{ "ABC_8f0aab7fd30ffc56cc477b25e6bb16_0": { "x": [ { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-2", "text": "The FrameNet lexical database (Fillmore & Baker 2010; Ruppenhofer et al. 2006) http://framenet.icsi." }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-3", "text": "berkeley.edu), covers roughly 13,000 lexical units (word senses) for the core Engish lexicon, associating them with roughly 1,200 fully defined semantic frames; these frames and their roles cover the majority of event types in everyday, non-specialist text, and they are documented with 200,000 manually annotated examples." }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-4", "text": "This tutorial will teach attendees what they need to know to start using the FrameNet lexical database as part of an NLP system." }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-5", "text": "We will cover the basics of Frame Semantics, explain how the database was created, introduce the Python API and the state of the art in automatic frame semantic role labeling systems; and we will discuss FrameNet collaboration with commercial partners." }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-6", "text": "Time permitting, we will present new research on frames and annotation of locative relations, as well as corresponding metaphorical uses, along with information about how frame semantic roles can aid the interpretation of metaphors." }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-7", "text": "\u2022 Introduction" }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-8", "text": "----------------------------------" }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-9", "text": "****" }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-10", "text": "\u2022 Introduction Michael Ellsworth (International Computer Science Institute, infinity@icsi.berkeley.edu) has been involved with FrameNet for well over a decade." }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-11", "text": "His chief focus is on semantic relations in FrameNet (Ruppenhofer et al. 2006) , how they can be used for paraphrase (Ellsworth & Janin 2007) , and mapping to other resources (Sche\u21b5czyk & Ellsworth 2006; Ferr\u00e1ndez et al. 2010b) ." }, { "sent_id": "8f0aab7fd30ffc56cc477b25e6bb16-C001-12", "text": "Increasingly, he has examined the connection of FrameNet to syntax and the Constructicon (Torrent & Ellsworth 2013; Ziem & Ellsworth to appear) , including in his pending dissertation on the constructions and frame semantics of emotion." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8f0aab7fd30ffc56cc477b25e6bb16-C001-11" ] ], "cite_sentences": [ "8f0aab7fd30ffc56cc477b25e6bb16-C001-11" ] } } }, "ABC_0860b08831b01e7e98c66ced63b256_0": { "x": [ { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-15", "text": "----------------------------------" }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-16", "text": "****" }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "0860b08831b01e7e98c66ced63b256-C001-29", "text": "There are three main steps in our rescoring approach." } ], "y": { "@MOT@": { "gold_contexts": [ [ "0860b08831b01e7e98c66ced63b256-C001-27", "0860b08831b01e7e98c66ced63b256-C001-28" ] ], "cite_sentences": [ "0860b08831b01e7e98c66ced63b256-C001-27" ] } } }, "ABC_8abffa3f807bad5ae2073aa7db215d_0": { "x": [ { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-15", "text": "----------------------------------" }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-16", "text": "****" }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-29", "text": "There are three main steps in our rescoring approach." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "8abffa3f807bad5ae2073aa7db215d-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8abffa3f807bad5ae2073aa7db215d-C001-14" ], [ "8abffa3f807bad5ae2073aa7db215d-C001-23" ], [ "8abffa3f807bad5ae2073aa7db215d-C001-26" ] ], "cite_sentences": [ "8abffa3f807bad5ae2073aa7db215d-C001-14", "8abffa3f807bad5ae2073aa7db215d-C001-23", "8abffa3f807bad5ae2073aa7db215d-C001-26" ] } } }, "ABC_117d30ddacd28478c6cce1e6d39c12_0": { "x": [ { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-29", "text": "There are three main steps in our rescoring approach." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-15", "text": "----------------------------------" }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-16", "text": "****" }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "117d30ddacd28478c6cce1e6d39c12-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." } ], "y": { "@MOT@": { "gold_contexts": [ [ "117d30ddacd28478c6cce1e6d39c12-C001-10" ], [ "117d30ddacd28478c6cce1e6d39c12-C001-19" ] ], "cite_sentences": [ "117d30ddacd28478c6cce1e6d39c12-C001-10", "117d30ddacd28478c6cce1e6d39c12-C001-19" ] }, "@BACK@": { "gold_contexts": [ [ "117d30ddacd28478c6cce1e6d39c12-C001-26" ] ], "cite_sentences": [ "117d30ddacd28478c6cce1e6d39c12-C001-26" ] } } }, "ABC_ad62ec914bb7b002952f22afdca15f_0": { "x": [ { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-15", "text": "----------------------------------" }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-16", "text": "****" }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-29", "text": "There are three main steps in our rescoring approach." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "ad62ec914bb7b002952f22afdca15f-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ad62ec914bb7b002952f22afdca15f-C001-14" ], [ "ad62ec914bb7b002952f22afdca15f-C001-23" ] ], "cite_sentences": [ "ad62ec914bb7b002952f22afdca15f-C001-14", "ad62ec914bb7b002952f22afdca15f-C001-23" ] }, "@USE@": { "gold_contexts": [ [ "ad62ec914bb7b002952f22afdca15f-C001-32" ] ], "cite_sentences": [ "ad62ec914bb7b002952f22afdca15f-C001-32" ] } } }, "ABC_5fe872d8e15ac38f845bc244f7bf5f_0": { "x": [ { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-74", "text": "**INTERACTION OF SENTIMENT AND DECEPTION**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-27", "text": "**DECEPTIVE REVIEWS FROM MECHANICAL TURK**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-2", "text": "The rising influence of user-generated online reviews (Cone, 2011) has led to growing incentive for businesses to solicit and manufacture DECEPTIVE OPINION SPAM-fictitious reviews that have been deliberately written to sound authentic and deceive the reader." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-3", "text": "Recently, Ott et al. (2011) have introduced an opinion spam dataset containing gold standard deceptive positive hotel reviews." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-4", "text": "However, the complementary problem of negative deceptive opinion spam, intended to slander competitive offerings, remains largely unstudied." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-5", "text": "Following an approach similar to Ott et al. (2011) , in this work we create and study the first dataset of deceptive opinion spam with negative sentiment reviews." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-73", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-6", "text": "Based on this dataset, we find that standard n-gram text categorization techniques can detect negative deceptive opinion spam with performance far surpassing that of human judges." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-7", "text": "Finally, in conjunction with the aforementioned positive review dataset, we consider the possible interactions between sentiment and deception, and present initial results that encourage further exploration of this relationship." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-8", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-10", "text": "Consumer's purchase decisions are increasingly influenced by user-generated online reviews of products and services (Cone, 2011) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-11", "text": "Accordingly, there is a growing incentive for businesses to solicit and manufacture DECEPTIVE OPINION SPAMfictitious reviews that have been deliberately written to sound authentic and deceive the reader (Ott et al., 2011) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-12", "text": "For example, Ott et al. (2012) has estimated that between 1% and 6% of positive hotel reviews appear to be deceptive, suggesting that some hotels may be posting fake positive reviews in order to hype their own offerings." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-13", "text": "In this work we distinguish between two kinds of deceptive opinion spam, depending on the sentiment expressed in the review." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-14", "text": "In particular, reviews intended to promote or hype an offering, and which therefore express a positive sentiment towards the offering, are called positive deceptive opinion spam." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-15", "text": "In contrast, reviews intended to disparage or slander competitive offerings, and which therefore express a negative sentiment towards the offering, are called negative deceptive opinion spam." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-16", "text": "While previous related work (Ott et al., 2011; Ott et al., 2012) has explored characteristics of positive deceptive opinion spam, the complementary problem of negative deceptive opinion spam remains largely unstudied." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-17", "text": "Following the framework of Ott et al. (2011) , we use Amazon's Mechanical Turk service to produce the first publicly available 1 dataset of negative deceptive opinion spam, containing 400 gold standard deceptive negative reviews of 20 popular Chicago hotels." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-18", "text": "To validate the credibility of our deceptive reviews, we show that human deception detection performance on the negative reviews is low, in agreement with decades of traditional deception detection research (Bond and DePaulo, 2006) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-19", "text": "We then show that standard n-gram text categorization techniques can be used to detect negative deceptive opinion spam with approximately 86% accuracy -far surpassing that of the human judges." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-20", "text": "In conjunction with Ott et al. (2011) 's positive deceptive opinion spam dataset, we then explore the interaction between sentiment and deception with respect to three types of language features: (1) changes in first-person singular use, often attributed to psychological distancing (Newman et al., 2003) , (2) decreased spatial awareness and more narrative form, consistent with theories of reality monitoring (Johnson and Raye, 1981) and imaginative writing (Biber et al., 1999; Rayson et al., 2001) , and (3) increased negative emotion terms, often attributed to leakage cues (Ekman and Friesen, 1969) , but perhaps better explained in our case as an exaggeration of the underlying review sentiment." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-21", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-22", "text": "**DATASET**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-23", "text": "One of the biggest challenges facing studies of deception is obtaining labeled data." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-24", "text": "Recently, Ott et al. (2011) have proposed an approach for generating positive deceptive opinion spam using Amazon's popular Mechanical Turk crowdsourcing service." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-25", "text": "In this section we discuss our efforts to extend Ott et al. (2011) 's dataset to additionally include negative deceptive opinion spam." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-26", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-28", "text": "Deceptive negative reviews are gathered from Mechanical Turk using the same procedure as Ott et al. (2011) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-29", "text": "In particular, we create and divide 400 HITs evenly across the 20 most popular hotels in Chicago, such that we obtain 20 reviews for each hotel." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-30", "text": "We allow workers to complete only a single HIT each, so that each review is written by a unique worker." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-31", "text": "2 We further require workers to be located in the United States and to have an average past approval rating of at least 90%." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-32", "text": "We allow a maximum of 30 minutes to complete the HIT, and reward accepted submissions with one US dollar ($1)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-33", "text": "Each HIT instructs a worker to imagine that they work for the marketing department of a hotel, and that their manager has asked them to write a fake negative review of a competitor's hotel to be posted online." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-34", "text": "Accompanying each HIT is the name and URL of the hotel for which the fake negative review is to be written, and instructions that: (1) workers should not complete more than one similar HIT, (2) submissions must be of sufficient quality, i.e., written for the correct hotel, legible, reasonable in length, 3 and not plagiarized, 4 and, (3) the HIT is for academic research purposes." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-35", "text": "Submissions are manually inspected to ensure that they are written for the correct hotel and to ensure that they convey a generally negative sentiment." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-36", "text": "5 The average accepted review length was 178 words, higher than for the positive reviews gathered by Ott et al. (2011) , who report an average review length of 116 words." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-37", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-38", "text": "**TRUTHFUL REVIEWS FROM THE WEB**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-39", "text": "Negative (1-or 2-star) truthful reviews are mined from six popular online review communities: Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor, and Yelp." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-40", "text": "While reviews mined from these communities cannot be considered gold standard truthful, recent work (Mayzlin et al., 2012; Ott et al., 2012) suggests that deception rates among travel review portals is reasonably small." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-41", "text": "Following Ott et al. (2011) , we sample a subset of the available truthful reviews so that we retain an equal number of truthful and deceptive reviews (20 each) for each hotel." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-42", "text": "However, because the truthful reviews are on average longer than our deceptive reviews, we sample the truthful reviews according to a log-normal distribution fit to the lengths of our deceptive reviews, similarly to Ott et al. (2011) Table 1 : Deception detection performance, incl." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-43", "text": "(P)recision, (R)ecall, and (F)1-score, for three human judges and two meta-judges on a set of 160 negative reviews." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-44", "text": "The largest value in each column is indicated with boldface." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-45", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-46", "text": "**HUMAN PERFORMANCE**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-47", "text": "Recent large-scale meta-analyses have shown human deception detection performance is low, with accuracies often not much better than chance (Bond and DePaulo, 2006) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-48", "text": "Indeed, Ott et al. (2011) found that two out of three human judges were unable to perform statistically significantly better than chance (at the p < 0.05 level) at detecting positive deceptive opinion spam." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-49", "text": "Nevertheless, it is important to subject our reviews to human judgments to validate their convincingness." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-50", "text": "In particular, if human detection performance is found to be very high, then it would cast doubt on the usefulness of the Mechanical Turk approach for soliciting gold standard deceptive opinion spam." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-51", "text": "Following Ott et al. (2011) , we asked three volunteer undergraduate university students to read and make assessments on a subset of the negative review dataset described in Section 2." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-52", "text": "Specifically, we randomized all 40 deceptive and truthful reviews from each of four hotels (160 reviews total)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-53", "text": "We then asked the volunteers to read each review and mark whether they believed it to be truthful or deceptive." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-54", "text": "Performance for the three human judges appears in Table 1 ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-55", "text": "We additionally show the deception detection performance of two meta-judges that aggregate the assessments of the individual human judges: (1) the MAJORITY meta-judge predicts deceptive when at least two out of three human judges predict deceptive (and truthful otherwise), and (2) the SKEP-TIC meta-judge predicts deceptive when at least one out of three human judges predicts deceptive (and truthful otherwise)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-56", "text": "A two-tailed binomial test suggests that JUDGE 1 and JUDGE 2 both perform better than chance (p = 0.0002, 0.003, respectively), while JUDGE 3 fails to reject the null hypothesis of performing at-chance (p = 0.07)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-57", "text": "However, while the best human judge is accurate 65% of the time, inter-annotator agreement computed using Fleiss' kappa is only slight at 0.07 (Landis and Koch, 1977) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-58", "text": "Furthermore, based on Cohen's kappa, the highest pairwise interannotator agreement is only 0.26, between JUDGE 1 and JUDGE 2." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-59", "text": "These low agreements suggest that while the judges may perform statistically better than chance, they are identifying different reviews as deceptive, i.e., few reviews are consistently identified as deceptive." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-60", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-61", "text": "**AUTOMATED CLASSIFIER PERFORMANCE**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-62", "text": "Standard n-gram-based text categorization techniques have been shown to be effective at detecting deception in text (Jindal and Liu, 2008; Mihalcea and Strapparava, 2009; Ott et al., 2011; Feng et al., 2012) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-63", "text": "Following Ott et al. (2011) , we evaluate the performance of linear Support Vector Machine (SVM) classifiers trained with unigram and bigram term-frequency features on our novel negative deceptive opinion spam dataset." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-64", "text": "We employ the same 5-fold stratified cross-validation (CV) procedure as Ott et al. (2011) , whereby for each cross-validation iteration we train our model on all reviews for 16 hotels, and test our model on all reviews for the remaining 4 hotels." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-65", "text": "The SVM cost parameter, C, is tuned by nested cross-validation on the training data." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-66", "text": "Results appear in Table 2 ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-67", "text": "Each row lists the sentiment of the train and test reviews, where \"Cross Val.\" corresponds to the cross-validation procedure described above, and \"Held Out\" corresponds to classifiers trained on reviews of one sentiment and tested on the other." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-68", "text": "The results suggest that n-grambased SVM classifiers can detect negative deceptive opinion spam in a balanced dataset with performance far surpassing that of untrained human judges (see Section 3.1)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-69", "text": "Furthermore, our results show that classifiers trained and tested on reviews of different sentiments perform worse, despite having more training data, 7 than classifiers trained and tested on reviews of the same sentiment." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-70", "text": "This suggests that cues to deception differ depending on the sentiment of the text (see Section 4)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-71", "text": "Interestingly, we find that training on the combined sentiment dataset results in performance that is comparable to that of the \"same sentiment\" classifiers (88.4% vs. 89.3% accuracy for positive reviews and 86.0% vs. 86.0% accuracy for negative reviews)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-72", "text": "This is explainable in part by the increased training set size (1,280 vs. 640 reviews per 4 training folds)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-75", "text": "An important question is how language features operate in our fake negative reviews compared with the fake positive reviews of Ott et al. (2011) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-76", "text": "For example, fake positive reviews included less spatial language (e.g., floor, small, location, etc.) because individuals who had not actually experienced the hotel simply had less spatial detail available for their review (Johnson and Raye, 1981) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-77", "text": "This was also the case for our negative reviews, with less spatial language observed for fake negative reviews relative to truthful." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-78", "text": "Likewise, our fake negative reviews had more verbs relative to nouns than truthful, suggesting a more narrative style that is indicative of imaginative writing (Biber et al., 1999; Rayson et al., 2001 ), a pattern also observed by Ott et al. (2011) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-79", "text": "There were, however, several important differences in the deceptive language of fake negative relative to fake positive reviews." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-80", "text": "First, as might be expected, negative emotion terms were more fre-quent, according to LIWC (Pennebaker et al., 2007) , in our fake negative reviews than in the fake positive reviews. But, importantly, the fake negative reviewers over-produced negative emotion terms (e.g., terrible, disappointed) relative to the truthful reviews in the same way that fake positive reviewers over-produced positive emotion terms (e.g., elegant, luxurious)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-81", "text": "Combined, these data suggest that the more frequent negative emotion terms in the present dataset are not the result of \"leakage cues\" that reveal the emotional distress of lying (Ekman and Friesen, 1969) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-82", "text": "Instead, the differences suggest that fake hotel reviewers exaggerate the sentiment they are trying to convey relative to similarly-valenced truthful reviews." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-83", "text": "Second, the effect of deception on the pattern of pronoun frequency was not the same across positive and negative reviews." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-84", "text": "In particular, while first person singular pronouns were produced more frequently in fake reviews than truthful, consistent with the case for positive reviews, the increase was diminished in the negative reviews examined here." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-85", "text": "In the positive reviews reported by Ott et al. (2011) , the rate of first person singular in fake reviews (M=4.36%, SD=2.96%) was twice the rate observed in truthful reviews (M=2.18%, SD=2.04%)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-86", "text": "In contrast, the rate of first person singular in the deceptive negative reviews (M=4.47%, SD=2.83%) was only 57% greater than for truthful reviews (M=2.85%, SD=2.23%)." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-87", "text": "These results suggest that the emphasis on the self, perhaps as a strategy of convincing the reader that the author had actually been to the hotel, is not as evident in the fake negative reviews, perhaps because the negative tone of the reviews caused the reviewers to psychologically distance themselves from their negative statements, a phenomenon observed in several other deception studies, e.g., Hancock et al. (2008) ." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-88", "text": "----------------------------------" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-89", "text": "**CONCLUSION**" }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-90", "text": "We have created the first publicly-available corpus of gold standard negative deceptive opinion spam, containing 400 reviews of 20 Chicago hotels, which we have used to compare the deception detection capabilities of untrained human judges and standard n-gram-based Support Vector Machine classifiers." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-91", "text": "Our results demonstrate that while human deception detection performance is greater for negative rather than positive deceptive opinion spam, the best detection performance is still achieved through automated classifiers, with approximately 86% accuracy." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-92", "text": "We have additionally explored, albeit briefly, the relationship between sentiment and deception by utilizing Ott et al. (2011) 's positive deceptive opinion spam dataset in conjunction with our own." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-93", "text": "In particular, we have identified several features of language that seem to remain consistent across sentiment, such as decreased awareness of spatial details and exaggerated language." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-94", "text": "We have also identified other features that vary with the sentiment, such as first person singular use, although further work is required to determine if these differences may be exploited to improve deception detection performance." }, { "sent_id": "5fe872d8e15ac38f845bc244f7bf5f-C001-95", "text": "Indeed, future work may wish to jointly model sentiment and deception in order to better determine the effect each has on language use." } ], "y": { "@MOT@": { "gold_contexts": [ [ "5fe872d8e15ac38f845bc244f7bf5f-C001-3", "5fe872d8e15ac38f845bc244f7bf5f-C001-4" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-11" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-16" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-23", "5fe872d8e15ac38f845bc244f7bf5f-C001-24" ] ], "cite_sentences": [ "5fe872d8e15ac38f845bc244f7bf5f-C001-3", "5fe872d8e15ac38f845bc244f7bf5f-C001-11", "5fe872d8e15ac38f845bc244f7bf5f-C001-16", "5fe872d8e15ac38f845bc244f7bf5f-C001-24" ] }, "@SIM@": { "gold_contexts": [ [ "5fe872d8e15ac38f845bc244f7bf5f-C001-5" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-17" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-41" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-42" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-51" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-78" ] ], "cite_sentences": [ "5fe872d8e15ac38f845bc244f7bf5f-C001-5", "5fe872d8e15ac38f845bc244f7bf5f-C001-17", "5fe872d8e15ac38f845bc244f7bf5f-C001-41", "5fe872d8e15ac38f845bc244f7bf5f-C001-42", "5fe872d8e15ac38f845bc244f7bf5f-C001-51", "5fe872d8e15ac38f845bc244f7bf5f-C001-78" ] }, "@USE@": { "gold_contexts": [ [ "5fe872d8e15ac38f845bc244f7bf5f-C001-17" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-20" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-28" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-51" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-63" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-64" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-92" ] ], "cite_sentences": [ "5fe872d8e15ac38f845bc244f7bf5f-C001-17", "5fe872d8e15ac38f845bc244f7bf5f-C001-20", "5fe872d8e15ac38f845bc244f7bf5f-C001-28", "5fe872d8e15ac38f845bc244f7bf5f-C001-51", "5fe872d8e15ac38f845bc244f7bf5f-C001-63", "5fe872d8e15ac38f845bc244f7bf5f-C001-64", "5fe872d8e15ac38f845bc244f7bf5f-C001-92" ] }, "@EXT@": { "gold_contexts": [ [ "5fe872d8e15ac38f845bc244f7bf5f-C001-25" ] ], "cite_sentences": [ "5fe872d8e15ac38f845bc244f7bf5f-C001-25" ] }, "@DIF@": { "gold_contexts": [ [ "5fe872d8e15ac38f845bc244f7bf5f-C001-36" ] ], "cite_sentences": [ "5fe872d8e15ac38f845bc244f7bf5f-C001-36" ] }, "@BACK@": { "gold_contexts": [ [ "5fe872d8e15ac38f845bc244f7bf5f-C001-48" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-62" ], [ "5fe872d8e15ac38f845bc244f7bf5f-C001-85" ] ], "cite_sentences": [ "5fe872d8e15ac38f845bc244f7bf5f-C001-48", "5fe872d8e15ac38f845bc244f7bf5f-C001-62", "5fe872d8e15ac38f845bc244f7bf5f-C001-85" ] }, "@UNSURE@": { "gold_contexts": [ [ "5fe872d8e15ac38f845bc244f7bf5f-C001-75" ] ], "cite_sentences": [ "5fe872d8e15ac38f845bc244f7bf5f-C001-75" ] } } }, "ABC_5f62958d0cdd32b15067c1afe458a5_0": { "x": [ { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-19", "text": "Copyrights for third-party components of this work must be honored." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-20", "text": "For all other uses, contact the owner/author(s)." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-2", "text": "Unsupervised learning of low-dimensional, semantic representations of words and entities has recently gained a ention." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-3", "text": "In this paper we describe the Semantic Entity Retrieval Toolkit (SERT) that provides implementations of our previously published entity representation models." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-4", "text": "e toolkit provides a uni ed interface to di erent representation learning algorithms, ne-grained parsing con guration and can be used transparently with GPUs." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-5", "text": "In addition, users can easily modify existing models or implement their own models in the framework." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-6", "text": "A er model training, SERT can be used to rank entities according to a textual query and extract the learned entity/word representation for use in downstream algorithms, such as clustering or recommendation." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-7", "text": "----------------------------------" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-9", "text": "e unsupervised learning of low-dimensional, semantic representations of words and entities has recently gained a ention for the entity-oriented tasks of expert nding [9] and product search [8] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-10", "text": "Representations are learned from a document collection and domain-speci c associations between documents and entities." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-11", "text": "Expert nding is the task of nding the right person with the appropriate skills or knowledge [1] and an association indicates document authorship (e.g., academic papers) or involvement in a project (e.g., annual progress reports)." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-12", "text": "In the case of product search, an associated document is a product description or review [8] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-13", "text": "In this paper we describe the Semantic Entity Retrieval Toolkit (SERT) that provides implementations of our previously published entity representation models [8, 9] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-14", "text": "Beyond a uni ed interface that combines di erent models, the toolkit allows for ne-grained parsing con guration and GPU-based training through integration with eano [3, 6] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-15", "text": "Users can easily extend existing models or implement their own models within the uni ed framework." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-16", "text": "A er model training, SERT can compute matching scores between an entity and a piece of text (e.g., a query)." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-17", "text": "is matching score can then be used for ranking entities, or as a feature in a downstream machine learning system, such as the learning to rank component * e toolkit is licensed under the permissive MIT open-source license and can be found at h ps://github.com/cvangysel/SERT." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-18", "text": "Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-21", "text": "e collection is parsed, processed and packaged in a numerical format using the prepare ( \u00a72.1) utility." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-22", "text": "A erwards, the training ( \u00a72.2) utility learns representations of entities and words and the query ( \u00a72.3) utility is used to compute matching scores between entities and queries." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-23", "text": "of a search engine." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-24", "text": "In addition, the learned representations can be extracted and used as feature vectors in entity clustering or recommendation tasks [10] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-25", "text": "e toolkit is licensed under the permissive MIT open-source license; see the footnote on the rst page." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-26", "text": "----------------------------------" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-27", "text": "**THE TOOLKIT**" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-28", "text": "SERT is organized as a pipeline of utilities as depicted in Fig. 1 ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-29", "text": "First, a collection of documents and entity associations is processed and packaged using a numerical format ( \u00a72.1)." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-30", "text": "Low-dimensional representations of words and entities are then learned ( \u00a72.2) and a erwards the representations can be used to make inferences ( \u00a72.3)." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-31", "text": "----------------------------------" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-32", "text": "**COLLECTION PARSING AND PREPARATION**" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-33", "text": "To begin, SERT constructs a vocabulary that will be used to tokenize the document collection." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-34", "text": "Non-signi cant words that are too frequent (e.g., stopwords), noisy (e.g., single characters) and rare words are ltered out." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-35", "text": "Words that do not occur in the dictionary are ignored." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-36", "text": "A erwards, word sequences are extracted from the documents and stored together with the associated entities in the numerical format provided by NumPy [7] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-37", "text": "Word sequences can be extracted consecutively or a stride can be speci ed to extract nonconsecutive windows." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-38", "text": "In addition, a hierarchy of word sequence extractors can be applied to extract skip-grams, i.e., word sequences where a number of tokens are skipped a er selecting a token [4] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-39", "text": "To support short documents, a special-purpose padding token can be used to ll up word sequences that are longer than a particular document." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-40", "text": "A er word sequence extraction, a weight can be assigned to each word sequence/entity pair that can be used to re-weight the training objective." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-41", "text": "For example, in the case of expert nding [9] , this weight is the reciprocal of the document length of the document where the sequence was extracted from. is avoids a bias in the objective towards long documents." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-42", "text": "An alternative option that exists within the toolkit is to resample word sequence/entity pairs such that every entity is associated with the same number of word sequences, as used for product search [8] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-43", "text": "Code snippet 1: Illustrative example of the SERT model interface." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-44", "text": "e full interface supports more functionality omitted here for brevity." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-45", "text": "Users can de ne a symbolic graph of computation using the eano library [6] in combination with Lasagne [3] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-46", "text": "----------------------------------" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-47", "text": "**REPRESENTATION LEARNING**" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-48", "text": "A er the collection has been processed and packaged in a machinefriendly format, representations of words and entities can be learned." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-49", "text": "e toolkit includes implementations of state-of-the-art representation learning models that were applied to expert nding [9] and product search [8] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-50", "text": "Users of the toolkit can use these implementations to learn representations out-of-the-box or adapt the algorithms to their needs." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-51", "text": "In addition, users can implement their own models by extending an interface provided by the framework." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-52", "text": "Code snippet 1 shows an example of a model implemented in the SERT toolkit where users can de ne a symbolic cost function that will be optimized using eano [6] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-53", "text": "Due to the component-wise organization of the toolkit (Fig. 1) , modeling and text processing are separated from each other." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-54", "text": "Consequently, researchers can focus on modeling and representation learning only." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-55", "text": "In addition, any improvements to the collection processing ( \u00a72.1) collectively bene ts all models implemented in SERT." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-56", "text": "----------------------------------" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-57", "text": "**ENTITY RANKING & OTHER USES OF THE REPRESENTATIONS**" }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-58", "text": "Once a model has been trained, SERT can be used to rank entities w.r.t." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-59", "text": "a textual query." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-60", "text": "e concrete implementation used to rank entities depends on the model that was trained." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-61", "text": "In the most generic case, a matching score is computed for every entity and entities are ranked in decreasing order of his score." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-62", "text": "However, in the special case when the model is interpreted as a metric vector space [2, 8] , SERT casts entity ranking as a k-nearest neighbor problem and uses specialized data structures for retrieval [5] ." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-63", "text": "A er ranking, SERT outputs the entity rankings as a TREC-compatible le that can be used as input to the trec eval 1 evaluation utility." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-64", "text": "In this paper we described the Semantic Entity Retrieval Toolkit, a toolkit that learns latent representations of words and entities." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-65", "text": "e toolkit contains implementations of state-of-the-art entity representations algorithms [8, 9] and consists of three components: text processing, representation learning and inference." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-66", "text": "Users of the toolkit can easily make changes to existing model implementations or contribute their own models by extending an interface provided by the SERT framework." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-67", "text": "Future work includes integration with Pyndri [11] such that document collections indexed with Indri can transparently be used to train entity representations." }, { "sent_id": "5f62958d0cdd32b15067c1afe458a5-C001-68", "text": "In addition, integration with machine learning frameworks besides eano, such as TensorFlow and PyTorch, will make it easier to integrate existing models into SERT." } ], "y": { "@MOT@": { "gold_contexts": [ [ "5f62958d0cdd32b15067c1afe458a5-C001-3" ], [ "5f62958d0cdd32b15067c1afe458a5-C001-9" ] ], "cite_sentences": [ "5f62958d0cdd32b15067c1afe458a5-C001-3", "5f62958d0cdd32b15067c1afe458a5-C001-9" ] }, "@EXT@": { "gold_contexts": [ [ "5f62958d0cdd32b15067c1afe458a5-C001-3" ], [ "5f62958d0cdd32b15067c1afe458a5-C001-13" ], [ "5f62958d0cdd32b15067c1afe458a5-C001-42" ], [ "5f62958d0cdd32b15067c1afe458a5-C001-49", "5f62958d0cdd32b15067c1afe458a5-C001-50" ] ], "cite_sentences": [ "5f62958d0cdd32b15067c1afe458a5-C001-3", "5f62958d0cdd32b15067c1afe458a5-C001-13", "5f62958d0cdd32b15067c1afe458a5-C001-42", "5f62958d0cdd32b15067c1afe458a5-C001-49" ] }, "@BACK@": { "gold_contexts": [ [ "5f62958d0cdd32b15067c1afe458a5-C001-9" ], [ "5f62958d0cdd32b15067c1afe458a5-C001-12" ], [ "5f62958d0cdd32b15067c1afe458a5-C001-49", "5f62958d0cdd32b15067c1afe458a5-C001-50" ] ], "cite_sentences": [ "5f62958d0cdd32b15067c1afe458a5-C001-9", "5f62958d0cdd32b15067c1afe458a5-C001-12", "5f62958d0cdd32b15067c1afe458a5-C001-49" ] }, "@SIM@": { "gold_contexts": [ [ "5f62958d0cdd32b15067c1afe458a5-C001-42" ] ], "cite_sentences": [ "5f62958d0cdd32b15067c1afe458a5-C001-42" ] }, "@UNSURE@": { "gold_contexts": [ [ "5f62958d0cdd32b15067c1afe458a5-C001-62" ] ], "cite_sentences": [ "5f62958d0cdd32b15067c1afe458a5-C001-62" ] }, "@USE@": { "gold_contexts": [ [ "5f62958d0cdd32b15067c1afe458a5-C001-65" ] ], "cite_sentences": [ "5f62958d0cdd32b15067c1afe458a5-C001-65" ] } } }, "ABC_2b148e376c39eae7f674610118e588_0": { "x": [ { "sent_id": "2b148e376c39eae7f674610118e588-C001-91", "text": "Results are shown in Table 1 ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-2", "text": "There is growing interest in the language developed by agents interacting in emergentcommunication settings." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-3", "text": "Earlier studies have focused on the agents' symbol usage, rather than on their representation of visual input." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-4", "text": "In this paper, we consider the referential games of Lazaridou et al. (2017) , and investigate the representations the agents develop during their evolving interaction." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-5", "text": "We find that the agents establish successful communication by inducing visual representations that almost perfectly align with each other, but, surprisingly, do not capture the conceptual properties of the objects depicted in the input images." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-6", "text": "We conclude that, if we are interested in developing language-like communication systems, we must pay more attention to the visual semantics agents associate to the symbols they use." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-7", "text": "----------------------------------" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-9", "text": "There has recently been a revival of interests in language emergence simulations involving agents interacting in visually-grounded games." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-10", "text": "Unlike earlier work (e.g., Briscoe, 2002; Cangelosi and Parisi, 2002; Steels, 2012) , many recent simulations consider realistic visual input, for example, by playing referential games with real-life pictures (e.g., Jorge et al., 2016; Lazaridou et al., 2017; Havrylov and Titov, 2017; Lee et al., 2018; Evtimova et al., 2018) ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-11", "text": "This setup allows us to address the exciting issue of whether the needs of goal-directed communication will lead agents to associate visually-grounded conceptual representations to discrete symbols, developing naturallanguage-like word meanings." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-12", "text": "However, while most studies present some analysis of the agents' symbol usage, they pay little or no attention to the representation of the visual input that the agents develop as part of their evolving interaction." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-13", "text": "We study here agent representations following the model and setup of Lazaridou et al. (2017) ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-14", "text": "This is an ideal starting point, since it involves an extremely simple signaling game (Lewis, 1969) , that is however played with naturalistic images, thus allowing us to focus on the question of how the agents represent these images, and whether such representations meet our expectations for natural word meanings." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-15", "text": "In their first game, Lazaridou's Sender and Receiver are exposed to the same pair of images, one of them being randomly marked as the \"target\"." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-16", "text": "The Sender always sees the target in the left position, and it must pick one discrete symbol from a fixed vocabulary to send to the Receiver." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-17", "text": "The Receiver sees the images in random order, together with the sent symbol, and it tries to guess which image is the target." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-18", "text": "In case of success, both players get a payoff of 1." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-19", "text": "Since an analysis of vocabulary usage brings inconclusive evidence that the agents are using the symbols to represent natural concepts (such as beaver or bayonet), Lazaridou and colleagues next modify the game, by presenting to the Sender and the Receiver different images for each of the two concepts (e.g., the Sender must now signal that the target is a beaver, while seeing a different beaver from the one shown to the Receiver)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-20", "text": "This setup should encourage conceptlevel thinking, since the two agents should not be able to communicate about low-level perceptual characteristics of images they do not share." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-21", "text": "Lazaridou and colleagues present preliminary evidence suggesting that, indeed, agents are now developing conceptual symbol meanings." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-22", "text": "We replicate Lazaridou's games, and we find that, in both, the agents develop successfully aligned representations that, however, are not capturing conceptual properties at all." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-23", "text": "In what is perhaps our most striking result, agents trained in either version of the game succeed at communicating about pseudoimages generated from random noise (Fig. 2) ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-24", "text": "We conclude that, if we want interactive agents to develop a vocabulary of words denoting natural meanings, more attention must be paid to the way in which they are representing their perceptual input." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-25", "text": "----------------------------------" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-26", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-27", "text": "Architecture We re-implement Lazaridou's Sender and Receiver architectures (using their better-behaved \"informed\" Sender)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-28", "text": "Both agents are feed-forward networks." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-29", "text": "The Sender takes image representations as input, it projects them into its own representational space, compares them, and finally outputs a probability distribution over vocabulary symbols, from which a single discrete symbol is then sampled." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-30", "text": "We report here results obtained with an output vocabulary of 100 symbols, but the same patterns were observed using a range of sizes from 2 to 1, 000." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-31", "text": "The Receiver takes as input the target and distractor input image representations in random order, as well as the symbol produced by the sender (as a vocabulary-sized one-hot vector)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-32", "text": "It embeds the images and the symbol into its own representational space, where it performs a symbol-to-image comparison, producing a probability distribution over the two images, one of which is selected by sampling from this distribution." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-33", "text": "If the Receiver selected the target image, a reward of 1 is assigned to both agents." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-34", "text": "The whole architecture is jointly trained by letting the agents play, and updating their parameters with Reinforce (Williams, 1992) ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-35", "text": "See Lazaridou et al. (2017) for details." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-36", "text": "Data Following Lazaridou et al. (2017) , for each of the 463 concepts they used, we randomly sample 100 images from ImageNet (Deng et al., 2009 )." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-37", "text": "We construct 50, 000 mini-batches of 32 image pairs during training and 1, 024 pairs for validation." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-38", "text": "We construct a held-out test set in the same way by sampling 10 images per concept from ImageNet (for 2 concepts, we were not able to assemble enough further images), for a total of 4, 610." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-39", "text": "We compute RSA scores (see below) on the crossproduct of these images." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-40", "text": "We also use the heldout set to construct mini-batches of images pairs to compute test performance." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-41", "text": "Following Lazaridou, the images are passed through a pre-trained VGG ConvNet (Simonyan and Zisserman, 2015) ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-42", "text": "The input vector fed to the agents is the second-tolast 4096-D fully connected layer 1 ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-43", "text": "Games We re-implement both Lazaridou's same-image game, where Sender and Receiver are shown the same two images (always of different concepts), and their different-image game, where the Receiver sees different images than the Sender's." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-44", "text": "We repeat all experiments using 100 random initialization seeds." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-45", "text": "As we faithfully reproduced the setup of Lazaridou et al. (2017) , we refer the reader there for hyper-parameters and training details." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-46", "text": "----------------------------------" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-47", "text": "**EXPERIMENTS**" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-48", "text": "We first asked in which way playing the game affects the way agents \"see\" the input data, that is, in which way their image embeddings differ from the input image representations, and from each other." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-49", "text": "Concerning Sender and Receiver, a reasonable expectation is that successful communication implies a convergence of representations." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-50", "text": "How should these representations relate to the input? Recall that input representations are from one of the top layers of a state-of-the-art ConvNet trained on ImageNet concept categorization, and the top layers of such networks are known to capture highlevel concept semantics (Zeiler and Fergus, 2014) ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-51", "text": "The game image pairs are always sampled from different concepts." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-52", "text": "So, it would make sense for the agents to simply learn to carry through the similarity structure of the input space, in order to communicate about distinct concepts." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-53", "text": "Consequently, we predicted that, as training proceeds, Sender and Receiver representations will become closer to each other, and to the input ones." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-54", "text": "In order to compare the similarity structure of input, Sender and Receiver spaces, we borrow representational similarity analysis (RSA) from computational neuroscience (Kriegeskorte et al., 2008) ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-55", "text": "Given two sets r 1 and r 2 of representations of the same item collection (e.g., r 1 is the collection of input images mapped in Sender embedding space and r 2 is the same collection represented by Receiver), we first compute s 1 as all possible pairwise (cosine) similarities between the representations in r 1 , and s 2 as those in r 2 ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-56", "text": "We then compute the (Spearman) correlation between the similarity vectors s 1 and s 2 ." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-57", "text": "This latter value, which we will call RSA score, measures the global agreement between s 1 and s 2 , relative to the chosen input collection." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-58", "text": "If N is the number of items in the collection that we compute representations for, both similarity vectors s 1 and s 2 are of length N (N \u2212 1)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-59", "text": "Therefore, it is not necessary for representations r 1 and r 2 to belong to the same space (for example, in our case, input and agent vectors have different dimensionality)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-60", "text": "Figure 1 shows RSA and mean validation reward (MVR) development curves for the crossvalidated best seed in the same-image game." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-61", "text": "At the beginning of training, the RSA scores are nonzeros, which is expected as the two agents architectures are similar and randomly initialized the same way." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-62", "text": "They are also somewhat correlated with the input, which we attribute to the fact that untrained neural networks can already extract relevant image features (Jarrett et al., 2009 )." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-63", "text": "As training converges, Sender and Receiver similarity spaces also converge." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-64", "text": "However, contrary to our prediction, the agent similarity spaces are not strongly correlated with the input visual space." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-65", "text": "We note that, during the first few hundred games, the Sender (green curve) aligns with the input, but the Receiver (blue curve) does not." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-66", "text": "Therefore, it seems that, in order to establish communication, the two agents have to drift from the input." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-67", "text": "Indeed, when communication is successfully established at the end of training, 2 the two agents have a RSA score of \u03c1 S/R = 0.98." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-68", "text": "However either agent's score with the input is a much lower \u03c1 S/I = \u03c1 R/I = 0.33." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-69", "text": "3 On the contrary, when the agents fail to establish communication, by the end of training their RSA score is just \u03c1 S/R = 0.39, but they stay closer to the input (\u03c1 S/I = 0.58 and \u03c1 R/I = 0.42)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-70", "text": "4 The drift of the agents from input similarity could be attributed to the characteristics of the game they are playing." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-71", "text": "Since they are only asked to distinguish between pictures of different concepts, they have no incentive to keep different instances of a concept distinct (if the agents are never asked to distinguish one dog from another, they might eventually become unable to tell dogs apart)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-72", "text": "That is, we might be assisting to the inception of a form of categorical perception (Goldstone and Hendrickson, 2010) , whereby the agents lose sensitivity to within-category differences." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-73", "text": "If this is the case, we should observe that sameconcept image similarity is higher in Sender (or Receiver) space with respect to input space." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-74", "text": "However, this turns out not to be the case." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-75", "text": "To the contrary, average pairwise same-concept similarity is consistently lower in Sender space than in the input" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-76", "text": "(mean z-normalized same-concept similarity in input space is at 1.94 vs. 0.57 in Sender space, averaged across successful seeds)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-77", "text": "A similar effect is observed by looking at higher-class (mammal, furniture, etc.) similarities: images from the same classes become less similar in Sender space (0.61 z-normalized within-class input similarity vs. 0.30 in Sender space)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-78", "text": "This suggests that the agents are becoming less proficient at capturing the similarity among instances of the same concept or of the same class." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-79", "text": "The same conclusion is qualitatively supported by the pairs of images that underwent the largest shift between input and Sender space." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-80", "text": "For example, for two test images of avocados which have an input similarity of 0.82 (and are reasonably similar to the human eye), the Sender similarity is at the low value of \u22120.27 (Receiver similarity is \u22120.59)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-81", "text": "Contrarily, for an image of a cabin in a field and an image of a telephone that have an intuitively correct very low input similarity of 0.02, the Sender similarity for these images is 0.94 (Receiver similarity is 0.95)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-82", "text": "Lazaridou et al. (2017) designed their second game to encourage more general, concept-like referents." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-83", "text": "Unfortunately, we replicate the anomalies above in the different-image setup, although to a less marked extent." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-84", "text": "When successful communication is established at the end of training, the agents have \u03c1 S/R = 0.90. But again, the agents' representation do not align with the input space: their scores with the input are at lower values of \u03c1 S/I = 0.40 and \u03c1 R/I = 0.37." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-85", "text": "5 In case of communication failure, by the end of training their RSA score is at the lower value of \u03c1 S/R = 0.74, and their values with respect to the input are \u03c1 S/I = 0.36 and \u03c1 R/I = 0.34." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-86", "text": "6 Again, same-concept images drift apart in agent space, although now to a lesser extent (1.94 z-normalized mean similarity in input space vs. 1.07 in Sender space)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-87", "text": "More encouragingly, we don't find the same pattern for withinclass mean similarities (0.61 input space vs. 0.75 Sender space)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-88", "text": "We must conjecture that the agents are comparing low-level properties of the image pairs, independently of the game they play." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-89", "text": "As an extreme way to test this, we look at how agents trained to play the two games behave when tested with input pairs that are just random noise vectors drawn from a standard Normal distribution." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-90", "text": "7 If the agents are indeed indifferent to the objects represented by images, the radical shift in the nature of the input to the game should not affect them much." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-92", "text": "We confirm that the same-image game is the easiest, and we observe that agents trained in one game perform reasonably well on the other." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-93", "text": "More importantly, no matter which game they are trained on, the agents perform very well on noise input!" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-94", "text": "This confirms our hypothesis that the Sender and Receiver are able to communicate about input data that contain no conceptual content at all, which in turn suggests that they haven't extracted any concept-level information (e.g., features that would allow them to recognize instances of the dog or chair category) during training." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-95", "text": "To get a sense of the sort of noise pairs agents succeed to communicate about, Fig-Figure 2 : Noise vectors agents trained on the sameimage game successfully communicate about." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-96", "text": "ure 2 provides an example." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-97", "text": "Finally, we draw 1, 000 noise pairs (z 1 , z 2 ), and present each to the Sender with either z 1 or z 2 as target." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-98", "text": "We then compare, pair by pair, whether the highest probability symbol changes when the target is swapped." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-99", "text": "We average across 10 random runs using the best cross-validated seed." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-100", "text": "In both versions of the game, for more than 99% of the pairs, the symbol with highest probability changes when the target is swapped." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-101", "text": "This suggests that the agents perform a relative comparison of the two inputs, rather than an absolute one, in line with the general conclusion that they are not using the vocabulary to denote stable conceptual properties of the objects depicted in the images." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-102", "text": "----------------------------------" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-103", "text": "**DISCUSSION**" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-104", "text": "Existing literature in game theory already showed that convergence towards successful communication is ensured under specific conditions (see Skyrms (2010) and references therein)." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-105", "text": "However, the important contribution of Lazaridou et al. (2017) is to play a signaling game with real-life images instead of artificial symbols." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-106", "text": "This raises new empirical questions that are not answered by the general mathematical results, such as: When the agents do succeed at communicating, what are the input features they rely upon?" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-107", "text": "Do the internal representations they develop relate to the conceptual properties of the input?" }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-108", "text": "Our study suggests that the agents' representations are not capturing general conceptual properties of different objects, but they are rather specifically tuned to successfully distinguish images based on inscrutable lowlevel relational properties." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-109", "text": "Interestingly, our conclusions can be aligned with findings in psycholinguistic experimental literature on dialogue." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-110", "text": "In order to achieve communication, the agents develop a form of ''conceptual pact\" (Brennan and Clark, 1996) : Their internal representations align while at the same time drifting away from human-level properties of the input." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-111", "text": "The agents agree on a shared use of the vocabulary, that does not correspond to concepts in the input data." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-112", "text": "In future work, we would like to encourage the development of more natural word meanings by enforcing the agent representations to stay more faithful to the perceptual input they receive." }, { "sent_id": "2b148e376c39eae7f674610118e588-C001-113", "text": "Moving ahead, it is fundamental to design setups where agents would have stronger reasons to develop human-like communication strategies." } ], "y": { "@MOT@": { "gold_contexts": [ [ "2b148e376c39eae7f674610118e588-C001-4" ], [ "2b148e376c39eae7f674610118e588-C001-10", "2b148e376c39eae7f674610118e588-C001-11" ], [ "2b148e376c39eae7f674610118e588-C001-13" ], [ "2b148e376c39eae7f674610118e588-C001-22" ], [ "2b148e376c39eae7f674610118e588-C001-105", "2b148e376c39eae7f674610118e588-C001-106" ] ], "cite_sentences": [ "2b148e376c39eae7f674610118e588-C001-4", "2b148e376c39eae7f674610118e588-C001-10", "2b148e376c39eae7f674610118e588-C001-13", "2b148e376c39eae7f674610118e588-C001-22", "2b148e376c39eae7f674610118e588-C001-105" ] }, "@BACK@": { "gold_contexts": [ [ "2b148e376c39eae7f674610118e588-C001-10", "2b148e376c39eae7f674610118e588-C001-11" ], [ "2b148e376c39eae7f674610118e588-C001-15" ], [ "2b148e376c39eae7f674610118e588-C001-19" ], [ "2b148e376c39eae7f674610118e588-C001-21" ], [ "2b148e376c39eae7f674610118e588-C001-35" ], [ "2b148e376c39eae7f674610118e588-C001-45" ], [ "2b148e376c39eae7f674610118e588-C001-82", "2b148e376c39eae7f674610118e588-C001-83" ] ], "cite_sentences": [ "2b148e376c39eae7f674610118e588-C001-10", "2b148e376c39eae7f674610118e588-C001-15", "2b148e376c39eae7f674610118e588-C001-19", "2b148e376c39eae7f674610118e588-C001-21", "2b148e376c39eae7f674610118e588-C001-35", "2b148e376c39eae7f674610118e588-C001-45", "2b148e376c39eae7f674610118e588-C001-82" ] }, "@USE@": { "gold_contexts": [ [ "2b148e376c39eae7f674610118e588-C001-22" ], [ "2b148e376c39eae7f674610118e588-C001-27" ], [ "2b148e376c39eae7f674610118e588-C001-36" ], [ "2b148e376c39eae7f674610118e588-C001-41" ], [ "2b148e376c39eae7f674610118e588-C001-43" ] ], "cite_sentences": [ "2b148e376c39eae7f674610118e588-C001-22", "2b148e376c39eae7f674610118e588-C001-27", "2b148e376c39eae7f674610118e588-C001-36", "2b148e376c39eae7f674610118e588-C001-41", "2b148e376c39eae7f674610118e588-C001-43" ] }, "@SIM@": { "gold_contexts": [ [ "2b148e376c39eae7f674610118e588-C001-36" ], [ "2b148e376c39eae7f674610118e588-C001-41" ], [ "2b148e376c39eae7f674610118e588-C001-45" ], [ "2b148e376c39eae7f674610118e588-C001-82", "2b148e376c39eae7f674610118e588-C001-83" ] ], "cite_sentences": [ "2b148e376c39eae7f674610118e588-C001-36", "2b148e376c39eae7f674610118e588-C001-41", "2b148e376c39eae7f674610118e588-C001-45", "2b148e376c39eae7f674610118e588-C001-82" ] } } }, "ABC_737a452057be3e254b35bd8df492be_0": { "x": [ { "sent_id": "737a452057be3e254b35bd8df492be-C001-83", "text": "We reproduced the results for each model on three of the data sets." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-2", "text": "The original goal of any social media platform is to facilitate users to indulge in healthy and meaningful conversations. But more often than not, it has been found that it becomes an avenue for wanton attacks." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-3", "text": "We want to alleviate this issue and hence we try to provide a detailed analysis of how abusive behavior can be monitored in Twitter." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-4", "text": "The complexity of the natural language constructs makes this task challenging." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-5", "text": "We show how applying contextual attention to Long Short Term Memory networks help us give near state of art results on multiple benchmarks abuse detection data sets from Twitter." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-6", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-7", "text": "**INTRODUCTION & RELATED WORK**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-8", "text": "Any social interaction involves an exchange of viewpoints and thoughts. But these views and thoughts can be caustic." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-9", "text": "Often we see that users resort to verbal abuse to win an argument or overshadow someone's opinion." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-10", "text": "On Twitter, people from every sphere have experienced online abuse." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-11", "text": "Be it a famous celebrity with millions of followers or someone representing a marginalized community such as LGBTQ, Women and more." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-12", "text": "We want to channelize Natural Language Processing (NLP) for social good and aid in the process of flagging abusive tweets and users." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-13", "text": "Detecting abuse on Twitter can be challenging, particularly because the text is often noisy." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-14", "text": "Abuse can also have different facets." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-15", "text": "[10] released one of the initial data sets from Twitter with the goal of identifying what constitutes racism and sexism." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-16", "text": "[9] in their work pointed out that hate speech is different from offensive language and released a data set of 25k tweets with the goal of distinguishing hate speech from offensive language." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-17", "text": "Stop saying dumb blondes with pretty faces as you need a pretty face to pull them off !!! #mkr In Islam women must be locked in their houses and Muslims claim this is treating them well Table 1 : Tweets from [10] data set demonstrating online abuse They find that racist and homophobic tweets are more likely to be classified as hate speech but sexist tweets are generally classified as offensive." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-18", "text": "[4] introduced a large, hand-coded corpus of online harassment data for studying the nature of harassing comments and the culture of trolling." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-19", "text": "Keeping these motivations in mind, we make the following salient contributions:" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-20", "text": "\u2022 We build a deep context-aware attention-based model for abusive behavior detection on Twitter ." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-21", "text": "To the best of our knowledge ours is the first work that exploits context aware attention for this task." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-22", "text": "\u2022 Our model is robust and achieves consistent performance gains in all the three abusive data sets \u2022 We show how context aware attention helps in focusing on certain abusive keywords when used in specific context and improve the performance of abusive behavior detection ." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-23", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-25", "text": "Existing approaches to abusive text detection can be broadly divided into two categories: 1) Feature intensive machine learning algorithms such as Logistic Regression (LR), Multilayer Perceptron (MLP) and etc." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-26", "text": "2) Deep Learning models which learn feature representations on their own. [10] released the popular data set of 16k tweets annotated as belonging to sexism, racism or none class 1 , and provided a feature engineered model for detection of abuse in their corpus." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-27", "text": "[9] use a similar handcrafted feature engineered model to identify offensive language and distinguish it from hate speech." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-28", "text": "[2] in their work, experiment with multiple deep learning architectures for the task of hate speech detection on Twitter using the same data set by [10] ." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-29", "text": "Their best-reported F1-score is achieved using Long Short Term Memory Networks (LSTM) + Gradient Boosting." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-30", "text": "On the data set released by [10] , [5] experiment with a two-step approach of detecting abusive language first and then classifying them into specific types i.e. racist, sexist or none." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-31", "text": "They achieve best results using a Hybrid Convolution Neural Network (CNN) with the intuition that character level input would counter the purposely or mistakenly misspelled words and made-up vocabularies." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-32", "text": "[6] in their work ran experiments on the Gazetta dataset and the DETOX system ( [12] ) and show that a Recurrent Neural Network (RNN) coupled with deep, classification-specific attention outperforms the previous state of the art in abusive comment moderation." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-33", "text": "In their more recent work [7] explored how user embeddings, user-type embeddings, and user type biases can improve their previous RNN based model on the Gazetta dataset." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-34", "text": "Attentive neural networks have been shown to perform well on a variety of NLP tasks ( [13] , [11] )." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-35", "text": "[13] use hierarchical contextual attention for text classification (i.e attention both at word and sentence level) on six large scale text classification tasks and demonstrate that the proposed architecture outperform previous methods by a substantial margin." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-36", "text": "We primarily focus on word level attention because most of the tweets are single sentence tweets." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-37", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-38", "text": "**MODEL**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-39", "text": "The best choice for modeling tweets was Long Short Term Memory Networks (LSTMs) because of their ability to capture long-term dependencies by introducing a gating mechanism that ensures the proper gradient propagation through the network." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-40", "text": "We use bidirectional LSTMs because of their inherent capability of capturing information from both: the past and the future states." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-41", "text": "A bidirectional LSTM (BiLSTM) consists of a forward LSTM \u2212 \u2192 f that reads the sentence from x 1 to x T and a backward LSTM \u2190 \u2212 f that reads the sentence from x T to x 1 , where T is the number of words in the sentence under consideration and x i is the i t h word in the sentence." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-42", "text": "We obtain the final annotation for a given word x i , by concatenating the annotations from both directions (Eq. [1] )." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-43", "text": "[1] show that LSTMs can benefit from depth in space." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-44", "text": "Stacking multiple recurrent hidden layers on top of each other, just as feed forward layers are stacked in the conventional deep networks give performance gains .And hence we choose stacked LSTM for our experiments." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-45", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-46", "text": "**WORD ATTENTION**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-47", "text": "The attention mechanism assigns a weight to each word annotation that is obtained from the BiLSTM layer." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-48", "text": "We compute the fixed representation v of the whole message as a weighted sum of all the word annotations which is then fed to a final fully-connected Softmax layer to obtain the class probabilities." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-49", "text": "We first feed the LSTM output h i of each word x i through a Multi Layer Perceptron to get u i as its hidden representation." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-50", "text": "u c is our word level context vector that is randomly initialized and learned as we train our network." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-51", "text": "Once u i is obtained we calculate the importance of the word as the similarity Data Set Tweets Count [10] 15,844 [9] 25,112 [4] 20,362 Table 2 : Data sets and their total tweets count of u i with u c and get a normalized importance weight \u03b1 i through a softmax function." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-52", "text": "The context vector u c can be seen as a tool which filters which word is more important over all the words like that used in the LSTM." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-53", "text": "Figure 2 shows the high-level architecture of this model." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-54", "text": "W h and b h are the attention layers weights and biases." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-55", "text": "More formally," }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-56", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-57", "text": "**EXPERIMENTS**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-84", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-58", "text": "In this section we talk about data sets first and then go on to show our results obtained on these three data sets .We also show some examples where our model failed ." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-59", "text": "Finally we show how attention helps us understand the model in a better fashion." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-60", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-61", "text": "**DATA SETS**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-62", "text": "We have used the 3 benchmark data sets for abusive content detection on Twitter." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-63", "text": "At the time of the experiment, the [10] data set had a total of 15,844 tweets out of which 1,924 were labelled as belonging to racism, 3,058 as sexism and 10,862 as none." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-64", "text": "The [9] data set had a total of 25,112 tweets out of which 1498 were labelled as hate speech, 19,326 as offensive language and 4,288 as neither." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-65", "text": "For the [4] data set, there were 20,362 tweets out of which 5,235 were positive harassment examples and 15,127 were negative." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-66", "text": "We call [10] data set as D1 , [9] data set as D2 and [4] as D3 For tweet tokenization, we use Ekphrasis which is a text processing tool built specially from social platforms such as Twitter." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-67", "text": "[3] use a big collection of Twitter messages (330M) to generate word embeddings, with a vocabulary size of 660K words, using GloVe ( [8] )." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-68", "text": "We use these pre-trained word embeddings for initializing the first layer (embedding layer) of our neural networks." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-69", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-70", "text": "**RESULTS**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-71", "text": "The network is trained at a learning rate of 0.001 for 10 epochs, with a dropout of 0.2 to prevent over-fitting." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-72", "text": "The results are averaged over 10-fold cross-validations for D1 and D3 and 5 fold cross-validations for D2 because [9] reported results using 5 fold CV." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-73", "text": "Because of class imbalance in all our data sets, we report weighted F1 scores." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-74", "text": "Table 3 shows our results in detail." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-75", "text": "We compare our model with the best models reported in each paper." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-76", "text": "Because [4] is a data set paper, we cannot fill the corresponding row." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-77", "text": "* denotes the numbers from baseline papers." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-78", "text": "All the results were reproducible except for the one marked red." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-79", "text": "For (Waseem and Hovy, 2016 ) data set, (Badjatiyaet al., 2017) claim that using Gradient Boosting with LSTM embeddings obtained from random word embeddings boosted their performance by 12 F1 from 81.0 to 93.0." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-80", "text": "When we tried to reproduce the result, we did not find any significant improvement over 81." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-81", "text": "Results show that our model is robust when it comes to the performance on all of the three data sets." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-82", "text": "Table 3 : Data sets and the results of different models." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-85", "text": "**MODELS**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-86", "text": "We also share some examples from the three data sets in Figure 2 which our BiLSTM attention model could not classify correctly." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-87", "text": "On closer investigation we find that most cases where our model fails are instances where annotation is either noisy or the difference between classes are very blurred and subtle." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-88", "text": "The first tweet is a tweet from [10] , the second tweet is a tweet from from [9] data set and the third from the [4]" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-89", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-90", "text": "**WHY CONTEXTUAL ATTENTION?**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-91", "text": "Attention mechanism enables our neural network to focus on the relevant parts of the input more than the irrelevant parts while performing a prediction task." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-92", "text": "But the relevance is often dependant on the context and so the importance of words is highly context dependent." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-93", "text": "For example, the word islam may appear in the realm of Racism as well as in any normal conversation." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-94", "text": "The top tweet in Figure 3 belongs to None class while the bottom tweet belongs to Racism class." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-95", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-96", "text": "**ATTENTION HEAT MAP VISUALIZATION**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-97", "text": "The color intensity corresponds to the weight given to each word by the contextual attention." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-98", "text": "The first tweet is a sexist tweet from [10] where as the second tweet is an example of racist tweet from the same datset ." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-99", "text": "The third tweet is from [9] data set labelled as offensive language." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-100", "text": "----------------------------------" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-101", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-102", "text": "We successfully built a deep context-aware attention-based model and applied it to the task of abusive tweet detection." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-103", "text": "We ran experiments on three relevant data sets and empirically showed how our model is robust when it comes to detecting abuse on Twitter." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-104", "text": "We also show how context-aware attention helps us to interpret the model's performance by visualizing the attention weights and conducting thorough error analysis." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-105", "text": "As for future work, we want to experiment with a model that learns user embeddings from their historical tweets." }, { "sent_id": "737a452057be3e254b35bd8df492be-C001-106", "text": "We also want to model abusive text classification in Twitter by taking tweets in context because often standalone tweets don't give a clear picture of a tweet's intent." } ], "y": { "@BACK@": { "gold_contexts": [ [ "737a452057be3e254b35bd8df492be-C001-15" ], [ "737a452057be3e254b35bd8df492be-C001-26" ], [ "737a452057be3e254b35bd8df492be-C001-28" ], [ "737a452057be3e254b35bd8df492be-C001-30" ] ], "cite_sentences": [ "737a452057be3e254b35bd8df492be-C001-15", "737a452057be3e254b35bd8df492be-C001-26", "737a452057be3e254b35bd8df492be-C001-28", "737a452057be3e254b35bd8df492be-C001-30" ] }, "@USE@": { "gold_contexts": [ [ "737a452057be3e254b35bd8df492be-C001-17" ], [ "737a452057be3e254b35bd8df492be-C001-51" ], [ "737a452057be3e254b35bd8df492be-C001-63" ], [ "737a452057be3e254b35bd8df492be-C001-66" ], [ "737a452057be3e254b35bd8df492be-C001-72" ], [ "737a452057be3e254b35bd8df492be-C001-88" ], [ "737a452057be3e254b35bd8df492be-C001-98" ] ], "cite_sentences": [ "737a452057be3e254b35bd8df492be-C001-17", "737a452057be3e254b35bd8df492be-C001-51", "737a452057be3e254b35bd8df492be-C001-63", "737a452057be3e254b35bd8df492be-C001-66", "737a452057be3e254b35bd8df492be-C001-72", "737a452057be3e254b35bd8df492be-C001-88", "737a452057be3e254b35bd8df492be-C001-98" ] } } }, "ABC_92f4cc0d6516a19a860d5b9af80f59_0": { "x": [ { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-2", "text": "We introduce RelNet: a new model for relational reasoning." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-3", "text": "RelNet is a memory augmented neural network which models entities as abstract memory slots and is equipped with an additional relational memory which models relations between all memory pairs." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-4", "text": "The model thus builds an abstract knowledge graph on the entities and relations present in a document which can then be used to answer questions about the document." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-5", "text": "It is trained end-to-end: only supervision to the model is in the form of correct answers to the questions." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-6", "text": "We test the model on the 20 bAbI question-answering tasks with 10k examples per task and find that it solves all the tasks with a mean error of 0.3%, achieving 0% error on 11 of the 20 tasks." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-7", "text": "----------------------------------" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-9", "text": "Reasoning about entities and their relations is an important problem for achieving general artificial intelligence." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-10", "text": "Often such problems are formulated as reasoning over graph-structured representation of knowledge." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-11", "text": "Knowledge graphs, for example, consist of entities and relations between them [1, 2, 3, 4] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-12", "text": "Representation learning [5, 6, 7, 8] and reasoning [9, 10, 11, 12] with such structured representations is an important and active area of research." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-13", "text": "Most previous work on knowledge representation and reasoning relies on a pipeline of natural language processing systems, often consisting of named entity extraction [13] , entity resolution and coreference [14] , relationship extraction [5] , and knowledge graph inference [15] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-14", "text": "While this cascaded approach of using NLP systems can be effective at reasoning with knowledge bases at scale, it also leads to a problem of compounding of the error from each component sub-system." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-15", "text": "The importance of each of these sub-component on a particular downstream application is also not very clear." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-16", "text": "For the task of question-answering, we instead make an attempt at an end-to-end approach which directly models the entities and relations in the text as memory slots." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-17", "text": "While incorporating existing knowledge (from curated knowledge bases) for the purpose of question-answering [12, 9, 16] is an important area of research, we consider the simpler setting where all the information is contained within the text itself -which is the approach taken by many recent memory based neural network models [17, 18, 19, 20] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-18", "text": "Recently, Henaff et." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-19", "text": "al. [18] proposed a dynamic memory based neural network for implicitly modeling the state of entities present in the text for question answering." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-20", "text": "However, this model lacks any module for relational reasoning." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-21", "text": "In response, we propose RelNet, which extends memoryaugmented neural networks with a relational memory to reason about relationships between multiple entities present within the text." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-22", "text": "Our end-to-end method reads text, and writes to both memory slots and edges between them." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-23", "text": "Intuitively, the memory slots correspond to entities and the edges correspond to relationships between entities, each represented as a vector." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-24", "text": "The only supervision signal for our method comes from answering questions on the text." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-25", "text": "We demonstrate the utility of the model through experiments on the bAbI tasks [19] and find that the model achieves smaller mean error across the tasks than the best previously published result [18] in the 10k examples regime and achieves 0% error on 11 of the 20 tasks." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-26", "text": "----------------------------------" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-27", "text": "**RELNET MODEL**" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-28", "text": "We describe the RelNet model in this section." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-29", "text": "The model is sequential in nature, consisting of the following steps: read text, process it into a dynamic relational memory and then attention conditioned on the question generates the answer." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-30", "text": "We model the dynamic memory in a fashion similar to Recurrent Entity Networks [18] and then equip it with an additional relational memory." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-31", "text": "There are three main components to the model: 1) input encoder 2) dynamic memory, and 3) output module." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-32", "text": "We will describe these three modules in details." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-33", "text": "The input encoder and output module implementations are similar to the Entity Network [18] and main novelty lies in the dynamic memory." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-34", "text": "We describe the operations executed by the network for a single example consisting of a document with T sentences, where each sentence consists of a sequence of words represented with K-dimensional word embeddings {e 1 , . . . , e N }, a question on the document represented as another sequence of words and an answer to the question." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-35", "text": "----------------------------------" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-36", "text": "**INPUT ENCODER:**" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-37", "text": "The input at each time point is a sentence from the document which can be encoded into a fixed vector representation using some encoding mechanism, such as a recurrent neural network." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-38", "text": "We use a simple encoder with a learned multiplicative mask [18, 17] :" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-39", "text": "Dynamic Relational Memory This is the main component of an end-to-end reasoning pipeline, where we need to process the information contained in the text such that it can be used to reason about the entities, their properties and the relationships among them." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-40", "text": "The memory consists of two parts: entity memory and relational memory." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-41", "text": "The entity memory is organized as a key-value memory network [12] , where the keys are global embeddings updated during training time but not during inference, and the value memory slot is a dynamic memory for each example (document, question) whose values are updated while reading the document." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-42", "text": "The memory thus consists of D memory slots {m 1 , . . . , m D } (each is a vector of dimension K) and associated keys {k 1 , . . . , k D } (again vectors of dimension K)." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-43", "text": "At time t, after reading the sentence t into a vector representation s t , a gating mechanism decides the set of memories to be updated (< \u00b7, \u00b7 > denotes inner product):" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-44", "text": "Intuitively the memory slots can be thought of as entities." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-45", "text": "Indeed, Henaff et." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-46", "text": "al. [18] found that if they tie the key vectors to entities in the text then the memories contain information about the state of those entities." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-47", "text": "The update in (1) essentially does a soft selection of memory slots based on cosine distance in the embedding space." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-48", "text": "Note that there can be multiple entites in a sentence hence a sigmoid operation is more suitable, and it is also more scalable [18] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-49", "text": "After selecting the set of memories, there is an update step which stores information in the corresponding memory slots:" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-50", "text": "where PReLU is a parametric Rectified linear unit [21] , and U , V and W are k \u00d7 k parameter matrices." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-51", "text": "Now we augment the model with additional relational memory cells." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-52", "text": "Intuitively, the entity memory allows modeling of entities and information about the entities in isolation." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-53", "text": "This can be insufficient in scenarios where a particular entity participates in may relations with other entities across the document." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-54", "text": "Thus, in order to succeed at relational reasoning the model needs to be able to compare each pair of the entity memories." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-79", "text": "However, relation network is not a memory-based model and there is no mechanism to read and write relevant information for each pair." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-55", "text": "The relational memories will allow modeling of these relations and provide an inherent inductive bias towards a more structured representation of the participating entities in the text, in the form a latent knowledge graph." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-56", "text": "The relational memories are D 2 memory slots {r ij } indexed by the entity memory slots i, j \u2208 D." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-57", "text": "The relational memories are updated as follows." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-58", "text": "First, a gating mechanism decides the set of active relational memories:" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-59", "text": "where g m i , g m j select the relational memory slot based on the active entity slots and the last sigmoid gate decides whether the corresponding relational memory needs to be updated based on the current input sentence." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-60", "text": "After selecting the set of active relational memory, we update the contents of the relational memory:r ij \u2190 P ReLU (Ar ij + Bs t ) r ij \u2190 r ij + g r ij \u2299r ij (4) where again A, B are k \u00d7 k parameter matrices." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-61", "text": "Note that for updates (3)- (4) we use a different encoding mask to obtain the sentence representation for relations." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-62", "text": "Similar to [18] , we normalize the memories after each update step (that is after reading each sentence)." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-63", "text": "This acts as a forget step and does not cause the memory to explode." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-64", "text": "The full memory consists of the entity memory slots {h j } and the relational memory slots {r ij }." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-65", "text": "Output Module This is a standard attention module used in memory networks [17, 18] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-66", "text": "The question is encoded as a K dimensional vector q using the same encoding mechanism as the sentences (though with a separate learned mask)." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-67", "text": "We first concatenate the relational memory vectors with the corresponding entity vectors, and project the resulting memory vector to k dimension." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-68", "text": "Then attention on these projected memories, conditioned on the vector q, yields the final answer:" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-69", "text": "where y is the predicted answer, and C, H, Z are parameter matrices." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-70", "text": "----------------------------------" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-71", "text": "**RELATED WORK**" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-72", "text": "There is a long line of work in textual question-answering systems [22, 23] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-73", "text": "Recent successful approaches use memory based neural networks for question answering [24, 19, 25, 20, 18] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-74", "text": "Our model is also a memory network based model and is also related to the neural turing machine [26] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-75", "text": "As described previously, the model is closely related to the Recurrent Entity Networks model [18] which describes an end-to-end approach to model entities in text but does not directly model relations." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-76", "text": "Other approaches to question answering use external knowledge, for instance external knowledge bases [27, 12, 28, 29, 10] or external text like Wikipedia [30, 31] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-77", "text": "Very recently, and in parallel to this work, a method for relational reasoning called relation networks [32] was proposed." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-78", "text": "They demonstrated that simple neural network modules are not as effective at relational reasoning and their proposed module is similar to our model." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-80", "text": "Moreover, while their approach scales as the square of the number of sentences, our approach scales as the square of the number of memory slots used per QA pair." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-81", "text": "The output module in our model can be seen as a type of relation network." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-82", "text": "Representation learning and reasoning over graph structured data is also relevant to this work." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-83", "text": "Graph based neural network models [33, 34, 35] have been proposed which take graph data as an input." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-84", "text": "The relational memory however does not rely on a specified graph structure and such models can potentially be used for multi-hop reasoning over the relational memory." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-85", "text": "----------------------------------" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-86", "text": "**EXPERIMENTS**" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-87", "text": "We evaluate the model's performance on the bAbI tasks [19] , a collection of 20 question answering tasks which have become a benchmark for evaluating memory-augmented neural networks." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-88", "text": "We Task EntNet [18] compare the performance with the Recurrent Entity Networks model (EntNet) [18] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-89", "text": "Performance is measured in terms of mean percentage error on the tasks." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-90", "text": "Training Details: We used Adam and did a grid search for the learning rate in {0.01, 0.005, 0.001} and choose a fixed learning rate of 0.005 based on performance on the validation set, and clip the gradient norm at 2." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-91", "text": "We keep all other details similar to [18] for a fair comparison." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-92", "text": "embedding dimensions were fixed to be 100, models were trained for a maximum of 250 epochs with minibatches size of 32 for all tasks except 3 for which the batch size was 16." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-93", "text": "The document sizes were limited to most recent 70 sentences for all tasks, except for task 3 for which it was limited to 130." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-94", "text": "The RelNet models were run for 5 times with random seed on each task and the model with best validation performance was chosen as the final model." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-95", "text": "The baseline EntNet model was run for 10 times for each task [18] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-96", "text": "The results are shown in Table 1 ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-97", "text": "The RelNet model achieves a mean error of 0.285% across tasks which is better than the results of the EntNet model [18] ." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-98", "text": "The RelNet model is able to achieve 0% test error on 11 of the tasks, whereas the EntNet model achieves 0% error on 7 of the tasks." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-99", "text": "----------------------------------" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-100", "text": "**CONCLUSION**" }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-101", "text": "We demonstrated an end-to-end trained neural network augmented with a structured memory representation which can reason about entities and relations for question answering." }, { "sent_id": "92f4cc0d6516a19a860d5b9af80f59-C001-102", "text": "Future work will investigate the performance of these models on more real world datasets, interpreting what the models learn, and scaling these models to answer questions about entities and relations from reading massive text corpora." } ], "y": { "@MOT@": { "gold_contexts": [ [ "92f4cc0d6516a19a860d5b9af80f59-C001-17" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-18", "92f4cc0d6516a19a860d5b9af80f59-C001-19", "92f4cc0d6516a19a860d5b9af80f59-C001-20", "92f4cc0d6516a19a860d5b9af80f59-C001-21" ] ], "cite_sentences": [ "92f4cc0d6516a19a860d5b9af80f59-C001-17", "92f4cc0d6516a19a860d5b9af80f59-C001-19", "92f4cc0d6516a19a860d5b9af80f59-C001-20" ] }, "@BACK@": { "gold_contexts": [ [ "92f4cc0d6516a19a860d5b9af80f59-C001-17" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-45", "92f4cc0d6516a19a860d5b9af80f59-C001-46" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-48" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-73" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-75" ] ], "cite_sentences": [ "92f4cc0d6516a19a860d5b9af80f59-C001-17", "92f4cc0d6516a19a860d5b9af80f59-C001-46", "92f4cc0d6516a19a860d5b9af80f59-C001-48", "92f4cc0d6516a19a860d5b9af80f59-C001-73", "92f4cc0d6516a19a860d5b9af80f59-C001-75" ] }, "@DIF@": { "gold_contexts": [ [ "92f4cc0d6516a19a860d5b9af80f59-C001-25" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-97" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-98" ] ], "cite_sentences": [ "92f4cc0d6516a19a860d5b9af80f59-C001-25", "92f4cc0d6516a19a860d5b9af80f59-C001-97", "92f4cc0d6516a19a860d5b9af80f59-C001-98" ] }, "@SIM@": { "gold_contexts": [ [ "92f4cc0d6516a19a860d5b9af80f59-C001-30" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-33" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-62" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-65" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-91" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-95" ] ], "cite_sentences": [ "92f4cc0d6516a19a860d5b9af80f59-C001-30", "92f4cc0d6516a19a860d5b9af80f59-C001-33", "92f4cc0d6516a19a860d5b9af80f59-C001-62", "92f4cc0d6516a19a860d5b9af80f59-C001-65", "92f4cc0d6516a19a860d5b9af80f59-C001-91", "92f4cc0d6516a19a860d5b9af80f59-C001-95" ] }, "@EXT@": { "gold_contexts": [ [ "92f4cc0d6516a19a860d5b9af80f59-C001-30" ], [ "92f4cc0d6516a19a860d5b9af80f59-C001-33" ] ], "cite_sentences": [ "92f4cc0d6516a19a860d5b9af80f59-C001-30", "92f4cc0d6516a19a860d5b9af80f59-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "92f4cc0d6516a19a860d5b9af80f59-C001-38" ] ], "cite_sentences": [ "92f4cc0d6516a19a860d5b9af80f59-C001-38" ] }, "@UNSURE@": { "gold_contexts": [ [ "92f4cc0d6516a19a860d5b9af80f59-C001-88" ] ], "cite_sentences": [ "92f4cc0d6516a19a860d5b9af80f59-C001-88" ] } } }, "ABC_b5097b3d901d073bfe06bcd88318ac_0": { "x": [ { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-63", "text": "Different observations were noted from the many trials." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-2", "text": "Word2Vec is a prominent tool for Natural Language Processing (NLP) tasks." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-3", "text": "Similar inspiration is found in distributed embeddings for state-of-the-art (sota) deep neural networks." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-4", "text": "However, wrong combination of hyper-parameters can produce poor quality vectors." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-5", "text": "The objective of this work is to show optimal combination of hyper-parameters exists and evaluate various combinations." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-6", "text": "We compare them with the original model released by Mikolov." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-7", "text": "Both intrinsic and extrinsic (downstream) evaluations, including Named Entity Recognition (NER) and Sentiment Analysis (SA) were carried out." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-8", "text": "The downstream tasks reveal that the best model is task-specific, high analogy scores don't necessarily correlate positively with F1 scores and the same applies for more data." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-9", "text": "Increasing vector dimension size after a point leads to poor quality or performance." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-10", "text": "If ethical considerations to save time, energy and the environment are made, then reasonably smaller corpora may do just as well or even better in some cases." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-11", "text": "Besides, using a small corpus, we obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream (NER & SA) performance compared to Mikolov's model, trained on 100 billion word corpus." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-12", "text": "----------------------------------" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-14", "text": "There have been many implementations of the word2vec model in either of the two architectures it provides: continuous skipgram and continuous bag of words (CBoW) (Mikolov et al. (2013a) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-15", "text": "Similar distributed models of word or subword embeddings (or vector representations) find usage in sota, deep neural networks like Bidirectional Encoder Representations from Transformers (BERT) and its successors (Devlin et al. (2018) ; ; Raffel et al. (2019) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-16", "text": "These deep networks generate contextual representations of words after been trained for extended periods on large corpora, unsupervised, using the attention mechanisms (Vaswani et al. (2017) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-17", "text": "It has been observed that various hyper-parameter combinations have been used in different research involving word2vec with the possibility of many of them being sub-optimal (Naili et al. (2017) ; Wang et al. (2018) ; Dhingra et al. (2017) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-18", "text": "Therefore, the authors seek to address the research question: what is the optimal combination of word2vec hyperparameters for intrinsic and extrinsic NLP purposes?" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-19", "text": "There are astronomically high numbers of combinations of hyper-parameters possible for neural networks, even with just a c 2000 *Corresponding author." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-20", "text": "License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v1/meila00a.html." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-21", "text": "few layers." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-22", "text": "Hence, the scope of our extensive work over three corpora is on dimension size, training epochs, window size and vocabulary size for the training algorithms (hierarchical softmax and negative sampling) of both skipgram and CBoW. The corpora used for word embeddings are English Wiki News Abstract by Wikipedia (2019a) of about 15MB, English Wiki Simple (SW) Articles by Wikipedia (2019b) of about 711MB and the Billion Word (BW) of 3.9GB by Chelba et al. (2013) ." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-23", "text": "The corpus used for sentiment analysis is the Internet Movie Databse (IMDb) dataset of movie reviews by Maas et al. (2011) while that for NER is Groningen Meaning Bank (GMB) by Bos et al. (2017) , containing 47,959 sentence samples." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-24", "text": "The IMDb dataset used has a total of 25,000 sentences with half being positive sentiments and the other half being negative sentiments." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-25", "text": "The Groningen Meaning Bank (GMB) dataset has 17 labels, with 9 main labels and 2 context tags." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-26", "text": "It is however unbalanced due to the high percentage of tokens with the label 'O'." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-27", "text": "This skew in the GMB dataset is typical with NER datasets." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-28", "text": "The objective of this work is to determine the optimal combinations of word2vec hyperparameters for intrinsic evaluation (semantic and syntactic analogies) and extrinsic evaluation tasks (Zhang et al. (2019) ; Wang et al. (2019) ), like SA and NER." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-29", "text": "It is not our objective in this work to record sota results." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-30", "text": "Some of the main contributions of this research are the empirical establishment of optimal combinations of word2vec hyper-parameters for NLP tasks, discovering the behaviour of quality of vectors viz-a-viz increasing dimensions and the confirmation of embeddings being task-specific for the downstream." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-31", "text": "The rest of this paper is organised as follows: the literature review that briefly surveys distributed representation of words, particularly word2vec; the methodology employed in this research work; the results obtained and the conclusion." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-32", "text": "----------------------------------" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-33", "text": "**LITERATURE REVIEW**" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-34", "text": "Breaking away from the non-distributed (high-dimensional, sparse) representations of words, typical of traditional bag-of-words or one-hot-encoding (Turian et al. (2010) ), Mikolov et al. (2013a) created word2vec." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-35", "text": "Word2Vec consists of two shallow neural network architectures: continuous skipgram and CBoW. It uses distributed (low-dimensional, dense) representations of words that group similar words." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-36", "text": "This new model traded the complexity of deep neural network architectures, by other researchers, for more efficient training over large corpora." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-37", "text": "Its architectures have two training algorithms: negative sampling and hierarchical softmax (Mikolov et al. (2013b) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-38", "text": "The released model was trained on Google news dataset of 100 billion words." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-39", "text": "Implementations of the model have been undertaken by researchers in the programming languages Python and C++, though the original was done in C (\u0158eh\u016f\u0159ek and Sojka (2010))." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-40", "text": "Continuous skipgram predicts (by maximizing classification of) words before and after the center word, for a given range." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-41", "text": "Since distant words are less connected to a center word in a sentence, less weight is assigned to such distant words in training." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-42", "text": "CBoW, on the other hand, uses words from the history and future in a sequence, with the objective of correctly classifying the target word in the middle." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-43", "text": "It works by projecting all history or future words within a chosen window into the same position, averaging their vectors." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-44", "text": "Hence, the order of words in the history or future does not influence the averaged vector." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-45", "text": "This is similar to the traditional bag-of-words, which is oblivious of the order of words in its sequence." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-46", "text": "A loglinear classifier is used in both architectures (Mikolov et al. (2013a) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-47", "text": "In further work, they extended the model to be able to do phrase representations and subsample frequent words (Mikolov et al. (2013b) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-48", "text": "Being a Neural Network Language Model (NNLM), word2vec assigns probabilities to words in a sequence, like other NNLMs such as feedforward networks or recurrent neural networks (Turian et al. (2010) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-49", "text": "Earlier models like latent dirichlet allocation (LDA) and latent semantic analysis (LSA) exist and effectively achieve low dimensional vectors by matrix factorization (Deerwester et al. (1990) ; Levy et al. (2015) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-50", "text": "It's been shown that word vectors are beneficial for NLP tasks (Turian et al. (2010) ), such as sentiment analysis and named entity recognition." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-51", "text": "Besides, Mikolov et al. (2013a) showed with vector space algebra that relationships among words can be evaluated, expressing the quality of vectors produced from the model." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-52", "text": "The famous, semantic example: vector(\"King\") -vector(\"Man\") + vector(\"Woman\") \u2248 vector(\"Queen\") can be verified using cosine distance." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-53", "text": "Another type of semantic meaning is the relationship between a capital city and its corresponding country." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-54", "text": "Syntactic relationship examples include plural verbs and past tense, among others." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-55", "text": "Combination of both syntactic and semantic analyses is possible and provided (totaling over 19,000 questions) as Google analogy test set by Mikolov et al. (2013a) ." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-56", "text": "WordSimilarity-353 test set is another analysis tool for word vectors (Finkelstein et al. (2002) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-57", "text": "Unlike Google analogy score, which is based on vector space algebra, WordSimilarity is based on human expert-assigned semantic similarity on two sets of English word pairs." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-58", "text": "Both tools rank from 0 (totally dissimilar) to 1 (very much similar or exact, in Google analogy case)." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-59", "text": "A typical artificial neural network (ANN) has very many hyper-parameters which may be tuned." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-60", "text": "Hyper-parameters are values which may be manually adjusted and include vector dimension size, type of algorithm and learning rate (Levy et al. (2015) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-61", "text": "Mikolov et al. (2013a) tried various hyper-parameters with both architectures of their model, ranging from 50 to 1,000 dimensions, 30,000 to 3,000,000 vocabulary sizes, 1 to 3 epochs, among others." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-62", "text": "In our work, we extended research to 3,000 dimensions." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-64", "text": "They observed diminishing returns after a certain point, despite additional dimensions or larger, unstructured training data." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-65", "text": "However, quality increased when both dimensions and data size were increased together." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-66", "text": "Although Mikolov et al. (2013b) pointed out that choice of optimal hyper-parameter configurations depends on the NLP problem at hand, they identified the most important factors are architecture, dimension size, subsampling rate, and the window size." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-67", "text": "In addition, it has been observed that variables like size of datasets improve the quality of word vectors and, potentially," }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-68", "text": "----------------------------------" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-69", "text": "**METHODOLOGY**" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-70", "text": "The models were generated in a shared cluster running Ubuntu 16 with 32 CPUs of 32x Intel Xeon 4110 at 2.1GHz." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-71", "text": "Gensim (\u0158eh\u016f\u0159ek and Sojka (2010)) python library implementation of word2vec was used with parallelization to utilize all 32 CPUs." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-72", "text": "The downstream experi-ments were run on a Tesla GPU on a shared DGX cluster running Ubuntu 18." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-73", "text": "Pytorch deep learning framework was used." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-74", "text": "Gensim was chosen because of its relative stability, popular support and to minimize the time required in writing and testing a new implementation in python from scratch." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-75", "text": "1" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-76", "text": "To form the vocabulary, words occurring less than 5 times in the corpora were dropped, stop words removed using the natural language toolkit (NLTK) (Loper and Bird (2002) ) and data pre-processing carried out." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-77", "text": "Finkelstein et al. (2002) were chosen for intrinsic evaluations." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-78", "text": "They measure the quality of word vectors." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-79", "text": "The analogy scores are averages of both semantic and syntactic tests." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-80", "text": "NER and SA were chosen for extrinsic evaluations." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-81", "text": "The GMB dataset for NER was trained in an LSTM network, which had an embedding layer for input." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-82", "text": "The network diagram is shown in fig. 1 ." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-83", "text": "The IMDb dataset for SA was trained in a BiLSTM network, which also used an embedding layer for input." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-84", "text": "Its network diagram is given in fig. 2 ." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-85", "text": "It includes an additional hidden linear layer." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-86", "text": "Hyper-parameter details of the two networks for the downstream tasks are given in table 2." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-87", "text": "The metrics for extrinsic evaluation include F1, precision, recall and accuracy scores." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-88", "text": "In both tasks, the default pytorch embedding was tested before being replaced by pre-trained embeddings released by Mikolov et al. (2013a) and ours." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-89", "text": "In each case, the dataset was shuffled before training and split in the ratio 70:15:15 for training, validation (dev) and test sets." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-90", "text": "Batch size of 64 was used." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-91", "text": "For each task, experiments for each embedding was conducted four times and an average value calculated and reported in the next section 1." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-92", "text": "It should be noted, however, that gensim multithreading for 30 and 40 epochs seemed unstable and crashed, preventing any related experiments." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-93", "text": "Table 3 summarizes key results from the intrinsic evaluations for 300 dimensions." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-94", "text": "Table 4 reveals the training time (in hours) and average embedding loading time (in seconds) representative of the various models used." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-95", "text": "Tables 5 and 6 summarize key results for the extrinsic evaluations." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-96", "text": "Figures 3, 4 , 5, 6 and 7 present line graph of the eight combinations for different dimension sizes for Simple Wiki, trend of Simple Wiki and Billion Word corpora over several dimension sizes, analogy score comparison for models across datasets, NER mean F1 scores on the GMB dataset and SA mean F1 scores on the IMDb dataset, respectively." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-97", "text": "Combination of the skipgram using hierarchical softmax and window size of 8 for 300 dimensions outperformed others in analogy scores for the Wiki Abstract." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-98", "text": "However, its results are so poor, because of the tiny file size, they're not worth reporting here." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-99", "text": "Hence, we'll focus on results from the Simple Wiki and Billion Word corpora." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-100", "text": "----------------------------------" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-101", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-102", "text": "Best combination changes when corpus size increases, as will be noticed from table 3." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-103", "text": "In terms of analogy score, for 10 epochs, w8s0h0 performs best while w8s1h0 performs best in terms of WordSim and corresponding Spearman correlation." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-104", "text": "Meanwhile, increasing the corpus size to BW, w4s1h0 performs best in terms of analogy score while w8s1h0 maintains its position as the best in terms of WordSim and Spearman correlation." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-105", "text": "Besides considering quality metrics, it can be observed from table 4 that comparative ratio of values between the models is not commensurate with the results in intrinsic or extrinsic values, especially when we consider the amount of time and energy spent, since more training time results in more energy consumption Information on the length of training time for the released Mikolov model is not readily available." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-106", "text": "However, it's interesting to note that their presumed best model, which was released is also s1h0." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-107", "text": "Its analogy score, which we tested and report, is confirmed in their paper." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-108", "text": "It beats our best models in only analogy score (even for Simple Wiki), performing worse in others." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-109", "text": "This is inspite of using a much bigger corpus of 3,000,000 vocabulary size and 100 billion words while Simple Wiki had vocabulary size of 367,811 and is 711MB." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-110", "text": "It is very likely our analogy scores will improve when we use a much larger corpus, as can be observed from table 3, which involves just one billion words." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-111", "text": "Although the two best combinations in analogy (w8s0h0 & w4s0h0) for SW, as shown in fig. 3 , decreased only slightly compared to others with increasing dimensions, the increased training time and much larger serialized model size render any possible minimal score advantage over higher dimensions undesirable." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-112", "text": "As can be observed in fig. 4 , from 100 dimensions, scores improve but start to drop after over 300 dimensions for SW and after over 400 dimensions for BW." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-113", "text": "More becomes worse!" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-114", "text": "This trend is true for all combinations for all tests." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-115", "text": "Polynomial interpolation may be used to determine the optimal dimension in both corpora." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-116", "text": "Our models are available for confirmation and source codes are available on github." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-117", "text": "2 With regards to NER, most pretrained embeddings outperformed the default pytorch embedding, with our BW w4s1h0 model (which is best in BW analogy score) performing best in F1 score and closely followed by Mikolov et al. (2013a) model." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-118", "text": "On the other hand, with regards to SA, pytorch embedding outperformed the pretrained embeddings but was closely followed by our SW w8s0h0 model (which also had the best SW analogy score)." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-119", "text": "Mikolov et al. (2013a) performed second worst of all, despite originating from a very huge corpus." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-120", "text": "The combinations w8s0h0 & w4s0h0 of SW performed reasonably well in both extrinsic tasks, just as the default pytorch embedding did." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-121", "text": "----------------------------------" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-123", "text": "This work analyses, empirically, optimal combinations of hyper-parameters for embeddings, specifically for word2vec." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-124", "text": "It further shows that for downstream tasks, like NER and SA, there's no silver bullet!" }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-125", "text": "However, some combinations show strong performance across tasks." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-126", "text": "Performance of embeddings is task-specific and high analogy scores do not necessarily correlate positively with performance on downstream tasks." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-127", "text": "This point on correlation is somewhat similar to results by Chiu et al. (2016) and Wang et al. (2019) ." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-128", "text": "It was discovered that increasing dimension size depreciates performance after a point." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-129", "text": "If strong considerations of saving time, energy and the environment are made, then reasonably smaller corpora may suffice or even be better in some cases." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-130", "text": "The on-going drive by many researchers to use ever-growing data to train deep neural networks can benefit from the findings of this work." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-131", "text": "Indeed, hyper-parameter choices are very important in neural network systems (Levy et al. (2015) )." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-132", "text": "Future work that may be investigated are performance of other architectures of word or sub-word embeddings, the performance and comparison of embeddings applied to languages other than English and how embeddings perform in other downstream tasks." }, { "sent_id": "b5097b3d901d073bfe06bcd88318ac-C001-133", "text": "In addition, since the actual reason for the changes in best model as corpus size increases is not clear, this will also be suitable for further research." } ], "y": { "@USE@": { "gold_contexts": [ [ "b5097b3d901d073bfe06bcd88318ac-C001-6" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-88" ] ], "cite_sentences": [ "b5097b3d901d073bfe06bcd88318ac-C001-6", "b5097b3d901d073bfe06bcd88318ac-C001-88" ] }, "@DIF@": { "gold_contexts": [ [ "b5097b3d901d073bfe06bcd88318ac-C001-11" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-104", "b5097b3d901d073bfe06bcd88318ac-C001-105" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-117" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-119" ] ], "cite_sentences": [ "b5097b3d901d073bfe06bcd88318ac-C001-11", "b5097b3d901d073bfe06bcd88318ac-C001-105", "b5097b3d901d073bfe06bcd88318ac-C001-117", "b5097b3d901d073bfe06bcd88318ac-C001-119" ] }, "@MOT@": { "gold_contexts": [ [ "b5097b3d901d073bfe06bcd88318ac-C001-14" ] ], "cite_sentences": [ "b5097b3d901d073bfe06bcd88318ac-C001-14" ] }, "@BACK@": { "gold_contexts": [ [ "b5097b3d901d073bfe06bcd88318ac-C001-34" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-46" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-51" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-55" ], [ "b5097b3d901d073bfe06bcd88318ac-C001-61" ] ], "cite_sentences": [ "b5097b3d901d073bfe06bcd88318ac-C001-34", "b5097b3d901d073bfe06bcd88318ac-C001-46", "b5097b3d901d073bfe06bcd88318ac-C001-51", "b5097b3d901d073bfe06bcd88318ac-C001-55", "b5097b3d901d073bfe06bcd88318ac-C001-61" ] } } }, "ABC_c57e98c9c07dd5d8653e172136c901_0": { "x": [ { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-5", "text": "----------------------------------" }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-28", "text": "----------------------------------" }, { "sent_id": "c57e98c9c07dd5d8653e172136c901-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "c57e98c9c07dd5d8653e172136c901-C001-11" ] ], "cite_sentences": [ "c57e98c9c07dd5d8653e172136c901-C001-11" ] } } }, "ABC_d70b9838e8a32a8638d7aed0adc80a_0": { "x": [ { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-105", "text": "Especially, we refresh the concepts of the structure reordering rules and the discontiguous phrase rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-106", "text": "This novel classification system may supports the SMT research community with some helpful references." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-2", "text": "Recently, numerous statistical machine translation models which can utilize various kinds of translation rules are proposed." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-3", "text": "In these models, not only the conventional syntactic rules but also the non-syntactic rules can be applied." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-4", "text": "Even the pure phrase rules are includes in some of these models." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-107", "text": "In the future works, aiming to analyze the rule contributions and the redundances issues using the presented rule classification based on some real translation systems, we plan to implement some synchronous grammar based syntax translation models such as the one presented in (Liu et al., 2007) or in (Zhang et al., 2008a) ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-108", "text": "Taking such a system as the experimental platform, we can perform comprehensive statistics about distributions of different rule categories." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-109", "text": "What is more important, the contribution of each rule category can be evaluated seriatim." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-110", "text": "Furthermore, which kinds of rules are preferentially applied in the 1-best decoding can be studied." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-111", "text": "All these investigations could reveal very useful information for the optimization of rule extraction and the improvement of the computational models for synchronous grammar based machine translation." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-5", "text": "Although the better performances are reported over the conventional phrase model and syntax model, the mixture of diversified rules still leaves much room for study." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-6", "text": "In this paper, we present a refined rule classification system." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-7", "text": "Based on this classification system, the rules are classified according to different standards, such as lexicalization level and generalization." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-8", "text": "Especially, we refresh the concepts of the structure reordering rules and the discontiguous phrase rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-9", "text": "This novel classification system may supports the SMT research community with some helpful references." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-10", "text": "----------------------------------" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-12", "text": "Phrase-based statistical machine translation models (Marcu and Wong, 2002; Koehn et al., 2003; Och and Ney, 2004; Koehn, 2004; Koehn et al., 2007) have achieved significant improvements in translation accuracy over the original IBM word-based model." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-13", "text": "However, there are still many limitations in phrase based models." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-14", "text": "The most frequently pointed limitation is its inefficacy to modeling the structure reordering and the discontiguous corresponding." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-15", "text": "To overcome these limitations, many syntaxbased SMT models have been proposed (Wu, 1997; Chiang, 2007; Ding et al., 2005; Eisner, 2003; Quirk et al., 2005; Liu et al., 2007; Zhang et al., 2007; Zhang et al., 2008a; Zhang et al., 2008b; Gildea, 2003; Galley et al., 2004; Bod, 2007) ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-16", "text": "The basic motivation behind syntax-based model is that the syntax information has the potential to model the structure reordering and discontiguous corresponding by the intrinsic structural generalization ability." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-17", "text": "Although remarkable progresses have been reported, the strict syntactic constraint (the both sides of the rules should strictly be a subtree of the whole syntax parse) greatly hinders the utilization of the non-syntactic translation equivalents." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-18", "text": "To alleviate this constraint, a few works have attempted to make full use of the non-syntactic rules by extending their syntax-based models to more general frameworks." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-19", "text": "For example, forest-to-string transformation rules have been integrated into the tree-to-string translation framework by (Liu et al., 2006; Liu et al., 2007) ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-20", "text": "Zhang et al. (2008a) made it possible to utilize the non-syntactic rules and even the phrases which are used in phrase based model by advancing a general tree sequence to tree sequence framework based on the tree-to-tree model presented in (Zhang et al., 2007) ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-21", "text": "In these models, various kinds of rules can be employed." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-22", "text": "For example, as shown in Figure 1 and Figure 2 , Figure 1 shows a Chinese-to-English sentence pair with syntax parses on both sides and the word alignments (dotted lines)." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-23", "text": "Figure 2 lists some of the rules which can be extracted from the sentence pair in Figure 1 by the system used in (Zhang et al., 2008a) ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-24", "text": "These rules includes not only conventional syntax rules but also the tree sequence rules (the multi-headed syntax rules )." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-25", "text": "Even the phrase rules are adopted by the system." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-26", "text": "Although the better performances are reported over the conventional phrase-based model and syntax-based model, the mixture of diversified rules still leaves much room for study." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-27", "text": "Given such a hybrid rule set, we must want to know what kinds of rules can make more important contributions to the overall system performance and what kinds of rules are redundant compared with the others." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-28", "text": "From engineering point of view, the developers may concern about which kinds of rules should be preferred and which kinds of rules could be discard without too much decline in translation quality." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-29", "text": "However, one of the precondition for the investigations of these issues is what are the \"rule categories\"?" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-30", "text": "In other words, some comprehensive rule classifications are necessary to make the rule analyses feasible." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-31", "text": "The motivation of this paper is to present such a rule classification." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-32", "text": "----------------------------------" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-33", "text": "**RELATED WORKS**" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-34", "text": "A few researches have made some exploratory investigations towards the effects of different rules by classifying the translation rules into different subcategories (Liu et al., 2007; Zhang et al., 2008a; DeNeefe et al., 2007) ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-35", "text": "Liu et al. (2007) differentiated the rules in their tree-to-string model which integrated with forest 1 -to-string into fully lexicalized rules, non-lexicalized rules and partial lexicalized rules according to the lexicalization levels." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-36", "text": "As an extension, Zhang et al. (2008a) proposed two more categories: Structure Reordering Rules (SRR) and Discontiguous Phrase Rules (DPR)." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-37", "text": "The SRR stands for the rules which have at least two non-terminal leaf nodes with inverted order in the source and target side." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-38", "text": "And DPR refers to the rules having at least one non-terminal leaf node between two terminal leaf nodes." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-39", "text": "(DeNeefe et al., 2007) made an illuminating breakdown of the different kinds of rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-40", "text": "Firstly, they classify all the GHKM 2 rules (Galley et al., 2004; Galley et al., 2006) into two categories: lexical rules and non-lexical rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-41", "text": "The former are the rules whose source side has no source words." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-42", "text": "In other words, a non-lexical rule is a purely ab-1 A \"forest\" means a sub-tree sequence derived from a given parse tree 2 One reviewer asked about the acronym GHKM." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-43", "text": "We guess it is an acronym for the authors of (Galley et al., 2004) stract rule." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-44", "text": "The latter is the complementary set of the former. And then lexical rules are classified further into phrasal rules and non-phrasal rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-45", "text": "The phrasal rules refer to the rules whose source side and the yield of the target side contain exactly one contiguous phrase each. And the one or more nonterminals can be placed on either side of the phrase." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-46", "text": "In other words, each phrasal rule can be simulated by the conjunction of two more phrase rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-47", "text": "(DeNeefe et al., 2007) classifies non-phrasal rules further into structural rules, re-ordering rules, and noncontiguous phrase rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-48", "text": "However, these categories are not explicitly defined in (DeNeefe et al., 2007) since out of its focus." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-49", "text": "Our proposed rule classification is inspired by these works." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-50", "text": "----------------------------------" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-51", "text": "**RULES CLASSIFICATIONS**" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-52", "text": "Currently, there have been several classifications in SMT research community." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-53", "text": "Generally, the rules can be classified into two main groups according to whether syntax information is involved: bilingual phrases (Phrase) and syntax rules (Syntax)." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-54", "text": "Further, the syntax rules can be divided into three categories according to the lexicalization levels (Liu et al., 2007; Zhang et al., 2008a source and target sides are non-lexicons (nonterminals)" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-55", "text": "3) Partially lexicalized (PLex): otherwise." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-56", "text": "In Figure 2 , R 1 -R 3 are FLex rules, and R 5 -R 8 are PLex rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-57", "text": "Following (Zhang et al., 2008b) , a syntax rule r can be formalized into a tuple < \u03be s , \u03be t , A T , A N T > , where \u03be s and \u03be t are tree sequences of source side and target side respectively, A T is a many-to-many correspondence set which includes the alignments between the terminal leaf nodes from source and target side, and A N T is a one-to-one correspondence set which includes the synchronizing relations between the non-terminal leaf nodes from source and target side." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-58", "text": "Then, the syntax rules can also fall into two categories according to whether equipping with generalization capability (Chiang, 2007; Zhang et al., 2008a) : 1) Initial rules (Initial): all leaf nodes of this rule are terminals." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-59", "text": "2) Abstract rules (Abstract): otherwise, i.e. at least one leaf node is a non-terminal." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-60", "text": "A non-terminal leaf node in a rule is named an abstract node since it has the generalization capability." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-61", "text": "Comparing these two classifications for syntax rules, we can find that a FLex rule is a initial rule when ULex rules and PLex rules belong to abstract rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-62", "text": "These classifications are clear and easy for understanding." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-63", "text": "However, we argue that they need further refinement for in-depth study." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-64", "text": "Specially, more refined differentiations are needed for the abstract rules (ULex rules and PLex rules) since they play important roles for the characteristic capabilities which are deemed to be the advantages over the phrase-based model." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-65", "text": "For instance, the potentials to model the structure reordering and the discontiguous correspondence." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-66", "text": "The Structure Reordering Rules (SRR) and Discontiguous Phrase Rules (DPR) mentioned by (Zhang et al., 2008a) can be regarded as more in-depth classification of the syntax rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-67", "text": "In (Zhang et al., 2008a) , they are described as follows: Definition 1: The Structure Reordering Rule (SRR) refers to the structure reordering rule that has at least two non-terminal leaf nodes with inverted order in the source and target side." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-68", "text": "Definition 2: The Discontiguous Phrase Rule (DPR) refers to the rule having at least one nonterminal leaf node between two lexicalized leaf nodes." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-69", "text": "Based on these descriptions, R 7 , R 8 in Figure 2 belong to the category of SRR and R 6 , R 7 fall into the category of DPR." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-70", "text": "Although these two definitions are easy implemented in practice, we argue that the definition of SRR is not complete." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-71", "text": "The reordering rules involving the reordering between content word terminals and non-terminal (such as R 5 in Figure 2 ) also can model the useful structure reorderings." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-72", "text": "Moreover, it is not uncommon that a rule demonstrates the reorderings between two non-terminals as well as the reorderings between one non-terminal and one content word terminal." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-73", "text": "The reason for our emphasis of content word terminal is that the reorderings between the non-terminals and function word are less meaningful." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-74", "text": "One of the theoretical problems with phrase based SMT models is that they can not effectively model the discontiguous translations and numerous attempts have been made on this issue (Simard et al., 2005; Quirk and Menezes, 2006; Wellington et al., 2006; Bod, 2007; Zhang et al., 2007) ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-75", "text": "What seems to be lacking, however, is a explicit definition to the discontiguous translation." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-76", "text": "The definition of DPR in (Zhang et al., 2008a ) is explicit but somewhat rough and not very accurate." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-77", "text": "For example, in Figure 3(a) , non-terminal node pair ([0,' '] , [0,'love'] ) is surrounded by lexical terminals." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-78", "text": "According to Definition 2, it is a DPR." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-79", "text": "However, obviously it is not a discontiguous phrase actually." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-80", "text": "This rule can be simulated by conjunctions of three phrases (' ', 'I'; ' ', 'love'; ' ','you')." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-81", "text": "In contrast, the translation rule in Figure 3(b) is an actual discontiguous phrase rule." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-82", "text": "The English correspondences of the Chinese word ' ' is dispersed in the English side in which the correspondence of Chinese word ' ' is inserted." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-83", "text": "This rule can not be simulated by any conjunctions of the sub phrases." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-84", "text": "It must be noted that the discontiguous phrase (' '-\"switch . . . off\") can not be abstracted under the existing synchronous grammar frameworks." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-85", "text": "The fundamental reason is that the corresponding parts should be abstracted in the same time and lexicalized in the same time." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-86", "text": "In other words, the discontiguous phrase can not be modeled by the permutation between non-terminals (abstract nodes)." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-87", "text": "Another point to notice is that our focus in this paper is the ability demonstrated by the abstract rules." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-88", "text": "Thus, we do not pay much attentions to the reorderings and discontiguous phrases involved in the phrase rules (e.g. \" \"-\"switch the light off\") since they lack the generalization capability." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-89", "text": "Therefore, the discontiguous phrase is limited to the relation between non-terminals and terminals." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-90", "text": "On the basis of the above analyses, we present a novel classification system for the abstract rules based on the crossings between the leaf node alignment links." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-91", "text": "Given an abstract rule r =< Note that the intersection of SRR NT 2 and SRR NT-T is not necessary an empty set, i.e. a rule can be both SRR NT 2 and SRR NT-T rule." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-92", "text": "The basic characteristic of the discontiguous translation is that the correspondence of one nonterminal N T is inserted among the correspondences of one phrase X. Figure 5 (a) illustrates this situation." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-93", "text": "However, this characteristic can not support necessary and sufficient condition." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-94", "text": "For example, if the phrase X can be divided like Figure 5 (b), then the rule in Figure 5 (a) is actually a reordering rule rather than a discontiguous phrase rule." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-95", "text": "For sufficient condition, we constrain that the phrase X = w i . . ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-96", "text": "w j need to satisfy the requirement: w i should be connected with w j through word alignment links (A word is connected with itself)." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-97", "text": "In Figure 5(c) , f 1 is connected with f 2 when N T is inserted between e 1 and e 2 ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-98", "text": "Thus, the rule in Figure 5 (c) is a discontiguous phrase rule." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-99", "text": "Definition 3: Given an abstract rule r =< \u03be s , \u03be t , A T , A N T >, it is a Discontiguous Phrase iff \u2203 two links l t1 , l t2 from A T and a link l nt from A N T , satisfy: l t1 , l t2 are emitted from the same word and l t1 is crossed with l nt when l t2 is not crossed with l nt ." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-100", "text": "Through Definition 3, we know that the DPR is a sub-set of the SRR NT-T." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-101", "text": "----------------------------------" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-102", "text": "**CONCLUSIONS AND FUTURE WORKS**" }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-103", "text": "In this paper, we present a refined rule classification system." }, { "sent_id": "d70b9838e8a32a8638d7aed0adc80a-C001-104", "text": "Based on this classification system, the rules are classified according to different standards, such as lexicalization level and generalization." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d70b9838e8a32a8638d7aed0adc80a-C001-15" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-16", "d70b9838e8a32a8638d7aed0adc80a-C001-17" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-20" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-23" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-34" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-36" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-37" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-38" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-54" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-58" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-66" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-67" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-68" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-69" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-76" ] ], "cite_sentences": [ "d70b9838e8a32a8638d7aed0adc80a-C001-15", "d70b9838e8a32a8638d7aed0adc80a-C001-16", "d70b9838e8a32a8638d7aed0adc80a-C001-20", "d70b9838e8a32a8638d7aed0adc80a-C001-23", "d70b9838e8a32a8638d7aed0adc80a-C001-34", "d70b9838e8a32a8638d7aed0adc80a-C001-36", "d70b9838e8a32a8638d7aed0adc80a-C001-37", "d70b9838e8a32a8638d7aed0adc80a-C001-38", "d70b9838e8a32a8638d7aed0adc80a-C001-54", "d70b9838e8a32a8638d7aed0adc80a-C001-58", "d70b9838e8a32a8638d7aed0adc80a-C001-66", "d70b9838e8a32a8638d7aed0adc80a-C001-67", "d70b9838e8a32a8638d7aed0adc80a-C001-68", "d70b9838e8a32a8638d7aed0adc80a-C001-69", "d70b9838e8a32a8638d7aed0adc80a-C001-76" ] }, "@MOT@": { "gold_contexts": [ [ "d70b9838e8a32a8638d7aed0adc80a-C001-15" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-16", "d70b9838e8a32a8638d7aed0adc80a-C001-17" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-49" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-70" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-76" ] ], "cite_sentences": [ "d70b9838e8a32a8638d7aed0adc80a-C001-15", "d70b9838e8a32a8638d7aed0adc80a-C001-16", "d70b9838e8a32a8638d7aed0adc80a-C001-49", "d70b9838e8a32a8638d7aed0adc80a-C001-70", "d70b9838e8a32a8638d7aed0adc80a-C001-76" ] }, "@USE@": { "gold_contexts": [ [ "d70b9838e8a32a8638d7aed0adc80a-C001-23" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-54" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-58" ] ], "cite_sentences": [ "d70b9838e8a32a8638d7aed0adc80a-C001-23", "d70b9838e8a32a8638d7aed0adc80a-C001-54", "d70b9838e8a32a8638d7aed0adc80a-C001-58" ] }, "@DIF@": { "gold_contexts": [ [ "d70b9838e8a32a8638d7aed0adc80a-C001-78", "d70b9838e8a32a8638d7aed0adc80a-C001-79" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-91" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-100" ] ], "cite_sentences": [ "d70b9838e8a32a8638d7aed0adc80a-C001-78", "d70b9838e8a32a8638d7aed0adc80a-C001-91", "d70b9838e8a32a8638d7aed0adc80a-C001-100" ] }, "@FUT@": { "gold_contexts": [ [ "d70b9838e8a32a8638d7aed0adc80a-C001-107" ], [ "d70b9838e8a32a8638d7aed0adc80a-C001-108" ] ], "cite_sentences": [ "d70b9838e8a32a8638d7aed0adc80a-C001-107", "d70b9838e8a32a8638d7aed0adc80a-C001-108" ] } } }, "ABC_e834dadbcf08cf14e476b5f5cbf79e_0": { "x": [ { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-5", "text": "----------------------------------" }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-28", "text": "----------------------------------" }, { "sent_id": "e834dadbcf08cf14e476b5f5cbf79e-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "e834dadbcf08cf14e476b5f5cbf79e-C001-24" ] ], "cite_sentences": [ "e834dadbcf08cf14e476b5f5cbf79e-C001-24" ] } } }, "ABC_f3282df3adadf78320e99c09d8384f_0": { "x": [ { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-2", "text": "We present a very simple, unsupervised method for the pairwise matching of documents from heterogeneous collections." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-3", "text": "We demonstrate our method with the Concept-Project matching task, which is a binary classification task involving pairs of documents from heterogeneous collections." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-4", "text": "Although our method only employs standard resources without any domain-or task-specific modifications, it clearly outperforms the more complex system of the original authors." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-5", "text": "In addition, our method is transparent, because it provides explicit information about how a similarity score was computed, and efficient, because it is based on the aggregation of (pre-computable) word-level similarities." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-6", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-8", "text": "We present a simple and efficient unsupervised method for pairwise matching of documents from heterogeneous collections." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-9", "text": "Following Gong et al. (2018) , we consider two document collections heterogeneous if their documents differ systematically with respect to vocabulary and / or level of abstraction." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-10", "text": "With these defining differences, there often also comes a difference in length, which, however, by itself does not make document collections heterogeneous." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-11", "text": "Examples include collections in which expert answers are mapped to non-expert questions (e.g. InsuranceQA by Feng et al. (2015) ), but also so-called community QA collections (Blooma and Kurian (2011) ), where the lexical mismatch between Q and A documents is often less pronounced than the length difference." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-12", "text": "Like many other approaches, the proposed method is based on word embeddings as universal meaning representations, and on vector cosine as the similarity metric." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-13", "text": "However, instead of computing pairs of document representations and measuring their similarity, our method assesses the document-pair similarity on the basis of selected pairwise word similarities." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-14", "text": "This has the following advantages, which make our method a viable candidate for practical, real-world applications: efficiency, because pairwise word similarities can be efficiently (pre-)computed and cached, and transparency, because the selected words from each document are available as evidence for what the similarity computation was based on." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-15", "text": "We demonstrate our method with the Concept-Project matching task (Gong et al. (2018) ), which is described in the next section." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-16", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-17", "text": "**TASK, DATA SET, AND ORIGINAL APPROACH**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-18", "text": "The Concept-Project matching task is a binary classification task where each instance is a pair of heterogeneous documents: one concept, which is a short science curriculum item from NGSS 1 , and one project, which is a much longer science project description for school children from ScienceBuddies 2 ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-19", "text": "CONCEPT LABEL: ecosystems: -ls2.a: interdependent relationships in ecosystems CONCEPT DESCRIPTION: Ecosystems have carrying capacities , which are limits to the numbers of organisms and populations they can support ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-20", "text": "These limits result from such factors as the availability of living and nonliving resources and from such challenges such as predation , competition , and disease ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-21", "text": "Organisms would have the capacity to produce populations of great size were it not for the fact that environments and resources are finite ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-22", "text": "This fundamental tension affects the abundance ( number of individuals ) of species in any given ecosystem ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-23", "text": "PROJECT LABEL: Primary Productivity and Plankton PROJECT DESCRIPTION: Have you seen plankton? I am not talking about the evil villain trying to steal the Krabby Patty recipe from Mr. Krab. I am talking about plankton that live in the ocean." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-24", "text": "In this experiment you can learn how to collect your own plankton samples and see the wonderful diversity in shape and form of planktonic organisms." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-25", "text": "The oceans contain both the earth's largest and smallest organisms." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-26", "text": "Interestingly they share a delicate relationship linked together by what they eat." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-27", "text": "The largest of the ocean's inhabitants, the Blue Whale, eats very small plankton, which themselves eat even smaller phytoplankton." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-28", "text": "All of the linkages between predators, grazers, and primary producers in the ocean make up an enormously complicated food web." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-29", "text": "The base of this food web depends upon phytoplankton, very small photosynthetic organisms which can make their own energy by using energy from the sun." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-30", "text": "These phytoplankton provide the primary source of the essential nutrients that cycle through our ocean's many food webs." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-31", "text": "This is called primary productivity, and it is a very good way of measuring the health and abundance of our fisheries." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-32", "text": "There are many different kinds of phytoplankton in our oceans." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-33", "text": "[...] One way to study plankton is to collect the plankton using a plankton net to collect samples of macroscopic and microscopic plankton organisms." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-34", "text": "The net is cast out into the water or trolled behind a boat for a given distance then retrieved." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-35", "text": "Upon retrieving the net, the contents of the collecting bottle can be removed and the captured plankton can be observed with a microscope." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-36", "text": "The plankton net will collect both phytoplankton (photosynthetic plankton) and zooplankton (non-photosynthetic plankton and larvae) for observation." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-37", "text": "In this experiment you will make your own plankton net and use it to collect samples of plankton from different marine or aquatic locations in your local area." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-38", "text": "You can observe both the abundance (total number of organisms) and diversity (number of different kinds of organisms) of planktonic forms to make conclusions about the productivity and health of each location." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-39", "text": "In this experiment you will make a plankton net to collect samples of plankton from different locations as an indicator of primary productivity." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-40", "text": "You can also count the number of phytoplankton (which appear green or brown) compared to zooplankton (which are mostly marine larval forms) and compare." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-41", "text": "Do the numbers balance, or is there more of one type than the other? What effect do you think this has on productivity cycles?" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-42", "text": "Food chains are very complex." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-43", "text": "Find out what types of predators and grazers you have in your area." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-44", "text": "You can find this information from a field guide or from your local Department of Fish and Game. Can you use this information to construct a food web for your local area?" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-45", "text": "Some blooms of phytoplankton can be harmful and create an anoxic environment that can suffocate the ecosystem and leave a \"Dead Zone\" behind. Did you find an excess of brown algae or diatoms?" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-46", "text": "These can be indicators of a harmful algal bloom." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-47", "text": "Re-visit this location over several weeks to report on an increase or decrease of these types of phytoplankton." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-48", "text": "Do you think that a harmful algal bloom could be forming in your area?" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-49", "text": "For an experiment that studies the relationship between water quality and algal bloom events, see the Science Buddies project Harmful Algal Blooms in the Chesapeake Bay." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-50", "text": "The publicly available data set 3 contains 510 labelled pairs 4 involving C = 75 unique concepts and P = 230 unique projects." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-51", "text": "A pair is annotated as 1 if the project matches the concept (57%), and as 0 otherwise (43%)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-52", "text": "The annotation was done by undergrad engineering students." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-53", "text": "Gong et al. (2018) do not provide any specification, or annotation guidelines, of the semantics of the 'matches' relation to be annotated." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-54", "text": "Instead, they create gold standard annotations based on a majority vote of three manual annotations." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-55", "text": "Figure 1 provides an example of a matching C-P pair." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-56", "text": "The concept labels can be very specific, potentially introducing vocabulary that is not present in the actual concept descriptions." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-57", "text": "The extent to which this information is used by Gong et al. (2018) is not entirely clear, so we experiment with several setups (cf. Section 4)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-58", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-59", "text": "**GONG ET AL. (2018)'S APPROACH**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-60", "text": "The approach by Gong et al. (2018) is based on the idea that the longer document in the pair is reduced to a set of topics which capture the essence of the document in a way that eliminates the effect of a potential length difference." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-61", "text": "In order to overcome the vocabulary mismatch, these topics are not based on words and their distributions (as in LSI (Deerwester et al. (1990) ) or LDA (Blei et al. (2003) )), but on word embedding vectors." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-62", "text": "Then, basically, matching is done by measuring the cosine similarity between the topic vectors and the short document words." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-63", "text": "Gong et al. (2018) motivate their approach mainly with the length mismatch argument, which they claim makes approaches relying on document representations (incl. vector averaging) unsuitable." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-64", "text": "Accordingly, they use Doc2Vec (Le and Mikolov (2014) ) as one of their baselines, and show that its performance is inferior to their method." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-65", "text": "They do not, however, provide a much simpler averaging-based baseline." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-66", "text": "As a second baseline, they use Word Mover's Distance (Kusner et al. (2015) ), which is based on word-level distances, rather than distance of global document representations, but which also fails to be competitive with their topic-based method." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-67", "text": "Gong et al. (2018) use two different sets of word embeddings: One (topic wiki) was trained on a full English Wikipedia dump, the other (wiki science) on a smaller subset of the former dump which only contained science articles." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-68", "text": "the similarity of two documents (i.e. word sequences) c and p is to average over the word embeddings for each sequence first, and to compute the cosine similarity between the two averages afterwards." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-69", "text": "In the first step, weighting can be applied by multiplying a vector with the TF, IDF, or TF*IDF score of its pertaining word." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-70", "text": "We implement this standard measure (AVG COS SIM) as a baseline for both our method and for the method by Gong et al. (2018) ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-71", "text": "It yields a single scalar similarity score." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-72", "text": "The core idea of our alternative method is to turn the above process upside down, by computing the cosine similarity of selected pairs of words from c and p first, and to average over the similarity scores afterwards (cf. also Section 6)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-73", "text": "More precisely, we implement a measure TOP n COS SIM AVG as the average of the n highest pairwise cosine similarities of the n top-ranking words in c and p. Ranking, again, is done by TF, IDF, and TF*IDF." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-74", "text": "For each ranking, we take the top-ranking n words from c and p, compute n \u00d7 n similarities, rank by decreasing similarity, and average over the top n similarities." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-75", "text": "This measure yields both a scalar similarity score and a list of < c x , p y , sim > tuples, which represent the qualitative aspects of c and p on which the similarity score is based." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-76", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-77", "text": "**EXPERIMENTS**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-78", "text": "Setup All experiments are based on off-the-shelf word-level resources: We employ WOMBAT (M\u00fcller and Strube (2018) ) for easy access to the 840B GloVe (Pennington et al. (2014) ) and the GoogleNews 5 Word2Vec (Mikolov et al. (2013) ) embeddings." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-79", "text": "These embedding resources, while slightly outdated, are still widely used." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-80", "text": "However, they cannot handle out-of-vocabulary tokens due to their fixed, word-level lexicon." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-81", "text": "Therefore, we also use a pretrained English fastText model 6 (Bojanowski et al. (2017) ; Grave et al. (2018) ), which also includes subword information." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-82", "text": "IDF weights for approx. 12 mio." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-83", "text": "different words were obtained from the English Wikipedia dump provided by the Polyglot project (Al-Rfou et al. (2013))." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-84", "text": "All resources are case-sensitive, i.e. they might contain different entries for words that only differ in case (cf. Section 5)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-85", "text": "We run experiments in different setups, varying both the input representation (GloVe vs. Google vs. fastText embeddings, \u00b1 TF-weighting, and \u00b1 IDF-weighting) for concepts and projects, and the extent to which concept descriptions are used: For the latter, Label means only the concept label (first and second row in the example), Description means only the textual description of the concept, and Both means the concatenation of Label and Description." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-86", "text": "For the projects, we always use both label and description." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-87", "text": "For the project descriptions, we extract only the last column of the original file (CONTENT), and remove user comments and some boiler-plate." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-88", "text": "Each instance in the resulting data set is a tuple of < c, p, label >, where c and p are bags of words, with case preserved and function words 7 removed, and label is either 0 or 1." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-89", "text": "Parameter Tuning Our method is unsupervised, but we need to define a threshold parameter which controls the minimum similarity that a concept and a project description should have in order to be considered a match." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-90", "text": "Also, the TOP n COS SIM AVG measure has a parameter n which controls how many ranked words are used from c and p, and how many similarity scores are averaged to create the final score." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-91", "text": "Parameter tuning experiments were performed on a random subset of 20% of our data set (54% positive)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-92", "text": "Note that Gong et al. (2018) used only 10% of their 537 instances data set as tuning data." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-93", "text": "The tuning data results of the best-performing parameter values for each setup can be found in Tables 1 and 2 ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-94", "text": "The top F scores per type of concept input (Label, Description, Both) are given in bold." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-95", "text": "For AVG COS SIM and TOP n COS SIM AVG, we determined the threshold values (T) on the tuning data by doing a simple .005 step search over the range from 0.3 to 1.0." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-96", "text": "For TOP n COS SIM AVG, we additionally varied the value of n in steps of 2 from 2 to 30." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-97", "text": "Results The top tuning data scores for AVG COS SIM (Table 1) show that the Google embeddings with TF*IDF weighting yield the top F score for all three concept input types (.881 -.945)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-98", "text": "Somewhat expectedly, the best overall F score (.945) is produced in the setting Both, which provides the most information." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-99", "text": "Actually, this is true for all four weighting schemes for both GloVe and Google, while fastText consistently yields its top F scores (.840 -.911) in the Label setting, which provides the least information." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-100", "text": "Generally, the level of performance of the simple baseline measure AVG COS SIM on this data set is rather striking." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-101", "text": "For TOP n COS SIM AVG, the tuning data results (Table 2 ) are somewhat more varied: First, there is no single best performing set of embeddings: Google yields the best F score for the Label setting (.953), while GloVe (though only barely) leads in the Description setting (.912)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-102", "text": "This time, it is fastText which produces the best F score in the Both setting, which is also the best overall tuning data F score for TOP n COS SIM AVG (.954)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-103", "text": "While the difference to the Google result for Label is only minimal, it is striking that the best overall score is again produced using the 'richest' setting, i.e. the one involving both TF and IDF weighting and the most informative input." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-104", "text": "We then selected the best performing parameter settings for every concept input and ran experiments on the held-out test data." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-105", "text": "Since the original data split used by Gong et al. (2018) is unknown, we cannot exactly replicate their settings, but we also perform ten runs using randomly selected 10% of our 408 instances test data set, and report average P, R, F, and standard deviation." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-106", "text": "The results can be found in Table 3 ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-107", "text": "For comparison, the two top rows provide the best results of Gong et al. (2018) ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-108", "text": "The first interesting finding is that the AVG COS SIM measure again performs very well: In all three settings, it beats both the system based on general-purpose embeddings (topic wiki) and the one that is adapted to the science domain (topic science), with again the Both setting yielding the best overall result (.926)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-109", "text": "Note that our Both setting is probably the one most similar to the concept input used by Gong et al. (2018) ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-110", "text": "This result corroborates our findings on the tuning data, and clearly contradicts the (implicit) claim made by Gong et al. (2018) regarding the infeasibility of document-level matching for documents of different lengths." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-111", "text": "The second, more important finding is that our proposed TOP n COS SIM AVG measure is also very competitive, as it also outperforms both systems by Gong et al. (2018) in two out of three settings." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-112", "text": "It only fails in the setting using only the Description input." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-113", "text": "8 This is the more important as we exclusively employ off-the-shelf, general-purpose embeddings, while Gong et al. (2018) reach their best results with a much more sophisticated system and with embeddings that were custom-trained for the science domain." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-114", "text": "Thus, while the performance of our proposed TOP n COS SIM AVG method is superior to the approach by Gong et al. (2018) , it is itself outperformed by the 'baseline' AVG COS SIM method with appropriate weighting." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-115", "text": "However, apart from raw classification performance, our method also aims at providing human-interpretable information on how a classification was done." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-116", "text": "In the next section, we perform a detail analysis on a selected setup." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-117", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-118", "text": "**CONCEPT INPUT**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-119", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-120", "text": "**DETAIL ANALYSIS**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-121", "text": "The similarity-labelled word pairs from concept and project description which are selected during classification with the TOP n COS SIM AVG measure provide a way to qualitatively evaluate the basis on which each similarity score was computed." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-122", "text": "We see this as an advantage over average-based comparison (like AVG COS SIM), since it provides a means to check the plausibility of the decision." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-123", "text": "Here, we are mainly interested in the overall best result, so we perform a detail analysis on the best-performing Both setting only (fastText, TF*IDF weighting, T = .310, n = 14)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-124", "text": "Since the Concept-Project matching task is a binary classification task, its performance can be qualitatively analysed by providing examples for instances that were classified correctly (True Positive (TP) and True Negative (TN)) or incorrectly (False Positive (FP) and False Negative (FN))." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-125", "text": "Table 5 shows the concept and project words from selected instances (one TP, FP, TN, and FN case each) of the tuning data set." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-126", "text": "Concept and project words are ordered alphabetically, with concept words appearing more than once being grouped together." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-127", "text": "According to the selected setting, the number of word pairs is n = 14." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-128", "text": "The bottom line in each column provides the average similarity score as computed by the TOP n COS SIM AVG measure." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-129", "text": "This value is compared against the threshold T = .310." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-130", "text": "The similarity is higher than T in the TP and FP cases, and lower otherwise." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-131", "text": "Without going into too much detail, it can be seen that the selected words provide a reasonable idea of the gist of the two documents." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-132", "text": "Another observation relates to the effect of using unstemmed, case-sensitive documents as input: the top-ranking words often contain inflectional variants (e.g. enzyme and enzymes, level and levels in the example), and words differing in case only can also be found." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-133", "text": "Currently, these are treated as distinct (though semantically similar) words, mainly out of compatibility with the pretrained GloVe and Google embeddings." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-134", "text": "However, since our method puts a lot of emphasis on individual words, in particular those coming from the shorter of the two documents (the concept), results might be improved by somehow merging these words (and their respective embedding vectors) (see Section 7)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-135", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-136", "text": "**RELATED WORK**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-137", "text": "While in this paper we apply our method to the Concept-Project matching task only, the underlying task of matching text sequences to each other is much more general." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-138", "text": "Many existing approaches follow (2017))." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-139", "text": "As the name suggests, these approaches collect the results of element-wise matchings (comparisons) first, and create the final result by aggregating these results later." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-140", "text": "Our method can be seen as a variant of compare-aggregate which is characterized by extremely simple methods for comparison (cosine vector similarity) and aggregation (averaging)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-141", "text": "Other approaches, like He and Lin (2016) and Wang and Jiang (2017) , employ much more elaborated supervised neural networks methods." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-142", "text": "Also, on a simpler level, the idea of averaging similarity scores (rather than scoring averaged representations) is not new: Camacho-Collados and Navigli (2016) use the average of pairwise word similarities to compute their compactness score." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-143", "text": "----------------------------------" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-144", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-145", "text": "We presented a simple method for semantic matching of documents from heterogeneous collections as a solution to the Concept-Project matching task by Gong et al. (2018) ." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-146", "text": "Although much simpler, our method clearly outperformed the original system in most input settings." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-147", "text": "Another result is that, contrary to the claim made by Gong et al. (2018) , the standard averaging approach does indeed work very well even for heterogeneous document collections, if appropriate weighting is applied." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-148", "text": "Due to its simplicity, we believe that our method can also be applied to other text matching tasks, including more 'standard' ones which do not necessarily involve heterogeneous document collections." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-149", "text": "This seems desirable because our method offers additional transparency by providing not only a similarity score, but also the subset of words on which the similarity score is based." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-150", "text": "Future work includes detailed error analysis, and exploration of methods to combine complementary information about (grammatically or orthographically) related words from word embedding resources." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-151", "text": "Also, we are currently experimenting with a pretrained ELMo (Peters et al. (2018) ) model as another word embedding resource." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-152", "text": "ELMo takes word embeddings a step further by dynamically creating contextualized vectors from input word sequences (normally sentences)." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-153", "text": "Our initial experiments have been promising, but since ELMo tends to yield different, context-dependent vectors for the same word in the same document, ways have still to be found to combine them into single, document-wide vectors, without (fully) sacrificing their context-awareness." }, { "sent_id": "f3282df3adadf78320e99c09d8384f-C001-154", "text": "The code used in this paper is available at https://github.com/nlpAThits/TopNCosSimAvg." } ], "y": { "@MOT@": { "gold_contexts": [ [ "f3282df3adadf78320e99c09d8384f-C001-10", "f3282df3adadf78320e99c09d8384f-C001-9" ], [ "f3282df3adadf78320e99c09d8384f-C001-57" ], [ "f3282df3adadf78320e99c09d8384f-C001-145" ] ], "cite_sentences": [ "f3282df3adadf78320e99c09d8384f-C001-9", "f3282df3adadf78320e99c09d8384f-C001-57", "f3282df3adadf78320e99c09d8384f-C001-145" ] }, "@BACK@": { "gold_contexts": [ [ "f3282df3adadf78320e99c09d8384f-C001-10", "f3282df3adadf78320e99c09d8384f-C001-9" ], [ "f3282df3adadf78320e99c09d8384f-C001-59", "f3282df3adadf78320e99c09d8384f-C001-60" ], [ "f3282df3adadf78320e99c09d8384f-C001-63" ], [ "f3282df3adadf78320e99c09d8384f-C001-64" ], [ "f3282df3adadf78320e99c09d8384f-C001-65" ], [ "f3282df3adadf78320e99c09d8384f-C001-66" ], [ "f3282df3adadf78320e99c09d8384f-C001-67" ] ], "cite_sentences": [ "f3282df3adadf78320e99c09d8384f-C001-9", "f3282df3adadf78320e99c09d8384f-C001-59", "f3282df3adadf78320e99c09d8384f-C001-60", "f3282df3adadf78320e99c09d8384f-C001-63", "f3282df3adadf78320e99c09d8384f-C001-64", "f3282df3adadf78320e99c09d8384f-C001-65", "f3282df3adadf78320e99c09d8384f-C001-66", "f3282df3adadf78320e99c09d8384f-C001-67" ] }, "@USE@": { "gold_contexts": [ [ "f3282df3adadf78320e99c09d8384f-C001-15" ] ], "cite_sentences": [ "f3282df3adadf78320e99c09d8384f-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "f3282df3adadf78320e99c09d8384f-C001-52", "f3282df3adadf78320e99c09d8384f-C001-53", "f3282df3adadf78320e99c09d8384f-C001-54" ], [ "f3282df3adadf78320e99c09d8384f-C001-91", "f3282df3adadf78320e99c09d8384f-C001-92" ], [ "f3282df3adadf78320e99c09d8384f-C001-105" ], [ "f3282df3adadf78320e99c09d8384f-C001-110" ], [ "f3282df3adadf78320e99c09d8384f-C001-113" ], [ "f3282df3adadf78320e99c09d8384f-C001-114" ], [ "f3282df3adadf78320e99c09d8384f-C001-147" ] ], "cite_sentences": [ "f3282df3adadf78320e99c09d8384f-C001-53", "f3282df3adadf78320e99c09d8384f-C001-54", "f3282df3adadf78320e99c09d8384f-C001-92", "f3282df3adadf78320e99c09d8384f-C001-105", "f3282df3adadf78320e99c09d8384f-C001-110", "f3282df3adadf78320e99c09d8384f-C001-113", "f3282df3adadf78320e99c09d8384f-C001-114", "f3282df3adadf78320e99c09d8384f-C001-147" ] }, "@UNSURE@": { "gold_contexts": [ [ "f3282df3adadf78320e99c09d8384f-C001-57" ] ], "cite_sentences": [ "f3282df3adadf78320e99c09d8384f-C001-57" ] }, "@EXT@": { "gold_contexts": [ [ "f3282df3adadf78320e99c09d8384f-C001-57" ], [ "f3282df3adadf78320e99c09d8384f-C001-70" ] ], "cite_sentences": [ "f3282df3adadf78320e99c09d8384f-C001-57", "f3282df3adadf78320e99c09d8384f-C001-70" ] }, "@SIM@": { "gold_contexts": [ [ "f3282df3adadf78320e99c09d8384f-C001-109" ], [ "f3282df3adadf78320e99c09d8384f-C001-111" ] ], "cite_sentences": [ "f3282df3adadf78320e99c09d8384f-C001-109", "f3282df3adadf78320e99c09d8384f-C001-111" ] } } }, "ABC_e75e14ff2812f34ff456eb472a36d2_0": { "x": [ { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-5", "text": "----------------------------------" }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-28", "text": "----------------------------------" }, { "sent_id": "e75e14ff2812f34ff456eb472a36d2-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "e75e14ff2812f34ff456eb472a36d2-C001-21" ] ], "cite_sentences": [ "e75e14ff2812f34ff456eb472a36d2-C001-21" ] } } }, "ABC_f1e5584a2139160943d9f0338e6ce0_0": { "x": [ { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-2", "text": "We explore blindfold (question-only) baselines for Embodied Question Answering." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-3", "text": "The EmbodiedQA task requires an agent to answer a question by intelligently navigating in a simulated environment, gathering necessary visual information only through first-person vision before finally answering." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-4", "text": "Consequently, a blindfold baseline which ignores the environment and visual information is a degenerate solution, yet we show through our experiments on the EQAv1 dataset that a simple question-only baseline achieves state-of-the-art results on the EmbodiedQA task in all cases except when the agent is spawned extremely close to the object." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-5", "text": "----------------------------------" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-7", "text": "Recent breakthroughs in static, unimodal tasks such as image classification [16] and language processing [18] has prompted research towards multimodal tasks [1, 8] and virtual environments [4, 15, 25] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-8", "text": "This is substantiated by embodiment theories in cognitive science that have argued for agent learning to be interactive and multimodal, mimicking key aspects of human learning [9, 17] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-30", "text": "The master and sub-policies are initialized with Behavior Cloning (BC), and later fine-tuned with Asynchronous Advantage Actor-Critic (A3C) [19] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-9", "text": "To foster and measure progress in such virtual environments, new tasks have been introduced, one of them being Embodied Question Answering (EmbodiedQA) [5] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-10", "text": "The EmbodiedQA task requires an agent to intelligently navigate in a simulated household environment [25] and answer questions through egocentric vision." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-11", "text": "Concretely, an agent is spawned at a random location in an environment (a house or building) and asked a question (e.g. 'What color is the car?')." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-12", "text": "The agent perceives its environment through first-person egocentric vision and can perform a few atomic actions (move-forward, turn, strafe, etc.)." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-13", "text": "The goal of the agent is to intelligently navigate the environment and gather visual information necessary for answering the question." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-14", "text": "Subsequent to the introduction of the task, several methods have been introduced to solve the EmbodiedQA task [5, 6] , using some combination of reinforcement learning, behavior cloning and hierarchical control." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-15", "text": "Apart from using the question and images from the environment, these methods also rely on varying degrees of expert supervision such as shortest path demonstrations and subgoal policy sketches." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-16", "text": "In this work, we evaluate simple question-only baselines that never see the environment and receive no form of expert supervision." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-17", "text": "We examine whether existing methods outperform baselines designed to solely capture dataset bias, in order to better understand the performance of these existing methods." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-18", "text": "To our surprise, blindfold baselines achieve state-of-the-art performance on the EmbodiedQA task, except in the case when the agent is spawned extremely close to the object." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-19", "text": "Even in the latter case, blindfold baselines perform surprisingly close to existing state-of-the-art methods." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-20", "text": "We note that this finding is reminiscent of several recent works in both Computer Vision and Natural Language Processing, where researchers have found that statistical irregularities in the dataset can enable degenerate methods to perform surprisingly well [11, 12, 14, 21] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-21", "text": "Our findings suggest that current EmbodiedQA models are ineffective at leveraging the context from the environment, in fact this context or embodiment in the environment can negatively hamper them." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-22", "text": "We hope comparison with our baseline results can more effectively demonstrate how well a method is able to leverage embodiment in the environment." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-23", "text": "Upon further error analysis of our models and qualitative inspection of the dataset, we find that there exist biases in the EQAv1 dataset that allow blindfold models to perform so well." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-24", "text": "We acknowledge the active effort of Das et al. [5] in removing some biases via entropy-pruning but note that further efforts might be necessary to fully correct these biases." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-25", "text": "----------------------------------" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-26", "text": "**RELATED WORK**" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-27", "text": "EmbodiedQA Methods: Das et al. [5] introduced the PACMAN-RL+Q model which is bootstrapped with expert shortest-path demonstrations and later fine-tuned with REINFORCE [24] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-28", "text": "This model consists of a hierarchical navigation module: a planner and a controller, and a question answering module that acts when the navigation module has given up control." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-29", "text": "In a later work, Das et al. [6] introduce Neural Modular Control (NMC) which is a hierarchical policy network that operates over expert sub-policy sketches." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-31", "text": "Dataset Biases and Trivial Baselines: Many recent studies in language and vision show how biases in a dataset allow models to perform well on a task without leveraging the meaning of the text or image in the underlying dataset." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-32", "text": "A simple CNN-BoW model was shown to achieve state-of-the-art results [12] on the Visual7W [26] task while also performing surprisingly well compared to the most complex systems proposed for the VQA dataset [1] and other joint vision and language tasks [2, 10] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-33", "text": "Simple nearest neighbor approaches have been shown to perform well on image captioning datasets [7] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-34", "text": "This phenomenon has also been observed in language processing tasks." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-35", "text": "On the Story-cloze task which was presented to evaluate common-sense reasoning, Schwartz et al. [23] achieved state-ofthe-art performance by ignoring the narrative and training a linear classifier with features related to the writing style of the two potential endings, rather than their content." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-36", "text": "Similar observations were found on the Natural Language Inference (NLI) datasets, where methods ignoring the context and relying only on the hypothesis perform remarkably well [11, 21] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-37", "text": "Most recently, question-only and passage-only baselines on several QA datasets highlighted similar issues [14] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-38", "text": "3 Methods grey (1946) black (1386) bedroom (978) living room (871) bathroom (451) balcony (10) passenger elevator (1) Figure 2 : Frequency of each answer in the entire EQAv1 dataset." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-39", "text": "We observe that answers do not appear equally in the dataset, and are biased toward a select few." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-40", "text": "Average BOW Embedding We use a simple linear classifier as described in [3, 13, 22] , which takes word level embeddings and averages them to construct the question representation." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-41", "text": "We first perform a look-up over an embedding matrix for each word to get individual word representations." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-42", "text": "These word representations are then averaged into a text representation, which is in turn fed to a linear classifier." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-43", "text": "This architecture is similar to the fastText model of [13] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-44", "text": "It is also a common and strong baseline in language and vision and language tasks [13, 22] ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-45", "text": "We use the softmax function f to compute the probability distribution over the predefined classes." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-46", "text": "The training criterion minimizes the negative log-likelihood over the classes." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-47", "text": "Nearest Neighbor Answer Distribution (NN-AnswerDist) This method attempts to answer purely based on the per-question answer distribution of the training set." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-48", "text": "For an input question we find either the identical question in the training set or if one doesn't exist the nearest matching question (based on number of shared words)." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-49", "text": "We then select the most likely answer for the training set." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-50", "text": "Performance on this baseline is directly indicative of the bias in answer distributions in the dataset." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-51", "text": "We note that for EQAv1 almost all questions in the validation and test sets are present in the training set." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-52", "text": "The answers span across 72 different categories of color, location and objects." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-53", "text": "We note that there are only 2 questions in the validation set, and 6 questions in the test set that are not in the training set." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-54", "text": "This limits the ability to test how well an agent generalizes across unseen combinations of rooms/objects/colors." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-76", "text": "This shows that there is room for building new models that leverage the context and embodiment in the environment." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-55", "text": "To get rid of peaky answers, an entropy pruning method was applied by [5] where questions with normalized entropy below 0.5 were excluded." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-56", "text": "However this still leaves an uneven answer distribution that can be exploited." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-57", "text": "Training Details 1 We evaluate the efficacy of our proposed baselines on the EQAv1 dataset." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-58", "text": "For the BoW model, we initialize the embeddings with Glove vectors [20] of size 100, which are allowed to be fine-tuned during the training procedure." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-59", "text": "We use the Adam optimizer (batch-size of 64) with a learning rate of 5e \u22123 which is annealed via a scheduling mechanism based on plateaus in the validation loss." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-60", "text": "The training procedure is run for 200 epochs and we use the checkpoint with minimum validation loss to compute accuracy on the test set." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-61", "text": "The NN-AnswerDist and the Majority baselines are self-descriptive and there are no specific training details that we apply." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-62", "text": "We also train the [5] text embedding model (an LSTM) with the optimization settings described in [5] for 200 epochs." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-63", "text": "Results Detailed results are reported in Table 4 ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-64", "text": "Following Das et al. [6] , we report the agent's top-1 accuracy on the test set when spawned 10, 20 and 50 steps away from the goal, denoted as T 10 , T 20 and T 50 respectively." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-65", "text": "Since the performance of blindfold baselines are not affected based on where the agent is spawned, their accuracy is same across T 10 , T 20 and T 50 ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-66", "text": "We observe that the BoW model outperforms all existing methods except NMC(BC+A3C) in the case where agent is spawned very close to the target." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-67", "text": "The Nearest Neighbour method also does pretty well, and only falls behind to PACMAN (BC+REINFORCE) and NMC(BC+A3C) in the T 10 case." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-68", "text": "The difference in performance b/w the Nearest Neighbour method and BoW is primarily due to the fact that the BoW method leverages validation metrics more effectively, uses distributed word representations and differs in optimization." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-69", "text": "We also observe that the majority baseline achieves an accuracy of only 17.15%, suggesting that the other question-only baselines leverage dataset biases separate from class modes." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-70", "text": "For completeness, we also include a question only baseline derived directly from the EmbodiedQA codebase, which uses only the Question LSTM in the PACMAN model, termed as PACMAN Q-only (LSTM)." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-71", "text": "Note that we only compare the top-1 accuracy of different methods here, and not the navigation performance since it's not directly applicable to these blindfold baselines." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-72", "text": "To better understand the exact bias exploited by the text only models we observe that (a) The questions from training set are largely repeated in the validation and test set, with only 2 and 6 questions being unique to them respectively." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-73", "text": "As noted earlier, this means that models don't need to generalize across unseen combinations of rooms/objects/colors to perform well on this task (b) Despite entropy-pruning, there is a noticeable bias in the answer distribution of EQAv1 questions (see [5, Appendix A])." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-74", "text": "Our results on the Nearest Neighbour baseline confirm this source of bias and explain largely the text model performance." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-75", "text": "Viewing these results holistically, we conclude that current methods for the EmbodiedQA task are not effective at using context from the environment, and in fact this negatively hampers them." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-77", "text": "Oracles: We now examine whether the EQAv1 dataset and the proposed oracle navigation can improve over pure text baselines, to leverage visual information in the most ideal case." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-78", "text": "We reproduce the settings for training the VQA model 2 ." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-79", "text": "Specifically we train the VQA model described in [6] on the last 5 frames of oracle navigation for 50 epochs with ADAM and a learning rate of 3e \u2212 4 using batch size 20." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-80", "text": "We observe the accuracy is improved over text baselines in this unrealistic setting, but the use of this model with navigation in PACMAN reduces performance to below the text baselines." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-81", "text": "For completeness we benchmark an oracle with our BoW embedding model in place of the LSTM with all other settings kept constant." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-82", "text": "As noted in [5] , we re-iterate that these oracles are far from perfect, as they may not contain the best vantage or context to answer the question." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-83", "text": "----------------------------------" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-84", "text": "**T 10**" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-85", "text": "T 20 T 50 T any Navigation + VQA PACMAN (BC) [5] 48 BOW-CNN VQA-Only 56.5 Table 1 : We compare to the published results from [6] for agent spawned at various steps away from the target: 10, 30, 50, and anywhere in the environment." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-86", "text": "Question-only baselines outperform Navigation+VQA methods except when spawned 10 steps from the target object." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-87", "text": "A VQA-only system with oracle navigation can improve on a pure text baseline but isn't effective when combined with navigation." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-88", "text": "(*) indicates our reproduction of the model described in [5] Error Analysis: To better understand the shortcomings and limitations, we perform an error analysis of the one of the runs of the BoW model on different question types: Here, the color category Preposition Location Color 9.09 51.72 53.31 Table 2 : Accuracy of the BoW model on different question types subsumes color and color_room both." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-89", "text": "The particularly low accuracy on preposition questions is due to the fact that there exist very few questions of this type in the training set (2.44%), and the entropy of answer distribution in this class is much higher compared to color and location question types." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-90", "text": "----------------------------------" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-92", "text": "We show that simple question only baselines largely outperform or closely compete with existing methods on the EmbodiedQA task." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-93", "text": "Our results indicate existing models are not able to convincingly use sensory inputs from the environment to perform question answering, although they have been demonstrated some ability navigate toward the object of interest." }, { "sent_id": "f1e5584a2139160943d9f0338e6ce0-C001-94", "text": "Besides providing a benchmark score for future researchers working on this task, our results suggest considerations for future dataset and task construction in EQA and related tasks." } ], "y": { "@MOT@": { "gold_contexts": [ [ "f1e5584a2139160943d9f0338e6ce0-C001-2", "f1e5584a2139160943d9f0338e6ce0-C001-3", "f1e5584a2139160943d9f0338e6ce0-C001-4" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-14", "f1e5584a2139160943d9f0338e6ce0-C001-15", "f1e5584a2139160943d9f0338e6ce0-C001-16" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-75", "f1e5584a2139160943d9f0338e6ce0-C001-76" ] ], "cite_sentences": [ "f1e5584a2139160943d9f0338e6ce0-C001-2", "f1e5584a2139160943d9f0338e6ce0-C001-3", "f1e5584a2139160943d9f0338e6ce0-C001-14", "f1e5584a2139160943d9f0338e6ce0-C001-75" ] }, "@BACK@": { "gold_contexts": [ [ "f1e5584a2139160943d9f0338e6ce0-C001-2", "f1e5584a2139160943d9f0338e6ce0-C001-3", "f1e5584a2139160943d9f0338e6ce0-C001-4" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-9" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-10" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-14" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-26", "f1e5584a2139160943d9f0338e6ce0-C001-27" ] ], "cite_sentences": [ "f1e5584a2139160943d9f0338e6ce0-C001-2", "f1e5584a2139160943d9f0338e6ce0-C001-3", "f1e5584a2139160943d9f0338e6ce0-C001-9", "f1e5584a2139160943d9f0338e6ce0-C001-10", "f1e5584a2139160943d9f0338e6ce0-C001-14", "f1e5584a2139160943d9f0338e6ce0-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "f1e5584a2139160943d9f0338e6ce0-C001-18" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-23" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-70" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-75", "f1e5584a2139160943d9f0338e6ce0-C001-76" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-85" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-88" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-92" ] ], "cite_sentences": [ "f1e5584a2139160943d9f0338e6ce0-C001-18", "f1e5584a2139160943d9f0338e6ce0-C001-70", "f1e5584a2139160943d9f0338e6ce0-C001-75", "f1e5584a2139160943d9f0338e6ce0-C001-85", "f1e5584a2139160943d9f0338e6ce0-C001-88", "f1e5584a2139160943d9f0338e6ce0-C001-92" ] }, "@UNSURE@": { "gold_contexts": [ [ "f1e5584a2139160943d9f0338e6ce0-C001-24" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-80" ] ], "cite_sentences": [ "f1e5584a2139160943d9f0338e6ce0-C001-24", "f1e5584a2139160943d9f0338e6ce0-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "f1e5584a2139160943d9f0338e6ce0-C001-55", "f1e5584a2139160943d9f0338e6ce0-C001-56", "f1e5584a2139160943d9f0338e6ce0-C001-62" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-67" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-85" ] ], "cite_sentences": [ "f1e5584a2139160943d9f0338e6ce0-C001-55", "f1e5584a2139160943d9f0338e6ce0-C001-62", "f1e5584a2139160943d9f0338e6ce0-C001-67", "f1e5584a2139160943d9f0338e6ce0-C001-85" ] }, "@SIM@": { "gold_contexts": [ [ "f1e5584a2139160943d9f0338e6ce0-C001-62" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-73", "f1e5584a2139160943d9f0338e6ce0-C001-74" ], [ "f1e5584a2139160943d9f0338e6ce0-C001-82" ] ], "cite_sentences": [ "f1e5584a2139160943d9f0338e6ce0-C001-62", "f1e5584a2139160943d9f0338e6ce0-C001-73", "f1e5584a2139160943d9f0338e6ce0-C001-82" ] }, "@FUT@": { "gold_contexts": [ [ "f1e5584a2139160943d9f0338e6ce0-C001-94" ] ], "cite_sentences": [ "f1e5584a2139160943d9f0338e6ce0-C001-94" ] } } }, "ABC_b8f090dadfbd01a17912e006e7ccfc_0": { "x": [ { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-2", "text": "Concepts and methods of complex networks have been applied to probe the properties of a myriad of real systems [1] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-3", "text": "The finding that written texts modeled as graphs share several properties of other completely different real systems has inspired the study of language as a complex system [2] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-4", "text": "Actually, language can be represented as a complex network in its several levels of complexity." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-5", "text": "As a consequence, morphological, syntactical and semantical properties have been employed in the construction of linguistic networks [3] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-6", "text": "Even the character level has been useful to unfold particular patterns [4, 5] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-7", "text": "In the review by Cong and Liu [6], the authors emphasize the need to use the topological information of complex networks modeling the various spheres of the language to better understand its origins, evolution and organization." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-8", "text": "In addition, the authors cite the use of networks in applications aiming at holistic typology and stylistic variations." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-9", "text": "In this context, I will discuss some possible directions that could be followed in future research directed towards the understanding of language via topological characterization of complex linguistic networks." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-10", "text": "In addition, I will comment the use of network models for language processing applications." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-11", "text": "Additional prospects for future practical research lines will also be discussed in this comment." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-12", "text": "The topological analysis of complex textual networks has been widely studied in the recent years." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-13", "text": "As for cooccurrence networks of characters, it was possible to verify that they follow the scale-free and small-world features [4] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-14", "text": "Co-occurrence networks of words (or adjacency networks) have accounted for most of the models tackling textual applications." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-15", "text": "In special, they have been more prevalent than syntactical networks because they represent a simplified representation of the complex syntactical analysis [7, 8] , as most of the syntactical links occur between neighboring words." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-16", "text": "Despite its outward simplicity, co-occurrence networks have proven useful in many applications, such as in authorship recognition [9] , extractive summarization [10, 11, 12] , stylistic identification [13] and part-of-speech tagging [14] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-17", "text": "Furthermore, such representation has also been useful in the analysis of the complexity [15] and quality of texts [16] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-18", "text": "Unfortunately, a major problem arising from the analyses performed with co-occurrence networks is the difficulty to provide a rigorous interpretation of the factors accounting for the success of the model." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-19", "text": "Therefore, future investigations should pursue a better interpretation at the network level aiming at the understanding of the fundamental properties of the language." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-20", "text": "Most importantly, it is clear from some recent studies [8, 9 ] that novel topological measurements should be introduced to capture a wider range of linguistic features." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-21", "text": "Many of the applications relying on network analysis outperform other traditional shallow strategies in natural language processing (see e.g. the extractive summarization task [10, 11] )." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-22", "text": "However, when deep analyzes are performed, network-based strategies usually do not perform better than other techniques making extensive use of semantic resources and tools." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-23", "text": "In order to improve the performance of network-based applications, I suggest a twofold research line: (i) the introduction of measurements consistent with the nature of the problem; and (ii) the combination of topological strategies with other traditional natural language processing methods." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-24", "text": "More specifically, in (i), I propose 1 E-mail:diego.raphael@gmail.com, diego@icmc.usp.br" }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-25", "text": "December 4, 2014 the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-26", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-27", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-28", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-29", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20]." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-30", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-31", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-32", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-33", "text": "----------------------------------" }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-34", "text": "****" }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-35", "text": "the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-36", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-37", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-38", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-39", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20] ." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-40", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-41", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "b8f090dadfbd01a17912e006e7ccfc-C001-42", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b8f090dadfbd01a17912e006e7ccfc-C001-29" ], [ "b8f090dadfbd01a17912e006e7ccfc-C001-39" ] ], "cite_sentences": [ "b8f090dadfbd01a17912e006e7ccfc-C001-29", "b8f090dadfbd01a17912e006e7ccfc-C001-39" ] } } }, "ABC_f2dfc35b67e47c12cba3cd0ec743a5_1": { "x": [ { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-2", "text": "Concepts and methods of complex networks have been applied to probe the properties of a myriad of real systems [1] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-3", "text": "The finding that written texts modeled as graphs share several properties of other completely different real systems has inspired the study of language as a complex system [2] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-4", "text": "Actually, language can be represented as a complex network in its several levels of complexity." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-5", "text": "As a consequence, morphological, syntactical and semantical properties have been employed in the construction of linguistic networks [3] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-6", "text": "Even the character level has been useful to unfold particular patterns [4, 5] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-7", "text": "In the review by Cong and Liu [6], the authors emphasize the need to use the topological information of complex networks modeling the various spheres of the language to better understand its origins, evolution and organization." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-8", "text": "In addition, the authors cite the use of networks in applications aiming at holistic typology and stylistic variations." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-9", "text": "In this context, I will discuss some possible directions that could be followed in future research directed towards the understanding of language via topological characterization of complex linguistic networks." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-10", "text": "In addition, I will comment the use of network models for language processing applications." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-11", "text": "Additional prospects for future practical research lines will also be discussed in this comment." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-12", "text": "The topological analysis of complex textual networks has been widely studied in the recent years." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-13", "text": "As for cooccurrence networks of characters, it was possible to verify that they follow the scale-free and small-world features [4] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-14", "text": "Co-occurrence networks of words (or adjacency networks) have accounted for most of the models tackling textual applications." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-15", "text": "In special, they have been more prevalent than syntactical networks because they represent a simplified representation of the complex syntactical analysis [7, 8] , as most of the syntactical links occur between neighboring words." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-16", "text": "Despite its outward simplicity, co-occurrence networks have proven useful in many applications, such as in authorship recognition [9] , extractive summarization [10, 11, 12] , stylistic identification [13] and part-of-speech tagging [14] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-17", "text": "Furthermore, such representation has also been useful in the analysis of the complexity [15] and quality of texts [16] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-18", "text": "Unfortunately, a major problem arising from the analyses performed with co-occurrence networks is the difficulty to provide a rigorous interpretation of the factors accounting for the success of the model." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-19", "text": "Therefore, future investigations should pursue a better interpretation at the network level aiming at the understanding of the fundamental properties of the language." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-20", "text": "Most importantly, it is clear from some recent studies [8, 9 ] that novel topological measurements should be introduced to capture a wider range of linguistic features." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-21", "text": "Many of the applications relying on network analysis outperform other traditional shallow strategies in natural language processing (see e.g. the extractive summarization task [10, 11] )." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-22", "text": "However, when deep analyzes are performed, network-based strategies usually do not perform better than other techniques making extensive use of semantic resources and tools." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-23", "text": "In order to improve the performance of network-based applications, I suggest a twofold research line: (i) the introduction of measurements consistent with the nature of the problem; and (ii) the combination of topological strategies with other traditional natural language processing methods." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-24", "text": "More specifically, in (i), I propose 1 E-mail:diego.raphael@gmail.com, diego@icmc.usp.br" }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-25", "text": "December 4, 2014 the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-26", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-27", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-28", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-29", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20]." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-30", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-31", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-32", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-33", "text": "----------------------------------" }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-34", "text": "****" }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-35", "text": "the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-36", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-37", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-38", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-39", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20] ." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-40", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-41", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "f2dfc35b67e47c12cba3cd0ec743a5-C001-42", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f2dfc35b67e47c12cba3cd0ec743a5-C001-26" ], [ "f2dfc35b67e47c12cba3cd0ec743a5-C001-36" ] ], "cite_sentences": [ "f2dfc35b67e47c12cba3cd0ec743a5-C001-26", "f2dfc35b67e47c12cba3cd0ec743a5-C001-36" ] } } }, "ABC_07a2b256766020450c85eae2839db8_1": { "x": [ { "sent_id": "07a2b256766020450c85eae2839db8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-2", "text": "Concepts and methods of complex networks have been applied to probe the properties of a myriad of real systems [1] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-3", "text": "The finding that written texts modeled as graphs share several properties of other completely different real systems has inspired the study of language as a complex system [2] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-4", "text": "Actually, language can be represented as a complex network in its several levels of complexity." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-5", "text": "As a consequence, morphological, syntactical and semantical properties have been employed in the construction of linguistic networks [3] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-6", "text": "Even the character level has been useful to unfold particular patterns [4, 5] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-7", "text": "In the review by Cong and Liu [6], the authors emphasize the need to use the topological information of complex networks modeling the various spheres of the language to better understand its origins, evolution and organization." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-8", "text": "In addition, the authors cite the use of networks in applications aiming at holistic typology and stylistic variations." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-9", "text": "In this context, I will discuss some possible directions that could be followed in future research directed towards the understanding of language via topological characterization of complex linguistic networks." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-10", "text": "In addition, I will comment the use of network models for language processing applications." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-11", "text": "Additional prospects for future practical research lines will also be discussed in this comment." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-12", "text": "The topological analysis of complex textual networks has been widely studied in the recent years." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-13", "text": "As for cooccurrence networks of characters, it was possible to verify that they follow the scale-free and small-world features [4] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-14", "text": "Co-occurrence networks of words (or adjacency networks) have accounted for most of the models tackling textual applications." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-15", "text": "In special, they have been more prevalent than syntactical networks because they represent a simplified representation of the complex syntactical analysis [7, 8] , as most of the syntactical links occur between neighboring words." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-16", "text": "Despite its outward simplicity, co-occurrence networks have proven useful in many applications, such as in authorship recognition [9] , extractive summarization [10, 11, 12] , stylistic identification [13] and part-of-speech tagging [14] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-17", "text": "Furthermore, such representation has also been useful in the analysis of the complexity [15] and quality of texts [16] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-18", "text": "Unfortunately, a major problem arising from the analyses performed with co-occurrence networks is the difficulty to provide a rigorous interpretation of the factors accounting for the success of the model." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-19", "text": "Therefore, future investigations should pursue a better interpretation at the network level aiming at the understanding of the fundamental properties of the language." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-20", "text": "Most importantly, it is clear from some recent studies [8, 9 ] that novel topological measurements should be introduced to capture a wider range of linguistic features." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-21", "text": "Many of the applications relying on network analysis outperform other traditional shallow strategies in natural language processing (see e.g. the extractive summarization task [10, 11] )." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-22", "text": "However, when deep analyzes are performed, network-based strategies usually do not perform better than other techniques making extensive use of semantic resources and tools." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-23", "text": "In order to improve the performance of network-based applications, I suggest a twofold research line: (i) the introduction of measurements consistent with the nature of the problem; and (ii) the combination of topological strategies with other traditional natural language processing methods." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-24", "text": "More specifically, in (i), I propose 1 E-mail:diego.raphael@gmail.com, diego@icmc.usp.br" }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-25", "text": "December 4, 2014 the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-26", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-27", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-28", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-29", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20]." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-30", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-31", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-32", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-33", "text": "----------------------------------" }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-34", "text": "****" }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-35", "text": "the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-36", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-37", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-38", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-39", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20] ." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-40", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-41", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "07a2b256766020450c85eae2839db8-C001-42", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." } ], "y": { "@BACK@": { "gold_contexts": [ [ "07a2b256766020450c85eae2839db8-C001-20" ] ], "cite_sentences": [ "07a2b256766020450c85eae2839db8-C001-20" ] }, "@FUT@": { "gold_contexts": [ [ "07a2b256766020450c85eae2839db8-C001-19", "07a2b256766020450c85eae2839db8-C001-20" ] ], "cite_sentences": [ "07a2b256766020450c85eae2839db8-C001-20" ] }, "@MOT@": { "gold_contexts": [ [ "07a2b256766020450c85eae2839db8-C001-19", "07a2b256766020450c85eae2839db8-C001-20" ] ], "cite_sentences": [ "07a2b256766020450c85eae2839db8-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "07a2b256766020450c85eae2839db8-C001-24", "07a2b256766020450c85eae2839db8-C001-25" ] ], "cite_sentences": [ "07a2b256766020450c85eae2839db8-C001-25" ] } } }, "ABC_485464efe36b1ca1f6dc4e1cf8d474_1": { "x": [ { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-5", "text": "----------------------------------" }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-28", "text": "----------------------------------" }, { "sent_id": "485464efe36b1ca1f6dc4e1cf8d474-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@USE@": { "gold_contexts": [ [ "485464efe36b1ca1f6dc4e1cf8d474-C001-7" ] ], "cite_sentences": [ "485464efe36b1ca1f6dc4e1cf8d474-C001-7" ] }, "@BACK@": { "gold_contexts": [ [ "485464efe36b1ca1f6dc4e1cf8d474-C001-7" ] ], "cite_sentences": [ "485464efe36b1ca1f6dc4e1cf8d474-C001-7" ] } } }, "ABC_684a637d08e8dbabddc1f1982f5393_1": { "x": [ { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-2", "text": "The goal of spoken dialogue systems (SDS) is to offer efficient and natural access to applications and services." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-3", "text": "A common task for SDS is to help users select a suitable option (e.g., flight, hotel, restaurant) from the set of options available." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-4", "text": "When the number of options is small, they can simply be presented sequentially." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-5", "text": "However, as the number of options increases, the system must have strategies for summarizing the options to enable the user to browse the option space." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-6", "text": "In this talk, we evaluate two recent approaches to information presentation in SDS: (1) the Refiner approach (Polifroni et al., 2003) which generates summaries by clustering the options to maximize coverage of the domain, and (2) the user-model based summarize and refine (UMSR) approach (Demberg and Moore, 2006) which clusters options to maximize utility with respect to a user model, and uses linguistic devices (e.g., discourse cues, adverbials) to highlight the trade-offs among the presented items." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-7", "text": "To evaluate these strategies, we go beyond the typical \"overhearer\" evaluation methodology, in which participants read or listen to pre-prepared dialogues, which limits the evaluation criteria to users' perceptions (e.g., informativeness, ease of comprehension)." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-8", "text": "Using a Wizard-of-Oz methodology to evaluate the approaches in an interactive setting, we show that in addition to being preferred by users, the UMSR approach is superior to the Refiner approach in terms of both task success and dialogue efficiency, even when the user is performing a demanding secondary task." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-9", "text": "Finally, we hypothesize that UMSR is more effective because it uses linguistic devices to highlight relations (e.g., trade-offs) between options and attributes." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-10", "text": "We report the results of two studies which show that the discourse cues in UMSR summaries help users compare different options and choose between options, even though they do not improve verbatim recall." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-11", "text": "----------------------------------" }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-12", "text": "****" }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-13", "text": "The goal of spoken dialogue systems (SDS) is to offer efficient and natural access to applications and services." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-14", "text": "A common task for SDS is to help users select a suitable option (e.g., flight, hotel, restaurant) from the set of options available." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-15", "text": "When the number of options is small, they can simply be presented sequentially." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-16", "text": "However, as the number of options increases, the system must have strategies for summarizing the options to enable the user to browse the option space." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-17", "text": "In this talk, we evaluate two recent approaches to information presentation in SDS: (1) the Refiner approach (Polifroni et al., 2003) which generates summaries by clustering the options to maximize coverage of the domain, and (2) the user-model based summarize and refine (UMSR) approach (Demberg and Moore, 2006) which clusters options to maximize utility with respect to a user model, and uses linguistic devices (e.g., discourse cues, adverbials) to highlight the trade-offs among the presented items." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-18", "text": "To evaluate these strategies, we go beyond the typical \"overhearer\" evaluation methodology, in which participants read or listen to pre-prepared dialogues, which limits the evaluation criteria to users' perceptions (e.g., informativeness, ease of comprehension)." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-19", "text": "Using a Wizard-of-Oz methodology to evaluate the approaches in an interactive setting, we show that in addition to being preferred by users, the UMSR approach is superior to the Refiner approach in terms of both task success and dialogue efficiency, even when the user is performing a demanding secondary task." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-20", "text": "Finally, we hypothesize that UMSR is more effective because it uses linguistic devices to highlight relations (e.g., trade-offs) between options and attributes." }, { "sent_id": "684a637d08e8dbabddc1f1982f5393-C001-21", "text": "We report the results of two studies which show that the discourse cues in UMSR summaries help users compare different options and choose between options, even though they do not improve verbatim recall." } ], "y": { "@USE@": { "gold_contexts": [ [ "684a637d08e8dbabddc1f1982f5393-C001-6", "684a637d08e8dbabddc1f1982f5393-C001-8", "684a637d08e8dbabddc1f1982f5393-C001-9" ], [ "684a637d08e8dbabddc1f1982f5393-C001-10" ] ], "cite_sentences": [ "684a637d08e8dbabddc1f1982f5393-C001-6", "684a637d08e8dbabddc1f1982f5393-C001-8", "684a637d08e8dbabddc1f1982f5393-C001-9", "684a637d08e8dbabddc1f1982f5393-C001-10" ] }, "@DIF@": { "gold_contexts": [ [ "684a637d08e8dbabddc1f1982f5393-C001-8", "684a637d08e8dbabddc1f1982f5393-C001-9" ] ], "cite_sentences": [ "684a637d08e8dbabddc1f1982f5393-C001-8", "684a637d08e8dbabddc1f1982f5393-C001-9" ] } } }, "ABC_418016e4df12f80205cadc41119244_1": { "x": [ { "sent_id": "418016e4df12f80205cadc41119244-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "418016e4df12f80205cadc41119244-C001-2", "text": "Conditions are essential in the statements of biological literature." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-3", "text": "Without the conditions (e.g., environment, equipment) that were precisely specified, the facts (e.g., observations) in the statements may no longer be valid." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-4", "text": "One biological statement has one or multiple fact(s) and/or condition(s)." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-5", "text": "Their subject and object can be either a concept or a concept's attribute." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-6", "text": "Existing information extraction methods do not consider the role of condition in the biological statement nor the role of attribute in the subject/object." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-7", "text": "In this work, we design a new tag schema and propose a deep sequence tagging framework to structure conditional statement into fact and condition tuples from biological text." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-8", "text": "Experiments demonstrate that our method yields a information-lossless structure of the literature." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-9", "text": "----------------------------------" }, { "sent_id": "418016e4df12f80205cadc41119244-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "418016e4df12f80205cadc41119244-C001-11", "text": "Extracting information from biological text plays an important role in biological knowledge graph construction, relational inference, hypothesis generation and validation." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-12", "text": "Open IE systems extract a diverse set of relational tuples without requiring any relationspecific schema in advance, which was supposed to be ideally suited to the biological corpora." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-13", "text": "For example, given a statement sentence in a biochemistry paper [3] : \"We observed that ... alkaline pH increases the activity of TRPV5/V6 channels in Jurkat T cells." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-14", "text": "\", an Open IE system [2] would return a (subject, relation, object)-tuple (alkaline pH, increases, activity of TRPV5/V6 channels in Jurkat T cells)." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-15", "text": "Such an information representation has two problems." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-16", "text": "First, the condition that the channels were in the Jurkat T cells, on which the observation was obtained, remained unstructured in the object argument." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-17", "text": "In biological domains, conditions are essential in claiming observations and hypotheses [1] ." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-18", "text": "Second, the object name was long, infrequent, and less likely to be associated with other tuple extractions, leading to knowledge graph sparsity and poor inference, though the concept \"TRPV5/V6 channels\" and its attribute \"activity\" could be frequently mentioned." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-19", "text": "Unlike the general domain where relations are often linking two named entities, in biological domains, the relations could be linking either biological concepts or a concept's particular attributes." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-20", "text": "A well-designed information structure of biological statement is desired." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-21", "text": "We propose a new information extraction task named \"Biological Conditional Statement Structuring\" (BioCS)." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-22", "text": "Given a biological statement sentence, BioCS extracts fact tuple(s) as well as condition tuple(s) on which the fact(s) were observed or claimed." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-23", "text": "The expected outputs of the example sentence are: Fact 1: (alkaline_pH, increases, {TRPV5/V6 _channels: activity}), and Condition 1: (TRPV5/V6_channels, in, Jurkat _T_cells)." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-24", "text": "The subject or object in the tuple is formatted as {concept: attribute} where the attribute could be null if it is a concept only." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-25", "text": "Actually we find Inspired by [2] , we formulate BioCS as a sequence tagging problem." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-26", "text": "We propose a new deep sequence tagging framework to achieve the goal." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-27", "text": "Experiments on a data set of 141M sentences from PubMed paper abstracts show that the biological knowledge graph we constructed provide a good understanding of biological statements (https://scikg.github.io)." }, { "sent_id": "418016e4df12f80205cadc41119244-C001-28", "text": "----------------------------------" }, { "sent_id": "418016e4df12f80205cadc41119244-C001-29", "text": "**FRAMEWORK DESCRIPTION**" }, { "sent_id": "418016e4df12f80205cadc41119244-C001-30", "text": "Our framework has two modules: (1) a deep sequence tagging model, taking multiple language feature sequences as inputs and returning multiple fact and condition tuples; (2) an iterative self-training scheme with massive unlabeled data to enhance the model." } ], "y": { "@MOT@": { "gold_contexts": [ [ "418016e4df12f80205cadc41119244-C001-25" ] ], "cite_sentences": [ "418016e4df12f80205cadc41119244-C001-25" ] } } }, "ABC_53bc4206427e95d600f787e0531df1_1": { "x": [ { "sent_id": "53bc4206427e95d600f787e0531df1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-2", "text": "Topic models learn topics base on the amount of the word co-occurrence in the documents." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-3", "text": "The word co-occurrence is a degree which describes how often the two words appear together." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-4", "text": "BTM, discovers topics from bi-terms in the whole corpus to overcome the lack of local word co-occurrence information." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-5", "text": "However, BTM will make the common words be performed excessively because BTM identifies the word co-occurrence information by the bi-term" }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-6", "text": "----------------------------------" }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-7", "text": "****" }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-8", "text": "frequency in corpus-level." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-9", "text": "Thus, we propose a PMI-\u03b2 priors methods on BTM." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-10", "text": "Our PMI-\u03b2 priors method can adjust the co-occurrence score to prevent the common words problem." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-11", "text": "Next, we will describe the detail of our method of PMI-\u03b2 priors." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-12", "text": "However, just consider the frequency of bi-term in corpus-level will generate the topics which contain too many common words." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-13", "text": "To solve this problem, we consider the Pointwise Mutual Information (PMI) [9] ." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-14", "text": "Since the PMI score not only considers the co-occurrence frequency of the two words, but also normalizes by the single word frequency." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-15", "text": "Thus, we want to apply PMI score in the original BTM." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-16", "text": "A suitable way to apply PMI scores is modifying the priors in the BTM." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-17", "text": "The reason is that the priors modifying will not increase the complexity in the generation model and very intuitive." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-18", "text": "Clearly, there are two kinds of priors in BTM which are \u03b2-prior and \u03b2-priors." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-19", "text": "The \u03b2-prior is a corpus-topic bias without the data." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-20", "text": "While the \u03b2-priors are topic-word biases without the data." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-21", "text": "Applying the PMI score to the \u03b2-priors is the only one choice because we can adjust the degree of the word co-occurrence by modifying the distributions in the \u03b2-priors." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-22", "text": "For example, we assume that a topic contains three words \"pen\", \"apple\" and \"banana\"." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-23", "text": "In the symmetric priors, we set <0.1, 0.1, 0.1> which means no bias of these three words, while we can apply <0.1, 0.5, 0.5> to enhance the word co-occurrence of \"apple\" and \"banana\"." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-24", "text": "Thus the topic will prefer to put the \"apple\" and \"banana\" together in the topic sampling step." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-25", "text": "Table 1 shows the clustering results on the Twitter2011 dataset, when we set the number of topic to 50." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-26", "text": "As expected, BTM is better than Mixture of unigram and LDA got the worst result when we adopt the symmetric priors <0.1>. When apply the PMI-\u03b2 priors, we get the better result than BTM with symmetric priors." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-27", "text": "Otherwise, our baseline method, PCA-\u03b2, is better than the original LDA because the PCA-\u03b2 prior can make up the lack of the global word co-occurrence information in the original LDA." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-28", "text": "In this paper, we propose a solution for topic model to enhance the amount of the word co-occurrence relation in the short text corpus." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-29", "text": "First, we find the BTM identifies the word co-occurrence by considering the bi-term frequency in the corpus-level." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-30", "text": "BTM will make the common words be performed excessively because the frequency of bi-term comes from the whole corpus instead of a short document." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-31", "text": "We propose a PMI-\u03b2 priors method to overcome this problem." }, { "sent_id": "53bc4206427e95d600f787e0531df1-C001-32", "text": "The experimental results show our PMI-\u03b2-BTM get the best results in the regular short news title text." } ], "y": { "@USE@": { "gold_contexts": [ [ "53bc4206427e95d600f787e0531df1-C001-13", "53bc4206427e95d600f787e0531df1-C001-15", "53bc4206427e95d600f787e0531df1-C001-16", "53bc4206427e95d600f787e0531df1-C001-21" ] ], "cite_sentences": [ "53bc4206427e95d600f787e0531df1-C001-13", "53bc4206427e95d600f787e0531df1-C001-15", "53bc4206427e95d600f787e0531df1-C001-16", "53bc4206427e95d600f787e0531df1-C001-21" ] }, "@BACK@": { "gold_contexts": [ [ "53bc4206427e95d600f787e0531df1-C001-13", "53bc4206427e95d600f787e0531df1-C001-14" ] ], "cite_sentences": [ "53bc4206427e95d600f787e0531df1-C001-13", "53bc4206427e95d600f787e0531df1-C001-14" ] } } }, "ABC_1781b27c13dca15752cb6aa8a9fc38_1": { "x": [ { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-5", "text": "----------------------------------" }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-28", "text": "----------------------------------" }, { "sent_id": "1781b27c13dca15752cb6aa8a9fc38-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@USE@": { "gold_contexts": [ [ "1781b27c13dca15752cb6aa8a9fc38-C001-11" ] ], "cite_sentences": [ "1781b27c13dca15752cb6aa8a9fc38-C001-11" ] }, "@BACK@": { "gold_contexts": [ [ "1781b27c13dca15752cb6aa8a9fc38-C001-11" ] ], "cite_sentences": [ "1781b27c13dca15752cb6aa8a9fc38-C001-11" ] } } }, "ABC_4bc5fc3bccb704b9978b294ffb07de_1": { "x": [ { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-5", "text": "----------------------------------" }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-28", "text": "----------------------------------" }, { "sent_id": "4bc5fc3bccb704b9978b294ffb07de-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@USE@": { "gold_contexts": [ [ "4bc5fc3bccb704b9978b294ffb07de-C001-7" ] ], "cite_sentences": [ "4bc5fc3bccb704b9978b294ffb07de-C001-7" ] }, "@BACK@": { "gold_contexts": [ [ "4bc5fc3bccb704b9978b294ffb07de-C001-7" ] ], "cite_sentences": [ "4bc5fc3bccb704b9978b294ffb07de-C001-7" ] } } }, "ABC_74471d4e333ce76fd62b968045eba5_1": { "x": [ { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-5", "text": "----------------------------------" }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-28", "text": "----------------------------------" }, { "sent_id": "74471d4e333ce76fd62b968045eba5-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "74471d4e333ce76fd62b968045eba5-C001-7" ] ], "cite_sentences": [ "74471d4e333ce76fd62b968045eba5-C001-7" ] }, "@USE@": { "gold_contexts": [ [ "74471d4e333ce76fd62b968045eba5-C001-7" ] ], "cite_sentences": [ "74471d4e333ce76fd62b968045eba5-C001-7" ] } } }, "ABC_cf2cc67035107f5bdaab85a760e56e_1": { "x": [ { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-2", "text": "We introduced KaWAT (Kata Word Analogy Task), a new word analogy task dataset for Indonesian." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-3", "text": "We evaluated on it several existing pretrained Indonesian word embeddings and embeddings trained on Indonesian online news corpus." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-4", "text": "We also tested them on two downstream tasks and found that pretrained word embeddings helped either by reducing the training epochs or yielding significant performance gains." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-5", "text": "----------------------------------" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-7", "text": "Despite the existence of various Indonesian pretrained word embeddings, there are no publicly available Indonesian analogy task datasets on which to evaluate these embeddings." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-8", "text": "Consequently, it is unknown if Indonesian word embeddings introduced in, e.g., (Al-Rfou et al., 2013) and (Grave et al., 2018) , capture syntactic or semantic information as measured by analogy tasks." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-9", "text": "Also, such embeddings are usually trained on Indonesian Wikipedia (Al-Rfou et al., 2013; Bojanowski et al., 2017) whose size is relatively small, approximately 60M tokens." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-10", "text": "Therefore, in this work, we introduce KaWAT (Kata Word Analogy Task), an Indonesian word analogy task dataset, and new Indonesian word embeddings pretrained on 160M tokens of online news corpus." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-11", "text": "We evaluated these embeddings on KaWAT, and also tested them on POS tagging and text summarization as representatives of syntactic and semantic downstream task respectively." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-12", "text": "----------------------------------" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-13", "text": "**METHODOLOGY**" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-14", "text": "We asked an Indonesian linguist to help build KaWAT based on English analogy task datasets such as Google Word Analogy (Mikolov et al., 2013a) and BATS (Gladkova et al., 2016) ." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-15", "text": "Following those works, we split the analogy tasks into two categories, syntax and semantic." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-16", "text": "We included mostly morphological analogies in the syntax category, leveraging the richness of Indonesian inflectional morphology." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-17", "text": "For semantic, we included analogies such as antonyms, country capitals and currencies, gender-specific words, measure words, and Indonesian province capitals." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-18", "text": "In total, we have 15K syntactic and 19K semantic analogy queries." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-19", "text": "KaWAT is available online." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-20", "text": "1 One of the goals of this work is to evaluate and compare existing Indonesian pretrained word embeddings." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-21", "text": "We used fastText pretrained embeddings introduced in (Bojanowski et al., 2017 ) and (Grave et al., 2018) , which have been trained on Indonesian Wikipedia and Indonesian Wikipedia plus Common Crawl data respectively." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-22", "text": "We refer to them as Wiki/fastText and CC/fastText hereinafter." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-23", "text": "We also used another two pretrained embeddings: polyglot embedding trained on Indonesian Wikipedia (Al-Rfou et al., 2013) and NLPL embedding trained on the Indonesian portion of CoNLL 2017 corpus (Fares et al., 2017) ." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-24", "text": "For training our word embeddings, we used online news corpus obtained from Tempo." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-25", "text": "2 We used Tempo newspaper and magazine articles up to year 2014." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-26", "text": "This corpus contains roughly 400K articles, 160M word tokens, and 600K word types." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-27", "text": "To train the word embeddings, we experimented with three algorithms: word2vec (Mikolov et al., 2013b) , fastText (Bojanowski et al., 2017) , and GloVe (Pennington et al., 2014) ." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-28", "text": "We refer to them henceforth as Tempo/word2vec, Tempo/fastText, and Tempo/GloVe respectively." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-29", "text": "We used gensim 3 to run word2vec and fastText and the original C implementation for GloVe." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-30", "text": "4 For all three, we used their default hyperparameters, i.e. no tuning was performed." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-31", "text": "Our three embeddings are available online." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-32", "text": "5 Evaluation on KaWAT was done using gensim with its KeyedVectors.most similar method." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-33", "text": "Since the vocabularies of the word embeddings are different, for a fair comparison, we first removed analogy queries containing words that do not exist in any vocabulary." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-34", "text": "In other words, we only kept queries whose words all exist in all vocabularies." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-35", "text": "After this process, there were roughly 6K syntactic and 1.5K semantic queries." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-36", "text": "We performed evaluation by computing 95% confidence interval of the accuracy at rank 1 by bootstrapping." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-37", "text": "Our implementation code is available online." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-38", "text": "6 3 Results" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-39", "text": "----------------------------------" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-40", "text": "**WORD ANALOGY RESULTS**" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-41", "text": "We found that on syntactic analogies, Wiki/fastText achieved 2.7% accuracy, which significantly outperformed the others, even CC/fastText which has been trained on a much larger corpus." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-42", "text": "Other embeddings performed poorly, mostly less than 1% of accuracy." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-43", "text": "The overall trend of low accuracy scores attests to the difficulty of syntactic KaWAT analogies, making it suitable as benchmark for future research." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-44", "text": "On semantic analogies, Tempo/GloVe clearly outperformed the others with 20.42% accuracy, except Tempo/word2vec." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-45", "text": "Surprisingly, we found that Tempo/fastText performed very poorly with less than 1% accuracy, even worse than Wiki/fastText which has been trained on a much smaller data." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-46", "text": "Overall, the accuracies on semantic are also low, less than 25%, which again attests to the suitability of KaWAT as benchmark for future work." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-47", "text": "----------------------------------" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-48", "text": "**DOWNSTREAM TASK RESULTS**" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-49", "text": "To check how useful these embeddings are for downstream tasks, we evaluated them on POS tagging and text summarization task." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-50", "text": "For each task, we compared two embeddings, which are the best off-the-shelf pretrained embedding and our proposed embedding on the syntactic (for POS) and semantic (for summarization) analogy task respectively." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-51", "text": "7 We used the same model and setting as (Kurniawan and Aji, 2018) for POS tagging and (Kurniawan and Louvan, 2018) for summarization." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-52", "text": "However, for computational reasons, we tuned only the learning rate using grid search, and only used the first fold of the summarization dataset." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-53", "text": "Our key finding from the POS tagging experiment is that using the two embeddings did not yield significant gain on test F 1 score compared with not using any pretrained embedding (around 97.23)." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-54", "text": "However, on average, using Wiki/fastText resulted in 20% fewer training epochs, compared with only 4% when using Tempo/GloVe." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-55", "text": "For the summarization experiment, Tempo/GloVe was significantly better 8 than CC/fastText in ROUGE-1 and ROUGE-L scores (66.63 and 65.93 respectively)." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-56", "text": "The scores of using CC/fastText was on par to those of not using any pretrained word embedding, and we did not observe fewer training epochs when using pretrained word embedding." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-57", "text": "----------------------------------" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-58", "text": "**CONCLUSION**" }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-59", "text": "We introduced KaWAT, a new dataset for Indonesian word analogy task, and evaluated several Indonesian pretrained word embeddings on it." }, { "sent_id": "cf2cc67035107f5bdaab85a760e56e-C001-60", "text": "We found that (1) in general, accuracies on the analogy tasks were low, suggesting that improvements for Indonesian word embeddings are still possible and KaWAT is hard enough to be the benchmark dataset for that purpose, (2) on syntactic analogies, embedding by (Bojanowski et al., 2017) performed best and yielded 20% fewer training epochs when employed for POS tagging, and (3) on semantic analogies, GloVe embedding trained on Tempo corpus performed best and produced significant gains on ROUGE-1 and ROUGE-L scores when used for text summarization." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cf2cc67035107f5bdaab85a760e56e-C001-9" ] ], "cite_sentences": [ "cf2cc67035107f5bdaab85a760e56e-C001-9" ] }, "@MOT@": { "gold_contexts": [ [ "cf2cc67035107f5bdaab85a760e56e-C001-9" ] ], "cite_sentences": [ "cf2cc67035107f5bdaab85a760e56e-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "cf2cc67035107f5bdaab85a760e56e-C001-21" ], [ "cf2cc67035107f5bdaab85a760e56e-C001-27", "cf2cc67035107f5bdaab85a760e56e-C001-29" ], [ "cf2cc67035107f5bdaab85a760e56e-C001-60" ] ], "cite_sentences": [ "cf2cc67035107f5bdaab85a760e56e-C001-21", "cf2cc67035107f5bdaab85a760e56e-C001-27", "cf2cc67035107f5bdaab85a760e56e-C001-29", "cf2cc67035107f5bdaab85a760e56e-C001-60" ] } } }, "ABC_7c5c5f13205c40a27d2629727df840_1": { "x": [ { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-2", "text": "Roughly speaking, the compositional semantics analysis of natural language consists in mapping a sentence to a logical formula which depicts its meaning." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-3", "text": "To do so, a lexicon endow each word with a partial formula (a \u03bb-term over the base type t for propositions and e for individuals), and the (binary) parse tree of the sentence specifies for each node which subtree is the function and which one is the argument." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-4", "text": "Hence a \u03bb-term corresponding to the whole parse tree, and by reduction one obtains a \u03bb-term of type t which corresponds to a formula of higher order logic." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-5", "text": "----------------------------------" }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-6", "text": "****" }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-7", "text": "semantics." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-8", "text": "Our meta logic (a.k.a. glue logic) is system F with many base types t, e i (instead of simply typed \u03bb-calculus with t, and e) Our logic for semantic representations is many-sorted higher-order logic (e i instead of a single sort e)." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-9", "text": "For representing quantification, we actually prefer to use Hilbert's \u01eb and \u03c4 -terms constructed with two constants \u01eb, \u03c4 : \u039b\u03b1. (\u03b1 \u2192 t) \u2192 \u03b1 and one for generic elements [8] ." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-10", "text": "An important but rather easy property holds: if the constants define an n-order q-sorted logic, any (\u03b7-long) normal \u03bb-term of type t does actually correspond to a formula of n-order q-sorted logic (possibly n = \u03c9)." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-11", "text": "We preferred system F to modern type theories (MTT) used by Luo [5] or to the categorical logic of Asher [1] because of its formal simplicity and absence of variants -as the terms are issued from the lexicon by means of syntactic rules, F terms with a problematic complexity are avoided." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-12", "text": "There are two properties of Luo's approach [5] that would be welcome: a proper notion of subtyping, mathematically safe and linguistically relevant, and predefined inductive types with specific reduction rules." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-13", "text": "Subtyping is most welcome in particular to represent ontological inclusion (a \"human being\" is an \"animal\", thus predicates that apply to \"animals\" also apply to \"human beings\")." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-14", "text": "Coercive subtyping as developed by Luo and Soloviev [10] sounds promising for F (and other notions as well [4] )." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-15", "text": "The key property of coercive subtyping is that there is at most one subtyping map between any two complex types, provided that there is at most one subtyping map between any two base types." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-16", "text": "Predefined types, inductive types would be most welcome in our setting, e.g. integers as in G\u00f6del's system T and finite sets of \u03b1-objects." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-17", "text": "Of course, F can encode such types, but such encodings are far from natural." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-18", "text": "Reduction in such a setting is related to the work of Soloviev and Chemouil [9] The key point is to show that normalisation and confluence are preserved and that there is no constant-free and closed term of a false type." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-19", "text": "We shall also illustrate the linguistic relevance of these extensions, which are already included in Moot's semantical and semantical parser for French." }, { "sent_id": "7c5c5f13205c40a27d2629727df840-C001-20", "text": "[6]" } ], "y": { "@USE@": { "gold_contexts": [ [ "7c5c5f13205c40a27d2629727df840-C001-9" ] ], "cite_sentences": [ "7c5c5f13205c40a27d2629727df840-C001-9" ] } } }, "ABC_5b14e259a557aa3cbfcbd6265f04c8_1": { "x": [ { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-2", "text": "Adversarial training has shown impressive success in learning bilingual dictionary without any parallel data by mapping monolingual embeddings to a shared space." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-3", "text": "However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-4", "text": "In this work, we revisit adversarial autoencoder for unsupervised word translation and propose two novel extensions to it that yield more stable training and improved results." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-5", "text": "Our method includes regularization terms to enforce cycle consistency and input reconstruction, and puts the target encoders as an adversary against the corresponding discriminator." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-6", "text": "Extensive experimentations with European, non-European and lowresource languages show that our method is more robust and achieves better performance than recently proposed adversarial and nonadversarial approaches." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-7", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-9", "text": "Learning cross-lingual word embeddings has been shown to be an effective way to transfer knowledge from one language to another for many key linguistic tasks including machine translation, named entity recognition, part-of-speech tagging, and parsing (Ruder et al., 2017) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-10", "text": "While earlier efforts solved the associated word alignment problem using large parallel corpora (Luong et al., 2015) , broader applicability demands methods to relax this requirement since acquiring a large corpus of parallel data is not feasible in most scenarios." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-11", "text": "Recent methods instead use embeddings learned from monolingual data, and learn a linear mapping from one language to another with the underlying assumption that two embedding spaces exhibit similar geometric structures (i.e., approximately isomorphic)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-12", "text": "This allows the model to learn effective cross-lingual representations without expensive supervision (Artetxe et al., 2017) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-13", "text": "Given monolingual word embeddings of two languages, Mikolov et al. (2013a) show that a linear mapping can be learned from a seed dictionary of 5000 word pairs by minimizing the sum of squared Euclidean distances between the mapped vectors and the target vectors." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-14", "text": "Subsequent works (Xing et al., 2015; Artetxe et al., 2016 Artetxe et al., , 2017 Smith et al., 2017) propose to improve the model by normalizing the embeddings, imposing an orthogonality constraint on the mapper, and modifying the objective function." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-15", "text": "While these methods assume some supervision in the form of a seed dictionary, recently fully unsupervised methods have shown competitive results." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-16", "text": "Zhang et al. (2017a,b) first reported encouraging results with adversarial training." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-17", "text": "Conneau et al. (2018) improved this approach with post-mapping refinements, showing impressive results for several language pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-18", "text": "Their learned mapping was then successfully used to train a fully unsupervised neural machine translation system (Lample et al., 2018a,b) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-19", "text": "Although successful, adversarial training has been criticized for not being stable and failing to converge, inspiring researchers to propose nonadversarial methods more recently (Xu et al., 2018a; Hoshen and Wolf, 2018; Alvarez-Melis and Jaakkola, 2018; Artetxe et al., 2018b) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-20", "text": "In particular, Artetxe et al. (2018b) show that the adversarial methods of Conneau et al. (2018) and Zhang et al. (2017a,b) fail for many language pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-21", "text": "In this paper, we revisit adversarial training and propose a number of key improvements that yield more robust training and improved mappings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-22", "text": "Our main idea is to learn the cross-lingual mapping in a projected latent space and add more constraints to guide the unsupervised mapping in this space." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-23", "text": "We accomplish this by proposing a novel adversarial autoencoder framework (Makhzani et al., 2015) , where adversarial mapping is done at the (latent) code space as opposed to the original embedding space (Figure 1) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-24", "text": "This gives the model the flexibility to automatically induce the required geometric structures in its latent code space that could potentially yield better mappings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-25", "text": "S\u00f8gaard et al. (2018) recently find that the isomorphic assumption made by most existing methods does not hold in general even for two closely related languages like English and German." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-26", "text": "In their words \"approaches based on this assumption have important limitations\"." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-27", "text": "By mapping the latent vectors through adversarial training, our approach therefore departs from the isomorphic assumption." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-28", "text": "In our adversarial training, not only the mapper but also the target encoder is trained to fool the discriminator." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-29", "text": "This forces the discriminator to improve its discrimination skills, which in turn pushes the mapper to generate indistinguishable translation." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-30", "text": "To guide the mapping, we include two additional constraints." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-31", "text": "Our first constraint enforces cycle consistency so that code vectors after being translated from one language to another, and then translated back to their source space remain close to the original vectors." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-32", "text": "The second constraint ensures reconstruction of the original input word embeddings from the back-translated codes." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-33", "text": "This grounding step forces the model to retain word semantics during the mapping process." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-34", "text": "We conduct a series of experiments with six different language pairs (in both directions) comprising European, non-European, and low-resource languages from two different datasets." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-35", "text": "Our results show that our model is more robust and yields significant gains over Conneau et al. (2018) for all translation tasks in all evaluation measures." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-36", "text": "Our method also gives better initial mapping compared to other existing methods (Artetxe et al., 2018b) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-37", "text": "We also perform an extensive ablation study to understand the contribution of different components of our model." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-38", "text": "The study reveals that cycle consistency contributes the most, while adversarial training of the target encoder and post-cycle reconstruction also have significant effect." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-39", "text": "We have released our source code at https://ntunlpsg.github.io/ project/unsup-word-translation/ The remainder of this paper is organized as follows." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-40", "text": "After discussing related work in Section 2, we present our unsupervised word translation approach with adversarial autoencoder in Section 3." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-41", "text": "We describe our experimental setup in Section 4, and present our results with in-depth analysis in Section 5." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-42", "text": "Finally, we summarize our findings with possible future directions in Section 6." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-43", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-44", "text": "**RELATED WORK**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-45", "text": "In recent years a number of methods have been proposed to learn bilingual dictionary from monolingual word embeddings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-46", "text": "1 Many of these methods use an initial seed dictionary." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-47", "text": "Mikolov et al. (2013a) show that a linear transformation can be learned from a seed dictionary of 5000 pairs by minimizing the squared Euclidean distance." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-48", "text": "In their view, the key reason behind the good performance of their model is the similarity of geometric arrangements in vector spaces of the embeddings of different languages." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-49", "text": "For translating a new source word, they map the corresponding word embedding to the target space using the learned mapping and find the nearest target word." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-50", "text": "In their approach, they found that simple linear mapping works better than non-linear mappings with multilayer neural networks." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-51", "text": "Xing et al. (2015) enforce the word vectors to be of unit length during the learning of the embeddings and modify the objective function for learning the mapping to maximize the cosine similarity instead of using Euclidean distance." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-52", "text": "To preserve length normalization after mapping, they enforce the orthogonality constraint on the mapper." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-53", "text": "Instead of learning a mapping from the source to the target embedding space, Faruqui and Dyer (2014) use a technique based on Canonical Correlation Analysis (CCA) to project both source and target embeddings to a common low-dimensional space, where the correlation of the word pairs in the seed dictionary is maximized." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-54", "text": "Artetxe et al. (2016) show that the above methods are variants of the same core optimization objective and propose a closed form solution for the mapper under orthogonality constraint." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-55", "text": "Smith et al. (2017) find that this solution is closely related to the orthogonal Procrustes solution." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-56", "text": "In their follow-up work, Artetxe et al. (2017) obtain competitive results using a seed dictionary of only 25 word pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-57", "text": "They propose a self-learning framework that performs two steps iteratively until convergence." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-58", "text": "In the first step, they use the dictionary (starting with the seed) to learn a linear mapping, which is then used in the second step to induce a new dictionary." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-59", "text": "A more recent line of research attempts to eliminate the seed dictionary totally and learn the map-ping in a purely unsupervised way." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-60", "text": "This was first proposed by Miceli Barone (2016) , who initially used an adversarial network similar to Conneau et al. (2018) , and found that the mapper (which is also the encoder) translates everything to a single embedding, known commonly as the mode collapse issue (Goodfellow, 2017) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-61", "text": "To preserve diversity in mapping, he used a decoder to reconstruct the source embedding from the mapped embedding, extending the framework to an adversarial autoencoder." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-62", "text": "His preliminary qualitative analysis shows encouraging results but not competitive with methods using bilingual seeds." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-63", "text": "He suspected issues with training and with the isomorphic assumption." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-64", "text": "In our work, we successfully address these issues with an improved model that also relaxes the isomorphic assumption." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-65", "text": "Our model uses two separate autoencoders, one for each language, which allows us to put more constraints to guide the mapping." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-66", "text": "We also distinguish the role of an encoder from the role of a mapper." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-67", "text": "The encoder projects embeddings to latent code vectors, which are then translated by the mapper." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-68", "text": "Zhang et al. (2017a) improved adversarial training with orthogonal parameterization and cycle consistency." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-69", "text": "To aid training, they incorporate additional techniques like noise injection which works as a regularizer." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-70", "text": "For selecting the best model, they rely on sharp drops of the discriminator accuracy." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-71", "text": "In their follow-up work (Zhang et al., 2017b) , they minimize Earth-Mover's distance between the distribution of the transformed source embeddings and the distribution of the target embeddings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-72", "text": "Conneau et al. (2018) show impressive results with adversarial training and refinement with the Procrustes solution." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-73", "text": "Instead of using the adversarial loss, Xu et al. (2018a) use Sinkhorn distance and adopt cycle consistency inspired by the CycleGAN (Zhu et al., 2017) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-74", "text": "We also incorporate cycle consistency along with the adversarial loss." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-75", "text": "However, while all these methods learn the mapping in the original embedding space, our approach learns it in the latent code space considering both the mapper and the target encoder as adversary." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-76", "text": "In addition, we use a postcycle reconstruction to guide the mapping." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-77", "text": "A number of non-adversarial methods have also been proposed recently." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-78", "text": "Artetxe et al. (2018b) learn an initial dictionary by exploiting the structural similarity of the embeddings and use a robust self-learning algorithm to improve it iteratively." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-79", "text": "Hoshen and Wolf (2018) align the second moment of word distributions of the two languages using principal component analysis (PCA) and then refine the alignment iteratively using a variation of the Iterative Closest Point (ICP) method used in computer vision." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-80", "text": "Alvarez-Melis and Jaakkola (2018) cast the problem as an optimal transport problem and exploit the Gromov-Wasserstein distance which measures how similarities between pairs of words relate across languages." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-81", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-82", "text": "**APPROACH**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-83", "text": "Let X = {x 1 , . . . , x n } and Y = {y 1 , . . . , y m } be two sets consisting of n and m word embeddings of d-dimensions for a source and a target language, respectively." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-84", "text": "We assume that X and Y are trained independently from monolingual corpora." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-85", "text": "Our aim is to learn a mapping f (x) in an unsupervised way (i.e., no bilingual dictionary given) such that for every x i , f (x) corresponds to its translation in Y. Our overall approach follows the same sequence of steps as Conneau et al. (2018): 1." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-86", "text": "Induction of seed dictionary through adversarial training." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-87", "text": "2. Iterative refinement of the initial mapping through the Procrustes solution." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-88", "text": "3. Apply CSLS for nearest neighbor search." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-89", "text": "We propose a novel adversarial autoencoder model to learn the initial mapping for inducing a seed dictionary in step (i), and we adopt existing refinement methods for steps (ii) and (iii)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-90", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-91", "text": "**ADVERSARIAL AUTOENCODER FOR INITIAL DICTIONARY INDUCTION**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-92", "text": "Our proposed model ( Figure 1 ) has two autoencoders, one for each language." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-93", "text": "Each autoencoder comprises an encoder E X (res." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-94", "text": "E Y ) and a decoder D X (res." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-95", "text": "D Y )." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-96", "text": "The encoders transform an input x (res." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-97", "text": "y) into a latent code z x (res." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-98", "text": "z y ) from which the decoders try to reconstruct the original input." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-99", "text": "We use a linear encoder and l 2 reconstruction loss" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-100", "text": "where \u03b8 E X \u2208 R c\u00d7d and \u03b8 D X \u2208 R d\u00d7c are the parameters of the encoder and the decoder for ddimensional word embedding and c-dimensional code vector." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-101", "text": "2 The encoder, decoder and the reconstruction loss for the other autoencoder (autoenc Y ) is similarly defined." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-102", "text": "Let q(z x |x) and q(z y |y) be the encoding distributions of the two autoencoders." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-103", "text": "We use adversarial training to find a mapping between q(z x |x) and q(z y |y)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-104", "text": "This is in contrast with most existing methods (e.g., Conneau et al. (2018) ; Artetxe et al. (2017) ) that directly map the distribution of the source word embeddings p(x) to the distribution of the target p(y)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-105", "text": "As S\u00f8gaard et al. (2018) pointed out, the isomorphism does not hold in general between the word embedding spaces of two languages." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-106", "text": "Mapping the latent codes gives our model more flexibility to induce the required semantic structures in its code space that could potentially yield more accurate mappings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-107", "text": "As shown in Figure 1 , we include two linear mappings G : Z x \u2192 Z y and F : Z y \u2192 Z x to project the code vectors (samples from q(.|.)) from one language to the other." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-108", "text": "In addition, we have two language discriminators, L X and L Y ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-109", "text": "The discriminators are trained to discriminate between the mapped codes and the encoded codes, while the mappers and encoders are jointly trained to fool their respective discriminator." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-110", "text": "This results in a three-player game, where the discriminator tries to identify the origin of a code, and the mapper and the encoder act together to prevent the discriminator to succeed by making the mapped vector and the encoded vector as similar as possible." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-111", "text": "Discriminator Loss Let \u03b8 L X and \u03b8 L Y denote the parameters of the two discriminators, and W G and W F are the mapping weight matrices." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-112", "text": "The loss for the source discriminator L X can be written as" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-113", "text": "where P L X (src|z) is the probability according to L X to distinguish whether z is coming from the source encoder (src = 1) or from the target-tosource mapper" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-114", "text": "Our discriminators have the same architecture as Conneau et al. (2018) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-115", "text": "It is a feed-forward network with two hidden layers of size 2048 and Leaky-ReLU activations." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-116", "text": "We apply dropout with a rate of 0.1 on the input to the discriminators." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-117", "text": "Instead of using 1 and 0, we also apply a smoothing coefficient (s = 0.2) in the discriminator loss." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-118", "text": "Adversarial Loss The mappers and encoders are trained jointly with the following adversarial loss to fool their respective discriminators." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-119", "text": "The adversarial loss for mapper G and encoder E Y is similarly defined." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-120", "text": "Note that we consider both the mapper and the target encoder as generators." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-121", "text": "This is in contrast to existing adversarial methods, which do not use any autoencoder in the target side." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-122", "text": "The mapper and the target encoder team up to fool the discriminator." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-123", "text": "This forces the discriminator to improve its skill and vice versa for the generators, forcing them to produce indistinguishable codes through better mapping." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-124", "text": "Cycle Consistency and Reconstruction The adversarial method introduced above maps a \"bag\" of source embeddings to a \"bag\" of target embeddings, and in theory, the mapper can match the target language distribution." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-125", "text": "However, mapping at the bag level is often insufficient to learn the individual word level mappings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-126", "text": "In fact, there exist infinite number of possible mappings that can match the same target distribution." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-127", "text": "Thus to learn better mappings, we need to enforce more constraints to our objective." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-128", "text": "The first form of constraints we consider is cycle consistency to ensure that a source code z x translated to the target language code space, and translated back to the original space remains unchanged, i.e., z x \u2192G(z x )\u2192F (G(z x ))\u2248z x ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-129", "text": "Formally, the cycle consistency loss in one direction:" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-130", "text": "The loss in the other direction (z y \u2192 F (z y ) \u2192 G(F (z y )) \u2248 z y ) is similarly defined." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-131", "text": "In addition to cycle consistency, we include another constraint to guide the mapping further." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-132", "text": "In particular, we ask the decoder of the respective autoencoder to reconstruct the original input from the back-translated code." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-133", "text": "We compute this post-cycle reconstruction loss for the source autoencoder as follows:" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-134", "text": "The reconstruction loss at the target autoencoder is defined similarly." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-135", "text": "Apart from improved mapping, both cycle consistency and reconstruction lead to more stable training in our experiments." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-136", "text": "Specifically, they help our training to converge and get around the mode collapse issue (Goodfellow, 2017) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-137", "text": "Since the model now has to translate the mapped code back to the source code and reconstruct the original word embedding, the generators cannot get away by mapping all source codes to a single target code." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-138", "text": "Total Loss The total loss for mapping a batch from source to target is" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-139", "text": "where \u03bb 1 and \u03bb 2 control the relative importance of the three loss components." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-140", "text": "Similarly we define the total loss for mapping in the opposite direction L tar\u2192src ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-141", "text": "The complete objective of our model is:" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-142", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-143", "text": "**TRAINING AND DICTIONARY CONSTRUCTION**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-144", "text": "We present the training procedure of our model and the overall word translation process in Algorithm 1." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-145", "text": "We first pre-train the autoencoders separately on monolingual embeddings (Step 1)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-146", "text": "This pre-training is required to induce word semantics (and relations) in the latent code space." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-147", "text": "We start adversarial training (" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-148", "text": "Step 2) by updating the discriminators for n critics (5) times, each time with a random batch." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-149", "text": "Then we update the generators (the mapper and target encoder) on the adversarial loss." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-150", "text": "The mappers then go through two more updates, one for cycle consistency and another for post-cycle reconstruction." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-151", "text": "The autoencoders (encoder-decoder) in this stage get updated only on the post-cycle reconstruction loss." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-152", "text": "We also apply the orthogonalization update to the mappers following Conneau et al. (2018) with \u03b2 = 0.01." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-153", "text": "Our training setting is similar to Conneau et al. (2018) , and we apply the same pre-and postprocessing steps." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-154", "text": "We use stochastic gradient descent (SGD) with a batch size of 32, a learning rate of 0.1, and a decay of 0.98." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-155", "text": "For selecting the best model, we use the unsupervised validation criterion proposed by Conneau et al. (2018) , which correlates highly with the mapping quality." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-156", "text": "In this criterion, 10, 000 most frequent source words along with their nearest neighbors in the target space are considered." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-157", "text": "The average cosine similarity between these pseudo translations is considered as the validation metric." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-158", "text": "The initial bilingual dictionary induced by adversarial training (or any other unsupervised method) is generally of lower quality than what could be achieved by a supervised method." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-159", "text": "Conneau et al. (2018) and Artetxe et al. (2018b) propose fine-tuning methods to refine the initial mappings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-160", "text": "Similar to Conneau et al. (2018) ), we finetune our initial mappings (G and F ) by iteratively solving the Procrustes problem and applying a dictionary induction step." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-161", "text": "This method uses singular value decomposition or SVD of Z T y Z x to find the optimal mappings G (similarly SVD(Z T x Z y ) for F ) given the approximate alignment of words from the previous step." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-162", "text": "For generating synthetic dictionary in each iteration, we only consider the translation pairs that are mutual nearest neighbors." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-163", "text": "In our fine-tuning, we run five iterations of this process." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-164", "text": "For finding the nearest neighbors, we use the Cross-domain Similarity Local Scaling (CSLS) which works better in mitigating the hubness problem (Conneau et al., 2018) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-165", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-166", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-167", "text": "Following the tradition, we evaluate our model on word translation (a.k.a. bilingual lexicon induction) task, which measures the accuracy of the predicted dictionary to a gold standard dictionary." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-168", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-169", "text": "**DATASETS**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-170", "text": "We evaluate our model on two different datasets." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-171", "text": "The first one is from Conneau et al. (2018) , which consists of FastText monolingual embeddings of (d =) 300 dimensions (Bojanowski et al., 2017) trained on Wikipedia monolingual corpus and gold dictionaries for 110 language pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-172", "text": "3 To show the generality of different methods, we consider European, non-European and low-resource languages." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-173", "text": "In particular, we evaluate on English (En) from/to Spanish (Es), German (De), Italian (It), Arabic (Ar), Malay (Ms), and Hebrew (He)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-174", "text": "We also evaluate on the more challenging dataset of Dinu et al. (2015) and its subsequent extension by Artetxe et al. (2018a) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-175", "text": "We will refer to this dataset as Dinu-Artexe dataset." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-176", "text": "From this dataset, we choose to experiment on English from/to Italian and Spanish." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-177", "text": "English and Italian embeddings were trained on WacKy corpora using CBOW (Mikolov et al., 2013b) , while the Spanish embeddings were trained on WMT News Crawl." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-178", "text": "The CBOW vectors are also of 300 dimensions." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-179", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-180", "text": "**BASELINES**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-181", "text": "We compare our method with the unsupervised models of Conneau et al. (2018) , Artetxe et al. (2018b) , Alvarez-Melis and Jaakkola (2018) , Xu et al. (2018a) , and Hoshen and Wolf (2018) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-182", "text": "To evaluate how our unsupervised method compares with methods that rely on a bilingual seed dictionary, we follow Conneau et al. (2018) , and compute a supervised baseline that uses the Procrustes solution directly on the seed dictionary (5000 pairs) to learn the mapping function, and then uses CSLS to do the nearest neighbor search." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-183", "text": "We also compare with the supervised approaches of Artetxe et al. (2017 Artetxe et al. ( , 2018a , which to our knowledge are the state-of-the-art supervised systems." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-184", "text": "For some of the baselines, results are reported from their papers, while for the rest we report results by running the publicly available codes on our machine." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-185", "text": "For training our model on European languages, the weight for cycle consistency (\u03bb 1 ) in Eq. 7 was always set to 5, and the weight for post-cycle reconstruction (\u03bb 2 ) was set to 1." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-261", "text": "Now let us turn our attention to the results with fine-tuning." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-186", "text": "For non-European languages, we use different values of \u03bb 1 and \u03bb 2 for different language pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-187", "text": "4 The dimension of the code vectors in our model was set to 350." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-188", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-189", "text": "**RESULTS**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-190", "text": "We present our results on European languages on the datasets of Conneau et al. (2018) and Dinu et al. (2015) in Tables 1 and 3 , while the results on non-European languages are shown in Table 2 ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-191", "text": "Through experiments, our goal is to assess:" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-192", "text": "1. Does the unsupervised mapping method based on our proposed adversarial autoencoder model improve over the best existing adversarial method of Conneau et al. (2018) in terms of mapping accuracy and convergence (Section 5.1)?" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-193", "text": "2. How does our unsupervised mapping method compare with other unsupervised and supervised approaches (Section 5.2)? 3." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-194", "text": "Which components of our adversarial autoencoder model attribute to improvements (Section 5.3)?" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-195", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-196", "text": "**COMPARISON WITH CONNEAU ET AL. (2018)**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-197", "text": "Since our approach follows the same steps as Conneau et al. (2018), we first compare our proposed model with their model on European (Table 1) , non-European and low-resource languages (Table 2 ) on their dataset." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-198", "text": "In the tables, we present the numbers that they reported in their paper (Conneau et al. (2018) (paper)) as well as the results that we get by running their code on our machine (Conneau et al. (2018) (code))." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-199", "text": "For a fair comparison with respect to the quality of the learned mappings (or induced seed dictionary), here we only consider the results of our approach that use the refinement procedure of Conneau et al. (2018) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-200", "text": "In Table 1 , we see that our Adversarial autoencoder + Conneau et al. (2018) Refinement outperforms Conneau et al. (2018) in all the six translation tasks involving European language pairs, yielding gains in the range 0.3 -1.3%." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-201", "text": "Our method is also superior to theirs for the non-European and low-resource language pairs in Table 2 ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-202", "text": "Here our method gives more gains ranging from 1.8 to 4.3%." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-203", "text": "Note specifically that Malay (Ms) is a lowresource language, and the FastText contains word vectors for only 155K Malay words." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-204", "text": "We found their model to be very fragile for En from/to Ms, and does not converge at all for Ms\u2192En." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-205", "text": "We ran their code 10 times for Ms\u2192En but failed every time." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-206", "text": "Compared to that, our method is more robust and converged most of the time we ran." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-207", "text": "If we compare our method with the method of Conneau et al. (2018) on the more challenging Dinu-Artexe dataset in Table 3 , we see here also our method performs better than their method in all the four translation tasks involving European language pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-208", "text": "In this dataset, our method shows more robustness compared to their method." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-209", "text": "For example, their method had difficulties in converging for En from/to Es translations; for En\u2192Es, it converges only 2 times out of 10 attempts, while for Es\u2192En it did not converge a single time in 10 attempts." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-210", "text": "Compared to that, our method was more robust, converging 4 times out of 10 attempts." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-211", "text": "In Section 5.3, we compare our model with Conneau et al. (2018) more rigorously by evaluating them with and without fine-tuning and measuring their performance on P@1, P@5, and P@10." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-212", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-213", "text": "**COMPARISON WITH OTHER METHODS**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-214", "text": "In this section, we compare our model with other state-of-the-art methods that do not follow the same procedure as us and Conneau et al. (2018) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-215", "text": "For example, Artetxe et al. (2018b) do the initial mapping in the similarity space, then they apply a different self-learning method to fine-tune the embeddings, and perform a final refinement with symmetric re-weighting." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-216", "text": "Instead of mapping from source to target, they map both source and target embeddings to a common space." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-217", "text": "Let us first consider the results for European language pairs on the dataset of Conneau et al. (2018) in Table 1 ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-218", "text": "Our Adversarial autoencoder + Conneau et al. (2018) Refinement performs better than most of the other methods on this dataset, achieving the highest accuracy for 4 out of 6 translation tasks." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-219", "text": "For De\u2192En, our result is very close to the best system of Artetxe et al. (2018b) with only 0.2% difference." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-220", "text": "Table 3 : Word translation accuracy (P@1) on the English-Italian and English-Spanish language pairs of Dinu-Artetxe dataset (Dinu et al., 2015; Artetxe et al., 2017) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-221", "text": "All methods use CBOW embeddings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-222", "text": "** indicates the model failed to converge; '-' indicates the authors did not report the number." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-223", "text": "On the dataset of Dinu et al. (2015) ; Artetxe et al. (2017) in Table 3 , our Adversarial autoencoder + Conneau et al. (2018) Refinement performs better than other methods except Artetxe et al. (2018b) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-224", "text": "On average our method lags behind by about 2%." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-225", "text": "However, as mentioned, they follow a different refinement and mapping methods." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-226", "text": "For non-European and low-resource language pairs in Table 2 , our Adversarial autoencoder + Conneau et al. (2018) Refinement exhibits better performance than others in one translation task, where the model of Artetxe et al. (2018b) performs better in the rest." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-227", "text": "One important thing to notice here is that other unsupervised models (apart from ours and Artetxe et al. (2018b) ) fail to converge in one or more language pairs." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-228", "text": "We notice that the method of Artetxe et al. (2018b) gives better results than other baselines, even in some translation tasks they achieve the highest accuracy." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-229", "text": "To understand whether the improvements of their method are due to a better initial mapping or better post-processing, we conducted two additional experiments." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-230", "text": "In our first experiment, we use their method to induce the initial seed dictionary and then apply iterative Procrustes solution (same refinement procedure of Conneau et al. (2018) ) for refinement." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-231", "text": "Table 4 shows the results." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-232", "text": "Surprisingly, on both datasets their initial mappings fail to produce any reasonable results." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-233", "text": "So we suspect that the main gain in (Artetxe et al., 2018b) comes from their finetuning method, which they call robust self learn- ing." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-234", "text": "In our second experiment, we use the initial dictionary induced by our adversarial training and then apply their refinement procedure." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-235", "text": "Here for most of the translation tasks, we achieve better results; see the model Adversarial autoencoder + Artetxe et al. (2018b) Refinement in Tables 1 -3 ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-236", "text": "These two experiments demonstrate that the quality of the initial dictionary induced by our model is far better than that of Artetxe et al. (2018b) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-237", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-238", "text": "**MODEL DISSECTION**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-239", "text": "We further analyze our model by dissecting it and measuring the contribution of each novel component that is proposed in this work." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-240", "text": "We achieve this by incrementally removing a new component from the model and evaluating it on different translation tasks." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-262", "text": "Here also we see gains across all datasets for our model, although the gains are not as verbose as before (about 1% on average)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-241", "text": "In order to better understand the contribution of each component, we evaluate each model by measuring its P@1, P@5, and P@10 with fine-tuning and without fine-tuning." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-242", "text": "In case of without fine-tuning, the models apply the CSLS neighbor search directly on the mappings learned from the adversarial training, i.e., no Procrustes solution based refinement is done after the adversarial training." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-243", "text": "This setup allows us to compare our model directly with the adversarial model of Conneau et al. (2018) , putting the effect of finetuning aside." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-244", "text": "Table 5 presents the ablation results for En-Es, En-De, and En-It in both directions." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-245", "text": "The first row presents the results of Conneau et al. (2018) that uses adversarial training to map the word embeddings." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-246", "text": "The next row shows the results of our full model." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-247", "text": "The subsequent rows incrementally detach one component from our model." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-248", "text": "For example, -Enc." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-249", "text": "adv denotes the variant of our model where the target encoder is not trained on the adversarial loss (\u03b8 E X in Eq. 4); --Recon excludes the post-cycle reconstruction loss from -Enc." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-250", "text": "adv, and ---Cycle excludes the cycle consistency from --Recon." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-251", "text": "Thus, ---Cycle is a variant of our model that uses only adversarial loss to learn the mapping." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-252", "text": "However, it is important Table 5 : Ablation study of our adversarial autoencoder model on the dataset of Conneau et al. (2018)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-253", "text": "to note that in contrast to Conneau et al. (2018) , our mapping is performed at the code space." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-254", "text": "As we compare our full model with the model of Conneau et al. (2018) in the without fine-tuning setting, we notice large improvements in all measures across all datasets: 5.1 -7.3% in En\u2192Es, 3 -6% in Es\u2192En, 3.4 -4.3% in En\u2192De, 1 -3% in De\u2192En, 3.4 -4.3% in En\u2192It, and 0.3 -3.7% in It\u2192En." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-255", "text": "These improvements demonstrate that our model finds a better mapping compared to Conneau et al. (2018) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-256", "text": "Among the three components, the cycle consistency is the most influential one across all languages." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-257", "text": "Training the target encoder adversarially also gives a significant boost." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-258", "text": "The reconstruction has less impact." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-259", "text": "If we compare the results of ---Cycle with Conneau-18, we see sizeable gains for En-Es in both directions." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-260", "text": "This shows the benefits of mapping at the code level." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-263", "text": "However, this is not surprising as it has been shown that iterative fine-tuning with Procrustes solution is a robust method that can recover many errors made in the initial mapping (Conneau et al., 2018) ." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-264", "text": "Given a good enough initial mapping, the measures converge nearly to the same point even though the differences were comparatively more substantial initially; for example, notice that the scores are very similar for P@5 and P@10 measures after fine-tuning." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-265", "text": "----------------------------------" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-266", "text": "**CONCLUSIONS**" }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-267", "text": "We have proposed an adversarial autoencoder framework to learn the cross-lingual mapping of monolingual word embeddings of two languages in a completely unsupervised way." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-268", "text": "In contrast to the existing methods that directly map word embeddings, our method first learns to transform the embeddings into latent code vectors by pretraining an autoencoder." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-269", "text": "We apply adversarial training to map the distributions of the source and target code vectors." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-270", "text": "In our adversarial training, both the mapper and the target encoder are treated as generators that act jointly to fool the discriminator." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-271", "text": "To guide the mapping further, we include constraints for cycle consistency and post-cycle reconstruction." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-272", "text": "Through extensive experimentations on six different language pairs comprising European, nonEuropean and low-resource languages from two different data sources, we demonstrate that our method outperforms the method of Conneau et al. (2018) for all translation tasks in all measures (P@{1,5,10}) across all settings (with and without fine-tuning)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-273", "text": "Comparison with other existing methods also shows that our method learns better mapping (not considering the fine-tuning)." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-274", "text": "With an ablation study, we further demonstrated that the cycle consistency is the most important component followed by the adversarial training of target encoder and the post-cycle reconstruction." }, { "sent_id": "5b14e259a557aa3cbfcbd6265f04c8-C001-275", "text": "In future work, we plan to incorporate knowledge from the similarity space in our adversarial framework." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5b14e259a557aa3cbfcbd6265f04c8-C001-17" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-20" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-60" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-72" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-159" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-164" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-263" ] ], "cite_sentences": [ "5b14e259a557aa3cbfcbd6265f04c8-C001-17", "5b14e259a557aa3cbfcbd6265f04c8-C001-20", "5b14e259a557aa3cbfcbd6265f04c8-C001-60", "5b14e259a557aa3cbfcbd6265f04c8-C001-72", "5b14e259a557aa3cbfcbd6265f04c8-C001-159", "5b14e259a557aa3cbfcbd6265f04c8-C001-164", "5b14e259a557aa3cbfcbd6265f04c8-C001-263" ] }, "@MOT@": { "gold_contexts": [ [ "5b14e259a557aa3cbfcbd6265f04c8-C001-19", "5b14e259a557aa3cbfcbd6265f04c8-C001-20", "5b14e259a557aa3cbfcbd6265f04c8-C001-21" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-191", "5b14e259a557aa3cbfcbd6265f04c8-C001-192" ] ], "cite_sentences": [ "5b14e259a557aa3cbfcbd6265f04c8-C001-20", "5b14e259a557aa3cbfcbd6265f04c8-C001-192" ] }, "@EXT@": { "gold_contexts": [ [ "5b14e259a557aa3cbfcbd6265f04c8-C001-20", "5b14e259a557aa3cbfcbd6265f04c8-C001-21" ] ], "cite_sentences": [ "5b14e259a557aa3cbfcbd6265f04c8-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "5b14e259a557aa3cbfcbd6265f04c8-C001-20", "5b14e259a557aa3cbfcbd6265f04c8-C001-21" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-85" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-114" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-152" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-153" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-155" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-159", "5b14e259a557aa3cbfcbd6265f04c8-C001-160", "5b14e259a557aa3cbfcbd6265f04c8-C001-161" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-170", "5b14e259a557aa3cbfcbd6265f04c8-C001-171" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-181", "5b14e259a557aa3cbfcbd6265f04c8-C001-184" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-182" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-190" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-196" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-197" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-199" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-200", "5b14e259a557aa3cbfcbd6265f04c8-C001-205" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-211" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-214" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-217", "5b14e259a557aa3cbfcbd6265f04c8-C001-218" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-230" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-243" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-252" ] ], "cite_sentences": [ "5b14e259a557aa3cbfcbd6265f04c8-C001-20", "5b14e259a557aa3cbfcbd6265f04c8-C001-85", "5b14e259a557aa3cbfcbd6265f04c8-C001-114", "5b14e259a557aa3cbfcbd6265f04c8-C001-152", "5b14e259a557aa3cbfcbd6265f04c8-C001-153", "5b14e259a557aa3cbfcbd6265f04c8-C001-155", "5b14e259a557aa3cbfcbd6265f04c8-C001-159", "5b14e259a557aa3cbfcbd6265f04c8-C001-160", "5b14e259a557aa3cbfcbd6265f04c8-C001-161", "5b14e259a557aa3cbfcbd6265f04c8-C001-171", "5b14e259a557aa3cbfcbd6265f04c8-C001-181", "5b14e259a557aa3cbfcbd6265f04c8-C001-184", "5b14e259a557aa3cbfcbd6265f04c8-C001-182", "5b14e259a557aa3cbfcbd6265f04c8-C001-190", "5b14e259a557aa3cbfcbd6265f04c8-C001-196", "5b14e259a557aa3cbfcbd6265f04c8-C001-197", "5b14e259a557aa3cbfcbd6265f04c8-C001-199", "5b14e259a557aa3cbfcbd6265f04c8-C001-200", "5b14e259a557aa3cbfcbd6265f04c8-C001-205", "5b14e259a557aa3cbfcbd6265f04c8-C001-211", "5b14e259a557aa3cbfcbd6265f04c8-C001-214", "5b14e259a557aa3cbfcbd6265f04c8-C001-217", "5b14e259a557aa3cbfcbd6265f04c8-C001-218", "5b14e259a557aa3cbfcbd6265f04c8-C001-230", "5b14e259a557aa3cbfcbd6265f04c8-C001-243", "5b14e259a557aa3cbfcbd6265f04c8-C001-252" ] }, "@DIF@": { "gold_contexts": [ [ "5b14e259a557aa3cbfcbd6265f04c8-C001-35" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-72", "5b14e259a557aa3cbfcbd6265f04c8-C001-75" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-104" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-181" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-197" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-200", "5b14e259a557aa3cbfcbd6265f04c8-C001-201", "5b14e259a557aa3cbfcbd6265f04c8-C001-204", "5b14e259a557aa3cbfcbd6265f04c8-C001-205", "5b14e259a557aa3cbfcbd6265f04c8-C001-206" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-207", "5b14e259a557aa3cbfcbd6265f04c8-C001-208", "5b14e259a557aa3cbfcbd6265f04c8-C001-209", "5b14e259a557aa3cbfcbd6265f04c8-C001-210" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-211" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-243" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-254" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-255" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-259" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-272" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-253" ] ], "cite_sentences": [ "5b14e259a557aa3cbfcbd6265f04c8-C001-35", "5b14e259a557aa3cbfcbd6265f04c8-C001-72", "5b14e259a557aa3cbfcbd6265f04c8-C001-75", "5b14e259a557aa3cbfcbd6265f04c8-C001-104", "5b14e259a557aa3cbfcbd6265f04c8-C001-181", "5b14e259a557aa3cbfcbd6265f04c8-C001-197", "5b14e259a557aa3cbfcbd6265f04c8-C001-200", "5b14e259a557aa3cbfcbd6265f04c8-C001-201", "5b14e259a557aa3cbfcbd6265f04c8-C001-204", "5b14e259a557aa3cbfcbd6265f04c8-C001-205", "5b14e259a557aa3cbfcbd6265f04c8-C001-206", "5b14e259a557aa3cbfcbd6265f04c8-C001-207", "5b14e259a557aa3cbfcbd6265f04c8-C001-208", "5b14e259a557aa3cbfcbd6265f04c8-C001-209", "5b14e259a557aa3cbfcbd6265f04c8-C001-210", "5b14e259a557aa3cbfcbd6265f04c8-C001-211", "5b14e259a557aa3cbfcbd6265f04c8-C001-243", "5b14e259a557aa3cbfcbd6265f04c8-C001-254", "5b14e259a557aa3cbfcbd6265f04c8-C001-255", "5b14e259a557aa3cbfcbd6265f04c8-C001-259", "5b14e259a557aa3cbfcbd6265f04c8-C001-272", "5b14e259a557aa3cbfcbd6265f04c8-C001-253" ] }, "@SIM@": { "gold_contexts": [ [ "5b14e259a557aa3cbfcbd6265f04c8-C001-114" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-181" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-197" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-211" ], [ "5b14e259a557aa3cbfcbd6265f04c8-C001-243" ] ], "cite_sentences": [ "5b14e259a557aa3cbfcbd6265f04c8-C001-114", "5b14e259a557aa3cbfcbd6265f04c8-C001-181", "5b14e259a557aa3cbfcbd6265f04c8-C001-197", "5b14e259a557aa3cbfcbd6265f04c8-C001-211", "5b14e259a557aa3cbfcbd6265f04c8-C001-243" ] } } }, "ABC_d3122aab8960a7c89afe87c73faa59_1": { "x": [ { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-2", "text": "We analyze the performance of different sentiment classification models on syntacticallycomplex inputs like A-but-B sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-3", "text": "The first contribution of this analysis addresses reproducible research: to meaningfully compare different models, their accuracies must be averaged over far more random seeds than what has traditionally been reported." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-4", "text": "With proper averaging in place, we notice that the distillation model described in Hu et al. (2016) , which incorporates explicit logic rules for sentiment classification, is ineffective." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-5", "text": "In contrast, using contextualized ELMo embeddings (Peters et al., 2018a) instead of logic rules yields significantly better performance." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-6", "text": "Additionally, we provide analysis and visualizations that demonstrate ELMo's ability to implicitly learn logic rules." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-7", "text": "Finally, a crowdsourced analysis reveals how ELMo outperforms baseline models even on sentences with ambiguous sentiment labels." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-8", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-10", "text": "In this paper, we explore the effectiveness of methods designed to improve sentiment classification (positive vs. negative) of sentences that contain complex syntactic structures." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-11", "text": "While simple bag-of-words or lexicon-based methods (Pang and Lee, 2005; Wang and Manning, 2012; Iyyer et al., 2015) achieve good performance on this task, they are unequipped to deal with syntactic structures that affect sentiment, such as contrastive conjunctions (i.e., sentences of the form \"A-but-B\") or negations." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-12", "text": "Neural models that explicitly encode word order (Kim, 2014) , syntax (Socher et al., 2013; Tai et al., 2015) and semantic features (Li et al., 2017) have been proposed with the aim of improving performance on these more complicated sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-13", "text": "Recently, Hu et al. (2016) incorporate logical rules into a neural model and show that these rules increase the model's accuracy on sentences containing contrastive conjunctions, while Peters et al. (2018a) demonstrate increased overall accuracy on sentiment analysis by initializing a model with representations from a language model trained on millions of sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-14", "text": "In this work, we carry out an in-depth study of the effectiveness of the techniques in Hu et al. (2016) and Peters et al. (2018a) for sentiment classification of complex sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-15", "text": "Part of our contribution is to identify an important gap in the methodology used in Hu et al. (2016) for performance measurement, which is addressed by averaging the experiments over several executions." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-16", "text": "With the averaging in place, we obtain three key findings: (1) the improvements in Hu et al. (2016) can almost entirely be attributed to just one of their two proposed mechanisms and are also less pronounced than previously reported; (2) contextualized word embeddings (Peters et al., 2018a) incorporate the \"A-but-B\" rules more effectively without explicitly programming for them; and (3) an analysis using crowdsourcing reveals a bigger picture where the errors in the automated systems have a striking correlation with the inherent sentiment-ambiguity in the data." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-17", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-18", "text": "**LOGIC RULES IN SENTIMENT CLASSIFICATION**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-19", "text": "Here we briefly review background from Hu et al. (2016) to provide a foundation for our reanalysis in the next section." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-63", "text": "with early stopping on the development set." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-20", "text": "We focus on a logic rule for sentences containing an \"A-but-B\" structure (the only rule for which Hu et al. (2016) provide experimental results)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-21", "text": "Intuitively, the logic rule for such sentences is that the sentiment associated with the whole sentence should be the same as the sentiment associated with phrase \"B\"." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-22", "text": "1 More formally, let p \u03b8 (y|x) denote the probability assigned to the label y \u2208 {+, \u2212} for an input x by the baseline model using parameters \u03b8." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-23", "text": "A logic rule is (softly) encoded as a variable r \u03b8 (x, y) \u2208 [0, 1] indicating how well labeling x with y satisfies the rule." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-24", "text": "For the case of A-but-B sentences, r \u03b8 (x, y) = p \u03b8 (y|B) if x has the structure A-but-B (and 1 otherwise)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-25", "text": "Next, we discuss the two techniques from Hu et al. (2016) for incorporating rules into models: projection, which directly alters a trained model, and distillation, which progressively adjusts the loss function during training." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-26", "text": "Projection." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-27", "text": "The first technique is to project a trained model into a rule-regularized subspace, in a fashion similar to Ganchev et al. (2010) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-28", "text": "More precisely, a given model p \u03b8 is projected to a model q \u03b8 defined by the optimum value of q in the following optimization problem: 2 min q,\u03be\u22650" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-29", "text": "Here q(X, Y ) denotes the distribution of (x, y) when x is drawn uniformly from the set X and y is drawn according to q(\u00b7|x)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-30", "text": "Iterative Rule Knowledge Distillation." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-31", "text": "The second technique is to transfer the domain knowledge encoded in the logic rules into a neural network's parameters." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-32", "text": "Following Hinton et al. (2015) , a \"student\" model p \u03b8 can learn from the \"teacher\" model q \u03b8 , by using a loss function \u03c0H(p \u03b8 , P true ) + (1 \u2212 \u03c0)H(p \u03b8 , q \u03b8 ) during training, where P true denotes the distribution implied by the ground truth, H(\u00b7, \u00b7) denotes the cross-entropy function, and \u03c0 is a hyperparameter." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-33", "text": "Hu et al. (2016) computes q \u03b8 after every gradient update by projecting the current p \u03b8 , as described above." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-34", "text": "Note that both mechanisms can be combined: After fully training p \u03b8 using the iterative distillation process above, the projection step can be applied one more time to obtain q \u03b8 which is then used as the trained model." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-35", "text": "Dataset." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-36", "text": "All of our experiments (as well as those in Hu et al. (2016) ) use the SST2 dataset, a binarized subset of the popular Stanford Sentiment Treebank (SST) (Socher et al., 2013) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-37", "text": "The dataset includes phrase-level labels in addition to sentence-level labels (see Table 1 for detailed statistics); following Hu et al. (2016) , we use both types of labels for the comparisons in Section 3.2." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-38", "text": "In all other experiments, we use only sentencelevel labels, and our baseline model for all experiments is the CNN architecture from Kim (2014) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-39", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-40", "text": "**A REANALYSIS**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-41", "text": "In this section we reanalyze the effectiveness of the techniques of Hu et al. (2016) and find that most of the performance gain is due to projection and not knowledge distillation." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-42", "text": "The discrepancy with the original analysis can be attributed to the relatively small dataset and the resulting variance across random initializations." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-43", "text": "We start by analyzing the baseline CNN by Kim (2014) to point out the need for an averaged analysis." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-44", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-45", "text": "**IMPORTANCE OF AVERAGING**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-46", "text": "We run the baseline CNN by Kim (2014) across 100 random seeds, training on sentence-level la- bels." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-47", "text": "We observe a large amount of variation from run-to-run, which is unsurprising given the small dataset size." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-48", "text": "The inset density plot in Figure 1 shows the range of accuracies (83.47 to 87.20) along with 25, 50 and 75 percentiles." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-49", "text": "3 The figure also shows how the variance persists even after the average converges: the accuracies of 100 models trained for 20 epochs each are plotted in gray, and their average is shown in red." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-50", "text": "We conclude that, to be reproducible, only averaged accuracies should be reported in this task and dataset." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-51", "text": "This mirrors the conclusion from a detailed analysis by Reimers and Gurevych (2017) in the context of named entity recognition." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-52", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-53", "text": "**PERFORMANCE OF HU ET AL. (2016)**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-54", "text": "We carry out an averaged analysis of the publicly available implementation 4 of Hu et al. (2016) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-55", "text": "Our analysis reveals that the reported performance of their two mechanisms (projection and distillation) is in fact affected by the high variability across random seeds." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-56", "text": "Our more robust averaged analysis yields a somewhat different conclusion of their effectiveness." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-57", "text": "In Figure 2 , the first two columns show the reported accuracies in Hu et al. (2016) for models trained with and without distillation (corresponding to using values \u03c0 = 1 and \u03c0 = 0.95 t in the t th epoch, respectively)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-58", "text": "The two rows show the results for models with and without a final projection into the rule-regularized space." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-59", "text": "We keep our hyper-parameters identical to Hu et al. (2016) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-60", "text": "5 The baseline system (no-project, no-distill) is identical to the system of Kim (2014) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-61", "text": "All the systems are trained on the phrase-level SST2 dataset 3 We use early stopping based on validation performance for all models in the density plot." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-62", "text": "4 https://github.com/ZhitingHu/logicnn/ 5 In particular, C = 6 for projection." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-86", "text": "**KL DIVERGENCE ANALYSIS:**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-64", "text": "The number inside each arrow indicates the improvement in accuracy by adding either the projection or the distillation component to the training algorithm." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-65", "text": "Note that the reported figures suggest that while both components help in improving accuracy, the distillation component is much more helpful than the projection component." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-66", "text": "The next two columns, which show the results of repeating the above analysis after averaging over 100 random seeds, contradict this claim." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-67", "text": "The averaged figures show lower overall accuracy increases, and, more importantly, they attribute these improvements almost entirely to the projection component rather than the distillation component." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-68", "text": "To confirm this result, we repeat our averaged analysis restricted to only \"A-but-B\" sentences targeted by the rule (shown in the last two columns)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-69", "text": "We again observe that the effect of projection is pronounced, while distillation offers little or no advantage in comparison." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-70", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-71", "text": "**CONTEXTUALIZED WORD EMBEDDINGS**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-72", "text": "Traditional context-independent word embeddings like word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) are fixed vectors for every word in the vocabulary." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-73", "text": "In contrast, contextualized embeddings are dynamic representations, dependent on the current context of the word." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-74", "text": "We hypothesize that contextualized word embeddings might inherently capture these logic rules due to increasing the effective context size for the CNN layer in Kim (2014) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-75", "text": "Following the recent success of ELMo (Peters et al., 2018a) in sentiment analysis, we utilize the TensorFlow Hub implementation of ELMo 6 and feed these contextualized embeddings into our CNN model." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-76", "text": "We fine-tune the ELMo LSTM weights along with the CNN weights on the downstream CNN task." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-77", "text": "As in Section 3, we check performance with and without the final projection into the rule-regularized space." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-78", "text": "We present our results in Table 2 ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-79", "text": "Switching to ELMo word embeddings improves performance by 2.9 percentage points on an average, corresponding to about 53 test sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-80", "text": "Of these, about 32 sentences (60% of the improvement) correspond to A-but-B and negation style sentences, which is substantial when considering that only 24.5% of test sentences include these discourse relations (Table 1 )." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-81", "text": "As further evidence that ELMo helps on these specific constructions, the non-ELMo baseline model (no-project, no-distill) gets 255 sentences wrong in the test corpus on average, only 89 (34.8%) of which are A-but-B style or negations." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-82", "text": "Statistical Significance: Using a two-sided Kolmogorov-Smirnov statistic (Massey Jr, 1951) with \u03b1 = 0.001 for the results in Table 2 , we find that ELMo and projection each yield statistically significant improvements, but distillation does not." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-83", "text": "Also, with ELMo, projection is not significant." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-84", "text": "Specific comparisons have been added in the Appendix, in Table A3 ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-85", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-87", "text": "We observe no significant gains by projecting a trained ELMo model into an A-but-B rule-regularized space, unlike the other models." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-88", "text": "We confirm that ELMo's predictions are much closer to the A-but-B rule's manifold than those of the other models by computing KL(q \u03b8 ||p \u03b8 ) where p \u03b8 and q \u03b8 are the original and projected distributions: Averaged across all A-but-B sentences and 100 seeds, this gives 0.27, 0.26 and 0.13 for the Kim (2014) , Hu et al. (2016) with distillation and ELMo systems respectively." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-89", "text": "Intra-sentence Similarity: To understand the information captured by ELMo embeddings for A-but-B sentences, we measure the cosine similarity between the word vectors of every pair of words within the A-but-B sentence (Peters et al., 2018b) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-90", "text": "We compare the intra-sentence similarity for fine-tuned word2vec embeddings (baseline), ELMo embeddings without fine-tuning and finally fine-tuned ELMo embeddings in Figure 3 ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-91", "text": "In the fine-tuned ELMo embeddings, we notice the words within the A and within the B part of the A-but-B sentence share the same part of the vector space." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-92", "text": "This pattern is less visible in the Table 2 : Average performance (across 100 seeds) of ELMo on the SST2 task." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-93", "text": "We show performance on A-but-B sentences (\"but\"), negations (\"neg\")." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-94", "text": "ELMo embeddings without fine-tuning and absent in the word2vec embeddings." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-95", "text": "This observation is indicative of ELMo's ability to learn specific rules for A-but-B sentences in sentiment classification." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-96", "text": "More intra-sentence similarity heatmaps for A-but-B sentences are in Figure A1 ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-97", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-98", "text": "**CROWDSOURCED EXPERIMENTS**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-99", "text": "We conduct a crowdsourced analysis that reveals that SST2 data has significant levels of ambiguity even for human labelers." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-100", "text": "We discover that ELMo's performance improvements over the baseline are robust across varying levels of ambiguity, whereas the advantage of Hu et al. (2016) is reversed in sentences of low ambiguity (restricting to A-but-B style sentences)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-101", "text": "Our crowdsourced experiment was conducted on Figure Eight ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-102", "text": "8 Nine workers scored the sentiment of each A-but-B and negation sentence in the test SST2 split as 0 (negative), 0.5 (neutral) or 1 (positive)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-103", "text": "(SST originally had three crowdworkers choose a sentiment rating from 1 to 25 for every phrase.) More details regarding the crowd experiment's parameters have been provided in Appendix A." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-104", "text": "We average the scores across all users for each sentence." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-105", "text": "Sentences with a score in the range (x, 1] are marked as positive (where x \u2208 [0.5, 1)), sentences in [0, 1 \u2212 x) marked as negative, and sentences in [1 \u2212 x, x] are marked as neutral." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-106", "text": "For instance, \"flat , but with a revelatory performance by michelle williams\" (score=0.56) is neutral when x = 0.6." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-107", "text": "9 We present statistics of our dataset 10 in Table 3 ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-108", "text": "Inter-annotator agree-7 Trained on sentences and not phrase-level labels for a fair comparison with baseline and ELMo, unlike Section 3.2." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-109", "text": "8 https://www.figure-eight.com/ 9 More examples of neutral sentences have been provided in the Appendix in Table A1 , as well as a few \"flipped\" sentences receiving an average score opposite to their SST2 label (Table A2) ." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-110", "text": "10 The dataset along with source code can be found in ment was computed using Fleiss' Kappa (\u03ba)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-111", "text": "As expected, inter-annotator agreement is higher for higher thresholds (less ambiguous sentences)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-112", "text": "According to Landis and Koch (1977) , \u03ba \u2208 (0.2, 0.4] corresponds to \"fair agreement\", whereas \u03ba \u2208 (0.4, 0.6] corresponds to \"moderate agreement\"." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-113", "text": "We next compute the accuracy of our model for each threshold by removing the corresponding neutral sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-114", "text": "Higher thresholds correspond to sets of less ambiguous sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-115", "text": "Table 3 shows that ELMo's performance gains in Table 2 extends across all thresholds." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-116", "text": "In Figure 4 we compare all the models on the A-but-B sentences in this set." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-117", "text": "Across all thresholds, we notice trends similar to previous sections: 1) ELMo performs the best among all models on A-but-B style sentences, and projection results in only a slight improvement; 2) models in Hu et al. (2016) (with and without distillation) benefit considerably from projection; but 3) distillation offers little improvement (with or without projection)." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-118", "text": "Also, as the ambiguity threshold increases, we see decreasing gains from projection on all models." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-119", "text": "In fact, beyond the 0.85 threshold, projection degrades the average performance, indicating that projection is useful for more ambiguous sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-120", "text": "----------------------------------" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-121", "text": "**CONCLUSION**" }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-122", "text": "We present an analysis comparing techniques for incorporating logic rules into sentiment classification systems." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-123", "text": "Our analysis included a metastudy highlighting the issue of stochasticity in performance across runs and the inherent ambiguity in the sentiment classification task itself, which was tackled using an averaged analysis and https://github.com/martiansideofthemoon/ logic-rules-sentiment." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-124", "text": "Table 3 : Number of sentences in the crowdsourced study (447 sentences) which got marked as neutral and which got the opposite of their labels in the SST2 dataset, using various thresholds." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-125", "text": "Inter-annotator agreement is computed using Fleiss' Kappa." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-126", "text": "Average accuracies of the baseline and ELMo (over 100 seeds) on non-neutral sentences are also shown." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-127", "text": "a crowdsourced experiment identifying ambiguous sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-128", "text": "We present evidence that a recently proposed contextualized word embedding model (ELMo) (Peters et al., 2018a) implicitly learns logic rules for sentiment classification of complex sentences like A-but-B sentences." }, { "sent_id": "d3122aab8960a7c89afe87c73faa59-C001-129", "text": "Future work includes a fine-grained quantitative study of ELMo word vectors for logically complex sentences along the lines of Peters et al. (2018b) ." } ], "y": { "@DIF@": { "gold_contexts": [ [ "d3122aab8960a7c89afe87c73faa59-C001-4" ], [ "d3122aab8960a7c89afe87c73faa59-C001-100" ], [ "d3122aab8960a7c89afe87c73faa59-C001-117" ] ], "cite_sentences": [ "d3122aab8960a7c89afe87c73faa59-C001-4", "d3122aab8960a7c89afe87c73faa59-C001-100", "d3122aab8960a7c89afe87c73faa59-C001-117" ] }, "@MOT@": { "gold_contexts": [ [ "d3122aab8960a7c89afe87c73faa59-C001-4" ], [ "d3122aab8960a7c89afe87c73faa59-C001-15" ] ], "cite_sentences": [ "d3122aab8960a7c89afe87c73faa59-C001-4", "d3122aab8960a7c89afe87c73faa59-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "d3122aab8960a7c89afe87c73faa59-C001-3", "d3122aab8960a7c89afe87c73faa59-C001-4" ], [ "d3122aab8960a7c89afe87c73faa59-C001-14" ], [ "d3122aab8960a7c89afe87c73faa59-C001-15" ], [ "d3122aab8960a7c89afe87c73faa59-C001-20" ], [ "d3122aab8960a7c89afe87c73faa59-C001-16" ], [ "d3122aab8960a7c89afe87c73faa59-C001-25" ], [ "d3122aab8960a7c89afe87c73faa59-C001-36" ], [ "d3122aab8960a7c89afe87c73faa59-C001-37" ], [ "d3122aab8960a7c89afe87c73faa59-C001-41" ], [ "d3122aab8960a7c89afe87c73faa59-C001-53", "d3122aab8960a7c89afe87c73faa59-C001-54", "d3122aab8960a7c89afe87c73faa59-C001-55" ], [ "d3122aab8960a7c89afe87c73faa59-C001-57" ], [ "d3122aab8960a7c89afe87c73faa59-C001-59" ], [ "d3122aab8960a7c89afe87c73faa59-C001-88" ], [ "d3122aab8960a7c89afe87c73faa59-C001-100" ] ], "cite_sentences": [ "d3122aab8960a7c89afe87c73faa59-C001-4", "d3122aab8960a7c89afe87c73faa59-C001-14", "d3122aab8960a7c89afe87c73faa59-C001-15", "d3122aab8960a7c89afe87c73faa59-C001-20", "d3122aab8960a7c89afe87c73faa59-C001-16", "d3122aab8960a7c89afe87c73faa59-C001-25", "d3122aab8960a7c89afe87c73faa59-C001-36", "d3122aab8960a7c89afe87c73faa59-C001-37", "d3122aab8960a7c89afe87c73faa59-C001-41", "d3122aab8960a7c89afe87c73faa59-C001-53", "d3122aab8960a7c89afe87c73faa59-C001-54", "d3122aab8960a7c89afe87c73faa59-C001-55", "d3122aab8960a7c89afe87c73faa59-C001-57", "d3122aab8960a7c89afe87c73faa59-C001-59", "d3122aab8960a7c89afe87c73faa59-C001-88", "d3122aab8960a7c89afe87c73faa59-C001-100" ] }, "@BACK@": { "gold_contexts": [ [ "d3122aab8960a7c89afe87c73faa59-C001-13" ], [ "d3122aab8960a7c89afe87c73faa59-C001-19" ], [ "d3122aab8960a7c89afe87c73faa59-C001-33" ], [ "d3122aab8960a7c89afe87c73faa59-C001-53" ] ], "cite_sentences": [ "d3122aab8960a7c89afe87c73faa59-C001-13", "d3122aab8960a7c89afe87c73faa59-C001-19", "d3122aab8960a7c89afe87c73faa59-C001-33", "d3122aab8960a7c89afe87c73faa59-C001-53" ] }, "@EXT@": { "gold_contexts": [ [ "d3122aab8960a7c89afe87c73faa59-C001-15" ], [ "d3122aab8960a7c89afe87c73faa59-C001-16" ], [ "d3122aab8960a7c89afe87c73faa59-C001-41" ], [ "d3122aab8960a7c89afe87c73faa59-C001-100" ], [ "d3122aab8960a7c89afe87c73faa59-C001-117" ] ], "cite_sentences": [ "d3122aab8960a7c89afe87c73faa59-C001-15", "d3122aab8960a7c89afe87c73faa59-C001-16", "d3122aab8960a7c89afe87c73faa59-C001-41", "d3122aab8960a7c89afe87c73faa59-C001-100", "d3122aab8960a7c89afe87c73faa59-C001-117" ] } } }, "ABC_b2cb08afadadeddc0f8e7267163c0e_1": { "x": [ { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-2", "text": "Subjectivity and sentiment analysis focuses on the automatic identification of private states, such as opinions, emotions, sentiments, evaluations, beliefs, and speculations in natural language." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-3", "text": "While subjectivity classification labels text as either subjective or objective, sentiment classification adds an additional level of granularity, by further classifying subjective text as either positive, negative or neutral." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-4", "text": "----------------------------------" }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-5", "text": "****" }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-6", "text": "While much of the research work in this area has been applied to English, research on other languages is growing, including Japanese, Chinese, German, Spanish, Romanian." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-7", "text": "While most of the researchers in the field are familiar with the methods applied on English, few of them have closely looked at the original research carried out in other languages." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-8", "text": "For example, in languages such as Chinese, researchers have been looking at the ability of characters to carry sentiment information (Ku et al., 2005; Xiang, 2011) ." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-9", "text": "In Romanian, due to markers of politeness and additional verbal modes embedded in the language, experiments have hinted that subjectivity detection may be easier to achieve (Banea et al., 2008) ." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-10", "text": "These additional sources of information may not be available across all languages, yet, various articles have pointed out that by investigating a synergistic approach for detecting subjectivity and sentiment in multiple languages at the same time, improvements can be achieved not only in other languages, but in English as well." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-11", "text": "The development and interest in these methods is also highly motivated by the fact that only 27% of Internet users speak English (www.internetworldstats.com/stats.htm, Oct 11, 2011), and that number diminishes further every year, as more people across the globe gain Internet access." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-12", "text": "The aim of this tutorial is to familiarize the attendees with the subjectivity and sentiment research carried out on languages other than English in order to enable and promote crossfertilization." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-13", "text": "Specifically, we will review work along three main directions." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-14", "text": "First, we will present methods where the resources and tools have been specifically developed for a given target language." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-15", "text": "In this category, we will also briefly overview the main methods that have been proposed for English, but which can be easily ported to other languages." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-16", "text": "Second, we will describe cross-lingual approaches, including several methods that have been proposed to leverage on the resources and tools available in English by using cross-lingual projections." }, { "sent_id": "b2cb08afadadeddc0f8e7267163c0e-C001-17", "text": "Finally, third, we will show how the expression of opinions and polarity pervades language boundaries, and thus methods that holistically explore multiple languages at the same time can be effectively considered." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b2cb08afadadeddc0f8e7267163c0e-C001-9" ] ], "cite_sentences": [ "b2cb08afadadeddc0f8e7267163c0e-C001-9" ] }, "@MOT@": { "gold_contexts": [ [ "b2cb08afadadeddc0f8e7267163c0e-C001-7", "b2cb08afadadeddc0f8e7267163c0e-C001-8", "b2cb08afadadeddc0f8e7267163c0e-C001-9" ] ], "cite_sentences": [ "b2cb08afadadeddc0f8e7267163c0e-C001-9" ] } } }, "ABC_3b0a82129333203eca96a7473095f3_1": { "x": [ { "sent_id": "3b0a82129333203eca96a7473095f3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-2", "text": "In a recent paper advocating a corpus-based and probabilistic approach to grammar development, Black, Lafferty, and Roukos (1992) argue that \"the current state of the art is far from being able to produce a robust parser of general English\" and advocate \"steady and quantifiable,\" empirically corpus-driven grammar development and testing." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-3", "text": "Black et al. are addressing a community in which armchair introspection has been and still is the dominant methodology in many quarters, but in some parts of Europe, corpus linguistics never died." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-4", "text": "For nearly two decades, the Nijmegen group led by Jan Aarts have been undertaking corpus analyses that, although motivated primarily by the desire to study language variation using corpus data, are particularly relevant to the issue of broad-coverage grammar development." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-5", "text": "In distinction to other groups undertaking corpus-based work (e.g., Garside, Leech, and Sampson 1987), the Nijmegen group has consistently adopted the position that it is possible and desirable to develop a formal, generative grammar that characterizes the syntactic properties of a given corpus and can be used to assign appropriate analyses to each of its sentences." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-6", "text": "Nelleke Oostdijk's book provides a detailed description of the cumulative development of a grammar capable of analyzing a one million-word corpus of English written texts, drawn from a wide but balanced variety of sources." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-7", "text": "This task forms a significant component of the wider Tools for Syntactic Corpus Analysis (TOSCA) project being undertaken at Nijmegen." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-8", "text": "Oostdijk's work provides an excellent example of the strengths and weaknesses of the approach advocated by Black et al. In addition, she discusses issues such as sampling and tokenization of corpus material, as well as the exploitation of the analyzed corpus in studies of language variation." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-9", "text": "However, in this review I will concentrate on the central core of her book: the development of the grammar and performance of the associated parser, since this is the part that is most relevant to computational linguistics." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-10", "text": "Oostdijk begins by locating her work and the TOSCA project within the field of computational linguistics (arguing that it is distinguished by \"an interest in language itself as it is actually produced\" (p. 2)) and contrasting it to the LSP system (Sager 1981) and Parsifal (Marcus 1980) ." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-11", "text": "The comparison is brief and the choice odd since more general broad-coverage grammars, such as DIAGRAM (Robinson 1982) , PEG (Jensen et al. 1986) and ANLT (Grover et al. 1989), and more corpus-oriented parsing systems, such as FIDDITCH (Hindle 1983 , 1993 ) or MITFP (de Marcken 1990, have been developed within the field, but are not discussed anywhere." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-12", "text": "A similar suspicion of isolationism recurs in the sections dealing with the grammatical formalism used; 210" }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-13", "text": "----------------------------------" }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-14", "text": "****" }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-15", "text": "this is based on (extended) affix grammar (Koster 1971 ) and, although only described informally, the variant of affix grammar adopted is probably similar in generative and expressive capacity to unification-based formalisms, such as PATR-II (Shieber 1986) or the ANLT formalism (Briscoe et al. 1987) , with some interesting extensions making it more adequate to phenomena such as agreement in coordinate structures." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-16", "text": "Unfortunately, no comparison is offered." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-17", "text": "More discussion is devoted to comparison with the approach to corpus analysis taken by the Lancaster group (Garside et al. 1987) ; Oostdijk argues that because their espousal of probabilistic methods and rejection of a rule-based generative approach is not founded on sound empirical evidence, it is impossible to develop a comprehensive generative grammar for a corpus." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-18", "text": "While I am sympathetic to Oostdijk's position and think that the grammar she goes on to present is impressive enough to bias us towards the opposite conclusion, it is a mistake to accept the assumption that the two approaches are incompatible, as much recent work (including that of Black et al. 1992) has demonstrated the usefulness of combining statistical techniques with rule-based systems." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-19", "text": "The core of the book is a description of the grammar developed and analyses adopted for notoriously difficult phenomena, such as nonconstituent coordination, gapping, apposition, partitives, other noun phrase premodifier syntax, and so forth." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-20", "text": "The grammatical framework adopted is based on a conventional notion of constituency, with nodes assigned categorial labels augmented with functional categories encoding mostly familiar grammatical relations." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-21", "text": "The commitment to nonelliptical accounts of the full range of coordinate and gapped constructions that occur in the corpus leads to adoption of linguistically nonstandard analyses; for example, grouping noun phrase complements of ditransitive verbs into single constituents." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-22", "text": "Once again, no reference is made to recent theoretical work addressing similar problems, such as extended categorial or combinatory grammar (e.g., Steedman 1985) ." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-23", "text": "Nevertheless, the coverage of the resultant grammar is impressive, and the (computational) linguist who has not developed a substantial grammar from natural data will find enough interesting insights, analyses, and detailed discussion of constructions sometimes ignored in the more mainstream generative literature to be convinced, I hope, of the value of corpus-based grammar development." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-24", "text": "There are, however, dangers, as well as strengths in this approach; for instance, the commitment to assign an analysis to each sentence of the corpus can easily lead to reification of undesirable decisions in the grammar and consequent propagation throughout the analyzed corpus: a case in point might be the use of ditransitive complement constituents introduced to deal with gapped examples." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-25", "text": "Corpus-based development and testing of a grammar requires computational support to be practical and, given the goal of the TOSCA project, a method is needed to select the semantically and pragmatically appropriate analysis from the set licensed by the grammar for each sentence in the corpus." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-26", "text": "A separate system is used to assign each word of the input sentence an unambiguous and correct lexical category compatible with the grammar developed." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-27", "text": "This system and the lexical categories are not described in the book but appear to be more fine-grained than the categories assigned by tagging programs (e.g., CLAWS2, Garside et al. 1987) , incorporating subcategorization information concerning complementation, for instance." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-28", "text": "The parsing system then assigns analyses to this unambiguous sequence of lexical categories." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-29", "text": "Oostdijk does not describe the parser-generator or parser developed for the affix grammar formalism used, but instead concentrates on the issues of parse selection and performance both in terms of coverage and efficiency." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-30", "text": "Parse selection is done interactively by guiding the parser manually; Oostdijk justifies this approach by arguing that it ensures a high level of accuracy and guarantees parsing efficiency by pre-empting unnecessary Computational Linguistics Volume 19, Number 1 search." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-31", "text": "An approach in which intervention is limited to selection between predefined legitimate analyses is an improvement on one in which the analyst is able to create new descriptions at will (e.g., Leech and Garside 1991) in that the resulting database of analyses will be consistent and intervention will be simpler and faster." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-32", "text": "However, other approaches are possible, such as the use of probabilities to guide parse selection, if not grammar induction (e.g., Black et al. 1992 )." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-33", "text": "Oostdijk does not consider this possibility, presumably because of her acceptance of the incompatibility of rule-based and statistical techniques." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-34", "text": "The decision to manually select parses, coupled with the fact that the TOSCA parser (on the hardware available) is not always able to compute all the possible analyses, even starting from unambiguous lexical categories, has the unfortunate side effect that a significant effort has been devoted to removing linguistically motivated ambiguity from the grammar." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-35", "text": "Earlier, Oostdijk argues for the strict separation of grammatical formalism and parsing algorithm on familiar grounds, but the same arguments tell against the decision, for instance, to stipulate that coordination occurs at specific nodes in case of ambiguity (p. 133), since such distinctions often correlate with differences of semantic scope." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-36", "text": "Despite these criticisms--and in a practical project of this type some compromises are inevitable---Oostdijk's achievement is impressive; her book is well written and \u2022 easy to read, and she manages the difficult task of striking the right level between an exhaustive and exhausting documentation of a substantial grammar and a superficial overview." }, { "sent_id": "3b0a82129333203eca96a7473095f3-C001-37", "text": "There are very few typographical errors and the book has been well produced." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3b0a82129333203eca96a7473095f3-C001-2" ], [ "3b0a82129333203eca96a7473095f3-C001-8" ], [ "3b0a82129333203eca96a7473095f3-C001-18" ], [ "3b0a82129333203eca96a7473095f3-C001-32" ] ], "cite_sentences": [ "3b0a82129333203eca96a7473095f3-C001-2", "3b0a82129333203eca96a7473095f3-C001-8", "3b0a82129333203eca96a7473095f3-C001-18", "3b0a82129333203eca96a7473095f3-C001-32" ] }, "@MOT@": { "gold_contexts": [ [ "3b0a82129333203eca96a7473095f3-C001-2" ] ], "cite_sentences": [ "3b0a82129333203eca96a7473095f3-C001-2" ] } } }, "ABC_d52a1a26cbf8a6f528be5494f05e45_1": { "x": [ { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-95", "text": "Word CNN." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-168", "text": "**EXTERNAL FACTORS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-2", "text": "We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-3", "text": "Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-4", "text": "In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-5", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-195", "text": "Figure 6 shows the decoding times on 10000 NER sentences." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-7", "text": "Sequence labeling models have been used for fundamental NLP tasks such as POS tagging, chunking and named entity recognition (NER)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-8", "text": "Traditional work uses statistical approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Ratinov and Roth, 2009; Passos et al., 2014; Luo et al., 2015) with handcrafted features and task-specific resources." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-9", "text": "With advances in deep learning, neural models have given state-of-the-art results on many sequence labeling tasks (Ling et al., 2015; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-10", "text": "Words and characters are encoded in distributed representations (Mikolov et al., 2013) and sentence-level features are learned automatically during end-to-end training." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-11", "text": "Many existing state-of-the-art neural sequence labeling models utilize word-level Long Short-Term Memory (LSTM) structures to represent global sequence information and a CRF layer to capture dependencies between neighboring labels Lample et al., 2016; Ma and Hovy, 2016; Peters et al., 2017) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-12", "text": "As an alternative, Convolution Neural Network (CNN) (LeCun et al., 1989) has also been used for its ability of parallel computing, leading to an efficient training and decoding process." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-13", "text": "Despite them being dominant in the research literature, reproducing published results for neural models can be challenging, even if the codes are available open source." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-14", "text": "For example, Reimers and Gurevych (2017b) conduct a large number of experiments using the code of Ma and Hovy (2016) , but cannot obtain comparable results as reported in the paper." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-15", "text": "Liu et al. (2018) report lower average F-scores on NER when reproducing the structure of Lample et al. (2016) , and on POS tagging when reproducing Ma and Hovy (2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-16", "text": "Most literature compares results with others by citing the scores directly Lample et al., 2016) without re-implementing them under the same setting, resulting in less persuasiveness on the advantage of their models." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-17", "text": "In addition, conclusions from different reports can be contradictory." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-18", "text": "For example, most work observes that stochastic gradient descent (SGD) gives best performance on NER task (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , while Reimers and Gurevych (2017b) report that SGD is the worst optimizer on the same datasets." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-19", "text": "The comparison between different deep neural models is challenging due to sensitivity on experimental settings." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-20", "text": "We list six inconsistent configurations in literature, which lead to difficulties for fair comparison." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-21", "text": "\u2022 Dataset." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-22", "text": "Most work reports sequence labeling results on both CoNLL 2003 English NER (Tjong Kim Sang and De Meulder, 2003) and PTB POS (Marcus et al., 1993) datasets (Collobert et al., 2011; Ma and Hovy, 2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-23", "text": "Ling et al. (2015) give results only on POS dataset, while some papers (Chiu and Nichols, 2016; Lample et al., 2016; Strubell et al., 2017) report results on the NER dataset only." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-24", "text": "dos Santos et al. (2015) conducts experiments on NER for Portuguese and Spanish." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-25", "text": "Most work uses the development set to select hyperparameters (Lample et al., 2016; Ma and Hovy, 2016) , while others add development set into training set (Chiu and Nichols, 2016; Peters et al., 2017) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-26", "text": "Reimers and Gurevych (2017b) use a smaller dataset (13862 vs 14987 sentences)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-27", "text": "Different from Ma and Hovy (2016) and Liu et al. (2018) , choose a different data split on the POS dataset." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-28", "text": "Liu et al. (2018) and Hashimoto et al. (2017) use different development sets for chunking." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-29", "text": "\u2022 Preprocessing." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-30", "text": "A typical data preprocessing step is to normize digit characters (Chiu and Nichols, 2016; Lample et al., 2016; Yang et al., 2016; Strubell et al., 2017) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-31", "text": "Reimers and Gurevych (2017b) use fine-grained representations for less frequent words." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-32", "text": "Ma and Hovy (2016) do not use preprocessing." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-33", "text": "\u2022 Features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-34", "text": "Strubell et al. (2017) and Chiu and Nichols (2016) apply word spelling features and further integrate context features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-35", "text": "Collobert et al. (2011) and use neural features to represent external gazetteer information." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-36", "text": "Besides, Lample et al. (2016) and Ma and Hovy (2016) use end-to-end structure without handcrafted features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-37", "text": "\u2022 Hyperparameters including learning rate, dropout rate (Srivastava et al., 2014) , number of layers, hidden size etc. can strongly affect the model performance." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-38", "text": "Chiu and Nichols (2016) search for the hyperparameters for each task and show that the system performance is sensitive to the choice of hyperparameters." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-39", "text": "However, existing models use different parameter settings, which affects the fair comparison." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-40", "text": "\u2022 Evaluation." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-41", "text": "Some literature reports results using mean and standard deviation under different random seeds (Chiu and Nichols, 2016; Peters et al., 2017; Liu et al., 2018) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-42", "text": "Others report the best result among different trials (Ma and Hovy, 2016) , which cannot be compared directly." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-43", "text": "\u2022 Hardware environment can also affect system accuracy." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-44", "text": "Liu et al. (2018) observes that the system gives better accuracy on NER task when trained using GPU as compared to using CPU." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-45", "text": "Besides, the running speeds are highly affected by the hardware environment." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-46", "text": "To address the above concerns, we systematically analyze neural sequence labeling models on three benchmarks: CoNLL 2003 NER (Tjong Kim Sang and De Meulder, 2003) , CoNLL 2000 chunking (Tjong Kim Sang and Buchholz, 2000) and PTB POS tagging (Marcus et al., 1993) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-47", "text": "Table 1 shows a summary of the models we investigate, which can be categorized under three settings: (i) character sequence representations ; (ii) word sequence representations; (iii) inference layer." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-48", "text": "Although various combinations of these three settings have been proposed in the literature, others have not been examined." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-49", "text": "We compare all models in Table 1 , which includes most state-of-the-art methods." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-50", "text": "To make fair comparisons, we build a unified framework 1 to reproduce the twelve neural sequence labeling architectures in Table 1 ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-51", "text": "Experiments show that our framework gives comparable or better results on reproducing existing works, showing the practicability and reliability of our analysis for practitioners." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-52", "text": "The detailed comparison and analysis show that (i) Character information provides a significant improvement on accuracy; (ii) Word-based LSTM models outperforms CNN models in most cases; (iii) CRF can improve model accuracy on NER and chunking but does not on POS tagging." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-53", "text": "Our framework is based on PyTorch with batched implementation, which is highly efficient, facilitating quick configurations for new tasks." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-54", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-55", "text": "**RELATED WORK**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-56", "text": "Collobert et al. (2011) proposed a seminal neural architecture for sequence labeling." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-57", "text": "It captures word sequence information with a one-layer CNN based on pretrained word embeddings and handcrafted neural features, followed with a CRF output layer." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-58", "text": "dos Santos et al. (2015) extended this model by integrating character-level CNN features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-59", "text": "Strubell et al. (2017) built a deeper dilated CNN architecture to capture larger local features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-60", "text": "Hammerton (2003) was the first to exploit LSTM for sequence labeling." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-61", "text": "built a BiLSTM-CRF structure, which has been extended by adding character-level LSTM (Lample et al., 2016; Liu et al., 2018) , GRU (Yang et al., 2016) , and CNN (Chiu and Nichols, 2016; Ma and Hovy, 2016) features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-62", "text": "Yang et al. (2017a) proposed a neural reranking model to improve NER models." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-63", "text": "These models achieve state-of-the-art results in the literature." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-64", "text": "Reimers and Gurevych (2017b) compared several word-based LSTM models for several sequence labeling tasks, reporting the score distributions over multiple runs rather than single value." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-65", "text": "They investigated the influence of various hyperparameters and configurations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-66", "text": "Our work is similar in comparing different neural architectures under unified settings, but differs in four main aspects: 1) Their experiments are based on a BiLSTM with handcrafted word features, while our experiments are based on end-to-end neural models without human knowledge." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-67", "text": "2) Their system gives relatively low performances on standard benchmarks 2 , while ours can give comparable or better results with state-of-the-art models, rendering our observations more informative for practitioners." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-68", "text": "3) Our findings are more consistent with most previous work on configurations such as usefulness of character information (Lample et al., 2016; Ma and Hovy, 2016) , optimizer (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) and tag scheme (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-69", "text": "In contrast, many results of Reimers and Gurevych (2017b) contradict existing reports." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-70", "text": "4) We conduct a wider range of comparison for word sequence representations, including all combinations of character CNN/LSTM and word CNN/LSTM structures, while Reimers and Gurevych (2017b) studied the word LSTM models only." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-71", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-72", "text": "**NEURAL SEQUENCE LABELING MODELS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-73", "text": "Our neural sequence labeling framework contains three layers, i.e., a character sequence representation layer, a word sequence representation layer and an inference layer, as shown in Figure 1 ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-74", "text": "Character information has been proven to be critical for sequence labeling tasks (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , with LSTM and CNN being used to model character sequence information (\"Char Rep.\")." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-75", "text": "Similarly, on the word level, LSTM or CNN structures can be leveraged to capture long-term information or local features (\"Word Rep.\"), respectively." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-76", "text": "Subsequently, the inference layer assigns labels to each word using the hidden states of word sequence representations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-77", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-78", "text": "**CHARACTER SEQUENCE REPRESENTATIONS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-79", "text": "Character features such as prefix, suffix and capitalization can be represented with embeddings through a feature-based lookup table (Collobert et al., 2011; Strubell et al., 2017) , or neural networks without human-defined features (Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-80", "text": "In this work, we focus on neural character sequence representations without hand-engineered features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-81", "text": "Character CNN." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-82", "text": "Using a CNN structure to encode character sequences was firstly proposed by Santos and Zadrozny (2014), and followed by many subsequent investigations (dos Santos et al., 2015; Chiu and Nichols, 2016; Ma and Hovy, 2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-83", "text": "In our experiments, we take the same structure as Ma and Hovy (2016) , using one layer CNN structure with max-pooling to capture character-level representations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-84", "text": "Figure 2 (a) shows the CNN structure on representing word \"Mexico\"." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-85", "text": "Character LSTM." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-86", "text": "Shown as Figure 2 (b), in order to model the global character sequence information of a word \"Mexico\", we utilize a bidirectional LSTM on the character sequence of each word and concatenate the left-to-right final state F LST M and the right-to-left final state B LST M as character sequence representations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-87", "text": "Liu et al. (2018) applied one bidirectional LSTM for the character sequence over a sentence rather than each word individually." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-88", "text": "We examined both structures and found that they give comparable accuracies on sequence labeling tasks." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-89", "text": "We choose Lample et al. (2016) 's structure as its character LSTMs can be calculated in parallel, making the system more efficient." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-90", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-91", "text": "**WORD SEQUENCE REPRESENTATIONS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-92", "text": "Similar to character sequences in words, we can model word sequence information through LSTM or CNN structures." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-93", "text": "LSTM has been widely used in sequence labeling (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Liu et al., 2018) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-94", "text": "CNN can be much faster than LSTM due to the fact that convolution calculation can be parallel on the input sequence (Collobert et al., 2011; dos Santos et al., 2015; Strubell et al., 2017) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-96", "text": "Figure 3(a) shows the multi-layer CNN on word sequence, where words are represented by embeddings." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-97", "text": "If a character sequence representation layer is used, then word embeddings and character sequence representations are concatenated for word representations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-98", "text": "For each CNN layer, a window of size 3 slides along the sequence, extracting local features on the word inputs and a ReLU function (Glorot et al., 2011 ) is followed." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-99", "text": "We follow Strubell et al. (2017) by using 4 CNN layers." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-100", "text": "Batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) are applied following each CNN layer." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-101", "text": "Word LSTM." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-102", "text": "Shown in Figure 3 (b), word representations are fed into a forward LSTM and backward LSTM, respectively." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-103", "text": "The forward LSTM captures the word sequence information from left to right, while the backward LSTM extracts information in a reversed direction." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-104", "text": "The hidden states of the forward and backward LSTMs are concatenated at each word to give global information of the whole sequence." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-105", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-106", "text": "**INFERENCE LAYER**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-107", "text": "The inference layer takes the extracted word sequence representations as features and assigns labels to the word sequence." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-108", "text": "Independent local decoding with a linear layer mapping word sequence representations to label vocabulary and performing softmax can be quite effective (Ling et al., 2015) , while for tasks that with strong output label dependency, such as NER, CRF is a more appropriate choice." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-109", "text": "In this work, we examine both softmax and CRF as inference layer on three sequence labeling tasks." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-110", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-111", "text": "**EXPERIMENTS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-112", "text": "We investigate the main influencing factors to system accuracy, including the character sequence representations, word sequence representations, inference algorithm, pretrained embeddings, tag scheme, running environment and optimizer; analyzing system performances from the perspective of decoding speed and accuracies on in-vocabulary (IV) and out-of-vocabulary (OOV) entities/chunks/words." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-113", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-114", "text": "**SETTINGS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-115", "text": "Data." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-116", "text": "The NER dataset has been standardly split in Tjong Kim Sang and De Meulder (2003 (Toutanova et al., 2003; Santos and Zadrozny, 2014; Ma and Hovy, 2016; Liu et al., 2018) , we adopt the standard splits by using sections 0-18 as training set, sections 19-21 as development set and sections 22-24 as test set." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-117", "text": "No preprocessing is performed on either dataset except for normalizing digits." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-118", "text": "The dataset statistics are listed in Table 2 ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-119", "text": "Hyperparameters." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-120", "text": "Table 3 shows the hyperparameters used in our experiments, which mostly follow Ma and Hovy (2016) , including the learning rate \u03b7 = 0.015 for word LSTM models." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-121", "text": "For word CNN based models, a large \u03b7 leads to convergence problem." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-122", "text": "We take \u03b7 = 0.005 with more epochs (200) instead." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-123", "text": "GloVe 100-dimension (Pennington et al., 2014 ) is used to initialize word embeddings and character embeddings are randomly initialized." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-124", "text": "We use mini-batch stochastic gradient descent (SGD) with a decayed learning rate to update parameters." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-125", "text": "For NER and chunking, we the BIOES tag scheme." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-126", "text": "Evaluation." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-127", "text": "Standard precision (P), recall (R) and F1-score (F) are used as the evaluation metrics for NER and chunking; token accuracy is used to evaluate the performance of POS tagger." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-128", "text": "Development datasets are used to select the optimal model among all epochs, and we report scores of the selected model on the test dataset." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-129", "text": "To reduce the volatility of the system, we conduct each experiment 5 times under different random seeds, and report the max, mean, and standard deviation for each model." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-130", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-131", "text": "**RESULTS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-132", "text": "Tables 4, 5 and 6 show the results of the twelve models on NER, chunking and POS datasets, respectively." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-133", "text": "Existing work has also been listed in the tables for comparison." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-134", "text": "To simplify the description, we use \"CLSTM\" and \"CCNN\" to represent character LSTM and character CNN encoder, respectively." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-135", "text": "Similarly, \"WLSTM\" and \"WCNN\" represent word LSTM and word CNN structure, respectively." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-136", "text": "As shown in Table 4 , most NER work focuses on WLSTM+CRF structures with different character sequence representations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-137", "text": "We re-implement the structure of several reports (Chiu and Nichols, 2016; Ma and Hovy, 2016; Peters et al., 2017) , which take the CCNN+WLSTM+CRF architecture." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-138", "text": "Our reproduced models give slightly better performances." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-139", "text": "The results of Lample et al. (2016) can be reproduced by our CLSTM+WLSTM+CRF." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-140", "text": "In most cases, our \"Nochar\" based models underperform their corresponding prototypes Strubell et al., 2017) , which utilize the hand-crafted features." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-141", "text": "Table 5 shows the results of the chunking task." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-142", "text": "Peters et al. (2017) give the best reported single model results (95.00\u00b10.08), and our CLSTM+WLSTM+CRF gives a comparable performance (94.93\u00b10.05)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-143", "text": "We re-implement Zhai et al. (2017) 's model in our Nochar+WLSTM but cannot reproduce their results, this may because that they use grid search for hyperparameter selection." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-144", "text": "Our Nochar+WCNN+CRF can give comparable results with Collobert et al. (2011) , even ours does not include character information." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-145", "text": "The results of the POS tagging task is shown in Table 6 ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-146", "text": "The results of Lample et al. (2016) , Ma and Hovy (2016) and Yang et al. (2017b) can be reproduced by our CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF models." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-147", "text": "Our WLSTM based models give better results than WLSTM+CRF based models, this is consistent with the fact that Ling et al. (2015) take CLSTM+WLSTM without CRF layer but achieve the best POS accuracy." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-148", "text": "Santos and Zadrozny (2014) build a pure CNN structure on both character and word level, which can be reproduced by our CCNN+WCNN models." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-149", "text": "Based on above observations, most results in the literature are reproducible." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-150", "text": "Our implementations can achieve the comparable or better results with state-of-the-art models." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-151", "text": "We do not fine-tune any hyperparameter to fit the specific task." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-152", "text": "Results on Table 4 , 5 and 6 are all under the same hyperparameters, which demonstrates the generalization ability of our framework." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-153", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-154", "text": "**NETWORK SETTINGS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-155", "text": "Character LSTM vs Character CNN." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-156", "text": "Unlike the observations of Reimers and Gurevych (2017b) , in our experiments, character information can significantly (p < 0.01) 3 improve sequence labeling models (by comparing the row of Nochar with CLSTM or CCNN on Table 4 , 5 and 6), while the difference between CLSTM and CCNN is not significant." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-157", "text": "In most cases, CLSTM and CCNN can give comparable results under different frameworks and different tasks." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-158", "text": "CCNN gives the best NER result under the WL-STM+CRF framework, while CLSTM gets better NER results in all other configurations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-159", "text": "For chunking and POS tagging, CLSTM consistently outperforms CCNN under all settings, while the difference is statistically insignificant (p > 0.2)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-160", "text": "We conclude that the difference between CLSTM and CCNN is small, which is consistent with the observation of Reimers and Gurevych (2017b) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-161", "text": "Word LSTM vs Word CNN." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-162", "text": "By comparing the performances of WLSTM+CRF, WLSTM with WCNN+CRF, WCNN on the three benchmarks, we conclude that word-based LSTM models are significantly (p < 0.01) better than word-based CNN models in most cases." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-163", "text": "It demonstrates that global word context information is necessary for sequence labeling." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-164", "text": "Softmax vs CRF." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-165", "text": "Models with CRF inference layer can consistently outperform the models with softmax layer under all configurations on NER and chunking tasks, proving the effectiveness of label dependency information." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-166", "text": "While for POS tagging, the local softmax based models give slightly better accuracies while the difference is insignificant (p > 0.2)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-167", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-169", "text": "In addition to model structures, external factors such as pretrained embeddings, tag scheme, and optimizer can significantly influence system performance." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-170", "text": "We investigate a set of external factors on the NER dataset with the two best models: CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-171", "text": "Pretrained embedding." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-172", "text": "Figure 4 (a) shows the F1-scores of the two best models on the NER test set with two different pretrained embeddings, as well as the random initialization." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-173", "text": "Compared with the random initialization, models using pretrained embeddings give significant improvements (p < 0.01)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-174", "text": "The GloVe 100-dimension embeddings give higher F1-scores than SENNA (Collobert et al., 2011) on both models, which is consistent with the observation of Ma and Hovy (2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-175", "text": "Tag scheme." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-176", "text": "We examine two different tag schemes: BIO and BIOES (Ratinov and Roth, 2009) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-177", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-178", "text": "In our experiments, models using BIOES are significantly (p < 0.05) better than BIO." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-179", "text": "Our observation is consistent with most literature (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-180", "text": "Reimers and Gurevych (2017b) report that the difference between the schemes is insignificant." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-181", "text": "Running environment." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-182", "text": "Liu et al. (2018) observe that neural sequence labeling models can give better results on GPU rather than CPU." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-183", "text": "We conduct repeated experiments on both GPU and CPU environments." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-184", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-185", "text": "Models run on CPU give a lower mean F1-score than models run on GPU, while the difference is insignificant (p > 0.2)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-186", "text": "Optimizer." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-187", "text": "We compare different optimizers including SGD, Adagrad (Duchi et al., 2011 ), Adadelta (Zeiler, 2012 , RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-188", "text": "The results are shown in Figure 5 5 ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-189", "text": "In contrast to Reimers and Gurevych (2017b) , who reported that SGD is the worst optimizer, our results show that SGD outperforms all other optimizers significantly (p < 0.01), with a slower convergence process during training." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-190", "text": "Our observation is consistent with most literature (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-191", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-192", "text": "**ANALYSIS**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-193", "text": "Decoding speed." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-194", "text": "We test the decoding speeds of the twelve models on the NER dataset using a Nvidia GTX 1080 GPU." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-196", "text": "The CRF inference layer severely limits the decoding speed due to the left-to-right inference process, which disables the parallel decoding." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-197", "text": "Character LSTM significantly slows down the system." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-198", "text": "Compared with models without character information, adding character CNN representations does not affect the decoding speed too much but can give significant accuracy improvements (shown in Section 4.3)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-199", "text": "Due to the support of parallel computing, word-based CNN models are consistently faster than word-based LSTM models, with close accuracies, leading to large utilization potential in practice." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-200", "text": "Table 7 : Results for OOV analysis." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-201", "text": "OOV error." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-202", "text": "We conduct error analysis on in-vocabulary and out-of-vocabulary words with the CRF based models 6 ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-203", "text": "Following Ma and Hovy (2016) , words in the test set are divided into four subsets: in-vocabulary words, out-of-training-vocabulary words (OOTV), out-of-embedding-vocabulary words (OOEV) and out-of-both-vocabulary words (OOBV)." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-204", "text": "For NER and chunking, we consider entities or chunks rather than words." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-205", "text": "The OOV entities and chunks are categorized following Ma and Hovy (2016) ." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-206", "text": "Table 7 shows the performances of different OOV splits on three benchmarks." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-207", "text": "The top three rows list the performances of word-based LSTM CRF models, followed by the word-based CNN CRF models." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-208", "text": "The results of OOEV in NER keep 100% because of there exist only 8 OOEV entities and all are recognized correctly." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-209", "text": "It is obvious that character LSTM or CNN representations improve OOTV and OOBV the most on both WLSTM+CRF and WCNN+CRF models across all three datasets, proving that the main contribution of neural character sequence representations is to disambiguate the OOV words." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-210", "text": "Models with character LSTM representations give the best IV scores across all configurations, which may be because character LSTM can be well trained on IV data, bringing the useful global character sequence information." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-211", "text": "On the OOVs, character LSTM and CNN gives comparable results." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-212", "text": "----------------------------------" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-213", "text": "**CONCLUSION**" }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-214", "text": "We built a unified neural sequence labeling framework to reproduce and compare recent state-of-theart models with different configurations." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-215", "text": "We explored three neural model design decisions: character sequence representations, word sequence representations, and inference layer." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-216", "text": "Experiments show that character information helps to improve model performances, especially on disambiguating OOV words." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-217", "text": "Character-level LSTM and CNN structures give comparable improvements, with the latter being more efficient." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-218", "text": "In most cases, models with word-level LSTM encoders outperform those with CNN, at the expense of longer decoding time." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-219", "text": "We observed that the CRF inference algorithm is effective on NER and chunking tasks, but does not have the advantage on POS tagging." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-220", "text": "With controlled experiments on the NER dataset, we showed that BIOES tags are better than BIO." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-221", "text": "Besides, pretrained GloVe 100d embedding and SGD optimizer give significantly better performances compared to their competitors." }, { "sent_id": "d52a1a26cbf8a6f528be5494f05e45-C001-222", "text": "6 We choose the models that give the median performance on the test set for conducting result analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d52a1a26cbf8a6f528be5494f05e45-C001-9" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-11" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-14", "d52a1a26cbf8a6f528be5494f05e45-C001-15" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-18" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-22", "d52a1a26cbf8a6f528be5494f05e45-C001-25", "d52a1a26cbf8a6f528be5494f05e45-C001-27", "d52a1a26cbf8a6f528be5494f05e45-C001-32", "d52a1a26cbf8a6f528be5494f05e45-C001-36" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-61" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-74" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-79" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-82" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-93" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-116" ] ], "cite_sentences": [ "d52a1a26cbf8a6f528be5494f05e45-C001-9", "d52a1a26cbf8a6f528be5494f05e45-C001-11", "d52a1a26cbf8a6f528be5494f05e45-C001-14", "d52a1a26cbf8a6f528be5494f05e45-C001-15", "d52a1a26cbf8a6f528be5494f05e45-C001-18", "d52a1a26cbf8a6f528be5494f05e45-C001-22", "d52a1a26cbf8a6f528be5494f05e45-C001-25", "d52a1a26cbf8a6f528be5494f05e45-C001-27", "d52a1a26cbf8a6f528be5494f05e45-C001-32", "d52a1a26cbf8a6f528be5494f05e45-C001-36", "d52a1a26cbf8a6f528be5494f05e45-C001-61", "d52a1a26cbf8a6f528be5494f05e45-C001-74", "d52a1a26cbf8a6f528be5494f05e45-C001-79", "d52a1a26cbf8a6f528be5494f05e45-C001-82", "d52a1a26cbf8a6f528be5494f05e45-C001-93", "d52a1a26cbf8a6f528be5494f05e45-C001-116" ] }, "@MOT@": { "gold_contexts": [ [ "d52a1a26cbf8a6f528be5494f05e45-C001-13", "d52a1a26cbf8a6f528be5494f05e45-C001-14", "d52a1a26cbf8a6f528be5494f05e45-C001-15" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-17", "d52a1a26cbf8a6f528be5494f05e45-C001-18" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-20", "d52a1a26cbf8a6f528be5494f05e45-C001-21", "d52a1a26cbf8a6f528be5494f05e45-C001-22", "d52a1a26cbf8a6f528be5494f05e45-C001-23", "d52a1a26cbf8a6f528be5494f05e45-C001-25", "d52a1a26cbf8a6f528be5494f05e45-C001-27", "d52a1a26cbf8a6f528be5494f05e45-C001-30", "d52a1a26cbf8a6f528be5494f05e45-C001-32", "d52a1a26cbf8a6f528be5494f05e45-C001-34", "d52a1a26cbf8a6f528be5494f05e45-C001-36", "d52a1a26cbf8a6f528be5494f05e45-C001-41", "d52a1a26cbf8a6f528be5494f05e45-C001-42" ] ], "cite_sentences": [ "d52a1a26cbf8a6f528be5494f05e45-C001-14", "d52a1a26cbf8a6f528be5494f05e45-C001-15", "d52a1a26cbf8a6f528be5494f05e45-C001-18", "d52a1a26cbf8a6f528be5494f05e45-C001-22", "d52a1a26cbf8a6f528be5494f05e45-C001-25", "d52a1a26cbf8a6f528be5494f05e45-C001-27", "d52a1a26cbf8a6f528be5494f05e45-C001-32", "d52a1a26cbf8a6f528be5494f05e45-C001-36", "d52a1a26cbf8a6f528be5494f05e45-C001-42" ] }, "@SIM@": { "gold_contexts": [ [ "d52a1a26cbf8a6f528be5494f05e45-C001-68" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-120" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-146" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-174" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-190" ] ], "cite_sentences": [ "d52a1a26cbf8a6f528be5494f05e45-C001-68", "d52a1a26cbf8a6f528be5494f05e45-C001-120", "d52a1a26cbf8a6f528be5494f05e45-C001-146", "d52a1a26cbf8a6f528be5494f05e45-C001-174", "d52a1a26cbf8a6f528be5494f05e45-C001-190" ] }, "@USE@": { "gold_contexts": [ [ "d52a1a26cbf8a6f528be5494f05e45-C001-73", "d52a1a26cbf8a6f528be5494f05e45-C001-74" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-79", "d52a1a26cbf8a6f528be5494f05e45-C001-80" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-83" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-116" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-120" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-137" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-146" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-203" ], [ "d52a1a26cbf8a6f528be5494f05e45-C001-205" ] ], "cite_sentences": [ "d52a1a26cbf8a6f528be5494f05e45-C001-74", "d52a1a26cbf8a6f528be5494f05e45-C001-79", "d52a1a26cbf8a6f528be5494f05e45-C001-83", "d52a1a26cbf8a6f528be5494f05e45-C001-116", "d52a1a26cbf8a6f528be5494f05e45-C001-120", "d52a1a26cbf8a6f528be5494f05e45-C001-137", "d52a1a26cbf8a6f528be5494f05e45-C001-146", "d52a1a26cbf8a6f528be5494f05e45-C001-203", "d52a1a26cbf8a6f528be5494f05e45-C001-205" ] } } }, "ABC_ce990a3d035b8e57fe86b8d84ca479_1": { "x": [ { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-5", "text": "----------------------------------" }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-28", "text": "----------------------------------" }, { "sent_id": "ce990a3d035b8e57fe86b8d84ca479-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@USE@": { "gold_contexts": [ [ "ce990a3d035b8e57fe86b8d84ca479-C001-11" ] ], "cite_sentences": [ "ce990a3d035b8e57fe86b8d84ca479-C001-11" ] }, "@BACK@": { "gold_contexts": [ [ "ce990a3d035b8e57fe86b8d84ca479-C001-11" ] ], "cite_sentences": [ "ce990a3d035b8e57fe86b8d84ca479-C001-11" ] } } }, "ABC_f78b11352d01b6567392f8cb3c7642_1": { "x": [ { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-6", "text": "Even the character level has been useful to unfold particular patterns [4, 5] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-2", "text": "Concepts and methods of complex networks have been applied to probe the properties of a myriad of real systems [1] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-3", "text": "The finding that written texts modeled as graphs share several properties of other completely different real systems has inspired the study of language as a complex system [2] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-4", "text": "Actually, language can be represented as a complex network in its several levels of complexity." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-5", "text": "As a consequence, morphological, syntactical and semantical properties have been employed in the construction of linguistic networks [3] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-7", "text": "In the review by Cong and Liu [6], the authors emphasize the need to use the topological information of complex networks modeling the various spheres of the language to better understand its origins, evolution and organization." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-8", "text": "In addition, the authors cite the use of networks in applications aiming at holistic typology and stylistic variations." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-9", "text": "In this context, I will discuss some possible directions that could be followed in future research directed towards the understanding of language via topological characterization of complex linguistic networks." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-10", "text": "In addition, I will comment the use of network models for language processing applications." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-11", "text": "Additional prospects for future practical research lines will also be discussed in this comment." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-12", "text": "The topological analysis of complex textual networks has been widely studied in the recent years." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-13", "text": "As for cooccurrence networks of characters, it was possible to verify that they follow the scale-free and small-world features [4] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-14", "text": "Co-occurrence networks of words (or adjacency networks) have accounted for most of the models tackling textual applications." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-15", "text": "In special, they have been more prevalent than syntactical networks because they represent a simplified representation of the complex syntactical analysis [7, 8] , as most of the syntactical links occur between neighboring words." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-16", "text": "Despite its outward simplicity, co-occurrence networks have proven useful in many applications, such as in authorship recognition [9] , extractive summarization [10, 11, 12] , stylistic identification [13] and part-of-speech tagging [14] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-17", "text": "Furthermore, such representation has also been useful in the analysis of the complexity [15] and quality of texts [16] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-18", "text": "Unfortunately, a major problem arising from the analyses performed with co-occurrence networks is the difficulty to provide a rigorous interpretation of the factors accounting for the success of the model." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-19", "text": "Therefore, future investigations should pursue a better interpretation at the network level aiming at the understanding of the fundamental properties of the language." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-20", "text": "Most importantly, it is clear from some recent studies [8, 9 ] that novel topological measurements should be introduced to capture a wider range of linguistic features." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-21", "text": "Many of the applications relying on network analysis outperform other traditional shallow strategies in natural language processing (see e.g. the extractive summarization task [10, 11] )." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-22", "text": "However, when deep analyzes are performed, network-based strategies usually do not perform better than other techniques making extensive use of semantic resources and tools." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-23", "text": "In order to improve the performance of network-based applications, I suggest a twofold research line: (i) the introduction of measurements consistent with the nature of the problem; and (ii) the combination of topological strategies with other traditional natural language processing methods." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-24", "text": "More specifically, in (i), I propose 1 E-mail:diego.raphael@gmail.com, diego@icmc.usp.br" }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-25", "text": "December 4, 2014 the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-26", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-27", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-28", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-29", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20]." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-30", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-31", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-32", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-33", "text": "----------------------------------" }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-34", "text": "****" }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-35", "text": "the conception of measurements that are able to capture semantic aspects, since the topological measurements of co-occurrence networks capture mostly syntactic factors [8] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-36", "text": "Although such networks have proved useful in some semantical-dependent tasks (see e.g. a topological approach to word sense disambiguation in [17] ), I believe that the creation of novel semantic-based measurements would improve the state of the art." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-37", "text": "Alternative forms to create the network could also be useful to grasp semantical features hidden in the topological space." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-38", "text": "In (ii), I suggest, for example, the introduction of a hybrid classifier that could consider both linguistic (deeper linguistic processing [18] ) and topological attributes at the same time in a hybrid way." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-39", "text": "Examples of combinations of distinct strategies are described in [9] , [19] and [20] ." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-40", "text": "In sum, the network framework has proven applicable to understand the properties of the language and its applications, especially those related to the textual classification in several levels." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-41", "text": "Despite the limitations imposed by the restrict understanding of the mechanisms behind the classification, it is worth noting that the such representation remains entirely generic, being therefore useful to many tasks as well as for analyzing the evolution of languages, cultures and emotional trends." }, { "sent_id": "f78b11352d01b6567392f8cb3c7642-C001-42", "text": "For this reason, I believe that the use of complex networks in both practical and theoretical investigations shall yield novels insights into the mechanisms behind the language." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f78b11352d01b6567392f8cb3c7642-C001-16", "f78b11352d01b6567392f8cb3c7642-C001-19", "f78b11352d01b6567392f8cb3c7642-C001-20" ], [ "f78b11352d01b6567392f8cb3c7642-C001-29" ], [ "f78b11352d01b6567392f8cb3c7642-C001-39" ] ], "cite_sentences": [ "f78b11352d01b6567392f8cb3c7642-C001-16", "f78b11352d01b6567392f8cb3c7642-C001-20", "f78b11352d01b6567392f8cb3c7642-C001-29", "f78b11352d01b6567392f8cb3c7642-C001-39" ] }, "@MOT@": { "gold_contexts": [ [ "f78b11352d01b6567392f8cb3c7642-C001-19", "f78b11352d01b6567392f8cb3c7642-C001-20" ] ], "cite_sentences": [ "f78b11352d01b6567392f8cb3c7642-C001-20" ] }, "@FUT@": { "gold_contexts": [ [ "f78b11352d01b6567392f8cb3c7642-C001-19", "f78b11352d01b6567392f8cb3c7642-C001-20" ] ], "cite_sentences": [ "f78b11352d01b6567392f8cb3c7642-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "f78b11352d01b6567392f8cb3c7642-C001-28", "f78b11352d01b6567392f8cb3c7642-C001-29" ], [ "f78b11352d01b6567392f8cb3c7642-C001-38", "f78b11352d01b6567392f8cb3c7642-C001-39" ] ], "cite_sentences": [ "f78b11352d01b6567392f8cb3c7642-C001-29", "f78b11352d01b6567392f8cb3c7642-C001-39" ] } } }, "ABC_2b6dd9388c43df4416c738b2d1ed5f_1": { "x": [ { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-47", "text": "**CLASSIFIER MODEL**" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-2", "text": "In multilingual societies like the Indian subcontinent, use of code-switched languages is much popular and convenient for the users." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-3", "text": "In this paper, we study offense and abuse detection in the code-switched pair of Hindi and English (i.e. Hinglish), the pair that is the most spoken." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-4", "text": "The task is made difficult due to non-fixed grammar, vocabulary, semantics and spellings of Hinglish language." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-5", "text": "We apply transfer learning and make a LSTM based model for hate speech classification." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-6", "text": "This model surpasses the performance shown by the current best models to establish itself as the state-of-the-art in the unexplored domain of Hinglish offensive text classification." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-7", "text": "We also release our model and the embeddings trained for research purposes." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-8", "text": "----------------------------------" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-10", "text": "With the penetration of internet among masses, the content being posted on social media channels has uptaken." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-11", "text": "Specifically, in the Indian subcontinent, number of Internet users has crossed 500 mi 1 , and is rising rapidly due to inexpensive data 2 ." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-12", "text": "With this rise, comes the problem of hate speech, offensive and abusive posts on social media." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-13", "text": "Although there are many previous works which deal with Hindi and English hate speech (the top two languages in India), but very few on the code-switched version (Hinglish) of the two (Mathur et al. 2018) ." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-14", "text": "This is partially due to the following reasons: (i) Hinglish consists of no-fixed grammar and vocabulary." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-15", "text": "It derives a part of its semantics from Devnagari and another part from the Roman script." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-16", "text": "(ii) Hinglish speech and written text consists of a concoction of words spoken in Hindi as well as English, but written in the Roman script." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-17", "text": "This makes the spellings variable and dependent on the writer of the text." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-18", "text": "Hence code-switched languages present tough challenges in terms of parsing and getting the meaning out of the text." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-19", "text": "For instance, the sentence, \"Modiji foreign yatra par hai\", is in the Hinglish language." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-20", "text": "Somewhat correct translation of this would be, \"Mr. Modi is on a foriegn tour\"." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-21", "text": "However, even this translation has some flaws due to no direct translation available for the word ji, which is used to show respect." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-46", "text": "----------------------------------" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-22", "text": "Verbatim translation would lead to \"Mr. Modi foreign tour on Hinglish English Hinglish English acha good gunda thug s**la blo*dy ra*di h*oker Table 1 : Examples of word-pairs in Hinglish-English dictionary is\"." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-23", "text": "Moreover, the word yatra here, can have phonetic variations, which would result in multiple spellings of the word as yatra, yaatra, yaatraa, etc." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-24", "text": "Also, the problem of hate speech has been rising in India, and according to the policies of the government and the various social networks, one is not allowed to misuse his right to speech to abuse some other community or religion." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-25", "text": "Due to the various difficulties associated with the Hinglish language, it is challenging to automatically detect and ban such kind of speech." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-26", "text": "Thus, with this in mind, we build a transfer learning based model for the code-switched language Hinglish, which outperforms the baseline model of (Mathur et al. 2018) ." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-27", "text": "We also release the embeddings and the model trained." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-28", "text": "----------------------------------" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-29", "text": "**METHODOLOGY**" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-30", "text": "Our methodology primarily consists of these steps: Preprocessing of the dataset, training of word embeddings, training of the classifier model and then using that on HEOT dataset." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-31", "text": "----------------------------------" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-32", "text": "**PRE-PROCESSING**" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-33", "text": "In this work, we use the datasets released by (Davidson et al. 2017 ) and HEOT dataset provided by (Mathur et al. 2018) ." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-34", "text": "The datasets obtained pass through these steps of processing: (i) Removal of punctuatios, stopwords, URLs, numbers, emoticons, etc." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-35", "text": "This was then followed by transliteration using the Xlit-Crowd conversion dictionary 3 and translation of each word to English using Hindi to English dictionary 4 ." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-36", "text": "To deal with the spelling variations, we manually added some common variations of popular Hinglish words." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-37", "text": "Final dictionary comprised of 7200 word pairs." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-38", "text": "Additionally, to deal with profane words, which are not present in Xlit-Crowd, we had to make a profanity dictionary (with 209 profane words) as well." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-39", "text": "Table 1 gives some examples from the dictionary." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-40", "text": "----------------------------------" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-41", "text": "**TRAINING WORD EMBEDDINGS**" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-42", "text": "We tried Glove (Pennington, Socher, and Manning 2014) and Twitter word2vec (Godin et al. 2015) code for training embeddings for the processed tweets." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-43", "text": "The embeddings were trained on both the datasets provided by (Davidson et al. 2017 ) and HEOT." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-44", "text": "These embeddings help to learn distributed representations of tweets." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-45", "text": "After experimentation, we kept the size of embeddings fixed to 100." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-48", "text": "Both the HEOT and (Davidson et al. 2017 ) datasets contain tweets which are annotated in three categories: offensive, abusive and none (or benign)." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-49", "text": "Some examples from the dataset are shown in Table 2 ." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-50", "text": "We use a LSTM based classifier model for training our model to classify these tweets into these three categories." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-51", "text": "An overview of the model is given in the Figure 1 ." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-52", "text": "The model consists of one layer of LSTM followed by three dense layers." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-53", "text": "The LSTM layer uses a dropout value of 0.2." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-54", "text": "Categorical crossentropy loss was used for the last layer due to the presence of multiple classes." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-55", "text": "We use Adam optimizer along with L2 regularisation to prevent overfitting." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-56", "text": "As indicated by the Figure 1 , the model was initially trained on the dataset provided by (Davidson et al. 2017) , and then re-trained on the HEOT dataset so as to benefit from the transfer of learned features in the last stage." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-57", "text": "The model hyperparameters were experimentally selected by trying out a large number of combinations through grid search." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-58", "text": "Results Table 3 shows the performance of our model (after getting trained on (Davidson et al. 2017) ) with two types of embeddings in comparison to the models by (Mathur et al. 2018) and (Davidson et al. 2017 ) on the HEOT dataset averaged over three runs." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-59", "text": "We also compare results on pre-trained embeddings." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-60", "text": "As shown in the table, our model when given Glove embeddings performs better than all other models." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-61", "text": "For comparison purposes, in Table 4 we have also evaluated our results on the dataset by (Davidson et al. 2017 )." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-62", "text": "----------------------------------" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-63", "text": "**CONCLUSION**" }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-64", "text": "In this paper, we presented a pipeline which given Hinglish text can classify it into three categories: offensive, abusive and benign." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-65", "text": "This LSTM based model performs better than the other systems present." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-66", "text": "We also release the code, the dictionary made and the embeddings trained in the process." }, { "sent_id": "2b6dd9388c43df4416c738b2d1ed5f-C001-67", "text": "We believe this model would be useful in hate speech detection tasks for code-switched languages." } ], "y": { "@USE@": { "gold_contexts": [ [ "2b6dd9388c43df4416c738b2d1ed5f-C001-33" ], [ "2b6dd9388c43df4416c738b2d1ed5f-C001-43" ], [ "2b6dd9388c43df4416c738b2d1ed5f-C001-48", "2b6dd9388c43df4416c738b2d1ed5f-C001-50" ], [ "2b6dd9388c43df4416c738b2d1ed5f-C001-56" ], [ "2b6dd9388c43df4416c738b2d1ed5f-C001-61" ], [ "2b6dd9388c43df4416c738b2d1ed5f-C001-58" ] ], "cite_sentences": [ "2b6dd9388c43df4416c738b2d1ed5f-C001-33", "2b6dd9388c43df4416c738b2d1ed5f-C001-43", "2b6dd9388c43df4416c738b2d1ed5f-C001-48", "2b6dd9388c43df4416c738b2d1ed5f-C001-56", "2b6dd9388c43df4416c738b2d1ed5f-C001-61", "2b6dd9388c43df4416c738b2d1ed5f-C001-58" ] }, "@BACK@": { "gold_contexts": [ [ "2b6dd9388c43df4416c738b2d1ed5f-C001-48" ] ], "cite_sentences": [ "2b6dd9388c43df4416c738b2d1ed5f-C001-48" ] }, "@DIF@": { "gold_contexts": [ [ "2b6dd9388c43df4416c738b2d1ed5f-C001-58" ] ], "cite_sentences": [ "2b6dd9388c43df4416c738b2d1ed5f-C001-58" ] }, "@SIM@": { "gold_contexts": [ [ "2b6dd9388c43df4416c738b2d1ed5f-C001-58" ] ], "cite_sentences": [ "2b6dd9388c43df4416c738b2d1ed5f-C001-58" ] } } }, "ABC_1f48420f55771e243c73babf54632f_1": { "x": [ { "sent_id": "1f48420f55771e243c73babf54632f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-5", "text": "----------------------------------" }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-28", "text": "----------------------------------" }, { "sent_id": "1f48420f55771e243c73babf54632f-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@USE@": { "gold_contexts": [ [ "1f48420f55771e243c73babf54632f-C001-7" ] ], "cite_sentences": [ "1f48420f55771e243c73babf54632f-C001-7" ] } } }, "ABC_e6fe4c6c32294072dbc1ee5bb0a606_2": { "x": [ { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-2", "text": "We propose smoothed max pooling loss and its application to keyword spotting systems." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-3", "text": "The proposed approach jointly trains an encoder (to detect keyword parts) and a decoder (to detect whole keyword) in a semi-supervised manner." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-4", "text": "The proposed new loss function allows training a model to detect parts and whole of a keyword, without strictly depending on frame-level labeling from LVCSR (Large vocabulary continuous speech recognition), making further optimization possible." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-5", "text": "The proposed system outperforms the baseline keyword spotting model in [1] due to increased optimizability." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-6", "text": "Further, it can be more easily adapted for on-device learning applications due to reduced dependency on LVCSR." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-7", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-9", "text": "Keyword detection has become an important frontend service for ASR-based assistant interfaces (e.g. Hey Google, Alexa, Hey Siri)." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-10", "text": "As assistant technology spreads to more ubiquitous use-cases (mobile, IOT), reducing resource consumption (memory and computation) while improving accuracy has been the key success criteria of keyword spotting techniques." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-11", "text": "Following the successes in general ASR [2, 3] , the neural network based approach has been extensively explored in keyword spotting area with benefits of lowering resource requirements and improving accuracy [4, 5, 6, 7, 8, 9, 10, 11] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-12", "text": "Such works include DNN + temporal integration [4, 5, 11, 12] , and HMM + DNN hybrid approaches [6, 7, 8, 9, 10] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-13", "text": "Recently introduced end-to-end trainable DNN approaches [1, 13] further improved accuracy and lowered resource requirements using highly optimizable system design." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-14", "text": "In general, training of such DNN based systems required framelevel labels generated by LVCSR systems [14, 1] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-15", "text": "These approaches make end-to-end optimizable keyword spotting system depend on labels generated from non-end-to-end system trained for a different task." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-16", "text": "However, for keyword-spotting, the exact position of the keyword is not as relevant as its presence." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-17", "text": "Therefore, such strict dependency on frame-level labels may limit further optimization promised by the end-to-end approach." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-18", "text": "In [1] , the top level loss is derived by integrating frame-level losses, which are computed using frame-level labels from LVCSR." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-19", "text": "Integrating frame-level losses penalizes slightly mis-aligned correct predictions, which can limit detection accuracy, especially for difficult data (e.g. noisy or accented speech) where LVCSR labels may have higher-than-normal uncertainty." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-20", "text": "In such case, losses can be fully minimized only when the predicted value and position-in-time matches that of provided frame level labels, where exact position match is not highly relevant for high accuracy." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-21", "text": "Prior work of CTC-training [15] or sequence-to-sequence training [16] dont require on frame level alignment information." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-22", "text": "However those approaches need to train full-sized encoders, which require fully transcribed speech data." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-23", "text": "Work in [17] proposed max pooling loss, which doesnt depend on phoneme level alignment information, but its application is limited to decoder level training." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-24", "text": "In this paper, we prepose a new smoothed max pooling loss for training an end-to-end keyword spotting system." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-25", "text": "The new loss function reduces dependence on dense labels from LVCSR." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-26", "text": "Further, the new loss function jointly trains an encoder (detecting keyword parts) and decoder (detecting whole keyword)." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-27", "text": "One can train models to generate stable activations for a target pattern even without exact location of the target specified." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-28", "text": "We describe the details of the proposed method in Section 2." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-29", "text": "Then we show experiment setup in Section 3, and results in Section 4." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-30", "text": "We conclude with discussions in Section 5." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-31", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-32", "text": "**SMOOTHED MAX POOLING LOSS FOR TRAINING ENCODER/DECODER KEYWORD SPOTTING MODEL**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-33", "text": "The proposed model uses the same encoder/decoder structure as [1] ( Fig.1 ), but it differs in that encoder and decoder models are trained simultaneously using smoothed max pooling loss." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-34", "text": "In [1] , both encoder and decoder models are trained with cross entropy (CE) loss using frame level labels." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-35", "text": "In the proposed approach, we define losses for encoder and decoder using smoothed max pooling loss, and optimize the combination of two losses simultaneously." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-36", "text": "The proposed smoothed max pooling loss doesnt strictly depend on phoneme-level alignment, allowing better optimization than the baseline." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-37", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-38", "text": "**BASELINE END-TO-END KEYWORD SPOTTING MODEL**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-39", "text": "Both the baseline and the proposed model have an encoder which takes spectral domain feature Xt as input and generate (K+1) outputs Y E corresponding to phoneme-like sound units (Fig.1 )." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-40", "text": "The decoder model takes the encoder output as input and generates binary output Y D that predicts existence of a keyword ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-41", "text": "The model is fed with acoustic input features at each frame (generated every 10ms), and generates prediction labels at each frame in a streaming manner." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-42", "text": "In [1] , the encoder model is trained first, and then the decoder model is trained while encoder model weights are frozen." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-43", "text": "In [1] , the encoder model is trained to predict phonemelevel labels provided from LVCSR." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-44", "text": "Both encoder and decoder models use CE-loss defined in Eq (1) and (2), where Xt = [xt\u2212C l , \u00b7 \u00b7 \u00b7 , xt, \u00b7 \u00b7 \u00b7 , xt+C r ], xt is spectral feature of d-dimension, yi(Xt, W ) stands for ith dimension of network's softmax output, W is network weight, and ct is a frame-level label at frame t." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-45", "text": "In [1] , target label sequence consists of intervals of repeated labels which we call runs." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-46", "text": "These label runs define clearly defined intervals where a model should learn to generate strong activation in label output dimension." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-47", "text": "While such model behavior can be trained end-to-end, the labels need to be provided from a LVCSR system which is typically non-end-to-end system [2] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-48", "text": "The timing and accuracy of labels from LVCSR system can limit the accuracy of the trained model." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-49", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-50", "text": "**SMOOTHED MAX POOLING LOSS**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-51", "text": "Instead of interval based CE-loss, we propose temporal max pooling operation to avoid specifying exact activation position (timing) from supervised labels." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-52", "text": "We also propose to apply temporal smoothing on the logits of frames before max pooling operation." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-53", "text": "[17] also explores max pooling loss, where one specifies a window of max pooling in the time domain, and computes CE loss only with the logit of the frame with maximum activation." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-54", "text": "However, with such simple max pooling loss, the learned activation tends to resemble a delta function, whose peak values tend to be unstable under small variation and temporal shift of audio." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-55", "text": "By introducing temporal smoothing on logits before max pooling, the model learns temporally smooth activation and stable peak values." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-56", "text": "Eqs.(3) to (7) define the smoothed max pooling loss." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-57", "text": "Where s(t) is a smoothing filter, is a convolution over time and [\u03c4 start i , \u03c4 end i ] defines the interval of ith max pooling window." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-58", "text": "\u03c4 c is a set of frames not included in any of the max pooling windows." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-59", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-60", "text": "**SMOOTHED MAX POOLING LOSS FOR DECODER**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-61", "text": "In our proposed approach, the decoder submodel is trained to generate strong activation on output dimension 1 near end of keyword." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-62", "text": "Eqs.(3) to (7) (with the number of positive targets n=1) and (8) to (9) define the loss for the decoder submodel. ] includes actual end-point of the keyword." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-63", "text": "By defining the interval long enough, the model can learn optimal position of strongest activation in a semi-supervised manner." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-64", "text": "For current work, we used word level alignment from [2] to get \u03c9 end , but it can be computed from output of existing detection model such as [1] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-65", "text": "Fig. 2 (a) and (c) visualizes relationship between keyword and decoder pooling window." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-66", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-67", "text": "**SMOOTHED MAX POOLING LOSS FOR ENCODER**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-68", "text": "Unlike [17] where only decoder level output is trained with max pooling, we propose training encoder level output also using smoothed max pooling." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-69", "text": "In our method, encoder model learns a sequence of sound-parts that constitute a keyword in a semi-supervised manner." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-70", "text": "This can be done by placing K max-pooling windows sequentially over expected keyword location and define a max pooling loss at each window ( Fig. 2(b) )." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-71", "text": "The number of windows (n=K) and win size E are tuned such that K approximates number of distinguishable sound parts (i.e. phonemes), and K * win size E matches the average length of the keyword." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-72", "text": "Eqs.(3) to (7) and (10) to (11) define such encoder loss." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-73", "text": "Where offset E and win size E are also tunable parameters." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-74", "text": "Fig.2 (a) and 2 (b) show the relationship between expected keyword and pooling windows." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-75", "text": "Both the encoder and the decoder models are trained jointly using loss in (12) ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-76", "text": "The tunable parameter \u03b1 controls the relative importance of each loss." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-77", "text": "T otal loss = \u03b1 * Loss E + Loss D (12)" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-78", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-79", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-80", "text": "We compare the model trained with the new smoothed max pooling loss on encoder/decoder architecture with the baseline in [1] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-81", "text": "Both the baseline and the proposed model have the same architecture." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-82", "text": "Only the training losses are different." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-83", "text": "Details of the setup are discussed below." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-84", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-85", "text": "**FRONT-END**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-86", "text": "We used the same frontend feature extract as the baseline [1] in our experiments." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-87", "text": "The front-end extracts and stacks a 40-d feature vector of log-mel filter-bank energies at each frame and stacks them to generate input feature vector Xt. Refer to [1] for further details." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-88", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-89", "text": "**MODEL SETUP**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-90", "text": "We selected E2E 318K architecture in [1] as the baseline and use the same structure for testing all other models." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-91", "text": "As shown in Fig. 1 , the model has 7 SVDF layers and 3 linear bottleneck dense layers." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-92", "text": "For detailed architectural parameters, please refer to [1] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-93", "text": "We call the baseline model as Baseline CE CE where encoder and decoder submodels are trained with CE loss." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-94", "text": "We call the proposed model as Max4 SMP SMP where both encoder and decoder submodels are trained by SMP (smoothed max pooling) loss." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-95", "text": "We also performed ablation study by testing other models that use different losses." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-96", "text": "Table 1 summarizes all the tested models." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-97", "text": "Model Max1-Max3 uses SMP (smoothed max pooling) loss for the decoder, but uses different losses for the encoder." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-98", "text": "Max1 CTC SMP used CTC loss to train the encoder." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-99", "text": "Standard CTC loss function from Tensorflow [18] was used." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-100", "text": "CTC loss doesnt need alignments, but it learns peaky activations whose peak values are not highly stable." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-101", "text": "Max2 NA SMP has no encoder loss (i.e. \u03b1 = 0), s.t." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-102", "text": "the entire network is trained by decoder loss only." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-103", "text": "Max3 CE SMP used baseline CE loss for encoder." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-104", "text": "Model Max4-Max7 are tested to measure the importance of the smoothing operation." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-105", "text": "MP means max pooling without smoothing (i.e. s(t) = 1)." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-106", "text": "For the decoder SMP(smoothed max pooling) loss, we used truncated Gaussian as the smoothing filter s(t) with \u00b5 = 0, \u03c3 = 9 frames (90ms) and truncated length 21 frames." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-107", "text": "Max pooling window of size 60 frames (600ms) with offset D = 40 frames (400ms) is used." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-108", "text": "For the encoder SMP loss, we used truncated gaussian with \u00b5 = 0, \u03c3 = 4 frames and truncated length 9." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-109", "text": "Encoder max pooling windows have size of 20 frames with offset E = 40 frames." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-110", "text": "These windows are placed sequentially in 40 frames interval." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-111", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-112", "text": "**DATASET**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-113", "text": "The training data consists of 2.1 million anonymized utterances with the keywords Ok Google and Hey Google." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-114", "text": "Data augmentation similar to [1] has been used for better robustness." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-115", "text": "Evaluation is done on four data sets separate from training data, representing diverse environmental conditions -Clean non-accented set contains 170K non-accented English utterances of keywords in quiet condition." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-116", "text": "Clean accented has 138K English utterances of keyword with Australian, British, and Indian accents in quiet conditions." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-117", "text": "Query logs contains 58K utterances from anonymized voice search queries." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-118", "text": "In-vehicle set has 144K utterances with the keywords recorded inside cars while driving, which includes significant amount of noises from road, engine, and fans." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-119", "text": "All sets are augmented with 64K negative utterances, which are re-recorded TV noise." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-120", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-121", "text": "**RESULTS**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-122", "text": "To show effectiveness of the proposed approach, we evaluated falsereject (FR) and false-accept (FA) tradeoff across various models described in Section 3." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-123", "text": "All models are converted to inference models using TensorFlow Lites quantization [19] ." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-124", "text": "Table 2 summarizes FR rates of models in Fig.3 and 4 at selected FA rate (0.1 FA per hour measured on 64K re-recorded TV noise set)." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-125", "text": "Fig.3 shows the ROC curves of various models (baseline, Max1-Max4) across different conditions." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-126", "text": "Figure 4 shows the ROC curves of Max4-Max7 models across different conditions." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-127", "text": "Across model types and evaluation conditions Max4 SMP SMP shows the best accuracy and ROC curve." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-128", "text": "Max3 CE MP model also performs better than the baseline but not as good as Max4." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-129", "text": "Other variations Max2 (has only decoder loss) and Max1 (has CTC encoder loss) performed worse than baseline." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-130", "text": "Comparison among models with max pooling and different smoothing options (Fig.4) shows that Max4 SMP SMP (smoothed max poling on both encoder and decoder) performs the best and outperforms Max7(no smoothing on encoder and decoder max pooling loss)." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-131", "text": "Especially the proposed Max4 model reduces FR rate to nearly half of the baseline in clean accented and noisy inside-vehicle conditions, where it's more difficult to obtain training data with accurate alignments." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-132", "text": "----------------------------------" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-133", "text": "**CONCLUSION**" }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-134", "text": "We presented smoothed max pooling loss for training keyword spotting model with improved optimizability." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-135", "text": "Experiments show that the proposed approach outperforms the baseline model with CE loss by relative 22%-54% across a variety of conditions." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-136", "text": "Further, we show that applying smoothing before max pooling is highly important for achieving accuracy better than the baseline." }, { "sent_id": "e6fe4c6c32294072dbc1ee5bb0a606-C001-137", "text": "The proposed approach provides further benefits of reducing dependence on LVCSR to provide phoneme level alignments, which is desirable for embedded learning scenarios, like on-device learning [20] [21] ." } ], "y": { "@DIF@": { "gold_contexts": [ [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-5" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-33" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-36" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-80" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-93", "e6fe4c6c32294072dbc1ee5bb0a606-C001-94" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-128" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-129" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-131" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-135" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-136" ] ], "cite_sentences": [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-5", "e6fe4c6c32294072dbc1ee5bb0a606-C001-33", "e6fe4c6c32294072dbc1ee5bb0a606-C001-36", "e6fe4c6c32294072dbc1ee5bb0a606-C001-80", "e6fe4c6c32294072dbc1ee5bb0a606-C001-93", "e6fe4c6c32294072dbc1ee5bb0a606-C001-128", "e6fe4c6c32294072dbc1ee5bb0a606-C001-129", "e6fe4c6c32294072dbc1ee5bb0a606-C001-131", "e6fe4c6c32294072dbc1ee5bb0a606-C001-135", "e6fe4c6c32294072dbc1ee5bb0a606-C001-136" ] }, "@BACK@": { "gold_contexts": [ [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-13" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-14" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-18" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-42" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-43" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-45" ] ], "cite_sentences": [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-13", "e6fe4c6c32294072dbc1ee5bb0a606-C001-14", "e6fe4c6c32294072dbc1ee5bb0a606-C001-18", "e6fe4c6c32294072dbc1ee5bb0a606-C001-42", "e6fe4c6c32294072dbc1ee5bb0a606-C001-43", "e6fe4c6c32294072dbc1ee5bb0a606-C001-45" ] }, "@USE@": { "gold_contexts": [ [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-33" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-86" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-90" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-103" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-125" ] ], "cite_sentences": [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-33", "e6fe4c6c32294072dbc1ee5bb0a606-C001-86", "e6fe4c6c32294072dbc1ee5bb0a606-C001-90", "e6fe4c6c32294072dbc1ee5bb0a606-C001-103", "e6fe4c6c32294072dbc1ee5bb0a606-C001-125" ] }, "@SIM@": { "gold_contexts": [ [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-33" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-39" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-81" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-86" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-90" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-114" ] ], "cite_sentences": [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-33", "e6fe4c6c32294072dbc1ee5bb0a606-C001-39", "e6fe4c6c32294072dbc1ee5bb0a606-C001-81", "e6fe4c6c32294072dbc1ee5bb0a606-C001-86", "e6fe4c6c32294072dbc1ee5bb0a606-C001-90", "e6fe4c6c32294072dbc1ee5bb0a606-C001-114" ] }, "@UNSURE@": { "gold_contexts": [ [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-87" ], [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-92" ] ], "cite_sentences": [ "e6fe4c6c32294072dbc1ee5bb0a606-C001-87", "e6fe4c6c32294072dbc1ee5bb0a606-C001-92" ] } } }, "ABC_0da20f60adaff9637ebdbe2a27f2a4_2": { "x": [ { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-2", "text": "Variational Autoencoder (VAE) is a powerful method for learning representations of highdimensional data." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-3", "text": "However, VAEs can suffer from an issue known as latent variable collapse (or KL loss vanishing), where the posterior collapses to the prior and the model will ignore the latent codes in generative tasks." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-4", "text": "Such an issue is particularly prevalent when employing VAE-RNN architectures for text modelling (Bowman et al., 2016) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-5", "text": "In this paper, we present a simple architecture called holistic regularisation VAE (HR-VAE), which can effectively avoid latent variable collapse." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-6", "text": "Compared to existing VAE-RNN architectures, we show that our model can achieve much more stable training process and can generate text with significantly better quality." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-7", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-9", "text": "Variational Autoencoder (VAE) (Kingma and Welling, 2013 ) is a powerful method for learning representations of high-dimensional data." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-10", "text": "However, recent attempts of applying VAEs to text modelling are still far less successful compared to its application to image and speech (Bachman, 2016; Fraccaro et al., 2016; Semeniuta et al., 2017) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-11", "text": "When applying VAEs for text modelling, recurrent neural networks (RNNs) 1 are commonly used as the architecture for both encoder and decoder (Bowman et al., 2016; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-12", "text": "While such a VAE-RNN based architecture allows encoding and generating sentences (in the decoding phase) with variablelength effectively, it is also vulnerable to an issue known as latent variable collapse (or KL loss vanishing) , where the posterior collapses to the prior and the model will ignore the latent codes in generative tasks." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-13", "text": "Various efforts have been made to alleviate the latent variable collapse issue." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-14", "text": "Bowman et al. (2016) uses KL annealing, where a variable weight is added to the KL term in the cost function at training time." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-15", "text": "Yang et al. (2017) discovered that there is a trade-off between the contextual capacity of the decoder and effective use of encoding information, and developed a dilated CNN as decoder which can vary the amount of conditioning context." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-16", "text": "They also introduced a loss clipping strategy in order to make the model more robust." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-17", "text": "Xu and Durrett (2018) addressed the problem by replacing the standard normal distribution for the prior with the von Mises-Fisher (vMF) distribution." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-18", "text": "With vMF, the KL loss only depends on the concentration parameter which is fixed during training and testing, and hence results in a constant KL loss." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-19", "text": "In a more recent work, Dieng et al. (2019) avoided latent variable collapse by including skip connections in the generative model, where the skip connections enforce strong links between the latent variables and the likelihood function." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-20", "text": "Although the aforementioned works show effectiveness in addressing the latent variable collapse issue to some extent, they either require carefully engineering to balance the weight between the reconstruction loss and KL loss (Bowman et al., 2016; , or resort to designing more sophisticated model structures (Yang et al., 2017; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-21", "text": "In this paper, we present a simple architecture called holistic regularisation VAE (HR-VAE), which can effectively avoid latent variable collapse." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-22", "text": "In contrast to existing VAE-RNN models for text modelling which merely impose a standard normal distribution prior on the last hidden state of the RNN encoder, our HR-VAE model imposes regularisation for all hidden states of the RNN encoder." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-23", "text": "Another advantage of our model is that it is generic and can be applied to any existing VAE-RNN-based architectures." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-24", "text": "We evaluate our model against several strong baselines which apply VAE for text modelling (Bowman et al., 2016; Yang et al., 2017; Xu and Durrett, 2018) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-25", "text": "We conducted experiments based on two public benchmark datasets, namely, the Penn Treebank dataset (Marcus and Marcinkiewicz, 1993) and the end-to-end (E2E) text generation dataset (Novikova et al., 2017) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-26", "text": "Experimental results show that our HR-VAE model not only can effectively mitigate the latent variable collapse issue with a stable training process, but also can give better predictive performance than the baselines, as evidenced by both quantitative (e.g., negative log likelihood and perplexity) and qualitative evaluation." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-27", "text": "The code for our model is available online 2 ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-28", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-29", "text": "**METHODOLOGY**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-30", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-31", "text": "**BACKGROUND OF VAE**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-32", "text": "A variational autoencoder (VAE) is a deep generative model, which combines variational inference with deep learning." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-33", "text": "The VAE modifies the conventional autoencoder architecture by replacing the deterministic latent representation z of an input x with a posterior distribution P (z|x), and imposing a prior distribution on the posterior, such that the model allows sampling from any point of the latent space and yet able to generate novel and plausible output." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-34", "text": "The prior is typically chosen to be standard normal distributions, i.e., P (z) = N (0, 1), such that the KL divergence between posterior and prior can be computed in closed form (Kingma and Welling, 2013) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-35", "text": "To train a VAE, we need to optimise the marginal likelihood P \u03b8 (x) = P (z)P \u03b8 (x|z)dz, 2 https://github.com/ruizheliUOA/HR-VAE where the log likelihood can take following form:" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-36", "text": "Here Q \u03c6 (z|x) is the variational approximation for the true posterior P \u03b8 (z|x)." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-37", "text": "Specifically, Q \u03c6 (z|x) can be regarded as an encoder (a.k.a." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-38", "text": "the recognition model) and P \u03b8 (x|z) the decoder (a.k.a. the generative model)." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-39", "text": "Both encoder and decoder are implemented via neural networks." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-40", "text": "As proved in (Kingma and Welling, 2013) , optimising the marginal log likelihood is essentially equivalent to maximising L(\u03b8, \u03c6; x), i.e., the evidence lower bound (ELBO), which consists of two terms." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-41", "text": "The first term is the expected reconstruction error indicating how well the model can reconstruct data given a latent variable." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-42", "text": "The the second term is the KL divergence of the approximate posterior from prior, i.e., a regularisation pushing the learned posterior to be as close to the prior as possible." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-43", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-44", "text": "**VARIATIONAL AUTOENDODER WITH HOLISTIC REGULARISATION**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-45", "text": "In this section, we discuss the technical details of the proposed holistic regularisation VAE (HR-VAE) model, a general architecture which can effectively mitigate the KL vanishing phenomenon." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-46", "text": "Our model design is motivated by one noticeable defect shared by the VAE-RNN based models in previous works (Bowman et al., 2016; Yang et al., 2017; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-47", "text": "That is, all these models, as shown in Figure 1a , only impose a standard normal distribution prior on the last hidden state of the RNN encoder, which potentially leads to learning a suboptimal representation of the latent variable and results in model vulnerable to KL loss vanishing." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-48", "text": "Our hypothesis is that to learn a good representation of data and a good generative model, it is crucial to impose the standard normal prior on all the hidden states of the RNN-based encoder (see Figure 1b ), which allows a better regularisation of the model learning process." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-49", "text": "We implement the HR-VAE model using a twolayer LSTM for both the encoder and decoder." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-50", "text": "However, one should note that our architecture can be readily applied to other types of RNN such as GRU." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-51", "text": "For each time stamp t (see Figure 1b) , we concatenate the hidden state h t and the cell state c t of the encoder." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-52", "text": "The concatenation (i.e., [h t ; c t ]) is then fed into two linear transformation layers for estimating \u00b5 t and \u03c3 2 t , which are parameters of a normal distribution corresponding to the concatenation of h t and c t ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-53", "text": "Let Q \u03c6t (z t |x) = N (z t |\u00b5 t , \u03c3 2 t ), we wish Q \u03c6t (z t |x) to be close to a prior P (z t ), which is a standard Gaussian." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-54", "text": "Finally, the KL divergence between these two multivariate Gaussian distributions (i.e., Q \u03c6t and P (z t )) will contribute to the overall KL loss of the ELBO." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-55", "text": "By taking the average of the KL loss at each time stamp t, the resulting ELBO takes the following form" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-56", "text": "KL(Q \u03c6t (z t |x) P (z t ))." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-57", "text": "( 3) As can be seen in Eq. 3, our solution to the KL collapse issue does not require any engineering for balancing the weight between the reconstruction term and KL loss as commonly the case in existing works (Bowman et al., 2016; ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-58", "text": "The weight between these two terms of our model is simply 1 : 1." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-59", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-61", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-62", "text": "**DATASETS**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-63", "text": "We evaluate our model on two public datasets, namely, Penn Treebank (PTB) (Marcus and Marcinkiewicz, 1993) and the end-to-end (E2E) text generation corpus (Novikova et al., 2017) , which have been used in a number of previous works for text generation (Bowman et al., 2016; Xu and Durrett, 2018; Wiseman et al., 2018; Su et al., 2018) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-64", "text": "PTB consists of more than 40,000 sentences from Wall Street Journal articles whereas the E2E dataset contains over 50,000 sen-tences of restaurant reviews." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-65", "text": "The statistics of these two datasets are summarised in Table 1 ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-66", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-67", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-68", "text": "For the PTB dataset, we used the train-test split following (Bowman et al., 2016; Xu and Durrett, 2018) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-69", "text": "For the E2E dataset, we used the train-test split from the original dataset (Novikova et al., 2017) and indexed the words with a frequency higher than 3." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-70", "text": "We represent input data with 512-dimensional word2vec embeddings (Mikolov et al., 2013) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-71", "text": "We set the dimension of the hidden layers of both encoder and decoder to 256." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-72", "text": "The Adam optimiser (Kingma and Ba, 2014) was used for training with an initial learning rate of 0.0001." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-73", "text": "Each utterance in a mini-batch was padded to the maximum length for that batch, and the maximum batch-size allowed is 128." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-74", "text": "----------------------------------" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-75", "text": "**BASELINES**" }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-76", "text": "We compare our HR-VAE model with three strong baselines using VAE for text modelling: VAE-LSTM-base 3 : A variational autoencoder model which uses LSTM for both encoder and decoder." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-77", "text": "KL annealing is used to tackled the latent variable collapse issue (Bowman et al., 2016) ; VAE-CNN 4 : A variational autoencoder model with a LSTM encoder and a dilated CNN decoder (Yang et al., 2017) ; vMF-VAE 5 : A variational autoencoder model using LSTM for both encoder and decoder where the prior distribution is the von Mises-Fisher (vMF) distribution rather than a Gaussian distribution (Xu and Durrett, 2018) ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-78", "text": "the decoder needs to predict the entire sequence with only the help of the given latent variable z." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-79", "text": "In this way, a high-quality representation abstracting the information of the input sentence is much needed for the decoder, and hence enforcing z to learn the required information." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-80", "text": "Overall performance." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-81", "text": "Table 2 shows the language modelling results of our approach and the baselines." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-82", "text": "We report negative log likelihood (NLL), KL loss, and perplexity (PPL) on the test set." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-83", "text": "As expected, all the models have a higher KL loss in the inputless setting than the standard setting, as z is required to encode more information about the input data for reconstruction." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-84", "text": "In terms of overall performance, our model outperforms all the baselines in both datasets (i.e., PTB and E2E)." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-85", "text": "For instance, when comparing with the strongest baseline vMF-VAE in the standard setting, our model reduces NLL from 96 to 79 and PPL from 98 to 43 in PTB, respectively." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-86", "text": "In the inputless setting, our performance gain is even higher, i.e., NLL reduced from 117 to 85 and PPL from 262 to 54." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-87", "text": "A similar pattern can be observed for the E2E dataset." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-88", "text": "These observations suggest that our approach can learn a better generative model for data." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-89", "text": "Loss analysis." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-90", "text": "To conduct a more thorough evaluation, we further investigate model behaviours in terms of both reconstruction loss and KL loss, as shown in Figure 2 ." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-91", "text": "These plots were obtained based on the E2E training set using the inputless setting." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-92", "text": "We can see that the KL loss of VAE-LSTMbase, which uses Sigmoid annealing (Bowman et al., 2016) , collapses to zero, leading to a poor generative performance as indicated by the high reconstruction loss." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-93", "text": "The KL loss for both VAE-CNN and vMF-VAE are nonzero, where the former mitigates the KL collapse issue with a KL loss clipping strategy and the latter by replacing the standard normal distribution for the prior with the vMF distribution (i.e., with the vMF distribution, the KL loss only depends on a fixed concentration parameter, and hence results in a constant KL loss)." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-94", "text": "Although both VAE-CNN and vMF-VAE outperform VAE-LSTM-base by a large margin in terms of reconstruction loss as shown in Figure 2 , one should also notice that these two models actually overfit the training data, as their performance on the test set is much worse (cf." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-95", "text": "Table 2 )." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-96", "text": "In contrast to the baselines which mitigate the KL collapse issue by carefully engineering the weight between the reconstruction loss and KL loss or choosing a different choice of prior, we provide a simple and elegant solution through holistic KL regularisation, which can effectively mitigate the KL collapse issue and achieve a better reconstruction error in both training and testing." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-97", "text": "Sentence reconstruction." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-98", "text": "Lastly, we show some sentence examples reconstructed by vMF-VAE (i.e., the best baseline) and our model in the inputless setting using sentences from the E2E test set as input." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-99", "text": "As shown in Table 3 , the sentences generated by vMF-VAE contain repeated words in quite a few cases, such as 'city city area' and 'blue spice spice'." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-100", "text": "In addition, vMF-VAE also tends to generate unnecessary or unrelated words at the end of sentences, making the generated sentences ungrammatical." }, { "sent_id": "0da20f60adaff9637ebdbe2a27f2a4-C001-101", "text": "The sentences reconstructed by our model, in contrast, are more grammatical and more similar to the corresponding ground truth sentences than vMF-VAE." } ], "y": { "@MOT@": { "gold_contexts": [ [ "0da20f60adaff9637ebdbe2a27f2a4-C001-4", "0da20f60adaff9637ebdbe2a27f2a4-C001-5" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-20" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-46" ] ], "cite_sentences": [ "0da20f60adaff9637ebdbe2a27f2a4-C001-4", "0da20f60adaff9637ebdbe2a27f2a4-C001-20", "0da20f60adaff9637ebdbe2a27f2a4-C001-46" ] }, "@BACK@": { "gold_contexts": [ [ "0da20f60adaff9637ebdbe2a27f2a4-C001-11" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-14" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-63" ] ], "cite_sentences": [ "0da20f60adaff9637ebdbe2a27f2a4-C001-11", "0da20f60adaff9637ebdbe2a27f2a4-C001-14", "0da20f60adaff9637ebdbe2a27f2a4-C001-63" ] }, "@DIF@": { "gold_contexts": [ [ "0da20f60adaff9637ebdbe2a27f2a4-C001-24", "0da20f60adaff9637ebdbe2a27f2a4-C001-25", "0da20f60adaff9637ebdbe2a27f2a4-C001-26" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-57" ] ], "cite_sentences": [ "0da20f60adaff9637ebdbe2a27f2a4-C001-24", "0da20f60adaff9637ebdbe2a27f2a4-C001-57" ] }, "@USE@": { "gold_contexts": [ [ "0da20f60adaff9637ebdbe2a27f2a4-C001-63" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-68" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-77" ], [ "0da20f60adaff9637ebdbe2a27f2a4-C001-92" ] ], "cite_sentences": [ "0da20f60adaff9637ebdbe2a27f2a4-C001-63", "0da20f60adaff9637ebdbe2a27f2a4-C001-68", "0da20f60adaff9637ebdbe2a27f2a4-C001-77", "0da20f60adaff9637ebdbe2a27f2a4-C001-92" ] } } }, "ABC_be26538a785f9ec9edc1ea031194cf_2": { "x": [ { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-2", "text": "In multilingual societies like the Indian subcontinent, use of code-switched languages is much popular and convenient for the users." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-3", "text": "In this paper, we study offense and abuse detection in the code-switched pair of Hindi and English (i.e. Hinglish), the pair that is the most spoken." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-4", "text": "The task is made difficult due to non-fixed grammar, vocabulary, semantics and spellings of Hinglish language." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-5", "text": "We apply transfer learning and make a LSTM based model for hate speech classification." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-6", "text": "This model surpasses the performance shown by the current best models to establish itself as the state-of-the-art in the unexplored domain of Hinglish offensive text classification." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-30", "text": "Our methodology primarily consists of these steps: Preprocessing of the dataset, training of word embeddings, training of the classifier model and then using that on HEOT dataset." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-31", "text": "----------------------------------" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-32", "text": "**PRE-PROCESSING**" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-33", "text": "In this work, we use the datasets released by (Davidson et al. 2017 ) and HEOT dataset provided by (Mathur et al. 2018) ." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-34", "text": "The datasets obtained pass through these steps of processing: (i) Removal of punctuatios, stopwords, URLs, numbers, emoticons, etc." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-35", "text": "This was then followed by transliteration using the Xlit-Crowd conversion dictionary 3 and translation of each word to English using Hindi to English dictionary 4 ." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-36", "text": "To deal with the spelling variations, we manually added some common variations of popular Hinglish words." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-37", "text": "Final dictionary comprised of 7200 word pairs." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-38", "text": "Additionally, to deal with profane words, which are not present in Xlit-Crowd, we had to make a profanity dictionary (with 209 profane words) as well." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-39", "text": "Table 1 gives some examples from the dictionary." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-40", "text": "----------------------------------" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-41", "text": "**TRAINING WORD EMBEDDINGS**" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-42", "text": "We tried Glove (Pennington, Socher, and Manning 2014) and Twitter word2vec (Godin et al. 2015) code for training embeddings for the processed tweets." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-43", "text": "The embeddings were trained on both the datasets provided by (Davidson et al. 2017 ) and HEOT." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-44", "text": "These embeddings help to learn distributed representations of tweets." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-45", "text": "After experimentation, we kept the size of embeddings fixed to 100." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-46", "text": "----------------------------------" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-47", "text": "**CLASSIFIER MODEL**" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-48", "text": "Both the HEOT and (Davidson et al. 2017 ) datasets contain tweets which are annotated in three categories: offensive, abusive and none (or benign)." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-49", "text": "Some examples from the dataset are shown in Table 2 ." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-50", "text": "We use a LSTM based classifier model for training our model to classify these tweets into these three categories." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-51", "text": "An overview of the model is given in the Figure 1 ." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-52", "text": "The model consists of one layer of LSTM followed by three dense layers." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-53", "text": "The LSTM layer uses a dropout value of 0.2." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-54", "text": "Categorical crossentropy loss was used for the last layer due to the presence of multiple classes." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-55", "text": "We use Adam optimizer along with L2 regularisation to prevent overfitting." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-56", "text": "As indicated by the Figure 1 , the model was initially trained on the dataset provided by (Davidson et al. 2017) , and then re-trained on the HEOT dataset so as to benefit from the transfer of learned features in the last stage." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-57", "text": "The model hyperparameters were experimentally selected by trying out a large number of combinations through grid search." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-58", "text": "Results Table 3 shows the performance of our model (after getting trained on (Davidson et al. 2017) ) with two types of embeddings in comparison to the models by (Mathur et al. 2018) and (Davidson et al. 2017 ) on the HEOT dataset averaged over three runs." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-59", "text": "We also compare results on pre-trained embeddings." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-60", "text": "As shown in the table, our model when given Glove embeddings performs better than all other models." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-61", "text": "For comparison purposes, in Table 4 we have also evaluated our results on the dataset by (Davidson et al. 2017 )." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-62", "text": "----------------------------------" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-63", "text": "**CONCLUSION**" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-64", "text": "In this paper, we presented a pipeline which given Hinglish text can classify it into three categories: offensive, abusive and benign." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-65", "text": "This LSTM based model performs better than the other systems present." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-66", "text": "We also release the code, the dictionary made and the embeddings trained in the process." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-67", "text": "We believe this model would be useful in hate speech detection tasks for code-switched languages." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-7", "text": "We also release our model and the embeddings trained for research purposes." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-8", "text": "----------------------------------" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-10", "text": "With the penetration of internet among masses, the content being posted on social media channels has uptaken." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-11", "text": "Specifically, in the Indian subcontinent, number of Internet users has crossed 500 mi 1 , and is rising rapidly due to inexpensive data 2 ." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-12", "text": "With this rise, comes the problem of hate speech, offensive and abusive posts on social media." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-13", "text": "Although there are many previous works which deal with Hindi and English hate speech (the top two languages in India), but very few on the code-switched version (Hinglish) of the two (Mathur et al. 2018) ." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-14", "text": "This is partially due to the following reasons: (i) Hinglish consists of no-fixed grammar and vocabulary." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-15", "text": "It derives a part of its semantics from Devnagari and another part from the Roman script." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-16", "text": "(ii) Hinglish speech and written text consists of a concoction of words spoken in Hindi as well as English, but written in the Roman script." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-17", "text": "This makes the spellings variable and dependent on the writer of the text." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-18", "text": "Hence code-switched languages present tough challenges in terms of parsing and getting the meaning out of the text." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-19", "text": "For instance, the sentence, \"Modiji foreign yatra par hai\", is in the Hinglish language." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-20", "text": "Somewhat correct translation of this would be, \"Mr. Modi is on a foriegn tour\"." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-21", "text": "However, even this translation has some flaws due to no direct translation available for the word ji, which is used to show respect." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-22", "text": "Verbatim translation would lead to \"Mr. Modi foreign tour on Hinglish English Hinglish English acha good gunda thug s**la blo*dy ra*di h*oker Table 1 : Examples of word-pairs in Hinglish-English dictionary is\"." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-23", "text": "Moreover, the word yatra here, can have phonetic variations, which would result in multiple spellings of the word as yatra, yaatra, yaatraa, etc." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-24", "text": "Also, the problem of hate speech has been rising in India, and according to the policies of the government and the various social networks, one is not allowed to misuse his right to speech to abuse some other community or religion." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-25", "text": "Due to the various difficulties associated with the Hinglish language, it is challenging to automatically detect and ban such kind of speech." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-26", "text": "Thus, with this in mind, we build a transfer learning based model for the code-switched language Hinglish, which outperforms the baseline model of (Mathur et al. 2018) ." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-27", "text": "We also release the embeddings and the model trained." }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-28", "text": "----------------------------------" }, { "sent_id": "be26538a785f9ec9edc1ea031194cf-C001-29", "text": "**METHODOLOGY**" } ], "y": { "@MOT@": { "gold_contexts": [ [ "be26538a785f9ec9edc1ea031194cf-C001-13" ] ], "cite_sentences": [ "be26538a785f9ec9edc1ea031194cf-C001-13" ] }, "@DIF@": { "gold_contexts": [ [ "be26538a785f9ec9edc1ea031194cf-C001-26" ], [ "be26538a785f9ec9edc1ea031194cf-C001-58", "be26538a785f9ec9edc1ea031194cf-C001-59", "be26538a785f9ec9edc1ea031194cf-C001-60" ] ], "cite_sentences": [ "be26538a785f9ec9edc1ea031194cf-C001-26", "be26538a785f9ec9edc1ea031194cf-C001-58" ] }, "@USE@": { "gold_contexts": [ [ "be26538a785f9ec9edc1ea031194cf-C001-33" ] ], "cite_sentences": [ "be26538a785f9ec9edc1ea031194cf-C001-33" ] } } }, "ABC_9d12c4d6aea96ff3e1b93faf1eb961_2": { "x": [ { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-2", "text": "Given the current growth in research and related emerging technologies in machine learning and deep learning, it is timely to introduce this tutorial to a large number of researchers and practitioners who are attending COLING 2018 and working on statistical models, deep neural networks, sequential learning and natural language understanding." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-3", "text": "To the best of our knowledge, there is no similar tutorial presented in previous ACL/COLING/EMNLP/NAACL." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-4", "text": "This three-hour tutorial will concentrate on a wide range of theories and applications and systematically present the recent advances in deep Bayesian and sequential learning which are impacting the communities of computational linguistics, human language technology and machine learning for natural language processing." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-5", "text": "----------------------------------" }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-6", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-7", "text": "This tutorial introduces the advances in deep Bayesian learning with abundant applications for natural language understanding ranging from speech recognition (Saon and Chien, 2012; Chan et al., 2016) to document summarization (Chang and Chien, 2009 ), text classification (Blei et al., 2003; Zhang et al., 2015) , text segmentation (Chien and Chueh, 2012) , information extraction (Narasimhan et al., 2016) , image caption generation (Vinyals et al., 2015; Xu et al., 2015) , sentence generation (Li et al., 2016b) , dialogue control (Zhao and Eskenazi, 2016; Li et al., 2016a) , sentiment classification, recommendation system, question answering (Sukhbaatar et al., 2015) and machine translation , to name a few." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-8", "text": "Traditionally, \"deep learning\" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-9", "text": "The \"semantic structure\" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-10", "text": "The \"distribution function\" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated in model inference." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-11", "text": "This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process , Chinese restaurant process (Blei et al., 2010) , hierarchical Pitman-Yor process (Teh, 2006) , Indian buffet process (Ghahramani and Griffiths, 2005) , recurrent neural network (Mikolov et al., 2010; Van Den Oord et al., 2016) , long short-term memory (Hochreiter and Schmidhuber, 1997; , sequence-to-sequence model (Sutskever et al., 2014), variational auto-encoder (Kingma and Welling, 2014) , generative adversarial network (Goodfellow et al., 2014) , attention mechanism (Chorowski et al., 2015; Seo et al., 2016) , memory-augmented neural network (Graves et al., 2014; Graves et al., 2014) , stochastic neural network Miao et al., 2016) , predictive state neural network (Downey et al., 2017) , policy gradient (Yu et al., 2017) and reinforcement learning (Mnih et al., 2015) ." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-12", "text": "We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-13", "text": "The variational inference and sampling method are formulated to tackle the optimization for complicated models (Rezende et al., 2014) ." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-14", "text": "The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-15", "text": "A series of case studies are presented to tackle different issues in deep Bayesian learning and understanding." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-16", "text": "At last, we point out a number of directions and outlooks for future studies." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-17", "text": "The presentation of this tutorial is arranged into five parts." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-18", "text": "First of all, we share the current status of researches on natural language understanding, statistical modeling and deep neural network and explain the key issues in deep Bayesian learning for discrete-valued observation data and latent semantics." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-19", "text": "A new paradigm called the symbolic neural learning is introduced to extend how data analysis is performed from language processing to semantic learning and memory networking." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-20", "text": "Secondly, we address a number of Bayesian models ranging from latent variable model to VB inference (Chien and Chang, 2014; Chien and Chueh, 2011; Chien, 2015b) , MCMC sampling (Watanabe and Chien, 2015) and BNP learning (Chien, 2016; Chien, 2015a; Chien, 2018) for hierarchical, thematic and sparse topics from natural language." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-21", "text": "In the third part, a series of deep models including deep unfolding (Chien and Lee, 2018) , Bayesian RNN (Gal and Ghahramani, 2016; Chien and Ku, 2016) , sequence-to-sequence learning (Graves et al., 2006; Gehring et al., 2017) , CNN (Kalchbrenner et al., 2014; Xingjian et al., 2015; , GAN (Tsai and Chien, 2017) and VAE are introduced." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-22", "text": "The coffee break is arranged within this part." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-23", "text": "Next, the fourth part focuses on a variety of advanced studies which illustrate how deep Bayesian learning is developed to infer the sophisticated recurrent models for natural language understanding." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-24", "text": "In particular, the memory network Chien and Lin, 2018) , neural variational learning (Serban et al., 2017; Chung et al., 2015) , neural discrete representation (Jang et al., 2016; Maddison et al., 2016; van den Oord et al., 2017) , recurrent ladder network (Rasmus et al., 2015; Pr\u00e9mont-Schwarz et al., 2017; S\u00f8nderby et al., 2016) , stochastic neural network (Fraccaro et al., 2016; Goyal et al., 2017; Shabanian et al., 2017) , Markov recurrent neural network (Venkatraman et al., 2017; Kuo and Chien, 2018) , sequence GAN (Yu et al., 2017) and reinforcement learning (Tegho et al., 2017) are introduced in various deep models which open a window to more practical tasks, e.g. reading comprehension, sentence generation, dialogue system, question answering and machine translation." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-25", "text": "In the final part, we spotlight on some future directions for deep language understanding which can handle the challenges of big data, heterogeneous condition and dynamic system." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-26", "text": "In particular, deep learning, structural learning, temporal modeling, long history representation and stochastic learning are emphasized." }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-27", "text": "Slides of this tutorial are available at http://chien.cm.nctu.edu.tw/home/coling/. APSIPA 2013 , ISCSLP 2014 , Interspeech 2013 , 2016 and ICASSP 2012" }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-28", "text": "----------------------------------" }, { "sent_id": "9d12c4d6aea96ff3e1b93faf1eb961-C001-29", "text": "**INSTRUCTOR**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "9d12c4d6aea96ff3e1b93faf1eb961-C001-7" ] ], "cite_sentences": [ "9d12c4d6aea96ff3e1b93faf1eb961-C001-7" ] } } }, "ABC_12c5d72fad925c8ec025cda87a0fd9_2": { "x": [ { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-2", "text": "Verb-noun combinations (VNCs) -e.g., blow the whistle, hit the roof, and see stars -are a common type of English idiom that are ambiguous with literal usages." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-3", "text": "In this paper we propose and evaluate models for classifying VNC usages as idiomatic or literal, based on a variety of approaches to forming distributed representations." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-4", "text": "Our results show that a model based on averaging word embeddings performs on par with, or better than, a previously-proposed approach based on skip-thoughts." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-5", "text": "Idiomatic usages of VNCs are known to exhibit lexico-syntactic fixedness." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-6", "text": "We further incorporate this information into our models, demonstrating that this rich linguistic knowledge is complementary to the information carried by distributed representations." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-7", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-9", "text": "Multiword expressions (MWEs) are combinations of multiple words that exhibit some degree of idiomaticity (Baldwin and Kim, 2010) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-10", "text": "Verb-noun combinations (VNCs), consisting of a verb with a noun in its direct object position, are a common type of semantically-idiomatic MWE in English and cross-lingually (Fazly et al., 2009 )." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-11", "text": "Many VNCs are ambiguous between MWEs and literal combinations, as in the following examples of see stars, in which 1 is an idiomatic usage (i.e., an MWE), while 2 is a literal combination." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-12", "text": "1 1. Hereford United were seeing stars at Gillingham after letting in 2 early goals 2. Look into the night sky to see the stars MWE identification is the task of automatically determining which word combinations at the token-level form MWEs (Baldwin and Kim, 2010) , and must be able to make such distinctions." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-13", "text": "This is particularly important for applications such as machine translation (Sag et al., 2002) , where the appropriate meaning of word combinations in context must be preserved for accurate translation." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-14", "text": "In this paper, following prior work (e.g., Salton et al., 2016 ), we frame token-level identification of VNCs as a supervised binary classification problem, i.e., idiomatic vs. literal." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-15", "text": "We consider a range of approaches to forming distributed representations of the context in which a VNC occurs, including word embeddings (Mikolov et al., 2013) , word embeddings tailored to representing sentences (Kenter et al., 2016) , and skip-thoughts sentence embeddings (Kiros et al., 2015) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-16", "text": "We then train a support vector machine (SVM) on these representations to classify unseen VNC instances." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-17", "text": "Surprisingly, we find that an approach based on representing sentences as the average of their word embeddings performs comparably to, or better than, the skip-thoughts based approach previously proposed by Salton et al. (2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-18", "text": "VNCs exhibit lexico-syntactic fixedness." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-19", "text": "For example, the idiomatic interpretation in example 1 above is typically only accessible when the verb see has active voice, the determiner is null, and the noun star is in plural form, as in see stars or seeing stars." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-20", "text": "Usages with a determiner (as in example 2), a singular noun (e.g., see a star), or passive voice (e.g., stars were seen) typically only have the literal interpretation." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-21", "text": "In this paper we further incorporate knowledge of the lexico-syntactic fixedness of VNCs -automatically acquired from corpora using the method of Fazly et al. (2009) -into our various embedding-based approaches." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-22", "text": "Our experimental results show that this leads to substantial improve-ments, indicating that this rich linguistic knowledge is complementary to that available in distributed representations." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-23", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-25", "text": "Much research on MWE identification has focused on specific kinds of MWEs (e.g., Patrick and Fletcher, 2005; Uchiyama et al., 2005) , including English VNCs (e.g., Fazly et al., 2009; Salton et al., 2016) , although some recent work has considered the identification of a broad range of kinds of MWEs (e.g., Schneider et al., 2014; Brooke et al., 2014; Savary et al., 2017) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-26", "text": "Work on MWE identification has leveraged rich linguistic knowledge of the constructions under consideration (e.g., Fazly et al., 2009; Fothergill and Baldwin, 2012) , treated literal and idiomatic as two senses of an expression and applied approaches similar to word-sense disambiguation (e.g., Birke and Sarkar, 2006; Hashimoto and Kawahara, 2008) , incorporated topic models (e.g., Li et al., 2010) , and made use of distributed representations of words (Gharbieh et al., 2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-27", "text": "In the most closely related work to ours, Salton et al. (2016) represent token instances of VNCs by embedding the sentence that they occur in using skip-thoughts (Kiros et al., 2015) -an encoderdecoder model that can be viewed as a sentencelevel counterpart to the word2vec (Mikolov et al., 2013 ) skip-gram model." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-28", "text": "During training the target sentence is encoded using a recurrent neural network, and is used to predict the previous and next sentences." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-29", "text": "Salton et al. then use these sentence embeddings, representing VNC token instances, as features in a supervised classifier." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-30", "text": "We treat this skip-thoughts based approach as a strong baseline to compare against." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-31", "text": "Fazly et al. (2009) formed a set of eleven lexicosyntactic patterns for VNC instances capturing the voice of the verb (active or passive), determiner (e.g., a, the), and number of the noun (singular or plural)." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-32", "text": "They then determine the canonical form, C(v, n), for a given VNC as follows: 2" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-33", "text": "where P is the set of patterns, T z is a predetermined threshold, which is set to 1, and z(v, n, pt k ) is calculated as follows:" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-34", "text": "where f (\u00b7) is the frequency of a VNC occurring in a given pattern in a corpus, 3 and f and s are the mean and standard deviations for all patterns for the given VNC, respectively." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-74", "text": "We borrowed ideas from both of these approaches in structuring our experiments." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-35", "text": "Fazly et al. (2009) showed that idiomatic usages of a VNC tend to occur in that expression's canonical form, while literal usages do not." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-36", "text": "This approach provides a strong, linguistically-informed, unsupervised baseline, referred to as CForm, for predicting whether VNC instances are idiomatic or literal." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-37", "text": "In this paper we incorporate knowledge of canonical forms into embedding-based approaches to VNC token classification, and show that this linguistic knowledge can be leveraged to improve such approaches." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-38", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-39", "text": "**MODELS**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-40", "text": "We describe the models used to represent VNC token instances below." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-41", "text": "For each model, a linear SVM classifier is trained on these representations." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-42", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-43", "text": "**WORD2VEC**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-44", "text": "We trained word2vec's skip-gram model (Mikolov et al., 2013 ) on a snapshot of Wikipedia from September 2015, which consists of approximately 2.6 billion tokens." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-45", "text": "We used a window size of \u00b18 and 300 dimensions." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-46", "text": "We ignore all words that occur less than fifteen times in the training corpus, and did not set a maximum vocabulary size." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-47", "text": "We perform negative sampling and set the number of training epochs to five." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-48", "text": "We used batch processing with approximately 10k words in each batch." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-49", "text": "To embed a given a sentence containing a VNC token instance, we average the word embeddings for each word in the sentence, including stopwords." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-50", "text": "4 Prior to averaging, we normalize each embedding to have unit length." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-51", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-52", "text": "**SIAMESE CBOW**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-53", "text": "The Siamese CBOW model (Kenter et al., 2016) learns word embeddings that are better able to represent a sentence through averaging than conventional word embeddings such as skip-gram or CBOW." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-54", "text": "We use a Siamese CBOW model that was pretrained on a snapshot of Wikipedia from November 2012 using randomly initialized word embeddings." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-55", "text": "5 Similarly to the word2vec model, to embed a given sentence containing a VNC instance, we average the word embeddings for each word in the sentence." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-56", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-57", "text": "**SKIP-THOUGHTS**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-58", "text": "We use a publicly-available skip-thoughts model, that was pre-trained on a corpus of books." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-59", "text": "6 We represent a given sentence containing a VNC instance using the skip-thoughts encoder." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-60", "text": "Note that this approach is our re-implementation of the skipthoughts based method of Salton et al. (2016) , and we use it as a strong baseline for comparison." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-61", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-62", "text": "**DATA AND EVALUATION**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-63", "text": "In this section, we discuss the dataset used in our experiments, and the evaluation of our models." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-64", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-65", "text": "**DATASET**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-66", "text": "We use the VNC-Tokens dataset (Cook et al., 2008) -the same dataset used by Fazly et al. (2009) and Salton et al. (2016) -to train and evaluate our models." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-67", "text": "This dataset consists of sentences containing VNC usages drawn from the British National Corpus (Burnard, 2000) , 7 along with a label indicating whether the VNC is an idiomatic or literal usage (or whether this cannot be determined, in which case it is labelled \"unknown\")." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-68", "text": "VNC-Tokens is divided into DEV and TEST sets that each include fourteen VNC types and a total of roughly six hundred instances of these types annotated as literal or idiomatic." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-69", "text": "Following Salton et al. (2016) , we use DEV and TEST, and ignore all token instances annotated as \"unknown\"." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-70", "text": "Fazly et al. (2009) and Salton et al. (2016) structured their experiments differently." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-71", "text": "Fazly et al. report results over DEV and TEST separately." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-72", "text": "In this setup TEST consists of expressions that were not seen during model development (done on DEV)." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-73", "text": "Salton et al., on the other hand, merge DEV and TEST, and create new training and testing sets, such that each expression is present in the training and testing data, and the ratio of idiomatic to literal usages of each expression in the training data is roughly equal to that in the testing data." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-75", "text": "We retain We then divide each of these into training and testing sets, using the same ratios of idiomatic to literal usages for each expression as Salton et al. (2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-76", "text": "This allows us to develop and tune a model on DEV, and then determine whether, when retrained on instances of unseen VNCs in (the training portion of) TEST, that model is able to generalize to new VNCs without further tuning to the specific expressions in TEST." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-77", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-78", "text": "**EVALUATION**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-79", "text": "The proportion of idiomatic usages in the testing portions of both DEV and TEST is 63%." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-80", "text": "We therefore use accuracy to evaluate our models following Fazly et al. (2009) because the classes are roughly balanced." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-81", "text": "We randomly divide both DEV and TEST into training and testing portions ten times, following Salton et al. (2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-82", "text": "For each of the ten runs, we compute the accuracy for each expression, and then compute the average accuracy over the expressions." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-83", "text": "We then report the average accuracy over the ten runs." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-84", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-85", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-86", "text": "In this section we first consider the effect of tuning the cost parameter of the SVM for each model on DEV, and then report results on DEV and TEST using the tuned models." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-87", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-88", "text": "**PARAMETER TUNING**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-89", "text": "We tune the SVM for each model on DEV by carrying out a linear search for the penalty cost from 0.01-100, increasing by a factor of ten each time." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-90", "text": "Results for this parameter tuning are shown in Table 1 ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-91", "text": "These results highlight the importance of choosing an appropriate setting for the penalty cost." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-92", "text": "For example, the accuracy of the word2vec model ranges from 0.619-0.830 depending on the cost setting." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-93", "text": "In subsequent experiments, for each model, we use the penalty cost that achieves the highest accuracy in Table 1 ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-94", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-95", "text": "**DEV AND TEST RESULTS**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-96", "text": "In Table 2 we report results on DEV and TEST for each model, as well as the unsupervised CForm model of Fazly et al. (2009) , which simply labels a VNC as idiomatic if it occurs in its canonical form, and as literal otherwise." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-97", "text": "We further consider each model (other than CForm) in two setups." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-98", "text": "\u2212CF corresponds to the models as described in Section 3. +CF further incorporates lexico-syntactic knowledge of canonical forms into each model by concatenating the embedding representing each VNC token instance with a one-dimensional vector which is one if the VNC occurs in its canonical form, and zero otherwise." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-99", "text": "We first consider results for the \u2212CF setup." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-100", "text": "On both DEV and TEST, the accuracy achieved by each supervised model is higher than that of the unsupervised CForm approach, except for Siamese CBOW on TEST." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-101", "text": "The word2vec model achieves the highest accuracy on DEV and TEST of 0.830 and 0.804, respectively." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-102", "text": "The difference between the word2vec model and the next-best model, skip-thoughts, is significant using a bootstrap test (Berg-Kirkpatrick et al., 2012) with 10k repetitions for DEV (p = 0.006), but not for TEST (p = 0.051)." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-103", "text": "Nevertheless, it is remarkable that the relatively simple approach to averaging word embeddings used by word2vec performs as well as, or better than, the much more complex skipthoughts model used by Salton et al. (2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-104", "text": "8 8 The word2vec and skip-thoughts models were trained on different corpora, which could contribute to the differences in results for these models." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-105", "text": "We therefore carried out an additional experiment in which we trained word2vec on BookCorpus, the corpus on which skip-thoughts was trained." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-106", "text": "This new word2vec model achieved accuracies of 0.825 and 0.809, on DEV and TEST, respectively, which are also higher accuTurning to the +CF setup, we observe that, for both DEV and TEST, each model achieves higher accuracy than in the \u2212CF setup." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-107", "text": "9 All of these differences are significant using a bootstrap test (p < 0.002 in each case)." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-108", "text": "In addition, each method outperforms the unsupervised CForm approach on both DEV and TEST." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-109", "text": "These findings demonstrate that the linguistically-motivated, lexico-syntactic knowledge encoded by the canonical form feature is complementary to the information from a wide range of types of distributed representations." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-110", "text": "In the +CF setup, the word2vec model again achieves the highest accuracy on both DEV and TEST of 0.854 and 0.852, respectively." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-111", "text": "10 The difference between the word2vec model and the next-best model, again skip-thoughts, is significant for both DEV and TEST using a bootstrap test (p < 0.05 in each case)." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-112", "text": "To better understand the impact of the canonical form feature when combined with the word2vec model, we compute the average precision, recall, and F1 score for each MWE for both the positive (idiomatic) and negative (literal) classes, for each run on TEST." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-113", "text": "11 For a given run, we then compute the average precision, recall, and F1 score across all MWEs, and then the average over all ten runs." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-114", "text": "We do this using CForm, and the word2vec model with and without the canonical form feature." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-115", "text": "Results are shown in Table 3 ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-116", "text": "In line with the findings of Fazly et al. (2009) , CForm achieves higher precision and recall on idiomatic usages than literal ones." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-117", "text": "In particular, the relatively low recall for the literal class indicates that many literal usages occur in a canonical form." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-118", "text": "Comparing the word2vec model with and without the canonical form feature, we see that, when this feature is used, there is a relatively larger increase in precision and recall (and F1 score) for the literal class, than for the idiomatic class." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-119", "text": "This indicates that, although the racies than those obtained by the skip-thoughts model." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-120", "text": "9 In order to determine that this improvement is due to the information about canonical forms carried by the additional feature in the +CF setup, and not due to the increase in number of dimensions, we performed additional experiments in which we concatenated the embedding representations with a random binary feature, and with a randomly chosen value between 0 and 1." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-121", "text": "For each model, neither of these approaches outperformed that model using the +CF setup." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-122", "text": "10 In the +CF setup, the word2vec model using embeddings that were trained on the same corpus as skip-thoughts achieved accuracies of 0.846 and 0.851, on DEV and TEST, respectively." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-123", "text": "These are again higher accuracies than the corresponding setup for the skip-thoughts model." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-124", "text": "11 We carried out the same analysis on DEV." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-125", "text": "The findings were similar." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-126", "text": "canonical form feature itself performs relatively poorly on literal usages, it provides information that enables the word2vec model to better identify literal usages." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-127", "text": "----------------------------------" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-128", "text": "**CONCLUSIONS**" }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-129", "text": "Determining whether a usage of a VNC is idiomatic or literal is important for applications such as machine translation, where it is vital to preserve the meanings of word combinations." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-130", "text": "In this paper we proposed two approaches to the task of classifying VNC token instances as idiomatic or literal based on word2vec embeddings and Siamese CBOW." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-131", "text": "We compared these approaches against a linguistically-informed unsupervised baseline, and a model based on skip-thoughts previously applied to this task (Salton et al., 2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-132", "text": "Our experimental results show that a comparatively simple approach based on averaging word embeddings performs at least as well as, or better than, the approach based on skip-thoughts." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-133", "text": "We further proposed methods to combine linguistic knowledge of the lexico-syntactic fixedness of VNCs -socalled \"canonical forms\", which can be automatically acquired from corpora via statistical methods -with the embedding based approaches." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-134", "text": "Our findings indicate that this rich linguistic knowledge is complementary to that available in distributed representations." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-135", "text": "Alternative approaches to embedding sentences containing VNC instances could also be considered, for example, FastSent (Hill et al., 2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-136", "text": "However, all of the models we used represent the context of a VNC by the sentence in which it occurs." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-137", "text": "In future work we therefore also intend to consider approaches such as context2vec (Melamud et al., 2016) which explicitly encode the context in which a token occurs." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-138", "text": "Finally, one known challenge of VNC token classification is to develop models that are able to generalize to VNC types that were not seen during training (Gharbieh et al., 2016) ." }, { "sent_id": "12c5d72fad925c8ec025cda87a0fd9-C001-139", "text": "In future work we plan to explore this experimental setup." } ], "y": { "@BACK@": { "gold_contexts": [ [ "12c5d72fad925c8ec025cda87a0fd9-C001-10" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-25" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-26" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-31" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-35" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-70" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-71" ] ], "cite_sentences": [ "12c5d72fad925c8ec025cda87a0fd9-C001-10", "12c5d72fad925c8ec025cda87a0fd9-C001-25", "12c5d72fad925c8ec025cda87a0fd9-C001-26", "12c5d72fad925c8ec025cda87a0fd9-C001-31", "12c5d72fad925c8ec025cda87a0fd9-C001-35", "12c5d72fad925c8ec025cda87a0fd9-C001-70", "12c5d72fad925c8ec025cda87a0fd9-C001-71" ] }, "@USE@": { "gold_contexts": [ [ "12c5d72fad925c8ec025cda87a0fd9-C001-21" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-66" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-80" ] ], "cite_sentences": [ "12c5d72fad925c8ec025cda87a0fd9-C001-21", "12c5d72fad925c8ec025cda87a0fd9-C001-66", "12c5d72fad925c8ec025cda87a0fd9-C001-80" ] }, "@SIM@": { "gold_contexts": [ [ "12c5d72fad925c8ec025cda87a0fd9-C001-66" ], [ "12c5d72fad925c8ec025cda87a0fd9-C001-116" ] ], "cite_sentences": [ "12c5d72fad925c8ec025cda87a0fd9-C001-66", "12c5d72fad925c8ec025cda87a0fd9-C001-116" ] }, "@MOT@": { "gold_contexts": [ [ "12c5d72fad925c8ec025cda87a0fd9-C001-80" ] ], "cite_sentences": [ "12c5d72fad925c8ec025cda87a0fd9-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "12c5d72fad925c8ec025cda87a0fd9-C001-96" ] ], "cite_sentences": [ "12c5d72fad925c8ec025cda87a0fd9-C001-96" ] } } }, "ABC_15dd59368074f3473b57d86568807f_2": { "x": [ { "sent_id": "15dd59368074f3473b57d86568807f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-2", "text": "In this work, we present an approach based on combining string kernels and word embeddings for automatic essay scoring." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-3", "text": "String kernels capture the similarity among strings based on counting common character ngrams, which are a low-level yet powerful type of feature, demonstrating state-of-theart results in various text classification tasks such as Arabic dialect identification or native language identification." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-4", "text": "To our best knowledge, we are the first to apply string kernels to automatically score essays." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-5", "text": "We are also the first to combine them with a high-level semantic feature representation, namely the bag-of-super-word-embeddings." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-6", "text": "We report the best performance on the Automated Student Assessment Prize data set, in both indomain and cross-domain settings, surpassing recent state-of-the-art deep learning approaches." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-7", "text": "----------------------------------" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-9", "text": "Automatic essay scoring (AES) is the task of assigning grades to essays written in an educational setting, using a computer-based system with natural language processing capabilities." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-10", "text": "The aim of designing such systems is to reduce the involvement of human graders as far as possible." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-11", "text": "AES is a challenging task as it relies on grammar as well as semantics, pragmatics and discourse (Song et al., 2017) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-12", "text": "Although traditional AES methods typically rely on handcrafted features (Larkey, 1998; Foltz et al., 1999; Attali and Burstein, 2006; Dikli, 2006; Wang and Brown, 2008; Chen and He, 2013; Somasundaran et al., 2014; Yannakoudakis et al., 2014; Phandi et al., 2015) , recent results indicate that state-of-the-art deep learning methods reach better performance (Alikaniotis et al., 2016; Dong and Zhang, 2016; Taghipour and Ng, 2016; Song et al., 2017; Tay et al., 2018) , perhaps because these methods are able to capture subtle and complex information that is relevant to the task (Dong and Zhang, 2016) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-13", "text": "In this paper, we propose to combine string kernels (low-level character n-gram features) and word embeddings (high-level semantic features) to obtain state-of-the-art AES results." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-14", "text": "Since recent methods based on string kernels have demonstrated remarkable performance in various text classification tasks ranging from authorship identification (Popescu and Grozea, 2012) and sentiment analysis (Gim\u00e9nez-P\u00e9rez et al., 2017; to native language identification (Popescu and Ionescu, 2013; Ionescu et al., 2014; Ionescu, 2015; and dialect identification , we believe that string kernels can reach equally good results in AES." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-15", "text": "To the best of our knowledge, string kernels have never been used for this task." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-16", "text": "As string kernels are a simple approach that relies solely on character n-grams as features, it is fairly obvious that such an approach will not to cover several aspects (e.g.: semantics, discourse) required for the AES task." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-17", "text": "To solve this problem, we propose to combine string kernels with a recent approach based on word embeddings, namely the bag-of-super-wordembeddings (BOSWE) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-18", "text": "To our knowledge, this is the first successful attempt to combine string kernels and word embeddings." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-19", "text": "We evaluate our approach on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-20", "text": "The empirical results indicate that our approach yields a better performance than state-of-the-art approaches (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-21", "text": "----------------------------------" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-22", "text": "**METHOD**" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-23", "text": "String kernels." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-24", "text": "Kernel functions (Shawe-Taylor and Cristianini, 2004) capture the intuitive notion of similarity between objects in a specific domain." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-25", "text": "For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character n-grams." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-26", "text": "Various string kernel functions have been proposed to date (Lodhi et al., 2002; Shawe-Taylor and Cristianini, 2004; Ionescu et al., 2014) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-27", "text": "One of the most recent string kernels is the histogram intersection string kernel (HISK) (Ionescu et al., 2014 )." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-28", "text": "For two strings over an alphabet \u03a3, x, y \u2208 \u03a3 * , the intersection string kernel is formally defined as follows:" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-29", "text": "where num v (x) is the number of occurrences of n-gram v as a substring in x, and n is the length of v. In our AES experiments, we use the intersection string kernel based on a range of character n-grams." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-30", "text": "We approach AES as a regression task, and employ \u03bd-Support Vector Regression (\u03bd-SVR) (Suykens and Vandewalle, 1999; Shawe-Taylor and Cristianini, 2004) for training." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-31", "text": "Bag-of-super-word-embeddings." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-32", "text": "Word embeddings are long known in the NLP community (Bengio et al., 2003; Collobert and Weston, 2008) , but they have recently become more popular due to the word2vec (Mikolov et al., 2013) framework that enables the building of efficient vector representations from words." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-33", "text": "On top of the word embeddings, developed an approach termed bag-ofsuper-word-embeddings (BOSWE) by adapting an efficient computer vision technique, the bag-ofvisual-words model (Csurka et al., 2004) , for natural language processing tasks." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-34", "text": "The adaptation consists of replacing the image descriptors (Lowe, 2004) useful for recognizing object patterns in images with word embeddings (Mikolov et al., 2013) useful for recognizing semantic patterns in text documents." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-35", "text": "The BOSWE representation is computed as follows." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-36", "text": "First, each word in the collection of training documents is represented as word vector using a pre-trained word embeddings model." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-37", "text": "Based on the fact that word embeddings carry semantic information by projecting semantically related words in the same region of the embedding space, the next step is to cluster the word vectors in order to obtain relevant semantic clusters of words." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-38", "text": "As in the standard bag-of-visual-words model, the clustering is done by k-means (Leung and Malik, 2001) , and the formed centroids are stored in a randomized forest of k-d trees (Philbin et al., 2007) to reduce search cost." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-39", "text": "The centroid of each cluster is interpreted as a super word embedding or super word vector that embodies all the semantically related word vectors in a small region of the embedding space." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-40", "text": "Every embedded word in the collection of documents is then assigned to the nearest cluster centroid (the nearest super word vector)." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-41", "text": "Put together, the super word vectors generate a vocabulary (codebook) that can further be used to describe each document as a bag-of-super-wordembeddings." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-42", "text": "To obtain the BOSWE represenation for a document, we just have to compute the occurrence count of each super word embedding in the respective document." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-43", "text": "After building the representation, we employ a kernel method to train the BOSWE model for our specific task." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-44", "text": "To be consistent with the string kernel approach, we choose the histogram intersection kernel and the same regression method, namely \u03bd-SVR." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-45", "text": "Model fusion." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-46", "text": "In the primal form, a linear classifier takes as input a feature matrix X of r samples (rows) with m features (columns) and optimizes a set of weights in order to reproduce the r training labels." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-47", "text": "In the dual form, the linear classifier takes as input a kernel matrix K of r \u00d7 r components, where each component k ij is the similarity between examples x i and x j ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-48", "text": "Kernel methods work by embedding the data in a Hilbert space and by searching for linear relations in that space, using a learning algorithm." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-49", "text": "The embedding can be performed either (i) implicitly, by directly specifying the similarity function between each pair of samples, or (ii) explicitly, by first giving the embedding map \u03c6 and by computing the inner product between each pair of samples embedded in the Hilbert space." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-50", "text": "For the linear kernel, the associated embedding map is \u03c6(x) = x and options (i) or (ii) are equivalent, i.e. the similarity function is the inner product." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-51", "text": "Hence, the linear kernel matrix K can be obtained as K = X \u00b7 X \u2032 , where X \u2032 is the transpose of X. For other kernels, e.g. the histogram intersection kernel, it is not possible to explicitly define the embedding map (Shawe-Taylor and Cristianini, 2004) , and the only solution is to adopt option (i) and compute the corresponding kernel matrix directly." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-52", "text": "Therefore, we combine HISK and BOSWE in the dual (kernel) form, by simply summing up the two corresponding kernel matrices." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-53", "text": "However, summing up kernel matrices is equivalent to feature vector concatenation in the primal Hilbert space." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-54", "text": "To better explain this statement, let us suppose that we can define the embedding map of the histogram intersection kernel and, consequently, we can obtain the corresponding feature matrix of HISK with r\u00d7m 1 components denoted by X 1 and the corresponding feature matrix of BOSWE with r \u00d7 m 2 components denoted by X 2 ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-55", "text": "We can now combine HISK and BOSWE in two ways." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-56", "text": "One way is to compute the corresponding kernel matrices K 1 = X 1 \u00b7 X \u2032 1 and K 2 = X 2 \u00b7 X \u2032 2 , and to sum the matrices into a single kernel matrix" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-57", "text": "The other way is to first concatenate the feature matrices into a single feature matrix X + = [X 1 X 2 ] of r \u00d7 (m 1 + m 2 ) components, and to compute the final kernel matrix using the inner product, i.e. K + = X + \u00b7X \u2032 + ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-58", "text": "Either way, the two approaches, HISK and BOSWE, are fused before the learning stage." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-59", "text": "As a consequence of kernel summation, the search space of linear patterns grows, which should help the kernel classifier, in our case \u03bd-SVR, to find a better regression function." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-60", "text": "----------------------------------" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-62", "text": "Data set." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-63", "text": "To evaluate our approach, we use the Automated Student Assessment Prize (ASAP) 1 data set from Kaggle." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-64", "text": "The ASAP data set contains 8 prompts of different genres." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-65", "text": "The number of essays per prompt along with the score ranges are presented in Table 1 ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-66", "text": "Since the official test data of the ASAP competition is not released to the public, we, as well as others before us (Phandi et al., 2015; Dong and Zhang, 2016; 1 https://www.kaggle.com/c/asap-aes/data Tay et al., 2018) , use only the training data in our experiments." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-67", "text": "Evaluation procedure." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-68", "text": "As Dong and Zhang (2016), we scaled the essay scores into the range 0-1." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-69", "text": "We closely followed the same settings for data preparation as (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-70", "text": "For the in-domain experiments, we use 5-fold cross-validation." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-71", "text": "The 5-fold cross-validation procedure is repeated for 10 times and the results were averaged to reduce the accuracy variation introduced by randomly selecting the folds." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-72", "text": "We note that the standard deviation in all cases in below 0.2%." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-73", "text": "For the cross-domain experiments, we use the same source\u2192target domain pairs as (Phandi et al., 2015; Dong and Zhang, 2016) , namely, 1\u21922, 3\u21924, 5\u21926 and 7\u21928." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-74", "text": "All essays in the source domain are used as training data." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-75", "text": "Target domain samples are randomly divided into 5 folds, where one fold is used as test data, and the other 4 folds are collected together to sub-sample target domain train data." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-76", "text": "The sub-sample sizes are n t = {10, 25, 50, 100}. The sub-sampling is repeated for 5 times as in (Phandi et al., 2015; Dong and Zhang, 2016) to reduce bias." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-77", "text": "As our approach performs very well in the cross-domain setting, we also present experiments without subsampling data from the target domain, i.e. when the sub-sample size is n t = 0." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-78", "text": "As evaluation metric, we use the quadratic weighted kappa (QWK)." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-79", "text": "Baselines." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-80", "text": "We compare our approach with stateof-the-art methods based on handcrafted features (Phandi et al., 2015) , as well as deep features (Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-81", "text": "We note that results for the cross-domain setting are reported only in some of these recent works (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-82", "text": "Implementation choices." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-83", "text": "For the string kernels approach, we used the histogram intersection string kernel (HISK) based on the blended range of character n-grams from 1 to 15." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-84", "text": "To compute the intersection string kernel, we used the open-source code provided by Ionescu et al. (2014) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-85", "text": "For the BOSWE approach, we used the pre-trained word embeddings computed by the word2vec toolkit (Mikolov et al., 2013) on the Google News data set using the Skip-gram model, which produces 300-dimensional vectors for 3 million words and phrases." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-86", "text": "We used functions from the VLFeat li- Table 2 : In-domain automatic essay scoring results of our approach versus several state-of-the-art methods (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-87", "text": "Results are reported in terms of the quadratic weighted kappa (QWK) measure, using 5-fold cross-validation." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-88", "text": "The best QWK score (among the machine learning systems) for each prompt is highlighted in bold." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-89", "text": "brary (Vedaldi and Fulkerson, 2008) for the other steps involved in the BOSWE approach, such as the k-means clustering and the randomized forest of k-d trees." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-90", "text": "We set the number of clusters (dimension of the vocabulary) to k = 500." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-91", "text": "After computing the BOSWE representation, we apply the L 1 -normalized intersection kernel." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-92", "text": "We combine HISK and BOSWE in the dual form by summing up the two corresponding matrices." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-93", "text": "For the learning phase, we employ the dual implementation of \u03bd-SVR available in LibSVM (Chang and Lin, 2011) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-94", "text": "We set its regularization parameter to c = 10 3 and \u03bd = 10 \u22121 in all our experiments." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-95", "text": "In-domain results." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-96", "text": "The results for the in-domain automatic essay scoring task are presented in Table 2." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-97", "text": "In our empirical study, we also include feature ablation results." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-98", "text": "We report the QWK measure on each prompt as well as the overall average." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-99", "text": "We first note that the histogram intersection string kernel alone reaches better overall performance (0.780) than all previous works (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-100", "text": "Remarkably, the overall performance of the HISK is also higher than the inter-human agreement (0.754)." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-101", "text": "Although the BOSWE model can be regarded as a shallow approach, its overall results are comparable to those of deep learning approaches (Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-102", "text": "When we combine the two models (HISK and BOSWE), we obtain even better results." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-103", "text": "Indeed, the combination of string kernels and word embeddings attains the best performance on 7 out of 8 prompts." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-104", "text": "The average QWK score of HISK and BOSWE (0.785) is more than 2% better the average scores of the best-performing state-of-the-art approaches Tay et al., 2018) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-105", "text": "Cross-domain results." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-106", "text": "The results for the crossdomain automatic essay scoring task are presented in Table 3 ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-107", "text": "For each and every source\u2192target pair, we report better results than both state-of-theart methods (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-108", "text": "We observe that the difference between our best QWK scores and the other approaches are sometimes much higher in the cross-domain setting than in the in-domain setting." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-109", "text": "We particularly notice that the difference from (Phandi et al., 2015) when n t = 0 is always higher than 10%." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-110", "text": "Our highest improvement (more than 54%, from 0.187 to 0.728) over (Phandi et al., 2015) is recorded for the pair 5\u21926, when n t = 0." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-111", "text": "Our score in this case (0.728) is even higher than both scores of Phandi et al. (2015) and Dong and Zhang (2016) when they use n t = 50." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-112", "text": "Different from the in-domain setting, we note that the combination of string kernels and word embeddings does not always provide better results than string kernels alone, particularly when the number of target samples (n t ) added into the training set is less or equal to 25." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-113", "text": "Discussion." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-114", "text": "It is worth noting that in a set of preliminary experiments (not included in the paper), we actually considered another approach based on word embeddings." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-115", "text": "We tried to obtain a document embedding by averaging the word vectors for each document." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-116", "text": "We computed the average as well as the standard deviation for each component of the word vectors, resulting in a total of 600 features, since the word vectors are 300-dimensional." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-117", "text": "We applied this method in the in-domain setting and we obtained a surprisingly low overall QWK score, around 0.251." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-118", "text": "We concluded that this simple apSource\u2192Target Method n t = 0 n t = 10 n t = 25 n t = 50 n t = 100 1\u21922 (Phandi et al., 2015) Table 3 : Corss-domain automatic essay scoring results of our approach versus two state-of-the-art methods (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-119", "text": "Results are reported in terms of the quadratic weighted kappa (QWK) measure, using the same evaluation procedure as (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-120", "text": "The best QWK scores for each source\u2192target domain pair are highlighted in bold." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-121", "text": "proach is not useful, and decided to use BOSWE instead." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-122", "text": "It would have been interesting to present an error analysis based on the discriminant features weighted higher by the \u03bd-SVR method." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-123", "text": "Unfortunately, this is not possible because our approach works in the dual space and we cannot transform the dual weights into primal weights, as long as the histogram intersection kernel does not have an explicit embedding map associated to it." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-124", "text": "In future work, however, we aim to replace the histogram intersection kernel with the presence bits kernel, which will enable us to perform an error analysis based on the overused or underused patterns, as described by ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-125", "text": "----------------------------------" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-127", "text": "In this paper, we described an approach based on combining string kernels and word embeddings for automatic essay scoring." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-128", "text": "We compared our approach on the Automated Student Assessment Prize data set, in both in-domain and crossdomain settings, with several state-of-the-art approaches (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-129", "text": "Overall, the in-domain and the cross-domain comparative studies indicate that string kernels, both alone and in combination with word embeddings, attain the best performance on the automatic essay scoring task." }, { "sent_id": "15dd59368074f3473b57d86568807f-C001-130", "text": "Using a shallow approach, we report better results compared to recent deep learning approaches (Dong and Zhang, 2016; Tay et al., 2018) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "15dd59368074f3473b57d86568807f-C001-12" ], [ "15dd59368074f3473b57d86568807f-C001-81" ] ], "cite_sentences": [ "15dd59368074f3473b57d86568807f-C001-12", "15dd59368074f3473b57d86568807f-C001-81" ] }, "@DIF@": { "gold_contexts": [ [ "15dd59368074f3473b57d86568807f-C001-20" ], [ "15dd59368074f3473b57d86568807f-C001-80" ], [ "15dd59368074f3473b57d86568807f-C001-86" ], [ "15dd59368074f3473b57d86568807f-C001-99" ], [ "15dd59368074f3473b57d86568807f-C001-107" ], [ "15dd59368074f3473b57d86568807f-C001-111" ], [ "15dd59368074f3473b57d86568807f-C001-118" ], [ "15dd59368074f3473b57d86568807f-C001-128", "15dd59368074f3473b57d86568807f-C001-129", "15dd59368074f3473b57d86568807f-C001-130" ] ], "cite_sentences": [ "15dd59368074f3473b57d86568807f-C001-20", "15dd59368074f3473b57d86568807f-C001-80", "15dd59368074f3473b57d86568807f-C001-86", "15dd59368074f3473b57d86568807f-C001-99", "15dd59368074f3473b57d86568807f-C001-107", "15dd59368074f3473b57d86568807f-C001-111", "15dd59368074f3473b57d86568807f-C001-118", "15dd59368074f3473b57d86568807f-C001-128", "15dd59368074f3473b57d86568807f-C001-130" ] }, "@USE@": { "gold_contexts": [ [ "15dd59368074f3473b57d86568807f-C001-66" ], [ "15dd59368074f3473b57d86568807f-C001-68" ], [ "15dd59368074f3473b57d86568807f-C001-69" ], [ "15dd59368074f3473b57d86568807f-C001-73" ], [ "15dd59368074f3473b57d86568807f-C001-76" ], [ "15dd59368074f3473b57d86568807f-C001-119" ] ], "cite_sentences": [ "15dd59368074f3473b57d86568807f-C001-66", "15dd59368074f3473b57d86568807f-C001-68", "15dd59368074f3473b57d86568807f-C001-69", "15dd59368074f3473b57d86568807f-C001-73", "15dd59368074f3473b57d86568807f-C001-76", "15dd59368074f3473b57d86568807f-C001-119" ] }, "@SIM@": { "gold_contexts": [ [ "15dd59368074f3473b57d86568807f-C001-66" ], [ "15dd59368074f3473b57d86568807f-C001-68" ], [ "15dd59368074f3473b57d86568807f-C001-69" ], [ "15dd59368074f3473b57d86568807f-C001-73" ], [ "15dd59368074f3473b57d86568807f-C001-101" ], [ "15dd59368074f3473b57d86568807f-C001-119" ] ], "cite_sentences": [ "15dd59368074f3473b57d86568807f-C001-66", "15dd59368074f3473b57d86568807f-C001-68", "15dd59368074f3473b57d86568807f-C001-69", "15dd59368074f3473b57d86568807f-C001-73", "15dd59368074f3473b57d86568807f-C001-101", "15dd59368074f3473b57d86568807f-C001-119" ] } } }, "ABC_2606ecb66287c0199f3aa6d95f6774_2": { "x": [ { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-74", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-75", "text": "**VISION AND LANGUAGE**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-26", "text": "This enables us to incorporate 'agreement' between the humans directly into the metric." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-2", "text": "As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-27", "text": "Although the idea is not entirely new [35, 36, 37] , we believe it sits at the core of building more open and holistic challenges." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-3", "text": "This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-4", "text": "In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-5", "text": "Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks?" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-6", "text": "In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-7", "text": "We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-8", "text": "Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on 'social consensus' as the main driving force to create suitable benchmarks." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-9", "text": "Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-10", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-12", "text": "Recently we witness a tremendous progress in the machine perception [1, 2, 3, 4, 5, 6, 7, 8] and in the language understanding [9, 10, 11, 12, 13] tasks." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-13", "text": "The progress in both fields has inspired researchers to build holistic architectures for challenging grounding [14, 15] , natural language generation from image/video [16, 17, 18] , image-to-sentence alignment [19, 20, 21, 22] , and recently presented question-answering problems [23, 24, 25, 26, 27] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-14", "text": "In this paper we argue for a Visual Turing Test -an open domain task of question-answering based on real-world images that resemblances the famous Turing Test [28, 29] and deviates from other attempts [30, 31, 32] -and discuss challenges together with tools to benchmark different models on such task." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-15", "text": "We typically measure the progress in the field by quantifying the performance of different methods against a carefully crafted set of benchmarks." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-16", "text": "Crowdsourcing in combination of machine learning approaches have served us well to generate curated datasets with a unique ground truth at scale [33, 34] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-17", "text": "As the complexity and the openness of the task grows, the quest of crafting good benchmarks also becomes more difficult." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-18", "text": "First, interpreting and evaluating the answer of a system becomes increasingly difficult and ideally would rely on human judgement." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-19", "text": "Yet we want to have objective metrics that we can evaluate automatically at large scale." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-20", "text": "Second, establishing an evaluation methodology that assigns scores over a large output domain is challenging, as any system based on ontologies will have limited coverage." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-21", "text": "Third, if our aim is to mimic human response, we have to deal with inherent ambiguities due to human judgement that stem from issues like binding, reference frames, social conventions." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-22", "text": "For instance [27] reports that for a question answering task on real-world images even human answers are inconsistent." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-23", "text": "Obviously this cannot be a problem of humans but rather argues for inherent ambiguities in the task." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-24", "text": "Competing methods are validated against true annotations, but what is the \"truth\" in a task where even human answers cannot completely agree with each other?" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-25", "text": "Instead of seeking an unique, \"true\" answer we suggest to look into 'social consensus' that takes multiple human answers as different interpretations of the question into account." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-28", "text": "We exemplify some of our findings on the DAQUAR dataset [27] with the aim of demonstrating different challenges that are present in the dataset." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-29", "text": "We hope that our exposition is helpful towards building a public visual turing challenge and will generate a discussion for the agreeable evaluation procedure and designing systems that can address open domain tasks." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-30", "text": "In this paper holistic architecture (also holistic learner) is a machine learning architecture designed to work on the task that fuses at least two modalities, e.g. language and vision." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-31", "text": "The external world is a part of a task accessible to the holistic learner only via sensors and it can be either human world (the world that surrounds us), or a machine world that models some aspects of human world." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-32", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-33", "text": "**CHALLENGES**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-34", "text": "As we strive for more holistic and open tasks such as grounding or question-answering based on images, we need to deal with a large gamut of challenges." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-35", "text": "In this section we have distilled and discuss some of the most prominent ones in order to guide the further discussion." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-36", "text": "Vision and language Scalability: Perception and natural language understanding are crucial parts of holistic reasoning as they ground any representation in the external world and therefore serve as a common reference point for machines and humans." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-37", "text": "The human conceptualization divides these percepts into different instances, categories as well as spatio-temporal concepts." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-38", "text": "Architectures that aim at mimicking or reproducing this space of human concepts need to capture the same diversity and therefore scale up to thousands of concepts [38, 39, 40] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-39", "text": "Concept ambiguity: As the number of categories grows, the semantic boundaries become more fuzzy, and hence ambiguities are inherently introduced [41, 42] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-40", "text": "For instance, sometimes we may overlook the difference between 'night stand' and 'cabinet', or 'armchair' and 'sofa'." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-41", "text": "Therefore it is reasonable to expect from the holistic architectures to create alternative hypotheses of the external world during inference." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-42", "text": "This also relates to the gradual category membership in human perception as portrayed in the prototype theory [41, 43] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-43", "text": "Attributes: The human concepts are not limited to object categories, but also include attributes such as genders, colors, states (lights can be either on or off)." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-44", "text": "Often these concepts cannot be learned on their own, but rather are contextualized by the associated noun." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-45", "text": "E.g. white in \"white\" elephant is surly different from \"white\" in white snow." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-46", "text": "Ambiguity in reference resolution: Reliably answering on questions is challenging even for humans." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-47", "text": "The quality of an answer depends on how ambiguous and latent notions of reference frames and intentions are understood [27, 44] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-48", "text": "Depending on the cultural bias and the context, we may use object-centric or observer-centric or even world-centric frames of reference [45] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-49", "text": "Moreover, it is even unclear what 'with', 'beneath', 'over' mean." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-50", "text": "It seems at least difficult to symbolically define them in terms of predicates." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-51", "text": "While holistic learning and inference encompassing all the aforementioned aspects has yet to be shown, current research directions show promise [46, 47, 48] by adapting the symbolic-based approaches [10, 11, 23, 24] with vector-based approaches [12, 19, 25] to represent the meaning." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-52", "text": "Common sense knowledge It turns out that some questions can solely be answered with the access to common sense knowledge with high reliability." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-53", "text": "For instance \"Which object on the table is used for cutting?\" already narrows the likely options significantly and the correct answer is probably \"knife\" or \"scissors\"." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-54", "text": "Other questions like \"Which hand of the teacher is on her chin?\" require the mixture of the vision and language." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-55", "text": "To understand the question, a holistic learner needs to first detect a person, figure out that the person may be a teacher, understand a gender of the person, detect her chin, understand 'left' and 'right' side, and finally relates 'her' with the 'teacher'." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-56", "text": "However, different parts of the common sense knowledge can be used with different modality." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-57", "text": "An 'object for cutting' is not about seeing but about the affordance of the object and it cannot be learnt solely from the set of images." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-58", "text": "On the other hand things that often co-occur together may stand for the visual-based common sense knowledge." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-59", "text": "For instance we may expect to find a scissor or a pen inside a small plastic box, but never a wall or a window." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-60", "text": "Common sense knowledge can help holistic machine learning architectures to either fulfill the task (question \"Which object on the table is used for cutting?\" can utilizes this type of knowledge), or limit the hypothesis space and hence to reduce the computational complexity of the search problem." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-61", "text": "For instance an architecture could be guided by its common sense knowledge to limit the space of possible locations of the 'scissors' and answer on \"What is in front of scissors?\" more effectively." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-62", "text": "Defining a benchmark dataset and quantifying performance We argue that the question answering based on the visual input task significantly differ from the grounding problem and has unique advantages towards defining a challenge dataset." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-63", "text": "Most prominently, the latter is about finding (either with a hand-crafted set of rules or learnt-based approaches) a mapping between the linguistic fragments and the physical world [14, 15, 49] , whereas the question answering task is about an end-to-end system where we do not necessarily want to enforce any constraints or penalty for the internal representation of the holistic learner." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-64", "text": "In this sense grounding is a latent sub-task that the holistic learner needs to solve, but will not be evaluated on." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-65", "text": "Finally, we argue that establishing benchmark dataset based on a question answering task similar to a turing test, is more tractable." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-66", "text": "Learning grounding asks for exhaustive symbolic-based annotations of the world, while question answering only needs textual annotations for the aspects that the question refers to." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-67", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-68", "text": "**DAQUAR: BUILDING A DATASET FOR VISUAL TURING CHALLENGE**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-69", "text": "DAQUAR [27] is a challenging, large dataset for a question answering task based on real-world images." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-70", "text": "The images present real-world indoor scenes [50] , while the questions are unconstrained natural language sentences." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-71", "text": "DAQUAR's language scope is beyond the nouns or tuples that are typical to recognition datasets [51, 52, 53] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-72", "text": "Other, linguistically rich datasets either do not tackle images at all [54, 55] or consider only few in very constrained domain [15] , or are more suitable for the learning an embedding/image-sentence retrieval or language generation [22, 56, 57, 58] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-73", "text": "In this section we discuss in isolation different challenges reflected in DAQUAR." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-76", "text": "The machine world in DAQUAR is represented as a set of images and questions about their content." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-77", "text": "DAQUAR contains 1088 different nouns in the question, 803 in the answers, and 1586 altogether (we use the Stanford POS Tagger [59] to extract the nouns from the questions)." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-78", "text": "If we consider only nouns in singular form in the questions, we still have 573 categories." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-79", "text": "The current state-of-the-art semantic segmentation methods on the NYU-Depth V2 dataset [50] can discriminate only between up to 37 object categories [2, 60, 61] , much fewer to what is needed." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-80", "text": "DAQUAR also contains other parts of speech where only colors and spatial prepositions are grounded in [27] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-81", "text": "Moreover, ambiguities naturally emerge due to fine grained categories that exist in DAQUAR." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-82", "text": "For instance 'night stand', 'stool' and 'cabinet' sometimes refer to the same thing." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-83", "text": "There is also a variation in the naming of colors among the annotations." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-84", "text": "Questions rely heavily on the spatial concepts with different frame of reference." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-85", "text": "DAQUAR includes various challenges related to natural language understanding." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-86", "text": "Any semantic representation needs to work with the large number of predicates (reaching about 4 million to account different interpretations of the external world), with questions of substantial length (10.5 words in average with variance 5.5; the longest question has 30 words), and possible language errors in the questions." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-87", "text": "Common sense knowledge DAQUAR includes questions that can be reliably answered using common sense knowledge." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-88", "text": "For instance \"Which object on the table is used for cutting?\" already provides strong non-visual cues for the \"cutting\" object." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-89", "text": "Answers on other questions, such as \"What is above the desk in front of scissors?\", can be improved if the search space is reasonable restricted." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-90", "text": "Moreover, some annotators hypothesize missing parts of the object based on their common sense." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-91", "text": "To sum up, we believe that common sense knowledge is an interesting venue to explore with DAQUAR." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-92", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-93", "text": "**QUESTION ANSWERING TASK**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-94", "text": "The question answering task is also about understanding hidden intentions of the questioner with grounding as a sub-goal to solve." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-95", "text": "Some authors [23, 24, 27] treat the grounding (understood here as the logical representation of the meaning of the question) as a latent variable in the question answering task." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-96", "text": "Others [44] have modeled the pragmatic effects in the question answering task, but such approaches have never been shown to work in less constrained environments." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-97", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-98", "text": "**QUANTIFYING THE PERFORMANCE OF HOLISTIC ARCHITECTURES**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-99", "text": "Together with increasing complexity and openness of the task, quantifying performance of the holistic architectures becomes challenging due to several issues: Automation: Evaluating answers on such complex tasks as answering on questions requires a quite deep understanding of natural language, involved concepts and hidden intentions of the questioner." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-100", "text": "The ideal but impractical metric would be to manually judge every single answer of every architecture individually." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-101", "text": "Since this is infeasible we are seeking an automatic approximation so that we can evaluate different holistic architectures at scale." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-102", "text": "Ambiguity: The complex tasks that we are interested in are inherently ambiguous." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-103", "text": "The ambiguities stem from cultural bias, different frame of reference and fined grained categorization." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-104", "text": "This implies that multiple interpretations of a question are possible and hence many correct answers." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-105", "text": "Coverage: Since there are multiple ways of expressing the same concept, the automatic performance metric should take the equivalence class among the answers into the consideration by assigning similar scores to all members of the same class." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-106", "text": "There are attempts to alleviate this issue via defining similarity scores [62] over the lexical databases [63, 64] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-107", "text": "These approaches, however, lacks of coverage: we cannot assign a similarity between the terms that are not represented in the structure." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-108", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-109", "text": "**WUPS SCORES**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-110", "text": "We exemplify the aforementioned requirements by illustrating the WUPS scorean automatic metric that quantifies performance of the holistic architectures proposed by [27] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-111", "text": "This metric is motivated by the development of a 'soft' generalization of accuracy that takes ambiguities of different concepts into account via the set membership measure \u00b5:" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-112", "text": "where for each i-th question, A i and T i are the answers produced by the architecture and human respectively, and they are represented as bags of words." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-113", "text": "The authors of [27] have proposed using WUP similarity [62] as the membership measure \u00b5 in the WUPS score." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-114", "text": "Such choice of \u00b5 suffers from the aforementioned coverage problem and the whole metric takes only one human interpretation of the question into account." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-115", "text": "Future directions for defining metrics Recent work provides several directions towards improving scores." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-116", "text": "To deal with ambiguities that stem from different readings of the same question we are collecting more human answers per question and we propose, based on that, two generalizations of WUPS score." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-117", "text": "The first, we call Interpretation Metric, runs Eq. 1 over many human answers and takes the maximal score, so that the machine answer is high if it is similar to at least one human answer." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-118", "text": "However, with many human answers, we can also rank higher the machine answers that are 'socially agreeable' by measuring if they agree with most human answers." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-119", "text": "This can be done by averaging over multiple human answers." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-120", "text": "We call such second extension, Consensus Metric." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-121", "text": "The problem with coverage can be potentially alleviated with vector based representations [12] of the answers." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-122", "text": "Although in this case the coverage issues are less problematic, we understand the concerns that such score is dependent on the training data used to build such representation." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-123", "text": "On the other hand, due to abundance of textual data and recent improvements of vector based approaches [12, 65] , we consider it as a valid alternative to similarities that are based on ontologies." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-124", "text": "Experimental scenarios In many cases, success on challenging learning problems has been accelerated by use of external data in the training, e.g. in object detection [3] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-125", "text": "We believe that a Visual Turing challenge should consists of a sub-task with a prohibited use of auxiliary data to understand how the holistic learners generalize from limited and challenging data in a more established setup." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-126", "text": "On the other hand we should not limit ourselves to such artificial restrictions in building next generation of the holistic learners." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-127", "text": "Therefore open sub-tasks with a permissible use of another sources in the training have to be stated, including: additional vision and language resources, synthetic data and curated questions." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-128", "text": "----------------------------------" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-129", "text": "**SUMMARY**" }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-130", "text": "The goal of this contribution is to sparkle the discussions about benchmarking holistic architectures on complex and more open tasks." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-131", "text": "We identify particular challenges that holistic tasks should exhibit and exemplify how they are manifested in a recent question answering challenge [27] ." }, { "sent_id": "2606ecb66287c0199f3aa6d95f6774-C001-132", "text": "To judge competing architectures and measure the progress on the task, we suggest several directions to further improve existing metrics, and discuss different experimental scenarios." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2606ecb66287c0199f3aa6d95f6774-C001-13" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-47" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-69" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-71" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-76" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-77" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-80" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-81" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-87" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-95" ] ], "cite_sentences": [ "2606ecb66287c0199f3aa6d95f6774-C001-13", "2606ecb66287c0199f3aa6d95f6774-C001-47", "2606ecb66287c0199f3aa6d95f6774-C001-69", "2606ecb66287c0199f3aa6d95f6774-C001-71", "2606ecb66287c0199f3aa6d95f6774-C001-76", "2606ecb66287c0199f3aa6d95f6774-C001-77", "2606ecb66287c0199f3aa6d95f6774-C001-80", "2606ecb66287c0199f3aa6d95f6774-C001-81", "2606ecb66287c0199f3aa6d95f6774-C001-87", "2606ecb66287c0199f3aa6d95f6774-C001-95" ] }, "@MOT@": { "gold_contexts": [ [ "2606ecb66287c0199f3aa6d95f6774-C001-21", "2606ecb66287c0199f3aa6d95f6774-C001-22" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-73" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-85" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-91" ] ], "cite_sentences": [ "2606ecb66287c0199f3aa6d95f6774-C001-22", "2606ecb66287c0199f3aa6d95f6774-C001-73", "2606ecb66287c0199f3aa6d95f6774-C001-85", "2606ecb66287c0199f3aa6d95f6774-C001-91" ] }, "@USE@": { "gold_contexts": [ [ "2606ecb66287c0199f3aa6d95f6774-C001-28" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-110" ] ], "cite_sentences": [ "2606ecb66287c0199f3aa6d95f6774-C001-28", "2606ecb66287c0199f3aa6d95f6774-C001-110" ] }, "@FUT@": { "gold_contexts": [ [ "2606ecb66287c0199f3aa6d95f6774-C001-113", "2606ecb66287c0199f3aa6d95f6774-C001-114", "2606ecb66287c0199f3aa6d95f6774-C001-115" ], [ "2606ecb66287c0199f3aa6d95f6774-C001-131", "2606ecb66287c0199f3aa6d95f6774-C001-132" ] ], "cite_sentences": [ "2606ecb66287c0199f3aa6d95f6774-C001-113", "2606ecb66287c0199f3aa6d95f6774-C001-131" ] } } }, "ABC_4821d5a283c1d4ba162b58e5fac8bc_2": { "x": [ { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-74", "text": "All essays in the source domain are used as training data." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-78", "text": "As evaluation metric, we use the quadratic weighted kappa (QWK)." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-79", "text": "Baselines." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-2", "text": "In this work, we present an approach based on combining string kernels and word embeddings for automatic essay scoring." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-3", "text": "String kernels capture the similarity among strings based on counting common character ngrams, which are a low-level yet powerful type of feature, demonstrating state-of-theart results in various text classification tasks such as Arabic dialect identification or native language identification." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-4", "text": "To our best knowledge, we are the first to apply string kernels to automatically score essays." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-5", "text": "We are also the first to combine them with a high-level semantic feature representation, namely the bag-of-super-word-embeddings." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-6", "text": "We report the best performance on the Automated Student Assessment Prize data set, in both indomain and cross-domain settings, surpassing recent state-of-the-art deep learning approaches." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-7", "text": "----------------------------------" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-75", "text": "Target domain samples are randomly divided into 5 folds, where one fold is used as test data, and the other 4 folds are collected together to sub-sample target domain train data." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-76", "text": "The sub-sample sizes are n t = {10, 25, 50, 100}. The sub-sampling is repeated for 5 times as in (Phandi et al., 2015; Dong and Zhang, 2016) to reduce bias." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-77", "text": "As our approach performs very well in the cross-domain setting, we also present experiments without subsampling data from the target domain, i.e. when the sub-sample size is n t = 0." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-9", "text": "Automatic essay scoring (AES) is the task of assigning grades to essays written in an educational setting, using a computer-based system with natural language processing capabilities." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-10", "text": "The aim of designing such systems is to reduce the involvement of human graders as far as possible." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-11", "text": "AES is a challenging task as it relies on grammar as well as semantics, pragmatics and discourse (Song et al., 2017) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-12", "text": "Although traditional AES methods typically rely on handcrafted features (Larkey, 1998; Foltz et al., 1999; Attali and Burstein, 2006; Dikli, 2006; Wang and Brown, 2008; Chen and He, 2013; Somasundaran et al., 2014; Yannakoudakis et al., 2014; Phandi et al., 2015) , recent results indicate that state-of-the-art deep learning methods reach better performance (Alikaniotis et al., 2016; Dong and Zhang, 2016; Taghipour and Ng, 2016; Song et al., 2017; Tay et al., 2018) , perhaps because these methods are able to capture subtle and complex information that is relevant to the task (Dong and Zhang, 2016) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-13", "text": "In this paper, we propose to combine string kernels (low-level character n-gram features) and word embeddings (high-level semantic features) to obtain state-of-the-art AES results." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-14", "text": "Since recent methods based on string kernels have demonstrated remarkable performance in various text classification tasks ranging from authorship identification (Popescu and Grozea, 2012) and sentiment analysis (Gim\u00e9nez-P\u00e9rez et al., 2017; to native language identification (Popescu and Ionescu, 2013; Ionescu et al., 2014; Ionescu, 2015; and dialect identification , we believe that string kernels can reach equally good results in AES." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-15", "text": "To the best of our knowledge, string kernels have never been used for this task." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-16", "text": "As string kernels are a simple approach that relies solely on character n-grams as features, it is fairly obvious that such an approach will not to cover several aspects (e.g.: semantics, discourse) required for the AES task." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-17", "text": "To solve this problem, we propose to combine string kernels with a recent approach based on word embeddings, namely the bag-of-super-wordembeddings (BOSWE) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-18", "text": "To our knowledge, this is the first successful attempt to combine string kernels and word embeddings." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-19", "text": "We evaluate our approach on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-20", "text": "The empirical results indicate that our approach yields a better performance than state-of-the-art approaches (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-21", "text": "----------------------------------" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-22", "text": "**METHOD**" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-23", "text": "String kernels." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-24", "text": "Kernel functions (Shawe-Taylor and Cristianini, 2004) capture the intuitive notion of similarity between objects in a specific domain." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-25", "text": "For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character n-grams." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-26", "text": "Various string kernel functions have been proposed to date (Lodhi et al., 2002; Shawe-Taylor and Cristianini, 2004; Ionescu et al., 2014) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-27", "text": "One of the most recent string kernels is the histogram intersection string kernel (HISK) (Ionescu et al., 2014 )." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-28", "text": "For two strings over an alphabet \u03a3, x, y \u2208 \u03a3 * , the intersection string kernel is formally defined as follows:" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-29", "text": "where num v (x) is the number of occurrences of n-gram v as a substring in x, and n is the length of v. In our AES experiments, we use the intersection string kernel based on a range of character n-grams." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-30", "text": "We approach AES as a regression task, and employ \u03bd-Support Vector Regression (\u03bd-SVR) (Suykens and Vandewalle, 1999; Shawe-Taylor and Cristianini, 2004) for training." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-31", "text": "Bag-of-super-word-embeddings." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-32", "text": "Word embeddings are long known in the NLP community (Bengio et al., 2003; Collobert and Weston, 2008) , but they have recently become more popular due to the word2vec (Mikolov et al., 2013) framework that enables the building of efficient vector representations from words." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-33", "text": "On top of the word embeddings, developed an approach termed bag-ofsuper-word-embeddings (BOSWE) by adapting an efficient computer vision technique, the bag-ofvisual-words model (Csurka et al., 2004) , for natural language processing tasks." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-34", "text": "The adaptation consists of replacing the image descriptors (Lowe, 2004) useful for recognizing object patterns in images with word embeddings (Mikolov et al., 2013) useful for recognizing semantic patterns in text documents." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-35", "text": "The BOSWE representation is computed as follows." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-36", "text": "First, each word in the collection of training documents is represented as word vector using a pre-trained word embeddings model." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-37", "text": "Based on the fact that word embeddings carry semantic information by projecting semantically related words in the same region of the embedding space, the next step is to cluster the word vectors in order to obtain relevant semantic clusters of words." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-38", "text": "As in the standard bag-of-visual-words model, the clustering is done by k-means (Leung and Malik, 2001) , and the formed centroids are stored in a randomized forest of k-d trees (Philbin et al., 2007) to reduce search cost." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-39", "text": "The centroid of each cluster is interpreted as a super word embedding or super word vector that embodies all the semantically related word vectors in a small region of the embedding space." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-40", "text": "Every embedded word in the collection of documents is then assigned to the nearest cluster centroid (the nearest super word vector)." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-41", "text": "Put together, the super word vectors generate a vocabulary (codebook) that can further be used to describe each document as a bag-of-super-wordembeddings." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-42", "text": "To obtain the BOSWE represenation for a document, we just have to compute the occurrence count of each super word embedding in the respective document." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-43", "text": "After building the representation, we employ a kernel method to train the BOSWE model for our specific task." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-44", "text": "To be consistent with the string kernel approach, we choose the histogram intersection kernel and the same regression method, namely \u03bd-SVR." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-45", "text": "Model fusion." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-46", "text": "In the primal form, a linear classifier takes as input a feature matrix X of r samples (rows) with m features (columns) and optimizes a set of weights in order to reproduce the r training labels." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-47", "text": "In the dual form, the linear classifier takes as input a kernel matrix K of r \u00d7 r components, where each component k ij is the similarity between examples x i and x j ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-48", "text": "Kernel methods work by embedding the data in a Hilbert space and by searching for linear relations in that space, using a learning algorithm." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-49", "text": "The embedding can be performed either (i) implicitly, by directly specifying the similarity function between each pair of samples, or (ii) explicitly, by first giving the embedding map \u03c6 and by computing the inner product between each pair of samples embedded in the Hilbert space." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-50", "text": "For the linear kernel, the associated embedding map is \u03c6(x) = x and options (i) or (ii) are equivalent, i.e. the similarity function is the inner product." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-51", "text": "Hence, the linear kernel matrix K can be obtained as K = X \u00b7 X \u2032 , where X \u2032 is the transpose of X. For other kernels, e.g. the histogram intersection kernel, it is not possible to explicitly define the embedding map (Shawe-Taylor and Cristianini, 2004) , and the only solution is to adopt option (i) and compute the corresponding kernel matrix directly." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-52", "text": "Therefore, we combine HISK and BOSWE in the dual (kernel) form, by simply summing up the two corresponding kernel matrices." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-53", "text": "However, summing up kernel matrices is equivalent to feature vector concatenation in the primal Hilbert space." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-54", "text": "To better explain this statement, let us suppose that we can define the embedding map of the histogram intersection kernel and, consequently, we can obtain the corresponding feature matrix of HISK with r\u00d7m 1 components denoted by X 1 and the corresponding feature matrix of BOSWE with r \u00d7 m 2 components denoted by X 2 ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-55", "text": "We can now combine HISK and BOSWE in two ways." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-56", "text": "One way is to compute the corresponding kernel matrices K 1 = X 1 \u00b7 X \u2032 1 and K 2 = X 2 \u00b7 X \u2032 2 , and to sum the matrices into a single kernel matrix" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-57", "text": "The other way is to first concatenate the feature matrices into a single feature matrix X + = [X 1 X 2 ] of r \u00d7 (m 1 + m 2 ) components, and to compute the final kernel matrix using the inner product, i.e. K + = X + \u00b7X \u2032 + ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-58", "text": "Either way, the two approaches, HISK and BOSWE, are fused before the learning stage." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-59", "text": "As a consequence of kernel summation, the search space of linear patterns grows, which should help the kernel classifier, in our case \u03bd-SVR, to find a better regression function." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-60", "text": "----------------------------------" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-62", "text": "Data set." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-63", "text": "To evaluate our approach, we use the Automated Student Assessment Prize (ASAP) 1 data set from Kaggle." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-64", "text": "The ASAP data set contains 8 prompts of different genres." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-65", "text": "The number of essays per prompt along with the score ranges are presented in Table 1 ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-66", "text": "Since the official test data of the ASAP competition is not released to the public, we, as well as others before us (Phandi et al., 2015; Dong and Zhang, 2016; 1 https://www.kaggle.com/c/asap-aes/data Tay et al., 2018) , use only the training data in our experiments." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-67", "text": "Evaluation procedure." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-68", "text": "As Dong and Zhang (2016), we scaled the essay scores into the range 0-1." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-69", "text": "We closely followed the same settings for data preparation as (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-70", "text": "For the in-domain experiments, we use 5-fold cross-validation." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-71", "text": "The 5-fold cross-validation procedure is repeated for 10 times and the results were averaged to reduce the accuracy variation introduced by randomly selecting the folds." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-72", "text": "We note that the standard deviation in all cases in below 0.2%." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-73", "text": "For the cross-domain experiments, we use the same source\u2192target domain pairs as (Phandi et al., 2015; Dong and Zhang, 2016) , namely, 1\u21922, 3\u21924, 5\u21926 and 7\u21928." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-80", "text": "We compare our approach with stateof-the-art methods based on handcrafted features (Phandi et al., 2015) , as well as deep features (Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-81", "text": "We note that results for the cross-domain setting are reported only in some of these recent works (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-82", "text": "Implementation choices." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-83", "text": "For the string kernels approach, we used the histogram intersection string kernel (HISK) based on the blended range of character n-grams from 1 to 15." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-84", "text": "To compute the intersection string kernel, we used the open-source code provided by Ionescu et al. (2014) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-85", "text": "For the BOSWE approach, we used the pre-trained word embeddings computed by the word2vec toolkit (Mikolov et al., 2013) on the Google News data set using the Skip-gram model, which produces 300-dimensional vectors for 3 million words and phrases." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-86", "text": "We used functions from the VLFeat li- Table 2 : In-domain automatic essay scoring results of our approach versus several state-of-the-art methods (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-87", "text": "Results are reported in terms of the quadratic weighted kappa (QWK) measure, using 5-fold cross-validation." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-88", "text": "The best QWK score (among the machine learning systems) for each prompt is highlighted in bold." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-89", "text": "brary (Vedaldi and Fulkerson, 2008) for the other steps involved in the BOSWE approach, such as the k-means clustering and the randomized forest of k-d trees." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-90", "text": "We set the number of clusters (dimension of the vocabulary) to k = 500." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-91", "text": "After computing the BOSWE representation, we apply the L 1 -normalized intersection kernel." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-92", "text": "We combine HISK and BOSWE in the dual form by summing up the two corresponding matrices." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-93", "text": "For the learning phase, we employ the dual implementation of \u03bd-SVR available in LibSVM (Chang and Lin, 2011) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-94", "text": "We set its regularization parameter to c = 10 3 and \u03bd = 10 \u22121 in all our experiments." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-95", "text": "In-domain results." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-96", "text": "The results for the in-domain automatic essay scoring task are presented in Table 2." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-97", "text": "In our empirical study, we also include feature ablation results." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-98", "text": "We report the QWK measure on each prompt as well as the overall average." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-99", "text": "We first note that the histogram intersection string kernel alone reaches better overall performance (0.780) than all previous works (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-100", "text": "Remarkably, the overall performance of the HISK is also higher than the inter-human agreement (0.754)." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-101", "text": "Although the BOSWE model can be regarded as a shallow approach, its overall results are comparable to those of deep learning approaches (Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-102", "text": "When we combine the two models (HISK and BOSWE), we obtain even better results." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-103", "text": "Indeed, the combination of string kernels and word embeddings attains the best performance on 7 out of 8 prompts." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-104", "text": "The average QWK score of HISK and BOSWE (0.785) is more than 2% better the average scores of the best-performing state-of-the-art approaches Tay et al., 2018) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-105", "text": "Cross-domain results." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-106", "text": "The results for the crossdomain automatic essay scoring task are presented in Table 3 ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-107", "text": "For each and every source\u2192target pair, we report better results than both state-of-theart methods (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-108", "text": "We observe that the difference between our best QWK scores and the other approaches are sometimes much higher in the cross-domain setting than in the in-domain setting." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-109", "text": "We particularly notice that the difference from (Phandi et al., 2015) when n t = 0 is always higher than 10%." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-110", "text": "Our highest improvement (more than 54%, from 0.187 to 0.728) over (Phandi et al., 2015) is recorded for the pair 5\u21926, when n t = 0." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-111", "text": "Our score in this case (0.728) is even higher than both scores of Phandi et al. (2015) and Dong and Zhang (2016) when they use n t = 50." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-112", "text": "Different from the in-domain setting, we note that the combination of string kernels and word embeddings does not always provide better results than string kernels alone, particularly when the number of target samples (n t ) added into the training set is less or equal to 25." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-113", "text": "Discussion." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-114", "text": "It is worth noting that in a set of preliminary experiments (not included in the paper), we actually considered another approach based on word embeddings." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-115", "text": "We tried to obtain a document embedding by averaging the word vectors for each document." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-116", "text": "We computed the average as well as the standard deviation for each component of the word vectors, resulting in a total of 600 features, since the word vectors are 300-dimensional." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-117", "text": "We applied this method in the in-domain setting and we obtained a surprisingly low overall QWK score, around 0.251." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-118", "text": "We concluded that this simple apSource\u2192Target Method n t = 0 n t = 10 n t = 25 n t = 50 n t = 100 1\u21922 (Phandi et al., 2015) Table 3 : Corss-domain automatic essay scoring results of our approach versus two state-of-the-art methods (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-119", "text": "Results are reported in terms of the quadratic weighted kappa (QWK) measure, using the same evaluation procedure as (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-120", "text": "The best QWK scores for each source\u2192target domain pair are highlighted in bold." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-121", "text": "proach is not useful, and decided to use BOSWE instead." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-122", "text": "It would have been interesting to present an error analysis based on the discriminant features weighted higher by the \u03bd-SVR method." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-123", "text": "Unfortunately, this is not possible because our approach works in the dual space and we cannot transform the dual weights into primal weights, as long as the histogram intersection kernel does not have an explicit embedding map associated to it." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-124", "text": "In future work, however, we aim to replace the histogram intersection kernel with the presence bits kernel, which will enable us to perform an error analysis based on the overused or underused patterns, as described by ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-125", "text": "----------------------------------" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-127", "text": "In this paper, we described an approach based on combining string kernels and word embeddings for automatic essay scoring." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-128", "text": "We compared our approach on the Automated Student Assessment Prize data set, in both in-domain and crossdomain settings, with several state-of-the-art approaches (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-129", "text": "Overall, the in-domain and the cross-domain comparative studies indicate that string kernels, both alone and in combination with word embeddings, attain the best performance on the automatic essay scoring task." }, { "sent_id": "4821d5a283c1d4ba162b58e5fac8bc-C001-130", "text": "Using a shallow approach, we report better results compared to recent deep learning approaches (Dong and Zhang, 2016; Tay et al., 2018) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4821d5a283c1d4ba162b58e5fac8bc-C001-12" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-81" ] ], "cite_sentences": [ "4821d5a283c1d4ba162b58e5fac8bc-C001-12", "4821d5a283c1d4ba162b58e5fac8bc-C001-81" ] }, "@DIF@": { "gold_contexts": [ [ "4821d5a283c1d4ba162b58e5fac8bc-C001-20" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-80" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-86" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-99" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-107" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-108" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-109" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-110" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-111" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-118" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-128", "4821d5a283c1d4ba162b58e5fac8bc-C001-129", "4821d5a283c1d4ba162b58e5fac8bc-C001-130" ] ], "cite_sentences": [ "4821d5a283c1d4ba162b58e5fac8bc-C001-20", "4821d5a283c1d4ba162b58e5fac8bc-C001-80", "4821d5a283c1d4ba162b58e5fac8bc-C001-86", "4821d5a283c1d4ba162b58e5fac8bc-C001-99", "4821d5a283c1d4ba162b58e5fac8bc-C001-107", "4821d5a283c1d4ba162b58e5fac8bc-C001-108", "4821d5a283c1d4ba162b58e5fac8bc-C001-109", "4821d5a283c1d4ba162b58e5fac8bc-C001-110", "4821d5a283c1d4ba162b58e5fac8bc-C001-111", "4821d5a283c1d4ba162b58e5fac8bc-C001-118", "4821d5a283c1d4ba162b58e5fac8bc-C001-128" ] }, "@MOT@": { "gold_contexts": [ [ "4821d5a283c1d4ba162b58e5fac8bc-C001-66" ] ], "cite_sentences": [ "4821d5a283c1d4ba162b58e5fac8bc-C001-66" ] }, "@SIM@": { "gold_contexts": [ [ "4821d5a283c1d4ba162b58e5fac8bc-C001-66" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-69" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-73" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-76" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-119" ] ], "cite_sentences": [ "4821d5a283c1d4ba162b58e5fac8bc-C001-66", "4821d5a283c1d4ba162b58e5fac8bc-C001-69", "4821d5a283c1d4ba162b58e5fac8bc-C001-73", "4821d5a283c1d4ba162b58e5fac8bc-C001-76", "4821d5a283c1d4ba162b58e5fac8bc-C001-119" ] }, "@USE@": { "gold_contexts": [ [ "4821d5a283c1d4ba162b58e5fac8bc-C001-66" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-69" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-73" ], [ "4821d5a283c1d4ba162b58e5fac8bc-C001-119" ] ], "cite_sentences": [ "4821d5a283c1d4ba162b58e5fac8bc-C001-66", "4821d5a283c1d4ba162b58e5fac8bc-C001-69", "4821d5a283c1d4ba162b58e5fac8bc-C001-73", "4821d5a283c1d4ba162b58e5fac8bc-C001-119" ] } } }, "ABC_79b3933e51c5fd412d00829815a958_2": { "x": [ { "sent_id": "79b3933e51c5fd412d00829815a958-C001-205", "text": "Future work will explore other structured prediction tasks, such as parsing and generation." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-228", "text": "We also use BiL-STMs for the inference networks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-2", "text": "Deep energy-based models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-3", "text": "Tu and Gimpel (2018) developed an efficient framework for energy-based models by training \"inference networks\" to approximate structured inference instead of using gradient descent." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-4", "text": "However, their alternating optimization approach suffers from instabilities during training, requiring additional loss terms and careful hyperparameter tuning." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-5", "text": "In this paper, we contribute several strategies to stabilize and improve this joint training of energy functions and inference networks for structured prediction." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-6", "text": "We design a compound objective to jointly train both costaugmented and test-time inference networks along with the energy function." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-7", "text": "We propose joint parameterizations for the inference networks that encourage them to capture complementary functionality during learning." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-8", "text": "We empirically validate our strategies on two sequence labeling tasks, showing easier paths to strong performance than prior work, as well as further improvements with global energy terms." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-9", "text": "* Work done at the University of Chicago and Toyota Technological Institute at Chicago." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-10", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-12", "text": "Energy-based modeling (LeCun et al., 2006) associates a scalar compatibility measure to each configuration of input and output variables." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-13", "text": "Belanger and McCallum (2016) formulated deep energy-based models for structured prediction, which they called structured prediction energy networks (SPENs)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-14", "text": "SPENs use arbitrary neural networks to define the scoring function over input/output pairs." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-15", "text": "However, this flexibility leads to challenges for learning and inference." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-16", "text": "The original work on SPENs used gradient descent for structured inference (Belanger and McCallum, 2016; Belanger et al., 2017) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-17", "text": "Gimpel (2018, 2019) found improvements in both speed and accuracy by replacing the use of gradient descent with a method that trains a neural network (called an \"inference network\") to do inference directly." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-18", "text": "Their formulation, which jointly trains the inference network and energy function, is similar to training in generative adversarial networks (Goodfellow et al., 2014) , which is known to suffer from practical difficulties in training due to the use of alternating optimization (Salimans et al., 2016) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-19", "text": "To stabilize training, Tu and Gimpel (2018) experimented with several additional terms in the training objectives, finding performance to be dependent on their inclusion." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-20", "text": "Also, when using the approach of Tu and Gimpel (2018) , there is a mismatch between the training and test-time uses of the trained inference network." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-21", "text": "During training with hinge loss, the inference network is actually trained to do \"costaugmented\" inference." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-22", "text": "However, at test time, the goal is to simply minimize the energy without any cost term." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-23", "text": "Tu and Gimpel (2018) fine-tuned the cost-augmented network to match the test-time criterion, but found only minimal change from this fine-tuning." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-24", "text": "This suggests that the cost-augmented network was mostly acting as a test-time inference network by convergence, which may be hindering the potential contributions of cost-augmented inference in max-margin structured learning (Tsochantaridis et al., 2004; Taskar et al., 2004) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-25", "text": "In this paper, we contribute a new training objective for SPENs that addresses the above concern and also contribute several techniques for stabilizing and improving learning." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-26", "text": "We design a compound objective to jointly train both costaugmented and test-time inference networks along with the energy function." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-27", "text": "In the context of the new objective, we propose shared parameterizations for the two inference networks that encourage them to capture complementary functionality while reducing the total number of parameters being trained." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-28", "text": "Quantitative and qualitative analysis shows clear differences in the characteristics of the trained cost-augmented and test-time inference networks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-29", "text": "We also present three methods to streamline and stabilize training that help with both the old and new objectives." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-30", "text": "We empirically validate our strategies on two sequence labeling tasks from natural language processing (NLP), namely part-ofspeech tagging and named entity recognition." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-31", "text": "We show easier paths to strong performance than prior work, and further improvements with global energy terms." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-32", "text": "While SPENs have been used for multiple NLP tasks, including multi-label classification (Belanger and McCallum, 2016) , part-of-speech tagging (Tu and Gimpel, 2018) , and semantic role labeling (Belanger et al., 2017) , they are not widely used in NLP." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-33", "text": "Structured prediction is extremely common in NLP, but is typically approached using methods that are more limited than SPENs (such as conditional random fields) or models that suffer from a train/test mismatch (such as most auto-regressive models)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-34", "text": "SPENs offer a maximally expressive framework for structured prediction while avoiding the train/test mismatch and therefore offer great potential for NLP." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-35", "text": "However, the training and inference difficulties have deterred NLP researchers." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-36", "text": "Our hope is that our methods can enable SPENs to be applied to a larger set of applications, including generation tasks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-37", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-38", "text": "**BACKGROUND**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-39", "text": "We denote the input space by X ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-40", "text": "For an input x \u2208 X , we denote the structured output space by Y(x)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-41", "text": "The entire space of structured outputs is denoted Y = \u222a x\u2208X Y(x)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-42", "text": "A SPEN (Belanger and McCallum, 2016) defines an energy function E \u0398 : X \u00d7 Y \u2192 R parameterized by \u0398 that computes a scalar energy for an input/output pair." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-43", "text": "At test time, for a given input x, prediction is done by choosing the output with lowest energy:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-44", "text": "However, solving equation (1) requires combinatorial algorithms because Y is a structured, discrete space." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-45", "text": "This becomes intractable when E \u0398 does not decompose into a sum over small \"parts\" of y. Belanger and McCallum (2016) relax this problem by allowing the discrete vector y to be continuous; Y R denotes the relaxed output space." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-46", "text": "They solve the relaxed problem by using gradient descent to iteratively minimize the energy with respect to y. The energy function parameters \u0398 are trained using a structured hinge loss which requires repeated cost-augmented inference during training." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-47", "text": "Using gradient descent for the repeated cost-augmented inference steps is time-consuming and makes learning unstable (Belanger et al., 2017) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-48", "text": "Tu and Gimpel (2018) propose an alternative that replaces gradient descent with a neural network trained to do inference, i.e., to mimic the function performed in equation (1)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-49", "text": "This \"inference network\" A \u03a8 : X \u2192 Y R is parameterized by \u03a8 and trained with the goal that" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-50", "text": "When training the energy function parameters \u0398, Tu and Gimpel (2018) replaced the cost-augmented inference step in the structured hinge loss from Belanger and McCallum (2016) with a costaugmented inference network F \u03a6 and trained the energy function and inference network parameters jointly:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-51", "text": "where D is the set of training pairs, [h] + = max(0, h), and is a structured cost function that computes the distance between its two arguments." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-52", "text": "Tu and Gimpel (2018) alternatively optimized \u0398 and \u03a6, which is similar to training in generative adversarial networks (Goodfellow et al., 2014) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-53", "text": "As alternating optimization can be difficult in practice (Salimans et al., 2016) , Tu & Gimpel experimented with including several additional terms in the above objective to stabilize training." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-54", "text": "We adopt the same learning framework as Tu & Gimpel of jointly learning the energy function and inference network, but we propose a novel objective function that jointly trains a cost-augmented inference network, a test-time inference network, and the energy function." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-55", "text": "The energy functions we use for our sequence labeling tasks are taken from Tu and Gimpel (2018) and are described in detail in the appendix." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-56", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-57", "text": "**AN OBJECTIVE FOR JOINT LEARNING OF INFERENCE NETWORKS**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-58", "text": "We now describe our \"compound\" objective that combines two widely-used losses in structured pre-diction." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-59", "text": "We first present it without inference networks:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-60", "text": "This objective contains two different inference problems, which are also the two inference problems that must be solved in structured max-margin learning, whether during training or at test time." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-61", "text": "Eq. (1) shows the test-time inference problem." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-62", "text": "The other one is cost-augmented inference, defined as follows:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-63", "text": "arg min" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-64", "text": "where y * is the gold standard output." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-65", "text": "This inference problem involves finding an output with low energy but high cost relative to the gold standard." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-66", "text": "Thus, it is not well-aligned with the test-time inference problem." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-67", "text": "Tu and Gimpel (2018) used the same inference network for solving both problems, which led them to fine-tune the network at test-time with a different objective." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-68", "text": "We avoid this issue by training two inference networks, A \u03a8 for test-time inference and F \u03a6 for cost-augmented inference:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-69", "text": "As indicated, this loss can be viewed as the sum of the margin-rescaled and perceptron losses for SPEN training with inference networks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-70", "text": "We treat this optimization problem as a minmax game and find a saddle point for the game similar to Tu and Gimpel (2018) and Goodfellow et al. (2014) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-71", "text": "We alternatively optimize \u0398, \u03a6, and \u03a8. The objective for the energy function parameters is:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-72", "text": "When we remove 0-truncation (see Sec. 4.1), the objective for the inference network parameters is:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-73", "text": "Joint Parameterizations. If we were to train independent inference networks A \u03a8 and F \u03a6 , this new objective could be much slower than the original approach of (Tu and Gimpel, 2018) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-74", "text": "However, the compound objective offers several natural options for defining joint parameterizations of the two inference networks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-75", "text": "We consider three options which are visualized in Figure 1 and described below:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-76", "text": "\u2022 Separated: F \u03a6 and A \u03a8 are two independent networks with their own architectures and parameters as shown in Figure 1 (a)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-77", "text": "\u2022 Shared: F \u03a6 and A \u03a8 share a \"feature\" network as shown in Figure 1 (b)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-78", "text": "We consider this option because both F \u03a6 and A \u03a8 are trained to produce output labels with low energy." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-79", "text": "However F \u03a6 also needs to produce output labels with high cost (i.e., far from the gold standard)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-80", "text": "\u2022 Stacked: the cost-augmented network F \u03a6 is a function of the output of the test-time network A \u03a8 and the gold standard output y. That is," }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-81", "text": "where q is a parameterized function." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-82", "text": "This is depicted in Figure 1 (c)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-83", "text": "Note that we block the gradient at A \u03a8 when updating \u03a6." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-84", "text": "For the q function in the stacked option, we use an affine transform on the concatenation of the inference network label distribution and the gold standard one-hot vector." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-85", "text": "That is, denoting the vector at position t of the cost-augmented network output by F \u03a6 (x, y) t , we have:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-86", "text": "One motivation for these parameterizations is to reduce the total number of parameters in the procedure." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-87", "text": "Generally, the number of parameters is expected to decrease when moving from separated to shared to stacked." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-88", "text": "We will compare the three options empirically in our experiments, in terms of both accuracy and number of parameters." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-89", "text": "Another motivation, specifically for the third option, is to distinguish the two inference networks in terms of their learned functionality." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-90", "text": "With all three parameterizations, the cost-augmented network will be trained to produce an output that differs from the gold standard, due to the presence of the (F \u03a6 (x), y i ) term in the combined objective." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-91", "text": "However, Tu and Gimpel (2018) found that the trained cost-augmented network was barely affected by fine-tuning for the test-time inference objective." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-92", "text": "This suggests that the cost-augmented network was mostly acting as a test-time inference network by the time of convergence." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-93", "text": "With the stacked parameterization, however, we explicitly provide the gold standard y to the cost-augmented network, permitting it to learn to change the predictions of the test-time network in appropriate ways to improve the energy function." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-94", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-95", "text": "**TRAINING STABILITY AND EFFECTIVENESS**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-96", "text": "We now discuss several methods that simplify and stabilize training SPENs with inference networks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-97", "text": "When describing them, we will illustrate their impact by showing training trajectories for the Twitter part-of-speech tagging task described in Section 6 and the appendix." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-98", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-99", "text": "**REMOVING ZERO TRUNCATION**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-100", "text": "Tu and Gimpel (2018) used the following objective for the cost-augmented inference network (maximizing it with respect to \u03a6):" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-101", "text": "where [h] + = max(0, h)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-102", "text": "However, there are two potential reasons why l will equal zero and therefore trigger no gradient update." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-103", "text": "First, E \u0398 (the energy function, corresponding to the discriminator in a GAN) may already be well-trained, and it does a good job separating the gold standard output and the cost-augmented inference network output." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-104", "text": "Or, it may be the case that the cost-augmented inference network (corresponding to the generator in a GAN) is so poorly trained that the energy of its output is extremely large, leading the margin constraints to be satisfied and l 0 to be zero." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-105", "text": "In standard margin-rescaled max-margin learning in structured prediction (Taskar et al., 2004; Tsochantaridis et al., 2004) , the cost-augmented inference step is performed exactly (or approximately with reasonable guarantee of effectiveness), ensuring that when l 0 is 0, the energy parameters are well trained." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-106", "text": "However, in our case, l 0 may be zero simply because the cost-augmented inference network is undertrained, which will be the case early in training." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-107", "text": "Then, when using zero truncation, the gradient of the inference network parameters will be 0." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-108", "text": "This is likely why Tu and Gimpel (2018) found it important to add several stabilization terms to the l 0 objective." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-109", "text": "We find that by instead removing the truncation, learning stabilizes and becomes less dependent on these additional terms." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-110", "text": "Note that we retain the truncation at zero when updating the energy parameters \u0398." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-111", "text": "As shown in Figure 3 (a) in the appendix, without any stabilization terms and with truncation, the inference network will barely move from its starting point and learning fails overall." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-112", "text": "However, without truncation, the inference network can work well even without any stabilization terms." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-113", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-114", "text": "**LOCAL CROSS ENTROPY (CE) LOSS**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-115", "text": "Tu and proposed adding a local cross entropy loss, which is the sum of the label cross entropy losses over all positions in the sequence, to stabilize inference network training." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-116", "text": "We similarly find this term to help speed up conver- gence and improve accuracy." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-117", "text": "Figure 3(b) shows faster convergence to high accuracy when adding the local CE term." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-118", "text": "More comparisons are in Section 7." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-119", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-120", "text": "**MULTIPLE INFERENCE NETWORK UPDATE STEPS**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-121", "text": "When training SPENs with inference networks, the inference network parameters are nested within the energy function." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-122", "text": "We found that the gradient components of the inference network parameters consequently have smaller absolute values than those of the energy function parameters." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-123", "text": "So, we alternate between k \u2265 1 steps of optimizing the inference network parameters (\"I steps\") and one step of optimizing the energy function parameters (\"E steps\")." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-124", "text": "We find this strategy especially helpful when using complex inference network architectures." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-125", "text": "To analyze this, we compute the cost-augmented loss l 1 = (F \u03a6 (x), y) \u2212 E \u0398 (x, F \u03a6 (x)) and the margin-rescaled loss l 0 = [ (F \u03a6 (x), y) \u2212 E \u0398 (x, F \u03a6 (x)) + E \u0398 (x, y)] + averaged over all training pairs (x, y) after each set of I steps." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-126", "text": "The I steps seek to maximize these terms and the E steps seek to minimize them." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-127", "text": "Figs. 2(a) and (b) show l 1 and l 0 during training for different numbers of I steps for every one E step." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-128", "text": "Fig. 2(c) shows the norm of \u2202E \u0398 (x,A \u03a8 (x)) \u2202\u03a8 after the E steps, and Fig. 2(d) shows the norm of \u2202l 0 \u2202\u03a6 after the I steps." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-129", "text": "With k = 1, the inference network lags behind the energy, making the energy parameter updates very small, as shown by the small norms in Fig. 2(c) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-130", "text": "The inference network gradient norm (Fig. 2(d) ) remains high, indicating underfitting." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-131", "text": "However, increasing k too much also harms learning, as evidenced by the \"plateau\" effect in the l 1 curves for k = 50; this indicates that the energy function is lagging behind the inference network." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-132", "text": "Using k = 5 leads to more of a balance between l 1 and l 0 and gradient norms that are mostly decreasing during training." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-133", "text": "We treat k as a hyperparameter that is tuned in our experiments." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-134", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-135", "text": "**GLOBAL ENERGIES FOR SEQUENCE LABELING**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-136", "text": "In addition to new training strategies, we also experiment with several global energy terms for sequence labeling." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-137", "text": "Eq. (8) in the appendix shows the base energy." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-138", "text": "To capture long-distance dependencies, we include global energy (GE) terms in the form of Eq. (9)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-139", "text": "Tu and Gimpel (2018) pretrained their tag language model (TLM) on a large, automatically-tagged corpus and fixed its parameters when optimizing \u0398. We instead do not pretrain the TLM and learn its parameters when training \u0398." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-140", "text": "We also propose new global energy terms." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-141", "text": "Define y t = h(y 0 , . . . , y t\u22121 ) where h is an LSTM TLM that takes a sequence of labels as input and returns a distribution over next labels." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-142", "text": "First, we add a TLM in the backward direction (denoted y t analogously to the forward TLM)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-143", "text": "Second, we include words as additional inputs to forward and backward TLMs." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-144", "text": "We define y t = g(x 0 , ..., x t\u22121 , y 0 , ..., y t\u22121 ) where g is a forward LSTM TLM." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-145", "text": "We define the backward version similarly (denoted y t )." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-146", "text": "The global energy is therefore" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-147", "text": "Here \u03b3 is a hyperparameter that is tuned." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-148", "text": "We experiment with three settings for the global energy: GE(a): forward TLM as in Tu and Gimpel (2018) ; GE(b): forward and backward TLMs (\u03b3 = 0); GE(c): all four TLMs in Eq. (7)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-149", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-150", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-151", "text": "We consider two sequence labeling tasks: Twitter part-of-speech (POS) tagging (Gimpel et al., 2011) and named entity recognition (NER; Tjong Kim Sang and De Meulder, 2003) , described in detail in the appendix." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-152", "text": "We consider three NER modeling configurations." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-153", "text": "NER uses only words as input and pretrained, fixed GloVe embeddings." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-154", "text": "NER+ uses words, the case of the first letter, POS" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-155", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-156", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-157", "text": "Effect of Removing Truncation." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-158", "text": "Table 1 shows results for the margin-rescaled and perceptron losses when considering the removal of zero truncation and its interaction with the use of the local CE term." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-159", "text": "Training fails for both tasks when using zero truncation without the CE term." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-160", "text": "Removing truncation makes learning succeed and leads to effective models even without using CE." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-161", "text": "However, when using the local CE term, truncation has little effect on performance." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-162", "text": "The importance of CE in prior work (Tu and Gimpel, 2018) is likely due to the fact that truncation was being used." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-163", "text": "Effect of Local CE." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-164", "text": "The local cross entropy (CE) term is useful for both tasks, though it appears more helpful for tagging." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-165", "text": "This may be because POS tagging is a more local task." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-166", "text": "Regardless, for both tasks, as shown in Section 4.2, the inclusion of the CE term speeds convergence and improves training stability." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-167", "text": "For example, on NER, using the CE term reduces the number of epochs chosen by early stopping from \u223c100 to \u223c25." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-168", "text": "On Twitter POS Tagging, using the CE term reduces the number of epochs chosen by early stopping from \u223c150 to \u223c60." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-169", "text": "Effect of Compound Objective and Joint Parameterizations." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-170", "text": "The compound objective is the sum of the margin-rescaled and perceptron losses, and outperforms them both (see Table 2 )." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-171", "text": "Across all tasks, the shared and stacked parameterizations are more accurate than the previous objectives." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-172", "text": "For the separated parameterization, the performance Table 3 : Top: differences in accuracy/F1 between test-time inference networks A \u03a8 and cost-augmented networks F \u03a6 (on development sets)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-173", "text": "The \"marginrescaled\" row uses a SPEN with the local CE term and without zero truncation, where A \u03a8 is obtained by finetuning F \u03a6 as done by Tu and Gimpel (2018) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-174", "text": "Bottom: most frequent output differences between A \u03a8 and F \u03a6 on the development set." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-175", "text": "drops slightly for NER, likely due to the larger number of parameters." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-176", "text": "The shared and stacked options also have fewer parameters to train than the separated option, and the stacked version processes examples at the fastest rate during training." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-177", "text": "The left part of Table 3 shows how the performance of the test-time inference network A \u03a8 and the cost-augmented inference network F \u03a6 vary when using the new compound objective." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-178", "text": "The differences between F \u03a6 and A \u03a8 are larger than in the baseline configuration, showing that the two are learning complementary functionality." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-179", "text": "With the stacked parameterization, the cost-augmented network F \u03a6 receives as an additional input the gold standard label sequence, which leads to the largest differences as the cost-augmented network can explicitly favor incorrect labels." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-180", "text": "1 The right part of Table 3 shows qualitative differences between the two inference networks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-181", "text": "On the POS development set, we count the differences between the predictions of A \u03a8 and F \u03a6 when A \u03a8 makes the correct prediction." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-182", "text": "2 The most frequent combinations show that F \u03a6 tends to output tags that are highly confusable with those output by A \u03a8 ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-183", "text": "For example, it often outputs proper noun when the gold standard is common noun or vice versa." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-184", "text": "It also captures the noun-verb ambiguity and ambiguities among adverbs, adjectives, and prepositions." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-185", "text": "Global Energies." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-186", "text": "The results are shown in Table 4 ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-187", "text": "Adding the backward (b) and word-augmented TLMs (c) improves over only using the forward TLM from Tu and Gimpel (2018) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-188", "text": "With the global energies, our performance is comparable to several strong results (cf. 90.94 of Lample et al., 2016 and 91.37 of Ma and Hovy, 2016) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-189", "text": "However, it is still lower than the state of the art (Akbik et al., 2018; Devlin et al., 2019) , likely due to the lack of contextualized embeddings." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-190", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-191", "text": "**RELATED WORK**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-192", "text": "Aside from the relevant work discussed already, there are several efforts aimed at stabilizing and improving learning in adversarial frameworks, for example those developed for generative adversarial networks (GANs) (Goodfellow et al., 2014; Salimans et al., 2016; Zhao et al., 2017; ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-193", "text": "Progress in training GANs has come largely from overcoming learning difficulties by modifying loss functions and optimization, and GANs have become more successful and popular as a result." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-194", "text": "Notably, Wasserstein GANs provided the first convergence measure in GAN training using Wasserstein distance." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-195", "text": "To compute Wasserstein distance, the discriminator uses weight clipping, which limits network capacity." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-196", "text": "Weight clipping was subsequently replaced with a gradient norm constraint (Gulrajani et al., 2017) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-197", "text": "Miyato et al. (2018) proposed a novel weight normalization technique called spectral normalization." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-198", "text": "These methods may be applicable to the similar optimization problems solved in learning SPENs." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-199", "text": "Another direction may be to explore alternative training objectives for SPENs, such as those that use weaker supervision than complete structures (Rooshenas et al., 2018) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-200", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-201", "text": "**CONCLUSIONS**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-202", "text": "We contributed several strategies to stabilize and improve joint training of SPENs and inference networks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-203", "text": "Our use of joint parameterizations mitigates the need for fine-tuning of inference networks, leads to complementarity in the learned costaugmented and test-time networks, and yields improved performance overall." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-204", "text": "These developments offer promise for SPENs to be more easily trained and deployed for a broad range of NLP tasks." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-206", "text": "We have taken initial steps in this direction, experimenting with constituency parsing using the attentionaugmented sequence-to-sequence model of Tran et al. (2018) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-207", "text": "Preliminary experiments are positive, 3 but significant challenges remain, specifically Our experiments in this paper consider sequence labeling tasks, so the input x is a length-T sequence of tokens where x t denotes the token at position t. The output y is a sequence of labels also of length T ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-208", "text": "We use y t to denote the output label at position t, where y t is a vector of length L (the number of labels in the label set) and where y t,j is the jth entry of the vector y t ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-209", "text": "In the original output space Y(x), y t,j is 1 for a single j and 0 for all others." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-210", "text": "In the relaxed output space Y R (x), y t,j can be interpreted as the probability of the tth position being labeled with label j. We then use the following energy for sequence labeling (Tu and Gimpel, 2018) :" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-211", "text": "where U j \u2208 R d is a parameter vector for label j and the parameter matrix W \u2208 R L\u00d7L contains label pair parameters." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-212", "text": "Also, b(x, t) \u2208 R d denotes the \"input feature vector\" for position t. We define it to be the d-dimensional BiLSTM (Hochreiter and Schmidhuber, 1997) hidden vector at t. The full set of energy parameters \u0398 includes the U j vectors, W , and the parameters of the BiLSTM." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-213", "text": "Tu and Gimpel (2018) also added a global energy term that they referred to as a \"tag language model\" (TLM)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-214", "text": "We use h to denote an LSTM TLM that takes a sequence of labels as input and returns a distribution over next labels." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-215", "text": "We define y t = h(y 0 , . . . , y t\u22121 )." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-216", "text": "Then, the energy term is:" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-217", "text": "where y 0 is the start-of-sequence symbol and y T +1 is the end-of-sequence symbol." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-218", "text": "This energy returns the negative log-likelihood under the TLM of the candidate output y." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-219", "text": "For inference networks, we use architectures similar to those used by (Tu and Gimpel, 2018) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-220", "text": "In particular, we choose BiLSTMs as the inference network architectures in our experiments." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-221", "text": "We also use BiLSTMs for the baselines and both the inference networks and baseline models use the same hidden sizes." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-222", "text": "----------------------------------" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-223", "text": "**A.2 EXPERIMENTAL SETUP DETAILS**" }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-224", "text": "Twitter Part-of-Speech (POS) Tagging." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-225", "text": "We use the Twitter POS data from Gimpel et al. (2011) and Owoputi et al. (2013) which contains 25 tags." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-226", "text": "We use 100-dimensional skip-gram (Mikolov et al., 2013) embeddings from Tu et al. (2017) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-227", "text": "Like Tu and Gimpel (2018) , we use a BiLSTM to compute the input feature vector for each position, using hidden dimension of size 100." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-229", "text": "The output of the inference network is a softmax function, so the inference network will produce a distribution over labels at each position." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-230", "text": "The \u2206 is L1 distance." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-231", "text": "We train the inference network using stochastic gradient descent (SGD) with momentum and train the energy parameters using Adam (Kingma and Ba, 2014) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-232", "text": "We also explore training the inference network using Adam when we do not use the local CE loss." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-233", "text": "4 In experiments with the local CE term, its weight is set to 1." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-234", "text": "Named Entity Recognition (NER)." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-235", "text": "We use the CoNLL 2003 English dataset (Tjong Kim Sang and De Meulder, 2003; Ma and Hovy, 2016; Lample et al., 2016) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-236", "text": "We use the BIOES tagging scheme, following previous work (Ratinov and Roth, 2009) , resulting in 17 NER labels." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-237", "text": "We use 100-dimensional pretrained GloVe embeddings (Pennington et al., 2014) ." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-238", "text": "The task is evaluated using F1 score computed with the conlleval script." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-239", "text": "The architectures for the feature networks in the energy function and inference networks are all BiLSTMs." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-240", "text": "The architectures for tag language models are LSTMs." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-241", "text": "We use a dropout keep-prob of 0.7 for all LSTM cells." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-242", "text": "The hidden size for all LSTMs is 128." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-243", "text": "We use Adam (Kingma and Ba, 2014) and do early stopping on the development set." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-244", "text": "The hyperparameter k (the number of I steps) is tuned over the set {1, 2, 5, 10, 50}. \u03b3 is tuned over the set {0, 0.5, 1}. 4 We find that Adam works better than SGD when training the inference network without the local cross entropy term." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-245", "text": "The three curves for each setting correspond to different random seeds." }, { "sent_id": "79b3933e51c5fd412d00829815a958-C001-246", "text": "Tu and Gimpel (2018) use truncation and CE during training." } ], "y": { "@MOT@": { "gold_contexts": [ [ "79b3933e51c5fd412d00829815a958-C001-3", "79b3933e51c5fd412d00829815a958-C001-4" ], [ "79b3933e51c5fd412d00829815a958-C001-23" ], [ "79b3933e51c5fd412d00829815a958-C001-67", "79b3933e51c5fd412d00829815a958-C001-68" ] ], "cite_sentences": [ "79b3933e51c5fd412d00829815a958-C001-3", "79b3933e51c5fd412d00829815a958-C001-4", "79b3933e51c5fd412d00829815a958-C001-23", "79b3933e51c5fd412d00829815a958-C001-67" ] }, "@BACK@": { "gold_contexts": [ [ "79b3933e51c5fd412d00829815a958-C001-19" ], [ "79b3933e51c5fd412d00829815a958-C001-20" ], [ "79b3933e51c5fd412d00829815a958-C001-23" ], [ "79b3933e51c5fd412d00829815a958-C001-32" ], [ "79b3933e51c5fd412d00829815a958-C001-48" ], [ "79b3933e51c5fd412d00829815a958-C001-50" ], [ "79b3933e51c5fd412d00829815a958-C001-52" ], [ "79b3933e51c5fd412d00829815a958-C001-53" ], [ "79b3933e51c5fd412d00829815a958-C001-91" ], [ "79b3933e51c5fd412d00829815a958-C001-100" ], [ "79b3933e51c5fd412d00829815a958-C001-162" ], [ "79b3933e51c5fd412d00829815a958-C001-246" ] ], "cite_sentences": [ "79b3933e51c5fd412d00829815a958-C001-19", "79b3933e51c5fd412d00829815a958-C001-20", "79b3933e51c5fd412d00829815a958-C001-23", "79b3933e51c5fd412d00829815a958-C001-32", "79b3933e51c5fd412d00829815a958-C001-48", "79b3933e51c5fd412d00829815a958-C001-50", "79b3933e51c5fd412d00829815a958-C001-52", "79b3933e51c5fd412d00829815a958-C001-53", "79b3933e51c5fd412d00829815a958-C001-91", "79b3933e51c5fd412d00829815a958-C001-100", "79b3933e51c5fd412d00829815a958-C001-162", "79b3933e51c5fd412d00829815a958-C001-246" ] }, "@USE@": { "gold_contexts": [ [ "79b3933e51c5fd412d00829815a958-C001-54" ], [ "79b3933e51c5fd412d00829815a958-C001-55" ], [ "79b3933e51c5fd412d00829815a958-C001-173" ], [ "79b3933e51c5fd412d00829815a958-C001-187" ], [ "79b3933e51c5fd412d00829815a958-C001-210" ], [ "79b3933e51c5fd412d00829815a958-C001-227" ] ], "cite_sentences": [ "79b3933e51c5fd412d00829815a958-C001-54", "79b3933e51c5fd412d00829815a958-C001-55", "79b3933e51c5fd412d00829815a958-C001-173", "79b3933e51c5fd412d00829815a958-C001-187", "79b3933e51c5fd412d00829815a958-C001-210", "79b3933e51c5fd412d00829815a958-C001-227" ] }, "@SIM@": { "gold_contexts": [ [ "79b3933e51c5fd412d00829815a958-C001-54" ], [ "79b3933e51c5fd412d00829815a958-C001-70" ], [ "79b3933e51c5fd412d00829815a958-C001-148" ], [ "79b3933e51c5fd412d00829815a958-C001-173" ], [ "79b3933e51c5fd412d00829815a958-C001-213", "79b3933e51c5fd412d00829815a958-C001-214" ], [ "79b3933e51c5fd412d00829815a958-C001-219" ], [ "79b3933e51c5fd412d00829815a958-C001-227" ] ], "cite_sentences": [ "79b3933e51c5fd412d00829815a958-C001-54", "79b3933e51c5fd412d00829815a958-C001-70", "79b3933e51c5fd412d00829815a958-C001-148", "79b3933e51c5fd412d00829815a958-C001-173", "79b3933e51c5fd412d00829815a958-C001-213", "79b3933e51c5fd412d00829815a958-C001-219", "79b3933e51c5fd412d00829815a958-C001-227" ] }, "@DIF@": { "gold_contexts": [ [ "79b3933e51c5fd412d00829815a958-C001-54" ], [ "79b3933e51c5fd412d00829815a958-C001-73" ], [ "79b3933e51c5fd412d00829815a958-C001-108", "79b3933e51c5fd412d00829815a958-C001-109" ], [ "79b3933e51c5fd412d00829815a958-C001-139" ] ], "cite_sentences": [ "79b3933e51c5fd412d00829815a958-C001-54", "79b3933e51c5fd412d00829815a958-C001-73", "79b3933e51c5fd412d00829815a958-C001-108", "79b3933e51c5fd412d00829815a958-C001-139" ] }, "@EXT@": { "gold_contexts": [ [ "79b3933e51c5fd412d00829815a958-C001-54" ] ], "cite_sentences": [ "79b3933e51c5fd412d00829815a958-C001-54" ] } } }, "ABC_03b7c2e050957dcff336183823e6f1_2": { "x": [ { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-2", "text": "We use a generative history-based model to predict the most likely derivation of a dependency parse." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-3", "text": "Our probabilistic model is based on Incremental Sigmoid Belief Networks, a recently proposed class of latent variable models for structure prediction." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-4", "text": "Their ability to automatically induce features results in multilingual parsing which is robust enough to achieve accuracy well above the average for each individual language in the multilingual track of the CoNLL-2007 shared task." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-5", "text": "This robustness led to the third best overall average labeled attachment score in the task, despite using no discriminative methods." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-6", "text": "We also demonstrate that the parser is quite fast, and can provide even faster parsing times without much loss of accuracy." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-7", "text": "----------------------------------" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-9", "text": "The multilingual track of the CoNLL-2007 shared task (Nivre et al., 2007) considers dependency parsing of texts written in different languages." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-10", "text": "It requires use of a single dependency parsing model for the entire set of languages; model parameters are estimated individually for each language on the basis of provided training sets." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-11", "text": "We use a recently proposed dependency parser (Titov and Henderson, 2007b ) 1 which has demonstrated state-of-theart performance on a selection of languages from the CoNLL-X shared task (Buchholz and Marsi, 2006) ." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-12", "text": "This parser employs a latent variable model, Incremental Sigmoid Belief Networks (ISBNs), to define a generative history-based model of projective parsing." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-13", "text": "We used the pseudo-projective transformation introduced in (Nivre and Nilsson, 2005) to cast non-projective parsing tasks as projective." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-14", "text": "Following (Nivre et al., 2006) , the encoding scheme called HEAD in (Nivre and Nilsson, 2005 ) was used to encode the original non-projective dependencies in the labels of the projectivized dependency tree." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-15", "text": "In the following sections we will briefly discuss our modifications to the ISBN parser, experimental setup, and achieved results." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-16", "text": "----------------------------------" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-17", "text": "**THE PROBABILITY MODEL**" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-18", "text": "Our probability model uses the parsing order proposed in (Nivre et al., 2004) , but instead of performing deterministic parsing as in (Nivre et al., 2004) , this ordering is used to define a generative historybased model, by adding word prediction to the Shift parser action." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-19", "text": "We also decomposed some parser actions into sub-sequences of decisions." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-20", "text": "We split arc prediction decisions (Left-Arc r and Right-Arc r ) each into two elementary decisions: first the parser creates the corresponding arc, then it assigns a relation r to the arc." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-21", "text": "Similarly, we decompose the decision to shift a word into a decision to shift and a prediction of the word." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-22", "text": "We used part-of-speech tags and fine-grain word features, which are given in the data, to further decompose word predictions." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-23", "text": "First we predict the fine-grain part-of-speech tag for the word, then the set of word features (treating each set as an atomic value), and only then the particu-lar word form." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-24", "text": "This approach allows us to both decrease the effect of sparsity and to avoid normalization across all the words in the vocabulary, significantly reducing the computational expense of word prediction." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-25", "text": "When conditioning on words, we treated each word feature individually, as this proved to be useful in (Titov and Henderson, 2007b) ." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-26", "text": "The probability of each parser decision, conditioned on the complete parse history, is modeled using a form a graphical model called Incremental Sigmoid Belief Networks." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-27", "text": "ISBNs, originally proposed for constituent parsing in (Titov and Henderson, 2007a) , use vectors of binary latent variables to encode information about the parse history." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-28", "text": "These history variables are similar to the hidden state of a Hidden Markov Model. But unlike the graphical model for an HMM, which would specify conditional dependency edges only between adjacent states in the parse history, the ISBN graphical model can specify conditional dependency edges between latent variables which are arbitrarily far apart in the parse history." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-29", "text": "The source state of such an edge is determined by the partial parse structure built at the time of the destination state, thereby allowing the conditional dependency edges to be appropriate for the structural nature of the parsing problem." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-30", "text": "In particular, they allow conditional dependencies to be local in the parse structure, not just local in the history sequence." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-31", "text": "In this they are similar to the class of neural networks proposed in (Henderson, 2003) for constituent parsing." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-32", "text": "In fact, in (Titov and Henderson, 2007a ) it was shown that this neural network can be viewed as a coarse approximation to the corresponding ISBN model." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-33", "text": "Traditional statistical parsing models also condition on features which are local in the parse structure, but these features need to be explicitly defined before learning, and require careful feature selection." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-34", "text": "This is especially difficult for languages unknown to the parser developer, since the number of possible features grows exponentially with the structural distance considered." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-35", "text": "The ISBN model uses an alternative approach, where latent variables are used to induce features during learning." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-36", "text": "The most important problem in designing an ISBN is to define an appropriate structural locality for each parser decision." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-37", "text": "This is done by choosing a fixed set of relationships between parser states, where the information which is needed to make the decision at the earlier state is also useful in making the decision at the later state." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-38", "text": "The latent variables for these related states are then connected with conditional dependency edges in the ISBN graphical model." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-39", "text": "Longer conditional dependencies are then possible through chains of these immediate conditional dependencies, but there is an inductive bias toward shorter chains." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-40", "text": "This bias makes it important that the set of chosen relationships defines an appropriate notion of locality." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-41", "text": "However, as long as there exists some chain of relationships between any two states, then any statistical dependency which is clearly manifested in the data can be learned, even if it was not foreseen by the designer." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-42", "text": "This provides a potentially powerful form of feature induction, which is nonetheless biased toward a notion of locality appropriate for the nature of the problem." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-43", "text": "In our experiments we use the same definition of structural locality as was proposed for the ISBN dependency parser in (Titov and Henderson, 2007b) ." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-44", "text": "The current state is connected to previous states using a set of 7 distinct relationships defined in terms of each state's parser configuration, which includes of a stack and a queue." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-45", "text": "Specifically, the current state is related to the last previous state whose parser configuration has: the same queue, the same stack, a stack top which is the rightmost right child of the current stack top, a stack top which is the leftmost left child of the current stack top, a front of the queue which is the leftmost child of the front of the current queue, a stack top which is the head word of the current stack top, a front of the queue which is the current stack top." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-46", "text": "Different model parameters are trained for each of these 7 types of relationship, but the same parameters are used everywhere in the graphical model where the relationship holds." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-47", "text": "Each latent variable in the ISBN parser is also conditionally dependent on a set of explicit features of the parsing history." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-48", "text": "As long as these explicit features include all the new information from the last parser decision, the performance of the model is not very sensitive to this design choice." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-49", "text": "We used the base feature model defined in (Nivre et al., 2006) for all the languages but Arabic, Chinese, Czech, and Turkish." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-50", "text": "For Arabic, Chinese, and Czech, we used the same feature models used in the CoNLL-X shared task by (Nivre et al., 2006) , and for Turkish we used again the base feature model but extended it with a single feature: the part-of-speech tag of the token preceding the current top of the stack." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-51", "text": "----------------------------------" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-52", "text": "**PARSING**" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-53", "text": "Exact inference in ISBN models is not tractable, but effective approximations were proposed in (Titov and Henderson, 2007a) ." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-54", "text": "Unlike (Titov and Henderson, 2007b ), in the shared task we used only the simplest feed-forward approximation, which replicates the computation of a neural network of the type proposed in (Henderson, 2003) ." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-55", "text": "We would expect better performance with the more accurate approximation based on variational inference proposed and evaluated in (Titov and Henderson, 2007a )." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-56", "text": "We did not try this because, on larger treebanks it would have taken too long to tune the model with this better approximation, and using different approximation methods for different languages would not be compatible with the shared task rules." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-57", "text": "To search for the most probable parse, we use the heuristic search algorithm described in (Titov and Henderson, 2007b) , which is a form of beam search." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-58", "text": "In section 4 we show that this search leads to quite efficient parsing." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-59", "text": "To overcome a minor shortcoming of the parsing algorithm of (Nivre et al., 2004) we introduce a simple language independent post-processing step." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-60", "text": "Nivre's parsing algorithm allows unattached nodes to stay on the stack at the end of parsing, which is reasonable for treebanks with unlabeled attachment to root." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-61", "text": "However, this sometimes happens with languages where only labeled attachment to root is allowed." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-62", "text": "In these cases (only 35 tokens in Greek, 17 in Czech, 1 in Arabic, on the final testing set) we attached them using a simple rule: if there are no tokens in the sentence attached to root, then the considered token is attached to root with the most frequent root-attachment relation used for its part-ofspeech tag." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-63", "text": "If there are other root-attached tokens in the sentence, it is attached to the next root-attached token with the most frequent relation." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-64", "text": "Preference is given to the most frequent attachment direction for its part-of-speech tag." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-65", "text": "This rule guarantees that no loops are introduced by the post-processing." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-66", "text": "----------------------------------" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-67", "text": "**EXPERIMENTS**" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-68", "text": "We evaluated the ISBN parser on all the languages considered in the shared task (Haji\u010d et al., 2004; Aduriz et al., 2003; Mart\u00ed et al., 2007; Chen et al., 2003; B\u00f6hmov\u00e1 et al., 2003; Marcus et al., 1993; Johansson and Nugues, 2007; Prokopidis et al., 2005; Csendes et al., 2005; Montemagni et al., 2003; Oflazer et al., 2003) ." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-69", "text": "ISBN models were trained using a small development set taken out from the training set, which was used for tuning learning and decoding parameters, for early stopping and very coarse feature engineering." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-70", "text": "2 The sizes of the development sets were different: starting from less than 2,000 tokens for smaller treebanks to 5,000 tokens for the largest one." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-71", "text": "The relatively small sizes of the development sets limited our ability to perform careful feature selection, but this should not have significantly affected the model performance, as discussed in section 2." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-72", "text": "3 We used frequency cutoffs: we ignored any property (word form, lemma, feature) which occurs in the training set less than a given threshold." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-73", "text": "We used a threshold of 20 for Greek and Chinese and a threshold of 5 for the rest." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-74", "text": "Because cardinalities of each of these sets (sets of word forms, lemmas and features) effect the model efficiency, we selected the larger threshold when validation results with the smaller threshold were comparable." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-75", "text": "For the ISBN latent variables, we used vectors of length 80, based on our previous experience." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-76", "text": "Results on the final testing set are presented in table 1." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-77", "text": "The model achieves relatively high scores on each individual language, significantly better than each average result in the shared task." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-78", "text": "This leads to the third best overall average results in the shared task, both in average labeled attachment score and in average unlabeled attachment score." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-79", "text": "The absolute error increase in labeled attachment score over the best system is only 0.4%." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-80", "text": "We attribute ISBN's success mainly to its ability to automatically induce features, as this significantly reduces the risk of omitting any important highly predictive features." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-81", "text": "This makes an ISBN parser a particularly good baseline when considering a new treebank or language, be-Ara Bas Cat Chi Cze Eng Gre Hun Ita cause it does not require much effort in feature engineering." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-82", "text": "As was demonstrated in (Titov and Henderson, 2007b) , even a minimal set of local explicit features achieves results which are non-significantly different from a carefully chosen set of explicit features, given the language independent definition of locality described in section 2." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-83", "text": "It is also important to note that the model is quite efficient." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-84", "text": "Figure 1 shows the tradeoff between accuracy and parsing time as the width of the search beam is varied, on the development set." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-85", "text": "This curve plots the average labeled attachment score over Basque, Chinese, English, and Turkish as a function of parsing time per token." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-86", "text": "4 Accuracy of only 1% below the maximum can be achieved with average processing time of 17 ms per token, or 60 tokens per second." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-87", "text": "5 We also refer the reader to (Titov and Henderson, 2007b) for more detailed analysis of the ISBN dependency parser results, where, among other things, it was shown that the ISBN model is especially accurate at modeling long dependencies." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-88", "text": "----------------------------------" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-89", "text": "**CONCLUSION**" }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-90", "text": "We evaluated the ISBN dependency parser in the multilingual shared task setup and achieved competitive accuracy on every language, and the third best average score overall." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-91", "text": "The proposed model requires minimal design effort because it relies mostly on automatic feature induction, which is highly desirable when using new treebanks or languages." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-92", "text": "The parsing time needed to achieve high accuracy is also quite small, making this model a good candidate for use in practical applications." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-93", "text": "The fact that our model defines a probability model over parse trees, unlike the previous stateof-the-art methods (Nivre et al., 2006; McDonald et al., 2006) , makes it easier to use this model in applications which require probability estimates, such as in language processing pipelines or for language modeling." }, { "sent_id": "03b7c2e050957dcff336183823e6f1-C001-94", "text": "Also, as with any generative model, it should be easy to improve the parser's accuracy with discriminative reranking, such as discriminative retraining techniques (Henderson, 2004) or data-defined kernels (Henderson and Titov, 2005) , with or even without the introduction of any additional linguistic features." } ], "y": { "@USE@": { "gold_contexts": [ [ "03b7c2e050957dcff336183823e6f1-C001-11" ], [ "03b7c2e050957dcff336183823e6f1-C001-43" ], [ "03b7c2e050957dcff336183823e6f1-C001-57" ] ], "cite_sentences": [ "03b7c2e050957dcff336183823e6f1-C001-11", "03b7c2e050957dcff336183823e6f1-C001-43", "03b7c2e050957dcff336183823e6f1-C001-57" ] }, "@MOT@": { "gold_contexts": [ [ "03b7c2e050957dcff336183823e6f1-C001-25" ] ], "cite_sentences": [ "03b7c2e050957dcff336183823e6f1-C001-25" ] }, "@SIM@": { "gold_contexts": [ [ "03b7c2e050957dcff336183823e6f1-C001-43" ], [ "03b7c2e050957dcff336183823e6f1-C001-82" ] ], "cite_sentences": [ "03b7c2e050957dcff336183823e6f1-C001-43", "03b7c2e050957dcff336183823e6f1-C001-82" ] }, "@DIF@": { "gold_contexts": [ [ "03b7c2e050957dcff336183823e6f1-C001-54" ] ], "cite_sentences": [ "03b7c2e050957dcff336183823e6f1-C001-54" ] }, "@UNSURE@": { "gold_contexts": [ [ "03b7c2e050957dcff336183823e6f1-C001-87" ] ], "cite_sentences": [ "03b7c2e050957dcff336183823e6f1-C001-87" ] } } }, "ABC_9a8b29b10539be9e6c65172a97b16f_2": { "x": [ { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-89", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-90", "text": "**WHY CONTEXTUAL ATTENTION?**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-2", "text": "The original goal of any social media platform is to facilitate users to indulge in healthy and meaningful conversations. But more often than not, it has been found that it becomes an avenue for wanton attacks." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-3", "text": "We want to alleviate this issue and hence we try to provide a detailed analysis of how abusive behavior can be monitored in Twitter." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-4", "text": "The complexity of the natural language constructs makes this task challenging." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-5", "text": "We show how applying contextual attention to Long Short Term Memory networks help us give near state of art results on multiple benchmarks abuse detection data sets from Twitter." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-6", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-7", "text": "**INTRODUCTION & RELATED WORK**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-8", "text": "Any social interaction involves an exchange of viewpoints and thoughts. But these views and thoughts can be caustic." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-9", "text": "Often we see that users resort to verbal abuse to win an argument or overshadow someone's opinion." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-10", "text": "On Twitter, people from every sphere have experienced online abuse." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-11", "text": "Be it a famous celebrity with millions of followers or someone representing a marginalized community such as LGBTQ, Women and more." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-12", "text": "We want to channelize Natural Language Processing (NLP) for social good and aid in the process of flagging abusive tweets and users." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-13", "text": "Detecting abuse on Twitter can be challenging, particularly because the text is often noisy." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-14", "text": "Abuse can also have different facets." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-15", "text": "[10] released one of the initial data sets from Twitter with the goal of identifying what constitutes racism and sexism." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-16", "text": "[9] in their work pointed out that hate speech is different from offensive language and released a data set of 25k tweets with the goal of distinguishing hate speech from offensive language." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-17", "text": "Stop saying dumb blondes with pretty faces as you need a pretty face to pull them off !!! #mkr In Islam women must be locked in their houses and Muslims claim this is treating them well Table 1 : Tweets from [10] data set demonstrating online abuse They find that racist and homophobic tweets are more likely to be classified as hate speech but sexist tweets are generally classified as offensive." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-18", "text": "[4] introduced a large, hand-coded corpus of online harassment data for studying the nature of harassing comments and the culture of trolling." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-19", "text": "Keeping these motivations in mind, we make the following salient contributions:" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-20", "text": "\u2022 We build a deep context-aware attention-based model for abusive behavior detection on Twitter ." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-21", "text": "To the best of our knowledge ours is the first work that exploits context aware attention for this task." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-22", "text": "\u2022 Our model is robust and achieves consistent performance gains in all the three abusive data sets \u2022 We show how context aware attention helps in focusing on certain abusive keywords when used in specific context and improve the performance of abusive behavior detection ." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-23", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-25", "text": "Existing approaches to abusive text detection can be broadly divided into two categories: 1) Feature intensive machine learning algorithms such as Logistic Regression (LR), Multilayer Perceptron (MLP) and etc." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-26", "text": "2) Deep Learning models which learn feature representations on their own. [10] released the popular data set of 16k tweets annotated as belonging to sexism, racism or none class 1 , and provided a feature engineered model for detection of abuse in their corpus." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-27", "text": "[9] use a similar handcrafted feature engineered model to identify offensive language and distinguish it from hate speech." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-28", "text": "[2] in their work, experiment with multiple deep learning architectures for the task of hate speech detection on Twitter using the same data set by [10] ." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-29", "text": "Their best-reported F1-score is achieved using Long Short Term Memory Networks (LSTM) + Gradient Boosting." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-30", "text": "On the data set released by [10] , [5] experiment with a two-step approach of detecting abusive language first and then classifying them into specific types i.e. racist, sexist or none." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-31", "text": "They achieve best results using a Hybrid Convolution Neural Network (CNN) with the intuition that character level input would counter the purposely or mistakenly misspelled words and made-up vocabularies." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-32", "text": "[6] in their work ran experiments on the Gazetta dataset and the DETOX system ( [12] ) and show that a Recurrent Neural Network (RNN) coupled with deep, classification-specific attention outperforms the previous state of the art in abusive comment moderation." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-33", "text": "In their more recent work [7] explored how user embeddings, user-type embeddings, and user type biases can improve their previous RNN based model on the Gazetta dataset." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-34", "text": "Attentive neural networks have been shown to perform well on a variety of NLP tasks ( [13] , [11] )." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-35", "text": "[13] use hierarchical contextual attention for text classification (i.e attention both at word and sentence level) on six large scale text classification tasks and demonstrate that the proposed architecture outperform previous methods by a substantial margin." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-36", "text": "We primarily focus on word level attention because most of the tweets are single sentence tweets." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-37", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-38", "text": "**MODEL**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-39", "text": "The best choice for modeling tweets was Long Short Term Memory Networks (LSTMs) because of their ability to capture long-term dependencies by introducing a gating mechanism that ensures the proper gradient propagation through the network." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-40", "text": "We use bidirectional LSTMs because of their inherent capability of capturing information from both: the past and the future states." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-41", "text": "A bidirectional LSTM (BiLSTM) consists of a forward LSTM \u2212 \u2192 f that reads the sentence from x 1 to x T and a backward LSTM \u2190 \u2212 f that reads the sentence from x T to x 1 , where T is the number of words in the sentence under consideration and x i is the i t h word in the sentence." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-42", "text": "We obtain the final annotation for a given word x i , by concatenating the annotations from both directions (Eq. [1] )." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-43", "text": "[1] show that LSTMs can benefit from depth in space." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-44", "text": "Stacking multiple recurrent hidden layers on top of each other, just as feed forward layers are stacked in the conventional deep networks give performance gains .And hence we choose stacked LSTM for our experiments." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-45", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-46", "text": "**WORD ATTENTION**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-47", "text": "The attention mechanism assigns a weight to each word annotation that is obtained from the BiLSTM layer." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-48", "text": "We compute the fixed representation v of the whole message as a weighted sum of all the word annotations which is then fed to a final fully-connected Softmax layer to obtain the class probabilities." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-49", "text": "We first feed the LSTM output h i of each word x i through a Multi Layer Perceptron to get u i as its hidden representation." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-50", "text": "u c is our word level context vector that is randomly initialized and learned as we train our network." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-51", "text": "Once u i is obtained we calculate the importance of the word as the similarity Data Set Tweets Count [10] 15,844 [9] 25,112 [4] 20,362 Table 2 : Data sets and their total tweets count of u i with u c and get a normalized importance weight \u03b1 i through a softmax function." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-52", "text": "The context vector u c can be seen as a tool which filters which word is more important over all the words like that used in the LSTM." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-53", "text": "Figure 2 shows the high-level architecture of this model." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-54", "text": "W h and b h are the attention layers weights and biases." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-55", "text": "More formally," }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-56", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-57", "text": "**EXPERIMENTS**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-58", "text": "In this section we talk about data sets first and then go on to show our results obtained on these three data sets .We also show some examples where our model failed ." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-59", "text": "Finally we show how attention helps us understand the model in a better fashion." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-60", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-61", "text": "**DATA SETS**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-62", "text": "We have used the 3 benchmark data sets for abusive content detection on Twitter." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-63", "text": "At the time of the experiment, the [10] data set had a total of 15,844 tweets out of which 1,924 were labelled as belonging to racism, 3,058 as sexism and 10,862 as none." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-64", "text": "The [9] data set had a total of 25,112 tweets out of which 1498 were labelled as hate speech, 19,326 as offensive language and 4,288 as neither." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-65", "text": "For the [4] data set, there were 20,362 tweets out of which 5,235 were positive harassment examples and 15,127 were negative." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-66", "text": "We call [10] data set as D1 , [9] data set as D2 and [4] as D3 For tweet tokenization, we use Ekphrasis which is a text processing tool built specially from social platforms such as Twitter." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-67", "text": "[3] use a big collection of Twitter messages (330M) to generate word embeddings, with a vocabulary size of 660K words, using GloVe ( [8] )." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-68", "text": "We use these pre-trained word embeddings for initializing the first layer (embedding layer) of our neural networks." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-69", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-70", "text": "**RESULTS**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-71", "text": "The network is trained at a learning rate of 0.001 for 10 epochs, with a dropout of 0.2 to prevent over-fitting." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-72", "text": "The results are averaged over 10-fold cross-validations for D1 and D3 and 5 fold cross-validations for D2 because [9] reported results using 5 fold CV." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-73", "text": "Because of class imbalance in all our data sets, we report weighted F1 scores." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-74", "text": "Table 3 shows our results in detail." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-75", "text": "We compare our model with the best models reported in each paper." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-76", "text": "Because [4] is a data set paper, we cannot fill the corresponding row." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-77", "text": "* denotes the numbers from baseline papers." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-78", "text": "All the results were reproducible except for the one marked red." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-79", "text": "For (Waseem and Hovy, 2016 ) data set, (Badjatiyaet al., 2017) claim that using Gradient Boosting with LSTM embeddings obtained from random word embeddings boosted their performance by 12 F1 from 81.0 to 93.0." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-80", "text": "When we tried to reproduce the result, we did not find any significant improvement over 81." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-81", "text": "Results show that our model is robust when it comes to the performance on all of the three data sets." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-82", "text": "Table 3 : Data sets and the results of different models." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-83", "text": "We reproduced the results for each model on three of the data sets." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-84", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-85", "text": "**MODELS**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-86", "text": "We also share some examples from the three data sets in Figure 2 which our BiLSTM attention model could not classify correctly." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-87", "text": "On closer investigation we find that most cases where our model fails are instances where annotation is either noisy or the difference between classes are very blurred and subtle." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-88", "text": "The first tweet is a tweet from [10] , the second tweet is a tweet from from [9] data set and the third from the [4]" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-91", "text": "Attention mechanism enables our neural network to focus on the relevant parts of the input more than the irrelevant parts while performing a prediction task." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-92", "text": "But the relevance is often dependant on the context and so the importance of words is highly context dependent." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-93", "text": "For example, the word islam may appear in the realm of Racism as well as in any normal conversation." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-94", "text": "The top tweet in Figure 3 belongs to None class while the bottom tweet belongs to Racism class." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-95", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-96", "text": "**ATTENTION HEAT MAP VISUALIZATION**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-97", "text": "The color intensity corresponds to the weight given to each word by the contextual attention." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-98", "text": "The first tweet is a sexist tweet from [10] where as the second tweet is an example of racist tweet from the same datset ." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-99", "text": "The third tweet is from [9] data set labelled as offensive language." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-100", "text": "----------------------------------" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-101", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-102", "text": "We successfully built a deep context-aware attention-based model and applied it to the task of abusive tweet detection." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-103", "text": "We ran experiments on three relevant data sets and empirically showed how our model is robust when it comes to detecting abuse on Twitter." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-104", "text": "We also show how context-aware attention helps us to interpret the model's performance by visualizing the attention weights and conducting thorough error analysis." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-105", "text": "As for future work, we want to experiment with a model that learns user embeddings from their historical tweets." }, { "sent_id": "9a8b29b10539be9e6c65172a97b16f-C001-106", "text": "We also want to model abusive text classification in Twitter by taking tweets in context because often standalone tweets don't give a clear picture of a tweet's intent." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9a8b29b10539be9e6c65172a97b16f-C001-16" ], [ "9a8b29b10539be9e6c65172a97b16f-C001-27" ] ], "cite_sentences": [ "9a8b29b10539be9e6c65172a97b16f-C001-16", "9a8b29b10539be9e6c65172a97b16f-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "9a8b29b10539be9e6c65172a97b16f-C001-51" ], [ "9a8b29b10539be9e6c65172a97b16f-C001-62", "9a8b29b10539be9e6c65172a97b16f-C001-63", "9a8b29b10539be9e6c65172a97b16f-C001-64" ], [ "9a8b29b10539be9e6c65172a97b16f-C001-99" ] ], "cite_sentences": [ "9a8b29b10539be9e6c65172a97b16f-C001-51", "9a8b29b10539be9e6c65172a97b16f-C001-64", "9a8b29b10539be9e6c65172a97b16f-C001-99" ] }, "@MOT@": { "gold_contexts": [ [ "9a8b29b10539be9e6c65172a97b16f-C001-72" ] ], "cite_sentences": [ "9a8b29b10539be9e6c65172a97b16f-C001-72" ] }, "@DIF@": { "gold_contexts": [ [ "9a8b29b10539be9e6c65172a97b16f-C001-87", "9a8b29b10539be9e6c65172a97b16f-C001-88" ] ], "cite_sentences": [ "9a8b29b10539be9e6c65172a97b16f-C001-88" ] } } }, "ABC_31e8c524f05495fdd87bfac6fbecc8_2": { "x": [ { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-2", "text": "In this paper we use methods for creating a large lexicon of verbal polarity shifters and apply them to German." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-3", "text": "Polarity shifters are content words that can move the polarity of a phrase towards its opposite, such as the verb \"abandon\" in \"abandon all hope\"." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-4", "text": "This is similar to how negation words like \"not\" can influence polarity." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-5", "text": "Both shifters and negation are required for high precision sentiment analysis." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-6", "text": "Lists of negation words are available for many languages, but the only language for which a sizable lexicon of verbal polarity shifters exists is English." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-7", "text": "This lexicon was created by bootstrapping a sample of annotated verbs with a supervised classifier that uses a set of data-and resource-driven features." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-8", "text": "We reproduce and adapt this approach to create a German lexicon of verbal polarity shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-9", "text": "Thereby, we confirm that the approach works for multiple languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-10", "text": "We further improve classification by leveraging cross-lingual information from the English shifter lexicon." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-11", "text": "Using this improved approach, we bootstrap a large number of German verbal polarity shifters, reducing the annotation effort drastically." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-12", "text": "The resulting German lexicon of verbal polarity shifters is made publicly available." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-13", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-14", "text": "**INTRODUCTION**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-15", "text": "Polarity shifters are content words such as verbs, nouns or adjectives that influence the sentiment polarity of an expression in ways similar to negation words." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-16", "text": "For example, the negated statement in (1) that uses the negation word nicht in German and not in English can also be expressed using the verbal shifter unterlassen in German and fail in English, as seen in (2)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-17", "text": "(1) Peter hat ihnen nicht geholfen." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-18", "text": "Peter did not help them." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-19", "text": "(2) Peter hat es unterlassen shifter ihnen zu helfen." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-20", "text": "Peter failed shifter to help them." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-21", "text": "Polarity shifters can affect both positive and negative expressions, moving their polarity towards the opposite polarity." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-22", "text": "In (3) the shifter verweigern/deny affects the positive polar expression Stipendium/scholarship, resulting in a negative polarity for the sentence." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-23", "text": "On the other hand, the shifter lindern/alleviate in (4) creates a positive sentence despite the negative polar expression Schmerz/pain." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-24", "text": "As can be seen for verhindern/prevent in (5) and (6), the same shifter can even affect both positive and negative expressions." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-25", "text": "We present a reproduction and extension to the work of Schulder et al. (2017) , which introduced a lexicon of verbal polarity shifters, as well as methods to increase the size of this lexicon through bootstrapping." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-26", "text": "The lexicon lists verb lemmas and assigns a binary label (shifter or no shifter) to each." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-27", "text": "The original approach was developed on English." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-28", "text": "We apply it to German, validating the generality of the approach and creating a new resource, a German lexicon of 677 verbal polarity shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-29", "text": "We also improve the bootstrapping process by adding features that leverage polarity shifter resources across languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-30", "text": "As is the case with negation, modeling polarity shifting is important for various tasks in NLP, such as relation extraction (Sanchez-Graillet and Poesio, 2007) , recognition of textual entailment (Harabagiu et al., 2006) and especially sentiment analysis (Wiegand et al., 2010) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-31", "text": "However, while there has been significant research on negation in sentiment analysis (Wiegand et al., 2010) , current classifiers fail to handle polarity shifters adequately (Schulder et al., 2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-32", "text": "This is in part due to the lack of lexical resources for polarity shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-33", "text": "Unlike negation words (no, not, never, etc.) , of which there are only a few dozen in a language, polarity shifters are far more numerous." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-34", "text": "Among verbs alone there are many hundreds (Schulder et al., 2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-35", "text": "Comprehensive shifter lexicons are, therefore, considerably more expensive to create." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-36", "text": "Once available, they can be used to improve the aforementioned tasks, as has already been shown for the case of English polarity classification (Schulder et al., 2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-37", "text": "To reduce the cost of creating such polarity shifter lexicons, Schulder et al. (2017) introduced methods to automatically generate a labeled list of words using either a limited amount of labeled training data or no labeled data at all." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-38", "text": "Their approach includes both features that rely on semantic resources and datadriven ones." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-39", "text": "They limited their work to English verbs, but expressed the expectation that their methods should also work for other languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-40", "text": "To verify that expectation, we apply their approach to German, for which all resources required to reproduce their experiments are available." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-41", "text": "Keeping in mind that this is not the case for many other languages, we focus our evaluation on differentiating between features that rely on unstructured data and those requiring rare semantic resources." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-42", "text": "While polarity shifters are not restricted to a particular part of speech -shifter nouns (e.g. downfall), adjectives (devoid) and adverbs (barely) also exist -we limit ourselves to verbs." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-43", "text": "Verbs and nouns are the most important minimal semantic units (Schneider et al., 2016) and verbs are usually the main syntactic predicates of clauses, projecting far-reaching scopes." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-44", "text": "Focusing on verbs also allows us a closer comparison with Schulder et al. (2017) and to investigate cross-lingual similarities between verbal shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-45", "text": "The contributions of this paper are:" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-46", "text": "(i) we introduce a German lexicon of verbal polarity shifters; (ii) we reproduce and adapt the approach of Schulder et al. (2017) to German to extend our lexicon; (iii) we introduce additional methods that take advantage of the existence of the English verbal polarity shifter lexicon and improve upon the current state of the art." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-47", "text": "The focus of our work is the binary classification of verbal polarity shifters in German." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-48", "text": "The resulting German lexicon of 677 verbal polarity shifters is made publicly available." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-49", "text": "1" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-50", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-51", "text": "**RELATED WORK**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-52", "text": "Existing work on negation modeling focuses almost exclusively on negation words (see the survey of Wiegand et al. (2010) )." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-53", "text": "One reason for this is the lack of lexicons and corpora that cover other forms of polarity shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-54", "text": "Even the most complex negation lexicon for English sentiment analysis (Wilson et al., 2005) includes a mere 12 verbal shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-55", "text": "So far the only larger resources for polarity shifters are the English-language verbal shifter lexicons recently introduced by Schulder et al. (2017) and Schulder et al. (2018) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-56", "text": "Schulder et al. (2017) automatically bootstrap a lexicon which covers 980 verbal shifters at the lemma level, while Schulder et al. (2018) manually annotate word senses of verbs, creating a lexicon of 2131 shifter senses across 1220 verbs." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-57", "text": "As we reproduce and extend the work of Schulder et al. (2017) , all further use of and comparison to an English shifter lexicon refers to their bootstrapped lexicon as well." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-58", "text": "To create shifter lexicons at a large scale, automation and bootstrapping techniques are required." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-59", "text": "Danescu-Niculescu-Mizil et al. (2009) propose using negative polarity items (NPIs) to extract downwardentailing operators, which are closely related to polarity shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-60", "text": "Schulder et al. (2017) also make use of NPIs in addition to a number of other features." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-61", "text": "Rather than using lexicons, another approach would be to learn polarity shifters from labelled corpora." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-62", "text": "In the case of negation, this has already been examined for the biomedical domain (Huang and Lowe, 2007; Morante and Daelemans, 2009; Zou et al., 2013) , the review domain (Ikeda et al., 2008; Kessler and Sch\u00fctze, 2012; Socher et al., 2013; Yu et al., 2016) and across domains (Fancellu et al., 2016) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-63", "text": "Unfortunately, due to the considerably higher lexical diversity of polarity shifters, far larger corpora would be required for learning shifter than for learning negation." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-64", "text": "Available corpora that are suitable for negation learning, such as the Sentiment Treebank (Socher et al., 2013) or the BioScope corpus (Szarvas et al., 2008) , are fairly small in size." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-65", "text": "Most verbs occur in them in very few instances or not at all." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-66", "text": "In the BioScope corpus, for example, there are only 6 verbal shifters (Morante, 2010) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-67", "text": "Polarity classifiers trained on such corpora, such as the state-of-the-art Recursive Neural Tensor Network tagger (Socher et al., 2013) , fail to detect many instances of polarity shifting." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-68", "text": "Schulder et al. (2017) show that the explicit knowledge provided by a shifter lexicon can improve polarity classification in such cases." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-69", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-70", "text": "**DATA**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-71", "text": "We create a gold standard for German verbal shifters, following the approach Schulder et al. (2017) used for their English gold standard." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-72", "text": "An expert annotator, who is a native speaker of German, labeled 2000 verbs, randomly sampled from GermaNet (Hamp and Feldweg, 1997) , a German wordnet resource." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-73", "text": "The remaining 7262 GermaNet verbs are used to bootstrap a larger lexicon in \u00a75.3." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-74", "text": "Each verb is assigned a binary label of being a shifter or not." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-75", "text": "To qualify as a shifter, a verb must permit polar expressions as its dependents and cause the polarity of the expression that embeds both verb and polar expression to move towards the opposite of the polar expression." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-76", "text": "For example, in (6) verhindern shifts the negative polarity of its dependent ein Gemetzel, resulting in a positive expression." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-77", "text": "Annotation is performed at the lemma level, as word-sense disambiguation tends to be insufficiently robust." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-78", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-79", "text": "**RESOURCE TYPE GERMAN RESOURCE ENGLISH RESOURCE WORDNET**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-80", "text": "GermaNet (Hamp and Feldweg, 1997) WordNet (Miller et al., 1990)" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-81", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-82", "text": "**TEXT CORPUS**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-83", "text": "DeWaC Web Corpus Amazon Product Reviews (Baroni et al., 2009 ) (Jindal and Liu, 2008) Polarity Lexicon PolArt Sentiment Lexicon Subjectivity Lexicon (Klenner et al., 2009 ) (Wilson et al., 2005) Framenet Salsa (Burchardt et al., 2006) FrameNet (Baker et al., 1998)" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-84", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-85", "text": "**EFFECTS**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-86", "text": "EffektGermaNet (Ruppenhofer and Brandes, 2015) EffectWordNet (Choi et al., 2014) Table 3 : Distribution of verbal shifters in the PolArt Sentiment Lexicon (Klenner et al., 2009) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-87", "text": "Table 1 provides an overview of the German resources we use in our reproduction, compared to the resources used for the English shifter lexicon." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-88", "text": "More detailed descriptions of the resources are provided in sections discussing feature design ( \u00a74) and experiments ( \u00a75)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-89", "text": "Table 2 shows that in our gold data 11.2% of verbs are shifters, which is a bit less than the 15.2% of the English gold standard." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-90", "text": "Table 3 shows the shifter distribution among verbs with sentiment polarity (determined using the PolArt Sentiment Lexicon (Klenner et al., 2009) )." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-91", "text": "As was the case for the English gold data, it shows a tendency for shifter verbs to be negative rather than positive terms." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-92", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-93", "text": "**FEATURE DESIGN**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-94", "text": "In this section we introduce the features that we will use to bootstrap our German verbal shifter lexicon in \u00a75.3." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-95", "text": "We start by outlining the features proposed by Schulder et al. (2017) and how we adapt them for use with German ( \u00a74.1)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-96", "text": "We further separate them into data-driven features ( \u00a74.1.1) and resource-driven features ( \u00a74.1.2) to highlight their requirements when applied to a new language." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-97", "text": "In \u00a74.2 we introduce new methods that can either be used as stand-alone classifiers or as features for an SVM classifier." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-98", "text": "Both methods take advantage of existing knowledge about English verbal shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-99", "text": "One method uses a bilingual dictionary ( \u00a74.2.1) and the other cross-lingual word embeddings ( \u00a74.2.2)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-100", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-101", "text": "**FEATURE REPRODUCTION**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-102", "text": "In this section we briefly describe how we adapt the features of Schulder et al. (2017) to German language data." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-103", "text": "We distinguish between features that mainly rely on text data from a corpus ( \u00a74.1.1) and those that require complex semantic resources ( \u00a74.1.2)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-104", "text": "When working with languages with scarcer resources, it can be expected that the former will be more readily available than the latter." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-105", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-106", "text": "**DATA-DRIVEN FEATURES**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-107", "text": "The main requirement of the following features is a reasonably sized text corpus to detect syntactic patterns and word frequencies." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-108", "text": "The text corpus was lemmatized using the TreeTagger (Schmid, 1994) and parsed for syntactic dependency structures with ParZu (Sennrich et al., 2009 )." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-109", "text": "2 For features requiring knowledge of polarities we use the PolArt Sentiment Lexicon (Klenner et al., 2009)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-110", "text": "3 Distributional Similarity (SIM): The distributional similarity feature assumes that words that are semantically similar to negation words are also likely to be polarity shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-111", "text": "Semantic similarity is modeled as cosine similarity in a word embedding space." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-112", "text": "The word embeddings are created using Word2Vec (Mikolov et al., 2013) on the German web corpus DeWaC (Baroni et al., 2009) , using the same hyperparameters as Schulder et al. (2017) and German translations of their negation seeds." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-113", "text": "Polarity Clash (CLASH): The polarity clash feature assesses that shifting will often occur when a polar verb modifies an expression of the opposite polarity, such as in (7)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-114", "text": "The feature is further narrowed down to negative verbs that modify positive nouns, as polar verbal shifters are predominantly of negative polarity (Table 3 )." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-115", "text": "Particle Verbs (PRT): Certain verb particles indicate a complete transition to an end state (Brinton, 1985) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-116", "text": "Schulder et al. (2017) hypothesize that this phenomenon correlates with shifting, which can be seen as producing a new (negative) end state." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-117", "text": "Therefore, they collect particle verbs containing relevant English particles, such as away, down and out." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-118", "text": "For our German data we chose the following particles associated with negative end states: ab, aus, entgegen, fort, herunter, hinunter, weg and wider." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-119", "text": "Heuristic using 'jeglich' (ANY): Negative polarity items (NPIs) are known to occur in the context of negation (Giannakidou, 2008) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-120", "text": "Schulder et al. (2017) showed that the English NPI any co-occurs with shifters, so its presence in a verb phrase can indicate the presence of a verbal shifter." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-121", "text": "We expect the same for the German NPI jeglich, as seen in (8)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-122", "text": "We collect all verbs with a polar direct object that is modified by the lemma jeglich." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-123", "text": "The resulting pattern matches are sorted by their frequency, normalized over their respective verb frequency and then reranked using Personalised PageRank (Agirre and Soroa, 2009 Anti-Shifter Feature (ANTI): This feature specifically targets anti-shifters, verbs that exhibit polar stability instead of causing polar shifting." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-124", "text": "These are commonly verbs indicating creation or continued existence, such as live, introduce, construct or prepare." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-125", "text": "Such verbs often co-occur with the adverbs ausschlie\u00dflich, zuerst, neu and extra, as seen in (9)-(12)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-126", "text": "Accordingly, we can create a list of anti-shifters by selecting the verbs that most often co-occur with these adverbs." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-127", "text": "They specially prepared antiShifter vegan dishes for me." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-128", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-129", "text": "**RESOURCE-DRIVEN FEATURES**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-130", "text": "The following features rely on advanced semantic resources which are available in only a few languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-131", "text": "GermaNet: Wordnets are large lexical ontologies providing various kinds of semantic information and relations." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-132", "text": "Schulder et al. (2017) used glosses, hypernyms and supersenses taken from the English WordNet (Miller et al., 1990) as features in their work." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-133", "text": "We use GermaNet (Hamp and Feldweg, 1997) , a German wordnet resource that provides all these features." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-134", "text": "In the case of glosses, called paraphrases in GermaNet, GermaNet offers two variations: the paraphrases originally written for GermaNet, and a more extensive set of paraphrases harvested from Wiktionary (Henrich et al., 2014) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-135", "text": "To improve coverage we use this paraphrase extension in our experiments." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-136", "text": "Salsa FrameNet: Framenets provide semantic frames that group words with similar semantic behavior." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-200", "text": "**CLASSIFIER EVALUATION**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-137", "text": "Schulder et al. (2017) use the frame memberships of verbs as a feature, hypothesizing that verbal shifters will be found in the same frames." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-138", "text": "We reproduce this feature using frames from the German FrameNet project Salsa (Burchardt et al., 2006) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-139", "text": "EffektGermaNet: Wiebe and colleagues (Deng et al., 2013; Choi et al., 2014) introduced the idea that events can have harmful or beneficial effects on their objects." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-140", "text": "These effects are related but not identical to polarity shifting." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-141", "text": "Choi et al. (2014) provide lexical information on effects in their English resource EffectWordNet." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-142", "text": "We use its German counterpart, EffektGermaNet (Ruppenhofer and Brandes, 2015) , to model the effect feature in our data." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-143", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-144", "text": "**NEW FEATURES**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-145", "text": "In \u00a74.1 we described how we reproduce features already used for English shifter classification." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-146", "text": "Next we introduce new features that have not yet been used for the creation of a verbal shifter lexicon." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-147", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-148", "text": "**BILINGUAL DICTIONARY**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-149", "text": "The motivation behind the work of Schulder et al. (2017) was to introduce a large lexicon of verbal polarity shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-150", "text": "Now that such a lexicon exists for English, it is an obvious resource to use when creating verbal shifter lexicons for other languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-151", "text": "We hypothesize that a verb with the same meaning as an English verbal shifter will also function as a shifter in its own language." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-152", "text": "All that is required is a mapping from English verbs to, in our case, German verbs." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-153", "text": "We choose to use the bootstrapped lexicon of Schulder et al. (2017) , rather than the manually created one of Schulder et al. (2018) , to show that bootstrapping is sufficient for all stages of the learning process." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-154", "text": "One potential source for such a mapping is a bilingual dictionary." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-155", "text": "We use the English-German dataset by DictCC 4 , as it is large (over one million translation pairs) and publicly available." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-156", "text": "It covers 76% of German verbs found in GermaNet and 77% of English verbs found in WordNet." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-157", "text": "Mapping the shifter labels of the English verbs to German verbs is performed as follows: For each German verb, all possible English translations are looked up." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-158", "text": "Using the English verbal shifter lexicon, we confirm whether the English translations are shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-159", "text": "If the majority of translations are shifters, the German word is also labeled as a shifter, otherwise as not a shifter." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-160", "text": "This approach provides explicit labels for 1368 of our 2000 gold standard verbs (68%)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-161", "text": "Less than 6% of these are tied between shifter and no shifter translations." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-162", "text": "Ties are resolved in favor of the shifter label." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-163", "text": "The remaining verbs are labeled with the majority label no shifter." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-164", "text": "While this bilingual dictionary mapping approach makes for a promising feature, we refrain from considering it for generating a gold standard." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-165", "text": "Using a dictionary instead of annotating a random sample would introduce biases existing in the dictionary, e.g. more translation pairs being available for frequent words, which can in turn favor features that work better for frequent words." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-166", "text": "Schulder et al. (2017) also observe in their error analysis that some verbs act as shifters in only some of their word senses." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-167", "text": "As different word senses often do not translate into the same foreign word, indiscriminate translation may introduce non-shifting senses of English shifter words as false positives." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-168", "text": "Evaluating the dictionary mapping as a feature will allow us to judge its usefulness for high-precision lexicon induction in future works." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-169", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-170", "text": "**CROSS-LINGUAL WORD EMBEDDINGS**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-171", "text": "As an alternative to using bilingual dictionaries we investigate transferring English shifter labels to German using cross-lingual word embeddings." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-172", "text": "These are word embeddings which provide a shared vector space for words from multiple languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-173", "text": "Similar to how the SIM feature (see \u00a74.1.1) compares negation words to verbs in a mono-lingual word embedding, a cross-lingual word embedding allows us to compare English verbs to verbs of another language based on their distributional similarity without having labeled data for the other language." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-174", "text": "These comparisons can then be used to apply the labels of the English lexicon of verbal shifters to the other language." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-175", "text": "Mapping shifter labels cross-lingually with a bilingual dictionary, as described in \u00a74.2.1, requires a dictionary with good coverage for both languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-176", "text": "For many languages, publicly available dictionaries of adequate size are hard to come by." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-177", "text": "For instance, the second largest English dictionary on DictCC is only 40% the size of the English-German dataset and only a few others have more than 2% its size." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-178", "text": "In \u00a75.2 we explore the effect of dictionary size on mapping performance and how cross-lingual word embeddings fare in comparison." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-179", "text": "Methods for creating cross-lingual word embeddings can be grouped into cross-lingual training and monolingual mappings." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-180", "text": "Cross-lingual training learns joint embeddings from parallel corpora." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-181", "text": "However, such corpora are far smaller and rarer than monolingual corpora and, therefore, not ideal for us." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-182", "text": "5 Monolingual mappings take preexisting monolingual word embeddings and learn linear transformations to map both embeddings onto the same vector space." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-183", "text": "Commonly, these approaches use bilingual dictionaries to initialize this mapping, which would rather defeat our goal of using embeddings as a datadriven alternative to dictionaries." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-184", "text": "The VecMap framework (Artetxe et al., 2017) provides an initialization method that relies on numerals instead of a dictionary." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-185", "text": "The idea behind this is that Arabic numerals are used in most languages, even across different writing systems (e.g. Cyrillic, Chinese, etc.), and, therefore, can function as a dictionary without requiring actual bilingual knowledge." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-186", "text": "For our experiments, we train Word2Vec word embeddings for English and German, using the Amazon Product Review (Jindal and Liu, 2008) and DeWaC (Baroni et al., 2009 ) corpora, respectively." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-187", "text": "Ideally, product review corpora would be used for both languages, but available German review corpora are considerably smaller than their English counterparts." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-188", "text": "For example, the German corpus Webis-CLS (Prettenhofer and Stein, 2010) contains only 33 million words, while the English-language Amazon Product Review Corpus consists of 1.2 billion words." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-189", "text": "When generating word embeddings, the size of the corpus is very important for the quality of the resulting embedding, so we choose instead to use DeWaC, a web corpus of 1.7 billion words." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-190", "text": "Training is performed using the same hyperparameters as used by Artetxe et al. (2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-191", "text": "6 We use VecMap to create a cross-lingual word embedding using the default configuration for numeral-based mappings." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-192", "text": "The resulting cross-lingual embedding covers 79% of German GermaNet verbs as well as 79% of English WordNet verbs." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-193", "text": "It covers 1598 of our 2000 gold data verbs (80%)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-194", "text": "We use this new word embedding to apply English shifter labels to German." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-195", "text": "To achieve this, we go through our list of German verbs, look up the most similar English verb for each and apply its label." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-196", "text": "We also investigated majority voting using k-nearest neighbors, but this did not improve performance." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-197", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-198", "text": "**EXPERIMENTS**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-199", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-201", "text": "We start our evaluation by reproducing the classifier evaluation of Schulder et al. (2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-202", "text": "The task is the classification of all verbs from the given gold standard in a 10-fold cross validation." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-203", "text": "Analogous to Schulder et al. (2017) we evaluate a supervised SVM classifier as well as a graph-based label propagation (LP) classifier that requires no labeled training data." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-204", "text": "In addition, we evaluate our crosslingual word embedding classifier ( \u00a74.2.2) and our dictionary classifier ( \u00a74.2.1), which both make use of the pre-existing English lexicon, but require no additional labeled German data." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-205", "text": "For an overview of the classifiers and their data requirements, see Table 4 ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-206", "text": "For the LP classifier we use the ANY features as seeds for the positive label (shifter) and the ANTI feature as negative label (no shifter) seeds." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-207", "text": "For SVM we group features into data-driven and resourcedriven feature sets (see Table 6 ) as outlined in \u00a74.1.1 and \u00a74.1.2, as well as introducing the outputs of the cross-lingual word embedding and dictionary classifiers as additional separate features." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-208", "text": "Table 5 shows the performance of our various classifiers." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-209", "text": "All classifiers clearly outperform the baseline 7 and resource-based features outperform data-based ones." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-210", "text": "This is similar to performance observed 5 BilBOWA (Gouws et al., 2015) seeks to improve the coverage problem of parallel corpora by incorporating additional monolingual corpora into the training process." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-211", "text": "However, our experiments with it did not provide satisfactory results." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-212", "text": "This is in line with reports by Artetxe et al. (2017) and Upadhyay et al. (2016) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-213", "text": "6 Word2Vec configuration: CBOW, 300 dimensions, context window of 5 words, sub-sampling at 1e \u2212 05, negative samples at 10 and vocabulary restricted to the 200,000 most frequent words." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-214", "text": "We also experimented with using the full vocabulary, but this resulted in lower quality embeddings." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-215", "text": "7 As in Schulder et al. (2017) , accuracy proves to be a problematic measure, as it has a strong majority label bias." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-216", "text": "The no shifter label makes up 88.8% of our gold annotation (Table 2) , which explains the strong performance of the majority baseline on this metric." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-217", "text": "Table 6 : Features included in SVM feature groups in Table 5 ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-218", "text": "All features in data and resource were also used in Schulder et al. (2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-219", "text": "for English (Schulder et al., 2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-220", "text": "Cross-lingual embeddings and dictionaries as stand-alone classifiers both outperform the label propagation approach due to their better recall coverage of shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-221", "text": "Interestingly, the cross-lingual embedding classifier performs far better than SIM, despite both relying on word embeddings to judge distributional similarity." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-222", "text": "Comparing similarity among verbs, even crosslingually, works better than across parts-of-speech, as required for negation-shifter comparisons." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-223", "text": "Adding both cross-lingual features to the SVM classifier improves performance further." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-224", "text": "This shows that they are not only complementary to the existing features, but also to each other, as using only one cross-lingual feature does not improve performance as much." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-225", "text": "The most feature-rich SVM configuration, SVM data+resource+dict+embed , provides a significant improvement over SVM data+resource , the best classifier of Schulder et al. (2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-226", "text": "We conclude that cross-lingual shifter information is useful even when the same bootstrapping process and feature set is used in both the source and target language." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-227", "text": "Figure 1 shows the learning curve of select SVM configurations, compared to the classifiers that work without labeled German data, i.e. LP, Embedding and Dictionary." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-228", "text": "Cross-lingual embedding and dictionary classifiers provide a stronger baseline than LP, outperforming SVM data+resource when training data is sparse." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-229", "text": "However, adding them as features to the SVM results in a classifier that consistently improves upon all other systems, even at small training sizes of only 20%." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-230", "text": "Combining all available sources of information as SVM features is therefore the preferred approach if any amount of training data is available." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-231", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-232", "text": "**EVALUATION OF DICTIONARY SIZE**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-233", "text": "The dictionary mapping approach ( \u00a74.2.1) has been shown to be a strong stand-alone classifier and SVM feature ( Figure 1 : Learning curve on gold standard." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-234", "text": "SVM data+resource represents the previously best classifier (Schulder et al., 2017) ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-235", "text": "covered." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-236", "text": "For many other languages, finding a publicly available dictionary of comparable size may pose a challenge." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-237", "text": "Therefore, we investigate how smaller dictionaries may perform in our classifiers." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-238", "text": "The English-German DictCC dictionary covers slightly over 8000 of the English verbs found in WordNet." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-239", "text": "Of the 2000 German verbs in our gold standard, DictCC covers 1368." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-240", "text": "To simulate bilingual dictionaries of smaller size, we create a version of the DictCC dictionary with half the English vocabulary by limiting it to the 4000 most frequent verbs from WordNet (Dict voc_size=4k )." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-241", "text": "We also create even smaller versions with only the 1000 (Dict voc_size=1k ) and 500 most frequent English verbs (Dict voc_size=0.5k )." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-242", "text": "As bilingual dictionaries provide a many-to-many mapping, having half the English vocabulary does not necessarily mean that we receive only half the German translations." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-243", "text": "Many German words receive multiple translations, all of which we then use to determine their shifter label via majority vote." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-244", "text": "Reducing the English vocabulary, therefore, first reduces the number of label votes for each German word, until, eventually, German words are removed as there are no more votes for them." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-245", "text": "Having fewer votes per German output label can, however, still affect the robustness of the labeling process." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-246", "text": "In our case, reducing the English vocabulary by half still provides translations for 1168 of German words in our gold data, i.e. 85% of the full dictionary." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-247", "text": "Reducing it further to 1000 English verbs drops the size of the German vocabulary to 52%." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-248", "text": "Using only the 500 most frequent English words leaves a German coverage of 33%." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-249", "text": "Figure 2 shows the performance of the differently sized dictionaries as stand-alone classifiers, while Figure 3 shows how much they can improve the best classifier of Schulder et al. (2017) , i.e. SVM data+resource ." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-250", "text": "In both cases we see that while even smaller dictionaries can still provide acceptable performance, using cross-lingual embeddings is preferable to using a dictionary of insufficient size." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-251", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-252", "text": "**BOOTSTRAPPING**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-253", "text": "In their intrinsic evaluation Schulder et al. (2017) showed that explicit knowledge of a large number of polarity shifters can improve sentiment analysis." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-254", "text": "To increase the size of our lexicon, we bootstrap additional shifters following their approach." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-255", "text": "We train our best classifier (Table 5 ) on the 2000 verbs from our gold standard ( \u00a73) and then use it to classify the remaining 7262 GermaNet verbs that had not been labeled so far." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-256", "text": "Of these, the classifier labels 595 verbs as shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-257", "text": "A German native speaker manually checks these predicted shifters and confirms 453 to be true verbal shifters." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-258", "text": "Limiting our annotation effort to predicted shifters and discarding all others reduces the cost of annotation by 92%." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-259", "text": "Table 7 shows the classifier precision at different confidence intervals." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-260", "text": "Like Schulder et al. (2017) , we see very high performance for the first quartile, matching their observation that manual confirmation is not strictly necessary for high confidence labels." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-261", "text": "Combining the 453 bootstrapped shifters with the 224 shifters from the gold standard we produce a novel list of 677 German verbal shifters (see footnote 1)." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-262", "text": "Table 7 : Classification of GermaNet verbs that were not part of gold standard ( \u00a73); verbs are ranked by confidence-score of classifier and evaluated at intervals by precision of shifter label." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-263", "text": "----------------------------------" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-264", "text": "**CONCLUSION**" }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-265", "text": "We confirm that the bootstrapping process for creating a large lexicon of verbal polarity shifters can successfully be applied to German." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-266", "text": "Given appropriate resources, the effort for adjusting to a new language is minimal, mostly requiring translating seed words and adjusting syntactic patterns, while the underlying concepts of the features remain the same." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-267", "text": "Using a manually annotated sample of 2000 verbs taken from GermaNet, we train a supervised classifier with various data-and resource-driven features." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-268", "text": "Its performance is further improved by leveraging information from an existing English lexicon of verbal shifters using bilingual dictionaries and cross-lingual word embeddings." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-269", "text": "The resulting improved classifier allows us to triple the number of confirmed German shifters in our lexicon." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-270", "text": "We differentiate features by whether they require only unlabeled data and basic linguistic tools or whether they depend on rare semantic resources that may not be available for many languages." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-271", "text": "In addition, we introduce the possibility of using cross-lingual resources to reduce the dependence on resources in the target language." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-272", "text": "This shows promise, improving performance for both unsupervised and supervised classification, especially for scenarios where only small amounts of training data are available." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-273", "text": "However, supervised learning that combines all features still provides the best results." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-274", "text": "Our recommendation for creating shifter lexicons in new languages is to start out with cross-lingual label transfer, but to also invest in annotating a random sample of verbs if possible, especially if advanced semantic resources like a wordnet are available, as they require supervised learning to be leveraged." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-275", "text": "In reproducing the work of Schulder et al. (2017) , we limited ourselves to verbs." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-276", "text": "In the future, we would like to investigate methods to extend the shifter lexicon to also cover nouns and adjectives." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-277", "text": "While we have shown that the same approach for classifying verbal shifters works for German and English, future work will expand the number of languages, especially to verify that these methods can also be applied to non-Indo-European languages, such as Chinese, Japanese or Arabic." }, { "sent_id": "31e8c524f05495fdd87bfac6fbecc8-C001-278", "text": "In this context it will also be interesting to see whether using shifter lexicons from several languages can further improve the dictionary and cross-lingual word embedding classifiers." } ], "y": { "@EXT@": { "gold_contexts": [ [ "31e8c524f05495fdd87bfac6fbecc8-C001-25" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-27", "31e8c524f05495fdd87bfac6fbecc8-C001-28" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-40" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-46" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-57" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-95" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-102" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-253", "31e8c524f05495fdd87bfac6fbecc8-C001-254" ] ], "cite_sentences": [ "31e8c524f05495fdd87bfac6fbecc8-C001-25", "31e8c524f05495fdd87bfac6fbecc8-C001-27", "31e8c524f05495fdd87bfac6fbecc8-C001-28", "31e8c524f05495fdd87bfac6fbecc8-C001-40", "31e8c524f05495fdd87bfac6fbecc8-C001-46", "31e8c524f05495fdd87bfac6fbecc8-C001-57", "31e8c524f05495fdd87bfac6fbecc8-C001-95", "31e8c524f05495fdd87bfac6fbecc8-C001-102", "31e8c524f05495fdd87bfac6fbecc8-C001-253", "31e8c524f05495fdd87bfac6fbecc8-C001-254" ] }, "@BACK@": { "gold_contexts": [ [ "31e8c524f05495fdd87bfac6fbecc8-C001-31" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-33", "31e8c524f05495fdd87bfac6fbecc8-C001-34" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-36" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-37" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-38" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-55" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-56" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-60" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-68" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-116" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-253" ] ], "cite_sentences": [ "31e8c524f05495fdd87bfac6fbecc8-C001-31", "31e8c524f05495fdd87bfac6fbecc8-C001-34", "31e8c524f05495fdd87bfac6fbecc8-C001-36", "31e8c524f05495fdd87bfac6fbecc8-C001-37", "31e8c524f05495fdd87bfac6fbecc8-C001-38", "31e8c524f05495fdd87bfac6fbecc8-C001-55", "31e8c524f05495fdd87bfac6fbecc8-C001-56", "31e8c524f05495fdd87bfac6fbecc8-C001-60", "31e8c524f05495fdd87bfac6fbecc8-C001-68", "31e8c524f05495fdd87bfac6fbecc8-C001-116", "31e8c524f05495fdd87bfac6fbecc8-C001-253" ] }, "@MOT@": { "gold_contexts": [ [ "31e8c524f05495fdd87bfac6fbecc8-C001-39" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-116", "31e8c524f05495fdd87bfac6fbecc8-C001-117", "31e8c524f05495fdd87bfac6fbecc8-C001-118" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-149", "31e8c524f05495fdd87bfac6fbecc8-C001-150", "31e8c524f05495fdd87bfac6fbecc8-C001-151" ] ], "cite_sentences": [ "31e8c524f05495fdd87bfac6fbecc8-C001-39", "31e8c524f05495fdd87bfac6fbecc8-C001-116", "31e8c524f05495fdd87bfac6fbecc8-C001-117", "31e8c524f05495fdd87bfac6fbecc8-C001-149" ] }, "@FUT@": { "gold_contexts": [ [ "31e8c524f05495fdd87bfac6fbecc8-C001-39" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-166", "31e8c524f05495fdd87bfac6fbecc8-C001-167", "31e8c524f05495fdd87bfac6fbecc8-C001-168" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-277" ] ], "cite_sentences": [ "31e8c524f05495fdd87bfac6fbecc8-C001-39", "31e8c524f05495fdd87bfac6fbecc8-C001-166", "31e8c524f05495fdd87bfac6fbecc8-C001-277" ] }, "@DIF@": { "gold_contexts": [ [ "31e8c524f05495fdd87bfac6fbecc8-C001-39" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-225" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-234" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-249" ] ], "cite_sentences": [ "31e8c524f05495fdd87bfac6fbecc8-C001-39", "31e8c524f05495fdd87bfac6fbecc8-C001-225", "31e8c524f05495fdd87bfac6fbecc8-C001-234", "31e8c524f05495fdd87bfac6fbecc8-C001-249" ] }, "@SIM@": { "gold_contexts": [ [ "31e8c524f05495fdd87bfac6fbecc8-C001-44" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-112" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-120", "31e8c524f05495fdd87bfac6fbecc8-C001-121" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-132", "31e8c524f05495fdd87bfac6fbecc8-C001-133" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-137", "31e8c524f05495fdd87bfac6fbecc8-C001-138" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-203" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-215" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-218" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-260" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-277" ] ], "cite_sentences": [ "31e8c524f05495fdd87bfac6fbecc8-C001-44", "31e8c524f05495fdd87bfac6fbecc8-C001-112", "31e8c524f05495fdd87bfac6fbecc8-C001-120", "31e8c524f05495fdd87bfac6fbecc8-C001-132", "31e8c524f05495fdd87bfac6fbecc8-C001-137", "31e8c524f05495fdd87bfac6fbecc8-C001-203", "31e8c524f05495fdd87bfac6fbecc8-C001-215", "31e8c524f05495fdd87bfac6fbecc8-C001-218", "31e8c524f05495fdd87bfac6fbecc8-C001-260", "31e8c524f05495fdd87bfac6fbecc8-C001-277" ] }, "@USE@": { "gold_contexts": [ [ "31e8c524f05495fdd87bfac6fbecc8-C001-57" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-71" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-137", "31e8c524f05495fdd87bfac6fbecc8-C001-138" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-153" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-201" ], [ "31e8c524f05495fdd87bfac6fbecc8-C001-275" ] ], "cite_sentences": [ "31e8c524f05495fdd87bfac6fbecc8-C001-57", "31e8c524f05495fdd87bfac6fbecc8-C001-71", "31e8c524f05495fdd87bfac6fbecc8-C001-137", "31e8c524f05495fdd87bfac6fbecc8-C001-153", "31e8c524f05495fdd87bfac6fbecc8-C001-201", "31e8c524f05495fdd87bfac6fbecc8-C001-275" ] } } }, "ABC_e90c9a93ec445a636fcee924306d95_2": { "x": [ { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-2", "text": "The current state-of-the-art for image annotation and image retrieval tasks is obtained through deep neural networks, which combine an image representation and a text representation into a shared embedding space." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-3", "text": "In this paper we evaluate the impact of using the Full-Network embedding in this setting, replacing the original image representation in a competitive multimodal embedding generation scheme." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-4", "text": "Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale representation of images, which results in richer characterizations." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-5", "text": "To measure the influence of the Full-Network embedding, we evaluate its performance on three different datasets, and compare the results with the original multimodal embedding generation scheme when using a one-layer image embedding, and with the rest of the state-of-the-art." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-6", "text": "Results for image annotation and image retrieval tasks indicate that the Full-Network embedding is consistently superior to the one-layer embedding." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-7", "text": "These results motivate the integration of the FullNetwork embedding on any multimodal embedding generation scheme, something feasible thanks to the flexibility of the approach." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-8", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-10", "text": "Image annotation (also known as caption retrieval) is the task of automatically associating an input image with a describing text." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-11", "text": "Image annotation methods are an emerging technology, enabling semantic image indexing and search applications." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-12", "text": "The complementary task of associating an input text with a fitting image (known as image retrieval or image search) is also of relevance for the same sort of applications." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-13", "text": "State-of-the-art image annotation methods are currently based on deep neural net representations, where an image embedding (e.g., obtained from a convolutional neural network or CNN) and a text embedding (e.g., obtained from a recurrent neural network or RNN) are combined into a unique multimodal embedding space." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-14", "text": "While several techniques for merging both spaces have been proposed [1, 2, 3, 4, 5, 6] , little effort has been made in finding the most appropriate image embeddings to be used in that process." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-15", "text": "In fact, most approaches simply use a one-layer CNN embedding [7, 8] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-16", "text": "In this paper we explore the impact of using a Full-Network embedding (FNE) [9] to generate the required image embedding, replacing the one-layer embedding." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-17", "text": "We do so by integrating the FNE into the multimodal embedding pipeline defined by Kiros et al. [1] , which is based in the use of a Gated Recurrent Units neural network (GRU) [10] for text encoding and CNN for image encoding." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-18", "text": "Unlike one-layer embeddings, the FNE represents features of varying specificity in the context of the visual dataset, while also discretizes the features to regularize the space and alleviate the curse of dimensionality." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-19", "text": "These particularities result in a richer visual embedding space, which may be more reliably mapped to a common visual-textual embedding space." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-20", "text": "The generic pipeline defined by Kiros et al. [1] has been recently outperformed in image annotation and image search tasks by methods specifically targeting one of those tasks [4, 18] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-21", "text": "We choose to test our contribution on this pipeline for its overall competitive performance, expecting that any conclusion may generalize when applied to other solutions and tasks (e.g., caption generation)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-22", "text": "This assumption would be dimmer if a more problem-specific methodology was chosen instead." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-23", "text": "Our main goal is to establish the competitiveness of the FNE as an image representation to be used in caption related tasks." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-24", "text": "We test the suitability of this approach by evaluating its performance on both image annotation and image retrieval using three publicly available datasets: Flickr8k [11] , Flickr30k [12] and MSCOCO [13] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-25", "text": "Results obtained by the pipeline including the FNE are compared with the original pipeline of Kiros et al. [1] using a one-layer embedding, and also with the methods currently obtaining state-of-the-art results on the three datasets." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-26", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-28", "text": "In the last few years, several solutions have been proposed to the problem of building common representations for images and text with the goal of enabling cross-domain search [1, 2, 3, 4, 5] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-29", "text": "This paper builds upon the methodology described by Kiros et al. [1] , which is in turn based on previous works in the area of Neural Machine Translation [14] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-30", "text": "In their work, Kiros et al. [1] define a vectorized representation of an input text by using GRU RNNs." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-31", "text": "In this setting, each word in the text is codified into a vector using a word dictionary, vectors which are then fed one by one into the GRUs." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-32", "text": "Once the last word vector has been processed, the activations of the GRUs at the last time step conveys the representation of the whole input text in the multimodal embedding space." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-33", "text": "In parallel, images are processed through a Convolutional Neural Network (CNN) pre-trained on ImageNet [15] , extracting the activations of the last fully connected layer to be used as a representation of the images." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-34", "text": "To solve the dimensionality matching between both representations (the output of the GRUs and the last fully-connected of the CNN) an affine transformation is applied on the image representation." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-35", "text": "Similarly to the approach of Kiros et al. [1] , most image annotation and image retrieval approaches rely on the use of CNN features for image representation." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-36", "text": "The current best overall performing model (considering both image annotation and image retrieval tasks) is the Fisher Vector (FV) [4] , although its performance is most competitive on the image retrieval task." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-37", "text": "FV are computed with respect to the parameters of a Gaussian Mixture Model (GMM) and an Hybrid Gaussian-Laplacian Mixture Model (HGLMM)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-38", "text": "For both images and text, FV are build using deep neural network features; a VGG [16] CNN for images features, and a word2vec [17] for text features." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-39", "text": "For the specific problem of image annotation, the current state-of-art is obtained with the Word2VisualVec (W2VV) model [18] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-40", "text": "This approach uses as a multimodal embedding space the same visual space where images are represented, involving a deeper text processing." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-41", "text": "Finally for the largest dataset we consider (MSCOCO), the best results in certain metrics are obtained by MatchCNN (m-CNN) [5] , which is based on the use of CNNs to encode both image and text." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-42", "text": "the FNE, which results in the architecture shown in Figure 1 ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-43", "text": "Next we describe these components in further detail." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-44", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-45", "text": "**FULL-NETWORK EMBEDDING**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-46", "text": "The FNE generates a representation of an input image by processing it through a pre-trained CNN, and extracting the neural activations of all convolutional and fully-connected layers." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-47", "text": "After the initial feature extraction process, the FNE performs a dimensionality reduction step for convolutional activations, by applying a spatial average pooling on each convolutional filter." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-48", "text": "After the spatial pooling, every feature (from both convolutional and fully-connected layers) is standardized through the z-values, which are computed over the whole image train set." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-49", "text": "This standardization process puts the value of the each feature in the context of the dataset." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-50", "text": "At this point, the meaning of a single feature value is the degree with which the feature value is atypically high (if positive) or atypically low (if negative) in the context of the dataset." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-51", "text": "Zero marks the typical behavior." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-52", "text": "The last step in the FNE pipeline is a feature discretitzation process." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-53", "text": "The previously standardized embedding is usually of large dimensionality (e.g., 12,416 features for VGG16) which entails problems related with the curse of dimensionality." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-54", "text": "The usual approach to address this issue is to apply some dimensionality reduction methods (e.g., PCA) [19, 20] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-55", "text": "FNE uses a different approach, reducing expressiveness through the discretization of features, while keeping the dimensionality." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-56", "text": "Specifically, the FNE discretization maps the feature values to the {\u22121, 0, 1} domain, where -1 indicates an unusually low value (i.e., the feature is significant by its absence for an image in the context of the dataset), 0 indicates that the feature has an average value (i.e., the feature is not significant) and 1 indicates an uncommonly high activation (i.e., the feature is significant by its presence for an image in the context of the dataset)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-57", "text": "The mapping of standardized values into these three categories is done through the definition of two constant thresholds." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-58", "text": "The optimal values of these thresholds can be found empirically for a labeled dataset [21] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-59", "text": "Instead, we use threshold values shown to perform consistently across several domains [9] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-60", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-61", "text": "**MULTIMODAL EMBEDDING**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-62", "text": "In our approach, we integrate the FNE with the multimodal embedding pipeline of Kiros et al. [1] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-63", "text": "To do so we use the FNE to obtain an image representation instead of the output of the last layer of a CNN, as the original model does." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-64", "text": "The encoder architecture processing the text is used as it is, using a GRUs recurrent neural network to encode the sentences." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-65", "text": "To combine both embeddings, Kiros et al. [1] use an affine transformation on the image representation (in our case, the FNE) identical to a fully connected neural network layer." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-66", "text": "This extra layer is trained simultaneously with the GRUs." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-67", "text": "The elements of the multimodal pipeline that are tuned during the training phase of the model are shown in orange in Figure 1 ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-68", "text": "In simple terms, the training procedure consist on the optimization of the pairwise ranking loss between the correct image-caption pair and a random pair." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-69", "text": "Assuming that a correct pair of elements should be closer in the multimodal space than a random pair." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-70", "text": "The loss L can be formally defined as follows:" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-71", "text": "Where i is an image vector, c is its correct caption vector, and i k and c k are sets of random images and captions respectively." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-72", "text": "The operator s(\u2022, \u2022) defines the cosine similarity." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-73", "text": "This formulation includes a margin term \u03b1 to avoid pulling the image and caption closer once their distance is smaller than the margin." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-74", "text": "This makes the optimization focus on distant pairs instead of improving the ones that are already close." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-75", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-76", "text": "**EXPERIMENTS**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-77", "text": "In this section we evaluate the impact of using the FNE in a multimodal pipeline (FN-MME) for both image annotation and image retrieval tasks." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-78", "text": "To properly measure the relevance of the FNE, we compare the results of the FN-MME with those of the original multimodal pipeline reported by Kiros et al. [1] (CNN-MME)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-79", "text": "Additionally, we define a second baseline by using the original multimodal pipeline with a training configuration closer to the one used for the FNE experiments (i.e., same source CNN, same MME dimensionality, etc.)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-80", "text": "We refer to this second baseline as CNN-MME*." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-81", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-82", "text": "**DATASETS**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-83", "text": "In our experiments we use three different publicly available datasets:" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-84", "text": "The Flickr8K dataset [11] contains 8,000 hand-selected images from Flickr, depicting actions and events." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-85", "text": "Five correct captions are provided for each image." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-86", "text": "Following the provided splits, 6,000 images are used for train, 1,000 are used in validation and 1,000 more are kept for testing." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-87", "text": "The Flickr30K dataset [12] is an extension of Flickr8K." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-88", "text": "It contains 31,783 photographs of everyday activities, events and scenes." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-89", "text": "Five correct captions are provided for each image." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-90", "text": "In our experiments 29,000 images are used for training, 1,014 conform the validation set and 1,000 are kept for test." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-91", "text": "These splits are the same ones used by Kiros et al. [1] and by Karpathy and Fei-Fei [22] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-92", "text": "The MSCOCO dataset [13] includes images of everyday scenes containing common objects in their natural context." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-93", "text": "For captioning, 82,783 images and 413,915 captions are available for training, while 40,504 images and 202,520 captions are available for validation." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-94", "text": "Captions from the test set are not publicly available." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-95", "text": "Previous contributions consider using a subset of the validation set for validation and a different subset for test." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-96", "text": "In most cases, such subsets are composed by either 1,000 or 5,000 images for each set, with their corresponding 5 captions per image." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-97", "text": "In our experiments we consider both settings." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-98", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-99", "text": "**IMPLEMENTATION AND EVALUATION DETAILS**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-100", "text": "The caption sentences are word-tokenized using the Natural Language Toolkit (NLTK) for Python [23] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-101", "text": "The choice of the word embedding size and the number of GRUs has been analyzed to obtain a range of suitable parameters to test in the validation set." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-102", "text": "The total number of different words is 8,919 for Flickr8k, 22,962 for Flickr30k and 32,775 for MSCOCO." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-103", "text": "Using all the words present in the dataset is likely to produce overfitting problems when training on examples containing words that only occur a few times." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-104", "text": "This overfitting problem may not have a huge impact on performance, but it may add undesired noise in the multimodal representation." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-105", "text": "The original setup [1] limited the word embedding to the 300 most frequent words, while using 300 GRUs." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-106", "text": "The Bi-LSTM model [25] in contrast defines the vocabulary size to include words appearing more than 5 time in the dataset, leading to dictionaries of size 2,018 for Flickr8k, 7,400 for Flickr30k and 8,801 for MSCOCO." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-107", "text": "Our own preliminary experiments on the validation set showed that increasing multimodal space dimensionality and dictionary length slightly improved the performance of image retrieval, in detriment of image annotation." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-108", "text": "However, the combined performance difference remains rather small when using non-extreme parameter values (e.g., a model with 10,000 words vocabulary on MSCOCO dataset show a 0.4% average recall reduction when compared with a 2,000 words model)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-109", "text": "Since we are building a model for solving both tasks, we kept the parameters obtaining the highest combined score in the validation set." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-110", "text": "For the Flickr datasets, the word embedding is limited to the 1,000 most frequent words." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-111", "text": "For the MSCOCO dataset, we use a larger dictionary, considering the 2,000 most frequent words." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-112", "text": "In both cases we use 2,048 GRUs, which is also the dimensionality of the resultant multimodal embedding space." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-113", "text": "For generating the image embedding we use the classical VGG16 CNN architecture [16] as source model pretrained for ImageNet [15] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-114", "text": "This architecture is composed of 16 convolutional layers combined with pooling layers followed by two fully connected layers and the final softmax output layer." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-115", "text": "When using the FNE, this results in a image embedding space of 12,416 dimensions." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-116", "text": "On all our experiments (both the CNN-MME* and the FN-MME) the margin parameter \u03b1 is set to 0.2, and the batch size to 128 image-caption pairs." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-117", "text": "Within the same batch, every possible alternative image-caption pair is used as contrasting example." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-118", "text": "The models are trained up to 25 epochs, and the best performing model on the validation set is chosen (i.e., early stopping)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-119", "text": "We use gradient clipping for the GRUs with a threshold of 2." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-120", "text": "We use ADAM [24] as optimization algorithm, with a learning rate of 0.0002 for the Flickr datasets, and 0.00025 for MSCOCO." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-121", "text": "To evaluate both image annotation and image retrieval we use the following metrics:" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-122", "text": "\u2022 Recall@K (R@K) is the fraction of images for which a correct caption is ranked within the top-K retrieved results (and vice-versa for sentences)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-123", "text": "Results are provided for R@1, R@5 and R@10." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-124", "text": "\u2022 Median rank (Med r) of the highest ranked ground truth result." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-125", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-126", "text": "**RESULTS**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-127", "text": "For both image annotation and image retrieval tasks on the Flickr8k dataset, Table 1 shows the results of the proposed FN-MME, the reported results of the original model CNN-MME, the results of the original model when using our configuration CNN-MME*, and the current state-of-the-art (SotA)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-128", "text": "Tables 2 and 3 are analogous for the Flickr30k and MSCOCO datasets." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-129", "text": "Additional results of the CNN-MME model were made publicly available later on by the original authors [26] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-130", "text": "We include these for the MSCOCO dataset, which was not evaluated in the original paper [1] ." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-131", "text": "First, let us consider the impact of using the FNE." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-132", "text": "On all cases, the multimodal pipeline proposed by Kiros et al. [1] obtains equal or better results when using the FNE." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-133", "text": "This is the case for the originally reported results (CNN-MME), for the results made available later on by the original authors (CNN-MME \u2020), and for the experiments we do using same configuration as the FN-MME (CNN-MME*)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-134", "text": "The comparison we consider to be the most relevant is the FN-MME against the CNN-MME*, as these contain the least differences besides the image embedding being used." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-135", "text": "In this particular case, the FN-MME outperforms the CNN-MME* by 3 percentual points on average for the Flickr datasets, and roughly by 4 points for the MSCOCO dataset." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-136", "text": "To measure the relevance of the improvement provided by using the FNE, we compare the FN-MME model with the current state-of-the-art for image annotation and image retrieval." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-137", "text": "For the Flickr datasets, particularly for image annotation tasks, the performance of the FN-MME is significantly closer to the state of the art than the other variants of the same model (CNN-MME, CNN-MME \u2020, CNN-MME*)." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-138", "text": "Remarkably, the FN-MME provides the best reported results on image annotation for the MSCOCO dataset." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-139", "text": "However, let us remark that the competitive W2VV method [18] has no reported results for MSCOCO." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-140", "text": "The results of the FN-MME for image retrieval tasks are significantly further from the stateof-the-art." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-141", "text": "Overall, the competitiveness of FN-MME increases with dataset size." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-142", "text": "----------------------------------" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-143", "text": "**CONCLUSIONS**" }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-144", "text": "For the multimodal pipeline of Kiros et al. [1] , using the Full-Network image embedding results in consistently higher performances than using a one-layer image embedding." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-145", "text": "These results suggest that the visual representation provided by the FNE is superior to the current standard for the construction of most multimodal embeddings." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-146", "text": "When compared to the current state-of-the-art, the results obtained by the FN-MME are significantly less competitive than problem-specific methods." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-147", "text": "Since this happens for all models using the same pipeline (CNN-MME, CNN-MME \u2020, CNN-MME*), these results indicate that the original architecture of Kiros et al. [1] is itself outperformed in general by more problem-specific techniques." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-148", "text": "Since the FNE is compatible with most multimodal pipelines based on CNN embeddings, as future work of this paper we intend to evaluate the performance of the FNE when integrated into the current state-of-the-art on image annotation (W2VV [18] ) and image retrieval (FV [4] )." }, { "sent_id": "e90c9a93ec445a636fcee924306d95-C001-149", "text": "If the boost in performance obtained by the FNE on the Kiros et al. [1] pipeline translates to these other methods, such combination would be likely to define new state-of-the-art results on both tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e90c9a93ec445a636fcee924306d95-C001-14" ], [ "e90c9a93ec445a636fcee924306d95-C001-15" ], [ "e90c9a93ec445a636fcee924306d95-C001-20" ], [ "e90c9a93ec445a636fcee924306d95-C001-28" ], [ "e90c9a93ec445a636fcee924306d95-C001-30" ], [ "e90c9a93ec445a636fcee924306d95-C001-35" ], [ "e90c9a93ec445a636fcee924306d95-C001-65" ], [ "e90c9a93ec445a636fcee924306d95-C001-67" ] ], "cite_sentences": [ "e90c9a93ec445a636fcee924306d95-C001-14", "e90c9a93ec445a636fcee924306d95-C001-15", "e90c9a93ec445a636fcee924306d95-C001-20", "e90c9a93ec445a636fcee924306d95-C001-28", "e90c9a93ec445a636fcee924306d95-C001-30", "e90c9a93ec445a636fcee924306d95-C001-35", "e90c9a93ec445a636fcee924306d95-C001-65", "e90c9a93ec445a636fcee924306d95-C001-67" ] }, "@EXT@": { "gold_contexts": [ [ "e90c9a93ec445a636fcee924306d95-C001-17" ], [ "e90c9a93ec445a636fcee924306d95-C001-29" ], [ "e90c9a93ec445a636fcee924306d95-C001-62" ], [ "e90c9a93ec445a636fcee924306d95-C001-79" ] ], "cite_sentences": [ "e90c9a93ec445a636fcee924306d95-C001-17", "e90c9a93ec445a636fcee924306d95-C001-29", "e90c9a93ec445a636fcee924306d95-C001-62", "e90c9a93ec445a636fcee924306d95-C001-79" ] }, "@USE@": { "gold_contexts": [ [ "e90c9a93ec445a636fcee924306d95-C001-21" ], [ "e90c9a93ec445a636fcee924306d95-C001-91" ] ], "cite_sentences": [ "e90c9a93ec445a636fcee924306d95-C001-21", "e90c9a93ec445a636fcee924306d95-C001-91" ] }, "@MOT@": { "gold_contexts": [ [ "e90c9a93ec445a636fcee924306d95-C001-21" ] ], "cite_sentences": [ "e90c9a93ec445a636fcee924306d95-C001-21" ] }, "@DIF@": { "gold_contexts": [ [ "e90c9a93ec445a636fcee924306d95-C001-25" ], [ "e90c9a93ec445a636fcee924306d95-C001-62", "e90c9a93ec445a636fcee924306d95-C001-63" ], [ "e90c9a93ec445a636fcee924306d95-C001-65" ], [ "e90c9a93ec445a636fcee924306d95-C001-78" ], [ "e90c9a93ec445a636fcee924306d95-C001-105" ], [ "e90c9a93ec445a636fcee924306d95-C001-127" ], [ "e90c9a93ec445a636fcee924306d95-C001-130" ], [ "e90c9a93ec445a636fcee924306d95-C001-132" ], [ "e90c9a93ec445a636fcee924306d95-C001-144" ], [ "e90c9a93ec445a636fcee924306d95-C001-147" ] ], "cite_sentences": [ "e90c9a93ec445a636fcee924306d95-C001-25", "e90c9a93ec445a636fcee924306d95-C001-62", "e90c9a93ec445a636fcee924306d95-C001-63", "e90c9a93ec445a636fcee924306d95-C001-65", "e90c9a93ec445a636fcee924306d95-C001-78", "e90c9a93ec445a636fcee924306d95-C001-105", "e90c9a93ec445a636fcee924306d95-C001-127", "e90c9a93ec445a636fcee924306d95-C001-130", "e90c9a93ec445a636fcee924306d95-C001-132", "e90c9a93ec445a636fcee924306d95-C001-144", "e90c9a93ec445a636fcee924306d95-C001-147" ] }, "@FUT@": { "gold_contexts": [ [ "e90c9a93ec445a636fcee924306d95-C001-149" ] ], "cite_sentences": [ "e90c9a93ec445a636fcee924306d95-C001-149" ] } } }, "ABC_95883b369c4b019fa98493a728c1a0_2": { "x": [ { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-2", "text": "Word embedding spaces are powerful tools for capturing latent semantic relationships between terms in corpora (Mikolov et al. 2013; Pennington, Socher, and Manning 2014) , and have become widely popular for building state-of-the-art natural language processing algorithms." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-3", "text": "However, recent studies have shown that societal biases (e.g., gender, race, age, etc.) present in text corpora may be incorporated into the word embedding spaces learned from them as well." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-4", "text": "Thus, there is an ethical concern that human-like biases contained in the corpora and their derived embedding spaces might be propagated, or even amplified with the usage of the biased embedding spaces in downstream applications." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-5", "text": "In an attempt to quantify these biases so that they may be better understood and studied, several bias metrics have been proposed." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-6", "text": "We explore the statistical properties of these proposed measures in the context of their cited applications as well as their supposed utilities." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-7", "text": "We find that there are significant caveats to the simple interpretation of these metrics as proposed, and that some applications of these metrics in well-cited works may be erroneous." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-8", "text": "Specifically, we find that the bias metric proposed by (Bolukbasi et al. 2016 ) is highly sensitive to embedding hyperparameter selection, and that in many cases, the variance due to the selection of some hyper-parameters, notably the embedding space dimensionality, is greater than the variance in the metric due to corpus selection, while in fewer cases, even the relative rankings of the bias measured in the embedding spaces of various corpora varies with hyper-parameter selection." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-9", "text": "In light of these observations, it may be the case that bias estimates should not be thought to directly reflect the properties of the underlying corpus, but rather the properties of the specific embedding spaces in question, particularly in the context of hyper-parameter selections used to generate them." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-10", "text": "Hence, bias metrics of spaces generated with differing hyper-parameters should be compared only with explicit consideration of the embedding-learning algorithms' particular configurations." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-11", "text": "While it may be possible to use embedding spaces generated with a controlled hyperparameter configuration to rank corpora in terms of the quantity of bias contained, the numerical value of this bias metric has poor stability across model configurations and * The two first authors contributed equally." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-12", "text": "----------------------------------" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-14", "text": "Word embeddings are widely used for their ability to capture the semantic meanings of terms within a corpus." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-32", "text": "As one can observe, this bias measure identifies some profession names such as commander and nurse, which historically were predominantly male and female jobs, respectively." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-15", "text": "They are widely praised as useful tools for generating features for use in natural language processing systems, and recently, some researchers are attempting to study the structures of these embedding spaces to learn about the fundamental linguistic properties, or even the semantic content of a corpus." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-16", "text": "While many of the properties of embedding spaces are useful and socially benign, a subset of these properties can reveal latent relationships between terms which could be socially problematic, and may be of interest to researchers for study, or possibly lead to risks of propagating the implicit biases of a corpus if these socially problematic term relationships are used in downstream machine learning applications which would ideally be free of such implicit biases." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-17", "text": "Qualitatively, when inspecting term-analogical relationships as is done in (Garg et al. 2018) , it is difficult to deny the existence of these biases, but the task of quantitatively capturing them in a canonical measure has been a topic of recent study." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-18", "text": "Initial attempts have been made to develop metrics which seek to describe the geometric properties of the embedding space with respect to various axes of interest which are empirically determined to correspond to our intuitions of the hypothetical biases under study to quantify the degrees to which various biases exist within the embedding space, and presumably, the underlying text corpus (Bolukbasi et al. 2016; Caliskan, Bryson, and Narayanan 2017; Garg et al. 2018) ." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-19", "text": "For instance, using such a bias measure, (Bolukbasi et al. 2016) concluded that \"word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent\"." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-20", "text": "To the end of quantifying these biases, (Caliskan, Bryson, and Narayanan 2017) proposed and used a bias measure to report that \"text corpora contain recoverable and accurate imprints of our historic biases\"." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-21", "text": "The genesis of these biases via the underlying text corpus is explored in depth in (Brunet et al. 2018) ." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-22", "text": "Although this family of bias metrics is fairly new, initial attempts have been made to refine and explore their validity and robustness as explored in the metric significance research in (Caliskan, Bryson, and Narayanan 2017) , which seeks to determine whether a bias score is significant, given the possibility of having selected various sets of metric-inducing terms." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-23", "text": "----------------------------------" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-24", "text": "**PROPERTIES OF THE BIAS MEASUREMENT TECHNIQUE**" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-25", "text": "In this section, we evaluate the stability of the bias measure developed in (Bolukbasi et al. 2016) which is claimed to measure societal biases in word embeddings." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-26", "text": "For the sake of com-pleteness, we repeat the definition of (Bolukbasi et al. 2016 )'s bias metric:" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-27", "text": "Given two groups of words (e.g., gender words)" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-28", "text": "first groups' subspace direction g is identified as the first principal component of the vectors" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-29", "text": "Consequently, given a set of words W = { w i }, bias for each word is defined as cosine similarity ( w i , g) and direct bias for a given set of words is calculated as" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-30", "text": "In essence, this bias metric measures how closely a given word embedding aligns with a gender axis defined by a first principal component of vectors man-woman, he-she, him-her, etc. And, direct bias is essentially average magnitude of bias present in a list of neutral words." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-31", "text": "Figure 1 illustrates the results of (Bolukbasi et al. 2016 ) bias detection algorithm for a list of profession names with word embedding vectors of dimension 256 trained using the Skipgram algorithm (Mikolov et al. 2013 ) on a sample of 23k Wikipedia articles with 50k term vocabulary." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-33", "text": "Moreover, we trained word embeddings with differing dimension on the same sampled Wikipedia corpus using Skip-gram algorithm, and calculated Kendall tau rank correlation coefficients for the bias metrics corresponding to terms in the above list of professions." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-34", "text": "As illustrated in Figure 2 , we found that though rankings for low dimensional embeddings were unstable, for larger dimensions (\u2265 128) rankings of the biases of the profession terms achieve superior Kendall tau scores." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-35", "text": "Although this bias measure nearly preserves term-bias-rankings, particularly for highly dimensional embedding configurations, we found that bias measure decays, approaching zero as the space dimensionality increases." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-36", "text": "In Figure 3a , we have depicted bias score densities for word embeddings trained for a range of dimensions using Skip-gram algorithm on the same text corpus." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-37", "text": "One can observe that bias score densities change substantially with changing embedding dimension." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-38", "text": "Next, we evaluated the stability of the direct bias measure proposed in (Bolukbasi et al. 2016) ." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-39", "text": "The authors assert that the direct bias measure can be used as a metric to conclude how much an embedding is biased." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-40", "text": "For instance, for word embeddings trained on Google News articles, they reported that direct gender bias on 327 profession names is 0.08 and thus they concluded that this embedding is biased." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-41", "text": "However, as we have illustrated in Figure 3a , this bias score is not stable, i.e., direct bias measure decays exponentially with increasing word embedding dimension." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-42", "text": "Moreover, the change in the direct bias, the average bias magnitude in a word embedding, varies more with model hyper-parameter configuration than with corpus selection, and in certain instances, notably Twitter vs. Wikipedia, the bias ranking of corpora may also change with altering embedding dimension." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-43", "text": "This, in turn, suggests that this bias metric magnitude corresponds less directly to the inherent bias of the text corpus than to to the measurable bias resultant of the embedding training algorithm hyper-parameter selections employed." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-44", "text": "The decrease of bias measure with increasing embedding dimension could lead to the conclusion that you can reduce direct bias from word embeddings by increasing the dimensionality." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-45", "text": "However, we believe the downward slope we see in the plots is merely an effect of the properties of the cosine similarity metric in high dimension spaces." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-46", "text": "In particular, in Figure 4 we observe that arbitrary pairs of word embeddings become less and less similar to each other as the number of dimensions increases." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-47", "text": "We feel additional research regarding whether apparently-low-bias, high dimension embedding spaces regain their measurable bias when projected into smaller dimension spaces wold help further understand this phenomenon." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-48", "text": "Future work attempting to develop a better canonical bias metric for corpora should seek to be less sensitive to model training hyper-parameters, accounting for the properties of the cosine similarity at various dimensionalities." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-49", "text": "In addition to the explored sensitivity to the dimension of the embedding space, we note that there is some degree of sensitivity of the proposed metric to the sample of terms used to induce the axis onto which term vectors are projected to determine their bias." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-50", "text": "While it is not difficult to hypothesize various words which are supposed to be either ideally neutral terms W , or ideally bias-axis-aligned G 1 , G 2 , or as in (Bolukbasi et al. 2016) , have these term sets evaluated by a crowd, it is much more difficult to argue the canonicity of a given term set W , G 1 or G 2 . If it is not possible to defend the term set selections used to define the metric as being canonical, it is better for them to be mathematically regarded as a sample of the canonical term set." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-51", "text": "In this light, it becomes critical to understand the sensitivity of the proposed metric to the particular term set sampled, and ideally, characterize the distribution of the metric under many such samples as partially explored in (Antoniak and Mimno 2018) ." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-52", "text": "In examining several online corpora, we observe that although the degree of variance due to term selection at a given embedding dimensionality is small compared to that due to the choice of embedding dimension, we feel that the incorporation of a description of the variance of the bias metric would be prudent to allow for the development of a notion of statistical significance of bias comparisons between corpora." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-53", "text": "In Figure 5 , we show that for several online corpora 2 , the variance of the bias metric under bootstrap samples of G 1 , G 2 , and W is large compared to the inter-group mean differences, making statements comparing these mean bias estimates suspect if reported in absence of an account for the variance of the metric under sampled term sets." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-54", "text": "We note that without accounting for the bias metric variance due to target term sample, one would be led to conclude that the bias of reddit-politics.glove d200 is greater than the bias of reddit-pics.glove d200, while in fact, this difference is probably not significant (p > 0.5), whereas reddit-programming.glove d200 is probably significantly more biased than redditpolitics.glove d200 (p < 0.001) even when accounting for the variance of the bias metric due to target term sample." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-55", "text": "----------------------------------" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-56", "text": "**CONCLUDING REMARKS**" }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-57", "text": "We conclude that while meta analyses of the bias metrics proposed by (Bolukbasi et al. 2016) indicate that the metrics capture and somewhat quantify sociologically meaningful biases present in learned embedding spaces, the metrics are highly sensitive to the hyperparameter configurations of the algorithms used to learn them." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-58", "text": "We specifically find that the average magnitude of the quantified bias is particularly sensitive to the embedding dimension hyper-parameter selected, as well as the sample of bias-axis-inducing terms used to construct the various projections employed by the metric." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-59", "text": "While it is the case that the bias metrics in (Bolukbasi et al. 2016 ) may provide meaningful rankings of corpora when controlling for model hyper-parameter configuration, publishing the average absolute value of the metric without a complete account for model configuration is suspect." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-60", "text": "Moreover, we feel publications utilizing these bias metrics would benefit from including information regarding the variance or confidence intervals of the metric under bias-term sampling, if the absolute values of the metric must be published." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-61", "text": "Regarding the use of the metric as a model selection tool intended to minimize the bias of downstream models employing the embedding space in question, we feel it is important to understand how the properties of the metric employed vary with embedding dimension, and to critically consider whether increased embedding dimensionality truly reduces implicit biases captured within the space, or whether the increased dimensionality simply reduces the apparent magnitude of the bias to simplistic quantification methods per the properties of the cosine similarity in high dimension spaces." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-62", "text": "Important steps have been taken regarding the quantification of implicit bias contained in embedding spaces, but we feel there is still effort to be made towards developing metrics which are less sensitive to model hyper-parameter selection, and which possess more robust geometric properties if these metrics are to be used for model selection." }, { "sent_id": "95883b369c4b019fa98493a728c1a0-C001-63", "text": "When reporting and discussing these metric values for various corpora, we feel it is necessary to include detailed information regarding the embedding learner's hyper-parameter configuration to improve the utility and interpretability of the results." } ], "y": { "@DIF@": { "gold_contexts": [ [ "95883b369c4b019fa98493a728c1a0-C001-8" ], [ "95883b369c4b019fa98493a728c1a0-C001-31", "95883b369c4b019fa98493a728c1a0-C001-32", "95883b369c4b019fa98493a728c1a0-C001-33", "95883b369c4b019fa98493a728c1a0-C001-34" ], [ "95883b369c4b019fa98493a728c1a0-C001-38", "95883b369c4b019fa98493a728c1a0-C001-39", "95883b369c4b019fa98493a728c1a0-C001-40", "95883b369c4b019fa98493a728c1a0-C001-41" ], [ "95883b369c4b019fa98493a728c1a0-C001-57" ], [ "95883b369c4b019fa98493a728c1a0-C001-59" ] ], "cite_sentences": [ "95883b369c4b019fa98493a728c1a0-C001-8", "95883b369c4b019fa98493a728c1a0-C001-31", "95883b369c4b019fa98493a728c1a0-C001-38", "95883b369c4b019fa98493a728c1a0-C001-39", "95883b369c4b019fa98493a728c1a0-C001-40", "95883b369c4b019fa98493a728c1a0-C001-57", "95883b369c4b019fa98493a728c1a0-C001-59" ] }, "@BACK@": { "gold_contexts": [ [ "95883b369c4b019fa98493a728c1a0-C001-18" ], [ "95883b369c4b019fa98493a728c1a0-C001-19" ], [ "95883b369c4b019fa98493a728c1a0-C001-25" ], [ "95883b369c4b019fa98493a728c1a0-C001-26" ] ], "cite_sentences": [ "95883b369c4b019fa98493a728c1a0-C001-18", "95883b369c4b019fa98493a728c1a0-C001-19", "95883b369c4b019fa98493a728c1a0-C001-25", "95883b369c4b019fa98493a728c1a0-C001-26" ] }, "@MOT@": { "gold_contexts": [ [ "95883b369c4b019fa98493a728c1a0-C001-25" ] ], "cite_sentences": [ "95883b369c4b019fa98493a728c1a0-C001-25" ] }, "@UNSURE@": { "gold_contexts": [ [ "95883b369c4b019fa98493a728c1a0-C001-50" ] ], "cite_sentences": [ "95883b369c4b019fa98493a728c1a0-C001-50" ] } } }, "ABC_822b2010b07d3e1103f904fa45388a_2": { "x": [ { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-68", "text": "**RESULTS**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-2", "text": "Neural Machine Translation (NMT) is known to outperform Phrase Based Statistical Machine Translation (PBSMT) for resource rich language pairs but not for resource poor ones." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-3", "text": "Transfer Learning (Zoph et al., 2016 ) is a simple approach in which we can simply initialize an NMT model (child model) for a resource poor language pair using a previously trained model (parent model) for a resource rich language pair where the target languages are the same." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-4", "text": "This paper explores how different choices of parent models affect the performance of child models." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-5", "text": "We empirically show that using a parent model with the source language falling in the same or linguistically similar language family as the source language of the child model is the best." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-6", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-8", "text": "One of the most attractive features of Neural Machine Translation (NMT) (Bahdanau et al., 2015; Cho et al., 2014; Sutskever et al., 2014) is that it is possible to train an end to end system without the need to deal with word alignments, phrase tables and complicated decoding algorithms which are a characteristic of Phrase Based Statistical Machine Translation (PBSMT) systems (Koehn et al., 2003) ." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-9", "text": "It is reported that NMT works better than PBSMT only when there is an abundance of parallel corpora." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-10", "text": "In the case of low resource languages like Hausa, vanilla NMT is either worse than or comparable to PBSMT (Zoph et al., 2016) ." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-11", "text": "However, it is possible to use a previously trained X-Y model (parent model; X-Y being the resource rich language pair where X and Y represent the source and target languages respectively) to initialize the parameters of a Z-Y model (child model; Z-Y being the resource poor language pair) leading to significant improvements (Zoph et al., 2016) for the latter." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-12", "text": "This paper is about an empirical study of transfer learning for NMT for low resource languages." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-13", "text": "Our main focus is on translation to English for the following low resource languages: Hausa, Uzbek, Marathi, Malayalam, Punjabi, Malayalam, Kazakh, Luxembourgish, Javanese and Sundanese." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-14", "text": "Our main contribution is that we empirically (and exhaustively; within reason) show that using a resource rich language pair in which the source language is linguistically closer to the source language of the resource poor pair is much better than other choices of language pairs." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-15", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-16", "text": "**RELATED WORK**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-17", "text": "Transfer learning for NMT (Zoph et al., 2016) is an approach where previously trained NMT models for French and German to English (resource rich pairs) were used to initialize models for Hausa, Uzbek, Spanish to English (resource poor pairs)." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-18", "text": "They showed that French-English as a parent model was better than German-English when trying to improve the Spanish-English translation quality (since Spanish is linguistically closer to French than German) but they did not conduct an exhaustive investigation for multiple language pairs." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-19", "text": "In this paper we extend this work to explore how language relatedness impacts transfer learning." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-20", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-21", "text": "**OVERVIEW OF TRANSFER LEARNING**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-22", "text": "Refer to Figure 1 for an overview of the method." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-23", "text": "It is essentially the same as described in (Zoph et al., 2016) where we learn a model (parent model) for a resource rich language pair (Hindi-English) and use it to initialize the model (child model) for the resource poor pair (Marathi-English)." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-24", "text": "Henceforth the source languages of the parent model and 282 child models will be known as parent and child languages respectively and the corresponding language pairs will be known as the parent and child language pairs respectively." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-25", "text": "The target language vocabulary (English) should be the same for both the parent and the child models." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-47", "text": "We use 5K sentences for evaluation and the rest form the development set." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-26", "text": "Following the originally proposed method we focused on freezing 1 (by setting gradients to zero) the decoder embeddings and softmax layers when learning child models since they represent the majority of the decoder parameter space." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-27", "text": "This method can easily be applied in cases where we wish to use the X-Y pair to help the Z-Y pair where Y is usually English." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-28", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-29", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-30", "text": "All of our experiments were performed using an encoder-decoder NMT system with attention for the various baselines and transfer learning experiments." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-31", "text": "We used an in house NMT system developed using the Tensorflow (Abadi et al., 2015) framework so as to exploit multiple GPUs to speed up training." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-32", "text": "To ensure replicability we use the same NMT model design as in the original work (Zoph et al., 2016) ." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-33", "text": "In order to enable infinite vocabulary we use the word piece model (WPM) (Schuster and Nakajima, 2012 ) as a segmentation model which is closely related to the Byte Pair Encoding (BPE) based segmentation approach (Sennrich et al., 2016) ." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-34", "text": "We evaluate our models using the standard BLEU (Papineni et al., 2002) metric 2 on the detokenized translations of the test set." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-35", "text": "However we report the only the difference between the BLEU scores of the transferred and the baseline models since our focus is not on the BLEU scores themselves but rather the improvement by using transfer learning and on observing the language relatedness phenomenon." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-36", "text": "Baseline models are simply ones trained from scratch by initializing the model parameters with random values." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-37", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-38", "text": "**LANGUAGES**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-39", "text": "The set of parent languages (and abbreviations) we considered is: Hindi (Hi), Indonesian (Id), Turkish (Tr), Russian (Ru), German (De) and French (Fr)." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-40", "text": "The set of child languages (and abbreviations) consists of: Luxembourgish (Lb), Hausa (Ha), Somali (So), Malayalam (Ml), Punjabi (Pa), Marathi (Mr), Uzbek (Uz), Javanese (Jw), Kazakah (Kk) and Sundanese (Su)." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-41", "text": "Table 1 groups the languages into language families." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-42", "text": "For each child model we try around 3 to 4 parent models out of which one is mostly learned from a linguistically close parent language pair." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-43", "text": "The source languages vary but the target language is always English." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-44", "text": "Since there are no standard training sets for many of these language pairs, we use parallel data automatically mined from the web using an in-house crawler." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-45", "text": "For evaluation, we use a set of 9K English sentences collected from the web and translated by humans into each of the source languages mentioned above." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-46", "text": "Each sentence has one reference translation." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-48", "text": "To give a rough idea of the corpora sizes consider the WMT14 dataset for German-English which contains around 5M lines of parallel corpora for training." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-49", "text": "The child language pair corpora sizes vary from being one decimal order of magnitude smaller to one decimal order of magnitude larger than the WMT14 German-English corpus." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-50", "text": "However the parent language pair corpora are two to three decimal orders of magnitude larger than the aforementioned dataset." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-51", "text": "From left to right, the languages above are ordered according to the size of their corpora with the leftmost being the one with the smallest dataset." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-52", "text": "Since these datasets are mined from the open web they represent a realistic scenario and hence it should be evident that the child language pairs are truly resource poor." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-53", "text": "Our choice of languages was influenced by two factors:" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-54", "text": "\u2022 a. We wanted to replicate the basic transfer learning results (Zoph et al., 2016) and hence chose French, German for Hausa and Uzbek." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-55", "text": "\u2022 b. We wanted to compare the effects of using parent languages belonging to the same lan" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-56", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-57", "text": "**SETTINGS**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-58", "text": "Following the aforementioned factors influencing our language choices we conducted our experiments in two stages as below:" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-59", "text": "\u2022 Exhaustive experimentation on 6 child languages (Hausa, Uzbek, Marathi, Malayalam, Punjabi and Somali) by using 4 parent languages (French, German, Russian and Hindi)." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-60", "text": "This was done in order to verify whether there is any language relatedness phenomenon worth exploring or not." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-61", "text": "Based on these experiments we proposed a hypothesis that a parent language from the same or a closely related language family should be a lot more helpful than any other parent language." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-62", "text": "\u2022 Opportunistic experimentation on 4 child languages (Kazakh, Javanese, Sundanese and Luxembourgish) by using 3 parent languages out of which one is from the same language family and the other two are from another language family." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-63", "text": "Turkish being the related language for Kazakh, German for Luxembourgish and Indonesian for Javanese and Sundanese." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-64", "text": "The model and training details are the same as that in the original work (Zoph et al., 2016) Note that the target language (English) vocabulary is same for all settings and the WPM is learned on the English side of the French-English corpus since it is the largest one amongst all our pairs." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-65", "text": "We deliberately chose this since we wished to maintain the same target side vocabulary for all our experiments (both baseline and transfer) for fair comparison." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-66", "text": "The parent source vocabulary (and hence embeddings) is randomly mapped to child source vocabulary since it was shown that NMT is less sensitive to it (Zoph et al., 2016" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-67", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-69", "text": "Refer to Table 2 for the results of the exhaustive experimentation round and Table 3 for those of the opportunistic experimentation round." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-70", "text": "As mentioned before we only report the difference between the BLEU scores of the transferred and the baseline model." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-71", "text": "Entries in bold indicate the parent-child pair that performed the best amongst others." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-72", "text": "Furthermore, entries that have an \"*\" mark represent the parent-child pair with a BLEU difference that is statistically significant compared to the BLEU difference of other parent-child pairs." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-73", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-74", "text": "**OBSERVATIONS**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-75", "text": "One thing that stood out during the exhaustive experimentation phase (Table 2) is that Hindi as a parent language led to better gains (from +0.57 to +2.8) for all Indian languages as opposed to gains (-1.62 to +1.89) due to other parents." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-76", "text": "In the case of Marathi all other parent languages led to degradation in performance and Punjabi gained the most (+2.41) from Hindi as a parent where as the gains due to the others were at most +0.8." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-77", "text": "It makes sense that Punjabi being the closest language (linguistically speaking) to Hindi would gain the most followed by Marathi." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-78", "text": "It is also important to note that amongst all parent languages Hindi had the least amount of data and French had the most." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-79", "text": "This led us to believe that beyond a certain amount the size of the training data is not the real factor behind the gains observed due to transfer learning." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-80", "text": "Amongst the child languages Uzbek and Marathi were the most resource abundant ones and hence the gains to the transfer learning (less than 1 BLEU point) are notable only in cases where the baseline systems are not that strong." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-81", "text": "Following this we decided to verify our hypothesis that: \"A parent language from the same (or linguistically similar) language family as the child language will have a larger impact on transfer learning.\" From Table 3 it can be seen that this hypothesis is mostly true." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-82", "text": "The gain (+8.58) in the case of German as a parent for Luxembourgish is quite striking since the latter is known to be closely related to the former." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-83", "text": "Moreover using German gives an additional improvement of around 2 BLEU points over other parents." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-84", "text": "Indonesian, Javanese and Sundanese are close to each other in the same way that Punjabi is similar to Hindi." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-85", "text": "Thus Indonesian as a parent gives around 1 to 2 BLEU improvement for these language pairs over when other parents are chosen." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-86", "text": "Indonesian, Javanese and Sundanese use the same script but Hindi and Punjabi do not." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-87", "text": "In spite of this Hindi still acts as a better parent as compared to the others which means that the NMT system does learn certain grammatical features which provide the child models with a good prior when transferring the parameters." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-88", "text": "Finally, Kazakh received maximum benefit when using Turkish as a parent but the baseline model for Kazakh was too strong and thus it is difficult to draw any proper conclusion in this case since Hindi as a parent helped almost as much." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-89", "text": "We did try a scenario where Turkish was used as a parent for Uzbek (not in the tables) but failed to see any particular improvement over when other parents are used but it should be noted that, linguistically speaking, Turkish is a lot closer to Kazakh than it is to Uzbek." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-90", "text": "Although we do not give details here due to lack of space transfer learning helps cut down the training time by more than half in most cases since more than half the model is already pre-trained." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-91", "text": "----------------------------------" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-92", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-93", "text": "We presented our work on an empirical study of language relatedness for transfer learning in Neural Machine Translation." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-94", "text": "We showed that in general, transfer learning done on a X-Y language pair to Z-Y language pair has maximum impact when Z-Y is resource scarce and when X and Z fall in the same or linguistically similar language family." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-95", "text": "We did exhaustive experimentation to validate our hypothesis and it stands to be true in most cases." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-96", "text": "In the future we would like to experiment with transfer learning where we use Spanish as a parent for Italian with a slight modification where we force the Spanish vocabulary to resemble Italian by applying a segmentation mechanism (like BPE or WPM) trained on Italian to Spanish." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-97", "text": "This should help exploit cognates between closely related languages." }, { "sent_id": "822b2010b07d3e1103f904fa45388a-C001-98", "text": "285" } ], "y": { "@MOT@": { "gold_contexts": [ [ "822b2010b07d3e1103f904fa45388a-C001-3" ], [ "822b2010b07d3e1103f904fa45388a-C001-53", "822b2010b07d3e1103f904fa45388a-C001-54" ], [ "822b2010b07d3e1103f904fa45388a-C001-66" ] ], "cite_sentences": [ "822b2010b07d3e1103f904fa45388a-C001-3", "822b2010b07d3e1103f904fa45388a-C001-54", "822b2010b07d3e1103f904fa45388a-C001-66" ] }, "@BACK@": { "gold_contexts": [ [ "822b2010b07d3e1103f904fa45388a-C001-10" ], [ "822b2010b07d3e1103f904fa45388a-C001-11" ], [ "822b2010b07d3e1103f904fa45388a-C001-17" ], [ "822b2010b07d3e1103f904fa45388a-C001-18" ] ], "cite_sentences": [ "822b2010b07d3e1103f904fa45388a-C001-10", "822b2010b07d3e1103f904fa45388a-C001-11", "822b2010b07d3e1103f904fa45388a-C001-17" ] }, "@SIM@": { "gold_contexts": [ [ "822b2010b07d3e1103f904fa45388a-C001-23" ], [ "822b2010b07d3e1103f904fa45388a-C001-64" ] ], "cite_sentences": [ "822b2010b07d3e1103f904fa45388a-C001-23", "822b2010b07d3e1103f904fa45388a-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "822b2010b07d3e1103f904fa45388a-C001-22", "822b2010b07d3e1103f904fa45388a-C001-23" ], [ "822b2010b07d3e1103f904fa45388a-C001-32" ] ], "cite_sentences": [ "822b2010b07d3e1103f904fa45388a-C001-23", "822b2010b07d3e1103f904fa45388a-C001-32" ] }, "@EXT@": { "gold_contexts": [ [ "822b2010b07d3e1103f904fa45388a-C001-19" ] ], "cite_sentences": [] } } }, "ABC_6b147afca676882878e67bc10abd58_2": { "x": [ { "sent_id": "6b147afca676882878e67bc10abd58-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-2", "text": "We present a shallow approach to the sentence ordering problem." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-3", "text": "The employed features are based on discourse entities, shallow syntactic analysis, and temporal precedence relations retrieved from VerbOcean." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-4", "text": "We show that these relatively simple features perform well in a machine learning algorithm on datasets containing sequences of events, and that the resulting models achieve optimal performance with small amounts of training data." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-53", "text": "For completeness, we also consider the effects of using verb groups and whole sentences as syntactic units of choice." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-5", "text": "The model does not yet perform well on datasets describing the consequences of events, such as the destructions after an earthquake." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-6", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-8", "text": "Sentence ordering is a problem in many natural language processing tasks." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-9", "text": "While it has, historically, mainly been considered a challenging problem in (concept-to-text) language generation tasks, more recently, the issue has also generated interest within summarization research (Barzilay, 2003; Ji and Pulman, 2006) ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-10", "text": "In the spirit of the latter, this paper investigates the following questions: (1) Does the topic of the text influence the factors that are important to sentence ordering? (2) Which factors are most important for determining coherent sentence orderings? (3) How much performance is gained when using deeper knowledge resources?" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-11", "text": "Past research has investigated a wide range of aspects pertaining to the ordering of sentences in text." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-12", "text": "The most prominent approaches include: (1) temporal ordering in terms of publication date (Barzilay, 2003) , (2) temporal ordering in terms of textual cues in sentences (Bollegala et al., 2006) , (3) the topic of the sentences (Barzilay, 2003) , (4) coherence theories (Barzilay and Lapata, 2008) , e.g., Centering Theory, (5) content models (Barzilay and Lee, 2004) , and (6) ordering(s) in the underlying documents in the case of summarisation (Bollegala et al., 2006; Barzilay, 2003) ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-13", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-14", "text": "**THE MODEL**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-15", "text": "We view coherence assessment, which we recast as a sentence ordering problem, as a machine learning problem using the feature representation discussed in Section 2.1." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-16", "text": "It can be viewed as a ranking task because a text can only be more or less coherent than some other text." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-17", "text": "The sentence ordering task used in this paper can easily be transformed into a ranking problem." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-18", "text": "Hence, paralleling Barzilay and Lapata (2008) , our model has the following structure." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-19", "text": "The data consists of alternative orderings (x ij , x ik ) of the sentences of the same document d i ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-20", "text": "In the training data, the preference ranking of the alternative orderings is known." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-21", "text": "As a result, training consists of determining a parameter vector w that minimizes the number of violations of pairwise rankings in the training set, a problem which can be solved using SVM constraint optimization (Joachims, 2002) ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-22", "text": "The following section explores the features available for this optimization." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-23", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-24", "text": "**FEATURES**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-25", "text": "Approaches to sentence ordering can generally be categorized as knowledge-rich or knowledge-lean." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-26", "text": "Knowledge-rich approaches rely on manually created representations of sentence orderings using do-main communication knowledge." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-27", "text": "Barzilay and Lee (2004) 's knowledge-lean approach attempts to automate the inference of knowledge-rich information using a distributional view of content." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-28", "text": "In essence, they infer a number of topics using clustering." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-54", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-29", "text": "The clusters are represented by corresponding states in a hidden Markov model, which is used to model the transitions between topics." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-30", "text": "Lapata (2003) , in contrast, does not attempt to model topics explicitly." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-31", "text": "Instead, she reduces sentence ordering to the task of predicting the next sentence given the previous sentence, which represents a coarse attempt at capturing local coherence constraints." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-32", "text": "The features she uses are derived from three categories -verbs, nouns, and dependencies -all of which are lexicalised." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-33", "text": "Her system thereby, to some extent, learns a precedence between the words in the sentences, which in turn represent topics." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-34", "text": "Ji and Pulman (2006) base their ordering strategy not only on the directly preceding sentence, but on all preceding sentences." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-35", "text": "In this way, they are able to avoid a possible topic bias when summarizing multiple documents." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-36", "text": "This is specific to their approach as both Lapata (2003)'s and Barzilay and Lee (2004) 's approaches are not tailored to summarization and therefore do not experience the topic bias problem." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-37", "text": "The present paper deviates from Lapata (2003) insofar as we do not attempt to learn the ordering preferences between pairs of sentences." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-38", "text": "Instead, we learn the ranking of documents." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-39", "text": "The advantage of this approach is that it allows us to straightforwardly discern the individual value of various features (cf." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-40", "text": "Barzilay and Lapata (2008) )." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-41", "text": "The methods used in this paper are mostly shallow with the exception of two aspects." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-42", "text": "First, some of the measures make use of WordNet relations (Fellbaum, 1998) , and second, some use the temporal ordering provided by the \"happens-before\" relation in VerbOcean (Chklovski and Pantel, 2004) ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-43", "text": "While the use of WordNet is self-explanatory, its effect on sentence ordering algorithms does not seem to have been explored in any depth." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-44", "text": "The use of VerbOcean is meant to reveal the degree to which common sense orderings of events affect the ordering of sentences, or whether the order is reversed." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-45", "text": "With this background, the sentence ordering features used in this paper can be grouped into three categories:" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-46", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-47", "text": "**GROUP SIMILARITY**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-48", "text": "The features in this category are inspired by discourse entity-based accounts of local coherence." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-49", "text": "Yet, in contrast to Barzilay and Lapata (2008) , who employ the syntactic properties of the respective occurrences, we reduce the accounts to whether or not the entities occur in subsequent sentences (similar to Karamanis (2004) 's NOCB metric)." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-50", "text": "We also investigate whether using only the information from the head of the noun group (cf." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-51", "text": "Barzilay and Lapata (2008) ) suffices, or whether performance is gained when allowing the whole noun group in order to determine similarity." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-52", "text": "Moreover, as indicated above, some of the noun group measures make use of WordNet synonym, hypernym, hyponym, antonym relationships." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-55", "text": "**TEMPORAL ORDERING**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-56", "text": "This set of features uses information on the temporal ordering of sentences, although it currently only includes the \"happens-before\" relations in VerbOcean." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-57", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-58", "text": "**LONGER RANGE RELATIONS**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-59", "text": "The group similarity features only capture the relation between a sentence and its immediate successor." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-60", "text": "However, the coherence of a text is clearly not only defined by direct relations, but also requires longer range relations between sentences (e.g., Barzilay and Lapata (2008) )." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-61", "text": "The features in this section explore the impact of such relations on the coherence of the overall document as well as the appropriate way of modeling them." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-62", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-63", "text": "**EXPERIMENTS**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-64", "text": "This section introduces the datasets used for the experiments, describes the experiments, and discusses our main findings." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-65", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-66", "text": "**EVALUATION DATASETS**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-67", "text": "The three datasets used for the automatic evaluation in this paper are based on human-generated texts (Table 1 )." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-68", "text": "The first two are the earthquake and accident datasets used by Barzilay and Lapata (2008) ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-69", "text": "Each of these sets consists of 100 datasets in the training and test sets, respectively, as well as 20 random permutations for each text." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-70", "text": "The third dataset is similar to the first two in that it contains original texts and random permutations." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-71", "text": "In contrast to the other two sources, however, this dataset is based on the human summaries from DUC 2005 (Dang, 2005) ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-72", "text": "It comprises 300 human summaries on 50 document sets, resulting in a total of 6,000 pairwise rankings split into training and test sets." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-73", "text": "The source furthermore differs from Barzilay and Lapata (2008) 's datasets in that the content of each text is not based on one individual event (an earthquake or accident), but on more complex topics followed over a period of time (e.g., the espionage case between GM and VW along with the various actions taken to resolve it)." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-74", "text": "Since the different document sets cover completely different topics the third dataset will mainly be used to evaluate the topic-independent properties of our model." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-75", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-76", "text": "**EXPERIMENT 1**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-77", "text": "In the first part of this experiment, we consider the problem of the granularity of the syntactic units to be used." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-78", "text": "That is, does it make a difference whether we use the words in the sentence, the words in the noun groups, the words in the verb groups, or the words in the respective heads of the groups to determine coherence? (The units are obtained by processing the documents using the LT-TTT2 tools (Grover and Tobin, 2006) ; the lemmatizer used by LT-TTT2 is morpha (Minnen and Pearce, 2000) .) We also consider whether lemmatization is beneficial in each of the granularities." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-79", "text": "The results -presented in Table 2 -indicate that considering only the heads of the verb and noun groups separately provides the best performance." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-105", "text": "The resulting score is the number of times the entities occur in N out of M sentences." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-80", "text": "In particular, the heads outperform the whole groups, and the heads separately also outperform noun and verb group heads together." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-81", "text": "As for the question of whether lemmatization provides better results, one needs to distinguish the case of noun and verb groups." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-82", "text": "For noun groups, lemmatization improves performance, which can mostly be attributed to singular and plural forms." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-83", "text": "In the case of verb groups, however, the lemmatized version yields worse results than the surface forms, a fact mainly explained by the tense and modality properties of verbs." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-84", "text": "Given the appropriate unit of granularity, we can consider the impact of semantic relations between surface realizations on coherence." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-85", "text": "For these experiments we use the synonym, hypernym, hyponym, and antonym relations in WordNet." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-86", "text": "The rationale for the consideration of semantic relations lies in the fact that the frequent use of the same words is usually deemed bad writing style." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-87", "text": "One therefore tends to observe the use of semantically similar terms in neighboring sentences." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-88", "text": "The results of using semantic relations for coherence rating are provided in Table 3." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-89", "text": "Synonym detection improves performance, while the other units provide poorer performance." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-90", "text": "This suggests that the hypernym and hyponym relations tend to over-generalize in the semantics." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-91", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-92", "text": "**SYNTACTIC**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-93", "text": "The third category of features investigated is the temporal ordering of sentences; we use VerbOcean to obtain the temporal precedence between two events." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-94", "text": "One would expect events to be described ei- Table 4 : The impact of the VerbOcean ?" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-95", "text": "happens-before? temporal precedence relation on accuracy on the training datasets ther in chronological order or in its reverse." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-96", "text": "While the former ordering represents a factual account of some sequence of events, the latter corresponds to newswire-style texts, which present the most important event(s) first, even though they may derive from previous events." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-97", "text": "Table 4 provides the results of the experiments with temporal orderings." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-98", "text": "The first two rows validate the ordering of the events, while the latter two require the corresponding sentences to have a noun group in common in order to increase the likelihood that two events are related." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-99", "text": "The results clearly show that there is potential in the direct ordering of events." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-100", "text": "This suggests that sentence ordering can to some degree be achieved using simple temporal precedence orderings in a domain-independent way." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-101", "text": "This holds despite the results indicating that the features work better for sequences of events (as in the accident dataset) as opposed to accounts of the results of some event(s) (as in the earthquake dataset)." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-102", "text": "The final category of features investigates the degree to which relations between sentences other than directly subsequent sentences are relevant." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-103", "text": "To this end, we explore two different approaches." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-104", "text": "The first set of features considers the distribution of entities within a fixed set of sentences, and captures in how many different sentences the entities occur." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-106", "text": "The second set only considers the similarity score from the current sentence and the other sentences within a certain range from the current sentence." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-107", "text": "The score of this feature is the sum of the individual similarities." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-108", "text": "Table 5 clearly confirms that longer range relations are relevant to the assessment of the coherence of text." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-109", "text": "An interesting difference between the two approaches is that sentence similarity only provides good results for neighboring sentences or sentences only one sentence apart, while the occurrence-counting method also works well over longer ranges." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-110", "text": "Having evaluated the potential contributions of the individual features and their modeling, we now use SVMs to combine the features into one comprehensive measure." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-111", "text": "Given the indications from the foregoing experiments, the results in Table 6 are disappointing." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-112", "text": "In particular, the performance on the earthquake dataset is below standard." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-113", "text": "However, it seems that sentence ordering in that set is primarily defined by topics, as only content models perform well." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-114", "text": "(Barzilay and Lapata (2008) only perform well when using their coreference module, which determines antecedents based on the identified coreferences in the original sentence ordering, thereby biasing their orderings towards the correct ordering.) Longer range and WordNet relations together (Chunk+Temp-WN+LongRange+) achieve the best performance." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-115", "text": "The corresponding configuration is also the only one that achieves reasonable performance when compared with other systems." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-116", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-117", "text": "**EXPERIMENT 2**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-118", "text": "As stated, the ultimate goal of the models presented in this paper is the application of sentence ordering to automatically generated summaries." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-119", "text": "It is, in this regard, important to distinguish coherence as studied in Experiment 1 and coherence in the context of automatic summarization." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-120", "text": "Namely, for newswire summarization systems, the topics of the documents are Table 7 : Cross-Training between Accident and Earthquake datasets." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-121", "text": "The results for Coreference+Syntax+Salience+ and HMM-Based Content Models are reproduced from Barzilay and Lapata (2008)." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-122", "text": "unknown at the time of training." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-123", "text": "As a result, model performance on out-of-domain texts is important for summarization." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-124", "text": "Experiment 2 seeks to evaluate how well our model performs in such cases." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-125", "text": "To this end, we carry out two sets of tests." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-126", "text": "First, we crosstrain the models between the accident and earthquake datasets to determine system performance in unseen domains." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-127", "text": "Second, we use the dataset based on the DUC 2005 model summaries to investigate whether our model's performance on unseen topics reaches a plateau after training on a particular number of different topics." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-128", "text": "Surprisingly, the results are rather good, when compared to the poor results in part of the previous experiment (Table 7) ." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-129", "text": "In fact, model performance is nearly independent of the training topic." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-130", "text": "Nevertheless, the results on the earthquake test set indicate that our model is missing essential components for the correct prediction of sentence orderings on this set." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-131", "text": "When compared to the results obtained by Barzilay and Lapata (2008) and Barzilay and Lee (2004) , it would appear that direct sentenceto-sentence similarity (as suggested by the Barzilay and Lapata baseline score) or capturing topic sequences are essential for acquiring the correct sequence of sentences in the earthquake dataset." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-132", "text": "The final experimental setup applies the best Table 8 : Accuracy on 20 test topics (2,700 pairs) with respect to the number of topics used for training using the model Chunk+Temporal-WordNet+LongRange+ model (Chunk+Temporal-WordNet+LongRange+) to the summarization dataset and evaluates how well the model generalises as the number of topics in the training dataset increases." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-133", "text": "The results -provided in Table 8 -indicate that very little training data (both regarding the number of pairs and the number of different topics) is needed." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-134", "text": "Unfortunately, they also suggest that the DUC summaries are more similar to the earthquake than to the accident dataset." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-135", "text": "----------------------------------" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-136", "text": "**CONCLUSIONS**" }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-137", "text": "This paper investigated the effect of different features on sentence ordering." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-138", "text": "While a set of features has been identified that works well individually as well as in combination on the accident dataset, the results on the earthquake and DUC 2005 datasets are disappointing." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-139", "text": "Taking into account the performance of content models and the baseline of the Barzilay and Lapata (2008) model, the most convincing explanation is that the sentence ordering in the earthquake datasets is based on some sort of topic notion, providing a variety of possible antecedents between which our model is thus far unable to distinguish without resorting to the original (correct) ordering." }, { "sent_id": "6b147afca676882878e67bc10abd58-C001-140", "text": "Future work will have to concentrate on this aspect of sentence ordering, as it appears to coincide with the structure of the summaries for the DUC 2005 dataset." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6b147afca676882878e67bc10abd58-C001-12" ] ], "cite_sentences": [ "6b147afca676882878e67bc10abd58-C001-12" ] }, "@SIM@": { "gold_contexts": [ [ "6b147afca676882878e67bc10abd58-C001-18" ], [ "6b147afca676882878e67bc10abd58-C001-121" ], [ "6b147afca676882878e67bc10abd58-C001-131" ] ], "cite_sentences": [ "6b147afca676882878e67bc10abd58-C001-18", "6b147afca676882878e67bc10abd58-C001-121", "6b147afca676882878e67bc10abd58-C001-131" ] }, "@UNSURE@": { "gold_contexts": [ [ "6b147afca676882878e67bc10abd58-C001-39", "6b147afca676882878e67bc10abd58-C001-40" ], [ "6b147afca676882878e67bc10abd58-C001-50", "6b147afca676882878e67bc10abd58-C001-51" ], [ "6b147afca676882878e67bc10abd58-C001-139" ] ], "cite_sentences": [ "6b147afca676882878e67bc10abd58-C001-40", "6b147afca676882878e67bc10abd58-C001-51", "6b147afca676882878e67bc10abd58-C001-139" ] }, "@DIF@": { "gold_contexts": [ [ "6b147afca676882878e67bc10abd58-C001-49" ], [ "6b147afca676882878e67bc10abd58-C001-73" ], [ "6b147afca676882878e67bc10abd58-C001-114" ] ], "cite_sentences": [ "6b147afca676882878e67bc10abd58-C001-49", "6b147afca676882878e67bc10abd58-C001-73", "6b147afca676882878e67bc10abd58-C001-114" ] }, "@MOT@": { "gold_contexts": [ [ "6b147afca676882878e67bc10abd58-C001-60" ] ], "cite_sentences": [ "6b147afca676882878e67bc10abd58-C001-60" ] }, "@USE@": { "gold_contexts": [ [ "6b147afca676882878e67bc10abd58-C001-68" ] ], "cite_sentences": [ "6b147afca676882878e67bc10abd58-C001-68" ] } } }, "ABC_0bb68718667b8850dc0110d10d1d3a_2": { "x": [ { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-2", "text": "In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-3", "text": "Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-4", "text": "We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-5", "text": "We also showcase the shortcomings of current vision and language models by performing an error analysis on our system's output." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-6", "text": "----------------------------------" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-8", "text": "This work aims to learn strategies for textual response generation in a multimodal conversation directly from data." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-9", "text": "Conversational AI has great potential for online retail: It greatly enhances user experience and in turn directly affects user retention (Chai et al., 2000) , especially if the interaction is multi-modal in nature." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-10", "text": "So far, most conversational agents are uni-modal -ranging from opendomain conversation (Ram et al., 2018; Papaioannou et al., 2017; Fang et al., 2017) to task oriented dialogue systems Lemon, 2010, 2011; Young et al., 2013; Singh et al., 2000; Wen et al., 2016) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-11", "text": "While recent progress in deep learning has unified research at the intersection of vision and language, the availability of open-source multimodal dialogue datasets still remains a bottleneck." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-12", "text": "This research makes use of a recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) , which contains multiple dialogue sessions in the fashion domain." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-13", "text": "The MMD dataset provides an interesting new challenge, combining recent efforts on task-oriented dialogue systems, as well as visually grounded dialogue." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-14", "text": "In contrast to simple QA tasks in visually grounded dialogue, e.g. (Antol et al., 2015) , it contains conversations with a clear end-goal." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-15", "text": "However, in contrast to previous slot-filling dialogue systems, e.g. (Rieser and Lemon, 2011; Young et al., 2013) , it heavily relies on the extra visual modality to drive the conversation forward (see Figure 1) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-16", "text": "In the following, we propose a fully data-driven response generation model for this task." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-17", "text": "Our work is able to ground the system's textual response with language and images by learning the semantic correspondence between them while modelling long-term dialogue context." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-18", "text": "Lu et al., 2016) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-19", "text": "In contrast to standard sequenceto-sequence models (Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015) , HREDs model the dialogue context by introducing a context Recurrent Neural Network (RNN) over the encoder RNN, thus forming a hierarchical encoder." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-20", "text": "We build on top of the HRED architecture to include multimodality over multiple images." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-21", "text": "A simple HRED consists of three RNN modules: encoder, context and decoder." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-22", "text": "In multimodal HRED, we combine the output representations from the utterance encoder with concatenated multiple image representations and pass them as input to the context encoder (see Figure 2) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-23", "text": "A dialogue is modelled as a sequence of utterances (turns), which in turn are modelled as sequences of words and images." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-24", "text": "Formally, a dialogue is generated according to the following:" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-25", "text": "where t n is the n-th utterance in a dialogue." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-26", "text": "For each m = 1, . . . , M n , we have hidden states of each module defined as:" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-27", "text": "where f text \u03b8 ,f cxt \u03b8 and f dec \u03b8 are GRU cells (Cho et al., 2014) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-28", "text": "\u03b8 represent model parameters, w n,m is the m-th word in the n-th utterance and g enc \u03b8 is a Convolutional Neural Network (CNN); here we use VGGnet (Simonyan and Zisserman, 2014) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-29", "text": "We pass multiple images in a context through the CNN in order to get encoded image representations g enc \u03b8 (img k )." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-30", "text": "Then these are combined together and passed through a linear layer l img to get the aggregated image representation for one turn of context, denoted by h are subsequently concatenated and passed as input to the context RNN." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-31", "text": "h cxt N , the final hidden state of the context RNN, acts as the initial hidden state of the decoder RNN." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-32", "text": "Finally, output is generated by passing h dec n,m through an affine transformation followed by a softmax activation." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-33", "text": "The model is trained using cross entropy on next-word prediction." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-34", "text": "During generation, the decoder conditions on the previous output token." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-35", "text": "Saha et al. (2017) propose a similar baseline model for the MMD dataset, extending HREDs to include the visual modality." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-36", "text": "However, for simplicity's sake, they 'unroll' multiple images in a single utterance to include only one image per utterance." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-37", "text": "While computationally leaner, this approach ultimately loses the objective of capturing multimodality over the context of multiple images and text." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-38", "text": "In contrast, we combine all the image representations in the utterance using a linear layer." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-39", "text": "We argue that modelling all images is necessary to answer questions that address previous agent responses." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-40", "text": "For example in Figure 3 , when the user asks \"what about the 4th image?\", it is impossible to give a correct response without reasoning over all images in the previous response." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-41", "text": "In the following, we empirically show that our extension leads to better results in terms of text-based similarity measures, as well as quality of generated dialogues." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-42", "text": "Example contexts for a given system utterance; note the difference in our approach from Saha et al. (2017) when extracting the training data from the original chat logs." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-43", "text": "For simplicity, in this illustration we consider a context size of 2 previous utterances." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-44", "text": "'|' differentiates turns for a given context." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-45", "text": "We concatenate the representation vector of all images in one turn of a dialogue to form the image context." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-46", "text": "If there is no image in the utterance, we consider a 0 4096 vector to form the image context." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-47", "text": "In this work, we focus only on the textual response of the agent." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-48", "text": "----------------------------------" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-49", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-50", "text": "----------------------------------" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-51", "text": "**DATASET**" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-52", "text": "The MMD dataset (Saha et al., 2017) consists of 100/11/11k train/validation/test chat sessions comprising 3.5M context-response pairs for the model." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-53", "text": "Each session contains an average of 40 dialogue turns (average of 8 words per textual response, 4 images per image response)." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-54", "text": "The data contains complex user queries, which pose new challenges for multimodal, task-based dialogue, such as quantitative inference (sorting, counting and filtering): \"Show me more images of the 3rd product in some different directions\", inference using domain knowledge and long term context: \"Will the 5th result go well with a large sized messenger bag?\", inference over aggregate of images: \"List more in the upper material of the 5th image and style as the 3rd and the 5th\", co-reference resolution." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-55", "text": "Note that we started with the raw transcripts of dialogue sessions to create our own version of the dataset for the model." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-56", "text": "This is done since the authors originally consider each image as a different context, while we consider all the images in a single turn as one concatenated context (cf. Figure 3 )." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-57", "text": "----------------------------------" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-58", "text": "**IMPLEMENTATION**" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-59", "text": "We use the PyTorch 1 framework (Paszke et al., 2017) for our implementation." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-60", "text": "2 We used 512 as the word embedding size as well as hidden dimension for all the RNNs using GRUs (Cho et al., 2014) with tied embeddings for the (bidirectional) encoder and decoder." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-61", "text": "The decoder uses Luong-style attention mechanism (Luong et al., 2015) with input feeding." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-62", "text": "We trained our model with the Adam optimizer (Kingma and Ba, 2015) , with a learning rate of 0.0004 and clipping gradient norm over 5." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-63", "text": "We perform early stopping by monitoring validation loss." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-64", "text": "For image representations, we use the FC6 layer representations of the VGG-19 (Simonyan and Zisserman, 2014) , pre-trained on ImageNet." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-65", "text": "3" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-66", "text": "----------------------------------" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-67", "text": "**ANALYSIS AND RESULTS**" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-68", "text": "We report sentence-level BLEU-4 (Papineni et al., 2002) , METEOR (Lavie and Agarwal, 2007) and ROUGE-L (Lin and Och, 2004) using the evaluation scripts provided by (Sharma et al., 2017) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-69", "text": "We compare our results against Saha et al. (2017) by using their code and data-generation scripts." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-70", "text": "4 Note that the results reported in their paper are on a different version of the corpus, hence not directly comparable." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-71", "text": "Table 1 provides results for different configurations of our model (\"T\" stands for text-only in the encoder, \"M\" for multimodal, and \"attn\" for using attention in the decoder)." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-72", "text": "We experimented with different context sizes and found that output quality improved with increased context size (models with 5-turn context perform better than those with a 2-turn context), confirming the observation by Serban et al. (2016 Serban et al. ( , 2017 ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-73", "text": "5 Using attention clearly helps: even T-HRED-attn outperforms M-HRED (without attention) for the same context size." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-74", "text": "We also tested whether multimodal input has an impact on the generated outputs." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-75", "text": "However, there was only a slight increase in BLEU score (M-HRED-attn vs T-HRED-attn)." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-76", "text": "To summarize, our best performing model (M-HRED-attn) outperforms the model of Saha et al. by 7 BLEU points." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-77", "text": "6 This can be primarily attributed to the way we created the input for our model from raw chat logs, as well as incorporating more information during decoding via attention." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-78", "text": "Figure 4 provides example output utterances using M-HRED-attn with a context size of 5." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-79", "text": "Our model is able to accurately map the response to previous textual context turns as shown in (a) and (c)." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-80", "text": "In (c), it is able to capture that the user is asking about the style in the 1st and 2nd image." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-81", "text": "(d) shows an example where our model is able to relate that the corresponding product is 'jeans' from visual features, while it is not able to model finegrained details like in (b) that the style is 'casual fit' but resorts to 'woven'." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-82", "text": "----------------------------------" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-83", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-84", "text": "In this research, we address the novel task of response generation in search-based multimodal dialogue by learning from the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-85", "text": "We introduce a novel extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model (Serban et al., 2016) and show that our implementation significantly outperforms the model of Saha et al. (2017) by modelling the full multimodal context." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-86", "text": "Contrary to their results, our generation outputs improved by adding attention and increasing context size." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-87", "text": "However, we also show that multimodal HRED does not improve significantly over text-only HRED, similar to observations by Agrawal et al. (2016) and Qian et al. (2018) ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-88", "text": "Our model learns to handle textual correspondence between the questions and answers, while mostly ignoring the visual context." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-89", "text": "This indicates that we need better visual models to en-code the image representations when he have multiple similar-looking images, e.g., black hats in Figure 3 ." }, { "sent_id": "0bb68718667b8850dc0110d10d1d3a-C001-90", "text": "We believe that the results should improve with a jointly trained or fine-tuned CNN for generating the image representations, which we plan to implement in future work." } ], "y": { "@EXT@": { "gold_contexts": [ [ "0bb68718667b8850dc0110d10d1d3a-C001-3", "0bb68718667b8850dc0110d10d1d3a-C001-4" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-55" ] ], "cite_sentences": [ "0bb68718667b8850dc0110d10d1d3a-C001-3", "0bb68718667b8850dc0110d10d1d3a-C001-55" ] }, "@USE@": { "gold_contexts": [ [ "0bb68718667b8850dc0110d10d1d3a-C001-12" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-69" ] ], "cite_sentences": [ "0bb68718667b8850dc0110d10d1d3a-C001-12", "0bb68718667b8850dc0110d10d1d3a-C001-69" ] }, "@BACK@": { "gold_contexts": [ [ "0bb68718667b8850dc0110d10d1d3a-C001-13" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-14" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-15" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-52" ] ], "cite_sentences": [ "0bb68718667b8850dc0110d10d1d3a-C001-13", "0bb68718667b8850dc0110d10d1d3a-C001-14", "0bb68718667b8850dc0110d10d1d3a-C001-15", "0bb68718667b8850dc0110d10d1d3a-C001-52" ] }, "@SIM@": { "gold_contexts": [ [ "0bb68718667b8850dc0110d10d1d3a-C001-35" ] ], "cite_sentences": [ "0bb68718667b8850dc0110d10d1d3a-C001-35" ] }, "@DIF@": { "gold_contexts": [ [ "0bb68718667b8850dc0110d10d1d3a-C001-42" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-56" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-70" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-76" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-85" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-86" ] ], "cite_sentences": [ "0bb68718667b8850dc0110d10d1d3a-C001-42", "0bb68718667b8850dc0110d10d1d3a-C001-56", "0bb68718667b8850dc0110d10d1d3a-C001-70", "0bb68718667b8850dc0110d10d1d3a-C001-76", "0bb68718667b8850dc0110d10d1d3a-C001-85", "0bb68718667b8850dc0110d10d1d3a-C001-86" ] }, "@MOT@": { "gold_contexts": [ [ "0bb68718667b8850dc0110d10d1d3a-C001-56" ], [ "0bb68718667b8850dc0110d10d1d3a-C001-84" ] ], "cite_sentences": [ "0bb68718667b8850dc0110d10d1d3a-C001-56", "0bb68718667b8850dc0110d10d1d3a-C001-84" ] } } }, "ABC_08c64c92b77dbd9e999092a2fec3d1_2": { "x": [ { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-91", "text": "The last one evaluates quality of word embeddings." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-2", "text": "We study the problem of jointly embedding a knowledge base and a text corpus." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-3", "text": "The key issue is the alignment model making sure the vectors of entities, relations and words are in the same space." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-4", "text": "Wang et al. (2014a) rely on Wikipedia anchors, making the applicable scope quite limited." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-5", "text": "In this paper we propose a new alignment model based on text descriptions of entities, without dependency on anchors." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-6", "text": "We require the embedding vector of an entity not only to fit the structured constraints in KBs but also to be equal to the embedding vector computed from the text description." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-7", "text": "Extensive experiments show that, the proposed approach consistently performs comparably or even better than the method of Wang et al. (2014a) , which is encouraging as we do not use any anchor information." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-8", "text": "----------------------------------" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-10", "text": "Knowledge base embedding has attracted surging interest recently." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-11", "text": "The aim is to learn continuous vector representations (embeddings) for entities and relations of a structured knowledge base (KB) such as Freebase." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-12", "text": "Typically it optimizes a global objective function over all the facts in the KB and hence the embedding vector of an entity / relation is expected to encode global information in the KB." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-13", "text": "It is capable of reasoning missing facts in a KB and helping facts extraction (Bordes et al., 2011; Bordes et al., 2012; Socher et al., 2013; Chang et al., 2013; Wang et al., 2014b; Lin et al., 2015) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-14", "text": "Although seeming encouraging, the approaches in the aforementioned literature suffer from two common issues: (1) Embeddings are exclusive to entities/relations within KBs." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-15", "text": "Computation between KBs and text cannot be handled, which are prevalent in practice." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-16", "text": "For example, in fact extraction, a candidate value may be just a phrase in text." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-17", "text": "(2) KB sparsity." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-18", "text": "The above approaches are only based on structured facts of KBs, and thus cannot work well on entities with few facts." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-19", "text": "An important milestone, the approach of Wang et al. (2014a) solves issue (1) by jointly embedding entities, relations, and words into the same vector space and hence is able to deal with words/phrases beyond entities in KBs." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-20", "text": "The key component is the so-called alignment model, which makes sure the embeddings of entities, relations, and words are in the same space." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-21", "text": "Two alignment models are introduced there: one uses entity names and another uses Wikipedia anchors." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-22", "text": "However, both of them have drawbacks." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-23", "text": "As reported in the paper, using entity names severely pollutes the embeddings of words." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-24", "text": "Thus it is not recommended in practice." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-25", "text": "Using Wikipedia anchors completely relies on the special data source and hence the approach cannot be applied to other customer data." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-26", "text": "To fully address the two issues, this paper proposes a new alignment method, aligning by entity descriptions." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-27", "text": "We only assume some entities in KBs have text descriptions, which almost always holds in practice." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-28", "text": "We require the embedding of an entity not only fits the structured constraints in KBs but also equals the vector computed from the text description." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-29", "text": "Meanwhile, if an entity has few facts, the description will provide information for embedding, thus the issue of KB sparsity is also well handled." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-30", "text": "We conduct extensive experiments on the tasks of triplet classification, link prediction, relational fact extraction, and analogical reasoning to compare with the previous approach (Wang et al., 2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-31", "text": "Results show that our approach consistently achieves better or comparable performance." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-32", "text": "----------------------------------" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-33", "text": "**RELATED WORK**" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-34", "text": "TransE This is a representative knowledge embedding model proposed by ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-35", "text": "For a fact (h, r, t) in KBs, where h is the head entity, r is the relation, and t is the tail entity, TransE models the relation r as a translation vector r connecting the embeddings h and t of the two entities, i.e., h + r is close to t. The model is simple, effective and efficient." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-36", "text": "Most knowledge embedding models thereafter including this paper are variants of this model (Wang et al., 2014b; Wang et al., 2014a; Lin et al., 2015) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-37", "text": "Skip-gram This is an efficient word embedding method proposed by Mikolov et al. (2013a) , which learns word embeddings from word concurrencies in text windows." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-38", "text": "Without any supervision, it amazingly recovers the semantic relations between words in a vector space such as 'King' \u2212 'Queen' \u2248 'Man' \u2212 'Women'." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-39", "text": "However, as it is unsupervised, it cannot tell the exact relation between two words." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-40", "text": "Wang et al. (2014a) combines knowledge embedding and word embedding in a joint framework so that the entities/relations and words are in the same vector space and hence operators like inner product (similarity) between them are meaningful." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-41", "text": "This brings convenience to tasks requiring computation between knowledge bases and text." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-42", "text": "Meanwhile, jointly embedding utilizes information from both structured KBs and unstructured text and hence the knowledge embedding and word embedding can be enhanced by each other." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-43", "text": "Their model is composed of three components: a knowledge model to embed entities and relations, a text model to embed words, and an alignment model to make sure entities/relations and words are in the same vector space." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-44", "text": "The knowledge model and text model are variants of TransE and Skip-gram respectively." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-45", "text": "The key component is the alignment model." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-46", "text": "They introduced two: alignment by entity names and alignment by Wikipedia anchors." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-47", "text": "(1) Alignment by Entity Names makes a replicate of KB facts but replaces each entity ID with its name string, i.e., the vector of a name phrase is encouraged to equal to the vector of the entity (identified by ID)." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-48", "text": "It has problems with ambiguous entity names and observed polluting word embeddings thus it is not recommended by the authors." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-49", "text": "----------------------------------" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-50", "text": "**KNOWLEDGE AND TEXT JOINTLY EMBEDDING**" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-51", "text": "where C is the collection of observed word and context pairs and A refers to the set of all anchors in Wikipedia." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-52", "text": "Pr(w|e v ) is the probability of the anchor predicting its context word, which takes a form similar to Skip-gram for word embedding." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-53", "text": "Alignment by anchors works well in both improving knowledge embedding and word embeddings." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-54", "text": "However, it completely relies on the special data source of Wikipedia anchors and cannot be applied to other general data settings." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-55", "text": "----------------------------------" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-56", "text": "**ALIGNMENT BY ENTITY DESCRIPTIONS**" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-57", "text": "We first describe the settings and notations." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-58", "text": "Given a knowledge base, i.e., a set of facts (h, r, t), where h, t \u2208 E (the set of entities) and r \u2208 R (the set of relations)." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-59", "text": "Some entities have text descriptions." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-60", "text": "The description of entity e is denoted as D e ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-61", "text": "w i,n is the n th word in the description of e i ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-62", "text": "N i is the length (in words) of the description of e i ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-63", "text": "We try to learn embeddings e i , r j and w l for each entity e i , relation r j and word w l respectively." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-64", "text": "The vocabulary of words is V. The union vocabulary of entities and words together is I = E \u222a V. In this paper \"word(s)\" refers to \"word(s)/phrase(s)\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-65", "text": "We follow the jointly embedding framework of (Wang et al., 2014a) , i.e., learning optimal embeddings by minimizing the following loss" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-66", "text": "where L K , L T and L A are the component loss functions of the knowledge model, text model and alignment model respectively." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-67", "text": "Our focus is on a new alignment model L A while the knowledge model L K and text model L T are the same as the counterparts in (Wang et al., 2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-68", "text": "However, to make the content self-contained, we still need to briefly explain L K and L T ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-69", "text": "Knowledge Model Describes the plausibility of a triplet (h, r, t) by defining" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-70", "text": "where z(h, r, t) = b \u2212 0.5 \u00b7 h + r \u2212 t 2 2 , b = 7 as suggested by Wang et al. (2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-71", "text": "Pr(r|h, t) and Pr(t|h, r) are defined in the same way." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-72", "text": "The loss function of knowledge model is then defined as" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-73", "text": "log Pr(h|r, t) + log Pr(t|h, r) + log Pr(r|h, t) (4)" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-74", "text": "Text Model Defines the probability of a pair of words w and v co-occurring in a text window:" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-75", "text": "where" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-76", "text": "Then the loss function of text model is" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-77", "text": "Alignment Model This part is different from Wang et al. (2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-78", "text": "For each word w in the description of entity e, we define Pr(w|e), the conditional probability of predicting w given e:" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-79", "text": "Pr(w|e) = exp{z(e, w)} w\u2208V exp{z(e,w)} ," }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-80", "text": "where z(e, w) = b \u2212 0.5 \u00b7 e \u2212 w 2 2 ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-81", "text": "Notice that e is the same vector of entity e appearing in the knowledge model of Eq. (3)." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-82", "text": "We also define Pr(e|w) in the same way by revising the normalization term" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-83", "text": "Then the loss function of alignment model is" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-84", "text": "[log Pr(w|e) + log Pr(e|w)]" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-85", "text": "Training We use stochastic gradient descent (S-GD) to minimize the overall loss of Eq. (2), which sequentially updates the embeddings." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-86", "text": "Negative sampling is used to calculate the normalization items over large vocabularies." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-87", "text": "We implement a multi-threading version to deal with large data sets, where memory is shared and lock-free." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-88", "text": "----------------------------------" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-89", "text": "**EXPERIMENTS**" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-90", "text": "We conduct experiments on the following tasks: link prediction , triplet classification (Socher et al., 2013) , relational fact extraction , and analogical reasoning (Mikolov et al., 2013b) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-92", "text": "We try to study whether the proposed alignment model, without using any anchor information, is able to achieve comparable or better performance than alignment by anchors." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-93", "text": "As to the methods, \"Separately\" denotes the method of separately embedding knowledge bases and text. \"Jointly(anchor)\" and \"Jointly(name)\" denote the jointly embedding methods based on Alignment by Wikipedia Anchors and Alignment by Entity Names in (Wang et al., 2014a) respectively." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-94", "text": "\"Jointly(desp)\" is the joint embedding method based on alignment by entity descriptions." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-95", "text": "Data For link prediction, FB15K from ) is used as the knowledge base." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-96", "text": "For triplet classification, a large dataset provided by (Wang et al., 2014a ) is used as the knowledge base." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-97", "text": "Both sets are subsets of Freebase." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-98", "text": "For all tasks, Wikipedia articles are used as the text corpus." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-99", "text": "As many Wikipedia articles can be mapped to Freebase entities, we regard a Wikipedia article as the description for the corresponding entity in Freebase." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-100", "text": "Following the settings in (Wang et al., 2014a) , we apply the same preprocessing steps, including sentence segmentation, tokenization, and named entity recognition." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-101", "text": "We combine the consecutive tokens covered by an anchor or identically tagged as \"Location/Person/Organization\" and regard them as phrases." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-102", "text": "Link Prediction This task aims to complete a fact (h, r, t) in absence of h or t, simply based on h + r \u2212 t ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-103", "text": "We follow the same protocol in ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-104", "text": "We directly copy the results of the baseline (TransE) from and implement \"Jointly(anchor)\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-105", "text": "The results are in Table 1 . \"MEAN\" is the average rank of the true absent entity." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-106", "text": "\"HITS@10\" is accuracy of the top 10 predictions containing the true entity." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-107", "text": "Lower \"MEAN\" and higher \"HITS@10\" is better. \"Raw\" and \"Filtered\" are two settings on processing candidates ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-108", "text": "We train \"Jointly(anchor)\" and \"Jointly(desp)\" with the embedding dimension k among {50, 100, 150}, the learning rate \u03b1 in {0.01, 0.025}, the number of negative examples per positive example c in {5, 10}, the max skiprange s in {5, 10} and traverse the text corpus with only 1 epoch." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-109", "text": "The best configurations of \"Jointly(anchor)\" and \"Jointly(desp)\" are exactly the same: k = 100, \u03b1 = 0.025, c = 10, s = 5." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-110", "text": "From the results, we observe that: (1) Both jointly embedding methods are much better than the baseline TransE, which demonstrates that external textual resources make entity embeddings become more discriminative." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-111", "text": "Intuitively, \"Jointly(anchor)\" indicates \"how to use an entity in text\", while \"Jointly(desp)\" shows \"what is the definition/meaning of an entity\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-112", "text": "Both are helpful to distinguish an entity from others." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-113", "text": "(2) Under the setting of \"Raw\", \"Jointly(desp)\" and \"Jointly(anchor)\" are comparable." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-114", "text": "In other settings \"Jointly(desp)\" wins." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-115", "text": "Triplet Classification This is a binary classification task, predicting whether a candidate triplet (h, r, t) is a correct fact or not." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-116", "text": "It is used in (Socher et al., 2013; Wang et al., 2014b; Wang et al., 2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-117", "text": "We follow the same protocol in (Wang et al., 2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-118", "text": "We train their models via our own implementation on our dataset." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-119", "text": "The results are in Table 2 . \"e-e\" means both sides of a triplet (h, r, t) are entities in KB, \"e-w\" means the tail side is a word out of KB entity vocabulary, similarly for \"w-e\" and \"w-w\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-120", "text": "The best configurations of the models are: k = 150, \u03b1 = 0.025, c = 10, s = 5 and traversing the text corpus with 6 epochs." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-121", "text": "The results reveal that: (1) Jointly embedding is indeed effective." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-122", "text": "Both jointly embedding methods can well handle the cases of \"e-w\", \"w-e\" and \"ww\", which means the vector computation between entities/relations and words are really meaningful." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-123", "text": "Meanwhile, even the case of \"e-e\" is also improved." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-124", "text": "(2) Our method, \"Jointly(desp)\", outperforms \"Jointly(anchor)\" on all types of triplets." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-125", "text": "We believe that the good performance of \"Jointly(desp)\" is due to the appropriate design of the alignment mechanism." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-126", "text": "Using entity's description information is a more straightforward and effective way to align entity embeddings and word embeddings." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-127", "text": "----------------------------------" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-128", "text": "**RELATIONAL FACT EXTRACTION**" }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-129", "text": "This task is to extract facts (h, r, t) from plain text. show that combing scores from TransE and some text side base extractor achieved much better precision-recall curve compared to the base extractor." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-130", "text": "Wang et al. (2014a) confirm this observation and show that jointly embedding brings further encouraging improvement over TransE. In this experiment, we follow the same settings as (Wang et al., 2014a) to investigate the performance of our new alignment model." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-131", "text": "We use the same public dataset NYT+FB, released by Riedel et al. (2010) and used in and (Wang et al., 2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-132", "text": "We use Mintz (Mintz et al., 2009 ) and MIML (Surdeanu et al., 2012) as our base extractors." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-133", "text": "In order to combine the score of a base extractor and the score from embeddings, we only reserve the testing triplets whose entitites and relations can be mapped to the embeddings learned from the triplet classification experiment." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-134", "text": "Since both Mintz and MIML are probabilistic models, we use the same method in (Wang et al., 2014a) to linearly combine the scores." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-135", "text": "The precision-recall curves are plot in Fig. (1) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-136", "text": "On both base extractors, the jointly embedding methods outperform separate embedding." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-137", "text": "Moreover, \"Jointly(desp)\" is slightly better than \"Jointly(anchor)\", which is in accordance with the results from the link prediction experiment and the triplet classification experiment." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-138", "text": "Analogical Reasoning This task evaluates the quality of word embeddings (Mikolov et al., 2013b) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-139", "text": "We use the original dataset released by (Mikolov et al., 2013b) and follow the same evaluation protocol of (Wang et al., 2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-140", "text": "For a true analogical pair like (\"France\", \"Paris\") and (\"China\", \"Beijing\"), we hide \"Beijing\" and predict it by selecting the word from the vocabulary whose vector has highest similarity with the vector of \"China\" + \"Paris\" -\"France\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-141", "text": "We use the word embeddings learned for the triplet classification experiment and conduct the analogical reasoning experiment for \"Skip-gram\", \"Jointly(anchor)\", \"Jointly(name)\" and \"Jointly(desp)\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-142", "text": "Results are presented in Table 3 . \"Acc\" is the accuracy of the predicted word." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-143", "text": "\"HITS@10\" is the accuracy of the top 10 candidates containing the ground truth." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-144", "text": "The evaluation analogical pairs are organized into two groups, \"Words\" and \"Phrases\", by whether an analogical pair contains phrases (i.e., multiple words)." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-145", "text": "From the table we observe that: (1) Both \"Jointly(anchor)\" and \"Jointly(desp)\" outperform \"Skip-gram\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-146", "text": "(2) \"Jointly(desp)\" achieves the best results, especially for the case of \"Phrases\"." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-147", "text": "Both \"Jointly(anchor)\" and \"Skip-gram\" only consider the context of words, while \"Jointly(desp)\" not only consider the context but also use the whole document to disambiguate words." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-148", "text": "Intuitively, the whole document is also a valuable resource to disambiguate words." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-149", "text": "(3) We further verify that \"Jointly(name)\", i.e., using entity names for alignment, indeed pollutes word embeddings, which is consistent with the reports in (Wang et al., 2014a) ." }, { "sent_id": "08c64c92b77dbd9e999092a2fec3d1-C001-150", "text": "The above four experiments are consistent in results: without using any anchor information, alignment by entity description is able to achieve better or comparable performance, compared to alignment by Wikipedia anchors proposed by Wang et al. (2014a) ." } ], "y": { "@MOT@": { "gold_contexts": [ [ "08c64c92b77dbd9e999092a2fec3d1-C001-4" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-130" ] ], "cite_sentences": [ "08c64c92b77dbd9e999092a2fec3d1-C001-4", "08c64c92b77dbd9e999092a2fec3d1-C001-130" ] }, "@DIF@": { "gold_contexts": [ [ "08c64c92b77dbd9e999092a2fec3d1-C001-7" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-30", "08c64c92b77dbd9e999092a2fec3d1-C001-31" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-77" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-150" ] ], "cite_sentences": [ "08c64c92b77dbd9e999092a2fec3d1-C001-7", "08c64c92b77dbd9e999092a2fec3d1-C001-30", "08c64c92b77dbd9e999092a2fec3d1-C001-77", "08c64c92b77dbd9e999092a2fec3d1-C001-150" ] }, "@BACK@": { "gold_contexts": [ [ "08c64c92b77dbd9e999092a2fec3d1-C001-19" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-36" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-40" ] ], "cite_sentences": [ "08c64c92b77dbd9e999092a2fec3d1-C001-19", "08c64c92b77dbd9e999092a2fec3d1-C001-36", "08c64c92b77dbd9e999092a2fec3d1-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "08c64c92b77dbd9e999092a2fec3d1-C001-36" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-65" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-67" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-69", "08c64c92b77dbd9e999092a2fec3d1-C001-70" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-93" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-96" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-100" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-117" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-130" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-131" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-134" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-139" ] ], "cite_sentences": [ "08c64c92b77dbd9e999092a2fec3d1-C001-36", "08c64c92b77dbd9e999092a2fec3d1-C001-65", "08c64c92b77dbd9e999092a2fec3d1-C001-67", "08c64c92b77dbd9e999092a2fec3d1-C001-70", "08c64c92b77dbd9e999092a2fec3d1-C001-93", "08c64c92b77dbd9e999092a2fec3d1-C001-96", "08c64c92b77dbd9e999092a2fec3d1-C001-100", "08c64c92b77dbd9e999092a2fec3d1-C001-117", "08c64c92b77dbd9e999092a2fec3d1-C001-130", "08c64c92b77dbd9e999092a2fec3d1-C001-131", "08c64c92b77dbd9e999092a2fec3d1-C001-134", "08c64c92b77dbd9e999092a2fec3d1-C001-139" ] }, "@SIM@": { "gold_contexts": [ [ "08c64c92b77dbd9e999092a2fec3d1-C001-67" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-100" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-117" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-131" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-134" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-139" ], [ "08c64c92b77dbd9e999092a2fec3d1-C001-149" ] ], "cite_sentences": [ "08c64c92b77dbd9e999092a2fec3d1-C001-67", "08c64c92b77dbd9e999092a2fec3d1-C001-100", "08c64c92b77dbd9e999092a2fec3d1-C001-117", "08c64c92b77dbd9e999092a2fec3d1-C001-131", "08c64c92b77dbd9e999092a2fec3d1-C001-134", "08c64c92b77dbd9e999092a2fec3d1-C001-139", "08c64c92b77dbd9e999092a2fec3d1-C001-149" ] } } }, "ABC_c60a1131c6b1639b772b0e5c59588e_2": { "x": [ { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-98", "text": "Query Time: The set of target entities corresponding to a source entity and the relation being predicted is not available during query (test) time." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-2", "text": "Large-scale Knowledge Bases (such as NELL, Yago, Freebase, etc.) are often sparse, i.e., a large number of valid relations between existing entities are missing." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-3", "text": "Recent research have addressed this problem by augmenting the KB graph with additional edges mined from a large text corpus while keeping the set of nodes fixed, and then using the Path Ranking Algorithm (PRA) to perform KB inference over this augmented graph." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-4", "text": "In this paper, we extend this line of work by augmenting the KB graph not only with edges, but also with bridging entities, where both the edges and bridging entities are mined from a 500 million web text corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-5", "text": "Through experiments on real-world datasets, we demonstrate the value of bridging entities in improving the performance and running time of PRA in the KB inference task." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-6", "text": "----------------------------------" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-8", "text": "Large-scale knowledge bases (KB) like Freebase (Bollacker et al., 2008) , Yago (Suchanek et al., 2007) , NELL (Mitchell et al., 2015) can be useful in a variety of applications like natural language question answering, semantic search engines, etc." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-9", "text": "These knowledge bases consist of millions of real world entities and relationships between them which are stored in the form of a directed graph where links represent relations and nodes represent the entities." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-10", "text": "Although such KBs contain millions of entities, they are still very sparse, i.e., they are missing a large number of relations between existing entities (West et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-11", "text": "Performing inference over the knowledge graph for predicting relations between two entities is one way of densifying the KB graph." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-12", "text": "For example, Figure 1 : Example showing how addition of the bridging entity, Brian McCain, and the two edges incident on it can help the PRA algorithm (Lao and Cohen, 2010) to infer the initially missing relation instance teamPlaysSport(Yankees, BaseBall)." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-13", "text": "The original KB graph consisted only of two nodes, Yankees and Baseball, and no edges." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-14", "text": "from (Germany, playsinTournament, FIFA) and (FIFA, tournamentofSport, Soccer), we can infer (Germany, playsSport, Soccer)." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-15", "text": "The Path Ranking Algorithm (PRA) (Lao and Cohen, 2010) , (Lao et al., 2011) performs such an inference by learning inference rules over the knowledge graph." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-16", "text": "If the knowledge graph is sparse, i.e., if there are a very few or no paths between source and target entities, then PRA is unable to predict the existence of a relation." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-17", "text": "To address this shortcoming, (Lao et al., 2012) augmented the knowledge graph with paths obtained from an external corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-99", "text": "We use all the entities included in the range of the relation being predicted as candidate target entities." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-91", "text": "Low quality bridging entities connect source target pairs from both positive and negative training sets, and hence are eliminated by the sparse logistic regression classifier." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-92", "text": "The negative dataset is generated using the closed world assumption by performing a random walk." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-93", "text": "After augmenting the KB, we run the training phase of the PRA algorithm to obtain the feature (path) weights computed by the logistic regression Table 2 : Comparison of Mean Reciprocal Rank (MRR) metric for 10 relations from NELL (higher is better)." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-94", "text": "PRA-SVO, PRA-VS are the systems proposed in (Gardner et al., 2013; Gardner et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-95", "text": "PRA-ODA is the approach proposed in this paper." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-96", "text": "Improvements in PRA-ODA over PRA-SVO is statistically significant with p < 0.007, with PRA-SVO as null hypothesis." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-18", "text": "The added paths consisted of unlexicalized dependency labels obtained from a dependency parsed external corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-19", "text": "To improve the expressivity of the added paths, instead of the unlexicalized labels, (Gardner et al., 2013) augmented the KB graph with verbs (surface relations) from a corpus containing over 600 million Subject-Verb-Object (SVO) triples." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-20", "text": "These verbs act as edges that connect previously unconnected entities thereby increasing the connectivity of the KB graph which can potentially improve PRA performance." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-21", "text": "However, na\u00efvely adding these edges increases the feature sparsity which degrades the discriminative ability of the logistic regression classifier used in PRA." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-22", "text": "This can be addressed by adding latent relations obtained by clustering the surface relations, instead of directly adding the surface relations." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-23", "text": "This reduces feature sparsity and has been shown to improve PRA inference (Gardner et al., 2013) , (Gardner et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-24", "text": "In this article we propose a scheme for augmenting the KB using paths obtained by mining noun phrases that connect two SVO triples from an external corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-25", "text": "We term these noun phrases as bridging entities since they bridge two KB relations to form a path." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-26", "text": "This is different from the scheme in (Gardner et al., 2013) and (Gardner et al., 2014) , which adds edges between KB nodes by mining surface relations from an external corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-27", "text": "We search for such bridging entities in the corpus by performing a limited depth DFS (depth first search) on the corpus graph in an on-demand fashion." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-28", "text": "We term this procedure as On-Demand Augmentation (ODA), because the search can be performed during test time in an on-demand manner." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-29", "text": "In contrast, the previous approaches of adding edges or embeddings to the KB (Gardner et al., 2013) , and vector space random walk PRA (Gardner et al., 2014) are batch procedures." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-30", "text": "As we shall see in Section 4, due to a limited search space, on-demand augmentation is much faster compared to algorithms in (Gardner et al., 2013; Gardner et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-31", "text": "Furthermore, since edges are not added blindly, on-demand augmentation does not increase feature sparsity which is responsible for performance degradation." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-32", "text": "Our experiments suggest that ODA provides better performance than (Gardner et al., 2013) and nearly the same prediction performance as provided by (Gardner et al., 2014) , but in both cases with the added advantage of faster running time and greater flexibility due to its online and on-demand nature." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-33", "text": "The code along with the results can be obtained at https://github.com/malllabiisc/pra-oda." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-34", "text": "----------------------------------" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-36", "text": "Using surface level relations and noun phrases for extracting meaningful relational facts is not a new idea (Hearst, 1992) , (Brin, 1999) , (Etzioni et al., 2004) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-37", "text": "However, none of them make use of Knowledge Bases for improving information extraction." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-38", "text": "The Path Ranking Algorithm (PRA) first proposed in (Lao and Cohen, 2010) was used for performing inference over a KB in (Lao et al., 2011) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-39", "text": "It was extended by (Lao et al., 2012) , to improve the inference by augmenting the KB with syntactic information obtained from a dependency parsed corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-40", "text": "Augmenting the KB for improving PRA inference using surface relations mined from an external corpus and using latent edge labels obtained by performing PCA on the surface relations was explored in (Gardner et al., 2013) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-41", "text": "Instead of hard mapping of surface relations to latent embeddings, (Gardner et al., 2014 ) perform a 'soft' mapping using vector space random walks." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-42", "text": "This allows the random walker to traverse an edge semantically similar to the current edge type more frequently than other edges." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-43", "text": "Although, like others, we too use an external corpus to augment the KB, the crucial difference in our approach is that apart from adding surface relations, we also add bridging entities that enable us to create new paths in the KB." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-44", "text": "Furthermore, the procedure is targeted so that only paths that play a part in inferring the relations that are of interest are added." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-45", "text": "Thus, the number of paths added in this manner is much lower than the number of surface relations added using the procedure in (Gardner et al., 2013) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-46", "text": "As we shall see in Section 4, this results in a more effective algorithm with faster runtime." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-47", "text": "We first present a brief overview of the Path Ranking Algorithm (PRA) (Lao and Cohen, 2010) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-48", "text": "The PRA uses paths as features for a logistic regression classifier which predicts if the given relation exists between a pair of entities." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-49", "text": "For a given pair of entities s and t, the path type connecting s to t form the feature vector." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-50", "text": "A path types \u03c0 is an ordered set of relations." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-51", "text": "Paths with the same ordered relations but different intermediate or terminal entities belong to the same path type." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-52", "text": "For example," }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-53", "text": "The value of a feature, is taken to be P (s \u2192 t; \u03c0), where P (s \u2192 t; \u03c0) is the probability of reaching t from s by traversing paths of type \u03c0." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-54", "text": "PRA approximates these probabilities by running a random walk (RW) on the KB graph." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-55", "text": "Let F = {\u03c0 1 , \u03c0 2 , ..., \u03c0 k } be the set of all path types." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-56", "text": "For predicting the existence of relation r between entities s and t, the logistic regression classifier outputs a score which is a measure of the confidence that r exists between s and t. It does so by first assigning weights to the features in the training phase." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-57", "text": "The score is given by" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-58", "text": "where \u03b8 r \u03c0 is the weight learned by the logistic regression classifier during training specially for relation r and path type \u03c0." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-59", "text": "During the test phase, since targets are not available, the PRA gathers candidate targets by performing a random walk and then computes feature vectors and the score." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-60", "text": "----------------------------------" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-61", "text": "**PRA-SVO AND PRA-VS**" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-62", "text": "PRA-SVO and PRA-VS are the systems proposed in (Gardner et al., 2013) and (Gardner et al., 2014) respectively, where the KB graph is augmented with edges mined from a large subject-verb-object (SVO) triple corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-63", "text": "In these two systems, only new edges are added over the fixed set of nodes, and the augmentation happens in a batch, offline setting." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-64", "text": "In contrast, PRA-ODA, the method proposed in the paper, also expands the set of nodes through bridging entities, and performs the augmentation in an on-demand manner." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-65", "text": "----------------------------------" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-66", "text": "**PRA ON-DEMAND AUGMENTATION**" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-67", "text": "(PRA-ODA) Training: Let s and t be any two KB entities and let s (n) and t (n) be their corresponding noun phrase representations or aliases." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-68", "text": "We search for bridging entities x 1 , x 2 , ..x n by performing limited depth first search (DFS) starting with s n such that we obtain a path s" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-69", "text": "\u2212\u2192 t, where v i are verbs present in the corpus graph." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-70", "text": "This is done for all n \u2264 d max \u2212 1, where d max is the maximum depth of DFS." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-71", "text": "We add an 'ALIAS' edge between the KB entity and its noun phase representation." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-72", "text": "The usefulness of bridging entities is illustrated in Fig. 1 ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-73", "text": "We mine bridging entities from a corpus containing over 600 million SVO triples which were obtained from the ClueWeb09 corpus (Callan et al., 2009 ) parsed using the MALT parser (Nivre et al., 2007) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-74", "text": "We use Mongo DB to store the triples as an adjacency list." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-75", "text": "During training time, for any relation that is being inferred, both the source and its corresponding target entities are known." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-76", "text": "A limited depth DFS is performed for all depths less then d max on the SVO graph with the aliases of subject entity acting as the starting points." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-77", "text": "Such aliases are available for the NELL and Freebase knowledge bases." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-78", "text": "The DFS is said to discover a path if the terminating entity of the path matches any alias of the target entity." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-79", "text": "We choose to use aliases to perform string match, since it is easy to change the softness of the match by simply adding more aliases." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-80", "text": "This is done for all training sourcetarget pairs." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-81", "text": "A few examples of added paths are shown in Table 1 ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-82", "text": "The SVO graph is noisy since it is obtained by parsing the ClueWeb corpus which was obtained by scraping the web." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-83", "text": "To reduce noise, we add the top K most frequent discovered SVO path types, where K is a tunable parameter." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-84", "text": "By SVO path type we refer to a set of ordered verbs mined from the SVO corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-85", "text": "There is a possibility that the bridging entities, extracted from the corpus, may be present in the KB." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-86", "text": "If the bridging entity matches any alias, then it is treated as an alias to an existing KB entity." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-87", "text": "If not, then the bridging entity is added to the KB as a new entity." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-88", "text": "To avoid overfitting we add negative data to the training set." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-89", "text": "Furthermore, only high quality expressive bridging entities result in meaningful and discriminative paths." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-90", "text": "Although the quality of bridging entities depend on the corpus, low quality bridging entities can be filtered out by adding negative training data." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-97", "text": "classifier." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-100", "text": "For example, if the relation is riverFlowsThroughCity, the candidate target set would include entities in the KB that are cities." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-101", "text": "The DFS is now performed starting from source entities as during training, but this time only restricting to paths with positive weights learned during training." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-102", "text": "Any path (along with bridging entities) found during this search are added to the KB, and the PRA algorithm is now run over this augmented graph." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-103", "text": "----------------------------------" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-104", "text": "**EXPERIMENTS**" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-105", "text": "We used the implementation of PRA provided by the authors of (Gardner et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-106", "text": "For our experiments, we used the same 10 NELL relation data as used in (Gardner et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-107", "text": "The augmentation resulted in the addition of 1086 paths during training and 1430 paths during test time." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-108", "text": "We split the NELL data into 60% training data, 15 % development data and 25% test data." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-109", "text": "Values for d max , and K, the most frequent paths, were obtained by tuning on a development set for 4 relations (athleteplaysforsport,actorstarredinmovie,citylocatedincountry (Gardner et al., 2013; Gardner et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-110", "text": "PRA-ODA is the approach proposed in this paper." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-111", "text": "Between the two top performing systems, i.e., PRA-ODA and PRA-VS, PRA-ODA is faster by a factor of 1.8." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-112", "text": "and journalistwritesforpublication)." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-113", "text": "The hyperparameter values d max = 2, K = 10 reported the highest MRR and were used for the rest of the relations." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-114", "text": "For the L 1 and L 2 regularization parameters in the logistic regression classifier, we used the same values as used in (Gardner et al., 2013; Gardner et al., 2014) , viz., L 1 = 0.005, and L 2 = 1.0." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-115", "text": "This is because the parameters were reported to be robust, and seemed to work well even when the knowledge base was augmented." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-116", "text": "We compare the results (PRA-ODA) with the PRA algorithm executed on the NELL KB, NELL KB augmented with surface relations (PRA-SVO) (Gardner et al., 2013) and vector space random walk PRA (PRA-VS) (Gardner et al., 2014) ." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-117", "text": "The run times, i.e, the time taken to perform an entire experiment for PRA-SVO and PRA-VS includes the time taken to augment NELL KB with SVO edges." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-118", "text": "The PRA-VS runtime also includes the time taken for generating embeddings to perform the vector space random walk." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-119", "text": "As can be seen from Table 2 and Table 3 , our scheme, PRA-ODA, provides performance equivalent to PRA-VS with faster running time (speed up of 1.8)." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-120", "text": "In addition to the time taken for the full SVO augmentation, PRA-VS takes additional time to generate embeddings (13 minutes) from the added verbs." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-121", "text": "We note that the batch augmentation in case of PRA-SVO and PRA-VS, and embedding computation in case of PRA-VS are all specific to the relations in the evaluation set, and hence can't be ignored as a one-time offline cost." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-122", "text": "In other words, these costs are likely to increase as more relations (and their instances) are included during training and testing." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-123", "text": "Runtime gains with PRA-ODA are likely to be even more pronounced in such settings." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-124", "text": "An additional advantage of the proposed algorithm is that it can also be run on the top of any PRA based algorithm such as the PRA-SVO and PRA-VS." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-125", "text": "----------------------------------" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-127", "text": "In this paper, we investigated the usefulness of adding paths to a Knowledge Base for improving its connectivity by mining bridging entities from an external corpus." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-128", "text": "While previous KB augmentation methods focused only on augmentation using mined surface verbs while keeping the node set fixed, we extended these approaches by also adding bridging entities in an online fashion." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-129", "text": "We used a large corpus of 500 million web text corpus to mine these additional edges and bridging entities." }, { "sent_id": "c60a1131c6b1639b772b0e5c59588e-C001-130", "text": "Through experiments on real-world datasets, we demonstrate that the proposed approach is not only comparable or better than other state-of-theart baselines, but more importantly provides faster overall runtime compared with the alternatives." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c60a1131c6b1639b772b0e5c59588e-C001-19" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-23" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-40" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-62" ] ], "cite_sentences": [ "c60a1131c6b1639b772b0e5c59588e-C001-19", "c60a1131c6b1639b772b0e5c59588e-C001-23", "c60a1131c6b1639b772b0e5c59588e-C001-40", "c60a1131c6b1639b772b0e5c59588e-C001-62" ] }, "@DIF@": { "gold_contexts": [ [ "c60a1131c6b1639b772b0e5c59588e-C001-26" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-29" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-30" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-32" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-45" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-62", "c60a1131c6b1639b772b0e5c59588e-C001-63", "c60a1131c6b1639b772b0e5c59588e-C001-64" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-96" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-116", "c60a1131c6b1639b772b0e5c59588e-C001-117", "c60a1131c6b1639b772b0e5c59588e-C001-118", "c60a1131c6b1639b772b0e5c59588e-C001-119" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-121" ], [ "c60a1131c6b1639b772b0e5c59588e-C001-124" ] ], "cite_sentences": [ "c60a1131c6b1639b772b0e5c59588e-C001-26", "c60a1131c6b1639b772b0e5c59588e-C001-29", "c60a1131c6b1639b772b0e5c59588e-C001-30", "c60a1131c6b1639b772b0e5c59588e-C001-32", "c60a1131c6b1639b772b0e5c59588e-C001-45", "c60a1131c6b1639b772b0e5c59588e-C001-62", "c60a1131c6b1639b772b0e5c59588e-C001-96", "c60a1131c6b1639b772b0e5c59588e-C001-116", "c60a1131c6b1639b772b0e5c59588e-C001-117", "c60a1131c6b1639b772b0e5c59588e-C001-121", "c60a1131c6b1639b772b0e5c59588e-C001-124" ] }, "@MOT@": { "gold_contexts": [ [ "c60a1131c6b1639b772b0e5c59588e-C001-62", "c60a1131c6b1639b772b0e5c59588e-C001-63", "c60a1131c6b1639b772b0e5c59588e-C001-64" ] ], "cite_sentences": [ "c60a1131c6b1639b772b0e5c59588e-C001-62" ] }, "@USE@": { "gold_contexts": [ [ "c60a1131c6b1639b772b0e5c59588e-C001-114" ] ], "cite_sentences": [ "c60a1131c6b1639b772b0e5c59588e-C001-114" ] }, "@SIM@": { "gold_contexts": [ [ "c60a1131c6b1639b772b0e5c59588e-C001-114" ] ], "cite_sentences": [ "c60a1131c6b1639b772b0e5c59588e-C001-114" ] }, "@UNSURE@": { "gold_contexts": [ [ "c60a1131c6b1639b772b0e5c59588e-C001-124" ] ], "cite_sentences": [ "c60a1131c6b1639b772b0e5c59588e-C001-124" ] } } }, "ABC_7f78697390e28cc7798f8cb183cb59_2": { "x": [ { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-137", "text": "In future work we therefore also intend to consider approaches such as context2vec (Melamud et al., 2016) which explicitly encode the context in which a token occurs." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-2", "text": "Verb-noun combinations (VNCs) -e.g., blow the whistle, hit the roof, and see stars -are a common type of English idiom that are ambiguous with literal usages." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-3", "text": "In this paper we propose and evaluate models for classifying VNC usages as idiomatic or literal, based on a variety of approaches to forming distributed representations." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-4", "text": "Our results show that a model based on averaging word embeddings performs on par with, or better than, a previously-proposed approach based on skip-thoughts." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-5", "text": "Idiomatic usages of VNCs are known to exhibit lexico-syntactic fixedness." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-6", "text": "We further incorporate this information into our models, demonstrating that this rich linguistic knowledge is complementary to the information carried by distributed representations." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-7", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-9", "text": "Multiword expressions (MWEs) are combinations of multiple words that exhibit some degree of idiomaticity (Baldwin and Kim, 2010) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-10", "text": "Verb-noun combinations (VNCs), consisting of a verb with a noun in its direct object position, are a common type of semantically-idiomatic MWE in English and cross-lingually (Fazly et al., 2009 )." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-11", "text": "Many VNCs are ambiguous between MWEs and literal combinations, as in the following examples of see stars, in which 1 is an idiomatic usage (i.e., an MWE), while 2 is a literal combination." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-138", "text": "Finally, one known challenge of VNC token classification is to develop models that are able to generalize to VNC types that were not seen during training (Gharbieh et al., 2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-12", "text": "1 1. Hereford United were seeing stars at Gillingham after letting in 2 early goals 2. Look into the night sky to see the stars MWE identification is the task of automatically determining which word combinations at the token-level form MWEs (Baldwin and Kim, 2010) , and must be able to make such distinctions." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-13", "text": "This is particularly important for applications such as machine translation (Sag et al., 2002) , where the appropriate meaning of word combinations in context must be preserved for accurate translation." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-14", "text": "In this paper, following prior work (e.g., Salton et al., 2016 ), we frame token-level identification of VNCs as a supervised binary classification problem, i.e., idiomatic vs. literal." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-15", "text": "We consider a range of approaches to forming distributed representations of the context in which a VNC occurs, including word embeddings (Mikolov et al., 2013) , word embeddings tailored to representing sentences (Kenter et al., 2016) , and skip-thoughts sentence embeddings (Kiros et al., 2015) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-16", "text": "We then train a support vector machine (SVM) on these representations to classify unseen VNC instances." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-17", "text": "Surprisingly, we find that an approach based on representing sentences as the average of their word embeddings performs comparably to, or better than, the skip-thoughts based approach previously proposed by Salton et al. (2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-18", "text": "VNCs exhibit lexico-syntactic fixedness." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-19", "text": "For example, the idiomatic interpretation in example 1 above is typically only accessible when the verb see has active voice, the determiner is null, and the noun star is in plural form, as in see stars or seeing stars." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-20", "text": "Usages with a determiner (as in example 2), a singular noun (e.g., see a star), or passive voice (e.g., stars were seen) typically only have the literal interpretation." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-21", "text": "In this paper we further incorporate knowledge of the lexico-syntactic fixedness of VNCs -automatically acquired from corpora using the method of Fazly et al. (2009) -into our various embedding-based approaches." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-22", "text": "Our experimental results show that this leads to substantial improve-ments, indicating that this rich linguistic knowledge is complementary to that available in distributed representations." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-23", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-25", "text": "Much research on MWE identification has focused on specific kinds of MWEs (e.g., Patrick and Fletcher, 2005; Uchiyama et al., 2005) , including English VNCs (e.g., Fazly et al., 2009; Salton et al., 2016) , although some recent work has considered the identification of a broad range of kinds of MWEs (e.g., Schneider et al., 2014; Brooke et al., 2014; Savary et al., 2017) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-26", "text": "Work on MWE identification has leveraged rich linguistic knowledge of the constructions under consideration (e.g., Fazly et al., 2009; Fothergill and Baldwin, 2012) , treated literal and idiomatic as two senses of an expression and applied approaches similar to word-sense disambiguation (e.g., Birke and Sarkar, 2006; Hashimoto and Kawahara, 2008) , incorporated topic models (e.g., Li et al., 2010) , and made use of distributed representations of words (Gharbieh et al., 2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-27", "text": "In the most closely related work to ours, Salton et al. (2016) represent token instances of VNCs by embedding the sentence that they occur in using skip-thoughts (Kiros et al., 2015) -an encoderdecoder model that can be viewed as a sentencelevel counterpart to the word2vec (Mikolov et al., 2013 ) skip-gram model." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-28", "text": "During training the target sentence is encoded using a recurrent neural network, and is used to predict the previous and next sentences." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-29", "text": "Salton et al. then use these sentence embeddings, representing VNC token instances, as features in a supervised classifier." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-30", "text": "We treat this skip-thoughts based approach as a strong baseline to compare against." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-31", "text": "Fazly et al. (2009) formed a set of eleven lexicosyntactic patterns for VNC instances capturing the voice of the verb (active or passive), determiner (e.g., a, the), and number of the noun (singular or plural)." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-32", "text": "They then determine the canonical form, C(v, n), for a given VNC as follows: 2" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-33", "text": "where P is the set of patterns, T z is a predetermined threshold, which is set to 1, and z(v, n, pt k ) is calculated as follows:" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-34", "text": "where f (\u00b7) is the frequency of a VNC occurring in a given pattern in a corpus, 3 and f and s are the mean and standard deviations for all patterns for the given VNC, respectively." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-35", "text": "Fazly et al. (2009) showed that idiomatic usages of a VNC tend to occur in that expression's canonical form, while literal usages do not." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-36", "text": "This approach provides a strong, linguistically-informed, unsupervised baseline, referred to as CForm, for predicting whether VNC instances are idiomatic or literal." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-37", "text": "In this paper we incorporate knowledge of canonical forms into embedding-based approaches to VNC token classification, and show that this linguistic knowledge can be leveraged to improve such approaches." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-38", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-39", "text": "**MODELS**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-40", "text": "We describe the models used to represent VNC token instances below." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-41", "text": "For each model, a linear SVM classifier is trained on these representations." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-42", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-43", "text": "**WORD2VEC**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-44", "text": "We trained word2vec's skip-gram model (Mikolov et al., 2013 ) on a snapshot of Wikipedia from September 2015, which consists of approximately 2.6 billion tokens." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-45", "text": "We used a window size of \u00b18 and 300 dimensions." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-46", "text": "We ignore all words that occur less than fifteen times in the training corpus, and did not set a maximum vocabulary size." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-47", "text": "We perform negative sampling and set the number of training epochs to five." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-48", "text": "We used batch processing with approximately 10k words in each batch." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-49", "text": "To embed a given a sentence containing a VNC token instance, we average the word embeddings for each word in the sentence, including stopwords." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-50", "text": "4 Prior to averaging, we normalize each embedding to have unit length." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-51", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-52", "text": "**SIAMESE CBOW**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-53", "text": "The Siamese CBOW model (Kenter et al., 2016) learns word embeddings that are better able to represent a sentence through averaging than conventional word embeddings such as skip-gram or CBOW." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-54", "text": "We use a Siamese CBOW model that was pretrained on a snapshot of Wikipedia from November 2012 using randomly initialized word embeddings." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-55", "text": "5 Similarly to the word2vec model, to embed a given sentence containing a VNC instance, we average the word embeddings for each word in the sentence." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-56", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-57", "text": "**SKIP-THOUGHTS**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-58", "text": "We use a publicly-available skip-thoughts model, that was pre-trained on a corpus of books." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-59", "text": "6 We represent a given sentence containing a VNC instance using the skip-thoughts encoder." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-60", "text": "Note that this approach is our re-implementation of the skipthoughts based method of Salton et al. (2016) , and we use it as a strong baseline for comparison." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-61", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-62", "text": "**DATA AND EVALUATION**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-63", "text": "In this section, we discuss the dataset used in our experiments, and the evaluation of our models." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-64", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-65", "text": "**DATASET**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-66", "text": "We use the VNC-Tokens dataset (Cook et al., 2008) -the same dataset used by Fazly et al. (2009) and Salton et al. (2016) -to train and evaluate our models." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-67", "text": "This dataset consists of sentences containing VNC usages drawn from the British National Corpus (Burnard, 2000) , 7 along with a label indicating whether the VNC is an idiomatic or literal usage (or whether this cannot be determined, in which case it is labelled \"unknown\")." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-68", "text": "VNC-Tokens is divided into DEV and TEST sets that each include fourteen VNC types and a total of roughly six hundred instances of these types annotated as literal or idiomatic." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-69", "text": "Following Salton et al. (2016) , we use DEV and TEST, and ignore all token instances annotated as \"unknown\"." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-70", "text": "Fazly et al. (2009) and Salton et al. (2016) structured their experiments differently." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-71", "text": "Fazly et al. report results over DEV and TEST separately." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-72", "text": "In this setup TEST consists of expressions that were not seen during model development (done on DEV)." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-73", "text": "Salton et al., on the other hand, merge DEV and TEST, and create new training and testing sets, such that each expression is present in the training and testing data, and the ratio of idiomatic to literal usages of each expression in the training data is roughly equal to that in the testing data." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-74", "text": "We borrowed ideas from both of these approaches in structuring our experiments." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-75", "text": "We retain We then divide each of these into training and testing sets, using the same ratios of idiomatic to literal usages for each expression as Salton et al. (2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-76", "text": "This allows us to develop and tune a model on DEV, and then determine whether, when retrained on instances of unseen VNCs in (the training portion of) TEST, that model is able to generalize to new VNCs without further tuning to the specific expressions in TEST." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-77", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-78", "text": "**EVALUATION**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-79", "text": "The proportion of idiomatic usages in the testing portions of both DEV and TEST is 63%." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-80", "text": "We therefore use accuracy to evaluate our models following Fazly et al. (2009) because the classes are roughly balanced." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-81", "text": "We randomly divide both DEV and TEST into training and testing portions ten times, following Salton et al. (2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-82", "text": "For each of the ten runs, we compute the accuracy for each expression, and then compute the average accuracy over the expressions." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-83", "text": "We then report the average accuracy over the ten runs." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-84", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-85", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-86", "text": "In this section we first consider the effect of tuning the cost parameter of the SVM for each model on DEV, and then report results on DEV and TEST using the tuned models." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-87", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-88", "text": "**PARAMETER TUNING**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-89", "text": "We tune the SVM for each model on DEV by carrying out a linear search for the penalty cost from 0.01-100, increasing by a factor of ten each time." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-90", "text": "Results for this parameter tuning are shown in Table 1 ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-91", "text": "These results highlight the importance of choosing an appropriate setting for the penalty cost." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-92", "text": "For example, the accuracy of the word2vec model ranges from 0.619-0.830 depending on the cost setting." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-93", "text": "In subsequent experiments, for each model, we use the penalty cost that achieves the highest accuracy in Table 1 ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-94", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-95", "text": "**DEV AND TEST RESULTS**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-96", "text": "In Table 2 we report results on DEV and TEST for each model, as well as the unsupervised CForm model of Fazly et al. (2009) , which simply labels a VNC as idiomatic if it occurs in its canonical form, and as literal otherwise." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-97", "text": "We further consider each model (other than CForm) in two setups." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-98", "text": "\u2212CF corresponds to the models as described in Section 3. +CF further incorporates lexico-syntactic knowledge of canonical forms into each model by concatenating the embedding representing each VNC token instance with a one-dimensional vector which is one if the VNC occurs in its canonical form, and zero otherwise." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-99", "text": "We first consider results for the \u2212CF setup." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-100", "text": "On both DEV and TEST, the accuracy achieved by each supervised model is higher than that of the unsupervised CForm approach, except for Siamese CBOW on TEST." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-101", "text": "The word2vec model achieves the highest accuracy on DEV and TEST of 0.830 and 0.804, respectively." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-139", "text": "In future work we plan to explore this experimental setup." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-102", "text": "The difference between the word2vec model and the next-best model, skip-thoughts, is significant using a bootstrap test (Berg-Kirkpatrick et al., 2012) with 10k repetitions for DEV (p = 0.006), but not for TEST (p = 0.051)." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-103", "text": "Nevertheless, it is remarkable that the relatively simple approach to averaging word embeddings used by word2vec performs as well as, or better than, the much more complex skipthoughts model used by Salton et al. (2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-104", "text": "8 8 The word2vec and skip-thoughts models were trained on different corpora, which could contribute to the differences in results for these models." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-105", "text": "We therefore carried out an additional experiment in which we trained word2vec on BookCorpus, the corpus on which skip-thoughts was trained." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-106", "text": "This new word2vec model achieved accuracies of 0.825 and 0.809, on DEV and TEST, respectively, which are also higher accuTurning to the +CF setup, we observe that, for both DEV and TEST, each model achieves higher accuracy than in the \u2212CF setup." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-107", "text": "9 All of these differences are significant using a bootstrap test (p < 0.002 in each case)." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-108", "text": "In addition, each method outperforms the unsupervised CForm approach on both DEV and TEST." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-109", "text": "These findings demonstrate that the linguistically-motivated, lexico-syntactic knowledge encoded by the canonical form feature is complementary to the information from a wide range of types of distributed representations." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-110", "text": "In the +CF setup, the word2vec model again achieves the highest accuracy on both DEV and TEST of 0.854 and 0.852, respectively." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-111", "text": "10 The difference between the word2vec model and the next-best model, again skip-thoughts, is significant for both DEV and TEST using a bootstrap test (p < 0.05 in each case)." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-112", "text": "To better understand the impact of the canonical form feature when combined with the word2vec model, we compute the average precision, recall, and F1 score for each MWE for both the positive (idiomatic) and negative (literal) classes, for each run on TEST." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-113", "text": "11 For a given run, we then compute the average precision, recall, and F1 score across all MWEs, and then the average over all ten runs." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-114", "text": "We do this using CForm, and the word2vec model with and without the canonical form feature." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-115", "text": "Results are shown in Table 3 ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-116", "text": "In line with the findings of Fazly et al. (2009) , CForm achieves higher precision and recall on idiomatic usages than literal ones." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-117", "text": "In particular, the relatively low recall for the literal class indicates that many literal usages occur in a canonical form." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-118", "text": "Comparing the word2vec model with and without the canonical form feature, we see that, when this feature is used, there is a relatively larger increase in precision and recall (and F1 score) for the literal class, than for the idiomatic class." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-119", "text": "This indicates that, although the racies than those obtained by the skip-thoughts model." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-120", "text": "9 In order to determine that this improvement is due to the information about canonical forms carried by the additional feature in the +CF setup, and not due to the increase in number of dimensions, we performed additional experiments in which we concatenated the embedding representations with a random binary feature, and with a randomly chosen value between 0 and 1." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-121", "text": "For each model, neither of these approaches outperformed that model using the +CF setup." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-122", "text": "10 In the +CF setup, the word2vec model using embeddings that were trained on the same corpus as skip-thoughts achieved accuracies of 0.846 and 0.851, on DEV and TEST, respectively." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-123", "text": "These are again higher accuracies than the corresponding setup for the skip-thoughts model." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-124", "text": "11 We carried out the same analysis on DEV." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-125", "text": "The findings were similar." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-126", "text": "canonical form feature itself performs relatively poorly on literal usages, it provides information that enables the word2vec model to better identify literal usages." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-127", "text": "----------------------------------" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-128", "text": "**CONCLUSIONS**" }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-129", "text": "Determining whether a usage of a VNC is idiomatic or literal is important for applications such as machine translation, where it is vital to preserve the meanings of word combinations." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-130", "text": "In this paper we proposed two approaches to the task of classifying VNC token instances as idiomatic or literal based on word2vec embeddings and Siamese CBOW." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-131", "text": "We compared these approaches against a linguistically-informed unsupervised baseline, and a model based on skip-thoughts previously applied to this task (Salton et al., 2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-132", "text": "Our experimental results show that a comparatively simple approach based on averaging word embeddings performs at least as well as, or better than, the approach based on skip-thoughts." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-133", "text": "We further proposed methods to combine linguistic knowledge of the lexico-syntactic fixedness of VNCs -socalled \"canonical forms\", which can be automatically acquired from corpora via statistical methods -with the embedding based approaches." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-134", "text": "Our findings indicate that this rich linguistic knowledge is complementary to that available in distributed representations." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-135", "text": "Alternative approaches to embedding sentences containing VNC instances could also be considered, for example, FastSent (Hill et al., 2016) ." }, { "sent_id": "7f78697390e28cc7798f8cb183cb59-C001-136", "text": "However, all of the models we used represent the context of a VNC by the sentence in which it occurs." } ], "y": { "@USE@": { "gold_contexts": [ [ "7f78697390e28cc7798f8cb183cb59-C001-14" ], [ "7f78697390e28cc7798f8cb183cb59-C001-60" ], [ "7f78697390e28cc7798f8cb183cb59-C001-66" ], [ "7f78697390e28cc7798f8cb183cb59-C001-69" ], [ "7f78697390e28cc7798f8cb183cb59-C001-70", "7f78697390e28cc7798f8cb183cb59-C001-71", "7f78697390e28cc7798f8cb183cb59-C001-72", "7f78697390e28cc7798f8cb183cb59-C001-73", "7f78697390e28cc7798f8cb183cb59-C001-74" ], [ "7f78697390e28cc7798f8cb183cb59-C001-75" ], [ "7f78697390e28cc7798f8cb183cb59-C001-81" ], [ "7f78697390e28cc7798f8cb183cb59-C001-131" ] ], "cite_sentences": [ "7f78697390e28cc7798f8cb183cb59-C001-14", "7f78697390e28cc7798f8cb183cb59-C001-60", "7f78697390e28cc7798f8cb183cb59-C001-66", "7f78697390e28cc7798f8cb183cb59-C001-69", "7f78697390e28cc7798f8cb183cb59-C001-70", "7f78697390e28cc7798f8cb183cb59-C001-73", "7f78697390e28cc7798f8cb183cb59-C001-75", "7f78697390e28cc7798f8cb183cb59-C001-81", "7f78697390e28cc7798f8cb183cb59-C001-131" ] }, "@DIF@": { "gold_contexts": [ [ "7f78697390e28cc7798f8cb183cb59-C001-17" ], [ "7f78697390e28cc7798f8cb183cb59-C001-103" ] ], "cite_sentences": [ "7f78697390e28cc7798f8cb183cb59-C001-17", "7f78697390e28cc7798f8cb183cb59-C001-103" ] }, "@BACK@": { "gold_contexts": [ [ "7f78697390e28cc7798f8cb183cb59-C001-25" ], [ "7f78697390e28cc7798f8cb183cb59-C001-27" ], [ "7f78697390e28cc7798f8cb183cb59-C001-29" ] ], "cite_sentences": [ "7f78697390e28cc7798f8cb183cb59-C001-25", "7f78697390e28cc7798f8cb183cb59-C001-27", "7f78697390e28cc7798f8cb183cb59-C001-29" ] }, "@MOT@": { "gold_contexts": [ [ "7f78697390e28cc7798f8cb183cb59-C001-29", "7f78697390e28cc7798f8cb183cb59-C001-30" ] ], "cite_sentences": [ "7f78697390e28cc7798f8cb183cb59-C001-29" ] }, "@SIM@": { "gold_contexts": [ [ "7f78697390e28cc7798f8cb183cb59-C001-66" ], [ "7f78697390e28cc7798f8cb183cb59-C001-75" ] ], "cite_sentences": [ "7f78697390e28cc7798f8cb183cb59-C001-66", "7f78697390e28cc7798f8cb183cb59-C001-75" ] } } }, "ABC_2d3ec2e77947cb23af773926ec917b_3": { "x": [ { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-141", "text": "the Spearman rank correlation coefficient of the substitute ranking phase was 0.522." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-2", "text": "We propose a new dataset for evaluating a Japanese lexical simplification method." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-3", "text": "Previous datasets have several deficiencies." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-4", "text": "All of them substitute only a single target word, and some of them extract sentences only from newswire corpus." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-5", "text": "In addition, most of these datasets do not allow ties and integrate simplification ranking from all the annotators without considering the quality." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-6", "text": "In contrast, our dataset has the following advantages: (1) it is the first controlled and balanced dataset for Japanese lexical simplification with high correlation with human judgment and (2) the consistency of the simplification ranking is improved by allowing candidates to have ties and by considering the reliability of annotators." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-7", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-9", "text": "Lexical simplification is the task to find and substitute a complex word or phrase in a sentence with its simpler synonymous expression." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-10", "text": "We define complex word as a word that has lexical and subjective difficulty in a sentence." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-11", "text": "It can help in reading comprehension for children and language learners (De Belder and Moens, 2010) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-12", "text": "This task is a rather easier task which prepare a pair of complex and simple representations than a challenging task which changes the substitute pair in a given context (Specia et al., 2012; Kajiwara and Yamamoto, 2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-13", "text": "Construction of a benchmark dataset is important to ensure the reliability and reproducibility of evaluation." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-14", "text": "However, few resources are available for the automatic evaluation of lexical simplification." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-15", "text": "Specia et al. (2012) and De Belder and Moens (2010) created benchmark datasets for evaluating English lexical simplification." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-16", "text": "In addition, Horn et al. (2014) extracted simplification candidates and constructed an evaluation dataset using English Wikipedia and Simple English Wikipedia." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-17", "text": "In contrast, such a parallel corpus does not exist in Japanese." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-18", "text": "Kajiwara and Yamamoto (2015) constructed an evaluation dataset for Japanese lexical simplification 1 in languages other than English." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-19", "text": "However, there are four drawbacks in the dataset of Kajiwara and Yamamoto (2015) : (1) they extracted sentences only from a newswire corpus; (2) they substituted only a single target word; (3) they did not allow ties; and (4) they did not integrate simplification ranking considering the quality." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-20", "text": "Hence, we propose a new dataset addressing the problems in the dataset of Kajiwara and Yamamoto (2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-21", "text": "The main contributions of our study are as follows:" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-22", "text": "\u2022 It is the first controlled and balanced dataset for Japanese lexical simplification." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-23", "text": "We extract sentences from a balanced corpus and control sentences to have only one complex word." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-24", "text": "Experimental results show that our dataset is more suitable than previous datasets for evaluating systems with respect to correlation with human judgment." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-25", "text": "\u2022 The consistency of simplification ranking is greatly improved by allowing candidates to have ties and by considering the reliability of annotators." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-26", "text": "Our dataset is available at GitHub 2 ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-27", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-28", "text": "**RELATED WORK**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-29", "text": "The evaluation dataset for the English Lexical Simplification task (Specia et al., 2012) Figure 1: A part of the dataset of Kajiwara and Yamamoto (2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-30", "text": "notated on top of the evaluation dataset for English lexical substitution (McCarthy and Navigli, 2007) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-31", "text": "They asked university students to rerank substitutes according to simplification ranking." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-32", "text": "Sentences in their dataset do not always contain complex words, and it is not appropriate to evaluate simplification systems if a test sentence does not include any complex words." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-33", "text": "In addition, De Belder and Moens (2012) built an evaluation dataset for English lexical simplification based on that developed by McCarthy and Navigli (2007) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-34", "text": "They used Amazon's Mechanical Turk to rank substitutes and employed the reliability of annotators to remove outlier annotators and/or downweight unreliable annotators." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-35", "text": "The reliability was calculated on penalty based agreement (McCarthy and Navigli, 2007) and Fleiss' Kappa." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-36", "text": "Unlike the dataset of Specia et al. (2012) , sentences in their dataset contain at least one complex word, but they might contain more than one complex word." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-37", "text": "Again, it is not adequate for the automatic evaluation of lexical simplification because the human ranking of the resulting simplification might be affected by the context containing complex words." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-38", "text": "Furthermore, De Belder and Moens' (2012) dataset is too small to be used for achieving a reliable evaluation of lexical simplification systems." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-39", "text": "3 Problems in previous datasets for Japanese lexical simplification Kajiwara and Yamamoto (2015) followed Specia et al. (2012) to construct an evaluation dataset for Japanese lexical simplification." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-40", "text": "Namely, they split the data creation process into two steps: substitute extraction and simplification ranking." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-41", "text": "During the substitute extraction task, they collected substitutes of each target word in 10 different contexts." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-42", "text": "These contexts were randomly selected from a newswire corpus." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-43", "text": "The target word was a content word (noun, verb, adjective, or adverb) , and was neither a simple word nor part of any compound words." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-44", "text": "They gathered substitutes from five annotators using crowdsourcing." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-45", "text": "These procedures were the same as for De Belder and Moens (2012) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-46", "text": "During the simplification ranking task, annotators were asked to reorder the target word and its substitutes in a single order without allowing ties." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-47", "text": "They used crowdsourcing to find five annotators different from those who performed the substitute extraction task." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-48", "text": "Simplification ranking was integrated on the basis of the average of the simplification ranking from each annotator to generate a gold-standard ranking that might include ties." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-49", "text": "During the substitute extraction task, agreement among the annotators was 0.664, whereas during the simplification ranking task, Spearman's rank correlation coefficient score was 0.332." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-50", "text": "Spearman's score of this work was lower than that of Specia et al. (2012) by 0.064." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-51", "text": "Thus, there was a big blur between annotators, and the simplification ranking collected using crowdsourcing tended to have a lower quality." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-52", "text": "Figure 1 shows a part of the dataset of Kajiwara and Yamamoto (2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-53", "text": "Our discussion in this paper is based on this example." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-54", "text": "Domain of the dataset is limited." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-55", "text": "Because Kajiwara and Yamamoto (2015) extracted sentences from a newswire corpus, their dataset has a poor variety of expression." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-56", "text": "English lexical simplification datasets (Specia et al., 2012; De Belder and Moens, 2012) do not have this problem because both of them use a balanced corpus of English (Sharoff, 2006) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-57", "text": "Complex words might exist in context." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-58", "text": "In Figure 1, even when a target word such as \" (feel exalted)\" is simplified, another complex word \" (skill)\" is left in a sentence." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-59", "text": "Lexical simplification is a task of simplifying complex words in a sentence." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-60", "text": "Previous datasets may include multiple complex words in a sentence but target only one complex word." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-61", "text": "Not only the target word but also other complex words should be considered as well, but annotation of substitutes and simplification ranking to all complex words in a sentence produces a huge number of patterns, therefore takes a very high cost of annotation." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-62", "text": "For example, when three complex words which have 10 substitutes each in a sentence, annotators should consider 10 3 patterns." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-63", "text": "Thus, it is desired that a sentence includes only simple words after the target word is substituted." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-64", "text": "Therefore, in this work, we extract sentences containing only one complex word." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-65", "text": "Ties are not permitted in simplification ranking." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-66", "text": "When each annotator assigns a simplification ranking to a substitution list, a tie cannot be assigned in previous datasets (Specia et al., 2012; Kajiwara and Yamamoto, 2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-67", "text": "This deteriorates ranking consistency if some substitutes have a similar simplicity." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-68", "text": "De Belder and Moens (2012) allow ties in simplification ranking and report considerably higher agreement among annotators than Specia et al. (2012) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-69", "text": "The method of ranking integration is na\u00efve." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-70", "text": "Kajiwara and Yamamoto (2015) and Specia et al. (2012) use an average score to integrate rankings, but it might be biased by outliers." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-71", "text": "De Belder and Moens (2012) report a slight increase in agreement by greedily removing annotators to maximize the agreement score." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-72", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-73", "text": "**BALANCED DATASET FOR EVALUATION OF JAPANESE LEXICAL SIMPLIFICATION**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-74", "text": "We create a balanced dataset for the evaluation of Japanese lexical simplification." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-75", "text": "Figure 2 illustrates how we constructed the dataset." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-76", "text": "It follows the data creation procedure of Kajiwara and Yamamoto's (2015) dataset with improvements to resolve the problems described in Section 3." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-77", "text": "We use a crowdsourcing application, Lancers, 3 3 http://www.lancers.jp/ Figure 3 : Example of annotation of extracting substitutes." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-78", "text": "Annotators are provided with substitutes that preserve the meaning of target word which is shown bold in the sentence." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-79", "text": "In addition, annotators can write a substitute including particles." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-80", "text": "to perform substitute extraction, substitute evaluation, and substitute ranking." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-81", "text": "In each task, we requested the annotators to complete at least 95% of their previous assignments correctly." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-82", "text": "They were native Japanese speakers." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-83", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-84", "text": "**EXTRACTING SENTENCES**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-85", "text": "Our work defines complex words as \"High Level\" words in the Lexicon for Japanese Language Education (Sunakawa et al., 2012) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-86", "text": "4 The word level is calculated by five teachers of Japanese, based on their experience and intuition." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-87", "text": "There were 7,940 high-level words out of 17,921 words in the lexicon." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-88", "text": "In addition, target words of this work comprised content words (nouns, verbs, adjectives, adverbs, adjectival nouns, sahen nouns, 5 and sahen verbs 6 )." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-89", "text": "Sentences that include a complex word were randomly extracted from the Balanced Corpus of Contemporary Written Japanese (Maekawa et al., 2010) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-90", "text": "Sentences shorter than seven words or longer than 35 words were excluded." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-91", "text": "We excluded target words that appeared as a part of compound words." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-92", "text": "Following previous work, 10 contexts of occurrence were collected for each complex word." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-93", "text": "We assigned 30 complex words for each part of speech." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-94", "text": "The total number of sentences was 2,100 (30 words 10 sentences 7 parts of speech)." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-95", "text": "We used a crowdsourcing application to annotate 1,800 sentences, and we asked university students majoring in computer science to annotate 300 sentences to investigate the quality of crowdsourcing." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-96", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-97", "text": "**EXTRACTING SUBSTITUTES**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-98", "text": "Simplification candidates were collected using crowdsourcing techniques." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-99", "text": "For each complex word, five annotators wrote substitutes that did not 4 http://jhlee.sakura.ne.jp/JEV.html 5 Sahen noun is a kind of noun that can form a verb by adding a generic verb \"suru (do)\" to the noun." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-100", "text": "(e.g. \" repair\") 6 Sahen verb is a sahen noun that accompanies with \"suru\"." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-101", "text": "(e.g. \"" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-102", "text": "(do repair)\") change the sense of the sentence." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-103", "text": "Substitutions could include particles in context." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-104", "text": "Conjugation was allowed to cover variations of both verbs and adjectives." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-105", "text": "Figure 3 shows an example of annotation." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-106", "text": "To improve the quality of the lexical substitution, inappropriate substitutes were deleted for later use, as described in the next subsection." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-107", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-108", "text": "**EVALUATING SUBSTITUTES**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-109", "text": "Five annotators selected an appropriate word to include as a substitution that did not change the sense of the sentence." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-110", "text": "Substitutes that won a majority were defined as correct." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-111", "text": "Figure 4 shows an example of annotation." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-112", "text": "Nine complex words that were evaluated as not having substitutes were excluded at this point." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-113", "text": "As a result, 2,010 sentences were annotated, as described in next subsection." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-114", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-115", "text": "**RANKING SUBSTITUTES**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-116", "text": "Five annotators arranged substitutes and complex words according to the simplification ranking." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-117", "text": "Annotators were permitted to assign a tie, but they could select up to four items to be in a tie because we intended to prohibit an insincere person from selecting a tie for all items." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-118", "text": "Figure 5 shows an example of annotation." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-119", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-120", "text": "**INTEGRATING SIMPLIFICATION RANKING**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-121", "text": "Annotators' rankings were integrated into one ranking, using a maximum likelihood estimation (Matsui et al., 2014) to penalize deceptive annotators as was done by De Belder and Moens (2012) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-122", "text": "This method estimates the reliability of annotators in addition to determining the true order of rankings." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-123", "text": "We applied the reliability score to exclude extraordinary annotators." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-124", "text": "Table 1 shows the characteristics of our dataset." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-125", "text": "It is about the same size as previous work (Specia et al., 2012; Kajiwara and Yamamoto, 2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-126", "text": "Our dataset has two advantages: (1) improved correlation with human judgment by making a controlled and balanced dataset, and (2) enhanced consistency by allowing ties in ranking and removing outlier annotators." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-127", "text": "In the following subsections, we evaluate our dataset in detail." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-128", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-129", "text": "**RESULT**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-130", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-131", "text": "**INTRINSIC EVALUATION**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-132", "text": "To evaluate the quality of the ranking integration, the Spearman rank correlation coefficient was calculated." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-133", "text": "The baseline integration ranking used an average score (Kajiwara and Yamamoto, 2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-134", "text": "Our proposed method excludes outlier annotators by using a reliability score calculated using the method developed by Matsui et al. (2014) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-135", "text": "Pairwise agreement is calculated between each pair of sets (p 1 , p 2 \u2208 P ) from all the possible pairings (P) (Equation 1)." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-136", "text": "The agreement among annotators from the substitute evaluation phase was 0.669, and agreement among the students is 0.673, which is similar to the level found in crowdsourcing." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-137", "text": "This score is almost the same as that from Kajiwara and Yamamoto (2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-138", "text": "On the contrary, Table 3 : Detail of sentences and substitutes in our dataset." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-139", "text": "(BCCWJ comprise three main subcorpora: publication (P), library (L), special-purpose (O)." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-140", "text": "PB = book, PM = magazine, PN = newswire, LB = book, OW = white paper, OT = textbook, OP =PR paper, OB = bestselling books, OC = Yahoo! Answers, OY = Yahoo! Blogs, OL = Law, OM = Magazine) baseline outlier removal Average 0.541 0.580 Table 2 : Correlation of ranking integration." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-142", "text": "This score is higher than that from Kajiwara and Yamamoto (2015) by 0.190." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-143", "text": "This clearly shows the importance of allowing ties during the substitute ranking task." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-144", "text": "Table 2 shows the results of the ranking integration." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-145", "text": "Our method achieved better accuracy in ranking integration than previous methods (Specia et al., 2012; Kajiwara and Yamamoto, 2015) and is similar to the results from De Belder and Moens (2012) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-146", "text": "This shows that the reliability score can be used for improving the quality." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-147", "text": "Table 3 shows the number of sentences and average substitutes in each genre." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-148", "text": "In our dataset, the number of acquired substitutes is 8,636 words and the average number of substitutes is 4.30 words per sentence." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-149", "text": "Figure 6 illustrates a part of our dataset." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-150", "text": "Substitutes that include particles are found in 75 context (3.7%)." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-151", "text": "It is shown that if particles are not permitted in substitutes, we obtain only two substitutes (4 and 7)." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-152", "text": "By permitting substitutes to include particles, we are able to obtain 7 substitutes." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-153", "text": "In ranking substitutes, Spearman rank correlation coefficient is 0.729, which is substantially higher than crowdsourcing's score." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-154", "text": "Thus, it is necessary to consider annotation method." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-155", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-156", "text": "**EXTRINSIC EVALUATION**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-157", "text": "In this section, we evaluate our dataset using five simple lexical simplification methods." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-158", "text": "We calcu- late 1-best accuracy in our dataset and the dataset of Kajiwara and Yamamoto (2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-159", "text": "Annotated data is collected by our and Kajiwara and Yamamoto (2015)'s work in ranking substitutes task, and which size is 21,700 ((2010 + 2330) 5) rankings." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-160", "text": "Then, we calculate correlation between the accuracies of annotated data and either those of Kajiwara and Yamamoto (2015) or those of our dataset." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-161", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-162", "text": "**LEXICAL SIMPLIFICATION SYSTEMS**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-163", "text": "We used several metrics for these experiments:" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-164", "text": "Frequency Because it is said that a high frequent word is simple, most frequent word is selected as a simplification candidate from substitutes using uni-gram frequency of Japanese Web N-gram (Kudo and Kazawa, 2007) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-165", "text": "This uni-gram frequency is counted from two billion sentences in Japanese Web text." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-166", "text": "Aramaki et al. (2013) claimed that a word used by many people is simple, so we pick the word used by the most of users." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-167", "text": "Number of Users were estimated from the Twitter corpus created by Aramaki et al. (2013) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-168", "text": "The corpus contains 250 million tweets from 100,000 users." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-169", "text": "Familiarity Assuming that a word which is known by many people is simple, replace a target word with substitutes according to the familiarity score using familiarity data constructed by Amano and Kondo (2000) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-170", "text": "The familiarity score is an averaged score 28 annotators with seven grades." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-171", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-172", "text": "**NUMBER OF USERS**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-173", "text": "JEV We hypothesized a word which is low difficulty for non-native speakers is simple, so we select a word using a Japanese learner dictionary made by Sunakawa et al. (2012) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-174", "text": "The word in dictionary has a difficulty score averaged by 5 Japanese teachers with their subjective annotation according to six grade system." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-175", "text": "JLPT Same as above, but uses a different source called Japanese Language Proficient Test (JLPT)." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-176", "text": "We choose the lowest level word using levels of JLPT." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-177", "text": "These levels are a scale of one to five." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-178", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-179", "text": "**EVALUATION**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-180", "text": "We ranked substitutes according to the metrics, and calculated the 1-best accuracy for each target word." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-181", "text": "Finally, to compare two datasets, we used the Pearson product-moment correlation coefficient between our dataset and the dataset of Kajiwara and Yamamoto (2015) against the annotated data." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-182", "text": "Table 4 shows the result of this experiment." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-183", "text": "The Pearson coefficient shows that our dataset correlates with human annotation better than the dataset of Kajiwara and Yamamoto (2015) , possibly because we controlled each sentence to include only one complex word." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-184", "text": "Because our dataset is balanced, the accuracy of Web corpus-based metrics (Frequency and Number of Users) closer than the dataset of Kajiwara and Yamamoto (2015) ." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-185", "text": "----------------------------------" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-186", "text": "**CONCLUSION**" }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-187", "text": "We have presented a new controlled and balanced dataset for the evaluation of Japanese lexical simplification." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-188", "text": "Experimental results show that (1) our dataset is more consistent than the previous datasets and (2) lexical simplification methods using our dataset correlate with human annotation better than the previous datasets." }, { "sent_id": "2d3ec2e77947cb23af773926ec917b-C001-189", "text": "Future work includes increasing the number of sentences, so as to leverage the dataset for machine learning-based simplification methods." } ], "y": { "@DIF@": { "gold_contexts": [ [ "2d3ec2e77947cb23af773926ec917b-C001-12" ], [ "2d3ec2e77947cb23af773926ec917b-C001-66" ], [ "2d3ec2e77947cb23af773926ec917b-C001-142" ], [ "2d3ec2e77947cb23af773926ec917b-C001-145" ], [ "2d3ec2e77947cb23af773926ec917b-C001-183" ], [ "2d3ec2e77947cb23af773926ec917b-C001-184" ] ], "cite_sentences": [ "2d3ec2e77947cb23af773926ec917b-C001-12", "2d3ec2e77947cb23af773926ec917b-C001-66", "2d3ec2e77947cb23af773926ec917b-C001-142", "2d3ec2e77947cb23af773926ec917b-C001-145", "2d3ec2e77947cb23af773926ec917b-C001-183", "2d3ec2e77947cb23af773926ec917b-C001-184" ] }, "@USE@": { "gold_contexts": [ [ "2d3ec2e77947cb23af773926ec917b-C001-12" ], [ "2d3ec2e77947cb23af773926ec917b-C001-29" ], [ "2d3ec2e77947cb23af773926ec917b-C001-39" ], [ "2d3ec2e77947cb23af773926ec917b-C001-52" ], [ "2d3ec2e77947cb23af773926ec917b-C001-76" ], [ "2d3ec2e77947cb23af773926ec917b-C001-133" ], [ "2d3ec2e77947cb23af773926ec917b-C001-158" ], [ "2d3ec2e77947cb23af773926ec917b-C001-159" ], [ "2d3ec2e77947cb23af773926ec917b-C001-160" ], [ "2d3ec2e77947cb23af773926ec917b-C001-181" ] ], "cite_sentences": [ "2d3ec2e77947cb23af773926ec917b-C001-12", "2d3ec2e77947cb23af773926ec917b-C001-29", "2d3ec2e77947cb23af773926ec917b-C001-39", "2d3ec2e77947cb23af773926ec917b-C001-52", "2d3ec2e77947cb23af773926ec917b-C001-76", "2d3ec2e77947cb23af773926ec917b-C001-133", "2d3ec2e77947cb23af773926ec917b-C001-158", "2d3ec2e77947cb23af773926ec917b-C001-159", "2d3ec2e77947cb23af773926ec917b-C001-160", "2d3ec2e77947cb23af773926ec917b-C001-181" ] }, "@BACK@": { "gold_contexts": [ [ "2d3ec2e77947cb23af773926ec917b-C001-18" ], [ "2d3ec2e77947cb23af773926ec917b-C001-19" ], [ "2d3ec2e77947cb23af773926ec917b-C001-44" ], [ "2d3ec2e77947cb23af773926ec917b-C001-47" ], [ "2d3ec2e77947cb23af773926ec917b-C001-55" ] ], "cite_sentences": [ "2d3ec2e77947cb23af773926ec917b-C001-18", "2d3ec2e77947cb23af773926ec917b-C001-19", "2d3ec2e77947cb23af773926ec917b-C001-44", "2d3ec2e77947cb23af773926ec917b-C001-47", "2d3ec2e77947cb23af773926ec917b-C001-55" ] }, "@MOT@": { "gold_contexts": [ [ "2d3ec2e77947cb23af773926ec917b-C001-19" ], [ "2d3ec2e77947cb23af773926ec917b-C001-39", "2d3ec2e77947cb23af773926ec917b-C001-40", "2d3ec2e77947cb23af773926ec917b-C001-41" ], [ "2d3ec2e77947cb23af773926ec917b-C001-52", "2d3ec2e77947cb23af773926ec917b-C001-53", "2d3ec2e77947cb23af773926ec917b-C001-54" ], [ "2d3ec2e77947cb23af773926ec917b-C001-55" ], [ "2d3ec2e77947cb23af773926ec917b-C001-70" ] ], "cite_sentences": [ "2d3ec2e77947cb23af773926ec917b-C001-19", "2d3ec2e77947cb23af773926ec917b-C001-39", "2d3ec2e77947cb23af773926ec917b-C001-40", "2d3ec2e77947cb23af773926ec917b-C001-41", "2d3ec2e77947cb23af773926ec917b-C001-52", "2d3ec2e77947cb23af773926ec917b-C001-55", "2d3ec2e77947cb23af773926ec917b-C001-70" ] }, "@EXT@": { "gold_contexts": [ [ "2d3ec2e77947cb23af773926ec917b-C001-20" ], [ "2d3ec2e77947cb23af773926ec917b-C001-76" ], [ "2d3ec2e77947cb23af773926ec917b-C001-183" ] ], "cite_sentences": [ "2d3ec2e77947cb23af773926ec917b-C001-20", "2d3ec2e77947cb23af773926ec917b-C001-76", "2d3ec2e77947cb23af773926ec917b-C001-183" ] }, "@SIM@": { "gold_contexts": [ [ "2d3ec2e77947cb23af773926ec917b-C001-125" ] ], "cite_sentences": [ "2d3ec2e77947cb23af773926ec917b-C001-125" ] } } }, "ABC_6330c615d4c62b30933dac3057c9d6_3": { "x": [ { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-2", "text": "Generative models for text have substantially contributed to tasks like machine translation and language modeling, using maximum likelihood optimization (MLE)." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-3", "text": "However, for creative text generation, where multiple outputs are possible and originality and uniqueness are encouraged, MLE falls short." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-4", "text": "Methods optimized for MLE lead to outputs that can be generic, repetitive and incoherent." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-5", "text": "In this work, we use a Generative Adversarial Network framework to alleviate this problem." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-6", "text": "We evaluate our framework on poetry, lyrics and metaphor datasets, each with widely different characteristics, and report better performance of our objective function over other generative models." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-7", "text": "----------------------------------" }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-8", "text": "**INTRODUCTION AND RELATED WORK**" }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-9", "text": "Language models can be optimized to recognize syntax and semantics with great accuracy [1] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-10", "text": "However, the output generated can be repetitive and generic leading to monotonous or uninteresting responses (e.g \"I don't know\") regardless of the input [2] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-11", "text": "While application of attention [3, 4] and advanced decoding mechanisms like beam search and variation sampling [5] have shown improvements, it does not solve the underlying problem." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-12", "text": "In creative text generation, the objective is not strongly bound to the ground truth-instead the objective is to generate diverse, unique or original samples." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-13", "text": "We attempt to do this through a discriminator which can give feedback to the generative model through a cost function that encourages sampling of creative tokens." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-14", "text": "The contributions of this paper are in the usage of a GAN framework to generate creative pieces of writing." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-15", "text": "Our experiments suggest that generative text models, while very good at encapsulating semantic, syntactic and domain information, perform better with external feedback from a discriminator for fine-tuning objectiveless decoding tasks like that of creative text." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-16", "text": "We show this by evaluating our model on three very different creative datasets containing poetry, metaphors and lyrics." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-17", "text": "Previous work on handling the shortcomings of MLE include length-normalizing sentence probability [6] , future cost estimation [7] , diversity-boosting objective function [8, 2] or penalizing repeating tokens [9] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-18", "text": "When it comes to poetry generation using generative text models, Zhang and Lapata [10] , Yi et al. [11] and Wang et al. [12] use language modeling to generate Chinese poems." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-19", "text": "However, none of these methods provide feedback on the quality of the generated sample and hence, do not address the qualitative objective required for creative decoding." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-20", "text": "For the task of text generation, MaskGAN [13] uses a Reinforcement Learning signal from the discriminator, FMD-GAN [14] uses an optimal transport mechanism as an objective function." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-21", "text": "GumbelGAN [15] uses Gumbel-Softmax distribution that replaces the non-differentiable sample from a categorical distribution with a differentiable sample to propagate stronger gradients." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-22", "text": "Li et al. [2] use a discriminator for a diversity promoting objective." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-23", "text": "Yu et al. [16] use SeqGAN to generate poetry and comment on the performance of SeqGAN over MLE in human evaluations, encouraging our study of GANs for creative text generation." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-24", "text": "However, these studies do not focus solely on creative text." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-25", "text": "Using GANs, we can train generative models in a two-player game setting between a discriminator and a generator, where the discriminator (a binary classifier) learns to distinguish between real and fake data samples and the generator tries to fool the discriminator by generating authentic and high quality output [17] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-26", "text": "GANs have shown to be successful in image generation tasks [18] and recently, some progress has been observed in text generation [14, 13, 16] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-27", "text": "Our generator is a language model trained using backpropagation through time [19] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-28", "text": "During the pre-training phase we optimize for MLE and during the GAN training phase, we optimize on the creativity reward from the discriminator." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-29", "text": "The discriminator's encoder has the same architecture as the generator encoder module with the addition of a pooled decoder layer." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-30", "text": "The decoder contains 3 [DenseBatchN ormalization, ReLU ] blocks and an addtional Sigmoid layer." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-31", "text": "The discriminator decoder takes the hidden state at the last time step of a sequence concatenated with both the max-pooled and mean-pooled representation of the hidden states [20] and outputs a number in the range [0, 1]." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-32", "text": "The difficulty of using GANs in text generation comes from the discrete nature of text, making the model non-differentiable hence, we update parameters for the generator model with policy gradients as described in Yu [16] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-33", "text": "We utilize AWD-LSTM [21] and TransformerXL [22] based language models." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-34", "text": "For model hyperparameters please to refer to Supplementary Section Table 2 ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-35", "text": "We use Adam optimizer [23] with \u03b21 = 0.7 and \u03b22 = 0.8 similar to [20] and use a batch size of 50." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-36", "text": "Other practices for LM training were the same as [22] and [21] for Transformer-XL and AWD-LSTM respectively." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-37", "text": "We refer to our proposed GAN as Creative-GAN and compare it to a baseline (a language model equivalent to our pre-trained generator) and a GumbelGAN model [15] across all proposed datasets." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-38", "text": "We use three creative English datasets with distinct linguistic characteristics: (1) A corpus of 740 classical and contemporary English poems, (2) a corpus of 14950 metaphor sentences retrieved from a metaphor database website 1 and (3) a corpus of 1500 song lyrics ranging across genres." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-39", "text": "The mix of linguistic styles within this corpus offers the potential for interesting variation during the generation phase." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-40", "text": "We use the same pre-processing as in earlier work [20, 24] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-41", "text": "We reserve 10% of our data for test set and another 10% for our validation set." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-42", "text": "We first pre-train our generator on the Gutenberg dataset [25] for 20 epochs and then fine-tune [20] them to our target datasets with a language modeling objective." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-43", "text": "The discriminator's encoder is initialized to the same weights as our fine-tuned language model." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-44", "text": "Once we have our fine-tuned encoders for each target dataset, we train in an adversarial manner." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-45", "text": "The discriminator objective here is to score the quality of the creative text." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-46", "text": "The discriminator is trained for 3 iterations for every iteration of the generator, a practice seen in previous work [26] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-47", "text": "Creative-GAN relies on using the reward from the discriminator [13, 16] for backpropagation." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-48", "text": "We follow a similar training procedure for GumbelGAN." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-49", "text": "Outputs are generated through sampling over a multinomial distribution for all methods, instead of argmax on the log-likelihood probabilities, as sampling has shown to produce better output quality [5] . Please refer to Supplementary Section Table 3 for training parameters of each dataset and Table 2 for hyperparameters of each encoder." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-50", "text": "We pick these values after experimentation with our validation set." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-51", "text": "Training and output generation code can be found online 2 ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-52", "text": "----------------------------------" }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-53", "text": "**EVALUATION AND CONCLUSION**" }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-54", "text": "Evaluating creative generation tasks is both critical and complex [27] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-55", "text": "Along the lines of previous research on evaluating text generation tasks [27] , we report the perplexity scores of our test set on the evaluated models in the Supplementary Section, Table 1 Our model shows improvements over baseline and GumbelGAN." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-56", "text": "Common computational methods like BLEU [28] and perplexity are at best a heuristic and not strong indicators of good performance in text generation models [29] ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-57", "text": "Particularly, since these scores use target sequences as a reference, it has the same pitfalls as relying on MLE." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-58", "text": "The advantages in this approach lie in the discriminator's ability to influence the generator to explore other possibilities." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-59", "text": "Sample outputs for our model can be found on our website 3 ." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-60", "text": "----------------------------------" }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-61", "text": "**SUPPLEMENTARY MATERIAL**" }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-62", "text": "In this section, we report our results on computational metrics, hyperparameters and training configurations for our models." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-63", "text": "Table 1 shows the results of the perplexity score evaluation of the evaluated models, Table 2 shows hyperparameters for each encoding method and Table 3 shows our training parameters." }, { "sent_id": "6330c615d4c62b30933dac3057c9d6-C001-64", "text": "In Table 3 Table 3 : Training Parameters" } ], "y": { "@USE@": { "gold_contexts": [ [ "6330c615d4c62b30933dac3057c9d6-C001-31" ], [ "6330c615d4c62b30933dac3057c9d6-C001-40" ], [ "6330c615d4c62b30933dac3057c9d6-C001-42" ] ], "cite_sentences": [ "6330c615d4c62b30933dac3057c9d6-C001-31", "6330c615d4c62b30933dac3057c9d6-C001-40", "6330c615d4c62b30933dac3057c9d6-C001-42" ] }, "@SIM@": { "gold_contexts": [ [ "6330c615d4c62b30933dac3057c9d6-C001-35" ] ], "cite_sentences": [ "6330c615d4c62b30933dac3057c9d6-C001-35" ] } } }, "ABC_491879c73f8aa9f11bfae01abc795d_3": { "x": [ { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-2", "text": "Verb-noun idiomatic combinations (VNICs) are idioms consisting of a verb with a noun in its direct object position." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-91", "text": "We there-8 This is equivalent to macro-averaged recall." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-3", "text": "Usages of these expressions can be ambiguous between an idiomatic usage and a literal combination." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-4", "text": "In this paper we propose supervised and unsupervised approaches, based on word embeddings, to identifying token instances of VNICs." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-5", "text": "Our proposed supervised and unsupervised approaches perform better than the supervised and unsupervised approaches of Fazly et al. (2009) , respectively." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-6", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-7", "text": "**VERB-NOUN IDIOMATIC COMBINATIONS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-8", "text": "Much research on multiword expressions (MWEs) in natural language processing (NLP) has focused on various type-level prediction tasks, e.g., MWE extraction (e.g., Church and Hanks, 1990; Smadja, 1993; Lin, 1999 ) -i.e., determining which MWE types are present in a given corpus Kim, 2010) -and compositionality prediction (e.g., McCarthy et al., 2003; Reddy et al., 2011; Salehi et al., 2014) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-9", "text": "However, word combinations can be ambiguous between literal combinations and MWEs." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-10", "text": "For example, consider the following two usages of the expression hit the roof :" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-11", "text": "1. I think Paula might hit the roof if you start ironing." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-12", "text": "2. When the blood hit the roof of the car I realised it was serious." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-13", "text": "The first example of hit the roof is an idiomatic usage, while the second is a literal combination." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-14", "text": "1 MWE identification is the task of determining which token instances in running text are MWEs (Baldwin and Kim, 2010) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-15", "text": "Although there has been relatively less work on MWE identification than other type-level MWE prediction tasks, it is nevertheless important for NLP applications such as machine translation that must be able to distinguish MWEs from literal combinations in context." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-16", "text": "Some recent work has focused on token-level identification of a wide range of types of MWEs and other multiword units (e.g., Newman et al., 2012; Schneider et al., 2014; Brooke et al., 2014) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-17", "text": "Many studies, however, have taken a word sense disambiguation-inspired approach to MWE identification (e.g., Birke and Sarkar, 2006; Katz and Giesbrecht, 2006; Li et al., 2010) , treating literal combinations and MWEs as different word senses, and have exploited linguistic knowledge of MWEs (e.g., Patrick and Fletcher, 2005; Uchiyama et al., 2005; Hashimoto and Kawahara, 2008; Fazly et al., 2009; Fothergill and Baldwin, 2012) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-18", "text": "In this study we focus on English verb-noun idiomatic combinations (VNICs)." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-19", "text": "VNICs are formed from a verb with a noun in its direct object position." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-20", "text": "They are a common and productive type of English idiom, and occur cross-lingually (Fazly et al., 2009) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-21", "text": "VNICs tend to be relatively lexico-syntactically fixed, e.g., whereas hit the roof is ambiguous between literal and idiomatic meanings, hit the roofs and a roof was hit are most likely to be literal usages." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-22", "text": "Fazly et al. (2009) exploit this property in their unsupervised approach, referred to as CFORM." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-23", "text": "They define lexico-syntactic patterns for VNIC token instances based on the noun's determiner (e.g., a, the, or possibly no determiner), the number of the noun (singular or plural), and the verb's voice (active or passive)." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-24", "text": "They propose a statistical method for automatically determining a given VNIC type's canonical idiomatic form, based on the frequency of its usage in these patterns in a corpus." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-25", "text": "2 They then classify a given token instance of a VNIC as idiomatic if it occurs in its canonical form, and as literal otherwise." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-26", "text": "Fazly et al. also consider a supervised approach that classifies a given VNIC instance based on the similarity of its context to that of idiomatic and literal instances of the same expression seen during training." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-27", "text": "Distributed representations of word meaning in the form of word embeddings (Mikolov et al., 2013) have recently been demonstrated to benefit a wide range of NLP tasks including POS tagging (e.g., Ling et al., 2015) , question answering (e.g., Dong et al., 2015) , and machine translation (e.g., Zou et al., 2013) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-28", "text": "Moreover, word embeddings have been shown to improve over count-based models of distributional similarity for predicting MWE compositionality (Salehi et al., 2015) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-29", "text": "In this work we first propose a supervised approach to identifying VNIC token instances based on word embeddings that outperforms the supervised method of Fazly et al. (2009) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-30", "text": "We then propose an unsupervised approach to this task, that combines word embeddings with Fazly et al.'s unsupervised CFORM approach, that improves over CFORM." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-31", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-32", "text": "**MODELS FOR VNIC IDENTIFICATION BASED ON WORD EMBEDDINGS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-33", "text": "The following subsections propose supervised and unsupervised approaches to VNIC identification based on word embeddings." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-34", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-35", "text": "**SUPERVISED VNIC IDENTIFICATION**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-36", "text": "For the proposed supervised approach, we first extract features based on word embeddings from word2vec representing a token instance of a VNIC in context, and then use these representations of VNIC tokens to train a supervised classifier." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-37", "text": "We first form a vector e representing a given VNIC token at the type level." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-38", "text": "e is formed by averaging the embeddings of the lemmatized component words forming the VNIC." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-39", "text": "We then form a vector c representing the context of the VNIC token instance." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-40", "text": "MWEs, including VNICs, can be discontiguous." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-41", "text": "We therefore form two vectors, c verb and c noun , representing the context of the verb and noun components, respectively, of the VNIC instance, and then average 2 In some cases a VNIC may have a small number of canonical forms, as opposed to just one." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-42", "text": "Original text: You can see the stars, now, in the city" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-43", "text": "where k is the window size that the word2vec model was trained on, and w j t is the embedding of the word in position t of the input sentence relative to the jth component of the MWE (i.e., either the verb or noun)." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-44", "text": "In forming c verb and c noun the other component token of the VNIC is not considered part of the context." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-45", "text": "The summation is done over the same window size that the word2vec model was trained on so that c j captures the same information that the word2vec model has learned to capture." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-46", "text": "After computing c verb and c noun these vectors are averaged to form c. Figure 1 shows the process for forming c for an example sentence." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-47", "text": "Finally, to form the feature vector representing a VNIC instance, we subtract e from c, and append to this vector a single binary feature representing whether the VNIC instance occurs in its canonical form, as determined by Fazly et al. (2009) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-48", "text": "The feature vectors are then used to train a supervised classifier; in our experiments we use the linear SVM implementation from Pedregosa et al. (2011) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-49", "text": "The motivation for the subtraction is to capture the difference between the context in which a VNIC instance occurs ( c) and a type-level representation of that expression ( e), to potentially represent VNIC instances such that the classifier is able to generalize across expressions (i.e., to generalize to MWE types that are unseen during training)." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-50", "text": "The canonical form feature is included because it is known to be highly informative as to whether an instance is idiomatic or literal." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-51", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-52", "text": "**UNSUPERVISED VNIC IDENTIFICATION**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-53", "text": "Our unsupervised approach combines the word embedding-based representation used in the supervised approach (without relying on training a supervised classifier, of course) with the unsupervised CFORM method of Fazly et al. (2009) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-54", "text": "In this approach, we first represent each token instance of a given VNIC type as a feature vector, using the same representation as in Section 2.1." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-55", "text": "3 We then apply k-means clustering to form k clusters of the token instances." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-56", "text": "4 All instances in each cluster are then assigned a single class, idiomatic or literal, depending on whether the majority of token instances in a cluster are in that VNIC's canonical form or not, respectively." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-57", "text": "In the case of ties the method backs off to a most-frequent class (idiomatic) baseline." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-58", "text": "This method is unsupervised in that it does not rely on any gold standard labels." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-59", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-60", "text": "**MATERIALS AND METHODS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-61", "text": "In this section we describe training details for the word embeddings and the dataset used for evaluation." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-62", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-63", "text": "**WORD EMBEDDINGS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-64", "text": "The word embeddings required by our proposed methods were trained using the gensim 5 implementation of the skip gram version of word2vec (Mikolov et al., 2013) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-65", "text": "The model was trained on a snapshot of English Wikipedia from 1 September 2015." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-66", "text": "The text was pre-processed using wp2txt 6 to remove markup, and then tokenized with the Stanford tokenizer (Manning et al., 2014) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-67", "text": "Tokens occurring less than 15 times were removed, and the negative sampling parameter was set to 5." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-68", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-69", "text": "**VNC-TOKENS DATASET**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-70", "text": "The VNC-Tokens dataset (Cook et al., 2008) contains instances of 53 VNIC types -drawn from the British National Corpus (Burnard, 2007)" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-71", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-72", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-73", "text": "In the following subsections we describe the results of experiments using our supervised approach, the ability of this method to generalize across MWE types, and finally the results of the unsupervised approach." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-74", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-75", "text": "**SUPERVISED RESULTS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-76", "text": "Following Fazly et al. (2009) , the supervised approach was evaluated using a leave-one-token-out strategy." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-77", "text": "That is, for each MWE, a single token instance is held out, and the classifier is trained on the remaining instances." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-78", "text": "The trained model is then used to classify the held out instance." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-79", "text": "This is Table 1 for a variety of settings of window size and number of dimensions for the word embeddings." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-80", "text": "The results reveal the general trend that smaller window sizes, and more dimensions, tend to give higher accuracy, although the overall amount of variation is relatively small." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-81", "text": "The accuracy on DEV and TEST ranges from 85.5%-88.2% and 83.4%-88.3%, respectively." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-82", "text": "All of these accuracies are higher than those reported by Fazly et al. (2009) for their supervised approach." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-83", "text": "They are also substantially higher than the most-frequent class baseline, and the unsupervised CFORM method of Fazly et al." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-84", "text": "That a window size of just 1 performs well is interesting." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-85", "text": "A word2vec model with a smaller window size gives more syntactically-oriented word embeddings, whereas a larger window size gives more semantically-oriented embeddings (Trask et al., 2015) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-86", "text": "The CFORM method of Fazly et al. (2009 ) is a strong unsupervised benchmark for this task, and relies on the lexico-syntactic pattern in which an MWE token instance occurs." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-87", "text": "A smaller window size for the word embedding features might be better able to capture similar information to CFORM, which could explain the good performance of the model using a window size of 1." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-88", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-89", "text": "**GENERALIZATION TO UNSEEN VNICS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-90", "text": "We do not expect to have substantial amounts of annotated training data for every VNIC." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-92", "text": "fore further consider whether the supervised approach is able to generalize to MWE types that are unseen during training." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-93", "text": "Indeed, this scenario motivated the choice of representation of VNIC token instances in Section 2.1." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-94", "text": "In these experiments we perform a leave-one-type-out evaluation." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-95", "text": "In this case, all token instances for a single MWE type are held out, and the token instances of the remaining MWE types (limited to those within either DEV or TEST) are used to train a classifier." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-96", "text": "The classifier is then used to classify the token instances of the held out MWE type." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-97", "text": "This process is repeated until all instances of all MWE types have been classified." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-98", "text": "For these experiments we consider the setup that performed best on average over DEV and TEST in the previous experiments (i.e., a window size of 1 and 300 dimensional vectors)." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-99", "text": "The macroaveraged accuracy on DEV and TEST is 68.9% and 69.4%, respectively." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-100", "text": "Although this is a substantial improvement over the most-frequent class baseline, it is well-below the accuracy for the previously-considered leave-one-token-out setup." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-101", "text": "Moreover, the unsupervised CFORM method of Fazly et al. (2009) gives substantially higher accuracies than this supervised approach." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-102", "text": "The limited ability of this model to generalize to unseen MWE types further motivates exploring unsupervised approaches to this task." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-103", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-104", "text": "**UNSUPERVISED RESULTS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-105", "text": "The k-means clustering for the unsupervised approach is repeated 100 times with randomlyselected initial centroids, for several values of k. The average accuracy and standard deviation of the unsupervised approach over these 100 runs are shown in the left panel of Table 2 ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-106", "text": "For k = 4 and 5 on TEST, this approach surpasses the unsupervised CFORM method of Fazly et al. (2009) ; however, on DEV this approach does not outperform Fazly et al.'s CFORM approach for any of the val-ues of k considered." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-107", "text": "Analyzing the results on individual expressions indicates that the unsupervised approach gives especially low accuracy for hit roof -which is in DEV-as compared to the CFORM method of Fazly et al., which could contribute to the overall lower accuracy of the unsupervised approach on this dataset." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-108", "text": "We now consider the upperbound of an unsupervised approach that selects a single label for each cluster of usages." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-109", "text": "In the right panel of Table 2 we show results for an oracle approach that always selects the best label for each cluster." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-110", "text": "In this case, as the number of clusters increases, so too will the accuracy." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-111", "text": "9 Nevertheless, these results show that, even for relatively small values of k, there is scope for improving the proposed unsupervised method through improved methods for selecting the label for each cluster, and that the performance of such a method could potentially come close to that of the supervised approach." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-112", "text": "A word's predominant sense is known to be a powerful baseline in word-sense disambiguation, and prior work has addressed automatically identifying predominant word senses (McCarthy et al., 2007; Lau et al., 2014) ." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-113", "text": "The findings here suggest that methods for determining whether a set of usages of a VNIC are predominantly literal or idiomatic could be leveraged to give further improvements in unsupervised VNIC identification." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-114", "text": "----------------------------------" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-115", "text": "**CONCLUSIONS**" }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-116", "text": "In this paper we proposed supervised and unsupervised approaches, based on word embeddings, to identifying token instances of VNICs that performed better than the supervised approach, and unsupervised CFORM approach, of Fazly et al. (2009) , respectively." }, { "sent_id": "491879c73f8aa9f11bfae01abc795d-C001-117", "text": "In future work we intend to consider methods for determining the predominant \"sense\" (i.e., idiomatic or literal) of a set of usages of a VNIC, in an effort to further improve unsupervised VNIC identification." } ], "y": { "@DIF@": { "gold_contexts": [ [ "491879c73f8aa9f11bfae01abc795d-C001-5" ], [ "491879c73f8aa9f11bfae01abc795d-C001-29" ], [ "491879c73f8aa9f11bfae01abc795d-C001-30" ], [ "491879c73f8aa9f11bfae01abc795d-C001-82" ], [ "491879c73f8aa9f11bfae01abc795d-C001-83" ], [ "491879c73f8aa9f11bfae01abc795d-C001-101" ], [ "491879c73f8aa9f11bfae01abc795d-C001-106" ], [ "491879c73f8aa9f11bfae01abc795d-C001-107" ], [ "491879c73f8aa9f11bfae01abc795d-C001-116" ] ], "cite_sentences": [ "491879c73f8aa9f11bfae01abc795d-C001-5", "491879c73f8aa9f11bfae01abc795d-C001-29", "491879c73f8aa9f11bfae01abc795d-C001-30", "491879c73f8aa9f11bfae01abc795d-C001-82", "491879c73f8aa9f11bfae01abc795d-C001-83", "491879c73f8aa9f11bfae01abc795d-C001-101", "491879c73f8aa9f11bfae01abc795d-C001-106", "491879c73f8aa9f11bfae01abc795d-C001-107", "491879c73f8aa9f11bfae01abc795d-C001-116" ] }, "@BACK@": { "gold_contexts": [ [ "491879c73f8aa9f11bfae01abc795d-C001-17" ], [ "491879c73f8aa9f11bfae01abc795d-C001-20" ], [ "491879c73f8aa9f11bfae01abc795d-C001-22" ], [ "491879c73f8aa9f11bfae01abc795d-C001-23" ], [ "491879c73f8aa9f11bfae01abc795d-C001-24" ], [ "491879c73f8aa9f11bfae01abc795d-C001-25" ], [ "491879c73f8aa9f11bfae01abc795d-C001-26" ], [ "491879c73f8aa9f11bfae01abc795d-C001-86" ] ], "cite_sentences": [ "491879c73f8aa9f11bfae01abc795d-C001-17", "491879c73f8aa9f11bfae01abc795d-C001-20", "491879c73f8aa9f11bfae01abc795d-C001-22", "491879c73f8aa9f11bfae01abc795d-C001-23", "491879c73f8aa9f11bfae01abc795d-C001-24", "491879c73f8aa9f11bfae01abc795d-C001-25", "491879c73f8aa9f11bfae01abc795d-C001-26", "491879c73f8aa9f11bfae01abc795d-C001-86" ] }, "@USE@": { "gold_contexts": [ [ "491879c73f8aa9f11bfae01abc795d-C001-30" ], [ "491879c73f8aa9f11bfae01abc795d-C001-47" ], [ "491879c73f8aa9f11bfae01abc795d-C001-53" ], [ "491879c73f8aa9f11bfae01abc795d-C001-76" ] ], "cite_sentences": [ "491879c73f8aa9f11bfae01abc795d-C001-30", "491879c73f8aa9f11bfae01abc795d-C001-47", "491879c73f8aa9f11bfae01abc795d-C001-53", "491879c73f8aa9f11bfae01abc795d-C001-76" ] }, "@SIM@": { "gold_contexts": [ [ "491879c73f8aa9f11bfae01abc795d-C001-87" ] ], "cite_sentences": [ "491879c73f8aa9f11bfae01abc795d-C001-87" ] } } }, "ABC_f7e80cf0a6724675cab2825cbf7e10_3": { "x": [ { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-2", "text": "Word embeddings have been shown to be useful across state-of-the-art systems in many natural language processing tasks, ranging from question answering systems to dependency parsing." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-3", "text": "(Herbelot and Vecchi, 2015) explored word embeddings and their utility for modeling language semantics." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-4", "text": "In particular, they presented an approach to automatically map a standard distributional semantic space onto a set-theoretic model using partial least squares regression." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-5", "text": "We show in this paper that a simple baseline achieves a +51% relative improvement compared to their model on one of the two datasets they used, and yields competitive results on the second dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-6", "text": "----------------------------------" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-8", "text": "Word embeddings are one of the main components in many state-of-the-art systems for natural language processing (NLP), such as language modeling (Mikolov et al., 2010 ), text classification (Socher et al., 2013; Kim, 2014; Blunsom et al., 2014; , question answering (Weston et al., 2015; Wang and Nyberg, 2015) , machine translation (Bahdanau et al., 2014; Tamura et al., 2014; Sundermeyer et al., 2014) , as well as named entity recognition (Collobert et al., 2011; Lample et al., 2016; Labeau et al., 2015) ." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-9", "text": "Word embeddings can be pre-trained using large unlabeled datasets typically based on token cooccurrences (Mikolov et al., 2013; Collobert et al., 2011; Pennington et al., 2014) ." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-10", "text": "They can also be jointly learned with the task." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-11", "text": "Understanding what information word embeddings contain is subsequently of high interest." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-12", "text": "(Herbelot and Vecchi, 2015) investigated a method to map word embeddings to formal semantics, which is the center of interest of this paper." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-13", "text": "Specifically, given a feature and a word vector of a concept, they tried to automatically find how often the given concept has the given feature." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-14", "text": "For example, the concept yam is always a vegetable, the concept cat has a coat most of the time, the concept plug has sometimes 3 prongs, and the concept dog never has wings." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-15", "text": "The method they used was based on partial least squares regression (PLSR)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-16", "text": "We propose a simple baseline that outperforms their model." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-17", "text": "----------------------------------" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-18", "text": "**TASK**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-19", "text": "In this section, we summarize the task presented in (Herbelot and Vecchi, 2015) ." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-20", "text": "The following is an example of a concept along with some of its features, as formatted in one of the two datasets used to evaluate the model: yam a vegetable all all all yam eaten by cooking all most most yam grows in the ground all all all yam is edible all most all yam is orange some most most yam like a potato all all all The concept yam has six features (a vegetable, eaten by cooking, grows in the ground, is edible, is orange, and like a potato)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-21", "text": "Each feature in this dataset is annotated by three different humans." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-22", "text": "The annotation is a quantifier that reflects how frequently the concept has a feature." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-23", "text": "Five quantifiers are used: no, few, some, most, and all." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-24", "text": "In this example, the concept yam has been annotated as some, most and most for the feature is orange." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-25", "text": "Each of the five quantifiers is converted into a numerical format with the following (somehow arbitrary) mapping: no \u2192 0; few \u2192 0.05; some \u2192 0.35; most \u2192 0.95; all \u2192 1." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-26", "text": "The value is averaged over the three annotators." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-27", "text": "Using this mapping, we can map a concept into a \"model-theoretic vector\" (also called feature vector)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-28", "text": "If a feature has not been annotated for a concept, then the element in the model-theoretic vector corresponding to the feature will have value 0." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-29", "text": "As a result, any element of a model-theoretic vector that has value 0 may correspond to a feature that has either been annotated as no by the three annotators, or not been annotated (presumed no)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-30", "text": "Given that there can be many features and it is possible that only some of them are annotated for each concept, the model-theoretic vector may be quite sparse." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-31", "text": "In the yam example, if we only included features annotated with yam, the model-theoretic vector would be as follows:" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-32", "text": "The additional coordinates corresponding to all the remaining features would be zero." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-33", "text": "Each concept word will have a vector of the same dimension (number of unique features) in the same dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-34", "text": "The coordinates mean the same from one concept to another." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-35", "text": "For example, the feature is vegetable appears in the same coordinate position in all the vectors." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-36", "text": "----------------------------------" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-37", "text": "**DATASETS**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-38", "text": "Two datasets are used:" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-39", "text": "\u2022 The Animal Dataset (AD) (Herbelot, 2013) contains 73 concepts and 54 features." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-40", "text": "All concepts are animals, and for each concept all features are annotated by 1 human annotator." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-41", "text": "There are 3942 annotated pairs of conceptfeature (73 * 54 = 3942)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-42", "text": "The dimension of the model-theoretic vectors will therefore be 54." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-43", "text": "\u2022 TheMcRae norms (QMR) (McRae et al., 2005) contains 541 concepts covering living and nonliving entities (e.g., alligator, chair, accordion), as well as 2201 features." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-44", "text": "One concept is annotated with 11.4 features on average by 3 human annotators." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-45", "text": "There are 6187 annotated pairs of concept-feature (541 * 11.4 \u2248 6187)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-46", "text": "The dimension of the model-theoretic vectors will therefore be 2201, and each model-theoretic vector will have on average 2201 \u2212 11.4 = 2189.6 elements set to 0 due to unannotated features." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-47", "text": "----------------------------------" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-48", "text": "**MODEL**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-49", "text": "In the previous section, we have seen how to convert a concept into a model-theoretic vector based on human annotations." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-50", "text": "The goal of (Herbelot and Vecchi, 2015) is to analyze whether there exists a transformation from the word embedding of a concept to its model-theoretic vector, the gold standard being the human annotations." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-51", "text": "The word embeddings are taken from the word embeddings pre-trained with word2vec GoogleNews-vectors-negative300 1 (300 dimensions), which were trained on part of the Google News dataset, consisting of approximately 100 billion words." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-52", "text": "The transformation used in (Herbelot and Vecchi, 2015) is based on Partial Least Squares Regression (PLSR)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-53", "text": "The PLSR is fitted on the training set: the inputs are the word embeddings for each concept, and the outputs are the model-theoretic vectors for each concept." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-54", "text": "To assess the quality of the predictions, the Spearman rank-order correlation coefficient is computed between the predictions and the gold modeltheoretic vectors, ignoring all features for which a concept has not been annotated." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-55", "text": "The idea is that some of the features might be present but not given as options during annotation." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-56", "text": "The method should therefore not be penalized for not suggesting them." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-57", "text": "Figure 1 illustrates the model." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-58", "text": "(Herbelot and Vecchi, 2015) 's system." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-59", "text": "The word embedding of a concept is transformed to a modeltheoric vector via a PLSR." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-60", "text": "The quality of the predicted model-theoric vector is assessed with the Spearman rank-order correlation coefficient between the predictions and the gold model-theoretic vectors." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-61", "text": "Note that some of the elements that equal 0 in the gold model-theoretic vector may correspond to features that are not annotated for the concept." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-62", "text": "Such features are omitted when evaluating the Spearman rank-order correlation coefficient." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-63", "text": "Also, the dimension of the model-theoretic vectors could be larger or smaller than the dimension of the word embedding." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-64", "text": "Since the word embeddings we use have 300 dimensions, the model-theoretic vectors will be smaller than the word embeddings in the AD dataset, and larger in the QMR dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-65", "text": "----------------------------------" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-66", "text": "**EXPERIMENTS**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-67", "text": "We compare (Herbelot and Vecchi, 2015) 's model (PLSR + word2vec) against three baselines: random vectors, mode, and nearest neighbor." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-68", "text": "\u2022 Mode: A predictor that outputs, for each feature, the most common feature value (i.e., the mode) in the training set." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-69", "text": "For example, if a feature is annotated as all for most concepts, then the predictor will always output all for this feature." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-70", "text": "When finding the most common value of a feature, we ignore all the concepts for which the feature is not annotated." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-71", "text": "The resulting predictor does not take any concept into account when making a prediction." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-72", "text": "Indeed, the predicted values are always the same, regardless of the concept." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-73", "text": "If a feature has the same value for most concepts, the predictor may perform reasonably well." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-74", "text": "\u2022 Nearest neighbor (NN): A predictor that outputs for any concept the model-theoretic vector from the training set corresponding to the most similar concept in the training set." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-75", "text": "Similarity is based on the cosine similarity of the word vectors." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-76", "text": "This is a simple nearest neighbor predictor." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-77", "text": "\u2022 Random vectors: (Herbelot and Vecchi, 2015) used pre-trained word embeddings as input to the PLSR, we instead simply use random vectors of same dimension (300, continuous uniform distribution between 0 and 1)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-78", "text": "We also apply retrofitting (Faruqui et al., 2014) on the word embeddings in order to leverage relational information from semantic lexicons by encouraging linked words to have similar vector representations." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-79", "text": "Using (Faruqui et al., 2014 )'s retrofitting tool 2 , we retrofit the word embeddings (GoogleNews-vectorsnegative300) on each of the 4 datasets present in the retrofitting tool (framenet, ppdb-xl, wordnetsynonyms+, and wordnet-synonyms." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-80", "text": "(Herbelot and Vecchi, 2015) ) in the last row." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-81", "text": "PLSR stands for partial least squares regression, NN for nearest neighbor, ppdb for the Paraphrase Database (Ganitkevitch et al., 2013) ." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-82", "text": "There are two ways to compute the mode: either taking the mode of the means of the 3 annotations (mode), or the mode for all annotations (true-mode)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-83", "text": "QMR has 3 potentially different annotations for each concept-feature pair, while AD has 3 only one annotation for each concept-feature pair: as a result, mode and true-mode have similar results for AD, but potentially different results for QMR." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-84", "text": "For each run, a train/test split was randomly chosen (60 training samples for AD, 400 for QMR, in order to have the same number of training samples as in (Herbelot and Vecchi, 2015) 's Table 2 )." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-85", "text": "----------------------------------" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-86", "text": "**AD**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-87", "text": "6 Results and discussion Table 1 presents the results, using the Spearman correlation as the performance metric." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-88", "text": "The experiment was coded in Python using scikit-learn (Pedregosa et al., 2011) and the source as well as the complete result log and the two datasets are available online 3 ." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-89", "text": "We could reproduce the results for the QMR dataset using PLSR and word2vec embeddings (0.346 in (Herbelot and Vecchi, 2015) vs. 0.332 in our experiments, but we could not ex-3 https://github.com/Franck-Dernoncourt/ model-theoretic actly reproduce the results for the AD dataset (0.634 in (Herbelot and Vecchi, 2015) vs. 0.572 in our experiments): this discrepancy most likely results from the choice of the training set." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-90", "text": "Our experiments' results are averaged over 1000 runs, and for each run the training/test split is randomly chosen, the only constraint being having the same number of training samples as in (Herbelot and Vecchi, 2015) ." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-91", "text": "For the AD dataset, our worst run achieved 0.435, and our best run achieved 0.713, which emphasizes the lack of robustness of the results with respect to the train/test split." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-92", "text": "QMR dataset (min: 0.244; max: 0.407), which is expected since QMR is significantly larger than AD." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-93", "text": "Furthermore, the mode baseline yields results that are good on the AD dataset (0.554, vs. 0.634 in (Herbelot and Vecchi, 2015) vs. 0.572 in our PLSR + word2vec implementation), and significantly better than all other models on the QMR dataset (0.522, vs. 0.346 in (Herbelot and Vecchi, 2015) , i.e. +51% improvement)." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-94", "text": "To get an intuition of why the mode baseline works well, Figures 2 and 3 show that most features tend to have one clearly dominant quantifier in the AD dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-95", "text": "A similar trend can be found in the QMR dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-96", "text": "In the AD dataset, there are 54 features, each of them being annotated for all 73 concepts." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-97", "text": "In the QMR dataset, there are 2201 features, each of them being annotated for only 6187 2201 \u2248 2.81 concepts on average." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-98", "text": "As a result, it is much more difficult for the PLSR to learn the mapping from word embeddings to model-theoretic vectors in the QMR dataset than in the AD dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-99", "text": "This explains why the mode baseline outperforms PLSR in the QMR dataset but not in the AD dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-100", "text": "The random vector baseline with PLSR performs mediocrely on the AD dataset, and very poorly on the QMR dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-101", "text": "The nearest neighbor baseline yields some competitive results on the AD dataset, but lower results on the QMR dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-102", "text": "Lastly, using retrofitting increases the performances on both AD and QMR datasets." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-103", "text": "This is expected as applying retrofitting to word embeddings leverages relational information from semantic lexicons by encouraging linked words to have similar vector representations." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-104", "text": "----------------------------------" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-105", "text": "**CONCLUSION**" }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-106", "text": "In this paper we have presented several baselines for mapping distributional to model-theoretic semantic spaces." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-107", "text": "The mode baseline significantly outperforms (Herbelot and Vecchi, 2015) 's model on the QMR dataset, and yields competitive results on the AD dataset." }, { "sent_id": "f7e80cf0a6724675cab2825cbf7e10-C001-108", "text": "This indicates that state-of-the-art models do not efficiently map word embeddings to model-theoretic vectors in these datasets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f7e80cf0a6724675cab2825cbf7e10-C001-3" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-12" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-50" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-52" ] ], "cite_sentences": [ "f7e80cf0a6724675cab2825cbf7e10-C001-3", "f7e80cf0a6724675cab2825cbf7e10-C001-12", "f7e80cf0a6724675cab2825cbf7e10-C001-50", "f7e80cf0a6724675cab2825cbf7e10-C001-52" ] }, "@USE@": { "gold_contexts": [ [ "f7e80cf0a6724675cab2825cbf7e10-C001-12" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-19" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-58" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-67" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-80" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-89" ] ], "cite_sentences": [ "f7e80cf0a6724675cab2825cbf7e10-C001-12", "f7e80cf0a6724675cab2825cbf7e10-C001-19", "f7e80cf0a6724675cab2825cbf7e10-C001-58", "f7e80cf0a6724675cab2825cbf7e10-C001-67", "f7e80cf0a6724675cab2825cbf7e10-C001-80", "f7e80cf0a6724675cab2825cbf7e10-C001-89" ] }, "@DIF@": { "gold_contexts": [ [ "f7e80cf0a6724675cab2825cbf7e10-C001-77" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-93" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-107" ] ], "cite_sentences": [ "f7e80cf0a6724675cab2825cbf7e10-C001-77", "f7e80cf0a6724675cab2825cbf7e10-C001-93", "f7e80cf0a6724675cab2825cbf7e10-C001-107" ] }, "@SIM@": { "gold_contexts": [ [ "f7e80cf0a6724675cab2825cbf7e10-C001-84" ], [ "f7e80cf0a6724675cab2825cbf7e10-C001-90" ] ], "cite_sentences": [ "f7e80cf0a6724675cab2825cbf7e10-C001-84", "f7e80cf0a6724675cab2825cbf7e10-C001-90" ] } } }, "ABC_7e380d496bb253885218465b778cc1_3": { "x": [ { "sent_id": "7e380d496bb253885218465b778cc1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-2", "text": "Pour la documentation des langues, la transcription est un processus tr\u00e8s co\u00fbteux : une minute d'enregistrement n\u00e9cessiterait environ une heure et demie de travail pour un linguiste (Austin and Sallabank, 2013) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-3", "text": "R\u00e9cemment, la collecte de traductions (dans des langues bien document\u00e9es) align\u00e9es aux enregistrements est devenue une solution populaire pour garantir l'interpr\u00e9tabilit\u00e9 des enregistrements (Adda et al., 2016) et aider \u00e0 leur traitement automatique." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-4", "text": "Dans cet article, nous \u00e9tudions l'impact de la langue de traduction sur les approches automatiques en documentation des langues." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-5", "text": "Nous traduisons un corpus parall\u00e8le bilingue Mboshi-Fran\u00e7ais (Godard et al., 2017) dans quatre autres langues, et \u00e9valuons l'impact de la langue de traduction sur une t\u00e2che de segmentation en mots non supervis\u00e9e." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-6", "text": "Nos r\u00e9sultats sugg\u00e8rent que la langue de traduction peut influencer l\u00e9g\u00e8rement la qualit\u00e9 de segmentation." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-7", "text": "Cependant, combiner l'information apprise par diff\u00e9rents mod\u00e8les bilingues nous permet d'am\u00e9liorer ces r\u00e9sultats de mani\u00e8re marginale." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-8", "text": "For language documentation initiatives, transcription is an expensive resource: one minute of audio is estimated to take one hour and a half on average of a linguist's work (Austin and Sallabank, 2013) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-9", "text": "Recently, collecting aligned translations in well-resourced languages became a popular solution for ensuring posterior interpretability of the recordings (Adda et al., 2016) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-10", "text": "In this paper we investigate language-related impact in automatic approaches for computational language documentation." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-11", "text": "We translate the bilingual Mboshi-French parallel corpus (Godard et al., 2017) into four other languages, and we perform bilingual-rooted unsupervised word discovery." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-12", "text": "Our results hint towards an impact of the well-resourced language in the quality of the output." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-13", "text": "However, by combining the information learned by different bilingual models, we are only able to marginally increase the quality of the segmentation." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-14", "text": "MOTS-CL\u00c9S : d\u00e9couverte non supervis\u00e9e du lexique, documentation des langues, approches multilingues." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-15", "text": "----------------------------------" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-16", "text": "**INTRODUCTION**" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-17", "text": "The Cambridge Handbook of Endangered Languages (Austin and Sallabank, 2011) estimates that at least half of the 7,000 languages currently spoken worldwide will no longer exist by the end of this century." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-18", "text": "For these endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-19", "text": "This transcription bottleneck problem can be handled by translating into a widely spoken language to ensure subsequent interpretability of the collected recordings, and such parallel corpora have been recently created by aligning the collected audio with translations in a well-resourced language (Adda et al., 2016; Godard et al., 2017; Boito et al., 2018) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-20", "text": "Moreover, some linguists suggested that more than one translation should be collected to capture deeper layers of meaning (Evans and Sasse, 2004) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-21", "text": "This work is a contribution to the Computational Language Documentation (CLD) research field, that aims to replace part of the manual steps performed by linguists during language documentation initiatives by automatic approaches." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-22", "text": "Here we investigate the unsupervised word discovery and segmentation task, using the bilingual-rooted approach from Godard et al. (2018) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-23", "text": "There, words in the well-resourced language are aligned to unsegmented phonemes in the endangered language in order to identify group of phonemes, and to cluster them into word-like units." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-24", "text": "We experiment with the Mboshi-French parallel corpus, translating the French text into four other well-resourced languages in order to investigate language impact in this CLD approach." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-25", "text": "Our results hint that this language impact exists, and that models based on different languages will output different word-like units." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-26", "text": "----------------------------------" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-27", "text": "**METHODOLOGY**" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-28", "text": "The Multilingual Mboshi Parallel Corpus: In this work we extend the bilingual Mboshi-French parallel corpus (Godard et al., 2017) , fruit of the documentation process of Mboshi (Bantu C25), an endangered language spoken in Congo-Brazzaville." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-29", "text": "The corpus contains 5,130 utterances, for which it provides audio, transcriptions and translations in French." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-30", "text": "We translate the French into four other well-resourced languages through the use of the DeepL translator." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-31", "text": "1 The languages added to the dataset are: English, German, Portuguese and Spanish." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-32", "text": "Table 1 shows some statistics for the produced Multilingual Mboshi parallel corpus." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-33", "text": "2 Bilingual Unsupervised Word Segmentation/Discovery Approach: We use the bilingual neuralbased Unsupervised Word Segmentation (UWS) approach from Godard et al. (2018) to discover words in Mboshi." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-34", "text": "In this approach, Neural Machine Translation (NMT) models are trained between language pairs, using as source language the translation (word-level) and as target, the language to document (unsegmented phonemic sequence)." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-35", "text": "Due to the attention mechanism present in these networks (Bahdanau et al., 2014) , posterior to training, it is possible to retrieve soft-alignment probability matrices between source and target sequences." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-36", "text": "These matrices give us sentence-level source-to-target alignment information, and by using it for clustering neighbor phonemes aligned to the same translation word, we are able to create segmentation in the target side." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-37", "text": "The product of this approach is a set of (discovered-units, translation words) pairs." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-38", "text": "Multilingual Leveraging: In this work we apply two simple methods for including multilingual information into the bilingual models from Godard et al. (2018) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-39", "text": "The first one, Multilingual Voting, consists of merging the information learned by models trained with different language pairs by performing a voting over the final discovered boundaries." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-40", "text": "The voting is performed by applying an agreement threshold T over the output boundaries." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-41", "text": "This threshold balances between accepting all boundaries from all the bilingual models (zero agreement) and accepting only input boundaries discovered by all these models (total agreement)." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-42", "text": "The second method is ANE Selection." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-43", "text": "For every language pair and aligned sentence in the dataset, a soft-alignment probability matrix is generated." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-44", "text": "We use Average Normalized Entropy (ANE) (Boito et al., 2019a) computed over these matrices for selecting the most confident one for segmenting each phoneme sequence." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-45", "text": "This exploits the idea that models trained on different language pairs will have language-related behavior, thus differing on the resulting alignment and segmentation over the same phoneme sequence." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-46", "text": "----------------------------------" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-47", "text": "**EXPERIMENTS**" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-48", "text": "The experiment settings from this paper and evaluation protocol for the Mboshi corpus (Boundary F-scores using the ZRC speech reference) are the same from Boito et al. (2019a) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-49", "text": "Table 2 presents the results for bilingual UWS and multilingual leveraging." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-50", "text": "For the former, we reach our best result by using as aligned information the French, the original aligned language for this dataset." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-51", "text": "Languages closely related to French (Spanish and Portuguese) ranked better, while our worst result used German." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-52", "text": "English also performs notably well in our experiments." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-53", "text": "We believe this is due to the statistics features of the resulting text." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-54", "text": "We observe in Table 1 that the English portion of the dataset contains the smallest vocabulary among all languages." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-55", "text": "Since we train our systems in very low-resource settings, vocabularyrelated features can impact greatly the system's capacity to language-model, and consequently the final quality of the produced alignments." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-56", "text": "Even in high-resource settings, it was already attested that some languages are more difficult to model than others (Cotterell et al., 2018) ." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-57", "text": "For the multilingual selection experiments, we experimented combining the languages from top to bottom as they appear Table 2 (ranked by performance; e.g. 1-3 means the combination of FR(1), Table 3 : Top 10 confident (discovered type, translation) pairs for the five bilingual models." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-58", "text": "The \"+\" mark means the discovered type is a concatenation of two existing true types." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-59", "text": "EN(2) and PT (3))." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-60", "text": "We observe that the performance improvement is smaller than the one observed in previous work (Boito et al., 2019b) , which we attribute to the fact that our dataset was artificially augmented." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-61", "text": "This could result in the available multilingual form of supervision not being as rich as in a manually generated dataset." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-62", "text": "Finally, the best boundary segmentation result is obtained by performing multilingual voting with all the languages and an agreement of 50%, which indicates that the information learned by different languages will provide additional complementary evidence." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-63", "text": "Lastly, following the methodology from Boito et al. (2019a) , we extract the most confident alignments (in terms of ANE) discovered by the bilingual models." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-64", "text": "Table 3 presents the top 10 most confident (discovered type, translation) pairs." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-65", "text": "3 Looking at the pairs the bilingual models are most confident about, we observe there are some types discovered by all the bilingual models (e.g. Mboshi word itua, and the concatenation obo\u00e1+ng\u00e1)." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-66", "text": "However, the models still differ for most of their alignments in the table." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-67", "text": "This hints that while a portion of the lexicon might be captured independently of the language used, other structures might be more dependent of the chosen language." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-68", "text": "On this note, Haspelmath (2011) suggests the notion of word cannot always be meaningfully defined cross-linguistically." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-69", "text": "----------------------------------" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-70", "text": "**CONCLUSION**" }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-71", "text": "In this work we train bilingual UWS models using the endangered language Mboshi as target and different well-resourced languages as aligned information." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-72", "text": "Results show that similar languages rank better in terms of segmentation performance, and that by combining the information learned by different models, segmentation is further improved." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-73", "text": "This might be due to the different languagedependent structures that are captured by using more than one language." }, { "sent_id": "7e380d496bb253885218465b778cc1-C001-74", "text": "Lastly, we extend the bilingual Mboshi-French parallel corpus, creating a multilingual corpus for the endangered language Mboshi that we make available to the community." } ], "y": { "@USE@": { "gold_contexts": [ [ "7e380d496bb253885218465b778cc1-C001-44" ], [ "7e380d496bb253885218465b778cc1-C001-48" ], [ "7e380d496bb253885218465b778cc1-C001-63" ] ], "cite_sentences": [ "7e380d496bb253885218465b778cc1-C001-44", "7e380d496bb253885218465b778cc1-C001-48", "7e380d496bb253885218465b778cc1-C001-63" ] } } }, "ABC_3d6df70136820f74ce76f60a59cc42_3": { "x": [ { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-2", "text": "In this paper, we quantify, analyze and mitigate gender bias exhibited in ELMo's contextualized word vectors." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-3", "text": "First, we conduct several intrinsic analyses and find that (1) training data for ELMo contains significantly more male than female entities, (2) the trained ELMo embeddings systematically encode gender information and (3) ELMo unequally encodes gender information about male and female entities." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-4", "text": "Then, we show that a state-of-the-art coreference system that depends on ELMo inherits its bias and demonstrates significant bias on the WinoBias probing corpus." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-5", "text": "Finally, we explore two methods to mitigate such gender bias and show that the bias demonstrated on WinoBias can be eliminated." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-6", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-8", "text": "Distributed representations of words in the form of word embeddings Pennington et al., 2014) and contextualized word embeddings (Peters et al., 2018; Devlin et al., 2018; Radford et al., 2018; McCann et al., 2017; Radford et al., 2019) have led to huge performance improvement on many NLP tasks." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-9", "text": "However, several recent studies show that training word embeddings in large corpora could lead to encoding societal biases present in these human-produced data (Bolukbasi et al., 2016; Caliskan et al., 2017) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-10", "text": "In this work, we extend these analyses to the ELMo contextualized word embeddings." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-11", "text": "Our work provides a new intrinsic analysis of how ELMo represents gender in biased ways." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-12", "text": "First, the corpus used for training ELMo has a significant gender skew: male entities are nearly three times more common than female entities, which leads to gender bias in the downloadable pre-trained contextualized embeddings." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-13", "text": "Then, we apply principal component analysis (PCA) to show that after training on such biased corpora, there exists a lowdimensional subspace that captures much of the gender information in the contextualized embeddings." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-14", "text": "Finally, we evaluate how faithfully ELMo preserves gender information in sentences by measuring how predictable gender is from ELMo representations of occupation words that co-occur with gender revealing pronouns." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-15", "text": "Our results show that ELMo embeddings perform unequally on male and female pronouns: male entities can be predicted from occupation words 14% more accurately than female entities." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-16", "text": "In addition, we examine how gender bias in ELMo propagates to the downstream applications." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-17", "text": "Specifically, we evaluate a state-of-the-art coreference resolution system ) that makes use of ELMo's contextual embeddings on WinoBias (Zhao et al., 2018a) , a coreference diagnostic dataset that evaluates whether systems behave differently on decisions involving male and female entities of stereotyped or anti-stereotyped occupations." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-18", "text": "We find that in the most challenging setting, the ELMo-based system has a disparity in accuracy between pro-and anti-stereotypical predictions, which is nearly 30% higher than a similar system based on GloVe (Lee et al., 2017) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-19", "text": "Finally, we investigate approaches for mitigating the bias which propagates from the contextualized word embeddings to a coreference resolution system." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-20", "text": "We explore two different strategies: (1) a training-time data augmentation technique (Zhao et al., 2018a) , where we augment the corpus for training the coreference system with its genderswapped variant (female entities are swapped to male entities and vice versa) and, afterwards, retrain the coreference system; and (2)" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-21", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-22", "text": "**RELATED WORK**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-23", "text": "Gender bias has been shown to affect several realworld applications relying on automatic language analysis, including online news (Ross and Carter, 2011) , advertisements (Sweeney, 2013) , abusive language detection (Park et al., 2018) , machine translation (Font and Costa-juss\u00e0, 2019; Vanmassenhove et al., 2018) , and web search (Kay et al., 2015) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-24", "text": "In many cases, a model not only replicates bias in the training data but also amplifies it (Zhao et al., 2017) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-25", "text": "For word representations, Bolukbasi et al. (2016) and Caliskan et al. (2017) show that word embeddings encode societal biases about gender roles and occupations, e.g. engineers are stereotypically men, and nurses are stereotypically women." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-26", "text": "As a consequence, downstream applications that use these pretrained word embeddings also reflect this bias." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-27", "text": "For example, Zhao et al. (2018a) and Rudinger et al. (2018) show that coreference resolution systems relying on word embeddings encode such occupational stereotypes." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-28", "text": "In concurrent work, May et al. (2019) measure gender bias in sentence embeddings, but their evaluation is on the aggregation of word representations." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-29", "text": "In contrast, we analyze bias in contextualized word representations and its effect on a downstream task." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-30", "text": "To mitigate bias from word embeddings, Bolukbasi et al. (2016) propose a post-processing method to project out the bias subspace from the pre-trained embeddings." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-31", "text": "Their method is shown to reduce the gender information from the embeddings of gender-neutral words, and, remarkably, maintains the same level of performance on different downstream NLP tasks." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-32", "text": "Zhao et al. (2018b) further propose a training mechanism to separate gender information from other factors." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-33", "text": "However, Gonen and Goldberg (2019) argue that entirely removing bias is difficult, if not impossible, and the gender bias information can be often recovered." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-34", "text": "This paper investigates a natural follow-up question: What are effective bias mitigation techniques for contextualized embeddings?" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-35", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-36", "text": "**GENDER BIAS IN ELMO**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-37", "text": "In this section we describe three intrinsic analyses highlighting gender bias in trained ELMo contextual word embeddings (Peters et al., 2018) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-38", "text": "We show that (1) training data for ELMo contains sig#occurrence #M-biased occs." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-39", "text": "#F-biased occs." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-40", "text": "M 5,300,000 170,000 81,000 F 1,600,000 33,000 36,000 nificantly more male entities compared to female entities leading to gender bias in the pre-trained contextual word embeddings (2) the geometry of trained ELMo embeddings systematically encodes gender information and (3) ELMo propagates gender information about male and female entities unequally." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-41", "text": "Table 1 lists the data analysis on the One Billion Word Benchmark (Chelba et al., 2013) corpus, the training corpus for ELMo." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-42", "text": "We show counts for the number of occurrences of male pronouns (he, his and him) and female pronouns (she and her) in the corpus as well as the co-occurrence of occupation words with those pronouns." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-43", "text": "We use the set of occupation words defined in the WinoBias corpus and their assignments as prototypically male or female (Zhao et al., 2018a) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-44", "text": "The analysis shows that the Billion Word corpus contains a significant skew with respect to gender: (1) male pronouns occur three times more than female pronouns and (2) male pronouns co-occur more frequently with occupation words, irrespective of whether they are prototypically male or female." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-45", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-46", "text": "**TRAINING DATA BIAS**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-47", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-48", "text": "**GEOMETRY OF GENDER**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-49", "text": "Next, we analyze the gender subspace in ELMo." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-50", "text": "We first sample 400 sentences with at least one gendered word (e.g., he or she from the OntoNotes 5.0 dataset (Weischedel et al., 2012) and generate the corresponding gender-swapped variants (changing he to she and vice-versa)." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-51", "text": "We then calculate the difference of ELMo embeddings between occupation words in corresponding sentences and conduct principal component analysis for all pairs of sentences." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-52", "text": "Figure 1 shows there are two principal components for gender in ELMo, in contrast to GloVe which only has one (Bolukbasi et al., 2016) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-53", "text": "The two principal components in ELMo seem to represent the gender from the contextual information (Contextual Gender) as well as the gender embedded in the word itself (Occupational Gender)." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-54", "text": "To visualize the gender subspace, we pick a few sentence pairs from WinoBias (Zhao et al., 2018a) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-55", "text": "Each sentence in the corpus contains one gendered pronoun and two occupation words, such as \"The developer corrected the secretary because she made a mistake\" and also the same sentence with the opposite pronoun (he)." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-56", "text": "In Figure 1 on the right, we project the ELMo embeddings of occupation words that are co-referent with the pronoun (e.g. secretary in the above example) for when the pronoun is male (blue dots) and female (orange dots) on the two principal components from the PCA analysis." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-57", "text": "Qualitatively, we can see the first component separates male and female contexts while the second component groups male related words such as lawyer and developer and female related words such as cashier and nurse." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-58", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-59", "text": "**UNEQUAL TREATMENT OF GENDER**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-60", "text": "To test how ELMo embeds gender information in contextualized word embeddings, we train a classifier to predict the gender of entities from occupation words in the same sentence." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-61", "text": "We collect sentences containing gendered words (e.g., he-she, father-mother) and occupation words (e.g., doctor) 1 from the OntoNotes 5.0 corpus (Weischedel et al., 2012) , where we treat occupation words as a mention to an entity, and the gender of that entity is taken to the gender of a co-referring gendered word, if one exists." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-62", "text": "For example, in the sentence \"the engineer went back to her home,\" we take engineer to be a female mention." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-63", "text": "Then we split all such instances into training and test, with 539 and 62 instances, respectively and augment these sentences by swapping all the gendered words with words of the opposite gender such that the numbers of male 1 We use the list collected in (Zhao et al., 2018a) and female entities are balanced." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-64", "text": "We first test if ELMo embedding vectors carry gender information." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-65", "text": "We train an SVM classifier with an RBF kernel 2 to predict the gender of a mention (i.e., an occupation word) based on its ELMo embedding." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-66", "text": "On development data, this classifier achieves 95.1% and 80.6% accuracy on sentences where the true gender was male and female respectively." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-67", "text": "For both male and female contexts, the accuracy is much larger than 50%, demonstrating that ELMo does propagate gender information to other words." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-68", "text": "However, male information is more than 14% more accurately represented in ELMo than female information, showing that ELMo propagates the information unequally for male and female entities." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-69", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-70", "text": "**BIAS IN COREFERENCE RESOLUTION**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-71", "text": "In this section, we establish that coreference systems that depend on ELMo embeddings exhibit significant gender bias." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-72", "text": "Then we evaluate two simple methods for removing the bias from the systems and show that the bias can largely be reduced." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-73", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-74", "text": "**SETUP**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-75", "text": "We evaluate bias with respect to the WinoBias dataset (Zhao et al., 2018a) , a benchmark of paired male and female coreference resolution examples following the Winograd format (Hirst, 1981; Rahman and Ng, 2012; Peng et al., 2015) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-76", "text": "It contains two different subsets, pro-stereotype, where pronouns are associated with occupations predominately associated with the gender of the pronoun, or anti-stereotype, when the opposite relation is true." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-77", "text": "Table 2 : F1 on OntoNotes and WinoBias development sets." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-78", "text": "WinoBias dataset is split Semantics Only and w/ Syntactic Cues subsets." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-79", "text": "ELMo improves the performance on the OntoNotes dataset by 5% but shows stronger bias on the WinoBias dataset." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-80", "text": "Avg. stands for averaged F1 score on the pro-and anti-stereotype subsets while \"Diff.\" is the absolute difference between these two subsets." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-81", "text": "* indicates the difference between pro/anti stereotypical conditions is significant (p < .05) under an approximate randomized test (Graham et al., 2014) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-82", "text": "Mitigating bias by data augmentation reduces all the bias from the coreference model to a neglect level." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-83", "text": "However, the neutralizing ELMo approach only mitigates bias when there are other strong learning signals for the task." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-84", "text": "Each subset consists of two types of sentences: one that requires semantic understanding of the sentence to make coreference resolution (Semantics Only) and another that relies on syntactic cues (w/ Syntactic Cues)." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-85", "text": "Gender bias is measured by taking the difference of the performance in pro-and antistereotypical subsets." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-86", "text": "Previous work (Zhao et al., 2018a) evaluated the systems based on GloVe embeddings but here we evaluate a state-of-the-art system that trained on the OntoNotes corpus with ELMo embeddings ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-87", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-88", "text": "**BIAS MITIGATION METHODS**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-89", "text": "Next, we describe two methods for mitigating bias in ELMo for the purpose of coreference resolution:" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-90", "text": "(1) a train-time data augmentation approach and (2) a test-time neutralization approach." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-91", "text": "Zhao et al. (2018a) propose a method to reduce gender bias in coreference resolution by augmenting the training corpus for this task." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-92", "text": "Data augmentation is performed by replacing gender revealing entities in the OntoNotes dataset with words indicating the opposite gender and then training on the union of the original data and this swapped data." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-93", "text": "In addition, they find it useful to also mitigate bias in supporting resources and therefore replace standard GloVe embeddings with bias mitigated word embeddings from Bolukbasi et al. (2016) ." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-94", "text": "We evaluate the performance of both aspects of this approach." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-95", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-96", "text": "**DATA AUGMENTATION**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-97", "text": "Neutralization We also investigate an approach to mitigate bias induced by ELMo embeddings without retraining the coreference model." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-98", "text": "Instead of augmenting training corpus by swapping gender words, we generate a gender-swapped version of the test instances." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-99", "text": "We then apply ELMo to obtain contextualized word representations of the original and the gender-swapped sentences and use their average as the final representations." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-100", "text": "Table 2 summarizes our results on WinoBias." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-101", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-102", "text": "**RESULTS**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-103", "text": "ELMo Bias Transfers to Coreference Row 3 in Table 2 summarizes performance of the ELMo based coreference system on WinoBias." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-104", "text": "While ELMo helps to boost the coreference resolution F1 score (OntoNotes) it also propagates bias to the task." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-105", "text": "It exhibits large differences between pro-and anti-stereotyped sets (|Diff|) on both semantic and syntactic examples in WinoBias." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-106", "text": "Bias Mitigation Rows 4-6 in Table 2 summarize the effectiveness of the two bias mitigation approaches we consider." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-107", "text": "Data augmentation is largely effective at mitigating bias in the coreference resolution system with ELMo (reducing |Diff | to insignificant levels) but requires retraining the system." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-108", "text": "Neutralization is less effective than augmentation and cannot fully remove gender bias on the Semantics Only portion of WinoBias, indicating it is effective only for simpler cases." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-109", "text": "This observation is consistent with Gonen and Goldberg (2019) , where they show that entirely removing bias from an embedding is difficult and depends on the manner, by which one measures the bias." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-110", "text": "----------------------------------" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-111", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-112", "text": "Like word embedding models, contextualized word embeddings inherit implicit gender bias." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-113", "text": "We analyzed gender bias in ELMo, showing that the corpus it is trained on has significant gender skew and that ELMo is sensitive to gender, but unequally so for male and female entities." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-114", "text": "We also showed this bias transfers to downstream tasks, such as coreference resolution, and explored two bias mitigation strategies: 1) data augmentation and 2) neutralizing embeddings, effectively eliminating the bias from ELMo in a state-of-the-art system." }, { "sent_id": "3d6df70136820f74ce76f60a59cc42-C001-115", "text": "With increasing adoption of contextualized embeddings to get better results on core NLP tasks, e.g. BERT (Devlin et al., 2018) , we must be careful how such unsupervised methods perpetuate bias to downstream applications and our work forms the basis of evaluating and mitigating such bias." } ], "y": { "@USE@": { "gold_contexts": [ [ "3d6df70136820f74ce76f60a59cc42-C001-4" ], [ "3d6df70136820f74ce76f60a59cc42-C001-17" ], [ "3d6df70136820f74ce76f60a59cc42-C001-20" ], [ "3d6df70136820f74ce76f60a59cc42-C001-43" ], [ "3d6df70136820f74ce76f60a59cc42-C001-54" ], [ "3d6df70136820f74ce76f60a59cc42-C001-63" ], [ "3d6df70136820f74ce76f60a59cc42-C001-75" ], [ "3d6df70136820f74ce76f60a59cc42-C001-77" ], [ "3d6df70136820f74ce76f60a59cc42-C001-79" ], [ "3d6df70136820f74ce76f60a59cc42-C001-94" ], [ "3d6df70136820f74ce76f60a59cc42-C001-100" ], [ "3d6df70136820f74ce76f60a59cc42-C001-103" ], [ "3d6df70136820f74ce76f60a59cc42-C001-105" ] ], "cite_sentences": [ "3d6df70136820f74ce76f60a59cc42-C001-4", "3d6df70136820f74ce76f60a59cc42-C001-17", "3d6df70136820f74ce76f60a59cc42-C001-20", "3d6df70136820f74ce76f60a59cc42-C001-43", "3d6df70136820f74ce76f60a59cc42-C001-54", "3d6df70136820f74ce76f60a59cc42-C001-63", "3d6df70136820f74ce76f60a59cc42-C001-75", "3d6df70136820f74ce76f60a59cc42-C001-77", "3d6df70136820f74ce76f60a59cc42-C001-79", "3d6df70136820f74ce76f60a59cc42-C001-94", "3d6df70136820f74ce76f60a59cc42-C001-100", "3d6df70136820f74ce76f60a59cc42-C001-103", "3d6df70136820f74ce76f60a59cc42-C001-105" ] }, "@EXT@": { "gold_contexts": [ [ "3d6df70136820f74ce76f60a59cc42-C001-5" ] ], "cite_sentences": [ "3d6df70136820f74ce76f60a59cc42-C001-5" ] }, "@BACK@": { "gold_contexts": [ [ "3d6df70136820f74ce76f60a59cc42-C001-27" ], [ "3d6df70136820f74ce76f60a59cc42-C001-76" ], [ "3d6df70136820f74ce76f60a59cc42-C001-78" ], [ "3d6df70136820f74ce76f60a59cc42-C001-86" ], [ "3d6df70136820f74ce76f60a59cc42-C001-91" ], [ "3d6df70136820f74ce76f60a59cc42-C001-93" ] ], "cite_sentences": [ "3d6df70136820f74ce76f60a59cc42-C001-27", "3d6df70136820f74ce76f60a59cc42-C001-76", "3d6df70136820f74ce76f60a59cc42-C001-78", "3d6df70136820f74ce76f60a59cc42-C001-86", "3d6df70136820f74ce76f60a59cc42-C001-91", "3d6df70136820f74ce76f60a59cc42-C001-93" ] }, "@DIF@": { "gold_contexts": [ [ "3d6df70136820f74ce76f60a59cc42-C001-27", "3d6df70136820f74ce76f60a59cc42-C001-29" ] ], "cite_sentences": [ "3d6df70136820f74ce76f60a59cc42-C001-27" ] }, "@MOT@": { "gold_contexts": [ [ "3d6df70136820f74ce76f60a59cc42-C001-44" ], [ "3d6df70136820f74ce76f60a59cc42-C001-86" ], [ "3d6df70136820f74ce76f60a59cc42-C001-108" ] ], "cite_sentences": [ "3d6df70136820f74ce76f60a59cc42-C001-44", "3d6df70136820f74ce76f60a59cc42-C001-86", "3d6df70136820f74ce76f60a59cc42-C001-108" ] } } }, "ABC_311b238406da4891c09cb9c3c0334d_3": { "x": [ { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-2", "text": "Abstract." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-3", "text": "This paper describes our system created to detect stance in online discussions." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-4", "text": "The goal is to identify whether the author of a comment is in favor of the given target or against." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-5", "text": "Our approach is based on a maximum entropy classifier, which uses surface-level, sentiment and domain-specific features." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-6", "text": "The system was originally developed to detect stance in English tweets." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-7", "text": "We adapted it to process Czech news commentaries." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-8", "text": "----------------------------------" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-10", "text": "Stance detection has been defined as automatically detecting whether the author of a piece of text is in favor of the given target or against it." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-11", "text": "In the third class, there are the cases, in which neither inference is likely." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-12", "text": "It can be viewed as a subtask of opinion mining and it stands next to the sentiment analysis." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-13", "text": "The significant difference is that in sentiment analysis, systems determine whether a piece of text is positive, negative, or neutral." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-14", "text": "However, in stance detection, systems are to determine author's favorability towards a given target and the target even may not be explicitly mentioned in the text." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-15", "text": "Moreover, the text may express positive opinion about an entity contained in the text, but one can also infer that the author is against the defined target (an entity or a topic)." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-16", "text": "This makes the task more difficult, compared to the sentiment analysis, but it can often bring complementary information [3] ." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-17", "text": "There are many applications which could benefit from the automatic stance detection, including information retrieval, textual entailment, or text summarization, in particular opinion summarization." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-18", "text": "----------------------------------" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-19", "text": "**TASK DESCRIPTION**" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-20", "text": "The system was originally created for the SemEval 2016 task: Detecting stance in tweets [5] ." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-21", "text": "The task had two independent subtasks -supervised and weakly supervised." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-22", "text": "There were 19 participating systems for the supervised subtask and 9 for weaklysupervised subtask." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-23", "text": "Our system performed well for Abortion (2nd), Climate change (3rd) and Hillary Clinton (4th)." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-24", "text": "The overall rank was 9th." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-25", "text": "In the weakly-supervised task, we were ranked 4th, only the top system was significantly better." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-26", "text": "Official results are summarized in the We used the same system to detect stance in Czech news commentaries." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-27", "text": "We collected 1.560 comments from a Czech news server 1 related to two topics -\"Milo\u0161 Zeman\" (the Czech president) and \"Smoking ban in restaurants\" (statistics in Table 2 )." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-28", "text": "Consider the following example from the topic \"Milo\u0161 Zeman\"." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-29", "text": "----------------------------------" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-30", "text": "**THE APPROACH OVERVIEW**" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-31", "text": "We preprocessed the Czech commentaries by the same rules as in the original system [3] (for example: all urls were replaced by keyword URL, links to images are replaced by IMGURL, only letters are preserved, the rest of the characters is removed, \u2026)." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-32", "text": "Moreover, we stemmed the texts by HPS -High Precision Stemmer [2] ." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-33", "text": "The system is based on a standard maximum entropy classifier [4] , trained separately for each topic, with the following features." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-34", "text": "It has been showed that unigrams perform quite well in this task [6] ." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-35", "text": "Our model is based on TF-IDF and uses the top 1000 words from the vocabulary." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-36", "text": "The rest of the features can be turned on or off for each topic." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-37", "text": "Initial n-grams 4 , as showed in [1] can be useful features." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-38", "text": "Out system supports initial unigrams to initial trigrams." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-39", "text": "Another surface feature was the comment length in words after preprocessing." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-40", "text": "We used a resource borrowed from the sentiment analysis -Entity-centered sentiment dictionaries (ECSD): dictionaries created mainly for the purpose of entity-related polarity detection [7] ." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-41", "text": "The original system [3] used more features, which could not be easily applied on Czech commentaries." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-42", "text": "We do not work with tweets, so we could not use a set of features generated from hashtags." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-43", "text": "We have not analyzed the influence of part-of-speech (POS) tags yet." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-44", "text": "We did not identify strong candidates to build a domain specific dictionary as in [3] ." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-45", "text": "Bigram features did not work in the case of the tweet analysis, so we did not use it in this work as well." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-46", "text": "However, we plan to revisit the influence of bigram, POS or domain-specific features." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-47", "text": "Table 4 shows results on the Czech data." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-48", "text": "We used two evaluation measures." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-49", "text": "The first one was used for the SemEval'16 evaluation -the average F1-score on FAVOR and AGAINST classes." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-50", "text": "The second one includes the NONE class as well." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-51", "text": "We used 10-fold cross validation to distribute training and testing data." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-52", "text": "4 Initial n-grams are basically the first n words of the sentence." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-53", "text": "The results show that performance on the Czech data is significantly worse (.43 -.46) than on the English tweets corpus (.47 -.62)." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-54", "text": "It is mainly due to the lack of some key features like hashtags or domain-specific." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-55", "text": "Moreover, in the tweets corpus the stance tend to lean to one direction (either FAVOR or AGAINST), while in the Czech corpus most of the comments are considered neutral (NONE)." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-56", "text": "----------------------------------" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-57", "text": "**RESULTS**" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-58", "text": "----------------------------------" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-59", "text": "**CONCLUSION**" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-60", "text": "The paper describes the system originally created to participate in Tweet Stance Detection task in SemEval 2016 and additionally used to detect stance in Czech news commentaries." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-61", "text": "We experienced worse performance in comparison with the original English tweets corpus." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-62", "text": "It is mainly due to the lack of some significant features like hashtags." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-63", "text": "The current plan is to revisit the influence of bigram, POS or domainspecific features." }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-64", "text": "----------------------------------" }, { "sent_id": "311b238406da4891c09cb9c3c0334d-C001-65", "text": "**6**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "311b238406da4891c09cb9c3c0334d-C001-16" ] ], "cite_sentences": [ "311b238406da4891c09cb9c3c0334d-C001-16" ] }, "@USE@": { "gold_contexts": [ [ "311b238406da4891c09cb9c3c0334d-C001-31" ] ], "cite_sentences": [ "311b238406da4891c09cb9c3c0334d-C001-31" ] }, "@DIF@": { "gold_contexts": [ [ "311b238406da4891c09cb9c3c0334d-C001-41" ], [ "311b238406da4891c09cb9c3c0334d-C001-44" ] ], "cite_sentences": [ "311b238406da4891c09cb9c3c0334d-C001-41", "311b238406da4891c09cb9c3c0334d-C001-44" ] } } }, "ABC_21c2160667b3ff919e39285cd1ece7_3": { "x": [ { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-2", "text": "Subcategorization information is a useful feature in dependency parsing." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-3", "text": "In this paper, we explore a method of incorporating this information via Combinatory Categorial Grammar (CCG) categories from a supertagger." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-4", "text": "We experiment with two popular dependency parsers (Malt and MST) for two languages: English and Hindi." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-5", "text": "For both languages, CCG categories improve the overall accuracy of both parsers by around 0.3-0.5% in all experiments." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-6", "text": "For both parsers, we see larger improvements specifically on dependencies at which they are known to be weak: long distance dependencies for Malt, and verbal arguments for MST." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-7", "text": "The result is particularly interesting in the case of the fast greedy parser (Malt), since improving its accuracy without significantly compromising speed is relevant for large scale applications such as parsing the web." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-8", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-10", "text": "Dependency parsers can recover much of the predicate-argument structure of a sentence, while being relatively efficient to train and extremely fast at parsing." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-11", "text": "Dependency parsers have been gaining in popularity in recent times due to the availability of large dependency treebanks for several languages and parsing shared tasks (Buchholz and Marsi, 2006; Nivre et al., 2007a; Bharati et al., 2012) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-12", "text": "Ambati et al. (2013) showed that the performance of Malt (Nivre et al., 2007b) on the free word order language, Hindi, is improved by using lexical categories from Combinatory Categorial Grammar (CCG) (Steedman, 2000) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-13", "text": "In this paper, we extend this work and show that CCG categories are useful even in the case of English, a typologically different language, where parsing accuracy of dependency parsers is already extremely high." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-14", "text": "In addition, we also demonstrate the utility of CCG categories to MST (McDonald et al., 2005) for both languages." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-15", "text": "CCG lexical categories contain subcategorization information regarding the dependencies of predicates, including longdistance dependencies." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-16", "text": "We show that providing this subcategorization information in the form of CCG categories can help both Malt and MST on precisely those dependencies for which they are known to have weak rates of recovery." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-17", "text": "The result is particularly interesting for Malt, the fast greedy parser, as the improvement in Malt comes without significantly compromising its speed, so that it can be practically applied in web scale parsing." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-18", "text": "Our results apply both to English, a fixed word order and morphologically simple language, and to Hindi, a free word order and morphologically rich language, indicating that CCG categories from a supertagger are an easy and robust way of introducing lexicalized subcategorization information into dependency parsers." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-19", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-21", "text": "Parsers using different grammar formalisms have different strengths and weaknesses, and prior work has shown that information from one formalism can improve the performance of a parser in another formalism." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-22", "text": "Sagae et al. (2007) achieved a 1.4% improvement in accuracy over a state-of-the-art HPSG parser by using dependencies from a dependency parser for constraining wide-coverage rules in the HPSG parser." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-23", "text": "Coppola and Steedman (2013) incorporated higher-order dependency features into a cube decoding phrasestructure parser and obtained significant gains on dependency recovery for both in-domain and out-of-domain test sets." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-24", "text": "Kim et al. (2012) improved a CCG parser using dependency features." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-25", "text": "They extracted n-best parses from a CCG parser and provided dependency Pierre Vinken will join the board as a nonexecutive director Nov. 29" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-26", "text": "Figure 1: A CCG derivation and the Stanford scheme dependencies for an example sentence." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-27", "text": "features from a dependency parser to a re-ranker with an improvement of 0.35% in labelled F-score of the CCGbank test set." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-28", "text": "Conversely, Ambati et al. (2013) showed that a Hindi dependency parser (Malt) could be improved by using CCG categories." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-29", "text": "Using an algorithm similar to Cakici (2005) and Uematsu et al. (2013) , they first created a Hindi CCGbank from a Hindi dependency treebank and built a supertagger." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-30", "text": "They provided CCG categories from a supertagger as features to Malt and obtained overall improvements of 0.3% and 0.4% in unlabelled and labelled attachment scores respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-31", "text": "Figure 1 shows a CCG derivation with CCG lexical categories for each word and Stanford scheme dependencies (De Marneffe et al., 2006) for an example English sentence." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-32", "text": "(Details of CCG and dependency parsing are given by Steedman (2000) and K\u00fcbler et al. (2009)" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-33", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-34", "text": "**DATA AND TOOLS**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-35", "text": ".)" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-36", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-37", "text": "**TREEBANKS**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-38", "text": "In English dependency parsing literature, Stanford and CoNLL dependency schemes are widely popular." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-39", "text": "We used the Stanford parser's built-in converter (with the basic projective option) to generate Stanford dependencies and Penn2Malt 1 to generate CoNLL dependencies from Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-40", "text": "We used standard splits, training (sections 02-21), development (section 22) and testing (section 23) for our experiments." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-41", "text": "For Hindi, we worked with the Hindi Dependency Treebank (HDT) released as part of Coling 2012 Shared Task (Bharati et al., 2012) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-42", "text": "HDT contains 12,041 training, 1,233 development and 1,828 testing sentences." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-43", "text": "We used the English (Hockenmaier and Steedman, 2007) and Hindi CCGbanks (Ambati et al., 1 http://w3.msi.vxu.se/ nivre/research/Penn2Malt.html 2013) for our experiments." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-44", "text": "For Hindi we used two lexicons: a fine-grained one (with morphological information) and a coarse-grained one (without morphological information)." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-45", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-46", "text": "**SUPERTAGGERS**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-47", "text": "We used Clark and Curran (2004) 's supertagger for English, and Ambati et al. (2013) 's supertagger for Hindi." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-48", "text": "Both are Maximum Entropy based CCG supertaggers." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-49", "text": "The Clark and Curran (2004) supertagger uses different features like word, partof-speech, and contextual and complex bi-gram features to obtain a 1-best accuracy of 91.5% on the development set." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-50", "text": "In addition to the above mentioned features, Ambati et al. (2013) employed morphological features useful for Hindi." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-51", "text": "The 1-best accuracy of Hindi supertagger for finegrained and coarse-grained lexicon is 82.92% and 84.40% respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-52", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-53", "text": "**DEPENDENCY PARSERS**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-54", "text": "There has been a significant amount of work on parsing English and Hindi using the Malt and MST parsers in the recent past (Nivre et al., 2007a; Bharati et al., 2012) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-55", "text": "We first run these parsers with previous best settings (McDonald et al., 2005; Zhang and Nivre, 2012; Bharati et al., 2012) and treat them as our baseline." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-56", "text": "In the case of English, Malt uses arc-standard and stack-projective parsing algorithms for CoNLL and Stanford schemes respectively and LIBLIN-EAR learner (Fan et al., 2008) for both the schemes." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-57", "text": "MST uses 1st-order features, and a projective parsing algorithm with 5-best MIRA training for both the schemes." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-58", "text": "For Hindi, Malt uses the arc-standard parsing algorithm with a LIBLIN-EAR learner." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-59", "text": "MST uses 2nd-order features, nonprojective algorithm with 5-best MIRA training." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-60", "text": "For English, we assigned POS-tags using a perceptron tagger (Collins, 2002) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-61", "text": "For Hindi, we also did all our experiments using automatic features Ambati et al. (2013) )." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-62", "text": "(POS, chunk and morphological information) extracted using a Hindi shallow parser 2 ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-63", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-64", "text": "**CCG CATEGORIES AS FEATURES**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-65", "text": "Following Ambati et al. (2013) , we used supertags which occurred at least K times in the training data, and backed off to coarse POS-tags otherwise." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-66", "text": "For English K=1, i.e., when we use CCG categories for all words, gave the best results." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-67", "text": "K=15 gave the best results for Hindi due to sparsity issues, as the data for Hindi is small." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-68", "text": "We provided a supertag as an atomic symbol similar to a POS tag and didn't split it into a list of argument and result categories." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-69", "text": "We explored both Stanford and CoNLL schemes for English and fine and coarsegrained CCG categories for Hindi." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-70", "text": "All feature and parser tuning was done on the development data." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-71", "text": "We assigned automatic POS-tags and supertags to the training data." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-72", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-73", "text": "**EXPERIMENTS WITH SUPERTAGGER OUTPUT**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-74", "text": "We first used gold CCG categories extracted from each CCGbank as features to the Malt and MST, to get an upper bound on the utility of CCG categories." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-75", "text": "As expected, gold CCG categories boosted the Unlabelled Attachment Score (UAS) and Labelled Attachment Score (LAS) by a large amount (4-7% in all the cases)." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-76", "text": "We then experimented with using automatic CCG categories from the English and Hindi supertaggers as a feature to Malt and MST." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-77", "text": "With automatic categories from a supertagger, we got statistically significant improvements (McNemar's test, p < 0.05 for Hindi LAS and p < 0.01 for the rest) over the baseline parsers, for all cases (Table 1) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-78", "text": "Since the CCGbanks used to train the supertaggers are automatically generated from the constituency or dependency treebanks used to train 2 http://ltrc.iiit.ac.in/analyzer/hindi/ the dependency parsers, the improvements are indeed due to reparameterization of the model to include CCG categories and not due to additional hand annotations in the CCGbanks." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-79", "text": "This shows that the rich subcategorization information provided by automatically assigned CCG categories can help Malt and MST in realistic applications." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-80", "text": "For English, in case of Malt, we achieved 0.3% improvement in both UAS and LAS for Stanford scheme." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-81", "text": "For CoNLL scheme, these improvements were 0.4% and 0.5% in UAS and LAS respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-82", "text": "For MST, we got around 0.5% improvements in all cases." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-83", "text": "In case of Hindi, fine-grained supertags gave larger improvements for MST." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-84", "text": "We got final improvements of 0.5% and 0.3% in UAS and LAS respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-85", "text": "In contrast, for Malt, Ambati et al. (2013) had shown that coarse-grained supertags gave larger improvements of 0.3% and 0.4% in UAS and LAS respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-86", "text": "Due to better handling of error propagation in MST, the richer information in fine-grained categories may have surpassed the slightly lower supertagger performance, compared to coarse-grained categories." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-87", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-88", "text": "**ANALYSIS: ENGLISH**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-89", "text": "We analyze the impact of CCG categories on different labels (label-wise) and distance ranges (distance-wise) for CoNLL scheme dependencies (We observed a similar impact for the Stanford scheme dependencies as well)." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-90", "text": "Figure 2a shows the F-score for three major dependency labels, namely, ROOT (sentence root), SUBJ (subject), OBJ (object)." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-91", "text": "For Malt, providing CCG categories gave an increment of 1.0%, 0.3% for ROOT and SUBJ labels respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-92", "text": "For MST, the improvements for ROOT and SUBJ were 0.5% and 0.8% respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-93", "text": "There was no significant improvement for OBJ label, especially in the case of Malt." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-94", "text": "Figure 2b shows the F-score of dependencies based on the distance ranges between words." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-95", "text": "The percentage of dependencies in the 1\u22125, 6\u221210 and >10 distance ranges are 88.5%, 6.6% and 4.9% respectively out of the total of around 50,000 dependencies." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-96", "text": "For both Malt and MST, there was very slight improvement for short distance dependencies (1\u22125) but significant improvements for longer distances (6\u221210 and >10)." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-97", "text": "For Malt, there was an improvement of 0.6% and 0.9% for distances 6\u221210, and >10 respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-98", "text": "For MST, these improvements were 1.0% and 1.0% respectively." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-99", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-100", "text": "**ANALYSIS: HINDI**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-101", "text": "In the case of Hindi, for MST, providing CCG categories gave an increment of 0.5%, 0.4% and 0.3% for ROOT, SUBJ and OBJ labels respectively in F-score over the baseline." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-102", "text": "Ambati et al. (2013) showed that for Hindi, providing CCG categories as features improved Malt in better handling of long distance dependencies." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-103", "text": "The percentage of dependencies in the 1\u22125, 6\u221210 and >10 distance ranges are 82.2%, 8.6% and 9.2% respectively out of the total of around 40,000 dependencies." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-104", "text": "Similar to English, there was very slight improvement for short distance dependencies (1\u22125)." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-105", "text": "But for longer distances, 6\u221210, and >10, there was significant improvement of 1.3% and 1.3% respectively for MST." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-106", "text": "Ambati et al. (2013) reported similar improvements for Malt as well." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-107", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-108", "text": "**DISCUSSION**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-109", "text": "Though valency is a useful feature in dependency parsing (Zhang and Nivre, 2011) , Zhang and Nivre (2012) showed that providing valency information dynamically, in the form of the number of dependencies established in a particular state during parsing, did not help Malt." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-110", "text": "However, as we have shown above, providing this information as a static lexical feature in the form of CCG categories does help Malt." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-111", "text": "In addition to specifying the number of arguments, CCG categories also contain syntactic type and direction of those arguments." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-112", "text": "However, providing CCG categories as features to zpar (Zhang and Nivre, 2011 ) didn't have significant impact as it is already using similar information." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-113", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-114", "text": "**IMPACT ON WEB SCALE PARSING**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-115", "text": "Greedy parsers such as Malt are very fast and are practically useful in large-scale applications such as parsing the web." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-116", "text": "----------------------------------" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-117", "text": "**CONCLUSION**" }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-118", "text": "We have shown that informative CCG categories, which contain both local subcategorization information and capture long distance dependencies elegantly, improve the performance of two dependency parsers, Malt and MST, by helping in recovering long distance relations for Malt and local verbal arguments for MST." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-119", "text": "This is true both in the case of English (a fixed word order language) and Hindi (free word order and morphologically richer language), extending the result of Ambati et al. (2013) ." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-120", "text": "The result is particularly interesting in the case of Malt which cannot directly use valency information, which CCG categories provide indirectly." }, { "sent_id": "21c2160667b3ff919e39285cd1ece7-C001-121", "text": "It leads to an improvement in performance without significantly compromising speed and hence promises to be applicable to web scale processing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "21c2160667b3ff919e39285cd1ece7-C001-12" ], [ "21c2160667b3ff919e39285cd1ece7-C001-28" ], [ "21c2160667b3ff919e39285cd1ece7-C001-29" ], [ "21c2160667b3ff919e39285cd1ece7-C001-30" ], [ "21c2160667b3ff919e39285cd1ece7-C001-50" ], [ "21c2160667b3ff919e39285cd1ece7-C001-85" ] ], "cite_sentences": [ "21c2160667b3ff919e39285cd1ece7-C001-12", "21c2160667b3ff919e39285cd1ece7-C001-28", "21c2160667b3ff919e39285cd1ece7-C001-29", "21c2160667b3ff919e39285cd1ece7-C001-30", "21c2160667b3ff919e39285cd1ece7-C001-50", "21c2160667b3ff919e39285cd1ece7-C001-85" ] }, "@EXT@": { "gold_contexts": [ [ "21c2160667b3ff919e39285cd1ece7-C001-13" ], [ "21c2160667b3ff919e39285cd1ece7-C001-119" ] ], "cite_sentences": [ "21c2160667b3ff919e39285cd1ece7-C001-13", "21c2160667b3ff919e39285cd1ece7-C001-119" ] }, "@USE@": { "gold_contexts": [ [ "21c2160667b3ff919e39285cd1ece7-C001-43" ], [ "21c2160667b3ff919e39285cd1ece7-C001-47" ], [ "21c2160667b3ff919e39285cd1ece7-C001-61" ], [ "21c2160667b3ff919e39285cd1ece7-C001-65" ], [ "21c2160667b3ff919e39285cd1ece7-C001-102" ] ], "cite_sentences": [ "21c2160667b3ff919e39285cd1ece7-C001-43", "21c2160667b3ff919e39285cd1ece7-C001-47", "21c2160667b3ff919e39285cd1ece7-C001-61", "21c2160667b3ff919e39285cd1ece7-C001-65", "21c2160667b3ff919e39285cd1ece7-C001-102" ] }, "@DIF@": { "gold_contexts": [ [ "21c2160667b3ff919e39285cd1ece7-C001-84", "21c2160667b3ff919e39285cd1ece7-C001-85" ] ], "cite_sentences": [ "21c2160667b3ff919e39285cd1ece7-C001-85" ] }, "@SIM@": { "gold_contexts": [ [ "21c2160667b3ff919e39285cd1ece7-C001-106" ] ], "cite_sentences": [ "21c2160667b3ff919e39285cd1ece7-C001-106" ] } } }, "ABC_f1a800c7cd47ac2edf2172cedb5889_3": { "x": [ { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-2", "text": "Neural dependency parsing models that compose word representations from characters can presumably exploit morphosyntax when making attachment decisions." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-3", "text": "How much do they know about morphology?" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-4", "text": "We investigate how well they handle morphological case, which is important for parsing." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-5", "text": "Our experiments on Czech, German and Russian suggest that adding explicit morphological case-either oracle or predicted-improves neural dependency parsing, indicating that the learned representations in these models do not fully encode the morphological knowledge that they need, and can still benefit from targeted forms of explicit linguistic modeling." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-6", "text": "----------------------------------" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-8", "text": "Parsing morphologically rich languages (MRLs) is difficult due to the complex relationship of syntax to morphology." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-9", "text": "But the success of neural networks offer an appealing solution to this problem by computing word representation from characters." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-10", "text": "Character-level models (Ling et al., 2015; Kim et al., 2016) learn relationship between similar word forms and have shown to be effective for parsing MRLs (Ballesteros et al., 2015; Dozat et al., 2017; Shi et al., 2017; Bj\u00f6rkelund et al., 2017) ." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-11", "text": "Does that mean that we can do away with explicit modeling of morphology altogether? Consider two challenges in parsing MRLs raised by Tsarfaty et al. (2010 Tsarfaty et al. ( , 2013 :" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-12", "text": "\u2022 Can we represent words abstractly so as to reflect shared morphological aspects between them? \u2022 Which types of morphological information should we include in the parsing model?" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-13", "text": "It is tempting to hypothesize that character-level models effectively solve the first problem." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-14", "text": "For the second, Tsarfaty et al. (2010) and Seeker and Kuhn (2013) reported that morphological case is beneficial across morphologically rich languages with extensive case systems, where case syncretism is pervasive and often hurts parsing performance. But these studies focus on vintage parsers; do neural parsers with character-level representations also solve this second problem?" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-15", "text": "We attempt to answer this question by asking whether an explicit model of morphological case helps dependency parsing, and our results show that it does." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-16", "text": "Furthermore, a pipeline model in which we feed predicted case to the parser outperforms multi-task learning in which case prediction is an auxiliary task." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-17", "text": "These results suggest that neural dependency parsers do not adequately infer this crucial linguistic feature directly from the input text." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-18", "text": "----------------------------------" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-19", "text": "**DEPENDENCY PARSING MODEL**" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-20", "text": "We use a neural graph-based dependency parser similar to that of Kiperwasser and Goldberg (2016) and Zhang et al. (2017) for all our experiments." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-21", "text": "We treat our parser as a black box and experiment only with the input representations of the parser." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-22", "text": "Let w = w 1 , . . . , w |w| be an input sentence of length |w| and let w 0 denote an artificial ROOT token." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-23", "text": "For each input token w i , we compute the context-independent representation, e(w i ) with a bidirectional LSTM (bi-LSTM) over characters." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-24", "text": "We concatenate the result with its part-of-speech (POS) representation, t i : x i = [e(w i ); t i ]." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-25", "text": "We then feed x i to a word-level bi-LSTM encoder to learn a contextual word representation w i ." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-26", "text": "The model uses these representations to compute the probability p(h i , i | w, i) of head h i \u2208 {0, ..., |w|}/i and label i of word w i ." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-27", "text": "----------------------------------" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-28", "text": "**EXPERIMENTS**" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-29", "text": "We experiment with three fusional languages with extensive case systems: Czech, German, and Rus- sian; and we consider four forms of input (e(w i ), \u00a72): word (embedding), characters, characters with gold case, and characters with predicted case." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-30", "text": "For the latter two, we append the case label to the character sequence, e.g. b, a, t, Acc represents bat with accusative case." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-31", "text": "Using the same method, we also supply the gold full analysis, to tease out the importance of case specifically." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-32", "text": "Finally, we experiment with multitask learning (MTL; S\u00f8gaard and Goldberg, 2016; Coavoux and Crabb\u00e9, 2017) , using the bi-LSTM states of the lower layer of the bi-LSTM encoder to predict case feature." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-33", "text": "We then look at the performance when we replace gold case with predicted case." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-34", "text": "We train a morphological tagger to predict case information." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-35", "text": "The tagger has the same structure as the parser's encoder, with an additional feedforward neural network with one hidden layer followed by a softmax layer." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-36", "text": "We found that predicted case improves accuracy, although the effect is different across languages." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-37", "text": "These results are interesting, since in vintage parsers, predicted case usually harmed accuracy (Tsarfaty et al., 2010) ." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-38", "text": "However, we note that our taggers use gold POS, which might help." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-39", "text": "Pipeline model vs. Multi-task learning In general, MTL models achieve similar or slightly better performance than the character-only models, suggesting that supplying case in this way is beneficial." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-40", "text": "However, we found that using predicted case in a pipeline model gives more improvements than MTL." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-41", "text": "We also observe an interesting pattern in which MTL achieves better tagging accuracy than the pipeline model but lower performance in parsing (Table 2 )." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-42", "text": "This is surprising since it suggests that the MTL model must learn to effectively encode case in the model's representation, but must not effectively use it for parsing." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-43", "text": "----------------------------------" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-44", "text": "**CONCLUSION**" }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-45", "text": "Vintage dependency parsers rely on hand-crafted feature engineering to encode morphology." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-46", "text": "The recent success of character-level models for many NLP tasks motivates us to ask whether their learned representations are powerful enough to completely replace this feature engineering." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-47", "text": "By empirically testing this using a single feature known to be important-morphological case-we have shown that they are not." }, { "sent_id": "f1a800c7cd47ac2edf2172cedb5889-C001-48", "text": "Experiments with multi-task learning suggest that although MTL gives better performance, it is still underperformed by a traditional pipeline model." } ], "y": { "@MOT@": { "gold_contexts": [ [ "f1a800c7cd47ac2edf2172cedb5889-C001-11" ], [ "f1a800c7cd47ac2edf2172cedb5889-C001-14" ] ], "cite_sentences": [ "f1a800c7cd47ac2edf2172cedb5889-C001-11", "f1a800c7cd47ac2edf2172cedb5889-C001-14" ] }, "@BACK@": { "gold_contexts": [ [ "f1a800c7cd47ac2edf2172cedb5889-C001-14" ], [ "f1a800c7cd47ac2edf2172cedb5889-C001-37" ] ], "cite_sentences": [ "f1a800c7cd47ac2edf2172cedb5889-C001-14", "f1a800c7cd47ac2edf2172cedb5889-C001-37" ] }, "@DIF@": { "gold_contexts": [ [ "f1a800c7cd47ac2edf2172cedb5889-C001-36", "f1a800c7cd47ac2edf2172cedb5889-C001-37" ] ], "cite_sentences": [ "f1a800c7cd47ac2edf2172cedb5889-C001-37" ] } } }, "ABC_4dbf6e2bfa30e96816b7f6d9c6e069_3": { "x": [ { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-2", "text": "Automatic summarization techniques on meeting conversations developed so far have been primarily extractive, resulting in poor summaries." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-3", "text": "To improve this, we propose an approach to generate abstractive summaries by fusing important content from several utterances." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-4", "text": "Any meeting is generally comprised of several discussion topic segments." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-5", "text": "For each topic segment within a meeting conversation, we aim to generate a one sentence summary from the most important utterances using an integer linear programming-based sentence fusion approach." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-6", "text": "Experimental results show that our method can generate more informative summaries than the baselines." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-7", "text": "----------------------------------" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-9", "text": "Meeting summarization helps both participants and non-participants by providing a short and concise snapshot of the most important content discussed in the meetings." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-10", "text": "A recent study revealed that people generally prefer abstractive summaries [4] ." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-11", "text": "Table 1 shows the human-written abstractive summaries along with the humangenerated extractive summaries from a meeting transcript." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-12", "text": "As can be seen, the utterances are highly noisy and contain unnecessary information." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-13", "text": "Even if an extractive summarizer can accurately classify these utterances as \"important\" and present them to a reader, it is hard to read and synthesize information from such utterances." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-14", "text": "In contrast, human written summaries are compact and readable." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-15", "text": "We propose an automatic way of generating short and concise abstractive summaries of meetings." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-16", "text": "Any meeting conversation includes dialogues on several topics." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-85", "text": "Our technique significantly outperforms the extractive method." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-17", "text": "For example, in Table 1 , the participants converse on two topics: design features and selling prices." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-18", "text": "Given the most important sentences within a topic segment, our goal is to generate a one-sentence summary from each segment and appending them to form a comprehensive summary of the meeting." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-19", "text": "Moreover, we also aim to generate summaries that resemble human-written summaries in terms of writing style." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-20", "text": "Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-21", "text": "Copyrights for thirdparty components of this work must be honored." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-22", "text": "For all other uses, contact the owner/author(s)." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-23", "text": "Copyright is held by the author/owner(s)." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-24", "text": "A: Cause you have more complicated characters like European languages, then you need more buttons." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-25", "text": "D: I'm thinking the price might appeal to a certain market in one region, whereas in another it'll be different, so D: kay trendy probably means something other than just basic Abstractive summary: The team then discussed various features to consider in making the remote." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-26", "text": "Set 2: Human-generated extractive summary B: Like how much does, you know, a remote control cost." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-27", "text": "B: Well twenty five Euro, I mean that's um that's about like eighteen pounds or something." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-28", "text": "D: This is this gonna to be like the premium product kinda thing or B: So I don't know how how good a remote control that would get you. Um. Abstractive summary: The project manager talked about the project finances and selling prices." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-29", "text": "To aggregate the information from multiple utterances, we adapt an existing integer linear programming (ILP) based fusion technique [1] ." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-30", "text": "The fusion technique is based on the idea of merging dependency parse trees of the utterances." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-31", "text": "The trees are merged on the common nodes that are represented by the word and parts-ofspeech (POS) combination." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-32", "text": "Each edge of the merged structure is represented as a variable in the ILP objective function, and the solution will decide whether the edge has to be preserved or discarded." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-33", "text": "We modify the technique by introducing an anaphora resolution step and also an ambiguity resolver that takes the context of words into account." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-34", "text": "Further, to solve the ILP, we introduce several constraints, such as desired length of the output, etc." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-35", "text": "To the best of our knowledge, our work is the first to address the problems of readability, grammaticality and content selection jointly for meeting summary generation without employing a templatebased approach." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-36", "text": "We conduct experiments on the AMI corpus 1 that consists of meeting transcripts and show that our best method outperforms extractive model significantly on ROUGE-2 scores (0.048 vs 0.026)." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-37", "text": "----------------------------------" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-38", "text": "**PROPOSED APPROACH**" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-39", "text": "Dependency fusion on meeting data requires an algorithm that is robust for noisy data as utterances often have disfluencies." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-40", "text": "Our work applies fusion to all the important utterances within the topic segment to generate the best sub-tree that satisfies the constraints and maximizes the objective function of the optimization problem." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-41", "text": "Anaphora resolution step replaces pronouns with the original nouns in the previous utterance that they refer to in order to increase the chances of merging." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-42", "text": "Consider the following utterances: start node and the end node to ensure defined start and end points of the merged structure." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-43", "text": "The words from the utterances are iteratively added onto the graph." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-44", "text": "The words that have the same word form and POS tag are assigned to the same nodes." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-45", "text": "Ambiguity resolver. Suppose that a new word wi that has k ambiguous nodes where it can be mapped to." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-46", "text": "The k ambiguous nodes are referred to as mappable nodes." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-47", "text": "For every ambiguous mapping candidate, we first find the words to the left and right of the mappable node of the sentences, and then compute the number of words in both the directions that are common to the words in either direction of the word wi." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-48", "text": "Finally, wi is mapped to the node that has the highest directed context." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-49", "text": "ILP formulation." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-50", "text": "Figure 1 shows the sub-graph (marked using blue bold arrows) that we wish to retain from the merged graph structure to generate a one-sentence summary from several merged utterances." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-51", "text": "All the sentences generated from each meeting transcript are concatenated to produce the final abstractive summary." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-52", "text": "We need to maximize the information content of the generated sentence, keeping it grammatical." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-53", "text": "We model the problem as an integer linear programming (ILP) formulation, similar to the dependency graph fusion as proposed by Fillipova and Strube [1] ." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-54", "text": "The directed edges in the graph (binary variables) are represented as x g,d,l , where g, d and l denote the governor node, dependent node and the label of an edge, respectively." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-55", "text": "We maximize the following objective function:" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-56", "text": "As shown in Equation (1), we introduce three different terms: p(l | g), I(d) and px N ." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-57", "text": "Each relation in a dependency graph consists of the governing node, the dependent node and the relation type." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-58", "text": "The term p(l | g) denotes the probabilities of the labels given a governor node, g. For every node (word and POS) in the entire corpus, the probabilities are represented as the ratio of the sum of the frequency of a particular label and the sum of the frequencies of all the labels emerging from a node." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-59", "text": "In this work, we calculate these values using Reuters corpora [5] to obtain dominant relations from non-conversational style of text." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-60", "text": "For example, Table 2 shows the probabilities of outgoing edges from a node, (produced/VBN)." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-86", "text": "In future work, we plan to design an end-to-end framework for summary generation from meetings." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-61", "text": "This term assigns the importance of grammatical relations to a node and only the relations that are more dominant from a node will be preferred." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-62", "text": "The term I(d) denotes the informativeness of a node calculated using Hori and Furui's formula [2] ." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-63", "text": "The last term in Equation (1) is based on the idea of lexical cohesion." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-64", "text": "Towards the end of any segment, generally, more important discussions might happen that will conclude a particular topic and then start another." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-65", "text": "In order to take this fact into account, we introduce the term px N , where N and px denote the total number of extracted utterances in a segment and the position of the utterance (the edge x belongs to) in the set of N utterances, respectively." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-66", "text": "In order to solve the above ILP problem, we impose a number of constraints." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-67", "text": "Some of the constraints have been directly adapted from the original ILP formulation [1] ." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-68", "text": "For example, we use the same constraints for restricting one incoming edge per node, as well as we impose the connectivity constraint to ensure a connected graph structure." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-69", "text": "Further, we restrict the subtree to have just one start edge and one end edge." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-70", "text": "This helps in preserving one ROOT node, as well as it limits to one end node for the generated subtree." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-71", "text": "We also limit the generated subtree to have a maximum of 15 nodes that controls the length of the summary sentence." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-72", "text": "We also add few linguistic constraints that ensure the coherence of the output such as every node can have maximum of one determinant, etc." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-73", "text": "We also impose constraints to prevent cycles in the graph structure, otherwise finding the best path from start and end nodes might be difficult." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-74", "text": "The final graph is linearized to obtain a coherent sentence." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-75", "text": "In the linearization process, we order the nodes based on their original ordering in the utterance." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-76", "text": "----------------------------------" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-77", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-78", "text": "The AMI Meeting corpus contains 20 meeting transcripts in the test set along with their corresponding abstractive (human-written) summaries as well as the annotations of topic segments." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-79", "text": "ROUGE is used to compare content selection of several approaches." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-80", "text": "We compared the content selection of our approach to an extractive summarizer [3] , which works as a baseline." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-81", "text": "We also compared our model without using anaphora resolution to see the impact of resolving pronouns." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-82", "text": "All the summaries were compared against the human-written summaries as reference." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-83", "text": "The results in Table 3 show that our method outperforms the other techniques on both ROUGE-2 (R2) and ROUGE-SU4 (R-SU4) recall scores." }, { "sent_id": "4dbf6e2bfa30e96816b7f6d9c6e069-C001-84", "text": "Moreover, we computed a coarse estimate of grammaticality using the log-likelihood score (LL) from the parser." } ], "y": { "@USE@": { "gold_contexts": [ [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-29" ], [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-32" ], [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-53" ], [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-66" ] ], "cite_sentences": [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-29", "4dbf6e2bfa30e96816b7f6d9c6e069-C001-32", "4dbf6e2bfa30e96816b7f6d9c6e069-C001-53", "4dbf6e2bfa30e96816b7f6d9c6e069-C001-66" ] }, "@EXT@": { "gold_contexts": [ [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-34" ], [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-67" ] ], "cite_sentences": [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-34", "4dbf6e2bfa30e96816b7f6d9c6e069-C001-67" ] }, "@SIM@": { "gold_contexts": [ [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-53" ] ], "cite_sentences": [ "4dbf6e2bfa30e96816b7f6d9c6e069-C001-53" ] } } }, "ABC_e9404db1fbda5dd8c55a40711d06ec_3": { "x": [ { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-2", "text": "In this proposal track paper, we have presented a crowdsourcing-based word embedding evaluation technique that will be more reliable and linguistically justified." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-3", "text": "The method is designed for intrinsic evaluation and extends the approach proposed in (Schnabel et al., 2015) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-4", "text": "Our improved evaluation technique captures word relatedness based on the word context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-5", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-7", "text": "The semantic relatedness between words can be ambiguous if the context of the word is not known (Patwardhan et al., 2003) , and word sense disambiguation is the process of assigning a meaning to a polysemous word based on its context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-8", "text": "The context defines linguistic and corresponding factual real world knowledge which provides a difference between word's sense and its reference." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-9", "text": "The sense of a word concerns one of the meanings of a word in a particular language." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-10", "text": "Reference is used to deal with the relationship between a language and the real world knowledge about an object or entity." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-11", "text": "The context of a word can be understood through a sentence, and thus understanding a word in a sentential context works as ambiguity resolution (Faust and Chiarello, 1998) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-12", "text": "The vector space representation of words (embeddings) keeps related words nearby in the vector space." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-13", "text": "The word relatedness is usually measured through synonyms, but synonyms can differ in at least one semantic feature." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-14", "text": "The feature can be 'denotative', referring to some actual, real world difference in the object the language is dealing with, such as, walk, lumber, stroll, meander, lurch, stagger." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-15", "text": "The feature can be 'connotative', referring to how the user feels about the object rather than any real difference in the object itself, such as, die, pass away, give up the ghost, kick the bucket, croak." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-16", "text": "Absolute synonyms are usually rare in a language." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-17", "text": "For example: sofa and couch are nearly absolute synonyms, however based on the context, they have different meaning in at least one way, such as, couch potato, because there is no word sense available for sofa potato (Vajda, 2001) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-18", "text": "Crowdsourcing (Ambati et al., 2010; CallisonBurch, 2009 ), which allows employing people worldwide to perform short tasks via online platforms, can be an effective tool for performing evaluation in a time and cost-effective way (Ambati, 2012) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-19", "text": "In (Schnabel et al., 2015) , crowdsourcingbased evaluation was proposed for synonyms or a word relatedness task where six word embedding techniques were evaluated." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-20", "text": "The crowdsourcingbased intrinsic evaluation which tests embeddings for semantic relationship between words focuses on a direct comparison of word embeddings with respect to individual queries." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-21", "text": "Although the method is promising for evaluating different word embeddings, it has some shortcomings." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-22", "text": "Specifically, it does not explicitly consider word context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-23", "text": "As the approach relies on human interpretation of words, it is important to take into account how humans interpret or understand the meaning of a word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-24", "text": "Humans usually understand semantic relatedness between words based on the context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-25", "text": "Thus, if the approach is based only on the word without its context, it will be difficult for humans to understand the meaning of a particular word, and it could result in word sense ambiguity (WSA)." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-26", "text": "In this paper, we show what are the consequences of the lack of the word context in (Schnabel et al., 2015) , and we discuss how to address the resulting challenge." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-27", "text": "Specifically, we add a sentential context to mitigate word sense ambiguity, and this extension leads to an improved evaluation technique that explicitly accounts for multiple senses of a word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-28", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-29", "text": "**CROWDSOURCING EVALUATION**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-30", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-31", "text": "**DETAILS OF THE METHOD**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-32", "text": "The method in (Schnabel et al., 2015) started by creating a query inventory which is a pre-selected set of query terms and semantically related target words." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-33", "text": "The query inventory consists of 100 query terms that balance frequency, part of speech (POS), and concreteness." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-34", "text": "The query terms were selected from 10 out of 45 broad categories from WordNet (Miller, 1995) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-35", "text": "Then, 10 random words with one adjective, one adverb, four nouns, and four verbs were selected based on concrete concepts from each category." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-36", "text": "Among the 10 words, 3 words were rare with the property that the number of their occurrences in the training corpusWikipedia dump (2008-03-01)-is smaller than 2500." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-37", "text": "For each of those 100 query terms in the inventory, the nearest neighbours at ranks k \u2208 {1, 5, 50} for the six embeddings from CBOW (Mikolov et al., 2013) , Glove (Pennington et al., 2014) , TSCCA (Dhillon et al., 2012) , C&W (Collobert et al., 2011) , H-PCA (Lebret and Lebret, 2013) , and Random Projection (Li et al., 2006) were retrieved." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-38", "text": "Then, for each k, the query word along with the six words corresponding to the embeddings described above were presented to human testers (Turkers) from Amazon Mechanical Turk (MTurk) for evaluation." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-39", "text": "Each Turker was requested to evaluate between 20 and 50 items per task, where an item corresponds to the query term and a set of 6 retrieved nearest neighbour words from each of the six embeddings." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-40", "text": "The Turkers' were then asked to select one of the six words that is the closest synonym to the query word according to their perception." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-41", "text": "For the selected 100 query words and 3 ranks (k), there were a total of 300 terms on which Turkers' perception-based choices were used for evaluating the embedding techniques." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-42", "text": "The comparison of embeddings was done by averaging the win ratio, where the win ratio was how many times the Turker chose a particular embedding divided by the number of total ratings for the corresponding query word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-43", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-44", "text": "**SHORTCOMINGS OF THE METHOD**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-45", "text": "A word relatedness evaluation task for word embeddings is challenging due to ambiguity inherent in word sense and corresponding reference." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-46", "text": "Although the experiments in (Schnabel et al., 2015) incorporated participants with adequate knowledge of English, the ambiguity is inherent in the language." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-47", "text": "This means that evaluations that ignore the context may have impact on the evaluation result." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-48", "text": "Also, the evaluated word embedding techniques in (Schnabel et al., 2015) except TSCCA (Dhillon et al., 2015)-generate one vector for each word, and that makes comparisons between two related words from two embedding techniques difficult." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-49", "text": "For example, the word 'bank' may be embedded by CBOW as a noun in the context of 'he cashed a cheque at the bank' where the related word according to nearest neighbours would be 'financial' or 'finance' whereas the TSCCA might embed the same 'bank' as a noun but in the context of 'they pulled the canoe up on the bank' where related word according to nearest neighbours would be 'slope' or 'incline'." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-50", "text": "Although all the embedding techniques have been trained with the same corpus, different techniques may encode different explanatory factors of variation present in the data (Gao et al., 2014) , and using one embedding vector per word cannot capture the different meanings (Huang et al., 2012) , and as a result, not all senses will be conflated into one representation." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-51", "text": "If the query word 'bank' is presented to a user with 'financial' and 'incline' as related words, and a user is asked which one is more likely to be a related word, then the user has to choose one word, but she does not know the context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-52", "text": "Therefore, if 100 people were asked to evaluate the query word, and 50 persons voted for 'financial' and 50 persons voted for 'incline' to be a related word, then both CBOW and TSCCA have the same score." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-53", "text": "However, this judgement would be inaccurate as CBOW can embed one vector per word whereas TSCCA can embed multiple vectors for each word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-54", "text": "Thus user's choice of a related word does not have sufficient impact on the quality evaluation of the embedding techniques." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-55", "text": "Note that the word 'bank', as a noun, has 10 senses in WordNet." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-56", "text": "Before we introduce our extensions in the next section, we investigate how (Schnabel et al., 2015) accommodates word sense ambiguity." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-57", "text": "The Turker is presented with a query word and several related words to choose from." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-58", "text": "If the options presented to the Turker are from different contexts, the Turker has to choose from several correct senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-59", "text": "The Turker could be instructed that multiple senses can be encountered during the experiment, and one of the two alternative solutions could be considered:" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-60", "text": "1. Biased Select the sense that is most likely according to your knowledge of the language 2." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-61", "text": "Uniform sampling Select one sense randomly giving the same preference to all options The first approach would be more appropriate because senses that are more common would be given higher priority." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-62", "text": "The second option would be hard to implement in practice because it is not clear if random sampling could be achieved, but this option will be useful to show connections with our method." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-63", "text": "Certainly, even if the Turker can sample according to a uniform probability, the real samples would depend on which senses contained in the corpus were captured by various word embedding techniques." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-64", "text": "Overall, using the above options, one could argue that the method accommodates different senses because the evaluation measures how well the word embedding methods recover the sense selection strategy of the user." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-65", "text": "The biased method would be desirable because it would focus on the most frequent senses, but one should note that this would depend on the subjective judgement of the user and her knowledge." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-66", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-67", "text": "**PROPOSED EXTENSIONS**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-68", "text": "Recent efforts on multiple embeddings for words (Neelakantan et al., 2015; Reisinger and Mooney, 2010 ) require a more sophisticated evaluation and further motivate our ideas." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-69", "text": "There are existing works, such as (Song, 2016; Iacobacci et al., 2015) , where the sense embedding was proposed as a remedy for the current word embedding limitation on ubiquitous polysemous words, and the method learns a vector for each sense of a word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-70", "text": "For words with multiple meanings, it is important to see how many senses a word embedding technique can represent through multiple vectors." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-71", "text": "To achieve such an evaluation, we have first extended the work of (Schnabel et al., 2015) to include sentential context to avoid word sense ambiguity faced by a human tester." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-72", "text": "In our method, every query word is accompanied by a context sentence." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-73", "text": "We then extended the method further so that it is more suitable to evaluate embedding techniques designed for polysemous words with regard to their ability to embed diverse senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-74", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-75", "text": "**FIRST EXTENSION**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-76", "text": "Our chief idea is to extend the work of (Schnabel et al., 2015) by adding a context sentence for each query term." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-77", "text": "Using a context sentence for resolving word sense ambiguity is not a new concept, and it has been used by numerous researchers, such as (Melamud et al., 2015; Huang et al., 2012; Stetina et al., 1998; Biemann, 2013) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-78", "text": "In particular, human judgement based approaches, such as (Huang et al., 2012) , have used the sentential context to determine the similarity between two words, and (Biemann, 2013) used sentential context for lexical substitution realising the importance of the word interpretation in the context for crowdsourcing-based evaluations." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-79", "text": "Due to limited and potentially insufficient embedded vocabulary used to identify a related sense of the query term, we are also proposing to provide another option of 'None of the above' along with the six words." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-80", "text": "In fact, (Schnabel et al., 2015) have already considered 'I don't know the meaning of one (or several) of the words'; however, when the context is in place, there may be a situation when none of the embeddings make a good match for the query term, and in that case 'None of the above' is more appropriate." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-81", "text": "In this way, the user's response will be more justified, and a more reliable evaluation score will be retrieved." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-82", "text": "Our proposal is based on an observation that human reasoning about a word is based on the context, and in crowdsourcing evaluations, we use a human to interpret the meaning; and based on their judgement, we evaluate embedding techniques." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-83", "text": "So the human should be presented with the examples in the manner that is consistent with what humans see in real-life." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-84", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-85", "text": "**SECOND EXTENSION**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-86", "text": "In our first extension above, every query word is presented in a context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-87", "text": "In order to implement a multi-sense evaluation, every query word is presented in several contexts where contexts represent different senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-88", "text": "The number (p) of the contexts presented, where p \u2265 1, will depend on the number and frequency of available senses for a particular query word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-89", "text": "Note that p contexts for the query word are presented in every round, and the Turker has more senses to choose from when word embeddings encode multiple senses per word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-90", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-91", "text": "**EXAMPLE**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-92", "text": "The true, related words are those that are retrieved from the embedding techniques using the nearest neighbour algorithm, for example." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-93", "text": "Below, we show an example word 'bar' together with its context; the context is extracted from WordNet." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-94", "text": "To extend the evaluation for multi-sense embedding capabilities of the embedding techniques, we will extend the example setting above by adding multiple test cases for each query word representing different senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-95", "text": "Note that this is not needed in (Schnabel et al., 2015) where query words are not annotated." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-96", "text": "In the above example, only one test case per query word was presented." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-97", "text": "However, for the query word 'Bar' as a noun, there are 15 senses available in WordNet 3.0, and 23 senses available in 2012 version of Wikipedia (Dandala et al., 2013a) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-98", "text": "For the second extension, the human evaluator will be presented with p context sentences representing p different senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-99", "text": "The criteria for selecting senses, and the corresponding context sentences will be discussed in the next section." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-100", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-101", "text": "**CONTEXT GENERATION**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-102", "text": "In every iteration, every word embedding method will return its best match for the query term." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-103", "text": "Our method will need to determine a context (i.e. an appropriate sentence for the given word)." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-104", "text": "We call this process context generation, and this section introduces two approaches that can be used to implement it." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-105", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-106", "text": "**INFORMED MATCHING**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-107", "text": "In this informed approach, our assumption is that the senses selected for the query word should exist in the training corpus." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-108", "text": "Below we explain how to implement this feature." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-109", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-110", "text": "**MATCHING FREQUENT SENSES**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-111", "text": "In this approach, the goal is to use the most frequent senses from WordNet." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-112", "text": "In this way, we can take into account the frequency of senses embedded in WordNet." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-113", "text": "For every query word, the most frequent n, where n \u2265 1, word senses will be selected from WordNet." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-114", "text": "Note that we have to select only those senses that exist in our training corpus which is Wikipedia in this case." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-115", "text": "The mapping of the senses between Wikipedia and WordNet will be implemented using a method similar to (Mihalcea, 2007, Section 3.1) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-116", "text": "In the final step of their method, the labels (Wikipedia senses) are manually (i.e. they are performed by a human) mapped to WordNet senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-117", "text": "An alternative approach would be automated mapping introduced in (Fernando and Stevenson, 2012) , which does not require human intervention." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-118", "text": "One could argue that the manual mapping would be more accurate because of the incorporation of the human judgement, however, this is expensive and time consuming." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-119", "text": "As the overlapping, most frequent senses from the Wikipedia and WordNet will be chosen, the correct senses corresponding to the embedded word can be selected by Turkers as long as the word embedding methods are accurate." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-120", "text": "Since our method presents n senses per run, it is more likely that one or more of the chosen senses were embedded by the embedding techniques." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-121", "text": "Note that senses in WordNet are generally ordered from the most frequent to the least frequent." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-122", "text": "WordNet sense frequencies come from the SemCor (Miller et al., 1993) sensetagged corpus which means that WordNet frequencies are well justified, and they are based on data." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-123", "text": "The example sentence corresponding to the chosen sense will be taken as a context sentence." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-124", "text": "As WordNet was annotated by humans, we assume that the context sentences are correct for a particular sense." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-125", "text": "Matching Rare Senses In (Vossen et al., 2013) , the authors argue that current sense-tagged corpora have insufficient support for rare senses and contexts and, as a result, they may not be sufficient for word-sense-disambiguation." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-126", "text": "For example, WordNet 3.0 has 15 senses for the word 'bar' as a noun, whereas 2012 version of Wikipedia has 23 senses (Dandala et al., 2013a) for this word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-127", "text": "As a remedy for this issue, we propose another way to generate contexts where we utilise m, where m \u2265 1, randomly selected senses from the training corpus (Wikipedia in our case)." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-128", "text": "Note that this section applies to the situation where none of the rare senses exist in WordNet." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-129", "text": "Since Wikipedia does not contain frequencies for senses, sampling has to be according to a uniform distribution." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-130", "text": "Overall, Wikipedia can be used as a training corpus for the embedding methods and also for sense annotation." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-131", "text": "In (Mihalcea, 2007) , the authors showed that links in Wikipedia articles are appropriate for representing a sense." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-132", "text": "When Wikipedia will be used for selecting rare senses, the context sentence will be retrieved using a similar method to (Mihalcea, 2007, Section 3.1) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-133", "text": "Specifically, in the final step of the mapping method of (Mihalcea, 2007, Section 3.1) , the labels (Wikipedia senses) were mapped to WordNet senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-134", "text": "However, this time we are interested in the word senses that are not available in WordNet; as a result, we will map the selected senses from Wikipedia to the appropri-ate subsenses in the Oxford Dictionary of English (ODE) (Soanes and Stevenson, 2003) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-135", "text": "Note that ODE provides a hierarchical structure of senses, and each polysemous sense is divided into a core sense and a set of subsenses (Navigli, 2006) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-136", "text": "We will follow an approach similar to (Navigli, 2006) where WordNet sense was semantically mapped to the ODE core senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-137", "text": "They mapped to the core senses because they were interested in the coarsegrained sense mapping to resolve granularity inherent in WordNet." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-138", "text": "In our case, we will do semantic mapping between Wikipedia senses (piped link or simple link) and ODE subsenses, instead of mapping the WordNet sense to the ODE core senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-139", "text": "Then, corresponding context sentences will be selected from Wikipedia or ODE." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-140", "text": "Overall, when the corresponding context sentence for a query term is not available in WordNet, the context sentence can be retrieved from Wikipedia (Mihalcea, 2007; Dandala et al., 2013b) or ODE using the method described above." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-141", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-142", "text": "**RANDOM MATCHING**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-143", "text": "The informed method described above requires either manual matching by humans (which are time consuming and expensive) or an automated matching which may be inaccurate." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-144", "text": "An alternative approach is to sample senses randomly from WordNet ignoring senses contained in the training corpus." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-145", "text": "The sampling distribution should be based on frequencies of senses." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-146", "text": "In this case, 'None of the above' option will be used whenever none of the embedded words are related to the query word according to the presented context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-147", "text": "If we consider a large number of Turkers' evaluations, the evaluation will still give the performance score reflecting the true performance score of the embedding technique." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-148", "text": "However, this will be more costly because more Turkers will be required." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-149", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-150", "text": "**MERIT OF OUR EXTENSIONS**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-151", "text": "At the end of Sec. 2.2, we explained how word sense ambiguity is accommodated in (Schnabel et al., 2015) ." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-152", "text": "We argued that their evaluation was in expectation with respect to subjective preferences of the Turkers." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-153", "text": "Additionally, when the context is not provided, the Turkers may even forget about common senses of the query word." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-154", "text": "In our proposal, we argue that query words should be presented in an appropriate context." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-155", "text": "Similar to Sec. 2.2, we can distinguish two ways in which we can apply our method:" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-156", "text": "1. Informed sampling Sample senses according to their frequency in WordNet 2." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-157", "text": "Uniform sampling Sample senses according to a uniform probability distribution if no frequency data is available (e.g. Wikipedia) We can now draw a parallel with alternative ways that Turkers may apply to solve the word sense ambiguity problem." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-158", "text": "In particular, under certain conditions (i.e. when word embeddings don't use sense frequency information), the uniform sampling option in our method would be equivalent with the uniform sampling method in Sec. 2.2." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-159", "text": "This means that asking the Turkers to select senses randomly according to a uniform probability distribution is the same as sampling contexts according to a uniform distribution." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-160", "text": "The two approaches differ, however, when non-uniform, informed probability distributions are used." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-161", "text": "Informed sampling in our approach is based on WordNet whose sense frequencies are based on data-driven research." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-162", "text": "This means that the overall evaluation would be based on real frequencies coming from the data instead of subjective and idiosyncratic judgements by the Turkers." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-163", "text": "This probabilistic argument provides another justification for our approach." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-164", "text": "----------------------------------" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-165", "text": "**CONCLUSION**" }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-166", "text": "In this paper, a crowdsourcing-based word embedding evaluation technique of (Schnabel et al., 2015) was extended to provide data-driven treatment of word sense ambiguity." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-167", "text": "The method of (Schnabel et al., 2015) relies on user's subjective and knowledge dependent ability to select 'preferred' meanings whereas our method would deal with this problem selecting explicit contexts for words." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-168", "text": "The selection is according to the real frequencies of meanings computed from data." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-169", "text": "With this data-driven feature, our method could be more appropriate to evaluate both methods that produce one embedding per word as well as methods that produce one embedding per word sense." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-170", "text": "Our method would provide scores that accommodate word sense frequencies in the real use of the language." }, { "sent_id": "e9404db1fbda5dd8c55a40711d06ec-C001-171", "text": "Here, we assume that word embeddings should recover the most frequent senses with higher priority." } ], "y": { "@EXT@": { "gold_contexts": [ [ "e9404db1fbda5dd8c55a40711d06ec-C001-3" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-71" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-73" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-76" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-166" ] ], "cite_sentences": [ "e9404db1fbda5dd8c55a40711d06ec-C001-3", "e9404db1fbda5dd8c55a40711d06ec-C001-71", "e9404db1fbda5dd8c55a40711d06ec-C001-73", "e9404db1fbda5dd8c55a40711d06ec-C001-76", "e9404db1fbda5dd8c55a40711d06ec-C001-166" ] }, "@BACK@": { "gold_contexts": [ [ "e9404db1fbda5dd8c55a40711d06ec-C001-19" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-20" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-23" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-32" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-46", "e9404db1fbda5dd8c55a40711d06ec-C001-47", "e9404db1fbda5dd8c55a40711d06ec-C001-48" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-80" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-95" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-151" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-152" ] ], "cite_sentences": [ "e9404db1fbda5dd8c55a40711d06ec-C001-19", "e9404db1fbda5dd8c55a40711d06ec-C001-20", "e9404db1fbda5dd8c55a40711d06ec-C001-23", "e9404db1fbda5dd8c55a40711d06ec-C001-32", "e9404db1fbda5dd8c55a40711d06ec-C001-46", "e9404db1fbda5dd8c55a40711d06ec-C001-48", "e9404db1fbda5dd8c55a40711d06ec-C001-80", "e9404db1fbda5dd8c55a40711d06ec-C001-95", "e9404db1fbda5dd8c55a40711d06ec-C001-151", "e9404db1fbda5dd8c55a40711d06ec-C001-152" ] }, "@MOT@": { "gold_contexts": [ [ "e9404db1fbda5dd8c55a40711d06ec-C001-21" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-22" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-25" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-26" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-46", "e9404db1fbda5dd8c55a40711d06ec-C001-47", "e9404db1fbda5dd8c55a40711d06ec-C001-48" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-80" ] ], "cite_sentences": [ "e9404db1fbda5dd8c55a40711d06ec-C001-21", "e9404db1fbda5dd8c55a40711d06ec-C001-22", "e9404db1fbda5dd8c55a40711d06ec-C001-25", "e9404db1fbda5dd8c55a40711d06ec-C001-26", "e9404db1fbda5dd8c55a40711d06ec-C001-46", "e9404db1fbda5dd8c55a40711d06ec-C001-48", "e9404db1fbda5dd8c55a40711d06ec-C001-80" ] }, "@USE@": { "gold_contexts": [ [ "e9404db1fbda5dd8c55a40711d06ec-C001-26" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-56" ], [ "e9404db1fbda5dd8c55a40711d06ec-C001-152" ] ], "cite_sentences": [ "e9404db1fbda5dd8c55a40711d06ec-C001-26", "e9404db1fbda5dd8c55a40711d06ec-C001-56", "e9404db1fbda5dd8c55a40711d06ec-C001-152" ] }, "@DIF@": { "gold_contexts": [ [ "e9404db1fbda5dd8c55a40711d06ec-C001-167" ] ], "cite_sentences": [ "e9404db1fbda5dd8c55a40711d06ec-C001-167" ] } } }, "ABC_fe539365c7bb4555280fd1a5478aba_3": { "x": [ { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-2", "text": "Opinions in social media play such an important role for customers and companies that there is a growing tendency to post fake reviews in order to change purchase decisions and opinions." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-3", "text": "In this paper we propose the use of different features for a low dimension representation of opinions." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-4", "text": "We evaluate our proposal incorporating the features to a Support Vector Machines classifier and we use an available corpus with reviews of hotels in Chicago." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-5", "text": "We perform comparisons with previous works and we conclude that using our proposed features it is possible to obtain competitive results with a small amount of features for representing the data." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-6", "text": "Finally, we also investigate if the use of emotions can help to discriminate between truthful and deceptive opinions as previous works show to happen for deception detection in text in general." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-7", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-9", "text": "Spam is commonly present on the Web through of fake opinions, untrue reviews, malicious comments or unwanted texts posted in electronic commerce sites and blogs." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-10", "text": "The purpose of those kinds of spam is promote products and services, or simply damage their reputation." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-11", "text": "A deceptive opinion spam can be defined as a fictitious opinion written with the intention to sound authentic in order to mislead the reader." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-12", "text": "An opinion spam usually is a short text written by an unknown author using a not very well defined style." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-13", "text": "These characteristics make the problem of automatic detection of opinion spam a very challenging problem." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-14", "text": "First attempts for solving this problem considered unsupervised approaches trying to identify duplicate content (Jindal and Liu, 2008) , and searching for unusual review patterns (Jindal et al., 2010) or groups of opinion spammers (Mukherjee et al., 2011) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-15", "text": "Later, supervised methods were presented." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-16", "text": "Such is the case of (Feng et al., 2012a; Feng et al., 2012b) in which the authors extended the n-gram feature by incorporating syntactic production rules derived from probabilistic context free grammar parse trees." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-17", "text": "In (Liu et al., 2002) a learning from positive and unlabeled examples (PU-learning) approach was successfully applied to detect deceptive opinion spam, using only few examples of deceptive opinions and a set of unlabeled data." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-18", "text": "Then, in (Hern\u00e1ndez Fusilier et al., 2015a ) the authors proposed a PU-learning variant for the same task, concluding the appropriateness of their approach for detecting opinion spam." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-19", "text": "In this paper we study the feasibility of the application of different features for representing safely information about clues related to fake reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-20", "text": "We focus our study in a variant of the stylistic feature character n-grams named character n-grams in tokens." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-21", "text": "We also study an emotion-based feature and a linguistic processes feature based on LIWC variables." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-22", "text": "We evaluated the proposed features with a Support Vector Machines (SVM) classifier using a corpus of 1600 reviews of hotels (Ott et al., 2011; Ott et al., 2013) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-23", "text": "We show an experimental study evaluating the single features and combining them with the intention to obtain better features." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-24", "text": "After that previous study, we selected the one with we obtained the best results and made direct and indirect comparisons with some other methods." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-25", "text": "The obtained results show that the proposed features can capture information from the contents of the reviews and the writing style allowing to obtain classification results as good as with traditional character n-grams but with a lower dimensionality of representation." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-26", "text": "The rest of the paper is organized as follows." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-27", "text": "Section 2 describes briefly the proposed features." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-28", "text": "Section 3 shows the experimental study performed." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-29", "text": "The description of the corpus and the different experiments carried out can also be found in this section." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-30", "text": "Finally, the main conclusions and future work are in Section 4." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-31", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-32", "text": "**FEATURE SELECTION FOR DECEPTION CLUES**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-33", "text": "In this section we describe the three different kinds of features studied in this work and the tools used for their extraction." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-34", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-35", "text": "**CHARACTER N-GRAMS IN TOKENS**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-36", "text": "The main difference of character n-grams in tokens 1 with respect to the traditional NLP feature character n-grams is the consideration of the tokens for the extraction of the feature." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-37", "text": "That is, tokens with less than n characters are not considered in the process of extraction neither blank spaces." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-38", "text": "Character n-grams in tokens preserve the main characteristics of the standard character ngrams (Sili\u0107 et al., 2007) : effectiveness for quantifying the writing style used in a text (Keselj et al., 2003; Stamatatos, 2013) , the independence of language and domains (Wei et al., 2008) , the robustness to noise present in the text (Cavnar and Trenkle, 1994) , and, easiness of extraction in any text. But unlike the traditional character n-grams, the proposed feature obtains a smaller set of attributes, that is, character n-grams in tokens avoids the need of feature dimension reduction." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-39", "text": "Figure 1 illustrates that difference." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-40", "text": "in tokens feature is considerably less, although the effectiveness of this representation still being good, as we will see in Section 3." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-41", "text": "For the extraction of character n-grams in tokens we have used Natural Language Toolkit (NLTK) package (Bird et al., 2009 ) with Python language." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-42", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-43", "text": "**EMOTIONS-BASED FEATURE**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-44", "text": "Previous works have been demonstrated that the use of emotions helps to discriminate truthful from deceptive text (Hancock et al., 2008; Burgoon et al., 2003; Newman et al., 2003) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-45", "text": "There is some evidence that liars use more negative emotions that truth-tellers." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-46", "text": "Based on that, we obtained the percentages of positive, negative and neutral emotions contained in the sentences of a document." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-47", "text": "Then, we have used these values as features in order to represent the polarity of the text." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-48", "text": "For the calculation of the percentages of positive, negative and neutral emotions contained in the text we have used the Natural Language Sentiment Analysis API 2 which analyzes the sentiments, labeling a text with its polarity (positive, negative or neutral)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-49", "text": "We have obtained the polarities of each sentence and then we have obtained the percentages of the polarities associated to the whole document (a review in our case)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-50", "text": "Finally, we have used those values as features." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-51", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-52", "text": "**LIWC-BASED FEATURE: LINGUISTIC PROCESSES**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-53", "text": "Several features derived from Linguistic Inquiry and Word Count (LIWC) were considered." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-54", "text": "In particular we have studied those related to functional aspects of the text such as word count, adverbs, pronouns, etc." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-55", "text": "After performing an early experimental study considering the 26 different variables of the linguistic processes category in LIWC2007 software (Pennebaker et al., 2007) , we have concluded that pronouns, articles and verbs (present, past and future tense) would help to distinguish fake from true reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-56", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-57", "text": "**EXPERIMENTAL STUDY**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-58", "text": "In order to evaluate our proposal, we have performed some experimental study on the first publicly available opinion spam dataset gathered and presented in (Ott et al., 2011; Ott et al., 2013) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-59", "text": "We first describe the corpus and then we show the different experiments made." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-60", "text": "Finally we compare our results with those published previously." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-61", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-62", "text": "**OPINION SPAM CORPUS**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-63", "text": "The Opinion Spam corpus presented in (Ott et al., 2011; Ott et al., 2013 ) is composed of 1600 positive and negative opinions for hotels with the corresponding gold-standard." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-64", "text": "From the 800 positive reviews (Ott et al., 2011) , the 400 truthful where mined from TripAdvisor 5-star reviews about the 20 most popular hotels in Chicago area." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-65", "text": "All reviews were written in English, have at least 150 characters and correspond to users who had posted opinions previously on TripAdvisor (non first-time authors)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-66", "text": "The 400 deceptive opinions correspond to the same 20 hotels and were gathered using Amazon Mechanical Turk crowdsourcing service." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-67", "text": "From the 800 negative reviews (Ott et al., 2013) , the 400 truthful where mined from TripAdvisor, Expedia, Hotels.com, Orbitz, Priceline and Yelp." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-68", "text": "The reviews are 1 or 2-star category and are about the same 20 hotels in Chicago." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-69", "text": "The 400 deceptive reviews correspond to the same 20 hotels and were obtained using Amazon Mechanical Turk." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-70", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-71", "text": "**TRUTHFUL FROM DECEPTIVE OPINION CLASSIFICATION**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-72", "text": "We have obtained the representations of the opinion reviews considering the features described in Section 2." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-73", "text": "For all, we have used term frequencyinverse document frequency (tf-idf) weighting scheme." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-74", "text": "The only text preprocessing made was convert all words to lowercase characters." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-75", "text": "Na\u00efve Bayes and SVM algorithms in Weka (Hall et al., 2009) were used to perform the classification." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-76", "text": "We only show the results obtained with SVM because its performance was the best." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-77", "text": "For all experiments we have performed a 10 fold cross-validation procedure in order to study the effectiveness of the SVM classifier with the different representations." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-78", "text": "For simplicity, we have used LibSVM 3 which implements a C-SVC version of SVM with a radial basis function." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-79", "text": "We have run the classifier with the default parameters." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-80", "text": "The values reported in the tables correspond to the macro average F-measure as it is reported in Weka." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-81", "text": "Tables 1, 2 and 3 show the F-measure obtained with each feature proposed for the Opinion Spam corpus." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-82", "text": "In the second part of the table, the combination of each single feature was used as representation of the reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-83", "text": "The best value is in boldface." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-84", "text": "As we can observe, the best result (F-measure = 0.89) was obtained with the combination of 4-grams in tokens and the articles, pronouns and verbs (LIWC)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-85", "text": "With the combination of 3-grams and LIWC feature the F-measure is quite similar." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-86", "text": "Table 2 : Deceptive opinions detection with SVM for negative reviews of Opinion Spam corpus (800 opinions)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-87", "text": "Table 2 shows the results obtained considering only the negative reviews (800 documents)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-88", "text": "The best result (F-measure = 0.865) was obtained with the feature 4-grams in tokens plus LIWC variables." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-89", "text": "It is interesting to note that similar results (although sightly lower) were obtained also with three more features: the single 4-grams in tokens, the combination of the last one with positive and negative emotions percentages, and also with 3-grams combined with LIWC's tokens." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-90", "text": "Table 3 : Deceptive opinions detection with SVM for positive and negative reviews of Opinion Spam corpus (1600 opinions)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-91", "text": "Table 3 shows the classification results considering the whole corpus, that is, the combined case of positive plus negative reviews (1600 documents)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-92", "text": "The best F-measure (0.879) was obtained, as the same as the previous cases, with 4-grams in tokens plus LIWC feature." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-93", "text": "It is worth noting that with the combination of 4-grams in tokens with POSNEG feature seems to be effective when positive and negative polarities are considered together in deception detection, a fact that is not present when just one polarity is considered (see Tables 1 and 2 )." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-94", "text": "As we can observe from Tables 1, 2 and 3, the differences of F-measure values are quite small." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-95", "text": "In fact, for the almost similar values like, for example, 4-grams in tokens + LIWC compared with 3-grams + LIWC or 3 + 4-grams + POSNEG (see Table 1 ) the differences are not statistically significant." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-96", "text": "Consequently we have selected the one with highest F-measure value (4-grams in tokens + LIWC) for simplicity, but some of the other representations can be used instead." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-97", "text": "In order to analyze the set of attributes corresponding to the feature 4-grams in tokens combined with LIWC, we have calculated the Information Gain ranking." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-98", "text": "From this analysis we have observed that the set of attributes with highest information gain are similar for negative and both together polarities corpora." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-99", "text": "The study shows that 4-grams in tokens are in the top positions of the ranking and those reveal information related to places (chic, chig, igan for Chicago and Michigan cities), amenities (floo, elev, room for floor, elevator, room) and their characterization (luxu, smel, tiny for luxury, smells and tiny)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-100", "text": "From the 7th position of the ranking we can observe the first LIWC attributes: pronouns (my, I, we) and after 15th position we can observe verbs (is, open, seemed) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-101", "text": "Interestingly, the articles can be observed from position 68th in the ranking (a, the)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-102", "text": "Regarding the corpus considering only the positive reviews, the ranking is similar to the cases analyzed before with exception of the pronouns which appear at 1st position (my) and at 16th position (I, you)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-103", "text": "This fact could indicate the presence of many opinions concerned with their own experience (good) making the personal pronouns one of the most discriminative attribute for positive polarity spam opinion detection." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-104", "text": "With respect to the characterization of the amenities, the adjectives observed in 4-grams in tokens have to do with positive opinions about those (elax, amaz, good for relax, amazing and good)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-105", "text": "Figure 2 illustrates the first positions of the ranking of attributes obtained for positive reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-106", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-107", "text": "**COMPARISON OF RESULTS**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-108", "text": "For a comparison of the performance of our proposal, we analyzed the obtained results with respect to the state-of-the-art." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-109", "text": "We have made a comparison considering the results of five different models." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-110", "text": "The first four of these were used in an indirect comparison, while just one method was used in a direct comparison of the performance." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-111", "text": "In (Banerjee and Chua, 2014 ) the authors presented the results of a logistic regression model using 13 different independent variables: complexity, reading difficulty, adjective, article, noun, preposition, adverb, verb, pronoun, personal pronoun, positive cues, perceptual words and future tense." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-112", "text": "In (Ren et al., 2014) a semi-supervised model called mixing population and individual property PU learning, is presented." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-113", "text": "The model is then incorporated to a SVM classifier." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-114", "text": "In (Ott et al., 2011 ) the authors used the 80 dimensions of LIWC2007, unigrams and bigrams as set of features with a SVM classifier." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-115", "text": "In (Feng and Hirst, 2013) , profile alignment compatibility features combined with unigrams, bigrams and syntactic production rules were proposed for representing the opinion spam corpus." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-116", "text": "Then, a multivariate performance measures version of SVM classifier (named SVM perf ) was trained." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-117", "text": "In (Hern\u00e1ndez Fusilier et al., 2015b ) the authors studied two different representations: character n-grams and word n-grams." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-118", "text": "In particular, the best results were obtained with a Na\u00efve Bayes classifier using character 4 and 5 grams as features." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-119", "text": "As we stated before, two kinds of comparisons are shown: an indirect (we could not obtain the complete set of results reported by the authors) and a direct (the authors kindly made available the results and a statistical comparison can be performed)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-120", "text": "In Table 4 we can observe the indirect comparison of our results with those of (Banerjee and Chua, 2014) and (Ren et al., 2014) obtained with a 10 fold cross validation experiment, and then, with a 5 fold cross validation in order to make a fair comparison with the results of (Ott et al., 2011) and (Feng and Hirst, 2013) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-121", "text": "Note that the results are expressed in terms of the accuracy as those were published by the authors; the results correspond only to positive reviews of the Opinion Spam corpus because the authors experimented in that corpus alone." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-122", "text": "From the Table 4 we can observe that the combination of 13 independent variables seems to have the lowest prediction accuracy (accuracy = 70.50%)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-123", "text": "About the last result, the authors in (Banerjee and Chua, 2014) concluded that only articles and pronouns (over the 13 variables) could significantly distinguish true from false reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-124", "text": "The accuracy of the semi-supervised model is slightly lower (86.69%) than that of our approach (89%), although good enough." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-125", "text": "The authors concluded that the good performance of the semi-supervised model is due the topic information captured by the model combined with the examples and their similarity (Ren et al., 2014) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-126", "text": "Then, they could obtain an accurate SVM classifier." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-127", "text": "Regarding the experiments with the 5 fold cross-validation, we obtained similar results to those of (Ott et al., 2011) and slightly lower than the ones of (Feng and Hirst, 2013) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-128", "text": "From this last experiment we can observe that using the representation of (Feng and Hirst, 2013 ) with more than 20138 attributes it is possible to obtain comparable results with those of our approach where we use a smaller representation (1533 attributes)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-129", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-130", "text": "**MODEL ACCURACY**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-131", "text": "10 fold cross-validation (Banerjee and Chua, 2014) 70.50% (Ren et al., 2014) 86.69% Our approach 89%" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-132", "text": "5 fold cross-validation (Ott et al., 2011) 89.8% (Feng and Hirst, 2013) 91.3% Our approach 89.8% Table 4 : Indirect comparison of the performance." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-133", "text": "Deceptive opinions detection for positive reviews of Opinion Spam corpus (800 opinions)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-134", "text": "In Table 5 we can observe the direct comparison of the performance for the positive and negative polarities reviews of the Opinion Spam corpus considering the proposal of (Hern\u00e1ndez Fusilier et al., 2015b) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-135", "text": "First column shows the representation proposed, the second one shows the amount of attributes (Attr.) of the representation, the third column shows the F-measure value (F) obtained after a 10 fold cross-validation process, and the last column shows the p-value obtained in the statistical significance test used to study the differences of performance between (Hern\u00e1ndez Fusilier et al., 2015b) It is interesting to note that the F-measure values obtained with both approaches are quite similar for positive and negative reviews, as we can observe in Table 5 ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-136", "text": "Regarding the amount of attributes used for each representation of the reviews, it is worth noting that our approach uses 97% and 95% fewer attributes for positive and negative reviews compared with the model of (Hern\u00e1ndez Fusilier et al., 2015b) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-137", "text": "Even using a combination of two simple features as character 4-grams in tokens and LIWC variables as we have proposed, the amount of attributes is considerably lower than the traditional character n-grams without diminishing the quality of the classification." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-138", "text": "The reason of the lower dimensionality of our representation has to do with the manner in which the n-grams are obtained." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-139", "text": "The high descriptive power of character n-grams in tokens plus the information added with the LIWC variables seem to be adequate to obtain an accurate classifier (SVM in our case)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-140", "text": "In order to determine if the differences of performance of (Hern\u00e1ndez Fusilier et al., 2015b) and our approach are statistically significant, we have calculated the Mann-Whitney U-test (Mann and Whitney, 1947) ." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-141", "text": "This nonparametric test compares two unpaired groups of values without making the assumption of the normality of the samples." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-142", "text": "However, the requirements of independence of the samples, the data is continuous and ordinal, there are no ties between the groups and the assumption that the distribution of both groups are similar in shape, are satisfied." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-143", "text": "The null hypothesis states that the samples come from the same population, that is, the classifiers performs equally well with the proposed models." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-144", "text": "We have calculated the Mann-Whitney U-test considering a 2-tailed hypothesis and significance level of 0.05." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-145", "text": "In Table 5 we can observe that the p-value obtained in the comparison of performance of positive reviews corpus is 0.094 > 0.05 which stands for the difference of results are not statistically significant (the p-value is not \u2264 0.05, then the null hypothesis is not rejected)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-146", "text": "The same conclusion can be obtained with respect to the results corresponding to the negative reviews corpus, for which the test obtained a p-value of 0.748 > 0.05." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-147", "text": "From the last test we concluded that both approaches performs similarly well." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-148", "text": "A statistical analysis of variance over the Fmeasure values obtained in the evaluation of (Hern\u00e1ndez Fusilier et al., 2015b) and our approach complements our performance study." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-149", "text": "This analysis can be obtained from the boxplots 4 with the distribution of F-measure values of each proposal with both polarity reviews corpora." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-150", "text": "----------------------------------" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-151", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-152", "text": "In this work we have proposed some interesting features for deceptive opinions detection." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-153", "text": "We have studied how different features contribute to model deception clues." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-154", "text": "Character n-grams in tokens seems to capture correctly the content and the writing style of the reviews helping this, in some way, to differentiate truthful from deceptive opinions." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-155", "text": "Many works have demonstrated that emotions-based features can discriminate deceptive text, but in our experimental study this feature seems not to provide too much useful information for detecting deception in reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-156", "text": "We also have used some variables extracted from LIWC as pronouns, articles and verbs." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-157", "text": "That information combined with character 4-grams in tokens was selected for modeling the representation of the reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-158", "text": "For the experimental study we have used the positive and negative polarities reviews corresponding to the corpora proposed by (Ott et al., 2011; Ott et al., 2013) with 800 reviews each one (400 true and 400 false opinions)." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-159", "text": "We have used both corpora in a separate way but we have performed experiments joining both polarities reviews in a combined corpus of 1600 reviews." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-160", "text": "From the results obtained with the different features we have concluded that character 4-grams in tokens with LIWC variables performs the best using a SVM classifier." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-161", "text": "We made also a comparison with the approach of (Hern\u00e1ndez Fusilier et al., 2015b) and the results were similar (no statistically significant difference was found), but our low dimensionality representation makes our approach more efficient." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-162", "text": "For future work we plans to investigate another emotion/sentiment features in order to study the contributions in tasks of deception detection of opinion spam." }, { "sent_id": "fe539365c7bb4555280fd1a5478aba-C001-163", "text": "Also we are interesting to test our model with other corpora related to opinion spam as the one recently proposed in (Fornaciari and Poesio, 2014" } ], "y": { "@USE@": { "gold_contexts": [ [ "fe539365c7bb4555280fd1a5478aba-C001-22" ], [ "fe539365c7bb4555280fd1a5478aba-C001-58" ], [ "fe539365c7bb4555280fd1a5478aba-C001-120" ], [ "fe539365c7bb4555280fd1a5478aba-C001-121" ], [ "fe539365c7bb4555280fd1a5478aba-C001-158" ] ], "cite_sentences": [ "fe539365c7bb4555280fd1a5478aba-C001-22", "fe539365c7bb4555280fd1a5478aba-C001-58", "fe539365c7bb4555280fd1a5478aba-C001-120", "fe539365c7bb4555280fd1a5478aba-C001-121", "fe539365c7bb4555280fd1a5478aba-C001-158" ] }, "@BACK@": { "gold_contexts": [ [ "fe539365c7bb4555280fd1a5478aba-C001-63" ], [ "fe539365c7bb4555280fd1a5478aba-C001-64" ], [ "fe539365c7bb4555280fd1a5478aba-C001-114" ] ], "cite_sentences": [ "fe539365c7bb4555280fd1a5478aba-C001-63", "fe539365c7bb4555280fd1a5478aba-C001-64", "fe539365c7bb4555280fd1a5478aba-C001-114" ] }, "@SIM@": { "gold_contexts": [ [ "fe539365c7bb4555280fd1a5478aba-C001-127" ], [ "fe539365c7bb4555280fd1a5478aba-C001-132" ] ], "cite_sentences": [ "fe539365c7bb4555280fd1a5478aba-C001-127", "fe539365c7bb4555280fd1a5478aba-C001-132" ] } } }, "ABC_9158f716efae0d1f38510dd0847c45_3": { "x": [ { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-22", "text": "**UNBOUNDED DEPENDENCY EVALUATION**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-21", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-2", "text": "We evaluate two dependency parsers, MSTParser and MaltParser, with respect to their capacity to recover unbounded dependencies in English, a type of evaluation that has been applied to grammarbased parsers and statistical phrase structure parsers but not to dependency parsers." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-3", "text": "The evaluation shows that when combined with simple post-processing heuristics, the parsers correctly recall unbounded dependencies roughly 50% of the time, which is only slightly worse than two grammar-based parsers specifically designed to cope with such dependencies." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-4", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-6", "text": "Though syntactic parsers for English are reported to have accuracies over 90% on the Wall Street Journal (WSJ) section of the Penn Treebank (PTB) (McDonald et al., 2005; Sagae and Lavie, 2006; Huang, 2008; Carreras et al., 2008) , broad-coverage parsing is still far from being a solved problem." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-7", "text": "In particular, metrics like attachment score for dependency parsers (Buchholz and Marsi, 2006) and Parseval for constituency parsers (Black et al., 1991) suffer from being an average over a highly skewed distribution of different grammatical constructions." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-8", "text": "As a result, infrequent yet semantically important construction types could be parsed with accuracies far below what one might expect." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-9", "text": "This shortcoming of aggregate parsing metrics was highlighted in a recent study by Rimell et al. (2009) , introducing a new parser evaluation corpus containing around 700 sentences annotated with unbounded dependencies in seven different grammatical constructions." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-10", "text": "This corpus was used to evaluate five state-of-the-art parsers for English, focusing on grammar-based and statistical phrase structure parsers." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-11", "text": "For example, in the sentence By Monday, they hope to have a sheaf of documents both sides can trust., parsers should recognize that there is a dependency between trust and documents, an instance of object extraction out of a (reduced) relative clause." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-12", "text": "In the evaluation, the recall of state-of-the-art parsers on this kind of dependency varies from a high of 65% to a low of 1%." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-13", "text": "When averaging over the seven constructions in the corpus, none of the parsers had an accuracy higher than 61%." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-14", "text": "In this paper, we extend the evaluation of Rimell et al. (2009) to two dependency parsers, MSTParser (McDonald, 2006) and MaltParser (Nivre et al., 2006a) , trained on data from the PTB, converted to Stanford typed dependencies (de Marneffe et al., 2006) , and combined with a simple post-processor to extract unbounded dependencies from the basic dependency tree." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-15", "text": "Extending the evaluation to dependency parsers is of interest because it sheds light on whether highly tuned grammars or computationally expensive parsing formalisms are necessary for extracting complex linguistic phenomena in practice." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-16", "text": "Unlike the best performing grammar-based parsers studied in Rimell et al. (2009) , neither MSTParser nor MaltParser was developed specifically as a parser for English, and neither has any special mechanism for dealing with unbounded dependencies." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-17", "text": "Dependency parsers are also often asymptotically faster than grammar-based or constituent parsers, e.g., MaltParser parses sentences in linear time." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-18", "text": "Our evaluation ultimately shows that the recall of MSTParser and MaltParser on unbounded dependencies is much lower than the average (un)labeled attachment score for each system." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-19", "text": "Nevertheless, the two dependency parsers are found to perform only slightly worse than the best grammar-based parsers evaluated in Rimell et al. (2009) and considerably better than the other statistical parsers in that evaluation." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-20", "text": "Interestingly, though the two systems have similar accuracies overall, there is a clear distinction between the kinds of errors each system makes, which we argue is consistent with observations by McDonald and Nivre (2007) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-23", "text": "An unbounded dependency involves a word or phrase interpreted at a distance from its surface position, where an unlimited number of clause boundaries may in principle intervene." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-24", "text": "The unbounded dependency corpus of Rimell et al. (2009) includes seven grammatical constructions: object extraction from a relative clause (ObRC), object extraction from a reduced relative clause (ObRed), subject extraction from a relative clause (SbRC), free relatives (Free), object questions (ObQ), right node raising (RNR), and subject extraction from an embedded clause (SbEm), all chosen for being relatively frequent and easy to identify in PTB trees." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-25", "text": "Examples of the constructions can be seen in Figure 1 ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-26", "text": "The evaluation set contains 80 sentences per construction (which may translate into more than 80 dependencies, since sentences containing coordinations may have more than one gold-standard dependency), while the development set contains between 13 and 37 sentences per construction." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-27", "text": "The data for ObQ sentences was obtained from various years of TREC, and for the rest of the constructions from the WSJ (0-1 and 22-24) and Brown sections of the PTB." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-28", "text": "Each sentence is annotated with one or more gold-standard dependency relations representing the relevant unbounded dependency." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-29", "text": "The goldstandard dependencies are shown as arcs below the sentences in Figure 1 ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-30", "text": "The format of the dependencies in the corpus is loosely based on the Stanford typed dependency scheme, although the evaluation procedure permits alternative representations and does not require that the parser output match the gold-standard exactly, as long as the \"spirit\" of the construction is correct." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-31", "text": "The ability to recover unbounded dependencies is important because they frequently form part of the basic predicate-argument structure of a sentence." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-32", "text": "Subject and object dependencies in particular are crucial for a number of tasks, including information extraction and question answering." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-33", "text": "Moreover, Rimell et al. (2009) show that, although individual types of unbounded dependencies may be rare, the unbounded dependency types in the corpus, considered as a class, occur in as many as 10% of sentences in the PTB." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-34", "text": "In Rimell et al. (2009) , five state-of-the-art parsers were evaluated for their recall on the goldstandard dependencies." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-35", "text": "Three of the parsers were based on grammars automatically extracted from the PTB: the C&C CCG parser (Clark and Curran, 2007) , the Enju HPSG parser (Miyao and Tsujii, 2005) , and the Stanford parser (Klein and Manning, 2003) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-36", "text": "The two remaining systems were the RASP parser (Briscoe et al., 2006) , using a manually constructed grammar and a statistical parse selection component, and the DCU post-processor of PTB parsers (Cahill et al., 2004) using the output of the Charniak and Johnson reranking parser (Charniak and Johnson, 2005) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-37", "text": "Because of the wide variation in parser output representations, a mostly manual evaluation was performed to ensure that each parser got credit for the constructions it recovered correctly." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-38", "text": "The parsers were run essentially \"out of the box\", meaning that the development set was used to confirm input and output formats, but no real tuning was performed." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-39", "text": "In addition, since a separate question model is available for C&C, this was also evaluated on ObQ sentences." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-40", "text": "The best overall performers were C&C and Enju, which is unsurprising since they are deep parsers based on grammar formalisms designed to recover just such dependencies." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-41", "text": "The DCU post-processor performed somewhat worse than expected, often identifying the existence of an unbounded dependency but failing to identify the grammatical class (subject, object, etc.)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-42", "text": "RASP and Stanford, although not designed to recover such dependencies, nevertheless recovered a subset of them." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-43", "text": "Performance of the parsers also varied widely across the different constructions." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-44", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-45", "text": "**DEPENDENCY PARSERS**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-46", "text": "In this paper we repeat the study of Rimell et al. (2009) for two dependency parsers, with the goal of evaluating how parsers based on dependency grammars perform on unbounded dependencies." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-47", "text": "MSTParser 1 is a freely available implementation of the parsing models described in McDonald (2006) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-48", "text": "According to the categorization of parsers in K\u00fcbler et al. (2008) it is a graph-based parsing system in that core parsing algorithms can be equated to finding directed maximum spanning trees (either projective or non-projective) from a dense graph representation of the sentence." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-49", "text": "Graph-based parsers typically rely on global training and inference algorithms, where the goal is to learn models in which the weight/probability of correct trees is higher than that of incorrect trees." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-50", "text": "At inference time a global search is run to find the highest weighted dependency tree." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-51", "text": "Unfortunately, global inference and learning for graph-based dependency parsing is typically NP-hard (McDonald and Satta, 2007) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-52", "text": "As a result, graph-based parsers (including MSTParser) often limit the scope of their features to a small number of adjacent arcs (usually two) and/or resort to approximate inference ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-53", "text": "MaltParser 2 is a freely available implementation of the parsing models described in Nivre et al. (2006a) and Nivre et al. (2006b) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-54", "text": "MaltParser is categorized as a transition-based parsing system, characterized by parsing algorithms that produce dependency trees by transitioning through abstract state machines (K\u00fcbler et al., 2008) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-55", "text": "Transitionbased parsers learn models that predict the next state given the current state of the system as well as features over the history of parsing decisions and the input sentence." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-56", "text": "At inference time, the parser starts in an initial state, then greedily moves to subsequent states -based on the predictions of the model -until a termination state is reached." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-57", "text": "Transition-based parsing is highly efficient, with run-times often linear in sentence length." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-58", "text": "Furthermore, transition-based parsers can easily incorporate arbitrary non-local features, since the current parse structure is fixed by the state." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-59", "text": "However, the greedy nature of these systems can lead to error propagation if early predictions place the parser in incorrect states." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-60", "text": "McDonald and Nivre (2007) compared the accuracy of MSTParser and MaltParser along a number of structural and linguistic dimensions." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-61", "text": "They observed that, though the two parsers exhibit indistinguishable accuracies overall, MSTParser tends to outperform MaltParser on longer dependencies as well as those dependencies closer to the root of the tree (e.g., verb, conjunction and preposition dependencies), whereas MaltParser performs better on short dependencies and those further from the root (e.g., pronouns and noun dependencies)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-62", "text": "Since long dependencies and those near to the root are typically the last constructed in transition-based parsing systems, it was concluded that MaltParser does suffer from some form of error propagation." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-63", "text": "On the other hand, the richer feature representations of MaltParser led to improved performance in cases where error propagation has not occurred." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-64", "text": "However, that study did not investigate unbounded dependencies." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-65", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-66", "text": "**METHODOLOGY**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-67", "text": "In this section, we describe the methodological setup for the evaluation, including parser training, post-processing, and evaluation." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-68", "text": "3" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-69", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-70", "text": "**PARSER TRAINING**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-71", "text": "One important difference between MSTParser and MaltParser, on the one hand, and the best performing parsers evaluated in Rimell et al. (2009) , on the other, is that the former were never developed specifically as parsers for English." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-72", "text": "Instead, they are best understood as data-driven parser generators, that is, tools for generating a parser given a training set of sentences annotated with dependency structures." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-73", "text": "Over the years, both systems have been applied to a wide range of languages (see, e. (2007)), but they come with no language-specific enhancements and are not equipped specifically to deal with unbounded dependencies." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-74", "text": "Since the dependency representation used in the evaluation corpus is based on the Stanford typed dependency scheme (de Marneffe et al., 2006) , we opted for using the WSJ section of the PTB, converted to Stanford dependencies, as our primary source of training data." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-75", "text": "Thus, both parsers were trained on section 2-21 of the WSJ data, which we converted to Stanford dependencies using the Stanford parser (Klein and Manning, 2003) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-76", "text": "The Stanford scheme comes in several varieties, but because both parsers require the dependency structure for each sentence to be a tree, we had to use the so-called basic variety (de Marneffe et al., 2006) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-77", "text": "It is well known that questions are very rare in the WSJ data, and Rimell et al. (2009) found that parsers trained only on WSJ data generally performed badly on the questions included in the evaluation corpus, while the C&C parser equipped with a model trained on a combination of WSJ and question data had much better performance." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-78", "text": "To investigate whether the performance of MSTParser and MaltParser on questions could also be improved by adding more questions to the training data, we trained one variant of each parser using data that was extended with 3924 questions taken from QuestionBank (QB) (Judge et al., 2006) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-79", "text": "4 Since the QB sentences are annotated in PTB style, it was possible to use the same conversion procedure as for the WSJ data." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-80", "text": "However, it is clear that the conversion did not always produce adequate dependency structures for the questions, an observation that we will return to in the error analysis below." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-81", "text": "In comparison to the five parsers evaluated in Rimell et al. (2009) , it is worth noting that MSTParser and MaltParser were trained on the same basic data as four of the five, but with a different kind of syntactic representation -dependency trees instead of phrase structure trees or theoryspecific representations from CCG and HPSG." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-82", "text": "It is especially interesting to compare MSTParser and MaltParser to the Stanford parser, which essentially produces the same kind of dependency structures as output but uses the original phrase structure trees from the PTB as input to training." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-83", "text": "For our experiments we used MSTParser with the same parsing algorithms and features as reported in ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-84", "text": "However, unlike that work we used an atomic maximum entropy model as the second stage arc predictor as opposed to the more time consuming sequence labeler. showed that there is negligible accuracy loss when using atomic rather than structured labeling." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-85", "text": "For MaltParser we used the projective Stack algorithm (Nivre, 2009 ) with default settings and a slightly enriched feature model." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-86", "text": "All parsing was projective because the Stanford dependency trees are strictly projective." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-87", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-88", "text": "**POST-PROCESSING**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-89", "text": "All the development and test sets in the corpus of Rimell et al. (2009) were parsed using MSTParser and MaltParser after part-of-speech tagging the input using SVMTool (Gim\u00e9nez and M\u00e0rquez, 2004 ) trained on section 2-21 of the WSJ data in Stanford basic dependency format." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-90", "text": "The Stanford parser has an internal module that converts the basic dependency representation to the collapsed representation, which explicitly represents additional dependencies, including unbounded dependencies, that can be inferred from the basic representation (de Marneffe et al., 2006) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-91", "text": "We performed a similar conversion using our own tool." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-92", "text": "Broadly speaking, there are three ways in which unbounded dependencies can be inferred from the Stanford basic dependency trees, which we will refer to as simple, complex, and indirect." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-93", "text": "In the simple case, the dependency coincides with a single, direct dependency relation in the tree." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-94", "text": "This is the case, for example, in Figure 1d -e, where all that is required is that the parser identifies the dependency relation from a governor to an argument (dobj(see, What), dobj(have, effect)), which we call the Arg relation; no post-processing is needed." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-95", "text": "In the complex case, the dependency is represented by a path of direct dependencies in the tree, as exemplified in Figure 1a ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-96", "text": "In this case, it is not enough that the parser correctly identifies the Arg relation dobj(carries, that); it must also find the dependency rcmod(fragment, carries)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-97", "text": "We call this the Link relation, because it links the argument role inside the relative clause to an element outside the clause." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-98", "text": "Other examples of the complex case are found in Figure 1c and in Figure 1f ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-99", "text": "In the indirect case, finally, the dependency cannot be defined by a path of labeled dependencies, whether simple or complex, but must be inferred from a larger context of the tree using heuristics." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-100", "text": "Consider Figure 1b , where there is a Link relation (rcmod(things, do)), but no corresponding Arg relation inside the relative clause (because there is no overt relative pronoun)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-101", "text": "However, given the other dependencies, we can infer with high probability that the implicit relation is dobj." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-102", "text": "Another example of the indirect case is in Figure 1g ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-103", "text": "Our post-processing tool performs more heuristic inference for the indirect case than the Stanford parser does (cf. Section 4.3)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-104", "text": "In order to handle the complex and indirect cases, our post-processor is triggered by the occurrence of a Link relation (rcmod or conj) and first tries to add dependencies that are directly implied by a single Arg relation (relations involving relative pronouns for rcmod, shared heads and dependents for conj)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-105", "text": "If there is no overt relative pronoun, or the function of the relative pronoun is underspecified, the post-processor relies on the obliqueness hierarchy subj < dobj < pobj and simply picks the first \"missing function\", unless it finds a clausal complement (indicated by the labels ccomp and xcomp), in which case it descends to the lower clause and restarts the search there." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-106", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-107", "text": "**PARSER EVALUATION**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-108", "text": "The evaluation was performed using the same criteria as in Rimell et al. (2009) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-109", "text": "A dependency was considered correctly recovered if the goldstandard head and dependent were correct and the label was an \"acceptable match\" to the goldstandard label, indicating the grammatical function of the extracted element at least to the level of subject, passive subject, object, or adjunct." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-110", "text": "The evaluation in Rimell et al. (2009) took into account a wide variety of parser output formats, some of which differed significantly from the gold-standard." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-111", "text": "Since MSTParser and MaltParser produced Stanford dependencies for this experiment, evaluation required less manual examination than for some of the other parsers, as was also the case for the output of the Stanford parser in the original evaluation." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-112", "text": "However, a manual evaluation was still performed in order to resolve questionable cases." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-113", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-114", "text": "**RESULTS**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-115", "text": "The results are shown in Table 2 : Parser accuracy on the unbounded dependency corpus." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-116", "text": "The ObQ score for C&C, MSTParser, and MaltParser is for a model trained with additional questions (without this C&C scored 27.5; MSTParser and MaltParser as in Table 1 )." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-117", "text": "a weighted macroaverage, where the constructions are weighted proportionally to their relative frequency in the PTB." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-118", "text": "WAvg excludes ObQ sentences, since frequency statistics were not available for this construction in Rimell et al. (2009) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-119", "text": "Our first observation is that the accuracies for both systems are considerably below the \u223c90% unlabeled and \u223c88% labeled attachment scores for English that have been reported previously Hall et al., 2006) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-120", "text": "Comparing the two parsers, we see that MaltParser is more accurate on dependencies in relative clause constructions (ObRC, ObRed, SbRC, and Free), where argument relations tend to be relatively local, while MSTParser is more accurate on dependencies in RNR and SbEm, which involve more distant relations." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-121", "text": "Without the additional QB training data, the average scores for the two parsers are indistinguishable, but MSTParser appears to have been better able to take advantage of the question training, since MST-Q performs better than Malt-Q on ObQ sentences." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-122", "text": "On the weighted average MaltParser scores 3.5 points higher, because the constructions on which it outperforms MSTParser are more frequent in the PTB, and because WAvg excludes ObQ, where MSTParser is more accurate." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-123", "text": "Table 2 shows the results for MSTParser and MaltParser in the context of the other parsers evaluated in Rimell et al. (2009) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-124", "text": "5 For the parsers 5 The average scores reported differ slightly from those in which have a model trained on questions, namely C&C, MSTParser, and MaltParser, the figure shown for ObQ sentences is that of the question model." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-125", "text": "It can be seen that MSTParser and MaltParser perform below C&C and Enju, but above the other parsers, and that MSTParser achieves the highest score on SbEm sentences and MaltParser on SbRC sentences." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-126", "text": "It should be noted, however, that Table 2 does not represent a direct comparison across all parsers, since most of the other parsers would have benefited from heuristic postprocessing of the kind implemented here for MSTParser and MaltParser." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-127", "text": "This is especially true for RASP, where the grammar explicitly leaves some types of attachment decisions for post-processing." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-128", "text": "For DCU, improved labeling heuristics would significantly improve performance." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-129", "text": "It is instructive to compare the dependency parsers to the Stanford parser, which uses the same output representation and has been used to prepare the training data for our experiments." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-130", "text": "Stanford has very low recall on ObRed and SbEm, the categories where heuristic inference plays the largest role, but mirrors MSTParser for most other categories." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-131", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-132", "text": "**ERROR ANALYSIS**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-133", "text": "We now proceed to a more detailed error analysis, based on the development sets, and classify the errors made by the parsers into three categories: A global error is one where the parser completely fails to build the relevant clausal structure -the relative clause in ObRC, ObRed, SbRC, Free, SbEmb; the interrogative clause in ObQ; and the clause headed by the higher conjunct in RNR -often as a result of surrounding parsing errors." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-134", "text": "When a global error occurs, it is usually meaningless to further classify the error, which means that this category excludes the other two." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-135", "text": "An Arg error is one where the parser has constructed the relevant clausal structure but fails to find the Arg relation -in the simple and complex cases -or the set of surrounding Arg relations needed to infer an implicit Arg relation -in the indirect case (cf. Section 4.2)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-136", "text": "A Link error is one where the parser fails to find the crucial Link relation -rcmod in ObRC, ObRed, SbRC, SbEmb; conj in RNR (cf. Section 4.2)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-137", "text": "Link errors are not relevant for Free and ObQ, where all the crucial relations are clause-internal." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-138", "text": "Table 3 shows the frequency of different error types for MSTParser (first) and MaltParser (second) in the seven development sets." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-139", "text": "First of all, we can see that the overall error distribution is very similar for the two parsers, which is probably due to the fact that they have been trained on exactly the same data with exactly the same annotation (unlike the five parsers previously evaluated)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-140", "text": "However, there is a tendency for MSTParser to make fewer Link errors, especially in the relative clause categories ObRC, ObRed and SbRC, which is compatible with the observation from the test results that MSTParser does better on more global dependencies, while MaltParser has an advantage on more local dependencies, although this is not evident from the statistics from the relatively small development set." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-141", "text": "Comparing the different grammatical constructions, we see that Link errors dominate for the relative clause categories ObRC, ObRed and SbRC, where the parsers make very few errors with respect to the internal structure of the relative clauses (in fact, no errors at all for MaltParser on SbRC)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-142", "text": "This is different for SbEm, where the analysis of the argument structure is more complex, both because there are (at least) two clauses involved and because the unbounded dependency can only be inferred indirectly from the basic dependency representation (cf. Section 4.2)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-143", "text": "Another category where Arg errors are frequent is RNR, where all such errors consist in attaching the relevant dependent to the second conjunct instead of to the first. 6 Thus, in the example in Figure 1f , both parsers found the conj relation between puzzled and angered but attached by to the second verb." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-144", "text": "Global errors are most frequent for RNR, probably indicating that coordinate structures are difficult to parse in general, and for ObQ (especially for MaltParser), probably indicating that questions are not well represented in the training set even after the addition of QB data." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-145", "text": "7 As noted in Section 4.1, this may be partly due to the fact that conversion to Stanford dependencies did not seem to work as well for QB as for the WSJ data." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-146", "text": "Another problem is that the part-of-speech tagger used was trained on WSJ data only and did not perform as well on the ObQ data." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-147", "text": "Uses of What as a determiner were consistently mistagged as pronouns, which led to errors in parsing." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-148", "text": "Thus, for the example in Figure 1e , both parsers produced the correct analysis except that, because of the tagging error, they treated What rather than effect as the head of the wh-phrase, which counts as an error in the evaluation." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-149", "text": "In order to get a closer look specifically at the Arg errors, Table 4 gives the confusion matrix for such errors, showing which grammatical functions are mistaken for each other, with an extra category Other for cases where the function is left unspecified by the parser or the error is an attachment error rather than a labeling error (and excluding the RNR category because of the special nature of the Arg errors in this category)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-150", "text": "The results again confirm that the two parsers make very few errors on subjects and objects clauseinternally." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-151", "text": "The few cases where an object is mistaken as a subject occur in ObQ, where both parsers perform rather poorly in general." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-152", "text": "By contrast, there are many more errors on prepositional objects and on embedded subjects and objects." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-153", "text": "We believe an important part of the explanation for this pattern is to be found in the Stanford dependency representation, where subjects and objects are marked as such but all other functions realized by wh elements are left unspecified (using the generic rel dependency), which means that the recovery of these functions currently has to rely on heuristic rules as described in Section 4.2." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-154", "text": "Finally, we think it is possible to observe the tendency for MaltParser to be more accurate at local labeling decisions -reflected in fewer cross-label confusions -and for MSTParser to perform better on more distant attachment decisions -reflected in fewer errors in the Other category (and in fewer Link errors)." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-155", "text": "----------------------------------" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-156", "text": "**CONCLUSION**" }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-157", "text": "In conclusion, the capacity of MSTParser and MaltParser to recover unbounded dependencies is very similar on the macro and weighted macro level, but there is a clear distinction in their strengths -constructions involving more distant dependencies such as ObQ, RNR and SbEm for MSTParser and constructions with more locally defined configurations such as ObRC, ObRed, SbRC and Free for MaltParser." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-158", "text": "This is a pattern that has been observed in previous evaluations of the parsers and can be explained by the global learning and inference strategy of MSTParser and the richer feature space of MaltParser (McDonald and Nivre, 2007) ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-159", "text": "Perhaps more interestingly, the accuracies of MSTParser and MaltParser are only slightly below the best performing systems in Rimell et al. (2009) -C&C and Enju ." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-160", "text": "This is true even though MSTParser and MaltParser have not been engineered specifically for English and lack special mechanisms for handling unbounded dependencies, beyond the simple post-processing heuristic used to extract them from the output trees." }, { "sent_id": "9158f716efae0d1f38510dd0847c45-C001-161", "text": "Thus, it is reasonable to speculate that the addition of such mechanisms could lead to computationally lightweight parsers with the ability to extract unbounded dependencies with high accuracy." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9158f716efae0d1f38510dd0847c45-C001-9" ], [ "9158f716efae0d1f38510dd0847c45-C001-24" ], [ "9158f716efae0d1f38510dd0847c45-C001-33", "9158f716efae0d1f38510dd0847c45-C001-34" ], [ "9158f716efae0d1f38510dd0847c45-C001-77" ], [ "9158f716efae0d1f38510dd0847c45-C001-110" ], [ "9158f716efae0d1f38510dd0847c45-C001-118" ] ], "cite_sentences": [ "9158f716efae0d1f38510dd0847c45-C001-9", "9158f716efae0d1f38510dd0847c45-C001-24", "9158f716efae0d1f38510dd0847c45-C001-33", "9158f716efae0d1f38510dd0847c45-C001-34", "9158f716efae0d1f38510dd0847c45-C001-77", "9158f716efae0d1f38510dd0847c45-C001-110", "9158f716efae0d1f38510dd0847c45-C001-118" ] }, "@EXT@": { "gold_contexts": [ [ "9158f716efae0d1f38510dd0847c45-C001-14" ] ], "cite_sentences": [ "9158f716efae0d1f38510dd0847c45-C001-14" ] }, "@DIF@": { "gold_contexts": [ [ "9158f716efae0d1f38510dd0847c45-C001-16" ], [ "9158f716efae0d1f38510dd0847c45-C001-19" ], [ "9158f716efae0d1f38510dd0847c45-C001-71" ], [ "9158f716efae0d1f38510dd0847c45-C001-81" ], [ "9158f716efae0d1f38510dd0847c45-C001-159" ] ], "cite_sentences": [ "9158f716efae0d1f38510dd0847c45-C001-16", "9158f716efae0d1f38510dd0847c45-C001-19", "9158f716efae0d1f38510dd0847c45-C001-71", "9158f716efae0d1f38510dd0847c45-C001-81", "9158f716efae0d1f38510dd0847c45-C001-159" ] }, "@USE@": { "gold_contexts": [ [ "9158f716efae0d1f38510dd0847c45-C001-46" ], [ "9158f716efae0d1f38510dd0847c45-C001-81" ], [ "9158f716efae0d1f38510dd0847c45-C001-89" ], [ "9158f716efae0d1f38510dd0847c45-C001-108" ], [ "9158f716efae0d1f38510dd0847c45-C001-123" ] ], "cite_sentences": [ "9158f716efae0d1f38510dd0847c45-C001-46", "9158f716efae0d1f38510dd0847c45-C001-81", "9158f716efae0d1f38510dd0847c45-C001-89", "9158f716efae0d1f38510dd0847c45-C001-108", "9158f716efae0d1f38510dd0847c45-C001-123" ] } } }, "ABC_9f1d2be80dbfd726a24fb2a05e130b_3": { "x": [ { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-2", "text": "Machine transliteration is the process of automatically transforming the script of a word from a source language to a target language, while preserving pronunciation." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-3", "text": "Sequence to sequence learning has recently emerged as a new paradigm in supervised learning." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-4", "text": "In this paper a character-based encoder-decoder model has been proposed that consists of two Recurrent Neural Networks." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-5", "text": "The encoder is a Bidirectional recurrent neural network that encodes a sequence of symbols into a fixed-length vector representation, and the decoder generates the target sequence using an attention-based recurrent neural network." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-6", "text": "The encoder, the decoder and the attention mechanism are jointly trained to maximize the conditional probability of a target sequence given a source sequence." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-7", "text": "Our experiments on different datasets show that the proposed encoderdecoder model is able to achieve significantly higher transliteration quality over traditional statistical models." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-8", "text": "----------------------------------" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-10", "text": "Machine Transliteration is defined as phonetic transformation of names across languages Karimi et al., 2011) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-11", "text": "Transliteration of named entities is the essential part of many multilingual applications, such as machine translation (Koehn, 2010) and cross-language information retrieval (Jadidinejad and Mahmoudi, 2010) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-12", "text": "Recent studies pay a great attention to the task of Neural Machine Translation (Cho et al., 2014a; Sutskever et al., 2014) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-13", "text": "In neural machine translation, a single neural network is responsible for reading a source sentence and generates its translation." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-14", "text": "From a probabilistic perspective, translation is equivalent to finding a target sentence y that maximizes the conditional probability of y given a source sentence x, i.e., arg max y p(y | x)." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-15", "text": "The whole neural network is jointly trained to maximize the conditional probability of a correct translation given a source sentence, using the bilingual corpus." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-16", "text": "Transforming a name from spelling to phonetic and then use the constructed phonetic to generate the spelling on the target language is a very complex task (Oh et al., 2006; Finch et al., 2015) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-17", "text": "Based on successful studies on Neural Machine Translation (Cho et al., 2014a; Sutskever et al., 2014; Hirschberg and Manning, 2015) , in this paper, we proposed a character-based encoderdecoder model which learn to transliterate endto-end." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-18", "text": "In the opposite side of classical models which contains different components, the proposed model is trained end-to-end, so it able to apply to any language pairs without tuning for a spacific one." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-19", "text": "----------------------------------" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-20", "text": "**PROPOSED MODEL**" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-21", "text": "Here, we describe briefly the underlying framework, called RNN Encoder-Decoder, proposed by (Cho et al., 2014b) and (Sutskever et al., 2014) upon which we build a machine transliteration model that learns to transliterate end-to-end." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-22", "text": "The enoder is a character-based recurrent neural network that learns a highly nonlinear mapping from a spelling to the phonetic of the input sequence." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-23", "text": "This network reads the source name x = (x 1 , . . . , x T ) and encodes it into a sequence of hidden states h = (h 1 , \u00b7 \u00b7 \u00b7 , h T ):" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-24", "text": "Each hidden state h i is a bidirectional recurrent representation with forward and backward sequence information around the ith character." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-25", "text": "The representation of a forward sequence and a backward sequence of the input character sequence is estimated and concatenated to form a context set C = {h 1 , h 2 , ..., h T } (Dong et al., 2015; Chung et al., 2016) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-26", "text": "Then, the decoder, another recurrent neural network, computes the conditional distribution over all possible transliteration based on this context set and generates the corresponding transliteration y = (y 1 , \u00b7 \u00b7 \u00b7 , y T ) based on the encoded sequence of hidden states h. The whole model is jointly trained to maximize the conditional log-probability of the correct transliteration given a source sequence with respect to the parameters \u03b8 of the model:" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-27", "text": "where (x n , y n ) is the n-th training pair of character sequences, and T n is the length of the n-th target sequence (y n )." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-28", "text": "For each conditional term in Equation 2, the decoder updates its hidden state by:" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-29", "text": "where c t is a context vector computed by a soft attention mechanism:" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-30", "text": "The soft attention mechanism f a weights each vector in the context set C according to its relevance given what has been transliterated." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-31", "text": "Finally, the hidden state h t , together with the previous target symbol y t \u22121 and the context vector c t , is fed into a feedforward neural network to result in the conditional distribution described in Equation 2." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-32", "text": "The whole model, consisting of the encoder, decoder and soft attention mechanism, is trained end-to-end to minimize the negative loglikelihood using stochastic gradient descent." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-33", "text": "----------------------------------" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-34", "text": "**EXPERIMENTS**" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-35", "text": "We conducted a set of experiments to show the effectiveness of RNN Encoder-Decoder model (Cho et al., 2014b; Sutskever et al., 2014) in the task of machine transliteration using standard benchmark datasets provided by NEWS 2015-16 shared task ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-36", "text": "Table 1 shows different datasets in our experiments." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-37", "text": "Each dataset covers different levels of difficulty and training set size." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-38", "text": "The proposed model has been applied on ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-39", "text": "each dataset without tuning the algorithm for each specific language pairs." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-40", "text": "Also, we don't apply any preprocessing on the source or target language in order to evaluate the effectiveness of the proposed model in a fair situation." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-41", "text": "'TaskID' is a unique identifier in the following experiments." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-42", "text": "We leveraged a character-based encoderdecoder model (Bojanowski et al., 2015; Chung et al., 2016) with soft attention mechanism (Cho et al., 2014b) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-43", "text": "In this model, input sequences in both source and target languages have been represented as characters." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-44", "text": "Using characters instead of words leads to longer sequences, so Gated Recurrent Units (Cho et al., 2014a) have been used for the encoder network to model long term dependencies." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-45", "text": "The encoder has 128 hidden units for each direction (forward and backward), and the decoder has 128 hidden units with soft attention mechanism (Cho et al., 2014b) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-46", "text": "We train the model using stochastic gradient descent with Adam (Kingma and Ba, 2014)." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-47", "text": "Each update is computed using a minibatch of 128 sequence pairs." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-48", "text": "The norm of the gradient is clipped with a threshold 1 (Pascanu et al., 2013) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-49", "text": "Also, beamsearch has been used to approximately find the most likely transliteration given a source sequence (Koehn, 2010) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-50", "text": "Table 2 shows the effectiveness of the proposed model on different datasets using standard measures ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-51", "text": "The proposed neural machine transliteration model has been compared to the baseline method provided by NEWS 2016 organizers ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-52", "text": "Baseline results are based on a machine translation implementation at the character level using MOSES (Koehn et al., 2007) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-53", "text": "Experimental results shows that the proposed model is significantly better than the robust baseline using different metrics." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-54", "text": "Figure 1 shows the learning curve of the pro- Table 2 : The effectiveness of neural machine transliteration is compared with the robust baseline (Koehn et al., 2007) provided by NEWS 2016 shared task on transliteration of named entities." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-55", "text": "posed model on different datasets." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-56", "text": "It is clear that in most datasets, the trained model is capable of robust transliteration after a few number of iterations." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-57", "text": "As shown in Table 1 , each dataset has different number of training set and also different number of characters in the source and target language." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-58", "text": "For example, when transliterating from English to Chinese (TaskID='En-Ch') and English to Hebrew, the target names contains 548 and 37 different tokens respectively." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-59", "text": "Since we leverage a same model for different datasets without tuning the model for each dataset, differences in the learning curves are expectable." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-60", "text": "For some datasets (such as 'En-Ch'), it takes more time to fit the model to the training data while for some others (such as 'En-He'), the model fit to the training data after a few iterations." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-61", "text": "----------------------------------" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-62", "text": "**CONCLUSION**" }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-63", "text": "In this paper we proposed Neural Machine Transliteration based on successful studies in sequence to sequence learning (Sutskever et al., 2014) and Neural Machine Translation (Ling et al., 2015; Costa-Juss\u00e0 and Fonollosa, 2016; Bahdanau et al., 2015; Cho et al., 2014a) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-64", "text": "Neural Machine Transliteration typically consists of two components, the first of which encodes a source name sequence x and the second decodes to a target name sequence y. Different parts of the proposed model jointly trained using stochastic gradient descent to minimize the log-likelihood." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-65", "text": "Experiments on different datasets using benchmark measures revealed that the proposed model is able to achieve significantly higher transliteration quality over traditional statistical models (Koehn, 2010) ." }, { "sent_id": "9f1d2be80dbfd726a24fb2a05e130b-C001-66", "text": "In this paper we did not concentrate on improving the model for achieving state-of-the-art results, so applying hyperparameter optimization (Bergstra and Bengio, 2012) , multi-task sequence to sequence learning (Luong et al., 2015) and multiway transliteration (Firat et al., 2016; Dong et al., 2015) are quite promising for future works." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9f1d2be80dbfd726a24fb2a05e130b-C001-12" ], [ "9f1d2be80dbfd726a24fb2a05e130b-C001-17" ], [ "9f1d2be80dbfd726a24fb2a05e130b-C001-21" ] ], "cite_sentences": [ "9f1d2be80dbfd726a24fb2a05e130b-C001-12", "9f1d2be80dbfd726a24fb2a05e130b-C001-17", "9f1d2be80dbfd726a24fb2a05e130b-C001-21" ] }, "@USE@": { "gold_contexts": [ [ "9f1d2be80dbfd726a24fb2a05e130b-C001-17" ], [ "9f1d2be80dbfd726a24fb2a05e130b-C001-21" ], [ "9f1d2be80dbfd726a24fb2a05e130b-C001-35" ], [ "9f1d2be80dbfd726a24fb2a05e130b-C001-63" ] ], "cite_sentences": [ "9f1d2be80dbfd726a24fb2a05e130b-C001-17", "9f1d2be80dbfd726a24fb2a05e130b-C001-21", "9f1d2be80dbfd726a24fb2a05e130b-C001-35", "9f1d2be80dbfd726a24fb2a05e130b-C001-63" ] } } }, "ABC_fca14f99953b9dc30a594525ee92b5_3": { "x": [ { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-2", "text": "Learning disentangled representations of high-dimensional data is currently an active research area." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-3", "text": "However, compared to the field of computer vision, less work has been done for speech processing." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-4", "text": "In this paper, we provide a review of two representative efforts on this topic and propose the novel concept of fine-grained disentangled speech representation learning." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-5", "text": "----------------------------------" }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-7", "text": "Representation learning is a fundamental challenge in machine learning and artificial intelligence." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-8", "text": "While there are multiple criteria for an ideal representation, disentangled representation (illustrated in Figure 1 ), which explicitly separates the underlying causal factors of the observed data, has been of particular interest, because it can be useful for a large variety of tasks and domains [1, 2, 3, 4, 5] ." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-9", "text": "For example, in [5] , the authors show that learning disentangling latent factors corresponding to pose and identity in photos of human faces can improve the performance of both pose estimation and face verification." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-10", "text": "Learning disentangled representation from high-dimensional data is not a trivial task and multiple techniques, such as \u03b2-VAE [1] , Info-GAN [2] , and DC-IGN [3] , have been developed to address this problem." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-11", "text": "While disentangling natural image representation has been studied extensively, much less work has focused on natural speech, leaving a rather large void in the understanding of this problem." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-12", "text": "In this paper, we first present a short review and comparison of two representative efforts on this topic [6, 7] , where both efforts involve using an auto-encoder and can be applied to the same task (i.e., voice conversion), but the key disentangling algorithms and underlying ideas are very different." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-13", "text": "In [6] , the authors proposed an unsupervised factorized hierarchical variational autoencoder (FHVAE)." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-14", "text": "The key idea is that assuming that the speech data is generated from two separate latent variable sets z1 and z2, where z1 contains segment-level (short-term) variables and z2 contains sequencelevel (long-term) variables (z2 that are further conditioned on an s-vector \u00b52)." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-15", "text": "Leveraging the multi-scale nature that different factors affect speech at different time scales (e.g., speaker identity affects the fundamental frequency and volume of speech signal at the sequence level while the phonetic content affects the speech signal at the segment level), by training an autoencoder in a sequence-to-sequence manner, z1 can be forced to encode segment-level information (e.g., speech content), while z2 and \u00b52 can be forced to encode sequence-level information (e.g., speaker identity)." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-16", "text": "In the experiments, by keeping z2 fixed and changing z1, speech of the same content, but by different speakers, can be synthesized naturally, demonstrating the clean separation between content and speaker information." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-17", "text": "Further, the learned s-vector is shown to be a stronger feature than the conventional i-vector in the speaker verification task, demonstrating that it encodes speaker-level information well." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-18", "text": "In [6] and subsequent efforts [8, 9, 10] , the authors further showed that the disentangled representation is also helpful in the speech recognition task." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-19", "text": "These efforts convey two primary insights: 1) by adding the appropriate prior assumptions on the latent variables, speech content information and speaker-level information can be separated out in an unsupervised learning manner; 2) the learned disentangled representations are useful to improve both speech synthesis and broader inference tasks." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-20", "text": "Different from [6] , in [7] , the authors propose a supervised approach based on adversarial training [11, 12, 13, 14] (illustrated in Figure 2 (left))." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-21", "text": "In addition to a regular autoencoder, the authors add a regularization term in its objective function to force the latent variables (i.e., the encoding) to not contain speaker information." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-22", "text": "This is done by introducing an auxiliary speaker verification classifier C. C is trained to correctly identify the speaker y from the latent variables z (i.e., minimizing the misclassification loss Lc = \u2212logP (y|z)), while the encoder is trained to maximize Lc, i.e., to avoid encoding speaker information in z. Both z and speaker label y are fed to the decoder for reconstruction, and the complete objective function of the auto-encoder is hence minimizing Lrec \u2212 \u03bbLc (where Lrec is the point-wise L1-norm loss)." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-23", "text": "By alternatively training the auto-encoder and C, the z is learned to be an encoding of speech content information." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-24", "text": "Further, the residual information of the speech is reconstructed through another GAN and auxiliary classifier." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-25", "text": "The experiment shows that such a scheme can be successfully applied to a voice conversion task." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-26", "text": "The main insight of this work is how to use supervision to conduct representation disentanglement." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-27", "text": "In summary, although the implementation is very different, in order to learn disentangled representations, both works add constraints to the latent variables." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-28", "text": "Such a constraint can be a prior assumption (in the unsupervised case) or a regularization term (in the supervised case)." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-29", "text": "While both efforts show good empirical performance in real tasks and lay the groundwork for future efforts, the learned disentangled representation is relatively coarse-grained." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-30", "text": "That is, in [6] , z1 and z2 are in fact corresponding to general fast-changing and slow-changing information, i.e., z1 may contain other fast-changing information such as emotion, while z2 may contain slow-changing factors such as background and channel noise." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-31", "text": "In [7] , the authors actually separate out speaker information and general non-speaker information, which may contain a lot of detailed information." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-32", "text": "Coarse-grained disentangled representations are enough for some tasks, such as voice speaker conversion, but might be limited for other tasks." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-33", "text": "Next, we discuss the need for (and benefits of) fine-grained disentangled representations." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-34", "text": "----------------------------------" }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-35", "text": "**FINE-GRAINED DISENTANGLED SPEECH REPRESENTATION**" }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-36", "text": "Natural speech signal inherently contains both linguistic and paralinguistic information; common paralinguistic information include gender, age, health status, personality, friendliness, mood, and emotion (sorted from long-term to short-term) [15] ." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-37", "text": "In the original representation of natural speech, these information are entangled together. But in fact, many of these information are essentially independent or of low correlation with each other as well as with the linguistic content, hence raising the possibility to disentangle them in some latent space." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-38", "text": "Natural speech signals can be viewed as produced by multiple finegrained causal factors and disentangling these factors leads to the following benefits: Synthesis: Learning fine-grained disentangled representation can help more flexible speech synthesis." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-39", "text": "Assume that disentangled latent variables corresponding to age, personality, friendliness, emotion, and content are learned; we may then be able to synthesize speech signals corresponding to arbitrary combinations of these factors, according to the requirement of the application scenario." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-40", "text": "Further, this may support novel AI applications, such as speech style transfer and predicting the future voice of a given subject (similar technology has been adopted in computer vision, e.g., image style transfer [16] and face aging [17] )." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-41", "text": "In contrast, a coarse-grained disentangled representation [6, 7] may only support a simple voice speaker conversion task." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-42", "text": "Inference: Learning fine-grained disentangled representation can also help with more accurate inference and reasoning." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-43", "text": "When we attempt to predict one target variable, we usually want to eliminate the interference of other factors." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-44", "text": "For example, a speech recognition system is expected to be emotionindependent, while a speech emotion recognition system is expected to be text-independent." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-45", "text": "Historically, some manually designed algorithms are used to eliminate the effects of unrelated factors, e.g., speaker normalization [18, 19] and speaker adaptation [20, 21] are commonly used to eliminate the impacts of speaker variability." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-46", "text": "However, it is difficult to manually design algorithms for all underlying factors." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-47", "text": "Previous work in representation learning has shown that by disentangling different (independent) factors, all corresponding inference tasks [5] can gain performance improvements." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-48", "text": "In addition, the learned disentangled and interpretable representation helps us to understand the inner working mechanism of the machine learning model." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-49", "text": "The question is how can we learn fine-grained disentangled representation from speech?" }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-50", "text": "Following the discussion in the previous section, we can either add a prior assumption (unsupervised) or a regularization (supervised) to the latent variables." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-51", "text": "However, for unsupervised approaches, when we want to disentangle many factors, designing such prior assumptions is difficult and needs to be done very carefully." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-52", "text": "Hence, we first consider a fully supervised solution." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-53", "text": "Figure 2 : Illustration of supervised representation learning." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-54", "text": "Left: the approach in [7] ." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-55", "text": "Right: the proposed approach for learning fine-grained disentangled representation." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-56", "text": "Assume that we want to disentangle n independent factors f1, f2, ...,fn of natural speech." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-57", "text": "Further assume that we have a data set D that has complete annotations {y f 1 ,y f 2 ,...,y fn } corresponding to each factor for each sample." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-58", "text": "We can then extend the approach in [7] to learn disentangled latent variables {z f 1 ,z f 2 ,...,z fn } corresponding to more than two factors." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-59", "text": "As illustrated in Figure 2 (right), we build n auto-encoders, each used to learn latent variables corresponding to one factor." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-60", "text": "In order to guarantee the disentanglement, for each auto-encoder we further need to build n \u2212 1 auxiliary predictors." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-61", "text": "Since the auto-encoder attempts to learn latent variable z f i , we train each predictor to correctly predict one factor j(j = i) based on z f i (i.e., minimizing the miss prediction loss Lij ), and train the auto-encoder to maximize the minimum of the loss of each predictor." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-62", "text": "The training is conducted alternatively and during the training, ground truth annotations y = {yj |j = i} are fed to the decoder for successful reconstruction." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-63", "text": "Hence, the complete loss of auto-encoder i is Li = Lrec \u2212 min j =i (\u03bbjLij ), where Lrec is the point-wise L1-norm reconstruction loss, and \u03bb is the parameter controlling the disentanglement degree of each factor." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-64", "text": "Note that the predictors having the same target factor cannot be re-used across different auto-encoders, because they are based on different latent variables." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-65", "text": "However, in practice, there does not exist such a speech dataset D that has complete annotations for each factor; most speech datasets only have a limited number of annotations." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-66", "text": "Nevertheless, natural speech has relatively fixed variation factors, which makes it possible to use multiple datasets to cooperatively learn disentangled representations." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-67", "text": "This is very different from natural images, which have a much larger variation." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-68", "text": "For example, handwriting digit images have factors \"number\" and \"writing style\", which are completely different from human face images; therefore, it is difficult to use a handwriting image dataset and a human face dataset together to learn disentangled representations." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-69", "text": "In contrast, for most speech datasets, although they are collected for different purposes and hence have different annotations, they share the same factors such as content, emotion, age, and gender." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-70", "text": "Assume that we have a set of datasets D1, D2, ...,Dn, where each has an annotation of one different factor." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-71", "text": "In this practical setting, two things become more complicated: 1) without the ground-truth label, the loss of predictor log P (y|z) cannot be calculated; 2) the encoder aims to remove the information unrelated to the desired factor and the decoder does not have the ground-truth label of the removed information to reconstruct the input." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-72", "text": "To solve these challenges, we only use the samples with the corresponding ground-truth labels to train the predictors, and use the certainty of the prediction as the regularizer for the auto-encoder (e.g., for discrete labels y: max P (y k |z), k \u2208 [1, Card(y)])." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-73", "text": "In the same iteration, we also feed this prediction to the decoder." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-74", "text": "In this approach, we penalize the high-certainty prediction rather than the correct prediction, hence the ground-truth label is not needed." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-75", "text": "We feed the prediction as the ground truth to the decoder, compensating the information the encoder removes (according to this prediction) to reconstruct the input." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-76", "text": "In this paper, we extend the idea of [7] for fine-grained disentangled speech representation learning." }, { "sent_id": "fca14f99953b9dc30a594525ee92b5-C001-77", "text": "The training procedure detail, convergence condition, and the empirical performance need to be further explored and are left as future work." } ], "y": { "@USE@": { "gold_contexts": [ [ "fca14f99953b9dc30a594525ee92b5-C001-12" ] ], "cite_sentences": [ "fca14f99953b9dc30a594525ee92b5-C001-12" ] }, "@BACK@": { "gold_contexts": [ [ "fca14f99953b9dc30a594525ee92b5-C001-12" ], [ "fca14f99953b9dc30a594525ee92b5-C001-13" ], [ "fca14f99953b9dc30a594525ee92b5-C001-18" ], [ "fca14f99953b9dc30a594525ee92b5-C001-30" ] ], "cite_sentences": [ "fca14f99953b9dc30a594525ee92b5-C001-12", "fca14f99953b9dc30a594525ee92b5-C001-13", "fca14f99953b9dc30a594525ee92b5-C001-18", "fca14f99953b9dc30a594525ee92b5-C001-30" ] }, "@DIF@": { "gold_contexts": [ [ "fca14f99953b9dc30a594525ee92b5-C001-20" ], [ "fca14f99953b9dc30a594525ee92b5-C001-41" ] ], "cite_sentences": [ "fca14f99953b9dc30a594525ee92b5-C001-20", "fca14f99953b9dc30a594525ee92b5-C001-41" ] } } }, "ABC_2eaa48dbc5e42a5934e905ec2288ac_3": { "x": [ { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-22", "text": "**METHOD**" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-2", "text": "In this work, we present an approach based on combining string kernels and word embeddings for automatic essay scoring." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-3", "text": "String kernels capture the similarity among strings based on counting common character ngrams, which are a low-level yet powerful type of feature, demonstrating state-of-theart results in various text classification tasks such as Arabic dialect identification or native language identification." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-4", "text": "To our best knowledge, we are the first to apply string kernels to automatically score essays." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-5", "text": "We are also the first to combine them with a high-level semantic feature representation, namely the bag-of-super-word-embeddings." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-6", "text": "We report the best performance on the Automated Student Assessment Prize data set, in both indomain and cross-domain settings, surpassing recent state-of-the-art deep learning approaches." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-7", "text": "----------------------------------" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-9", "text": "Automatic essay scoring (AES) is the task of assigning grades to essays written in an educational setting, using a computer-based system with natural language processing capabilities." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-10", "text": "The aim of designing such systems is to reduce the involvement of human graders as far as possible." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-11", "text": "AES is a challenging task as it relies on grammar as well as semantics, pragmatics and discourse (Song et al., 2017) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-12", "text": "Although traditional AES methods typically rely on handcrafted features (Larkey, 1998; Foltz et al., 1999; Attali and Burstein, 2006; Dikli, 2006; Wang and Brown, 2008; Chen and He, 2013; Somasundaran et al., 2014; Yannakoudakis et al., 2014; Phandi et al., 2015) , recent results indicate that state-of-the-art deep learning methods reach better performance (Alikaniotis et al., 2016; Dong and Zhang, 2016; Taghipour and Ng, 2016; Song et al., 2017; Tay et al., 2018) , perhaps because these methods are able to capture subtle and complex information that is relevant to the task (Dong and Zhang, 2016) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-13", "text": "In this paper, we propose to combine string kernels (low-level character n-gram features) and word embeddings (high-level semantic features) to obtain state-of-the-art AES results." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-14", "text": "Since recent methods based on string kernels have demonstrated remarkable performance in various text classification tasks ranging from authorship identification (Popescu and Grozea, 2012) and sentiment analysis (Gim\u00e9nez-P\u00e9rez et al., 2017; to native language identification (Popescu and Ionescu, 2013; Ionescu et al., 2014; Ionescu, 2015; and dialect identification , we believe that string kernels can reach equally good results in AES." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-15", "text": "To the best of our knowledge, string kernels have never been used for this task." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-16", "text": "As string kernels are a simple approach that relies solely on character n-grams as features, it is fairly obvious that such an approach will not to cover several aspects (e.g.: semantics, discourse) required for the AES task." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-17", "text": "To solve this problem, we propose to combine string kernels with a recent approach based on word embeddings, namely the bag-of-super-wordembeddings (BOSWE) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-18", "text": "To our knowledge, this is the first successful attempt to combine string kernels and word embeddings." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-19", "text": "We evaluate our approach on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-20", "text": "The empirical results indicate that our approach yields a better performance than state-of-the-art approaches (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-21", "text": "----------------------------------" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-23", "text": "String kernels." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-24", "text": "Kernel functions (Shawe-Taylor and Cristianini, 2004) capture the intuitive notion of similarity between objects in a specific domain." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-25", "text": "For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character n-grams." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-26", "text": "Various string kernel functions have been proposed to date (Lodhi et al., 2002; Shawe-Taylor and Cristianini, 2004; Ionescu et al., 2014) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-27", "text": "One of the most recent string kernels is the histogram intersection string kernel (HISK) (Ionescu et al., 2014 )." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-28", "text": "For two strings over an alphabet \u03a3, x, y \u2208 \u03a3 * , the intersection string kernel is formally defined as follows:" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-29", "text": "where num v (x) is the number of occurrences of n-gram v as a substring in x, and n is the length of v. In our AES experiments, we use the intersection string kernel based on a range of character n-grams." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-30", "text": "We approach AES as a regression task, and employ \u03bd-Support Vector Regression (\u03bd-SVR) (Suykens and Vandewalle, 1999; Shawe-Taylor and Cristianini, 2004) for training." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-31", "text": "Bag-of-super-word-embeddings." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-32", "text": "Word embeddings are long known in the NLP community (Bengio et al., 2003; Collobert and Weston, 2008) , but they have recently become more popular due to the word2vec (Mikolov et al., 2013) framework that enables the building of efficient vector representations from words." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-33", "text": "On top of the word embeddings, developed an approach termed bag-ofsuper-word-embeddings (BOSWE) by adapting an efficient computer vision technique, the bag-ofvisual-words model (Csurka et al., 2004) , for natural language processing tasks." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-34", "text": "The adaptation consists of replacing the image descriptors (Lowe, 2004) useful for recognizing object patterns in images with word embeddings (Mikolov et al., 2013) useful for recognizing semantic patterns in text documents." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-35", "text": "The BOSWE representation is computed as follows." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-36", "text": "First, each word in the collection of training documents is represented as word vector using a pre-trained word embeddings model." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-37", "text": "Based on the fact that word embeddings carry semantic information by projecting semantically related words in the same region of the embedding space, the next step is to cluster the word vectors in order to obtain relevant semantic clusters of words." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-38", "text": "As in the standard bag-of-visual-words model, the clustering is done by k-means (Leung and Malik, 2001) , and the formed centroids are stored in a randomized forest of k-d trees (Philbin et al., 2007) to reduce search cost." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-39", "text": "The centroid of each cluster is interpreted as a super word embedding or super word vector that embodies all the semantically related word vectors in a small region of the embedding space." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-40", "text": "Every embedded word in the collection of documents is then assigned to the nearest cluster centroid (the nearest super word vector)." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-41", "text": "Put together, the super word vectors generate a vocabulary (codebook) that can further be used to describe each document as a bag-of-super-wordembeddings." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-42", "text": "To obtain the BOSWE represenation for a document, we just have to compute the occurrence count of each super word embedding in the respective document." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-43", "text": "After building the representation, we employ a kernel method to train the BOSWE model for our specific task." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-44", "text": "To be consistent with the string kernel approach, we choose the histogram intersection kernel and the same regression method, namely \u03bd-SVR." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-45", "text": "Model fusion." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-46", "text": "In the primal form, a linear classifier takes as input a feature matrix X of r samples (rows) with m features (columns) and optimizes a set of weights in order to reproduce the r training labels." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-47", "text": "In the dual form, the linear classifier takes as input a kernel matrix K of r \u00d7 r components, where each component k ij is the similarity between examples x i and x j ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-48", "text": "Kernel methods work by embedding the data in a Hilbert space and by searching for linear relations in that space, using a learning algorithm." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-49", "text": "The embedding can be performed either (i) implicitly, by directly specifying the similarity function between each pair of samples, or (ii) explicitly, by first giving the embedding map \u03c6 and by computing the inner product between each pair of samples embedded in the Hilbert space." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-50", "text": "For the linear kernel, the associated embedding map is \u03c6(x) = x and options (i) or (ii) are equivalent, i.e. the similarity function is the inner product." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-51", "text": "Hence, the linear kernel matrix K can be obtained as K = X \u00b7 X \u2032 , where X \u2032 is the transpose of X. For other kernels, e.g. the histogram intersection kernel, it is not possible to explicitly define the embedding map (Shawe-Taylor and Cristianini, 2004) , and the only solution is to adopt option (i) and compute the corresponding kernel matrix directly." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-52", "text": "Therefore, we combine HISK and BOSWE in the dual (kernel) form, by simply summing up the two corresponding kernel matrices." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-53", "text": "However, summing up kernel matrices is equivalent to feature vector concatenation in the primal Hilbert space." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-54", "text": "To better explain this statement, let us suppose that we can define the embedding map of the histogram intersection kernel and, consequently, we can obtain the corresponding feature matrix of HISK with r\u00d7m 1 components denoted by X 1 and the corresponding feature matrix of BOSWE with r \u00d7 m 2 components denoted by X 2 ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-55", "text": "We can now combine HISK and BOSWE in two ways." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-56", "text": "One way is to compute the corresponding kernel matrices K 1 = X 1 \u00b7 X \u2032 1 and K 2 = X 2 \u00b7 X \u2032 2 , and to sum the matrices into a single kernel matrix" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-57", "text": "The other way is to first concatenate the feature matrices into a single feature matrix X + = [X 1 X 2 ] of r \u00d7 (m 1 + m 2 ) components, and to compute the final kernel matrix using the inner product, i.e. K + = X + \u00b7X \u2032 + ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-58", "text": "Either way, the two approaches, HISK and BOSWE, are fused before the learning stage." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-59", "text": "As a consequence of kernel summation, the search space of linear patterns grows, which should help the kernel classifier, in our case \u03bd-SVR, to find a better regression function." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-60", "text": "----------------------------------" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-62", "text": "Data set." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-63", "text": "To evaluate our approach, we use the Automated Student Assessment Prize (ASAP) 1 data set from Kaggle." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-64", "text": "The ASAP data set contains 8 prompts of different genres." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-65", "text": "The number of essays per prompt along with the score ranges are presented in Table 1 ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-66", "text": "Since the official test data of the ASAP competition is not released to the public, we, as well as others before us (Phandi et al., 2015; Dong and Zhang, 2016; 1 https://www.kaggle.com/c/asap-aes/data Tay et al., 2018) , use only the training data in our experiments." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-67", "text": "Evaluation procedure." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-68", "text": "As Dong and Zhang (2016), we scaled the essay scores into the range 0-1." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-69", "text": "We closely followed the same settings for data preparation as (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-70", "text": "For the in-domain experiments, we use 5-fold cross-validation." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-71", "text": "The 5-fold cross-validation procedure is repeated for 10 times and the results were averaged to reduce the accuracy variation introduced by randomly selecting the folds." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-72", "text": "We note that the standard deviation in all cases in below 0.2%." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-73", "text": "For the cross-domain experiments, we use the same source\u2192target domain pairs as (Phandi et al., 2015; Dong and Zhang, 2016) , namely, 1\u21922, 3\u21924, 5\u21926 and 7\u21928." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-74", "text": "All essays in the source domain are used as training data." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-75", "text": "Target domain samples are randomly divided into 5 folds, where one fold is used as test data, and the other 4 folds are collected together to sub-sample target domain train data." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-76", "text": "The sub-sample sizes are n t = {10, 25, 50, 100}. The sub-sampling is repeated for 5 times as in (Phandi et al., 2015; Dong and Zhang, 2016) to reduce bias." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-77", "text": "As our approach performs very well in the cross-domain setting, we also present experiments without subsampling data from the target domain, i.e. when the sub-sample size is n t = 0." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-78", "text": "As evaluation metric, we use the quadratic weighted kappa (QWK)." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-79", "text": "Baselines." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-80", "text": "We compare our approach with stateof-the-art methods based on handcrafted features (Phandi et al., 2015) , as well as deep features (Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-81", "text": "We note that results for the cross-domain setting are reported only in some of these recent works (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-82", "text": "Implementation choices." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-83", "text": "For the string kernels approach, we used the histogram intersection string kernel (HISK) based on the blended range of character n-grams from 1 to 15." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-84", "text": "To compute the intersection string kernel, we used the open-source code provided by Ionescu et al. (2014) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-85", "text": "For the BOSWE approach, we used the pre-trained word embeddings computed by the word2vec toolkit (Mikolov et al., 2013) on the Google News data set using the Skip-gram model, which produces 300-dimensional vectors for 3 million words and phrases." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-86", "text": "We used functions from the VLFeat li- Table 2 : In-domain automatic essay scoring results of our approach versus several state-of-the-art methods (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-87", "text": "Results are reported in terms of the quadratic weighted kappa (QWK) measure, using 5-fold cross-validation." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-88", "text": "The best QWK score (among the machine learning systems) for each prompt is highlighted in bold." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-89", "text": "brary (Vedaldi and Fulkerson, 2008) for the other steps involved in the BOSWE approach, such as the k-means clustering and the randomized forest of k-d trees." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-90", "text": "We set the number of clusters (dimension of the vocabulary) to k = 500." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-91", "text": "After computing the BOSWE representation, we apply the L 1 -normalized intersection kernel." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-92", "text": "We combine HISK and BOSWE in the dual form by summing up the two corresponding matrices." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-93", "text": "For the learning phase, we employ the dual implementation of \u03bd-SVR available in LibSVM (Chang and Lin, 2011) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-94", "text": "We set its regularization parameter to c = 10 3 and \u03bd = 10 \u22121 in all our experiments." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-95", "text": "In-domain results." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-96", "text": "The results for the in-domain automatic essay scoring task are presented in Table 2." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-97", "text": "In our empirical study, we also include feature ablation results." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-98", "text": "We report the QWK measure on each prompt as well as the overall average." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-99", "text": "We first note that the histogram intersection string kernel alone reaches better overall performance (0.780) than all previous works (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-100", "text": "Remarkably, the overall performance of the HISK is also higher than the inter-human agreement (0.754)." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-101", "text": "Although the BOSWE model can be regarded as a shallow approach, its overall results are comparable to those of deep learning approaches (Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-102", "text": "When we combine the two models (HISK and BOSWE), we obtain even better results." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-103", "text": "Indeed, the combination of string kernels and word embeddings attains the best performance on 7 out of 8 prompts." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-104", "text": "The average QWK score of HISK and BOSWE (0.785) is more than 2% better the average scores of the best-performing state-of-the-art approaches Tay et al., 2018) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-105", "text": "Cross-domain results." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-106", "text": "The results for the crossdomain automatic essay scoring task are presented in Table 3 ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-107", "text": "For each and every source\u2192target pair, we report better results than both state-of-theart methods (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-108", "text": "We observe that the difference between our best QWK scores and the other approaches are sometimes much higher in the cross-domain setting than in the in-domain setting." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-109", "text": "We particularly notice that the difference from (Phandi et al., 2015) when n t = 0 is always higher than 10%." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-110", "text": "Our highest improvement (more than 54%, from 0.187 to 0.728) over (Phandi et al., 2015) is recorded for the pair 5\u21926, when n t = 0." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-111", "text": "Our score in this case (0.728) is even higher than both scores of Phandi et al. (2015) and Dong and Zhang (2016) when they use n t = 50." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-112", "text": "Different from the in-domain setting, we note that the combination of string kernels and word embeddings does not always provide better results than string kernels alone, particularly when the number of target samples (n t ) added into the training set is less or equal to 25." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-113", "text": "Discussion." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-114", "text": "It is worth noting that in a set of preliminary experiments (not included in the paper), we actually considered another approach based on word embeddings." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-115", "text": "We tried to obtain a document embedding by averaging the word vectors for each document." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-116", "text": "We computed the average as well as the standard deviation for each component of the word vectors, resulting in a total of 600 features, since the word vectors are 300-dimensional." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-117", "text": "We applied this method in the in-domain setting and we obtained a surprisingly low overall QWK score, around 0.251." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-118", "text": "We concluded that this simple apSource\u2192Target Method n t = 0 n t = 10 n t = 25 n t = 50 n t = 100 1\u21922 (Phandi et al., 2015) Table 3 : Corss-domain automatic essay scoring results of our approach versus two state-of-the-art methods (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-119", "text": "Results are reported in terms of the quadratic weighted kappa (QWK) measure, using the same evaluation procedure as (Phandi et al., 2015; Dong and Zhang, 2016) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-120", "text": "The best QWK scores for each source\u2192target domain pair are highlighted in bold." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-121", "text": "proach is not useful, and decided to use BOSWE instead." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-122", "text": "It would have been interesting to present an error analysis based on the discriminant features weighted higher by the \u03bd-SVR method." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-123", "text": "Unfortunately, this is not possible because our approach works in the dual space and we cannot transform the dual weights into primal weights, as long as the histogram intersection kernel does not have an explicit embedding map associated to it." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-124", "text": "In future work, however, we aim to replace the histogram intersection kernel with the presence bits kernel, which will enable us to perform an error analysis based on the overused or underused patterns, as described by ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-125", "text": "----------------------------------" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-127", "text": "In this paper, we described an approach based on combining string kernels and word embeddings for automatic essay scoring." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-128", "text": "We compared our approach on the Automated Student Assessment Prize data set, in both in-domain and crossdomain settings, with several state-of-the-art approaches (Phandi et al., 2015; Dong and Zhang, 2016; Tay et al., 2018) ." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-129", "text": "Overall, the in-domain and the cross-domain comparative studies indicate that string kernels, both alone and in combination with word embeddings, attain the best performance on the automatic essay scoring task." }, { "sent_id": "2eaa48dbc5e42a5934e905ec2288ac-C001-130", "text": "Using a shallow approach, we report better results compared to recent deep learning approaches (Dong and Zhang, 2016; Tay et al., 2018) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2eaa48dbc5e42a5934e905ec2288ac-C001-12" ] ], "cite_sentences": [ "2eaa48dbc5e42a5934e905ec2288ac-C001-12" ] }, "@DIF@": { "gold_contexts": [ [ "2eaa48dbc5e42a5934e905ec2288ac-C001-20" ], [ "2eaa48dbc5e42a5934e905ec2288ac-C001-99" ], [ "2eaa48dbc5e42a5934e905ec2288ac-C001-104" ], [ "2eaa48dbc5e42a5934e905ec2288ac-C001-130" ] ], "cite_sentences": [ "2eaa48dbc5e42a5934e905ec2288ac-C001-20", "2eaa48dbc5e42a5934e905ec2288ac-C001-99", "2eaa48dbc5e42a5934e905ec2288ac-C001-104", "2eaa48dbc5e42a5934e905ec2288ac-C001-130" ] }, "@SIM@": { "gold_contexts": [ [ "2eaa48dbc5e42a5934e905ec2288ac-C001-66" ], [ "2eaa48dbc5e42a5934e905ec2288ac-C001-101" ] ], "cite_sentences": [ "2eaa48dbc5e42a5934e905ec2288ac-C001-66", "2eaa48dbc5e42a5934e905ec2288ac-C001-101" ] }, "@USE@": { "gold_contexts": [ [ "2eaa48dbc5e42a5934e905ec2288ac-C001-80" ], [ "2eaa48dbc5e42a5934e905ec2288ac-C001-86" ], [ "2eaa48dbc5e42a5934e905ec2288ac-C001-128" ] ], "cite_sentences": [ "2eaa48dbc5e42a5934e905ec2288ac-C001-80", "2eaa48dbc5e42a5934e905ec2288ac-C001-86", "2eaa48dbc5e42a5934e905ec2288ac-C001-128" ] } } }, "ABC_e7f972baa73e7ababa28eded3adad9_3": { "x": [ { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-2", "text": "There is abundance of digitised texts available in Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-3", "text": "However, the word segmentation task in such texts are challenging due to the issue of Sandhi." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-4", "text": "In Sandhi, words in a sentence often fuse together to form a single chunk of text, where the word delimiter vanishes and sounds at the word boundaries undergo transformations, which is also reflected in the written text." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-5", "text": "Here, we propose an approach that uses a deep sequence to sequence (seq2seq) model that takes only the sandhied string as the input and predicts the unsandhied string." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-6", "text": "The state of the art models are linguistically involved and have external dependencies for the lexical and morphological analysis of the input." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-7", "text": "Our model can be trained \"overnight\" and be used for production." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-8", "text": "In spite of the knowledge lean approach, our system preforms better than the current state of the art by gaining a percentage increase of 16.79 % than the current state of the art." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-9", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-11", "text": "Sanskrit had profound influence as the knowledge preserving language for centuries in India." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-12", "text": "The tradition of learning and teaching Sanskrit, though limited, still exists in India." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-13", "text": "There have been tremendous advancements in digitisation of ancient manuscripts in Sanskrit in the last decade." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-14", "text": "Numerous initiatives such as the Digital Corpus of Sanskrit 1 , GRETIL 2 , The Sanskrit Library 3 and others from the Sanskrit Linguistic and Computational Linguistic community is a fine example of such efforts (Goyal et al., 2012; Krishna et al., 2017) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-15", "text": "The digitisation efforts have made the Sanskrit manuscripts easily available in the public domain." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-16", "text": "However, the accessibility of such digitised manuscripts is still limited." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-17", "text": "Numerous technical challenges in indexing and retrieval of such resources in a digital repository arise due to the linguistic peculiarities posed by the language." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-18", "text": "Word Segmentation in Sanskrit is an important yet non-trivial prerequisite for facilitating efficient processing of Sanskrit texts." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-19", "text": "Sanskrit has been primarily communicated orally." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-20", "text": "Due to its oral tradition, the phonemes in Sanskrit undergo euphonic assimilation in spoken format." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-21", "text": "This gets reflected in writing as well and leads to the phenomena of Sandhi (Goyal and Huet, 2016) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-22", "text": "Sandhi leads to phonetic transformations at word boundaries of a written chunk, and the sounds at the end of a word join together to form a single chunk of character sequence." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-23", "text": "This not only makes the word boundaries indistinguishable, but transformations occur to the characters at the word boundaries." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-24", "text": "The transformations can be deletion, insertion or substitution of one or more sounds at the word ends." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-25", "text": "There are about 281 sandhi rules, each denoting a unique combination of phonetic transformations, documented in the grammatical tradition of Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-26", "text": "The *The first two authors contributed equally 1 http://kjc-sv013.kjc.uni-heidelberg.de/dcs/ 2 http://gretil.sub.uni-goettingen.de/ 3 http://sanskritlibrary.org/ proximity between two compatible sounds as per any one of the 281 rules is the sole criteria for sandhi." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-27", "text": "The Sandhi do not make any syntactic or semantic changes to the words involved." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-28", "text": "Sandhi is an optional operation relied solely on the discretion of the writer." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-29", "text": "While the Sandhi formation is deterministic, the analysis of Sandhi is non-deterministic and leads to high level of ambiguity." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-30", "text": "For example, the chunk 'gardabhas ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-31", "text": "c\u0101\u015bva\u015bca' (the ass and the horse) has 625 possible phonetically and lexically valid splits (Hellwig, 2015) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-32", "text": "Now, the correct split relies on the semantic compatibility between the split words." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-33", "text": "The word segmentation problem is a well studied problem across various languages where the segmentation is nontrivial." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-34", "text": "For languages such as Chinese and Japanese, where there is no explicit boundary markers between the words (Xue, 2003) , numerous sequence labelling approaches have been proposed." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-35", "text": "In Sanskrit, it can be seen that the merging of word boundaries is the discretion of the writer." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-36", "text": "In this work, we propose a purely engineering based pipeline for segmentation of Sanskrit sentences." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-37", "text": "The word segmentation problem is a structured prediction problem and we propose a deep sequence to sequence (seq2seq) model to solve the task." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-38", "text": "We use an encoder-decoder framework where the sandhied (unsegmented) and the unsandhied (segmented) sequences are treated as the input at the encoder and the output at the decoder, respectively." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-39", "text": "We train the model so as to maximise the conditional probability of predicting the unsandhied sequence given its corresponding sandhied sequence (Cho et al., 2014) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-40", "text": "We propose a knowledge-lean data-centric approach for the segmentation task." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-41", "text": "Our approach will help to scale the segmentation process in comparison with the challenges posed by knowledge involved processes in the current systems (Krishna et al., 2017) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-42", "text": "We only use parallel segmented and unsegmented sentences during training." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-43", "text": "At run-time, we only require the input sentence." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-44", "text": "Our model can literally be trained overnight." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-45", "text": "The best performing model of ours takes less than 12 hours to train in a 'Titan X' 12 GB memory, 3584 GPU Cores system." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-46", "text": "Our title for the paper is inspired from the title for the work by Wang et al. (2015) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-47", "text": "As with the original paper, we want to emphasise on the ease with which our system can be used for training and at runtime, as it do not require any linguistically involved preprocessing." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-48", "text": "Such requirements often limit the scalability of a system and tediousness involved in the process limits the usability of a system." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-49", "text": "Since Sanskrit is a resource scarce language, we use the sentencepiece (Schuster and Nakajima, 2012) , an unsupervised text tokeniser to obtain a new vocabulary for a corpus, that maximises the likelihood of the language model so learnt." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-50", "text": "We propose a pipeline for finding the semantically most valid segmented word-forms for a given sentence." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-51", "text": "Our model uses multiple layers of LSTM cells with attention." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-52", "text": "Our model outperforms the current state of the art by 16.79 %." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-53", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-54", "text": "**MODELS FOR WORD SEGMENTATION IN SANSKRIT**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-55", "text": "A number of methods have been proposed for word segmentation in Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-56", "text": "Hellwig (2015) treats the problem as a character level RNN sequence labelling task." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-57", "text": "The author, in addition to reporting sandhi splits to upto 155 cases, additionally categorises the rules to 5 different types." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-58", "text": "Since, the results reported by the author are not at word-level, as is the standard with word segmentation systems in general, a direct comparison with the other systems is not meaningful." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-59", "text": "Mittal (2010) proposed a method based on Finite State Transducers by incorporating rules of sandhi." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-60", "text": "The system generates all possible splits and then provides a ranking of various splits, based on probabilistic ranking inferred from a dataset of 25000 split points." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-61", "text": "Using the same dataset, Natarajan and Charniak (2011) proposed a sandhi splitter for Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-62", "text": "The method is an extension of Bayesian word segmentation approach by Goldwater et al. (2006) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-63", "text": "Krishna et al. (2016) is currently the state of the art in Sanskrit word segmentation." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-64", "text": "The system treats the problem as an iterative query expansion problem." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-65", "text": "Using a shallow parser for Sanskrit (Goyal et al., 2012) , an input sentence is first converted to a graph of possible candidates and desirable nodes are iteratively selected using Path Constrained Random Walks (Lao and Cohen, 2010) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-66", "text": "To further catalyse the research in word segmentation for Sanskrit, Krishna et al. (2017) has released a dataset for the word segmentation task." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-67", "text": "The work releases a dataset of 119,000 sentences in Sanskrit along with the lexical and morphological analysis from a shallow parser." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-68", "text": "The work emphasises the need for not just predicting the inflected word form but also the prediction of the associated morphological information of the word." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-69", "text": "The additional information will be beneficial in further processing of Sanskrit texts, such as Dependency parsing or summarisation (Krishna et al., 2017) .So far, no system successfully predicts the morphological information of the words in addition to the final word form." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-70", "text": "Though Krishna et al. (2016) has designed their system with this requirement in mind and outlined the possible extension of their system for the purpose, the system currently only predicts the final word-form." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-71", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-72", "text": "**METHOD**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-73", "text": "We use an encoder-decoder framework for tackling our segmentation problem, and propose a deep seq2seq model using LSTMs for our prediction task." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-74", "text": "Our model follows the architecture from Wu et al. (2016) , originally proposed for neural machine translation." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-75", "text": "We consider the pair of sandhied and unsandhied sentences as source and target sentences, respectively." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-76", "text": "Following the insights from Sutskever et al. (2014), we reverse the sequence order at the input and we find that the reversal of the string leads to improvement in the results." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-77", "text": "We also use a deep architecture with 3 layers each at the encoder and decoder, as it is shown that deeper models perform better than shallow LSTM Models." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-78", "text": "We also experiment with models with and without attention and find that the model with attention leads to considerable improvement in performance of the system (Wu et al., 2016) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-79", "text": "Given the training set S, our training objective is to maximise the log probability of the segmented sequences T where the unsegmented sequences S are given." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-80", "text": "The training objective is to maximise 1 |S|" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-81", "text": "For a new sentence, we need to output a sequence T with maximum likelihood for the given input ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-82", "text": "LSTMs are used both at the encoder and decoder." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-83", "text": "We use softmax layer at the decoder and perform greedy decoding to obtain the final prediction." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-84", "text": "The outputs are then passed to the loss function which calculates the log-perplexity over the data samples in the batch." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-85", "text": "We then update the parameters via backpropagation and use Adam optimiser (Kingma and Ba, 2015) for our model." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-86", "text": "Vocabulary Enhancement for the model -Sanskrit, being a resource poor language, the major challenge is to obtain enough data for the supervised task." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-87", "text": "While there are plenty of sandhied texts available for Sanskrit, it is hard to find parallel or unsandhied texts alone, as it is deterministic to get sandhied text from unsandhied texts." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-88", "text": "In our case we use 105,000 parallel strings from the Digital Corpus of Sanskrit as released in Krishna et al. (2017) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-89", "text": "To handle the data sparsity, we adopt a purely engineering based approach for our model." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-90", "text": "Rather than relying on the real word boundaries, we use the 'sentencepiece' model, an unsupervised text tokeniser (Schuster and Nakajima, 2012) to obtain a new vocabulary for the corpus." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-91", "text": "The method was originally proposed for segmentation problem in Japanese and Korean speech recognition systems." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-92", "text": "In the method, a greedy approach is used to identify new word units from a corpus that maximises the likelihood of the language model so learnt (Schuster and Nakajima, 2012) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-93", "text": "Figure 1 shows the instance of words learnt from the sentencepiece model corresponding to the original input from the corpus." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-94", "text": "In the sentencepiece model, the 'space' in the original input is also treated as a character and is replaced with the special symbol ' '." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-95", "text": "So 'am . r\u0101ma' is a word in our vocabulary, which originally is part of two words." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-96", "text": "Our model is fed only the 'words' from the new vocabulary, henceforth to be referred to as 'GibberishVocab'." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-97", "text": "Note that the decoder also outputs words from GibberishVocab." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-98", "text": "The output from decoder is then converted to the original vocabulary for evaluating the outputs." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-99", "text": "This is trivially done by reclaiming the original 'space' as the delimiter for the old vocabulary." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-100", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-101", "text": "**EXPERIMENTS**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-102", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-103", "text": "**DATASET**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-104", "text": "We used a dataset of 107,000 sentences from the Sanskrit Word Segmentation Dataset (Krishna et al., 2017) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-105", "text": "The dataset is a subset of the Digital Corpus of Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-106", "text": "From the dataset we only use the input sentences and the ground truth inflected word-forms." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-107", "text": "We ignore all the other morphological and lemma information available in the dataset." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-108", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-109", "text": "**BASELINES**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-110", "text": "We compare the performance of our system with two other baseline systems." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-111", "text": "supervisedPCRW -This is the current state of the art for word segmentation in Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-112", "text": "The method treats the problem as an iterative query expansion task." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-113", "text": "This is a linguistically involved approach, as at first a lexicon driven shallow parser is used to obtain all the phonetically valid segments for a given sentence." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-114", "text": "The sentence is then converted into a graph with the segments as the nodes." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-115", "text": "The edges are formed between every pair of nodes which can co-exist in a sentence and are not competing for the same input position." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-116", "text": "The edge weights are formed by weighted sum of random walks across typed paths." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-117", "text": "The authors use typed paths to obtain extended contexts about the word pairs from the candidate pool." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-118", "text": "The typed paths are designed with human supervision which is also linguistically involved." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-119", "text": "GraphCRF -We use a structured prediction approach using graph based Conditional Random Fields, where we first obtain the possible candidate segments using the shallow parser and then convert the segments into a graph." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-120", "text": "For every node segment, we learn a word vector using fastText (Bojanowski et al., 2016) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-121", "text": "segSeq2Seq -This is the proposed model as described in Section 3. but without attention." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-122", "text": "attnSegSeq2Seq -This is the proposed model as described in Section 3. with attention." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-123", "text": "We report all our results on a test data of 4,200 sentences which was not used in any part of the training." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-124", "text": "From the dataset we ignore about 7,500 sentences which are neither part of training nor the test set." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-125", "text": "We used 214,000 strings both from input and output strings in the training data to obtain the GibberishVocab using sentencepiece model." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-126", "text": "We use string-wise micro averaged precision, recall and FScore to evaluate our model as is the standard with evaluating word segmentation models." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-127", "text": "We find that the default vocabulary size of 8,000 for the GibberishVocab works best." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-128", "text": "Of the 8,000 'words', the encoder vocabulary size is 7,944 and the decoder vocabulary size is 7,464." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-129", "text": "This shows the high overlap in the vocabulary in GibberishVocab at both input and output sides, in spite of the difference in phonetic transformations due to sandhi." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-130", "text": "Originally the training data contained 60,308 segmented words at the output side." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-131", "text": "By reducing the vocabulary size at decoder side to 7,464, we make the probability distribution (softmax) at the decoder layer denser." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-132", "text": "Even if we followed a linguistic approach there were 16,473 unique lemmas in the training dataset." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-133", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-134", "text": "**TRAINING PROCEDURE AND HYPERPARAMETERS**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-135", "text": "Our models have 3 layers at both the encoder and decoder." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-136", "text": "The models contain an embedding layer which is a trainable matrix with individual word vector having a size of 128." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-137", "text": "Our LSTM layers consist of 128 cells at both the encoder and decoder layers." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-138", "text": "We train the sentences in a batch size of 128 and keep the sequence length of each sequence to be 35." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-139", "text": "The initial learning rate was set at 0.001 and we trained our system for 80 epochs after which the network parameters converged." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-140", "text": "We used Adam optimiser with parameter values \u03b2 1 ,\u03b2 2 as 0.9 and 0.999, respectively." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-141", "text": "We use dropout in the hidden layers with different settings from 0.1 to 0.4 in step sizes of 0.1." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-142", "text": "We find that a dropout of 0.2 is the best performing configuration." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-143", "text": "Dropout helps to avoid over-fitting of data (Srivastava et al., 2014) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-144", "text": "Both the 'segSeq2Seq' and 'attnSegSeq2Seq' models follow the same architecture and have the same hyperparameter settings and vary only on the attention mechanism." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-145", "text": "Table 1 shows the performance of the competing systems." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-146", "text": "We can find that the system 'attnSegSeq2Seq' outperforms the current state of the art with a percent increase of 16.29 % in F-Score." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-147", "text": "The model 'segSeq2Seq' falls short of the current state of the art with a percent decrease of 6.29 % in F-Score." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-148", "text": "It needs to be noted that the systems 'attnSegSeq2Seq' and 'segSeq2Seq' are exactly the same architectures other than the addition of attention in the former. But there is a percentage increase of 23.61 % for both the systems." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-149", "text": "One probable reason for this is due to the free word order nature of sentences in Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-150", "text": "Since there are multiple permutations of words in a sentence which are valid syntactically and convey the same semantic meaning, the entire input context is required to understand the meaning of a sentence for any distributional semantic model." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-151", "text": "Figure 2 shows the results of the competing systems on strings of different lengths in terms of words in the sentence." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-152", "text": "This should not be confused with sequence length." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-153", "text": "Here, we mean the 'word' as per the original vocabulary and is common for all the competing systems." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-154", "text": "For all the strings with up to 10 words, our system 'attnSegSeq2Seq' consistently outperforms all the systems in terms of both precision and recall." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-155", "text": "The current state of the art performs slightly better than our system, for sentences with more than 10 words." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-156", "text": "It needs to be noted that the average length of a string in the Digital Corpus of Sanskrit is 6.7 (Krishna et al., 2016) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-157", "text": "The proportion of sentences with more than 10 words in our dataset is less than 1 %." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-158", "text": "The test dataset has slightly more than 4 % sentences with 10 or more words." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-159", "text": "The 'segSeq2Seq' model performs better than the state of the art for both Precision and Recall for strings with less than or equal to 6 words." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-160", "text": "Figure 2a shows the proportion of sentences in the test data based on the frequency of words in it." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-161", "text": "Figure 2b shows the proportion of strings in the test dataset based on the number of words in the strings." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-162", "text": "Our systems attnSegSeq2Seq takes overall 11 hours 40 minutes and for 80 epochs in a 'Titan X' 12GB GPU memory, 3584 GPU Cores, 62GB RAM and Intel Xeon CPU E5-2620 2.40GHz system." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-163", "text": "For segSeq2Seq it takes 7 hours for the same setting." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-164", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-165", "text": "**RESULTS**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-166", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-167", "text": "**DISCUSSION**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-168", "text": "The purpose of our proposed model is purely to identify the word splits and correctness of the inflected word forms from a sandhied string." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-169", "text": "The word-level indexing in retrieval systems is often affected by phonetic transformations in words due to sandhi." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-170", "text": "For example, the term 'parame\u015bvarah . ' is split as 'parama' (ultimate) and '\u012b\u015bvarah ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-171", "text": "' (god)." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-172", "text": "Now, a search for instances of the word '\u012b\u015bvarah . ' might lead to missing search results without proper indexing." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-173", "text": "String matching approaches often result in low precision results." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-174", "text": "Using a lexicon driven system might alleviate the said issues, but can lead to possible splits which are not semantically compatible." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-175", "text": "For parame\u015bvarah ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-176", "text": "', it can be split as 'parama' (ultimate), '\u015bva' (dog) and 'rah . ' (to facilitate)." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-177", "text": "Though this is not semantically meaningful it is lexically valid." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-178", "text": "Such tools are put to use by some of the existing systems (Krishna et al., 2016; Mittal, 2010 ) to obtain additional morphological or syntactic information about the sentences." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-179", "text": "This limits the scalability of those systems, as they cannot handle out of vocabulary words." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-180", "text": "Scalability of such systems is further restricted as the sentences often need to undergo linguistically involved preprocessing steps that lead to human in the loop processing." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-181", "text": "The systems by Krishna et al. (2016) and Krishna et al. (2017) assume that the parser by Goyal et al. (2012) , identifies all the possible candidate chunks." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-182", "text": "Our proposed model is built with precisely one purpose in mind, which is to predict the final word-forms in a given sequence." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-183", "text": "Krishna et al. (2017) states that it is desirable to predict the morphological information of a word from along with the final word-form as the information will be helpful in further processing of Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-184", "text": "The segmentation task is seen as a means and not an end itself." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-185", "text": "Here, we overlook this aspect and see the segmentation task as an end in itself." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-186", "text": "So we achieve scalability at the cost of missing out on providing valuable linguistic information." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-187", "text": "Models that use linguistic resources are at an advantage here." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-188", "text": "Those systems such as Krishna et al. (2016) can be used to identify the morphological tags of the system as they currently store the morphological information of predicted candidates, but do not use them for evaluation as of now." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-189", "text": "Currently, no system exists that performs the prediction of wordform and morphological information jointly for Sanskrit." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-190", "text": "In our case, since we learn a new vocabulary altogether, the real word boundaries are opaque to the system." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-191", "text": "The decoder predicts from its own vocabulary." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-192", "text": "But predicting morphological information requires the knowledge of exact word boundaries." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-193", "text": "This should be seen as a multitask learning set up." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-194", "text": "One possible solution is to learn 'GibberishVocab' on the set of words rather than sentences. But this leads to increased vocabulary at decoder which is not beneficial, given the scarcity of the data we have." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-195", "text": "Given the importance of morphological segmentation in morphologically rich languages such as Hebrew and Arabic (Seeker and \u00c7 etinoglu, 2015) , the same applies to the morphologically rich Sanskrit as well (Krishna et al., 2017) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-196", "text": "But, we leave this work for future." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-197", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-198", "text": "**CONCLUSION**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-199", "text": "In this work we presented a model for word segmentation in Sanskrit using a purely engineering based appraoch." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-200", "text": "Our model with attention outperforms the current state of the art (Krishna et al., 2016) ." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-201", "text": "Since, we tackle the problem with a non-linguistic approach, we hope to extend the work to other Indic languages as well where sandhi is prevalent such as Hindi, Marathi, Malayalam, Telugu etc." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-202", "text": "Since we find that the inclusion of attention is highly beneficial in improving the performance of the system, we intend to experiment with recent advances in the encoder-decoder architectures, such as Vaswani et al. (2017) and Gehring et al. (2017) , where different novel approaches in using attention are experimented with." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-203", "text": "Our experiments in line with the measures reported in Krishna et al. (2016) show that our system performs robustly across strings of varying word size." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-204", "text": "----------------------------------" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-205", "text": "**CODE AND DATASET**" }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-206", "text": "All our working code can be downloaded at https: //github.com/cvikasreddy/skt." }, { "sent_id": "e7f972baa73e7ababa28eded3adad9-C001-207", "text": "The dataset for training can be downloaded at https://zenodo.org/ record/803508#.WTuKbSa9UUs" } ], "y": { "@BACK@": { "gold_contexts": [ [ "e7f972baa73e7ababa28eded3adad9-C001-14" ], [ "e7f972baa73e7ababa28eded3adad9-C001-66", "e7f972baa73e7ababa28eded3adad9-C001-67", "e7f972baa73e7ababa28eded3adad9-C001-68" ], [ "e7f972baa73e7ababa28eded3adad9-C001-181" ], [ "e7f972baa73e7ababa28eded3adad9-C001-183" ], [ "e7f972baa73e7ababa28eded3adad9-C001-195" ] ], "cite_sentences": [ "e7f972baa73e7ababa28eded3adad9-C001-14", "e7f972baa73e7ababa28eded3adad9-C001-66", "e7f972baa73e7ababa28eded3adad9-C001-67", "e7f972baa73e7ababa28eded3adad9-C001-68", "e7f972baa73e7ababa28eded3adad9-C001-181", "e7f972baa73e7ababa28eded3adad9-C001-183", "e7f972baa73e7ababa28eded3adad9-C001-195" ] }, "@EXT@": { "gold_contexts": [ [ "e7f972baa73e7ababa28eded3adad9-C001-41" ] ], "cite_sentences": [ "e7f972baa73e7ababa28eded3adad9-C001-41" ] }, "@FUT@": { "gold_contexts": [ [ "e7f972baa73e7ababa28eded3adad9-C001-69" ], [ "e7f972baa73e7ababa28eded3adad9-C001-195", "e7f972baa73e7ababa28eded3adad9-C001-196" ] ], "cite_sentences": [ "e7f972baa73e7ababa28eded3adad9-C001-69", "e7f972baa73e7ababa28eded3adad9-C001-195" ] }, "@USE@": { "gold_contexts": [ [ "e7f972baa73e7ababa28eded3adad9-C001-88" ], [ "e7f972baa73e7ababa28eded3adad9-C001-104" ] ], "cite_sentences": [ "e7f972baa73e7ababa28eded3adad9-C001-88", "e7f972baa73e7ababa28eded3adad9-C001-104" ] }, "@DIF@": { "gold_contexts": [ [ "e7f972baa73e7ababa28eded3adad9-C001-181", "e7f972baa73e7ababa28eded3adad9-C001-182" ], [ "e7f972baa73e7ababa28eded3adad9-C001-183", "e7f972baa73e7ababa28eded3adad9-C001-184", "e7f972baa73e7ababa28eded3adad9-C001-185" ] ], "cite_sentences": [ "e7f972baa73e7ababa28eded3adad9-C001-181", "e7f972baa73e7ababa28eded3adad9-C001-183" ] } } }, "ABC_1540b0b172971ac75771b414765f1d_3": { "x": [ { "sent_id": "1540b0b172971ac75771b414765f1d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-2", "text": "Historical text normalization suffers from small datasets that exhibit high variance, and previous work has shown that multitask learning can be used to leverage data from related problems in order to obtain more robust models." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-3", "text": "Previous work has been limited to datasets from a specific language and a specific historical period, and it is not clear whether results generalize." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-4", "text": "It therefore remains an open problem, when historical text normalization benefits from multi-task learning." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-5", "text": "We explore the benefits of multi-task learning across 10 different datasets, representing different languages and periods." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-6", "text": "Our main findingcontrary to what has been observed for other NLP tasks-is that multi-task learning mainly works when target task data is very scarce." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-7", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-9", "text": "Historical text normalization is the problem of translating historical documents written in the absence of modern spelling conventions and making them amenable to search by today's scholars, processable by natural language processing models, and readable to laypeople." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-10", "text": "In other words, historical text normalization is a text-to-text generation, where the input is a text written centuries ago, and the output is a text that has the same contents, but uses the orthography of modern-day language." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-11", "text": "In this paper, we limit ourselves to word-by-word normalization, ignoring the syntactic differences between modern-day languages and their historic predecessors." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-12", "text": "Resources for historical text normalization are scarce." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-13", "text": "Even for major languages like English and German, we have very little training data for inducing normalization models, and the models we induce may be very specific to these datasets and not scale to writings from other historic periodsor even just writings from another monastery or by another author." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-14", "text": "Bollmann and S\u00f8gaard (2016) and Bollmann et al. (2017) recently showed that we can obtain more robust historical text normalization models by exploiting synergies across historical text normalization datasets and with related tasks." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-15", "text": "Specifically, Bollmann et al. (2017) showed that multitask learning with German grapheme-to-phoneme translation as an auxiliary task improves a stateof-the-art sequence-to-sequence model for historical text normalization of medieval German manuscripts." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-16", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-17", "text": "**CONTRIBUTIONS**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-18", "text": "We study when multi-task learning leads to improvements in historical text normalization." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-19", "text": "Specifically, we evaluate a state-ofthe-art approach to historical text normalization (Bollmann et al., 2017) with and without various auxiliary tasks, across 10 historical text normalization datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-20", "text": "We also include an experiment in English historical text normalization using data from Twitter and a grammatical error correction corpus (FCE) as auxiliary datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-21", "text": "Across the board, we find that, unlike what has been observed for other NLP tasks, multi-task learning only helps when target task data is scarce." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-22", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-23", "text": "**DATASETS**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-24", "text": "We consider 10 datasets from 8 different languages: German, using the Anselm dataset (taken from Bollmann et al., 2017) and texts from the RIDGES corpus (Odebrecht et al., 2016) Bollmann et al. (2017) to obtain a single dataset." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-25", "text": "For RIDGES, we use 16 texts and randomly sample 70% of all sentences from each text for the training set, and 15% for the dev/test sets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-26", "text": "The Spanish and Portuguese datasets consist of manually normalized subsets of the Post Scriptum corpus; here, we randomly sample 80% (train) and 10% (dev/test) of all sentences per century represented in the corpus." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-27", "text": "Dataset splits for the other languages are taken from Pettersson (2016) and Ljube\u0161i\u0107 et al. (2016) ." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-28", "text": "We preprocessed all datasets to remove punctuation, perform Unicode normalization, replace digits that do not require normalization with a dummy symbol, and lowercase all tokens." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-29", "text": "Table 1 gives an overview of all historical datasets, the approximate time period of historical texts that they cover, as well as the size of the dataset splits." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-30", "text": "Note that, to the best of our knowledge, the Spanish, Portuguese, and German RIDGES datasets have not been used in the context of automatic historical text normalization before." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-31", "text": "Table 2 additionally gives some examples of historical word forms and their gold-standard normalizations from each of these datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-32", "text": "3" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-33", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-34", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-35", "text": "Model We use the same encoder-decoder architecture with attention as described in Bollmann et al. (2017) ." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-36", "text": "4 This is a fairly standard model consisting of one bidirectional LSTM unit in the encoder and one (unidirectional) LSTM unit in the decoder." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-37", "text": "The input for the encoder is a single historical word form represented as a sequence of characters and padded with word boundary symbols; i.e., we only input single tokens in isolation, not full sentences." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-38", "text": "The decoder attends over the encoder's outputs and generates the normalized output characters." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-39", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-40", "text": "**HYPERPARAMETERS**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-41", "text": "We use the same hyperparameters across all our experiments: The dimensionality of the embedding layer is 60, the size of the LSTM layers is set to 300, and we use a dropout rate of 0.2." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-42", "text": "We use the Adam optimizer (Kingma and Ba, 2014) with a character-wise cross-entropy loss." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-43", "text": "Training is done on mini-batches of 50 samples with early stopping based on validation on the individual development sets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-44", "text": "The hyperparameters were set on a randomly selected subset of 50,000 tokens from each of the following datasets: English, German (Anselm), Hungarian, Icelandic, and Slovene (Gaj)." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-45", "text": "Bollmann et al. (2017) also describe a multi-task learning (MTL) scenario where the encoder-decoder model is trained on two datasets in parallel." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-46", "text": "We perform similar experiments on pairwise combinations of our datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-47", "text": "Table 2 : Examples of input tokens (first line) and reference normalization (second line) for each of the historical datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-48", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-49", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-50", "text": "The question we ask here is whether training on pairs of datasets can improve over training on datasets individually, which pairings yield the best results, and what properties of the datasets are most predictive of this." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-51", "text": "In other words, we are interested in when multi-task learning works." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-52", "text": "In the multi-task learning setting, the two datasets-or \"tasks\"-share all parts of the encoder-decoder model except for the final prediction layer, which is specific to each dataset." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-53", "text": "This way, most parts of the model are forced to learn language-independent representations." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-54", "text": "This is different from Luong et al. (2015) and related work in machine translation, where typically only the encoder or the decoder is shared." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-55", "text": "We do not explore these alternatives here." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-56", "text": "During training, we iterate over both our datasets in parallel in a random order, with each parameter update now being based on 50 samples from each dataset." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-57", "text": "Since datasets are of different sizes, we define an epoch to be a fixed size of 50,000 samples." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-58", "text": "Validation is performed for both datasets after each epoch, and model states are saved independently for each one if its validation accuracy improved." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-59", "text": "This means that even if the ideal number of epochs is different for the datasets, only the best state for each dataset will be used in the end." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-60", "text": "Training ends only after the validation accuracy for each dataset has stopped improving." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-61", "text": "Sparse data scenario The training sets in our experiments range from ca. 25,000 to 230,000 tokens." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-62", "text": "Generally, historical text normalization suffers from scarce resources, and our biggest datasets are considered huge compared to what scholars typically have access to." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-63", "text": "Creating gold-standard normalizations is cumbersome and expensive, and for many languages and historic periods, it is not feasible to obtain big datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-64", "text": "Therefore, we also present experiments on reduced datasets; instead of taking the full training sets, we only use the first 5,000 tokens from each one." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-65", "text": "In this case, for multi-task learning, we combine the small target datasets with the full auxiliary datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-66", "text": "This procedure mimics a realistic scenario: If a researcher is interested in normalizing a language for which no manually normalized resource exists, they could conceivably create a small batch of manual normalizations for this language and then leverage an existing corpus in another language using multi-task learning." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-67", "text": "Table 3 : Normalization accuracy (in percent) using the full or sparse training sets, both for the single-task setup and the best-performing multitask (MTL) setup." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-68", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-69", "text": "**RESULTS**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-70", "text": "We evaluate our models using normalization accuracy, i.e., the percentage of correctly normalized word forms." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-71", "text": "Table 3 compares the accuracy scores of our single-task baseline models and for multi-task learning, in both the full and the sparse data scenario." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-72", "text": "For multi-task learning, we report the test set performance of the best target-auxiliary task pair combination, as evaluated on development data." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-73", "text": "Figure 1 visualizes the results for all pairwise combinations of the multi-task models; here, we show the error reduction of multitask learning over our single-task baseline to better highlight by how much the MTL setup changes the performance." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-74", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-75", "text": "**FULL DATASETS**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-76", "text": "We make two observations about the results for the full data scenario (the left side of Fig. 1 ): (i) the usefulness of multi-task learning depends more on the dataset that is being evaluated than the one it is trained together with; and (ii) for most datasets, multi-task learning is detrimental rather than beneficial." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-77", "text": "One hypothesis about multi-task learning is that its usefulness correlates with either synergistic or complementary properties of the datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-78", "text": "In other words, it is conceivable that the performance on one dataset improves most with an MTL setup when it is paired with another dataset that is either (i) very similar, or (ii) provides an additional signal that is useful for, but not covered in, the first dataset." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-79", "text": "The results in Figure 1 show that there can indeed be considerable variation depending on the exact dataset combination; e.g., the error reduction on Slovene (Bohori\u010d) ranges from 5% (when paired with the Gaj dataset) to 33.2% (when paired with Swedish)." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-80", "text": "At the same time, the question whether multi-task learning helps at all seems to depend mostly on the dataset being evaluated." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-81", "text": "With few exceptions, for most datasets, the error rate either always improves or always worsens, independently of the auxiliary task." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-82", "text": "Considering the dataset statistics in Table 1 , it appears that the size of the training corpus is the most important factor for these results." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-83", "text": "The four corpora that consistently benefit from MTLGerman (RIDGES), Icelandic, Slovene (Bohori\u010d), and Swedish-also have the smallest training sets, with about 50,000 tokens or less." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-84", "text": "For other tasks, different patterns have been observed (Mart\u00ednez Alonso and Plank, 2017; Bingel and S\u00f8-gaard, 2017 ); see Sec. 5." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-85", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-86", "text": "**SPARSE DATASETS**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-87", "text": "In the sparse data scenario where only 5,000 tokens are used for training (right side of Fig. 1 ), MTL almost always leads to improvements over the single-task training setup." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-88", "text": "This further confirms the hypothesis that multitask learning is beneficial for historical text normalization when the target task dataset is small." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-89", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-90", "text": "**ENGLISH WITH NON-HISTORICAL AUXILIARY DATA**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-91", "text": "We also conduct a follow-up experiment on the (sparse) English dataset using a Twitter normalization dataset (Han and Baldwin, 2011 ) and a grammatical error corpus (Yannakoudakis et al., 2011) as auxiliary data." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-92", "text": "The results are presented in Table 4 ." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-93", "text": "Surprisingly, the Twitter dataset is actually more helpful than the best historical dataset; but of course, it is also in-language, unlike the historical datasets." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-94", "text": "Figure 1: Percentage change of error of MTL over single-task models; rows are targets, columns auxiliary data." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-95", "text": "Left: full data; right: sparse data." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-96", "text": "Blue scores are improvements, reds increases in error." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-97", "text": "----------------------------------" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-98", "text": "**RELATED WORK AND CONCLUSION**" }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-99", "text": "There has been considerable work on multitask sequence-to-sequence models for other tasks (Dong et al., 2015; Luong et al., 2015; Elliott and K\u00e1d\u00e1r, 2017) ." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-100", "text": "There is a wide range of design questions and sharing strategies that we ignore here, focusing instead on under what circumstances the approach advocated in (Bollmann et al., 2017) works." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-101", "text": "Our main observation-that the size of the target dataset is most predictive of multi-task learning gains-runs counter previous findings for other NLP tasks (Mart\u00ednez Alonso and Plank, 2017; Bingel and S\u00f8gaard, 2017) ." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-102", "text": "Mart\u00ednez Alonso and Plank (2017) find that the label entropy of the auxiliary dataset is more predictive; Bingel and S\u00f8-gaard (2017) find that the relative differences in the steepness of the two single-task loss curves is more predictive." }, { "sent_id": "1540b0b172971ac75771b414765f1d-C001-103", "text": "Both papers consider sequence tagging problems with a small number of labels; and it is probably not a surprise that their findings do not seem to scale to the case of historical text normalization." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1540b0b172971ac75771b414765f1d-C001-14" ], [ "1540b0b172971ac75771b414765f1d-C001-15" ], [ "1540b0b172971ac75771b414765f1d-C001-45" ] ], "cite_sentences": [ "1540b0b172971ac75771b414765f1d-C001-14", "1540b0b172971ac75771b414765f1d-C001-15", "1540b0b172971ac75771b414765f1d-C001-45" ] }, "@USE@": { "gold_contexts": [ [ "1540b0b172971ac75771b414765f1d-C001-19" ], [ "1540b0b172971ac75771b414765f1d-C001-24" ], [ "1540b0b172971ac75771b414765f1d-C001-35" ], [ "1540b0b172971ac75771b414765f1d-C001-44" ], [ "1540b0b172971ac75771b414765f1d-C001-100" ] ], "cite_sentences": [ "1540b0b172971ac75771b414765f1d-C001-19", "1540b0b172971ac75771b414765f1d-C001-24", "1540b0b172971ac75771b414765f1d-C001-35", "1540b0b172971ac75771b414765f1d-C001-44", "1540b0b172971ac75771b414765f1d-C001-100" ] }, "@MOT@": { "gold_contexts": [ [ "1540b0b172971ac75771b414765f1d-C001-19" ] ], "cite_sentences": [ "1540b0b172971ac75771b414765f1d-C001-19" ] }, "@SIM@": { "gold_contexts": [ [ "1540b0b172971ac75771b414765f1d-C001-45", "1540b0b172971ac75771b414765f1d-C001-46" ] ], "cite_sentences": [ "1540b0b172971ac75771b414765f1d-C001-45" ] } } }, "ABC_81499fd759b958a0c02d9ed9d72a46_3": { "x": [ { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-2", "text": "Crosslingual word embeddings represent lexical items from different languages using the same vector space, enabling crosslingual transfer." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-3", "text": "Most prior work constructs embeddings for a pair of languages, with English on one side." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-4", "text": "We investigate methods for building high quality crosslingual word embeddings for many languages in a unified vector space." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-5", "text": "In this way, we can exploit and combine information from many languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-6", "text": "We report competitive performance on bilingual lexicon induction, monolingual similarity and crosslingual document classification tasks." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-7", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-9", "text": "Monolingual word embeddings have facilitated advances in many natural language processing tasks, such as natural language understanding (Collobert and Weston, 2008) , sentiment analysis , and dependency parsing (Dyer et al., 2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-10", "text": "Crosslingual word embeddings represent words from several languages in the same low dimensional space." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-11", "text": "They are helpful for multilingual tasks such as machine translation (Brown et al., 1993) and bilingual named entity recognition (Wang et al., 2013) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-12", "text": "Crosslingual word embeddings can also be used in transfer learning, where the source model is trained on one language and applied directly to another language; this is suitable for the low-resource scenario (Yarowsky and Ngai, 2001; Duong et al., 2015b; Das and Petrov, 2011; T\u00e4ckstr\u00f6m et al., 2012) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-13", "text": "Most prior work on building crosslingual word embeddings focuses on a pair of languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-14", "text": "English is usually on one side, thanks to the wealth of available English resources." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-15", "text": "However, it is highly desirable to have a crosslingual word embeddings for many languages so that different relations can be exploited." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-16", "text": "1 For example, since Italian and Spanish are similar, they are excellent candidates for transfer learning." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-17", "text": "However, few parallel resources exist between Italian and Spanish for directly building bilingual word embeddings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-18", "text": "Our multilingual word embeddings, on the other hand, map both Italian and Spanish to the same space without using any direct bilingual signal between them." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-19", "text": "In addition, multilingual word embeddings allow multiple source language transfer learning, producing a more general model and overcoming data sparseness (McDonald et al., 2011; Guo et al., 2016; Agi\u0107 et al., 2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-63", "text": "This leads to the new objective function" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-20", "text": "Moreover, multilingual word embeddings are also crucial for multilingual applications such as multi-source machine translation (Zoph and Knight, 2016) , and multisource transfer dependency parsing (McDonald et al., 2011; Duong et al., 2015a) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-21", "text": "We propose several algorithms to map bilingual word embeddings to the same vector space, either during training or during post-processing." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-22", "text": "We apply a linear transformation to map the English side of each pretrained crosslingual word embedding to the same space." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-23", "text": "We also extend Duong et al. (2016) , which used a lexicon to learn bilingual word embeddings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-24", "text": "We modify the objective function to jointly build multilingual word embeddings during training." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-25", "text": "Unlike most prior work which focuses on downstream applications, we measure the quality of our multilingual word embeddings in three ways: bilingual lexicon induction, monolingual word similarity, and crosslingual document classification tasks." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-26", "text": "Relative to a benchmark of training on each language pair separately and to various published multilingual word embeddings, we achieved high performance for all the tasks." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-27", "text": "In this paper we make the following contributions: (a) novel algorithms for post hoc combination of multiple bilingual word embeddings, applicable to any pretrained bilingual model; (b) a method for jointly learning multilingual word embeddings, extending Duong et al. (2016) , to jointly train over monolingual corpora in several languages; (c) achieving competitive results in bilingual, monolingual and crosslingual transfer settings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-28", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-30", "text": "Crosslingual word embeddings are typically based on co-occurrence statistics from parallel text (Luong et al., 2015; Gouws et al., 2015; Chandar A P et al., 2014; Klementiev et al., 2012; Ko\u010disk\u00fd et al., 2014; Huang et al., 2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-31", "text": "Other work uses more widely available resources such as comparable data (Vuli\u0107 and Moens, 2015) and shared Wikipedia entries (S\u00f8gaard et al., 2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-32", "text": "However, those approaches rely on data from Wikipedia, and it is non-trivial to extend them to languages that are not covered by Wikipedia." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-33", "text": "Lexicons are another source of bilingual signal, with the advantage of high coverage." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-34", "text": "Multilingual lexical resources such as PanLex (Kamholz et al., 2014) and Wiktionary 2 cover thousands of languages, and have been used to construct high performance crosslingual word embeddings (Mikolov et al., 2013a; Xiao and Guo, 2014; Faruqui and Dyer, 2014) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-35", "text": "Previous work mainly focuses on building word embeddings for a pair of languages, typically with English on one side, with the exception of Coulmance et al. (2015) , S\u00f8gaard et al. (2015) and Ammar et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-36", "text": "Coulmance et al. (2015) extend the bilingual skipgram model from Luong et al. (2015) , training jointly over many languages using the Europarl corpora." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-37", "text": "We also compare our models with an extension of Huang et al. (2015) adapted for multiple languages also using bilingual corpora." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-38", "text": "However, parallel data is an expensive resource and using parallel data seems to under-perform on the bilingual lexicon induction task (Vuli\u0107 and Moens, 2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-39", "text": "While Coulmance et al. (2015) use English as the pivot language, S\u00f8gaard et al. (2015) learn multilingual word em-beddings for many languages using Wikipedia entries which are the same for many languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-40", "text": "However, their approach is limited to languages covered in Wikipedia and seems to under-perform other methods." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-41", "text": "Ammar et al. (2016) propose two algorithms, MultiCluster and MultiCCA, for multilingual word embeddings using set of bilingual lexicons." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-42", "text": "MultiCluster first builds the graph where nodes are lexical items and edges are translations." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-43", "text": "Each cluster in this graph is an anchor point for building multilingual word embeddings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-44", "text": "Multi-CCA is an extension of Faruqui and Dyer (2014) , performing canonical correlation analysis (CCA) for multiple languages using English as the pivot." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-45", "text": "A shortcoming of MultiCCA is that it ignores polysemous translations by retaining only one-to-one dictionary pairs (Gouws et al., 2015) , disregarding much information." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-46", "text": "As a simple solution, we propose a simple post hoc method by mapping the English parts of each bilingual word embedding to each other." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-47", "text": "In this way, the mapping is always exact and one-to-one." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-48", "text": "Duong et al. (2016) constructed bilingual word embeddings based on monolingual data and PanLex." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-49", "text": "In this way, their approach can be applied to more languages as PanLex covers more than a thousand languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-50", "text": "They solve the polysemy problem by integrating an EM algorithm for selecting a lexicon." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-51", "text": "Relative to many previous crosslingual word embeddings, their joint training algorithm achieved state-of-the-art performance for the bilingual lexicon induction task, performing significantly better on monolingual similarity and achieving a competitive result on cross lingual document classification." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-52", "text": "Here we also adopt their approach, and extend it to multilingual embeddings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-53", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-54", "text": "**BASE MODEL FOR BILINGUAL EMBEDDINGS**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-55", "text": "We briefly describe the base model (Duong et al., 2016) , an extension of the continuous bag-of-word (CBOW) model (Mikolov et al., 2013a ) with negative sampling." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-56", "text": "The original objective function is" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-57", "text": "where D is the training data, h i = 1 2k k j=\u2212k;j =0 v w i+j is a vector encoding the context over a window of size k centred around position i, V and U \u2208 R |Ve|\u00d7d are learned matrices referred to as the context and centre word embeddings, where V e is the vocabulary and p is the number of negative examples randomly drawn from a noise distribution, w ij \u223c P n (w)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-58", "text": "Duong et al. (2016) extend the CBOW model for application to two languages, using monolingual text in both languages and a bilingual lexicon." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-59", "text": "Their approach augments CBOW by generating not only the middle word, but also its translation in the other language." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-60", "text": "This is done by first selecting a translationw i from the lexicon for the middle word w i , based on the cosine distance between the context h i and the context embeddings V for each candidate foreign translation." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-61", "text": "In this way source monolingual training contexts must generate both source and target words, and similarly target monolingual training contexts also generate source and target words." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-62", "text": "Overall this results in compatible word embeddings across the two languages, and highly informative nearest neighbours across the two languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-64", "text": "where D s and D t are source and target monolingual data, V s and V t are source and target vocabulary." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-65", "text": "Comparing with the CBOW objective function in Equation (1), this represents two additions: the translation cross entropy log \u03c3(u w i h i ), and a regularisation term w\u2208Vs\u222aVt u w \u2212 v w 2 2 which penalises divergence between context and center word embedding vectors for each word type, which was shown to improve the embedding quality (Duong et al., 2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-66", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-67", "text": "**POST HOC UNIFICATION OF EMBEDDINGS**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-68", "text": "Our goal is to learn multilingual word embeddings over more than two languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-69", "text": "One simple way to do this is to take several learned bilingual word embeddings which share a common target language (here, English), and map these into a shared space (Mikolov et al., 2013a; Faruqui and Dyer, 2014) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-70", "text": "In this section we propose post hoc methods, however in \u00a74 we develop an integrated multilingual method using joint inference." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-71", "text": "Formally, the input to the post hoc combination methods are a set of n pre-trained bilingual word embedding matrices, i.e., C i = {(E i , F i )} with i \u2208 F is the set of foreign languages (not English), E i \u2208 R |Ve i |\u00d7d are the English word embeddings and F i \u2208 R |V f i |\u00d7d are foreign language word embeddings for language i, with V e i and V f i being the English and foreign language vocabularies and d is the embedding dimension." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-72", "text": "These bilingual embeddings can be produced by any method, e.g., those discussed in \u00a72." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-73", "text": "Linear Transformation." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-74", "text": "The simplest method is to learn a linear transformation which maps the English part of each bilingual word embedding into the same space (inspired by Mikolov et al. (2013a) ), as illustrated in Figure 1 ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-75", "text": "One language pair is chosen as the pivot, en-it in this example, and the English side of the other language pairs, en-de, en-es, en-nl, are mapped to closely match the English side of the pivot, en-it." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-76", "text": "This is achieved through learning linear transformation matrices for each language, W de , W es and W nl , respectively, where each W i \u2208 R d\u00d7d is learned to minimize the objective function E i \u00d7 W i \u2212 E pivot 2 2 where E pivot is the English embedding of the pivot pair, en-it." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-77", "text": "Each foreign language f i is then mapped to the same space using the learned matrix W i , i.e., F i = F i \u00d7 W i ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-78", "text": "These projected foreign embeddings are then used in evaluation, along with the English side of the language pair with largest English vocabulary coverage, i.e., biggest |V e i |." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-79", "text": "Together these embeddings allow for querying of monolingual and cross-lingual word similarity, and multilingual transfer of trained models." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-80", "text": "The advantage of this approach is that it is very fast and simple to train, since the objective function is strictly convex and has a closed form solution." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-81", "text": "Moreover, unlike Mikolov et al. (2013a) who learn the projection from a source to a target Figure 2: Examples of our multilingual joint training model without mapping for learning multilingual embeddings for three languages en, it, de using joint inference." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-82", "text": "language, we learn the projection from English to English, thus do not require a lexicon, sidestepping the polysemy problem." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-83", "text": "3" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-84", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-85", "text": "**MULTILINGUAL JOINT TRAINING**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-207", "text": "**ANALYSIS**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-86", "text": "Instead of combining bilingual word embeddings in the post-processing step, it might be more beneficial to do it during training, so that languages can interact with each other more freely." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-87", "text": "We extend the method in \u00a72.1 to jointly learn the multilingual word embeddings during training." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-88", "text": "The input to the model is the combined monolingual data for each language and the set of lexicons between any language pair." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-89", "text": "We modify the base model (Duong et al., 2016 ) to accommodate more languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-90", "text": "For the first step, instead of just predicting the translation for a single target language, we predict the translation for all languages in the lexicon." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-91", "text": "That is, we compute w" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-92", "text": "cos(v w , context), which is the best translation in language f of source word w e i in language e, given the bilingual lexicon dict f e and the context." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-93", "text": "For the second step, we jointly predict word w e i and all translations w f i in all foreign languages f \u2208 T that we have dictionary dict f e as illustrated in Figure 2." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-94", "text": "3 A possible criticism of this approach is that a linear transformation is not powerful enough for the required mapping." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-95", "text": "We experimented with non-linear transformations but did not observe any improvements." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-96", "text": "Faruqui and Dyer (2014) extended Mikolov et al. (2013a) as they projected both source and target languages to the same space using canonical correlation analysis (CCA)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-97", "text": "We also adopted this approach for multilingual environment by applying multi-view CCA to map the English part of each pre-trained bilingual word embedding to the same space." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-98", "text": "However, we only observe minor improvements." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-99", "text": "The English word cat might have several translations in German {Katze, Raupe, Typ} and Italian {gatto, gatta}. In the first step, we select the closest translation given the context for each language, i.e. Katze and gatto for German and Italian respectively." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-100", "text": "In the second step, we jointly predict the English word cat together with selected translations Katze and gatto using the following modified objective function:" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-101", "text": "where D all and V all are the combined monolingual data and vocabulary for all languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-102", "text": "Each of the p negative samples, w ij , are sampled from a unigram model over the combined vocabulary V all ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-103", "text": "Explicit mapping." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-104", "text": "As we keep adding more languages to the model, the hidden layer in our model -shared between all languages -might not be enough to accommodate all languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-105", "text": "However, we can combine the strength of the linear transformation proposed in \u00a73 to our joint model as described in Equation (3)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-106", "text": "We explicitly learn the linear transformation jointly during training by adding the following regularization term to the objective function:" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-107", "text": "where D e is the English monolingual data (since we use English as the pivot language), F is the set of foreign languages (not English), W f \u2208 R d\u00d7d is the linear transformation matrix, and \u03b1 controls the contribution of the regularization term and will be tuned in \u00a76." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-108", "text": "4 Thus, the set of learned parameters for the model are the word and context embeddings U, V and |F| linear transformation matrices, {W f } f \u2208F ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-109", "text": "After training is finished, we linearly transform the foreign language embeddings with the corresponding learned matrix W f , such that all embeddings are in the same space." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-110", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-111", "text": "**EXPERIMENT SETUP**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-112", "text": "Our experimental setup is based on that of Duong et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-113", "text": "We use the first 5 million sen- Duong et al. (2016) where each pair is trained separately." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-114", "text": "Our proposed methods including linear transformation (Linear), joint prediction as in Equation (3) (Joint) and joint prediction with explicit mapping as in Equation (4) (+mapping)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-115", "text": "We report recall at 1 and 5 with respect to four baseline multilingual word embeddings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-116", "text": "The best scores for are shown in bold." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-117", "text": "tences from the tokenized monolingual data from the Wikipedia dump from Al-Rfou et al. (2013) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-118", "text": "5 The dictionary is from PanLex which covers more than 1,000 language varieties." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-119", "text": "We build multilingual word embeddings for 5 languages (en, it, es, nl, de) jointly using the same parameters as Duong et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-120", "text": "6 During training, for a fairer comparison, we only use lexicons between English and each target language." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-121", "text": "However, it is straightforward to incorporate a lexicon between any pair of languages into our model." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-122", "text": "The pretrained bilingual word embeddings for the postprocessing experiment in \u00a73 are also from Duong et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-123", "text": "In the following sections, we evaluate the performance of our multilingual word embeddings in comparison with bilingual word embeddings and previous published multilingual word embeddings (MultiCluster, MultiCCA, MultiSkip and MultiTrans) for three tasks: bilingual lexicon induction ( \u00a76), monolingual similarity ( \u00a77) and crosslingual document classification ( \u00a78)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-124", "text": "MultiCluster and MultiCCA are the models proposed from Ammar et al. (2016) trained on monolingual data using bilingual lexicons extracted from aligning Europarl corpus." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-125", "text": "MultiSkip is the reimplementation of the multilingual skipgram model from Coul- 5 We will use the whole data if there are less than 5 million sentences." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-126", "text": "6 Default learning rate of 0.025, negative sampling with 25 samples, subsampling rate of value 1e \u22124 , embedding dimension d = 200, window size 48, run for 15 epochs and \u03b4 = 0.01 for combining word and context embeddings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-127", "text": "mance et al. (2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-128", "text": "MultiTrans is the multilingual version of the translation invariance model from Huang et al. (2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-129", "text": "Both MultiSkip and MultiTrans are trained directly on parallel data from Europarl." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-130", "text": "All the previous work is trained with 512 dimensions on 12 languages acquired directly from Ammar et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-131", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-132", "text": "**BILINGUAL LEXICON INDUCTION**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-133", "text": "In this section we evaluate our multilingual models on the bilingual lexicon induction (BLI) task, which tests the bilingual quality of the model." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-134", "text": "Given a word in the source language, the model must predict the translation in the target language." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-135", "text": "We report recall at 1 and 5 for the various models listed in Table 1 ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-136", "text": "The evaluation data for it-en, es-en, and nl-en pairs was manually constructed (Vuli\u0107 and Moens, 2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-137", "text": "We extend the evaluation for nl-es pair which do not involve English." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-138", "text": "7 The BiWE results for pairs involving English in Table 1 are from Duong et al. (2016) , the current state of the art in this task." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-139", "text": "For the nl-es pair, we cannot build bilingual word embeddings, since we do not have a corresponding bilingual lexicon." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-140", "text": "Instead, we use English as the pivot language." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-141", "text": "To get the nl-es translation, we use two bilingual embeddings of nl-en and es-en from Duong et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-142", "text": "We get the best English translation for the Dutch word, and get the top 5 Spanish translations with respect to the English word." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-143", "text": "This simple trick performs surprisingly well, probably because bilingual word embeddings involving English such as nl-en and es-en from Duong et al. (2016) are very accurate." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-144", "text": "For the linear transformation, we use the first pair it-en as the pivot and learn to project es-en, de-en, nl-en pairs to this space as illustrated in Figure 1 ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-145", "text": "We use English part (E biggest ) from transformed de-en pair as the English output." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-146", "text": "Despite simplicity, linear transformation performs surprisingly well." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-147", "text": "Our joint model to predict all target languages simultaneously, as described in Equation (3), performs consistently better in contrast with linear transformation at all language pairs." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-148", "text": "The joint model with explicit mapping as described in Equation (4) can be understood as the combination of joint model and linear transformation." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-149", "text": "For this model, we need to tune \u03b1 in Equation (4)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-150", "text": "We tested \u03b1 with value in range {10 \u2212i } 5 i=0 using es-en pair on BLI task." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-151", "text": "\u03b1 = 0.1 gives the best performance." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-152", "text": "To avoid over-fitting, we use the same value of \u03b1 for all experiments and all other pairs." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-153", "text": "With this tuned value \u03b1, our joint model with mapping clearly outperforms other proposed methods on all pairs." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-154", "text": "More importantly, this result is substantially better than all the baselines across four language pairs and two evaluation metrics." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-155", "text": "Comparing with the state of the art (BiWE), our final model (joint + mapping) are more general and more widely applicable, however achieves relatively better result, especially for recall at 5." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-156", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-157", "text": "**MONOLINGUAL SIMILARITY**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-158", "text": "The multilingual word embeddings should preserve the monolingual property of the languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-159", "text": "We evaluate using the monolingual similarity task proposed in Luong et al. (2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-160", "text": "In this task, the model is asked to give the similarity score for a pair of words in the same language." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-161", "text": "This score is then measured against human judgment." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-162", "text": "Following Duong et al. (2016) , we evaluate on three datasets, WordSim353 (WS-en), RareWord (RWen), and the German version of WordSim353 (WS-de) (Finkelstein et al., 2001; Luong et al., 2013; Luong et al., 2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-163", "text": "Table 2 shows the result of our multilingual word embeddings with respect to several baselines." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-164", "text": "The trend is similar to the bilingual lexicon induction task." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-165", "text": "Linear transformation per- forms surprisingly well." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-166", "text": "Our joint model achieves a similar result, with linear transformation (better on WS-de but worse on WS-en and RW-en)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-167", "text": "Our joint model with explicit mapping regains the drop and performs slightly better than linear transformation." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-168", "text": "More importantly, this model is substantially better than all baselines, except for MultiTrans on RW-en dataset." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-169", "text": "This can probably be explained by the low coverage of MultiTrans on this dataset." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-170", "text": "Our final model (Joint + Mapping) is also close to the best bilingual word embeddings (BiWE) performance reported by Duong et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-171", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-172", "text": "**CROSSLINGUAL DOCUMENT CLASSIFICATION**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-173", "text": "In the previous sections, we have shown that our methods for building multilingual word embeddings, either in the post-processing step or during training, preserved high quality bilingual and monolingual relations." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-174", "text": "In this section, we demonstrate the usefulness of multi-language crosslingual word embeddings through the crosslingual document classification (CLDC) task." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-175", "text": "This task exploits transfer learning, where the document classifier is trained on the source language and tested on the target language." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-176", "text": "The source language classifier is transferred to the target language using crosslingual word embeddings as the document is represented as the sum of bag- Table 3 : Crosslingual document classification accuracy for various model." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-177", "text": "Chandar A P et al. (2014) and Luong et al. (2015) achieved a state-of-the-art result for en\u2192de and de\u2192en respectively, served as the reference." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-178", "text": "The best results for bilingual and multilingual word embeddings are bold." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-179", "text": "of-word embeddings weighted by tf.idf." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-180", "text": "This setting is useful for target low-resource languages where the annotated data is insufficient." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-181", "text": "The train and test data are from multilingual RCV1/RCV2 corpus (Lewis et al., 2004) where each document is annotated with labels from 4 categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets)." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-182", "text": "We extend the evaluation from Klementiev et al. (2012) to cover more language pairs." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-183", "text": "We use the same data split for en\u2192de and de\u2192en pairs but additionally construct the train and test data for it\u2192de, it\u2192es and en\u2192es." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-184", "text": "For each pair, we use 1,000 documents in the source language as the training data and 5,000 documents in the target language as the test data." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-185", "text": "The training data is randomly sampled, but the test data (for es) is evenly balanced among labels." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-186", "text": "Table 3 shows the accuracy for the CLDC task for many pairs and models with respect to the baselines." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-187", "text": "For all bilingual models (Duong et al., 2016; Luong et al., 2015; Chandar A P et al., 2014) , the bilingual word embeddings are constructed for each pair separately." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-188", "text": "In this way, they can only get the pairs involving English since there are many bilingual resources involving English on one side." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-189", "text": "For all our models, including Linear, Joint and Joint + Mapping, the embedding space is available for multiple languages; this is why we can exploit different relations, such as it\u2192es." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-190", "text": "This is the motivation for the work reported in this paper." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-191", "text": "Suppose we want to build a document classifier for es but lack any annotations." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-192", "text": "It is common to build en-es crosslingual word embeddings for transfer learning, but this only achieves 53.8 % accuracy." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-193", "text": "Yet when we use it as the source, we get 81.0% accuracy." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-194", "text": "This is motivated by the fact that it and es are very similar." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-195", "text": "The trend observed in Table 3 is consistent with previous observations." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-196", "text": "Linear transformation performs well." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-197", "text": "Joint training performs better especially for the it\u2192de pair." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-198", "text": "The joint model with explicit mapping is generally our best model, even better than the base bilingual model from Duong et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-199", "text": "The de\u2192en result improves on the existing state of the art reported in Luong et al. (2015) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-200", "text": "Our final model (Joint + Mapping) achieved competitive results compared with four strong baseline multilingual word embeddings, achieving best results for two out of five pairs." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-201", "text": "Moreover, the best scores for each language pairs are all from multilingual training, emphasizing the advantages over bilingual training." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-202", "text": "Mikolov et al. (2013b) showed that monolingual word embeddings capture some analogy relations such as Paris \u2212 France + Italy \u2248 Rome." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-203", "text": "It seems that in our multilingual embeddings, these relations still hold." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-204", "text": "Table 5 : Performance of our model compared with MultiCluster and MultiCCA using extrinsic and intrinsic evaluation tasks on 12 languages proposed in Ammar et al. (2016) , all models are trained on the same dataset." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-205", "text": "The best score for each task is bold." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-206", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-208", "text": "are trained on parallel corpora, MultiCluster and MultiCCA use monolingual corpora and bilingual lexicons which are similar to our proposed methods." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-209", "text": "Therefore, for a strict comparison 8 , we train our best model (Joint + Mapping) using the same monolingual data and set of bilingual lexicons on the same 12 languages with MultiCluster and MultiCCA." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-210", "text": "Table 5 shows the performance on intrinsic and extrinsic tasks proposed in Ammar et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-211", "text": "Multilingual dependency parsing and document classification are trained on a set of source languages and test on a target language in the transfer learning setting." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-212", "text": "Monolingual word similarity task is similar with our monolingual similarity task described in \u00a77, multilingual word similarity is an extension of monolingual word similarity task but tested for pair of words in different languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-213", "text": "Monolingual QVEC, multilingual QVEC test the linguistic content of word embeddings in monolingal and multilingual setting." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-214", "text": "Monolingual QVEC-CCA and multilingual QVEC-CCA are the 8 also with respect to the word coverage since MultiSkip and MultiTrans usually have much lower word coverage, biasing the intrinsic evaluations." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-215", "text": "extended versions of monolingual QVEC and multilingual QVEC also proposed in Ammar et al. (2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-216", "text": "Table 5 shows that our model achieved competitive results, best at 4 out of 9 evaluation tasks." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-217", "text": "----------------------------------" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-218", "text": "**CONCLUSION**" }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-219", "text": "In this paper, we introduced several methods for building unified multilingual word embeddings." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-220", "text": "These represent an improvement because they exploit more relations and combine information from many languages." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-221", "text": "The input to our model is just a set of monolingual data and a set of bilingual lexicons between any language pairs." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-222", "text": "We induce the bilingual relationship for all language pairs while keeping high quality monolingual relations." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-223", "text": "Our multilingual joint training model with explicit mapping consistently achieves better performance compared with linear transformation." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-224", "text": "We achieve new state-of-the-art performance on bilingual lexicon induction task for recall at 5, similar excellent results with the state-of-the-art bilingual word embeddings on monolingual similarity task (Duong et al., 2016) ." }, { "sent_id": "81499fd759b958a0c02d9ed9d72a46-C001-225", "text": "Moreover, our model is competitive at the crosslingual document classification task, achieving a new state of the art for de\u2192en and it\u2192de pair." } ], "y": { "@EXT@": { "gold_contexts": [ [ "81499fd759b958a0c02d9ed9d72a46-C001-23" ], [ "81499fd759b958a0c02d9ed9d72a46-C001-48", "81499fd759b958a0c02d9ed9d72a46-C001-52" ] ], "cite_sentences": [ "81499fd759b958a0c02d9ed9d72a46-C001-23", "81499fd759b958a0c02d9ed9d72a46-C001-48", "81499fd759b958a0c02d9ed9d72a46-C001-52" ] }, "@BACK@": { "gold_contexts": [ [ "81499fd759b958a0c02d9ed9d72a46-C001-48" ] ], "cite_sentences": [ "81499fd759b958a0c02d9ed9d72a46-C001-48" ] }, "@USE@": { "gold_contexts": [ [ "81499fd759b958a0c02d9ed9d72a46-C001-48", "81499fd759b958a0c02d9ed9d72a46-C001-52" ] ], "cite_sentences": [ "81499fd759b958a0c02d9ed9d72a46-C001-48", "81499fd759b958a0c02d9ed9d72a46-C001-52" ] } } }, "ABC_91e4fd2556d4a04e477ea97208b218_4": { "x": [ { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-2", "text": "Question generation from a knowledge base (KB) is the task of generating questions related to the domain of the input KB." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-3", "text": "We propose a system for generating fluent and natural questions from a KB, which significantly reduces the human effort by leveraging massive web resources." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-4", "text": "In more detail, a seed question set is first generated by applying a small number of hand-crafted templates on the input KB, then more questions are retrieved by iteratively forming already obtained questions as search queries into a standard search engine, before finally questions are selected by estimating their fluency and domain relevance." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-5", "text": "Evaluated by human graders on 500 random-selected triples from Freebase, questions generated by our system are judged to be more fluent than those of Serban et al. (2016) by human graders." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-6", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-8", "text": "Question generation is important as questions are useful for student assessment or coaching purposes in educational or professional contexts, and a large-scale corpus of question and answer pairs is also critical to many NLP tasks including question answering, dialogue interaction and intelligent tutoring systems." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-9", "text": "There has been much literature so far (Chen et al., 2009; Ali et al., 2010; Heilman and Smith, 2010; Curto et al., 2012; Lindberg et al., 2013; Mazidi and Nielsen, 2014; Labutov et al., 2015) studying question generation from text." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-10", "text": "Recently people are becoming interested in question generation from KB, since large-scale KBs, such as Freebase (Bollacker et al., 2008) and DBPedia (Auer et al., 2007) , are freely available, and entities and their relations are already present in KBs but not for texts." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-11", "text": "Question generation from KB is challenging as function words and morphological forms for entities are abstracted away when a KB is created." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-79", "text": "**SHOWN IN**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-12", "text": "To tackle this challenge, previous work (Seyler et al., 2015; Serban et al., 2016 ) relies on massive human-labeled data." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-13", "text": "Treating question generation as a machine translation problem, Serban et al. (2016) train a neural machine translation (NMT) system with 10,000 triple 1 , question pairs." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-14", "text": "At test time, input triples are \"translated\" into questions with the NMT system." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-15", "text": "On the other hand, the question part of the 10,000 pairs are human generated, which requires a large amount of human effort." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-16", "text": "In addition, the grammaticality and naturalness of generated questions can not be guaranteed (as seen in Table 1 )." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-17", "text": "We propose a system for generating questions from KB that significantly reduces the human effort by leveraging the massive web resources." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-18", "text": "Given a KB, a small set of question templates are first hand-crafted based on the predicates in the KB." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-19", "text": "These templates consist of a transcription of the predicate in the KB (e.g. performsActivity\u21d2how to) and placeholders for the subject (#X#) and the object (#Y#)." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-20", "text": "A seed question set is then generated by applying the templates on the KB." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-21", "text": "The seed question set is further expanded through a search engine (e.g., Google, Bing), by iteratively forming each generated question as a search query to retrieve more related question candidates." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-22", "text": "Finally a selection step is applied by estimating the fluency and domain relevance of each question candidate." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-23", "text": "The only human labor in this work is the question template construction." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-24", "text": "Our system does not require a large number of templates because: (1) the iterative question expansion can produce a large number of questions even with a relatively small number of seed questions, as we see in the experiments, (2) multiple entities in the KB share the same predicates." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-25", "text": "Another advantage is that our system can easily generate updated questions as web is self-updating consistently." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-26", "text": "In our experiment, we compare with Serban et al. (2016) on 500 random selected triples from Freebase (Bollacker et al., 2008) ." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-27", "text": "Evaluated by 3 human graders, questions generated by our system are significantly better then Serban et al. (2016) on grammaticality and naturalness." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-28", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-29", "text": "**KNOWLEDGE BASE**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-30", "text": "A knowledge base (KB) can be viewed as a directed graph, in which nodes are entities (such as \"jigsaw\" and \"CurveCut\") and edges are relations of entities (such as \"performsActivity\")." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-31", "text": "A KB can also be viewed as a list of triples in the format of subject, predicate, object , where subjects and objects are entities, and predicates are relations." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-32", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-33", "text": "**SYSTEM**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-34", "text": "Shown in Figure 1 , our system contains the submodules of question template construction, seed question generation, question expansion and selection." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-35", "text": "Given an input KB, a small set of question templates is first constructed such that each template is associated with a predicate, then a seed question set is generated by applying the template set on the input KB, before finally more questions are generated from related questions that are iteratively retrieved from a search engine with already-obtained questions as search queries (section 3.1)." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-36", "text": "Taking our in-house KB of power tool domain as an example, template \"how to use #X#\" is first constructed for predicate \"performsActivity\"." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-37", "text": "In addition, seed question \"how to use jigsaw\" is generated by applying the template on triple \" jigsaw, performsActivity, CurveCut \", be-" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-38", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-39", "text": "**QUESTION EXPANSION AND SELECTION**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-40", "text": "Shown in Algorithm 1, the expanded question set E is initialized as the seed question set (Line 1)." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-41", "text": "In each iteration, an already-obtained question is expanded from web and the retrieved questions are added to E if E does not contain them (Lines 6-10)." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-42", "text": "As there may be a large number of questions generated in the loop, we limit the maximum number of iterations with I max (Line 4)." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-43", "text": "The questions collected from the web search engine may not be fluent or domain relevant; especially the domain relevance drops significantly as the iteration goes on." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-44", "text": "Here we adopt a skip-gram model (Mikolov and Dean, 2013 ) and a language model for evaluating the domain relevance and fluency of the expanded questions, respectively." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-45", "text": "For System grammatical naturalness Serban et al. (2016) 3.36 3.14 Ours 3.53 3.31 Table 1 : Comparing generated questions domain relevance, we take the seed question set as the in-domain data D in , the domain relevance of expanded question q is defined as:" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-46", "text": "where v(\u00b7) is the document embedding defined as the averaged word embedding within the document." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-47", "text": "For fluency, we define the averaged language model score as:" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-48", "text": "where LM(\u00b7) is the general-domain language model score (log probability), and LEN(\u00b7) is the word count." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-49", "text": "We apply thresholds t rel and t f lu for domain relevance and fluency respectively, and filter out questions whose scores are below these thresholds." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-50", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-51", "text": "**EXPERIMENTS**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-52", "text": "We perform three experiments to evaluate our system qualitatively and quantitatively." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-53", "text": "In the first experiment, we compare our end-to-end system with the previous state-of-the-art method (Serban et al., 2016) on Freebase (Bollacker et al., 2008) , a domain-general KB." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-54", "text": "In the second experiment, we validate our domain relevance evaluation method on a standard dataset about short document classification." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-55", "text": "In the final experiment, we run our endto-end system on a highly specialized in-house KB and present sample results, showing that our system is capable of generating questions from domain specific KBs." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-56", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-57", "text": "**EVALUATION ON FREEBASE**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-58", "text": "We first compare our system with Serban et al. (2016) on 500 randomly selected triples from Freebase (Bollacker et al., 2008) 2 ." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-59", "text": "For the 500 triples, we hand-crafted 106 templates, as these triples share only 53 distinct predicates (we made 2 templates for each predicate on average)." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-60", "text": "991 seed questions are generated by applying the templates on the triples, and 1529 more questions are retrieved from Google." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-61", "text": "To evaluate the fluency of the candidate questions, we train a 4-gram language model (LM) on gigaword (LDC2011T07) with Kneser Ney smoothing." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-62", "text": "Using the averaged language model score as index, the top 500 questions are selected to compare with the results from Serban et al. (2016) ." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-63", "text": "We ask three native English speakers to evaluate the fluency and the naturalness 3 of both results based on a 4-point scheme where 4 is the best." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-64", "text": "We show the averaged human rate in Table 2 , where we can see that our questions are more grammatical and natural than Serban et al. (2016) ." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-65", "text": "The naturalness score is less than the grammatical score for both methods." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-66", "text": "It is because naturalness is a more strict metric since a natural question should also be grammatical." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-67", "text": "Shown in Table 1 , we compare our questions with Serban et al. (2016) where questions in the same line describe the same entity." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-68", "text": "We can see that our questions are grammatical and natural as these questions are what people usually ask on the web." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-69", "text": "On the other hand, questions from Serban et al. (2016) Ma et al. (2015) 85.48 Ours 85.65 Table 3 : Precision on the web snippet dataset was someone who was involved in the leukemia ?\" and \"whats the title of a book of the subject of the bible ?\"), unnatural (\"what 's one of the mountain where can you found in argentina in netflix ?\") or confusing (\"who was someone who was involved in the leukemia ?\")." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-70", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-71", "text": "**DOMAIN RELEVANCE**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-72", "text": "We test our domain-relevance evaluating method on the web snippet dataset, which is a commonlyused for domain classification of short documents." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-73", "text": "It contains 10,060 training and 2,280 test snippets (short documents) in 8 classes (domains), and each snippet has 18 words on average." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-74", "text": "There have been plenty of prior results (Phan et al., 2008; Chen et al., 2011; Ma et al., 2015) on the dataset." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-75", "text": "Table 3 , we compare our domainrelevance evaluation method (section 3.1) with previous state-of-the-art methods: Phan et al. (2008) first derives latent topics with LDA (Blei et al., 2003) from Wikipedia, then uses the topics as appended features to expand the short text." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-76", "text": "Chen et al. (2011) further expanded Phan et al. (2008) by using multi-granularity topics." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-77", "text": "Ma et al. (2015) adopts a Bayesian model that the probability a document D belongs to a topic t equals to the prior of t times the probability each word w in D comes from t. Our method first concatenates training documents of the same domain into one \"domain document\", then calculates each document embedding by averaging word embeddings within it, before finally assigns the label of the nearest (cosine similarity) \"domain document\" to each test document." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-78", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-80", "text": "Simple as it is, our method outperforms all previous methods proving its effectiveness." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-81", "text": "The reason can be that word embeddings captures the similarity between distinct words (such as \"finance\" and \"economy\"), while it is hard for traditional methods." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-82", "text": "On the order hand, LDA only learns probabilities of words belonging to topics." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-83", "text": "how to change circular saw blade how to measure lawn mower cutting height how to sharpen drill bits on bench grinder how does an oscillating multi tool work how to cut a groove in wood without a router what type of sander to use on deck do i need a hammer drill can i use acrylic paint on wood how to use a sharpening stone with oil Table 4 : Example question expanded" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-84", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-85", "text": "**EVALUATION ON THE DOMAIN-SPECIFIC KB**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-86", "text": "The last experiment is on our in-house KB in the power tool domain." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-87", "text": "It contains 67 distinct predicates, 293 distinct subjects and 279 distinct objects respectively." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-88", "text": "For the 67 predicates, we handcraft 163 templates." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-89", "text": "Here we use the same language model as in our first experiment, and learn a skip-gram model (Mikolov and Dean, 2013) on Wikipedia 4 for evaluating domain relevance." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-90", "text": "We generate 12,228 seed questions from which 20,000 more questions are expanded with Google." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-91", "text": "Shown in Table 4 are some expanded questions from which we can see that most of them are grammatical and relevant to the power tool domain." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-92", "text": "In addition, most questions are informative and correspond to a specific answer, except the one \"do I need a hammer drill\" that lacks context information." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-93", "text": "Finally, in addition to the simple factoid questions, our system generates many complex questions such as \"how to cut a groove in wood without a router\"." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-94", "text": "----------------------------------" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-95", "text": "**CONCLUSION**" }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-96", "text": "We presented a system to generate natural language questions from a knowledge base." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-97", "text": "By leveraging rich web information, our system is able to generate domain-relevant questions in wide scope, while human effort is significantly reduced." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-98", "text": "Evaluated by human graders, questions generated by our system are significantly better than these from Serban et al. (2016) on 500 random-selected triples from Freebase." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-99", "text": "We also demonstrated generated questions from our in-house KB of power tool domain, which are fluent and domain-relevant in general." }, { "sent_id": "91e4fd2556d4a04e477ea97208b218-C001-100", "text": "Our current system only generates questions without answers, leaving automatic answer mining as our future work." } ], "y": { "@DIF@": { "gold_contexts": [ [ "91e4fd2556d4a04e477ea97208b218-C001-5" ], [ "91e4fd2556d4a04e477ea97208b218-C001-27" ], [ "91e4fd2556d4a04e477ea97208b218-C001-45" ], [ "91e4fd2556d4a04e477ea97208b218-C001-64" ], [ "91e4fd2556d4a04e477ea97208b218-C001-69" ], [ "91e4fd2556d4a04e477ea97208b218-C001-98" ] ], "cite_sentences": [ "91e4fd2556d4a04e477ea97208b218-C001-5", "91e4fd2556d4a04e477ea97208b218-C001-27", "91e4fd2556d4a04e477ea97208b218-C001-45", "91e4fd2556d4a04e477ea97208b218-C001-64", "91e4fd2556d4a04e477ea97208b218-C001-69", "91e4fd2556d4a04e477ea97208b218-C001-98" ] }, "@BACK@": { "gold_contexts": [ [ "91e4fd2556d4a04e477ea97208b218-C001-12" ], [ "91e4fd2556d4a04e477ea97208b218-C001-13" ] ], "cite_sentences": [ "91e4fd2556d4a04e477ea97208b218-C001-12", "91e4fd2556d4a04e477ea97208b218-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "91e4fd2556d4a04e477ea97208b218-C001-26" ], [ "91e4fd2556d4a04e477ea97208b218-C001-53" ], [ "91e4fd2556d4a04e477ea97208b218-C001-58" ], [ "91e4fd2556d4a04e477ea97208b218-C001-62" ], [ "91e4fd2556d4a04e477ea97208b218-C001-67" ] ], "cite_sentences": [ "91e4fd2556d4a04e477ea97208b218-C001-26", "91e4fd2556d4a04e477ea97208b218-C001-53", "91e4fd2556d4a04e477ea97208b218-C001-58", "91e4fd2556d4a04e477ea97208b218-C001-62", "91e4fd2556d4a04e477ea97208b218-C001-67" ] } } }, "ABC_75b2aa54c363151130ca2146044922_4": { "x": [ { "sent_id": "75b2aa54c363151130ca2146044922-C001-11", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-105", "text": "**AB AB**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-2", "text": "Word sense induction (WSI), or the task of automatically discovering multiple senses or meanings of a word, has three main challenges: domain adaptability, novel sense detection, and sense granularity flexibility." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-3", "text": "While current latent variable models are known to solve the first two challenges, they are not flexible to different word sense granularities, which differ very much among words, from aardvark with one sense, to play with over 50 senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-4", "text": "Current models either require hyperparameter tuning or nonparametric induction of the number of senses, which we find both to be ineffective." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-5", "text": "Thus, we aim to eliminate these requirements and solve the sense granularity problem by proposing AutoSense, a latent variable model based on two observations: (1) senses are represented as a distribution over topics, and (2) senses generate pairings between the target word and its neighboring word." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-6", "text": "These observations alleviate the problem by (a) throwing garbage senses and (b) additionally inducing fine-grained word senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-7", "text": "Results show great improvements over the stateof-the-art models on popular WSI datasets." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-8", "text": "We also show that AutoSense is able to learn the appropriate sense granularity of a word." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-9", "text": "Finally, we apply AutoSense to the unsupervised author name disambiguation task where the sense granularity problem is more evident and show that AutoSense is evidently better than competing models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-10", "text": "We share our data and code here: https://github.com/rktamplayo/ AutoSense." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-106", "text": "represents the number of a \u2208 A and b \u2208 B assignments." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-13", "text": "Word sense induction (WSI) is the task where given an ambiguous target word (e.g. cold) and texts where the word is used, we automatically discover its multiple senses or meanings (e.g. (1) nose infection, (2) absence of heat, etc.)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-14", "text": "We show examples of words with multiple senses and example usage in a text 1 in Figure 1 ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-15", "text": "It is distinct from its similar supervised counterpart, word sense disambiguation (WSD) (Stevenson and Wilks 2003) , because WSI models should consider the following challenges due to its unsupervised nature: (C1) adaptability to new domains, (C2) ability to detect novel senses, and (C3) flexibility to different word sense granularities (Jurgens and Klapaftis 2013) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-16", "text": "Another task similar to the WSI is the unsupervised author name disambiguation (UAND) task (Song et al. 2007) , where it aims to automatically find different authors, instead of words, with the same name." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-17", "text": "In this paper, we consider a latent variable modeling approach to WSI problem as it is proven to be more effective than other approaches (Chang, Pei, and Chen 2014; Komninos and Manandhar 2016) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-18", "text": "Specifically, we look into methods based on Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) , a topic modeling method that automatically discovers the topics underlying a set of documents using Dirichlet priors to infer the multinomial distribution over words and topics." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-19", "text": "LDA naturally answers two of the three main problems mentioned above, i.e. (C1) and (C2), of the WSI task (Brody and Lapata 2009) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-20", "text": "However, it is not flexible with regards to (C3), or the sense granularity problem, as it requires the users to specify the number of senses: Current systems (Wang et al. 2015; Chang, Pei, and Chen 2014) required to set the number of senses to a small number (set to 3 or 5 in the literature) to get a good accuracy, however many words may have a large number of senses, e.g. play in Figure 1 ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-21", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-22", "text": "**LDA: !(#|%)**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-23", "text": "AutoSense: !('|#, (% ) , %)) Figure 2: Example induced senses when the target word is cold from LDA and AutoSense." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-24", "text": "Applying our observations to LDA introduces both garbage and fine-grained senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-25", "text": "To this end, we propose a latent variable model called AutoSense that solves all the challenges of WSI, including overcoming the sense granularity problem." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-26", "text": "Consider Figure 2 on finding the senses of the target word cold." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-27", "text": "An LDA model naively considers the topics as senses and thus differentiates the usage of cold in the medical and science domains, even though the same sense is commonly used in the two domains." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-28", "text": "This results in too many senses induced by the model." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-29", "text": "We extend LDA using two observations." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-30", "text": "First, we introduce a separate latent variable for senses, which can be represented as a distribution over topics." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-31", "text": "This introduces more accurate induced senses (e.g. the cold: nose infection sense can be from a mixture of medical, science, and temperature topics), as well as garbage senses (colored red in the figure) as most topic distributions will not be assigned to any instance." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-32", "text": "Second, we enforce senses to generate target-neighbor pairs, a pair (w t , w) which consists of the target word w t and one of its neighboring word w, at once." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-33", "text": "This separates the topic distributions into fine-grained senses based on lexical semantic features easily captured by the target-neighbor pairs." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-34", "text": "For example, the cold: absence of heat and the cold: sensation from low temperature senses are both related to temperature, but have different syntactic and semantic usage." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-35", "text": "By applying the two observations above, AutoSense removes the strict requirement on correctly setting the number of senses by throwing garbage senses and introducing fine-grained senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-36", "text": "Nonparametric models (Teh et al. 2004; have also been used to solve this problem by automatically inducing the number of senses, however our experiments show that these models are less effective than parametric models and induce incorrect number of senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-37", "text": "Our proposed model is parametric, and is also able to adapt to the different number of senses of different words, even when the number of senses is set to an arbitrarily large number." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-38", "text": "Moreover, the model can also be used in other tasks such as UAND where the variance in the number of senses is large." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-39", "text": "To the best of our knowledge, we are the first to experiment extensively on the sense granularity problem of parametric latent variable models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-40", "text": "In our experiments, we estimate the parameters of the model using collapsed Gibbs sampling and get the sense distribution of each instance as the WSI solution." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-41", "text": "We evaluate our model using the SemEval 2010 and 2013 WSI datasets (Manandhar et al. 2010; Jurgens and Klapaftis 2013) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-42", "text": "Results show that AutoSense performs superior than previous state-of-the-art models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-43", "text": "We also provide analyses and experiments that shows how AutoSense overcomes the issue on sense granularity." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-44", "text": "Finally, we show that our model performs the best on unsupervised author name disambiguation (UAND), where the sense granularities are extremely varied." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-45", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-46", "text": "**RELATED WORK**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-47", "text": "Previous works on WSI used context vectors and attributes (Almuhareb, Poesio, and others 2006) , pretrained classification systems (Tsvetkov et al. 2014) , and alignment of parallel corpus (Yao, Van Durme, and Callison-Burch 2012) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-48", "text": "In the most recent shared task on WSI (Jurgens and Klapaftis 2013) , top models used lexical substitution method (AI-KU) (Baskaya et al. 2013) and Hierarchical Dirichlet Process trained with additional instances (Unimelb) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-49", "text": "Latent variable models such as LDA (Blei, Ng, and Jordan 2003) are used to induce the word sense of a target word after rigorous preprocessing and feature extraction (LDA, Spectral) (Goyal and Hovy 2014)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-50", "text": "More recent models introduced a latent variable for the sense of a word, with the assumption that a sense has multiple concepts (HC, HC+Zipf) (Chang, Pei, and Chen 2014) and that topics and senses should be inferred jointly (STM) (Wang et al. 2015) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-51", "text": "In this paper, we also use a separate sense latent variable, however we show boost in performance by representing it with more versatility and by incorporating the use of targetneighbor pairs." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-52", "text": "HC was also extended to a nonparametric model (BNP-HC) (Teh et al. 2004 ) in order to automatically set the number of senses of a word, providing flexibility to the sense granularity (Yao and Van Durme 2011; Lau et al. 2012; ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-53", "text": "In our experiments, we show that the sense granularity induced from nonparametric models are incorrect making the models less effective." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-54", "text": "Recent inclusions to the WSI models are neural-based dense distributional representation models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-55", "text": "STM also used word embeddings (Mikolov et al. 2013) to assign similarity weights during inference (STM+w2v) (Wang et al. 2015) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-56", "text": "Existing sense embeddings are also used to perform word sense induction (CRP-PPMI, SE-WSI-fix, WG, DIVE) (Song 2016; Pelevina et al. 2016; Chang et al. 2018 )." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-201", "text": "pare the performances of the models as the number of senses increases." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-57", "text": "These models, on their own, do not perform well on the WSI task until recently when embeddings of words and their dependencies are used to construct a probabilistic model (MCC) (Komninos and Manandhar 2016) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-58", "text": "We show that neuralbased embeddings are still ineffective for this task and that our model performs better than these models as well." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-59", "text": "In the unsupervised author name disambiguation (UAND) domain, LDA-based models have also been used (Shu, Long, and Meng 2009) to employ text features for the task, while non-text features such as co-authors, publication venue, year, and citations are found to be stronger features (Tang et al. 2012) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-60", "text": "In this paper, we study on how to improve the performance of text features for UAND using latent variable models, which can later be combined with non-text features in the future work." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-61", "text": "Figure 3: Graphical representation of AutoSense." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-62", "text": "Nodes are random variables, edges are dependencies, and plates are replications." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-63", "text": "Nodes shaded in black are observed." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-64", "text": "The node shaded in red is the observed target word." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-65", "text": "The dependency edges of \u03b8 s|t , \u03b8 t|s , and \u03b8 st are not shown for clarity: They are all generated by the Dirichlet prior \u03b1." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-66", "text": "Moreover, sense variables are dependent to \u03b8 s|t and \u03b8 st , while topic variables are dependent to \u03b8 t|s and \u03b8 st ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-67", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-68", "text": "**PROPOSED MODEL**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-69", "text": "There are two reasons why Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) is not effective for WSI." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-70", "text": "First, LDA tries to give instance assignments to all senses even when it is unnecessary." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-71", "text": "For example, when the number of senses S is set to 10, the model tries to assign all the senses to all instances even when the original number of senses of a target word is 3." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-72", "text": "LDA extensions (Wang et al. 2015; Chang, Pei, and Chen 2014) mitigated this problem by setting S to a small number (e.g. 3 or 5)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-73", "text": "However, this is not a good solution because there are many words with more than five senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-74", "text": "Second, LDA and its extensions do not consider the existence of fine-grained senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-75", "text": "For example, the cold: absence of heat and the cold: sensation from low temperature senses are fine-grained senses because they are similarly related to temperature yet have different usage." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-76", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-77", "text": "**AUTOSENSE MODEL**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-78", "text": "To solve the problems above, we propose to extend LDA in two parts." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-79", "text": "First, we introduce a new latent variable, apart from the topic latent variable, to represent word senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-80", "text": "Previous works also attempted to introduce a separate sense latent variable to generate all the words (Chang, Pei, and Chen 2014) , or to generate only the neighboring words within a local context, decided by a strict user-specified window (Wang et al. 2015) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-81", "text": "We improve by softening the strict local context assumption by introducing a switch variable which decides whether a word not in a local context should be generated by conditioning also on the sense latent variable." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-82", "text": "Our experiments show that our sense representation provides superior improvements from previous models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-83", "text": "Second, we force the model to generate target-neighbor pairs at once in the local context, instead of generating words one by one." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-84", "text": "A target-neighbor pair (w t , w) consists of the target word w t and a neighboring word w in the local context." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-85", "text": "For example, the target-neighbor pairs in \"cold snowy weather\", where w t is cold, are (cold, snowy) and (cold, weather)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-86", "text": "These pairs give explicit information on the lexical semantics of the target word given the neighboring words." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-87", "text": "In our running example (Figure 2 ), the cold: absence of heat and the cold: sensation from low temperature senses can be easily differentiated when we are given the target-neighbor pairs (cold, weather) and (cold, climate) for the former, and (cold, water) and (cold, f resh) for the latter sense, rather than the individual words." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-88", "text": "These extensions bring us to our proposed model called AutoSense." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-89", "text": "The graphical representation of AutoSense is shown in Figure 3 , while the meaning of the notations used in this paper is shown in Table 1 ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-90", "text": "Generative process For each instance, we divide the text into two contexts: the local context L which includes the target word w t and its neighboring words w l , and the global context M which contains the other remaining words w m ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-91", "text": "Words from different contexts are generated separately." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-92", "text": "In the global context M , words w m are generated from either a sense s or a topic t latent variable." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-93", "text": "The selection is done by a switch variable x. If x = 1, then the word generation is done by using the sense variable s. Otherwise, it is done by using the topic variable t. The probability of a global context word w m in document d is given below." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-94", "text": "In the local context L, words w l are generated from both sense s and topic t variables." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-95", "text": "Also, the target word w t is generated along with w l as target-neighbor pairs (w t , w l ) using the sense variable s. Sense and topic variables are dependent to each other, so we generate them using the joint probability p(s, t|d)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-96", "text": "We factorize p(s, t|d) approximately using ideas from dependency networks (Heckerman et al. 2000) to avoid independency assumptions, i.e. p(a, b|c) = p(a|b, c)p(b|a, c), and deficient modeling (Brown et al. 1993) to ignore redundancies, i.e. p(a|b, c)p(b|a, c) = p(a|b)p(a|c)p(b|a)p(b|c)p(a, b)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-97", "text": "The probability of a local context word w l in document d given below." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-98", "text": "Inference We use collapsed Gibbs sampling (Griffiths and Steyvers 2004) to estimate the latent variables." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-99", "text": "At each transition step of the Markov chain, for each word w m in the global context, we draw the switch x \u223c {1, 2}, and the sense s = k or the topic t = j variables using the conditional probabilities given below." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-100", "text": "The variable C AB ab represents the number of a \u2208 A and b \u2208 B assignments, excluding the current word." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-101", "text": "The rest corresponds to the other remaining variables, such as the instance d, the current word w m , the \u03b8 and \u03c6 distributions, and the \u03b1, \u03b2, and \u03b3 Dirichlet priors." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-102", "text": "Subsequently, for each word w l and the target word w t (forming the target-neighbor pair (w t , w l )) in the local context, we draw the sense s = k and the topic t = j variables using the conditional probability given below." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-103", "text": "Word sense induction After inference is done, the approximate probability of the sense s of the target word in a given document d is induced using the sense distribution of the document as shown in the equation below, where C" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-104", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-107", "text": "We also calculate the word distribution of each sense using the second equation below to inspect the meaning of sense." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-108", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-109", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-110", "text": "Datasets and preprocessing We use two publicly available datasets: SemEval 2010 Task 14 (Manandhar et al. 2010) For preprocessing, we do tokenization, lemmatization, and removing of symbols to build the word lists using Stanford CoreNLP (Manning et al. 2014) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-111", "text": "We divide the word lists into two contexts: the local and global context." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-112", "text": "Following (Wang et al. 2015), we set the local context window to 10, with a maximum number of words of 21 (i.e. 10 words before and 10 words after)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-113", "text": "Other words are put into the global context." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-114", "text": "Note however that AutoSense has a less strict global/local context assumption as it treats some words in the global context as local depending on the switch variable." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-115", "text": "Parameter setting We set the hyperparameters to \u03b1 = 0.1, \u03b2 = 0.01, \u03b3 = 0.3, following the conventional setup (Griffiths and Steyvers 2004; Chemudugunta, Smyth, and Steyvers 2006) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-116", "text": "We arbitrarily set the number of senses to S = 15, and the number of topics T = 2S = 30, following (Wang et al. 2015) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-117", "text": "We also include four other versions of our model: AutoSense \u2212wp removes the target-neighbor pair constraint and transforms the local context to that of STM, AutoSense \u2212sw removes the switch variable and transforms the global context to that of LDA, AutoSense s=X is a tuned and best version of the model, where the number of senses is tuned over a separate development set provided by the shared tasks and X is the tuned number of sense, different for each dataset, and AutoSense s=100 is the overestimated and worst version of the model, where we set the number of senses to an arbitrary large number, i.e. 100." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-118", "text": "We set the number of iterations to 2000 and run the Gibbs sampler." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-119", "text": "Following the convention of previous works (Lau et al. 2012; Goyal and Hovy 2014; Wang et al. 2015) , we assume convergence when the number of iterations is high." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-120", "text": "However, due to the randomized nature of Gibbs sampling, we report the average scores over 5 runs of Gibbs sampling." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-121", "text": "We then use the distribution \u03b8 s|d as shown in Equation 1 as the solution of the WSI problem." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-122", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-123", "text": "**MODEL F-S V-M AVG \u0394(#S)**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-124", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-125", "text": "**EXPERIMENTS WORD SENSE INDUCTION**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-126", "text": "SemEval 2010 For the SemEval 2010 dataset, we compare models using two unsupervised metrics: V-measure (V-M) and paired F-score (F-S)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-127", "text": "V-M favors a high number of senses (e.g. assigning one cluster per instance), while F-S favors a small number of senses (e.g. all instances in one cluster) (Manandhar et al. 2010) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-128", "text": "In order to get a common ground for comparison, we do a geometric average AVG of both metrics, following (Wang et al. 2015) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-129", "text": "Finally, we also report the absolute difference between the actual (3.85) and induced number of senses as \u03b4(#S)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-130", "text": "We compare with seven other models: a) LDA on cooccurrence graphs (LDA) and b) spectral clustering on cooccurrence graphs (Spectral) as reported in (Goyal and Hovy Results are shown in Table 2a , where AutoSense outperforms other competing models on AVG." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-131", "text": "Among the AutoSense models, the AutoSense \u2212wp and AutoSense \u2212sw version perform the worst, emphasizing the necessity of the target-neighbor pairs and the switch variable." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-132", "text": "The overestimated AutoSense s=100 performs better than previously proposed models, proving the robustness of our model on the different word sense granularities." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-133", "text": "On the \u03b4(#S) metric, the untuned AutoSense and AutoSense s=5 perform the best." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-134", "text": "The V-M metric needs to be interpreted carefully, because it can easily be maximized by separating all instances into different sense clusters and thus overestimating the actual number of senses #S and decreasing the F-S metric." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-135", "text": "The model BNP-HC is an example of such: Though its V-M metric is the highest, it scores the lowest on the F-S metric and greatly overestimates #S, thus having a very high \u03b4(#S)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-136", "text": "The goal is thus a good balance of V-M and F-S (i.e. highest AVG), and a close estimation of #S (i.e. lowest \u03b4(#S), which is successfully achieved by our models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-137", "text": "SemEval 2013 Two metrics are used for the SemEval 2013 dataset: fuzzy B-cubed (F-BC) and fuzzy normalized mutual information (F-NMI)." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-138", "text": "F-BC gives preference to labelling all instances with the same sense, while F-NMI gives preference to labelling all instances with distinct senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-139", "text": "Therefore, computing the AVG of both metrics is also necessary in this experiment, for ease of comparison, as also suggested in (Wang et al. 2015) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-140", "text": "We use seven baselines: a) lexical substitution method (AI-KU) and b) nonparametric HDP model (Unimelb) as reported in (Jurgens and Klapaftis 2013) , c) Sense-Topic Model STM, d) STM with word2vec weights (STM+w2v) as reported in (Wang et al. 2015) , e) Word Graph embeddings (WG), f) Distributional Inclusion Vector Embedding (DIVE) as reported in (Chang et al. 2018) , and g) Multi Context Continuous model MCC as reported in (Komninos and Manandhar 2016) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-141", "text": "Results are shown in Table 2b ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-142", "text": "Among the models, all versions of AutoSense perform better than other models on AVG." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-143", "text": "The untuned AutoSense and AutoSense s=7 especially garner noticeable increase of 6.1% on fuzzy B-cubed metric from MCC, the previous best model." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-144", "text": "We also notice a big 6.0% decrease on the fuzzy B-cubed of AutoSense when the target-neighbor pair context is removed." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-145", "text": "This means that introducing the target-neighbor pair is crucial to the improvement of the model." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-146", "text": "Finally, the overestimated AutoSense model performs as well as the other AutoSense models, even outperforming all previous models on AVG, which proves the effectiveness of AutoSense even when s is set to a large value." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-147", "text": "For completeness, we also report STM with additional contexts, STM+actual and STM+ukWac (Wang et al. 2015) , where they used the actual additional contexts from the original data and semantically similar contexts from ukWac, respectively, as additional global context." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-148", "text": "With the performance gain we achieved, AutoSense without additional context can perform comparably to models with additional contexts: Our model greatly outperforms these models on the Sense Word distribution #Docs 1 hotel tour tourist summer flight 22 2 month ticket available performance 3 3 guest office stateroom class suite 3 * advance overseas line popular japan 0 * email day buy unable tour 0 * sort basic tour time 0 Table 3 : Six of the 15 senses of the target verb book using AutoSense with S = 15." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-149", "text": "The word lists shown are preprocessed to remove stopwords and the target word." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-150", "text": "The first three senses are senses which are assigned at least once to an instance document." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-151", "text": "The last three are garbage senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-152", "text": "F-BC metric by at least 2%." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-153", "text": "Also, considering that both AutoSense and STM are LDA-based models, the same data enhancements can straightforwardly be applied when the needs arise." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-154", "text": "We similarly apply the actual additional contexts to AutoSense and find that we achieve state-of-the-art performance on AVG." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-155", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-156", "text": "**SENSE GRANULARITY PROBLEM**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-157", "text": "The main weakness of LDA when used on WSI task is the sense granularity problem." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-158", "text": "Recent models such as HC (Chang, Pei, and Chen 2014) and STM (Wang et al. 2015) mitigated this problem by tuning the number of senses hyperparameter S to minimize the error." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-159", "text": "However, such tuning, often empirically set to a small number such as S = 3 (Wang et al. 2015) , fails to infer varying number of senses of words, especially for words with a higher number of senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-160", "text": "Nonparametric models such as HDP and BNP-HC Chang, Pei, and Chen 2014) claim to automatically induce different S for each word." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-161", "text": "However, as shown in the results in Table 2 , the estimated S is far from the actual number of senses and both models are ineffective." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-162", "text": "On the other hand, Table 2 also shows that AutoSense is effective even when S is overestimated." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-163", "text": "We explain why through an example result shown in Table 3 , where the target word is the verb book, the actual number of senses is three, and S is set to 15." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-164", "text": "First, we see that there are senses which are not assigned to any instance document, signified by * , which we call garbage senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-165", "text": "We notice that effectively representing a new latent variable for sense as a distribution over topics forces the model to throw garbage senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-166", "text": "Second, while it is easy to distinguish the third sense (i.e., book: register in a booker) to the two other senses, the first and second senses both refer to planning or arranging for an event in advance." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-167", "text": "Incorporating the target-neighbor pairs helps the model differentiates both into fine-grained senses book: arrange for and reserve in advance and book: engage for a performance." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-168", "text": "We compare the competing models quantitatively on how they correctly detect the actual number of sense clusters using cluster error, which is the mean absolute error between the detected number and the actual number of sense clusters." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-169", "text": "We compare the cluster errors of LDA (Blei, Ng, and Jordan 2003) , STM (Wang et al. 2015) , HC (Chang, Pei, and Chen 2014) , and a nonparametric model HDP (Teh et al. 2004 ), with AutoSense." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-170", "text": "We report the results in Figure 4 ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-171", "text": "Results show that the cluster error of LDA increases sharply as the number of senses exceeds the actual mean number of senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-172", "text": "HC and STM also throw garbage senses since they also introduce in some way a new sense variable, however the cluster errors of both models still increase when S is set beyond the maximum number of senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-173", "text": "We argue that this is because first, the sense representation is not optimal as they assume strict local/global context assumption, and second and most importantly, the models do not produce fine-grained senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-174", "text": "AutoSense does both garbage sense throwing and fine-grained sense induction, which helps in the detection of the actual word granularity." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-175", "text": "Finally, the cluster error of AutoSense is always better than that of HDP." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-176", "text": "This shows that AutoSense, despite being a parametric model, automatically detects the number of sense clusters without parameter tuning and is more accurate than the automatic detection of nonparametric models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-177", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-178", "text": "**UNSUPERVISED AUTHOR NAME DISAMBIGUATION**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-179", "text": "Unsupervised author name disambiguation (UAND) is a task very similar to the WSI task, where ambiguous author names are the target words." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-180", "text": "However, one additional challenge of UAND is that there can be as many as 100 authors Table 4 : Statistics of the number of senses of target words/names in the datasets used in the paper." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-181", "text": "with the same name, whereas words can have at most 20 different senses, at least in our datasets, as shown in the dataset statistics in Table 4 ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-182", "text": "Moreover, the standard deviations of the author name disambiguation datasets are also higher, which means that there is more variation on the number of senses per target author name." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-183", "text": "Thus, in this task, the sense granularity problem is more difficult and needs to be addressed properly." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-184", "text": "Current state-of-the-art models use non-text features such as publication venue and citations (Tang et al. 2012) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-185", "text": "We argue that text features also provide informative clues to disambiguate author names." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-186", "text": "In this experiment, we make use of text features such as the title and abstract of research papers as data instance of the task." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-187", "text": "In addition, we also include in our dataset author names and the publication venue as pseudo-words." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-188", "text": "In this way, we can reformulate the UAND task as a WSI task, and exploit text features not used in current techniques." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-189", "text": "Experimental setup We use two publicly available datasets for the UAND task: Arnet 4 and PubMed 5 ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-190", "text": "The Arnet dataset contains 100 ambiguous author names and a total of 7528 papers as data instance." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-191", "text": "Each instance includes the title, author list, and publication venue of a research paper authored by the given author name." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-192", "text": "In addition, we also manually extract the abstracts of the research papers for additional context." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-193", "text": "The PubMed dataset contains 37 author names with a total of 2875 research papers as instances." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-194", "text": "It includes the PubMed ID of the papers authored by the given author name." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-195", "text": "We extract the title, author list, publication venue, and abstract of each PubMed ID from the PubMed website." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-196", "text": "We use LDA (Blei, Ng, and Jordan 2003) , HC (Chang, Pei, and Chen 2014) and STM (Wang et al. 2015) as baselines." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-197", "text": "We do not compare with non-text feature-based models (Tang et al. 2012; Cen et al. 2013 ) because our goal is to compare sense topic models on a task where the sense granularities are more varied." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-198", "text": "For STM and AutoSense, the title, publication venue and the author names are used as local contexts while the abstract is used as the global context." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-199", "text": "This decision is based on conclusions from previous works (Tang et al. 2012 ) that the title, publication venue, and the author names are more informative than the abstract when disambiguating author names." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-200", "text": "We use the same parameters as used above, and we set S to 5, 25, 50, and 100 to com-4 https://aminer.org/disambiguation 5 https://github.com/Yonsei-TSMM/author_ name_disambiguation Model S = 5 S = 25 S = 50 S = 100 LDA 31.5 13." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-202", "text": "Results For evaluation, we use the pairwise F1 measure to compare the performance of competing models, following (Tang et al. 2012) ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-203", "text": "Results are shown in Figure 5 ." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-204", "text": "AutoSense performs the best on almost all settings, except on the PubMed dataset and when S = 5, where it garners a comparable result with STM." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-205", "text": "However, in the case where S is set close to the maximum number of senses in the dataset (i.e. 28 in PubMed and 112 in Arnet), AutoSense performs the best among the models." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-206", "text": "LDA and HC perform badly on all settings and greatly decrease their performances when S becomes high." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-207", "text": "STM also shows decrease in performance on the PubMed dataset when S = 100." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-208", "text": "This is because the PubMed dataset has a lower maximum number of senses, and STM is sensitive in the setting of S, and thus hurts the robustness of the model to different sense granularities." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-209", "text": "----------------------------------" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-210", "text": "**CONCLUSION**" }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-211", "text": "We proposed a solution to answer the sense granularity problem, one of the major challenges of the WSI task." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-212", "text": "We introduced AutoSense, a latent variable model that not only throws away garbage senses, but also induces fine-grained senses." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-213", "text": "We showed that AutoSense greatly outperforms the current state-of-the-art models in both SemEval 2010 and 2013 WSI datasets." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-214", "text": "We also show experiments on how AutoSense is able to overcome sense granularity problem, a well-known flaw of latent variable models on." }, { "sent_id": "75b2aa54c363151130ca2146044922-C001-215", "text": "We further applied our model to UAND task, a similar task but with more varying number of senses, and showed that AutoSense performs the best among latent variable models, proving its robustness to different sense granularities." } ], "y": { "@MOT@": { "gold_contexts": [ [ "75b2aa54c363151130ca2146044922-C001-20" ], [ "75b2aa54c363151130ca2146044922-C001-72", "75b2aa54c363151130ca2146044922-C001-73" ], [ "75b2aa54c363151130ca2146044922-C001-159" ] ], "cite_sentences": [ "75b2aa54c363151130ca2146044922-C001-20", "75b2aa54c363151130ca2146044922-C001-72", "75b2aa54c363151130ca2146044922-C001-159" ] }, "@BACK@": { "gold_contexts": [ [ "75b2aa54c363151130ca2146044922-C001-50" ], [ "75b2aa54c363151130ca2146044922-C001-55" ], [ "75b2aa54c363151130ca2146044922-C001-72" ], [ "75b2aa54c363151130ca2146044922-C001-80" ], [ "75b2aa54c363151130ca2146044922-C001-158" ] ], "cite_sentences": [ "75b2aa54c363151130ca2146044922-C001-50", "75b2aa54c363151130ca2146044922-C001-55", "75b2aa54c363151130ca2146044922-C001-72", "75b2aa54c363151130ca2146044922-C001-80", "75b2aa54c363151130ca2146044922-C001-158" ] }, "@USE@": { "gold_contexts": [ [ "75b2aa54c363151130ca2146044922-C001-112" ], [ "75b2aa54c363151130ca2146044922-C001-116" ], [ "75b2aa54c363151130ca2146044922-C001-119" ], [ "75b2aa54c363151130ca2146044922-C001-128" ], [ "75b2aa54c363151130ca2146044922-C001-139" ], [ "75b2aa54c363151130ca2146044922-C001-140" ], [ "75b2aa54c363151130ca2146044922-C001-147" ], [ "75b2aa54c363151130ca2146044922-C001-169" ], [ "75b2aa54c363151130ca2146044922-C001-196" ] ], "cite_sentences": [ "75b2aa54c363151130ca2146044922-C001-112", "75b2aa54c363151130ca2146044922-C001-116", "75b2aa54c363151130ca2146044922-C001-119", "75b2aa54c363151130ca2146044922-C001-128", "75b2aa54c363151130ca2146044922-C001-139", "75b2aa54c363151130ca2146044922-C001-140", "75b2aa54c363151130ca2146044922-C001-147", "75b2aa54c363151130ca2146044922-C001-169", "75b2aa54c363151130ca2146044922-C001-196" ] } } }, "ABC_55160a7ab2df9a86e677bcc72d9842_4": { "x": [ { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-2", "text": "Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-3", "text": "In contrast, state of the art semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-4", "text": "In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and we train semantic parsers to apply these idioms during decoding." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-5", "text": "We apply this idiom-based code generation to a recent context-dependent semantic parsing task, and improve the state of the art by 2.2% BLEU score while reducing training time by more than 50%." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-6", "text": "This improved speed enables us to scale up the model by training on an extended training set that is 5x times larger, to further move up the state of the art by an additional 2.3% BLEU and 0.9% exact match." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-7", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-9", "text": "When programmers translate Natural Language (NL) specifications into executable source code, they typically start with a high-level plan of the major structures required, such as nested loops, conditionals, etc. and then proceed to fill in specific details into these components." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-10", "text": "We refer to these high-level structures (Figure 1 (b) ) as code idioms (Allamanis and Sutton, 2014) ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-11", "text": "In this paper, we demonstrate how learning to use code idioms leads to an improvement in model accuracy and training time for the task of semantic parsing, i.e., mapping intents in NL into general purpose source code (Iyer et al., 2017; Ling et al., 2016) ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-12", "text": "State-of-the-art semantic parsers are neural encoder-decoder models, where decoding is guided by the grammar of the target programming language (Yin and Neubig, 2017; Rabinovich et al., 2017; Iyer et al., 2018) to ensure syntactically valid programs." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-13", "text": "For general purpose programming languages with large formal grammars, this can easily lead to long decoding paths even for short snippets of code." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-14", "text": "For example, Figure 1 shows an intermediate parse tree for a generic if-then-else code snippet, for which the decoder requires as many as eleven decoding steps before ultimately filling in the slots for the if condition, the then expression and the else expression." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-15", "text": "However, the if-then-else block can be seen as a higher level structure such as shown in Figure 1 (b) that can be applied in one decoding step and reused in many different programs." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-16", "text": "In this paper, we refer to frequently recurring subtrees of programmatic parse trees as code idioms, and we equip semantic parsers with the ability to learn and directly generate idiomatic structures as in Figure 1 (b)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-17", "text": "We introduce a simple iterative method to extract idioms from a dataset of programs by repeatedly collapsing the most frequent depth-2 subtrees of syntax parse trees." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-18", "text": "Analogous to the byte pair encoding (BPE) method (Gage, 1994; Sennrich et al., 2016 ) that creates new subtokens of words by repeatedly combining frequently occurring adjacent pairs of subtokens, our method takes a depth-2 syntax subtree and replaces it with a tree of depth-1 by removing all the internal nodes." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-19", "text": "This method is in contrast with the approach using probabilistic tree substitution grammars (pTSG) taken by Allamanis and Sutton (2014) , who use the explanation quality of an idiom to prioritize idioms that are more interesting, with an end goal of suggesting useful idioms to programmers using IDEs." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-20", "text": "Once idioms are extracted, we greedily apply them to semantic parsing training sets to provide supervision for learning to apply idioms." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-21", "text": "We evaluate our approach on a context dependent semantic parsing task (Iyer et al., 2018) using the CONCODE dataset, where we improve the state of the art by 2.2% of BLEU score." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-22", "text": "Furthermore, generating source code using idioms results in a more than 50% reduction in the number of decoding steps, which cuts down training time to less than half, from 27 to 13 hours." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-23", "text": "Taking advantage of this reduced training time, we further push the state of the art on CONCODE to an EM of 13.4 and a BLEU score of 28.9 by training on an extended version of the training set (with 5x the amount of training examples)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-24", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-25", "text": "**RELATED WORK**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-26", "text": "Neural encoder-decoder models have proved effective in mapping NL to logical forms (Dong and Lapata, 2016) and also for directly producing general purpose programs (Iyer et al., 2017 (Iyer et al., , 2018 ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-27", "text": "Ling et al. (2016) use a sequence-tosequence model with attention and a copy mechanism to generate source code." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-28", "text": "Instead of directly generating a sequence of code tokens, recent methods focus on constrained decoding mechanisms to generate syntactically correct output using a decoder that is either grammar-aware or has a dynamically determined modular structure paralleling the structure of the abstract syntax tree (AST) of the code (Rabinovich et al., 2017; Yin and Neubig, 2017) ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-29", "text": "Iyer et al. (2018) use a similar decoding approach but use a specialized context encoder for the task of context-dependent code generation." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-30", "text": "We augment these neural encoder-decoder models with the ability to decode in terms of frequently occurring higher level idiomatic structures to achieve gains in accuracy and training time." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-31", "text": "Another different but related method to produce source code is with the help of sketches, which are code snippets containing slots in the place of low-level information such as variable names and arguments." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-32", "text": "Dong and Lapata (2018) generate sketches as intermediate representations to convert NL to logical forms; Hayati et al. (2018) retrieve sketches from a large training corpus and later modify them for the current input; Murali et al. (2018) use a combination of neural learning and type-guided combinatorial search to convert existing sketches into executable programs, whereas Nye et al. (2019) additionally also generate the sketches before synthesising programs." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-33", "text": "While we don't explicitly generate sketches, we find that our idiom-based decoder learns to generate commonly used programming sketches with slots, and fills them in during subsequent decoding timesteps." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-34", "text": "More closely related to the idioms that we use for decoding is Allamanis and Sutton (2014) , who develop a system (HAGGIS) to automatically mine idioms from large code bases." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-35", "text": "They focused on finding idioms that are interesting and explainable, e.g., those that can be included as preset code templates in programming IDEs." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-36", "text": "Instead, we learn idiomatic structures that are frequently used and can be easily associated with natural language phrases in our dataset." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-37", "text": "The production of large subtrees in a single step directly translates to a large speedup in training and inference." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-38", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-39", "text": "**IDIOM AWARE ENCODER-DECODER MODELS**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-40", "text": "Our goal is to train semantic parsers with the ability to learn to use code idioms during program generation." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-41", "text": "To do this, we first extract a set of frequently used idioms from the training set, and then provide them as supervision to the semantic parser's learning algorithm." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-42", "text": "Formally, if a semantic parser decoder is guided by a grammar G = (N, \u03a3, R), where N and \u03a3 are the sets of non-terminals and terminals respectively, and R is the set of production rules of the form A \u2192 \u03b2, A \u2208 N, \u03b2 \u2208 {N \u222a \u03a3} * , we would like to construct an idiom set I with rules of the form B \u2192 \u03b3, B \u2208 N, \u03b3 \u2208 {N \u222a \u03a3} * , such that B \u22652 =\u21d2 \u03b3 under G , i.e., \u03b3 can be derived in two or more steps from B under G ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-43", "text": "For the example in Figure 1 , N would contain rules for expanding each non-terminal, such as Statement \u2192 if ParExpr Statement IfOrElse and ParExpr \u2192 { Expr }, whereas I would contain the idiomatic rule State-" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-44", "text": "The decoder builds trees from\u011c = (N, \u03a3, R \u222a I)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-45", "text": "Although the set of valid programs under both G and\u011c are exactly the same, this introduction of ambiguous rules into G in the form of idioms presents an opportunity to learn shorter derivations." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-46", "text": "In the next two sections, we describe the idiom extraction process, i.e., how I is chosen, and the idiom application process, i.e., how the decoder is trained to learn to apply idioms." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-47", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-48", "text": "**IDIOM EXTRACTION**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-49", "text": "The procedure to add idiomatic rules, I, to the regular production rules, R is described in Algorithm 1." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-50", "text": "Our goal is to populate the set I by identifying a set of frequently occurring idioms (subtrees) from the programs in training set D. Since enumerating all subtrees of every AST in the training set is infeasible, we observe that all subtrees s \u2032 of a particular frequently occurring subtree s are just as or more frequent than s \u2032 , so we take a bottom-up approach by repeatedly collapsing the most frequent depth-2 subtrees." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-51", "text": "Intuitively, this can be viewed as a particular kind of generalization of the BPE (Gage, 1994; Sennrich et al., 2016) algorithm for sequences, where new subtokens are created by repeatedly combining frequently occurring adjacent pairs of subtokens." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-52", "text": "Our definition of the mostfrequent depth-2 subtree is specific to parse trees, in that we require that either all or none of the children of any non-terminal node be included in the subtree, and any subtree that only includes some children of a non-terminal is disallowed." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-53", "text": "We perform idiom extraction in an iterative fashion." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-54", "text": "We start with populating T with all parse trees of programs in D using grammar G (Step 4)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-55", "text": "Each iteration then comprises retrieving the most frequent depth-2 subtree s from T (Step 8), followed by post-processing T to replace all occurrences of s in T with a collapsed (depth-1) version of s (Step 10 and Step 17)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-56", "text": "The collapse method (Step 20) simply takes a subtree, removes all its internal nodes and attaches its leaves directly to its root (Step 22)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-57", "text": "The collapsed version of s is a new idiomatic rule (a depth-1 subtree), which is added to our set of idioms, I (Step 12)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-58", "text": "We illustrate two iterations of this algorithm in Figure 2 ((a)-(b) and (c)-(d))." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-59", "text": "Assuming (a) is the most frequent depth-2 subtree from the dataset, it is transformed into the idiomatic rule in (b)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-60", "text": "Larger idiomatic trees are learned by the combination of several depth-2 subtrees as the algorithm progresses." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-61", "text": "This is shown in Figure 2 (c) which contains the idiom extracted in (b) within it owing to the post-processing of the dataset after idiom (b) is extracted (Step 10 of Algorithm 1) which effectively makes the idiom in (d) a depth-3 idiom." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-62", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-63", "text": "**MODEL TRAINING WITH IDIOMS**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-64", "text": "Once a set of idioms I is obtained, we next train our semantic parsing models to apply these idioms while decoding." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-65", "text": "We do this by supervising production rule generation in the decoder using a compressed set of rules for each example, using the idiom set I (see Algorithm 2)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-66", "text": "More concretely, we first obtain the parse tree t i (or produc- tion rule set p i ) for each training example program y i under grammar G (Step 3) and then greedily collapse each depth-2 subtree in t i corresponding to every idiom in I (Step 5)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-67", "text": "Once t i cannot be further collapsed, we translate t i into production rules r i based on the collapsed tree, with |r i | \u2264 |p i | (Step 7)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-68", "text": "This process is illustrated in Figure 3 where we perform two applications of the first idiom from Figure 2 (b), followed by one application of the second idiom from Figure 2 (d) , after which, the tree cannot be further compressed using those two idioms." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-69", "text": "The final tree can be represented using |r i | = 2 rules instead of the original |p i | = 5 rules." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-70", "text": "The decoder is then trained similar to previous approaches (Yin and Neubig, 2017; Iyer et al., 2018) using the compressed set of rules." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-71", "text": "In later experiments, we find that this results in a rule set compression of more than 50% (see Section 7)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-72", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-73", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-74", "text": "We apply our approach to the context dependent encoder-decoder model of Iyer et al. (2018) on the CONCODE dataset, and compare performance to a better tuned instance of their best model." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-75", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-76", "text": "**TASK**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-77", "text": "The CONCODE task involves mapping an NL query together with a class environment comprising a list of variables (with types) and methods (with return types), into the source code of a class member function." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-78", "text": "Figure 4 (a) shows an example where the context comprises variables and methods (with types) that would normally exist in a class that implements a vector, such as vecElements and dotProduct()." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-79", "text": "Conditioned on Source code:" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-80", "text": "AST Derivation: this context, the task involves mapping the NL query Adds a scalar to this vector in place into a sequence of parsing rules to generate the source code in Figure 4 (b)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-81", "text": "Formally, their task is: Given a NL utterance q, a set of context variables {v i } with types {t i }, and a set of context methods {m i } with return types {r i }, predict a set of parsing rules {a i } of the target program." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-82", "text": "Their best performing model is an encoder-decoder model with a context aware encoder and a decoder that produces production rules from the grammar of the target programming language." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-83", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-84", "text": "**BASELINE MODEL**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-85", "text": "We follow the approach of Iyer et al. (2018) with three major modifications in their encoder, which yields improvements in speed and accuracy (IyerSimp) ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-86", "text": "First, in addition to camel-case splitting of identifier tokens, we further use byte-pair encoding (BPE) (Sennrich et al., 2016) on all NL tokens, identifier names and types and embed all these BPE tokens using a single embedding matrix." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-87", "text": "Next, we replace their RNN that contextualizes the subtokens of identifiers and types with an average of the subtoken embeddings instead." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-88", "text": "Finally, we consolidate their three separate RNNs for contextualizing NL, variable names with types, and method names with types, into a single shared RNN, which greatly reduces the number of model parameters." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-89", "text": "Formally, let {q i } represent the set of BPE tokens of the NL, and {t ij }, {v ij }, {r ij } and {m ij } represent the jth BPE token of the ith variable type, variable name, method return type, and method name respectively." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-90", "text": "First, all these elements are embedded using a BPE token embedding matrix B to give us q i , t ij , v ij , r ij and m ij ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-91", "text": "Using Bi-LSTM f , the encoder then computes:" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-92", "text": "Then, h 1 , . . . , h z , andt i ,v i ,r i ,m i are passed on to the attention mechanism in the decoder, exactly as in Iyer et al. (2018) ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-93", "text": "The decoder of Iyer et al. (2018) is left unchanged." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-94", "text": "This forms our baseline model (Iyer-Simp)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-95", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-96", "text": "**HYPERPARAMETERS**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-97", "text": "To create models that use idioms, we augment this decoder by first retrieving the top-K most frequent idioms from the training set (Algorithm 1), followed by post-processing the training set by greedily applying these idioms (Algorithm 2; we denote this model as Iyer-Simp-K)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-98", "text": "We evaluate all our models on the CONCODE dataset which was created using Java class files from github.com." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-99", "text": "It contains 100K tuples of (NL, code, context) for training, 2,000 tuples for development, and an additional 2,000 tuples for testing." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-100", "text": "We use a BPE vocabulary of 10K tokens for embedding matrix B and get the best validation set results using the original hyperparameters used by Iyer et al. (2018) ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-101", "text": "Since idiom aware training is significantly faster than without idioms, we are also able to train on an additional 400K training examples that Iyer et al. (2018) released as part of CONCODE." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-102", "text": "We report exact match accuracy, corpus level BLEU score (which serves as a measure of partial credit) (Papineni et al., 2002) , and training time for all these configurations." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-103", "text": "Iyer et al. (2018) ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-104", "text": "Significant improvements in training speed after incorporating idioms makes training on large amounts of data possible." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-105", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-106", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-107", "text": "taining comparable EM accuracy." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-108", "text": "Using the top-200 idioms results in a target AST compression of more than 50%, which results in fewer decoder RNN steps being performed." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-109", "text": "This reduces training time further by more than 50%, from 27 hours to 13 hours." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-110", "text": "In Table 2 , we illustrate the changes in EM, BLEU and training time as we vary the number of idioms." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-111", "text": "We find that 200 idioms performs best overall in terms of balancing accuracy and training time." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-112", "text": "Adding more idioms continues to reduce training time, but accuracy also suffers." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-113", "text": "Since we permit idioms to contain identifier names in order to capture frequently used library methods in idioms, having too many idioms hurts generalization, especially since the test set is built using repositories disjoint from the training set." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-114", "text": "Finally, the amount of compression, and therefore the training time, plateaus after the top-600 idioms are incorporated." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-115", "text": "Compared to the model of Iyer et al. (2018) , our significantly reduced training time enables us to train on their extended training set." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-116", "text": "We run IyerSimp using 400 idioms (taking advantage of even lower training time) on up to 5 times the amount of data, making sure that we do not include in training any NL from the validation or the test sets." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-117", "text": "Since the original set of idioms learned from the original training set are quite general, we directly use them rather than relearn the idioms from scratch." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-118", "text": "We report EM and BLEU scores for different amounts of training data on the same validation and test sets as CONCODE in Table 3 ." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-119", "text": "In general, accuracies increase with the amount of data with the best model achieving a BLEU score of 28.9 and EM of 13.4." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-120", "text": "Figure 5 shows some of the idioms that were extracted from CONCODE." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-121", "text": "(a) is an idiom to construct a new object with arguments, (b) represents a try-catch block, and, (c) is an integerbased for loop." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-122", "text": "In (e), we show how small idioms are combined to form larger ones; it combines an if-then idiom with a throw-exception idiom, which throws an object instantiated using idiom (a)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-123", "text": "The decoder also learns idioms to directly generate common library methods such as System.out.println( StringLiteral ) in one decoding step (d)." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-124", "text": "----------------------------------" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-125", "text": "**CONCLUSIONS**" }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-126", "text": "We presented a general approach to make semantic parsers aware of target idiomatic structures, that involves first identifying frequently used idioms, followed by providing semantic parsing models with supervision to apply these idioms." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-127", "text": "We demonstrated this approach on the task of context dependent code generation where we achieved a new state of the art in exact match accuracy and BLEU score." }, { "sent_id": "55160a7ab2df9a86e677bcc72d9842-C001-128", "text": "We also found that decoding using idioms significantly reduces training time and allows us to train on significantly larger datasets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "55160a7ab2df9a86e677bcc72d9842-C001-12" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-26" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-29" ] ], "cite_sentences": [ "55160a7ab2df9a86e677bcc72d9842-C001-12", "55160a7ab2df9a86e677bcc72d9842-C001-26", "55160a7ab2df9a86e677bcc72d9842-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "55160a7ab2df9a86e677bcc72d9842-C001-21" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-74" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-92" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-93" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-100" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-101" ] ], "cite_sentences": [ "55160a7ab2df9a86e677bcc72d9842-C001-21", "55160a7ab2df9a86e677bcc72d9842-C001-74", "55160a7ab2df9a86e677bcc72d9842-C001-92", "55160a7ab2df9a86e677bcc72d9842-C001-93", "55160a7ab2df9a86e677bcc72d9842-C001-100", "55160a7ab2df9a86e677bcc72d9842-C001-101" ] }, "@SIM@": { "gold_contexts": [ [ "55160a7ab2df9a86e677bcc72d9842-C001-70" ], [ "55160a7ab2df9a86e677bcc72d9842-C001-102", "55160a7ab2df9a86e677bcc72d9842-C001-103" ] ], "cite_sentences": [ "55160a7ab2df9a86e677bcc72d9842-C001-70", "55160a7ab2df9a86e677bcc72d9842-C001-103" ] }, "@EXT@": { "gold_contexts": [ [ "55160a7ab2df9a86e677bcc72d9842-C001-85" ] ], "cite_sentences": [ "55160a7ab2df9a86e677bcc72d9842-C001-85" ] }, "@DIF@": { "gold_contexts": [ [ "55160a7ab2df9a86e677bcc72d9842-C001-115" ] ], "cite_sentences": [ "55160a7ab2df9a86e677bcc72d9842-C001-115" ] } } }, "ABC_ddd23a034c366b62b53d15128edd45_4": { "x": [ { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-2", "text": "This paper proposes a method for reordering words in a Japanese sentence based on concurrent execution with dependency parsing so that the sentence becomes more readable." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-3", "text": "Our contributions are summarized as follows: (1) we extend a probablistic model used in the previous work which concurrently performs word reordering and dependency parsing; (2) we conducted an evaluation experiment using our semi-automatically constructed evaluation data so that sentences in the data are more likely to be spontaneously written by natives than the automatically constructed evaluation data in the previous work." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-4", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-6", "text": "Although Japanese has relatively free word order, Japanese word order is not completely arbitrary and has some sort of preference." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-7", "text": "Since such preference is incompletely understood, even Japanese natives often write Japanese sentences which are grammatically well-formed but not easy to read." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-8", "text": "For example, in Figure 1 , the word order of S1 is less readable than that of S2 because the distance between the bunsetsu \"Suzuki-san-ga (Mr. Suzuki)\" and its modified bunsetsu \"toi-te-shimatta (solved)\" is large and thus the loads on working memory become large (Nihongo Kijutsu Bunpo Kenkyukai, 2009; Uchimoto et al., 2000) There have been some conventional researches for reordering words in a sentence so that the sentence becomes easier to read (Belz et al., 2011; Filippova and Strube, 2007; Harbusch et al., 2006; Kruijff et al., 2001; Ringger et al., 2004; Shaw and Hatzivassiloglou, 1999; Uchimoto et al., 2000; Yokobayashi et al., 2004) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-9", "text": "Most of the conventional researches used syntactic information by assuming that an input sentence for word reordering^\u0102 Note: A box and an arrow express a bunsetsu 1 and a dependency relation, respectively." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-10", "text": "Both the sentences S1 and S2 have the same meaning which is translated as \"Mr. Suzuki instantly solved the problem that Mr. Sato could not possibly solve.\" in English." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-11", "text": "The difference between S1 and S2 is just in their word orders in Japanese." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-12", "text": "Figure 1: Example of less-readable/readable word order has been already parsed." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-13", "text": "There is a problem that the errors of dependency parsing increase when an input sentence is less-readable, and the parsing errors cause negative effects on word reordering." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-14", "text": "To solve the problem, we previously proposed a method for concurrently performing word reordering and dependency parsing and confirmed the effectiveness of their proposed method using evaluation data created by randomly changing the word order in newspaper article sentences (Yoshida et al., 2014) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-15", "text": "However, since some of the just automatically created sentences are unlikely to be spontaneously written by a native, the evaluation is thought to be not enough." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-16", "text": "In addition, the probablistic model has room for improvement in targeting at sentences which a native is likely to spontaneously write." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-17", "text": "This paper proposes a new method on Japanese word reordering based on concurrent execution with dependency parsing by extending the probablistic model proposed by Yoshida et al. (2014) , and describes an evaluation experiment using our 1 Bunsetsu is a linguistic unit in Japanese that roughly corresponds to a basic phrase in English." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-18", "text": "A bunsetsu consists of one independent word and zero or more ancillary words." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-19", "text": "A dependency relation in Japanese is a modification relation in which a modifier bunsetsu depends on a modified bunsetsu." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-20", "text": "That is, the modifier bunsetsu and the modified bunsetsu work as modifier and modifyee, respectively." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-21", "text": "evaluation data semi-automatically constructed by adding human judgement after automatically changing word order in newspaper article sentences." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-22", "text": "The experimental results showed the effectiveness of our method." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-23", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-24", "text": "**WORD ORDER AND DEPENDENCY**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-25", "text": "In this section, we discuss the relation between word order and dependency in a Japanese sentence using the example shown in Figure 1 ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-26", "text": "On the ground that dependency is one of fundamental contributing factors which decide the appropriate word order (Nihongo Kijutsu Bunpo Kenkyukai, 2009), the conventional method (Uchimoto et al., 2000) reordered words using syntactic information obtained by dependency parsing which was assumed to be beforehand performed." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-27", "text": "However, the accuracy of dependency parsing decreases when an input sentence has lessreadable word order such as S1 because dependency parsers are usually trained on syntactically annotated corpora in which sentences have the readable word order such as S2." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-28", "text": "On the other hand, if word reordering is performed before dependency parsing, the accuracy of the word reordering is thought to decrease because syntactic information can not be utilized." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-29", "text": "In fact, to change the word order in S1 to the appropriate one such as S2, it is necessary to comprehend the dependency structure of S1." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-30", "text": "The above discussion indicates that word reordering and dependency parsing depend on each other." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-31", "text": "Therefore, we can consider it is more desirable to concurrently perform the two processes than to sequentially perform them." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-32", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-33", "text": "**WORD REORDERING METHOD**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-34", "text": "In our method, a sentence, on which morphological analysis and bunsetsu segmentation have been performed, is considered as the input." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-35", "text": "We assume that the input sentence might have word order which is not easy to read but grammatically well-formed." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-36", "text": "Our method identifies the suitable word order which is easy to read by being executed concurrently with dependency parsing." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-37", "text": "We realize the concurrent execution of dependency parsing and word reordering by searching for the maximum-likelihood pattern of word order and dependency structure for an input sentence." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-38", "text": "We use the same search algorithm as one proposed by Yoshida et al. (2014) , which can efficiently find the approximate solution from a huge number of candidates of the pattern by extending CYK algorithm used in conventional dependency parsing." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-39", "text": "In this paper, we refine the probabilistic model proposed by Yoshida et al. (2014) to improve the accuracy." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-40", "text": "Note our method reorders bunsetsus in a sentence without paraphrasing and does not reorder morphemes within a bunsetsu." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-41", "text": "In addition, we assume there are not any inverted structures and commas in an input sentence." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-42", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-43", "text": "**PROBABILISTIC MODEL FOR WORD REORDERING**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-44", "text": "When a sequence of bunsetsus in an input sentence B = b 1 \u00b7 \u00b7 \u00b7b n is provided, our method identifies the structure S which maximizes P (S|B)." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-45", "text": "The structure S is defined as a tuple S = \u27e8O, D\u27e9 where In the probablistic model proposed by Yoshida et al. (2014) , P (S|B) was calculated as follows:" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-46", "text": "We extend the above model and calculate P (S|B) as follows:" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-47", "text": "where \u03b1 is a weight and 0 \u2264 \u03b1 \u2264 1. Formula (2) is obtained for the weighted geometric average 2 between the following two Formulas (3) and (4)." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-48", "text": "Here, Formulas (3) and (4) are derived by expanding P (O, D|B) based on multiplication theorem." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-49", "text": "Formula (3) is thought to represent the processing flow in which dependency parsing is executed after word reordering, and Formula (4) is thought to represent the inverse flow." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-50", "text": "According to the probability theory, the calculated result of Formula (2) is equal to those of Formulas (3) and (4)." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-51", "text": "However, in practice, since each factor in the formulas is estimated based on a training corpus, the results of these formulas are different from each other." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-52", "text": "Figure 2 shows a conceptual diagram which represents relations among Fomulas (2) -(4)." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-53", "text": "If an input sentence has low adequacy of word order, it is thought that performing word reordering before dependency parsing enables S = \u27e8O, D\u27e9 to be identified with higher accuracy, and thus, we can conceive an idea of calculating P (O, D|B) by Fomula (3)." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-54", "text": "Conversely, if an input sentence has high adequacy of word order, it is probably better to perform word reordering after dependency parsing, and thus, we can think of calculating P (O, D|B) by Fomula (4)." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-55", "text": "Therefore, we mix Formulas (3) and (4) by adjusting the weight \u03b1 depending on the adequacy of word order in an input sentence, instead of using the constant 0.5 in the previous model proposed by Yoshida et al. (2014) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-56", "text": "Each factor in Formula (2) is estimated by the maximum entropy method in the same approximation procedure as that of Yoshida et al. (2014) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-57", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-58", "text": "**EXPERIMENT**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-59", "text": "To evaluate the effectiveness of our method, we applied our method to less-readable sentences artificially created by changing the word order of Japanese newspaper article sentences, and evaluated how much our method could reproduce the word order of the original sentences." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-60", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-61", "text": "**CONSTRUCTION OF EVALUATION DATA**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-62", "text": "From a viewpoint of utilizing our method for support revision, it is desirable to use less-readable sentences spontaneously written by Japanese natives in the experiment." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-63", "text": "However, it is not easy to collect a large amount of pairs composed of such a sentence and the corresponding sentence which was modified by hand so that the word order becomes readable, and also, such data is unavailable." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-64", "text": "In addition, since spontaneously written sentences have many factors other than word order which decrease the readability, it is difficult to conduct the evaluation with a focus solely on word order." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-65", "text": "Therefore, our previous work (Yoshida et al., 2014) artificially generated sentences which were not easy to read, by just automatically changing the word order of newspaper article sentences in Kyoto Text Corpus 3 based on the dependency structure." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-66", "text": "However, just automatically changing the word order may create sentences which are unlikely to be written by a native." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-67", "text": "To solve the problem, we semi-automatically constructed the evaluation data by adding human judgement." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-68", "text": "That is, if a subject judges that a sentence generated by automatically changing the word order in the same way as the previous work (Yoshida et al., 2014 ) may have spontaneously written by a native." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-69", "text": "Our constructed data has 552 sentences including 4,906 bunsetsus." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-70", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-71", "text": "**OUTLINE OF EXPERIMENT**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-72", "text": "Since our method needs to decide the weight \u03b1 in Formula (2) in advance, we conducted 5-fold cross validation using the evaluation data constructed in Section 4.1." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-73", "text": "Concretely, we divided 552 sentences into 5 sets, and then, we repeated an experiment 5 times, in which we used one set from among 5 sets as the test data and the others as the held-out data to decide \u03b1." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-74", "text": "As the training data to estimate each probability in Formula (2), we used 7,976 sentences in Kyoto Text Corpus, which were different from the 552 sentences." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-75", "text": "Here, we used the Maximum Entropy Modeling Toolkit for Python and C++ 4 with the default options except \"-i (iteration) 1000.\"" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-76", "text": "In the evaluation of word reordering, we obtained the complete agreement (the percentage of the sentences in which all words' order completely agrees with that of the original sentence) and pair agreement (the percentage of the pairs of bunsetsus whose word order agrees with that in the original sentence), which are defined by Uchimoto et al. (2000) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-77", "text": "Here, when deciding \u03b1 using the held-out data, we calculate the \u03b1 to two places of decimals which maximizes the pair agreement." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-78", "text": "In the evaluation of dependency parsing, we obtained the dependency accuracy (the percentage of correctly analyzed dependencies out of all dependencies) and sentency accuracy (the percentage of the sentences in which all the dependencies are analyzed correctly), which were defined by Uchimoto et al. (1999) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-79", "text": "We compared our method to Yoshida's method (Yoshida et al., 2014) and two conventional sequential methods." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-80", "text": "Both the sequential methods execute the dependency parsing primarily, and then, perform the word reordering by using the conventional word reordering method (Uchimoto et al., 1999) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-81", "text": "The difference between the two is the method of dependency parsing." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-82", "text": "The sequential methods 1 and 2 use the dependency parsing method proposed by Uchimoto et al. (2000) and the dependency parsing tool CaboCha 5 , respectively." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-83", "text": "All of the methods used the same training features as those described in Yoshida et al. (2014) ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-84", "text": "Table 1 shows the experimental results on word reordering of each method." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-85", "text": "Here, the last row shows the agreements measured by comparing the input word order with the correct word order." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-86", "text": "The agreements mean the values which can be achieved with no reordering." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-87", "text": "The both agreements of our method are micro averages for the agreements of each of the 5 sets." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-88", "text": "As the result of decision of \u03b1 by using the held-out data, the \u03b1 for 3 sets was 0.66, and the \u03b1 for the other two sets was 0.75." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-89", "text": "The both agreements of our method were highest among all." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-90", "text": "We can confirm the effectiveness of our method." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-91", "text": "Although the purpose of our method is reordering to improve readability, our method generates a dependency structure as a by-product." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-92", "text": "Here, for reference, we show the experimental results on dependency parsing in Table 2 ." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-93", "text": "The dependency accuracy of our method was significantly lower than that of the two sequential methods, and was higher than that of Yoshida's method although there was no significant difference." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-94", "text": "On the other hand, the sentence accuracy of our method was highest among all the methods although there were no significant differences in them." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-95", "text": "As a result of analysis, especially, our method and Yoshida's method tended to improve the sentence accuracy very well in case of short sentences." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-96", "text": "On the other hand, CaboCha, which is a dependency parser in sequential 2, tended not to depend very well on the length of sentences." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-97", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-98", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-99", "text": "----------------------------------" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-100", "text": "**CONCLUSION**" }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-101", "text": "This paper proposed the method for reordering bunsetsus in a Japanese sentence based on executing concurrently with dependency parsing." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-102", "text": "Especially, we extended the probablistic model proposed by Yoshida et al. (2014) to deal with sentences spontaneously written by a native." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-103", "text": "In addition, we conducted the experiment using our semiautomatically constructed evaluation data so that the sentences are likely to be spontaneously written by a native." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-104", "text": "The experimental results showed the effectiveness of our method." }, { "sent_id": "ddd23a034c366b62b53d15128edd45-C001-105", "text": "In the future, we would like to develop a word reordering method which can take account of comma positions by integrating our method with a method for identifying proper comma positions (for example, Murata et al., 2010) ." } ], "y": { "@EXT@": { "gold_contexts": [ [ "ddd23a034c366b62b53d15128edd45-C001-3" ], [ "ddd23a034c366b62b53d15128edd45-C001-17" ], [ "ddd23a034c366b62b53d15128edd45-C001-39" ], [ "ddd23a034c366b62b53d15128edd45-C001-46" ], [ "ddd23a034c366b62b53d15128edd45-C001-102" ] ], "cite_sentences": [ "ddd23a034c366b62b53d15128edd45-C001-3", "ddd23a034c366b62b53d15128edd45-C001-17", "ddd23a034c366b62b53d15128edd45-C001-39", "ddd23a034c366b62b53d15128edd45-C001-46", "ddd23a034c366b62b53d15128edd45-C001-102" ] }, "@BACK@": { "gold_contexts": [ [ "ddd23a034c366b62b53d15128edd45-C001-14" ], [ "ddd23a034c366b62b53d15128edd45-C001-65" ], [ "ddd23a034c366b62b53d15128edd45-C001-68" ] ], "cite_sentences": [ "ddd23a034c366b62b53d15128edd45-C001-14", "ddd23a034c366b62b53d15128edd45-C001-65", "ddd23a034c366b62b53d15128edd45-C001-68" ] }, "@USE@": { "gold_contexts": [ [ "ddd23a034c366b62b53d15128edd45-C001-38" ], [ "ddd23a034c366b62b53d15128edd45-C001-45" ], [ "ddd23a034c366b62b53d15128edd45-C001-56" ], [ "ddd23a034c366b62b53d15128edd45-C001-79" ], [ "ddd23a034c366b62b53d15128edd45-C001-83" ] ], "cite_sentences": [ "ddd23a034c366b62b53d15128edd45-C001-38", "ddd23a034c366b62b53d15128edd45-C001-45", "ddd23a034c366b62b53d15128edd45-C001-56", "ddd23a034c366b62b53d15128edd45-C001-79", "ddd23a034c366b62b53d15128edd45-C001-83" ] }, "@DIF@": { "gold_contexts": [ [ "ddd23a034c366b62b53d15128edd45-C001-55" ], [ "ddd23a034c366b62b53d15128edd45-C001-93" ], [ "ddd23a034c366b62b53d15128edd45-C001-94" ] ], "cite_sentences": [ "ddd23a034c366b62b53d15128edd45-C001-55", "ddd23a034c366b62b53d15128edd45-C001-93", "ddd23a034c366b62b53d15128edd45-C001-94" ] }, "@MOT@": { "gold_contexts": [ [ "ddd23a034c366b62b53d15128edd45-C001-65", "ddd23a034c366b62b53d15128edd45-C001-66" ] ], "cite_sentences": [ "ddd23a034c366b62b53d15128edd45-C001-65" ] }, "@SIM@": { "gold_contexts": [ [ "ddd23a034c366b62b53d15128edd45-C001-95" ] ], "cite_sentences": [ "ddd23a034c366b62b53d15128edd45-C001-95" ] } } }, "ABC_a99a393c83e47393400c72338faf80_4": { "x": [ { "sent_id": "a99a393c83e47393400c72338faf80-C001-55", "text": "**SYSTEM DEFINITION**" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-2", "text": "This paper looks at the task of predicting word association strengths across three datasets; WordNet Evocation (BoydGraber et al., 2006), University of Southern Florida Free Association norms (Nelson et al., 2004), and Edinburgh Associative Thesaurus (Kiss et al., 1973)." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-3", "text": "We achieve results of r = 0.357 and \u03c1 = 0.379, r = 0.344 and \u03c1 = 0.300, an \u03c1 = 0.292 and \u03c1 = 0.363, respectively." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-4", "text": "We find Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) cosine similarities, as well as vector offsets, to be the highest performing features." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-5", "text": "Furthermore, we examine the usefulness of Gaussian embeddings (Vilnis and McCallum, 2014) for predicting word association strength, the first work to do so." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-6", "text": "----------------------------------" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-8", "text": "Word embeddings such as Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) have received increasing attention in the world of natural language processing and computational linguistics." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-9", "text": "Under such embeddings, the semantic relatedness of two words is generally taken to be the cosine similarity of their word vectors." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-10", "text": "Although this approach performs well for variety of applications, it is not without its limitations." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-11", "text": "First, it defines \"relatedness\" quite narrowly as the extent to which the two words appear in similar contexts." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-12", "text": "Second, it fails to capture how humans internally represent words (De Deyne et al., 2016b) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-13", "text": "Word associations offer a more flexible view of semantic relatedness by leveraging \"lexical knowledge acquired through world experience\" (Nelson et al., 2004) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-14", "text": "While word embeddings capture distributional relationships, word associations are able to capture more nuanced relationships \"which are based on human perception and experiences [and] are not reflected in common language usage." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-15", "text": "\" (Ma, 2013) For example, \"yellow\" is so closely associated with \"banana\" that many people would only specify a banana's colour if it is not yellow." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-16", "text": "This is backed up by De Deyne et al. (2016b) which found word associations performed better than word embeddings across a variety of semantic relatedness tasks." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-17", "text": "Furthermore, word associations, unlike cosine similarities, are asymmetric; when presented with the word \"beer\", many people think of the word \"glass\" but when presented with the word \"glass\", few people think of the word \"beer\" (Ma, 2013) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-18", "text": "This directionality allows for more fine-grained exploration of semantic links, with applications in word similarity (Jabeen et al., 2013) and computational humour (Cattle and Ma, 2016) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-19", "text": "Although several word association datasets exist, such as the Edinburgh Associative Thesaurus (EAT, Kiss et al., 1973) , the University of South Florida Free Association Norms (USF, Nelson et al., 2004 ), or WordNet Evocation (Evocation, Boyd-Graber et al., 2006 , their reliance on human annotations mean they all suffer from coverage issues relating to limited vocabularies or sparse connectivity (Cattle and Ma, 2016; De Deyne et al., 2016b) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-20", "text": "Although these issues would be somewhat alleviated by the creation of larger datasets, collecting human judgments for all possible word pairs is impractical." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-21", "text": "Therefore, the ability to predict association strengths between arbitrary word pairs represents the best solution to these coverage issues (Boyd-Graber et al., 2006) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-22", "text": "Although the prediction of Evocation ratings has attracted some attention (Boyd-Graber et al., 2006; Hayashi, 2016) , to the best of our knowledge this is the first work to focus on the prediction of USF or EAT strengths." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-23", "text": "As described in Sec-tion 2, USF and EAT have several advantages over Evocation, such as the ability to work with ambiguous words instead of WordNet synsets." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-24", "text": "Following Hayashi (2016)'s work on Evocation prediction, we frame word association prediction as a supervised regression task and introduce several new and modified features, including the first use of Gaussian embeddings (Vilnis and McCallum, 2014) to better capture the asymmetric nature of word associations." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-25", "text": "----------------------------------" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-26", "text": "**PREVIOUS WORK**" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-27", "text": "Word association has been used in psychological and psycholinguistic experiments for well over 100 years (Boyd-Graber et al., 2006; De Deyne and Storms, 2008) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-28", "text": "Word association datasets such as USF or EAT have typically framed word association as \"a task that requires participants to produce the first word to come to mind that is related in a specified way to a presented cue\" (Nelson et al., 2000) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-29", "text": "These datasets use forward strengths, the proportion of participants who produce a specific response, to \"index the relative accessibility of related words in memory [for a given cue]\" (Nelson et al., 2004) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-30", "text": "This cue/response framework has several drawbacks." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-31", "text": "First, since forward strengths are relative, comparing strengths across different cue words is difficult." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-32", "text": "Second, both cues and responses are ambiguous, with each participant's responses being greatly influenced by how they chose to interpret a given cue." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-33", "text": "For example, someone responding to the cue \"brother\" with \"monk\" is considering a different sense of \"brother\" than someone who responds \"sister\" (Ma, 2013) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-34", "text": "As such, forward strengths are biased toward responses which presume more readily apparent cue word senses." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-35", "text": "Third, limiting participants to a single response can lead to weaker associations being underreported or omitted entirely." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-36", "text": "Evocation solves the ambiguity issue by focusing on associations between WordNet synsets." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-37", "text": "Boyd-Graber et al. (2006) presented participants with randomly selected synset pairs and asked them to score how much the first synset evoked (i.e. brought to mind) the second." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-38", "text": "Unlike forward strengths, these Evocation ratings are absolute, meaning they can be directly compared across different cues." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-39", "text": "While randomly selecting synset pairs ensured that weaker associations would not be underreported, it did have the disadvantage that 67% of pairs were unanimously rated as having no connection (Boyd-Graber et al., 2006) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-40", "text": "Despite attempts to address this spareness issue by expanding Evocation with data gathered from Amazon Mechanical Turk 1 (Nikolova et al., 2009) or word-sense disambiguated USF cue/response pairs (Ma, 2013) , obtaining human judgments for all possible synset pairs is impractical." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-41", "text": "As such, the prediction of Evocation ratings presents the most promising solution to this coverage issue." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-42", "text": "Boyd-Graber et al. (2006) detailed a simple Evocation estimator which used a combination of WordNet structure-based features, WordNet definition-based features, and corpus-based word co-occurrence features." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-43", "text": "However, this approach is somewhat limited in that it frames Evocation prediction as a classification task, considering only five Evocation levels." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-44", "text": "The main drawback of Evocation prediction as a classification task is that it is too coarse-grained to deal with very weak associations, such as those in remote triads (De Deyne et al., 2016a) , or very slight variations in association strength, such as those useful for computational humour (Cattle and Ma, 2016) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-45", "text": "To this end, Hayashi (2016) framed Evocation prediction as a supervised regression task." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-46", "text": "They employed a combination of WordNet structure-based features, word embedding-based features, and lexical features and found that vector offsets, i.e. the mathematical difference between vectors, were a strong indicator of Evocation ratings." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-47", "text": "While Evocation's use of unambiguous synsets is useful for many applications, it is not without its own drawbacks." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-48", "text": "First, it requires texts to be word sense disambiguated; a non-trivial task." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-49", "text": "Second, since humans do not conceptualize words as a discrete set of independent word senses, Evocation is unable to capture natural associations owing to homography, homophony, or polysemy (Ma, 2013)." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-50", "text": "As such, despite their drawbacks, word associations may provide a more flexible, more holistic view of mental semantics." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-51", "text": "By allowing participants to record more than one response, De Deyne and Storms (2008), and their derivative works De Deyne et al. (2013) and De Deyne et al. (2016b) , were able to better represent weaker associations." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-52", "text": "However, this introduced its own set of problems as great care had to be taken to avoid chaining, i.e. responding to a previous response instead of the cue, and retrieval inhibition." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-53", "text": "De Deyne and Storms (2008) frames word association collection as a continuous task, meaning not only that the vocabulary is ever growing but also that changes in associations over time can be observed and tracked. But despite the steps taken to improve the size and quality of their association dataset, practicality dictates that coverage issues cannot be completely eliminated." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-54", "text": "----------------------------------" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-56", "text": "Our word association prediction system extends the method in Hayashi (2016) with several modifications to make it better suited to the USF and EAT datasets." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-57", "text": "First, we modify Hayashi (2016)'s lexVector." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-58", "text": "Hayashi (2016) represent each word's part-ofspeech (POS) using a one-hot encoded five dimensional vector (one of each POS in WordNet)." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-59", "text": "Similarly, they represent each word's lexical category using a one-hot encoded 45 dimensional vector (one for each WordNet lexicographer file)." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-60", "text": "This results in a 100 dimensional vector representing the POS and lexical categories of both the cue and the response." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-61", "text": "Since words in USF and EAT can be associated with multiple synsets and we want to be able to capture associations related to polysemy, instead using a one-hot encoding we employ count vectors specifying the number of synsets from each POS/lexical category each word belongs to." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-62", "text": "Second, instead of computing Wu-Palmer similarity (WUP, Wu and Palmer, 1994 ) between a single synset pair, we compute it for all cue synset/response synset pairs and record the maximum and average values." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-63", "text": "Following Boyd-Graber et al. (2006) and Ma (2013), we also explored the use of path and Leacock-Chodorow (Leacock and Chodorow, 1998) similarities but found they did not add any advantage over WUP alone." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-64", "text": "We take a similar approach for adapting load and betweenness centralities (Barthelemy, 2004) as well as AutoExtend (AutoEx, Rothe and Sch\u00fctze, 2015) similarity." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-65", "text": "Third, we extend the notion of dirRel, introduced in Hayashi (2016) to leverage the semantic network structure of WordNet." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-66", "text": "Given a graph where nodes represent synsets and arcs represent WordNet relations such as hypernym/hyponym and holonym/meronym, dirRel(s,t,k) is the proportion of k-step neighbours of s that are also k-step neighbours of t. In the original formula, s and t are nodes representing a single synset." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-67", "text": "We instead consider a set of nodes S and a set of nodes T representing the set of synsets associated with the cue and response words, respectively, as shown in Equation 1." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-68", "text": "This may increase the probability that |nb(S, k)\u2229nb(T, k)| > 0, a shortcoming of the original dirRel due to WordNet's \"relatively sparse connective structure\" (Hayashi, 2016) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-69", "text": "Fourth, in addition to the Word2Vec (w2v) cosine similarity between cue/response pairs calculated using Google's pre-trained 300 dimension Word2Vec embeddings 2 ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-70", "text": "We also examine the effectiveness of Stanford's pre-trained 300 dimension GloVe embeddings 3 ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-71", "text": "Fifth, in order to better capture asymmetric word associations, we propose using Gaussian embeddings." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-72", "text": "Gaussian embeddings (Vilnis and McCallum, 2014) represent words not as a fixed point in vector space but as \"potential functions\", continuous densities in latent space; therefore, they are more suitable for capturing asymmetric relationships." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-73", "text": "More specifically, for each cue/response pair, we calculate both the KL-divergence and cosine similarities of their Gaussian embeddings." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-74", "text": "The embeddings have a dimensionality of 300 and are trained on English Wikipedia using the Word2Gauss 4 (w2g) and the hyperparameters reported by the developer 5 Sixth, since cue and response words are not associated with a single synset, the AutoEx embeddings employed in Hayashi (2016) to compute vector offsets are not well suited for our task." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-75", "text": "Instead, we experiment with offsets calculated using w2v, GloVe, and w2g embeddings." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-76", "text": "Finally, our 300 topic LDA model (Blei et al., 2003) was trained using Gensim 6 on full English Wikipedia instead of the subset of English Wikipedia used in Hayashi (2016) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-77", "text": "Using the above features, we trained a multilayer perceptron for each of our three datasets; Evocation, USF, and EAT." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-78", "text": "In the case of Evocation, we discarded any synset information and Table 1 : Individual feature performance after 50 epochs simply use each synset's headword (e.g. given the sysnet entity.n.01, we only considered the word entity)." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-79", "text": "Following the setup used in Hayashi (2016) , all neural networks are trained using the Chainer 7 Python library with rectified linear units, dropout, and two hidden layers, each with 50% of the number of units in the input layer." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-80", "text": "All were trained on 80% of their respective dataset, with 20% held out for testing." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-81", "text": "Mean squared error was used as a loss function and optimization was performed using Adam algorithm (Kingma and Ba, 2014) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-82", "text": "To act as a baseline, we also reimplemented the system described in Hayashi (2016) and trained it on the same 80/20 split of Evocation as our system." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-83", "text": "In addition to the reported results, we also performed feature selection experiments using 20% of the training sets as validation." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-84", "text": "----------------------------------" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-85", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-86", "text": "The performance of individual features are reported in Table 1 while the results of our ablation experiments are reported in Table 2 ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-87", "text": "For all experiments we report both the Pearson correlation coefficient (as r) and Spearman's rank correlation coefficient (as \u03c1)." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-88", "text": "The best performing single feature on Evocation and USF is w2v cosine similarity." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-89", "text": "However, its removal in the ablation test had little effect." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-90", "text": "This is likely due to redundancy between w2v and GloVe; not only does GloVe perform similarly to w2v but removing both features at the same time produced the largest drop in performance." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-91", "text": "It is unclear why word embedding cosine similarities in gen- Table 2 : Ablation performance after 50 epochs eral performed relatively poorly on EAT." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-92", "text": "While the USF and EAT datasets are very similar, EAT does seem to contain a greater number of multiword cues/responses which would not be in the word embedding vocabularies." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-93", "text": "In such cases, perhaps a multi-word embedding like Doc2Vec (Le and Mikolov, 2014) would be more appropriate." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-94", "text": "However, if this were indeed the issue, one would expect vector offsets to perform equally poorly." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-95", "text": "This is not the case, with GloVe offsets being the best performing single feature on EAT and the removal of w2v offsets causing the greatest drop in performance in the EAT ablation tests." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-96", "text": "The results of our Hayashi (2016) implementation are roughly comparable to those reported in the original paper (r = 0.374, \u03c1 = 0.401 compared to r = 0.439, \u03c1 = 0.400)." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-97", "text": "Our slightly lower Pearson's R may be due to differences in way we split our training and test data as well as due to randomness in the training process itself." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-98", "text": "On Evocation, our system does not perform as well as Hayashi (2016) ." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-99", "text": "This is expected as we explicitly ignore any synset information and instead attempt to predict association strengths between word-sense ambiguous words." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-100", "text": "Despite this, our performance is not appreciably lower, indicating the fitness of our system." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-101", "text": "The fact that we perform better on Evocation than either USF or EAT is quite interesting considering our system was designed with USF and EAT in mind." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-102", "text": "There are several possible explanations for this." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-103", "text": "First, as mentioned in Section 2, 67% of cue/response pairs in Evocation have a strength of zero." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-104", "text": "This uniformity in Evocation strengths may make them easier to predict." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-105", "text": "Second, due to the way USF and EAT were collected, there are no true zeros in the datasets." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-106", "text": "This lack of grounding may skew the predictions." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-107", "text": "Third, this may be an indication that predicting associations in a wordsense ambiguous context is a harder task than predicting them in a word-sense disambiguated one." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-108", "text": "Boyd-Graber et al. (2006) explicitly told annotators to ignore associations based on polysemy or rhyme." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-109", "text": "It could be the case that these effects are more difficult to identify." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-110", "text": "Another possible explanation for this relatively lower performance is a lack of bespoke features." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-111", "text": "For example, we heavily rely on WordNet-based features which make sense in a word-sense disambiguated context but are less suited for ambiguous contexts." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-112", "text": "In fact, removal of several of these features, such as betweenness or AutoEx similarity, seem to slightly improve performance." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-113", "text": "One explanation is that, despite noting in Section2 that word association strengths are influenced by word-sense frequencies, our system instead implicitly assumes all synsets are equally likely." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-114", "text": "The most surprising finding was the poor performance of Gaussian embeddings overall, and KL-divergence in particular." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-115", "text": "Given the asymmetric nature of word associations, KL-divergence seemed to be a natural fit." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-116", "text": "However, it is vastly outperformed by even cosine similarity on the same set of embeddings." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-117", "text": "Despite this, the usefulness of Gaussian embeddings cannot be ruled out." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-118", "text": "While we used pre-trained embeddings for Word2Vec and GloVe, we had to train our own Gaussian embedding model." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-119", "text": "Not only were Word2Vec and GloVe trained on much larger corpora than Gaussian embedding's English Wikipedia, but the pre-trained embeddings likely underwent a greater degree of hyperparameter tuning." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-120", "text": "----------------------------------" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-121", "text": "**CONCLUSIONS AND FUTURE WORKS**" }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-122", "text": "In this paper we explored the effectiveness of various features for predicting Evocation, USF, and EAT association strengths, finding GloVe and Word2Vec cosine similarities as well as vector offsets to be the most useful features." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-123", "text": "We also examined the effectiveness of Gaussian embeddings for capturing the asymmetric nature of word embeddings but found it to be less effective than traditional word embeddings." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-124", "text": "Although we report a lower performance than that in Hayashi (2016) , potentially indicating that predicting association strengths in word-sense ambiguous contexts is a harder task, we believe our results are a promising start." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-125", "text": "Training Gaussian embeddings on a larger corpus may lead to improved effectiveness." }, { "sent_id": "a99a393c83e47393400c72338faf80-C001-126", "text": "Future works should also consider incorporating word-sense frequencies or developing word-sense agnostic features, with a particular focus on asymmetric features." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a99a393c83e47393400c72338faf80-C001-22" ], [ "a99a393c83e47393400c72338faf80-C001-45" ], [ "a99a393c83e47393400c72338faf80-C001-46" ], [ "a99a393c83e47393400c72338faf80-C001-58" ], [ "a99a393c83e47393400c72338faf80-C001-59" ], [ "a99a393c83e47393400c72338faf80-C001-66" ] ], "cite_sentences": [ "a99a393c83e47393400c72338faf80-C001-22", "a99a393c83e47393400c72338faf80-C001-45", "a99a393c83e47393400c72338faf80-C001-46", "a99a393c83e47393400c72338faf80-C001-58", "a99a393c83e47393400c72338faf80-C001-59", "a99a393c83e47393400c72338faf80-C001-66" ] }, "@EXT@": { "gold_contexts": [ [ "a99a393c83e47393400c72338faf80-C001-24" ], [ "a99a393c83e47393400c72338faf80-C001-56" ], [ "a99a393c83e47393400c72338faf80-C001-57" ], [ "a99a393c83e47393400c72338faf80-C001-65" ] ], "cite_sentences": [ "a99a393c83e47393400c72338faf80-C001-24", "a99a393c83e47393400c72338faf80-C001-56", "a99a393c83e47393400c72338faf80-C001-57", "a99a393c83e47393400c72338faf80-C001-65" ] }, "@DIF@": { "gold_contexts": [ [ "a99a393c83e47393400c72338faf80-C001-61" ], [ "a99a393c83e47393400c72338faf80-C001-62" ], [ "a99a393c83e47393400c72338faf80-C001-67" ], [ "a99a393c83e47393400c72338faf80-C001-68" ], [ "a99a393c83e47393400c72338faf80-C001-74" ], [ "a99a393c83e47393400c72338faf80-C001-75" ], [ "a99a393c83e47393400c72338faf80-C001-76" ], [ "a99a393c83e47393400c72338faf80-C001-98" ], [ "a99a393c83e47393400c72338faf80-C001-124" ] ], "cite_sentences": [ "a99a393c83e47393400c72338faf80-C001-61", "a99a393c83e47393400c72338faf80-C001-62", "a99a393c83e47393400c72338faf80-C001-67", "a99a393c83e47393400c72338faf80-C001-68", "a99a393c83e47393400c72338faf80-C001-74", "a99a393c83e47393400c72338faf80-C001-75", "a99a393c83e47393400c72338faf80-C001-76", "a99a393c83e47393400c72338faf80-C001-98", "a99a393c83e47393400c72338faf80-C001-124" ] }, "@USE@": { "gold_contexts": [ [ "a99a393c83e47393400c72338faf80-C001-79" ], [ "a99a393c83e47393400c72338faf80-C001-82" ] ], "cite_sentences": [ "a99a393c83e47393400c72338faf80-C001-79", "a99a393c83e47393400c72338faf80-C001-82" ] }, "@SIM@": { "gold_contexts": [ [ "a99a393c83e47393400c72338faf80-C001-96" ] ], "cite_sentences": [ "a99a393c83e47393400c72338faf80-C001-96" ] } } }, "ABC_87a87855d67d90c691ab5bedb4d460_4": { "x": [ { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-23", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-24", "text": "**AUDIOBOOK CORPUS FOR END-TO-END SPEECH TRANSLATION**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-25", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-26", "text": "**AUGMENTED LIBRISPEECH**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-22", "text": "Finally, section 5 concludes this work." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-2", "text": "We investigate end-to-end speech-to-text translation on a corpus of audiobooks specifically augmented for this task." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-3", "text": "Previous works investigated the extreme case where source language transcription is not available during learning nor decoding, but we also study a midway case where source language transcription is available at training time only." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-4", "text": "In this case, a single model is trained to decode source speech into target text in a single pass." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-5", "text": "Experimental results show that it is possible to train compact and efficient end-to-end speech translation models in this setup." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-6", "text": "We also distribute the corpus and hope that our speech translation baseline on this corpus will be challenged in the future." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-7", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-9", "text": "Most spoken language translation (SLT) systems integrate (loosely or closely) two main modules: source language speech recognition (ASR) and source-to-target text translation (MT)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-10", "text": "In these approaches, a symbolic sequence of words (or characters) in the source language is used as an intermediary representation during the speech translation process." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-11", "text": "However, recent works have attempted to build end-to-end speech-to-text translation without using source language transcription during learning or decoding." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-12", "text": "One attempt to translate directly a source speech signal into target language text is that of [1] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-13", "text": "However, the authors focus on the alignment between source speech utterances and their text translation without proposing a complete end-to-end translation system." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-14", "text": "The first attempt to build an end-to-end speech-to-text translation system (which does not use source language) is our own work [2] but it was applied to a synthetic (TTS) speech corpus." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-15", "text": "A similar approach was then proposed and evaluated on a real speech corpus by [3] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-16", "text": "This paper is a follow-up of our previous work [2] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-17", "text": "We now investigate end-to-end speech-to-text translation on a corpus of audiobooks -LibriSpeech [4] -specifically augmented to perform end-to-end speech translation [5] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-18", "text": "While previous works [2, 3] investigated the extreme case where source language transcription is not available during learning nor decoding (unwritten language scenario defined in [6, 7] ), we also investigate, in this paper, a midway case where a certain amount of source language transcription is available during training." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-19", "text": "In this intermediate scenario, a unique (endto-end) model is trained to decode source speech into target text through a single pass (which can be interesting if compact speech translation models are needed)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-20", "text": "This paper is organized as follows: after presenting our corpus in section 2, we present our end-to-end models in section 3." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-21", "text": "Section 4 describes our evaluation on two datasets: the synthetic dataset used in [2] and the audiobook dataset described in section 2." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-27", "text": "Large quantities of parallel texts (e.g. Europarl or Open-Subtitles) are available for training text machine translation systems, but there are no large (>100h) and publicly available parallel corpora that include speech in a source language aligned to text in a target language." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-28", "text": "The Fisher/Callhome Spanish-English corpora [8] are only medium size (38h), contain low-bandwidth recordings, and are not available for free." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-29", "text": "We very recently built a large English to French corpus for direct speech translation training and evaluation [5] 1 , which is much larger than the existing corpora described above." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-30", "text": "We started from the LibriSpeech corpus used for Automatic Speech Recognition (ASR), which has 1000 hours of speech aligned with their transcriptions [4] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-31", "text": "The read audiobook recordings derive from a project based on a collaborative effort: LibriVox." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-32", "text": "The speech recordings are based on public domain books available on Gutenberg Project 2 which are distributed in LibriSpeech along with the recordings." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-33", "text": "Our augmentation of LibriSpeech is straightforward: we automatically aligned e-books in a foreign language (French) with English utterances of LibriSpeech." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-34", "text": "This lead to 236 hours of English speech aligned to French translations at utterance level (more details can be found in [5] )." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-35", "text": "Since English (source) transcriptions are initially available for Lib-riSpeech, we also translated them using Google Translate." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-36", "text": "To summarize, for each utterance of our 236h corpus, the following quadruplet is available: English speech signal, English transcription, French text translation 1 (from alignment of e-books) and translation 2 (from MT of English transcripts)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-37", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-38", "text": "**MT AND AST TASKS**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-39", "text": "This paper focuses on the speech translation (AST) task of audiobooks from English to French, using the Augmented LibriSpeech corpus." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-40", "text": "We compare a direct (end-to-end) approach, with a cascaded approach that combines a neural speech transcription (ASR) model with a neural machine translation model (MT)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-41", "text": "The ASR and MT results are also reported as baselines for future uses of this corpus." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-42", "text": "Augmented LibriSpeech contains 236 hours of speech in total, which is split into 4 parts: a test set of 4 hours, a dev set of 2 hours, a clean train set of 100 hours, and an extended train set with the remaining 130 hours." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-43", "text": "Table 1 gives detailed information about the size of each corpus." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-44", "text": "All segments in the corpus were sorted according to their alignment confidence scores, as produced by the alignment software used by the authors of the corpus [5] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-45", "text": "The test, dev and train sets correspond to the highest rated alignments." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-46", "text": "The remaining data (extended train) is more noisy, as it contains more incorrect alignments." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-47", "text": "The test set was manually checked, and incorrect alignments were removed." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-48", "text": "We perform all our experiments using train only (without extended train)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-49", "text": "Furthermore, we double the training size by concatenating the aligned references with the Google Translate references." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-50", "text": "We also mirror our experiments on the BTEC synthetic speech corpus, as a follow-up to [2] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-51", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-52", "text": "**END-TO-END MODELS**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-53", "text": "For the three tasks, we use encoder-decoder models with attention [9, 10, 11, 2, 3] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-54", "text": "Because we want to share some parts of the model between tasks (multi-task training), the ASR and AST models use the same encoder architecture, and the AST and MT models use the same decoder architecture." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-55", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-56", "text": "**SPEECH ENCODER**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-57", "text": "The speech encoder is a mix between the convolutional encoder presented in [3] and our previously proposed encoder [2] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-58", "text": "It takes as input a sequence of audio features: x = (x 1 , . . . , x Tx ) \u2208 R Tx\u00d7n ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-59", "text": "Like [2] , these features are given as input to two non-linear (tanh) layers, which output new features of size n ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-60", "text": "Like [3] , this new set of features is then passed to a stack of two convolutional layers." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-61", "text": "Each layer applies 16 convolution filters of shape (3, 3, depth) with a stride of (2, 2) w.r.t." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-62", "text": "time and feature dimensions; depth is 1 for the first layer, and 16 for the second layer." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-63", "text": "We get features of shape (T x /2, n /2, 16) after the 1 st layer, and (T x /4, n /4, 16) after the 2 nd layer." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-64", "text": "This latter tensor is flattened with shape (T x = T x /4, 4n ) before being passed to a stack of three bidirectional LSTMs." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-65", "text": "This set of features has 1/4th the time length of the initial features, which speeds up training, as the complexity of the model is quadratic with respect to the source length." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-66", "text": "In our models, we use n = 128, which gives features of size 512." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-67", "text": "The last bidirectional LSTM layer computes a sequence of annotations h = h 1 , \u00b7 \u00b7 \u00b7 , h T x , where each annotation h i is a concatenation of the corresponding forward and backward states:" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-68", "text": "This model differs from [2] , which did not use convolutions, but time pooling between each LSTM layer, resulting in a shorter sequence (pyramidal encoder)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-69", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-70", "text": "**CHARACTER-LEVEL DECODER**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-71", "text": "We use a character-level decoder composed of a conditional LSTM [12] , followed by a dense layer." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-72", "text": "where update 1 and update 2 are two LSTMs with cell size m ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-73", "text": "look is a vanilla global attention mechanism [9] , which uses a feed-forward network with one hidden layer of size m ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-74", "text": "E k\u00d7|V | is the target embedding matrix, with k the embedding size and |V | the vocabulary size." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-75", "text": "c t \u2208 R 2m is a context vector which summarizes the input states to help the decoder generate a new symbol and update its state." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-76", "text": "generate uses a nonlinear layer followed by a linear projection to compute a score for each symbol in target vocabulary V ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-77", "text": "It then picks target symbol z t with the highest score:" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-78", "text": "where l is the output layer size." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-79", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-80", "text": "**EXPERIMENTS**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-81", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-82", "text": "**MODEL SETTINGS**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-83", "text": "Speech files were preprocessed using Yaafe [13] , to extract 40 MFCC features and frame energy for each frame with a step size of 10 ms and window size of 40 ms, following [14, 2] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-84", "text": "We tokenize and lowercase all the text, and normalize the punctuation, with the Moses scripts 3 for LibriSpeech are of size 46 for English (transcriptions) and 167 for French (translation)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-85", "text": "The decoder outputs are always at the character-level (for AST, MT and ASR)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-86", "text": "For the MT task, the LibriSpeech English (source) side is preprocessed into subword units [15] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-87", "text": "We limit the number of merge operations to 30k, which gives a vocabulary of size 27k." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-88", "text": "The MT encoder for BTEC takes entire words as input." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-89", "text": "Our BTEC models use an LSTM size of m = m = 256, while the LibriSpeech models use a cell size of 512, except for the speech encoder layers which use a cell size of m = 256 in each direction." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-90", "text": "We use character embeddings of size k = 64 for BTEC, and k = 128 for LibriSpeech." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-91", "text": "The MT encoders are more shallow, with a single bidirectional layer." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-92", "text": "The source embedding sizes for words (BTEC) and subwords (LibriSpeech) are respectively 128 and 256." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-93", "text": "The input layers in the speech encoders have a size of 256 for the first layer and n = 128 for the second." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-94", "text": "The Lib-riSpeech decoders use an output layer size of l = 512." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-95", "text": "For BTEC, we do not use any non-linear output layer, as we found that this led to overfitting." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-96", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-97", "text": "**TRAINING SETTINGS**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-98", "text": "We train our models with Adam [16] , with a learning rate of 0.001, and a mini-batch size of 64 for BTEC, and 32 for LibriSpeech (because of memory constraints)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-99", "text": "We use variational dropout [17] , i.e., the same dropout mask is applied to all elements in a batch at all time steps, with a rate of 0.2 for LibriSpeech and 0.4 for BTEC." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-100", "text": "In the MT tasks, we also drop source and target symbols at random, with probability 0.2." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-101", "text": "Dropout is not applied on recurrent connections [18] ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-102", "text": "We train all our models on LibriSpeech train augmented with the Google Translate references, i.e., the source side of the corpus (speech) is duplicated, and the target side (translations) is a concatenation of the aligned references with the Google Translate references." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-103", "text": "Because of GPU memory limits, we set the maximum length to 1400 frames for LibriSpeech input, and 300 characters for its output." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-104", "text": "This covers about 90% of the training corpus." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-105", "text": "Longer sequences are kept but truncated to the maximum size." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-106", "text": "We evaluate our models on the dev set every 1000 mini-batch updates using BLEU for AST and MT, and WER for ASR, and keep the best performing checkpoint for final evaluation on the test set." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-107", "text": "Our models are implemented with TensorFlow [19] as part of the LIG-CRIStAL NMT toolkit 4 ." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-108", "text": "Table 2 presents the results for the ASR and MT tasks on BTEC and LibriSpeech." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-109", "text": "The MT task (and by extension the AST task) on LibriSpeech (translating novels) looks particularly challenging, as we observe BLEU scores around 20% 5 For Automatic Speech Translation (AST), we try four settings." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-110", "text": "The cascaded model combines both the ASR and MT models (as a pipeline)." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-111", "text": "The end-to-end model (described in section 3) does not make any use of source language transcripts." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-112", "text": "The pre-trained model is identical to end-to-end, but its encoder and decoder are initialized with our ASR and MT Table 3 : Results of the AST task on BTEC test." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-113", "text": "\u2020 was obtained with an ensemble of 5 models, while we use ensembles of 2 models." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-114", "text": "The non-cascaded ensemble combines the pre-trained and multi-task models." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-115", "text": "Contrary to [2] , we only present mono-reference results." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-116", "text": "ASR mono ASR multi MT mono MT multi Fig. 1 : Augmented LibriSpeech Dev BLEU scores for the MT task, and WER scores for the ASR task, with the initial (mono-task) models, and when multi-task training picks up." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-117", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-118", "text": "**RESULTS**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-119", "text": "models." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-120", "text": "The multi-task model is also pre-trained, but continues training for all tasks, by alternating updates like [20] , with 60% of updates for AST and 20% for ASR and MT." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-121", "text": "Table 3 and 4 present the results for the end-to-end AST task on BTEC and LibriSpeech." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-122", "text": "On both corpora, we show that: (1) it is possible to train compact end-to-end AST models with a performance close to cascaded models; (2) pretraining and multi-task learning 6 improve AST performance;" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-123", "text": "(3) contrary to [3] , in both BTEC and LibriSpeech settings, best AST performance is observed when a symbolic sequence of symbols in the source language is used as an intermediary representation during the speech translation process (cascaded system); (4) finally, the AST results presented on Lib-riSpeech demonstrate that our augmented corpus is useful, although challenging, to benchmark end-to-end AST systems on real speech at a large scale." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-124", "text": "We hope that our baseline on Augmented LibriSpeech will be challenged in the future." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-125", "text": "The large improvements on MT and AST on the BTEC corpus, compared to [2] are mostly due to our use of a better decoder, which outputs characters instead of words." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-126", "text": "6 BLEU End-to-End Pre-train Multi-task Fig. 2 : Dev BLEU scores on 3 models for end-to-end AST of audiobooks." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-127", "text": "Best scores on the dev set for the end-to-end (mono-task), pre-train and multi-task models were achieved at steps 369k, 129k and 95k." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-128", "text": "Figure 1 shows the evolution of BLEU and WER scores for MT and ASR tasks with single models, and when we continue training them as part of a multi-task model." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-129", "text": "The multi-task procedure does more updates on AST, which explains the degraded results, but we observe that the speech encoder and text decoder are still able to generalize well to other tasks." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-130", "text": "Figure 2 shows the evolution of dev BLEU scores for our three AST models on LibriSpeech." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-131", "text": "We see that pre-training helps the model converge much faster." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-132", "text": "Eventually, the End-to-End system reaches a similarly good solution, but after three times as many updates." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-133", "text": "Multi-Task training does not seem to be helpful when combined with pre-training." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-134", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-135", "text": "**ANALYSIS**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-136", "text": "----------------------------------" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-137", "text": "**CONCLUSION**" }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-138", "text": "We present baseline results on End-to-End Automatic Speech Translation on a new speech translation corpus of audiobooks, and on a synthetic corpus extracted from BTEC (follow-up to [2] )." }, { "sent_id": "87a87855d67d90c691ab5bedb4d460-C001-139", "text": "We show that, while cascading two neural models for ASR and MT gives the best results, end-to-end methods that incorporate the source language transcript come close in performance." } ], "y": { "@BACK@": { "gold_contexts": [ [ "87a87855d67d90c691ab5bedb4d460-C001-14" ] ], "cite_sentences": [ "87a87855d67d90c691ab5bedb4d460-C001-14" ] }, "@EXT@": { "gold_contexts": [ [ "87a87855d67d90c691ab5bedb4d460-C001-16" ], [ "87a87855d67d90c691ab5bedb4d460-C001-18" ], [ "87a87855d67d90c691ab5bedb4d460-C001-138" ] ], "cite_sentences": [ "87a87855d67d90c691ab5bedb4d460-C001-16", "87a87855d67d90c691ab5bedb4d460-C001-18", "87a87855d67d90c691ab5bedb4d460-C001-138" ] }, "@USE@": { "gold_contexts": [ [ "87a87855d67d90c691ab5bedb4d460-C001-21" ], [ "87a87855d67d90c691ab5bedb4d460-C001-50" ], [ "87a87855d67d90c691ab5bedb4d460-C001-53" ], [ "87a87855d67d90c691ab5bedb4d460-C001-57" ], [ "87a87855d67d90c691ab5bedb4d460-C001-83" ] ], "cite_sentences": [ "87a87855d67d90c691ab5bedb4d460-C001-21", "87a87855d67d90c691ab5bedb4d460-C001-50", "87a87855d67d90c691ab5bedb4d460-C001-53", "87a87855d67d90c691ab5bedb4d460-C001-57", "87a87855d67d90c691ab5bedb4d460-C001-83" ] }, "@SIM@": { "gold_contexts": [ [ "87a87855d67d90c691ab5bedb4d460-C001-59" ] ], "cite_sentences": [ "87a87855d67d90c691ab5bedb4d460-C001-59" ] }, "@DIF@": { "gold_contexts": [ [ "87a87855d67d90c691ab5bedb4d460-C001-115" ], [ "87a87855d67d90c691ab5bedb4d460-C001-125" ] ], "cite_sentences": [ "87a87855d67d90c691ab5bedb4d460-C001-115", "87a87855d67d90c691ab5bedb4d460-C001-125" ] } } }, "ABC_d160d5a44f795f2b694d5ee538d713_4": { "x": [ { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-113", "text": "**DISCUSSION**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-135", "text": "**CONCLUSION**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-112", "text": "----------------------------------" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-2", "text": "In this paper, we report recent improvements to the exemplar-based learning approach for word sense disambiguation that have achieved higher disambiguation accuracy." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-3", "text": "By using a larger value of k, the number of nearest neighbors to use for determining the class of a test example, and through 10-fold cross validation to automatically determine the best k, we have obtained improved disambiguation accuracy on a large sense-tagged corpus first used in (Ng and Lee, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-4", "text": "The accuracy achieved by our improved exemplar-based classifier is comparable to the accuracy on the same data set obtained by the Naive-Bayes algorithm, which was reported in (Mooney, 1996) to have the highest disambiguation accuracy among seven state-of-the-art machine learning algorithms." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-5", "text": "----------------------------------" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-7", "text": "Much recent research on word sense disambiguation (WSD) has adopted a corpus-based, learning approach." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-8", "text": "Many different learning approaches have been used, including neural networks (Leacock et al., 1993) , probabilistic algorithms (Bruce and Wiebe, 1994; Gale et al., 1992a; Gale et al., 1995; Leacock et al., 1993; Yarowsky, 1992) , decision lists (Yarowsky, 1994) , exemplar-based learning algorithms (Cardie, 1993; Ng and Lee, 1996) , etc." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-9", "text": "In particular, Mooney (1996) evaluated seven state-of-the-art machine learning algorithms on a common data set for disambiguating six senses of the word \"line\"." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-10", "text": "The seven algorithms that he evaluated are: a Naive-Bayes classifier (Duda and Hart, 1973) , a perceptron (Rosenblatt, 1958) , a decisiontree learner (Quinlan, 1993) , a k nearest-neighbor classifier (exemplar-based learner) (Cover and Hart, 1967) , logic-based DNF and CNF learners (Mooney, 1995) , and a decision-list learner (Rivest, 1987) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-11", "text": "His results indicate that the simple Naive-Bayes algorithm gives the highest accuracy on the \"line\" corpus tested." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-12", "text": "Past research in machine learning has also reported that the Naive-Bayes algorithm achieved good performance on other machine learning tasks (Clark and Niblett, 1989; Kohavi, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-13", "text": "This is in spite of the conditional independence assumption made by the Naive-Bayes algorithm, which may be unjustified in the domains tested." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-14", "text": "Gale, Church and Yarowsky (Gale et al., 1992a; Gale et al., 1995; Yarowsky, 1992) have also successfully used the Naive-Bayes algorithm (and several extensions and variations) for word sense disambiguation." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-15", "text": "On the other hand, our past work on WSD (Ng and Lee, 1996) used an exemplar-based (or nearest neighbor) learning approach." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-16", "text": "Our WSD program, LEXAS, extracts a set of features, including part of speech and morphological form, surrounding words, local collocations, and verb-object syntactic relation from a sentence containing the word to be disambiguated." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-17", "text": "These features from a sentence form an example." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-18", "text": "LEXAS then uses the exemplar-based learning algorithm PEBLS (Cost and Salzberg, 1993) to find the sense (class) of the word to be disambiguated." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-19", "text": "In this paper, we report recent improvements to the exemplar-based learning approach for WSD that have achieved higher disambiguation accuracy." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-20", "text": "The exemplar-based learning algorithm PEBLS contains a number of parameters that must be set before running the algorithm." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-21", "text": "These parameters include the number of nearest neighbors to use for determining the class of a test example (i.e., k in a k nearest-neighbor classifier), exemplar weights, feature weights, etc." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-22", "text": "We found that the number k of nearest neighbors used has a considerable impact on the accuracy of the induced exemplar-based classifier." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-23", "text": "By using 10-fold cross validation (Kohavi and John, 1995) on the training set to automatically determine the best k to use, we have obtained improved disambiguation accuracy on a large sensetagged corpus first used in (Ng and Lee, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-24", "text": "The accuracy achieved by our improved exemplar-based classifier is comparable to the accuracy on the same data set obtained by the Naive-Bayes algorithm, which was reported in (Mooney, 1996) to have the highest disambiguation accuracy among seven stateof-the-art machine learning algorithms." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-25", "text": "The rest of this paper is organized as follows." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-26", "text": "Section 2 gives a brief description of the exemplar-based algorithm PEBLS and the Naive-Bayes algorithm." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-27", "text": "Section 3 describes the 10-fold cross validation training procedure to determine the best k number of nearest neighbors to use." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-28", "text": "Section 4 presents the disambiguation accuracy of PEBLS and Naive-Bayes on the large corpus of (Ng and Lee, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-29", "text": "Section 5 discusses the implications of the results." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-30", "text": "Section 6 gives the conclusion." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-31", "text": "----------------------------------" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-32", "text": "**LEARNING ALGORITHMS**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-33", "text": "----------------------------------" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-34", "text": "**PEBLS**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-35", "text": "The heart of exemplar-based learning is a measure of the similarity, or distance, between two examples." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-36", "text": "If the distance between two examples is small, then the two examples are similar." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-37", "text": "In PEBLS (Cost and Salzberg, 1993) , the distance between two symbolic values vl and v2 of a feature f is defined as: Vl for feature f in any class." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-38", "text": "P(Ci]v2) is estimated similarly." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-39", "text": "This distance metric of PEBLS is adapted from the value difference metric of the earlier work of (Stanfill and Waltz, 1986 Note that the nearest neighbor algorithm tested in (Mooney, 1996) uses Hamming distance as the distance metric between two symbolic feature values." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-40", "text": "This is different from the above distance metric used in PEBLS." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-41", "text": "----------------------------------" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-42", "text": "**NAIVE-BAYES**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-43", "text": "Our presentation of the Naive-Bayes algorithm (Duda and Hart, 1973) follows that of (Clark and Niblett, 1989) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-44", "text": "This algorithm is based on Bayes' theorem:" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-45", "text": "is the probability that a test example is of class Ci given feature values vj." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-46", "text": "(Avj denotes the conjunction of all feature values in the test example.) The goal of a Naive-Bayes classifier is to determine the class Ci with the highest conditional probability P(Ci] A vj)." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-47", "text": "Since the denominator P(Avj) of the above expression is constant for all classes Ci, the problem reduces to finding the class Ci with the maximum value for the numerator." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-48", "text": "(Gale et al., 1992a) ) are also possible, although we have not experimented with these other variations." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-49", "text": "For the experimental results reported in this paper, we used the implementation of Naive-Bayes algorithm in the PEBLS program (Rachlin and Salzberg, 1993) , which has an option for training and testing using the Naive-Bayes algorithm." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-50", "text": "We only changed the handling of zero probability counts to the method just described." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-51", "text": "3 Improvements to Exemplar-Based WSD PEBLS contains a number of parameters that must be set before running the algorithm." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-52", "text": "These parameters include k (the number of nearest neighbors to use for determining the class of a test example), exemplar weights, feature weights, etc." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-53", "text": "Each of these parameters has a default value in PEBLS, eg., k = 1, no exemplar weighting, no feature weighting, etc." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-54", "text": "We have used the default values for all parameter settings in our previous work on exemplar-based WSD reported in (Ng and Lee, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-55", "text": "However, our preliminary investigation indicates that, among the various learning parameters of PEBLS, the number k of nearest neighbors used has a considerable impact on the accuracy of the induced exemplar-based classifier." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-56", "text": "Cross validation is a well-known technique that can be used for estimating the expected error rate of a classifier which has been trained on a particular data set." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-57", "text": "For instance, the C4.5 program (Quinlan, 1993) contains an option for running cross validation to estimate the expected error rate of an induced rule set." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-58", "text": "Cross validation has been proposed as a general technique to automatically determine the parameter settings of a given learning algorithm using a particular data set as training data (Kohavi and John, 1995) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-59", "text": "In m-fold cross validation, a training data set is partitioned into m (approximately) equal-sized blocks, and the learning algorithm is run m times." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-60", "text": "In each run, one of the m blocks of training data is set aside as test data (the holdout set) and the remaining m-1 blocks are used as training data." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-61", "text": "The average error rate of the m runs is a good estimate of the error rate of the induced classifier." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-62", "text": "For a particular parameter setting, we can run m-fold cross validation to determine the expected error rate of that particular parameter setting." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-63", "text": "We can then choose an optimal parameter setting that minimizes the expected error rate." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-64", "text": "Kohavi and John (1995) reported the effectiveness of such a technique in obtaining optimal sets of parameter settings over a large number of machine learning problems." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-65", "text": "In our present study, we used 10-fold cross validation to automatically determine the best k (number of nearest neighbors) to use from the training data." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-66", "text": "To determine the best k for disambiguating a word on a particular training set, we run 10-fold cross validation using PEBLS 21 times, each time with k = 1,5, 10, 15,..., 85, 90, 95,100." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-67", "text": "We compute the error rate for each k, and choose the value of k with the minimum error rate." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-68", "text": "Note that the automatic determination of the best k through 10-fold cross validation makes use of only the training set, without looking at the test set at all." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-69", "text": "----------------------------------" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-70", "text": "**4**" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-71", "text": "Experimental Results Mooney (1996) has reported that the Naive-Bayes algorithm gives the best performance on disambiguating six senses of the word \"line\", among seven state-of-the-art learning algorithms tested." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-72", "text": "However, his comparative study is done on only one word using a data set of 2,094 examples." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-73", "text": "In our present study, we evaluated PEBLS and Naive-Bayes on a much larger corpus containing sense-tagged occurrences of 121 nouns and 70 verbs." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-74", "text": "This corpus was first reported in (Ng and Lee, 1996) , and it contains about 192,800 sense-tagged word occurrences of 191 most frequently occurring and ambiguous words of English." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-75", "text": "1 These 191 words have been tagged with senses from WOI:tDNET (Miller, 1990) , an on-line, electronic dictionary available publicly." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-76", "text": "For this set of 191 words, the average number of senses per noun is 7.8, while the average number of senses per verb is 12.0." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-77", "text": "The sentences in this corpus were drawn from the combined corpus of the i million word Brown corpus and the 2.5 million word Wall Street Journal (WSJ) corpus." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-78", "text": "We tested both algorithms on two test sets from this corpus." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-79", "text": "The first test set, named BC50, consists of 7,119 occurrences of the 191 words appearing in 50 text files of the Brown corpus." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-80", "text": "The second test set, named WSJ6, consists of 14,139 occurrences of the 191 words appearing in 6 text files of the WSJ corpus." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-81", "text": "Both test sets are identical to the ones reported in (Ng and Lee, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-82", "text": "Since the primary aim of our present study is the comparative evaluation of learning algorithms, not feature representation, we have chosen, for simplicity, to use local collocations as the only features in the example representation." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-83", "text": "Local collocations have been found to be the single most informative set of features for WSD (Ng and Lee, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-84", "text": "That local collocation knowledge provides important clues to WSD has also been pointed out previously by Yarowsky (1993) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-85", "text": "Let w be the word to be disambiguated, and let 12 ll w rl r2 be the sentence fragment containing w. In the present study, we used seven features in the representation of an example, which are the local collocations of the surrounding 4 words." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-86", "text": "These seven features are: 12-11, ll-rl, rl-r2, ll, rl, 12 , and r2." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-87", "text": "The first three features are concatenation of two words." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-88", "text": "2" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-89", "text": "The experimental results obtained are tabulated in Table 1 ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-90", "text": "The first three rows of accuracy fig-1This corpus is available from the Linguistic Data Consortium (LDC)." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-91", "text": "Contact the LDC at ldc@unagi.cis.upenn.edu for details." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-92", "text": "2The first five of these seven features were also used in (Ng and Lee, 1996) . , 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-93", "text": "The default strategy of picking the most frequent sense has been advocgted as the baseline performance for evaluating WSD programs (Gale et al., 1992b; Miller et al., 1994) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-94", "text": "There are two instantiations of this strategy in our current evaluation." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-95", "text": "Since WORDNET orders its senses such that sense 1 is the most frequent sense, one possibility is to always pick sense 1 as the best sense assignment." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-96", "text": "This assignment method does not even need to look at the training examples." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-97", "text": "We call this method \"Sense 1\" in Table 1 ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-98", "text": "Another assignment method is to determine the most frequently occurring sense in the training examples, and to assign this sense to all test examples." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-99", "text": "We call this method \"Most Frequent\" in Table 1 ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-100", "text": "The accuracy figures of LEXAS as reported in (Ng and Lee, 1996) are reproduced in the third row of Table 1 ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-101", "text": "These figures were obtained using all features including part of speech and morphological form, surrounding words, local collocations, and verb-object syntactic relation." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-102", "text": "However, the feature value pruning method of (Ng and Lee, 1996) only selects surrounding words and local collocations as feature values if they are indicative of some sense class as measured by conditional probability (See (Ng and Lee, 1996) for details)." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-103", "text": "The next three rows show the accuracy figures of PEBLS using the parameter setting of k = 1, k = 20, and 10-fold cross validation for finding the best k, respectively." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-104", "text": "The last row shows the accuracy figures of the Naive-Bayes algorithm." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-105", "text": "Accuracy figures of the last four rows are all based on only seven collocation features as described earlier in this section." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-106", "text": "However, all possible feature values (collocated words) are used, without employing the feature value pruning method used in (Ng and Lee, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-107", "text": "Note that the accuracy figures of PEBLS with k = 1 are 1.0% and 1.6% higher than the accuracy figures of (Ng and Lee, 1996) in the third row, also with k = 1." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-108", "text": "The feature value pruning method of (Ng and Lee, 1996) is intended to keep only feature values deemed important for classification." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-109", "text": "It seems that the pruning method has filtered out some usefifl collocation values that improve classification accuracy, such that this unfavorable effect outweighs the additional set of features (part of speech and morphological form, surrounding words, and verb-object syntactic relation) used." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-110", "text": "Our results indicate that although Naive-Bayes performs better than PEBLS with k = 1, PEBLS with k = 20 achieves comparable performance." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-111", "text": "Furthermore, PEBLS with 10-fold cross validation to select the best k yields results slightly better than the Naive-Bayes algorithm." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-114", "text": "To understand why larger values of k are needed, we examined the performance of PEBLS when tested on the WSJ6 test set." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-115", "text": "During 10-fold cross validation runs on the training set, for each of the 191 words, we compared two error rates: the minimum expected error rate of PEBLS using the best k, and the expected error rate of the most frequent classifter." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-116", "text": "We found that for 13 words out of the 191 words, the minimum expected error rate of PEBLS using the best k is still higher than the expected error rate of the most frequent classifier." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-117", "text": "That is, for these 13 words, PEBLS will produce, on average, lower accuracy than the most frequent classifier." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-118", "text": "Importantly, for 11 of these 13 words, the best k found by PEBLS are at least 85 and above." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-119", "text": "This indicates that for a training data set when PEBLS has trouble even outperforming the most frequent classifter, it will tend to use a large value for k. This is explainable since for a large value of k, PEBLS will tend towards the performance of the most frequent classifier, as it will find the k closest matching training examples and select the majority class among this large number of k examples." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-120", "text": "Note that in the extreme case when k equals the size of the training set, PEBLS will behave exactly like the most frequent classifier." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-121", "text": "Our results indicate that although PEBLS with k = 1 gives lower accuracy compared with Naive-Bayes, PEBLS with k = 20 performs as well as Naive-Bayes." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-122", "text": "Furthermore, PEBLS with automatically selected k using 10-fold cross validation gives slightly higher performance compared with Naive-Bayes." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-123", "text": "We believe that this result is significant, in light of the fact that Naive-Bayes has been found to give the best performance for WSD among seven state-ofthe-art machine learning algorithms (Mooney, 1996) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-124", "text": "It demonstrates that an exemplar-based learning approach is suitable for the WSD task, achieving high disambiguation accuracy." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-125", "text": "One potential drawback of an exemplar-based learning approach is the testing time required, since each test example must be compared with every training example, and hence the required testing time grows linearly with the size of the training set." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-126", "text": "However, more sophisticated indexing methods such as that reported in (Friedman et al., 1977) can reduce this to logarithmic expected time, which will significantly reduce testing time." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-127", "text": "In the present study, we have focused on the comparison of learning algorithms, but not on feature representation of examples." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-128", "text": "Our past work (Ng and Lee, 1996) suggests that multiple sources of knowledge are indeed useful for WSD." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-129", "text": "Future work will explore the addition of these other features to further improve disambiguation accuracy." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-130", "text": "Besides the parameter k, PEBLS also contains other learning parameters such as exemplar weights and feature weights." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-131", "text": "Exemplar weighting has been found to improve classification performance (Cost and Saizberg, 1993) ." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-132", "text": "Also, given the relative importance of the various knowledge sources as reported in (Ng and Lee, 1996) , it may be possible to improve disambignation performance by introducing feature weighting." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-133", "text": "Future work can explore the effect of exemplar weighting and feature weighting on disambiguation accuracy." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-134", "text": "----------------------------------" }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-136", "text": "In summary, we have presented improvements to the exemplar-based learning approach for WSD." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-137", "text": "By using a larger value of k, the number of nearest neighbors to use for determining the class of a test example, and through 10-fold cross validation to automatically determine the best k, we have obtained improved disambignation accuracy on a large sensetagged corpus." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-138", "text": "The accuracy achieved by our improved exemplar-based classifier is comparable to the accuracy on the same data set obtained by the Naive-Bayes algorithm, which was recently reported to have the highest disambignation accuracy among seven state-of-the-art machine learning algorithms." }, { "sent_id": "d160d5a44f795f2b694d5ee538d713-C001-139", "text": "7" } ], "y": { "@DIF@": { "gold_contexts": [ [ "d160d5a44f795f2b694d5ee538d713-C001-3" ], [ "d160d5a44f795f2b694d5ee538d713-C001-23" ], [ "d160d5a44f795f2b694d5ee538d713-C001-106" ], [ "d160d5a44f795f2b694d5ee538d713-C001-107" ] ], "cite_sentences": [ "d160d5a44f795f2b694d5ee538d713-C001-3", "d160d5a44f795f2b694d5ee538d713-C001-23", "d160d5a44f795f2b694d5ee538d713-C001-106", "d160d5a44f795f2b694d5ee538d713-C001-107" ] }, "@BACK@": { "gold_contexts": [ [ "d160d5a44f795f2b694d5ee538d713-C001-8" ], [ "d160d5a44f795f2b694d5ee538d713-C001-15" ], [ "d160d5a44f795f2b694d5ee538d713-C001-74" ], [ "d160d5a44f795f2b694d5ee538d713-C001-83" ], [ "d160d5a44f795f2b694d5ee538d713-C001-102" ], [ "d160d5a44f795f2b694d5ee538d713-C001-108" ], [ "d160d5a44f795f2b694d5ee538d713-C001-128" ] ], "cite_sentences": [ "d160d5a44f795f2b694d5ee538d713-C001-8", "d160d5a44f795f2b694d5ee538d713-C001-15", "d160d5a44f795f2b694d5ee538d713-C001-74", "d160d5a44f795f2b694d5ee538d713-C001-83", "d160d5a44f795f2b694d5ee538d713-C001-102", "d160d5a44f795f2b694d5ee538d713-C001-108", "d160d5a44f795f2b694d5ee538d713-C001-128" ] }, "@USE@": { "gold_contexts": [ [ "d160d5a44f795f2b694d5ee538d713-C001-28" ], [ "d160d5a44f795f2b694d5ee538d713-C001-54" ], [ "d160d5a44f795f2b694d5ee538d713-C001-73", "d160d5a44f795f2b694d5ee538d713-C001-74" ], [ "d160d5a44f795f2b694d5ee538d713-C001-100" ] ], "cite_sentences": [ "d160d5a44f795f2b694d5ee538d713-C001-28", "d160d5a44f795f2b694d5ee538d713-C001-54", "d160d5a44f795f2b694d5ee538d713-C001-74", "d160d5a44f795f2b694d5ee538d713-C001-100" ] }, "@SIM@": { "gold_contexts": [ [ "d160d5a44f795f2b694d5ee538d713-C001-81" ], [ "d160d5a44f795f2b694d5ee538d713-C001-92" ] ], "cite_sentences": [ "d160d5a44f795f2b694d5ee538d713-C001-81", "d160d5a44f795f2b694d5ee538d713-C001-92" ] }, "@FUT@": { "gold_contexts": [ [ "d160d5a44f795f2b694d5ee538d713-C001-132" ] ], "cite_sentences": [ "d160d5a44f795f2b694d5ee538d713-C001-132" ] } } }, "ABC_caa0ffb1d4e3e5310a28b921333d1e_4": { "x": [ { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-2", "text": "Bidirectional Encoder Representations from Transformers (BERT) has recently achieved state-of-the-art performance on a broad range of NLP tasks including sentence classification, machine translation, and question answering." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-3", "text": "The BERT model architecture is derived primarily from the transformer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-4", "text": "Prior to the transformer era, bidirectional Long Short-Term Memory (BLSTM) has been the dominant modeling architecture for neural machine translation and question answering." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-5", "text": "In this paper, we investigate how these two modeling techniques can be combined to create a more powerful model architecture." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-6", "text": "We propose a new architecture denoted as Transformer with BLSTM (TRANS-BLSTM) which has a BLSTM layer integrated to each transformer block, leading to a joint modeling framework for transformer and BLSTM." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-7", "text": "We show that TRANS-BLSTM models consistently lead to improvements in accuracy compared to BERT baselines in GLUE and SQuAD 1.1 experiments." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-8", "text": "Our TRANS-BLSTM model obtains an F1 score of 94.01% on the SQuAD 1.1 development dataset, which is comparable to the state-of-the-art result." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-9", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-11", "text": "Learning representations (Mikolov et al., 2013) of natural language and language model pre-training (Devlin et al., 2018; Radford et al., 2019) has shown promising results recently." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-12", "text": "These pretrained models serve as generic up-stream models and they can be used to improve down-stream applications such as natural language inference, paraphrasing, named entity recognition, and question answering." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-13", "text": "The innovation of BERT (Devlin et al., 2018) comes from the \"masked language model\" with a pre-training objective, inspired by the Cloze task (Taylor, 1953) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-14", "text": "The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original token based only on its context." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-15", "text": "Follow-up work including RoBERTa (Liu et al., 2019b) investigated hyper-parameter design choices and suggested longer model training time." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-16", "text": "In addition, XLNet (Yang et al., 2019) has been proposed to address the BERT pre-training and fine-tuning discrepancy where masked tokens were found in the former but not in the latter." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-17", "text": "Nearly all existing work suggests that a large network is crucial to achieve the state-of-the-art performance." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-18", "text": "For example, (Devlin et al., 2018) has shown that across natural language understanding tasks, using larger hidden layer size, more hidden layers, and more attention heads always leads to better performance." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-19", "text": "However, they stop at a hidden layer size of 1024." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-20", "text": "ALBERT (Lan et al., 2019) showed that it is not the case that simply increasing the model size would lead to better accuracy." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-21", "text": "In fact, they observed that simply increasing the hidden layer size of a model such as BERT-large can lead to significantly worse performance." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-22", "text": "On the other hand, model distillation (Hinton et al., 2015; Tang et al., 2019; Sun et al., 2019; Sanh et al., 2019) has been proposed to reduce the BERT model size while maintaining high performance." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-23", "text": "In this paper, we attempt to improve the performance of BERT via architecture enhancement." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-24", "text": "BERT is based on the encoder of the transformer model (Vaswani et al., 2017) , which has been proven to obtain state-of-the-art accuracy across a broad range of NLP applications (Devlin et al., 2018) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-25", "text": "Prior to BERT, bidirectional LSTM (BLSTM) has dominated sequential modeling for many tasks including machine translation (Chiu and Nichols, 2016) and speech recognition (Graves et al., 2013) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-26", "text": "Given both models have demonstrated superior accuracy on various benchmarks, it is natural to raise the question whether a combination of the transformer and BLSTM can outperform each individual architecture." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-27", "text": "In this paper, we attempt to answer this question by proposing a transformer BLSTM joint modeling framework." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-28", "text": "Our major contribution in this paper is two fold: 1) We propose the TRANS-BLSTM model architectures, which combine the transformer and BLSTM into one single modeling framework, leveraging the modeling capability from both the transformer and BLSTM." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-29", "text": "2) We show that the TRANS-BLSTM models can effectively boost the accuracy of BERT baseline models on SQuAD 1.1 and GLUE NLP benchmark datasets." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-30", "text": "2 Related work 2.1 BERT Our work focuses on improving the transformer architecture (Vaswani et al., 2017) , which motivated the recent breakthrough in language representation, BERT (Devlin et al., 2018) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-31", "text": "Our work builds on top of the transformer architecture, integrating each transformer block with a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-55", "text": "The BERT model consists of a transformer encoder (Vaswani et al., 2017) as shown in Figure 1 ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-32", "text": "Related to our work, XLNet (Yang et al., 2019) proposes twostream self-attention as opposed to single-stream self-attention used in classic transformers." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-33", "text": "With two-stream attention, XLNet can be treated as a general language model that does not suffer from the pretrain-finetune discrepancy (the mask tokens are seen during pretraining but not during finetuning) thanks to its autoregressive formulation." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-34", "text": "Our method overcomes this limitation with a different approach, using single-stream self-attention with an integrated BLSTM layer for each transformer layer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-35", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-36", "text": "**BIDIRECTIONAL LSTM**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-37", "text": "The LSTM network (Hochreiter and Schmidhuber, 1997) has demonstrated powerful modeling capability in sequential learning tasks including named entity tagging (Huang et al., 2015; Chiu and Nichols, 2016) , machine translation (Bahdanau et al., 2015; Wu et al., 2016) and speech recognition (Graves et al., 2013; Sak et al., 2014) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-38", "text": "The motivation of this paper is to integrate bidirectional LSTM layers to the transformer model to further improve transformer performance." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-39", "text": "The work of (Tang et al., 2019) attempts to distill a BERT model to a single-layer bidirectional LSTM model." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-40", "text": "It is relevant to our work as both utilizing bidirectional LSTM." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-41", "text": "However, their work leads to inferior accuracy compared to BERT baseline models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-42", "text": "Similar to their observation, we show that in our experiments, the use of BLSTM model alone (even with multiple stacked BLSTM layers) leads to significantly worse results compared to BERT models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-43", "text": "However, our proposed joint modeling framework, TRANS-BLSTM, is able to boost the accuracy of the transformer BERT models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-44", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-45", "text": "**COMBINE RECURRENT NETWORK AND TRANSFORMER**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-46", "text": "Previous work has explored the combination of the recurrent network and transformer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-47", "text": "For example, (Lei et al., 2018) has substituted the feedforward network in transformer with the simple recurrent unit (SRU) implementation and achieved better accuracy in machine translation." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-48", "text": "It is similar to one of the proposed models in this paper." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-49", "text": "However, the difference is that our paper investigates the gain of the combination in BERT pre-training context, while their paper focused on the parallelization speedup of SRU in machine translation encoder and decoder context." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-50", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-51", "text": "**TRANS AND PROPOSED TRANS-BLSTM ARCHITECTURES**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-52", "text": "In this section, we first review the transformer architecture, then propose the transformer bidirectional LSTM network architectures (TRANS-BLSTM), which integrates the BLSTM to either the transformer encoder or decoder." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-53", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-54", "text": "**TRANSFORMER ARCHITECTURE (TRANS)**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-56", "text": "The original transformer architecture uses multiple stacked self-attention layers and point-wise fully connected layers for both the encoder and decoder." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-57", "text": "However, BERT only leverages the encoder to generate hidden value representation and the original transformer decoder (for generating text in neural machine translation etc.) is replaced by a linear layer followed by a softmax layer, shown in Figure 1 , both for sequential classification tasks (named entity tagging, question answering) and sentence classification tasks (sentiment classification etc.)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-58", "text": "The encoder is composed of a stack of N = 12 or N = 24 layers for the BERT-base and -large cases respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-59", "text": "Each layer consists of two sub-layers." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-60", "text": "The first sub-layer is a multi-head selfattention mechanism, and the second sub-layer is a simple, position-wise fully connected feed-forward network." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-61", "text": "(Vaswani et al., 2017 ) employs a residual connection (He et al., 2016) around each of the two sub-layers, followed by layer normalization (Ba et al., 2016) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-62", "text": "That is, the output of each sublayer is LayerN orm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-63", "text": "To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension of 768 and 1024 for BERT-base and BERT-large, respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-64", "text": "We used the same multi-head selfattention from the original paper (Vaswani et al., 2017) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-65", "text": "We used the same input and output representations, i.e., the embedding and positional encoding, and the same loss objective, i.e., masked LM prediction and next sentence prediction, from the BERT paper (Devlin et al., 2018) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-66", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-67", "text": "**PROPOSED TRANSFORMER BIDIRECTIONAL LSTM (TRANS-BLSTM) ARCHITECTURES**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-68", "text": "Previous experiments indicated that a bidirectional LSTM model alone may not perform on par with a transformer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-69", "text": "For example, the distillation from a transformer model to a single-layer bidirectional LSTM model (Tang et al., 2019) resulted in significantly lower accuracy." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-70", "text": "We also confirmed this on our experiments in Section 4.3." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-71", "text": "In this paper, we hypothesize that the transformer and bidirectional LSTM may be complementary in sequence modeling." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-72", "text": "We are motivated to investigate how a bidirectional LSTM can further improve accuracy in down-stream tasks relative to a classic transformer model." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-73", "text": "Figure 2 shows the two proposed Transformer with Bidirectional LSTM architectures (denoted as the TRANS-BLSTM-1 and TRANS-BLSTM-2) models respectively:" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-74", "text": "TRANS-BLSTM-1 For each BERT layer, we replace the feedforward layer with a bidirectional LSTM layer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-75", "text": "TRANS-BLSTM-2 We add a bidirectional LSTM layer which takes the same input as the original BERT layer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-76", "text": "The output of the bidirectional LSTM layer is summed up with the original BERT layer output (before the Layer-Norm)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-121", "text": "However, in the whole word masking setting, we are able to fairly compare the proposed TRANS-BLSTM models to the original BERT models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-77", "text": "The motivation of adding BLSTM is to integrate the self-attention and bidirectional LSTM to produce a better joint model framework (as we will see in the experiments later)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-78", "text": "We found that these two architectures lead to similar accuracy in our experiments (see Section 4.5)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-79", "text": "We thus focus on the latter (TRANS-BLSTM-2) and refer to this model as TRANS-BLSTM henceforth for simplicity." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-80", "text": "For both architectures, if we use the same number of BLSTM hidden units as in the BERT model H, we obtain the BLSTM output with dimension of 2H, and we therefore need a linear layer to project the output of the BLSTM (with dimensionality 2H) to H in order to match the transformer output." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-81", "text": "Alternatively, if we set the number of BLSTM hidden units to H/2 (we denote this model as TRANS-BLSTM-SMALL), we need not include an additional projection layer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-82", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-83", "text": "**ADDING BIDIRECTIONAL LSTM TO TRANSFORMER DECODER**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-84", "text": "While the above method adds bidirectional LSTM layers to a transformer encoder, we can in addition replace the linear layer with bidirectional LSTM layers in decoder." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-85", "text": "The number of bidirectional LSTM layers is a hyper parameter to tune; we use 2 in this paper." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-86", "text": "While the bidirectional LSTM layers in encoder help the pre-training task for the masked language model and next sentence prediction task, the bidirectional LSTM in decoder may help in downstream sequential prediction tasks such as question answering." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-87", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-88", "text": "**OBJECTIVE FUNCTIONS**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-89", "text": "Following the BERT (Devlin et al., 2018) , we use masked language model loss and next sentence prediction (NSP) loss to train the models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-90", "text": "The masked LM (MLM) is often referred to as a Cloze task in the literature (Taylor, 1953) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-91", "text": "The encoder output, corresponding to the mask tokens, are fed into an output softmax over the vocabulary." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-92", "text": "In our experiments, we randomly mask 15% of all whole word wordpiece tokens in each sequence (Wu et al., 2016) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-93", "text": "We also use the next sentence prediction loss as introduced in (Devlin et al., 2018) to train our models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-94", "text": "Specifically, when choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A, and 50% of the time it is a random sentence from the corpus." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-95", "text": "We note that recent work (Yang et al., 2019; Liu et al., 2019a; Lan et al., 2019; Raffel et al., 2019) has argued that the NSP loss may not be useful in improving model accuracy." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-96", "text": "Nevertheless, we used the NSP loss in our experiments to have a fair comparison between the proposed models and the original BERT models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-97", "text": "Table 1 shows the model parameter size and training speedup for TRANS/BERT (TRANS and BERT are exchangeable in this paper), TRANS-BLSTM-SMALL, and TRANS-BLSTM respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-98", "text": "Here, the TRANS-BLSTM-SMALL and TRANS-BLSTM models are 50% and 100% larger than the TRANS model (base, large) respec-tively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-99", "text": "Consequently, TRANS-BLSTM-SMALL and TRANS-BLSTM models require more computational resources and longer training times compared to the vanilla transformer model." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-100", "text": "The slowest-training model is the TRANS-BLSTM which is also our baseline." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-101", "text": "Models with fewer parameters can train faster." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-102", "text": "For example, the large TRANS/BERT model boasts a 2.8 fold speedup compared to the TRANS-BLSTM large model." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-103", "text": "We note that the focus of the paper is to investigate whether a joint transformer and BLSTM architecture can further improve the performance over a transformer baseline." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-104", "text": "This is important to keep in mind because simply increasing the number of hidden units in BERT-large is not enough to positively affect accuracy (Lan et al., 2019) (also see Section 4.6)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-105", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-106", "text": "**MODEL PARAMETERS**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-107", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-108", "text": "**EXPERIMENTS**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-109", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-110", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-111", "text": "We use the same large-scale data which has been used for BERT model pre-training, the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2.5B words) (Wikipedia contributors, 2004; Devlin et al., 2018) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-112", "text": "The two corpora consist of about 16GB of text." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-113", "text": "Following the original BERT setup (Devlin et al., 2018) , we format the inputs as \"[CLS] TRANS/BERT 108M 12 768 768 12 6.0X Base TRANS-BLSTM-SMALL 152M 12 768 768 12 3.3X TRANS-BLSTM 237M 12 768 768 12 2.5X Large TRANS/BERT 334M 24 1024 1024 16 2.8X TRANS-BLSTM-SMALL 487M 24 1024 1024 16 1.4X TRANS-BLSTM 789M 24 1024 1024 16 1 Table 1 : Parameter size and training speed for TRANS/BERT, TRANS-BLSTM-SMALL, and TRANS-BLSTM on base and large settings respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-114", "text": "x 1 = x 11 , x 12 . . . and x 2 = x 21 , x 22 . . . are two segments." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-115", "text": "To reduce the training memory consumption, we set the maximum input length to 256 (as opposed to 512 in the original BERT paper)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-116", "text": "We note that this setting may adversely affect the best accuracy we report in our paper 1 , but the relative accuracy gain by the proposed models are still valid." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-117", "text": "Similar to BERT, we use a vocabulary size of 30k with wordpiece tokenization." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-118", "text": "We generate the masked input from the MLM targets using unigram masking, which is denoted as whole word masking." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-119", "text": "That is, each masking applies to a whole word at one time." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-120", "text": "We note that using n-gram masking (for example, with n = 3) (Joshi et al., 2019; Lan et al., 2019) with the length of each n-gram mask selected randomly can further improve the downstream task accuracy (for example, 2% F1 score increase was observed on SQuAD 1.1 data set with n-gram masking and span boundary representation prediction (Joshi et al., 2019) )." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-122", "text": "Similar to (Devlin et al., 2018) , the training data generator chooses 15% of the token positions at random for making." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-123", "text": "If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-124", "text": "The model updates use a batch size of 256 and Adam optimizer with learning rate starting from 1e-4." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-125", "text": "Training was done on a cluster of nodes, where each node consists of 8 Nvidia Tesla V100 GPUs." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-126", "text": "We vary the node size from 1 to 8 depending on the model size." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-127", "text": "Our TRANS-BLSTM is implemented on top of Pytorch transformer repository 2 ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-128", "text": "1 Nevertheless, our implementation of baseline BERT model obtained higher accuracy than that reported by the original BERT paper (Devlin et al., 2018) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-129", "text": "2 https://github.com/huggingface/pytorch-transformers." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-130", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-131", "text": "**DOWNSTREAM EVALUATION DATASETS**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-132", "text": "Following the previous work (Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019a; Lan et al., 2019) , we evaluate our models on the General Language Understanding Evaluation (GLUE) benchmark and the Stanford Question Answering Dataset (SQuAD 1.1) (Rajpurkar et al., 2016) ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-133", "text": "GLUE is the General Language Understanding Evaluation benchmark consisting of a diverse collection of natural language understanding tasks." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-134", "text": "GLUE is model-agnostic and the tasks are selected to incentivize the development of general and robust NLU systems." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-135", "text": "The tasks included in GLUE are (1) Multi-Genre Natural Language Inference (MNLI) for sentence entailment classification, (2) Quora Question Pairs (QQP) for semantic equivalence classification, (3) Question Natural Language Inference (QNLI) for predicting whether the sentence in a query-sentence pair contains a correct answer, (4) Stanford Sentiment Treebank (SST-2) for sentiment analysis of movie reviews, (5) Corpus of Linguistic Acceptability (CoLA) for determining whether an English sentence is linguistically acceptable, (6) Semantic Textual Similarity (STS-B)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-136", "text": "The Stanford Question Answering Dataset (SQuAD) is a corpus consisting of 100k question/answer pairs sourced from Wikipedia." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-137", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-138", "text": "**BIDIRECTIONAL LSTM MODEL ON SQUAD DATASET**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-139", "text": "For the down-stream fine-tuning experiments on SQuAD 1.1 dataset, we have the following hyperparameters for training." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-140", "text": "We set the learning rate to be 3e-5, training batch size to be 12, and the number of training epochs to be 2." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-141", "text": "We first run the experiment by replacing the transformer in BERT base with a bidirectional LSTM model with the same number of layers." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-142", "text": "That is, we replace the 12 transformer layers with 12 BLSTM layers." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-167", "text": "As can been see, the TRANS-BLSTM-SMALL can boost the baseline BERT model from F1 score of 90.05% to 90.76%, and from 92.34% to 92.86% on base and large cases respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-143", "text": "Table 2 shows the BERT base models, including the original BERT-base model in (Devlin et al., 2018) and our implementation, and the bidirectional LSTM model accuracy over SQuAD 1.1 development dataset." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-144", "text": "Our implementation results in a higher F1 score (90.05%) compared to the original BERT-base one (88.50%)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-145", "text": "This may be due to the fact that we use the whole word masking while BERT-base used partial word masking (an easier task, which may prevent from learning a better model)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-146", "text": "We found that the BLSTM model has F1 score of 83.43%, which is significantly worse than our TRANS/BERT baseline (90.05%)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-147", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-148", "text": "**MODEL**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-149", "text": "EM F1 BERT-base (Devlin et al., 2018)" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-150", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-151", "text": "**MODELS PRE-TRAINING**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-152", "text": "We run three pre-training experiments for base and large settings respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-153", "text": "1) BERT model training baseline (denoted as TRANS/BERT representing a transformer model or BERT), 2) TRANS-BLSTM-SMALL, with BLSTM having half of the hidden units of the transformer (768/2 = 384 on BERT base and 1024/2 = 512 on BERT large) for BLSTM, and 3) TRANS-BLSTM, with BLSTM having the same hidden units as the transformer (768 on BERT base and 1024 on BERT large)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-154", "text": "Figure 3 shows the training loss for base TRANS/BERT, TRANS-BLSTM-SMALL, and TRANS-BLSTM models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-155", "text": "As can be seen, TRANS-BLSTM-SMALL model has lower training loss than the TRANS/BERT model." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-156", "text": "TRANS-BLSTM can further decrease the training loss compared to TRANS-BLSTM-SMALL." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-157", "text": "This suggests that the proposed TRANS-BLSTM-SMALL and TRANS-BLSTM are capable of fitting the training data better than the original BERT model." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-158", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-159", "text": "**COMPARE TWO VERSIONS OF TRANS-BLSTM MODELS**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-160", "text": "We proposed two versions of TRANS-BLSTM models in section 3.2 (see Fig 2) , with TRANS-BLSTM-1 replacing the feedforward layer with a bidirectional LSTM layer, and TRANS-BLSTM-2 adding a parallel bidirectional LSTM layer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-161", "text": "We trained these two models and list their performance on SQuAD 1.1 development dataset in Table 3 ." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-162", "text": "We note that these two models lead to similar accuracy on this dataset." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-163", "text": "We will use TRANS-BLSTM-2 to report the accuracy in the rest of the experiments (denoted as TRANS-BLSTM for notational simplicity)." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-164", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-165", "text": "**MODEL**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-166", "text": "EM F1 TRANS-BLSTM-1 84.87 91.52 TRANS-BLSTM-2 84.75 91.53 Table 4 shows the results of SQuAD dataset for TRANS/BERT, TRANS-BLSTM-SMALL and TRANS-BLSTM models for base and large settings respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-168", "text": "In addition, the TRANS-BLSTM can further boost accuracy on top of TRANS-BLSTM-SMALL to 91.53% and 93.82% on base and large respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-169", "text": "The accuracy boosts suggest that the bidirectional LSTM model can add additional accuracy gain on top of the transformer models." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-170", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-171", "text": "**MODELS EVALUATION ON SQUAD DATASET**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-172", "text": "Compared to adding bidirectional LSTM layers to the encoder, the addition of bidirectional LSTMs to the decoder (see +BLSTM experiments in Table 4 ) offers additional improvements on five out of six cases." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-173", "text": "For example, it boosts the base TRANS/BERT model F1 score of 90.05% to 90.67%, and boosts the large TRANS-BLSTM model F1 score of 93.82% to 94.01%." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-174", "text": "----------------------------------" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-175", "text": "**MODEL EVALUATION ON GLUE DATASETS**" }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-176", "text": "Following (Devlin et al., 2018) , we use a batch size of 32 and 3-epoch fine-tuning over the data for all GLUE tasks." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-177", "text": "For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the development set." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-178", "text": "Additionally similar to (Devlin et al., 2018) , for large BERT and TRANS-BLSTM models, we found that finetuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the development set." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-179", "text": "Table 5 shows the results of GLUE datasets for original BERT (Devlin et al., 2018) , ours TRANS/BERT, TRANS-BLSTM-SMALL and TRANS-BLSTM on base and large settings respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-180", "text": "Following the BERT setting (Devlin et al., 2018) , we exclude the problematic WNLI set." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-181", "text": "F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-182", "text": "Unlike the evaluation on SQuAD dataset, we do not apply the BLSTM layer to the decoder." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-183", "text": "This is because that the tasks on GLUE are classification tasks based on the [CLS] token, and are not sequential prediction tasks (for example the SQuAD dataset) which may benefit more from including a BLSTM layer." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-184", "text": "We note again the accuracy discrepancy between the original BERT and our implementation of BERT, which may be due to the fact that the former uses partial word masking while the later uses whole word masking." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-185", "text": "Similar to the SQuAD results, the TRANS-BLSTM-SMALL and TRANS-BLSTM base models can improve the TRANS/BERT base model from the average GLUE score of 84.63% to 84.77% and 85.35% respectively." }, { "sent_id": "caa0ffb1d4e3e5310a28b921333d1e-C001-186", "text": "In addition, the TRANS-BLSTM-SMALL and TRANS-BLSTM large models can improve the TRANS/BERT large model from the average GLUE score of 85.59% to 86.23% and 86.50% respectively." } ], "y": { "@BACK@": { "gold_contexts": [ [ "caa0ffb1d4e3e5310a28b921333d1e-C001-11" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-13" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-18" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-24" ] ], "cite_sentences": [ "caa0ffb1d4e3e5310a28b921333d1e-C001-11", "caa0ffb1d4e3e5310a28b921333d1e-C001-13", "caa0ffb1d4e3e5310a28b921333d1e-C001-18", "caa0ffb1d4e3e5310a28b921333d1e-C001-24" ] }, "@MOT@": { "gold_contexts": [ [ "caa0ffb1d4e3e5310a28b921333d1e-C001-19" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-30" ] ], "cite_sentences": [ "caa0ffb1d4e3e5310a28b921333d1e-C001-19", "caa0ffb1d4e3e5310a28b921333d1e-C001-30" ] }, "@USE@": { "gold_contexts": [ [ "caa0ffb1d4e3e5310a28b921333d1e-C001-65" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-89" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-93" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-111" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-113" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-132" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-143" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-148", "caa0ffb1d4e3e5310a28b921333d1e-C001-149" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-176" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-179" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-180" ] ], "cite_sentences": [ "caa0ffb1d4e3e5310a28b921333d1e-C001-65", "caa0ffb1d4e3e5310a28b921333d1e-C001-89", "caa0ffb1d4e3e5310a28b921333d1e-C001-93", "caa0ffb1d4e3e5310a28b921333d1e-C001-111", "caa0ffb1d4e3e5310a28b921333d1e-C001-113", "caa0ffb1d4e3e5310a28b921333d1e-C001-132", "caa0ffb1d4e3e5310a28b921333d1e-C001-143", "caa0ffb1d4e3e5310a28b921333d1e-C001-149", "caa0ffb1d4e3e5310a28b921333d1e-C001-176", "caa0ffb1d4e3e5310a28b921333d1e-C001-179", "caa0ffb1d4e3e5310a28b921333d1e-C001-180" ] }, "@SIM@": { "gold_contexts": [ [ "caa0ffb1d4e3e5310a28b921333d1e-C001-122" ], [ "caa0ffb1d4e3e5310a28b921333d1e-C001-178" ] ], "cite_sentences": [ "caa0ffb1d4e3e5310a28b921333d1e-C001-122", "caa0ffb1d4e3e5310a28b921333d1e-C001-178" ] }, "@DIF@": { "gold_contexts": [ [ "caa0ffb1d4e3e5310a28b921333d1e-C001-128" ] ], "cite_sentences": [ "caa0ffb1d4e3e5310a28b921333d1e-C001-128" ] }, "@UNSURE@": { "gold_contexts": [ [ "caa0ffb1d4e3e5310a28b921333d1e-C001-149" ] ], "cite_sentences": [ "caa0ffb1d4e3e5310a28b921333d1e-C001-149" ] } } }, "ABC_6ca6283ae23bbd6d0827d8f5f2947a_4": { "x": [ { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-64", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-2", "text": "It is shown that many published models for the Stanford Question Answering Dataset (Rajpurkar et al., 2016) lack robustness, suffering an over 50% decrease in F1 score during adversarial evaluation based on the AddSent (Jia and Liang, 2017) algorithm." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-3", "text": "It has also been shown that retraining models on data generated by AddSent has limited effect on their robustness." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-4", "text": "We propose a novel alternative adversary-generation algorithm, AddSentDiverse, that significantly increases the variance within the adversarial training data by providing effective examples that punish the model for making certain superficial assumptions." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-5", "text": "Further, in order to improve robustness to AddSent's semantic perturbations (e.g., antonyms), we jointly improve the model's semantic-relationship learning capabilities in addition to our AddSentDiversebased adversarial training data augmentation." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-6", "text": "With these additions, we show that we can make a state-of-the-art model significantly more robust, achieving a 36.5% increase in F1 score under many different types of adversarial evaluation while maintaining performance on the regular SQuAD task." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-7", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-9", "text": "We explore the task of reading comprehension based question answering (Q&A), where we focus on the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) , in which models answer questions about paragraphs taken from Wikipedia." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-10", "text": "Significant progress has been made with deep end to end neural-attention models, with some achieving above human level performance on the test set (Wang and Jiang, 2017; Seo et al., 2017; Huang et al., 2018; Peters et al., 2018) ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-11", "text": "However, as shown recently by Jia and Liang (2017) , these models are very fragile when presented with adversarially generated data." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-12", "text": "They proposed AddSent, which creates a semantically-irrelevant sentence containing a fake answer that resembles the question syntactically, and appends it to the context." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-13", "text": "Many state-ofthe-art models exhibit a nearly 50% reduction in F1 score on AddSent, showing their over-reliance on syntactic similarity and limited semantic understanding." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-14", "text": "Importantly, this is in part due to the nature of the SQuAD dataset." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-15", "text": "Most questions in the dataset have answer spans embedded in sentences that are syntactically similar to the question." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-16", "text": "Thus during training, the model is rarely punished for answering questions based on syntactic similarity, and learns it as a reliable approach to Q&A. This correlation between syntactic similarity and correctness is of course not true in general: the adversaries generated by AddSent (Jia and Liang, 2017) are syntactically similar to the question but do not answer them." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-17", "text": "The models' failures on AddSent demonstrates their ignorance of this aspect of the task." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-18", "text": "Jia and Liang (2017) presented some initial attempts to fix this problem by retraining the BiDAF model (Seo et al., 2017) with adversaries generated with AddSent. But they showed that the method is not very effective, as slight modifications (e.g., different positioning of the distractor sentence in the paragraph and different fake answer set) to the adversary generation algorithm at test time have drastic impact on the retrained model's performance." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-19", "text": "In this paper, we show that their method of adversarial training failed because the specificity of the AddSent algorithm along with the lack of naturally-occurring counterexamples allow models to learn superficial clues regarding what is a 'distractor' and subsequently ignore it; thus significantly limiting their robustness." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-20", "text": "Instead, we first introduce a novel algorithm, AddSentDiverse, for generating adversarial examples with signifi-cantly higher variance (by varying the locations where the distractors are placed and expanding the set of fake answers), so that the model is punished during training time for making these superficial assumptions about the distractor." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-21", "text": "We show that an AddSentDiverse-based adversariallytrained model beats an AddSent-trained model across 3 different adversarial test sets, showing an average improvement of 24.22% in F1 score, demonstrating a general increase in robustness." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-22", "text": "However, even with our diversified adversarial training data, the model is still not fully resilient to AddSent-style attacks, e.g., its antonymy-style semantic perturbations." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-23", "text": "Hence, we next add semantic relationship features to the model to let it directly identify such relationships between the context and question." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-24", "text": "Interestingly, we see that these additions only increase model robustness when trained adversarially, because intuitively in the non-adversarially-trained setup, there are not enough negative (adversarial) examples for the model to learn how to use its semantic features." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-25", "text": "Overall, we demonstrate that with our adversarial training method and model improvement, we can increase the performance of a state-of-theart model by 36.46% on the AddSent evaluation set." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-65", "text": "**SEMANTIC FEATURE ENHANCED MODEL**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-26", "text": "Although we focused on the AddSent adversary (Jia and Liang, 2017) , our method of effective adversarial training by eliminating superficial statistical correlations (with joint model capability improvements) are generalizable to other similar insertion-based adversaries for Q&A tasks." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-27", "text": "1" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-28", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-30", "text": "Adversarial Evaluation In computer vision, adversarial examples are frequently used to punish model oversensitivity, where semantic-preserving perturbations (usually in the form of small noise vectors) are added to an image to fool the classifier into giving it a different label (Szegedy et al., 2014; Goodfellow et al., 2015) ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-31", "text": "In the field of Q&A, Jia and Liang (2017) introduced the AddSent algorithm, which generates adversaries that punish model failure in the other direction: overstability, or the inability to detect semantic-altering noise." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-32", "text": "It does so by generating distractor sentences that only resemble the questions syntactically and appending them to the context paragraphs (detailed description included in Sec. 3)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-33", "text": "When tested on these adversarial examples, Jia and Liang (2017) showed that even the most 'robust' amongst published models (the Mnemonic Reader (Hu et al., 2017) ) only achieved 46.6% F1 (compared to 79.6% F1 on the regular task)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-34", "text": "Since then, the FusionNet model (Huang et al., 2018) used history-of-word representations and multi-level attention mechanism to obtain an improved 51.4% F1 score under adversarial evaluation, but that is still a 30% decrease from the model's performance on the regular task." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-35", "text": "We show, however, that one can make a pre-existing model significantly more robust by simply retraining it with better, higher variance adversarial training data, and improve it further with minor semantic feature additions to its inputs." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-36", "text": "Adversarial Training It has been shown in the field of image classification that training with adversarial examples produces more robust and error-resistant models (Goodfellow et al., 2015; Kurakin et al., 2017) ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-37", "text": "In the field of Q&A, Jia and Liang (2017) attempted to retrain the BiDAF (Seo et al., 2017) model with data generated with AddSent algorithm." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-38", "text": "Despite performing well when evaluated on AddSent, the retrained model suffers a more than 30% decrease in F1 performance when tested on a slightly different adversarial dataset generated by AddSentMod (which differs from AddSent in two superficial ways: using a different set of fake answers and prepending instead of appending the distractor sentence to the context)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-39", "text": "We show that using AddSent to generate adversarial training data introduces new superficial trends for a model to exploit; and instead we propose the AddSentDiverse algorithm that generates highly varied data for adversarial training, resulting in more robust models." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-40", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-41", "text": "**METHODS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-42", "text": "Our 'AddSentDiverse' algorithm is a modified version of AddSent (Jia and Liang, 2017) , aimed at producing good adversarial examples for robust training purposes." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-62", "text": "For each answer a, we generate the fake answer dynamically by randomly selecting another answer a = a from S that has the same type as a, as opposed to AddSent (Jia and Liang, 2017) , which uses a pre-defined fake answer for each type (e.g., \"Chicago\" for any location)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-43", "text": "For each {context, question, answer} triple, AddSent does the following: (1) Several antonym and named-entity based semantic altering perturbations (swapping) are applied to the question; (2) A fake answer is generated that matches the 'type' of the original answer (e.g., Prague \u2192 Chicago, etc.); (3) The fake answer and the altered question are combined into a distractor statement based on a set of manually defined rules; (4) Errors in grammar are fixed by crowd-workers; (5) The finalized distractor is appended to the end of the context." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-44", "text": "The specificity of the algorithm creates new superficial cues that a model can learn and use during training and never get punished for: (1) a model can learn that it is unlikely for the last sentence to contain the real answer; (2) a model can learn that the fixed set of fake answers should not be picked." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-45", "text": "These nullify the effectiveness of the distractors as the model will learn to simply ignore them." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-46", "text": "We thus introduce the AddSentDiverse algorithm, which adds two modifications to AddSent that allows for generating higher-variance adversarial examples." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-47", "text": "Namely, we randomize the distractor placement (Sec. 3.1) and we diversity the set of fake answers used (Sec. 3.2)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-48", "text": "Lastly, to address the antonymstyle semantic perturbations used in AddSent, we show that we need to improve model capabilities by adding indicator features for semantic relationships (but only when) in tandem with the addition of diverse adversarial data (Sec. 3.3)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-49", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-50", "text": "**RANDOM DISTRACTOR PLACEMENT**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-51", "text": "Given a paragraph P containing n sentences, let X, Y be random variables representing the location of the sentence containing the correct answer counting from the front and back." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-52", "text": "2 Let P represent the paragraph with the inserted distractor, and X and Y represent the updated location of the sentence with the correct answer." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-53", "text": "As shown in Fig. 1 , their distribution is highly dependent on the strategy used to insert the distractor." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-54", "text": "During training done by Jia and Liang (2017) , the distractor is always added as the last sentence, creating a very skewed distribution for Y ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-55", "text": "This resulted in the model learning to ignore the last sentence, as it was never punished for doing so." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-56", "text": "This, in turn, caused the retrained model to fail on AddSentMod, where the distractor is inserted to the front instead of the back of the context paragraph (this is shown by our experiments as well)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-57", "text": "However, Fig. 1 shows that when the distractor is inserted randomly, the distributions of X and Y are almost identical to that of X and Y , indicating that no new correlation between the location of a sentence and its likelihood to contain the correct answer is introduced by the distractors, hence forcing the model to learn to discern them from the" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-58", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-59", "text": "**DYNAMIC FAKE ANSWER GENERATION**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-60", "text": "To prevent the model from superficially deciding what is a distractor based on certain specific words, we dynamically generate the fake answers instead of using AddSent's pre-defined set." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-61", "text": "Let S be the set that contains all the answers in the SQuAD training data, tagged by their type (e.g., person, location, etc.)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-63", "text": "This creates a much larger set of fake answers, thus decreasing the correlation between any text and its likelihood of being a part of a distractor, forcing the model to become more robust." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-66", "text": "In previous sections, we prevented the model from identifying distractors based on superficial clues such as location and fake answer identity by eliminating these correlations within the training data." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-67", "text": "But even if we force the model to learn some deeper methods for identifying/discarding the distractors, it only has limited ability in recognizing semantic differences because its current inputs do not capture crucial aspects of lexical semantics such as antonymy (which were inserted by Jia and Liang (2017) when generating the AddSent adversaries; see Sec. 3)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-68", "text": "Most current models use pretrained word embeddings (e.g., GloVE (Pennington et al., 2014) and ELMo (Peters et al., 2018) ) as input, which are usually calculated based on the distributional hypothesis (Harris, 1954) , and do not capture lexical semantic relations such as antonymy (Geffet and Dagan, 2005) ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-69", "text": "These shortcomings are reflected by our results in Sec. 4.6, where we see that we can't resolve all AddSent- style adversaries by diversifying the training data alone." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-70", "text": "For the model to be robust to semanticsbased (e.g., antonym-style) attacks, it needs extra knowledge of lexical semantic relations." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-71", "text": "Hence, we augment the input of each word in the question/context with two indicator features indicating the existence of its synonym and antonym (using WordNet (Fellbaum, 1998) ) in the context/question, allowing the model to use lexical semantics directly instead of learned statistical correlations of the word embeddings." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-72", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-73", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-74", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-75", "text": "**MODEL AND TRAINING DETAILS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-76", "text": "We use the architecture and hyperparameters of the strong BiDAF + Self-Attn + ELMo (BSAE) model (Peters et al., 2018) , currently (as of January 10, 2018) the third highest performing singlemodel on the SQuAD leaderboard." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-77", "text": "3" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-78", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-79", "text": "**EVALUATION DETAILS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-80", "text": "Models are evaluated on the original SQuAD dev set and 4 adversarial datasets: AddSent, the adversarial evaluation set by Jia and Liang (2017) , and 3 variations of AddSent: AddSentPrepend, where the distractor is prepended to the context, AddSentRandom, where the distractor is randomly inserted into the context, 4 and AddSentMod (Jia and Liang, 2017) , where a different set of fake answers is used and the distractor is prepended to the context." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-81", "text": "Experiments measure the soft F1 score and all of the adversarial evaluations are modeldependent, following the style of AddSent, where multiple adversaries are generated for each exam-ple in the evaluation set and the model's worst performance among the variants is recorded." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-82", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-83", "text": "**PRIMARY EXPERIMENT RESULTS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-84", "text": "In our main experiment, we compare the BSAE model's performance on different test sets when trained with three different training sets: the original SQuAD data (Original-SQuAD), SQuAD data augmented with AddSent generated adversaries (similar to adversarial training conducted by Jia and Liang (2017)), and SQuAD data augmented with our AddSentDiverse generated adversaries." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-108", "text": "**ERROR ANALYSIS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-85", "text": "For the latter two, we run the respective adversarial generation algorithms on the training set, and add randomly selected adversarial examples such that they make up 20% of the total training data." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-86", "text": "The results are shown in Table 1 ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-87", "text": "First, as shown, the AddSent-trained model is not able to perform well on test sets where the distractors are not inserted at the end, e.g., the AddSentRandom adversarial test set." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-88", "text": "5 On the other hand, it can be seen that retraining with AddSentDiverse boosts performance of the model significantly across all adversarial datasets, indicating a general increase in robustness." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-89", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-90", "text": "**DISTRACTOR PLACEMENT RESULTS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-91", "text": "We also conducted experiments studying the effect of different distractor placement strategies on the trained models' robustness." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-92", "text": "The BSAE model was trained on 4 variations of AddSentDiverseaugmented training set, with the only difference between them being the location of the distractor within the context: InsFirst, where the distractor is prepended, InsLast, where the distractor is appended, InsMid, where the distractor is inserted in the middle and InsRandom, where the distractor is randomly placed." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-93", "text": "The retrained models are tested on AddSent and AddSentPrepend, whose only difference is where the distractor is located." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-94", "text": "The result is shown in performs well on test sets created by a similar distractor placement strategy, indicating that they are exploiting superficial trends instead of learning to process the semantics." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-95", "text": "It is also shown that InsRandom gives optimal performance on both evaluation datasets." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-96", "text": "Further investigations regarding distractor placement can be found in the appendix." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-97", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-98", "text": "**FAKE ANSWER GENERATION RESULTS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-99", "text": "We also conducted experiments studying the effect of training on data containing distractors with dynamically generated fake answers (DynamicFakeAns) instead of chosen from a predefined set (Fixed-FakeAns) ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-100", "text": "The trained models are tested on AddSentPrepend and AddSentMod, whose only difference is that AddSentMod uses a different set of fake answers." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-101", "text": "The results are displayed in Table 3 ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-102", "text": "It shows that the model trained on Fixed-FakeAns suffers an approximate 3% drop in performance when tested on a dataset with a different set of fake answers, but this gap does not exist for the model retrained on Dynamic-FakeAns." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-103", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-104", "text": "**SEMANTIC FEATURE ENHANCEMENT RESULTS**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-105", "text": "In Table 1 , we see that despite improving performance on adversarial test sets, adversarial training on the BSAE model leads to a 1% decrease in its performance on the original SQuAD task (from 84.65% to 83.49%)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-106", "text": "Furthermore, there is still a 6.5% gap between its performance on adversarial datasets and the original SQuAD dev set (76.95% vs 83.49%)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-107", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-109", "text": "Finally, we examined the errors of our final adversarially-trained BSAE+SA model on the AddSent dataset and found that out of the 21.09% remaining errors (Table 4) , 33.3% (46 cases) of these erroneous predictions occurred within the inserted distractor, and 63.7% (88 cases) occurred on questions that the model got wrong in the original SQuAD dev set (without the inserted distractors)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-110", "text": "The former errors are mainly occurring within distractors created with named-entity replacements (which we haven't addressed directly in the current paper) or malformed distractors (that in fact do answer the question)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-111", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-112", "text": "**CONCLUSION**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-113", "text": "We demonstrate that we can overcome model overstability and increase their robustness by training on diverse adversarial data that eliminates latent data correlations." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-114", "text": "We further show that adversarial training is more effective when we jointly add useful semantic-relations knowledge to improve model capabilities." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-115", "text": "We hope that these robustness methods are generalizable to other insertion-based adversaries for Q&A tasks." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-116", "text": "----------------------------------" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-117", "text": "**A APPENDIX: DISTRACTOR PLACEMENT STRATEGIES**" }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-118", "text": "This section provides a theoretical framework to predict a model's performance on adversarial test sets when trained on adversarial data generated by a specific distractor-insertion strategy." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-119", "text": "Given a paragraph composed of n sentences (with the distractor inserted) P = {s 1 , s 2 , . . . , s n }, where s i is the ith sentence counting from the front." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-120", "text": "Define random variables X and Y to represent the location of the distractor counting from the front and back, respectively." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-121", "text": "The distributions of X and Y are dependent upon the insertion strategy used to add the distractors, several examples of this are displayed in Fig. 2 ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-122", "text": "A bidirectional deep learning model, trained in a supervised setting, should be able to jointly learn X and Y ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-123", "text": "Thus, at test time, when given a paragraph of n sentences, the model can obtain the probability that the sentence s a is the distractor, P sa , by computing P (X = a) + P (Y = n \u2212 a)." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-124", "text": "Ideally, we want the distribution of P sa to be uniform, as that means the model is not biased towards discarding any sentence as the distractor based on location." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-125", "text": "The actual distributions of P sa under different distractor-insertion strategies are displayed in Fig. 3 for n = 3, 5 and 7." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-126", "text": "We pick these n as they are typical lengths of contexts within the SQuAD dataset (the complete distribution of paragraph lengths in the SQuAD training set is shown in Fig. 4) ." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-127", "text": "We see that under random insertion, the distribution is very close to uniform." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-128", "text": "Note that if we were to aggregate n and plot P sa for n \u2264 3, 5 and 7, as shown in Fig. 3 , the distributions of P sa created by inserting in the middle and inserting randomly are very similar, but the distribution of inserting in the middle is skewed against the beginnings and ends of the paragraphs." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-129", "text": "This explains why in our experiment studying the effect of distractor placement strategies (see Table 2), InsMid's performance was not skewed towards either AddSent or AddSentPrepend, but was worse on both when compared to InsRandom." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-130", "text": "This method of calculating the distribution of P sa allows us to predict the model's performance when trained on datasets where the distractors are inserted at specific locations." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-131", "text": "To test this hypothesis, we created two datasets: InsFront-3 and InsFront-6 where the distractors were inserted as the 3rd and 6th sentence from the beginning and measure the model's performance when trained on these two datasets." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-132", "text": "The distributions of P sa for these two datasets are shown in Fig. 5 , from which we can predict that models trained on InsFront-3 should perform slightly better on adversarial sets where the distractors are appended (as opposed to prepended), whereas those trained on InsFront-6 will perform much better on such adversarial sets." }, { "sent_id": "6ca6283ae23bbd6d0827d8f5f2947a-C001-133", "text": "These predictions are confirmed by the results in Table 5 ." } ], "y": { "@MOT@": { "gold_contexts": [ [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-2" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-3" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-11" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-19" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-22" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-33" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-48" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-69" ] ], "cite_sentences": [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-2", "6ca6283ae23bbd6d0827d8f5f2947a-C001-3", "6ca6283ae23bbd6d0827d8f5f2947a-C001-11", "6ca6283ae23bbd6d0827d8f5f2947a-C001-19", "6ca6283ae23bbd6d0827d8f5f2947a-C001-22", "6ca6283ae23bbd6d0827d8f5f2947a-C001-33", "6ca6283ae23bbd6d0827d8f5f2947a-C001-48", "6ca6283ae23bbd6d0827d8f5f2947a-C001-69" ] }, "@EXT@": { "gold_contexts": [ [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-5" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-26" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-39" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-42" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-46" ] ], "cite_sentences": [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-5", "6ca6283ae23bbd6d0827d8f5f2947a-C001-26", "6ca6283ae23bbd6d0827d8f5f2947a-C001-39", "6ca6283ae23bbd6d0827d8f5f2947a-C001-42", "6ca6283ae23bbd6d0827d8f5f2947a-C001-46" ] }, "@BACK@": { "gold_contexts": [ [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-12" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-13" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-16" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-17" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-18" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-31" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-37" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-43" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-54" ] ], "cite_sentences": [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-12", "6ca6283ae23bbd6d0827d8f5f2947a-C001-13", "6ca6283ae23bbd6d0827d8f5f2947a-C001-16", "6ca6283ae23bbd6d0827d8f5f2947a-C001-17", "6ca6283ae23bbd6d0827d8f5f2947a-C001-18", "6ca6283ae23bbd6d0827d8f5f2947a-C001-31", "6ca6283ae23bbd6d0827d8f5f2947a-C001-37", "6ca6283ae23bbd6d0827d8f5f2947a-C001-43", "6ca6283ae23bbd6d0827d8f5f2947a-C001-54" ] }, "@DIF@": { "gold_contexts": [ [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-21" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-25" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-38" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-60" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-62" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-67" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-93" ] ], "cite_sentences": [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-21", "6ca6283ae23bbd6d0827d8f5f2947a-C001-25", "6ca6283ae23bbd6d0827d8f5f2947a-C001-38", "6ca6283ae23bbd6d0827d8f5f2947a-C001-60", "6ca6283ae23bbd6d0827d8f5f2947a-C001-62", "6ca6283ae23bbd6d0827d8f5f2947a-C001-67", "6ca6283ae23bbd6d0827d8f5f2947a-C001-93" ] }, "@USE@": { "gold_contexts": [ [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-80" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-81" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-84" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-87" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-93" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-109" ], [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-129" ] ], "cite_sentences": [ "6ca6283ae23bbd6d0827d8f5f2947a-C001-80", "6ca6283ae23bbd6d0827d8f5f2947a-C001-81", "6ca6283ae23bbd6d0827d8f5f2947a-C001-84", "6ca6283ae23bbd6d0827d8f5f2947a-C001-87", "6ca6283ae23bbd6d0827d8f5f2947a-C001-93", "6ca6283ae23bbd6d0827d8f5f2947a-C001-109", "6ca6283ae23bbd6d0827d8f5f2947a-C001-129" ] } } }, "ABC_15c8ca572430c214d9c571fbe0db95_4": { "x": [ { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-2", "text": "In this paper, we present a unigram segmentation model for statistical machine translation where the segmentation units are blocks: pairs of phrases without internal structure." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-3", "text": "The segmentation model uses a novel orientation component to handle swapping of neighbor blocks." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-4", "text": "During training, we collect block unigram counts with orientation: we count how often a block occurs to the left or to the right of some predecessor block." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-5", "text": "The orientation model is shown to improve translation performance over two models: 1) no block re-ordering is used, and 2) the block swapping is controlled only by a language model." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-6", "text": "We show experimental results on a standard Arabic-English translation task." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-7", "text": "----------------------------------" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-9", "text": "In recent years, phrase-based systems for statistical machine translation (Och et al., 1999; Koehn et al., 2003; Venugopal et al., 2003) have delivered state-of-the-art performance on standard translation tasks." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-10", "text": "In this paper, we present a phrase-based unigram system similar to the one in (Tillmann and Xia, 2003) , which is extended by an unigram orientation model." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-11", "text": "The units of translation are blocks, pairs of phrases without internal structure." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-12", "text": "Fig. 1 shows an example block translation using five Arabic-English blocks" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-13", "text": ". The unigram orientation model is trained from word-aligned training data." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-14", "text": "During decoding, we view translation as a block segmentation process, where the input sentence is segmented from left to right and the target sentence is generated from bottom to top, one block at a time." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-15", "text": "A monotone block sequence is generated except for the possibility to swap a pair of neighbor blocks." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-16", "text": "The novel orientation model is used to assist the block swapping: as shown in section 3, block swapping where only a trigram language model is used to compute probabilities between neighbor blocks fails to improve translation performance." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-17", "text": "(Wu, 1996; Zens and Ney, 2003) present re-ordering models that make use of a straight/inverted orientation model that is related to our work." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-18", "text": "Here, we investigate in detail the effect of restricting the word re-ordering to neighbor block swapping only." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-19", "text": "In this paper, we assume a block generation process that generates block sequences from bottom to top, one block at a time." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-20", "text": "The score of a successor block depends on its predecessor block" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-21", "text": "where is a block and )." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-22", "text": "The neutral orientation is not modeled explicitly in this paper, rather it is handled as a default case as explained below." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-23", "text": "In Fig. 1 , the orientation sequence is ." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-24", "text": "These counts are defined via an enumeration process and are used to define the orientation model E 9 \u00a1 \u00a5\u00a8" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-25", "text": "----------------------------------" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-26", "text": "**ORIENTATION UNIGRAM MODEL**" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-27", "text": "The basic idea of the orientation model can be illustrated as follows: In the example translation in Fig with right orientation, i.e. it is always involved in swapping." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-28", "text": "This intuition is formalized using unigram counts with orientation." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-29", "text": "The orientation model is related to the distortion model in (Brown et al., 1993 ), but we do not compute a block alignment during training." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-30", "text": "We rather enumerate all relevant blocks in some order." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-31", "text": "Enumeration does not allow us to capture position dependent distortion probabilities, but we can compute statistics about adjacent block predecessors." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-32", "text": "Our baseline model is the unigram monotone model described in (Tillmann and Xia, 2003) ." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-33", "text": "Here, we select blocks from word-aligned training data and unigram block occurrence counts 0 \u00a1 \u00a8 a re computed: all blocks for a training sentence pair are enumerated in some order and we count how often a given block occurs in the parallel training data \u00a1 of the predecessor is ignored." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-34", "text": "The are chosen to be optimal on the devtest set (the optimal parameter setting is shown in Table." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-35", "text": "1)." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-36", "text": "Only two parameters have to be optimized due to the constraint that the have to sum to`P" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-37", "text": "where" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-38", "text": "are not optimized separately, rather we define:" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-39", "text": ". Straightforward normalization over all successor blocks in Eq. 2 and in Eq. 3 is not feasible: there are tens of millions of possible successor blocks ." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-40", "text": "In future work, normalization over a restricted successor set, e.g. for a given source input sentence, all blocks that match this sentence might be useful for both training and decoding." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-41", "text": "The segmentation model in Eq. 1 naturally prefers translations that make use of a smaller number of blocks which leads to a smaller number of factors in Eq. 1." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-42", "text": "Using fewer 'bigger' blocks to carry out the translation generally seems to improve translation performance." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-43", "text": "Since normalization does not influence the number of blocks used to carry out the translation, it might be less important for our segmentation model." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-44", "text": "We use a DP-based beam search procedure similar to the one presented in (Tillmann and Xia, 2003) ." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-45", "text": "We maximize" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-46", "text": "----------------------------------" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-47", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-48", "text": "The translation system is tested on an Arabic-to-English translation task." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-49", "text": "The training data comes from the UN news sources: ." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-50", "text": "This is the model presented in (Tillmann and Xia, 2003) ." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-51", "text": "For the ) model, the sentence is translated mostly monotonously, and only neighbor blocks are allowed to be swapped (at most`block is skipped)." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-52", "text": "The Table 1 : three BLEU results are presented for both devtest set and blind test set." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-53", "text": "Two scaling parameters are set on the devtest set and copied for use on the blind test set." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-54", "text": "The second column shows the model name, the third column presents the optimal weighting as obtained from the devtest set by carrying out an exhaustive grid search." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-55", "text": "The fourth column shows BLEU results together with confidence intervals (Here, the word casing is ignored)." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-56", "text": "The block swapping model Table 2 presents devtest set example blocks that have actually been swapped." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-57", "text": "The training data is unsegmented, as can be seen from the first two blocks." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-58", "text": "The block in the first line has been seen times more often with left than with right orientation." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-59", "text": "Blocks for which the ratio\u00a83 \u00a9 9 \u00a9 9" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-60", "text": "is bigger than U P S T Q are likely candidates for swapping in our Arabic-English experiments." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-61", "text": "The ratio\u00a8itself is not currently used in the orientation model." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-62", "text": "The orientation model mostly effects blocks where the Arabic and English words are verbs or nouns." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-63", "text": "As shown in Fig. 1 , the orientation model uses the orientation probability A \u00a1 for the noun block , and only the default model for the adjective block \u00a1 ." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-64", "text": "Although the noun block might occur by itself without adjective, the swapping is not controlled by the occurrence of the adjective block \u00a2 \u00a1 (which does not have adjacent predecessors)." }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-65", "text": "We rather model the fact that a noun block is typically preceded by some block \u00a5 \u00a9" }, { "sent_id": "15c8ca572430c214d9c571fbe0db95-C001-66", "text": ". This situation seems typical for the block swapping that occurs on the evaluation test set." } ], "y": { "@SIM@": { "gold_contexts": [ [ "15c8ca572430c214d9c571fbe0db95-C001-10" ], [ "15c8ca572430c214d9c571fbe0db95-C001-44" ] ], "cite_sentences": [ "15c8ca572430c214d9c571fbe0db95-C001-10", "15c8ca572430c214d9c571fbe0db95-C001-44" ] }, "@USE@": { "gold_contexts": [ [ "15c8ca572430c214d9c571fbe0db95-C001-32" ], [ "15c8ca572430c214d9c571fbe0db95-C001-50" ] ], "cite_sentences": [ "15c8ca572430c214d9c571fbe0db95-C001-32", "15c8ca572430c214d9c571fbe0db95-C001-50" ] } } }, "ABC_91c1a4ab0347fb8b11ff213a97e864_4": { "x": [ { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-20", "text": "----------------------------------" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-21", "text": "**STRUCTURED PERCEPTRON FOR INEXACT HYPERGRAPH SEARCH**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-2", "text": "Online learning algorithms like the perceptron are widely used for structured prediction tasks." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-3", "text": "For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-4", "text": "However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-5", "text": "In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-6", "text": "This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-7", "text": "----------------------------------" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-9", "text": "Structured prediction problems generally deal with exponentially many outputs, often making exact search infeasible." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-10", "text": "For sequential search problems, such as tagging and incremental parsing, beam search coupled with perceptron algorithms that account for potential search errors have been shown to be a powerful combination (Collins and Roark, 2004; Daum\u00e9 and Marcu, 2005; Zhang and Clark, 2008; Huang et al., 2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-11", "text": "However, sequential search algorithms, and in particular left-to-right beam search (Collins and Roark, 2004; Zhang and Clark, 2008) , squeeze inference into a very narrow space." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-12", "text": "To address this, Huang (2008) formulated constituency parsing as approximate bottom-up inference in order to compactly represent an exponential number of outputs while scoring features of arbitrary scope." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-13", "text": "This idea was adapted to graph-based dependency parsers by Zhang and McDonald (2012) and shown to outperform left-to-right beam search." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-14", "text": "Both these examples, bottom-up approximate dependency and constituency parsing, can be viewed as specific instances of inexact hypergraph search." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-15", "text": "Typically, the approximation is accomplished by cube-pruning throughout the hypergraph (Chiang, 2007) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-16", "text": "Unfortunately, as the scope of features at each node increases, the inexactness of search and its negative impact on learning can potentially be exacerbated." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-17", "text": "Unlike sequential search, the impact on learning of approximate hypergraph search -as well as methods to mitigate any ill effects -has not been studied." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-18", "text": "Motivated by this, we develop online learning algorithms for inexact hypergraph search by generalizing the violation-fixing percepron of Huang et al. (2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-19", "text": "We empirically validate the benefit of this approach within the cube-pruning dependency parser of Zhang and McDonald (2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-22", "text": "The structured perceptron algorithm (Collins, 2002) is a general learning algorithm." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-23", "text": "Given training instances (x,\u0177), the algorithm first solves the decoding problem y \u2032 = argmax y\u2208Y(x) w \u00b7 f (x, y) given the weight vector w for the high-dimensional feature representation f of the mapping (x, y), where y \u2032 is the prediction under the current model,\u0177 is the gold output and Y(x) is the space of all valid outputs for input x. The perceptron update rule is simply:" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-24", "text": "The convergence of original perceptron algorithm relies on the argmax function being exact so that the condition w \u00b7 f (x, y \u2032 ) > w \u00b7 f (x,\u0177) (modulo ties) always holds." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-25", "text": "This condition is called a violation because the prediction y \u2032 scores higher than the correct label\u0177." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-26", "text": "Each perceptron update moves weights away from y \u2032 and towards\u0177 to fix such violations. But when search is inexact, y \u2032 could be suboptimal so that sometimes w \u00b7 f (x, y \u2032 ) < w \u00b7 f (x,\u0177)." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-27", "text": "Huang et al. (2012) named such instances non-violations and showed that perceptron model updates for nonviolations nullify guarantees of convergence." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-28", "text": "To account for this, they generalized the original update rule to select an output y \u2032 within the pruned search space that scores higher than\u0177, but is not necessarily the highest among all possibilities, which represents a true violation of the model on that training instance." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-29", "text": "This violation fixing perceptron thus relaxes the argmax function to accommodate inexact search and becomes provably convergent as a result." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-30", "text": "In the sequential cases where\u0177 has a linear structure such as tagging and incremental parsing, the violation fixing perceptron boils down to finding and updating along a certain prefix of\u0177." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-31", "text": "Collins and Roark (2004) locate the earliest position in a chain structure where\u0177 pref is worse than y \u2032 pref by a margin large enough to cause\u0177 to be dropped from the beam." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-32", "text": "Huang et al. (2012) locate the position where the violation is largest among all prefixes of\u0177, where size of a violation is defined as w \u00b7 f (x, y \u2032 pref ) \u2212 w \u00b7 f (x,\u0177 pref )." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-33", "text": "For hypergraphs, the notion of prefix must be generalized to subtrees." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-34", "text": "Figure 1 shows the packedforest representation of the union of gold subtrees and highest-scoring (Viterbi) subtrees at every gold node for an input." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-35", "text": "At each gold node, there are two incoming hyperedges: one for the gold subtree and the other for the Viterbi subtree." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-36", "text": "After bottomup parsing, we can compute the scores for the gold subtrees as well as extract the corresponding Viterbi subtrees by following backpointers." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-37", "text": "These Viterbi subtrees need not necessarily to belong to the full Viterbi path (i.e., the Viterbi tree rooted at node N )." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-38", "text": "An update strategy must choose a subtree or a set of subtrees at gold nodes." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-39", "text": "This is to ensure that the model is updating its weights relative to the intersection of the search space and the gold path." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-40", "text": "Our first update strategy is called single-node max-violation (s-max)." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-41", "text": "Given a gold tree\u0177, it traverses the gold tree and finds the node n on which the violation between the Viterbi subtree and the gold subtree is the largest over all gold nodes." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-42", "text": "The violation is guaranteed to be greater than or equal to zero because the lower bound for the max-violation on any hypergraph is 0 which happens at the leaf nodes." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-43", "text": "Then we choose the subtree pair (\u0177 n , y \u2032 n ) and do the update similar to the prefix update for the sequential case." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-44", "text": "For example, in Figure 1 , suppose the max-violation happens at node K , which covers the left half of the input x, then the perceptron update would move parameters to the subtree represented by nodes B , C , H and K and away from A , B , G and K ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-45", "text": "Our second update strategy is called parallel maxviolation (p-max)." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-46", "text": "It is based on the observation that violations on non-overlapping nodes can be fixed in parallel." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-47", "text": "We define a set of frontiers as a set of nodes that are non-overlapping and the union of which covers the entire input string x. The frontier set can include up to |x| nodes, in the case where the frontier is equivalent to the set of leaves." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-48", "text": "We travers\u00ea y bottom-up to compute the set of frontiers such that each has the max-violation in the span it covers." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-49", "text": "Concretely, for each node n, the max-violation frontier set can be defined recursively," }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-50", "text": "where maxv(n) is the function that returns the node with the absolute maximum violation in the subtree rooted at n and can easily be computed recursively over the hypergraph." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-51", "text": "To make a perceptron update, we generate the max-violation frontier set for the entire hypergraph and use it to choose subtree pairs n\u2208ft(root(x)) (\u0177 n , y \u2032 n ), where root(x) is the root of the hypergraph for input x. For example, in Figure 1 , if the union of K and L satisfies the definition of ft, then the perceptron update would move feature weights away from the union of the two Viterbi subtrees and towards their gold counterparts." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-52", "text": "In our experiments, we compare the performance of the two violation-fixing update strategies against two baselines." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-53", "text": "The first baseline is the standard update, where updates always happen at the root node of a gold tree, even if the Viterbi tree at the root node leads to a non-violation update." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-54", "text": "The second baseline is the skip update, which also always updates at the root nodes but skips any non-violations." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-55", "text": "This is the strategy used by Zhang and McDonald (2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-56", "text": "----------------------------------" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-57", "text": "**EXPERIMENTS**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-58", "text": "We ran a number of experiments on the cubepruning dependency parser of Zhang and McDonald (2012) , whose search space can be represented as a hypergraph in which the nodes are the complete and incomplete states and the hyperedges are the instantiations of the two parsing rules in the Eisner algorithm (Eisner, 1996) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-59", "text": "The feature templates we used are a superset of Zhang and McDonald (2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-60", "text": "These features include first-, second-, and third-order features and their labeled counterparts, as well as valency features." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-61", "text": "In addition, we also included a feature template from Bohnet and Kuhn (2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-62", "text": "This template examines the leftmost child and the rightmost child of a modifier simultaneously." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-63", "text": "All other highorder features of Zhang and McDonald (2012) only look at arcs on the same side of their head." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-64", "text": "We trained the parser with hamming-loss-augmented MIRA (Crammer et al., 2006) , following Martins et al. (2010) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-65", "text": "Based on results on the English validation data, in all the experiments, we trained MIRA with 8 epochs and used a beam of size 6 per node." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-66", "text": "To speed up the parser, we used an unlabeled first-order model to prune unlikely dependency arcs at both training and testing time (Koo and Collins, 2010; Martins et al., 2013) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-67", "text": "We followed Rush and Petrov (2012) to train the first-order model to minimize filter loss with respect to max-marginal filtering." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-68", "text": "On the English validation corpus, the filtering model pruned 80% of arcs while keeping the oracle unlabeled attachment score above 99.50%." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-69", "text": "During training only, we insert the gold tree into the hypergraph if it was mistakenly pruned." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-70", "text": "This ensures that the gold nodes are always available, which is required for model updates." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-71", "text": "----------------------------------" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-72", "text": "**ENGLISH AND CHINESE RESULTS**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-73", "text": "We report dependency parsing results on the Penn WSJ Treebank and the Chinese CTB-5 Treebank." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-74", "text": "Both treebanks are constituency treebanks." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-75", "text": "We generated two versions of dependency treebanks by applying commonly-used conversion procedures." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-76", "text": "For the first English version (PTB-YM), we used the Penn2Malt 1 software to apply the head rules of Yamada and Matsumoto and the Malt label set." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-77", "text": "For the second English version (PTB-S), we used the Stanford dependency framework (De Marneffe et al., 2006) by applying version 2.0.5 of the Stanford parser." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-78", "text": "We split the data in the standard way: sections 2-21 for training; section 22 for validation; and section 23 for evaluation." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-79", "text": "We utilized a linear chain CRF tagger which has an accuracy of 96.9% on the validation data and 97.3% on the evaluation data 2 ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-80", "text": "For Chinese, we use the Chinese Penn Treebank converted to dependencies and split into train/-validation/evaluation according to Zhang and Nivre (2011) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-81", "text": "We report both unlabeled attachment scores (UAS) and labeled attachment scores (LAS), ignoring punctuations (Buchholz and Marsi, 2006) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-82", "text": "Table 1 displays the results." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-83", "text": "Our improved cube-pruned parser represents a significant improvement over the feature-rich transition-based parser of Zhang and Nivre (2011) with a large beam size." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-84", "text": "It also improves over the baseline cube-pruning parser without max-violation update strategies (Zhang and McDonald, 2012) , showing the importance of update strategies in inexact hypergraph search." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-85", "text": "The UAS score on Penn-YM is slightly higher than the best result known in the literature which was reported by the fourth-order unlabeled dependency parser of Ma and Zhao (2012) , although we did not utilize fourth-order features." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-86", "text": "The LAS score on Penn-YM is on par with the best reported by Bohnet and Kuhn (2012) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-87", "text": "On Penn-S, there are not many existing results to compare with, due to the tradition of reporting results on Penn-YM in the past." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-88", "text": "Nevertheless, our result is higher than the second best by a large margin." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-89", "text": "Our Chinese parsing scores are the highest reported results." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-90", "text": "The speed of our parser is around 200-300 tokens per second for English." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-91", "text": "This is faster than the parser of Bohnet and Kuhn (2012) which has roughly the same level of accuracy, but is slower than the parser of Martins et al. (2013) and Rush and Petrov (2012) , both of which only do unlabeled dependency parsing and are less accurate." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-92", "text": "Given that predicting labels on arcs can slow down a parser by a constant factor proportional to the size of the label set, the speed of our parser is competitive." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-93", "text": "We also tried to prune away arc labels based on observed labels for each POS tag pair in the training data." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-94", "text": "By doing so, we could speed up our parser to 500-600 tokens per second with less than a 0.2% drop in both UAS and LAS." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-95", "text": "----------------------------------" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-96", "text": "**IMPORTANCE OF UPDATE STRATEGIES**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-97", "text": "The lower portion of Table 1 compares cube-pruning parsing with different online update strategies in order to show the importance of choosing an update strategy that accommodates search errors." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-98", "text": "The maxviolation update strategies (s-max and p-max) improved results on both versions of the Penn Treebank as well as the CTB-5 Chinese treebank." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-99", "text": "It made a larger difference on Penn-S relative to Penn-YM, improving as much as 0.93% in LAS against the skip update strategy." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-100", "text": "Additionally, we measured the percentage of non-violation updates at root nodes." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-101", "text": "In the last epoch of training, on Penn-YM, there was 24% non-violations if we used the skip update strategy; on Penn-S, there was 36% non-violations." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-102", "text": "The portion of non-violations indicates the inexactness of the underlying search." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-103", "text": "Search is harder on Penn-S due to the larger label set." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-104", "text": "Thus, as expected, maxviolation update strategies improve most where the search is the hardest and least exact." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-105", "text": "----------------------------------" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-106", "text": "**CONLL RESULTS**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-107", "text": "We also report parsing results for 17 languages from the CoNLL 2006/2007 shared-task (Buchholz and Marsi, 2006; Nivre et al., 2007) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-108", "text": "The parser in our experiments can only produce projective dependency trees as it uses an Eisner algorithm backbone to generate the hypergraph (Eisner, 1996) ." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-109", "text": "So, at training time, we convert non-projective trees -of which there are many in the CoNLL data -to projective ones through flattening, i.e., attaching words to the lowest ancestor that results in projective trees." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-110", "text": "At testing time, our parser can only predict projective trees, though we evaluate on the true non-projective trees." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-111", "text": "Table 2 shows the full results." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-112", "text": "We sort the languages according to the percentage of nonprojective trees in increasing order." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-113", "text": "The Spanish treebank is 98% projective, while the Dutch treebank is only 64% projective." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-114", "text": "With respect to the Zhang and Nivre (2011) baseline, we improved UAS in 16 languages and LAS in 15 languages." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-115", "text": "The improvements are stronger for the projective languages in the top rows." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-116", "text": "We achieved the best published UAS results for 7 languages: Spanish, Catalan, Bulgarain, Italian, Swedish, Danish, and Greek." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-117", "text": "As these languages are typically from the more projective data sets, we speculate that extending the parser used in this study to handle non-projectivity will lead to state-of-the-art models for the majority of languages." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-118", "text": "----------------------------------" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-119", "text": "**CONCLUSIONS**" }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-120", "text": "We proposed perceptron update strategies for inexact hypergraph search and experimented with a cube-pruning dependency parser." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-121", "text": "Both singlenode max-violation and parallel max-violation update strategies signficantly improved parsing results over the strategy that ignores any invalid udpates caused by inexactness of search." }, { "sent_id": "91c1a4ab0347fb8b11ff213a97e864-C001-122", "text": "The update strategies are applicable to any bottom-up parsing problems such as constituent parsing (Huang, 2008) and syntax-based machine translation with online learning (Chiang et al., 2008) ." } ], "y": { "@USE@": { "gold_contexts": [ [ "91c1a4ab0347fb8b11ff213a97e864-C001-5" ], [ "91c1a4ab0347fb8b11ff213a97e864-C001-19" ], [ "91c1a4ab0347fb8b11ff213a97e864-C001-52", "91c1a4ab0347fb8b11ff213a97e864-C001-55" ], [ "91c1a4ab0347fb8b11ff213a97e864-C001-58" ], [ "91c1a4ab0347fb8b11ff213a97e864-C001-59" ] ], "cite_sentences": [ "91c1a4ab0347fb8b11ff213a97e864-C001-5", "91c1a4ab0347fb8b11ff213a97e864-C001-19", "91c1a4ab0347fb8b11ff213a97e864-C001-55", "91c1a4ab0347fb8b11ff213a97e864-C001-58", "91c1a4ab0347fb8b11ff213a97e864-C001-59" ] }, "@BACK@": { "gold_contexts": [ [ "91c1a4ab0347fb8b11ff213a97e864-C001-13" ], [ "91c1a4ab0347fb8b11ff213a97e864-C001-55" ], [ "91c1a4ab0347fb8b11ff213a97e864-C001-63" ] ], "cite_sentences": [ "91c1a4ab0347fb8b11ff213a97e864-C001-13", "91c1a4ab0347fb8b11ff213a97e864-C001-55", "91c1a4ab0347fb8b11ff213a97e864-C001-63" ] }, "@DIF@": { "gold_contexts": [ [ "91c1a4ab0347fb8b11ff213a97e864-C001-84" ] ], "cite_sentences": [ "91c1a4ab0347fb8b11ff213a97e864-C001-84" ] } } }, "ABC_bdd7a4dabf8a8d7c0a2b638eb6eb72_4": { "x": [ { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-2", "text": "Large-scale Knowledge Bases (such as NELL, Yago, Freebase, etc.) are often sparse, i.e., a large number of valid relations between existing entities are missing." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-3", "text": "Recent research have addressed this problem by augmenting the KB graph with additional edges mined from a large text corpus while keeping the set of nodes fixed, and then using the Path Ranking Algorithm (PRA) to perform KB inference over this augmented graph." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-4", "text": "In this paper, we extend this line of work by augmenting the KB graph not only with edges, but also with bridging entities, where both the edges and bridging entities are mined from a 500 million web text corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-5", "text": "Through experiments on real-world datasets, we demonstrate the value of bridging entities in improving the performance and running time of PRA in the KB inference task." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-6", "text": "----------------------------------" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-8", "text": "Large-scale knowledge bases (KB) like Freebase (Bollacker et al., 2008) , Yago (Suchanek et al., 2007) , NELL (Mitchell et al., 2015) can be useful in a variety of applications like natural language question answering, semantic search engines, etc." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-9", "text": "These knowledge bases consist of millions of real world entities and relationships between them which are stored in the form of a directed graph where links represent relations and nodes represent the entities." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-10", "text": "Although such KBs contain millions of entities, they are still very sparse, i.e., they are missing a large number of relations between existing entities (West et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-11", "text": "Performing inference over the knowledge graph for predicting relations between two entities is one way of densifying the KB graph." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-12", "text": "For example, Figure 1 : Example showing how addition of the bridging entity, Brian McCain, and the two edges incident on it can help the PRA algorithm (Lao and Cohen, 2010) to infer the initially missing relation instance teamPlaysSport(Yankees, BaseBall)." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-13", "text": "The original KB graph consisted only of two nodes, Yankees and Baseball, and no edges." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-14", "text": "from (Germany, playsinTournament, FIFA) and (FIFA, tournamentofSport, Soccer), we can infer (Germany, playsSport, Soccer)." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-15", "text": "The Path Ranking Algorithm (PRA) (Lao and Cohen, 2010) , (Lao et al., 2011) performs such an inference by learning inference rules over the knowledge graph." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-16", "text": "If the knowledge graph is sparse, i.e., if there are a very few or no paths between source and target entities, then PRA is unable to predict the existence of a relation." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-17", "text": "To address this shortcoming, (Lao et al., 2012) augmented the knowledge graph with paths obtained from an external corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-18", "text": "The added paths consisted of unlexicalized dependency labels obtained from a dependency parsed external corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-19", "text": "To improve the expressivity of the added paths, instead of the unlexicalized labels, (Gardner et al., 2013) augmented the KB graph with verbs (surface relations) from a corpus containing over 600 million Subject-Verb-Object (SVO) triples." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-20", "text": "These verbs act as edges that connect previously unconnected entities thereby increasing the connectivity of the KB graph which can potentially improve PRA performance." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-21", "text": "However, na\u00efvely adding these edges increases the feature sparsity which degrades the discriminative ability of the logistic regression classifier used in PRA." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-22", "text": "This can be addressed by adding latent relations obtained by clustering the surface relations, instead of directly adding the surface relations." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-23", "text": "This reduces feature sparsity and has been shown to improve PRA inference (Gardner et al., 2013) , (Gardner et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-24", "text": "In this article we propose a scheme for augmenting the KB using paths obtained by mining noun phrases that connect two SVO triples from an external corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-25", "text": "We term these noun phrases as bridging entities since they bridge two KB relations to form a path." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-26", "text": "This is different from the scheme in (Gardner et al., 2013) and (Gardner et al., 2014) , which adds edges between KB nodes by mining surface relations from an external corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-27", "text": "We search for such bridging entities in the corpus by performing a limited depth DFS (depth first search) on the corpus graph in an on-demand fashion." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-28", "text": "We term this procedure as On-Demand Augmentation (ODA), because the search can be performed during test time in an on-demand manner." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-29", "text": "In contrast, the previous approaches of adding edges or embeddings to the KB (Gardner et al., 2013) , and vector space random walk PRA (Gardner et al., 2014) are batch procedures." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-30", "text": "As we shall see in Section 4, due to a limited search space, on-demand augmentation is much faster compared to algorithms in (Gardner et al., 2013; Gardner et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-31", "text": "Furthermore, since edges are not added blindly, on-demand augmentation does not increase feature sparsity which is responsible for performance degradation." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-32", "text": "Our experiments suggest that ODA provides better performance than (Gardner et al., 2013) and nearly the same prediction performance as provided by (Gardner et al., 2014) , but in both cases with the added advantage of faster running time and greater flexibility due to its online and on-demand nature." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-33", "text": "The code along with the results can be obtained at https://github.com/malllabiisc/pra-oda." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-34", "text": "----------------------------------" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-36", "text": "Using surface level relations and noun phrases for extracting meaningful relational facts is not a new idea (Hearst, 1992) , (Brin, 1999) , (Etzioni et al., 2004) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-37", "text": "However, none of them make use of Knowledge Bases for improving information extraction." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-38", "text": "The Path Ranking Algorithm (PRA) first proposed in (Lao and Cohen, 2010) was used for performing inference over a KB in (Lao et al., 2011) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-39", "text": "It was extended by (Lao et al., 2012) , to improve the inference by augmenting the KB with syntactic information obtained from a dependency parsed corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-40", "text": "Augmenting the KB for improving PRA inference using surface relations mined from an external corpus and using latent edge labels obtained by performing PCA on the surface relations was explored in (Gardner et al., 2013) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-41", "text": "Instead of hard mapping of surface relations to latent embeddings, (Gardner et al., 2014 ) perform a 'soft' mapping using vector space random walks." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-42", "text": "This allows the random walker to traverse an edge semantically similar to the current edge type more frequently than other edges." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-43", "text": "Although, like others, we too use an external corpus to augment the KB, the crucial difference in our approach is that apart from adding surface relations, we also add bridging entities that enable us to create new paths in the KB." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-44", "text": "Furthermore, the procedure is targeted so that only paths that play a part in inferring the relations that are of interest are added." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-45", "text": "Thus, the number of paths added in this manner is much lower than the number of surface relations added using the procedure in (Gardner et al., 2013) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-46", "text": "As we shall see in Section 4, this results in a more effective algorithm with faster runtime." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-47", "text": "We first present a brief overview of the Path Ranking Algorithm (PRA) (Lao and Cohen, 2010) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-48", "text": "The PRA uses paths as features for a logistic regression classifier which predicts if the given relation exists between a pair of entities." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-49", "text": "For a given pair of entities s and t, the path type connecting s to t form the feature vector." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-50", "text": "A path types \u03c0 is an ordered set of relations." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-51", "text": "Paths with the same ordered relations but different intermediate or terminal entities belong to the same path type." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-52", "text": "For example," }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-53", "text": "The value of a feature, is taken to be P (s \u2192 t; \u03c0), where P (s \u2192 t; \u03c0) is the probability of reaching t from s by traversing paths of type \u03c0." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-54", "text": "PRA approximates these probabilities by running a random walk (RW) on the KB graph." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-55", "text": "Let F = {\u03c0 1 , \u03c0 2 , ..., \u03c0 k } be the set of all path types." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-56", "text": "For predicting the existence of relation r between entities s and t, the logistic regression classifier outputs a score which is a measure of the confidence that r exists between s and t. It does so by first assigning weights to the features in the training phase." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-57", "text": "The score is given by" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-58", "text": "where \u03b8 r \u03c0 is the weight learned by the logistic regression classifier during training specially for relation r and path type \u03c0." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-59", "text": "During the test phase, since targets are not available, the PRA gathers candidate targets by performing a random walk and then computes feature vectors and the score." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-60", "text": "----------------------------------" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-61", "text": "**PRA-SVO AND PRA-VS**" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-62", "text": "PRA-SVO and PRA-VS are the systems proposed in (Gardner et al., 2013) and (Gardner et al., 2014) respectively, where the KB graph is augmented with edges mined from a large subject-verb-object (SVO) triple corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-63", "text": "In these two systems, only new edges are added over the fixed set of nodes, and the augmentation happens in a batch, offline setting." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-64", "text": "In contrast, PRA-ODA, the method proposed in the paper, also expands the set of nodes through bridging entities, and performs the augmentation in an on-demand manner." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-65", "text": "----------------------------------" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-66", "text": "**PRA ON-DEMAND AUGMENTATION**" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-67", "text": "(PRA-ODA) Training: Let s and t be any two KB entities and let s (n) and t (n) be their corresponding noun phrase representations or aliases." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-68", "text": "We search for bridging entities x 1 , x 2 , ..x n by performing limited depth first search (DFS) starting with s n such that we obtain a path s" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-69", "text": "\u2212\u2192 t, where v i are verbs present in the corpus graph." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-70", "text": "This is done for all n \u2264 d max \u2212 1, where d max is the maximum depth of DFS." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-71", "text": "We add an 'ALIAS' edge between the KB entity and its noun phase representation." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-72", "text": "The usefulness of bridging entities is illustrated in Fig. 1 ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-73", "text": "We mine bridging entities from a corpus containing over 600 million SVO triples which were obtained from the ClueWeb09 corpus (Callan et al., 2009 ) parsed using the MALT parser (Nivre et al., 2007) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-74", "text": "We use Mongo DB to store the triples as an adjacency list." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-75", "text": "During training time, for any relation that is being inferred, both the source and its corresponding target entities are known." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-76", "text": "A limited depth DFS is performed for all depths less then d max on the SVO graph with the aliases of subject entity acting as the starting points." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-77", "text": "Such aliases are available for the NELL and Freebase knowledge bases." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-78", "text": "The DFS is said to discover a path if the terminating entity of the path matches any alias of the target entity." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-79", "text": "We choose to use aliases to perform string match, since it is easy to change the softness of the match by simply adding more aliases." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-80", "text": "This is done for all training sourcetarget pairs." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-81", "text": "A few examples of added paths are shown in Table 1 ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-82", "text": "The SVO graph is noisy since it is obtained by parsing the ClueWeb corpus which was obtained by scraping the web." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-83", "text": "To reduce noise, we add the top K most frequent discovered SVO path types, where K is a tunable parameter." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-84", "text": "By SVO path type we refer to a set of ordered verbs mined from the SVO corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-85", "text": "There is a possibility that the bridging entities, extracted from the corpus, may be present in the KB." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-86", "text": "If the bridging entity matches any alias, then it is treated as an alias to an existing KB entity." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-87", "text": "If not, then the bridging entity is added to the KB as a new entity." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-88", "text": "To avoid overfitting we add negative data to the training set." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-89", "text": "Furthermore, only high quality expressive bridging entities result in meaningful and discriminative paths." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-90", "text": "Although the quality of bridging entities depend on the corpus, low quality bridging entities can be filtered out by adding negative training data." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-91", "text": "Low quality bridging entities connect source target pairs from both positive and negative training sets, and hence are eliminated by the sparse logistic regression classifier." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-92", "text": "The negative dataset is generated using the closed world assumption by performing a random walk." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-93", "text": "After augmenting the KB, we run the training phase of the PRA algorithm to obtain the feature (path) weights computed by the logistic regression Table 2 : Comparison of Mean Reciprocal Rank (MRR) metric for 10 relations from NELL (higher is better)." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-94", "text": "PRA-SVO, PRA-VS are the systems proposed in (Gardner et al., 2013; Gardner et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-95", "text": "PRA-ODA is the approach proposed in this paper." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-96", "text": "Improvements in PRA-ODA over PRA-SVO is statistically significant with p < 0.007, with PRA-SVO as null hypothesis." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-97", "text": "classifier." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-98", "text": "Query Time: The set of target entities corresponding to a source entity and the relation being predicted is not available during query (test) time." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-99", "text": "We use all the entities included in the range of the relation being predicted as candidate target entities." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-100", "text": "For example, if the relation is riverFlowsThroughCity, the candidate target set would include entities in the KB that are cities." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-101", "text": "The DFS is now performed starting from source entities as during training, but this time only restricting to paths with positive weights learned during training." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-102", "text": "Any path (along with bridging entities) found during this search are added to the KB, and the PRA algorithm is now run over this augmented graph." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-103", "text": "----------------------------------" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-104", "text": "**EXPERIMENTS**" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-105", "text": "We used the implementation of PRA provided by the authors of (Gardner et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-106", "text": "For our experiments, we used the same 10 NELL relation data as used in (Gardner et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-107", "text": "The augmentation resulted in the addition of 1086 paths during training and 1430 paths during test time." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-108", "text": "We split the NELL data into 60% training data, 15 % development data and 25% test data." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-109", "text": "Values for d max , and K, the most frequent paths, were obtained by tuning on a development set for 4 relations (athleteplaysforsport,actorstarredinmovie,citylocatedincountry (Gardner et al., 2013; Gardner et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-110", "text": "PRA-ODA is the approach proposed in this paper." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-111", "text": "Between the two top performing systems, i.e., PRA-ODA and PRA-VS, PRA-ODA is faster by a factor of 1.8." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-112", "text": "and journalistwritesforpublication)." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-113", "text": "The hyperparameter values d max = 2, K = 10 reported the highest MRR and were used for the rest of the relations." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-114", "text": "For the L 1 and L 2 regularization parameters in the logistic regression classifier, we used the same values as used in (Gardner et al., 2013; Gardner et al., 2014) , viz., L 1 = 0.005, and L 2 = 1.0." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-115", "text": "This is because the parameters were reported to be robust, and seemed to work well even when the knowledge base was augmented." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-116", "text": "We compare the results (PRA-ODA) with the PRA algorithm executed on the NELL KB, NELL KB augmented with surface relations (PRA-SVO) (Gardner et al., 2013) and vector space random walk PRA (PRA-VS) (Gardner et al., 2014) ." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-117", "text": "The run times, i.e, the time taken to perform an entire experiment for PRA-SVO and PRA-VS includes the time taken to augment NELL KB with SVO edges." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-118", "text": "The PRA-VS runtime also includes the time taken for generating embeddings to perform the vector space random walk." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-119", "text": "As can be seen from Table 2 and Table 3 , our scheme, PRA-ODA, provides performance equivalent to PRA-VS with faster running time (speed up of 1.8)." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-120", "text": "In addition to the time taken for the full SVO augmentation, PRA-VS takes additional time to generate embeddings (13 minutes) from the added verbs." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-121", "text": "We note that the batch augmentation in case of PRA-SVO and PRA-VS, and embedding computation in case of PRA-VS are all specific to the relations in the evaluation set, and hence can't be ignored as a one-time offline cost." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-122", "text": "In other words, these costs are likely to increase as more relations (and their instances) are included during training and testing." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-123", "text": "Runtime gains with PRA-ODA are likely to be even more pronounced in such settings." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-124", "text": "An additional advantage of the proposed algorithm is that it can also be run on the top of any PRA based algorithm such as the PRA-SVO and PRA-VS." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-125", "text": "----------------------------------" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-127", "text": "In this paper, we investigated the usefulness of adding paths to a Knowledge Base for improving its connectivity by mining bridging entities from an external corpus." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-128", "text": "While previous KB augmentation methods focused only on augmentation using mined surface verbs while keeping the node set fixed, we extended these approaches by also adding bridging entities in an online fashion." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-129", "text": "We used a large corpus of 500 million web text corpus to mine these additional edges and bridging entities." }, { "sent_id": "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-130", "text": "Through experiments on real-world datasets, we demonstrate that the proposed approach is not only comparable or better than other state-of-theart baselines, but more importantly provides faster overall runtime compared with the alternatives." } ], "y": { "@BACK@": { "gold_contexts": [ [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-23" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-29", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-30" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-41" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-62" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-94" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-118" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-120" ] ], "cite_sentences": [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-23", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-29", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-30", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-41", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-62", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-94", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-118", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-120" ] }, "@DIF@": { "gold_contexts": [ [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-26" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-28", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-29", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-30" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-32" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-41", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-43" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-61", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-62", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-64" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-111" ] ], "cite_sentences": [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-26", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-29", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-30", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-32", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-41", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-61", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-62", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-111" ] }, "@USE@": { "gold_contexts": [ [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-105" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-106" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-109" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-114" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-116" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-117" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-121" ], [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-124" ] ], "cite_sentences": [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-105", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-106", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-109", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-114", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-116", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-117", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-121", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-124" ] }, "@SIM@": { "gold_contexts": [ [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-118", "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-119" ] ], "cite_sentences": [ "bdd7a4dabf8a8d7c0a2b638eb6eb72-C001-118" ] } } }, "ABC_6092234b23f2620c356c2e417c2ce8_4": { "x": [ { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-2", "text": "We extend our previous work on constituency parsing (Kitaev and Klein, 2018) by incorporating pre-training for ten additional languages, and compare the benefits of no pre-training, ELMo , and BERT (Devlin et al., 2018)." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-3", "text": "Pre-training is effective across all languages evaluated, and BERT outperforms ELMo in large part due to the benefits of increased model capacity." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-4", "text": "Our parser obtains new state-of-the-art results for 11 languages, including English (95.8 F1) and Chinese (91.8 F1)." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-5", "text": "----------------------------------" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-7", "text": "There has recently been rapid progress in developing contextual word representations that improve accuracy across a range of natural language tasks Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2018) ." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-8", "text": "In our earlier work (Kitaev and Klein, 2018) , we showed that such representations are helpful for constituency parsing." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-9", "text": "However, these results only considered the LSTM-based ELMo representations , and only for the English language." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-10", "text": "We now extend this work to show that using only self-attention also works by substituting BERT (Devlin et al., 2018) ." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-11", "text": "We further demonstrate that pre-training and self-attention are effective across languages by applying our parsing architecture to ten additional languages." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-12", "text": "Our parser code and trained models for 11 languages are publicly available." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-13", "text": "1" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-14", "text": "----------------------------------" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-15", "text": "**MODEL**" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-16", "text": "Our parser as described in Kitaev and Klein (2018) accepts as input a sequence of vectors corresponding to words in a sentence, transforms these repre-1 https://github.com/nikitakit/self-attentive-parser sentations using one or more self-attention layers, and finally uses these representations to output a parse tree." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-17", "text": "We incorporate BERT by taking the token representations from the last layer of a BERT model and projecting them to 512 dimensions (the default size used by our parser) using a learned projection matrix." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-18", "text": "While our parser operates on vectors aligned to words in a sentence, BERT associates vectors to sub-word units based on Word-Piece tokenization (Wu et al., 2016) ." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-19", "text": "We bridge this difference by only retaining the BERT vectors corresponding to the last sub-word unit for each word in the sentence." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-20", "text": "We briefly experimented with other alternatives, such as using only the first sub-word instead, but did not find that this choice had a substantial effect on English parsing accuracy." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-21", "text": "The fact that additional layers are applied to the output of BERT -which itself uses a selfattentive architecture -may at first seem redundant, but there are important differences between these two portions of the architecture." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-22", "text": "The extra layers on top of BERT use word-based tokenization instead of sub-words, apply the factored version of self-attention proposed in Kitaev and Klein (2018) , and are randomly-initialized instead of being pre-trained." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-23", "text": "We found that omitting these additional layers and using the BERT vectors directly hurt parsing accuracies." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-24", "text": "We also extend the parser to predict part-ofspeech tags in addition to constituent labels, a feature we include based on feedback from users of our previous parser." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-25", "text": "Tags are predicted using a small feed-forward network (with only one ReLU nonlinearity) after the final layer of self-attention." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-26", "text": "This differs slightly from Joshi et al. (2018) , where tags are predicted based on span representations instead." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-27", "text": "The tagging head is trained jointly with the parser by adding an auxiliary softmax crossentropy loss, averaged over all words present in a given batch." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-28", "text": "(2018) We train our parser with a learning rate of 5 \u00d7 10 \u22125 and batch size 32, where BERT parameters are fine-tuned as part of training." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-29", "text": "All other hyperparameters are unchanged from Kitaev and Klein (2018) and Devlin et al. (2018) ." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-30", "text": "----------------------------------" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-31", "text": "**COMPARISON OF PRE-TRAINING METHODS**" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-32", "text": "In this section, we compare using BERT, ELMo, and training a parser from scratch on treebank data alone." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-33", "text": "Our comparison of the different methods for English is shown in Table 1 ." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-34", "text": "BERT with the \"base\" hyperparameter settings (12 layers, 12 attention heads per layer, and 768dimensional hidden vectors) performs comparably or slightly better than ELMo (95.32 vs. 95.21 F1), while a larger version of BERT (24 layers, 16 attention heads per layer, and 1024-dimensional hidden vectors) leads to better parsing accuracy (95.70 F1)." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-35", "text": "These results show that both the LSTM-based architecture of ELMo and the selfattentive architecture of BERT are viable for parsing, and that pre-training benefits from having a high model capacity." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-36", "text": "We did not observe a sizable difference between using a version of BERT that converts all text to lowercase and a version of BERT that retains case information." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-37", "text": "We found that pre-training on only English outperformed multilingual pre-training given the same model capacity, but the relative decrease in error rate was less than 6% (95.24 vs. 94.97 F1)." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-38", "text": "This is a promising result because it supports the idea of using joint multilingual pre-training as a way to provide support for many languages in a resource-efficient manner." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-39", "text": "We also conduct a control experiment to try to tease apart the benefits of the BERT architecture and training setup from the effects of the data used for pre-training." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-40", "text": "We originally attempted to use a randomly-initialized BERT model, but found that it would not train effectively within the range of hyperparameters we tried." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-41", "text": "2 Instead, we trained an English parser using a version of BERT that was pre-trained on the Chinese Wikipedia." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-42", "text": "Neither the pre-training domain nor the subword vocabulary used are a good fit for the target task; however, English does occur sporadically throughout the Chinese Wikipedia, and the model can represent losslessly English text -all English letters are present in its subword vocabulary, so in the worst case it will decompose an English word into its individual letters." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-43", "text": "We found that this model achieved comparable performance to a version of our parser that was designed to be trained on treebank data alone (93.57 vs. 93.61 F1)." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-44", "text": "This result suggests that even when the pre-training data is a poor fit for the target domain, fine-tuning can still produce results comparable to purely supervised training starting with randomly-initialized parameters." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-45", "text": "----------------------------------" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-46", "text": "**RESULTS**" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-47", "text": "We train and evaluate our model on treebanks for eleven languages: English (see Table 2 ), the nine languages represented in the SPMRL 2013/2014 shared tasks (Seddah et al., 2013 ) (see Table 3 ), and Chinese (see Table 4 )." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-48", "text": "For each of these languages, our parser obtains a higher F1 score than any past systems we are aware of." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-49", "text": "The English and Chinese parsers use monolingual pre-training, while the remaining parsers incorporate a version of BERT pre-trained jointly on 104 languages." }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-50", "text": "----------------------------------" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-51", "text": "**CONCLUSION**" }, { "sent_id": "6092234b23f2620c356c2e417c2ce8-C001-52", "text": "The remarkable effectiveness of unsupervised pretraining of vector representations of language suggests that future advances in this area can continue the ability of machine learning methods to model syntax (as well as other aspects of language.) At the same time, syntactic annotations remain a useful tool due to their interpretability, and we hope that our parsing software may be of use to others." } ], "y": { "@EXT@": { "gold_contexts": [ [ "6092234b23f2620c356c2e417c2ce8-C001-2" ], [ "6092234b23f2620c356c2e417c2ce8-C001-10" ] ], "cite_sentences": [ "6092234b23f2620c356c2e417c2ce8-C001-2", "6092234b23f2620c356c2e417c2ce8-C001-10" ] }, "@BACK@": { "gold_contexts": [ [ "6092234b23f2620c356c2e417c2ce8-C001-8" ], [ "6092234b23f2620c356c2e417c2ce8-C001-16" ] ], "cite_sentences": [ "6092234b23f2620c356c2e417c2ce8-C001-8", "6092234b23f2620c356c2e417c2ce8-C001-16" ] }, "@MOT@": { "gold_contexts": [ [ "6092234b23f2620c356c2e417c2ce8-C001-9" ] ], "cite_sentences": [ "6092234b23f2620c356c2e417c2ce8-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "6092234b23f2620c356c2e417c2ce8-C001-22" ] ], "cite_sentences": [ "6092234b23f2620c356c2e417c2ce8-C001-22" ] }, "@SIM@": { "gold_contexts": [ [ "6092234b23f2620c356c2e417c2ce8-C001-29" ] ], "cite_sentences": [ "6092234b23f2620c356c2e417c2ce8-C001-29" ] } } }, "ABC_a127218cca5653f1700c0de6c8318a_4": { "x": [ { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-2", "text": "Concordancers are interactive software that searches for the input word and displays the list of its usages in a corpus." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-3", "text": "They have been widely used by language learners and educators to analyze word usages." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-4", "text": "Because naively listing all usages of the word overwhelms users, determining how to summarize the list is important for usability." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-5", "text": "Previous studies summarized the list by using the surrounding word patterns and showed their frequency; however, such a naive method counts substantially the same usages, such as \"the book\" and \"a book,\" separately; hence, such a method is not very informative to learners." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-6", "text": "Here, a novel approach for summarizing the list is proposed." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-7", "text": "According to the user's input word, the proposed system semantically visualizes each usage of the word using contextualized word embeddings interactively." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-8", "text": "It is shown that the system responds quickly with intuitive use cases." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-9", "text": "----------------------------------" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-11", "text": "Concordancers are interactive software tools that search and display a usage list of the input words or word patterns within a corpus." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-12", "text": "The tools have been widely used in corpus linguistics and computer-aided language education to assist language learners and educators analyze word usages within a corpus (Hockey and Martin, 1987) ." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-13", "text": "In Natural Language Processing (NLP), studies have built sophisticated concordancers to support second language writing and translators in searching bilingual sentence-aligned corpus (Wu et al., 2004; Jian et al., 2004; Lux-Pogodalla et al., 2010) ." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-14", "text": "However, the information that conventional concordancers can provide for analyses of each usage is limited to the frequency of surrounding context patterns, parts of speech, and so on." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-15", "text": "The words that second language learners can search to learn their usages tend to be frequent." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-16", "text": "Therefore, a more sophisticated method to summarize many word usages in a large corpus for concordancers is desirable." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-17", "text": "Recently, contextualized word embeddings such as (Devlin et al., 2019) were proposed in NLP to capture the context of each word usage in vectors and to model the semantic distances between the usages using contexts as a clue." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-18", "text": "Unlike previous studies (Liu et al., 2017; Smilkov et al., 2016) that visualized different words using word embeddings, in this paper, we introduce a novel system intuitively helpful for concordancer users to visualize different usages of a word of interest." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-19", "text": "2 System Overview and Use Cases Fig. 1 shows our system layout." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-20", "text": "Once a user provides a word to the system, it automatically searches the word in the corpus in a similar way to typical concordancers." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-21", "text": "Unlike concordancers, our system has a database that stores contextualized word embeddings for each usage or occurrence of each word in the corpus." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-22", "text": "We used half a million sentences from the British National Corpus (BNC Consortium, 2007) as the raw corpus." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-23", "text": "We built the database by applying the bert-base-uncased model of the PyTorch Pretrained the BERT project 1 (Devlin et al., 2019) to the corpus." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-24", "text": "We used the last layer, which was more distant from the surface input, as the embeddings." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-25", "text": "The size of the database is roughly 200MB per thousand sentences." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-26", "text": "Our system visualizes these searched contextualized word embedding vectors." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-27", "text": "We visualize the contextualized word embedding vectors for the provided word by projecting these vectors into a twodimensional space." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-28", "text": "To visualize, we used principal component analysis (PCA) because its fast calculation is beneficial for short system response time and better interactivity." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-29", "text": "The number of points in the visualization is also set to a maximum of 100 so that users can easily understand it." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-30", "text": "Fig. 2 shows a use case of searching book." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-31", "text": "2 . Users can directly type the word in the textbox shown at the top of Fig. 2 ." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-32", "text": "Below is the visualization of the usages found and their list." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-33", "text": "Each dark-colored point links to each usage." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-34", "text": "The red lightly-colored point is the probe point: the usages are listed in the nearest order of the probe point." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-35", "text": "No usage is linked to the probe point." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-36", "text": "Users can 1 https://github.com/huggingface/ pytorch-pretrained-BERT 2 Fig. 2 and Fig. 3 shows use cases on a 10, 000-sentence experpt of the BNC corpus to avoid having too many hits hinder the reading of the paper." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-37", "text": "freely and interactively drag and move the probe point to change the list of usages below the visualization." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-38", "text": "Each line of the list shows the surrounding words of the usage, followed by the distance between the vectors of the usage and probe point in the two-dimensional visualization." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-39", "text": "In Fig. 2 , the probe point is on the left part of the visualized figure." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-40", "text": "In the first several lines of the list, the system successfully shows the usages of the word book about reading." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-41", "text": "In contrast, Fig. 3 shows the case, in which the users drag the probe point from the left to the right of the visualization." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-42", "text": "The first several lines of the list or the usages nearest the probe point show the usages of the word book about reservation." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-43", "text": "A careful reading of the usage list below shows that the words surrounding the word book vary." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-44", "text": "Thus, merely focusing on the surrounding words, such as \"to\" before book, cannot distinguish the usages of book about reservation from the usages of book about reading." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-45", "text": "----------------------------------" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-46", "text": "**DEMO OUTLINE**" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-47", "text": "We are expecting language learners to be users." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-48", "text": "We are planning to make our software openly available under an open-source license after we evaluate our system in more detail 3 ." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-49", "text": "As for the interoperability of the software, the software is built on the Jupyter notebook 4 using ipywidgets 5 ; hence it is accessible online via browsers without the need to install it to each learner's terminal computer." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-50", "text": "----------------------------------" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-51", "text": "**CONCLUSIONS**" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-52", "text": "We proposed a novel concordancer that can search the usages of a word and visualize the usages using contextualized word embeddings." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-53", "text": "Through use cases, we illustrated that a learner can understand different types of usage of book, which could not be captured only by surface information of the surrounding words." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-54", "text": "As future work, we will evaluate our system on more practical use cases with many language learners, especially from the perspective of support systems for second language vocabulary learning and reading (Ehara et al., 2012 (Ehara et al., , 2013 (Ehara et al., , 2014 ." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-55", "text": "----------------------------------" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-56", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-57", "text": "This work was supported by JST, ACT-I Grant Number JPMJPR18U8, Japan." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-58", "text": "We used the AI Bridging Cloud Infrastructure (ABCI) by the National Institute of Advanced Industrial Science and Technology (AIST) for computational resources." }, { "sent_id": "a127218cca5653f1700c0de6c8318a-C001-59", "text": "We thank anonymous reviewers for their insightful and constructive comments." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a127218cca5653f1700c0de6c8318a-C001-17" ] ], "cite_sentences": [ "a127218cca5653f1700c0de6c8318a-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "a127218cca5653f1700c0de6c8318a-C001-23" ], [ "a127218cca5653f1700c0de6c8318a-C001-36" ] ], "cite_sentences": [ "a127218cca5653f1700c0de6c8318a-C001-23", "a127218cca5653f1700c0de6c8318a-C001-36" ] } } }, "ABC_42854f204c8d2a62822e12f731ad08_4": { "x": [ { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-2", "text": "Transformer with self-attention has achieved great success in the area of nature language processing." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-3", "text": "Recently, there have been a few studies on transformer for end-to-end speech recognition, while its application for hybrid acoustic model is still very limited." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-4", "text": "In this paper, we revisit the transformer-based hybrid acoustic model, and propose a model structure with interleaved self-attention and 1D convolution, which is proven to have faster convergence and higher recognition accuracy." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-5", "text": "We also study several aspects of the transformer model, including the impact of the positional encoding feature, dropout regularization, as well as training with and without time restriction." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-6", "text": "We show competitive recognition results on the public Librispeech dataset when compared to the Kaldi baseline at both cross entropy training and sequence training stages." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-7", "text": "For reproducible research, we release our source code and recipe within the PyKaldi2 toolbox." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-8", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-10", "text": "Recurrent neural networks (RNNs) with long short-term memory (LSTM) [1] units have defined the state-of-the-art large-scale speech recognition since 2014 [2] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-11", "text": "While there have been new types of sequence modeling approaches which are proposed and explored for speech recognition recently, such as sequence-to-sequence with attention [3, 4, 5] , connectionist temporal classification [6] and recurrent neural network transducer [6] , LSTM-RNNs remains the most popular neural network architectures for learning speech feature representations, although convolutional neural networks (CNNs) with different variants have shown competitive recognition results for some tasks." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-12", "text": "The key behind the success of RNNs is their capacity to learn temporal correlations in sequential signals through the recurrent connections when the networks are trained with the back-propagation through time (BPTT) [7] algorithm." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-13", "text": "However, a well-known weakness of RNNs is the gradient vanishing or explosion problem due to BPTT, and the recurrent connections in RNNs make it challenging to parallelize the computations in both training and inference stages." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-14", "text": "Transformer [8] , which relies solely on self-attention to capture the temporal correlations in sequential signals, is a new type of neural network structure for sequence modeling, which has achieved excellent results in machine translation [8] , language modeling [9] , as well as end-to-end speech recognition [10, 11] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-15", "text": "Self-attention is appealing for sequence modeling in the sense that it can learn long-term correlations by one step of attention operation, while for RNNs, it would take multiple steps in the time space for both forward and backward computation, and noise may accumulate during the process." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-16", "text": "CNNs, on the other hand, require multiple layers to capture the correlations between the two features which are very distant in the time space, although dilation that uses large strides can reduce the number of layers that is required." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-17", "text": "While there have been many studies on end-to-end speech recognition using transformers [10, 11, 12, 13, 14] , their applications for hybrid acoustic models are less well understood." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-18", "text": "In this paper, we study the more standard transformer for speech recognition within the hybrid framework, and provide further insight to this model through experiments on the Librispeech public dataset." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-19", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-20", "text": "**RELATED WORKS**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-21", "text": "There have been a few studies on transformers for end-to-end speech recognition, particularly for sequence-to-sequence with attention model [10, 11, 12] , as well as transducer [13] and CTC models [14] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-22", "text": "In [10] , the authors compared RNNs with transformers for various speech recognition and synthesis tasks, and obtained competitive or even better results with transformers." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-23", "text": "However, the key challenge for transformer-based sequence-to-sequence model is to perform online streaming speech recognition, as there is no clear boundary for chunk-wise self-attention." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-24", "text": "Transformer based transducer [13] and CTC model [14] do not have the issue for online speech recognition, however, the results presented in the two studies are not competitive compared the hybrid baseline system from Kaldi [15] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-25", "text": "The work that is closely related to ours is the time restricted selfattention for hybrid acoustic model [16] , where the self-attention layer is applied to a chunk of the acoustic frames on top of a timedelay neural network (TDNN) or an LSTM layer." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-26", "text": "Recently, Han et al. [17, 18] presented two extensions to this work by using multiple streams of acoustic features to the TDNN layers before the self-attention layer in [18] , or using multi-stride features to the selfattention layers [17] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-27", "text": "In fact, the key idea of the two studies is the same, i.e., sample the features using different strides (or sampling rates), and feed the multiple views of the features to the model, assuming that each view contains complementary acoustic information." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-28", "text": "The only difference is that the multiple views of features are fed into the TDNN layers in [18] , referred to as multi-stream, while they are fed into the self-attention layers directly in [17] , referred to as multi-stride self-attention model." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-29", "text": "In this work, we look at a few other aspects of transformerbased hybrid acoustic models that have not been studied previously." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-30", "text": "In [16, 17, 18] , self-attention is only applied in a chunk of acoustic input restricted by a time window, which makes the transformer model easier to train as it does not need to consider very long term correlations." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-31", "text": "While whole sequence-level self-attention has been applied in sequence-to-sequence models [10] , hybrid model is different in the sense that it is required to maintain strict frame-level alignments before performing predictions, which may be challenging for a transformer with multiple layers of self-attention as it may reorder the sequence." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-32", "text": "Furthermore, lower sampling rates are usually used for transformer-based acoustic models, which makes it easier for sequence-level self-attention as the input sequences are much shorter." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-33", "text": "We propose an interleaved self-attention and convolution structure for transformer model, with the motivation that convolution can learn local feature correlations and maintain the ordering information of the sequence while self-attention can capture longterm correlations." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-34", "text": "We show that the model can achieved competitive recognition results when trained with or without time-restriction." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-35", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-36", "text": "**TRANSFORMER**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-37", "text": "In this section, we review each component in the standard transformer model, and discuss a model structure that is mainly investigated for speech recognition in this work." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-38", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-39", "text": "**SELF-ATTENTION WITH MULTIPLE HEADS**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-40", "text": "The attention mechanism in transformer is technically the same as in the original RNN-based attention model [19] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-41", "text": "The key difference is that the query used to compute the attention probability is also from the source sequence, instead of using the decoder hidden state as in the RNN-based attention model [19] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-42", "text": "In [8] , the authors used the dot-production attention [20] rather than the conventional additive attention [19] in favor of the low computational complexity, which is rewritten here as:" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-43", "text": "where Q, K, V are referred to the query, key and value according to [8] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-44", "text": "In transformer, both Q and K are from the source sequence, while in the conventional RNN-based attention model [19] , Q is from the decoder hidden state, and K is from the encoder hidden state." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-45", "text": "In Eq (1), d k is the dimension of the model, and it is used to scale the dot-product between Q and K in order to smooth the probability distribution returned by the Softmax operation." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-46", "text": "This is to avoid placing most of the attention probability to a single frame as a result of the dot-product attention, while additive attention does not require such a scaling factor from our experience." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-47", "text": "Another key idea from the transformer paper [8] is the multihead attention mechanism, which performs multiple attention operations in parallel using different model parameters." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-48", "text": "The output from different attention heads are then concatenated and projected before being fed into the next layer." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-49", "text": "It can be expressed as" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-50", "text": "where" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-51", "text": "where n is the number of attention heads, and W Q i , W K i , W V i are parameters for the i-th attention head, and W O is the projection matrix to reduce the dimension of the concatenated attention vector." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-52", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-53", "text": "**POSITIONAL ENCODING**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-54", "text": "The attention function (1) itself does not use the information of the order of the sequence V ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-55", "text": "It is possible that reordering the elements in V can result in the same attention vector after the attention operation, since Eq (1) is only a weighted sum of the elements in V ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-56", "text": "To encode where i refers to the dimension, and t denotes the time-step." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-57", "text": "We study the same type of the positional encoding for our speech recognition experiments, by adding P E[t] to the corresponding feature vector at the time step t." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-58", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-59", "text": "**INTERLEAVED SELF-ATTENTION AND CONVOLUTION**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-60", "text": "For hybrid models, our preliminary experiments show that transformers with multiple self-attention layers alone is hard to train without time restriction, and it can easily diverge after a few epochs." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-61", "text": "We hypothesize that this is due to the nature of hybrid models, which are expected to predict the frame-level labels." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-62", "text": "Hence, they are more sensitive to any reordering or shifting of the acoustic information in the time space compared with sequence-to-sequence models." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-63", "text": "The positional encoding along may not be able to provide sufficient information to maintain the sequential information in the acoustic sequence (cf. section 4.1)." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-64", "text": "In this paper, we propose a transformer model with interleaved 1D convolution and self-attention, with the motivation that the convolution layer can maintain the sequential information of the input sequence, while at the same time, it can learn the local correlations." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-65", "text": "Self-attention, on the other hand, is expected to capture the global information as the attention is performed as the entire sequence level." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-66", "text": "The model with interleaved convolution and self-attention has the flexibility to tradeoff the model capacity for learning both local and global information from the input sequence." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-67", "text": "Same as the standard transformer [8] , we also insert the feedforward layer after the multi-head attention." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-68", "text": "The final model structure is shown in Figure 1 ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-69", "text": "It is possible that the feedforward layer is redundant, or its size could be reduced given the 1D convolution layer." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-70", "text": "We will investigate this aspect in our future work." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-71", "text": "Table 1 shows the number of parameters in each component of the transformer model studied in this paper." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-72", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-73", "text": "**EXPERIMENTS**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-74", "text": "We performed the experiments using the publicly available Librispeech corpus [21] , which contains around 960 hours of training data in total." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-75", "text": "To constrain our research scope, we fixed the depth of the transformer models to be 6 layers, and the dimension of the model d k in Eq (1) to be 512." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-76", "text": "The number of hidden units in the feedforward layer is 2048." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-77", "text": "The kernel size for each convolution layer is 3 without stride." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-78", "text": "The total number of parameters is around 26.6 million." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-79", "text": "We did some experiments using a smaller model, and the results are worse than what we reported here." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-80", "text": "We did not train deeper transformer models due to the memory constraint." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-81", "text": "In our experiments, we used a high frame rate as 100 Hz, i.e., extracting one acoustic frame in every 10 millisecond." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-82", "text": "This led to long acoustic sequences." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-83", "text": "As the memory cost of self-attention is in the order of O(T 2 ), where T is the length of the acoustic sequence, lower frame rate would significantly cut down the memory cost, and enable the training of much deeper transformer models that will studied in our future work." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-84", "text": "In terms of acoustic features, we used 80-dimensional raw logmel filter-banks (FBANKs), and we did not perform any form of speaker-level feature normalization." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-85", "text": "Instead, we only applied the utterance-level mean and variance normalization." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-86", "text": "We used a 4-gram language model for decoding that is released as the part of the corpus, and we used Kaldi [15] to build a Gaussian mixture model (GMM) system for bootstrapping." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-87", "text": "Our transformer acoustic models were trained using the PyKaldi2 toolbox [22] , which is built on top of Kaldi and PyTorch through the PyKaldi [23] wrapper." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-88", "text": "We used the Adam optimizer [24] cross entropy (CE) training, and the same learning rate scheduler as in [8] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-89", "text": "For sequence training, we used the vanilla stochastic gradient decent (SGD) with fixed learning rate." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-90", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-91", "text": "**RESULTS OF POSITIONAL ENCODING AND DROPOUT**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-92", "text": "We first evaluated the positional encoding discussed in section 3.2 and dropout training for the transformer model." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-93", "text": "Results are given in Table 2 ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-94", "text": "Unlike the observations in the area of machine translation, positional encoding did not make a big difference in terms of recognition accuracies for our transformer models." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-95", "text": "One possible reason is that we have used 1D convolution, which has encoded some sequential information in to the model." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-96", "text": "Another possible reason is that the dynamic range of the positional encoding is much smaller compared to the output of the first linear layer in Figure 1 ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-97", "text": "During the model training, the information in the positional encoding may be ignored." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-98", "text": "We performed a sanity check by removing the positional encoding when evaluating a transformer model trained with positional encoding, and obtained results which are only around 0.1% worse absolute." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-99", "text": "In the future, we shall investigate if it would make a difference after scaling the positional encoding features to the same dynamic range as the acoustic feature after the linear projection layer." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-100", "text": "As for dropout training, it was pointed out in [8] that transformer model for sequence to-sequence ASR may suffer from overfitting easily, and regularization such as dropout is important to address such kind of issue." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-101", "text": "In our experiments, we also observed the overfitting problem, and our models were usually trained for about 8 -10 epochs." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-102", "text": "However, dropout is not effective for our model, and it only slightly improved the recognition accuracy." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-103", "text": "It may be due to the convolution layers used in our model, as according to our experience, dropout does not work well for CNNs." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-104", "text": "While other regularization approach may be applicable including adding data noise, in our future work, we are more interested in evaluating the model with very large amount of our internal training data to see if the transformer model is still sensitive to the overfitting problem." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-105", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-106", "text": "**1D CONVOLUTION AND LAYER NORMALIZATION**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-107", "text": "We then evaluated the impact of the 1D convolution layers in our transformer model by removing all the convolution layers." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-108", "text": "This corresponds to a vanilla transformer as in [8] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-109", "text": "We still added the positional encoding feature to the inputs since the sequential information from the convolution layers is no longer available." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-110", "text": "We also have to insert another layer normalization to the output of the transformer before the output linear layer to stabilize the training." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-111", "text": "Otherwise, the training diverges quickly after one or two epochs in our experiments." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-112", "text": "The results are given in Table 3 , which shows that the recognition errors are much higher when the model does not have the convolution layers." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-113", "text": "Besides, without the convolution layers, the convergence of the model during training also became much slower, which demonstrates that the convolution layers are helpful for both convergence in model training and the recognition accuracy." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-114", "text": "For a fair comparison, we also added another layer normalization to the model with convolutions." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-115", "text": "While it can further speed up the convergence, it does not improve the recognition accuracy further." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-116", "text": "In the future, we shall compare the interleaved combination of self-attention and convolution to the sequential combination of self-attention and TDNN [16, 17, 18] , which is a special type of 1D convolution with subsampling." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-117", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-118", "text": "**WITH OR WITHOUT TIME RESTRICTION**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-119", "text": "In the previous experiments, we performed self-attention across the whole input sequence." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-120", "text": "This corresponds to the offline model as the whole sequence need to be visible before the attention operation." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-121", "text": "For online streaming speech recognition, we can simply apply to a time restriction window to the self-attention layer, which is the same as the study in [16] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-122", "text": "However, this is very challenging for sequenceto-sequence model based on transformer as the boundary for each output token is unclear." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-123", "text": "For hybrid models, the latency is controllable by adjusting the size of the time restriction window." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-124", "text": "In our implementation, we still take the whole sequence as the input, but mask out the frames which are outside the time restriction window, i.e., the attention probabilities of those frames are set to be zero during training." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-125", "text": "Follow the convention, we denote [\u2212l, r] as the attention time window, where l and r are the sizes of left and right context." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-126", "text": "When l is \u221e, it means that we perform self-attention up to the start of the acoustic sequence, while r = \u221e means the attention operation spans to the end of the sequence." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-127", "text": "The offline model corresponds to the time window of [\u2212\u221e, \u221e] as in Table 4 . Note that, each 1D convolution layer looks ahead 1 frame since the kernel size is 3 without stride (e.g., [\u22121, 1])." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-128", "text": "The time window in Table 4 is only for the attention operation, and the corresponding latency should have another 6 frames overhead from the convolution layers." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-129", "text": "In addition, the numbers in Table 4 refer to the accumulated attention window." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-130", "text": "If the time window for each self-attention layer is [\u2212\u221e, 2], then the total accumulated time window for 6 self-attention layers would be [\u2212\u221e , 12] ." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-131", "text": "For faster convergence, we used the transformer models with one more layer normalization as in section 4.2, although the offline results on the dev-other and test-other evaluation sets are slightly worse." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-132", "text": "The table shows that when we limited the future context information for the transformer model, we obtained slightly worse results." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-133", "text": "However, contrary to our expectations, when we increased the right context size, we did not achieve higher recognition accuracy, although the CE losses were significantly reduced, e.g., from \u223c0.78 for the model with the attention window of [\u2212\u221e, 0] to \u223c0.70 for the model with the attention window of [\u2212\u221e, 24] from our setups." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-134", "text": "Although the convolution layers have already looked ahead 6 frame in total, we believe the future context information should still be helpful." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-135", "text": "We hypothesis that this may be due to the multi-head attention." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-136", "text": "If one or more attention heads are placed at around the end of the time window, i.e., focusing on the future context information more than it should, the information from those time steps can help to reduce the training loss, but it may not be able to generalize well." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-137", "text": "To understand deeper about this results, we will" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-138", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-139", "text": "**SEQUENCE TRAINING RESULTS**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-140", "text": "In Table 5 , we show the sequence training results of the transformer model trained with the maximum mutual information (MMI) criterion." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-141", "text": "We followed the traditional lattice-based sequence training approach, and the lattices were generated on-the-fly as implemented in PyKaldi2." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-142", "text": "We used a CE trained model as the seed model, and then trained the model with MMI using the vanilla SGD optimizer." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-143", "text": "We fixed learning rate as 5 \u00d7 10 \u22125 , and to avoid overfitting, we applied the CE regularization with weight as 0.2." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-144", "text": "The model was converged in less than 1 epoch." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-145", "text": "Table 5 shows that we obtained larger improvements on the noisy test sets (dev-other, test-other)." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-146", "text": "Our results are comparable to the results of the TDNN system in Kaldi 1 , which is a well-tuned hybrid system." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-147", "text": "In fact, the TDNN system applied the speed perturbation [25] for data argumentation and ivector based speaker adaptive training, while in our system, we only used the raw log-mel filterbank features without using any speakerlevel information." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-148", "text": "From that sense, our results are very competitive." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-149", "text": "Han et al. [17, 18] achieved better results by using multi-stream and multi-stride features on top of the TDNN system, which are also applicable to our system, and will be investigated in the future." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-150", "text": "----------------------------------" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-151", "text": "**CONCLUSION**" }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-152", "text": "While transformer has been very successful in the area of nature language processing, its application to speech recognition is mostly within the end-to-end architecture." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-153", "text": "We are more interested in transformers for hybrid acoustic models as there is no theoretical issues for online streaming speech recognition." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-154", "text": "In this paper, we have presented a transformer model with interleaved self-attention and convolution for hybrid acoustic modeling, although this structure may be also applicable to end-to-end models." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-155", "text": "We have showed that the convolutional layers can improve the recognition accuracy with faster convergence compared to the model with self-attention layers only." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-156", "text": "We have also investigated several other aspects of the model including the impact of the positional encoding feature, dropout regularization as well training with and without the time restriction." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-157", "text": "Our work is an addition to the current study of self-attention for hybrid models with a sequential TDNN and self-attention architecture trained with time restriction only." }, { "sent_id": "42854f204c8d2a62822e12f731ad08-C001-158", "text": "For our future works, we shall study training much deeper transformer with low frame rate to get rid of the GPU memory constraint, as well as evaluate the model in the setting with a very large amount of training data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "42854f204c8d2a62822e12f731ad08-C001-2" ], [ "42854f204c8d2a62822e12f731ad08-C001-14" ], [ "42854f204c8d2a62822e12f731ad08-C001-21" ], [ "42854f204c8d2a62822e12f731ad08-C001-22" ], [ "42854f204c8d2a62822e12f731ad08-C001-30" ], [ "42854f204c8d2a62822e12f731ad08-C001-31" ], [ "42854f204c8d2a62822e12f731ad08-C001-32" ], [ "42854f204c8d2a62822e12f731ad08-C001-40" ], [ "42854f204c8d2a62822e12f731ad08-C001-42" ], [ "42854f204c8d2a62822e12f731ad08-C001-43" ], [ "42854f204c8d2a62822e12f731ad08-C001-47" ], [ "42854f204c8d2a62822e12f731ad08-C001-100" ] ], "cite_sentences": [ "42854f204c8d2a62822e12f731ad08-C001-2", "42854f204c8d2a62822e12f731ad08-C001-14", "42854f204c8d2a62822e12f731ad08-C001-21", "42854f204c8d2a62822e12f731ad08-C001-22", "42854f204c8d2a62822e12f731ad08-C001-30", "42854f204c8d2a62822e12f731ad08-C001-31", "42854f204c8d2a62822e12f731ad08-C001-32", "42854f204c8d2a62822e12f731ad08-C001-40", "42854f204c8d2a62822e12f731ad08-C001-42", "42854f204c8d2a62822e12f731ad08-C001-43", "42854f204c8d2a62822e12f731ad08-C001-47", "42854f204c8d2a62822e12f731ad08-C001-100" ] }, "@MOT@": { "gold_contexts": [ [ "42854f204c8d2a62822e12f731ad08-C001-3" ], [ "42854f204c8d2a62822e12f731ad08-C001-17" ], [ "42854f204c8d2a62822e12f731ad08-C001-23" ], [ "42854f204c8d2a62822e12f731ad08-C001-24" ], [ "42854f204c8d2a62822e12f731ad08-C001-122", "42854f204c8d2a62822e12f731ad08-C001-123", "42854f204c8d2a62822e12f731ad08-C001-124" ], [ "42854f204c8d2a62822e12f731ad08-C001-152" ] ], "cite_sentences": [ "42854f204c8d2a62822e12f731ad08-C001-3", "42854f204c8d2a62822e12f731ad08-C001-17", "42854f204c8d2a62822e12f731ad08-C001-23", "42854f204c8d2a62822e12f731ad08-C001-24", "42854f204c8d2a62822e12f731ad08-C001-122", "42854f204c8d2a62822e12f731ad08-C001-152" ] }, "@EXT@": { "gold_contexts": [ [ "42854f204c8d2a62822e12f731ad08-C001-4" ], [ "42854f204c8d2a62822e12f731ad08-C001-18" ], [ "42854f204c8d2a62822e12f731ad08-C001-33" ] ], "cite_sentences": [ "42854f204c8d2a62822e12f731ad08-C001-4", "42854f204c8d2a62822e12f731ad08-C001-18", "42854f204c8d2a62822e12f731ad08-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "42854f204c8d2a62822e12f731ad08-C001-5" ], [ "42854f204c8d2a62822e12f731ad08-C001-37" ], [ "42854f204c8d2a62822e12f731ad08-C001-60" ], [ "42854f204c8d2a62822e12f731ad08-C001-64" ], [ "42854f204c8d2a62822e12f731ad08-C001-71" ], [ "42854f204c8d2a62822e12f731ad08-C001-75" ], [ "42854f204c8d2a62822e12f731ad08-C001-87" ], [ "42854f204c8d2a62822e12f731ad08-C001-92" ], [ "42854f204c8d2a62822e12f731ad08-C001-94" ], [ "42854f204c8d2a62822e12f731ad08-C001-98" ], [ "42854f204c8d2a62822e12f731ad08-C001-107" ], [ "42854f204c8d2a62822e12f731ad08-C001-110" ], [ "42854f204c8d2a62822e12f731ad08-C001-131" ], [ "42854f204c8d2a62822e12f731ad08-C001-140" ], [ "42854f204c8d2a62822e12f731ad08-C001-153" ], [ "42854f204c8d2a62822e12f731ad08-C001-154" ] ], "cite_sentences": [ "42854f204c8d2a62822e12f731ad08-C001-5", "42854f204c8d2a62822e12f731ad08-C001-37", "42854f204c8d2a62822e12f731ad08-C001-60", "42854f204c8d2a62822e12f731ad08-C001-64", "42854f204c8d2a62822e12f731ad08-C001-71", "42854f204c8d2a62822e12f731ad08-C001-75", "42854f204c8d2a62822e12f731ad08-C001-87", "42854f204c8d2a62822e12f731ad08-C001-92", "42854f204c8d2a62822e12f731ad08-C001-94", "42854f204c8d2a62822e12f731ad08-C001-98", "42854f204c8d2a62822e12f731ad08-C001-107", "42854f204c8d2a62822e12f731ad08-C001-110", "42854f204c8d2a62822e12f731ad08-C001-131", "42854f204c8d2a62822e12f731ad08-C001-140", "42854f204c8d2a62822e12f731ad08-C001-153", "42854f204c8d2a62822e12f731ad08-C001-154" ] }, "@DIF@": { "gold_contexts": [ [ "42854f204c8d2a62822e12f731ad08-C001-44" ], [ "42854f204c8d2a62822e12f731ad08-C001-80" ], [ "42854f204c8d2a62822e12f731ad08-C001-122", "42854f204c8d2a62822e12f731ad08-C001-123", "42854f204c8d2a62822e12f731ad08-C001-124" ], [ "42854f204c8d2a62822e12f731ad08-C001-132" ] ], "cite_sentences": [ "42854f204c8d2a62822e12f731ad08-C001-44", "42854f204c8d2a62822e12f731ad08-C001-80", "42854f204c8d2a62822e12f731ad08-C001-122", "42854f204c8d2a62822e12f731ad08-C001-132" ] }, "@SIM@": { "gold_contexts": [ [ "42854f204c8d2a62822e12f731ad08-C001-67" ], [ "42854f204c8d2a62822e12f731ad08-C001-88" ], [ "42854f204c8d2a62822e12f731ad08-C001-108" ] ], "cite_sentences": [ "42854f204c8d2a62822e12f731ad08-C001-67", "42854f204c8d2a62822e12f731ad08-C001-88", "42854f204c8d2a62822e12f731ad08-C001-108" ] }, "@FUT@": { "gold_contexts": [ [ "42854f204c8d2a62822e12f731ad08-C001-83" ], [ "42854f204c8d2a62822e12f731ad08-C001-158" ] ], "cite_sentences": [ "42854f204c8d2a62822e12f731ad08-C001-83", "42854f204c8d2a62822e12f731ad08-C001-158" ] } } }, "ABC_1dd3adcb79c8bc4b5187b85d836ceb_4": { "x": [ { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-2", "text": "This talk will present several issues related to incremental (left-to-right) beam-search parsing of natural language using generative or discriminative models, either individually or in combination." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-3", "text": "The first part of the talk will provide background in incremental top-down and (selective) left-corner beamsearch parsing algorithms, and in stochastic models for such derivation strategies." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-4", "text": "Next, the relative benefits and drawbacks of generative and discriminative models with respect to heuristic pruning and search will be discussed." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-5", "text": "A range of methods for using multiple models during incremental parsing will be detailed." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-6", "text": "Finally, we will discuss the potential for effective use of fast, finite-state processing, e.g. partof-speech tagging, to reduce the parsing search space without accuracy loss." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-7", "text": "POS-tagging is shown to improve efficiency by as much as 20-25 percent with the same accuracy, largely due to the treatment of unknown words." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-8", "text": "In contrast, an 'islands-of-certainty' approach, which quickly annotates labeled bracketing over low-ambiguity word sequences, is shown to provide little or no efficiency gain over the existing beam-search." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-9", "text": "The basic parsing approach that will be described in this talk is stochastic incremental top-down parsing, using a beam-search to prune the search space." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-10", "text": "Grammar induction occurs from an annotated treebank, and non-local features are extracted from each derivation to enrich the stochastic model." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-11", "text": "Left-corner grammar and tree transforms can be applied to the treebank or the induced grammar, either fully or selectively, to change the derivation order while retaining the same underlying parsing algorithm." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-12", "text": "This approach has been shown to be accurate, relatively efficient, and robust using both generative and discriminative models (Roark, 2001; Roark, 2004; Collins and Roark, 2004) ." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-13", "text": "The key to effective beam-search parsing is comparability of analyses when the pruning is done." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-14", "text": "If two competing parses are at different points in their respective derivations, e.g. one is near the end of the derivation and another is near the beginning, then it will be difficult to evaluate which of the two is likely to result in a better parse." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-15", "text": "With a generative model, comparability can be accomplished by the use of a look-ahead statistic, which estimates the amount of probability mass required to extend a given derivation to include the word(s) in the look-ahead." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-16", "text": "Every step in the derivation decreases the probability of the derivation, but also takes the derivation one step closer to attaching to the look-ahead." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-17", "text": "For good parses, the look-ahead statistic should increase with each step of the derivation, ensuring a certain degree of comparability among competing parses with the same look-ahead." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-18", "text": "Beam-search parsing using an unnormalized discriminative model, as in Collins and Roark (2004) , requires a slightly different search strategy than the original generative model described in Roark (2001; 2004) ." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-19", "text": "This alternate search strategy is closer to the approach taken in Costa et al. (2001; 2003) , in that it enumerates a set of possible ways of attaching the next word before evaluating with the model." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-20", "text": "This ensures comparability for models that do not have the sort of behavior described above for the generative models, rendering look-ahead statistics difficult to estimate." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-21", "text": "This approach is effective, although somewhat less so than when a look-ahead statistic is used." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-22", "text": "A generative parsing model can be used on its own, and it was shown in Collins and Roark (2004) that a discriminative parsing model can be used on its own." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-23", "text": "Most discriminative parsing approaches, e.g. (Johnson et al., 1999; Collins, 2000; Collins and Duffy, 2002) , are re-ranking approaches, in which another model (typically a generative model) presents a relatively small set of candidates, which are then re-scored using a second, discriminatively trained model." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-24", "text": "There are other ways to combine a generative and discriminative model apart from waiting for the former to provide a set of completed candidates to the latter." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-25", "text": "For example, the scores can be used simultaneously; or the generative model can present candidates to the discriminative model at intermediate points in the string, rather than simply at the end." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-26", "text": "We discuss these options and their potential benefits." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-27", "text": "Finally, we discuss and present a preliminary evaluation of the use of rapid finite-state tagging to reduce the parsing search space, as was done in (Ratnaparkhi, 1997; Ratnaparkhi, 1999) ." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-28", "text": "When the parsing algorithm is integrated with model training, such efficiency improvements can be particularly important." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-29", "text": "POS-tagging using a simple bi-tag model improved parsing efficiency by nearly 25 percent without a loss in accuracy, when 1.2 tags per word were produced on average by the tagger." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-30", "text": "Producing a single tag sequence for each string resulted in further speedups, but at the loss of 1-2 points of accuracy." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-31", "text": "We show that much, but not all, of the speedup from POS-tagging is due to more constrained tagging of unknown words." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-32", "text": "In a second set of trials, we make use of what we are calling 'syntactic collocations', i.e. collocations that are (nearly) unambiguously associated with a particular syntactic configuration." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-33", "text": "For example, a chain of auxiliaries in English will always combine in a particular syntactic configuration, modulo noise in the annotation." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-34", "text": "In our approach, the labeled bracketing spanning the sub-string is treated as a tag for the sequence." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-35", "text": "A simple, finite-state method for finding such collocations, and an efficient longest match algorithm for labeling strings will be presented." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-36", "text": "The labeled-bracketing 'tags' are integrated with the parse search as follows: when a derivation reaches the first word of such a collocation, the remaining words are attached in the given configuration." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-37", "text": "This has the effect of extending the look-ahead beyond the collocation, as well as potentially reducing the amount of search required to extend the derivations to include the words in the collocation." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-38", "text": "However, while POS-tagging improved efficiency, we find that using syntactic collocations does not, indicating that 'islands-of-certainty' approaches are not what is needed from shallow processing; rather genuine dis-ambiguation of the sort provided by the POS-tagger." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-39", "text": "----------------------------------" }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-40", "text": "****" }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-41", "text": "will be difficult to evaluate which of the two is likely to result in a better parse." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-42", "text": "With a generative model, comparability can be accomplished by the use of a look-ahead statistic, which estimates the amount of probability mass required to extend a given derivation to include the word(s) in the look-ahead." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-43", "text": "Every step in the derivation decreases the probability of the derivation, but also takes the derivation one step closer to attaching to the look-ahead." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-44", "text": "For good parses, the look-ahead statistic should increase with each step of the derivation, ensuring a certain degree of comparability among competing parses with the same look-ahead." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-45", "text": "Beam-search parsing using an unnormalized discriminative model, as in Collins and Roark (2004) , requires a slightly different search strategy than the original generative model described in Roark (2001; 2004) ." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-46", "text": "This alternate search strategy is closer to the approach taken in Costa et al. (2001; 2003) , in that it enumerates a set of possible ways of attaching the next word before evaluating with the model." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-47", "text": "This ensures comparability for models that do not have the sort of behavior described above for the generative models, rendering look-ahead statistics difficult to estimate." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-48", "text": "This approach is effective, although somewhat less so than when a look-ahead statistic is used." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-49", "text": "A generative parsing model can be used on its own, and it was shown in Collins and Roark (2004) that a discriminative parsing model can be used on its own." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-50", "text": "Most discriminative parsing approaches, e.g. (Johnson et al., 1999; Collins, 2000; Collins and Duffy, 2002) , are re-ranking approaches, in which another model (typically a generative model) presents a relatively small set of candidates, which are then re-scored using a second, discriminatively trained model." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-51", "text": "There are other ways to combine a generative and discriminative model apart from waiting for the former to provide a set of completed candidates to the latter." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-52", "text": "For example, the scores can be used simultaneously; or the generative model can present candidates to the discriminative model at intermediate points in the string, rather than simply at the end." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-53", "text": "We discuss these options and their potential benefits." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-54", "text": "Finally, we discuss and present a preliminary evaluation of the use of rapid finite-state tagging to reduce the parsing search space, as was done in (Ratnaparkhi, 1997; Ratnaparkhi, 1999) ." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-55", "text": "When the parsing algorithm is integrated with model training, such efficiency improvements can be particularly important." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-56", "text": "POS-tagging using a simple bi-tag model improved parsing efficiency by nearly 25 percent without a loss in accuracy, when 1.2 tags per word were produced on average by the tagger." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-57", "text": "Producing a single tag sequence for each string resulted in further speedups, but at the loss of 1-2 points of accuracy." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-58", "text": "We show that much, but not all, of the speedup from POS-tagging is due to more constrained tagging of unknown words." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-59", "text": "In a second set of trials, we make use of what we are calling 'syntactic collocations', i.e. collocations that are (nearly) unambiguously associated with a particular syntactic configuration." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-60", "text": "For example, a chain of auxiliaries in English will always combine in a particular syntactic configuration, modulo noise in the annotation." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-61", "text": "In our approach, the labeled bracketing spanning the sub-string is treated as a tag for the sequence." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-62", "text": "A simple, finite-state method for finding such collocations, and an efficient longest match algorithm for labeling strings will be presented." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-63", "text": "The labeled-bracketing 'tags' are integrated with the parse search as follows: when a derivation reaches the first word of such a collocation, the remaining words are attached in the given configuration." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-64", "text": "This has the effect of extending the look-ahead beyond the collocation, as well as potentially reducing the amount of search required to extend the derivations to include the words in the collocation." }, { "sent_id": "1dd3adcb79c8bc4b5187b85d836ceb-C001-65", "text": "However, while POS-tagging improved efficiency, we find that using syntactic collocations does not, indicating that 'islands-of-certainty' approaches are not what is needed from shallow processing; rather genuine dis-ambiguation of the sort provided by the POS-tagger." } ], "y": { "@USE@": { "gold_contexts": [ [ "1dd3adcb79c8bc4b5187b85d836ceb-C001-12" ], [ "1dd3adcb79c8bc4b5187b85d836ceb-C001-18" ], [ "1dd3adcb79c8bc4b5187b85d836ceb-C001-45" ] ], "cite_sentences": [ "1dd3adcb79c8bc4b5187b85d836ceb-C001-12", "1dd3adcb79c8bc4b5187b85d836ceb-C001-18", "1dd3adcb79c8bc4b5187b85d836ceb-C001-45" ] }, "@BACK@": { "gold_contexts": [ [ "1dd3adcb79c8bc4b5187b85d836ceb-C001-22" ], [ "1dd3adcb79c8bc4b5187b85d836ceb-C001-49" ] ], "cite_sentences": [ "1dd3adcb79c8bc4b5187b85d836ceb-C001-22", "1dd3adcb79c8bc4b5187b85d836ceb-C001-49" ] } } }, "ABC_3251c6cd1afccf6ad8d5391a4360b0_4": { "x": [ { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-6", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-2", "text": "Speech translation systems usually follow a pipeline approach, using word lattices as an intermediate representation." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-3", "text": "However, previous work assume access to the original transcriptions used to train the ASR system, which can limit applicability in real scenarios." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-4", "text": "In this work we propose an approach for speech translation through lattice transformations and neural models based on graph networks." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-5", "text": "Experimental results show that our approach reaches competitive performance without relying on transcriptions, while also being orders of magnitude faster than previous work." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-8", "text": "Translation from speech utterances is a challenging problem that has been studied both under statistical, symbolic approaches (Ney, 1999; Casacuberta et al., 2004; Kumar et al., 2015) and more recently using neural models (Sperber et al., 2017) ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-9", "text": "Most previous work rely on pipeline approaches, using the output of a speech recognition system (ASR) as an input to a machine translation (MT) one." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-10", "text": "These inputs can be simply the 1-best sentence returned by the ASR system or a more structured representation such as a lattice." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-11", "text": "Some recent work on end-to-end systems bypass the need for intermediate representations, with impressive results (Weiss et al., 2017) ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-12", "text": "However, such a scenario has drawbacks." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-13", "text": "From a practical perspective, it requires access to the original speech utterances and transcriptions, which can be unrealistic if a user needs to employ an out-ofthe-box ASR system." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-14", "text": "From a theoretical perspective, intermediate representations such as lattices can be enriched through external, textual resources such as monolingual corpora or dictionaries." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-15", "text": "Sperber et al. (2017) proposes a lattice-tosequence model which, in theory, can address both problems above." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-16", "text": "However, their model suffers from training speed performance due to the lack of efficient batching procedures and they rely on transcriptions for pretraining." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-17", "text": "In this work, we address these two problems by applying lattice transformations and graph networks as encoders." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-18", "text": "More specifically, we enrich the lattices by applying subword segmentation using byte-pair encoding (Sennrich et al., 2016, BPE) and perform a minimisation step to remove redundant nodes arising from this procedure." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-19", "text": "Together with the standard batching strategies provided by graph networks, we are able to decrease training time by two orders of magnitude, enabling us to match their translation performance under the same training speed constraints without relying on gold transcriptions." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-20", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-21", "text": "**APPROACH**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-22", "text": "Many graph network options exist in the literature (Bruna et al., 2014; Duvenaud et al., 2015; Kipf and Welling, 2017; Gilmer et al., 2017) : in this work we opt for a Gated Graph Neural Network (Li et al., 2016, GGNN) , which was recently incorporated in an encoder-decoder architecture by Beck et al. (2018) ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-23", "text": "Assume a directed" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-24", "text": ") and L V and L E are respectively vocabularies for nodes and edges, from which node and edge labels ( v and e ) are defined." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-25", "text": "Given an input graph with node embeddings X, a GGNN is defined as u, v, e ) is the edge between nodes u and v, N (v) is the set of neighbour nodes for v, \u03c1 is a non-linear function, \u03c3 is the sigmoid function and" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-26", "text": "Intuitively, a GGNN reduces to a GRU (Cho et al., 2014) if the graph is a linear chain." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-27", "text": "Therefore, the GGNN acts as a generalised encoder that updates nodes according to their neighbourhood." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-28", "text": "Multiple layers can be stacked, allowing information to be propagated through longer paths in the graph." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-29", "text": "Batching can be done by using adjacency matrices and matrix operations to perform the updates, enabling efficient processing on a GPU." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-30", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-31", "text": "**LATTICE TRANSFORMATIONS**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-32", "text": "As pointed out by Beck et al. (2018) , GGNNs can suffer from parameter explosion when the edge label space is large, as the number of parameters is proportional to the set of edge labels." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-33", "text": "This is a problem for lattices, since most of the information is encoded on the edges." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-34", "text": "We tackle this problem by transforming the lattices into their corresponding line graphs, which swaps nodes and edges." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-35", "text": "1 After this transformation, we also add start and end symbols, which enable the encoder to propagate information through all possible paths in the lattice." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-36", "text": "Importantly, we also remove node scores from the lattice in most of our experiments, but we do revisit this idea in \u00a73.3." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-37", "text": "Having lattices as inputs allow us to incorporate additional steps of textual transformations." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-38", "text": "To showcase this, in this work we perform subword segmentation on the lattice nodes using BPE." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-39", "text": "If a node is not present in the subword vocabulary, we split it into subwords and connect them in a leftto-right manner." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-40", "text": "The BPE segmentation can lead to redundant nodes in the lattice." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-41", "text": "Our next transformation step is a minimisation procedure, where such nodes are joined into a single node in the graph." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-42", "text": "To perform this step, we leverage an efficient algorithm for automata minimisation (Hopcroft, 1971) , which traverses the graph detecting redundant nodes by using equivalence classes, running in O(n log n) time, where n is the number of nodes." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-43", "text": "1 This procedure is also done in Sperber et al. (2017) ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-44", "text": "The final step adds reverse and self-loop edges to the lattice, where these new edges have specific parameters in the encoder." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-45", "text": "This eases propagation of information and is standard practice when using graph networks as encoders Bastings et al., 2017; Beck et al., 2018) ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-46", "text": "We show an example of all the transformation steps on Figure 1 ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-47", "text": "In Figure 2 we show the architecture of our system, using the final lattice from Figure 1 as an example." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-48", "text": "Nodes are represented as embeddings that are updated according to the lattice structure, resulting in a set of hidden states as the output." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-49", "text": "Other components follow a standard seq2seq model, using a bilinear attention module (Luong et al., 2015) and a 2-layer LSTM (Hochreiter and Schmidhuber, 1997) as the decoder." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-50", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-51", "text": "**EXPERIMENTS**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-52", "text": "Data We perform experiments using the Fisher/Callhome Speech Translation corpus, composed of Spanish telephone conversations with their corresponding English translations." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-53", "text": "We use the original release by Post et al. (2013) , containing both 1-best and pruned lattice outputs from an ASR system for each Spanish utterance." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-54", "text": "2 The Fisher corpus contain 150K instances and we use the original splits provided with the datasets." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-55", "text": "Following previous work (Post et al., 2013; Sperber et al., 2017) , we lowercase and remove punctuation from the English translations." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-56", "text": "To build the BPE models, we extract the vocabulary from the Spanish training lattices, using 8K split operations." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-57", "text": "Models and Evaluation All our models are trained on the Fisher training set." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-58", "text": "For the 1-best baseline we use a standard seq2seq architecture and for the GGNN models, we use the same setup as Beck et al. (2018) ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-59", "text": "Our implementation is based on the Sockeye toolkit (Hieber et al., 2017) and we use default values for most hyperparameters, except for batch size (16) and GGNN layers (8). 3 For regularisation, we apply 0.5 dropout on the input embeddings and perform early stopping on the corresponding Fisher dev set." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-60", "text": "Table 1 : Out-of-the-box scenario results, in BLEU scores. \"L\" corresponds to word lattice inputs, \"L+S\" and \"L+S+M\" correspond to lattices after subword segmentation and after minimisation, respectively." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-61", "text": "Each model is trained using 5 different seeds and we report BLEU (Papineni et al., 2001) results using the median performance according to the dev set and an ensemble of the 5 models." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-62", "text": "For the word-based models, we remove any tokens with frequency lower than 2 (as in Sperber et al. (2017) ), while for subword models we do not perform any threshold pruning." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-63", "text": "We report all results on the Fisher \"dev2\" set." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-64", "text": "4" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-65", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-66", "text": "**OUT-OF-THE-BOX ASR SCENARIO**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-67", "text": "In this scenario we assume only lattices and 1-best outputs are available, simulating a setting where we do not have access to the transcriptions." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-68", "text": "Table 1 shows that results are consistent with previous work: lattices provide significant improvements over simply using the 1-best output." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-69", "text": "More importantly though, the results also highlight the benefits of our proposed transformations and we obtain the best ensemble performance using minimised lattices." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-70", "text": "Table 2 : Results with transcriptions, in BLEU scores." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-71", "text": "\"L+S+M\" corresponds to the same results in Table 1 and \"L+S+M+T\" is the setting with gold transcriptions added to the training set." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-72", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-73", "text": "**ADDING TRANSCRIPTIONS**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-74", "text": "The out-of-the-box results in \u00a73.1 are arguably more general in terms of applicability in real scenarios." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-75", "text": "However, in order to compare with the state-of-the-art, we also experiment with a scenario where we have access to the original Spanish transcriptions." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-76", "text": "To incorporate transcriptions into our model, we convert them into a linear chain graph, after segmenting using BPE." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-77", "text": "With this, we can simply take the union of transcriptions and lattices into a single training set." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-78", "text": "We keep the dev and test sets with lattices only, as this emulates test time conditions." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-79", "text": "The results shown in Table 2 are consistent with previous work: adding transcriptions further enhance the system performance." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-80", "text": "We also slightly outperform Sperber et al. (2017) in the setting where they ignore lattice scores, as in our approach." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-81", "text": "Most importantly, we are able to reach those results while being two orders of magnitude faster at training time: Sperber et al. (2017) report taking 1.5 days for each epoch while our architecture can process each epoch in 15min." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-82", "text": "The reason is because their model relies on the CPU while our GGNN-based model can be easily batched and computed in a GPU." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-83", "text": "Given those differences in training time, it is worth mentioning that the best model in Sperber et al. (2017) is surpassed by our best ensemble using lattices only." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-84", "text": "This means that we can obtain state-of-the-art performance even in an out-of-thebox scenario, under the same training speed constraints." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-85", "text": "While there are other constraints that may be considered (such as parameter budget), we nevertheless believe this is an encouraging result for real world scenarios." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-86", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-87", "text": "**ADDING LATTICE SCORES**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-88", "text": "Our approach is not without limitations." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-89", "text": "In particular, the GGNN encoder ignores lattice scores, which can help the model disambiguate between different paths in the lattice." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-90", "text": "As a simple first approach to incorporate scores, we embed them using a multilayer perceptron, using the score as the input." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-91", "text": "This however did not produce good results: performance dropped to 32.9 BLEU in the single model setting and 38.4 for the ensemble." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-92", "text": "It is worth noticing that Sperber et al. (2017) has a more principled approach to incorporate scores: by modifying the attention module." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-93", "text": "This is arguably a better choice, since the scores can directly inform the decoder about the ambiguity in the lattice." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-94", "text": "Since this approach does not affect the encoder, it is theoretically possible to combine our GGNN encoder with their attention module, we leave this avenue for future work." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-95", "text": "----------------------------------" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-96", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-97", "text": "In this work we proposed an architecture for lattice-to-string translation by treating lattices as general graphs and leveraging on recent advances in neural networks for graphs." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-98", "text": "5 Compared to previous similar work, our model permits easy minibatching and allows one to freely enrich the lattices with additional information, which we exploit by incorporating BPE segmentation and lattice minimisation." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-99", "text": "We show promising results and outperform baselines in speech translation, particularly in out-of-the-box ASR scenarios, when one has no access to transcriptions." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-100", "text": "For future work, we plan to investigate better approaches to incorporate scores in the lattices." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-101", "text": "The approaches used by Sperber et al. (2017) can provide a starting point in this direction." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-102", "text": "The same minimisation procedures we employ can be adapted to weighted lattices (Eisner, 2003 )." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-103", "text": "Another important avenue is to explore this approach in low-resource scenarios such as ones involving endangered languages (Adams et al., 2017; Anastasopoulos and Chiang, 2018) ." }, { "sent_id": "3251c6cd1afccf6ad8d5391a4360b0-C001-104", "text": "Workshop on Speech and Language Technologies, hosted at Carnegie Mellon University and sponsored by Johns Hopkins University with unrestricted gifts from Amazon, Apple, Facebook, Google, and Microsoft." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3251c6cd1afccf6ad8d5391a4360b0-C001-8" ], [ "3251c6cd1afccf6ad8d5391a4360b0-C001-15" ] ], "cite_sentences": [ "3251c6cd1afccf6ad8d5391a4360b0-C001-8", "3251c6cd1afccf6ad8d5391a4360b0-C001-15" ] }, "@MOT@": { "gold_contexts": [ [ "3251c6cd1afccf6ad8d5391a4360b0-C001-16" ] ], "cite_sentences": [ "3251c6cd1afccf6ad8d5391a4360b0-C001-16" ] }, "@SIM@": { "gold_contexts": [ [ "3251c6cd1afccf6ad8d5391a4360b0-C001-43" ] ], "cite_sentences": [ "3251c6cd1afccf6ad8d5391a4360b0-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "3251c6cd1afccf6ad8d5391a4360b0-C001-55" ], [ "3251c6cd1afccf6ad8d5391a4360b0-C001-62" ], [ "3251c6cd1afccf6ad8d5391a4360b0-C001-101" ] ], "cite_sentences": [ "3251c6cd1afccf6ad8d5391a4360b0-C001-55", "3251c6cd1afccf6ad8d5391a4360b0-C001-62", "3251c6cd1afccf6ad8d5391a4360b0-C001-101" ] }, "@DIF@": { "gold_contexts": [ [ "3251c6cd1afccf6ad8d5391a4360b0-C001-80" ], [ "3251c6cd1afccf6ad8d5391a4360b0-C001-81" ], [ "3251c6cd1afccf6ad8d5391a4360b0-C001-82" ], [ "3251c6cd1afccf6ad8d5391a4360b0-C001-83" ], [ "3251c6cd1afccf6ad8d5391a4360b0-C001-92" ] ], "cite_sentences": [ "3251c6cd1afccf6ad8d5391a4360b0-C001-80", "3251c6cd1afccf6ad8d5391a4360b0-C001-81", "3251c6cd1afccf6ad8d5391a4360b0-C001-82", "3251c6cd1afccf6ad8d5391a4360b0-C001-83", "3251c6cd1afccf6ad8d5391a4360b0-C001-92" ] }, "@FUT@": { "gold_contexts": [ [ "3251c6cd1afccf6ad8d5391a4360b0-C001-100", "3251c6cd1afccf6ad8d5391a4360b0-C001-101" ] ], "cite_sentences": [ "3251c6cd1afccf6ad8d5391a4360b0-C001-101" ] } } }, "ABC_d2c95c3198f21e793549d7b16bdaf8_4": { "x": [ { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-39", "text": "We provide a graphical, web-based user interface (GUI)." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-2", "text": "Abstract." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-3", "text": "A plethora of Entity Linking (EL) approaches has recently been developed." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-4", "text": "While many claim to be multilingual, the MAG (Multilingual AGDIS-TIS) approach has been shown recently to outperform the state of the art in multilingual EL on 7 languages." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-5", "text": "With this demo, we extend MAG to support EL in 40 different languages, including especially low-resources languages such as Ukrainian, Greek, Hungarian, Croatian, Portuguese, Japanese and Korean." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-6", "text": "Our demo relies on online web services which allow for an easy access to our entity linking approaches and can disambiguate against DBpedia and Wikidata." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-7", "text": "During the demo, we will show how to use MAG by means of POST requests as well as using its user-friendly web interface." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-8", "text": "All data used in the demo is available at https://hobbitdata.informatik.uni-leipzig.de/agdistis/" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-9", "text": "----------------------------------" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-11", "text": "A recent survey by IBM 3 suggests that more than 2.5 quintillion bytes of data are produced on the Web every day." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-12", "text": "Entity Linking (EL), also known as Named Entity Disambiguation (NED), is one of the most important Natural Language Processing (NLP) techniques for extracting knowledge automatically from this huge amount of data." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-40", "text": "In addition, users can choose to use the REST interface or a Java snippet." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-13", "text": "The goal of an EL approach is as follows: Given a piece of text, a reference knowledge base K and a set of entity mentions in that text, map each entity mention to the corresponding resource in K [4] ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-14", "text": "A large number of challenges has to be addressed while performing a disambiguation." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-15", "text": "For instance, a given resource can be referred to using different labels due to phenomena such as synonymy, acronyms or typos." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-16", "text": "For example, New York City, NY and Big Apple are all labels for the same entity." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-17", "text": "Also, multiple entities can share the same name due to homonymy and ambiguity." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-18", "text": "For example, both the state and the city of Rio de Janeiro are called Rio de Janeiro." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-19", "text": "Despite the complexity of the task, EL approaches have recently achieved increasingly better results by relying on trained machine learning models [6] ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-20", "text": "A portion of these approaches claim to be multilingual and most of them rely on models which are trained on English corpora with cross-lingual dictionaries." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-21", "text": "However, MAG (Multilingual AGDISTIS) [4] showed that the underlying models being trained on English corpora make them prone to failure when migrated to a different language." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-22", "text": "Additionally, these approaches hardly make their models or data available on more than three languages [6] ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-23", "text": "The new version of MAG (which is the quintessence of this demo) provides support for 40 different languages using sophisticated indices 4 ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-24", "text": "For the sake of server space, we deployed MAG-based web services for 9 languages and offer the other 31 languages for download." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-25", "text": "Additionally, we provide an English index using Wikidata to show the knowledge-base agnosticism of MAG." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-26", "text": "During the demo, we will show how to use the web services as well as MAG's user interface." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-27", "text": "----------------------------------" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-28", "text": "**MAG ENTITY LINKING SYSTEM**" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-29", "text": "MAG's EL process comprises two phases, namely an offline and an online phase." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-30", "text": "The sub-indices (which are generated during the offline phase) consist of surface forms, person names, rare references, acronyms and context information." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-31", "text": "During the online phase, the EL is carried out in two steps: 1) candidate generation and 2) disambiguation." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-32", "text": "The goal of the candidate generation step is to retrieve a tractable number of candidates for each mention." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-33", "text": "These candidates are later inserted into the disambiguation graph, which is used to determine the mapping between entities and mentions." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-34", "text": "MAG implements two graph-based algorithms to disambiguate entities, i.e., PageRank and HITS." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-35", "text": "Independently of the chosen graph algorithm, the highest candidate score among the set of candidates is chosen as correct disambiguation for a given mention [4] ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-36", "text": "----------------------------------" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-37", "text": "**DEMONSTRATION**" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-38", "text": "Our demonstration will show the capabilities of MAG for different languages." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-41", "text": "For research purposes, MAG can be downloaded and deployed via Maven or Docker." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-42", "text": "Figure 1 illustrates an example of MAG working on Spanish." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-43", "text": "The online demo can be accessed via http://agdistis.aksw." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-44", "text": "org/mag-demo and its code can be downloaded from https://github.com/ dice-group/AGDISTIS_DEMO/tree/v2." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-45", "text": "We have set up a web service interface for each language version." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-46", "text": "Each of these interfaces understands two mandatory parameters: (1) text and (2) type." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-47", "text": "1. text accepts an UTF-8 and URL encoded string with entities annotated with XML-tag . It is also capable of recognizing NIF [3] or txt files." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-48", "text": "2. type accepts two different values." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-49", "text": "First, 'agdistis' to disambiguate the mentions using the graph-based algorithms, but also 'candidates' which list all possible entities for a given mention through the depth-candidate selection of MAG." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-50", "text": "Other parameters." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-51", "text": "The user can also define more parameters to fine-tune the disambiguation." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-52", "text": "These parameters have to be set up within the properties file 5 or via environment variables while deploying it locally." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-53", "text": "Below, we describe all the parameters." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-54", "text": "-Popularity -The user can set it as popularity=false or popularity=true." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-55", "text": "It allows MAG to use either the Page Rank or the frequency of a candidate to sort while candidate retrieval." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-56", "text": "-Graph-based algorithm -The user can choose which graph-based algorithm to use for disambiguating among the candidates per mentions." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-57", "text": "The current implementation offers HITS and PageRank as algorithms, algorithm=hits or algorithm =pagerank." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-58", "text": "-Search by Context -This boolean parameter provides a search of candidates using a context index [4] ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-59", "text": "-Acronyms -This parameter enables a search by acronyms." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-60", "text": "In this case, MAG uses an additional index to filter the acronyms by expanding their labels and assigns them a high probability." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-61", "text": "For example, PSG equals Paris Saint-Germain." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-62", "text": "The parameter is acronym=false or acronym=true." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-63", "text": "-Common Entities -This boolean option supports finding common entities , in case, users desire to find more than ORGANIZATIONs, PLACEs and PERSONs as entity type." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-64", "text": "Knowledge-base Agnosticism." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-65", "text": "Figure 2 shows a screen capture of our demo for disambiguating mentions using Wikidata." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-66", "text": "We also provide a web service to allow further investigation." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-67", "text": "In addition, MAG is used in a domain specific problem using a music Knowledge Base (KB) [5] ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-68", "text": "----------------------------------" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-69", "text": "**EVALUATION OF THE USER INTERFACE**" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-70", "text": "We performed a system usability study (SUS) 67 to validate the design of our user interface." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-71", "text": "15 users -with a good or no knowledge of Semantic Web, EL or knowledge extraction -selected randomly from all departments at Leipzig University answered our survey." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-72", "text": "We achieved a SUS-Score of 86.3." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-73", "text": "This score assigns the mark S to the current interface of MAG and places it into the category of the 10% interfaces, meaning that users of the interface are likely to recommend it to a friend." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-74", "text": "Figure 3 shows the average voting per question and its standard deviation." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-75", "text": "----------------------------------" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-76", "text": "**SUMMARY**" }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-77", "text": "In this demo, we will present MAG, a KB-agnostic and deterministic approach for multilingual EL on 40 different languages contained in DBpedia." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-78", "text": "Currently, MAG is used in diverse projects 8 and has been used largely by the Semantic Web community." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-79", "text": "We also provide a demo/web-service using Wikidata for supporting an investigation of the graphs structures behind DBpedia and Wikidata pertaining to Information Extraction tasks [1, 2] ." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-80", "text": "The indexes we provided will be used in future work to investigate the EL problem in low-resource languages." }, { "sent_id": "d2c95c3198f21e793549d7b16bdaf8-C001-81", "text": "Our next step will hence be to to evaluate EL on all 40 languages presented in this demo." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d2c95c3198f21e793549d7b16bdaf8-C001-13" ], [ "d2c95c3198f21e793549d7b16bdaf8-C001-35" ], [ "d2c95c3198f21e793549d7b16bdaf8-C001-58" ] ], "cite_sentences": [ "d2c95c3198f21e793549d7b16bdaf8-C001-13", "d2c95c3198f21e793549d7b16bdaf8-C001-35", "d2c95c3198f21e793549d7b16bdaf8-C001-58" ] }, "@MOT@": { "gold_contexts": [ [ "d2c95c3198f21e793549d7b16bdaf8-C001-21" ] ], "cite_sentences": [ "d2c95c3198f21e793549d7b16bdaf8-C001-21" ] } } }, "ABC_1a8c7d22709cae34fbc1eb70fe5189_4": { "x": [ { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-2", "text": "This tutorial discusses a framework for incremental left-to-right structured predication, which makes use of global discriminative learning and beam-search decoding." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-3", "text": "The method has been applied to a wide range of NLP tasks in recent years, and achieved competitive accuracies and efficiencies." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-4", "text": "We give an introduction to the algorithms and efficient implementations, and discuss their applications to a range of NLP tasks." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-5", "text": "----------------------------------" }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-7", "text": "This tutorial discusses a framework of online global discriminative learning and beam-search decoding for syntactic processing (Zhang and Clark, 2011b) , which has recently been applied to a wide variety of natural language processing (NLP) tasks, including word segmentation (Zhang and Clark, 2007) , dependency parsing (Zhang and Clark, 2008b; Huang and Sagae, 2010; Zhang and Nivre, 2011; Bohnet and Kuhn, 2012) , context free grammar (CFG) parsing (Collins and Roark, 2004; Zhang and Clark, 2009; Zhu et al., 2013) , combinational categorial grammar (CCG) parsing (Zhang and Clark, 2011a; Xu et al., 2014) and machine translation (Liu, 2013) , achieving stateof-the-art accuracies and efficiencies." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-8", "text": "In addition, due to its high efficiencies, it has also been applied to a range of joint structural problems, such as joint segmentation and POS-tagging (Zhang and Clark, 2008a; Zhang and Clark, 2010) , joint POS-tagging and dependency parsing (Hatori et al., 2011; Bohnet and Nivre, 2012) , joint morphological analysis, POS-tagging and dependency parsing (Bohnet et al., 2013) , and joint segmentation, POS-tagging and parsing ." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-9", "text": "In addition to the aforementioned tasks, the framework can be applied to all structural prediction tasks for which the output can be constructed using an incremental process." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-10", "text": "The advantage of this framework is two-fold." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-11", "text": "First, beamsearch enables highly efficient decoding, which typically has linear time complexity, depending on the incremental process." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-12", "text": "Second, free from DPstyle constraints and Markov-style independence assumptions, the framework allows arbitrary features to be defined to capture structural patterns." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-13", "text": "In addition to feature advantages, the high accuracies of this framework are also enabled by direct interactions between learning and search (Daum\u00e9 III and Marcu, 2005; Huang et al., 2012; Zhang and Nivre, 2012) ." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-14", "text": "----------------------------------" }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-15", "text": "**TUTORIAL OVERVIEW**" }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-16", "text": "In this tutorial, we make an introduction to the framework, illustrating how it can be applied to a range of NLP problems, giving theoretical discussions and demonstrating a software implementation." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-17", "text": "We start with a detailed introduction of the framework, describing the averaged perceptron algorithm (Collins, 2002) and its efficient implementation issues (Zhang and Clark, 2007) , as well as beam-search and the early-update strategy (Collins and Roark, 2004) ." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-18", "text": "We then illustrate how the framework can be applied to NLP tasks, including word segmentation, joint segmentation & POS-tagging, labeled and unlabeled dependency parsing, joint POS-tagging and dependency parsing, CFG parsing, CCG parsing, and joint segmentation, POS-tagging and parsing." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-19", "text": "In each case, we illustrate how the task is turned into an incremental left-to-right output-building process, and how rich features are defined to give competitive accuracies." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-20", "text": "These examples can serve as guidance in applying the framework to other structural prediction tasks." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-21", "text": "In the second part of the tutorial, we give some analysis on why the framework is effective." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-22", "text": "We discuss several alternative learning algorithms, 13 and compare beam-search with greedy search on dependency parsing." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-23", "text": "We show that accuracy benefits from interaction between learning and search." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-24", "text": "Finally, the tutorial concludes with an introduction to ZPar, an open source toolkit that provides optimized C++ implementations of of all the above tasks." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-25", "text": "Ting Liu is a professor at HIT-SCIR." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-26", "text": "His research interest includes social computing, information retrieval and natural language processing." }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-27", "text": "----------------------------------" }, { "sent_id": "1a8c7d22709cae34fbc1eb70fe5189-C001-28", "text": "**OUTLINE**" } ], "y": { "@USE@": { "gold_contexts": [ [ "1a8c7d22709cae34fbc1eb70fe5189-C001-7" ], [ "1a8c7d22709cae34fbc1eb70fe5189-C001-17" ] ], "cite_sentences": [ "1a8c7d22709cae34fbc1eb70fe5189-C001-7", "1a8c7d22709cae34fbc1eb70fe5189-C001-17" ] } } }, "ABC_357667e192057a48dff60edad86bf0_4": { "x": [ { "sent_id": "357667e192057a48dff60edad86bf0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-2", "text": "Learning embeddings of words and knowledge base elements is a promising approach for open domain question answering." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-3", "text": "Based on the remark that relations and entities are distinct object types lying in the same embedding space, we analyze the benefit of adding a regularizer favoring the embeddings of entities to be orthogonal to those of relations." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-4", "text": "The main motivation comes from the observation that modifying the embeddings using prior knowledge often helps performance." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-5", "text": "The experiments show that incorporating the regularizer yields better results on a challenging question answering benchmark." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-6", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-8", "text": "Having a system which is able to answer questions based on a structured knowledge base is a challenging problem." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-9", "text": "The problem has been addressed recently by researchers working on large knowledge bases such as Reverb (Fader et al., 2011) and Freebase (Bollacker et al., 2008) ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-10", "text": "The creation of question answering (QA) benchmarks for these knowledge bases (KB) has a significant impact on the domain, as shown by the number of QA systems recently proposed in the literature (Berant and Liang, 2014; Berant et al., 2013; Bordes et al., 2014a; Bordes et al., 2014b; Fader et al., 2013; Fader et al., 2014; Yao and Van Durme, 2014; Yih et al., 2014; Dong et al., 2015) ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-11", "text": "We identify two types of approaches for KBcentric QA systems: parsing-based approaches and information retrieval (IR) based approaches." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-12", "text": "Parsing-based approaches (Yih et al., 2014; Berant et al., 2013; Berant and Liang, 2014; Reddy et al., 2014) answer factoid questions by learning a structured representation for the sentences, called logical form." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-13", "text": "This logical form is then used to query the knowledge base and retrieve the answer." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-14", "text": "IR-based approaches try to identify the best possible match between the knowledge base and the question (Bordes et al., 2014a; Bordes et al., 2014b; Yao and Van Durme, 2014; Dong et al., 2015) ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-15", "text": "In this work, we focus on the second approach, using embedding models, mainly because it is robust to invalid syntax and can exploit information of the answer." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-16", "text": "We focus on the Wikianswers (Fader et al., 2013) dataset constructed for Reverb." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-17", "text": "On Wikianswers, the underlying semantics is very simple (just one single triple)." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-18", "text": "However, the task remains challenging due to the large variety of lexicalizations for the same semantics." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-19", "text": "We follow the approach of Bordes et .al (2014b) which learns the embeddings of words and KB elements." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-20", "text": "They model the semantics of natural language sentences and KB triples as the sum of the embeddings of the associated words and KB elements respectively." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-21", "text": "Despite its simplicity, this model performs surprisingly well in practice." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-22", "text": "Something even more interesting (Bordes et al., 2014b) is that the system can have a good performance even without using a paraphrase corpus." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-23", "text": "This makes the system very attractive in practice because in many specific domains, we might have a KB but there may be no paraphrase corpus as in Wikianswers." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-24", "text": "In our work, we push the results further when learning a QA system based only on the KB." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-25", "text": "Our contribution is to introduce a new orthogonality regularizer which distinguishes entities and relations." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-26", "text": "We also investigate the tradeoff captured by the orthogonality constraints." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-27", "text": "With a synthetic example, we show that if entities and relations are independent, orthogonal embeddings generate better results." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-28", "text": "The orthogonality constraint in the context of question answering is new, although it has been successfully used in other contexts ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-29", "text": "Like (Bordes et al., 2014b) , we use al-most no linguistic features such as POS tagging, parsing, etc." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-30", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-31", "text": "**THE REVERB QUESTION ANSWERING TASK**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-32", "text": "The ReVerb question answering task was first introduced in (Fader et al., 2013) as follows." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-33", "text": "Given a large RDF KB and a natural language (NL) question whose answer is given by a triple contained in that KB, the task is to find a correct triple." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-34", "text": "For example, a correct answer to the NL question \"What is the main language in Hong Kong ?\" would be the KB triple (cantonese.e, be-major-languagein.r, hong-kong.e)." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-35", "text": "RDF triples are assertions of the form (e 1 , r, e 2 ) where r is a binary relation from some vocabulary R and e 1 , e 2 are entities from a vocabulary E." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-36", "text": "The KB used is ReVerb 1 , a publicly available set of 15 million extractions (Fader et al., 2011) defined over a vocabulary of approximately 600K relations and 3M entities." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-37", "text": "The test set used for evaluation includes 698 questions extracted from the website Wikianswers, many of which involve paraphrases." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-38", "text": "To map NL questions to KB queries, they first induce a lexicon mapping NL expressions to KB elements using manually defined patterns, alignments and a paraphrase corpus." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-39", "text": "Using this lexicon, multiple KB queries can be derived from a NL question." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-40", "text": "These queries are then ranked using a scoring function." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-41", "text": "Bordes et al. (2014b) introduce a linguistically leaner IR-based approach which identifies the KB triple most similar to the input NL question." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-42", "text": "In their approach, KB triples and NL questions are represented as sums of embeddings of KB symbols and words respectively." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-43", "text": "The similarity between a triple and a question is then simply the dot product of their embeddings." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-44", "text": "Interestingly, Bordes' (2014b) system performs relatively well (MAP score 0.34) on the Wikianswers dataset even without using the paraphrase corpus." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-45", "text": "This suggests that the embedding method successfully captures the similarity between NL questions and KB queries." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-46", "text": "Our work continues this direction by further separating relations with entities." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-47", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-48", "text": "**RELATED WORK**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-49", "text": "1 http://reverb.cs.washington.edu" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-50", "text": "The idea of distinguishing entities and relations in question answering can also be found in (Yih et al., 2014) ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-51", "text": "However, they base their work by supposing that we can cut the sentence into \"entity part\" and \"relation part\" and then calculate the matching score." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-52", "text": "Our model does not need this cut and simply enforces the entity embeddings and relation embeddings (on the KB side) to be different." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-53", "text": "Orthogonality or near orthogonality is a property which is desired in many embedding techniques." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-54", "text": "In random indexing (Sahlgren, 2005) , a near orthogonality is ensured amongst the embeddings of different contexts." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-55", "text": "In (Zanzotto and Dell'Arciprete, 2012) , to approximate tree kernels in a distributed way, different subtree feature embeddings are also constructed to be near orthogonal." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-56", "text": "Our work gives yet another motivation for orthogonal embeddings for the special case where the semantics of a sentence is modeled as the sum of its associated word embeddings." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-57", "text": "In this case, orthogonal word embeddings help to model their independence." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-58", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-59", "text": "**EMBEDDING MODEL**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-60", "text": "Word embeddings are generally learned (Deerwester et al., 1990; Mikolov et al., 2013; Lebret and Collobert, 2015; Faruqui et al., 2014) such that words with similar context will naturally share similar embeddings as measured for instance by cosine similarity." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-61", "text": "The embeddings learned in (Bordes et al., 2014b ) also encode context information." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-62", "text": "They link the embedding of words with the whole triple-answer in their scoring function." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-63", "text": "By this means, the word embedding carries the information of the whole triple." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-64", "text": "Our model further distinguishes entities and relations." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-65", "text": "Noting that entities and relations may have some independence (knowing that 'a man eats' doesn't help to tell 'which man'), the distinction is done via orthogonality." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-66", "text": "We show in the toy example that orthogonality helps to capture this independent structure of the data." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-67", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-68", "text": "**SCORING FUNCTION**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-69", "text": "The model learns the embedding of each word and KB element by trying to score the correct answers highest." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-70", "text": "Mathematically, let q be the query, and a be the answer-triple to align." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-71", "text": "Denote the total number of words as N w and the number of KB elements as N kb ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-72", "text": "Then denote by \u03c6(q)" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-73", "text": "\u2208 {0, 1} Nw Algorithm 1 Training with orthogonality regularizer 1." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-74", "text": "Sample a positive training pair (q i , a i ) from D. 2." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-75", "text": "Create a corrupted triple a i 3." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-102", "text": "We noticed that one important assumption that is not discussed in the basic approach is that the embedding space is the same for relations and entities." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-103", "text": "That approach has a tendency to learn similar embeddings for entities and relations, even if they have different meanings." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-76", "text": "If S(q i , a i ) \u2212 S(q i , a i ) < 0.1 : make a stochastic gradient ascent on S(q i , a i ) \u2212 S(q i , a i ) \u2212 \u03bb|E.R| 4. Normalize the embedding vector the 1-hot representation indicating the presence or absence of words in the query." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-77", "text": "Similarly we denote the sparse representation on the KB side as \u03c8(a)." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-78", "text": "Let M \u2208 R d\u00d7Nw be the embedding matrix for words and K \u2208 R d\u00d7N kb be the embedding matrix for the elements in the KB." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-79", "text": "d is the low dimension chosen by the user." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-80", "text": "The embedding of the sentence is then calculated as M \u03c6(q) and similarly the embedding of the answer-triple as K \u03c8(a)." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-81", "text": "We can score the matching of these embeddings:" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-82", "text": "which is the dot product between the embedding of the sentence and the embedding of the triple." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-83", "text": "The model is introduced in (Bordes et al., 2014b) and we use the same scoring function." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-84", "text": "Note that the model actually sums up each word embedding to form the embedding of the sentence." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-85", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-86", "text": "**INFERENCE**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-87", "text": "The inference procedure is straightforward." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-88", "text": "Given a question q and a set of possible answer triples noted A(q), the model predicts the answer by returning the triple with the highest score: a = argmax a\u2208A(q) S(q, a)" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-89", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-90", "text": "**TRAINING**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-91", "text": "Originally in (Bordes et al., 2014b) , given a question to be answered, training is performed by imposing a margin-constraint between the correct answer and negative ones." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-92", "text": "More precisely, note a a negative answer to the question q (the correct answer to q being a)." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-93", "text": "Then for each question answer pair, the system tries to maximize the following function by performing a gradient ascent step: min( , S(q, a) \u2212 S(q, a )) with the margin set to 0.1." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-94", "text": "In addition, the norms of columns in M and K are constrained to be inferior to 1." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-95", "text": "The training is done in a stochastic way by randomly selecting a question answer pair at each step." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-96", "text": "For each gradient step, the step size is calculated using Adagrad (Duchi et al., 2011) ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-97", "text": "The negative example is created by randomly replacing each element of (e 1 , r, e 2 ) by another one with probability 2/3." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-98", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-99", "text": "**ENFORCING ORTHOGONAL EMBEDDINGS**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-100", "text": "In this work, we are especially interested in the additional assumptions we can make on the model in order to cope with data sparsity." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-101", "text": "Indeed, when the number of training data supporting the computation of embeddings is small, embedding models are brittle and can lead to disappointing results." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-104", "text": "Intuitively, we would like to balance that tendency by a \"prior knowledge\" preference towards choosing embeddings of entities and relations which are orthogonal to each other." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-105", "text": "To justify this assumption, consider a simple case where the underlying semantics is (e, r) as in the sentence \"John eats\"." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-106", "text": "We will use the same letter to indicate an entity or relation and their corresponding embeddings." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-107", "text": "In (Bordes et al., 2014b) , the embedding of the semantics is then calculated as e + r for this very simple case." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-108", "text": "Now suppose that \u2200e = e, ||e \u2212 e || 2 \u2265 (i.e John is different from Mary with margin ) and that the same kind of constraints also holds for relations." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-109", "text": "However, even when these constraints are satisfied, it is not guaranteed that ||e + r \u2212 e \u2212 r || 2 \u2265 , which means that the model may still get confused on the whole semantics even if each part is clear." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-110", "text": "One obvious and linguistically plausible solution is to say that the entities and relations lie in orthogonal spaces." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-111", "text": "Indeed, if relations and entities are orthogonal (\u2200r, e (r \u22a5 e)), then if two entities e, e and two relations r, r are distinct (i.e., ||e \u2212 e || 2 \u2265 and ||r \u2212 r || 2 \u2265 ), it follows that ||e + r \u2212 e \u2212 r || 2 = ||e \u2212 e || 2 + |||r \u2212 r || 2 \u2265 2 by Pythagorean theorem." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-112", "text": "That is, two sentences whose semantic representations involve two distinct entities and/or relations will have different values." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-113", "text": "In real problems, however, posing a hard orthogonality constraint largely reduces the model's What is the religious celebration of christians ? (christian.e be-all-about.r original-sin.e) (easter.e be-most-important-holiday.r christian.e)" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-114", "text": "What do cassava come from ? (cassava.e be-source-of.r security.e) (cassava.e be-grow-in.r africa.e) Table 1 : Some examples for which our system differs from ( (Bordes et al., 2014b) )." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-115", "text": "Gold standard answer triples are marked in bold." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-116", "text": "expressive power 2 , so we decide to add it as a regularizer." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-117", "text": "More concretely, let the correct triple be (e 1 , r, e 2 ) and the negative one be (e 1 , r , e 2 )." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-118", "text": "Consider that we are in a case not satisfying the margin constraint, then we will try to maximize the following regularized function S(q, a) \u2212 S(q, a ) \u2212 \u03bb|E.R| with a gradient step." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-119", "text": "The regularizer |E.R| = |e 1 .r| + |e 2 .r| + |e 1 .r | + |e 2 .r | is minimized when all the entities and relations live in orthogonal space." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-120", "text": "The regularization parameter \u03bb is chosen via an automatically constructed development set for which we randomly selected 1/2000 of all the triples in the KB and generate associated questions." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-121", "text": "We discard these triples from training and choose the \u03bb value based on the score on the development set." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-122", "text": "The \u03bb value is by this means set to 0.01 with \u03bb in {0.5,0.1,0.05,0.01,0.005,0.001}. Once the \u03bb value is chosen, we retrain the whole system." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-123", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-124", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-125", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-126", "text": "**TOY EXAMPLE**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-127", "text": "In this section, we illustrate the benefits of orthogonality via a toy example." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-128", "text": "We construct a KB containing 50 entities (E) and 50 relations (R) then generate all their cross products obtaining 2500 fact pairs." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-129", "text": "In consequence the entities and relations are independent." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-130", "text": "For every e i \u2208 E, we suppose that there is a single word lexicalizing the entity noted \"e i \" ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-131", "text": "Similarly, we note the lexicalization of r j \"r j \"." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-132", "text": "We separate these 2500 pairs into training (2450) and test (50)." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-133", "text": "Notice that similarly to Wikianswers, this toy dataset involves KB entities and relations whose type is known a priori." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-134", "text": "The training corpus is built using one simple generation rule : (e i , r j ) \u2192 \"e i r j \" ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-135", "text": "Negative examples are created by replacing with probability 1/2 both entity and relation with another one." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-136", "text": "We embed all the words and KB symbols in a space of 20 dimensions." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-137", "text": "We compare the model (Bordes et al., 2014b) with the model where we enforce E and R (and also \"E\" and \"R\") to be orthogonal." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-138", "text": "This means that words or KB symbols in fact live in an embedding space of dimension 10." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-139", "text": "At test time, for a given sentence \"e i r j \", a set of (e, r) pairs is ranked and we compute the proportion of cases where the first ranked pair is correct." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-140", "text": "Table 2 shows the results for both systems on two configurations: a configuration (Accuracy(1)) where the number of pairs to be ranked is 1250 and another (Accuracy(2)) with 2500 pairs." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-141", "text": "3 In both cases, imposing the orthogonality constraint improves performance by a large margin." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-142", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-143", "text": "**WIKIANSWERS**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-144", "text": "Wikianswers contains a set of possible triples for each question and we re-rank these triples to report our system's performance." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-145", "text": "This is the \"reranking\" setting used in (Bordes et al., 2014b) ." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-146", "text": "Table 3 compares different systems in this setting." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-147", "text": "The Embedding scores are taken from (Bordes et al., 2014b) Table 3 shows that our technique improves the performance also on the larger, non-synthetic, dataset provided by Fader (2013) over the Bordes (2014b)'s method." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-148", "text": "In addition, Table 1 shows some examples where the two systems differ and where the orthogonality regularized embeddings seem to better support the identification of similar relations." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-149", "text": "For instance, \"is the argument on\" is mapped to support.r rather than be-type-of.r and \"is the religious celebration of\" to be-mostimportant-holiday.r rather then be-all-about.r." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-150", "text": "----------------------------------" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-151", "text": "**CONCLUSION**" }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-152", "text": "This paper introduces an embedding model for question answering with orthogonality regularizer." }, { "sent_id": "357667e192057a48dff60edad86bf0-C001-153", "text": "We show that orthogonality helps to capture the differences between entities and relations and that it helps improve performance on an existing dataset." } ], "y": { "@BACK@": { "gold_contexts": [ [ "357667e192057a48dff60edad86bf0-C001-10" ], [ "357667e192057a48dff60edad86bf0-C001-14" ], [ "357667e192057a48dff60edad86bf0-C001-20" ], [ "357667e192057a48dff60edad86bf0-C001-22" ], [ "357667e192057a48dff60edad86bf0-C001-41" ], [ "357667e192057a48dff60edad86bf0-C001-42" ], [ "357667e192057a48dff60edad86bf0-C001-44" ], [ "357667e192057a48dff60edad86bf0-C001-61" ], [ "357667e192057a48dff60edad86bf0-C001-91" ], [ "357667e192057a48dff60edad86bf0-C001-107" ] ], "cite_sentences": [ "357667e192057a48dff60edad86bf0-C001-10", "357667e192057a48dff60edad86bf0-C001-14", "357667e192057a48dff60edad86bf0-C001-20", "357667e192057a48dff60edad86bf0-C001-22", "357667e192057a48dff60edad86bf0-C001-41", "357667e192057a48dff60edad86bf0-C001-42", "357667e192057a48dff60edad86bf0-C001-44", "357667e192057a48dff60edad86bf0-C001-61", "357667e192057a48dff60edad86bf0-C001-91", "357667e192057a48dff60edad86bf0-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "357667e192057a48dff60edad86bf0-C001-19" ], [ "357667e192057a48dff60edad86bf0-C001-29" ], [ "357667e192057a48dff60edad86bf0-C001-83" ], [ "357667e192057a48dff60edad86bf0-C001-137" ], [ "357667e192057a48dff60edad86bf0-C001-145" ], [ "357667e192057a48dff60edad86bf0-C001-147" ] ], "cite_sentences": [ "357667e192057a48dff60edad86bf0-C001-19", "357667e192057a48dff60edad86bf0-C001-29", "357667e192057a48dff60edad86bf0-C001-83", "357667e192057a48dff60edad86bf0-C001-137", "357667e192057a48dff60edad86bf0-C001-145", "357667e192057a48dff60edad86bf0-C001-147" ] }, "@EXT@": { "gold_contexts": [ [ "357667e192057a48dff60edad86bf0-C001-44", "357667e192057a48dff60edad86bf0-C001-46" ] ], "cite_sentences": [ "357667e192057a48dff60edad86bf0-C001-44" ] }, "@DIF@": { "gold_contexts": [ [ "357667e192057a48dff60edad86bf0-C001-114" ] ], "cite_sentences": [ "357667e192057a48dff60edad86bf0-C001-114" ] } } }, "ABC_9272b3f7e628a156caed328d475d0c_5": { "x": [ { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-2", "text": "ABSTRACT" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-3", "text": "----------------------------------" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-4", "text": "**INTRODUCTION**" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-5", "text": "Readability refers to the ease with which a given piece of natural language text can be read and understood." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-6", "text": "Intuitively, readability emerges from an interaction between the reader and the text, and depends on the prior knowledge of the reader, his/her reading skills, interest, and motivation [23] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-7", "text": "Although it may seem that automatic assessment of readability would be a very complicated process, as it turns out, fairly effective readability scoring can be achieved by means of several lowlevel features." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-8", "text": "Readability has many important applications, such as assessing the quality of student essays (one of the original applications of readability scoring), designing educational materials for schoolchildren and second-language learners, moderating newspaper content to convey information more clearly and effectively, and standardizing the language-learning experience of different age groups." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-9", "text": "Readability (\"reading ease\") and its converse -reading difficulty -are associated with different grade levels in school." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-10", "text": "It is generally observed that students from higher grade levels can write and comprehend texts with greater reading difficulty than students from lower grade levels." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-11", "text": "A lot of studies in readability therefore focused on correlating readability scores with grade levels, or even predicting grade levels from readability-oriented features." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-12", "text": "Existing methods of readability assessment look into a handful of low-level signals such as average sentence length (ASL), average word length in syllables (AWL), percentage of difficult words, and number of polysyllabic words." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-13", "text": "Early studies used word frequency lists to identify difficult words." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-14", "text": "Recently, readability evaluation has been tackled as a supervised machine learning problem [12] , [17] , [22] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-15", "text": "There have been many different studies on readability assessment in English (cf. Section 2)." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-16", "text": "Bengali has received much less attention owing to inadequate resources and a lack of robust natural language processing tools." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-63", "text": "The number of passages from different authors is shown in Table 1 ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-17", "text": "It is only very recently that some groups of researchers looked into readability assessment in Bengali." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-18", "text": "They observed that English readability formulas did not work well on Bengali texts [11] , [21] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-19", "text": "This observation is not surprising, because Bengali is very different than English." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-20", "text": "Bengali is a highly inflected language, follows subject-object-verb ordering in sentences, and has a rich morphology." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-21", "text": "Further, Bengali shows word compounding and diglossia, i.e. formal and informal language variants (sadhu bhasha and cholit bhasha)." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-22", "text": "All these factors complicate readability scoring in Bengali." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-23", "text": "Since the concept of readability is highly subjective and reader-dependent, it is necessary to find out how much two native Bengali speakers agree on the readability level of a piece of text." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-24", "text": "Generalizing from there, we performed an inter-rater agreement study on readability assessment in Bengali." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-25", "text": "This study not only enables us to see how much human annotators agree on readability assessment, but also shows how difficult it is for humans to assign consistent readability scores." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-26", "text": "Since Bengali is very different than English, we want to see if (and how) readability is affected by the peculiarities of the language." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-27", "text": "As a by-product of this study, we obtained a human-annotated gold standard dataset for readability evaluation in Bengali." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-28", "text": "The rest of this paper is organized as follows." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-29", "text": "We briefly discuss related studies in Section 2, followed by a discussion of our dataset and annotation scheme in Section 3." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-30", "text": "Experimental results are described in Section 4, along with their explanation and observations." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-31", "text": "Section 5 concludes the paper with contributions, limitations, and further research directions." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-32", "text": "----------------------------------" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-33", "text": "**RELATED WORK**" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-34", "text": "Readability scoring in English has a long and rich history, starting with the work of L. A. Sherman in the late nineteenth century [20] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-35", "text": "Among the early readability formulas were Flesch Reading Ease [7] , Dale-Chall Formula [5] , Automated Readability Index [19] , Gunning Fog Index [9] , SMOG score [16] , and Coleman-Liau Index [2] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-36", "text": "These early indices were based on simple features like average number of characters, words, syllables and sentences, number of difficult and polysyllabic words, etc." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-37", "text": "Albeit simple, these readability indices were surprisingly good predictors of a reader's grade level." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-38", "text": "Two different lines of work focused on children and adult readability formulas." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-39", "text": "Recently Lahiri et al. showed moderate correlation between readability indices and formality score ( [10] ) in four different domains [14] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-64", "text": "We assigned the 30 text passages to seven independent annotators." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-40", "text": "Sinha et al. classified English readability formulas into three broad categories -traditional methods, cognitively motivated methods, and machine learning methods [21] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-41", "text": "Traditional methods assess readability using surface features and shallow linguistic features such as the ones mentioned in the preceding paragraph." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-42", "text": "Cognitively motivated methods take into account the cohesion and coherence of text, its latent topic structure, Kintsch's propositions, etc [1] , [8] , [13] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-43", "text": "Finally, machine learning methods utilize sophisticated structures such as language models [3] , [4] , [18] , query logs [15] , and several other features to predict the readability of open-domain text data." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-44", "text": "There are very few studies on readability assessment in Bengali texts." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-45", "text": "We found only three lines of work that specifically looked into Bengali readability [6] , [11] , [21] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-46", "text": "Das and Roychoudhury worked with a miniature model of two parameters in their pioneering study [6] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-47", "text": "They found that the two-parameter model was a better predictor of readability than the one-parameter model." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-48", "text": "Note, however, that Das and Roychoudhury's corpus was small (only seven documents), thereby calling into question the validity of their results." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-49", "text": "Sinha et al. alleviated these problems by considering six parameters instead of just two [21] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-50", "text": "They further showed that English readability indices were inadequate for Bengali, and built their own readability model on 16 texts." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-51", "text": "Around the same time, Islam et al. independently reached the same conclusion [11] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-52", "text": "They designed a Bengali readability classifier on lexical and information-theoretic features, resulting in an F-score 50% higher than that from traditional scoring approaches." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-53", "text": "While all the above studies are very important and insightful, none of them explicitly performed an inter-rater agreement study." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-54", "text": "For reasons mentioned in Section 1, an inter-rater agreement study is very important when we talk about readability assessment." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-55", "text": "Further, none of these studies made available their readability-annotated gold standard datasets, thereby stymieing further research." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-56", "text": "We attempt to bridge these gaps in our work." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-57", "text": "----------------------------------" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-58", "text": "**METHODOLOGY**" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-59", "text": "We collected a corpus of 30 Bengali text passages." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-60", "text": "The passages were randomly selected from the writings of four eminent Bengali authors -Rabindranath Tagore (1861-1941), Sarat Chandra Chattopadhyay (1876-1938), Bankim Chandra Chattopadhyay (1838-1894), and Bibhutibhushan Bandyopadhyay (1894-1950)." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-61", "text": "We ensured that samples from both sadhu bhasha as well as cholit bhasha were incorporated in our corpus." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-62", "text": "We also ensured that we had both adult text as well as children's text in the mix." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-65", "text": "The annotators were 30 to 35 years of age; they were from a similar educational background and socio-economic milieu; there were four female and three male annotators; and they all were native speakers of Bengali." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-66", "text": "Annotators were asked to assign a readability rating to each of the 30 passages." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-67", "text": "The rating scale was as follows: 1) Very easy to read 2) Easy to read 3) Somewhat easy to read 4) In-between 5) Somewhat difficult to read 6) Difficult to read 7) Very difficult to read This rating scale reflects the fact that readability is not a binary/ternary variable; it is an ordinal variable." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-68", "text": "We further collected the data on whether the annotators were avid readers of Bengali or not." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-69", "text": "Each annotator rated every passage." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-70", "text": "Note that readability annotation in Bengali is challenging because passages written in sadhu bhasha tend to be harder to read than those written in cholit bhasha." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-71", "text": "Since our dataset contains both sadhu bhasha and cholit bhasha, maintaining consistency in readability rating becomes a big issue." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-72", "text": "Table 2 gives the mean readability rating of the 30 text passages, along with their standard deviations." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-73", "text": "These ratings are averages over seven independent annotations." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-74", "text": "Note from Table 2 that none of the mean ratings is 1 or 7." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-75", "text": "In other words, mean ratings never reach the extreme readability values." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-76", "text": "This phenomenon is known as the central tendency bias." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-77", "text": "Note also that the standard deviations are not very high, which should be intuitive because the rating scale varies between 1 and 7.Agreement among the annotators was measured by Cohen's kappa (\u03ba) and Spearman's rank correlation coefficient (\u03c1)." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-78", "text": "Table 3 shows the pairwise \u03ba values among different annotators, and Table 4 gives the pairwise \u03c1 values." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-79", "text": "Both tables are symmetric around the main diagonal." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-80", "text": "Note from" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-81", "text": "----------------------------------" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-82", "text": "**RESULTS**" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-83", "text": "----------------------------------" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-84", "text": "**CONCLUSION**" }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-85", "text": "We performed an inter-rater agreement study for readability assessment in Bengali." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-86", "text": "This is the first time such an agreement study has been performed." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-87", "text": "We obtained moderate to fair agreement among seven independent annotators on 30 text passages written by four eminent Bengali authors." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-88", "text": "As a byproduct of this study, we obtained a gold standard human annotated readability dataset for Bengali." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-89", "text": "We plan to release this dataset for future research." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-90", "text": "We are working on readability modeling in Bengali, and this dataset will be very helpful." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-91", "text": "An important limitation of our study is the small corpus size." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-92", "text": "We only have 30 annotated passages at our disposal, whereas Islam et al. [11] had around 300. But Islam et al.'s dataset is not annotated in as fine-grained a fashion as ours." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-93", "text": "Note also that our dataset is larger than both Sinha et al.'s 16document dataset [21] , and Das and Roychoudhury's seven document dataset [6] ." }, { "sent_id": "9272b3f7e628a156caed328d475d0c-C001-94", "text": "We plan to increase the size of our dataset in future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9272b3f7e628a156caed328d475d0c-C001-17", "9272b3f7e628a156caed328d475d0c-C001-18" ], [ "9272b3f7e628a156caed328d475d0c-C001-39", "9272b3f7e628a156caed328d475d0c-C001-40" ], [ "9272b3f7e628a156caed328d475d0c-C001-45" ], [ "9272b3f7e628a156caed328d475d0c-C001-49", "9272b3f7e628a156caed328d475d0c-C001-50" ] ], "cite_sentences": [ "9272b3f7e628a156caed328d475d0c-C001-18", "9272b3f7e628a156caed328d475d0c-C001-40", "9272b3f7e628a156caed328d475d0c-C001-45", "9272b3f7e628a156caed328d475d0c-C001-49", "9272b3f7e628a156caed328d475d0c-C001-50" ] }, "@MOT@": { "gold_contexts": [ [ "9272b3f7e628a156caed328d475d0c-C001-45" ] ], "cite_sentences": [ "9272b3f7e628a156caed328d475d0c-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "9272b3f7e628a156caed328d475d0c-C001-92", "9272b3f7e628a156caed328d475d0c-C001-93" ] ], "cite_sentences": [ "9272b3f7e628a156caed328d475d0c-C001-93" ] } } }, "ABC_55f67c918001335974608200a87cfc_5": { "x": [ { "sent_id": "55f67c918001335974608200a87cfc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-2", "text": "We introduce a relational graph neural network with bi-directional attention mechanism and hierarchical representation learning for open-domain question answering task." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-3", "text": "Our model can learn contextual representation by jointly learning and updating the query, knowledge graph, and document representations." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-4", "text": "The experiments suggest that our model achieves state-of-the-art on the WebQuestionsSP benchmark." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-5", "text": "Preprint." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-6", "text": "Under review." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-7", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-9", "text": "Fusing structured knowledge from knowledge graphs into deep models using Graph Neural Networks (GNN) [23, 30, 29] is shown to improve their performance on tasks such as visual question answering [12] , object detection [9] , natural language inference [2] , neural machine translation [11] , and opendomain question answering [17] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-10", "text": "This particularly helps question answering neural models such as Memory Networks [22] and Key-Value Memory Networks [10] by providing them with wellstructured knowledge on specific and open domains [28] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-11", "text": "Most models, however, answer questions using a single information source, usually either a text corpus, or a single knowledge graph." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-12", "text": "Text corpora have high coverage but extracting information from them is challenging whereas knowledge graphs are incomplete but are easier to extract answers from [17] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-13", "text": "In this paper, we propose a relational GNN for open-domain question answering that learns contextual knowledge graph embeddings by jointly updating the embeddings from a knowledge graph and a set of linked documents." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-14", "text": "More specifically, our contributions are as follows: (1) we use documents as contextualized relations to augment the knowledge graph and co-train the knowledge graph and document representations, (2) we introduce a bi-directional graph attention mechanism for knowledge graphs, (3) we propose a simple graph coarsening method to fuse node and cluster representations, and (4) we show that our model achieves state-of-the-art results on the WebQuestionsSP benchmark." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-15", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-16", "text": "**RELATED WORKS**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-17", "text": "Graph representation learning on knowledge graphs allows projecting high-level factual information into embedding spaces which can be fused with other learned structural representations to improve down-stream tasks." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-18", "text": "Pioneer works such as Trans-E [1] , Complex-E [18] , Hole-E [13] , and DistMul [25] use unsupervised and mostly linear models to learn such pre-trained representations." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-19", "text": "A few recent works, on the other hand, use GNNs to compute the knowledge graph representation [24, 20, 27] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-20", "text": "These high-level knowledge graph representations are particularly important for question answering task [22, 10, 17] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-21", "text": "We use pre-trained representations to initialize the model and then update them using a relational and bi-directional GNN model." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-22", "text": "Only a few works fuse knowledge graphs with text corpora to answer questions." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-23", "text": "In [3] early fusion of knowledge graph facts and text is performed using Key-Value Memory Networks (KVMN)." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-24", "text": "This model, however, ignores relational structure between the text and knowledge graph." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-25", "text": "Our model links the knowledge graph and documents through document-contextualized edges and also links entities with their positions in the corpus." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-26", "text": "This linking is used in GRAFT-Net as well which also performs question answering through fusing learned knowledge graph and linked document representations [17] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-27", "text": "Unlike GRAFT-Net, our model uses variants of differential pooling [26] and bi-directional graph attention [19] for more powerful message passing." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-28", "text": "Our model also introduces trainable documentcontextualized relation embeddings instead of exclusively relying on fixed relation representations." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-29", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-30", "text": "**METHOD**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-31", "text": "Assume a knowledge graph G = (V, E, R) where V , R, and E are sets of entities, relations, and edges, respectively." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-32", "text": "Each edge e i = (s, r, o) \u2208 E denotes an object o interacting with subject s through a relationship r. Given a textual query q \u2208 (w 1 ...w |q| ) and a set of relevant documents D linked to G by an imperfect linker, the task is to predict the answer nodes: v aq \u2208 V ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-33", "text": "A query can have zero or multiple answers and hence the task is reduced to binary node classification on G (i.e., binary cross entropy loss)." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-34", "text": "Following [17] , we first extract a subgraph G q \u2282 G which contains v aq with high probability." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-35", "text": "This is done by linking the query entities to G and expanding their neighborhood using Personalized PageRank (PPR) method." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-36", "text": "We then initialize G q , q, D, and R in embedding space as follows." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-37", "text": "Each entity v \u2208 V is initialized using pre-trained TransE embeddings [1] : h" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-38", "text": "is the maximum number of tokens in documents, and H (l) D k,j \u2208 R dw corresponds to the jth token embedding of dimension d w in the kth document." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-39", "text": "A bidirectional LSTM that takes in pre-trained GloVe embeddings [15] of tokens and computes a global sequence embedding is shared between query and documents to initialize their embeddings: h" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-40", "text": "\u00d7dr is a trainable parameter, is the concatenation operator, and \u00b5 glove \u2208 R dw is the mean pool over GloVe embeddings of the relation tokens." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-41", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-42", "text": "**DOCUMENTS AS RELATIONS**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-43", "text": "We use documents to explicitly augment the knowledge graph by forming additional relations between entities." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-44", "text": "Assuming two unique entities v i and v j co-occurring within document d k , we compute the document relation embedding for both plausible directions between v i and v j ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-45", "text": "This lets the model to learn to attend to the direction that optimizes the objective." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-46", "text": "These embeddings are computed as follows: h" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-47", "text": "d k is the learned textual embedding of the document and h" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-48", "text": "d k ,p denotes the textual representation of the entity and is computed by summing up the textual embeddings of its tokens within the document (i.e., M d k (v) returns the positions of entity v in document d k )." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-49", "text": "An example of this process is illustrated in Figure 1 ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-50", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-51", "text": "**BI-DIRECTIONAL GRAPH ATTENTION**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-52", "text": "To update the node embedding h (l) v we aggregate the embeddings of its connecting edges." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-53", "text": "An edge embedding is computed by aggregating its relation embedding h (l) r with the neighbor node embedding h (l) v connecting through that edge." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-54", "text": "Let e out = (v, r, v i ) \u2208 E and e in = (v j , r, v) \u2208 E represent two facts in which the node v is either the subject or the object." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-55", "text": "If the node is an object, we will refer the edge to point inwards, and outwards if it is a subject." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-56", "text": "The edge embedding are computed as follows: eout denote the embeddings of the inward and outward edges connecting to node v, and W (l) r and W (l) v are trainable parameters." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-57", "text": "To distinguish inward edges from outward edges, we negate h r ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-58", "text": "This is distinct from previous approaches which only process incoming nodes [17] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-59", "text": "The next step is aggregating the embeddings of the edges connecting to the node, i.e., h l eout and h l ein ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-60", "text": "We apply two attention mechanisms to perform the aggregation and hence the model separately aggregates weighted sums of edge embeddings over each attention parameters: \u03b1 GAT is based on the similarity between a node and its neighbors with respect to the relationship between them." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-61", "text": "Assume (v i , r, v j ) is an edge between nodes v i and v j with relationship of type r. The edge score is defined as the dot product between the edge embedding of the original direction and inverted direction (v j , r, v i ) and then normalized over all inward edges: \u03b1 )." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-62", "text": "Unlike graph attention [26] this method addresses heterogeneous directed knowledge graphs." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-63", "text": "Finally, the model updates the node embedding ( Figure 2) :" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-64", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-65", "text": "**HIERARCHICAL AGGREGATION**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-66", "text": "GNN models cannot learn hierarchical representations as they do not exploit the compositionality of graphs." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-67", "text": "Inspired by [26] we define a linear layer to assign nodes into clusters and then use them to represent the nodes." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-68", "text": "This increases the receptive field by letting messages to pass among nodes belonging to the same cluster." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-69", "text": "We compute the soft cluster assignments using a linear layer" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-70", "text": "where H v \u2208 R nv\u00d7dv is the node embedding matrix, W (l) c \u2208 R dv\u00d7nc is a trainable parameter, and C (l) \u2208 R nv\u00d7nc is the normalized soft assignment matrix mapping n v nodes to n c clusters." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-71", "text": "We then compute the cluster centroids using H (l) c = C (l) H (l) v and compute the cluster-based node representation using H (l) vc = softmax((H (l) v W (l) c ) )H (l) c ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-72", "text": "Finally, we concatenate the node representation in the final layer with all the cluster-based node representations from previous layers (i.e., similar to DenseNet [5] )." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-73", "text": "We reduce the dimensionality using a trainable parameter W f inal \u2208 R dvL\u00d7dv followed by a sigmoid function to produce the probability score per node." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-74", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-75", "text": "**RESULTS**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-76", "text": "We implemented the model with Pytorch [14] on a Nvidia DGX-1 server with 8 Volta GPUs and optimize it using Adam [7] with initial learning rate of 0.001 and batch size of 10 for 100 epochs with early stopping." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-77", "text": "The learning rate is decayed by a factor of 0.8 every 10 epochs." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-78", "text": "We also apply batch-normalization [6] and dropout [16] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-79", "text": "We evaluated our model on the WebQuestionsSP dataset consisting of 4,737 natural language questions (i.e., 3,098 training, 250 validation, and 1,639 test questions) posed over Freebase entities [8] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-80", "text": "Following [17] , we apply the same pre-processing and report average F 1 and Hits@1, as well as micro-average, and macro-average F 1 scores." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-81", "text": "[4] suggests that micro-averaged F 1 best represents the performance on imbalanced binary classification." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-82", "text": "Table 1 shows the performance of our model compared to other models that also feature early fusion of the knowledge graph and text." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-83", "text": "These include Key-Value Memory Networks (KVMN) [3] and GRAFT-Net [17] ." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-84", "text": "The results suggest that our model outperforms GRAFT-Net with an absolute increase in all metrics." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-85", "text": "To investigate the effect of the proposed methods we performed an ablation study by masking each introduced component and training and evaluating the model." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-86", "text": "The results in Table 2 (Appendix) shows the effect of each component and suggest that all introduced components contribute to the performance." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-87", "text": "----------------------------------" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-88", "text": "**CONCLUSION**" }, { "sent_id": "55f67c918001335974608200a87cfc-C001-89", "text": "We introduced a relational GNN with bi-directional attention and hierarchical representation learning for open-domain question answering that jointly learns to represent the query, knowledge graph, and documents." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-90", "text": "The experiments showed that our model achieves state-of-the-art performance on WebQuestionsSP." }, { "sent_id": "55f67c918001335974608200a87cfc-C001-91", "text": "For future directions, we are planning to expand our model towards cross-modal question answering benchmarks such as Fact-based Visual Question Answering (FVQA) [21] ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "55f67c918001335974608200a87cfc-C001-9" ], [ "55f67c918001335974608200a87cfc-C001-19", "55f67c918001335974608200a87cfc-C001-20" ], [ "55f67c918001335974608200a87cfc-C001-26" ] ], "cite_sentences": [ "55f67c918001335974608200a87cfc-C001-9", "55f67c918001335974608200a87cfc-C001-20", "55f67c918001335974608200a87cfc-C001-26" ] }, "@MOT@": { "gold_contexts": [ [ "55f67c918001335974608200a87cfc-C001-12", "55f67c918001335974608200a87cfc-C001-13" ], [ "55f67c918001335974608200a87cfc-C001-19", "55f67c918001335974608200a87cfc-C001-20", "55f67c918001335974608200a87cfc-C001-21" ] ], "cite_sentences": [ "55f67c918001335974608200a87cfc-C001-12", "55f67c918001335974608200a87cfc-C001-20" ] }, "@EXT@": { "gold_contexts": [ [ "55f67c918001335974608200a87cfc-C001-19", "55f67c918001335974608200a87cfc-C001-20", "55f67c918001335974608200a87cfc-C001-21" ] ], "cite_sentences": [ "55f67c918001335974608200a87cfc-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "55f67c918001335974608200a87cfc-C001-34" ], [ "55f67c918001335974608200a87cfc-C001-80" ] ], "cite_sentences": [ "55f67c918001335974608200a87cfc-C001-34", "55f67c918001335974608200a87cfc-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "55f67c918001335974608200a87cfc-C001-57", "55f67c918001335974608200a87cfc-C001-58" ], [ "55f67c918001335974608200a87cfc-C001-82", "55f67c918001335974608200a87cfc-C001-83", "55f67c918001335974608200a87cfc-C001-84" ] ], "cite_sentences": [ "55f67c918001335974608200a87cfc-C001-58", "55f67c918001335974608200a87cfc-C001-83" ] } } }, "ABC_bdcd8b0f3a56606427ee298d454b52_5": { "x": [ { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-2", "text": "A variety of information extraction techniques rely on the fact that instances of the same relation are \"distributionally similar,\" in that they tend to appear in similar textual contexts." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-3", "text": "We demonstrate that extraction accuracy depends heavily on the accuracy of the language model utilized to estimate distributional similarity." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-4", "text": "An unsupervised model selection technique based on this observation is shown to reduce extraction and type-checking error by 26% over previous results, in experiments with Hidden Markov Models." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-5", "text": "The results suggest that optimizing statistical language models over unlabeled data is a promising direction for improving weakly supervised and unsupervised information extraction." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-6", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-8", "text": "Many weakly supervised and unsupervised information extraction techniques assess the correctness of extractions using the distributional hypothesis-the notion that words with similar meanings tend to occur in similar contexts (Harris, 1985) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-9", "text": "A candidate extraction of a relation is deemed more likely to be correct when it appears in contexts similar to those of \"seed\" instances of the relation, where the seeds may be specified by hand (Pa\u015fca et al., 2006) , taken from an existing, incomplete knowledge base (Snow et al., 2006; Pantel et al., 2009) , or obtained in an unsupervised manner using a generic extractor (Banko et al., 2007) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-10", "text": "We refer to this technique as Assessment by Distributional Similarity (ADS)." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-11", "text": "Typically, distributional similarity is computed by comparing co-occurrence counts of extractions and seeds with various contexts found in the corpus." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-12", "text": "Statistical Language Models (SLMs) include methods for more accurately estimating co-occurrence probabilities via back-off, smoothing, and clustering techniques (e.g. (Chen and Goodman, 1996; Rabiner, 1989; Bell et al., 1990) )." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-13", "text": "Because SLMs can be trained from only unlabeled text, they can be applied for ADS even when the relations of interest are not specified in advance (Downey et al., 2007) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-14", "text": "Unlabeled text is abundant in large corpora like the Web, making nearly-ceaseless automated optimization of SLMs possible. But how fruitful is such an effort likely to be-to what extent does optimizing a language model over a fixed corpus lead to improvements in assessment accuracy?" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-15", "text": "In this paper, we show that an ADS technique based on SLMs is improved substantially when the language model it employs becomes more accurate." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-16", "text": "In a large-scale set of experiments, we quantify how language model perplexity correlates with ADS performance over multiple data sets and SLM techniques." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-17", "text": "The experiments show that accuracy over unlabeled data can be used for selecting among SLMs-for an ADS approach utilizing Hidden Markov Models, this results in an average error reduction of 26% over previous results in extraction and type-checking tasks." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-18", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-19", "text": "**EXTRACTION ASSESSMENT WITH LANGUAGE MODELS**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-20", "text": "We begin by formally defining the extraction and typechecking tasks we consider, then discuss statistical language models and their utilization for extraction assessment." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-21", "text": "The extraction task we consider is formalized as follows: given a corpus, a target relation R, a list of seed instances S R , and a list of candidate extractions U R , the task is to order elements of U R such that correct instances for R are ranked above extraction errors." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-22", "text": "Let U Ri denote the set of the ith arguments of the extractions in U R , and let S Ri be defined similarly for the seed set S R ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-23", "text": "For relations of arity greater than one, we consider the typechecking task, an important sub-task of extraction (Downey et al., 2007) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-24", "text": "The typechecking task is to rank extractions with arguments that are of the proper type for a relation above type errors." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-25", "text": "As an example, the extraction Founded(Bill Gates, Oracle) is type correct, but is not correct for the extraction task." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-26", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-27", "text": "**STATISTICAL LANGUAGE MODELS**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-28", "text": "A Statistical Language Model (SLM) is a probability distribution P (w) over word sequences w = (w 1 , ..., w r )." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-29", "text": "The most common SLM techniques are n-gram models, which are Markov models in which the probability of a given word is dependent on only the previous n\u22121 words." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-30", "text": "The accuracy of an n-gram model of a corpus depends on two key factors: the choice of n, and the smoothing technique employed to assign probabilities to word sequences seen infrequently in training." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-31", "text": "We experiment with choices of n from 2 to 4, and two popular smoothing approaches, Modified Kneser-Ney (Chen and Goodman, 1996) and Witten-Bell (Bell et al., 1990) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-32", "text": "Unsupervised Hidden Markov Models (HMMs) are an alternative SLM approach previously shown to offer accuracy and scalability advantages over ngram models in ADS (Downey et al., 2007) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-33", "text": "An HMM models a sentence w as a sequence of observations w i each generated by a hidden state variable t i ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-34", "text": "Here, hidden states take values from {1, . . . , T }, and each hidden state variable is itself generated by some number k of previous hidden states." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-35", "text": "Formally, the joint distribution of a word sequence w given a corresponding state sequence t is:" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-36", "text": "The distributions on the right side of Equation 1 are learned from the corpus in an unsupervised manner using Expectation-Maximization, such that words distributed similarly in the corpus tend to be generated by similar hidden states (Rabiner, 1989) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-37", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-38", "text": "**PERFORMING ADS WITH SLMS**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-39", "text": "The Assessment by Distributional Similarity (ADS) technique is to rank extractions in U R in decreasing order of distributional similarity to the seeds, as estimated from the corpus." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-40", "text": "In our experiments, we utilize an ADS approach previously proposed for HMMs (Downey et al., 2007) and adapt it to also apply to n-gram models, as detailed below." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-41", "text": "Define a context of an extraction argument e i to be a string containing the m words preceding and m words following an occurrence of e i in the corpus." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-42", "text": "Let C i = {c 1 , c 2 , ..., c |C i | } be the union of all contexts of extraction arguments e i and seed arguments s i for a given relation R. We create a probabilistic context vector for each extraction e i where the j-th dimension of the vector is the probability of the context surrounding given the extraction, P (c j |e i ), computed from the language model." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-43", "text": "1 We rank the extractions in U R according to how similar their arguments' contextual distributions, P (c|e i ), are to those of the seed arguments." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-44", "text": "Specifically, extractions are ranked according to:" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-45", "text": "where KL represents KL Divergence, and the outer sum is taken over arguments e i of the extraction e. For HMMs, we alternatively rank extractions using the HMM state distributions P (t|e i ) in place of the probabilistic context vectors P (c|e i )." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-46", "text": "Our experiments show that state distributions are much more accurate for ADS than are HMM context vectors." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-47", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-48", "text": "**EXPERIMENTS**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-49", "text": "In this section, we present experiments showing that SLM accuracy correlates strongly with ADS performance." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-50", "text": "We also show that SLM performance can be used for model selection, leading to an ADS technique that outperforms previous results." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-51", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-52", "text": "**EXPERIMENTAL METHODOLOGY**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-53", "text": "We experiment with a wide range of n-gram and HMM models." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-54", "text": "The n-gram models are trained using the SRILM toolkit (Stolcke, 2002 variety of HMM configurations over a large corpus requires a scalable training architecture." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-55", "text": "We constructed a parallel HMM codebase using the Message Passing Interface (MPI), and trained the models on a supercomputing cluster." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-56", "text": "All language models were trained on a corpus of 2.8M sentences of Web text (about 60 million tokens)." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-57", "text": "SLM performance is measured using the standard perplexity metric, and assessment accuracy is measured using area under the precision-recall curve (AUC), a standard metric for ranked lists of extractions." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-58", "text": "We evaluated performance on three distinct data sets." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-59", "text": "The first two data sets evaluate ADS for unsupervised information extraction, and were taken from (Downey et al., 2007) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-60", "text": "The first, Unary, was an extraction task for unary relations (Company, Country, Language, Film) and the second, Binary, was a type-checking task for binary relations (Conquered, Founded, Headquartered, Merged)." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-61", "text": "The 10 most frequent extractions served as bootstrapped seeds." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-62", "text": "The two test sets contained 361 and 265 extractions, respectively." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-63", "text": "The third data set, Wikipedia, evaluates ADS on weaklysupervised extraction, using seeds and extractions taken from Wikipedia 'List of' pages (Pantel et al., 2009) ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-64", "text": "Seed sets of various sizes (5, 10, 15 and 20) were randomly selected from each list, and we present results averaged over 10 random samplings." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-65", "text": "Other members of the seed list were added to a test set as correct extractions, and elements from other lists were added as errors." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-66", "text": "The data set included 2264 extractions across 36 unary relations, including Composers and US Internet Companies." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-67", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-68", "text": "**OPTIMIZING LANGUAGE MODELS FOR IE**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-69", "text": "The first question we investigate is whether optimizing individual language models leads to better performance in ADS." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-70", "text": "We measured the correlation between SLM perplexity and ADS performance as training proceeds in HMMs, and as n and the smoothing technique vary in the n-gram models." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-71", "text": "Table 1 shows that as the SLM becomes more accurate (i.e. as perplexity decreases), ADS performance increases." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-72", "text": "The correlation is strong (averaging -0.742) and is consistent across model configurations and data sets." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-73", "text": "The low positive correlation for the ngram models on Wikipedia is likely due to a \"floor effect\"; the models have low performance overall on the difficult Wikipedia data set." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-74", "text": "The lowestperplexity n-gram model (Mod Kneser-Ney smoothing with n=3, KN3) does exhibit the best IE performance, at 0.039 (the average performance of the HMM models is more than twice this, at 0.084)." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-75", "text": "Figure 1 shows the relationship between SLM and ADS performance in detail for the best-performing HMM configuration." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-76", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-77", "text": "**MODEL SELECTION**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-78", "text": "Different language models can be configured in different ways: for example, HMMs require choices for the hyperparameters k and T ." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-79", "text": "Here, we show that SLM perplexity can be used to select a high-quality model configuration for ADS using only unlabeled data." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-80", "text": "We evaluate on the Unary and Binary data sets, since they have been employed in previous work on our corpora." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-81", "text": "Figure 2 shows that for HMMs, ADS performance increases as perplexity decreases across various model configurations (a similar relationship holds for n-gram models)." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-82", "text": "A model selection technique that picks the HMM model with lowest perplexity (HMM 1-100) results in better ADS performance than previous results." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-83", "text": "As shown in Table 2, HMM 1-100 reduces error over the HMM-T model in (Downey et al., 2007) by 26%, on average." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-84", "text": "The experiments also reveal an important difference between the HMM and n-gram approaches." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-85", "text": "While KN3 is more accurate in SLM than our HMM models, it performs worse in ADS on average." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-86", "text": "For example, HMM 1-25 underperforms KN3 in perpexity, at 537.2 versus 227.1, but wins in ADS, 0.880 to 0.853." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-87", "text": "We hypothesize that this is because the latent state distributions in the HMMs provide a more informative distributional similarity measure." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-88", "text": "Indeed, when we compute distributional similarity for HMMs using probabilistic context vectors as opposed to state distributions, ADS performance for HMM 1-25 decreases to 5.8% below that of KN3." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-89", "text": "----------------------------------" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-90", "text": "**CONCLUSIONS**" }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-91", "text": "We presented experiments showing that estimating distributional similarity with more accurate statistical language models results in more accurate extrac- tion assessment." }, { "sent_id": "bdcd8b0f3a56606427ee298d454b52-C001-92", "text": "We note that significantly larger, more powerful language models are possible beyond those evaluated here, which (based on the trajectory observed in Figure 2 ) may offer significant improvements in assessment accuracy." } ], "y": { "@DIF@": { "gold_contexts": [ [ "bdcd8b0f3a56606427ee298d454b52-C001-13", "bdcd8b0f3a56606427ee298d454b52-C001-15" ], [ "bdcd8b0f3a56606427ee298d454b52-C001-82", "bdcd8b0f3a56606427ee298d454b52-C001-83" ] ], "cite_sentences": [ "bdcd8b0f3a56606427ee298d454b52-C001-13", "bdcd8b0f3a56606427ee298d454b52-C001-83" ] }, "@USE@": { "gold_contexts": [ [ "bdcd8b0f3a56606427ee298d454b52-C001-23" ], [ "bdcd8b0f3a56606427ee298d454b52-C001-59" ] ], "cite_sentences": [ "bdcd8b0f3a56606427ee298d454b52-C001-23", "bdcd8b0f3a56606427ee298d454b52-C001-59" ] }, "@BACK@": { "gold_contexts": [ [ "bdcd8b0f3a56606427ee298d454b52-C001-31", "bdcd8b0f3a56606427ee298d454b52-C001-32" ] ], "cite_sentences": [ "bdcd8b0f3a56606427ee298d454b52-C001-32" ] }, "@EXT@": { "gold_contexts": [ [ "bdcd8b0f3a56606427ee298d454b52-C001-40" ] ], "cite_sentences": [ "bdcd8b0f3a56606427ee298d454b52-C001-40" ] } } }, "ABC_866cc7036c626f07fba10ab2a839d8_5": { "x": [ { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-2", "text": "We present Mean Box Pooling, a novel visual representation that pools over CNN representations of a large number, highly overlapping object proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-3", "text": "We show that such representation together with nCCA, a successful multimodal embedding technique, achieves state-of-the-art performance on the Visual Madlibs task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-4", "text": "Moreover, inspired by the nCCA's objective function, we extend classical CNN+LSTM approach to train the network by directly maximizing the similarity between the internal representation of the deep learning architecture and candidate answers." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-5", "text": "Again, such approach achieves a significant improvement over the prior work that also uses CNN+LSTM approach on Visual Madlibs." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-6", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-8", "text": "Question answering about real-world images is a relatively new research thread [2, 5, 14, 15] that requires a chain of machine visual perception, natural language understanding, and deductive capabilities to successfully come up with an answer on a question about visual content." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-9", "text": "Although similar in nature to image description [3, 8, 27] it requires a more focused attention to details in the visual content, yet it is easier to evaluate different architectures on the task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-10", "text": "Moreover, in contrast to many classical Computer Vision problems such as recognition or detection, the task does not evaluate any internal representation of methods, yet it requires a holistic understanding of the image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-11", "text": "Arguably, it is also less prone to over-interpretations compared with the classical Turing Test [16, 25] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-12", "text": "To foster progress on this task, a few metrics and datasets have been proposed [2, 4, 14, 20] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-13", "text": "The recently introduced Visual Madlibs task [32] removes ambiguities in question or scene interpretations by introducing a multiple choice \"filling the blank\" task, where a c 2016." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-14", "text": "The copyright of this document resides with its authors." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-15", "text": "It may be distributed unchanged freely in print or electronic forms." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-16", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-17", "text": "**ARXIV:1608.02717V1 [CS.CV] 9 AUG 2016**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-18", "text": "machine has to complete the prompted sentence." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-19", "text": "Such completed sentence is next matched against four ground truth answers." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-20", "text": "Thanks to such a problem formulation, a traditional accuracy measure can be used to monitor the progress on this task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-21", "text": "Due to its unambiguous evaluation, this work focuses on this task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-22", "text": "Contributions." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-23", "text": "We present two main contributions." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-24", "text": "Mean Box Pooling: We argue for a rich image representation in the form of pooled representations of the objects." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-57", "text": "Furthermore, we also extend popular in the VQA community CNN+LSTM approach by learning to compare in the answer space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-25", "text": "Although related ideas have been explored for visual question answering [22] , and even have been used in Visual Madlibs [32] , we are first to show a significant improvement of such representation by using object proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-26", "text": "More precisely, we argue for an approach that pools over a large number, highly overlapping object proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-27", "text": "This, arguably, increases the recall of extracting bounding boxes that describe an object, but also allows for multi-scale and multi-parts object representation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-28", "text": "Our approach in the combination with the Normalized Correlation Analysis embedding technique improves on the state-of-the-art of the Visual Madlibs task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-29", "text": "Text-Embedding Loss: Motivated by the popularity of deep architectures for visual question answering, that combine a global CNN image representation with an LSTM [7] question representation [4, 13, 17, 20, 29, 30, 31] , as well as the leading performance of nCCA on the multi-choice Visual Madlibs task [32] , we propose a novel extension of the CNN+LSTM architecture that chooses a prompt completion out of four candidates (see Figure 4 ) by measuring similarities directly in the embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-30", "text": "This contrasts with the prior approach of [32] that uses a post-hoc comparison between the discrete output of the CNN+LSTM method and all four candidates." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-31", "text": "To achieve this, we directly train an LSTM with a cosine similarity loss between the output embedding of the network and language representation of the ground truth completion." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-32", "text": "Such an approach integrates more tightly with the multi-choice filling the blanks task, and significantly outperforms the prior CNN+LSTM method [32] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-33", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-34", "text": "**RELATED WORK**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-35", "text": "Question answering about images is a relatively new task that switches focus from recognizing objects in the scene to a holistic \"image understanding\"." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-36", "text": "The very first work [14] on this topic has considered real world indoor scenes with associated natural language questions and answers." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-37", "text": "Since then different variants and larger datasets have been proposed: FM-IQA [4] , COCO-QA [20] , and VQA [2] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-38", "text": "Although answering questions on images is, arguably, more susceptible to automatic evaluation than the image description task [3, 8, 27] , ambiguities in the output space still remain." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-39", "text": "While such ambiguities can be handled using appropriate metrics [14, 15, 17, 26] , Visual Madlibs [32] has taken another direction, and handles them directly within the task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-40", "text": "It asks machines to fill the blank prompted with a natural language description with a phrase chosen from four candidate completions (Figure 4 )." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-41", "text": "In general, the phrase together with the prompted sentence should serve as the accurate description of the image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-42", "text": "With such problem formulation the standard accuracy measure is sufficient to automatically evaluate the architectures." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-43", "text": "The first proposed architecture [14] to deal with the question answering about images task uses image analysis methods and a set of hand-defined schemas to create a database of visual facts." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-44", "text": "The mapping from questions to executable symbolic representations is done by a semantic parser [12] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-45", "text": "Later deep learning approaches for question answering either generate [4, 17] answers or predict answers [13, 20] over a fixed set of choices." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-46", "text": "Most recently, attention based architectures, which put weights on a fixed grid Figure 2 : Overview of our full model, i.e. our proposed image representation using Mean Box Pooling, text encoding using average of Word2Vec representations, and normalized CCA for learning the joint space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-47", "text": "over the image, yield state of the art results [29, 30, 31] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-48", "text": "Another, more focused \"hard\" attention, has also been studied in the image-to-text retrieval scenario [9] as well as fine-grained categorization [33] , person recognition [19] and zero-shot learning [1] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-49", "text": "Here representations are computed on objects, visual fragments or parts, that are further aggregated to form a visual representation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-50", "text": "Closer to our work, [22] use Edge Boxes [34] to form memories [28] consisting of different image fragments that are either pooled or \"softmax\" weighted in order to provide the final score." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-51", "text": "However, in contrast to [22] , our experiments indicate a strong improvement by using object proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-52", "text": "While a majority of the most recent work on visual question answering combine LSTM [7] with CNN [11, 23, 24] by concatenation or summation or piece-wise multiplication, Canonical Correlation Analysis (CCA and nCCA) [6] have also been shown to be a very effective multimodal embedding technique [32] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-53", "text": "Our work further investigates this embedding method as well as brings ideas from CCA over to an CNN+LSTM formulation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-54", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-55", "text": "**METHOD**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-56", "text": "We use normalized CCA (nCCA) [6] to embed the textual embedding of answers and the visual representation of the image into a joint space, where candidate sentence completions are compared to the image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-58", "text": "In Section 3.1, we propose a richer representation of the entire image obtained by pooling of CNN representations extracted from object proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-59", "text": "Figure 1 illustrates the proposed Mean Box Pooling image representation and Figure 2 illustrates our whole method." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-60", "text": "In Section 3.3, we describe nCCA approach to encode two modalities into a joint space in greater details." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-61", "text": "In Section 3.4, we also investigate a CNN+LSTM architecture." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-62", "text": "Instead of generating a prompt completion that is next compared against candidate completions in a post-hoc process, we propose to choose a candidate completion by directly comparing candidates in the embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-63", "text": "This puts CNN+LSTM approach closer to nCCA with a tighter integration with the multi-choice Visual Madlibs task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-64", "text": "This approach is depicted in Figure 3 ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-65", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-66", "text": "**MEAN BOX POOLING IMAGE REPRESENTATION**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-67", "text": "Figure 1 illustrates our proposed image representation, which starts from extracting object proposals from the raw image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-68", "text": "Next such object proposals are encoded via a CNN, and pooled in order to create a feature vector representation of the image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-69", "text": "Extracting Region Proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-70", "text": "Since questions are mainly about salient parts of the image, it seems reasonable to use object detection in order to extract such parts from the image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-71", "text": "At the same time, however, it is important to not miss any object in the image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-72", "text": "Moreover, arguably, sampling a context of the objects and capturing multi-scale, multi-parts properties seem to be important as well." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-73", "text": "Given all these reasons, we choose to use Edge Boxes [34] in order to generate a set of object bounding box proposals for feature extraction." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-74", "text": "Edge Boxes extract a number of bounding boxes along with a score for each bounding box that is interpreted as a confidence score that the bounding box contains an object." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-75", "text": "In our study, two hyper parameters are important: Non-Maxima Suppression and the number of proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-76", "text": "The latter defines how many object proposals we want to maintain and hence implicitly influence recall of the proposals, while the former defines a threshold \u03b2 such that all predicted bounding boxes with the intersection over union greater than \u03b2 are removed." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-77", "text": "In practice, the lower the \u03b2 the more spread the object proposals are." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-78", "text": "Feature Extraction." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-79", "text": "Once the object proposals are extracted, we use output of the \"fc7\" layer of the VGG network [23] on the extracted image crops to encode the proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-80", "text": "VGG is among the best performing recognition architectures on the large scale object recognition task [21] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-81", "text": "Pooling for Image Representation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-82", "text": "Our final image representation is constructed by pooling the encoded object proposals together with the global image representation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-83", "text": "Since we do not want to associate any particular order over the extracted object proposals, we investigate popular order-less pooling schemes." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-84", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-85", "text": "**POOLING FOR ANSWER REPRESENTATION.**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-86", "text": "We encode each word in the answer with a 300 dimensional word embedding [18] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-87", "text": "The embedded words are next mean pooled to form a vector representation of the answer." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-88", "text": "Note that, we do not encode prompts as they follow the same pattern for each Visual Madlibs category." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-89", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-90", "text": "**MULTIMODAL EMBEDDING**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-91", "text": "We use the Normalized Canonical Correlation Analysis (nCCA) to learn a mapping from two modalities: image and textual answers, into a joint embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-92", "text": "This embedding method has shown outstanding performance on the Visual Madlibs task [32] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-93", "text": "At the test time, given the encoded image, we choose an answer (encoded by the mean pooling over word2vec words representations) from the set of four candidate answers that is the most similar to the encoded image in the multimodal embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-147", "text": "The experiments suggest using a larger numbers of proposals, however the gain diminishes with the larger numbers." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-94", "text": "Formally, the Canonical Correlation Analysis (CCA) maximizes the cosine similarity between two modalities (also called views) in the embedding space, that is:" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-95", "text": "where tr is the matrix trace,X := XW 1 ,\u0176 := YW 2 , and X,Y are two views (encoded images, and textual answers in our case)." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-96", "text": "Normalized Canonical Correlation Analysis (nCCA) [6] has been reported to work significantly better than the plain CCA." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-97", "text": "Here, columns of the projection matrices W 1 and W 2 are scaled by the p-th power (p=4) of the corresponding eigen values." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-98", "text": "The improvement is consistent with the findings of [32] , where nCCA performs better than CCA by about five percentage points in average on the hard task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-99", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-100", "text": "**CNN+LSTM WITH TEXT-EMBEDDING LOSS**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-101", "text": "We present our novel architecture that extends prior approaches on question answering about images [4, 13, 17, 20, 29, 30, 31] by learning similarity between candidate labels and internal output embedding of the neural network." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-102", "text": "Figure 3 depicts our architecture." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-103", "text": "Similarly to prior work, we encode an image with a CNN encoder that is next concatenated with (learnable) word embeddings of the prompt sentence, and fed to a recurrent neural network." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-104", "text": "We use a special '' token to denote the empty blank space in the image description." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-105", "text": "On the other side, for each completion candidate s we compute its representation by averaging over word2vec [18] representations of the words contributing to s. However, in contrast to the prior work [32] , instead of comparing the discrete output of the network with the representation of s, we directly optimize an objective in the embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-106", "text": "During training we maximize the similarity measure between the output embedding and the representation of \u03c3 by optimizing the following objective:" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-107", "text": "which is a cosine similarity between the representation of the available during the training correct completion\u015d i , and an output embedding vector of the i-th image-prompt training Table 4 : BLEU-1 and BLEU-2 computed on Madlibs testing dataset for different approaches." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-108", "text": "ImageNet using R-CNN [13] , covering 42 MS COCO categories." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-109", "text": "We observe similar performance between groundtruth and detected bounding boxes in Table 3 ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-110", "text": "As an additional experiment we ask humans to answer the multiple choice task, with 5 Turkers answering each question." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-111", "text": "We use their results to filter out a subset of the hard multiple-choice questions where at least 3 Turkers choose the correct answer." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-112", "text": "Results of the methods on this subset are shown in Table 2 bottom set of rows." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-113", "text": "These results show the same pattern as on the unfiltered set, with slightly higher accuracy." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-114", "text": "Table 4 shows BLEU-1 and BLEU-2 scores for targeted generation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-115", "text": "Although the CNN+LSTM models we trained on Madlibs were not quite as accurate as nCCA for selecting the correct multiple-choice answer, they did result in better, sometimes much better, accuracy (as measured by BLEU scores) for targeted generation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-116", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-117", "text": "**CONCLUSIONS**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-118", "text": "We have introduced a new fill-in-the blank strategy for targeted natural language descriptions and used this to collect a Visual Madlibs dataset." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-119", "text": "Our analyses show that these descriptions are usually more detailed than generic whole image descriptions." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-120", "text": "We also introduce a targeted natural language description generation task, and a multiplechoice question answering task, then train and evaluate joint-embedding and generation models." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-121", "text": "Data produced by this paper will be publicly released upon acceptance." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-122", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-123", "text": "**ACKNOWLEDGEMENT**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-124", "text": "We thank the vision and language community for feedback regarding this dataset, especially Julia Hockenmaier, Kate Saenko, and Jason Corso." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-125", "text": "This research is supported by NSF Awards #1417991, 1405822, 144234, and 1452851, and Microsoft Research. where S denotes a set of available candidate prompt completions, x is the image-prompt pair fed to the network, and \u0398 * denotes all the learnt parameters." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-126", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-127", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-128", "text": "We evaluate our method on the multiple choice task of the Visual Madlibs dataset." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-148", "text": "Comparison to the state-of-the-art." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-129", "text": "The dataset consists of about 360k descriptions, spanning 12 different categories specified by different types of templates, of about 10k images." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-130", "text": "The selected images from the MS COCO dataset comes with rich annotations." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-131", "text": "In the multi-choice scenario a textual prompt is given (every category follows the same, fixed template) with a blank to be filled, together with 4 candidate completions (see Figure 4 )." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-132", "text": "Every category represents a different type of question including scenes, affordances, emotions, or activities (the full list is shown in the first column of Table 1 )." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-133", "text": "Since each category has fixed prompt, there is no need to include the prompt in the modeling given the training is done per each category." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-134", "text": "Finally, Visual Madlibs considers an easy and difficult tasks that differ in how the negative 3 candidate completions (distractors) are chosen." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-135", "text": "In the easy task, the distractors are randomly chosen from three descriptions of the same question type from other images." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-136", "text": "In the hard task, 3 distractors are chosen only from these images that contain the same objects as the given question image, and hence it requires a more careful and detailed image understanding." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-137", "text": "We use ADAM gradient descent method [10] with default hyper-parameters." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-138", "text": "Different Non Maxima Suppression Thresholds." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-139", "text": "Table 2 : Accuracies computed for different number of Edge Box proposals on the easy and hard tasks of the Visual Madlibs dataset." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-140", "text": "The NMS threshold 0.75 and mean-pooling is used for all the experiments." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-141", "text": "Results in %." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-142", "text": "Different number of object proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-143", "text": "The maximal number of object proposals is the second factor of Edge Boxes that we study in this work." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-144", "text": "A larger number of proposals tend to cover a larger fraction of the input image." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-145", "text": "Moreover, the higher number together with the higher NMS threshold can assign proposals to both an object, and its parts, effectively forming a multi-scale and multi-parts object representation." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-146", "text": "Table 2 shows the accuracy of our model with different number of Edge Box proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-183", "text": "Such model even outperforms the prior work that uses ground truth bounding boxes from the MS COCO dataset." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-149", "text": "Guided by the results of the previous experiments, we compare nCCA that uses Edge Boxes object proposals (nCCA (ours)) with the state-ofthe-arts on Visual Madlibs (nCCA [32] )." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-150", "text": "Both models use the same VGG Convolutional Neural Network [23] to encode images (or theirs crops), and word2vec to encode words." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-151", "text": "The models are trained per category (a model trained over all the categories performs inferior on the hard task [32] )." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-152", "text": "As Table 3 shows using a large number of object proposals Table 4 : Accuracies computed for different approaches on the easy and hard task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-153", "text": "nCCA (ours) uses the representation with object proposals (NMS 0.75, and 100 proposals with mean-pooling)." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-154", "text": "nCCA(bbox) mean-pools over the representations computed on the available ground-truth bounding boxes both at train and test time." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-155", "text": "The averages are computed only over 7 categories." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-156", "text": "Results in %." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-157", "text": "improves over global, full frame nCCA by 5.9 percentage points on the easy task, and about 1.4 percentage points on the difficult task in average." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-158", "text": "However, our nCCA also consistently outperforms state-of-the-art on every category except the 'Scenes' category." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-159", "text": "This suggests that better localized object oriented representation is beneficial." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-160", "text": "However, Edge Boxes only roughly localize objects." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-161", "text": "This naturally leads to the following question if better localization helps." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-162", "text": "To see the limits, we compare nCCA (ours) against nCCA (bbox) [32] that crops over ground truth bounding boxes from MS COCO segmentations and next averages over theirs representations (Table 3 in [32] shows that ground truth bounding boxes outperforms automatically detected bounding boxes, and hence they can be seen as an upper bound for a detection method trained to detect objects on MS COCO)." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-163", "text": "Surprisingly, nCCA (ours) outperforms nCCA (bbox) by a large margin as Table 4 shows." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-164", "text": "Arguably, object proposals have better recall and captures multi-scale, multi-parts phenomena." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-165", "text": "CNN+LSTM with comparison in the output embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-166", "text": "On one hand nCCA tops the leaderboard on the Visual Madlibs task [32] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-167", "text": "Table 5 : Comparison between our Embedded CNN+LSTM approach that computes the similarity between input and candidate answers in the embedding space, and the plain CNN+LSTM original approach from [32] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-168", "text": "Since the accuracies of CNN+LSTM [32] are unavailable for two categories, we report average over 10 categories in this case." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-169", "text": "Results in %." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-170", "text": "completion of the prompt sentence out of four candidates, the comparison between the candidate completions should be directly done in the output embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-171", "text": "This contrasts to a post-hoc process used in [32] where an image description architecture (CNN+LSTM) first generates a completion that is next compared against the candidates in the word2vec space (see section 3 for more details)." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-172", "text": "Moreover, since the \"Ask Your Neurons\" architecture [17] is more suitable for the question answering task, we extend that method to do comparisons directly in the embedding space (\"Embedded CNN+LSTM\" in Table 5 )." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-173", "text": "Note that, here we feed the sentence prompt to LSTM even though it is fixed per category." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-174", "text": "Table 5 shows the performance of different methods." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-175", "text": "Our \"Embedded CNN+LSTM\" outperforms other methods on both tasks confirming our hypothesis. \"Ask Your Neurons\" [17] is also slightly better than the original CNN+LSTM [32] (on the 10 categories that the results for CNN+LSTM are available it achieves 49.8% accuracy on the easy task, which is 2.1 percentage points higher than CNN+LSTM)." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-176", "text": "----------------------------------" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-177", "text": "**CONCLUSION**" }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-178", "text": "We study an image representation formed by averaging over representations of object proposals, and show its effectiveness through experimental evaluation on the Visual Madlibs dataset [32] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-179", "text": "We achieve state of the art performance on the multi-choice \"filling the blank\" task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-180", "text": "We have also shown and discussed effects of different parameters that affect how the proposals are obtained." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-181", "text": "Surprisingly, the larger number of proposals the better overall performance." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-182", "text": "Moreover, the model benefits even from highly overlapping proposals." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-184", "text": "The proposed representation can be considered as a valid alternative to 'soft' attention representations such as implemented in recent work of visual question answering using memory networks [31] ." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-185", "text": "Due to its popularity on question answering about images tasks, we also investigate a CNN+LSTM approach that chooses a prompt completion candidate by doing comparisons directly in the embedding space." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-186", "text": "This approach contrasts with a posthoc solution of the previous work allowing for a tighter integration of the model with the multi-choice task." }, { "sent_id": "866cc7036c626f07fba10ab2a839d8-C001-187", "text": "Acknowledgements: This work is supported by the German Research Foundation (DFG) under the SFB/CRC 1223." } ], "y": { "@USE@": { "gold_contexts": [ [ "866cc7036c626f07fba10ab2a839d8-C001-3" ], [ "866cc7036c626f07fba10ab2a839d8-C001-92" ], [ "866cc7036c626f07fba10ab2a839d8-C001-128", "866cc7036c626f07fba10ab2a839d8-C001-129" ], [ "866cc7036c626f07fba10ab2a839d8-C001-149" ], [ "866cc7036c626f07fba10ab2a839d8-C001-162" ], [ "866cc7036c626f07fba10ab2a839d8-C001-167" ], [ "866cc7036c626f07fba10ab2a839d8-C001-168" ], [ "866cc7036c626f07fba10ab2a839d8-C001-178" ] ], "cite_sentences": [ "866cc7036c626f07fba10ab2a839d8-C001-3", "866cc7036c626f07fba10ab2a839d8-C001-92", "866cc7036c626f07fba10ab2a839d8-C001-128", "866cc7036c626f07fba10ab2a839d8-C001-129", "866cc7036c626f07fba10ab2a839d8-C001-149", "866cc7036c626f07fba10ab2a839d8-C001-162", "866cc7036c626f07fba10ab2a839d8-C001-167", "866cc7036c626f07fba10ab2a839d8-C001-168", "866cc7036c626f07fba10ab2a839d8-C001-178" ] }, "@DIF@": { "gold_contexts": [ [ "866cc7036c626f07fba10ab2a839d8-C001-5" ], [ "866cc7036c626f07fba10ab2a839d8-C001-25" ], [ "866cc7036c626f07fba10ab2a839d8-C001-30" ], [ "866cc7036c626f07fba10ab2a839d8-C001-62", "866cc7036c626f07fba10ab2a839d8-C001-63" ], [ "866cc7036c626f07fba10ab2a839d8-C001-105" ], [ "866cc7036c626f07fba10ab2a839d8-C001-115" ], [ "866cc7036c626f07fba10ab2a839d8-C001-150", "866cc7036c626f07fba10ab2a839d8-C001-151" ], [ "866cc7036c626f07fba10ab2a839d8-C001-162" ], [ "866cc7036c626f07fba10ab2a839d8-C001-165", "866cc7036c626f07fba10ab2a839d8-C001-166" ], [ "866cc7036c626f07fba10ab2a839d8-C001-171" ], [ "866cc7036c626f07fba10ab2a839d8-C001-175" ] ], "cite_sentences": [ "866cc7036c626f07fba10ab2a839d8-C001-5", "866cc7036c626f07fba10ab2a839d8-C001-25", "866cc7036c626f07fba10ab2a839d8-C001-30", "866cc7036c626f07fba10ab2a839d8-C001-63", "866cc7036c626f07fba10ab2a839d8-C001-105", "866cc7036c626f07fba10ab2a839d8-C001-115", "866cc7036c626f07fba10ab2a839d8-C001-151", "866cc7036c626f07fba10ab2a839d8-C001-162", "866cc7036c626f07fba10ab2a839d8-C001-166", "866cc7036c626f07fba10ab2a839d8-C001-171", "866cc7036c626f07fba10ab2a839d8-C001-175" ] }, "@MOT@": { "gold_contexts": [ [ "866cc7036c626f07fba10ab2a839d8-C001-5" ], [ "866cc7036c626f07fba10ab2a839d8-C001-25" ] ], "cite_sentences": [ "866cc7036c626f07fba10ab2a839d8-C001-5", "866cc7036c626f07fba10ab2a839d8-C001-25" ] }, "@BACK@": { "gold_contexts": [ [ "866cc7036c626f07fba10ab2a839d8-C001-13" ], [ "866cc7036c626f07fba10ab2a839d8-C001-38", "866cc7036c626f07fba10ab2a839d8-C001-39", "866cc7036c626f07fba10ab2a839d8-C001-40" ], [ "866cc7036c626f07fba10ab2a839d8-C001-52" ], [ "866cc7036c626f07fba10ab2a839d8-C001-128", "866cc7036c626f07fba10ab2a839d8-C001-129" ], [ "866cc7036c626f07fba10ab2a839d8-C001-134" ] ], "cite_sentences": [ "866cc7036c626f07fba10ab2a839d8-C001-13", "866cc7036c626f07fba10ab2a839d8-C001-39", "866cc7036c626f07fba10ab2a839d8-C001-40", "866cc7036c626f07fba10ab2a839d8-C001-52", "866cc7036c626f07fba10ab2a839d8-C001-128", "866cc7036c626f07fba10ab2a839d8-C001-129", "866cc7036c626f07fba10ab2a839d8-C001-134" ] }, "@UNSURE@": { "gold_contexts": [ [ "866cc7036c626f07fba10ab2a839d8-C001-14", "866cc7036c626f07fba10ab2a839d8-C001-15" ], [ "866cc7036c626f07fba10ab2a839d8-C001-88" ], [ "866cc7036c626f07fba10ab2a839d8-C001-107" ], [ "866cc7036c626f07fba10ab2a839d8-C001-118" ] ], "cite_sentences": [ "866cc7036c626f07fba10ab2a839d8-C001-14", "866cc7036c626f07fba10ab2a839d8-C001-15", "866cc7036c626f07fba10ab2a839d8-C001-88", "866cc7036c626f07fba10ab2a839d8-C001-107", "866cc7036c626f07fba10ab2a839d8-C001-118" ] }, "@EXT@": { "gold_contexts": [ [ "866cc7036c626f07fba10ab2a839d8-C001-28" ], [ "866cc7036c626f07fba10ab2a839d8-C001-29" ], [ "866cc7036c626f07fba10ab2a839d8-C001-31", "866cc7036c626f07fba10ab2a839d8-C001-32" ], [ "866cc7036c626f07fba10ab2a839d8-C001-171", "866cc7036c626f07fba10ab2a839d8-C001-172" ] ], "cite_sentences": [ "866cc7036c626f07fba10ab2a839d8-C001-28", "866cc7036c626f07fba10ab2a839d8-C001-29", "866cc7036c626f07fba10ab2a839d8-C001-32", "866cc7036c626f07fba10ab2a839d8-C001-171" ] }, "@SIM@": { "gold_contexts": [ [ "866cc7036c626f07fba10ab2a839d8-C001-98" ] ], "cite_sentences": [ "866cc7036c626f07fba10ab2a839d8-C001-98" ] } } }, "ABC_7c3f94a231c83c94b5d93c33ab8bfa_5": { "x": [ { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-21", "text": "In the second part of the tutorial, we give some analysis on why the framework is effective." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-2", "text": "This tutorial discusses a framework for incremental left-to-right structured predication, which makes use of global discriminative learning and beam-search decoding." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-3", "text": "The method has been applied to a wide range of NLP tasks in recent years, and achieved competitive accuracies and efficiencies." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-4", "text": "We give an introduction to the algorithms and efficient implementations, and discuss their applications to a range of NLP tasks." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-5", "text": "----------------------------------" }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-7", "text": "This tutorial discusses a framework of online global discriminative learning and beam-search decoding for syntactic processing (Zhang and Clark, 2011b) , which has recently been applied to a wide variety of natural language processing (NLP) tasks, including word segmentation (Zhang and Clark, 2007) , dependency parsing (Zhang and Clark, 2008b; Huang and Sagae, 2010; Zhang and Nivre, 2011; Bohnet and Kuhn, 2012) , context free grammar (CFG) parsing (Collins and Roark, 2004; Zhang and Clark, 2009; Zhu et al., 2013) , combinational categorial grammar (CCG) parsing (Zhang and Clark, 2011a; Xu et al., 2014) and machine translation (Liu, 2013) , achieving stateof-the-art accuracies and efficiencies." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-8", "text": "In addition, due to its high efficiencies, it has also been applied to a range of joint structural problems, such as joint segmentation and POS-tagging (Zhang and Clark, 2008a; Zhang and Clark, 2010) , joint POS-tagging and dependency parsing (Hatori et al., 2011; Bohnet and Nivre, 2012) , joint morphological analysis, POS-tagging and dependency parsing (Bohnet et al., 2013) , and joint segmentation, POS-tagging and parsing ." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-9", "text": "In addition to the aforementioned tasks, the framework can be applied to all structural prediction tasks for which the output can be constructed using an incremental process." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-10", "text": "The advantage of this framework is two-fold." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-11", "text": "First, beamsearch enables highly efficient decoding, which typically has linear time complexity, depending on the incremental process." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-12", "text": "Second, free from DPstyle constraints and Markov-style independence assumptions, the framework allows arbitrary features to be defined to capture structural patterns." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-13", "text": "In addition to feature advantages, the high accuracies of this framework are also enabled by direct interactions between learning and search (Daum\u00e9 III and Marcu, 2005; Huang et al., 2012; Zhang and Nivre, 2012) ." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-14", "text": "----------------------------------" }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-15", "text": "**TUTORIAL OVERVIEW**" }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-16", "text": "In this tutorial, we make an introduction to the framework, illustrating how it can be applied to a range of NLP problems, giving theoretical discussions and demonstrating a software implementation." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-17", "text": "We start with a detailed introduction of the framework, describing the averaged perceptron algorithm (Collins, 2002) and its efficient implementation issues (Zhang and Clark, 2007) , as well as beam-search and the early-update strategy (Collins and Roark, 2004) ." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-18", "text": "We then illustrate how the framework can be applied to NLP tasks, including word segmentation, joint segmentation & POS-tagging, labeled and unlabeled dependency parsing, joint POS-tagging and dependency parsing, CFG parsing, CCG parsing, and joint segmentation, POS-tagging and parsing." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-19", "text": "In each case, we illustrate how the task is turned into an incremental left-to-right output-building process, and how rich features are defined to give competitive accuracies." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-20", "text": "These examples can serve as guidance in applying the framework to other structural prediction tasks." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-22", "text": "We discuss several alternative learning algorithms, 13 and compare beam-search with greedy search on dependency parsing." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-23", "text": "We show that accuracy benefits from interaction between learning and search." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-24", "text": "Finally, the tutorial concludes with an introduction to ZPar, an open source toolkit that provides optimized C++ implementations of of all the above tasks." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-25", "text": "Ting Liu is a professor at HIT-SCIR." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-26", "text": "His research interest includes social computing, information retrieval and natural language processing." }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-27", "text": "----------------------------------" }, { "sent_id": "7c3f94a231c83c94b5d93c33ab8bfa-C001-28", "text": "**OUTLINE**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "7c3f94a231c83c94b5d93c33ab8bfa-C001-7" ] ], "cite_sentences": [ "7c3f94a231c83c94b5d93c33ab8bfa-C001-7" ] }, "@MOT@": { "gold_contexts": [ [ "7c3f94a231c83c94b5d93c33ab8bfa-C001-17", "7c3f94a231c83c94b5d93c33ab8bfa-C001-18" ] ], "cite_sentences": [ "7c3f94a231c83c94b5d93c33ab8bfa-C001-17" ] } } }, "ABC_740db031e3fc086dfdb2477caeac66_5": { "x": [ { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-2", "text": "Variational Autoencoder (VAE) is a powerful method for learning representations of highdimensional data." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-3", "text": "However, VAEs can suffer from an issue known as latent variable collapse (or KL loss vanishing), where the posterior collapses to the prior and the model will ignore the latent codes in generative tasks." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-4", "text": "Such an issue is particularly prevalent when employing VAE-RNN architectures for text modelling (Bowman et al., 2016) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-5", "text": "In this paper, we present a simple architecture called holistic regularisation VAE (HR-VAE), which can effectively avoid latent variable collapse." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-6", "text": "Compared to existing VAE-RNN architectures, we show that our model can achieve much more stable training process and can generate text with significantly better quality." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-7", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-9", "text": "Variational Autoencoder (VAE) (Kingma and Welling, 2013 ) is a powerful method for learning representations of high-dimensional data." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-10", "text": "However, recent attempts of applying VAEs to text modelling are still far less successful compared to its application to image and speech (Bachman, 2016; Fraccaro et al., 2016; Semeniuta et al., 2017) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-11", "text": "When applying VAEs for text modelling, recurrent neural networks (RNNs) 1 are commonly used as the architecture for both encoder and decoder (Bowman et al., 2016; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-12", "text": "While such a VAE-RNN based architecture allows encoding and generating sentences (in the decoding phase) with variablelength effectively, it is also vulnerable to an issue known as latent variable collapse (or KL loss vanishing) , where the posterior collapses to the prior and the model will ignore the latent codes in generative tasks." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-13", "text": "Various efforts have been made to alleviate the latent variable collapse issue." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-14", "text": "Bowman et al. (2016) uses KL annealing, where a variable weight is added to the KL term in the cost function at training time." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-15", "text": "Yang et al. (2017) discovered that there is a trade-off between the contextual capacity of the decoder and effective use of encoding information, and developed a dilated CNN as decoder which can vary the amount of conditioning context." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-16", "text": "They also introduced a loss clipping strategy in order to make the model more robust." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-17", "text": "Xu and Durrett (2018) addressed the problem by replacing the standard normal distribution for the prior with the von Mises-Fisher (vMF) distribution." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-18", "text": "With vMF, the KL loss only depends on the concentration parameter which is fixed during training and testing, and hence results in a constant KL loss." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-19", "text": "In a more recent work, Dieng et al. (2019) avoided latent variable collapse by including skip connections in the generative model, where the skip connections enforce strong links between the latent variables and the likelihood function." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-20", "text": "Although the aforementioned works show effectiveness in addressing the latent variable collapse issue to some extent, they either require carefully engineering to balance the weight between the reconstruction loss and KL loss (Bowman et al., 2016; , or resort to designing more sophisticated model structures (Yang et al., 2017; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-21", "text": "In this paper, we present a simple architecture called holistic regularisation VAE (HR-VAE), which can effectively avoid latent variable collapse." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-22", "text": "In contrast to existing VAE-RNN models for text modelling which merely impose a standard normal distribution prior on the last hidden state of the RNN encoder, our HR-VAE model imposes regularisation for all hidden states of the RNN encoder." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-23", "text": "Another advantage of our model is that it is generic and can be applied to any existing VAE-RNN-based architectures." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-24", "text": "We evaluate our model against several strong baselines which apply VAE for text modelling (Bowman et al., 2016; Yang et al., 2017; Xu and Durrett, 2018) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-25", "text": "We conducted experiments based on two public benchmark datasets, namely, the Penn Treebank dataset (Marcus and Marcinkiewicz, 1993) and the end-to-end (E2E) text generation dataset (Novikova et al., 2017) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-26", "text": "Experimental results show that our HR-VAE model not only can effectively mitigate the latent variable collapse issue with a stable training process, but also can give better predictive performance than the baselines, as evidenced by both quantitative (e.g., negative log likelihood and perplexity) and qualitative evaluation." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-27", "text": "The code for our model is available online 2 ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-28", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-29", "text": "**METHODOLOGY**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-30", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-31", "text": "**BACKGROUND OF VAE**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-32", "text": "A variational autoencoder (VAE) is a deep generative model, which combines variational inference with deep learning." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-33", "text": "The VAE modifies the conventional autoencoder architecture by replacing the deterministic latent representation z of an input x with a posterior distribution P (z|x), and imposing a prior distribution on the posterior, such that the model allows sampling from any point of the latent space and yet able to generate novel and plausible output." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-34", "text": "The prior is typically chosen to be standard normal distributions, i.e., P (z) = N (0, 1), such that the KL divergence between posterior and prior can be computed in closed form (Kingma and Welling, 2013) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-35", "text": "To train a VAE, we need to optimise the marginal likelihood P \u03b8 (x) = P (z)P \u03b8 (x|z)dz, 2 https://github.com/ruizheliUOA/HR-VAE where the log likelihood can take following form:" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-36", "text": "Here Q \u03c6 (z|x) is the variational approximation for the true posterior P \u03b8 (z|x)." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-37", "text": "Specifically, Q \u03c6 (z|x) can be regarded as an encoder (a.k.a." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-38", "text": "the recognition model) and P \u03b8 (x|z) the decoder (a.k.a. the generative model)." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-39", "text": "Both encoder and decoder are implemented via neural networks." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-40", "text": "As proved in (Kingma and Welling, 2013) , optimising the marginal log likelihood is essentially equivalent to maximising L(\u03b8, \u03c6; x), i.e., the evidence lower bound (ELBO), which consists of two terms." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-41", "text": "The first term is the expected reconstruction error indicating how well the model can reconstruct data given a latent variable." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-42", "text": "The the second term is the KL divergence of the approximate posterior from prior, i.e., a regularisation pushing the learned posterior to be as close to the prior as possible." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-43", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-44", "text": "**VARIATIONAL AUTOENDODER WITH HOLISTIC REGULARISATION**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-45", "text": "In this section, we discuss the technical details of the proposed holistic regularisation VAE (HR-VAE) model, a general architecture which can effectively mitigate the KL vanishing phenomenon." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-46", "text": "Our model design is motivated by one noticeable defect shared by the VAE-RNN based models in previous works (Bowman et al., 2016; Yang et al., 2017; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-47", "text": "That is, all these models, as shown in Figure 1a , only impose a standard normal distribution prior on the last hidden state of the RNN encoder, which potentially leads to learning a suboptimal representation of the latent variable and results in model vulnerable to KL loss vanishing." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-48", "text": "Our hypothesis is that to learn a good representation of data and a good generative model, it is crucial to impose the standard normal prior on all the hidden states of the RNN-based encoder (see Figure 1b ), which allows a better regularisation of the model learning process." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-49", "text": "We implement the HR-VAE model using a twolayer LSTM for both the encoder and decoder." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-50", "text": "However, one should note that our architecture can be readily applied to other types of RNN such as GRU." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-51", "text": "For each time stamp t (see Figure 1b) , we concatenate the hidden state h t and the cell state c t of the encoder." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-52", "text": "The concatenation (i.e., [h t ; c t ]) is then fed into two linear transformation layers for estimating \u00b5 t and \u03c3 2 t , which are parameters of a normal distribution corresponding to the concatenation of h t and c t ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-53", "text": "Let Q \u03c6t (z t |x) = N (z t |\u00b5 t , \u03c3 2 t ), we wish Q \u03c6t (z t |x) to be close to a prior P (z t ), which is a standard Gaussian." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-54", "text": "Finally, the KL divergence between these two multivariate Gaussian distributions (i.e., Q \u03c6t and P (z t )) will contribute to the overall KL loss of the ELBO." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-55", "text": "By taking the average of the KL loss at each time stamp t, the resulting ELBO takes the following form" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-56", "text": "KL(Q \u03c6t (z t |x) P (z t ))." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-57", "text": "( 3) As can be seen in Eq. 3, our solution to the KL collapse issue does not require any engineering for balancing the weight between the reconstruction term and KL loss as commonly the case in existing works (Bowman et al., 2016; ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-58", "text": "The weight between these two terms of our model is simply 1 : 1." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-59", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-61", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-62", "text": "**DATASETS**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-63", "text": "We evaluate our model on two public datasets, namely, Penn Treebank (PTB) (Marcus and Marcinkiewicz, 1993) and the end-to-end (E2E) text generation corpus (Novikova et al., 2017) , which have been used in a number of previous works for text generation (Bowman et al., 2016; Xu and Durrett, 2018; Wiseman et al., 2018; Su et al., 2018) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-64", "text": "PTB consists of more than 40,000 sentences from Wall Street Journal articles whereas the E2E dataset contains over 50,000 sen-tences of restaurant reviews." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-65", "text": "The statistics of these two datasets are summarised in Table 1 ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-66", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-67", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-68", "text": "For the PTB dataset, we used the train-test split following (Bowman et al., 2016; Xu and Durrett, 2018) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-69", "text": "For the E2E dataset, we used the train-test split from the original dataset (Novikova et al., 2017) and indexed the words with a frequency higher than 3." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-70", "text": "We represent input data with 512-dimensional word2vec embeddings (Mikolov et al., 2013) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-71", "text": "We set the dimension of the hidden layers of both encoder and decoder to 256." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-72", "text": "The Adam optimiser (Kingma and Ba, 2014) was used for training with an initial learning rate of 0.0001." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-73", "text": "Each utterance in a mini-batch was padded to the maximum length for that batch, and the maximum batch-size allowed is 128." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-74", "text": "----------------------------------" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-75", "text": "**BASELINES**" }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-76", "text": "We compare our HR-VAE model with three strong baselines using VAE for text modelling: VAE-LSTM-base 3 : A variational autoencoder model which uses LSTM for both encoder and decoder." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-77", "text": "KL annealing is used to tackled the latent variable collapse issue (Bowman et al., 2016) ; VAE-CNN 4 : A variational autoencoder model with a LSTM encoder and a dilated CNN decoder (Yang et al., 2017) ; vMF-VAE 5 : A variational autoencoder model using LSTM for both encoder and decoder where the prior distribution is the von Mises-Fisher (vMF) distribution rather than a Gaussian distribution (Xu and Durrett, 2018) ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-78", "text": "the decoder needs to predict the entire sequence with only the help of the given latent variable z." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-79", "text": "In this way, a high-quality representation abstracting the information of the input sentence is much needed for the decoder, and hence enforcing z to learn the required information." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-80", "text": "Overall performance." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-81", "text": "Table 2 shows the language modelling results of our approach and the baselines." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-82", "text": "We report negative log likelihood (NLL), KL loss, and perplexity (PPL) on the test set." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-83", "text": "As expected, all the models have a higher KL loss in the inputless setting than the standard setting, as z is required to encode more information about the input data for reconstruction." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-84", "text": "In terms of overall performance, our model outperforms all the baselines in both datasets (i.e., PTB and E2E)." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-85", "text": "For instance, when comparing with the strongest baseline vMF-VAE in the standard setting, our model reduces NLL from 96 to 79 and PPL from 98 to 43 in PTB, respectively." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-86", "text": "In the inputless setting, our performance gain is even higher, i.e., NLL reduced from 117 to 85 and PPL from 262 to 54." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-87", "text": "A similar pattern can be observed for the E2E dataset." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-88", "text": "These observations suggest that our approach can learn a better generative model for data." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-89", "text": "Loss analysis." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-90", "text": "To conduct a more thorough evaluation, we further investigate model behaviours in terms of both reconstruction loss and KL loss, as shown in Figure 2 ." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-91", "text": "These plots were obtained based on the E2E training set using the inputless setting." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-92", "text": "We can see that the KL loss of VAE-LSTMbase, which uses Sigmoid annealing (Bowman et al., 2016) , collapses to zero, leading to a poor generative performance as indicated by the high reconstruction loss." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-93", "text": "The KL loss for both VAE-CNN and vMF-VAE are nonzero, where the former mitigates the KL collapse issue with a KL loss clipping strategy and the latter by replacing the standard normal distribution for the prior with the vMF distribution (i.e., with the vMF distribution, the KL loss only depends on a fixed concentration parameter, and hence results in a constant KL loss)." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-94", "text": "Although both VAE-CNN and vMF-VAE outperform VAE-LSTM-base by a large margin in terms of reconstruction loss as shown in Figure 2 , one should also notice that these two models actually overfit the training data, as their performance on the test set is much worse (cf." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-95", "text": "Table 2 )." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-96", "text": "In contrast to the baselines which mitigate the KL collapse issue by carefully engineering the weight between the reconstruction loss and KL loss or choosing a different choice of prior, we provide a simple and elegant solution through holistic KL regularisation, which can effectively mitigate the KL collapse issue and achieve a better reconstruction error in both training and testing." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-97", "text": "Sentence reconstruction." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-98", "text": "Lastly, we show some sentence examples reconstructed by vMF-VAE (i.e., the best baseline) and our model in the inputless setting using sentences from the E2E test set as input." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-99", "text": "As shown in Table 3 , the sentences generated by vMF-VAE contain repeated words in quite a few cases, such as 'city city area' and 'blue spice spice'." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-100", "text": "In addition, vMF-VAE also tends to generate unnecessary or unrelated words at the end of sentences, making the generated sentences ungrammatical." }, { "sent_id": "740db031e3fc086dfdb2477caeac66-C001-101", "text": "The sentences reconstructed by our model, in contrast, are more grammatical and more similar to the corresponding ground truth sentences than vMF-VAE." } ], "y": { "@BACK@": { "gold_contexts": [ [ "740db031e3fc086dfdb2477caeac66-C001-11", "740db031e3fc086dfdb2477caeac66-C001-12" ], [ "740db031e3fc086dfdb2477caeac66-C001-17" ], [ "740db031e3fc086dfdb2477caeac66-C001-20", "740db031e3fc086dfdb2477caeac66-C001-21" ] ], "cite_sentences": [ "740db031e3fc086dfdb2477caeac66-C001-11", "740db031e3fc086dfdb2477caeac66-C001-17", "740db031e3fc086dfdb2477caeac66-C001-20" ] }, "@MOT@": { "gold_contexts": [ [ "740db031e3fc086dfdb2477caeac66-C001-20", "740db031e3fc086dfdb2477caeac66-C001-21" ], [ "740db031e3fc086dfdb2477caeac66-C001-46" ] ], "cite_sentences": [ "740db031e3fc086dfdb2477caeac66-C001-20", "740db031e3fc086dfdb2477caeac66-C001-46" ] }, "@USE@": { "gold_contexts": [ [ "740db031e3fc086dfdb2477caeac66-C001-24" ], [ "740db031e3fc086dfdb2477caeac66-C001-63" ], [ "740db031e3fc086dfdb2477caeac66-C001-68" ], [ "740db031e3fc086dfdb2477caeac66-C001-76", "740db031e3fc086dfdb2477caeac66-C001-77" ] ], "cite_sentences": [ "740db031e3fc086dfdb2477caeac66-C001-24", "740db031e3fc086dfdb2477caeac66-C001-63", "740db031e3fc086dfdb2477caeac66-C001-68", "740db031e3fc086dfdb2477caeac66-C001-77" ] } } }, "ABC_247bbc4eb671895222065ed425f968_5": { "x": [ { "sent_id": "247bbc4eb671895222065ed425f968-C001-46", "text": "**HUMAN TRANSCRIPTION EXPERIMENTS**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-2", "text": "With recent advances in deep learning, considerable attention has been given to achieving automatic speech recognition performance close to human performance on tasks like conversational telephone speech (CTS) recognition." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-3", "text": "In this paper we evaluate the usefulness of these proposed techniques on broadcast news (BN), a similar challenging task." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-4", "text": "We also perform a set of recognition measurements to understand how close the achieved automatic speech recognition results are to human performance on this task." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-5", "text": "On two publicly available BN test sets, DEV04F and RT04, our speech recognition system using LSTM and residual network based acoustic models with a combination of n-gram and neural network language models performs at 6.5% and 5.9% word error rate." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-6", "text": "By achieving new performance milestones on these test sets, our experiments show that techniques developed on other related tasks, like CTS, can be transferred to achieve similar performance." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-7", "text": "In contrast, the best measured human recognition performance on these test sets is much lower, at 3.6% and 2.8% respectively, indicating that there is still room for new techniques and improvements in this space, to reach human performance levels." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-8", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-10", "text": "Prior to the recent ubiquitous deployment of automatic speech recognition technology for various device user interfaces, two key domains of interest for application of automatic speech recognition technology were conversational telephone speech (CTS) and broadcast news (BN)." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-11", "text": "Interest in these domains was primarily fueled by various DARPA programs [1] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-12", "text": "More recently, by employing various deep learning techniques, performance of speech recognition systems on the CTS task is getting close to human parity." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-13", "text": "Several sites have made significant progress to lower the WER to within the 5%-10% range on the Switchboard-CallHome subsets of the Hub5 2000 evaluation [2, 3, 4, 5] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-14", "text": "Given the progress on conversational telephone speech, we focus on the other closely related broadcast news recognition task that received similar attention within the DARPA EARS program." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-15", "text": "One of the key objectives of this study is to understand how deep learning based techniques developed on CTS generalize to the BN task." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-16", "text": "In the BN domain, speech recognition systems need to deal with wide-band signals collected over a wide variety of speakers with different speaking styles, in various background noise conditions, and speaking on a wide variety of news topics." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-17", "text": "Most of the speech is well articulated and is formed similarly to written English." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-18", "text": "In contrast, CTS is spontaneous speech recorded over a telephone channel Fig. 1 ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-19", "text": "The NIST STT Benchmark Test History -May'09 [7] that introduces additional artifacts in addition to numerous speaking styles." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-20", "text": "Conversational speech is interspersed with portions of overlapping speech, interruptions, restarts and back-channel confirmations between participants." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-21", "text": "In terms of the amount of training data available from the DARPA EARS program for training systems on CTS and BN, there are a few significant differences as well." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-22", "text": "The CTS acoustic training corpus consists of approximately 2000 hours of speech with human transcriptions [2] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-23", "text": "On the other hand, for the BN task, only about 140 hours of data is carefully transcribed." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-24", "text": "The remaining \u223c9000 hours of available speech are TV shows with closed captions." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-25", "text": "In other words, models being developed for BN typically use lightly supervised transcripts for training [6] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-26", "text": "The EARS program led to significant advances in speech recognition technology for both domains, with the development of techniques that could ingest large quantities of unsupervised or semisupervised training data, discriminative training of recognition models, methods to deal with channel and speaker variabilities in the data, real-time decoding of test data, and also approaches to combine outputs from various systems [8, 9, 10, 11] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-27", "text": "Several of these techniques have further been extended to build ASR systems on broadcast news data in various languages [12, 13, 14] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-28", "text": "Figure 1 shows progress made in this domain over the past two decades." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-29", "text": "More recently, as part of the MGB Challenge, in addition to the core ASR problem, several other related tasks -speaker diarization and lightly supervised alignment of data have also been studied [15, 16] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-30", "text": "In [2, 3] we describe state-of-the-art speech recognition systems on the CTS task using multiple LSTM and ResNet acoustic models trained on various acoustic features along with word and character LSTMs and convolutional WaveNet-style language models." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-31", "text": "This ad-vanced recipe achieves 5.1% and 9.9% on the Switchboard and CallHome subsets of the Hub5 2000 evaluation." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-32", "text": "In this paper we develop a similar but simpler variant for BN." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-33", "text": "As described earlier, by developing this system we investigate how these earlier proposed systems can be trained on BN data which are not human annotated but are created using closed captions." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-34", "text": "To create these systems, instead of adding all the available training data we carefully select a reliable subset." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-35", "text": "We then train LSTM and residual network based acoustic models with a combination of n-gram and neural network language models on this selected data." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-36", "text": "In addition to automatic speech recognition results, similar to [2] , we also present human performance on the same BN test sets." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-37", "text": "These evaluations allow us to properly benchmark our automatic system performance." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-38", "text": "Similar to earlier human performance evaluations on CTS, we observe a significant gap between human and automatic results." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-39", "text": "The rest of the paper is organized as follows." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-40", "text": "In Section 2 we describe the human evaluation experiments on two broadcast news test sets -RT04 and DEV04F." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-41", "text": "We also compare the recognition errors we observe with human and automatic recognition systems." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-42", "text": "Section 3 describes the development of our ASR systems -training data selection, acoustic and language model building." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-43", "text": "In Section 4 we present WER results using the proposed system." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-44", "text": "The paper concludes with a discussion in Section 5." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-45", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-47", "text": "Similar to [2] , human performance measurements on two broadcast news tasks -RT04 and DEV04F -are carried out by Appen." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-48", "text": "For these evaluations we limit the audio from the test sets to only regions of speech that are marked for scoring using the original references and scoring scripts provided during the EARS evaluation." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-49", "text": "After processing, the RT04 test set has 4 hours of BN data from 12 shows with about 230 overlapping speakers across the shows." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-50", "text": "The DEV04F test set is smaller, with about 2 hours of data from 6 shows with close to 100 overlapping speakers across the shows." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-51", "text": "The first round of transcripts was produced by three independent transcribers, followed by quality checking by a fourth senior transcriber." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-52", "text": "All four transcribers are native US English speakers and were selected based on the quality of their work on past transcription projects." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-53", "text": "The transcriptions were produced in line with LDC transcription guidelines for hyphenations, spelled abbreviations, contractions, partial words, non-speech sounds, etc." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-54", "text": "that were used to produce the original transcripts for these test sets." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-55", "text": "The three primary transcribers took 14-16 times real-time (xRT) for the first pass followed by an additional 3xRT for the second quality checking pass (by Transcriber 4)." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-56", "text": "Both passes involved listening to the audio multiple times: around 3-4 times for the first pass and 1-2 times for the second." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-57", "text": "In order to use NISTs scoring tool, sclite [17] , the human annotations were converted into CTM files which have time-marked word boundary information." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-58", "text": "The transcriptions were also filtered to remove non-speech markers, partial words, punctuation marks etc as described in [2] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-59", "text": "Table 1 shows the error rates of the three transcribers after quality checking by the fourth transcriber." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-60", "text": "[2] , the word error rate on BN is much lower." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-61", "text": "Although this reduction could be because BN speech is well articulated, the transcribers reported that these test sets were much denser with respect to speech content, had considerable background noise, and a significant number of named entities that required lookup to ensure correctness, compared to traditional CTS test sets." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-62", "text": "The best WER results we obtain, 3.6% and 2.8%, also fit in the expected human transcription error range indicated in Figure 1 ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-63", "text": "A more detailed error analysis and comparison of human and automatic recognition is presented in the Discussion section." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-64", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-65", "text": "**ASR SYSTEM BUILDING**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-66", "text": "As described earlier, one differentiating characteristic of ASR system builds for this BN task is the limited amount of carefully annotated manual transcriptions." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-67", "text": "Prior to the EARS program, LDC released about 144 hours of careful manual annotations for a portion of the Hub4 acoustic training data collected between May 1996 and January 1998." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-68", "text": "In addition to this, several sources of BN data were available for training acoustic models during the EARS program period with just closed caption transcripts." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-69", "text": "These data sources include about 1000 hours of data as part of different data releases collected between 1998-2001 (TDT2 and TDT4) and about 7000 hours of broadcast news released in 2003 as part of the EARS program (BN03)." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-70", "text": "In this paper we use processed versions of these data sources to build deep neural network based acoustic and language models." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-71", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-72", "text": "**TRAINING DATA PREPARATION**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-73", "text": "To process the BN data with noisy closed captions, the data is first decoded with multiple off-the-shelf broadband ASR systems using a biased LM created with the available closed captions." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-74", "text": "Based on the confidence scores of these decodes and agreement between the multiple system decodes, we perform a strict filtering of the data to create 2 sub-corpora that we consider have very reliable transcripts -" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-75", "text": "\u2022 The BN-400 Corpus -This is a corpus of about 430 hours of broadcast news data selected from the data sources described above." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-76", "text": "This data corpus includes 144 hours of carefully transcribed audio along with data with semi-supervised transcripts created via a biased decode of the matching audio." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-77", "text": "\u2022 The BN-1300 Corpus -This corpus is an extended version of the BN-400 corpus with about 900 additional hours of broadcast news." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-78", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-79", "text": "**ACOUSTIC MODEL DEVELOPMENT**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-80", "text": "As discussed earlier, one of the key objectives of this work is to verify the usefulness of our earlier proposed system strategy for CTS." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-81", "text": "In [2] , two kinds of acoustic models, a convolutional and a non-convolutional acoustic model with comparable performance, are used since they produce good complementary outputs which can be further combined for improved performance." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-82", "text": "The convolutional network used in that work is a residual network (ResNet) and an LSTM is used as the non-convolutional network." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-83", "text": "The acoustic scores of these systems are subsequently combined for the final decodes." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-84", "text": "Similar to that work, in this paper also we train ResNet and LSTM based acoustic models." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-85", "text": "Both these acoustic models are based on speaker transformed features." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-86", "text": "The ResNet uses 40 dimensional VTL-warped log-mel fea- , we train an LSTM acoustic model with 6 bidirectional layers having 1024 cells per layer (512 per direction), one linear bottleneck layer with 256 units and an output layer with 32K units corresponding to the contextdependent HMM states we derived in the HMM-GMM system build." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-87", "text": "The model is trained using non-overlapping subsequences of 21 frames." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-88", "text": "Subsequences from different utterances are grouped into mini-batches of size 128 for processing speed and reliable gradient estimates." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-89", "text": "After the cross-entropy based training on the BN-1300 Corpus has converged we also sequence train the model using the 144 hours of carefully transcribed audio." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-90", "text": "To complement the LSTM acoustic model, we train a deep Residual Network based on the best performing architecture proposed in [2] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-91", "text": "The ResNet has 12 residual blocks followed by 5 fully connected layers." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-92", "text": "To effectively train this network with 25 convolutional layers, a short-cut connection is placed between each residual block to allow for additional flow of information and gradients." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-93", "text": "Each layer has a batch normalization layer as well." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-94", "text": "Table 2 gives a summary of the network architecture of the ResNet model." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-95", "text": "The ResNet consists of several stages with different numbers of feature maps in each stage: 64 in stage 1, 128 in stage 2, 256 in stage 3 and 512 in stage 4." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-96", "text": "Each stage has an initStride which indicates the (frequency \u00d7 time) stride for the first block of that stage as the number of feature maps is increased." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-97", "text": "The stride applies to both the first 3\u00d73 convolution of each block and also the 1\u00d71 convolution in projection shortcut between each block." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-98", "text": "The ResNet acoustic model is trained using the cross-entropy training criterion on the BN-1300 Corpus and then sequence trained using the 144 hours of carefully transcribed audio." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-99", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-100", "text": "**LANGUAGE MODEL DEVELOPMENT**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-101", "text": "Similar to the development of acoustic models, several kinds of ngram and neural network based language models are built on this BN task." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-102", "text": "For the initial decode that produces word lattices, an n-gram and a feed forward neural network language model are first built." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-103", "text": "To rescore the word lattices and n-best lists produced by these models, advanced LSTM based NN language models are also constructed." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-104", "text": "Table 4 ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-105", "text": "LSTM rescoring results (WER%) on RT04 and DEV04F." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-106", "text": "The primary language model training text for all these models consists of a total of 350M words from different publicly available sources released by LDC during the GALE [1] and EARS evaluation periods suitable for broadcast news." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-107", "text": "The baseline language model is a linear interpolation of word 6-gram models, one for each corpus with a vocabulary size of about 80K words." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-108", "text": "We train a feed forward neural network model based on the same data and vocabulary as the n-gram language model described above." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-109", "text": "The neural network model (FFNN-LM) uses an embedding size of 120, a hidden layer size of 1200 and the maxout non-linearity [18] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-110", "text": "We use noise contrastive estimation to train this unnormalized NNLM [19] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-111", "text": "For decoding experiments, the FF-NNLM is interpolated with the baseline 6-gram arpabo with an interpolation weight set to 0.5." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-112", "text": "In addition to the n-gram and feed forward neural network language models, we also train two different flavors of LSTM language models with the same vocabulary and training data as described above." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-113", "text": "The first LSTM model (LSTM1-LM) consists of one word embedding layer with 256 units, four LSTM layers with 1024 units, one fully-connected layer, and one softmax layer." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-114", "text": "The second to fourth LSTM layers and the fully-connected layer allow residual connections [20] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-115", "text": "Dropout is applied to the vertical dimension only and not applied to the time dimension." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-116", "text": "We trained this model to minimize the cross-entropy objective using Adam for learning rate control [21] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-117", "text": "The second LSTM based LM (LSTM2-LM) consists of two LSTM layers, each layer with 2048 nodes and a word embedding size of 512." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-118", "text": "Before the softmax-based estimation of an 80K-dimensional posterior vector, the feature space was reduced to 128 by a linear bottleneck layer." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-119", "text": "During the training various dropout techniques were applied [22] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-120", "text": "First, the outputs of the embedding and each LSTM layer were masked at a 10% rate." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-121", "text": "Second, 10% dropout was also applied on the embedding weights, and also on the parameters of the recurrent connection of the LSTMs." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-122", "text": "These weight masks were kept constant during processing a mini-batch of sequences." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-123", "text": "In the final step of the training, the model was fine-tuned on the best matching resource, the EARS BN data." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-124", "text": "The SGD based model training uses a batch size of 128 and a Nesterov momentum of 0.9 to optimize model parameters on the cross-entropy criterion." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-125", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-126", "text": "**ASR EXPERIMENTS AND RESULTS**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-127", "text": "The acoustic and language models described above are used to decode the RT04 and DEV04F test sets." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-128", "text": "We use the same speech segments that were provided to the human annotators for our various experiments." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-129", "text": "In our first set of experiments we separately test the Table 5 ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-130", "text": "Overall substitution, deletion and insertion errors of humans and ASR system." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-131", "text": "LSTM and ResNet models in conjunction with the n-gram and FF-NNLM, before combining scores from the two acoustic models." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-132", "text": "Table 3 shows the individual and combined results we obtain on both the test sets." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-133", "text": "In comparison with the results obtained on the CTS evaluation with similar acoustic models [2] , the LSTM and ResNet operate at similar WERs." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-134", "text": "Unlike results observed on the CTS task, no significant reduction in WER is obtained after scores from both the LSTM and ResNet models are combined." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-135", "text": "The LSTM model with an n-gram LM individually performs quite well and its results further improve with the addition of the FF-NNLM." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-136", "text": "For our second set of experiments word lattices are generated after decoding with the LSTM+ResNet+n-gram+FF-NNLM model." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-137", "text": "We generated n-best lists from these lattices and rescored them with the LSTM1-LM." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-138", "text": "LSTM2-LM is also used to rescore word lattices independently." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-139", "text": "Table 4 shows the results after our rescoring experiments." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-140", "text": "We observe significant WER gains after using the LSTM LMs similar to those reported in [2] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-141", "text": "By rescoring outputs with both LSTM1 and LSTM2, we achieve new performance milestones with final WERs of 6.5% and 5.9% on DEV04F and RT04 respectively." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-142", "text": "Our ASR results have clearly improved state-of-the-art performance on these test sets compared to the various results reported in [8, 9, 23] ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-143", "text": "Significant progress has also been made compared to systems developed over the last decade, as shown in Figure 1 ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-144", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-145", "text": "**DISCUSSIONS**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-146", "text": "When compared to the human performance results, the absolute ASR WER is about 3% worse." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-147", "text": "From Table 5 we observe that although the machine and human insertion error rates are comparable, the ASR system has much higher substitution and deletion error rates." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-148", "text": "Tables 6 and 7 list the 10 most frequent errors of each type." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-149", "text": "We draw the following observations based on these errors -1." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-150", "text": "There is a significant overlap in the words that ASR and humans delete, substitute and insert." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-151", "text": "2. Humans seem to be careful about marking hesitations -%hes-itation is the most inserted symbol." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-152", "text": "Hesitations seem to be important in conveying meaning to the sentences in human transcriptions." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-153", "text": "The ASR systems however focus on blind recognition and not in improving the meaning with appropriate pauses, etc." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-154", "text": "To measure the extent of this process, we score the human transcripts without any hesitations in a separate experiment and observe a 0.1% absolute improvement in WER." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-155", "text": "3. Machines have trouble recognizing short function words{the, and, of, a, that} and these get deleted the most." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-156", "text": "Humans, on the other hand, seem to catch most of them." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-157", "text": "It is likely that these words are not fully articulated and hence the machine fails to recognize them while humans are able to infer these words even without full acoustic evidence since they may have a better model of syntax/semantics." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-158", "text": "Table 7 ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-159", "text": "Most frequent deletion and insertion errors for humans and ASR systems on DEV04F and RT04" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-160", "text": "4. Compared to the telephone conversation confusions recorded in [2] -one symbol that is clearly missing is the back-channel response -this is probably from the very nature of the BN domain." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-161", "text": "5. Similar to telephone conversation confusions reported in [2] , humans performance is much higher because the number of deletions is significantly lower -compare 2.3% vs 0.8%/0.6% for deletion errors in Table 5 ." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-162", "text": "----------------------------------" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-163", "text": "**CONCLUSION**" }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-164", "text": "We have presented recent improvements on broadcast news transcription based on earlier established techniques shown to be useful on CTS." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-165", "text": "Our experiments on BN show that these techniques can be transferred across domains to provide highly accurate transcriptions." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-166", "text": "For both acoustic and language modeling we have demonstrated the effectiveness of LSTM and ResNet based models." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-167", "text": "To verify the extent of the improvements obtained, human evaluation experiments are also performed on the two test sets of interest." }, { "sent_id": "247bbc4eb671895222065ed425f968-C001-168", "text": "We show that there still exists a significant gap between human and machine performance and demonstrate the need for continued research on broadcast news." } ], "y": { "@BACK@": { "gold_contexts": [ [ "247bbc4eb671895222065ed425f968-C001-13" ], [ "247bbc4eb671895222065ed425f968-C001-21", "247bbc4eb671895222065ed425f968-C001-22", "247bbc4eb671895222065ed425f968-C001-25" ] ], "cite_sentences": [ "247bbc4eb671895222065ed425f968-C001-13", "247bbc4eb671895222065ed425f968-C001-22" ] }, "@MOT@": { "gold_contexts": [ [ "247bbc4eb671895222065ed425f968-C001-13", "247bbc4eb671895222065ed425f968-C001-14" ], [ "247bbc4eb671895222065ed425f968-C001-30", "247bbc4eb671895222065ed425f968-C001-32" ] ], "cite_sentences": [ "247bbc4eb671895222065ed425f968-C001-13", "247bbc4eb671895222065ed425f968-C001-30" ] }, "@SIM@": { "gold_contexts": [ [ "247bbc4eb671895222065ed425f968-C001-30", "247bbc4eb671895222065ed425f968-C001-32" ], [ "247bbc4eb671895222065ed425f968-C001-36" ], [ "247bbc4eb671895222065ed425f968-C001-47" ], [ "247bbc4eb671895222065ed425f968-C001-81", "247bbc4eb671895222065ed425f968-C001-82", "247bbc4eb671895222065ed425f968-C001-84" ], [ "247bbc4eb671895222065ed425f968-C001-133" ], [ "247bbc4eb671895222065ed425f968-C001-140" ], [ "247bbc4eb671895222065ed425f968-C001-161" ] ], "cite_sentences": [ "247bbc4eb671895222065ed425f968-C001-30", "247bbc4eb671895222065ed425f968-C001-36", "247bbc4eb671895222065ed425f968-C001-47", "247bbc4eb671895222065ed425f968-C001-81", "247bbc4eb671895222065ed425f968-C001-133", "247bbc4eb671895222065ed425f968-C001-140", "247bbc4eb671895222065ed425f968-C001-161" ] }, "@USE@": { "gold_contexts": [ [ "247bbc4eb671895222065ed425f968-C001-58" ], [ "247bbc4eb671895222065ed425f968-C001-90" ] ], "cite_sentences": [ "247bbc4eb671895222065ed425f968-C001-58", "247bbc4eb671895222065ed425f968-C001-90" ] }, "@DIF@": { "gold_contexts": [ [ "247bbc4eb671895222065ed425f968-C001-160" ] ], "cite_sentences": [ "247bbc4eb671895222065ed425f968-C001-160" ] } } }, "ABC_1056d36c5ed22c7a34f6fe82b4962f_5": { "x": [ { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-58", "text": "**SEARCH**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-2", "text": "Recent work on Grammatical Error Correction (GEC) has highlighted the importance of language modeling in that it is certainly possible to achieve good performance by comparing the probabilities of the proposed edits." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-3", "text": "At the same time, advancements in language modeling have managed to generate linguistic output, which is almost indistinguishable from that of human-generated text." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-4", "text": "In this paper, we up the ante by exploring the potential of more sophisticated language models in GEC and offer some key insights on their strengths and weaknesses." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-5", "text": "We show that, in line with recent results in other NLP tasks, Transformer architectures achieve consistently high performance and provide a competitive baseline for future machine learning models." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-6", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-8", "text": "Transformer models (Vaswani et al., 2017) trained on large-scale language modeling datasets have recently proved to be a very effective means of representing the meaning of a sentence, being put to effective use in fine-tuning both sentence-level tasks, such as the GLUE benchmark (Wang et al., 2018) and token-level tasks, such as Named Entity Recognition (Devlin et al., 2019) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-9", "text": "Recent work has also found them to produce linguistically valid representations (Goldberg, 2019) , as well as to display excellent performance across multiple downstream NLP tasks (e.g., Houlsby et al. 2019) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-10", "text": "In this work, we explore how such models perform in the task of Grammatical Error Correction (GEC)." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-11", "text": "While there is a substantial amount of work on statistical (Rozovskaya and Roth, 2016; Junczys-Dowmunt and Grundkiewicz, 2014; Yannakoudakis et al., 2017) and neural (Ji et al., 2017; Xie et al., 2016; Yuan and Briscoe, 2016; Chollampatt et al., 2016; Chollampatt and Ng, 2017; Chollampatt and Ng, 2018) machine translation methods for GEC, we follow the approach of Bryant and Briscoe (2018) and explore how such models would fare in this task when treated as simple language models." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-12", "text": "More specifically, Bryant and Briscoe (2018) train a 5-gram language model on the One Billion Word Benchmark (Chelba et al., 2013) dataset and find that it produces competitive baseline results without any supervised training." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-13", "text": "In our work, we extend this work by substituting the n-gram model for several publicly available implementations of state-of-the-art Transformer language models trained on large linguistic corpora and assess their performance on GEC without any supervised training." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-14", "text": "We find that Transformer language models produce results on par with supervised approaches providing a solid baseline system." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-15", "text": "This finding is of particular importance in GEC, where data collection and annotation requires substantial manual effort." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-16", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-17", "text": "**RELATED WORK**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-18", "text": "The idea of using language models is quite fundamental to the task of Grammatical Error Correction, which has fed a substantial body of work over the years." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-19", "text": "More recently, with the availability of web-scale data powering the advances in language modeling, among most of the other advances in NLP, a plethora of language-modeling based approaches have been proposed for the GEC task." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-20", "text": "Gamon et al. (2008); Matthieu Hermet and Szpakowicz (2008) and Yi et al. (2008) were some of the early works to successfully leverage language models trained on large amounts of web-scale data into a GEC system, reinforcing the idea that simple models and a lot of data trump more elaborate models based on annotated data (Halevy et al., 2009) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-21", "text": "Since then, multiple works based on languagemodels have been proposed for the GEC task (Park and Levy, 2011; Dahlmeier and Ng, 2012a) , either relying entirely on LMs or using them for fine-tuning their systems." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-22", "text": "Many of the topranked systems in the CoNLL-2013 GEC shared tasks (Ng et al., 2013 , were either based on language models or had them as integral parts of their systems (Kao et al., 2013; Yoshimoto et al., 2013; Xing et al., 2013; Lee and Lee, 2014; Junczys-Dowmunt and Grundkiewicz, 2014) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-23", "text": "LM-only approaches though took a backseat and were only sporadically used after the shared tasks, as Neural Machine Translationbased approaches took over, but LMs remained an integral part of the GEC systems (JunczysDowmunt and Grundkiewicz, 2016; Ji et al., 2017; Xie et al., 2016; Junczys-Dowmunt et al., 2018; Chollampatt and Ng, 2018) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-24", "text": "However, Bryant and Briscoe (2018) recently revived the idea, achieving competitive performance with the state-ofthe-art, demonstrating the effectiveness of the approaches to the task without using any annotated data for training." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-25", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-26", "text": "**METHODOLOGY**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-27", "text": "In this work, we follow the setup from Bryant and Briscoe (2018) substituting the 5-gram language model for different language models based on the Transformer architecture." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-28", "text": "Specifically, we use Google's BERT (Devlin et al., 2019) and OpenAI's GPT (Radford et al., 2018) and GPT-2 (Radford et al., 2019) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-29", "text": "1 While all these are best thought of as language models in that they have been trained to predict an element in a sequence, they use slightly different objectives which does not make them directly comparable." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-30", "text": "Specifically, GPT and GPT-2 have been trained with a classic language modeling objective, whereby they predict the next word in a sequence, whereas BERT has been trained using a masked language modeling objective in which the network attempts to predict masked words in the sentence." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-31", "text": "We extract the probability of a sentence from BERT, by iteratively masking every word in the sentence and then summing the log probabilities." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-32", "text": "While this approach is far from ideal, it has been shown (Wang and Cho, 2019) that it approximates the log-likelihood of a sentence." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-33", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-34", "text": "**CONFUSION SETS**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-35", "text": "Since our systems do not generate novel sequences, we follow Bryant and Briscoe (2018) and use simple heuristics to generate a confusion set of sentences that our language models score." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-57", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-36", "text": "For prepositions and determiners, the confusion set includes the set of all prepositions and determiners plus an empty string \u01eb to remove unnecessary additions." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-37", "text": "For morphological errors (e.g., past tense or pluralization), we use the Automatically Generated Inflection Database (AGID) which contains different morphological forms for each word." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-38", "text": "2 However, we notice that due to the automatic generation, AGID contains errors that might propagate into our scoring." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-39", "text": "The problem with introducing new errors and non-words is that they would be interpreted as unknown words (henceforth [UNK]s) from the model's perspective." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-40", "text": "An unknown word in some context might give higher probabilities to an erroneous sentence and cause the model not to select the correct alternative." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-41", "text": "To remedy this issue, we generate a vocabulary from all the training sets and make sure that any proposed words which do not exist in the vocabulary are replaced by [UNK]s." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-42", "text": "Note that there is no reason to re-use the vocabulary of the training sets as any large English wordlist would achieve a similar effect." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-43", "text": "Finally, for spelling mistakes, we, again, follow Bryant and Briscoe (2018) and use CyHunSpell 3 to generate alternatives for non-words." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-44", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-45", "text": "**THRESHOLDING**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-46", "text": "Given that our confusion set is prone to errors (due to its automatic generation procedure) as well as the fact that we cannot target all potential errors (e.g., insertions), we bias our method to prefer the original sentence unless a much better the alternative is found." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-47", "text": "We quantify this margin by imposing a threshold above which we accept a candidate sentence as a better alternative." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-48", "text": "Concretely, let P (s c ) be the probability of the candidate sentence and P (s o ) the probability of the Table 2 : Results of our Transformer-Language Model approach against similar approaches (Bryant and Briscoe, 2018) and state-of-the-art on Grammatical Error Correction." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-49", "text": "For each of the datasets, we use the corresponding test set, and we do not train our models on the corpora." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-50", "text": "As BERT, we report the best performing BERT model (12 layers, retaining uppercase characters)." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-51", "text": "In the top part of each dataset, we report the scores of supervised methods and in the bottom the unsupervised ones." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-52", "text": "\u2020 denotes this system won the shared task competition." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-53", "text": "original sentence, then we accept the candidate if P (s c ) > P (s o ) + \u03c4 , where \u03c4 is some threshold parameter which we fit on each development set." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-54", "text": "Note that, practically, this parameter controls the trade-off between precision and recall as higher \u03c4 values would mean that there is less chance of changing the original sentence (i.e., higher precision) and vice versa." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-55", "text": "We explore different values for \u03c4 \u2208 {0, 2, 4, 6, 8} by, as above, fitting them on the corresponding development set." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-56", "text": "4" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-59", "text": "Finally, we perform greedy search to find the best alternative sentence by iterating over each sentence multiple times, once for every position for which our heuristics found alternatives." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-60", "text": "If an alternative is selected for the target position, we update the original sentence and proceed to the next position." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-61", "text": "This pseudo-log-likelihood approximation makes the problem of considering every permutation more computationally tractable." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-62", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-63", "text": "**EXPERIMENTS**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-64", "text": "We evaluate our method and report results on two standard publicly available datasets." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-65", "text": "Our evalua-4 Note that the probability of each sentence is in log space." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-66", "text": "tion is aimed to stay as true to Bryant and Briscoe (2018) (Yannakoudakis et al., 2011) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-67", "text": "Unfortunately, due to licensing issues, we were unable to obtain permission to use the JFLEG (Napoles et al., 2017) corpus for evaluation." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-68", "text": "Note that in our method, we do not make use of the training sets commonly used with these datasets." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-69", "text": "However, we use the development sets used by Bryant and Briscoe (2018) to tune the hyperparameter \u03c4 ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-70", "text": "The number of sentences and tokens for the datasets we used can be found in Table 1 ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-71", "text": "Similar to Bryant and Briscoe (2018) , we report results on three metrics." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-72", "text": "We use the MaxMatch (M 2 ) Precision, Recall and F 0.5 (Dahlmeier and Ng, 2012b) and ERRANT Precision, Recall and F 0.5 (Bryant et al., 2017) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-73", "text": "Source It will start by a speech from the Director of the conference, followed by a meal." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-74", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-75", "text": "**GOLD**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-76", "text": "It will start with a speech by the Director of the conference, followed by a meal." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-77", "text": "BERT It will start with a speech from the Director of the conference, followed by a meal." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-78", "text": "GPT It will start by a speech from the Director of the conference, followed by a meal." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-79", "text": "GPT-2 It will start with a speech from the Director of the conference, followed by a meal." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-80", "text": "Source They all knows where the conference is and when." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-81", "text": "They all know where the conference is and when." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-82", "text": "BERT They all know where the conferencing is and when." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-83", "text": "GPT They all knows where the conference is and when." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-84", "text": "GPT-2 They all know where the conference is and when." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-85", "text": "Table 3 : Source sentences along with the gold edits and the proposed candidates from each of our models." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-86", "text": "Table 2 presents the results of our method comparing them against recent state-of-the-art supervised models and the simple n-gram language model used by Bryant and Briscoe (2018) ." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-87", "text": "Table 3 shows some qualitative examples on how each model corrects two sentences pulled from the FCE along with the gold annotations." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-88", "text": "The reported results come from the best performing hyperparameter \u03c4 on each dataset." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-89", "text": "For BERT, we also explored different sizes (12 vs. 24 layers) and whether retaining uppercase characters helps in performance." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-90", "text": "The best performing \u03c4 values were \u03c4 = 4 for CoNLL14 for all models; for the FCE dataset: BERT \u03c4 = 4, GPT \u03c4 = 8, and GPT-2 \u03c4 = 6." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-91", "text": "The best 'version,' of BERT was the large, cased (i.e., retaining the lower-/uppercase distinction)." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-92", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-93", "text": "**RESULTS**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-94", "text": "A key result of Table 2 is that Transformer Language Models prove to be more than just a competitive baseline to legitimate Grammatical Error Correction systems on their own." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-95", "text": "Across the board, Transformer Models are able to outperform the simple n-gram model and even approach the performance of supervised GEC systems." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-96", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-97", "text": "**DISCUSSION**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-98", "text": "Looking at the performance of the two GPT models more closely, we see that their performance is nearly identical with GPT-2 leading by a small margin in the CoNLL14 dataset." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-99", "text": "Given that the versions we used share the same number of layers (12), we attribute GPT-2's slight advantage to the fact that it was trained on considerably more data." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-100", "text": "5 While we acknowledge the contemporaneous nature of the BEA 2019 Shared Task on GEC and would have liked to report results on the W&I+LOCNESS data, we could not do so because of license limitations." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-101", "text": "Another interesting result is that while BERT surpasses the n-gram baseline overall, it achieves worse performance than the rest in terms of precision and F 0.5 score." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-102", "text": "Considering its overall success at modeling NLP tasks, one might expect BERT to achieve better performance here." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-103", "text": "However, as mentioned above, BERT is not truly a language model in the sense that GPT and GPT-2 are but uses a quasi-language modeling objective which could explain its degraded performance in this setting." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-104", "text": "Note that framing the task differently (e.g., by masking the preposition in a sentence and selecting the one with the highest probability) might give the edge to BERT as it resembles the way it was trained." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-105", "text": "It is also worth mentioning that despite tuning \u03c4 to each dataset, we do not explore different weights for different kinds of errors (e.g., penalizing more spelling mistakes)." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-106", "text": "Our key motivation was to corroborate and extend the results of Bryant and Briscoe (2018) to current state-of-the-art language models which have been trained in several languages and show that these models are tough baselines to beat for novel GEC systems." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-107", "text": "While the results of the Transformer language models shown in Table 2 demonstrate that they are a tough baseline to beat, it is worth noting that the present approach is not without its limitations." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-108", "text": "We believe that our methodology should not be considered a panacea to GEC." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-109", "text": "For instance, being bound by the confusion sets, our system (1) cannot handle missing words (which make up about 20% of all errors), and (2) it is tuned to capture only a subset of the possible mistakes a writer can make (closed class words)." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-110", "text": "It could be argued that since our system makes use of a pre-defined confusion set (even an automatically generated one), it could not be considered as a fully unsupervised system." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-111", "text": "In principle, we agree with that statement and we believe that a system which uses, for example, corpus statistics to on-the-fly generate a confusion set would be a very interesting exercise and could yield similar results." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-112", "text": "However, the present paper is concerned with highlighting the importance of language modeling in GEC and its potential in aiding in low-resource languages where large parallel datasets are unavailable, but such confusion sets are relatively easily obtainable." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-113", "text": "----------------------------------" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-114", "text": "**CONCLUSION**" }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-115", "text": "In this work, we advanced on the foundational idea that a simple language modeling-based approach to GEC with no annotated data can challenge the latest neural and machine translation approaches that rely on large quantities of annotated training data." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-116", "text": "To this end, we improve on previous work by leveraging state-of-the-art language modeling techniques and perform a thorough comparison of three state-of-the-art Transformer language models which in turn have been trained on data of the order of hundreds of millions of words." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-117", "text": "We find that merely using pre-trained, and publicly available neural language models improves the performance by a significant margin and comes within striking distance of the state-of-the-art methods." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-118", "text": "This work reinforces the strength and robustness of language-model based methods for the task of grammatical error correction." }, { "sent_id": "1056d36c5ed22c7a34f6fe82b4962f-C001-119", "text": "While recent state-of-the-art GEC systems are pursuing NMTbased models with huge amounts (millions of sentences) of annotated training data, approaches like this which require no annotated training data provide great value to researchers and developers interested in building competitive GEC systems (e.g., in other languages) with limited annotated data." } ], "y": { "@USE@": { "gold_contexts": [ [ "1056d36c5ed22c7a34f6fe82b4962f-C001-11" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-35" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-43" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-48" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-68", "1056d36c5ed22c7a34f6fe82b4962f-C001-69" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-86" ] ], "cite_sentences": [ "1056d36c5ed22c7a34f6fe82b4962f-C001-11", "1056d36c5ed22c7a34f6fe82b4962f-C001-35", "1056d36c5ed22c7a34f6fe82b4962f-C001-43", "1056d36c5ed22c7a34f6fe82b4962f-C001-48", "1056d36c5ed22c7a34f6fe82b4962f-C001-69", "1056d36c5ed22c7a34f6fe82b4962f-C001-86" ] }, "@EXT@": { "gold_contexts": [ [ "1056d36c5ed22c7a34f6fe82b4962f-C001-12", "1056d36c5ed22c7a34f6fe82b4962f-C001-13" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-27" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-106" ] ], "cite_sentences": [ "1056d36c5ed22c7a34f6fe82b4962f-C001-12", "1056d36c5ed22c7a34f6fe82b4962f-C001-13", "1056d36c5ed22c7a34f6fe82b4962f-C001-27", "1056d36c5ed22c7a34f6fe82b4962f-C001-106" ] }, "@MOT@": { "gold_contexts": [ [ "1056d36c5ed22c7a34f6fe82b4962f-C001-12", "1056d36c5ed22c7a34f6fe82b4962f-C001-13" ], [ "1056d36c5ed22c7a34f6fe82b4962f-C001-106" ] ], "cite_sentences": [ "1056d36c5ed22c7a34f6fe82b4962f-C001-12", "1056d36c5ed22c7a34f6fe82b4962f-C001-13", "1056d36c5ed22c7a34f6fe82b4962f-C001-106" ] }, "@BACK@": { "gold_contexts": [ [ "1056d36c5ed22c7a34f6fe82b4962f-C001-24" ] ], "cite_sentences": [ "1056d36c5ed22c7a34f6fe82b4962f-C001-24" ] }, "@DIF@": { "gold_contexts": [ [ "1056d36c5ed22c7a34f6fe82b4962f-C001-68", "1056d36c5ed22c7a34f6fe82b4962f-C001-69" ] ], "cite_sentences": [ "1056d36c5ed22c7a34f6fe82b4962f-C001-69" ] }, "@SIM@": { "gold_contexts": [ [ "1056d36c5ed22c7a34f6fe82b4962f-C001-71" ] ], "cite_sentences": [ "1056d36c5ed22c7a34f6fe82b4962f-C001-71" ] } } }, "ABC_f6694f359ae948b6e4563b927a672c_5": { "x": [ { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-2", "text": "Mapping word embeddings of different languages into a single space has multiple applications." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-3", "text": "In order to map from a source space into a target space, a common approach is to learn a linear mapping that minimizes the distances between equivalences listed in a bilingual dictionary." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-4", "text": "In this paper, we propose a framework that generalizes previous work, provides an efficient exact method to learn the optimal linear transformation and yields the best bilingual results in translation induction while preserving monolingual performance in an analogy task." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-5", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-7", "text": "Bilingual word embeddings have attracted a lot of attention in recent times (Zou et al., 2013; Ko\u010disk\u00fd et al., 2014; Chandar A P et al., 2014; Gouws et al., 2014; Gouws and S\u00f8gaard, 2015; Luong et al., 2015; Wick et al., 2016) ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-8", "text": "A common approach to obtain them is to train the embeddings in both languages independently and then learn a mapping that minimizes the distances between equivalences listed in a bilingual dictionary." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-9", "text": "The learned transformation can also be applied to words missing in the dictionary, which can be used to induce new translations with a direct application in machine translation (Mikolov et al., 2013b; Zhao et al., 2015) ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-10", "text": "The first method to learn bilingual word embedding mappings was proposed by Mikolov et al. (2013b) , who learn the linear transformation that minimizes the sum of squared Euclidean distances for the dictionary entries." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-11", "text": "Subsequent work has proposed alternative optimization objectives to learn better mappings." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-12", "text": "Xing et al. (2015) incorporate length normalization in the training of word embeddings and try to maximize the cosine similarity instead, introducing an orthogonality constraint to preserve the length normalization after the projection." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-13", "text": "Faruqui and Dyer (2014) use canonical correlation analysis to project the embeddings in both languages to a shared vector space." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-14", "text": "Beyond linear mappings, Lu et al. (2015) apply deep canonical correlation analysis to learn a nonlinear transformation for each language." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-15", "text": "Finally, additional techniques have been used to address the hubness problem in Mikolov et al. (2013b) , both through the neighbor retrieval method and the training itself ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-16", "text": "We leave the study of non-linear transformation and other additions for further work." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-17", "text": "In this paper, we propose a general framework to learn bilingual word embeddings." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-18", "text": "We start with a basic optimization objective (Mikolov et al., 2013b) and introduce several meaningful and intuitive constraints that are equivalent or closely related to previously proposed methods (Faruqui and Dyer, 2014; Xing et al., 2015) ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-19", "text": "Our framework provides a more general view of bilingual word embedding mappings, showing the underlying connection between the existing methods, revealing some flaws in their theoretical justification and providing an alternative theoretical interpretation for them." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-20", "text": "Our experiments on an existing English-Italian word translation induction and an English word analogy task give strong empirical evidence in favor of our theoretical reasoning, while showing that one of our models clearly outperforms previous alternatives." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-21", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-22", "text": "**LEARNING BILINGUAL MAPPINGS**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-23", "text": "Let X and Z denote the word embedding matrices in two languages for a given bilingual dictionary so that their ith row X i * and Z i * are the word embeddings of the ith entry in the dictionary." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-24", "text": "Our goal is to find a linear transformation matrix W so that XW best approximates Z, which we formalize minimizing the sum of squared Euclidean distances following Mikolov et al. (2013b) :" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-25", "text": "Alternatively, this is equivalent to minimizing the (squared) Frobenius norm of the residual matrix:" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-26", "text": "Consequently, W will be the so called leastsquares solution of the linear matrix equation XW = Z. This is a well-known problem in linear algebra and can be solved by taking the MoorePenrose pseudoinverse X + = X T X \u22121 X T as W = X + Z, which can be computed using SVD." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-27", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-28", "text": "**ORTHOGONALITY FOR MONOLINGUAL INVARIANCE**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-29", "text": "Monolingual invariance is needed to preserve the dot products after mapping, avoiding performance degradation in monolingual tasks (e.g. analogy)." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-30", "text": "This can be obtained requiring W to be an orthogonal matrix (W T W = I)." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-31", "text": "The exact solution under such orthogonality constraint is given by W = V U T , where Z T X = U \u03a3V T is the SVD factorization of Z T X (cf." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-32", "text": "Appendix A)." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-33", "text": "Thanks to this, the optimal transformation can be efficiently computed in linear time with respect to the vocabulary size." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-34", "text": "Note that orthogonality enforces an intuitive property, and as such it could be useful to avoid degenerated solutions and learn better bilingual mappings, as we empirically show in Section 3." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-35", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-36", "text": "**LENGTH NORMALIZATION FOR MAXIMUM COSINE**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-37", "text": "Normalizing word embeddings in both languages to be unit vectors guarantees that all training instances contribute equally to the optimization goal." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-38", "text": "As long as W is orthogonal, this is equivalent to maximizing the sum of cosine similarities for the dictionary entries, which is commonly used for similarity computations:" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-39", "text": "This last optimization objective coincides with Xing et al. (2015) , but their work was motivated by an hypothetical inconsistency in Mikolov et al. (2013b) , where the optimization objective to learn word embeddings uses dot product, the objective to learn mappings uses Euclidean distance and the similarity computations use cosine." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-40", "text": "However, the fact is that, as long as W is orthogonal, optimizing the squared Euclidean distance of length-normalized embeddings is equivalent to optimizing the cosine, and therefore, the mapping objective proposed by Xing et al. (2015) is equivalent to that used by Mikolov et al. (2013b) with orthogonality constraint and unit vectors." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-41", "text": "In fact, our experiments show that orthogonality is more relevant than length normalization, in contrast to Xing et al. (2015) , who introduce orthogonality only to ensure that unit length is preserved after mapping." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-42", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-43", "text": "**MEAN CENTERING FOR MAXIMUM COVARIANCE**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-44", "text": "Dimension-wise mean centering captures the intuition that two randomly taken words would not be expected to be semantically similar, ensuring that the expected product of two random embeddings in any dimension and, consequently, their cosine similarity, is zero." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-45", "text": "As long as W is orthogonal, this is equivalent to maximizing the sum of dimensionwise covariance for the dictionary entries:" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-46", "text": "where C m denotes the centering matrix This equivalence reveals that the method proposed by Faruqui and Dyer (2014) is closely related to our framework." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-47", "text": "More concretely, Faruqui and Dyer (2014) use Canonical Correlation Analysis (CCA) to project the word embeddings in both languages to a shared vector space." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-48", "text": "CCA maximizes the dimension-wise covariance of both projections (which is equivalent to maximizing the covariance of a single projection if the transformations are constrained to be orthogonal, as in our case) but adds an implicit restriction to the two mappings, making different dimensions have the same variance and be uncorrelated among themselves 1 :" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-49", "text": "Therefore, the only fundamental difference between both methods is that, while our model enforces monolingual invariance, Faruqui and Dyer (2014) do change the monolingual embeddings to meet this restriction." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-50", "text": "In this regard, we think that the restriction they add could have a negative impact on the learning of the bilingual mapping, and it could also degrade the quality of the monolingual embeddings." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-51", "text": "Our experiments (cf. Section 3) show empirical evidence supporting this idea." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-52", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-53", "text": "**EXPERIMENTS**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-54", "text": "In this section, we experimentally test the proposed framework and all its variants in comparison with related methods." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-55", "text": "For that purpose, we use the translation induction task introduced by Mikolov et al. (2013b) , which learns a bilingual mapping on a small dictionary and measures its accuracy on predicting the translation of new words." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-56", "text": "Unfortunately, the dataset they use is not public." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-57", "text": "For that reason, we use the English-Italian dataset on the same task provided by 2 ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-58", "text": "The dataset contains monolingual word embeddings trained with the word2vec toolkit using the CBOW method with negative sampling (Mikolov et al., 2013a) 3 ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-59", "text": "The English embeddings were trained on a 2.8 billion word corpus (ukWaC + Wikipedia + BNC), while the 1.6 billion word corpus itWaC was used to train the Italian 1 While CCA is typically defined in terms of correlation (thus its name), correlation is invariant to the scaling of variables, so it is possible to constrain the canonical variables to have a fixed variance, as we do, in which case correlation and covariance become equivalent 2 http://clic.cimec.unitn.it/\u02dcgeorgiana." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-60", "text": "dinu/down/ 3 The context window was set to 5 words, the dimension of the embeddings to 300, the sub-sampling to 1e-05 and the number of negative samples to 10 embeddings." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-61", "text": "The dataset also contains a bilingual dictionary learned from Europarl, split into a training set of 5,000 word pairs and a test set of 1,500 word pairs, both of them uniformly distributed in frequency bins." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-62", "text": "Accuracy is the evaluation measure." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-63", "text": "Apart from the performance of the projected embeddings in bilingual terms, we are also interested in the monolingual quality of the source language embeddings after the projection." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-64", "text": "For that purpose, we use the word analogy task proposed by Mikolov et al. (2013a) , which measures the accuracy on answering questions like \"what is the word that is similar to small in the same sense as biggest is similar to big?\" using simple word vector arithmetic." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-65", "text": "The dataset they use consists of 8,869 semantic and 10,675 syntactic questions of this type, and is publicly available 4 ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-66", "text": "In order to speed up the experiments, we follow the authors and perform an approximate evaluation by reducing the vocabulary size according to a frequency threshold of 30,000 (Mikolov et al., 2013a) ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-67", "text": "Since the original embeddings are the same in all the cases and it is only the transformation that is applied to them that changes, this affects all the methods in the exact same way, so the results are perfectly comparable among themselves." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-68", "text": "With these settings, we obtain a coverage of 64.98%." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-69", "text": "We implemented the proposed method in Python using NumPy, and make it available as an open source project 5 ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-70", "text": "The code for Mikolov et al. (2013b) and Xing et al. (2015) is not publicly available, so we implemented and tested them as part of the proposed framework, which only differs from the original systems in the optimization method (exact solution instead of gradient descent) and the length normalization approach in the case of Xing et al. (2015) (postprocessing instead of constrained training)." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-71", "text": "As for the method by Faruqui and Dyer (2014) , we used their original implementation in Python and MAT-LAB 6 , which we extended to cover cases where the dictionary contains more than one entry for the same word." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-72", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-73", "text": "**RESULTS OF OUR FRAMEWORK**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-74", "text": "The rows in Table 1 show, respectively, the results for the original embeddings, the basic mapping proposed by Mikolov et al. (2013b) (cf. Section 2) and the addition of orthogonality constraint (cf. Section 2.1), with and without length normalization and, incrementally, mean centering." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-75", "text": "In all the cases, length normalization and mean centering were applied to all embeddings, even if missing from the dictionary." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-76", "text": "The results show that the orthogonality constraint is key to preserve monolingual performance, and it also improves bilingual performance by enforcing a relevant property (monolingual invariance) that the transformation to learn should intuitively have." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-77", "text": "The contribution of length normalization alone is marginal, but when followed by mean centering we obtain further improvements in bilingual performance without hurting monolingual performance." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-78", "text": "Table 2 shows the results for our best performing configuration in comparison to previous work." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-79", "text": "As discussed before, (Mikolov et al., 2013b) and (Xing et al., 2015) were implemented as part of our framework, so they correspond to our uncostrained mapping with no preprocessing and orthogonal mapping with length normalization, respectively." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-80", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-81", "text": "**COMPARISON TO OTHER WORK**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-82", "text": "As it can be seen, the method by Xing et al. (2015) performs better than that of Mikolov et al. (2013b) in the translation induction task, which is in line with what they report in their paper." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-83", "text": "Moreover, thanks to the orthogonality constraint their monolingual performance in the word analogy task does not degrade, whereas the accuracy of Mikolov et al. (2013b) drops by 2.86% in absolute terms with respect to the original embeddings." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-84", "text": "Since Faruqui and Dyer (2014) Mikolov et al. (2013b) 34.93% 73. 80% Xing et al. (2015) 36.87% 76.66% Faruqui and Dyer (2014) CCA to perform dimensionality reduction, we tested several values for it and report the best (180 dimensions)." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-85", "text": "This beats the method by Xing et al. (2015) in the bilingual task, although it comes at the price of a considerable degradation in monolingual quality." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-86", "text": "In any case, it is our proposed method with the orthogonality constraint and a global preprocessing with length normalization followed by dimensionwise mean centering that achieves the best accuracy in the word translation induction task." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-87", "text": "Moreover, it does not suffer from any considerable degradation in monolingual quality, with an anecdotal drop of only 0.07% in contrast with 2.86% for Mikolov et al. (2013b) and 7.02% for Faruqui and Dyer (2014) ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-88", "text": "When compared to Xing et al. (2015) , our results in Table 1 reinforce our theoretical interpretation for their method (cf. Section 2.2), as it empirically shows that its improvement with respect to Mikolov et al. (2013b) comes solely from the orthogonality constraint, and not from solving any inconsistency." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-89", "text": "It should be noted that the implementation by Faruqui and Dyer (2014) also length-normalizes the word embeddings in a preprocessing step." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-90", "text": "Following the discussion in Section 2.3, this means that our best performing configuration is conceptually very close to the method by Faruqui and Dyer (2014) , as they both coincide on maximizing the average dimension-wise covariance and length-normalize the embeddings in both languages first, the only difference being that our model enforces monolingual invariance after the normalization while theirs does change the monolingual embeddings to make different dimensions have the same variance and be uncorrelated among themselves." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-91", "text": "However, our model performs considerably better than any configuration from Faruqui and Dyer (2014) in both the monolingual and the bilingual task, supporting our hypothesis that these two constraints that are implicit in their method are not only conceptually confusing, 2292 but also have a negative impact." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-92", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-93", "text": "**CONCLUSIONS**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-94", "text": "This paper develops a new framework to learn bilingual word embedding mappings, generalizing previous work and providing an efficient exact method to learn the optimal transformation." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-95", "text": "Our experiments show the effectiveness of the proposed model and give strong empirical evidence in favor of our reinterpretation of Xing et al. (2015) and Faruqui and Dyer (2014) ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-96", "text": "It is the proposed method with the orthogonality constraint and a global preprocessing with length normalization and dimension-wise mean centering that achieves the best overall results both in monolingual and bilingual terms, surpassing those previous methods." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-97", "text": "In the future, we would like to study non-linear mappings (Lu et al., 2015) and the additional techniques in ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-98", "text": "ish Ministry of Economy and Competitiveness (TADEEP TIN2015-70214-P)." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-99", "text": "Mikel Artetxe enjoys a doctoral grant from the Spanish Ministry of Education, Culture and Sports." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-100", "text": "----------------------------------" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-101", "text": "**A PROOF OF SOLUTION UNDER ORTHOGONALITY**" }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-102", "text": "Constraining W to be orthogonal (W T W = I), the original minimization problem can be reformulated as follows (cf. Section 2.1): arg min In the above expression, Tr(\u00b7) denotes the trace operator (the sum of all the elements in the main diagonal), and the last equality is given by its cyclic property." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-103", "text": "At this point, we can take the SVD of Z T X as Z T X = U \u03a3V T , so Tr Z T XW = Tr U \u03a3V T W = Tr \u03a3V T W U ." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-104", "text": "Since V T , W and U are orthogonal matrices, their product V T W U will also be an orthogonal matrix." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-105", "text": "In addition to that, given that \u03a3 is a diagonal matrix, its trace after an orthogonal transformation will be maximal when the values in its main diagonal are preserved after the mapping, that is, when the orthogonal transformation matrix is the identity matrix." }, { "sent_id": "f6694f359ae948b6e4563b927a672c-C001-106", "text": "This will happen when V T W U = I in our case, so the optimal solution will be W = V U T ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f6694f359ae948b6e4563b927a672c-C001-11", "f6694f359ae948b6e4563b927a672c-C001-12" ], [ "f6694f359ae948b6e4563b927a672c-C001-82", "f6694f359ae948b6e4563b927a672c-C001-83" ] ], "cite_sentences": [ "f6694f359ae948b6e4563b927a672c-C001-12", "f6694f359ae948b6e4563b927a672c-C001-82", "f6694f359ae948b6e4563b927a672c-C001-83" ] }, "@SIM@": { "gold_contexts": [ [ "f6694f359ae948b6e4563b927a672c-C001-18" ], [ "f6694f359ae948b6e4563b927a672c-C001-79" ] ], "cite_sentences": [ "f6694f359ae948b6e4563b927a672c-C001-18", "f6694f359ae948b6e4563b927a672c-C001-79" ] }, "@DIF@": { "gold_contexts": [ [ "f6694f359ae948b6e4563b927a672c-C001-39", "f6694f359ae948b6e4563b927a672c-C001-40", "f6694f359ae948b6e4563b927a672c-C001-41" ], [ "f6694f359ae948b6e4563b927a672c-C001-84", "f6694f359ae948b6e4563b927a672c-C001-85" ], [ "f6694f359ae948b6e4563b927a672c-C001-94", "f6694f359ae948b6e4563b927a672c-C001-95" ] ], "cite_sentences": [ "f6694f359ae948b6e4563b927a672c-C001-39", "f6694f359ae948b6e4563b927a672c-C001-40", "f6694f359ae948b6e4563b927a672c-C001-41", "f6694f359ae948b6e4563b927a672c-C001-84", "f6694f359ae948b6e4563b927a672c-C001-85", "f6694f359ae948b6e4563b927a672c-C001-95" ] }, "@USE@": { "gold_contexts": [ [ "f6694f359ae948b6e4563b927a672c-C001-70" ], [ "f6694f359ae948b6e4563b927a672c-C001-79" ] ], "cite_sentences": [ "f6694f359ae948b6e4563b927a672c-C001-70", "f6694f359ae948b6e4563b927a672c-C001-79" ] }, "@EXT@": { "gold_contexts": [ [ "f6694f359ae948b6e4563b927a672c-C001-94", "f6694f359ae948b6e4563b927a672c-C001-95" ] ], "cite_sentences": [ "f6694f359ae948b6e4563b927a672c-C001-95" ] } } }, "ABC_580713b57ae47692af0d0c86a07fd1_5": { "x": [ { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-41", "text": "**METHOD**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-2", "text": "Neural networks have shown promising results for relation extraction." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-3", "text": "State-ofthe-art models cast the task as an end-toend problem, solved incrementally using a local classifier." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-4", "text": "Yet previous work using statistical models have demonstrated that global optimization can achieve better performances compared to local classification." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-5", "text": "We build a globally optimized neural model for end-to-end relation extraction, proposing novel LSTM features in order to better learn context representations." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-6", "text": "In addition, we present a novel method to integrate syntactic information to facilitate global learning, yet requiring little background on syntactic grammars thus being easy to extend." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-7", "text": "Experimental results show that our proposed model is highly effective, achieving the best performances on two standard benchmarks." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-8", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-10", "text": "Extracting entities (Florian et al., 2006 (Florian et al., , 2010 and relations (Zhao and Grishman, 2005; Jiang and Zhai, 2007; Sun et al., 2011; Plank and Moschitti, 2013 ) from unstructured texts have been two central tasks in information extraction (Grishman, 1997; Doddington et al., 2004) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-11", "text": "Traditional approaches to relation extraction take entity recognition as a predecessor step in a pipeline (Zelenko et al., 2003; Chan and Roth, 2011) , predicting relations between given entities." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-12", "text": "In recent years, there has been a surge of interest in performing end-to-end relation extraction, jointly recognizing entities and relations given free text inputs (Li and Ji, 2014; Miwa and Sasaki, 2014; Miwa and Bansal, 2016; ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-13", "text": "End-to-end learning prevents error propagation in the pipeline approach, and allows cross-task dependencies to be modeled explicitly for entity recognition." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-14", "text": "As a result, it gives better relation extraction accuracies compared to pipelines." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-15", "text": "Miwa and Bansal (2016) were among the first to use neural networks for end-to-end relation extraction, showing highly promising results." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-16", "text": "In particular, they used bidirectional LSTM (Graves et al., 2013) to learn hidden word representations under a sentential context, and further leveraged treestructured LSTM (Tai et al., 2015) to encode syntactic information, given the output of a parser." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-17", "text": "The resulting representations are then used for making local decisions for entity and relation extraction incrementally, leading to much improved results compared with the best statistical model (Li and Ji, 2014) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-18", "text": "This demonstrates the strength of neural representation learning for end-to-end relation extraction." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-19", "text": "On the other hand, Miwa and Bansal (2016) 's model is trained locally, without considering structural correspondences between incremental decisions." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-20", "text": "This is unlike existing statistical methods, which utilize well-studied structured prediction methods to address the problem (Li and Ji, 2014; Miwa and Sasaki, 2014) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-21", "text": "As has been commonly understood, learning local decisions for structured prediction can lead to label bias (Lafferty et al., 2001) , which prevents globally optimal structures from receiving optimal scores by the model." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-22", "text": "We address this potential issue by building a structural neural model for end-to-end relation extraction, following a recent line of efforts on globally optimized models for neural structured prediction (Zhou et al., 2015; Watanabe and Sumita, 2015; Andor et al., 2016; Wiseman and Rush, 2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-23", "text": "In particular, we follow Miwa and Sasaki (2014) , casting the task as an end-to-end tablefilling problem." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-24", "text": "This is different from the actionbased method of Li and Ji (2014) , yet has shown to be more flexible and accurate (Miwa and Sasaki, 2014) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-25", "text": "We take a different approach to representation learning, addressing two potential limitations of Miwa and Bansal (2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-26", "text": "First, Miwa and Bansal (2016) rely on external syntactic parsers for obtaining syntactic information, which is crucial for relation extraction (Culotta and Sorensen, 2004; Zhou et al., 2005; Bunescu and Mooney, 2005; Qian et al., 2008) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-27", "text": "However, parsing errors can lead to encoding inaccuracies of tree-LSTMs, thereby hurting relation extraction potentially." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-28", "text": "We take an alternative approach to integrating syntactic information, by taking the hidden LSTM layers of a bi-affine attention parser (Dozat and Manning, 2016) to augment input representations." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-29", "text": "Pretrained for parsing, such hidden layers contain rich syntactic information on each word, yet do not explicitly represent parsing decisions, thereby avoiding potential issues caused by incorrect parses." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-30", "text": "Our method is also free from a particular syntactic formalism, such as dependency grammar, constituent grammar or combinatory categorial grammar, requiring only hidden representations on word that contain syntactic information." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-31", "text": "In contrast, the method of Miwa and Bansal (2016) must consider tree LSTM formulations that are specific to grammar formalisms, which can be structurally different (Tai et al., 2015) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-32", "text": "Second, Miwa and Bansal (2016) did not explicitly learn the representation of segments when predicting entity boundaries or making relation classification decisions, which can be intuitively highly useful, and has been investigated in several studies (Wang and Chang, 2016; ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-33", "text": "We take the LSTM-Minus method of Wang and Chang (2016) , modelling a segment as the difference between its last and first LSTM hidden vectors." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-34", "text": "This method is highly efficient, yet gives as accurate results as compared to more complex neural network structures to model a span of words (Cross and Huang, 2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-35", "text": "Evaluation on two benchmark datasets shows that our method outperforms previous methods of Miwa and Bansal (2016) , Li and Ji (2014) and Miwa and Sasaki (2014) , giving the best reported results on both benchmarks." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-36", "text": "Detailed analysis shows that our integration of syntactic features is as effective as traditional approaches based on discrete parser outputs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-37", "text": "We make our code publicly As shown in Figure 1 , the goal of relation extraction is to mine relations from raw texts." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-38", "text": "It consists of two sub-tasks, namely entity detection, which recognizes valid entities, and relation classification, which determines the relation categories over entity pairs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-39", "text": "We follow recent studies and recognize entities and relations as one single task." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-40", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-42", "text": "We follow Miwa and Sasaki (2014) and , treating relation extraction as a tablefilling problem, performing entity detection and relation classification using a single incremental model, which is similar in spirit to Miwa and Bansal (2016) by performing the task end-to-end." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-43", "text": "Formally, given a sentence w 1 w 2 \u00b7 \u00b7 \u00b7 w n , we maintain a table T n\u00d7n , where T (i, j) denotes the relation between w i and w j ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-44", "text": "When i = j, T (i, j) denotes an entity boundary label." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-45", "text": "We map entity words into labels under the BILOU (Begin, Inside, Last, Outside, Unit) scheme, assuming that there are no overlapping entities in one sentence (Li and Ji, 2014; Miwa and Sasaki, 2014; Miwa and Bansal, 2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-46", "text": "Only the upper triangular table is necessary for indicating the relations." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-47", "text": "We adopt the close-first left-to-right order (Miwa and Sasaki, 2014) to map the twodimensional table into a sequence, in order to fill the table incrementally." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-48", "text": "As shown in Figure 2 , first {T (i, i)} are filled by growing i, and then the sequence {T (i, i + 1)} is filled, and then {T (i, i + 2)}, \u00b7 \u00b7 \u00b7 , {T (i, i + n)} are filled incrementally, until the table is fully annotated." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-49", "text": "During the table-filling process, we take two label sets for entity detection (i = j) and relation Table- filling example, where numbers indicate the filling order." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-50", "text": "classification (i < j), respectively." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-51", "text": "The labels for entity detection include {B-*, I-*, L-*, O, U-* }, where * denotes the entity type, and the labels for relation classification are { \u2212 \u2192 * , \u2190 \u2212 * , \u22a5}, where * denotes the relation category and \u22a5 denotes a NULL relation." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-52", "text": "2 At each step, given a partially-filled table T , we determine the most suitable label l for the next step using a scoring function:" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-53", "text": "where W l is a model parameter and h T is the vector representation of T ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-54", "text": "Based on the function, we aim to find the best label sequence" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-55", "text": ", and the resulting sequence of partially-filled tables is T 0 T 1 \u00b7 \u00b7 \u00b7 T m , where T i = FILL(T i\u22121 , l i ), and T 0 is an empty table." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-56", "text": "Different from previous work, we investigate a structural model that is optimized for the label sequence l 1 \u00b7 \u00b7 \u00b7 l m globally, rather than for each l i locally." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-57", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-58", "text": "**REPRESENTATION LEARNING**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-59", "text": "At the ith step, we determine the label l i of the next table slot based on the current hypothesis T i\u22121 ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-60", "text": "Following Miwa and Bansal (2016) , we use a neural network to learn the vector representation of T i\u22121 , and then use Equation 1 to rank candidate next labels." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-61", "text": "There are two types of input features, including the word sequence w 1 w 2 \u00b7 \u00b7 \u00b7 w n , and the readily filled label sequence l 1 l 2 \u00b7 \u00b7 \u00b7 l i\u22121 ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-62", "text": "We build a neural network to represent T i\u22121 ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-63", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-64", "text": "**WORD REPRESENTATION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-65", "text": "Shown in Figure 3 , we represent each word w i by a vector h w i using its word form, POS tag and characters." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-66", "text": "Two different forms of embeddings are used based on the word form, one being obtained by using a randomly initialized look-up table E w , 2 We remove the illegal table-filling labels during decoding for training and testing." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-67", "text": "For example, tuned during training and represented by e w , and the other being a pre-trained external word embedding from E w , which is fixed and represented by e w ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-68", "text": "3 For a POS tag t, its embedding e t is obtained from a look-up table E t similar to E w ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-69", "text": "The above two components have also been used by Miwa and Bansal (2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-70", "text": "We further enhance the word representation by using its character sequence Lample et al., 2016) , taking a convolution neural network (CNN) to derive a character-based word representation h char , which has been demonstrated effective for several NLP tasks (dos Santos and Gatti, 2014) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-71", "text": "We obtain the final h w i based on a non-linear feedforward layer on e w \u2295 e w \u2295 e t \u2295 h char , where \u2295 denotes concatenation." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-72", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-73", "text": "**LABEL REPRESENTATION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-74", "text": "In addition to the word sequence, the history label sequence l 1 l 2 \u00b7 \u00b7 \u00b7 l i\u22121 , and especially the labels representing detected entities, are also useful disambiguation." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-75", "text": "For example, the previous entity boundary label can be helpful to deciding the boundary label of the current word." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-76", "text": "During relation classification, the types of the entities involved can indicate the relation category between them." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-77", "text": "We exploit the diagonal label sequence of partial table T , which denotes entity boundaries, to enhance the representation learning." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-78", "text": "A word's entity boundary label embedding e l is obtained by" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-79", "text": "Figure 4: Segment representation." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-80", "text": "using a randomly initialized looking-up table E l ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-81", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-82", "text": "**LSTM FEATURES**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-83", "text": "We follow Miwa and Bansal (2016) , learning global context representations using LSTMs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-84", "text": "Three basic LSTM structures are used: a leftto-right word LSTM (" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-85", "text": "and a left-to-right entity boundary label LSTM ( \u2212 \u2212\u2212\u2212 \u2192 LSTM e )." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-86", "text": "Each LSTM derives a sequence of hidden vectors for inputs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-87", "text": "For example, for" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-88", "text": "Different from Miwa and Bansal (2016) , who use the output hidden vectors {h i } of LSTMs to represent words, we exploit segment representations as well." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-89", "text": "In particular, for a segment of text [i, j] , the representation is computed by using LSTM-Minus (Wang and Chang, 2016) , shown by Figure 4 , where h j \u2212 h i\u22121 in a left-to-right LSTM and h i \u2212 h j+1 in a right-to-left LSTM are used to represent the segment [i, j]." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-90", "text": "The segment representations can reflect entities in a sentence, and thus can be potentially useful for both entity detection and relation extraction." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-91", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-92", "text": "**FEATURE REPRESENTATION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-93", "text": "We use separate feature representations for entity detection and relation classification, both of which are extracted from the above three LSTM structures." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-94", "text": "In particular, we first extract a set of base neural features, and then concatenate them and feed them into a non-linear neural layer for entity detection and relation classification, respectively." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-95", "text": "Figure 5 shows the overall representation." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-96", "text": "[Entity Detection] Figure 5 (a) shows the feature representation for the entity detection." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-97", "text": "First, we extract six feature vectors from the three basic LSTMs, three of which are word features, namely h , K&G (2016) refers to the best parser of Kiperwasser and Goldberg (2016) , D&M (2016) refers to Dozat and Manning (2016) , and LAS (labeled attachment score) is the major evaluation metric." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-98", "text": "tity label LSTM, we only use the segment features of entity i and entity j ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-99", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-100", "text": "**SYNTACTIC FEATURES**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-101", "text": "Previous work has shown that syntactic features are useful for relation extraction (Zhou et al., 2005) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-102", "text": "For example, the shortest dependency path has been used by several relation extraction models (Bunescu and Mooney, 2005; Miwa and Bansal, 2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-103", "text": "Here we propose a novel method to integrate syntax, without need for prior knowledge on concrete syntactic structures." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-104", "text": "In particular, we take state-of-the-art syntactic parsers that use encoder-decoder neural models (Buys and Blunsom, 2015; Kiperwasser and Goldberg, 2016) , where the encoder represents the syntactic features of the input sentences." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-105", "text": "For example, LSTM hidden states over the input word/tag sequences has been used frequently as syntactic features (Kiperwasser and Goldberg, 2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-106", "text": "Such features represent input words with syntactic information." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-107", "text": "The parser decoder also leverages partially-parsed results, such as features from partial syntactic trees, although we do not use explicit output features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-108", "text": "Table 1 shows the encoder structures of three state-of-the-art dependency parsers." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-109", "text": "Our method is to leverage trained syntactic parsers, dumping the encoder feature representations given our inputs, using them directly as part of input embeddings in our proposed model." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-110", "text": "Denoting the dumped syntactic features on each word as h In this paper, we exploit the parser of Dozat and Manning (2016) , since it achieves the current best performance for dependency parsing." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-111", "text": "Our method can be easily generalized to other parsers, which are potentially useful for our task as well." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-112", "text": "For example, we can use a constituent parser in the same way by dumping the implicit encoder features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-113", "text": "Our exploration of syntactic features has two main advantages over the method of Miwa and Bansal (2016) , where dependency path LSTMs are used for relation classification." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-114", "text": "On the one hand, incorrect dependency paths between entity pairs can propagate to relation classification in Miwa and Bansal (2016) , because these paths rely on explicit discrete outputs from a syntactic parser." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-115", "text": "Our method can avoid the problem since we do not compute parser outputs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-116", "text": "On the other hand, the computation complexity is largely reduced by using our method since sequential LSTMs are based on inputs only, while the dependency path LSTMs should be computed based on the dynamic entity detection outputs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-117", "text": "When beam search is exploited during decoding, increasing number of dependency paths can be used by a surge of entity pairs from beam outputs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-118", "text": "Our method can be extended into neural stacking Wang et al. (2017) , by doing back-propagation training of the parser parameters during model training, which are leave for future work." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-119", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-120", "text": "**TRAINING AND SEARCH**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-121", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-122", "text": "**LOCAL OPTIMIZATION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-123", "text": "Previous work (Miwa and Bansal, 2016; trains model parameters by modeling each step for labeling one input sentence separately." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-124", "text": "Given a partial table T , its neural representation h T is first obtained, and then compute the next label scores {l 1 , l 2 , \u00b7 \u00b7 \u00b7 , l s } using Equation 1." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-125", "text": "The output scores are regularized into a probability distribution {p l 1 , p l 2 , \u00b7 \u00b7 \u00b7 , p ls } by using a softmax layer." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-126", "text": "The training objective is to minimize the cross-entropy loss between this output distribution with the gold-standard distribution:" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-127", "text": "where l g i is the gold-standard next label for T , and \u0398 is the set of all model parameters." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-128", "text": "We refer this training method as local optimization, because it maximizes the score of the gold-standard label at each step locally." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-129", "text": "During the decoding phase, the greedy search strategy is applied in consistence with the training." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-130", "text": "At each step, we find the highest-scored label based on the current partial table, before going on to the next step." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-131", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-132", "text": "**GLOBAL OPTIMIZATION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-133", "text": "We exploit the global optimization strategy of Zhou et al. (2015) and Andor et al. (2016) , maximizing the cumulative score of the gold-standard label sequence for one sentence as a unit." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-134", "text": "Global optimization has achieved success for several NLP tasks under the neural setting (Zhou et al., 2015; Watanabe and Sumita, 2015) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-208", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-135", "text": "For relation extraction, global learning gives the best performances under the discrete setting (Li and Ji, 2014; Miwa and Sasaki, 2014) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-136", "text": "We study such models here for neural network models." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-137", "text": "Given a label sequence of l 1 l 2 \u00b7 \u00b7 \u00b7 l i , the score of T i is defined as follows:" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-138", "text": "where score(T 0 ) = 0 and score(T i\u22121 , l i ) is computed by Equation 1." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-139", "text": "By this definition, we maximize the scores of all gold-standard partial tables." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-140", "text": "Again cross-entropy loss is used to perform model updates." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-141", "text": "At each step i, the objective function is defined by:" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-142", "text": "where x denotes the input sentence, T g i denotes the gold-standard state at step i, and T i are all partial tables that can be reached at step i." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-143", "text": "The major challenge is to compute p T g i" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-144", "text": ", because we cannot traverse all partial tables that are valid at step i, since their count increases exponentially by the step number." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-145", "text": "We follow Andor et al. (2016) , approximating the probability by using beam search and early-update." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-146", "text": "Shown in Algorithm 1, we use standard beam search, maintaining the B highest-scored partially-filled tables in an agenda at each step." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-147", "text": "When each action of table filling is taken, all hypotheses in the agenda are expanded by enumerating the next labels, and the B highest-scored resulting tables are used to replace the agenda for the next step." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-148", "text": "Search begins with the agenda containing an empty table, and finishes when all cells of the tables in the agenda have been filled." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-149", "text": "When the beam size is 1, the algorithm is the same as greedy decoding." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-150", "text": "When the beam size is larger than 1, however, error propagation is alleviated." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-151", "text": "For training, the same beam search algorithm is applied to training examples, and early-update (Collins and Roark, 2004 ) is used to fix search errors." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-152", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-153", "text": "**EXPERIMENTS**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-154", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-155", "text": "**DATA AND EVALUATION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-156", "text": "We evaluate the proposed model on two datasets, namely the ACE05 data and the corpus of Roth and Yih (2004) (CONLL04) , respectively." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-157", "text": "The ACE05 dataset defines seven coarse-grained entity types and six coarse-grained relation categories, while the CONLL04 dataset defines four entity types and five relation categories." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-158", "text": "For the ACE05 dataset, we follow Li and Ji (2014) and Miwa and Bansal (2016) , splitting and preprocessing the dataset into training, development and test sets." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-159", "text": "5 For the CONLL04 dataset, we follow Miwa and Sasaki (2014) to split the data into training and test corpora, and then divide 10% of the training corpus for development." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-160", "text": "We use the micro F1-measure as the major metric to evaluate model performances, treating an entity as correct when its head region and type are both correct, 6 and regard a relation as correct when the argument entities and the relation category are all correct." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-161", "text": "We exploit pairwise t-test for measuring significance values." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-162", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-163", "text": "**PARAMETER TUNING**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-164", "text": "We update all model parameters by back propagation using Adam (Kingma and Ba, 2014 ) with a learning rate 10 \u22123 , using gradient clipping by a max norm 10 and l 2 -regularization by a parameter 10 \u22125 ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-165", "text": "The dimension sizes of various vectors in neural network structure are shown in Table 2 ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-166", "text": "All the hyper-parameters are tuned by development experiments." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-167", "text": "All experiments are conducted using gcc version 4.9.4 (Ubuntu 4.9.4-2ubuntu1 14.04.1), on an Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-168", "text": "Online training is used to learn parameters, traversing over the entire training examples by 300 iterations." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-169", "text": "We select the best iteration number according to the development results." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-170", "text": "In particular, we exploit pre-training techniques (Wiseman and Rush, 2016 ) to learn better model parameters." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-171", "text": "For the local model, we follow Miwa and Bansal (2016) , training parameters only for entity detection during the first 20 iterations." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-172", "text": "For the global model, we pretrain our model using local optimization for 40 iterations, before conducting beam global optimization." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-173", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-174", "text": "**DEVELOPMENT EXPERIMENTS**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-175", "text": "We conduct several development experiments on the ACE05 development dataset." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-176", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-177", "text": "**FEATURE ABLATION TESTS**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-178", "text": "We consider the baseline system with no syntactic features using local training." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-179", "text": "Compared with Miwa and Bansal (2016) features for entity detection." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-180", "text": "Feature ablation experiments are conducted for the two types of features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-181", "text": "Table 3 shows the experimental results, which demonstrate that the character-level features and the segment features we use are both useful for relation extraction." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-182", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-183", "text": "**LOCAL V.S. GLOBAL TRAINING**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-184", "text": "We study the influence of training strategies for relation extraction without using syntactic features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-209", "text": "**FINAL RESULTS**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-185", "text": "For the local model, we apply scheduled sampling (Bengio et al., 2015) , which has been shown to improve the performance of relation extraction by Miwa and Bansal (2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-186", "text": "Table 4 shows the results." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-187", "text": "Scheduled sampling achieves improved F-measure scores for the local model." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-188", "text": "With the same greedy search strategy, the globally normalized model gives slightly better results than the local model with scheduled sampling." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-189", "text": "The performance of the global model increases with a larger beam size." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-190", "text": "When beam size 5 is exploited, we obtain a further gain of 1.2% on the relation F-measure, which is significantly better than our baseline local model with scheduled sampling (p \u2248 10 \u22124 )." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-191", "text": "However, the decoding speed becomes intolerably slow when the beam size increases beyond 5." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-192", "text": "Thus we exploit a beam size of 5 for global training considering both performance and efficiency." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-193", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-194", "text": "**SYNTACTIC FEATURES**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-195", "text": "We examine the effectiveness of the proposed implicit syntactic features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-196", "text": "Table 5 shows the development results using both local and global optimization." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-197", "text": "The proposed features improve the relation performances significantly under both settings (p < 10 \u22124 ), demonstrating that our use of syntactic features is highly effective." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-198", "text": "We also compare our feature integration method with the traditional methods based on syntactic outputs which Miwa and Bansal (2016) and all previous methods use." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-199", "text": "We use the same parser of Dozat and Manning (2016) , building features on its dependency outputs." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-200", "text": "We exploit the bidirectional tree LSTM of to extract neural features, and then exploit a nonlinear feed-forward neural network to combine the two features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-201", "text": "Similarly, we extract segment features but by using max pooling instead over the sequential outputs of the feed-forward layer, since the vector minus is nonsense here." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-202", "text": "The final relation results are 53.1% and 53.9% for the local and global models, respectively, which have no significantly differences compared with our models." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-203", "text": "On the other hand, our method is relatively more efficient, and flexible to the grammar formalism." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-204", "text": "Miwa and Bansal (2016) , who exploit end-to-end LSTM neural networks with local optimization, and L&J (2014) and M&S (2014) refer to Li and Ji (2014) and Miwa and Sasaki (2014) , respectively, which are both globally optimized models using discrete features, giving the top F-scores among statistical models." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-205", "text": "7 Overall, neural models give better performances than statistical models, and global optimization can give improved performances as well." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-206", "text": "Our final model achieves the best performances on both datasets." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-207", "text": "Compared with the best reported results, our model gives improvements of 1.9% on ACE05, and 6.8% on CONLL04." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-210", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-211", "text": "**ANALYSIS**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-212", "text": "We conduct analysis on the ACE05 test dataset in order to better understand our models, on its two major contributions, first examining the influences of global optimization, and then studying the gains by using the proposed syntactic features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-213", "text": "Intuitively global optimization should give better accuracies at the sentence level." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-214", "text": "We verify this by examining the sentence-level accuracies, where one sentence is regarded as correct when all the labels in the resulted table are correct." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-215", "text": "Figure 6 shows the result, which is consistent with our intuition." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-216", "text": "The sentence-level accuracies of the globally normalized model are consistently better than the local model." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-217", "text": "In addition, the accuracy decreases sharply as the sentence length increases, with the local model suffering more severely from larger sentences." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-218", "text": "To understand the effectiveness of the proposed syntactic features, we examine the relation Fscores with respect to entity distances." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-219", "text": "Miwa and Bansal (2016) exploit the shortest dependency path, which can make the distance between two entities closer compared with their sequential dis-tance, thus facilitating relation extraction." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-220", "text": "We verify whether the proposed syntactic features can benefit our model similarly." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-221", "text": "As shown in Figure 7 , the F-scores of entity-pairs with large distances see apparent improvements, demonstrating that our use of syntactic features has a similar effect compared to the shortest dependency path." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-222", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-223", "text": "**RELATED WORK**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-224", "text": "Entity recognition (Florian et al., 2004 (Florian et al., , 2006 Ratinov and Roth, 2009; Florian et al., 2010; Kuru et al., 2016) and relation extraction (Zhao and Grishman, 2005; Jiang and Zhai, 2007; Zhou et al., 2007; Qian and Zhou, 2010; Chan and Roth, 2010; Sun et al., 2011; Plank and Moschitti, 2013; Verga et al., 2016) have received much attention in the NLP community." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-225", "text": "The dominant methods treat the two tasks separately, where relation extraction is performed assuming that entity boundaries have been given (Zelenko et al., 2003; Miwa et al., 2009; Chan and Roth, 2011; Lin et al., 2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-226", "text": "Several studies find that extracting entities and relations jointly can benefit both tasks." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-227", "text": "Early work conducts joint inference for separate models (Ji and Grishman, 2005; Yih, 2004, 2007) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-228", "text": "Recent work shows that joint learning and decoding with a single model brings more benefits for the two tasks (Li and Ji, 2014; Miwa and Sasaki, 2014; Miwa and Bansal, 2016; , and we follow this line of work in the study." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-229", "text": "LSTM features have been extensively exploited for NLP tasks, including tagging Lample et al., 2016) , parsing (Kiperwasser and Goldberg, 2016; Dozat and Manning, 2016) , relation classification Vu et al., 2016; Miwa and Bansal, 2016) and sentiment analysis ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-230", "text": "Based on the output of LSTM structures, Wang and Chang (2016) introduce segment features, and apply it to dependency parsing." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-231", "text": "The same method is applied to constituent parsing by Cross and Huang (2016) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-232", "text": "We exploit this segmental representation for relation extraction." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-233", "text": "Global optimization and normalization has been successfully applied on many NLP tasks that involve structural prediction (Lafferty et al., 2001; Collins, 2002; McDonald et al., 2010; Zhang and Clark, 2011) , using traditional discrete features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-234", "text": "For neural models, it has recently received increasing interests (Zhou et al., 2015; Andor et al., 2016; Xu, 2016; Wiseman and Rush, 2016) , and improved performances can be achieved with global optimization accompanied by beam search." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-235", "text": "Our work is in line with these efforts." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-236", "text": "To our knowledge, we are the first to apply globally optimized neural models for end-to-end relation extraction, achieving the best results on standard benchmarks." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-237", "text": "----------------------------------" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-238", "text": "**CONCLUSION**" }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-239", "text": "We investigated a globally normalized end-to-end relation extraction model using neural network, based on the table-filling framework proposed by Miwa and Sasaki (2014) ." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-240", "text": "Feature representations are learned from several LSTM structures over the inputs, and a novel simple method is used to integrate syntactic information." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-241", "text": "Experiments show the effectiveness of both global normalization and syntactic features." }, { "sent_id": "580713b57ae47692af0d0c86a07fd1-C001-242", "text": "Our final model achieved the best performances on two benchmark datasets." } ], "y": { "@MOT@": { "gold_contexts": [ [ "580713b57ae47692af0d0c86a07fd1-C001-12", "580713b57ae47692af0d0c86a07fd1-C001-13" ], [ "580713b57ae47692af0d0c86a07fd1-C001-19", "580713b57ae47692af0d0c86a07fd1-C001-20" ], [ "580713b57ae47692af0d0c86a07fd1-C001-25", "580713b57ae47692af0d0c86a07fd1-C001-26", "580713b57ae47692af0d0c86a07fd1-C001-27", "580713b57ae47692af0d0c86a07fd1-C001-28" ] ], "cite_sentences": [ "580713b57ae47692af0d0c86a07fd1-C001-12", "580713b57ae47692af0d0c86a07fd1-C001-19", "580713b57ae47692af0d0c86a07fd1-C001-25", "580713b57ae47692af0d0c86a07fd1-C001-26" ] }, "@BACK@": { "gold_contexts": [ [ "580713b57ae47692af0d0c86a07fd1-C001-15", "580713b57ae47692af0d0c86a07fd1-C001-16" ], [ "580713b57ae47692af0d0c86a07fd1-C001-19", "580713b57ae47692af0d0c86a07fd1-C001-20" ], [ "580713b57ae47692af0d0c86a07fd1-C001-101", "580713b57ae47692af0d0c86a07fd1-C001-102" ], [ "580713b57ae47692af0d0c86a07fd1-C001-123" ], [ "580713b57ae47692af0d0c86a07fd1-C001-184", "580713b57ae47692af0d0c86a07fd1-C001-185" ], [ "580713b57ae47692af0d0c86a07fd1-C001-229" ] ], "cite_sentences": [ "580713b57ae47692af0d0c86a07fd1-C001-15", "580713b57ae47692af0d0c86a07fd1-C001-16", "580713b57ae47692af0d0c86a07fd1-C001-19", "580713b57ae47692af0d0c86a07fd1-C001-102", "580713b57ae47692af0d0c86a07fd1-C001-123", "580713b57ae47692af0d0c86a07fd1-C001-185", "580713b57ae47692af0d0c86a07fd1-C001-229" ] }, "@DIF@": { "gold_contexts": [ [ "580713b57ae47692af0d0c86a07fd1-C001-24", "580713b57ae47692af0d0c86a07fd1-C001-25" ], [ "580713b57ae47692af0d0c86a07fd1-C001-30", "580713b57ae47692af0d0c86a07fd1-C001-31" ], [ "580713b57ae47692af0d0c86a07fd1-C001-32", "580713b57ae47692af0d0c86a07fd1-C001-33" ], [ "580713b57ae47692af0d0c86a07fd1-C001-35" ], [ "580713b57ae47692af0d0c86a07fd1-C001-88" ], [ "580713b57ae47692af0d0c86a07fd1-C001-102", "580713b57ae47692af0d0c86a07fd1-C001-103" ], [ "580713b57ae47692af0d0c86a07fd1-C001-113" ], [ "580713b57ae47692af0d0c86a07fd1-C001-114", "580713b57ae47692af0d0c86a07fd1-C001-115" ], [ "580713b57ae47692af0d0c86a07fd1-C001-204", "580713b57ae47692af0d0c86a07fd1-C001-205" ] ], "cite_sentences": [ "580713b57ae47692af0d0c86a07fd1-C001-25", "580713b57ae47692af0d0c86a07fd1-C001-31", "580713b57ae47692af0d0c86a07fd1-C001-32", "580713b57ae47692af0d0c86a07fd1-C001-35", "580713b57ae47692af0d0c86a07fd1-C001-88", "580713b57ae47692af0d0c86a07fd1-C001-102", "580713b57ae47692af0d0c86a07fd1-C001-113", "580713b57ae47692af0d0c86a07fd1-C001-114", "580713b57ae47692af0d0c86a07fd1-C001-204" ] }, "@SIM@": { "gold_contexts": [ [ "580713b57ae47692af0d0c86a07fd1-C001-42" ] ], "cite_sentences": [ "580713b57ae47692af0d0c86a07fd1-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "580713b57ae47692af0d0c86a07fd1-C001-45" ], [ "580713b57ae47692af0d0c86a07fd1-C001-60" ], [ "580713b57ae47692af0d0c86a07fd1-C001-83" ], [ "580713b57ae47692af0d0c86a07fd1-C001-158" ], [ "580713b57ae47692af0d0c86a07fd1-C001-171" ], [ "580713b57ae47692af0d0c86a07fd1-C001-179" ], [ "580713b57ae47692af0d0c86a07fd1-C001-198" ], [ "580713b57ae47692af0d0c86a07fd1-C001-218", "580713b57ae47692af0d0c86a07fd1-C001-219" ], [ "580713b57ae47692af0d0c86a07fd1-C001-228" ] ], "cite_sentences": [ "580713b57ae47692af0d0c86a07fd1-C001-45", "580713b57ae47692af0d0c86a07fd1-C001-60", "580713b57ae47692af0d0c86a07fd1-C001-83", "580713b57ae47692af0d0c86a07fd1-C001-158", "580713b57ae47692af0d0c86a07fd1-C001-171", "580713b57ae47692af0d0c86a07fd1-C001-179", "580713b57ae47692af0d0c86a07fd1-C001-198", "580713b57ae47692af0d0c86a07fd1-C001-219", "580713b57ae47692af0d0c86a07fd1-C001-228" ] }, "@EXT@": { "gold_contexts": [ [ "580713b57ae47692af0d0c86a07fd1-C001-66", "580713b57ae47692af0d0c86a07fd1-C001-69", "580713b57ae47692af0d0c86a07fd1-C001-70" ] ], "cite_sentences": [ "580713b57ae47692af0d0c86a07fd1-C001-69" ] } } }, "ABC_3fd7a249a8fa7a71a4c6aa2e79fecf_5": { "x": [ { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-2", "text": "We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-3", "text": "Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-4", "text": "In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-5", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-7", "text": "Sequence labeling models have been used for fundamental NLP tasks such as POS tagging, chunking and named entity recognition (NER)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-8", "text": "Traditional work uses statistical approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Ratinov and Roth, 2009; Passos et al., 2014; Luo et al., 2015) with handcrafted features and task-specific resources." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-9", "text": "With advances in deep learning, neural models have given state-of-the-art results on many sequence labeling tasks (Ling et al., 2015; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-10", "text": "Words and characters are encoded in distributed representations (Mikolov et al., 2013) and sentence-level features are learned automatically during end-to-end training." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-11", "text": "Many existing state-of-the-art neural sequence labeling models utilize word-level Long Short-Term Memory (LSTM) structures to represent global sequence information and a CRF layer to capture dependencies between neighboring labels Lample et al., 2016; Ma and Hovy, 2016; Peters et al., 2017) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-12", "text": "As an alternative, Convolution Neural Network (CNN) (LeCun et al., 1989) has also been used for its ability of parallel computing, leading to an efficient training and decoding process." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-13", "text": "Despite them being dominant in the research literature, reproducing published results for neural models can be challenging, even if the codes are available open source." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-14", "text": "For example, Reimers and Gurevych (2017b) conduct a large number of experiments using the code of Ma and Hovy (2016) , but cannot obtain comparable results as reported in the paper." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-15", "text": "Liu et al. (2018) report lower average F-scores on NER when reproducing the structure of Lample et al. (2016) , and on POS tagging when reproducing Ma and Hovy (2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-16", "text": "Most literature compares results with others by citing the scores directly Lample et al., 2016) without re-implementing them under the same setting, resulting in less persuasiveness on the advantage of their models." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-17", "text": "In addition, conclusions from different reports can be contradictory." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-85", "text": "Character LSTM." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-18", "text": "For example, most work observes that stochastic gradient descent (SGD) gives best performance on NER task (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , while Reimers and Gurevych (2017b) report that SGD is the worst optimizer on the same datasets." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-19", "text": "The comparison between different deep neural models is challenging due to sensitivity on experimental settings." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-20", "text": "We list six inconsistent configurations in literature, which lead to difficulties for fair comparison." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-21", "text": "\u2022 Dataset." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-22", "text": "Most work reports sequence labeling results on both CoNLL 2003 English NER (Tjong Kim Sang and De Meulder, 2003) and PTB POS (Marcus et al., 1993) datasets (Collobert et al., 2011; Ma and Hovy, 2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-23", "text": "Ling et al. (2015) give results only on POS dataset, while some papers (Chiu and Nichols, 2016; Lample et al., 2016; Strubell et al., 2017) report results on the NER dataset only." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-24", "text": "dos Santos et al. (2015) conducts experiments on NER for Portuguese and Spanish." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-25", "text": "Most work uses the development set to select hyperparameters (Lample et al., 2016; Ma and Hovy, 2016) , while others add development set into training set (Chiu and Nichols, 2016; Peters et al., 2017) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-26", "text": "Reimers and Gurevych (2017b) use a smaller dataset (13862 vs 14987 sentences)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-27", "text": "Different from Ma and Hovy (2016) and Liu et al. (2018) , choose a different data split on the POS dataset." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-28", "text": "Liu et al. (2018) and Hashimoto et al. (2017) use different development sets for chunking." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-29", "text": "\u2022 Preprocessing." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-30", "text": "A typical data preprocessing step is to normize digit characters (Chiu and Nichols, 2016; Lample et al., 2016; Yang et al., 2016; Strubell et al., 2017) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-31", "text": "Reimers and Gurevych (2017b) use fine-grained representations for less frequent words." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-32", "text": "Ma and Hovy (2016) do not use preprocessing." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-33", "text": "\u2022 Features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-34", "text": "Strubell et al. (2017) and Chiu and Nichols (2016) apply word spelling features and further integrate context features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-35", "text": "Collobert et al. (2011) and use neural features to represent external gazetteer information." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-36", "text": "Besides, Lample et al. (2016) and Ma and Hovy (2016) use end-to-end structure without handcrafted features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-37", "text": "\u2022 Hyperparameters including learning rate, dropout rate (Srivastava et al., 2014) , number of layers, hidden size etc. can strongly affect the model performance." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-38", "text": "Chiu and Nichols (2016) search for the hyperparameters for each task and show that the system performance is sensitive to the choice of hyperparameters." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-39", "text": "However, existing models use different parameter settings, which affects the fair comparison." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-40", "text": "\u2022 Evaluation." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-110", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-41", "text": "Some literature reports results using mean and standard deviation under different random seeds (Chiu and Nichols, 2016; Peters et al., 2017; Liu et al., 2018) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-42", "text": "Others report the best result among different trials (Ma and Hovy, 2016) , which cannot be compared directly." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-43", "text": "\u2022 Hardware environment can also affect system accuracy." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-44", "text": "Liu et al. (2018) observes that the system gives better accuracy on NER task when trained using GPU as compared to using CPU." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-45", "text": "Besides, the running speeds are highly affected by the hardware environment." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-46", "text": "To address the above concerns, we systematically analyze neural sequence labeling models on three benchmarks: CoNLL 2003 NER (Tjong Kim Sang and De Meulder, 2003) , CoNLL 2000 chunking (Tjong Kim Sang and Buchholz, 2000) and PTB POS tagging (Marcus et al., 1993) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-47", "text": "Table 1 shows a summary of the models we investigate, which can be categorized under three settings: (i) character sequence representations ; (ii) word sequence representations; (iii) inference layer." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-48", "text": "Although various combinations of these three settings have been proposed in the literature, others have not been examined." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-49", "text": "We compare all models in Table 1 , which includes most state-of-the-art methods." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-50", "text": "To make fair comparisons, we build a unified framework 1 to reproduce the twelve neural sequence labeling architectures in Table 1 ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-51", "text": "Experiments show that our framework gives comparable or better results on reproducing existing works, showing the practicability and reliability of our analysis for practitioners." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-52", "text": "The detailed comparison and analysis show that (i) Character information provides a significant improvement on accuracy; (ii) Word-based LSTM models outperforms CNN models in most cases; (iii) CRF can improve model accuracy on NER and chunking but does not on POS tagging." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-53", "text": "Our framework is based on PyTorch with batched implementation, which is highly efficient, facilitating quick configurations for new tasks." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-54", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-55", "text": "**RELATED WORK**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-56", "text": "Collobert et al. (2011) proposed a seminal neural architecture for sequence labeling." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-57", "text": "It captures word sequence information with a one-layer CNN based on pretrained word embeddings and handcrafted neural features, followed with a CRF output layer." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-58", "text": "dos Santos et al. (2015) extended this model by integrating character-level CNN features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-59", "text": "Strubell et al. (2017) built a deeper dilated CNN architecture to capture larger local features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-60", "text": "Hammerton (2003) was the first to exploit LSTM for sequence labeling." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-61", "text": "built a BiLSTM-CRF structure, which has been extended by adding character-level LSTM (Lample et al., 2016; Liu et al., 2018) , GRU (Yang et al., 2016) , and CNN (Chiu and Nichols, 2016; Ma and Hovy, 2016) features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-62", "text": "Yang et al. (2017a) proposed a neural reranking model to improve NER models." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-63", "text": "These models achieve state-of-the-art results in the literature." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-111", "text": "**EXPERIMENTS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-64", "text": "Reimers and Gurevych (2017b) compared several word-based LSTM models for several sequence labeling tasks, reporting the score distributions over multiple runs rather than single value." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-65", "text": "They investigated the influence of various hyperparameters and configurations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-66", "text": "Our work is similar in comparing different neural architectures under unified settings, but differs in four main aspects: 1) Their experiments are based on a BiLSTM with handcrafted word features, while our experiments are based on end-to-end neural models without human knowledge." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-67", "text": "2) Their system gives relatively low performances on standard benchmarks 2 , while ours can give comparable or better results with state-of-the-art models, rendering our observations more informative for practitioners." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-68", "text": "3) Our findings are more consistent with most previous work on configurations such as usefulness of character information (Lample et al., 2016; Ma and Hovy, 2016) , optimizer (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) and tag scheme (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-69", "text": "In contrast, many results of Reimers and Gurevych (2017b) contradict existing reports." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-70", "text": "4) We conduct a wider range of comparison for word sequence representations, including all combinations of character CNN/LSTM and word CNN/LSTM structures, while Reimers and Gurevych (2017b) studied the word LSTM models only." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-71", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-72", "text": "**NEURAL SEQUENCE LABELING MODELS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-73", "text": "Our neural sequence labeling framework contains three layers, i.e., a character sequence representation layer, a word sequence representation layer and an inference layer, as shown in Figure 1 ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-74", "text": "Character information has been proven to be critical for sequence labeling tasks (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , with LSTM and CNN being used to model character sequence information (\"Char Rep.\")." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-75", "text": "Similarly, on the word level, LSTM or CNN structures can be leveraged to capture long-term information or local features (\"Word Rep.\"), respectively." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-76", "text": "Subsequently, the inference layer assigns labels to each word using the hidden states of word sequence representations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-77", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-78", "text": "**CHARACTER SEQUENCE REPRESENTATIONS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-79", "text": "Character features such as prefix, suffix and capitalization can be represented with embeddings through a feature-based lookup table (Collobert et al., 2011; Strubell et al., 2017) , or neural networks without human-defined features (Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-80", "text": "In this work, we focus on neural character sequence representations without hand-engineered features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-81", "text": "Character CNN." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-82", "text": "Using a CNN structure to encode character sequences was firstly proposed by Santos and Zadrozny (2014), and followed by many subsequent investigations (dos Santos et al., 2015; Chiu and Nichols, 2016; Ma and Hovy, 2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-83", "text": "In our experiments, we take the same structure as Ma and Hovy (2016) , using one layer CNN structure with max-pooling to capture character-level representations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-84", "text": "Figure 2 (a) shows the CNN structure on representing word \"Mexico\"." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-86", "text": "Shown as Figure 2 (b), in order to model the global character sequence information of a word \"Mexico\", we utilize a bidirectional LSTM on the character sequence of each word and concatenate the left-to-right final state F LST M and the right-to-left final state B LST M as character sequence representations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-87", "text": "Liu et al. (2018) applied one bidirectional LSTM for the character sequence over a sentence rather than each word individually." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-88", "text": "We examined both structures and found that they give comparable accuracies on sequence labeling tasks." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-89", "text": "We choose Lample et al. (2016) 's structure as its character LSTMs can be calculated in parallel, making the system more efficient." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-90", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-91", "text": "**WORD SEQUENCE REPRESENTATIONS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-92", "text": "Similar to character sequences in words, we can model word sequence information through LSTM or CNN structures." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-93", "text": "LSTM has been widely used in sequence labeling (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Liu et al., 2018) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-94", "text": "CNN can be much faster than LSTM due to the fact that convolution calculation can be parallel on the input sequence (Collobert et al., 2011; dos Santos et al., 2015; Strubell et al., 2017) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-95", "text": "Word CNN." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-96", "text": "Figure 3(a) shows the multi-layer CNN on word sequence, where words are represented by embeddings." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-97", "text": "If a character sequence representation layer is used, then word embeddings and character sequence representations are concatenated for word representations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-98", "text": "For each CNN layer, a window of size 3 slides along the sequence, extracting local features on the word inputs and a ReLU function (Glorot et al., 2011 ) is followed." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-99", "text": "We follow Strubell et al. (2017) by using 4 CNN layers." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-100", "text": "Batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) are applied following each CNN layer." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-101", "text": "Word LSTM." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-102", "text": "Shown in Figure 3 (b), word representations are fed into a forward LSTM and backward LSTM, respectively." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-103", "text": "The forward LSTM captures the word sequence information from left to right, while the backward LSTM extracts information in a reversed direction." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-104", "text": "The hidden states of the forward and backward LSTMs are concatenated at each word to give global information of the whole sequence." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-105", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-106", "text": "**INFERENCE LAYER**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-107", "text": "The inference layer takes the extracted word sequence representations as features and assigns labels to the word sequence." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-108", "text": "Independent local decoding with a linear layer mapping word sequence representations to label vocabulary and performing softmax can be quite effective (Ling et al., 2015) , while for tasks that with strong output label dependency, such as NER, CRF is a more appropriate choice." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-109", "text": "In this work, we examine both softmax and CRF as inference layer on three sequence labeling tasks." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-112", "text": "We investigate the main influencing factors to system accuracy, including the character sequence representations, word sequence representations, inference algorithm, pretrained embeddings, tag scheme, running environment and optimizer; analyzing system performances from the perspective of decoding speed and accuracies on in-vocabulary (IV) and out-of-vocabulary (OOV) entities/chunks/words." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-113", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-114", "text": "**SETTINGS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-115", "text": "Data." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-116", "text": "The NER dataset has been standardly split in Tjong Kim Sang and De Meulder (2003 (Toutanova et al., 2003; Santos and Zadrozny, 2014; Ma and Hovy, 2016; Liu et al., 2018) , we adopt the standard splits by using sections 0-18 as training set, sections 19-21 as development set and sections 22-24 as test set." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-117", "text": "No preprocessing is performed on either dataset except for normalizing digits." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-118", "text": "The dataset statistics are listed in Table 2 ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-119", "text": "Hyperparameters." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-120", "text": "Table 3 shows the hyperparameters used in our experiments, which mostly follow Ma and Hovy (2016) , including the learning rate \u03b7 = 0.015 for word LSTM models." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-121", "text": "For word CNN based models, a large \u03b7 leads to convergence problem." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-122", "text": "We take \u03b7 = 0.005 with more epochs (200) instead." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-123", "text": "GloVe 100-dimension (Pennington et al., 2014 ) is used to initialize word embeddings and character embeddings are randomly initialized." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-124", "text": "We use mini-batch stochastic gradient descent (SGD) with a decayed learning rate to update parameters." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-125", "text": "For NER and chunking, we the BIOES tag scheme." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-126", "text": "Evaluation." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-127", "text": "Standard precision (P), recall (R) and F1-score (F) are used as the evaluation metrics for NER and chunking; token accuracy is used to evaluate the performance of POS tagger." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-128", "text": "Development datasets are used to select the optimal model among all epochs, and we report scores of the selected model on the test dataset." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-129", "text": "To reduce the volatility of the system, we conduct each experiment 5 times under different random seeds, and report the max, mean, and standard deviation for each model." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-130", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-131", "text": "**RESULTS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-132", "text": "Tables 4, 5 and 6 show the results of the twelve models on NER, chunking and POS datasets, respectively." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-133", "text": "Existing work has also been listed in the tables for comparison." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-134", "text": "To simplify the description, we use \"CLSTM\" and \"CCNN\" to represent character LSTM and character CNN encoder, respectively." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-135", "text": "Similarly, \"WLSTM\" and \"WCNN\" represent word LSTM and word CNN structure, respectively." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-136", "text": "As shown in Table 4 , most NER work focuses on WLSTM+CRF structures with different character sequence representations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-137", "text": "We re-implement the structure of several reports (Chiu and Nichols, 2016; Ma and Hovy, 2016; Peters et al., 2017) , which take the CCNN+WLSTM+CRF architecture." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-138", "text": "Our reproduced models give slightly better performances." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-139", "text": "The results of Lample et al. (2016) can be reproduced by our CLSTM+WLSTM+CRF." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-140", "text": "In most cases, our \"Nochar\" based models underperform their corresponding prototypes Strubell et al., 2017) , which utilize the hand-crafted features." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-141", "text": "Table 5 shows the results of the chunking task." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-142", "text": "Peters et al. (2017) give the best reported single model results (95.00\u00b10.08), and our CLSTM+WLSTM+CRF gives a comparable performance (94.93\u00b10.05)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-143", "text": "We re-implement Zhai et al. (2017) 's model in our Nochar+WLSTM but cannot reproduce their results, this may because that they use grid search for hyperparameter selection." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-144", "text": "Our Nochar+WCNN+CRF can give comparable results with Collobert et al. (2011) , even ours does not include character information." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-145", "text": "The results of the POS tagging task is shown in Table 6 ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-146", "text": "The results of Lample et al. (2016) , Ma and Hovy (2016) and Yang et al. (2017b) can be reproduced by our CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF models." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-147", "text": "Our WLSTM based models give better results than WLSTM+CRF based models, this is consistent with the fact that Ling et al. (2015) take CLSTM+WLSTM without CRF layer but achieve the best POS accuracy." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-148", "text": "Santos and Zadrozny (2014) build a pure CNN structure on both character and word level, which can be reproduced by our CCNN+WCNN models." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-149", "text": "Based on above observations, most results in the literature are reproducible." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-150", "text": "Our implementations can achieve the comparable or better results with state-of-the-art models." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-151", "text": "We do not fine-tune any hyperparameter to fit the specific task." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-152", "text": "Results on Table 4 , 5 and 6 are all under the same hyperparameters, which demonstrates the generalization ability of our framework." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-153", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-154", "text": "**NETWORK SETTINGS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-155", "text": "Character LSTM vs Character CNN." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-156", "text": "Unlike the observations of Reimers and Gurevych (2017b) , in our experiments, character information can significantly (p < 0.01) 3 improve sequence labeling models (by comparing the row of Nochar with CLSTM or CCNN on Table 4 , 5 and 6), while the difference between CLSTM and CCNN is not significant." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-157", "text": "In most cases, CLSTM and CCNN can give comparable results under different frameworks and different tasks." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-158", "text": "CCNN gives the best NER result under the WL-STM+CRF framework, while CLSTM gets better NER results in all other configurations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-159", "text": "For chunking and POS tagging, CLSTM consistently outperforms CCNN under all settings, while the difference is statistically insignificant (p > 0.2)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-186", "text": "Optimizer." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-160", "text": "We conclude that the difference between CLSTM and CCNN is small, which is consistent with the observation of Reimers and Gurevych (2017b) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-161", "text": "Word LSTM vs Word CNN." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-162", "text": "By comparing the performances of WLSTM+CRF, WLSTM with WCNN+CRF, WCNN on the three benchmarks, we conclude that word-based LSTM models are significantly (p < 0.01) better than word-based CNN models in most cases." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-163", "text": "It demonstrates that global word context information is necessary for sequence labeling." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-164", "text": "Softmax vs CRF." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-165", "text": "Models with CRF inference layer can consistently outperform the models with softmax layer under all configurations on NER and chunking tasks, proving the effectiveness of label dependency information." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-166", "text": "While for POS tagging, the local softmax based models give slightly better accuracies while the difference is insignificant (p > 0.2)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-167", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-168", "text": "**EXTERNAL FACTORS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-169", "text": "In addition to model structures, external factors such as pretrained embeddings, tag scheme, and optimizer can significantly influence system performance." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-170", "text": "We investigate a set of external factors on the NER dataset with the two best models: CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-171", "text": "Pretrained embedding." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-172", "text": "Figure 4 (a) shows the F1-scores of the two best models on the NER test set with two different pretrained embeddings, as well as the random initialization." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-173", "text": "Compared with the random initialization, models using pretrained embeddings give significant improvements (p < 0.01)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-174", "text": "The GloVe 100-dimension embeddings give higher F1-scores than SENNA (Collobert et al., 2011) on both models, which is consistent with the observation of Ma and Hovy (2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-175", "text": "Tag scheme." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-176", "text": "We examine two different tag schemes: BIO and BIOES (Ratinov and Roth, 2009) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-177", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-178", "text": "In our experiments, models using BIOES are significantly (p < 0.05) better than BIO." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-179", "text": "Our observation is consistent with most literature (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-180", "text": "Reimers and Gurevych (2017b) report that the difference between the schemes is insignificant." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-181", "text": "Running environment." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-182", "text": "Liu et al. (2018) observe that neural sequence labeling models can give better results on GPU rather than CPU." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-183", "text": "We conduct repeated experiments on both GPU and CPU environments." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-184", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-185", "text": "Models run on CPU give a lower mean F1-score than models run on GPU, while the difference is insignificant (p > 0.2)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-187", "text": "We compare different optimizers including SGD, Adagrad (Duchi et al., 2011 ), Adadelta (Zeiler, 2012 , RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-188", "text": "The results are shown in Figure 5 5 ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-189", "text": "In contrast to Reimers and Gurevych (2017b) , who reported that SGD is the worst optimizer, our results show that SGD outperforms all other optimizers significantly (p < 0.01), with a slower convergence process during training." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-190", "text": "Our observation is consistent with most literature (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-191", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-192", "text": "**ANALYSIS**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-193", "text": "Decoding speed." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-194", "text": "We test the decoding speeds of the twelve models on the NER dataset using a Nvidia GTX 1080 GPU." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-195", "text": "Figure 6 shows the decoding times on 10000 NER sentences." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-196", "text": "The CRF inference layer severely limits the decoding speed due to the left-to-right inference process, which disables the parallel decoding." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-197", "text": "Character LSTM significantly slows down the system." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-198", "text": "Compared with models without character information, adding character CNN representations does not affect the decoding speed too much but can give significant accuracy improvements (shown in Section 4.3)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-199", "text": "Due to the support of parallel computing, word-based CNN models are consistently faster than word-based LSTM models, with close accuracies, leading to large utilization potential in practice." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-200", "text": "Table 7 : Results for OOV analysis." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-201", "text": "OOV error." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-202", "text": "We conduct error analysis on in-vocabulary and out-of-vocabulary words with the CRF based models 6 ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-203", "text": "Following Ma and Hovy (2016) , words in the test set are divided into four subsets: in-vocabulary words, out-of-training-vocabulary words (OOTV), out-of-embedding-vocabulary words (OOEV) and out-of-both-vocabulary words (OOBV)." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-204", "text": "For NER and chunking, we consider entities or chunks rather than words." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-205", "text": "The OOV entities and chunks are categorized following Ma and Hovy (2016) ." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-206", "text": "Table 7 shows the performances of different OOV splits on three benchmarks." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-207", "text": "The top three rows list the performances of word-based LSTM CRF models, followed by the word-based CNN CRF models." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-208", "text": "The results of OOEV in NER keep 100% because of there exist only 8 OOEV entities and all are recognized correctly." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-209", "text": "It is obvious that character LSTM or CNN representations improve OOTV and OOBV the most on both WLSTM+CRF and WCNN+CRF models across all three datasets, proving that the main contribution of neural character sequence representations is to disambiguate the OOV words." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-210", "text": "Models with character LSTM representations give the best IV scores across all configurations, which may be because character LSTM can be well trained on IV data, bringing the useful global character sequence information." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-211", "text": "On the OOVs, character LSTM and CNN gives comparable results." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-212", "text": "----------------------------------" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-213", "text": "**CONCLUSION**" }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-214", "text": "We built a unified neural sequence labeling framework to reproduce and compare recent state-of-theart models with different configurations." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-215", "text": "We explored three neural model design decisions: character sequence representations, word sequence representations, and inference layer." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-216", "text": "Experiments show that character information helps to improve model performances, especially on disambiguating OOV words." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-217", "text": "Character-level LSTM and CNN structures give comparable improvements, with the latter being more efficient." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-218", "text": "In most cases, models with word-level LSTM encoders outperform those with CNN, at the expense of longer decoding time." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-219", "text": "We observed that the CRF inference algorithm is effective on NER and chunking tasks, but does not have the advantage on POS tagging." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-220", "text": "With controlled experiments on the NER dataset, we showed that BIOES tags are better than BIO." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-221", "text": "Besides, pretrained GloVe 100d embedding and SGD optimizer give significantly better performances compared to their competitors." }, { "sent_id": "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-222", "text": "6 We choose the models that give the median performance on the test set for conducting result analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-9" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-11" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-15", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-16" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-18" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-23" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-25" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-30" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-36" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-73", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-74" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-79" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-92", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-93" ] ], "cite_sentences": [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-9", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-11", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-15", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-16", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-18", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-23", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-25", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-30", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-36", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-74", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-79", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-93" ] }, "@EXT@": { "gold_contexts": [ [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-61" ] ], "cite_sentences": [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-61" ] }, "@SIM@": { "gold_contexts": [ [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-68" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-92", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-93" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-139" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-146" ], [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-190" ] ], "cite_sentences": [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-68", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-93", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-139", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-146", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-190" ] }, "@MOT@": { "gold_contexts": [ [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-73", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-74" ] ], "cite_sentences": [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-74" ] }, "@USE@": { "gold_contexts": [ [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-88", "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-89" ] ], "cite_sentences": [ "3fd7a249a8fa7a71a4c6aa2e79fecf-C001-89" ] } } }, "ABC_13091dd4d06e11957a5cb7785b92d4_5": { "x": [ { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-2", "text": "We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-3", "text": "Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and character-level encoders, but these methods are highly sensitive to spelling mistakes." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-4", "text": "We show how training on a mild amount of random synthetic noise can dramatically improve robustness to these variations, without diminishing performance on clean text." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-5", "text": "We focus on translation performance on natural noise, as captured by frequent corrections in Wikipedia edit logs, and show that robustness to such noise can be achieved using a balanced diet of simple synthetic noises at training time, without access to the natural noise data or distribution." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-6", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-8", "text": "Machine translation systems are generally trained on clean data, without spelling errors." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-9", "text": "Yet machine translation may be used in settings in which robustness to such errors is critical: for example, social media text in which there is little emphasis on standard spelling (Michel and Neubig, 2018) , and interactive settings in which users must enter text on a mobile device." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-10", "text": "Systems that are trained on clean data generally perform poorly when faced with such errors at test time (Heigold et al., 2017; Belinkov and Bisk, 2018) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-11", "text": "One potential solution is to introduce noise at training time, an approach that is similar in spirit to the use of adversarial examples in other areas of machine learning (Goodfellow et al., 2014) and natural language processing (Ebrahimi et al., 2018) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-12", "text": "So far, using synthetic noise at training time has been found to only improve performance on test data with exactly the same kind of synthetic noise, while at the same time impairing performance on clean test data (Heigold et al., 2017; Belinkov and Bisk, 2018) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-13", "text": "We desire methods that yield good performance on both clean text as well as naturally-occurring noise, but this is beyond the reach of current techniques." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-14", "text": "Drawing inspiration from dropout (Srivastava et al., 2014) and noise-based regularization methods, we explore the space of random noising methods at training time, and evaluate performance on both clean text and text corrupted by natural noise based on real spelling mistakes on Wikipedia (Max and Wisniewski, 2010 )." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-15", "text": "We find that by feeding our translation models a balanced diet of several types of synthetic noise at training time -random character deletions, insertions, substitutions, and swapsit is possible to obtain substantial improvements on such naturally noisy data, with minimal impact on the performance on clean data, and without accessing the test noise data or even its distribution." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-16", "text": "We demonstrate that our method substantially improves the robustness of a transformer-based machine translation model with CNN character encoders to spelling errors across multiple input languages (German, French, and Czech) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-17", "text": "Of the different noise types we use at training, we find that random character deletions are particularly useful, followed by character insertions." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-18", "text": "However, noisy training does not seem to improve translations of social media text, as indicated by performance on the recently-introduced MTNT dataset of Reddit posts (Michel and Neubig, 2018) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-19", "text": "This finding aligns with previous work arguing that the distinctive feature of social media text is not noise or orthographical errors, but rather, variation in writing style and vocabulary (Eisenstein, 2013) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-20", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-21", "text": "**DELETION**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-22", "text": "A character is deleted." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-23", "text": "whale \u2192 whle Insertion A character is inserted into a random position." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-24", "text": "whale \u2192 wxhale Substitution A character is replaced with a random character." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-25", "text": "whale \u2192 whalz Swap Two adjacent characters change position." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-26", "text": "whale \u2192 wahle Table 1 : The synthetic noise types applied during training." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-27", "text": "Noise is applied on a random character, selected from a uniform distribution." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-28", "text": "The right column illustrates the application of each noise type on the word \"whale\"." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-29", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-30", "text": "**NOISE MODELS**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-31", "text": "We focus on orthographical noise, which is noise at the character level, affecting the spelling of individual terms." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-32", "text": "Orthographical noise is obviously problematic for machine translation systems that operate on token-level embeddings, because noised terms are likely to be out-ofvocabulary, even when pre-segmented into subwords using techniques such as byte-pair encoding (Sennrich et al., 2015) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-33", "text": "A more subtle issue is that orthographical noise can also pose problems for character-level encoding models." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-34", "text": "Typical character-level encoders are based on models such as convolutional neural networks (Kim et al., 2016) , which learn to match filters against specific character n-grams." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-35", "text": "When these n-grams are disrupted by orthographical noise, the resulting encoding may be radically different from the encoding of a \"clean\" version of the same text." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-36", "text": "Belinkov and Bisk (2018) report significant degradations in performance after applying noise to only a small fraction of input tokens." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-80", "text": "We first add only one type of synthetic noise at 10% (i.e. 90% of the training data is clean), and measure performance." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-37", "text": "1 Table 1 describes the four types of synthetic orthographic noise we used during training." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-38", "text": "Substitutions and swaps were experimented with extensively in previous work (Heigold et al., 2017; Belinkov and Bisk, 2018) , but deletion and insertion were not." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-39", "text": "Deletion and insertion pose a different challenge to character encoders, since they alter the distances between character sequences in the word, as well as its overall length." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-40", "text": "In Section 3.2, we show that they are indeed the primary contributors in improving our model's robustness to natural noise." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-41", "text": "During training, we used a balanced diet of all four noise types by sampling the noise, for each to-ken, from a multinomial distribution of 60% clean (no noise) and 10% probability for each type of noise." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-42", "text": "The noise was added dynamically, allowing for different mutations of the same example over different epochs." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-43", "text": "2" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-44", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-45", "text": "**SYNTHETIC NOISE**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-46", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-47", "text": "**NATURAL NOISE**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-48", "text": "In addition to these forms of synthetic noise, we evaluate our models on natural noise, which is obtained by mining edit histories of Wikipedia in search of local error corrections (for French and German) (Max and Wisniewski, 2010; Zesch, 2012) and manually-corrected essays (for Czech) (\u0160ebesta et al., 2017) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-49", "text": "Specifically, these authors have obtained a set of likely spelling error pairs, each involving a clean spelling and a candidate error." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-50", "text": "These errors are then injected into the source language." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-51", "text": "When there are multiple error forms for a single term, an error is selected randomly (uniformly)." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-52", "text": "Not all words have error candidates, and so even when noise is applied at the maximum rate, only 20-50% of all tokens are noised, depending on the language." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-53", "text": "Natural noise is more representative of what might actually be encountered by a deployed machine translation system, so we reserve it for test data." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-54", "text": "While it is also possible, in theory, to use natural noise for training as well, it is not always realistic." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-55", "text": "For one, significant engineering effort is required to obtain such noise examples, making it difficult to build naturally-noised training sets for any source language." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-56", "text": "Secondly, the kind of spelling mistakes may vary greatly across different demographics and periods, making it difficult to anticipate the exact distribution of noise one might encounter at test time." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-57", "text": "Table 2 : The performance of a machine translation model on the IWSLT 2016 task when different amounts of natural orthographical errors are introduced." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-58", "text": "Noise Probability is the probability of a test-data token to be injected with natural noise, while the Noised Tokens column reflects the number of tokens that were actually distorted in practice; not every word in the vocabulary has a corresponding misspelling." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-59", "text": "Training on synthetic noise significantly increases robustness in scenarios where large amounts of spelling mistakes are encountered." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-60", "text": "(Junczys-Dowmunt and Birch, 2016) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-61", "text": "We translated from three source languages (German, French, Czech) to English." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-62", "text": "Synthetic noise was only added to the training data, and natural noise was only added to the test data; the validation data remained untouched." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-63", "text": "Model We used a transformer-based machine translation model (Vaswani et al., 2017) with the CNN-based character encoder of Kim et al. (2016) on the source encoder." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-64", "text": "The model was implemented in Fairseq." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-65", "text": "3 Hyperparameters We followed the base configuration of the transformer (Vaswani et al., 2017) , with 6 encoder and decoder layers of 512 model dimension, 2048 hidden dimensions, and 8 attention heads per layer." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-66", "text": "Our character embeddings had 256 dimensions and the character CNN's filters followed the specifications of Kim et al. (2016) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-67", "text": "We optimized the model with Adam and used the inverse square-root learning rate schedule typically used for transformers, but with a peek learning rate of 0.001." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-68", "text": "Each batch contained a maximum of 8,000 tokens." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-69", "text": "We used a dropout rate of 0.2." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-70", "text": "We used beam search for generating the translations (5 beams), and computed BLEU scores to measure performance on the test 3 https://github.com/pytorch/fairseq set." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-71", "text": "Table 2 shows the performance of the model on data with varying amounts of natural orthographical errors (see Section 2.2)." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-72", "text": "As observed in prior art (Heigold et al., 2017; Belinkov and Bisk, 2018) , when there are significant amounts of natural noise, the model's performance drops significantly." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-73", "text": "However, training on our synthetic noise cocktail greatly improves performance, regaining between 20% (Czech) and 50% (German) of the BLEU score that was lost to natural noise." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-74", "text": "Moreover, the negative effects of training on synthetic noise seem to be limited to both negative and positive fluctuations that are smaller than 1 BLEU point." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-75", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-76", "text": "**RESULTS**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-77", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-78", "text": "**ABLATION ANALYSIS**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-79", "text": "To determine the individual contribution of each type of synthetic noise, we conduct an ablation study." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-101", "text": "To achieve robustness to orthographical errors, we require noise that operates on the sequence of characters." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-102", "text": "Heigold et al. (2017) demonstrated that synthetic noising operations such as random swaps and replacements can significantly degrade performance when inserted at test time; they also show that some robustness can be obtained by inserting the same synthetic noise at training time." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-103", "text": "Similarly, the impact of speech-like noise is explored by Sperber et al. (2017) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-104", "text": "Most relevant for us is the work of Belinkov and Bisk (2018) , who evaluated on natural noise obtained from Wikipedia edit histories (e.g., Max and Wisniewski, 2010) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-105", "text": "They find that robustness to natural noise can be obtained by training on the same noise model, but that (a) training on synthetic noise does not yield robustness to natural noise at test time, and (b) training on natural noise significantly impairs performance on clean text." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-106", "text": "In contrast, we show that training on the right kind and the right amount of synthetic noise can yield substantial improvements on natural noise at test time, without significantly impairing performance on clean data." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-107", "text": "Our ablation results suggest that deletion and insertion noise -which were not included by Belinkov and Bisk -are essential to achieving robustness to natural noise." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-108", "text": "An alternative approach is to build robustness to character permutations directly into the design of the character-level encoder." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-109", "text": "Belinkov and Bisk (2018) experiment with a bag of characters, while Sakaguchi et al. (2017) use character RNNs combined with special representations for the first and last characters of each token." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-110", "text": "These models are particularly suited for specific types of swapping and scrambling noises, but are not robust to natural noise." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-111", "text": "We also conducted preliminary experiments with similar noise-invariant models, but found that training a CNN with synthetic noise to work better." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-112", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-113", "text": "**CONCLUSION**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-114", "text": "In this work we take a step towards addressing the challenge of making machine translation robust to character-level noise." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-115", "text": "We show how training on synthetic character-level noise, similar in spirit to dropout, can significantly improve a translation model's robustness to natural spelling mistakes." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-116", "text": "In particular, we find that deleting and inserting random characters play a key role in preparing the model for test-time orthographic variations." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-117", "text": "While our method works well on typos, it does not appear to generalize to non-standard text in social media." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-118", "text": "We conjecture that spelling mistakes constitute a relatively small part of the deviations from standard text, and that the main challenges in this domain stem from other linguistic phenomena." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-81", "text": "We then take the full set of noise types, and remove a single type at each time to see how important it is given the other noises." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-82", "text": "Table 3 shows the model's performance on the German-to-English dataset when training with various mixtures of noise." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-83", "text": "We find that deletion is by far the most effective synthetic noise in preparing our model for natural orthographical errors, followed by insertion." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-84", "text": "The French and Czech datasets exhibit the same trend." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-85", "text": "We conjecture that the importance of deletion and insertion is that they distort the typical distances between characters, requiring the CNN character encoder to become more invariant to unexpected character movements." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-86", "text": "The fact that we use deletion and insertion also explains why our model was able to regain a significant portion of its original performance when confronted with natural noise at test time, while previous work that trained only on substitutions and swaps was not able to do so (Belinkov and Bisk, 2018) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-87", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-88", "text": "**TRANSLATING SOCIAL MEDIA TEXT**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-89", "text": "We also apply our synthetic noise training procedure to translation of social media, using the recently-released MTNT dataset of Reddit posts (Michel and Neubig, 2018) , focusing on the English-French translation pair." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-90", "text": "Note that no noise was inserted into the test data in this case; the only source of noise is the non-standard spellings inherent to the dataset." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-91", "text": "As shown in Table 4 , noised training has minimal impact on performance." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-92", "text": "We did not exhaustively explore the space of possible noising strategies, and so these negative results should be taken only as a preliminary finding." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-93", "text": "Nonetheless, there are reasons to believe that synthetic training noise may not help in this case." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-94", "text": "Michel and Neubig (2018) note that the rate of spelling errors, as reported by a spell check system, is not especially high in MTNT; other differences from standard corpora include the use of entirely new words and names, terms from other languages (especially English), grammar differences, and paralinguistic phenomena such as emoticons." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-95", "text": "These findings align with prior work showing that social media does not feature high rates of misspellings (Rello and Baeza-Yates, 2012) ." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-96", "text": "Furthermore, many of the spelling variants in MTNT have very high edit distance (e.g., catholique \u2192 catho [Fr] )." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-97", "text": "It is unlikely that training with mild synthetic noise would yield robustness to these variants, which reflect well-understood stylistic patterns rather than random variation at the character level." }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-98", "text": "----------------------------------" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-99", "text": "**RELATED WORK**" }, { "sent_id": "13091dd4d06e11957a5cb7785b92d4-C001-100", "text": "The use of noise to improve robustness in machine learning has a long history (e.g., Holmstrom and Koistinen, 1992; Wager et al., 2013) , with early work by Bishop (1995) demonstrating a connection between additive noise and regularization." } ], "y": { "@BACK@": { "gold_contexts": [ [ "13091dd4d06e11957a5cb7785b92d4-C001-10" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-36" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-38", "13091dd4d06e11957a5cb7785b92d4-C001-39" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-71", "13091dd4d06e11957a5cb7785b92d4-C001-72" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-104", "13091dd4d06e11957a5cb7785b92d4-C001-105" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-109" ] ], "cite_sentences": [ "13091dd4d06e11957a5cb7785b92d4-C001-10", "13091dd4d06e11957a5cb7785b92d4-C001-36", "13091dd4d06e11957a5cb7785b92d4-C001-38", "13091dd4d06e11957a5cb7785b92d4-C001-72", "13091dd4d06e11957a5cb7785b92d4-C001-104", "13091dd4d06e11957a5cb7785b92d4-C001-105", "13091dd4d06e11957a5cb7785b92d4-C001-109" ] }, "@MOT@": { "gold_contexts": [ [ "13091dd4d06e11957a5cb7785b92d4-C001-10", "13091dd4d06e11957a5cb7785b92d4-C001-11", "13091dd4d06e11957a5cb7785b92d4-C001-12", "13091dd4d06e11957a5cb7785b92d4-C001-13" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-38", "13091dd4d06e11957a5cb7785b92d4-C001-39" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-109", "13091dd4d06e11957a5cb7785b92d4-C001-110" ] ], "cite_sentences": [ "13091dd4d06e11957a5cb7785b92d4-C001-10", "13091dd4d06e11957a5cb7785b92d4-C001-12", "13091dd4d06e11957a5cb7785b92d4-C001-38", "13091dd4d06e11957a5cb7785b92d4-C001-109", "13091dd4d06e11957a5cb7785b92d4-C001-110" ] }, "@DIF@": { "gold_contexts": [ [ "13091dd4d06e11957a5cb7785b92d4-C001-72", "13091dd4d06e11957a5cb7785b92d4-C001-73" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-86" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-104", "13091dd4d06e11957a5cb7785b92d4-C001-105", "13091dd4d06e11957a5cb7785b92d4-C001-106", "13091dd4d06e11957a5cb7785b92d4-C001-107" ] ], "cite_sentences": [ "13091dd4d06e11957a5cb7785b92d4-C001-72", "13091dd4d06e11957a5cb7785b92d4-C001-86", "13091dd4d06e11957a5cb7785b92d4-C001-104", "13091dd4d06e11957a5cb7785b92d4-C001-105", "13091dd4d06e11957a5cb7785b92d4-C001-107" ] }, "@EXT@": { "gold_contexts": [ [ "13091dd4d06e11957a5cb7785b92d4-C001-86" ], [ "13091dd4d06e11957a5cb7785b92d4-C001-104", "13091dd4d06e11957a5cb7785b92d4-C001-105", "13091dd4d06e11957a5cb7785b92d4-C001-106", "13091dd4d06e11957a5cb7785b92d4-C001-107" ] ], "cite_sentences": [ "13091dd4d06e11957a5cb7785b92d4-C001-86", "13091dd4d06e11957a5cb7785b92d4-C001-104", "13091dd4d06e11957a5cb7785b92d4-C001-105", "13091dd4d06e11957a5cb7785b92d4-C001-107" ] } } }, "ABC_d0b4d9566f16915cb5a5244f351e61_5": { "x": [ { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-2", "text": "We present a simple method to learn continuous representations of dependency substructures (links), with the motivation of directly working with higher-order, structured embeddings and their hidden relationships, and also to avoid the millions of sparse, template-based word-cluster features in dependency parsing." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-3", "text": "These link embeddings allow a significantly smaller and simpler set of unary features for dependency parsing, while maintaining improvements similar to state-of-the-art, n-ary word-cluster features, and also stacking over them." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-60", "text": "All the final test improvements, i.e., Bucket (92.3) and Bit-string (92.6) w.r.t." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-4", "text": "Moreover, these link vectors (made publicly available) are directly portable as offthe-shelf, dense, syntactic features in various NLP tasks." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-5", "text": "As one example, we incorporate them into constituent parse reranking, where their small feature set again matches the performance of standard non-local, manuallydefined features, and also stacks over them." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-6", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-8", "text": "Word representations and more recently, word embeddings, learned from large amounts of text have been quite successful as features in various NLP tasks (Koo et al., 2008; Turian et al., 2010; Collobert et al., 2011; Dhillon et al., 2012; Al-Rfou' et al., 2013; Bansal et al., 2014; Guo et al., 2014; Pennington et al., 2014; Yu and Dredze, 2014; Faruqui et al., 2014; Wang et al., 2015) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-9", "text": "While these word representations do capture useful, dense relationships among known and unknown words, one still has to work with sparse conjunctions of features on the multiple words involved in the substructure that a task factors on, e.g., head-argument links in dependency parsing." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-10", "text": "Therefore, most statistical dependency parsers still suffer from millions of such conjoined, template-based, n-ary features on word clusters or embeddings (Koo et al., 2008; Bansal et al., 2014) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-11", "text": "Some recent work has addressed this issue, via low-rank tensor mappings (Lei et al., 2014) , feature embeddings , or neural network parsers (Chen and Manning, 2014) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-12", "text": "Secondly, it would also be useful to learn dense representations directly for the higher-order substructures (that structured NLP tasks factor on) so as to explicitly capture the useful, hidden relationships among these substructures, instead of relying on the sparse word-conjoined relationships." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-13", "text": "In this work, we propose to address both these issues by learning simple dependency link embeddings on 'head-argument' pairs (as a single concatenated unit), which allows us to work directly with linguistically-intuitive, higher-order substructures, and also fire significantly fewer and simpler features in dependency parsing, as opposed to word cluster and embedding features in previous work (Koo et al., 2008; Bansal et al., 2014) , while still maintaining their strong accuracies." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-14", "text": "Trained using appropriate dependency-based context in word2vec, the fast neural language model of Mikolov et al. (2013a) , these link vectors allow a substantially smaller set of unary link features (as opposed to n-ary, conjoined features) which provide savings in parsing time and memory." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-15", "text": "Moreover, unlike conjoined features, link embeddings allow a tractable set of accurate per-dimension features, making the feature set even smaller and the featuregeneration process orders of magnitude faster (than hierarchical clustering features)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-16", "text": "At the same time, these link embedding features maintain dependency parsing improvements similar to the complex, template-based features on word clusters and embeddings by previous work (Koo et al., 2008; Bansal et al., 2014 ) (up to 9% relative error reduction), and also stack statistically significantly over them (up to an additional 5% relative error reduction)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-17", "text": "Another advantage of this approach (versus previous work on feature embeddings or special neural networks for parsing) is that these link embeddings can be imported as off-the-shelf, dense, syntactic features into various other NLP tasks, similar to word embedding features, but now with richer, structured information, and in tasks where plain word embeddings have not proven useful ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-18", "text": "As an example, we incorporate them into a constituent parse reranker and see improvements that again match state-of-the-art, manually-defined, non-local reranking features and stack over them statistically significantly." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-59", "text": "Table 3 shows the main UAS (unlabeled attachment score) results on WSJ, where each '+ X' row denotes adding type X features to the MSTParser baseline." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-19", "text": "We make our link embeddings publicly available 1 and hope that they will prove useful in various other NLP tasks in future work, e.g., as dense, syntactic features in sentence classification or as linguistically-intuitive, initial units in vectorspace composition." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-20", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-21", "text": "**DEPENDENCY LINK EMBEDDINGS**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-22", "text": "To train the link embeddings, we use the speedy, skip-gram neural language model of Mikolov et al. (2013a; 2013b) via their toolkit word2vec." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-23", "text": "2 We use the original skip-gram model and simply change the context tuple data on which the model is trained, similar to Bansal et al. (2014) and Levy and Goldberg (2014) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-24", "text": "The goal is to learn similar embeddings for links with similar syntactic contextual properties like label, signed distance, ancestors, etc." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-25", "text": "To this end, we first parse the BLLIP corpus (minus the PTB portion) 3 using the baseline MSTParser (McDonald et al., 2005b) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-26", "text": "Next, for each predicted link, we create a tuple, consisting of the parent-child pair p-c (concatenated as a single unit, same as p c) and its various properties such as the N.Y.-Yonkers, Md.-Columbia, N.Y.-Bronx, Va.-Reston, Ky.-Lexington, Mich.-Kalamazoo, Calif.-Calabasas, . .. boost-revenue, tap-markets, take-losses, launch-fight, reduce-holdings, terminate-contract, identify-bidders, ... boosting-bid, meeting-schedules, obtaining-order, having-losses, completing-review, governing-industry, ... says-mean, adds-may, explains-have, contend-has, recalls-had, figures-is, asserted-is, notes-would, ... would-Based, is-Besides, was-Like, is-From, are-Despite, said-Besides, says-Despite, reported-As, ... began-Meanwhile, was-Since, are-Often, would-Now, had-During, were-Over, was-Late, have-Until, ... Catsimatidis-Mr., Swete-Mr., Case-Mr., Montoya-Mr., Byerlein-Mr., Heard-Mr., Leny-Mr., Graham-Mrs., ... only-1.5, about-170, nearly-eight, approximately-10, almost-15, some-80, Only-two, about-23, roughly-50, ... link's dependency relation label l, the grandparent dependency relation label gl, and the signed, binned distance d:" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-27", "text": "We then run the skip-gram model on the the above context tuples (Eq. 1) with a window-size of 2, dimension-size of 100, and a min-count cutoff of 4 to give us a vocabulary of around 92K." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-28", "text": "4 We also tried other context settings, e.g., where we add more lexicalized, link-based context to the tuple such as the neighboring grandparent-parent link gp-p:" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-29", "text": "but the setting in Eq. 1 performs slightly better (based on the development set)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-30", "text": "Clusters: Table 1 shows example clusters obtained by clustering link embeddings via MAT-LAB's linkage + cluster commands, with 1000 clusters." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-31", "text": "5 We can see that these link embeddings are able to capture useful groups and subtle distinctions directly at the link level (without having to work with all pairs of word types), e.g., based on syntactic properties like capitalization, verb form, position in sentence; and based on topics like location, time, finance, etc." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-32", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-33", "text": "**DEPENDENCY PARSING EXPERIMENTS**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-34", "text": "In this section, we will first discuss how we use the link embeddings as features in dependency parsing." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-35", "text": "Next, we will present empirical results on feature space reduction and on parsing performance on both in-domain and out-of-domain datasets." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-36", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-37", "text": "**FEATURES**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-38", "text": "The BROWN cluster features are based on Bansal et al. (2014) , who follow Koo et al. (2008) Koo et al. (2008) )." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-39", "text": "We have another feature that additionally includes the signed, bucketed distance of the particular link in the given sentence." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-40", "text": "Also note the difference of our unary bucket features from the binary bucket features of Bansal et al. (2014) , who had to work with pairwise, conjoined features of the head and the argument." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-41", "text": "Hence, they used features on conjunctions of the two bucket values from the head and argument word vectors, firing one pairwise feature per dimension, because firing features on all dimension pairs (corresponding to an outer product) led to an infeasible number of features." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-42", "text": "The result discussion of these feature differences in presented in \u00a73.2." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-43", "text": "Bit-string features: We first hierarchically cluster the link vectors via MATLAB's linkage function with {method=ward, metric=euclidean} to get 0-1 bit-strings (similar to BROWN)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-44", "text": "Next, we again fire a small set of unary indicator features that simply con- sist of the link's bit-string prefix, the prefix-length, and another feature that adds the signed, bucketed distance of that link in the sentence." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-45", "text": "6" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-46", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-47", "text": "**SETUP AND RESULTS**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-48", "text": "For all experiments (unless otherwise noted), we follow the 2nd-order MSTParser setup of Bansal et al. (2014) , in terms of data splits, parameters, preprocessing, and feature thresholding." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-49", "text": "Statistical significance is reported based on the bootstrap test (Efron and Tibshirani, 1994) with 1 million samples." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-50", "text": "First, we compare the number of features in Table 2 ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-51", "text": "Our dense, unary, link-embedding based Bucket and Bit-string features are substantially fewer than the sparse, n-ary, template-based features used in the MSTParser baseline, in BROWN, and in the word embedding SKIP DEP result of Bansal et al. (2014) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-52", "text": "This in turn also improves our parsing speed and memory." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-53", "text": "Moreover, regarding the preprocessing time taken to generate these various feature types, our Bucket features, which just need the fast word2vec training, take 2-3 orders of magnitude lesser time than the BROWN features (15 mins." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-54", "text": "versus 2.5 days) 7 ; this is also advantageous when 6 We again used prefixes of length 4, 6, 8, 12, same as the BROWN feature setting." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-55", "text": "For unknown links' features, we replace the bucket or bit-string prefix with a special 'UNK' string." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-56", "text": "7 Based on a modern 3.50 GHz desktop and 1 thread." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-57", "text": "The Bit-string features additionally need hierarchical clustering, but are still at least twice as fast as BROWN features." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-58", "text": "training and parsing with representations of new domains or languages." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-61", "text": "Baseline (91.9), and BROWN + Bucket (93.0) and BROWN + Bit-string (93.1) w.r.t." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-62", "text": "BROWN (92.7), are statistically significant at p < 0.01." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-63", "text": "Moreover, the Bit-string result (92.6) is the same, i.e., has no statistically significant difference from the BROWN result (92.7), and also from the Bansal et al. (2014) SKIP DEP result (92.7)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-64", "text": "Therefore, the main contribution of these link embeddings is that their significantly simpler, smaller, and faster set of unary features can match the performance of complex, template-based BROWN features (and of the dependency-based word embedding features of Bansal et al. (2014) ), and also stack over them." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-65", "text": "We also get similar trends of improvements on the labeled attachment score (LAS) metric." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-66", "text": "8 Moreover, unlike Bansal et al. (2014) , our Bucket features achieve statistically significant improvements, most likely because they fired D pairwise, conjoined features, one per dimension d, consisting of the two bucket values from the head and argument word vectors." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-67", "text": "This would disallow the classifier to learn useful linear combinations of the various dimensions." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-68", "text": "Firing D 2 features on all dimension pairs (corresponding to an outer product) would lead to an infeasible number of features." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-69", "text": "On the other hand, we have a single vector for head+argument, allowing us to fire just D features (one per dimension) and still learn useful dimension combinations in linear space." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-70", "text": "We also report out-of-domain performance, in Table 4, on the Web treebank (Petrov and McDonald, 2012 ) test sets, directly using the WSJ-trained models." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-71", "text": "Again, both our Bucket and Bit-string linkembedding features achieve decent improvements over Baseline and they stack over BROWN, while using much fewer features." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-72", "text": "Moreover, one can hopefully achieve bigger gains by training link embeddings on Web or Wikipedia data (since BLLIP is news-domain)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-73", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-74", "text": "**OFF-THE-SHELF: CONSTITUENT PARSING**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-75", "text": "Finally, these link embeddings are also portable as off-the-shelf, dense, syntactic features into other NLP tasks, either to incorporate missing syntactic information, or to replace sparse (n-ary lexicalized or template-based) parsing features, or where word embedding features are not appropriate and one needs higher-order embeddings, e.g., in constituent parsing (see Andreas and Klein (2014) )." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-76", "text": "Therefore, as a first example, we import our link embedding features into a constituent parse reranker." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-77", "text": "We follow Bansal and Klein (2011) , reranking 50-best lists of the Berkeley parser (Petrov et al., 2006) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-78", "text": "We first extract dependency links in each candidate constituent tree based on the head-modifier rules of Collins (2000) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-79", "text": "Next, we simply fire our Bit-string features on each link, where the feature again consists of just the prefix bit-string, the prefix length, and the signed, bucketed link distance." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-80", "text": "9 Table 5 shows these reranking results, where 1-best and log p(t|w) are the two Berkeley parser baselines, and where Config is the state-of-the-art, nonlocal, configurational feature set of Huang (2008) , which in turn is a simplified merge of Charniak and Johnson (2005) and Collins (2000) (here configurational)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-81", "text": "Again, all our test improvements are statistically significant at p < 0.01: Bit-string (90.9) over both the baselines (90.2, 89.9); and Config + Bit-string (91.4) over Config (91.1)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-82", "text": "Moreover, the Bit-string result (90.9) is the same (i.e., no statistically significant difference) as the Config result (91.1)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-83", "text": "Therefore, we can again match the improvements of complex, manually-defined, nonlocal reranking features with a much smaller set of simple, dense, off-the-shelf, link-embedding features, and also complement them statistically significantly." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-84", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-85", "text": "**RELATED WORK**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-86", "text": "As mentioned earlier, there has been a lot of useful, previous work on using word embeddings for NLP tasks such as similarity, tagging, NER, sentiment analysis, and parsing (Turian et al., 2010; Collobert et al., 2011; Dhillon et al., 2012; Huang et al., 2012; Al-Rfou' et al., 2013; Hisamoto et al., 2013; Andreas and Klein, 2014; Bansal et al., 2014; Guo et al., 2014; Pennington et al., 2014; Wang et al., 2015) , inter alia." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-87", "text": "In related work, Bansal et al. (2014) also use dependency context to tailor word embeddings to dependency parsing." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-88", "text": "However, their embedding features are still based on the sparse set of n-ary, word-based templates from previous work (McDonald et al., 2005a; Koo et al., 2008) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-89", "text": "Our structured link embeddings achieve similar improvements as theirs (and better in the case of direct, per-dimension bucket features) with a substantially smaller and simpler (unary) set of features that are aimed to directly capture hidden relationships between the substructures that dependency parsing factors on." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-90", "text": "Moreover, we hope that similar to word embeddings, these link embeddings will also prove useful when imported into various other NLP tasks as dense, continuous features, but now with additional syntactic information." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-91", "text": "There has also been some recent, useful work on reducing the sparsity of features in dependency parsing, e.g., via low-rank tensors (Lei et al., 2014) and via neural network parsers that learn tag and label embeddings (Chen and Manning, 2014) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-92", "text": "In related work, learn dense feature embeddings for dependency parsing; however, they still work with the large number of manuallydefined feature templates from previous work and train embeddings for all those templates, with an aim to discover hidden, shared information among the large set of sparse features." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-93", "text": "We get similar improvements with a much smaller and simpler set of unary link features; also, our link embeddings are more portable to other NLP tasks than template-based embeddings specific to dependency parsing." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-94", "text": "Other work includes learning distributed structured output via dense label vectors (Srikumar and Manning, 2014) , learning bilexical operator embeddings (Madhyastha et al., 2014) , and learning joint word embeddings and composition functions based on predicate-argument compositionality (Hashimoto et al., 2014) ." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-95", "text": "Our main goal is to directly learn embeddings on linguistically-intuitive units like dependency links, so that they can be used as non-sparse, unary features in dependency parsing, and also as off-theshelf, dense, syntactic features in other NLP tasks (versus more intrinsic approaches based on feature embeddings or neural network parsers, which are harder to export)." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-96", "text": "----------------------------------" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-97", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-98", "text": "We presented dependency link embeddings, which provide a small, simple set of unary features for dependency parsing, while maintaining statistically significant improvements, similar and complementary to sparse, n-ary, word-cluster features." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-99", "text": "These link vectors are also portable as off-the-shelf syntactic features in other NLP tasks; we import them into constituent parse reranking, where they again match and stack over state-of-the-art, non-local reranking features." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-100", "text": "We release our link embeddings (available at ttic.edu/bansal) and hope that these will prove useful in various other NLP tasks, e.g., as dense, syntactic features in sentence classification or as linguistically-intuitive, initial units in vectorspace composition." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-101", "text": "In future work, it will be useful to try obtaining stronger parsing accuracies via newer, better representation learning tools, e.g., GloVe (Pennington et al., 2014) , and by training on larger quantities of automatically-parsed data." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-102", "text": "It will also be useful to perform intrinsic evaluation of these link embeddings on appropriate syntactic datasets and metrics, and extrinsic evaluation via various other NLP tasks such as sentence classification." }, { "sent_id": "d0b4d9566f16915cb5a5244f351e61-C001-103", "text": "Finally, it will be interesting to try parsers or frameworks where we can directly employ the embeddings as features, instead of bucketing or clustering them." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d0b4d9566f16915cb5a5244f351e61-C001-8" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-86" ] ], "cite_sentences": [ "d0b4d9566f16915cb5a5244f351e61-C001-8", "d0b4d9566f16915cb5a5244f351e61-C001-86" ] }, "@DIF@": { "gold_contexts": [ [ "d0b4d9566f16915cb5a5244f351e61-C001-13" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-16", "d0b4d9566f16915cb5a5244f351e61-C001-17" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-40" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-50", "d0b4d9566f16915cb5a5244f351e61-C001-51" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-66" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-87", "d0b4d9566f16915cb5a5244f351e61-C001-88", "d0b4d9566f16915cb5a5244f351e61-C001-89" ] ], "cite_sentences": [ "d0b4d9566f16915cb5a5244f351e61-C001-13", "d0b4d9566f16915cb5a5244f351e61-C001-16", "d0b4d9566f16915cb5a5244f351e61-C001-17", "d0b4d9566f16915cb5a5244f351e61-C001-40", "d0b4d9566f16915cb5a5244f351e61-C001-51", "d0b4d9566f16915cb5a5244f351e61-C001-66", "d0b4d9566f16915cb5a5244f351e61-C001-87", "d0b4d9566f16915cb5a5244f351e61-C001-88", "d0b4d9566f16915cb5a5244f351e61-C001-89" ] }, "@MOT@": { "gold_contexts": [ [ "d0b4d9566f16915cb5a5244f351e61-C001-13" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-16", "d0b4d9566f16915cb5a5244f351e61-C001-17" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-87", "d0b4d9566f16915cb5a5244f351e61-C001-88", "d0b4d9566f16915cb5a5244f351e61-C001-89" ] ], "cite_sentences": [ "d0b4d9566f16915cb5a5244f351e61-C001-13", "d0b4d9566f16915cb5a5244f351e61-C001-16", "d0b4d9566f16915cb5a5244f351e61-C001-17", "d0b4d9566f16915cb5a5244f351e61-C001-87", "d0b4d9566f16915cb5a5244f351e61-C001-88", "d0b4d9566f16915cb5a5244f351e61-C001-89" ] }, "@SIM@": { "gold_contexts": [ [ "d0b4d9566f16915cb5a5244f351e61-C001-23" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-63", "d0b4d9566f16915cb5a5244f351e61-C001-64" ] ], "cite_sentences": [ "d0b4d9566f16915cb5a5244f351e61-C001-23", "d0b4d9566f16915cb5a5244f351e61-C001-63", "d0b4d9566f16915cb5a5244f351e61-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "d0b4d9566f16915cb5a5244f351e61-C001-38" ], [ "d0b4d9566f16915cb5a5244f351e61-C001-48" ] ], "cite_sentences": [ "d0b4d9566f16915cb5a5244f351e61-C001-38", "d0b4d9566f16915cb5a5244f351e61-C001-48" ] } } }, "ABC_8a1d4802c170fa8a71504533437e8f_5": { "x": [ { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-2", "text": "In this era of Big Data, due to expeditious exchange of information on the web, words are being used to denote newer meanings, causing linguistic shift." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-176", "text": "Only when all the three features are used, these cases are correctly detected as 'birth'." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-3", "text": "With the recent availability of large amounts of digitized texts, an automated analysis of the evolution of language has become possible." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-4", "text": "Our study mainly focuses on improving the detection of new word senses." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-5", "text": "This paper presents a unique proposal based on network features to improve the precision of new word sense detection." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-6", "text": "For a candidate word where a new sense (birth) has been detected by comparing the sense clusters induced at two different time points, we further compare the network properties of the subgraphs induced from novel sense cluster across these two time points." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-7", "text": "Using the mean fractional change in edge density, structural similarity and average path length as features in an SVM classifier, manual evaluation gives precision values of 0.86 and 0.74 for the task of new sense detection, when tested on 2 distinct time-point pairs, in comparison to the precision values in the range of 0.23-0.32, when the proposed scheme is not used." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-8", "text": "The outlined method can therefore be used as a new post-hoc step to improve the precision of novel word sense detection in a robust and reliable way where the underlying framework uses a graph structure." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-9", "text": "Another important observation is that even though our proposal is a post-hoc step, it can be used in isolation and that itself results in a very decent performance achieving a precision of 0.54-0.62." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-10", "text": "Finally, we show that our method is able to detect the well-known historical shifts in 80% cases." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-11", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-13", "text": "How do words develop new senses? How does one characterize semantic change?" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-14", "text": "Is it possible to develop algorithms to track semantic change by comparing historical data at scale?" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-15", "text": "In order to extract meaningful insights from these data, a very important step is to understand the contextual sense of a word, e.g., does the word 'bass' in a particular context refer to fish or is it related to music?" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-16", "text": "Most data-driven approaches so far have been limited to either word sense induction where the the goal is to automatically induce different senses of a given word in an unsupervised clustering setting, or word sense disambiguation where a fixed sense inventory is assumed to exist, and the senses of a given word are disambiguated relative to the sense inventory." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-17", "text": "However in both these tasks, the assumption is that the number of senses that a word has, is static, and also the senses exist in the sense inventory to compare with." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-18", "text": "They attempt to detect or induce one of these senses depending on the context." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-19", "text": "However, natural language is dynamic, constantly evolving as per the users' needs which leads to change of word meanings over time." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-20", "text": "For example, by late 20 th century, the word 'float' has come up with the 'data type' sense whereas the word 'hot' has started corresponding to the 'attractive personality' sense." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-21", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-22", "text": "**RECENT ADVANCEMENTS**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-23", "text": "Recently, with the arrival of large-scale collections of historic texts and online libraries such as Google books, a new paradigm has been added to this research area, whereby the prime interest is in identifying the temporal scope of a sense [10, 14, 16, 25] which, in turn, can give further insights to the phenomenon of language evolution." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-177", "text": "Edge density, on the other hand is very crucial for improving precision." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-24", "text": "Some recent attempts [5, 8, 11, 12, 15] also have been made to model the dynamics of language in terms of word senses." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-25", "text": "One of the studies in this area has been presented by Mitra et al. [19] where the authors show that at earlier times, the sense of the word 'sick' was mostly associated to some form of illness; however, over the years, a new sense associating the same word to something that is 'cool' or 'crazy' has emerged." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-26", "text": "Their study is based on a unique network representation of the corpus called a distributional thesauri (DT) network built using Google books syntactic n-grams." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-27", "text": "They have used unsupervised clustering techniques to induce a sense of a word and then compared the induced senses of two time periods to get the new sense for a particular target word." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-28", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-29", "text": "**LIMITATIONS OF THE RECENT APPROACHES**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-30", "text": "While Mitra et al. [19] reported a precision close to 0.6 over a random sample of 49 words, we take another random sample of 100 words separately and repeat manual evaluation." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-31", "text": "When we extract the novel senses by comparing the DTs from 1909-1953 and 2002-2005 , the precision obtained for these 100 words is as low as 0.32." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-32", "text": "Similarly if we extract the novel senses comparing the DTs of 1909-1953 with 2006-2008, the precision stands at 0.23." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-33", "text": "We then explore another unsupervised approach presented in Lau et al. [16] over the same Google books corpus 1 , apply topic modeling for sense induction and directly adapt their similarity measure to get the new senses." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-34", "text": "Using a set intersecting with the 100 random samples for Mitra et al. [19] , we obtain the precision values of 0.21 and 0.28, respectively." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-35", "text": "Clearly, none of the precision values are good enough for reliable novel sense detection." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-36", "text": "This motivates us to devise a new approach to improve the precision of the existing approaches." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-37", "text": "Further, being inspired by the recent works of applying complex network theory in NLP applications like co-hyponymy detection [13] , evaluating machine generated summaries [20] , detection of ambiguity in a text [4] , etc. we opt for a solution using complex network measures." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-38", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-39", "text": "**OUR PROPOSAL AND THE ENCOURAGING RESULTS**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-40", "text": "We propose a method based on the network features to reduce the number of false positives and thereby, increase the overall precision of the method proposed by Mitra et al. [19] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-41", "text": "In particular, if a target word qualifies as a 'birth' as per their method, we construct two induced subgraphs of those words that form the cluster corresponding to this 'birth' sense from the corresponding distributional thesauri (DT) networks of the two time points." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-42", "text": "Next we compare the following three network properties: (i) the edge density, (ii) the structural similarity and (iii) the average path length [27, 29] of the two induced subgraphs from the two time points." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-43", "text": "A remarkable observation is that although this is a small set of only three features, for the actual 'birth' cases, each of them has a significantly different value for the later time point and are therefore very discriminative indicators." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-44", "text": "In fact, the features are so powerful that even a small set of training instances is sufficient for making highly accurate predictions." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-45", "text": "Results: Manual evaluation of the results by 3 evaluators shows that this classification achieves an overall precision of 0.86 and 0.74 for the two time point pairs over the same set of samples, in contrast with the precision values of 0.32 and 0.23 by the original method." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-46", "text": "Note that we would like to stress here that an improvement of more than double in the precision of novel sense detection that we achieve has the potential to be the new stepping stone in many NLP and IR applications that are sensitive to novel senses of a word." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-47", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-48", "text": "**DETECTING KNOWN SHIFTS**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-49", "text": "Further we also investigate the robustness of our approach by analyzing the ability to capture known historical shifts in meaning." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-50", "text": "Preparing a list of words that have been suggested by different prior works as having undergone sense change, we see that 80% of those words get detected by our approach." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-51", "text": "We believe that the ability to detect such diachronic shifts in data can significantly enhance various standard studies in natural language evolution and change." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-52", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-53", "text": "**IMPACT**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-54", "text": "We stress that our work could have strong repercussions in historical linguistics [1] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-55", "text": "Besides, lexicography is also expensive; compiling, editing and updating sense inventory entries frequently remains cumbersome and labor-intensive." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-56", "text": "Time specific knowledge would make the word meaning representations more accurate." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-57", "text": "A well constructed semantic representation of a word is useful for many natural language processing or information retrieval systems like machine translation, semantic search, disambiguation, Q&A, etc." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-58", "text": "For semantic search, taking into account the newer senses of a word can increase the relevance of the query result." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-59", "text": "Similarly, a disambiguation engine informed with the newer senses of a word can increase the efficiency of disambiguation, and recognize senses uncovered by the inventory that would otherwise have to be wrongly assigned to covered senses." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-60", "text": "Above all, a system having the ability to perceive the novel sense of a word can help in automatic sense inventory update by taking into account the temporal scope of senses." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-61", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-62", "text": "**RELATED WORK**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-63", "text": "Our work broadly classifies under data-driven models of language dynamics." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-64", "text": "One of the first attempts in this area was made by Erk [6] , where the author tried to model this problem as an instance of outlier detection, using a simple nearest neighbor-based approach." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-65", "text": "Gulordava and Baroni [10] study the change in the semantic orientation of words using Google book n-grams corpus from different time periods." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-66", "text": "In another work, Mihalcea et al. [18] attempted to quantify the changes in word usage over time." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-67", "text": "Along similar lines, Jatowt and Duh [14] used the Google n-grams corpus from two different time points and proposed a method to identify semantic change based on the distributional similarity between the word vectors." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-68", "text": "Tahmasebi et al. [25] attempted to track sense changes from a newspaper corpus containing articles between 1785 and 1985." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-69", "text": "Efforts have been made by Cook et al. [3] to prepare the largest corpus-based dataset of diachronic sense differences." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-70", "text": "Attempts have been made by Lau et al. [17] where they first introduced their topic modeling based word sense induction method to automatically detect words with emergent novel senses and in a subsequent work, Lau et al. [16] extended this task by leveraging the concept of predominant sense." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-71", "text": "The first computational approach to track and detect statistically significant linguistic shifts of words has been proposed by Kulkarni et al. [15] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-72", "text": "Recently, Hamilton et al. [12] proposed a method to quantify semantic change by evaluating word embeddings against known historical changes." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-73", "text": "In another work, Hamilton et al. [11] categorized the semantic change into two types and proposed different distributional measures to detect those types." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-74", "text": "Attempts have also been made to analyze time-series model of embedding vectors as well as time-indexed self-similarity graphs in order to hypothesize the linearity of semantic change by Eger et al. [5] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-75", "text": "A dynamic Bayesian model of diachronic meaning change has been proposed by Frermann et al. [8] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-76", "text": "Recently, researchers have also tried to investigate the reasons behind word sense evolution and have come up with computational models based on chaining [21] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-77", "text": "Researchers also attempt to apply dynamic word embeddings as well to detect language evolution [23, 30] , analyze temporal word analogy [24] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-78", "text": "We now describe the two baselines that are relevant for our work." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-79", "text": "Baseline 1: Mitra et al. [19] The authors proposed an unsupervised method to identify word sense changes automatically for nouns." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-80", "text": "Datasets and graph construction: The authors used the Google books corpus, consisting of texts from over 3.4 million digitized English books published between 1520 and 2008." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-81", "text": "The authors constructed distributional thesauri (DT) networks from the Google books syntactic n-grams data [9] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-82", "text": "In the DT network, each word is a node and there is a weighted edge between a pair of words where the weight of the edge is defined as the number of features that these two words share in common." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-83", "text": "A snapshot of the DT is shown in leftmost image of Figure 1 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-84", "text": "To study word sense changes over time, they divided the dataset across eight time periods; accordingly DT networks for each of these time periods were constructed separately." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-85", "text": "Sense change detection: The Chinese Whispers algorithm [2] is used to produce a set of clusters for each target word by decomposing its neighbourhood in the DT network." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-86", "text": "The hypothesis is that different clusters signify different senses of a target word." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-87", "text": "The clusters for a target word 'float' is shown in the right image of Figure 1 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-88", "text": "The authors then compare the sense clusters extracted across two different time points to obtain the suitable signals of sense change." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-89", "text": "Specifically, for a candidate word w, a sense cluster in the later time period is called as a 'birth' cluster if at least 80% words of this cluster do not appear in any of the sense clusters from the previous time period." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-90", "text": "The authors then apply multi-stage filtering in order to obtain meaningful candidate words." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-91", "text": "Baseline 2: Lau et al. [16] : The authors proposed an unsupervised approach based on topic modeling for sense induction, and showed novel sense identification as one of its applications." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-92", "text": "For a candidate word, Hierarchical Dirichlet Process [26] is run over a corpus to induce topics." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-93", "text": "The induced topics are represented as word multinomials, and are expressed by the top-N words in descending order of conditional probability." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-94", "text": "Each topic is represented as a sense of the target word." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-95", "text": "The words having highest probability in each topic represent the sense clusters." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-96", "text": "The authors treated the novel sense detection task as identifying those sense clusters, which did not align with any of the recorded senses in a sense repository." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-97", "text": "They used Jensen-Shannon (JS) divergence measure to compute alignment between a sense cluster and a synset." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-98", "text": "They computed JS divergence between the multinomial distribution (over words) of the topic cluster and that of the synset, and converted the divergence value into a similarity score." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-99", "text": "Similarity between topic cluster t j and synset s i is defined as," }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-100", "text": "where T and S are the multinomial distributions over words for topic t j and synset s i , respectively, and JS(X \u2225 Y ) is the Jensen-Shannon divergence for the distribution X and Y ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-101", "text": "Since we define novel senses while comparing sense clusters across two time points, we use the same JS measure to detect novel sense of a target word." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-102", "text": "A sense cluster in the newer time period denotes a new sense ('birth') only if its maximum similarity with any of the clusters in older time period is below a threshold, which we have set to 0.35 based on empirical observation." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-103", "text": "The strength of this connection can be measured if we construct induced subgraphs of S from the two DTs and measure the network properties of these subgraphs; the difference would be more prominent for the actual 'birth' cases (true positives) than for the false 'birth' signals (false positives)." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-104", "text": "Note that by definition, the nodes in an induced subgraph from a DT will be the words in S and there will be an edge between two words if and only if the edge exists in the original DT; we ignore the weight of the edge henceforth." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-105", "text": "Thus, the difference between the two subgraphs (one each from the older and newer DTs) will only be in the edge connections." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-106", "text": "Figure 2 takes one true positive ('register') and one false positive ('quotes') word from the set of 49 words and shows the induced subgraphs obtained by a subset of their 'birth' clusters across the two time points." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-107", "text": "We can clearly see that connections among the words in S is much stronger in the newer DT than in the older one in the case of 'registers', indicating the emergence of a new sense." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-108", "text": "In the case of 'quotes', however, the connections are not very different across the two time periods." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-109", "text": "We choose three cohesion indicating network properties, (i) the edge density, (ii) the structural similarity and (iii) the average path length, to capture this change." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-110", "text": "Let S = {w 1 , w 2 , . . . , w n } be the 'birth' cluster for w. Once we construct a graph induced by S from the DT, these network properties are measured as follows: Edge Density (ED): ED is given by ED = N a /N p (2) where N a denotes the number of actual edges between w 1 , w 2 , . . . , w n and N p denotes the maximum possible edges between these, i.e., n(n\u22121)" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-111", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-112", "text": "**PROPOSED NETWORK-CENTRIC APPROACH**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-113", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-114", "text": "**. STRUCTURAL SIMILARITY (SS):**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-115", "text": "For each pair of words (w i , w j ) in the cluster S, the structural similarity SS(w i , w j ) is computed as:" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-116", "text": "where N c denotes the number of common neighbors of w i and w j in the induced graph and de\u0434(w k ) denotes the degree of w k in the induced graph, for k = i, j. The average structural similarity for the cluster S is computed by averaging the structural similarity of all the word pairs." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-117", "text": "Average Path Length (APL): To compute average path length of S, we first find the shortest path length between w and the words w i , in the induced graph of S. Let spl i denote the shortest path distance from w to w i ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-118", "text": "The average path length is defined as:" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-119", "text": "where n is the number of words in the cluster S. Table 1 notes the values obtained for these network properties for the induced subgraphs of the reported 'birth' clusters for 'registers' and 'quotes' across the two time periods." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-120", "text": "The fractional changes observed for the three network properties show a clear demarcation between the two cases." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-121", "text": "Fractional change (\u2206) of any network measure P is defined as," }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-122", "text": "where t 1 and t 2 are old and new timeperiods respectively." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-123", "text": "The change observed for the 'birth' cluster of 'registers' is significantly higher than that in 'quotes' 2 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-124", "text": "We now compute these parameter values for all the 49 candidate cases." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-125", "text": "The mean values obtained for the true positives (TP) and false positives (FP) are shown in Table 2 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-126", "text": "The findings are consistent with those obtained for 'registers' and 'quotes'." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-127", "text": "2 As we have taken the 'birth' clusters from new time period (t 2 ), the words in the clusters are the direct neighbors of the target word always resulting in average path length of 1 in t 2" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-128", "text": "We, therefore, use the fractional changes in the three network properties over time as three features to classify the remaining candidate 'birth' words into true positives (actual 'birth') or false positives (false 'birth')." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-129", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-130", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-131", "text": "For experimental evaluation, we start with the 'birth' cases reported by Mitra et al. ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-132", "text": "We run Lau et al. [16] over these birth cases to detect 'novel' sense as per their algorithm." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-133", "text": "Separately, we also apply the proposed SVM classification model as a filtering step to obtain 'filtered birth' cases." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-134", "text": "This helps in designing a comparative evaluation of these algorithms as follows." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-135", "text": "From both the time point pairs (T 1 and T 2 ), we take 100 random samples from the birth cases reported by Mitra et al. [19] and get these manually evaluated." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-136", "text": "For the same 100 random samples, we now use the outputs of Lau et al. [16] and the proposed approach, and estimate the precision as well as recall of these." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-137", "text": "To further evaluate the proposed algorithm, we perform two more evaluations." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-138", "text": "First, we take 60 random samples from each time point pair for computing precision of the 'filtered birth' cases." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-139", "text": "Secondly, we also take 100 random samples for each time point pair Word Word for computing precision of our approach independently of Mitra et al. [19] , i.e., the proposed approach is not informed of the 'birth' cluster reported by Mitra et al. [19] , instead all the clusters in old and new time point are shown." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-140", "text": "We perform all the evaluations manually and each of the candidate word is judged by 3 evaluators." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-141", "text": "These evaluators are graduate/postgraduate students, having good background in Natural Language Processing." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-142", "text": "They are unaware of each other, making the annotation process completely blind and independent." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-143", "text": "Evaluators are shown the detected 'birth' cluster from the newer time period and all the clusters from the older time period." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-144", "text": "They are asked to make a binary judgment as to whether the 'birth' cluster indicates a new sense of the candidate word, which is not present in any of the sense clusters of the previous time point 3 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-145", "text": "Majority decision is taken in case of disagreement." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-146", "text": "In total, we evaluate the system for a set of as large as 520 words 4 which we believe is significant given the tedious manual judgment involved." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-147", "text": "In this process of manual annotation, we obtain an inter-annotator agreement (Fleiss' kappa [7] ) of 0.745, which is substantial [28] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-148", "text": "Table 3 shows three example words from T 1 , their 'birth' clusters as reported in Mitra et al. [19] and the manual evaluation result." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-149", "text": "The first two cases belong to computer related sense of 'searches' and 'logging', which were absent from time point 1909-1953." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-150", "text": "On the other hand, the 'birth' cluster of 'pesticide' represents an old sense which was also present in 1909-1953." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-151", "text": "Similarly Table 4 shows manual evaluations results for 3 example cases, along with their novel sense as captured by Lau et al. [16] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-152", "text": "[16] are presented in Table 5 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-153", "text": "Since the reported novel sense cluster can in principle be different from the 'birth' sense reported by the method of Mitra et al. [19] for the same word, we get the novel sense cases manually evaluated by 3 annotators (42 and 28 cases for the two time periods, respectively)." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-178", "text": "For instance, when only {SS, APL} are used, words like 'moderators' are wrongly flagged as 'true'." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-154", "text": "Note that for these 100 random samples (that are all marked 'true' by Mitra et al. [19] ), it is possible to find an upper bound on the recall of Lau et al. [16] 's approach automatically." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-155", "text": "While the low recall might be justified because this is a different approach, even the precision is found to be in the same range as that of Mitra et al. [19] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-156", "text": "Table 6 presents the evaluation results for the same set of 100 random samples after using the proposed SVM filtering." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-157", "text": "We see that the filtering using SVM classification improves the precision for both the time point pairs (T 1 and T 2 ) significantly, boosting it from the range of 0.23-0.32 to 0.74-0.86. Note that, as per our calculations, indeed the recall of Mitra et al. [19] would be 100% (as we are taking random samples for annotation from the set of reported 'birth' cases by Mitra et al. [19] only)." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-158", "text": "Even then Mitra et al. [19] 's F-measure ranges from 0.37-0.48 while ours is 0.67-0.68." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-159", "text": "Table 7 represents some of the examples which were declared as 'birth' by Mitra et al. [19] but SVM filtering correctly flagged them as 'false birth'." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-160", "text": "The feature values in the third column clearly show that the network around the words in the detected 'birth' cluster did not change much and therefore, the SVM approach could correctly flag these." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-161", "text": "Considering the small training set, the results are highly encouraging." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-162", "text": "We also obtain decent recall values for the two time point pairs, giving an overall F-measure of 0.67-0.68." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-163", "text": "Further, we check if we can meaningfully combine the results reported by both the methods of Mitra et al. [19] and Lau et al. [16] for more accurate sense detection; and how does this compare with Table 8 ; both the senses look quite similar." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-164", "text": "Table 9 shows the accuracy results obtained using this approach." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-165", "text": "Only 6 and 2 words out of those 100 samples were flagged as 'birth' for the two time points T 1 and T 2 respectively." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-166", "text": "Thus, the recall is very poor." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-167", "text": "While the precision improves slightly forT 1 (4 out of 6 are correct), it is worse for T 2 (only 1 out of 6 words is correct)." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-168", "text": "The results confirm that the proposed SVM classification approach works better than both the approaches, individually as well as combined." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-169", "text": "Feature analysis: We therefore move onto further feature analysis and error analysis of the proposed approach." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-170", "text": "To validate the usefulness of all the identified features, we perform feature leave-one-out experiments." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-171", "text": "The results for T 1 are presented in Table 10 and 11." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-172", "text": "We see that F-measure drops as we leave out one of the features." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-173", "text": "While {ED, SS } turns out to be the best for precision, {SS, APL} gives the best recall." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-174", "text": "Table 12 provides three examples to illustrate the importance of using all the three features." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-175", "text": "For the word 'newsweek', using {ED, APL} and for the word 'caring', using {ED, SS } could not detect those as 'birth'." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-179", "text": "Such cases are filtered out when all the three features are used." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-180", "text": "Extensive evaluation of the proposed approach: We first take 60 random samples each from the 'birth' cases reported by the SVM filtering for the two time point pairs, T 1 (from 318 cases) and T 2 (from 329 cases)." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-181", "text": "The precision values of this evaluation are found to be 0.87 (52/60) and 0.75 (45/60) respectively, quite consistent with those reported in Table 6 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-182", "text": "We did another experiment in order to estimate the performance of our model for detecting novel sense, independent of the method of Mitra et al. [19] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-183", "text": "We take 100 random words from the two time point pairs (T 1 and T 2 ), along with all the induced clusters from the newer time period and run the proposed SVM filtering approach to flag the novel 'birth' senses." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-184", "text": "According to our model, for T 1 and T 2 respectively, 16 and 15 words are flagged to be having novel sense achieving precision values of 0.54 and 0.62 on manual evaluation, which itself is quite decent." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-185", "text": "Note that, for some cases, multiple clusters of a single word have been flagged as novel senses and we observe that these clusters hold a similar sense." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-186", "text": "Error analysis: We further analyze the cases, which are labeled as 'true birth' by the SVM but are evaluated as 'false' by the human evaluators." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-187", "text": "We find that in most of such cases, the sense cluster reported as 'birth' contained many new terms (and therefore, the network properties have undergone change) but the implied sense was already present in one of the previous clusters with very few common words (and therefore, the new cluster contained > 80% new words, and is being reported as 'birth' in Mitra et al. [19] )." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-188", "text": "Two such examples are given in Table 13 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-189", "text": "The split-join algorithm proposed in Mitra et al. [19] needs to be adapted for such cases." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-190", "text": "We also analyze the 'false positive' cases, which are labeled as 'false birth' by the SVM filtering but are evaluated as 'true' by the human evaluators." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-191", "text": "Two such examples are given in Table 14 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-192", "text": "By looking at the feature values of these cases, it is clear that the network structure of the induced subgraph is not changing much, yet they undergo sense change." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-193", "text": "The probable reason could be that the target word was not in the network of the induced subgraph in the old time point and enters into it in the new time point." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-194", "text": "Our SVM model is unable to detect this single node injection in a network so far." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-195", "text": "Handling these cases would be an immediate future step to improve the recall of the system." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-196", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-197", "text": "**DETECTION OF KNOWN SHIFTS**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-198", "text": "So far, we have reported experiments on discovering novel senses from data and measured the accuracy of our method using manual evaluation." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-199", "text": "In this section, we evaluate the diachronic validity of our method on another task of detecting known shifts." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-200", "text": "We test, whether our proposed method is able to capture the known historical shifts in meaning." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-201", "text": "For this purpose, we create a reference list L of 15 words that have been suggested by prior work [5, 11, 12] as having undergone a linguistic change and emerging with a novel sense." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-202", "text": "Note that, we focus only on nouns that emerge with a novel sense between 1900 and 1990." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-203", "text": "The goal of this task is to find out the number of cases, our method is able to detect as novel sense from the list L, which in turn would prove the robustness of our method." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-204", "text": "Data: Consistent with the prior work, we use the Corpus of Historical American (COHA) 5 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-205", "text": "COHA corpus is carefully created to be genre balanced and is a well constructed prototype of American English over 200 years, from the time period 1810 to 2000." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-206", "text": "We extract the raw text data of two time slices: 1880-1900 and 1990-2000 for our experiment." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-207", "text": "Experiment details and results: We first construct distributional thesauri (DT) networks [22] for the COHA corpus at two different time points, 1880-1900 and 1990-2000." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-208", "text": "We apply Chinese Whispers algorithm [2] to produce a set of clusters for each target word in the DT network." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-209", "text": "The Chinese Whispers clusters for the target word 'web' are shown in Figure 3 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-210", "text": "Note that we have reported only some of the representative words for each cluster." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-211", "text": "Each of the clusters represents a particular sense of the target." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-212", "text": "We now compare the sense clusters extracted across two different time points to obtain the suitable signals of sense change following the approach proposed in Mitra et al. Mitra et al. [19] ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-213", "text": "After getting the novel sense clusters, we pick up 50 random samples, of which 25 cases are flagged as 'true birth' and the rest 25 cases are flagged as 'false birth' by manual evaluation." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-214", "text": "We use these 50 samples as our training set for classification using SVM." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-215", "text": "Some of the examples of this training set are presented in Table 15 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-216", "text": "We ensure that none of the words in the list L is present in the training set." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-217", "text": "Using this training set for our proposed SVM classifier, we are successfully able to detect 80% of the cases (12 out of 15) from the list L as having novel sense." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-218", "text": "Table 16 presents all of these detected words along with the novel senses and the discriminative network feature." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-219", "text": "Our method is unable to detect three cases -'gay', 'guy' and 'bush'." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-220", "text": "For 'gay', since there is no sense cluster in the older time period with 'gay' being a noun, cluster comparison does not even detect the 'birth' cluster of 'gay'." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-221", "text": "The 'birth' sense clusters for 'guy', 'bush' in the new time period, as detected by split-join algorithm contain general terms like \"someone, anyone, men, woman, mother, son\" and \"cloud, air, sky, sunlight\" respectively." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-222", "text": "As the network around these words did not change much over time, our method found it difficult to detect." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-223", "text": "Note that even though COHA corpus is substantially smaller than the Google n-grams corpus, our approach produces promising result, showing the usability of the proposed method with not so large corpus as well." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-224", "text": "----------------------------------" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-225", "text": "**CONCLUSION**" }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-226", "text": "In this paper, we showed how complex network theory can help improving the performance of otherwise challenging task of novel sense detection." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-227", "text": "This is the first attempt to apply concepts borrowed from complex network theory to deal with the problem of novel sense detection." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-228", "text": "We demonstrated how the change in the network properties of the induced subgraphs from a sense cluster can be used to improve the precision of novel sense detection significantly." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-229", "text": "Manual evaluation over two different time point pairs shows that the proposed SVM classification approach boosts the precision values from 0.23-0.32 to 0.74-0.86." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-230", "text": "Finally, from the experiments on the COHA corpus, we have also shown that our approach can reliably detect the words known to have sense shifts." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-231", "text": "We have made the human annotated data used in our experiments publicly available which could be used as gold standard dataset to validate systems built for novel sense detection 6 ." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-232", "text": "This framework of precise novel sense detection of a word can be used by lexicographers as well as historical linguistics to design new dictionaries as well as updating the existing sense repositories like WordNet where candidate new senses can be semi-automatically detected and included, thus greatly reducing the otherwise required manual effort." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-233", "text": "Computational methods based on large diachronic corpora are considered to have huge potential to put a light on interesting language evolution phenomenon which can be useful for etymologists as well." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-234", "text": "In future, we plan to apply our methodology to different genres of corpus, like social network data, several product or movie reviews data which are becoming increasingly popular source for opinion tracking, to identify short-term changes in word senses or usages." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-235", "text": "These analyses would also provide insights on the evolution of language in a short span of time." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-236", "text": "Apart from that, we plan to extend our work to detect the dying senses of words; the senses which were used in the older texts, but are not being used in newer time anymore." }, { "sent_id": "8a1d4802c170fa8a71504533437e8f-C001-237", "text": "Our ultimate goal is to prepare a generalized framework for accurate detection of sense change across languages and investigate the triggering factors behind language evolution as well." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8a1d4802c170fa8a71504533437e8f-C001-24", "8a1d4802c170fa8a71504533437e8f-C001-25", "8a1d4802c170fa8a71504533437e8f-C001-26", "8a1d4802c170fa8a71504533437e8f-C001-27" ], [ "8a1d4802c170fa8a71504533437e8f-C001-78", "8a1d4802c170fa8a71504533437e8f-C001-79" ], [ "8a1d4802c170fa8a71504533437e8f-C001-88" ], [ "8a1d4802c170fa8a71504533437e8f-C001-90" ] ], "cite_sentences": [ "8a1d4802c170fa8a71504533437e8f-C001-25", "8a1d4802c170fa8a71504533437e8f-C001-26", "8a1d4802c170fa8a71504533437e8f-C001-27", "8a1d4802c170fa8a71504533437e8f-C001-79", "8a1d4802c170fa8a71504533437e8f-C001-88", "8a1d4802c170fa8a71504533437e8f-C001-90" ] }, "@EXT@": { "gold_contexts": [ [ "8a1d4802c170fa8a71504533437e8f-C001-30" ], [ "8a1d4802c170fa8a71504533437e8f-C001-40" ], [ "8a1d4802c170fa8a71504533437e8f-C001-139" ] ], "cite_sentences": [ "8a1d4802c170fa8a71504533437e8f-C001-30", "8a1d4802c170fa8a71504533437e8f-C001-40", "8a1d4802c170fa8a71504533437e8f-C001-139" ] }, "@USE@": { "gold_contexts": [ [ "8a1d4802c170fa8a71504533437e8f-C001-34" ], [ "8a1d4802c170fa8a71504533437e8f-C001-135" ], [ "8a1d4802c170fa8a71504533437e8f-C001-148" ], [ "8a1d4802c170fa8a71504533437e8f-C001-154" ], [ "8a1d4802c170fa8a71504533437e8f-C001-157" ], [ "8a1d4802c170fa8a71504533437e8f-C001-159" ], [ "8a1d4802c170fa8a71504533437e8f-C001-163" ], [ "8a1d4802c170fa8a71504533437e8f-C001-212" ] ], "cite_sentences": [ "8a1d4802c170fa8a71504533437e8f-C001-34", "8a1d4802c170fa8a71504533437e8f-C001-135", "8a1d4802c170fa8a71504533437e8f-C001-148", "8a1d4802c170fa8a71504533437e8f-C001-154", "8a1d4802c170fa8a71504533437e8f-C001-157", "8a1d4802c170fa8a71504533437e8f-C001-159", "8a1d4802c170fa8a71504533437e8f-C001-163", "8a1d4802c170fa8a71504533437e8f-C001-212" ] }, "@MOT@": { "gold_contexts": [ [ "8a1d4802c170fa8a71504533437e8f-C001-78", "8a1d4802c170fa8a71504533437e8f-C001-79", "8a1d4802c170fa8a71504533437e8f-C001-80", "8a1d4802c170fa8a71504533437e8f-C001-81" ], [ "8a1d4802c170fa8a71504533437e8f-C001-187", "8a1d4802c170fa8a71504533437e8f-C001-188", "8a1d4802c170fa8a71504533437e8f-C001-189" ] ], "cite_sentences": [ "8a1d4802c170fa8a71504533437e8f-C001-79", "8a1d4802c170fa8a71504533437e8f-C001-80", "8a1d4802c170fa8a71504533437e8f-C001-81", "8a1d4802c170fa8a71504533437e8f-C001-187", "8a1d4802c170fa8a71504533437e8f-C001-189" ] }, "@DIF@": { "gold_contexts": [ [ "8a1d4802c170fa8a71504533437e8f-C001-153" ], [ "8a1d4802c170fa8a71504533437e8f-C001-158" ], [ "8a1d4802c170fa8a71504533437e8f-C001-182" ], [ "8a1d4802c170fa8a71504533437e8f-C001-187" ], [ "8a1d4802c170fa8a71504533437e8f-C001-220", "8a1d4802c170fa8a71504533437e8f-C001-221", "8a1d4802c170fa8a71504533437e8f-C001-222" ] ], "cite_sentences": [ "8a1d4802c170fa8a71504533437e8f-C001-153", "8a1d4802c170fa8a71504533437e8f-C001-158", "8a1d4802c170fa8a71504533437e8f-C001-182", "8a1d4802c170fa8a71504533437e8f-C001-187", "8a1d4802c170fa8a71504533437e8f-C001-221" ] }, "@SIM@": { "gold_contexts": [ [ "8a1d4802c170fa8a71504533437e8f-C001-155" ] ], "cite_sentences": [ "8a1d4802c170fa8a71504533437e8f-C001-155" ] }, "@FUT@": { "gold_contexts": [ [ "8a1d4802c170fa8a71504533437e8f-C001-187", "8a1d4802c170fa8a71504533437e8f-C001-188", "8a1d4802c170fa8a71504533437e8f-C001-189" ] ], "cite_sentences": [ "8a1d4802c170fa8a71504533437e8f-C001-187", "8a1d4802c170fa8a71504533437e8f-C001-189" ] } } }, "ABC_88aca1aa7ab73cde8492adc0e7a059_5": { "x": [ { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-2", "text": "In this paper, we present a model for improved discriminative semantic parsing." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-3", "text": "The model addresses an important limitation associated with our previous stateof-the-art discriminative semantic parsing model -the relaxed hybrid tree model by introducing our constrained semantic forests." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-4", "text": "We show that our model is able to yield new state-of-the-art results on standard datasets even with simpler features." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-5", "text": "Our system is available for download from http://statnlp.org/research/sp/." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-6", "text": "----------------------------------" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-8", "text": "This paper addresses the problem of parsing natural language sentences into their corresponding semantic representations in the form of formal logical representations." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-9", "text": "Such a task is also known as semantic parsing (Kate and Mooney, 2006; Wong and Mooney, 2007; Lu et al., 2008; Kwiatkowski et al., 2010) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-10", "text": "One state-of-the-art model for semantic parsing is our recently introduced relaxed hybrid tree model (Lu, 2014) , which performs integrated lexicon acquisition and semantic parsing within a single framework utilizing efficient algorithms for training and inference." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-11", "text": "The model allows natural language phrases to be recursively mapped to semantic units, where certain long-distance dependencies can be captured." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-12", "text": "It relies on representations called relaxed hybrid trees that can jointly represent both the sentences and semantics." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-13", "text": "The model is essentially discriminative, and allows rich features to be incorporated." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-14", "text": "Unfortunately, the relaxed hybrid tree model has an important limitation: it essentially does not allow certain sentence-semantics pairs to be jointly encoded using the proposed relaxed hybrid tree representations." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-15", "text": "Thus, the model is unable to identify joint representations for certain sentence-semantics pairs during the training process, and is unable to produce desired outputs for certain inputs during the evaluation process." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-16", "text": "In this work, we propose a solution addressing the above limitation, which makes our model more robust." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-17", "text": "Through experiments, we demonstrate that our improved discriminative model for semantic parsing, even when simpler features are used, is able to obtain new state-of-the-art results on standard datasets." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-18", "text": "----------------------------------" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-19", "text": "**RELATED WORK**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-20", "text": "Semantic parsing has recently attracted a significant amount of attention in the community." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-21", "text": "In this section, we provide a relatively brief discussion of prior work in semantic parsing." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-22", "text": "The hybrid tree model (Lu et al., 2008) and the Bayesian tree transducer based model (Jones et al., 2012) are generative frameworks, which essentially assume natural language and semantics are jointly generated from an underlying generative process." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-23", "text": "Such models are efficient, but are limited in their predictive power due to the simple independence assumptions made." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-24", "text": "On the other hand, discriminative models are able to exploit arbitrary features and are usually able to give better results." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-25", "text": "Examples of such models include the WASP system (Wong and Mooney, 2006) which regards the semantic parsing problem as a statistical machine translation problem, the UBL system (Kwiatkowski et al., 2010) which performs CCG-based semantic parsing using a log-linear model, as well as the relaxed hybrid tree model (Lu, 2014) which extends the generative hybrid tree model." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-26", "text": "This extension results in a discriminative model that incorporates rich features and allows long-distance dependencies to be captured." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-27", "text": "The relaxed hybrid tree model has achieved the state-of-the-art results on standard benchmark datasets across different languages." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-28", "text": "Performing semantic parsing under other forms of supervision is also possible." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-29", "text": "Clarke et al. (2010) proposed a model that learns a semantic parser for answering questions without relying on semantic annotations." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-30", "text": "Goldwasser et al. (2011) presented a confidence-driven approach to semantic parsing based on self-training." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-31", "text": "Liang et al. (2013) introduced semantic parsers based on dependency based semantics (DCS) that map sentences into their denotations." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-32", "text": "In this work, we focus on parsing sentences into their formal semantic representations." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-33", "text": "----------------------------------" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-34", "text": "**RELAXED HYBRID TREES**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-35", "text": "We briefly discuss our previously proposed relaxed hybrid tree model (Lu, 2014) in this section." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-36", "text": "The model is a discriminative semantic parsing model which extends the generative hybrid tree model (Lu et al., 2008) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-37", "text": "Both systems are publicly available 1 ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-38", "text": "Let us use m to denote a complete semantic representation, n to denote a complete natural language sentence, and h to denote a complete latent structure that jointly represents both m and n. The model defines the conditional probability for observing a (m, h) pair for a given natural language sentence n using a log-linear approach:" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-39", "text": "where \u039b is the set of parameters (weights of features) used by the model." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-40", "text": "its corresponding semantic representation." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-41", "text": "Typically, to limit the space of latent structures, certain assumptions have to be made to h. In our work, we assume that h must be from a space consisting of relaxed hybrid tree structures (Lu, 2014) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-42", "text": "The relaxed hybrid trees are analogous to the hybrid trees, which was earlier introduced as a generative framework." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-43", "text": "One major distinction between these two types of representations is that the relaxed hybrid tree representations are able to capture unbounded long-distance dependencies in a principled way." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-44", "text": "Such dependencies were unable to be captured by hybrid tree representations largely due to their generative settings." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-45", "text": "Figure 1 gives an example of a hybrid tree and a relaxed hybrid tree representation encoding the sentence w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w 10 and the se-" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-46", "text": "In the hybrid tree structure, each word is strictly associated with a semantic unit." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-47", "text": "For example the word w 3 is associated with the semantic unit m b ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-48", "text": "In the relaxed hybrid tree, however, each word is not only directly associated with exactly one semantic unit m, but also indirectly associated with all other semantic units that are predecessors of m. For example, the word w 3 now is directly associated with m b , but is also indirectly associated with m a ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-49", "text": "These indirect associations allow the longdistance dependencies to be captured." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-50", "text": "Both the hybrid tree and relaxed hybrid tree models define patterns at each level of their latent structure which specify how the words and child semantic units are organized at each level." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-51", "text": "For example, within the semantic unit m a , we have a pattern wXw which states that we first have words that are directly associated with m a , followed by some words covered by its first child semantic unit, then another sequence of words directly associated with m a ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-52", "text": "----------------------------------" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-53", "text": "**LIMITATIONS**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-54", "text": "One important difference between the hybrid tree representations and the relaxed hybrid tree representations is the exclusion of the pattern X in the latter." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-55", "text": "This ensured relaxed hybrid trees with an infinite number of nodes were not considered (Lu, 2014) when computing the denominator term of Equation 1." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-56", "text": "In relaxed hybrid tree, H(n, m) was implemented as a packed forest representation for exponentially many possible relaxed hybrid trees where pattern X was excluded." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-57", "text": "By allowing pattern X, we allow certain semantic units with no natural language word counter- part to exist in the joint relaxed hybrid tree representation." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-58", "text": "This may lead to possible relaxed hybrid tree representations consisting of an infinite number of internal nodes (semantic units), as seen in Figure 3 (b) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-59", "text": "When pattern X is allowed, both m a and m b are not directly associated with any natural language word, so we are able to further insert arbitrarily many (compatible) semantic units between the two units m a and m b while the resulting relaxed hybrid tree remains valid." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-60", "text": "Therefore we can construct a relaxed hybrid tree representation that contains the given natural language sentence w 1 w 2 with an infinite number of nodes." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-61", "text": "This issue essentially prevents us from computing the denominator term of Equation 1 since it involves an infinite number of possible m and h ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-62", "text": "To eliminate relaxed hybrid trees consisting of an infinite number of nodes, pattern X is disallowed in the relaxed hybrid trees model (Lu, 2014) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-63", "text": "However, disallowing pattern X has led to other issues." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-64", "text": "Specifically, for certain semanticssentence pairs, it is not possible to find relaxed hybrid trees that jointly represent them." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-65", "text": "In the example semantics-sentence pair given in Figure 3 (a) , it is not possible to find any relaxed hybrid tree that contains both the sentence and the semantics since each semantic unit which takes one argument must be associated with at least one word." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-66", "text": "On the other hand, it is still possible to find a hybrid tree representation for both the sentence and the semantics where pattern X is allowed (see Figure 3 (c) )." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-67", "text": "In practice, we can alleviate this issue by extending the lengths of the sentences." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-88", "text": "Our experiments were conducted on the publicly available multilingual GeoQuery dataset." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-68", "text": "For example, we can append the special beginning-of-sentence symbol s and end-of-sentence symbol /s to all sentences to increase their lengths, allowing the relaxed hybrid trees to be constructed for certain sentence-semantics pairs with short sentences." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-69", "text": "However, such an approach does not resolve the theoretical limitation of the model." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-70", "text": "----------------------------------" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-71", "text": "**CONSTRAINED SEMANTIC FORESTS**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-72", "text": "To address this limitation, we allow pattern X to be included when building our new discriminative semantic parsing model." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-73", "text": "However, as mentioned above, doing so will lead to latent structures (relaxed hybrid tree representations) of infinite heights." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-74", "text": "To resolve such an issue, we instead add an additional constraint -limiting the height of a semantic representation to a fixed constant c, where c is larger than the maximum height of all the trees appearing in the training set." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-75", "text": "Table 1 summarizes the list of patterns that our model considers." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-76", "text": "This is essentially the same as those considered by the hybrid tree model." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-77", "text": "Our new objective function is as follows:" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-78", "text": "where M refers to the set of all possible semantic trees whose heights are less than or equal to c, and H (n, m ) refers to the set of possible relaxed hybrid tree representations where the pattern X is allowed." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-79", "text": "The main challenge now becomes the computation of the denominator term in Equation 2, as the set M is still very large." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-80", "text": "To properly handle all such semantic trees in an efficient way, we introduce a constrained semantic forest (CSF) representation of M here." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-81", "text": "Such a constrained semantic forest is a packed forest representation of exponentially many possible unique semantic trees, where we set the height of the forest to c. By contrast, it was not possible in our previous relaxed hybrid tree model to introduce such a compact representation over all possible semantic trees." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-82", "text": "In our previous model's implementation, we directly constructed for each sentence n a different compact representation over all possible relaxed hybrid trees containing n." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-83", "text": "Setting the maximum height to c effectively guarantees that all semantic trees contained in the constrained semantic forest have a height no greater than c. We then constructed the (exponentially many) relaxed hybrid tree representations based on the constrained semantic forest M and each input sentence n. We used a single packed forest representation to represent all such relaxed hybrid tree representations." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-84", "text": "This allows the computation of the denominator to be performed efficiently using similar dynamic programming algorithms described in (Lu, 2014) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-85", "text": "Optimization of the model parameters were done by using L-BFGS (Liu and Nocedal, 1989) , where the gradients were computed efficiently using an analogous dynamic programming algorithm." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-86", "text": "----------------------------------" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-87", "text": "**EXPERIMENTS**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-89", "text": "Various previous works on semantic parsing used this dataset for evaluations (Wong and Mooney, 2006; Kate and Mooney, 2006; Lu et al., 2008; Jones et al., 2012) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-90", "text": "The dataset consists of 880 natural language sentences where each sentence is coupled with a formal tree-structured semantic representation." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-91", "text": "The early version of this dataset was annotated with English only (Wong and Mooney, 2006; Kate and Mooney, 2006) , and Jones et al." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-92", "text": "(2012) released a version that is annotated with three additional languages: German, Greek and Thai." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-93", "text": "To make our system directly comparable to previous works, we used the same train/test split used in those works (Jones et al., 2012; Lu, 2014) for evaluation." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-94", "text": "We also followed the standard approach for evaluating the correctness of an output semantic representation from our system." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-95", "text": "Specifically, we used a standard script to construct Prolog queries based on the outputs, and used the queries to retrieve answers from the GeoQuery database." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-96", "text": "Following previous works, we regarded an output semantic representation as correct if and only if it returned the same answers as the gold standard (Jones et al., 2012; Lu, 2014) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-97", "text": "The results of our system as well as those of several previous systems are given in Table 2 ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-98", "text": "We compared our system's performance against those of several previous works." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-99", "text": "The WASP system (Wong and Mooney, 2006 ) is based on statistical machine translation technique while the HY-BRIDTREE+ system (Lu et al., 2008 ) is based on the generative hybrid tree model augmented with a discriminative re-ranking stage where certain global features are used." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-100", "text": "UBL-S (Kwiatkowski et al., 2010 ) is a CCG-based semantic parsing system." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-101", "text": "TREETRANS (Jones et al., 2012) is the system based on tree transducers." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-102", "text": "RHT (Lu, 2014) is the discriminative semantic parsing system based on relaxed hybrid trees." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-103", "text": "In practice, we set c (the maximum height of a semantic representation) to 20 in our experi- ments, which we determined based on the heights of the semantic trees that appear in the training data." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-104", "text": "Results showed that our system consistently yielded higher results than all the previous systems, including our state-of-the-art relaxed hybrid tree system (the full model, when all the features are used), in terms of both accuracy score and F 1 -measure." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-105", "text": "We would like to highlight two potential advantages of our new model over the old RHT model." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-106", "text": "First, our model is able to handle certain sentence-semantics pairs which could not be handled by RHT during both training and evaluation as discussed in Section 3.1." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-107", "text": "Second, our model considers the additional pattern X and therefore has the capability to capture more accurate dependencies between the words and semantic units." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-108", "text": "We note that in our experiments we used a small subset of the features used by our relaxed hybrid tree work." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-109", "text": "Specifically, we did not use any long-distance features, and also did not use any character-level features." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-110", "text": "As we have mentioned in (Lu, 2014) , although the RHT model is able to capture unbounded long-distance dependencies, for certain languages such as German such longdistance features appeared to be detrimental to the performance of the system (Lu, 2014, Table 4 )." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-111", "text": "Here in this work, we only used simple unigram features (concatenation of a semantic unit and an individual word that appears directly below that unit in the joint representation), pattern features (concatenation of a semantic unit and the pattern below that unit) as well as transition features (concatenation of two semantic units that form a parent-child relationship) described in (Lu, 2014) ." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-112", "text": "While additional features could potentially lead to better results, using simpler features would make our model more compact and more interpretable." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-113", "text": "We summarized in Table 3 the number of features used in both the previous RHT system and our system across four different languages." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-114", "text": "It can be seen that our system only required about 2-3% of the Table 3 : Number of features involved for both the RHT system and our new system using constrained semantic forests, across four different languages." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-115", "text": "features used in the previous system." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-116", "text": "We also note that the training time for our model is longer than that of the relaxed hybrid tree model since the space for H (n, m ) is now much larger than the space for H(n, m )." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-117", "text": "In practice, to make the overall training process faster, we implemented a parallel version of the original RHT algorithm." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-118", "text": "----------------------------------" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-119", "text": "**CONCLUSION**" }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-120", "text": "In this work, we presented an improved discriminative approach to semantic parsing." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-121", "text": "Our approach does not have the theoretical limitation associated with our previous state-of-the-art approach." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-122", "text": "We demonstrated through experiments that our new model was able to yield new stateof-the-art results on a standard dataset across four different languages, even though simpler features were used." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-123", "text": "Since our new model involves simpler features, including unigram features defined over individual semantic unit -word pairs, we believe our new model would aid the joint modeling of both distributional and logical semantics (Lewis and Steedman, 2013 ) within a single framework." }, { "sent_id": "88aca1aa7ab73cde8492adc0e7a059-C001-124", "text": "We plan to explore this avenue in the future." } ], "y": { "@MOT@": { "gold_contexts": [ [ "88aca1aa7ab73cde8492adc0e7a059-C001-2", "88aca1aa7ab73cde8492adc0e7a059-C001-3" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-10", "88aca1aa7ab73cde8492adc0e7a059-C001-14", "88aca1aa7ab73cde8492adc0e7a059-C001-15", "88aca1aa7ab73cde8492adc0e7a059-C001-16" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-56", "88aca1aa7ab73cde8492adc0e7a059-C001-57" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-59", "88aca1aa7ab73cde8492adc0e7a059-C001-60" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-61", "88aca1aa7ab73cde8492adc0e7a059-C001-62" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-72", "88aca1aa7ab73cde8492adc0e7a059-C001-73" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-77", "88aca1aa7ab73cde8492adc0e7a059-C001-78" ] ], "cite_sentences": [ "88aca1aa7ab73cde8492adc0e7a059-C001-3", "88aca1aa7ab73cde8492adc0e7a059-C001-10", "88aca1aa7ab73cde8492adc0e7a059-C001-14", "88aca1aa7ab73cde8492adc0e7a059-C001-15", "88aca1aa7ab73cde8492adc0e7a059-C001-56", "88aca1aa7ab73cde8492adc0e7a059-C001-57", "88aca1aa7ab73cde8492adc0e7a059-C001-59", "88aca1aa7ab73cde8492adc0e7a059-C001-60", "88aca1aa7ab73cde8492adc0e7a059-C001-62", "88aca1aa7ab73cde8492adc0e7a059-C001-73", "88aca1aa7ab73cde8492adc0e7a059-C001-78" ] }, "@DIF@": { "gold_contexts": [ [ "88aca1aa7ab73cde8492adc0e7a059-C001-2", "88aca1aa7ab73cde8492adc0e7a059-C001-3" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-54", "88aca1aa7ab73cde8492adc0e7a059-C001-55" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-56", "88aca1aa7ab73cde8492adc0e7a059-C001-57" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-81" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-104" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-105", "88aca1aa7ab73cde8492adc0e7a059-C001-106" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-110", "88aca1aa7ab73cde8492adc0e7a059-C001-111" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-113", "88aca1aa7ab73cde8492adc0e7a059-C001-114" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-116" ] ], "cite_sentences": [ "88aca1aa7ab73cde8492adc0e7a059-C001-3", "88aca1aa7ab73cde8492adc0e7a059-C001-54", "88aca1aa7ab73cde8492adc0e7a059-C001-55", "88aca1aa7ab73cde8492adc0e7a059-C001-56", "88aca1aa7ab73cde8492adc0e7a059-C001-57", "88aca1aa7ab73cde8492adc0e7a059-C001-81", "88aca1aa7ab73cde8492adc0e7a059-C001-104", "88aca1aa7ab73cde8492adc0e7a059-C001-105", "88aca1aa7ab73cde8492adc0e7a059-C001-106", "88aca1aa7ab73cde8492adc0e7a059-C001-110", "88aca1aa7ab73cde8492adc0e7a059-C001-111", "88aca1aa7ab73cde8492adc0e7a059-C001-113", "88aca1aa7ab73cde8492adc0e7a059-C001-114", "88aca1aa7ab73cde8492adc0e7a059-C001-116" ] }, "@BACK@": { "gold_contexts": [ [ "88aca1aa7ab73cde8492adc0e7a059-C001-10", "88aca1aa7ab73cde8492adc0e7a059-C001-11", "88aca1aa7ab73cde8492adc0e7a059-C001-12", "88aca1aa7ab73cde8492adc0e7a059-C001-13" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-25" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-27" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-35", "88aca1aa7ab73cde8492adc0e7a059-C001-36", "88aca1aa7ab73cde8492adc0e7a059-C001-38" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-41", "88aca1aa7ab73cde8492adc0e7a059-C001-42", "88aca1aa7ab73cde8492adc0e7a059-C001-43" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-45" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-48", "88aca1aa7ab73cde8492adc0e7a059-C001-50" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-102" ] ], "cite_sentences": [ "88aca1aa7ab73cde8492adc0e7a059-C001-10", "88aca1aa7ab73cde8492adc0e7a059-C001-11", "88aca1aa7ab73cde8492adc0e7a059-C001-12", "88aca1aa7ab73cde8492adc0e7a059-C001-13", "88aca1aa7ab73cde8492adc0e7a059-C001-25", "88aca1aa7ab73cde8492adc0e7a059-C001-27", "88aca1aa7ab73cde8492adc0e7a059-C001-35", "88aca1aa7ab73cde8492adc0e7a059-C001-36", "88aca1aa7ab73cde8492adc0e7a059-C001-38", "88aca1aa7ab73cde8492adc0e7a059-C001-41", "88aca1aa7ab73cde8492adc0e7a059-C001-42", "88aca1aa7ab73cde8492adc0e7a059-C001-43", "88aca1aa7ab73cde8492adc0e7a059-C001-45", "88aca1aa7ab73cde8492adc0e7a059-C001-48", "88aca1aa7ab73cde8492adc0e7a059-C001-50", "88aca1aa7ab73cde8492adc0e7a059-C001-102" ] }, "@UNSURE@": { "gold_contexts": [ [ "88aca1aa7ab73cde8492adc0e7a059-C001-58" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-64", "88aca1aa7ab73cde8492adc0e7a059-C001-65" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-68", "88aca1aa7ab73cde8492adc0e7a059-C001-69" ] ], "cite_sentences": [ "88aca1aa7ab73cde8492adc0e7a059-C001-58", "88aca1aa7ab73cde8492adc0e7a059-C001-64", "88aca1aa7ab73cde8492adc0e7a059-C001-65", "88aca1aa7ab73cde8492adc0e7a059-C001-68", "88aca1aa7ab73cde8492adc0e7a059-C001-69" ] }, "@USE@": { "gold_contexts": [ [ "88aca1aa7ab73cde8492adc0e7a059-C001-77", "88aca1aa7ab73cde8492adc0e7a059-C001-78" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-93" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-96" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-108" ], [ "88aca1aa7ab73cde8492adc0e7a059-C001-117" ] ], "cite_sentences": [ "88aca1aa7ab73cde8492adc0e7a059-C001-78", "88aca1aa7ab73cde8492adc0e7a059-C001-93", "88aca1aa7ab73cde8492adc0e7a059-C001-96", "88aca1aa7ab73cde8492adc0e7a059-C001-108", "88aca1aa7ab73cde8492adc0e7a059-C001-117" ] }, "@SIM@": { "gold_contexts": [ [ "88aca1aa7ab73cde8492adc0e7a059-C001-83", "88aca1aa7ab73cde8492adc0e7a059-C001-84" ] ], "cite_sentences": [ "88aca1aa7ab73cde8492adc0e7a059-C001-83", "88aca1aa7ab73cde8492adc0e7a059-C001-84" ] }, "@EXT@": { "gold_contexts": [ [ "88aca1aa7ab73cde8492adc0e7a059-C001-110", "88aca1aa7ab73cde8492adc0e7a059-C001-111" ] ], "cite_sentences": [ "88aca1aa7ab73cde8492adc0e7a059-C001-110", "88aca1aa7ab73cde8492adc0e7a059-C001-111" ] } } }, "ABC_4b0aab00a99547791bff0597aabc06_5": { "x": [ { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-2", "text": "We study how different frame annotations complement one another when learning continuous lexical semantics." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-3", "text": "We learn the representations from a tensorized skip-gram model that consistently encodes syntactic-semantic content better, with multiple 10% gains over baselines." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-4", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-6", "text": "Consider \"Bill\" in Fig. 1 : what is his involvement with the words \"would try,\" and what does this involvement mean? Word embeddings represent such meaning as points in a real-valued vector space (Deerwester et al., 1990; Mikolov et al., 2013) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-7", "text": "These representations are often learned by exploiting the frequency that the word cooccurs with contexts, often within a user-defined window (Harris, 1954; Turney and Pantel, 2010) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-8", "text": "When built from large-scale sources, like Wikipedia or web crawls, embeddings capture general characteristics of words and allow for robust downstream applications (Kim, 2014; Das et al., 2015) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-9", "text": "Frame semantics generalize word meanings to that of analyzing structured and interconnected labeled \"concepts\" and abstractions (Minsky, 1974; Fillmore, 1976 Fillmore, , 1982 ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-10", "text": "These concepts, or roles, implicitly encode expected properties of that word." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-11", "text": "In a frame semantic analysis of Fig. 1 , the segment \"would try\" triggers the ATTEMPT frame, filling the expected roles AGENT and GOAL with \"Bill\" and \"the same tactic,\" respectively." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-12", "text": "While frame semantics provide a structured form for analyzing words with crisp, categorically-labeled concepts, the encoded properties and expectations are implicit. What does it mean to fill a frame's role?" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-13", "text": "Semantic proto-role (SPR) theory, motivated by Dowty (1991) 's thematic proto-role theory, offers an answer to this." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-14", "text": "SPR replaces categorical roles ATTEMPT She said Bill would try the same tactic again." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-15", "text": "We are interested in capturing these SPR-based properties and expectations within word embeddings." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-16", "text": "We present a method that learns frameenriched embeddings from millions of documents that have been semantically parsed with multiple different frame analyzers (Ferraro et al., 2014) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-17", "text": "Our method leverages Cotterell et al. (2017) 's formulation of Mikolov et al. (2013) 's popular skip-gram model as exponential family principal component analysis (EPCA) and tensor factorization." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-18", "text": "This paper's primary contributions are: (i) enriching learned word embeddings with multiple, automatically obtained frames from large, disparate corpora; and (ii) demonstrating these enriched embeddings better capture SPR-based properties." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-19", "text": "In so doing, we also generalize Cotterell et al.'s method to arbitrary tensor dimensions." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-20", "text": "This allows us to include an arbitrary amount of semantic information when learning embeddings." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-21", "text": "Our variable-size tensor factorization code is available at https://github.com/" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-22", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-23", "text": "**FRAME SEMANTICS AND PROTO-ROLES**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-24", "text": "Frame semantics currently used in NLP have a rich history in linguistic literature." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-25", "text": "Fillmore (1976) 's frames are based on a word's context and prototypical concepts that an individual word evokes; they intend to represent the meaning of lexical items by mapping words to real world concepts and shared experiences." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-26", "text": "Frame-based semantics have inspired many semantic annotation schemata and datasets, such as FrameNet (Baker et al., 1998) , PropBank (Palmer et al., 2005) , and Verbnet (Schuler, 2005) , as well as composite resources (Hovy et al., 2006; Palmer, 2009; Banarescu et al., 2012) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-27", "text": "1 Thematic Roles and Proto Roles These resources map words to their meanings through discrete/categorically labeled frames and roles; sometimes, as in FrameNet, the roles can be very descriptive (e.g., the DEGREE role for the AF-FIRM OR DENY frame), while in other cases, as in PropBank, the roles can be quite general (e.g., ARG0)." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-28", "text": "Regardless of the actual schema, the roles are based on thematic roles, which map a predicate's arguments to a semantic representation that makes various semantic distinctions among the arguments (Dowty, 1989) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-29", "text": "2 Dowty (1991) claims that thematic role distinctions are not atomic, i.e., they can be deconstructed and analyzed at a lower level." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-30", "text": "Instead of many discrete thematic roles, Dowty (1991) argues for proto-thematic roles, e.g. PROTO-AGENT rather than AGENT, where distinctions in proto-roles are based on clusterings of logical entailments." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-31", "text": "That is, PROTO-AGENTs often have certain properties in common, e.g., manipulating other objects or willingly participating in an action; PROTO-PATIENTs are often changed or affected by some action." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-32", "text": "By decomposing the meaning of roles into properties or expectations that can be reasoned about, proto-roles can be seen as including a form of vector representation within structured frame semantics." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-33", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-34", "text": "**CONTINUOUS LEXICAL SEMANTICS**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-35", "text": "Word embeddings represent word meanings as elements of a (real-valued) vector space (Deerwester et al., 1990) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-36", "text": "Mikolov et al. (2013) 's word2vec methods-skip-gram (SG) and continuous bag of words (CBOW)-repopularized these methods." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-37", "text": "We focus on SG, which predicts the context i around a word j, with learned representations c i and w j , respectively, as p(context i | word j) \u221d exp (c i w j ) = exp (1 (c i w j )) , where is the Hadamard (pointwise) product." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-38", "text": "Traditionally, the context words i are those words within a small window of j and are trained with negative sampling (Goldberg and Levy, 2014) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-39", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-40", "text": "**SKIP-GRAM AS MATRIX FACTORIZATION**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-41", "text": "Levy and Goldberg (2014b), and subsequently Keerthi et al. (2015) , showed how vectors learned under SG with the negative sampling are, under certain conditions, the factorization of (shifted) positive pointwise mutual information." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-42", "text": "Cotterell et al. (2017) showed that SG is a form of exponential family PCA that factorizes the matrix of word/context cooccurrence counts (rather than shifted positive PMI values)." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-43", "text": "With this interpretation, they generalize SG from matrix to tensor factorization, and provide a theoretical basis for modeling higher-order SG (or additional context, such as morphological features of words) within a word embeddings framework." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-44", "text": "Specifically, Cotterell et al. recast higher-order SG as maximizing the log-likelihood" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-45", "text": "where X ijk is a cooccurrence count 3-tensor of words j, surrounding contexts i, and features k." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-46", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-47", "text": "**SKIP-GRAM AS N-TENSOR FACTORIZATION**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-48", "text": "When factorizing an n-dimensional tensor to include an arbitrary number of L annotations, we replace feature k in Equation (1) and a k in Equation (2) with each annotation type l and vector \u03b1 l included." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-49", "text": "X i,j,k becomes X i,j,l 1 ,...l L , representing the number of times word j appeared in context i with features l 1 through l L ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-50", "text": "We maximize" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-51", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-52", "text": "**EXPERIMENTS**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-53", "text": "Our end goal is to use multiple kinds of automatically obtained, \"in-the-wild\" frame semantic parses in order to improve the semantic content-specifically SPR-type informationwithin learned lexical embeddings." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-54", "text": "We utilize majority portions of the Concretely Annotated New York Times and Wikipedia corpora from Ferraro et al. (2014) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-55", "text": "These have been annotated with three frame semantic parses: FrameNet from Das et al. (2010) , and both FrameNet and PropBank from Wolfe et al. (2016) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-56", "text": "In total, we use nearly five million frame-annotated documents." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-57", "text": "Extracting Counts The baseline extraction we consider is a standard sliding window: for each word w j seen \u2265 T times, extract all words w i two to the left and right of w j ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-58", "text": "These counts, forming a matrix, are then used within standard word2vec." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-59", "text": "We also follow Cotterell et al. (2017) and augment the above with the signed number of tokens separating w i and w j , e.g., recording that w i appeared two to the left of w j ; these counts form a 3-tensor." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-60", "text": "To turn semantic parses into tensor counts, we first identify relevant information from the parses." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-61", "text": "We consider all parses that are triggered by the target word w j (seen \u2265 T times) and that have at least one role filled by some word in the sentence." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-62", "text": "We organize the extraction around roles and what fills them." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-63", "text": "We extract every word w r that fills all possible triggered frames; each of those frame and role labels; and the distance between filler w r and trigger w j ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-64", "text": "This process yields a 9-tensor X. 3 Although we always treat the trigger as the \"original\" word (e.g., word j, with vector w j ), later we consider (1) what to include from X, (2) what to predict (what to treat as the \"context\" word i), and (3) what to treat as auxiliary features." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-65", "text": "Data Discussion The baseline extraction methods result in roughly symmetric target and surrounding word counts." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-66", "text": "This is not the case for the frame extraction." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-67", "text": "Our target words must trigger some semantic parse, so our target words are actually target triggers." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-68", "text": "However, the surrounding context words are those words that fill semantic roles." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-69", "text": "As shown in Table 1 , there are an order-of-magnitude fewer triggers than target words, but up to an order-of-magnitude more surrounding words." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-70", "text": "Implementation We generalize Levy and Goldberg (2014a) to enable any arbitrary dimensional tensor factorization, as described in \u00a73.2." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-71", "text": "We learn 100-dimensional embeddings for words that appear at least 100 times from 15 negative samples." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-72", "text": "4 The implementation is available at https://github." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-73", "text": "com/fmof/tensor-factorization." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-74", "text": "Metric We evaluate our learned (trigger) embeddings w via QVEC (Tsvetkov et al., 2015) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-75", "text": "QVEC uses canonical correlation analysis to measure the Pearson correlation between w and a collection of oracle lexical vectors o. These oracle vectors are derived from a human-annotated resource." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-76", "text": "For QVEC, higher is better: a higher score indicates w more closely correlates (positively) with o. Evaluating Semantic Content with SPR Motivated by Dowty (1991)'s proto-role theory, Reisinger et al. (2015) , with a subsequent expansion by White et al. (2016) , annotated thousands of predicate-argument pairs (v, a) with (boolean) applicability and (ordinal) likelihoods of wellmotivated semantic properties applying to/being true of a. 5 These likelihood judgments, under the SPR framework, are converted from a fivepoint Likert scale to a 1-5 interval scale." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-77", "text": "Because the predicate-argument pairs were extracted from previously annotated dependency trees, we link each property with the dependency relation joining v and a when forming the oracle vectors; each component of an oracle vector o v is the unitynormalized sum of likelihood judgments for joint property and grammatical relation, using the interval responses when the property is applicable and discarding non-applicable properties, i.e. treating the response as 0." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-78", "text": "Thus, the combined 20 properties of Reisinger et al. (2015) and White et al. (2016) -together with the four basic grammatical 4 In preliminary experiments, this occurrence threshold did not change the overall conclusions." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-79", "text": "Cotterell et al. (2017) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-80", "text": "Each row represents an ablation model: sep means the prediction relies on the token separation distance between the frame and role filler, fn-frame means the prediction uses FrameNet frames, fn-role means the prediction uses FrameNet roles, and filler means the prediction uses the tokens filling the frame role." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-81", "text": "Read from top to bottom, additional contextual features are denoted with a +." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-82", "text": "Note when filler is used, we only predict PropBank roles." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-83", "text": "relations nsubj, dobj, iobj and nsubjpass-result in 80-dimensional oracle vectors." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-84", "text": "6 Predict Fillers or Roles?" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-85", "text": "Since SPR judgments are between predicates and arguments, we predict the words filling the roles, and treat all other frame information as auxiliary features." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-86", "text": "SPR annotations were originally based off of (gold-standard) PropBank annotations, so we also train a model to predict PropBank frames and roles, thereby treating role-filling text and all other frame information as auxiliary features." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-87", "text": "In early experiments, we found it beneficial to treat the FrameNet annotations additively and not distinguish one system's output from another." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-88", "text": "Treating the annotations additively serves as a type of collapsing operation." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-89", "text": "Although X started as a 9-tensor, we only consider up to 6-tensors: trigger, role filler, token separation between the trigger and filler, PropBank frame and role, FrameNet frame, and FrameNet role." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-90", "text": "Results Fig. 2 shows the overall percent change for SPR-QVEC from the filler and role prediction models, on newswire (Fig. 2a) and Wikipedia (Fig. 2b) , across different ablation models." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-91", "text": "We indicate additional contextual features being used with a +: sep uses the token separation distance between the frame and role filler, fn-frame uses FrameNet frames, fn-role uses FrameNet roles, filler uses the tokens filling the frame role, and none indicates no additional information is used when predicting." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-92", "text": "The 0 line represents a plain word2vec baseline and the dashed line represents the 3-tensor baseline of Cotterell et al. (2017) ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-93", "text": "Both of these baselines are windowed: they are restricted to a local context and cannot take advantage of frames or any lexical signal that can be derived from frames." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-94", "text": "Overall, we notice that we obtain large improvements from models trained on lexical signals that have been derived from frame output (sep and none), even if the model itself does not incorporate any frame labels." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-95", "text": "The embeddings that predict the role filling lexical items (the green triangles) correlate higher with SPR oracles than the embeddings that predict PropBank frames and roles (red circles)." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-96", "text": "Examining Fig. 2a , we see that both model types outperform both the word2vec and Cotterell et al. (2017) baselines in nearly all model configurations and ablations." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-97", "text": "We see the highest improvement when predicting role fillers given the frame trigger and the number of tokens separating the two (the green triangles in the sep rows)." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-98", "text": "Comparing Fig. 2a to Fig. 2b , we see newswire is more amenable to predicting PropBank frames and roles." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-99", "text": "We posit this is a type of out-ofdomain error, as the PropBank parser was trained on newswire." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-100", "text": "We also find that newswire is overall more amenable to incorporating limited framebased features, particularly when predicting PropBank using lexical role fillers as part of the con- textual features." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-101", "text": "We hypothesize this is due to the significantly increased vocabulary size of the Wikipedia role fillers (c.f., Tab. 1)." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-102", "text": "Note, however, that by using all available schema information when predicting PropBank, we are able to compensate for the increased vocabulary." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-103", "text": "In Fig. 3 we display the ten nearest neighbors for three randomly sampled trigger words according to two of the highest performing newswire models." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-104", "text": "They each condition on the trigger and the role filler/trigger separation; these correspond to the sep rows of Fig. 2a ." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-105", "text": "The left column of Fig. 3 predicts the role filler, while the right column predicts PropBank annotations." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-106", "text": "We see that while both models learn inflectional relations, this quality is prominent in the model that predicts PropBank information while the model predicting role fillers learns more non-inflectional paraphrases." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-107", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-108", "text": "**RELATED WORK**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-109", "text": "The recent popularity of word embeddings have inspired others to consider leveraging linguistic annotations and resources to learn embeddings." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-110", "text": "Both Cotterell et al. (2017) and Levy and Goldberg (2014a) incorporate additional syntactic and morphological information in their word embeddings." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-111", "text": "Rothe and Sch\u00fctze (2015) 's use lexical resource entries, such as WordNet synsets, to improve pre-computed word embeddings." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-112", "text": "Through generalized CCA, Rastogi et al. (2015) incorporate paraphrased FrameNet training data." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-113", "text": "On the applied side, Wang and Yang (2015) used frame embeddings-produced by training word2vec on tweet-derived semantic frame (names)-as additional features in downstream prediction." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-114", "text": "Teichert et al. (2017) similarly explored the relationship between semantic frames and thematic proto-roles." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-115", "text": "They proposed using a Conditional Random Field (Lafferty et al., 2001) to jointly and conditionally model SPR and SRL." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-116", "text": "Teichert et al. (2017) demonstrated slight improvements in jointly and conditionally predicting PropBank (Bonial et al., 2013) 's semantic role labels and Reisinger et al. (2015) 's proto-role labels." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-117", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-118", "text": "**CONCLUSION**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-119", "text": "We presented a way to learn embeddings enriched with multiple, automatically obtained frames from large, disparate corpora." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-120", "text": "We also presented a QVEC evaluation for semantic proto-roles." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-121", "text": "As demonstrated by our experiments, our extension of Cotterell et al. (2017) 's tensor factorization enriches word embeddings by including syntacticsemantic information not often captured, resulting in consistently higher SPR-based correlations." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-122", "text": "The implementation is available at https: //github.com/fmof/tensor-factorization." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-123", "text": "----------------------------------" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-124", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-125", "text": "This work was supported by Johns Hopkins University, the Human Language Technology Center of Excellence (HLTCOE), DARPA DEFT, and DARPA LORELEI." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-126", "text": "We would also like to thank three anonymous reviewers for their feedback." }, { "sent_id": "4b0aab00a99547791bff0597aabc06-C001-127", "text": "The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government." } ], "y": { "@USE@": { "gold_contexts": [ [ "4b0aab00a99547791bff0597aabc06-C001-17" ], [ "4b0aab00a99547791bff0597aabc06-C001-59" ], [ "4b0aab00a99547791bff0597aabc06-C001-92", "4b0aab00a99547791bff0597aabc06-C001-93" ], [ "4b0aab00a99547791bff0597aabc06-C001-122" ] ], "cite_sentences": [ "4b0aab00a99547791bff0597aabc06-C001-17", "4b0aab00a99547791bff0597aabc06-C001-59", "4b0aab00a99547791bff0597aabc06-C001-92", "4b0aab00a99547791bff0597aabc06-C001-93", "4b0aab00a99547791bff0597aabc06-C001-122" ] }, "@EXT@": { "gold_contexts": [ [ "4b0aab00a99547791bff0597aabc06-C001-19" ], [ "4b0aab00a99547791bff0597aabc06-C001-121" ] ], "cite_sentences": [ "4b0aab00a99547791bff0597aabc06-C001-19", "4b0aab00a99547791bff0597aabc06-C001-121" ] }, "@BACK@": { "gold_contexts": [ [ "4b0aab00a99547791bff0597aabc06-C001-42", "4b0aab00a99547791bff0597aabc06-C001-43", "4b0aab00a99547791bff0597aabc06-C001-44" ], [ "4b0aab00a99547791bff0597aabc06-C001-110" ] ], "cite_sentences": [ "4b0aab00a99547791bff0597aabc06-C001-42", "4b0aab00a99547791bff0597aabc06-C001-43", "4b0aab00a99547791bff0597aabc06-C001-44", "4b0aab00a99547791bff0597aabc06-C001-110" ] }, "@DIF@": { "gold_contexts": [ [ "4b0aab00a99547791bff0597aabc06-C001-96" ] ], "cite_sentences": [ "4b0aab00a99547791bff0597aabc06-C001-96" ] }, "@MOT@": { "gold_contexts": [ [ "4b0aab00a99547791bff0597aabc06-C001-109", "4b0aab00a99547791bff0597aabc06-C001-110" ] ], "cite_sentences": [ "4b0aab00a99547791bff0597aabc06-C001-110" ] } } }, "ABC_d5144370a9361ff870dd3cb2e064ff_5": { "x": [ { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-2", "text": "I assess the extent to which the recently introduced BERT model captures English syntactic phenomena, using (1) naturally-occurring subject-verb agreement stimuli; (2) \"coloreless green ideas\" subject-verb agreement stimuli, in which content words in natural sentences are randomly replaced with words sharing the same part-of-speech and inflection; and (3) manually crafted stimuli for subject-verb agreement and reflexive anaphora phenomena." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-3", "text": "The BERT model performs remarkably well on all cases." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-4", "text": "----------------------------------" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-6", "text": "The recently introduced BERT model (Devlin et al., 2018) exhibits strong performance on several language understanding benchmarks." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-7", "text": "To what extent does it capture syntax-sensitive structures?" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-8", "text": "Recent work examines the extent to which RNN-based models capture syntax-sensitive phenomena that are traditionally taken as evidence for the existence in hierarchical structure." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-9", "text": "In particular, in (Linzen et al., 2016) we assess the ability of LSTMs to learn subject-verb agreement patterns in English, and evaluate on naturally occurring wikipedia sentences." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-10", "text": "(Gulordava et al., 2018 ) also consider subject-verb agreement, but in a \"colorless green ideas\" setting in which content words in naturally occurring sentences are replaced with random words with the same partof-speech and inflection, thus ensuring a focus on syntax rather than on selectional-preferences based cues." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-11", "text": "Marvin and Linzen (2018) consider a wider range of syntactic phenomena (subjectverb agreement, reflexive anaphora, negative polarity items) using manually constructed stimuli, allowing for greater coverage and control than in the naturally occurring setting." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-12", "text": "The BERT model is based on the \"Transformer\" architecture (Vaswani et al., 2017) , which-in contrast to RNNs-relies purely on attention mechanisms, and does not have an explicit notion of word order beyond marking each word with its absolute-position embedding." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-13", "text": "This reliance on attention may lead one 1 to expect decreased performance on syntax-sensitive tasks compared to RNN (LSTM) models that do model word order directly, and explicitly track states across the sentence." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-14", "text": "Indeed, Tran et al. (2018) finds that transformerbased models perform worse than LSTM models on the Linzen et al. (2016) agreement prediction dataset." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-15", "text": "In contrast, (Tang et al., 2018) find that self-attention performs on par with LSTM for syntax sensitive dependencies in the context of machine-translation, and performance on syntactic tasks is correlated with the number of attention heads in multi-head attention." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-16", "text": "I adapt the evaluation protocol and stimuli of Linzen et al. (2016) , Gulordava et al. (2018) and Marvin and Linzen (2018) to the bidirectional setting required by BERT, and evaluate the pretrained BERT models (both the LARGE and the BASE models)." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-17", "text": "Surprisingly (at least to me), the out-of-the-box models (without any task-specific fine-tuning) perform very well on all the syntactic tasks." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-18", "text": "----------------------------------" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-19", "text": "**METHODOLOGY**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-20", "text": "I use the stimuli provided by (Linzen et al., 2016; Gulordava et al., 2018; Marvin and Linzen, 2018) , but change the experimental protocol to adapt it to the bidirectional nature of the BERT model." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-21", "text": "This requires discarding some of the stimuli, as described below." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-22", "text": "Thus, the numbers are not strictly comparable to those reported in previous work." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-23", "text": "----------------------------------" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-24", "text": "**PREVIOUS SETUPS**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-25", "text": "All three previous work use uni-directional language-model-like models." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-26", "text": "Linzen et al. (2016) start with existing sentences from wikipedia that contain a present-tense verb." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-27", "text": "They feed each sentence word by word into an LSTM, stop right before the focus verb, and ask the model to predict a binary plural/singular decision (supervised setup) or compare the probability assigned by a pre-trained language model (LM) to the plural vs singular forms of the verb (LM setup)." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-28", "text": "2 The evaluation is then performed on sentences with \"agreement attractors\" in which at there is at least one noun between the verb and its subject, and all of the nouns between the verb and subject are of the opposite number from the subject." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-29", "text": "Gulordava et al. (2018) also start with existing sentences." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-30", "text": "However, in order to control for the possibillity of the model learning to rely on \"semantic\" selectional-preferences cues rather than syntactic ones, they replace each content word with random words from the same part-ofspeech and inflection." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-31", "text": "This results in \"coloreless green ideas\" nonce sentences." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-32", "text": "The evaluation is then performed similarly to the LM setup of Linzen et al. (2016) : the sentence is fed into a pretraiend LSTM LM up to the focus verb, and the model is considered correct if the probability assigned to the correct inflection of the original verb form given the prefix is larger than that assigned to the incorrect inflection." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-33", "text": "Marvin and Linzen (2018) focus on manually constructed and controlled stimuli, that also emphasizes linguistic structure over selectional preferences." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-34", "text": "They construct minimal pairs of grammatical and ungrammatical sentences, feed each one in its entirety into a pre-trained LSTM-LM, and compare the perplexity assigned by the model to the grammatical and ungrammatical sentences." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-35", "text": "The model is \"correct\" if it assigns the grammatical sentence a higher probability than to the ungrammatical one." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-36", "text": "Since the minimal pairs for most phenomena differ only in a single word (the focus verb), this scoring is very similar to the one used in the two previous works." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-37", "text": "However, it does consider the continuation of the sentence after the fo-cus verb, and also allows for assessing phenomena that require change into two or more words (like negative polarity items)." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-38", "text": "----------------------------------" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-39", "text": "**ADAPTATION TO THE BERT MODEL**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-40", "text": "In contrast to these works, the BERT model is bi-directional: it is trained to predict the identity of masked words based on both the prefix and suffix surrounding these words." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-41", "text": "I adapt the unidirectional setup by feeding into BERT the complete sentence, while masking out the single focus verb." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-42", "text": "I then ask BERT for its word predictions for the masked position, and compare the score 3 assigned to the original correct verb to the score assigned to the incorrect one." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-43", "text": "For example, for the sentence:" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-44", "text": "a 2002 systemic review of herbal products found that several herbs , including peppermint and caraway , have anti-dyspeptic effects for non-ulcer dyspepsia with \" encouraging safety profiles \" ." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-45", "text": "(from (Linzen et al., 2016)) I feed into BERT:" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-46", "text": "[CLS] a 2002 systemic review of herbal products found that several herbs , including peppermint and caraway , [MASK] anti-dyspeptic effects for non-ulcer dyspepsia with '' encouraging safety profiles '' . and look for the score assigned to the words have and has at the masked position." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-47", "text": "Similarly, for the pair the game that the guard hates is bad ." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-48", "text": "the game that the guard hates are bad ." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-49", "text": "(from (Marvin and Linzen, 2018) ), I feed into BERT:" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-50", "text": "[CLS] the game that the guard hates [MASK] bad ." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-51", "text": "and compare the scores predicted for is and are." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-52", "text": "This differs from Linzen et al. (2016) and Gulordava et al. (2018) by considering the entire sentence (excluding the verb) and not just its prefix leading to the verb, and differs from Marvin and Linzen (2018) by conditioning the focus verb on bidirectional context." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-53", "text": "I use the PyTorch implementation of BERT, with the pre-trained models supplied by Google." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-54", "text": "4 I experiment with the bert-large-uncased and bert-base-uncased models." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-55", "text": "Discarded Material The bi-directional setup precludes using using the NPI stimuli of Marvin and Linzen (2018) , in which the minimal pair differs in two words position, which I discard from the evaluation." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-56", "text": "I also discard the agreement cases involving the verbs is or are in Linzen et al. (2016) and in Gulordava et al. (2018) , because some of them are copular construction, in which strong agreement hints can be found also on the object following the verb." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-57", "text": "5 This is not an issue in the manually constructed (Marvin and Linzen, 2018 ) stimuli due to the patterns they chose." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-58", "text": "Finally, I discard stimuli in which the focus verb or its plural/singular inflection does not appear as a single word in the BERT wordpiece-based vocabulary (and hence cannot be predicted by the model)." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-59", "text": "This include discarding Marvin and Linzen (2018) stimuli involving the words swims or admires, resulting in 23,368 discarded pairs (out of 152,300)." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-60", "text": "I similarly discard 680 sentences from (Linzen et al., 2016) where the focus verb or its inflection were one of 108 out-ofvocabulary tokens, 6 and 28 sentence-pairs (8 tokens 7 ) from (Gulordava et al., 2018) ." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-61", "text": "Limitations The BERT results are not directly comparable to the numbers reported in previous work." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-62", "text": "Beyond the differences due to bidirectionality and the discarded stimuli, the BERT models are also trained on a different and larger corpus (covering both wikipedia and books)." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-63", "text": "4 https://github.com/huggingface/pytorch-pretrained-BERT 5 Results are generally a bit higher when not discarding the is/are cases." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-64", "text": "6 blames, dislike, inhabit, exclude, revolves, governs, delete, composes, overlap, edits, embrace, compose, undertakes, disagrees, redirect, persist, recognise, rotates, accompanies, attach, undertake, earn, communicates, imagine, contradicts, specialize, accuses, obtain, caters, welcomes, interprets, await, communicate, templates, qualify, reverts, achieve, achieves, govern, restricts, violate, behave, emit, contend, adopt, overlaps, reproduces, rotate, defends, submit, revolve, lend, pertain, disagree, concentrate, detects, endorses, detect, predate, persists, consume, locates, earns, predict, interact, merge, consumes, behaves, locate, predates, enhances, predicts, integrates, inhabits, satisfy, contradict, swear, activate, restrict, satisfies, redirects, excludes, violates, interacts, admires, speculate, blame, drag, qualifies, activates, criticize, assures, welcome, depart, characterizes, defend, obtains, lends, strives, accuse, recognises, characterize, contends, perceive, complain, awaits 7 toss, spills, tosses, affirms, spill, melt, approves, affirm Table 2 : Results on the EN NONCE (Gulordava et al., 2018) stimuli." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-65", "text": "While not strictly comparable, the numbers reported by Gulordava et al. (2018) for the LSTM in this condition (on All) is 74.1 \u00b1 1.6." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-66", "text": "----------------------------------" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-67", "text": "**REPRODUCABILITY**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-68", "text": "Code is available at https://github.com/yoavg/bert-syntax." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-69", "text": "----------------------------------" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-70", "text": "**RESULTS**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-71", "text": "Tables 1, 2 and 3 show the results." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-72", "text": "All cases exhibit high scores-in the vast majority of the cases substantially higher than reported in previous work." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-73", "text": "8 As discussed above, the results are not directly comparable to previous work: the BERT models are trained on different (and larger) data, are allowed to access the suffix of the sentence in addition to its prefix, and are evaluated on somewhat different data due to discarding OOV items." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-74", "text": "Still, taken together, the high performance numbers indicate that the purely attention-based BERT models are likely capable of capturing the same kind of syntactic regularities that LSTM-based models are capable of capturing, at least as well as the LSTM models and probably better." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-75", "text": "Another noticeable and interesting trend is that larger is not necessarily better: the BERT-Base model outperforms the BERT-Large model on many of the syntactic conditions." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-76", "text": "Marvin and Linzen (2018) ." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-77", "text": "The BERT and M&L numbers are not directly comparable, as the experimental setup differs in many ways." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-78", "text": "----------------------------------" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-79", "text": "**DISCUSSION**" }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-80", "text": "The BERT models perform remarkably well on all the syntactic test cases." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-81", "text": "I expected the attentionbased mechanism to fail on these (compared to the LSTM-based models), and am surprised by these results." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-82", "text": "The Gulordava et al. (2018) and Marvin and Linzen (2018) conditions rule out the possibility of overly relying on selectional preference cues or memorizing the wikipedia training data, and suggest real syntactic generalization is taking place." }, { "sent_id": "d5144370a9361ff870dd3cb2e064ff-C001-83", "text": "Exploring the extent to which deep purely-attention-based architectures such as BERT are capable of capturing hierarchy-sensitive and syntactic dependencies-as well as the mechanisms by which this is achieved-is a fascinating area for future research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d5144370a9361ff870dd3cb2e064ff-C001-9" ], [ "d5144370a9361ff870dd3cb2e064ff-C001-25", "d5144370a9361ff870dd3cb2e064ff-C001-26", "d5144370a9361ff870dd3cb2e064ff-C001-27" ], [ "d5144370a9361ff870dd3cb2e064ff-C001-32" ] ], "cite_sentences": [ "d5144370a9361ff870dd3cb2e064ff-C001-9", "d5144370a9361ff870dd3cb2e064ff-C001-26", "d5144370a9361ff870dd3cb2e064ff-C001-27", "d5144370a9361ff870dd3cb2e064ff-C001-32" ] }, "@DIF@": { "gold_contexts": [ [ "d5144370a9361ff870dd3cb2e064ff-C001-14" ], [ "d5144370a9361ff870dd3cb2e064ff-C001-52" ] ], "cite_sentences": [ "d5144370a9361ff870dd3cb2e064ff-C001-14", "d5144370a9361ff870dd3cb2e064ff-C001-52" ] }, "@USE@": { "gold_contexts": [ [ "d5144370a9361ff870dd3cb2e064ff-C001-14", "d5144370a9361ff870dd3cb2e064ff-C001-15", "d5144370a9361ff870dd3cb2e064ff-C001-16" ] ], "cite_sentences": [ "d5144370a9361ff870dd3cb2e064ff-C001-14", "d5144370a9361ff870dd3cb2e064ff-C001-16" ] }, "@EXT@": { "gold_contexts": [ [ "d5144370a9361ff870dd3cb2e064ff-C001-20" ], [ "d5144370a9361ff870dd3cb2e064ff-C001-56" ], [ "d5144370a9361ff870dd3cb2e064ff-C001-60" ] ], "cite_sentences": [ "d5144370a9361ff870dd3cb2e064ff-C001-20", "d5144370a9361ff870dd3cb2e064ff-C001-56", "d5144370a9361ff870dd3cb2e064ff-C001-60" ] } } }, "ABC_e5886e138ce8d84a48e44db3f3d6a1_5": { "x": [ { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-14", "text": "As far as meronymy detection is concerned, most of the attempts are pattern based (Berland and Charniak, 1999; Girju et al., 2006; Pantel and Pennacchiotti, 2006) along with some recent works exploring the possibility of using distributional semantic models (Morlane-Hond\u00e8re, 2015) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-15", "text": "Similarly, for co-hyponymy detection, researchers have investigated the usefulness of several distributional semantic models." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-16", "text": "One such attempt is made by Weeds et al. (2014) , where they proposed a supervised framework and used several vector operations as features for the classification of hypernymy and co-hyponymy." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-17", "text": "Santus et al. (2016) proposed a supervised method based on a Random Forest algorithm to learn taxonomical semantic relations and they have shown that the model performs well for co-hyponymy detection." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-18", "text": "In another attempt, Jana and Goyal (2018b) proposed various complex network measures which can be used as features to build a supervised classifier model for co-hyponymy detection, and showed improvements over other baseline approaches." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-19", "text": "Recently, with the emergence of various network representation learning methods (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Ribeiro et al., 2017) , attempts have been made to convert distributional thesauri network into low dimensional vector space." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-20", "text": "(Ferret, 2017) apply distributional thesaurus embedding for synonym extraction and expansion tasks whereas Jana and Goyal (2018a) use it to improve the state-of-theart performance of word similarity/relatedness tasks, word analogy task etc." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-21", "text": "Thus, a natural question arises as to whether network embeddings should be more effective than the handcrafted network features used by Jana and Goyal (2018b) for cohyponymy detection." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-22", "text": "Being motivated by this connection, we investigate how the information captured by network representation learning methodologies on distributional thesaurus can be used in discriminating word pairs having co-hyponymy relation from the word pairs having hypernymy, meronymy relation or any random pair of words." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-23", "text": "We use the distributional thesaurus (DT) network (Riedl and Biemann, 2013) built using Google books syntactic n-grams." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-24", "text": "As a network representation learning method, we apply node2vec (Grover and Leskovec, 2016) which is an algorithmic framework for learning continuous feature representations for nodes in networks that maximizes the likelihood of preserving network neighborhoods of nodes." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-25", "text": "Thus obtained vectors are then used as feature vectors and plugged into the classifiers according to the state-of-the-art experimental setup." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-26", "text": "Classification model: To distinguish the word pairs having co-hyponymy relation from the word pairs having hypernymy or meronymy relation, or from any random pair of words, we combine the network embeddings of the two words by concatenation (CC) and addition (ADD) operations to provide as features to train classifiers like Support Vector Machine (SVM) and Random Forest (RF)." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-27", "text": "Evaluation results: We evaluate the usefulness of DT embeddings against three benchmark datasets for cohyponymy detection (Weeds et al., 2014; Santus et al., 2016; Jana and Goyal, 2018b) , following their experimental setup." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-28", "text": "We show that the network embeddings outperform the baselines by a huge margin throughout all the experiments, except for co-hyponyms vs. random pairs, where the baselines already have very high accuracy and network embeddings are able to match the results." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-29", "text": "arXiv:2002.11506v1 [cs.CL] 24 Feb 2020" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-30", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-31", "text": "**METHODOLOGY**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-32", "text": "We take the distributional thesaurus (DT) (Riedl and Biemann, 2013) constructed from the Google books syntactic n-grams data (Goldberg and Orwant, 2013) spanning from 1520 to 2008 as the underlying network where each word's neighborhood is represented by a list of top 200 words that are similar with respect to their bi-gram distribution (Riedl and Biemann, 2013) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-33", "text": "Figure 1 : A sample snapshot of distributional thesaurus (DT) network, where each node represents a word and the weight of edge between two nodes is defined as the number of context features that these two words share in common." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-34", "text": "Here the word 'owl' shares more context features with its co-hyponyms -'crow', 'vulture' compared to their hypernym 'animal'." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-35", "text": "The nodes in the network represent words and edges are present between a node and its top 200 similar nodes; the number of features that two nodes share in common is assigned as the weight of the edge connecting them." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-36", "text": "A snapshot of the DT is shown in Figure 1 ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-37", "text": "We see that a target word 'owl' is connected with its co-hyponyms, 'crow' and 'vulture' via higher weighted edges, whereas the edge weights with its hypernyms like 'animal' are less." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-38", "text": "It may also happen that hypernyms of a target word are not even present in its neighborhood." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-39", "text": "For example, 'creature' is not present in the neighborhood of 'owl' but it is connected with 'crow' via less weighted edge." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-40", "text": "As per the DT network structure, distributionally similar words are present in a close proximity with similar neighborhood." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-41", "text": "According to the literature dealing with lexical relation detection, words having co-hyponymy relation are distributionally more similar than the words having hypernymy or meronymy relation or any random pair of words." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-42", "text": "This is well captured by the DT." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-43", "text": "In a recent work, Jana and Goyal (2018b) used network features extracted from the DT to detect co-hyponyms." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-44", "text": "In our approach, we attempt to use embeddings obtained through a network representation learning method such as node2vec (Grover and Leskovec, 2016) when applied over the DT network." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-45", "text": "By choosing a flexible notion of a neighborhood and applying a biased random walk procedure, which efficiently explores diverse neighborhoods, node2vec learn representations for each node that organize nodes based on their network roles and/or communities." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-46", "text": "We use the default setup of node2vec; having walk-length 80, walks per node 10, window size 10 and dimension of vector 128." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-47", "text": "In order to do a qualitative analysis of the obtained vectors, we plot some sample words using t-SNE (Maaten and Hinton, 2008) in Figure 2 ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-48", "text": "We observe that the relative distance between the co-hyponymy pairs is much smaller than those having hypernymy relations or meronymy relations for the DT embeddings." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-49", "text": "For instance, the cohyponyms of 'owl' like 'crow', 'vulture', 'sparrow' are close to each other whereas hypernyms of 'owl' like 'animal', 'vertebrate', 'creature', as well as meronyms of 'owl' like 'claw','feather', are at distant positions." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-50", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-51", "text": "**FIGURE 2:**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-52", "text": "t-Distributed Stochastic Neighbor (t-SNE) (Maaten and Hinton, 2008) plot of DT embedding obtained using node2vec." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-53", "text": "We aim to build a classifier that given a word pair, is able to detect whether or not they hold a co-hyponymy relation." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-54", "text": "Since we intend to explore the use of DT embeddings, we need to come up with specific ways to combine the embeddings of the word pair to be used as features for the classification." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-55", "text": "Following the literature (Weeds et al., 2014) , we investigate four operations -vector difference (DIFF), vector concatenation (CC), vector pointwise addition (ADD) and vector pointwise multiplication (MUL)." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-56", "text": "From our initial experiments, we find that CC and ADD prove to be the better combination methods overall." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-57", "text": "It is justified, as DIFF and MUL operations are somewhat intersective whereas both CC and ADD effectively come up with the union of the features in different ways and classifier fed with both shared and non-shared features has access to more information leading to better accuracy." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-58", "text": "We only report the performances for CC and ADD for Support Vector Machine (SVM) and Random Forest (RF) classifiers." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-59", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-60", "text": "**EXPERIMENTAL RESULTS AND ANALYSIS**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-61", "text": "We perform experiments using three benchmark datasets for co-hyponymy detection (Weeds et al., 2014; Santus et al., 2016; Jana and Goyal, 2018b) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-62", "text": "For each of these, we follow the same experimental setup as discussed by the authors and compare our method with the method proposed by the author as well as the state-of-the-art models by Jana and Goyal (2018b) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-63", "text": "We perform the analysis of three datasets to investigate the extent of overlap present in these publicly available benchmark datasets and find out that 45.7% word pairs of dataset prepared by Weeds et al. (2014) are present in dataset ROOT9 prepared by Santus et al. (2016) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-64", "text": "This intersection set comprises 27.8% of the ROOT9 dataset." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-65", "text": "Similarly 36.7% word pairs of dataset prepared by Weeds et al. (2014) are present in the whole dataset prepared by Jana and Goyal (2018b) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-66", "text": "This intersection set comprises 44.9% of the dataset prepared by Jana and Goyal (2018b) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-67", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-68", "text": "**BASELINE MODEL**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-69", "text": "Description svmDIFF A linear SVM trained on the vector difference svmMULT A linear SVM trained on the pointwise product vector svmADD A linear SVM trained on the vector sum svmCAT A linear SVM trained on the vector concatenation svmSING A linear SVM trained on the vector of the second word in the given word pair knnDIFF k nearest neighbours (knn) trained on the vector difference cosineP" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-70", "text": "The relation between word pair holds if the cosine similarity of the word vectors is greater than some threshold p linP" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-71", "text": "The relation between word pair holds if the lin similarity (Lin, 1998) of the word vectors is greater than some threshold p Table 3 : Accuracy scores on a ten-fold cross validation for cohyponym BLESS dataset of our models along with the top two baseline models (one supervised, one semisupervised) described in (Weeds et al., 2014) and models described in (Jana and Goyal, 2018b )" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-72", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-73", "text": "**METHOD**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-74", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-75", "text": "**CO-HYP VS RANDOM**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-76", "text": "Co-Hyp vs Hyper (Santus et al., 2016) 97.8 95.7 (Jana and Goyal, 2018b) 99 Table 4 : Percentage F1 scores on a ten-fold cross validation of our models along with the best models described in (Santus et al., 2016) and (Jana and Goyal, 2018b) for ROOT9 dataset a set of baseline methodologies, the descriptions of which are presented in Table 1 ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-77", "text": "Following the same experimental setup, we report the accuracy measure for ten-fold cross validation and compare our models with the baselines in proposed by Weeds et al. (2014) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-78", "text": "Table 2 represents the performance of all the baseline models proposed by Weeds et al. (2014) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-79", "text": "In Table 3 we show the performance of the best supervised model (svmDIFF) and the best semi-supervised model (cosineP) proposed by Weeds et al. (2014) along with our models." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-80", "text": "Here, the best model proposed by Jana and Goyal (2018b) uses SVM classifier which is fed with structural similarity of the words in the given word pair from the distributional thesaurus network." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-81", "text": "We see that all the 4 proposed methods perform at par or better than the baselines, and using RF CC gives a 15.4% improvement over the best results reported." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-82", "text": "3.2. Experiment-2 (Santus et al., 2016) In the second experiment, we use ROOT9 dataset prepared by Santus et al. (2016) , containing 9,600 labeled pairs extracted from three datasets: EVALution (Santus et al., 2015) , Lenci/Benotto (?) and BLESS (Baroni and Lenci, 2011) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-83", "text": "There is an even distribution of the three classes (hypernyms, co-hyponyms and random) in the dataset." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-84", "text": "Following the same experimental setup as (Santus et al., 2016) , we report percentage F1 scores on a ten-fold cross validation for binary classification of co-hyponyms vs random pairs, as well as co-hyponyms vs. hypernyms using both SVM and Random Forest classifiers." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-85", "text": "Table 4 represents the performance comparison of our models with the best stateof-the-art models reported in (Santus et al., 2016) and (Jana and Goyal, 2018b) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-86", "text": "Here, the best model proposed by Santus et al. (2016) uses Random Forest classifier which is fed with nine corpus based features like frequency of words, co-occurrence frequency etc., and the best model proposed by Jana and Goyal (2018b) use Random Forest classifier which is fed with five complex network features like structural similarity, shortest path etc." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-87", "text": "computed from the distributional thesaurus network." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-88", "text": "The results in Table 4 shows that, for the binary classification task of co-hyponymy vs random pairs, we achieve percentage F1 score of 99.0 with RF CC which is at par with the state-of-the-art models." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-89", "text": "More importantly, both RF CC and RF ADD beat the baselines with significant margins for the classification task of co-hyponymy vs hypernymy pairs." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-90", "text": "Table 5 : Accuracy scores on a ten-fold cross validation of models (svmSS, rfALL) proposed by Jana and Goyal (2018b) and our models for the dataset prepared by Jana and Goyal (2018b) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-91", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-92", "text": "**MODEL**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-93", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-94", "text": "**EXPERIMENT-3 (JANA AND GOYAL, 2018B)**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-95", "text": "In the third experiment we use the dataset specifically build for co-hyponymy detection in one of the recent works by Jana and Goyal (2018b) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-96", "text": "This dataset is extracted from BLESS (Baroni and Lenci, 2011) and divided into three small datasets-Co-Hypo vs Hyper, Co-Hypo vs Mero, Co-Hypo Vs Random." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-97", "text": "Each of these datasets are balanced, containing 1,000 co-hyponymy pairs and 1,000 pairs for the other class." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-98", "text": "Following the same setup, we report accuracy scores for ten-fold cross validation for each of these three datasets of our models along with the best models (svmSS, rfALL) reported by Jana and Goyal (2018b) in Table 5 ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-99", "text": "Jana and Goyal (2018b) use SVM classifier with structural similarity between words in a word pair as feature to obtain svmSS and use Random Forest classifier with five complex network measures computed from distributional thesaurus network as features to obtain rfALL." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-100", "text": "From the results presented in Table 5 , RF CC proves to be the best among our proposed models which performs at par with the baselines for Co-Hypo vs Random dataset." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-101", "text": "Interestingly, it beats the baselines comprehensively for Co-Hypo vs Mero and Co-Hypo vs Hyper datasets, providing improvements of 9.88% and 25.64%, respectively." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-102", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-103", "text": "**ERROR ANALYSIS**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-104", "text": "We further analyze the cases for which our model produces wrong prediction." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-105", "text": "We point out some example word pairs such as 'screw -screwdriver', 'gorilla -orangutan' from cohyponym BLESS dataset which our model wrongly flags as 'false'." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-106", "text": "We observe a drastic difference in frequency between the words in these words pairs in the corpus from which the DT was constructed; for example 'screw' appears 592,857 times whereas 'screwdriver' has a frequency of 29,748; similarly 'gorilla' has a frequency of 40,212 whereas 'orangutan' has 3,567." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-107", "text": "In the DT network, edge weight depends on the overlap between top 1000 context features, and a drastic frequency difference might not capture this well." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-108", "text": "On the other hand, there are examples like 'potato -peel', 'jacket -zipper' which our model wrongly flags as 'true' co-hyponyms." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-109", "text": "We observe that the corpus does not contain many co-hyponyms of 'peel' or 'zipper', and thus their neighborhood in the DT network contains words like 'ginger, lemon, onion, garlic' and 'pant, skirt, coat, jeans' which are co-hyponyms of 'potato' and 'jacket', respectively." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-110", "text": "This leads to the false signal by the approach." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-111", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-112", "text": "**CONCLUSION**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-113", "text": "In this paper, we have investigated how the distributional thesaurus embeddings obtained using network representation learning can help improve the otherwise difficult task of discriminating co-hyponym pairs from hypernym, meronym and random pairs." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-114", "text": "By extensive experiments, we have shown that while the proposed models are at par with the baselines for detecting co-hyponyms vs. random pairs, they outperform the state-of-the-art models by a huge margin for the binary classification of co-hyponyms vs. hypernyms, as well as co-hyponyms vs. meronyms." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-115", "text": "It clearly shows that network representations can be very effectively utilized for a focused task like relation extraction." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-116", "text": "All the datasets, DT embeddings and codes (with instructions) used in our experiments are made publicly available 1 ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-117", "text": "The next immediate step is to try out DT embedding to build unsupervised model for co-hyponymy detection." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-118", "text": "In future, we plan to investigate some more sophisticated network representation learning techniques like path embedding, community embedding techniques to embed the path joining the given pair of words or the subgraph induced by the given pair of words etc. and apply it on distributional thesaurus network for robust detection of lexical relations." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-119", "text": "In this study, our focus has been distinguishing a horizontal relation, co-hyponymy, from parent-child relations like hypernymy and meronymy." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-120", "text": "However, the investigation on discriminating two analogous sibling relations, co-hyponymy and co-meronymy using the proposed method would be one of the interesting future direction." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-121", "text": "Finally, our broad objective is to build a general supervised and unsupervised framework based on complex network theory to detect different lexical relations from a given a corpus with high accuracy." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-2", "text": "Discriminating lexical relations among distributionally similar words has always been a challenge for natural language processing (NLP) community." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-3", "text": "In this paper, we investigate whether the network embedding of distributional thesaurus can be effectively utilized to detect co-hyponymy relations." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-4", "text": "By extensive experiments over three benchmark datasets, we show that the vector representation obtained by applying node2vec on distributional thesaurus outperforms the state-of-the-art models for binary classification of co-hyponymy vs. hypernymy, as well as co-hyponymy vs. meronymy, by huge margins." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-5", "text": "----------------------------------" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-7", "text": "Distributional semantic models are used in a wide variety of tasks like sentiment analysis, word sense disambiguation, predicting semantic compositionality, etc." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-8", "text": "Automatic detection of lexical relations is one such fundamental task which can be leveraged in applications like paraphrasing, ontology building, metaphor detection etc." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-9", "text": "Both supervised and unsupervised methods have been proposed by the researchers to identify lexical relations like hypernymy, co-hyponymy, meronymy etc." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-10", "text": "over the years." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-11", "text": "Recent attempts to solve this task deal with proposing similarity measures based on distributional semantic models (Roller et al., 2014; Weeds et al., 2014; Santus et al., 2016; Shwartz et al., 2017; Roller and Erk, 2016) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-12", "text": "For hypernymy detection, several works use distributional inclusion hypothesis (Geffet and Dagan, 2005) , entropy-based distributional measure (Santus et al., 2014) as well as several embedding schemes (Fu et al., 2014; Yu et al., 2015; Nguyen et al., 2017) ." }, { "sent_id": "e5886e138ce8d84a48e44db3f3d6a1-C001-13", "text": "Image generality for lexical entailment detection (Kiela et al., 2015) has also been tried out for the same purpose." } ], "y": { "@MOT@": { "gold_contexts": [ [ "e5886e138ce8d84a48e44db3f3d6a1-C001-18" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-21" ] ], "cite_sentences": [ "e5886e138ce8d84a48e44db3f3d6a1-C001-18", "e5886e138ce8d84a48e44db3f3d6a1-C001-21" ] }, "@USE@": { "gold_contexts": [ [ "e5886e138ce8d84a48e44db3f3d6a1-C001-27" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-43" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-61" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-71" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-76" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-85" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-86" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-90" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-95", "e5886e138ce8d84a48e44db3f3d6a1-C001-96" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-98" ] ], "cite_sentences": [ "e5886e138ce8d84a48e44db3f3d6a1-C001-27", "e5886e138ce8d84a48e44db3f3d6a1-C001-43", "e5886e138ce8d84a48e44db3f3d6a1-C001-61", "e5886e138ce8d84a48e44db3f3d6a1-C001-71", "e5886e138ce8d84a48e44db3f3d6a1-C001-76", "e5886e138ce8d84a48e44db3f3d6a1-C001-85", "e5886e138ce8d84a48e44db3f3d6a1-C001-86", "e5886e138ce8d84a48e44db3f3d6a1-C001-90", "e5886e138ce8d84a48e44db3f3d6a1-C001-95", "e5886e138ce8d84a48e44db3f3d6a1-C001-96", "e5886e138ce8d84a48e44db3f3d6a1-C001-98" ] }, "@EXT@": { "gold_contexts": [ [ "e5886e138ce8d84a48e44db3f3d6a1-C001-20", "e5886e138ce8d84a48e44db3f3d6a1-C001-21", "e5886e138ce8d84a48e44db3f3d6a1-C001-22", "e5886e138ce8d84a48e44db3f3d6a1-C001-27" ], [ "e5886e138ce8d84a48e44db3f3d6a1-C001-43", "e5886e138ce8d84a48e44db3f3d6a1-C001-44" ] ], "cite_sentences": [ "e5886e138ce8d84a48e44db3f3d6a1-C001-21", "e5886e138ce8d84a48e44db3f3d6a1-C001-27", "e5886e138ce8d84a48e44db3f3d6a1-C001-43" ] }, "@DIF@": { "gold_contexts": [ [ "e5886e138ce8d84a48e44db3f3d6a1-C001-80", "e5886e138ce8d84a48e44db3f3d6a1-C001-81" ] ], "cite_sentences": [ "e5886e138ce8d84a48e44db3f3d6a1-C001-80" ] }, "@BACK@": { "gold_contexts": [ [ "e5886e138ce8d84a48e44db3f3d6a1-C001-99" ] ], "cite_sentences": [ "e5886e138ce8d84a48e44db3f3d6a1-C001-99" ] } } }, "ABC_bebcad79900e9a4a25020ed0d886b5_5": { "x": [ { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-2", "text": "I assess the extent to which the recently introduced BERT model captures English syntactic phenomena, using (1) naturally-occurring subject-verb agreement stimuli; (2) \"coloreless green ideas\" subject-verb agreement stimuli, in which content words in natural sentences are randomly replaced with words sharing the same part-of-speech and inflection; and (3) manually crafted stimuli for subject-verb agreement and reflexive anaphora phenomena." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-3", "text": "The BERT model performs remarkably well on all cases." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-4", "text": "----------------------------------" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-6", "text": "The recently introduced BERT model (Devlin et al., 2018) exhibits strong performance on several language understanding benchmarks." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-7", "text": "To what extent does it capture syntax-sensitive structures?" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-8", "text": "Recent work examines the extent to which RNN-based models capture syntax-sensitive phenomena that are traditionally taken as evidence for the existence in hierarchical structure." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-9", "text": "In particular, in (Linzen et al., 2016) we assess the ability of LSTMs to learn subject-verb agreement patterns in English, and evaluate on naturally occurring wikipedia sentences." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-10", "text": "(Gulordava et al., 2018 ) also consider subject-verb agreement, but in a \"colorless green ideas\" setting in which content words in naturally occurring sentences are replaced with random words with the same partof-speech and inflection, thus ensuring a focus on syntax rather than on selectional-preferences based cues." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-11", "text": "Marvin and Linzen (2018) consider a wider range of syntactic phenomena (subjectverb agreement, reflexive anaphora, negative polarity items) using manually constructed stimuli, allowing for greater coverage and control than in the naturally occurring setting." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-12", "text": "The BERT model is based on the \"Transformer\" architecture (Vaswani et al., 2017) , which-in contrast to RNNs-relies purely on attention mechanisms, and does not have an explicit notion of word order beyond marking each word with its absolute-position embedding." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-13", "text": "This reliance on attention may lead one 1 to expect decreased performance on syntax-sensitive tasks compared to RNN (LSTM) models that do model word order directly, and explicitly track states across the sentence." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-14", "text": "Indeed, Tran et al. (2018) finds that transformerbased models perform worse than LSTM models on the Linzen et al. (2016) agreement prediction dataset." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-15", "text": "In contrast, (Tang et al., 2018) find that self-attention performs on par with LSTM for syntax sensitive dependencies in the context of machine-translation, and performance on syntactic tasks is correlated with the number of attention heads in multi-head attention." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-16", "text": "I adapt the evaluation protocol and stimuli of Linzen et al. (2016) , Gulordava et al. (2018) and Marvin and Linzen (2018) to the bidirectional setting required by BERT, and evaluate the pretrained BERT models (both the LARGE and the BASE models)." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-17", "text": "Surprisingly (at least to me), the out-of-the-box models (without any task-specific fine-tuning) perform very well on all the syntactic tasks." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-18", "text": "----------------------------------" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-19", "text": "**METHODOLOGY**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-20", "text": "I use the stimuli provided by (Linzen et al., 2016; Gulordava et al., 2018; Marvin and Linzen, 2018) , but change the experimental protocol to adapt it to the bidirectional nature of the BERT model." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-21", "text": "This requires discarding some of the stimuli, as described below." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-22", "text": "Thus, the numbers are not strictly comparable to those reported in previous work." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-23", "text": "----------------------------------" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-24", "text": "**PREVIOUS SETUPS**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-25", "text": "All three previous work use uni-directional language-model-like models." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-26", "text": "Linzen et al. (2016) start with existing sentences from wikipedia that contain a present-tense verb." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-27", "text": "They feed each sentence word by word into an LSTM, stop right before the focus verb, and ask the model to predict a binary plural/singular decision (supervised setup) or compare the probability assigned by a pre-trained language model (LM) to the plural vs singular forms of the verb (LM setup)." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-28", "text": "2 The evaluation is then performed on sentences with \"agreement attractors\" in which at there is at least one noun between the verb and its subject, and all of the nouns between the verb and subject are of the opposite number from the subject." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-29", "text": "Gulordava et al. (2018) also start with existing sentences." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-30", "text": "However, in order to control for the possibillity of the model learning to rely on \"semantic\" selectional-preferences cues rather than syntactic ones, they replace each content word with random words from the same part-ofspeech and inflection." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-31", "text": "This results in \"coloreless green ideas\" nonce sentences." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-32", "text": "The evaluation is then performed similarly to the LM setup of Linzen et al. (2016) : the sentence is fed into a pretraiend LSTM LM up to the focus verb, and the model is considered correct if the probability assigned to the correct inflection of the original verb form given the prefix is larger than that assigned to the incorrect inflection." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-33", "text": "Marvin and Linzen (2018) focus on manually constructed and controlled stimuli, that also emphasizes linguistic structure over selectional preferences." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-34", "text": "They construct minimal pairs of grammatical and ungrammatical sentences, feed each one in its entirety into a pre-trained LSTM-LM, and compare the perplexity assigned by the model to the grammatical and ungrammatical sentences." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-35", "text": "The model is \"correct\" if it assigns the grammatical sentence a higher probability than to the ungrammatical one." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-36", "text": "Since the minimal pairs for most phenomena differ only in a single word (the focus verb), this scoring is very similar to the one used in the two previous works." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-37", "text": "However, it does consider the continuation of the sentence after the fo-cus verb, and also allows for assessing phenomena that require change into two or more words (like negative polarity items)." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-38", "text": "----------------------------------" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-39", "text": "**ADAPTATION TO THE BERT MODEL**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-40", "text": "In contrast to these works, the BERT model is bi-directional: it is trained to predict the identity of masked words based on both the prefix and suffix surrounding these words." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-41", "text": "I adapt the unidirectional setup by feeding into BERT the complete sentence, while masking out the single focus verb." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-42", "text": "I then ask BERT for its word predictions for the masked position, and compare the score 3 assigned to the original correct verb to the score assigned to the incorrect one." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-43", "text": "For example, for the sentence:" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-44", "text": "a 2002 systemic review of herbal products found that several herbs , including peppermint and caraway , have anti-dyspeptic effects for non-ulcer dyspepsia with \" encouraging safety profiles \" ." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-45", "text": "(from (Linzen et al., 2016)) I feed into BERT:" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-46", "text": "[CLS] a 2002 systemic review of herbal products found that several herbs , including peppermint and caraway , [MASK] anti-dyspeptic effects for non-ulcer dyspepsia with '' encouraging safety profiles '' . and look for the score assigned to the words have and has at the masked position." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-47", "text": "Similarly, for the pair the game that the guard hates is bad ." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-48", "text": "the game that the guard hates are bad ." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-49", "text": "(from (Marvin and Linzen, 2018) ), I feed into BERT:" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-50", "text": "[CLS] the game that the guard hates [MASK] bad ." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-51", "text": "and compare the scores predicted for is and are." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-52", "text": "This differs from Linzen et al. (2016) and Gulordava et al. (2018) by considering the entire sentence (excluding the verb) and not just its prefix leading to the verb, and differs from Marvin and Linzen (2018) by conditioning the focus verb on bidirectional context." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-53", "text": "I use the PyTorch implementation of BERT, with the pre-trained models supplied by Google." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-54", "text": "4 I experiment with the bert-large-uncased and bert-base-uncased models." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-55", "text": "Discarded Material The bi-directional setup precludes using using the NPI stimuli of Marvin and Linzen (2018) , in which the minimal pair differs in two words position, which I discard from the evaluation." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-56", "text": "I also discard the agreement cases involving the verbs is or are in Linzen et al. (2016) and in Gulordava et al. (2018) , because some of them are copular construction, in which strong agreement hints can be found also on the object following the verb." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-57", "text": "5 This is not an issue in the manually constructed (Marvin and Linzen, 2018 ) stimuli due to the patterns they chose." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-58", "text": "Finally, I discard stimuli in which the focus verb or its plural/singular inflection does not appear as a single word in the BERT wordpiece-based vocabulary (and hence cannot be predicted by the model)." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-59", "text": "This include discarding Marvin and Linzen (2018) stimuli involving the words swims or admires, resulting in 23,368 discarded pairs (out of 152,300)." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-60", "text": "I similarly discard 680 sentences from (Linzen et al., 2016) where the focus verb or its inflection were one of 108 out-ofvocabulary tokens, 6 and 28 sentence-pairs (8 tokens 7 ) from (Gulordava et al., 2018) ." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-61", "text": "Limitations The BERT results are not directly comparable to the numbers reported in previous work." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-62", "text": "Beyond the differences due to bidirectionality and the discarded stimuli, the BERT models are also trained on a different and larger corpus (covering both wikipedia and books)." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-63", "text": "4 https://github.com/huggingface/pytorch-pretrained-BERT 5 Results are generally a bit higher when not discarding the is/are cases." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-64", "text": "6 blames, dislike, inhabit, exclude, revolves, governs, delete, composes, overlap, edits, embrace, compose, undertakes, disagrees, redirect, persist, recognise, rotates, accompanies, attach, undertake, earn, communicates, imagine, contradicts, specialize, accuses, obtain, caters, welcomes, interprets, await, communicate, templates, qualify, reverts, achieve, achieves, govern, restricts, violate, behave, emit, contend, adopt, overlaps, reproduces, rotate, defends, submit, revolve, lend, pertain, disagree, concentrate, detects, endorses, detect, predate, persists, consume, locates, earns, predict, interact, merge, consumes, behaves, locate, predates, enhances, predicts, integrates, inhabits, satisfy, contradict, swear, activate, restrict, satisfies, redirects, excludes, violates, interacts, admires, speculate, blame, drag, qualifies, activates, criticize, assures, welcome, depart, characterizes, defend, obtains, lends, strives, accuse, recognises, characterize, contends, perceive, complain, awaits 7 toss, spills, tosses, affirms, spill, melt, approves, affirm Table 2 : Results on the EN NONCE (Gulordava et al., 2018) stimuli." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-65", "text": "While not strictly comparable, the numbers reported by Gulordava et al. (2018) for the LSTM in this condition (on All) is 74.1 \u00b1 1.6." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-66", "text": "----------------------------------" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-67", "text": "**REPRODUCABILITY**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-68", "text": "Code is available at https://github.com/yoavg/bert-syntax." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-69", "text": "----------------------------------" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-70", "text": "**RESULTS**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-71", "text": "Tables 1, 2 and 3 show the results." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-72", "text": "All cases exhibit high scores-in the vast majority of the cases substantially higher than reported in previous work." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-73", "text": "8 As discussed above, the results are not directly comparable to previous work: the BERT models are trained on different (and larger) data, are allowed to access the suffix of the sentence in addition to its prefix, and are evaluated on somewhat different data due to discarding OOV items." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-74", "text": "Still, taken together, the high performance numbers indicate that the purely attention-based BERT models are likely capable of capturing the same kind of syntactic regularities that LSTM-based models are capable of capturing, at least as well as the LSTM models and probably better." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-75", "text": "Another noticeable and interesting trend is that larger is not necessarily better: the BERT-Base model outperforms the BERT-Large model on many of the syntactic conditions." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-76", "text": "Marvin and Linzen (2018) ." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-77", "text": "The BERT and M&L numbers are not directly comparable, as the experimental setup differs in many ways." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-78", "text": "----------------------------------" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-79", "text": "**DISCUSSION**" }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-80", "text": "The BERT models perform remarkably well on all the syntactic test cases." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-81", "text": "I expected the attentionbased mechanism to fail on these (compared to the LSTM-based models), and am surprised by these results." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-82", "text": "The Gulordava et al. (2018) and Marvin and Linzen (2018) conditions rule out the possibility of overly relying on selectional preference cues or memorizing the wikipedia training data, and suggest real syntactic generalization is taking place." }, { "sent_id": "bebcad79900e9a4a25020ed0d886b5-C001-83", "text": "Exploring the extent to which deep purely-attention-based architectures such as BERT are capable of capturing hierarchy-sensitive and syntactic dependencies-as well as the mechanisms by which this is achieved-is a fascinating area for future research." } ], "y": { "@MOT@": { "gold_contexts": [ [ "bebcad79900e9a4a25020ed0d886b5-C001-2" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-52", "bebcad79900e9a4a25020ed0d886b5-C001-53" ] ], "cite_sentences": [ "bebcad79900e9a4a25020ed0d886b5-C001-2", "bebcad79900e9a4a25020ed0d886b5-C001-52" ] }, "@DIF@": { "gold_contexts": [ [ "bebcad79900e9a4a25020ed0d886b5-C001-10", "bebcad79900e9a4a25020ed0d886b5-C001-9" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-56" ] ], "cite_sentences": [ "bebcad79900e9a4a25020ed0d886b5-C001-10", "bebcad79900e9a4a25020ed0d886b5-C001-56" ] }, "@BACK@": { "gold_contexts": [ [ "bebcad79900e9a4a25020ed0d886b5-C001-10" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-29", "bebcad79900e9a4a25020ed0d886b5-C001-30", "bebcad79900e9a4a25020ed0d886b5-C001-31" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-81", "bebcad79900e9a4a25020ed0d886b5-C001-82" ] ], "cite_sentences": [ "bebcad79900e9a4a25020ed0d886b5-C001-10", "bebcad79900e9a4a25020ed0d886b5-C001-29", "bebcad79900e9a4a25020ed0d886b5-C001-30", "bebcad79900e9a4a25020ed0d886b5-C001-31", "bebcad79900e9a4a25020ed0d886b5-C001-82" ] }, "@EXT@": { "gold_contexts": [ [ "bebcad79900e9a4a25020ed0d886b5-C001-16" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-20" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-52", "bebcad79900e9a4a25020ed0d886b5-C001-53" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-56" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-60" ] ], "cite_sentences": [ "bebcad79900e9a4a25020ed0d886b5-C001-16", "bebcad79900e9a4a25020ed0d886b5-C001-20", "bebcad79900e9a4a25020ed0d886b5-C001-52", "bebcad79900e9a4a25020ed0d886b5-C001-56", "bebcad79900e9a4a25020ed0d886b5-C001-60" ] }, "@USE@": { "gold_contexts": [ [ "bebcad79900e9a4a25020ed0d886b5-C001-65" ], [ "bebcad79900e9a4a25020ed0d886b5-C001-81", "bebcad79900e9a4a25020ed0d886b5-C001-82" ] ], "cite_sentences": [ "bebcad79900e9a4a25020ed0d886b5-C001-65", "bebcad79900e9a4a25020ed0d886b5-C001-82" ] } } }, "ABC_666cc3c936358c5e9b2f7d0eb8d0e4_6": { "x": [ { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-2", "text": "Path-based relational reasoning over knowledge graphs has become increasingly popular due to a variety of downstream applications such as question answering in dialogue systems, fact prediction, and recommender systems." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-3", "text": "In recent years, reinforcement learning (RL) has provided solutions that are more interpretable and explainable than other deep learning models." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-4", "text": "However, these solutions still face several challenges, including large action space for the RL agent and accurate representation of entity neighborhood structure." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-5", "text": "We address these problems by introducing a type-enhanced RL agent that uses the local neighborhood information for efficient path-based reasoning over knowledge graphs." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-6", "text": "Our solution uses graph neural network (GNN) for encoding the neighborhood information and utilizes entity types to prune the action space." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-7", "text": "Experiments on real-world dataset show that our method outperforms state-of-the-art RL methods and discovers more novel paths during the training procedure." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-8", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-10", "text": "Relational reasoning has been long one of the most desirable goals of machine learning and artificial intelligence [10, 16, 8, 32] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-11", "text": "In the context of large-scale knowledge graphs (KG), relational reasoning addresses a number of important applications, such as question answering [30, 3] , dialogue systems [20, 18] , and recommender systems [38, 15, 2] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-12", "text": "Most KGs are incomplete, so the problem of inferring missing relations, or KG completion, has become an increasingly popular research domain." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-13", "text": "Several works view this as a link prediction problem and attempt to solve it using network embedding and deep learning approaches [1, 26, 27, 7, 33, 4] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-14", "text": "These methods embed the KG into a vector space and use a similarity measure to identify the entities that are likely to be connected." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-15", "text": "However, they are unable to discover the reasoning paths, which are important for interpreting the model." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-16", "text": "Furthermore, they do not provide an explicit explanation during the learning process and often rely on other analytical methods to provide an interpretation for their predictions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-17", "text": "As a result, it is often hard to trust the predictions made arXiv:2003.06050v1 [cs. LG] 12 Mar 2020 by embedding-based methods." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-18", "text": "Recent advances in the area of deep reinforcement learning (DRL) have inspired reinforcement learning (RL) based solutions for the KG completion problem [21, 3, 30, 13, 19, 22, 14, 29] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-19", "text": "RL-based methods formulate the task of KG completion as a sequential decision-making process in which the goal is to train an RL agent to walk over the graph by taking a sequence of actions (i.e., choosing the next entity) that connects the source to the target entity." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-20", "text": "The sequences of entities and relations can be directly used as a logical reasoning path for interpreting model predictions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-21", "text": "For example, in order to answer the query (Reggie Miller, plays sport, ?), the agent may find the following reasoning path in the KG: Reggie Miller These RL solutions demonstrate competitive accuracy with other deep learning methods while boasting improved interpretability of the reasoning process." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-22", "text": "However, there are some fundamental and open issues that we address in this work:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-23", "text": "Large action space." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-24", "text": "In KGs, facts are represented as binary relations between entities." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-25", "text": "Real-world KGs contain huge numbers of entities and relations." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-26", "text": "As a result, the RL agent often encounters nodes with a high out-degree, which increases the complexity of choosing the next action." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-27", "text": "In these cases, exploring the possible paths to determine the optimal action is computationally expensive, and in many cases beyond the memory limits of a single GPU." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-28", "text": "Previous studies have shown that type of information can improve the KG completion performance [24, 11] using deep learning approaches." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-29", "text": "To improve the search efficiency, we first propose a representation for the entity type information, which we include in the representation of the state space." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-30", "text": "We then prune the action space based on the type information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-31", "text": "This guides the RL agent to limit the search to the entities whose type best matches the previously taken actions and, as a result, avoid incorrect reasoning paths." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-32", "text": "In the above example reasoning path for query (Reggie Miller, plays sport, ?), suppose the entity Michael Jordan has a high out-degree, and is connected to several other entities through different relations (e.g., Michael Jordan ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-33", "text": "None of these additional entities are useful for answering the query and may mislead the agent." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-34", "text": "However, we demonstrate that an agent can learn that next entity's type is most likely a sport rather than a person or a location." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-35", "text": "Accurate representation of entity neighborhood." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-36", "text": "Existing RL-based methods for KG completion do not capture the entity's neighborhood information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-37", "text": "Previous studies on one-shot fact prediction have shown that the local neighborhood structure improves the fact prediction performance for long-tailed relations [31, 37] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-38", "text": "We propose a graph neural network (GNN) [9] to encode the neighborhood information of the entities and leverage the state representation with the type and neighborhood information of the entities." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-39", "text": "We demonstrate that learning the local heterogeneous neighborhood information improves the performance of RL agent on the long-tailed relations, which in turn significantly improves performance for the KG completion task." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-40", "text": "Our contributions include:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-41", "text": "1. Designing an efficient vector representation of entity type-embeddings." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-42", "text": "2. Pruning the action space and improving the choice of next actions using the entity type information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-43", "text": "3. Proposing a GNN for incorporating the local neighborhood information in the state representation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-44", "text": "The rest of the paper is organized as follows." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-45", "text": "First, in Section 2 we survey related work." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-46", "text": "Next, in Section 3 we present the details of our model, and in Section 4 present and discuss experimental results." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-47", "text": "Finally, we conclude in Section 5 and discuss opportunities for continued work." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-48", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-49", "text": "**RELATED WORK**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-50", "text": "Relational reasoning over knowledge graphs has attracted significant attention over the past few years." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-51", "text": "Recent works [1, 26, 27, 7, 33, 4, 35] approached this problem by embedding the relation and entities into a vector space and identifying related entities by similarity in the vector space." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-52", "text": "However, these methods have some important drawbacks, including:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-53", "text": "1. They cannot perform multi-hop reasoning." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-54", "text": "That is, they only consider pairwise relationships and cannot reason along a path." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-55", "text": "2. They cannot explain the reasoning behind their predictions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-56", "text": "Because they treat the task as a link prediction problem, the output of their prediction is binary." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-57", "text": "With the recent success of deep reinforcement learning in AlphaGO [25] researchers began to adopt RL to solve a variety of problems that were conventionally addressed by deep learning methods, such as ad recommendation [38, 15, 2] , dialogue systems [20, 18] , and question answering [30, 3] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-58", "text": "As a result, more recent methods proposed using RL to solve the multi-hop reasoning problem in knowledge graphs by framing it as a sequential decision-making process [3, 23, 30, 22, 12, 13] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-59", "text": "Deeppath [30] was the first method that used RL to find relation paths between two entities in KGs." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-60", "text": "It walks from the source entity, chooses a relation and translates to any entity in the tail entity set of the relation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-61", "text": "MINERVA [3] , on the other hand, learns to do multi-hop reasoning by jointly selecting an entity-relation pair via a policy network." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-62", "text": "MARLPaR [12] uses a multi-agent RL approach where two agents are used to perform the relation selection and entity selection iteratively." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-63", "text": "Lin et al. [13] implement reward shaping to address the problem of the sparse reward signal and action dropout to reduce the effect of incorrect paths." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-64", "text": "Xian et al. [29] use KG reasoning for recommender systems and designed both a multi-hop scoring function and a user-conditioned action pruning strategy to improve the efficiency of RL-based recommendation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-65", "text": "Because these RL models treat the KG completion problem as a path reasoning problem instead of a link prediction problem, they are able to overcome both drawbacks of embedding methods that are outlined above." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-66", "text": "However, the RL models have drawbacks of their own, the most notable of which are computational cost and predictive accuracy." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-67", "text": "Many of these RL methods have tried to combine the representational power of embeddings and reasoning power of RL by training an agent to navigate an embedding space." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-68", "text": "For example, the authors of [13] build an agent-based model on top of pre-trained embeddings generated by ComplEx [27] or ConvE [4] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-69", "text": "While we take a similar modular approach, our solution enriches the embedding space with additional information about entity types and local neighborhood information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-70", "text": "In light of recent work on heterogeneous networks that have demonstrated the importance of heterogeneous information [5, 36, 24, 11] and local neighborhood information [31, 37] in graph mining, we take a broader approach." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-71", "text": "We propose to include entity type information in the state representation to help the RL agent to take more informed actions considering the heterogeneous context." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-72", "text": "We also learn the heterogeneous neighborhood information simultaneously with training the RL agent to improve the prediction on less frequent relations." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-73", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-74", "text": "**MODEL**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-75", "text": "In this section, we formally define the problem of relational reasoning in a KG and explain our RL solution." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-76", "text": "We then introduce our contribution by incorporating type-embeddings and heterogeneous neighbor-encoder." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-77", "text": "An overview of the model is displayed in Figure 1 ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-78", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-79", "text": "**PROBLEM FORMULATION**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-80", "text": "Knowledge graphs consist of facts represented as triples." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-81", "text": "We formally define a knowledge graph G = {(e s , r, e d )} \u2286 E \u00d7 R \u00d7 E where E is a set of entities and R is a set of relations." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-82", "text": "Given a query (e s , r, ?), e s is called the source entity and r is the query relation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-83", "text": "Our goal is to predict the target entity e d \u2208 E. In most cases, the output of each query is a list of candidate entities,\u00ca d = {\u00ea 1 , ...,\u00ea n } for some fixed n < |E|, ranked in descending order by probability." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-84", "text": "The prediction can be represented as a function F : E \u00d7 R \u2192 E n ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-85", "text": "In KGs, we are not only interested in accurate prediction of the target entity e d , but also understanding the reasoning path the model uses to predict e d ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-86", "text": "This is a key advantage that RL methods offer over embedding-based models, which can make entity predictions but cannot give interpretable justification for them." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-87", "text": "Rather than treating the task as a form of link prediction, RL models instead train an agent to traverse the nodes of a KG via logical reasoning paths." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-88", "text": "Below we provide the details of RL formulations of the problem." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-89", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-90", "text": "**A REINFORCEMENT LEARNING SOLUTION**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-91", "text": "Similar to [3, 13, 30] , we formulate this problem as a Markov Decision Process (MDP), in which the goal is to train a policy gradient agent (using REIN- FORCE [28] ) to learn an optimal reasoning path to answer a given query (e s , r, ?)." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-92", "text": "We express the RL framework as a set of states, actions, rewards, and transitions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-93", "text": "States." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-94", "text": "The state s t at time t is represented as tuple ((e s , r), e t , h t ), where (e s , r) is the input query, e t is the entity at which the agent is located at time t and e t is the history of the entities and relations traversed by agent until time t. The agent begins at the source entity with initial state s 0 = ((e s , r), e s , h 0 )." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-95", "text": "We refer to the terminal state as s T = ((e s , r), e T , h T ), where e T is the agent's answer to the input query and h T is the full reasoning path." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-96", "text": "Each entity and relation is represented by an embedding vector R d for some constant d \u2208 Z. In our solution, we enrich the state representation with entity type and neighborhood information, which is explain later in Section 4." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-97", "text": "Actions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-98", "text": "At each time-step, the action is to select an edge (and move to the connecting entity) or stay at the current entity." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-99", "text": "The action space A t \u2286 A given state s t is thus set of all immediate neighbors of the current node e t , and the node itself, i.e., A t (s t ) = N et \u222a e t , where N e is the set of all neighbors from node e." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-100", "text": "The inclusion of the current node e t in the action space represents the agent's decision to terminate, and so select e t as its answer to the input query." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-101", "text": "Since the graph is directed, N only includes nodes adjacent on out-edges." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-102", "text": "Following previous work [1, 3, 30] , for each edge (triple) (e s , r, e d ), during preprocessing we add an inverse edge (e d , r \u22121 , e s ) in order to facilitate graph traversal." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-103", "text": "Most real-world knowledge graphs are scale-free, meaning that a small percentage number of entities have a high out-degree while the majority of the entities have a small out-degree." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-104", "text": "However, the entities with high out-degree are crucial to query answering." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-105", "text": "For performance reasons, many RL models are forced to cap the size of the action space and do so via a pre-computed heuristic." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-106", "text": "For example, [13] pre-computes PageRank scores for each node, and narrows the action space to a fixed number of highest-ranking neighbors." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-107", "text": "In this work, we use entity type information to limit the search to the entities with best-matching types, given the previous actions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-108", "text": "We call the reduced action space A t ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-109", "text": "We provide more details in Section 3.3." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-110", "text": "Rewards." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-111", "text": "The agent evaluates each action and chooses the one that will maximize a reward." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-112", "text": "Previous works [30, 3] define only a terminal reward of +1 if the agent reaches the correct answer." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-113", "text": "However, since knowledge graphs are incomplete, a binary reward cannot model the potentially missing facts." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-114", "text": "As a result, the agent receives low-quality rewards as it explores the environment." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-115", "text": "Inspired by [13] , we use pre-trained KG embeddings based on existing KG embedding methods to design a soft reward function for the terminal state s T based on [17] :" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-116", "text": "Where f (e s , r, e T ) is a similarity measure calculated based on pre-trained KG embedding approach [13] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-117", "text": "We use different embedding methods depending on different datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-118", "text": "More details is provided in Section 4." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-119", "text": "Transitions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-120", "text": "At state s t , the agent choses an action a t \u2208 A t based on a policy \u03c0 : S \u2192 A where S and A are the sets of all possible states and and actions, respectively." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-121", "text": "Following [13] , we use an LST M to encode the history h t = {e t\u2212k , r t\u2212k+1 , ..., e t\u22121 , r t } of the past k steps taken by the agent in solving the query." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-122", "text": "The history embedding for h t is represented as:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-123", "text": "We define the policy network \u03c0 with weight parameters \u03b8 as follows:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-124", "text": "The transition to a new state is thus given by:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-125", "text": "To reduce the potential impact of argmax leading to the overuse of incorrect paths, we utilize random action dropout as described in [13] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-126", "text": "The policy network is trained using stochastic gradient descent to maximize the expected reward:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-127", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-128", "text": "**ENTITY TYPES**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-129", "text": "Many knowledge graphs contain rich semantic information as entity type that can be used as prior information to guide the agent through the reasoning process." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-130", "text": "We argue that the type information can be helpful for reducing the action space, especially for nodes with a high out-degree." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-131", "text": "Entity type information can be used to limit the search only to the entities that are best matching the previously visited entities and actions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-132", "text": "To achieve this, we measure the similarity of all possible actions given the entity type embedding of current and possible target entities and only keep the top n candidates." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-133", "text": "In order to build the entity type representation e \u03c4 , we propose to aggregate the vector representation of the entities with a similar type." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-134", "text": "Below we propose two simple mechanisms for doing so:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-135", "text": "1. Take the average of the embedding vectors for all the N entities e i that share the same entity type \u03c4 (mean-pooling)." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-136", "text": "2. Take the maximum value of each element in the the embedding vectors for all the N entities e i that share the same entity type (max-pooling)." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-137", "text": "To measure the similarity of the current entity with the candidate actions we use the cosine similarity of two entities with respect to their type-enhanced embedding vectors:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-138", "text": "where \u00b7, \u00b7 is the dot product operation, \u2295 represents the vector concatenation operator, e, r \u2208 R d are d-dimensional vector representations of the entity e and relation r, and b e \u2208 R is a bias term for e k ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-139", "text": "We call e i \u03c4 = [e i ; e i \u03c4 ] the type-enhanced entity representation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-140", "text": "We then rank the possible actions and prioritizes the ones that are more likely to result in a correct answer based on the g(e 0 , e k ) score." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-141", "text": "We thus create a pruned action space A t by keeping the nodes with the highest value of g." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-142", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-143", "text": "**HETEROGENEOUS NEIGHBOR ENCODER**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-144", "text": "After generating the type embeddings, we feed the type-enhanced embeddings together with the relation embeddings to the heterogeneous neighbor encoder to generate the enriched entity representation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-145", "text": "Although many works [1, 34] have been proposed to learn entity embeddings using relational information, recent studies [31, 37] have demonstrated that explicitly encoding graph local structure can benefit entity embedding learning in knowledge graph." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-146", "text": "Inspired by this motivation, we propose a heterogeneous neighbor encoder to learn the enriched entity embedding by aggregating entity's neighbors information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-147", "text": "Specifically, we denote the set of relational neighbors (relation, entity) of given entity h as N h = {(r i , t i )|(h, r i , t i ) \u2208 G}, where G is the background knowledge graph, r i and t i represent the i-th relation and corresponding neighboring entity of h, respectively." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-148", "text": "The heterogeneous neighbor encoder should be able to encode N h and output a feature representation of h by considering different relational neighbors (r i , t i ) \u2208 N h ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-149", "text": "To achieve this goal, we formulate the enriched entity embedding as follows:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-150", "text": "where \u03c3 denotes activation unit (we use Tanh), \u2295 represents concatenation operator, e \u03c4 ti , e \u03c4 ri \u2208 R d\u00d71 are type-enhanced entity embeddings of t i and r i ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-151", "text": "Besides, u rt \u2208 R d\u00d71 , W rt \u2208 R d\u00d72d and b rt \u2208 R d\u00d71 (d: pre-trained embedding dimension) are parameters of neighbor encoder." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-152", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-153", "text": "**EXPERIMENTS**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-154", "text": "In this section, we describe and discuss the experimental results of our proposed approach." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-155", "text": "We compare against several baseline methods: ConvE (embeddingbased) [4] , ComplEx (embedding-based) [27] , MINERVA (agent-based) [3] , and MultiHopKG (agent-based) using both ConvE and ComplEx for pre-trained embeddings [13] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-156", "text": "The experiments utilize the 3 data sets presented in Table 1 ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-157", "text": "Of the standard data sets used in KG reasoning tasks, NELL-995 is the only one that explicitly encodes entity types." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-158", "text": "Therefore, in addition to NELL, we incorporated two datasets from the Amazon e-commerce collection [6] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-159", "text": "Each Amazon data contains a set of users, products, brands, and other information, which the authors of [29] use to make product recommendations to users." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-160", "text": "This task is a specialized instance of KG completion that only focuses on user-product relations, so we do not include it in our baseline results." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-161", "text": "Additionally, we found these data sets were too large for efficient computation in the broader KG completion task, so we shrunk them for our experiments." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-162", "text": "To do this, we randomly chose 20% of the nodes and induced a subgraph on those nodes." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-163", "text": "While this might result in a sparser graph that makes predictions more difficult, this was the best option given the lack of other relevant data containing type information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-164", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-165", "text": "**DATA & METRICS**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-166", "text": "The full KG is represented by the number of F acts in Table 1 ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-167", "text": "Before training, we partition F acts into a training set and a test set, which we call Queries." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-168", "text": "In NELL-995, this split already exists as part of the standard data set." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-169", "text": "For the Amazon sets, we populated Queries with 10% of the triples in F acts, chosen at random." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-170", "text": "Each model is then trained on the set (F acts \u2212 Queries), and tested on the set Queries." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-171", "text": "Recall that each fact is a triple in the form (e s , r, e d )." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-172", "text": "Each triple is presented to the model in the form (e s , r, ?), and, as described in Section 3.2, the model outputs a list of ranked candidate entities\u00ca d = {e 1 , ..., e n }." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-173", "text": "Also recall that we describe the prediction as a function F : (e s , r) \u2192\u00ca d ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-174", "text": "We measure performance for each experiment with standard KG completion metrics, namely, Hits@k for k={1,3,5,10}, and Mean Reciprocal Rank (MRR)." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-175", "text": "Hits@k is measured as the percentage of test cases in which the correct entity e d appears in the top k candidates in\u00ca d , i.e.:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-176", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-177", "text": "**HITS@K =**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-178", "text": "|{(e s , r, e d ) \u2208 Q : rank(e d , F(e s , r)) \u2264 k}| |Q| \u00d7 100 (8) where Q = Queries and rank(e d ,\u00ca d ) is a function that returns the position of entity e d in the set of ordered predictions\u00ca d ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-179", "text": "MRR is a related metric, defined as the multiplicative inverse of the rank of the correct answer, i.e.:" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-180", "text": "Because none of these models generalize to unknown entities, followed by previous work [3, 13] , we measure Hits@k and MRR only for queries for which both e s and e d have already been seen at least once by the model during training." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-181", "text": "In other words, if either of the query entities is missing from the training set (F acts \u2212 Queries), we discard it from testing." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-182", "text": "Additionally, we reserve a small portion of the F acts as a development set to estimate performance during training." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-183", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-184", "text": "**PARAMETER SELECTION**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-185", "text": "For NELL-995, utilize the same hyperparameters described in [13] when training ConvE, ComplEx, Distmult, and Lin et al [13] baselines." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-186", "text": "For MINERVA, we utilize the same hyperparameters described in [3] and train the model for 3,000 epochs." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-187", "text": "For the two Amazon datasets, we perform a grid search for our method and all baselines and report the best performance for each." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-188", "text": "For all datasets, we train the KG embedding models (ConvE and ComplEx) for 1000 epochs each." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-189", "text": "These embeddings are then used to make predictions directly but also serve as pre-trained inputs for the RL agent, which we train for 30 epochs per experiment for all datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-190", "text": "We tried different embedding methods for the pretrained embeddings and eventually used ComplEx for NELL-995 and Distmult for Amazon cellphones and Amazon Beauty as they resulted in the best performance in each data." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-191", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-192", "text": "**DATA SET**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-193", "text": "NELL-995 Amazon Beauty Amazon Cellphones Metric (%) @1 @5 @10 MRR @1 @5 @10 MRR @1 @5 @10 MRR" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-194", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-195", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-196", "text": "Our experimental results are described in Table 2 ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-197", "text": "For NELL-995 data, We quote the results reported in [3, 13] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-198", "text": "Embedding-based methods show an overall higher performance compared to the RL-based methods." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-199", "text": "We can see that in all three datasets, our results outperform both RL baselines (Lin et al. [13] and MINERVA [3] )." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-200", "text": "Amazon datasets, on the other hand, are far more challenging." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-201", "text": "We notice that even the embedding based methods are struggling with low performance on these datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-202", "text": "On the Amazon data, the performance of all methods is significantly lower." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-203", "text": "Our method results in a 4% improvement in MRR (and 5.43% in Hits@1) over the best RL baseline on Amazon Cellphones and a 3.9% improvement in MRR (and 5.5% in Hits@1) on Amazon Beauty." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-204", "text": "On the NELL-995 dataset, our method results in 2.8% improvement in MRR and 4% improvement in Hits@1 over the best performing baseline." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-205", "text": "We also performed ablations studies to analyze the effect of each module in our model." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-206", "text": "We removed the type embeddings in Ours (-T) and the heterogeneous neighbor encoder in Ours (-N)." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-207", "text": "We notice that removing the heterogeneous neighbor encoder results in a higher drop in performance in the Amazon datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-208", "text": "This gap is quite smaller in the NELL-955 data." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-209", "text": "Our results show that pruning the action space based on the entity type information results in a larger boost in performance on the Amazon datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-210", "text": "We believe due to the sparsity of these two knowledge graphs, type information was more effective for action space pruning than entity page ranks, as done in [13] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-211", "text": "Note that, there are only 5 entity types in the Amazon datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-212", "text": "As a result, the number of entities that will be discarded (due to type mismatch) is higher which assists the agent to discover a better path." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-213", "text": "We generated the type embeddings using max-pooling for the NELL-995 dataset and mean-pooling for both Amazon datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-214", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-215", "text": "**PATH DIVERSITY AND CONVERGENCE**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-216", "text": "We compare the number of unique paths discovered from the development set during the training procedure." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-217", "text": "Figure 2 shows that path diversity (top row) improves across all models as the model performance (bottom row) improves." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-218", "text": "For this analysis, we compare our ablation models (Ours (-N) and Ours (-T)) with the best performing RL baseline by Lin et al. [13] ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-219", "text": "Our method is more successful in discovering novel paths and obtains a better hit ratio on the development set." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-220", "text": "On the Amazon Beauty data, the number of unique paths discovered by Ours (-T) is higher than both combined (Ours) while in Amazon Cellphones the combined model performs better, but similar to Amazon Beauty, Ours (-N)performs better than Ours (-T)." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-221", "text": "NELL-955 shows a different trend where removing the type information results in a larger drop in the number of unique paths, compared to the heterogeneous neighbor encoder." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-222", "text": "This is intuitive, since NELL-955 contains far more entity type than Amazon datasets, and inclusion of type information may be a positive factor for discovering new paths." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-223", "text": "In terms of convergence, Amazon Beauty and Amazon Cellphones show a similar trend, and removing the type information significantly reduces the hit ratio." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-224", "text": "This gap is smaller for NELL-995 data, though our model still shows improvement in hit ratio on this dataset." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-225", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-226", "text": "**PERFORMANCE ON SEEN AND UNSEEN QUERIES**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-227", "text": "We compare the ablation models along with the best RL baseline performance on seen and unseen queries." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-228", "text": "Note that, percentage of unseen queries is much lower in the Amazon datasets compared to the NELL-995 dataset." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-229", "text": "Table 4 shows that our proposed method performs better on both seen and unseen queries." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-230", "text": "In particular, we notice that removing the neighbor encoder in Amazon Beauty and Amazon Cellphone results in a significant performance drop on unseen queries, while removing the type information had little or no effect." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-231", "text": "We observe a similar trend on the seen queries." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-232", "text": "Although they show a performance drop after removing the type information, the effect is less than removing the neighbor encoder." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-233", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-234", "text": "**PERFORMANCE ON DIFFERENT RELATIONS**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-235", "text": "We evaluate our proposed model on different relation types and compare our results with the best performing RL baseline." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-236", "text": "We take a similar approach as [13] to extract to-many and to-one relations." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-237", "text": "A relation r is considered to-many if queries containing relation r can have more than 1 correct answer, otherwise, it is considered a to-one relation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-238", "text": "Table 4 shows the MRR values on the development set for all three datasets." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-239", "text": "We notice that most relations in the Amazon datasets are to-many, as these graphs are denser." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-240", "text": "On the other hand, a large portion of the NELL-995 data consists of to-one relations." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-241", "text": "Overall, to-many relations show lower performance, regardless of the model." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-242", "text": "Our proposed model consistently shows a better performance than Lin et al., except for the NELL-995 dataset where the improvement is marginal." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-243", "text": "Both to-one and to-many are more sensitive to removing the neighbor-encoder rather than removing the type information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-244", "text": "However, for to-one relations MRR does not drop in any dataset after removing the type information." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-245", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-246", "text": "**CASE STUDY**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-247", "text": "In this section we present a few case studies that show the strength of our proposed method." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-248", "text": "In the NELL-995 dataset, our method is more successful when the agent encounters an entity with a high out-degree." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-249", "text": "As an example, for the query (Buffalo Bills (sports team) , [D:315], they are not able to navigate properly after reaching the NFL entity, due to its high out-degree." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-250", "text": "As a result, they are unable to discover the answer Mike Mularkey." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-251", "text": "As another example, we consider the query: (New York (City), organization hired person ,?)." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-252", "text": "Our method discovers the path: New York (City) ." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-253", "text": "Again, other RL baselines struggle with finding the next best step after entity New York." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-254", "text": "Our method uses the location information to find the answer Michael Bloomberg." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-255", "text": "In the Amazon datasets, there are fewer entity and relation types." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-256", "text": "As a result, we observe many frequent patterns that all RL baselines are able to discover." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-257", "text": "Therefore, we focus on the diversity of the relations used in our method and the best performing baseline [13] for the discovered paths in the development set." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-258", "text": "Figure 3 displays the inference results." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-259", "text": "On the Amazon cellphones data, our method discovers fewer , produced-by and also-bought relations while it utilizes more of other relations, in particular, belongs-to relations." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-260", "text": "Similarly, on the Amazon Beauty data, our method utilizes fewer also-bought, produced-by and relations, while it uses other relations more frequently, especially the bought-together relation." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-261", "text": "We believe one reason for the success of our method is the diverse use of different relation types for discovering new path types." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-262", "text": "----------------------------------" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-263", "text": "**CONCLUSION**" }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-264", "text": "We proposed a framework for improving the performance of path-based reasoning using reinforcement learning." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-265", "text": "Our results show that incorporating the heterogeneous context and the local neighborhood information results in a better performance for the query answering task." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-266", "text": "Our analysis shows that the type information is important for faster convergence and finding more diverse paths, and the neighborhood information improves the performance on unseen queries." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-267", "text": "In the future, we plan to explore more efficient strategies for action-space pruning to improve the scalability of existing RL solutions." }, { "sent_id": "666cc3c936358c5e9b2f7d0eb8d0e4-C001-268", "text": "Furthermore, we plan to develop more effective type embeddings considering the hierarchical structure of the type information." } ], "y": { "@BACK@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-18" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-19" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-36" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-58" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-63" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-65" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-66" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-67" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-68" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-106" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-253" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-18", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-19", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-36", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-58", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-63", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-65", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-66", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-67", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-68", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-106", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-253" ] }, "@MOT@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-36", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-37", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-38" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-115" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-36", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-115" ] }, "@SIM@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-68", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-69" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-91" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-236" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-68", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-69", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-91", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-236" ] }, "@EXT@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-68", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-69" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-68", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-69" ] }, "@USE@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-116" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-121" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-125" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-155" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-180" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-185" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-187" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-197" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-210" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-218" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-227" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-235" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-257" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-116", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-121", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-125", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-155", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-180", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-185", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-187", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-197", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-210", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-218", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-227", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-235", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-257" ] }, "@DIF@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-198" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-199" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-203" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-204" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-218", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-219" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-227", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-229" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-242" ], [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-253", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-254" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-198", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-199", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-203", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-204", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-218", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-227", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-242", "666cc3c936358c5e9b2f7d0eb8d0e4-C001-253" ] }, "@UNSURE@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-256" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-256" ] }, "@FUT@": { "gold_contexts": [ [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-267" ] ], "cite_sentences": [ "666cc3c936358c5e9b2f7d0eb8d0e4-C001-267" ] } } }, "ABC_0c233d68fb2ccdf033fc6a08c8f4bf_6": { "x": [ { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-2", "text": "The goal of the Penn Discourse Treebank (PDTB) project is to develop a large-scale corpus, annotated with coherence relations marked by discourse connectives." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-3", "text": "Currently, the primary application of the PDTB annotation has been to news articles." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-4", "text": "In this study, we tested whether the PDTB guidelines can be adapted to a different genre." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-5", "text": "We annotated discourse connectives and their arguments in one 4,937-token full-text biomedical article." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-6", "text": "Two linguist annotators showed an agreement of 85% after simple conventions were added." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-7", "text": "For the remaining 15% cases, we found that biomedical domain-specific knowledge is needed to capture the linguistic cues that can be used to resolve inter-annotator disagreement." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-8", "text": "We found that the two annotators were able to reach an agreement after discussion." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-9", "text": "Thus our experiments suggest that the PDTB annotation can be adapted to new domains by minimally adjusting the guidelines and by adding some further domain-specific linguistic cues." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-10", "text": "----------------------------------" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-12", "text": "Large scale annotated corpora, e.g., the Penn TreeBank (PTB) project (Marcus et al. 1993) , have played an important role in text-mining." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-13", "text": "The Penn Discourse Treebank (PDTB) (http://www.seas.upenn.edu/~pdtb) (Prasad et al. 2008a) annotates the argument structure, semantics, and attribution of discourse connectives and their arguments." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-14", "text": "The current release of PDTB-2.0 contains the annotations of 1,808 Wall Street Journal articles (~1 million words) from the Penn TreeBank (Marcus et al. 1993 ) II distribution and a total of 40,600 discourse connective tokens (Prasad et al. 2008b )." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-15", "text": "This work examines whether the PDTB annotation guidelines can be adapted to a different genre, the biomedical literature." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-16", "text": "----------------------------------" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-17", "text": "**NOTATION**" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-18", "text": "A discourse connective can be defined as a word or multiword expression that signals a discourse relation." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-19", "text": "Discourse connectives can be subordinating conjunctions (e.g., because, when, although), coordinating conjunctions (e.g., but, or, nor) and adverbials (e.g., however, as a result, for example)." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-20", "text": "A discourse connective takes in two arguments, Arg1 and Arg2." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-21", "text": "Arg2 is the argument that appears in the clause that is syntactically bound to the connective and Arg1 is the other argument." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-22", "text": "In the sentence \"John failed the exam because he was lazy\" the discourse connective is underlined, Arg1 appears in italics and Arg2 appears in bold." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-23", "text": "----------------------------------" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-24", "text": "**A PILOT ANNOTATION**" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-25", "text": "Following the PDTB annotation manual (Prasad et al. 2008b ), we conducted a pilot annotation of discourse connectivity in biomedical text." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-26", "text": "As an initial step, we only annotated the three most important components of a discourse relation; namely, a discourse connective and its two arguments; we did not annotate attribution." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-27", "text": "Two linguist annotators independently annotated one full-text biomedical article (Verpy et al. 1999 ) that we randomly selected." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-28", "text": "The article is 4,937 tokens long." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-29", "text": "When the annotation work was completed, we measured the inter-annotator agreement, following the PDTB exact match criterion (Miltsakaki et al. 2004 )." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-30", "text": "According to this criterion, a discourse relation is in disagreement if there is disagreement on any textspan (i.e., the discourse connective or any of its two arguments)." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-31", "text": "In addition, we also measured the agreement in the components (i.e., discourse connectives and the arguments)." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-32", "text": "We discussed the annotation results and made suggestions to adapt the PDTB guidelines to biomedical text." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-33", "text": "----------------------------------" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-34", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-35", "text": "The first annotator identified 74 discourse connectives, and the second annotator identified 75, 68 of which were the same as those identified by the first annotator." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-36", "text": "The combined total number of discourse connectives was 81." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-37", "text": "The overall agreement in discourse connective identification was 68/81=84%." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-38", "text": "Of the 68 discourse connectives that were annotated by both annotators, 31 were an exact match, 31 had an exact match for Arg1, and 54 had an exact match for Arg2." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-39", "text": "The overall agreement for the 68 discourse relations is 45.6% for exact match, 45.6% for Arg1, and 79.4% for Arg2." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-40", "text": "The PDTB also reported a higher level of agreement in annotating Arg2 than in annotating Arg1 (Miltsakaki et al. 2004) ." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-41", "text": "We manually analyzed the cases with disagreement." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-42", "text": "We found the disagreements are nearly all related to the annotation of citation references, supplementary clauses, and other conventions." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-43", "text": "When a few conventions for these cases were added, the inter-annotator agreement went up to 85%." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-44", "text": "We also found that different interpretation of a relation and its arguments by annotators plays an important role for the remaining 15% inconsistency, and domain-specific knowledge is necessary to resolve such cases." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-45", "text": "----------------------------------" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-46", "text": "**NEW CONVENTIONS**" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-47", "text": "After the completion of the pilot annotation and the discussion, we decided to add the following conventions to the PDTB annotation guidelines to address the characteristics of biomedical text:" }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-48", "text": "i. Citation references are to be annotated as a part of an argument because the inclusion will benefit many text-mining tasks including identifying the semantic relations among citations." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-49", "text": "ii." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-50", "text": "Clausal supplements (e.g., relative or parenthetical constructions) that modify arguments but are not minimally necessary for the interpretation of the relation, are annotated as part of the arguments." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-51", "text": "iii." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-52", "text": "We will annotate a wider variety of nominalizations as arguments than allowed by the PDTB guidelines." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-53", "text": "We anticipate that these changes will both decrease the amount of effort required for annotation and increase the reliability of the annotation." }, { "sent_id": "0c233d68fb2ccdf033fc6a08c8f4bf-C001-1", "text": "**ABSTRACT**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-2", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-3" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-13" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-40" ] ], "cite_sentences": [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-2", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-3", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-13", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-40" ] }, "@MOT@": { "gold_contexts": [ [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-4" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-15" ] ], "cite_sentences": [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-4", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-4", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-5" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-25" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-29" ] ], "cite_sentences": [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-4", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-5", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-25", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-29" ] }, "@EXT@": { "gold_contexts": [ [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-9" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-32" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-47", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-48" ], [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-52" ] ], "cite_sentences": [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-9", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-32", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-47", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-52" ] }, "@DIF@": { "gold_contexts": [ [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-39", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-40", "0c233d68fb2ccdf033fc6a08c8f4bf-C001-41" ] ], "cite_sentences": [ "0c233d68fb2ccdf033fc6a08c8f4bf-C001-40" ] } } }, "ABC_e8c60c9fc3a2d74df632f3b423adae_6": { "x": [ { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-2", "text": "The chain-structured long short-term memory (LSTM) has showed to be effective in a wide range of problems such as speech recognition and machine translation." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-3", "text": "In this paper, we propose to extend it to tree structures, in which a memory cell can reflect the history memories of multiple child cells or multiple descendant cells in a recursive process." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-4", "text": "We call the model S-LSTM, which provides a principled way of considering long-distance interaction over hierarchies, e.g., language or image parse structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-5", "text": "We leverage the models for semantic composition to understand the meaning of text, a fundamental problem in natural language understanding, and show that it outperforms a state-of-theart recursive model by replacing its composition layers with the S-LSTM memory blocks." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-6", "text": "We also show that utilizing the given structures is helpful in achieving a performance better than that without considering the structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-7", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-9", "text": "Recent years have seen a revival of the long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) , with its effectiveness being demonstrated on a wide range of problems such as speech recognition (Graves et al., 2013) , machine translation (Sutskever et al., 2014; Cho et al., 2014) , and image-to-text conversion , On February 6th, 2015 , this work was submitted to the International Conference on Machine Learning (ICML)." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-10", "text": "among many others, in which history is summarized and coded in the memory cell in a full-order time sequence." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-11", "text": "Recursion is a fundamental process associated with many problems-a recursive process and hierarchical structure so formed are common in different modalities." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-12", "text": "For example, semantics of sentences in human languages is believed to be carried by not merely a linear concatenation of words; instead, sentences have parse structures (Manning & Sch\u00fctze, 1999) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-13", "text": "Image understanding, as another example, benefits from recursive modeling over structures, which yielded the state-of-the-art performance on tasks like scene segmentation (Socher et al., 2011) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-14", "text": "In this paper, we extend LSTM to tree structures, in which we learn memory cells that can reflect the history memories of multiple child cells and hence multiple descendant cells." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-15", "text": "We call the model S-LSTM." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-16", "text": "Compared with previous recursive neural networks (Socher et al., 2013; 2012) , S-LSTM has the potentials of avoiding gradient vanishing and hence may model long-distance interaction over trees." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-17", "text": "This is a desirable characteristic as many of such structures are deep." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-18", "text": "S-LSTM can be considered as bringing the merits of a recursive neural network and a recurrent neural network togetherStanford Sentiment Tree Bank (Socher et al., 2013) to determine the sentiment for different granularities of phrases in a tree." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-19", "text": "The dataset has favorable properties: in addition to being a benchmark for much previous work, it provides with human annotations at all nodes of the trees, enabling us to comprehensively explore the properties of S-LSTM." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-20", "text": "We experimentally show that S-LSTM outperforms a stateof-the-art recursive model by simply replacing the original tensor-enhanced composition with the S-LSTM memory block we propose here." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-21", "text": "We showed that utilizing the given structures is helpful in achieving a better performance than that without considering the structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-22", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-24", "text": "Recursive neural networks Recursion is a fundamental process in different modalities." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-25", "text": "In recent years, recursive neural networks (RvNN) have been introduced and demonstrated to achieve state-of-the-art performances on different problems such as semantic analysis in natural language processing and image segmentation (Socher et al., 2013; 2011) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-26", "text": "These networks are defined over recursive tree structures-a tree node is a vector computed from its children." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-27", "text": "In a recursive fashion, the information from the leaf nodes of a tree and its internal nodes are combined in a bottom-up manner through the tree." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-28", "text": "Derivatives of errors are computed with backpropagation over structures (Goller & Kchler, 1996) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-29", "text": "In addition, the literature has also included many other efforts of applying feedforward-based neural network over structures, including (Goller & Kchler, 1996; Chater, 1992; Starzyk et al.; Hammer et al., 2004) , amongst others." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-30", "text": "For instance, Legrand and Collobert leverage neural networks over greedy syntactic parsing (Pinheiro & Collobert, 2014) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-31", "text": "In (Irsoy & Cardie, 2014) , a deep recursive neural network is proposed ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-32", "text": "Nevertheless, over the often deep structures, the networks are potentially subject to the vanishing gradient problem, resulting in difficulties in leveraging long-distance dependencies in the structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-33", "text": "In this paper, we propose the S-LSTM model that wires memory blocks in recursive structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-34", "text": "We compare our model with the RvNN models presented in (Socher et al., 2013) , as we directly replaced the tensor-enhanced composition layer at each tree node with a S-LSTM memory block." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-35", "text": "We show the advantages of our proposed model in achieving significantly better results." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-36", "text": "Recurrent neural networks and LSTM Unlike a feedforward network, a recurrent neural network (RNN) shares their hidden states across time." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-37", "text": "The sequential history is summarized in a hidden vector." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-38", "text": "RNN also suffers from the decaying of gradient, or less frequently, blowing-up of gradient problem." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-39", "text": "LSTM replaces the hidden vector of a recurrent neural network with memory blocks which are equipped with gates; it can in principle keep longterm memory by training proper gating weights (refer to (Graves, 2008) for intuitive illustrations and good discussions), and it has practically showed to be very useful, achieving the state of the art on a range of problems including speech recognition (Graves et al., 2013) , digit handwriting recognition (Liwicki et al., 2007; Graves, 2012) , and achieve interesting results on statistical machine translation (Sutskever et al., 2014; Cho et al., 2014) and music composition (Eck & Schmidhuber, 2002b; a) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-40", "text": "In (Graves et al., 2013) , a deep LSTM network achieved the state-of-the-art results on the TIMIT phoneme recognition benchmark." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-41", "text": "In (Sutskever et al., 2014; Cho et al., 2014) , a pair of LSTM networks are trained to encode and decode human language for automatic machine translation, which is in particular effective for the more challenging long sentence translation." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-42", "text": "In (Liwicki et al., 2007; Graves, 2012) , LSTM networks are found to be very useful for digit writing recognition because of the network's capability of memorizing context information in a long sequence." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-43", "text": "In (Eck & Schmidhuber, 2002b; a) , LSTM networks are trained to effectively capture global structures of the temporal data." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-44", "text": "With the memory cells, LSTM is able to keep track of temporally distant events that indicate global music structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-45", "text": "As a result, LSTM can be successfully trained to compose music, where other RNNs have failed to do so." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-46", "text": "Although promising results have been observed by applying chain-structured LSTM, many other interesting problems are inherently associated with input structures that are more complicated than a sequence." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-47", "text": "For example, sentences in human languages are believed to be carried by not merely a linear sequence of words; instead, meaning is thought to interweave with structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-48", "text": "While a sequential application of LSTM may capture structural information implicitly, in practice it sometimes lacks the claimed power." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-49", "text": "For example, even simply reversing the input sequences may result in significant differences in modeling performances, in tasks such as machine translation and speech recognition." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-50", "text": "Unlike in previous work, we propose here to directly wire memory blocks in recursive structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-51", "text": "We show the proposed S-LSTM model does utilize the structures and achieve results better than those ignoring such priori structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-52", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-53", "text": "**THE MODEL**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-54", "text": "Model brief In this paper, we extend LSTM to structures, in which a memory cell can reflect the history memories of multiple child cells and hence multiple descendant cells in a hierarchical structure." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-55", "text": "As intuitively showed in Figure 1 , the root of the tree can in principle consider information from long-distance interactions over the tree-in this figure, the gray and light-blue leaf." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-56", "text": "In the figure, the small circle (\"\u2022\") or short line (\"\u2212\") at each arrowhead indicates pass and block of information, respectively." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-57", "text": "Note that the figure shows a binary case, while in real models a soft version of gating is applied, where a gating signal is in the range of [0, 1] , often enforced with a logistic sigmoid function." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-58", "text": "Through learning the gating signals, as detailed later in this section, S-LSTM provides a principled way of considering long-distance interplays over the input structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-59", "text": "Figure 1 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-60", "text": "An example of S-LSTM, a long-short term memory network on tree structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-61", "text": "A tree node can consider information from multiple descendants." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-62", "text": "Information of the other nodes in white are blocked." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-63", "text": "The small circle (\"\u2022\") or short line (\"\u2212\") at each arrowhead indicates a pass or block of information, respectively, while in the real model the gating is a soft version of gating." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-64", "text": "Figure 1 is composed of a S-LSTM memory block." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-65", "text": "We present a specific wiring of such a block in Figure 2 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-66", "text": "Each memory block contains one input gate and one output gate." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-67", "text": "The number of forget gates depends on the structure, i.e., the number of children of a node." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-68", "text": "In this paper, we assume there are two children at each nodes, same as in (Socher et al., 2013) and therefore we use their data in our experiments." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-69", "text": "That is, we have two forget gates." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-70", "text": "Extension of the model to handle more children is rather straightforward." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-71", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-72", "text": "**THE MEMORY BLOCK EACH NODE IN**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-73", "text": "As shown in the figure, the hidden vectors of the two children, denoted as h different W in the formulas below." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-74", "text": "Different from the process in a regular LSTM, the cell here considers the copies from both children's cell vectors (c L t\u22121 , c R t\u22121 ), gated with separated forget gates." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-75", "text": "The left and right forget gates can be controlled independently, allowing the pass-through of information from children's cell vectors." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-76", "text": "The output gate o t considers the hidden vectors from the children and the current cell vector." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-77", "text": "In turn, the hidden vector h t and the cell vector c t of the current block are passed to the parent and are used depending on if the current block is a left or right child of its parent." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-78", "text": "In this way, the memory cell, through merging the gated cell vectors of the children, can reflect multiple direct or indirect descendant cells." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-79", "text": "As a result, the long-distance interplays over the structures can be captured." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-80", "text": "More specifically, the forward computation of a S-LSTM memory block is specified in the following equations." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-81", "text": "where \u03c3 is the element-wise logistic function used to confine the gating signals to be in the range of [0, 1]; f L and f R are the left and right forget gate, respectively; b is bias and W is network weight matrices; the sign \u2297 is a Hadamard product, i.e., element-wise product." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-82", "text": "The subscripts of the weight matrices indicate what they are used for." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-83", "text": "For example, W ho is a matrix mapping a hidden vector to an output gate." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-84", "text": "Backpropagation over structures During training, the gradient of the objective function with respect to each parameter can be calculated efficiently via backpropagation over structures (Goller & Kchler, 1996; Socher et al., 2013) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-85", "text": "The major difference from that of (Socher et al., 2013 ) is we use LSTM-like backpropagation, where unlike a regular LSTM, pass of error needs to discriminate between the left and right children, or in a topology with more than two children, needs to discriminate between children." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-86", "text": "Obtaining the backprop formulas is tedious but we list them below to facilitate duplication of our work 2 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-87", "text": "We will discuss the specific objective function later in experiments." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-88", "text": "For each memory block, assume that the error passed to the hidden vector is \u01eb" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-89", "text": "2 The code will be published at www.icml-placeholderonly.com where \u03c3 \u2032 (x) is the element-wise derivative of the logistic function over vector x. Since it can be computed with the activation of x, we abuse the notation a bit to write it over the activated vectors in these equations." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-90", "text": "\u01eb c t is the derivative over the cell vector." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-91", "text": "So if the current node is the left child of its parent, we use Equation (13) to calculate \u01eb c t , otherwise Formula (14) is used:" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-92", "text": "where g \u2032 (x) is the element-wise derivative of the tanh function." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-93", "text": "It can also be directly calculated from the tanh activation of x. The superscript T over the weight matrices means matrix transpose." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-94", "text": "With derivatives at each gate computed, the derivatives of the weight matrices used in Formula (1)- (7) can be calculated accordingly, which is omitted here." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-95", "text": "We checked the correctness of the S-LSTM implementation with the standard approximated gradient approach." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-96", "text": "Objective over trees The objective function defined over structures can be complicated, which could consider the output structures depending on the properties of problem." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-97", "text": "Following (Socher et al., 2013) , the overall objective function we used to learn S-LSTM in this paper is simply minimizing the overall cross-entropy errors and a sum of that at all nodes." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-98", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-99", "text": "**EXPERIMENT SET-UP**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-100", "text": "As discussed earlier, recursion is a basic process inherent to many problems." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-101", "text": "In this paper, we leverage the proposed model to solve semantic composition for the meanings of pieces of text, a fundamental problem in understanding human languages." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-102", "text": "We specifically attempt to determine the sentiment of different granularities of phrases in a tree, within the Stanford Sentiment Tree Bank benchmark data (Socher et al., 2013) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-103", "text": "In obtaining the sentiment of a long piece of text, early work often factorized the problem to consider smaller pieces of component words or phrases with bag-of-words or bag-ofphrases models (Pang & Lee, 2008; Liu & Zhang, 2012) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-104", "text": "More recent work has started to model composition (Moilanen & Pulman, 2007; Choi & Cardie, 2008; Socher et al., 2012; Kalchbrenner et al., 2014) , a more principled approach to modeling the formation of semantics." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-105", "text": "In this paper, we put the proposed LSTM memory blocks at tree nodes-we replaced the tensorenhanced composition layer at each tree node presented in (Socher et al., 2013 ) with a S-LSTM memory block." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-106", "text": "We used the same dataset, the Stanford Sentiment Tree Bank, to evaluate the performances of the models." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-107", "text": "In addition to being a benchmark for much previous work, the data provide with human annotations at all nodes of the trees, facilitating a more comprehensive exploration of the properties of S-LSTM." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-108", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-109", "text": "**DATA SET**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-159", "text": "Figure 4 , we further depict the performances of models on different levels of nodes in the trees." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-110", "text": "The Stanford Sentiment Tree Bank (Socher et al., 2013) contains about 11,800 sentences from the movie reviews that were originally discussed in (Pang & Lee, 2005) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-111", "text": "The sentences were parsed with the Stanford parser (Klein & Manning, 2003) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-112", "text": "Phrases at all the tree nodes were manually annotated with sentiment values." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-113", "text": "We use the same split of the training and test data as in (Socher et al., 2013) to predict the sentiment categories of the roots (sentences) and all phrases (including sentences)." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-114", "text": "For the root sentiment, the training, development, and test sentences are 8544, 1101, and 2210, respectively." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-115", "text": "The phrase sentiment task includes 318582, 41447, and 82600 phrases for the three sets." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-116", "text": "Following (Socher et al., 2013) , we also use the classification accuracy to measure the performances." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-117", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-118", "text": "**TRAINING DETAILS**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-119", "text": "As mentioned before, we follow (Socher et al., 2013) to minimize the cross-entropy error for all nodes or for roots only, depending on specific experiment settings." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-120", "text": "For all phrases, the error is calculated as a regularized sum:" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-121", "text": "where y seni \u2208 R c\u00d71 is predicted distribution and t i \u2208 R c\u00d71 the target distribution." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-122", "text": "c is the number of classes or categories, and j \u2208 c denotes the j-th element of the multinomial target distribution; i iterates over nodes, \u03b8 are model parameters, and \u03bb is a regularization parameter." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-123", "text": "We tuned our model against the development data set as split in (Socher et al., 2013) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-124", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-125", "text": "**RESULTS**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-126", "text": "To understand the modeling advantages of S-LSTM over the structures, we conducted four sets of experiments." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-127", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-128", "text": "**DEFAULT SETTING**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-129", "text": "In the default setting, we conducted experiments as in (Socher et al., 2013) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-130", "text": "Table 1 shows the accuracies of different models on the test set of the Stanford Sentiment Tree Bank." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-131", "text": "We present the results on 5-category sentiment prediction at both the sentence level (i.e., the ROOTS column in the table) and for all phrases including roots (the PHRASES column) 3 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-132", "text": "In Table 1 , NB and SVM are naive Bayes and support vector machine classifiers, respectively; RvNN corresponds to RNN in (Socher et al., 2013) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-133", "text": "As described earlier, we refer to recursive neural networks to as RvNN to avoid confusion with recurrent neural networks." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-134", "text": "RNTN is different from RvNN in that when merging two nodes to obtain the hidden vector of their parent, tensor is used to obtain the second-degree polynomial interactions." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-135", "text": "Table 1 showed that S-LSTM achieved the best predictive performance, when compared to all the models reported in (Socher et al., 2013) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-136", "text": "The S-LSTM results reported here were obtained by setting the size of the hidden units to be 100, batch size to be 10, and learning rate to be 0.1." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-137", "text": "In our experiments, we only tuned these hyper-parameters, and we feel that more finer tuning, such as discriminating the classification weights between the leaves (word embedding) and other nodes, using different numbers of hidden units for the memory blocks (e.g., for the hidden layers of words), or different initializations of word embedding, may further improve the performances reported here." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-138", "text": "To evaluate the S-SLTM model's convergence behavior, Figure 3 depicts the converging time during training." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-139", "text": "More specifically, we show two sub-figures: one for roots (uppermuch less parameters than RNTN and the forward and backward propagation can be computed efficiently." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-140", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-141", "text": "**MORE REAL-LIFE SETTINGS**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-142", "text": "We further compare S-LSTM with RNTN in two more experimental settings." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-143", "text": "In the first setting we only keep the training signals at the roots to train S-LSTM and RNTN, depicted as model (1) and (2) in Table 2 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-144", "text": "ROOT LBLS besides the model names stands for root labels; that is, only the gold labels of the sentence level are used to train the model." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-145", "text": "In most sentiment analysis circumstances, phrase level annotations are not available: most nodes in a tree are fragments that may not be that interesting; e.g., the fragment \"of a good movie\" 4 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-146", "text": "Also, annotating all phrases is expensive." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-147", "text": "However, these should not be regarded as comments on the value of the Sentiment Tree Bank." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-148", "text": "Detailed annotations in the tree bank enable much interesting work to be possible, e.g., the study of the effect of negation in changing sentiment (Zhu et al., 2014) ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-149", "text": "The second setting, corresponding to model (3) and (4) in Table 2 , is only slightly different, in which we keep annotation for the tree leafs as well, to simulate that a sentiment lexicon is available and it covers all leafs (words) (LEAF LBLS along the side of the model names stands for leaf labels), and so there is no out-of-vocabulary concern." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-150", "text": "Using real sentiment lexicons is expected to have a performance between the two settings here." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-151", "text": "Results in the table show that in both settings, S-LSTM outperforms RNTN by a large margin." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-152", "text": "When only root labels are used to train the models, S-LSTM obtains an accuracy of 43.5, compared with 29.1 of RNTN." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-153", "text": "When the leaf labels are also used, S-LSTM achieves an accuracy of 44.1 and RNTN 34.9." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-154", "text": "All these improvements are statistically significant (p < 0.05)." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-155", "text": "For the RNTN, without supervising signals from the internal nodes, the composition parameters may not be learned well, potentially because the tensor has much more parameters to learn." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-156", "text": "On the other hand, through controlling its gates, the S-LSTM shows a very good ability to learn from the trees." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-157", "text": "Table 2 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-158", "text": "Performances of models trained with only root labels (the first two rows) and models that use both root and leaf labels (the last two rows)." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-160", "text": "In the Figure, the x-axis corresponds to different depths or lengths and y-axis is accuracy." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-161", "text": "The depth here is defined as the longest distance between the root of a phrase and their descendant leafs." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-162", "text": "The Length is simply the number of words of a node, where depth is not necessarily to be length-e.g., a balanced tree with 4 leafs has different depths than the unbalanced tree with the same number of leafs." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-163", "text": "The trends of the two figure are similar." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-164", "text": "In both figures, S-LSTM performs better at all depths, showing its advantages on nodes at depth." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-165", "text": "As the deeper levels of the tree tend to have more complicated syntax and semantics, S-LSTM can model such more complicated syntax and semantics better." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-166", "text": "Explicit structures vs. no structures Some efforts in the literature attempt to learn distributed representation by utilizing input structures when available, and others prefer to assume chain-structured recurrent neural networks can actually capture the structures implicitly though a linear coding process." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-167", "text": "In this paper, we attempt to give some empirical evidences in our experiment setting by comparing several different models." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-168", "text": "First, a special case for the S-LSTM model is considered, in which no sentential structures are given." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-169", "text": "Instead, words are read from left to right and combined in that order." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-170", "text": "We call it left recursive S-LSTM, or S- LSTM-LR in short." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-171", "text": "Similarly, we also experimented with a right recursive S-LSTM, S-LSTM-RR, in which words are read from right to left instead." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-172", "text": "Since for these models, phrase-level training signals are not available-the nodes here do not correspond to that in the original Standford Sentiment Tree Bank, but the roots and leafs annotations are still the same, so we run two versions of our experiments: one uses only training signals from roots and the other includes also leaf annotations." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-173", "text": "It can be observed from Table 3 that the given parsing structure helps improve the predictive accuracy." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-174", "text": "In the case of using only root labels, the left recursive S-LSTM and right recursive S-LSTM have similar performance (40.2 and 40.3, respectively), both inferior to S-LSTM (43.5)." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-175", "text": "When using gold leaf labels, the gaps are smaller, but still, using the parse structure are better." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-176", "text": "Note that in real applications, where there is no out-of-vocabulary issue (i.e., some leafs are not seen in the sentiment dictionaries), the difference between S-LSTM and the recursive version without using the structures are expected to be between the gaps we observed here." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-177", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-178", "text": "**CONCLUSIONS**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-179", "text": "We aim to extend the conventional chain-structured long short-term memory to explicitly consider structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-180", "text": "In this paper we particularly study tree structures, in which Table 3 ." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-181", "text": "Performances of models that do not use the given sentence structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-182", "text": "S-LSTM-LR is a degenerated version of S-LSTM that reads input words from left to right, and S-LSTM-RR reads words from right to left." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-183", "text": "the proposed S-LSTM memory cell can reflect the history memories of multiple descendants through gated copying of memory vectors." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-184", "text": "The model provides a principled way to consider long-distance interplays over the structures." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-185", "text": "We leveraged the model to learn distributed sentiment representations for texts, and showed that it outperforms a stateof-the-art recursive model by replacing its tensor-enhanced composition layers with the S-LSTM memory blocks." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-186", "text": "We showed that the structure information is useful in helping S-LSTM achieve the state-of-the-art performance." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-187", "text": "----------------------------------" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-188", "text": "**MODELS**" }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-189", "text": "The research community seems to contain two lines of wisdom; one attempts to learn distributed representation by utilizing structures when available, and the other prefers to believe recurrent neural networks can actually capture the structures implicitly through a linear-chain coding process." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-190", "text": "In this paper, we also attempt to give some empirical evidences toward answering the question." }, { "sent_id": "e8c60c9fc3a2d74df632f3b423adae-C001-191", "text": "It is at least for the settings of our experiments that the explicit input structures are helpful in inferring the high-level (e.g., root) semantics." } ], "y": { "@MOT@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-14", "e8c60c9fc3a2d74df632f3b423adae-C001-16" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-18", "e8c60c9fc3a2d74df632f3b423adae-C001-19" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-16", "e8c60c9fc3a2d74df632f3b423adae-C001-18", "e8c60c9fc3a2d74df632f3b423adae-C001-19" ] }, "@DIF@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-14", "e8c60c9fc3a2d74df632f3b423adae-C001-16" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-34", "e8c60c9fc3a2d74df632f3b423adae-C001-35" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-85" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-134" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-135" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-172" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-16", "e8c60c9fc3a2d74df632f3b423adae-C001-34", "e8c60c9fc3a2d74df632f3b423adae-C001-85", "e8c60c9fc3a2d74df632f3b423adae-C001-134", "e8c60c9fc3a2d74df632f3b423adae-C001-135", "e8c60c9fc3a2d74df632f3b423adae-C001-172" ] }, "@BACK@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-25", "e8c60c9fc3a2d74df632f3b423adae-C001-26" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-110" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-25", "e8c60c9fc3a2d74df632f3b423adae-C001-26", "e8c60c9fc3a2d74df632f3b423adae-C001-110" ] }, "@USE@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-34" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-68" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-84" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-97" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-102" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-106", "e8c60c9fc3a2d74df632f3b423adae-C001-107" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-113" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-116" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-119" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-123" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-129" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-34", "e8c60c9fc3a2d74df632f3b423adae-C001-68", "e8c60c9fc3a2d74df632f3b423adae-C001-84", "e8c60c9fc3a2d74df632f3b423adae-C001-97", "e8c60c9fc3a2d74df632f3b423adae-C001-102", "e8c60c9fc3a2d74df632f3b423adae-C001-106", "e8c60c9fc3a2d74df632f3b423adae-C001-107", "e8c60c9fc3a2d74df632f3b423adae-C001-113", "e8c60c9fc3a2d74df632f3b423adae-C001-116", "e8c60c9fc3a2d74df632f3b423adae-C001-119", "e8c60c9fc3a2d74df632f3b423adae-C001-123", "e8c60c9fc3a2d74df632f3b423adae-C001-129" ] }, "@EXT@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-105" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-105" ] }, "@UNSURE@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-130" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-132" ], [ "e8c60c9fc3a2d74df632f3b423adae-C001-147" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-130", "e8c60c9fc3a2d74df632f3b423adae-C001-132", "e8c60c9fc3a2d74df632f3b423adae-C001-147" ] }, "@FUT@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-148" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-148" ] }, "@SIM@": { "gold_contexts": [ [ "e8c60c9fc3a2d74df632f3b423adae-C001-172" ] ], "cite_sentences": [ "e8c60c9fc3a2d74df632f3b423adae-C001-172" ] } } }, "ABC_7c8ec9e38bf4c0c458d60af014e102_6": { "x": [ { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-2", "text": "Choi, Wiemer-Hastings, and Moore (2001)" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-3", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-4", "text": "**IMPROVING TEXT SEGMENTATION BY USING COMPLEMENTARY SEMANTIC KNOWLEDGE**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-96", "text": "The Within space is based on the whole 1997-1998 corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-5", "text": "For the last ten years, many methods have been proposed for the segmentation of texts in topically related units on the basis of lexical cohesion." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-6", "text": "The major distinction between these methods is in the contrast between the approaches based exclusively on the information contained in the text to be segmented, such as lexical repetition (e.g., Choi 2000; Hearst 1997; Heinonen 1998; Kehagias, Pavlina, and Petridis 2003; Utiyama and Isahara 2001) , and those approaches that rest on complementary semantic knowledge extracted from dictionaries and thesauruses (e.g., Kozima 1993; Lin et al. 2004; Morris and Hirst 1991) , or from collocations collected in large corpora (Bolshakov and Gelbukh 2001; Brants, Chen, and Tsochantaridis 2002; Choi et al. 2001; Ferret 2002; Kaufmann 1999; Ponte and Croft 1997) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-7", "text": "According to their authors, methods that use additional knowledge allow for a solution to problems encountered when sentences belonging to a unique topic do not share common words due to the use of hyperonyms or synonyms and allow words that are semantically related to be taken as positive evidence for topic continuity." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-8", "text": "Empirical arguments in favor of these methods have been provided recently by Choi et al. (2001) in a study using Latent Semantic Analysis (Latent Semantic Indexing, Deerwester et al. 1990 ) to extract a semantic space from a corpus allowing determination of the similarity of meanings of words, sentences, or paragraphs." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-9", "text": "By comparing the accuracy of the very same algorithm according to whether or not it takes into account complementary semantic knowledge, they were able to show the benefit derived from such knowledge." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-10", "text": "However, implications of Choi et al.'s study for text segmentation and for the use of LSA in natural language processing are unclear due to the methodology employed." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-11", "text": "In their experiments, semantic knowledge was acquired from a corpus containing the materials to be segmented in the test phase." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-12", "text": "One could speculate whether the largest part of the benefit obtained thanks to the addition of semantic knowledge was not due to this hyper-specificity of the LSA corpus (i.e., the inclusion of the test materials)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-13", "text": "If this were the case, it would call into question the possibility of using LSA to acquire generic semantic knowledge that can be used to segment new texts." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-14", "text": "A priori, the problem does not seem serious for at least two reasons." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-15", "text": "First, Choi et al.'s segmentation procedure does not rely on supervised learning in which a system learns how to efficiently segment a text from training data." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-16", "text": "The LSA corpus only intervenes in an indirect manner by allowing the extraction of semantic proximities between words that are then used to compute similarities between parts of the text to segment (see Section 2 for details)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-17", "text": "Second, Choi et al. employed a large number of small test samples to evaluate their algorithm, each making up-on average-0.15% of the LSA corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-18", "text": "The present study shows, however, that the presence of the test materials in the LSA corpus has an important effect, but also that the generic semantic knowledge derived from large corpora clearly improves the segmentation accuracy." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-19", "text": "This conclusion is drawn from two experiments in which the presence or absence of the test materials in the LSA corpus is manipulated." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-20", "text": "The first experiment is based on the original materials from Choi et al., which consisted of a small corpus (1,000,000 words)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-21", "text": "The second experiment is based on a much larger corpus (25,000,000 words)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-22", "text": "Before reporting these experiments, Choi's algorithm and the use of LSA within this framework are described." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-23", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-24", "text": "**THE TWO VERSIONS OF CHOI'S ALGORITHM**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-25", "text": "The segmentation algorithm proposed by Choi (2000) is made up of the three steps usually found in any segmentation procedure based on lexical cohesion." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-26", "text": "Firstly, the document to be segmented is divided into minimal textual units, usually sentences." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-27", "text": "Then, a similarity index between every pair of adjacent units is calculated." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-28", "text": "Each raw similarity value is cast on an ordinal scale by taking the proportion of neighboring values that are smaller than it." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-29", "text": "Lastly, the document is segmented recursively according to the boundaries between the units that maximize the sum of the average similarities inside the segments thus comprised (divisive clustering)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-30", "text": "The step of greatest interest here is the one that calculates the inter-sentence similarities." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-31", "text": "The procedure initially proposed by Choi (2000) , C99, rests exclusively on the information contained in the text to be segmented." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-32", "text": "According to the vector space model, each sentence is represented by a vector of word frequency count, and the similarity between two sentences is calculated by means of the cosine measure between the corresponding vectors." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-33", "text": "In a first evaluation based on the procedure described below, Choi showed that its algorithm outperforms several other approaches such as TextTiling (Hearst 1997) and Segmenter (Kan, Klavans, and McKeown 1998) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-34", "text": "Choi et al. (2001) claimed that it was possible to improve the inter-sentence similarities index by taking into account the semantic proximities between words estimated on the basis of Latent Semantic Analysis (LSA)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-35", "text": "Briefly stated, LSA rests on the thesis that analyzing the contexts in which words occur permits an estimation of their similarity in meaning (Deerwester et al. 1990; Landauer and Dumais 1997) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-36", "text": "The first step in the analysis is to construct a lexical table containing an information-theoretic weighting of the frequencies of the words occurrence in each document (i.e. sentence, paragraph, or text) included in the corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-37", "text": "This frequency table undergoes a Singular Value Decomposition that extracts the most important orthogonal dimensions, and, consequently, discards the small sources of variability in term usage." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-38", "text": "After this step, every word is represented by a vector of weights indicating its strength of association with each of the dimensions." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-39", "text": "This makes it possible to measure the semantic proximity between any two words by using, for instance, the cosine measure between the corresponding vectors." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-40", "text": "Proximity between any two sentences (or any other textual units), even if these sentences were not present in the original corpus, can be estimated by computing a vector for each of these units-which corresponds to the weighted sum of the vectors of the words that compose it-and then by computing the cosine between these vectors (Deerwester et al. 1990 )." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-41", "text": "Choi et al. (2001) have shown that using this procedure to compute the inter-sentence similarities results in the previous version of the algorithm (based solely on word repetition) being outperformed." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-42", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-43", "text": "**EXPERIMENT 1**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-44", "text": "The aim of this experiment is to determine the impact of the presence of the test materials in the LSA corpus on the results obtained by Choi et al. (2001) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-45", "text": "Does semantic knowledge acquired from a corpus that does not include the test materials also improve the segmentation accuracy?" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-46", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-47", "text": "**METHOD**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-48", "text": "This experiment was based on the procedure and test materials designed by Choi (2000) , which was also used by several authors as a benchmark for comparing segmentation systems (Brants et al. 2002; Ferret 2002; Kehagias et al. 2003; Utiyama and Isahara 2001) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-49", "text": "The task consists in finding the boundaries between concatenated texts." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-50", "text": "Each test sample is a concatenation of ten text segments." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-51", "text": "Each segment consisted in the first n sentences of a randomly selected text from two sub-sections of the Brown corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-52", "text": "For the present experiment, I used the most general test materials built by Choi (2000) , in which the size of the segments within each sample varies randomly from 3 to 11 sentences." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-53", "text": "It is composed of 400 samples." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-54", "text": "The analysis related to the comparison between the accuracy of the algorithm when the test materials were included in the LSA corpus (Within) and when it was not (Without)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-55", "text": "One Within semantic space, which corresponds to the one used by Choi et al., was built using the entire Brown corpus as the LSA corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-56", "text": "Four hundred different Without spaces were built, one for each test sample, by each time removing from the Brown corpus only the sentences that make this sample." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-57", "text": "To extract the LSA space and to apply the segmentation algorithm, a series of parameters had to be set." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-97", "text": "Four hundred different Without spaces were built as described in Experiment 1." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-58", "text": "First of all, paragraphs were used as documents for building the lexical tables because Choi et al. observed that such middle-sized units were more effective than shorter units (i.e., sentences)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-59", "text": "The words on Choi's stoplist were removed, as were those that appeared only once in the whole corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-60", "text": "Words were not stemmed, as in Choi et al. (2001) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-61", "text": "To build the LSA space, the singular value decomposition was realized using the program SVDPACKC (Berry 1992; Berry et al. 1993) , and the first 300 singular vectors were retained." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-62", "text": "Concerning the segmentation algorithm, I used the version in which the number of boundaries to be found is imposed, and thus fixed at nine." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-63", "text": "An 11 \u00d7 11 rank mask was used for the ordinal transformation, as recommended by Choi (2000) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-64", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-65", "text": "**RESULTS**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-66", "text": "The segmentation accuracy was evaluated by means of the index reported by Choi et al. (2001) : the Pk measure of segmentation inaccuracy (Beeferman, Berger, and Lafferty 1999) , which gives the proportion of sentences that are wrongly predicted to belong to the same segment or wrongly predicted to belong to different segments." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-67", "text": "I also report, for potential future comparison, Pevzner and Hearst's (2002) WindowDiff index, which remedies several problems in the Pk measure." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-68", "text": "Results are provided in Table 1 ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-69", "text": "1 Compared with the Within condition, the performance in the Without condition is definitely worse, as confirmed by t tests for paired sample (each test sample being used as an observation) that are significant for an alpha smaller than 0.0001." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-70", "text": "The C99 algorithm, which does not employ LSA to estimate the similarities between the sentences, produces a Pk of 0.13 (Choi et al. 2001, Table 3, line 3: No stemming) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-71", "text": "It appears that although the Without condition is still better than C99, the benefit is very small." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-72", "text": "Before concluding that the presence of the test materials in the LSA corpus strongly modified the semantic space, an alternative explanation must be considered." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-73", "text": "The loss of accuracy in the Without condition could potentially be due to the fact that the words indexed in the corresponding LSA spaces are systematically slightly fewer than those present in the Within space." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-74", "text": "Removing each test sample led to the loss-on averageof 23 different words out of the 25,847 words that are indexed in the Within space." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-75", "text": "In the Without spaces, these words are no longer available to estimate the similarity of the sentences, whereas they are employed in the Within space." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-98", "text": "Finally, a Former space was built from the 1995-1996 corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-76", "text": "In order to determine whether this factor can explain the difference in performance, a complementary analysis was carried out on the Within space in which, for each test sample, only the words present in the corresponding Without space were taken into account." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-77", "text": "In this manner, only the semantic relations can come into play." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-78", "text": "Compared with the complete Within space, almost no drop in performance was observed: the Pk error rate went from 0.084 to 0.085 in the new analysis." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-79", "text": "This result indicates that it is not the words selected for the calculation of the proximities that matter, but the semantic relations in the spaces extracted from the word co-occurrences by the Singular Value Decomposition." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-80", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-81", "text": "**EXPERIMENT 2**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-82", "text": "Experiment 1 was conducted on the Choi et al. (2001) LSA corpus, a 1,000,000-word collection of texts from very different genres and with varied themes." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-83", "text": "The smallness of the corpus and diversity of the texts could have affected the results at two levels." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-84", "text": "First, removing a few sentences of a text should have less impact if the corpus contains a lot of texts on similar topics." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-85", "text": "Second, a larger corpus would probably also permit the extraction of a more stable and efficient semantic space." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-86", "text": "This could produce a greater difference between the LSA version of the algorithm and the version that does not use additional semantic knowledge (C99)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-87", "text": "For these reasons, a second experiment was conducted on the basis of a much larger corpus consisting of the articles published during 1997 and 1998 in the Belgian French-speaking newspaper Le Soir (roughly 52,000 articles and 26,000,000 words)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-88", "text": "In this corpus, the test materials from each sample account for-on average-0.0066% of the complete corpus." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-89", "text": "This second experiment also made it possible to compare the Within and Without spaces with a Former space composed of articles published in the same newspaper, but during the years 1995 and 1996 (roughly 50,000 articles and more than 22,000,000 words)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-90", "text": "This condition will show the possibility of using LSA to build even more generic semantic knowledge, since the LSA corpus is earlier than the text to segment." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-91", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-92", "text": "**METHOD**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-93", "text": "The test materials were extracted from the 1997-1998 corpus following the guidelines given in Choi (2000) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-94", "text": "It is composed of 400 samples of ten segments, of which the length varies randomly from 3 to 11 sentences." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-95", "text": "Three types of LSA space were composed." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-99", "text": "The parameters employed to build the semantic spaces are identical to those used in Experiment 1 with one exception: in order to reduce the size of the lexical tables the whole articles and not the paragraphs were used as documents." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-100", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-101", "text": "**RESULTS**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-102", "text": "Although the results are mostly similar to those obtained in Experiment 1, Table 2 shows some interesting differences." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-103", "text": "The discrepancy between the Within and Without condition is much smaller, even if it remains statistically significant (p < 0.0001)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-104", "text": "Using a corpus from the same source, but with earlier years, still returns a poorer performance (p < 0.0001)." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-105", "text": "The C99 algorithm, which is not based on LSA, produces a Pk error rate of 0.150, a value definitely worse than those obtained with the Without and Former spaces." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-106", "text": "This confirms the usefulness of semantic knowledge acquired from large corpora in estimating inter-sentence similarities." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-107", "text": "----------------------------------" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-109", "text": "The two experiments showed that the presence of the test materials in the LSA corpus increases the algorithm accuracy even when a corpus of more than 25,000,000 words is used." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-110", "text": "They also showed that the use of independent semantic knowledge improves the segmentation accuracy and that this can be observed even when the semantic knowledge is extracted from former years of the same source." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-111", "text": "This observation underlines the possibility of building relatively generic semantic knowledge; that is, knowledge which could be employed to process new linguistic data, as has been recently proposed in a anaphora resolution algorithm, in a continuous speech recognition system, or in machine translation (Bellegarda 2000; Klebanov and Wiemer-Hastings 2002; Kim, Chang, and Zhang 2003) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-112", "text": "A question the present study does not answer concerns the possibility of employing a corpus drawn from another source, such as another newspaper." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-113", "text": "Bellegarda (2000) observed in speech recognition tasks that such a semantic space is definitely less effective." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-114", "text": "It is nevertheless possible that evaluating the semantic proximity between two sentences is less affected by the style of composition of the source than predicting the next word of a statement." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-115", "text": "Recently, several authors have proposed segmentation algorithms, based mainly on dynamic programming, that equal or even outperform Choi's results (Ji and Zha 2003, Kehagias et al. 2003; Utiyama and Isahara 2001) ." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-116", "text": "These algorithms do not rest on additional semantic knowledge." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-117", "text": "According to the results of the present study, they could still be improved by taking into account such knowledge." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-118", "text": "Finally, this study allows a more general conclusion about the use of LSA for natural language processing." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-119", "text": "If one's objective is to analyze a linguistic phenomenon in a large corpus such as for instance the factors determining the use of causal connectives (Degand, Spooren, and Bestgen 2004) , it is preferable to extract the semantic space from the corpus at hand." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-120", "text": "The two experiments did indeed show that such specific corpora allow the extraction of a more efficient semantic space." }, { "sent_id": "7c8ec9e38bf4c0c458d60af014e102-C001-121", "text": "However, if the objective is to test the effectiveness of an algorithm intended to process new linguistic data on the basis of a semantic space built beforehand, one must avoid including the material to analyze in the LSA corpus since that would produce an over-estimate of the effectiveness of the procedure." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7c8ec9e38bf4c0c458d60af014e102-C001-6" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-8" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-9" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-10", "7c8ec9e38bf4c0c458d60af014e102-C001-11" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-15", "7c8ec9e38bf4c0c458d60af014e102-C001-17" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-34" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-41" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-55" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-57", "7c8ec9e38bf4c0c458d60af014e102-C001-58" ] ], "cite_sentences": [ "7c8ec9e38bf4c0c458d60af014e102-C001-6", "7c8ec9e38bf4c0c458d60af014e102-C001-8", "7c8ec9e38bf4c0c458d60af014e102-C001-9", "7c8ec9e38bf4c0c458d60af014e102-C001-10", "7c8ec9e38bf4c0c458d60af014e102-C001-11", "7c8ec9e38bf4c0c458d60af014e102-C001-15", "7c8ec9e38bf4c0c458d60af014e102-C001-17", "7c8ec9e38bf4c0c458d60af014e102-C001-34", "7c8ec9e38bf4c0c458d60af014e102-C001-41", "7c8ec9e38bf4c0c458d60af014e102-C001-55", "7c8ec9e38bf4c0c458d60af014e102-C001-58" ] }, "@DIF@": { "gold_contexts": [ [ "7c8ec9e38bf4c0c458d60af014e102-C001-15", "7c8ec9e38bf4c0c458d60af014e102-C001-16", "7c8ec9e38bf4c0c458d60af014e102-C001-17", "7c8ec9e38bf4c0c458d60af014e102-C001-18" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-20", "7c8ec9e38bf4c0c458d60af014e102-C001-21" ] ], "cite_sentences": [ "7c8ec9e38bf4c0c458d60af014e102-C001-15", "7c8ec9e38bf4c0c458d60af014e102-C001-17", "7c8ec9e38bf4c0c458d60af014e102-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "7c8ec9e38bf4c0c458d60af014e102-C001-20" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-57", "7c8ec9e38bf4c0c458d60af014e102-C001-58" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-60" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-66" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-70" ], [ "7c8ec9e38bf4c0c458d60af014e102-C001-82" ] ], "cite_sentences": [ "7c8ec9e38bf4c0c458d60af014e102-C001-20", "7c8ec9e38bf4c0c458d60af014e102-C001-58", "7c8ec9e38bf4c0c458d60af014e102-C001-60", "7c8ec9e38bf4c0c458d60af014e102-C001-66", "7c8ec9e38bf4c0c458d60af014e102-C001-70", "7c8ec9e38bf4c0c458d60af014e102-C001-82" ] }, "@MOT@": { "gold_contexts": [ [ "7c8ec9e38bf4c0c458d60af014e102-C001-41", "7c8ec9e38bf4c0c458d60af014e102-C001-44" ] ], "cite_sentences": [ "7c8ec9e38bf4c0c458d60af014e102-C001-41", "7c8ec9e38bf4c0c458d60af014e102-C001-44" ] } } }, "ABC_9567cb276162a6e9d445f13f06f5a2_6": { "x": [ { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-2", "text": "This work incorporates Selectional Preferences (SP) into a Semantic Role (SR) Classification system." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-3", "text": "We learn separate selectional preferences for noun phrases and prepositional phrases and we integrate them in a state-of-the-art SR classification system both in the form of features and individual class predictors." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-4", "text": "We show that the inclusion of the refined SPs yields statistically significant improvements on both in domain and out of domain data (14.07% and 11.67% error reduction, respectively)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-5", "text": "The key factor for success is the combination of several SP methods with the original classification model using metaclassification." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-6", "text": "----------------------------------" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-7", "text": "**IMPROVING SEMANTIC ROLE CLASSIFICATION WITH SELECTIONAL PREFERENCES**" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-8", "text": "Be\u00f1at Zapirain, Eneko Agirre IXA NLP Group Basque Country Univ." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-9", "text": "----------------------------------" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-11", "text": "Semantic Role Labeling (SRL) is the process of extracting simple event structures, i.e., \"who\" did \"what\" to \"whom\", \"when\" and \"where\"." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-12", "text": "Current systems usually perform SRL in two pipelined steps: argument identification and argument classification." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-13", "text": "While identification is mostly syntactic, classification requires semantic knowledge to be taken into account." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-14", "text": "Semantic information is usually captured through lexicalized features on the predicate and the head-word of the argument to be classified." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-15", "text": "Since lexical features tend to be sparse, SRL systems are prone to overfit the training data and generalize poorly to new corpora." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-16", "text": "Indeed, the SRL evaluation exercises at CoNLL-2004 (Carreras and M\u00e0rquez, 2005 observed that all systems showed a significant performance degradation (\u223c10 F 1 points) when applied to test data from a different genre of that of the training set." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-17", "text": "Pradhan et al. (2008) showed that this performance degradation is essentially caused by the argument classification subtask, and suggested the lexical data sparseness as one of the main reasons." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-82", "text": "+SP x stands for SwiRL plus each SP model." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-83", "text": "C = 0.01 (tuned in development)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-18", "text": "The same authors studied the contribution of the different feature types in SRL and concluded that the lexical features were the most salient features in argument classification (Pradhan et al., 2007) ." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-19", "text": "In recent work, we showed (Zapirain et al., 2009 ) how automatically generated selectional preferences (SP) for verbs were able to perform better than pure lexical features in a role classification experiment, disconnected from a full-fledged SRL system." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-20", "text": "SPs introduce semantic generalizations on the type of arguments preferred by the predicates and, thus, they are expected to improve results on infrequent and unknown words." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-21", "text": "The positive effect was especially relevant for out-of-domain data." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-22", "text": "In this paper we advance (Zapirain et al., 2009) in two directions: (1) We learn separate SPs for prepositions and verbs, showing improvement over using SPs for verbs alone." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-23", "text": "(2) We integrate the information of several SP models in a state-of-the-art SRL system (SwiRL 1 ) and show significant improvements in SR classification." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-24", "text": "The key for the improvement lies in a metaclassifier, trained to select among the predictions provided by several role classification models." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-25", "text": "----------------------------------" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-26", "text": "**SPS FOR SR CLASSIFICATION**" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-27", "text": "SPs have been widely believed to be an important knowledge source when parsing and performing SRL, especially role classification." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-28", "text": "Still, present parsers and SRL systems use just lexical features, which can be seen as the most simple form of SP, where the headword needs to be seen in the training data, and otherwise the SP is not satisfied." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-29", "text": "Gildea and Jurafsky (2002) showed barely significant improvements in semantic role classification of NPs for FrameNet roles using distributional clusters." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-30", "text": "In (Erk, 2007) a number of SP models are tested in a pseudo-task related to SRL." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-31", "text": "More recently, we showed (Zapirain et al., 2009 ) that several methods to automatically generate SPs generalize well and outperform lexical match in a large dataset for semantic role classification, but the impact on a full system was not explored." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-32", "text": "In this work we apply a subset of the SP methods proposed in (Zapirain et al., 2009 )." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-33", "text": "These methods can be split in two main families, depending on the resource used to compute similarity: WordNetbased methods and distributional methods." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-34", "text": "Both families define a similarity score between a word (the headword of the argument to be classified) and a set of words (the headwords of arguments of a given role)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-35", "text": "WordNet-based similarity: One of the models that we used is based on Resnik's similarity measure (1993), referring to it as res." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-36", "text": "The other model is an in-house method (Zapirain et al., 2009 ), referred as wn, which only takes into account the depth of the most common ancestor, and returns SPs that are as specific as possible." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-37", "text": "Distributional similarity: Following (Zapirain et al., 2009) we considered both first order and second order similarity." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-38", "text": "In first order similarity, the similarity of two words was computed using the cosine (or Jaccard measure) of the co-occurrence vectors of the two words." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-39", "text": "Co-occurrence vectors where constructed using freely available software (Pad\u00f3 and Lapata, 2007) run over the British National Corpus." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-40", "text": "We used the optimal parameters (Pad\u00f3 and Lapata, 2007, p. 179 )." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-41", "text": "We will refer to these similarities as sim cos and sim Jac , respectively." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-42", "text": "In contrast, second order similarity uses vectors of similar words, i.e., the similarity of two words was computed using the cosine (or Jaccard measure) between the thesaurus entries of those words in Lin's thesaurus (Lin, 1998) ." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-43", "text": "We refer to these as sim 2 cos and sim 2 Jac ." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-44", "text": "Given a target sentence with a verb and its arguments, the task of SR classification is to assign the correct role to each of the arguments." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-45", "text": "When using SPs alone, we only use the headwords of the arguments, and each argument is classified independently of the rest." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-46", "text": "For each headword, we select the role (r) of the verb (c) which fits best the head word (w), where the goodness of fit (SP sim (v, r, w)) is modeled using one of the similarity models above, between the headword w and the headwords seen in training data for role r of verb v. This selection rule is formalized as follows:" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-47", "text": "In our previous work (Zapirain et al., 2009 ), we modelled SPs for pairs of predicates (verbs) and arguments, independently of the fact that the argument is a core argument (typically a noun) or an adjunct argument (typically a prepositional phrase)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-48", "text": "In contrast, (Litkowski and Hargraves, 2005) show that prepositions have SPs of their own, especially when functioning as adjuncts." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-49", "text": "We therefore decided to split SPs according to whether the potential argument is a Prepositional Phrase (PP) or a Noun Phrase (NP)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-50", "text": "For NPs, which tend to be core arguments 2 , we use the SPs of the verb (as formalized above)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-51", "text": "For PPs, which have an even distribution between core and adjunct arguments, we use the SPs of the prepositions alone, ignoring the verbs." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-52", "text": "Implementation wise, this means that in Eq. (1), we change v for p, where p is the preposition heading the PP." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-53", "text": "----------------------------------" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-54", "text": "**EXPERIMENTS WITH SPS IN ISOLATION**" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-55", "text": "In this section we evaluate the use of SPs for classification in isolation, i.e., we use formula 1, and no other information." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-56", "text": "In addition we contrast the use of both verb-role and preposition-role SPs, as compared to the use of verb-role SPs alone." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-57", "text": "The dataset used in these experiments (and in Section 4) is the same as provided by the CoNLL-2005 shared task on SRL (Carreras and M\u00e0rquez, 2005) ." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-58", "text": "This dataset comprises several sections of the PropBank corpus (news from the WSJ) as well as an extract of the Brown Corpus." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-59", "text": "Sections 02-21 are used for generating the SPs and training, Section 00 for development, and Section 23 for testing, as customary." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-60", "text": "The Brown Corpus is used for out-of-domain testing, but due to the limited size of the provided section, we extended it with instances from SemLink 3 ." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-61", "text": "Since the focus of this work is on argument (Zapirain et al., 2009 )." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-62", "text": "The differences are due to the fact that we do not discard roles like MOD, DIS, NEG and that our previous work used only the subset of the data that could be mapped to VerbNet (around 50%)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-63", "text": "All in all, the table shows that splitting SPs into verb and preposition SPs yields better results, both in precision and recall, improving F 1 up to 10 points in some cases." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-64", "text": "----------------------------------" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-65", "text": "**INTEGRATING SPS IN A SRL SYSTEM**" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-66", "text": "For these experiments we modified SwiRL (Surdeanu et al., 2007) : (a) we matched the gold boundaries against syntactic constituents predicted internally using the Charniak parser (Charniak, 2000) ; and (b) we classified these constituents with their semantic role using a modified version of SwiRL's feature set." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-67", "text": "We explored two different strategies for integrating SPs in SwiRL." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-68", "text": "The first, obvious method is to extend SwiRL's feature set with features that model the preferences of the SPs, i.e., for each SP model SP i we add a feature whose value is R i ." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-69", "text": "The second method combines SwiRL's classification model and our SP models using meta-classification." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-70", "text": "We opted for a binary classification approach: first, for each constituent we generate n datums, one for each distinct role label proposed by the pool of base models; then we use a binary meta-classifier to label each candidate role as correct or incorrect." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-71", "text": "Table 2 lists the features of the meta-classifier." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-72", "text": "We trained the meta-classifier on the usual PropBank training partition, using cross-validation to generate outputs for the base models that require the same training material." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-73", "text": "At prediction time, for each candidate constituent we selected the role label that was classified as correct with the highest confidence." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-74", "text": "Table 3 compares the performance of both combination approaches against the standalone SwiRL classifier." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-75", "text": "We show results for both core arguments (Core), adjunct arguments (Arg) and all arguments combined (All)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-76", "text": "In the table, the SwiRL+SP * models stand for SwiRL classifiers enhanced with one feature from the corresponding SP." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-77", "text": "Adding more than one SP-based feature to SwiRL did not improve results." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-78", "text": "Our conjecture is that the SwiRL classifier enhanced with SPbased features does not learn relevant weights for these features because their signal is \"drowned\" by SwiRL's large initial feature set and the correlation between the different SPs." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-79", "text": "This observation motivated the development of the meta-classifier." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-80", "text": "The meta-classifier shown in the table combines the output of the SwiRL+SP * models with the predictions of SP models used in isolation." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-81", "text": "We implemented the meta-classifier using Support Vector Machines (SVM) 4 with a quadratic polynomial kernel, and Table 3 : Classification accuracy for the combination approaches." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-84", "text": "Table 3 indicates that four out of the six SwiRL+SP * models perform better than SwiRL in domain (WSJ-test), and all of them outperform SwiRL out of domain (Brown)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-85", "text": "However, the improvements are small and, generally, not statistically significant." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-86", "text": "On the other hand, the meta-classifier outperforms SwiRL both in domain (14.07% error reduction) and out of domain (11.67% error reduction), and the differences are statistically significant (measured using two-tailed paired t-test at 99% confidence interval on 100 samples generated using bootstrap resampling)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-87", "text": "We also implemented two unsupervised voting baselines, one unweighted (each base model has the same weight) and one weighted (each base model is weighted by its accuracy in development)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-88", "text": "However, none of these baselines outperformed the standalone SwiRL classifier." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-89", "text": "This is further proof that, for SR classification, metaclassification is crucial because it can learn the distinct specializations of the various base models." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-90", "text": "Finally, Table 3 shows that our approach yields consistent improvements for both core and adjunct arguments." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-91", "text": "Out of domain, we see a bigger accuracy improvement for adjunct arguments (5.64 absolute points) vs. core arguments (1.78 points)." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-92", "text": "This is to be expected, as most core arguments fall under the Arg0 and Arg1 classes, which can typically be disambiguated based on syntactic information, i.e., subject vs. object." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-93", "text": "On the other hand, there are no syntactic hints for adjunct arguments, so the system learns to rely more on SP information in this case." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-94", "text": "----------------------------------" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-95", "text": "**CONCLUSIONS**" }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-96", "text": "This paper is the first work to show that SPs improve a state-of-the-art SR classification system." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-97", "text": "Several decisions were crucial for success: (a) we deployed separate SP models for verbs and prepositions, which in conjunction outperform SP models for verbs alone; (b) we incorporated SPs into SR classification using a meta-classification approach that combines eight base models, developed from variants of a state-of-the-art SRL system and the above SP models." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-98", "text": "We show that the resulting system outperforms the original SR classification system for arguments mapped to nominal or prepositional constituents." }, { "sent_id": "9567cb276162a6e9d445f13f06f5a2-C001-99", "text": "The improvements are statistically significant both on in-domain and out-of-domain data sets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9567cb276162a6e9d445f13f06f5a2-C001-19" ], [ "9567cb276162a6e9d445f13f06f5a2-C001-31" ], [ "9567cb276162a6e9d445f13f06f5a2-C001-47" ] ], "cite_sentences": [ "9567cb276162a6e9d445f13f06f5a2-C001-19", "9567cb276162a6e9d445f13f06f5a2-C001-31", "9567cb276162a6e9d445f13f06f5a2-C001-47" ] }, "@MOT@": { "gold_contexts": [ [ "9567cb276162a6e9d445f13f06f5a2-C001-19", "9567cb276162a6e9d445f13f06f5a2-C001-20", "9567cb276162a6e9d445f13f06f5a2-C001-21" ], [ "9567cb276162a6e9d445f13f06f5a2-C001-31" ] ], "cite_sentences": [ "9567cb276162a6e9d445f13f06f5a2-C001-19", "9567cb276162a6e9d445f13f06f5a2-C001-31" ] }, "@EXT@": { "gold_contexts": [ [ "9567cb276162a6e9d445f13f06f5a2-C001-22" ], [ "9567cb276162a6e9d445f13f06f5a2-C001-32" ], [ "9567cb276162a6e9d445f13f06f5a2-C001-47", "9567cb276162a6e9d445f13f06f5a2-C001-48", "9567cb276162a6e9d445f13f06f5a2-C001-49" ] ], "cite_sentences": [ "9567cb276162a6e9d445f13f06f5a2-C001-22", "9567cb276162a6e9d445f13f06f5a2-C001-32", "9567cb276162a6e9d445f13f06f5a2-C001-47" ] }, "@USE@": { "gold_contexts": [ [ "9567cb276162a6e9d445f13f06f5a2-C001-35", "9567cb276162a6e9d445f13f06f5a2-C001-36", "9567cb276162a6e9d445f13f06f5a2-C001-37" ] ], "cite_sentences": [ "9567cb276162a6e9d445f13f06f5a2-C001-36", "9567cb276162a6e9d445f13f06f5a2-C001-37" ] }, "@DIF@": { "gold_contexts": [ [ "9567cb276162a6e9d445f13f06f5a2-C001-47", "9567cb276162a6e9d445f13f06f5a2-C001-48", "9567cb276162a6e9d445f13f06f5a2-C001-49" ] ], "cite_sentences": [ "9567cb276162a6e9d445f13f06f5a2-C001-47" ] } } }, "ABC_c856f5ce5d2cdcfc71027d6fa4c6b3_6": { "x": [ { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-2", "text": "Choi, Wiemer-Hastings, and Moore (2001)" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-3", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-4", "text": "**IMPROVING TEXT SEGMENTATION BY USING COMPLEMENTARY SEMANTIC KNOWLEDGE**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-5", "text": "For the last ten years, many methods have been proposed for the segmentation of texts in topically related units on the basis of lexical cohesion." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-6", "text": "The major distinction between these methods is in the contrast between the approaches based exclusively on the information contained in the text to be segmented, such as lexical repetition (e.g., Choi 2000; Hearst 1997; Heinonen 1998; Kehagias, Pavlina, and Petridis 2003; Utiyama and Isahara 2001) , and those approaches that rest on complementary semantic knowledge extracted from dictionaries and thesauruses (e.g., Kozima 1993; Lin et al. 2004; Morris and Hirst 1991) , or from collocations collected in large corpora (Bolshakov and Gelbukh 2001; Brants, Chen, and Tsochantaridis 2002; Choi et al. 2001; Ferret 2002; Kaufmann 1999; Ponte and Croft 1997) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-7", "text": "According to their authors, methods that use additional knowledge allow for a solution to problems encountered when sentences belonging to a unique topic do not share common words due to the use of hyperonyms or synonyms and allow words that are semantically related to be taken as positive evidence for topic continuity." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-8", "text": "Empirical arguments in favor of these methods have been provided recently by Choi et al. (2001) in a study using Latent Semantic Analysis (Latent Semantic Indexing, Deerwester et al. 1990 ) to extract a semantic space from a corpus allowing determination of the similarity of meanings of words, sentences, or paragraphs." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-9", "text": "By comparing the accuracy of the very same algorithm according to whether or not it takes into account complementary semantic knowledge, they were able to show the benefit derived from such knowledge." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-10", "text": "However, implications of Choi et al.'s study for text segmentation and for the use of LSA in natural language processing are unclear due to the methodology employed." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-11", "text": "In their experiments, semantic knowledge was acquired from a corpus containing the materials to be segmented in the test phase." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-12", "text": "One could speculate whether the largest part of the benefit obtained thanks to the addition of semantic knowledge was not due to this hyper-specificity of the LSA corpus (i.e., the inclusion of the test materials)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-13", "text": "If this were the case, it would call into question the possibility of using LSA to acquire generic semantic knowledge that can be used to segment new texts." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-14", "text": "A priori, the problem does not seem serious for at least two reasons." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-15", "text": "First, Choi et al.'s segmentation procedure does not rely on supervised learning in which a system learns how to efficiently segment a text from training data." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-16", "text": "The LSA corpus only intervenes in an indirect manner by allowing the extraction of semantic proximities between words that are then used to compute similarities between parts of the text to segment (see Section 2 for details)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-17", "text": "Second, Choi et al. employed a large number of small test samples to evaluate their algorithm, each making up-on average-0.15% of the LSA corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-18", "text": "The present study shows, however, that the presence of the test materials in the LSA corpus has an important effect, but also that the generic semantic knowledge derived from large corpora clearly improves the segmentation accuracy." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-19", "text": "This conclusion is drawn from two experiments in which the presence or absence of the test materials in the LSA corpus is manipulated." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-20", "text": "The first experiment is based on the original materials from Choi et al., which consisted of a small corpus (1,000,000 words)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-21", "text": "The second experiment is based on a much larger corpus (25,000,000 words)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-22", "text": "Before reporting these experiments, Choi's algorithm and the use of LSA within this framework are described." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-23", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-24", "text": "**THE TWO VERSIONS OF CHOI'S ALGORITHM**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-25", "text": "The segmentation algorithm proposed by Choi (2000) is made up of the three steps usually found in any segmentation procedure based on lexical cohesion." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-26", "text": "Firstly, the document to be segmented is divided into minimal textual units, usually sentences." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-27", "text": "Then, a similarity index between every pair of adjacent units is calculated." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-28", "text": "Each raw similarity value is cast on an ordinal scale by taking the proportion of neighboring values that are smaller than it." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-29", "text": "Lastly, the document is segmented recursively according to the boundaries between the units that maximize the sum of the average similarities inside the segments thus comprised (divisive clustering)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-30", "text": "The step of greatest interest here is the one that calculates the inter-sentence similarities." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-31", "text": "The procedure initially proposed by Choi (2000) , C99, rests exclusively on the information contained in the text to be segmented." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-32", "text": "According to the vector space model, each sentence is represented by a vector of word frequency count, and the similarity between two sentences is calculated by means of the cosine measure between the corresponding vectors." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-33", "text": "In a first evaluation based on the procedure described below, Choi showed that its algorithm outperforms several other approaches such as TextTiling (Hearst 1997) and Segmenter (Kan, Klavans, and McKeown 1998) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-34", "text": "Choi et al. (2001) claimed that it was possible to improve the inter-sentence similarities index by taking into account the semantic proximities between words estimated on the basis of Latent Semantic Analysis (LSA)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-35", "text": "Briefly stated, LSA rests on the thesis that analyzing the contexts in which words occur permits an estimation of their similarity in meaning (Deerwester et al. 1990; Landauer and Dumais 1997) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-36", "text": "The first step in the analysis is to construct a lexical table containing an information-theoretic weighting of the frequencies of the words occurrence in each document (i.e. sentence, paragraph, or text) included in the corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-37", "text": "This frequency table undergoes a Singular Value Decomposition that extracts the most important orthogonal dimensions, and, consequently, discards the small sources of variability in term usage." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-38", "text": "After this step, every word is represented by a vector of weights indicating its strength of association with each of the dimensions." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-39", "text": "This makes it possible to measure the semantic proximity between any two words by using, for instance, the cosine measure between the corresponding vectors." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-40", "text": "Proximity between any two sentences (or any other textual units), even if these sentences were not present in the original corpus, can be estimated by computing a vector for each of these units-which corresponds to the weighted sum of the vectors of the words that compose it-and then by computing the cosine between these vectors (Deerwester et al. 1990 )." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-41", "text": "Choi et al. (2001) have shown that using this procedure to compute the inter-sentence similarities results in the previous version of the algorithm (based solely on word repetition) being outperformed." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-42", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-43", "text": "**EXPERIMENT 1**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-44", "text": "The aim of this experiment is to determine the impact of the presence of the test materials in the LSA corpus on the results obtained by Choi et al. (2001) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-45", "text": "Does semantic knowledge acquired from a corpus that does not include the test materials also improve the segmentation accuracy?" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-46", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-47", "text": "**METHOD**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-48", "text": "This experiment was based on the procedure and test materials designed by Choi (2000) , which was also used by several authors as a benchmark for comparing segmentation systems (Brants et al. 2002; Ferret 2002; Kehagias et al. 2003; Utiyama and Isahara 2001) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-49", "text": "The task consists in finding the boundaries between concatenated texts." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-50", "text": "Each test sample is a concatenation of ten text segments." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-51", "text": "Each segment consisted in the first n sentences of a randomly selected text from two sub-sections of the Brown corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-52", "text": "For the present experiment, I used the most general test materials built by Choi (2000) , in which the size of the segments within each sample varies randomly from 3 to 11 sentences." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-53", "text": "It is composed of 400 samples." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-54", "text": "The analysis related to the comparison between the accuracy of the algorithm when the test materials were included in the LSA corpus (Within) and when it was not (Without)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-55", "text": "One Within semantic space, which corresponds to the one used by Choi et al., was built using the entire Brown corpus as the LSA corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-56", "text": "Four hundred different Without spaces were built, one for each test sample, by each time removing from the Brown corpus only the sentences that make this sample." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-57", "text": "To extract the LSA space and to apply the segmentation algorithm, a series of parameters had to be set." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-58", "text": "First of all, paragraphs were used as documents for building the lexical tables because Choi et al. observed that such middle-sized units were more effective than shorter units (i.e., sentences)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-59", "text": "The words on Choi's stoplist were removed, as were those that appeared only once in the whole corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-60", "text": "Words were not stemmed, as in Choi et al. (2001) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-61", "text": "To build the LSA space, the singular value decomposition was realized using the program SVDPACKC (Berry 1992; Berry et al. 1993) , and the first 300 singular vectors were retained." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-62", "text": "Concerning the segmentation algorithm, I used the version in which the number of boundaries to be found is imposed, and thus fixed at nine." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-63", "text": "An 11 \u00d7 11 rank mask was used for the ordinal transformation, as recommended by Choi (2000) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-64", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-65", "text": "**RESULTS**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-66", "text": "The segmentation accuracy was evaluated by means of the index reported by Choi et al. (2001) : the Pk measure of segmentation inaccuracy (Beeferman, Berger, and Lafferty 1999) , which gives the proportion of sentences that are wrongly predicted to belong to the same segment or wrongly predicted to belong to different segments." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-67", "text": "I also report, for potential future comparison, Pevzner and Hearst's (2002) WindowDiff index, which remedies several problems in the Pk measure." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-68", "text": "Results are provided in Table 1 ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-69", "text": "1 Compared with the Within condition, the performance in the Without condition is definitely worse, as confirmed by t tests for paired sample (each test sample being used as an observation) that are significant for an alpha smaller than 0.0001." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-70", "text": "The C99 algorithm, which does not employ LSA to estimate the similarities between the sentences, produces a Pk of 0.13 (Choi et al. 2001, Table 3, line 3: No stemming) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-71", "text": "It appears that although the Without condition is still better than C99, the benefit is very small." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-72", "text": "Before concluding that the presence of the test materials in the LSA corpus strongly modified the semantic space, an alternative explanation must be considered." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-73", "text": "The loss of accuracy in the Without condition could potentially be due to the fact that the words indexed in the corresponding LSA spaces are systematically slightly fewer than those present in the Within space." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-74", "text": "Removing each test sample led to the loss-on averageof 23 different words out of the 25,847 words that are indexed in the Within space." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-75", "text": "In the Without spaces, these words are no longer available to estimate the similarity of the sentences, whereas they are employed in the Within space." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-76", "text": "In order to determine whether this factor can explain the difference in performance, a complementary analysis was carried out on the Within space in which, for each test sample, only the words present in the corresponding Without space were taken into account." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-77", "text": "In this manner, only the semantic relations can come into play." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-78", "text": "Compared with the complete Within space, almost no drop in performance was observed: the Pk error rate went from 0.084 to 0.085 in the new analysis." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-79", "text": "This result indicates that it is not the words selected for the calculation of the proximities that matter, but the semantic relations in the spaces extracted from the word co-occurrences by the Singular Value Decomposition." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-80", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-81", "text": "**EXPERIMENT 2**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-82", "text": "Experiment 1 was conducted on the Choi et al. (2001) LSA corpus, a 1,000,000-word collection of texts from very different genres and with varied themes." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-83", "text": "The smallness of the corpus and diversity of the texts could have affected the results at two levels." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-84", "text": "First, removing a few sentences of a text should have less impact if the corpus contains a lot of texts on similar topics." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-85", "text": "Second, a larger corpus would probably also permit the extraction of a more stable and efficient semantic space." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-86", "text": "This could produce a greater difference between the LSA version of the algorithm and the version that does not use additional semantic knowledge (C99)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-87", "text": "For these reasons, a second experiment was conducted on the basis of a much larger corpus consisting of the articles published during 1997 and 1998 in the Belgian French-speaking newspaper Le Soir (roughly 52,000 articles and 26,000,000 words)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-88", "text": "In this corpus, the test materials from each sample account for-on average-0.0066% of the complete corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-89", "text": "This second experiment also made it possible to compare the Within and Without spaces with a Former space composed of articles published in the same newspaper, but during the years 1995 and 1996 (roughly 50,000 articles and more than 22,000,000 words)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-90", "text": "This condition will show the possibility of using LSA to build even more generic semantic knowledge, since the LSA corpus is earlier than the text to segment." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-91", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-92", "text": "**METHOD**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-93", "text": "The test materials were extracted from the 1997-1998 corpus following the guidelines given in Choi (2000) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-94", "text": "It is composed of 400 samples of ten segments, of which the length varies randomly from 3 to 11 sentences." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-95", "text": "Three types of LSA space were composed." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-96", "text": "The Within space is based on the whole 1997-1998 corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-97", "text": "Four hundred different Without spaces were built as described in Experiment 1." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-98", "text": "Finally, a Former space was built from the 1995-1996 corpus." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-99", "text": "The parameters employed to build the semantic spaces are identical to those used in Experiment 1 with one exception: in order to reduce the size of the lexical tables the whole articles and not the paragraphs were used as documents." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-100", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-101", "text": "**RESULTS**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-102", "text": "Although the results are mostly similar to those obtained in Experiment 1, Table 2 shows some interesting differences." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-103", "text": "The discrepancy between the Within and Without condition is much smaller, even if it remains statistically significant (p < 0.0001)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-104", "text": "Using a corpus from the same source, but with earlier years, still returns a poorer performance (p < 0.0001)." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-105", "text": "The C99 algorithm, which is not based on LSA, produces a Pk error rate of 0.150, a value definitely worse than those obtained with the Without and Former spaces." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-106", "text": "This confirms the usefulness of semantic knowledge acquired from large corpora in estimating inter-sentence similarities." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-107", "text": "----------------------------------" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-109", "text": "The two experiments showed that the presence of the test materials in the LSA corpus increases the algorithm accuracy even when a corpus of more than 25,000,000 words is used." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-110", "text": "They also showed that the use of independent semantic knowledge improves the segmentation accuracy and that this can be observed even when the semantic knowledge is extracted from former years of the same source." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-111", "text": "This observation underlines the possibility of building relatively generic semantic knowledge; that is, knowledge which could be employed to process new linguistic data, as has been recently proposed in a anaphora resolution algorithm, in a continuous speech recognition system, or in machine translation (Bellegarda 2000; Klebanov and Wiemer-Hastings 2002; Kim, Chang, and Zhang 2003) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-112", "text": "A question the present study does not answer concerns the possibility of employing a corpus drawn from another source, such as another newspaper." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-113", "text": "Bellegarda (2000) observed in speech recognition tasks that such a semantic space is definitely less effective." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-114", "text": "It is nevertheless possible that evaluating the semantic proximity between two sentences is less affected by the style of composition of the source than predicting the next word of a statement." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-115", "text": "Recently, several authors have proposed segmentation algorithms, based mainly on dynamic programming, that equal or even outperform Choi's results (Ji and Zha 2003, Kehagias et al. 2003; Utiyama and Isahara 2001) ." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-116", "text": "These algorithms do not rest on additional semantic knowledge." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-117", "text": "According to the results of the present study, they could still be improved by taking into account such knowledge." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-118", "text": "Finally, this study allows a more general conclusion about the use of LSA for natural language processing." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-119", "text": "If one's objective is to analyze a linguistic phenomenon in a large corpus such as for instance the factors determining the use of causal connectives (Degand, Spooren, and Bestgen 2004) , it is preferable to extract the semantic space from the corpus at hand." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-120", "text": "The two experiments did indeed show that such specific corpora allow the extraction of a more efficient semantic space." }, { "sent_id": "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-121", "text": "However, if the objective is to test the effectiveness of an algorithm intended to process new linguistic data on the basis of a semantic space built beforehand, one must avoid including the material to analyze in the LSA corpus since that would produce an over-estimate of the effectiveness of the procedure." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-6" ], [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-22" ], [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-25" ], [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-31" ], [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-33" ] ], "cite_sentences": [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-6", "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-22", "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-25", "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-31", "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-33" ] }, "@DIF@": { "gold_contexts": [ [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-6" ] ], "cite_sentences": [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-6" ] }, "@USE@": { "gold_contexts": [ [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-48" ], [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-52" ], [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-63" ], [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-93" ] ], "cite_sentences": [ "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-48", "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-52", "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-63", "c856f5ce5d2cdcfc71027d6fa4c6b3-C001-93" ] } } }, "ABC_ff7bafb8f21118ca3c908603ef32d0_6": { "x": [ { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-2", "text": "This paper presents and evaluates several original techniques for the latent classification of biographic attributes such as gender, age and native language, in diverse genres (conversation transcripts, email) and languages (Arabic, English)." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-3", "text": "First, we present a novel partner-sensitive model for extracting biographic attributes in conversations, given the differences in lexical usage and discourse style such as observed between same-gender and mixedgender conversations." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-4", "text": "Then, we explore a rich variety of novel sociolinguistic and discourse-based features, including mean utterance length, passive/active usage, percentage domination of the conversation, speaking rate and filler word usage." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-5", "text": "Cumulatively up to 20% error reduction is achieved relative to the standard Boulis and Ostendorf (2005) algorithm for classifying individual conversations on Switchboard, and accuracy for gender detection on the Switchboard corpus (aggregate) and Gulf Arabic corpus exceeds 95%." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-6", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-8", "text": "Speaker attributes such as gender, age, dialect, native language and educational level may be (a) stated overtly in metadata, (b) derivable indirectly from metadata such as a speaker's phone number or userid, or (c) derivable from acoustic properties of the speaker, including pitch and f0 contours (Bocklet et al., 2008) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-9", "text": "In contrast, the goal of this paper is to model and classify such speaker attributes from only the latent information found in textual transcripts." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-10", "text": "In particular, we are interested in modeling and classifying biographic attributes such as gender and age based on lexical and discourse factors including lexical choice, mean utterance length, patterns of participation in the conversation and filler word usage." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-11", "text": "Furthermore, a speaker's lexical choice and discourse style may differ substantially depending on the gender/age/etc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-12", "text": "of the speaker's interlocutor, and hence improvements may be achived via dyadic modeling or stacked classifiers." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-13", "text": "There has been substantial work in the sociolinguistics literature investigating discourse style differences due to speaker properties such as gender (Coates, 1997; Eckert, McConnell-Ginet, 2003) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-14", "text": "Analyzing such differences is not only interesting from the sociolinguistic and psycholinguistic point of view of language understanding, but also from an engineering perspective, given the goal of predicting latent author/speaker attributes in various practical applications such as user authenticaion, call routing, user and population profiling on social networking websites such as facebook, and gender/age conditioned language models for machine translation and speech recogntition." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-15", "text": "While most of the prior work in sociolinguistics has been approached from a non-computational perspective, Koppel et al. (2002) employed the use of a linear model for gender classification with manually assigned weights for a set of linguistically interesting words as features, focusing on a small development corpus." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-16", "text": "Another computational study for gender classification using approximately 30 weblog entries was done by Herring and Paolillo (2006) , making use of a logistic regression model to study the effect of different features." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-17", "text": "While small-scale sociolinguistic studies on monologues have shed some light on important features, we focus on modeling attributes from spoken conversations, building upon the work of Boulis and Ostendorf (2005) and show how gender and other attributes can be accurately predicted based on the following original contributions:" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-18", "text": "1. Modeling Partner Effect: A speaker may adapt his or her conversation style depending on the partner and we show how conditioning on the predicted partner class using a stacked model can provide further performance gains in gender classification." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-19", "text": "2." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-20", "text": "Sociolinguistic features: The paper explores a rich set of lexical and non-lexical features motivated by the sociolinguistic literature for gender classification, and show how they can effectively augment the standard ngrambased model of Boulis and Ostendorf (2005) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-21", "text": "3. Application to Arabic Language: We also report results for Arabic language and show that the ngram model gives reasonably high accuracy for Arabic as well." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-22", "text": "Furthmore, we also get consistent performance gains due to partner effect and sociolingusic features, as observed in English." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-23", "text": "4. Application to Email Genre: We show how the models explored in this paper extend to email genre, showing the wide applicability of general text-based features." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-24", "text": "5." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-25", "text": "Application to new attributes: We show how the lexical model of Boulis and Ostendorf (2005) can be extended to Age and Native vs. Non-native prediction, with further improvements gained from our partner-sensitive models and novel sociolinguistic features." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-26", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-28", "text": "Much attention has been devoted in the sociolinguistics literature to detection of age, gender, social class, religion, education, etc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-29", "text": "from conversational discourse and monologues starting as early as the 1950s, making use of morphological features such as the choice between the -ing and the -in variants of the present participle ending of the verb (Fisher, 1958) , and phonological features such as the pronounciation of the \"r\" sound in words such as far, four, cards, etc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-30", "text": "(Labov, 1966) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-31", "text": "Gender differences has been one of the primary areas of sociolinguistic research, including work such as Coates (1998) and Eckert and McConnell-Ginet (2003) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-32", "text": "There has also been some work in developing computational models based on linguistically interesting clues suggested by the sociolinguistic literature for detecting gender on formal written texts (Singh, 2001; Koppel et al., 2002; Herring and Paolillo, 2006) but it has been primarily focused on using a small number of manually selected features, and on a small number of formal written texts." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-33", "text": "Another relevant line of work has been on the blog domain, using a bag of words feature set to discriminate age and gender (Schler et al., 2006; Burger and Henderson, 2006; Nowson and Oberlander, 2006) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-34", "text": "Conversational speech presents a challenging domain due to the interaction of genders, recognition errors and sudden topic shifts." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-35", "text": "While prosodic features have been shown to be useful in gender/age classification (e.g. Shafran et al., 2003) , their work makes use of speech transcripts along the lines of Boulis and Ostendorf (2005) in order to build a general model that can be applied to electronic conversations as well." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-36", "text": "While Boulis and Ostendorf (2005) observe that the gender of the partner can have a substantial effect on their classifier accuracy, given that same-gender conversations are easier to classify than mixed-gender classifications, they don't utilize this observation in their work." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-37", "text": "In Section 5.3, we show how the predicted gender/age etc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-38", "text": "of the partner/interlocutor can be used to improve overall performance via both dyadic modeling and classifier stacking." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-39", "text": "Boulis and Ostendorf (2005) have also constrained themselves to lexical n-gram features, while we show improvements via the incorporation of non-lexical features such as the percentage domination of the conversation, degree of passive usage, usage of subordinate clauses, speaker rate, usage profiles for filler words (e.g. \"umm\"), mean-utterance length, and other such properties." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-40", "text": "We also report performance gains of our models for a new genre (email) and a new language (Arabic), indicating the robustness of the models explored in this paper." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-41", "text": "Finally, we also explore and evaluate original model performance on additional latent speaker attributes including age and native vs. non-native English speaking status." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-42", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-43", "text": "**CORPUS DETAILS**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-44", "text": "Consistent with Boulis and Ostendorf (2005) , we utilized the Fisher telephone conversation corpus (Cieri et al., 2004) and we also evaluated performance on the standard Switchboard conversational corpus (Godfrey et al., 1992) , both collected and annotated by the Linguistic Data Consortium." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-45", "text": "In both cases, we utilized the provided metadata (including true speaker gender, age, native language, etc.) as only class labels for both training and evaluation, but never as features in the classification." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-46", "text": "The primary task we employed was identical to Boulis and Ostendorf (2005) , namely the classification of gender, etc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-47", "text": "of each speaker in an isolated conversation, but we also evaluate performance when classifying speaker attributes given the combination of multiple conversations in which the speaker has participated." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-48", "text": "The Fisher corpus contains a total of 11971 speakers and each speaker participated in 1-3 conversations, resulting in a total of 23398 conversation sides (i.e. the transcript of a single speaker in a single conversation)." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-49", "text": "We followed the preprocessing steps and experimental setup of Boulis and Ostendorf (2005) as closely as possible given the details presented in their paper, although some details such as the exact training/test partition were not currently obtainable from either the paper or personal communication." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-50", "text": "This resulted in a training set of 9000 speakers with 17587 conversation sides and a test set of 1000 speakers with 2008 conversation sides." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-51", "text": "The Switchboard corpus was much smaller and consisted of 543 speakers, with 443 speakers used for training and 100 speakers used for testing, resulting in a total of 4062 conversation sides for training and 808 conversation sides for testing." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-52", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-53", "text": "**MODELING GENDER VIA NGRAM FEATURES (BOULIS AND OSTENDORF, 2005)**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-54", "text": "As our reference algorithm, we used the current state-of-the-art system developed by Boulis and Ostendorf (2005) using unigram and bigram features in a SVM framework." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-55", "text": "We reimplemented this model as our reference for gender classification, further details of which are given below:" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-56", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-57", "text": "**TRAINING VECTORS**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-58", "text": "For each conversation side, a training example was created using unigram and bigram features with tf-idf weighting, as done in standard text classification approaches." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-59", "text": "However, stopwords were retained in the feature set as various sociolinguistic studies have shown that use of some of the stopwords, for instance, pronouns and determiners, are correlated with age and gender." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-60", "text": "Also, only the ngrams with frequency greater than 5 were retained in the feature set following Boulis and Ostendorf (2005) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-61", "text": "This resulted in a total of 227,450 features for the Fisher corpus and 57,914 features for the Switchboard corpus." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-62", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-63", "text": "**MODEL**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-64", "text": "After extracting the ngrams, a SVM model was trained via the SVM light toolkit (Joachims, 1999) using the linear kernel with the default toolkit settings." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-65", "text": "Table 1 shows the most discriminative ngrams for gender based on the weights assigned by the linear SVM model." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-66", "text": "It is interesting that some of the gender-correlated words proposed by sociolinguistics are also found by this empirical approach, including the frequent use of \"oh\" by females and also obvious indicators of gender such as \"my wife\" or \"my husband\", etc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-67", "text": "Also, named entity \"Mike\" shows up as a discriminative unigram, this maybe due to the self-introduction at the beginning of the conversations and \"Mike\" being a common male name." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-68", "text": "For compatibility with Boulis and Ostendorf (2005) , no special pre- processing for names is performed, and they are treated as just any other unigrams or bigrams 1 ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-69", "text": "Furthermore, the ngram-based approach scales well with varying the amount of conversation utilized in training the model as shown in Figure 1 ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-70", "text": "The \"Boulis and Ostendorf, 05\" rows in Table 3 show the performance of this reimplemented algorithm on both the Fisher (90.84%) and Switchboard (90.22%) corpora, under the identical training and test conditions used elsewhere in our paper for direct comparison with subsequent results 2 ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-71", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-72", "text": "**EFFECT OF PARTNER'S GENDER**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-73", "text": "Our original contribution in this section is the successful modeling of speaker properties (e.g. gender/age) based on the prior and joint modeling of the partner speaker's gender/age in the same discourse." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-74", "text": "The motivation here is that people tend to use stronger gender-specific, age-specific or dialect-specific word/phrase usage and discourse properties when speaking with someone of a similar gender/age/dialect than when speaking with someone of a different gender/age/dialect, when they may adapt a more neutral speaking style." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-75", "text": "Also, discourse properties such as relative use of the passive and percentage of the conversation dominated may vary depending on the gender or age relationship with the speaking partner." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-76", "text": "We employ several varieties of classifier stacking and joint modeling to be effectively sensitive to these differences." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-77", "text": "To illustrate the significance of 1 A natural extension of this work, however, would be to do explicit extraction of self introductions and then do tablelookup-based gender classification, although we did not do so for consistency with the reference algorithm." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-78", "text": "2 The modest differences with their reported results may be due to unreported details such as the exact training/test splits or SVM parameterizations, so for the purposes of assessing the relative gain of our subsequent enhancements we base all reported experiments on the internally-consistent configurations as (re-)implemented here. the \"partner effect\", Table 2 shows the difference in the standard algorithm performance between same-gender conversations (when gender-specific style flourishes) and mixed-gender conversations (where more neutral styles are harder to classify)." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-79", "text": "Table 3 shows the classwise performance of classifying the entire conversation into four possible categories." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-80", "text": "We can see that the mixed-gender cases are also significantly harder to classify on a conversation level granularity." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-81", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-82", "text": "**ORACLE EXPERIMENT**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-83", "text": "To assess the potential gains from full exploitation of partner-sensitive modeling, we first report the result from an oracle experiment, where we assume we know whether the conversation is homogeneous (same gender) or heterogeneous (different gender)." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-84", "text": "In order to effectively utilize this information, we classify both the test conversation side and the partner side, and if the classifier is more confident about the partner side then we choose the gender of the test conversation side based on the heterogeneous/homogeneous information." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-85", "text": "The overall accuracy improves to 96.46% on the Fisher corpus using this oracle (from 90.84%), leading us to the experiment where the oracle is replaced with a non-oracle SVM model trained on a subset of training data such that all test conversation sides (of the speaker and the partner) are excluded from the training set." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-86", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-87", "text": "**REPLACING ORACLE BY A HOMOGENEOUS VS**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-88", "text": "Heterogenous Classifier Given the substantial improvement using the Oracle information, we initially trained another bi-nary classifier for classifying the conversation as mixed or single-gender." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-89", "text": "It turns out that this task is much harder than the single-side gender classification, task and achieved only a low accuracy value of 68.35% on the Fisher corpus." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-90", "text": "Intuitively, the homogeneous vs. hetereogeneous partition results in a much harder classification task because the two diverse classes of male-male and femalefemale conversations are grouped into one class (\"homogeneous\") resulting in linearly inseparable classes 3 ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-91", "text": "This subsequently lead us to create two different classifiers for conversations, namely, male-male vs rest and female-female vs rest 4 used in a classifier combination framework as follows: 5.3 Modeling partner via conditional model and whole-conversation model The following classifiers were trained and each of their scores was used as a feature in a meta SVM classifier:" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-92", "text": "1. Male-Male vs Rest: Classifying the entire conversation (using test speaker and partner's sides) as male-male or other 5 . 2. Female-Female vs Rest: Classifying the entire conversation (using test speaker and partner's sides) as female-female or other." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-93", "text": "3." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-94", "text": "Conditional model of gender given most likely partner's gender: Two separate classifiers were trained for classifying the gender of a given conversation side, one where the partner is male and other where the partner is female." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-95", "text": "Given a test conversation side, we first choose the most likely gender of the partner's conversation side using the ngrambased model 6 and then choose the gender of the test conversation side using the appropriate conditional model." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-96", "text": "4. Ngram model as explained in Section 4." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-97", "text": "The row labeled \"+ Partner Model\" in Table 4 shows the performance gain obtained via this meta-classifier incorporating conversation type and partner-conditioned models." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-98", "text": "3 Even non-linear kernels were not able to find a good classification boundary 4 We also explored training a 3-way classifier, male-male, female-female, mixed and the results were similar to that of the binarized setup 5 For classifying the conversations as male-male vs rest or female-female vs rest, all the conversations with either the speaker or the partner present in any of the test conversations were eliminated from the training set, thus creating a disjoint training and test conversation partitions." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-99", "text": "6 All the partner conversation sides of test speakers were removed from the training data and the ngram-based model was retrained on the remaining subset." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-100", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-101", "text": "**INCORPORATING SOCIOLINGUISTIC FEATURES**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-102", "text": "The sociolinguistic literature has shown gender differences for speakers due to features such as speaking rate, pronoun usage and filler word usage." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-103", "text": "While ngram features are able to reasonably predict speaker gender due to their high detail and coverage and the overall importance of lexical choice in gender differences while speaking, the sociolinguistics literature suggests that other nonlexical features can further help improve performance, and more importantly, advance our understanding of gender differences in discourse." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-104", "text": "Thus, on top of the standard Boulis and Ostendorf (2005) model, we also investigated the following features motivated by the sociolinguistic literature on gender differences in discourse (Macaulay, 2005) :" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-105", "text": "1. % of conversation spoken: We measured the speaker's fraction of conversation spoken via three features extracted from the transcripts: % of words, utterances and time." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-106", "text": "2." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-107", "text": "Speaker rate: Some studies have shown that males speak faster than females (Yuan et al., 2006) as can also be observed in Figure 2 showing empirical data obtained from Switchboard corpus." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-108", "text": "The speaker rate was measured in words/sec., using starting and ending time-stamps for the discourse." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-109", "text": "3. % of pronoun usage: Macaulay (2005) argues that females tend to use more third-person male/female pronouns (he, she, him, her and his) as compared to males." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-110", "text": "4. % of back-channel responses such as \"(laughter)\" and \"(lipsmacks)\"." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-111", "text": "5. % of passive usage: Passives were detected by extracting a list of past-participle verbs from Penn Treebank and using occurences of \"form of \"to be\" + past participle\"." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-112", "text": "6. % of short utterances (<= 3 words)." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-113", "text": "7. % of modal auxiliaries, subordinate clauses." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-114", "text": "8. % of \"mm\" tokens such as \"mhm\", \"um\", \"uh-huh\", \"uh\", \"hm\", \"hmm\",etc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-115", "text": "9." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-116", "text": "Type-token ratio 10." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-117", "text": "Mean inter-utterance time: Avg. time taken between utterances of the same speaker." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-118", "text": "11. % of \"yeah\" occurences." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-119", "text": "12. % of WH-question words." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-120", "text": "13. % Mean word and utterance length." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-121", "text": "The above classes resulted in a total of 16 sociolinguistic features which were added based on feature ablation studies as features in the meta SVM classifier along with the 4 features as explained previously in Section 5.3." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-122", "text": "The rows in Table 4 labeled \"+ (any sociolinguistic feature)\" show the performance gain using the respective features described in this section." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-123", "text": "Each row indicates an additive effect in the feature ablation, showing the result of adding the current sociolinguistic feature with the set of features mentioned in the rows above." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-124", "text": "Table 4 combines the results of the experiments reported in the previous sections, assessed on both the Fisher and Switchboard corpora for gender classification." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-125", "text": "The evaluation measure was the standard classifier accuracy, that is, the fraction of test conversation sides whose gender was correctly predicted." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-126", "text": "Baseline performance (always guessing female) yields 57.47% and 51.6% on Fisher and Switchboard respectively." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-127", "text": "As noted before, the standard reference algorithm is Boulis and Ostendorf (2005) , and all cited relative error reductions are based on this established standard, as implemented in this paper." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-128", "text": "Also, as a second reference, performance is also cited for the popular \"Gender Genie\", an online gender-detector 7 , based on the manually weighted word-level sociolinguistic features discussed in Argamon et al. (2003) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-129", "text": "The additional table rows are described in Sections 4-6, and cumulatively yield substantial improvements over the Boulis and Ostendorf (2005) with the work reported by Boulis and Ostendorf (2005) ), all of the above models can be easily extended to per-speaker evaluation by pooling in the predictions from multiple conversations of the same speaker." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-130", "text": "Table 5 shows the result of each model on a per-speaker basis using a majority vote of the predictions made on the individual conversations of the respective speaker." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-131", "text": "The consensus model when applied to Switchboard corpus show larger gains as it has 9.38 conversations per speaker on average as compared to 1.95 conversations per speaker on average in Fisher." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-132", "text": "The results on Switchboard corpus show a very large reduction in error rate of more than 57% with respect to the standard algorithm, further indicating the usefulness of the partner-sensitive model and richer sociolinguistic features when more conversational evidence is available." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-133", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-134", "text": "**GENDER CLASSIFICATION RESULTS**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-135", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-136", "text": "**APPLICATION TO ARABIC LANGUAGE**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-137", "text": "It would be interesting to see how the Boulis and Ostendorf (2005)" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-138", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-139", "text": "**APPLICATION TO EMAIL GENRE**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-140", "text": "A primary motivation for using only the speaker transcripts as compared to also using acoustic properties of the speaker (Bocklet et al., 2008) was to enable the application of the models to other new genres." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-141", "text": "In order to empirically support this motivation, we also tested the performance of the models explored in this paper on the Enron email corpus (Klimt and Yang, 2004) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-142", "text": "We manually annotated the sender's gender on a random collection of emails taken from the corpus." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-143", "text": "The resulting training and test sets after preprocessing for header information, reply-to's, forwarded messages consisted of 1579 and 204 emails respectively." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-144", "text": "In addition to ngram features, a subset of sociolinguistic features that could be extracted for email were also utilized." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-145", "text": "Based on the prior distribution, always guessing the most likely class (\"male\") resulted in 63.2% accuracy." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-146", "text": "We can see from Table 7 that the Boulis and Ostendorf (2005) Model Acc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-147", "text": "Error Reduc." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-148", "text": "Gulf Arabic (52.5% sides are male) Ngram (Boulis & Ostendorf, 05) Results for Age and Native/Non-Native: Based on the prior distribution, always guessing the most likely class for age ( age less-than-orequal-to 40) results in 62.59% accuracy and always guessing the most likely class for native language (non-native) yields 50.59% accuracy." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-149", "text": "Table 9 shows the results for age and native/nonnative speaker status." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-150", "text": "We can see that the ngrambased approach for gender also gives reasonable performance on other speaker attributes, and more importantly, both the partner-model and sociolinguistic features help in reducing the error rate on age and native language substantially, indicating their usefulness not just on gender but also on other diverse latent attributes." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-151", "text": "Table 8 shows the most discriminative ngrams for binary classification of age, it is interesting to see the use of \"well\" right on top of the list for older speakers, also found in the sociolinguistic studies for age (Macaulay, 2005) ." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-152", "text": "We also see that older speakers talk about their children (\"my daughter\") and younger speakers talk about their parents (\"my mom\"), the use of words such as \"wow\", \"kinda\" and \"cool\" is also common in younger speakers." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-153", "text": "To give maximal consistency/benefit to the Boulis and Ostendorf (2005) n-gram-based model, we did not filter the self-reporting n-grams such as \"im forty\" and \"im thirty\", putting our sociolinguisticliterature-based and discourse-style-based features at a relative disadvantage." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-154", "text": "----------------------------------" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-155", "text": "**CONCLUSION**" }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-156", "text": "This paper has presented and evaluated several original techniques for the latent classification of speaker gender, age and native language in diverse genres and languages." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-157", "text": "A novel partner-sensitve model shows performance gains from the joint modeling of speaker attributes along with partner speaker attributes, given the differences in lexical usage and discourse style such as observed between same-gender and mixed-gender conversations." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-158", "text": "The robustness of the partner-model is substantially supported based on the consistent performance gains achieved in diverse languages and attributes." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-159", "text": "This paper has also explored a rich variety of novel sociolinguistic and discourse-based features, including mean utterance length, passive/active usage, percentage domination of the conversation, speaking rate and filler word usage." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-160", "text": "In addition to these novel models, the paper also shows how these models and the previous work extend to new languages and genres." }, { "sent_id": "ff7bafb8f21118ca3c908603ef32d0-C001-161", "text": "Cumulatively up to 20% error reduction is achieved relative to the standard Boulis and Ostendorf (2005) algorithm for classifying individual conversations on Switchboard, and accuracy for gender detection on the Switchboard corpus (aggregate) and Gulf Arabic exceeds 95%." } ], "y": { "@DIF@": { "gold_contexts": [ [ "ff7bafb8f21118ca3c908603ef32d0-C001-4", "ff7bafb8f21118ca3c908603ef32d0-C001-5" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-129" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-153" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-161" ] ], "cite_sentences": [ "ff7bafb8f21118ca3c908603ef32d0-C001-5", "ff7bafb8f21118ca3c908603ef32d0-C001-129", "ff7bafb8f21118ca3c908603ef32d0-C001-153", "ff7bafb8f21118ca3c908603ef32d0-C001-161" ] }, "@EXT@": { "gold_contexts": [ [ "ff7bafb8f21118ca3c908603ef32d0-C001-17" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-20" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-25" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-104" ] ], "cite_sentences": [ "ff7bafb8f21118ca3c908603ef32d0-C001-17", "ff7bafb8f21118ca3c908603ef32d0-C001-20", "ff7bafb8f21118ca3c908603ef32d0-C001-25", "ff7bafb8f21118ca3c908603ef32d0-C001-104" ] }, "@BACK@": { "gold_contexts": [ [ "ff7bafb8f21118ca3c908603ef32d0-C001-35", "ff7bafb8f21118ca3c908603ef32d0-C001-36", "ff7bafb8f21118ca3c908603ef32d0-C001-39" ] ], "cite_sentences": [ "ff7bafb8f21118ca3c908603ef32d0-C001-35", "ff7bafb8f21118ca3c908603ef32d0-C001-36", "ff7bafb8f21118ca3c908603ef32d0-C001-39" ] }, "@MOT@": { "gold_contexts": [ [ "ff7bafb8f21118ca3c908603ef32d0-C001-35", "ff7bafb8f21118ca3c908603ef32d0-C001-36", "ff7bafb8f21118ca3c908603ef32d0-C001-39", "ff7bafb8f21118ca3c908603ef32d0-C001-40" ] ], "cite_sentences": [ "ff7bafb8f21118ca3c908603ef32d0-C001-35", "ff7bafb8f21118ca3c908603ef32d0-C001-36", "ff7bafb8f21118ca3c908603ef32d0-C001-39" ] }, "@USE@": { "gold_contexts": [ [ "ff7bafb8f21118ca3c908603ef32d0-C001-44" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-46" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-49" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-54", "ff7bafb8f21118ca3c908603ef32d0-C001-55" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-60" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-68" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-70" ], [ "ff7bafb8f21118ca3c908603ef32d0-C001-127" ] ], "cite_sentences": [ "ff7bafb8f21118ca3c908603ef32d0-C001-44", "ff7bafb8f21118ca3c908603ef32d0-C001-46", "ff7bafb8f21118ca3c908603ef32d0-C001-49", "ff7bafb8f21118ca3c908603ef32d0-C001-54", "ff7bafb8f21118ca3c908603ef32d0-C001-55", "ff7bafb8f21118ca3c908603ef32d0-C001-60", "ff7bafb8f21118ca3c908603ef32d0-C001-68", "ff7bafb8f21118ca3c908603ef32d0-C001-70", "ff7bafb8f21118ca3c908603ef32d0-C001-127" ] } } }, "ABC_1c51e45e2917268e0ab5ce43a69655_6": { "x": [ { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-2", "text": "Hate speech in the form of racism and sexism is commonplace on the internet (Waseem and Hovy, 2016) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-3", "text": "For this reason, there has been both an academic and an industry interest in detection of hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-4", "text": "The volume of data to be reviewed for creating data sets encourages a use of crowd sourcing for the annotation efforts." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-5", "text": "In this paper, we provide an examination of the influence of annotator knowledge of hate speech on classification models by comparing classification results obtained from training on expert and amateur annotations." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-6", "text": "We provide an evaluation on our own data set and run our models on the data set released by Waseem and Hovy (2016)." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-7", "text": "We find that amateur annotators are more likely than expert annotators to label items as hate speech, and that systems trained on expert annotations outperform systems trained on amateur annotations." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-8", "text": "----------------------------------" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-10", "text": "Large amounts of hate speech on exists on platforms that allow for user generated documents, which creates a need to detect and filter it (Nobata et al., 2016) , and to create data sets that contain hate speech and are annotated for the occurrence of hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-11", "text": "The need for corpus creation must be weighted against the psychological tax of being exposed to large amounts of abusive language (Chen, 2012) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-12", "text": "A number of studies on profanity and hate speech detection, have crowdsourced their annotations due to the resources required to annotate large data sets and the possibility of distributing the load onto the crowd (Warner and Hirschberg, 2012; Nobata et al., 2016) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-13", "text": "Ross et al. (2016) investigate annotator reliability for hate speech annotation, concluding that \"hate speech is a fuzzy construct that requires significantly better definitions and guidelines in order to be annotated reliably\"." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-14", "text": "Hate speech is hard to detect for humans (Sue et al., 2007) , which warrants a thorough understanding of the benefits and pitfalls of crowdsourced annotation." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-15", "text": "This need is reinforced by previous studies, which utilize crowdsourcing of hate speech without knowledge on the quality of crowdsourced annotations for hate speech labeling." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-16", "text": "In addition, it is important to understand how different manners of obtaining labeling can influence the classification models and how it is possible to obtain good annotations, while ensuring that annotators are not likely to experience adverse effects of annotating hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-17", "text": "Our contribution We provide annotations of 6, 909 tweets for hate speech by annotators from CrowdFlower and annotators that have a theoretical and applied knowledge of hate speech, henceforth amateur and expert annotators 1 ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-18", "text": "Our data set extends the Waseem and Hovy (2016) data set by 4, 033 tweets." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-19", "text": "We also illustrate, how amateur and expert annotations influence classification efforts." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-20", "text": "Finally, we show the effects of allowing majority voting on classification and agreement between the amateur and expert annotators." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-21", "text": "----------------------------------" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-22", "text": "**DATA**" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-23", "text": "Our data set is obtained by sampling tweets from the 130k tweets extracted by Waseem and Hovy (2016) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-24", "text": "The order of the tweets is selected by our database connection, thus allowing for an overlap with the data set released by Waseem and Hovy (2016) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-25", "text": "We find that there is an overlap of 2, 876 tweets (see Table 1) between the two data sets." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-26", "text": "Racism Sexism Neither Count 1 95 2780 Given the distribution of the labels in Waseem and Hovy (2016) and our annotated data set (see Table 2 ), it is to be expected the largest overlap occurs with tweets annotated as negative for hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-27", "text": "Observing Table 2 , we see that the label distribution in our data set generally differs from the distribution in Waseem and Hovy (2016) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-28", "text": "In fact, we see that the amateur majority voted labels is the only distribution that tends towards a label distribution similar to Waseem and Hovy (2016) Our annotation effort deviates from Waseem and Hovy (2016) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-29", "text": "In addition to \"racism\", \"sexism\", and \"neither\", we add the label \"both\" for tweets that contain both racism and sexism." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-30", "text": "We add this label, as the intersection of multiple oppressions can differ from the forms of oppression it consists of (Crenshaw, 1989) , and as such becomes a unique form of oppression." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-31", "text": "Thus, we introduce a labeling scheme that follows an intersectional approach (Crenshaw, 1989) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-32", "text": "We do not require annotators to follow links." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-33", "text": "Instead, we ask them to annotate tweets only containing links as \"Neither\"." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-34", "text": "----------------------------------" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-35", "text": "**EXPERT ANNOTATIONS**" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-36", "text": "We recruit feminist and antiracism activists to annotate the data set." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-37", "text": "We present the annotators with the tests from Waseem and Hovy (2016) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-38", "text": "If a tweet fails any of the tests, the annotators are instructed to label it as the relevant form of hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-39", "text": "Expert annotators are given the choice of skipping tweets, if they are not confident in which label to assign, and a \"Noise\" label in case the annotators are presented with non-English tweets." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-40", "text": "Due to privacy concerns, all expert annotators are treated as a single entity." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-41", "text": "Amateur Annotations Amateur annotators are recruited on CrowdFlower without any selection, to mitigate selection biases." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-42", "text": "They are presented with 6, 909 tweets that have been annotated by the expert annotators." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-43", "text": "The amateur annotators are not provided with the option to skip tweets, as they are not presented tweets the experts had skipped or labeled as \"Noise\"." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-44", "text": "Annotator agreement Considering annotator agreement, we find that the inter-annotator agreement among the amateur annotators is \u03ba = 0.57 (\u03c3 = 0.08)." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-45", "text": "Majority Vote Full Agreement Expert 0.34 0.70 The low agreement in Table 2 provides further evidence to the claim by Ross et al. (2016) that annotation of hate speech is a hard task." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-46", "text": "Table 2 suggests that if only cases of full agreement are considered, it is possible to obtain good annotations using crowdsourcing." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-47", "text": "Overlap Considering the overlap with the Waseem and Hovy (2016), we see that the agreement is extremely low (mean pairwise \u03ba = 0.14 between all annotator groups and Waseem and Hovy (2016) )." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-48", "text": "Interestingly, we see that the vast majority of disagreements between our annotators and Waseem and Hovy (2016) , are disagreements where our annotators do not find hate speech but Waseem and Hovy (2016) the influence of the features listed in Table 4 for each annotator group." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-49", "text": "Model Selection We perform a grid search over all possible feature combinations to find the best performing features." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-50", "text": "We find that the features with the highest performance are not necessarily the features with the best performance." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-51", "text": "For instance, token unigrams obtains the highest F1-score, precision, and the second highest recall on the amateur annotations, yet this feature fails to classify the minority classes." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-52", "text": "Features We use a range of features focusing on both the textual information given in the tweets as well as extra-linguistic information including POS tags obtained using Gimpel et al. (2011) and Spacy 2 ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-53", "text": "In Table 4 3 , we see that the most significant features trained on majority voted amateur annotations emphasize extra-linguistic features while the most significant features trained on expert annotations emphasize the content of the tweets." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-54", "text": "----------------------------------" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-55", "text": "**BROWN CLUSTERS AND LENGTH**" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-56", "text": "We highlight the use of Brown Clusters (Brown et al., 1992) and length features (as inspired by Nobata et al. (2016) ), as these are the only two features that classify the minority classes for both amateur and expert annotators." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-57", "text": "We use an in-house mapping of brown clusters, replacing unigrams with cluster identifiers." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-58", "text": "2 www.spacy.io 3 Italics signify the best performing feature on expert annotations, bold signify the best performing features on amateur annotations (majority voting)." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-59", "text": "These best performing features are then used for the respective \"best\" feature sets." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-60", "text": "We follow Nobata et al. (2016) , in their use of the length of comments in tokens, and the average length of the words in a tweet." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-61", "text": "Author Historical Salient Terms Given the promising results obtained for sarcasm detection (Bamman and Smith, 2015) , we calculate the Author Historical Salient Terms (AHST)." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-62", "text": "We obtain up to 3200 tweets for each user in our data set, calculate the TF-IDF scores, and identify the top 100 terms." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-63", "text": "We then add a binary feature signifying the occurrence of each of these 100 terms." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-64", "text": "Interestingly, this feature performs worse than any other feature." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-65", "text": "Particularly when trained on expert annotations, suggesting that hate speech may be more situational or that users engaging in hate speech, do not only, or even primarily engage in hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-66", "text": "Gender Following the indication that gender can positively influence classification scores (Waseem and Hovy, 2016) , we compute the gender of the users in our data set." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-67", "text": "To counteract the low coverage in Waseem and Hovy (2016) , we use a lexicon trained on Twitter (Sap et al., 2014) to calculate the probability of gender." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-68", "text": "Using these probabilities we assign binary gender." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-69", "text": "Both the probability of a gender for a user and the binary gender are used as individual features." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-70", "text": "We find that using gender information only contributes to the classification score for amateur annotators." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-71", "text": "Minority Class Misclassification We find that some features trained on expert and amateur annotations result in misclassification on the minority classes, including identifying no instances of the mi- nority classes (see Table 4 )." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-72", "text": "These misclassifications of the minority classes are largely due to the small number of instances in those classes." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-73", "text": "In spite of this, we do not believe that only boosting the size of the minority classes is a good approach, as we should seek to mimic reality in our data sets for hate speech detection." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-74", "text": "Results Running our system on the Waseem and Hovy (2016) data set, we find that our best performing system does not substantially outperform on the binary classification task Waseem and Hovy (2016 Interestingly, the main cause of error is false positives." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-75", "text": "This holds true using both amateur and expert annotations." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-76", "text": "We mitigate personal bias in our annotations, as multiple people have participated in the annotation process." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-77", "text": "Waseem and Hovy (2016) may suffer from personal bias, as the only the authors annotated, and only the annotations positive for hate speech were reviewed by one other person." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-78", "text": "It is our contention that hate speech corpora should reflect real life, in that hate speech is a rare occurrence comparatively." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-79", "text": "Given that some of our features obtain high F1-scores, in spite of not classifying for the minority classes, we suggest that the unweighted F1-score may not be an appropriate metric to evaluate classification on hate speech corpora." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-80", "text": "----------------------------------" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-81", "text": "**RELATED WORK**" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-82", "text": "Most related work in the field of abusive language detection has focused on detecting profanity using list-based methods to identify offensive words (Sood et al., 2012; Chen et al., 2012) ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-83", "text": "These methods traditionally suffer from a poor recall and do not address hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-84", "text": "While Sood et al. (2012) incorporate edit distances to find variants of slurs, they are not able to find terms that do not occur in these lists." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-85", "text": "Nobata et al. (2016) address this, by using comprehensive lists of slurs obtained from Hatebase 4 ." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-86", "text": "Waseem and Hovy (2016) and Ross et al. (2016) focus on building corpora which they annotate for containing hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-87", "text": "Our work closely resembles Waseem and Hovy (2016) , as they also run classification experiments on a hate speech data set." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-88", "text": "Waseem and Hovy (2016) obtain an F1-score of 73.91 on their data set, using character n-grams and gender information." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-89", "text": "Nobata et al. (2016) employ a wide array of features for abusive language detection, including but not limited to POS tags, the number of blacklisted words in a document, n-gram features including token and character n-grams and length features." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-90", "text": "The primary challenge this paper presents, is the need for good annotation guidelines, if one wishes to detect specific subsets of abusive language." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-91", "text": "----------------------------------" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-92", "text": "**CONCLUSION**" }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-93", "text": "We find that using expert annotations can produce models that perform comparably to previous classification efforts." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-94", "text": "Our best model is on par with previous work on the Waseem and Hovy (2016) data set for the binary classification task but under-performs for the multi-class classification task." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-95", "text": "We suggest that a weighted F1-score be applied in evaluation of classification efforts on hate speech corpora, such that misclassification on minority classes is penalized." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-96", "text": "Our annotation and classification results expand on the claim of Ross et al. (2016) that hate speech is hard to annotate without intimate knowledge of hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-97", "text": "Furthermore, we find that considering only cases of full agreement among amateur annota-tors can produce relatively good annotations as compared to expert annotators." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-98", "text": "This can allow for a significant decrease in the annotations burden of expert annotators by asking them to primarily consider the cases in which amateur annotators have disagreed." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-99", "text": "Future Work We will seek to further investigate the socio-linguistic features such as gender and location." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-100", "text": "Furthermore, we will expand to more forms of hate speech." }, { "sent_id": "1c51e45e2917268e0ab5ce43a69655-C001-101", "text": "Finally, we will review the negative class in Waseem and Hovy (2016) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1c51e45e2917268e0ab5ce43a69655-C001-2" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-86" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-88" ] ], "cite_sentences": [ "1c51e45e2917268e0ab5ce43a69655-C001-2", "1c51e45e2917268e0ab5ce43a69655-C001-86", "1c51e45e2917268e0ab5ce43a69655-C001-88" ] }, "@USE@": { "gold_contexts": [ [ "1c51e45e2917268e0ab5ce43a69655-C001-6" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-23" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-37", "1c51e45e2917268e0ab5ce43a69655-C001-38" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-66" ] ], "cite_sentences": [ "1c51e45e2917268e0ab5ce43a69655-C001-6", "1c51e45e2917268e0ab5ce43a69655-C001-23", "1c51e45e2917268e0ab5ce43a69655-C001-37", "1c51e45e2917268e0ab5ce43a69655-C001-38", "1c51e45e2917268e0ab5ce43a69655-C001-66" ] }, "@EXT@": { "gold_contexts": [ [ "1c51e45e2917268e0ab5ce43a69655-C001-18" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-67" ] ], "cite_sentences": [ "1c51e45e2917268e0ab5ce43a69655-C001-18", "1c51e45e2917268e0ab5ce43a69655-C001-67" ] }, "@SIM@": { "gold_contexts": [ [ "1c51e45e2917268e0ab5ce43a69655-C001-24" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-26" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-28" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-47" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-74" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-87" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-94" ] ], "cite_sentences": [ "1c51e45e2917268e0ab5ce43a69655-C001-24", "1c51e45e2917268e0ab5ce43a69655-C001-26", "1c51e45e2917268e0ab5ce43a69655-C001-28", "1c51e45e2917268e0ab5ce43a69655-C001-47", "1c51e45e2917268e0ab5ce43a69655-C001-74", "1c51e45e2917268e0ab5ce43a69655-C001-87", "1c51e45e2917268e0ab5ce43a69655-C001-94" ] }, "@DIF@": { "gold_contexts": [ [ "1c51e45e2917268e0ab5ce43a69655-C001-27" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-48" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-76", "1c51e45e2917268e0ab5ce43a69655-C001-77" ], [ "1c51e45e2917268e0ab5ce43a69655-C001-94" ] ], "cite_sentences": [ "1c51e45e2917268e0ab5ce43a69655-C001-27", "1c51e45e2917268e0ab5ce43a69655-C001-48", "1c51e45e2917268e0ab5ce43a69655-C001-77", "1c51e45e2917268e0ab5ce43a69655-C001-94" ] }, "@MOT@": { "gold_contexts": [ [ "1c51e45e2917268e0ab5ce43a69655-C001-101" ] ], "cite_sentences": [ "1c51e45e2917268e0ab5ce43a69655-C001-101" ] } } }, "ABC_04b525b91b48e31258287a015d0401_6": { "x": [ { "sent_id": "04b525b91b48e31258287a015d0401-C001-60", "text": "Wikipedia." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-2", "text": "Requiring only category names as user input is a highly attractive, yet hardly explored, setting for text categorization." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-3", "text": "Earlier bootstrapping results relied on similarity in LSA space, which captures rather coarse contextual similarity." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-4", "text": "We suggest improving this scheme by identifying concrete references to the category name's meaning, obtaining a special variant of lexical expansion." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-5", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-7", "text": "Topical Text Categorization (TC), the task of classifying documents by pre-defined topics, is most commonly addressed as a supervised learning task." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-8", "text": "However, the supervised setting requires a substantial amount of manually labeled documents, which is often impractical in real-life settings." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-9", "text": "Keyword-based TC methods (see Section 2) aim at a more practical setting." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-10", "text": "Each category is represented by a list of characteristic keywords, which should capture the category meaning." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-11", "text": "Classification is then based on measuring similarity between the category keywords and the classified documents, typically followed by a bootstrapping step." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-12", "text": "The manual effort is thus reduced to providing a keyword list per category, which was partly automated in some works through clustering." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-13", "text": "The keyword-based approach still requires nonnegligible manual work in creating a representative keyword list per category." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-14", "text": "(Gliozzo et al., 2005) succeeded eliminating this requirement by using the category name alone as the initial keyword, yet obtaining superior performance within the keywordbased approach." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-15", "text": "This was achieved by measuring similarity between category names and documents in Latent Semantic space (LSA), which implicitly captures contextual similarities for the category name through unsupervised dimensionality reduction." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-16", "text": "Requiring only category names as user input seems very attractive, particularly when labeled training data is too costly while modest performance (relative to supervised methods) is still useful." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-17", "text": "The goal of our research is to further improve the scheme of text categorization from category name, which was hardly explored in prior work." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-18", "text": "When analyzing the behavior of the LSA representation of (Gliozzo et al., 2005) we noticed that it captures two types of similarities between the category name and document terms." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-19", "text": "One type regards words which refer specifically to the category name's meaning, such as pitcher for the category Baseball." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-20", "text": "However, typical context words for the category which do not necessarily imply its specific meaning, like stadium, also come up as similar to baseball in LSA space." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-21", "text": "This limits the method's precision, due to false-positive classifications of contextually-related documents that do not discuss the specific category topic (such as other sports documents wrongly classified to Baseball)." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-22", "text": "This behavior is quite typical for query expansion methods, which expand a query with contextually correlated terms." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-23", "text": "We propose a novel scheme that models separately these two types of similarity." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-24", "text": "For one, it identifies words that are likely to refer specifically to the category name's meaning (Glickman et al., 2006) , based on certain relations in WordNet and Wikipedia." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-25", "text": "In tandem, we assess the general contextual fit of the category topic using an LSA model, to overcome lexical ambiguity and passing references." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-26", "text": "The evaluations show that tracing lexical references indeed increases classification precision, which in turn improves the eventual classifier obtained through bootstrapping." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-27", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-28", "text": "**BACKGROUND: KEYWORD-BASED TEXT CATEGORIZATION**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-29", "text": "The majority of keyword-based TC methods fit the general bootstrapping scheme outlined in Figure 1 , which is cast in terms of a vector-space model." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-30", "text": "The simplest version for step 1 is manual generation of the keyword lists (McCallum and Nigam, 1999) ." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-31", "text": "(Ko and Seo, 2004; Liu et al., 2004) partly automated this step, using clustering to generate candidate keywords." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-32", "text": "These methods employed a standard term-space representation in step 2." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-33", "text": "As described in Section 1, the keyword list in (Gliozzo et al., 2005) consisted of the category name alone." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-34", "text": "This was accompanied by representing the category names and documents (step 2) in LSA space, obtained through cooccurrence-based dimensionality reduction." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-35", "text": "In this space, words that tend to cooccur together, or occur in similar contexts, are represented by similar vectors." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-36", "text": "Thus, vector similarity in LSA space (in step 3) captures implicitly the similarity between the category name and contextually related words within the classified documents." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-84", "text": "**INITIAL CLASSIFICATION**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-37", "text": "Step 3 yields an initial similarity-based classification that assigns a single (most similar) category to each document, with Sim(c, d) typically being the cosine between the corresponding vectors." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-38", "text": "This classification is used, in the subsequent bootstrapping step, to train a standard supervised classifier (either single-or multi-class), yielding the eventual classifier for the category set." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-39", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-40", "text": "**INTEGRATING REFERENCE AND CONTEXT**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-41", "text": "Our goal is to augment the coarse contextual similarity measurement in earlier work with the identification of concrete references to the category name's meaning." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-42", "text": "We were mostly inspired by (Glickman et al., 2006) , which coined the term lexical reference to denote concrete references in text to the specific meaning of a given term." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-43", "text": "They further showed that an entailing text (in the textual entailment setting) typically includes a concrete reference to each term in the entailed statement." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-44", "text": "Analogously, we assume that a relevant document for a category typically includes concrete terms that refer specifically to the category name's meaning." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-45", "text": "We thus extend the scheme in Figure 1 by creating two vectors per category (in steps 1 and 2): a reference vector c ref in term space, consisting of referring terms for the category name; and a context vector c con , representing the category name in LSA space, as in (Gliozzo et al., 2005) ." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-46", "text": "Step 3 then computes a combined similarity score for categories and documents based on the two vectors." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-47", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-48", "text": "**REFERENCES TO CATEGORY NAMES**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-49", "text": "Referring terms are collected from WordNet and Wikipedia, by utilizing relations that are likely to correspond to lexical reference." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-50", "text": "Table 1 illustrates that WordNet provides mostly referring terms of general terminology while Wikipedia provides more specific terms." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-51", "text": "While these resources were used previously for text categorization, it was mostly for enhancing document representation in supervised settings, e.g. (Rodr\u00edguez et al., 2000) ." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-52", "text": "WordNet." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-53", "text": "Referring terms were found in WordNet starting from relevant senses of the category name and transitively following relation types that correspond to lexical reference." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-54", "text": "To that end, we specified for each category name those senses which fit the category's meaning, such as the outer space sense for the category Space." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-55", "text": "1 A category name sense is first expanded by its synonyms and derivations, all of which are then expanded by their hyponyms." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-56", "text": "When a term has no hyponyms it is expanded by its meronyms instead, since we observed that in such cases they often specify unique components that imply the holonym's meaning, such as Egypt for Middle East." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-57", "text": "However, when a term is not a leaf in the hyponymy hierarchy then its meronyms often refer to generic sub-parts, such as door for car." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-58", "text": "Finally, the hyponyms and meronyms are expanded by their derivations." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-59", "text": "As a common heuristic, we considered only the most frequent senses (top 4) of referring terms, avoiding low-ranked (rare) senses which are likely to introduce noise." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-61", "text": "We utilized a subset of a lexical reference resource extracted from Wikipedia (anonymous reference)." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-62", "text": "For each category name we extracted referring terms of two types, capturing hyponyms and synonyms." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-63", "text": "Terms of the first type are Wikipedia page titles for which the first definition sentence includes a syntactic \"is-a\" pattern whose complement is the category name, such as Chevrolet for the category Autos." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-64", "text": "Terms of the second type are extracted from Wikipedia's redirect links, which capture synonyms such as x11 for Windows-X." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-65", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-66", "text": "**INCORPORATING CONTEXT SIMILARITY**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-67", "text": "Our key motivation is to utilize Sim ref as the basis for classification in step 3 (Figure 1 )." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-68", "text": "However, this may yield false positive classifications in two cases: (a) inappropriate sense of an ambiguous referring term, e.g., the narcotic sense of drug should not yield classification to Medicine; (b) a passing reference, e.g., an analogy to cars in a software document, should not yield classification to Autos." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-69", "text": "In both these cases the overall context in the document is expected to be atypical for the triggered category." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-70", "text": "We therefore measure the contextual similarity between a category c and a document d utilizing LSA space, replicating the method in (Gliozzo et al., 2005) : c con and d LSA are taken as the LSA vectors of the category name and the document, respectively, yielding Sim con (c, d) = cos( c con , d LSA ))." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-71", "text": "2 The overall similarity score of step 3 is defined as" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-72", "text": "This formula fulfils the requirement of finding at least one referring term in the document; otherwise Sim ref (c, d) would be zero." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-73", "text": "Sim con (c, d) is computed in the reduced LSA space and is thus practically non-zero, and would downgrade Sim(c, d) when there is low contextual similarity between the category name and the document." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-74", "text": "Documents for which Sim(c, d) = 0 for all categories are omitted." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-75", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-76", "text": "**RESULTS AND CONCLUSIONS**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-77", "text": "We tested our method on the two corpora used in (Gliozzo et al., 2005) : 20-NewsGroups, classified by a single-class scheme (single category per document), and Reuters-10 3 , of a multi-class scheme." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-78", "text": "As in their work, non-standard category names were adjusted, such as Foreign exchange for Money-fx." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-79", "text": "Table 2 presents the results of the initial classification (step 3)." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-80", "text": "The first 4 lines refer to classification based on Sim ref alone." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-81", "text": "As a baseline, including only the category name in the reference vector (CatName) yields particularly low recall." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-82", "text": "Expansion by WordNet is notably more powerful than by the automatically extracted Wikipedia resource; still, the latter consistently provides a small marginal improvement when using both resources (Reference), indicating their complementary nature." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-83", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-85", "text": "As we hypothesized, the Reference model achieves much better precision than the Context model from (Gliozzo et al., 2005) resources, yielding a lower F1." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-86", "text": "Yet, its higher precision pays off for the bootstrapping step (Section 4.2)." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-87", "text": "Finally, when the two models are Combined a small precision improvement is observed." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-88", "text": "----------------------------------" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-89", "text": "**FINAL BOOTSTRAPPING RESULTS**" }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-90", "text": "The output of step 3 was fed as standard training for a binary SVM classifier for each category (step 4)." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-91", "text": "We used the default setting for SVM-light, apart from the j parameter which was set to the number of categories in each data set, as suggested by (Morik et al., 1999) ." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-92", "text": "For Reuters-10, classification was determined independently by the classifier of each category, allowing multiple classes per document." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-93", "text": "For 20-NewsGroups, the category which yielded the highest classification score was chosen (one-versusall), fitting the single-class setting." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-94", "text": "We experimented with two document representations for the supervised step: either as vectors in tf-idf weighted term space or as vectors in LSA space." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-95", "text": "Table 3 shows the final classification results." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-96", "text": "4 First, we observe that for the noisy bootstrapping training data LSA document representation is usually preferred." }, { "sent_id": "04b525b91b48e31258287a015d0401-C001-97", "text": "Most importantly, our Reference and Combined models clearly improve over the earlier" } ], "y": { "@BACK@": { "gold_contexts": [ [ "04b525b91b48e31258287a015d0401-C001-14" ], [ "04b525b91b48e31258287a015d0401-C001-18", "04b525b91b48e31258287a015d0401-C001-19" ], [ "04b525b91b48e31258287a015d0401-C001-33", "04b525b91b48e31258287a015d0401-C001-34", "04b525b91b48e31258287a015d0401-C001-35" ] ], "cite_sentences": [ "04b525b91b48e31258287a015d0401-C001-14", "04b525b91b48e31258287a015d0401-C001-18", "04b525b91b48e31258287a015d0401-C001-19", "04b525b91b48e31258287a015d0401-C001-33", "04b525b91b48e31258287a015d0401-C001-34", "04b525b91b48e31258287a015d0401-C001-35" ] }, "@MOT@": { "gold_contexts": [ [ "04b525b91b48e31258287a015d0401-C001-14", "04b525b91b48e31258287a015d0401-C001-17" ], [ "04b525b91b48e31258287a015d0401-C001-18", "04b525b91b48e31258287a015d0401-C001-19", "04b525b91b48e31258287a015d0401-C001-20", "04b525b91b48e31258287a015d0401-C001-21", "04b525b91b48e31258287a015d0401-C001-22", "04b525b91b48e31258287a015d0401-C001-23" ] ], "cite_sentences": [ "04b525b91b48e31258287a015d0401-C001-14", "04b525b91b48e31258287a015d0401-C001-18", "04b525b91b48e31258287a015d0401-C001-19", "04b525b91b48e31258287a015d0401-C001-21", "04b525b91b48e31258287a015d0401-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "04b525b91b48e31258287a015d0401-C001-45" ], [ "04b525b91b48e31258287a015d0401-C001-70" ], [ "04b525b91b48e31258287a015d0401-C001-77", "04b525b91b48e31258287a015d0401-C001-78" ] ], "cite_sentences": [ "04b525b91b48e31258287a015d0401-C001-45", "04b525b91b48e31258287a015d0401-C001-70", "04b525b91b48e31258287a015d0401-C001-77", "04b525b91b48e31258287a015d0401-C001-78" ] }, "@DIF@": { "gold_contexts": [ [ "04b525b91b48e31258287a015d0401-C001-85" ] ], "cite_sentences": [ "04b525b91b48e31258287a015d0401-C001-85" ] } } }, "ABC_5e34591c2a7b1664e1275372c40b79_6": { "x": [ { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-79", "text": "**CONCLUSION AND PERSPECTIVES**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-2", "text": "Distant supervision has attracted recent interest for training information extraction systems because it does not require any human annotation but rather employs existing knowledge bases to heuristically label a training corpus." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-3", "text": "However, previous work has failed to address the problem of false negative training examples mislabeled due to the incompleteness of knowledge bases." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-4", "text": "To tackle this problem, we propose a simple yet novel framework that combines a passage retrieval model using coarse features into a state-of-the-art relation extractor using multi-instance learning with fine features." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-5", "text": "We adapt the information retrieval technique of pseudorelevance feedback to expand knowledge bases, assuming entity pairs in top-ranked passages are more likely to express a relation." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-6", "text": "Our proposed technique significantly improves the quality of distantly supervised relation extraction, boosting recall from 47.7% to 61.2% with a consistently high level of precision of around 93% in the experiments." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-7", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-9", "text": "A recent approach for training information extraction systems is distant supervision, which exploits existing knowledge bases instead of annotated texts as the source of supervision (Craven and Kumlien, 1999; Mintz et al., 2009; Nguyen and Moschitti, 2011) ." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-10", "text": "To combat the noisy training data produced by heuristic labeling in distant supervision, researchers (Bunescu and Mooney, 2007; Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012) exploited multi-instance learning models." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-11", "text": "Only a few studies have directly examined the influence of the quality of the training data and attempted to enhance it (Sun et al., 2011; Wang et al., 2011; Takamatsu et al., 2012) ." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-12", "text": "However, their methods are handicapped by the built-in assumption that a sentence does not express a relation unless it mentions two entities which participate in the relation in the knowledge base, leading to false negatives." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-13", "text": "In reality, knowledge bases are often incomplete, giving rise to numerous false negatives in the training data." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-14", "text": "We sampled 1834 sentences that contain two entities in the New York Times 2006 corpus and manually evaluated whether they express any of a set of 50 common Freebase 1 relations." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-15", "text": "As shown in Figure 1 , of the 133 (7.3%) sentences that truly express one of these relations, only 32 (1.7%) are covered by Freebase, leaving 101 (5.5%) false negatives." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-16", "text": "Even for one of the most complete relations in Freebase, Employee-of (with more than 100,000 entity pairs), 6 out of 27 sentences with the pattern 'PERSON executive of ORGANIZATION' contain a fact that is not included in Freebase and are thus mislabeled as negative." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-17", "text": "These mislabelings dilute the discriminative capability of useful features and confuse the models." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-18", "text": "In this paper, we will show how reducing this source of noise can significantly improve the performance of distant supervision." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-19", "text": "In fact, our system corrects the relation labels of the above 6 sentences before training the relation extractor." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-20", "text": "(1) matches relation instances to sentences and (2) learns a passage retrieval model to (3) provide relevance feedback on sentences; Relevant sentences (4) yield new relation instances which are added to the knowledge base; Finally, instances are again (5) matched to sentences to (6) create training data for relation extraction." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-77", "text": "The performance improvement by using pseudo-feedback is significant (p < 0.05) in McNemar's test for both datasets." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-21", "text": "Encouraged by the recent success of simple methods for coreference resolution (Raghunathan et al., 2010) and inspired by pseudo-relevance feedback (Xu and Croft, 1996; Lavrenko and Croft, 2001; Matveeva et al., 2006; Cao et al., 2008) in the field of information retrieval, which expands or reformulates query terms based on the highest ranked documents of an initial query, we propose to increase the quality and quantity of training data generated by distant supervision for information extraction task using pseudo feedback." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-22", "text": "As shown in Figure 2 , we expand an original knowledge base with possibly missing relation instances with information from the highest ranked sentences returned by a passage retrieval model trained on the same data." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-23", "text": "We use coarse features for our passage retrieval model to aggressively expand the knowledge base for maximum recall; at the same time, we exploit a multi-instance learning model with fine features for relation extraction to handle the newly introduced false positives and maintain high precision." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-24", "text": "Similar to iterative bootstrapping techniques (Yangarber, 2001) , this mechanism uses the outputs of the first trained model to expand training data for the second model, but unlike bootstrapping it does not require iteration and avoids the problem of semantic drift." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-25", "text": "We further note that iterative bootstrapping over a single distant supervision system is difficult, because state-of-the-art systems (Surdeanu et al., 2012; Hoffmann et al., 2011; Riedel et al., 2010; Mintz et al., 2009) , detect only few false negatives in the training data due to their high-precision low-recall features, which were originally proposed by Mintz et al. (2009) ." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-26", "text": "We present a reliable and novel way to address these issues and achieve significant improvement over the MULTIR system (Hoffmann et al., 2011) , increasing recall from 47.7% to 61.2% at comparable precision." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-27", "text": "The key to this success is the combination of two different views as in co-training (Blum and Mitchell, 1998) : an information extraction technique with fine features for high precision and an information retrieval technique with coarse features for high recall." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-28", "text": "Our work is developed in parallel with Min et al. (2013) , who take a very different approach by adding additional latent variables to a multi-instance multi-label model (Surdeanu et al., 2012) to solve this same problem." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-29", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-30", "text": "**SYSTEM DETAILS**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-31", "text": "In this section, we first introduce some formal notations then describe in detail each component of the proposed system in Figure 2 ." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-32", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-33", "text": "**DEFINITIONS**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-34", "text": "A relation instance is an expression r(e1, e2) where r is a binary relation, and e1 and e2 are two entities having such a relation, for example CEO-of(Tim Cook, Apple)." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-35", "text": "The knowledge-based distant supervised learning problem takes as input (1) \u03a3, a training corpus, (2) E, a set of entities mentioned in that corpus, (3) R, a set of relation names, and (4) \u2206, a set of ground facts of relations in R. To generate our training data, we further assume (5) T , a set of entity types, as well as type signature r(E1, E2) for relations." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-36", "text": "We define the positive data set P OS(r) to be the set of sentences in which any related pair of entities of relation r (according to the knowledge base) is mentioned." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-37", "text": "The negative data set RAW (r) is the rest of the training data, which contain two entities of the required types in the knowledge base, e.g. one person and one organization for the CEO-of relation in Freebase." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-76", "text": "Table 1 shows the overall precision and recall computed against these two test datasets, with and without adding lexical features into multi-instance learning models." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-38", "text": "Another negative data set with more conservative sense N EG(r) is defined as the set of sentences which contain the primary entity e1 (e.g. person in any CEO-of relation in the knowledge base) and any secondary entity e2 of required type (e.g. organization for the CEO-of relation) but the relation does not hold for this pair of entities in the knowledge base." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-39", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-40", "text": "**DISTANTLY SUPERVISED PASSAGE RETRIEVAL**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-41", "text": "We extend the learning-to-rank techniques (Liu, 2011) to distant supervision setting to create a robust passage retrieval system." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-42", "text": "While relation extraction systems exploit rich and complex features that are necessary to extract the exact relation (Mintz et al., 2009; Riedel et al., 2010; Hoffmann et al., 2011) , passage retrieval components use coarse features in order to provide different and complementary feedback to information extraction models." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-43", "text": "We exploit two types of lexical features: BagOf-Words and Word-Position." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-44", "text": "The two types of simple binary features are shown in the following example:" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-45", "text": "Sentence: Apple founder Steve Jobs died." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-46", "text": "For each relation r, we assume each sentence has a binary relevance label to form distantly supervised training data: sentences in P OS(r) are relevant and sentences in N EG(r) are irrelevant." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-47", "text": "As a pointwise learning-to-rank approach (Nallapati, 2004) , the probabilities of relevance estimated by SVMs (Platt and others, 1999) are used for ranking all the sentences in the original training corpus for each relation respectively." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-48", "text": "We use Lib-SVM 2 (Chang and Lin, 2011) in our implementation." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-49", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-50", "text": "**PSUEDO-RELEVANCE RELATION FEEDBACK**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-51", "text": "In the field of information retrieval, pseudorelevance feedback assumes that the top-ranked documents from an initial retrieval are likely relevant, and extracts relevant terms to expand the original query (Xu and Croft, 1996; Lavrenko and Croft, 2001; Cao et al., 2008) ." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-52", "text": "Analogously, our assumption is that entity pairs that appear in more relevant and more sentences are more likely to express the relation, and can be used to expand knowledge base and reduce false negative noise in the training data for information extraction." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-53", "text": "We identify the most likely relevant entity pairs as follows:" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-54", "text": "initialize \u2206 \u2190\u2212 \u2206 for each relation type r \u2208 R do learn a passage (sentence) retrieval model L(r) using coarse features and P OS(r)\u222aN EG(r) as training data score the sentences in the RAW (r) by L(r) score the entity pairs according to the scores of sentences they are involved in select the top ranked pairs of entities, then add the relation r to their label in \u2206 end for" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-55", "text": "We select the entity pairs whose average score of the sentences they are involved in is greater than p, where p is a parameter tuned on development data." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-56", "text": "3 The relation extraction model is then trained using (\u03a3, E, R, \u2206 ) with a more complete database than the original knowledge base \u2206." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-57", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-58", "text": "**DISTANTLY SUPERVISED RELATION EXTRACTION**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-59", "text": "We use a state-of-the-art open-source system, MULTIR (Hoffmann et al., 2011) , as the relation extraction component." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-60", "text": "MULTIR is based on multi-instance learning, which assumes that at least one sentence of those matching a given entity-pair contains the relation of interest (Riedel et al., 2010) in the given knowledge base to tolerate false positive noise in the training data and superior than previous models (Riedel et al., 2010; Mintz et al., 2009 ) by allowing overlapping relations." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-61", "text": "MULTIR uses features which are based on Mintz et al. (2009) and consist of conjunctions of named entity tags, syntactic dependency paths between arguments, and lexical information." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-62", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-63", "text": "**EXPERIMENTS**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-64", "text": "For evaluating extraction accuracy, we follow the experimental setup of Hoffmann et al. (2011) , and use their implementation of MULTIR 4 with 50 training iterations as our baseline." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-65", "text": "Our complete system, which we call IRMIE, combines our passage retrieval component with MULTIR." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-66", "text": "We use the same datasets as in Hoffmann et al. (2011) and Riedel et al. (2010) , which include 3-years of New York Times articles aligned with Freebase." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-67", "text": "The sentential extraction evaluation is performed on a small amount of manually annotated sentences, sampled from the union of matched sentences and Table 1 : Overall sentential extraction performance evaluated on the original test set of Hoffmann et al. (2011) and our corrected test set: Our proposed relevance feedback technique yields a substantial increase in recall." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-68", "text": "system predictions." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-69", "text": "We define S e as the sentences where some system extracted a relation and S F as the sentences that match the arguments of a fact in \u2206. The sentential precision and recall is computed on a randomly sampled set of sentences from S e \u222a S F , in which each sentence is manually labeled whether it expresses any relation in R. Figure 3 shows the precision/recall curves for MULTIR with and without pseudo-relevance feedback computed on the test dataset of 1000 sentence used by Hoffmann et al. (2011) ." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-70", "text": "With the pseudo-relevance feedback from passage retrieval, IRMIE achieves significantly higher recall at a consistently high level of precision." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-71", "text": "At the highest recall point, IRMIE reaches 78.5% precision and 59.2% recall, for an F1 score of 68.9%." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-72", "text": "Because the two types of lexical features used in our passage retrieval models are not used in MUL-TIR, we created another baseline MULTIRLEX by adding these features into MULTIR in order to rule out the improvement from additional information." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-73", "text": "Note that the sentences are sampled from the union of Freebase matches and sentences from which some systems in Hoffmann et al. (2011) extracted a relation." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-74", "text": "It underestimates the improvements of the newly developed systems in this paper." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-75", "text": "We therefore also created a new test set of 1000 sentences by sampling from the union of Freebase matches and sentences where MULTIR-LEX or IRMIELEX extracted a relation." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-78", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-80", "text": "This paper proposes a novel approach to address an overlooked problem in distant supervision: the knowledge base is often incomplete causing nu- merous false negatives in the training data." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-81", "text": "It greatly improves a state-of-the-art multi-instance learning model by correcting the most likely false negatives in the training data based on the ranking of a passage retrieval model." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-82", "text": "In the future, we would like to more tightly integrate a coarser featured estimator of sentential relevance and a finer featured relation extractor, such that a single joint-model can be learned." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-83", "text": "----------------------------------" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-84", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-85", "text": "Supported in part by NSF grant IIS-1018317, the Air Force Research Laboratory (AFRL) under prime contract number FA8750-09-C-0181 and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20154." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-86", "text": "The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon." }, { "sent_id": "5e34591c2a7b1664e1275372c40b79-C001-87", "text": "Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL, IARPA, DoI/NBC, or the U.S. Government." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5e34591c2a7b1664e1275372c40b79-C001-10" ], [ "5e34591c2a7b1664e1275372c40b79-C001-25" ], [ "5e34591c2a7b1664e1275372c40b79-C001-42" ], [ "5e34591c2a7b1664e1275372c40b79-C001-60", "5e34591c2a7b1664e1275372c40b79-C001-61" ] ], "cite_sentences": [ "5e34591c2a7b1664e1275372c40b79-C001-10", "5e34591c2a7b1664e1275372c40b79-C001-25", "5e34591c2a7b1664e1275372c40b79-C001-42", "5e34591c2a7b1664e1275372c40b79-C001-60", "5e34591c2a7b1664e1275372c40b79-C001-61" ] }, "@MOT@": { "gold_contexts": [ [ "5e34591c2a7b1664e1275372c40b79-C001-10", "5e34591c2a7b1664e1275372c40b79-C001-11", "5e34591c2a7b1664e1275372c40b79-C001-12" ], [ "5e34591c2a7b1664e1275372c40b79-C001-25" ] ], "cite_sentences": [ "5e34591c2a7b1664e1275372c40b79-C001-10", "5e34591c2a7b1664e1275372c40b79-C001-25" ] }, "@EXT@": { "gold_contexts": [ [ "5e34591c2a7b1664e1275372c40b79-C001-25", "5e34591c2a7b1664e1275372c40b79-C001-26" ], [ "5e34591c2a7b1664e1275372c40b79-C001-41", "5e34591c2a7b1664e1275372c40b79-C001-42" ], [ "5e34591c2a7b1664e1275372c40b79-C001-72" ] ], "cite_sentences": [ "5e34591c2a7b1664e1275372c40b79-C001-25", "5e34591c2a7b1664e1275372c40b79-C001-26", "5e34591c2a7b1664e1275372c40b79-C001-42", "5e34591c2a7b1664e1275372c40b79-C001-72" ] }, "@USE@": { "gold_contexts": [ [ "5e34591c2a7b1664e1275372c40b79-C001-59" ], [ "5e34591c2a7b1664e1275372c40b79-C001-64" ], [ "5e34591c2a7b1664e1275372c40b79-C001-65" ], [ "5e34591c2a7b1664e1275372c40b79-C001-66" ], [ "5e34591c2a7b1664e1275372c40b79-C001-67" ], [ "5e34591c2a7b1664e1275372c40b79-C001-69" ], [ "5e34591c2a7b1664e1275372c40b79-C001-73" ] ], "cite_sentences": [ "5e34591c2a7b1664e1275372c40b79-C001-59", "5e34591c2a7b1664e1275372c40b79-C001-64", "5e34591c2a7b1664e1275372c40b79-C001-65", "5e34591c2a7b1664e1275372c40b79-C001-66", "5e34591c2a7b1664e1275372c40b79-C001-67", "5e34591c2a7b1664e1275372c40b79-C001-69", "5e34591c2a7b1664e1275372c40b79-C001-73" ] } } }, "ABC_7ce85e3c3f58cee33015409b74f99e_6": { "x": [ { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-2", "text": "In Distributional Semantic Models (DSMs), Vector Cosine is widely used to estimate similarity between word vectors, although this measure was noticed to suffer from several shortcomings." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-3", "text": "The recent literature has proposed other methods which attempt to mitigate such biases." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-4", "text": "In this paper, we intend to investigate APSyn, a measure that computes the extent of the intersection between the most associated contexts of two target words, weighting it by context relevance." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-5", "text": "We evaluated this metric in a similarity estimation task on several popular test sets, and our results show that APSyn is in fact highly competitive, even with respect to the results reported in the literature for word embeddings." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-6", "text": "On top of it, APSyn addresses some of the weaknesses of Vector Cosine, performing well also on genuine similarity estimation." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-7", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-9", "text": "Word similarity is one of the most important and most studied problems in Natural Language Processing (NLP), as it is fundamental for a wide range of tasks, such as Word Sense Disambiguation (WSD), Information Extraction (IE), Paraphrase Generation (PG), as well as the automatic creation of semantic resources." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-10", "text": "Most of the current approaches to word similarity estimation rely on some version of the Distributional Hypothesis (DH), which claims that words occurring in the same contexts tend to have similar meanings (Harris, 1954; Firth, 1957; Sahlgren, 2008) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-11", "text": "Such hypothesis provides the theoretical ground for Distributional Semantic Models (DSMs), which represent word meaning by means of high-dimensional vectors encoding corpus-extracted co-occurrences between targets and their linguistic contexts (Turney and Pantel, 2010) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-12", "text": "Traditional DSMs initialize vectors with cooccurrence frequencies." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-13", "text": "Statistical measures, such as Positive Pointwise Mutual Information (PPMI) or its variants (Church and Hanks, 1990; Bullinaria and Levy, 2012; Levy et al., 2015) , have been adopted to normalize these values." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-14", "text": "Also, these models have exploited the power of dimensionality reduction techniques, such as Singular Value Decomposition (SVD; Landauer and Dumais, 1997) and Random Indexing (Sahlgren, 2005) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-15", "text": "These first-generation models are currently referred to as count-based, as distinguished from the contextpredicting ones, which have been recently proposed in the literature (Bengio et al., 2006; Collobert and Weston, 2008; Turian et al., 2010; Huang et al., 2012; Mikolov et al., 2013) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-16", "text": "More commonly known as word embeddings, these secondgeneration models learn meaning representations through neural network training: the vectors dimensions are set to maximize the probability for the contexts that typically occur with the target word." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-17", "text": "Vector Cosine is generally adopted by both types of models as a similarity measure." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-18", "text": "However, this metric has been found to suffer from several problems (Li and Han, 2013; Faruqui et al., 2016) , such as a bias towards features with higher values and the inability of considering how many features are actually shared by the vectors." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-19", "text": "Finally, Cosine is affected by the hubness effect Schn-abel et al., 2015) , i.e. the fact that words with high frequency tend to be universal neighbours." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-20", "text": "Even though other measures have been proposed in the literature (Deza and Deza, 2009 ), Vector Cosine is still by far the most popular one (Turney and Pantel, 2010) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-21", "text": "However, in a recent paper of Santus et al. (2016b) , the authors have claimed that Vector Cosine is outperformed by APSyn (Average Precision for Synonymy), a metric based on the extent of the intersection between the most salient contexts of two target words." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-22", "text": "The measure, tested on a window-based DSM, outperformed Vector Cosine on the ESL and on the TOEFL datasets." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-23", "text": "In the present work, we perform a systematic evaluation of APSyn, testing it on the most popular test sets for similarity estimation -namely WordSim-353 (Finkelstein et al., 2001) , MEN (Bruni et al., 2014) and SimLex-999 (Hill et al., 2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-24", "text": "For comparison, Vector Cosine is also calculated on several countbased DSMs." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-25", "text": "We implement a total of twenty-eight models with different parameters settings, each of which differs according to corpus size, context window width, weighting scheme and SVD application." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-26", "text": "The new metric is shown to outperform Vector Cosine in most settings, except when the latter metric is applied on a PPMI-SVD reduced matrix (Bullinaria and Levy, 2012) , against which APSyn still obtains competitive performances." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-27", "text": "The results are also discussed in relation to the state-of-the-art DSMs, as reported in Hill et al. (2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-28", "text": "In such comparison, the best settings of our models outperform the word embeddings in almost all datasets." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-29", "text": "A pilot study was also carried out to investigate whether APSyn is scalable." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-30", "text": "Results prove its high performance also when calculated on large corpora, such as those used by ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-31", "text": "On top of the performance, APSyn seems not to be subject to some of the biases that affect Vector Cosine." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-32", "text": "Finally, considering the debate about the ability of DSMs to calculate genuine similarity as opposed to word relatedness (Turney, 2001; Agirre et al., 2009; Hill et al., 2015) , we test the ability of the models to quantify genuine semantic similarity." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-33", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-34", "text": "**BACKGROUND**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-35", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-36", "text": "**DSMS, MEASURES OF ASSOCIATION AND DIMENSIONALITY REDUCTION**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-37", "text": "Count-based DSMs are built in an unsupervised way." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-38", "text": "Starting from large preprocessed corpora, a matrix M (m\u00d7n) is built, in which each row is a vector representing a target word in a vocabulary of size m, and each column is one of the n potential contexts (Turney and Pantel, 2010; Levy et al., 2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-39", "text": "The vector dimensions are counters recording how many times the contexts co-occur with the target words." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-40", "text": "Since raw frequency is highly skewed, most DSMs have adopted more sophisticated association measures, such as Positive PMI (PPMI; Church and Hanks, 1990; Bullinaria and Levy, 2012; Levy et al., 2015) and Local Mutual Information (LMI; Evert, 2005) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-41", "text": "PPMI compares the observed joint probability of co-occurrence of w and c with their probability of co-occurrence assuming statistical indipendence." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-42", "text": "It is defined as:" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-43", "text": "P M I(w, c) = log P (w, c) P (w)P (c) = log |w, c|D |w||c| (2) where w is the target word, c is the given context, P(w,c) is the probability of co-occurrence, and D is the collection of observed word-context pairs." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-44", "text": "Unlike frequency, PPMI was found to have a bias towards rare events." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-45", "text": "LMI could therefore be used to reduce such bias and it consists in multiplying the PPMI of the pair by its co-occurrence frequency." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-46", "text": "Since target words may occur in hundreds of thousands contexts, most of which are not informative, methods for dimensionality reduction have been investigated, such as truncated SVD (Deerwester et al., 1990; Landauer and Dumais, 1997; Turney and Pantel, 2010; Levy et al., 2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-47", "text": "SVD has been regarded as a method for noise reduction and for the discovery of latent dimensions of meaning, and it has been shown to improve similarity measurements when combined with PPMI (Bullinaria and Levy, 2012; Levy et al., 2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-48", "text": "As we will see in the next section, APSyn applies another type of reduction, which consists in selecting only the top-ranked contexts in a relevance sorted context list for each word vector." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-49", "text": "Such reduction complies with the principle of cognitive economy (i.e. only the most relevant contexts are elaborated; see Finton, 2002) and with the results of behavioural studies, which supported feature saliency (Smith et al., 1974) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-50", "text": "Since APSyn was defined for linguistic contexts (Santus et al., 2016b) , we did not test it on SVD-reduced spaces, leaving such test to further studies." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-51", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-52", "text": "**SIMILARITY MEASURES**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-53", "text": "Vector Cosine, by far the most common distributional similarity metric (Turney and Pantel, 2010; Landauer and Dumais, 1997; Jarmasz and Szpakowicz, 2004; Mikolov et al., 2013; Levy et al., 2015) , looks at the normalized correlation between the dimensions of two word vectors, w 1 and w 2 and returns a score between -1 and 1." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-54", "text": "It is described by the following equation:" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-55", "text": "where f i x is the i-th dimension in the vector x. Despite its extensive usage, Vector Cosine has been recently criticized for its hyper sensibility to features with high values and for the inability of identifying the actual feature intersection (Li and Han, 2013; Schnabel et al., 2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-56", "text": "Recalling an example by Li and Han (2013) , the Vector Cosine for the toy-vectors a = [1, 2, 0] and b = [0, 1, 0] (i.e. 0.8944) is unexpectedly higher than the one for a and c = [2, 1, 0] (i.e. 0.8000), and even higher than the one for the toy-vectors a and d = [1, 2, 1] (i.e. 0.6325), which instead share a larger feature intersection." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-57", "text": "Since the Vector Cosine is a distance measure, it is also subject to the hubness problem, which was shown by Radovanovic et al. (2010) to be an inherent property of data distributions in highdimensional vector space." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-58", "text": "The problem consists in the fact that vectors with high frequency tend to get high scores with a large number of other vectors, thus becoming universal nearest neighbours Schnabel et al., 2015; Faruqui et al., 2016) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-59", "text": "Another measure of word similarity named APSyn 1 1 Scripts and information can be found at https://github.com/esantus/APSyn has been recently introduced in Santus et al. (2016a) and Santus et al. (2016b) , and it was shown to outperform the vector cosine on the TOEFL (Landauer and Dumais, 1997) and on the ESL (Turney, 2001) test sets." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-60", "text": "This measure is based on the hypothesis that words carrying similar meanings share their most relevant contexts in higher proportion compared to less similar words." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-61", "text": "The authors define APSyn as the extent of the weighted intersection between the top most salient contexts of the target words, weighting it by the average rank of the intersected features in the PPMI-sorted contexts lists of the target words:" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-62", "text": "meaning: for every feature f included in the intersection between the top N features of w 1 and the top of w 2 (i.e. N (f 1 ) and N (f 2 )), add 1 divided by the average rank of the feature in the PPMI-ranked features of w 1 (i.e. rank 1 ) and w 2 (i.e. rank 2 )." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-63", "text": "According to the authors, N is a parameter, generally ranging between 100 and 1000." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-64", "text": "Results are shown to be relatively stable when N varies in this range, while become worst if bigger N are used, as low informative features are also introduced." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-65", "text": "Santus et al. (2016a) have also used LMI instead of PPMI as weighting function, but achieving lower results." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-66", "text": "With respect to the limitations mentioned above for the Vector Cosine, APSyn has some advantages." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-67", "text": "First of all, it is by definition able to identify the extent of the intersection." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-68", "text": "Second, its sensibility to features with high values can be kept under control by tuning the value of N. On top of it, feature values (i.e. their weights) do not affect directly the similarity score, as they are only used to build the feature rank." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-69", "text": "With reference to the toy-vectors presented above, APSyn would assign in fact completely different scores." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-70", "text": "The higher score would be assigned to a and d, as they share two relevant features out of three." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-71", "text": "The second higher score would be assigned to a and c, for the same reason as above." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-72", "text": "The lower score would be instead assigned to a and b, as they only share one non-salient feature." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-73", "text": "In section 3.4, we briefly discuss the hubness problem." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-74", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-75", "text": "**DATASETS**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-76", "text": "For our evaluation, we used three widely popular datasets: WordSim-353 (Finkelstein et al., 2001) , MEN (Bruni et al., 2014) , SimLex-999 (Hill et al., 2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-77", "text": "These datasets have a different history, but all of them consist in word pairs with an associated score, that should either represent word association or word similarity." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-78", "text": "WordSim-353 (Finkelstein et al., 2001 ) was proposed as a word similarity dataset containing 353 pairs annotated with scores between 0 and 10." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-79", "text": "However, Hill et al. (2015) claimed that the instructions to the annotators were ambiguous with respect to similarity and association, so that the subjects assigned high similarity scores to entities that are only related by virtue of frequent association (e.g. coffee and cup; movie and theater)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-80", "text": "On top of it, WordSim-353 does not provide the POS-tags for the 439 words that it contains, forcing the users to decide which POS to assign to the ambiguous words (e.g. [white, rabbit] and [run, marathon] )." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-81", "text": "An extension of this dataset resulted from the subclassification carried out by Agirre et al. (2009) , which discriminated between similar and associated word pairs." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-82", "text": "Such discrimination was done by asking annotators to classify all pairs according to the semantic relation they hold (i.e. identical, synonymy, antonymy, hypernymy, meronymy and none-of-the-above)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-83", "text": "The annotation was then used to group the pairs in three categories: similar pairs (those classified as identical, synonyms, antonyms and hypernyms), associated pairs (those classified as meronyms and none-of-the-above, with an average similarity greater than 5), and non-associated pairs (those classified as none-of-the-above, with an average similarity below or equal to 5)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-84", "text": "Two gold standard were finally produced: i) one for similarity, containing 203 word pairs resulting from the union of similar and non-associated pairs; ii) one for relatedness, containing 252 word pairs resulting from the union of associated and non-associated pairs." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-85", "text": "Even though such a classification made a clear distinction between the two types of relations (i.e. similarity and association), Hill et al. (2015) argue that these gold standards still carry the scores they had in WordSim-353, which are known to be ambiguous in this regard." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-86", "text": "The MEN Test Collection (Bruni et al., 2014) includes 3,000 word pairs divided in two sets (one for training and one for testing) together with human judgments, obtained through Amazon Mechanical Turk." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-87", "text": "The construction was performed by asking subjects to rate which pair -among two of themwas the more related one (i.e. the most associated)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-88", "text": "Every pairs-couple was proposed only once, and a final score out of 50 was attributed to each pair, according to how many times it was rated as the most related." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-89", "text": "According to Hill et al. (2015) , the major weakness of this dataset is that it does not encode word similarity, but a more general notion of association." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-90", "text": "SimLex-999 is the dataset introduced by Hill et al. (2015) to address the above mentioned criticisms of confusion between similarity and association." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-91", "text": "The dataset consists of 999 pairs containing 1,028 words, which were also evaluated in terms of POS-tags and concreteness." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-92", "text": "The pairs were annotated with a score between 0 and 10, and the instructions were strictly requiring the identification of word similarity, rather than word association." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-93", "text": "Hill et al. (2015) claim that differently from other datasets, SimLex-999 interannotator agreement has not been surpassed by any automatic approach." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-94", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-95", "text": "**STATE OF THE ART VECTOR SPACE MODELS**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-96", "text": "In order to compare our results with state-of-the-art DSMs, we report the scores for the Vector Cosines calculated on the neural language models (NLM) by Hill et al. (2015) , who used the code (or directly the embeddings) shared by the original authors." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-97", "text": "As we trained our models on almost the same corpora used by Hill and colleagues, the results are perfectly comparable." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-98", "text": "The three models we compare our results to are: i) the convolutional neural network of Collobert and Weston (2008) , which was trained on 852 million words of Wikipedia; ii) the neural network of Huang et al. (2012) , which was trained on 990 million words of Wikipedia; and iii) the word2vec of Mikolov et al. (2013) , which was trained on 1000 million words of Wikipedia and on the RCV Vol." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-99", "text": "1 Corpus (Lewis et al., 2004 Mikolov et al. (2013) , as reported in Hill et al. (2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-100", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-101", "text": "**EXPERIMENTS**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-102", "text": "In this section, we describe our experiments, starting from the training corpora (Section 3.1), to move to the implementation of twenty-eight DSMs (Section 3.2), following with the application and evaluation of the measures (Section 3.3), up to the performance analysis (Section 3.4) and the scalability test (Section 3.5)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-103", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-104", "text": "**CORPORA AND PREPROCESSING**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-105", "text": "We used two different corpora for our experiments: RCV vol." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-106", "text": "1 (Lewis et al., 2004) and the Wikipedia corpus (Baroni et al., 2009) , respectively containing 150 and 820 million words." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-107", "text": "The RCV Vol." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-108", "text": "1 and Wikipedia were automatically tagged, respectively, with the POS tagger described in Dell'Orletta (2009) and with the TreeTagger (Schmid, 1994) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-109", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-110", "text": "**DSMS**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-111", "text": "For our experiments, we implemented twenty-eight DSMs, but for reasons of space only sixteen of them are reported in the tables." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-112", "text": "All of them include the pos-tagged target words used in the three datasets (i.e. MEN, WordSim-353 and SimLex-999) and the pos-tagged contexts having frequency above 100 in the two corpora." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-113", "text": "We considered as contexts the content words (i.e. nouns, verbs and adjectives) within a window of 2, 3 and 5, even though the latter was given up for its poor performances." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-114", "text": "As for SVD factorization, we found out that the best results were always achieved when the number of latent dimensions was between 300 and 500." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-115", "text": "We report here only the scores for k = 300, since 300 is one of the most common choices for the dimensionality of SVD-reduced spaces and it is always close to be an optimal value for the parameter." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-116", "text": "Fourteen out of twenty-eight models were developed for RCV1, while the others were developed for Wikipedia." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-117", "text": "For each corpus, the models differed according to the window size (i.e. 2 and 3), to the statistical association measure used as a weighting scheme (i.e. none, PPMI and LMI) and to the application of SVD to the previous combinations." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-118", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-119", "text": "**MEASURING WORD SIMILARITY AND RELATEDNESS**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-120", "text": "Given the twenty-eight DSMs, for each dataset we have measured the Vector Cosine and APSyn between the words in the test pairs." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-121", "text": "Table 2 : Spearman correlation scores for our eight models trained on Wikipedia, in the three datasets Simlex-999, WordSim-353 and MEN." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-122", "text": "In the bottom the performance of the state-of-the-art models of Collobert and Weston (2008) , Huang et al. (2012), Mikolov et al. (2013) , as reported in Hill et al. (2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-123", "text": "The Spearman correlation between our scores and the gold standard was then computed for every model and it is reported in Table 1 and Table 2 ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-124", "text": "In particular, Table 1 describes the performances on SimLex-999, WordSim-353 and MEN for the measures applied on RCV Vol." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-125", "text": "1 models." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-126", "text": "Table 2 , instead, describes the performances of the measures on the three datasets for the Wikipedia models." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-127", "text": "Concurrently, Table 3 and Table 4 describe the performances of the measures respectively on the RCV Vol." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-128", "text": "1 and Wikipedia models, tested on the subsets of WordSim-353 extracted by Agirre et al. (2009) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-129", "text": "Table 1 shows the Spearman correlation scores for Vector Cosine and APSyn on the three datasets for the eight most representative DSMs built using RCV Vol. 1. Table 2 does the same for the DSMs built using Wikipedia." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-130", "text": "For the sake of comparison, we also report the results of the state-of-the-art DSMs mentioned in Hill et al. (2015) (see Section 2.5)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-131", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-132", "text": "**PERFORMANCE ANALYSIS**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-133", "text": "With a glance at the tables, it can be easily noticed that the measures perform particularly well in two models: i) APSyn, when applied on the PPMI-weighted DSM (henceforth, APSynPPMI); ii) Vector Cosine, when applied on the SVD-reduced PPMI-weighted matrix (henceforth, CosSVDPPMI)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-134", "text": "These two models perform consistently and in a comparable way across the datasets, generally outperforming the state-of-the-art DSMs, with an exception for the Wikipedia-trained models in WordSim-353." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-135", "text": "Some further observations are: i) corpus size strongly affects the results; ii) PPMI strongly outperforms LMI for both Vector Cosine and APSyn; iii) SVD boosts the Vector Cosine, especially when it is combined with PPMI; iv) N has some impact on the performance of APSyn, which generally achieves the best results for N=500." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-136", "text": "As a note about iii), the results of using SVD jointly with LMI spaces are less predictable than when combining it with PPMI." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-137", "text": "Also, we can notice that the smaller window (i.e. 2) does not always perform better than the larger one (i.e. 3)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-138", "text": "The former appears to perform better on SimLex-999, while the latter seems to have some advantages on the other datasets." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-139", "text": "This Table 3 : Spearman correlation scores for our eight models trained on RCV1, in the two subsets of might depend on the different type of similarity encoded in SimLex-999 (i.e. genuine similarity)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-140", "text": "On top of it, despite Hill et al. (2015) 's claim that no evidence supports the hypothesis that smaller context windows improve the ability of models to capture similarity (Agirre et al., 2009; Kiela and Clark, 2014) , we need to mention that window 5 was abandoned because of its low performance." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-141", "text": "With reference to the hubness effect, we have conducted a pilot study inspired to the one carried out by Schnabel et al. (2015) , using the words of the SimLex-999 dataset as query words and collecting for each of them the top 1000 nearest neighbors." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-142", "text": "Given all the neighbors at rank r, we have checked their rank in the frequency list extracted from our corpora." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-143", "text": "Figure 1 shows the relation between the rank in the nearest neighbor list and the rank in the frequency list." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-144", "text": "It can be easily noticed that the highest ranked nearest neighbors tend to have higher rank also in the frequency list, supporting the idea that frequent words are more likely to be nearest neighbors." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-145", "text": "APSyn does not seem to be able to overcome such bias, which seems to be in fact an inherent property of the DSMs (Radovanovic et al., 2010) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-146", "text": "Further investigation is needed to see whether variations of APSyn can tackle this problem." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-147", "text": "Finally, few words need to be spent with regard to the ability of calculating genuine similarity, as distinguished from word relatedness (Turney, 2001; Agirre et al., 2009; Hill et al., 2015) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-148", "text": "Table 3 and Table 4 show the Spearman correlation scores for the two measures calculated on the models respectively trained on RCV1 and Wikipedia, tested on the subsets of WordSim-353 extracted by Agirre et al. (2009) ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-149", "text": "It can be easily noticed that our best models work better on the similarity subset." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-150", "text": "In particular, APSynPPMI performs about 20-30% better for the similarity subset than for the relatedness one (see Table 3 ), as well as both APSynPPMI and CosSVDPPMI do in Wikipedia (see Table 4 )." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-151", "text": "Table 4 : Spearman correlation results for our eight models trained on Wikipedia, in the subsets of WordSim-353." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-152", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-153", "text": "**SCALABILITY**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-154", "text": "In order to evaluate the scalability of APSyn, we have performed a pilot test on WordSim-353 and MEN with the same corpus used by , which consists of about 2.8B words (i.e. about 3 times Wikipedia and almost 20 times RCV1)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-155", "text": "The best scores were obtained with APSyn, N=1000, on a 2-window PPMI-weighted DSM." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-156", "text": "In such setting, we obtain a Spearman correlation of 0.72 on WordSim and 0.77 on MEN." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-157", "text": "These results are much higher than those reported by for the count-based models (i.e. 0.62 on WordSim and 0.72 on MEN) and slightly lower than those reported for the predicting ones (i.e. 0.75 on WordSim and 0.80 on MEN)." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-158", "text": "----------------------------------" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-159", "text": "**CONCLUSIONS**" }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-160", "text": "In this paper, we have presented the first systematic evaluation of APSyn, comparing it to Vector Cosine in the task of word similarity identification." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-161", "text": "We developed twenty-eight count-based DSMs, each of which implementing different hyperparameters." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-162", "text": "PPMI emerged as the most efficient association measure: it works particularly well with Vector Cosine, when combined with SVD, and it boosts APSyn." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-163", "text": "APSyn showed extremely promising results, despite its conceptual simplicity." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-164", "text": "It outperforms the Vector Cosine in almost all settings, except when the latter is used on a PPMI-weighed SVD-reduced DSM." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-165", "text": "Even in this case, anyway, its performance is very competitive." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-166", "text": "Interestingly, our best models achieve results that are comparable to -or even better than -those reported by Hill et al. (2015) for the stateof-the-art word embeddings models." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-167", "text": "In Section 3.5 we show that APSyn is scalable, outperforming the state-of-the-art count-based models reported in ." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-168", "text": "On top of it, APSyn does not suffer from some of the problems reported for the Vector Cosine, such as the inability of identifying the number of shared features." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-169", "text": "It still however seems to be affected by the hubness issue, and more research should be carried out to tackle it." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-170", "text": "Concerning the discrimination between similarity and association, the good performance of APSyn on SimLex-999 (which was built with a specific attention to genuine similarity) and the large difference in performance between the two subsets of WordSim-353 described in Table 3 and Table 4 make us conclude that APSyn is indeed efficient in quantifying genuine similarity." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-171", "text": "To conclude, being a linguistically and cognitively grounded metric, APSyn offers the possibility for further improvements, by simply combining it to other properties that were not yet considered in its definition." }, { "sent_id": "7ce85e3c3f58cee33015409b74f99e-C001-172", "text": "A natural extension would be to verify whether APSyn hypothesis and implementation holds on SVD reduced matrices and word embeddings." } ], "y": { "@USE@": { "gold_contexts": [ [ "7ce85e3c3f58cee33015409b74f99e-C001-23" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-27" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-76" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-96" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-111", "7ce85e3c3f58cee33015409b74f99e-C001-112" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-121" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-122" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-124" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-130" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-139" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-141" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-147" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-170" ] ], "cite_sentences": [ "7ce85e3c3f58cee33015409b74f99e-C001-23", "7ce85e3c3f58cee33015409b74f99e-C001-27", "7ce85e3c3f58cee33015409b74f99e-C001-76", "7ce85e3c3f58cee33015409b74f99e-C001-96", "7ce85e3c3f58cee33015409b74f99e-C001-112", "7ce85e3c3f58cee33015409b74f99e-C001-121", "7ce85e3c3f58cee33015409b74f99e-C001-122", "7ce85e3c3f58cee33015409b74f99e-C001-124", "7ce85e3c3f58cee33015409b74f99e-C001-130", "7ce85e3c3f58cee33015409b74f99e-C001-139", "7ce85e3c3f58cee33015409b74f99e-C001-141", "7ce85e3c3f58cee33015409b74f99e-C001-147", "7ce85e3c3f58cee33015409b74f99e-C001-170" ] }, "@BACK@": { "gold_contexts": [ [ "7ce85e3c3f58cee33015409b74f99e-C001-32" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-79" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-85" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-89" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-90" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-93" ] ], "cite_sentences": [ "7ce85e3c3f58cee33015409b74f99e-C001-32", "7ce85e3c3f58cee33015409b74f99e-C001-79", "7ce85e3c3f58cee33015409b74f99e-C001-85", "7ce85e3c3f58cee33015409b74f99e-C001-89", "7ce85e3c3f58cee33015409b74f99e-C001-90", "7ce85e3c3f58cee33015409b74f99e-C001-93" ] }, "@SIM@": { "gold_contexts": [ [ "7ce85e3c3f58cee33015409b74f99e-C001-97" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-166" ] ], "cite_sentences": [ "7ce85e3c3f58cee33015409b74f99e-C001-97", "7ce85e3c3f58cee33015409b74f99e-C001-166" ] }, "@DIF@": { "gold_contexts": [ [ "7ce85e3c3f58cee33015409b74f99e-C001-136", "7ce85e3c3f58cee33015409b74f99e-C001-137", "7ce85e3c3f58cee33015409b74f99e-C001-138" ], [ "7ce85e3c3f58cee33015409b74f99e-C001-140" ] ], "cite_sentences": [ "7ce85e3c3f58cee33015409b74f99e-C001-138", "7ce85e3c3f58cee33015409b74f99e-C001-140" ] } } }, "ABC_14529822630fb469f5fc8f37aaf473_6": { "x": [ { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-80", "text": "(1)" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-2", "text": "Non-compositional phrases such as red herring and weakly compositional phrases such as spelling bee are an integral part of natural language (Sag et al., 2002) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-3", "text": "They are also the phrases that are difficult, or even impossible, for good compositional distributional models of semantics." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-4", "text": "Compositionality detection therefore provides a good testbed for compositional methods." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-5", "text": "We compare an integrated compositional distributional approach, using sparse high dimensional representations, with the adhoc compositional approach of applying simple composition operations to state-ofthe-art neural embeddings." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-6", "text": "----------------------------------" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-8", "text": "One current focus within the field of distributional semantics is enabling systems to make inferences about phrase-level or sentence-level similarity." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-9", "text": "One popular approach (Mitchell and Lapata, 2010 ) is to build phrase or sentence-level representations by composing word-level representations and then measuring similarity directly." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-10", "text": "Success is usually measured in terms of correlation with human similarity judgments." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-11", "text": "However, evaluating measures of phrase-level similarity directly against human judgments of similarity ignores the problem that it is not always possible to determine meaning in a compositional manner." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-12", "text": "If we compose the meaning representations for red and herring, we might expect to get a very different representation from the one which could be directly inferred from corpus observations of the phrase red herring." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-13", "text": "Thus any judgements of the similarity of two composed phrases may be confounded by the degree to which those phrases are compositional." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-14", "text": "In this paper, we use a compound noun compositionality dataset (Reddy et al., 2011) to investigate the extent to which the underlying definition of context has an effect on a model's ability to support composition." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-15", "text": "We compare the Anchored Packed Tree (APT) model (Weir et al., 2016) , where composition is an integral part of the distributional model, with the commonly employed approach of applying na\u00efve compositional operations to state-of-the-art distributional representations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-16", "text": "Consider the occurrence of the word student in the sentence \"The recently graduated student folded the dry clothes.\" Different distributional representations leverage the context, e.g., the fact that the target word student has occurred in the context folded, in different ways." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-17", "text": "Table 1 illustrates the contextual features which might be generated for student given different definitions of context." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-18", "text": "The most commonly used definition of context, in both traditional count-based representations and in more recent distributed embeddings, is proximity, i.e., the contextual features of a word occurrence are all those words which occur within a certain context window around the occurrence." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-19", "text": "However, contextual features may also be defined in terms of dependency relations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-20", "text": "For example, in a dependency parse of the sentence we would expect to see a direct-object relation from folded to student." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-21", "text": "Contextual features based on dependency relations may be typed (i.e., include the name of the dependency relation) or untyped (Baroni and Lenci, 2010) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-22", "text": "Pad\u00f3 and Lapata (2007) proposed using dependency paths to define untyped contextual features; here any word in the context which has a dependency path to the target is considered a contextual feature." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-23", "text": "Weeds et al. (2014) proposed using dependency paths to define typed contextual features which could be used to align representations before composition." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-24", "text": "This idea is further refined in the APT framework of Weir et al. (2016) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-25", "text": "----------------------------------" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-26", "text": "**BACKGROUND**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-27", "text": "Na\u00efve composition of distributional representations, e.g., using pointwise addition and multiplication, has proved very popular and effective." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-28", "text": "In an evaluation across 3 different benchmark tasks (Dinu et al., 2013) , the lexical function model (Baroni and Zamparelli, 2010) was shown to be consistently the best-performing, but in the composition of adjective-noun phrases, simple additive and multiplicative models were highly competitive." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-29", "text": "Milajevs et al. (2014) compared neural word representations with count-based vectors on 4 different tasks using a variety of na\u00efve and tensorbased compositional models." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-30", "text": "The neural word representations consistently outperformed the traditional count-based vectors." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-31", "text": "Considering the results for the neural word representations, pointwise addition outperformed all of the other compositional models considered on 3 of the tasks." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-32", "text": "Typed distributional representations cannot be straightforwardly composed using na\u00efve operations (Weeds et al., 2014) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-33", "text": "The APT approach (Weir et al., 2016) overcomes this problem by defining contextual features in terms of complete dependency paths and then ensuring that the representations of target words are properly aligned before composition." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-34", "text": "For example, to carry out the composition of student with folded in the example sentence, it is necessary to align the representations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-35", "text": "This can be done by offsetting all of the features of student by its dependency relation (NSUBJ) with folded." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-36", "text": "Intuitively we are viewing the representation of student from the perspective of actions (i.e., verbs) which are likely to be carried out by students." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-37", "text": "This view can be straightforwardly composed with the representation of folded because the representations are aligned i.e., they have features of the same type (e.g., DOBJ)." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-38", "text": "----------------------------------" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-39", "text": "**COMPOSITIONALITY OF COMPOUND NOUNS**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-40", "text": "Compositionality detection (Reddy et al., 2011) involves deciding whether a given multiword expression is compositional or not i.e., whether the meaning can be understood from the literal meaning of its parts." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-41", "text": "Reddy et al. (2011) introduced a dataset consisting of 90 compound nouns along with human judgments of their literality or compositionally at both the constituent and the phrase level." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-42", "text": "All judgments are given on a scale of 0 to 5, where 5 is high." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-43", "text": "For example, the phrase spelling bee is deemed to have high literalness in its use of the first constituent, low literalness in its use of the second constituent and a medium level of literalness with respect to the whole phrase." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-44", "text": "Assuming the distributional hypothesis (Harris, 1954) , the observed co-occurrences of compositional target phrases are highly likely to have occurred with one or both of the constituents independently." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-45", "text": "On the other hand, the observed cooccurrences of non-compositional target phrases are much less likely to have occurred with either of the constituents independently." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-46", "text": "Thus, a good compositionality function, without any access to the observed co-occurrences of the target phrases, is highly likely to return vectors which are similar to observed phrasal vectors for compositional phrases but much less likely to return similar vectors for non-compositional phrases." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-47", "text": "Accordingly, as observed elsewhere (Reddy et al., 2011; Salehi et al., 2015; Yazdani et al., 2015) , compositional methods can be evaluated by correlating the similarity of composed and observed phrase representations with the human judgments of compositionality." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-48", "text": "A similar idea is also explored by Kiela and Clark (2013) who detect noncompositional phrases by comparing the neighbourhoods of phrases where individual words have been substituted for similar words." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-49", "text": "Reddy et al. (2011) carried out experiments with a vector space model built from ukWaC (Ferraresi et al., 2008) using untyped co-occurrences (window size=100)." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-50", "text": "Used 3-fold cross-validation, they found that using weighted addition outperformed multiplication as a compositionality function." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-51", "text": "With their optimal settings, they achieved a Spearman's rank correlation coefficient of 0.714 with the human judgments, which remains the state-of-the-art on this dataset 1 ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-52", "text": "For consistency with the experiments of Reddy et al. (2011) , the corpus used in this experiment is the same fullyannotated version of the web-derived ukWaC corpus (Ferraresi et al., 2008) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-53", "text": "This corpus has been tokenised, POS-tagged and lemmatised with TreeTagger (Schmid, 1994) and dependency-parsed with the Malt Parser (Nivre, 2004) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-54", "text": "It contains about 1.9 billion tokens." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-55", "text": "In order to create a corpus which contains compound nouns, we further preprocessed the corpus by identifying occurrences of the 90 target compound nouns and recombining them into a single lexical item." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-56", "text": "We then created a number of elementary representations for every token in the corpus." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-57", "text": "----------------------------------" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-58", "text": "**UNTYPED CONTEXTUAL FEATURES**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-59", "text": "For each word and compound phrase, neural representations were constructed using the word2vec tool (Mikolov et al., 2013) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-60", "text": "Whilst it is not possible or appropriate to carry out an exhaustive parameter search, we experiment with a number of commonly used and recommended parameter settings." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-61", "text": "We investigate both the cbow and skip-gram models with 50, 100 and 300 dimensions and experiment with the subsampling threshold, trying 10 \u22123 , 10 \u22124 and 10 \u22125 ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-62", "text": "As recommended in the documentation, we use a window size of 5 for cbow and of 10 for skip-gram." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-63", "text": "Early experiments with different composition operations, showed add to be the only promising option." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-64", "text": "Similarity between composed and observed representations is computed using the cosine measure." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-65", "text": "----------------------------------" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-66", "text": "**TYPED CONTEXTUAL FEATURES**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-67", "text": "For each word and compound phrase, elementary APT representations were constructed using the method and recommended settings of Weir et al. (2016) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-68", "text": "For efficiency, we did not consider paths of length 3 or more." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-69", "text": "In relation to the construction of the elementary APTs, the most obvious parameter is the nature of the weight associated with each feature." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-70", "text": "We consider both the use of probabilities 2 and positive pointwise mutual information (PPMI)" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-71", "text": "1 Hermann et al. (2012) proposed using generative models for modeling the compositionality of noun-noun compounds." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-72", "text": "Using interpolation to mitigate the sparse data problem, their model beat the baseline of weighted addition on the Reddy et al. (2011) evaluation task when trained on the BNC." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-73", "text": "However, these results were still significantly lower than those reported by Reddy et al. (2011) using the larger ukWaC corpus." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-74", "text": "2 referred to as normalised counts by Weir et al. (2016) values." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-75", "text": "Levy et al. (2015) showed that the use of context distribution smoothing (\u03b1 = 0.75) in the PMI calculation can lead to performance comparable with state-of-the-art word embeddings on word similarity tasks." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-76", "text": "We use this modified definition of PMI and experiment with \u03b1 = 0.75 and \u03b1 = 1." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-77", "text": "3 Having constructed elementary APTs, the APT composition process involves aligning and composing these elementary APTs." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-78", "text": "We investigate using INT , which takes the minimum of each of the constituent's feature values and UNI , which performs pointwise addition." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-79", "text": "Following Reddy et al. (2011) , when using the UNI operation, we experiment with weighting the contributions of each constituent to the composed APT representation using the parameter, h. For example, if A 2 is the APT associated with the head of the phrase and A \u03b4 1 is the properly aligned APT associated with the modifier where \u03b4 is the dependency path from the head to the modifier (e.g. NMOD or AMOD), the composition operations can be defined as:" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-81", "text": "We have also considered composition without alignment of the modifier's APT, i.e, using A 1 :" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-82", "text": "In general, one would expect there to be little overlap between APTs which have not been properly aligned." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-83", "text": "However, in the case where \u03b4 is the NMOD relation, i.e., the internal relation in the vast majority of the compound phrases, both modifier and head are nouns and therefore there may well be considerable overlap between their unaligned dependency features." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-84", "text": "In order to examine the contribution of both the aligned and unaligned APTs in the composition process, we used a hybrid method where the composed representation is defined as: Table 2 : Average \u03c1 using neural word embeddings" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-85", "text": "In the case where representations consist of APT weights which are probabilities, PPMI is estimated after composition." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-86", "text": "Therefore we refer to this as compose-first (CF) in contrast to composesecond (CS) where composition is carried out after PPMI calculations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-87", "text": "In both cases, the cosine measure is applied to vectors made up PPMI values in order to calculate the similarity of the observed and composed representations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-88", "text": "----------------------------------" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-89", "text": "**RESULTS**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-90", "text": "We used repeated 3-fold cross-validation to enable us to estimate 4 the model parameters h and q. Results for all models are then reported in terms of average Spearman rank correlation scores (\u03c1) of phrase compositionality scores with human judgements on the corresponding testing samples." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-91", "text": "We used a sufficiently large number of repetitions that errors are all small (\u2264 0.0015) and thus any difference observed which is greater than 0.005 is statistically significant at the 95% level." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-92", "text": "Boldface is used to indicate the best performing configuration of parameters for a particular model." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-93", "text": "Table 2 summarises results for different parameter settings for the neural word embeddings." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-94", "text": "Looking at the results in Table 2 , we see that the cbow model significantly outperforms the skip-gram model." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-95", "text": "Using the cbow model with 100 dimensions and a subsampling threshold of t = 10 \u22123 gives a performance of 0.74 which is significantly higher than the previous state-ofthe-art reported in Reddy et al. (2011) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-96", "text": "Since both of these models are based on untyped cooccurrences, this performance gain can be seen as the result of implicit parameter optimisation." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-97", "text": "Table 3 : Average \u03c1 using APT representations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-98", "text": "APT representations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-99", "text": "We see that the results using standard PPMI (\u03b1 = 1) significantly outperform the result reported in Reddy et al. (2011) , which demonstrates the superiority of a typed dependency space over an untyped dependency space." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-100", "text": "Smoothing the PPMI calculation with a value of \u03b1 = 0.75 generally has a further small positive effect." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-101", "text": "On average, the results when probabilities are composed and PPMI is calculated as part of the similarity calculation (CF) are slightly higher than the results when PPMI weights are composed (CS) ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-102", "text": "Regarding different composition operations, UNI generally outperforms INT ." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-103", "text": "In general, the unaligned model outperforms the aligned model." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-104", "text": "However, a small but statistically significant performance gain is generally made using the hybrid model." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-105", "text": "Therefore aligned APT composition and unaligned APT composition are predicting different contexts for compound nouns which all contribute to a better estimate of the compositionality of the phrase." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-106", "text": "----------------------------------" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-107", "text": "**CONCLUSIONS AND FURTHER WORK**" }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-108", "text": "We have shown that combining traditional compositional methods with state-of-the-art lowdimensional word representations can improve results over the state-of-the-art." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-109", "text": "Further improvements can be achieved using an integrated compositional distributional approach based on APT representations." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-110", "text": "This approach maintains syntactic structure within the contextual features of words which is then central to the compositional process." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-111", "text": "We argue that some knowledge of syntactic structure is crucial in the fine-grained understanding of language." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-112", "text": "Since compositionality detection also provides a way of evaluating compositional methods without confounding judgements of phrase similarity with judgements of compositionality, it appears that the APT approach to composition is reasonably promising." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-113", "text": "Further work is of course needed with other datasets and other types of phrase." }, { "sent_id": "14529822630fb469f5fc8f37aaf473-C001-114", "text": "For example, it would be interesting to apply these models in German and evaluate their performance on a German noun-noun compound compositionality dataset (Schulte im Walde et al., 2013; Schulte im Walde et al., 2016) ." } ], "y": { "@USE@": { "gold_contexts": [ [ "14529822630fb469f5fc8f37aaf473-C001-14" ], [ "14529822630fb469f5fc8f37aaf473-C001-79" ] ], "cite_sentences": [ "14529822630fb469f5fc8f37aaf473-C001-14", "14529822630fb469f5fc8f37aaf473-C001-79" ] }, "@BACK@": { "gold_contexts": [ [ "14529822630fb469f5fc8f37aaf473-C001-40", "14529822630fb469f5fc8f37aaf473-C001-41" ], [ "14529822630fb469f5fc8f37aaf473-C001-47" ], [ "14529822630fb469f5fc8f37aaf473-C001-49", "14529822630fb469f5fc8f37aaf473-C001-50", "14529822630fb469f5fc8f37aaf473-C001-51" ], [ "14529822630fb469f5fc8f37aaf473-C001-52" ], [ "14529822630fb469f5fc8f37aaf473-C001-71", "14529822630fb469f5fc8f37aaf473-C001-72", "14529822630fb469f5fc8f37aaf473-C001-73" ] ], "cite_sentences": [ "14529822630fb469f5fc8f37aaf473-C001-40", "14529822630fb469f5fc8f37aaf473-C001-41", "14529822630fb469f5fc8f37aaf473-C001-47", "14529822630fb469f5fc8f37aaf473-C001-49", "14529822630fb469f5fc8f37aaf473-C001-50", "14529822630fb469f5fc8f37aaf473-C001-51", "14529822630fb469f5fc8f37aaf473-C001-52", "14529822630fb469f5fc8f37aaf473-C001-72", "14529822630fb469f5fc8f37aaf473-C001-73" ] }, "@MOT@": { "gold_contexts": [ [ "14529822630fb469f5fc8f37aaf473-C001-52", "14529822630fb469f5fc8f37aaf473-C001-53", "14529822630fb469f5fc8f37aaf473-C001-55" ] ], "cite_sentences": [ "14529822630fb469f5fc8f37aaf473-C001-52" ] }, "@DIF@": { "gold_contexts": [ [ "14529822630fb469f5fc8f37aaf473-C001-95" ], [ "14529822630fb469f5fc8f37aaf473-C001-99" ] ], "cite_sentences": [ "14529822630fb469f5fc8f37aaf473-C001-95", "14529822630fb469f5fc8f37aaf473-C001-99" ] } } }, "ABC_4176674f83dec5389a23d9d45654c7_6": { "x": [ { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-13", "text": "**APPROACH**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-2", "text": "We describe the submission from the Columbia Arabic & Dialect Modeling group (CADIM) for the Shared Task at the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL'2013)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-3", "text": "We participate in the Arabic Dependency parsing task for predicted POS tags and features." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-4", "text": "Our system is based on Marton et al. (2013) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-5", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-7", "text": "In this paper, we discuss the system that the Columbia Arabic & Dialect Modeling group (CADIM) submitted to the 2013 Shared Task on Parsing Morphologically Rich Languages (Seddah et al., 2013) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-8", "text": "We used a system for Arabic dependency parsing which we had previously developed, but retrained it on the training data splits used in this task." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-9", "text": "We only participated in the Arabic dependency parsing track, and in it, only optimized for predicted (non-gold) POS tags and features." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-10", "text": "We first summarize our previous work (Section 2)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-11", "text": "We then discuss our submission and the results (Section 3)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-12", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-14", "text": "In this section, we summarize Marton et al. (2013) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-15", "text": "We first present some background information on Arabic morphology and then discuss our methodology and main results." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-16", "text": "We present our best performing set of features, which we also use in our SPMRL'2013 submission." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-17", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-18", "text": "**BACKGROUND**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-19", "text": "Morphology interacts with syntax in two ways: agreement and assignment." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-20", "text": "In agreement, there is coordination between the morphological features of two words in a sentence based on their syntactic configuration (e.g., subject-verb or noun-adjective agreement in GENDER and/or NUMBER)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-21", "text": "In assignment, specific morphological feature values are assigned in certain syntactic configurations (e.g., CASE assignment for the subject or direct object of a verb)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-22", "text": "The choice of optimal linguistic features for a parser depends on three factors: relevance, redundancy and accuracy." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-23", "text": "A feature has relevance if it is useful in making an attachment (or labeling) decision." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-24", "text": "A particular feature may or may not be relevant to parsing." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-25", "text": "For example, the GENDER feature may help parse the Arabic phrase / bAb AlsyArh Aljdyd/Aljdydh 1 'door the-car thenew masc.sg/f em.sg [lit.]' using syntactic agreement: if the-new is masculine (Aljdyd ), it should attach to the masculine door (bAb ), resulting in the meaning 'the car's new door'; if the-new is feminine (Aljdydh ), it should attach to the feminine the-car (AlsyArh ), resulting in 'the door of the new car'." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-26", "text": "In contrast, the ASPECT feature does 1 Arabic orthographic transliteration is presented in the HSB scheme not constrain any syntactic decision." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-27", "text": "2 Even if relevant, a feature may not necessarily contribute to optimal performance since it may be redundant with other features that surpass it in relevance." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-28", "text": "For example, the DET and STATE features alone both help parsing because they help identify the idafa construction (the modificiation of a nominal by a genitive noun phrase), but they are redundant with each other and the DET feature is more helpful since it also helps with adjectival modification of nouns." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-29", "text": "Finally, the accuracy of automatically predicting the feature values (ratio of correct predictions out of all predictions) of course affects the value of a feature on unseen text." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-30", "text": "Even if relevant and non-redundant, a feature may be hard to predict with sufficient accuracy by current technology, in which case it will be of little or no help for parsing, even if helpful when its gold values are provided." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-31", "text": "The CASE feature is very relevant and not redundant, but it cannot be predicted with high accuracy and overall it is not useful." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-32", "text": "Different languages vary with respect to which features may be most helpful given various tradeoffs among these three factors." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-33", "text": "It has been shown previously that if the relevant morphological features in assignment configurations can be recognized well enough, then they contribute to parsing accuracy." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-34", "text": "For example, modeling CASE in Czech improves Czech parsing (Collins et al., 1999) : CASE is relevant, not redundant, and can be predicted with sufficient accuracy." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-35", "text": "However, it had been more difficult showing that agreement morphology helps parsing, with negative results for dependency parsing in several languages Eryigit et al., 2008; Nivre, 2009) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-36", "text": "In contrast to these negative results, Marton et al. (2013) showed positive results for using agreement morphology for Arabic." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-37", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-38", "text": "**METHODOLOGY**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-39", "text": "In Marton et al. (2013) , we investigated morphological features for dependency parsing of Modern Standard Arabic (MSA)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-40", "text": "The goal was to find a set of relevant, accurate and non-redundant features." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-41", "text": "We used both the MaltParser (Nivre, 2008) and the Easy-First Parser (Goldberg and Elhadad, 2010) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-42", "text": "Since the Easy-First Parser performed better, we use it in all experiments reported in this paper." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-43", "text": "For MSA, the space of possible morphological features is quite large." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-44", "text": "We determined which morphological features help by performing a search through the feature space." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-45", "text": "In order to do this, we separated part-of-speech (POS) from the morphological features." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-46", "text": "We defined a core set of 12 POS features, and then explored combinations of morphological features in addition to this POS tagset." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-47", "text": "This core set of POS tags is similar to those proposed in cross-lingual work (Rambow et al., 2006; Petrov et al., 2012) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-48", "text": "We performed this search independently for Gold input features and predicted input features." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-49", "text": "We used our MADA+TOKAN system (Habash and Rambow, 2005; Habash et al., 2009; for the prediction." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-50", "text": "As the EasyFirst Parser predicts links separately before labels, we first optimized for unlabeled attachment score, and then optimized the Easy-First Parser labeler for label score." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-51", "text": "As had been found in previous results, assignment features, specifically CASE and STATE, are very helpful in MSA." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-52", "text": "However, in MSA this is true only under gold conditions: since CASE is rarely explicit in the typically undiacritized written MSA, it has a dismal accuracy rate, which makes it useless when used in machine-predicted (real, non-gold) condition." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-53", "text": "In contrast with previous results, we showed that agreement features are quite helpful in both gold and predicted conditions." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-54", "text": "This is likely a result of MSA having a rich agreement system, covering both verb-subject and noun-adjective relations." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-55", "text": "Additionally, almost all work to date in MSA morphological analysis and part-of-speech (POS) tagging has concentrated on the morphemic form of the words." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-56", "text": "However, often the functional morphology (which is relevant to agreement, and relates to the meaning of the word) is at odds with the \"surface\" (form-based) morphology; a well-known example of this are the \"broken\" (irregular) plurals of nominals, which often have singular-form morphemes but are in fact plurals and show plural agreement if the referent is rational." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-57", "text": "In Marton et al. (2013) , we showed that by modeling the functional morphology rather than the form-based morphology, we obtain a further increase in parsing performance (2013) test (old split) 81.0 84.0 92.7 Table 2 : Results of our system on Shared Task test data, Gold Tokenization, Predicted Morphological Tags; and for reference also on the data splits used in our previous work (Marton et al., 2013) ; \"\u2264 70\" refers to the test sentences with 70 or fewer words." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-58", "text": "Training Set Test Set Labeled Tedeval Score Unlabeled Tedeval Score 5K (SPMRL'2013) test \u2264 70 86.4 89.9 All (SPMRL'2013) test \u2264 70 87.8 90.8 Table 3 : Results of our system on on Shared Task test data, Predicted Tokenization, Predicted Morphological Tags; \"\u2264 70\" refers to the test sentences with 70 or fewer words (again, both when using gold and when using predicted POS and morphological features)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-59", "text": "We also showed that for parsing with predicted POS and morphological features, training on a combination of gold and predicted POS and morphological feature values outperforms the alternative training scenarios." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-60", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-61", "text": "**BEST PERFORMING FEATURE SET**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-62", "text": "The best performing set of features on non-gold input, obtained in Marton et al. (2013) , are shown in Table 1 ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-63", "text": "The features are clustered into three types." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-64", "text": "\u2022 First is part-of-speech, represented using a \"core\" 12-tag set." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-65", "text": "\u2022 Second are the inflectional morphological features: determiner clitic, person and functional gender and number." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-66", "text": "\u2022 Third are the rationality (humanness) feature, which participates in morphosyntactic agreement in Arabic (Alkuhlani and Habash, 2011) , and a form of the lemma, which abstract over all inflectional morphology." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-67", "text": "For the training corpus, we use a combination of the gold and predicted features." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-68", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-69", "text": "**OUR SUBMISSION**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-70", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-71", "text": "**DATA PREPARATION**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-72", "text": "The data split used in the shared task is different from the data split we used in (Marton et al., 2013) , so we retrained our models on the new splits (Diab et al., 2013) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-73", "text": "The data released for the Shared Task showed inconsistent availability of lemmas across gold and predicted input, so we used the ALMOR analyzer (Habash, 2007) with the SAMA databases (Graff et al., 2009 ) to determine a lemma given the word form and the provided (gold or predicted) POS tags." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-74", "text": "In addition to the lemmas, the ALMOR analyzer also provides morphological features in the feature-value representation our approach requires." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-75", "text": "Finally, we ran our existing converter (Alkuhlani and Habash, 2012) over this representation to obtain functional number and gender, as well as the rationality feature." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-76", "text": "3 For simplicity reasons, we used the MLE:W2+CATiB model (Alkuhlani and Habash, 2012) , which was the best performing model on seen words, as opposed to the combination system that used a syntactic component with better results on unseen words." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-77", "text": "We did not perform Alif or Ya normalization on the data." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-78", "text": "We trained two models: one on 5,000 sentences of training data and one on the entire training data." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-79", "text": "----------------------------------" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-80", "text": "**RESULTS**" }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-81", "text": "Our performance in the Shared Task for Arabic Dependency, Gold Tokenization, Predicted Tags, is shown in Table 2 ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-82", "text": "Our performance in the Shared Task for Arabic Dependency, Predicted Tokenization, Predicted Tags, is shown in Table 3 ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-83", "text": "For predicted tokenization, only the IMS/Szeged system which uses system combination (Run 2) outperformed our parser on all measures; our parser performed better than all other single-parser systems." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-84", "text": "For gold tokenization, our system is the second best single-parser system after the IMS/Szeged single system (Run 1)." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-85", "text": "For gold tokenization and predicted morphology (Table 2) , we also give the performance reported in our previous work (Marton et al., 2013) ." }, { "sent_id": "4176674f83dec5389a23d9d45654c7-C001-86", "text": "The increase over the previously reported work may simply be due to the different split for training and test, but it may also be due to improvements to the functional feature prediction (Alkuhlani and Habash, 2012) , and the predicted features provided by the Shared Task organizers." } ], "y": { "@USE@": { "gold_contexts": [ [ "4176674f83dec5389a23d9d45654c7-C001-4" ], [ "4176674f83dec5389a23d9d45654c7-C001-39", "4176674f83dec5389a23d9d45654c7-C001-41", "4176674f83dec5389a23d9d45654c7-C001-42" ], [ "4176674f83dec5389a23d9d45654c7-C001-62" ], [ "4176674f83dec5389a23d9d45654c7-C001-85" ] ], "cite_sentences": [ "4176674f83dec5389a23d9d45654c7-C001-4", "4176674f83dec5389a23d9d45654c7-C001-39", "4176674f83dec5389a23d9d45654c7-C001-41", "4176674f83dec5389a23d9d45654c7-C001-62", "4176674f83dec5389a23d9d45654c7-C001-85" ] }, "@BACK@": { "gold_contexts": [ [ "4176674f83dec5389a23d9d45654c7-C001-14" ], [ "4176674f83dec5389a23d9d45654c7-C001-36" ], [ "4176674f83dec5389a23d9d45654c7-C001-39" ], [ "4176674f83dec5389a23d9d45654c7-C001-57" ] ], "cite_sentences": [ "4176674f83dec5389a23d9d45654c7-C001-14", "4176674f83dec5389a23d9d45654c7-C001-36", "4176674f83dec5389a23d9d45654c7-C001-39", "4176674f83dec5389a23d9d45654c7-C001-57" ] }, "@MOT@": { "gold_contexts": [ [ "4176674f83dec5389a23d9d45654c7-C001-35", "4176674f83dec5389a23d9d45654c7-C001-36" ] ], "cite_sentences": [ "4176674f83dec5389a23d9d45654c7-C001-36" ] }, "@DIF@": { "gold_contexts": [ [ "4176674f83dec5389a23d9d45654c7-C001-72" ], [ "4176674f83dec5389a23d9d45654c7-C001-86" ] ], "cite_sentences": [ "4176674f83dec5389a23d9d45654c7-C001-72", "4176674f83dec5389a23d9d45654c7-C001-86" ] } } }, "ABC_a6564c4b215e6c5ad4f53eeb5dd69c_6": { "x": [ { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-8", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-2", "text": "In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-3", "text": "To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-4", "text": "We propose that there is an appropriate temporal granularity at which modalities should be modeled." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-5", "text": "We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-6", "text": "Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-7", "text": "We also show that our approach can be used to incorporate gaze features into turn-taking models." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-10", "text": "The design of naturalistic spoken dialog systems (SDSs) that can interact with users in a human-like manner, requires models that control when it is an appropriate time for the system to speak." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-11", "text": "These turn-taking models (e.g. [7, 9, 10] ) are used to make decisions as to whether the current speaker will continue to hold the floor (HOLD) or not (SHIFT)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-12", "text": "A naive approach to modeling these decisions is to identify turn exchange points using silence thresholding." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-13", "text": "This approach is limited in its naturalness since the chosen threshold can potentially be too short (leading to interruptions) or too long (leading to unnaturally long pauses)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-14", "text": "These traditional thresholdbased approaches also cannot model rapid turn-switches, which Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-15", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-16", "text": "Abstracting with credit is permitted." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-17", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-18", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-19", "text": "ICMI '18, October 16-20, 2018, Boulder, CO, USA \u00a9 2018 Association for Computing Machinery." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-20", "text": "ACM ISBN 978-1-4503-5692-3/18/10. . . $15.00 https://doi.org /10.1145/3242969.3242997 can potentially involve a brief period of overlap during the turn exchange [8] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-21", "text": "It has been proposed in the projection theory of Sacks [12] that, rather than reacting to silence, humans often form predictions as to when their conversational partner will finish their turn." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-22", "text": "Using these predictions we are able to anticipate utterance endpoints, and start our turns accordingly." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-23", "text": "LSTM-based turn-taking models that operate in a similar predictive fashion were proposed by Skantze in [13] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-24", "text": "In these models LSTMs are used to make continuous predictions of a person's speech activity at each individual time step of 50ms." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-25", "text": "The networks are trained to predict a vector of probability scores for speech activity in each individual frame within a set future window." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-26", "text": "Rather than designing classifiers to make specific decisions, these continuous models are able to capture general information about turn-taking in the data that they are trained on." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-27", "text": "They can therefore be applied to a wide variety of turn-taking prediction tasks and have been shown to outperform traditional classifiers when applied to HOLD/SHIFT predictions." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-28", "text": "A downside to the approach in [13] is that, since a single LSTM is being used, all input features must be processed at the same rate." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-29", "text": "When considering linguistic features, the the relevant information for turn-taking prediction happens at a coarse temporal granularity in comparison with acoustic features, where much of the relevant information occurs at the sub-word prosodic level." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-30", "text": "When using a single LSTM, this requires either averaging the acoustic features to represent them at a word-level temporal resolution, or upsampling the linguistic features to represent them at the acoustic temporal resolution." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-31", "text": "Both of these options have their drawbacks." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-32", "text": "In the case of averaging the acoustic features, we lose the finer-grained prosodic inflections that are important to forming turn-taking predictions." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-33", "text": "In the case of upsampling the linguistic features, the model has a harder time with the longer term dependencies that exist between linguistic features." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-34", "text": "Because the linguistic features are sampled at a higher rate, the model is more susceptible to the vanishing gradient problem." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-35", "text": "We propose a way to address this problem by using a multiscale RNN architecture, in which modalities are modeled in separate sub-network LSTMs that are allowed to operate at their own independent timescales." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-36", "text": "A master LSTM is used fuse the modalities and form predictions at a regular rate by taking as input a concatenation of the current hidden states of the sub-networks." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-37", "text": "This allows the hidden states of the sub-networks to be updated at an appropriate temporal rate." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-38", "text": "In this paper we present significant extensions to the original work of Skantze in [13] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-39", "text": "We investigate the performance of our proposed multiscale architecture on two datasets that contain two different combinations of modalities." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-40", "text": "We look at the influence of modeling modalities in separate sub-networks and using separate Session 5: Human Behavior ICMI'18, October 16-20, 2018, Boulder, CO, USA timescales." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-41", "text": "We find that there are significant performance benefits to modeling linguistic features at a slower temporal rate, and in a separate sub-network from acoustic features." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-42", "text": "We also find that our approach can be used to incorporate gaze features into turn-taking models, a task that has been previously found to be difficult [4] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-43", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-44", "text": "**CONTINUOUS TURN-TAKING PREDICTION**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-45", "text": "The main objective of continuous turn-taking prediction as proposed in [13] is to predict the future speech activity annotations of one of the speakers in a dyadic conversation using input speech features from both speakers (x t )." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-46", "text": "At each timestep t of frame-size 50ms, speech features are extracted and input into an LSTM that is used to predict the future speech activity (y t ) of one of the speakers." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-47", "text": "The future speech activity is a 3 second window comprising of 60 frames of the binary annotations for frames t + 1 to t + 60." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-48", "text": "The output layer of the network uses an element-wise sigmoid activation function to predict a probability score for the target speaker's speech activity at each future frame." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-49", "text": "To represent linguistic features in this model the word-rate features are upsampled to the 50ms acoustic feature rate." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-50", "text": "They use one-hot encodings, where the feature is \"switched on\" for a single frame, 100ms after the word is finished (to simulate real-world ASR conditions)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-51", "text": "To model more fine-grained prosodic inflections the 50ms acoustic feature rate could be increased." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-52", "text": "However, doing this leads to a sparser linguistic feature representation since all features input into an LSTM must be processed at the same rate." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-53", "text": "This makes it more difficult for the network to model longer-term dependencies that exist in the linguistic modality." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-54", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-55", "text": "**MULTISCALE CONTINUOUS TURN-TAKING PREDICTION**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-56", "text": "To address this problem we modify the original network architecture to include a variable number of sub-network LSTM cells that process features from separate modalities independently." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-57", "text": "The sub-networks are allowed to process the input features from the separate modalities at different timescales." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-58", "text": "An example network configuration that uses a master LSTM (h 0 ) and two sub-network LSTMs (h 1 ,h 2 , each assigned a modality) is shown in Fig. 1 ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-59", "text": "We use superscripts to denote the index of modalities m \u2208 M, and subscripts to index timesteps (represented using the notation t m )." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-60", "text": "At each timestep of the master LSTM (t 0 ), the current states of the sub-network LSTMs are concatenated and fed into the master LSTM." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-61", "text": "The hidden state update process for the network is shown in Algorithm 1." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-62", "text": "By feeding the current states of the sub-networks into the master LSTM, we are effectively performing a sampling operation, represented in the algorithm by the step h m t 0 +1 \u2190\u2212 h m t m +1 ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-63", "text": "The sampling operation can either increase or decrease the temporal resolution of the individual modalities, depending on the timescales used." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-64", "text": "For example, in Fig. 1 , the temporal resolution of the first sub-network (h 1 ) is decreased since we sample it at a regular rate every five t 1 timesteps." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-65", "text": "The temporal resolution of the second modality could either be increased or decreased by the sampling process since the features have irregular timesteps." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-66", "text": "The processing of features at a slower update rate potentially allows the network to better retain information." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-67", "text": "The model was implemented using the PyTorch framework and our code is available online 1 ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-68", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-69", "text": "**EXPERIMENTAL DESIGN**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-70", "text": "To assess the performance of our multiscale approach, we test it on two different datasets." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-71", "text": "In each dataset, features from two separate modalities are investigated by training models using a variety of different network configurations." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-72", "text": "The HCRC map-task corpus (MTC) [1] is used to examine linguistic and acoustic modalities while the Mahnob Mimicry Database (MMD) [3] is used to examine visual and acoustic modalities." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-73", "text": "In this section we discuss the details of the datasets and how features were extracted." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-74", "text": "We then discuss the evaluation metrics used to assess network performance." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-75", "text": "We then outline our experimental procedure." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-76", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-77", "text": "**MAP-TASK CORPUS**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-78", "text": "The MTC is a corpus of 128 dyadic task-based dialogues totaling 18 hours in length." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-79", "text": "We used 96 conversations as training and 32 conversations for testing." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-80", "text": "Additionally, we used 32 conversations of the training set as a held-out test set during hyperparameter searches." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-81", "text": "The speech transcriptions supplied with the corpus were used as the ground-truth for the speech-activity predictions." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-82", "text": "Acoustic Features." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-83", "text": "As acoustic features, we use the low-level descriptors from the eGeMAPs [5] feature set extracted with the OpenSmile toolkit [6] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-84", "text": "The features were all normalized using zscores on a per-file basis." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-85", "text": "We extract the features at two different temporal resolutions: 10ms and 50ms." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-86", "text": "We use these two different temporal resolutions to investigate which one is more useful for our turn-taking models." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-87", "text": "In our results tables and discussion we refer to these as \"Acous 10ms\" and \"Acous 50ms\"." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-88", "text": "Linguistic Features." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-89", "text": "For linguistic features we use the word annotations supplied with the corpus." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-90", "text": "The words were represented as an enumerated vocabulary where the raw word features were transformed into a linear embedding of size 64 that is jointly trained with the rest of the network." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-91", "text": "In an effort to simulate the conditions of a real-time system, the linguistic features were not provided to the system until 100ms after the end of the word." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-92", "text": "Three different temporal rates for the processing of linguistic features are tested in our experiments." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-93", "text": "In our discussion and results below, \"Ling 50ms\" refers to using word features that have been sampled at regular 50ms intervals, as was proposed in [13] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-94", "text": "\"Ling 10ms\" refers to using word features that are sampled at a faster rate of 10ms." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-95", "text": "\"Ling Asynch\" refers to using an irregular update rate, where the LSTM only processes the linguistic features when a new word is available." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-96", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-97", "text": "**MAHNOB MIMICRY DATABASE**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-98", "text": "The MMD is an audio-visual corpus of 54 dyadic conversations totaling 11 hours in length." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-99", "text": "The participants are either assigned discussion topics or roles to play in a loosely defined role-playing scenario." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-100", "text": "When splitting the data into training and test sets we balanced the number of role-playing and discussion conditions in the training and test set." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-101", "text": "We used 39 conversations for training and 15 for testing." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-102", "text": "Since there are no speech transcriptions available for the dataset, we manually labeled the dataset for speech activity." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-103", "text": "The procedure we used for extracting acoustic for MMD was the same as that followed in for the MTC in section 4.1." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-104", "text": "Visual Features." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-105", "text": "We automatically extract visual features using the OpenFace toolkit [2] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-106", "text": "During informal exploratory experiments we found that the automatically extracted gaze features performed better than other features extracted with the toolkit (e.g. facial action units, pose)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-107", "text": "We therefore used the gaze features (a six dimensional vector of eye gaze directions) along with a confidence score as our visual input feature." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-108", "text": "The video in the MMD uses a high frame rate of 58Hz." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-109", "text": "We perform a comparison of using features at this high frame rate and using features that are averaged over 50ms frame windows." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-110", "text": "In the results tables and the discussion below we refer to the high frame rate video and the averaged video features as \"Visual 58Hz\" and \"Visual 50ms\" respectively." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-111", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-112", "text": "**EVALUATION METRICS**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-113", "text": "To evaluate the performance of the different network configurations we use two kinds of evaluation metrics that were proposed in [13] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-114", "text": "These two kinds of metrics represent the prediction performance on two common types of turn-taking decisions that are pertinent to SDSs." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-115", "text": "Prediction at Pauses." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-116", "text": "The prediction at pauses metric represents the standard turn-taking decision made at brief pauses in the interaction to predict whether the person holding the floor will continue speaking (HOLD) or the interlocutor will take a turn (SHIFT)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-117", "text": "To make this decision, we find all points where there is a pause of a minimum set length." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-118", "text": "We then select all of these instances where only one person continued within a one second window directly after the pause." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-119", "text": "We average the predicted output probabilities within the window for each of the speakers at the frame directly after the pause." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-120", "text": "The speaker with the higher score is selected as the predicted next speaker, giving us a binary HOLD/SHIFT classification for which we report F-scores." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-121", "text": "We test predictions at pauses of both 500ms (PAUSE 500) and 50ms (PAUSE 50)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-122", "text": "The majority vote F-score for the MTC (always HOLD) is 0.5052 and 0.4608 for PAUSE 50 and PAUSE 500 respectively." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-123", "text": "For MMD the corresponding values are 0.7298 and 0.8185." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-124", "text": "Prediction at Onsets." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-125", "text": "Prediction at onsets (ONSET) represents a prediction of the length of an utterance after an initial period of speech." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-126", "text": "It represents a useful decision for estimating how long the upcoming utterance will be." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-127", "text": "It categorizes onset predictions into SHORT and LONG, where SHORT utterances can be considered similar to backchannels." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-128", "text": "For an utterance to be classified as short, 1.5 seconds of silence by a participant has to be followed by a maximum of 1 second of speech, after which the speaker must be silent for at least 5 seconds." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-129", "text": "For the utterance to be classified as long, 1.5 seconds of silence must be followed by at least 2.5 seconds of speech." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-130", "text": "The point at which the predictions are made is 500ms from the start of the utterance." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-131", "text": "The prediction is made by taking the mean of the 60 output nodes from the sigmoid layer and comparing them to a threshold value." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-132", "text": "The majority vote F-score for the MTC (always SHORT) is 0.3346." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-133", "text": "For the MMD the corresponding value is 0.5346." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-134", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-135", "text": "**EXPERIMENTAL PROCEDURE**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-136", "text": "In our experiments, the networks were trained to minimize binary cross entropy (BCE) loss which was shown to produce good results in [11] ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-137", "text": "We test the impact of using three different network configurations with multiple combinations of modalities at different temporal resolutions." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-138", "text": "The three network configurations are: \"no subnets\", which corresponds to an early fusion approach in which the modalities are fed directly into a single LSTM; \"one subnet\", which corresponds to the use of only one sub-network LSTM; and \"two subnets\", which corresponds to the use of separate LSTM subnetworks for the individual modalities." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-139", "text": "We note that combinations such as \"Ling 50ms\" with \"Acous 10ms\" are not possible when using the \"no subnets\" and \"one subnet\" configurations since the features are being input into the same LSTM and cannot operate at different temporal resolutions." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-140", "text": "Grid searches for three hyperparameters (hidden node size, dropout, and L2 regularization) were performed for each network configuration." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-141", "text": "In order to limit the influence of parameter count changes between the different network configurations, the hidden node count in a given network was limited to a sum of 150." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-142", "text": "Once the hyperparameters for a network are chosen, we train the network five times and report the mean values of the different evaluation metrics in Tables 1 and 2 ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-143", "text": "The best performing modality combination for a given network configuration is shown in bold and the best overall performance is shown in italics." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-144", "text": "In our discussion below we use two-tailed t-tests to report on the difference between the means of metrics." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-145", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-146", "text": "**DISCUSSION**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-147", "text": "Looking at the results from the fusion of linguistic and acoustic modalities shown in Table 1 , it is clear that there are significant benefits in modeling acoustic and linguistic modalities separately using different timescales." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-148", "text": "Our best performance on all evaluation metrics is achieved using our multiscale approach where features from the two modalities are modeled at separate rates." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-149", "text": "Comparing the BCE loss of the best performing early-fusion result (7) with the best multiscale result (11) gives a statistically significant improvement (P < .001)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-150", "text": "Comparing the performance of the acoustic feature timescales, we observe that the faster rate of 10ms consistently performs better than the slower 50ms rate." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-151", "text": "Looking at the performance of the three different linguistic timescales in (3, 4, 5) , we see that processing linguistic features at the slower regular rate of 50ms achieves the best performance." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-152", "text": "Comparing the BCE loss of (3) and (5) suggests that sampling linguistic features at a fast temporal rate makes it difficult for the network to model longer term dependencies (P = .004)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-153", "text": "The effect of processing modalities on their own in separate sub-networks without the added gain of using separate timescales is inconclusive when we compare (6) and (10) ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-154", "text": "Using a single subnet as an added layer also does not yield significant differences to the early fusion approach." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-155", "text": "We conclude that the main advantage in using our multiscale approach on a combination of acoustic and linguistic modalities is its ability to fuse the two modalities when the linguistic features are operating at a slow 50ms timescale and the acoustic features are operating at a fast 10ms timescale." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-156", "text": "Comparing our results with previously published baselines reported on the same dataset by Skantze in [13] , our best result on the PAUSE 500 task of 0.8553 is a large improvement over his reported score of 0.762. Looking at the results from the fusion of visual and acoustic modalities shown in in Table 2 , we were able to achieve our best BCE loss using our multiscale approach to fuse acoustic features at a 10ms timescale and visual features at a 58Hz timescale." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-157", "text": "Comparing this result (9) with our best \"no subnets\" result (2) gives a statistically significant improvement (P = 0.035)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-158", "text": "We note that using early fusion with gaze features (5) does not add any value when compared to acoustic features on their own (1) ." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-159", "text": "The results also indicate that the faster 58Hz gaze features perform better than the averaged 50ms visual features when used in conjunction with the acoustic features." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-160", "text": "This suggests that we loose relevant information by averaging the gaze features within a timestep." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-161", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-162", "text": "**CONCLUSION**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-163", "text": "In conclusion, we have shown that there are considerable benefits in using multiscale architectures for continuous turn-taking prediction." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-164", "text": "When fusing linguistic and acoustic modalities, the architecture allows acoustic features to be modeled at a fast temporal rate (which is better-suited to capturing prosodic inflections) while modeling linguistic features at a slower rate (which is better-suited to capturing long-term linguistic dependencies)." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-165", "text": "When fusing visual and acoustic modalities, our multiscale approach allowed the use of high frame-rate visual features without resorting to averaging." }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-166", "text": "----------------------------------" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-167", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "a6564c4b215e6c5ad4f53eeb5dd69c-C001-168", "text": "The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-23" ], [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-45" ], [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-139" ] ], "cite_sentences": [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-23", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-45", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-139" ] }, "@MOT@": { "gold_contexts": [ [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-22", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-23", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-24", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-25", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-26", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-27", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-28" ] ], "cite_sentences": [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-23", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-24", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-25", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-26", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-27", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-28" ] }, "@EXT@": { "gold_contexts": [ [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-38" ] ], "cite_sentences": [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-93" ], [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-156" ] ], "cite_sentences": [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-93", "a6564c4b215e6c5ad4f53eeb5dd69c-C001-156" ] }, "@DIF@": { "gold_contexts": [ [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-156" ] ], "cite_sentences": [ "a6564c4b215e6c5ad4f53eeb5dd69c-C001-156" ] } } }, "ABC_780b96afa8d417aa241e01ad594ce9_6": { "x": [ { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-2", "text": "Public speakings play important roles in schools and work places and properly using humor contributes to effective presentations." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-3", "text": "For the purpose of automatically evaluating speakers' humor usage, we build a presentation corpus containing humorous utterances based on TED talks." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-4", "text": "Compared to previous data resources supporting humor recognition research, ours has several advantages, including (a) both positive and negative instances coming from a homogeneous data set, (b) containing a large number of speakers, and (c) being open." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-5", "text": "Focusing on using lexical cues for humor recognition, we systematically compare a newly emerging text classification method based on Convolutional Neural Networks (CNNs) with a wellestablished conventional method using linguistic knowledge." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-6", "text": "The advantages of the CNN method are both getting higher detection accuracies and being able to learn essential features automatically." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-7", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-9", "text": "The ability to make effective presentations has been found to be linked with success at school and in the workplace (Hill and Storey, 2003; Stevens, 2005) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-10", "text": "Humor plays an important role in successful public speaking, e.g., helping to reduce public speaking anxiety often regarded as the most prevalent type of social phobia, generating shared amusement to boost persuasive power, and serving as a means to attract attention and reduce tension (Xu, 2016) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-11", "text": "Automatically simulating an audience's reactions to humor will not only be useful for presentation training, but also improve conversational systems by giving machines more empathetic power." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-12", "text": "The present study reports our efforts in recognizing utterances that cause laughter in presentations." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-13", "text": "These include building a corpus from TED talks and using Convolutional Neural Networks (CNNs) in the recognition." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-14", "text": "The remainder of the paper is organized as follows: Section 2 briefly reviews the previous related research; Section 3 describes the corpus we collected from TED talks; Section 4 describes the text classification methods; Section 5 reports on our experiments; finally, Section 6 discusses the findings of our study and plans for future work." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-15", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-16", "text": "**PREVIOUS RESEARCH**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-17", "text": "Humor recognition refers to the task of deciding whether a sentence/spoken-utterance expresses a certain degree of humor." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-18", "text": "In most of the previous studies (Mihalcea and Strapparava, 2005; Purandare and Litman, 2006; Yang et al., 2015) , humor recognition was modeled as a binary classification task." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-19", "text": "In the seminal work (Mihalcea and Strapparava, 2005) , a corpus of 16,000 \"one-liners\" was created using daily joke websites to collect humorous instances while using formal writing resources (e.g., news titles) to obtain non-humorous instances." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-20", "text": "Three humor-specific stylistic features, including alliteration, antonymy, and adult slang were utilized together with content-based features to build classifiers." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-21", "text": "In a recent work (Yang et al., 2015) , a new corpus was constructed from the Pun of the Day website." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-22", "text": "Yang et al. (2015) explained and computed stylistic features based on the following four aspects: (a) Incongruity, (b) Ambiguity, (c) Interpersonal Effect, and (d) Phonetic Style." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-23", "text": "In addition, Word2Vec (Mikolov et al., 2013) distributed representations were utilized in the model building." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-24", "text": "Beyond lexical cues from text inputs, other research has also utilized speakers' acoustic cues (Purandare and Litman, 2006; Bertero and Fung, 2016b) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-25", "text": "These studies have typically used audio tracks from TV shows and their corresponding captions in order to categorize characters' speaking turns as humorous or non-humorous based on canned laughter." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-26", "text": "Convolutional Neural Networks (CNNs) have recently been successfully used in several text categorization tasks (e.g., review rating, sentiment recognition, and question type recognition)." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-27", "text": "Kim (2014) ; Johnson and Zhang (2015) ; Zhang and Wallace (2015) suggested that using a simple CNN setup, which entails one layer of convolution on top of word embedding vectors, achieves excellent results on multiple tasks." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-28", "text": "Deep learning recently has been applied to computational humor research (Bertero and Fung, 2016b,a) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-29", "text": "In Bertero and Fung (2016b) , CNN was found to be the best model that uses both acoustic and lexical cues for humor recognition." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-30", "text": "However, it did not outperform the Logistical Regression (LR) model when using text inputs exclusively." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-31", "text": "Beyond treating humor detection as a binary classification task, Bertero and Fung (2016a) formulated the recognition to be a sequential labeling task and utilized Recurrent Neural Networks (RNNs) (Hochreiter and Schmidhuber, 1997) on top of CNN models (serving as feature extractors) to utilize context information among utterances." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-32", "text": "From the brief review, it is clear that there is a great need for an open corpus that can support investigating humor in presentations." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-33", "text": "1 CNNbased text categorization methods have been applied to humor recognition (e.g., in (Bertero and Fung, 2016b) ) but with limitations: (a) a rigorous comparison with the state-of-the-art conventional method examined in Yang et al. (2015) is missing; (b) CNN's performance in the previous research is not quite clear; and (c) some important techniques that can improve CNN performance (e.g., using varied-sized filters and dropout regularization (Hinton et al., 2012)) were not applied." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-34", "text": "Therefore, the present study is meant to address these limitations." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-35", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-36", "text": "**TED TALK DATA**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-37", "text": "TED Talks 2 are recordings from TED conferences and other special TED programs." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-38", "text": "Many effects in a presentation can cause audience laugh, such as speaking content, presenters' nonverbal behaviors, and so on." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-39", "text": "In the present study, we focused on the transcripts of the talks." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-40", "text": "Most transcripts of the talks contain the markup '(Laughter)', which represents where audiences laughed aloud during the talks." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-41", "text": "This special markup was used to determine utterance labels." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-42", "text": "We collected 1,192 TED Talk transcripts 3 ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-43", "text": "An example transcription is given in Figure 1 ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-44", "text": "The collected transcripts were split into sentences using the Stanford CoreNLP tool (Manning et al., 2014) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-45", "text": "In this study, sentences containing or immediately followed by '(Laughter)' were used as 'Laughter' sentences, as shown in Figure 1 ; all other sentences were defined as 'No-Laughter' sentences." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-46", "text": "Following Mihalcea and Strapparava (2005) and Yang et al. (2015) , we selected the same numbers (n = 4726) of 'Laughter' and 'NoLaughter' sentences." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-47", "text": "To minimize possible topic shifts between positive and negative instances, for each positive instance, we randomly picked one negative instance nearby (the context window was 7 sentences in this study)." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-48", "text": "For example, in Figure 1 , a negative instance (corresponding to 'sent-2') was selected from the nearby sentences ranging from 'sent-7' to 'sent+7'." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-49", "text": "More details about this data set can refer to Lee et al. (2016) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-50", "text": "The TED data set can be obtained by contacting the authors." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-51", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-52", "text": "**METHODS**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-53", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-54", "text": "**CONVENTIONAL MODEL**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-55", "text": "Following Yang et al. (2015) , we applied Random Forest (Breiman, 2001 ) to perform humor recognition by using the following two groups of features." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-56", "text": "The first group are humor-specific stylistic features covering the following 4 categories 4 : Incongruity (2), Ambiguity (6), Interpersonal Effect (4), and Phonetic Pattern (4)." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-57", "text": "The second group are semantic distance features, including the humor label classes from 5 sentences in the training set that are closest to the sentence being evaluated (found by using a k-Nearest Neighbors (kNN)" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-58", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-59", "text": "**CNN MODEL**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-60", "text": "Our CNN-based text classification's setup follows Kim (2014) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-61", "text": "Figure 2 depicts the model's details." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-62", "text": "From the left side's input texts to the right side's prediction labels, different shapes of tensors flow through the entire network for solving the classification task in an end-to-end mode." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-63", "text": "Firstly, tokenized text strings were converted to a 2D tensor with shape (L \u00d7 d), where L represents sentences' maximum length while d represents the word-embedding dimension." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-64", "text": "In this study, we utilized the Word2Vec (Mikolov et al., 2013) embedding vectors (d = 300) that were trained on 100 billion words of Google News." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-65", "text": "Next, the embedding matrix was fed into a 1D convolution network with multiple filters." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-66", "text": "To cover varied reception fields, we used filters of sizes of f w \u2212 1, f w , and f w + 1." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-67", "text": "For each filter size, n f filters were utilized." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-68", "text": "Then, max pooling, which stands for finding the largest value from a vector, was applied to each feature map (total 3 \u00d7 n f feature maps) output by the 1D convolution." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-69", "text": "Finally, maximum values from all of 3 \u00d7 n f filters were formed as a flattened vector to go through a fully connected (FC) layer to predict two possible labels (Laughter vs. No-Laughter) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-70", "text": "Note that for 1D convolution and FC layer's input, we applied 'dropout' (Hinton et al., 2012) regularization, which entails randomly setting a proportion of network weights to be zero during model training, to overcome over-fitting." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-71", "text": "By using cross-entropy as the learning metric, the whole sequential network (all weights and bias) could be optimized by using any SGD optimization, e.g., Adam (Kingma and Ba, 2014) , Adadelta (Zeiler, 2012) , and so on." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-72", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-73", "text": "**EXPERIMENTS**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-74", "text": "We used two corpora: the TED Talk corpus (denoted as TED) and the Pun of the Day corpus 5 (denoted as Pun)." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-75", "text": "Note that we normalized words in the Pun data to lowercase to avoid a possibly elevated result caused by a special pattern: in the original format, all negative instances started with capital letters." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-76", "text": "The Pun data allows us to verify that our implementation of the conventional model is consistent with the work reported in Yang et al. (2015) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-77", "text": "In our experiment, we firstly divided each corpus into two parts." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-78", "text": "The smaller part (the Dev set) was used for setting various hyper-parameters used in text classifiers." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-79", "text": "The larger portion (the CV set) was then formulated as a 10-fold crossvalidation setup for obtaining a stable and comprehensive model evaluation result." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-80", "text": "For the PUN data, the Dev contains 482 sentences, while the CV set contains 4344 sentences." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-81", "text": "For the TED data, the Dev set contains 1046 utterances, while the CV set contains 8406 utterances." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-82", "text": "Note that, with a goal of building a speaker-independent humor detector, when partitioning our TED data set, we always kept all utterances of a single talk within the same partition." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-83", "text": "When building conventional models, we developed our own feature extraction scripts and used the SKLL 6 python package for building Random Forest models." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-84", "text": "When implementing CNN, Bergstra et al. (2013) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-85", "text": "After running 200 iterations of tweaking, we ended up with the following selection: f w is 6 (entailing that the various filter sizes are (5, 6, 7)), n f is 100, dropout 1 is 0.7 and dropout 2 is 0.35, optimization uses Adam (Kingma and Ba, 2014) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-86", "text": "When training the CNN model, we randomly selected 10% of the training data as the validation set for using early stopping to avoid over-fitting." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-87", "text": "On the Pun data, the CNN model shows consistent improved performance over the conventional model, as suggested in Yang et al. (2015) ." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-88", "text": "In particular, precision has been greatly increased from 0.762 to 0.864." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-89", "text": "On the TED data, we also observed that the CNN model helps to increase precision (from 0.515 to 0.582) and accuracy (from 52.0% to 58.9%)." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-90", "text": "The empirical evaluation results suggest that the CNN-based model has an advantage on the humor recognition task." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-91", "text": "In addition, focusing on the system development time, gener-7 Our code implementation was based on https://github.com/shagunsodhani/ CNN-Sentence-Classifier ating and implementing those features in the conventional model would take days or even weeks." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-92", "text": "However, the CNN model automatically learns its optimal feature representation and can adjust the features automatically across data sets." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-93", "text": "This makes the CNN model quite versatile for supporting different tasks and data domains." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-94", "text": "Compared with the humor recognition results on the Pun data, the results on the TED data are still quite low, and more research is needed to fully handle humor in authentic presentations." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-95", "text": "----------------------------------" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-96", "text": "**DISCUSSION**" }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-97", "text": "For the purpose of monitoring how well speakers can use humor during their presentations, we have created a corpus from TED talks." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-98", "text": "Compared to the existing corpora, ours has the following advantages: (a) it was collected from authentic talks, rather than from TV shows performed by professional actors based on scripts; (b) it contains about 100 times more speakers compared to the limited number of actors in existing corpora." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-99", "text": "We compared two types of leading text-based humor recognition methods: a conventional classifier (e.g., Random Forest) based on human-engineered features vs. an end-to-end CNN method, which relies on its inherent representation learning." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-100", "text": "We found that the CNN method has better performance." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-101", "text": "More importantly, the representation learning of the CNN method makes it very efficient when facing new data sets." }, { "sent_id": "780b96afa8d417aa241e01ad594ce9-C001-102", "text": "Stemming from the present study, we envision that more research is worth pursuing: (a) for presentations, cues from other modalities such as audio or video will be included, similar to Bertero and Fung (2016b) ; (b) context information from multiple utterances will be modeled by using sequential modeling methods." } ], "y": { "@BACK@": { "gold_contexts": [ [ "780b96afa8d417aa241e01ad594ce9-C001-18" ], [ "780b96afa8d417aa241e01ad594ce9-C001-21", "780b96afa8d417aa241e01ad594ce9-C001-22" ], [ "780b96afa8d417aa241e01ad594ce9-C001-33" ] ], "cite_sentences": [ "780b96afa8d417aa241e01ad594ce9-C001-18", "780b96afa8d417aa241e01ad594ce9-C001-21", "780b96afa8d417aa241e01ad594ce9-C001-22", "780b96afa8d417aa241e01ad594ce9-C001-33" ] }, "@MOT@": { "gold_contexts": [ [ "780b96afa8d417aa241e01ad594ce9-C001-33", "780b96afa8d417aa241e01ad594ce9-C001-34" ] ], "cite_sentences": [ "780b96afa8d417aa241e01ad594ce9-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "780b96afa8d417aa241e01ad594ce9-C001-46" ], [ "780b96afa8d417aa241e01ad594ce9-C001-55" ] ], "cite_sentences": [ "780b96afa8d417aa241e01ad594ce9-C001-46", "780b96afa8d417aa241e01ad594ce9-C001-55" ] }, "@SIM@": { "gold_contexts": [ [ "780b96afa8d417aa241e01ad594ce9-C001-76" ], [ "780b96afa8d417aa241e01ad594ce9-C001-87" ] ], "cite_sentences": [ "780b96afa8d417aa241e01ad594ce9-C001-76", "780b96afa8d417aa241e01ad594ce9-C001-87" ] } } }, "ABC_b27150a3506730c61dc78b3034887e_6": { "x": [ { "sent_id": "b27150a3506730c61dc78b3034887e-C001-51", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-52", "text": "**PREPROCESSING**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-2", "text": "This paper describes our system for SemEval-2019 Task 4: Hyperpartisan News Detection (Kiesel et al., 2019) ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-3", "text": "We use pretrained BERT (Devlin et al., 2018) architecture and investigate the effect of different fine tuning regimes on the final classification task." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-4", "text": "We show that additional pretraining on news domain improves the performance on the Hyperpartisan News Detection task." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-5", "text": "Our system 1 ranked 8th out of 42 teams with 78.3% accuracy on the held-out test dataset." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-6", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-8", "text": "With the rapid spread of the Internet and nextgeneration media development, people started to follow news through the Internet by abandoning de facto sources such as television and radio." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-9", "text": "Recent studies reveal that 43% of Americans report often getting news online (Shearer and Gottfried, 2017) ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-10", "text": "In parallel with that, there also has been a massive improvement in the NLP research in news domain to keep the content true, fair and unbiased." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-11", "text": "SemEval-2019 Task 4: Hyperpartisan News Detection, is yet another attempt under this objective." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-12", "text": "Hyperpartisan is defined as being extremely biased in favor of a political party (Bastos and Mercea, 2017) and the aim of the shared task is to detect hyperpartisan argumentation in news text." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-13", "text": "Though it is an important task by itself, hyperpartisan argument detection is also considered as a very first step (or even replacement) of fake news detection, because it has been shown by (Potthast et al., 2018) that there is a high positive correlation between having a hyperpartisan argumentation and being fake for news items." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-14", "text": "In this shared task, we seek to model this problem as a text classification task." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-15", "text": "In general, the * equal contribution 1 https://github.com/ozanarkancan/hyperpartisan task aims to label the text in the question with one or more classes or categories." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-16", "text": "The main question of text classification is how to mathematically represent the words/tokens such that they retain their original meaning in the context they appear." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-17", "text": "This question has been tried to be answered in many different ways so far." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-18", "text": "In earlier work, people mainly used the \"bag of words\" approach in algorithms such as Naive Bayes, Decision Tree, and SVM." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-19", "text": "Then, (Mikolov et al., 2013) advanced the field further by introducing word embeddings, capturing a somewhat meaningful representation of words." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-20", "text": "However, recent studies (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018) showed that contextual word embeddings perform quite better than traditional word embeddings in many different NLP tasks as a result of their superior capacity of meaning representation." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-21", "text": "Among those, BERT attracts researchers most because of (i) its transformer based architecture enabling faster training and (ii) state of the art results in many different tasks." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-22", "text": "Though it is quite new, BERT has been tried in many different domains than the one proposed in Devlin et al. (2018) ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-23", "text": "However, almost all of these studies have two things in common: they don't start training BERT from scratch and the target domain contains very limited data (Zhu et al., 2018; Yang et al., 2019; Alberti et al., 2019) ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-24", "text": "In this study, on the other hand, we address (1) the performance of BERT by comparing its domain specific pre-trained and fine-tuned performances, and (2) in the setting where the target domain has extensively more data." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-25", "text": "In the following sections, we first summarize the BERT architecture, then give details of shared task data set, and then describe experimental setups we used to train BERT model." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-26", "text": "In the results section, we compare the performance of BERT under different settings and share our submission results for the shared task." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-27", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-28", "text": "**METHOD**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-29", "text": "Transformer 2 (Vaswani et al., 2017) originally came out as a machine translation architecture and it uses the idea of self attention mechanism (Parikh et al., 2016; Lin et al., 2017) ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-30", "text": "It has an encoderdecoder design and both parts use the same novel multi-head attention mechanism." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-31", "text": "The encoder part takes an input sentence and derives a representation from it using this attention mechanism." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-91", "text": "**SHARED TASK RESULTS**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-32", "text": "Afterwards, the decoder generates the target sentence by performing multi-headed attention over the encoder stack." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-33", "text": "Devlin et al. (2018) introduced two unsupervised tasks to pretrain this architecture, Next Sentence Prediction and Masked Language Modeling." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-34", "text": "In Next Sentence Prediction task, the goal is to determine whether the sentence comes after the specified previous sentence or not." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-35", "text": "It takes two sentences as input, the latter being in its original form 50% of the time, while other times it can be any random sentence from the corpus." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-36", "text": "In Masked Language Modeling task, 15% of the words in the input sentences are masked and the model tries to predict these words." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-37", "text": "Training takes place with the combined loss of these two unsupervised tasks." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-38", "text": "Resulting representations can be further fine-tuned with a task specific layer on the top for a number of NLP tasks using appropriate supervised data." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-39", "text": "In this study, we use an open source PyTorch implementation 3 of BERT architecture." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-40", "text": "We make use of BERT-Base pretrained model provided by Devlin et al. (2018) in order to avoid pretraining from scratch." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-41", "text": "Similar to Devlin et al. (2018) , we use the representation obtained from the last layer for the first token (i.e. \"[CLS]\") for the sentence representation and a softmax classifier on top of it for predicting hyperpartisanship." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-42", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-43", "text": "**EXPERIMENTS**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-44", "text": "In this section, we first introduce data provided by the shared task and the data preprocessing step." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-45", "text": "Then, we give the details of our experiments and results with BERT under pretraining and finetuning settings." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-46", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-47", "text": "**DATA**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-48", "text": "Task provides data that consist of 750.000 articles labelled portal-wise and 645 articles labelled manually, and they divide the former into 600.000 and 150.000 as train and development set." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-49", "text": "Portal-wise data is labelled as hyperpartisan or not, according to publishers known affinities provided by BuzzFeed journalists or MediaBiasFactCheck.com." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-50", "text": "In our experiments, we first shuffled and then split the portal-wise data into three: 705.000, 40.000, 5.000 articles for train, development and test respectively." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-53", "text": "For all our experiments we remove some unwanted text from the articles." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-54", "text": "We replaced HTML character reference for ampersand and numeric entity reference, and removed adjacent underscore characters which is possibly used as a replacement for classified information in data." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-55", "text": "We also removed lines, solely containing \"*\" characters, used for separation of different news in the same article." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-56", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-57", "text": "**INPUT REPRESENTATION**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-58", "text": "BERT restricts the input length to a maximum of 512 tokens." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-59", "text": "We select the first n tokens from the beginning of the article, because using the lead sentences of a news article has been found to be more effective for some NLP tasks (Wasson, 1998) ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-60", "text": "We use the same tokenization method and embeddings as Devlin et al. (2018) to represent the words." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-61", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-62", "text": "**FINE-TUNING ONLY**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-63", "text": "In order to show how BERT performs in news domain, our first attempt was to use the training data to only fine-tune the pretrained model for classification." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-64", "text": "We used BERT-Base which consists of 12 transformer blocks on top of each other applying 12 headed attention mechanism, hidden size of 768 and a total of 110 million parameters." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-65", "text": "We set 16 as our batch size and 2e-5 as our learning rate as recommended by Devlin et al. (2018) Table 1 : Classification results on our portal-wise data splits with fine-tuned BERT." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-66", "text": "We performed experiments using 128, 256 and 512 as our maximum sequence lengths and found out that 256 gives us the best test results, as shown in Table 1 ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-67", "text": "Although the results for experiments with maximum sequence lengths of 256 and 512 are relatively close to each other, we chose 256 for computational efficiency." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-68", "text": "From these results, we can argue that for news articles, the first 128 tokens do not carry enough information." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-69", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-70", "text": "**PRETRAINING + FINE-TUNING**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-71", "text": "For the pretraining step, the data used by two unsupervised tasks need to be generated." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-92", "text": "The evaluation of SemEval-2019 Task 4, Hyperpartisan News Detection task is done through the online platform of TIRA (?)." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-72", "text": "For the Next Sentence Prediction task, originally, one would go over the articles sentence by sentence to generate pretraining data, but our data is not made of split sentences." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-73", "text": "To avoid using a tool for sentence splitting, as it would take too much time in large scale, for each document from the training data, we extract a chunk of text with a random length sampled from a uniform distribution defined as an interval between %15 and %85 of the maximum sequence length." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-74", "text": "The reason for this is to make the model more robust to non-sentential input and leave space for the second sentence." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-75", "text": "As the second sentence, 50% of the time, we select the chunk following the original one with a length that is complementing the first chunk's length up to maximum sequence length." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-76", "text": "Other times, when we need the next sentence to be random, we take a random chunk from other documents." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-77", "text": "We extract more than one sample from a single docu-" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-78", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-79", "text": "**MODEL**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-80", "text": "Combined Loss BERT-Base 3.65 Our Version 1.79 ment, avoiding overlapping between chunks." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-81", "text": "For Masked LM task, we follow the same approach with Devlin et al. (2018) ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-82", "text": "At the end of pretraining data generation process, we accumulated near 3.5 million samples, only running the process once on our train split, so without any duplication unlike Devlin et al. (2018) because of time restrictions." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-83", "text": "We also generated a small held-out dataset using our test split to use in evaluation." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-84", "text": "Starting from the pretrained model of BERT-Base instead of a cold start, we trained the model with a learning rate of 3e-5 and 256 as the maximum sequence length for 290k iterations." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-85", "text": "Table 2 presents the combined loss of two unsupervised tasks on the held-out data for original BERT-Base and further pretrained model with the generated data." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-86", "text": "Results show that pretraining BERT further with data from an unseen domain greatly increases its representational power." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-87", "text": "Table 3 : Comparison of fine-tuning only and pretraining + fine-tuning models." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-88", "text": "After this step, we applied the same fine-tuning as previous section with the same parameters." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-89", "text": "Table 3 demonstrates that pretraining BERT with domain specific data using unsupervised tasks improves the performance of the model on the supervised classificiation task." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-90", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-93", "text": "It serves as a means of blind evaluation of the submitted model." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-94", "text": "Accuracy is used as the official evaluation metric and the deciding test set is an another manually labelled news articles set named \"by-article-test-set\" which was kept hidden from the participants." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-95", "text": "In our first attempt, we fine-tuned BERT with portal-wise train split using development set to get the best model." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-96", "text": "After this we further train it with 645 manually labeled data (i.e. \"by-article-trainset\"), because it comes from the same sample as test data." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-97", "text": "In our last attempt, we pretrained BERT with our portal-wise train split, and then fine-tune it as described before." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-98", "text": "Again, we further fine-tune our model with \"by-article-train-set\" data." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-99", "text": "The results of our two attempts can be seen in Table 4 ." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-100", "text": "The third model in the table is to show the effect of the last fine-tuning step on \"by-article-train-set\"." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-101", "text": "Looking at the results of second and third models on \"by-article-test-set\" shows us, although we fine-tune BERT with supervised data for the same classification task, fine-tuning on \"byarticle-train-set\" improves the results drastically." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-102", "text": "This may be rooted from the domain difference in between \"by-article-test-set\" and portal-wise train data." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-103", "text": "Although our experiments ( Table 3) show us that pretraining BERT further with data from news domain has a positive effect on overall accuracy, we are not able to observe the similar effect on \"by-article-test-set\"." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-104", "text": "The second model adapts to the publisher domain more than the first model does because of the extensive pretraining before fine-tuning." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-105", "text": "As the difference between publisher and article is highly notable from the findings before, overfitting to the publisher domain might end up hurting the generalization of the model." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-106", "text": "So, this would explain the unexpected drop of performance between the second model and the first model." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-107", "text": "----------------------------------" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-109", "text": "We presented a BERT baseline for the Hyperpartisan News Detection task." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-110", "text": "We demonstrated that pretraining BERT in an unseen domain improves the performance of the model on the domain specific supervised task." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-111", "text": "We also showed that the difference in news source affects the generalization." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-112", "text": "Our best performing system ranked 8th out of 42 teams with 78.3% accuracy on the held-out test dataset." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-113", "text": "From our findings, we believe that domain adaptation is important for the BERT architecture and we would like to investigate the effect of from scratch unsupervised pretraining on the supervised task as future work." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-114", "text": "Beale 4 is a news anchor who decides to commit suicide on live air." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-115", "text": "Instead, he gives his famous speech about modern American life and convinces American people to scream his words: \"I'm as mad as hell, and I'm not going to take this any more!\". But the media sees his breakdown as an opportunity for huge ratings." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-116", "text": "We believe that the speech is now more than ever relevant to our media." }, { "sent_id": "b27150a3506730c61dc78b3034887e-C001-117", "text": "Choosing \"Howard Beale\" as the team name is our scream from the windows of Academia." } ], "y": { "@USE@": { "gold_contexts": [ [ "b27150a3506730c61dc78b3034887e-C001-3" ], [ "b27150a3506730c61dc78b3034887e-C001-25", "b27150a3506730c61dc78b3034887e-C001-26" ], [ "b27150a3506730c61dc78b3034887e-C001-39" ], [ "b27150a3506730c61dc78b3034887e-C001-40" ], [ "b27150a3506730c61dc78b3034887e-C001-45" ], [ "b27150a3506730c61dc78b3034887e-C001-60" ], [ "b27150a3506730c61dc78b3034887e-C001-63" ], [ "b27150a3506730c61dc78b3034887e-C001-64" ], [ "b27150a3506730c61dc78b3034887e-C001-65" ], [ "b27150a3506730c61dc78b3034887e-C001-81" ], [ "b27150a3506730c61dc78b3034887e-C001-84" ], [ "b27150a3506730c61dc78b3034887e-C001-85" ], [ "b27150a3506730c61dc78b3034887e-C001-109" ] ], "cite_sentences": [ "b27150a3506730c61dc78b3034887e-C001-3", "b27150a3506730c61dc78b3034887e-C001-25", "b27150a3506730c61dc78b3034887e-C001-26", "b27150a3506730c61dc78b3034887e-C001-39", "b27150a3506730c61dc78b3034887e-C001-40", "b27150a3506730c61dc78b3034887e-C001-45", "b27150a3506730c61dc78b3034887e-C001-60", "b27150a3506730c61dc78b3034887e-C001-63", "b27150a3506730c61dc78b3034887e-C001-64", "b27150a3506730c61dc78b3034887e-C001-65", "b27150a3506730c61dc78b3034887e-C001-81", "b27150a3506730c61dc78b3034887e-C001-84", "b27150a3506730c61dc78b3034887e-C001-85", "b27150a3506730c61dc78b3034887e-C001-109" ] }, "@BACK@": { "gold_contexts": [ [ "b27150a3506730c61dc78b3034887e-C001-18", "b27150a3506730c61dc78b3034887e-C001-20" ], [ "b27150a3506730c61dc78b3034887e-C001-21", "b27150a3506730c61dc78b3034887e-C001-22" ], [ "b27150a3506730c61dc78b3034887e-C001-33" ], [ "b27150a3506730c61dc78b3034887e-C001-58" ] ], "cite_sentences": [ "b27150a3506730c61dc78b3034887e-C001-20", "b27150a3506730c61dc78b3034887e-C001-21", "b27150a3506730c61dc78b3034887e-C001-22", "b27150a3506730c61dc78b3034887e-C001-33", "b27150a3506730c61dc78b3034887e-C001-58" ] }, "@MOT@": { "gold_contexts": [ [ "b27150a3506730c61dc78b3034887e-C001-22", "b27150a3506730c61dc78b3034887e-C001-23", "b27150a3506730c61dc78b3034887e-C001-24" ] ], "cite_sentences": [ "b27150a3506730c61dc78b3034887e-C001-22", "b27150a3506730c61dc78b3034887e-C001-23", "b27150a3506730c61dc78b3034887e-C001-24" ] }, "@SIM@": { "gold_contexts": [ [ "b27150a3506730c61dc78b3034887e-C001-41" ] ], "cite_sentences": [ "b27150a3506730c61dc78b3034887e-C001-41" ] }, "@DIF@": { "gold_contexts": [ [ "b27150a3506730c61dc78b3034887e-C001-82" ], [ "b27150a3506730c61dc78b3034887e-C001-86", "b27150a3506730c61dc78b3034887e-C001-89" ], [ "b27150a3506730c61dc78b3034887e-C001-101" ], [ "b27150a3506730c61dc78b3034887e-C001-103" ], [ "b27150a3506730c61dc78b3034887e-C001-110" ] ], "cite_sentences": [ "b27150a3506730c61dc78b3034887e-C001-82", "b27150a3506730c61dc78b3034887e-C001-86", "b27150a3506730c61dc78b3034887e-C001-89", "b27150a3506730c61dc78b3034887e-C001-101", "b27150a3506730c61dc78b3034887e-C001-103", "b27150a3506730c61dc78b3034887e-C001-110" ] }, "@EXT@": { "gold_contexts": [ [ "b27150a3506730c61dc78b3034887e-C001-86", "b27150a3506730c61dc78b3034887e-C001-89" ], [ "b27150a3506730c61dc78b3034887e-C001-95" ], [ "b27150a3506730c61dc78b3034887e-C001-97" ], [ "b27150a3506730c61dc78b3034887e-C001-113" ] ], "cite_sentences": [ "b27150a3506730c61dc78b3034887e-C001-86", "b27150a3506730c61dc78b3034887e-C001-89", "b27150a3506730c61dc78b3034887e-C001-95", "b27150a3506730c61dc78b3034887e-C001-97", "b27150a3506730c61dc78b3034887e-C001-113" ] }, "@FUT@": { "gold_contexts": [ [ "b27150a3506730c61dc78b3034887e-C001-113" ] ], "cite_sentences": [ "b27150a3506730c61dc78b3034887e-C001-113" ] } } }, "ABC_eb5ef34dd9c3845cd27c33242d5316_6": { "x": [ { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-2", "text": "Existing state of the art neural entity linking models employ attention-based bag-of-words context model and pre-trained entity embeddings bootstrapped from word embeddings to assess topic level context compatibility." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-3", "text": "However, the latent entity type information in the immediate context of the mention is neglected, which causes the models often link mentions to incorrect entities with incorrect type." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-4", "text": "To tackle this problem, we propose to inject latent entity type information into the entity embeddings based on pre-trained BERT." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-5", "text": "In addition, we integrate a BERT-based entity similarity score into the local context model of a state-of-the-art model to better capture latent entity type information." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-6", "text": "Our model significantly outperforms the state-of-the-art entity linking models on standard benchmark (AIDA-CoNLL)." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-7", "text": "Detailed experiment analysis demonstrates that our model corrects most of the type errors produced by the direct baseline." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-8", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-10", "text": "Entity Linking (EL) is the task of disambiguating textual mentions to their corresponding entities in a reference knowledge base (e.g., Wikipedia)." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-11", "text": "An accurate entity linking system is crucial for many knowledge related tasks such as question answering (Yih et al. 2015) and information extraction (Hoffmann et al. 2011) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-12", "text": "Traditional entity linkers mainly depend on manually designed features to evaluate the local context compatibility and document-level global coherence of referent entities (Cheng and Roth 2013; Durrett and Klein 2014) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-13", "text": "The design of such features requires entity-specific domain knowledge." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-14", "text": "These features can not fully capture relevant statistical dependencies and interactions." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-15", "text": "One recent notable work (Ganea and Hofmann 2017) instead pioneers to rely on pre-trained entity embeddings, learnable context representation and differentiable joint inference stage to learn basic features and their combinations from scratch." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-36", "text": "So it is natural for the model of Ganea and Hofmann (2017) to make type errors when it is trained to fit such entity embeddings." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-37", "text": "As is argued in (Zhou et al. 2018) , \"context consistency is a strong proxy for type compatibility\"." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-38", "text": "Based on this claim, the mention's immediate context is a proxy of its type." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-39", "text": "For example, we consider the following context from Wikipedia linking to the entity APPLE in which the mention is replaced with the [MASK] token." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-40", "text": "Fruits that tend to be more popular in this area are [MASK] , pears , and berries ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-41", "text": "By reading the context surrounding the [MASK] token, we can easily determine that the entities fitting this context should be a kind of fruit." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-42", "text": "In this paper, we propose to inject latent entity type information into the entity embeddings by modeling the immediate context surrounding the mention." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-43", "text": "Specifically, we apply pre-trained BERT (Devlin et al. 2019) to represent the entity context and build a shared entity representation by aggregating all the entity contexts linking to the same entity via average pooling." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-44", "text": "Pre-trained BERT models naturally fit our purpose to represent the entity context surrounding the [MASK] token as it is trained with masked language model objective." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-45", "text": "What's more, we integrate a BERT-based entity similarity feature into the local model of Ganea and Hofmann (2017) to better capture entity type information." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-46", "text": "This can leverage both the pre-trained entity embeddings from BERT and the domain adaption capability of BERT via fine-tuning." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-47", "text": "We conduct entity linking experiments on standard benchmark datasets: AIDA-CoNLL and five out-domain test sets." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-48", "text": "Our model achieves an absolute improvement of 1.32% F1 on AIDA-CoNLL test set and average 0.80% F1 on five out-domain test sets over five different runs." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-49", "text": "In addition, we conduct detailed experiment analysis on AIDA-CoNLL development set which shows our proposed model can reduce 67.03% type errors of the state-of-the-art model (Ganea and Hofmann 2017) and more than 90% of the remaining type error cases are due to over estimation of prior and global modeling problem which we leave as the further work." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-50", "text": "Our contributions can be summarized as follows." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-51", "text": "\u2022 We show current state-of-the-art (SOTA) neural entity linking models based on attention-based bag-of-words context model often produce type errors and analyze the possible causes." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-52", "text": "\u2022 We propose a novel entity embedding method based on pre-trained BERT to better capture latent entity type information." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-53", "text": "\u2022 We integrate a BERT-based entity similarity into the local model of a SOTA model (Ganea and Hofmann 2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-54", "text": "\u2022 We verify the effectiveness of our model on standard benchmark datasets and achieve significant improvement over the baseline. And the detailed experiment analysis demonstrates that our method truly corrects most of the type errors produced by the baseline." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-55", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-56", "text": "**BACKGROUND ENTITY LINKING PROBLEM**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-57", "text": "Formally, given a document D consisting of a list of entity mentions m 1 , ..., m n ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-58", "text": "The goal of an entity linking system is to assign each m i a KB entity e i or predict that no corresponding entity in the KB (i.e., e i =NIL)." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-59", "text": "Due to potentially very large entity space (e.g. Wikipedia has more than 4 million entities), standard entity linking is often divided into two stages: candidate generation which chooses potential candidates C i = (e i1 , ..., e ili ) using a heuristic and entity disambiguation which learns to select the best entity from the candidates using a statistical model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-60", "text": "In this work, we focus on the second stage entity disambiguation." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-61", "text": "As for entity disambiguation, two different kinds of information can be leveraged: local context compatibility and document-level global coherence which respectively corresponds to the local model and the global model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-62", "text": "Next, we introduce the general formulation of entity linking problem with a focus on the well known DeepED model (Ganea and Hofmann 2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-63", "text": "General Formulation An entity linking model integrating both local and global features can be formulated as a conditional random field." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-64", "text": "Formally, we can define a scoring function g to evaluate the entity assignment e 1 , ..., e n to mentions m 1 , ..., m n in a document D." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-65", "text": "g(e 1 , ..., e n |D)" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-66", "text": "where the first term scores how well an entity fits its local context and the second one measures the global coherence." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-67", "text": "Local Model Following Ganea and Hofmann (2017) , we instantiate the local model as an attention model based on pre-trained word and entity embeddings." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-68", "text": "Specifically, for each mention m i , a pruned candidate set C i = (e i1 , ..., e ili ) is identified in the candidate generation stage." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-69", "text": "We compute a local context score for each e \u2208 C i based on the K-word (in practice, K is set to 100 and stop words are removed.) local context c = {w 1 , ..., w K } surrounding m i ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-70", "text": "where B is a learnable diagonal matrix, x e is the embeddings of entity e, h(c) applies a hard attention mechanism to context words in c to obtain the representation of the context." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-71", "text": "Besides, Ganea and Hofmann (2017) combined this context score with the priorp(e|m) (computed by mixing mention-entity hyperlink count statistics from Wikipedia, a large Web corpus and YAGO." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-72", "text": "2 ) using a two-layer feedforward neural network in the local model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-73", "text": "\u03a8(e, m, c) = f (\u03a8 long (e, c), logp(e|m))" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-74", "text": "(3)" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-75", "text": "Global Model The second term in Equation 1 is given by:" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-76", "text": "where C is a diagonal matrix." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-77", "text": "The model defined by Equation 1 is a fully-connected pairwise conditional random field." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-78", "text": "Exact maximum-a-posteriori inference on this CRF, needed both at training and testing phrase, is NP-hard (Wainwright, Jordan, and others 2008) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-79", "text": "So they used maxproduct loopy belief propagation (LBP) to estimate the maxmarginal probabilit\u0177" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-80", "text": "..,ei\u22121 ei+1,...,en g(e 1 , ..., e n |D)" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-81", "text": "for each mention m i ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-82", "text": "The final score for m i is given by:" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-83", "text": "where f is another two-layer neural network andp(e|m i ) is the prior feature." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-84", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-85", "text": "**RELATED WORK**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-86", "text": "Our work focuses on improving entity linking by capturing latent entity type information with BERT." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-87", "text": "Specifically, our work related to previous approaches in three aspects." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-88", "text": "Entity Embedding The entity linking task is essentially a zero-shot task where the answer of test cases may not exist in the training data." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-89", "text": "3 So we need to build a shared entity embedding space for all entities which allows neural entity linking models to generalize to both seen and unseen entities during test time." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-90", "text": "Based on the distributional hypothesis (Harris 1954) , an entity is characterized by its contexts." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-91", "text": "Different methods to characterize an entity's context result in different information its entity embedding can capture." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-92", "text": "Previous work (Yamada et al. 2016 ; Ganea and Hofmann 2017) on learning entity representation are mostly extensions of the embedding methods proposed by (Mikolov et al. 2013 )." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-93", "text": "An entity's context is a bag-ofwords representation which mainly captures topic level entity relatedness rather than entity type relatedness." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-94", "text": "In contrast, we propose a simple method to build entity embeddings directly from pre-trained BERT (Devlin et al. 2019) which can better capture entity type information." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-95", "text": "Type Information Previous work attempt to integrate type information into the entity linking task mostly by jointly modeling named entity recognition and entity linking." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-96", "text": "Specifically, a line of work (Durrett and Klein 2014; Luo et al. 2015; Nguyen, Theobald, and Weikum 2016) jointly model entity linking and named entity recognition to capture the mutual dependency between them using structured CRF." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-97", "text": "These methods mainly differ in the design of hand-engineered features." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-98", "text": "Recently, Martins, Marinho, and Martins (2019) perform multi-task learning using learned features by extending Stack-LSTM (Dyer et al. 2015) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-99", "text": "However, all of these work rely on extensive annotation of the type of mentions which are difficult to obtain on most of the entity linking datasets." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-100", "text": "In contrast, based on the assumption that \"context consistency is a strong proxy for type compatibility\" from Zhou et al. (2018) , we propose to model a mention's immediate context using BERT (Devlin et al. 2019) to capture its contextual latent entity type information." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-101", "text": "Applications of BERT Since the advent of the wellknown BERT models (Devlin et al. 2019) , it has been applied successfully to and has achieved state-of-the-art performance on many NLP tasks." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-102", "text": "The main challenges which the entity linking task has over other tasks e.g. sentence classification, named entity recognition, where BERT has been applied are: (1) a very large label space, i.e. every mention has many target entities and (2) the zero-shot nature of the entity linking task." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-103", "text": "Training label embeddings from a small labeled dataset could not generalize to cover unseen entities in test time." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-104", "text": "To tackle this problem, we introduce a novel method to build entity embeddings from BERT by modeling the immediate context of an entity." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-105", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-106", "text": "**MODEL**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-107", "text": "Our model consists of two phrases: (1) Build entity embeddings from BERT (2) Add a BERT-based entity similarity component to the local model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-108", "text": "Next we will describe each phrase in the following sections." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-109", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-110", "text": "**ENTITY EMBEDDINGS FROM BERT**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-111", "text": "Given lists of mention context 4 {c i1 , c i2 , ..., c iN } in Wikipedia for every entity e i \u2208 E, we build the entity embeddings map B : E \u2192 R d ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-112", "text": "Here, the anchor context" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-113", "text": "where m ij is the mention, lctx ij is the left context and rctx ij is the right context." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-114", "text": "Context Representation A mention's immediate context is a proxy for its type." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-115", "text": "Here, the mention's immediate context is a sequence of tokens where the mention m ij is replaced with a single [MASK] token." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-116", "text": "Then, we represent the immediate entity context by extracting the upper most layer representation of pre-trained BERT (Devlin et al. 2019) corresponding to the [MASK] token." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-117", "text": "Entity Representation For each entity e i \u2208 E, we randomly sample at most N anchor contexts {c i1 , c i2 , ..., c iN } from Wikipedia." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-118", "text": "Then the entity representation of e i is computed by aggregating all the context representation {c i1 , c i2 , ..., c iN } via average pooling." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-119", "text": "As will be shown in the analysis section, the entity embeddings from BERT better capture entity type information than those from Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-120", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-121", "text": "**BERT-BASED ENTITY SIMILARITY**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-122", "text": "The local context model of Ganea and Hofmann (2017) mainly captures the topic level entity relatedness information based on a long range bag-of-words context." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-123", "text": "To capture latent entity type information, we design a BERT-based entity similarity score \u03a8 BERT (e, c)." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-124", "text": "Specifically, given a short range (the immediate context where the mention m lies) context c = (lctx, m, rctx), we firstly encode c using the same method defined by Equation 7." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-125", "text": "Then we define the BERT-based entity similarity as the cosine similarity 5 between the context representation c and the entity representation B e ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-126", "text": "Finally, as for the local disambiguation model, we integrate the BERT-based entity similarity \u03a8 BERT (e, c) with the local context score \u03a8 long (e, c) (defined in Equation 2) and the priorp(e|m i ) with two fully connected layers of 100 hidden units and ReLU non-linearities following the same feature composition methods as Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-127", "text": "As for the global disambiguation model, we firstly define the local context score \u03a8 localctx (e, c) by combining \u03a8 long (e, c) and \u03a8 BERT (e, c)." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-128", "text": "6" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-129", "text": "Then we adopt exactly the same global model as Ganea and Hofmann (2017) which is already introduced in the Background section." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-130", "text": "Specifically, we adopt loopy belief propagation (LBP) to estimate the max-marginal probability\u011d i (e|D) and then combine it with the priorp(e|m i ) using a two-layer neural network to get the final score \u03c1 i (e) for m i ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-131", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-132", "text": "**MODEL TRAINING**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-133", "text": "We minimize the following max-margin ranking loss:" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-134", "text": "In order to discourage the model from biasing toward a particular feature, we add a L2 regularization term (\u03bb||\u03b1|| 2 2 ) w.r.t parameters \u03b1 in feature composition function f to the loss function in Equation 16 , where \u03bb is set 10 \u22127 ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-135", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-136", "text": "**EXPERIMENTS DATASETS**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-137", "text": "In order to verify the effectiveness of our model, we conduct experiments on standard benchmark datasets considering both in-domain and out-domain settings." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-138", "text": "For in-domain setting, we use AIDA-CoNLL dataset (Hoffart et al. 2011) for training, validation and testing." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-139", "text": "For out-domain setting, we evaluate the model trained with AIDA-CoNLL on five popular out-domain test sets: MSNBC, AQUAINT, ACE 2004 datasets cleaned and updated by Guo and Barbosa (2016) and WNED-CWEB (CWEB), WNED-WIKI (WIKI) automatically extracted from ClueWeb and Wikipedia (Guo and Barbosa 2016) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-140", "text": "Following previous work (Ganea and Hofmann 2017), we only consider in-KB mentions." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-141", "text": "Besides, our candidate generation strategy follows that of Ganea and Hofmann (2017) to make our results comparable." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-142", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-143", "text": "**SETUP**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-144", "text": "The main goal of this work is to introduce a BERT-based entity similarity to capture latent entity type information which is supplementary to existing SOTA local context model (Ganea and Hofmann 2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-145", "text": "So we evaluate the performance when integrating the BERT-based entity similarity into the local context model of Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-146", "text": "We also evaluate our model with or without global modeling method of Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-147", "text": "In addition, we further compare our methods with other state-of-theart models (Yamada et al. 2016; Le and Titov 2018) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-148", "text": "To verify the contribution of our proposed BERT-based entity embeddings, we also compare with a straightforward baseline which directly replaces the encoder of Ganea and Hofmann (2017) utilizing pre-trained BERT." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-149", "text": "To do so, we introduce a 768 \u00d7 300 dimensional matrix W which projects BERT-based context representation c into Ganea and Hofmann (2017)'s entity embeddings space when calculating the similarity score." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-150", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-151", "text": "**HYPER-PARAMETER SETTING**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-152", "text": "The resources (word and entity embeddings) used to train the local context model of Ganea and Hofmann (2017) are obtained from DeepED 7 ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-153", "text": "For each entity, we randomly sam-7 https://github.com/dalab/deep-ed/ Methods AIDA-B Local models priorp(e|m) 71.9 Lazic et al. (2015) 86.4 Globerson et al. (2016) 87.9 Yamada et al. (2016) 87.2 Ganea and Hofmann (2017) 88.8 Ganea and Hofmann (2017) (reproduce) 88.75 \u00b1 0.30 BERT-Entity-Sim (local)" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-154", "text": "90.06 \u00b1 0.22 Local & Global models Huang, Heck, and Ji (2015) 86.6 Ganea et al. (2016) 87.6 Chisholm and Hachey (2015) 88.7 Guo and Barbosa (2016) 89.0 Globerson et al. (2016) 91.0 Yamada et al. (2016) 91.5 Ganea and Hofmann (2017) 92.22 \u00b1 0.14 Le and Titov (2018) 93 ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-155", "text": "ple at most 100 anchor contexts from Wikipedia 8 to build the entity representation from BERT." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-156", "text": "We discard any articles appearing in WIKI dataset when building the entity representation from BERT." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-157", "text": "We take the anchor context as the surrounding sentence where the mention lies and replace the mention with a single [MASK] token." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-158", "text": "Each context is truncated to 128 tokens after WordPiece tokenization." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-159", "text": "We use the PyTorch implementation of pre-trained BERT models 9 and choose the BERT-base-cased version." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-160", "text": "We adopt the Adam (Kingma and Ba 2014) implemented by BERT with \u03b2 1 = 0.9, \u03b2 2 = 0.999." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-161", "text": "Empirically, we found that it is helpful to set parameters in BERT a small initial learning rate and not BERT related parameters a larger initial learning rate to avoid the whole model biasing toward the BERT feature and disregarding other model components." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-162", "text": "In our experiments, pre-trained BERT model is fine-tuned with initial learning rate 10 \u22125 whereas not BERT related parameters are trained with 10 \u22123 ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-163", "text": "Similar learning rate usage can be found in the recent work by (Hwang et al. 2019 )." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-164", "text": "Similar to Ganea and Hofmann (2017) , all the entity embeddings are fixed during fine-tuning." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-165", "text": "We randomly initialize the not BERT related parameters using Gaussian distribution N (0.0, 0.02) and the bias term is zeroed." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-166", "text": "Note that all the hyper-parameters used in the local context and global model of Ganea and Hofmann (2017) were set to the same values as theirs for direct comparison purpose." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-167", "text": "Detailed hyper-parameters setting is described in the appendices." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-168", "text": "Our model is trained with 4 NVIDIA Tesla P100 GPUs." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-169", "text": "We run each of our model five times with different random seeds, and the performance is reported in the form of average \u00b1 standard deviation." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-170", "text": "Table 1 shows the micro F1 scores on in-domain AIDA-B dataset of the SOTA methods and ours, which all use Wikipedia and YAGO mention-entity index." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-171", "text": "The models are divided into two groups: local models and local & global models." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-172", "text": "As we can see, our proposed model, BERT-Entity-Sim, outperforms all previous methods." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-173", "text": "Our local model achieves a 1.31 improvement in terms of F1 over its corresponding baseline (Ganea and Hofmann 2017) , yielding a very competitive local model with an average 90.06 F1 score even surpassing the performance of four local & global models." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-174", "text": "Equipped with the global modeling method of Ganea and Hofmann (2017) , the performance of our model further increase to 93.54 with an average 1.32 improvement in terms of F1 over Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-175", "text": "In addition, our method outperforms Le and Titov (2018) model by 0.47 point." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-176", "text": "The model of Le and Titov (2018) is a multirelational extension of Ganea and Hofmann (2017)'s global modeling method while keeps exactly the same local context model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-177", "text": "Our better local context model should be orthogonal with them and has potential more applications on short texts (e.g. tweets) where global modeling has little benefits." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-178", "text": "Moreover, BERT+G&Hs embeddings performs significantly worse than the baseline (Ganea and Hofmann 2017) and our proposed BERT-Entity-Sim model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-179", "text": "The reason is that BERT-based context representation space and Ganea and Hofmanns entity embeddings space are heterogeneous." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-180", "text": "Ganea and Hofmanns entity embeddings are bootstrapped from word embeddings which mainly capture topic level entity relatedness, while BERT-based context representation is derived from BERT which naturally captures type information." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-181", "text": "The non-parallel information in both context and entity sides makes it difficult to learn the alignment parameter W and results in the poor generalization performance." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-182", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-183", "text": "**RESULTS**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-184", "text": "To evaluate the robustness of our model, Table 2 shows the performance of our method and SOTA methods on five out-domain test sets." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-185", "text": "On average, our proposed model (BERT-Entity-Sim) outperforms the local & global version of Ganea and Hofmann; Le and Titov (2017; by an average 0.80 and 0.51 on F1." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-186", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-187", "text": "**ANALYSIS**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-188", "text": "We conduct experiment analysis to answer the following questions:" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-189", "text": "\u2022 Do the entity embeddings from BERT better capture latent entity type information than that of Ganea and Hofmann (2017) ?" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-190", "text": "\u2022 Does the proposed model correct the type errors in the baseline (Ganea and Hofmann 2017)?" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-191", "text": "\u2022 Can straightforward integration of state-of-the-art fine grained entity typing systems improve entity linking performance?" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-192", "text": "\u2022 Can better global model further boost the performance of the proposed model?" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-193", "text": "Effectiveness of BERT-based and Ganea & Hofmann (2017) Ganea and Hofmann (2017) 93.7 \u00b1 0.1 88.5 \u00b1 0.4 88.5 \u00b1 0.3 77.9 \u00b1 0.1 77.5 \u00b1 0.1 85.22 Le and Titov (2018) 93.9 \u00b1 0.2 88.3 \u00b1 0.6 89.9 \u00b1 0.8 77." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-194", "text": "Table 5 : Performance of two state-of-the-art fine grained entity typing systems on AIDA-CoNLL development set order to verify our claim that the entity embeddings from BERT better capture entity type information than those from Ganea and Hofmann (2017) , we carry out an entity type prediction task based on its entity embedding." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-195", "text": "Specifically, we randomly sample 100K entities from Wikipedia, and randomly split them into training set (80K), development set (10K) and test set (10K)." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-196", "text": "For each entity, we obtain its entity types from three typing systems: FIGER (Ling and Weld 2012), BBN (Weischedel and Brunstein 2005) and OntoNotes fine (Gillick et al. 2014 ) via the entity type mapping provided by Zhou et al. (2018) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-197", "text": "The entity type prediction model is a simple linear classification model 10 using the entity embedding of an entity as features; limiting its capacity enables us to focus on whether type information can be easily extracted from the entity embeddings." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-198", "text": "We evaluate the model using standard entity typing metrics: Strict Accuracy (Acc.), Micro F1 (F 1 mi ) and Macro F1 (F 1 ma )." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-199", "text": "As shown in Table 3 , our proposed entity embedding from BERT significantly outperforms the entity embedding proposed by Ganea and Hofmann (2017) on three typing sys-tems FIGER, BBN and OntoNotes fine ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-200", "text": "Specifically, our method improves over the baseline with an absolute 8.31, 10.43 and 9.33 F 1 mi point than the baseline on three typing systems respectively." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-201", "text": "This demonstrates that our proposed entity embeddings from BERT indeed capture better latent entity type information than Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-202", "text": "Type Errors Correction As we have mentioned in the introduction section, more than half of the baseline model's errors on the AIDA-A dataset are type errors." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-203", "text": "Type errors are error cases 11 where (1) the predicted entity's type is different from the golden entity's type; (2) contextual cue predictive of the type of the mention exists; (3) errors are not due to annotation errors." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-204", "text": "By doing so, we collect 185 type error cases which cover 57.45% of all (322) error cases." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-205", "text": "This indicates that Ganea and Hofmann (2017) produces many type errors due to its inability to consider the entity type information in mention context." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-206", "text": "By integrating the BERTbased entity similarity, our proposed model can correct 124 out of 185 (67.03%) type error cases of the baseline model which demonstrates that we correct more than two third of the type errors produced by the baseline." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-207", "text": "We have further examined and categorized the remaining 61 type error cases into three categories: (i) Due to prior: golden entities with very lowp(e|m i ) prior, (ii) Due to global: both the local context score and prior score support predicting the golden entity, but the overall score supports predicting incorrect entity due to global modeling, (iii) Due to local context: the local context score misleads the model predicting the wrong entity, this is potentially due to the mention context can be misleading, e.g. a document discussing cricket will favor resolving the mention \"Australian\" in context \"impressed by the positive influence of Australian coach Dave Gilbert\" to Methods AIDA-B MSNBC AQUAINT ACE2004 CWEB WIKI Avg Ganea and Hofmann (2017) 92.22 \u00b1 0.14 93.7 \u00b1 0.1 88.5 \u00b1 0.4 88.5 \u00b1 0.3 77.9 \u00b1 0.1 77.5 \u00b1 0.1 85.22 Le and Titov (2018) 93.07 \u00b1 0.27 93.9 \u00b1 0.2 88.3 \u00b1 0.6 89.9 \u00b1 0.8 77.5 \u00b1 0.1 78.0 \u00b1 0.1 85.51 Yang et al. (2019) 94 ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-208", "text": "the entity AUSTRALIA NATIONAL CRICKET TEAM instead of the gold entity AUSTRALIA." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-209", "text": "As shown in Table 4 , 67.21% of the remaining type error cases are due to prior problem which are hard to solve in the current feature combination framework." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-210", "text": "We argue that prior should be considered as the final resort, only relying on it when the model can not make decision based on other features." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-211", "text": "Besides, there are 22.95% remaining type errors which are due to global modeling problem which shows the limitation of the global modeling method of Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-212", "text": "Finally, 9.84% type error cases are due to local context problem that our BERT-based solution cannot address." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-213", "text": "We leave this to future work." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-214", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-215", "text": "**INCORPORATING EXPLICIT ENTITY TYPES**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-216", "text": "We have shown that our BERT-based local context model which implicitly captures entity type information and is effective in correcting two third of type error cases." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-217", "text": "It is nature to conjecture that we can also correct type errors by incorporating explicit type information into Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-218", "text": "We investigate this approach in this section." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-219", "text": "Assuming that we have types for each mention and candidate entity, we calculate the Jaccard similarity between them and use it as a feature for local disambiguation model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-220", "text": "where T m and T e are the type sets of the mention m and candidate entity e respectively." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-221", "text": "The new local context score function considering explicit type information is defined as:" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-222", "text": "\u03a8 localctx (e, m, c) = f \u03a8 long (e, c), JaccardSim(e, m, c)" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-223", "text": "We consider both Oracle setting and Predict setting." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-224", "text": "In the oracle setting, the mention's types are set as the golden entity's types." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-225", "text": "12 As for the entity's types, we use two sources: one is the ultra-fine type sets from Choi et al. (2018) consisting of more than 10,000 ultra-fine grained types; the other one is the FIGER type sets (Ling and Weld 2012) consisting of 112 fine grained types." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-226", "text": "In the predict setting, we use two state-of-the-art fine grained entity typing systems: 1) Ultrafine (Choi et al. 2018) which predicts types in ultra-fine type sets; 2) ZOE (Zhou et al. 2018 ) which can predict types in FIGER type sets." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-227", "text": "As we can see from both Table 1 and Table 2 , in the oracle setting, the best model outperforms all the state-of-theart entity linking models by a large margin, even surpass Le and Titov (2018) by 3.28 F1 points on AIDA-CoNLL test set." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-228", "text": "This result shows that a better type prediction system can further improve upon the state-of-the-state entity linking systems." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-229", "text": "However, in the predict setting, the type injection models have worse performance than the baseline." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-230", "text": "The degradation might be attributed to the poor performance of the two state-of-the-art fine grained entity typing systems." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-231", "text": "To verify this, we measure the performance of the two typing systems on AIDA-CoNLL development set." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-232", "text": "13 As shown in Table 5 , the ultra-fine entity typing system (Choi et al. 2018) only achieves 26.52% F 1 mi score while the ZOE system (Zhou et al. 2018 ) achieves 66.12% F 1 mi score 14 which are insufficient to improve state-of-the-art entity linking system with more than 92% F1 score." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-233", "text": "Better Global Model In order to investigate whether better global model can further boost the performance of our model, we incorporate the recent proposed Dynamic Context Augmentation (DCA) 15 (Yang et al. 2019) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-234", "text": "DCA is a global entity linking model featuring better efficiency and effectiveness than that of Ganea and Hofmann (2017) by breaking the \"all-mention coherence\" assumption." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-235", "text": "Compared to BERT-Entity-Sim equipped with Ganea and Hofmann (2017)'s global model in Table 8 Ganea and Hofmann (2017) )." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-236", "text": "We found that Yang et al. (2019) includes an explicit type similarity which is based on a typing system 16 trained with AIDA-train NER annotation." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-237", "text": "This explicit type similarity feature is tailored for AIDA-CoNLL data set and doesn't achieve good generalization performance on out-domain test sets." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-238", "text": "In contrast, our BERT-Entity-Sim model capturing latent type information has potential better generalization performance with an average 2.10 F1 improvement over them." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-239", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-240", "text": "**CASE STUDY**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-241", "text": "We demonstrate the effectiveness of our proposed model by retrieving the nearest neighbours in both the context representation space and entity representation space." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-242", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-243", "text": "**NEAREST CONTEXTS**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-244", "text": "We follow Papernot and Mc-Daniel (2018) Ganea and Hofmann (2017) and BERT based entity representation space neighbour in the context representation space." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-245", "text": "For our model, we use the context representation c defined by Equation 9." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-246", "text": "For the baseline model, we use the attentionbased context representation h(c) defined in Equation 2." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-247", "text": "This can reveal which training instances support the prediction of a model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-248", "text": "As shown in Table 7 , for the example in Figure 1 , the most similar contexts retrieved by our model's context representation are all with preposition \"In\" ahead of the mention and the golden entities of them are all American cities." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-249", "text": "In contrast, the baseline's local context is a bag-of-words representation which we denote using the top 10 attended contextual words sorted by attention weights." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-250", "text": "The most similar contexts retrieved by baseline's context representation share common words like \"games\", \"victory\" and the golden entities of them are all baseball teams." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-251", "text": "This explains why the baseline model incorrectly links the mention \"Milwaukee\" to MILWAUKEE BREWERS while our model can link to the correct entity MILWAUKEE." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-252", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-253", "text": "**NEAREST ENTITIES**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-254", "text": "We also retrieve nearest entities in the embedding space of Ganea and Hofmann (2017) and ours." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-255", "text": "As we can see, we query STEVE JOBS, the nearest entity in Ganea and Hofmann (2017) is APPLE INC." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-256", "text": "which is a different type." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-257", "text": "In contrast, all the entities retrieved by our approach share the same types like person, entrepreneur etc." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-258", "text": "Another example is when we query NA-TIONAL BASKETBALL ASSOCIATION, the most similar entities in Ganea and Hofmann (2017) are NBA teams which are topically related, while the entities retrieved by our approach are all basketball leagues." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-259", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-260", "text": "**CONCLUSION**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-261", "text": "In this paper, we propose to improve entity linking by capturing latent entity type information with BERT." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-262", "text": "Firstly, we build entity embeddings from BERT by averaging all the context representation extracted from pre-trained BERT." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-263", "text": "Then we integrate a BERT-based entity similarity into the local model of the state-of-the-art method by (Ganea and Hofmann 2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-264", "text": "The experiment results show that our model significantly outperforms the baseline with an absolute improvement of 1.32% F1 on in-domain AIDA-CoNLL test set and average 0.80% F1 on five out-domain test datasets." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-265", "text": "The detailed experiment analysis shows that our method corrects most of the type errors produced by the baseline." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-266", "text": "In the future, we would like to design global modeling methods which can take advantage of the BERT architecture and investigate other ways to use the prior feature." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-267", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-268", "text": "**APPENDICES CLASSIFICATION MODEL OF ENTITY PREDICTION TASK**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-269", "text": "Given an entity e, we firstly retrieve its entity embedding e, then compute the probability for each type in the typeset T :" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-270", "text": "where \u03c3 is the sigmoid function, w j and b are respectively the weight and bias parameter." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-271", "text": "For each entity e, it is labeled with t e , a binary vector of all types where t e j = 1 if the j th type is in the set of gold types of e and 0 otherwise." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-272", "text": "We optimize a multi-label binary cross entropy objective:" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-273", "text": "L type = \u2212 j t e j log p e j + (1 \u2212 t e j ) log(1 \u2212 p e j )" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-274", "text": "We optimize the model with Adam with an initial learning rate of 1e-3." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-275", "text": "Each model is trained for up to 200 epoches and training stops when the performance on the development set does not improve for 6 consecutive epoches." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-276", "text": "Detailed Hyper-parameters Setting (" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-16", "text": "Such model design allows to learn useful regularities in an end-to-end fashion and eliminates the need for extensive feature engineering." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-17", "text": "It also substantially outperforms In Milwaukee , Marc Newfield homered off Jose Parra leading off the bottom of the 12th as the Brewers rallied for a 5-4 victory over the Minnesota Twins ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-18", "text": "----------------------------------" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-19", "text": "**WIKIPEDIA TITLE**" }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-20", "text": "Local context score Golden Figure 1 : One error case on AIDA-CoNLL development set of the full model of Ganea and Hofmann (2017) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-21", "text": "(a) Immediate context; (b) Attended contextual words sorted by attention weights." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-22", "text": "The preposition \"In\" is a strong cue predictive of the type of mention \"Milwaukee\" which is not captured by the local context model." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-23", "text": "the traditional methods on standard benchmark (e.g., AIDA-CoNLL) ." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-24", "text": "A line of follow-up work (Le and Titov 2018; 2019a; 2019b) investigate potential improvement solution or other task settings based on that." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-25", "text": "Such state-of-the-art entity linking models (Ganea and Hofmann 2017; Le and Titov 2018) employ attention-based bag-of-words context model and pre-trained entity embeddings bootstrapped from word embeddings to assess topic level context compatibility." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-26", "text": "However, the latent entity type information in the immediate context of the mention is neglected." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-27", "text": "We suspect this may sometimes cause the models link mentions to incorrect entities with incorrect type." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-28", "text": "To verify this, we conduct error analysis of the well known DeepED 1 model (Ganea and Hofmann 2017) on the development set of AIDA-CoNLL (Hoffart et al. 2011) , and found that more than half of their error cases fall into the category of type errors where the predicted entity's type is different from the golden entity's type, although some predictive contextual cue for them can be found in their local context." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-29", "text": "As shown in Fig. 1 , the full model of Ganea and Hofmann (2017) incorrectly links the mention \"Milwaukee\" to the entity MILWAUKEE BREWERS." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-30", "text": "However, the prepo-sition \"In\" is a strong cue predictive of the type (location) of mention \"Milwaukee\" which is helpful for disambiguation." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-31", "text": "The reason why the local context model of Ganea and Hofmann (2017) couldn't capture such apparent cue is two folds." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-32", "text": "On one hand, the context encoding module adopts a bag-of-words encoding scheme which is position agnostic." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-33", "text": "As shown in Fig. 1(b) , the attention mechanism is helpful for selecting predictive words (e.g. \"Milwaukee\", \"games\" etc.), but does not capture the pattern that the previous word \"In\" of the mention \"Milwaukee\" which very likely refers to an entity with location type." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-34", "text": "On the other hand, the pre-trained entity embedding of Ganea and Hofmann (2017) is not very sensitive to entity types." }, { "sent_id": "eb5ef34dd9c3845cd27c33242d5316-C001-35", "text": "For example, as shown in Table 8 , when we query the most similar entities with the entity STEVE JOBS, the top one returned entity is APPLE INC., which is a different type but releated at topic level." } ], "y": { "@BACK@": { "gold_contexts": [ [ "eb5ef34dd9c3845cd27c33242d5316-C001-15" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-20" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-25" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-31" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-34" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-36" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-71" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-92" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-122" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-178", "eb5ef34dd9c3845cd27c33242d5316-C001-179", "eb5ef34dd9c3845cd27c33242d5316-C001-180" ] ], "cite_sentences": [ "eb5ef34dd9c3845cd27c33242d5316-C001-15", "eb5ef34dd9c3845cd27c33242d5316-C001-20", "eb5ef34dd9c3845cd27c33242d5316-C001-25", "eb5ef34dd9c3845cd27c33242d5316-C001-31", "eb5ef34dd9c3845cd27c33242d5316-C001-34", "eb5ef34dd9c3845cd27c33242d5316-C001-36", "eb5ef34dd9c3845cd27c33242d5316-C001-71", "eb5ef34dd9c3845cd27c33242d5316-C001-92", "eb5ef34dd9c3845cd27c33242d5316-C001-122", "eb5ef34dd9c3845cd27c33242d5316-C001-178", "eb5ef34dd9c3845cd27c33242d5316-C001-179", "eb5ef34dd9c3845cd27c33242d5316-C001-180" ] }, "@MOT@": { "gold_contexts": [ [ "eb5ef34dd9c3845cd27c33242d5316-C001-25", "eb5ef34dd9c3845cd27c33242d5316-C001-26", "eb5ef34dd9c3845cd27c33242d5316-C001-27", "eb5ef34dd9c3845cd27c33242d5316-C001-28" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-144" ] ], "cite_sentences": [ "eb5ef34dd9c3845cd27c33242d5316-C001-25", "eb5ef34dd9c3845cd27c33242d5316-C001-28", "eb5ef34dd9c3845cd27c33242d5316-C001-144" ] }, "@USE@": { "gold_contexts": [ [ "eb5ef34dd9c3845cd27c33242d5316-C001-28", "eb5ef34dd9c3845cd27c33242d5316-C001-29" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-45" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-53" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-62" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-67" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-126" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-129" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-130" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-140" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-141" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-145", "eb5ef34dd9c3845cd27c33242d5316-C001-146" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-148", "eb5ef34dd9c3845cd27c33242d5316-C001-149" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-152" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-166" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-205" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-211" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-244" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-254", "eb5ef34dd9c3845cd27c33242d5316-C001-255" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-263" ] ], "cite_sentences": [ "eb5ef34dd9c3845cd27c33242d5316-C001-28", "eb5ef34dd9c3845cd27c33242d5316-C001-29", "eb5ef34dd9c3845cd27c33242d5316-C001-45", "eb5ef34dd9c3845cd27c33242d5316-C001-53", "eb5ef34dd9c3845cd27c33242d5316-C001-62", "eb5ef34dd9c3845cd27c33242d5316-C001-67", "eb5ef34dd9c3845cd27c33242d5316-C001-126", "eb5ef34dd9c3845cd27c33242d5316-C001-129", "eb5ef34dd9c3845cd27c33242d5316-C001-130", "eb5ef34dd9c3845cd27c33242d5316-C001-140", "eb5ef34dd9c3845cd27c33242d5316-C001-141", "eb5ef34dd9c3845cd27c33242d5316-C001-145", "eb5ef34dd9c3845cd27c33242d5316-C001-146", "eb5ef34dd9c3845cd27c33242d5316-C001-148", "eb5ef34dd9c3845cd27c33242d5316-C001-149", "eb5ef34dd9c3845cd27c33242d5316-C001-152", "eb5ef34dd9c3845cd27c33242d5316-C001-166", "eb5ef34dd9c3845cd27c33242d5316-C001-205", "eb5ef34dd9c3845cd27c33242d5316-C001-211", "eb5ef34dd9c3845cd27c33242d5316-C001-244", "eb5ef34dd9c3845cd27c33242d5316-C001-254", "eb5ef34dd9c3845cd27c33242d5316-C001-255", "eb5ef34dd9c3845cd27c33242d5316-C001-263" ] }, "@DIF@": { "gold_contexts": [ [ "eb5ef34dd9c3845cd27c33242d5316-C001-49" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-92", "eb5ef34dd9c3845cd27c33242d5316-C001-93", "eb5ef34dd9c3845cd27c33242d5316-C001-94" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-119" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-173" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-174" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-185" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-194" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-199" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-201" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-234" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-258" ] ], "cite_sentences": [ "eb5ef34dd9c3845cd27c33242d5316-C001-49", "eb5ef34dd9c3845cd27c33242d5316-C001-92", "eb5ef34dd9c3845cd27c33242d5316-C001-119", "eb5ef34dd9c3845cd27c33242d5316-C001-173", "eb5ef34dd9c3845cd27c33242d5316-C001-174", "eb5ef34dd9c3845cd27c33242d5316-C001-185", "eb5ef34dd9c3845cd27c33242d5316-C001-194", "eb5ef34dd9c3845cd27c33242d5316-C001-199", "eb5ef34dd9c3845cd27c33242d5316-C001-201", "eb5ef34dd9c3845cd27c33242d5316-C001-234", "eb5ef34dd9c3845cd27c33242d5316-C001-258" ] }, "@SIM@": { "gold_contexts": [ [ "eb5ef34dd9c3845cd27c33242d5316-C001-164" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-178" ] ], "cite_sentences": [ "eb5ef34dd9c3845cd27c33242d5316-C001-164", "eb5ef34dd9c3845cd27c33242d5316-C001-178" ] }, "@EXT@": { "gold_contexts": [ [ "eb5ef34dd9c3845cd27c33242d5316-C001-176" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-217" ] ], "cite_sentences": [ "eb5ef34dd9c3845cd27c33242d5316-C001-176", "eb5ef34dd9c3845cd27c33242d5316-C001-217" ] }, "@UNSURE@": { "gold_contexts": [ [ "eb5ef34dd9c3845cd27c33242d5316-C001-189", "eb5ef34dd9c3845cd27c33242d5316-C001-190" ], [ "eb5ef34dd9c3845cd27c33242d5316-C001-235" ] ], "cite_sentences": [ "eb5ef34dd9c3845cd27c33242d5316-C001-189", "eb5ef34dd9c3845cd27c33242d5316-C001-190", "eb5ef34dd9c3845cd27c33242d5316-C001-235" ] } } }, "ABC_d91913d0c8e669153e8d477e19aef2_6": { "x": [ { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-2", "text": "Unsupervised machine translation-i.e., not assuming any cross-lingual supervision signal, whether a dictionary, translations, or comparable corpora-seems impossible, but nevertheless, Lample et al. (2018a) recently proposed a fully unsupervised machine translation (MT) model." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-3", "text": "The model relies heavily on an adversarial, unsupervised alignment of word embedding spaces for bilingual dictionary induction (Conneau et al., 2018), which we examine here." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-4", "text": "Our results identify the limitations of current unsupervised MT: unsupervised bilingual dictionary induction performs much worse on morphologically rich languages that are not dependent marking, when monolingual corpora from different domains or different embedding algorithms are used." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-5", "text": "We show that a simple trick, exploiting a weak supervision signal from identical words, enables more robust induction, and establish a near-perfect correlation between unsupervised bilingual dictionary induction performance and a previously unexplored graph similarity metric." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-6", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-8", "text": "Cross-lingual word representations enable us to reason about word meaning in multilingual contexts and facilitate cross-lingual transfer (Ruder et al., 2018) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-9", "text": "Early cross-lingual word embedding models relied on large amounts of parallel data (Klementiev et al., 2012; Mikolov et al., 2013a) , but more recent approaches have tried to minimize the amount of supervision necessary Levy et al., 2017; Artetxe et al., 2017) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-10", "text": "Some researchers have even presented unsupervised methods that do not rely on any form of cross-lingual supervision at all (Barone, 2016; Conneau et al., 2018; Zhang et al., 2017) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-11", "text": "Unsupervised cross-lingual word embeddings hold promise to induce bilingual lexicons and machine translation models in the absence of dictionaries and translations (Barone, 2016; Zhang et al., 2017; Lample et al., 2018a) , and would therefore be a major step toward machine translation to, from, or even between low-resource languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-12", "text": "Unsupervised approaches to learning crosslingual word embeddings are based on the assumption that monolingual word embedding graphs are approximately isomorphic, that is, after removing a small set of vertices (words) (Mikolov et al., 2013b; Barone, 2016; Zhang et al., 2017; Conneau et al., 2018) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-13", "text": "In the words of Barone (2016):" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-14", "text": ". . . we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-15", "text": "Our results indicate this assumption is not true in general, and that approaches based on this assumption have important limitations." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-16", "text": "Contributions We focus on the recent stateof-the-art unsupervised model of Conneau et al. (2018) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-17", "text": "1 Our contributions are: (a) In \u00a72, we show that the monolingual word embeddings used in Conneau et al. (2018) are not approximately isomorphic, using the VF2 algorithm (Cordella et al., 2001 ) and we therefore introduce a metric for quantifying the similarity of word embeddings, based on Laplacian eigenvalues." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-37", "text": "Note that (approximately) isospectral graphs need not be (approximately) isomorphic, but (approximately) isomorphic graphs are always (approximately) isospectral (Gordon et al., 1992) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-18", "text": "(b) In \u00a73, we identify circumstances under which the unsupervised bilingual dictionary induction (BDI) algorithm proposed in Conneau et al. (2018) does not lead to good performance." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-19", "text": "(c) We show that a simple trick, exploiting a weak supervision signal from words that are identical across languages, makes the algorithm much more robust." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-20", "text": "Our main finding is that the performance of unsupervised BDI depends heavily on all three factors: the language pair, the comparability of the monolingual corpora, and the parameters of the word embedding algorithms." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-21", "text": "2 How similar are embeddings across languages?" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-22", "text": "As mentioned, recent work focused on unsupervised BDI assumes that monolingual word embedding spaces (or at least the subgraphs formed by the most frequent words) are approximately isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-23", "text": "In this section, we show, by investigating the nearest neighbor graphs of word embedding spaces, that word embeddings are far from isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-24", "text": "We therefore introduce a method for computing the similarity of non-isomorphic graphs." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-25", "text": "In \u00a74.7, we correlate our similarity metric with performance on unsupervised BDI." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-26", "text": "Isomorphism To motivate our study, we first establish that word embeddings are far from graph isomorphic 2 -even for two closely re-2 Two graphs that contain the same number of graph vertices connected in the same way are said to be isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-27", "text": "In the context of weighted graphs such as word embeddings, a lated languages, English and German, and using embeddings induced from comparable corpora (Wikipedia) with the same hyper-parameters." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-28", "text": "If we take the top k most frequent words in English, and the top k most frequent words in German, and build nearest neighbor graphs for English and German using the monolingual word embeddings used in Conneau et al. (2018) , the graphs are of course very different." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-29", "text": "This is, among other things, due to German case and the fact that the translates into der, die, and das, but unsupervised alignment does not have access to this kind of information." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-30", "text": "Even if we consider the top k most frequent English words and their translations into German, the nearest neighbor graphs are not isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-31", "text": "Figure 1a-b shows the nearest neighbor graphs of the top 10 most frequent English words on Wikipedia, and their German translations." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-32", "text": "Word embeddings are particularly good at capturing relations between nouns, but even if we consider the top k most frequent English nouns and their translations, the graphs are not isomorphic; see Figure 1c -d." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-33", "text": "We take this as evidence that word embeddings are not approximately isomorphic across languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-34", "text": "We also ran graph isomorphism checks on 10 random samples of frequent English nouns and their translations into Spanish, and only in 1/10 of the samples were the corresponding nearest neighbor graphs isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-35", "text": "Eigenvector similarity Since the nearest neighbor graphs are not isomorphic, even for frequent translation pairs in neighboring languages, we want to quantify the potential for unsupervised BDI using a metric that captures varying degrees of graph similarity." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-36", "text": "Eigenvalues are compact representations of global properties of graphs, and we introduce a spectral metric based on Laplacian eigenvalues (Shigehalli and Shettar, 2011 ) that quantifies the extent to which the nearest neighbor graphs are isospectral." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-38", "text": "Let A 1 and A 2 be the adjacency matrices of the nearest neighbor graphs G 1 and G 2 of our two word embeddings, respectively." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-39", "text": "Let L 1 = D 1 \u2212 A 1 and L 2 = D 2 \u2212 A 2 be the Laplacians of the nearest neighbor graphs, where D 1 and D 2 are the corresponding diagonal matrices of degrees." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-40", "text": "We now weak version of this is to require that the underlying nearest neighbor graphs for the most frequent k words are isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-41", "text": "compute the eigensimilarity of the Laplacians of the nearest neighbor graphs, L 1 and L 2 ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-42", "text": "For each graph, we find the smallest k such that the sum of the k largest Laplacian eigenvalues is <90% of the Laplacian eigenvalues." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-43", "text": "We take the smallest k of the two, and use the sum of the squared differences between the largest k Laplacian eigenvalues \u2206 as our similarity metric." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-44", "text": "where k is chosen s.t." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-45", "text": "Note that \u2206 = 0 means the graphs are isospectral, and the metric goes to infinite." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-46", "text": "Thus, the higher \u2206 is, the less similar the graphs (i.e., their Laplacian spectra)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-47", "text": "We discuss the correlation between unsupervised BDI performance and approximate isospectrality or eigenvector similarity in \u00a74.7." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-48", "text": "3 Unsupervised cross-lingual learning" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-49", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-50", "text": "**LEARNING SCENARIOS**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-51", "text": "Unsupervised neural machine translation relies on BDI using cross-lingual embeddings (Lample et al., 2018a; Artetxe et al., 2018) , which in turn relies on the assumption that word embedding graphs are approximately isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-52", "text": "The work of Conneau et al. (2018) , which we focus on here, also makes several implicit assumptions that may or may not be necessary to achieve such isomorphism, and which may or may not scale to low-resource languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-53", "text": "The algorithms are not intended to be limited to learning scenarios where these assumptions hold, but since they do in the reported experiments, it is important to see to what extent these assumptions are necessary for the algorithms to produce useful embeddings or dictionaries." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-54", "text": "We focus on the work of Conneau et al. (2018) , who present a fully unsupervised approach to aligning monolingual word embeddings, induced using fastText (Bojanowski et al., 2017) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-55", "text": "We describe the learning algorithm in \u00a73.2." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-56", "text": "Conneau et al. (2018) consider a specific set of learning scenarios:" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-57", "text": "(a) The authors work with the following languages: English-{French, German, Chinese, Russian, Spanish}. These languages, except French, are dependent marking (Table 1) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-58", "text": "3 We evaluate Conneau et al. (2018) on (English to) Estonian (ET), Finnish (FI), Greek (EL), Hungarian (HU), Polish (PL), and Turkish (TR) in \u00a74.2, to test whether the selection of languages in the original study introduces a bias." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-101", "text": "Only words with more than 5 occurrences are retained for training." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-59", "text": "(b) The monolingual corpora in their experiments are comparable; Wikipedia corpora are used, except for an experiment in which they include Google Gigawords." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-60", "text": "We evaluate across different domains, i.e., on all combinations of Wikipedia, EuroParl, and the EMEA medical corpus, in \u00a74.3." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-61", "text": "We believe such scenarios are more realistic for low-resource languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-62", "text": "(c) The monolingual embedding models are induced using the same algorithms with the same hyper-parameters." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-63", "text": "We evaluate Conneau et al. (2018) on pairs of embeddings induced with different hyper-parameters in \u00a74.4." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-64", "text": "While keeping hyper-parameters fixed is always possible, it is of practical interest to know whether the unsupervised methods work on any set of pre-trained word embeddings." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-65", "text": "We also investigate the sensitivity of unsupervised BDI to the dimensionality of the monolingual word embeddings in \u00a74.5." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-66", "text": "The motivation for this is that dimensionality reduction will alter the geometric shape and remove characteristics of the embedding graphs that are important for alignment; but on the other hand, lower dimensionality introduces regularization, which will make the graphs more similar." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-67", "text": "Finally, in \u00a74.6, we investigate the impact of different types of query test words on performance, including how performance varies across part-of-speech word classes and on shared vocabulary items." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-68", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-69", "text": "**SUMMARY OF CONNEAU ET AL. (2018)**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-70", "text": "We now introduce the method of Conneau et al. (2018) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-71", "text": "4 The approach builds on existing work on learning a mapping between monolingual word embeddings (Mikolov et al., 2013b; Xing et al., 2015) and consists of the following steps: 1) Monolingual word embeddings: An off-the-shelf word embedding algorithm (Bojanowski et al., 2017 ) is used to learn source and target language spaces X and Y ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-72", "text": "2) Adversarial mapping: A translation matrix W is learned between the spaces X and Y using adversarial techniques (Ganin et al., 2016) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-73", "text": "A discriminator is trained to discriminate samples from the translated source space W X from the target space Y , while W is trained to prevent this." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-74", "text": "This, again, is motivated by the assumption that source and target language word embeddings are approximately isomorphic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-75", "text": "3) Refinement (Procrustes analysis): W is used to build a small bilingual dictionary of frequent words, which is pruned such that only bidirectional translations are kept ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-76", "text": "A new translation matrix W that translates between the spaces X and Y of these frequent word pairs is then induced by solving the Orthogonal Procrustes problem:" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-77", "text": "This step can be used iteratively by using the new matrix W to create new seed translation pairs." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-78", "text": "It requires frequent words to serve as reliable anchors for learning a translation matrix." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-79", "text": "In the experiments in Conneau et al. (2018) , as well as in ours, the iterative Procrustes refinement improves performance across the board." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-167", "text": "**IMPACT OF DIMENSIONALITY**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-80", "text": "4) Cross-domain similarity local scaling (CSLS) is used to expand high-density areas and condense low-density ones, for more accurate nearest neighbor calculation, CSLS reduces the hubness problem in high-dimensional spaces (Radovanovi\u0107 et al., 2010; Dinu et al., 2015) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-81", "text": "It relies on the mean similarity of a source language embedding x to its K target language nearest neighbours (K = 10 suggested) nn 1 , . . . , nn K :" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-82", "text": "where cos is the cosine similarity." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-83", "text": "mnn S (y) is defined in an analogous manner for any target language embedding y. CSLS(x, y) is then calculated as follows:" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-84", "text": "3.3 A simple supervised method" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-85", "text": "Instead of learning cross-lingual embeddings completely without supervision, we can extract inexpensive supervision signals by harvesting identically spelled words as in, e.g. (Artetxe et al., 2017; Smith et al., 2017) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-86", "text": "Specifically, we use identically spelled words that occur in the vocabularies of both languages as bilingual seeds, without employing any additional transliteration or lemmatization/normalization methods." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-87", "text": "Using this seed dictionary, we then run the refinement step using Procrustes analysis of Conneau et al. (2018) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-88", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-89", "text": "**EXPERIMENTS**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-90", "text": "In the following experiments, we investigate the robustness of unsupervised cross-lingual word embedding learning, varying the language pairs, monolingual corpora, hyper-parameters, etc., to obtain a better understanding of when and why unsupervised BDI works." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-91", "text": "Task: Bilingual dictionary induction After the shared cross-lingual space is induced, given a list of N source language words x u,1 , . . . , x u,N , the task is to find a target language word t for each query word x u relying on the representations in the space." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-92", "text": "t i is the target language word closest to the source language word x u,i in the induced cross-lingual space, also known as the cross-lingual nearest neighbor." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-93", "text": "The set of learned N (x u,i , t i ) pairs is then run against a gold standard dictionary." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-94", "text": "We use bilingual dictionaries compiled by Conneau et al. (2018) as gold standard, and adopt their evaluation procedure: each test set in each language consists of 1500 gold translation pairs." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-95", "text": "We rely on CSLS for retrieving the nearest neighbors, as it consistently outperformed the cosine similarity in all our experiments." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-96", "text": "Following a standard evaluation practice (Vuli\u0107 and Moens, 2013; Mikolov et al., 2013b; Conneau et al., 2018) , we report Precision at 1 scores (P@1): how many times one of the correct translations of a source word w is retrieved as the nearest neighbor of w in the target language." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-97", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-98", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-99", "text": "Our default experimental setup closely follows the setup of Conneau et al. (2018) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-100", "text": "For each language we induce monolingual word embeddings for all languages from their respective tokenized and lowercased Polyglot Wikipedias (Al-Rfou et al., 2013) using fastText (Bojanowski et al., 2017) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-102", "text": "Our fastText setup relies on skip-gram with negative sampling (Mikolov et al., 2013a) with standard hyper-parameters: bag-of-words contexts with the window size 2, 15 negative samples, subsampling rate 10 \u22124 , and character n-gram length Table 2 : Bilingual dictionary induction scores (P@1\u00d7100%) using a) the unsupervised method with adversarial training; b) the supervised method with a bilingual seed dictionary consisting of identical words (shared between the two languages)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-103", "text": "The third columns lists eigenvector similarities between 10 randomly sampled source language nearest neighbor subgraphs of 10 nodes and the subgraphs of their translations, all from the benchmark dictionaries in Conneau et al. (2018) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-104", "text": "3-6." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-105", "text": "All embeddings are 300-dimensional." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-106", "text": "As we analyze the impact of various modeling assumptions in the following sections (e.g., domain differences, algorithm choices, hyper-parameters), we also train monolingual word embeddings using other corpora and different hyper-parameter choices." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-107", "text": "Quick summaries of each experimental setup are provided in the respective subsections." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-108", "text": "et al. (2018) present results for several target languages: Spanish, French, German, Russian, Chinese, and Esperanto." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-109", "text": "All languages but Esperanto are isolating or exclusively concatenating languages from a morphological point of view." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-110", "text": "All languages but French are dependent-marking." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-111", "text": "Table 1 lists three important morphological properties of the languages involved in their/our experiments." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-112", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-113", "text": "**IMPACT OF LANGUAGE SIMILARITY**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-114", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-115", "text": "**CONNEAU**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-116", "text": "Agglutinative languages with mixed or double marking show more morphological variance with content words, and we speculate whether unsupervised BDI is challenged by this kind of morphological complexity." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-117", "text": "To evaluate this, we experiment with Estonian and Finnish, and we include Greek, Hungarian, Polish, and Turkish to see how their approach fares on combinations of these two morphological traits." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-118", "text": "We show results in the left column of Table 2 ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-119", "text": "The results are quite dramatic." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-120", "text": "The approach achieves impressive performance for Spanish, one of the languages Conneau et al. (2018) include in their paper." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-121", "text": "For the languages we add here, performance is less impressive." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-122", "text": "For the languages with dependent marking (Hungarian, Polish, and Turkish), P@1 scores are still reasonable, with Turkish being slightly lower (0.327) than the others." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-123", "text": "However, for Estonian and Finnish, the method fails completely." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-124", "text": "Only in less than 1/1000 cases does a nearest neighbor search in the induced embeddings return a correct translation of a query word." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-125", "text": "5 The sizes of Wikipedias naturally vary across languages: e.g., fastText trains on approximately 16M sentences and 363M word tokens for Spanish, while it trains on 1M sentences and 12M words for Finnish." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-126", "text": "However, the difference in performance cannot be explained by the difference in training data sizes." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-127", "text": "To verify that near-zero performance in Finnish is not a result of insufficient training data, we have conducted another experiment using the large Finnish WaC corpus (Ljube\u0161i\u0107 et al., 2016) containing 1.7B words in total (this is similar in size to the English Polyglot Wikipedia)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-128", "text": "However, even with this large Finnish corpus, the model does not induce anything useful: P@1 equals 0.0." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-129", "text": "We note that while languages with mixed marking may be harder to align, it seems unsupervised BDI is possible between similar, mixed marking languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-130", "text": "So while unsupervised learning fails for English-Finnish and English-Estonian, performance is reasonable and stable for the more similar Estonian-Finnish pair (Table 2 )." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-131", "text": "In general, unsupervised BDI, using the approach in Conneau et al. (2018) , seems challenged when pairing En-glish with languages that are not isolating and do not have dependent marking." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-132", "text": "6 The promise of zero-supervision models is that we can learn cross-lingual embeddings even for low-resource languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-133", "text": "On the other hand, a similar distribution of embeddings requires languages to be similar." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-134", "text": "This raises the question whether we need fully unsupervised methods at all." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-135", "text": "In fact, our supervised method that relies on very naive supervision in the form of identically spelled words leads to competitive performance for similar language pairs and better results for dissimilar pairs." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-136", "text": "The fact that we can reach competitive and more robust performance with such a simple heuristic questions the true applicability of fully unsupervised approaches and suggests that it might often be better to rely on available weak supervision." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-137", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-138", "text": "**IMPACT OF DOMAIN DIFFERENCES**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-139", "text": "Monolingual word embeddings used in Conneau et al. (2018) are induced from Wikipedia, a nearparallel corpus." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-140", "text": "In order to assess the sensitivity of unsupervised BDI to the comparability and domain similarity of the monolingual corpora, we replicate the experiments in Conneau et al. (2018) using combinations of word embeddings extracted from three different domains: 1) parliamentary proceedings from EuroParl.v7 (Koehn, 2005) , 2) Wikipedia (Al- Rfou et al., 2013) , and 3) the EMEA corpus in the medical domain (Tiedemann, 2009) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-141", "text": "We report experiments with three language pairs: English{Spanish, Finnish, Hungarian}." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-142", "text": "To control for the corpus size, we restrict each corpus in each language to 1.1M sentences in total (i.e., the number of sentences in the smallest, EMEA corpus)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-143", "text": "300-dim fastText vectors are induced as in \u00a74.1, retaining all words with more than 5 occurrences in the training data." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-144", "text": "For each pair of monolingual corpora, we compute their domain (dis)similarity by calculating the Jensen-Shannon divergence (El-Gamal, 1991) , based on term distributions." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-145", "text": "7 The domain similarities are displayed in Figures 2a-c ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-146", "text": "8 We show the results of unsupervised BDI in Figures 2g-i ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-147", "text": "For Spanish, we see good performance in all three cases where the English and Spanish corpora are from the same domain." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-148", "text": "When the two corpora are from different domains, performance is close to zero." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-149", "text": "For Finnish and Hungarian, performance is always poor, suggesting that more data is needed, even when domains are similar." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-150", "text": "This is in sharp contrast with the results of our minimally supervised approach (Figures 2d-f ) based on identical words, which achieves decent performance in many set-ups." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-151", "text": "We also observe a strong decrease in P@1 for English-Spanish (from 81.19% to 46.52%) when using the smaller Wikipedia corpora." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-152", "text": "This result indicates the importance of procuring large monolingual corpora from similar domains in order to enable unsupervised dictionary induction." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-153", "text": "However, resource-lean languages, for which the unsupervised method was designed in the first place, cannot be guaranteed to have as large monolingual training corpora as available for English, Spanish or other major resource-rich languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-154", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-155", "text": "**IMPACT OF HYPER-PARAMETERS**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-156", "text": "Conneau et al. (2018) use the same hyperparameters for inducing embeddings for all languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-157", "text": "This is of course always practically possible, but we are interested in seeing whether their approach works on pre-trained embeddings induced with possibly very different hyper-parameters." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-158", "text": "We focus on two hyper-parameters: context windowsize (win) and the parameter controlling the number of n-gram features in the fastText model (chn), while at the same time varying the underlying algorithm: skip-gram vs. cbow." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-159", "text": "The results for EnglishSpanish are listed in Table 3 ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-160", "text": "The small variations in the hyper-parameters with the same underlying algorithm (i.e., using skipgram or cbow for both EN and ES) yield only slight drops in the final scores." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-161", "text": "Still, the best scores are obtained with the same configuration on both sides." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-162", "text": "Our main finding here is that unsupervised BDI fails (even) for EN-ES when the two monolingual embedding spaces are induced by two different algorithms (see the results of the entire Spanish cbow column)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-163", "text": "9 In sum, this means that the unsupervised approach is unlikely to work on pre-trained word embeddings unless they are induced on same-9 We also checked if this result might be due to a lowerquality monolingual ES space." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-164", "text": "However, monolingual word similarity scores on available datasets in Spanish show performance comparable to that of Spanish skip-gram vectors: e.g., Spearman's \u03c1 correlation is \u2248 0.7 on the ES evaluation set from SemEval-2017 Task 2 (Camacho-Collados et al., 2017)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-165", "text": "or comparable-domain, reasonably-sized training data using the same underlying algorithm." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-166", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-168", "text": "We also perform an experiment on 40-dimensional monolingual word embeddings." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-169", "text": "This leads to reduced expressivity, and can potentially make the geometric shapes of embedding spaces harder to align; on the other hand, reduced dimensionality may also lead to less overfitting." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-170", "text": "Table 4 : P@1 \u00d7 100% scores for query words with different parts-of-speech." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-171", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-172", "text": "**IMPACT OF EVALUATION PROCEDURE**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-173", "text": "BDI models are evaluated on a held-out set of query words." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-174", "text": "Here, we analyze the performance of the unsupervised approach across different parts-ofspeech, frequency bins, and with respect to query words that have orthographically identical counterparts in the target language with the same or a different meaning." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-175", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-176", "text": "**PART-OF-SPEECH**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-177", "text": "We show the impact of the partof-speech of the query words in Table 4 ; again on a representative subset of our languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-178", "text": "The results indicate that performance on verbs is lowest across the board." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-179", "text": "This is consistent with research on distributional semantics and verb meaning (Schwartz et al., 2015; Gerz et al., 2016) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-180", "text": "Frequency We also investigate the impact of the frequency of query words." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-181", "text": "We calculate the word frequency of English words based on Google's Trillion Word Corpus: query words are divided in groups based on their rank -i.e., the first group contains the top 100 most frequent words, the second one contains the 101th-1000th most frequent words, etc." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-182", "text": "-and plot performance (P@1) relative to rank in Figure 3 ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-183", "text": "For EN-FI, P@1 was 0 across all frequency ranks." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-184", "text": "The plot shows sensitivity to frequency for HU, but less so for ES." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-185", "text": "Table 5 : Scores (P@1 \u00d7 100%) for query words with same and different spellings and meanings." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-186", "text": "whether these are representative or harder to align than other words." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-187", "text": "Table 5 lists performance for three sets of query words: (a) source words that have homographs (words that are spelled the same way) with the same meaning (homonyms) in the target language, e.g., many proper names; (b) source words that have homographs that are not homonyms in the target language, e.g., many short words; and (c) other words." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-188", "text": "Somewhat surprisingly, words which have translations that are homographs, are associated with lower precision than other words." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-189", "text": "This is probably due to loan words and proper names, but note that using homographs as supervision for alignment, we achieve high precision for this part of the vocabulary for free." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-190", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-191", "text": "**EVALUATING EIGENVECTOR SIMILARITY**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-212", "text": "**CONCLUSION**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-192", "text": "Finally, in order to get a better understanding of the limitations of unsupervised BDI, we correlate the graph similarity metric described in \u00a72 (right column of Table 2 ) with performance across languages (left column)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-193", "text": "Since we already established that the monolingual word embeddings are far from isomorphic-in contrast with the intuitions motivating previous work (Mikolov et al., 2013b; Barone, 2016; Zhang et al., 2017; Conneau et al., 2018 )-we would like to establish another diagnostic metric that identifies embedding spaces for which the approach in Conneau et al. (2018) is likely to work." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-194", "text": "Differences in morphology, domain, or embedding parameters seem to be predictive of poor performance, but a metric that is independent of linguistic categorizations and the characteristics of the monolingual corpora would be more widely applicable." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-195", "text": "We plot the values in Table 2 in Figure 4 ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-196", "text": "Recall that our graph similarity metric returns a value in the half-open interval [0, \u221e)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-197", "text": "The correlation between BDI performance and graph similarity is strong (\u03c1 \u223c 0.89)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-198", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-199", "text": "**RELATED WORK**" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-200", "text": "Cross-lingual word embeddings Cross-lingual word embedding models typically, unlike Conneau et al. (2018) , require aligned words, sentences, or documents (Levy et al., 2017) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-201", "text": "Most approaches based on word alignments learn an explicit mapping between the two embedding spaces (Mikolov et al., 2013b; Xing et al., 2015) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-202", "text": "Recent approaches try to minimize the amount of supervision needed Artetxe et al., 2017; Smith et al., 2017) ." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-203", "text": "See Upadhyay et al. (2016) and Ruder et al. (2018) for surveys." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-204", "text": "Unsupervised cross-lingual learning Haghighi et al. (2008) were first to explore unsupervised BDI, using features such as context counts and orthographic substrings, and canonical correlation analysis." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-205", "text": "Recent approaches use adversarial learning (Goodfellow et al., 2014) and employ a discriminator, trained to distinguish between the translated source and the target language space, and a generator learning a translation matrix (Barone, 2016)." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-206", "text": "Zhang et al. (2017) , in addition, use different forms of regularization for convergence, while Conneau et al. (2018) uses additional steps to refine the induced embedding space." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-207", "text": "Unsupervised machine translation Research on unsupervised machine translation (Lample et al., 2018a; Artetxe et al., 2018; Lample et al., 2018b) has generated a lot of interest recently with a promise to support the construction of MT systems for and between resource-poor languages." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-208", "text": "All unsupervised NMT methods critically rely on accurate unsupervised BDI and back-translation." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-209", "text": "Models are trained to reconstruct a corrupted version of the source sentence and to translate its translated version back to the source language." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-210", "text": "Since the crucial input to these systems are indeed cross-lingual word embedding spaces induced in an unsupervised fashion, in this paper we also implicitly investigate one core limitation of such unsupervised MT techniques." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-211", "text": "----------------------------------" }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-213", "text": "We investigated when unsupervised BDI (Conneau et al., 2018 ) is possible and found that differences in morphology, domains or word embedding algorithms may challenge this approach." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-214", "text": "Further, we found eigenvector similarity of sampled nearest neighbor subgraphs to be predictive of unsupervised BDI performance." }, { "sent_id": "d91913d0c8e669153e8d477e19aef2-C001-215", "text": "We hope that this work will guide further developments in this new and exciting field." } ], "y": { "@USE@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-3" ], [ "d91913d0c8e669153e8d477e19aef2-C001-16" ], [ "d91913d0c8e669153e8d477e19aef2-C001-17", "d91913d0c8e669153e8d477e19aef2-C001-18" ], [ "d91913d0c8e669153e8d477e19aef2-C001-20" ], [ "d91913d0c8e669153e8d477e19aef2-C001-25" ], [ "d91913d0c8e669153e8d477e19aef2-C001-28" ], [ "d91913d0c8e669153e8d477e19aef2-C001-47" ], [ "d91913d0c8e669153e8d477e19aef2-C001-52", "d91913d0c8e669153e8d477e19aef2-C001-53" ], [ "d91913d0c8e669153e8d477e19aef2-C001-54", "d91913d0c8e669153e8d477e19aef2-C001-55" ], [ "d91913d0c8e669153e8d477e19aef2-C001-58" ], [ "d91913d0c8e669153e8d477e19aef2-C001-63" ], [ "d91913d0c8e669153e8d477e19aef2-C001-70" ], [ "d91913d0c8e669153e8d477e19aef2-C001-87" ], [ "d91913d0c8e669153e8d477e19aef2-C001-90" ], [ "d91913d0c8e669153e8d477e19aef2-C001-94" ], [ "d91913d0c8e669153e8d477e19aef2-C001-96" ], [ "d91913d0c8e669153e8d477e19aef2-C001-99" ], [ "d91913d0c8e669153e8d477e19aef2-C001-103" ], [ "d91913d0c8e669153e8d477e19aef2-C001-131" ], [ "d91913d0c8e669153e8d477e19aef2-C001-140" ], [ "d91913d0c8e669153e8d477e19aef2-C001-144", "d91913d0c8e669153e8d477e19aef2-C001-146" ], [ "d91913d0c8e669153e8d477e19aef2-C001-192" ], [ "d91913d0c8e669153e8d477e19aef2-C001-213" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-3", "d91913d0c8e669153e8d477e19aef2-C001-16", "d91913d0c8e669153e8d477e19aef2-C001-17", "d91913d0c8e669153e8d477e19aef2-C001-18", "d91913d0c8e669153e8d477e19aef2-C001-20", "d91913d0c8e669153e8d477e19aef2-C001-25", "d91913d0c8e669153e8d477e19aef2-C001-28", "d91913d0c8e669153e8d477e19aef2-C001-47", "d91913d0c8e669153e8d477e19aef2-C001-52", "d91913d0c8e669153e8d477e19aef2-C001-53", "d91913d0c8e669153e8d477e19aef2-C001-54", "d91913d0c8e669153e8d477e19aef2-C001-55", "d91913d0c8e669153e8d477e19aef2-C001-58", "d91913d0c8e669153e8d477e19aef2-C001-63", "d91913d0c8e669153e8d477e19aef2-C001-70", "d91913d0c8e669153e8d477e19aef2-C001-87", "d91913d0c8e669153e8d477e19aef2-C001-90", "d91913d0c8e669153e8d477e19aef2-C001-94", "d91913d0c8e669153e8d477e19aef2-C001-96", "d91913d0c8e669153e8d477e19aef2-C001-99", "d91913d0c8e669153e8d477e19aef2-C001-103", "d91913d0c8e669153e8d477e19aef2-C001-131", "d91913d0c8e669153e8d477e19aef2-C001-140", "d91913d0c8e669153e8d477e19aef2-C001-146", "d91913d0c8e669153e8d477e19aef2-C001-192", "d91913d0c8e669153e8d477e19aef2-C001-213" ] }, "@BACK@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-10" ], [ "d91913d0c8e669153e8d477e19aef2-C001-12" ], [ "d91913d0c8e669153e8d477e19aef2-C001-22" ], [ "d91913d0c8e669153e8d477e19aef2-C001-51" ], [ "d91913d0c8e669153e8d477e19aef2-C001-56" ], [ "d91913d0c8e669153e8d477e19aef2-C001-70", "d91913d0c8e669153e8d477e19aef2-C001-71" ], [ "d91913d0c8e669153e8d477e19aef2-C001-139" ], [ "d91913d0c8e669153e8d477e19aef2-C001-156" ], [ "d91913d0c8e669153e8d477e19aef2-C001-173" ], [ "d91913d0c8e669153e8d477e19aef2-C001-200" ], [ "d91913d0c8e669153e8d477e19aef2-C001-206" ], [ "d91913d0c8e669153e8d477e19aef2-C001-208" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-10", "d91913d0c8e669153e8d477e19aef2-C001-12", "d91913d0c8e669153e8d477e19aef2-C001-22", "d91913d0c8e669153e8d477e19aef2-C001-51", "d91913d0c8e669153e8d477e19aef2-C001-56", "d91913d0c8e669153e8d477e19aef2-C001-70", "d91913d0c8e669153e8d477e19aef2-C001-71", "d91913d0c8e669153e8d477e19aef2-C001-139", "d91913d0c8e669153e8d477e19aef2-C001-156", "d91913d0c8e669153e8d477e19aef2-C001-173", "d91913d0c8e669153e8d477e19aef2-C001-200", "d91913d0c8e669153e8d477e19aef2-C001-206", "d91913d0c8e669153e8d477e19aef2-C001-208" ] }, "@MOT@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-15", "d91913d0c8e669153e8d477e19aef2-C001-16" ], [ "d91913d0c8e669153e8d477e19aef2-C001-17", "d91913d0c8e669153e8d477e19aef2-C001-18" ], [ "d91913d0c8e669153e8d477e19aef2-C001-22", "d91913d0c8e669153e8d477e19aef2-C001-23" ], [ "d91913d0c8e669153e8d477e19aef2-C001-35" ], [ "d91913d0c8e669153e8d477e19aef2-C001-58" ], [ "d91913d0c8e669153e8d477e19aef2-C001-65" ], [ "d91913d0c8e669153e8d477e19aef2-C001-116" ], [ "d91913d0c8e669153e8d477e19aef2-C001-117", "d91913d0c8e669153e8d477e19aef2-C001-119", "d91913d0c8e669153e8d477e19aef2-C001-120", "d91913d0c8e669153e8d477e19aef2-C001-121" ], [ "d91913d0c8e669153e8d477e19aef2-C001-156", "d91913d0c8e669153e8d477e19aef2-C001-157" ], [ "d91913d0c8e669153e8d477e19aef2-C001-173", "d91913d0c8e669153e8d477e19aef2-C001-174" ], [ "d91913d0c8e669153e8d477e19aef2-C001-193" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-16", "d91913d0c8e669153e8d477e19aef2-C001-17", "d91913d0c8e669153e8d477e19aef2-C001-18", "d91913d0c8e669153e8d477e19aef2-C001-22", "d91913d0c8e669153e8d477e19aef2-C001-35", "d91913d0c8e669153e8d477e19aef2-C001-58", "d91913d0c8e669153e8d477e19aef2-C001-65", "d91913d0c8e669153e8d477e19aef2-C001-116", "d91913d0c8e669153e8d477e19aef2-C001-120", "d91913d0c8e669153e8d477e19aef2-C001-156", "d91913d0c8e669153e8d477e19aef2-C001-157", "d91913d0c8e669153e8d477e19aef2-C001-173", "d91913d0c8e669153e8d477e19aef2-C001-193" ] }, "@DIF@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-22", "d91913d0c8e669153e8d477e19aef2-C001-23" ], [ "d91913d0c8e669153e8d477e19aef2-C001-193" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-22", "d91913d0c8e669153e8d477e19aef2-C001-193" ] }, "@SIM@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-79" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-79" ] }, "@EXT@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-117", "d91913d0c8e669153e8d477e19aef2-C001-119", "d91913d0c8e669153e8d477e19aef2-C001-120", "d91913d0c8e669153e8d477e19aef2-C001-121" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-120" ] }, "@UNSURE@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-129" ], [ "d91913d0c8e669153e8d477e19aef2-C001-162" ], [ "d91913d0c8e669153e8d477e19aef2-C001-197" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-129", "d91913d0c8e669153e8d477e19aef2-C001-162", "d91913d0c8e669153e8d477e19aef2-C001-197" ] }, "@FUT@": { "gold_contexts": [ [ "d91913d0c8e669153e8d477e19aef2-C001-214", "d91913d0c8e669153e8d477e19aef2-C001-215" ] ], "cite_sentences": [ "d91913d0c8e669153e8d477e19aef2-C001-214" ] } } }, "ABC_c3f71bea55f85633568c7ba57f6fd5_6": { "x": [ { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-43", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-2", "text": "We introduce the task of acoustic question answering (AQA) in the area of acoustic reasoning." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-3", "text": "In this task an agent learns to answer questions on the basis of acoustic context." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-4", "text": "In order to promote research in this area, we propose a data generation paradigm adapted from CLEVR [11] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-5", "text": "We generate acoustic scenes by leveraging a bank of elementary sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-6", "text": "We also provide a number of functional programs that can be used to compose questions and answers that exploit the relationships between the attributes of the elementary sounds in each scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-7", "text": "We provide AQA datasets of various sizes as well as the data generation code." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-8", "text": "As a preliminary experiment to validate our data, we report the accuracy of current state of the art visual question answering models when they are applied to the AQA task without modifications." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-9", "text": "Although there is a plethora of question answering tasks based on text, image or video data, to our knowledge, we are the first to propose answering questions directly on audio streams." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-10", "text": "We hope this contribution will facilitate the development of research in the area." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-11", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-12", "text": "**INTRODUCTION AND RELATED WORK**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-13", "text": "Question answering (QA) problems have attracted increasing interest in the machine learning and artificial intelligence communities." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-14", "text": "These tasks usually involve interpreting and answering text based questions in the view of some contextual information, often expressed in a different modality." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-15", "text": "Text-based QA, use text corpora as context ( [19, 20, 17, 9, 10, 16] ); in visual question answering (VQA), instead, the questions are related to a scene depicted in still images (e.g. [11, 2, 25, 7, 1, 23, 8, 10, 16] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-16", "text": "Finally, video question answering attempts to use both the visual and acoustic information in video material as context (e.g. [5, 6, 22, 13, 14, 21] )." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-17", "text": "In the last case, however, the acoustic information is usually expressed in text form, either with manual transcriptions (e.g. subtitles) or by automatic speech recognition, and is limited to linguistic information [24] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-18", "text": "The task presented in this paper differs from the above by answering questions directly on audio streams." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-19", "text": "We argue that the audio modality contains important information that has not been exploited in the question answering domain." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-20", "text": "This information may allow QA systems to answer relevant questions more accurately, or even to answer questions that are not approachable from the visual domain alone." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-44", "text": "**SCENES AND ELEMENTARY SOUNDS**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-21", "text": "Examples of potential applications are the detection of anomalies in machinery where the moving parts are hidden, the detection of threatening or hazardous events, industrial and social robotics." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-22", "text": "Current question answering methods require large amounts of annotated data." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-23", "text": "In the visual domain, several strategies have been proposed to make this kind of data available to the community [11, 2, 25, 7] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-24", "text": "Agrawal et al. [1] noted that the way the questions are created has a huge impact on what information a neural network uses to answer them (this is a well known problem that can arise with In what part of the scene is the clarinet playing a G note that is before the third violin sound?" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-25", "text": "beginning, middle, end (of the scene) 3" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-26", "text": "Total 47 all neural network based systems)." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-27", "text": "This motivated research [23, 8, 11] on how to reduce the bias in VQA datasets." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-28", "text": "The complexity around gathering good labeled data forced some authors [23, 8] to constrain their work to yes/no questions." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-29", "text": "Johnson et al. [11] made their way around this constraint by using synthetic data." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-30", "text": "To generate the questions, they first generate a semantic representation that describes the reasoning steps needed in order to answer the question." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-31", "text": "This gives them full control over the labelling process and a better understanding of the semantic meaning of the questions." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-32", "text": "They leverage this ability to reduce the bias in the synthesized data." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-33", "text": "For example, they ensure that none of the generated questions contains hints about the answer." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-34", "text": "Inspired by the work on CLEVR [11] , we propose an acoustical question answering (AQA) task by defining a synthetic dataset that comprises audio scenes composed by sequences of elementary sounds and questions relating properties of the sounds in each scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-35", "text": "We provide the adapted software for AQA data generation as well as a version of the dataset based on musical instrument sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-36", "text": "We also report preliminary experiments using the FiLM architecture derived from the VQA domain." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-37", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-38", "text": "**DATASET**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-39", "text": "This section presents the dataset and the generation process 1 ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-40", "text": "In this first version (version 1.0) we created multiple instances of the dataset with 1000, 10000 and 50000 acoustic scenes for which we generated 20 to 40 questions and answers per scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-41", "text": "In total, we generated six instances of the dataset." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-42", "text": "To represent questions, we use the same semantic representation through functional programs that is proposed in [11, 12] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-45", "text": "An acoustic scene is composed by a sequence of elementary sounds, that we will call just sounds in the following." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-46", "text": "The sounds are real recordings of musical notes from the Good-Sounds database [3] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-47", "text": "We use five families of musical instruments: cello, clarinet, flute, trumpet and violin." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-48", "text": "Each recording of an instrument has a different musical note (pitch) on the MIDI scale." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-49", "text": "The data generation process, however, is independent of the specific sounds, so that future versions of the data may include speech, animal vocalizations and environmental sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-50", "text": "Each sound is described by an n-tuple [Instrument family, Brightness, Loudness, Musical note, Absolute Position, Relative Position, Global Position, Duration] (see Table 1 for a summary of attributes and values)." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-51", "text": "Where Brightness can be either bright or dark; Loudness can be quiet or loud; Musical note can take any of the 12 values on the fourth octave of the Western chromatic scale 2 ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-52", "text": "The Absolute Position gives the position of the sound within the acoustic scene (between first and tenth), the Relative Position gives the position of a sound relatively to the other sounds that are in the same category (e.g. \"the third cello sound\")." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-53", "text": "Global Position refers Figure 1 : Example of an acoustic scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-54", "text": "We show the spectrogram, the waveform and the annotation of the instrument for each elementary sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-55", "text": "A possible question on this scene could be \"What is the position of the flute that plays after the second clarinet?\", and the corresponding answer would be \"Fifth\". Note that the agent must answer based on the spectrogram (or waveform) alone." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-56", "text": "to the approximate position of the sound within the scene and can be either beginning, middle or end." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-57", "text": "We start by generating a clean acoustic scene as following: first the encoding of the original sounds (sampled at 48kHz) is converted from 24 to 16 bits." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-58", "text": "Then silence is detected and removed when the energy, computed as 10 log 10 i x 2 i over windows of 100 msec, falls below -50 dB, where x i are the sound samples normalized between \u00b11. Then we measure the perceptual loudness of the sounds in dB LUFS using the method described in the ITU-R BS.1770-4 international normalization standard [4] and implemented in [18] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-59", "text": "We attenuate sounds that are in an intermediate range of -24 dB LUFS and -30.5 dB LUFS by -10 dB, to increase the separation between loud and quiet sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-60", "text": "We obtain a bank of 56 elementary sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-61", "text": "Each clean acoustic scene is generated by concatenating 10 sounds chosen randomly from this bank." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-62", "text": "Once a clean acoustic scene has been created it is post-processed to generate a more difficult and realistic scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-63", "text": "A white uncorrelated uniform noise is first added to the scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-64", "text": "The amplitude range of the noise is first set to the maximum values allowed by the encoding." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-65", "text": "Then the amplitude is attenuated by a factor f randomly sampled from a uniform distribution between -80 dB and -90 dB (20 log 10 f )." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-66", "text": "The noise is then added to the scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-67", "text": "Although the noise is weak and almost imperceptible to the human ear, it guaranties that there is no pure silence between each elementary sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-68", "text": "The scene obtained this way is finally filtered to simulate room reverberation using SoX 3 ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-69", "text": "For each scene, a different room reverberation time is chosen from a uniform distribution between [50ms, 400ms]." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-70", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-71", "text": "**QUESTIONS**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-72", "text": "Questions are structured in a logical tree introduced in CLEVR [11] as a functional program." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-73", "text": "A functional program, defines the reasoning steps required to answer a question given a scene definition." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-74", "text": "We adapted the original work of Johnson et al. [11] to our acoustical context by updating the function catalog and the relationships between the objects of the scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-75", "text": "For example we added the before and after temporal relationships." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-76", "text": "In natural language, there is more than one way to ask a question that has the same meaning." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-77", "text": "For example, the question \"Is the cello as loud as the flute?\" is equivalent to \"Does the cello play as loud as the flute?\"." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-78", "text": "Both of these questions correspond to the same functional program even though their text representation is different." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-79", "text": "Therefore the structures we use include, for each question, a functional representation, and possibly many text representations used to maximize language diversity and minimize the bias in the questions." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-80", "text": "We have defined 942 such structures." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-81", "text": "A template can be instantiated using a large number of combinations of elements." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-82", "text": "Not all of them generate valid questions." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-83", "text": "For example \"Is the flute louder than the flute?\" is invalid because it does not provide enough information to compare the correct sounds regardless of the structure of the scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-84", "text": "Similarly, the question \"What is the position of the violin playing after the trumpet?\" would be ill-posed if there are several violins playing after the trumpet." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-85", "text": "The same question would be considered degenerate if there is only one violin sound in the scene, because it could be answered without taking into account the relation \"after the trumpet\"." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-86", "text": "A validation process [11] is responsible for rejecting both ill-posed and degenerate questions during the generation phase." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-87", "text": "Thanks to the functional representation we can use the reasoning steps of the questions to analyze the results." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-88", "text": "This would be difficult if we were only using the text representation without human annotations." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-89", "text": "If we consider the kind of answer, questions can be organized into 9 families as illustrated in Table 1 ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-90", "text": "For example, the question \"What is the third instrument playing?\" would translate to the \"Query Instrument\" family as its function is to retrieve the instrument's name." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-91", "text": "On the other hand we could classify the questions based on the relationships they required to be answered." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-92", "text": "For example, \"What is the instrument after the trumpet sound that is playing the C note?\" is still a \"query_instrument\" question, but compared to the previous example, requires more complex reasoning." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-93", "text": "The appendix reports and analyzes statistics and properties of the database." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-94", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-95", "text": "**PRELIMINARY EXPERIMENTS**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-96", "text": "To evaluate our dataset, we performed preliminary experiments with a FiLM network [15] ." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-97", "text": "It is a good candidate as it has been shown to work well on the CLEVR VQA task [11] that shares the same structure of questions as our CLEAR dataset." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-98", "text": "To represent acoustic scenes in a format compatible with FiLM, we computed spectrograms (log amplitude of the spectrum at regular intervals in time) and treated them as images." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-99", "text": "Each scene corresponds to a fixed resolution image because we have designed the dataset to include acoustic scenes of the same length in time." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-100", "text": "The best results were obtained with a training on 35000 scenes and 1400000 questions/answers." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-101", "text": "It yields a 89.97% accuracy on the test set that comprises 7500 scenes and 300000 questions." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-102", "text": "For the same test set a classifier choosing always the majority class would obtain as little as 7.6% accuracy." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-103", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-105", "text": "We introduce the new task of acoustic question answering (AQA) as a means to stimulate AI and reasoning research on acoustic scenes." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-106", "text": "We also propose a paradigm for data generation that is an extension of the CLEVR paradigm: The acoustic scenes are generated by combining a number of elementary sounds, and the corresponding questions and answers are generated based on the properties of those sounds and their mutual relationships." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-107", "text": "We generated a preliminary dataset comprising 50k acoustic scenes composed of 10 musical instrument sounds, and 2M corresponding questions and answers." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-108", "text": "We also tested the FiLM model on the preliminary dataset obtaining at best 89.97% accuracy predicting the right answer from the question and the scene." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-109", "text": "Although these preliminary results are very encouraging, we consider this as a first step in creating datasets that will promote research in acoustic reasoning." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-110", "text": "The following is a list of limitations that we intend to address in future versions of the dataset." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-111", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-112", "text": "**LIMITATIONS AND FUTURE DIRECTIONS**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-113", "text": "In order to be able to use models that were designed for VQA, we created acoustic scenes that have the same length in time." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-114", "text": "This allows us to represent the scenes as images (spectrograms) of fixed resolution." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-115", "text": "In order to promote models that can handle sounds more naturally, we should release this assumption and create scenes of variable lenghts." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-116", "text": "Another simplifying assumption (somewhat related to the first) is that every scene includes an equal number of elementary sounds." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-117", "text": "This assumption should also be released in future versions of the dataset." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-118", "text": "In the current implementation, consecutive sounds follow each other without overlap." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-119", "text": "In order to implement something similar to occlusions in visual domain, we should let the sounds overlap." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-120", "text": "The number of instruments is limited to five and all produce sustained notes, although with different sound sources (bow, for cello and violin, reed vibration for the clarinet, fipple for the flute and lips for the trumpet)." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-121", "text": "We should increase the number of instruments and consider percussive and decaying sounds as in drums and piano, or guitar." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-122", "text": "We also intend to consider other types of sounds (ambient and speech, for example) to increase the generality of the data." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-123", "text": "Finally, the complexity of the task can always be increased by adding more attributes to the elementary sounds, adding complexity to the questions, or introducing different levels of noise and distortions in the acoustic data." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-124", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-125", "text": "**4**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-126", "text": "----------------------------------" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-127", "text": "**A STATISTICS ON THE DATA SET**" }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-128", "text": "This appendix reports some statistics on the properties of the data set." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-129", "text": "We have considered the data set comprising 50k scenes and 2M questions and answers to produce the analysis." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-130", "text": "Figure 2 reports the distribution of the correct answer to each of the 2M questions." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-131", "text": "Figure 3 and 4 reports the distribution of question types and available template types respectively." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-132", "text": "The fact that those two distributions are very similar means that the available templates are sampled uniformly when generating the questions." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-133", "text": "Finally, Figure 5 shows the distribution of sound attributes in the scenes." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-134", "text": "It can be seen that most attributes are nearly evenly distributed." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-135", "text": "In the case of brightness, calculated in terms of spectral centroids, sounds were divided into clearly bright, clearly dark and ambiguous cases (referred to by \"None\" in the figure)." }, { "sent_id": "c3f71bea55f85633568c7ba57f6fd5-C001-136", "text": "We only instantiated questions about the brightness on the clearly separable cases." } ], "y": { "@EXT@": { "gold_contexts": [ [ "c3f71bea55f85633568c7ba57f6fd5-C001-4" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-74" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-106" ] ], "cite_sentences": [ "c3f71bea55f85633568c7ba57f6fd5-C001-4", "c3f71bea55f85633568c7ba57f6fd5-C001-74", "c3f71bea55f85633568c7ba57f6fd5-C001-106" ] }, "@BACK@": { "gold_contexts": [ [ "c3f71bea55f85633568c7ba57f6fd5-C001-15" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-23" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-27" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-28", "c3f71bea55f85633568c7ba57f6fd5-C001-29" ] ], "cite_sentences": [ "c3f71bea55f85633568c7ba57f6fd5-C001-15", "c3f71bea55f85633568c7ba57f6fd5-C001-23", "c3f71bea55f85633568c7ba57f6fd5-C001-27", "c3f71bea55f85633568c7ba57f6fd5-C001-29" ] }, "@MOT@": { "gold_contexts": [ [ "c3f71bea55f85633568c7ba57f6fd5-C001-34" ] ], "cite_sentences": [ "c3f71bea55f85633568c7ba57f6fd5-C001-34" ] }, "@USE@": { "gold_contexts": [ [ "c3f71bea55f85633568c7ba57f6fd5-C001-42" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-72" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-86", "c3f71bea55f85633568c7ba57f6fd5-C001-87" ], [ "c3f71bea55f85633568c7ba57f6fd5-C001-96", "c3f71bea55f85633568c7ba57f6fd5-C001-97" ] ], "cite_sentences": [ "c3f71bea55f85633568c7ba57f6fd5-C001-42", "c3f71bea55f85633568c7ba57f6fd5-C001-72", "c3f71bea55f85633568c7ba57f6fd5-C001-86", "c3f71bea55f85633568c7ba57f6fd5-C001-97" ] } } }, "ABC_b7278824bdae498021b899fbc6c638_6": { "x": [ { "sent_id": "b7278824bdae498021b899fbc6c638-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-2", "text": "Key to named entity recognition, the manual gazetteering of entity lists is a costly, errorprone process that often yields results that are incomplete and suffer from sampling bias." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-3", "text": "Exploiting current sources of structured information, we propose a novel method for extending minimal seed lists into complete gazetteers." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-4", "text": "Like previous approaches, we value WIKIPEDIA as a huge, well-curated, and relatively unbiased source of entities." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-5", "text": "However, in contrast to previous work, we exploit not only its content, but also its structure, as exposed in DBPEDIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-6", "text": "We extend gazetteers through Wikipedia categories, carefully limiting the impact of noisy categorizations." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-7", "text": "The resulting gazetteers easily outperform previous approaches on named entity recognition." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-8", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-10", "text": "Automatically learning gazetteers with minimal supervision is a long standing problem in named entity recognition." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-11", "text": "We propose EAGER as a novel approach to extending automatically gazetteers for entity recognition, utilizing DBPEDIA (Bizer et al., 2009 ) rather than WIKIPEDIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-12", "text": "DBPEDIA serves as a much better foundation than WIKIPEDIA, because all the information used in previous approaches (and much more) is already provided as a structured database of facts and articles." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-13", "text": "The extraction is more robust and complete than ad-hoc methods and maintained by a large community." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-14", "text": "E.g., navigating the category hierarchy is much easier and reliable with DBPE- DIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-15", "text": "To summarize, EAGER's main contributions are" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-16", "text": "(1) A novel gazetteer expansion algorithm that adds new entities from DBPEDIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-17", "text": "EAGER adds entities that have several categories in common with the seed terms, addressing noisy categorizations through a sophisticated category pruning technique." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-18", "text": "(2) EAGER also extracts categories from DBPE-DIA abstracts using dependency analysis." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-19", "text": "Finally, EAGER extracts plural forms and synonyms from redirect information." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-20", "text": "(3) For entity recognition, we integrate the gazetteer with a simple, but effective machine learning classifier, and experimentally show that the extended gazetteers improve the F 1 score between 7% and 12% over our baseline approach and outperform (Zhang and Iria, 2009 ) on all learned concepts (subject, location, temporal)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-21", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-22", "text": "**RELATED WORK**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-23", "text": "We divide the related work in automatic gazetteer population into three groups: (1) Machine learning approaches (2) Pattern driven approaches Finally, like our own work, (3) knowledge driven approaches Knowledge Driven." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-24", "text": "In any case, machine learning and pattern driven approaches extract their terms from unstructured sources -despite the fact that large, general knowledge bases became available in the last years." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-25", "text": "One of the first knowledgedriven methods (Magnini et al., 2002) employed WORDNET to identify trigger words and candidate gazetteer terms with its word-class and -instance relations." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-26", "text": "As WORDNET covers domain specific vocabularies only to a limited extent, this approach is also limited in its general applicability." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-27", "text": "In (Toral and Mu\u00f1oz, 2006) , gazetteers are built from the noun phrases in the first sentences of WIKIPEDIA articles by mapping these phrases to WORDNET and adding further terms found along the hypernymy relations." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-28", "text": "The approach presented in (Kazama and Torisawa, 2007; Kazama and Torisawa, 2008 ) relies solely on WIKIPEDIA, producing gazetteers without explicitly named concepts, arguing that consistent but anonymous labels are still useful." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-29", "text": "Most closely related to our own work, the authors of (Zhang and Iria, 2009 ) build an approach solely on WIKIPEDIA which does not only exploit the article text but also analyzes the structural elements of WIKIPEDIA:" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-30", "text": "3 Automatically Extending Gazetteer Lists 3.1 Extraction Algorithm: Overview Algorithm 1 shows an outline of the gazetteer expansion algorithm used in EAGER." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-31", "text": "To extend an initial seed set S EAGER proceeds, roughly, in three steps: First, it identifies DBPEDIA articles for seed entities and extracts implicit category and synonym information from abstracts and redirect information (Lines 1-11)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-32", "text": "Second, it finds additional categories from the DBPEDIA category hierarchy ." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-33", "text": "Finally, it uses the categories from the first two steps to extract additional entities (Lines 21-24)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-34", "text": "In the following, we consider the three steps separately." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-35", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-36", "text": "**IMPLICIT: ABSTRACT AND REDIRECTS**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-37", "text": "Before EAGER can analyse abstract and redirect information for an article, we need to find the corresponding DBPEDIA articles (Lines 1-3) for each seed entry in S ." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-38", "text": "There may be one or more such entry." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-39", "text": "Here, we observe the first advantage of DBPEDIA's more structured information: DB-PEDIA already contains plain text labels such as \"Barack Obama\" and we can directly query (using the SPARQL endpoint) all articles with a label equal (or starting with) an entity in our seed set." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-40", "text": "This allows for more precise article matching and avoids complex URL encodings as necessary in previous, add all labels of a to G ; WIKIPEDIA-based approaches such as (Kazama and Torisawa, 2007) ." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-41", "text": "As (Zhang and Iria, 2009 ), we reject redirection entries in this step as ambiguous." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-42", "text": "With the articles identified, we can proceed to extract category information from the abstracts and new entities from the redirect information." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-43", "text": "In the dependency analysis of article abstracts (Lines 6-9), we aim to extract category (or, more generally, hypernym) information from the abstracts of articles on the ssed list." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-44", "text": "We perform a standard dependency analysis on the sentences of the abstract and return all nouns that stand in nsubj relation to a seed entity or (directly or indirectly) in conj (correlative conjunction) relation to a noun that stands in nsubj relation to a seed entity." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-45", "text": "This allows us to extract, e.g., both \"general\" and \"statesman\" as categories from a sentence such as \"Julius Caesar was a Roman general and statesman\"." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-46", "text": "This analysis is inspired by (Zhang and Iria, 2009 ), but performed on the entire abstract which is clearly dis- (Zhang and Iria, 2009) , where this is applied only to the first sentence (as WIKIPEDIA does not directly provide a concept of \"abstract\")." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-47", "text": "All categories thus obtained are added to P and will be used in Section 3.4 to generate additional entities." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-48", "text": "Finally, we are interested in redirection information (Lines 10-11) about an article for a seed entity as that provides such with synonyms, plural forms, different spellings, etc." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-49", "text": "Fortunately, DB-PEDIA provides this information directly by means of the dbpedia-owl:wikiPageRedirects property." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-50", "text": "The labels of all redirect articles with this property pointing to a seed entity articles are directly added to the Gazetteer." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-51", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-52", "text": "**EXPLICIT: CATEGORY GRAPH**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-53", "text": "In addition to categories from the abstract analysis, we also use the category graph of DBPEDIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-54", "text": "It has been previously observed, (Zhang and Iria, 2009 ) and (Strube and Ponzetto, 2006) , that the category graph of poor quality." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-55", "text": "DBPEDIA improves little on that fact." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-56", "text": "However, EAGER uses a sophisticated analysis of categories related to seed entities that allows us to prune most of the noise in the category graph." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-57", "text": "Biased towards precision over recall, Section 4 shows that combined with the category extraction from abstracts it provides a significantly extended Gazetteer without introducing substantial noise." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-58", "text": "The fundamental contribution of EAGER is a category pruning based on finding a connected component in the graph of related categories that is supported by as many different entities from the seed list as possible." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-59", "text": "Figure 1 illustrates this further: From the articles for the seed entities, we compute (Line 12) the direct categories (via subject edges) and associate them to their seed entities e via Cats(e)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-60", "text": "We extent this set (Lines 13-14) with all categories in the k-neighbourhood (here, we use k = 3), i.e., connected via up to k broader edges traversed in any direction, again maintaining via Cats(e) which categories are reached from which seed entity e. In the resulting graph of all such categories, we identify the connected component with maximum support (Lines 15-19) ." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-61", "text": "The support of a component is the sum of the support of its categories." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-62", "text": "The support of a category c is the number of seed entities with c \u2208 Cats(e)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-63", "text": "For Figure 1 , this yields the category graph of the blue and black categories of the figure." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-64", "text": "The blue categories form the connected component with maximum support and are thus retained (in P), the black categories are dropped." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-65", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-66", "text": "**ENTITIES FROM CATEGORIES**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-67", "text": "Finally, in Lines 21-24, EAGER completes the gazetteer extension by extracting the labels of all articles of categories in P if they are sufficiently unambiguous." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-68", "text": "An article is called sufficiently unambiguous, if it is categorised only with categories from P up to a threshold \u03b8 (here, set to 5) of non-P categories." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-69", "text": "This avoids adding very general entities that tend to have large number of categories in WIKIPEDIA and thus DBPEDIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-70", "text": "The output of Algorithm 3 is the extended gazetteers G ." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-71", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-72", "text": "**EVALUATION**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-73", "text": "To evaluate the impact of EAGER on entity recognition, we performed a large set of experiments (on the archeology domain)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-74", "text": "The experiment domains and (Zhang and Iria, 2009 ), which we outperform for all entity types, in some cases up to 5% in F 1 score." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-75", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-76", "text": "**EVALUATION SETUP**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-77", "text": "In this experiment, we consider entity recognition in the domain of archaeology." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-78", "text": "As part of this effort, (Jeffrey et al., 2009) identified three types of entities that are most useful for archaeological research; Subject(SUB), Temporal Terms(TEM), Location (LOC)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-79", "text": "In this evaluation, we use the same setup as in (Zhang and Iria, 2009 ): A corpus of 30 full length UK archaeological reports archived by the Arts and Humanities Data Service (AHDS)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-80", "text": "The length of the documents varies from 4 to 120 pages." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-81", "text": "The corpus is inter-annotated by three archaeologists." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-82", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-83", "text": "**RESULT**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-84", "text": "For the evaluation, we perform a 5-fold validation on the above corpus." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-85", "text": "The evaluate the performance (in terms of precision, recall and F 1 score) for entity recognition of the baseline system as well as the baseline system extended with a gazetteer feature." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-86", "text": "For the latter, we consider full EAGER as described in Section 3 as well as only the entities derived from dependency analysis of abstracts, from the category graph, and from redirection information." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-87", "text": "Finally, we also include the performance numbers report in (Zhang and Iria, 2009 ) for comparison (since we share their evaluation settings)." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-88", "text": "Table 1 show the results of the comparison: EA-GER significantly improves precision and recall over the baseline system and outperforms (Zhang and Iria, 2009 ) in all cases." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-89", "text": "Furthermore, the impact of all three types of information (dependencies from abstract, category, redirection) of EAGER individually is quite notable with a slight disadvantage for category information." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-90", "text": "However, in all cases the combination of all three types as proposed in EAGER shows a significant further increase in performance." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-91", "text": "----------------------------------" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-92", "text": "**CONCLUSION**" }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-93", "text": "At its heart, EAGER is a novel algorithm for extending sets of entities of a specific type with additional entities of that type extracted from DBPE-DIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-94", "text": "It is based on a new strategy for pruning the category graph in DBPEDIA (and thus WIKIPEDIA), necessary to address the inherent noise." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-95", "text": "Our evaluation shows that EAGER can significantly improve the performance of entity recognition and outperforms existing systems in all cases." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-96", "text": "Unlike previous approaches, our approach makes use of richer content and structural elements of DBpedia." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-97", "text": "We believe that EAGER is a strong indicator that DBPEDIA provides a much richer, yet easier to use foundation for NLP tasks in general than WIKIPEDIA." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-98", "text": "The extensibility and domain adaptability of our methods still need further investigation." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-99", "text": "We are currently extending the evaluation to several other domains, including property descriptions in real estate and classified adds." }, { "sent_id": "b7278824bdae498021b899fbc6c638-C001-100", "text": "We are also investigating more targeted means of detecting and addressing noise in the category graph." } ], "y": { "@DIF@": { "gold_contexts": [ [ "b7278824bdae498021b899fbc6c638-C001-20" ], [ "b7278824bdae498021b899fbc6c638-C001-46" ], [ "b7278824bdae498021b899fbc6c638-C001-74" ], [ "b7278824bdae498021b899fbc6c638-C001-88" ] ], "cite_sentences": [ "b7278824bdae498021b899fbc6c638-C001-20", "b7278824bdae498021b899fbc6c638-C001-46", "b7278824bdae498021b899fbc6c638-C001-74", "b7278824bdae498021b899fbc6c638-C001-88" ] }, "@SIM@": { "gold_contexts": [ [ "b7278824bdae498021b899fbc6c638-C001-29" ], [ "b7278824bdae498021b899fbc6c638-C001-41" ], [ "b7278824bdae498021b899fbc6c638-C001-53", "b7278824bdae498021b899fbc6c638-C001-54", "b7278824bdae498021b899fbc6c638-C001-55" ] ], "cite_sentences": [ "b7278824bdae498021b899fbc6c638-C001-29", "b7278824bdae498021b899fbc6c638-C001-41", "b7278824bdae498021b899fbc6c638-C001-54" ] }, "@MOT@": { "gold_contexts": [ [ "b7278824bdae498021b899fbc6c638-C001-46" ] ], "cite_sentences": [ "b7278824bdae498021b899fbc6c638-C001-46" ] }, "@BACK@": { "gold_contexts": [ [ "b7278824bdae498021b899fbc6c638-C001-54" ] ], "cite_sentences": [ "b7278824bdae498021b899fbc6c638-C001-54" ] }, "@USE@": { "gold_contexts": [ [ "b7278824bdae498021b899fbc6c638-C001-79" ], [ "b7278824bdae498021b899fbc6c638-C001-87" ] ], "cite_sentences": [ "b7278824bdae498021b899fbc6c638-C001-79", "b7278824bdae498021b899fbc6c638-C001-87" ] } } }, "ABC_6ed955baf28ad1c7fd6d590e660c20_7": { "x": [ { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-45", "text": "." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-2", "text": "Dependency-based Compositional Semantics (DCS) provides a precise and expressive way to model semantics of natural language queries on relational databases, by simple dependency-like trees." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-3", "text": "Recently abstract denotation is proposed to enable generic logical inference on DCS." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-4", "text": "In this paper, we discuss some other possibilities to equip DCS with logical inference, and we discuss further on how logical inference can help textual entailment recognition, or other semantic precessing tasks." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-5", "text": "----------------------------------" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-7", "text": "Dependency-based Compositional Semantics (DCS) was proposed as an interface for querying relational databases by natural language." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-8", "text": "It features DCS trees as semantic representation, with a structure similar to dependency trees." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-9", "text": "In its basic version, a node of a DCS tree indicates a table in the database, and an edge indicates a join relation." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-10", "text": "Both ends of an edge are labeled by a field of the corresponding table (Liang et al., 2011) ." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-11", "text": "However, when DCS is applied to logical inference on unrestricted texts, it is unrealistic to assume an explicit database, because we cannot prepare a database for everything in the world." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-12", "text": "For this reason, DCS trees are detached from any specific relational database, in a way that each node of a DCS tree indicates a content word in a sentence (thus no fixed set of possible word labels for a DCS tree node), and each edge indicates a semantic relation between two words." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-13", "text": "Labels on the two ends of an edge, initially indicating fields of tables in a database, are considered as semantic roles of the corresponding words." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-14", "text": "Abstract denotation is proposed to capture the meaning of this abstract version of DCS tree, and a textual inference system based on abstract denotation is built (Tian et al., 2014) ." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-15", "text": "It is quite natural to apply DCS trees, a simple and expressive semantic representation, to textual inference; however the use of abstract denotations to convey logical inference is somehow unusual." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-16", "text": "There are two seemingly obvious way to equip DCS with logical inference: (i) at the tree level, by defining a set of logically sound transformations of DCS trees; or (ii) at the logic level, by converting DCS trees to first order predicate logic (FOL) formulas and then utilizing a theorem prover." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-17", "text": "For (i), it may not be easy to enumerate all types of logically sound transformations, but tree transformations can be seen as an approximation of logical inference." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-18", "text": "For (ii), abstract denotation is more efficient than FOL formula, because abstract denotation eliminates quantifiers and meanings of natural language texts can be represented by atomic sentences." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-19", "text": "To elaborate the above discussion and to provide more topics to the literature, in this paper we discuss the following four questions: ( \u00a72) How well can tree transformation approximate logical inference? ( \u00a73) With rigorous inference on DCS trees, where does logic contribute in the system of Tian et al. (2014) In the tree transformation based approach to RTE, it has been realized that some gaps between T and H cannot be filled even by a large number of tree transformation rules extracted from corpus (BarHaim et al., 2007a) ." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-20", "text": "For example in Figure 1 , it is possible to extract the rule blamed for death \u2192 cause loss of life, but not easy to extract tropical storm Debby \u2192 storm, because \"Debby\" could be an arbitrary name which may not even appear in the corpus." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-21", "text": "This kind of gaps was typically addressed by approximate matching methods, for example by counting common sub-graphs of T and H, or by computing a cost of tree edits that convert T to H. In the example of Figure 1 , we would expect that T is \"similar enough\" (i.e. has many common sub-graphs) with H, or the cost to convert T into H (e.g. by deleting the node Debby and then add the node storm) is low." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-22", "text": "As for how similar is enough, or how the cost is evaluated, we will need a statistical model to train on RTE development set." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-23", "text": "It was neglected that some combinations of tree edits are logical (while some are not)." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-24", "text": "The entailment pair in Figure 1 can be easily treated by logical inference, as long as the apposition tropical storm = Debby is appropriately handled." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-25", "text": "In contrast to graph matching or tree edit models which theoretically admit arbitrary tree transformation, logical inference clearly discriminate sound transformations from unsound ones." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-26", "text": "In this sense, there would be no need to train on RTE data." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-27", "text": "When coreference is considered, logically sound tree transformations can be quite complicated." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-28", "text": "The following is a modified example from RTE2-dev:" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-29", "text": "T: Hurricane Isabel, which caused significant damage, was a tropical storm when she entered Virginia." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-30", "text": "Note the coreference between Hurricane Isabel and she, suggesting us to copy the subtree of Hurricane Isabel to she, in a tree edit approach." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-31", "text": "This is not enough yet, because the head storm in T is not placed at the subject of cause." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-32", "text": "The issue is indeed very logical: from \"Hurricane Isabel = she\", \"Hurricane Isabel = storm\", \"she = subject of enter\" and \"Hurricane Isabel = subject of cause\", we can imply that \"storm = subject of enter = subject of cause\"." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-33", "text": "3 Alignment with logical clues Tian et al. (2014) proposed a way to generate onthe-fly knowledge to fill knowledge gaps: if H is not proven, compare DCS trees of T and H to generate path alignments (e.g. blamed for death \u223c cause loss of life, as underscored in Figure 1) ; evaluate the path alignments by a similarity score function; and path alignments with a score greater than a threshold (0.4) are accepted and converted to inference rules." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-34", "text": "The word vectors Tian et al. (2014) use to calculate similarities are reported able to capture semantic compositions by simple additions and subtractions (Mikolov et al., 2013) ." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-35", "text": "This is also the case when used as knowledge resource for RTE, for example the similarities between blamed+death and cause+loss+life, or between found+shot+dead and killed, are computed > 0.4." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-36", "text": "However, generally such kind of similarity is very noisy." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-37", "text": "Tian et al. (2014) used some logical clues to filter out irrelevant path alignments, which helps to keep a high precision." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-38", "text": "To evaluate the effect of such logical filters, we compare it with some other alignment strategies, the performance of which on RTE5-test data is shown in Table 1 ." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-39", "text": "Each strategy is described in the following." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-40", "text": "LexNoun + Inference The same system as above, except that we only align paths between lexically aligned nouns." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-41", "text": "Two nouns are aligned if and only if they are synonyms, hyponyms or derivatively related in WordNet." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-42", "text": "LexNoun + Coverage As above, paths between lexically aligned nouns are aligned, and aligned paths with similarity score > 0.4 are accepted." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-43", "text": "If all nodes in H can be covered by some accepted path alignments, then output \"Y\"." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-44", "text": "This is very similar to the system described in Bar-Haim et al." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-46", "text": "NoFilter + Coverage Same as above, but all paths alignments with similarity score > 0.4 are accepted." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-47", "text": "----------------------------------" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-48", "text": "**HOW CAN LOGICAL INFERENCE HELP RTE?**" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-49", "text": "Logical inference is shown to be useful for RTE, as Tian et al. (2014) demonstrates a system with competitive results." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-50", "text": "However, despite the expectation that all entailment matters can be explained logically, our observation is that currently logical inference only fills very limited short gaps from T to H. The logical phenomena easily addressed by Tian et al. (2014) Table 2 : Proportion (%) of exit status of Prover9" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-51", "text": "The system of Tian et al. (2014) generated onthe-fly knowledge to join several fragments in T and wrongly proved H. In examples of such complexity, distributional similarity is no longer reliable." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-52", "text": "However, it may be possible to build a priori logical models at the meta level, such as on epistemic, intentional and reportive attitudes." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-53", "text": "The models then can provide signals for semantic parsing to connect the logic to natural language, such as the words \"grant\", \"decertify\", and \"accuse\" in the above example." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-54", "text": "We hope this approach can bring new progress to RTE and other semantic processing tasks." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-55", "text": "----------------------------------" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-56", "text": "**EFFICIENCY OF ABSTRACT DENOTATIONS**" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-57", "text": "To evaluate the efficiency of logical inference on abstract denotations, we took 110 true entailment pairs from RTE5 development set, which are also pairs that can be proven with on-the-fly knowledge." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-58", "text": "We plot the running time of Tian et al. (2014) 's inference engine (single-threaded) on a 2.27GHz Xeon CPU, with respect to the weighted sum of all statements 2 , as shown in Figure 3 ." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-59", "text": "The graph shows all pairs can be proven in 6 seconds, and proof time scales logarithmically on weight of statements." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-60", "text": "On the other hand, we converted statements on abstract denotations into FOL formulas, and tried to prove the same pairs using Prover9, 3 a popu-lar FOL theorem prover." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-61", "text": "As the result turns out (Table 2) , only 8% of the pairs can be proven in 3 seconds (the \"Orig." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-62", "text": "3 Sec.\" column), and only 16% pairs can be proven in 5 minutes (the \"Orig." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-63", "text": "5 Min.\" column), showing severe difficulties for an FOL prover to handle textual inferences with many (usually hundreds of) on-the-fly rules." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-64", "text": "As such, we use Tian et al. (2014) 's inference engine to pin down statements that are actually needed for proving H (usually just 2 or 3 statements), and try to prove H by Prover9 again, using only necessary statements." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-65", "text": "Proven pairs in 5 minutes then jump to 82% (the \"Red. 5 Min.\" column), showing that a large number of on-the-fly rules may drastically increase computation cost." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-66", "text": "Still, nearly 20% pairs cannot be proven even in this setting, suggesting that traditional FOL prover is not suited for textual inference." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-67", "text": "----------------------------------" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-68", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-69", "text": "We have discussed the role that logical inference could play in RTE task, and the efficiency of performing inference on abstract denotations." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-70", "text": "Though currently logical inference contributes at places that are somehow inconspicuous, there is the possibility that with some meta level logical models and the methodology of semantic parsing, we can build systems that understand natural language texts deeply: logic implies (in)consistency, which is in turn used as signals to produce more accurate semantic interpretation." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-71", "text": "And after all, as there may be many possible variations of semantic representations, it is good to have an efficient inference framework that has the potential to connect them." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-72", "text": "It would be exciting if we can combine different types of structured data with natural language in semantic processing tasks." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-73", "text": "Directions of our future work are described below." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-74", "text": "Improvement of similarity score To calculate phrase similarities, Tian et al. (2014) use the cosine similarity of sums of word vectors, which ignores syntactic information." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-75", "text": "We plan to add syntactic information to words by some supertags, and learn a vector space embedding for this structure." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-76", "text": "Integration of FreeBase to RTE It would be exciting if we can utilize the huge amount of FreeBase data in RTE task." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-77", "text": "Using the framework of abstract denotation, meanings of sentences can be explained as relational database queries; to convert it to FreeBase data queries is like relational to ontology schema matching." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-78", "text": "In order to make effective use of FreeBase data, we also need to recognize entities and relations in natural language sentences." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-79", "text": "Previous research on semantic parsing will be very helpful for learning such mapping." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-80", "text": "Winograd Schema Challenge (WSC) As the RTE task, WSC (Levesque et al., 2012 ) also provides a test bed for textual inference systems." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-81", "text": "A Winograd schema is a pair of similar sentences but contain an ambiguity of pronouns that is resolved in opposite ways." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-82", "text": "A complicated partial example is:" }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-83", "text": "Michael decided to freeze himself in cryo-stasis even though his father was against it, because he hopes to be unfrozen in the future when there is a cure available." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-84", "text": "The logical interplay among decided, hopes, even though, because, and the realization that he is coreferent to Michael (but not his father) is intriguing." }, { "sent_id": "6ed955baf28ad1c7fd6d590e660c20-C001-85", "text": "By working on the task, we hope to gain further understanding on how knowledge can be gathered and applied in natural language reasoning." } ], "y": { "@USE@": { "gold_contexts": [ [ "6ed955baf28ad1c7fd6d590e660c20-C001-14" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-58" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-64" ] ], "cite_sentences": [ "6ed955baf28ad1c7fd6d590e660c20-C001-14", "6ed955baf28ad1c7fd6d590e660c20-C001-58", "6ed955baf28ad1c7fd6d590e660c20-C001-64" ] }, "@BACK@": { "gold_contexts": [ [ "6ed955baf28ad1c7fd6d590e660c20-C001-19" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-33" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-34" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-37" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-49" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-50" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-51" ], [ "6ed955baf28ad1c7fd6d590e660c20-C001-74" ] ], "cite_sentences": [ "6ed955baf28ad1c7fd6d590e660c20-C001-19", "6ed955baf28ad1c7fd6d590e660c20-C001-33", "6ed955baf28ad1c7fd6d590e660c20-C001-34", "6ed955baf28ad1c7fd6d590e660c20-C001-37", "6ed955baf28ad1c7fd6d590e660c20-C001-49", "6ed955baf28ad1c7fd6d590e660c20-C001-50", "6ed955baf28ad1c7fd6d590e660c20-C001-51", "6ed955baf28ad1c7fd6d590e660c20-C001-74" ] }, "@UNSURE@": { "gold_contexts": [ [ "6ed955baf28ad1c7fd6d590e660c20-C001-58", "6ed955baf28ad1c7fd6d590e660c20-C001-59" ] ], "cite_sentences": [ "6ed955baf28ad1c7fd6d590e660c20-C001-58" ] }, "@SIM@": { "gold_contexts": [ [ "6ed955baf28ad1c7fd6d590e660c20-C001-58", "6ed955baf28ad1c7fd6d590e660c20-C001-59" ] ], "cite_sentences": [ "6ed955baf28ad1c7fd6d590e660c20-C001-58" ] }, "@EXT@": { "gold_contexts": [ [ "6ed955baf28ad1c7fd6d590e660c20-C001-74", "6ed955baf28ad1c7fd6d590e660c20-C001-75" ] ], "cite_sentences": [ "6ed955baf28ad1c7fd6d590e660c20-C001-74" ] } } }, "ABC_926e7df3c367ae29da574ba465504f_7": { "x": [ { "sent_id": "926e7df3c367ae29da574ba465504f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-2", "text": "Topics generated by topic models are usually represented by lists of t terms or alternatively using short phrases and images." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-3", "text": "The current state-of-the-art work on labeling topics using images selects images by re-ranking a small set of candidates for a given topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-4", "text": "In this paper, we present a more generic method that can estimate the degree of association between any arbitrary pair of an unseen topic and image using a deep neural network." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-5", "text": "Our method has better runtime performance O(n) compared to O(n 2 ) for the current state-of-the-art method, and is also significantly more accurate." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-6", "text": "----------------------------------" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-8", "text": "Topic models (Blei et al., 2003) are a popular method for organizing and interpreting large document collections by grouping documents into various thematic subjects (e.g. sports, politics or lifestyle) called topics." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-9", "text": "Topics are multinomial distributions over a predefined vocabulary whereas documents are represented as probability distributions over topics." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-10", "text": "Topic models have proved to be an elegant way to build exploratory interfaces (i.e. topic browsers) for visualizing document collections." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-11", "text": "Topic browsers present to the users lists of topics (Gretarsson et al., 2012; Chaney and Blei, 2012; Ganguly et al., 2013; Snyder et al., 2013; Sievert and Shirley, 2014; Smith et al., 2014) where they can select documents of a particular topic of interest." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-12", "text": "A topic is traditionally represented by a list of t terms with the highest probability." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-13", "text": "In recent works, short phrases (Lau et al., 2011) or images (Aletras and Stevenson, 2013) have been used as alternatives." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-14", "text": "Particularly, images offer a language independent representation of the topic which can also be complementary to textual labels." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-15", "text": "Aletras et al. (2015) showed that the visual representation of a topic is as effective as the textual labels on retrieving information using a topic browser while it can be understood quickly by the users." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-16", "text": "The method presented by Aletras and Stevenson (2013) selects an image from a small set of candidates by re-ranking them using an unsupervised graph-based method." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-17", "text": "It is an iterative method that has a runtime complexity of O(n 2 ) which makes it infeasible to run over large number of images." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-18", "text": "Hence, for efficiency the candidate images are selected a priori using an information retrieval engine." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-19", "text": "Thus the scope of this method gets limited to solving a local problem of re-ordering a small set of candidate images for a given topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-20", "text": "Furthermore, its accuracy is limited by the recall of the information retrieval engine." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-21", "text": "In this work, we present a more generic method that directly estimates the appropriateness of any arbitrary pair of topic and image." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-22", "text": "We refer to this method as a global method to differentiate it from the localized approach described above." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-23", "text": "We utilize a Deep Neural Network (DNN) to estimate the suitability of an image for labeling a given topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-24", "text": "DNNs have proved to be effective in various NLP tasks (Collobert and Weston, 2008; Socher et al., 2011) ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-25", "text": "They combine multiple layers that perform non-linear transformations to the data allowing the automatic learning of high-level abstractions." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-26", "text": "Our proposed feed-forward network has a sequential architecture." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-27", "text": "It takes as input the topic, the visual and the associated textual information (i.e. caption) of an image." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-28", "text": "Topic and image textual information are represented as the mean vector of their constituent word embeddings while visual information of the image is also represented as a dense vector." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-29", "text": "The interconnected hidden layers allow to model nonlinearities while the output layer is a scalar which gives the estimate of how good the image is as a label for the topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-30", "text": "At run-time our method computes dot products between various features and the model weights to obtain the relevance score, that gives it an order complexity of O(n)." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-31", "text": "Hence, it is suitable for using it over large image sources such as Flickr 1 , Getty 2 or ImageNet (Deng et al., 2009) ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-32", "text": "We evaluate our model on a standard data set of topics and images and obtain state-of-the-art results for labeling completely unseen topics with images." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-33", "text": "----------------------------------" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-34", "text": "**MODEL**" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-35", "text": "For a topic T and an image I, we want to compute a real value s \u2208 R that denotes how good the image I is for representing the topic T ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-36", "text": "T consists of ten terms (t) with the highest probability for the topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-37", "text": "We denote the visual information for the image as V ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-38", "text": "Furthermore, the image also is associated with textual information from its caption, C." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-39", "text": "Textual Representation." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-40", "text": "For the topic T = {t 1 , t 2 , ..., t 10 } and the image caption C = {c 1 , c 2 , ..., c n }, each term is transformed into a vector x \u2208 R d where d is the dimensionality of the distributed semantic space." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-41", "text": "We use pre-computed dependency-based word embeddings (Levy and Goldberg, 2014) whose d is 300." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-42", "text": "The resulting representations of T and C are the mean vectors of their constituent words, x t and x c respectively." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-43", "text": "Visual Representation." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-44", "text": "The visual information from the image V is converted into a dense vectorized representation, x v ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-45", "text": "That is the output of the publicly available 16-layer VGG-net (Simonyan and Zisserman, 2014) trained over the ImageNet dataset (Deng et al., 2009 )." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-46", "text": "VGG-net provides a 1000 di-mensional vector which is the soft-max classification output of ImageNet classes." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-47", "text": "Input Layer." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-48", "text": "The input to the network is the concatenation of topic, caption and visual vectors." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-49", "text": "i.e.," }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-50", "text": "This results in a 1600-dimensional input vector." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-51", "text": "Hidden Layers." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-52", "text": "Then, X is passed through a series of four hidden layers, H 1 , ..., H 4 , with non-layer activations to capture higher level similarities." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-53", "text": "In this way the network learns a combined representation of topics and images and the non-linear relationships that they share." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-54", "text": "where g is the rectified linear unit (ReLU) and h 0 = X. The output of each hidden layer is regularized using dropout (Srivastava et al., 2014 Output Layer." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-55", "text": "The output layer of the network maps the input to a real value s \u2208 R that denotes how good the image I is for the topic T ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-56", "text": "The network is trained by minimizing the mean absolute error:" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-57", "text": "where s g is the ground-truth relevance value." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-58", "text": "The network is optimized using a standard mini-batch gradient descent method with RMSProp adaptive learning rate algorithm (Tieleman and Hinton, 2012) ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-59", "text": "The architecture of our network is shown in Figure 1 ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-60", "text": "----------------------------------" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-61", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-62", "text": "Data." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-63", "text": "We evaluate our model on the publicly available data set provided by Aletras and Stevenson (2013) ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-64", "text": "It consists of 300 topics generated using Wikipedia articles and news articles taken from the New York Times." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-65", "text": "Each topic is represented by ten terms with the highest probability." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-66", "text": "They are also associated with 20 candidate image labels and their human ratings between 0 (lowest) and 3 (highest) denoting the appropriateness of these images for the topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-67", "text": "That results into a total of 6K images and their associated textual metadata which are considered as captions." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-68", "text": "The task is to choose the image with the highest rating from the set of the 20 candidates for a given topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-69", "text": "Negative examples sampling." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-70", "text": "The 20 candidate image labels per topic are collected by Aletras and Stevenson (2013) using an information retrieval engine (Google)." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-71", "text": "Hence most of them are expected to be relevant to the topic." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-72", "text": "This jeopardizes the training of our supervised model due to the lack of sufficient negative examples." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-73", "text": "To address this issue we generate extra negative examples." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-74", "text": "For each topic we sample another 20 images from random topics in the training set and assign them a relevance score of 0." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-75", "text": "These extra images are added in the training data." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-76", "text": "Evaluation Metrics." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-77", "text": "Our evaluation follows prior work (Lau et al., 2011; Aletras and Stevenson, 2013) using two metrics." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-78", "text": "The Top-1 average rating is the average human rating assigned to the top-ranked label proposed by the topic labeling method." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-79", "text": "This metric provides an indication of the overall quality of the label selected and takes values from 0 (irrelevant) to 3 (relevant)." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-80", "text": "The normalized discounted cumulative gain (nDCG) compares the label ranking proposed by the labeling method to the goldstandard ranking provided by the human annotators (J\u00e4rvelin and Kek\u00e4l\u00e4inen, 2002; Croft et al., 2009) ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-81", "text": "Its value varies between 0 (lowest) to 1 (highest)." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-82", "text": "Model Parameters." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-83", "text": "We set the dropout value to 0.2 which randomly sets 20% of the input units to 0 at each update during the training time." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-84", "text": "We train the model in a 5-fold cross-validation for 30 epochs and set the batch size for training data to 16." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-85", "text": "In each fold, data from 240 topics are used for training and the rest completely unseen 60 topics are used for testing." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-86", "text": "----------------------------------" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-87", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-88", "text": "We compare our approach to the state-of-the-art method that uses Personalized PageRank (Aletras and Stevenson, 2013) to re-rank image candidates." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-89", "text": "We also test a relevant approach originally proposed for image annotation that learns a joint model of text and image features (Weston et al., 2010) ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-90", "text": "Finally, we test two versions of our own DNN using only either the caption (DNN (Topic+Caption)) or the visual information of the image (DNN (Topic+VGG))." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-91", "text": "We adapt the original method of Aletras and Stevenson (2013) to compute the PageRank scores of all the available images in the test set of each fold for each topic (Global PPR)." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-92", "text": "We also compare with their original local method where the graph consists of only 20 candidate images per topic (Local PPR)." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-93", "text": "We also train a WSABIE model, adapted to our task, by optimizing the WARP loss (Weston et al., 2010) ." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-94", "text": "To predict the rating or preference score for an unseen item, we compute the dot product of the topic terms and image metadata embeddings in the usual way for a matrix factorization recommender, but then we add the dot product of the topic embeddings and the low-dimensional mapping of the image embeddings." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-95", "text": "Table 1 shows the Top-1 average and nDCG scores obtained." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-96", "text": "First, we observe that the DNN methods perform better for both the evaluation metrics compared to the other methods." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-97", "text": "They achieve a Top-1 average rating between 1.94 and 2.12 com-" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-98", "text": "----------------------------------" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-99", "text": "**MODEL**" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-100", "text": "Top-1 aver. rating nDCG-1 nDCG-3 nDCG-5 Global PPR (Aletras and Stevenson, 2013) 1 (Aletras and Stevenson, 2013) 2.24 --- The DNN (Topic+Caption) model that uses only textual information, obtains a Top-1 Average performance of 1.94." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-101", "text": "Incorporating visual information (VGG) improves it to 2.12 (DNN (Topic+Caption+VGG))." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-102", "text": "An interesting finding is that using only the visual information (DNN (Topic+VGG)) achieves better results (2.04) compared to using only text." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-103", "text": "This demonstrates that images contain less noisy information compared to their captions for this particular task." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-104", "text": "The DNN models also provide a better ranking for the image candidates." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-105", "text": "The nDCG scores for the majority of the DNN methods are higher than the other methods." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-106", "text": "DNN (Topic+Caption+VGG) consistently obtains the best nDCG scores, 0.79, 0.80 and 0.81 respectively." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-107", "text": "Figure 2 shows two topics and the top-3 images selected by the DNN (Topic+Caption+VGG) model from the candidate set." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-108", "text": "The labels selected for the topic #288 are all very relevant to a Surgical operation." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-109", "text": "On the other hand, the images selected for topic #99 are irrelevant to Wedding photography." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-110", "text": "The main problem is that the set of candidate images do not contain any relevant ones." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-111", "text": "However, in this situation our model can identify other images that might be good labels which do not belong in the original set of candidates." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-112", "text": "----------------------------------" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-113", "text": "**CONCLUSION**" }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-114", "text": "We presented a deep neural network that jointly models textual and visual information for the task of topic labeling with images." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-115", "text": "Our model is generic and works for any unseen pair of topic and image." }, { "sent_id": "926e7df3c367ae29da574ba465504f-C001-116", "text": "Our evaluation results show that our proposed approach significantly outperforms the state-of-the-art method of Aletras and Stevenson (2013) and a relevant method originally utilized for image annotation proposed by Weston et al. (2010) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "926e7df3c367ae29da574ba465504f-C001-13" ], [ "926e7df3c367ae29da574ba465504f-C001-16", "926e7df3c367ae29da574ba465504f-C001-17", "926e7df3c367ae29da574ba465504f-C001-19", "926e7df3c367ae29da574ba465504f-C001-20" ], [ "926e7df3c367ae29da574ba465504f-C001-70" ] ], "cite_sentences": [ "926e7df3c367ae29da574ba465504f-C001-13", "926e7df3c367ae29da574ba465504f-C001-16", "926e7df3c367ae29da574ba465504f-C001-17", "926e7df3c367ae29da574ba465504f-C001-19", "926e7df3c367ae29da574ba465504f-C001-20", "926e7df3c367ae29da574ba465504f-C001-70" ] }, "@USE@": { "gold_contexts": [ [ "926e7df3c367ae29da574ba465504f-C001-63" ], [ "926e7df3c367ae29da574ba465504f-C001-77" ], [ "926e7df3c367ae29da574ba465504f-C001-88" ] ], "cite_sentences": [ "926e7df3c367ae29da574ba465504f-C001-63", "926e7df3c367ae29da574ba465504f-C001-77", "926e7df3c367ae29da574ba465504f-C001-88" ] }, "@EXT@": { "gold_contexts": [ [ "926e7df3c367ae29da574ba465504f-C001-91" ] ], "cite_sentences": [ "926e7df3c367ae29da574ba465504f-C001-91" ] }, "@DIF@": { "gold_contexts": [ [ "926e7df3c367ae29da574ba465504f-C001-100" ], [ "926e7df3c367ae29da574ba465504f-C001-116" ] ], "cite_sentences": [ "926e7df3c367ae29da574ba465504f-C001-100", "926e7df3c367ae29da574ba465504f-C001-116" ] } } }, "ABC_2eeffe385539b28c7d31eeb176e926_7": { "x": [ { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-2", "text": "The use of large pretrained neural networks to create contextualized word embeddings has drastically improved performance on several natural language processing (NLP) tasks." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-3", "text": "These computationally expensive models have begun to be applied to domainspecific NLP tasks such as re-hospitalization prediction from clinical notes." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-4", "text": "This paper demonstrates that using large pretrained models produces excellent results on common learning analytics tasks." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-5", "text": "Pre-training deep language models using student forum data from a wide array of online courses improves performance beyond the state of the art on three text classification tasks." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-6", "text": "We also show that a smaller, distilled version of our model produces the best results on two of the three tasks while limiting computational cost." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-7", "text": "We make both models available to the research community at large." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-8", "text": "1" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-9", "text": "----------------------------------" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-11", "text": "In the past year, the field of Natural Language Processing (NLP) has seen the rise of pretrained language models such as as ELMo (Peters et al., 2018) , ULMFiT (Howard and Ruder, 2018) and BERT (Devlin et al., 2019) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-12", "text": "These approaches train a deep-learning language model on large volumes of unlabeled text, which is subsequently fine-tuned for particular NLP tasks." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-13", "text": "Applying these models to the General Language Understanding Evaluation (GLUE) benchmark introduced by Wang et al. (2018) has achieved the best performance to date on tasks ranging from sentiment classification to question answering (Devlin et al., 2019) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-35", "text": "Training of the models was performed on a Titan X GPU." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-14", "text": "The benefit of these models has also been demonstrated in specialized NLP domains." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-15", "text": "BioBERT (Lee et al., 2019) , a version of BERT trained exclusively on biomedical text, was able to significantly increase performance on biomedical named entity recognition." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-16", "text": "Further refining this model on clinical text produced an increase in performance in medical natural language inference (Alsentzer et al. 2019) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-17", "text": "While large pretrained models offer significantly increased performance, they come with their own constraints, as the number of parameters in the classic BERT-base model exceeds 100 million." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-18", "text": "As such, their computational cost can thus be prohibitively high at both training and prediction time (Devlin et al., 2019) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-19", "text": "More recent work has addressed this challenge by 'distilling' the models, training smaller versions of BERT which reduce the number of parameters to train by 40% while retaining more than 95% of the full model performance and even outperforming it on two out of eleven GLUE tasks ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-20", "text": "This paper shows that using pretrained models in learning analytics holds great potential for advancing the field." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-21", "text": "We apply the BERT approach to the following three previously explored LAK tasks on MOOC forum data (Wei et al., 2017) : Confusion detection, urgency of teacher intervention and sentimentality classification." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-22", "text": "In all three of these tasks, we are able to improve performance past the state of the art." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-23", "text": "----------------------------------" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-24", "text": "**METHOD**" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-25", "text": "----------------------------------" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-26", "text": "**DATA:**" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-27", "text": "We trained the language model on a large unannotated data set from two sources: student forum data from the Stanford MOOCPosts dataset (Agrawal and Paepcke, 2014) which includes about 30,000 forum posts from 11 courses among three subject domains; and forum data from multiple instances of 18 courses from large public universities in the UK and USA." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-28", "text": "In total, this dataset is comprised of more than 12 million tokens." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-29", "text": "The data used for the classification tasks was from the same Stanford MOOCPosts dataset." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-30", "text": "The posts are annotated by domain experts and given scores for sentiment (the degree of emotionality exhibited by the post), confusion expressed by the student and urgency for the post to receive a response from an instructor." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-31", "text": "Scores are given on a Likert scale from 1 (low) to 7 (high)." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-32", "text": "Language Models: We constructed two models, EduBERT and EduDistilBERT, which respectively refine BERT-base and DistilBERT , both of which were trained on general domain text from books and Wikipedia (Devlin et al., 2019) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-33", "text": "Both models are initialized from their base model and finetuned on educational data, using the Transformers library ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-34", "text": "The fine-tuning step allows the model to better capture how words are used in an educational context." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-36", "text": "We set the maximum input sequence length to the default value (512); the learning rate was set to 5e-5; the batch size (the number of input sequences processed at one time) was set to 8 for EduBERT and 16 for EduDistilBERT." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-37", "text": "The best performance was achieved after 5 training epochs." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-38", "text": "----------------------------------" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-39", "text": "**CLASSIFICATION TASKS:**" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-40", "text": "To encourage easily comparable results, we evaluated the models on three wellexplored classification tasks on the StandfordMOOC dataset." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-41", "text": "Following previous work by Guo et al. (2019) , we split the data into a 2/3 training set and 1/3 test set and consider a post to express sentiment, urgency or confusion if and only if its respective score is \u2265 4." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-42", "text": "We compare between the four classifiers BERT-base, DistilBERT, EduBERT and EduDistilBERT." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-43", "text": "We evaluated multiple sets of parameters." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-44", "text": "Best results for these tasks were achieved with the following parameters: two learning epochs, maximal sequence length of 300 (BERT-base, EDUBERT) and 512 for the distilled models, all other parameter values were equal to the ones used for pre-training." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-45", "text": "Table 1 compares EduBERT, EduDistilBERT to their base versions, as well as the state-of-the-art (SoA) for urgency detection (Guo et al. 2019) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-46", "text": "The table shows that all pretraining approaches outperformed the SoA for F1 and weighted F1 measures, with our distilled model EduDistilBERT achieving the best overall performance." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-47", "text": "Table 2 compares all of the models for all three tasks to the SoA using the same measures of accuracy as Wei et al. (2017) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-48", "text": "Here too, all the pretraining approaches outperform the SoA. EduDistilBERT obtains the best results on both urgency and confusion prediction while EduBERT performs the best for sentimentality classification." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-49", "text": "However, EduDistilBERT has a lower memory footprint and is noticeably faster at inference time, allowing for a 30% speedup." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-50", "text": "----------------------------------" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-51", "text": "**RESULTS & DISCUSSIONS**" }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-52", "text": ". EduBERT and EduDistilBERT are fine-tuned on millions of tokens, in contrast to the billions of tokens required to make the most of the architecture potential (Devlin et al., 2019) ." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-53", "text": "We are actively seeking more data to train models even more capable of producing contextualized word representations in the educational domain." }, { "sent_id": "2eeffe385539b28c7d31eeb176e926-C001-54", "text": "We are making EduBERT and EduDistilBERT publicly available in the hope that they will facilitate learning analytics research at large." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2eeffe385539b28c7d31eeb176e926-C001-11", "2eeffe385539b28c7d31eeb176e926-C001-12", "2eeffe385539b28c7d31eeb176e926-C001-13", "2eeffe385539b28c7d31eeb176e926-C001-14" ], [ "2eeffe385539b28c7d31eeb176e926-C001-17" ], [ "2eeffe385539b28c7d31eeb176e926-C001-18" ], [ "2eeffe385539b28c7d31eeb176e926-C001-19" ], [ "2eeffe385539b28c7d31eeb176e926-C001-52" ] ], "cite_sentences": [ "2eeffe385539b28c7d31eeb176e926-C001-11", "2eeffe385539b28c7d31eeb176e926-C001-12", "2eeffe385539b28c7d31eeb176e926-C001-13", "2eeffe385539b28c7d31eeb176e926-C001-14", "2eeffe385539b28c7d31eeb176e926-C001-17", "2eeffe385539b28c7d31eeb176e926-C001-18", "2eeffe385539b28c7d31eeb176e926-C001-19", "2eeffe385539b28c7d31eeb176e926-C001-52" ] }, "@EXT@": { "gold_contexts": [ [ "2eeffe385539b28c7d31eeb176e926-C001-21" ], [ "2eeffe385539b28c7d31eeb176e926-C001-32" ] ], "cite_sentences": [ "2eeffe385539b28c7d31eeb176e926-C001-21", "2eeffe385539b28c7d31eeb176e926-C001-32" ] }, "@USE@": { "gold_contexts": [ [ "2eeffe385539b28c7d31eeb176e926-C001-42" ], [ "2eeffe385539b28c7d31eeb176e926-C001-44" ], [ "2eeffe385539b28c7d31eeb176e926-C001-45" ] ], "cite_sentences": [ "2eeffe385539b28c7d31eeb176e926-C001-42", "2eeffe385539b28c7d31eeb176e926-C001-44", "2eeffe385539b28c7d31eeb176e926-C001-45" ] } } }, "ABC_c7c9266b5063ec85494fde45d1dce1_7": { "x": [ { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-35", "text": "(1) CNN: Inspired by Kim et." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-2", "text": "Hate speech detection on Twitter is critical for applications like controversial event extraction, building AI chatterbots, content recommendation, and sentiment analysis." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-3", "text": "We define this task as being able to classify a tweet as racist, sexist or neither." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-4", "text": "The complexity of the natural language constructs makes this task very challenging." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-5", "text": "We perform extensive experiments with multiple deep learning architectures to learn semantic word embeddings to handle this complexity." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-6", "text": "Our experiments on a benchmark dataset of 16K annotated tweets show that such deep learning methods outperform state-of-the-art char/word n-gram methods by \u223c18 F1 points." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-7", "text": "----------------------------------" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-9", "text": "With the massive increase in social interactions on online social networks, there has also been an increase of hateful activities that exploit such infrastructure." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-10", "text": "On Twitter, hateful tweets are those that contain abusive speech targeting individuals (cyber-bullying, a politician, a celebrity, a product) or particular groups (a country, LGBT, a religion, gender, an organization, etc.)." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-11", "text": "Detecting such hateful speech is important for analyzing public sentiment of a group of users towards another group, and for discouraging associated wrongful activities." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-12", "text": "It is also useful to filter tweets before content recommendation, or learning AI chatterbots from tweets 1 ." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-13", "text": "The manual way of filtering out hateful tweets is not scalable, motivating researchers to identify automated ways." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-14", "text": "In this work, we focus on the problem of classifying a tweet as racist, sexist or neither." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-15", "text": "The task is quite challenging due to the inherent complexity of the natural language constructsdifferent forms of hatred, different kinds of targets, different ways of representing the same meaning." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-16", "text": "Most of the earlier work revolves either around manual feature extraction [6] or use representation learning methods followed by a linear classifier [1, 4] of complex problems in speech, vision and text applications." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-17", "text": "To the best of our knowledge, we are the first to experiment with deep learning architectures for the hate speech detection task." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-18", "text": "In this paper, we experiment with multiple classifiers such as Logistic Regression, Random Forest, SVMs, Gradient Boosted Decision Trees (GBDTs) and Deep Neural Networks(DNNs)." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-19", "text": "The feature spaces for these classifiers are in turn defined by task-specific embeddings learned using three deep learning architectures: FastText, Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs)." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-20", "text": "As baselines, we compare with feature spaces comprising of char n-grams [6] , TF-IDF vectors, and Bag of Words vectors (BoWV)." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-21", "text": "Main contributions of our paper are as follows: (1) We investigate the application of deep learning methods for the task of hate speech detection." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-22", "text": "(2) We explore various tweet semantic embeddings like char n-grams, word Term FrequencyInverse Document Frequency (TF-IDF) values, Bag of Words Vectors (BoWV) over Global Vectors for Word Representation (GloVe), and task-specific embeddings learned using FastText, CNNs and LSTMs." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-23", "text": "(3) Our methods beat stateof-the-art methods by a large margin (\u223c18 F1 points better)." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-24", "text": "----------------------------------" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-25", "text": "**PROPOSED APPROACH**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-26", "text": "We first discuss a few baseline methods and then discuss the proposed approach." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-27", "text": "In all these methods, an embedding is generated for a tweet and is used as its feature representation with a classifier." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-28", "text": "Baseline Methods: As baselines, we experiment with three broad representations." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-29", "text": "(1) Char n-grams: It is the state-ofthe-art method [6] which uses character n-grams for hate speech detection." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-30", "text": "(2) TF-IDF: TF-IDF are typical features used for text classification." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-31", "text": "(3) BoWV: Bag of Words Vector approach uses the average of the word (GloVe) embeddings to represent a sentence." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-32", "text": "We experiment with multiple classifiers for both the TF-IDF and the BoWV approaches." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-33", "text": "Proposed Methods: We investigate three neural network architectures for the task, described as follows." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-34", "text": "For each of the three methods, we initialize the word embeddings with either random embeddings or GloVe embeddings." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-36", "text": "al [3] 's work on using CNNs for sentiment classification, we leverage CNNs for hate speech detection." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-37", "text": "We use the same settings for the CNN as described in [3] ." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-38", "text": "(2) LSTM: Unlike feed-forward neural networks, recurrent neural networks like LSTMs can use their internal memory to process arbitrary sequences of inputs." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-39", "text": "Hence, we use LSTMs to capture long range dependencies in tweets, which may play a role in hate speech detection." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-40", "text": "----------------------------------" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-41", "text": "**METHOD**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-42", "text": "Prec Recall F1 All of these networks are trained (fine-tuned) using labeled data with back-propagation." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-43", "text": "Once the network is learned, a new tweet is tested against the network which classifies it as racist, sexist or neither." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-44", "text": "Besides learning the network weights, these methods also learn task-specific word embeddings tuned towards the hate speech labels." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-45", "text": "Therefore, for each of the networks, we also experiment by using these embeddings as features and various other classifiers like SVMs and GBDTs as the learning method." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-46", "text": "----------------------------------" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-47", "text": "**EXPERIMENTS**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-48", "text": "----------------------------------" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-49", "text": "**DATASET AND EXPERIMENTAL SETTINGS**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-50", "text": "We experimented with a dataset of 16K annotated tweets made available by the authors of [6] ." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-51", "text": "Of the 16K tweets, 3383 are labeled as sexist, 1972 as racist, and the remaining are marked as neither sexist nor racist." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-52", "text": "For the embedding based methods, we used the GloVe [5] pre-trained word embeddings." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-53", "text": "GloVe embeddings 2 have been trained on a large tweet corpus (2B tweets, 27B tokens, 1.2M vocab, uncased)." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-54", "text": "We experimented with multiple word embedding sizes for our task." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-55", "text": "We observed similar results with different sizes, and hence due to lack of space we report results using embedding size=200." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-56", "text": "We performed 10-Fold Cross Validation and calculated weighted macro precision, recall and F1-scores." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-57", "text": "We use 'adam' for CNN and LSTM, and 'RMS-Prop' for FastText as our optimizer." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-58", "text": "We perform training in batches of size 128 for CNN & LSTM and 64 for FastText." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-59", "text": "More details on the experimental setup can be found from our publicly available source code 3 ." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-60", "text": "Table 1 shows the results of various methods on the hate speech detection task." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-61", "text": "Part A shows results for baseline methods." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-62", "text": "Parts B and C focus on the proposed methods where part B contains methods using neural networks only, while part C uses average of word embeddings learned by DNNs as features for GBDTs." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-63", "text": "We experimented with mul- tiple classifiers but report results mostly for GBDTs only, due to lack of space." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-64", "text": "----------------------------------" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-65", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-66", "text": "As the table shows, our proposed methods in part B are significantly better than the baseline methods in part A. Among the baseline methods, the word TF-IDF method is better than the character n-gram method." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-67", "text": "Among part B methods, CNN performed better than LSTM which was better than FastText." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-68", "text": "Surprisingly, initialization with random embeddings is slightly better than initialization with GloVe embeddings when used along with GBDT." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-69", "text": "Finally, part C methods are better than part B methods." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-70", "text": "The best method is \"LSTM + Random Embedding + GBDT\" where tweet embeddings were initialized to random vectors, LSTM was trained using back-propagation, and then learned embeddings were used to train a GBDT classifier." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-71", "text": "Combinations of CNN, LSTM, FastText embeddings as features for GBDTs did not lead to better results." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-72", "text": "Also note that the standard deviation for all these methods varies from 0.01 to 0.025." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-73", "text": "To verify the task-specific nature of the embeddings, we show top few similar words for a few chosen words in Table 2 using the original GloVe embeddings and also embeddings learned using DNNs." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-74", "text": "The similar words obtained using deep neural network learned embeddings clearly show the \"hatred\" towards the target words, which is in general not visible at all in similar words obtained using GloVe." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-75", "text": "----------------------------------" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-76", "text": "**CONCLUSIONS**" }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-77", "text": "In this paper, we investigated the application of deep neural network architectures for the task of hate speech detection." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-78", "text": "We found them to significantly outperform the existing methods." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-79", "text": "Embeddings learned from deep neural network models when combined with gradient boosted decision trees led to best accuracy values." }, { "sent_id": "c7c9266b5063ec85494fde45d1dce1-C001-80", "text": "In the future, we plan to explore the importance of the user network features for the task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c7c9266b5063ec85494fde45d1dce1-C001-16" ], [ "c7c9266b5063ec85494fde45d1dce1-C001-29" ] ], "cite_sentences": [ "c7c9266b5063ec85494fde45d1dce1-C001-16", "c7c9266b5063ec85494fde45d1dce1-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "c7c9266b5063ec85494fde45d1dce1-C001-20" ], [ "c7c9266b5063ec85494fde45d1dce1-C001-50" ] ], "cite_sentences": [ "c7c9266b5063ec85494fde45d1dce1-C001-20", "c7c9266b5063ec85494fde45d1dce1-C001-50" ] } } }, "ABC_55bcdca5052745160dc861e22e7401_7": { "x": [ { "sent_id": "55bcdca5052745160dc861e22e7401-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-2", "text": "Abstract." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-3", "text": "We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-4", "text": "Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-5", "text": "We additionally show that competitive results can be achieved by simply including higher-length n-grams, which suggests the need to develop more challenging datasets and tasks that address implicit and more subtle forms of bias." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-6", "text": "----------------------------------" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-8", "text": "Media such as radio, TV channels, and newspapers control which information spreads and how it does it." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-9", "text": "The aim is often not only to inform readers but also to influence public opinion on specific topics from a hyperpartisan perspective." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-10", "text": "Social media, in particular, have become the default channel for many people to access information and express ideas and opinions." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-11", "text": "The most relevant and positive effect is the democratization of information and knowledge but there are also undesired effects." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-12", "text": "One of them is that social media foster information bubbles: every user may end up receiving only the information that matches her personal biases, beliefs, tastes and points of view." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-13", "text": "Because of this, social media are a breeding ground for the propagation of fake news: when a piece of news outrages us or matches our beliefs, we tend to share it without checking its veracity; and, on the other hand, content selection algorithms in social media give credit to this type of popularity because of the click-based economy on which their business are based." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-14", "text": "Another harmful effect is that the relative anonymity of social networks facilitates the propagation of toxic, hate and exclusion messages." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-15", "text": "Therefore, social media contribute to the misinformation and polarization of society, as we have recently witnessed in the last presidential elections in USA or the Brexit referendum." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-16", "text": "Clearly, the polarization of society and its underlying discourses are not limited to social media, but rather reflected also in political dynamics (e.g., like those found in the US Congress [1] ): even in this domain, however, social media can provide a useful signal to estimate partisanship [4] ." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-17", "text": "Closely related to the concept of controversy and the \"filter bubble effect\" is the concept of bias [2] , which refers to the presentation of information according to the standpoints or interests of the journalists and the news agencies." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-18", "text": "Detecting bias is very important to help users to acquire balanced information." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-19", "text": "Moreover, how a piece of information is reported has the capacity to evoke different sentiments in the audience, which may have large social implications (especially in very controversial topics such as terror attacks and religion issues)." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-20", "text": "In this paper, we approach this very broad topic by focusing on the problem of detecting hyperpartisan news, namely news written with an extreme manipulation of the reality on the basis of an underlying, typically extreme, ideology." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-21", "text": "This problem has received little attention in the context of the automatic detection of fake news, despite the potential correlation between them." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-22", "text": "Seminal work from [5] presents a comparative style analysis of hyperpartisan news, evaluating features such as characters n-grams, stop words, part-of-speech, readability scores, and ratios of quoted words and external links." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-23", "text": "The results indicate that a topic-based model outperforms a style-based one to separate the left, right and mainstream orientations." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-24", "text": "We build upon previous work and use the dataset from [5] : this way we can investigate hyperpartisan-biased news (i.e., extremely one-sided) that have been manually fact-checked by professional journalists from BuzzFeed." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-25", "text": "The articles originated from 9 well-known political publishers, three each from the mainstream, the hyperpartisan left-wing, and the hyperpartisan right-wing." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-26", "text": "To detect hyperpartisanship, we apply a masking technique that transforms the original texts in a form where the textual structure is maintained, while letting the learning algorithm focus on the writing style or the topic-related information." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-27", "text": "This technique makes it possible for us to corroborate previous results that content matters more than style." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-28", "text": "However, perhaps surprisingly, we are able to achieve the overall best performance by simply using higher-length n-grams than those used in the original work from [5] : this seems to indicate a strong lexical overlap between different sources with the same orientation, which, in turn, calls for more challenging datasets and task formulations to encourage the development of models covering more subtle, i.e., implicit, forms of bias." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-29", "text": "The rest of the paper is structured as follows." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-30", "text": "In Section 2 we describe our method to hyperpartisan news detection based on masking." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-31", "text": "Section 3 presents details on the dataset, experimental results and a discussion of our results." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-32", "text": "Finally, Section 4 concludes with some directions for future work." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-33", "text": "----------------------------------" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-34", "text": "**INVESTIGATING MASKING FOR HYPERPARTISANSHIP DETECTION**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-35", "text": "The masking technique that we propose here for the hyperpartisan news detection task has been applied to text clustering [3] , authorship attribution [7] , and recently to deception detection [6] with encouraging results." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-36", "text": "The main idea of the proposed method is to transform the original texts to a form where the textual structure, related to a general style (or topic), is maintained while content-related (or style-related) words are masked." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-37", "text": "To this end, all the occurrences (in both training and test corpora) of non-desired terms are replaced by symbols." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-38", "text": "Table 1 : Examples of masking style-related information or topic-related information." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-39", "text": "----------------------------------" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-40", "text": "**ORIGINAL TEXT**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-41", "text": "Masking topicrelated words" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-42", "text": "Masking stylerelated words Officers went after Christopher Few after watching an argument between him and his girlfriend outside a bar just before the 2015 shooting * went after * Few after * an * between him and his * a * just before the # * Officers * * Christopher * * watching * argument * * * * girlfriend outside * bar * * * 2015 shooting Let W k be the set of the k most frequent words, we mask all the occurrences of a word w \u2208 W k if we want to learn a topic-related model, or we mask all w / \u2208 W k if we want to learn a style-based model." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-43", "text": "Whatever the case, the way in which we mask the terms in this work is called Distorted View with Single Asterisks and consists in replacing w with a single asterisk or a single # symbol if the term is a word or a number, respectively." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-44", "text": "For further masking methods, refer to [7] ." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-45", "text": "Table 1 shows a fragment of an original text and the result of masking style-related information or topic-related information." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-46", "text": "With the former we obtain distorted texts that allow for learning a topic-based model; on the other hand, with the latter, it is possible to learn a style-based model." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-47", "text": "One of the options to choose the terms to be masked or maintained without masking is to take the most frequent words of the target language [7] ." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-48", "text": "In the original text from the table, we highlight some of the more frequent words in English." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-49", "text": "----------------------------------" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-50", "text": "**EXPERIMENTS**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-51", "text": "We used the BuzzedFeed-Webis Fake News Corpus 2016 collected by [5] whose articles were labeled with respect to three political orientations: mainstream, left-wing, and right-wing (see Table 2 )." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-52", "text": "Each article was taken from one of 9 publishers known as hyperpartisan left/right or mainstream in a period close to the US presidential elections of 2016." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-53", "text": "Therefore, the content of all the articles is related to the same topic." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-54", "text": "During initial data analysis and prototyping we identified a variety of issues with the original dataset: we cleaned the data excluding articles with empty or bogus texts, e.g. 'The document has moved here' (23 and 14 articles respectively)." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-55", "text": "Additionally, we removed duplicates (33) and files with the same text but inconsistent labels (2) ." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-56", "text": "As a result, we obtained a new dataset with 1555 articles out of 1627." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-57", "text": "4 Following the settings of [5] , we balance the training set using random duplicate oversampling." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-58", "text": "4 The dataset is available at https://github.com/jjsjunquera/ UnmaskingBiasInNews." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-59", "text": "----------------------------------" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-60", "text": "**MASKING CONTENT VS. STYLE IN HYPERPARTISAN NEWS**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-61", "text": "In this section, we reported the results of the masking technique from two different perspectives." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-62", "text": "In one setting, we masked topic-related information in order to maintain the predominant writing style used in each orientation." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-63", "text": "We call this approach a stylebased model." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-64", "text": "With that intention we selected the k most frequent words from the target language, and then we transformed the texts by masking the occurrences of the rest of the words." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-65", "text": "In another setting, we masked style-related information to allow the system to focus only on the topic-related differences between the orientations." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-66", "text": "We call this a topic-based model." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-67", "text": "For this, we masked the k most frequent words and maintained intact the rest." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-68", "text": "After the text transformation by the masking process in both the training and test sets, we represented the documents with character n-grams and compared the results obtained with the style-based and the topic-related models." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-69", "text": "Machine (SVM) and Random Forest (RF); for the three classifiers we used the versions implemented in sklearn with the parameters set by default." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-70", "text": "Evaluation: We performed 3-fold cross-validation with the same configuration used in [5] ." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-71", "text": "Therefore, each fold comprised one publisher from each orientation (the classifiers did not learn a publisher's style)." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-72", "text": "We used macro F 1 as the evaluation measure since the test set is unbalanced with respect to the three classes." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-73", "text": "In order to compare our results with those reported in [5] , we also used accuracy, precision, and recall." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-74", "text": "Baseline: Our baseline method is based on the same text representation with the character n-grams features, but without masking any word." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-75", "text": "Table 3 shows the results of the proposed method." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-76", "text": "We compare with [5] against their topic and style-based methods." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-77", "text": "In order to compare our results with those reported in [5] , we report the same measures the authors used." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-78", "text": "We also include the macro F 1 score because of the unbalance test set." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-79", "text": "For these experiments we extract the character 5-grams from the transformed texts, taking into account that as more narrow is the domain more sense has the use of longer n-grams." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-80", "text": "We follow the steps of [8] and set k = 500 for this comparison results." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-81", "text": "The last two rows show the results obtained by applying the system from [5] 6 to our cleaned dataset (Section 3)." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-82", "text": "Similar to [5] , the topic-based model achieves better results than the style-related model." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-83", "text": "However, the differences between the results of the two evaluated approaches are much higher (0.66 vs. 0.57 according to Macro F 1 ) than those shown in [5] ." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-84", "text": "The highest scores were consistently achieved using the SVM classifier and masking the style-related information (i.e., the topic-related model)." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-85", "text": "This could be due to the fact that all the articles are about the same political event in a very limited period of time." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-86", "text": "----------------------------------" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-87", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-88", "text": "In line with what was already pointed out in [5] , the left-wing orientation is harder to predict, possibly because this class is represented with fewer examples in the dataset." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-89", "text": "Another reason why our masking approach achieves better results could be that we use a higher length of character n-grams." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-90", "text": "In fact, comparing the results of [5] against our baseline model, it is possible to note that even without masking any word, the classifier obtains better results." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-91", "text": "This suggests that the good results are due to the length of the character n-grams rather than the use of the masking technique." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-92", "text": "Robustness of the approach to different values of k and n. With the goals of: (i) understanding the robustness of the approach to different parameter values; and to see if (ii) it is possible to overcome the F 1 = 0.70 from the baseline model, we vary the values of k and n and evaluate the macro F 1 using SVM." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-93", "text": "Figures 1 shows the results of the variation of k \u2208 {100, 200, ..., 5000}. When k > 5000, we clearly can see that the topic-related model, in which the k most frequent terms are masked, is decreasing the performance." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-94", "text": "This could be explained by the fact that relevant topic-related terms start to be masked too." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-95", "text": "However, a different behavior is seen in the style-related model, in which we tried to maintain only the style-related words without masking them." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-96", "text": "In this model, the higher is k the better is the performance." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-97", "text": "This confirms that for the used dataset, taking into account only style-related information is not good, and observing also topic-related information benefits the classification." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-98", "text": "When k tends to the vocabulary size, the style-related model tends to behave like the baseline model, which we already saw in Table 3 that achieves the best results." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-99", "text": "From this experiment, we conclude that: (i) the topic-related model is less sensitive than the style-related model when k < 500, i.e. the k most frequent terms are stylerelated ones; and (ii) when we vary the value of k, both models achieve worse results than our baseline." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-100", "text": "On the other hand, the results of extracting character 5-grams are higher than extracting smaller n-grams, as can be seen in Figures 2." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-101", "text": "These results confirm that perhaps the performance of our approach overcomes the models proposed in [5] because of the length of the n-grams 7 ." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-102", "text": "Relevant features." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-103", "text": "Table 4 shows the features with the highest weights from the SVM (we use scikit-learn's method to collect feature weights)." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-104", "text": "It is possible to note that the mention of cnn was learned as a discriminative feature when the news from that publisher were used in the training (in the topic-based model)." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-105", "text": "However, this feature is infrequent in the test set where no news from CNN publisher was included." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-106", "text": "Baseline model left main right imag cnn e are that said lary e tru said your e don y con n pla here ry co e thi s of cnn s to e tru n ame your for h said illar donal said hilla racis ore t llary here said and s kill story hill that said let trum tory comm trump ed be lary Style-based model left main right but * n thi y ** out w s * s out a t ** how as to you h at he o you t and m * t ell * is a * * u and n h * a e # * hat w of # and * * # # or hi ** * it t for h t the e of ** * and * o you in o * tw n it hat * * two and n f ** s ** all * so * onl f * w Topic-based model left main right ant * cnn hilla imag cs * * da lies ics * als * * ex sday * le etty ed be dail donal cnn * te n * c day * * ame onald cs * * am ying ics * illar thing * * e llary * * ed be * le e * y con * ri n * tory hill t * story bomb imag d bel * * r Features like donal and onal are related to Donald Trump, while illar and llary refer to Hillary Clinton." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-107", "text": "Each of these names is more frequent in one of the hyperpartisan Table 5 : Fragments of original texts and their transformation by masking the k most frequent terms." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-108", "text": "Some of the features from Table 4 using the topic-related model are highlighted." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-109", "text": "Topic-related model left (...)which his son pretty much confirmed in a foolish statement." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-110", "text": "The content of those tax returns has been the subject of much speculation, but given Trumps long history of tax evasion and political bribery, it doesnt take much imagination to assume hes committing some kind of fraud * * son pretty * confirmed * foolish statement * content * * tax returns * * * subject * * speculation * * Trump * * * tax evasion * * bribery * doesn * * imagination * assume * committing * * * fraud are frequent in the three classes (e.g., out, you, and, of ) even if the combination between function words and other characters can lightly differ in different orientations." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-111", "text": "----------------------------------" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-112", "text": "**CONCLUSIONS**" }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-113", "text": "In this paper we presented initial experiments on the task of hyperpartisan news detection: for this, we explored the use of masking techniques to boost the performance of a lexicalized classifier." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-114", "text": "Our results corroborate previous research on the importance of content features to detect extreme content: masking, in addition, shows the benefits of reducing data sparsity for this task comparing our results with the state of the art." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-115", "text": "We evaluated different values of the parameters and see that finally our baseline model, in which we extract character 5-grams without applying any masking process, achieves the better results." }, { "sent_id": "55bcdca5052745160dc861e22e7401-C001-116", "text": "As future work we plan to explore more complex learning architectures (e.g., representation learning of masked texts), as well as the application and adaptation of unsupervised methods for detecting ideological positioning from political texts 'in the wild' for the online news domain." } ], "y": { "@BACK@": { "gold_contexts": [ [ "55bcdca5052745160dc861e22e7401-C001-22" ] ], "cite_sentences": [ "55bcdca5052745160dc861e22e7401-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "55bcdca5052745160dc861e22e7401-C001-24" ], [ "55bcdca5052745160dc861e22e7401-C001-51" ], [ "55bcdca5052745160dc861e22e7401-C001-57" ], [ "55bcdca5052745160dc861e22e7401-C001-70" ], [ "55bcdca5052745160dc861e22e7401-C001-73" ], [ "55bcdca5052745160dc861e22e7401-C001-76" ], [ "55bcdca5052745160dc861e22e7401-C001-77" ], [ "55bcdca5052745160dc861e22e7401-C001-81" ] ], "cite_sentences": [ "55bcdca5052745160dc861e22e7401-C001-24", "55bcdca5052745160dc861e22e7401-C001-51", "55bcdca5052745160dc861e22e7401-C001-57", "55bcdca5052745160dc861e22e7401-C001-70", "55bcdca5052745160dc861e22e7401-C001-73", "55bcdca5052745160dc861e22e7401-C001-76", "55bcdca5052745160dc861e22e7401-C001-77", "55bcdca5052745160dc861e22e7401-C001-81" ] }, "@EXT@": { "gold_contexts": [ [ "55bcdca5052745160dc861e22e7401-C001-28" ], [ "55bcdca5052745160dc861e22e7401-C001-51", "55bcdca5052745160dc861e22e7401-C001-54" ] ], "cite_sentences": [ "55bcdca5052745160dc861e22e7401-C001-28", "55bcdca5052745160dc861e22e7401-C001-51", "55bcdca5052745160dc861e22e7401-C001-54" ] }, "@SIM@": { "gold_contexts": [ [ "55bcdca5052745160dc861e22e7401-C001-82" ], [ "55bcdca5052745160dc861e22e7401-C001-88" ] ], "cite_sentences": [ "55bcdca5052745160dc861e22e7401-C001-82", "55bcdca5052745160dc861e22e7401-C001-88" ] }, "@DIF@": { "gold_contexts": [ [ "55bcdca5052745160dc861e22e7401-C001-83" ], [ "55bcdca5052745160dc861e22e7401-C001-90" ], [ "55bcdca5052745160dc861e22e7401-C001-101" ] ], "cite_sentences": [ "55bcdca5052745160dc861e22e7401-C001-83", "55bcdca5052745160dc861e22e7401-C001-90", "55bcdca5052745160dc861e22e7401-C001-101" ] } } }, "ABC_ec0ae4e56c069e3efb4a2dc12199cd_7": { "x": [ { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-2", "text": "Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-3", "text": "A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-4", "text": "This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-5", "text": "While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which -in realistic annotation settings, at least -is often unavailable." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-6", "text": "We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-7", "text": "This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-8", "text": "Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-9", "text": "----------------------------------" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-11", "text": "State-of-the-art NLP components are increasingly based on supervised machine learning methods." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-12", "text": "This raises the need for large amounts of training data." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-13", "text": "While for the general language English newspaper domain syntactic (Marcus et al., 1993) , semantic (Palmer et al., 2005; Pustejovsky et al., 2003) , and even discourse (Carlson et al., 2003; Miltsakaki et al., 2008) annotations are increasingly made available, any language, domain, or genre shift pushes the severe burden on developers of NLP systems to supply comparably sized high-quality annotations." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-14", "text": "Even inner-domain shifts, such as, e.g., moving from hematology (Ohta et al., 2002) to the genetics of cancer (Kulick et al., 2004) within the field of molecular biology may have drastic consequences in the sense that entirely new meta data sets have to produced by annotation teams." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-15", "text": "Thus, reducing the human efforts for the creation of adequate training material is a major challenge." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-16", "text": "Active learning (AL) copes with this problem as it intelligently selects the data to be labeled." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-17", "text": "It is a sampling strategy where the learner has control over the training material to be manually annotated by selecting those examples which are of high utility for the learning process." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-18", "text": "AL has been successfully applied to speed up the annotation process for many NLP tasks without sacrificing annotation quality (Engelson and Dagan, 1996; Ngai and Yarowsky, 2000; Hwa, 2001; Tomanek et al., 2007a) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-19", "text": "Once we decide to use AL for meta-data annotation and a reasonable, stable level of annotation quality is reachedafter having run through only a fraction of the documents compared with the traditional annotation approach where a randomly and independently selected amount of documents is sequentially annotated -an obvious question turns up: When do we stop the annotation process to cash in the time savings?" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-20", "text": "Stopping after a certain amount of time has elapsed or a certain amount of data has been annotated is clearly not the best choice since such criteria, easily applicable though, do not take into account how well a classifier trained on the annotated data really performs." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-21", "text": "An optimal stopping condition for any annotation would be to locate that point in time when no further improvement in terms of classifier performance can be achieved by additional annotations." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-22", "text": "Since learning curves show the classifier performance at different time steps, i.e., for different amounts of annotated training examples, we can observe that progression." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-23", "text": "Given this observation data we may stop the annotation process when the learning curve completely converges and is not ascending any more." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-24", "text": "In most real-world annotation scenarios, however, such a well-defined stopping point based on the convergence of classifier performance does not exist." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-25", "text": "Instead, additional annotations often result in slight improvements of the classifier's performance." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-26", "text": "Accordingly, one should rather consider the trade-off between further annotation efforts and gains in classifier performance to decide whether additional annotations are worth the effort for targeted application." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-27", "text": "This trade-off can be read from the learning curve which, unfortunately, will not always be available." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-28", "text": "Re-sampling strategies, e.g., cross-validation or bootstrapping, usually applied to estimate classifier performance, assume independently and identically distributed (i.i.d.) examples to sample from. But examples selected by means of AL do not meet this requirement." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-29", "text": "So, to estimate classifier performance a separately annotated gold standard with i.i.d." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-30", "text": "examples is often used to obtain a learning curve for AL." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-31", "text": "Yet, this solution comes with expensive extra annotation work." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-32", "text": "We present an approach to approximate the progression of the learning curve without the need for a labeled gold standard." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-33", "text": "We situate our discussion in the context of a simulation and a real-world annotation scenario and will find out that the second scenario imposes some restrictions on the configuration of the approach." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-34", "text": "The paper is structured as follows: In Section 2., we describe our approach in detail." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-35", "text": "Other work on stopping conditions for AL-based annotation is discussed in Section 3." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-36", "text": "Experimental results for the task of named entity recognition are presented in Section 4." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-37", "text": "----------------------------------" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-38", "text": "**APPROXIMATING THE LEARNING CURVE**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-39", "text": "Given the idea that from the learning curve one can read the trade-off between annotation effort and classifier performance gain, we here propose an approach to approximate the progression of the learning curve which comes at no extra annotation costs." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-40", "text": "This approach is designed for use in committee-based AL (Seung et al., 1992) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-41", "text": "A committee consists of k classifiers of the same type trained on different subsets of the already labeled (training) data." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-42", "text": "Each committee member then makes its predictions on the pool of unlabeled examples, and those examples on which the committee members express the highest disagreement are considered most informative for learning and are thus selected for manual annotation." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-43", "text": "To calculate the disagreement among the committee members several metrics have been proposed including the vote entropy (Engelson and Dagan, 1996) as possibly the most well-known one." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-44", "text": "Our approach to approximating the learning curve is based on the disagreement within a committee." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-45", "text": "However, it is independent of the actual metric used to calculate the disagreement." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-46", "text": "Although in our experiments we considered the NLP task of named entity recognition (NER) only, our approach is not limited to this scenario and can be expected to be applicable to other tasks as well." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-47", "text": "In Tomanek et al. (2007a) we introduced the selection agreement (SA) curve -the average agreement amongst the selected examples plotted over time." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-48", "text": "When the SA values are close to '1', the committee members almost perfectly agree." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-49", "text": "So, any further AL iteration would resemble a random selection." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-50", "text": "Experiments have shown that at the point where the SA curve converges on values close to '1' the respective learning curve converges on its maximum value as well so that further annotations would have (almost) no impact on the classifier performance." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-51", "text": "As a result, we concluded that we can derive, from the SA curve, the point where the classifier performance is not increased any more by further annotation efforts." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-52", "text": "Hence, when this curve approaches values of '1' it can be interpreted as a stopping signal for annotation." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-53", "text": "However, this positive finding is due to an inherent feature of AL simulations." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-54", "text": "In typical simulation settings, the pool of annotation items is of a very limited size -normally only a few thousand examples." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-55", "text": "This is so because for simulations, a pre-annotated corpus is used and the manual annotation is simulated by just moving selected examples from the pool to the training set unveiling the labels." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-56", "text": "As a consequence, the total number of positive and hard examples, which are preferentially selected by AL, is rather limited." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-57", "text": "In the NER scenario, examples to be selected are complete sentences." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-58", "text": "Sentences containing (many) entity mentions can be considered as \"positive\" ones." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-59", "text": "Especially when very infrequent entity classes are to be annotated, a corpus will consist of a large proportion of \"negative\" examples which contain no entity mentions at all." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-60", "text": "In our experiments, we observed that sentences which contained many and complex entity mentions were already selected in early AL iterations." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-61", "text": "Thus, the more AL iterations are run, the less hard and positive examples are left in the pool." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-62", "text": "As a consequence, only in early iterations, AL really has choices to select useful examples." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-63", "text": "The SA curve is directly affected by this simulation effect and thus cannot be used as a reliable approximation of the learning curve in a real-world annotation scenario where the pool will be much larger and much more diverse." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-64", "text": "In such a setting there will always be useful (and, by this, hard) examples which AL may find, thus keeping the selection agreement constantly high." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-65", "text": "The solution we propose is to calculate the average agreement for each AL iteration on a separate validation set which should reflect the real data distribution and must not be used in the annotation process itself." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-66", "text": "As for most NLP tasks there is no limit to unlabeled data and no annotations are required, the validation set comes at no extra costs." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-67", "text": "Plotted over time we get the validation set agreement (VSA) curve." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-68", "text": "This curve is based on the same data in each AL iteration making the agreement values comparable between different AL iterations." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-69", "text": "Since the examples of the validation set are not used in the annotation process we can further guarantee that this curve is only affected by the benefit the selected and labeled examples have on training a classifier." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-70", "text": "Now, from a VSA curve which is only slightly ascending between selected measurement points we can infer that the respective learning curve has only a low slope at these positions, too." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-71", "text": "Although interpreting the actual agreement values of the VSA curve is still problematic, its progression behavior can be used to estimate whether further annotation is worth the human labeling effort." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-72", "text": "In Section 4., we will provide empirical evidence that the VSA curve is indeed an adequate approximation of the progression of the learning curve and that the SA curve fails in the real-world annotation scenario where examples are selected from a much larger pool." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-73", "text": "----------------------------------" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-74", "text": "**RELATED WORK**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-75", "text": "While there is a large body of work on AL proper, there are only few papers reporting on stopping criteria or methods to monitor the progress of AL-driven annotations." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-76", "text": "Schohn and Cohn (2000) consider an AL approach for Support Vector Machines (SVM) where examples are selected according to their proximity to the hyperplane." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-77", "text": "They propose to stop the annotation process when, in the current AL iteration, none of the unlabeled examples are closer to the hyperplane than the support vectors." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-78", "text": "While this approach is restricted to AL for SVMs, Vlachos (2008) presents a stopping criterion for uncertainty-based AL (Cohn et al., 1996) in general." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-79", "text": "The confidence of the classifier at the current AL iteration is estimated on a large, separate validation set." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-80", "text": "The author reports that such a confidence curve follows a rise-peak-drop pattern: It rises in the beginning, then reaches its maximum values, and after that it constantly drops." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-81", "text": "The stopping condition is then defined as the point when the confidence curve starts dropping, i.e., the point when the learning curve has converged." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-82", "text": "This approach is similar to ours in that it employs the usefulness measure of the AL selection and in that it applies a separate validation set to calculate the confidence curve on." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-83", "text": "However, while it provides an exact stopping condition, it cannot provide a means to estimate the progression of the learning curve." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-84", "text": "This is equally important since, in practice, one might want to stop the annotation before such a final stopping condition is met, e.g., when the trade-off between additional annotation costs and gain in classifier performance is falling below some threshold." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-85", "text": "For uncertainty-based AL, further stopping criteria employing a confidence estimate of the current classifier were proposed by Zhu et al. (2008) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-86", "text": "The first one is based on an uncertainty measurement on all unlabeled examples of a pool, the second one uses the prediction accuracy on the selected examples, and the final one builds on the classifier's expected error on all unlabeled examples." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-87", "text": "Since these approaches are not based on a separate validation set we assume that their reported success rates are largely due to the simulation effect, i.e., the limited number of 'hard' examples in a simulation data set." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-88", "text": "Whereas the first and the third criterion could also be applied in a separate, unlabeled validation set to avoid this shortcoming, the second one would require an annotated validation set -not really an advantage over plotting a learning curve." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-89", "text": "Further on, Zhu et al. use their approaches as stopping condition by comparing the respective values against a fixed threshold." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-90", "text": "We find this problematic because a priori chosen or heuristically determined values are highly task-and data-dependent." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-91", "text": "In a real-world annotation scenario it is almost impossible to adequately define such values in advance." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-92", "text": "While all the above-mentioned approaches focus on singleclassifier AL strategies, ours is tailored to committee-based AL." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-93", "text": "----------------------------------" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-94", "text": "**EXPERIMENTS**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-95", "text": "To empirically test whether our proposed approach works well as an approximation of the learning curves we ran several experiments both in a pure simulation mode, where the manual annotation was simulated by unveiling the labels already assigned in the simulation corpus, and in a realworld scenario where human annotators were asked to annotate the sentences selected by AL." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-96", "text": "For both scenarios the selection agreement (SA) and the validation set agreement (VSA) was calculated for each AL iteration." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-97", "text": "----------------------------------" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-98", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-99", "text": "For our experiments on approximating the learning curves for AL-based selection, we chose named entity recognition (NER) as the annotation task in focus." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-100", "text": "We employed the committee-based AL approach described in Tomanek et al. (2007a) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-101", "text": "The committee consists of k = 3 Maximum Entropy (ME) classifiers (Berger et al., 1996) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-102", "text": "In each AL iteration, each classifier is trained on a randomly" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-103", "text": ", L being the set of all examples seen so far." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-104", "text": "Disagreement is measured by vote entropy (Engelson and Dagan, 1996) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-105", "text": "In our NER scenario, complete sentences are selected by AL." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-106", "text": "While we made use of ME classifiers during the selection, we employed a NE tagger based on Conditional Random Fields (CRFs) (Lafferty et al., 2001 ) during evaluation time to determine the learning curves." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-107", "text": "We have already shown that in this scenario, ME classifiers perform equally well for AL-driven selection as CRFs when using the same features." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-108", "text": "This effect is truly beneficial, especially for real-world annotation projects, due to much lower training times and, by this, shorter annotator idle times (Tomanek et al., 2007a) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-109", "text": "For the AL simulation, we employed two simulation corpora: The CONLL corpus, based on the English data set of the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003) , which consists of newspaper articles annotated with respect to person, location, and organisation entities." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-110", "text": "This pool consists of about 14,000 sentences." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-111", "text": "As validation set and as gold standard for plotting the learning curve we used CoNLL's evaluation corpus which sums up to 3,453 sentences." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-112", "text": "The PBVAR corpus consists of biomedical abstracts and was derived from the PENNBIOIE corpus (Kulick et al., 2004) by keeping only those annotations related to variation event mentions." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-113", "text": "We have randomly split this corpus into a pool set and a validation/gold set." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-114", "text": "In our simulations, 20 sentences were selected in each AL iteration and the simulations were started with a random seed set of 20 sentences." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-115", "text": "Our results are averaged over three independent runs." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-116", "text": "For the real-world annotation scenario, we considered two sub-corpora from the entity annotations described in (Hahn et al., 2008) : The cytokine and growth factor receptors corpus (CYTOREC) is annotated with various entity subclasses of special receptor entities, while the antigens corpus (CDANTIGEN) contains annotations of various immunologically relevant antigen entities." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-117", "text": "For both annotation projects, the pool from which AL selected the examples to be labeled consisted of approximately 2 million sentences taken from PUBMED 2 abstracts, the validation set and gold standard was composed of 2,165 sentences." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-118", "text": "In each AL iteration, 30 sentences were selected for manual annotation." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-119", "text": "The corresponding seed sets were considerably larger than in our simulations and were assembled by the heuristic described by Tomanek et al. (2007b) ." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-120", "text": "Table 1 summarizes the corpora used for our experiments." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-121", "text": "Figures 1 and 2 display the learning and agreement curves for the CONLL and the PBVAR corpus, respectively." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-122", "text": "The learning curves are depicted for both AL (solid line) and random selection (dashed line) revealing the increase in annotation efficiency when AL is used to select the examples to be annotated." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-123", "text": "As for the agreement curves, we plot both the exact agreement values (dots) and a curve obtained by local polynomial regression fitting (solid line)." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-124", "text": "On the CONLL corpus, the learning curve converges on its maximum f-score (\u2248 84%) after about 125,000 tokens." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-125", "text": "This is reflected by the SA curve which is not ascending any more at about the same number of tokens." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-126", "text": "A similar pattern is depicted in the VSA curve though it provides an even clearer picture of the progression of the learning curve." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-127", "text": "It is only slightly ascending after about 50,000 tokens, i.e., at a time when the slope of the learning curve already becomes very low." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-128", "text": "From both the learning and the VSA curve we can read that after 50,000 tokens any additional annotation is very costly compared to its benefits in terms of increased classifier performance." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-129", "text": "----------------------------------" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-130", "text": "**RESULTS**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-131", "text": "On the PBVAR corpus, the maximal f-score (\u2248 80%) is reached after approximately 50,000 tokens, then there is a small decline which after about 100,000 tokens stabilizes at the maximum value." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-132", "text": "The SA curve reached values around '1' after about 100, 000 tokens, but is misleading here since it does not reflect that the learning curve had already reached a maximum before." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-133", "text": "The VSA curve, however, more comprehensively approximates the behavior of the learning curve." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-134", "text": "It has a clear bend after some 50,000 tokens and converges after approximately 100,000 tokens." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-135", "text": "Figures 3 and 4 display the learning and agreement curves for our experiments in the real-world annotation scenario." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-136", "text": "No learning curve for random selection is shown since only AL selection was performed to avoid unnecessary human efforts." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-137", "text": "Further, in this scenario, the agreement was not calculated during the selection to keep selection time as short as possible but was calculated afterwards for this experiment." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-138", "text": "3 On both corpora, agreement as well as learning curves start with the complete seed set (256 sentences, with about 10,00 tokens for CYTOREC and 853 sentences, with some 35,000 tokens for CDANTIGEN)." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-139", "text": "On the CDANTIGEN corpus, after 80, 000 tokens being annotated the learning curve has not completely converged but additional annotations do not pay off either." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-140", "text": "The VSA curve mirrors this behavior since it keeps on ascending with a low slope, though the SA curve remains quite obscure, here." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-141", "text": "A similar behavior can be observed for the CYTOREC corpus." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-142", "text": "The learning curve is only slightly ascending after about 65,000 tokens have been annotated." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-143", "text": "This is nicely mirrored by the VSA curve." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-144", "text": "Again, the SA curve is almost impossible to interpret: Though its slope decreases a bit after roughly 40,000 tokens, it keeps ascending thereafter." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-145", "text": "Both SA curves exhibit an oscillating behavior that does not contain any clue to guide stopping decisions." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-146", "text": "We have seen that in the simulation scenario the two agreement curves (SA and VSA) share a similar curve progression due to the simulation effect (cf. Figure 1 . But even in the simulation scenario the SA curve might be problematic and hence misleading as can be concluded from our experiments on the PBVAR corpus." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-147", "text": "In the real-world annotation scenario these SA curves are clueless to approximate the progression of the learning curve." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-148", "text": "However, our experiments suggest (see Figures 3 and 4 ) that the VSA curve is a good estimator for the progression of the learning curve and also works in practice, while the SA curve fails as a reliable predictor in our realworld annotation scenario." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-149", "text": "Still, even the more predictive VSA curves merely guide but do not finalize stopping decisions." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-150", "text": "So it is left to the annotation manager's over-all assessment to balance the trade-off between annotation costs and expectable quality gains for the learner." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-151", "text": "----------------------------------" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-152", "text": "**CONCLUSIONS**" }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-153", "text": "In this paper, we discussed an approach to approximate the progression of the learning curve for AL-driven annotation." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-154", "text": "Such an approximation can be used to estimate the relative quality gains of further annotation efforts." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-155", "text": "This might render valuable decision support for the question when to actually stop an annotation process, in practice, and is especially helpful when a learning curve is not available due to the absence of a labeled gold standard." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-156", "text": "We have deliberately refrained from defining a fixed stopping condition for AL-driven annotations." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-157", "text": "In practice, further annotation efforts will mostly result in some, although mild, classifier improvement." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-158", "text": "Whether the respective gain justifies the efforts (and costs) depends on the task at hand." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-159", "text": "As far as the learning curve and its approximation is concerned, the relative gains can be estimated." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-160", "text": "Such an approach might be more adequate for practical use cases rather than a single-point stopping condition which does not incorporate trade-off considerations of any sort." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-161", "text": "Further, we have discussed that AL simulations are subject to the simulation effect." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-162", "text": "From our experiments we conclude that approaches to monitor the progress (in whatever manner) of AL-driven annotation should always be based on a separate validation set instead of the material directly involved in the AL training process." }, { "sent_id": "ec0ae4e56c069e3efb4a2dc12199cd-C001-163", "text": "As the validation set does not need to be labeled and for almost all NLP applications unlabeled material is available in virtually unlimited volumes this approach comes at not extra costs." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ec0ae4e56c069e3efb4a2dc12199cd-C001-18" ], [ "ec0ae4e56c069e3efb4a2dc12199cd-C001-47" ], [ "ec0ae4e56c069e3efb4a2dc12199cd-C001-108" ] ], "cite_sentences": [ "ec0ae4e56c069e3efb4a2dc12199cd-C001-18", "ec0ae4e56c069e3efb4a2dc12199cd-C001-47", "ec0ae4e56c069e3efb4a2dc12199cd-C001-108" ] }, "@USE@": { "gold_contexts": [ [ "ec0ae4e56c069e3efb4a2dc12199cd-C001-100" ] ], "cite_sentences": [ "ec0ae4e56c069e3efb4a2dc12199cd-C001-100" ] } } }, "ABC_74e7b114ae968e196ea87f529f5eff_7": { "x": [ { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-37", "text": "----------------------------------" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-2", "text": "In this paper, we describe ROOT13, a supervised system for the classification of hypernyms, co-hyponyms and random words." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-3", "text": "The system relies on a Random Forest algorithm and 13 unsupervised corpus-based features." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-4", "text": "We evaluate it with a 10-fold cross validation on 9,600 pairs, equally distributed among the three classes and involving several Parts-OfSpeech (i.e. adjectives, nouns and verbs)." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-5", "text": "When all the classes are present, ROOT13 achieves an F1 score of 88.3%, against a baseline of 57.6% (vector cosine)." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-6", "text": "When the classification is binary, ROOT13 achieves the following results: hypernyms-co-hyponyms (93.4% vs. 60.2%), hypernymsrandom (92.3% vs. 65.5%) and co-hyponyms-random (97.3% vs. 81.5%)." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-7", "text": "Our results are competitive with stateof-the-art models." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-8", "text": "----------------------------------" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-9", "text": "**INTRODUCTION AND RELATED WORK**" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-10", "text": "Distinguishing hypernyms (e.g. dog-animal) from cohyponyms (e.g. dog-cat) and, in turn, discriminating them from random words (e.g. dog-fruit) is a fundamental task in Natural Language Processing (NLP)." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-11", "text": "Hypernymy in fact represents a key organization principle of semantic memory (Murphy, 2002) , the backbone of taxonomies and ontologies, and one of the crucial inferences supporting lexical entailment (Geffet and Dagan, 2005) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-12", "text": "Cohyponymy (or coordination), on the other hand, is the relation held by words sharing a close hypernym, which are therefore attributionally similar (Weeds et al., 2014) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-13", "text": "The ability of discriminating hypernymy, co-hyponymy and random words has potentially infinite applications, including automatic thesauri creation, paraphrasing, textual entailment, sentiment analysis and so on (Weeds et al., 2014) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-14", "text": "For this reason, in the last decades, numerous methods, datasets and shared tasks have been proposed to improve computers' ability in such discrimination, generally achieving promising results (Weeds et al., 2014; Rimmel, 2014; Geffet and Dagan, 2005) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-15", "text": "Both supervised and unsupervised approaches have been investigated." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-16", "text": "The former have been shown to outperform the latter in Weeds et al. (2014) , even though Levy et al. (2015) have recently claimed that these methods may learn whether a term y is a prototypical hypernym, regardless of its actual relation with a term x." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-17", "text": "In this paper, we propose a supervised method, based on a Random Forest algorithm and 13 corpus-based features." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-18", "text": "In our evaluation, carried out using the 10-fold cross validation on 9,600 pairs, we achieved an accuracy of 88.3% when the three classes are present, and of 92.3% and 97.3% when only two classes are present." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-19", "text": "Such results are competitive with the state-of-the-art (Weeds et al., 2014) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-20", "text": "----------------------------------" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-21", "text": "**METHOD AND EVALUATION**" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-22", "text": "ROOT13 uses the Random Forest algorithm implemented in Weka (Breiman, 2001) , with the default settings." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-23", "text": "It relies on 13 features that are described below." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-24", "text": "Each of them is automatically extracted from a window-based Vector Space Model (VSM), built on a combination of ukWaC and WaCkypedia corpora (around 2.7 billion words) and recording word co-occurrences within the 5 nearest content words to the left and right of each target." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-25", "text": "FEATURES." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-26", "text": "The feature set was designed to identify several distributional properties characterizing the terms in the pairs." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-27", "text": "On top of the standard features (e.g. vector cosine, co-occurrence and frequencies), we have added several features capturing the generality of the terms and of their contexts 1 , plus two unsupervised measures for capturing similarity (Santus et al., 2014b-c) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-28", "text": "All the features are normalized in the range 0-1:" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-29", "text": "\u2022 Cos: vector cosine (Turney and Pantel, 2010 );" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-30", "text": "\u2022 Cooc: co-occurrence frequency;" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-31", "text": "\u2022 Freq 1, 2: two features storing the frequency the terms;" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-32", "text": "\u2022 Entr 1, 2: two features storing the entropy of the terms;" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-33", "text": "\u2022 Shared: extent of the intersection between the top 1k most mutually related contexts of the two terms 2 ; \u2022 APSyn: for every context in the intersection between the top 1k most mutually related contexts of the two terms, this measure adds 1, divided by its average rank (Santus et al. 2014b -c); \u2022 Diff Freqs: difference between the terms frequencies; \u2022 Diff Entrs: difference between the terms entropies 3 ; \u2022 C-Freq 1, 2: two features storing the average frequency among the top 1k most mutually related contexts for each term; \u2022 C-Entr 1, 2: two features, storing the average entropy among the top 1k most mutually related contexts for each term." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-34", "text": "DATASET." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-35", "text": "We have used 9,600 pairs, randomly extracted from three datasets: Lenci/Benotto (Santus et al., 2014b) , BLESS (Baroni and Lenci, 2011) and EVALution (Santus et al., 2015) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-36", "text": "The pairs are equally distributed among the three classes (i.e. hypernyms, co-hyponyms and random words) and involve several Parts-Of-Speech." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-38", "text": "**TASKS. FOUR CLASSIFICATION TASKS HAVE BEEN CARRIED OUT.**" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-39", "text": "One involving all classes and three tasks involving only two of them." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-40", "text": "F1 score on a 10-fold cross validation was chosen as accuracy measure." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-41", "text": "BASELINE." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-42", "text": "Vector cosine is used as baseline." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-43", "text": "It achieves a reasonable accuracy, which is anyway far below the results obtained by our model." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-44", "text": "As it will be shown below, our model does not benefit from its use." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-45", "text": "RESULTS." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-46", "text": "Table 1 describes the features' contributions in the four classification tasks." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-47", "text": "As can be seen, when all the classes are involved, every feature contributes for an increment between 0.4% and 4.8%, except for the feature Shared, whose contribution is +12.5%." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-48", "text": "The relevance of this feature is confirmed also in the other three tasks." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-49", "text": "Interestingly, instead, the vector cosine does not contribute to our score." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-50", "text": "It instead penalizes it in three tasks out of four." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-51", "text": "The only task in which it actually contributes for 0.1% is the discrimination between co-hyponyms and randoms, which is its main function." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-52", "text": "----------------------------------" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-53", "text": "**CONCLUSIONS**" }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-54", "text": "In this paper, we have described ROOT13, a classifier for hypernyms, co-hyponyms and random words." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-55", "text": "The classifier, based on the Random Forest algorithm, uses only 13 unsupervised corpus-based features, which have been described and their contribution reported." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-56", "text": "Our classifier is competitive with the state-of-the-art (Weeds et al., 2014) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-57", "text": "In a second run of tests, we have noticed the Levy et al. (2015) 's effect, that is the classification of switched hypernyms as hypernyms (e.g. dog-vehicle, car-animal) ." }, { "sent_id": "74e7b114ae968e196ea87f529f5eff-C001-58", "text": "How-ever, we were able to remove it -without any sensible loss in accuracy -by training the model also on switched hypernyms labeled as randoms." } ], "y": { "@BACK@": { "gold_contexts": [ [ "74e7b114ae968e196ea87f529f5eff-C001-12" ], [ "74e7b114ae968e196ea87f529f5eff-C001-13" ], [ "74e7b114ae968e196ea87f529f5eff-C001-14" ], [ "74e7b114ae968e196ea87f529f5eff-C001-15", "74e7b114ae968e196ea87f529f5eff-C001-16" ] ], "cite_sentences": [ "74e7b114ae968e196ea87f529f5eff-C001-12", "74e7b114ae968e196ea87f529f5eff-C001-13", "74e7b114ae968e196ea87f529f5eff-C001-14", "74e7b114ae968e196ea87f529f5eff-C001-16" ] }, "@DIF@": { "gold_contexts": [ [ "74e7b114ae968e196ea87f529f5eff-C001-18", "74e7b114ae968e196ea87f529f5eff-C001-19" ], [ "74e7b114ae968e196ea87f529f5eff-C001-56" ] ], "cite_sentences": [ "74e7b114ae968e196ea87f529f5eff-C001-19", "74e7b114ae968e196ea87f529f5eff-C001-56" ] }, "@USE@": { "gold_contexts": [ [ "74e7b114ae968e196ea87f529f5eff-C001-18", "74e7b114ae968e196ea87f529f5eff-C001-19" ], [ "74e7b114ae968e196ea87f529f5eff-C001-56" ] ], "cite_sentences": [ "74e7b114ae968e196ea87f529f5eff-C001-19", "74e7b114ae968e196ea87f529f5eff-C001-56" ] } } }, "ABC_681c3e59adbfc09a28d267a4885598_7": { "x": [ { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-2", "text": "Most existing relation extraction models assume a fixed set of relations and are unable to adapt to exploit newly available supervision data to extract new relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-3", "text": "In order to alleviate such problems, there is the need to develop approaches that make relation extraction models capable of continuous adaptation and learning." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-4", "text": "We investigate and present results for such an approach, based on a combination of ideas from lifelong learning and optimizationbased meta-learning." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-5", "text": "We evaluate the proposed approach on two recent lifelong relation extraction benchmarks, and demonstrate that it markedly outperforms current state-of-the-art approaches." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-6", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-8", "text": "The majority of existing supervised relation extraction models can only extract a fixed set of relations which has been specified at training time." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-9", "text": "They are unable to detect an evolving set of novel relations observed after training without substantial retraining, which can be computationally expensive and may lead to catastrophic forgetting of previously learned relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-10", "text": "Zero-shot relation extraction approaches (Rockt\u00e4schel et al., 2015; Demeester et al., 2016; Levy et al., 2017; Obamuyide and Vlachos, 2018) can extract unseen relations, but at lower performance levels, and are unable to continually exploit newly available supervision to improve performance without considerable retraining." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-11", "text": "These limitations also extend to approaches to extracting relations in other limited supervision settings, for instance in the oneshot setting (Obamuyide and Vlachos, 2017) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-12", "text": "It is therefore desirable for relation extraction models to have the capability to learn continuously without catastrophic forgetting of previously learned relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-13", "text": "This would enable them exploit newly available supervision to both identify novel relations and improve performance without substantial retraining." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-14", "text": "Recently, Wang et al. (2019) introduced an embedding alignment approach to enable continual learning for relation extraction models." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-15", "text": "They consider a setting with streaming tasks, where each task consists of a number of distinct relations, and proposed to align the representation of relation instances in the embedding space to enable continual learning of new relations without forgetting knowledge from past relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-16", "text": "While they obtained promising results, a key weakness of the approach is that the use of an alignment model introduces additional parameters to already overparameterized relation extraction models, which may in turn lead to an increase in the quantity of supervision required for training." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-17", "text": "In addition, the approach can only align embeddings between observed relations, and does not have any explicit objective that encourages the model to transfer and exploit knowledge gathered from previously observed relations to facilitate the efficient learning of yet to be observed relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-18", "text": "In this work, we extend the work of Wang et al. (2019) by exploiting ideas from both lifelong learning and meta-learning." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-19", "text": "We propose to consider lifelong relation extraction as a metalearning challenge, to which the machinery of current optimization-based meta-learning algorithms can be applied." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-20", "text": "Unlike the use of a separate alignment model as proposed in Wang et al. (2019) , the proposed approach does not introduce additional parameters." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-21", "text": "In addition, the proposed approach is more data efficient since it explicitly optimizes for the transfer of knowledge from past relations, while avoiding the catastrophic forgetting of previously learned relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-22", "text": "Empirically, we evaluate on lifelong versions of the datasets by Bordes et al. (2015) and Han et al. (2018) and demonstrate con-siderable performance improvements over prior state-of-the-art approaches." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-23", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-24", "text": "**BACKGROUND**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-25", "text": "Lifelong Learning In the lifelong learning setting, also referred to as continual learning (Ring, 1994; Thrun, 1996; Zhao and Schmidhuber, 1996) , a model f \u03b8 is presented with a sequence of tasks {T t } t=1,2,3..,T , one task per round, and the goal is to learn model parameters {\u03b8 t } t=1,2,3,..,T with the best performance on the observed tasks." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-26", "text": "Each task T can be a conventional supervised task with its own distinct train (T train ), development (T dev ) and test (T test ) splits." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-27", "text": "At each round t, the model is allowed to exploit knowledge gained from the previous t \u2212 1 tasks to enhance performance on the current task." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-28", "text": "In addition, the model is also allowed to have a small-sized buffer memory B, which can be used to store a limited amount of data from previously observed tasks." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-29", "text": "A prominent line of work in lifelong learning research is developing approaches that enable models learn new tasks without forgetting knowledge from previous tasks, i.e. avoiding catastrophic forgetting of old tasks (McCloskey and Cohen, 1989; Ratcliff, 1990; McClelland et al., 1995; French, 1999) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-30", "text": "Approaches proposed to address this problem include memory-based approaches (Lopez-Paz and Ranzato, 2017; Rebuffi et al., 2017; Chaudhry et al., 2019) ; parameter consolidation approaches (Kirkpatrick et al., 2017; Zenke et al., 2017) ; and dynamic model architecture approaches (Xiao et al., 2014; Rusu et al., 2016; Fernando et al., 2017) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-31", "text": "Meta-Learning Meta-learning, or learning to learn (Schmidhuber, 1987; Naik and Mammone, 1992; Thrun and Pratt, 1998) , aims to develop algorithms that learn a generic knowledge of how to solve tasks from a given distribution of tasks, by generalizing from solving related tasks from that distribution." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-32", "text": "Given tasks T sampled from a distribution of tasks p(T ), and a learner model f (x; \u03b8) parameterized by \u03b8, gradient-based meta-learning methods, such as MAML (Finn et al., 2017) , learn a prior initialization of the parameters of the model which, at meta-test time, can be quickly adapted to achieve good performance on a new task using a few steps of gradient descent." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-33", "text": "During adaptation to the new task, the model parameters \u03b8 are updated to task-specific parameters \u03b8 with good performance on the task." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-34", "text": "Formally, the meta-learning algorithms optimize for the meta-objective:" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-35", "text": "( 1) where L T is the loss and D T is training data from task T , and U is a fixed gradient descent learning rule, such as vanilla SGD." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-36", "text": "While these algorithms were proposed and evaluated in the context of fewshot learning, here we demonstrate their effectiveness when utilized in the lifelong learning setting for relation extraction, following similar intuition as recent work by Finn et al. (2019) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-37", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-38", "text": "**META-LEARNING FOR LIFELONG RELATION EXTRACTION**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-39", "text": "It can be inferred from the previous section that a lot of lifelong learning research has focused on approaches to avoid catastrophic forgetting (i.e. negative backward transfer of knowledge) while recent meta-learning studies have focused on effective approaches for positive forward transfer of knowledge (for few-shot tasks)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-40", "text": "Given the complementary strengths of the approaches from the two learning settings, we propose to embed metalearning into the lifelong learning process for relation extraction." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-41", "text": "While we can utilize the MAML algorithm to directly optimize the meta-objective in Equation 1 for our purpose, doing so requires the computation of second-order derivatives, which can be computationally expensive." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-42", "text": "Nichol et al. (2018) proposed REPTILE, a first-order alternative to MAML, which uses only first-order derivatives." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-43", "text": "Similar to MAML, REPTILE works by repeatedly sampling tasks, training on those tasks and moving the initialization towards the adapted weights on those tasks." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-44", "text": "Here we adopt the REPTILE algorithm for meta-learning." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-45", "text": "Our algorithm for lifelong relation extraction is illustrated in Algorithm 1." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-46", "text": "We start by randomly initializing the parameters of the relation extraction model (the learner) (line 1)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-47", "text": "Then, as new tasks arrive, we augment their training set with randomly sampled task exemplars from the buffer memory B (lines 2-9)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-48", "text": "We then sample a batch of relations from the augmented training set (line 10)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-49", "text": "Then for each sampled relation R i , we sample a batch of supervision instances D train R i from its training set (line 11-12)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-50", "text": "We then obtain the adapted model parameters \u03b8 i t on the relation by first computing the gradient of the training loss on the sampled relation instances (line 13) and backpropagating the gradients with a gradient-based optimization algorithm (such as SGD or Adagrad (Duchi et al., 2011) ) (line 14)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-51", "text": "At the end of the learning iteration, the adapted parameters on all sampled relations in the batch are averaged, and an update is made on the task parameters \u03b8 t (line 16)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-52", "text": "This is done until convergence on the current task, after which exemplars of the current task are added to the buffer memory (line 18)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-53", "text": "Task exemplars are obtained by first clustering all training instances of the current task into 50 clusters using K-Means, then selecting an instance from each cluster with a representation closest to the cluster prototype." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-54", "text": "Finally, the model parameters are updated to the current task's adapted parameters (line 19)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-55", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-56", "text": "**ALGORITHM 1 META-LEARNING FOR LIFELONG RELATION EXTRACTION (MLLRE)**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-57", "text": "end for 16:" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-58", "text": "Update task parameters:" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-59", "text": "Add exemplars of Tt to B 19:" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-60", "text": "Update \u03b8 \u2190 \u03b8t 20: end while" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-61", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-62", "text": "**RELATION CLASSIFICATION MODEL**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-63", "text": "In principle the learner model f \u03b8 could be any gradient-optimized relation extraction model." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-64", "text": "In order to use the same number of parameters and ensure fair comparison to Wang et al. (2019) , we adopt as the relation extraction model f \u03b8 the Hier- arachical Residual BiLSTM (HR-BiLSTM) model of Yu et al. (2017) , which is the same model used by Wang et al. (2019) for their experiments." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-65", "text": "The HR-BILSTM is a relation classifier which accepts as input a sentence and a candidate relation, then utilizes two Bidirectional Long Short-Term Memory (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005) (BiLSTM) units with shared parameters to process the Glove (Pennington et al., 2014) embeddings of words in the sentence and relation names, then selects the relation with the maximum cosine similarity to the sentence as its response." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-66", "text": "Hyperparameters Apart from the hyperparameters specific to meta-learning (such as the step size ), all other hyperparameters we use for the learner model are the same as used by Wang et al. (2019) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-67", "text": "We also use the same buffer memory size (50) for each task." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-68", "text": "Note that the meta-learning algorithm uses SGD as the update rule (U), and does not add any additional trainable parameters to the learner model." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-69", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-70", "text": "**EXPERIMENTS**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-71", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-72", "text": "**SETUP**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-73", "text": "We conduct experiments in two settings." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-74", "text": "In the full supervision setting, we provide all models with all supervision available in the training set of each task." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-75", "text": "In the second, we limit the amount of supervision for each task to measure how the models are able to cope with limited supervision." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-76", "text": "Each experiment is run five (5) times and we report the average result." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-77", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-78", "text": "**DATASETS**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-79", "text": "We conduct experiments on Lifelong FewRel and Lifelong SimpleQuestions datasets, both introduced in Wang et al. (2019) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-80", "text": "Lifelong FewRel is derived from the FewRel (Han et al., 2018) dataset, by partitioning its 80 relations into 10 distinct clusters made up of 8 relations each, with each cluster serving as a task where a sentence must be labeled with the correct relation." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-81", "text": "The 8 relations in each cluster were obtained by clustering the averaged Glove word embeddings of the relation names in the FewRel dataset." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-82", "text": "Each instance of the dataset contains a sentence, the relation it expresses and a set of randomly sampled negative relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-83", "text": "Lifelong SimpleQuestions was similarly obtained from the SimpleQuestions (Bordes et al., 2015) dataset, and is made up of 20 clusters of relations, with each cluster serving as a task." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-84", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-85", "text": "**EVALUATION METRICS**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-86", "text": "We report two measures, ACC whole and ACC avg , both introduced in Wang et al. (2019) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-87", "text": "ACC whole measures accuracy on the test set of all tasks and gives a balanced measure of model performance on both observed (seen) and unobserved (unseen) tasks, and is the primary metric we report for all experiments." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-88", "text": "We also report ACC avg , which measures the average accuracy on the test set of only observed (seen) tasks." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-89", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-90", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-91", "text": "Full Supervision Results Table 1 gives both the ACC whole and ACC avg results of our approach compared to other approaches including Episodic Memory Replay (EMR) and its various embedding-aligned variants EA-EMR as proposed in Wang et al. (2019) ." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-92", "text": "Across all metrics, our approach outperforms the previous approaches, demonstrating its effectiveness in this setting." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-93", "text": "This result is likely because our approach is able to efficiently learn new relations by exploiting knowledge from previously observed relations." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-94", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-95", "text": "**LIMITED SUPERVISION RESULTS**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-96", "text": "The aim of our limited supervision experiments is to compare the use of an alignment module as proposed by Wang et al. (2019) to using our approach when only limited supervision is available for all tasks." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-97", "text": "We compare three approaches, Full EA-EMR (which uses their alignment module), its variant without the alignment module (EA-EMR NoAlign) and our approach (MLLRE)." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-98", "text": "Figures 1(a) and 1(b) show results obtained using 100 supervision instances for each task on Lifelong FewRel and Lifelong SimpleQuestions." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-99", "text": "Figures 2(a) and 2(b) show the corresponding plots using 200 supervision instances for each task." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-100", "text": "From the figures, we observe that the use of a separate alignment model results in only minor gains when supervision for the tasks is limited, whereas the use of our approach leads to wide gains on both datasets." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-101", "text": "In summary, because our approach explicitly encourages the model to learn to share and transfer knowledge between relations (by means of the meta-learning objective), the model is able to learn to exploit common structures across relations in different tasks to efficiently learn new relations over time." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-102", "text": "This leads to the performance improvements obtained by our approach." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-103", "text": "----------------------------------" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-105", "text": "We investigated the effectiveness of utilizing a gradient-based meta-learning algorithm within a lifelong learning setting to enable relation extraction models that are able to learn continually." }, { "sent_id": "681c3e59adbfc09a28d267a4885598-C001-106", "text": "We show the effectiveness of this approach, both when provided full supervision for new tasks and when provided limited supervision for new tasks, and demonstrated that the proposed approach outperformed current state-of-the-art approaches." } ], "y": { "@BACK@": { "gold_contexts": [ [ "681c3e59adbfc09a28d267a4885598-C001-14", "681c3e59adbfc09a28d267a4885598-C001-15", "681c3e59adbfc09a28d267a4885598-C001-16" ], [ "681c3e59adbfc09a28d267a4885598-C001-79", "681c3e59adbfc09a28d267a4885598-C001-80" ], [ "681c3e59adbfc09a28d267a4885598-C001-86", "681c3e59adbfc09a28d267a4885598-C001-87", "681c3e59adbfc09a28d267a4885598-C001-88" ] ], "cite_sentences": [ "681c3e59adbfc09a28d267a4885598-C001-14", "681c3e59adbfc09a28d267a4885598-C001-15", "681c3e59adbfc09a28d267a4885598-C001-16", "681c3e59adbfc09a28d267a4885598-C001-79", "681c3e59adbfc09a28d267a4885598-C001-80", "681c3e59adbfc09a28d267a4885598-C001-86", "681c3e59adbfc09a28d267a4885598-C001-87", "681c3e59adbfc09a28d267a4885598-C001-88" ] }, "@EXT@": { "gold_contexts": [ [ "681c3e59adbfc09a28d267a4885598-C001-18" ], [ "681c3e59adbfc09a28d267a4885598-C001-20" ] ], "cite_sentences": [ "681c3e59adbfc09a28d267a4885598-C001-18", "681c3e59adbfc09a28d267a4885598-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "681c3e59adbfc09a28d267a4885598-C001-64" ], [ "681c3e59adbfc09a28d267a4885598-C001-66" ], [ "681c3e59adbfc09a28d267a4885598-C001-79" ], [ "681c3e59adbfc09a28d267a4885598-C001-91" ], [ "681c3e59adbfc09a28d267a4885598-C001-96" ] ], "cite_sentences": [ "681c3e59adbfc09a28d267a4885598-C001-64", "681c3e59adbfc09a28d267a4885598-C001-66", "681c3e59adbfc09a28d267a4885598-C001-79", "681c3e59adbfc09a28d267a4885598-C001-91", "681c3e59adbfc09a28d267a4885598-C001-96" ] } } }, "ABC_6fbfc9f887e736472510bce30c9228_7": { "x": [ { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-2", "text": "Instant messaging dialogue is real-time, text-based computer-mediated communication conducted over the Internet." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-3", "text": "Messages sent over instant messaging can be encoded We propose a method of using dialogue acts to predict utterances in taskoriented dialogue." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-4", "text": "Dialogue acts provide a semantic representation of utterances in a dialogue." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-5", "text": "An evaluation using a dialogue simulation program shows that our proposed method of predicting responses provides useful suggestions for almost all response types." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-6", "text": "----------------------------------" }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-8", "text": "Support services in many domains have traditionally been provided over the telephone: when customers have queries, they dial a support number and speak to a support representative." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-9", "text": "Recent years have seen an increasing trend in support services provided over the Internet." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-10", "text": "Many companies have web sites with Frequently Asked Questions (FAQs), and also offer e-mail support." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-11", "text": "More recently, real-time support via online chat sessions is being offered where customers and support representatives type short messages to each other." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-12", "text": "Chat sessions are conducted over a network, such as the Internet, where textual messages can be sent and received between interlocutors in real-time." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-13", "text": "These chat sessions are commonly referred to as instant messaging." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-14", "text": "Support services that are conducted via instant messaging vary from being person-person dialogue, similar to traditional call centres, through to being entirely automated where customers engage in dialogue with a computer program." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-15", "text": "Commercial software is available to partially automated online support by suggesting responses to a human agent, which may then be accepted or overwritten." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-16", "text": "The research presented in this paper aims to provide a degree of natural language understanding to assist in automating task-oriented dialogue, such as support services, by suggesting utterances during the dialogue." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-17", "text": "We apply various probabilistic methods to improve discourse modelling in the support services domain." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-18", "text": "In previous work, we collected a small corpus of task-oriented dialogues between customers and support representatives from the MSN Shopping online support service (Ivanovic, 2005b) ." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-19", "text": "The service is designed to assist potential customers with finding various items for sale on the MSN Shopping web site." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-20", "text": "A sample from one of the dialogues in this corpus is shown in Table 1 ." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-21", "text": "The research presented here advances previous work which examined various models and tech-niques to predict dialogue acts in task-oriented instant messaging." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-22", "text": "In Ivanovic (2005b) , the MSN Shopping corpus was collected and a gold standard produced by labelling the utterances with dialogue acts." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-23", "text": "Probabilistic models were then trained to predict dialogue acts given a sequence of utterances." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-24", "text": "Ivanovic (2005a) examined probabilistic and linguistic methods to automatically segment messages from the same corpus into utterances." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-25", "text": "The present paper concludes this work by applying the models to a dialogue simulation program which suggests utterance responses during a dialogue." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-26", "text": "The performance of the suggested utterances is then evaluated." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-27", "text": "----------------------------------" }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-28", "text": "**BACKGROUND**" }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-29", "text": "Our dialogue act tag set contains 12 dialogue acts, which are intended to represent the illocutionary force of an utterance." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-30", "text": "The tags were derived in Ivanovic (2005b) by manually labelling the MSN Shopping corpus using the tags that seemed appropriate from a list of 42 tags in Stolcke et al. (2000) ." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-31", "text": "The MSN Shopping corpus we use comprises approximately 550 utterances and 6,500 words." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-32", "text": "Ivanovic (2005b) describes the manual process of segmenting the messages into utterances and labelling the utterances with dialogue act tags to produce a gold standard version of the data." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-33", "text": "Kappa analysis on both the labelling and segmentation tasks was conducted with results showing high interannotator agreement (Ivanovic, 2005a )." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-34", "text": "----------------------------------" }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-35", "text": "**EVALUATION AND RESULTS**" }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-36", "text": "As part of a high-level, end-to-end evaluation of dialogue act prediction and their usefulness in semiautomated dialogue systems, we developed a program that simulates a live conversation while suggesting responses." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-37", "text": "The suggested utterances are ranked by their respective probabilities given the dialogue history." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-38", "text": "We use cross-validation by training the system on all but one dialogue in our corpus." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-39", "text": "Following training, a customer support scenario is played out using the one dialogue that was not used for training, known as the target dialogue." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-40", "text": "The aim is to replicate substantially all of the utterances in the target dialogue." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-41", "text": "The process is repeated for each dialogue in our corpus." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-42", "text": "Our interface displays a ranked list of suggested dialogue acts and utterances." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-43", "text": "The dialogue acts are ranked from highest to lowest probability as determined by the naive Bayes model." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-44", "text": "The utterances within the dialogue acts are ranked by their frequency count during training." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-45", "text": "However, many utterances are only seen once, in which case the ordering is assumed random as their frequencies are equal." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-46", "text": "Our evaluation is only focussed on the dialogue-act rankings, not the utterance rankings." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-47", "text": "When a dialogue act is selected in the \"Suggestions\" list, the list of utterances is updated to show the relevant utterances for that dialogue act." }, { "sent_id": "6fbfc9f887e736472510bce30c9228-C001-48", "text": "Our support dialogue simulation program showed that it is possible to accurately predict many utterances using dialogue acts; 61% of utterances were correctly predicated within the top three ranked dialogues: 22% were in the first rank and 27% in the second." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6fbfc9f887e736472510bce30c9228-C001-18" ], [ "6fbfc9f887e736472510bce30c9228-C001-22" ], [ "6fbfc9f887e736472510bce30c9228-C001-32" ] ], "cite_sentences": [ "6fbfc9f887e736472510bce30c9228-C001-18", "6fbfc9f887e736472510bce30c9228-C001-22", "6fbfc9f887e736472510bce30c9228-C001-32" ] }, "@USE@": { "gold_contexts": [ [ "6fbfc9f887e736472510bce30c9228-C001-18" ], [ "6fbfc9f887e736472510bce30c9228-C001-29", "6fbfc9f887e736472510bce30c9228-C001-30" ] ], "cite_sentences": [ "6fbfc9f887e736472510bce30c9228-C001-18", "6fbfc9f887e736472510bce30c9228-C001-30" ] } } }, "ABC_c126f8b9a5fcb2687494a8c0b1e859_7": { "x": [ { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-2", "text": "We present a diacritization system for written Arabic which is based on a lexical resource." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-3", "text": "It combines a tagger and a lexeme language model." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-4", "text": "It improves on the best results reported in the literature." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-5", "text": "----------------------------------" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-7", "text": "Arabic is written without certain orthographic symbols, called diacritics, which represent among other things short vowels." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-8", "text": "1 The restoration of diacritics to written Arabic is an important processing step for several natural language processing applications, including training language models for automatic speech recognition, text-to-speech generation, and so on." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-9", "text": "For a discussion of the role of diacritization, see (Maamouri et al., 2006) ." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-10", "text": "In this paper, we present a new diacritization module that outperforms the best previously published results, using a new combination of techniques." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-11", "text": "A more detailed presentation can be found in (Habash and Rambow 2007) ." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-12", "text": "----------------------------------" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-13", "text": "**DIACRITIZATION IN ARABIC: LINGUISTIC DESCRIPTION**" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-14", "text": "Arabic script consists of two classes of symbols: letters and diacritics." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-15", "text": "Letters are always written whereas diacritics are optional: written Arabic can be fully diacritized, it can have some diacritics (to disambiguate certain words), or it can be entirely undiacritized." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-16", "text": "There are three types of diacritics: vowel, nunation, and shadda." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-17", "text": "Vowel diacritics represent Arabic's three short vowels and the absence of any vowel." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-18", "text": "kAtab 'to correspond' distinguish between the meanings of the word rather than their inflections." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-19", "text": "Thus, there are lexemes that look alike when undiacritized but are spelled differently when diacritized." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-20", "text": "Note that there are also distinct lexemes that are always spelled the same way, even when diacritizedtheir difference is only a difference in word sense." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-21", "text": "Inflectional diacritics distinguish different inflected forms of the same lexeme." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-22", "text": "For instance, the final diacritics in katabta 'you wrote' distinguish the person of the subject of the verb." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-23", "text": "We further distinguish be-2 Arabic orthography calls for adding a silent Alif ( # ) in conjunction with $ % in words ending with a consonant." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-24", "text": "3 A lexeme is an abstraction over inflected wordforms which groups together all those wordforms that differ only in terms of one of the morphological categories such as number, gender, aspect, or voice." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-25", "text": "The lemma is the distinguished word form which serves as citation form." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-26", "text": "tween two types of inflectional diacritics: variant inflectional diacritics and invariant inflectional diacritics." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-27", "text": "The distinction is made with respect to two morphosyntactic features: nominal case and verbal mood." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-28", "text": "The variant inflectional diacritics need not always appear at the end of the word." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-29", "text": "For instance, the variant inflectional diacritics at the penultimate positions of the following two words distinguish their case:" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-30", "text": "kAtibuhu 'his writer [nominative]' and" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-31", "text": "----------------------------------" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-32", "text": "**THE MADA-D SYSTEM**" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-33", "text": "In a previous publication, we described the Morphological Analysis and Disambiguation of Arabic (MADA) system (Habash and Rambow, 2005) ." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-34", "text": "The basic approach used in MADA is inspired by the work of Haji\u010d (2000) for tagging morphologically rich languages, which was extended to Arabic independently by Haji\u010d et al. (2005) ." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-35", "text": "In this approach, a set of taggers are trained for individual linguistic features which are components of the full morphological tag (such as core part-of-speech, tense, number, and so on)." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-36", "text": "In Arabic, we have ca." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-37", "text": "2,000 to 20,000 morphological tags, depending on how we count." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-38", "text": "The Buckwalter Arabic Morphological Analyzer (BAMA) (Buckwalter, 2004 ) is consulted to produce a list of possible analyses for a word." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-39", "text": "BAMA returns, given an undiacritized inflected word form, all possible morphological analyses, including full diacritization for each analysis." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-40", "text": "The results of the individual taggers are used to choose among these possible analyses." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-41", "text": "The algorithm we proposed in (Habash and Rambow, 2005) for choosing the best BAMA analysis simply counts the number of predicted values for the set of linguistic features in each candidate analysis." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-42", "text": "Haji\u010d et al. (2005) , however, weigh the predicted values by their probability or confidence measure." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-43", "text": "To our knowledge, no results on diacritization have been previously reported using this particular approach to tagging." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-44", "text": "4 In this paper, we extend our basic MADA system in the following ways: First, we follow Haji\u010d et al. (2005) in including case, mood, and nunation as features, because of its importance to diacritization." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-45", "text": "Second, we replace the YAMCHA (Kudo and Matsumoto, 2003) implementation of Support Vector Machines (SVMs) with SVMTool (Gim\u00e9nez and M\u00e0rquez, 2004) as our machine learning tool, for reasons of speed, at the cost of a slight decrease in accuracy." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-46", "text": "Like Haji\u010d et al. (2005) , we do not use Viterbi decoding." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-47", "text": "Finally, we introduce a specialized module for resolving residual ambiguity after the basic tagging is done." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-48", "text": "We explain this module in detail next." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-49", "text": "We train our classifiers on the exact training set defined by Zitouni et al. (2006) , a subpart of the third segment of the Penn Arabic Treebank (Maamouri et al., 2004 ) (\"ATB3-Train\", 288,000 words)." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-50", "text": "We also (reluctantly) follow them in having a single set for development and testing (\"ATB3-Devtest\", 52,000 words), rather than separate development and test sets (as is common), in order to be able to compare our results to theirs." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-51", "text": "Up until this point, MADA-D has narrowed the list of possible analyses of a word (supplied by BAMA) down to a small number." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-52", "text": "This number can sometimes be greater than one for two reasons: first, the way in which we use the output of the taggers to choose among the analyses may yield a tie among several analyses; second, there may be lexeme-based diacritic ambiguity, and the morphological taggers cannot disambiguate lexemic diacritization." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-53", "text": "To address the residual ambiguity, we implemented a second component." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-54", "text": "Ideally, this would be (or include) a full word sense disambiguation (WSD) system, but WSD is a hard problem." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-55", "text": "Instead, we approximate WSD using standard n-gram language models." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-56", "text": "We use two types of data for training: fully diacritized word forms, and data in which we have replaced the inflected word by the diacritized citation form of the lexeme." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-57", "text": "Note that this procedure conflates lexemes that differ only in meaning, not in diacritization, as we are not actually interested in WSD for its own sake in this paper." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-58", "text": "The training corpus is the same corpus we use for the classifiers, ATB3-Train." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-59", "text": "This means that the diacritization and the choice of lexeme are done by hand, but it also means that the training set is quite small by the standards of language models." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-60", "text": "We build an open-vocabulary language model with Kneser-Ney smoothing using the SRILM toolkit (Stolcke, 2002) ." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-61", "text": "We will call the resulting language models XLM-n, where X is \"D\" for the fully diacritized word forms, or \"L\" for the lexeme citation forms, and n is the order of the n-grams (n = 1, 2, 3)." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-62", "text": "When all candidate tokens (diacritized word or lexeme citation form) are unknown (out-of-vocabulary), the language model does not actually make a choice among them." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-63", "text": "We then use a diacritization unigram model, and then finally random choice." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-64", "text": "In the case of a preceding DLM-n model, this simply amounts to random choice, but in the case of a preceding LLM-n model, the diacritization model may actually make a non-random choice." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-65", "text": "----------------------------------" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-66", "text": "**RELATED WORK**" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-67", "text": "We review three approaches that are directly relevant to us; we refer to the excellent literature review in (Zitouni et al., 2006) for a general review." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-68", "text": "Vergyri and Kirchhoff (2004) follow an approach similar to ours in that they choose from the diacritizations proposed by BAMA." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-69", "text": "However, they train a single tagger using unannotated data and EM, which necessarily leads to a lower performance." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-70", "text": "The most salient difference, however, is that they are motivated by the goal of improving automatic speech recognition, and have an acoustic signal parallel to the undiacritized text." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-71", "text": "All their experiments use acoustic models." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-72", "text": "They show that WER for diacritization decreases by nearly 50% (from 50%) when BAMA is added to the acoustic information, but the tagger does not help." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-73", "text": "It would be interesting to investigate ways of incorporating acoustic model information in our approach." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-74", "text": "Ananthakrishnan et al. (2005) also work on diacritization with the goal of improving ASR." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-75", "text": "They use a word-based language model (using both diacritized and undiacritized words in the context) but back off to a character-based model for unseen words." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-76", "text": "They consult BAMA to narrow possible diacritizations for unseen words, but BAMA does not provide much improvement used in this manner." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-77", "text": "Zitouni et al. (2006) use a maximum entropy classifier to assign a set of diacritics to the letters of each word." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-78", "text": "They use the output of a tokenizer (segmenter) and a part-of-speech tagger (which presumably tags the output of the tokenizer)." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-79", "text": "They then use segment n-grams, segment position of the character being diacritized, the POS of the current segment, along with lexical features, including letter and word n-grams." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-80", "text": "Thus, while many of the same elements are used in their and our work (word n-grams, features related to morphological analysis), the basic approach is quite different: while we have one procedure that chooses a correct analysis (including to- Figure 1: Diacritization Results (all followed by single-choice-diac model); our best results are shown in boldface; Only-DLM-1 is the baseline; \"Zitouni\" is (Zitouni et al., 2006) kenization, morphological tag, and diacritization), they have a pipeline of processors." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-81", "text": "Furthermore, Zitouni et al. (2006) do not use a morphological lexicon." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-82", "text": "To our knowledge, their system is the best performing currently, and we have set up our experiments to allow us to compare our results directly to their results." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-83", "text": "----------------------------------" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-84", "text": "**RESULTS**" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-85", "text": "There are several ways of defining metrics for diacritization." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-86", "text": "In order to assure maximal comparability with the work of Zitouni et al. (2006) , we adopt their metric." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-87", "text": "5 We count all words, including numbers and punctuation." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-88", "text": "Each letter (or digit) in a word is a potential host for a set of diacritics; we count all diacritics on a single letter as a single binary choice." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-89", "text": "So, for example, if we correctly predict a shadda but get the vowel wrong, it counts as a wrong choice." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-90", "text": "We approximate non-variant diacritization by removing all diacritics from the final letter (Ignore Last), while counting that letter in the evaluation." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-91", "text": "We give diacritic error rate (DER) which tells us for how many letters we incorrectly restored all diacritics, and word error rate (WER), which tells us how many words had at least one DER." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-92", "text": "The results are shown in Figure 1 . Going top to bottom, we first see the baseline, Only-DLM-1, which is simply a diacritization unigram model with random choice for unseen words." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-93", "text": "We then show the results using the morphological tagger along with a language model." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-94", "text": "We first show results for the diacritization model, with 1-, 2-, and 3-grams." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-95", "text": "As we can see, the bigram language model helps slightly." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-96", "text": "The next three lines are the three lexeme n-gram models." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-97", "text": "Here we see that the unigram model performs worse than the unigram diacritization model, while the bigram and trigram models perform better (the trigram lexeme model is our best result)." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-98", "text": "We interpret this as meaning that the lexeme model is useful only when context is taken into account, because it is actually performing a rudimentary form of WSD." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-99", "text": "We tease apart the contribution of the tagger and of the language model with two further experiments, in the next two lines: using just the lexeme language model (trigrams), and running just the tagger, followed by random choice." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-100", "text": "We can see that the tagger alone does as well as the tagger with the unigram lexeme model, while the lexeme model on its own does much worse." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-101", "text": "However, as expected, the lexeme model on its own for the Ignore Last measure is much closer to the performance of the tagger on its own." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-102", "text": "We conclude from this that the quite simple lexeme model is in fact contributing to the correct choice of the lexemic diacritics." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-103", "text": "Finally, we give the results of Zitouni et al. (2006) on the last line, which we understand to be the best published results currently." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-104", "text": "We see that we improve on their results in all categories." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-105", "text": "We can see the effect of our different approaches to diacritization in the numbers: while for WER we reduce the Zitouni et al error by 17.2%, the DER error reduction is only 10.9%." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-106", "text": "This is because we are choosing among complete diacritization options for white space-tokenized words, while Zitouni et al. (2006) make choices for each diacritic." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-107", "text": "This means that when we make a mistake, it may well affect several diacritics at once, so that the diacritic errors are concentrated in fewer words." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-108", "text": "This effect is even stronger when we disregard the final letter (30.4% reduction in WER versus 12.0% reduction in DER), suggesting that singleton errors in words tend to be in the final position (case, mood), as it is often hard for the tagger to determine these features." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-109", "text": "----------------------------------" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-111", "text": "We have shown that a diacritizer that uses a lexical resource can outperform a highly optimized ad-hoc diacritization system that draws on a large number of features." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-112", "text": "We speculate that further work on WSD could further improve our results." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-113", "text": "We also note the issue of unknown words, which will affect our system much more than that of (Zitouni et al., 2006) ." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-114", "text": "It is possible to construct a combined system which uses a lexicon, but backs off to a Zitouni-style system for unknown words." }, { "sent_id": "c126f8b9a5fcb2687494a8c0b1e859-C001-115", "text": "However, a large portion of the unknown words are in fact foreign words and names, and it is not clear whether the models learned handle such words well." } ], "y": { "@USE@": { "gold_contexts": [ [ "c126f8b9a5fcb2687494a8c0b1e859-C001-49" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-67" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-82" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-86" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-103" ] ], "cite_sentences": [ "c126f8b9a5fcb2687494a8c0b1e859-C001-49", "c126f8b9a5fcb2687494a8c0b1e859-C001-67", "c126f8b9a5fcb2687494a8c0b1e859-C001-82", "c126f8b9a5fcb2687494a8c0b1e859-C001-86", "c126f8b9a5fcb2687494a8c0b1e859-C001-103" ] }, "@BACK@": { "gold_contexts": [ [ "c126f8b9a5fcb2687494a8c0b1e859-C001-77", "c126f8b9a5fcb2687494a8c0b1e859-C001-78", "c126f8b9a5fcb2687494a8c0b1e859-C001-79" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-81" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-82" ] ], "cite_sentences": [ "c126f8b9a5fcb2687494a8c0b1e859-C001-77", "c126f8b9a5fcb2687494a8c0b1e859-C001-78", "c126f8b9a5fcb2687494a8c0b1e859-C001-79", "c126f8b9a5fcb2687494a8c0b1e859-C001-81", "c126f8b9a5fcb2687494a8c0b1e859-C001-82" ] }, "@SIM@": { "gold_contexts": [ [ "c126f8b9a5fcb2687494a8c0b1e859-C001-77", "c126f8b9a5fcb2687494a8c0b1e859-C001-80" ] ], "cite_sentences": [ "c126f8b9a5fcb2687494a8c0b1e859-C001-77", "c126f8b9a5fcb2687494a8c0b1e859-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "c126f8b9a5fcb2687494a8c0b1e859-C001-77", "c126f8b9a5fcb2687494a8c0b1e859-C001-80" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-106" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-105" ], [ "c126f8b9a5fcb2687494a8c0b1e859-C001-113" ] ], "cite_sentences": [ "c126f8b9a5fcb2687494a8c0b1e859-C001-77", "c126f8b9a5fcb2687494a8c0b1e859-C001-80", "c126f8b9a5fcb2687494a8c0b1e859-C001-106", "c126f8b9a5fcb2687494a8c0b1e859-C001-105", "c126f8b9a5fcb2687494a8c0b1e859-C001-113" ] }, "@UNSURE@": { "gold_contexts": [ [ "c126f8b9a5fcb2687494a8c0b1e859-C001-114" ] ], "cite_sentences": [ "c126f8b9a5fcb2687494a8c0b1e859-C001-114" ] } } }, "ABC_fd5a6307b398f37d8729c21cfce6c1_7": { "x": [ { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-181", "text": "Running environment." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-2", "text": "We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-3", "text": "Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-4", "text": "In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-5", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-7", "text": "Sequence labeling models have been used for fundamental NLP tasks such as POS tagging, chunking and named entity recognition (NER)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-8", "text": "Traditional work uses statistical approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Ratinov and Roth, 2009; Passos et al., 2014; Luo et al., 2015) with handcrafted features and task-specific resources." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-9", "text": "With advances in deep learning, neural models have given state-of-the-art results on many sequence labeling tasks (Ling et al., 2015; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-10", "text": "Words and characters are encoded in distributed representations (Mikolov et al., 2013) and sentence-level features are learned automatically during end-to-end training." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-11", "text": "Many existing state-of-the-art neural sequence labeling models utilize word-level Long Short-Term Memory (LSTM) structures to represent global sequence information and a CRF layer to capture dependencies between neighboring labels Lample et al., 2016; Ma and Hovy, 2016; Peters et al., 2017) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-12", "text": "As an alternative, Convolution Neural Network (CNN) (LeCun et al., 1989) has also been used for its ability of parallel computing, leading to an efficient training and decoding process." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-13", "text": "Despite them being dominant in the research literature, reproducing published results for neural models can be challenging, even if the codes are available open source." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-60", "text": "Hammerton (2003) was the first to exploit LSTM for sequence labeling." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-14", "text": "For example, Reimers and Gurevych (2017b) conduct a large number of experiments using the code of Ma and Hovy (2016) , but cannot obtain comparable results as reported in the paper." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-15", "text": "Liu et al. (2018) report lower average F-scores on NER when reproducing the structure of Lample et al. (2016) , and on POS tagging when reproducing Ma and Hovy (2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-16", "text": "Most literature compares results with others by citing the scores directly Lample et al., 2016) without re-implementing them under the same setting, resulting in less persuasiveness on the advantage of their models." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-17", "text": "In addition, conclusions from different reports can be contradictory." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-18", "text": "For example, most work observes that stochastic gradient descent (SGD) gives best performance on NER task (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , while Reimers and Gurevych (2017b) report that SGD is the worst optimizer on the same datasets." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-19", "text": "The comparison between different deep neural models is challenging due to sensitivity on experimental settings." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-20", "text": "We list six inconsistent configurations in literature, which lead to difficulties for fair comparison." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-21", "text": "\u2022 Dataset." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-22", "text": "Most work reports sequence labeling results on both CoNLL 2003 English NER (Tjong Kim Sang and De Meulder, 2003) and PTB POS (Marcus et al., 1993) datasets (Collobert et al., 2011; Ma and Hovy, 2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-23", "text": "Ling et al. (2015) give results only on POS dataset, while some papers (Chiu and Nichols, 2016; Lample et al., 2016; Strubell et al., 2017) report results on the NER dataset only." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-24", "text": "dos Santos et al. (2015) conducts experiments on NER for Portuguese and Spanish." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-25", "text": "Most work uses the development set to select hyperparameters (Lample et al., 2016; Ma and Hovy, 2016) , while others add development set into training set (Chiu and Nichols, 2016; Peters et al., 2017) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-26", "text": "Reimers and Gurevych (2017b) use a smaller dataset (13862 vs 14987 sentences)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-27", "text": "Different from Ma and Hovy (2016) and Liu et al. (2018) , choose a different data split on the POS dataset." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-28", "text": "Liu et al. (2018) and Hashimoto et al. (2017) use different development sets for chunking." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-29", "text": "\u2022 Preprocessing." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-30", "text": "A typical data preprocessing step is to normize digit characters (Chiu and Nichols, 2016; Lample et al., 2016; Yang et al., 2016; Strubell et al., 2017) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-31", "text": "Reimers and Gurevych (2017b) use fine-grained representations for less frequent words." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-32", "text": "Ma and Hovy (2016) do not use preprocessing." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-33", "text": "\u2022 Features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-34", "text": "Strubell et al. (2017) and Chiu and Nichols (2016) apply word spelling features and further integrate context features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-35", "text": "Collobert et al. (2011) and use neural features to represent external gazetteer information." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-36", "text": "Besides, Lample et al. (2016) and Ma and Hovy (2016) use end-to-end structure without handcrafted features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-37", "text": "\u2022 Hyperparameters including learning rate, dropout rate (Srivastava et al., 2014) , number of layers, hidden size etc. can strongly affect the model performance." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-38", "text": "Chiu and Nichols (2016) search for the hyperparameters for each task and show that the system performance is sensitive to the choice of hyperparameters." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-39", "text": "However, existing models use different parameter settings, which affects the fair comparison." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-40", "text": "\u2022 Evaluation." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-41", "text": "Some literature reports results using mean and standard deviation under different random seeds (Chiu and Nichols, 2016; Peters et al., 2017; Liu et al., 2018) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-42", "text": "Others report the best result among different trials (Ma and Hovy, 2016) , which cannot be compared directly." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-43", "text": "\u2022 Hardware environment can also affect system accuracy." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-44", "text": "Liu et al. (2018) observes that the system gives better accuracy on NER task when trained using GPU as compared to using CPU." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-45", "text": "Besides, the running speeds are highly affected by the hardware environment." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-46", "text": "To address the above concerns, we systematically analyze neural sequence labeling models on three benchmarks: CoNLL 2003 NER (Tjong Kim Sang and De Meulder, 2003) , CoNLL 2000 chunking (Tjong Kim Sang and Buchholz, 2000) and PTB POS tagging (Marcus et al., 1993) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-47", "text": "Table 1 shows a summary of the models we investigate, which can be categorized under three settings: (i) character sequence representations ; (ii) word sequence representations; (iii) inference layer." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-48", "text": "Although various combinations of these three settings have been proposed in the literature, others have not been examined." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-49", "text": "We compare all models in Table 1 , which includes most state-of-the-art methods." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-50", "text": "To make fair comparisons, we build a unified framework 1 to reproduce the twelve neural sequence labeling architectures in Table 1 ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-51", "text": "Experiments show that our framework gives comparable or better results on reproducing existing works, showing the practicability and reliability of our analysis for practitioners." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-52", "text": "The detailed comparison and analysis show that (i) Character information provides a significant improvement on accuracy; (ii) Word-based LSTM models outperforms CNN models in most cases; (iii) CRF can improve model accuracy on NER and chunking but does not on POS tagging." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-53", "text": "Our framework is based on PyTorch with batched implementation, which is highly efficient, facilitating quick configurations for new tasks." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-54", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-55", "text": "**RELATED WORK**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-56", "text": "Collobert et al. (2011) proposed a seminal neural architecture for sequence labeling." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-57", "text": "It captures word sequence information with a one-layer CNN based on pretrained word embeddings and handcrafted neural features, followed with a CRF output layer." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-58", "text": "dos Santos et al. (2015) extended this model by integrating character-level CNN features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-59", "text": "Strubell et al. (2017) built a deeper dilated CNN architecture to capture larger local features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-61", "text": "built a BiLSTM-CRF structure, which has been extended by adding character-level LSTM (Lample et al., 2016; Liu et al., 2018) , GRU (Yang et al., 2016) , and CNN (Chiu and Nichols, 2016; Ma and Hovy, 2016) features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-62", "text": "Yang et al. (2017a) proposed a neural reranking model to improve NER models." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-63", "text": "These models achieve state-of-the-art results in the literature." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-64", "text": "Reimers and Gurevych (2017b) compared several word-based LSTM models for several sequence labeling tasks, reporting the score distributions over multiple runs rather than single value." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-65", "text": "They investigated the influence of various hyperparameters and configurations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-66", "text": "Our work is similar in comparing different neural architectures under unified settings, but differs in four main aspects: 1) Their experiments are based on a BiLSTM with handcrafted word features, while our experiments are based on end-to-end neural models without human knowledge." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-67", "text": "2) Their system gives relatively low performances on standard benchmarks 2 , while ours can give comparable or better results with state-of-the-art models, rendering our observations more informative for practitioners." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-68", "text": "3) Our findings are more consistent with most previous work on configurations such as usefulness of character information (Lample et al., 2016; Ma and Hovy, 2016) , optimizer (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) and tag scheme (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-69", "text": "In contrast, many results of Reimers and Gurevych (2017b) contradict existing reports." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-70", "text": "4) We conduct a wider range of comparison for word sequence representations, including all combinations of character CNN/LSTM and word CNN/LSTM structures, while Reimers and Gurevych (2017b) studied the word LSTM models only." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-71", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-72", "text": "**NEURAL SEQUENCE LABELING MODELS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-73", "text": "Our neural sequence labeling framework contains three layers, i.e., a character sequence representation layer, a word sequence representation layer and an inference layer, as shown in Figure 1 ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-74", "text": "Character information has been proven to be critical for sequence labeling tasks (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , with LSTM and CNN being used to model character sequence information (\"Char Rep.\")." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-75", "text": "Similarly, on the word level, LSTM or CNN structures can be leveraged to capture long-term information or local features (\"Word Rep.\"), respectively." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-76", "text": "Subsequently, the inference layer assigns labels to each word using the hidden states of word sequence representations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-77", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-78", "text": "**CHARACTER SEQUENCE REPRESENTATIONS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-79", "text": "Character features such as prefix, suffix and capitalization can be represented with embeddings through a feature-based lookup table (Collobert et al., 2011; Strubell et al., 2017) , or neural networks without human-defined features (Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-80", "text": "In this work, we focus on neural character sequence representations without hand-engineered features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-81", "text": "Character CNN." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-82", "text": "Using a CNN structure to encode character sequences was firstly proposed by Santos and Zadrozny (2014), and followed by many subsequent investigations (dos Santos et al., 2015; Chiu and Nichols, 2016; Ma and Hovy, 2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-83", "text": "In our experiments, we take the same structure as Ma and Hovy (2016) , using one layer CNN structure with max-pooling to capture character-level representations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-84", "text": "Figure 2 (a) shows the CNN structure on representing word \"Mexico\"." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-85", "text": "Character LSTM." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-86", "text": "Shown as Figure 2 (b), in order to model the global character sequence information of a word \"Mexico\", we utilize a bidirectional LSTM on the character sequence of each word and concatenate the left-to-right final state F LST M and the right-to-left final state B LST M as character sequence representations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-87", "text": "Liu et al. (2018) applied one bidirectional LSTM for the character sequence over a sentence rather than each word individually." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-88", "text": "We examined both structures and found that they give comparable accuracies on sequence labeling tasks." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-89", "text": "We choose Lample et al. (2016) 's structure as its character LSTMs can be calculated in parallel, making the system more efficient." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-90", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-91", "text": "**WORD SEQUENCE REPRESENTATIONS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-92", "text": "Similar to character sequences in words, we can model word sequence information through LSTM or CNN structures." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-93", "text": "LSTM has been widely used in sequence labeling (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Liu et al., 2018) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-94", "text": "CNN can be much faster than LSTM due to the fact that convolution calculation can be parallel on the input sequence (Collobert et al., 2011; dos Santos et al., 2015; Strubell et al., 2017) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-95", "text": "Word CNN." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-96", "text": "Figure 3(a) shows the multi-layer CNN on word sequence, where words are represented by embeddings." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-97", "text": "If a character sequence representation layer is used, then word embeddings and character sequence representations are concatenated for word representations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-98", "text": "For each CNN layer, a window of size 3 slides along the sequence, extracting local features on the word inputs and a ReLU function (Glorot et al., 2011 ) is followed." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-99", "text": "We follow Strubell et al. (2017) by using 4 CNN layers." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-100", "text": "Batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) are applied following each CNN layer." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-101", "text": "Word LSTM." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-102", "text": "Shown in Figure 3 (b), word representations are fed into a forward LSTM and backward LSTM, respectively." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-103", "text": "The forward LSTM captures the word sequence information from left to right, while the backward LSTM extracts information in a reversed direction." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-104", "text": "The hidden states of the forward and backward LSTMs are concatenated at each word to give global information of the whole sequence." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-105", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-106", "text": "**INFERENCE LAYER**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-107", "text": "The inference layer takes the extracted word sequence representations as features and assigns labels to the word sequence." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-108", "text": "Independent local decoding with a linear layer mapping word sequence representations to label vocabulary and performing softmax can be quite effective (Ling et al., 2015) , while for tasks that with strong output label dependency, such as NER, CRF is a more appropriate choice." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-109", "text": "In this work, we examine both softmax and CRF as inference layer on three sequence labeling tasks." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-110", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-111", "text": "**EXPERIMENTS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-112", "text": "We investigate the main influencing factors to system accuracy, including the character sequence representations, word sequence representations, inference algorithm, pretrained embeddings, tag scheme, running environment and optimizer; analyzing system performances from the perspective of decoding speed and accuracies on in-vocabulary (IV) and out-of-vocabulary (OOV) entities/chunks/words." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-113", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-114", "text": "**SETTINGS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-115", "text": "Data." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-116", "text": "The NER dataset has been standardly split in Tjong Kim Sang and De Meulder (2003 (Toutanova et al., 2003; Santos and Zadrozny, 2014; Ma and Hovy, 2016; Liu et al., 2018) , we adopt the standard splits by using sections 0-18 as training set, sections 19-21 as development set and sections 22-24 as test set." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-117", "text": "No preprocessing is performed on either dataset except for normalizing digits." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-118", "text": "The dataset statistics are listed in Table 2 ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-119", "text": "Hyperparameters." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-120", "text": "Table 3 shows the hyperparameters used in our experiments, which mostly follow Ma and Hovy (2016) , including the learning rate \u03b7 = 0.015 for word LSTM models." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-121", "text": "For word CNN based models, a large \u03b7 leads to convergence problem." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-122", "text": "We take \u03b7 = 0.005 with more epochs (200) instead." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-123", "text": "GloVe 100-dimension (Pennington et al., 2014 ) is used to initialize word embeddings and character embeddings are randomly initialized." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-124", "text": "We use mini-batch stochastic gradient descent (SGD) with a decayed learning rate to update parameters." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-125", "text": "For NER and chunking, we the BIOES tag scheme." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-126", "text": "Evaluation." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-127", "text": "Standard precision (P), recall (R) and F1-score (F) are used as the evaluation metrics for NER and chunking; token accuracy is used to evaluate the performance of POS tagger." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-128", "text": "Development datasets are used to select the optimal model among all epochs, and we report scores of the selected model on the test dataset." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-129", "text": "To reduce the volatility of the system, we conduct each experiment 5 times under different random seeds, and report the max, mean, and standard deviation for each model." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-130", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-131", "text": "**RESULTS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-132", "text": "Tables 4, 5 and 6 show the results of the twelve models on NER, chunking and POS datasets, respectively." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-133", "text": "Existing work has also been listed in the tables for comparison." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-134", "text": "To simplify the description, we use \"CLSTM\" and \"CCNN\" to represent character LSTM and character CNN encoder, respectively." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-135", "text": "Similarly, \"WLSTM\" and \"WCNN\" represent word LSTM and word CNN structure, respectively." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-136", "text": "As shown in Table 4 , most NER work focuses on WLSTM+CRF structures with different character sequence representations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-137", "text": "We re-implement the structure of several reports (Chiu and Nichols, 2016; Ma and Hovy, 2016; Peters et al., 2017) , which take the CCNN+WLSTM+CRF architecture." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-138", "text": "Our reproduced models give slightly better performances." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-139", "text": "The results of Lample et al. (2016) can be reproduced by our CLSTM+WLSTM+CRF." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-140", "text": "In most cases, our \"Nochar\" based models underperform their corresponding prototypes Strubell et al., 2017) , which utilize the hand-crafted features." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-141", "text": "Table 5 shows the results of the chunking task." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-142", "text": "Peters et al. (2017) give the best reported single model results (95.00\u00b10.08), and our CLSTM+WLSTM+CRF gives a comparable performance (94.93\u00b10.05)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-143", "text": "We re-implement Zhai et al. (2017) 's model in our Nochar+WLSTM but cannot reproduce their results, this may because that they use grid search for hyperparameter selection." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-144", "text": "Our Nochar+WCNN+CRF can give comparable results with Collobert et al. (2011) , even ours does not include character information." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-145", "text": "The results of the POS tagging task is shown in Table 6 ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-146", "text": "The results of Lample et al. (2016) , Ma and Hovy (2016) and Yang et al. (2017b) can be reproduced by our CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF models." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-147", "text": "Our WLSTM based models give better results than WLSTM+CRF based models, this is consistent with the fact that Ling et al. (2015) take CLSTM+WLSTM without CRF layer but achieve the best POS accuracy." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-148", "text": "Santos and Zadrozny (2014) build a pure CNN structure on both character and word level, which can be reproduced by our CCNN+WCNN models." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-149", "text": "Based on above observations, most results in the literature are reproducible." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-150", "text": "Our implementations can achieve the comparable or better results with state-of-the-art models." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-151", "text": "We do not fine-tune any hyperparameter to fit the specific task." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-152", "text": "Results on Table 4 , 5 and 6 are all under the same hyperparameters, which demonstrates the generalization ability of our framework." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-153", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-154", "text": "**NETWORK SETTINGS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-155", "text": "Character LSTM vs Character CNN." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-180", "text": "Reimers and Gurevych (2017b) report that the difference between the schemes is insignificant." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-156", "text": "Unlike the observations of Reimers and Gurevych (2017b) , in our experiments, character information can significantly (p < 0.01) 3 improve sequence labeling models (by comparing the row of Nochar with CLSTM or CCNN on Table 4 , 5 and 6), while the difference between CLSTM and CCNN is not significant." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-157", "text": "In most cases, CLSTM and CCNN can give comparable results under different frameworks and different tasks." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-158", "text": "CCNN gives the best NER result under the WL-STM+CRF framework, while CLSTM gets better NER results in all other configurations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-159", "text": "For chunking and POS tagging, CLSTM consistently outperforms CCNN under all settings, while the difference is statistically insignificant (p > 0.2)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-160", "text": "We conclude that the difference between CLSTM and CCNN is small, which is consistent with the observation of Reimers and Gurevych (2017b) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-161", "text": "Word LSTM vs Word CNN." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-162", "text": "By comparing the performances of WLSTM+CRF, WLSTM with WCNN+CRF, WCNN on the three benchmarks, we conclude that word-based LSTM models are significantly (p < 0.01) better than word-based CNN models in most cases." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-163", "text": "It demonstrates that global word context information is necessary for sequence labeling." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-164", "text": "Softmax vs CRF." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-165", "text": "Models with CRF inference layer can consistently outperform the models with softmax layer under all configurations on NER and chunking tasks, proving the effectiveness of label dependency information." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-166", "text": "While for POS tagging, the local softmax based models give slightly better accuracies while the difference is insignificant (p > 0.2)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-167", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-168", "text": "**EXTERNAL FACTORS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-169", "text": "In addition to model structures, external factors such as pretrained embeddings, tag scheme, and optimizer can significantly influence system performance." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-170", "text": "We investigate a set of external factors on the NER dataset with the two best models: CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-171", "text": "Pretrained embedding." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-172", "text": "Figure 4 (a) shows the F1-scores of the two best models on the NER test set with two different pretrained embeddings, as well as the random initialization." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-173", "text": "Compared with the random initialization, models using pretrained embeddings give significant improvements (p < 0.01)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-174", "text": "The GloVe 100-dimension embeddings give higher F1-scores than SENNA (Collobert et al., 2011) on both models, which is consistent with the observation of Ma and Hovy (2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-175", "text": "Tag scheme." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-176", "text": "We examine two different tag schemes: BIO and BIOES (Ratinov and Roth, 2009) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-177", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-178", "text": "In our experiments, models using BIOES are significantly (p < 0.05) better than BIO." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-179", "text": "Our observation is consistent with most literature (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-182", "text": "Liu et al. (2018) observe that neural sequence labeling models can give better results on GPU rather than CPU." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-183", "text": "We conduct repeated experiments on both GPU and CPU environments." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-184", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-185", "text": "Models run on CPU give a lower mean F1-score than models run on GPU, while the difference is insignificant (p > 0.2)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-186", "text": "Optimizer." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-187", "text": "We compare different optimizers including SGD, Adagrad (Duchi et al., 2011 ), Adadelta (Zeiler, 2012 , RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-188", "text": "The results are shown in Figure 5 5 ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-189", "text": "In contrast to Reimers and Gurevych (2017b) , who reported that SGD is the worst optimizer, our results show that SGD outperforms all other optimizers significantly (p < 0.01), with a slower convergence process during training." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-190", "text": "Our observation is consistent with most literature (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-191", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-192", "text": "**ANALYSIS**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-193", "text": "Decoding speed." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-194", "text": "We test the decoding speeds of the twelve models on the NER dataset using a Nvidia GTX 1080 GPU." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-195", "text": "Figure 6 shows the decoding times on 10000 NER sentences." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-196", "text": "The CRF inference layer severely limits the decoding speed due to the left-to-right inference process, which disables the parallel decoding." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-197", "text": "Character LSTM significantly slows down the system." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-198", "text": "Compared with models without character information, adding character CNN representations does not affect the decoding speed too much but can give significant accuracy improvements (shown in Section 4.3)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-199", "text": "Due to the support of parallel computing, word-based CNN models are consistently faster than word-based LSTM models, with close accuracies, leading to large utilization potential in practice." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-200", "text": "Table 7 : Results for OOV analysis." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-201", "text": "OOV error." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-202", "text": "We conduct error analysis on in-vocabulary and out-of-vocabulary words with the CRF based models 6 ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-203", "text": "Following Ma and Hovy (2016) , words in the test set are divided into four subsets: in-vocabulary words, out-of-training-vocabulary words (OOTV), out-of-embedding-vocabulary words (OOEV) and out-of-both-vocabulary words (OOBV)." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-204", "text": "For NER and chunking, we consider entities or chunks rather than words." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-205", "text": "The OOV entities and chunks are categorized following Ma and Hovy (2016) ." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-206", "text": "Table 7 shows the performances of different OOV splits on three benchmarks." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-207", "text": "The top three rows list the performances of word-based LSTM CRF models, followed by the word-based CNN CRF models." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-208", "text": "The results of OOEV in NER keep 100% because of there exist only 8 OOEV entities and all are recognized correctly." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-209", "text": "It is obvious that character LSTM or CNN representations improve OOTV and OOBV the most on both WLSTM+CRF and WCNN+CRF models across all three datasets, proving that the main contribution of neural character sequence representations is to disambiguate the OOV words." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-210", "text": "Models with character LSTM representations give the best IV scores across all configurations, which may be because character LSTM can be well trained on IV data, bringing the useful global character sequence information." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-211", "text": "On the OOVs, character LSTM and CNN gives comparable results." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-212", "text": "----------------------------------" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-213", "text": "**CONCLUSION**" }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-214", "text": "We built a unified neural sequence labeling framework to reproduce and compare recent state-of-theart models with different configurations." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-215", "text": "We explored three neural model design decisions: character sequence representations, word sequence representations, and inference layer." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-216", "text": "Experiments show that character information helps to improve model performances, especially on disambiguating OOV words." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-217", "text": "Character-level LSTM and CNN structures give comparable improvements, with the latter being more efficient." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-218", "text": "In most cases, models with word-level LSTM encoders outperform those with CNN, at the expense of longer decoding time." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-219", "text": "We observed that the CRF inference algorithm is effective on NER and chunking tasks, but does not have the advantage on POS tagging." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-220", "text": "With controlled experiments on the NER dataset, we showed that BIOES tags are better than BIO." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-221", "text": "Besides, pretrained GloVe 100d embedding and SGD optimizer give significantly better performances compared to their competitors." }, { "sent_id": "fd5a6307b398f37d8729c21cfce6c1-C001-222", "text": "6 We choose the models that give the median performance on the test set for conducting result analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fd5a6307b398f37d8729c21cfce6c1-C001-18" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-23" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-25" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-30" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-34" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-38" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-41" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-61", "fd5a6307b398f37d8729c21cfce6c1-C001-63" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-74" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-82" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-93" ] ], "cite_sentences": [ "fd5a6307b398f37d8729c21cfce6c1-C001-18", "fd5a6307b398f37d8729c21cfce6c1-C001-23", "fd5a6307b398f37d8729c21cfce6c1-C001-25", "fd5a6307b398f37d8729c21cfce6c1-C001-30", "fd5a6307b398f37d8729c21cfce6c1-C001-34", "fd5a6307b398f37d8729c21cfce6c1-C001-38", "fd5a6307b398f37d8729c21cfce6c1-C001-41", "fd5a6307b398f37d8729c21cfce6c1-C001-61", "fd5a6307b398f37d8729c21cfce6c1-C001-63", "fd5a6307b398f37d8729c21cfce6c1-C001-74", "fd5a6307b398f37d8729c21cfce6c1-C001-82", "fd5a6307b398f37d8729c21cfce6c1-C001-93" ] }, "@SIM@": { "gold_contexts": [ [ "fd5a6307b398f37d8729c21cfce6c1-C001-68" ], [ "fd5a6307b398f37d8729c21cfce6c1-C001-190" ] ], "cite_sentences": [ "fd5a6307b398f37d8729c21cfce6c1-C001-68", "fd5a6307b398f37d8729c21cfce6c1-C001-190" ] }, "@USE@": { "gold_contexts": [ [ "fd5a6307b398f37d8729c21cfce6c1-C001-137" ] ], "cite_sentences": [ "fd5a6307b398f37d8729c21cfce6c1-C001-137" ] } } }, "ABC_5428f8c196308c90618abfdbdf856a_7": { "x": [ { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-2", "text": "We investigate applying repurposed generic QA data and models to a recently proposed relation extraction task." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-3", "text": "We find that training on SQuAD produces better zero-shot performance and more robust generalisation compared to the task specific training set." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-4", "text": "We also show that standard QA architectures (e.g. FastQA or BiDAF) can be applied to the slot filling queries without the need for model modification." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-5", "text": "----------------------------------" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-7", "text": "Knowledge Base Population (KBP, e.g.: Riedel et al., 2013; Sterckx et al., 2016) attempts to identify facts within raw text and convert them into triples consisting of a subject, object and the relation between them." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-8", "text": "One common form of this task is slot filling (Surdeanu and Heng, 2014) , in which a knowledge base (KB) query, such as place of birth(Obama, ?) is applied to a set of documents and a set of slot fillers is returned." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-9", "text": "By converting such KB queries to natural language questions, Levy et al. (2017) showed that a question answering (QA) system could be effectively applied to this task." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-10", "text": "However, their approach relied on a modified QA model architecture and a dedicated slot-filling training corpus." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-11", "text": "Here, we investigate the utility of standard QA data and models for this task." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-12", "text": "Our results show that this approach is effective in the zero-shot and low-resource cases, and is more robust on a set of test instances that challenge the models' ability to identify relations between subject and object." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-13", "text": "Figure 1 gives an overview of using QA on the slot-filling task." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-14", "text": "Starting at the top right, a KB query is translated into a natural language question, which can then be fed into a QA model that has been trained on an appropriate resource." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-15", "text": "When applied to a set of texts, this model needs to predict the correct answer within each text, including the possibility that a text contains no answer." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-16", "text": "Within this framework, we consider different models and training and test datasets, but we keep the translation of KB queries into natural language questions fixed, based on the crowd-sourced templates used by Levy et al. (2017) ." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-17", "text": "----------------------------------" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-18", "text": "**PERFORMANCE ON THE ORIGINAL TASK**" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-19", "text": "In our first experiment, we examine the utility of a standard QA dataset as training data for the slotfilling model of Levy et al. (2017) ." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-20", "text": "Their zeroshot model generalised from seen relations to unseen relations by translating all relations into natural language question templates, such as Where was XXX born? for the relation place of birth." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-21", "text": "Identifying an instance of such a relation in text is then equivalent to finding an answer to the relevant question template, instantiated with the appropriate entity." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-22", "text": "However, such a model also needs to be able to identify when no answer is found in the text, and to achieve this they trained a slightly modified version of BiDAF (Seo et al., 2016) both positive examples, containing answers, and negative examples, without answers." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-23", "text": "These examples were derived from a pre-existing relation extraction resource, as their intention was to show the utility of the QA model." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-24", "text": "In this section, we evaluate whether the same model trained on QA data, specifically SQuAD (Rajpurkar et al., 2016) , can be applied to the relation extraction task." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-25", "text": "We first investigate the zeroshot case, where no examples of the relations are available, and then evaluate how performance improves as more data becomes available." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-26", "text": "Data We compare two sources of training data: The University of Washington relation extraction (UWRE) dataset created by Levy et al. (2017) and the Stanford Question Answering Dataset (SQuAD) created by Rajpurkar et al. (2016) ." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-27", "text": "The UWRE data is derived from WikiReading (Hewlett et al., 2016) , which is itself derived from WikiData (Vrande\u010di\u0107, 2012) , and consists of a set of positive and negative examples for relation extraction from Wikipedia sentences." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-28", "text": "Each instance consists of an entity, a relation, a question template for the relation and a sentence drawn from the wikipedia article for that entity which may or may not answer the question." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-29", "text": "Under the assumption that each relation triple found in a Wikipedia info-box is also expressed in the text of its article, the positive examples contain the first sentence from the article that contains both the subject and object of the triple." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-30", "text": "The negative examples also contain the subject entity of the relation, but express a different relation." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-31", "text": "Levy et al. (2017) provide a number of train/dev/test splits, to allow them to evaluate a variety of modes of generalisation." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-32", "text": "Here we use the relation and entity splits." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-33", "text": "The former tests the ability to generalise from one set of relations to another, i.e. to do zero-shot learning for the unseen relations in the test set." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-34", "text": "In contrast, the latter tests on the easier task of generalising from one set of entities to another for the same set of relations." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-35", "text": "We use this dataset to investigate how having access to various quantities of data about the test set relations changes performance." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-36", "text": "To build a dataset using SQuAD (Rajpurkar et al., 2016) , we construct negative examples by removing sentences that contain the answer, based on the spans provided by the annotators." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-37", "text": "In other words, we are left with the original question and a paragraph relevant to the topic of that question, but which typically no longer contains sentences answering it." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-38", "text": "Alongside these negative examples, we also retain the original SQuAD instances as positive examples." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-39", "text": "This process is applied to both the train and dev sets, allowing us to evaluate a model that uses only question answering data at training time." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-40", "text": "We also construct a series of datasets that combine increasing quantities of the UWRE entity split training set into the SQuAD training set, to evaluate the benefits of SQuAD when dedicated relation extraction data is limited." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-41", "text": "Random samples of 10 3 , 10 4 , 10 5 and 10 6 UWRE instances are added to our SQuAD training set, while leaving the SQuAD dev dataset untouched." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-42", "text": "Models We employ the same modified BiDAF (Seo et al., 2016 ) model as Levy et al. (2017) , which uses an additional bias term to allow the model to signal when no answer is predicted within the text." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-43", "text": "Evaluation Following the approach of Levy et al. (2017) , we report F1 scores on the answers returned by the model." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-44", "text": "Under this measure, predicting correctly that a negative instance has no answer does not contribute to either precision or recall." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-45", "text": "However, returning an answer for such an instance does reduce precision." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-46", "text": "Results Table 1 reports the F1 scores for zeroshot relation extraction on the relation split test set, using models trained on the original UWRE and SQuAD datasets." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-47", "text": "As can be seen, BIDAF is actually more effective at answering the questions for the unseen relation types in the UWRE test set when it is trained on a standard QA dataset, rather than a dedicated relation extraction dataset." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-48", "text": "Figure 2 plots how performance improves as more data becomes available about the relations in the entity split test set." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-49", "text": "We compare training purely on UWRE instances to those same instances combined with the whole SQuAD dataset." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-50", "text": "As can be seen, when only small amounts of relation extraction data is available, combining this with the QA data gives a substantial boost to performance." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-51", "text": "Discussion The SQuAD trained model appears to be effective in the limited data and zero-shot cases, but contributes little when large numbers of examples of the relations of interest are available." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-52", "text": "In this case, the dedicated relation extraction model is able to achieve an F1 of around 90%, with or without augmentation with SQuAD." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-53", "text": "This level of performance suggests that such a model would be accurate enough for practical applications." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-54", "text": "However, test set performance may not be a reliable indicator of the model's ability to generalise to more challenging examples (Jia and Liang, 2017) ." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-55", "text": "----------------------------------" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-56", "text": "**GENERALISATION TO A CHALLENGE TEST SET**" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-57", "text": "In this second experiment, we want to test the ability of the models decribed above to generalise to data beyond the UWRE test set." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-58", "text": "In particular, we want to verify that the BiDAF model is able to recognise the assertion of a relation between the entity and the answer, rather than just recognising an answer phrase of the right type." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-59", "text": "Data We construct a challenge test set of negative examples based on sentences which are about the wrong entity but which do contain potential answers that are valid for the question and relation type." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-60", "text": "Thus, each positive example from the original UWRE entity split test set is turned into a negative example by pairing the sentence with an equivalent question about another entity." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-61", "text": "A model that has merely learnt to identify answer spans of the right form, irrespective of their relation to the rest of the sentence, is likely to return the original span rather than recognise that the sentence no longer contains an answer." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-62", "text": "We then build new train, dev and test sets (UWRE+) from the original entity split datasets in which half the original negative instances have been replaced with these more challenging instances." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-63", "text": "As before, a series of datasets combining SQuAD with increasing amounts of this new data is also constructed." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-64", "text": "Models We re-use the UWRE and SQuAD trained models in addition to training on the UWRE+ datasets described in the previous section." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-65", "text": "Evaluation Here, F1 is not an appropriate measure, as there are no positive instances in the challenge data." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-66", "text": "Instead, we use accuracy of the predictions, which in this case is just the number of 'no answer' predictions divided by the total number of instances." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-67", "text": "Table 3 : Zero-shot Precision, Recall and F1 on the UWRE relation split test set." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-68", "text": "Results Table 2 reports the accuracy of predictions on the challenge test set of negative examples." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-69", "text": "Although the original UWRE model achieved an F1 of around 90% on the unmodified entity split test set, here it only manages to get 2% of its predictions correct." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-70", "text": "In contrast, the modified UWRE+ training data results in a model that is much more accurate, predicting over 70% of the negative examples correctly." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-71", "text": "Nonetheless, the performance of the SQuAD trained model is stronger still, even without modification to address this problem." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-72", "text": "Figure 3 shows the accuracy on the challenge test set as increasing quantities of relation extraction instances are added to SQuAD." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-73", "text": "Looking first at the effect of adding the original UWRE training instances, performance drops dramatically as the size of this expansion increases." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-74", "text": "In contrast, as the quantity of UWRE+ data grows, performance improves, peaking at around 100,000 instances, which is around the same size as SQuAD." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-75", "text": "----------------------------------" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-76", "text": "**DISCUSSION**" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-77", "text": "The results on our challenge test set suggest that the model does not learn to examine the relation between the answer span and the relation subject unless the training data requires it." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-78", "text": "In the case of SQuAD, the multi-sentence paragraph structure around the answer provides enough potential distractors to overcome this issue." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-79", "text": "Other models may show different patterns of strength and weakness, but to be able to investigate and exploit further QA systems quickly would require a means of producing 'no answer' predictions without the need to modify the model implementation." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-80", "text": "4 Using an unmodified QA model for slot filling Levy et al. (2017) modify the BiDAF architecture to produce an additional output representing the probability that no answer is present in the text." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-81", "text": "In this experiment, we investigate whether it is possible to adapt a QA model to the slot filling task without having to understand and modify its internal structure and implementation." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-82", "text": "Our approach merely requires prefixing all texts with a dummy token that stands in for the answer when no real answer is present." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-83", "text": "Data We train our models on a modified version of SQuAD, which has been augmented with negative examples by removing answer spans, as described in Section 2, and then had the token NoAnswerFound inserted into every text and as the answer for the negative examples, as described above." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-84", "text": "Models We train both BiDAF (Seo et al., 2016) and FastQA (Weissenborn et al., 2017 ) models on the modified SQuAD training data, using their standard architectures and hyperparameters." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-85", "text": "Evaluation We evaluate F1 on the same zeroshot evaluation considered in Section 2 and also accuracy on the challenge test set from Section 3." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-86", "text": "Results Table 3 reveals that the unmodified BiDAF model is almost as effective as the Levy et al. (2017) model in terms of zero-shot F1 on the original UWRE test set." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-87", "text": "In contrast, FastQA's performance is substantially worse." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-88", "text": "However, Table 4 reveals that FastQA is extremely accurate on the challenge test set, while BiDAF's performance is comparable to the modified model trained on SQuAD." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-89", "text": "Discussion The unmodified BiDAF and FastQA architectures have complementary strengths on the two evaluations." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-90", "text": "FastQA's strong performance on the challenge instances may be related to its use of binary features indicating whether a word was present in the question." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-91", "text": "----------------------------------" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-92", "text": "**CONCLUSION**" }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-93", "text": "We showed that standard QA models and data can be easily reused on the slot-filling task, using some straightforward data pre-processing." }, { "sent_id": "5428f8c196308c90618abfdbdf856a-C001-94", "text": "These recycled models were reasonably effective in the reduced data regime and robust on a new test set containing challenging examples." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5428f8c196308c90618abfdbdf856a-C001-10", "5428f8c196308c90618abfdbdf856a-C001-7", "5428f8c196308c90618abfdbdf856a-C001-8", "5428f8c196308c90618abfdbdf856a-C001-9" ], [ "5428f8c196308c90618abfdbdf856a-C001-19", "5428f8c196308c90618abfdbdf856a-C001-20", "5428f8c196308c90618abfdbdf856a-C001-22", "5428f8c196308c90618abfdbdf856a-C001-23" ], [ "5428f8c196308c90618abfdbdf856a-C001-27" ], [ "5428f8c196308c90618abfdbdf856a-C001-31" ], [ "5428f8c196308c90618abfdbdf856a-C001-80" ] ], "cite_sentences": [ "5428f8c196308c90618abfdbdf856a-C001-10", "5428f8c196308c90618abfdbdf856a-C001-9", "5428f8c196308c90618abfdbdf856a-C001-19", "5428f8c196308c90618abfdbdf856a-C001-20", "5428f8c196308c90618abfdbdf856a-C001-22", "5428f8c196308c90618abfdbdf856a-C001-23", "5428f8c196308c90618abfdbdf856a-C001-27", "5428f8c196308c90618abfdbdf856a-C001-31", "5428f8c196308c90618abfdbdf856a-C001-80" ] }, "@EXT@": { "gold_contexts": [ [ "5428f8c196308c90618abfdbdf856a-C001-16" ], [ "5428f8c196308c90618abfdbdf856a-C001-40" ], [ "5428f8c196308c90618abfdbdf856a-C001-41" ], [ "5428f8c196308c90618abfdbdf856a-C001-47" ], [ "5428f8c196308c90618abfdbdf856a-C001-57" ], [ "5428f8c196308c90618abfdbdf856a-C001-59", "5428f8c196308c90618abfdbdf856a-C001-60" ], [ "5428f8c196308c90618abfdbdf856a-C001-64" ] ], "cite_sentences": [ "5428f8c196308c90618abfdbdf856a-C001-16", "5428f8c196308c90618abfdbdf856a-C001-40", "5428f8c196308c90618abfdbdf856a-C001-41", "5428f8c196308c90618abfdbdf856a-C001-47", "5428f8c196308c90618abfdbdf856a-C001-57", "5428f8c196308c90618abfdbdf856a-C001-60", "5428f8c196308c90618abfdbdf856a-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "5428f8c196308c90618abfdbdf856a-C001-19" ], [ "5428f8c196308c90618abfdbdf856a-C001-26" ], [ "5428f8c196308c90618abfdbdf856a-C001-42" ], [ "5428f8c196308c90618abfdbdf856a-C001-43" ], [ "5428f8c196308c90618abfdbdf856a-C001-46" ], [ "5428f8c196308c90618abfdbdf856a-C001-49" ], [ "5428f8c196308c90618abfdbdf856a-C001-64" ], [ "5428f8c196308c90618abfdbdf856a-C001-67" ], [ "5428f8c196308c90618abfdbdf856a-C001-69" ], [ "5428f8c196308c90618abfdbdf856a-C001-73" ], [ "5428f8c196308c90618abfdbdf856a-C001-86" ] ], "cite_sentences": [ "5428f8c196308c90618abfdbdf856a-C001-19", "5428f8c196308c90618abfdbdf856a-C001-26", "5428f8c196308c90618abfdbdf856a-C001-42", "5428f8c196308c90618abfdbdf856a-C001-43", "5428f8c196308c90618abfdbdf856a-C001-46", "5428f8c196308c90618abfdbdf856a-C001-49", "5428f8c196308c90618abfdbdf856a-C001-64", "5428f8c196308c90618abfdbdf856a-C001-67", "5428f8c196308c90618abfdbdf856a-C001-69", "5428f8c196308c90618abfdbdf856a-C001-73", "5428f8c196308c90618abfdbdf856a-C001-86" ] } } }, "ABC_c684a2be8ca8ed8db25be6e080f921_7": { "x": [ { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-2", "text": "Cross-lingual dependency parsing involves transferring syntactic knowledge from one language to another." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-3", "text": "It is a crucial component for inducing dependency parsers in low-resource scenarios where no training data for a language exists." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-4", "text": "Using Faroese as the target language, we compare two approaches using annotation projection: first, projecting from multiple monolingual source models; second, projecting from a single polyglot model which is trained on the combination of all source languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-5", "text": "Furthermore, we reproduce multisource projection (Tyers et al., 2018) , in which dependency trees of multiple sources are combined." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-6", "text": "Finally, we apply multi-treebank modelling to the projected treebanks, in addition to or alternatively to polyglot modelling on the source side." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-7", "text": "We find that polyglot training on the source languages produces an overall trend of better results on the target language but the single best result for the target language is obtained by projecting from monolingual source parsing models and then training multi-treebank POS tagging and parsing models on the target side." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-8", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-10", "text": "Cross-lingual transfer methods, i. e. methods that transfer knowledge from one or more source languages to a target language, have led to substantial improvements for low-resource dependency parsing (Rosa and Mare\u010dek, 2018; Guo et al., 2015; Lynn et al., 2014; Mc-Donald et al., 2011; Hwa et al., 2005) and part-ofspeech (POS) tagging (Plank and Agi\u0107, 2018) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-11", "text": "In low-resource scenarios, there may be not enough data for data-driven models to learn how to parse." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-12", "text": "In cases where no annotated data is available, knowledge is often transferred from annotated data in other languages and when there is only a small amount of annotated data, additional knowl-edge can be induced from external corpora such as by learning distributed word representations (Mikolov et al., 2013; Al-Rfou' et al., 2013) and more recent contextualized variants Devlin et al., 2019) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-13", "text": "This work focuses on dependency parsing for low-resource languages by means of annotation projection (Yarowsky et al., 2001) and synthetic treebank creation (Tiedemann and Agi\u0107, 2016) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-14", "text": "We build on recent work by Tyers et al. (2018) who show that in the absence of annotated training data for the target language, a lexicalized treebank can be created by translating a target language corpus into a number of related source languages and parsing the translations using models trained on the source language treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-15", "text": "1 These annotations are then projected to the target language using separate word alignments for each source language, combined into a single graph for each sentence and decoded (Sagae and Lavie, 2006) , resulting in a treebank for the target language, Faroese in the case of Tyers et al.'s and our experiments." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-16", "text": "Inspired by recent literature involving multilingual learning (Mulcaire et al., 2019; Vilares et al., 2016) , we investigate whether additional improvements can be made by:" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-17", "text": "1. using a single polyglot 2 parsing model which is trained on the combination of all source languages to create synthetic source treebanks (which are subsequently projected to the target language) 1 In this paper, source language and target language always refer to the projection, not the direction of translation." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-18", "text": "2 We adopt the same terminology used in Mulcaire et al. (2019) , who use the term cross-lingual transfer to describe methods involving the use of one or more source languages to process a target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-19", "text": "They reserve the term polyglot learning for training a single model on multiple languages and where parameters are shared between languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-20", "text": "For the polyglot learning technique applied to multiple treebanks of a single language, we use the term multi-treebank learning." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-21", "text": "2. training a multi-treebank model on the individually projected treebanks and the treebank produced with multi-source projections." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-22", "text": "The former differs from the approach of Tyers et al. (2018) , who use multiple discrete, monolingual models to parse the translated sentences, whereas in this work we use a single model trained on multiple source treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-23", "text": "The latter differs from training on the target treebank produced by multi-source projection in that the information of the individual projections is still available and training data is not reduced to cases where all source languages provide a projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-24", "text": "In other words, we aim to investigate whether the current state-of-the-art approach for Faroese, which relies on cross-lingual transfer, can be improved upon by adopting an approach based on source-side polyglot learning and/or target-side multi-treebank learning." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-25", "text": "We hypothesize that a polyglot model can exploit similarities in morphology and syntax across the included source languages, which will result in a better model to provide annotations for projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-26", "text": "On the target side, we expect that combining different sources of information will result in a more robust target model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-27", "text": "We evaluated our various models on the Faroese test set and experienced considerable gains for three of the four source languages (Danish, Norwegian Bokm\u00e5l and Swedish) by adopting a polyglot model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-28", "text": "However, for Norwegian Nynorsk, a stronger monolingual model was able to outperform the polyglot approach." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-29", "text": "When we extended multi-treebank learning to the target side, we experienced additional gains for all cases." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-30", "text": "Our best result of 71.5 -an absolute improvement of 7.2 points over the result reported by Tyers et al. (2018) -was achieved with multi-treebank target learning over the monolingual projections." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-31", "text": "Tyers et al. (2018) describe a method for creating synthetic treebanks for Faroese based on previous work which uses machine translation and word alignments to transfer trees from source language(s) to the target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-32", "text": "Sentences from Faroese are translated into the four source languages Danish, Swedish, Norwegian Nynorsk and Norwegian Bokm\u00e5l." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-33", "text": "The translated sentences are then tokenized, POS tagged and parsed using the relevant source language model trained on the source language treebank." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-34", "text": "The resulting trees are projected back to the Faroese sentences using word alignments." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-35", "text": "The four trees for each sentence are combined into a graph with edge scores one to four (the number of trees that support them), from which a single tree per sentence is produced using the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-36", "text": "The resulting trees make up a synthetic treebank for Faroese which is then used to train a Faroese parsing model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-37", "text": "The parser output is evaluated using the gold-standard Faroese test treebank developed by Tyers et al. (2018) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-38", "text": "The approach is compared to a delexicalized baseline, which it outperforms by a large margin." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-39", "text": "It is also shown that, for Faroese, a combination of the four source languages (multi-source projection) is superior to individual language projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-40", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-41", "text": "**BACKGROUND**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-42", "text": "The idea of annotation projection using wordalignments originates from (Yarowsky et al., 2001) who used word alignments to transfer information such as POS tags from source to target languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-43", "text": "This method was later used in dependency parsing by Hwa et al. (2005) , who project dependencies to a target language and use a set of heuristics to form dependency trees in the target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-44", "text": "A parser is then trained on the projected treebank and evaluated against gold-standard treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-45", "text": "Zeman and Resnik (2008) introduced the idea of delexicalized dependency parsing whereby a parser is trained using only POS information and is then applied to a target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-46", "text": "McDonald et al. (2011) perform delexicalized dependency parsing using direct transfer and show that this approach outperforms unsupervised approaches for grammar induction." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-47", "text": "Importantly, this approach can be extended to the multi-source case by training on multiple source languages and predicting a target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-48", "text": "In an additional experiment, they combine annotation projection and multi-source transfer." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-49", "text": "Tiedemann and Agi\u0107 (2016) present a thorough comparison of pre-neural cross-lingual parsing." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-50", "text": "Various forms of projected annotation methods are compared to delexicalized baselines, and the use of machine translation instead of parallel corpora to produce synthetic treebanks in the target language is explored." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-51", "text": "In contrast to Tyers et al. (2018) , they translate a target sentence and project the source parse tree back to the target during test time instead of using this approach to obtain training data for the target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-52", "text": "leverage massively multilingual parallel corpora such as translations of the Bible and web-scraped data from the Watchtower Online Library website 3 for low-resource POS tagging and dependency parsing using annotation projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-53", "text": "They project weight matrices (as opposed to decoded dependency arcs) from multiple source languages and average the matrices weighted by word alignment confidences." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-54", "text": "They then decode the weight matrices into dependency trees on the target side, which are then used to train a parser." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-55", "text": "This approach utilizes dense information from multiple source languages, which helps reduce noise from source side predictions but to the best of our knowledge, the source-side parsing models learn information between source languages independently and the cross-lingual interaction only occurs when projecting the edge scores into multi-source weight matrices." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-56", "text": "The idea of projecting dense information in the form of a weighted graph has been further extended by Schlichtkrull and S\u00f8gaard (2017) who bypass the need to train the target parser on decoded trees and develop a parser which can be trained directly on weighted graphs." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-57", "text": "Plank and Agi\u0107 (2018) use annotation projection for POS tagging." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-58", "text": "They find that choosing high quality training instances results in superior accuracy than randomly sampling a larger training set." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-59", "text": "To this end, they rank the target sentences by the percentage of words covered by word alignments across all source languages and choose the top k covered sentences for training." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-60", "text": "Meechan-Maddon and Nivre (2019) carry out an evaluation on cross-lingual parsing for three low-resource languages which are supported by related languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-61", "text": "They include three experiments: first, training a monolingual model on a small number of sentences in the target language; second, training a cross-lingual model on related source languages which is then applied to the target data and lastly, training a multilingual model which includes target data as well as data from the related support languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-62", "text": "They found that training a monolingual model on the target data was always superior to training a cross-lingual model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-63", "text": "Interestingly, they found that the best results were achieved by training a model on the various support languages as well as the target data, i. e. their multilingual model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-64", "text": "While we do not combine" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-65", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-66", "text": "**METHOD**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-67", "text": "We outline the process used for creating a synthetic treebank for cross-lingual dependency parsing." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-68", "text": "We use the following resources: raw Faroese sentences taken from Wikipedia, a machine translation system to translate these sentences into all source languages (Danish, Swedish, Norwegian Nynorsk and Norwegian Bokm\u00e5l), a word-aligner to provide word alignments between the words in the target and source sentences, treebanks for the four source languages on which to train parsing models, POS tagging and parsing tools, and, lastly a target language test set." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-69", "text": "We use the same raw corpus, alignments and tokenized and segmented versions of the source translations 4 as Tyers et al. (2018) who release all of their data." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-70", "text": "5 In this way, the experimental pipeline is the same as theirs but we predict POS tags and dependency annotations using our own models." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-71", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-72", "text": "**TARGET LANGUAGE CORPUS**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-73", "text": "We use the target corpus built by Tyers et al. (2018) which comprises 28,862 sentences which were extracted from Faroese Wikipedia dumps 6 using the WikiExtractor script 7 and further pre-processed to remove any non-Faroese texts and other forms of unsuitable sentences." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-74", "text": "Machine Translation As noted by Tyers et al. (2018) , popular repositories for developing machine translation systems such as OPUS (Tiedemann, 2016) contain an inadequate amount of sentences to train a data-driven machine translation system for Faroese." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-75", "text": "For instance, there are fewer than 7,000 sentence pairs between Faroese and Danish, Faroese and English, Faroese and Norwegian and Faroese and Swedish." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-76", "text": "Consequently, to create parallel source sentences, Tyers et al. (2018) use a rule-based machine translation system available in Apertium 8 to translate from Faroese to Norwegian Bokm\u00e5l." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-77", "text": "There also exists translation systems from Norwegian Bokm\u00e5l to Norwegian Nynorsk, Swedish and Danish in Apertium." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-78", "text": "As a result, the authors use pivot translation from Norwegian Bokm\u00e5l into the other source languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-79", "text": "The process is illustrated in Fig. 1 ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-80", "text": "For a more thorough description of the machine translation process and for resource creation in general, see the work of Tyers et al. (2018) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-81", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-82", "text": "**WORD ALIGNMENTS**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-83", "text": "We use word alignments between the Faroese text and the source translations generated by Tyers et al. (2018) using fast align (Dyer et al., 2013) , a word alignment tool based on IBM Model 2." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-84", "text": "9" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-85", "text": "Source Treebanks We use the Universal Dependencies v2.2 treebanks to train our source parsing models." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-86", "text": "This is the version used for the 2018 CoNLL shared task on Parsing Universal Dependencies ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-87", "text": "Source Tagging and Parsing Models In order for our parsers to work well with predicted POS tags, we follow the same steps as used in the 2018 CoNLL shared task for creating training and development treebanks with automatically predicted POS tags (henceforth referred to as silver POS)." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-88", "text": "Since we are required to parse translated text which only has lexical features available, we 9 Note that previous related work report better results using IBM Model 1 with a more diverse language setup." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-89", "text": "They claim that IBM Model 2 introduces a bias towards more closely related languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-90", "text": "As we are working with related languages and translations and alignments are largely word-for-word, we expect that this will have less of an impact on our experiments although IBM Model 1 should also be tried in future work." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-91", "text": "disregard lemmas, language-specific POS (XPOS) and morphological features and only use the word form and universal POS (UPOS) tag as input features to our parsers." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-92", "text": "We develop our POS tagging and parsing models using the AllenNLP library ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-93", "text": "We use jackknife resampling to predict the UPOS tags for the training treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-94", "text": "We split the training treebank into ten parts, train models on nine parts and predict UPOS for the excluded part." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-95", "text": "The process is repeated until all ten parts are predicted and they are then combined to recreate the treebank with silver POS tags." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-96", "text": "Only token features are used to predict the UPOS tag." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-97", "text": "10 Finally, we train a model per source language on the full training data to check performance on the respective development set and to POS tag the source language translations before parsing." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-98", "text": "We train two variants of parsing models." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-99", "text": "The first is a monolingual biaffine dependency parser (Dozat and Manning, 2017) trained on the individual source treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-100", "text": "The second is a polyglot model trained on all source treebanks using the multilingual parser of Schuster et al. (2019) , which is the same graph-based biaffine dependency parser, extended to enable parsing with multiple treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-101", "text": "We additionally include a treebank embedding (Ammar et al., 2016; to the input of the polyglot parser to help the parser differentiate between the source languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-102", "text": "We optimize the model for average development set LAS across the included languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-103", "text": "The process is illustrated in Fig. 2 ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-104", "text": "To ensure that our parser is realistic, we add a pre-trained monolingual word embedding to each monolingual parser, giving a considerable improvement in accuracy on the development sets of the source languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-105", "text": "We use the precomputed Word2Vec embeddings 11 released as part of the 2017 CoNLL shared task on UD parsing (Zeman et al., 2017) which were trained on CommonCrawl and Wikipedia." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-106", "text": "In order to use pre-trained word embeddings for the polyglot setting, we need to consider that a polyglot model uses a shared vocabulary across all input languages." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-107", "text": "In our experiments, we simply 10 We observe slightly lower POS tagging scores on fully annotated test sets than UDPipe, which uses gold lemmas, XPOS and morphological features to predict the UPOS label and therefore cannot be applied to the translated text without also building predictors for these features." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-108", "text": "11 https://lindat.mff.cuni.cz/ repository/xmlui/handle/11234/1-1989 use the union of the word embeddings and average the word vector for words that occur in more than one language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-109", "text": "Future work should explore crosslingual word embeddings with limited amount of parallel data or use aligned contextual embeddings as in (Schuster et al., 2019) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-110", "text": "Synthetic Source Treebanks Source translations are tokenized with UDPipe (Straka and Strakov\u00e1, 2017) by Tyers et al. (2018) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-111", "text": "For each source language, the POS model trained on the full training data (see previous section) is used to tag the tokenized translations." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-112", "text": "Once the text is tagged, we predict dependency arcs and labels with the parsing models of the previous section, and use annotation projection (described below) to provide syntactic annotations for the target sentences." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-113", "text": "Annotation Projection Once the synthetic source treebanks are compiled, i. e. the translations are parsed, the annotations are then projected from the source translations to the target language using the word alignments and Tyers et al.'s projection tool, resulting in a Faroese treebank." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-114", "text": "In some cases, not all tokens are aligned and Tyers et al. (2018) work around this by falling back to a 1:1 mapping between the target index and the source index." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-115", "text": "There are also cases where there is a mismatch in length between the source and target sentences and some dependency structures cannot be projected to the target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-116", "text": "Tyers et al.'s projection setup removes unsuitable projected trees containing e. g. more than one root token, a token that is its own head or a token with a head outside the range of the sentence." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-117", "text": "Multi-source Projection For multi-source projection, the four source-language dependency trees for a Faroese sentence are projected into a single graph, scoring edges according to the number of trees that contain them (Sagae and Lavie, 2006; Nivre et al., 2007) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-118", "text": "The dependency structure is first built by voting over the directed edges." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-119", "text": "Afterward, dependency labels and POS tags are decided using the same voting procedure." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-120", "text": "The process is illustrated in Fig. 3 ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-121", "text": "Target Tagging and Parsing Models At this stage we have Faroese treebanks to train our POS tagging and parsing models." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-122", "text": "The Faroese treebanks come in two variants: the result of projection from source trees produced by either 1) a monolingual, or 2) the polyglot model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-123", "text": "For each case, we train our POS tagging and parsing models directly on these synthetic treebanks and do not make use of word embeddings as we do not have them for Faroese." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-124", "text": "Multi-treebank Target Parsing Since we have several synthetic Faroese treebanks, we have the option of training on a single treebank or using a multi-treebank approach where we train on all target treebanks in the same way as we did for inducing the polyglot source model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-125", "text": "The process is illustrated in Fig. 4 ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-126", "text": "When training a multi-treebank target model, for each target treebank, we add a treebank embedding denoting the source model used to project annotations to the target treebank." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-127", "text": "At predict time, we must include one of these treebank embeddings as input to the model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-128", "text": "As we do not have real Faroese data in our target training treebanks, we must choose the treebank embedding of one of the synthetic target treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-129", "text": "introduce the term \"proxy treebank\" to refer to cases where the test treebank is not in the training set and a treebank embedding from the training set must be used instead." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-130", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-131", "text": "**EXPERIMENTS**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-132", "text": "In this section, we describe our experiments, which include a replication of the main findings of Tyers et al. (2018) , using AllenNLP for POS tagging and parsing instead of UDPipe (Straka and Strakov\u00e1, 2017 Figure 3 : Multi-source projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-133", "text": "The source language is listed in brackets." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-134", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-135", "text": "**DETAILS**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-136", "text": "The hyper-parameters of our POS tagging and parsing models are given in Table 1 ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-137", "text": "For POS tagging, we adopt a standard architecture with a word and character-level Bi-LSTM Graves and Schmidhuber, 2005) to learn contextsensitive representations of our words." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-138", "text": "These representations are passed to a multilayer perceptron (MLP) classifier followed by a softmax function to choose a tag with the highest probability." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-139", "text": "For both the POS tagging and parsing models, we use a word embedding dimension of size 100 and a character embedding dimension of size 64." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-140", "text": "POS tag embeddings of dimension 50 are included in the parser." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-141", "text": "We train our Faroese models for fifty epochs." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-142", "text": "We do not split the synthetic Faroese treebanks into training/development portions though we suspect doing so will help the models to not overfit on the training data." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-143", "text": "For all experiments we report labelled attachment scores produced by the official CoNLL 2018 evaluation script." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-144", "text": "13" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-145", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-146", "text": "**RESULTS**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-147", "text": "The development results of our monolingual and polyglot models on the source language treebanks are shown in Table 2 ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-148", "text": "The results for the polyglot multilingual-parsing 13 https://github.com/ufal/conll2018/ blob/master/evaluation_script/conll18_ ud_eval.py model are better for three out of four source languages, whereas for no nynorsk, the monolingual model marginally outperforms the polyglot one." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-149", "text": "These results suggest that the polyglot model will contribute better syntactic annotations for Faroese treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-150", "text": "The statistics of the filtered Faroese treebanks obtained via projection with our source parsing models are given in Table 3 ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-151", "text": "The treebank sizes are fairly similar regardless of whether source annotations are provided by a monolingual or a polyglot model which is expected because the word alignments are the major factor in determining whether a projection is successful." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-152", "text": "There is a proportionally lower number of sentences for multi-source projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-153", "text": "This is because this method only uses the intersection of sentences which are present across all synthetic treebanks after filtering." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-154", "text": "The treebank originating from Norwegian Bokm\u00e5l has the highest number of valid sentences, suggesting that it could be a good candidate for projection to Faroese." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-155", "text": "It also has the highest source language parsing accuracy ( Table 2) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-156", "text": "The results of training on our various synthetic Faroese treebanks and predicting the Faroese test set are shown in the first result column of Table 4 (SINGLE)." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-157", "text": "In terms of monolingual vs. polyglot, we find that projecting from a polyglot model helps with four out of the five possible treebanks (with three of them being statistically formed by the monolingual model using Norwegian Nynorsk for projection though the difference is not statistically significant." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-158", "text": "On the source side, the monolingual Norwegian Nynorsk model also performed slightly better than the polyglot model ( guage with the highest LAS (Norwegian Bokm\u00e5l) is also the best choice for projection (in this single target model setting)." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-159", "text": "The multi-source approach was not that effective in our case and some individual better sources were able to surpass this combination approach." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-160", "text": "One could argue that this may be due to the lower amount of training data when using the multisource treebank." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-161", "text": "We test this hypothesis by only including those sentences which contributed to multi-source projection in the single-source synthetic treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-162", "text": "The results are given in Table 5." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-163", "text": "Comparing the results in Tables 4 and 5, we see that LAS scores tend to be slightly lower than on the version which included all target sen-WORK RESULT Rosa and Mare\u010dek (2018) 49.4 Tyers et al. (2018) 64.4 Our implementation 68.0 of Tyers et al. (2018) Our Best Model 71.5 tences, indicating that we did lose some information by filtering out a large number of sentences." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-164", "text": "However, Norwegian Nynorsk still outperforms the multi-source model for the monolingual setting and both Norwegian models perform better than the multi-source model in the polyglot setting, suggesting that size alone does not explain the under-performance of the multi-source model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-165", "text": "It is also worth noting that polyglot training is superior to all monolingual models which hints that for no nynorsk (the previously better performing model), the monolingual model was not able to achieve its full potential with the reduced data while the polyglot model was able to provide richer annotations." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-166", "text": "Another reason why the multi-source model does not work as well in our experiments as it does in those of Tyers et al. (2018) might be that we use pre-trained embeddings whereas Tyers et al. (2018) do not." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-167", "text": "In this way, our monolingual models are stronger and likely do not benefit as much from voting." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-168", "text": "The second result column (MULTI) of Table 4 shows the effect of training a multi-treebank POS tagger and parser on the Faroese treebanks created by each of the four source languages as well as the treebank which is produced by multi-source projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-169", "text": "This experiment is orthogonal to the experiment using a polyglot model on the source side and so we also test a combination of polyglot source side parsing and multi-treebank target side parsing." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-170", "text": "We see improvements over the single treebank setting for all cases." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-171", "text": "15 Table 6 places our systems in the context of previous results on the same Faroese test set." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-172", "text": "The highest scoring system in the 2018 CoNLL shared task was that of Rosa and Mare\u010dek (2018) who achieved a LAS score of 49.4 on the Faroese test set." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-173", "text": "Note that they use predicted tokenization and segmentation whereas our experiments and Tyers et al.'s use gold tokenization and segmentation, which provides a small artificial boost." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-174", "text": "Tyers et al. (2018) report an LAS of 64.43 with a monolingual multi-source approach." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-175", "text": "Our implementation which uses a different parser (AllenNLP versus UDPipe) and pre-trained word embeddings achieves an LAS of 68." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-176", "text": "Our highest score of 71.51 is achieved through the combination of projecting from strong monolingual source models and then training multi-treebank POS tagging and parsing models on the outputs." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-177", "text": "----------------------------------" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-178", "text": "**CONCLUSION**" }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-179", "text": "We have presented parsing results on Faroese, a low-resource language, using annotation projection from multiple monolingual sources versus a single polyglot model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-180", "text": "We also extended the idea of multi-treebank learning to the target treebanks." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-181", "text": "The results of our experiments show that the use of a polyglot source model helps in four out of five cases using single treebank target models." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-182", "text": "The two source languages that have lowest LAS when using monolingual parsers, namely Danish and Swedish, see significant improvements when switching to a polyglot model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-183", "text": "Our best performing single target model is trained on Faroese trees projected from Norwegian Bokm\u00e5l trees produced by a polyglot model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-184", "text": "However, the strongest language with monolingual modelling, Norwegian Nynorsk, does not benefit from switching to a polyglot model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-185", "text": "When we filtered the target treebank to the subset of sentences selected by multisource projection, the polyglot model is superior to all five monolingual models, even outperforming the Norwegian Nynorsk model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-186", "text": "One explanation of the improvements seen with polyglot modelling is that it introduces a new interaction point for cross-lingual features via the feature extractor of the polyglot parser." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-187", "text": "With monolingual source models, cross-lingual features only interact indirectly in the graph-decoding stage of multi-source projection." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-188", "text": "but found that it always helps to also perform multi-treebank training for the POS tagger." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-189", "text": "We also applied the multi-treebank approach to the target-side POS tagger and parser and see improvements for all settings." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-190", "text": "The overall best result is with the setting that uses monolingual source models to create the source trees that are projected to Faroese and combined in a multitreebank model." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-191", "text": "The proxy treebank for the multitreebank model is the treebank that also gave best results with single treebank target models, projected from Norwegian Nynorsk." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-192", "text": "We presented a simple solution to deal with using multiple pre-trained embeddings in a model with a shared vocabulary." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-193", "text": "It was a rather na\u00efve solution and we want to explore the use of available cross-lingual word embedding tools." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-194", "text": "Additionally, the use of contextual embeddings such as ELMo or multilingual BERT (Devlin et al., 2019) would likely provide better representations, with the effect of contributing better annotations for the target language." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-195", "text": "Indeed, recent work has already shown promising work in this area (Schuster et al., 2019; Kondratyuk, 2019) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-196", "text": "In the multi-source projection experiments, our criteria for filtering is based on whether the sentence was present across all target treebanks and more sophisticated approaches could be used to select better training instances as in Plank and Agi\u0107 (2018) ." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-197", "text": "More generally, we would like to investigate how our findings might change when the number of source languages or treebanks is changed and how the observations carry over to other languages than Faroese." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-198", "text": "It would also be interesting to use multiple sources of arc weights in a dense graph as in but with models induced from training on multiple source languages together." }, { "sent_id": "c684a2be8ca8ed8db25be6e080f921-C001-199", "text": "To work with language pairs with more deviating word orders and/or translations that are not word-for-word, the choice of word alignment algorithm and the projection algorithm may have to be revised." } ], "y": { "@USE@": { "gold_contexts": [ [ "c684a2be8ca8ed8db25be6e080f921-C001-5" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-37" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-69" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-73" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-83" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-132" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-163" ] ], "cite_sentences": [ "c684a2be8ca8ed8db25be6e080f921-C001-5", "c684a2be8ca8ed8db25be6e080f921-C001-37", "c684a2be8ca8ed8db25be6e080f921-C001-69", "c684a2be8ca8ed8db25be6e080f921-C001-73", "c684a2be8ca8ed8db25be6e080f921-C001-83", "c684a2be8ca8ed8db25be6e080f921-C001-132", "c684a2be8ca8ed8db25be6e080f921-C001-163" ] }, "@EXT@": { "gold_contexts": [ [ "c684a2be8ca8ed8db25be6e080f921-C001-14" ] ], "cite_sentences": [ "c684a2be8ca8ed8db25be6e080f921-C001-14" ] }, "@SIM@": { "gold_contexts": [ [ "c684a2be8ca8ed8db25be6e080f921-C001-15" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-69" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-173" ] ], "cite_sentences": [ "c684a2be8ca8ed8db25be6e080f921-C001-15", "c684a2be8ca8ed8db25be6e080f921-C001-69", "c684a2be8ca8ed8db25be6e080f921-C001-173" ] }, "@DIF@": { "gold_contexts": [ [ "c684a2be8ca8ed8db25be6e080f921-C001-22" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-30" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-51" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-69", "c684a2be8ca8ed8db25be6e080f921-C001-70" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-163" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-166" ] ], "cite_sentences": [ "c684a2be8ca8ed8db25be6e080f921-C001-22", "c684a2be8ca8ed8db25be6e080f921-C001-30", "c684a2be8ca8ed8db25be6e080f921-C001-51", "c684a2be8ca8ed8db25be6e080f921-C001-69", "c684a2be8ca8ed8db25be6e080f921-C001-70", "c684a2be8ca8ed8db25be6e080f921-C001-163", "c684a2be8ca8ed8db25be6e080f921-C001-166" ] }, "@BACK@": { "gold_contexts": [ [ "c684a2be8ca8ed8db25be6e080f921-C001-31" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-74" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-76" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-80" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-110" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-113" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-114" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-116" ], [ "c684a2be8ca8ed8db25be6e080f921-C001-174" ] ], "cite_sentences": [ "c684a2be8ca8ed8db25be6e080f921-C001-31", "c684a2be8ca8ed8db25be6e080f921-C001-74", "c684a2be8ca8ed8db25be6e080f921-C001-76", "c684a2be8ca8ed8db25be6e080f921-C001-80", "c684a2be8ca8ed8db25be6e080f921-C001-110", "c684a2be8ca8ed8db25be6e080f921-C001-113", "c684a2be8ca8ed8db25be6e080f921-C001-114", "c684a2be8ca8ed8db25be6e080f921-C001-116", "c684a2be8ca8ed8db25be6e080f921-C001-174" ] } } }, "ABC_05fe3e9c1598f5b36b6efa79216309_7": { "x": [ { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-35", "text": "**INTERPRETATION MODEL**" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-2", "text": "An interesting challenge for explainable recommender systems is to provide successful interpretation of recommendations using structured sentences." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-3", "text": "It is well known that user-generated reviews, have strong influence on the users' decision." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-4", "text": "Recent techniques exploit user reviews to generate natural language explanations." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-5", "text": "In this paper, we propose a character-level attention-enhanced long short-term memory model to generate natural language explanations." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-6", "text": "We empirically evaluated this network using two real-world review datasets." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-7", "text": "The generated text present readable and similar to a real user's writing, due to the ability of reproducing negation, misspellings, and domain-specific vocabulary." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-8", "text": "----------------------------------" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-10", "text": "A recommender system should provide accurate and relevant recommendations, but a good recommendation must be supported by interpretation." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-11", "text": "The explanation is the key factor to gain the trust of the user." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-12", "text": "An interpretable system has significant influence on a user's decision [6] , and users tend to trust the opinion of others, especially when they describe personal experiences [3] ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-13", "text": "Current explainable recommendations propose to mine user's reviews to generate explanations." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-36", "text": "The character-level explanation model presents (1) three modules: LSTM network, attention layer, and generator module; and (2) two input sources: review text and concatenated word embeddings of user, item, and rating, as presented in Fig. 1 ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-14", "text": "Nonetheless, they lack generating natural language expressions, hence the sentences are Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-15", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-16", "text": "Abstracting with credit is permitted." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-17", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-18", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-19", "text": "IUI'18, Mar 07-11, 2018, Tokyo, Japan ACM 978-1-4503-5571-1/18/03. . . $15.00 DOI: https://doi.org/10.1145/3180308.3180366 produced in a modular way." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-20", "text": "We aim to generate natural language explanations from reviews, aligning explanations and textual features such as aspects and sentiments, which influence the recommendation of different items." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-21", "text": "We exploit deep neural networks at character-level to generate explanations." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-22", "text": "These networks have recently shown good performance to generate sentences as presented by Karpathy et." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-23", "text": "al., who use a variant of LSTM cells to generate text [2] ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-24", "text": "Karpathy et." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-25", "text": "al." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-26", "text": "presented encoding rating vectors of reviews in the training phase, allowing the system to calculate the probability of the next character based on the given rating." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-27", "text": "Later, Dong et." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-28", "text": "al." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-29", "text": "presented an efficient method to generate the next word in a sequence when it is added an attention mechanism, improving the performance for long textual sequences [1] ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-30", "text": "In this paper, we propose a character-level attention-enhanced long short-term memory (LSTM) model to generate personalized natural language explanations based on user-generated reviews." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-31", "text": "The model is trained using two real-world datasets: BeerAdvocate [5] and Amazon book reviews [1] ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-32", "text": "The datasets present user reviews describing their opinion about items in natural language." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-33", "text": "The explanations are adaptively composed by an encoder-side context vector, because our model learns soft alignments between generated characters and user-item relations, for example, ratings from a user to an item." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-34", "text": "----------------------------------" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-37", "text": "First, users and items embeddings are learned from doc2vec model, where characters of reviews are encoded as one-hot vectors, corresponding to the input time-steps of LSTM network." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-38", "text": "Second, the embeddings are concatenated with the outputs of LSTM, which are inputs for the following attention layer." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-39", "text": "Finally, the generator module produce sentences as explanations using outputs from the attention layer." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-40", "text": "\u2022 LSTM network LSTM is an enhanced recurrent neural network (RNN) where information is transmitted from a neuron to the next neuron, and the corresponding neuron in the next layer simultaneously, as presented in Fig. 1 ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-41", "text": "LSTM Figure 2 ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-42", "text": "Rating Text Samples, from poorly rated (1) to highly rated (5) ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-43", "text": "was introduced to solve the long-term dependency problem, which causes vanishing gradient in conventional RNN [2] ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-44", "text": "\u2022 Attention mechanism The attention mechanism, adaptively learns soft alignments c t between character dependencies H t and attention inputs a. Eq. 1 formally defines the new character dependencies using attention layer H attention t [1] ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-45", "text": "\u2022 Generating Text The explanation is generated character by character." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-46", "text": "The characters are given by maximizing the softmax conditional probability p, based on the new character dependencies H attention t [1] , as presented in Eq. 2 p = softmax(H attention t W + b), char = arg max p (2)" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-47", "text": "----------------------------------" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-48", "text": "**RESULTS**" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-49", "text": "The model was evaluated using two real-world datasets: Beer-Advocate and Amazon book reviews." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-50", "text": "The first experiment presents generated explanations given by the rating as attention mechanism to generate explanations with different sentiments, as presented in Fig. 2 ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-51", "text": "The second experiment generates explanations for particular user-item pairs presented in Fig. 3 , where the user opinion about an item is generated in natural language." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-52", "text": "Finally, evaluating the generated explanations based on readability metrics in Fig. 4 ." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-53", "text": "The readability metrics [4] measure how understandable the generated text is, where lower values correspond to an easy and understandable text." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-54", "text": "----------------------------------" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-55", "text": "**CONCLUSION**" }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-56", "text": "The work provides preliminary results in automatically generating natural language explanations." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-57", "text": "The model differs from recent works [1, 6] , due to the use of attention layer combined with character-level LSTM." }, { "sent_id": "05fe3e9c1598f5b36b6efa79216309-C001-58", "text": "The proposed model improves We would like to improve the model considering: (1) personalizing explanations to benefit the users' preferences based on their expressed sentiments; and (2) testing the model in larger and more varied review domains such as hotels and restaurants." } ], "y": { "@BACK@": { "gold_contexts": [ [ "05fe3e9c1598f5b36b6efa79216309-C001-29" ] ], "cite_sentences": [ "05fe3e9c1598f5b36b6efa79216309-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "05fe3e9c1598f5b36b6efa79216309-C001-31" ], [ "05fe3e9c1598f5b36b6efa79216309-C001-44" ], [ "05fe3e9c1598f5b36b6efa79216309-C001-46" ] ], "cite_sentences": [ "05fe3e9c1598f5b36b6efa79216309-C001-31", "05fe3e9c1598f5b36b6efa79216309-C001-44", "05fe3e9c1598f5b36b6efa79216309-C001-46" ] }, "@DIF@": { "gold_contexts": [ [ "05fe3e9c1598f5b36b6efa79216309-C001-57" ] ], "cite_sentences": [ "05fe3e9c1598f5b36b6efa79216309-C001-57" ] } } }, "ABC_008d5261ee7385a2b7e39772938f51_7": { "x": [ { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-2", "text": "Sentiment analysis of citations in scientific papers and articles is a new and interesting problem which can open up many exciting new applications in bibliographic search and bibliometrics." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-3", "text": "Current work on citation sentiment detection focuses on only the citation sentence." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-4", "text": "In this paper, we address the problem of context-enhanced citation sentiment detection." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-5", "text": "We present a new citation sentiment corpus which has been annotated to take the dominant sentiment in the entire citation context into account." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-6", "text": "We believe that this gold standard is closer to the truth than annotation that looks only at the citation sentence itself." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-7", "text": "We then explore the effect of context windows of different lengths on the performance of a stateof-the-art citation sentiment detection system when using this context-enhanced gold standard definition." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-8", "text": "----------------------------------" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-10", "text": "Sentiment analysis of citations in scientific papers and articles is a new and interesting problem." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-11", "text": "It can open up many exciting new applications in bibliographic search and in bibliometrics, i.e., the automatic evaluation of the influence and impact of individuals and journals via citations." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-12", "text": "Automatic detection of citation sentiment can also be used as a first step to scientific summarisation (Abu-Jbara and Radev, 2011) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-13", "text": "Alternatively, it can help researchers during search, e.g., by identifying problems with a particular approach, or by helping to recognise unaddressed issues and possible gaps in the current research." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-14", "text": "However, there is a problem with the expression of sentiment in scientific text." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-15", "text": "Conventionally, the writing style in scientific writing is meant to be objective." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-16", "text": "Any personal bias by authors has to be hedged (Hyland, 1995) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-17", "text": "Negative sentiment is politically particularly dangerous (Ziman, 1968) , and some authors have documented the strategy of prefacing the intended criticism by slightly disingenuous praise (MacRoberts and MacRoberts, 1984) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-18", "text": "This makes the problem of identifying such opinions particularly challenging." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-19", "text": "This non-local expression of sentiment has been observed in other genres as well (Wilson et al., 2009; Polanyi and Zaenen, 2006) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-20", "text": "Figure 1 ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-21", "text": "While the first sentence praises some aspects of the cited paper, the remaining sentences list its shortcomings." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-22", "text": "It is clear that criticism is the intended sentiment, but if we define our gold standard only by looking at the citation sentence, we lose a significant amount of sentiment hidden in the text." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-23", "text": "Given that most citations are neutral (Spiegel-Rosing, 1977; Teufel et al., 2006) , this makes it ever more important to recover what explicit sentiment there is from the context of the citation." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-24", "text": "However, the dominant assumption in current citation identification methods (Ritchie et al., 2008; Radev et al., 2009 ) is that the sentiment present in the citation sentence represents the true sentiment of the author towards the cited paper." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-25", "text": "This is due to the difficulty of determining the relevant context, whereas it is substantially easier to identify the citation sentence." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-26", "text": "In our example above, however, such an approach would lead to the wrong prediction of praise or neutral sentiment." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-27", "text": "In this paper, we address the problem of contextenhanced citation sentiment detection." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-28", "text": "We present a new citation sentiment corpus where each citation has been annotated according to the dominant sentiment in the corresponding citation context." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-29", "text": "We claim that this corpus is closer to the truth than annotation that considers only the citation sentence itself." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-30", "text": "We show that it increases citation sentiment coverage, particularly for negative sentiment." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-31", "text": "Using this gold standard, we explore the effect of assuming context windows of different but fixed lengths on the performance of a state-of-the-art citation sentiment detection system where the sentiment of citation is considered in the entire context of the citation and more than one single sentiment can be assigned." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-32", "text": "Previous approaches neither detect citation sentiment and context simultaneously nor use as large a corpus as we do." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-33", "text": "We selected a four-class scheme for annotation." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-34", "text": "Every sentence that is in a window of 4 sentences of the citation and does not contain any direct or indirect mention of the citation was labelled as being excluded (x)." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-35", "text": "The window length was motivated by recent research (Qazvinian and Radev, 2010) which shows the best score for a four-sentence boundary when detecting non-explicit citation." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-36", "text": "The rest of the sentences were marked either positive (p), negative (n) or objective/neutral (o)." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-37", "text": "A total of 1,741 citations were annotated." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-38", "text": "Although this annotation was performed by the first author only, we know from previous work that similar styles of annotation can achieve acceptable interannotator agreement (Teufel et al., 2006 )." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-39", "text": "An example annotation for Smadja (1993) is given in Figure 2 , where the first column shows the line number and the second one shows the class label." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-40", "text": "To compare our work with Athar (2011) , we also applied a three-class annotation scheme." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-41", "text": "In this method of annotation, we merge the citation context into a single sentence." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-42", "text": "Since the context introduces more than one sentiment per citation, we marked the citation sentiment with the last sentiment mentioned in the context window as this is pragmatically most likely to be the real intention (MacRoberts and MacRoberts, 1984) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-43", "text": "As is evident from Table 1 , including the 4 sentence window around the citation more than doubles the instances of subjective sentiment, and in the case of negative sentiment, this proportion rises to 3." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-44", "text": "In light of the overall sparsity of detectable citation sentiment in a paper, and of the envisaged applica-tions, this is a very positive result." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-45", "text": "The reason for this effect is most likely \"sweetened criticism\" -authors' strategic behaviour of softening the effect of criticism among their peers (Hornsey et al., 2008) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-46", "text": "----------------------------------" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-47", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-48", "text": "We represent each citation as a feature set in a Support Vector Machine (SVM) (Cortes and Vapnik, 1995) framework and use n-grams of length 1 to 3 as well as dependency triplets as features." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-49", "text": "The dependency triplets are constructed by merging the relation, governor and dependent in a single string, for instance, the relation nsubj(failed, method) is represented as nsubj failed method ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-50", "text": "This setup has been shown to produce good results earlier as well (Pang et al., 2002; Athar, 2011) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-51", "text": "The first set of experiments focuses on simultaneous detection of sentiment and context sentences." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-52", "text": "For this purpose, we use the four-class annotated corpus described earlier." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-53", "text": "While the original annotations were performed for a window of length 4, we also experiment with asymmetrical windows of l sentences preceding the citation and r sentences succeeding it." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-54", "text": "The detailed results are given in Table 2 ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-55", "text": "Table 2 : Results for joint context and sentiment detection." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-56", "text": "Because of the skewed class distribution, we use both the F macro and F micro scores with 10-fold cross-validation." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-57", "text": "The baseline score, shown in bold, is obtained with no context window and is comparable to the results reported by Athar (2011) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-58", "text": "However, we can observe that the F scores decrease as more context is introduced." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-59", "text": "This may be attributed to the increase in the vocabulary size of the n-grams and a consequent reduction in the discriminating power of the decision boundaries." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-60", "text": "These results show that the task of jointly detecting sentiment and context is a hard problem." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-61", "text": "For our second set of experiments, we use the three-class annotation scheme." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-62", "text": "We merge the text of the sentences in the context windows as well as their dependency triplets to obtain the features." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-63", "text": "The results are reported in Table 3 with best results in bold." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-64", "text": "Although these results are not better than the context-less baseline, the reason might be data sparsity since existing work on citation sentiment analysis uses more data (Athar, 2011) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-65", "text": "----------------------------------" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-66", "text": "**RELATED WORK**" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-67", "text": "While different schemes have been proposed for annotating citations according to their function (Spiegel-Rosing, 1977; Nanba and Okumura, 1999; Garzone and Mercer, 2000) , the only recent work on citation sentiment detection using a relatively large corpus is by Athar (2011) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-68", "text": "However, this work does not handle citation context." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-69", "text": "Piao et al. (2007) proposed a system to attach sentiment information to the citation links between biomedical papers by using existing semantic lexical resources." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-70", "text": "A common approach for sentiment detection is to use a labelled lexicon to score sentences (Hatzivassiloglou and McKeown, 1997; Turney, 2002; Yu and Hatzivassiloglou, 2003) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-71", "text": "However, such approaches have been found to be highly topic dependent (Engstr\u00f6m, 2004; Gamon and Aue, 2005; Blitzer et al., 2007) ." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-72", "text": "Teufel et al. (2006) worked on a 2,829 sentence citation corpus using a 12-class classification scheme." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-73", "text": "Although they used context in their annotation, their focus was on determining the author's reason for citing a given paper." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-74", "text": "This task differs from citation sentiment, which is in a sense a \"lower level\" of analysis." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-75", "text": "For implicit citation extraction, Kaplan et al. (2009) explore co-reference chains for citation extraction using a combination of co-reference resolution techniques." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-76", "text": "However, their corpus consists of only 94 sentences of citations to 4 papers which is likely to be too small to be representative." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-77", "text": "The most relevant work is by Qazvinian and Radev (2010) who extract only the non-explicit citations for a given paper." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-78", "text": "They model each sentence as a node in a graph and experiment with various window boundaries to create edges between neighbouring nodes." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-79", "text": "However, their dataset consists of only 10 papers and their annotation scheme differs from our four-class annotation as they do not deal with any sentiment." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-80", "text": "----------------------------------" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-81", "text": "**CONCLUSION**" }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-82", "text": "In this paper, we focus on automatic detection of citation sentiment using the citation context." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-83", "text": "We present a new corpus and show that ignoring the citation context would result in loss of a lot of sentiment, specially criticism towards the cited paper." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-84", "text": "We also report the results of the state-of-the-art citation sentiment detection systems on this corpus when using this context-enhanced gold standard definition." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-85", "text": "Future work directions may include improving the detection algorithms by filtering the context sentences more intelligently." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-86", "text": "For this purpose, existing work on coreference resolution (Lee et al., 2011) may prove to be useful." }, { "sent_id": "008d5261ee7385a2b7e39772938f51-C001-87", "text": "Context features may also be used for first filtering citations which have been mentioned only in passing, and then applying context based sentiment classification to the remaining significant citations." } ], "y": { "@USE@": { "gold_contexts": [ [ "008d5261ee7385a2b7e39772938f51-C001-40" ] ], "cite_sentences": [ "008d5261ee7385a2b7e39772938f51-C001-40" ] }, "@BACK@": { "gold_contexts": [ [ "008d5261ee7385a2b7e39772938f51-C001-50" ], [ "008d5261ee7385a2b7e39772938f51-C001-67" ] ], "cite_sentences": [ "008d5261ee7385a2b7e39772938f51-C001-50", "008d5261ee7385a2b7e39772938f51-C001-67" ] }, "@SIM@": { "gold_contexts": [ [ "008d5261ee7385a2b7e39772938f51-C001-57" ] ], "cite_sentences": [ "008d5261ee7385a2b7e39772938f51-C001-57" ] }, "@DIF@": { "gold_contexts": [ [ "008d5261ee7385a2b7e39772938f51-C001-64" ] ], "cite_sentences": [ "008d5261ee7385a2b7e39772938f51-C001-64" ] } } }, "ABC_d06a49ad232f73328874282d91cde0_7": { "x": [ { "sent_id": "d06a49ad232f73328874282d91cde0-C001-42", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-43", "text": "**BACKGROUND**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-2", "text": "In this paper we propose a partial parsing model which achieves robust parsing with a large HPSG grammar." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-3", "text": "Constraint-based precision grammars, like the HPSG grammar we are using for the experiments reported in this paper, typically lack robustness, especially when applied to real world texts." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-4", "text": "To maximally recover the linguistic knowledge from an unsuccessful parse, a proper selection model must be used." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-5", "text": "Also, the efficiency challenges usually presented by the selection model must be answered." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-6", "text": "Building on the work reported in Zhang et al. (2007a) , we further propose a new partial parsing model that splits the parsing process into two stages, both of which use the bottom-up chart-based parsing algorithm." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-7", "text": "The algorithm is implemented and a preliminary experiment shows promising results." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-8", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-10", "text": "Linguistically motivated precision grammars are highly valuable language resources which provide in-depth modeling of complex language phenomena." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-11", "text": "Based on sound linguistic theoretical backgrounds and rigid mathematical formalisations, such approaches to natural language processing are capable of delivering highly accurate analyses when compared to shallower NLP systems." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-12", "text": "As pivotal central parts of continuous efforts on grammar engineering over the last decade, several of such grammars have achieved broad coverage on various linguistic phenomena in recent years, and have been successfully integrated in several NLP applications including information extraction, question answering, grammar checking, machine translation, and intelligent information retrieval, among others." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-13", "text": "However, being highly restricted rule systems, these grammars are typically vulnerable to noisy inputs, and perform badly in terms of robustness." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-14", "text": "This is one of the major reasons why, despite being highly valuable language resources, precision grammars have been very much underused in real world applications in the past decades." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-15", "text": "Baldwin et al. (2004) reported that the jun-04 version of the English Resource Grammar (ERG; Flickinger (2002) ) achieves full lexical span 1 over a mere 32% of a random sample of 20K BNC strings." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-16", "text": "Among these inputs, 57% receive at least one analysis." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-17", "text": "Through a series of parsing coverage tests, Zhang and Kordoni (2006) also showed that, at least for grammars similar to the ERG, incomplete lexicon is one of the major sources of parsing failures, with the other major source being missing grammar constructions." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-18", "text": "Targeting the missing lexical coverage in hand-crafted lexica of manually developed linguistically motivated precision grammars, like the ones mentioned above, several deep lexical acquisition approaches have been proposed (cf., Baldwin (2005) , Zhang and Kordoni (2006) )." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-41", "text": "Section 7. concludes the paper." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-19", "text": "The general idea shared among such approaches is to use available language resources (either derived from the grammar out-puts themselvers -the so called in vivo deep lexical acquisition approaches -, or from external language resourcesthe so called in vitro lexical acquisition approaches) in order to automatically acquire the required linguistic knowledge and extend the lexicon." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-20", "text": "While the lexical coverage has been proven to largely improve with statistical lexical type prediction models like the one proposed in Zhang and Kordoni (2006) , for instance, the missing constructions present a more serious coverage gap, as also briefly mentioned above." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-21", "text": "More specifically, in (Zhang, 2007) , a coverage test run with chronologically different versions of the ERG has shown that, with the increased efforts invested into grammar engineering, the coverage of the specific grammar has shown a very promising improvement over the years." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-22", "text": "However, it is still unlikely for the specific precision large-scale grammar to achieve full coverage on unseen data without extra robust processing techniques." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-23", "text": "Also, the cost of manually extending the grammar would be too high to be easily acceptable for other precision grammar-based parsing systems." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-24", "text": "In (Zhang et al., 2007a) , we have pointed out that most applications are only interested in certain aspects of parsing results." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-25", "text": "Full analyses are preferable, but not always necessary." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-26", "text": "In fact, most of the contemporary deep parsing systems provide as outputs either semantic representations that reflect the \"meaning\" of the input, or rather abstract syntactic structures." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-27", "text": "Full representations with all detailed linguistic features (e.g., typed feature structures in HPSG) are almost never used either as output format or in real applications." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-28", "text": "Take the DELPH-IN HPSG grammars, for instance: Minimal Recursion Semantics (MRS, Copestake et al. (2005) ) is used as the semantic representation in these grammars." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-29", "text": "For recording syntactic structures, derivation trees are usually used." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-30", "text": "Based on this fact, (Zhang et al., 2007a) have proposed to use partial parsing models to recover the most useful fragment analyses from the intermediate parsing results in cases of unsuccessful parses." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-31", "text": "To this effect, two statistical partial parse selection models are formulated, implemented, and evaluated." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-32", "text": "Along the lines of the analysis presented in (Zhang et al., 2007a) , in this paper we propose a more elaborated par-tial parsing model, in order to further simplify the training procedure, so that full parse disambiguation models can be reused in partial parsing." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-33", "text": "Moreover, this new model enables us to obtain complete derivation trees, instead of a set of subtrees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-34", "text": "Furthermore, with robust semantic composition rules, the fragment semantic representations can be put together in a robust, yet informative way." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-35", "text": "The rest of the paper is structured as follows." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-36", "text": "Section 2. provides background knowledge about the DELPH-IN HPSG grammars, the semantic and syntactic representations, and the partial parsing model presented in Kasper et al. (1999) and Zhang et al. (2007a) ." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-37", "text": "Section 3. presents the new proposed two-stage robust parsing model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-38", "text": "Section 4. further elaborates on the implementation details of the twostage parsing model, including a detailed presentation of the efficient processing techniques." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-39", "text": "Section 5. presents a preliminary evaluation with the ERG using the PARC 700 Dependency Treebank (King et al. (2003) ) sentences." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-40", "text": "In Section 6. we discuss the advantages of our model, as well as the remaining questions for future work." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-44", "text": "Head-driven Phrase Structure Grammar (HPSG, Pollard and Sag (1994) ) is a well-known constraint-based grammar formalism." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-45", "text": "Being a highly consistent grammar framework, HPSG is a linguistic theory formulated purely with Typed Feature Structures (TFSes, cf., Carpenter (1992) )." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-46", "text": "Due to its rigid mathematical foundation, HPSG has been widely adopted in the development of linguistically motivated large-scale precision grammars for different languages." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-47", "text": "Head-driven Phrase Structure Grammar is also at the heart of DELPH-IN, a community effort on deep linguistic processing with HPSG, which has delivered the most promising multilingual parallel grammar development with HPSG to date." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-48", "text": "With a complete software tool-chain, ranging from a grammar engineering platform, the LKB system (cf., Copestake (2002) ), to performance profiling and teebanking systems, the [incr tsdb()] platform (cf., Oepen (2001) ), an efficient parser, PET (Callmeier, 2001) , and a hybrid processing middle-ware, the HoG architecture (Callmeier et al., 2004) , linguists and computer scientists are able to work together to develop language resources and applications with profound linguistic knowledge." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-49", "text": "One of the most well-developed grammars in DELPH-IN (also among hand-crafted grammars in any other framework) is the English Resource Grammar (ERG, Flickinger (2002) )." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-50", "text": "The grammar achieves broad coverage on various linguistic phenomena, but still remains rather restricted." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-51", "text": "Therefore, it is not only used for parsing, but also for text generation tasks." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-52", "text": "On the semantic level, the English Resource Grammar outputs representations in the form of Minimal Recursion Semantics (MRS, Copestake et al. (2005) ), a representation framework for computational semantics." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-53", "text": "The main assumption behind MRS is that the interesting linguistic units for computational semantics are the elementary predications (EPs), which are single relations with associated arguments." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-54", "text": "The flat (non-recursive) structure of MRS is especially suitable for situations where semantic composition is desired." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-55", "text": "Moreover, it can be easily integrated with the HPSG grammar by embedding the MRS structure into the typed feature structures." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-56", "text": "On the syntactic level, on the other hand, a complete typed feature structure should be used, in principle." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-57", "text": "However, this is not necessary, for most of the features in the TFS are considered internal to the grammar, and not suitable as output format." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-58", "text": "2 In practice, the derivation trees are used." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-59", "text": "For the DELPH-IN grammars, a derivation tree is composed of leaf notes, each of which corresponds to a lexical entry, and intermediate nodes, each of which corresponds to a grammar rule." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-60", "text": "Given an input and a grammar, a derivation tree records how an analysis is derived." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-61", "text": "By applying grammar rules on the lexical entries in the way indicated by a derivation tree, one can easily recreate the whole typed feature structure." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-62", "text": "For this reason, the DELPH-IN treebanks Bond et al., 2004) only record derivation trees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-63", "text": "Theoretically, the computational complexity in unificationbased parsing is exponential to the length of the input." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-64", "text": "Given large-scale grammars like the ERG, it is crucial to have an efficient parser that can discover analyses licensed by the grammar." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-65", "text": "With continuous development in recent years, the PET (Callmeier, 2001 ) parser has grown to be one of the central components in the DELPH-IN software tool-chain." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-66", "text": "PET is based on a bottom-up chart-based algorithm, equipped with various efficient processing techniques, including quick-check, ambiguity packing and selective unpacking, among others." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-67", "text": "The robust parsing model proposed in this paper has been implemented as an extension to the PET parser." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-68", "text": "We should point out that this is not the first work to propose a partial parsing model in order to improve the robustness of a hand-crafted grammar." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-69", "text": "Although the idea is usually to construct meaningful output structures from intermediate unsuccessful parsing results, the definition of a partial parse is not consentaneous." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-70", "text": "It is largely dependent on the paradigm of the parsing model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-71", "text": "For instance, with bottomup chart-based parsing, Kasper et al. (1999) proposed to define a partial parse as a set of consecutive non-overlapping passive parsing edges that together cover the entire input." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-72", "text": "In cases where a multiple partial parse exists, a selection criterion is required to decide which one is more preferable." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-73", "text": "In other words, a partial parse selection model is required." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-74", "text": "One of the simplest and most commonly used criterion is to prefer the partial parses which contain an edge that covers the largest fragment of the input." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-75", "text": "However, there is no strong motivation that makes this a good selection model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-76", "text": "An alternative selection model proposed by Kasper et al. (1999) is to consider the parsing chart as a directed graph, with vertex being all the positions between input tokens, and arcs being passive parsing edges on the chart." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-77", "text": "Then a best partial parse (as a set of arcs in the graph) connects the shortest path from the beginning to the end of the input." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-78", "text": "Kasper et al. (1999) pointed out that the weights of the arcs can be assigned by an estimation function in order to indicate the preference over different fragment analyses." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-79", "text": "The discovery of such a path can be done in linear time (O(|V |+|E|)) with the DAG-shortest-path algorithm (Cormen et al., 1990) ." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-80", "text": "However, it is not clear (apart from some simple heuristics) how the estimation function can be acquired." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-81", "text": "Moreover, by its additive nature, the shortest-path, such a model makes an implicit independence assumption of the estimation function in different edge contexts." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-82", "text": "Based on a similar definition of partial parse, Zhang et al. (2007a) formulated the following statistical model:" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-83", "text": "The above model contains two probabilistic components: i) P (\u2126|w) is the conditional probability of a segmentation \u2126 given the input sequence w; and ii) P (t i |w i ) is the conditional probability of an analysis t i for a given subsequence w i in the segmentation." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-84", "text": "The empirical results have shown that this selection model significantly outperforms the shortest-path based baseline selection model proposed by Kasper et al. (1999) ." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-85", "text": "The evaluation was done using multiple metrics." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-86", "text": "While there is no gold-standard corpus for the purpose of partial parse evaluation, Zhang et al. (2007a) manually compared the parser's partial derivation trees with the Penn Treebank annotation for syntactic similarity." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-87", "text": "Furthermore, Zhang et al. (2007a) evaluated the fragment semantic outputs based on a practical estimation of RMRS similarities described by Dridan and Bond (2006) ." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-88", "text": "The semantic outputs of different partial parse selection models were compared to the RMRS outputs from the RASP system (Briscoe et al., 2006) ." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-89", "text": "If taken comparatively, all the results suggested that the model in (2.) performed much better than the baseline. But they failed to tell a clear story about the quality of the partial parse selection model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-90", "text": "Unfortunately, the model is approximate because of the independence assumption between the two components (for simplification)." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-91", "text": "Also, due to the lack of training data, the parameters of the two components were estimated over different data sets in the experiment, which has added further doubt on the consistency of the resulting model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-92", "text": "Moreover, it is generally not desirable to have different statistical models for full and partial parse selection." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-93", "text": "Ideally, a uniform disambiguation model should be used in both cases." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-94", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-95", "text": "**A TWO-STAGE ROBUST PARSING MODEL**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-96", "text": "One common shortcoming of the partial parsing models proposed in both (Kasper et al., 1999) and (Zhang et al., 2007a) is that the results of partial parsing are sets of disjoint sub-analyses, either in the form of derivation subtrees, or in the form of MRS fragments." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-97", "text": "It is not informative enough to show the interconnection across the fragment boundaries." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-98", "text": "It is not enough, either, to tell why a full analysis is not derived for the given input." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-99", "text": "Ideally, the partial parsing model should not only tell us which are good sub-analyses, but also predict what the missing parts from a full analysis are, should the input be licensed by the grammar." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-100", "text": "In a bottom-up chart-based parser, when a full analysis is not derived, the parser stops at a stage where no more grammar rule can be applied to either combine or create new edges on the chart." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-101", "text": "At this stage, all the passive edges on the parsing chart represent a licensed local analysis for the tokens within its span." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-102", "text": "Typically, for a broad coverage precision grammar with a well-formed input, certain rules fail to apply because some constraints are too strict." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-103", "text": "By relaxing the constraints in grammar rules, more robustness can be achieved." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-104", "text": "The basic idea of the robust parsing model we propose in this paper is to use a set of less restrictive grammar rules to continue parsing with the passive parsing edges created with HPSG rules and lexical entries during the unsuccessful parse." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-105", "text": "To differentiate these less restrictive grammar rules from the original HPSG rules, we call them robust rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-106", "text": "Several different ways of acquiring robust rules exist." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-107", "text": "In this paper, we use a context-free backbone grammar to simulate the behaviour of original HPSG rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-108", "text": "By choosing the CFG backbone, we will ignore the constraints encoded as typed feature structures." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-109", "text": "This allow us to generalise the approach beyond the specific grammar." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-110", "text": "Also, the robust parsing model we are concerned with in this paper focuses on improving constructional coverage." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-111", "text": "Therefore, only syntactic phrase structure rules are extracted." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-112", "text": "The missing lexical entries, together with the lexical rules should be captured through the lexical acquisition process." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-113", "text": "Figure 1 gives an example HPSG derivation tree and the corresponding CFG backbone." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-114", "text": "Using these rules, together with the passive parsing edges create with HPSG rules in the first parsing stage, we are likely to be able to build larger analysis trees during the second parsing stage when the TFS unification-based parsing is substituted by CFG parsing." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-115", "text": "All the TFSes created are ignored (but still kept along with the passive edges created during the fist stage)." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-116", "text": "Only the rules symbols are used as the category of the edge." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-117", "text": "Since the CFG backbone grammar uses the HPSG grammar rules names for its non-terminal nodes, the resulting parse trees are very similar to the HPSG derivation trees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-118", "text": "The only difference is that a valid TFS cannot be recreated for those nodes constructed with CFG rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-119", "text": "We call such trees pseudo-derivation trees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-120", "text": "not have a proper noun entry for Lakers, this will be falsely analysed as a plur noun during the first parsing stage." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-121", "text": "The first parsing stage stalls at the point where the HPSG headsubject fails to apply because of the disagreement on the number of the subject and the head phrase." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-122", "text": "With the CFG rule:" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-123", "text": "subjh \u2192 hspec third sg fin verb a CFG passive edge subj is constructed during the second parsing stage; this covers the entire input, and completes the pseudo-derivation tree." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-124", "text": "Constructing pseudo-derivation trees does not only predict the structure of full analyses, but it also helps simplify the partial parse disambiguation process." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-125", "text": "In recent years, the log-linear model shown in (3.) has been widely used in many parsing systems." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-126", "text": "proposed an inventory of features that perform well in HPSG parse selection." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-127", "text": "For the DELPH-IN grammars, the best performing features comprise the depth-one sub-trees (or portions of these) with grammar rule names as node labels, plus optionally a chain of one or more dominating nodes (i.e., levels of grandparents)." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-128", "text": "All these feature can be gathered from the derivation trees without consulting the TFSes." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-129", "text": "Therefore, the same discriminative model can be also applied to rank pseudoderivation trees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-130", "text": "One potential risk of reusing the full parse disambiguation model is that the model P (t|w) is conditional." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-131", "text": "Depending on the difference on the possible analyses (T ) licensed by the grammar, the model is not guaranteed to be consistent when trained on a HPSG treebank and applied on CFGbased pseudo-derivation trees (a similar issue pointed out by Abney (1997))." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-132", "text": "A potential solution for this is discussed in Section 6.. However, we find that the full parse disambiguation model works very well in practice, for the CFG backbone extracted from the HPSG treebank closely mimics the behaviour of HPSG rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-133", "text": "In the experiment of this paper, a full parse disambiguation model trained on HPSG treebanks is directly used for partial parse ranking." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-134", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-135", "text": "**SOME NOTES ON IMPLEMENTATION**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-136", "text": "The two-stage robust parsing model is implemented as an extension to the PET parser working with the jul-07 version of the ERG." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-137", "text": "The modified parser starts parsing with HPSG rules and TFS unification as usual." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-138", "text": "The second parsing starts when there is no full analysis found during the first stage." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-139", "text": "At the beginning of the second parsing stage, a new parsing chart is initiated with all passive parsing edges copied from the chart in the first stage." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-140", "text": "CFG rules are used to combine the passive edges and create new ones using an agendadriven bottom-up algorithm." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-141", "text": "Extra checking must guarantee that new edges will not duplicate the existing passive edges (with same daughters and rule name) in the old chart." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-142", "text": "For efficiency considerations, the PET parser uses subsumption-based ambiguity packing to effectively represent the local ambiguities." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-143", "text": "During the second parsing stage, there is no TFS for CFG passive edges; we use equivalencebased packing (i.e., two edges are packed together if they have the same span and share the same rule name)." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-144", "text": "During unpacking, we use the selective unpacking algorithm proposed by Carroll and Oepen (2005) and Zhang et al. (2007b) to efficiently extract the most probable pseudoderivation trees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-145", "text": "The unpacking algorithm is slightly modified so that it will not try to instantiate the TFS for CFG edges." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-146", "text": "The rest parts of the unpacking algorithm remain the same, and extraction of exact n-best readings is guaranteed." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-147", "text": "The CFG backbone grammar for ERG is extracted from the LOGON treebank (Oepen et al., 2004) ." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-148", "text": "We only extract syntactic rules that occur at least 5 times in the treebank." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-149", "text": "This gives us a CFG backbone grammar with about 2.5K unary and binary rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-150", "text": "For unary rules, we further filter out those that may lead to infinite recursion." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-151", "text": "We should point out that the decision of which CFG rules to extract is still an open question." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-152", "text": "Currently we only extract frequent rules, for they are more likely to be used in the ERG derivation trees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-153", "text": "Moreover, by reducing the number of CFG rules, the second parsing stage becomes much more efficient." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-154", "text": "For parse disambiguation, we use the model trained on the LOGON treebank with depth-one tree features with up to 3 levels of grandparents, which has so far worked reasonably well in different application scenarios." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-155", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-156", "text": "**EVALUATION**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-157", "text": "As Zhang et al. (2007a) have also pointed out, the evaluation of a partial parser is a very difficult task as such, due to the lack of gold-standard annotation for sentences that are not fully analysed by the grammar." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-158", "text": "For the purpose of evaluation, Zhang et al. (2007a) compared the partial derivation tree to the Penn Treebank bracketing, and partial RMRS fragments to the RASP RMRS outputs." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-159", "text": "Although the results have shown that the proposed partial parsing model performs comparatively better than the baseline model, it is not convincing in relation i) to how informative it is to compare HPSG derivations with Penn Treebank bracketings; and ii) to whether RASP RMRS output should be considered for evaluation comparison in the first place at all." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-160", "text": "For these reasons, a manual evaluation has been carried out for the new proposed partial parsing model in this paper." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-161", "text": "For the experiment, we selected a subset of 267 sentences from the PARC 700 Dependency Bank (King et al., 2003) , which have full lexical span licensed by the ERG." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-162", "text": "Among these sentences, 213 are parsed out of the box." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-163", "text": "For the remaining 54 sentences, the two-stage partial parsing model built pseudo-derivation trees for 41 of them." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-164", "text": "The remaining sentences are either not well-formed (exhibiting among them, for instance, garbage strings, incomplete utterances, etc.), or the parser is missing appropriate lexical entries." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-165", "text": "Among those sentences for which pseudo-derivation trees could be constructed, 13 of them are completely correct, and another 18 have no more than 2 cross-bracketings." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-166", "text": "In about half of the cases where the pseudo-derivation tree is wrong, there is a key lexical entry missing in the grammar lexicon." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-167", "text": "This indicates that an automatic lexical acquisition model should be used in combination with the partial parsing model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-168", "text": "Some errors in the pseudo-derivation trees indicate that the rule names symbols (as used in the derivation trees) are not informative enough for the CFG parser in the second stage in order for good predictions to be made." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-169", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-170", "text": "**DISCUSSION**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-171", "text": "Although the evaluation shows promising improvement on the grammar coverage, it is noticed that the type of the robust rules in use plays a significant role in our robust parsing model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-172", "text": "As pointed out in Section 3., the choice of robust rules is not limited to context-free grammars directly extracted from derivation trees." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-173", "text": "The flexibility allows us to achieve different levels of robustness, while maintaining the desired accuracy." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-174", "text": "In extreme cases, the robust rule may allow any sub-structures to be combined." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-175", "text": "But then it merely has any prediction power, and is practically equivalent to the shortest-path model." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-176", "text": "A context-free backbone grammar seems to be a reasonable choice, for it can be easily acquired from parser outputs, and can be used for efficient parsing." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-177", "text": "With rule symbols as CFG non-terminals, it appears to be too abstracted in some cases, and may lead to overgeneration." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-178", "text": "One solution to this would be to modify the CFG rules symbols with phrase categories (i.e., NP, VP, AP, PP, etc)." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-179", "text": "In Section 3. we have also mentioned that the parse disambiguation model trained on HPSG treebanks is not guaranteed to be consistent when used for pseudo-derivation tree disambiguation." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-180", "text": "The main reason is that some of the pseudo-derivation trees produced by the CFG are not licensed by the HPSG rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-181", "text": "It can be expected that with a set of relative strict robust rules the discrepancy would be relatively small." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-182", "text": "For rule sets which are much more relaxed than the HPSG rules, one could update the disambiguation model by extending the training HPSG treebank with the extra trees licensed by the robust rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-183", "text": "Another interesting topic that we have not discussed so far is that the two-stage parsing model opens the possibility of achieving robust semantic composition." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-184", "text": "In HPSG, the semantic compositions are carried out simultaneously with the syntactic analyses." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-185", "text": "However, most of the composition can be done without the lexicalised syntactic information." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-186", "text": "By encoding the general semantic composition rules into the robust parsing rules, the fragment semantic representations can be connected." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-187", "text": "Although this paper focuses on the robustness issue in relation to constructions, the fact that HPSG is a highly lexicalised framework entails that the lack of robustness in the lexicon may also lead to parsing failures (cf., Figure 2 )." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-188", "text": "If we think of the two-stage parsing model as a top-down approach to predict the upper part of a parse tree, then the automatic lexical acquisition model will serve as a bottomup predictor that fills in the knowledge gaps about words." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-189", "text": "Exploring the interconnection between the two prediction models would be another interesting topic for our future work." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-190", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-191", "text": "**CONCLUSION**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-192", "text": "In this paper, we have proposed a two-stage model for robust parsing with a large HPSG grammar." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-193", "text": "The model uses a less restrictive grammar derived from the HPSG parser outputs to continue parsing based on the fragment analyses produced by the HPSG rules." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-194", "text": "With the pseudo-derivation trees constructed by the partial parsing model, the full parse disambiguation model is applied in partial parse selection." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-195", "text": "The approach also opens the possibility of achieving robust semantic composition which remains to be explored in the future work." }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-196", "text": "----------------------------------" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-197", "text": "**REFERENCES**" }, { "sent_id": "d06a49ad232f73328874282d91cde0-C001-198", "text": "Steven Abney." } ], "y": { "@EXT@": { "gold_contexts": [ [ "d06a49ad232f73328874282d91cde0-C001-6" ], [ "d06a49ad232f73328874282d91cde0-C001-32" ] ], "cite_sentences": [ "d06a49ad232f73328874282d91cde0-C001-6", "d06a49ad232f73328874282d91cde0-C001-32" ] }, "@BACK@": { "gold_contexts": [ [ "d06a49ad232f73328874282d91cde0-C001-24" ], [ "d06a49ad232f73328874282d91cde0-C001-30" ], [ "d06a49ad232f73328874282d91cde0-C001-36" ], [ "d06a49ad232f73328874282d91cde0-C001-82" ], [ "d06a49ad232f73328874282d91cde0-C001-86" ], [ "d06a49ad232f73328874282d91cde0-C001-87" ], [ "d06a49ad232f73328874282d91cde0-C001-96" ], [ "d06a49ad232f73328874282d91cde0-C001-157" ], [ "d06a49ad232f73328874282d91cde0-C001-158" ] ], "cite_sentences": [ "d06a49ad232f73328874282d91cde0-C001-24", "d06a49ad232f73328874282d91cde0-C001-30", "d06a49ad232f73328874282d91cde0-C001-36", "d06a49ad232f73328874282d91cde0-C001-82", "d06a49ad232f73328874282d91cde0-C001-86", "d06a49ad232f73328874282d91cde0-C001-87", "d06a49ad232f73328874282d91cde0-C001-96", "d06a49ad232f73328874282d91cde0-C001-157", "d06a49ad232f73328874282d91cde0-C001-158" ] } } }, "ABC_9731b4cea1405b7cbf3792aed5b1e4_7": { "x": [ { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-2", "text": "We consider the problem of recognizing mentions of human senses in text." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-3", "text": "Our contribution is a method for acquiring labeled data, and a learning method that is trained on this data." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-4", "text": "Experiments show the effectiveness of our proposed data labeling approach and our learning model on the task of sense recognition in text." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-5", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-7", "text": "Information extraction methods produce structured data in the form of knowledge bases of factual assertions." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-8", "text": "Such knowledge bases are useful for supporting inference, question answering, and reasoning (Bollacker et al., 2008; Hoffart et al., 2012; ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-9", "text": "However, progress on the common sense front, as opposed to named entities such as locations, and people, is still limited (Havasi et al., 2007; Tandon et al., 2011) ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-10", "text": "In this paper, we study entity recognition of common sense concepts." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-11", "text": "Our goal is to detect mentions of concepts that are discernible by sense." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-12", "text": "For example, recognize that \"chirping birds\" is a mention of an audible concept (sound), and \"burning rubber\" is a mention of an olfactible concept (smell)." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-13", "text": "We aim to detect mentions of concepts without performing co-reference resolution or clustering mentions." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-14", "text": "Therefore, our setting resembles the established task of entity recognition (Finkel et al., 2005; Ratinov and Roth, 2009) , with the difference being that we focus on un-named entities." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-15", "text": "Contribution." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-16", "text": "One of the factors impeding progress in common sense information extraction is the lack of training data." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-17", "text": "It is relatively easy to obtain labeled data for named entities such as companies and people." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-18", "text": "Examples of such named entities can be found in structured forms on the Web, such as HTML lists and tables, and Wikipedia infoboxes (Wu and Weld, 2008; Wang and Cohen, 2008) ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-19", "text": "This is not the case for common sense concepts." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-20", "text": "We therefore propose a data labeling method, that leverages crowdsourcing and large corpora." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-21", "text": "This approach provides the flexibility to control the size and accuracy of the available labeled data for model training." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-22", "text": "Additionally, we propose and train several sequence models including variations of recurrent neural networks that learn to recognize mentions of sound and smell concepts in text." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-23", "text": "In our experiments, we show that the combination of our mixture labeling approach, and a suitable learning model are an effective solution to sense recognition in text." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-24", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-25", "text": "**PROBLEM DEFINITION**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-26", "text": "We would like to detect mentions of concepts discernible by sense." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-27", "text": "In this paper, we focus on mentions of audible (sound) and olfactible (smell) concepts." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-28", "text": "We treat sense recognition in text as a sequence labeling task where each sentence is a sequence of tokens labeled using the BIO tagging scheme (Ratinov and Roth, 2009 )." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-29", "text": "The BIO labels denote tokens at the beginning, inside, and outside of a relevant mention, respectively." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-30", "text": "Example BIO tagged sentences are shown in Figure 1 ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-31", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-32", "text": "**DATA LABELING METHODOLOGIES**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-33", "text": "There is a lack of easy to identify labeled data on the Web for common sense information extraction, an issue which affects named-entity centric information extraction to a lesser degree (Wang and Cohen, 2008; Wu and Weld, 2008) ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-34", "text": "We consider three data labeling approaches: i) Automatically generate training data using judiciously specified patterns." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-35", "text": "ii) Solicit input on crowd-sourcing platforms." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-36", "text": "iii) Leverage both i) and ii) in order to overcome their respective limitations." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-37", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-38", "text": "**PATTERN-BASED CORPUS LABELING**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-39", "text": "To label data with patterns, we begin by specifying patterns that we apply to a large corpus." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-40", "text": "For our concepts of interest, sound, and smell, we specify the following two patterns." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-41", "text": "\" sound of \", and \" smell of \", We then apply these patterns to a large corpus." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-42", "text": "In our experiments, we used the English part of ClueWeb09." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-43", "text": "1 . The result is a large collection of occurrences such as: \" sound of breaking glass\", \"smell of perfume\", etc." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-44", "text": "The collections contains 134,473 sound phrases, and 18,183 smell phrases." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-45", "text": "Figure 2 , shows a 2D projection of the 300-dimensional word vectors 2 of the discovered audible and olfactible phrases." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-46", "text": "We see a strong hint of two clusters." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-47", "text": "We later provide a quantitative analysis of this data." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-48", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-49", "text": "**CROWD-SOURCED SUPERVISION**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-50", "text": "The second way of obtaining labeled data that we consider is crowd-sourcing." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-51", "text": "We used the Amazon Mechanical Turk crowd-sourcing platform." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-52", "text": "Crowd Task Definition." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-53", "text": "To obtain labeled examples, we could do a \"cold call\" and ask crowd workers to list examples of phrases that refer to senses." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-54", "text": "However, such an approach requires crowd workers to think of examples without clues or memory triggers." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-55", "text": "This is time consuming and error prone." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-56", "text": "Additionally, the monetary cost to we have to pay to the crowd sourcing platform could be substantial." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-57", "text": "We propose to exploit a large corpus to obtain preliminary labeled data." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-58", "text": "This enables us to only need crowd workers to filter the data through a series of \"yes/no/notsure\" questions." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-59", "text": "These type of questions require little effort from crowd workers while mitigating the amount of noisy input that one could get from open-ended, cold call, type of questions." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-60", "text": "We randomly selected 1000 phrases labeled by the pattern approach as described in Section 3.1 to be sound/smell phrases, 500 for each sense type." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-61", "text": "Each phrase was given to 3 different workers to annotate \"yes/no/notsure\"." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-62", "text": "We consider a phrase to be a true mention of the labeled sense if the majority of the participants chose \"yes\"." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-63", "text": "This annotation task serves two purposes: 1) to provide us with human labeled examples of sound and smell concepts ii) to provide a quantitative evaluation of pattern generated labels." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-64", "text": "Crowd Annotation Results." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-65", "text": "Table 1 is a summary of the annotation results." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-66", "text": "First, we can see that the accuracy of the patterns is quite high as already hinted by Figure 2 ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-67", "text": "Second, The interannotator agreement rates are moderate, but lower for olfactible phrases." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-68", "text": "This is also reflected by the fact that there were around 3 times as many \"not sure\" responses in the smell annotations as there were in the sound annotation task (27 vs 10)." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-69", "text": "Nonetheless, the output of these tasks provide us with another option for labeled data that we can use to train our models." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-70", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-71", "text": "**JOINT PATTERN & CROWD-SOURCED LABELING**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-72", "text": "A third way of obtaining labeled data is to leverage both pattern-based and crowd-sourced labeling approaches." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-73", "text": "One central question pertains to how we can combine the two sources in a way that exploits the advantages of each approach while mitigating their limitations." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-74", "text": "We seek to start with the crowd-sourced labeled, which is small but more accurate, and expand it with the pattern-generated labeled data, which is large but less accurate." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-75", "text": "We Figure 3 ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-76", "text": "When \u03b1 = 1, that is we are only using the crowd-sourced labeled data, performance is at its worst." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-77", "text": "This is because even though the human labeled data is more accurate, it is much smaller, leading to potential model overfitting problems." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-78", "text": "A more subtle finding is that with low \u03b1 values (i.e., <0.4 for audible concepts), we have the highest recall, but not the best precision, this can be explained by the fact that, with low \u03b1 values, we are allowing more of the automatically labeled data to be part of the training data, thereby potentially adding noise to the model." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-79", "text": "However, the advantage of the mixture approach comes from the fact that, there comes a point where precision goes up, recall slightly degrades but we obtain the best F1 score." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-80", "text": "In Figure 3 , we see these points at \u03b1 = 0.6 and \u03b1 = 0.4 for the audible and olfactible concepts respectively." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-81", "text": "We use these values to generate the labeled data used to train models described in the rest of the paper." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-82", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-83", "text": "**LEARNING MODELS**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-84", "text": "We treat sense recognition in text as sequence prediction problem, we would like to estimate: P (y i |x i\u2212k , ..., x i+k ; y i\u2212l , ..., y i\u22121 ). where x refers Figure 4: Our neural network architecture for the task of recognizing concepts that are discernible by sensesss." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-85", "text": "to words, and y refers to BIO labels." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-86", "text": "Conditional Random Fields (CRFs) (Lafferty et al., 2001 ) have been widely used named entity recognition (Ratinov and Roth, 2009; Finkel et al., 2005) , a task similar to our own." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-87", "text": "While the CRF models performed reasonably well on our task, we sought to obtain improvements by proposing and training variations of Long Short Memory (LSTM) recurrent neural networks (Hochreiter and Schmidhuber, 1997)." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-88", "text": "We found our variations of LSTM sequence classifiers to do better than the CRF model, and also better than standard LSTMs." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-89", "text": "Word and Character Features." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-90", "text": "As input, the LSTM neural network model takes a sentence and, as output, produces a probability distribution over the BIO tags for each word in the sentence." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-91", "text": "To BIO tag each word in the sentence, we use word features." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-92", "text": "We chose the word features to be their word embeddings." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-93", "text": "As additional features, we model the character composition of words in order to capture morphology." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-94", "text": "Neural encodings of character-level features have been shown to yield performance gains in natural language tasks (Ling et al., 2015; Chiu and Nichols, 2016) ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-95", "text": "In all our experiments, we initialize the word embeddings with the Google news pre-trained word embeddings 3 ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-96", "text": "The character embeddings are learned from scratch." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-97", "text": "Prediction and Output Layer Recurrence." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-98", "text": "We represent each word as a mention within a short context window of length m. We use the LSTM to encode these windows contexts in order to make a prediction for each word." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-99", "text": "The LSTM window Sound Smell honking cars burning rubber snoring chlorine gunshots citrus blossoms live music fresh paint Table 2 : Examples of sound and smell concepts recognized by our method." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-100", "text": "encoding is then used to make predictions over the BIO labels." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-101", "text": "The output for each word is decoded by a linear layer and a softmax layer into probabilities over the BIO tag labels." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-102", "text": "Crucially, we modify the standard LSTM by modeling temporal dependencies by introducing a recurrence in the output layer." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-103", "text": "Therefore, the prediction d t at time step t takes into account the prediction d t\u22121 at the previous time t-1." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-104", "text": "Formally, we have:" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-105", "text": "We illustrate the model in Figure 4 ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-106", "text": "We found this model to consistently perform well on the senses of sound and smell." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-107", "text": "Model Evaluation." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-108", "text": "To evaluate the models, we set aside 200 of the 1000 crowd-annotated phrases as test data, meaning we have 100 test instances for each sense type (sound/smell)." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-109", "text": "The rest of the data, 400 per sense type was used for generating training data using the combined crowd and pattern approach described in Section 3.3." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-110", "text": "We set \u03b1 = 0.6 and \u03b1 = 0.4 , based on Figure 3 , for audible and olfactible concepts respectively." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-111", "text": "With these \u03b1 values, the combination approach produced 1,962 and 1,702 training instances for audible and olfactible concepts respectively Performance of the various models is shown in Table 4 ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-112", "text": "The abbreviations denote the following: LSTM refers to a vanilla LSTM model, using only word embeddings as features, + OR refers to the LSTM plus the output recurrence, + CHAR refers to the LSTM plus the character embeddings as features." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-113", "text": "+ OR + CHAR refers to the LSTM plus the output recurrence and character embeddings as features." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-114", "text": "For the CRF, we use the commonly used features for named entity recognition: words, prefix/suffices, and part-of-speech tag (Ratinov and Roth, 2009 )." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-115", "text": "We can see that for both senses, the model that uses both character embedding features, and an output recurrence layer yields the best F1 score." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-116", "text": "Examples of sounds and smells our method can recognize are shown in Table 4 ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-117", "text": "Table 3 : Performance of the various models on the task of sense recognition." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-118", "text": "----------------------------------" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-119", "text": "**RELATED WORK**" }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-120", "text": "Our task is related to entity recognition however in this paper we focused on novel types of entities, which can be used to improve extraction of common sense knowledge." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-121", "text": "Entity recognition systems are traditionally based on a sequential model, for example a CRF, and involve feature engineering (Lafferty et al., 2001; Ratinov and Roth, 2009 )." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-122", "text": "More recently, neural approaches have been used for named entity recognition (Hammerton, 2003; Collobert et al., 2011; dos Santos and Guimar\u00e3es, 2015; Chiu and Nichols, 2016; Shimaoka et al., 2016) ." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-123", "text": "Like other neural approaches, our approach does not require feature engineering, the only features we use are word and character embeddings." }, { "sent_id": "9731b4cea1405b7cbf3792aed5b1e4-C001-124", "text": "Related to our proposed recurrence in the output layer is the work of (Lample et al., 2016) which introduced a CRF on top of LSTM for the task of named entity recognition." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9731b4cea1405b7cbf3792aed5b1e4-C001-14" ], [ "9731b4cea1405b7cbf3792aed5b1e4-C001-86" ], [ "9731b4cea1405b7cbf3792aed5b1e4-C001-121" ] ], "cite_sentences": [ "9731b4cea1405b7cbf3792aed5b1e4-C001-14", "9731b4cea1405b7cbf3792aed5b1e4-C001-86", "9731b4cea1405b7cbf3792aed5b1e4-C001-121" ] }, "@DIF@": { "gold_contexts": [ [ "9731b4cea1405b7cbf3792aed5b1e4-C001-14" ] ], "cite_sentences": [ "9731b4cea1405b7cbf3792aed5b1e4-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "9731b4cea1405b7cbf3792aed5b1e4-C001-28" ], [ "9731b4cea1405b7cbf3792aed5b1e4-C001-114" ] ], "cite_sentences": [ "9731b4cea1405b7cbf3792aed5b1e4-C001-28", "9731b4cea1405b7cbf3792aed5b1e4-C001-114" ] } } }, "ABC_a2b945e18ab6b73b4021a2db8bda4f_7": { "x": [ { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-2", "text": "Scoring the quality of persuasive essays is an important goal of discourse analysis, addressed most recently with highlevel persuasion-related features such as thesis clarity, or opinions and their targets." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-3", "text": "We investigate whether argumentation features derived from a coarse-grained argumentative structure of essays can help predict essays scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-4", "text": "We introduce a set of argumentation features related to argument components (e.g., the number of claims and premises), argument relations (e.g., the number of supported claims) and typology of argumentative structure (chains, trees)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-5", "text": "We show that these features are good predictors of human scores for TOEFL essays, both when the coarsegrained argumentative structure is manually annotated and automatically predicted." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-6", "text": "----------------------------------" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-8", "text": "Persuasive essays are frequently used to assess students' understanding of subject matter and to evaluate their argumentation skills and language proficiency." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-9", "text": "For instance, the prompt for a TOEFL (Test of English as a Foreign Language) persuasive writing task is: Do you agree or disagree with the following statement?" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-10", "text": "It is better to have broad knowledge of many academic subjects than to specialize in one specific subject." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-11", "text": "Use specific reasons and examples to support your answer." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-12", "text": "Automatic essay scoring systems generally use features based on grammar usage, spelling, style, and content (e.g., topics, discourse) (Attali and Burstein, 2006; Burstein, 2003) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-33", "text": "To measure the inter-annotator agreement we calculated P/R/F1 measures, which are used to account for fuzzy boundaries (Wiebe et al., 2005) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-13", "text": "However, recent work has begun to explore the impact of highlevel persuasion-related features, such as opinions and their targets, thesis clarity and argumentation schemes (Farra et al., 2015; Song et al., 2014; Ong et al., 2014; Persing and Ng, 2015) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-14", "text": "In this paper, we investigate whether argumentation features derived from a coarse-grained, general argumentative structure of essays are good predictors of holistic essay scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-15", "text": "We use the argumentative structure proposed by Stab and Gurevych (2014a) : argument components (major claims, claims, premises) and argument relations (support, attack)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-16", "text": "Figure 1 (i) shows an extract from an essay written in response to the above prompt, labeled with a claim and two premises." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-17", "text": "The advantage of having a simple annotation scheme is two-fold: it allows for more reliable human annotations and it enables better performance for argumentation mining systems designed to automatically identify the argumentative structure (Stab and Gurevych, 2014b) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-18", "text": "The paper has two main contributions." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-19", "text": "First, we introduce a set of argumentation features related to three main dimensions of argumentative structure: 1) features related to argument components such as the number of claims in an essay, number of premises, fraction of sentences containing argument components; 2) features related to argument relations such as the number and percentage of supported and unsupported claims; and 3) features related to the typology of argumentative structure such as number of chains (see Figure 1(ii) for and example of chain) and trees (Section 3)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-20", "text": "On a dataset of 107 TOEFL essays manually annotated with the argumentative structure proposed by Stab and Gurevych (2014a) (Section 2), we show that using all the argumentation features predicts essay scores that are highly correlated with human scores (Section 3)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-21", "text": "We discuss what features are correlated with high scoring es- Figure 1 : (i) Essay extract showing a claim and two premises and (ii) the corresponding argumentative structure (i.e., chain)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-22", "text": "says vs. low scoring essays." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-23", "text": "Second, we show that the argumentation features extracted based on argumentative structures automatically predicted by a state-of-the-art argumentation mining system (Stab and Gurevych, 2014b) are also good predictors of essays scores (Section 4)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-24", "text": "1" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-25", "text": "----------------------------------" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-26", "text": "**DATA AND ANNOTATION**" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-27", "text": "We use a set of 107 essays from TOEFL11 corpus that was proposed for the first shared task of Native Language Identification (Blanchard et al., 2013) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-28", "text": "The essays are sampled from 2 prompts: P1 (shown in the Introduction) and P3: Each essay is associated with a score: high, medium, or low." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-29", "text": "From prompt P1, we selected 25 high, 21 medium, and 16 low essays, while for prompt P3 we selected 15 essays for each of the three scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-30", "text": "For annotation, we used the coarse-grained argumentative structure proposed by Stab and Gurevych (2014a) : argument components (major claim, claim, premises) and argument relations (support/attack)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-31", "text": "The unit of annotation is a clause." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-32", "text": "Our annotated dataset, T OEF L arg , includes 107 major claims, 468 claims, 603 premises, and 641 number of sentences that do not contain any argument component." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-34", "text": "The F1 measure for overlap matches (between two annotators) for argument components is 73.98% and for argument relation is 67.56%." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-35", "text": "----------------------------------" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-36", "text": "**ARGUMENTATION FEATURES FOR PREDICTING ESSAYS SCORES**" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-37", "text": "A major contribution of this paper is a thorough analysis of the key features derived from a coarsegrained argumentative structure that are correlated with essay scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-38", "text": "Based on our annotations, we propose three groups of features (Table 1) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-39", "text": "The first group consists of features related to argument components (AC) such as the number of claims, number of premises, fraction of sentences containing argument components." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-40", "text": "One hypothesis is that an essay with a higher percentage of argumentative sentences will have a higher score." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-41", "text": "The second group consists of features related to argument relations (AR), such as the number and percentage of supported claims (i.e., claims that are supported by at least one premise) and the number and percentage of dangling claims (i.e., claims with no supporting premises)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-42", "text": "In low scoring essays, test takers often fail to justify their claims with proper premises and this phenomenon is captured by the dangling claims feature." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-43", "text": "In contrary, in high scoring essays, it is common to find many claims that are justified by premises." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-44", "text": "We also consider the number of attack relations and attacks against the major claim." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-45", "text": "Finally, the third group consists of features related to the typology of argument structures (TS) such as the number of argument chains (Chain), number of argument trees of height = 1 (T ree h=1 ) and the number of argument trees of height > 1 (T ree h>1 )." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-46", "text": "We define an argument chain when a claim is supported by a chain of premises." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-47", "text": "We define T ree h=1 as a tree structure of height 1 with more than one leaves, where the root is a claim and the leaves are premises Figure 2 shows examples of a T ree h>1 structure, a Chain structure, and a T ree h=1 structure." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-48", "text": "The dark nodes represent claims (C), lighter nodes can be either claims or premises (C/P) and white nodes are premises (P)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-49", "text": "Figure 1 shows an extract from an essays and the corresponding Chain structure." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-50", "text": "To measure the effectiveness of the above features in predicting the holistic essay scores (high/medium/low) we use Logistic Regression (LR) learners and evaluate the learners using quadratic-weighted kappa (QWK) against the human scores, a methodology generally used for essay scoring (Farra et al., 2015) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-51", "text": "QWK corrects for chance agreement between the system prediction and the human prediction, and it takes into account the extent of the disagreement between labels." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-52", "text": "Table 2 reports the performance for the three feature groups as well as their combination." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-53", "text": "Our baseline feature (bl) is the number of sentences in the essay, since essay length has been shown to be generally highly correlated with essay scores (Chodorow and Burstein, 2004) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-54", "text": "We found that all three feature groups individually are strongly correlated with the human scores, much better than the baseline feature, and the AC features have the highest correlation." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-55", "text": "We also see that although the number of claims and premises can affect the score of an essay, the argumentative structures (i.e., how the claims and premises are connected in an essay) are also important." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-56", "text": "Combining all features gives the highest QWK score (0.803)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-57", "text": "We also looked at what features are associated with high scoring essays vs. low scoring essays." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-58", "text": "Based on the regression coefficients, we observe that the high \"number and % of dangling claims\" are strong features for low scoring essays, whereas the \"fraction of sentences containing argument components\" (AC feature), \"number of supported claims\" (AR feature), and \"number of T ree h=1 structures\" and \"number of T ree h>1 structures\" (TS features) have the highest correlation with high scoring essays." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-59", "text": "For example, in a good persuasive essay, test takers are inclined to use multiple premises (e.g., reasons or examples) to support a claim, which is captured by the TS and AR features." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-60", "text": "In addition, we notice that attack relations are sparse, as was the case in Stab and Gurevych (2014b) dataset and thus the coefficients for attack relations features (#10, #11 in Table 1 ) are negligible." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-61", "text": "In summary, our findings contribute to research on essay scoring, showing that argumentation features are good predictors of essay scores, besides spelling, grammar, and stylistic properties of text." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-62", "text": "----------------------------------" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-63", "text": "**AUTOMATIC EXTRACTION OF ARGUMENTATION FEATURES FOR PREDICTING ESSAY SCORES**" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-64", "text": "To automatically generate the argumentation features (Table 1) , we first need to identify the argumentative structures: argument components (major claim, claim, and premise) and relations (support/attack)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-65", "text": "We use the approach proposed by Stab and Gurevych (2014b) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-66", "text": "2 For argument component identification, we categorize clauses to one of the four classes (major claim (M C), claim (C), premise (P ), and N one)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-67", "text": "For argument relation identification, given a pair of argument clauses Arg 1 and Arg 2 the classifier decides whether the pair holds a support (S) or non-support (N S) relation (binary classification)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-68", "text": "For each essay, we extract all possible combinations of Arg 1 and Arg 2 from each paragraph as training data (654 S and 2503 N S instances; attack relations are few and included in N S)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-69", "text": "We do not consider relations that may span over multiple paragraphs to reduce number of non-support instances." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-70", "text": "For both tasks we use Lexical features (e.g., unigrams, bigrams, trigrams, modal verbs, adverbs, word-pairs for relation identification), Structural features (e.g., number of tokens/punctuations in argument, as well as in the sentence containing the argument, argument position in essay, paragraph position (paragraph that contains the argument)), Syntactic features (e.g., production rules from parse trees, number of clauses in the argument), and Indicators (discourse markers selected from the three top-level Penn Discourse Tree Bank (PDTB) relation senses: Comparison, Contingency, and Expansion (Prasad et al., 2008) )." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-71", "text": "We use two settings for the classification experiments using libSVM (Chang and Lin, 2011) for both argument component and relation identification." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-72", "text": "In the first setting, we used the dataset of 90 high quality persuasive essays from (Stab and Gurevych, 2014b ) (S&G) as training and use T OEF L arg for testing (out-of-domain setting)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-73", "text": "In the second setting (in-domain), we randomly split the T OEF L arg into 80% training and 20% for testing (sampled equally from each category (M C, C, P , and N one for argument components; S and N S for relations))." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-74", "text": "Table 3 and 4 present the classification results for identifying ar-2 In future work, we plan to use the authors' improved approach and larger dataset released after the acceptance of this paper (Stab and Gurevych, 2016 Table 4 : F1 for argument components (in-domain setting) gument components in the first and second setting, respectively." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-75", "text": "We ran experiments for all different features groups and observe that with the exception of the P class, the F1 scores for all the other classes is comparable to the results reported by Stab and Gurevych (2014b) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-76", "text": "One explanation of having lower performance on the P (premise) category is that the S&G dataset used for training has higher quality essays, while 2/3 of our T OEF L arg dataset consists of medium and low scoring essays (the writing style for providing reasons or example can differ between high and low scoring essays)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-77", "text": "When we select the top 100 features (\"top100\") using Information Gain (Hall et al., 2009 ) the F1 scores for the P class improves." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-78", "text": "The results in Table 4 show that when training and testing on same type of essays the results are better for all categories except for M C when using the \"top100\" setup." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-79", "text": "Table 5 shows the results for relation identification in the first setting (out-of-domain)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-80", "text": "The F1 score of identifying support relations is 84.3% (or 89% using top100), much higher than reported by Stab and Gurevych (2014b) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-81", "text": "We obtain similar results when training and testing on T OEF L arg ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-82", "text": "We observe that two specific feature groups, Structural and Lexical, individually achieve high F1 scores and when combined with other features, they assist the classifier in reaching F1 scores in high 80s%." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-83", "text": "There can be two explanations for this: 1) essays in T OEF L arg have multiple short paragraphs where the position features such as position of the arguments in the essay and paragraph (Structural group) are strong indicators for argument relations; and 2) due to short paragraphs, the percentage of N S instances are less than in the S&G dataset, hence the Lexical features (i.e., word-pairs between Arg 1 and Arg 2 ) perform very well." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-84", "text": "Table 6 : Correlation of LR (10 fold CV) with predicted results." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-85", "text": "Based on the automatic identification of the argument components and relations, we generate the argumentation features to see whether they still predict essays scores that are highly correlated with human scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-86", "text": "Since our goal is to compare with the manual annotation setup, we use the first setting, where we train on the S&G dataset and test on our T OEF L arg dataset." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-87", "text": "We select the best system setup (top100 for both tasks; Table 3 and 5)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-88", "text": "We ran Logistic Regression learners and evaluated their performance using QWK scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-89", "text": "Table 6 shows that the argumentative features related to argument relations (AR) and the typology of argument structures (TS) extracted based on the automatically predicated argumentative structure perform worse compared to the scores based on manual annotations (Table 2) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-90", "text": "Our error analysis shows that this is due to the wrong prediction of argument components, specifically wrongly labeling claims as premises (Table 3) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-91", "text": "AR and TS features rely on correctly identifying the claims, and thus a wrong prediction affects the features in these two groups, even if the accuracy of supports relations is high." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-92", "text": "This also explains why the argument components (AC) features still have a high correlation with human scores (0.669)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-93", "text": "When we extracted the argumentation features using goldstandard argument components and predicted argument relations, the correlation of AR and TS features improved to 0.576 and 0.504, respectively and the correlation of all features reached 0.769." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-94", "text": "----------------------------------" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-95", "text": "**RELATED WORK**" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-96", "text": "Researchers have begun to study the impact of features specific to persuasive construct on student essay scores (Farra et al., 2015; Song et al., 2014; Ong et al., 2014; Persing and Ng, 2013; Persing and Ng, 2015) ." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-97", "text": "Farra et al. (2015) investigate the impact of opinion and target features on TOEFL essays scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-98", "text": "Our work looks a step further by exploring argumentation features." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-99", "text": "Song et al. (2014) show that adding features related to argumentation schemes (from manual annotation) as part of an automatic scoring system increases the correlation with human scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-100", "text": "We show that argumentation features are good predictors of human scores for TOEFL essays, both when the coarsegrained argumentative structure is manually annotated and automatically predicted." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-101", "text": "Persing and Ng (2015) proposed a feature-rich approach for modeling argument strength in student essays, where the features are related to argument components." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-102", "text": "Our work explores features related to argument components, relations and typology of argument structures, showing that argument relation features show best correlation with human scores (based on manual annotation)." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-103", "text": "----------------------------------" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-105", "text": "We show that argumentation features derived from a coarse-grained, argumentative structure of essays are helpful in predicting essays scores that have a high correlation with human scores." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-106", "text": "Our manual annotation study shows that features related to argument relations are particularly useful." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-107", "text": "Our experiments using current methods for the automatic identification of argumentative structure confirms that distinguishing between claim and premises is a particularly hard task." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-108", "text": "This led to lower performance in predicting the essays scores using automatically generate argumentation features, especially for features related to argument relations and typology of structure." }, { "sent_id": "a2b945e18ab6b73b4021a2db8bda4f-C001-109", "text": "As future work we plan to improve the automatic methods for identifying argument components similar to Stab and Gurevych (2016) , and to use the dataset introduced by Persing and Ng (2015) to investigate how our argumentation features impact the argument strength score rather than the holistic essay score." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a2b945e18ab6b73b4021a2db8bda4f-C001-17" ] ], "cite_sentences": [ "a2b945e18ab6b73b4021a2db8bda4f-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "a2b945e18ab6b73b4021a2db8bda4f-C001-23" ], [ "a2b945e18ab6b73b4021a2db8bda4f-C001-65" ], [ "a2b945e18ab6b73b4021a2db8bda4f-C001-72" ], [ "a2b945e18ab6b73b4021a2db8bda4f-C001-86" ] ], "cite_sentences": [ "a2b945e18ab6b73b4021a2db8bda4f-C001-23", "a2b945e18ab6b73b4021a2db8bda4f-C001-65", "a2b945e18ab6b73b4021a2db8bda4f-C001-72", "a2b945e18ab6b73b4021a2db8bda4f-C001-86" ] }, "@SIM@": { "gold_contexts": [ [ "a2b945e18ab6b73b4021a2db8bda4f-C001-60" ], [ "a2b945e18ab6b73b4021a2db8bda4f-C001-75" ] ], "cite_sentences": [ "a2b945e18ab6b73b4021a2db8bda4f-C001-60", "a2b945e18ab6b73b4021a2db8bda4f-C001-75" ] }, "@DIF@": { "gold_contexts": [ [ "a2b945e18ab6b73b4021a2db8bda4f-C001-75", "a2b945e18ab6b73b4021a2db8bda4f-C001-76" ], [ "a2b945e18ab6b73b4021a2db8bda4f-C001-80" ], [ "a2b945e18ab6b73b4021a2db8bda4f-C001-82", "a2b945e18ab6b73b4021a2db8bda4f-C001-83" ] ], "cite_sentences": [ "a2b945e18ab6b73b4021a2db8bda4f-C001-75", "a2b945e18ab6b73b4021a2db8bda4f-C001-76", "a2b945e18ab6b73b4021a2db8bda4f-C001-80", "a2b945e18ab6b73b4021a2db8bda4f-C001-83" ] } } }, "ABC_78afdf391c70d7992200b4071e4ac2_7": { "x": [ { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-2", "text": "We propose a way to automatically improve the annotation of verbal complex predicates in PropBank which until now has been treating language mostly in a compositional manner." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-3", "text": "In order to minimize the manual re-annotation effort, we build on the recently introduced concept of aliasing complex predicates to existing PropBank rolesets which encompass the same meaning and argument structure." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-4", "text": "We suggest to find aliases automatically by applying a multilingual distributional model that uses the translations of simple and complex predicates as features." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-5", "text": "Furthermore, we set up an annotation effort to obtain a frequency balanced, realistic test set for this task." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-6", "text": "Our method reaches an accuracy of 44% on this test set and 72% for the more frequent test items in a lenient evaluation, which is not far from the upper bounds from human annotation." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-7", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-9", "text": "Semantic Role Labeling (SRL) aims at determining 'who' did 'what' to 'whom' in sentences by identifying and associating predicates with their semantic arguments." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-10", "text": "This information is useful for many downstream applications, for example for question answering (Shen, 2007) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-11", "text": "The PropBank corpus (PB) (Palmer et al., 2005 ) is one of the most widely used resources for training SRL systems." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-12", "text": "It provides senses of (mostly verbal) predicates with their typical semantic arguments annotated in a corpus and accompanied by a lexical resource." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-13", "text": "The sense of a predicate is referred to as a 'roleset' because it lists all required and possible semantic roles for the predicate used in a specific sense." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-14", "text": "The 12K rolesets in PB describe mostly single word predicates, to a great part leaving aside multiword expressions (MWEs)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-15", "text": "Complex predicates (CPs), 'predicates which are multi-headed: they are composed of more than one grammatical element' (Ramisch, 2012) , are most relevant in the context of SRL." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-16", "text": "Light verb constructions (LVCs), e.g. take care, and verb particle constructions (VPCs), e.g. watch out, are the most frequently occurring types of CPs." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-17", "text": "As Bonial et al. (2014) stated 'PB has previously treated language as if it were purely compositional, and has therefore lumped the majority of MWEs in with lexical verb usages'." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-18", "text": "For example the predicates in the CPs take a hard line, take time and many others are all annotated with a sense of take, meaning acquire, come to have, chose, bring with you from somewhere." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-19", "text": "This results in a loss of semantic information in the PB annotations." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-20", "text": "This is especially critical because CPs are a frequent phenomenon." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-21", "text": "The Wiki50 corpus (Vincze et al., 2011) , which provides a full coverage MWE annotation, counts 814 occurrences of LVCs and VPCs in 4350 sentences." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-22", "text": "This makes for one CP in every fifth sentence." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-23", "text": "Recently, Bonial et al. (2014) have introduced an approach to improve the handling of MWEs in PB while keeping annotation costs low." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-24", "text": "The process is called aliasing." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-25", "text": "Instead of creating new frames for CPs, human annotators map them to existing PB rolesets which encompass the same semantic and argument structure." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-26", "text": "For example, the CP give (a) talk could be mapped to the alias lecture.01." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-27", "text": "While this method significantly reduces the effort to create new rolesets, the time consuming manual mapping is still required." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-28", "text": "To address this problem, our work extends this approach by proposing a method to find the aliases automatically." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-29", "text": "One way to find the most suitable alias roleset for a given CP is to group predicates by their rolesets assigned by an automatic SRL system and compute the most similar roleset group by searching for (near-) synonymous predicates of the CP." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-30", "text": "The roleset of the most similar roleset group is selected as alias for the CP." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-31", "text": "Finding synonyms, both single-word and multiword, from corpora has been done successfully with the multilingual variant of the distributional hypothesis (Van der Plas and Tiedemann, 2006; Van der Plas et al., 2011) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-32", "text": "The idea behind this approach is that words or MWEs that share many translations are probably synonymous." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-33", "text": "We use the word alignments in a parallel corpus to find the translations of CPs and single predicates." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-34", "text": "The predicates are automatically annotated with rolesets by an SRL system." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-35", "text": "This allows us to compute the most suitable roleset for a given CP fully automatically." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-36", "text": "Our contributions are as follows: To the best of our knowledge, this work is the first to address the handling of CPs for SRL in an automatic way." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-37", "text": "We are thus able to scale up previous work that relies on manual intervention." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-38", "text": "In addition, we set up an annotation effort to gather a frequency-balanced, data-driven evaluation set that is larger and more diverse than the annotated set provided by Bonial et al. (2014) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-39", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-40", "text": "**REPRESENTING CPS FOR SRL**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-41", "text": "Previous work on representing CPs for SRL has mostly focused on PB." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-42", "text": "The currently available version of the PB corpus represents most CPs as if they were lexical usages of the verb involved in the predicate." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-43", "text": "Figure 1 shows an example for the annotation of the LVC take care in PB." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-44", "text": "1 The CP is split up into its two components that are each assigned their own roleset." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-45", "text": "This annotation ignores the semantic unity of the CP and is unable to capture its single meaning of being concerned with or caring for something." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-46", "text": "1 We show an excerpt of the original sentence found in the currently available version of PB (Proposition Bank I)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-47", "text": "Figure 2: Improved representation of the CP take care adopted from (Hwang et al., 2010; Duran et al., 2011) In contrast to this, Hwang et al. (2010) suggest a new annotation scheme for LVCs that assigns the argument structure of the LVC independently from the argument structure of its components." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-48", "text": "First, the arguments of the light verb and true predicate are annotated with roles regarding their relationship to the combination of the light verb and true predicate." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-49", "text": "Then, the light verb and predicate lemmas are joined into a single predicate." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-50", "text": "The result of this process is shown in Figure 2 ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-51", "text": "Duran et al. (2011) discuss the analysis of Brazilian Portuguese CPs." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-52", "text": "Similarly to Hwang et al. (2010) they argue that CPs should be treated as single predicates, not only for LVCs but for all CPs." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-53", "text": "They automatically extract CP candidates from a corpus and represent, if possible, the meaning of the CPs with one or more single-verb paraphrases." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-54", "text": "Atkins et al. (2003) describe a way in which LVCs can be annotated in FrameNet (Baker et al., 1998) , a framework that describes the semantic argument structure of predicates with semantic roles specific to the meaning of the predicate." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-55", "text": "In contrast to the proposals for PB by Hwang et al. (2010) and Duran et al. (2011) , they suggest to annotate the light verb and its counterpart separately." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-56", "text": "The aliasing process introduced by Bonial et al. (2014) tries to extend the coverage of PB for CPs while keeping the number of rolesets that should be newly created to a minimum." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-57", "text": "Bonial et al. (2014) conducted a pilot study re-annotating 138 CPs involving the verb take." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-58", "text": "As a first step, the annotators determined the meaning(s) of the CP by looking at their usage in corpora." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-59", "text": "If they found that the CP is already adequately represented by the existing rolesets for take, no further action was needed (18/138)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-60", "text": "Otherwise, they were instructed to propose as alias an existing PB entry that encompasses the same semantics and argument structure as the CP (100/138)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-61", "text": "If unable to find an alias, they could suggest to create a new roleset for this CP (20/138)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-62", "text": "Expressions for which the annotators were unable to determine the meaning were marked as idiomatic expressions that need further treatment (4/138)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-63", "text": "2 According to this process, take care could be aliased to the existing PB roleset care.01 whose entry is shown in Figure 3 ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-88", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-64", "text": "This alias replaces (take+care).01 shown in Figure 2 and thus avoids the creation of a new roleset." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-65", "text": "Roleset id: care.01, to be concerned Arg0: carer, agent Arg1: thing cared for/about Encouraged by the high proportion of CPs that could successfully be aliased in the pilot study by Bonial et al. (2014) , we created a method to automatically find aliases for CPs in order to decrease the amount of human intervention, thereby scaling up the coverage of CPs in PB." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-66", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-67", "text": "**METHOD**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-68", "text": "The task of finding aliases for CPs automatically is related to finding (near-) synonymous predicates and their accompanying roleset for the CPs." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-69", "text": "To find the near-synonyms, we apply the distributional hypothesis which states that we can assess the similarity of expressions by looking at their contexts (Firth, 1957) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-70", "text": "As previous work (Van der Plas and Tiedemann, 2006) has shown that multilingual contexts work better for synonym acquisition than monolingual syntactic contexts, we use the translations of the CPs and other predicates to all 20 languages available via the word alignments in a multilingual parallel corpus as context." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-71", "text": "Figure 4 shows an overview of the architecture of our system." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-72", "text": "First, we extract the CPs and all predicates that share a PB roleset (PB roleset groups) from the parallel corpus." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-73", "text": "For example, all verbs that were assigned to the roleset care.01 by the SRL system belong to the PB roleset group of care.01." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-74", "text": "The CPs stem from the gold standard MWE annotation in the Wiki50 corpus (Vincze et al., 2011) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-75", "text": "We parsed this corpus to obtain lemmas, POS and syntactic dependencies and extracted this information for all VPCs and LVCs annotated in the corpus." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-76", "text": "3 Figure 5 shows the two patterns we identified that the majority of the CPs followed." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-77", "text": "4 We used these two patterns to search for occurrences of the CPs in Europarl." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-78", "text": "Next, we build a co-occurrence matrix containing as head terms the CP and all PB roleset groups found in the parallel corpus." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-79", "text": "Figure 6 shows a toy example of such a matrix for the CP take care." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-80", "text": "The head words are listed in the rows, the translations (i.e. features) in the columns." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-81", "text": "Note that in contrast to previous work on distributional semantics we include PB roleset groups as head words." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-82", "text": "These contain several distinct verbal predicates but they share the same sense." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-83", "text": "Consequently, polysemous verbs are found in several distinct PB roleset groups." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-84", "text": "ter cuidado (es) Finally, we measure the similarity between CPs and roleset groups using the cosine similarity because it worked best in previous experiments for finding synonyms ( Van der Plas, 2008) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-85", "text": "This results in a similarity ranking of PB roleset groups for each CP, from which we select the roleset with the highest cosine value as alias." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-86", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-87", "text": "**EXPERIMENTS**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-89", "text": "**TOOLS AND DATA**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-90", "text": "We processed the English section of the Europarl corpus (Koehn, 2005) (about 2 million sentences) with the MATE tools (Bj\u00f6rkelund et al., 2010 ) to obtain lemmas, part-of-speech (POS) tags, dependency structures and semantic role labels." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-91", "text": "These annotations are used to find occurrences of the CPs and words assigned with PB rolesets in the English part." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-92", "text": "The word alignments produced with the grow-diagfinal-and-heuristics (Koehn et al., 2003) provided by the OPUS project (Tiedemann, 2012) are then used to find their alignments to all other 20 languages in the corpus and exploited as features in the distributional model." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-93", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-94", "text": "**EVALUATION FRAMEWORK**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-95", "text": "Human Annotation." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-96", "text": "In order to evaluate our system, we set up an annotation effort loosely following the guidelines provided by Bonial et al. (2014) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-97", "text": "We selected 50 LVCs and 50 VPCs from the Wiki50 corpus (Vincze et al., 2011) divided equally over two frequency groups: Half of the expressions occur only once in the Wiki50 corpus (low-frequency subgroup) and the other half occur at least twice (high-frequency subgroup)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-98", "text": "All occurrences of these 100 CP types in the corpus were selected to account for the polysemy of CPs." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-99", "text": "Different instances of the same CP could get assigned to different aliases." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-100", "text": "This resulted in a total of 197 annotated instances." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-101", "text": "Four annotators were presented with the CP in their original sentence context and were asked to propose one or several PB aliases which encompass the same meaning and argument structure." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-102", "text": "One annotator (A, one of the authors of this article) labeled the whole set of 100 expressions." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-103", "text": "The three other annotators (B,C,D) each labeled one third of the expressions assigned randomly, so that every expression was annotated by two annotators." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-104", "text": "First, they were asked to decide if there is already an appropriate PB roleset for the CP and then provide it." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-105", "text": "The annotators were requested to divide these cases into semantically compositional CPs (e.g. obtain permission with the roleset obtain.01) and uncompositional CPs for which PB already provides a multi-word predicate (e.g. open.03 for open up)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-106", "text": "For the remaining CPs, they were asked to suggest PB rolesets (aliases) that share the same semantics and argument structure as the CP." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-107", "text": "The simple inter-annotator agreement 5 was 67% for annotator A%B, 51% for A&C and 44% for A&D. These agreement figures are higher than the figures in Bonial et al. (2014) , and actual agreement is probably even higher, because synonymous rolesets are regarded as disagreements." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-108", "text": "Annotator A discussed the annotations with the other annotators and they were able to reach a consensus that resulted in a final agreed-upon test set." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-109", "text": "Table 1 shows the final decisions with respect to the complete set of 197 expressions." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-110", "text": "In line with the results from Bonial et al. (2014) who aliased 100 out of 138 uncompositional take MWEs, we were also able to alias most of the CPs in our annotation set." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-111", "text": "The final Wiki50 set consists of 154 7 instances of CPs from the categories 'aliased' and 'multi-word PB predicate' (low-frequency: 34, high-frequency: 120)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-112", "text": "The latter were included because the predicted roleset of the SRL only coincides with the gold standard for 23 out of 60 instances." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-113", "text": "This means that for the majority of the CPs, even if an adequate PB roleset exists, this roleset was not selected by the SRL system." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-114", "text": "We hope to also improve these cases with our method." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-115", "text": "All CPs were labeled with one to four appropriate PB alias rolesets." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-116", "text": "In addition, we evaluated our system on the dataset from Bonial et al. (2014) , restricted to the type of CP our system handles (LVCs and VPCs) and verb aliases (as opposed to aliases being a noun or adjective roleset)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-117", "text": "We used 70 of the 100 MWEs from their annotations." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-118", "text": "Evaluation Measures and Baseline." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-119", "text": "We report the accuracy of our system's predictions as compared to the gold standard." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-120", "text": "For the STRICT AC-CURACY, an alias is counted as correct if it corresponds exactly to one of the gold aliases." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-121", "text": "This evaluation is very rigid and regards synonymous rolesets as incorrect." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-122", "text": "Thus, we also compute a more LE-NIENT ACCURACY, which counts an alias as correct if it belongs to the same VerbNet (Kipper-Schuler, 2006 ) verb class as the gold alias." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-123", "text": "VerbNet (VN) is a hierarchically organized lexicon of English verbs." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-124", "text": "It consists of syntactically and semantically coherent verb classes, which are extensions of the classes proposed by Levin (1993) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-125", "text": "For the PB-VN mappings, we rely on the resource provided by the SemLink project 8 (Loper et al., 2007) and use the mostspecific (deepest) layer of the verb classes." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-126", "text": "Since the mapping provided in SemLink is not complete (only 58% of the rolesets found in PB have a mapping to a corresponding VN class), we discard rolesets that are not found in SemLink, unless they are correct 8 http://verbs.colorado.edu/semlink/ according to the gold standard in the first place." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-127", "text": "We compared our system with a baseline system that distinguishes between VPCs and LVCs." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-128", "text": "For VPCs, it checks whether there exists a PB multiword predicate for the expression and selects the first roleset of that predicate (e.g. there exists a predicate called open up (open.03) for the VPC 'open up')." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-129", "text": "For LVCs, it checks whether the noun has a corresponding verb predicate in PB and selects the first roleset of this predicate (e.g. walk.01 for take a walk)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-130", "text": "Note that this is an informed baseline that is very hard to beat and only fails in case of lack in coverage." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-131", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-132", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-133", "text": "We evaluated our approach on the 160 CPs annotated in the course of this work (Wiki50 set), as well as on the 70 take CPs from Bonial et al. (2014) (take set) and compare our results to the baseline." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-134", "text": "Table 2 shows percentage coverage, accuracy and the harmonic mean of coverage and accuracy for our system and the baseline." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-135", "text": "We report results on the two evaluation sets in the strict and lenient evaluation." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-136", "text": "The first five rows of Table 2 show the results for the Wiki50 set and its subsets." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-137", "text": "We see that our system scores 44.1 accuracy on the whole test set in the strict evaluation and 69.0 in the lenient evaluation." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-138", "text": "These numbers seem quite low, but they are not that far apart from the micro averaged IAA from our annotation effort (53%)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-139", "text": "Our system outperforms the baseline with very high coverage numbers." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-140", "text": "It beats the baseline in terms of the harmonic mean for all subsets except the multiword PB predicate subset." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-141", "text": "This is not surprising as the test items in this subset have a corresponding multiword PB predicate and all the baseline has to do is select the right sense." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-142", "text": "The high performance of the baseline on the multiword PB predicates leads to the high accuracy numbers for the baseline in all (sub-)sets except from the alias subset, which contains the expressions for which a true alias was provided." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-143", "text": "Our system beats the baseline in terms of strict accuracy for the alias subset." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-144", "text": "This is good news because the actual task is to find new aliases for CPs that are not covered in PB." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-145", "text": "The performance on the low-frequency subset is lower than on the high-frequency subset, as expected for a distributional method." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-146", "text": "The results on the take set are shown in the last row of Table 2 ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-147", "text": "Compared to the Wiki50 set, they are substantially lower." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-148", "text": "We would like to stress that the take set is far from what we expect to find in an actual corpus." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-149", "text": "This set comprises only CPs that contain the word take." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-150", "text": "Many test items have been extracted from WordNet and possibly have a very low frequency in a general corpus." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-151", "text": "This is reflected in the coverage number, which shows the proportion of CPs for which our system was able to suggest at least one alias: It is above 94% for all Wiki50 (sub)sets, but only 67% for the take set." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-152", "text": "We constructed the Wiki50 set to allow us to get a better estimate of how our method would fare in a natural setting." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-153", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-154", "text": "**ERROR ANALYSIS**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-155", "text": "We examined all expressions from the full Wiki50 set for which the top ranked predicted alias was incorrect." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-156", "text": "Due to space limitations we only mention the main reasons for errors we identified." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-157", "text": "First of all, the limited language domain of the Europarl corpus caused a low frequency of some rolesets selected as gold alias, like fuse.01 ('melt into lump') for the VPC melt down." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-158", "text": "This problem could be solved by adding more parallel data from different domains." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-159", "text": "Another source of errors is the fact that our approach requires the output of an SRL system which, in turn, we want to improve." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-160", "text": "For 45 out of 160 CPs our system suggested the roleset as alias that was assigned to the verb by the SRL system, e.g. leave.02 for leave for. But the automatically attributed roleset is only correct in 21 cases, which means that we reproduced the errors of the SRL in 24 cases." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-161", "text": "Some LVCs keep their light verb structure in other languages, i.e. they receive multi-word translations." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-162", "text": "This diminishes the overlap of translations between the LVC and the PB roleset groups." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-163", "text": "PB rolesets are assigned to simplex verbs and therefore predominantly receive simplex translations." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-164", "text": "As more frequent rolesets have more diverse translations that contain more MWEs, these are promoted as aliases." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-165", "text": "Applying frequency weights to the roleset matrix could remedy this problem." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-166", "text": "Lastly, our system adheres to the most frequent sense baseline due to lack of word sense disambiguation of the CPs and assigns the alias that fits the most dominant sense of the CP in the corpus." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-167", "text": "----------------------------------" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-168", "text": "**CONCLUSIONS**" }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-169", "text": "We have presented an approach to handle CPs in SRL that extends on work from Bonial et al. (2014) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-170", "text": "We automatically link VPCs and LVCs to the PB roleset that best describes their meaning, by relying on word alignments in parallel corpora and distributional methods." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-171", "text": "We set up an annotation effort to gather a frequency-balanced, contextualized evaluation set that is more natural, varied and larger than the pilot annotations provided by Bonial et al. (2014) ." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-172", "text": "Our method can be used to alleviate the manual annotation effort by providing a correct alias in 44% of the cases (up to 72% for the more frequent test items when taking synonymous rolesets into account)." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-173", "text": "These results are not too far from the upper bounds we calculate from human annotations." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-174", "text": "In future work, we would like to improve our method by incorporating the methods discussed in the error analysis section." }, { "sent_id": "78afdf391c70d7992200b4071e4ac2-C001-175", "text": "Additionally, we plan to evaluate the impact of the new CP representation on downstream applications by retraining an SRL system on the new annotations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "78afdf391c70d7992200b4071e4ac2-C001-17" ], [ "78afdf391c70d7992200b4071e4ac2-C001-23" ], [ "78afdf391c70d7992200b4071e4ac2-C001-56" ], [ "78afdf391c70d7992200b4071e4ac2-C001-57" ] ], "cite_sentences": [ "78afdf391c70d7992200b4071e4ac2-C001-17", "78afdf391c70d7992200b4071e4ac2-C001-23", "78afdf391c70d7992200b4071e4ac2-C001-56", "78afdf391c70d7992200b4071e4ac2-C001-57" ] }, "@DIF@": { "gold_contexts": [ [ "78afdf391c70d7992200b4071e4ac2-C001-38" ], [ "78afdf391c70d7992200b4071e4ac2-C001-107" ] ], "cite_sentences": [ "78afdf391c70d7992200b4071e4ac2-C001-38", "78afdf391c70d7992200b4071e4ac2-C001-107" ] }, "@MOT@": { "gold_contexts": [ [ "78afdf391c70d7992200b4071e4ac2-C001-65" ] ], "cite_sentences": [ "78afdf391c70d7992200b4071e4ac2-C001-65" ] }, "@EXT@": { "gold_contexts": [ [ "78afdf391c70d7992200b4071e4ac2-C001-96" ], [ "78afdf391c70d7992200b4071e4ac2-C001-169" ], [ "78afdf391c70d7992200b4071e4ac2-C001-171" ] ], "cite_sentences": [ "78afdf391c70d7992200b4071e4ac2-C001-96", "78afdf391c70d7992200b4071e4ac2-C001-169", "78afdf391c70d7992200b4071e4ac2-C001-171" ] }, "@SIM@": { "gold_contexts": [ [ "78afdf391c70d7992200b4071e4ac2-C001-110" ] ], "cite_sentences": [ "78afdf391c70d7992200b4071e4ac2-C001-110" ] }, "@USE@": { "gold_contexts": [ [ "78afdf391c70d7992200b4071e4ac2-C001-116" ], [ "78afdf391c70d7992200b4071e4ac2-C001-133" ] ], "cite_sentences": [ "78afdf391c70d7992200b4071e4ac2-C001-116", "78afdf391c70d7992200b4071e4ac2-C001-133" ] } } }, "ABC_950263323d351bcb483be7cdf15a7e_8": { "x": [ { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-2", "text": "In addition to the expression of positive and negative sentiments in the reviews, customers often tend to express wishes and suggestions regarding improvements in a product/service, which could be worth extracting." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-3", "text": "Subjunctive mood is often present in sentences which speak about a possibility or action that has not yet occurred." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-4", "text": "While this phenomena poses challenges to the identification of positive and negative sentiments hidden in a text, it can be helpful to identify wishes and suggestions." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-5", "text": "In this paper, we extract features from a small dataset of subjunctive mood, and use those features to identify wishes and suggestions in opinionated text." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-6", "text": "Our study validates that subjunctive features can be good features for the detection of wishes." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-7", "text": "However, with the given dataset, such features did not perform well for suggestion detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-8", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-10", "text": "In the context of Sentiment Analysis, presence of a variety of linguistic phenomena poses challenges for the identification of underlying sentiment in an opinionated text." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-11", "text": "Subjunctive mood is one such phenomena (Liu et al. (2013) ; Bloom (2011) )." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-12", "text": "It is a commonly occurring language phenomenon in Indo-European languages, which is a verb mood typically used in subordinate clauses to express action that has not yet occurred, in the form of a wish, possibility, necessity etc." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-13", "text": "(Guan, 2012) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-14", "text": "Oxford dictionary defines it as, Relating to or denoting a mood of verbs expressing what is imagined or wished or possible." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-15", "text": "Sentiment terms present in such sentences may not necessarily contribute to the actual sentiment of the sentence, for example 'I wish it tasted as amazing as it looked' is not positive." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-16", "text": "While this is considered as a challenge for sentiment analysis, we adopt a different perspective, and discover benefits of the presence of subjunctive mood in opinionated text." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-17", "text": "Apart from the expression of criticism and satisfaction in customer reviews, reviews might include suggestions for improvements." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-18", "text": "Suggestions can either be expressed explicitly (Brun, 2013) , or by expressing wishes regarding new features and improvements (Ramanand et al., 2010) (Table 1) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-19", "text": "Extraction of suggestions goes beyond the scope of sentiment analysis, and also complements it by providing another valuable information that is worth analyzing." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-20", "text": "Table 1 presents some examples of occurrence of subjunctive mood collected from different forums on English grammar 1 . There seems to be a high probability of the occurrence of subjunctive mood in wish and suggestion expressing sentences." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-21", "text": "This observation can be exploited for the tasks of wish detection (Ramanand et al., 2010) , and suggestion extraction (Brun, 2013) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-22", "text": "To the best of our knowledge, subjunctive mood has never been analysed in the context of wish and suggestion detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-23", "text": "We collect a sample dataset comprising of example sentences of subjunctive mood, and identify features of subjunctive mood." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-24", "text": "We then employ a state of the art statistical classifier, and use subjunctive features in order to perform two kind of tasks on a given set of sentences: 1. Detect wish expressing sentences, and 2." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-25", "text": "Detect suggestion expressing sentences." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-26", "text": "Description Examples Suggestion bearing wishes in product reviews I wanted a dvd player that had basic features and would be able to play dvd or format discs that I had made myself." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-27", "text": "I wish canon would work out some way for that issue." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-28", "text": "Direct suggestions in product reviews They should improve their user interface." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-29", "text": "Wishes in political discussions I wish someone said that to teddy at the meeting yesterday. Perhaps I should have stopped at 8 or 9 years old." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-30", "text": "I would like to know if you re a purist or a hypocrite." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-31", "text": "Sentences containing subjunctive mood I wish it were summer." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-32", "text": "I suggest that Dawn drive the car. But if it weren't so big, it wouldn't be nearly so fun." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-33", "text": "Mood and Modality: Modality is a grammatical category that allows the expression of aspects related to the attitude of a speaker towards his statement, in terms of degree of certainty, reliability, subjectivity, sources of information, and perspective (Morante and Sporleder, 2012) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-34", "text": "Subjunctive mood originated from the typological studies of modality (Palmer, 1986; Dudman, 1988; Portner, 2009) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-35", "text": "Some works equate its presence with 'counterfactuality' (Palmer, 1986) , while some do not (Anderson, 1951) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-36", "text": "Other concepts like 'event modality', 'irrealis' (Palmer, 1986) , have definitions similar to that of subjunctive mood." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-37", "text": "Benamara et al. (2012) studied modality and negation for French language, with an objective to examine its effect on sentiment polarity." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-38", "text": "Narayanan et al. (2009) performed sentiment analysis on conditional sentences." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-39", "text": "Our objective however is inclined towards wish and suggestion detection, rather than sentiment analysis." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-40", "text": "Wish Detection: Goldberg et al. (2009) performed wish detection on datasets obtained from political discussion forums and product reviews." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-41", "text": "They automatically extracted sentence templates from a corpus of new year wishes, and used them as features with a statistical classifier." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-42", "text": "Suggestion Detection: Ramanand et al. (2010) pointed out that wish is a broader category, which might not bear suggestions every time." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-43", "text": "They performed suggestion detection, where they focussed only on suggestion bearing wishes, and used manually formulated syntactic patterns for their detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-44", "text": "Brun (2013) also extracted suggestions from product reviews and used syntactico-semantic patterns for suggestion detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-45", "text": "None of these works on suggestion detection used a statistical classifier." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-46", "text": "None of these works aligned the problem of wish and suggestion detection with subjunctive mood, or identified features related to it." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-47", "text": "Wish and suggestion detection remain young problems, and our work contributes towards the same." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-48", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-49", "text": "**DATASETS**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-50", "text": "Following are the datasets which we use for our experiments." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-51", "text": "\u2022 Wish Detection Oxford dictionary defines the noun wish as, A desire or hope for something to happen." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-52", "text": "Goldberg et al. (2009) follow this definition of wish and provide manually annotated datasets, where each sentence is labelled as wish or non-wish." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-53", "text": "Following two datasets are made available: a. Political Discussions: 6379 sentences, out of which 34% are annotated wishes." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-54", "text": "b. Product Reviews: 1235 sentences, out of which 12% are annotated as wishes." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-55", "text": "Table 1 presents some examples from these datasets." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-56", "text": "Ramanand et al. (2010) worked on product review dataset of the wish corpus, with an objective to extract suggestions for improvements." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-57", "text": "They considered suggestions as a subset of wishes, and thus retained the labels of only suggestion bearing wishes." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-58", "text": "They also annotated additional product reviews, but their data is not available for open research." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-59", "text": "\u2022 Suggestion Detection Product reviews (new): We re-annotated the product review dataset from Goldberg et al. (2009) , for suggestions." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-60", "text": "This also includes wishes for improvements and new features." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-61", "text": "Out of 1235 sentences, 6% are annotated as suggestions." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-62", "text": "Table 1 presents some examples from this dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-63", "text": "Annotation Details: We had 2 annotators annotate each sentence with a suggestion or non-suggestion tag." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-64", "text": "We support the observation of Ramanand et al. (2010) that wishes for improvements and new features are implicit expression of suggestions." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-65", "text": "Therefore, annotators were also asked to annotate suggestions which were expressed as wishes." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-66", "text": "For inter-annotator agreement, a kappa value of 0.874 was obtained." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-67", "text": "In the final dataset, we only retained the sentences where both the annotators agree." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-68", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-69", "text": "**SUBJUNCTIVE FEATURE EXTRACTION**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-70", "text": "Subjunctive Mood Dataset (new): Since we did not come across any corpus of subjunctive mood, we collected example sentences of subjunctive mood from various grammar websites and forums 2 , which resulted in a sample dataset of 229 sentences." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-71", "text": "Table 1 shows examples from this dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-72", "text": "We use this dataset for manual and automatic identification of features of subjunctive mood." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-73", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-74", "text": "**APPROACH**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-75", "text": "We use a statistical classifier to detect wishes and suggestions in corresponding datasets." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-76", "text": "We obtain the following set of features from the subjunctive mood dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-77", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-78", "text": "**LEXICAL FEATURES:**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-79", "text": "\u2022 Condition indicator 'if': This is a binary feature, whose value depends on the presence and absence of 'if' in a sentence." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-80", "text": "\u2022 Suggestion and Wish Verbs: We collect some suggestion and wish indicator verbs observed in the subjunctive mood dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-81", "text": "We then expand this set of verbs by using VerbNet 3.2 (Schuler, 2005) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-82", "text": "VerbNet is a wide coverage verb lexicon, which places verbs into classes whose members have common syntactic and semantic properties." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-83", "text": "We collect all members of the VerbNet verb classes advice, wish, want, urge, require; 28 different verbs were obtained." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-84", "text": "Ramanand et al. (2010) also used a similar but much smaller subset {love, like, prefer and suggest} in their rules." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-85", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-86", "text": "**SYNTACTIC FEATURES:**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-87", "text": "\u2022 Frequent POS sequences: This is a set of 3,4 length sequences of Part Of Speech (POS) tags, which are automatically extracted from the subjunctive mood dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-88", "text": "Words in the sentences are replaced by their corresponding POS tag, and top 200 sequences are extracted based on their weight." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-89", "text": "The weight of each sequence is a product of Term Frequency (TF) and Inverse Document Frequency (IDF)." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-90", "text": "In order to apply the concept of TF and IDF to POS tag sequences, every 3 and 4 length tag sequence occurring in the corpus is treated as a term." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-91", "text": "We separate tags within a sequence with an underscore." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-92", "text": "An example of a sequence of length 3 would be PRP VB PRP ie." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-93", "text": "Personal Pronoun Base form of Verb Personal pronoun." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-94", "text": "\u2022 Frequent Dependency Relations: These are a set of dependency relations (Marneffe and Manning, 2008) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-95", "text": "Using the same method as the part of speech tags, we identify 5 most frequent dependency relations which occur in the subjunctive mood dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-96", "text": "In order to apply the concept of TF/IDF, each dependency relation occurring in the corpus is treated as a term." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-97", "text": "The top 5 relations were: advmod, aux, ccomp, mark and nsubj." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-98", "text": "Goldberg et.al (2009) templates n/a n/a 0.47 unigrams,templates n/a n/a 0.56 We also obtain classification results of the combination of these features with the standard unigram features (Table 2, 3) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-99", "text": "To obtain the part of speech and dependency information, we use Stanford Parser 3.3.1 (Klein and Manning, 2003) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-100", "text": "Word stemming is not performed." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-101", "text": "We use the LibSVM implementation of SVM classifier (EL-Manzalawy and Honavar, 2005) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-102", "text": "The parameter values of SVM classifiers are: SVM type = C-SVC, Kernel Function = Radial Basis Function." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-103", "text": "Features are ranked using the Info-Gain feature selection algorithm (Mitchell, 1997) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-104", "text": "Top 1000 features are used in all the experiments ie." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-105", "text": "the size of feature vector is not more than 1000." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-106", "text": "5 Subjunctive Feature Evaluation Goldberg et al. (2009) evaluated their approach using a 10 fold cross validation on their datasets." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-107", "text": "In order to compare subjunctive features against their wish template features, we also perform 10 fold cross validation on their wish datasets (politics and products)." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-108", "text": "The evaluation metrics include Precision, Recall, and Area Under Curve (AUC) for the positive class." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-109", "text": "AUC was also used by Goldberg et al. (2009) ." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-110", "text": "To the best of our knowledge, statistical classification based approach have not yet been employed to detect suggestions in reviews." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-111", "text": "Our experiment which uses subjunctive features for suggestion detection, is the first in this regard." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-112", "text": "Table 2 compares the AUC values obtained with unigrams, subjunctive features, a combination of both, and the results from Goldberg et al. (2009) for wish detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-113", "text": "Table 3 compares the AUC values obtained with unigrams, subjunctive features, and a combination of both for suggestion detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-114", "text": "Table 4 presents some of the top features used by the classifier." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-115", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-116", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-117", "text": "Wish Detection: Unigrams vs Subjunctive: One probable reason for the better performance of subjunctive features over unigrams in the case of product dataset, could be the small size of the dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-118", "text": "In the case of politics dataset, similar reason (big dataset) can be attributed for the better performance of unigrams over subjunctive features." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-119", "text": "Goldberg et al. (2009) perform better than our subjunctive features for the politics data." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-120", "text": "However, subjunctive features perform much better with product data as compared to the wish templates (Table 3 )." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-121", "text": "This may lead to the conclusion that wish templates need larger training corpus, since they failed for the smaller dataset of product reviews (AUC less than 0.5)." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-122", "text": "One additional benefit of subjunctive features could be that subjunctive mood appears in many languages, and thus such features can be easily extended to multi-lingual wish detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-123", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-124", "text": "**SUGGESTION DETECTION:**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-125", "text": "Subjunctive features perform better than unigrams in this case too." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-126", "text": "An overall decrease in classifier performance for the task of suggestion detection can be attributed to the fact that not all wishes are suggestions, and therefore are not tagged in this dataset." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-127", "text": "Some of these untagged wishes would contain subjunctive mood, which reduced the performance of subjunctive features, as compared to the task of wish detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-128", "text": "----------------------------------" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-130", "text": "From the results of feature evaluation, we conclude that subjunctive features are not effective for suggestion detection, but are considerably effective for the task of wish detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-131", "text": "This work contributes towards both, analysis and methodology for wish detection." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-132", "text": "On the analysis part, we validate that a considerable amount of wishes in opinionated text contain subjunctive mood." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-133", "text": "On the methodology part, we use subjunctive mood features as effective features for the detection of wishes." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-134", "text": "We also provide datasets for this kind of study." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-135", "text": "Since we only deal with 2 domains here, further experiments can be performed over data from different domains." }, { "sent_id": "950263323d351bcb483be7cdf15a7e-C001-136", "text": "In the continuation of this work, we intend to extend the datasets and explore more syntactic and semantic features for wish and suggestion detection." } ], "y": { "@BACK@": { "gold_contexts": [ [ "950263323d351bcb483be7cdf15a7e-C001-40" ], [ "950263323d351bcb483be7cdf15a7e-C001-51", "950263323d351bcb483be7cdf15a7e-C001-52", "950263323d351bcb483be7cdf15a7e-C001-53", "950263323d351bcb483be7cdf15a7e-C001-54", "950263323d351bcb483be7cdf15a7e-C001-55" ], [ "950263323d351bcb483be7cdf15a7e-C001-56" ], [ "950263323d351bcb483be7cdf15a7e-C001-106" ], [ "950263323d351bcb483be7cdf15a7e-C001-109" ] ], "cite_sentences": [ "950263323d351bcb483be7cdf15a7e-C001-40", "950263323d351bcb483be7cdf15a7e-C001-52", "950263323d351bcb483be7cdf15a7e-C001-53", "950263323d351bcb483be7cdf15a7e-C001-54", "950263323d351bcb483be7cdf15a7e-C001-55", "950263323d351bcb483be7cdf15a7e-C001-56", "950263323d351bcb483be7cdf15a7e-C001-106", "950263323d351bcb483be7cdf15a7e-C001-109" ] }, "@MOT@": { "gold_contexts": [ [ "950263323d351bcb483be7cdf15a7e-C001-40", "950263323d351bcb483be7cdf15a7e-C001-41", "950263323d351bcb483be7cdf15a7e-C001-46", "950263323d351bcb483be7cdf15a7e-C001-47" ] ], "cite_sentences": [ "950263323d351bcb483be7cdf15a7e-C001-40", "950263323d351bcb483be7cdf15a7e-C001-41" ] }, "@EXT@": { "gold_contexts": [ [ "950263323d351bcb483be7cdf15a7e-C001-59" ] ], "cite_sentences": [ "950263323d351bcb483be7cdf15a7e-C001-59" ] }, "@USE@": { "gold_contexts": [ [ "950263323d351bcb483be7cdf15a7e-C001-106", "950263323d351bcb483be7cdf15a7e-C001-107" ] ], "cite_sentences": [ "950263323d351bcb483be7cdf15a7e-C001-106", "950263323d351bcb483be7cdf15a7e-C001-107" ] }, "@SIM@": { "gold_contexts": [ [ "950263323d351bcb483be7cdf15a7e-C001-112" ] ], "cite_sentences": [ "950263323d351bcb483be7cdf15a7e-C001-112" ] }, "@DIF@": { "gold_contexts": [ [ "950263323d351bcb483be7cdf15a7e-C001-112" ], [ "950263323d351bcb483be7cdf15a7e-C001-117" ], [ "950263323d351bcb483be7cdf15a7e-C001-118" ], [ "950263323d351bcb483be7cdf15a7e-C001-119" ] ], "cite_sentences": [ "950263323d351bcb483be7cdf15a7e-C001-112", "950263323d351bcb483be7cdf15a7e-C001-117", "950263323d351bcb483be7cdf15a7e-C001-118", "950263323d351bcb483be7cdf15a7e-C001-119" ] } } }, "ABC_211b889125682f2596f708be1e83b9_8": { "x": [ { "sent_id": "211b889125682f2596f708be1e83b9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-2", "text": "Question-answer driven Semantic Role Labeling (QA-SRL) has been proposed as an attractive open and natural form of SRL, easily crowdsourceable for new corpora." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-3", "text": "Recently, a large-scale QA-SRL corpus and a trained parser were released, accompanied by a densely annotated dataset for evaluation." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-4", "text": "Trying to replicate the QA-SRL annotation and evaluation scheme for new texts, we observed that the resulting annotations were lacking in quality and coverage, particularly insufficient for creating gold standards for evaluation." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-5", "text": "In this paper, we present an improved QA-SRL annotation protocol, involving crowd-worker selection and training, followed by data consolidation." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-6", "text": "Applying this process, we release a new gold evaluation dataset for QA-SRL, yielding more consistent annotations and greater coverage." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-7", "text": "We believe that our new annotation protocol and gold standard will facilitate future replicable research of natural semantic annotations." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-8", "text": "Around 47 people could be arrested, including the councillor." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-9", "text": "(1) Who might be arrested? 47 people | the councillor Perry called for the DAs resignation, and when she did not resign, cut funding to a program she ran." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-10", "text": "(2) Why was something cut by someone? she did not resign (3) Who cut something?" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-11", "text": "Perry" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-12", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-14", "text": "Semantic Role Labeling (SRL) provides explicit annotation of predicate-argument relations, which have been found useful in various downstream tasks (Shen and Lapata, 2007; Chen et al., 2013; Wang et al., 2015; Marcheggiani et al., 2018) ." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-15", "text": "Question-Answer driven Semantic Role Labeling (QA-SRL) (He et al., 2015) is an SRL scheme in which roles are captured by natural language questions, while arguments represent their answers, making the annotations intuitive, semantically rich, and easily attainable by laymen." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-16", "text": "For example, in Table 1 , the question Who cut something captures the traditional \"agent\" role." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-17", "text": "Previous attempts to annotate QA-SRL initially involved trained annotators (He et al., 2015) but later resorted to crowdsourcing (Fitzgerald et al., 2018) to achieve scalability." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-18", "text": "Naturally, employing crowd workers raises challenges when annotating semantic structures like SRL." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-19", "text": "As Fitzgerald et al. (2018) acknowledged, the main shortage of the large-scale 2018 dataset is the lack of recall, estimated by experts to be in the lower 70s." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-20", "text": "In light of this and other annotation inconsistencies, we propose an improved QA-SRL crowdsourcing protocol for high-quality annotation, allowing for substantially more reliable performance evaluation of QA-SRL parsers." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-21", "text": "To address worker quality, we systematically screen workers, provide concise yet effective guidelines, and perform a short training procedure, all within a crowd-sourcing platform." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-22", "text": "To address coverage, we employ two independent workers plus an additional one for consolidation -similar to conventional expert-annotation practices." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-23", "text": "In addition to yielding 25% more roles, our coverage gain is demonstrated by evaluating against expertly annotated data and comparison with PropBank (Section 4)." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-24", "text": "To foster future research, we release an assessed high-quality gold dataset along with our reproducible protocol and evaluation scheme, and report the performance of the existing parser (Fitzgerald et al., 2018) as a baseline." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-25", "text": "2 Background -QA-SRL Specifications In QA-SRL, a role question adheres to a 7-slot template, with slots corresponding to a WH-word, the verb, auxiliaries, argument placeholders (SUBJ, OBJ), and prepositions, where some slots are optional (He et al., 2015) (see appendix for examples)." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-26", "text": "Such question captures the corresponding semantic role with a natural easily understood expression." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-27", "text": "The set of all non-overlapping answers for the question is then considered as the set of arguments associated with that role." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-28", "text": "This broad question-based definition of roles captures traditional cases of syntacticallylinked arguments, but also additional semantic arguments clearly implied by the sentence meaning (see example (2) in Table 1 )." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-29", "text": "Corpora The original 2015 QA-SRL dataset (He et al., 2015) was annotated by non-expert workers after completing a brief training procedure." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-30", "text": "They annotated 7.8K verbs, reporting an average of 2.4 QA pairs per predicate." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-31", "text": "Even though multiple annotators were shown to produce greater coverage, their released dataset was produced using only a single annotator per verb." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-32", "text": "In subsequent work, Fitzgerald et al. (2018) constructed a large-scale corpus and used it to train a parser." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-33", "text": "1 They crowdsourced 133K verbs with 2.0 QA pairs per verb on average." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-34", "text": "Since crowd-workers had no prior training, quality was established using an additional validation step, where workers had to ascertain the validity of the question, but not of its answers." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-35", "text": "Instead, the validator provided additional answers, independent of the other annotators." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-36", "text": "Each verb in the corpus was annotated by a single QA-generating worker and validated by two others." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-37", "text": "In a reserved part of the corpus (Dense), targeted for parser evaluation, verbs were densely validated with 5 workers, approving questions judged as valid by at least 4/5 validators." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-38", "text": "Notably, adding validators to the Dense annotation pipeline accounts mostly for precision errors, while role coverage solely relies upon the single generator's set of questions." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-39", "text": "As both 2015 and 2018 datasets use a single question generator, both struggle with maintaining coverage." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-40", "text": "Also noteworthy, is that while traditional SRL annotations contain a single authoritative and nonredundant annotation, the 2018 dataset provides the raw annotations of all annotators." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-41", "text": "These include many overlapping or noisy answers, without settling on consolidation procedures to provide a single gold reference." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-42", "text": "We found that these characteristics of the dataset impede its utility for future development of parsers." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-43", "text": "1 http://qasrl.org/ 3 Annotation and Evaluation Methods" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-44", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-45", "text": "**CROWDSOURCING METHODOLOGY**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-46", "text": "Screening and Training Our pool of annotators is selected after several short training rounds, with up to 15 predicates per round, in which they received extensive personal feedback." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-47", "text": "1 out of 3 participants were selected after exhibiting good performance, tested against expert annotations." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-48", "text": "Annotation We adopt the annotation machinery of (Fitzgerald et al., 2018) implemented using Amazon's Mechanical Turk, 2 and annotate each predicate by 2 trained workers independently, while a third consolidates their annotations into a final set of roles and arguments." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-49", "text": "In this consolidation task, the worker validates questions, merges, splits or modifies answers for the same role according to guidelines, and removes redundant roles by picking the more naturally phrased questions." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-50", "text": "For example, in Table 1 ex." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-51", "text": "1, one worker could have chosen \"47 people\", while another chose \"the councillor\"; in this case the consolidator would include both of those answers." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-52", "text": "In Section 4, we show that this process yields better coverage." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-53", "text": "3 For example annotations, please refer to the appendix." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-54", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-55", "text": "**GUIDELINES REFINEMENTS**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-56", "text": "We refine the previous guidelines by emphasizing several semantic features: correctly using modal verbs and negations in the question, and choosing answers that coincide with a single entity (example 1 in Table 1 )." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-57", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-58", "text": "**DATA & COST**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-59", "text": "We annotated a sample taken from the Dense set on Wikinews and Wikipedia domains, each with 1000 sentences, equally divided between development and test." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-60", "text": "QA generating annotators are paid the same as in Fitzgerald et al. (2018) , while the consolidator is rewarded 5\u00a2 per verb and 3\u00a2 per question." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-61", "text": "Per predicate, on average, our cost is 54.2\u00a2, yielding 2.9 roles, compared to reported 2.3 valid roles with an approximated cost of 51\u00a2 per predicate for Dense." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-62", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-63", "text": "**EVALUATION METRICS**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-64", "text": "Evaluation in QA-SRL involves aligning predicted and ground truth argument spans and evaluating role label equivalence." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-65", "text": "Since detecting question paraphrases is still an open challenge, we propose both unlabeled and labeled evaluation metrics." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-66", "text": "Unlabeled Argument Detection (UA) Inspired by the method presented in (Fitzgerald et al., 2018) , arguments are matched using a span matching criterion of intersection over union \u2265 0.5 ." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-67", "text": "To credit each argument only once, we employ maximal bipartite matching 4 between the two sets of arguments, drawing an edge for each pair that passes the above mentioned criterion." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-68", "text": "The resulting maximal matching determines the true-positive set, while remaining non-aligned arguments become false-positives or false-negatives." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-69", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-70", "text": "**LABELED ARGUMENT DETECTION (LA)**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-71", "text": "All aligned arguments from the previous step are inspected for label equivalence, similar to the joint evaluation reported in (Fitzgerald et al., 2018) ." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-72", "text": "There may be many correct questions for a role." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-73", "text": "For example, What was given to someone? and What has been given by someone?" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-74", "text": "both refer to the same semantic role but diverge in grammatical tense, voice, and presence of a syntactical object or subject." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-75", "text": "Aiming to avoid judging non-equivalent roles as equivalent, we propose STRICT-MATCH to be an equivalence on the following template slots: WH, SUBJ, OBJ, as well as on negation, voice, and modality 5 extracted from the question." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-76", "text": "Final reported numbers on labelled argument detection rates are based on bipartite aligned arguments passing STRICT-MATCH." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-77", "text": "We later manually estimate the rate of correct equivalences missed by this conservative method." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-78", "text": "As we will see, our evaluation heuristics, adapted from those in Fitzgerald et al. (2018) , significantly underestimate agreement between annotations, hence reflecting performance lower bounds." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-79", "text": "Devising more tight evaluation measures remains a challenge for future research." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-80", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-81", "text": "**EVALUATING REDUNDANT ANNOTATIONS**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-82", "text": "We extend our metric for evaluating manual or automatic redundant annotations, like the Dense dataset or the parser in (Fitzgerald et al., 2018) , which predicts argument spans independently of each other." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-83", "text": "To that end, we ignore predicted arguments that match ground-truth but are not selected by the bipartite matching due to redundancy." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-84", "text": "After con-necting unmatched predicted arguments that overlap, we count one false positive for every connected component to avoid penalizing precision too harshly when predictions are redundant." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-85", "text": "6" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-86", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-87", "text": "**DATASET QUALITY ANALYSIS**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-88", "text": "Inter-Annotator Agreement (IAA) To estimate dataset consistency across different annotations, we measure F1 using our UA metric with 5 generators per predicate." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-89", "text": "Individual worker-vsworker agreement yields 79.8 F1 over 10 experiments with 150 predicates, indicating high consistency across our annotators, inline with results by other structured semantic annotations (e.g. (Abend and Rappoport, 2013) )." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-90", "text": "Overall consistency of the dataset is assessed by measuring agreement between different consolidated annotations, obtained by disjoint triplets of workers, which achieves F1 of 84.1 over 4 experiments, each with 35 distinct predicates." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-91", "text": "Notably, consolidation boosts agreement, suggesting it is a necessity for semantic annotation consistency." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-92", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-93", "text": "**DATASET ASSESSMENT AND COMPARISON**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-94", "text": "We assess both our gold standard set and the recent Dense set against an integrated expert annotated sample of 100 predicates." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-95", "text": "To construct the expert set, we blindly merged the Dense set with our worker annotations and manually corrected them." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-96", "text": "We further corrected the evaluation decisions, accounting for some automatic evaluation mistakes introduced by the span-matching and question paraphrasing criteria." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-97", "text": "As seen in Table 2 , our gold set yields comparable precision with significantly higher recall, which is in line with our 25% higher yield." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-98", "text": "Examining disagreements between our gold and Dense, we observe that our workers successfully produced more roles, both implied and explicit." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-99", "text": "To a lesser extent, they split more arguments into independent answers, as emphasized by our guidelines, an issue which was left under-specified in the previous annotation guidelines." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-100", "text": "Agreement with PropBank Data It is illuminating to observe the agreement between QA-SRL and PropBank (CoNLL-2009) annotations (Haji\u010d et al., 2009 )." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-101", "text": "In Table 3 , we replicate the experiments in (He et al., 2015, Section 3.4) for both our gold set and theirs, over a sample of 200 sentences from Wall Street Journal (agreement evaluation is automatic and the metric is somewhat similar to our UA)." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-102", "text": "We report macroaveraged (over predicates) precision and recall for all roles, including core and adjuncts, 7 while considering the PropBank data as the reference set." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-103", "text": "Our recall of the PropBank roles is notably high, reconfirming the coverage obtained by our annotation protocol." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-104", "text": "The measured precision with respect to Prop-Bank is low for adjuncts due to the fact that our annotators were capturing many correct arguments not covered in PropBank." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-105", "text": "To examine this, we analyzed 100 false positive arguments." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-106", "text": "Only 32 of those were due to wrong or incomplete QA annotations in our gold, while most others were outside of PropBank's scope, capturing either implied arguments or roles not covered in PropBank." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-107", "text": "Extrapolating from this manual analysis estimates our true precision (on all roles) to be about 91%, which is consistent with the 88% precision figure in Table 2 ." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-108", "text": "Compared with 2015, our QA-SRL gold yielded 1593 annotations, with 989 core and 604 adjuncts, while theirs yielded 1315 annotations, 979 core and 336 adjuncts." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-109", "text": "Overall, the comparison to PropBank reinforces the quality of our gold dataset and shows its better coverage relative to the 2015 dataset." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-110", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-111", "text": "**BASELINE PARSER EVALUATION**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-112", "text": "To illustrate the effectiveness of our new goldstandard, we use its Wikinews development set to evaluate the currently available parser from (Fitzgerald et al., 2018) ." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-113", "text": "For each predicate, the parser classifies every span for being an argument, independently of the other spans." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-114", "text": "Unlike many other SRL systems, this policy often produces outputs with redundant arguments (see appendix for examples)." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-115", "text": "Results for~1200 predicates are reported in Table 4 , demonstrating reasonable performance along with substantial room for improvement, especially with respect to coverage." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-116", "text": "As expected, the parser's recall against our gold is substantially lower than the 84.2 recall reported in (Fitzgerald et al., 2018) against Dense, due to the limited recall of Dense relative to our gold set." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-117", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-118", "text": "**ERROR ANALYSIS**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-119", "text": "We sample and evaluate 50 predicates to detect correct argument and paraphrase pairs that are skipped by the IOU and STRICT-MATCH criteria." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-120", "text": "Based on this inspection, the parser completely misses 23% of the 154 roles present in the gold-data, out of which, 17% are implied." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-121", "text": "While the parser correctly predicts 82% of non-implied roles, it skips half of the implied ones." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-122", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-123", "text": "**CONCLUSION**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-124", "text": "We introduced a refined crowdsourcing pipeline and a corresponding evaluation methodology for QA-SRL." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-125", "text": "It enabled us to release a new gold standard for evaluations, notably of much higher coverage of core and implied roles than the previous Dense evaluation dataset." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-126", "text": "We believe that our annotation methodology and dataset would facilitate future research on natural semantic annotations and QA-SRL parsing." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-127", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-128", "text": "**ANNOTATION PIPELINE**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-129", "text": "As described in section 3 The consolidator receives two sets of QA annotations and merges them according to the guidelines to produce an exhaustive and consistent QA set." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-130", "text": "See Table 6 for examples." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-131", "text": "A1: Who identified something?" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-132", "text": "The U.S. Geological Survey (USGS) A2: Who identified something?" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-133", "text": "The U.S. Geological Survey C:" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-134", "text": "Who identified something The U.S. Geological Survey | USGS A1: What might contain something? that basin A2: What contains something? that basin C:" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-135", "text": "What might contain something? that basin Table 6 : The consolidation task -A1, A2 refer to the original annotator QAs, C refers to the consolidator selected question and corrected answers." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-136", "text": "----------------------------------" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-137", "text": "**REDUNDANT PARSER OUTPUT**" }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-138", "text": "As mentioned in the paper body, the Fitzgerald et al. parser generates redundant role questions and answers." }, { "sent_id": "211b889125682f2596f708be1e83b9-C001-139", "text": "The first two rows in Table 7" } ], "y": { "@BACK@": { "gold_contexts": [ [ "211b889125682f2596f708be1e83b9-C001-17" ], [ "211b889125682f2596f708be1e83b9-C001-32" ], [ "211b889125682f2596f708be1e83b9-C001-66" ], [ "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-113", "211b889125682f2596f708be1e83b9-C001-116" ], [ "211b889125682f2596f708be1e83b9-C001-138" ] ], "cite_sentences": [ "211b889125682f2596f708be1e83b9-C001-17", "211b889125682f2596f708be1e83b9-C001-32", "211b889125682f2596f708be1e83b9-C001-66", "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-113", "211b889125682f2596f708be1e83b9-C001-116", "211b889125682f2596f708be1e83b9-C001-138" ] }, "@MOT@": { "gold_contexts": [ [ "211b889125682f2596f708be1e83b9-C001-19", "211b889125682f2596f708be1e83b9-C001-20" ], [ "211b889125682f2596f708be1e83b9-C001-32", "211b889125682f2596f708be1e83b9-C001-40" ], [ "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-120", "211b889125682f2596f708be1e83b9-C001-121" ], [ "211b889125682f2596f708be1e83b9-C001-138" ] ], "cite_sentences": [ "211b889125682f2596f708be1e83b9-C001-19", "211b889125682f2596f708be1e83b9-C001-32", "211b889125682f2596f708be1e83b9-C001-40", "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-120", "211b889125682f2596f708be1e83b9-C001-121", "211b889125682f2596f708be1e83b9-C001-138" ] }, "@USE@": { "gold_contexts": [ [ "211b889125682f2596f708be1e83b9-C001-24" ], [ "211b889125682f2596f708be1e83b9-C001-48" ], [ "211b889125682f2596f708be1e83b9-C001-60" ], [ "211b889125682f2596f708be1e83b9-C001-71" ], [ "211b889125682f2596f708be1e83b9-C001-78" ], [ "211b889125682f2596f708be1e83b9-C001-82" ], [ "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-113", "211b889125682f2596f708be1e83b9-C001-116" ] ], "cite_sentences": [ "211b889125682f2596f708be1e83b9-C001-24", "211b889125682f2596f708be1e83b9-C001-48", "211b889125682f2596f708be1e83b9-C001-60", "211b889125682f2596f708be1e83b9-C001-71", "211b889125682f2596f708be1e83b9-C001-78", "211b889125682f2596f708be1e83b9-C001-82", "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-113", "211b889125682f2596f708be1e83b9-C001-116" ] }, "@EXT@": { "gold_contexts": [ [ "211b889125682f2596f708be1e83b9-C001-66" ] ], "cite_sentences": [ "211b889125682f2596f708be1e83b9-C001-66" ] }, "@SIM@": { "gold_contexts": [ [ "211b889125682f2596f708be1e83b9-C001-82" ] ], "cite_sentences": [ "211b889125682f2596f708be1e83b9-C001-82" ] }, "@DIF@": { "gold_contexts": [ [ "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-116" ] ], "cite_sentences": [ "211b889125682f2596f708be1e83b9-C001-112", "211b889125682f2596f708be1e83b9-C001-116" ] } } }, "ABC_d72f0608fddd1bf1cdef7ca6a20bdf_8": { "x": [ { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-2", "text": "This paper explores the use of multi-view features and their discriminative transforms in a convolutional deep neural network (CNN) architecture for a continuous large vocabulary speech recognition task." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-3", "text": "Mel-filterbank energies and perceptually motivated forced damped oscillator coefficient (DOC) features are used after feature-space maximum-likelihood linear regression (fMLLR) transforms, which are combined and fed as a multi-view feature to a single CNN acoustic model." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-4", "text": "Use of multi-view feature representation demonstrated significant reduction in word error rates (WERs) compared to the use of individual features by themselves." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-5", "text": "In addition, when articulatory information was used as an additional input to a fused deep neural network (DNN) and CNN acoustic model, it was found to demonstrate further reduction in WER for the Switchboard subset and the CallHome subset (containing partly non-native accented speech) of the NIST 2000 conversational telephone speech test set, reducing the error rate by 12% relative to the baseline in both cases." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-6", "text": "This work shows that multi-view features in association with articulatory information can improve speech recognition robustness to spontaneous and non-native speech." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-7", "text": "----------------------------------" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-9", "text": "Spontaneous speech typically contains a significant amount of variation, which makes it difficult to model in automatic speech recognition (ASR) systems." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-10", "text": "Such variability stems from varying speakers, pronunciation variations, speaker stylistic differences, varying recording conditions and many other factors." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-11", "text": "Recognizing words from conversational telephone speech (CTS) can be quite difficult due to the spontaneous nature of speech, its informality, speaker variations, hesitations, disfluencies etc." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-12", "text": "The Switchboard and Fisher [1] data collections are large collection of CTS datasets that have been used extensively by researchers working on conversational speech recognition [2, 3, 4, 5, 6] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-13", "text": "Recent trends in speech recognition [7, 8, 9] have demonstrated impressive performance on Switchboard and Fisher data." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-14", "text": "Deep neural network (DNN) based acoustic modeling has become the state-of-the-art in automatic speech recognition (ASR) systems [10, 11] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-15", "text": "It has demonstrated impressive performance gains for almost all tried languages and ___________________________________________________________ *The author performed this work while at SRI International and is currently working at Apple Inc. acoustic conditions." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-16", "text": "Advanced variants of DNNs, such as convolutional neural nets (CNNs) [12] , recurrent neural nets (RNNs) [13] , long short-term memory nets (LSTMs) [14] , time-delay neural nets (TDNNs) [15, 29] , VGG-nets [8] , have significantly improved recognition performance, bringing them closer to human performance [9] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-17", "text": "Both abundance of data and sophistication of deep learning algorithms have symbiotically contributed to the advancement of speech recognition performance." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-18", "text": "The role of acoustic features has not been explored in comparable detail, and their potential contribution to performance gains is unknown." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-19", "text": "This paper focuses on acoustic features and investigates how their selection improves recognition performance using benchmark training datasets: Switchboard and Fisher, when evaluated on the NIST 2000 CTS test set [2] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-20", "text": "We investigated a traditional CNN model and explored the following: (1) Use of multiple features both in isolation and in combination." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-21", "text": "Our experiments demonstrated that the use of feature combinations helped to improve performance over individual features in isolation and over traditionally used mel-filterbank (MFB) features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-22", "text": "Articulatory features were found to be useful for improving recognition performance on both Switchboard and CallHome subsets of the NIST 2000 CTS test set." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-23", "text": "These findings indicate that the use of better acoustic features can help improve speech recognition performance when using standard acoustic modeling techniques, and can demonstrate performance as good as those obtained from more sophisticated acoustic models that exploit temporal memory." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-24", "text": "For the sake of simplicity, we used a CNN acoustic model in our experiment, where the baseline system's performance is directly comparable to the state-of-the-art CNN performance reported in [8] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-25", "text": "We expect our results using the CNN to carry over into other neural network architectures as well." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-26", "text": "The outline of the paper is as follows." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-27", "text": "In Section 2 we present the dataset and the recognition task." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-28", "text": "In Section 3 we describe the acoustic features and the articulatory features that were used in our experiments." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-29", "text": "Section 4 presents the acoustic and language models used in our experiments, followed by experimental results in Section 5 and conclusion and future directions in Section 6." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-30", "text": "----------------------------------" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-31", "text": "**DATA AND TASK**" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-32", "text": "The acoustic models in our experiments were trained using the CTS Switchboard (SWB) [16] and Fisher (FSH) corpora." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-33", "text": "We first investigated contributions of the features on models trained only with the SWB dataset, where the training data consisted of ~360 hours of speech data." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-34", "text": "We then evaluated the contributions of the features using acoustic models trained with a combination of both SWB and FSH (~2000 hours)." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-35", "text": "The models were evaluated using the NIST 2000 CTS test set, which consists of 2.1 hours (21.4K words, 40 speakers) of SWB audio and 1.6 hours (21.6K words, 40 speakers) of the CallHome (CH) audio." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-36", "text": "The language model training data included 3M words from Switchboard, CallHome, and Switchboard Cellular transcripts, 20M words from Fisher transcripts, 150M words from Hub4 broadcast news transcripts and language model training data, and 191M words of \"conversational\" text retrieved from the Web by searching for conversational n-grams extracted from the CTS transcripts [25] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-37", "text": "A 4-gram language model (LM) was generated based on word probability estimates from a SuperARV language model, which is a class-based language model with classes derived from Constraint Dependency Grammar parses [26] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-38", "text": "For first pass decoding the 4-gram LM was pruned to improve efficiency, and the full 4-gram LM was used to rescore lattices generated from the first pass." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-39", "text": "----------------------------------" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-40", "text": "**FEATURES**" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-41", "text": "We used mel-filterbank energies (MFBs) as the baseline feature, where the features were generated using the implementation distributed with the Kaldi toolkit [17] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-42", "text": "The second acoustic feature was Damped Oscillator Coefficients (DOCs) [18] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-43", "text": "The DOC features model the auditory hair cells using a bank of forced damped oscillators, where gammatone filtered band-limited subband speech signals are used as the forcing function." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-44", "text": "The oscillation energy from the damped oscillators was used as the DOC features after power-law compression." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-45", "text": "We performed the fMLLR transform on the acoustic features, where we trained Gaussian Mixture Models (GMMs) to generate alignments on the training dataset to learn the fMLLR transform for the feature sets." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-46", "text": "We investigated two approaches: (1) we directly learned the fMLLR transforms on the 40-dimensional filterbank features, and (2) we investigated learning the fMLLR transform using the cepstral version of the features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-47", "text": "The cepstral version of the features helps decorrelate the features, which in turn adheres to the diagonal covariance assumption of the GMMs." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-48", "text": "In (2) the fMLLR transform was learned using 40 dimensional cepstral features (using all the cepstral dimensions extracted from 40 dimensional filterbanks)." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-49", "text": "After the fMLLR transform was performed, an IDCT of the features was obtained to generate the fMLLR version of filterbank features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-50", "text": "The articulatory features were estimated using the CNN system described in [19, 20] , where the CNN performs speech-to-articulatory inversion or simply speechinversion." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-51", "text": "During speech-inversion, the acoustic features extracted from the speech signal, in this case modulation features [19] , are used to predict the articulatory trajectories." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-52", "text": "The articulatory features contain time domain articulatory trajectories, with eight dimensions reflecting: glottal aperture, velic opening, lip aperture, lip protrusion, tongue tip location and degree, tongue body location and degree." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-53", "text": "More details regarding the articulatory features and their extraction are provided in [19] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-54", "text": "----------------------------------" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-55", "text": "**RECOGNITION SYSTEM**" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-56", "text": "We trained CNN acoustic models for the speech recognition tasks." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-57", "text": "To generate the alignments necessary for training the CNN system, a Gaussian Mixture Model -Hidden Markov Model (GMM-HMM) based acoustic model was first trained with flat-start, which was used to produce the senone labels." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-58", "text": "Altogether, the GMM-HMM system produced 5.6K context-dependent (CD) states for the SWB training set." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-59", "text": "A fully connected DNN model was then trained using MFB features, which in turn was used to generate the senone alignments to train the baseline and other acoustic models presented in this work." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-60", "text": "The input features to the acoustic models were formed using a context window of 15 frames (7 frames on either side of the current frame)." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-61", "text": "The acoustic models were trained by using cross-entropy (CE) followed by sequence training using maximum mutual information (MMI) criterion [17, 21] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-62", "text": "For the CNN model, 200 convolutional filters of size 8 were used in the convolutional layer, and the pooling size was set to 3 without overlap." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-63", "text": "The subsequent, fully connected network had five hidden layers, with 2048 nodes per hidden layer, and the output layer included as many nodes as the number of CD states for the given dataset." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-64", "text": "The networks were trained using an initial four iterations with a constant learning rate of 0.008, followed by learning-rate halving based on the cross-validation error decrease." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-65", "text": "Training stopped when no further significant reduction in crossvalidation error was noted or when cross-validation error started to increase." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-66", "text": "Backpropagation was performed using stochastic gradient descent with a mini-batch of 256 training examples." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-67", "text": "In this work, we investigated a modified deep neural network architecture to jointly model the acoustic and the articulatory spaces, as shown in Figure 1 ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-68", "text": "In this modified architecture, two parallel input layers are used to accept acoustic features and articulatory features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-69", "text": "The input layer tied to the acoustic feature consists of a convolutional layer, with 200 filters and the input layer tied to the articulatory features is a feed-forward layer with 100 neurons." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-70", "text": "The feature maps from the convolutional layer and the outputs from the feed-forward layer are fed to a fully connected DNN consisting of 5 hidden layers and 2048 neurons in each layer, as shown in figure 1." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-71", "text": "----------------------------------" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-72", "text": "**RESULTS**" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-73", "text": "We initially validated the performance of the features (MFB, DOC and TVs) using the 360 hours SWB training dataset." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-74", "text": "The baseline DNN and CNN models had six and five hidden layers respectively, with 2048 neurons in each layer, and were trained with MFB features and its fMLLR transformed version (MFB+fMLLR)." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-75", "text": "The NIST RT-04 dev04 dataset (3 hour test set from Fisher, containing 36 conversations) [2] was used as the cross-validation set during the acoustic model training step." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-76", "text": "Table 1 presents the word error rates (WER) from the baseline CNN model trained with the SWB data when evaluated on the NIST 2000 CTS test set, for both cross-entropy (CE) training and sequence training (ST) using MMI." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-77", "text": "Table 1 also shows the results obtained from the DOC features with and without a fMLLR transform." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-78", "text": "We present results from ST as they were found to be always better than the results CE training." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-79", "text": "We explored learning the fMLLR transform directly from the filterbank features (MFB_fMLLR and DOC_fMLLR) and learning the fMLLR transforms on the full dimensional cepstral versions of the features, applying the transform and then performing IDCT (MFB+fMLLR and DOC+fMLLR)." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-80", "text": "Table 1 shows that the performance of fMLLR transforms learned from the cepstral version of the features are better than the ones directly from the filterbank features, which is expected, as the cepstral features are uncorrelated, which adheres to the diagonal covariance assumption of the GMM models used to learn those transforms." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-81", "text": "Table 1 demonstrates that the fMLLR transformed features always performed better than the features without fMLLR transform." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-82", "text": "Also, the CNN models always gave better results, confirming similar observations from studies reported earlier [8] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-83", "text": "Also, note that Table 1 shows that the DOC features performed slightly better than the MFB features after the fMLLR transform, where the performance improvement was more pronounced for the CH subset of the NIST 2000 CTS test set." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-84", "text": "As a next step, we investigated the efficacy of feature combination and focused only on the CNN acoustic models." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-85", "text": "We appended the articulatory features (TVs) extracted from the SWB training set, dev04 and NIST 2000 CTS test sets, and combined them with MFB+fMLLR and DOC+fMLLR features, respectively." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-86", "text": "Finally, we combined the MFB+fMLLR and DOC+fMLLR features and added the TVs to them." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-87", "text": "Table 2 Table 2 shows that the use of articulatory features helped to lower the WER in all the cases." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-88", "text": "The DOC feature was always found to perform slightly better than the MFBs and the best results were obtained when all the features were combined together, indicating the benefit of using multiview features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-89", "text": "Note that only 100 additional neurons were used to accommodate the TV features, hence all the models were of comparable sizes." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-90", "text": "The benefit of the articulatory features stemmed from the complementary information that they contain (reflecting degree and location of articulatory constrictions in the vocal tract), as demonstrated by earlier studies [22] [23] [24] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-91", "text": "Overall the f-CNN-DNN system trained with the combined feature set, MFB+fMLLR + DOC+fMLLR + TV, demonstrated a relative reduction in WER of 7% and 9% compared to the MFB+fMLLR CNN baseline for SWB and CH subsets of the NIST 2000 CTS test set." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-92", "text": "Table 1 and 2 also demonstrates that sequence training always gave additive performance gain over crossentropy training, supporting the in [8, 21] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-93", "text": "As a next step, we focused on training the acoustic models using the 2000-hour SWB+FSH CTS data, focusing on the CNN acoustic models and multi-view features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-94", "text": "Note that the MFB DNN baseline model was used to generate the alignments for the FSH part of the 2000 hours CTS training set and as a consequence the number of senone labels remained the same as the 360-hour SWB models." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-95", "text": "Table 3 presents the results from the 2000 hours CTS trained models." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-96", "text": "The model configurations and their parameter size were kept the same as the 360-hour SWB models." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-97", "text": "Figure 3 shows that the use of the additional FSH training data resulted in significant performance improvement for both SWB and the CH subsets of the NIST 2000 CTS test set." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-98", "text": "Adding the FSH dataset resulted in relative WER reduction of 4.4% and 12% respectively for SWB and CH subsets of the NIST 2000 CTS test set, using MFB+fMLLR features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-99", "text": "Similar improvement was observed from the DOC+fMLLR features as well, where 8% and 12% relative reduction in WER for SWB and CH subsets was observed when FSH data was added to the training data." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-100", "text": "Note that the CH subset of the NIST 2000 CTS test set was more challenging than the SWB subset, as it contains non-native speakers of English, hence introducing accented speech into the evaluation set." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-101", "text": "The use of articulatory features helped to reduce the error rates for both SWB and CH test sets, indicating their robustness to model spontaneous speech in both native (SWB) and non-native (CH) speaking styles." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-102", "text": "The FSH corpus contains speech from quite a diverse set of speakers, helping to reduce the WER of the CH subset more significantly than the SWB subset, a trend reflected in results reported in the literature [8] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-103", "text": "Table 4 shows the system fusion results after dumping 2000-best lists from the rescored lattices from each individual system of different front-end features with fMLLR, i.e., MFB, DOC, MFB+DOC, MFB+DOC+TV, then conducting M-way combination of the subsystems using N-best ROVER [27] implemented in SRILM [28] ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-104", "text": "In this system fusion experiment, all subsystems have equal weights for N-best ROVER." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-105", "text": "As can be seen from the table, N-best ROVER based 2-way and 3-way system fusion produced a further 2% and 4% relative reduction in WER compared to the best single system (MFB+fMLLR + DOC+fMLLR + TV), for SWB and CH evaluation sets respectively." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-106", "text": "Note that the first row of Table 4 is the last row of Table 3 , i.e., the best single system." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-107", "text": "The last row 4way fusion is from combining the 4 individual systems presented in Table 3 ." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-108", "text": "----------------------------------" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-109", "text": "**CONCLUSION**" }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-110", "text": "We reported the results exploring multiple features for ASR on English CTS data." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-111", "text": "We observed that the fMLLR transform helped reduce the WER of the baseline system significantly." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-112", "text": "We observed that using multiple acoustic features helped in improving the overall accuracy of the system." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-113", "text": "Use of robust features and articulatory features significantly reduced the WER for the more challenging CallHome subset of the NIST 2000 CTS evaluation set, with accented speech in that subset." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-114", "text": "We developed a fused-CNN-DNN architecture, where input convolution was only performed on the acoustic features and the articulatory features were process by a feed-forward layer." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-115", "text": "We found this architecture effective for combining acoustic features and articulatory features." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-116", "text": "The robust features and articulatory features capture complementary information, and the addition of them resulted in the best single system performance, with 12% relative reduction of WER on SWB and CH evaluation sets respectively, compared to the MFB+fMLLR CNN baseline." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-117", "text": "Note that in this study the language model has not been optimized." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-118", "text": "Future studies should investigate RNN or other neural network-based language modeling techniques that are known to perform better than word n-gram LMs." }, { "sent_id": "d72f0608fddd1bf1cdef7ca6a20bdf-C001-119", "text": "Also, advanced acoustic modeling, through the use of timedelayed neural nets (TDNNs), long short-term memory neural nets (LSTMs), and the VGG nets, should also be explored as their performance has been mostly reported using MFB features, and the use of multi-view features can help further improve their performance." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-13" ], [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-16" ], [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-82" ] ], "cite_sentences": [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-13", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-16", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-82" ] }, "@USE@": { "gold_contexts": [ [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-24" ], [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-76", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-82" ], [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-92" ] ], "cite_sentences": [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-24", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-82", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-92" ] }, "@SIM@": { "gold_contexts": [ [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-76", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-82" ], [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-92" ], [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-102" ] ], "cite_sentences": [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-82", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-92", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-102" ] }, "@FUT@": { "gold_contexts": [ [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-118", "d72f0608fddd1bf1cdef7ca6a20bdf-C001-119" ] ], "cite_sentences": [ "d72f0608fddd1bf1cdef7ca6a20bdf-C001-119" ] } } }, "ABC_264bdb348c13f167768fd859b047e8_8": { "x": [ { "sent_id": "264bdb348c13f167768fd859b047e8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-2", "text": "In this paper, we extend the persona-based sequenceto-sequence (Seq2Seq) neural network conversation model to multi-turn dialogue by modifying the state-of-the-art hredGAN architecture." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-3", "text": "To achieve this, we introduce an additional input modality into the encoder and decoder of hredGAN to capture other attributes such as speaker identity, location, sub-topics, and other external attributes that might be available from the corpus of human-to-human interactions." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-4", "text": "The resulting persona hredGAN (phredGAN ) shows better performance than both the existing persona-based Seq2Seq and hredGAN models when those external attributes are available in a multi-turn dialogue corpus." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-5", "text": "This superiority is demonstrated on TV drama series with character consistency (such as Big Bang Theory and Friends) and customer service interaction datasets such as Ubuntu dialogue corpus in terms of perplexity, BLEU, ROUGE, and Distinct ngram scores." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-6", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-7", "text": "**I. INTRODUCTION**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-8", "text": "Recent advances in machine learning especially with deep neural networks has led to tremendous progress in natural language processing and dialogue modeling research [1] - [3] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-9", "text": "Nevertheless, developing a good conversation model capable of fluent interactions between a human and a machine is still in its infancy." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-10", "text": "Most existing work relies on a limited dialogue history to produce responses with the assumption that the model parameters will capture all the modalities within a dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-11", "text": "However, this is not true as dialogue corpora tend to be strongly multi-modal and practical neural network models find it difficult to disambiguate characteristics such as speaker personality, location, and sub-topic in the data." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-12", "text": "Most work in this domain has primarily focused on optimizing dialogue consistency." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-13", "text": "For example, Serban et al. [3] - [5] and Xing et al. [6] introduce a Hierarchical Recurrent Encoder-Decoder (HRED) network architecture that combines a series of recurrent neural networks to capture long-term context state within a dialogue." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-14", "text": "However, the HRED system suffers from lack of diversity and does not support any guarantees on the generator output since the output conditional probability is not calibrated." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-15", "text": "Olabiyi et al. [7] tackle this problem by training a modified HRED generator alongside an adversarial discriminator in order to provide a stronger guarantee to the generator's output." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-16", "text": "While the hredGAN system improves upon response quality, it does not capture speaker and other attributes modalities within a dataset and fails to generate persona-specific responses in datasets with multiple modalities." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-17", "text": "At the same time, there has been some recent work on introducing persona into dialogue models." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-18", "text": "For example, Li et al. [8] integrate learnable speaker attributes into a singleturn (Seq2Seq) generative dialogue model." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-41", "text": "The hredGAN proposed by Olabiyi et." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-42", "text": "al [7] contains three major components." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-19", "text": "In this work, Li et al. consider persona models: one with Speaker-only representation and another with Speaker and Addressee representations (Speaker-Addressee model), both of which capture aspects of speaker identity and interactions." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-20", "text": "Nguyen et al. [9] continue the same line of thought by considering a Seq2Seq dialogue model with Responder-only representations." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-21", "text": "In both cases, the attribute representation is learned during the system training." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-22", "text": "Zhang et al. [10] propose a slightly different approach." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-23", "text": "Here, the attributes are a set of sentences describing the profile of the speaker." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-24", "text": "In this case, the attribute representation is not learned." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-25", "text": "The system however learns how to attend to different parts of the attributes during training." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-26", "text": "Still, the above personabased models have a limited dialogue history (single-turn), suffer from exposure bias worsening the trade off between personalization and conversation quality, and cannot generate multiple responses given a dialogue context." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-27", "text": "This is evident in the relatively short and generic responses even though they generally capture the persona of the speaker." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-28", "text": "To overcome these limitations, we propose phredGAN , a multi-modal hredGAN dialogue system which additionally conditions the adversarial framework proposed by Olabiyi et al. [7] on speaker and/or utterance attributes in order to maintain response quality of hredGAN and still capture speaker and other modalities within a conversation." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-29", "text": "The attributes can be seen as another input modality as the utterance." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-30", "text": "The attribute representation is an embedding that is learned together with the rest of model parameters, similar to [8] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-31", "text": "The introduction of attributes allows the model to generate responses conditioned on particular attribute(s) across conversation turns." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-32", "text": "Since the attributes are discrete, it also allows for exploring what-if scenarios of model responses." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-33", "text": "We train and sample the proposed phredGAN similar to the procedure for hredGAN [7] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-34", "text": "To demonstrate model capability, we train on customer service related data such as the Ubuntu Dialogue Corpus (UDC) that is strongly bimodal between question poster and answerer, and character consistent TV scripts from two popular series, The Big Bang Theory and Friends with quantitative and qualitative analysis." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-35", "text": "We demonstrate system superiority over hredGAN and the state-of-the-art persona conversational model in terms" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-36", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-37", "text": "**II. MODEL ARCHITECTURE**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-38", "text": "In this section, we briefly introduce the state-of-the-art hredGAN model and subsequently show how we derive the persona version by combining it with the distributed representation of the dialogue speaker and utterance attributes." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-39", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-40", "text": "**A. HREDGAN : ADVERSARIAL LEARNING FRAMEWORK**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-43", "text": "Encoder: The encoder consists of three RNNs, the aRN N that encodes an utterance for attention memory, the eRN N that encodes an utterance for dialogue context, and the cRN N that encodes a multi-turn dialogue context from the eRN N outputs." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-44", "text": "The final states of aRN N and cRRN are concatenated to form the final encoder state." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-45", "text": "Generator: The generator contains a single decoder RNN, dRN N , initialized with the final encoder state." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-46", "text": "The dRN N inputs consists of cRN N output, embedding of the ground truth or previously generated tokens, as well as noise samples and the attention over the aRN N outputs." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-47", "text": "Discriminator: The discriminator also contains a single RNN, D RN N , initialized by the final encoder state." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-48", "text": "In the case of hredGAN [7] , it is a bidirectional RNN that discriminates at the word level to capture both the syntactic and semantic difference between the ground truth and the generator output." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-49", "text": "Problem Formulation: The hredGAN [7] formulates multi-turn dialogue response generation as: given a dialogue history of sequence of utterances," }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-50", "text": "where T i is the number of generated tokens." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-51", "text": "The framework uses a conditional GAN structure to learn a mapping from an observed dialogue history to a sequence of output tokens." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-52", "text": "The generator, G, is trained to produce sequences that cannot be distinguished from the ground truth by an adversarially trained discriminator, D, akin to a two-player min-max optimization problem." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-53", "text": "The generator is also trained to minimize the cross-entropy loss L MLE (G) between the ground truth X i+1 , and the generator output Y i ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-54", "text": "The following objective summarizes both goals:" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-55", "text": "where \u03bb G and \u03bb M are hyperparameters and L cGAN (G, D) and L MLE (G) are defined in Eqs. (5) and (7) of [7] respectively." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-56", "text": "Note that the generator G and discriminator D share the same encoder and embedding representation of the word tokens." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-57", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-58", "text": "**B. PHREDGAN : PERSONA ADVERSARIAL LEARNING FRAMEWORK**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-59", "text": "The proposed architecture of phredGAN is very similar to that of hredGAN summarized above." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-60", "text": "The only difference is that the dialogue history is now X i = (X 1 , C 1 ), (X 2 , C 2 ), \u00b7 \u00b7 \u00b7 , (X i , C i ) where C i is additional input that represents the speaker and/or utterance attributes." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-61", "text": "Note that C i can either be a sequence of tokens or a single token such that C i j \u2208 V c for vocabulary V c. The embedding for attribute tokens is also learned similar to that of word tokens." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-62", "text": "The modified system is as follows:" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-63", "text": "Encoder: In addition to the three RNNs in the encoder of hredGAN , if the attribute C i is a sequence of tokens, then another RNN, sattRN N is used to summarize the token embeddings; otherwise a single attribute embedding is concatenated with the output of eRN N as shown in Fig. 1 ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-64", "text": "Generator: In addition to the dRN N in the generator of hredGAN , if the attribute C i+1 is a sequence of tokens, then another RNN, tattRN N is used to summarize the attribute token embeddings; otherwise a single attribute embedding is concatenated with the other inputs of dRN N as shown in Fig. 1 ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-65", "text": "Discriminator: In addition to the D RN N in the discriminator of hredGAN , if the attribute C i+1 is a sequence of tokens, then the same tattRN N is used to summarize the attribute token embeddings; otherwise the single attribute embedding is concatenated with the other inputs of D RN N in Fig. 1 of [7] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-66", "text": "Noise Injection: Although [7] demonstrated that injecting noise at the word level seems to perform better than at the utterance level for hredGAN , we found that this is datasetdependent for phredGAN ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-67", "text": "The phredGAN model with utterance-level noise injection and word-level noise injection are tagged phredGAN u and phredGAN w respectively." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-68", "text": "The losses, L cGAN (G, D) and L MLE (G) in eq. (1) are then respectively updated as:" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-69", "text": "The addition of speaker or utterance attributes allows the dialogue model to exhibit personality traits giving consistent responses across style, gender, location, and so on." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-70", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-71", "text": "**III. MODEL TRAINING AND INFERENCE**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-72", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-73", "text": "**A. MODEL TRAINING**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-74", "text": "We train both the generator and the discriminator (with a shared encoder) of phredGAN using the same training procedure in Algorithm 1 with \u03bb G = \u03bb M = 1 [7] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-75", "text": "Since the encoder, word embeddings and attribute embeddings are shared, we are able to train the system end-to-end with backpropagation." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-76", "text": "Encoder: The encoder RNNs, eRN N , and eRN N are bidirectional while cRRN is unidirectional." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-77", "text": "All RNN units are 3-layer GRU cells with a hidden state size of 512." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-78", "text": "We use a word vocabulary size, V = 50, 000, with a word embedding size of 512." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-79", "text": "The number of attributes, V c is dataset-dependent Compute the generator output similar to Eq. (11) in [7] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-80", "text": "Sample a corresponding mini batch of utterance Yi." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-81", "text": "Yi \u223c P \u03b8 G Yi|, Zi, Xi, Ci+1 end for Compute the adversarial discriminator accuracy Dacc over N \u2212 1 utterances" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-82", "text": "if Dacc < accD th then Update phredGAN 's \u03b8D with gradient of the discriminator loss. but we use an attribute embedding size of 512." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-83", "text": "In this study, we only use one attribute per utterance, so there is no need to use RNNs, sattRN N and tattRN N to combine the attribute embeddings." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-84", "text": "Generator: The generator RNN, dRN N is also a 3-layer GRU cell with a hidden state size of 512." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-85", "text": "The aRN N outputs are connected to the dRN N input using an additive attention mechanism [11] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-86", "text": "Discriminator: The discriminator RNN, D RN N , is a bidirectional RNN, each 3-layer GRU cell having a hidden state size of 512." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-87", "text": "The output of both the forward and the backward cells for each word are concatenated and passed to a fully-connected layer with binary output." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-88", "text": "The output is the probability that the word is from the ground truth given the past and future words of the sequence as well as the responding speaker's embedding." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-89", "text": "Others: All parameters are initialized with Xavier uniform random initialization [12] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-90", "text": "Due to the large word vocabulary size, we use sampled softmax loss [13] for MLE loss to expedite the training process." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-91", "text": "However, we use full softmax for model evaluation." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-92", "text": "The parameter update is conditioned on the discriminator accuracy performance as in [7] with acc D th = 0.99 and acc G th = 0.75." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-93", "text": "The model is trained end-to-end using the stochastic gradient descent algorithm." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-94", "text": "Finally, the model is implemented, trained, and evaluated using the TensorFlow deep learning framework." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-95", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-96", "text": "**B. MODEL INFERENCE**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-97", "text": "We use an inference strategy similar to the approach in Olabiyi et." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-98", "text": "al [7] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-99", "text": "The only differences between training and inference are : (i) The generator is run in autoregressive mode with greedy decoding by passing the previously generated word token to the input of the dRN N at the next step." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-100", "text": "(ii) A modified noise sample N (0, \u03b1I) is passed into the generator input." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-101", "text": "For the modified noise sample, we perform a linear search for \u03b1 with sample size L = 1 based on the average discriminator loss, \u2212logD(G(.)) [7] using trained models run in autoregressive mode to reflect performance in actual deployment." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-102", "text": "The optimum \u03b1 value is then used for all inferences and evaluations." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-103", "text": "During inference, we condition the dialogue response generation on the encoder outputs, noise samples, word embedding, and the attribute embedding of the intended responder." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-104", "text": "With multiple noise samples, L = 64, we rank the generator outputs by the discriminator which is also conditioned on the encoder outputs, and the intended responder's embedding." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-105", "text": "The final response is the response ranked highest by the discriminator." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-106", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-107", "text": "**IV. EXPERIMENTS AND RESULTS**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-108", "text": "In this section, we explore phredGAN 's results on two conversational datasets and compare its performance to the persona system in Li et al. [8] and hredGAN [7] in terms of quantitative and qualitative measures." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-109", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-110", "text": "**A. DATASETS**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-111", "text": "TV Series Transcripts dataset [3] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-112", "text": "We train our model on transcripts from the two popular TV drama series, Big Bang Theory and Friends." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-113", "text": "Following a preprocessing setup similar to [8] , we collect utterances from the top 12 speakers from both series to construct a corpus of 5,008 lines of multi-turn dialogue." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-114", "text": "We split the corpus into training, development, and test sets with 94%, 3%, and 3% proportions, respectively, and [14] which consists of 240,000 dialogue triples." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-115", "text": "We pre-train our model on this dataset to initialize our model parameters to avoid overfitting on a relatively small persona TV series dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-116", "text": "After pre-training on MTC, we reinitialize the attribute embeddings in the generator from a uniform distribution following a Xavier initialization [12] for training on the combined person TV series dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-117", "text": "Ubuntu Dialogue Corpus (UDC) dataset [4] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-118", "text": "We train our model on 1.85 million conversations of multi-turn dialogue from the Ubuntu community hub, with an average of 5 utterances per conversation." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-119", "text": "We assign two types of speaker IDs to utterances in this dataset: questioner and helper." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-120", "text": "We follow the same training, development, and test split as the UDC dataset in [7] , with 90%, 5%, and 5% proportions, respectively." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-121", "text": "While the overwhelming majority of utterances in UDC follow two speaker types, the dataset does include utterances that are not classified under either a questioner or helper speaker type." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-122", "text": "To remain consistent, we assume that there are only two speaker types within this dataset and that the first utterance of every dialogue is from a questioner." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-123", "text": "This simplifying assumption does introduce a degree of noise into the model's ability to construct attribute embeddings." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-124", "text": "However, our experimental results demonstrate that our model is still able to differentiate between the larger two speaker types in the dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-125", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-126", "text": "**B. EVALUATION METRICS**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-127", "text": "We use similar evaluation metrics as in [7] including perplexity, BLEU [15] , ROUGE [16] , and distinct n-gram [17] scores." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-128", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-129", "text": "**C. BASELINE**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-130", "text": "We compare our system to [8] which uses a Seq2Seq framework in conjunction with learnable persona embeddings." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-131", "text": "Their work explores two persona models to incorporate vector representations of speaker interaction and speaker attributes into the decoder of their Seq2Seq model, i.e., Speaker and Speaker-Addressee models." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-132", "text": "While we compare with both models quantitatively, we mostly compare with the Speaker-Addressee model qualitatively." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-133", "text": "Our quantitative comparison uses perplexity and BLEU-4 scores as those are the ones reported in [8] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-134", "text": "In addition, we also measure our model performance in terms of ROGUE and Distinct n-gram scores for the purpose of completeness." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-135", "text": "For fair comparison, we use the same TV drama series dataset used in their study." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-136", "text": "We also compare our system to hredGAN from [7] in terms of perplexity, ROGUE, and distinct n-grams scores." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-137", "text": "In [7] , the authors recommend the version with word-level noise injection, hredGAN w , so we use this version in our comparison." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-138", "text": "Also for fair comparison, we use the same UDC dataset as reported in [7] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-139", "text": "The only addition we made is to add the speaker attribute to the utterances of the dataset as described in the Dataset subsection." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-140", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-141", "text": "**D. HYPERPARAMETER SEARCH**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-142", "text": "Prior to evaluation, we determine the noise injection method and the optimum noise variance \u03b1 that are suitable for the two datasets." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-143", "text": "We consider the two variations of phredGAN , that is, phredGAN u for utterance-level and phredGAN w for wordlevel noise injections." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-144", "text": "We notice a partial mode collapse with phredGAN w on the combined TV transcripts, likely due to the high variation of word-level perturbation on a very limited dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-145", "text": "However, this issue was rectified by phredGAN u ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-146", "text": "Therefore, we use phredGAN u for the combined TV series dataset and phredGAN w for the UDC dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-147", "text": "We perform a linear search for optimal noise variance values between 1 and 30 at an increment of 1, with a sample size of L = 1." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-148", "text": "We obtained an optimal \u03b1 of 2 with phredGAN u for the combined TV series dataset and an optimal \u03b1 of 5 for phredGAN w for the much larger UDC dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-149", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-150", "text": "**E. RESULTS**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-151", "text": "After training phredGAN models on the TV series and UDC datasets, we ran inference on some example dialogue contexts." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-152", "text": "The responses and their discriminator scores from phredGAN are listed in Table I ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-153", "text": "The table shows that phredGAN (i) can handle multi-turn dialogue context with utterances and corresponding persona attributes; (ii) generates responses conditioned on a persona attribute; (iii) generates multiple responses per dialogue context and scores their human likelihood by the discriminator." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-154", "text": "We observe that the discriminator score is generally reasonable with longer, more informative, and more persona-related responses receiving higher scores." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-155", "text": "It is worth noting that this behavior, although similar to the behavior of a human judge, is learned without supervision." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-156", "text": "Furthermore, we observe that phredGAN responses retain contextual consistency, sometimes referencing background information inherent in the conversation between two speakers." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-157", "text": "For example, in the second sample of the TV series in Table I , phredGAN generator, Leonard refers to Sheldon by name." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-158", "text": "Also, in the third sample, phredGAN , Raj refers to Penny when responding to Leonard who happens to be Penny's boy friend." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-159", "text": "We observe similar persona-based response generation for the UDC dataset with distinct communication styles between the asker and the helper." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-160", "text": "We will now present performance comparisons of phredGAN against the baselines, hredGAN , and Li et al.'s persona Seq2Seq models." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-161", "text": "1) Quantitative Analysis: We first report the performance of phredGAN u on TV series transcripts in table II." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-162", "text": "Our system actually performs slightly worse than both variations of the Speaker Model and Speaker-Addressee systems in [8] in terms of the perplexity measure." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-163", "text": "This is because the entropy of multi-turn dialogue is higher than that of single-turn." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-164", "text": "Similar observations have been made by Serban et al. [3] about seq2seq and HRED dialogue models." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-165", "text": "However, phredGAN u gives a significantly larger BLEU-4 score than both Speaker and Speaker-Addressee models." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-166", "text": "We attribute this improvement to (i) the multi-turn dialogue context, and (ii) training in an adversarial framework, which forces the model to produce longer, more informative, and diverse responses that have high topic relevance even with a limited dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-167", "text": "Also, unlike Speaker-Addressee models which suffer from lower response quality due to persona conditioning, conditioning the generator and discriminator of phredGAN on speaker embeddings does not compromise the system's ability to produce diverse responses." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-168", "text": "This problem might have been alleviated by the adversarial training too." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-169", "text": "We also compare phredGAN with the recommended variant of hredGAN that includes word-level noise injection at the decoder on the UDC dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-170", "text": "The results are summarized in Table III ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-171", "text": "We note an improvement in a variety of evaluation metrics including perplexity, ROUGE, and distinct n-grams, with the exception of distinct 2-grams." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-172", "text": "This is expected as phredGAN should be generally less diverse than hredGAN since the number of distinct data distribution modes is more for the phredGAN dataset due to the persona attributes." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-173", "text": "However, this leads to better response quality with persona, something not achievable with hredGAN ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-174", "text": "Also, the better ROUGE(f1) score indicates that phredGAN is able to strike a better balance between diversity and precision while still capturing the characteristics of the speaker attribute modality in the UDC dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-175", "text": "2) Qualitative Analysis: A qualitative assessment of these results are in Table IV with responses from several characters in the TV series dataset and the two characters in UDC." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-176", "text": "We see that for TV drama series, phredGAN responses are comparatively more informative than that of the Speaker-Addressee model of [8] ." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-177", "text": "For example, with Speaker-Addressee model, nearly all the characters in the TV series respond with \"Of course I love you.\" to the dialogue context, \"Do Helper i ' m not to be able to get the grub to the grub menu you love me?\" despite the fact that some of the responders sometimes have unfriendly relationship with the addressee." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-178", "text": "Many of the novel situations explored by phredGAN are unachievable with the Speaker-Addressee model due to a lack of informative responses." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-179", "text": "For example, by conditioning as Sheldon from The Big Bang Theory and asking \"Do you like me?\", our model responds with annoyance if conditioned as Penny (\"No, you don't understand." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-180", "text": "You're an idiot\"), brevity with Leonard (\"Yes?\"), and sarcasm with Raj (\"Well , you know , we could be a little more than my friend's friends.\") The wide range of responses indicate our model's ability to construct distinct attribute embeddings for each character even from a limited dataset." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-181", "text": "The other interesting responses in Table IV indicate our model's ability to infer not only the context of the conversation but important character information about the Addressee." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-182", "text": "We also see similar results with our model's output on UDC." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-183", "text": "We demonstrate that by conditioning as either a helper or questioner from the UDC dataset, our model is able to respond differently to input utterances as well as stay close to the context of the conversation." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-184", "text": "----------------------------------" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-185", "text": "**V. CONCLUSION AND FUTURE WORK**" }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-186", "text": "In this article, we improve upon state-of-the-art personabased response generation models by adding attribute embeddings to hredGAN in order to capture a representation of speaker identity and style in a multi-turn dialogue." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-187", "text": "Our model outperforms the existing Speaker-Model and Speaker-Addressee models from [8] with respect to BLEU score and even improves upon hredGAN with respect to ROUGE, distinct n-grams, and perplexity scoring." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-188", "text": "We also see a qualitative improvement in response quality that more clearly separates speaker identity from even limited datasets." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-189", "text": "In the future, we hope to carry out a human evaluation to confirm qualitative improvement of our model's outputs." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-190", "text": "We also hope to extend this work to Multi-class GAN by optimizing over two separate adversarial losses, one for speaker attribute classification and the other for dialogue generation real/fake discrimination." }, { "sent_id": "264bdb348c13f167768fd859b047e8-C001-191", "text": "This will allow us to further improve on persona distinctions without the loss of response quality." } ], "y": { "@BACK@": { "gold_contexts": [ [ "264bdb348c13f167768fd859b047e8-C001-14", "264bdb348c13f167768fd859b047e8-C001-15", "264bdb348c13f167768fd859b047e8-C001-16", "264bdb348c13f167768fd859b047e8-C001-17" ], [ "264bdb348c13f167768fd859b047e8-C001-41", "264bdb348c13f167768fd859b047e8-C001-42" ], [ "264bdb348c13f167768fd859b047e8-C001-48" ], [ "264bdb348c13f167768fd859b047e8-C001-49", "264bdb348c13f167768fd859b047e8-C001-50", "264bdb348c13f167768fd859b047e8-C001-51" ], [ "264bdb348c13f167768fd859b047e8-C001-52", "264bdb348c13f167768fd859b047e8-C001-53", "264bdb348c13f167768fd859b047e8-C001-54", "264bdb348c13f167768fd859b047e8-C001-55" ], [ "264bdb348c13f167768fd859b047e8-C001-65" ] ], "cite_sentences": [ "264bdb348c13f167768fd859b047e8-C001-15", "264bdb348c13f167768fd859b047e8-C001-48", "264bdb348c13f167768fd859b047e8-C001-49", "264bdb348c13f167768fd859b047e8-C001-55", "264bdb348c13f167768fd859b047e8-C001-65" ] }, "@MOT@": { "gold_contexts": [ [ "264bdb348c13f167768fd859b047e8-C001-14", "264bdb348c13f167768fd859b047e8-C001-15", "264bdb348c13f167768fd859b047e8-C001-16" ] ], "cite_sentences": [ "264bdb348c13f167768fd859b047e8-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "264bdb348c13f167768fd859b047e8-C001-28" ], [ "264bdb348c13f167768fd859b047e8-C001-33" ], [ "264bdb348c13f167768fd859b047e8-C001-62", "264bdb348c13f167768fd859b047e8-C001-65" ], [ "264bdb348c13f167768fd859b047e8-C001-74" ], [ "264bdb348c13f167768fd859b047e8-C001-79" ], [ "264bdb348c13f167768fd859b047e8-C001-92" ], [ "264bdb348c13f167768fd859b047e8-C001-101" ], [ "264bdb348c13f167768fd859b047e8-C001-120" ], [ "264bdb348c13f167768fd859b047e8-C001-127" ], [ "264bdb348c13f167768fd859b047e8-C001-137" ], [ "264bdb348c13f167768fd859b047e8-C001-138" ] ], "cite_sentences": [ "264bdb348c13f167768fd859b047e8-C001-28", "264bdb348c13f167768fd859b047e8-C001-33", "264bdb348c13f167768fd859b047e8-C001-65", "264bdb348c13f167768fd859b047e8-C001-74", "264bdb348c13f167768fd859b047e8-C001-79", "264bdb348c13f167768fd859b047e8-C001-92", "264bdb348c13f167768fd859b047e8-C001-101", "264bdb348c13f167768fd859b047e8-C001-120", "264bdb348c13f167768fd859b047e8-C001-127", "264bdb348c13f167768fd859b047e8-C001-137", "264bdb348c13f167768fd859b047e8-C001-138" ] }, "@EXT@": { "gold_contexts": [ [ "264bdb348c13f167768fd859b047e8-C001-28" ], [ "264bdb348c13f167768fd859b047e8-C001-33", "264bdb348c13f167768fd859b047e8-C001-35" ], [ "264bdb348c13f167768fd859b047e8-C001-59", "264bdb348c13f167768fd859b047e8-C001-60" ], [ "264bdb348c13f167768fd859b047e8-C001-62", "264bdb348c13f167768fd859b047e8-C001-66" ] ], "cite_sentences": [ "264bdb348c13f167768fd859b047e8-C001-28", "264bdb348c13f167768fd859b047e8-C001-33", "264bdb348c13f167768fd859b047e8-C001-66" ] }, "@SIM@": { "gold_contexts": [ [ "264bdb348c13f167768fd859b047e8-C001-33" ], [ "264bdb348c13f167768fd859b047e8-C001-59" ], [ "264bdb348c13f167768fd859b047e8-C001-79" ], [ "264bdb348c13f167768fd859b047e8-C001-108" ], [ "264bdb348c13f167768fd859b047e8-C001-136" ] ], "cite_sentences": [ "264bdb348c13f167768fd859b047e8-C001-33", "264bdb348c13f167768fd859b047e8-C001-79", "264bdb348c13f167768fd859b047e8-C001-108", "264bdb348c13f167768fd859b047e8-C001-136" ] }, "@DIF@": { "gold_contexts": [ [ "264bdb348c13f167768fd859b047e8-C001-35" ], [ "264bdb348c13f167768fd859b047e8-C001-59", "264bdb348c13f167768fd859b047e8-C001-60" ], [ "264bdb348c13f167768fd859b047e8-C001-66" ], [ "264bdb348c13f167768fd859b047e8-C001-108" ], [ "264bdb348c13f167768fd859b047e8-C001-136" ] ], "cite_sentences": [ "264bdb348c13f167768fd859b047e8-C001-66", "264bdb348c13f167768fd859b047e8-C001-108", "264bdb348c13f167768fd859b047e8-C001-136" ] } } }, "ABC_668e8967d702d4538c85935de083f7_8": { "x": [ { "sent_id": "668e8967d702d4538c85935de083f7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-2", "text": "Translating Japanese to English is difficult because they belong to different language families." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-3", "text": "Na\u00efve phrase-based statistical machine translation (SMT) often fails to address syntactic difference between Japanese and English." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-4", "text": "Preordering methods are one of the simple but effective approaches that can model reordering in a long distance, which is crucial in translating Japanese and English." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-5", "text": "Thus, we apply a predicate-argument structure-based preordering method to the Japanese-English statistical machine translation task of scientific papers." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-6", "text": "Our method is based on the method described in (Hoshino et al., 2013) , and extends their rules to handle abbreviation and passivization frequently found in scientific papers." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-7", "text": "Experimental results show that our proposed method improves performance of both (Hoshino et al., 2013) 's system and our phrase-based SMT baseline without preordering." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-8", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-10", "text": "Preordering method is one of the popular techniques in statistical machine translation." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-11", "text": "Preordering the word order of source language in advance can enhance alignments on a pair of languages with a large difference in syntax like japanese and English, and thus improve performance of machine translation system." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-49", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-50", "text": "**PARENTHESIS PREORDERING**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-12", "text": "One of the advantages of preordering is that it can incorporate rich linguistic information on the source side, whilst off-the-shelf SMT toolkit can be plugged in without any modification." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-13", "text": "Preordering methods employ various kinds of linguistic information to achieve better alignment between source and target languages." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-14", "text": "Specifically, previous work in the literature uses morphological analysis (Katz-Brown and Collins, 2008) , dependency structure (Katz-Brown and Collins, 2008) and predicate-argument structure (Komachi et al., 2006; Hoshino et al., 2013) for preordering in Japanese-English statistical machine translation." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-15", "text": "However, these preordering methods are tested on limited domains: travel (Komachi et al., 2006) and patent (Katz-Brown and Collins, 2008; Hoshino et al., 2013) corpora." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-16", "text": "Translating Japanese to English in a different domain such as scientific papers is still a big challenge for preordering-based approach." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-17", "text": "For example, academic writing in English traditionally relies on passive voice to give an objective impression, but one can use either passive construction or a zeropronoun in the Japanese translation of passive construction on the English side." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-18", "text": "It is not clear whether existing preordering rules are applicable to scientific domain due to such stylistic difference." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-19", "text": "Predicate-argument structure-based preordering is one of the promising approaches that can solve syntactic and stylistic difference between a language pair." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-20", "text": "Predicate-argument structure analysis identifies who does what to whom and generalizes grammatical relations such as active and passive construction." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-21", "text": "Following (Hoshino et al., 2013) , we perform predicate-argument structure analysis on the Japanese side to preorder Japanese sentences to form an SVO-like word order." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-22", "text": "We propose three modifications to the preordering rules to extend their model to better handle translation of scientific papers." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-23", "text": "The main contribution of this work is as follows:" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-24", "text": "\u2022 We propose an extension to (Hoshino et al., 2013) in order to deal with abbreviation and passivization frequently found in scientific papers." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-25", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-26", "text": "**PREVIOUS WORK**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-27", "text": "There are several related work that take preordering approaches to Japanese-English statistical machine translation." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-28", "text": "First, Komachi et al. (2006) suggested a preordering approach for Japanese-English speech translation in travel domain based on predicateargument structure." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-29", "text": "They used an in-house predicate-argument structure analyzer and reordered Japanese sentences using heuristic rules." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-30", "text": "They only performed preordering at sentencelevel, whereas other Japanese-English preordering methods we will describe below perform preordering at both sentence-and phrase-level 1 ." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-31", "text": "Second, Katz-Brown and Collins (2008) presented two preordering methods for JapaneseEnglish patent translation based on morphological analysis and dependency structure, respectively." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-32", "text": "Morphological analysis-based method splits sentences into segments by punctuation and a topic marker (\" \"), and then reverses the segments." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-33", "text": "Dependency analysis-based method reorders segments into a head-initial sentence, and moves verbs to make an SVO-like structure." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-34", "text": "Unlike (Komachi et al., 2006) , they also reverse all words in each phrase." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-35", "text": "Third, Hoshino et al. (2013) proposed predicate-argument structure-based preordering rules in two-level for the Japanese-English patent translation task." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-36", "text": "The first is sentence-level and the second is phrase-level." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-37", "text": "Furthermore, sentence-level preordering rules are divided into three parts." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-38", "text": "In total, sentences are reordered sequentially by four rules." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-39", "text": "Since this method is the one we re-implemented in this paper, we will describe their method in detail below." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-40", "text": "Pseudo head-initialization Since Japanese is a head-final language but English is a head-initial language, this rule transforms a Japanese dependency tree as to become a head-initial phrase sequence." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-41", "text": "Concretely, we move the last phrase, which is a predicate of a Japanese sentence in almost all cases, to the beginning of the sentence." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-42", "text": "We then order each phrase as their children located immediately after them." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-43", "text": "Inter-chunk preordering We move a predicate of a sentence to an adequate place." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-44", "text": "If a subject exists in a sentence 2 , the predicate is moved imme-Inter-chunk normalization We restore the order of coordinate expressions which are reversed by the first rule." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-45", "text": "Also, since a full stop is moved along with the predicate, we restore it back to the end of the sentence." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-46", "text": "Intra-chunk preordering We apply the phraselevel rule, which swaps function words and content words in a phrase." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-47", "text": "It will improve alignments because function words in Japanese (e.g. postposition) appear after content words while those in English (e.g. preposition) appear before content words." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-48", "text": "3 Extension to (Hoshino et al., 2013) Our proposed preordering model is based on (Hoshino et al., 2013) with three extensions to better handle academic writing in scientific papers." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-51", "text": "Scientific papers often include parenthetical expressions." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-52", "text": "The training data (1,000,000 parallel sentences, hereafter referred to as 1M training corpus) contains 168,336 (16.8%) parentheses on Japanese side, and 187,400 parentheses on English side." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-53", "text": "However, Japanese dependency analyzer fails to parse parenthetical expressions appropriately." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-54", "text": "In particular, if a parenthesis is used to show apposition (e.g. abbreviation and acronym), the pseudo head-initialization described in the last section swaps an acronym and its full spelling, which is not desirable." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-55", "text": "Therefore, we modify the pseudo head-initialization rule to restore the position of parenthetical expressions." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-56", "text": "Figures 1a and 1b illustrate how parenthesis preordering transforms original sentences." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-57", "text": "Parenthesis preordering rule handles not only single phrase parenthetical expressions but also multiple phrase parenthetical expressions." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-58", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-59", "text": "**PASSIVE VOICE PREORDERING**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-60", "text": "In scientific papers, zero-pronouns in Japanese are often translated into passive construction in English." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-61", "text": "The number of passive construction in the 1M training corpus is 166,057 (17%), whereas the number of active construction starting with \"They . . .\" and \"It is . . .\" are 4,700 and 17,104 (2%), respectively." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-62", "text": "Hence, we move a predicate to the end of the sentence if there exists no subject in active voice." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-63", "text": "Figure 1c describes how this rule transforms a Japanese sentence with a zero-pronoun." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-64", "text": "Even though the Japanese side is in active voice, English translation is expressed in passive voice." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-65", "text": "Note that a Japanese sentence in active voice may be translated into different expressions even in the same passive construction (e.g. \". . ." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-66", "text": "(explained . . .)\" can be either \". ." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-67", "text": ". was explained\" or \"It was explained that . . .\".)." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-68", "text": "Hoshino et al. (2013) proposed to move a predicate after the subject (inter-chunk preordering)." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-69", "text": "However, if a subject is modified by other phrases, this rule moves the predicate to the middle of a subjective phrase composed of multiple phrases." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-70", "text": "Thus, we move a predicate to the end of the subjective phrase." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-71", "text": "Table 1d depicts how subject preordering moves a predicate in a sentence." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-72", "text": "As we can see, this rule prevents subjective phrase \" | (Ventilation increase of the COPD patients)\" to be split by the predicate movement." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-73", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-74", "text": "**SUBJECT PREORDERING**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-75", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-76", "text": "**EXPERIMENTS**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-77", "text": "We compared translation performance using a standard phrase-based statistical machine translation technique with three kinds of data:" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-78", "text": "\u2022 original data (baseline)," }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-79", "text": "\u2022 preordered data by our re-implementation of (Hoshino et al., 2013) , and" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-80", "text": "\u2022 preordered data by our proposed methods." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-81", "text": "We analyzed predicate-argument structure of only the last predicate for each sentence, regardless of the number of predicates in a sentence." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-82", "text": "Also, following (Hoshino et al., 2013) , we did not consider event nouns as predicates." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-83", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-84", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-85", "text": "We used 1M Japanese-English parallel sentences extracted from scientific papers (train-1.txt) from the Asian Scientific Paper Excerpt Corpus (ASPEC) 3 ." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-86", "text": "We varied the size of the training corpus and used the best size determined by preliminary experiments." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-87", "text": "We identified predicate-argument structure in Japanese by SynCha 4 0.3." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-88", "text": "It uses MeCab 5 0.996 with IPADic 2.7.0 for morphological analysis and CaboCha 6 0.68 for dependency parsing." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-89", "text": "We used SRILM 7 1.7.0 for language model, GIZA++ 8 1.0.7 for word alignment, and Moses 9 2.1.1 for decoding." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-90", "text": "We set distortion limits to default value 6 for all systems 10 ." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-91", "text": "Translation quality is evaluated in terms of BLEU (Papineni et al., 2002) and RIBES (Isozaki et al., 2010) , as determined by the workshop organizers (Nakazawa et al., 2014) ." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-92", "text": "We performed minimum error rate training (Och, 2003) optimized for BLEU using the development set (dev.txt) of the ASPEC corpus." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-93", "text": "We conducted all the experiments using the scripts distributed at KFTT Moses Baseline v1.4 11 ." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-94", "text": "Table 1 shows the experimental results." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-95", "text": "In terms of BLEU, our re-implementation of (Hoshino et al., 2013) is below the baseline method while our proposed methods better than the baseline." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-96", "text": "In terms of RIBES, all preordering methods outperform the baseline, and our proposed method archieve the highest score." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-97", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-98", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-99", "text": "All methods including parenthesis preordering outperform the baseline method, and when we subtract three modifications one by one from proposed method, the parenthesis rule has the largest impact on the translation quality." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-100", "text": "----------------------------------" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-101", "text": "**DISCUSSION**" }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-102", "text": "Some of the errors found in a translation result are due to the errors in predicate-argument structure analysis." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-103", "text": "We found that it is hard for predicateargument structure analyzer trained on a newswire Table 1 : Comparison of the preordering methods." }, { "sent_id": "668e8967d702d4538c85935de083f7-C001-104", "text": "All the preordering models using (Hoshino et al., 2013) are our re-implementation of their paper." } ], "y": { "@USE@": { "gold_contexts": [ [ "668e8967d702d4538c85935de083f7-C001-6" ], [ "668e8967d702d4538c85935de083f7-C001-21" ], [ "668e8967d702d4538c85935de083f7-C001-39" ], [ "668e8967d702d4538c85935de083f7-C001-48" ], [ "668e8967d702d4538c85935de083f7-C001-77", "668e8967d702d4538c85935de083f7-C001-79" ], [ "668e8967d702d4538c85935de083f7-C001-82" ], [ "668e8967d702d4538c85935de083f7-C001-104" ] ], "cite_sentences": [ "668e8967d702d4538c85935de083f7-C001-6", "668e8967d702d4538c85935de083f7-C001-21", "668e8967d702d4538c85935de083f7-C001-39", "668e8967d702d4538c85935de083f7-C001-48", "668e8967d702d4538c85935de083f7-C001-79", "668e8967d702d4538c85935de083f7-C001-82", "668e8967d702d4538c85935de083f7-C001-104" ] }, "@EXT@": { "gold_contexts": [ [ "668e8967d702d4538c85935de083f7-C001-6" ], [ "668e8967d702d4538c85935de083f7-C001-24" ], [ "668e8967d702d4538c85935de083f7-C001-48" ] ], "cite_sentences": [ "668e8967d702d4538c85935de083f7-C001-6", "668e8967d702d4538c85935de083f7-C001-24", "668e8967d702d4538c85935de083f7-C001-48" ] }, "@DIF@": { "gold_contexts": [ [ "668e8967d702d4538c85935de083f7-C001-7" ], [ "668e8967d702d4538c85935de083f7-C001-95" ] ], "cite_sentences": [ "668e8967d702d4538c85935de083f7-C001-7", "668e8967d702d4538c85935de083f7-C001-95" ] }, "@BACK@": { "gold_contexts": [ [ "668e8967d702d4538c85935de083f7-C001-14" ], [ "668e8967d702d4538c85935de083f7-C001-35" ], [ "668e8967d702d4538c85935de083f7-C001-68" ] ], "cite_sentences": [ "668e8967d702d4538c85935de083f7-C001-14", "668e8967d702d4538c85935de083f7-C001-35", "668e8967d702d4538c85935de083f7-C001-68" ] }, "@MOT@": { "gold_contexts": [ [ "668e8967d702d4538c85935de083f7-C001-15" ], [ "668e8967d702d4538c85935de083f7-C001-24" ] ], "cite_sentences": [ "668e8967d702d4538c85935de083f7-C001-15", "668e8967d702d4538c85935de083f7-C001-24" ] } } }, "ABC_e9e733d38affa8a39a633ffb4d9d71_8": { "x": [ { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-15", "text": "The digitisation efforts have made the Sanskrit manuscripts easily available in the public domain." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-16", "text": "However, the accessibility of such digitised manuscripts is still limited." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-197", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-118", "text": "The typed paths are designed with human supervision which is also linguistically involved." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-2", "text": "There is abundance of digitised texts available in Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-3", "text": "However, the word segmentation task in such texts are challenging due to the issue of Sandhi." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-4", "text": "In Sandhi, words in a sentence often fuse together to form a single chunk of text, where the word delimiter vanishes and sounds at the word boundaries undergo transformations, which is also reflected in the written text." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-5", "text": "Here, we propose an approach that uses a deep sequence to sequence (seq2seq) model that takes only the sandhied string as the input and predicts the unsandhied string." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-6", "text": "The state of the art models are linguistically involved and have external dependencies for the lexical and morphological analysis of the input." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-7", "text": "Our model can be trained \"overnight\" and be used for production." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-8", "text": "In spite of the knowledge lean approach, our system preforms better than the current state of the art by gaining a percentage increase of 16.79 % than the current state of the art." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-9", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-11", "text": "Sanskrit had profound influence as the knowledge preserving language for centuries in India." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-12", "text": "The tradition of learning and teaching Sanskrit, though limited, still exists in India." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-13", "text": "There have been tremendous advancements in digitisation of ancient manuscripts in Sanskrit in the last decade." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-14", "text": "Numerous initiatives such as the Digital Corpus of Sanskrit 1 , GRETIL 2 , The Sanskrit Library 3 and others from the Sanskrit Linguistic and Computational Linguistic community is a fine example of such efforts (Goyal et al., 2012; Krishna et al., 2017) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-17", "text": "Numerous technical challenges in indexing and retrieval of such resources in a digital repository arise due to the linguistic peculiarities posed by the language." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-18", "text": "Word Segmentation in Sanskrit is an important yet non-trivial prerequisite for facilitating efficient processing of Sanskrit texts." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-19", "text": "Sanskrit has been primarily communicated orally." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-20", "text": "Due to its oral tradition, the phonemes in Sanskrit undergo euphonic assimilation in spoken format." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-21", "text": "This gets reflected in writing as well and leads to the phenomena of Sandhi (Goyal and Huet, 2016) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-22", "text": "Sandhi leads to phonetic transformations at word boundaries of a written chunk, and the sounds at the end of a word join together to form a single chunk of character sequence." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-23", "text": "This not only makes the word boundaries indistinguishable, but transformations occur to the characters at the word boundaries." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-24", "text": "The transformations can be deletion, insertion or substitution of one or more sounds at the word ends." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-25", "text": "There are about 281 sandhi rules, each denoting a unique combination of phonetic transformations, documented in the grammatical tradition of Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-26", "text": "The *The first two authors contributed equally 1 http://kjc-sv013.kjc.uni-heidelberg.de/dcs/ 2 http://gretil.sub.uni-goettingen.de/ 3 http://sanskritlibrary.org/ proximity between two compatible sounds as per any one of the 281 rules is the sole criteria for sandhi." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-27", "text": "The Sandhi do not make any syntactic or semantic changes to the words involved." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-28", "text": "Sandhi is an optional operation relied solely on the discretion of the writer." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-29", "text": "While the Sandhi formation is deterministic, the analysis of Sandhi is non-deterministic and leads to high level of ambiguity." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-30", "text": "For example, the chunk 'gardabhas ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-31", "text": "c\u0101\u015bva\u015bca' (the ass and the horse) has 625 possible phonetically and lexically valid splits (Hellwig, 2015) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-32", "text": "Now, the correct split relies on the semantic compatibility between the split words." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-33", "text": "The word segmentation problem is a well studied problem across various languages where the segmentation is nontrivial." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-34", "text": "For languages such as Chinese and Japanese, where there is no explicit boundary markers between the words (Xue, 2003) , numerous sequence labelling approaches have been proposed." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-35", "text": "In Sanskrit, it can be seen that the merging of word boundaries is the discretion of the writer." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-36", "text": "In this work, we propose a purely engineering based pipeline for segmentation of Sanskrit sentences." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-37", "text": "The word segmentation problem is a structured prediction problem and we propose a deep sequence to sequence (seq2seq) model to solve the task." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-38", "text": "We use an encoder-decoder framework where the sandhied (unsegmented) and the unsandhied (segmented) sequences are treated as the input at the encoder and the output at the decoder, respectively." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-39", "text": "We train the model so as to maximise the conditional probability of predicting the unsandhied sequence given its corresponding sandhied sequence (Cho et al., 2014) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-40", "text": "We propose a knowledge-lean data-centric approach for the segmentation task." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-145", "text": "Table 1 shows the performance of the competing systems." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-41", "text": "Our approach will help to scale the segmentation process in comparison with the challenges posed by knowledge involved processes in the current systems (Krishna et al., 2017) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-42", "text": "We only use parallel segmented and unsegmented sentences during training." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-43", "text": "At run-time, we only require the input sentence." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-44", "text": "Our model can literally be trained overnight." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-45", "text": "The best performing model of ours takes less than 12 hours to train in a 'Titan X' 12 GB memory, 3584 GPU Cores system." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-46", "text": "Our title for the paper is inspired from the title for the work by Wang et al. (2015) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-47", "text": "As with the original paper, we want to emphasise on the ease with which our system can be used for training and at runtime, as it do not require any linguistically involved preprocessing." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-48", "text": "Such requirements often limit the scalability of a system and tediousness involved in the process limits the usability of a system." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-49", "text": "Since Sanskrit is a resource scarce language, we use the sentencepiece (Schuster and Nakajima, 2012) , an unsupervised text tokeniser to obtain a new vocabulary for a corpus, that maximises the likelihood of the language model so learnt." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-50", "text": "We propose a pipeline for finding the semantically most valid segmented word-forms for a given sentence." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-51", "text": "Our model uses multiple layers of LSTM cells with attention." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-52", "text": "Our model outperforms the current state of the art by 16.79 %." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-53", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-54", "text": "**MODELS FOR WORD SEGMENTATION IN SANSKRIT**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-55", "text": "A number of methods have been proposed for word segmentation in Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-56", "text": "Hellwig (2015) treats the problem as a character level RNN sequence labelling task." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-57", "text": "The author, in addition to reporting sandhi splits to upto 155 cases, additionally categorises the rules to 5 different types." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-58", "text": "Since, the results reported by the author are not at word-level, as is the standard with word segmentation systems in general, a direct comparison with the other systems is not meaningful." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-59", "text": "Mittal (2010) proposed a method based on Finite State Transducers by incorporating rules of sandhi." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-60", "text": "The system generates all possible splits and then provides a ranking of various splits, based on probabilistic ranking inferred from a dataset of 25000 split points." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-61", "text": "Using the same dataset, Natarajan and Charniak (2011) proposed a sandhi splitter for Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-62", "text": "The method is an extension of Bayesian word segmentation approach by Goldwater et al. (2006) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-63", "text": "Krishna et al. (2016) is currently the state of the art in Sanskrit word segmentation." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-64", "text": "The system treats the problem as an iterative query expansion problem." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-65", "text": "Using a shallow parser for Sanskrit (Goyal et al., 2012) , an input sentence is first converted to a graph of possible candidates and desirable nodes are iteratively selected using Path Constrained Random Walks (Lao and Cohen, 2010) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-198", "text": "**CONCLUSION**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-66", "text": "To further catalyse the research in word segmentation for Sanskrit, Krishna et al. (2017) has released a dataset for the word segmentation task." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-67", "text": "The work releases a dataset of 119,000 sentences in Sanskrit along with the lexical and morphological analysis from a shallow parser." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-68", "text": "The work emphasises the need for not just predicting the inflected word form but also the prediction of the associated morphological information of the word." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-69", "text": "The additional information will be beneficial in further processing of Sanskrit texts, such as Dependency parsing or summarisation (Krishna et al., 2017) .So far, no system successfully predicts the morphological information of the words in addition to the final word form." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-70", "text": "Though Krishna et al. (2016) has designed their system with this requirement in mind and outlined the possible extension of their system for the purpose, the system currently only predicts the final word-form." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-71", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-72", "text": "**METHOD**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-73", "text": "We use an encoder-decoder framework for tackling our segmentation problem, and propose a deep seq2seq model using LSTMs for our prediction task." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-74", "text": "Our model follows the architecture from Wu et al. (2016) , originally proposed for neural machine translation." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-75", "text": "We consider the pair of sandhied and unsandhied sentences as source and target sentences, respectively." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-76", "text": "Following the insights from Sutskever et al. (2014), we reverse the sequence order at the input and we find that the reversal of the string leads to improvement in the results." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-77", "text": "We also use a deep architecture with 3 layers each at the encoder and decoder, as it is shown that deeper models perform better than shallow LSTM Models." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-78", "text": "We also experiment with models with and without attention and find that the model with attention leads to considerable improvement in performance of the system (Wu et al., 2016) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-79", "text": "Given the training set S, our training objective is to maximise the log probability of the segmented sequences T where the unsegmented sequences S are given." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-80", "text": "The training objective is to maximise 1 |S|" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-81", "text": "For a new sentence, we need to output a sequence T with maximum likelihood for the given input ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-82", "text": "LSTMs are used both at the encoder and decoder." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-83", "text": "We use softmax layer at the decoder and perform greedy decoding to obtain the final prediction." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-84", "text": "The outputs are then passed to the loss function which calculates the log-perplexity over the data samples in the batch." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-85", "text": "We then update the parameters via backpropagation and use Adam optimiser (Kingma and Ba, 2015) for our model." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-86", "text": "Vocabulary Enhancement for the model -Sanskrit, being a resource poor language, the major challenge is to obtain enough data for the supervised task." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-87", "text": "While there are plenty of sandhied texts available for Sanskrit, it is hard to find parallel or unsandhied texts alone, as it is deterministic to get sandhied text from unsandhied texts." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-88", "text": "In our case we use 105,000 parallel strings from the Digital Corpus of Sanskrit as released in Krishna et al. (2017) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-89", "text": "To handle the data sparsity, we adopt a purely engineering based approach for our model." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-196", "text": "But, we leave this work for future." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-90", "text": "Rather than relying on the real word boundaries, we use the 'sentencepiece' model, an unsupervised text tokeniser (Schuster and Nakajima, 2012) to obtain a new vocabulary for the corpus." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-91", "text": "The method was originally proposed for segmentation problem in Japanese and Korean speech recognition systems." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-92", "text": "In the method, a greedy approach is used to identify new word units from a corpus that maximises the likelihood of the language model so learnt (Schuster and Nakajima, 2012) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-93", "text": "Figure 1 shows the instance of words learnt from the sentencepiece model corresponding to the original input from the corpus." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-94", "text": "In the sentencepiece model, the 'space' in the original input is also treated as a character and is replaced with the special symbol ' '." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-95", "text": "So 'am . r\u0101ma' is a word in our vocabulary, which originally is part of two words." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-96", "text": "Our model is fed only the 'words' from the new vocabulary, henceforth to be referred to as 'GibberishVocab'." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-97", "text": "Note that the decoder also outputs words from GibberishVocab." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-98", "text": "The output from decoder is then converted to the original vocabulary for evaluating the outputs." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-99", "text": "This is trivially done by reclaiming the original 'space' as the delimiter for the old vocabulary." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-100", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-101", "text": "**EXPERIMENTS**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-102", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-103", "text": "**DATASET**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-104", "text": "We used a dataset of 107,000 sentences from the Sanskrit Word Segmentation Dataset (Krishna et al., 2017) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-105", "text": "The dataset is a subset of the Digital Corpus of Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-106", "text": "From the dataset we only use the input sentences and the ground truth inflected word-forms." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-107", "text": "We ignore all the other morphological and lemma information available in the dataset." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-108", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-109", "text": "**BASELINES**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-110", "text": "We compare the performance of our system with two other baseline systems." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-111", "text": "supervisedPCRW -This is the current state of the art for word segmentation in Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-112", "text": "The method treats the problem as an iterative query expansion task." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-113", "text": "This is a linguistically involved approach, as at first a lexicon driven shallow parser is used to obtain all the phonetically valid segments for a given sentence." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-114", "text": "The sentence is then converted into a graph with the segments as the nodes." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-115", "text": "The edges are formed between every pair of nodes which can co-exist in a sentence and are not competing for the same input position." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-116", "text": "The edge weights are formed by weighted sum of random walks across typed paths." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-117", "text": "The authors use typed paths to obtain extended contexts about the word pairs from the candidate pool." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-119", "text": "GraphCRF -We use a structured prediction approach using graph based Conditional Random Fields, where we first obtain the possible candidate segments using the shallow parser and then convert the segments into a graph." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-120", "text": "For every node segment, we learn a word vector using fastText (Bojanowski et al., 2016) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-121", "text": "segSeq2Seq -This is the proposed model as described in Section 3. but without attention." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-122", "text": "attnSegSeq2Seq -This is the proposed model as described in Section 3. with attention." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-123", "text": "We report all our results on a test data of 4,200 sentences which was not used in any part of the training." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-124", "text": "From the dataset we ignore about 7,500 sentences which are neither part of training nor the test set." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-125", "text": "We used 214,000 strings both from input and output strings in the training data to obtain the GibberishVocab using sentencepiece model." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-126", "text": "We use string-wise micro averaged precision, recall and FScore to evaluate our model as is the standard with evaluating word segmentation models." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-127", "text": "We find that the default vocabulary size of 8,000 for the GibberishVocab works best." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-128", "text": "Of the 8,000 'words', the encoder vocabulary size is 7,944 and the decoder vocabulary size is 7,464." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-129", "text": "This shows the high overlap in the vocabulary in GibberishVocab at both input and output sides, in spite of the difference in phonetic transformations due to sandhi." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-130", "text": "Originally the training data contained 60,308 segmented words at the output side." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-131", "text": "By reducing the vocabulary size at decoder side to 7,464, we make the probability distribution (softmax) at the decoder layer denser." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-132", "text": "Even if we followed a linguistic approach there were 16,473 unique lemmas in the training dataset." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-133", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-134", "text": "**TRAINING PROCEDURE AND HYPERPARAMETERS**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-135", "text": "Our models have 3 layers at both the encoder and decoder." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-136", "text": "The models contain an embedding layer which is a trainable matrix with individual word vector having a size of 128." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-137", "text": "Our LSTM layers consist of 128 cells at both the encoder and decoder layers." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-138", "text": "We train the sentences in a batch size of 128 and keep the sequence length of each sequence to be 35." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-139", "text": "The initial learning rate was set at 0.001 and we trained our system for 80 epochs after which the network parameters converged." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-140", "text": "We used Adam optimiser with parameter values \u03b2 1 ,\u03b2 2 as 0.9 and 0.999, respectively." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-141", "text": "We use dropout in the hidden layers with different settings from 0.1 to 0.4 in step sizes of 0.1." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-142", "text": "We find that a dropout of 0.2 is the best performing configuration." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-143", "text": "Dropout helps to avoid over-fitting of data (Srivastava et al., 2014) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-144", "text": "Both the 'segSeq2Seq' and 'attnSegSeq2Seq' models follow the same architecture and have the same hyperparameter settings and vary only on the attention mechanism." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-146", "text": "We can find that the system 'attnSegSeq2Seq' outperforms the current state of the art with a percent increase of 16.29 % in F-Score." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-147", "text": "The model 'segSeq2Seq' falls short of the current state of the art with a percent decrease of 6.29 % in F-Score." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-148", "text": "It needs to be noted that the systems 'attnSegSeq2Seq' and 'segSeq2Seq' are exactly the same architectures other than the addition of attention in the former. But there is a percentage increase of 23.61 % for both the systems." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-149", "text": "One probable reason for this is due to the free word order nature of sentences in Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-150", "text": "Since there are multiple permutations of words in a sentence which are valid syntactically and convey the same semantic meaning, the entire input context is required to understand the meaning of a sentence for any distributional semantic model." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-151", "text": "Figure 2 shows the results of the competing systems on strings of different lengths in terms of words in the sentence." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-152", "text": "This should not be confused with sequence length." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-153", "text": "Here, we mean the 'word' as per the original vocabulary and is common for all the competing systems." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-154", "text": "For all the strings with up to 10 words, our system 'attnSegSeq2Seq' consistently outperforms all the systems in terms of both precision and recall." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-155", "text": "The current state of the art performs slightly better than our system, for sentences with more than 10 words." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-156", "text": "It needs to be noted that the average length of a string in the Digital Corpus of Sanskrit is 6.7 (Krishna et al., 2016) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-157", "text": "The proportion of sentences with more than 10 words in our dataset is less than 1 %." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-158", "text": "The test dataset has slightly more than 4 % sentences with 10 or more words." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-159", "text": "The 'segSeq2Seq' model performs better than the state of the art for both Precision and Recall for strings with less than or equal to 6 words." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-160", "text": "Figure 2a shows the proportion of sentences in the test data based on the frequency of words in it." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-161", "text": "Figure 2b shows the proportion of strings in the test dataset based on the number of words in the strings." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-162", "text": "Our systems attnSegSeq2Seq takes overall 11 hours 40 minutes and for 80 epochs in a 'Titan X' 12GB GPU memory, 3584 GPU Cores, 62GB RAM and Intel Xeon CPU E5-2620 2.40GHz system." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-163", "text": "For segSeq2Seq it takes 7 hours for the same setting." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-164", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-165", "text": "**RESULTS**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-166", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-167", "text": "**DISCUSSION**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-168", "text": "The purpose of our proposed model is purely to identify the word splits and correctness of the inflected word forms from a sandhied string." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-169", "text": "The word-level indexing in retrieval systems is often affected by phonetic transformations in words due to sandhi." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-170", "text": "For example, the term 'parame\u015bvarah . ' is split as 'parama' (ultimate) and '\u012b\u015bvarah ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-171", "text": "' (god)." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-172", "text": "Now, a search for instances of the word '\u012b\u015bvarah . ' might lead to missing search results without proper indexing." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-173", "text": "String matching approaches often result in low precision results." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-174", "text": "Using a lexicon driven system might alleviate the said issues, but can lead to possible splits which are not semantically compatible." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-175", "text": "For parame\u015bvarah ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-176", "text": "', it can be split as 'parama' (ultimate), '\u015bva' (dog) and 'rah . ' (to facilitate)." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-177", "text": "Though this is not semantically meaningful it is lexically valid." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-178", "text": "Such tools are put to use by some of the existing systems (Krishna et al., 2016; Mittal, 2010 ) to obtain additional morphological or syntactic information about the sentences." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-179", "text": "This limits the scalability of those systems, as they cannot handle out of vocabulary words." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-180", "text": "Scalability of such systems is further restricted as the sentences often need to undergo linguistically involved preprocessing steps that lead to human in the loop processing." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-181", "text": "The systems by Krishna et al. (2016) and Krishna et al. (2017) assume that the parser by Goyal et al. (2012) , identifies all the possible candidate chunks." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-182", "text": "Our proposed model is built with precisely one purpose in mind, which is to predict the final word-forms in a given sequence." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-183", "text": "Krishna et al. (2017) states that it is desirable to predict the morphological information of a word from along with the final word-form as the information will be helpful in further processing of Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-184", "text": "The segmentation task is seen as a means and not an end itself." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-185", "text": "Here, we overlook this aspect and see the segmentation task as an end in itself." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-186", "text": "So we achieve scalability at the cost of missing out on providing valuable linguistic information." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-187", "text": "Models that use linguistic resources are at an advantage here." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-188", "text": "Those systems such as Krishna et al. (2016) can be used to identify the morphological tags of the system as they currently store the morphological information of predicted candidates, but do not use them for evaluation as of now." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-189", "text": "Currently, no system exists that performs the prediction of wordform and morphological information jointly for Sanskrit." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-190", "text": "In our case, since we learn a new vocabulary altogether, the real word boundaries are opaque to the system." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-191", "text": "The decoder predicts from its own vocabulary." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-192", "text": "But predicting morphological information requires the knowledge of exact word boundaries." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-193", "text": "This should be seen as a multitask learning set up." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-194", "text": "One possible solution is to learn 'GibberishVocab' on the set of words rather than sentences. But this leads to increased vocabulary at decoder which is not beneficial, given the scarcity of the data we have." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-195", "text": "Given the importance of morphological segmentation in morphologically rich languages such as Hebrew and Arabic (Seeker and \u00c7 etinoglu, 2015) , the same applies to the morphologically rich Sanskrit as well (Krishna et al., 2017) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-199", "text": "In this work we presented a model for word segmentation in Sanskrit using a purely engineering based appraoch." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-200", "text": "Our model with attention outperforms the current state of the art (Krishna et al., 2016) ." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-201", "text": "Since, we tackle the problem with a non-linguistic approach, we hope to extend the work to other Indic languages as well where sandhi is prevalent such as Hindi, Marathi, Malayalam, Telugu etc." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-202", "text": "Since we find that the inclusion of attention is highly beneficial in improving the performance of the system, we intend to experiment with recent advances in the encoder-decoder architectures, such as Vaswani et al. (2017) and Gehring et al. (2017) , where different novel approaches in using attention are experimented with." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-203", "text": "Our experiments in line with the measures reported in Krishna et al. (2016) show that our system performs robustly across strings of varying word size." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-204", "text": "----------------------------------" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-205", "text": "**CODE AND DATASET**" }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-206", "text": "All our working code can be downloaded at https: //github.com/cvikasreddy/skt." }, { "sent_id": "e9e733d38affa8a39a633ffb4d9d71-C001-207", "text": "The dataset for training can be downloaded at https://zenodo.org/ record/803508#.WTuKbSa9UUs" } ], "y": { "@BACK@": { "gold_contexts": [ [ "e9e733d38affa8a39a633ffb4d9d71-C001-63" ], [ "e9e733d38affa8a39a633ffb4d9d71-C001-156" ], [ "e9e733d38affa8a39a633ffb4d9d71-C001-181" ], [ "e9e733d38affa8a39a633ffb4d9d71-C001-188" ] ], "cite_sentences": [ "e9e733d38affa8a39a633ffb4d9d71-C001-63", "e9e733d38affa8a39a633ffb4d9d71-C001-156", "e9e733d38affa8a39a633ffb4d9d71-C001-181", "e9e733d38affa8a39a633ffb4d9d71-C001-188" ] }, "@MOT@": { "gold_contexts": [ [ "e9e733d38affa8a39a633ffb4d9d71-C001-69", "e9e733d38affa8a39a633ffb4d9d71-C001-70" ], [ "e9e733d38affa8a39a633ffb4d9d71-C001-169", "e9e733d38affa8a39a633ffb4d9d71-C001-173", "e9e733d38affa8a39a633ffb4d9d71-C001-174", "e9e733d38affa8a39a633ffb4d9d71-C001-177", "e9e733d38affa8a39a633ffb4d9d71-C001-178", "e9e733d38affa8a39a633ffb4d9d71-C001-179", "e9e733d38affa8a39a633ffb4d9d71-C001-180", "e9e733d38affa8a39a633ffb4d9d71-C001-181", "e9e733d38affa8a39a633ffb4d9d71-C001-182" ], [ "e9e733d38affa8a39a633ffb4d9d71-C001-188" ] ], "cite_sentences": [ "e9e733d38affa8a39a633ffb4d9d71-C001-70", "e9e733d38affa8a39a633ffb4d9d71-C001-178", "e9e733d38affa8a39a633ffb4d9d71-C001-179", "e9e733d38affa8a39a633ffb4d9d71-C001-180", "e9e733d38affa8a39a633ffb4d9d71-C001-181", "e9e733d38affa8a39a633ffb4d9d71-C001-188" ] }, "@DIF@": { "gold_contexts": [ [ "e9e733d38affa8a39a633ffb4d9d71-C001-188", "e9e733d38affa8a39a633ffb4d9d71-C001-189", "e9e733d38affa8a39a633ffb4d9d71-C001-190" ], [ "e9e733d38affa8a39a633ffb4d9d71-C001-200" ] ], "cite_sentences": [ "e9e733d38affa8a39a633ffb4d9d71-C001-188", "e9e733d38affa8a39a633ffb4d9d71-C001-200" ] }, "@USE@": { "gold_contexts": [ [ "e9e733d38affa8a39a633ffb4d9d71-C001-203" ] ], "cite_sentences": [ "e9e733d38affa8a39a633ffb4d9d71-C001-203" ] } } }, "ABC_fdfb8fbdb8544dca17b1aeba768124_8": { "x": [ { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-2", "text": "An empirical comparison of CFG filtering techniques for LTAG and HPSG is presented." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-3", "text": "We demonstrate that an approximation of HPSG produces a more effective CFG filter than that of LTAG." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-4", "text": "We also investigate the reason for that difference." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-5", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-7", "text": "Various parsing techniques have been developed for lexicalized grammars such as Lexicalized Tree Adjoining Grammar (LTAG) (Schabes et al., 1988) , and Head-Driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994) ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-8", "text": "Along with the independent development of parsing techniques for individual grammar formalisms, some of them have been adapted to other formalisms (Schabes et al., 1988; van Noord, 1994; Yoshida et al., 1999; Torisawa et al., 2000) ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-9", "text": "However, these realizations sometimes exhibit quite different performance in each grammar formalism (Yoshida et al., 1999; )." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-10", "text": "If we could identify an algorithmic difference that causes performance difference, it would reveal advantages and disadvantages of the different realizations." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-11", "text": "This should also allow us to integrate the advantages of the realizations into one generic parsing technique, which yields the further advancement of the whole parsing community." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-12", "text": "In this paper, we compare CFG filtering techniques for LTAG (Harbusch, 1990; Poller and Becker, 1998) and HPSG (Torisawa et al., 2000; Kiefer and Krieger, 2000) , following an approach to parsing comparison among different grammar formalisms )." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-13", "text": "The key idea of the approach is to use strongly equivalent grammars, which generate equivalent parse results for the same input, obtained by a grammar conversion as demonstrated by ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-14", "text": "The parsers with CFG filtering predict possible parse trees by a CFG approximated from a given grammar." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-15", "text": "Comparison of those parsers are interesting because effective CFG filters allow us to bring the empirical time complexity of the parsers close to that of CFG parsing." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-16", "text": "Investigating the difference between the ways of context-free (CF) approximation of LTAG and HPSG will thereby enlighten a way of further optimization for both techniques." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-17", "text": "We performed a comparison between the existing CFG filtering techniques for LTAG (Poller and Becker, 1998) and HPSG (Torisawa et al., 2000) , using strongly equivalent grammars obtained by converting LTAGs extracted from the Penn Treebank (Marcus et al., 1993) into HPSG-style." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-18", "text": "We compared the parsers with respect to the size of the approximated CFG and its effectiveness as a filter." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-19", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-20", "text": "**BACKGROUND**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-21", "text": "In this section, we introduce a grammar conversion ) and CFG filtering (Harbusch, 1990; Poller and Becker, 1998; Torisawa et al., 2000; Kiefer and Krieger, 2000) ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-22", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-23", "text": "**GRAMMAR CONVERSION**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-24", "text": "The grammar conversion consists of a conversion of LTAG elementary trees to HPSG lexical entries and an emulation of substitution and adjunction by We can perform a comparison between LTAG and HPSG parsers using strongly equivalent grammars obtained by the above conversion." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-25", "text": "This is because strongly equivalent grammars can be a substitute for the same grammar in different grammar formalisms." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-26", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-27", "text": "**CFG FILTERING TECHNIQUES**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-28", "text": "An initial offline step of CFG filtering is performed to approximate a given grammar with a CFG." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-29", "text": "The obtained CFG is used as an efficient device to compute the necessary conditions for parse trees." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-30", "text": "The CFG filtering generally consists of two steps." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-31", "text": "In phase 1, the parser first predicts possible parse trees using the approximated CFG, and then filters out irrelevant edges by a top-down traversal starting from roots of successful context-free derivations." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-32", "text": "In phase 2, it then eliminates invalid parse trees by using constraints in the given grammar." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-33", "text": "We call the remaining edges that are used for the phase 2 parsing essential edges." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-34", "text": "The parsers with CFG filtering used in our experiments follow the above parsing strategy, but are different in the way the CF approximation and the elimination of impossible parse trees in phase 2 are performed." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-35", "text": "In the following sections, we briefly describe the CF approximation and the elimination of impossible parse trees in each realization." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-36", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-37", "text": "**CF APPROXIMATION OF LTAG**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-38", "text": "In CFG filtering techniques for LTAG (Harbusch, 1990; Poller and Becker, 1998) , every branching of elementary trees in a given grammar is extracted as a CFG rule as shown in Figure 1 ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-39", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-40", "text": "**CFG RULES**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-41", "text": "Figure 2: Extraction of CFG from HPSG Because the obtained CFG can reflect only local constraints given in each local structure of the elementary trees, it generates invalid parse trees that connect local trees in different elementary trees." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-42", "text": "In order to eliminate such parse trees, a link between branchings is preserved as a node number which records a unique node address (a subscript attached to each node in Figure 1 )." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-43", "text": "We can eliminate these parse trees by traversing essential edges in a bottomup manner and recursively propagating ok-flag from a node number x to a node number y when a connection between x and y is allowed in the LTAG grammar." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-44", "text": "We call this propagation ok-prop." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-45", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-46", "text": "**CF APPROXIMATION OF HPSG**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-47", "text": "In CFG filtering techniques for HPSG (Torisawa et al., 2000; Kiefer and Krieger, 2000) , the extraction process of a CFG from a given HPSG grammar starts by recursively instantiating daughters of a grammar rule with lexical entries and generated feature structures until new feature structures are not generated as shown in Figure 2 ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-48", "text": "We must impose restrictions on values of some features (i.e., ignoring them) and/or the number of rule applications in order to guarantee the termination of the rule application." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-49", "text": "A CFG is obtained by regarding each initial and generated feature structures as nonterminals and transition relations between them as CFG rules." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-50", "text": "Although the obtained CFG can reflect local and global constraints given in the whole structure of lexical entries, it generates invalid parse trees because they do not reflect upon constraints given by the values of features that are ignored in phase 1." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-51", "text": "These parse trees are eliminated in phase 2 by applying a grammar rule that corresponds to the applied CFG rule." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-52", "text": "We call this rule application rule-app." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-53", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-54", "text": "**COMPARISON WITH CFG FILTERING**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-74", "text": "A generated feature structure in the approximation process thus corresponds to the whole unprocessed portion of a canonical elementary tree." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-55", "text": "In this section, we compare a pair of CFG filtering techniques for LTAG (Poller and Becker, 1998) and HPSG (Torisawa et al., 2000) described in Section 2.2.1 and 2.2.2." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-56", "text": "We hereafter refer to PB and TNT for the C++ implementations of the former and a valiant 1 of the latter, respectively." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-57", "text": "2 We first acquired LTAGs by a method proposed in Miyao et al. (2003) from Sections 2-21 of the Wall Street Journal (WSJ) in the Penn Treebank (Marcus et al., 1993) and its subsets." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-58", "text": "3 We then converted them into strongly equivalent HPSG-style grammars using the grammar conversion described in Section 2.1." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-59", "text": "Table 1 shows the size of CFG approximated from the strongly equivalent grammars." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-60", "text": "G x , CFG PB , and CFG TNT henceforth refer to the LTAG extracted from Section x of WSJ and CFGs approximated from G x by PB and TNT, respectively." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-61", "text": "The size of CFG TNT is much larger than that of CFG PB ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-62", "text": "By investigating parsing performance using these CFGs, we show that the larger size of CFG TNT resulted in better parsing performance." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-63", "text": "Table 2 shows the parse time with 254 sentences of length n (\u226410) from Section 2 of WSJ (the average length is 6.72 words)." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-64", "text": "4 This result shows not only that TNT achieved a drastic speed-up against PB, but also that performance difference between them increases with the larger size of the grammars." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-65", "text": "In order to estimate the degree of CF approximation, we measured the number of essential (inactive) edges of phase 1." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-66", "text": "Table 3 shows the number of the essential edges." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-67", "text": "The number of essential edges produced by PB is much larger than that produced by TNT." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-68", "text": "We then investigated the effect on phase 2 as caused by the different number of the essential edges." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-69", "text": "Table 4 shows the success rate of ok-prop and rule-app." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-70", "text": "The success rate of rule-app is 100%, 5 whereas that of ok-prop is quite low." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-71", "text": "6 These results indicate that CFG TNT is superior to CFG PB with respect to the degree of the CF approximation." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-72", "text": "We can explain the reason for this difference by investigating how TNT approximates HPSG-style grammars converted from LTAGs." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-73", "text": "As described in Section 2.1, the grammar conversion preserves the whole structure of each elementary tree (precisely, a canonical elementary tree) in a stack, and grammar rules manipulate a head element of the stack." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-75", "text": "This implies that successful context-free derivations obtained by CFG TNT basically involve elementary trees in which all substitution and adjunction have succeeded." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-76", "text": "However, CFG PB (also a CFG produced by the other work (Harbusch, 1990) ) cannot avoid generating invalid parse trees that connect two lo-cal structures where adjunction takes place between them." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-77", "text": "We measured with G 2-21 the proportion of the number of ok-prop between two node numbers of nodes that take adjunction and its success rate." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-78", "text": "It occupied 87% of the total number of ok-prop and its success rate was only 22%." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-79", "text": "These results suggest that the global contexts in a given grammar is essential to obtain an effective CFG filter." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-80", "text": "It should be noted that the above investigation also tells us another way of CF approximation of LTAG." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-81", "text": "We first define a unique way of tree traversal such as head-corner traversal (van Noord, 1994) on which we can perform a sequential application of substitution and adjunction." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-82", "text": "We then recursively apply substitution and adjunction on that traversal to an elementary tree and a generated tree structure." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-83", "text": "Because the processed portions of generated tree structures are no longer used later, we regard the unprocessed portions of the tree structures as nonterminals of CFG." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-84", "text": "We can thereby construct another CFG filtering for LTAG by combining this CFG filter with an existing LTAG parsing algorithm (van Noord, 1994) ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-85", "text": "----------------------------------" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-86", "text": "**CONCLUSION AND FUTURE DIRECTION**" }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-87", "text": "We presented an empirical comparison of LTAG and HPSG parsers with CFG filtering." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-88", "text": "We compared the parsers with strongly equivalent grammars obtained by converting LTAGs extracted from the Penn Treebank into HPSG-style." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-89", "text": "Experimental results showed that the existing CF approximation of HPSG (Torisawa et al., 2000) produced a more effective filter than that of LTAG (Poller and Becker, 1998) ." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-90", "text": "By investigating the different ways of CF approximation, we concluded that the global constraints in a given grammar is essential to obtain an effective filter." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-91", "text": "We are going to integrate the advantage of the CF approximation of HPSG into that of LTAG in order to establish another CFG filtering for LTAG." }, { "sent_id": "fdfb8fbdb8544dca17b1aeba768124-C001-92", "text": "We will also conduct experiments on trade-offs between the degree of CF approximation and the size of approximated CFGs as in Maxwell III and Kaplan (1993) ." } ], "y": { "@USE@": { "gold_contexts": [ [ "fdfb8fbdb8544dca17b1aeba768124-C001-12" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-2" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-17" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-21" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-27", "fdfb8fbdb8544dca17b1aeba768124-C001-28" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-34" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-38" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-55" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-84" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-91" ] ], "cite_sentences": [ "fdfb8fbdb8544dca17b1aeba768124-C001-12", "fdfb8fbdb8544dca17b1aeba768124-C001-2", "fdfb8fbdb8544dca17b1aeba768124-C001-17", "fdfb8fbdb8544dca17b1aeba768124-C001-21", "fdfb8fbdb8544dca17b1aeba768124-C001-27", "fdfb8fbdb8544dca17b1aeba768124-C001-28", "fdfb8fbdb8544dca17b1aeba768124-C001-34", "fdfb8fbdb8544dca17b1aeba768124-C001-38", "fdfb8fbdb8544dca17b1aeba768124-C001-55", "fdfb8fbdb8544dca17b1aeba768124-C001-84", "fdfb8fbdb8544dca17b1aeba768124-C001-91" ] }, "@MOT@": { "gold_contexts": [ [ "fdfb8fbdb8544dca17b1aeba768124-C001-16", "fdfb8fbdb8544dca17b1aeba768124-C001-17" ] ], "cite_sentences": [ "fdfb8fbdb8544dca17b1aeba768124-C001-17" ] }, "@BACK@": { "gold_contexts": [ [ "fdfb8fbdb8544dca17b1aeba768124-C001-21" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-28" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-30" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-38" ] ], "cite_sentences": [ "fdfb8fbdb8544dca17b1aeba768124-C001-21", "fdfb8fbdb8544dca17b1aeba768124-C001-28", "fdfb8fbdb8544dca17b1aeba768124-C001-30", "fdfb8fbdb8544dca17b1aeba768124-C001-38" ] }, "@EXT@": { "gold_contexts": [ [ "fdfb8fbdb8544dca17b1aeba768124-C001-34" ] ], "cite_sentences": [ "fdfb8fbdb8544dca17b1aeba768124-C001-34" ] }, "@DIF@": { "gold_contexts": [ [ "fdfb8fbdb8544dca17b1aeba768124-C001-34" ], [ "fdfb8fbdb8544dca17b1aeba768124-C001-89" ] ], "cite_sentences": [ "fdfb8fbdb8544dca17b1aeba768124-C001-34", "fdfb8fbdb8544dca17b1aeba768124-C001-89" ] }, "@FUT@": { "gold_contexts": [ [ "fdfb8fbdb8544dca17b1aeba768124-C001-91" ] ], "cite_sentences": [ "fdfb8fbdb8544dca17b1aeba768124-C001-91" ] } } }, "ABC_3e2fb3d4c1e224c084117c22a5db78_8": { "x": [ { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-2", "text": "As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-3", "text": "With a few notable exceptions, most research on this topic so far has dealt with English." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-4", "text": "This is mostly due to the availability of language resources for English." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-5", "text": "To address this shortcoming, this paper presents the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD)." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-6", "text": "OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-7", "text": "Along with a detailed description of the dataset, we evaluate several computational models trained and tested on this data." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-8", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-10", "text": "In the age of social media, offensive content online has become prevalent in recent years." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-11", "text": "There are many types of offensive content online such as racist and sexist posts and insults and threats targeted at individuals or groups." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-12", "text": "As such content increasingly occurs online, it has become a growing issue for online communities." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-13", "text": "This has come to the attention of social media platforms and authorities underlining the urgency to moderate and deal with such content." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-14", "text": "Several studies in NLP have approached offensive language identification applying machine learning and deep learning systems on annotated data to identify such content." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-15", "text": "Researchers in the field have worked with different definitions of offensive language with hate speech being the most studied among these types ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-106", "text": "Overall, it is clear that deep learning models with word embedding feature provide better results than the classical machine learning models." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-16", "text": "(Waseem et al., 2017) investigate the similarity between these subtasks." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-17", "text": "With a few noteworthy exceptions, most research so far has dealt with English, due to the availability of language resources." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-18", "text": "This gap in the literature recently started to be addressed with studies on Spanish (Arag\u00f3n et al., 2018) , Hindi (Kumar et al., 2018) , and German (Wiegand et al., 2018) , to name a few." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-19", "text": "In this paper we contribute in this direction presenting the first Greek annotated dataset for offensive language identification: the Offensive Greek Tweet Dataset (OGTD)." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-20", "text": "OGTD uses a working definition of offensive language inspired by the OLID dataset for English (Zampieri et al., 2019a) used in the recent OffensEval (SemEval-2019 Task 6) (Zampieri et al., 2019b) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-21", "text": "In its version, 1.0 OGTD contains nearly 4,800 posts collected from Twitter and manually annotated by a team of volunteers, resulting in a highquality annotated dataset." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-22", "text": "We trained a number of systems on this dataset and our best results have been obtained from a system using LSTMs and GRU with attention which achieved 0.89 F1 score." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-23", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-25", "text": "The bulk of work on detecting abusive posts online addressed particular types of such language like textual attacks and hate speech (Malmasi and Zampieri, 2017) , ag-gression (Kumar et al., 2018) , and others." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-26", "text": "OGTD considers a more general definition of offensiveness inspired by the first layer of the hierarchical annotation model described in (Zampieri et al., 2019a) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-27", "text": "(Zampieri et al., 2019a) model distinguishes targeted from general profanity, and considers the target of offensive posts as indicators of potential hate speech posts (insults targeted at groups) and cyberbulling posts (insults targeted at individuals)." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-28", "text": "Offensive Language: Previous work presented a dataset with sentences labelled as flame (i.e. attacking or containing abusive words) or okay (Razavi et al., 2010) with a Na\u00efve Bayes hybrid classifier and a user offensiveness estimation using an offensive lexicon and sentence syntactic structures (Chen et al., 2012) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-29", "text": "A dataset of 3.3M comments from the Yahoo Finance and News website, labelled as abusive or clean, was utilized in several experiments using ngrams, linguistic and syntactic features, combined with different types of word and comment embeddings as distributional semantics features (Nobata et al., 2016) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-30", "text": "The usefulness of character n-grams for abusive language detection was explored on the same dataset with three different methods ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-31", "text": "The most recent project expanded on existing ideas for defining offensive language and presented the OLID (Offensive Language Identification Dataset), a corpus of Twitter posts hierarchically annotated on three levels, whether they contain offensive language or not, whether the offense is targeted and finally, the target of the offense (Zampieri et al., 2019a) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-32", "text": "A CNN (Convolutional neural network) deep learning approach outperformed every model trained, with pre-trained FastText embeddings and updateable embeddings learned by the model as features." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-33", "text": "In OffensEval (SemEval-2019 Task 6), participants had the opportunity to use the OLID to train their own systems, with the top teams outperforming the original models trained on the dataset." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-34", "text": "Hate Speech: A study dataset of tweets posted after the murder of Drummer Lee Rigby in the UK, manually annotated as offensive or antagonistic in terms of race ethnicity or religion for hate speech identification with multiple clas-sifiers (Burnap and Williams, 2015) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-35", "text": "A logistic regression classifier trained with paragraph2vec 1 word representations of comments from Yahoo Finance (Djuric et al., 2015) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-36", "text": "The latest approaches in detecting hate speech include a dataset of Twitter posts, labelled as hateful, offensive or clean, used to train a logistic regression classifier with part-of-speech and word n-grams and a sentiment lexicon and a linear SVM trained on character 4-grams, with an extra RBF SVM meta-classifier that boosts accuracy in hateful language detection ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-37", "text": "Both attempts tried to distinguish offensive language and hate speech, with the hate class being the hardest to classify." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-38", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-39", "text": "**NON-ENGLISH DATASETS**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-40", "text": "Research on other languages includes datasets such as: A Dutch corpus of posts from the social networking site Ask.fm for the detection of cyberbullying (Van Hee et al., 2015) , a German Twitter corpus exploring the issue of hate speech targeted to refugees (Ross et al., 2016) , another Dutch corpus using data from two anti-Islamic groups in Facebook (Tulkens et al., 2016) , a hate speech corpus in Italian (Pelosi et al., 2017) , an abusive language corpus in Arabic (Mubarak et al., 2017) , a corpus of offensive comments from Facebook and Reddit in Danish (Sigurbergsson and Derczynski, 2020) , another Twitter corpus in German (Wiegand et al., 2018) for GermEval2018, a second Italian corpus from Facebook and Twitter (Bosco et al., 2018) , an aggressive post corpus from Mexican Twitter in Spanish (Arag\u00f3n et al., 2018) and finally an aggressive comments corpus from Facebook in Hindi (Kumar et al., 2018) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-41", "text": "Se-mEval 2019 presented a novel task: Multilingual detection of hate speech specifically against immigrants and women with a dataset from Twitter, in English and Spanish (Basile et al., 2019) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-42", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-43", "text": "**THE OGTD DATASET**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-44", "text": "The posts in OGTD v1.0 were collected between May and June, 2019." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-45", "text": "We used the Twitter API initially collecting tweets from popular and trending hashtags in Greece, including television programs such as series, reality and entertainment shows." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-46", "text": "Due to the municipal, regional as well as the European Parliament election taking place at the time, many hashtags included tweets discussing the elections." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-47", "text": "The intuition behind this approach is that Twitter as a microblogging service often gathers complaints and profane comments on widely viewed television and politics, and as such, this period was a good opportunity for data collection." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-48", "text": "Following the methodology described in (Zampieri et al., 2019a) and others, including a recent comparable Danish dataset (Sigurbergsson and Derczynski, 2020) , we collected tweets using keywords such as sensitive or obscene language." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-49", "text": "Queries for tweets containing common curse words and expressions usually found in offensive messages in Greek as keywords (such as the well-known word for \"asshole\", \"\u03bc\u03b1\u03bb\u03ac\u03ba\u03b1\u03c2\" (malakas) or \"go to hell\", \"\u03c3\u03c4\u03bf \u03b4\u03b9\u03ac\u03bf\u03bb\u03bf\" (sto diaolo), etc.) returned a large number of tweets." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-50", "text": "Aiming to compile a dataset including offensive tweets of diverse types (sexist, racist, etc.) targeted at various social groups, the Twitter API was queried with expletives such as \"\u03c0\u03bf\u03c5\u03c4\u03ac\u03bd\u03b1\" (poutana, \"whore\"), \"\u03ba\u03b1\u03c1\u03b9\u03cc\u03bb\u03b1\" (kariola, \"bitch\"), \"\u03c0\u03bf\u03cd\u03c3\u03c4\u03b7\u03c2\" (poustis, \"faggot\"), etc. and their plural forms, to explore the semantic and pragmatic differences of the expletives mentioned above in their different contextual environments." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-51", "text": "The challenge is to recognize between ironic and insulting uses of these swear words, a common phenomenon in Greek." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-52", "text": "The final query for data collection was for tweets containing \"\u03b5\u03af\u03c3\u03b1\u03b9\" (eisai, \"you are\") as a keyword, inspired by (Zampieri et al., 2019a) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-53", "text": "This particular keyword is considered a stop word as it is quite common and frequent in languages but was suspected to prove helpful for building the dataset for this particular project, as offensive language often follows the following structure: auxiliary verb (be) + noun/adjective." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-54", "text": "The immediacy of social media and specifically Twitter provides the opportunity for targeted insults to be investigated, following data mining of tweets including \"you are\" as a keyword." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-55", "text": "In fact, many tweets present in the dataset showed users verbally insulting other users or famous people and TV personas, confirming that \"\u03b5\u03af\u03c3\u03b1\u03b9\" was a facilitating keyword for the task in question." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-56", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-57", "text": "**PRE-PROCESSING AND ANNOTATION**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-58", "text": "We collected a set of 49,154 tweets." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-59", "text": "URLs, Emojis and Emoticons were removed, while usernames and user mentions were filtered as @USER following the same methodology described in OLID (Zampieri et al., 2019a) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-60", "text": "Duplicate punctuation such as question and exclamation marks was normalized." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-61", "text": "After removing duplicate tweets, the dataset was comprised of 46,218 tweets of which 5,000 were randomly sampled for annotation." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-62", "text": "We used Light-Tag 2 to annotate the dataset due to its simple and straightforward user interface and limitless annotations, provided by the software creators." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-63", "text": "Based on explicit annotation guidelines written in Greek and our proposal of the definition of offensive language, a team of three volunteers were asked to classify each tweet found in the dataset with one of the following tags: Offensive, Not Offensive and Spam, which was introduced to filter out spam from the dataset." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-64", "text": "Inter-annotator agreement was subsequently calculated and labels with 100% agreement were deemed acceptable annotations." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-65", "text": "In cases of disagreement, labels with majority agreement above 66% were selected as the actual annotations of the tweets in question." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-66", "text": "For labels with complete disagreement between annotators, one of the authors of this paper reviewed the tweets with two extra human judges, to get the desired majority agreement above 66%." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-67", "text": "Figure 1 is a confusion matrix that shows the inter-annotator agreement or reliability, statistically measured by Cohen's kappa coefficient." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-68", "text": "The benchmark annotated dataset produced contained 4,779 tweets, containing over 29% offensive content." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-69", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-70", "text": "**METHODS**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-71", "text": "Before experimenting with OGTD, an unique aspect of Greek which is the accentuation of characters for correct pronunciation needed to be normalized." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-72", "text": "When posting a tweet, many users omit accents due to their haste, resulting in a mixed dataset containing fully accented tweets, partially-accented tweets, and non-accented tweets." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-73", "text": "To achieve data uniformity and to avoid ambiguity, every word is lower-cased and then normalized to its non-accented equivalent." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-74", "text": "Several experiments were conducted with the OGTD, each one utilizing a different combination from a pool of features (e.g. TF/IDF unigrams, bigrams, POS and dependency relation tags) to train machine learning models." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-75", "text": "These features were selected based on previous methodology used by researchers and taking the dataset size into consideration." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-76", "text": "The TF-IDF weighted features are often used for text classification and are useful for determining how important a word is to a post in a corpus." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-77", "text": "The threshold for corpus specific words was set to 80%, ignoring terms appearing in more than 80% of the documents while the minimum document frequency was set to 6, and both unigrams and bigrams were tested." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-78", "text": "Given the consistent use of linguistic features for training machine learning models and results from previous work for offensive language detection, partof-speech (POS) and dependency relation tags were considered as additional features." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-79", "text": "Using the spaCy 3 pipeline for Greek, POS-tags and dependency relations were extracted for every token in a tweet, which were then transformed to count matrices." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-80", "text": "A sentiment lexicon was considered, but one suitable for this project is as of yet unavailable for Greek." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-81", "text": "For the first six deep learning models we used Greek word embeddings trained on a large Greek web corpus (Outsios" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-82", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-83", "text": "**. CLASSICAL MACHINE LEARNING MODELS**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-84", "text": "Every classical model was considered on the condition it could take matrices as input for fitting and was trained with the default settings because of the size of the dataset." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-85", "text": "Five models were trained: Two SVMs, one with linear kernel and the other with a radial basis function kernel (RBF), both with a value of 1 in the penalty parameter C of the error term." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-86", "text": "The gamma value of the RBF SVM which indicates how much influence a single training example has, was set to 2." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-87", "text": "The third classifier trained was another linear classifier with Stochastic Gradient Descent (SGDC) learning." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-88", "text": "The gradient of the loss is estimated each sample at a time and the SGDC is updated along the way with a decreasing learning rate." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-89", "text": "The parameters for maximum epochs and the stopping criterion were defined using the default values in scikit-learn." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-90", "text": "The final classifier was two models based on the Bayes theorem: Multinomial Na\u00efve Bayes, which works with occurrence counts, and Bernoulli Na\u00efve Bayes, which is designed for binary features." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-91", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-92", "text": "**DEEP LEARNING MODELS**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-93", "text": "Six different deep learning models were considered." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-94", "text": "All of these models have been used in an aggression detection task." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-95", "text": "The models are Pooled GRU (" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-96", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-97", "text": "**RESULTS**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-98", "text": "The performance of individual classifiers for offensive language identification with TF/IDF unigram features is demonstrated in table 2 below." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-99", "text": "We can see that both linear classifiers (SVM and SGDC) outperform the other classifiers in terms of macro-F1, which does not take label imbalance into account." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-100", "text": "The Linear SVM and SGDC per-" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-101", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-102", "text": "**DISCUSSION**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-103", "text": "The data annotated in OGTD proved to be facilitating in offensive language detection with a significant success for Greek, taking into consideration its size and label distribution, with the best model (LSTM and GRU with Attention) achieving a F1-macro of 0.89." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-104", "text": "Among the classical machine learning approaches, the linear SVM model achieved the best results, 0.80, whereas the the Stochastic Gradient Descent (SGD) learning classifier yielded the best recall score for the Offensive class, at 0.61." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-105", "text": "In terms of features used, TF/IDF matrices of word unigrams proved to work work well with multiple classical ML classifiers." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-107", "text": "Of the linguistic features, POS tags improved the performance of the Linear SVM marginally in terms of recall for the Offensive class, other classifiers deteriorated in their performance." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-108", "text": "It is not yet clear whether this is due to the accuracy of the Greek model available for spaCy in producing such tags or the tags themselves as features and is a subject that can be explored with further improvements of spaCy or other NLP tools developed for Greek." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-109", "text": "The dataset itself contains many instances with neologisms, creative uses of language or and even rare slang words, therefore training the existing model with such instances could improve both spaCy's accuracy for POS and dependency relation tags and the Linear SVM's performance in text classification for Greek." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-110", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-111", "text": "**CONCLUSION**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-112", "text": "This paper presented the Offensive Greek Tweet Dataset (OGTD), a manually annotated dataset for offensive language identification and the first Greek dataset of its kind." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-113", "text": "The OGTD v1.0 contains a total of 4,779 tweets, encompassing posts related to an array of topics popular among Greek people (e.g. political elections, TV shows, etc.)." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-114", "text": "Tweets were manually annotated by a team volunteers through an annotation platform." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-115", "text": "We used the same guidelines used in the annotation of the English OLID dataset (Zampieri et al., 2019a) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-116", "text": "Finally, we run several machine learning and deep learning classifiers and the best results were achieved by a LSTM and GRU with Attention model." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-117", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-118", "text": "**ONGOING -OGTD V2.0 AND OFFENSEVAL 2020**" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-119", "text": "We have recently released OGTD v2.0 as training data for OffensEval 2020 (SemEval-2020 Task 12) (Zampieri et al., 2020) ." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-120", "text": "6 The reasoning behind the expansion of the dataset was to have a larger Greek dataset for the competition." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-121", "text": "New posts were collected in November 2019 following the same approach we used to compile v1.0 described in this paper." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-122", "text": "This second batch of tweets included tweets with trending hashtags, shows and topics from Greece at the time." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-123", "text": "Additionally, keywords that proved to retrieve interesting tweets in the first version were once again used in the search, along with new keywords like pejorative terms." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-124", "text": "When the collection was finished, 5,508 tweets were randomly sampled to be then annotated by a team of volunteers." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-125", "text": "The annotation guidelines were the same ones we used for v1.0." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-126", "text": "OGTD v2.0 combines the existing with the newly annotated tweets in a larger dataset of 10,287 instances." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-127", "text": "Finally, both OGTD v1.0 and v2.0 provide the opportunity for researchers to test cross-lingual learning methods as it can be used in conjunction with the English OLID and other datasets annotated using the same guidelines such as the one by Sigurbergsson and Derczynski (2020) for Danish and by \u00c7\u00f6ltekin (2020) for Turkish while simultaneously facilitating the development of language resources for NLP in Greek." }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-128", "text": "----------------------------------" }, { "sent_id": "3e2fb3d4c1e224c084117c22a5db78-C001-129", "text": "**LABELS**" } ], "y": { "@USE@": { "gold_contexts": [ [ "3e2fb3d4c1e224c084117c22a5db78-C001-19", "3e2fb3d4c1e224c084117c22a5db78-C001-20" ], [ "3e2fb3d4c1e224c084117c22a5db78-C001-26" ], [ "3e2fb3d4c1e224c084117c22a5db78-C001-48" ], [ "3e2fb3d4c1e224c084117c22a5db78-C001-52" ], [ "3e2fb3d4c1e224c084117c22a5db78-C001-59" ], [ "3e2fb3d4c1e224c084117c22a5db78-C001-115" ] ], "cite_sentences": [ "3e2fb3d4c1e224c084117c22a5db78-C001-20", "3e2fb3d4c1e224c084117c22a5db78-C001-26", "3e2fb3d4c1e224c084117c22a5db78-C001-48", "3e2fb3d4c1e224c084117c22a5db78-C001-52", "3e2fb3d4c1e224c084117c22a5db78-C001-59", "3e2fb3d4c1e224c084117c22a5db78-C001-115" ] }, "@BACK@": { "gold_contexts": [ [ "3e2fb3d4c1e224c084117c22a5db78-C001-27" ], [ "3e2fb3d4c1e224c084117c22a5db78-C001-31" ] ], "cite_sentences": [ "3e2fb3d4c1e224c084117c22a5db78-C001-27", "3e2fb3d4c1e224c084117c22a5db78-C001-31" ] }, "@SIM@": { "gold_contexts": [ [ "3e2fb3d4c1e224c084117c22a5db78-C001-25", "3e2fb3d4c1e224c084117c22a5db78-C001-26" ] ], "cite_sentences": [ "3e2fb3d4c1e224c084117c22a5db78-C001-26" ] } } }, "ABC_91c82c4a49815fb2de300d99312754_8": { "x": [ { "sent_id": "91c82c4a49815fb2de300d99312754-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-2", "text": "Recently, to incorporate external Knowledge Base (KB) information, one form of world knowledge, several end-to-end task-oriented dialog systems have been proposed." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-3", "text": "These models, however, tend to confound the dialog history with KB tuples and simply store them into one memory." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-4", "text": "Inspired by the psychological studies on working memory, we propose a working memory model (WMM2Seq) for dialog response generation." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-5", "text": "Our WMM2Seq adopts a working memory to interact with two separated long-term memories, which are the episodic memory for memorizing dialog history and the semantic memory for storing KB tuples." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-6", "text": "The working memory consists of a central executive to attend to the aforementioned memories, and a short-term storage system to store the \"activated\" contents from the longterm memories." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-7", "text": "Furthermore, we introduce a context-sensitive perceptual process for the token representations of the dialog history, and then feed them into the episodic memory." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-8", "text": "Extensive experiments on two task-oriented dialog datasets demonstrate that our WMM2Seq significantly outperforms the state-of-the-art results in several evaluation metrics." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-9", "text": "----------------------------------" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-11", "text": "Task-oriented dialog systems, such as hotel booking or technical support service, help users to achieve specific goals with natural language." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-12", "text": "Compared with traditional pipeline solutions (Williams and Young, 2007; Young et al., 2013; Wen et al., 2017) , end-to-end approaches recently gain much attention (Zhao et al., 2017; Eric and Manning, 2017a; Lei et al., 2018) , because they directly map dialog history to the output responses and consequently reduce human effort for modular designs and hand-crafted state labels." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-13", "text": "To effectively incorporate KB information and perform knowledge- * Corresponding Author based reasoning, memory augmented models have been proposed (Bordes et al., 2017; Seo et al., 2017; Eric and Manning, 2017b; Madotto et al., 2018; Raghu et al., 2018; Reddy et al., 2019; Wu et al., 2019) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-14", "text": "Bordes et al. (2017) and Seo et al. (2017) attended to retrieval models, lacking the ability of generation, while others incorporated the memory (i.e. end-to-end memory networks, abbreviated as MemNNs, Sukhbaatar et al. (2015) ) and copy mechanism (Gu et al., 2016 ) into a sequential generative architecture." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-15", "text": "However, most models tended to confound the dialog history with KB tuples and simply stored them into one memory." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-16", "text": "A shared memory forces the memory reader to reason over the two different types of data, which makes the task harder, especially when the memory is large." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-17", "text": "To explore this problem, Reddy et al. (2019) very recently proposed to separate memories for modeling dialog context and KB results." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-18", "text": "In this paper, we adopt working memory to interact with two longterm memories." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-19", "text": "Furthermore, compared to Reddy et al. (2019) , we leverage the reasoning ability of MemNNs to instantiate the external memories." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-20", "text": "Our intuition comes from two aspects." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-21", "text": "First, psychologists tend to break down the long-term memory 1 into episodic memory for events (e.g. visual and textual perceptual inputs) and semantic memory for facts (world knowledge, such as KB information) as not all memory of experiences is the same (Gazzaniga and Ivry, 2013) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-65", "text": "The two copy distributions are obtained by augmenting MemNNs with copy mechanism that is P E\u00b7ptr = p K E,t and P S\u00b7ptr = p K S,t ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-22", "text": "Second, a successful task-oriented dialog system needs more intelligence, and recent works suggest that a critical component of intelligence may be working memory (Sternberg and Sternberg, 2016) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-23", "text": "Hence, leveraging the knowledge from psychological studies (Baddeley and Hitch, 1974; Baddeley, 2000; Dosher, 2003) , we explore working memory for the dialog response generation." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-24", "text": "Our contributions are summarized as follows:" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-25", "text": "Firstly, inspired by the psychological studies on working memory, we propose the WMM2Seq for dialog generation which separates the storage of dialog history and KB information by using the episodic and semantic memories and then leverages the working memory to interact with them." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-26", "text": "Secondly, we leverage two kinds of transformations (CNN and biGRU) to incorporate the context information for better token representations." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-27", "text": "This procedure can be seen as a part of perceptual processes before the episodic memory storage, and can alleviate the Out-Of-Vocabulary (OOV) problem." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-28", "text": "Finally, our WMM2Seq outperforms the existing methods on several evaluation metrics in two task-oriented dialog datasets and shows a better reasoning ability in the OOV situation." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-29", "text": "Figure 1 illustrates the flow of our WMM2Seq for dialog response generation." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-30", "text": "WMM2Seq can be seen as an encoder-decoder model, where decoder is the Working Memory (WM) which could interact with two long-term memories (the episodic memory memorizing dialog history and semantic memory storing KB information)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-31", "text": "As MemNN is well-known for its multiple hop reasoning ability, we instantiate the encoder and the two memories with three different MemNNs (MemNN Encoder, E-MemNN and S-MemNN)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-32", "text": "Furthermore, we augment E-MemNN and S-MemNN with copy mechanism from where we need to copy tokens or entities." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-33", "text": "The encoder encodes the dialog history to obtain the high-level signal, a distributed intent vector." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-34", "text": "The WM consists of a Short-Term Storage system (STS) and a Central-EXE including an Attention Controller (Attn-Ctrl) and a rule-based word selection strategy." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-35", "text": "The Attn-Ctrl dynamically generates the attention control vector to query and reason over the two long memories and then stores three \"activated\" distributions into STS." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-36", "text": "Finally a generated token is selected from the STS under the word selection strategy at each decoder step." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-37", "text": "----------------------------------" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-38", "text": "**MODEL DESCRIPTION**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-39", "text": "The symbols are defined in Table 1 , and more details can be found in the supplementary material." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-40", "text": "We omit the subscript E or S 2 , following Madotto et al. (2018) to define each pointer index set:" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-41", "text": "Symbol Definition xi or yi a token in the dialog history or system response $ a special token used as a sentinel (Madotto et al., 2018 ) X X = {x1, . . . , xn, $}, the dialog history Y Y = {y1, \u00b7 \u00b7 \u00b7 , ym}, the expected response bi one KB tuple, actually the corresponding entity B B = {b1, \u00b7 \u00b7 \u00b7 , b l , $}, the KB tuples P T RE = {ptrE,1, \u00b7 \u00b7 \u00b7 , ptrE,m}, dialog pointer index set." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-42", "text": "P T RE supervised information for copying words in dialog history P T RS = {ptrS,1, \u00b7 \u00b7 \u00b7 , ptrS,m}, KB pointer index set." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-43", "text": "P T RS supervised information for copying entities in KB tuples Table 1 : Notation Table. where xb z \u2208 X or B is the dialog history or KB tuples according to the subscript (E or S) and n xb + 1 is the sentinel position index as n xb is equal to the dialog history length n or the number of KB triples l. The idea behind Eq. 1 is that we can obtain the positions of where to copy by matching the target text with the dialog history or KB information." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-44", "text": "Furthermore, we hope this provides the model with an accurate guidance of how to activate the two long-term memories." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-45", "text": "----------------------------------" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-46", "text": "**MEMNN ENCODER**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-47", "text": "Here, on the context of our task, we give a brief description of K-hop MemNN with adjacent weight tying and more details can be found in (Sukhbaatar et al., 2015) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-48", "text": "The memory of MemNN is represented by a set of trainable embedding matrices C = {C 1 , . . . , C K+1 }." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-49", "text": "Given input tokens in the dialog history X, MemNN first writes them into memories by Eq. 2 and then uses a query to iteratively read from them with multi hops to reason about the required response by Eq. 3 and Eq. 4." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-50", "text": "For each hop k, we update the query by Eq. 5 and the initial query is a learnable vector as like Yang et al. (2016) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-51", "text": "The MemNN encoder finally outputs a user intent vector o K ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-52", "text": "To incorporate the context information, we explore two context-aware transformation TRANS(\u00b7) by replacing Eq. 2 with A k i = TRANS(C k (x i )), which is defined as follows:" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-53", "text": "where h i is the context-aware representation, and \u03c6 e is a trainable embedding function." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-54", "text": "We combine MemNNs with TRANS(\u00b7) to alleviate the OOV problem when reasoning about memory contents." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-55", "text": "----------------------------------" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-56", "text": "**WORKING MEMORY DECODER**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-57", "text": "Inspired by the studies on the working memory, we design our decoder as an attentional control system for dialog generation which consists of the working memory and two long-term memories." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-58", "text": "As shown in Figure 1 , we adopt the E-MemNN to memorize the dialog history X as described in Section 2.1, and then store KB tuples into the S-MemNN without TRANS(\u00b7)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-59", "text": "We also incorporate additional temporal information and speaker information into dialog utterances as (Madotto et al., 2018) and adopt a (subject, relation, object) representation of KB information as (Eric and Manning, 2017b) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-60", "text": "More details can be found in the supplementary material." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-61", "text": "Having written dialog history and KB tuples into E-MemNN and S-MemNN, we then use the WM to interact with them (to query and reason over them) to generate the response." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-62", "text": "At each decoder step, the Attn-Ctrl, instantiated as a GRU, dynamically generates the query vector q t as follows:" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-63", "text": "Here, query q t is used to access E-MemNN activating the final query q E = o K E , vocabulary distribution P vocab by Eq. 9 and copy distribution for dialog history P E\u00b7ptr ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-64", "text": "When querying S-MemNN, we consider the dialog history by using query q t = q E + q t and then obtain the copy distribution for KB entities P S\u00b7ptr ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-66", "text": "Now, three distributions, P vocab , P E\u00b7ptr and P S\u00b7ptr , are activated and moved into the STS, and then a proper word is generated from the activated distributions." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-67", "text": "We here use a rule-based word selection strategy by extending the sentinel idea in (Madotto et al., 2018) , which is shown in Figure 1 ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-68", "text": "If the expected word is not appearing either in the episodic memory or the semantic memory, the two copy pointers are trained to produce the sentinel token and our WMM2Seq generates the token from P vocab ; otherwise, the token is generated by copying from either the dialog history or KB tuples and this is done by comparing the two copy distributions." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-69", "text": "We always select the other distribution if one of the two distributions points to the sentinel or select to copy the token corresponding to the biggest probability of the two distributions." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-70", "text": "Hence, during the training stage, all the parameters are jointly learned by minimizing the sum of three standard cross-entropy losses with the corresponding targets (Y , P T R E and P T R S )." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-71", "text": "----------------------------------" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-72", "text": "**EXPERIMENTS**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-73", "text": "We conduct experiments on the simulated bAbI Dialogue dataset (Bordes et al., 2017) and the Dialog State Tracking Challenge 2 (DSTC2) (Henderson et al., 2014) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-74", "text": "We actually adopt the refined version of DSTC2 from Bordes et al. (2017) and their statistics are given in the supplementary material." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-75", "text": "Our model is trained end-to-end using Adam optimizer (Kingma and Ba, 2014) , and the responses are generated using greedy search without any rescoring techniques." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-76", "text": "The shared size of embedding and hidden units is selected from [64, 512] and the default hop K = 3 is used for all MemNNs." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-77", "text": "The learning rate is simply fixed to 0.001 and the dropout ratio is sampled from [0.1, 0.4]." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-78", "text": "Furthermore, we randomly mask some memory cells with the same dropout ratio to simulate the OOV situation for both episodic and semantic memories." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-79", "text": "The hyper-parameters for best models are given in the supplementary material." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-80", "text": "----------------------------------" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-81", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-82", "text": "We use Per-response/dialog Accuracy (Bordes et al., 2017) , BLEU (Papineni et al., 2002) and Entity F1 (Madotto et al., 2018) to compare the performance of different models." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-83", "text": "And the baseline models are Seq2Seq+Attn (Luong et al., 2015) , Pointer to Unknown (Ptr-Unk, Gulcehre et al. (2016) ), Mem2Seq (Madotto et al., 2018) , Hierarchical Pointer Generator Memory Network (HyP-MN, Raghu et al. (2018) ) and Global-to-Local Memory Pointer (GLMP, Wu et al. (2019) )." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-84", "text": "Automatic Evaluation: The results on the bAbI dialog dataset are given in Table 2 ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-85", "text": "We can see that our model does much better on the OOV situation and is on par with the best results on T5." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-86", "text": "Moreover, our model can perfectly issue API calls (task 1), update API calls (task 2) and provide extra information (task 4)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-87", "text": "As task 5 is a combination of tasks 1-4, our best performance on T5-OOV exhibits the powerful reasoning ability to the unseen dialog history and KB tuples." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-88", "text": "And this reasoning ability is also proved by the performance improvements on the DSTC2 dataset according to several metrics in Table 3 ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-89", "text": "Especially, a significant improvement on entity F1 scores indicates that our model can choose the right entities and incorporate them into responses more naturally (with highest BLEU scores)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-90", "text": "Furthermore, there is no significant difference between the two kinds of the transformation TRANS(\u00b7)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-91", "text": "Ablation Study: To better understand the components used in our model, we report our ablation studies from three aspects." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-92", "text": "First, we remove the context-sensitive transformation TRANS(\u00b7) and then find significant performance degradation." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-93", "text": "This suggests that perceptual processes are a necessary step before storing perceptual information (the dialog history) into the episodic memory and it is important for the performance of working memory." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-94", "text": "Second, we find that WMM2Seq outperforms Mem2Seq, which uses a unified memory to store dialog history and KB information." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-95", "text": "We can safely conclude that the separation of context memory and KB memory benefits the performance, as WMM2Seq performs well with less parameters than Mem2Seq on task 5." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-96", "text": "Finally, we additionally analysis how the multi-hop attention mechanism helps by showing the performance differences between the hop K = 1 and the default hop K = 3." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-97", "text": "Though multi-hop attention strengthens the reasoning ability and improves the results, we find that the performance difference between the hops K = 1 and K = 3 is not so obvious as shown in (Madotto et al., 2018; Wu et al., 2019) ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-98", "text": "Furthermore, our model performs well even with one hop, which we mainly attribute to the reasoning ability of working memory." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-99", "text": "The separation of memories and stacking S-MemNN on E-MemNN also help a lot, because the whole external memory, consisting of the episodic and semantic memories, can be seen as a multi-hop (two-level) structure (the first level is the episode memory and the second level is the semantic memory)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-100", "text": "Attention Visualization: As an intuitive way to show the model's dynamics, attention weight visualization is also used to understand how the Central-EXE controls the access to the two long-term memories (E-MemNN and S-MemNN)." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-101", "text": "Figure 2 shows the episodic and semantic memory attention vectors at the last hop for each generated token." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-102", "text": "Firstly, our model generates a different but still correct response as the customer wants a moderately priced restaurant in the west and does not care about the type of food." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-103", "text": "Secondly, the generated response has tokens from the vocabulary (e.g. \"is\" and \"a\"), dialog history (e.g. \"west\" and \"food\") and KB information (e.g. \"saint johns chop house\" and \"british\"), indicating that our model learns to interact well with the two long-term memories by two sentinels." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-104", "text": "Human Evaluation: Following the methods in (Eric and Manning, 2017b; Wu et al., 2019) , we report human evaluation of the generated responses in Table 4 ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-105", "text": "We adopt Mem2Seq as the baseline for human evaluation considering its good performance and code release 3 ." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-106", "text": "First we randomly select 100 samples from the DSTC2 test set, then generate the corresponding responses using WMM2Seq and Mem2Seq, and finally ask two human subjects to judge the quality of the generated responses according to the appropriateness and humanlikeness on a scale from 1 to 5." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-107", "text": "As shown in Table 4 , WMM2Seq outperforms Mem2Seq in both measures, which is coherent to the automatic evaluation." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-108", "text": "More details about human evaluation are reported in the supplementary material." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-109", "text": "----------------------------------" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-111", "text": "We leverage the knowledge from the psychological studies and propose our WMM2Seq for dialog response generation." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-112", "text": "First, the storage separation of the dialog history and KB information is very important and we explore two context-sensitive perceptual processes for the word-level representations of the dialog history." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-113", "text": "Second, working memory is adopted to interact with the long-term memories and then generate the responses." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-114", "text": "Finally, the improved performance on two task-oriented datasets demonstrates the contributions from the separated storage and the reasoning ability of working memory." }, { "sent_id": "91c82c4a49815fb2de300d99312754-C001-115", "text": "Our future work will focus on how to transfer the long-term memory across different tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "91c82c4a49815fb2de300d99312754-C001-13" ] ], "cite_sentences": [ "91c82c4a49815fb2de300d99312754-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "91c82c4a49815fb2de300d99312754-C001-40" ], [ "91c82c4a49815fb2de300d99312754-C001-41" ], [ "91c82c4a49815fb2de300d99312754-C001-59" ], [ "91c82c4a49815fb2de300d99312754-C001-82" ], [ "91c82c4a49815fb2de300d99312754-C001-83" ], [ "91c82c4a49815fb2de300d99312754-C001-105" ], [ "91c82c4a49815fb2de300d99312754-C001-106" ] ], "cite_sentences": [ "91c82c4a49815fb2de300d99312754-C001-40", "91c82c4a49815fb2de300d99312754-C001-41", "91c82c4a49815fb2de300d99312754-C001-59", "91c82c4a49815fb2de300d99312754-C001-82", "91c82c4a49815fb2de300d99312754-C001-83", "91c82c4a49815fb2de300d99312754-C001-105", "91c82c4a49815fb2de300d99312754-C001-106" ] }, "@EXT@": { "gold_contexts": [ [ "91c82c4a49815fb2de300d99312754-C001-67" ], [ "91c82c4a49815fb2de300d99312754-C001-95" ] ], "cite_sentences": [ "91c82c4a49815fb2de300d99312754-C001-67", "91c82c4a49815fb2de300d99312754-C001-95" ] }, "@DIF@": { "gold_contexts": [ [ "91c82c4a49815fb2de300d99312754-C001-94" ], [ "91c82c4a49815fb2de300d99312754-C001-95" ], [ "91c82c4a49815fb2de300d99312754-C001-97" ], [ "91c82c4a49815fb2de300d99312754-C001-107" ] ], "cite_sentences": [ "91c82c4a49815fb2de300d99312754-C001-94", "91c82c4a49815fb2de300d99312754-C001-95", "91c82c4a49815fb2de300d99312754-C001-97", "91c82c4a49815fb2de300d99312754-C001-107" ] } } }, "ABC_10f17930192132077f0d4526e7d755_8": { "x": [ { "sent_id": "10f17930192132077f0d4526e7d755-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-2", "text": "Mental health poses a significant challenge for an individual's well-being." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-3", "text": "Text analysis of rich resources, like social media, can contribute to deeper understanding of illnesses and provide means for their early detection." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-4", "text": "We tackle a challenge of detecting social media users' mental status through deep learningbased models, moving away from traditional approaches to the task." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-5", "text": "In a binary classification task on predicting if a user suffers from one of nine different disorders, a hierarchical attention network outperforms previously set benchmarks for four of the disorders." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-6", "text": "Furthermore, we explore the limitations of our model and analyze phrases relevant for classification by inspecting the model's word-level attention weights." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-7", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-9", "text": "Mental health is a serious issue of the modern-day world." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-10", "text": "According to the World Health Organization's 2017 report and Wykes et al. (2015) more than a quarter of Europe's adult population suffers from an episode of a mental disorder in their life." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-11", "text": "The problem grows bigger with the fact that as much as 35-50% of those affected go undiagnosed and receive no treatment for their illness." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-12", "text": "In line with WHO's Mental Health Action Plan (Saxena et al., 2013) , the natural language processing community helps the gathering of information and evidence on mental conditions, focusing on text analysis of authors affected by mental illnesses." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-13", "text": "Researchers can utilize large amounts of text on social media sites to get a deeper understanding of mental health and develop models for early detection of various mental disorders (De Choudhury et al., 2013a; Coppersmith et al., 2014; Gkotsis et al., 2016; Benton et al., 2017; Sekuli\u0107 et al., 2018; Zomick et al., 2019) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-14", "text": "In this work, we experiment with the Self-reported Mental Health Diagnoses (SMHD) dataset (Cohan et al., 2018) , consisting of thousands of Reddit users diagnosed with one or more mental illnesses." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-15", "text": "The contribution of our work is threefold." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-16", "text": "First, we adapt a deep neural model, proved to be successful in large-scale document classification, for user classification on social media, outperforming previously set benchmarks for four out of nine disorders." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-17", "text": "In contrast to the majority of preceding studies on mental health prediction in social media, which relied mostly on traditional classifiers, we employ Hierarchical Attention Network (HAN) (Yang et al., 2016) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-18", "text": "Second, we explore the limitations of the model in terms of data needed for successful classification, specifically, the number of users and number of posts per user." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-19", "text": "Third, through the attention mechanism of the model, we analyze the most relevant phrases for the classification and compare them to previous work in the field." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-20", "text": "We find similarities between lexical features and ngrams identified by the attention mechanism, supporting previous analyses." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-21", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-22", "text": "**DATASET AND THE MODEL**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-23", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-24", "text": "**SELF-REPORTED MENTAL HEALTH DIAGNOSES DATASET**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-25", "text": "The SMHD dataset (Cohan et al., 2018) is a largescale dataset of Reddit posts from users with one or multiple mental health conditions." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-26", "text": "The users were identified by constructing patterns for discovering self-reported diagnoses of nine different mental disorders." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-27", "text": "For example, if a user writes \"I was officially diagnosed with depression last year\", she/he/other would be considered to suffer from depression." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-28", "text": "Nine or more control users, which are meant to represent general population, are selected for each diagnosed user by their similarity, i.e., by their number of posts and the subreddits (sub-forums on Reddit) they post in." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-29", "text": "Diagnosed users' language is normalized by removing posts with specific mental health signals and discussions, in order to analyze the language of general discussions and to be more comparable to the control groups." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-30", "text": "The nine disorders and the number of users per disorder, as well as average number of posts per user, are shown in Table 1 ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-31", "text": "For each disorder, Cohan et al. (2018) analyze the differences in language use between diagnosed users and their respective control groups." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-32", "text": "They also provide benchmark results for the binary classification task of predicting whether the user belongs to the diagnosed or the control group." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-33", "text": "We reproduce their baseline models for each disorder and compare to our deep learning-based model, explained in Section 2.3." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-34", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-35", "text": "**SELECTING THE CONTROL GROUP**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-36", "text": "Cohan et al. (2018) select nine or more control users for each diagnosed user and run their experiments with these mappings." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-37", "text": "With this exact mapping not being available, for each of the nine conditions, we had to select the control group ourselves." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-38", "text": "For each diagnosed user, we draw exactly nine control users from the pool of 335,952 control users present in SMHD and proceed to train and test our binary classifiers on the newly created sub-datasets." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-39", "text": "In order to create a statistically-fair comparison, we run the selection process multiple times, as well as reimplement the benchmark models used in Cohan et al. (2018) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-40", "text": "Multiple sub-datasets with different control groups not only provide us with unbiased results, but also show how results of a binary classification can differ depending on the control group." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-41", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-42", "text": "**HIERARCHICAL ATTENTION NETWORK**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-43", "text": "We adapt a Hierarchical Attention Network (HAN) (Yang et al., 2016) , originally used for document classification, to user classification on social media." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-44", "text": "A HAN consists of a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-45", "text": "It employs GRU-based sequence encoders (Cho et al., 2014) on sentence and document level, yielding a document representation in the end." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-46", "text": "The word sequence encoder produces a representation of a given sentence, which then is forwarded to a sentence sequence encoder that, given a sequence of encoded sentences, returns a document representation." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-47", "text": "Both, word sequence and sentence sequence encoders, apply attention mechanisms on top to help the encoder more accurately aggregate the representation of given sequence." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-48", "text": "For details of the architecture we refer the interested readers to Yang et al. (2016) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-49", "text": "In this work, we model a user as a document, enabling an intuitive adaptation of the HAN." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-50", "text": "Just as a document is a sequence of sentences, we propose to model a social media user as a sequence of posts." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-51", "text": "Similarly, we identify posts as sentences, both being a sequence of tokens." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-52", "text": "This interpretation enables us to apply the HAN, which had great success in document classification, to user classification on social media." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-53", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-54", "text": "**RESULTS**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-55", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-56", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-57", "text": "The HAN uses two layers of bidirectional GRU units with hidden size of 150, each of them followed by a 100 dimensional attention mechanism." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-58", "text": "The first layer encodes posts, while the second one encodes a user as a sequence of encoded posts." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-59", "text": "The output layer is 50-dimensional fullyconnected network, with binary cross entropy as a loss function." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-60", "text": "We initialize the input layer with 300 dimensional GloVe word embeddings (Pennington et al., 2014) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-61", "text": "We train the model with Adam (Kingma and Ba, 2014), with an initial learning rate of 10 \u22124 and a batch size of 32 for 50 epochs." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-62", "text": "The model that proves best on the development set is selected." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-63", "text": "We implement the baselines as in Cohan et al. (2018) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-64", "text": "Logistic regression and the linear SVM were trained on tf-idf weighted bag-of-words features, where users' posts are all concatenated and all the tokens lower-cased." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-65", "text": "Optimal parameters were found on the development set, and models were evaluated on the test set." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-66", "text": "FastText (Joulin et al., 2016) was trained for 100 epochs, using character n-grams of size 3 to 6, with a 100 dimensional hidden layer." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-67", "text": "We take diagnosed users from predefined train-dev-test split, and select the control group as described in Subsection 2.2." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-68", "text": "To ensure unbiased results and fair comparison to the baselines, we repeat the process of selecting the control group five times for each disorder and report the average of the runs." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-69", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-70", "text": "**BINARY CLASSIFICATION PER DISORDER**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-71", "text": "We report the F 1 measures per disorder in Table 2, in the task of binary classification of users, with the diagnosed class as a positive one." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-72", "text": "Our model outperforms the baseline models for Depression, ADHD, Anxiety, and Bipolar disorder, while it proves insufficient for PTSD, Autism, OCD, Schizophrenia, and Eating disorder." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-73", "text": "We hypothesize that the reason for this are the sizes of particular sub-datasets, which can be seen in Table 1." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-74", "text": "We observe higher F 1 score for the HAN in disorders with sufficient data, suggesting once again that deep neural models are data-hungry (Sun et al., 2017) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-75", "text": "Logistic regression and linear SVM achieve higher scores where there is a smaller number of diagnosed users." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-76", "text": "In contrast to Cohan et al. (2018) , supervised FastText yields worse results than tuned linear models." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-77", "text": "We further investigate the impact of the size of the dataset on the final results of classification." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-78", "text": "We limit the number of posts per user available to the model to examine the amount needed for reasonable performance." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-79", "text": "The results for 50, 100, 150, 200, and 250 posts per user available are presented in Figure 1 ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-80", "text": "Experiments were run three times for each disorder and each number of available posts, every time with a different control group selected." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-81", "text": "We observe a positive correlation between the data provided to the model and the performance, although we find an upper bound to this tendency." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-82", "text": "As the average number of posts per user is roughly 160 (Table 1) , it is reasonable to expect of a model to perform well with similar amounts of data available." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-83", "text": "However, further analysis is required to see if the model reaches the plateau because a large amount of data is not needed for the task, or due to it not being expressive enough." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-84", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-85", "text": "**ATTENTION WEIGHTS ANALYSIS**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-86", "text": "The HAN, through attention mechanism, provides a clear way to identify posts, and words or phrases in those posts, relevant for classification." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-87", "text": "We examine attention weights on a word level and compare the most attended words to prior research on depression." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-88", "text": "Depression is selected as the most prevalent disorder in the SMHD dataset with a number of studies in the field (Rude et al., 2004; Chung and Pennebaker, 2007; De Choudhury et al., 2013b; Park et al., 2012) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-89", "text": "For each post, we extracted two words with the highest attention weight as being the most relevant for the classification." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-90", "text": "If the two words are appearing next to each other in a post we consider them as bigram." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-91", "text": "Some of the top 100 most common unigrams and bigrams are presented in Table 3 , aggregated under the most common LIWC categories." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-92", "text": "We observe similar patterns in features shown Table 3 : Unigrams and bigrams most often given the highest weight by attention mechanism in depression classification." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-93", "text": "relevant by the HAN and previous research on signals of depression in language." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-94", "text": "The importance of personal pronouns in distinguishing depressed authors from the control group is supported by multiple studies (Rude et al., 2004; Chung and Pennebaker, 2007; De Choudhury et al., 2013b; Cohan et al., 2018) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-95", "text": "In the categories Affective processes, Social processes, and Biological processes, Cohan et al. (2018) report significant differences between depressed and control group, similar to some other disorders." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-96", "text": "Except the above mentioned words and their abbreviations, among most commonly attended are swear words, as well as other forms of informal language." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-97", "text": "The attention mechanism's weighting suggests that words and phrases proved important in previous studies, using lexical features and linear models, are relevant for the HAN as well." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-98", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-99", "text": "**RELATED WORK**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-100", "text": "In recent years, social media has been a valuable source for psychological research." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-101", "text": "While most studies use Twitter data (Coppersmith et al., 2015a (Coppersmith et al., , 2014 Benton et al., 2017; Coppersmith et al., 2015b) , a recent stream turns to Reddit as a richer source of high-volume data (De Choudhury and De, 2014; Shen and Rudzicz, 2017; Gjurkovi\u0107 and\u0160najder, 2018; Cohan et al., 2018; Sekuli\u0107 et al., 2018; Zirikly et al., 2019) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-102", "text": "Previous approaches to author's mental health prediction usually relied on linguistic and stylistic features, e.g., Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2001 ) -a widely used feature extractor for various studies regarding mental health (Rude et al., 2004; Coppersmith et al., 2014; Sekuli\u0107 et al., 2018; Zomick et al., 2019) ." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-103", "text": "Recently, Song et al. (2018) built a feature attention network for depression detection on Reddit, showing high interpretability, but low improvement in accuracy." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-104", "text": "Orabi et al. (2018) concatenate all the tweets of a Twitter user in a single document and experiment with various deep neu-ral models for depression detection." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-105", "text": "Some of the previous studies use deep learning methods on a post level to infer general information about a user (Kshirsagar et al., 2017; Ive et al., 2018; Ruder et al., 2016) , or detect different mental health concepts in posts themselves (Rojas-Barahona et al., 2018) , while we focus on utilizing all of the users' text." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-106", "text": "Yates et al. (2017) use a CNN on a postlevel to extract features, which are then concatenated to get a user representation used for selfharm and depression assessment." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-107", "text": "A CNN requires a fixed length of posts, putting constraints on the data available to the model, while a HAN utilizes all of the data from posts of arbitrary lengths." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-108", "text": "A social media user can be modeled as collection of their posts, so we look at neural models for large-scale text classification." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-109", "text": "Liu et al. (2018) split a document into chunks and use a combination of CNNs and RNNs for document classification." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-110", "text": "While this approach proves to be successful for scientific paper categorization, it is unintuitive to use in social media text due to an unclear way of splitting user's data into equally sized chunks of text." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-111", "text": "Yang et al. (2016) use a hierarchical attention network for document classification, an approach that we adapt for Reddit." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-112", "text": "A step further would be adding another hierarchy, similar to Jiang et al. (2019) , who use a multi-depth attention-based hierarchical RNN to tackle the problem of longlength document semantic matching." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-113", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-114", "text": "**ETHICAL CONSIDERATIONS**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-115", "text": "tion weights on word level suggested similarities to previous studies of depressed authors." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-116", "text": "Embedding mental health-specific insights from previous work could benefit the model in general." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-117", "text": "Future work includes analysis of post-level attention weights, with a goal of finding patterns in the relevance of particular posts, and, through them, time periods when a user is in distress." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-118", "text": "As some of the disorders share similar symptoms, e.g., depressive episodes in bipolar disorder, exploiting correlations between labels through multi-task or transfer learning techniques might prove useful." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-119", "text": "In order to improve the classification accuracy, a transformerbased model for encoding users' posts should be tested." }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-120", "text": "----------------------------------" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-121", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "10f17930192132077f0d4526e7d755-C001-122", "text": "This work has been funded by the Erasmus+ programme of the European Union and the Klaus Tschira Foundation." } ], "y": { "@USE@": { "gold_contexts": [ [ "10f17930192132077f0d4526e7d755-C001-14" ], [ "10f17930192132077f0d4526e7d755-C001-31", "10f17930192132077f0d4526e7d755-C001-32", "10f17930192132077f0d4526e7d755-C001-33" ], [ "10f17930192132077f0d4526e7d755-C001-38" ], [ "10f17930192132077f0d4526e7d755-C001-39" ], [ "10f17930192132077f0d4526e7d755-C001-63" ], [ "10f17930192132077f0d4526e7d755-C001-87", "10f17930192132077f0d4526e7d755-C001-88" ] ], "cite_sentences": [ "10f17930192132077f0d4526e7d755-C001-14", "10f17930192132077f0d4526e7d755-C001-31", "10f17930192132077f0d4526e7d755-C001-32", "10f17930192132077f0d4526e7d755-C001-33", "10f17930192132077f0d4526e7d755-C001-38", "10f17930192132077f0d4526e7d755-C001-39", "10f17930192132077f0d4526e7d755-C001-63", "10f17930192132077f0d4526e7d755-C001-88" ] }, "@BACK@": { "gold_contexts": [ [ "10f17930192132077f0d4526e7d755-C001-25" ], [ "10f17930192132077f0d4526e7d755-C001-31" ], [ "10f17930192132077f0d4526e7d755-C001-36" ], [ "10f17930192132077f0d4526e7d755-C001-94" ], [ "10f17930192132077f0d4526e7d755-C001-95" ], [ "10f17930192132077f0d4526e7d755-C001-101" ] ], "cite_sentences": [ "10f17930192132077f0d4526e7d755-C001-25", "10f17930192132077f0d4526e7d755-C001-31", "10f17930192132077f0d4526e7d755-C001-36", "10f17930192132077f0d4526e7d755-C001-94", "10f17930192132077f0d4526e7d755-C001-95", "10f17930192132077f0d4526e7d755-C001-101" ] }, "@EXT@": { "gold_contexts": [ [ "10f17930192132077f0d4526e7d755-C001-36", "10f17930192132077f0d4526e7d755-C001-37" ], [ "10f17930192132077f0d4526e7d755-C001-38" ] ], "cite_sentences": [ "10f17930192132077f0d4526e7d755-C001-36", "10f17930192132077f0d4526e7d755-C001-38" ] }, "@DIF@": { "gold_contexts": [ [ "10f17930192132077f0d4526e7d755-C001-36", "10f17930192132077f0d4526e7d755-C001-37" ], [ "10f17930192132077f0d4526e7d755-C001-76" ] ], "cite_sentences": [ "10f17930192132077f0d4526e7d755-C001-36", "10f17930192132077f0d4526e7d755-C001-76" ] } } }, "ABC_2f3e2c81bed66fd020731b2475bb98_8": { "x": [ { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-2", "text": "In this paper, we study the problem of extracting technical paraphrases from a parallel software corpus, namely, a collection of duplicate bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-3", "text": "Paraphrase acquisition is a fundamental task in the emerging area of text mining for software engineering." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-4", "text": "Existing paraphrase extraction methods are not entirely suitable here due to the noisy nature of bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-5", "text": "We propose a number of techniques to address the noisy data problem." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-6", "text": "The empirical evaluation shows that our method significantly improves an existing method by up to 58%." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-7", "text": "----------------------------------" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-9", "text": "Using natural language processing (NLP) techniques to mine software corpora such as code comments and bug reports to assist software engineering (SE) is an emerging and promising research direction Tan et al., 2007) ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-10", "text": "Paraphrase extraction is one of the fundamental problems that have not been addressed in this area." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-11", "text": "It has many applications including software ontology construction and query expansion for retrieving relevant technical documents." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-12", "text": "In this paper, we study automatic paraphrase extraction from a large collection of software bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-13", "text": "Most large software projects have bug tracking systems, e.g., Bugzilla 1 , to help global users to describe and report the bugs they encounter when using the software." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-14", "text": "However, since the same bug may be seen by many users, many duplicate bug reports are sent to bug tracking systems." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-15", "text": "The duplicate bug reports are manually tagged and associated to the original bug report by either the system manager or software developers." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-16", "text": "These families of duplicate bug reports form a semi-parallel 1 http://www.bugzilla.org/ Parallel bug reports with a pair of true paraphrases 1: connector extend with a straight line in full screen mode 2: connector show straight line in presentation mode Non-parallel bug reports referring to the same bug 1: Settle language for part of text and spellchecking part of text 2: Feature requested to improve the management of a multi-language document Context-peculiar paraphrases (shown in italics) 1: status bar appear in the middle of the screen 2: maximizing window create phantom status bar in middle of document However, bug reports have characteristics that raise many new challenges." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-17", "text": "Different from many other parallel corpora, bug reports are noisy." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-18", "text": "We observe at least three types of noise common in bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-19", "text": "First, many bug reports have many spelling, grammatical and sentence structure errors." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-20", "text": "To address this we extend a suitable stateof-the-art technique that is robust to such corpora, i.e. (Barzilay and McKeown, 2001) ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-21", "text": "Second, many duplicate bug report families contain sentences that are not truly parallel." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-22", "text": "An example is shown in Table 1 (middle)." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-23", "text": "We handle this by considering lexical similarity between duplicate bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-24", "text": "Third, even if the bug reports are parallel, we find many cases of context-peculiar paraphrases, i.e., a pair of phrases that have the same meaning in a very narrow context." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-25", "text": "An example is shown in Table 1 (bottom)." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-26", "text": "To address this, we introduce two notions of global context-based score and co-occurrence based score which take into account all good and bad occurrences of the phrases in a candidate paraphrase in the corpus." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-27", "text": "These scores are then used to identify and remove context-peculiar paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-28", "text": "The contributions of our work are twofold." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-29", "text": "First, we studied the important problem of paraphrase extraction from a noisy semi-parallel software corpus, which has not been studied either in the NLP or the SE community." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-30", "text": "Second, taking into consideration the special characteristics of our noisy data, we proposed several improvements to an existing general paraphrase extraction method, resulting in a significant performance gain -up to 58% relative improvement in precision." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-31", "text": "----------------------------------" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-32", "text": "**RELATED WORK**" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-33", "text": "In the area of text mining for software engineering, paraphrases have been used in many tasks, e.g., Tan et al., 2007) ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-34", "text": "However, most paraphrases used are obtained manually." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-35", "text": "A recent study using synonyms from WordNet highlights the fact that these are not effective in software engineering tasks due to domain specificity (Sridhara et al., 2008) ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-36", "text": "Therefore, an automatic way to derive technical paraphrases specific to software engineering is desired." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-37", "text": "Paraphrases can be extracted from non-parallel corpora using contextual similarity (Lin, 1998) ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-38", "text": "They can also be obtained from parallel corpora if such data is available (Barzilay and McKeown, 2001; Ibrahim et al., 2003) ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-39", "text": "Recently, there are also a number of studies that extract paraphrases from multilingual corpora (Bannard and CallisonBurch, 2005; Zhao et al., 2008) ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-40", "text": "The approach in (Barzilay and McKeown, 2001) does not use deep linguistic analysis and therefore is suitable to noisy corpora like ours." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-41", "text": "Due to this reason, we build our technique on top of theirs." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-42", "text": "The following provides a summary of their technique." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-43", "text": "Two types of paraphrase patterns are defined: (1) Syntactic patterns which consist of the POS tags of the phrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-44", "text": "For example, the paraphrases \"a VGA monitor\" and \"a monitor\" are represented as \"DT 1 JJ NN 2 \" \u2194 \"DT 1 NN 2 \", where the subscripts denote common words." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-45", "text": "(2) Contextual patterns which consist of the POS tags before and after the phrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-46", "text": "For example, the contexts \"in the middle of\" and \"in middle of\" in Table 1 (bottom) are represented as \"" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-47", "text": "During pre-processing, the parallel corpus is aligned to give a list of parallel sentence pairs." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-48", "text": "The sentences are then processed by a POS tagger and a chunker." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-49", "text": "The authors first used identical words and phrases as seeds to find and score contextual patterns." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-50", "text": "The patterns are scored based on the following formula: (n+)/n, in which, n+ refers to the number of positively labeled paraphrases satisfying the patterns and n refers to the number of all paraphrases satisfying the patterns." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-51", "text": "Only patterns with scores above a threshold are considered." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-52", "text": "More paraphrases are identified using these contextual patterns, and more patterns are then found and scored using the newly-discovered paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-53", "text": "This co-training algorithm is employed in an iterative fashion to find more patterns and positively labeled paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-54", "text": "----------------------------------" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-55", "text": "**METHODOLOGY**" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-56", "text": "Our paraphrase extraction method consists of three components: sentence selection, global context-based scoring and co-occurrence-based scoring." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-57", "text": "We marry the three components together into a holistic solution." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-58", "text": "Selection of Parallel Sentences Our corpus consists of short bug report summaries, each containing one or two sentences only, grouped by the bugs they report." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-59", "text": "Each group corresponds to reports pertaining to a single bug and are duplicate of one another." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-60", "text": "Therefore, reports belonging to the same group can be naturally regarded as parallel sentences." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-61", "text": "However, these sentences are only partially parallel because two users may describe the same bug in very different ways." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-62", "text": "An example is shown in Table 1 (middle)." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-63", "text": "This kind of sentence pairs should not be regarded as parallel." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-64", "text": "To address this problem, we take a heuristic approach and only select sentence pairs that have strong similarities." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-65", "text": "Our similarity score is based on the number of common words, bigrams and trigrams shared between two parallel sentences." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-66", "text": "We use a threshold of 5 to filter out non-parallel sentences." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-67", "text": "Global Context-Based Scoring Our contextbased paraphrase scoring method is an extension of (Barzilay and McKeown, 2001 ) described in Sec. 2." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-68", "text": "Parallel bug reports are usually noisy." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-69", "text": "At times, some words might be detected as paraphrases incidentally due to the noise." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-70", "text": "In (Barzilay and McKeown, 2001 ), a paraphrase is reported as long as there is a single good supporting pair of sentences." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-71", "text": "Although this works well for a relatively clean parallel corpus considered in their work, i.e., novels, this does not work well for bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-72", "text": "Consider the context-peculiar example in Table 1 (bottom)." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-73", "text": "For a context-peculiar para-phrase, there can be many sentences containing the pair of phrases but very few support them to be a paraphrase." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-74", "text": "We develop a technique to offset this noise by computing a global context-based score for two phrases being a paraphrase over all their parallel occurrences." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-75", "text": "This is defined by the following formula:" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-76", "text": "where n is the number of parallel bug reports with the two phrases occurring in parallel, and s i is the score for the i'th occurrence." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-77", "text": "s i is computed as follows: 1." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-78", "text": "We compute the set of patterns with affixed pattern scores based on (Barzilay and McKeown, 2001 )." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-79", "text": "2. For the i'th parallel occurrence of the pair of phrases we want to score, we try to find a pattern that matches the occurrence and assign the pattern score to the pair of phrases as s i ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-80", "text": "If no such pattern exists, we set s i to 0." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-81", "text": "By taking the average of s i as the global score for a pair of phrases, we do not rely much on a single s i and can therefore prevent context-peculiar paraphrases to some degree." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-82", "text": "Co-occurrence-Based Scoring We also consider another global co-occurrence-based score that is commonly used for finding collocations." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-83", "text": "A general observation is that noise tends to appear in random but random things do not occur in the same way often." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-84", "text": "It is less likely for randomly paired words or paraphrases to co-occur together many times." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-85", "text": "To compute the likelihood of two phrases occurring together, we use the following commonly used co-occurrence-based score:" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-86", "text": "The expression P (w 1 , w 2 ) refers to the probability of a pair of phrases w 1 and w 2 appearing together." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-87", "text": "It is estimated based on the proportion of the corpus containing both w 1 and w 2 in parallel." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-88", "text": "Similarly, P (w 1 ) and P (w 2 ) each corresponds to the probability of w 1 and w 2 appearing respectively." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-89", "text": "We normalize the S c score to the range of 0 to 1 by dividing it with the size of the corpus." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-90", "text": "Holistic Solution We employ the parallel sentence selection as a pre-processing step, and merge co-occurrence-based scoring with global contextbased scoring." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-91", "text": "For each parallel sentence pairs, a chunker is used to get chunks from each sentence." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-92", "text": "All possible pairings of chunks are then formed." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-93", "text": "This set of chunk pairs are later fed to the method in (Barzilay and McKeown, 2001 ) to produce a set of patterns with affixed scores." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-94", "text": "With this we compute our global-context based scores." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-95", "text": "The cooccurrence based scores are computed following the approach described above." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-96", "text": "Two thresholds are used and candidate paraphrases whose scores are below the respective thresholds are removed." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-97", "text": "Alternatively, one of the score is used as a filter, while the other is used to rank the candidates." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-98", "text": "The next section describes our experimental results." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-99", "text": "----------------------------------" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-100", "text": "**EVALUATION**" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-101", "text": "Data Set Our bug report corpus is built from OpenOffice 2 ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-102", "text": "OpenOffice is a well-known open source software which has similar functionalities as Microsoft Office." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-103", "text": "We use the bug reports that are submitted before Jan 1, 2008." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-104", "text": "Also, we only use the summary part of the bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-105", "text": "We build our corpus in the following steps." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-106", "text": "We collect a total of 13,898 duplicate bug reports from OpenOffice." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-107", "text": "Each duplicate bug report is associated to a master report-there is one master report for each unique bug." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-108", "text": "From this information, we create duplicate bug report groups where each member of a group is a duplicate of all other members in the same group." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-109", "text": "Finally, we extract duplicate bug report pairs by pairing each two members of each group." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-110", "text": "We get in total 53,363 duplicate bug report pairs." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-111", "text": "As the first step, we employ parallel sentence selection, described in Sec. 3, to remove nonparallel duplicate bug report pairs." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-112", "text": "After this step, we find 5,935 parallel duplicate bug report pairs." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-113", "text": "Experimental Setup The baseline method we consider is the one in (Barzilay and McKeown, 2001 ) without sentence alignment -as the bug reports are usually of one sentence long." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-114", "text": "We call it BL." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-115", "text": "As described in Sec. 2, BL utilizes a threshold to control the number of patterns mined." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-116", "text": "These patterns are later used to select paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-117", "text": "In the experiment, we find that running BL using their default threshold of 0.95 on the 5,935 parallel bug reports only gives us 18 paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-118", "text": "This number is too small for practical purposes." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-119", "text": "Therefore, we reduce the threshold to get more paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-120", "text": "For each threshold in the range of 0.45-0.95 (step size: 0.05), we extract paraphrases and compute the corresponding precision." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-121", "text": "In our approach, we first form chunk pairs from the 5,935 pairs of parallel sentences and then use the baseline approach at a low threshold to ob-tain patterns." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-122", "text": "Using these patterns we compute the global context-based scores S g ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-123", "text": "We also compute the co-occurrence scores S c ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-124", "text": "We rank and extract top-k paraphrases based on these scores." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-125", "text": "We consider 4 different methods: We can use either S g or S c to rank the discovered paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-126", "text": "We call them Rk-S g and Rk-S c ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-127", "text": "We also consider using one of the scores for ranking and the other for filtering bad candidate paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-128", "text": "A threshold of 0.05 is used for filtering." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-129", "text": "We call these two methods Rk-S c +Ft-S g and Rk-S g +Ft-S c ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-130", "text": "With ranked lists from these 4 methods, we can compute precision@k for the top-k paraphrases." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-131", "text": "Results The comparison among these methods is plotted in Figure 1 ." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-132", "text": "From the figure we can see that our holistic approach using global-context score to rank and co-occurrence score to filter (i.e., Rk-S g +Ft-S c ) has higher precision than the baseline approach (i.e., BL) in all ks." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-133", "text": "In general, the other holistic configuration (i.e., Rk-S c +Ft-S g ) also works well for most of the ks considered." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-134", "text": "Interestingly, the graph shows that using only one of the scores alone (i.e., Rk-S g and Rk-S c ) does not result in a significantly higher precision than the baseline approach." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-135", "text": "A holistic approach by merging global-context score and co-occurrence score is needed to yield higher precision." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-136", "text": "In Table 2 , we show some examples of the paraphrases our algorithm extracted from the bug report corpus." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-137", "text": "As we can see, most of the paraphrases are very technical and only make sense in the software domain." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-138", "text": "It demonstrates the effectiveness of our method." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-139", "text": "----------------------------------" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-140", "text": "**CONCLUSION**" }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-141", "text": "In this paper, we develop a new technique to extract paraphrases of technical terms from software bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-142", "text": "Paraphrases of technical terms have been shown to be useful for various software enthe edit-field \u2194 input line field presentation mode \u2194 full screen mode word separator \u2194 a word delimiter application \u2194 app freeze \u2194 crash mru file list \u2194 recent file list multiple monitor \u2194 extended desktop xl file \u2194 excel file Table 2 : Examples of paraphrases of technical terms mined from bug reports." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-143", "text": "gineering tasks." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-144", "text": "These paraphrases could not be obtained via general purpose thesaurus e.g., WordNet." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-145", "text": "Interestingly, there is a wealth of text data, in particular bug reports, available for analysis in open-source software repositories." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-146", "text": "Despite their availability, a good technique is needed to extract paraphrases from these corpora as they are often noisy." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-147", "text": "We develop several approaches to address noisy data via parallel sentence selection, globalcontext based scoring and co-occurrence based scoring." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-148", "text": "To show the utility of our approach, we experimented with many parallel bug reports from a large software project." }, { "sent_id": "2f3e2c81bed66fd020731b2475bb98-C001-149", "text": "The preliminary experiment result is promising as it could significantly improves an existing method by up to 58%." } ], "y": { "@MOT@": { "gold_contexts": [ [ "2f3e2c81bed66fd020731b2475bb98-C001-20" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-40", "2f3e2c81bed66fd020731b2475bb98-C001-41" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-70", "2f3e2c81bed66fd020731b2475bb98-C001-71" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-114", "2f3e2c81bed66fd020731b2475bb98-C001-115", "2f3e2c81bed66fd020731b2475bb98-C001-117", "2f3e2c81bed66fd020731b2475bb98-C001-118" ] ], "cite_sentences": [ "2f3e2c81bed66fd020731b2475bb98-C001-20", "2f3e2c81bed66fd020731b2475bb98-C001-40", "2f3e2c81bed66fd020731b2475bb98-C001-41", "2f3e2c81bed66fd020731b2475bb98-C001-70", "2f3e2c81bed66fd020731b2475bb98-C001-71", "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-114", "2f3e2c81bed66fd020731b2475bb98-C001-115", "2f3e2c81bed66fd020731b2475bb98-C001-117" ] }, "@BACK@": { "gold_contexts": [ [ "2f3e2c81bed66fd020731b2475bb98-C001-37", "2f3e2c81bed66fd020731b2475bb98-C001-38" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-40" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-70" ] ], "cite_sentences": [ "2f3e2c81bed66fd020731b2475bb98-C001-38", "2f3e2c81bed66fd020731b2475bb98-C001-40", "2f3e2c81bed66fd020731b2475bb98-C001-70" ] }, "@USE@": { "gold_contexts": [ [ "2f3e2c81bed66fd020731b2475bb98-C001-40", "2f3e2c81bed66fd020731b2475bb98-C001-42", "2f3e2c81bed66fd020731b2475bb98-C001-49" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-78" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-93" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-114", "2f3e2c81bed66fd020731b2475bb98-C001-115", "2f3e2c81bed66fd020731b2475bb98-C001-117", "2f3e2c81bed66fd020731b2475bb98-C001-121" ] ], "cite_sentences": [ "2f3e2c81bed66fd020731b2475bb98-C001-40", "2f3e2c81bed66fd020731b2475bb98-C001-42", "2f3e2c81bed66fd020731b2475bb98-C001-49", "2f3e2c81bed66fd020731b2475bb98-C001-78", "2f3e2c81bed66fd020731b2475bb98-C001-93", "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-114", "2f3e2c81bed66fd020731b2475bb98-C001-115", "2f3e2c81bed66fd020731b2475bb98-C001-117", "2f3e2c81bed66fd020731b2475bb98-C001-121" ] }, "@EXT@": { "gold_contexts": [ [ "2f3e2c81bed66fd020731b2475bb98-C001-67" ], [ "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-114", "2f3e2c81bed66fd020731b2475bb98-C001-115", "2f3e2c81bed66fd020731b2475bb98-C001-117", "2f3e2c81bed66fd020731b2475bb98-C001-118", "2f3e2c81bed66fd020731b2475bb98-C001-119", "2f3e2c81bed66fd020731b2475bb98-C001-121" ] ], "cite_sentences": [ "2f3e2c81bed66fd020731b2475bb98-C001-67", "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-114", "2f3e2c81bed66fd020731b2475bb98-C001-115", "2f3e2c81bed66fd020731b2475bb98-C001-117", "2f3e2c81bed66fd020731b2475bb98-C001-121" ] }, "@DIF@": { "gold_contexts": [ [ "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-132", "2f3e2c81bed66fd020731b2475bb98-C001-134" ] ], "cite_sentences": [ "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-132", "2f3e2c81bed66fd020731b2475bb98-C001-134" ] }, "@SIM@": { "gold_contexts": [ [ "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-134" ] ], "cite_sentences": [ "2f3e2c81bed66fd020731b2475bb98-C001-113", "2f3e2c81bed66fd020731b2475bb98-C001-134" ] } } }, "ABC_45ba2841e91a2fd62f0534aeaf7491_8": { "x": [ { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-17", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-18", "text": "**THE ORIGINAL HYTER METRIC**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-2", "text": "We propose a variant of a well-known machine translation (MT) evaluation metric, HyTER (Dreyer and Marcu, 2012) , which exploits reference translations enriched with meaning equivalent expressions." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-3", "text": "The original HyTER metric relied on hand-crafted paraphrase networks which restricted its applicability to new data." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-4", "text": "We test, for the first time, HyTER with automatically built paraphrase lattices." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-5", "text": "We show that although the metric obtains good results on small and carefully curated data with both manually and automatically selected substitutes, it achieves medium performance on much larger and noisier datasets, demonstrating the limits of the metric for tuning and evaluation of current MT systems." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-6", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-8", "text": "Human translators and MT systems can produce multiple plausible translations for input texts." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-9", "text": "To reward meaning-equivalent but lexically divergent translations, MT evaluation metrics exploit synonyms and paraphrases, or multiple references (Papineni et al., 2002; Doddington, 2002; Denkowski and Lavie, 2010; Lo et al., 2012) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-10", "text": "The HyTER metric (Dreyer and Marcu, 2012 ) relies on massive reference networks encoding an exponential number of correct translations for parts of a given sentence, proposed by human annotators." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-11", "text": "The manually built networks attempt to encode the set of all correct translations for a sentence, and HyTER rewards high quality hypotheses by measuring their minimum edit distance to the set of possible translations." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-12", "text": "HyTER spurred a lot of enthusiasm but the need for human annotations heavily reduced its applicability to new data." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-13", "text": "We propose to use an embedding-based lexical substitution model (Melamud et al., 2015) for building this type of reference networks and test, for the first time, the metric with automatically generated lattices (hereafter HyTERA)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-14", "text": "We show that HyTERA strongly correlates with HyTER with hand-crafted lattices, and approximates the hTER score (Snover et al., 2006) as measured using post-edits made by human annotators." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-15", "text": "Furthermore, we generate lattices for standard datasets from a recent WMT Metrics Shared Task and perform the first evaluation of HyTER on large and noisier datasets." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-16", "text": "The results show that it still remains an interesting solution for MT evaluation, but highlight its limits when used to evaluate recent MT systems that make far less errors of lexical choice than older systems." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-19", "text": "The HyTER metric (Dreyer and Marcu, 2012) computes the similarity between a translation hypothesis and a reference lattice that compactly encodes millions of meaning-equivalent translations." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-20", "text": "Formally HyTER is defined as:" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-21", "text": "where Y is a set of references that can be encoded as a finite state automaton such as the one represented in Figure 1 , x is a translation hypothesis and LS is the standard Levenshtein distance, defined as the minimum number of substitutions, deletions and insertions required to transform x into y. We use, in all our experiments, our own im-plementation of HyTER 1 that relies on the Open-FST framework (Allauzen et al., 2007) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-22", "text": "Contrary to the original HyTER implementation, we do not consider permutations when transforming x into y as previous results (cf." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-23", "text": "Table 3 in (Dreyer and Marcu, 2012) ) have shown that permutations have only very little impact while significantly increasing the computational complexity of HyTER computation." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-24", "text": "2 We also use an exact search rather than a A * search to minimize Equation (1)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-25", "text": "The HyTER metric has already been successfully used in MT evaluation but only with handcrafted lattices." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-26", "text": "To the best of our knowledge, this is the first time it is tested with lattices built automatically." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-27", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-28", "text": "**AUTOMATIC LATTICE CREATION**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-29", "text": "We propose an alternative to the costly manual annotation of reference translations which exploits an embedding-based model of lexical substitution proposed by Melamud et al. (2015) (called AddCos)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-30", "text": "The original AddCos implementation selects substitutes for words in context from the whole vocabulary." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-31", "text": "Here, we restrict candidate substitutes to paraphrases of words in the Paraphrase Database (PPDB) XXL package (Ganitkevitch et al., 2013) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-32", "text": "3 AddCos quantifies the fit of substitute word s for target word t in context C by measuring the semantic similarity of the substitute to the target, and the similarity of the substitute to the context: (2) AddCos(s,t,C) = cos(s,t) + \u2211 c\u2208C cos(s, c)" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-33", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-34", "text": "**|C|+1**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-35", "text": "The vectors s and t are word embeddings of the substitute and target generated by the skip-gram with negative sampling model (Mikolov et al., 2013b,a) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-36", "text": "4 The context C is the set of context embeddings for words appearing within a fixedwidth window of the target t in a sentence (we use 1 The code is available at https://bitbucket." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-37", "text": "org/gwisniewski/hytera/ 2 Note that as permutations of interest can be compactly encoded in a fine-state graph (Kumar and Byrne, 2005) , the MOVE operation can be easily considered in our code by applying the substitutions to the permutation lattice rather than to the sentence." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-38", "text": "3 PPDB paraphrases come into packages of different sizes (going from S to XXXL): small packages contain highprecision paraphrases while larger ones have high coverage." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-39", "text": "All are available from paraphrase.org 4 For the moment, we focus on individual content words." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-40", "text": "In future work, we plan to also annotate longer text segments in the references with multi-word PPDB paraphrases." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-41", "text": "a window width of 1)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-42", "text": "The embeddings c are context embeddings generated by skip-gram." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-43", "text": "5 In our implementation, we train 300-dimensional word and context embeddings over the 4B words in the Annotated Gigaword (AGiga) corpus (Napoles et al., 2012) using the gensim word2vec package (Mikolov et al., 2013b,a; \u0158eh\u016f\u0159ek and Sojka, 2010) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-44", "text": "6 Each content word token in a sentence is expanded to include all its possible substitutes selected by AddCos in this specific context, and the lattice can take any path from the expanded start token to the expanded end token." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-45", "text": "We filter the paraphrase candidates according to: a) their PPDB2.0 score, an out-of-context measure of paraphrase confidence which denotes the strength of the relation between the paraphrase and the target word (hereafter, PPDBSc) (Pavlick et al., 2015) ; b) the substitution score assigned to paraphrases in context by the AddCos model (hereafter, AddCosSc), which shows whether the paraphrase is a good fit for the target word in a specific context." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-46", "text": "7 Figure 1 shows the four highest ranked paraphrases proposed by AddCos for words in the English reference sentence: Matt Damon downplays diversity in filmmaking." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-47", "text": "The sentences Matt Damon underestimates richness in cinematography and Matt Damon belittles pluralism in cinema. are included among the 48 references encoded in this lattice." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-48", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-49", "text": "**EVALUATING HYTER WITH AUTOMATIC SUBSTITUTIONS**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-50", "text": "We assess the quality of HyTERA to evaluate the quality of MT output both at the sentence and the system level." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-51", "text": "We first use the setting of Dreyer and Marcu (2012) , in Section \u00a7 5.1, to compare the score estimated by HyTER and HyTERA to hTER scores." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-52", "text": "In Section \u00a7 5.2, we explore whether HyTERA can reliably predict human translation quality scores from the WMT16 Metrics Shared Task." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-53", "text": "5 Comparing HyTER and HyTERA" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-54", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-55", "text": "**OPEN MT NIST EVALUATION**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-56", "text": "To evaluate the performance of HyTER, Dreyer and Marcu (2012) examine whether it can approximate the hTER score (Snover et al., 2006) that measures the number of edits required to change a system output into its post-edition." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-57", "text": "hTER scores are a good estimate of translation quality and usefulness, but require each translation hypothesis to be corrected by a human annotator." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-58", "text": "Dreyer and Marcu (2012) show that it can be closely approximated by HyTER scores." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-59", "text": "In this section, we reproduce their experiments with HyTERA to see whether it is possible to use automatically-built rather than hand-crafted references to approximate hTER scores." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-60", "text": "Experimental Setting We build meaningequivalent lattices by applying the lexical substitution method described in Section 3 to each of the four references associated with a sentence, and considering the union of the resulting lattices." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-61", "text": "We report results for two kinds of lattices: lattices encoding all lexical substitutes available for a word in PPDB (allPars) and lattices of substitutes with PPDBSc>2.3 (allParsFiltered) and AddCosSc\u22650." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-62", "text": "As expected, the allPars lattices are much larger than the manual and the filtered lattices (cf. Table 1 )." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-63", "text": "In all our experiments, all corpora are down-cased and tokenized using standard Moses scripts." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-64", "text": "hTER scores are computed using TERp (Snover et al., 2009 )." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-65", "text": "score, estimated by the arithmetic mean of 1 to 4-gram precisions." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-66", "text": "9 In all cases, there is a high correlation between HyTER, HyTERA and hTER, significantly higher than the correlation between BLEU and hTER." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-67", "text": "This observation shows that replacing the handcrafted lattices with automatically built ones has only a moderate impact on the HyTER metric quality: automatic lattices result in a small drop of the correlation when evaluating hypotheses translated from Chinese, and slightly improve it for the Arabic to English condition." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-68", "text": "Overall HyTERA scores are highly correlated with HyTER scores (\u03c1 = 0.766 for Arabic and \u03c1 = 0.756 for Chinese)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-69", "text": "More importantly, considering the filtered lattices allows to significantly reduce computation time compared to the allPars ones without hurting the quality estimation capacity of the metric." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-70", "text": "Figure 2 shows how the five MT systems are ranked by the different metrics we consider, when translating from Arabic to English." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-71", "text": "All metrics rank the systems in the same order, except from HyTER with allParsFiltered that only inverts two systems." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-72", "text": "Note that the tested systems were selected by NIST to cover a variety of system architectures (statistical, rule-based, hybrid) and performances (Dreyer and Marcu, 2012) , which makes distinction between them an easy task for all metrics." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-73", "text": "The benefits of using a metric like HyTER, which focuses on the word level, are much clearer in the sentence-based evaluation (Table 2) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-74", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-75", "text": "**SENTENCE LEVEL EVALUATION**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-76", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-77", "text": "**SYSTEM LEVEL EVALUATION**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-78", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-79", "text": "**WMT METRICS EVALUATION**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-80", "text": "In our second set of experiments, we explore the ability of HyTERA to predict direct human judgments at the sentence level using the setting of the WMT16 Metrics Shared Task (Bojar et al., 2016) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-81", "text": "We measure the correlation between ad- equacy scores collected on Amazon Mechanical Turk following the method advocated by and the translation quality estimated by applying HyTERA to the official WMT reference." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-82", "text": "Table 3 reports the results achieved by HyTERA on the six language pairs of the WMT16 Shared Task and its rank among the other metrics tested in the competition." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-83", "text": "HyTERA obtains medium performance on the WMT16 dataset, which is much larger and noisier than the dataset used for evaluation in (Dreyer and Marcu, 2012) : it is made, for each language, of 560 translations sampled from outputs of all systems taking part in the WMT15 campaign." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-84", "text": "It is important to note that the hTER scores used in the initial HyTER evaluation were produced by experienced LDC annotators, while the WMT16 Direct Assessment (DA) adequacy judgments were collected from non-experts through crowd-sourcing (Bojar et al., 2016) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-85", "text": "HyTERA achieves higher performance than the SENTBLEU baseline in four language pairs (cs/de/ru/tr-en)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-86", "text": "It obtains slightly lower correlation than SENTBLEU for fi-en and ro-en, the language pairs in which correlation was lower for all metrics." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-87", "text": "Among the metrics tested at the WMT16 shared task we find combination metrics, and metrics that have been tuned on a development dataset." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-88", "text": "The metric that performs best for most languages in the segment-level WMT16 evaluation, DPMFCOMB, combines 57 individual metrics (Yu et al., 2015) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-89", "text": "Similarly, the second highest ranked metric, METRICSF, combines BLEU, METEOR, the alignment-based metric UPF-COMBALT (Fomicheva et al., 2016) , and fluency features." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-90", "text": "The BEER metric, found in fifth position, is a trained evaluation metric with a linear model that combines features capturing character n-grams and permutation trees (Stanojevi\u0107 and Sima'an, 2015) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-91", "text": "We report the rank of HyTERA among all metrics (single and combined), and among the single ones." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-92", "text": "It is important to note that HyTERA needs no tuning, is straightforward to use and very fast to compute, especially with filtered lattices (on average 6s)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-93", "text": "The lower performance of the metric on this dataset is also due to the different nature of the MT systems tested." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-94", "text": "While in the (Dreyer and Marcu, 2012) evaluation, the systems came from the 2010 Open MT NIST evaluation and were selected to cover a variety of architectures and performances, the systems that participated in WMT15 are, for the large part, neural MT systems (Bojar et al., 2015) ." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-95", "text": "As reported by Bentivogli et al. (2016) , Neural MT systems make at least 17% fewer lexical choice errors than phrase-based systems, which limits the potential of HyTERA, primarily focused on capturing correct lexical choice." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-96", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-97", "text": "**CONCLUSION**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-98", "text": "We have proposed a method for automatic paraphrase lattice creation which makes the HyTER metric applicable to new datasets." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-99", "text": "We provide the first evaluation of HyTER on data from a recent WMT Metrics Shared task." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-100", "text": "We show that although the metric achieves high correlation with human judgments of translation quality on small and carefully curated data, with both manual and automatically constructed paraphrase networks, it obtains medium performance on recent WMT data." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-101", "text": "The lower performance is mainly due to the noisier nature of the data and to the higher quality lexical choices made by current neural MT systems, compared to phrase-based and transfer systems, which limits the potential of the metric for system evaluation and tuning." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-102", "text": "In its current form, the paraphrase substitution mechanism supports only lexical substitutions." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-103", "text": "It would be straightforward to extend the AddCos method to handle multi-word paraphrases by training embeddings for multi-word phrases, keeping in mind that longer substitutions might require restructuring the produced sentences to preserve grammaticality." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-104", "text": "----------------------------------" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-105", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-106", "text": "We would like to thank Markus Dreyer for sharing with us the original HyTER code." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-107", "text": "This work has been supported by the French National Research Agency under project ANR-16-CE33-0013." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-108", "text": "This material is based in part on research sponsored by DARPA under grant number FA8750-13-2-0017 (the DEFT program) and HR0011-15-C-0115 (the LORELEI program)." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-109", "text": "The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes." }, { "sent_id": "45ba2841e91a2fd62f0534aeaf7491-C001-110", "text": "The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA and the U.S. Government." } ], "y": { "@EXT@": { "gold_contexts": [ [ "45ba2841e91a2fd62f0534aeaf7491-C001-2", "45ba2841e91a2fd62f0534aeaf7491-C001-3", "45ba2841e91a2fd62f0534aeaf7491-C001-5" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-22", "45ba2841e91a2fd62f0534aeaf7491-C001-23" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-67" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-98" ] ], "cite_sentences": [ "45ba2841e91a2fd62f0534aeaf7491-C001-2", "45ba2841e91a2fd62f0534aeaf7491-C001-3", "45ba2841e91a2fd62f0534aeaf7491-C001-5", "45ba2841e91a2fd62f0534aeaf7491-C001-22", "45ba2841e91a2fd62f0534aeaf7491-C001-23", "45ba2841e91a2fd62f0534aeaf7491-C001-67", "45ba2841e91a2fd62f0534aeaf7491-C001-98" ] }, "@MOT@": { "gold_contexts": [ [ "45ba2841e91a2fd62f0534aeaf7491-C001-2", "45ba2841e91a2fd62f0534aeaf7491-C001-3", "45ba2841e91a2fd62f0534aeaf7491-C001-5" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-10", "45ba2841e91a2fd62f0534aeaf7491-C001-12" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-73" ] ], "cite_sentences": [ "45ba2841e91a2fd62f0534aeaf7491-C001-2", "45ba2841e91a2fd62f0534aeaf7491-C001-3", "45ba2841e91a2fd62f0534aeaf7491-C001-5", "45ba2841e91a2fd62f0534aeaf7491-C001-10", "45ba2841e91a2fd62f0534aeaf7491-C001-12", "45ba2841e91a2fd62f0534aeaf7491-C001-73" ] }, "@USE@": { "gold_contexts": [ [ "45ba2841e91a2fd62f0534aeaf7491-C001-2", "45ba2841e91a2fd62f0534aeaf7491-C001-4" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-12", "45ba2841e91a2fd62f0534aeaf7491-C001-13", "45ba2841e91a2fd62f0534aeaf7491-C001-14", "45ba2841e91a2fd62f0534aeaf7491-C001-15" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-51" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-72" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-99" ] ], "cite_sentences": [ "45ba2841e91a2fd62f0534aeaf7491-C001-2", "45ba2841e91a2fd62f0534aeaf7491-C001-4", "45ba2841e91a2fd62f0534aeaf7491-C001-12", "45ba2841e91a2fd62f0534aeaf7491-C001-14", "45ba2841e91a2fd62f0534aeaf7491-C001-15", "45ba2841e91a2fd62f0534aeaf7491-C001-51", "45ba2841e91a2fd62f0534aeaf7491-C001-72", "45ba2841e91a2fd62f0534aeaf7491-C001-99" ] }, "@BACK@": { "gold_contexts": [ [ "45ba2841e91a2fd62f0534aeaf7491-C001-10", "45ba2841e91a2fd62f0534aeaf7491-C001-11", "45ba2841e91a2fd62f0534aeaf7491-C001-66", "45ba2841e91a2fd62f0534aeaf7491-C001-68" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-19" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-25" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-56" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-58" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-84" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-94" ] ], "cite_sentences": [ "45ba2841e91a2fd62f0534aeaf7491-C001-10", "45ba2841e91a2fd62f0534aeaf7491-C001-11", "45ba2841e91a2fd62f0534aeaf7491-C001-66", "45ba2841e91a2fd62f0534aeaf7491-C001-68", "45ba2841e91a2fd62f0534aeaf7491-C001-19", "45ba2841e91a2fd62f0534aeaf7491-C001-25", "45ba2841e91a2fd62f0534aeaf7491-C001-56", "45ba2841e91a2fd62f0534aeaf7491-C001-58", "45ba2841e91a2fd62f0534aeaf7491-C001-84", "45ba2841e91a2fd62f0534aeaf7491-C001-94" ] }, "@SIM@": { "gold_contexts": [ [ "45ba2841e91a2fd62f0534aeaf7491-C001-12", "45ba2841e91a2fd62f0534aeaf7491-C001-13", "45ba2841e91a2fd62f0534aeaf7491-C001-14", "45ba2841e91a2fd62f0534aeaf7491-C001-66", "45ba2841e91a2fd62f0534aeaf7491-C001-68" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-67" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-71" ] ], "cite_sentences": [ "45ba2841e91a2fd62f0534aeaf7491-C001-12", "45ba2841e91a2fd62f0534aeaf7491-C001-14", "45ba2841e91a2fd62f0534aeaf7491-C001-66", "45ba2841e91a2fd62f0534aeaf7491-C001-68", "45ba2841e91a2fd62f0534aeaf7491-C001-67", "45ba2841e91a2fd62f0534aeaf7491-C001-71" ] }, "@DIF@": { "gold_contexts": [ [ "45ba2841e91a2fd62f0534aeaf7491-C001-22", "45ba2841e91a2fd62f0534aeaf7491-C001-23" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-25" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-51" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-83" ], [ "45ba2841e91a2fd62f0534aeaf7491-C001-94", "45ba2841e91a2fd62f0534aeaf7491-C001-95" ] ], "cite_sentences": [ "45ba2841e91a2fd62f0534aeaf7491-C001-22", "45ba2841e91a2fd62f0534aeaf7491-C001-23", "45ba2841e91a2fd62f0534aeaf7491-C001-25", "45ba2841e91a2fd62f0534aeaf7491-C001-51", "45ba2841e91a2fd62f0534aeaf7491-C001-83", "45ba2841e91a2fd62f0534aeaf7491-C001-94" ] } } }, "ABC_c6bae8dbdb66092865945e776148e6_8": { "x": [ { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-45", "text": "**ENTAILMENT CLASSIFIER**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-2", "text": "We present a simple sequential sentence encoder for multi-domain natural language inference." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-22", "text": "Our model mainly consists of two separate components, a sentence encoder and an entailment classifier." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-3", "text": "Our encoder is based on stacked bidirectional LSTM-RNNs with shortcut connections and fine-tuning of word embeddings." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-4", "text": "The overall supervised model uses the above encoder to encode two input sentences into two vectors, and then uses a classifier over the vector combination to label the relationship between these two sentences as that of entailment, contradiction, or neural." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-5", "text": "Our Shortcut-Stacked sentence encoders achieve strong improvements over existing encoders on matched and mismatched multi-domain natural language inference (top non-ensemble single-model result in the EMNLP RepEval 2017 Shared Task (Nangia et al., 2017))." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-6", "text": "Moreover, they achieve the new state-of-theart encoding result on the original SNLI dataset (Bowman et al., 2015) ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-7", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-8", "text": "**INTRODUCTION AND BACKGROUND**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-9", "text": "Natural language inference (NLI) or recognizing textual entailment (RTE) is a fundamental semantic task in the field of natural language processing." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-10", "text": "The problem is to determine whether a given hypothesis sentence can be logically inferred from a given premise sentence." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-11", "text": "Recently released datasets such as the Stanford Natural Language Inference Corpus (Bowman et al., 2015) (SNLI) and the Multi-Genre Natural Language Inference Corpus ) (Multi-NLI) have not only encouraged several end-to-end neural network approaches to NLI, but have also served as an evaluation resource for general representation learning of natural language." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-12", "text": "Depending on whether a model will first encode a sentence into a fixed-length vector without any incorporating information from the other sentence, the several proposed models can be categorized into two groups: (1) encoding-based models (or sentence encoders), such as Tree-based CNN encoders (TBCNN) in Mou et al. (2015) or Stack-augmented Parser-Interpreter Neural Network (SPINN) in Bowman et al. (2016) , and (2) joint, pairwise models that use cross-features between the two sentences to encode them, such as the Enhanced Sequential Inference Model (ESIM) in Chen et al. (2017) or the bilateral multiperspective matching (BiMPM) model Wang et al. (2017) ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-13", "text": "Moreover, common sentence encoders can again be classified into tree-based encoders such as SPINN in Bowman et al. (2016) which we mentioned before, or sequential encoders such as the biLSTM model by Bowman et al. (2015) ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-14", "text": "In this paper, we follow the former approach of encoding-based models, and propose a novel yet simple sequential sentence encoder for the Multi-NLI problem." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-15", "text": "Our encoder does not require any syntactic information of the sentence." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-16", "text": "It also does not contain any attention or memory structure." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-17", "text": "It is basically a stacked (multi-layered) bidirectional LSTM-RNN with shortcut connections (feeding all previous layers' outputs and word embeddings to each layer) and word embedding fine-tuning." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-18", "text": "The overall supervised model uses these shortcutstacked encoders to encode two input sentences into two vectors, and then we use a classifier over the vector combination to label the relationship between these two sentences as that of entailment, contradiction, or neural (similar to the classifier setup of Bowman et al. (2015) and Conneau et al. (2017) )." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-19", "text": "Our simple shortcut-stacked encoders achieve strong improvements over existing encoders due to its multi-layered and shortcutconnected properties, on both matched and mis- matched evaluation settings for multi-domain natural language inference, as well as on the original SNLI dataset." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-20", "text": "It is the top single-model (nonensemble) result in the EMNLP RepEval 2017 Multi-NLI Shared Task , and the new state-of-the-art for encoding-based results on the SNLI dataset (Bowman et al., 2015) ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-21", "text": "Github Code Link: https://github.com/ easonnie/multiNLI_encoder 2 Model" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-23", "text": "The sentence encoder compresses each source sentence into a vector representation and the classifier makes a three-way classification based on the two vectors of the two source sentences." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-24", "text": "The model follows the 'encoding-based rule', i.e., the encoder will encode each source sentence into a fixed length vector without any information or function based on the other sentence (e.g., cross-attention or memory comparing the two sentences)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-25", "text": "In order to fully explore the generalization of the sentence encoder, the same encoder is applied to both the premise and the hypothesis with shared parameters projecting them into the same space." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-26", "text": "This setting follows the idea of Siamese Networks in Bromley et al. (1994) ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-27", "text": "Figure 1 shows the overview of our encoding model (the standard classifier setup is not shown here; see Bowman et al. (2015) and Conneau et al. (2017) for that)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-28", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-29", "text": "**SENTENCE ENCODER**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-30", "text": "Our sentence encoder is simply composed of multiple stacked bidirectional LSTM (biLSTM) layers with shortcut connections followed by a max pooling layer." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-31", "text": "Let bilstm i represent the ith biLSTM layer, which is defined as:" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-32", "text": "where h i t is the output of the ith biLSTM at time t over input sequence (x i 1 , x i 2 , ..., x i n )." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-33", "text": "In a typical stacked biLSTM structure, the input of the next LSTM-RNN layer is simply the output sequence of the previous LSTM-RNN layer." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-34", "text": "In our settings, the input sequences for the ith biLSTM layer are the concatenated outputs of all the previous layers, plus the original word embedding sequence." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-35", "text": "This gives a shortcut connection style setup, related to the widely used idea of residual connections in CNNs for computer vision (He et al., 2016) , highway networks for RNNs in speech processing , and shortcut connections in hierarchical multitasking learning (Hashimoto et al., 2016) ; but in our case we feed in all the previous layers' output se-quences as well as the word embedding sequence to every layer." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-36", "text": "Let W = (w 1 , w 2 , ..., w n ) represent words in the source sentence." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-37", "text": "We assume w i \u2208 R d is a word embedding vector which are initialized using some pre-trained vector embeddings (and is then fine-tuned end-to-end via the NLI supervision)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-38", "text": "Then, the input of ith biLSTM layer at time t is defined as:" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-39", "text": "Then, assuming we have m layers of biLSTM, the final vector representation will be obtained by applying row-max-pool over the output of the last biLSTM layer, similar to Conneau et al. (2017) ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-40", "text": "The final layer is defined as:" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-41", "text": ", d m is the dimension of the hidden state of the last forward and backward LSTM layers, and v is the final vector representation for the source sentence (which is later fed to the NLI classifier)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-42", "text": "The closest encoder architecture to ours is that of Conneau et al. (2017) , whose model consists of a single-layer biLSTM with a max-pooling layer, which we treat as our starting point." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-43", "text": "Our experiments (Section 4) demonstrate that our enhancements of the stacked-biRNN with shortcut connections provide significant gains on top of this baseline (for both SNLI and Multi-NLI)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-44", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-46", "text": "After we obtain the vector representation for the premise and hypothesis sentence, we apply three matching methods to the two vectors (i) concatenation (ii) element-wise distance and (iii) elementwise product for these two vectors and then concatenate these three match vectors (based on the heuristic matching presented in Mou et al. (2015) )." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-47", "text": "Let v p and v h be the vector representations for premise and hypothesis, respectively." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-48", "text": "The matching vector is then defined as:" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-49", "text": "At last, we feed this final concatenated result m into a MLP layer and use a softmax layer to make final classification." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-50", "text": "Table 5 , we train on only the SNLI training set (and we also verify that the tuning decisions hold true on the SNLI dev set)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-51", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-52", "text": "**PARAMETER SETTINGS**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-53", "text": "We use cross-entropy loss as the training objective with Adam-based (Kingma and Ba, 2014) opti-Model Accuracy SNLI Multi-NLI Matched Multi-NLI Mismatched CBOW 80.6 65.2 64.6 biLSTM Encoder 81.5 67.5 67.1 300D Tree-CNN Encoder (Mou et al., 2015) 82.1 --300D SPINN-PI Encoder (Bowman et al., 2016) 83.2 --300D NSE Encoder (Munkhdalai and Yu, 2016) 84.6 --biLSTM-Max Encoder (Conneau et al., 2017) 84." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-54", "text": "mization with 32 batch size." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-55", "text": "The starting learning rate is 0.0002 with half decay every two epochs." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-56", "text": "The number of hidden units for MLP in classifier is 1600." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-57", "text": "Dropout layer is also applied on the output of each layer of MLP, with dropout rate set to 0.1." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-58", "text": "We used pre-trained 300D Glove 840B vectors (Pennington et al., 2014) to initialize the word embeddings." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-59", "text": "Tuning decisions for word embedding training strategy, the hyperparameters of dimension and number of layers for biLSTM, and the activation type and number of layers for MLP, are all explained in Section 4." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-60", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-61", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-62", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-63", "text": "**ABLATION ANALYSIS RESULTS**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-64", "text": "We now investigate the effectiveness of each of the enhancement components in our overall model." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-65", "text": "These ablation results are shown in Tables 1, 2, 3 and 4, all based on the Multi-NLI development sets." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-66", "text": "Finally, Table 5 shows results for different encoders on SNLI and Multi-NLI test sets." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-67", "text": "First, Table 1 shows the performance changes for different number of biLSTM layers and their varying dimension size." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-68", "text": "The dimension size of a biLSTM layer is referring to the dimension of the hidden state for both the forward and backward LSTM-RNNs." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-69", "text": "As shown, each added layer model improves the accuracy and we achieve a substantial improvement in accuracy (around 2%) on both matched and mismatched settings, compared to the single-layer biLSTM in Conneau et al. (2017) ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-70", "text": "We only experimented with up to 3 layers with 512, 1024, 2048 dimensions each, so the model still has potential to improve the result further with a larger dimension and more layers." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-71", "text": "Next, in Table 2 , we show that the shortcut connections among the biLSTM layers is also an important contributor to accuracy improvement (around 1.5% on top of the full 3-layered stacked-RNN model)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-72", "text": "This demonstrates that simply stacking the biLSTM layers is not sufficient to handle a complex task like Multi-NLI and it is significantly better to have the higher layer connected to both the output and the original input of all the previous layers (note that Table 1 results are based on multi-layered models with shortcut connections)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-73", "text": "Next, in Table 3 , we show that fine-tuning the word embeddings also improves results, again for both the in-domain task and cross-domain tasks (the ablation results are based on a smaller model with a 128+256 2-layer biLSTM)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-74", "text": "Hence, all our models were trained with word embeddings being fine-tuned." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-75", "text": "The last ablation in Table 4 shows that a classifier with two layers of relu is preferable than other options." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-76", "text": "Thus, we use that setting for our strongest encoder." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-77", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-78", "text": "**MULTI-NLI AND SNLI TEST RESULTS**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-79", "text": "Finally, in Table 5 , we report the test results for MNLI and SNLI." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-80", "text": "First for Multi-NLI, we improve substantially over the CBOW and biL-STM Encoder baselines reported in the dataset paper ." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-81", "text": "We also show that our final shortcut-based stacked encoder achieves around 3% improvement as compared to the 1layer biLSTM-Max Encoder in the second last row (using the exact same classifier and optimizer settings)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-82", "text": "Our shortcut-encoder was also the top singe-model (non-ensemble) result on the EMNLP RepEval Shared Task leaderboard." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-83", "text": "Next, for SNLI, we compare our shortcutstacked encoder with the current state-of-the-art encoders from the SNLI leaderboard (https:// nlp.stanford.edu/projects/snli/)." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-84", "text": "We also compare to the recent biLSTM-Max Encoder of Conneau et al. (2017) , which served as our model's 1-layer starting point." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-85", "text": "1 The results indicate that 'Our Shortcut-Stacked Encoder' sur-passes all the previous state-of-the-art encoders, and achieves the new best encoding-based result on SNLI, suggesting the general effectiveness of simple shortcut-connected stacked layers in sentence encoders." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-86", "text": "----------------------------------" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-87", "text": "**CONCLUSION**" }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-88", "text": "We explored various simple combinations and connections of biLSTM-RNN layered architectures and developed a Shortcut-Stacked Sentence Encoder for natural language inference." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-89", "text": "Our model is the top single result in the EMNLP RepEval 2017 Multi-NLI Shared Task, and it also surpasses the state-of-the-art encoders for the SNLI dataset." }, { "sent_id": "c6bae8dbdb66092865945e776148e6-C001-90", "text": "In future work, we are also evaluating the effectiveness of shortcut-stacked sentence encoders on several other semantic tasks." } ], "y": { "@USE@": { "gold_contexts": [ [ "c6bae8dbdb66092865945e776148e6-C001-84" ], [ "c6bae8dbdb66092865945e776148e6-C001-18" ], [ "c6bae8dbdb66092865945e776148e6-C001-39" ] ], "cite_sentences": [ "c6bae8dbdb66092865945e776148e6-C001-84", "c6bae8dbdb66092865945e776148e6-C001-18", "c6bae8dbdb66092865945e776148e6-C001-39" ] }, "@DIF@": { "gold_contexts": [ [ "c6bae8dbdb66092865945e776148e6-C001-84", "c6bae8dbdb66092865945e776148e6-C001-85" ], [ "c6bae8dbdb66092865945e776148e6-C001-69" ] ], "cite_sentences": [ "c6bae8dbdb66092865945e776148e6-C001-84", "c6bae8dbdb66092865945e776148e6-C001-69" ] }, "@SIM@": { "gold_contexts": [ [ "c6bae8dbdb66092865945e776148e6-C001-18" ], [ "c6bae8dbdb66092865945e776148e6-C001-39" ], [ "c6bae8dbdb66092865945e776148e6-C001-42" ] ], "cite_sentences": [ "c6bae8dbdb66092865945e776148e6-C001-18", "c6bae8dbdb66092865945e776148e6-C001-39", "c6bae8dbdb66092865945e776148e6-C001-42" ] }, "@BACK@": { "gold_contexts": [ [ "c6bae8dbdb66092865945e776148e6-C001-27" ] ], "cite_sentences": [ "c6bae8dbdb66092865945e776148e6-C001-27" ] }, "@EXT@": { "gold_contexts": [ [ "c6bae8dbdb66092865945e776148e6-C001-42" ] ], "cite_sentences": [ "c6bae8dbdb66092865945e776148e6-C001-42" ] } } }, "ABC_1542325bbf9bed87c22d34d12ee40e_8": { "x": [ { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-2", "text": "Left-to-right (LR) decoding (Watanabe et al., 2006 ) is promising decoding algorithm for hierarchical phrase-based translation (Hiero) that visits input spans in arbitrary order producing the output translation in left to right order." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-3", "text": "This leads to far fewer language model calls, but while LR decoding is more efficient than CKY decoding, it is unable to capture some hierarchical phrase alignments reachable using CKY decoding and suffers from lower translation quality as a result." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-4", "text": "This paper introduces two improvements to LR decoding that make it comparable in translation quality to CKY-based Hiero." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-5", "text": "----------------------------------" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-7", "text": "Hierarchical phrase-based translation (Hiero) (Chiang, 2007 ) uses a lexicalized synchronous context-free grammar (SCFG) extracted from word and phrase alignments of a bitext." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-8", "text": "Decoding for Hiero is typically done with CKY-style decoding with time complexity O(n 3 ) for source input with n words." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-9", "text": "Computing the language model score for each hypothesis within CKY decoding requires two histories, the left and the right edge of each span, due to the fact that the target side is built inside-out from sub-spans Heafield et al., 2013) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-10", "text": "LR-decoding algorithms exist for phrasebased (Koehn, 2004; Galley and Manning, 2010) and syntax-based (Huang and Mi, 2010; Feng et al., 2012 ) models and also for hierarchical phrasebased models (Watanabe et al., 2006; Siahbani et al., 2013) , which is our focus in this paper." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-11", "text": "Watanabe et al. (2006) first proposed left-toright (LR) decoding for Hiero (LR-Hiero henceforth) which uses beam search and runs in O(n 2 b) in practice where n is the length of source sentence and b is the size of beam (Huang and Mi, 2010) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-12", "text": "To simplify target generation, SCFG rules are constrained to be prefix-lexicalized on target side aka Griebach Normal Form (GNF)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-13", "text": "Throughout this paper we abuse the notation for simplicity and use the term GNF grammars for such SCFGs." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-14", "text": "This constraint drastically reduces the size of grammar for LR-Hiero in comparison to Hiero grammar (Siahbani et al., 2013) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-15", "text": "However, the original LR-Hiero decoding algorithm does not perform well in comparison to current state-of-the-art Hiero and phrase-based translation systems." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-16", "text": "Siahbani et al. (2013) propose an augmented version of LR decoding to address some limitations in the original LR-Hiero algorithm in terms of translation quality and time efficiency." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-17", "text": "Although, LR-Hiero performs much faster than Hiero in decoding and obtains BLEU scores comparable to phrase-based translation system on some language pairs, there is still a notable gap between CKY-Hiero and LR-Hiero (Siahbani et al., 2013) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-18", "text": "We show in this paper using instructive examples that CKY-Hiero can capture some complex phrasal re-orderings that are observed in language pairs such as Chinese-English that LR-Hiero cannot (c.f. Sec.3)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-19", "text": "We introduce two improvements to LR decoding of GNF grammars: (1) We add queue diversity to the cube pruning algorithm for LR-Hiero, and (2) We extend the LR-Hiero decoder to capture all the hierarchical phrasal alignments that are reachable in CKY-Hiero (restricted to using GNF grammars)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-20", "text": "We evaluate our modifications on three language pairs and show that LR-Hiero can reach the translation scores comparable to CKY-Hiero in two language pairs, and reduce the gap between Hiero and LR-Hiero on the third one." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-21", "text": "----------------------------------" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-22", "text": "**LR DECODING WITH QUEUE DIVERSITY**" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-23", "text": "LR-Hiero uses a constrained lexicalized SCFG which we call a GNF grammar: X \u2192 \u03b3,b \u03b2 where \u03b3 is a string of non-terminal and terminal symbols,b is a string of terminal symbols and \u03b2 is a possibly empty sequence of non-terminals." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-24", "text": "This ensures that as each rule is used in a derivation, Add h to hypList 29:" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-25", "text": "return hypList the target string is generated from left to right." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-26", "text": "The rules are obtained from a word and phrase aligned bitext using the rule extraction algorithm in (Watanabe et al., 2006) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-27", "text": "LR-Hiero decoding uses a top-down depth-first search, which strictly grows the hypotheses in target surface ordering." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-28", "text": "Search on the source side follows an Earley-style search (Earley, 1970) , the dot jumps around on the source side of the rules based on the order of nonterminals on the target side." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-29", "text": "This search is integrated with beam search or cube pruning to find the k-best translations." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-30", "text": "Algorithm 1 shows the pseudocode for LRHiero decoding with cube pruning (Chiang, 2007) (CP)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-31", "text": "LR-Hiero with CP was introduced in (Siahbani et al., 2013) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-32", "text": "In this pseudocode, we have introduced the notion of queue diversity (explained below)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-33", "text": "However to understand our change we need to understand the algorithm in more detail." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-34", "text": "----------------------------------" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-35", "text": "**S I**" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-36", "text": "Figure 1: Cubes (grids) are fed to a priority queue (triangle) and generated hypotheses are iteratively popped from the queue and added to stack Si." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-37", "text": "Lower scores are better." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-38", "text": "Scores of rules and hypotheses appear on the top and left side of the grids respectively." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-39", "text": "Shaded entries are hypotheses in the queue and black ones are popped from the queue and added to Si." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-40", "text": "Each source side non-terminal is instantiated with the legal spans given the input source string, e.g. if there is a Hiero rule aX 1 , a X 1 and if a only occurs at position 3 in the input then this rule can be applied to span [3, i] for all i, 4 < i \u2264 n for input of length n and source side X 1 is instantiated to span [4, i] ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-41", "text": "A worked out example of how the decoder works is shown in Figure 2 ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-42", "text": "Each partial hypothesis h is a 4-tuple (h t , h s , h cov , h c ): consisting of a translation prefix h t , a (LIFO-ordered) list h s of uncovered spans, source words coverage set h cov and the hypothesis cost h c ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-43", "text": "The initial hypothesis is a null string with just a sentence-initial marker s and the list h s containing a span of the whole sentence, [0, n]." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-44", "text": "The hypotheses are stored in stacks S 0 , . . . , S n , where S p contains hypotheses covering p source words just like in stack decoding for phrase-based SMT (Koehn et al., 2003) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-45", "text": "To fill stack S i we consider hypotheses in each stack S p 2 , which are first partitioned into a set of groups {G}, based on their first uncovered span (line 9)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-46", "text": "Each group g is a 2-tuple (g span , g hyps ), where g hyps is a list of hypotheses which share the same first uncovered span g span ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-47", "text": "Rules matching the span g span are obtained from routine GetSpanRules." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-48", "text": "Each g hyps and possible R s create a cube which is added to cubeList." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-49", "text": "The Merge routine gets the best hypotheses from all cubes (see Fig.1 )." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-50", "text": "Hypotheses (rows) and columns (rules) are sorted based on their scores." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-51", "text": "GetBestHypotheses((H, R), F, d) uses current hypothesis H and rule R to produce new hypotheses." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-52", "text": "The first best hypothesis, h along with its score h c and corresponding cube (H, R) is placed in a priority queue heapQ (triangle in Figure 1 and line 23 in Algorithm 1)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-53", "text": "Iteratively the K best 9) xiang jingji/to the economy s Thailand wants [3, 15] s Thailand wants to utilize [4, 15] s Thailand wants to utilize this money [7, 15] s Thailand wants to utilize this money to inject more [12,15][7,9] s Thailand wants to utilize this money to inject more circulating [13,15][7,9] s Thailand wants to utilize this money to inject more circulating capital [14,15][7,9] s Thailand wants to utilize this money to inject more circulating capital ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-54", "text": "[7, 9] s Thailand wants to utilize this money to inject more circulating capital ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-55", "text": "to the economy /s Figure 2 : The process of translating the Chinese sentence in Figure 3 (b) in LR-Hiero." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-56", "text": "Left side shows the rules used in the derivation (G indicates glue rules as defined in (Watanabe et al., 2006) )." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-57", "text": "The hypotheses column shows the translation prefix and the ordered list of yet-to-be-covered spans." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-58", "text": "hypotheses in the queue are popped (line 26) and for each hypothesis its neighbours in the cube are added to the priority queue (line 27)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-59", "text": "Decoding finishes when stack S n has been filled." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-60", "text": "The language model (LM) score violates the hypotheses generation assumption of CP and can cause search errors." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-61", "text": "In Figure 1 , the topmost and leftmost entry of the right cube has a score worse than many hypotheses in the left cube due to the LM score." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-62", "text": "This means the right cube has hypotheses that are ignored." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-63", "text": "This type of search error hurts LR-Hiero more than CKYHiero, due to the fact that hypotheses scores in LR-Hiero rely on a future cost, while CKY-Hiero uses the inside score for each hypothesis." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-64", "text": "To solve this issue for LR-Hiero we introduce the notion of queue diversity which is the parameter d in GetBestHypotheses((H, R), F, d)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-65", "text": "This parameter guarantees that each cube will produce at least d candidate hypotheses for the priority queue." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-66", "text": "d=1 in standard cube pruning for LR-Hiero (Siahbani et al., 2013) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-67", "text": "We apply the idea of diversity at queue level, before generating K best hypothesis, such that the GetBestHypotheses routine generates d best hypotheses from each cube and all these hypotheses are pushed to the priority queue (line 22-23)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-68", "text": "We fill each stack differently from CKY-Hiero and so queue diversity is different from lazy cube pruning (Pust and Knight, 2009) or cube growing (Huang and Chiang, 2007; Vilar and Ney, 2009; Xu and Koehn, 2012) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-69", "text": "Figure 2 where rule #5 is matched to span [7, 15] ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-70", "text": "During decoding LR-Hiero maintains a stack (lastin-first-out) of yet-to-be-covered spans and tries to translate the first uncovered span (span [7, 15] in Step 5)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-71", "text": "LR-Hiero should match rule #5 to span [7, 15] , therefore X 2 is forced to match span [12, 15] which leads to the translation of span [7, 9] (corresponding to X 1 ) being reordered around it In Figure 3 (a) monotonic translations after span [6, 9] are out of reach of the LR-Hiero decoder which has to use the non-terminals to support the reordering within span [6, 9] ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-72", "text": "In this example the first few phrases are translated monotonically, then for span [6, 18] we have to apply rule muqian X 1 wending, is now in stable X 1 to obtain the correct translation." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-73", "text": "But this rule cannot be matched to span [6, 18] and the decoder fails to generate the correct translation." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-74", "text": "While CKYHiero can apply this rule to span [6, 9] , generate correct translation for this span and monotonically combine it with translation of other spans ([0, 6] , [9, 18] )." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-75", "text": "----------------------------------" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-76", "text": "**CAPTURING MISSING ALIGNMENTS**" }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-77", "text": "In both these cases, CKY-Hiero has no difficulty in reaching the target sentence with the same GNF rules." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-78", "text": "The fact that we have to process spans as they appear in the stack in LR-Hiero means that we cannot combine arbitrary adjacent spans to deal with such cases." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-79", "text": "So purely bottom-up decoders such as CKY-Hiero can capture the alignments in Figure 3 but LR-Hiero cannot." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-80", "text": "We extend the LR-Hiero decoder to handle such cases by making the GNF grammar more expressive." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-81", "text": "Rules are partitioned to three types based on the right boundary in the source and target side." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-82", "text": "The rhs after the \u21d2 shows the new rules we create within the decoder using a new non-terminal X r to match the right boundary." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-83", "text": "where \u03b3 is a string of terminals and non-terminals, a andb are terminal sequences of source and target respectively, \u03b2 is a possibly empty sequence of non-terminals and X n and X m are different non-terminals distinct from X r 3 ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-84", "text": "The extra nonterminal X r lets us add a new yet-to-be-covered span to the bottom of the stack at each rule application which lets us match any two adjacent spans just as in CKY-Hiero." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-85", "text": "This captures the missing alignments that could not be previously captured in the LR-Hiero decoder 4 ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-86", "text": "In Table 4 we translated devset sentences using forced decoding to show that our modifications to LR-Hiero in this section improves the alignment coverage when compared to CKY-Hiero." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-87", "text": "We use a 5-gram LM trained on the Gigaword corpus and use KenLM (Heafield, 2011) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-88", "text": "We tune weights by minimizing BLEU loss on the dev set through MERT (Och, 2003) and report BLEU scores on the test set." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-89", "text": "Pop limit for Hiero and LRHiero+CP is 500 and beam size LR-Hiero is 500." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-90", "text": "Other extraction and decoder settings such as maximum phrase length, etc. were identical across settings." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-91", "text": "To make the results comparable we use the same feature set for all baselines, Hiero as well (including new features proposed by (Siahbani et al., 2013) )." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-92", "text": "We use 3 baselines: (i) our implementation of (Watanabe et al., 2006) : LR-Hiero with beam search (LR-Hiero) and (ii) LR-Hiero with cube pruning (Siahbani et al., 2013) : (LR-Hiero+CP); and (iii) Kriya, an open-source implementation of Hiero in Python, which performs comparably to other open-source Hiero systems (Sankaran et al., 2012) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-93", "text": "Table 3 shows model sizes for LR-Hiero (GNF) and Hiero (SCFG)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-94", "text": "Typical Hiero rule extraction excludes phrase-pairs with unaligned words on boundaries (loose phrases)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-95", "text": "We use similar rule extraction as Hiero, except that exclude non-GNF rules and include loose phrase-pairs as terminal rules." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-96", "text": "Table 2a shows the translation quality of different systems in terms of BLEU score." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-97", "text": "Row 3 is from (Siahbani et al., 2013) 5 . As we discussed in Section 2, LR-Hiero+CP suffers from severe search errors on Zh-En (1.5 BLEU) but using queue diversity (QD=15) we fill this gap." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-98", "text": "We use the same QD(=15) in next rows for Zh-en." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-99", "text": "For Cs-En and De-En we use regular cube pruning (QD=1), as it works as well as beam search (compare rows 4 and 2)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-100", "text": "We measure the benefit of the new modified rules from Section 3: (ab): adding modifications for rules type (a) and (b); (abc): modification of all rules." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-101", "text": "We can see that for all language pairs (ab) constantly improves performance of LRHiero, significantly better than LR-Hiero+CP and LR-Hiero (p-value<0.05) on Cs-En and Zh-En, evaluated by MultEval (Clark et al., 2011) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-102", "text": "But modifying rule type (c) does not show any improvement due to spurious ambiguity created by 5 We report results on Cs-En and De-En in (Siahbani et al., 2013) ." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-103", "text": "Row 4 is the same translation system as row 3 (LR-Hiero+CP)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-104", "text": "We achieve better results than our previous work (Siahbani et al., 2013) type (c) rules." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-105", "text": "Figure 2b shows the results in terms of average number of language model queries on a sample set of 50 sentences from test sets." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-106", "text": "All of the baselines use the same wrapper to KenLM (Heafield, 2011) to query the language model, and we have instrumented the wrapper to count the statistics." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-107", "text": "In (Siahbani et al., 2013) we discuss that LR-Hiero with beam search (Watanabe et al., 2006) does not perform at the same level of state-of-the-art Hiero (more LM calls and less translation quality)." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-108", "text": "As we can see in this figure, adding new modified rules slightly increases the number of language model queries on Cs-En and De-En so that LR-Hiero+CP still works 2 to 3 times faster than Hiero." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-109", "text": "On Zh-En, LR-Hiero+CP applies queue diversity (QD=15) which reduces search errors and improves translation quality but increases the number of hypothesis generation as well." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-110", "text": "LRHiero+CP with our modifications works substantially faster than LR-Hiero while obtain significantly better translation quality on Zh-En." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-111", "text": "Comparing Table 2a with Figure 2b we can see that overall our modifications to LR-Hiero decoder significantly improves the BLEU scores compared to previous LR decoders for Hiero." }, { "sent_id": "1542325bbf9bed87c22d34d12ee40e-C001-112", "text": "We obtain comparable results to CKY-Hiero for Cs-En and De-En and remarkably improve results on Zh-En, while at the same time making 2 to 3 times less LM calls on Cs-En and De-En compared to CKYHiero." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1542325bbf9bed87c22d34d12ee40e-C001-10" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-14" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-16" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-31" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-89" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-107" ] ], "cite_sentences": [ "1542325bbf9bed87c22d34d12ee40e-C001-10", "1542325bbf9bed87c22d34d12ee40e-C001-14", "1542325bbf9bed87c22d34d12ee40e-C001-16", "1542325bbf9bed87c22d34d12ee40e-C001-31", "1542325bbf9bed87c22d34d12ee40e-C001-89", "1542325bbf9bed87c22d34d12ee40e-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "1542325bbf9bed87c22d34d12ee40e-C001-10" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-13", "1542325bbf9bed87c22d34d12ee40e-C001-14" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-19" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-17" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-66" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-89" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-91" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-92" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-102", "1542325bbf9bed87c22d34d12ee40e-C001-97" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-107", "1542325bbf9bed87c22d34d12ee40e-C001-108" ] ], "cite_sentences": [ "1542325bbf9bed87c22d34d12ee40e-C001-10", "1542325bbf9bed87c22d34d12ee40e-C001-14", "1542325bbf9bed87c22d34d12ee40e-C001-19", "1542325bbf9bed87c22d34d12ee40e-C001-17", "1542325bbf9bed87c22d34d12ee40e-C001-66", "1542325bbf9bed87c22d34d12ee40e-C001-89", "1542325bbf9bed87c22d34d12ee40e-C001-91", "1542325bbf9bed87c22d34d12ee40e-C001-92", "1542325bbf9bed87c22d34d12ee40e-C001-102", "1542325bbf9bed87c22d34d12ee40e-C001-97", "1542325bbf9bed87c22d34d12ee40e-C001-107", "1542325bbf9bed87c22d34d12ee40e-C001-108" ] }, "@EXT@": { "gold_contexts": [ [ "1542325bbf9bed87c22d34d12ee40e-C001-19" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-31" ] ], "cite_sentences": [ "1542325bbf9bed87c22d34d12ee40e-C001-19", "1542325bbf9bed87c22d34d12ee40e-C001-31" ] }, "@SIM@": { "gold_contexts": [ [ "1542325bbf9bed87c22d34d12ee40e-C001-91" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-102" ] ], "cite_sentences": [ "1542325bbf9bed87c22d34d12ee40e-C001-91", "1542325bbf9bed87c22d34d12ee40e-C001-102" ] }, "@DIF@": { "gold_contexts": [ [ "1542325bbf9bed87c22d34d12ee40e-C001-101", "1542325bbf9bed87c22d34d12ee40e-C001-102", "1542325bbf9bed87c22d34d12ee40e-C001-103", "1542325bbf9bed87c22d34d12ee40e-C001-104", "1542325bbf9bed87c22d34d12ee40e-C001-96", "1542325bbf9bed87c22d34d12ee40e-C001-97" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-110" ], [ "1542325bbf9bed87c22d34d12ee40e-C001-109" ] ], "cite_sentences": [ "1542325bbf9bed87c22d34d12ee40e-C001-101", "1542325bbf9bed87c22d34d12ee40e-C001-102", "1542325bbf9bed87c22d34d12ee40e-C001-103", "1542325bbf9bed87c22d34d12ee40e-C001-104", "1542325bbf9bed87c22d34d12ee40e-C001-97", "1542325bbf9bed87c22d34d12ee40e-C001-110", "1542325bbf9bed87c22d34d12ee40e-C001-109" ] }, "@MOT@": { "gold_contexts": [ [ "1542325bbf9bed87c22d34d12ee40e-C001-97" ] ], "cite_sentences": [ "1542325bbf9bed87c22d34d12ee40e-C001-97" ] } } }, "ABC_52c52f6ce3663de49d5784630af1e7_8": { "x": [ { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-65", "text": "Figure 2: Arc-factored feature templates for graph-based parsing." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-90", "text": "First, all but one template include the arc direction." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-2", "text": "We study multi-source transfer parsing for resource-poor target languages; specifically methods for target language adaptation of delexicalized discriminative graph-based dependency parsers." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-3", "text": "We first show how recent insights on selective parameter sharing, based on typological and language-family features, can be applied to a discriminative parser by carefully decomposing its model features." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-4", "text": "We then show how the parser can be relexicalized and adapted using unlabeled target language data and a learning method that can incorporate diverse knowledge sources through ambiguous labelings." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-5", "text": "In the latter scenario, we exploit two sources of knowledge: arc marginals derived from the base parser in a self-training algorithm, and arc predictions from multiple transfer parsers in an ensemble-training algorithm." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-6", "text": "Our final model outperforms the state of the art in multi-source transfer parsing on 15 out of 16 evaluated languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-7", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-9", "text": "Many languages still lack access to core NLP tools, such as part-of-speech taggers and syntactic parsers." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-10", "text": "This is largely due to the reliance on fully supervised learning methods, which require large quantities of manually annotated training data." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-11", "text": "Recently, methods for cross-lingual transfer have appeared as a promising avenue for overcoming this hurdle for both part-of-speech tagging (Yarowsky et al., 2001; Das and Petrov, 2011) and syntactic dependency parsing (Hwa et al., 2005; Zeman and Resnik, 2008; Ganchev et al., 2009; McDonald et al., 2011; , * Work primarily carried out while at Google, NY. 2012)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-12", "text": "While these methods do not yet compete with fully supervised approaches, they can drastically outperform both unsupervised methods (Klein and Manning, 2004 ) and weakly supervised methods (Naseem et al., 2010; Berg-Kirkpatrick and Klein, 2010) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-13", "text": "A promising approach to cross-lingual transfer of syntactic dependency parsers is to use multiple source languages and to tie model parameters across related languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-63", "text": "Notably, we use the mapping proposed by Naseem et al. (2010) to map from fine-grained treebank specific part-of-speech tags to coarse-grained \"universal\" tags, rather than the more recent mapping proposed by Petrov et al. (2012) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-55", "text": "Let x denote an input sentence and let y \u2208 Y(x) denote a dependency tree, where Y(x) is the set of well-formed dependency trees spanning x. Henceforth, we restrict Y(x) to projective dependency trees, but all our methods are equally applicable in the nonprojective case." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-56", "text": "Provided a vector of model parameters \u03b8, the probability of a dependency tree y \u2208 Y(x), conditioned on a sentence x, has the following form:" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-57", "text": "Without loss of generality, we restrict ourselves to first-order models, where the feature function \u03a6(x, y) factors over individual arcs (h, m) in y, such that" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-58", "text": "We use the standard gradient-based L-BFGS algorithm (Liu and Nocedal, 1989) to maximize the loglikelihood." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-59", "text": "Eisner's algorithm (Eisner, 1996) is used for inference of the Viterbi parse and arc-marginals." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-60", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-61", "text": "**DATA SETS AND EXPERIMENTAL SETUP**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-62", "text": "To facilitate comparison with the state of the art, we use the same treebanks and experimental setup as Naseem et al. (2012) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-64", "text": "For" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-14", "text": "This idea was first explored for weakly supervised learning (Cohen and Smith, 2009; Snyder et al., 2009; Berg-Kirkpatrick and Klein, 2010) and recently by Naseem et al. (2012) for multisource cross-lingual transfer." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-15", "text": "In particular, Naseem et al. showed that by selectively sharing parameters based on typological features of each language, substantial improvements can be achieved, compared to using a single set of parameters for all languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-16", "text": "However, these methods all employ generative models with strong independence assumptions and weak feature representations, which upper bounds their accuracy far below that of feature-rich discriminative parsers (McDonald et al., 2005; Nivre, 2008) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-17", "text": "In this paper, we improve upon the state of the art in cross-lingual transfer of dependency parsers from multiple source languages by adapting feature-rich discriminatively trained parsers to a specific target language." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-18", "text": "First, in \u00a74 we show how selective sharing of model parameters based on typological traits can be incorporated into a delexicalized discriminative graph-based parsing model." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-19", "text": "This requires a careful decomposition of features into language-generic and language-specific sets in order to tie specific target language parameters to their relevant source language counterparts." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-20", "text": "The resulting parser outperforms the method of Naseem et al. (2012) on 12 out of 16 evaluated languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-21", "text": "Second, in \u00a75 we introduce a train-ing method that can incorporate diverse knowledge sources through ambiguously predicted labelings of unlabeled target language data." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-22", "text": "This permits effective relexicalization and target language adaptation of the transfer parser." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-23", "text": "Here, we experiment with two different knowledge sources: arc sets, which are filtered by marginal probabilities from the cross-lingual transfer parser, are used in an ambiguity-aware self-training algorithm ( \u00a75.2); these arc sets are then combined with the predictions of a different transfer parser in an ambiguity-aware ensemble-training algorithm ( \u00a75.3)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-24", "text": "The resulting parser provides significant improvements over a strong baseline parser and achieves a 13% relative error reduction on average with respect to the best model of Naseem et al. (2012) , outperforming it on 15 out of the 16 evaluated languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-25", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-26", "text": "**MULTI-SOURCE DELEXICALIZED TRANSFER**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-27", "text": "The methods proposed in this paper fall into the delexicalized transfer approach to multilingual syntactic parsing (Zeman and Resnik, 2008; McDonald et al., 2011; Cohen et al., 2011; S\u00f8gaard, 2011) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-28", "text": "In contrast to annotation projection approaches (Yarowsky et al., 2001; Hwa et al., 2005; Ganchev et al., 2009; Spreyer and Kuhn, 2009) , delexicalized transfer methods do not rely on any bitext." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-29", "text": "Instead, a parser is trained on annotations in a source language, relying solely on features that are available in both the source and the target language, such as \"universal\" part-ofspeech tags (Zeman and Resnik, 2008; Naseem et al., 2010; Petrov et al., 2012) , cross-lingual word clusters or type-level features derived from bilingual dictionaries (Durrett et al., 2012) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-30", "text": "1 This parser is then directly used to parse the target language." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-31", "text": "For languages with similar typology, this method can be quite accurate, especially when compared to purely unsupervised methods." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-32", "text": "For instance, a parser trained on English with only part-of-speech features can correctly parse the Greek sentence in Figure 1 , even without knowledge of the lexical items since the sequence of part-of-speech tags determines the syntactic structure almost unambiguously." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-33", "text": "Learning with multiple languages has been shown to benefit unsupervised learning (Cohen and Smith, 1 Note that and Durrett et al. (2012) do require bitext or a bilingual dictionary." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-34", "text": "The same holds for most cross-lingual representations, e.g., Klementiev et al. (2012) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-35", "text": "\u039f \u03a4\u03b6\u03cc\u03bd \u03ad\u03b4\u03c9\u03c3\u03b5 \u03c3\u03c4\u03b7\u03bd \u039c\u03b1\u03c1\u03af\u03b1 \u03c4\u03bf \u03b2\u03b9\u03b2\u03bb\u03af\u03bf ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-36", "text": "(The) (John) (gave) (to-the) (Maria) (the) (book) . 2009; Snyder et al., 2009; Berg-Kirkpatrick and Klein, 2010) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-37", "text": "Annotations in multiple languages can be combined in delexicalized transfer as well, as long as the parser features are available across the involved languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-38", "text": "This idea was explored by McDonald et al. (2011) , who showed that target language accuracy can be improved by simply concatenating delexicalized treebanks in multiple languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-39", "text": "In similar work, Cohen et al. (2011) proposed a mixture model in which the parameters of a generative target language parser is expressed as a linear interpolation of source language parameters, whereas S\u00f8gaard (2011) showed that target side language models can be used to selectively subsample training sentences to improve accuracy." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-40", "text": "Recently, inspired by the phylogenetic prior of Berg-Kirkpatrick and Klein (2010) , S\u00f8gaard and Wulff (2012) proposed -among other ideas -a typologically informed weighting heuristic for linearly interpolating source language parameters." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-41", "text": "However, this weighting did not provide significant improvements over uniform weighting." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-42", "text": "The aforementioned approaches work well for transfer between similar languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-43", "text": "However, their assumptions cease to hold for typologically divergent languages; a target language can rarely be described as a linear combination of data or model parameters from a set of source languages, as languages tend to share varied typological traits; this critical insight is discussed further in \u00a74." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-44", "text": "To account for this issue, Naseem et al. (2012) recently introduced a novel generative model of dependency parsing, in which the generative process is factored into separate steps for the selection of dependents and their ordering." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-45", "text": "The parameters used in the selection step are all language independent, capturing only head-dependent attachment preferences." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-46", "text": "In the ordering step, however, parameters are selectively shared between subsets of Naseem et al. (2012) restricts its potential performance." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-47", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-48", "text": "**BASIC MODELS AND EXPERIMENTAL SETUP**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-49", "text": "Inspired by the superiority of discriminative graphbased parsing in the supervised scenario, we investigate whether the insights of Naseem et al. (2012) on selective parameter sharing can be incorporated into such models in the transfer scenario." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-50", "text": "We first review the basic graph-based parser framework and the experimental setup that we will use throughout." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-51", "text": "We then delve into details on how to incorporate selective sharing in this model in \u00a74." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-52", "text": "In \u00a75, we show how learning with ambiguous labelings in this parser can be used for further target language adaptation, both through self-training and through ensemble-training." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-53", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-54", "text": "**DISCRIMINATIVE GRAPH-BASED PARSER**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-66", "text": "Direction: d \u2208 {LEFT, RIGHT}; dependency length: l \u2208 {1, 2, 3, 4, 5+}; part of speech of head / dependent / words between head and dependent: h.p / m.p / between.p \u2208 {NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT, PUNC, X}; token to the left / right of z: z \u22121 / z +1 ; WALS features: w." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-67", "text": "X for X = 81A, 85A, 86A, 87A (see Table 1 )." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-68", "text": "[\u00b7] denotes an optional template, e.g.," }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-69", "text": "p, so that the template also falls back on its undirectional variant." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-70", "text": "each target language evaluated, the treebanks of the remaining languages are used as labeled training data, while the target language treebank is used for testing only (in \u00a75 a different portion of the target language treebank is additionally used as unlabeled training data)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-71", "text": "We refer the reader to Naseem et al. (2012) for detailed information on the different treebanks." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-72", "text": "Due to divergent treebank annotation guidelines, which makes fine-grained evaluation difficult, all results are evaluated in terms of unlabeled attachment score (UAS)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-73", "text": "In line with Naseem et al. (2012), we use gold part-of-speech tags and evaluate only on sentences of length 50 or less excluding punctuation." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-74", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-75", "text": "**BASELINE MODELS**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-76", "text": "We compare our models to two multi-source baseline models." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-77", "text": "The first baseline, NBG, is the generative model with selective parameter sharing from Naseem et al. (2012) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-78", "text": "3 This model is trained without target language data, but we investigate the use of such data in \u00a75.4." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-79", "text": "The second baseline, Delex, is a delexicalized projective version of the well-known graph-based MSTParser (McDonald et al., 2005) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-80", "text": "The feature templates used by this model are shown to the left in Figure 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-81", "text": "Note that there is no selective sharing in this model." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-82", "text": "The second and third columns of Table 2 show the unlabeled attachment scores of the baseline models for each target language." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-83", "text": "We see that Delex performs well on target languages that are related to the majority of the source languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-84", "text": "However, for languages 3 Model \"D-,To\" in Table 2 from Naseem et al. (2012) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-85", "text": "that diverge from the Indo-European majority family, the selective sharing model, NBG, achieves substantially higher accuracies." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-86", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-87", "text": "**FEATURE-BASED SELECTIVE SHARING**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-88", "text": "The results for the baseline models are not surprising considering the feature templates used by Delex." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-89", "text": "There are two fundamental issues with these features when used for direct transfer." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-91", "text": "Second, some features are sensitive to local word order; e.g.," }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-92", "text": "p, which models direction as well as word order in the local contexts of the head and the dependent." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-93", "text": "Such features do not transfer well across typologically different languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-94", "text": "In order to verify that these issues are the cause of the poor performance of the Delex model, we remove all directional features and all features that model local word order from Delex." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-95", "text": "The feature templates of the resulting Bare model are shown in the center of Figure 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-96", "text": "These features only model selectional preferences and dependency length, analogously to the selection component of NBG." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-97", "text": "The performance of Bare is shown in the fourth column of Table 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-98", "text": "The removal of most of the features results in a performance drop on average." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-99", "text": "However, for languages outside of the Indo-European family, Bare is often more accurate, especially for Basque, Hungarian and Japanese, which supports our hypothesis." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-100", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-101", "text": "**SHARING BASED ON TYPOLOGICAL FEATURES**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-102", "text": "After removing all directional features, we now carefully reintroduce them." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-103", "text": "Inspired by Naseem et al. Table 2 from Naseem et al. (2012) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-104", "text": "(2012), we make use of the typological features from WALS (Dryer and Haspelmath, 2011), listed in Table 1, to selectively share directional parameters between languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-105", "text": "As a natural first attempt at sharing parameters, one might consider forming the crossproduct of all features of Delex with all WALS properties, similarly to a common domain adaptation technique (Daum\u00e9 III, 2007; Finkel and Manning, 2009 )." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-106", "text": "However, this approach has two issues." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-107", "text": "First, it results in a huge number of features, making the model prone to overfitting." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-108", "text": "Second, and more critically, it ties together languages via features for which they are not typologically similar." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-109", "text": "Consider English and French, which are both prepositional and thus have the same value for WALS property 85A." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-110", "text": "These languages will end up sharing a parameter for the feature" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-111", "text": "85A; yet they have the exact opposite direction of attachment preference when it comes to nouns and adjectives." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-112", "text": "This problem applies to any method for parameter mixing that treats all the parameters as equal." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-113", "text": "Like Naseem et al. (2012) , we instead share parameters more selectively." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-114", "text": "Our strategy is to use the relevant part-of-speech tags of the head and dependent to select which parameters to share, based on very basic linguistic knowledge." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-115", "text": "The resulting features are shown to the right in Figure 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-116", "text": "For example, there is a shared directional feature that models the order of Subject, Object and Verb by conjoining WALS feature 81A with the arc direction and an indicator feature that fires only if the head is a verb and the dependent is a noun." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-117", "text": "These features would not be very useful by themselves, so we combine them with the Bare features." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-118", "text": "The accuracy of the resulting Share model is shown in column five of Table 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-119", "text": "Although this model still performs worse than NBG, it is an improvement over the Delex baseline and actually outperforms the former on 5 out of the 16 languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-120", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-121", "text": "**SHARING BASED ON LANGUAGE GROUPS**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-122", "text": "While Share models selectional preferences and arc directions for a subset of dependency relations, it does not capture the rich local word order information captured by Delex." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-123", "text": "We now consider two ways of selectively including such information based on language similarity." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-124", "text": "While more complex sharing could be explored (Berg-Kirkpatrick and Klein, 2010) , we use a flat structure and consider two simple groupings of the source and target languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-125", "text": "First, the Similar model consists of the features used by Share together with the features from Delex in Figure 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-126", "text": "The latter are conjoined with an indicator feature that fires only when the source and target languages share values for all the WALS features in Table 1 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-127", "text": "This is accomplished by adding the template" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-128", "text": "for each template f in Delex." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-129", "text": "This groups: 1) Catalan, Italian, Portuguese and Spanish; 2) Bulgarian, Czech and English; 3) Dutch, German and Greek; and 4) Japanese and Turkish." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-130", "text": "The remaining languages do not share all WALS properties with at least one source language and thus revert to Share, since they cannot exploit these grouped features." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-131", "text": "Second, instead of grouping languages according to WALS, the Family model is based on a simple subdivision into Indo-European languages (Bulgarian, Catalan, Czech, Greek, English, Spanish, Italian, Dutch, Portuguese, Swedish) and Altaic languages (Japanese, Turkish)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-132", "text": "This is accomplished with indicator features analogous to those used in Similar." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-133", "text": "The remaining languages are again treated as isolates and revert to Similar." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-134", "text": "The results for these models are given in the last two columns of Table 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-135", "text": "We see that by adding these rich features back into the fold, but having them fire only for languages in the same group, we can significantly increase the performance -from 57.4% to 62.0% on average when considering Family. If we consider our original Delex baseline, we see an absolute improvement of 6.9% on average and a relative error reduction of 15%." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-136", "text": "Particular gains are seen for non-Indo-European languages; e.g., Japanese increases from 38.9% to 65.9%." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-137", "text": "Furthermore, Family achieves a 7% relative error reduction over the NBG baseline and outperforms it on 12 of the 16 languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-138", "text": "This shows that a discriminative graph-based parser can achieve higher accuracies compared to generative models when the features are carefully constructed." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-139", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-140", "text": "**TARGET LANGUAGE ADAPTATION**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-141", "text": "While some higher-level linguistic properties of the target language have been incorporated through selective sharing, so far no features specific to the target language have been employed." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-142", "text": "Cohen et al. (2011) and Naseem et al. (2012) have shown that using expectation-maximization (EM) to this end can in some cases bring substantial accuracy gains." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-143", "text": "For discriminative models, self-training has been shown to be quite effective for adapting monolingual parsers to new domains (McClosky et al., 2006) , as well as for relexicalizing delexicalized parsers using unlabeled target language data (Zeman and Resnik, 2008) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-144", "text": "Similarly T\u00e4ckstr\u00f6m (2012) used self-training to adapt a multi-source direct transfer named-entity recognizer to different target languages, \"relexicalizing\" the model with word cluster features." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-145", "text": "However, as discussed in \u00a75.2, standard self-training is not optimal for target language adaptation." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-146", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-147", "text": "**AMBIGUITY-AWARE TRAINING**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-148", "text": "In this section, we propose a related training method: ambiguity-aware training." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-149", "text": "In this setting a discriminative probabilistic model is induced from automatically inferred ambiguous labelings over unlabeled target language data, in place of gold-standard dependency trees." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-150", "text": "The ambiguous labelings can combine multiple sources of evidence to guide the estimation or simply encode the underlying uncertainty from the base parser." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-151", "text": "This uncertainty is marginalized out during training." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-152", "text": "The structure of the output space, e.g., projectivity and single-headedness constraints, along with regularities in the feature space, can together guide the estimation, similar to what occurs with the expectation-maximization algorithm." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-153", "text": "Core to this method is the idea of an ambiguous labeling\u1ef9(x) \u2286 Y(x), which encodes a set of possible dependency trees for an input sentence x. In subsequent sections we describe how to define such labelings." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-154", "text": "Critically,\u1ef9(x) should be large enough to capture the correct labeling, but on the other hand small enough to provide concrete guidance for model estimation." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-155", "text": "Ideally,\u1ef9(x) will capture heterogenous knowledge that can aid the parser in target language adaptation." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-156", "text": "In a first-order arc-factored model, we define\u1ef9(x) in terms of a collection of ambiguous arc sets A(x) = {A(x, m)} |x| m=1 , where A(x, m) denotes the set of ambiguously specified heads for the mth token in x. Then,\u1ef9(x) is defined as the set of all projective dependency trees spanning x that can be assembled from the arcs in A(x)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-157", "text": "Methods for learning with ambiguous labelings have previously been proposed in the context of multi-class classification (Jin and Ghahramani, 2002) , sequence-labeling (Dredze et al., 2009 ), log-linear LFG parsing (Riezler et al., 2002) , as well as for discriminative reranking of generative constituency parsers (Charniak and Johnson, 2005) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-158", "text": "In contrast to Dredze et al., who allow for weights to be assigned to partial labels, we assume that the ambiguous arcs are weighted uniformly." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-159", "text": "For target language adaptation, these weights would typically be derived from unreliable sources and we do not want to train the model to simply mimic their beliefs." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-160", "text": "Furthermore, with this assumption, learning is simply achieved by maximizing the marginal log-likelihood of the ambiguous training set" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-161", "text": "In maximizing the marginal log-likelihood, the model is free to distribute probability mass among the trees in the ambiguous labeling to its liking, as long as the marginal log-likelihood improves." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-162", "text": "The same objective function is used by Riezler et al. (2002) and Charniak and Johnson (2005) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-163", "text": "A key difference is that in these works, the ambiguity is constrained through a supervised signal, while we use ambiguity as a way to achieve self-training, using the base-parser itself, or some other potentially noisy knowledge source as the sole constraints." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-164", "text": "Note that we have introduced an 2 -regularizer, weighted by \u03bb." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-165", "text": "This is important as we are now training lexicalized target language models which can easily overfit." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-166", "text": "In all experiments, we optimize parameters with L-BFGS." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-167", "text": "Note also that the marginal likelihood is non-concave, so that we are only guaranteed to find a local maximum." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-168", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-169", "text": "**AMBIGUITY-AWARE SELF-TRAINING**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-170", "text": "In standard self-training -hereafter referred to as Viterbi self-training -a base parser is used to label each unlabeled sentence with its most probable parse tree to create a self-labeled data set, which is subsequently used to train a supervised parser." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-171", "text": "There are two reasons why this simple approach may work." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-172", "text": "First, if the base parser's errors are not too systematic and if the self-training model is not too expressive, self-training can reduce the variance on the new domain." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-173", "text": "Second, self-training allows for features in the new domain with low support -or no support in the case of lexicalized features -in the base parser to be \"filled in\" by exploiting correlations in the feature representation." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-174", "text": "However, a potential pitfall of this approach is that the self-trained parser is encouraged to blindly mimic the base parser, which leads to error reinforcement." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-175", "text": "This may be particularly problematic when relexicalizing a transfer parser, since the lexical features provide the parser with increased power and thereby an increased risk of overfitting to the noise." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-176", "text": "To overcome this potential problem, we propose an ambiguity-aware self-training (AAST) method that is able to take the noise of the base parser into account." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-177", "text": "We use the arc-marginals of the base parser to construct the ambiguous labeling\u1ef9(x) for a sentence x. For each token m \u2208 [1, |x|], we first sort the set of arcs in which m is the dependent, {(h, m)} |x| h=0 , by the marginal probabilities of the arcs:" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-178", "text": "We next construct the ambiguous arc set A(x, m) by adding arcs (h, m) in order of decreasing probability, until their cumulative probability exceeds \u03c3, i.e. until" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-179", "text": "Lower values of \u03c3 result in more aggressive pruning, with \u03c3 = 0 corresponding to including no arc and \u03c3 = 1 corresponding to including all arcs." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-180", "text": "We always add the highest scoring tree\u0177 to\u1ef9(x) to ensure that it contains at least one complete projective tree." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-181", "text": "Figure 3 outlines an example of how (and why) AAST works." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-182", "text": "In the Greek example, the genitive phrase \u0397 \u03c0\u03b1\u03c1\u03b1\u03bc\u03bf\u03bd\u03ae \u03c3\u03ba\u03b1\u03c6\u03ce\u03bd (the stay of vessels) is incorrectly analyzed as a flat noun phrase." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-183", "text": "This is not surprising given that the base parser simply observes this phrase as DET NOUN NOUN." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-184", "text": "However, looking at the arc marginals we can see that the correct analysis is available during AAST, although the actual marginal probabilities are quite misleading." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-185", "text": "Furthermore, the genitive noun \u03c3\u03ba\u03b1\u03c6\u03ce\u03bd also appears in other less ambiguous contexts, where the base parser correctly predicts it to modify a noun and not a verb." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-186", "text": "This allows the training process to add weight to the corresponding lexical feature pairing \u03c3\u03ba\u03b1\u03c6\u03ce\u03bd with a noun head and away from the feature pairing it with a verb." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-187", "text": "The resulting parser correctly predicts the genitive construction." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-188", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-189", "text": "**AMBIGUITY-AWARE ENSEMBLE-TRAINING**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-190", "text": "While ambiguous labelings can be used as a means to improve self-training, any information that can be expressed as hard arc-factored constraints can be incorporated, including linguistic expert knowledge and annotation projected via bitext." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-191", "text": "Here we explore another natural source of information: the predictions of other transfer parsers." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-192", "text": "It is well known that combining several diverse predictions in an ensemble often leads to improved predictions." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-193", "text": "However, in most ensemble methods there is typically no learning involved once the base learners have been trained (Sagae and Lavie, 2006 )." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-194", "text": "An exception is the method of Sagae and Tsujii (2007) , who combine the outputs of many parsers on unlabeled data to train a parser for a new domain." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-195", "text": "However, in that work the learner is not exposed to the underlying ambiguity of the base parsers; it is only given the Viterbi parse of the combination system as the gold standard." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-196", "text": "In contrast, Middle: The ambiguous labeling\u1ef9(x), which is used as supervision in AAST." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-197", "text": "Additional non-Viterbi arcs are present in\u1ef9(x); for clarity, these are not shown." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-198", "text": "When learning with AAST, probability mass will be pushed towards any tree consistent with\u1ef9(x)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-199", "text": "Marginal probabilities are ignored at this stage, so that all arcs in\u1ef9(x) are treated as equals." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-200", "text": "Bottom: The Viterbi parse of the AAST model, which has selected the correct arcs from\u1ef9(x)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-201", "text": "we propose an ambiguity-aware ensemble-training (AAET) method that treats the union of the ensemble predictions for a sentence x as an ambiguous labeling y(x)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-202", "text": "An additional advantage of this approach is that the ensemble is compiled into a single model and therefore does not require multiple models to be stored and used at runtime." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-203", "text": "It is straightforward to construct\u1ef9(x) from multiple parsers." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-204", "text": "Let A k (x, m) be the set of arcs for the mth token in x according to the kth parser in the ensemble." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-205", "text": "When arc-marginals are used to construct the ambiguity set, |A k (x, m)| \u2265 1, but when the Viterbiparse is used, A k (x, m) is a singleton." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-206", "text": "We next form , m) as the ensemble arc ambiguity set from which\u1ef9(x) is assembled." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-207", "text": "In this study, we combine the arc sets of two base parsers: first, the arc-marginal ambiguity set of the base parser ( \u00a75.2); and second, the Viterbi arc set from the NBG parser of Naseem et al. (2012) in Table 2 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-208", "text": "4 Thus, the latter will have singleton arc ambiguity sets, but when combined with the arc-marginal ambiguity sets of our base parser, the result will encode uncertainty derived from both parsers." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-209", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-210", "text": "**ADAPTATION EXPERIMENTS**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-211", "text": "We now study the different approaches to target language adaptation empirically." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-212", "text": "As in Naseem et al. (2012) , we use the CoNLL training sets, stripped of all dependency information, as the unlabeled target language data in our experiments." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-213", "text": "We use the Family model as the base parser, which is used to label the unlabeled target data with the Viterbi parses as well as with the ambiguous labelings." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-214", "text": "The final model is then trained on this data using standard lexicalized features (McDonald et al., 2005) ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-215", "text": "Since labeled training data is unavailable in the target language, we cannot tune any hyper-parameters and simply set \u03bb = 1 and \u03c3 = 0.95 throughout." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-216", "text": "Although the latter may suggest that\u1ef9(x) contains a high degree of ambiguity, in reality, the marginal distributions of the base model have low entropy and after filtering with \u03c3 = 0.95, the average number of potential heads per dependent ranges from 1.4 to 3.2, depending on the target language." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-217", "text": "The ambiguity-aware training methods, that is ambiguity-aware self-training (AAST) and ambiguityaware ensemble-training (AAET), are compared to three baseline systems." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-218", "text": "First, NBG+EM is the generative model of Naseem et al. (2012) trained with expectation-maximization on additional unlabeled target language text." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-219", "text": "Second, Family is the best discriminative model from the previous section." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-220", "text": "Third, Viterbi is the basic Viterbi self-training model." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-221", "text": "The results of each of these models are shown in Table 3 ." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-222", "text": "There are a number of things that can be observed." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-223", "text": "First, Viterbi self-training helps slightly on average, but the gains are not consistent and there are even drops in accuracy for some languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-224", "text": "Second, AAST outperforms the Viterbi variant on all languages and nearly always improves on the base parser, although it sees a slight drop for Italian." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-225", "text": "AAST improves the accuracy over the base model by 2% absolute on average and by as much as 5% absolute for Turkish." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-226", "text": "Comparing this model to the NBG+EM baseline, we observe an improvement by 3.6% absolute, outperforming it on 14 of the 16 languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-227", "text": "Furthermore, ambiguity-aware self-training appears to help more than expectation-maximization for generative (unlexicalized) models." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-228", "text": "Naseem et al. observed an increase from 59.3% to 60.4% on average by adding unlabeled target language data and the gains were not consistent across languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-229", "text": "AAST, on the other hand, achieves consistent gains, rising from 62.0% to 64.0% on average." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-230", "text": "Third, as shown in the rightmost column of Table 3 , ambiguity-aware ensemble-training is indeed a successful strategy; AAET outperforms the previous best self-trained model on 13 and NB&G+EM on 15 out of 16 languages." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-231", "text": "The relative error reduction with respect to the base Family model is 9% on average, while the average reduction with respect to NBG+EM is 13%." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-232", "text": "Before concluding, two additional points are worth making." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-233", "text": "First, further gains may potentially be achievable with feature-rich discriminative models." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-234", "text": "While the best generative transfer model of Naseem et al. (2012) approaches its upper-bounding supervised accuracy (60.4% vs. 67.1%), our relaxed selftraining model is still far below its supervised counterpart (64.0% vs. 84.1%)." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-235", "text": "One promising statistic along these lines is that the oracle accuracy for the ambiguous labelings of AAST is 75.7%, averaged across languages, which suggests that other training algorithms, priors or constraints could improve the accuracy substantially." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-236", "text": "Second, relexicalization is a key component of self-training." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-237", "text": "If we use delexicalized features during self-training, we only observe a small average improvement from 62.0% to 62.1%." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-238", "text": "----------------------------------" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-239", "text": "**CONCLUSIONS**" }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-240", "text": "We contributed to the understanding of multi-source syntactic transfer in several complementary ways." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-241", "text": "First, we showed how selective parameter sharing, based on typological features and language family membership, can be incorporated in a discriminative graph-based model of dependency parsing." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-242", "text": "We then showed how ambiguous labelings can be used to integrate heterogenous knowledge sources in parser training." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-243", "text": "Two instantiations of this framework were explored." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-244", "text": "First, an ambiguity-aware self-training method that can be used to effectively relexicalize and adapt a delexicalized transfer parser using unlabeled target language data." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-245", "text": "Second, an ambiguityaware ensemble-training method, in which predictions from different parsers can be incorporated and further adapted." }, { "sent_id": "52c52f6ce3663de49d5784630af1e7-C001-246", "text": "On average, our best model provides a relative error reduction of 13% over the state-ofthe-art model of Naseem et al. (2012) , outperforming it on 15 out of 16 evaluated languages." } ], "y": { "@BACK@": { "gold_contexts": [ [ "52c52f6ce3663de49d5784630af1e7-C001-13", "52c52f6ce3663de49d5784630af1e7-C001-14" ], [ "52c52f6ce3663de49d5784630af1e7-C001-15" ], [ "52c52f6ce3663de49d5784630af1e7-C001-44" ], [ "52c52f6ce3663de49d5784630af1e7-C001-71" ], [ "52c52f6ce3663de49d5784630af1e7-C001-142" ], [ "52c52f6ce3663de49d5784630af1e7-C001-207" ], [ "52c52f6ce3663de49d5784630af1e7-C001-228" ] ], "cite_sentences": [ "52c52f6ce3663de49d5784630af1e7-C001-14", "52c52f6ce3663de49d5784630af1e7-C001-15", "52c52f6ce3663de49d5784630af1e7-C001-44", "52c52f6ce3663de49d5784630af1e7-C001-71", "52c52f6ce3663de49d5784630af1e7-C001-142", "52c52f6ce3663de49d5784630af1e7-C001-207", "52c52f6ce3663de49d5784630af1e7-C001-228" ] }, "@DIF@": { "gold_contexts": [ [ "52c52f6ce3663de49d5784630af1e7-C001-20" ], [ "52c52f6ce3663de49d5784630af1e7-C001-24" ], [ "52c52f6ce3663de49d5784630af1e7-C001-77", "52c52f6ce3663de49d5784630af1e7-C001-78" ], [ "52c52f6ce3663de49d5784630af1e7-C001-83", "52c52f6ce3663de49d5784630af1e7-C001-84", "52c52f6ce3663de49d5784630af1e7-C001-85" ], [ "52c52f6ce3663de49d5784630af1e7-C001-119" ], [ "52c52f6ce3663de49d5784630af1e7-C001-137" ], [ "52c52f6ce3663de49d5784630af1e7-C001-226" ], [ "52c52f6ce3663de49d5784630af1e7-C001-228", "52c52f6ce3663de49d5784630af1e7-C001-229" ], [ "52c52f6ce3663de49d5784630af1e7-C001-231" ], [ "52c52f6ce3663de49d5784630af1e7-C001-234" ], [ "52c52f6ce3663de49d5784630af1e7-C001-246" ] ], "cite_sentences": [ "52c52f6ce3663de49d5784630af1e7-C001-20", "52c52f6ce3663de49d5784630af1e7-C001-24", "52c52f6ce3663de49d5784630af1e7-C001-77", "52c52f6ce3663de49d5784630af1e7-C001-78", "52c52f6ce3663de49d5784630af1e7-C001-84", "52c52f6ce3663de49d5784630af1e7-C001-85", "52c52f6ce3663de49d5784630af1e7-C001-119", "52c52f6ce3663de49d5784630af1e7-C001-137", "52c52f6ce3663de49d5784630af1e7-C001-226", "52c52f6ce3663de49d5784630af1e7-C001-228", "52c52f6ce3663de49d5784630af1e7-C001-231", "52c52f6ce3663de49d5784630af1e7-C001-234", "52c52f6ce3663de49d5784630af1e7-C001-246" ] }, "@MOT@": { "gold_contexts": [ [ "52c52f6ce3663de49d5784630af1e7-C001-46" ], [ "52c52f6ce3663de49d5784630af1e7-C001-141", "52c52f6ce3663de49d5784630af1e7-C001-142", "52c52f6ce3663de49d5784630af1e7-C001-145" ] ], "cite_sentences": [ "52c52f6ce3663de49d5784630af1e7-C001-46", "52c52f6ce3663de49d5784630af1e7-C001-142" ] }, "@USE@": { "gold_contexts": [ [ "52c52f6ce3663de49d5784630af1e7-C001-49" ], [ "52c52f6ce3663de49d5784630af1e7-C001-62" ], [ "52c52f6ce3663de49d5784630af1e7-C001-73" ], [ "52c52f6ce3663de49d5784630af1e7-C001-77" ], [ "52c52f6ce3663de49d5784630af1e7-C001-103", "52c52f6ce3663de49d5784630af1e7-C001-104" ], [ "52c52f6ce3663de49d5784630af1e7-C001-112", "52c52f6ce3663de49d5784630af1e7-C001-113" ], [ "52c52f6ce3663de49d5784630af1e7-C001-207" ], [ "52c52f6ce3663de49d5784630af1e7-C001-212" ], [ "52c52f6ce3663de49d5784630af1e7-C001-217", "52c52f6ce3663de49d5784630af1e7-C001-218" ] ], "cite_sentences": [ "52c52f6ce3663de49d5784630af1e7-C001-49", "52c52f6ce3663de49d5784630af1e7-C001-62", "52c52f6ce3663de49d5784630af1e7-C001-73", "52c52f6ce3663de49d5784630af1e7-C001-77", "52c52f6ce3663de49d5784630af1e7-C001-103", "52c52f6ce3663de49d5784630af1e7-C001-113", "52c52f6ce3663de49d5784630af1e7-C001-207", "52c52f6ce3663de49d5784630af1e7-C001-212", "52c52f6ce3663de49d5784630af1e7-C001-218" ] }, "@SIM@": { "gold_contexts": [ [ "52c52f6ce3663de49d5784630af1e7-C001-73" ], [ "52c52f6ce3663de49d5784630af1e7-C001-95", "52c52f6ce3663de49d5784630af1e7-C001-96" ], [ "52c52f6ce3663de49d5784630af1e7-C001-112", "52c52f6ce3663de49d5784630af1e7-C001-113" ], [ "52c52f6ce3663de49d5784630af1e7-C001-212" ] ], "cite_sentences": [ "52c52f6ce3663de49d5784630af1e7-C001-73", "52c52f6ce3663de49d5784630af1e7-C001-96", "52c52f6ce3663de49d5784630af1e7-C001-113", "52c52f6ce3663de49d5784630af1e7-C001-212" ] }, "@EXT@": { "gold_contexts": [ [ "52c52f6ce3663de49d5784630af1e7-C001-77", "52c52f6ce3663de49d5784630af1e7-C001-78" ], [ "52c52f6ce3663de49d5784630af1e7-C001-218" ] ], "cite_sentences": [ "52c52f6ce3663de49d5784630af1e7-C001-77", "52c52f6ce3663de49d5784630af1e7-C001-78", "52c52f6ce3663de49d5784630af1e7-C001-218" ] } } }, "ABC_87af486eb2e968d2055eeab094b3f9_8": { "x": [ { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-50", "text": "**VISUAL GROUNDING**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-98", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-99", "text": "**ATTENTION MECHANISM AT WORK**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-2", "text": "Sentence representation models trained only on language could potentially suffer from the grounding problem." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-3", "text": "Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with associated image features." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-26", "text": "Joint Learning of Language and Vision." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-4", "text": "However, the grounding capability is limited due to distant connection between input sentences and image features by the design of the architecture." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-5", "text": "In order to further close the gap, we propose applying self-attention mechanism to the sentence encoder to deepen the grounding effect." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-6", "text": "Our results on transfer tasks show that self-attentive encoders are better for visual grounding, as they exploit specific words with strong visual associations." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-7", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-9", "text": "Recent NLP studies have thrived on distributional hypothesis." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-10", "text": "More recently, there have been efforts in applying the intuition to larger semantic units, such as sentences, or documents." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-11", "text": "However, approaches based on distributional semantics are limited by the grounding problem [7] , which calls for techniques to ground certain conceptual knowledge in perceptual information." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-12", "text": "Both NLP and vision communities have proposed various multi-modal learning methods to bridge the gap between language and vision." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-13", "text": "However, how general sentence representations can be benefited from visual grounding has not been fully explored yet." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-14", "text": "Very recently, [8] proposed a multi-modal encoder-decoder framework that, given an image caption, jointly predicts another caption and the features of associated image." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-15", "text": "The work showed promising results for further improving general sentence representations by grounding them visually." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-16", "text": "However, according to the model, visual association only occurs at the final hidden state of the encoder, potentially limiting the effect of visual grounding." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-17", "text": "Attention mechanism helps neural networks to focus on specific input features relevant to output." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-18", "text": "In the case of visually grounded multi-modal framework, applying such attention mechanism could help the encoder to identify visually significant words or phrases." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-19", "text": "We hypothesize that a language-attentive multi-modal framework has an intuitive basis on how humans mentally visualize certain concepts in sentences during language comprehension." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-20", "text": "In this paper, we propose an enhanced multi-modal encoder-decoder model, in which the encoder attends to the input sentence and the decoders predict image features and the target sentence." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-21", "text": "We train the model on images and respective captions from COCO5K dataset [9] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-22", "text": "We augment the state-of-the-art sentence representations with those produced by our model and conduct a series of experiments on transfer tasks to test the quality of sentence representations." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-23", "text": "Through detailed analysis, we confirm our hypothesis that self-attention help our model produce more feature-rich visually grounded sentence representations." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-24", "text": "[4] to log-bilinear models [2, 3] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-25", "text": "A recent work proposed using supervised learning of a specific task as a leverage to obtain general sentence representation [10] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-27", "text": "Convergence between computer vision and NLP researches have increasingly become common." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-28", "text": "Image captioning [11] [12] [13] [14] and image synthesis [15] are two common tasks." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-29", "text": "There have been significant studies focusing on improving word embeddings [16, 17] , phrase embeddings [18] , sentence embeddings [8, 19] , language models [20] through multi-modal learning of vision and language." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-30", "text": "Among all studies, [8] is the first to apply skip-gram-like intuition (predicting multiple modalities from langauge) to joint learning of language and vision in the perspective of general sentence representations." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-31", "text": "Attention Mechanism in Multi-Modal Semantics." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-32", "text": "Attention mechanism was first introduced in [21] for neural machine translation." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-33", "text": "Similar intuitions have been applied to various NLP [5, 22, 23] and vision tasks [11] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-34", "text": "[11] applied attention mechanism to images to bind specific visual features to language." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-35", "text": "Recently, self-attention mechanism [5] has been proposed for situations where there are no extra source of information to \"guide the extraction of sentence embedding\"." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-36", "text": "In this work, we propose a novel sentence encoder for the multi-modal encoder-decoder framework that leverages the self-attention mechanism." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-37", "text": "To the best of our knowledge, such attempt is the first among studies on joint learning of language and vision." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-38", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-39", "text": "**PROPOSED METHOD**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-40", "text": "Given a data sample (X, Y, h I ) \u2208 D, where X is the source caption, Y is the target caption, and h I is the hidden representation of the image, our goal is to predict Y and h I with X, and the hidden representation in the middle serves as the general sentence representation." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-41", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-42", "text": "**VISUALLY GROUNDED ENCODER-DECODER FRAMEWORK**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-43", "text": "We base our model on the encoder-decoder framework introduced in [8] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-44", "text": "A bidirectional Long Short-Term Memory (LSTM) [24] encodes an input sentence and produces a sentence representation for the input." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-45", "text": "A pair of LSTM cells encodes the input sequence in both directions and produce two final hidden states: h t and h t ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-46", "text": "The hidden representation of the entire sequence is produced by selecting maximum elements between the two hidden states: h S = max( h t , h t )." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-47", "text": "The decoder calculates the probability of a target word y t at each time step t, conditional to the sentence representation h S and all target words before t. P (y t | y [8, 20, 25] , we find that log-exp-sum pairwise ranking [26] yields better results in terms of evaluation performance and efficiency." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-54", "text": "Thus, the objective for ranking" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-55", "text": "where N is the set of negative examples and sim is cosine similarity." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-56", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-57", "text": "**VISUAL GROUNDING WITH SELF-ATTENTION**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-58", "text": "Let h t \u2208 R d h be the encoder hidden state at timestep t concatenated from two opposite directional LSTMs (d h is the dimensionality of sentence representations)." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-59", "text": "Let H \u2208 R d h \u00d7T be the hidden state matrix where t-th column of H is h t ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-60", "text": "The self-attention mechanism aims to learn attention weight \u03b1 t , i.e. how much attention must be paid to hidden state h t , based on all hidden states H. Since there could be multiple ways to attend depending on desired features, we allow multiple attention vectors to be learned." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-61", "text": "Attention matrix A \u2208 R na\u00d7T is a stack of n a attention vectors, obtained through attention layers: A = sof tmax (W a2 tanh (W a1 H))." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-62", "text": "W a1 \u2208 R da\u00d7d h and W a2 \u2208 R na\u00d7da are attention parameters and d a is a hyperparameter." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-63", "text": "The context matrix C \u2208 R na\u00d7d h is obtained by C = AH." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-64", "text": "Finally, we compress the context matrix into a fixed size representation h A by max-pooling all context vectors: h A = max (c 1 , c 2 , . . . , c na ) ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-65", "text": "Attended representation h A and encoder-decoder representation h S are concatenated into the final self-attentive sentence representation h. This hybrid representation replaces h S and is used to predict image features (Section 3.2) and target caption (Section 3.1)." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-66", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-67", "text": "**LEARNING OBJECTIVES**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-68", "text": "Following the experimental design of [8] , we conduct experiments on three different learning objectives: CAP2ALL, CAP2CAP, CAP2IMG." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-69", "text": "Under CAP2ALL, the model is trained to predict both the target caption and the associated image: L = L C + L V G ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-70", "text": "Under CAP2CAP, the model is trained to predict only the target caption (L = L C ) and, under CAP2IMG, only the associated image (L = L V G )." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-71", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-72", "text": "**EXPERIMENTS**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-73", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-74", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-75", "text": "Word embeddings W E are initialized with GloVe [27] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-76", "text": "The hidden dimension of each encoder and decoder LSTM cell (d h ) is 1024 1 ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-77", "text": "We use Adam optimizer [28] and clip the gradients to between -5 and 5." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-78", "text": "Number of layers, dropout, and non-linearity for image feature prediction layers are 4, 0.3 and ReLU [29] respectively." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-79", "text": "Dimensionality of hidden attention layers (d a ) is 350 and number of attentions (n a ) is 30." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-80", "text": "We employ orthogonal initialization [30] for recurrent weights and xavier initialization [31] for all others." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-81", "text": "For the datasets, we use Karpathy and Fei-Fei's split for MS-COCO dataset [13] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-82", "text": "Image features are prepared by extracting hidden representations at the final layer of ResNet-101 [32] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-83", "text": "We evaluate sentence representation quality using SentEval 2 [8, 10] scripts." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-84", "text": "Mini-batch size is 128 and negative samples are prepared from remaining data samples in the same mini-batch." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-85", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-86", "text": "**EVALUATION**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-87", "text": "Adhering to the experimental settings of [8] , we concatenate sentence representations produced from our model with those obtained from the state-of-the-art unsupervised learning model (Layer Normalized Skip-Thoughts, ST-LN) [33] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-88", "text": "We evaluate the quality of sentence representations produced from different variants of our encoders on well-known transfer tasks: movie review sentiment (MR) [34] , customer reviews (CR) [35] , subjectivity (SUBJ) [36] , opinion polarity (MPQA) [37] , paraphrase identification (MSRP) [38] , binary sentiment classification (SST) [39] , SICK entailment and SICK relatedness [40] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-89", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-90", "text": "**RESULTS**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-91", "text": "Results are shown in Table 1 ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-92", "text": "Results show that incorporating self-attention mechanism in the encoder is beneficial for most tasks." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-93", "text": "However, original models were better in some tasks (CR, MPQA, MRPC), suggesting that self-attention mechanism could sometimes introduce noise in sentence features." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-94", "text": "Overall, utilizing self-attentive sentence representation further improves performances in 1 However, for baseline models (without self-attention), we use d h = 2048 to match the dimensionality (2048) of sentence representations produced by our proposed models." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-95", "text": "2 https://github.com/facebookresearch/SentEval 5 out of 8 tasks." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-96", "text": "Considering that models with self-attention employ smaller LSTM cells (1024) than those without (2048) (Section 4.1), the performance improvements are significant." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-97", "text": "Results on COCO5K image and caption retrieval tasks (not included in the paper due to limited space) show comparable performances to other more specialized methods [13, 41] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-100", "text": "In order to study the effects of incorporating self-attention mechanism in joint prediction of image and language features, we examine attention vectors for selected samples from MS-COCO dataset and compare them to associated images ( Figure 1 )." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-101", "text": "For example, given the sentence \"man in black shirt is playing guitar\", our model identifies words that have association with strong visual imagery, such as \"man\", \"black\" and \"guitar\"." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-102", "text": "Given the second sentence, our model learned to attend to visually significant words such as \"cat\" and \"bowl\"." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-103", "text": "These findings show that visually grounding self-attended sentence representations helps to expose word-level visual features onto sentence representations [8] ." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-104", "text": "Figure 1: Activated attention weights on two samples from MS-COCO dataset." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-105", "text": "Vertical axis shows attention vectors learned by our model (compressed due to space limit)." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-106", "text": "Note how the sentence encoder learned to identify words with strong visual associations." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-107", "text": "----------------------------------" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-108", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-109", "text": "In this paper, we proposed a novel encoder that exploits self-attention mechanism." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-110", "text": "We trained the model using MS-COCO dataset and evaluated sentence representations produced by our model (combined with universal sentence representations) on several transfer tasks." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-111", "text": "Results show that the self-attention mechanism not only improves the qualities of general sentence representations but also guides the encoder to emphasize certain visually associable words, which helps to make visual features more prominent in the sentence representations." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-112", "text": "As future work, we intend to explore crossmodal attention mechanism to further intertwine language and visual information for the purpose of improving sentence representation quality." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-113", "text": "(21A20151113068)." }, { "sent_id": "87af486eb2e968d2055eeab094b3f9-C001-114", "text": "Also, this work would not be possible without invaluable discussions with knowledgeable and helpful colleagues." } ], "y": { "@BACK@": { "gold_contexts": [ [ "87af486eb2e968d2055eeab094b3f9-C001-14" ], [ "87af486eb2e968d2055eeab094b3f9-C001-29" ] ], "cite_sentences": [ "87af486eb2e968d2055eeab094b3f9-C001-14", "87af486eb2e968d2055eeab094b3f9-C001-29" ] }, "@MOT@": { "gold_contexts": [ [ "87af486eb2e968d2055eeab094b3f9-C001-14", "87af486eb2e968d2055eeab094b3f9-C001-15", "87af486eb2e968d2055eeab094b3f9-C001-16" ] ], "cite_sentences": [ "87af486eb2e968d2055eeab094b3f9-C001-14", "87af486eb2e968d2055eeab094b3f9-C001-15", "87af486eb2e968d2055eeab094b3f9-C001-16" ] }, "@USE@": { "gold_contexts": [ [ "87af486eb2e968d2055eeab094b3f9-C001-43" ], [ "87af486eb2e968d2055eeab094b3f9-C001-68" ], [ "87af486eb2e968d2055eeab094b3f9-C001-83" ], [ "87af486eb2e968d2055eeab094b3f9-C001-87" ] ], "cite_sentences": [ "87af486eb2e968d2055eeab094b3f9-C001-43", "87af486eb2e968d2055eeab094b3f9-C001-68", "87af486eb2e968d2055eeab094b3f9-C001-83", "87af486eb2e968d2055eeab094b3f9-C001-87" ] }, "@DIF@": { "gold_contexts": [ [ "87af486eb2e968d2055eeab094b3f9-C001-53" ], [ "87af486eb2e968d2055eeab094b3f9-C001-100", "87af486eb2e968d2055eeab094b3f9-C001-103" ] ], "cite_sentences": [ "87af486eb2e968d2055eeab094b3f9-C001-53", "87af486eb2e968d2055eeab094b3f9-C001-103" ] }, "@EXT@": { "gold_contexts": [ [ "87af486eb2e968d2055eeab094b3f9-C001-100", "87af486eb2e968d2055eeab094b3f9-C001-103" ] ], "cite_sentences": [ "87af486eb2e968d2055eeab094b3f9-C001-103" ] } } }, "ABC_dcf84cf05e3e7950cabbdd8d8f304c_8": { "x": [ { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-2", "text": "In this paper we propose the first model for multiword expression (MWE) compositionality prediction based on character-level neural network language models." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-3", "text": "Experimental results on two kinds of MWEs (noun compounds and verb-particle constructions) and two languages (English and German) suggest that character-level neural network language models capture knowledge of multiword expression compositionality, in particular for English noun compounds and the particle component of English verb-particle constructions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-4", "text": "In contrast to many other approaches to MWE compositionality prediction, this character-level approach does not require token-level identification of MWEs in a training corpus, and can potentially predict the compositionality of out-of-vocabulary MWEs." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-5", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-7", "text": "Multiword expressions (MWEs) are lexical items that are composed of multiple words, and exhibit some degree of idiomaticity , for example semantic idiomaticity, in which the meaning of an MWE is not entirely transparent from the meanings of its component words, as in spill the beans, which has an idiomatic meaning of 'reveal a secret'." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-8", "text": "Compositionality is the degree to which the meaning of an MWE is predictable from the meanings of its component words." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-9", "text": "It is typically viewed as lying on a continuum, with expressions such as speed limit and gravy train lying towards the compositional and non-compositional ends of the spectrum, respectively, and expressions such as rush hour and fine line falling somewhere in between as semi-compositional." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-10", "text": "1 Compositionality can also be viewed with respect to an individual component word of an MWE, where an MWE component word is compositional if its meaning is reflected in the meaning of the expression." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-11", "text": "For example, in spelling bee and grandfather clock, the first and second component words, respectively, are compositional, while the others are not." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-12", "text": "Knowledge of multiword expressions is important for natural language processing (NLP) tasks such as parsing (Korkontzelos and Manandhar, 2010) and machine translation (Carpuat and Diab, 2010) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-13", "text": "In the case of translation, compositionality is particularly important because a word-for-word translation would typically be incorrect for a non-compositional expression." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-14", "text": "Much research has therefore focused on compositionality prediction of MWEs, primarily at the type level." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-15", "text": "One common approach to measuring compositionality is to compare distributional representations of an MWE and its component words (e.g., Schone and Jurafsky, 2001; Baldwin et al., 2003; Katz and Giesbrecht, 2006; Reddy et al., 2011; Schulte im Walde et al., 2013; Salehi et al., 2015) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-16", "text": "The hypothesis behind this line of work is that the representation of a compositional MWE will be more similar to the representations of its component words than the representation of a non-compositional MWE will be to those of its component words." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-17", "text": "One issue faced by such approaches is that token-level instances of MWEs must be identified in a corpus in order to form distributional representations of them." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-18", "text": "Token-level MWE identification has been studied for specific types of MWEs such as verb-particle constructions (e.g., and verb-noun idioms (e.g., Salton et al., 2016) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-19", "text": "Broad coverage MWE identification has also been studied, and remains a challenge (Schneider et al., 2014; Gharbieh et al., 2017) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-20", "text": "Language models are common throughout NLP in tasks including machine translation (Brants et al., 2007) , speech recognition (Collins et al., 2005) , and question answering (Chen et al., 2006) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-21", "text": "Although word-level language models are widely used, and their performance can be higher than character-level language models, character-level models have the advantage that they can model out-of-vocabulary words (Mikolov et al., 2012) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-22", "text": "Owing to this advantage, character-level language models have been applied in a range of NLP tasks, including authorship attribution, (Peng et al., 2003) , part-of-speech tagging (Santos and Zadrozny, 2014), case restoration (Susanto et al., 2016) , and stock price prediction (dos Santos Pinheiro and Dras, 2017)." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-23", "text": "Moreover, character-level information can be composed to form representations of words (Ling et al., 2015) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-24", "text": "In this paper we consider whether character-level neural network language models capture knowledge of MWE compositionality." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-25", "text": "We train character-level language models based on recurrent neural networks -including long short-term memory (LSTM, Hochreiter and Schmidhuber, 1997) and gated recurrent unit (GRU, Cho et al., 2014) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-26", "text": "We then use these language models to form continuous vector representations of MWEs and their component words." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-27", "text": "Following prior work, we then use these representations to predict the compositionality of MWEs." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-28", "text": "This method overcomes the limitation of previous work in this vein of having to identify token instances of MWEs in a corpus in order to form a distributional representation of them." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-29", "text": "Moreover, this approach could potentially be applied to predict the compositionality of out-of-vocabulary expressions that were not seen in the corpus on which the language model was trained." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-30", "text": "To the best of our knowledge, this is the first work to apply character-level neural network language models to predict MWE compositionality." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-31", "text": "Our experiments on two kinds of MWEs (noun compounds and verb-particle constructions) and two languages (English and German) produce mixed results, but suggest that character-level neural network language models do indeed capture some knowledge of multiword expression compositionality, in particular for English noun compounds and the particle component of English verb-particle constructions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-32", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-33", "text": "**A CHARACTER-LEVEL MODEL FOR MWE COMPOSITIONALITY**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-34", "text": "If an MWE is compositional, it is expected to be similar in meaning to its component words." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-35", "text": "Since the vector representation of a word/MWE is taken as a proxy for its meaning, we expect the vector representation of a compositional MWE to be similar to its component words' vectors." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-36", "text": "In order to obtain vectors representing each of an MWE and its component words through a character-level neural network language model, each of the MWE and its component words are considered as a sequence of characters." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-37", "text": "Each of these character sequences includes a special end-of-sequence character." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-38", "text": "In the case of an MWE, the character sequence includes a space character between the component words." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-39", "text": "For example, the MWE ivory tower is represented as the sequence < i, v, o, r, y, , t, o, w, e, r, END >." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-40", "text": "These character sequences are fed to the neural network language model, and the hidden state of the neural network at the end of the sequence is taken as the vector representation for that sequence." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-41", "text": "2 Once vector representations of an MWE and its component words are obtained, following Salehi et al. (2015) , the following equations are then used to compute the compositionality of an MWE:" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-42", "text": "where MWE is the vector representation of the MWE, and C 1 and C 2 are vector representations for the first and second components of the MWE, respectively." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-43", "text": "3 In both cases, we use cosine as the similarity measure." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-44", "text": "comp 1 is based on Reddy et al. (2011) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-45", "text": "As shown in equation (1), the compositionality of an MWE is computed based on measuring the similarity of the MWE and each of its component words, and then combining these two similarities into an overall compositionality score." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-46", "text": "comp 2 is based on Mitchell and Lapata (2010) and measures compositionality by considering the similarity between the MWE and the summation of its component words' vectors." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-47", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-48", "text": "**MATERIALS AND METHODS**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-49", "text": "In this section, we describe the language model and corpus it was trained on, as well as the evaluation dataset and methodology." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-50", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-51", "text": "**LANGUAGE MODEL**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-52", "text": "We use a publicly available TensorFlow implementation of a character-level RNN language model." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-53", "text": "4 We use the following parameter settings as defaults: a two-layer LSTM with one-hot character embeddings and a hidden layer size of 128 dimensions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-54", "text": "The batch size, learning rate, and dropout are set 20, 0.002, and 0, respectively." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-55", "text": "5 We consider some alternative parameter settings to these defaults in section \u00a74." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-56", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-57", "text": "**TRAINING CORPUS**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-58", "text": "We train language models over a portion of English and German Wikipedia dumps -following Salehi et al. (2015) -from 20 January 2018." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-59", "text": "The raw dumps are preprocessed using WP2TXT 6 to remove wikimarkup, metadata, and XML and HTML tags." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-60", "text": "The text from Wikipedia contains many characters that are not typically found in MWEs, for example, non-ASCII characters." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-61", "text": "Such characters drastically increase the size of the vocabulary of the language model, which leads to very long training times." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-62", "text": "We therefore remove all non-ASCII characters from the English dump, and all non-ASCII characters other than\u00e4,\u00c4,\u00f6,\u00d6,\u00fc,\u00dc, \u00df from the German dump." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-63", "text": "Training the character-level language model over the Wikipedia dumps in their entirety would take a prohibitively long time due to their size." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-64", "text": "We therefore instead carry out experiments training on a 1% sample of the English dump, and a 2% sample of the German dump (to give a corpus of similar size to the English one)." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-65", "text": "Details of the resulting training corpora are provided in table 1." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-66", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-67", "text": "**EVALUATION DATA**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-68", "text": "The proposed model is evaluated over the same three datasets as Salehi et al. (2015) , which cover two languages (English and German) and two kinds of MWEs (noun compounds and verb-particle constructions)." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-69", "text": "ENC This dataset contains 90 English noun compounds (e.g., game plan, gravy train) which are annotated on a scale of [0, 5] for both their overall compositionality, and the compositionality of each of their component words (Reddy et al., 2011) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-70", "text": "(Mikolov et al., 2013) , are also shown." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-71", "text": "EVPC This dataset consists of 160 English verb-particle constructions (e.g., add up, figure out) which are rated on a binary scale for the compositionality of each of the verb and particle component words (Bannard, 2006) by multiple annotators; no ratings for the overall compositionality of MWEs are provided in this dataset." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-72", "text": "The binary compositionality judgements are converted to continuous values as in Salehi et al. (2015) by dividing the number of judgements that an expression is compositional by the total number of judgements." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-73", "text": "GNC This dataset contains 244 German noun compounds (e.g., Ahornblatt 'maple leaf', Knoblauch 'garlic') which are annotated on a scale of [1, 7] for their overall compositionality, and the compositionality of each component word (von der Heide and Borgwaldt, 2009)." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-74", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-75", "text": "**EVALUATION METHODOLOGY**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-76", "text": "We evaluate our proposed approach following Salehi et al. (2015) by computing Pearson's correlation between the predicted compositionality (i.e., from either comp 1 or comp 2 ) and human ratings for overall compositionality." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-77", "text": "For EVPC, no overall compositionality ratings are provided." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-78", "text": "In this case we report the correlation between the predicted compositionality scores and both the verb and particle compositionality judgements." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-79", "text": "7" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-80", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-81", "text": "**RESULTS**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-82", "text": "We begin by considering results using the default settings (described in section \u00a73.1) using both comp 1 and comp 2 ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-83", "text": "For comp 1 , we set \u03b1 to 0.7 for ENC and GNC following Salehi et al. (2015) ; for EVPC we set \u03b1 to 0.5." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-84", "text": "Results are shown in table 2." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-85", "text": "For ENC, and the particle component of EVPC, both comp 1 and comp 2 achieve significant correlations (i.e., p < 0.05)." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-86", "text": "However, for GNC, and the verb component of EVPC, neither approach to predicting compositionality gives significant correlations." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-87", "text": "These correlations are well below those of previous work." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-88", "text": "For example, using comp 1 with representations of the MWE and component words obtained from word2vec (Mikolov et al., 2013) , Salehi et al. (2015) achieve correlations of 0.717, 0.289, and 0.400 for ENC, the verb component of EVPC, and GNC, respectively." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-89", "text": "8 Nevertheless, the results in table 2, and in particular the significant correlations for ENC and the particle component of EVPC, indicate that character-level neural network language models do capture some information about the compositionality of MWEs, at least for certain types of expressions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-90", "text": "We now consider the compositionality of individual component words." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-91", "text": "Because of the low correlations on GNC in the previous experiments, we do not consider it further here." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-92", "text": "In this case, we compute the compositionality of a specific component word as below, where C is the vector representation of a component word." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-93", "text": "Note that this corresponds to comp 1 with \u03b1 = 1 or 0, in the case of the first and second component words, respectively." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-94", "text": "We compare these compositionality predictions with the human judgements for Table 4 : Pearson's correlation (r) for MWEs that are attested, and unattested, in each dataset, using comp 1 and comp 2 ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-95", "text": "Significant correlations (p < 0.05) are indicated with *." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-96", "text": "The number of attested and unattested MWEs in each dataset is also shown." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-97", "text": "the compositionality of the corresponding component word." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-98", "text": "Results are shown in table 3." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-99", "text": "For EVPC, the results are perhaps not surprising given the previous findings, with a significant correlation being achieved for the particle (word 2) but not the verb (word 1)." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-100", "text": "In the case of ENC, a significant correlation is also achieved for the second component word, but not the first." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-101", "text": "The above results suggests that the model is better able to predict the compositionality of the second component word of an MWE than the first." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-102", "text": "To determine whether there is a relationship between the directionality of a character-level language model and the compositionality information it can capture, we also consider a backward LSTM that was trained by reversing the training corpus." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-103", "text": "The MWE and its component words were then reversed when computing compositionality." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-104", "text": "However, none of the correlations from this approach were significant." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-105", "text": "One interesting aspect of our proposed model is that it can potentially predict the compositionality of out-of-vocabulary expressions that are not observed in the training corpus." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-106", "text": "In table 4 we present results for each dataset, in the same setup as for table 2, but computing the correlation separately for MWEs that are attested, and unattested, in the training corpus." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-107", "text": "For ENC, both compositionality measures achieve significant correlations for attested expressions, but not for unattested ones, suggesting that the model cannot predict the compositionality of unseen expressions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-108", "text": "In the case of the compositionality of the particle component of EVPC, for both comp 1 and comp 2 , the correlations for the unattested expressions are higher than for the attested ones, although for unattested expressions the correlations are not significant." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-109", "text": "The relatively small number of unattested expressions in EVPC (13) could play a role in this finding." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-110", "text": "To further investigate this, we focused on expressions in EVPC with less than 5 usages in the training corpus." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-111", "text": "There are 71 such expressions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-112", "text": "For the compositionality of the particle component, comp 1 and comp 2 achieve correlations of 0.327 and 0.308, respectively." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-113", "text": "These correlations are significant (p < 0.05)." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-114", "text": "Word embedding models -such as that used in the approach to predicting compositionality of Salehi et al. (2015) -typically do not learn representations for low frequency items." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-115", "text": "9 These results demonstrate that the proposed model is able to predict the compositionality for low frequency items, that would not typically be in-vocabulary for word embedding models, and for which compositionality models based only on word embeddings would not be able to make predictions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-116", "text": "10 For GNC, and the verb component of EVPC, in line with the previous results over the entire dataset, neither compositionality measure gives significant correlations, with the exception of the verb component of EVPC using comp 2 for unattested expressions, although again the number of expressions here is relatively small." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-117", "text": "In an effort to improve on the default setup we considered a range of model variations." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-118", "text": "In particular we considered an RNN and GRU (instead of an LSTM), character embeddings of size 25 and 50 (instead of a one-hot representation), increasing the batch size to 100 (from 20), using dropout between 0.2-0.6, and using a bi-directional LSTM." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-119", "text": "None of these variations led to consistent improvements over the default setup." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-120", "text": "----------------------------------" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-121", "text": "**CONCLUSIONS**" }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-122", "text": "In this paper we proposed an approach to predicting the compositionality of multiword expressions based on a character-level neural network language model." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-123", "text": "To the best of our knowledge, this is the first work to consider such character-level models for this task." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-124", "text": "Our proposed character-level approach has an advantage over prior approaches to compositionality prediction based on distributed representations of words in that we do not require token-level identification of MWEs in order to form representations of them." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-125", "text": "Our proposed approach can furthermore potentially predict the compositionality of out-of-vocabulary MWEs that are not observed in the training corpus." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-126", "text": "We carried out experiments over three compositionality datasets: English and German noun compounds, and English verb-particle constructions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-127", "text": "Our experimental results indicate that character-level neural network models do capture knowledge of multiword expression compositionality, at least in the case of English noun compounds and the particle component of English verb-particle constructions." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-128", "text": "We further find that our proposed model captures knowledge of the compositionality of the particle component of English verb-particle constructions that are low frequency or not observed in the training corpus, but not of the compositionality of unobserved English noun compounds." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-129", "text": "In future work we intend to further explore the various parameter settings of the language modelsuch as the batch size, learning rate, and dropout -to better understand their impact on MWE compositionality prediction." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-130", "text": "We also intend to train the language model on larger corpora." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-131", "text": "Finally, we intend to combine our character-level approach to compositionality prediction with approaches based on other sources of information, for example distributed representations of words and knowledge from translation dictionaries (Salehi et al., 2014) ." }, { "sent_id": "dcf84cf05e3e7950cabbdd8d8f304c-C001-132", "text": "Specifically, we intend to determine whether the compositionality information from character-level neural network language models is complementary to that in these other approaches." } ], "y": { "@BACK@": { "gold_contexts": [ [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-15", "dcf84cf05e3e7950cabbdd8d8f304c-C001-16", "dcf84cf05e3e7950cabbdd8d8f304c-C001-17", "dcf84cf05e3e7950cabbdd8d8f304c-C001-21" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-88" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-114" ] ], "cite_sentences": [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-15", "dcf84cf05e3e7950cabbdd8d8f304c-C001-16", "dcf84cf05e3e7950cabbdd8d8f304c-C001-17", "dcf84cf05e3e7950cabbdd8d8f304c-C001-88", "dcf84cf05e3e7950cabbdd8d8f304c-C001-114" ] }, "@MOT@": { "gold_contexts": [ [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-15", "dcf84cf05e3e7950cabbdd8d8f304c-C001-17", "dcf84cf05e3e7950cabbdd8d8f304c-C001-21" ] ], "cite_sentences": [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-15", "dcf84cf05e3e7950cabbdd8d8f304c-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-41", "dcf84cf05e3e7950cabbdd8d8f304c-C001-42" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-58" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-68" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-72" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-76" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-83" ] ], "cite_sentences": [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-41", "dcf84cf05e3e7950cabbdd8d8f304c-C001-58", "dcf84cf05e3e7950cabbdd8d8f304c-C001-68", "dcf84cf05e3e7950cabbdd8d8f304c-C001-72", "dcf84cf05e3e7950cabbdd8d8f304c-C001-76", "dcf84cf05e3e7950cabbdd8d8f304c-C001-83" ] }, "@DIF@": { "gold_contexts": [ [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-87", "dcf84cf05e3e7950cabbdd8d8f304c-C001-88" ], [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-113", "dcf84cf05e3e7950cabbdd8d8f304c-C001-114", "dcf84cf05e3e7950cabbdd8d8f304c-C001-115" ] ], "cite_sentences": [ "dcf84cf05e3e7950cabbdd8d8f304c-C001-88", "dcf84cf05e3e7950cabbdd8d8f304c-C001-114" ] } } }, "ABC_048944feaff977c8cf057d52594c72_8": { "x": [ { "sent_id": "048944feaff977c8cf057d52594c72-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-2", "text": "Sentence Similarity [SS] computes a similarity score between two sentences." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-3", "text": "The SS task differs from document level semantics tasks in that it features the sparsity of words in a data unit, i.e. a sentence." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-4", "text": "Accordingly it is crucial to robustly model each word in a sentence to capture the complete semantic picture of the sentence." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-5", "text": "In this paper, we hypothesize that by better modeling lexical semantics we can obtain better sentential semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-6", "text": "We incorporate both corpus-based (selectional preference information) and knowledge-based (similar words extracted in a dictionary) lexical semantics into a latent variable model." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-7", "text": "The experiments show state-of-the-art performance among unsupervised systems on two SS datasets." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-8", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-10", "text": "Sentence Similarity [SS] is emerging as a crucial step in many NLP tasks that focus on sentence level semantics such as word sense disambiguation (Guo and Diab, 2010; Guo and Diab, 2012a) , summarization (Zhou et al., 2006 ), text coherence (Lapata and Barzilay, 2005) , tweet clustering (Sankaranarayanan et al., 2009; Jin et al., 2011) , etc." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-11", "text": "SS operates in a very small context, on average 11 words per sentence in Semeval-2012 dataset (Agirre et al., 2012) , resulting in inadequate evidence to generalize to robust sentential semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-12", "text": "Weighted Textual Matrix Factorization [WTMF] (Guo and Diab, 2012b ) is a latent variable model that outperforms Latent Semantic Analysis [LSA] (Deerwester et al., 1990) and Latent Dirichelet Allocation [LDA] (Blei et al., 2003) models by a large margin in the SS task, yielding state-of-the-art performance on the LI06 (Li et al., 2006 ) SS dataset." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-13", "text": "However, all of these models make harsh simplifying assumptions on how a token is generated: (1) in LSA/WTMF, a token is generated by the inner product of the word latent vector and the document latent vector; (2) in LDA, all the tokens in a document are sampled from the same document level topic distribution." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-14", "text": "Under this framework, they ignore rich linguistic phenomena such as inter-word dependency, semantic scope of words, etc." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-15", "text": "This is a result of simply using document IDs as features to represent a word." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-16", "text": "Modeling quality lexical semantics in latent variable models does not draw enough attention in the community, since people usually apply dimension reduction techniques for documents, which have abundant words for extracting the document level semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-17", "text": "However, in the SS setting, it is crucial to make good use of each word, given the limited number of words in a sentence." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-18", "text": "We believe a reasonable word generation story will avoid introducing noise in sentential semantics, encouraging robust lexical semantics which can further boost the sentential semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-19", "text": "In this paper, we explicitly encode lexical semantics, both corpus-based and knowledge-based information, in the WTMF model, by which we are able to achieve even better results in SS task." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-20", "text": "The additional corpus-based information we exploit is selectional preference semantics (Resnik, 1997) , a feature already existing in the data yet ignored by most latent variable models." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-21", "text": "Selectional preference focuses on the admissible arguments for a word, thus capturing more nuanced semantics than the sentence IDs (when applied to a corpus of sentences as opposed to documents)." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-22", "text": "Consider the following example: In WTMF/LSA/LDA, a word will receive semantics from all the other words in a sentence, hence, the word oil, in the above example, will be assigned the incorrect finance topic that reflects the sentence level semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-23", "text": "Moreover, the problem worsens for adjectives, adverbs and verbs, which have a much narrower semantic scope than the whole sentence." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-24", "text": "For example, the verb say should only be associated with analyst (only receiving semantics from analyst), as it is not related to other words in the sentence." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-25", "text": "In contrast, oil, according to its selectional preference, should be associated with crude indicating the resource topic." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-26", "text": "We believe modeling selectional preference capturing local evidence completes the semantic picture for words, hence further rendering better sentential semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-27", "text": "To our best knowledge, this is the first work to model selectional preference for sentence/document semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-28", "text": "We also integrate knowledge-based semantics in the WTMF framework." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-29", "text": "Knowledge-based semantics, a human-annotated clean resource, is an important complement to corpus-based noisy cooccurrence information." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-30", "text": "We extract similar word pairs from Wordnet (Fellbaum, 1998) ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-31", "text": "Leveraging these pairs, an infrequent word such as purchase can exploit robust latent vectors from its synonyms such as buy." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-32", "text": "Similar words pairs can be seamlessly modeled in WTMF, since in the matrix factorization framework a latent vector profile is explicitly created for each word, while in LDA all the data structures are designed for documents/sentences." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-33", "text": "We construct a graph to connect words according to the extracted similar word pairs, to encourage similar words to share similar latent vector profiles." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-34", "text": "We will refer to our proposed novel model as WTMF+PK." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-35", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-36", "text": "**WEIGHTED TEXTUAL MATRIX FACTORIZATION**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-37", "text": "Our previous work (Guo and Diab, 2012b ) models the sentences in the weighted matrix factorization framework ( Figure 1 )." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-38", "text": "The corpus is stored in an M \u00d7 N matrix X, with each cell containing the TF-IDF values of words." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-39", "text": "The rows of X are M distinct words and columns are N sentences." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-40", "text": "As in Figure 1 , X is approximated by the product of a K \u00d7 M matrix P and a K \u00d7 N matrix Q. Accordingly, each sentence s j is represented by a K dimensional latent vector Q \u00b7,j ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-41", "text": "Similarly a word w i is generalized by P \u00b7,i ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-42", "text": "P and Q is optimized by minimize the objective function:" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-43", "text": "where \u03bb is a regularization term." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-44", "text": "Missing tokens are modeled by assigning a different weight w m for each 0 cell in the matrix X. We can see the inner product of a word vector P \u00b7,i and a sentence vector Q \u00b7,j is used to approximate the cell X ij ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-45", "text": "The graphical model of WTMF is illustrated in Figure 2a ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-46", "text": "A w i /s j node is a latent vector P \u00b7,i /Q \u00b7,j , corresponding to a word/sentence, respectively." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-47", "text": "A shaded node is a non-zero cell in X, representing an observed token in a sentence." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-48", "text": "For simplicity, the missing tokens and weights are not shown in the graph." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-49", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-50", "text": "**CORPUS-BASED SEMANTICS: SELECTIONAL PREFERENCE**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-51", "text": "In this paper, we focus on selectional preference that reflects the association of two words: if two words form a bigram, then the two words should share similar latent dimensions." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-52", "text": "In the previous example, crude and oil form a bigram, and they share the resource topic." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-53", "text": "In our framework, this is implemented by adding extra columns in X, so that each additional column corresponds to a bigram, treating each bigram as a pseudo-sentence for the two words." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-54", "text": "The graphical model is illustrated in Figure 2b ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-55", "text": "Therefore, oil will receive more resource topic from crude through the bigram crude oil, instead of only finance topic from the sentence as a whole." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-56", "text": "Each non-zero cell in the new columns of X, i.e. an observed token in a bigram (pseudo-sentence), is given a different weight: f req(j) denotes the frequency of bigram j appearing in the corpus, hence the strength of association is differentiated such that higher weights are assigned on the more probable bigrams." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-57", "text": "The coefficient \u03b3 is the importance of selectional preference." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-58", "text": "A larger \u03b3 indicates that we trust the selectional preference over the global sentential semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-59", "text": "4 Knowledge-based Semantics: Similar Word Pairs" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-60", "text": "We first extract synonym pairs from WordNet, which are words associated with the same sense, synset." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-61", "text": "We further expand the set by exploiting the relations defined in WordNet." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-62", "text": "For the extracted words, we consider the first sense of each word, and if it is connected to other senses by any of the WordNet defined relations (hypernym, similar words, etc.), then we treat the words associated with the other senses as similar words." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-63", "text": "In total, we are able to discover 80K pairs of similar words for the 46K distinct words in our corpus." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-64", "text": "Given a pair of similar words w i 1 /w i 2 , we want the two corresponding latent vectors P \u00b7,i 1 /P \u00b7,i 2 to be as close as possible, namely the cosine similarity to be close to 1." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-65", "text": "Accordingly, a term is added in equation 1 for each similar word pair w i 1 /w i 2 :" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-66", "text": "|P \u00b7,i | denotes the length of the vector P \u00b7,i ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-67", "text": "The coefficient \u03b4, analogous to \u03b3, denotes the importance of the knowledge-based evidence." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-68", "text": "The Figure 2c shows the final WTMF+PK model." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-69", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-70", "text": "**INFERENCE**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-71", "text": "In (Guo and Diab, 2012b) we use Alternating Least Square [ALS] for inference, which is to set the derivative of equation 1 for P/Q to 0 and iteratively compute P/Q by fixing the other matrix (Srebro and Jaakkola, 2003) ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-72", "text": "However, it is no longer applicable with the new term (equation 2) involving the length of word vectors |P \u00b7,i |." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-73", "text": "Therefore we approximate the objective function by treating the vector length |P \u00b7,i | as fixed values during the ALS iterations:" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-74", "text": "where P \u00b7,s(i) are the latent vectors of similar words of word i; the length of these vectors in the current iteration are stored in L s(i) (similarly L i is the current length of P \u00b7,i ) (cf. (Steck, 2010; Guo and Diab, 2012b) for optimization details)." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-75", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-76", "text": "**EXPERIMENTAL SETTING**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-77", "text": "We build the model WTMF+PK on the same corpora as used in our previous work (Guo and Diab, 2012b) , comprising the following: Brown corpus (each sentence is treated as a document), sense definitions from Wiktionary and Wordnet (only definitions without target words and usage examples)." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-78", "text": "We follow the preprocessing steps in (Guo and Diab, 2012c) : tokenization, pos-tagging, lemmatization and further merge lemmas." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-79", "text": "The corpus is used for building matrix X. The evaluation datasets are LI06 dataset and Semeval-2012 STS [STS12] (Agirre et al., 2012) dataset." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-80", "text": "LI06 consists of 30 sentence pairs (dictionary definitions)." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-81", "text": "For STS12, 1 the training data (2000 pairs) are used as the tuning set for setting the parameters of our models." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-82", "text": "This data comprises msrpar, msr-vid, smt-eur." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-83", "text": "Once the models are tuned, we evaluate them on the STS12 test data that comprises 3150 sentence pairs from msr-par, msr-vid, smt-eur, smt-news, On-WN." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-84", "text": "It is worth noting that smt-news and On-WN are not part of the tuning data." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-85", "text": "We use cosine similarity to measure the similarity scores between two sentences." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-86", "text": "Pearson correlation between the system's answer and gold standard similarity scores is used as the evaluation metric." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-87", "text": "We include three baselines LSA, LDA and WTMF using the setting described in (Guo and Diab, 2012b) ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-88", "text": "We run Gibbs Sampling based LDA for 2000 iterations and average the model over the last 10 iterations." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-89", "text": "For WTMF, we run 20 iterations and fix the missing words weight at w m = 0.01 with a regularization coefficient set at \u03bb = 20, which is the best condition found in (Guo and Diab, 2012b) ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-90", "text": "Table 1 summarizes the results at dimension K = 100 (the dimension of latent vectors)." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-91", "text": "To remove randomness, each reported number is the averaged results of 10 runs." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-92", "text": "Based on the STS tuning set, we experiment with different values for the selectional preference weight (\u03b3 = {0, 1, 2}), and likewise for the similar word pairs weight varying the \u03b4 value as follows \u03b4 = {0, 0.1, 0.3, 0.5, 0.7}. The performance on STS12 tuning and test dataset as well as on the LI06 dataset are illustrated in Figures 3a, 3b and 3d ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-93", "text": "The parameters of model 6 in Table 1 (\u03b3 = 2, \u03b4 = 0.3) are the chosen values based on tuning set performance." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-94", "text": "Table 1 shows WTMF is already a very strong baseline: it outperforms LSA and LDA by a large margin." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-95", "text": "Same as in (Guo and Diab, 2012b) , LSA performance degrades dramatically when trained on a corpus of sentence sized documents, yielding results worse than the surface words baseline 31% (Agirre et al., 2012) ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-96", "text": "Using corpus-based selectional preference semantics alone (model 4 WTMF+P in Table 1 ) boosts the performance of WTMF by +1.17% on the test set, while using knowledge-based semantics alone (model 5 WTMF+K) improves the over the WTMF results by an absolute +2.31%." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-97", "text": "Combining them (model 6 WTMF+PK) yields the best results, with an absolute increase of +3.39%, which suggests that the two sources of semantic evidence are useful, but more importantly, they are complementary for each other." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-98", "text": "Table 1 also presents the performance on each individual dataset." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-99", "text": "The gain on each individual source is not as much as the overall gain, which suggests part of the overall gain comes from the correct ranking of intra-source pairs." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-100", "text": "Note that WTMF+PK improves all individual datasets except smt-eur." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-101", "text": "This may be caused by too many overlapping words in the sentence pairs in smt-eur, while our approach focuses on extracting similarity between different words." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-102", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-103", "text": "**EXPERIMENTS**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-104", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-105", "text": "**EVALUATION ON THE STS12 DATASETS**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-106", "text": "Observing the performance using different values of weights in figure 3a and 3b, we can conclude that the selectional preference and similar word pairs yield very promising results." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-107", "text": "The trends hold in different parameter conditions with a consistent improvement." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-108", "text": "Figure 3c illustrates the impact of dimension K = {50, 75, 100, 125, 150} on WTMF and WTMF+PK." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-109", "text": "Generally a larger K leads to a higher Pearson correlation, but the improvement is tiny when K \u2265 100 (0.1% increase)." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-110", "text": "Compared to all the unsupervised systems that participated in Semeval STS 2012 task, WTMF+PK yields state-of-the-art performance (70.70%)." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-111", "text": "2 In (Guo and Diab, 2012c) we also apply WTMF (K = 100) on STS12, achieving a correlation of 69.5%." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-112", "text": "However, additional data is incorporated in the training corpora: (1) STS12 tuning set; (2) for WordNet and Wiktionary data, the target words are also included in the definitions (hence synonym pairs were used); (3) the usage examples of target words were also appended to the definitions." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-113", "text": "3 While trained with this experimental setting, our model WTMF+PK (\u03b3 = 2, \u03b4 = 0.3, K = 100) is able to reach an even higher correlation of 72.0%." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-114", "text": "Figure 3d presents the results obtained on the LI06 data set at different weight values for the corpusbased selectional preference semantics \u03b3 and for the knowledge-based semantics \u03b4." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-115", "text": "Our previous experiments (Guo and Diab, 2012b) show that WTMF is the state-of-the-art model on LI06." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-116", "text": "With lexical semantics explicitly modeled, WTMF+PK yields better results than WTMF (see Table 1 )." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-117", "text": "It should be noted that LI06 prefers a smaller similar word pair weight ( a \u03b4 = 0.1 yields the best performance around of 90.75%), yet in almost all conditions WTMF+PK outperforms WTMF as shown in Figure 3d ." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-118", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-119", "text": "**RELATED WORK**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-120", "text": "SS has progressed immensely in recent years, especially with the establishment of the Semantic Textual Similarity task in SEMEVAL 2012." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-121", "text": "Early work in SS focused on word pair similarity in the high dimensional space (Li et al., 2006; Liu et al., 2007; Islam and Inkpen, 2008; Tsatsaronis et al., 2010; Ho et al., 2010) , where co-occurrence information was not efficiently exploited." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-122", "text": "Researchers (O'Shea et al., 2008) find LSA does not yield good performance." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-123", "text": "In (Guo and Diab, 2012b; Guo and Diab, 2012c) , we show the superiority of the latent space approach in WTMF." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-124", "text": "In this paper, we improve the WTMF model and achieve state-of-the-art Pearson correlation on two standard SS datasets." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-125", "text": "There are latent variable models designed for lexical semantics, such as word senses (Boyd-Graber et al., 2007; Guo and Diab, 2011) , function words (Griffiths et al., 2005) , selectional preference (Ritter et al., 2010) , synonyms and antonyms (Yih et al., 2012) , etc." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-126", "text": "However little improvement is shown on document/sentence level semantics: (Ritter et al., 2010) and (Yih et al., 2012) focus on selectional preference and antonym identification, respectively; in (Griffiths et al., 2005 ) the LDA performance degrades in the text categorization task including the modeling of function words." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-127", "text": "Rather, we concentrate on nuanced lexical semantics phenomena that could benefit sentential semantics." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-128", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-130", "text": "We incorporate corpus-based (selectional preference) and knowledge-based (similar word pairs) lexical semantics into a latent variable model." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-131", "text": "Our system yields state-of-the-art unsupervised performance on two most popular and standard SS datasets." }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-132", "text": "----------------------------------" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-133", "text": "**ACKNOWLEDGMENT**" }, { "sent_id": "048944feaff977c8cf057d52594c72-C001-134", "text": "This work is supported by the IARPA SCIL program." } ], "y": { "@BACK@": { "gold_contexts": [ [ "048944feaff977c8cf057d52594c72-C001-12" ], [ "048944feaff977c8cf057d52594c72-C001-37" ], [ "048944feaff977c8cf057d52594c72-C001-32" ], [ "048944feaff977c8cf057d52594c72-C001-71" ], [ "048944feaff977c8cf057d52594c72-C001-87", "048944feaff977c8cf057d52594c72-C001-94" ], [ "048944feaff977c8cf057d52594c72-C001-115" ], [ "048944feaff977c8cf057d52594c72-C001-116", "048944feaff977c8cf057d52594c72-C001-117" ], [ "048944feaff977c8cf057d52594c72-C001-123" ] ], "cite_sentences": [ "048944feaff977c8cf057d52594c72-C001-12", "048944feaff977c8cf057d52594c72-C001-37", "048944feaff977c8cf057d52594c72-C001-32", "048944feaff977c8cf057d52594c72-C001-71", "048944feaff977c8cf057d52594c72-C001-87", "048944feaff977c8cf057d52594c72-C001-94", "048944feaff977c8cf057d52594c72-C001-115", "048944feaff977c8cf057d52594c72-C001-116", "048944feaff977c8cf057d52594c72-C001-117", "048944feaff977c8cf057d52594c72-C001-123" ] }, "@MOT@": { "gold_contexts": [ [ "048944feaff977c8cf057d52594c72-C001-12", "048944feaff977c8cf057d52594c72-C001-13" ], [ "048944feaff977c8cf057d52594c72-C001-19" ], [ "048944feaff977c8cf057d52594c72-C001-32", "048944feaff977c8cf057d52594c72-C001-33" ], [ "048944feaff977c8cf057d52594c72-C001-71", "048944feaff977c8cf057d52594c72-C001-72" ] ], "cite_sentences": [ "048944feaff977c8cf057d52594c72-C001-12", "048944feaff977c8cf057d52594c72-C001-13", "048944feaff977c8cf057d52594c72-C001-19", "048944feaff977c8cf057d52594c72-C001-32", "048944feaff977c8cf057d52594c72-C001-71" ] }, "@USE@": { "gold_contexts": [ [ "048944feaff977c8cf057d52594c72-C001-19" ], [ "048944feaff977c8cf057d52594c72-C001-22" ], [ "048944feaff977c8cf057d52594c72-C001-32", "048944feaff977c8cf057d52594c72-C001-33" ], [ "048944feaff977c8cf057d52594c72-C001-45" ], [ "048944feaff977c8cf057d52594c72-C001-73", "048944feaff977c8cf057d52594c72-C001-74" ], [ "048944feaff977c8cf057d52594c72-C001-77" ], [ "048944feaff977c8cf057d52594c72-C001-87" ], [ "048944feaff977c8cf057d52594c72-C001-89" ], [ "048944feaff977c8cf057d52594c72-C001-108" ] ], "cite_sentences": [ "048944feaff977c8cf057d52594c72-C001-19", "048944feaff977c8cf057d52594c72-C001-22", "048944feaff977c8cf057d52594c72-C001-32", "048944feaff977c8cf057d52594c72-C001-45", "048944feaff977c8cf057d52594c72-C001-74", "048944feaff977c8cf057d52594c72-C001-77", "048944feaff977c8cf057d52594c72-C001-87", "048944feaff977c8cf057d52594c72-C001-89", "048944feaff977c8cf057d52594c72-C001-108" ] }, "@EXT@": { "gold_contexts": [ [ "048944feaff977c8cf057d52594c72-C001-28" ], [ "048944feaff977c8cf057d52594c72-C001-71", "048944feaff977c8cf057d52594c72-C001-72", "048944feaff977c8cf057d52594c72-C001-73", "048944feaff977c8cf057d52594c72-C001-74" ], [ "048944feaff977c8cf057d52594c72-C001-96" ], [ "048944feaff977c8cf057d52594c72-C001-124" ] ], "cite_sentences": [ "048944feaff977c8cf057d52594c72-C001-28", "048944feaff977c8cf057d52594c72-C001-71", "048944feaff977c8cf057d52594c72-C001-74", "048944feaff977c8cf057d52594c72-C001-96", "048944feaff977c8cf057d52594c72-C001-124" ] }, "@DIF@": { "gold_contexts": [ [ "048944feaff977c8cf057d52594c72-C001-94" ], [ "048944feaff977c8cf057d52594c72-C001-96" ], [ "048944feaff977c8cf057d52594c72-C001-108" ], [ "048944feaff977c8cf057d52594c72-C001-116", "048944feaff977c8cf057d52594c72-C001-117", "048944feaff977c8cf057d52594c72-C001-34" ], [ "048944feaff977c8cf057d52594c72-C001-124" ] ], "cite_sentences": [ "048944feaff977c8cf057d52594c72-C001-94", "048944feaff977c8cf057d52594c72-C001-96", "048944feaff977c8cf057d52594c72-C001-108", "048944feaff977c8cf057d52594c72-C001-116", "048944feaff977c8cf057d52594c72-C001-117", "048944feaff977c8cf057d52594c72-C001-124" ] }, "@SIM@": { "gold_contexts": [ [ "048944feaff977c8cf057d52594c72-C001-95" ] ], "cite_sentences": [ "048944feaff977c8cf057d52594c72-C001-95" ] } } }, "ABC_0f5c87e5434785a612c6578244543d_8": { "x": [ { "sent_id": "0f5c87e5434785a612c6578244543d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-2", "text": "Mapping word embeddings of different languages into a single space has multiple applications." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-3", "text": "In order to map from a source space into a target space, a common approach is to learn a linear mapping that minimizes the distances between equivalences listed in a bilingual dictionary." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-4", "text": "In this paper, we propose a framework that generalizes previous work, provides an efficient exact method to learn the optimal linear transformation and yields the best bilingual results in translation induction while preserving monolingual performance in an analogy task." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-5", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-7", "text": "Bilingual word embeddings have attracted a lot of attention in recent times (Zou et al., 2013; Ko\u010disk\u00fd et al., 2014; Chandar A P et al., 2014; Gouws et al., 2014; Gouws and S\u00f8gaard, 2015; Luong et al., 2015; Wick et al., 2016) ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-8", "text": "A common approach to obtain them is to train the embeddings in both languages independently and then learn a mapping that minimizes the distances between equivalences listed in a bilingual dictionary." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-9", "text": "The learned transformation can also be applied to words missing in the dictionary, which can be used to induce new translations with a direct application in machine translation (Mikolov et al., 2013b; Zhao et al., 2015) ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-29", "text": "Monolingual invariance is needed to preserve the dot products after mapping, avoiding performance degradation in monolingual tasks (e.g. analogy)." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-10", "text": "The first method to learn bilingual word embedding mappings was proposed by Mikolov et al. (2013b) , who learn the linear transformation that minimizes the sum of squared Euclidean distances for the dictionary entries." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-11", "text": "Subsequent work has proposed alternative optimization objectives to learn better mappings." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-12", "text": "Xing et al. (2015) incorporate length normalization in the training of word embeddings and try to maximize the cosine similarity instead, introducing an orthogonality constraint to preserve the length normalization after the projection." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-13", "text": "Faruqui and Dyer (2014) use canonical correlation analysis to project the embeddings in both languages to a shared vector space." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-14", "text": "Beyond linear mappings, Lu et al. (2015) apply deep canonical correlation analysis to learn a nonlinear transformation for each language." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-15", "text": "Finally, additional techniques have been used to address the hubness problem in Mikolov et al. (2013b) , both through the neighbor retrieval method and the training itself ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-16", "text": "We leave the study of non-linear transformation and other additions for further work." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-17", "text": "In this paper, we propose a general framework to learn bilingual word embeddings." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-18", "text": "We start with a basic optimization objective (Mikolov et al., 2013b) and introduce several meaningful and intuitive constraints that are equivalent or closely related to previously proposed methods (Faruqui and Dyer, 2014; Xing et al., 2015) ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-19", "text": "Our framework provides a more general view of bilingual word embedding mappings, showing the underlying connection between the existing methods, revealing some flaws in their theoretical justification and providing an alternative theoretical interpretation for them." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-20", "text": "Our experiments on an existing English-Italian word translation induction and an English word analogy task give strong empirical evidence in favor of our theoretical reasoning, while showing that one of our models clearly outperforms previous alternatives." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-21", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-22", "text": "**LEARNING BILINGUAL MAPPINGS**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-23", "text": "Let X and Z denote the word embedding matrices in two languages for a given bilingual dictionary so that their ith row X i * and Z i * are the word embeddings of the ith entry in the dictionary." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-24", "text": "Our goal is to find a linear transformation matrix W so that XW best approximates Z, which we formalize minimizing the sum of squared Euclidean distances following Mikolov et al. (2013b) :" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-25", "text": "Alternatively, this is equivalent to minimizing the (squared) Frobenius norm of the residual matrix:" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-26", "text": "Consequently, W will be the so called leastsquares solution of the linear matrix equation XW = Z. This is a well-known problem in linear algebra and can be solved by taking the MoorePenrose pseudoinverse X + = X T X \u22121 X T as W = X + Z, which can be computed using SVD." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-27", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-28", "text": "**ORTHOGONALITY FOR MONOLINGUAL INVARIANCE**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-30", "text": "This can be obtained requiring W to be an orthogonal matrix (W T W = I)." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-31", "text": "The exact solution under such orthogonality constraint is given by W = V U T , where Z T X = U \u03a3V T is the SVD factorization of Z T X (cf." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-32", "text": "Appendix A)." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-33", "text": "Thanks to this, the optimal transformation can be efficiently computed in linear time with respect to the vocabulary size." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-34", "text": "Note that orthogonality enforces an intuitive property, and as such it could be useful to avoid degenerated solutions and learn better bilingual mappings, as we empirically show in Section 3." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-35", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-36", "text": "**LENGTH NORMALIZATION FOR MAXIMUM COSINE**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-37", "text": "Normalizing word embeddings in both languages to be unit vectors guarantees that all training instances contribute equally to the optimization goal." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-38", "text": "As long as W is orthogonal, this is equivalent to maximizing the sum of cosine similarities for the dictionary entries, which is commonly used for similarity computations:" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-39", "text": "This last optimization objective coincides with Xing et al. (2015) , but their work was motivated by an hypothetical inconsistency in Mikolov et al. (2013b) , where the optimization objective to learn word embeddings uses dot product, the objective to learn mappings uses Euclidean distance and the similarity computations use cosine." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-40", "text": "However, the fact is that, as long as W is orthogonal, optimizing the squared Euclidean distance of length-normalized embeddings is equivalent to optimizing the cosine, and therefore, the mapping objective proposed by Xing et al. (2015) is equivalent to that used by Mikolov et al. (2013b) with orthogonality constraint and unit vectors." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-41", "text": "In fact, our experiments show that orthogonality is more relevant than length normalization, in contrast to Xing et al. (2015) , who introduce orthogonality only to ensure that unit length is preserved after mapping." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-42", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-43", "text": "**MEAN CENTERING FOR MAXIMUM COVARIANCE**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-44", "text": "Dimension-wise mean centering captures the intuition that two randomly taken words would not be expected to be semantically similar, ensuring that the expected product of two random embeddings in any dimension and, consequently, their cosine similarity, is zero." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-45", "text": "As long as W is orthogonal, this is equivalent to maximizing the sum of dimensionwise covariance for the dictionary entries:" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-46", "text": "where C m denotes the centering matrix This equivalence reveals that the method proposed by Faruqui and Dyer (2014) is closely related to our framework." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-47", "text": "More concretely, Faruqui and Dyer (2014) use Canonical Correlation Analysis (CCA) to project the word embeddings in both languages to a shared vector space." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-48", "text": "CCA maximizes the dimension-wise covariance of both projections (which is equivalent to maximizing the covariance of a single projection if the transformations are constrained to be orthogonal, as in our case) but adds an implicit restriction to the two mappings, making different dimensions have the same variance and be uncorrelated among themselves 1 :" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-49", "text": "Therefore, the only fundamental difference between both methods is that, while our model enforces monolingual invariance, Faruqui and Dyer (2014) do change the monolingual embeddings to meet this restriction." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-50", "text": "In this regard, we think that the restriction they add could have a negative impact on the learning of the bilingual mapping, and it could also degrade the quality of the monolingual embeddings." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-51", "text": "Our experiments (cf. Section 3) show empirical evidence supporting this idea." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-52", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-53", "text": "**EXPERIMENTS**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-54", "text": "In this section, we experimentally test the proposed framework and all its variants in comparison with related methods." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-55", "text": "For that purpose, we use the translation induction task introduced by Mikolov et al. (2013b) , which learns a bilingual mapping on a small dictionary and measures its accuracy on predicting the translation of new words." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-56", "text": "Unfortunately, the dataset they use is not public." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-57", "text": "For that reason, we use the English-Italian dataset on the same task provided by 2 ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-58", "text": "The dataset contains monolingual word embeddings trained with the word2vec toolkit using the CBOW method with negative sampling (Mikolov et al., 2013a) 3 ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-59", "text": "The English embeddings were trained on a 2.8 billion word corpus (ukWaC + Wikipedia + BNC), while the 1.6 billion word corpus itWaC was used to train the Italian 1 While CCA is typically defined in terms of correlation (thus its name), correlation is invariant to the scaling of variables, so it is possible to constrain the canonical variables to have a fixed variance, as we do, in which case correlation and covariance become equivalent 2 http://clic.cimec.unitn.it/\u02dcgeorgiana." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-60", "text": "dinu/down/ 3 The context window was set to 5 words, the dimension of the embeddings to 300, the sub-sampling to 1e-05 and the number of negative samples to 10 embeddings." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-61", "text": "The dataset also contains a bilingual dictionary learned from Europarl, split into a training set of 5,000 word pairs and a test set of 1,500 word pairs, both of them uniformly distributed in frequency bins." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-62", "text": "Accuracy is the evaluation measure." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-63", "text": "Apart from the performance of the projected embeddings in bilingual terms, we are also interested in the monolingual quality of the source language embeddings after the projection." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-64", "text": "For that purpose, we use the word analogy task proposed by Mikolov et al. (2013a) , which measures the accuracy on answering questions like \"what is the word that is similar to small in the same sense as biggest is similar to big?\" using simple word vector arithmetic." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-65", "text": "The dataset they use consists of 8,869 semantic and 10,675 syntactic questions of this type, and is publicly available 4 ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-66", "text": "In order to speed up the experiments, we follow the authors and perform an approximate evaluation by reducing the vocabulary size according to a frequency threshold of 30,000 (Mikolov et al., 2013a) ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-67", "text": "Since the original embeddings are the same in all the cases and it is only the transformation that is applied to them that changes, this affects all the methods in the exact same way, so the results are perfectly comparable among themselves." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-68", "text": "With these settings, we obtain a coverage of 64.98%." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-69", "text": "We implemented the proposed method in Python using NumPy, and make it available as an open source project 5 ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-70", "text": "The code for Mikolov et al. (2013b) and Xing et al. (2015) is not publicly available, so we implemented and tested them as part of the proposed framework, which only differs from the original systems in the optimization method (exact solution instead of gradient descent) and the length normalization approach in the case of Xing et al. (2015) (postprocessing instead of constrained training)." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-71", "text": "As for the method by Faruqui and Dyer (2014) , we used their original implementation in Python and MAT-LAB 6 , which we extended to cover cases where the dictionary contains more than one entry for the same word." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-72", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-73", "text": "**RESULTS OF OUR FRAMEWORK**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-74", "text": "The rows in Table 1 show, respectively, the results for the original embeddings, the basic mapping proposed by Mikolov et al. (2013b) (cf. Section 2) and the addition of orthogonality constraint (cf. Section 2.1), with and without length normalization and, incrementally, mean centering." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-75", "text": "In all the cases, length normalization and mean centering were applied to all embeddings, even if missing from the dictionary." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-76", "text": "The results show that the orthogonality constraint is key to preserve monolingual performance, and it also improves bilingual performance by enforcing a relevant property (monolingual invariance) that the transformation to learn should intuitively have." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-77", "text": "The contribution of length normalization alone is marginal, but when followed by mean centering we obtain further improvements in bilingual performance without hurting monolingual performance." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-78", "text": "Table 2 shows the results for our best performing configuration in comparison to previous work." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-79", "text": "As discussed before, (Mikolov et al., 2013b) and (Xing et al., 2015) were implemented as part of our framework, so they correspond to our uncostrained mapping with no preprocessing and orthogonal mapping with length normalization, respectively." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-80", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-81", "text": "**COMPARISON TO OTHER WORK**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-82", "text": "As it can be seen, the method by Xing et al. (2015) performs better than that of Mikolov et al. (2013b) in the translation induction task, which is in line with what they report in their paper." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-83", "text": "Moreover, thanks to the orthogonality constraint their monolingual performance in the word analogy task does not degrade, whereas the accuracy of Mikolov et al. (2013b) drops by 2.86% in absolute terms with respect to the original embeddings." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-84", "text": "Since Faruqui and Dyer (2014) Mikolov et al. (2013b) 34.93% 73. 80% Xing et al. (2015) 36.87% 76.66% Faruqui and Dyer (2014) CCA to perform dimensionality reduction, we tested several values for it and report the best (180 dimensions)." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-85", "text": "This beats the method by Xing et al. (2015) in the bilingual task, although it comes at the price of a considerable degradation in monolingual quality." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-86", "text": "In any case, it is our proposed method with the orthogonality constraint and a global preprocessing with length normalization followed by dimensionwise mean centering that achieves the best accuracy in the word translation induction task." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-87", "text": "Moreover, it does not suffer from any considerable degradation in monolingual quality, with an anecdotal drop of only 0.07% in contrast with 2.86% for Mikolov et al. (2013b) and 7.02% for Faruqui and Dyer (2014) ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-88", "text": "When compared to Xing et al. (2015) , our results in Table 1 reinforce our theoretical interpretation for their method (cf. Section 2.2), as it empirically shows that its improvement with respect to Mikolov et al. (2013b) comes solely from the orthogonality constraint, and not from solving any inconsistency." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-89", "text": "It should be noted that the implementation by Faruqui and Dyer (2014) also length-normalizes the word embeddings in a preprocessing step." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-90", "text": "Following the discussion in Section 2.3, this means that our best performing configuration is conceptually very close to the method by Faruqui and Dyer (2014) , as they both coincide on maximizing the average dimension-wise covariance and length-normalize the embeddings in both languages first, the only difference being that our model enforces monolingual invariance after the normalization while theirs does change the monolingual embeddings to make different dimensions have the same variance and be uncorrelated among themselves." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-91", "text": "However, our model performs considerably better than any configuration from Faruqui and Dyer (2014) in both the monolingual and the bilingual task, supporting our hypothesis that these two constraints that are implicit in their method are not only conceptually confusing, 2292 but also have a negative impact." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-92", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-93", "text": "**CONCLUSIONS**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-94", "text": "This paper develops a new framework to learn bilingual word embedding mappings, generalizing previous work and providing an efficient exact method to learn the optimal transformation." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-95", "text": "Our experiments show the effectiveness of the proposed model and give strong empirical evidence in favor of our reinterpretation of Xing et al. (2015) and Faruqui and Dyer (2014) ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-96", "text": "It is the proposed method with the orthogonality constraint and a global preprocessing with length normalization and dimension-wise mean centering that achieves the best overall results both in monolingual and bilingual terms, surpassing those previous methods." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-97", "text": "In the future, we would like to study non-linear mappings (Lu et al., 2015) and the additional techniques in ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-98", "text": "ish Ministry of Economy and Competitiveness (TADEEP TIN2015-70214-P)." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-99", "text": "Mikel Artetxe enjoys a doctoral grant from the Spanish Ministry of Education, Culture and Sports." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-100", "text": "----------------------------------" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-101", "text": "**A PROOF OF SOLUTION UNDER ORTHOGONALITY**" }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-102", "text": "Constraining W to be orthogonal (W T W = I), the original minimization problem can be reformulated as follows (cf. Section 2.1): arg min In the above expression, Tr(\u00b7) denotes the trace operator (the sum of all the elements in the main diagonal), and the last equality is given by its cyclic property." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-103", "text": "At this point, we can take the SVD of Z T X as Z T X = U \u03a3V T , so Tr Z T XW = Tr U \u03a3V T W = Tr \u03a3V T W U ." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-104", "text": "Since V T , W and U are orthogonal matrices, their product V T W U will also be an orthogonal matrix." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-105", "text": "In addition to that, given that \u03a3 is a diagonal matrix, its trace after an orthogonal transformation will be maximal when the values in its main diagonal are preserved after the mapping, that is, when the orthogonal transformation matrix is the identity matrix." }, { "sent_id": "0f5c87e5434785a612c6578244543d-C001-106", "text": "This will happen when V T W U = I in our case, so the optimal solution will be W = V U T ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0f5c87e5434785a612c6578244543d-C001-13" ] ], "cite_sentences": [ "0f5c87e5434785a612c6578244543d-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "0f5c87e5434785a612c6578244543d-C001-18" ], [ "0f5c87e5434785a612c6578244543d-C001-71" ], [ "0f5c87e5434785a612c6578244543d-C001-84" ] ], "cite_sentences": [ "0f5c87e5434785a612c6578244543d-C001-18", "0f5c87e5434785a612c6578244543d-C001-71", "0f5c87e5434785a612c6578244543d-C001-84" ] }, "@EXT@": { "gold_contexts": [ [ "0f5c87e5434785a612c6578244543d-C001-18" ], [ "0f5c87e5434785a612c6578244543d-C001-71" ] ], "cite_sentences": [ "0f5c87e5434785a612c6578244543d-C001-18", "0f5c87e5434785a612c6578244543d-C001-71" ] }, "@DIF@": { "gold_contexts": [ [ "0f5c87e5434785a612c6578244543d-C001-18", "0f5c87e5434785a612c6578244543d-C001-19" ], [ "0f5c87e5434785a612c6578244543d-C001-46", "0f5c87e5434785a612c6578244543d-C001-47", "0f5c87e5434785a612c6578244543d-C001-49" ], [ "0f5c87e5434785a612c6578244543d-C001-86", "0f5c87e5434785a612c6578244543d-C001-87" ], [ "0f5c87e5434785a612c6578244543d-C001-91" ], [ "0f5c87e5434785a612c6578244543d-C001-95" ] ], "cite_sentences": [ "0f5c87e5434785a612c6578244543d-C001-18", "0f5c87e5434785a612c6578244543d-C001-46", "0f5c87e5434785a612c6578244543d-C001-47", "0f5c87e5434785a612c6578244543d-C001-49", "0f5c87e5434785a612c6578244543d-C001-87", "0f5c87e5434785a612c6578244543d-C001-91", "0f5c87e5434785a612c6578244543d-C001-95" ] }, "@SIM@": { "gold_contexts": [ [ "0f5c87e5434785a612c6578244543d-C001-46" ], [ "0f5c87e5434785a612c6578244543d-C001-89" ], [ "0f5c87e5434785a612c6578244543d-C001-90" ] ], "cite_sentences": [ "0f5c87e5434785a612c6578244543d-C001-46", "0f5c87e5434785a612c6578244543d-C001-89", "0f5c87e5434785a612c6578244543d-C001-90" ] } } }, "ABC_520437f612e678dcd4ec9c043cf701_8": { "x": [ { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-53", "text": "We first discuss the results for the TripAdvisorGold dataset shown in Table 2 ." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-52", "text": "**TRIPADVISOR-GOLD**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-2", "text": "Most previous studies in computerized deception detection have relied only on shallow lexico-syntactic patterns." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-3", "text": "This paper investigates syntactic stylometry for deception detection, adding a somewhat unconventional angle to prior literature." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-4", "text": "Over four different datasets spanning from the product review to the essay domain, we demonstrate that features driven from Context Free Grammar (CFG) parse trees consistently improve the detection performance over several baselines that are based only on shallow lexico-syntactic features." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-5", "text": "Our results improve the best published result on the hotel review data (Ott et al., 2011) reaching 91.2% accuracy with 14% error reduction." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-6", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-8", "text": "Previous studies in computerized deception detection have relied only on shallow lexicosyntactic cues." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-9", "text": "Most are based on dictionarybased word counting using LIWC (Pennebaker et al., 2007 ) (e.g., Hancock et al. (2007) , Vrij et al. (2007) ), while some recent ones explored the use of machine learning techniques using simple lexico-syntactic patterns, such as n-grams and part-of-speech (POS) tags (Mihalcea and Strapparava (2009) , Ott et al. (2011) )." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-10", "text": "These previous studies unveil interesting correlations between certain lexical items or categories with deception that may not be readily apparent to human judges." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-11", "text": "For instance, the work of Ott et al. (2011) in the hotel review domain results in very insightful observations that deceptive reviewers tend to use verbs and personal pronouns (e.g., \"I\", \"my\") more often, while truthful reviewers tend to use more of nouns, adjectives, prepositions." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-12", "text": "In parallel to these shallow lexical patterns, might there be deep syntactic structures that are lurking in deceptive writing?" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-13", "text": "This paper investigates syntactic stylometry for deception detection, adding a somewhat unconventional angle to prior literature." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-14", "text": "Over four different datasets spanning from the product review domain to the essay domain, we find that features driven from Context Free Grammar (CFG) parse trees consistently improve the detection performance over several baselines that are based only on shallow lexico-syntactic features." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-15", "text": "Our results improve the best published result on the hotel review data of Ott et al. (2011) reaching 91.2% accuracy with 14% error reduction." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-16", "text": "We also achieve substantial improvement over the essay data of Mihalcea and Strapparava (2009) , obtaining upto 85.0% accuracy." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-17", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-18", "text": "**FOUR DATASETS**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-19", "text": "To explore different types of deceptive writing, we consider the following four datasets spanning from the product review to the essay domain:" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-20", "text": "I. TripAdvisor-Gold: Introduced in Ott et al. (2011) , this dataset contains 400 truthful reviews obtained from www.tripadviser.com and 400 deceptive reviews gathered using Amazon Mechanical Turk, evenly distributed across 20 Chicago hotels." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-21", "text": "II. TripAdvisor-Heuristic: This dataset contains 400 truthful and 400 deceptive reviews harvested from www.tripadviser.com, based on fake review detection heuristics introduced in Feng et al. (2012) ." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-22", "text": "1" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-23", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-24", "text": "**III. YELP:**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-25", "text": "This dataset is our own creation using www.yelp.com." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-26", "text": "We collect 400 filtered reviews and 400 displayed reviews for 35 Italian restaurants with average ratings in the range of [3.5, 4.0]." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-27", "text": "Class labels are based on the meta data, which tells us whether each review is filtered by Yelp's automated review filtering system or not." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-28", "text": "We expect that filtered reviews roughly correspond to deceptive reviews, and displayed reviews to truthful ones, but not without considerable noise." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-29", "text": "We only collect 5-star reviews to avoid unwanted noise from varying degree of sentiment." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-30", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-31", "text": "**IV. ESSAYS: INTRODUCED IN MIHALCEA AND**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-32", "text": "Strapparava (2009), this corpus contains truthful and deceptive essays collected using Amazon Mechanic Turk for the following three topics: \"Abortion\" (100 essays per class), \"Best Friend\" (98 essays per class), and \"Death Penalty\" (98 essays per class)." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-33", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-34", "text": "**FEATURE ENCODING**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-35", "text": "Words Previous work has shown that bag-ofwords are effective in detecting domain-specific deception (Ott et al., 2011; Mihalcea and Strapparava, 2009 )." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-36", "text": "We consider unigram, bigram, and the union of the two as features." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-37", "text": "Shallow Syntax As has been used in many previous studies in stylometry (e.g., ArgamonEngelson et al. (1998) , Zhao and Zobel (2007) ), we utilize part-of-speech (POS) tags to encode shallow syntactic information." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-38", "text": "Note that Ott et al. (2011) found that even though POS tags are effective in detecting fake product reviews, they are not as effective as words." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-39", "text": "Therefore, we strengthen POS features with unigram features." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-40", "text": "Deep syntax We experiment with four different encodings of production rules based on the Probabilistic Context Free Grammar (PCFG) parse trees as follows:" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-41", "text": "\u2022 r: unlexicalized production rules (i.e., all production rules except for those with terminal nodes), e.g., NP 2 \u2192 NP 3 SBAR." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-42", "text": "\u2022 r * : lexicalized production rules (i.e., all production rules), e.g., PRP \u2192 \"you\"." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-43", "text": "\u2022r: unlexicalized production rules combined with the grandparent node, e.g., NP 2\u02c6V P Table 2 : Deception Detection Accuracy (%)." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-44", "text": "\u2022r * : lexicalized production rules (i.e., all production rules) combined with the grandparent node, e.g., PRP\u02c6NP 4 \u2192 \"you\"." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-45", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-46", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-47", "text": "For all classification tasks, we use SVM classifier, 80% of data for training and 20% for testing, with 5-fold cross validation." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-48", "text": "2 All features are encoded as tf-idf values." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-49", "text": "We use Berkeley PCFG parser (Petrov and Klein, 2007) to parse sentences." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-50", "text": "Table 2 presents the classification performance using various features across four different datasets introduced earlier." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-51", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-54", "text": "As reported in Ott et al. (2011) , bag-of-words features achieve surprisingly high performance, reaching upto 89.6% accuracy." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-55", "text": "Deep syntactic features, encoded asr * slightly improves this performance, achieving 90.4% accuracy." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-56", "text": "When these syntactic features are combined with unigram features, we attain the best performance of 91.2% accuracy, 2 We use LIBLINEAR (Fan et al., 2008) with L2-regulization, parameter optimized over the 80% training data (3 folds for training, 1 fold for testing)." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-57", "text": "3 Numbers in italic are classification results reported in Ott et al. (2011) and Mihalcea and Strapparava (2009)." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-58", "text": "yielding 14% error reduction over the word-only features." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-59", "text": "Given the power of word-based features, one might wonder, whether the PCFG driven features are being useful only due to their lexical production rules." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-60", "text": "To address such doubts, we include experiments with unlexicalized rules, r andr." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-61", "text": "These features achieve 78.5% and 74.8% accuracy respectively, which are significantly higher than that of a random baseline (\u223c50.0%), confirming statistical differences in deep syntactic structures." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-62", "text": "See Section 4.4 for concrete exemplary rules." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-63", "text": "Another question one might have is whether the performance gain of PCFG features are mostly from local sequences of POS tags, indirectly encoded in the production rules." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-64", "text": "Comparing the performance of [shallow syntax+words] and [deep syntax+words] in Table 2 , we find statistical evidence that deep syntax based features offer information that are not available in simple POS sequences." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-65", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-66", "text": "**TRIPADVISOR-HEURISTIC & YELP**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-67", "text": "The performance is generally lower than that of the previous dataset, due to the noisy nature of these datasets." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-68", "text": "Nevertheless, we find similar trends as those seen in the TripAdvisor-Gold dataset, with respect to the relative performance differences across different approaches." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-69", "text": "The sig- nificance of these results comes from the fact that these two datasets consists of real (fake) reviews in the wild, rather than manufactured ones that might invite unwanted signals that can unexpectedly help with classification accuracy." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-70", "text": "In sum, these results indicate the existence of the statistical signals hidden in deep syntax even in real product reviews with noisy gold standards." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-71", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-72", "text": "**ESSAY**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-73", "text": "Finally in Table 2 , the last dataset Essay confirms the similar trends again, that the deep syntactic features consistently improve the performance over several baselines based only on shallow lexico-syntactic features." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-74", "text": "The final results, reaching accuracy as high as 85%, substantially outperform what has been previously reported in Mihalcea and Strapparava (2009) ." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-75", "text": "How robust are the syntactic cues in the cross topic setting?" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-76", "text": "Table 4 compares the results of Mihalcea and Strapparava (2009) and ours, demonstrating that syntactic features achieve substantially and surprisingly more robust results." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-77", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-78", "text": "**DISCRIMINATIVE PRODUCTION RULES**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-79", "text": "To give more concrete insights, we provide 10 most discriminative unlexicalized production rules (augmented with the grand parent node) for each class in Table 1 ." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-80", "text": "We order the rules based on the feature weights assigned by LIB-LINEAR classifier." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-81", "text": "Notice that the two production rules in bolds -[SBAR\u02c6NP \u2192 S] and [NP VP \u2192 NP SBAR] -are parts of the parse tree shown in Figure 1 , whose sentence is taken from an actual fake review." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-82", "text": "Table 3 shows the most discriminative phrasal tags in the PCFG parse training: trees for each class." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-83", "text": "Interestingly, we find more frequent use of VP, SBAR (clause introduced by subordinating conjunction), and WHADVP in deceptive reviews than truthful reviews." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-84", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-85", "text": "**RELATED WORK**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-86", "text": "Much of the previous work for detecting deceptive product reviews focused on related, but slightly different problems, e.g., detecting duplicate reviews or review spams (e.g., Jindal and Liu (2008) , , Mukherjee et al. (2011 ) due to notable difficulty in obtaining gold standard labels." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-87", "text": "4 The Yelp data we explored in this work shares a similar spirit in that gold standard labels are harvested from existing meta data, which are not guaranteed to align well with true hidden labels as to deceptive v.s. truthful reviews." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-88", "text": "Two previous work obtained more precise gold standard labels by hiring Amazon turkers to write deceptive articles (e.g., Mihalcea and Strapparava (2009), Ott et al. (2011) ), both of which have been examined in this study with respect to their syntactic characteristics." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-89", "text": "Although we are not aware of any prior work that dealt with syntactic cues in deceptive writing directly, prior work on hedge detection (e.g., Greene and Resnik (2009), Li et al. (2010) ) relates to our findings." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-90", "text": "----------------------------------" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-92", "text": "We investigated syntactic stylometry for deception detection, adding a somewhat unconventional angle to previous studies." }, { "sent_id": "520437f612e678dcd4ec9c043cf701-C001-93", "text": "Experimental results consistently find statistical evidence of deep syntactic patterns that are helpful in discriminating deceptive writing." } ], "y": { "@DIF@": { "gold_contexts": [ [ "520437f612e678dcd4ec9c043cf701-C001-5" ], [ "520437f612e678dcd4ec9c043cf701-C001-53", "520437f612e678dcd4ec9c043cf701-C001-54", "520437f612e678dcd4ec9c043cf701-C001-55" ] ], "cite_sentences": [ "520437f612e678dcd4ec9c043cf701-C001-5", "520437f612e678dcd4ec9c043cf701-C001-54" ] }, "@BACK@": { "gold_contexts": [ [ "520437f612e678dcd4ec9c043cf701-C001-9" ], [ "520437f612e678dcd4ec9c043cf701-C001-11" ], [ "520437f612e678dcd4ec9c043cf701-C001-35" ], [ "520437f612e678dcd4ec9c043cf701-C001-38" ], [ "520437f612e678dcd4ec9c043cf701-C001-54" ], [ "520437f612e678dcd4ec9c043cf701-C001-57" ], [ "520437f612e678dcd4ec9c043cf701-C001-88" ] ], "cite_sentences": [ "520437f612e678dcd4ec9c043cf701-C001-9", "520437f612e678dcd4ec9c043cf701-C001-11", "520437f612e678dcd4ec9c043cf701-C001-35", "520437f612e678dcd4ec9c043cf701-C001-38", "520437f612e678dcd4ec9c043cf701-C001-54", "520437f612e678dcd4ec9c043cf701-C001-57", "520437f612e678dcd4ec9c043cf701-C001-88" ] }, "@USE@": { "gold_contexts": [ [ "520437f612e678dcd4ec9c043cf701-C001-15" ], [ "520437f612e678dcd4ec9c043cf701-C001-19", "520437f612e678dcd4ec9c043cf701-C001-20" ], [ "520437f612e678dcd4ec9c043cf701-C001-88" ] ], "cite_sentences": [ "520437f612e678dcd4ec9c043cf701-C001-15", "520437f612e678dcd4ec9c043cf701-C001-20", "520437f612e678dcd4ec9c043cf701-C001-88" ] }, "@MOT@": { "gold_contexts": [ [ "520437f612e678dcd4ec9c043cf701-C001-38", "520437f612e678dcd4ec9c043cf701-C001-39" ] ], "cite_sentences": [ "520437f612e678dcd4ec9c043cf701-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "520437f612e678dcd4ec9c043cf701-C001-53", "520437f612e678dcd4ec9c043cf701-C001-54" ] ], "cite_sentences": [ "520437f612e678dcd4ec9c043cf701-C001-54" ] } } }, "ABC_425148e63eb84bba50326e362cc5b8_9": { "x": [ { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-2", "text": "Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-3", "text": "Past work suggests that instability of LDA topics may lead to systematic errors." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-4", "text": "Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-5", "text": "Method: We generate k LDA topics and replicate this process n times resulting in n*k topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-6", "text": "Then we use K-medioids to cluster the n*k topics to k clusters." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-7", "text": "The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-8", "text": "For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-9", "text": "Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-10", "text": "We show how our topic stability metrics are related to the contents of the topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-11", "text": "Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-12", "text": "Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-13", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-14", "text": "**INTRODUCTION**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-15", "text": "Latent Dirichlet Allocation (LDA) is a topic modeling technique for textual data [5] that is widely applied in software engineering [1-4, 6, 10, 11, 14-16, 19, 24, 25] for different tasks such as requirements engineering [15] , software architecture [10] , source code analysis [9] , defect reports [16] , testing [14] and to bibliometric analysis of software engineering literature [11, 22] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-16", "text": "A survey on topic modelling in software engineering has been conducted [24] and a book titled \"The art and science of analyzing software data\" [4] devoted a chapter for LDA analysis [6] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-17", "text": "Many sources give methodological guidance on how to apply LDA topic modeling in software engineering [1, 3, 19] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-18", "text": "Given all this, we think it is fair to say that LDA topic modelling is a relevant data analysis technique in empirical software engineering research." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-19", "text": "The quality of the resulting topic model can be evaluated with multiple metrics some inspired by mathematics such as the posterior probability of the topic model given the data [13] , perplexity of measure in the test data [13] , or Silhouette coefficient of resulting topics [19] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-20", "text": "Other target metrics are based on empirical observations such as coherence, which measures topic model quality using word co-occurrences in publicly available texts [23] , or stability which investigates similarity of topics between different runs [1] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-21", "text": "Recently, Agrawal et al. [1] published a paper titled \"What is wrong with topic modeling? And how to fix it using search-based software engineering\", where they claimed that the instability of topics is one major shortcoming of this technique." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-22", "text": "Indeed, studies could result in wrong conclusions if the results are based on instable topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-23", "text": "They proposed using a differential evolution search algorithm to find the input parameters which maximize the topic model stability measured as the similarity of topics between multiple runs." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-24", "text": "This method reduces instability by finding optimal input parameter settings, but only uses the result of one LDA run which can still have some instable topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-25", "text": "In this paper, we address the stability of topic models, but rather than optimizing input parameters we propose making stability (or instability) transparent to the user." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-26", "text": "We achieve this by performing replicated runs of LDA topic modeling and clustering the results." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-27", "text": "Subsequently, we present the clustering results as any topic modeling results by adding an additional metric of stability." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-28", "text": "Our method is not an alternative to the ones presented by Agrawal et al. [1] but additive." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-29", "text": "Thus, we may use both methods at the same time." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-30", "text": "However, a benefit of our method is that the topic models may also be optimized towards other targets than stability." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-31", "text": "For example, a user may choose to optimize the topic model input parameters for coherence [23] or perplexity and still use our approach in the end to provide information about topic stability." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-32", "text": "This paper is structured as follows." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-33", "text": "In Section 2 first we present LDA in more detail and then our method for making the stability transparent." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-34", "text": "In Section 3 we demonstrate our results while Section 4 provides conclusions and discusses future improvements." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-35", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-36", "text": "**METHOD 2.1 LDA TOPIC MODELLING**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-37", "text": "LDA (Latent Dirichlet Allocation) is a soft clustering algorithm that is ideal for text [5] but also for other purposes such as genetics [21] where a relationship between a gene and a genotype can be considered similar to a relationship between a word and a document." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-38", "text": "Given a set of documents, LDA models from what topics this set of documents may have been created from." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-39", "text": "As opposed to hard clustering where each document would be assigned to a single topic only, LDA soft clustering assigns each document a list of topics and probabilities for the topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-40", "text": "A topic in LDA is a collection of words and their probability estimates for each topic." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-41", "text": "In order to summarize, after running LDA we have the following." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-42", "text": "\u2022 For all documents m there is a vector \u03b8 which is the topic distribution for that document." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-43", "text": "\u2022 For all topics k there is a vector \u03d5 which is the word distribution for that topic." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-44", "text": "Before topic generation, LDA requires that we set the input parameters such as the number of topics k, and hyper priors \u03b1 and \u03b2." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-45", "text": "Past work in software engineering has used different techniques to find optimal input parameters such as genetic algorithms [19] or differential evolution [1] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-46", "text": "As pointed out in Section 1, what is optimal can be measured with many metrics such as perplexity [13] , stability [1] , or coherence [23] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-47", "text": "The stability of a topic model can be defined as the model's ability to replicate its solutions [8] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-48", "text": "Instability (the lack of stability) is caused by the non-deterministic nature of Monte-Carlo simulation that is part of the LDA algorithm [1] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-49", "text": "Past work has shown different stability measures and how to optimize the input parameters to provide a stable topic model [1, 8, 12] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-50", "text": "We think using the results of a single LDA run, whether optimized for stability or not, is dangerous as perfect stability is impossible to reach." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-51", "text": "The next section shows a method that can be used to make more informed decisions." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-52", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-53", "text": "**TRANSPARENT STABILITY**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-54", "text": "To make LDA topic stability transparent, we suggest performing replicated LDA runs, clustering the topics, and giving a measure of stability." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-55", "text": "R-code of our approach is available 1 ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-56", "text": "Section 2.2.1, describes the approach, Section 2.2.2 explains how to show the clusters, and finally Section 2.2.3 presents different stability measures." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-57", "text": "2.2.1 Clustering LDA topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-58", "text": "As previously described, an LDA topic is a list of words with the probabilities of each word appearing in that topic." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-59", "text": "When we cluster replicated LDA runs we have n replicated runs, and each run contains k number of topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-60", "text": "Therefore the total number of topics is t = nk." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-61", "text": "Our word list is represented by w where \u03d5 is the vector of word distribution for each topic." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-62", "text": "Thus, we have a topic-word matrix T with dimensions t \u00d7 w that we want to cluster back to k clusters as k was the number of topics in our LDA setting." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-63", "text": "We wanted to take an advantage of the word embeddings produced by GloVE [20] where our entire word list w, which has typically thousands of words, is converted to a word vector space with typically 200-400 elements." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-64", "text": "It has been shown that in this word vector space, semantically similar words appear close to one another [20] and we have previously use it for searching software engineering specific synonyms [17] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-65", "text": "Thus, we form a word vector space with w words and vectors as matrix V with dimensions (w \u00d7 )." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-66", "text": "Then we convert our topic-word matrix T (t \u00d7 w) to topic-vector matrix W (t \u00d7 ) via matrix multiplication T (t \u00d7 w)V (w \u00d7 )." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-67", "text": "An additional benefit is that the W (t \u00d7 ) matrix is much smaller than the T (t \u00d7 w) matrix, resulting in faster clustering." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-68", "text": "Finally, we use K-medioids clustering to cluster our topics W (t \u00d7 ) to k topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-69", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-70", "text": "**SHOWING**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-71", "text": "Clustered LDA topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-72", "text": "We want to deviate as little as possible from standard LDA topic modeling when presenting the results." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-73", "text": "We form the list of top ten words for each cluster, in LDA each topic is typically represented by the top ten words." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-74", "text": "To compute the top 10 words, for a cluster that has multiple topics, we sum up word distributions \u03d5 for all topics in a particular cluster and the top ten words are the ones with the highest sums." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-75", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-76", "text": "**2.2.3**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-77", "text": "Topic stability measure." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-78", "text": "At this point, our results would appear like any LDA topic model to the user." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-79", "text": "However, as we want to give the user transparency to topic stability, we need to add a measure describing topic stability." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-80", "text": "Obviously, user can investigate each cluster in detail but the topic stability measure can help the user to focus on specific clusters." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-81", "text": "We propose several measures of topic stability, i.e. whether a set of topics are actually about the same content." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-82", "text": "When two topics contain the same top 10 words in the same order, then we can think that they are exactly about the same content and should result in a maximum score." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-83", "text": "On the other hand, any deviations from this should result in a lower score." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-84", "text": "First, Silhouette is a well-established measure for cluster validation that considers both how similar each object is to its own cluster (cohesion) and how different it is to other clusters (separation)." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-85", "text": "It has been used in LDA optimization before [18, 19] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-86", "text": "The average silhouette is produced by the K-medioids clustering performed earlier." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-87", "text": "However, the cluster separation is not interesting for the user as the user mainly cares about whether a particular cluster has similar elements, i.e. high stability." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-88", "text": "Furthermore, this measure is based on the absolute values of word probabilities rather than the ranks what are presented to the user." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-89", "text": "Second, to model whether the same top words are present and that they are in the same order, we can use Spearman rank correlation between the top words of any two topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-90", "text": "Any words that are present in the top word list of one topic, but not the other, are assigned the lowest rank in the other topic." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-91", "text": "A problem occurs if two topics have the same words but in reverse order, the rank correlation between the topics would be -1 while one would still consider these two topics somewhat similar due to the same top words." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-92", "text": "Another anomaly is that for two topics with no intersecting top ten words, we would get a better Spearman correlation value than -1 (-0.86)." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-93", "text": "Third, we can measure Jaccard similarity between the top words of any two topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-94", "text": "Extended Jaccard measures have been used in LDA stability task optimization [1, 12] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-95", "text": "When two topics have all the same top words, the Jaccard similarity would be 1." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-96", "text": "On the other hand, the worst case (when all the top words are different) would result in a Jaccard similarity of 0." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-97", "text": "The undesirable property of the Jaccard similarity is that any variations in ordering would not be reflected in the measure." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-98", "text": "Obviously, a measure of topic stability should take into account both differences in word intersection and rank of two topics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-99", "text": "Luckily, a paper published in 2010 has presented such a measure known as rank-biased overlap (RBO) [26] that seems ideal for LDA topic comparisons." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-100", "text": "We use the extrapolated version of RBO from R-package Bioconductor 2 that is computed as follows" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-101", "text": "i \u00b7 p i T 1 and T 2 are two ranked lists and in our case they are two topics represented by their top words, d is the evaluation depth and in our case it is 10 as we wish to compare top ten words, X d is the intersection of T 1 and T 2 at depth d. RBO ranges between 0 and 1. RBO is zero when none of the top words are the same and one when all top words are the same and in the same order." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-102", "text": "The effect of order, i.e. top-weightedness, is controlled by p. When p is 1, the order has no effect and only the intersection is considered." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-103", "text": "Smaller p gives more weight to order of words." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-104", "text": "We set p to 0.9 as such value is suggested by RBO authors [26] and as it seems to offer good balance on impact of different ranks of the top ten words." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-105", "text": "For an illustration let us consider two topics with the same top ten words but in reverse order." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-106", "text": "This pair would result in the opposite of Spearman correlation (-1) and Jaccard similarity (1) while the RBO (p=0.9) has value close to the middle of its 0 to 1 range (0.51)." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-107", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-108", "text": "**RESULTS**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-109", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-110", "text": "**DATA, PARAMETER TUNING, AND LDA RUNS**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-111", "text": "We demonstrate our approach on 271,236 commit messages from Mozilla Firefox." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-112", "text": "We did some minimal preprocessing by excluding the words appearing fewer than ten times and the ones that appeared in 30% or more documents." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-113", "text": "Additionally, we removed common stopwords and individuals' names that appeared as Firefox commit messages are also used for assigning reviewers." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-114", "text": "We consider this preprocessing very conservative." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-115", "text": "We used a fixed number of topics (k = 20) and differential evolution algorithm (population=20, CR = 0.5, F = 0.8, iterations = 10) from the R-package DEoptim to optimize for hyper parameters \u03b1 (0.167) and \u03b2 (0.076) with the target of maximizing perplexity in the holdout set." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-116", "text": "As previously said, our approach is not affected by hyper parameter tuning so future works may use any training targets or algorithms they see appropriate." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-117", "text": "We performed 20 LDA run replications with the R-package text2vec (1,000 iterations)." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-118", "text": "This package includes very efficient LDA implementation (WarpLDA [7] ) and training a single model with our data takes less than 2 minutes with a laptop computer." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-119", "text": "Table 1 shows the best, worst, and median (eleventh) clusters in terms of LDA topic stability out of the 20 clusters ranked with the RBO stability metric." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-120", "text": "All four stability metrics are highly correlated with each other in our data and the Pearson correlation range is between 0.91 and 0.98." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-121", "text": "To demonstrate the details, Tables 2 to 4 2 https://rdrr.io/bioc/gespeR/man/rbo.html shows five topics from the best, worst and median cluster with their five top words." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-122", "text": "All computations of Table 1 were performed with all topics and the top 10 words but Tables 2 to 4 only shows a smaller sample to keep the paper within the four page limit." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-123", "text": "Table 1 shows that for the cluster with the best topic stability, all topics have all the words nearly in the same order as the average Spearman rank correlation is very close to 0.95." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-124", "text": "The average Jaccard similarity in topics is only 0.813, however, we need to remember that if two topics differ by one word of the top ten words this already results in Jaccard similarity of 0.818 (9/11)." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-125", "text": "Manual inspection of the details in Table 2 shows that for the five topics shown, the top 5 words appear in the same order." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-126", "text": "We can further confirm that this is true for all topics in this cluster." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-127", "text": "Deviations in word rank and occurrence exist in words in places 6 to 10." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-128", "text": "This topic is about commits that revert (back out) previous changes." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-129", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-130", "text": "**STABLE AND UNSTABLE TOPICS**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-131", "text": "As expected, the cluster with median stability has a lower topic stability than the best cluster in all stability metrics." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-132", "text": "We can also notice that number of topics in this cluster is higher than 20, i.e. the number of replicated runs we performed." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-133", "text": "This means that from a single LDA run, more than one topic is part of this cluster." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-134", "text": "Our K-medioids clustering took all 400 (20*20) topics as input and we did not try to force it to pick one topic from one LDA run to each cluster." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-135", "text": "This is something we may want to investigate in the future." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-136", "text": "Table 3 shows that this cluster is about build file usage or updates." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-137", "text": "Table 4 shows detailed sample of the worst cluster and we see variations in the word order and occurrence." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-138", "text": "It appears that this topic is about updates, additions, fixes and changes." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-139", "text": "Since all these are very common words in a version control context, it is hard to make a meaningful interpretation of the topics in this cluster." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-140", "text": "----------------------------------" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-141", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-142", "text": "Past work in software engineering [1] and machine learning [12] point out that LDA instability may lead to incorrect conclusions and proposes input parameter optimization to alleviate the problem." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-143", "text": "This paper suggests performing replicated runs, clustering the results and measuring the topic stability." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-144", "text": "These approach are not alternative but additive." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-145", "text": "Our approach can be combined with any LDA optimization technique that relies on input parameter optimization." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-146", "text": "Finally, our approach shows topic stability by providing a metric of topic stability and allowing further investigation of the clusters when desired." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-147", "text": "This paper presents multiple metrics of topic stability in Table 1 that are highly correlated with each other in our data set." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-148", "text": "Based on theoretical metric properties (see Section 2.2.3) we recommend using RBO [26] ." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-149", "text": "In the future, one should empirically establish what p value setting of RBO metric most accurately matches the user expectation on topic stability as in this paper we only used the default (p = 0.9)." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-150", "text": "We should also study how the topic clusters can be used in the downstream NLP tasks in software engineering." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-151", "text": "Furthermore, to demonstrate our idea we also made other design choices but didn't investigate their impact." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-152", "text": "For example, we considered only 20 replication runs which might be too little and we only generated 20 LDA topics for each run." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-153", "text": "We also clustered our topics in the word vector space produced by GLoVe as prior work suggested it would produce better results than clustering in word space." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-154", "text": "All these choices could be challenged." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-155", "text": "Zeller's 2018 ICSE talk [27] has warned us about the dangers of adding complexity." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-156", "text": "Our approach adds complexity but eventually hides it behind an RBO metric showing the stability of each topic cluster." }, { "sent_id": "425148e63eb84bba50326e362cc5b8-C001-157", "text": "If the stability of topics is an issue, users of topics models should be made aware of it but with minimal added complexity as we have tried to do in this paper." } ], "y": { "@BACK@": { "gold_contexts": [ [ "425148e63eb84bba50326e362cc5b8-C001-15" ], [ "425148e63eb84bba50326e362cc5b8-C001-17" ], [ "425148e63eb84bba50326e362cc5b8-C001-20" ], [ "425148e63eb84bba50326e362cc5b8-C001-21" ], [ "425148e63eb84bba50326e362cc5b8-C001-23" ], [ "425148e63eb84bba50326e362cc5b8-C001-45" ], [ "425148e63eb84bba50326e362cc5b8-C001-46" ], [ "425148e63eb84bba50326e362cc5b8-C001-48" ], [ "425148e63eb84bba50326e362cc5b8-C001-49" ], [ "425148e63eb84bba50326e362cc5b8-C001-94" ], [ "425148e63eb84bba50326e362cc5b8-C001-142" ] ], "cite_sentences": [ "425148e63eb84bba50326e362cc5b8-C001-15", "425148e63eb84bba50326e362cc5b8-C001-17", "425148e63eb84bba50326e362cc5b8-C001-20", "425148e63eb84bba50326e362cc5b8-C001-21", "425148e63eb84bba50326e362cc5b8-C001-23", "425148e63eb84bba50326e362cc5b8-C001-45", "425148e63eb84bba50326e362cc5b8-C001-46", "425148e63eb84bba50326e362cc5b8-C001-48", "425148e63eb84bba50326e362cc5b8-C001-49", "425148e63eb84bba50326e362cc5b8-C001-94", "425148e63eb84bba50326e362cc5b8-C001-142" ] }, "@UNSURE@": { "gold_contexts": [ [ "425148e63eb84bba50326e362cc5b8-C001-28" ] ], "cite_sentences": [ "425148e63eb84bba50326e362cc5b8-C001-28" ] } } }, "ABC_04f6b9d4296dee4bbf965f9911bf98_9": { "x": [ { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-2", "text": "Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-3", "text": "At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-4", "text": "In this paper, we present a new formulation of attention via the lens of the kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-5", "text": "To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-6", "text": "This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-7", "text": "Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-8", "text": "As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-9", "text": "This approach achieves competitive performance to the current state of the art model with less computation." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-10", "text": "In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-11", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-13", "text": "Transformer (Vaswani et al., 2017 ) is a relative new architecture which outperforms traditional deep learning models such as Recurrent Neural Networks (RNNs) (Sutskever et al., 2014) and Temporal Convolutional Networks (TCNs) (Bai et al., 2018) for sequence modeling tasks across neural machine translations (Vaswani et al., 2017) , language understanding (Devlin et al., 2018) , sequence prediction (Dai et al., 2019) , image generation (Child et al., 2019) , video activity classification (Wang et al., 2018) , music generation (Huang et al., 2018a) , and multimodal sentiment analysis (Tsai et al., 2019a) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-14", "text": "Instead of performing recurrence (e.g., RNN) or convolution (e.g., TCN) over the sequences, Transformer is a feed-forward model that concurrently processes the entire sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-15", "text": "At the core of the Transformer is its attention mechanism, which is proposed to integrate the dependencies between the inputs." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-29", "text": "Among all the components, we argue that the most important one is the construction of the kernel function." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-16", "text": "There are up to three types of attention within the full Transformer model as exemplified with neural machine translation application (Vaswani et al., 2017) : 1) Encoder self-attention considers the source sentence as input, generating a sequence of encoded representations, where each encoded token has a global dependency with other tokens in the input sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-17", "text": "2) Decoder self-attention considers the target sentence (e.g., predicted target sequence for translation) as input, generating a sequence of decoded representations 1 , where each decoded token depends on previous decoded tokens." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-18", "text": "3) Decoder-encoder attention considers both encoded and decoded sequences, generating a sequence with the same length as the decoded sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-19", "text": "It should be noted that some applications has only the decoder self-attention such as sequence prediction (Dai et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-20", "text": "In all cases, the Transformer's attentions follow the same general mechanism." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-21", "text": "At the high level, the attention can be seen as a weighted combination of the input sequence, where the weights are determined by the similarities between elements of the input sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-22", "text": "We note that this operation is orderagnostic to the permutation in the input se-quence (order is encoded with extra positional embedding (Vaswani et al., 2017; Shaw et al., 2018; Dai et al., 2019) )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-23", "text": "The above observation inspires us to connect Transformer's attention to kernel learning (Scholkopf and Smola, 2001) : they both concurrently and order-agnostically process all inputs by calculating the similarity between the inputs." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-24", "text": "Therefore, in the paper, we present a new formulation for Transformer's attention via the lens of kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-25", "text": "To be more precise, the new formulation can be interpreted as a kernel smoother (Wasserman, 2006) over the inputs in a sequence, where the kernel measures how similar two different inputs are." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-26", "text": "The main advantage of connecting attention to kernel is that it opens up a new family of attention mechanisms that can relate to the well-established literature in kernel learning (Scholkopf and Smola, 2001) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-27", "text": "As a result, we develop a new variant of attention which simply considers a product of symmetric kernels when modeling non-positional and positional embedding." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-28", "text": "Furthermore, our proposed formulation highlights naturally the main components of Transformer's attention, enabling a better understanding of this mechanism: recent variants of Transformers (Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019; Wang et al., 2018; Tsai et al., 2019a) can be expressed through these individual components." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-30", "text": "We empirically study multiple kernel forms and the ways to integrate positional embedding in neural machine translation (NMT) using IWSLT'14 GermanEnglish (De-En) dataset (Edunov et al., 2017) and sequence prediction (SP) using WikiText-103 dataset (Merity et al., 2016) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-31", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-32", "text": "**ATTENTION**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-33", "text": "This section aims at providing an understanding of attention in Transformer via the lens of kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-34", "text": "The inspiration for connecting the kernel (Scholkopf and Smola, 2001 ) and attention instantiates from the observation: both operations concurrently processes all inputs and calculate the similarity between the inputs." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-35", "text": "We first introduce the background (i.e., the original formulation) of attention and then provide a new reformulation within the class of kernel smoothers (Wasserman, 2006) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-36", "text": "Next, we show that this new formulation allows us to explore new family of attention while at the same time offering a framework to categorize previous attention variants (Vaswani et al., 2017; Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019; Wang et al., 2018; Tsai et al., 2019a) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-37", "text": "Last, we present a new form of attention, which requires fewer parameters and empirically reaches competitive performance as the state-of-the-art models." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-38", "text": "For notation, we use lowercase representing a vector (e.g., x), bold lowercase representing a matrix (e.g., x), calligraphy letter denoting a space (e.g., X ), and S denoting a set." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-39", "text": "To relate the notations in sequence to sequence learning (Vaswani et al., 2017 ), x represents a specific element of a sequence, x = [x 1 , x 2 , \u22ef, x T ] denotes a sequence of features, S x = {x exp , x 2 , \u22ef, x T } represents the set with its elements being the features in sequence x, and we refer the space of set S x as S." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-40", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-41", "text": "**TECHNICAL BACKGROUND**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-42", "text": "Unlike recurrent computation (Sutskever et al., 2014 ) (i.e., RNNs) and temporal convolutional computation (Bai et al., 2018 ) (i.e., TCNs), Transformer's attention is an order-agnostic operation given the order in the inputs (Vaswani et al., 2017) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-43", "text": "Hence, in the presentation of the paper, we consider the inputs as a set instead of a sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-44", "text": "When viewing sequence as a set, we lose the temporal (positional) information in inputs which is often crucial for sequence modeling (Sutskever et al., 2014) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-45", "text": "As a result, Transformer (Vaswani et al., 2017) introduced positional embedding to indicate the positional relation for the inputs." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-46", "text": "Formally, a sequence x = [x 1 , x 2 , \u22ef, x T ] defines each element as x i = (f i , t i ) with f i \u2208 F being the nontemporal feature at time i and t i \u2208 T as an temporal feature (or we called it positional embedding)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-47", "text": "Note that f i can be the word representation (in neural machine translation (Vaswani et al., 2017) ), a pixel in a frame (in video activity recognition (Wang et al., 2018) ), or a music unit (in music generation (Huang et al., 2018b) )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-48", "text": "t i can be a mixture of sine and cosine functions (Vaswani et al., 2017) or parameters that can be learned during back-propagation (Dai et al., 2019; Ott et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-49", "text": "The feature vector are defined over a joint space X \u2236= (F \u00d7 T )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-50", "text": "The resulting permutationinvariant set is:" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-51", "text": "Followed the definition by Vaswani et al. (2017) , we use queries(q)/keys(k)/values(v) to represent the inputs for the attention." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-52", "text": "To be more precise, x {q k v} is used for denoting a query/key/value data in the query/key/value sequence x {q k v} (x {q k v} \u2208 S x { q k v} ) with S x { q k v} being its set representation." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-53", "text": "We note that the input sequences are the same (x q = x k ) for self-attention and are different (x q from decoder and x k from encoder) for encoder-decoder attention." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-54", "text": "Given the introduced notation, the attention mechanism in original Transformer (Vaswani et al., 2017) can be presented as:" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-55", "text": "with" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-56", "text": "the weight, and d k being the feature dimension of x k W k ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-57", "text": "Decoder self-attention further introduces a mask to block the visibility of elements in S x k to x q ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-58", "text": "Particularly, decoder self-attention considers the decoded sequence as inputs (x k = x q ), where the decoded token at time t is not allowed to access the future decoded tokens (i.e., tokens decoded at time greater than t)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-59", "text": "On the contrary, encoder selfattention and decoder-encoder attention consider no additional mask to Eq. (1)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-60", "text": "Recent work (Shaw et al., 2018; Dai et al., 2019; Huang et al., 2018b; Child et al., 2019; Parmar et al., 2018; Tsai et al., 2019a) proposed modifications to the Transformer for the purpose of better modeling inputs positional relation (Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019) , appending additional keys in S x k (Dai et al., 2019) , modifying the mask applied to Eq. (1) (Child et al., 2019) , or applying to distinct feature types Parmar et al., 2018; Tsai et al., 2019a) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-61", "text": "These works adopt different designs of attention as comparing to the original form (Eq. (1))." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-62", "text": "In our paper, we aim at providing an unified view via the lens of kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-63", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-64", "text": "**REFORMULATION VIA THE LENS OF KERNEL**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-65", "text": "We now provide the intuition to reformulate Eq." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-66", "text": "(1) via the lens of kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-67", "text": "First, the softmax function can be realized as a probability function for x q observing the keys {x k }s in S x k (S x k is the set representation of sequence x k )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-68", "text": "The probability is determined by the dot product between x q and x k with additional mappings W q W k and scaling by d k , which we note the dot-product operation is an instance of kernel function." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-69", "text": "We also introduce a set filtering function M (x q , S x k ) \u2236 X \u00d7 S \u2192 S which returns a set with its elements that operate with (or are connected/visible to) x q ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-70", "text": "The filtering function M (\u22c5, \u22c5) plays as the role of the mask in decoder self-attention (Vaswani et al., 2017) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-71", "text": "Putting these altogether, we re-represent Eq. (1) into the following definition." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-72", "text": ", and a value function v(\u22c5) \u2236 X \u2192 Y, the Attention function taking the input of a query feature x q \u2208 X is defined as" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-73", "text": "The Definition 1 is a class of linear smoothers (Wasserman, 2006) with kernel smoothing:" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-74", "text": "where v(x k ) outputs the \"values\" and" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-75", "text": "is a probability function depends on k and N when k(\u22c5, \u22c5) is always positive." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-76", "text": "In the prior work (Vaswani et al., 2017)" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-77", "text": "Note that the kernel form k(x q , x k ) in the original Transformer (Vaswani et al., 2017 ) is a asymmetric exponential kernel with additional mapping W q and W k (Wilson et al., 2016; Li et al., 2017) 2 ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-78", "text": "The new formulation defines a larger space for composing attention by manipulating its individual components, and at the same time it is able to categorize different variants of attention in prior work (Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019; Wang et al., 2018; Tsai et al., 2019a) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-79", "text": "In the following, we study these components by dissecting Eq. (2) into: 1) kernel feature space X , 2) kernel construction k(\u22c5, \u22c5), 3) value function v(\u22c5), and 4) set filtering function M (\u22c5, \u22c5)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-80", "text": "2.2.1 Kernel Feature Space X In Eq. (2), to construct a kernel on X , the first thing is to identify the kernel feature space X ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-81", "text": "In addition to modeling sequences like word sentences (Vaswani et al., 2017) or music signals (Huang et al., 2018b) , the Transformer can also be applied to images (Parmar et al., 2018) , sets , and multimodal sequences (Tsai et al., 2019a) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-82", "text": "Due to distinct data types, these applications admit various kernel feature space: (Vaswani et al., 2017; Dai et al., 2019) :" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-83", "text": "with F being non-positional feature space and T being the positional embedding space of the position in the sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-84", "text": "(ii) Image Transformer (Parmar et al., 2018) :" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-85", "text": "with F being non-positional feature space, H being the positional space of the height in an image, and W being the positional space of the width in an image." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-86", "text": "(iii) Set Transformer and NonLocal Neural Networks (Wang et al., 2018) :" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-87", "text": "with no any positional information present." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-88", "text": "(iv) Multimodal Transformer (Tsai et al., 2019a) :" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-89", "text": "with F \u2113 representing the language feature space, F v representing the vision feature space, F a representing the audio feature space, and T representing the temporal indicator space." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-90", "text": "For the rest of the paper, we will focus on the setting for sequence Transformer X = (F \u00d7 T ) and discuss the kernel construction on it." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-91", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-92", "text": "**KERNEL CONSTRUCTION AND THE ROLE OF**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-93", "text": "Positional Embedding k(\u22c5, \u22c5) The kernel construction on X = (F \u00d7 T ) has distinct design in variants of Transformers (Vaswani et al., 2017; Dai et al., 2019; Huang et al., 2018b; Shaw et al., 2018; Child et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-94", "text": "Since now the kernel feature space considers a joint space, we will first discuss the kernel construction on F (the non-positional feature space) and then discuss how different variants integrate the positional embedding (with the positional feature space T ) into the kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-95", "text": "Kernel construction on F. All the work considered the scaled asymmetric exponential kernel with the mapping W q and W k (Wilson et al., 2016; Li et al., 2017) for non-positional features f q and f k :" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-96", "text": "Note that the usage of asymmetric kernel is also commonly used in various machine learning tasks (Yilmaz, 2007; Tsuda, 1999; Kulis et al., 2011) , where they observed the kernel form can be flexible and even non-valid (i.e., a kernel that is not symmetric and positive semi-definite)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-97", "text": "In Section 3, we show that symmetric design of the kernel has similar performance for various sequence learning tasks, and we also examine different kernel choices (i.e., linear, polynomial, and rbf kernel)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-98", "text": "Kernel construction on X = (F \u00d7 T )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-99", "text": "The designs for integrating the positional embedding t q and t k are listed in the following." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-100", "text": "(i) Absolute Positional Embedding (Vaswani et al., 2017; Dai et al., 2019; Ott et al., 2019) : For the original Transformer (Vaswani et al., 2017) , each t i is represented by a vector with each dimension being sine or cosine functions." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-101", "text": "For learned positional embedding (Dai et al., 2019; Ott et al., 2019) , each t i is a learned parameter and is fixed for the same position for different sequences." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-102", "text": "These works defines the feature space as the direct sum of its temporal and non-temporal space: X = F \u2295 T ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-103", "text": "Via the lens of kernel, the kernel similarity is defined as" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-104", "text": "(ii) Relative Positional Embedding in Transformer-XL (Dai et al., 2019) : t represents the indicator of the position in the sequence, and the kernel is chosen to be asymmetric of mixing sine and cosine functions:" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-105", "text": "with k fq t q , t k being an asymmetric kernel with coefficients inferred by f q : log k fq t q , t k = \u2211 (iii) Relative Positional Embedding of Shaw et al. (2018) and Music Transformer (Huang et al., 2018b) : t \u22c5 represents the indicator of the position in the sequence, and the kernel is modified to be indexed by a look-up table:" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-106", "text": "where L tq\u2212t k ,fq = exp(f q W q a tq\u2212t k ) with a \u22c5 being a learnable matrix having matrix width to be the length of the sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-107", "text": "We refer readers to Shaw et al. (2018) for more details." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-108", "text": "Dai et al. (2019) showed that the way to integrate positional embedding is better through Eq. (5) than through Eq. (6) and is better through Eq. (6) than through Eq. (4)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-109", "text": "We argue the reason is that if viewing f i and t i as two distinct spaces X \u2236= (F \u00d7 T ) , the direct sum x i = f i + t i may not be optimal when considering the kernel score between x q and x k ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-110", "text": "In contrast, Eq. (5) represents the kernel as a product of two kernels (one for f i and another for t i ), which is able to capture the similarities for both temporal and non-temporal components." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-111", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-112", "text": "**VALUE FUNCTION V(\u22c5)**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-113", "text": "The current Transformers consider two different value function construction: (Vaswani et al., 2017) and Sparse Transformer (Child et al., 2019) :" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-114", "text": "(ii) Transformer-XL (Dai et al., 2019) , Music Transformer (Huang et al., 2018b) , Self-Attention with Relative Positional Embedding (Shaw et al., 2018) :" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-115", "text": "Compared Eq. (7) to Eq. (8), Eq. (7) takes the positional embedding into account for constructing the value function." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-116", "text": "In Section 3, we empirically observe that constructing value function with Eq. (8) constantly outperforms the construction with Eq. (7), which suggests that we do not need positional embedding for value function." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-117", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-118", "text": "**SET FILTERING FUNCTION M (\u22c5, \u22c5)**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-119", "text": "In Eq. (2), the returned set by the set filtering function M (x q , S x k ) defines how many keys and which keys are operating with x q ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-120", "text": "In the following, we itemize the corresponding designs for the variants in Transformers: (i) Encoder Self-Attention in original Transformer (Vaswani et al., 2017) : For each query x q in the encoded sequence, M (x q , S x k ) = S x k contains the keys being all the tokens in the encoded sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-121", "text": "Note that encoder self-attention considers x q = x k with x q being the encoded sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-122", "text": "(ii) Encoder-Decoder Attention in original Transformer (Vaswani et al., 2017) : For each query x q in decoded sequence, M (x q , S x k ) = S x k contains the keys being all the tokens in the encoded sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-123", "text": "Note that encode-decoder attention considers x q \u2260 x k with x q being the decoded sequence and x k being the encoded sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-124", "text": "(iii) Decoder Self-Attention in original Transformer (Vaswani et al., 2017) : For each query x q in the decoded sequence, M (x q , S x k ) returns a subset of S x k (M (x q , S x k ) \u2282 S x k )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-125", "text": "Note that decoder self-attention considers x q = x k with x q being the decoded sequence." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-126", "text": "Since the decoded sequence is the output for previous timestep, the query at position i can only observe the keys being the tokens that are decoded with position < i. For convenience, let us define S 1 as the set returned by original Transformer (Vaswani et al., 2017 ) from M (x q , S x k ), which we will use it later." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-127", "text": "(iv) Decoder Self-Attention in Transformer-XL (Dai et al., 2019) : For each query x q in the decoded sequence, M (x q , S x k ) returns a set containing S 1 and additional memories (M (x q , S x k ) = S 1 + S mem , M (x q , S x k ) \u2283 S 1 )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-128", "text": "S mem refers to additional memories." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-129", "text": "(v) Decoder Self-Attention in Sparse Transformer (Child et al., 2019) : For each query x q in the decoded sentence, M (x q , S x k ) returns a subset of S 1 (M (x q , S x k ) \u2282 S 1 )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-130", "text": "To compare the differences for various designs, we see the computation time is inversely proportional to the number of elements in M (x q , S x k )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-131", "text": "For performance-wise comparisons, Transformer-XL (Dai et al., 2019) showed that, the additional memories in M (x q , S x k ) are able to capture longer-term dependency than the original Transformer (Vaswani et al., 2017) and hence results in better performance." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-132", "text": "Sparse Transformer (Child et al., 2019) showed that although having much fewer elements in M (x q , S x k ), if the elements are carefully chosen, the attention can still reach the same performance as Transformer-XL (Dai et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-133", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-134", "text": "**EXPLORING THE DESIGN OF ATTENTION**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-135", "text": "So far, we see how Eq. (2) connects to the variants of Transformers." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-136", "text": "By changing the kernel construction in Section 2.2.2, we can define a larger space for composing attention." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-137", "text": "In this paper, we present a new form of attention with a kernel that is 1) valid (i.e., a kernel that is symmetric and positive semi-definite) and 2) delicate in the sense of constructing a kernel on a joint space (i.e., X = (F \u00d7 T )):" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-138", "text": "where W F and W T are weight matrices." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-139", "text": "The new form considers product of kernels with the first kernel measuring similarity between non-temporal features and the second kernel measuring similarity between temporal features." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-140", "text": "Both kernels are symmetric exponential kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-141", "text": "Note that t i here is chosen as the mixture of sine and cosine functions as in the prior work (Vaswani et al., 2017; Ott et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-142", "text": "In our experiment, we find it reaching competitive performance as comparing to the current state-of-the-art designs (Eq. (5) by Dai et al. (2019) )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-143", "text": "We fix the size of the weight matrices W \u22c5 in Eq. (9) and Eq. (5) which means we save 33% of the parameters in attention from Eq. (9)" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-144", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-145", "text": "**EXPERIMENTS**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-146", "text": "By viewing the attention mechanism with Eq. (2), we aims at answering the following questions regarding the Transformer's designs:" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-147", "text": "Q1. What is the suggested way for incorporating positional embedding in the kernel function?" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-148", "text": "Q2." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-149", "text": "What forms of kernel are recommended to choose in the attention mechanism?" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-150", "text": "Can we replace the asymmetric kernel with the symmetric version?" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-151", "text": "Q3." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-152", "text": "Is there any exception that the attention mechanism is not order-agnostic with respect to inputs?" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-153", "text": "If so, can we downplay the role of positional embedding?" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-154", "text": "Q4." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-155", "text": "Is positional embedding required in value function?" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-156", "text": "We conduct experiments on neural machine translation (NMT) and sequence prediction (SP) tasks since these two tasks are commonly chosen for studying Transformers (Vaswani et al., 2017; Dai et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-157", "text": "Note that NMT has three different types of attentions (e.g., encoder selfattention, decoder-encoder attention, decoder selfattention) and SP has only one type of attention (e.g., decoder self-attention)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-188", "text": "**ORDER-INVARIANCE IN ATTENTION**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-158", "text": "For the choice of datasets, we pick IWSLT'14 German-English (De-En) dataset (Edunov et al., 2017) for NMT and WikiText-103 dataset (Merity et al., 2016) for SP as suggested by Edunov et al. (Edunov et al., 2017) and Dai et al. (Dai et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-159", "text": "For fairness of comparisons, we train five random initializations and report test accuracy with the highest validation score." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-160", "text": "We fix the position-wise operations in Transformer 3 and only change the attention mechanism." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-161", "text": "Similar to prior work (Vaswani et al., 2017; Dai et al., 2019) , we report BLEU score for NMT and perplexity for SP." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-162", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-163", "text": "**INCORPORATING POSITIONAL EMBEDDING**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-164", "text": "In order to find the best way to integrate positional embedding (PE), we study different PE incorporation in the kernel function k(\u22c5, \u22c5) in Eq. (2)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-165", "text": "Referring to Sections 2.2.2 and 2.3, we consider four cases: 1) PE as direct sum in the feature space (see Eq. (4)), 2) PE as a look-up table (see Eq. (6)), 3) PE in product kernel with asymmetric kernel (see Eq. (5)), and 4) PE in product kernel with symmetric kernel (see Eq. (9))." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-166", "text": "We present the results in Table 1 ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-167", "text": "First, we see that by having PE as a look-up (Edunov et al., 2017) and SP stands for sequence prediction on WikiText-103 dataset (Merity et al., 2016) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-168", "text": "\u2191 means the upper the better and \u2193 means the lower the better." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-169", "text": "Table 2 : Kernel Types." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-170", "text": "Other than manipulating the kernel choice of the non-positional features, we fix the configuration by Vaswani et al. (2017) for NMT and the configuration by Dai et al. (2019) for SP." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-171", "text": "34.14 24.13 24.21 table, it outperforms the case with having PE as direct-sum in feature space, especially for SP task." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-172", "text": "Note that the look-up table is indexed by the relative position (i.e., t q \u2212 t k ) instead of absolute position." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-173", "text": "Second, we see that PE in the product kernel proposed by Dai et al. (Dai et al., 2019) may not constantly outperform the other integration types (it has lower BLEU score for NMT)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-174", "text": "Our proposed product kernel reaches the best result in NMT and is competitive to the best result in SP." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-175", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-176", "text": "**KERNEL TYPES**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-177", "text": "To find the best kernel form in the attention mechanism, in addition to the exponential kernel (see Eq. (3)), we compare different kernel forms (i.e., linear, polynomial, and rbf kernel) for the non-positional features." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-178", "text": "We also provide the results for changing asymmetric to the symmetric kernel, when forcing W q = W k , so that the resulting kernel is a valid kernel (Scholkopf and Smola, 2001) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-179", "text": "The numbers are shown in Table 2 ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-180", "text": "Note that, for fairness, other than manipulating the kernel choice of the non-positional features, we fix the configuration by Vaswani et al. (Vaswani et al., 2017) for NMT and the configuration by Dai et al. (Dai et al., 2019) for SP." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-181", "text": "We first observe that the linear kernel does not converge for both NMT and SP." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-182", "text": "We argue the reason is that the linear kernel may have negative value and thus it violates the assumption in kernel smoother that the kernel score must be positive (Wasserman, 2006) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-183", "text": "Next, we observe the kernel with infinite feature space (i.e., exponential and rbf kernel) outperforms the kernel with finite feature space (i.e., polynomial kernel). And we see rbf kernel performs the best for NMT and exponential kernel performs the best for SP." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-184", "text": "We conclude that the choice of kernel matters for the design of attention in Transformer." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-185", "text": "Also, we see no much performance difference when comparing asymmetric to symmetric kernel." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-186", "text": "In the experiment, we fix the size of W \u22c5 in the kernel, and thus adopting the symmetric kernel benefits us from saving parameters." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-187", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-189", "text": "The need of the positional embedding (PE) in the attention mechanism is based on the argument that the attention mechanism is an order-agnostic (or, permutation equivariant) operation (Vaswani et al., 2017; Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-190", "text": "However, we show that, for decoder self-attention, the operation is not order-agnostic." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-191", "text": "For clarification, we are not attacking the claim made by the prior work (Vaswani et al., 2017; Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019 ), but we aim at providing a new look at the order-invariance problem when considering the attention mechanism with masks (masks refer to the set filtering function in our kernel formulation)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-192", "text": "In other words, previous work did not consider the mask between queries and keys when discussing the order-invariance problem (P\u00e9rez et al., 2019) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-193", "text": "To put it formally, we first present the definition by for a permutation equivariance function:" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-194", "text": "Definition 2." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-195", "text": "Denote \u03a0 as the set of all permutations over [n] = {1, \u22ef, n}. A function f unc \u2236 X n \u2192 Y n is permutation equivariant iff for any permutation \u03c0 \u2208 \u03a0, f unc(\u03c0x) = \u03c0f unc(x). showed that the standard attention (encoder self-attention (Vaswani et al., 2017; Dai et al., 2019) ) is permutation equivariant." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-196", "text": "Here, we present the non-permutation-equivariant problem on the decoder self-attention: (Vaswani et al., 2017; Dai et al., 2019) is not permutation equivariant." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-197", "text": "To proceed the proof, we need the following definition and propositions." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-198", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-199", "text": "**PROPOSITION 2. ATTENTION WITH THE SET FILTERING FUNC-**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-200", "text": "Proof." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-201", "text": "It is easy to show that if M (x q , S x k ) = S x k , Eq. (2) remains unchanged for any permutation \u03c0 performed on S x k ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-202", "text": "\u220e Proposition 3." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-203", "text": "Attention with the set difference" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-204", "text": "Then, we construct a permutation \u03c0 such that" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-205", "text": "It is obvious that Eq. (2) changes after this permutation and thus" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-206", "text": "not permutation equivariant w.r.t." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-207", "text": "S x k equals to showing Attention not permutation equivariant." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-208", "text": "Then, since the decoder self-attention considers masking (i.e., M (x q , S x k ) returns a subset of S x k ), by Proposition 3, the decoder self-attention is not permutation equivariant." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-209", "text": "\u220e In fact, not only being a permutation inequivariant process, the decoding process in the decoder self-attention already implies the order information from the data." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-210", "text": "To show this, take the decoded sequence y = [init, y 1 , y 2 , y 3 , y 4 ] as an example." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-211", "text": "init stands for the initial token." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-212", "text": "When determining the output y 1 from init, the set filtering function is M (init, S y ) = {init}. Similarly, we will have M (y 1 , S y ), M (y 2 , S y ), M (y 3 , S y ) to be {init, y 1 }, {init, y 1 , y 2 }, {init, y 1 , y 2 , y 3 }." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-213", "text": "Then, it raises a concern: do we require PE in decoder self-attention?" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-214", "text": "By removing PE in decoder selfattention, we present the results in Table 3 ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-215", "text": "From the table, we can see that, for NMT, removing PE only in decoder self-attention results in slight performance drop (from 34.71 to 34.49)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-216", "text": "However, removing PE in the entire model greatly degrades the performance (from 34.71 to 14.47)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-217", "text": "On the other hand, for SP, removing PE from our proposed attention variant dramatically degrades the performance (from 24.28 to 30.92)." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-218", "text": "Nonetheless, the performance is slightly better than considering PE from the original Transformer (Vaswani et al., 2017) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-219", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-220", "text": "**POSITIONAL EMBEDDING IN VALUE FUNCTION**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-221", "text": "To determine the need of positional embedding (PE) in value function, we conduct the experiments by adopting Eq. (7) or Eq. (8) in the attention mechanism." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-222", "text": "The results are presented in Table 4 ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-223", "text": "From the table, we find that considering PE in value function (Eq. (7)) does not gain performance as compared to not considering PE in value function (Eq. (8))." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-224", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-225", "text": "**TAKE-HOME MESSAGES**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-226", "text": "Based on the results and discussions, we can now answer the questions given at the beginning of this section." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-227", "text": "The answers are summarized into the takehome messages in the following." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-228", "text": "A1." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-229", "text": "We show that integrating the positional embedding in the form of product kernel (Eq. (5) or Eq. (9)) gives us best performance." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-230", "text": "A2." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-231", "text": "The kernel form does matter." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-232", "text": "Adopting kernel form with infinite feature dimension (i.e., exponential kernel or rbf kernel) gives us best results." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-233", "text": "The symmetric design of the kernel may benefit us from saving parameters and barely sacrifice the performance as compared to the non-symmetric one." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-234", "text": "A3." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-235", "text": "The decoder self-attention is not an orderagnostic operation with respect to the order of inputs." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-236", "text": "However, incorporating positional embedding into the attention mechanism may still improve performance." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-237", "text": "A4." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-238", "text": "We find that there is no much performance difference by considering or not considering the positional embedding in value function." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-239", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-240", "text": "**RELATED WORK**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-241", "text": "Other than relating Transformer's attention mechanism with kernel methods, the prior work (Wang et al., 2018; Shaw et al., 2018; Tsai et al., 2019b ) related the attention mechanism with graph-structured learning." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-242", "text": "For example, Non-Local Neural Networks (Wang et al., 2018) made a connection between the attention and the non-local operation in image processing (Buades et al., 2005) ." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-243", "text": "Others (Shaw et al., 2018; Tsai et al., 2019b) linked the attention to the message passing in graphical models." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-244", "text": "In addition to the fundamental difference between graph-structured learning and kernel learning, the prior work (Wang et al., 2018; Shaw et al., 2018; Tsai et al., 2019b) focused on presenting Transformer for its particular application (e.g., video classification (Wang et al., 2018) and neural machine translation (Shaw et al., 2018) )." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-245", "text": "Alternatively, our work focuses on presenting a new formulation of Transformer's attention mechanism that gains us the possibility for understanding the attention mechanism better." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-246", "text": "----------------------------------" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-247", "text": "**CONCLUSIONS**" }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-248", "text": "In this paper, we presented a kernel formulation for the attention mechanism in Transformer, which allows us to define a larger space for designing attention." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-249", "text": "As an example, we proposed a new variant of attention which reaches competitive performance when compared to previous state-of-the-art models." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-250", "text": "Via the lens of the kernel, we were able to better understand the role of individual components in Transformer's attention and categorize previous attention variants in a unified formulation." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-251", "text": "Among these components, we found the construction of the kernel function acts the most important role, and we studied different kernel forms and the ways to integrate positional embedding on neural machine translation and sequence prediction." }, { "sent_id": "04f6b9d4296dee4bbf965f9911bf98-C001-252", "text": "We hope our empirical study may potentially allow others to design better attention mechanisms given their particular applications." } ], "y": { "@BACK@": { "gold_contexts": [ [ "04f6b9d4296dee4bbf965f9911bf98-C001-2" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-3" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-13" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-14" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-15" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-16" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-20" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-22" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-28" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-36" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-42" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-45" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-47" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-48" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-54" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-60" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-70" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-77" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-81" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-82" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-93" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-100" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-104" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-105" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-113" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-120" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-122" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-124" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-126" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-127" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-129" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-131" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-132" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-156" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-189" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-195" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-241" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-244" ] ], "cite_sentences": [ "04f6b9d4296dee4bbf965f9911bf98-C001-2", "04f6b9d4296dee4bbf965f9911bf98-C001-3", "04f6b9d4296dee4bbf965f9911bf98-C001-13", "04f6b9d4296dee4bbf965f9911bf98-C001-14", "04f6b9d4296dee4bbf965f9911bf98-C001-15", "04f6b9d4296dee4bbf965f9911bf98-C001-16", "04f6b9d4296dee4bbf965f9911bf98-C001-20", "04f6b9d4296dee4bbf965f9911bf98-C001-22", "04f6b9d4296dee4bbf965f9911bf98-C001-28", "04f6b9d4296dee4bbf965f9911bf98-C001-36", "04f6b9d4296dee4bbf965f9911bf98-C001-42", "04f6b9d4296dee4bbf965f9911bf98-C001-45", "04f6b9d4296dee4bbf965f9911bf98-C001-47", "04f6b9d4296dee4bbf965f9911bf98-C001-48", "04f6b9d4296dee4bbf965f9911bf98-C001-54", "04f6b9d4296dee4bbf965f9911bf98-C001-60", "04f6b9d4296dee4bbf965f9911bf98-C001-70", "04f6b9d4296dee4bbf965f9911bf98-C001-77", "04f6b9d4296dee4bbf965f9911bf98-C001-81", "04f6b9d4296dee4bbf965f9911bf98-C001-82", "04f6b9d4296dee4bbf965f9911bf98-C001-93", "04f6b9d4296dee4bbf965f9911bf98-C001-100", "04f6b9d4296dee4bbf965f9911bf98-C001-104", "04f6b9d4296dee4bbf965f9911bf98-C001-105", "04f6b9d4296dee4bbf965f9911bf98-C001-113", "04f6b9d4296dee4bbf965f9911bf98-C001-120", "04f6b9d4296dee4bbf965f9911bf98-C001-122", "04f6b9d4296dee4bbf965f9911bf98-C001-124", "04f6b9d4296dee4bbf965f9911bf98-C001-126", "04f6b9d4296dee4bbf965f9911bf98-C001-127", "04f6b9d4296dee4bbf965f9911bf98-C001-129", "04f6b9d4296dee4bbf965f9911bf98-C001-131", "04f6b9d4296dee4bbf965f9911bf98-C001-132", "04f6b9d4296dee4bbf965f9911bf98-C001-156", "04f6b9d4296dee4bbf965f9911bf98-C001-189", "04f6b9d4296dee4bbf965f9911bf98-C001-195", "04f6b9d4296dee4bbf965f9911bf98-C001-241", "04f6b9d4296dee4bbf965f9911bf98-C001-244" ] }, "@MOT@": { "gold_contexts": [ [ "04f6b9d4296dee4bbf965f9911bf98-C001-6" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-7" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-23" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-24" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-191" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-245" ] ], "cite_sentences": [ "04f6b9d4296dee4bbf965f9911bf98-C001-6", "04f6b9d4296dee4bbf965f9911bf98-C001-7", "04f6b9d4296dee4bbf965f9911bf98-C001-23", "04f6b9d4296dee4bbf965f9911bf98-C001-24", "04f6b9d4296dee4bbf965f9911bf98-C001-191", "04f6b9d4296dee4bbf965f9911bf98-C001-245" ] }, "@EXT@": { "gold_contexts": [ [ "04f6b9d4296dee4bbf965f9911bf98-C001-8" ] ], "cite_sentences": [ "04f6b9d4296dee4bbf965f9911bf98-C001-8" ] }, "@UNSURE@": { "gold_contexts": [ [ "04f6b9d4296dee4bbf965f9911bf98-C001-33" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-39" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-90" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-135" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-146" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-160" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-184" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-196" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-250" ] ], "cite_sentences": [ "04f6b9d4296dee4bbf965f9911bf98-C001-33", "04f6b9d4296dee4bbf965f9911bf98-C001-39", "04f6b9d4296dee4bbf965f9911bf98-C001-90", "04f6b9d4296dee4bbf965f9911bf98-C001-135", "04f6b9d4296dee4bbf965f9911bf98-C001-146", "04f6b9d4296dee4bbf965f9911bf98-C001-160", "04f6b9d4296dee4bbf965f9911bf98-C001-184", "04f6b9d4296dee4bbf965f9911bf98-C001-196", "04f6b9d4296dee4bbf965f9911bf98-C001-250" ] }, "@USE@": { "gold_contexts": [ [ "04f6b9d4296dee4bbf965f9911bf98-C001-51" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-141" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-170" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-180" ] ], "cite_sentences": [ "04f6b9d4296dee4bbf965f9911bf98-C001-51", "04f6b9d4296dee4bbf965f9911bf98-C001-141", "04f6b9d4296dee4bbf965f9911bf98-C001-170", "04f6b9d4296dee4bbf965f9911bf98-C001-180" ] }, "@SIM@": { "gold_contexts": [ [ "04f6b9d4296dee4bbf965f9911bf98-C001-161" ] ], "cite_sentences": [ "04f6b9d4296dee4bbf965f9911bf98-C001-161" ] }, "@DIF@": { "gold_contexts": [ [ "04f6b9d4296dee4bbf965f9911bf98-C001-170" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-180" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-218" ], [ "04f6b9d4296dee4bbf965f9911bf98-C001-248" ] ], "cite_sentences": [ "04f6b9d4296dee4bbf965f9911bf98-C001-170", "04f6b9d4296dee4bbf965f9911bf98-C001-180", "04f6b9d4296dee4bbf965f9911bf98-C001-218", "04f6b9d4296dee4bbf965f9911bf98-C001-248" ] } } }, "ABC_c14d918f3b1b1248dc1d25a7e0b2e4_9": { "x": [ { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-2", "text": "This paper describes a log-linear model with an n-gram reference distribution for accurate probabilistic HPSG parsing." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-3", "text": "In the model, the n-gram reference distribution is simply defined as the product of the probabilities of selecting lexical entries, which are provided by the discriminative method with machine learning features of word and POS n-gram as defined in the CCG/HPSG/CDG supertagging." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-4", "text": "Recently, supertagging becomes well known to drastically improve the parsing accuracy and speed, but supertagging techniques were heuristically introduced, and hence the probabilistic models for parse trees were not well defined." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-5", "text": "We introduce the supertagging probabilities as a reference distribution for the log-linear model of the probabilistic HPSG." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-6", "text": "This is the first model which properly incorporates the supertagging probabilities into parse tree's probabilistic model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-7", "text": "----------------------------------" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-9", "text": "For the last decade, fast, accurate and wide-coverage parsing for real-world text has been pursued in sophisticated grammar formalisms, such as headdriven phrase structure grammar (HPSG) (Pollard and Sag, 1994) , combinatory categorial grammar (CCG) (Steedman, 2000) and lexical function grammar (LFG) (Bresnan, 1982) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-10", "text": "They are preferred because they give precise and in-depth analyses for explaining linguistic phenomena, such as passivization, control verbs and relative clauses." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-11", "text": "The main difficulty of developing parsers in these formalisms was how to model a well-defined probabilistic model for graph structures such as feature structures." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-12", "text": "This was overcome by a probabilistic model which provides probabilities of discriminating a correct parse tree among candidates of parse trees in a log-linear model or maximum entropy model (Berger et al., 1996) with many features for parse trees (Abney, 1997; Johnson et al., 1999; Malouf and van Noord, 2004; Kaplan et al., 2004; ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-13", "text": "Following this discriminative approach, techniques for efficiency were investigated for estimation (Geman and Johnson, 2002; Miyao and Tsujii, 2002; Malouf and van Noord, 2004) and parsing (Clark and Curran, 2004b; Clark and Curran, 2004a; ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-14", "text": "An interesting approach to the problem of parsing efficiency was using supertagging (Clark and Cur-ran, 2004b; Clark and Curran, 2004a; Wang, 2003; Wang and Harper, 2004; Nasr and Rambow, 2004; Ninomiya et al., 2006; , which was originally developed for lexicalized tree adjoining grammars (LTAG) (Bangalore and Joshi, 1999) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-15", "text": "Supertagging is a process where words in an input sentence are tagged with 'supertags,' which are lexical entries in lexicalized grammars, e.g., elementary trees in LTAG, lexical categories in CCG, and lexical entries in HPSG." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-16", "text": "The concept of supertagging is simple and interesting, and the effects of this were recently demonstrated in the case of a CCG parser (Clark and Curran, 2004a) with the result of a drastic improvement in the parsing speed." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-17", "text": "Wang and Harper (2004) also demonstrated the effects of supertagging with a statistical constraint dependency grammar (CDG) parser by showing accuracy as high as the state-of-the-art parsers, and and reported that accuracy was significantly improved by incorporating the supertagging probabilities into manually tuned Weighted CDG." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-18", "text": "Ninomiya et al. (2006) showed the parsing model using only supertagging probabilities could achieve accuracy as high as the probabilistic model for phrase structures." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-19", "text": "This means that syntactic structures are almost determined by supertags as is claimed by Bangalore and Joshi (1999) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-20", "text": "However, supertaggers themselves were heuristically used as an external tagger." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-21", "text": "They filter out unlikely lexical entries just to help parsing (Clark and Curran, 2004a) , or the probabilistic models for phrase structures were trained independently of the supertagger's probabilistic models (Wang and Harper, 2004; Ninomiya et al., 2006) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-22", "text": "In the case of supertagging of Weighted CDG , parameters for Weighted CDG are manually tuned, i.e., their model is not a well-defined probabilistic model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-23", "text": "We propose a log-linear model for probabilistic HPSG parsing in which the supertagging probabilities are introduced as a reference distribution for the probabilistic HPSG." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-24", "text": "The reference distribution is simply defined as the product of the probabilities of selecting lexical entries, which are provided by the discriminative method with machine learning features of word and part-of-speech (POS) n-gram as defined in the CCG/HPSG/CDG supertagging." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-25", "text": "This is the first model which properly incorporates the supertagging probabilities into parse tree's probabilistic model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-26", "text": "We compared our model with the probabilistic model for phrase structures ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-27", "text": "This model uses word and POS unigram for its reference distribution, i.e., the probabilities of unigram supertagging." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-28", "text": "Our model can be regarded as an extension of a unigram reference distribution to an n-gram reference distribution with features that are used in supertagging." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-29", "text": "We also compared with a probabilistic model in (Ninomiya et al., 2006) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-30", "text": "The probabilities of their model are defined as the product of probabilities of supertagging and probabilities of the probabilistic model for phrase structures, but their model was trained independently of supertagging probabilities, i.e., the supertagging probabilities are not used for reference distributions." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-31", "text": "----------------------------------" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-32", "text": "**HPSG AND PROBABILISTIC MODELS**" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-33", "text": "HPSG (Pollard and Sag, 1994 ) is a syntactic theory based on lexicalized grammar formalism." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-34", "text": "In HPSG, a small number of schemata describe general construction rules, and a large number of lexical entries express word-specific characteristics." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-35", "text": "The structures of sentences are explained using combinations of schemata and lexical entries." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-36", "text": "Both schemata and lexical entries are represented by typed feature structures, and constraints represented by feature structures are checked with unification." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-37", "text": "An example of HPSG parsing of the sentence \"Spring has come\" is shown in Figure 1 ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-38", "text": "First, each of the lexical entries for \"has\" and \"come\" is unified with a daughter feature structure of the Head-Complement Schema." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-39", "text": "Unification provides the phrasal sign of the mother." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-40", "text": "The sign of the larger constituent is obtained by repeatedly applying schemata to lexical/phrasal signs." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-41", "text": "Finally, the parse result is output as a phrasal sign that dominates the sentence." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-42", "text": "Given a set W of words and a set F of feature structures, an HPSG is formulated as a tuple, G = L, R , where L = {l = w, F |w \u2208 W, F \u2208 F} is a set of lexical entries, and R is a set of schemata; i.e., r \u2208 R is a partial function:" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-43", "text": "Given a sentence, an HPSG computes a set of phrasal signs, i.e., feature structures, as a result of parsing." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-44", "text": "Note that HPSG is one of the lexicalized grammar formalisms, in which lexical entries determine the dominant syntactic structures." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-45", "text": "Previous studies (Abney, 1997; Johnson et al., 1999; Malouf and van Noord, 2004; Kaplan et al., 2004; ) defined a probabilistic model of unification-based grammars including HPSG as a log-linear model or maximum entropy model (Berger et al., 1996) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-46", "text": "The probability that a parse result T is assigned to a given sentence w = w 1 , . . . , w n is (Probabilistic HPSG)" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-47", "text": "where \u03bb u is a model parameter, f u is a feature function that represents a characteristic of parse tree T , and Z w is the sum over the set of all possible parse trees for the sentence." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-48", "text": "Intuitively, the probability is defined as the normalized product of the weights exp(\u03bb u ) when a characteristic corresponding to f u appears in parse result T ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-49", "text": "The model parameters, \u03bb u , are estimated using numerical optimization methods (Malouf, 2002) to maximize the log-likelihood of the training data." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-50", "text": "However, the above model cannot be easily estimated because the estimation requires the computation of p(T |w) for all parse candidates assigned to sentence w. Because the number of parse candidates is exponentially related to the length of the sentence, the estimation is intractable for long sentences." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-51", "text": "To make the model estimation tractable, Geman and Johnson (Geman and Johnson, 2002) and Miyao and Tsujii (Miyao and Tsujii, 2002) proposed a dynamic programming algorithm for estimating p(T |w)." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-52", "text": "also introduced a preliminary probabilistic model p 0 (T |w) whose estimation does not require the parsing of a treebank." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-53", "text": "This model is introduced as a reference distribution (Jelinek, 1998; of the probabilistic HPSG model; i.e., the computation of parse trees given low probabilities by the model is omitted in the estimation stage , or a probabilistic model can be augmented by several distributions estimated from the larger and simpler corpus ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-54", "text": "In , p 0 (T |w) is defined as the product of probabilities of selecting lexical entries with word and POS unigram features: 's model)" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-55", "text": "where l i is a lexical entry assigned to word w i in T and p(l i |w i ) is the probability of selecting lexical entry l i for w i ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-56", "text": "In the experiments, we compared our model with other two types of probabilistic models using a supertagger (Ninomiya et al., 2006) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-57", "text": "The first one is the simplest probabilistic model, which is defined with only the probabilities of lexical entry selection." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-58", "text": "It is defined simply as the product of the probabilities of selecting all lexical entries in the sentence; i.e., the model does not use the probabilities of phrase structures like the probabilistic models explained above." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-59", "text": "Given a set of lexical entries, L, a sentence, w = w 1 , . . . , w n , and the probabilistic model of lexical entry selection, p(l i \u2208 L|w, i), the first model is formally defined as follows:" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-60", "text": "where l i is a lexical entry assigned to word w i in T and p(l i |w, i) is the probability of selecting lexical entry l i for w i ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-61", "text": "The probabilities of lexical entry selection, p(l i |w, i), are defined as follows:" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-62", "text": "where Z w is the sum over all possible lexical entries for the word w i ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-63", "text": "The second model is a hybrid model of supertagging and the probabilistic HPSG." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-64", "text": "The probabilities are given as the product of Ninomiya et al. (2006) 's model 1 and the probabilistic HPSG." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-65", "text": "(Ninomiya et al. (2006)" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-66", "text": "In the experiments, we compared our model with Figure 2 ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-67", "text": "----------------------------------" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-68", "text": "**PROBABILISTIC HPSG WITH AN N-GRAM REFERENCE DISTRIBUTION**" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-69", "text": "In this section, we propose a probabilistic model with an n-gram reference distribution for probabilistic HPSG parsing." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-70", "text": "This is an extension of 's model by replacing the unigram reference distribution with an n-gram reference distribution." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-71", "text": "Our model is formally defined as follows:" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-72", "text": "combinations , d, c, hw, hp, hl , r, d, c, hw, hp , r, d, c, hw, hl , r, d, c, sy, hw , r, c, sp, hw, hp, hl , r, c, sp, hw, hp , r, c, sp, hw, hl , r, c, sp, sy, hw , r, d, c, hp, hl , r, d, c, hp , r, d, c, hl , r, d, c, sy , r, c, sp, hp, hl , r, c, sp, hp , r, c, sp, hl , r, c, sp, sy combinations of feature templates for funary r, hw, hp, hl , r, hw, hp , r, hw, hl , r, sy, hw , r, hp, hl , r, hp , r, hl , r, sy combinations of feature templates for f root hw, hp, hl , hw, hp , hw, hl , sy, hw , hp, hl , hp , hl (Probabilistic HPSG with an n-gram reference distribution)" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-73", "text": "In our model, Ninomiya et al. (2006) 's model 1 is used as a reference distribution." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-74", "text": "The probabilistic model of lexical entry selection and its feature templates are the same as defined in Ninomiya et al. (2006) 's model 1." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-75", "text": "The formula of our model is the same as Ninomiya et al. (2006)'s model 3. But, their model is not a probabilistic model with a reference distribution." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-76", "text": "Both our model and their model consist of the probabilities for lexical entries (= p model1 (T |w)) and the probabilities for phrase structures (= the rest of each formula)." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-77", "text": "The only difference between our model and their model is the way of how to train model parameters for phrase structures." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-78", "text": "In both our model and their model, the parameters for lexical entries (= the parameters of p model1 (T |w)) are first estimated from the word and POS sequences independently of the parameters for phrase structures." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-79", "text": "That is, the estimated parameters for lexical entries are the same in both models, and hence the probabilities of p model1 (T |w) of both models are the same." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-100", "text": "The experiments were conducted on an AMD Opteron server with a 2.4-GHz CPU." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-80", "text": "Note that the parameters for lexical entries will never be updated after this estimation stage; i.e., the parameters for lexical entries are not estimated in the same time with the parameters for phrase structures." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-81", "text": "The difference of our model and their model is the estimation of parameters for phrase structures." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-82", "text": "In our model, given the probabilities for lexical entries, the parameters for phrase structures are estimated so as to maximize the entire probabilistic model (= the product of the probabilities for lexical entries and the probabilities for phrase structures) in the training corpus." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-83", "text": "In their model, the parameters for phrase structures are trained without using the probabilities for lexical entries, i.e., the parameters for phrase structures are estimated so as to maximize the probabilities for phrase structures only." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-84", "text": "That is, the parameters for lexical entries and the parameters for phrase structures are trained independently in their model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-85", "text": "'s model also uses a reference distribution, but with word and POS unigram features, as is explained in the previous section." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-86", "text": "The only difference between our model and 's model is that our model uses sequences of word and POS tags as n-gram features for selecting lexical entries in the same way as supertagging does." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-87", "text": "----------------------------------" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-88", "text": "**EXPERIMENTS**" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-89", "text": "We evaluated the speed and accuracy of parsing by using Enju 2.1, the HPSG grammar for English ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-90", "text": "The lexicon of the grammar was extracted from Sections 02-21 of the Penn Treebank (Marcus et al., 1994) abilistic models were trained using the same portion of the treebank." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-91", "text": "We used beam thresholding, global thresholding (Goodman, 1997) , preserved iterative parsing and quick check (Malouf et al., 2000) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-92", "text": "We measured the accuracy of the predicateargument relations output of the parser." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-93", "text": "A predicate-argument relation is defined as a tuple \u03c3, w h , a, w a , where \u03c3 is the predicate type (e.g., adjective, intransitive verb), w h is the head word of the predicate, a is the argument label (MODARG, ARG1, ..., ARG4), and w a is the head word of the argument." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-94", "text": "Labeled precision (LP)/labeled recall (LR) is the ratio of tuples correctly identified by the parser 2 ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-95", "text": "Unlabeled precision (UP)/unlabeled recall (UR) is the ratio of tuples without the predicate type and the argument label." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-96", "text": "This evaluation scheme was the same as used in previous evaluations of lexicalized grammars (Hockenmaier, 2003; The HPSG treebank is used for training the probabilistic model for lexical entry selection, and hence, those lexical entries that do not appear in the treebank are rarely selected by the probabilistic model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-97", "text": "The 'effective' tag set size, therefore, is around 1,361, the number of lexical entries without those never-seen lexical entries." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-98", "text": "2 When parsing fails, precision and recall are evaluated, although nothing is output by the parser; i.e., recall decreases greatly." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-99", "text": "and Curran, 2004b; ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-101", "text": "Section 22 of the Treebank was used as the development set, and the performance was evaluated using sentences of \u2264 100 words in Section 23." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-102", "text": "The performance of each model was analyzed using the sentences in Section 24 of \u2264 100 words." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-103", "text": "Table 3 details the numbers and average lengths of the tested sentences of \u2264 100 words in Sections 23 and 24, and the total numbers of sentences in Sections 23 and 24." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-104", "text": "The parsing performance for Section 23 is shown in Table 4 ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-105", "text": "The upper half of the table shows the performance using the correct POSs in the Penn Treebank, and the lower half shows the performance using the POSs given by a POS tagger ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-106", "text": "LF and UF in the figure are labeled F-score and unlabeled F-score." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-107", "text": "F-score is the harmonic mean of precision and recall." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-108", "text": "We evaluated our model in two settings." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-109", "text": "One is implemented with a narrow beam width ('our model 1' in the figure) , and the other is implemented with a wider beam width ('our model 2' in the figure) 3 . 'our model Figure 3 : F-score versus average parsing time for sentences in Section 24 of \u2264 100 words." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-110", "text": "1' was introduced to measure the performance with balanced F-score and speed, which we think appropriate for practical use." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-111", "text": "'our model 2' was introduced to measure how high the precision and recall could reach by sacrificing speed." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-112", "text": "Our models increased the parsing accuracy." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-113", "text": "'our model 1' was around 2.6 times faster and had around 2.65 points higher F-score than 's model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-114", "text": "'our model 2' was around 2.3 times slower but had around 2.9 points higher F-score than 's model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-115", "text": "We must admit that the difference between our models and Ninomiya et al. (2006) 's model 3 was not as great as the difference from 's model, but 'our model 1' achieved 0.56 points higher F-score, and 'our model 2' achieved 0.8 points higher F-score." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-116", "text": "When the automatic POS tagger was introduced, Fscore dropped by around 2.4 points for all models." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-117", "text": "We also compared our model with Matsuzaki et al. (2007) 's model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-118", "text": "Matsuzaki et al. (2007)" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-119", "text": "----------------------------------" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-120", "text": "**PRO-**" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-121", "text": "The terms \u03ba and \u03b4 are the thresholds of the number of phrasal signs in the chart cell and the beam width for signs in the chart cell." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-122", "text": "The terms \u03b1 and \u03b2 are the thresholds of the number and the beam width of lexical entries, and \u03b8 is the beam width for global thresholding (Goodman, 1997) ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-123", "text": "The terms with suffixes 0 are the initial values." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-124", "text": "The parser iterates parsing until it succeeds to generate a parse tree." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-125", "text": "The parameters increase for each iteration by the terms prefixed by \u2206, and parsing finishes when the parameters reach the terms with suffixes last." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-126", "text": "Details of the parameters are written in ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-127", "text": "The beam thresholding parameters for 'our model 2' are \u03b1 0 = 18, \u2206\u03b1 = 6, \u03b1 last = 42, \u03b2 0 = 9.0, \u2206\u03b2 = 3.0, \u03b2 last = 21.0, \u03b4 0 = 18, \u2206\u03b4 = 6, \u03b4 last = 42, \u03ba 0 = 9.0, \u2206\u03ba = 3.0, \u03ba last = 21.0." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-128", "text": "In 'our model 2', the global thresholding was not used." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-129", "text": "posed a technique for efficient HPSG parsing with supertagging and CFG filtering." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-130", "text": "Their results with the same grammar and servers are also listed in the lower half of Table 4 ." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-131", "text": "They achieved drastic improvement in efficiency." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-132", "text": "Their parser ran around 6 times faster than Ninomiya et al. (2006) 's model 3, 9 times faster than 'our model 1' and 60 times faster than 'our model 2.' Instead, our models achieved better accuracy." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-133", "text": "'our model 1' had around 0.5 higher F-score, and 'our model 2' had around 0.8 points higher F-score." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-134", "text": "Their efficiency is mainly due to elimination of ungrammatical lexical entries by the CFG filtering." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-135", "text": "They first parse a sentence with a CFG grammar compiled from an HPSG grammar, and then eliminate lexical entries that are not in the parsed CFG trees." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-136", "text": "Obviously, this technique can also be applied to the HPSG parsing of our models." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-137", "text": "We think that efficiency of HPSG parsing with our models will be drastically improved by applying this technique." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-138", "text": "The average parsing time and labeled F-score curves of each probabilistic model for the sentences in Section 24 of \u2264 100 words are graphed in Figure 3." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-139", "text": "The graph clearly shows the difference of our model and other models." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-140", "text": "As seen in the graph, our model achieved higher F-score than other model when beam threshold was widen." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-141", "text": "This implies that other models were probably difficult to reach the Fscore of 'our model 1' and 'our model 2' for Section 23 even if we changed the beam thresholding parameters." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-142", "text": "However, F-score of our model dropped eas-ily when we narrow down the beam threshold, compared to other models." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-143", "text": "We think that this is mainly due to its bad implementation of parser interface." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-144", "text": "The n-gram reference distribution is incorporated into the kernel of the parser, but the n-gram features and a maximum entropy estimator are defined in other modules; n-gram features are defined in a grammar module, and a maximum entropy estimator for the n-gram reference distribution is implemented with a general-purpose maximum entropy estimator module." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-145", "text": "Consequently, strings that represent the ngram information are very frequently changed into feature structures and vice versa when they go in and out of the kernel of the parser." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-146", "text": "On the other hand, Ninomiya et al. (2006)'s model 3 uses the supertagger as an external module." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-147", "text": "Once the parser acquires the supertagger's outputs, the n-gram information never goes in and out of the kernel." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-148", "text": "This advantage of Ninomiya et al. (2006)'s model can apparently be implemented in our model, but this requires many parts of rewriting of the implemented parser." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-149", "text": "We estimate that the overhead of the interface is around from 50 to 80 ms/sentence." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-150", "text": "We think that re-implementation of the parser will improve the parsing speed as estimated." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-151", "text": "In Figure 3 , the line of our model crosses the line of Ninomiya et al. (2006) 's model." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-152", "text": "If the estimation is correct, our model will be faster and more accurate so that the lines in the figure do not cross." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-153", "text": "Speed-up in our model is left as a future work." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-154", "text": "----------------------------------" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-155", "text": "**CONCLUSION**" }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-156", "text": "We proposed a probabilistic model in which supertagging is consistently integrated into the probabilistic model for HPSG." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-157", "text": "In the model, the n-gram reference distribution is simply defined as the product of the probabilities of selecting lexical entries with machine learning features of word and POS ngram as defined in the CCG/HPSG/CDG supertagging." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-158", "text": "We conducted experiments on the Penn Treebank with a wide-coverage HPSG parser." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-159", "text": "In the experiments, we compared our model with the probabilistic HPSG with a unigram reference distribution and the probabilistic HPSG with supertagging (Ninomiya et al., 2006 )." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-160", "text": "Though our model was not as fast as Ninomiya et al. (2006) 's models, it achieved the highest accuracy among them." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-161", "text": "Our model had around 2.65 points higher F-score than 's model and around 0.56 points higher F-score than the Ninomiya et al. (2006) 's model 3." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-162", "text": "When we sacrifice parsing speed, our model achieved around 2.9 points higher F-score than 's model and around 0.8 points higher F-score than Ninomiya et al. (2006) 's model 3." }, { "sent_id": "c14d918f3b1b1248dc1d25a7e0b2e4-C001-163", "text": "Our model achieved higher F-score because parameters for phrase structures in our model are trained with the supertagging probabilities, which are not in other models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-14" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-18" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-21" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-29", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-30" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-132" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-146" ] ], "cite_sentences": [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-14", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-18", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-21", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-29", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-30", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-132", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-146" ] }, "@UNSURE@": { "gold_contexts": [ [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-29" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-56" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-64" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-148" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-151" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-159" ] ], "cite_sentences": [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-29", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-56", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-64", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-148", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-151", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-159" ] }, "@USE@": { "gold_contexts": [ [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-73" ] ], "cite_sentences": [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-73" ] }, "@SIM@": { "gold_contexts": [ [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-74" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-75" ] ], "cite_sentences": [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-74", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-75" ] }, "@DIF@": { "gold_contexts": [ [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-75" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-115" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-160" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-161" ], [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-162" ] ], "cite_sentences": [ "c14d918f3b1b1248dc1d25a7e0b2e4-C001-75", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-115", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-160", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-161", "c14d918f3b1b1248dc1d25a7e0b2e4-C001-162" ] } } }, "ABC_0c8a99cac11953f26308128bfc058b_9": { "x": [ { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-89", "text": "We will see later in this section, by this evaluation, some facts covered by the bakeoff evaluation can be illustrated by our new evaluation metric." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-90", "text": "Here, we repeat two experiments described in (Zhang et al., 2006a) , namely dictionary-based approach and subword-based tagging." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-91", "text": "For CT method, top 2000 most frequent multi-character words and all single characters in training corpus are selected as subwords and the feature templates used for CRF model is listed in Table 3 ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-92", "text": "We present all the segmentation results in Table 6 to see the strength and weakness of each method conveniently." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-93", "text": "Based on IV and OOV recall as we show in Table 1 , Zhang argues that the DS performs better on IV word identification while CT performs better on OOV words. But we can see from the results in Table 6 (the lines about DS and CT), the IV precision of DS approach is much lower than that of CT on all the four corpora, which also causes a lower F measure of IV." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-94", "text": "The reason for low IV precision of DS is that many OOV words are segmented into two IV words by DS." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-95", "text": "For example, OOV word \"\u6b4c\u5531\u73ed(choral)\" is segmented into\" \u6b4c\u5531(sing) \u73ed(class)\" by DS." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-96", "text": "These wrongly identified IV words increase the number of all IV words in the segmenter\"s output and cause the low IV precision of the DS result." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-97", "text": "Since the F measure of IV is a more reasonable metric of performance of IV than IV recall only, Table 6 shows that CT method outperforms the DS on IV word segmentation over all four corpora." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-98", "text": "The comparison also shows that CT outperforms the DS on OOV and overall segmentation as well." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-99", "text": "Previous two, current and next two subword Bigram" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-100", "text": "Previous character and next subwords Table 3 Feature templates used for CRF in our experiments 3 Balance between IV and OOV Performance" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-101", "text": "There are other strategies such as (Goh et al., 2005) trying to seek balance between IV and OOV performance." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-102", "text": "In (Goh et al, 2005) , information in a dictionary is used in a statistical model." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-103", "text": "In this way, the dictionary-based approach and the statistical model are combined." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-104", "text": "We choose the confidence measure to study because it is straight-forward." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-2", "text": "Abstract" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-3", "text": "Since the first Chinese Word Segmentation (CWS) Bakeoff on 2003, CWS has experienced a prominent flourish because Bakeoff provides a platform for the participants, which helps them recognize the merits and drawbacks of their segmenters." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-4", "text": "However, the evaluation metric of bakeoff is not sufficient enough to measure the performance thoroughly, sometimes even misleading." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-5", "text": "One typical example caused by this insufficiency is that there is a popular belief existing in the research field that segmentation based on word can yield a better result than character-based tagging (CT) on in-vocabulary (IV) word segmentation even within closed tests of Bakeoff." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-127", "text": "Therefore, the CM formula seems somewhat unreasonable." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-6", "text": "Many efforts were paid to balance the performance on IV and out-ofvocabulary (OOV) words by combining these two methods according to this belief." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-7", "text": "In this paper, we provide a more detailed evaluation metric of IV and OOV words than Bakeoff to analyze CT method and combination method, which is a typical way to seek such balance." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-8", "text": "Our evaluation metric shows that CT outperforms dictionary-based (or so called word-based in general) segmentation on both IV and OOV words within Bakeoff * The work is done when the first author is working in MSRA as an intern." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-9", "text": "closed tests." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-10", "text": "Furthermore, our analysis shows that using confidence measure to combine the two segmentation results should be under certain limitation." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-11", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-12", "text": "****" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-13", "text": "Since the first Chinese Word Segmentation (CWS) Bakeoff on 2003, CWS has experienced a prominent flourish because Bakeoff provides a platform for the participants, which helps them recognize the merits and drawbacks of their segmenters." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-14", "text": "However, the evaluation metric of bakeoff is not sufficient enough to measure the performance thoroughly, sometimes even misleading." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-15", "text": "One typical example caused by this insufficiency is that there is a popular belief existing in the research field that segmentation based on word can yield a better result than character-based tagging (CT) on in-vocabulary (IV) word segmentation even within closed tests of Bakeoff." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-16", "text": "Many efforts were paid to balance the performance on IV and out-ofvocabulary (OOV) words by combining these two methods according to this belief." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-17", "text": "In this paper, we provide a more detailed evaluation metric of IV and OOV words than Bakeoff to analyze CT method and combination method, which is a typical way to seek such balance." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-18", "text": "Our evaluation metric shows that CT outperforms dictionary-based (or so called word-based in general) segmentation on both IV and OOV words within Bakeoff" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-19", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-20", "text": "**INTRODUCTION**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-21", "text": "Chinese Word Segmentation (CWS) has been witnessed a prominent progress in the last three Bakeoffs (Sproat and Emerson, 2003) , (Emerson, 2005) , (Levow, 2006) ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-22", "text": "One of the reasons for this progress is that Bakeoff provides standard corpora and objective metric, which makes the result of each system comparable." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-23", "text": "Through those evaluations researchers can recognize the advantage and disadvantage of their methods and improve their systems accordingly." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-24", "text": "However, in the evaluation metric of Bakeoff, only the overall F measure, precision, recall, IV (invocabulary) recall and OOV (out-of-vocabulary) recall are included and such a metric is not sufficient to give a completely measure on the performance, especially when the performance on IV and OOV word segmentation need to be evaluated." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-25", "text": "An important issue is that segmentation based on which, word or character, can yield the better performance on IV words." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-26", "text": "We give a detailed explanation about this issue as following." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-27", "text": "Since CWS was firstly treated as a characterbased tagging task (we call it \"CT\" for short hereafter) in (Xue and Converse, 2002) , this method has been widely accepted and further developed by researchers (Peng et al., 2004) , (Tseng et al., 2005) , (Low et al., 2005) , (Zhao et al., 2006) ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-28", "text": "Relatively to dictionary-based segmentation (we call it \"DS\" for short hereafter), CT method can achieve a higher accuracy on OOV word recognition and a better performance of segmentation in whole." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-29", "text": "Thus, CT has drawn more and more attention and became the dominant method in the Bakeoff 2005 and 2006." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-30", "text": "Although CT has shown its merits in word segmentation task, some researchers still hold the belief that on IV words DS can perform better than CT even in the restriction of Bakeoff closed test." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-31", "text": "Consequently, many strategies are proposed to balance the IV and OOV performance (Goh et al., 2005) , (Zhang et al., 2006a) ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-32", "text": "Among these strategies, the confidence measure used to combine the results of CT and DS is a straight-forward one, which is introduced in (Zhang et al., 2006a) ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-33", "text": "The basic assumption of such combination is that DS method performs better on IV words and Zhang derives this belief from the fact that DS achieves higher IV recall rate as Table 1 shows." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-34", "text": "In which AS, CityU, MSRA and PKU are four corpora used in Bakeoff 2005 (also see Table 2 for detail)." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-35", "text": "We provide a more detailed evaluation metric to analyze these two methods, including precision and F measure of IV and OOV respectively and our experiments show that CT outperforms DS on both IV and OOV words within Bakeoff closed test." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-36", "text": "The precision and F measure are existing metrics and the definitions of them are clear." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-37", "text": "Here we just employ them to evaluate segmentation results." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-38", "text": "Furthermore, our error analysis on the results of combination reveals that confidence measure in (Zhang et al., 2006a) has a representation flaw and we propose an EIV tag method to revise it." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-39", "text": "Finally, we give an empirical comparison between existing pure CT method and combination, which shows that pure CT method can produce state-of-the-art results on both IV word and overall segmentation." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-40", "text": "(Zhang et al., 2006a) The rest of this paper is organized as follows." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-41", "text": "In Section 2, we give a brief introduction to Zhang\"s DS method and subword-based tagging, which is a special CT method. And by comparing the results of this special CT method and DS according our detailed metric, we show that CT performs better on both IV and OOV." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-42", "text": "We review in Section 3 how confidence measure works and indicate its representation flaw." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-43", "text": "Furthermore, an \"EIV\" tag method is proposed to revise the confidence measure." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-44", "text": "In Section 4, the experimental results of existing pure CT method are demonstrated to compare with combination result, based on which we discuss the related work." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-45", "text": "In Section 5, we conclude the contributions of this paper and discuss the future work." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-46", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-47", "text": "**COMPARISON BETWEEN DS AND CT BASED ON DETAILED METRIC**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-48", "text": "We proposed a detailed evaluation metric for IV and OOV word identification in this section and experiments based on the new metric show that CT outperforms DS not only on OOV words but also on IV words with F-measure of IV." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-49", "text": "All the experiments in this paper conform to the constraints of closed test in Bakeoff 2005 (Emerson, 2005 ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-50", "text": "It means that any resource beyond the training corpus is excluded." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-51", "text": "We first review how DS and CT work and then present our evaluation metric and experiment results." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-52", "text": "There is one thing should be emphasized, by comparing DS and CT result we just want to verify that our new metric can show the performance on IV words more objectively." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-53", "text": "Since either DS or CT implementation has specific setting here we should not extend the comparison result to a general sense between those generative models and discriminative models." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-54", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-55", "text": "**DICTIONARY-BASED SEGMENTATION**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-56", "text": "For the dictionary-based word segmentation, we collect a dictionary from training corpus first." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-57", "text": "Instead of Maximum Match, trigram language model 2 trained on training corpus is employed for disambiguation." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-58", "text": "During the disambiguation procedure, a beam search decoder is used to seek the most possible segmentation." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-59", "text": "Since the setting in our paper is consistent with the closed test of Bakeoff, we can only use the information we learn from training corpus though other open resources may be helpful to improve the performance further." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-60", "text": "For detail, the decoder reads characters from the input sentence one at a time, and generates candidate segmentations incrementally." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-61", "text": "At each stage, the next incoming character is combined with an existing candidate in two different ways to generate new candidates: it is either appended to the last word in the candidate, or taken as the start of a new word." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-62", "text": "This method guarantees exhaustive generation of possible segmentations for any input sentence." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-63", "text": "However, the exponential time and space of the length of the input sentence are needed for such a search and it is always intractable in practice." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-64", "text": "Thus, we use the trigram language model to select top B (B is a constant predefined before search and in our experiment 3 is used) best candidates with highest probability at each stage so that the search algorithm can work in practice." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-65", "text": "Finally, when the whole sentence has been read, the best candidate with the highest probability will be selected as the segmentation result." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-66", "text": "Here, the term \"dictionary-based\" is exactly the method implemented in (Zhang et al., 2006a) , it does not mean the generative language model in general." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-67", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-68", "text": "**CHARACTER-BASED TAGGING**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-69", "text": "Under CT scheme, each character in one sentence is labeled as \"B\" if it is the beginning of a word, \"O\" tag means the current character is a single-character word, other character is labeled as \"I\"." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-70", "text": "For example, \"\u5168\u4e2d\u56fd (whole China)\" is labeled as \"" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-71", "text": "In (Zhang et al., 2006a) , the above CT method is developed as subword-based tagging." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-72", "text": "First, the most frequent multi-character words and all single characters in training corpus are collected as subwords." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-73", "text": "During the subwordbased tagging, a subword is viewed as an unit instead of several separate characters and given only one tag." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-74", "text": "For example, in subword-based tagging, \"\u5168\u4e2d\u56fd (whole China)\" is labeled as \" \u5168 (whole)/O \u4e2d\u56fd (China)/O\", if the word \"\u4e2d \u56fd (China)\" is collected as a subword." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-75", "text": "As the preprocessing, both training and test corpora are segmented by maximum match with subword set as dictionary." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-76", "text": "After this preprocessing, every sentence in both training and test corpora becomes subword sequence." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-77", "text": "Finally, the tagger is trained by CRFs approach 3 on the training data." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-78", "text": "Although word information is integrated into this method, it still works in the scheme of \"IOB\" tagging." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-79", "text": "Thus, we still call subwordbased tagging as a special CT method and in the reminder of this paper \"CT\" means subwordbased tagging in Zhang\"s paper and \"Pure CT\" means CT without subword." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-80", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-81", "text": "**A DETAILED EVALUATION METRIC**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-82", "text": "In this paper, data provided by Bakeoff 2005 is used in our experiments in order to compare with the published results in (Zhang et al., 2006a) ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-83", "text": "The statistics of the corpora for Bakeoff 2005 are listed in Evaluation standard is also provided by Bakeoff, including overall precision, recall, F measure, IV recall and OOV recall (Sproat and Emerson, 2003) , (Emerson, 2005) ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-84", "text": "However, some important metrics, such as F measure and precision of both IV and OOV words are omitted, which are necessary when the performance of IV or OOV word identification need to be judged." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-85", "text": "Thus, in order to judge the results of each experiment, a more detailed evaluation with precision and F measure of both IV and OOV words included is used." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-86", "text": "To calculate the IV and OOV precision and recall, we firstly divide words of the segmenter\"s output and gold data into IV word and OOV word sets respectively with the dictionary collected from the training corpus." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-87", "text": "Then, for IV and OOV word sets respectively, the IV (or OOV) recall is the proportion of the correctly segmented IV (or OOV) word tokens to all IV (or OOV) word tokens in the gold data, and IV (or OOV) precision is the proportion of the correctly segmented IV (or OOV) word tokens to all IV (or OOV) word tokens in the segmenter\"s output." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-88", "text": "One thing have to be emphasized is that the single character in test corpus will be defined as OOV if it does not appear in training corpus." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-168", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-105", "text": "We show in this section that there is a representation flaw in the formula of confidence measure in (Zhang et al., 2006a )." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-106", "text": "And we propose an \"EIV\" tag method to solve this problem." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-107", "text": "Our experiments show that confidence measure with EIV tag outperforms CT and DS alone." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-108", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-109", "text": "**CONFIDENCE MEASURE**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-110", "text": "Confidence Measure (CM) means to seek an optimal tradeoff between performance on IV and OOV words." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-111", "text": "The basic idea of CM comes from the belief that CT performs better on OOV words while DS performs better on IV words." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-112", "text": "When both results of CT and DS are available, the CM can be calculated according to the following formula in (Zhang et al., 2006a) :" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-113", "text": "Here, w is a subword, iob t is \"IOB\" tag given by CT and w t is \"IOB\" tag generated by DS." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-114", "text": "In the first term of the right hand side of the formula," }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-115", "text": "is the marginal probability of iob t (we call this marginal probability \"MP\" for short will be kept, otherwise it will be replaced with w t ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-116", "text": "Thus, the CM ultimately is the marginal probability of the \"IOB\" tag (MP)." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-117", "text": "In the experiment of this paper, MP is used as CM because it is equivalent to Zhang\"s CM but more convenient to express." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-118", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-119", "text": "**EXPERIMENTS AND ERROR ANALYSIS ABOUT COMBINATION**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-120", "text": "We repeat the experiments about CM in Zhang\"s paper (Zhang et al., 2006a) and show that there is a representation flaw in the CM formula." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-121", "text": "Furthermore, we propose an EIV tag method to make CM yield a better result." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-122", "text": "In this paper, \uf061 = 0.8 and t = 0.7 (Parameters in two papers, Zhang et al. 2006a and Zhang et al. 2006b , are different. And our parameters are consistent with Zhang et al. 2006b which is confirmed by Dr Zhang through email) are used in CM, namely MP= 0.875 is the threshold." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-123", "text": "Here, in Table 4 , we provide some statistics on the results of CT when MP is less than 0.875." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-124", "text": "From Table 4 we can see that even with MP less than 0.875, most of the subwords are still tagged correctly by CT and should not be revised by DS result." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-125", "text": "Besides, lots of the subwords with low MP contained by OOV words in test data, especially for the corpus whose OOV rate is high (i.e. on CityU corpus more than one third subwords with low MP belong to OOV word) and performance on OOV recognition is the advantage of CT rather than that of DS approach." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-126", "text": "Thus when combining the results of the two methods, it is the iob t should be maintained if the subword is contained by an OOV word." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-128", "text": "The error analysis about how many original errors are eliminated and how many new errors are introduced by CM is provided in Table 5 (the columns about CM)." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-129", "text": "Table 5 illustrates that, after combining the two results, most original errors on IV words are corrected because DS can achieve higher IV recall as described in Zhang\"s paper. But on OOV part, more new errors are introduced by CM and these new errors decrease the precision of the IV words." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-130", "text": "For example, the OOV words \"\u8b66\u536b\u961f\u5458 (guard member)\" and \" \u8bbe\u8ba1\u8d39 (design fee)\" is recognized correctly by CT but with low CM." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-131", "text": "In the combining procedure, these words are wrongly split as IV errors: \"\u8b66\u536b (guard) \u961f\u5458 (member)\" and \"\u8bbe\u8ba1 (design) \u8d39 (fee)\"." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-132", "text": "Thus, for two corpora (i.e. CityU and AS), F measure of IV and overall F measure decreases since there are more new errors introduced than original ones eliminated and only on the other two corpora (MSRA and PKU), overall F measure of combination method is higher than CT alone, which is shown in Table 6 by the lines about combination." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-133", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-134", "text": "**EIV TAG METHOD**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-135", "text": "Since combining the two results by CM may produce an even worse performance in some case, it is worthy to study how to use this CM to get an enhanced result." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-136", "text": "Intuitively, if we can change only the CT tags of the subwords which contained in IV word while keep the CT tags of those contained in OOV words unchanged, we will improve the final result according to our error analysis in Table 5 ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-137", "text": "Unfortunately, only from the test data, we can get the information whether a subword contained in an IV word, just as what we do to get Table 4 ." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-138", "text": "However, we can get an approximate estimation from DS result." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-139", "text": "When using subwords to re-segment DS result 4 , all the fractions re-segmented out of multiplecharacter words, including both multiplecharacter words and single characters, will be given an \"EIV\" tag, which means that the current multiple-character word or single character is contained in an IV word with high probability." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-140", "text": "For example, \"\u4eba\u529b\u8d44\u6e90 (human resource)\" in DS result is a whole word." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-141", "text": "However, only \"\u8d44\u6e90 (resource)\" belongs to the subword set, so during the re-segmentation \"\u4eba\u529b\u8d44\u6e90 (human resource)\" will be re-segmented as \"\u4eba (people) \u529b (force) \u8d44\u6e90 (resource)\"." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-142", "text": "All these three fractions will be labeled with an \"EIV\" tag respectively." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-143", "text": "It is reasonable because all the multiplecharacter words in the DS result can match an IV word." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-144", "text": "After this procedure, when combining Table 5 Error analysis of confidence measure with and without EIV tag the two results, only the CT tag with EIV tags and low MP will be replaced by DS tag, otherwise the original CT tag will be maintained." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-145", "text": "Under this condition the errors introduced by OOV will not happen and enhanced results are listed in Table 6 lines about EIV." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-146", "text": "We can see that on all four corpora the overall F measure of EIV result is higher than that of CT alone, which show that our EIV method works well." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-147", "text": "Now, let\"s check what changes happened in the number of error tags after EIV condition added into the CM." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-148", "text": "We can see from the Table 5 columns about EIV, there are more errors eliminated than the new errors introduced after EIV condition added into CM and most CT tags of subwords contained in OOV words maintained unchanged as we supposed. And then, our results (in Table 6 lines about EIV) are comparable with that in Zhang\"s paper." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-149", "text": "Thus, there may be some similar strategies in Zhang\"s CM too but not presented in Zhang\"s paper." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-150", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-151", "text": "**DISCUSSION AND RELATED WORKS**" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-152", "text": "Although the method such as confidence measure can be helpful at some circumstance, our experiment shows that pure character-based tagging (pure CT) can work well with reasonable features and tag set." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-153", "text": "In (Zhao et al., 2006) , an enhanced CRF tag set is proposed to distinguish different positions in the multi-character words when the word length is less than 6." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-154", "text": "In this method, feature templates are almost the same as shown in Table 3 with a 3-character window and a 6-tag set {B, B2, B3, M, E, O} is used." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-155", "text": "Here, tag B and E stand for the first and the last position in a multi-character word, respectively." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-156", "text": "S stands up a single-character word." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-157", "text": "B2 and B3 stand for the second and the third position in a multi-character word, whose length is larger than two-character or three-character." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-158", "text": "M stands for the fourth or more rear position in a multicharacter word, whose length is larger than fourcharacter." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-159", "text": "In Table 6 , the lines about \"pure CT\" provide the results generated by pure CT with 6-tag set." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-160", "text": "We can see from the Table 6 this pure CT approach achieves the state-of-the-art results on all the corpora." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-161", "text": "On three of the four corpora (AS, MSRA and PKU) this pure CT method gets the best result." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-162", "text": "Even on IV word, this pure CT approach outperforms Zhang\"s CT method and produces comparable results with combination with EIV tags, which shows that pure CT method can perform well on IV words too." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-163", "text": "Moreover, this character-based tagging approach is more clear and simple than the confidence measure method." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-164", "text": "Although character-based tagging became mainstream approach in the last two Bakeoffs, it does not mean that word information is valueless in Chinese word segmentation." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-165", "text": "A word-based perceptron algorithm is proposed recently (Zhang and Clark, 2007) , which views Chinese word segmentation task from a new angle instead of character-based tagging and gets comparable results with the best results of Bakeoff." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-166", "text": "Table 6 Results of different approach used in our experiments (White background lines are the results we repeat Zhang\"s methods and they have some trivial difference with Table 1. ) Therefore, the most important thing worth to pay attention in future study is how to integrate linguistic information into the statistical model effectively, no matter character or word information." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-167", "text": "----------------------------------" }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-169", "text": "In this paper, we first provided a detailed evaluation metric, which provides the necessary information to judge the performance of each method on IV and OOV word identification." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-170", "text": "Second, by this evaluation metric, we show that characterbased tagging outperforms dictionary-based segmentation not only on OOV words but also on IV words within Bakeoff closed tests." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-171", "text": "Furthermore, our experiments show that confidence measure in Zhang\"s paper has a representation flaw and we propose an EIV tag method to revise the combination." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-172", "text": "Finally, our experiments show that pure character-based approach also can achieve good IV word and overall performance." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-173", "text": "Perhaps, there are two reasons that existing combination results don\"t outperform the pure CT." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-174", "text": "One is that most information contained in statistic language model is already captured by the CT feature templates in CRF framework." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-175", "text": "The other is that confidence measure may not be the effective way to combine the DS and CT results." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-176", "text": "In the future work, our research will focus on how to integrate word information into CRF features rather than using it to modify the results of CRF tagging." }, { "sent_id": "0c8a99cac11953f26308128bfc058b-C001-177", "text": "In this way, we can capture the word information meanwhile avoid destroying the optimal output of CRF tagging." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0c8a99cac11953f26308128bfc058b-C001-31" ], [ "0c8a99cac11953f26308128bfc058b-C001-32" ], [ "0c8a99cac11953f26308128bfc058b-C001-33" ], [ "0c8a99cac11953f26308128bfc058b-C001-71" ], [ "0c8a99cac11953f26308128bfc058b-C001-112" ], [ "0c8a99cac11953f26308128bfc058b-C001-129" ] ], "cite_sentences": [ "0c8a99cac11953f26308128bfc058b-C001-31", "0c8a99cac11953f26308128bfc058b-C001-32", "0c8a99cac11953f26308128bfc058b-C001-33", "0c8a99cac11953f26308128bfc058b-C001-71", "0c8a99cac11953f26308128bfc058b-C001-112", "0c8a99cac11953f26308128bfc058b-C001-129" ] }, "@MOT@": { "gold_contexts": [ [ "0c8a99cac11953f26308128bfc058b-C001-38" ], [ "0c8a99cac11953f26308128bfc058b-C001-105" ], [ "0c8a99cac11953f26308128bfc058b-C001-120" ], [ "0c8a99cac11953f26308128bfc058b-C001-171" ] ], "cite_sentences": [ "0c8a99cac11953f26308128bfc058b-C001-38", "0c8a99cac11953f26308128bfc058b-C001-105", "0c8a99cac11953f26308128bfc058b-C001-120", "0c8a99cac11953f26308128bfc058b-C001-171" ] }, "@UNSURE@": { "gold_contexts": [ [ "0c8a99cac11953f26308128bfc058b-C001-41" ], [ "0c8a99cac11953f26308128bfc058b-C001-79" ], [ "0c8a99cac11953f26308128bfc058b-C001-82" ], [ "0c8a99cac11953f26308128bfc058b-C001-149" ] ], "cite_sentences": [ "0c8a99cac11953f26308128bfc058b-C001-41", "0c8a99cac11953f26308128bfc058b-C001-79", "0c8a99cac11953f26308128bfc058b-C001-82", "0c8a99cac11953f26308128bfc058b-C001-149" ] }, "@USE@": { "gold_contexts": [ [ "0c8a99cac11953f26308128bfc058b-C001-66" ], [ "0c8a99cac11953f26308128bfc058b-C001-90" ], [ "0c8a99cac11953f26308128bfc058b-C001-120" ], [ "0c8a99cac11953f26308128bfc058b-C001-166" ] ], "cite_sentences": [ "0c8a99cac11953f26308128bfc058b-C001-66", "0c8a99cac11953f26308128bfc058b-C001-90", "0c8a99cac11953f26308128bfc058b-C001-120", "0c8a99cac11953f26308128bfc058b-C001-166" ] }, "@DIF@": { "gold_contexts": [ [ "0c8a99cac11953f26308128bfc058b-C001-93" ], [ "0c8a99cac11953f26308128bfc058b-C001-122" ], [ "0c8a99cac11953f26308128bfc058b-C001-162" ], [ "0c8a99cac11953f26308128bfc058b-C001-166" ] ], "cite_sentences": [ "0c8a99cac11953f26308128bfc058b-C001-93", "0c8a99cac11953f26308128bfc058b-C001-122", "0c8a99cac11953f26308128bfc058b-C001-162", "0c8a99cac11953f26308128bfc058b-C001-166" ] }, "@SIM@": { "gold_contexts": [ [ "0c8a99cac11953f26308128bfc058b-C001-117" ], [ "0c8a99cac11953f26308128bfc058b-C001-148" ] ], "cite_sentences": [ "0c8a99cac11953f26308128bfc058b-C001-117", "0c8a99cac11953f26308128bfc058b-C001-148" ] } } }, "ABC_64d11e9efeaa735f74585d6998bab7_9": { "x": [ { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-2", "text": "We investigate the utility of pre-existing question answering models and data for a recently proposed relation extraction task." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-3", "text": "We find that in the low-resource and zeroshot cases, such resources are surprisingly useful." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-4", "text": "Moreover, the resulting models show robust performance on a new test set we create from the task's original datasets." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-5", "text": "----------------------------------" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-7", "text": "Knowledge Base Population (KBP, e.g.: Riedel et al., 2013; Sterckx et al., 2016) attempts to identify facts within raw text and convert them into triples consisting of a subject, object and the relation between them." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-8", "text": "One common form of this task is slot filling (Surdeanu and Heng, 2014) , in which a knowledge base (KB) query, such as place of birth(Obama, ?) is applied to a set of documents and a set of slot fillers is returned." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-9", "text": "By converting such KB queries to natural language questions, Levy et al. (2017) showed that a question answering (QA) system could be effectively applied to this task." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-10", "text": "However, their approach relied on a modified QA model architecture and a dedicated slot-filling training corpus." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-11", "text": "Here, we investigate the utility of standard QA data and models for this task." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-12", "text": "Our results show that this approach is effective in the zero-shot and low-resource cases, and is more robust on a set of test instances designed to challenge the models' ability to identify relations between subject and object." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-13", "text": "Figure 1 gives an overview of using QA on the slot-filling task." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-14", "text": "Starting at the top right, a KB query is translated into a natural language question, which can then be fed into a QA model that has been trained on an appropriate resource." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-15", "text": "When applied to a set of texts, this model needs to predict the correct answer within each text, including the possibility that a text contains no answer." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-16", "text": "Within this framework, we consider different models and training and test datasets, but we keep the translation of KB queries into natural language questions fixed, based on the crowd-sourced templates used by Levy et al. (2017) ." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-17", "text": "----------------------------------" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-18", "text": "**PERFORMANCE ON THE ORIGINAL TASK**" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-19", "text": "In our first experiment, we examine the utility of a QA dataset in relation to the slot-filling task of Levy et al. (2017) ." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-40", "text": "Random samples of 10 3 , 10 4 , 10 5 and 10 6 UWRE instances are added to our SQuAD training set, while leaving the SQuAD dev dataset untouched." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-20", "text": "Their zero-shot model generalised from seen relations to unseen relations by translating all relations into natural language question templates, such as Where was XXX born? for the relation place of birth." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-21", "text": "Identifying an instance of such a relation in text is then equivalent to finding an answer to the relevant question template, instantiated with the appropriate entity." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-22", "text": "However, such a model also needs to be able to identify when no answer is found in the text, and to achieve this they trained a slightly modified version of BiDAF (Seo et al., 2016) (Rajpurkar et al., 2016) , can be applied to the relation extraction task." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-23", "text": "This is motivated both by curiosity about the generalisation abilities of such QA models and also a practical interest in relation extraction applications." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-24", "text": "We first investigate the zero-shot case, where no examples of the relations are available, and then evaluate how performance improves as more data becomes available." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-25", "text": "Data We compare two sources of training data: The University of Washington relation extraction (UWRE) dataset created by Levy et al. (2017) and the Stanford Question Answering Dataset (SQuAD) created by Rajpurkar et al. (2016) ." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-26", "text": "The UWRE data is derived from WikiReading (Hewlett et al., 2016) , which is itself derived from WikiData (Vrande\u010di\u0107, 2012) , and consists of a set of positive and negative examples for relation extraction from Wikipedia sentences." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-27", "text": "Each instance consists of an entity, a relation, a question template for the relation and a sentence drawn from the wikipedia article for that entity which may or may not answer the question." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-28", "text": "Under the assumption that each relation triple found in a Wikipedia info-box is also expressed in the text of its article, the positive examples contain the first sentence from the article that contains both the subject and object of the triple." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-29", "text": "The negative examples also contain the subject entity of the relation, but express a different relation." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-30", "text": "Levy et al. (2017) provide a number of train/dev/test splits, to allow them to evaluate a variety of modes of generalisation." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-31", "text": "Here we use the relation and entity splits." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-32", "text": "The former tests the ability to generalise from one set of relations to another, i.e. to do zero-shot learning for the unseen relations in the test set." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-33", "text": "The latter tests the ability to generalise from one set of entities to another for the complete set of relations." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-34", "text": "We use this dataset to investigate how having access to various quantities of data about the test set relations changes performance." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-35", "text": "To build a dataset using SQuAD (Rajpurkar et al., 2016) , we construct negative examples by removing sentences that contain the answer, based on the spans provided by the annotators." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-36", "text": "In other words, we are left with the original question and a paragraph relevant to the topic of that question, but which typically no longer contains sentences answering it." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-37", "text": "Alongside these negative examples, we also retain the original SQuAD instances as positive examples." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-38", "text": "This process is applied to both the train and dev sets, allowing us to evaluate a model that uses only question answering data at training time." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-39", "text": "We also construct a series of datasets that combine increasing quantities of the UWRE entity split training set into the SQuAD training set, to evaluate the benefits of SQuAD when dedicated relation extraction data is limited." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-41", "text": "Models We employ the same modified BiDAF (Seo et al., 2016) model as Levy et al. (2017) , which uses an additional bias term to allow the model to signal when no answer is predicted within the text." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-42", "text": "Evaluation Following the approach of Levy et al. (2017) , we report F1 scores on the answers returned by the model." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-43", "text": "Under this measure, predicting correctly that a negative instance has no answer does not contribute to either precision or recall." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-44", "text": "However, returning an answer for such an instance does reduce precision." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-45", "text": "Results Table 1 reports the F1 scores for zeroshot relation extraction, using models trained on the original UWRE and SQuAD datasets." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-46", "text": "As can be seen, BIDAF is actually more effective at answering the questions for the unseen relation types in the UWRE test set when it is trained on a standard QA dataset, rather than a dedicated relation extraction dataset." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-47", "text": "Figure 2 plots how performance improves as more data is available about the relations in the test set." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-48", "text": "We compare training purely on UWRE instances to those same instances combined with the whole SQuAD dataset." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-49", "text": "As can be seen, when only small amounts of relation extraction data is available, combining this with the QA data gives a substantial boost to performance." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-50", "text": "Discussion The SQuAD trained model appears to be effective in the limited data and zero-shot cases, but contributes little when large numbers of examples of the relations of interest are available." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-51", "text": "In this case, the dedicated relation extraction model is able to achieve an F1 of around 90%, with or without augmentation with SQuAD." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-52", "text": "This level of performance suggests that such a model would be accurate enough for practical applications." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-53", "text": "However, test set performance may not be a reliable indicator of the model's ability to generalise to more challenging examples (Jia and Liang, 2017) ." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-54", "text": "----------------------------------" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-55", "text": "**GENERALISATION TO A CHALLENGE TEST SET**" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-56", "text": "In this second experiment, we want to test the ability of the models decribed above to generalise to data beyond the UWRE test set." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-57", "text": "In particular, we want to verify that the BiDAF model is able to recognise the assertion of a relation between the entity and the answer, rather than just recognising an answer phrase of the right type." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-58", "text": "Data We construct a challenge test set of negative examples based on sentences which are about the wrong entity but which do contain potential answers that are valid for the question and relation type." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-59", "text": "Thus, each positive example from the original UWRE entity test set is turned into a negative example by pairing the sentence with an equivalent question about another entity." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-60", "text": "A model that has merely learnt to identify answer spans of the right form, irrespective of their relation to the rest of the sentence, is likely to return the original span rather than recognise that the sentence no longer contains an answer." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-61", "text": "We then build new train, dev and test sets (UWRE+) from the original entity split datasets in which half the original negative instances have been replaced with these more challenging instances." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-62", "text": "As before, a series of datasets combining SQuAD with increasing amounts of this new data is also constructed." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-63", "text": "Models We re-use the UWRE and SQuAD trained models in addition to training on the UWRE+ datasets described in the previous section." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-64", "text": "Evaluation On the challenge test set, we are interested in the ability of the models to correctly identify that there is no answer in the sentence." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-65", "text": "In this case, F1 is not an appropriate measure, as there are no positive instances." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-66", "text": "Instead, we use accuracy of the predictions, which in this case is just the number of 'no answer' predictions divided by the total number of instances." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-67", "text": "Model P R F1 BiDAF 0.40 0.34 0.37 FastQA 0.49 0.19 0.28 Results Table 2 reports the accuracy of predictions on the challenge test set of negative examples." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-68", "text": "Although the original UWRE model achieved an F1 of around 90% on the unmodified test set, here it only manages to get 2% of its predictions correct." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-69", "text": "In contrast, the modified UWRE+ training data results in a model that is much more accurate, predicting over 70% of the negative examples correctly." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-70", "text": "Nonetheless, the performance of the SQuAD trained model is stronger still, even without modification to address this problem." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-71", "text": "Figure 3 shows the accuracy on the challenge test set as increasing quantities of relation extraction instances are added to SQuAD." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-72", "text": "Looking first at the effect of adding the original UWRE training instances, performance drops dramatically as the size of this expansion increases." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-73", "text": "In contrast, as the quantity of UWRE+ data grows, performance improves, peaking at around 100,000 instances, which is around the same size as SQuAD." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-74", "text": "----------------------------------" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-75", "text": "**DISCUSSION**" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-76", "text": "The results on our challenge test set suggest that the model does not learn to examine the relation between the answer span and the relation subject unless the training data requires it." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-77", "text": "In the case of SQuAD, the fact that the answer has to be found within a multi-sentence paragraph provides enough potential distractors to overcome this issue." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-78", "text": "Other models may show different patterns of strength and weakness, but to be able to investigate and exploit further QA systems quickly would require a means of producing 'no answer' predictions without the need to modify the model implementation." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-79", "text": "4 Using an unmodified QA model for slot filling Levy et al. (2017) modify the BiDAF architecture to produce an additional output representing the probability that no answer is present in the text." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-80", "text": "In this experiment, we investigate whether it is possible to adapt a QA model to the slot filling task without having to understand and modify its internal structure and implementation." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-81", "text": "Our approach Model Acc BiDAF 0.82 FastQA 0.99 merely requires prefixing all texts with a dummy token that stands in for the answer when no real answer is present." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-82", "text": "Data We train our models on a modified version of SQuAD, which has been augmented with negative examples by removing answer spans, as described in Section 2, and then had the token NoAnswerFound inserted into every text and as the answer for the negative examples, as described above." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-83", "text": "Models We train both BiDAF (Seo et al., 2016) and FastQA (Weissenborn et al., 2017 ) models on the modified SQuAD training data, using their standard architectures and hyperparameters." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-84", "text": "Evaluation We evaluate F1 on the same zeroshot evaluation considered in Section 2 and also accuracy on the challenge test set from Section 3." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-85", "text": "Results Table 3 reveals that the unmodified BiDAF model is almost as effective as the Levy et al. (2017) model in terms of zero-shot F1 on the original UWRE test set." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-86", "text": "In contrast, FastQA's performance is substantially worse." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-87", "text": "However, Table 4 reveals that FastQA is extremely accurate on the challenge test set, while BiDAF's performance is comparable to the modified model trained on SQuAD." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-88", "text": "The unmodified BiDAF and FastQA architectures have complementary strengths on the two evaluations." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-89", "text": "FastQA's strong performance on the challenge instances may be related to its use of binary features indicating whether a word was present in the question." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-90", "text": "----------------------------------" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-92", "text": "We showed that standard QA models and data can be easily applied to the slot-filling task, using some straightforward data pre-processing." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-93", "text": "These models were reasonably effective in a zero-shot evaluation and robust on a new test set containing challenging examples." }, { "sent_id": "64d11e9efeaa735f74585d6998bab7-C001-94", "text": "Future work will investigate how to dispense with the need for translation of KB queries into questions using crowd-sourced templates." } ], "y": { "@BACK@": { "gold_contexts": [ [ "64d11e9efeaa735f74585d6998bab7-C001-9" ], [ "64d11e9efeaa735f74585d6998bab7-C001-19" ], [ "64d11e9efeaa735f74585d6998bab7-C001-25" ], [ "64d11e9efeaa735f74585d6998bab7-C001-30" ] ], "cite_sentences": [ "64d11e9efeaa735f74585d6998bab7-C001-9", "64d11e9efeaa735f74585d6998bab7-C001-19", "64d11e9efeaa735f74585d6998bab7-C001-25", "64d11e9efeaa735f74585d6998bab7-C001-30" ] }, "@MOT@": { "gold_contexts": [ [ "64d11e9efeaa735f74585d6998bab7-C001-10", "64d11e9efeaa735f74585d6998bab7-C001-9" ], [ "64d11e9efeaa735f74585d6998bab7-C001-79", "64d11e9efeaa735f74585d6998bab7-C001-80" ] ], "cite_sentences": [ "64d11e9efeaa735f74585d6998bab7-C001-10", "64d11e9efeaa735f74585d6998bab7-C001-9", "64d11e9efeaa735f74585d6998bab7-C001-79" ] }, "@USE@": { "gold_contexts": [ [ "64d11e9efeaa735f74585d6998bab7-C001-16" ], [ "64d11e9efeaa735f74585d6998bab7-C001-41" ], [ "64d11e9efeaa735f74585d6998bab7-C001-42" ] ], "cite_sentences": [ "64d11e9efeaa735f74585d6998bab7-C001-16", "64d11e9efeaa735f74585d6998bab7-C001-41", "64d11e9efeaa735f74585d6998bab7-C001-42" ] }, "@SIM@": { "gold_contexts": [ [ "64d11e9efeaa735f74585d6998bab7-C001-41" ] ], "cite_sentences": [ "64d11e9efeaa735f74585d6998bab7-C001-41" ] }, "@DIF@": { "gold_contexts": [ [ "64d11e9efeaa735f74585d6998bab7-C001-85" ] ], "cite_sentences": [ "64d11e9efeaa735f74585d6998bab7-C001-85" ] } } }, "ABC_eb51af7d0487fc0795616aecfae9fb_9": { "x": [ { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-103", "text": "We decayed the learning rate by half if the log-likelihood on the validation set did not improve for an epoch." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-104", "text": "Hyper-parameters we selected were D = 200, H = 400, N = 200, E = 50, C = 5, and Q = 2." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-105", "text": "We re-normalized the embedding after each epoch (Hinton et al., 2012) ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-2", "text": "Neural network-based encoder-decoder models are among recent attractive methodologies for tackling natural language generation tasks." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-3", "text": "This paper investigates the usefulness of structural syntactic and semantic information additionally incorporated in a baseline neural attention-based model." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-4", "text": "We encode results obtained from an abstract meaning representation (AMR) parser using a modified version of Tree-LSTM." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-5", "text": "Our proposed attention-based AMR encoder-decoder model improves headline generation benchmarks compared with the baseline neural attention-based model." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-6", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-8", "text": "Neural network-based encoder-decoder models are cutting-edge methodologies for tackling natural language generation (NLG) tasks, i.e., machine translation (Cho et al., 2014) , image captioning (Vinyals et al., 2015) , video description (Venugopalan et al., 2015) , and headline generation (Rush et al., 2015) ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-9", "text": "This paper also shares a similar goal and motivation to previous work: improving the encoderdecoder models for natural language generation." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-10", "text": "There are several directions for enhancement." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-11", "text": "This paper respects the fact that NLP researchers have expended an enormous amount of effort to develop fundamental NLP techniques such as POS tagging, dependency parsing, named entity recognition, and semantic role labeling." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-12", "text": "Intuitively, this structural, syntactic, and semantic information underlying input text has the potential for improving the quality of NLG tasks." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-13", "text": "However, to the best of our knowledge, there is no clear evidence that syntactic and semantic information can enhance the recently developed encoder-decoder models in NLG tasks." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-14", "text": "To answer this research question, this paper proposes and evaluates a headline generation method based on an encoder-decoder architecture on Abstract Meaning Representation (AMR)." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-15", "text": "The method is essentially an extension of attention-based summarization (ABS) (Rush et al., 2015) ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-16", "text": "Our proposed method encodes results obtained from an AMR parser by using a modified version of Tree-LSTM encoder (Tai et al., 2015) as additional information of the baseline ABS model." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-17", "text": "Conceptually, the reason for using AMR for headline generation is that information presented in AMR, such as predicate-argument structures and named entities, can be effective clues when producing shorter summaries (headlines) from original longer sentences." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-18", "text": "We expect that the quality of headlines will improve with this reasonable combination (ABS and AMR)." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-19", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-20", "text": "**ATTENTION-BASED SUMMARIZATION (ABS)**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-21", "text": "ABS proposed in Rush et al. (2015) has achieved state-of-the-art performance on the benchmark data of headline generation including the DUC-2004 dataset (Over et al., 2007) ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-22", "text": "Figure 1 illustrates the model structure of ABS." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-23", "text": "The model predicts a word sequence (summary) based on the combination of the neural network language model and an input sentence encoder." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-24", "text": "Let V be a vocabulary." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-25", "text": "x i is the i-th indicator vector corresponding to the i-th word in the input sentence." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-26", "text": "Suppose we have M words of an input sentence." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-27", "text": "X represents an input sentence, which canadian prime \u2026 year canada \u2026 nato" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-28", "text": "input sentence headline is represented as a sequence of indicator vectors, whose length is M ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-29", "text": "That is, x i \u2208 {0, 1} |V | , and" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-30", "text": "whose length is L. Here, we assume L < M ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-31", "text": "Y C,i is a short notation of the list of vectors, which consists of the sub-sequence in Y from y i\u2212C+1 to y i ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-32", "text": "We assume a one-hot vector for a special start symbol, such as \"\u27e8S\u27e9\", when i < 1." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-33", "text": "Then, ABS outputs a summary\u0176 given an input sentence X as follows:" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-34", "text": "where nnlm(Y C,i ) is a feed-forward neural network language model proposed in (Bengio et al., 2003) , and enc(X, Y C,i ) is an input sentence encoder with attention mechanism." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-35", "text": "This paper uses D and H as denoting sizes (dimensions) of vectors for word embedding and hidden layer, respectively." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-36", "text": "Let E \u2208 R D\u00d7|V | be an embedding matrix of output words." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-37", "text": "Moreover, let U \u2208 R H\u00d7(CD) and O \u2208 R |V |\u00d7H be weight matrices of hidden and output layers, respectively 1 ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-38", "text": "Using the above notations, nnlm(Y C,i ) in Equation 3 can be written as follows:" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-39", "text": "where\u1ef9 c is a concatenation of output embedding vectors from i \u2212 C + 1 to i, that is,\u1ef9 c = (Ey i\u2212C+1 \u00b7 \u00b7 \u00b7 Ey i )." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-40", "text": "Therefore,\u1ef9 c is a (CD) dimensional vector." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-41", "text": "Next, F \u2208 R D\u00d7|V | and E \u2032 \u2208 R D\u00d7|V | denote embedding matrices of input and output words, respectively." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-42", "text": "O \u2032 \u2208 R |V |\u00d7D is a weight matrix for the output layer." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-43", "text": "P \u2208 R D\u00d7(CD) is a weight matrix for mapping embedding of C output words onto embedding of input words." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-44", "text": "X is a matrix form of a list of input embeddings, namely," }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-45", "text": ") is defined as the following equations:" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-46", "text": "where\u1ef9 \u2032 c is a concatenation of output embedding vectors from i \u2212 C + 1 to i similar to\u1ef9 c , that is," }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-47", "text": "Moreover,X is a matrix form of a list of averaged input word embeddings within window size Q," }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-48", "text": "Qx q ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-49", "text": "Equation 6 is generally referred to as the attention model, which is introduced to encode a relationship between input words and the previous C output words." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-50", "text": "For example, if the previous C output words are assumed to align to x i , then the surrounding Q words (x i\u2212Q , ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-51", "text": ". . , x i+Q ) are highly weighted by Equation 5." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-52", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-53", "text": "**PROPOSED METHOD**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-54", "text": "Our assumption here is that syntactic and semantic features of an input sentence can greatly help for generating a headline." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-55", "text": "For example, the meanings of subjects, predicates, and objects in a generated headline should correspond to the ones appearing in an input sentence." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-56", "text": "Thus, we incorporate syntactic and semantic features into the framework of headline generation." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-57", "text": "This paper uses an AMR as a case study of the additional features." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-58", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-59", "text": "**AMR**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-60", "text": "An AMR is a rooted, directed, acyclic graph that encodes the meaning of a sentence." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-61", "text": "Nodes in an AMR graph represent 'concepts', and directed edges represent a relationship between nodes." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-62", "text": "Concepts consist of English words, PropBank event predicates, and special labels such as \"person\"." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-63", "text": "For edges, AMR has approximately 100 relations (Banarescu et al., 2013) including semantic roles based on the PropBank annotations in OntoNotes (Hovy et al., 2006) ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-64", "text": "To acquire AMRs for input sentences, we use the state-of-the-art transition-based AMR parser (Wang et al., 2015) ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-65", "text": "Figure 2 shows a brief sketch of the model structure of our attention-based AMR encoder model." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-66", "text": "We utilize a variant of child-sum Tree-LSTM originally proposed in (Tai et al., 2015) to encode syntactic and semantic information obtained from output of the AMR parser into certain fixed-length embedding vectors." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-67", "text": "To simplify the computation, we transform a DAG structure of AMR parser output to a tree structure, which we refer to as \"tree-converted AMR structure\"." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-68", "text": "This transformation can be performed by separating multiple head nodes, which often appear for representing coreferential concepts, to a corresponding number of out-edges to head nodes." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-69", "text": "Then, we straightforwardly modify Tree-LSTM to also encode edge labels since AMR provides both node and edge labels, and original Tree-LSTM only encodes node labels." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-70", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-71", "text": "**ATTENTION-BASED AMR ENCODER**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-72", "text": "Let n j and e j be N and E dimensional embeddings for labels assigned to the j-th node, and the out-edge directed to its parent node 2 ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-73", "text": "W in , W f n , W on , W un \u2208 R D\u00d7N are weight matrices 2 We prepare a special edge embedding for a root node." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-74", "text": "for node embeddings n j 3 ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-75", "text": "Similarly, W ie , W f e , W oe , W ue \u2208 R D\u00d7E are weight matrices for edge embeddings e j ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-76", "text": "W ih , W f h , W oh , W uh \u2208 R D\u00d7D are weight matrices for output vectors connected from child nodes." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-77", "text": "B(j) represents a set of nodes, which have a direct edge to the j-th node in our treeconverted AMR structure." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-78", "text": "Then, we define embedding a j obtained at node j in tree-converted AMR structure via Tree-LSTM as follows:" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-79", "text": "Let J represent the number of nodes in treeconverted AMR structure obtained from a given input sentence." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-80", "text": "We introduce A \u2208 R D\u00d7J as a matrix form of a list of hidden states a j for all j, namely, A = [a 1 , . . . , a J ]." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-81", "text": "Let O \u2032\u2032 \u2208 R |V |\u00d7D be a weight matrix for the output layer." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-82", "text": "Let S \u2208 R D\u00d7(CD) be a weight matrix for mapping the context embedding of C output words onto embeddings obtained from nodes." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-83", "text": "Then, we define the attention-based AMR encoder 'encAMR(A, Y C,i )' as follows:" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-84", "text": "Finally, we combine our attention-based AMR encoder shown in Equation 14 as an additional term of Equation 3 to build our headline generation system." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-85", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-86", "text": "**EXPERIMENTS**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-87", "text": "To demonstrate the effectiveness of our proposed method, we conducted experiments on benchmark data of the abstractive headline generation task described in Rush et al. (2015) ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-88", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-89", "text": "**DUC-2004**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-90", "text": "Gigaword test data used Gigaword in (Rush et al., 2015) Our sampled test data (Rush et al., 2015)" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-91", "text": "For a fair comparison, we followed their evaluation setting." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-92", "text": "The training data was obtained from the first sentence and the headline of a document in the annotated Gigaword corpus (Napoles et al., 2012) 4 ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-93", "text": "The development data is DUC-2003 data, and test data are both DUC-2004 (Over et al., 2007) and sentence-headline pairs obtained from the annotated Gigaword corpus as well as training data 5 ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-94", "text": "All of the generated headlines were evaluated by ROUGE (Lin, 2004) 6 ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-95", "text": "For evaluation on DUC-2004, we removed strings after 75-characters for each generated headline as described in the DUC-2004 evaluation." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-96", "text": "For evaluation on Gigaword, we forced the system outputs to be at most 8 words as in Rush et al. (2015) since the average length of headline in Gigaword is 8.3 words." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-97", "text": "For the preprocessing for all data, all letters were converted to lower case, all digits were replaced with '#', and words appearing less than five times with 'UNK'." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-98", "text": "Note that, for further evaluation, we prepared 2,000 sentence-headline pairs randomly sampled from the test data section of the Gigaword corpus as our additional test data." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-99", "text": "In our experiments, we refer to the baseline neural attention-based abstractive summarization method described in Rush et al. (2015) as \"ABS\", and our proposed method of incorporating AMR structural information by a neural encoder to the baseline method described in Section 3 as \"ABS+AMR\"." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-100", "text": "Additionally, we also evaluated the performance of the AMR encoder without the attention mechanism, which we refer to as \"ABS+AMR(w/o attn)\", to investigate the contribution of the attention mechanism on the AMR encoder." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-101", "text": "For the parameter estimation (training), we used stochastic gradient descent to learn parameters." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-102", "text": "We tried several values for the initial learning rate, and selected the value that achieved the best performance for each method." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-106", "text": "For ABS+AMR, we used the two-step training scheme to accelerate the training speed." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-107", "text": "The first phase learns the parameters of the ABS." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-108", "text": "The second phase trains the parameters of the AMR encoder by using 1 million training pairs while the parameters of the baseline ABS were fixed and unchanged to prevent overfitting." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-109", "text": "Table 1 shows the recall of ROUGE (Lin, 2004 ) on each dataset." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-110", "text": "ABS (re-run) represents the performance of ABS re-trained by the distributed scripts 7 ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-111", "text": "We can see that the proposed method, ABS+AMR, outperforms the baseline ABS on all datasets." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-112", "text": "In particular, ABS+AMR achieved statistically significant gain from ABS (re-run) for ROUGE-1 and ROUGE-2 on DUC-2004." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-113", "text": "However in contrast, we observed that the improvements on Gigaword (the same test data as Rush et al. (2015) ) seem to be limited compared with the DUC-2004 dataset." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-114", "text": "We assume that this limited gain is caused largely by the quality of AMR parsing results." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-115", "text": "This means that the 7 https://github.com/facebook/NAMAS I(1): crown prince abdallah ibn abdel aziz left saturday at the head of saudi arabia 's delegation to the islamic summit in islamabad , the official news agency spa reported ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-116", "text": "G: saudi crown prince leaves for islamic summit A: crown prince leaves for islamic summit in saudi arabia P: saudi crown prince leaves for islamic summit in riyadh I(2): a massive gothic revival building once christened the lunatic asylum west of the was auctioned off for $ #." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-117", "text": "# million -lrbeuro# ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-118", "text": "# million -rrb-." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-119", "text": "G: massive ##th century us mental hospital fetches $ #." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-120", "text": "# million at auction A: west african art sells for $ #." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-121", "text": "# million in P: west african art auctioned off for $ #." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-122", "text": "# million I(3): brooklyn , the new bastion of cool for many new yorkers , is poised to go mainstream chic ." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-123", "text": "G: high-end retailers are scouting sites in brooklyn A: new yorkers are poised to go mainstream with chic P: new york city is poised to go mainstream chic Figure 3 : Examples of generated headlines on Gigaword." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-124", "text": "I: input, G: true headline, A: ABS (re-run), and P: ABS+AMR." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-125", "text": "Gigaword test data provided by Rush et al. (2015) is already pre-processed." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-126", "text": "Therefore, the quality of the AMR parsing results seems relatively worse on this pre-processed data since, for example, many low-occurrence words in the data were already replaced with \"UNK\"." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-127", "text": "To provide evidence of this assumption, we also evaluated the performance on our randomly selected 2,000 sentence-headline test data also taken from the test data section of the annotated Gigaword corpus." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-128", "text": "\"Gigaword (randomly sampled)\" in Table 1 shows the results of this setting." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-129", "text": "We found the statistical difference between ABS(re-run) and ABS+AMR on ROUGE-1 and ROUGE-2." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-130", "text": "We can also observe that ABS+AMR achieved the best ROUGE-1 scores on all of the test data." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-131", "text": "According to this fact, ABS+AMR tends to successfully yield semantically important words." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-132", "text": "In other words, embeddings encoded through the AMR encoder are useful for capturing important concepts in input sentences." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-133", "text": "Figure 3 supports this observation." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-134", "text": "For example, ABS+AMR successfully added the correct modifier 'saudi' to \"crown prince\" in the first example." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-135", "text": "Moreover, ABS+AMR generated a consistent subject in the third example." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-136", "text": "The comparison between ABS+AMR(w/o attn) and ABS+AMR (with attention) suggests that the attention mechanism is necessary for AMR encoding." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-137", "text": "In other words, the encoder without the attention mechanism tends to be overfitting." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-138", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-139", "text": "**RELATED WORK**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-140", "text": "Recently, the Recurrent Neural Network (RNN) and its variant have been applied successfully to various NLP tasks." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-141", "text": "For headline generation tasks, Chopra et al. (2016) exploited the RNN decoder (and its variant) with the attention mechanism instead of the method of Rush et al. (2015) : the combination of the feed-forward neural network language model and attention-based sentence encoder. also adapted the RNN encoder-decoder with attention for headline generation tasks." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-142", "text": "Moreover, they made some efforts such as hierarchical attention to improve the performance." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-143", "text": "In addition to using a variant of RNN, proposed a method to handle infrequent words in natural language generation." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-144", "text": "Note that these recent developments do not conflict with our method using the AMR encoder." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-145", "text": "This is because the AMR encoder can be straightforwardly incorporated into their methods as we have done in this paper, incorporating the AMR encoder into the baseline." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-146", "text": "We believe that our AMR encoder can possibly further improve the performance of their methods." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-147", "text": "We will test that hypothesis in future study." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-148", "text": "----------------------------------" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-149", "text": "**CONCLUSION**" }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-150", "text": "This paper mainly discussed the usefulness of incorporating structural syntactic and semantic information into novel attention-based encoder-decoder models on headline generation tasks." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-151", "text": "We selected abstract meaning representation (AMR) as syntactic and semantic information, and proposed an attention-based AMR encoder-decoder model." }, { "sent_id": "eb51af7d0487fc0795616aecfae9fb-C001-152", "text": "The experimental results of headline generation benchmark data showed that our attention-based AMR encoder-decoder model successfully improved standard automatic evaluation measures of headline generation tasks, ROUGE-1, ROUGE-2, and ROUGE-L. We believe that our results provide empirical evidence that syntactic and semantic information obtained from an automatic parser can help to improve the neural encoder-decoder approach in NLG tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "eb51af7d0487fc0795616aecfae9fb-C001-8" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-21" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-22" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-33" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-106" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-107" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-108" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-110" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-141" ] ], "cite_sentences": [ "eb51af7d0487fc0795616aecfae9fb-C001-8", "eb51af7d0487fc0795616aecfae9fb-C001-21", "eb51af7d0487fc0795616aecfae9fb-C001-22", "eb51af7d0487fc0795616aecfae9fb-C001-33", "eb51af7d0487fc0795616aecfae9fb-C001-106", "eb51af7d0487fc0795616aecfae9fb-C001-107", "eb51af7d0487fc0795616aecfae9fb-C001-108", "eb51af7d0487fc0795616aecfae9fb-C001-110", "eb51af7d0487fc0795616aecfae9fb-C001-141" ] }, "@EXT@": { "gold_contexts": [ [ "eb51af7d0487fc0795616aecfae9fb-C001-15" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-16" ] ], "cite_sentences": [ "eb51af7d0487fc0795616aecfae9fb-C001-15", "eb51af7d0487fc0795616aecfae9fb-C001-16" ] }, "@UNSURE@": { "gold_contexts": [ [ "eb51af7d0487fc0795616aecfae9fb-C001-18" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-90" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-100" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-130" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-131" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-134" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-135" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-136" ] ], "cite_sentences": [ "eb51af7d0487fc0795616aecfae9fb-C001-18", "eb51af7d0487fc0795616aecfae9fb-C001-90", "eb51af7d0487fc0795616aecfae9fb-C001-100", "eb51af7d0487fc0795616aecfae9fb-C001-130", "eb51af7d0487fc0795616aecfae9fb-C001-131", "eb51af7d0487fc0795616aecfae9fb-C001-134", "eb51af7d0487fc0795616aecfae9fb-C001-135", "eb51af7d0487fc0795616aecfae9fb-C001-136" ] }, "@USE@": { "gold_contexts": [ [ "eb51af7d0487fc0795616aecfae9fb-C001-87" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-92" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-93" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-96" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-98" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-99" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-113" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-125" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-127" ] ], "cite_sentences": [ "eb51af7d0487fc0795616aecfae9fb-C001-87", "eb51af7d0487fc0795616aecfae9fb-C001-92", "eb51af7d0487fc0795616aecfae9fb-C001-93", "eb51af7d0487fc0795616aecfae9fb-C001-96", "eb51af7d0487fc0795616aecfae9fb-C001-98", "eb51af7d0487fc0795616aecfae9fb-C001-99", "eb51af7d0487fc0795616aecfae9fb-C001-113", "eb51af7d0487fc0795616aecfae9fb-C001-125", "eb51af7d0487fc0795616aecfae9fb-C001-127" ] }, "@SIM@": { "gold_contexts": [ [ "eb51af7d0487fc0795616aecfae9fb-C001-96" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-113" ] ], "cite_sentences": [ "eb51af7d0487fc0795616aecfae9fb-C001-96", "eb51af7d0487fc0795616aecfae9fb-C001-113" ] }, "@DIF@": { "gold_contexts": [ [ "eb51af7d0487fc0795616aecfae9fb-C001-111" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-112" ], [ "eb51af7d0487fc0795616aecfae9fb-C001-129" ] ], "cite_sentences": [ "eb51af7d0487fc0795616aecfae9fb-C001-111", "eb51af7d0487fc0795616aecfae9fb-C001-112", "eb51af7d0487fc0795616aecfae9fb-C001-129" ] } } }, "ABC_0a93feafef3ba2d4bb5360ff215171_9": { "x": [ { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-2", "text": "We introduce MASSES, a simple evaluation metric for the task of Visual Question Answering (VQA)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-3", "text": "In its standard form, the VQA task is operationalized as follows: Given an image and an open-ended question in natural language, systems are required to provide a suitable answer." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-4", "text": "Currently, model performance is evaluated by means of a somehow simplistic metric: If the predicted answer is chosen by at least 3 human annotators out of 10, then it is 100% correct." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-5", "text": "Though intuitively valuable, this metric has some important limitations." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-6", "text": "First, it ignores whether the predicted answer is the one selected by the Majority (MA) of annotators." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-7", "text": "Second, it does not account for the quantitative Subjectivity (S) of the answers in the sample (and dataset)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-8", "text": "Third, information about the Semantic Similarity (SES) of the responses is completely neglected." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-9", "text": "Based on such limitations, we propose a multi-component metric that accounts for all these issues." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-10", "text": "We show that our metric is effective in providing a more fine-grained evaluation both on the quantitative and qualitative level." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-11", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-13", "text": "Since its introduction, the task of Visual Question Answering (VQA) [4] has received considerable attention in the Vision and Language community." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-14", "text": "The task is straightforward: Given an image and a question in natural language, models are asked to output the correct answer." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-15", "text": "This is usually treated as a classification problem, where answers are categories that are inferred using features from imagequestion pairs." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-16", "text": "Traditionally, two main versions of the tasks * Shailza and Sandro share the first authorship." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-17", "text": "have been proposed: One, multiple-choice, requires models to pick up the correct answer among a limited set of options; the other, open-ended, challenges systems to guess the correct answer from the whole vocabulary." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-18", "text": "Several metrics have been proposed recently for evaluating VQA systems (see section 2), but accuracy is still the most commonly used evaluation criterion [4, 11, 23, 42, 44, 1, 5, 14, 45, 2] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-19", "text": "In the multiple-choice setting, where only one answer is correct, accuracy is given by the proportion of correctly-predicted cases." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-20", "text": "In the open-ended setting, accuracy is instead based on human annotations for the question: ACC = min( humans that said answer 3 , 1)" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-21", "text": "Using the official VQA Evaluation Tool, that averages accuracy over all 10 choose 9 sets of human annotators, an answer is considered as 100% accurate if at least 4 workers out of 10 voted for it, 90% if the annotators were 3, 60% if they were 2, 30% if the answer was chosen by just one Figure 2 ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-22", "text": "Examples of VQA questions and answers in the open-ended setting." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-23", "text": "Given the image on the left and the third question 'How is the veggies being cut?', currently a model gets accuracy 100% in case it outputs 'diced' (4 occurrences), 60% if it outputs either 'cubed' or 'squares' (2), 30% for 'with knife' (1) , and 0% for any other response." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-24", "text": "The overall accuracy is obtained by averaging through samples." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-25", "text": "worker, 0% in case no one opted for it." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-26", "text": "1 Being based on the responses provided by 10 different workers, the evaluation of VQA in this setting is therefore driven by a wisdom of the crowd [12] criterion: The answer is 'perfectly' correct if more than one third annotators agree on that, 'almost' correct if the agreement involves one fifth of the workers, 'a bit' correct if provided by only one worker." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-27", "text": "That is, the degree of correctness is a function of annotators agreement." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-28", "text": "Though intuitively valuable, this metric has some important limitations." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-29", "text": "First, it ignores whether the predicted answer is the one selected by the majority of annotators or by just a smaller fraction of them." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-30", "text": "For example, in the second question in Figure 2 a model gets a 100% accuracy by answering 'yes', though this is not the most-voted option, which is 'no'." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-31", "text": "Second, it does not account for the quantitative subjectivity of the responses for a given question." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-32", "text": "Based on the number of unique responses assigned by annotators, for example, the first question in Figure 2 (2 unique responses) looks intuitively less subjective compared to the third (5), but this aspect does not play any role in the evaluation." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-33", "text": "Third, information about semantic similarity of responses is completely neglected." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-34", "text": "That is, samples where the responses are very semantically similar (e.g., first question in Figure 2 ) are not considered differently from cases where they are less similar (e.g., third question) or completely dissimilar (e.g., second question)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-35", "text": "Based on such limitations, we focus on open-ended VQA and propose MASSES, 2 a simple multi-component metric 1 From now on, we will report accuracy values as obtained with VQA Evaluation Tool: https://github.com/GT-Vision-Lab/VQA 2 Details and the code for computing MASSES will be available at the project page: https://sapmlresearch.github.io/MaSSeS/ that jointly accounts for all these issues (see Figure 1 )." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-36", "text": "In particular, MASSES combines a Majority component (MA) with a Subjectivity component (S) both endowed with Semantic Similarity (SES)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-37", "text": "Similarly to the current evaluations, the output of the metric is a single score that measures the accuracy in the task." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-38", "text": "By means of thorough analyses, we show that jointly considering this information is quantitatively and qualitatively better than using current evaluations." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-39", "text": "Moreover, our findings reveal that better exploiting the 'wisdom of the crowd' available in human annotation is beneficial to gain a fine-grained understanding of VQA." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-40", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-41", "text": "**RELATED WORK**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-42", "text": "In recent years, a number of VQA datasets have been proposed: VQA 1.0 [4] , VQA-abstract [1] , VQA 2.0 [47, 14] , FM-IQA [13] , DAQUAR [24] , COCO-QA [30] , Visual Madlibs [46] , Visual Genome [20] , VizWiz [16] , Visual7W [48] , TDIUC [18] , CLEVR [17] , SHAPES [3] , Visual Reasoning [34] , Embodied QA [7] . What all these resources have in common is the task for which they were designed: Given an image (either real or abstract) and a question in natural language, models are asked to correctly answer the question." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-43", "text": "Depending on the characteristics of the dataset and the models proposed, various ways to evaluate performance have been explored." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-44", "text": "Accuracy is the most common metric." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-45", "text": "Traditionally, VQA is treated as a classification task, either in a multiplechoice (limited set of answers) or open-ended (whole vocabulary) setting." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-46", "text": "In the multiple-choice setting, there is just one correct (or ground-truth) answer among a number of alternatives called decoys [4, 46, 48, 20] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-47", "text": "As such, ac-curacy is simply computed by counting the predictions of the model that match the ground-truth answer. What can affect the difficulty of the task in this setting is the type of decoys selected." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-48", "text": "Indeed, recent work has proposed methods to harvest more challenging alternatives on the basis of their consistency and semantic similarity with the correct response [6] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-49", "text": "Similar approaches have been exploited in the domains of visual dialogue [8] and multiple-choice image captioning [10] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-50", "text": "In the open-ended setting, accuracy can be computed in terms of Exact Matching between predicted and ground-truth answer [20, 3, 17, 34] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-51", "text": "Though suitable for synthetic datasets where there is just one, automatically-generated answer, this approach cannot be applied to datasets where various answers have been provided by multiple human annotators." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-52", "text": "To account for the variability among 10 crowdsourced answers, [4] proposed a metric which considers as 100% correct an answer that was provided by more than 3 annotators out of 10." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-53", "text": "If 3, 2 or 1 voted for it, the model accuracy is 90%, 60%, and 30%, respectively." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-54", "text": "Being simple to compute and interpret, this metric (hence, VQA3+) is the standard evaluation criterion for open-ended VQA [4, 1, 16, 47, 14] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-55", "text": "However, it has some important limitations." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-56", "text": "(a) It ignores whether an answer that was chosen more than 3 annotators is the most frequent or not." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-57", "text": "As such, it considers it as 100% correct even if e.g. 6 annotators converged on a different answer (see second question in Figure 2 )." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-58", "text": "(b) It is heavily dependent on the number of answers for a given question." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-59", "text": "While the 3+ criterion is valid with 10 annotations, this might not be the case when, e.g., 5 or 20 answers are available." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-60", "text": "(c) It does not account for the quantitative variability among the answers." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-61", "text": "(d) There is no focus on the semantic similarity between the answers." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-62", "text": "(e) Model performance and dataset features (frequency of answers) are intertwined." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-63", "text": "That is, a perfect model cannot achieve a 100% accuracy on the task." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-64", "text": "Arithmetic and Harmonic Means are two accuracybased metrics proposed by [18] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-65", "text": "The core idea is to compute an overall accuracy which takes into account the skewed question-type distribution observed in the TDIUC dataset." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-66", "text": "The harmonic mean-per-type accuracy (Harmonic MPT), in particular, is designed to capture the ability of a system to obtain high scores across all question-types, being skewed towards lowest performing categories." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-67", "text": "A normalized version is also provided to better account for rare answers." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-68", "text": "Though fine-grained, these metrics are only suitable for datasets with only one ground-truth answer." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-69", "text": "WUPS is a metric proposed by [24] to take into account semantic similarity in the evaluation of model predictions." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-70", "text": "The core idea is that, when evaluating performance in the exact-matching setting (i.e., only one ground-truth answer), a model should not be heavily penalized if its prediction is semantically close to the ground truth (e.g., 'carton' and 'box')." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-71", "text": "This intuition is implemented using Wu-Palmer similarity [41] , which computes the similarity between two words in terms of their longest common subsequence in the taxonomy tree." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-72", "text": "In practice, the predicted answer is considered as correct when its similarity with the ground truth exceeds a threshold, which in [24] is set to either 0.9 (strict) or 0.0 (tolerant)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-73", "text": "This metric has been extended by [25] to account for settings where more than one ground-truth answer is available." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-74", "text": "Two versions were proposed: In one, WUPS-ACM, the overall score comes from the average of all pairwise similarities and thus considers inter-annotator agreement; in the other, WUPS-MCM, the pair with the highest similarity is taken as representative of the pattern." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-75", "text": "As observed by [19] , the measure of similarity embedded in WUPS has some shortcomings." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-76", "text": "In particular, it is shown to produce high scores even for answers which are semantically very different, leading to significantly higher accuracies in both [24] and [30] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-77", "text": "Moreover, it only works with rigid semantic concepts, making it not suitable for phrasal or sentence answers that can be found in [4, 1, 16, 47, 14] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-78", "text": "Visual Turing Test has been proposed as a human-based evaluation metric for VQA by [13] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-79", "text": "Based on the characteristics of the FM-IQA dataset, whose answers are often long and complex sentences, the authors tackled the task as an answer-generation rather than a classification problem (see also [49, 39, 40, 36, 37] )." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-80", "text": "Given this setting, one option is to use standard metrics for the evaluation of automaticallygenerated language, such as BLEU [28] , METEOR [21] , ROUGE [22] or CIDEr [35] , as [16] did." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-81", "text": "However, these metrics turned out not to be suitable for VQA evaluation due to their inability to properly handle semantically relevant words [13] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-82", "text": "Therefore, [13] asked humans to judge whether the generated answers were provided by a human or a model." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-83", "text": "If annotators believed the answer was 'human', and thus implicitly good, the answer was considered as correct." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-84", "text": "Else, it failed the Visual Turing Test and considered as wrong." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-85", "text": "Intuitively, this evaluation procedure is very costly and heavily dependent on subjective opinions of annotators." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-86", "text": "Mean Rank Finally, in the recent work by [7] the performance of the embodied agent is evaluated via mean rank of the ground-truth answer in the predictions of the model." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-87", "text": "This implies that only one ground-truth answer is given." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-88", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-89", "text": "**OUR METRIC**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-90", "text": "Based on the limitations of the current metrics, we propose MASSES, a novel, multi-component metric for the evaluation of open-ended VQA." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-91", "text": "Each component is aimed at evaluating various aspects of either the performance of a given model or the characteristics of the dataset." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-92", "text": "In particular, one component (MA) evaluates the correctness of the answer predicted by the model and is thus model-specific." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-93", "text": "Two modules (S, SES) evaluate the pattern of human responses for a given question and are thus data-specific." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-94", "text": "By jointly combining these 3 modules, one single score is pro- vided." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-95", "text": "Below, we describe and motivate each component." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-96", "text": "Majority (MA): It is the core component of our metric, aimed at evaluating the performance of a given model in the task." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-97", "text": "It is based on two simple assumptions: First, the most frequent answer (hence, MAX) is considered as 100% correct regardless of its absolute frequency." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-98", "text": "Second, all other answers receive a score which is dependent on the frequency of MAX." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-99", "text": "Given a predicted answer, the score is given by dividing its frequency by the frequency of MAX." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-100", "text": "Consider the third example in Figure 2 ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-101", "text": "If the predicted answer is 'diced' (MAX), the score is 1. If it is 'cubed' or 'squares' (2 occurrences), the score is 0.5. If it is one among the others (1), then the score is 0.25." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-102", "text": "The method used for calculating MA is reported in (1): MA = f requency of predicted answer f requency of MAX (1) where the numerator is an integer ranging from 0 to number of annotators (#ann), and the denominator an integer from 1 to #ann." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-103", "text": "MA is a continuous value ranging from 0 to 1. MA overcomes some important shortcomings of the other metrics." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-144", "text": "However, by applying SES 0.7 , reliability increases to 1 in all examples." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-104", "text": "Similarly to Exact Matching and in contrast with VQA3+, MA assumes that there is always at least one answer that is 100% correct for the question." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-105", "text": "As a consequence, a model is allowed to achieve 100% accuracy." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-106", "text": "Similarly to VQA3+, it modulates the score on the basis of the frequency of the answer." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-107", "text": "However, in contrast to VQA3+, our score is dependent on the frequency of MAX and not on a fixed threshold (e.g. 4)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-108", "text": "Moreover, MA is continuous (i.e., it ranges from 0 to 1) rather than discrete (VQA3+ assigns just 5 possible scores: 0%, 30%, 60%, 90%, 100%), thus allowing a more flexible and fine-grained evaluation of the predictions." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-109", "text": "Subjectivity (S): This component evaluates the subjectivity of a given pattern of responses on the basis of the quantitative agreement between annotators, irrespectively of the prediction of the model." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-110", "text": "Our intuition is that highly skewed distributions would indicate more subjective and thus less reliable samples." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-111", "text": "Therefore, we should put more 'trust' to distributions that reflect a high agreement compared to those where a high variability is observed." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-112", "text": "Here, we operationalize S in terms of Wasserstein Distance (hence, WD) [29] , a method applied to transportation problems using efficient algorithms like network simplex algorithm [27] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-113", "text": "Given its ability to operate on variablelength representations, WD is more robust in comparison to other histogram-matching techniques and has been used, for example, in the domain of content-based image retrieval [32, 31] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-114", "text": "Applied to discrete probability distributions, WD (also known as Earth Mover's Distance [32] ) is used to compute the minimum amount of work that is needed for transforming one distribution into another." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-115", "text": "In our case, the work we measure is that required to transform a given distribution of frequencies into a uniform distribution where all elements have MAX frequency." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-116", "text": "In particular, we use WD as a measure of 'reliability' of the sample, based on the observation that highly skewed distributions require a smaller amount of work (low WD) compared to 'peaky' ones (high WD)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-117", "text": "This is intuitive since, in the former case, all elements are closer to the MAX than in the latter." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-118", "text": "As a consequence, patterns where all annotators converge on one single answer will get a S score equal to 1 (highest reliability), whereas uniformly-distributed patterns (i.e., all answers have frequency 1) will get 0 (no reliability at all)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-119", "text": "Consider the examples in Figure 2 ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-120", "text": "In the first and second, S is 0.55." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-121", "text": "In the third, more subjective, S is 0.33." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-122", "text": "The method used for computing S is shown in (2):" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-123", "text": "where the formula represents the standard way for computing WD, u,v are two different probability distributions, and \u0393(u, v) is the set of (probability) distributions." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-124", "text": "The value of S is further normalized to range from 0 to 1." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-125", "text": "Introducing such component allows us to take into account the subjectivity of a sample (and a dataset)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-126", "text": "This is crucial since, as shown in Figure 3 , in current datasets the proportion of samples with a perfect inter-annotator agreement (i.e., 1 unique answer) is relatively low: 35% in VQA 1.0 [4] , 33% in VQA 2.0 [14] , 43% in VQA-abstract [1] , and only 3% in VizWiz [16] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-127", "text": "Moreover, we compute this score independently from the predictions of the models, thus providing a self-standing measure for the analysis of any VQA dataset." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-128", "text": "As clearly depicted in Figure 3 , subjectivity is indeed a property of the datasets: In VizWiz, only 30% of samples display 3 or less unique answers, whereas this percentage exceeds 70% in the other datasets." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-129", "text": "The motivation behind proposing this component is loosely similar to [15] , who tackle the task of predicting the degree of agreement between annotators, and very close to [43] , who model subjectivity of samples in terms of the entropy of the response pattern (ranging from 0 to 3.32)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-130", "text": "Compared to [43] , we believe ours to be an essentially equivalent measure, though simpler and more intuitive." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-131", "text": "Finally, subjectivity is indirectly taken into account in WUPS-ACM, where the score is given by the average of the pairwise distances between the elements." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-132", "text": "However, this measure mixes quantitative (frequency) and qualitative (semantic similarity) information, while S specifically focuses on the former." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-133", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-134", "text": "**SEMANTIC SIMILARITY (SES):**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-135", "text": "This component is aimed at evaluating the semantic similarity between the answers in the sample." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-136", "text": "The rationale is that samples where the answers are overall semantically similar should be considered as more reliable (less subjective) compared to those including semantically diverse answers." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-137", "text": "Intuitively, a pattern containing e.g. 'plane', 'airplane', and 'aircraft' would be more consistent than one including e.g. 'plane', 'train', 'motorbike'." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-138", "text": "We operationalize this intuition by using pre-trained word embeddings [26] to re-organize the frequency distribution of the answers in the pattern." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-139", "text": "As a consequence, SES can be seen as a semantics-aware version of S. Technically, SES is obtained as follows: (a) we compute an average representation of each answer (similarly to [6] ); (b) we use these unique representations to build a centroid of the pattern aimed at encoding its overall semantics, irrespective of the relative frequency of the items (we want to account for the long tail of distributions); (c) we compute the cosine similarity between centroid and each unique answer; (d) we group together the answers whose cosine similarity value exceeds a given threshold, and sum their frequencies accordingly." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-140", "text": "This way, we obtain an updated frequency distribution, on the top of which S can be computed." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-141", "text": "Notably, this is the only component of MASSES that can be 'adjusted'." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-142", "text": "In particular, using 'strict' thresholds (e.g. 0.9) will generate lower scores compared to using more 'tolerant' ones (e.g. 0.7)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-143", "text": "To illustrate, if we apply a SES 0.9 to the examples in Figure 2 , only the reliability of the first example increases (from S 0.55 to SES 1)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-145", "text": "Though the third question is quantitatively more subjective than the others, it becomes as reliable as them when considering its semantics." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-146", "text": "Semantic similarity is computed as in (3):" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-147", "text": "where for each ground truth answer, centroid pair we obtain a similarity score sim ranging from 0 to 1 (we set negative values to 0)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-148", "text": "Answers for which sim is equal to or higher than a threshold t (0-1) are grouped together by summing their frequencies." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-149", "text": "To obtain SES, namely a semantics-aware measure of subjectivity, we compute (2) on the resulting distributions u sim ,v sim ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-150", "text": "To obtain the overall MASSES score, we simply compute an updated MA (1) which is based on these distributions, and we further multiply it by SES." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-151", "text": "Similarly to WUPS, our metric acknowledges the importance of taking semantic similarity into account in the evaluation of VQA." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-152", "text": "However, SES differs from WUPS in two main regards: (a) We use word embeddings instead of taxonomies trees, which makes our metric more flexible, intuitive, and convenient to compute." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-153", "text": "Moreover, it can account for phrasal and sentence answers." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-154", "text": "(b) As reported by [19] , WUPS tends to be very 'forgiving' by assigning high scores to distant concepts (e.g., 'raven' and 'writing desk' have a WUPS score of 0.4)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-155", "text": "In contrast, word embeddings provide a more fine-grained semantic information." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-156", "text": "It is worth mentioning that, in the domain of VQA, word embeddings have been used in various ways, e.g. for selecting challenging decoys [6] , or to implement nearest-neighbors baseline models [9] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-157", "text": "As for the procedure of aggregating various responses into one based on their semantic similarity, we were inspired by previous work on crowd consensus doing the same on the basis of various criteria [33, 38] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-158", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-159", "text": "**EXPERIMENTS**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-160", "text": "We tested the validity of our metric by experimenting with four VQA datasets: VQA 1.0 [4] , VQA 2.0 [14] , VQA-abstract [1] , and VizWiz [16] ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-161", "text": "To enable a fair comparison across the datasets, for each dataset we followed the same pipeline: The standard VQA model used in [1] was trained on the training split and tested on the validation split." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-162", "text": "Model predictions were evaluated by means of three metrics: VQA3+ [4] (using the evaluation tools), WUPS [25] , and our MASSES." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-163", "text": "WUPS was tested in both its consensus versions, i.e. ACM and MCM with a threshold of 0.9." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-164", "text": "As for MASSES, we computed its overall score as well as the scores provided by each of its components." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-165", "text": "The impact of 'tuning' semantic similarity is evaluated by exploring two thresholds: a strict 0.9 and a more tolerant 0.7." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-166", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-167", "text": "**QUANTITATIVE RESULTS**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-168", "text": "Results are reported in Table 1 . Note that columns VQA3+, WUPS-ACM, WUPS-MCM, MA, MAS, and MASSES are accuracies, while S and SES are reliability scores." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-169", "text": "As can be noted, accuracies obtained with both versions of MASSES are generally lower compared to those of VQA3+, with the drop being particularly accentuated for VizWiz." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-170", "text": "This can be observed in Figure 4 , which compares the distributions of accuracies scored by VQA3+ and MASSES 0.9 in VQA 1.0 (left) and VizWiz (right)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-171", "text": "As can be seen, the scores produced by our metric (blue) are 'distributed' across the x-axis (from 0 to 1), while those produced by VQA3+ (red) are grouped into 5 'classes'." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-172", "text": "Moreover, we observe that our metric is much more reluctant to output score 1." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-173", "text": "Part of this differences can be explained by looking at the values of MA (Table 1) , which are slightly lower than those of VQA3+ due to their finer-grained nature (recall that if an element is not MAX it is not considered as 100% correct by MA)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-174", "text": "This drop is further accentuated by multiplying MA by either S (to obtain MAS) or SES (to obtain MASSES)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-175", "text": "Since the values of these components cannot exceed 1, the resulting score will be lowered according to the degree of subjectivity of the dataset." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-176", "text": "Bearing this in mind, it is worth focusing on the scores of S and SES in each dataset." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-177", "text": "As reported in Table 1 , S is relatively high for the first three datasets (ranging from 0.70 to 0.78), extremely low for VizWiz (0.46)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-178", "text": "These numbers, in line with the descriptive statistics depicted in Figure 3 , clearly indicate that answers in VizWiz are extremely skewed, with annotators rarely agreeing on the same answer(s)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-179", "text": "This information can also be observed in Figure 5 , Table 2 ." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-180", "text": "Examples from the validation splits of VQA 1.0 (top) and VizWiz (bottom)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-181", "text": "For each example, we report the pattern of answers provided by annotators (unique answer: frequency), the prediction of the model, and the scores (note that ACM, SES, MASSES are computed using threshold 0.9)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-182", "text": "Answers that are grouped together by SES are included in square brackets." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-183", "text": "which depicts the distribution of S (red bars) and SES 0.9 (blue bars) in VQA 1.0 3 (left) and VizWiz (right)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-184", "text": "As can be noticed, S in VQA is relatively high, with most of the answers being grouped in the rightmost bars (0.8 or more)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-185", "text": "In contrast, we observe an almost normal distribution of S in VizWiz, with very few answers being scored with high values." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-186", "text": "When injecting semantic information into subjectivity (SES 0.9 ), however, the distribution changes." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-187", "text": "Indeed, we observe much less cases scored with extremely low values and much many cases with high values." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-188", "text": "In numbers, this is reflected in an overall increase of 8 points from S (0.46) to SES (0.54)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-189", "text": "A similar pattern is also observed in VQA 1.0 (+5 points)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-190", "text": "It is worth mentioning that using a lowest similarity threshold (0.7) makes the increase between S and SES even bigger." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-191", "text": "This, in turn, makes the MASSES score significantly higher and comparable to VQA3+ in the three VQA-based datasets (not for VizWiz)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-192", "text": "As for WUPS, we observe that ACM scores are significantly lower than VQA3+ ones, while MCM ones are generally higher." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-193", "text": "This is intuitive since MCM only considers the most similar answers, while ACM, similarly to ours, considers the whole set." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-194", "text": "Compared to our metric, we notice that ACM 0.9 scores are somehow in between those of MASSES 0.7 and MASSES 0.9 in the VQA-based datasets." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-195", "text": "In contrast, they are very different in VizWiz, where our metric versions 'outperform' ACM 0.9 by around 13 and 7 points, respectively." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-196", "text": "We believe this gap is due to the main differences between WUPS and MASSES: (a) In WUPS the predictions of the model are intertwined with the properties of the data, while in ours the two components are disentangled." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-197", "text": "(b) The type of semantic similarity used by MASSES and its role in the metric allows capturing finer-grained relations between the answers compared to taxonomy trees." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-198", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-199", "text": "**QUALITATIVE RESULTS**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-200", "text": "To better understand the functioning of our metric, we analyze several cases extracted from the validation splits of VQA 1.0 and VizWiz (see Table 2 )." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-201", "text": "Starting from VQA 1.0, we notice that examples 1 and 2 are considered as 100% correct by both VQA3+ and MASSES." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-202", "text": "The former metric assigns this score because 'yellow' and 'refrigerator' have frequency equal to or greater than 4." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-203", "text": "As for MASSES, this score is produced because (a) the two answers have MAX frequency, and (b) the SES score assigned to the response pattern is the highest (i.e. 1.0) due to their semantic consistency." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-204", "text": "That is, all the answers are grouped together since their cosine similarity with the centroid is equal or greater than 0.9." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-205", "text": "Notably, ACM produces a similar score in example 2, but very different (i.e., much lower) in example 1, though the words involved are semantically very similar (very similar colors)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-206", "text": "Moving to example 3, we observe that MASSES assigns a lower score (0.67) compared to VQA3+ (1.0) since SES makes a fine-grained distinction between generic 'rackets' and specific ones (i.e., for 'tennis')." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-207", "text": "This proves the validity and precision our semantic similarity component, especially in comparison with ACM, whose high score does not account for such distinction (0.98)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-208", "text": "As for example 4, the score output by MASSES (1.0) turns out to be higher than both VQA3+ (0.6) and ACM (0.7) due to the extremely high semantic consistency of the answers." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-209", "text": "As for VizWiz, we observe that examples 1 and 2, which receive highest accuracy from VQA3+, are assigned a lower score by MASSES." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-210", "text": "In the former case, the drop is minor due to the high reliability of the pattern; in the latter, the drop is bigger since the predicted answer, 'white', appears in a pattern where the other responses are semantically very similar to each other and thus grouped together by SES." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-211", "text": "That is, the items in the long tail of the distribution, though not quantitatively dominant, are semantically prevalent in the pattern." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-212", "text": "As such, the reliability of the pattern is only partial, and lowers the overall score." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-213", "text": "As for example 3, VQA3+ assigns a relatively high score to the prediction (0.60), while MASSES (as ACM) penalizes this choice mainly due to the non-MAX nature of the predicted answer, though the pattern has a high reliability due the semantic consistency of the alternatives (all grouped together by SES)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-214", "text": "Finally, in example 4 the prediction of the model ('unanswerable') is not present in the pattern and thus scored 0 by all metrics." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-215", "text": "However, it is worth mentioning that, according to SES, this pattern is highly reliable due to the high semantic consistency of its elements." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-216", "text": "As a consequence, a model predicting e.g. 'beef' would get 1.0 by MASSES, but only 0.5 by ACM." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-217", "text": "To further understand the qualitative difference between VQA3+ and MASSES, we analyze several cases from VQA 1.0 (see Figure 6 ) where the former metric outputs a higher score than the latter (left), and vice versa (right)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-218", "text": "In the two leftmost examples, the higher values produced by VQA3+ seem intuitively more correct than those output by MASSES, whose scores are affected by a valuable but somehow strict semantic criterion which penalizes the presence of other answers in the pattern." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-219", "text": "In contrast, the higher accuracies produced by MASSES in the rightmost cases look intuitively better than those by VQA3+." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-220", "text": "In these cases, the subjectivity of the pattern is compensated by the high semantic consistency among the answers, which makes MASSES to output the highest score." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-221", "text": "Overall, it is straightforward that taking semantics into account allows our metric to produce finer-grained evaluations." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-222", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-223", "text": "**EVALUATING DATASET 'FEASIBILITY' WITH SES**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-224", "text": "SES is a component evaluating the subjectivity of a sample while also taking into account the semantic relation between the answers." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-225", "text": "As such, the score it provides is a measure of reliability of a sample (and of a dataset)." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-226", "text": "Since a high reliable sample is one where annotators either converge on the same answer or pick up semantically related answers, we might take SES as an indirect measure of dataset feasibility: The higher the score assigned to a sample, the higher the probability to guess the correct answer." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-227", "text": "We test this intuition by analyzing VQA3+ accuracy against SES." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-228", "text": "If SES captures the degree of feasibility of a sample, we should observe a higher accuracy in correspondence to high values of our component." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-229", "text": "Our intuition is fully confirmed for VQA 1.0 (Figure 7, left) , where accuracies increase on par with SES." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-230", "text": "In contrast, a different pattern is observed for VizWiz (right), where the highest accuracy is obtained in samples with moderate SES and monotonically decreases with increasingly-reliable scores." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-231", "text": "This pattern, we conjecture, might be due to the low number of cases having high SES in VizWiz." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-232", "text": "----------------------------------" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-233", "text": "**DISCUSSION**" }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-234", "text": "We proposed MASSES, a novel multi-component metric for the evaluation of VQA." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-235", "text": "We showed the potential of such evaluation tool for gaining a higher-level, fine-grained understanding of models and data." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-236", "text": "Crucially, our metric can be used one component at a time: MA for evaluating model predictions only, S and SES for analyzing the quantitative and semantic reliability of a dataset, respectively." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-237", "text": "Overall, MASSES provides a single accuracy score that makes it comparable to other metrics such as VQA3+ or WUPS." }, { "sent_id": "0a93feafef3ba2d4bb5360ff215171-C001-238", "text": "Further investigation is needed to explore the functioning of our metric with other VQA models, as well as the impact of using various word embeddings techniques and similarity thresholds on the overall score." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0a93feafef3ba2d4bb5360ff215171-C001-18" ], [ "0a93feafef3ba2d4bb5360ff215171-C001-42" ], [ "0a93feafef3ba2d4bb5360ff215171-C001-54" ], [ "0a93feafef3ba2d4bb5360ff215171-C001-77" ], [ "0a93feafef3ba2d4bb5360ff215171-C001-126" ] ], "cite_sentences": [ "0a93feafef3ba2d4bb5360ff215171-C001-18", "0a93feafef3ba2d4bb5360ff215171-C001-42", "0a93feafef3ba2d4bb5360ff215171-C001-54", "0a93feafef3ba2d4bb5360ff215171-C001-77", "0a93feafef3ba2d4bb5360ff215171-C001-126" ] }, "@USE@": { "gold_contexts": [ [ "0a93feafef3ba2d4bb5360ff215171-C001-160" ], [ "0a93feafef3ba2d4bb5360ff215171-C001-161" ] ], "cite_sentences": [ "0a93feafef3ba2d4bb5360ff215171-C001-160", "0a93feafef3ba2d4bb5360ff215171-C001-161" ] } } }, "ABC_1ab7893c2a930bc5af3c34a5912dd2_9": { "x": [ { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-357", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-358", "text": "**VALIDATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-461", "text": "**CONCLUSIONS**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-2", "text": "A dialog act is a representation of an intention transmitted in the form of words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-3", "text": "In this sense, when someone wants to transmit some intention, it is revealed both in the selected words and in how they are combined to form a structured segment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-4", "text": "Furthermore, the intentions of a speaker depend not only on her intrinsic motivation, but also on the history of the dialog and the expectation she has of its future." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-5", "text": "In this article we explore multiple representation approaches to capture cues for intention at different levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-6", "text": "Recent approaches on automatic dialog act recognition use Word2Vec embeddings for word representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-7", "text": "However, these are not able to capture segment structure information nor morphological traits related to intention." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-8", "text": "Thus, we also explore the use of dependencybased word embeddings, as well as character-level tokenization." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-9", "text": "To generate the segment representation, the top performing approaches on the task use either RNNs that are able to capture information concerning the sequentiality of the tokens or CNNs that are able to capture token patterns that reveal function." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-10", "text": "However, both aspects are important and should be captured together." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-11", "text": "Thus, we also explore the use of an RCNN." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-12", "text": "Finally, context information concerning turn-taking, as well as that provided by the surrounding segments has been proved important in previous studies." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-13", "text": "However, the representation approaches used for the latter in those studies are not appropriate to capture sequentiality, which is one of the most important characteristics of the segments in a dialog." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-14", "text": "Thus, we explore the use of approaches able to capture that information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-15", "text": "By combining the best approaches for each aspect, we achieve results that surpass the previous state-of-the-art in a dialog system context and similar to human-level in an annotation context on the Switchboard Dialog Act Corpus, which is the most explored corpus for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-16", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-17", "text": "**INTRODUCTION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-18", "text": "In order to interpret its conversational partners' utterances, it is valuable for a dialog system to identify the generic intention behind the uttered words, as it provides an important clue concerning how each segment should be interpreted." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-19", "text": "That intention is revealed by dialog acts, which are the minimal units of linguistic communication (Searle, 1969) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-20", "text": "Similarly to other text classification tasks, such as news categorization and sentiment analysis (Kim, 2014; Conneau, Schwenk, Barrault, & Lecun, 2017) , most of the recent approaches on dialog act recognition take advantage of different Neural Network (NN) architectures." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-21", "text": "In that context, and considering the findings of previous studies on dialog acts, there are three main aspects of the task to be explored." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-329", "text": "Thus, the advantages of those architectures are not considered during the adaptation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-22", "text": "First, at what level should a segment be tokenized and how can those tokens be represented in order to provide maximum information? Then, how can those token representations be combined to generate a representation of the whole segment, while keeping information about the original tokens and the relations between them?" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-23", "text": "Finally, how can relevant context information from different sources, such as the surrounding segments or the speaker, be combined with the segment representation to achieve the best possible performance on the task?" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-24", "text": "Not much effort has been put into the first aspect and most NN-based dialog act recognition approaches perform tokenization at the word-level and generate the token representations using pre-trained Word2Vec embeddings (Mikolov et al., 2013) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-25", "text": "This approach captures information concerning words that commonly appear together and is able to map semantic relations between different words in the embedding space." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-26", "text": "However, many dialog acts are related to word functions or segment structure, which this approach is not adequate to capture." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-27", "text": "Furthermore, there are cues for intention at a sub-word level, both in lemmas and affixes, as well as in word abstractions, such as syntactic units." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-28", "text": "However, none of these have been explored in previous studies." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-29", "text": "The second aspect is that on which most variations between existing approaches occur." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-30", "text": "Multiple Deep Neural Network (DNN) architectures have been explored to combine the token representations into a single segment representation that captures relevant information for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-31", "text": "Of the two state-of-the-art approaches on dialog act recognition, one uses a deep stack of Recurrent Neural Networks (RNNs) (Schmidhuber, 1990) to capture long distance relations between tokens (Khanpour et al., 2016) , while the other uses multiple parallel temporal Convolutional Neural Networks (CNNs) (Fukushima, 1980) to capture relevant functional patterns with different length (Liu et al., 2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-32", "text": "Although these approaches focus on capturing different information, both have been proved successful on the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-33", "text": "Thus, an approach able to capture both kinds of information is expected to outperform both of these approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-34", "text": "Finally, concerning the third aspect, a dialog act is not only related to the words in a segment, but also to the whole dialog context." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-35", "text": "That is, the intention of a speaker is influenced by the dialog history, as well as the expectation of its future direction." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-36", "text": "Furthermore, it is also influenced by the speaker's intrinsic characteristics." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-37", "text": "The latter are hard to capture and are usually not available for a dialog system." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-38", "text": "Thus, only speaker information that is directly related to the dialog, such as turn-taking (Liu et al., 2017) , is typically considered." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-39", "text": "Concerning information from the surrounding segments, its influence, especially that of preceding segments, has been thoroughly explored in at least two studies (Ribeiro et al., 2015; Liu et al., 2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-98", "text": "The best results were achieved when using 150-dimensional embeddings trained on Wikipedia data using Word2Vec." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-40", "text": "However, in both cases, although it is one of its most important characteristics, that information was represented in ways that are not appropriate to capture its sequentiality." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-41", "text": "In this article we use the Switchboard Dialog Act Corpus (Jurafsky et al., 1997) , which is the most explored corpus for dialog act recognition, to study and compare different solutions concerning the three aspects." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-42", "text": "To do so, we focus on exploring different representations at the three-levels -token, segment, and context -, taking previous studies and the previously referred limitations into account." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-43", "text": "In the remainder of the article we start by describing the Switchboard corpus and its dialog act annotations in Section 2." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-44", "text": "Then, Section 3 provides an overview of related work on dialog act recognition on that corpus using DNNs." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-45", "text": "In Section 4 we describe our experimental setup, which defines a common ground for the multiple experiments in the study." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-46", "text": "In Section 5, we start by exploring different segment representation approaches, since it is the step that introduces higher variation in the network." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-47", "text": "Then, in Section 6, we explore token representation at different levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-48", "text": "To conclude the experiments, in Section 7, we assess the influence of context information on the task and explore different representations for that information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-49", "text": "Finally, Section 8 states the most important conclusions of this work and provides pointers for future work." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-50", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-51", "text": "**DATASET**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-52", "text": "Switchboard (Godfrey et al., 1992 ) is a corpus consisting of about 2,400 telephone conversations among 543 American English speakers (302 male and 241 female)." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-53", "text": "Each pair of speakers was automatically attributed a topic for discussion, from 70 different ones." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-54", "text": "Furthermore, speaker pairing and topic attribution were constrained so that no two speakers would be paired with each other more than once and no one spoke more than once on a given topic." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-55", "text": "The Switchboard Dialog Act Corpus (Jurafsky et al., 1997 ) is a subset of this corpus, consisting of 1,155 manually transcribed conversations, containing 223,606 segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-56", "text": "An excerpt of a transcription is shown in Figure 1 ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-57", "text": "The corpus was annotated for dialog acts using the SWBD-DAMSL tag set, which was structured so that the annotators were able to label the conversations from transcriptions alone." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-58", "text": "It contains over 200 unique tags." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-59", "text": "However, in order to obtain a higher inter-annotator agreement and higher example frequencies per class, a less fine-grained set of 44 tags was devised." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-60", "text": "As shown in Table 1 , the class distribution is highly unbalanced, with the three most frequent classes -Statement-opinion, Acknowledgement, and Statement-non-opinion -covering 68% of the corpus." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-61", "text": "The set can be reduced to 43 or 42 categories (Stolcke et al., 2000; Rotaru, 2002; Gamb\u00e4ck et al., 2011) , if the Abandoned and Uninterpretable categories are merged, and depending on how the Segment category, used when the current segment is the continuation of the previous one by the same speaker, is treated." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-62", "text": "By analyzing the data, we came to the conclusion that merging segments labeled as Segment with the previous segment by the same speaker is the best approach, since some of the attributed labels only make sense when the segments are merged." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-63", "text": "Also, it makes sense to merge the Abandoned and Uninterpretable categories, because both represent disruptions in the dialog flow, which interfere with the typical dialog act sequence." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-64", "text": "There is also a 41-category variant of the tag set (Webb & Ferguson, 2010) , which merges the Statement-opinion and Statement-non-opinion categories, making this the most frequent class, covering 49% of the corpus." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-65", "text": "Jurafsky et al. (1997) report an average pairwise Kappa (Carletta, 1996) of .80, while Stolcke et al. (2000) refer to an inter-annotator agreement of 84%, which we assume to be the average pairwise percent agreement." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-66", "text": "As previously stated, we selected this corpus for our experiments because it is the most explored for the dialog act recognition task, since it contains a large amount of annotated data which can lead to solid results." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-67", "text": "Furthermore, since its tag set is domain-independent, the probability of drawing conclusions that depend on the domain of the corpus is reduced." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-68", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-69", "text": "**RELATED WORK**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-70", "text": "Dialog act recognition on the Switchboard Dialog Act Corpus has been widely explored using multiple machine learning approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-71", "text": "The primordial approach by Stolcke et al. (2000) relied on Hidden Markov Models (HMMs) (Baum & Petrie, 1966) using word n-grams as features." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-72", "text": "This approach achieved 71.0% accuracy when applied to the manual transcriptions and 64.8% when applied to automatic transcriptions with 41% Word Error Rate (WER) on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-73", "text": "Since then, many other approaches have been explored." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-74", "text": "For instance, Rotaru (2002) used the k-Nearest Neighbors (k-NN) algorithm (Cover & Hart, 1967) , with the distance between neighbors being measured as the number of common bigrams between segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-75", "text": "Sridhar et al. (2009) used a maximum entropy model combining lexical, syntactic, and prosodic features with context information from the surrounding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-76", "text": "Webb and Ferguson (2010) applied a classification approach based on cue phrases, that is, phrases that are highly indicative of a particular dialog act." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-77", "text": "Gamb\u00e4ck et al. (2011) used Support Vector Machines (SVMs) (Cortes & Vapnik, 1995) with word n-grams, wh-words, punctuations, and context information from the preceding segments as features, together with an Active Learning (AL) approach to select the most informative subset of the training data." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-78", "text": "The article by Kr\u00e1l and Cerisara (2010) provides an overview of most of these approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-79", "text": "Here, we focus on the most recent studies, which take advantage of DNNs." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-80", "text": "To our knowledge, the first to do so was that by Kalchbrenner and Blunsom (2013 (Pennington, Socher, & Manning, 2014) pre-trained on Twitter data." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-81", "text": "In order to generate the corresponding dialog act classifications, the segment representations were then fed to a 2-layer feed-forward network, in which the first layer normalizes the representations and the second selects the class with higher probability." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-82", "text": "In their experiments, the CNN-based approach consistently led to better results than the LSTM-based one." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-83", "text": "The architecture was also used to provide context information from up to two preceding segments at two levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-84", "text": "The first level refers to the concatenation of the representations of the preceding segments with that of the current segment before providing it to the feed-forward network." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-85", "text": "The second refers to the concatenation of the normalized representations before providing them to the output layer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-86", "text": "The best results on the test set -69.5% accuracy for the LSTM-based approach and 71.4% for the CNN-based approach -were obtained when using information from two preceding segments at the first level." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-87", "text": "Ji et al. (2016) used a Discourse Relation Language Model (DRLM) with a hybrid architecture that combined a Recurrent Neural Network Language Model (RNNLM) (Mikolov, Karafi\u00e1t, Burget, Cernock\u00fd, & Khudanpur, 2010) with a latent variable model over shallow discourse structure, in order to combine positive aspects of neural network architectures and probabilistic graphical models." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-88", "text": "More specifically, the model can learn vector representations trained discriminatively, while maintaining a probabilistic representation of the targeted linguistic element which, in this context, is the dialog act." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-89", "text": "The architecture is similar to that of a Latent Variable Recurrent Neural Network (LVRNN) (Chung, Kastner, Dinh, Goel, Courville, & Bengio, 2015) but with discrete variables that are at least partially observable and correspond to linguistically meaningful elements, which may be predicted or marginalized, depending on the situation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-90", "text": "In order to function as a classifier, the model was trained to maximize the conditional probability of a sequence of dialog acts given a sequence of segments, achieving 77.0% accuracy." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-91", "text": "The previous studies explored the use of a single recurrent or convolutional layer to generate the segment representation from those of its words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-92", "text": "However, as stated in Section 1, the approaches which currently have top performance on the task explore the use of multiple of those layers." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-93", "text": "On the recurrent side, Khanpour et al. (2016) achieved their best results using a segment representation generated by concatenating the outputs of a stack of 10 LSTM units at the last time step." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-94", "text": "This way, the model is able to capture long distance relations between tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-95", "text": "On the convolutional side, Liu et al. (2017) generated the segment representation by combining the outputs of three parallel CNNs with different context window sizes, in order to capture different functional patterns." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-96", "text": "In both cases, pre-trained word-embeddings were used as input to the network." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-97", "text": "Khanpour et al. (2016) compared the performance of embeddings with different dimensionality trained on multiple corpora using GloVe and Word2Vec (Mikolov et al., 2013) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-99", "text": "Liu et al. (2017) used 200-dimensional Word2Vec embeddings trained on Facebook data." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-100", "text": "Overall, from the reported results, it is not possible to state which is the top performing segment representation approach since the evaluation was performed on different subsets of the corpus." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-101", "text": "Still, Khanpour et al. (2016) reported 73.9% accuracy on the validation set and 80.1% on the test set, while Liu et al. (2017) reported 74.5% and 76.9% accuracy on the two sets used to evaluate their experiments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-102", "text": "Additionally, Liu et al. (2017) explored the use of context information concerning speaker changes and from the surrounding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-103", "text": "The first was provided as a flag and concatenated to the segment representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-104", "text": "Concerning the latter, they explored the use of discourse models, as well as of approaches that concatenated the context information directly to the segment representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-105", "text": "The discourse models transform the model into a hierarchical one by generating a sequence of dialog act classifications from the sequence of segment representations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-106", "text": "Thus, when predicting the classification of a segment, the surrounding ones are also taken into account." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-107", "text": "However, when the discourse model is based on a CNN or a bidirectional LSTM unit, it considers information from future segments, which is not available for a dialog system." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-108", "text": "Still, even when relying on future information, the approaches based on discourse models performed worse than those that concatenated the context information directly to the segment representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-109", "text": "In this sense, similarly to our previous study using SVMs (Ribeiro et al., 2015) , Liu et al. (2017) concluded that providing that information in the form of the classification of the surrounding segments leads to better results than using their words, even when those classifications are obtained automatically." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-110", "text": "Furthermore, both studies have shown that the first preceding segment is the most important and that the influence decays with the distance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-111", "text": "Using the setup with gold standard labels from three preceding segments, Liu et al. (2017) achieved 79.6% and 81.8% on the two sets used to evaluate the approach." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-112", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-113", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-114", "text": "In order to set a common ground for result comparison, we use the generic architecture shown in Figure 2 , which is based on those of the top performing approaches referred to in Section 3." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-115", "text": "Below, we describe each of its components, as well as the training and evaluation procedures." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-116", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-117", "text": "**EMBEDDING LAYER**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-118", "text": "The input of the network is a tokenized segment, which is passed to an embedding layer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-119", "text": "As a baseline, we perform the tokenization at the word level and do not use pre-trained word embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-120", "text": "Thus, the weights of the embedding layer are initialized randomly and updated along with the rest of the network." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-121", "text": "The resulting word embeddings are 200-dimensional as in the study by Liu et al. (2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-122", "text": "Different tokenization and embedding approaches are explored in Section 6." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-123", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-124", "text": "**SEGMENT REPRESENTATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-125", "text": "The segment representation step processes and combines the token embeddings to generate a vectorial representation of the segment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-126", "text": "This is the step that introduces higher variability in the network and in which the main differences between previous dialog act recognition approaches occur." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-127", "text": "Initially, we reduce this step to a max pooling operation over the token embeddings, that is, the representation of a segment is the area in the embedding space that contains it." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-128", "text": "Section 5 explores the use of the complex state-of-the-art approaches, as well as a Recurrent Convolutional Neural Network (RCNN)-based approach which has not been applied to the task before." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-129", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-130", "text": "**DIMENSIONALITY REDUCTION LAYER**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-131", "text": "In order to avoid result differences caused by using representations with different dimensionality, the network includes a dimensionality reduction layer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-132", "text": "This is a dense layer which maps the segment representations into a 100-dimensional space, as in the study by Liu et al. (2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-133", "text": "Furthermore, during the training phase, we use dropout with 50% probability in this layer, in order to reduce the probability of overfitting." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-134", "text": "Before passing the segment representation to this layer, additional features can be concatenated to it." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-135", "text": "The approaches described in Section 7 take advantage of this to provide context information to the network." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-136", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-137", "text": "**OUTPUT LAYER**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-138", "text": "Finally, the output layer maps the 100-dimensional representation into a dialog act label." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-139", "text": "To do so, we use a dense layer with number of units equal to the number of labels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-140", "text": "This layer uses the softmax activation to obtain the class probabilities." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-141", "text": "The class with higher probability is then selected for the segment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-142", "text": "Since this is a multiclass classification problem, we use the categorical cross entropy loss function." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-143", "text": "Furthermore, for performance reasons, we use the Adam optimizer (Kingma & Ba, 2015) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-144", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-145", "text": "**TRAINING & EVALUATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-146", "text": "We used Keras (Chollet et al., 2015) with the TensorFlow (Abadi et al., 2015) backend to implement our networks." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-147", "text": "We used a fixed random seed to avoid result differences caused by different initializations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-148", "text": "However, there is still some non-determinism introduced by the optimization of certain operations for running on Graphics Processing Unit (GPU)." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-149", "text": "Thus, the results we present in this paper refer to the average (\u00b5) and standard deviation (\u03c3) of the results obtained by training and testing the networks over 10 runs." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-150", "text": "The mini-batch size was 512." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-151", "text": "The training phase stopped after 10 epochs without improvement on the validation set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-152", "text": "In this sense, although we present results on both the validation and test sets, all the decisions were based on the results on the validation set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-153", "text": "Considering the baseline described in this section, it achieves an average of 75.60% accuracy on the validation set and 71.78% on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-154", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-155", "text": "**SEGMENT REPRESENTATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-156", "text": "As stated in the previous section, the segment representation step is the one which introduces higher variability in the network." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-157", "text": "Consequently, it is where the main differences between previous dialog act recognition approaches occur." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-158", "text": "Thus, we start our study by exploring different approaches for this step." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-159", "text": "As stated in Section 3, of the two state-of-the-art approaches on dialog act recognition, one uses a RNN-based approach (Khanpour et al., 2016) for segment representation, while the other uses one based on CNNs (Liu et al., 2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-160", "text": "Both have their own advantages, as while the first focuses on capturing information from relevant sequences of tokens, the latter focuses on the context surrounding each token and, thus, captures information concerning neighboring tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-330", "text": "This explains the larger performance decay of dependency-based embeddings in comparison to that of CBOW embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-161", "text": "Concerning the task at hand, this is relevant since, among other aspects, while some dialog acts are distinguishable due to the order of the tokens in the segment (e.g. subject-auxiliary inversion in questions), others are distinguishable due to the presence of certain tokens or sequences of tokens independently of where they occur in the segment (e.g. greetings)." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-162", "text": "Thus, and since the two approaches were not compared directly and were evaluated on different sets, we included both of them in our experiments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-163", "text": "However, since each approach is expected to capture a different kind of information, but both kinds are relevant for the task, a third approach that merges at least some of the capabilities of the other two is expected to perform better on the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-164", "text": "That is exactly what the RCNN-based approach by Lai et al. (2015) was designed to do." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-165", "text": "This approach achieved state-of-art performance on text classification tasks, such as document topic classification and movie review rating, but was not applied to dialog act recognition before." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-166", "text": "Thus, we also included it in our experiments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-167", "text": "The characteristics of each approach and our adaptations of their original versions are described below." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-168", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-169", "text": "**RNN-BASED SEGMENT REPRESENTATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-170", "text": "As described in Section 3, the recurrent approach by Khanpour et al. (2016) uses a stack of 10 LSTM units." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-171", "text": "The segment representation is given by the concatenation of the outputs of the 10 LSTM units at the last time step, that is, after processing all the tokens in the segment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-172", "text": "This process is shown in Figure 3 ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-173", "text": "Using the output at the last time step instead of other pooling operation makes sense, since the recurrent units process the tokens sequentially." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-174", "text": "Thus, that output contains information from the whole segment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-175", "text": "In our experiments we attempted to replace the LSTM units with Gated Recurrent Units (GRUs) (Chung et al., 2014) , in order to reduce the amount of memory required." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-176", "text": "However, the results were not satisfactory." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-177", "text": "Furthermore, we attempted to use bidirectional LSTM units, but there were no benefits." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-178", "text": "Still, contrarily to what Khanpour et al. (2016) stated in their paper, we were able to improve the results by applying dropout with 50% probability to the input of each LSTM unit during the training phase." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-179", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-180", "text": "**CNN-BASED SEGMENT REPRESENTATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-181", "text": "As described in Section 3, the convolutional approach by Liu et al. (2017) uses a set of parallel temporal CNNs with different window size, each followed by a max pooling operation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-182", "text": "The segment representation is given by the concatenation of the results of the pooling operations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-183", "text": "This way, the representation contains information concerning groups of tokens with different sizes." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-184", "text": "To achieve the results presented in their paper, Liu et al. (2017) used three CNNs with 100 filters and 1, 2, and 3 as context window sizes." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-185", "text": "In a previous study using the same architecture for different tasks, Kim (2014) used 3, 4, and 5 as window sizes." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-186", "text": "Both setups performed similarly in our experiments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-187", "text": "However, their combination, that is, five parallel CNNs with window sizes between 1 and 5, led to better results, which means that both small and large groups of tokens provide relevant information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-188", "text": "The process to generate a segment representation using this approach is shown in Figure 4 ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-189", "text": "The CNN-based segment representation approach." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-190", "text": "e i corresponds to the embedding representation of the i-th token." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-191", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-192", "text": "**RCNN-BASED SEGMENT REPRESENTATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-193", "text": "As previously stated, the previous approaches focus on capturing different kinds of information, both of them relevant for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-194", "text": "The RCNN-based approach by Lai et al. (2015) combines some of the advantages of RNN-and CNN-based segment representation approaches in order to capture both kinds of information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-195", "text": "This approach achieved interesting results on multiple text classification tasks but had not been applied to dialog act recognition yet." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-196", "text": "Thus, we included it in our study with some adaptations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-197", "text": "One of its advantages is that it removes the need for selecting an appropriate context window size for convolution by using a bidirectional recurrent approach which captures context information from all the tokens that appear before and after each token." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-198", "text": "The embedding representation of each token is then extended by surrounding it with that context information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-199", "text": "A linear transformation together with the tanh activation is applied to each of those token embeddings to reduce their dimensionality and normalize their representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-200", "text": "Finally, a max pooling operation is performed over the sequence of token representations to obtain the segment representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-201", "text": "As shown in Figure 5 , we replaced the simple RNNs used to obtain token context information with GRUs, which are able to capture the importance of distant tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-202", "text": "In our experiments we obtained the best results when using a GRU with a number of neurons in each direction equal to the dimensionality of the embedding space." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-203", "text": "We also explored the use of LSTM units, as well as of a stack of recurrent units to extract context information at different levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-204", "text": "However, that did not improve the results." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-205", "text": "Furthermore, we explored the use of different number of neurons in the token representation dimensionality reduction layer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-206", "text": "In this aspect, the best results were achieved when using a number of neurons equal to the dimensionality of the embedding space." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-207", "text": "Finally, applying dropout with 50% probability to the input of each GRU also improved the results." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-208", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-209", "text": "**RESULTS**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-210", "text": "In Table 2 we can see that the RNN and CNN approaches perform similarly, with a difference of .15 percentage points between average accuracy results on the validation set and .02 on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-211", "text": "However, it is important to refer that the training of the RNN approach is much more resource consuming than the CNN counterpart." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-212", "text": "It requires around 9 times more memory and each training epoch takes around 31 times longer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-213", "text": "Table 2 : Accuracy results using different segment representation approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-214", "text": "Considering the RCNN approach, we can see that, as expected, it is able to improve the results of the remaining approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-215", "text": "On the validation set, the improvement is of .56 percentage points." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-216", "text": "However, that improvement is not reflected on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-217", "text": "Still, the RCNN approach is that with lowest standard deviation results among the three, which suggests that the model is more stable and less prone to variation than those generated by the other approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-218", "text": "The average accuracy difference between the best approach and the much simpler max pooling baseline is just 1.36 percentage points on the validation set and .99 percentage points on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-219", "text": "These differences are not very expressive, which suggests that the overall improvement that can be achieved by improving the segment representation is reaching a saturation point and that improvement should be sought on the other steps." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-220", "text": "Since the RCNN approach is that with best performance, we selected it for the segment representation step when exploring the approaches for the remaining steps described in the following sections." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-221", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-222", "text": "**TOKEN EMBEDDING**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-223", "text": "A segment is formed by multiple constituents." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-224", "text": "Thus, as shown in the previous section, its representation is obtained through a combination of the representations of those constituents." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-225", "text": "In this section we explore different means to represent those constituents or tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-226", "text": "First, it is important to refer that, as revealed by the studies described in Section 3, a segment is typically seen as a sequence of words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-227", "text": "However, it can be also be seen at other levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-228", "text": "For instance, from a finer-grained point of view, a segment can be seen as a sequence of characters." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-229", "text": "On the other hand, it can also be seen as a sequence of syntactic units, or other abstractions from words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-230", "text": "Below, we explore these three levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-231", "text": "Since we are focusing on the multiple steps of dialog act recognition approaches using DNNs, we only approach embedding representations, that is, those that represent a token as a vector of coordinates in a certain embedding space (Lavelli et al., 2004) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-232", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-233", "text": "**WORD-LEVEL EMBEDDING**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-234", "text": "As previously stated, segments are typically seen as sequences of words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-235", "text": "Thus, there has been extensive research on means to generate word-embedding representations that capture relevant word semantics (Collobert et al., 2011; Mikolov et al., 2013; Pennington et al., 2014; Levy & Goldberg, 2014) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-236", "text": "In addition to the methods themselves, this effort has led to the generation of sets of publicly available pre-trained word embeddings, which can be used to obtain baseline or benchmark results." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-237", "text": "Below, we discuss which word-embedding approaches we used in our experiments and why they are relevant for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-238", "text": "However, first, we explore the dimensionality of the embedding space, that is, the relation between ambiguity and sparseness." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-239", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-240", "text": "**EMBEDDING SPACE DIMENSIONALITY**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-241", "text": "The dimensionality of the embedding space is the factor that defines the trade-off between ambiguity and sparseness." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-242", "text": "On the one hand, higher dimensionality leads to increased sparseness and memory requirements." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-243", "text": "At the limit, one can have dimensionality equal to the number of words in the vocabulary and use a one-hot approach to represent words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-244", "text": "On the other hand, lower dimensionality leads to ambiguity in the representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-245", "text": "However, up to a certain level, ambiguity is not necessarily harmful." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-246", "text": "As stated in Section 3, Khanpour et al. (2016) explored embedding spaces with dimensionality 75, 150, and 300 together with different embedding approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-247", "text": "In every case, the embedding space with dimensionality 150 led to the best results." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-248", "text": "Liu et al. (2017) used a different dimensionality value, 200, in their study." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-249", "text": "In our experiments we explored the use of the four different dimensionality values." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-250", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-251", "text": "**PRE-TRAINED EMBEDDINGS**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-252", "text": "As previously stated, there has been extensive research on means to generate word-embedding representations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-253", "text": "The most common approaches are the Continuous Bag of Words (CBOW) model (Mikolov et al., 2013) , commonly known as Word2Vec, and the GloVe (Pennington et al., 2014)." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-254", "text": "Khanpour et al. (2016) used pre-trained embeddings using both approaches in their study and achieved their best results using Word2Vec embeddings trained on Wikipedia data." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-255", "text": "Liu et al. (2017) also used Word2Vec embeddings, but trained on Facebook data." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-256", "text": "Since we have access to the embeddings trained on Wikipedia data, but not to those trained on Facebook data, we used the first in our experiments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-257", "text": "The CBOW model generates word representations based on the co-occurrence of adjacent words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-258", "text": "However, as previously stated, many dialog acts are related to the structure of the segment and not sequences of specific words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-259", "text": "Thus, we also explored the dependency-based embedding approach by Levy and Goldberg (2014) , which takes that structure into account and not only word co-occurrences." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-260", "text": "It does so by introducing information concerning syntactic dependencies between words in the embedding generation process." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-261", "text": "First, it generalizes the Skip-Gram model (Mikolov et al., 2013) by allowing it to use arbitrary contexts instead of just those based on adjacent words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-262", "text": "Then, it uses the syntactic contexts derived from automatically produced dependency parse-trees." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-263", "text": "That is, the embedding generated for a given word is based on the syntactic relations it participates in." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-264", "text": "Thus, embeddings generated using this approach seem appropriate for the dialog act recognition task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-265", "text": "In our experiments we used the pre-trained set provided by Levy and Goldberg (2014) , which was trained on Wikipedia data." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-266", "text": "Using pre-trained embeddings typically leads to results that generalize better, since they are trained on large amounts of data and not only on a reduced set focused on a particular domain." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-267", "text": "However, this also means that their generation does not take their future use on a specific task into account." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-268", "text": "In their study, Liu et al. (2017) used pre-trained embeddings but let them adapt to the task during the training phase." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-269", "text": "However, they did not perform a comparison with the case where the embeddings are not adaptable." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-270", "text": "Thus, in our study we experimented with both fixed and adaptable embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-271", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-272", "text": "**EMBEDDING COMBINATIONS**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-273", "text": "In the previous section, we have discussed that different embedding approaches capture different kinds of information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-274", "text": "For instance, while the CBOW model captures information concerning sets of words that occur together frequently, the dependency-based approach captures information concerning syntactic dependencies." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-275", "text": "Furthermore, if no pre-trained embeddings are used, the generated representation is task specific, while when pre-trained embeddings are used, the representation captures information from larger amounts of data and is more generic." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-276", "text": "Adaptable embeddings attempt to balance the trade-off between the two." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-277", "text": "However, all those kinds of information may be complementary and relevant for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-278", "text": "Thus, we assessed whether it is advantageous to combine multiple embedding representations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-279", "text": "To do so, we used the approach suggested by Kim (2014) in the paper that inspired the CNN approach used by Liu et al. (2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-280", "text": "Considering the architecture in Figure 2 , our approach replicates all the steps up to segment representation, inclusively, for each embedding approach, and then concatenates the generated segment representations before passing them to the dimensionality reduction layer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-281", "text": "In our experiments, we combined both the fixed Word2Vec and dependency-based embeddings with their adaptable counterparts, as well as with each other and the version without pre-trained embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-282", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-283", "text": "**CHARACTER-LEVEL EMBEDDING**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-284", "text": "Although the majority of models used in Natural Language Processing (NLP) are wordbased, there are also character-based models achieving interesting performance on text classification tasks, such as news article categorization and review rating (Zhang et al., 2015) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-285", "text": "The main advantage of these models is that they are able to capture information concerning the morphology of words, such as their lemmas and affixes." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-286", "text": "Furthermore, they typically generalize better since they do not have to deal with the out-of-vocabulary prob-lem." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-287", "text": "On the other hand, when using an embedding space with the same dimensionality, character models are much slower, since the number of tokens largely increases." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-331", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-288", "text": "Considering the task at hand, dialog acts are related to an intention which is transmitted in the form of words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-289", "text": "Thus, the selected words are expected to have a function related to that intention." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-290", "text": "In this sense, the function of a word is typically related to its morphology." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-291", "text": "For instance, there are cases, such as adverbs of manner and negatives, in which the function, and hence the intention, of a word is related to its affixes." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-292", "text": "On the other hand, there are cases in which considering multiple forms of the same lexeme independently does not provide additional information concerning intention and the lemma suffices." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-293", "text": "Since a character-level model is able to capture that information, we decided to assess its performance on the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-294", "text": "To do so, we performed an experiment using the same segment representation approach as for the word-level approach, but using the characters as tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-295", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-296", "text": "**FUNCTIONAL-LEVEL EMBEDDING**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-297", "text": "As previously stated, there are some dialog acts that are more related to the structure of the segment or the functions of its words than to the presence of specific words or sets of words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-298", "text": "Thus, it makes sense to explore tokenization at a level that abstracts those words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-299", "text": "Considering what was said in the previous sections concerning the importance of suffixes, it is interesting to assess how important they are for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-300", "text": "Since automatic suffix extraction is not straightforward, we performed the negative experiment, that is, we replaced the words with their lemmas and assessed the loss of performance of both wordand character-level embedding in comparison to the case when the original words were used." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-301", "text": "In order to obtain the lemmas we used the spaCy parser (Honnibal & Montani, 2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-302", "text": "A replacement of the words by the corresponding syntactic units in the form of Part of Speech (POS) tags leads to a representation that captures the syntactic function of each word, as well as the structure of the segment when sets of words are considered together." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-303", "text": "Thus, we performed experiments with embedding at the POS-tag level." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-304", "text": "We used both the course-grained Google Universal POS tag set (Petrov et al., 2012) and the fine-grained Penn Treebank tag set (Marcus et al., 1993) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-305", "text": "While the first only covers the word type, the latter also includes information regarding morphological features of the words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-306", "text": "Similarly to the lemmas, the POS tags were obtained using the spaCy parser." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-307", "text": "Note that, as previously stated, there are also dialog acts that are related to specific words or sets of words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-308", "text": "Thus, the described replacement of words with POS tags impairs the detection of such dialog acts, since those tags are only able to capture information that is generic to the class and not that which is specific to the words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-332", "text": "**VALIDATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-333", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-334", "text": "**STILL IN**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-335", "text": "A surprising result was that, as shown in Table 5 , none of the embedding combinations led to better performance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-309", "text": "To avoid this problem it is important to understand which classes of words can be abstracted with the corresponding POS tags and which cannot." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-310", "text": "We hypothesize that nouns, proper nouns, numerals, and adjectives are the classes that fall in the first category." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-311", "text": "Still, in addition to replacing this set of classes, we performed experiments replacing each word class to assess the relevance of specific words of that class." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-312", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-313", "text": "**RESULTS**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-314", "text": "Starting with the dimensionality of the embedding space, in Table 3 we can see that using an embedding space with 200 dimensions, such as in the study by Liu et al. (2017) , leads to better results than any of the dimensionality values used by Khanpour et al. (2016) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-315", "text": "Furthermore, considering the three values they used, our experiments revealed better performance when using 300 dimensions than when using 150 dimensions, which was the best value in their study." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-316", "text": "However, their experiments on dimensionality also considered pre-trained embeddings, while we used a random initialization." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-317", "text": "We also performed some experiments that compared different values for the embedding space dimensionality while using pre-trained embeddings and, in every case, the embedding space with 200 dimensions led to better results." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-318", "text": "Concerning the use of different kinds of pre-trained embeddings, in Table 4 we can see that dependency-based embeddings outperform the widely used Word2Vec embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-319", "text": "Furthermore, both embeddings were trained on the English Wikipedia." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-320", "text": "Thus, the result difference was not caused by different training data." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-321", "text": "This confirms that the segment structure information included in the dependency-based representation is relevant for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-322", "text": "The original embeddings had dimensionality 300." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-323", "text": "Since we use dimensionality 200 in our experiments, we discarded the exceeding dimensions, as was done by Khanpour et al. (2016) in their study." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-324", "text": "Table 4 : Accuracy results using pre-trained word embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-325", "text": "Table 4 we can see that, in both cases, using adaptable embeddings impairs performance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-326", "text": "This was partially expected since the dialogs in the Switchboard corpus do not have a common domain nor common participants." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-327", "text": "Thus, since the separation into training, validation, and test sets is performed at the dialog level, what happens is that by letting the embeddings adapt, they overfit to the training dialogs and do not generalize well." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-328", "text": "Furthermore, since our embedding layer corresponds to a matrix of weights that is initialized with the values of the pre-trained embeddings, it does not reproduce the architectures used to produce the pre-trained embeddings, but only their output." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-336", "text": "In fact, all of those including an adaptable approach led to performance decreases above .90 percentage points in comparison to the simpler fixed embedding scenarios." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-337", "text": "This was unexpected since each embedding approach in the combinations was supposed to provide relevant information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-338", "text": "Still, it can be explained by the overfitting of the adaptable embedding approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-339", "text": "The concatenation of the representations generated by each approach before the dimensionality reduction layer contains the information provided by the fixed embedding representation and any additional relevant information provided by the adaptable representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-340", "text": "However, the dimensionality reduction layer ends up selecting information that is overfit to the training examples." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-341", "text": "This phenomenon can be assessed by removing that layer or by increasing the final dimensionality." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-342", "text": "We leave that for future work." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-343", "text": "On the other hand, the combination of the fixed versions of the CBOW and dependencybased embeddings led to results above those obtained using the CBOW embeddings alone, but below the ones obtained using the dependency-based embeddings alone, which suggests that the latter cover all the information provided by the first. Table 5 : Accuracy results using word embedding combinations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-344", "text": "Less surprising was the fact that character-level embeddings led to better results than word-level embeddings with random initialization." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-345", "text": "The concrete results are shown in Table 6." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-346", "text": "This confirms that there is relevant information for the task at a sub-word level." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-347", "text": "The results in Table 7 further confirm that part of that information is provided by affixes." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-348", "text": "However, the character-level embeddings were not able to achieve the performance of the pre-trained word embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-349", "text": "Furthermore, as explained in Section 6.2, the training and prediction processes are slower given the larger amount of tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-350", "text": "Still, it is important to refer that we used the same embedding space dimensionality at the character and word levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-351", "text": "However, dimensionality 200 is higher than that required to use a one-hot representation at the character level, especially considering the English language." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-352", "text": "Thus, embedding at the character-level can be performed using a lower dimensionality, which hastens the process." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-353", "text": "Furthermore, we did not explore whether the RCNN-based segment representation approach is the best when using character-level tokenization, nor whether the information captured by character-and word-level approaches is complementary." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-354", "text": "Still, we leave a thorough study of these aspects for future work." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-355", "text": "As previously stated, the experiments using lemmatized segments have shown that affixes provide relevant information for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-356", "text": "In Table 7 we can see the results obtained" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-359", "text": "Test \u00b5 \u03c3 \u00b5 \u03c3 Character-level .7735 .0008 .7348 .0021 Table 6 : Accuracy results using character-level embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-360", "text": "when using both word and character-level tokenization." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-361", "text": "It is interesting to see that the accuracy losses between the four results and their non-lemmatized counterparts vary between 1.56 and 1.91 percentage points which, given the standard deviation of the results when using lemmatized segments, can be considered similar values." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-362", "text": "This means that affixes are as important for the task when using word-level tokenization as when using character-level tokenization." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-363", "text": "Still, the fact that information provided by affixes accounts for less than 2 percentage points accuracy shows that most of the information concerning intention is in the lemmas and that the affixes only provide additional information in specific cases." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-364", "text": "Table 7 : Accuracy results using lemmatized words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-365", "text": "Finally, considering embeddings at the POS tag-level, in Table 8 we can see that the fine-grained set leads to better results than the course-grained one." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-366", "text": "However, they are both far from the results obtained using word-level embeddings with random initialization." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-367", "text": "This shows that, as expected, there are specific words that are relevant for the correct identification of dialog acts and that it cannot be performed based on the structure of the segment alone." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-368", "text": "In fact, this is consistent with the findings of previous studies on other discourse-related tasks (Marcu, 2000) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-369", "text": "Still, our experiments replacing a single class of words at each time have shown that there are classes for which the specific words have no influence on the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-370", "text": "In this sense, similar or better results than those of the word-level embeddings with random initialization were achieved for nouns, proper nouns, conjunctions, numerals, determinants, adpositions, and particles." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-371", "text": "However, replacing all of these classes at once led to worse results." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-372", "text": "Furthermore, the set of classes we hypothesized that could be replaced, consisting of nouns, proper nouns, numerals, and adjectives, also led to worse results, as shown in the last row of Table 8 ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-373", "text": "Still, as future work, we want to pre-train embeddings using this replacement and assess whether the information provided from larger amounts of data can improve the performance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-374", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-375", "text": "**CONTEXT INFORMATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-376", "text": "Although a dialog act represents the intention behind a set of words, that intention is not constrained to a specific segment and its context provides relevant cues." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-377", "text": "As described in Section 3, previous studies have shown that the most important source of context information Table 8 : Accuracy results using POS tags." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-378", "text": "for dialog act recognition is the dialog history, with influence decaying with distance (Ribeiro et al., 2015; Lee & Dernoncourt, 2016; Liu et al., 2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-379", "text": "However, information concerning the speakers and, more specifically, turn-taking has also been proved important (Liu et al., 2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-380", "text": "Thus, in our study, we explore both the surrounding segments and speaker information as sources of context information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-381", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-382", "text": "**SURROUNDING SEGMENTS**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-383", "text": "A dialog is a structured sequence of segments, in which each segment typically depends on both what has been said before and what is expected to be said in future." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-384", "text": "Thus, the surrounding segments are the most important sources of context information for dialog act recognition." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-385", "text": "In the context of a dialog system identifying its conversational partner's intention, the system only has access to the preceding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-386", "text": "Thus, we focus on approaches able to capture information from those segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-387", "text": "Still, in order to assess the importance of future information and simulate the annotation environment, we also performed some experiments using that information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-388", "text": "As stated in Section 3, considering the preceding segments, we have shown in a previous study (Ribeiro et al., 2015) that providing information in the form of segment classifications leads to better results than in the form of words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-389", "text": "Liu et al. (2017) further showed that using a single label per segment is better than using the probability of each class." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-390", "text": "Furthermore, both studies showed that using automatic predictions leads to a decrease in performance around 2 percentage points in comparison to using the manual annotations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-391", "text": "Thus, in order to simplify the experiments and obtain an upper bound for the approach, in this study we just use the manual annotations." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-392", "text": "In our previous study, we have used up to five preceding segments and showed that the gain becomes smaller as the number of preceding segments increases, which supports the claim that the closest segments are the most relevant." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-393", "text": "Liu et al. (2017) stopped at three preceding segments, but noticed a similar pattern." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-394", "text": "In this study we explore up to five preceding segments, as well as the entire dialog history." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-395", "text": "Although both our previous study and that by Liu et al. (2017) used the classifications of preceding segments as context information, none of them took into account that those segments have a sequential nature and simply flattened the sequence before appending it to the segment representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-396", "text": "However, as previously stated, both studies have shown that each label in that sequence is related to those that precede it." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-397", "text": "Thus, an approach that captures that information is expected to lead to better performance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-398", "text": "To do so, we introduce a recurrent layer that processes the label sequence before appending it to the segment representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-399", "text": "Since the closest segments are expected to have more influence but there still may be important information in distant segments, we use a GRU to process the label sequence." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-400", "text": "We used a number of neurons equal to the number of classes and no dropout." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-401", "text": "Each element in the sequence outputted by the recurrent layer consists of information about the sequence of labels up to that element." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-402", "text": "Thus, on the one hand, even if it is flattened in a similar manner to the label sequence in the previous approaches, it now contains sequentiality information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-403", "text": "On the other hand, the last element of the sequence outputted by the recurrent layer can be seen as a summary of the sequence of labels of the preceding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-404", "text": "We performed experiments using both approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-405", "text": "Finally, concerning future information, in terms of representation, the sequence of labels of the following segments can be seen as a mirrored version of the sequence of labels of the preceding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-406", "text": "Thus, we propose using the same approach based on the processing of the sequence by a recurrent layer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-407", "text": "However, in this case, the sequence is processed backwards." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-408", "text": "To assess the impact of this future information on the task, we performed experiments both independently of and in combination with the use of preceding segment information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-409", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-410", "text": "**SPEAKER INFORMATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-411", "text": "As previously stated, information concerning the speakers is also relevant for dialog act recognition." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-412", "text": "However, specific information about speaker characteristics may not be available." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-413", "text": "Still, information about the speaker of each segment, that is, who said what, is always available." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-414", "text": "In this sense, intentions may vary if two sequential segments are uttered by the same or different speakers." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-415", "text": "Thus, turn-taking information is relevant for dialog act recognition." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-416", "text": "In fact, this has been confirmed in the study by Liu et al. (2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-417", "text": "Thus, we also use turn-taking information in this study." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-418", "text": "It is provided as a flag that states whether the speaker is different from that of the preceding segment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-419", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-420", "text": "**RESULTS**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-440", "text": "That is, we provide future context information in the form of a summary of the sequence of labels of the following segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-421", "text": "Starting with the reproduction of the flat label sequence approach, in Table 9 we can see that the results follow the same pattern as in our previous study and that by Liu et al. (2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-422", "text": "The first preceding segment is the most important, leading to a 3.21 percentage point improvement on the validation set and 3.56 on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-423", "text": "An additional .85 percentage points on the validation set and .94 on the test set can be obtained adding additional segments, up to the fourth." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-424", "text": "However, adding the second preceding segment accounts for an improvement of .61 and .83 percentage points, respectively." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-425", "text": "This supports the claim that the influence of preceding segments decays with distance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-426", "text": "Adding the fifth preceding segment decreases performance by .1 percentage points on the validation set and .31 on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-427", "text": "Using the whole dialog history further harms the performance by 1.36 percentage points on the validation set and 2.92 on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-428", "text": "When looking at these results, one may be tempted to assume that using large context windows not only does not provide relevant information but also harms the performance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-429", "text": "However, the decay in performance is due to the adaptations required to use this approach in the context of a NN, as when using a context window of a given size, in order to obtain sequences with the same length, the preceding label sequence for segments which do not have enough preceding segments must be padded." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-430", "text": "The amount of required padding increases with the size of the context window, leading to a reduction in the percentage of relevant information included in the representation provided to the dimensionality reduction layer." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-431", "text": "However, the results achieved when using the whole dialog history still surpass those obtained without information from the preceding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-432", "text": "Table 9 : Accuracy results using a flat label sequence of the preceding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-433", "text": "Table 10 shows the results obtained when using the recurrent layer to process the preceding label sequence before concatenating it with the segment representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-434", "text": "Using the flattened version of the outputted sequence also suffers from the previously described padding problem." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-435", "text": "Still, comparing both approaches when using the whole dialog history shows that the recurrent layer captures relevant information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-436", "text": "The approach that uses only the last element of the sequence outputted by the recurrent layer as a summary of the preceding context minimizes the padding problem, as when generating the output for the last element of the sequence, the GRU is able to discard that information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-437", "text": "This is the scenario that proves the importance of the recurrent layer to capture sequentiality information between the labels, leading to results above the best obtained by the approach that does not take the sequentiality information into account." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-438", "text": "Table 10 : Accuracy results using a recurrent layer to generate preceding segment information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-439", "text": "Concerning context information from future segments, we ended up using the representation approach that uses the last element outputted by the recurrent layer, as it was the one with best performance for preceding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-441", "text": "The results in Table 11 confirm that future information is also able to provide important cues for the task, with an improvement of 2.77 percentage points on the validation set and 2.98 on the test set, in comparison to when no context information is used." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-442", "text": "However, these improvement values are lower than the 4.32 and 4.69 obtained when using context information from the preceding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-443", "text": "This is consistent with the fact that when a segment is uttered, its speaker is aware of the dialog history, but only has an expectation of its future direction." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-444", "text": "Furthermore, when information from both preceding and future segments is used, the improvement in comparison with the case when only the preceding segments are used is of 1.46 percentage points on the validation set and 1.55 on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-445", "text": "Since the values are lower than the improvement provided by future information in comparison with the case without context information, this shows that there are some dialog acts that can be identified both by those that precede it and those that follow it." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-446", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-447", "text": "**VALIDATION**" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-448", "text": "Test \u00b5 \u03c3 \u00b5 \u03c3 Future .8058 .0024 .7730 .0014 Table 11 : Accuracy results using future segment information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-449", "text": "As for speaker information concerning turn-taking, it had been proved important for the task before." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-450", "text": "The results in Table 12 further confirm that importance." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-451", "text": "However, it leads to better results in conjunction with context information from the surrounding segments." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-452", "text": "On its own, speaker change information accounts for a .16 percentage point improvement on the validation set, but actually harms performance on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-453", "text": "When combined with information from the preceding segments, it accounts for an improvement of .27 percentage points on the validation set and .41 on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-454", "text": "Finally, when used in combination with context information from all surrounding segments it accounts for an improvement of .74 percentage points on the validation set and .53 on the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-455", "text": "This improved importance in combination with information from the surrounding segments supports the claim that similar sequences of segments may have different intentions according to the involved speakers." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-456", "text": "As future work, we intend to explore the use of information concerning the whole turn-taking history instead of a single flag stating whether the speaker is different from that of the preceding segment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-457", "text": "Table 12 : Accuracy results using speaker information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-458", "text": "Finally, considering that during the segment annotation process the annotators had access to the complete dialogs and to information about the speakers, the scenario that includes information from all the surrounding segments and turn-taking information is the one closest to the annotation environment." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-459", "text": "In this sense, on the validation set, our approach reached the 84% inter-annotator agreement, which means that our classifier has performance similar to that of a human annotator." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-460", "text": "----------------------------------" }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-462", "text": "In this paper we explored multiple approaches on token, segment, and context information representation in the context of automatic dialog act recognition using DNNs." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-463", "text": "All the experiments were performed on the Switchboard Dialog Act Corpus (Jurafsky et al., 1997) , which is the most explored for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-464", "text": "We started with the segment representation approaches, since that is the step with higher variation among the previous studies described in Section 3 and that which introduces more changes in the overall architecture." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-465", "text": "We used adaptations of the approaches with top performance in previous studies, namely the RNN-based approach by Khanpour et al. (2016) and the CNN-based approach by Liu et al. (2017) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-466", "text": "However, those approaches focus on capturing different kinds of information, both of which are relevant for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-467", "text": "Thus, we introduced the use of the RCNN-based approach by Lai et al. (2015) and adapted it to capture relevant relations between distant tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-468", "text": "This approach outperformed the other two, proving that both token sequences and the contexts that surround each token are relevant for the task." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-469", "text": "In terms of token embedding, we have explored approaches at the character, word, and functional levels." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-470", "text": "Starting with the typically used word-level, we have shown that using an embedding space with 200 dimensions as used by Liu et al. (2017) in their study leads to better results than any of the dimensionality values used by Khanpour et al. (2016) ." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-471", "text": "Furthermore, we have shown that, since the dialogs in the Switchboard Dialog Act Corpus have multiple domains, using fixed pre-trained word embeddings leads to better results than letting them be trained along with the network." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-472", "text": "In this sense, we have shown that dependency-based embeddings outperform those generated by Word2Vec, which is the most used embedding approach." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-473", "text": "This is in accordance with the task, since many dialog acts are related to the structure of the segment and, thus, the dependencies between tokens." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-474", "text": "However, the experiments that replaced the words with the corresponding POS tags did not perform as well as the word-based approaches, which shows that dialog acts are also related to specific words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-475", "text": "Furthermore, our experiments at the character-level produced results similar to those of word-level approaches using pre-trained embeddings." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-476", "text": "This proves that there is important information for the task at a sub-word level." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-477", "text": "More specifically, the character-level approach is able to capture morphological aspects of the words, such as affixes and lemmas, which reveal their function and provide an important cue to identify intention." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-478", "text": "Concerning context information, we focused on that provided by preceding segments, since those are the ones available to a dialog system attempting to identify its conversational partner's intention." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-479", "text": "In this sense, previous dialog act recognition studies have shown that the best way to represent relevant context information from preceding segments is in the form of their classifications and not their words." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-480", "text": "However, in those studies, the sequentiality of the preceding segments, which is one of their main characteristics, was not appropriately represented." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-481", "text": "We approached this gap by reducing the representation of information from preceding segments to a summary of the sequence of their classifications, generated by a recurrent approach." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-482", "text": "Additionally, to simulate the annotation environment, in which the annotators have access to the whole dialog, we performed experiments that provided information from future segments in the same manner." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-483", "text": "In this sense, when information concerning turn-taking was combined with that extracted from all surrounding segments, our approach reached the inter-annotator agreement of 84% on the validation set, which means that our classifier has performance similar to that of a human annotator." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-484", "text": "Direct comparison with the results reported in previous studies is not straightforward." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-485", "text": "The results presented by Khanpour et al. (2016) should be comparable with ours, since we use the same test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-486", "text": "However, although we replicated their RNN-based approach achieving results in line with those reported for the validation set, we were not able to achieve the results they reported for the test set." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-487", "text": "Thus, we assume that the discrepancy of over 6 percentage points between the results presented for the two sets in their paper was due to the fact they considered the outcome of a single run, with a specific initialization." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-488", "text": "Furthermore, our study has shown that their approach can be improved in many aspects." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-489", "text": "In the case of the study by Liu et al. (2017) , direct result comparison with those reported is not possible since they were obtained on different sets." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-490", "text": "However, the result differences between overlapping steps in our experiments are consistent with those described in their paper." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-491", "text": "Thus, we can safely state that their approach can be improved by using five parallel CNNs, dependency-based word embeddings, and the summary representation of context information." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-492", "text": "Still, it does not perform as well as the RCNN-based approach in the same conditions." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-493", "text": "In terms of future work, there are still some aspects that can be explored, especially in terms of token representation." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-494", "text": "These are mainly related to the character-and functionallevel approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-495", "text": "Considering the first, we intend to perform a more in-depth study to assess the information that character-level approaches are able to capture and whether it can be combined with that captured by word-level approaches." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-496", "text": "As for the latter, we intend to explore the use of pre-trained embeddings on a large amount of data with specific word classes replaced by the corresponding POS tags." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-497", "text": "In terms of segment representation, we believe that the room for improvement is reduced, since our best approach only leads to an improvement of a single percentage point in comparison to the much simpler max pooling baseline, which suggests that the overall improvement that can be achieved by improving the segment representation is reaching a saturation point." }, { "sent_id": "1ab7893c2a930bc5af3c34a5912dd2-C001-498", "text": "Finally, concerning context information, we intend to explore the use of the whole turn-taking history, as well as other sources that may be relevant for the task, such as the domain and context of the dialog." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1ab7893c2a930bc5af3c34a5912dd2-C001-31" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-38" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-39" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-95" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-99" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-101" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-102" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-109" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-111" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-159" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-181" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-184" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-246", "1ab7893c2a930bc5af3c34a5912dd2-C001-247", "1ab7893c2a930bc5af3c34a5912dd2-C001-248" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-254", "1ab7893c2a930bc5af3c34a5912dd2-C001-255" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-378" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-379" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-389" ] ], "cite_sentences": [ "1ab7893c2a930bc5af3c34a5912dd2-C001-31", "1ab7893c2a930bc5af3c34a5912dd2-C001-38", "1ab7893c2a930bc5af3c34a5912dd2-C001-39", "1ab7893c2a930bc5af3c34a5912dd2-C001-95", "1ab7893c2a930bc5af3c34a5912dd2-C001-99", "1ab7893c2a930bc5af3c34a5912dd2-C001-101", "1ab7893c2a930bc5af3c34a5912dd2-C001-102", "1ab7893c2a930bc5af3c34a5912dd2-C001-109", "1ab7893c2a930bc5af3c34a5912dd2-C001-111", "1ab7893c2a930bc5af3c34a5912dd2-C001-159", "1ab7893c2a930bc5af3c34a5912dd2-C001-181", "1ab7893c2a930bc5af3c34a5912dd2-C001-184", "1ab7893c2a930bc5af3c34a5912dd2-C001-248", "1ab7893c2a930bc5af3c34a5912dd2-C001-255", "1ab7893c2a930bc5af3c34a5912dd2-C001-378", "1ab7893c2a930bc5af3c34a5912dd2-C001-379", "1ab7893c2a930bc5af3c34a5912dd2-C001-389" ] }, "@SIM@": { "gold_contexts": [ [ "1ab7893c2a930bc5af3c34a5912dd2-C001-121" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-132" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-314" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-392", "1ab7893c2a930bc5af3c34a5912dd2-C001-393" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-421" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-489", "1ab7893c2a930bc5af3c34a5912dd2-C001-490" ] ], "cite_sentences": [ "1ab7893c2a930bc5af3c34a5912dd2-C001-121", "1ab7893c2a930bc5af3c34a5912dd2-C001-132", "1ab7893c2a930bc5af3c34a5912dd2-C001-314", "1ab7893c2a930bc5af3c34a5912dd2-C001-393", "1ab7893c2a930bc5af3c34a5912dd2-C001-421", "1ab7893c2a930bc5af3c34a5912dd2-C001-489", "1ab7893c2a930bc5af3c34a5912dd2-C001-490" ] }, "@MOT@": { "gold_contexts": [ [ "1ab7893c2a930bc5af3c34a5912dd2-C001-268", "1ab7893c2a930bc5af3c34a5912dd2-C001-269", "1ab7893c2a930bc5af3c34a5912dd2-C001-270" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-395" ], [ "1ab7893c2a930bc5af3c34a5912dd2-C001-415", "1ab7893c2a930bc5af3c34a5912dd2-C001-416", "1ab7893c2a930bc5af3c34a5912dd2-C001-417" ] ], "cite_sentences": [ "1ab7893c2a930bc5af3c34a5912dd2-C001-268", "1ab7893c2a930bc5af3c34a5912dd2-C001-395", "1ab7893c2a930bc5af3c34a5912dd2-C001-416" ] }, "@UNSURE@": { "gold_contexts": [ [ "1ab7893c2a930bc5af3c34a5912dd2-C001-279" ] ], "cite_sentences": [ "1ab7893c2a930bc5af3c34a5912dd2-C001-279" ] }, "@EXT@": { "gold_contexts": [ [ "1ab7893c2a930bc5af3c34a5912dd2-C001-465" ] ], "cite_sentences": [ "1ab7893c2a930bc5af3c34a5912dd2-C001-465" ] }, "@USE@": { "gold_contexts": [ [ "1ab7893c2a930bc5af3c34a5912dd2-C001-470" ] ], "cite_sentences": [ "1ab7893c2a930bc5af3c34a5912dd2-C001-470" ] }, "@DIF@": { "gold_contexts": [ [ "1ab7893c2a930bc5af3c34a5912dd2-C001-489" ] ], "cite_sentences": [ "1ab7893c2a930bc5af3c34a5912dd2-C001-489" ] } } }, "ABC_e803782890224294066ce447671981_9": { "x": [ { "sent_id": "e803782890224294066ce447671981-C001-6", "text": "----------------------------------" }, { "sent_id": "e803782890224294066ce447671981-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "e803782890224294066ce447671981-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e803782890224294066ce447671981-C001-2", "text": "We describe an algorithm for recovering non-local dependencies in syntactic dependency structures." }, { "sent_id": "e803782890224294066ce447671981-C001-3", "text": "The patternmatching approach proposed by Johnson (2002) for a similar task for phrase structure trees is extended with machine learning techniques." }, { "sent_id": "e803782890224294066ce447671981-C001-4", "text": "The algorithm is essentially a classifier that predicts a nonlocal dependency given a connected fragment of a dependency structure and a set of structural features for this fragment." }, { "sent_id": "e803782890224294066ce447671981-C001-5", "text": "Evaluating the algorithm on the Penn Treebank shows an improvement of both precision and recall, compared to the results presented in (Johnson, 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-8", "text": "Non-local dependencies (also called long-distance, long-range or unbounded) appear in many frequent linguistic phenomena, such as passive, WHmovement, control and raising etc." }, { "sent_id": "e803782890224294066ce447671981-C001-9", "text": "Although much current research in natural language parsing focuses on extracting local syntactic relations from text, nonlocal dependencies have recently started to attract more attention." }, { "sent_id": "e803782890224294066ce447671981-C001-10", "text": "In (Clark et al., 2002) long-range dependencies are included in parser's probabilistic model, while Johnson (2002) presents a method for recovering non-local dependencies after parsing has been performed." }, { "sent_id": "e803782890224294066ce447671981-C001-11", "text": "More specifically, Johnson (2002) describes a pattern-matching algorithm for inserting empty nodes and identifying their antecedents in phrase structure trees or, to put it differently, for recovering non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-12", "text": "From a training corpus with annotated empty nodes Johnson's algorithm first extracts those local fragments of phrase trees which connect empty nodes with their antecedents, thus \"licensing\" corresponding non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-13", "text": "Next, the extracted tree fragments are used as patterns to match against previously unseen phrase structure trees: when a pattern is matched, the algorithm introduces a corresponding non-local dependency, inserting an empty node and (possibly) coindexing it with a suitable antecedent." }, { "sent_id": "e803782890224294066ce447671981-C001-14", "text": "In (Johnson, 2002 ) the author notes that the biggest weakness of the algorithm seems to be that it fails to robustly distinguish co-indexed and free empty nodes and it is lexicalization that may be needed to solve this problem." }, { "sent_id": "e803782890224294066ce447671981-C001-15", "text": "Moreover, the author suggests that the algorithm may suffer from overlearning, and using more abstract \"skeletal\" patterns may be helpful to avoid this." }, { "sent_id": "e803782890224294066ce447671981-C001-16", "text": "In an attempt to overcome these problems we developed a similar approach using dependency structures rather than phrase structure trees, which, moreover, extends bare pattern matching with machine learning techniques." }, { "sent_id": "e803782890224294066ce447671981-C001-17", "text": "A different definition of pattern allows us to significantly reduce the number of patterns extracted from the same corpus." }, { "sent_id": "e803782890224294066ce447671981-C001-18", "text": "Moreover, the patterns we obtain are quite general and in most cases directly correspond to specific linguistic phenomena." }, { "sent_id": "e803782890224294066ce447671981-C001-19", "text": "This helps us to understand what information about syntactic structure is important for the recovery of non-local dependencies and in which cases lexicalization (or even semantic analysis) is required." }, { "sent_id": "e803782890224294066ce447671981-C001-20", "text": "On the other hand, using these simplified patterns, we may loose some structural information important for recovery of non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-21", "text": "To avoid this, we associate patterns with certain structural features and use statistical classifi- cation methods on top of pattern matching." }, { "sent_id": "e803782890224294066ce447671981-C001-22", "text": "The evaluation of our algorithm on data automatically derived from the Penn Treebank shows an increase in both precision and recall in recovery of non-local dependencies by approximately 10% over the results reported in (Johnson, 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-23", "text": "However, additional work remains to be done for our algorithm to perform well on the output of a parser." }, { "sent_id": "e803782890224294066ce447671981-C001-24", "text": "----------------------------------" }, { "sent_id": "e803782890224294066ce447671981-C001-25", "text": "**FROM THE PENN TREEBANK TO A DEPENDENCY TREEBANK**" }, { "sent_id": "e803782890224294066ce447671981-C001-26", "text": "This section describes the corpus of dependency structures that we used to evaluate our algorithm." }, { "sent_id": "e803782890224294066ce447671981-C001-27", "text": "The corpus was automatically derived from the Penn Treebank II corpus (Marcus et al., 1993) , by means of the script chunklink.pl (Buchholz, 2002) that we modified to fit our purposes." }, { "sent_id": "e803782890224294066ce447671981-C001-28", "text": "The script uses a sort of head percolation table to identify heads of constituents, and then converts the result to a dependency format." }, { "sent_id": "e803782890224294066ce447671981-C001-29", "text": "We refer to (Buchholz, 2002) for a thorough description of the conversion algorithm, and will only emphasize the two most important modifications that we made." }, { "sent_id": "e803782890224294066ce447671981-C001-30", "text": "One modification of the conversion algorithm concerns participles and reduced relative clauses modifying NPs." }, { "sent_id": "e803782890224294066ce447671981-C001-31", "text": "Regular participles in the Penn Treebank II are simply annotated as VPs adjoined to the modified NPs (see Figure 1(a) )." }, { "sent_id": "e803782890224294066ce447671981-C001-32", "text": "These participles (also called reduced relative clauses, as they lack auxiliary verbs and complementizers) are both syntactically and semantically similar to full relative clauses, but the Penn annotation does not introduce empty complementizers, thus preventing coindexing of a trace with any antecedent." }, { "sent_id": "e803782890224294066ce447671981-C001-33", "text": "We perform a simple heuristic modification while converting the Treebank to the dependency format: when we encounter an NP modified by a VP headed by a past participle, an object dependency is introduced between the head of the VP and the head of the NP." }, { "sent_id": "e803782890224294066ce447671981-C001-34", "text": "Figure 1(b) shows an example, with solid arrows denoting local and dotted arrows denoting non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-35", "text": "Arrows are marked with dependency labels and go from dependents to heads." }, { "sent_id": "e803782890224294066ce447671981-C001-36", "text": "This simple heuristics does not allow us to handle all reduced relative clauses, because some of them correspond to PPs or NPs rather than VPs, but the latter are quite rare in the Treebank." }, { "sent_id": "e803782890224294066ce447671981-C001-37", "text": "The second important change to Buchholz' script concerns the structure of VPs." }, { "sent_id": "e803782890224294066ce447671981-C001-38", "text": "For every verb cluster, we choose the main verb as the head of the cluster, and leave modal and auxiliary verbs as dependents of the main verb." }, { "sent_id": "e803782890224294066ce447671981-C001-39", "text": "A similar modification was used by Eisner (1996) for the study of dependency parsing models." }, { "sent_id": "e803782890224294066ce447671981-C001-40", "text": "As will be described below, this allows us to \"factor out\" tense and modality of finite clauses from our patterns, making the patterns more general." }, { "sent_id": "e803782890224294066ce447671981-C001-41", "text": "----------------------------------" }, { "sent_id": "e803782890224294066ce447671981-C001-42", "text": "**PATTERN EXTRACTION AND MATCHING**" }, { "sent_id": "e803782890224294066ce447671981-C001-43", "text": "After converting the Penn Treebank to a dependency treebank, we first extracted non-local dependency patterns." }, { "sent_id": "e803782890224294066ce447671981-C001-44", "text": "As in (Johnson, 2002) , our patterns are minimal connected fragments containing both nodes involved in a non-local dependency." }, { "sent_id": "e803782890224294066ce447671981-C001-45", "text": "However, in our case these fragments are not connected sets of local trees, but shortest paths in local dependency graphs, leading from heads to non-local dependents." }, { "sent_id": "e803782890224294066ce447671981-C001-46", "text": "Patterns do not include POS tags of the involved words, but only labels of the dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-47", "text": "Thus, a pattern is a directed graph with labeled edges, and two distinguished nodes: the head and the dependent of a corresponding non-local dependency." }, { "sent_id": "e803782890224294066ce447671981-C001-48", "text": "When several patterns intersect, as may be the case, for example, when a word participates in more than one nonlocal dependency, these patterns are handled independently." }, { "sent_id": "e803782890224294066ce447671981-C001-49", "text": "Figure 2 shows examples of dependency graphs (above) and extracted patterns (below, with filled bullets corresponding to the nodes of a nonlocal dependency)." }, { "sent_id": "e803782890224294066ce447671981-C001-50", "text": "As before, dotted lines denote non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-51", "text": "The definition of a structure matching a pattern, and the algorithms for pattern matching and pattern extraction from a corpus are straightforward and similar to those described in (Johnson, 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-52", "text": "The total number of non-local dependencies found in the Penn WSJ is 57325." }, { "sent_id": "e803782890224294066ce447671981-C001-53", "text": "The number of different extracted patterns is 987." }, { "sent_id": "e803782890224294066ce447671981-C001-54", "text": "The 80 most frequent patterns (those that we used for the evaluation of our algorithm) cover 53700 out of all 57325 nonlocal dependencies (93,7%)." }, { "sent_id": "e803782890224294066ce447671981-C001-55", "text": "These patterns were further cleaned up manually, e.g., most Penn functional tags (-TMP, -CLR etc., but not -OBJ, -SBJ, -PRD) were removed." }, { "sent_id": "e803782890224294066ce447671981-C001-56", "text": "Thus, we ended up with 16 structural patterns (covering the same 93,7% of the Penn Treebank)." }, { "sent_id": "e803782890224294066ce447671981-C001-57", "text": "Table 1 shows some of the patterns found in the Penn Treebank." }, { "sent_id": "e803782890224294066ce447671981-C001-58", "text": "The column Count gives the number of times a pattern introduces non-local dependencies in the corpus." }, { "sent_id": "e803782890224294066ce447671981-C001-59", "text": "The Match column is the number of times a pattern actually occurs in the corpus (whether it introduces a non-local dependency or not)." }, { "sent_id": "e803782890224294066ce447671981-C001-60", "text": "The patterns are shown as dependency graphs with labeled arrows from dependents to heads." }, { "sent_id": "e803782890224294066ce447671981-C001-61", "text": "The column Dependency shows labels and directions of introduced non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-62", "text": "Clearly, an occurrence of a pattern alone is not enough for inserting a non-local dependency and determining its label, as for many patterns Match is significantly greater than Count." }, { "sent_id": "e803782890224294066ce447671981-C001-63", "text": "For this reason we introduce a set of other structural features, associated with patterns." }, { "sent_id": "e803782890224294066ce447671981-C001-64", "text": "For every occurrence of a pattern and for every word of this occurrence, we extract the following features:" }, { "sent_id": "e803782890224294066ce447671981-C001-65", "text": "pos, the POS tag of the word; class, the simplified word class (similar to (Eisner, 1996) ); fin, whether the word is a verb and a head of a finite verb cluster (as opposed to infinitives, gerunds or participles); subj, whether the word has a dependent (probably not included in the pattern) with a dependency label NP-SBJ; and obj, the same for NP-OBJ label." }, { "sent_id": "e803782890224294066ce447671981-C001-66", "text": "Thus, an occurrence of a pattern is associated with a sequence of symbolic features: five features for each node in the pattern." }, { "sent_id": "e803782890224294066ce447671981-C001-67", "text": "E.g., a pattern consisting of 3 nodes will have a feature vector with 15 elements." }, { "sent_id": "e803782890224294066ce447671981-C001-68", "text": "Table 2 : Examples of the patterns." }, { "sent_id": "e803782890224294066ce447671981-C001-69", "text": "The \"support\" words, i.e. words that appear in a pattern but are neither heads nor non-local dependents, are in italic; they correspond to empty bullets in patterns in Table 1 ." }, { "sent_id": "e803782890224294066ce447671981-C001-70", "text": "Boldfaced words correspond to filled bullets in Table 1 ." }, { "sent_id": "e803782890224294066ce447671981-C001-71", "text": "Given a pattern instance and its feature vector, our task now is to determine whether the pattern introduces a non-local dependency and, if so, what the label of this dependency is." }, { "sent_id": "e803782890224294066ce447671981-C001-72", "text": "In many cases this is not a binary decision, since one pattern may introduce several possible labeled dependencies (e.g., the pattern Table 1 )." }, { "sent_id": "e803782890224294066ce447671981-C001-73", "text": "Our task is a classification task: an instance of a pattern must be assigned to two or more classes, corresponding to several possible dependency labels (or absence of a dependency)." }, { "sent_id": "e803782890224294066ce447671981-C001-74", "text": "We train a classifier on instances extracted from a corpus, and then apply it to previously unseen instances." }, { "sent_id": "e803782890224294066ce447671981-C001-75", "text": "The procedure for finding non-local dependencies now consists of the two steps:" }, { "sent_id": "e803782890224294066ce447671981-C001-76", "text": "1. given a local dependency structure, find matching patterns and their feature vectors; 2. for each pattern instance found, use the classifier to identify a possible non-local dependency." }, { "sent_id": "e803782890224294066ce447671981-C001-77", "text": "----------------------------------" }, { "sent_id": "e803782890224294066ce447671981-C001-78", "text": "**EXPERIMENTS AND EVALUATION**" }, { "sent_id": "e803782890224294066ce447671981-C001-79", "text": "In our experiments we used sections 02-22 of the Penn Treebank as the training corpus and section 23 as the test corpus." }, { "sent_id": "e803782890224294066ce447671981-C001-80", "text": "First, we extracted all non-local patterns from the Penn Treebank, which resulted in 987 different (pattern, non-local dependency) pairs." }, { "sent_id": "e803782890224294066ce447671981-C001-81", "text": "As described in Section 3, after cleaning up we took 16 of the most common patterns." }, { "sent_id": "e803782890224294066ce447671981-C001-82", "text": "For each of these 16 patterns, instances of the pattern, pattern features, and a non-local dependency label (or the special label \"no\" if no dependency was introduced by the instance) were extracted from the training and test corpora." }, { "sent_id": "e803782890224294066ce447671981-C001-83", "text": "We performed experiments with two statistical classifiers: the decision tree induction system C4.5 (Quinlan, 1993) and the Tilburg Memory-Based Learner (TiMBL) (Daelemans et al., 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-84", "text": "In most cases TiBML performed slightly better." }, { "sent_id": "e803782890224294066ce447671981-C001-85", "text": "The results described in this section were obtained using TiMBL." }, { "sent_id": "e803782890224294066ce447671981-C001-86", "text": "For each of the 16 structural patterns, a separate classifier was trained on the set of (feature-vector, label) pairs extracted from the training corpus, and then evaluated on the pairs from the test corpus." }, { "sent_id": "e803782890224294066ce447671981-C001-87", "text": "Table 1 shows the results for some of the most frequent patterns, using conventional metrics: precision (the fraction of the correctly labeled dependencies among all the dependencies found), recall (the fraction of the correctly found dependencies among all the dependencies with a given label) and f-score (harmonic mean of precision and recall)." }, { "sent_id": "e803782890224294066ce447671981-C001-88", "text": "The table also shows the number of times a pattern (together with a specific non-local dependency label) actually occurs in the whole Penn Treebank corpus (the column Dependency count)." }, { "sent_id": "e803782890224294066ce447671981-C001-89", "text": "In order to compare our results to the results presented in (Johnson, 2002) , we measured the overall performance of the algorithm across patterns and non-local dependency labels." }, { "sent_id": "e803782890224294066ce447671981-C001-90", "text": "This corresponds to the row \"Overall\" of Table 4 in (Johnson, 2002) , repeated here in Table 4 ." }, { "sent_id": "e803782890224294066ce447671981-C001-91", "text": "We also evaluated the procedure on NP traces across all patterns, i.e., on nonlocal dependencies with NP-SBJ, NP-OBJ or NP-PRD labels." }, { "sent_id": "e803782890224294066ce447671981-C001-92", "text": "This corresponds to rows 2, 3 and 4 of Table 4 in (Johnson, 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-93", "text": "Our results are presented in Table 3 ." }, { "sent_id": "e803782890224294066ce447671981-C001-94", "text": "The first three columns show the results for those non-local dependencies that are actually covered by our 16 patterns (i.e., for 93.7% of all non-local dependencies)." }, { "sent_id": "e803782890224294066ce447671981-C001-95", "text": "The last three columns present the evaluation with respect to all non-local dependencies, thus the precision is the same, but recall drops accordingly." }, { "sent_id": "e803782890224294066ce447671981-C001-116", "text": "However, many less frequent phenomena appear to be much harder." }, { "sent_id": "e803782890224294066ce447671981-C001-96", "text": "These last columns give the results that can be compared to Johnson's results for section 23 (Table 4) Table 3 : Overall performance of our algorithm." }, { "sent_id": "e803782890224294066ce447671981-C001-97", "text": "----------------------------------" }, { "sent_id": "e803782890224294066ce447671981-C001-98", "text": "**ON SECTION 23**" }, { "sent_id": "e803782890224294066ce447671981-C001-99", "text": "On parser output P R f P R f Overall 0.80 0.70 0.75 0.73 0.63 0.68 Table 4 : Results from (Johnson, 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-100", "text": "It is difficult to make a strict comparison of our results and those in (Johnson, 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-101", "text": "The two algorithms are designed for slightly different purposes: while Johnson's approach allows one to recover free empty nodes (without antecedents), we look for nonlocal dependencies, which corresponds to identification of co-indexed empty nodes (note, however, the modifications we describe in Section 2, when we actually transform free empty nodes into co-indexed empty nodes)." }, { "sent_id": "e803782890224294066ce447671981-C001-102", "text": "----------------------------------" }, { "sent_id": "e803782890224294066ce447671981-C001-103", "text": "**DISCUSSION**" }, { "sent_id": "e803782890224294066ce447671981-C001-104", "text": "The results presented in the previous section show that it is possible to improve over the simple pattern matching algorithm of (Johnson, 2002) , using dependency rather than phrase structure information, more skeletal patterns, as was suggested by Johnson, and a set of features associated with instances of patterns." }, { "sent_id": "e803782890224294066ce447671981-C001-105", "text": "One of the reasons for this improvement is that our approach allows us to discriminate between different syntactic phenomena involving non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-106", "text": "In most cases our patterns correspond to linguistic phenomena." }, { "sent_id": "e803782890224294066ce447671981-C001-107", "text": "That helps to understand why a particular construction is easy or difficult for our approach, and in many cases to make the necessary modifications to the algorithm (e.g., adding other features to instances of patterns)." }, { "sent_id": "e803782890224294066ce447671981-C001-108", "text": "For example, for patterns 11 and 12 (see Tables 1 and 2 ) our classifier distinguishes subject and object reasonably well, apparently, because the feature has a local object is explicitly present for all instances (for the examples 11 and 12 in Table 2 , expand has a local object, but do doesn't)." }, { "sent_id": "e803782890224294066ce447671981-C001-109", "text": "Another reason is that the patterns are general enough to factor out minor syntactic differences in linguistic phenomena (e.g., see example 4 in Table 2)." }, { "sent_id": "e803782890224294066ce447671981-C001-110", "text": "Indeed, the most frequent 16 patterns cover 93.7% of all non-local dependencies in the corpus." }, { "sent_id": "e803782890224294066ce447671981-C001-111", "text": "This is mainly due to our choices in the dependency representation, such as making the main verb a head of a verb phrase." }, { "sent_id": "e803782890224294066ce447671981-C001-112", "text": "During the conversion to a dependency treebank and extraction of patterns some important information may have been lost (e.g., the finiteness of a verb cluster, or presence of subject and object); for that reason we had to associate patterns with additional features, encoding this information and providing it to the classifier." }, { "sent_id": "e803782890224294066ce447671981-C001-113", "text": "In other words, we first take an \"oversimplified\" representation of the data, and then try to find what other data features can be useful." }, { "sent_id": "e803782890224294066ce447671981-C001-114", "text": "This strategy appears to be successful, because it allows us to identify which information is important for the recovery of non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-115", "text": "More generally, the reasonable overall performance of the algorithm is due to the fact that for the most common non-local dependencies (extraction in relative clauses and reduced relative clauses, passivization, control and raising) the structural information we extract is enough to robustly identify non-local dependencies in a local dependency graph: the most frequent patterns in Table 1 are also those with best scores." }, { "sent_id": "e803782890224294066ce447671981-C001-117", "text": "For example, performance for relative clauses with extracted objects or adverbs is much worse than for subject relative clauses (e.g., patterns 2 and 3 vs. 1 in Table 1 )." }, { "sent_id": "e803782890224294066ce447671981-C001-118", "text": "Apparently, in most cases this is not due to the lack of training data, but because structural information alone is not enough and lexical preferences, subcategorization information, or even semantic properties should be considered." }, { "sent_id": "e803782890224294066ce447671981-C001-119", "text": "We think that the aproach allows us to identify those \"hard\" cases." }, { "sent_id": "e803782890224294066ce447671981-C001-120", "text": "The natural next step in evaluating our algorithm is to work with the output of a parser instead of the original local structures from the Penn Treebank." }, { "sent_id": "e803782890224294066ce447671981-C001-121", "text": "Obviously, because of parsing errors the performance drops significantly: e.g., in the experiments reported in (Johnson, 2002 ) the overall fscore decreases from 0.75 to 0.68 when evaluating on parser output (see Table 4 )." }, { "sent_id": "e803782890224294066ce447671981-C001-122", "text": "While experimenting with Collins' parser (Collins, 1999) , we found that for our algorithm the accuracy drops even more dramatically, when we train the classifier on Penn Treebank data and test it on parser output." }, { "sent_id": "e803782890224294066ce447671981-C001-123", "text": "One of the reasons is that, since we run our algorithm not on the parser's output itself but on the output automatically converted to dependency structures, conversion errors also contribute to the performance drop." }, { "sent_id": "e803782890224294066ce447671981-C001-124", "text": "Moreover, the conversion script is highly tailored to the Penn Treebank annotation (with functional tags and empty nodes) and, when run on the parser's output, produces structures with somewhat different dependency labels." }, { "sent_id": "e803782890224294066ce447671981-C001-125", "text": "Since our algorithm is sensitive to the exact labels of the dependencies, it suffers from these systematic errors." }, { "sent_id": "e803782890224294066ce447671981-C001-126", "text": "One possible solution to that problem could be to extract patterns and train the classification algorithm not on the training part of the Penn Treebank, but on the parser output for it." }, { "sent_id": "e803782890224294066ce447671981-C001-127", "text": "This would allow us to train and test our algorithm on data of the same nature." }, { "sent_id": "e803782890224294066ce447671981-C001-128", "text": "----------------------------------" }, { "sent_id": "e803782890224294066ce447671981-C001-129", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "e803782890224294066ce447671981-C001-130", "text": "We have presented an algorithm for recovering longdistance dependencies in local dependency structures." }, { "sent_id": "e803782890224294066ce447671981-C001-131", "text": "We extend the pattern matching approach of Johnson (2002) with machine learning techniques, and use dependency structures instead of constituency trees." }, { "sent_id": "e803782890224294066ce447671981-C001-132", "text": "Evaluation on the Penn Treebank shows an increase in accuracy." }, { "sent_id": "e803782890224294066ce447671981-C001-133", "text": "However, we do not have yet satisfactory results when working on a parser output." }, { "sent_id": "e803782890224294066ce447671981-C001-134", "text": "The conversion algorithm and the dependency labels we use are largely based on the Penn Treebank annotation, and it seems difficult to use them with the output of a parser." }, { "sent_id": "e803782890224294066ce447671981-C001-135", "text": "A parsing accuracy evaluation scheme based on grammatical relations (GR), presented in (Briscoe et al., 2002) , provides a set of dependency labels (grammatical relations) and a manually annotated dependency corpus." }, { "sent_id": "e803782890224294066ce447671981-C001-136", "text": "Non-local dependencies are also annotated there, although no explicit difference is made between local and non-local dependencies." }, { "sent_id": "e803782890224294066ce447671981-C001-137", "text": "Since our classification algorithm does not depend on a particular set of dependency labels, we can also use the set of labels described by Briscoe et al, if we convert Penn Treebank to a GR-based dependency treebank and use it as the training corpus." }, { "sent_id": "e803782890224294066ce447671981-C001-138", "text": "This will allow us to make the patterns independent of the Penn Treebank annotation details and simplify testing the algorithm with a parser'u output." }, { "sent_id": "e803782890224294066ce447671981-C001-139", "text": "We will also be able to use the flexible and parameterizable scoring schemes discussed in (Briscoe et al., 2002) ." }, { "sent_id": "e803782890224294066ce447671981-C001-140", "text": "We also plan to develop the approach by using iteration of our non-local relations extraction algorithm, i.e., by running the algorithm, inserting the found non-local dependencies, running it again etc., until no new dependencies are found." }, { "sent_id": "e803782890224294066ce447671981-C001-141", "text": "While raising an important and interesting issue of the order in which we examine our patterns, we believe that this will allow us to handle very long extraction chains, like the one in sentence \"Aichi revised its tax calculations after being challenged for allegedly failing to report. . . \", where Aichi is a (non-local) dependent of five verbs." }, { "sent_id": "e803782890224294066ce447671981-C001-142", "text": "Iteration of the algorithm will also help to increase the coverage (which is 93,7% with our 16 non-iterated patterns)." } ], "y": { "@EXT@": { "gold_contexts": [ [ "e803782890224294066ce447671981-C001-3" ], [ "e803782890224294066ce447671981-C001-131" ] ], "cite_sentences": [ "e803782890224294066ce447671981-C001-3", "e803782890224294066ce447671981-C001-131" ] }, "@DIF@": { "gold_contexts": [ [ "e803782890224294066ce447671981-C001-5" ], [ "e803782890224294066ce447671981-C001-22" ], [ "e803782890224294066ce447671981-C001-101" ], [ "e803782890224294066ce447671981-C001-104" ], [ "e803782890224294066ce447671981-C001-131" ] ], "cite_sentences": [ "e803782890224294066ce447671981-C001-5", "e803782890224294066ce447671981-C001-22", "e803782890224294066ce447671981-C001-101", "e803782890224294066ce447671981-C001-104", "e803782890224294066ce447671981-C001-131" ] }, "@BACK@": { "gold_contexts": [ [ "e803782890224294066ce447671981-C001-10" ], [ "e803782890224294066ce447671981-C001-11" ], [ "e803782890224294066ce447671981-C001-12" ], [ "e803782890224294066ce447671981-C001-121" ] ], "cite_sentences": [ "e803782890224294066ce447671981-C001-10", "e803782890224294066ce447671981-C001-11", "e803782890224294066ce447671981-C001-12", "e803782890224294066ce447671981-C001-121" ] }, "@MOT@": { "gold_contexts": [ [ "e803782890224294066ce447671981-C001-14" ] ], "cite_sentences": [ "e803782890224294066ce447671981-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "e803782890224294066ce447671981-C001-44" ] ], "cite_sentences": [ "e803782890224294066ce447671981-C001-44" ] }, "@SIM@": { "gold_contexts": [ [ "e803782890224294066ce447671981-C001-44" ], [ "e803782890224294066ce447671981-C001-51" ], [ "e803782890224294066ce447671981-C001-89", "e803782890224294066ce447671981-C001-90" ], [ "e803782890224294066ce447671981-C001-91", "e803782890224294066ce447671981-C001-92" ] ], "cite_sentences": [ "e803782890224294066ce447671981-C001-44", "e803782890224294066ce447671981-C001-51", "e803782890224294066ce447671981-C001-89", "e803782890224294066ce447671981-C001-90", "e803782890224294066ce447671981-C001-92" ] }, "@UNSURE@": { "gold_contexts": [ [ "e803782890224294066ce447671981-C001-89" ], [ "e803782890224294066ce447671981-C001-96" ], [ "e803782890224294066ce447671981-C001-100" ] ], "cite_sentences": [ "e803782890224294066ce447671981-C001-89", "e803782890224294066ce447671981-C001-96", "e803782890224294066ce447671981-C001-100" ] } } }, "ABC_5fa570cf5f37c7aae3b428a17de3e3_9": { "x": [ { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-2", "text": "Abstract." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-3", "text": "Generating descriptions for videos has many applications including assisting blind people and human-robot interaction." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-28", "text": "Many of them rely on Recurrent Neural Networks (RNNs) and in particular on Long-Short Term Memory networks (LSTMs)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-81", "text": "Now, how do we select visual labels for our semantic groups?" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-27", "text": "Multiple works have been proposed [6, 8, 16, 17, 23, 35, 37] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-4", "text": "The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII-MD [28] allow to study this task in more depth." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-5", "text": "Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long-Short Term Memory recurrent networks (LSTMs) for generating descriptions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-6", "text": "While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the challenging setting of movie description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-7", "text": "In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-8", "text": "Based on these visual classifiers we learn how to generate a description using an LSTM." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-9", "text": "We explore different design choices to build and train the LSTM and achieve the best performance to date on the challenging MPII-MD dataset." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-10", "text": "We compare and analyze our approach and prior work along various dimensions to better understand the key challenges of the movie description task." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-11", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-13", "text": "Automatic description of visual content has lately received a lot of interest in our community." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-14", "text": "Multiple works have successfully addressed the image captioning problem [6, 16, 17, 35] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-15", "text": "Many of the proposed methods rely on Long-Short Term Memory networks (LSTMs) [13] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-16", "text": "In the meanwhile, two large-scale movie description datasets have been proposed, namely MPII Movie Description (MPII-MD) [28] and Montreal Video Annotation Dataset (M-VAD) [31] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-17", "text": "Both are based on movies with associated textual descriptions and allow studying the problem how to generate movie description for visually disabled people." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-18", "text": "Works addressing these datasets [28, 33, 39] show that they are indeed challenging in terms of visual recognition and automatic description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-19", "text": "This results in a significantly lower performance then on simpler video datasets (e.g. MSVD [2] ), but a detailed analysis of the difficulties is missing." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-20", "text": "In this work we address this by taking a closer look at the performance of existing methods on the movie description task." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-21", "text": "This work contributes a) an approach to build robust visual classifiers which distinguish verbs, objects, and places extracted from weak sentence annotations; b) based on the visual classifiers we evaluate different design choices to train an LSTM for generating descriptions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-22", "text": "This outperforms related work on the MPII-MD dataset, both using automatic and human evaluation; c) we perform a detailed analysis of prior work and our approach to understand the challenges of the movie description task." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-23", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-25", "text": "Image captioning." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-26", "text": "Automatic image description has been studied in the past [9, 19, 20, 24] , however it regained attention just recently." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-29", "text": "Also new datasets have been released, Flickr30k [40] and MS COCO Captions [3] , where [3] additionally presents a standardized setup for image captioning evaluation." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-30", "text": "There are also attempts to analyze the performance of recent methods." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-31", "text": "E.g. [5] compares them with respect to the novelty of generated descriptions and additionally proposes a nearest neighbor baseline that improves over recent methods." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-32", "text": "Video description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-33", "text": "In the past video description has been addressed in semirealistic settings [1, 18] , on a small scale [4, 11, 30] or in constrained scenarios like cooking [27, 29] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-34", "text": "Most works (with a few exceptions, e.g. [27] ) study the task of describing a short clip with a single sentence." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-35", "text": "[6] first proposed to describe videos using an LSTM, relying on precomputed CRF scores from [27] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-36", "text": "[34] extended this work to extract CNN features from frames which are max-pooled over time." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-37", "text": "They show the benefit of pre-training the LSTM network for image captioning and fine-tuning it to video description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-38", "text": "[25] proposes a framework that consists of a 2-D and/or 3-D CNN and the LSTM is trained jointly with a visual-semantic embedding to ensure better coherence between video and text." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-39", "text": "[38] jointly addresses the language generation and video/language retrieval tasks by learning a joint embedding model for a deep video model and compositional semantic language model." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-40", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-41", "text": "**MOVIE DESCRIPTION.**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-42", "text": "Recently two large-scale movie description datasets have been proposed, MPII Movie Description (MPII-MD) [28] and Montreal Video Annotation Dataset (M-VAD) [31] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-43", "text": "Given that they are based on movies, they cover a much broader domain then previous video description datasets." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-44", "text": "Consequently they are much more varied and challenging with respect to the visual content and the associated description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-45", "text": "They also do not have any additional annotations, as e.g. TACoS Multi-Level [27] , thus one has to rely on the weak annotations of the sentence descriptions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-46", "text": "To handle this challenging scenario [39] proposes an attention based model which selects the most relevant temporal segments in a video and incorporates 3-D CNN and generates a sentence using an LSTM." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-47", "text": "[33] proposes an encoder-decoder framework, where a single LSTM encodes the input video frame by frame and decodes it into a sentence, outperforming [39] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-48", "text": "Our approach for sentence generation is most similar to [6] and we also rely on their LSTM implementation based on Caffe [15] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-49", "text": "However, we analyze different aspects and variants of this architecture for movie description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-50", "text": "To extract labels from sentences we rely on the semantic parser of [28] , however we treat the labels differently to handle the weak supervision (see Section 3.1)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-51", "text": "We show that this improves over [28] and [33] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-52", "text": "Fig. 1 : Overview of our approach." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-53", "text": "We first train the visual classifiers for verbs, objects and places, using different visual features: DT (dense trajectories [36] ), LSDA (large scale object detector [14] ) and PLACES (Places-CNN [41] )." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-54", "text": "Next, we concatenate the scores from a subset of selected robust classifiers and use them as input to our LSTM." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-55", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-56", "text": "**APPROACH**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-57", "text": "In this section we present our two-step approach to video description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-58", "text": "The first step performs visual recognition, while the second step generates textual descriptions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-59", "text": "For the visual recognition we propose to use the visual classifiers trained according to the labels' semantics and \"visuality\"." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-60", "text": "For the language generation we rely on a LSTM network which has been successfully used for image and video description [6, 33] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-61", "text": "We discuss various design choices for building and training the LSTM." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-62", "text": "An overview of our approach is given in Figure 1 ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-63", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-64", "text": "**VISUAL LABELS FOR ROBUST VISUAL CLASSIFIERS**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-65", "text": "For training we rely on a parallel corpus of videos and weak sentence annotations." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-66", "text": "As in [28] we parse the sentences to obtain a set of labels (single words or short phrases, e.g. look up) to train our visual classifiers." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-67", "text": "However, in contrast to [28] we do not want to keep all of these initial labels as they are noisy, but select only visual ones which actually can be robustly recognized." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-68", "text": "Avoiding parser failure." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-69", "text": "Not all sentences can be parsed successfully, as e.g. some sentences are incomplete or grammatically incorrect." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-70", "text": "To avoid loosing the potential labels in these sentences, we match our set of initial labels to the sentences which the parser failed to process." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-71", "text": "Semantic groups." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-72", "text": "Our labels correspond to different semantic groups." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-73", "text": "In this work we consider three most important groups: verbs (actions), objects and places, as they are typically visual." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-74", "text": "One could also consider e.g. groups like mood or emotions, which are naturally harder for visual recognition." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-75", "text": "We propose to treat each label group independently." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-76", "text": "First, we rely on a different representation for the each semantic groups, which is targeted to the specific group." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-77", "text": "Namely we use the activity recognition feature Improved Dense Trajectories (DT) [36] for verbs, large scale object detector responses (LSDA) [14] for objects and scene classification scores (PLACES) [41] for places." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-78", "text": "Second, we train one-vs-all SVM classifiers for each group separately." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-79", "text": "The intuition behind this is to discard \"wrong negatives\" (e.g. using object \"bed\" as negative for place \"bedroom\")." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-80", "text": "Visual labels." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-82", "text": "In order to find the verbs among the labels we rely on the semantic parser of [28] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-83", "text": "Next, we look up the list of \"places\" used in [41] and search for corresponding words among our labels." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-84", "text": "We look up the object classes used in [14] and search for these \"objects\", as well as their base forms (e.g. \"domestic cat\" and \"cat\")." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-85", "text": "We discard all the labels that do not belong to any of our three groups of interest as we assume that they are likely not visual and thus are difficult to recognize." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-86", "text": "Finally, we discard labels which the classifiers could not learn, as these are likely to be noisy or not visual." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-87", "text": "For this we require the classifiers to have have minimum area under the ROC-curve (Receiver Operating Characteristic)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-88", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-89", "text": "**LSTM FOR SENTENCE GENERATION**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-90", "text": "We rely on the basic LSTM architecture proposed in [6] for video description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-91", "text": "As shown in Figures 1 and 2 (a), at each time step, an LSTM generates a word and receives the visual classifiers (input-vis) as well as as the previous generated word (input-lang) as input." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-92", "text": "To handle natural words we encode each word with a one-hot-vector according to their index in a dictionary and a lower dimensional embedding." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-93", "text": "The embedding is jointly learned during training of the LSTM." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-94", "text": "[6] compares three variants: (a) an encoder-decoder architecture, (b) a decoder architecture with visual max predictions, and (c) a decoder architecture with visual probabilistic predictions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-95", "text": "In this work we rely on variant (c) which was shown to work best as it can rely on the richest visual input." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-96", "text": "We analyze the following aspects for this architecture:" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-97", "text": "Layer structure: We compare a 1-layer architecture with a 2-layer architecture." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-98", "text": "In the 2-layer architecture, the output of the first layer is used as input for the second layer ( Figure 2b ) and was used by [6] for video description." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-99", "text": "Additionally we also compare to a 2-layer factored architecture [6] , where the first layer only gets the language as input and the second gets the output of the first layer as well as the visual input." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-100", "text": "Dropout placement: To learn a more robust network which is less likely to overfit we rely on a dropout [12] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-101", "text": "Using dropout a ratio r of randomly selected units is set to 0 during training (while all others are multiplied with 1/r)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-102", "text": "We explore different ways to place dropout in the network, i.e. either for language input (lang-drop) or visual (vis-drop) input only, for both inputs (concat-drop) or for the LSTM output (lstm-drop), see Figure 2 (d)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-103", "text": "While the default dropout ratio is r = 0.5, we evaluate the effect of different ratios." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-104", "text": "Learning strategy: By default we rely on a step-based learning strategy, where a learning rate is halved after a certain number of steps." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-105", "text": "We find the best learning rate and step size on the validation set." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-106", "text": "Additionally we compare this to a polynomial learning strategy, where the learning rate is continuously decreased." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-236", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-107", "text": "The polynomial learning strategy has been shown to give good results faster without tweaking step size for GoogleNet implemented by Sergio Guadarrama in Caffe [15] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-108", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-109", "text": "**EVALUATION**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-110", "text": "In this section we first analyze our approach on the MPII-MD [28] dataset and explore different design choices." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-111", "text": "Then, we compare our best system to prior work." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-112", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-113", "text": "**ANALYSIS OF OUR APPROACH**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-114", "text": "Experimental setup." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-115", "text": "We build on the labels discovered by our semantic parser [28] and additionally match these labels to sentences which the parser failed to process." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-116", "text": "To be able to learn classifiers we select the labels that appear at least 30 times, resulting in 1,263 labels." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-117", "text": "The parser additionally tells us whether the label is a verb." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-118", "text": "We use the visual features (DT, LSDA, PLACES) provided with the MPII-MD dataset [28] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-119", "text": "The LSTM output/hidden unit as well as memory cell have each 500 dimensions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-120", "text": "We train the SVM classifiers on the Training set (56,861 clips)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-121", "text": "We evaluate our method on the validation set (4,930 clips) using the METEOR [21] score, which, according to [7, 32] , supersedes other popular measures, such as BLEU [26] , ROUGE [22] , in terms of agreement with human judgments." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-122", "text": "The authors of CIDEr [32] showed that METEOR also outperforms CIDEr when the number of references is small and in the case of MPII-MD we have typically only a single reference." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-123", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-124", "text": "**ROBUST VISUAL CLASSIFIERS.**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-125", "text": "In a first set of experiments we analyze our proposal to consider groups of labels to learn different classifiers and also to use different visual representations for these groups (see Section 3.1)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-126", "text": "Table 1 we evaluate our generated sentences using different input features to the LSTM." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-127", "text": "In our baseline, The second part of the table demonstrates the effect of introducing different semantic label groups." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-128", "text": "We first split the labels into \"Verbs\" and all remaining." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-129", "text": "Given that some labels appear in both roles, the total number of labels increases to 1328." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-130", "text": "We analyze two settings of training the classifiers." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-131", "text": "In the case of \"Retrieved\" we retrieve the classifier scores from the general classifiers trained in the previous step. \"Trained\" corresponds to training the SVMs specifically for each label type (e.g. for \"verbs\")." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-132", "text": "Next, we further divide the non-verbal labels into \"Places\" and \"Others\", and finally into \"Places\" and \"Objects\"." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-133", "text": "We discard the unused labels and end up with 913 labels." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-134", "text": "Out of these labels, we select the labels where the classifier obtains a ROC higher or equal to 0.7 (threshold selected on the validation set)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-135", "text": "After this we obtain 263 labels and the best performance in the \"Trained\" setting." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-136", "text": "To support our intuition about the importance of the label discrimination (i.e. using different features for different semantic groups of labels), we propose another baseline (last line in the table)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-137", "text": "Here we use the same set of 263 labels but provide the same feature for all of them, namely the best performing combination DT + LSDA + PLACES." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-138", "text": "As we see, this results in an inferior performance." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-139", "text": "We make several observations from Table 1 which lead to robust visual classifiers from the weak sentence annotations." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-140", "text": "a) It is beneficial to select features based on the label semantics." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-141", "text": "b) Training one-vs-all SVMs for specific label groups consistently improves the performance as it avoids \"wrong\" negatives." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-142", "text": "c) Focusing on more \"visual\" labels helps: we reduce the LSTM input dimensionality to 263 while improving the performance." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-143", "text": "LSTM architectures." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-144", "text": "Now, as described in Section 3.2, we look at different LSTM architectures and training configurations." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-145", "text": "In the following we use the best performing \"Visual Labels\" approach, Table 1" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-146", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-147", "text": "**(8).**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-148", "text": "We start with examining the architecture, where we explore different configurations of LSTM and dropout layers." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-149", "text": "Table 2a shows the performance of three different networks: \"1 layer\", \"2 layers unfactored\" and \"2 layers factored\" introduced in Section 3.2." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-150", "text": "As we see, the \"1 layer\" and \"2 layers unfactored\" perform equally well, while \"2 layers factored\" is inferior to them." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-151", "text": "In following experiments we use the simplest \"1 layer\" network." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-152", "text": "We then compare different dropout placements as illustrated in (Figure 2b) ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-153", "text": "We obtain the best result when applying dropout after the LSTM layer (\"lstm-drop\"), while having no dropout or applying it only to language leads to stronger over-fitting to the visual features." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-154", "text": "Putting dropout after the LSTM (and prior to a final prediction layer) makes the entire system more robust." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-155", "text": "As for the best dropout ratio, we find that 0.5 works best with lstm-dropout Table 2c ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-156", "text": "We compare different learning rates and learning strategies in Tables 3a and 3b ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-157", "text": "We find that the best learning rate in the step-based learning is 0.01, while step 4000 slightly improves over step 2000 (which we used in Table 1 )." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-158", "text": "We explore an alternative learning strategy, namely decreasing learning rate according to a polynomial decay." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-159", "text": "We experiment with different exponents (0.5 and 0.7) and numbers of iterations (25K and 10K), using the base-learning rate 0.01." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-160", "text": "Our results show that the step-based learning is superior to the polynomial learning." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-161", "text": "In most of experiments we trained our networks for 25,000 iterations." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-162", "text": "After looking at the METEOR performance for intermediate iterations we found that for the step size 4000 at iteration 15,000 we achieve best performance overall." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-163", "text": "Additionally we train multiple LSTMs with different random orderings of the training data." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-164", "text": "In our experiments we combine three in an ensemble, averaging the resulting word predictions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-165", "text": "In most cases the ensemble improves over the single networks in terms of METEOR score (see Table 4 )." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-166", "text": "To summarize, the most important aspects that decrease over-fitting and lead to a better sentence generation are: (a) a correct learning rate and step size, (b) dropout after the LSTM layer, (c) choosing the training iteration based on METEOR score as opposed to only looking at the LSTM accuracy/loss which can be misleading, and (d) building ensembles of multiple networks with different random initializations." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-167", "text": "In the following section we evaluate our best ensemble (last line of Table 4 ) on the test set of MPII-MD." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-168", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-169", "text": "**COMPARISON TO RELATED WORK**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-170", "text": "Experimental setup." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-171", "text": "We compare the best method of [28] , the recently proposed method S2VT [33] and our proposed \"Visual Labels\"-LSTM on the test set of the MPII-MD dataset (6,578 clips)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-172", "text": "We report all popular automatic evaluation measures, CIDEr [32] , BLEU [26] , ROUGE [22] and METEOR [21] , computed using the evaluation code of [3] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-173", "text": "We also perform a human evaluation, by randomly selecting 1300 video snippets and asking AMT turkers to rank three systems (the best SMT of [28] , S2VT [33] and ours) with respect to Correctness, Grammar and Relevance, similar to [28] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-174", "text": "Results." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-175", "text": "Moreover, we improve over the recent approach of [33] , which also uses LSTM to generate video descriptions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-176", "text": "Exploring different strategies to label selection and classifier training, as well as various LSTM configurations allows to obtain best result to date on the MPII-MD dataset." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-177", "text": "Human evaluation mainly agrees with the automatic measures." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-178", "text": "We outperform both prior works in terms of Correctness and Relevance, however we lose to S2VT in terms of Grammar." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-179", "text": "This is due to the fact that S2VT produces overall shorter (7.4 versus 8.7 words per sentence) and simpler sentences, while our system generates longer sentences and therefore has higher chances to make mistakes." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-180", "text": "We also propose a retrieval upperbound (last line in Table 5 )." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-181", "text": "For every test sentence we retrieve the closest training sentence according to the METEOR." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-182", "text": "The rather low METEOR score of 19.43 reflects the difficulty of the dataset." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-183", "text": "A closer look at the sentences produced by all three methods gives us additional insights." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-184", "text": "An interesting characteristic is the output vocabulary size, which is 94 for [28] , 86 for [33] and 605 for our method, while the test set contains 6422 unique words." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-185", "text": "This clearly shows a higher diversity of our output." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-186", "text": "Among the words generated by our system and absent in the outputs of others are such verbs as grab, drive, sip, climb, follow, objects as suit, chair, cigarette, mirror, bottle and places as kitchen, corridor, restaurant." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-187", "text": "We showcase some qualitative results in Figure 3 ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-188", "text": "Here, e.g. the verb pour, object drink and place courtyard only appear in our output." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-189", "text": "We attribute this, on one hand, to our diverse and robust visual classifiers." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-190", "text": "On the other hand, the architecture and parameter choices of our LSTM allow us to learn better correspondance between words and visual classifiers' scores." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-191", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-192", "text": "**ANALYSIS**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-193", "text": "Despite the recent advances in the video description domain, including our proposed approach, the video description performance on the movie description datasets (MPII-MD [28] and M-VAD [31] ) remains relatively low." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-194", "text": "In this section we want to take a closer look at three methods, best SMT of [28] , S2VT [33] and ours, in order to understand where these methods succeed and where they fail." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-195", "text": "In the following we evaluate all three methods on the MPII-MD test set." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-196", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-197", "text": "**APPROACH SENTENCE**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-198", "text": "SMT [28] Someone is a man, someone is a man." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-199", "text": "S2VT [33] Someone looks at him, someone turns to someone." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-200", "text": "Our Someone is standing in the crowd, a little man with a little smile." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-201", "text": "Reference Someone, back in elf guise, is trying to calm the kids." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-202", "text": "SMT [28] The car is a water of the water." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-203", "text": "S2VT [33] On the door, opens the door opens." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-204", "text": "Our" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-205", "text": "The fellowship are in the courtyard." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-206", "text": "Reference They cross the quadrangle below and run along the cloister." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-207", "text": "SMT [28] Someone is down the door, someone is a back of the door, and someone is a door." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-208", "text": "S2VT [33] Someone shakes his head and looks at someone." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-209", "text": "Our Someone takes a drink and pours it into the water." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-210", "text": "Reference Someone grabs a vodka bottle standing open on the counter and liberally pours some on the hand." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-211", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-212", "text": "**DIFFICULTY VERSUS PERFORMANCE**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-213", "text": "As the first study we suggest to sort the reference sentences (from the test set) by difficulty, where difficulty is defined in multiple ways." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-214", "text": "Sentence length and Word frequency." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-215", "text": "Two of the simplest sentence difficulty measures are its length and average frequency of words." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-216", "text": "When sorting the data by difficulty (increasing sentence length or decreasing average word frequency), we find that all three methods have the same tendency to obtain lower METEOR score as the difficulty increases (Figures 4a and 4b) ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-217", "text": "word frequency the correlation is stronger." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-218", "text": "Our method consistently outperforms the other two, most notable as the difficulty increases." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-219", "text": "Textual and Visual Nearest Neighbors." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-220", "text": "Next, for each reference test sentence we search for the closest training sentence (in terms of the METEOR score)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-221", "text": "We use the obtained best scores to sort the reference sentences by textual difficulty, i.e. the \"easy\" sentences are more likely to be retrieved." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-222", "text": "If we consider all training sentences, we obtain a Textual Nearest Neighbor." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-223", "text": "We sort the test sentences according to these scores (decreasing) and plot the performance of three methods in Figure 5a ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-224", "text": "All methods \"agree\" and ours is best throughout the difficulty range, in particular in the more challenging part of the plot." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-225", "text": "We can also use visual features to find the k Nearest Neighbors in the Training set, select the best one (in terms of the METEOR score) and use this score to sort the reference sentences." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-226", "text": "We call this a Visual k Nearest Neighbor." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-227", "text": "The intuition behind it is to consider a video clip as visually \"easy\" if the most similar training clips also have similar descriptions (the \"difficult\" clip might have no close visual neighbours)." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-228", "text": "We rely on our best visual representation (8) from Table 1 and cos similarity measure to define the Visual kNN and sort the reference sentences according to it with k = 10 ( Figure 5b )." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-229", "text": "We see a clear correlation between the visual difficulty and the performance of all methods (Figure 5b )." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-230", "text": "Summary." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-231", "text": "a) All methods perform better on shorter, common sentences and our method notably wins on longer sentences." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-232", "text": "b) Our method also wins on sentences that are more difficult to retrieve." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-233", "text": "c) Visual difficulty, defined by cos similarity and representation (8) from Table 1 , strongly correlates with the performance of all methods." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-234", "text": "(d) When comparing all four plots (Figures 4a and 4b, Figures 5a and 5b) , we find that the strongest correlation between the methods' performance and the difficulty is observed for the Textual difficulty, while the least correlation we observe for the Sentence length." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-235", "text": "(960) contact (562) Table 6 : Entropy and top 5 frequent verbs of each WordNet topic in the MPII-MD." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-237", "text": "**SEMANTIC ANALYSIS**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-238", "text": "WordNet Verb Topics." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-239", "text": "We closer analyze the test sentences with respect to different verbs." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-240", "text": "For this we rely on WordNet topics (high level entries in the WordNet ontology, e.g. \"motion\", \"perception\", \"competition\", \"emotion\"), defined for most synsets in WordNet [10] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-241", "text": "We obtain the sense information from the semantic parser of [28] , thus senses might be noisy." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-242", "text": "We showcase the 5 most frequent verbs for each topic in Table 6 ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-243", "text": "We select sentences with a single verb, group them according to the verb topic and compute an average METEOR score for each topic, see Figure 6 ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-244", "text": "We find that our method is best for all topics except \"communication\", where [28] wins." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-245", "text": "The most frequent verbs in this topic are \"look up\" and \"nod\", which are also frequent in the dataset and in the sentences produced by [28] ." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-246", "text": "The best performing topic, \"cognition\", is highly biased to \"look at\" verb." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-247", "text": "The most frequent topics, \"motion\" and \"contact\", which are also visual (e.g. \"turn\", \"walk\", \"open\", \"sit\"), are nevertheless quite challenging, which we attribute to their high diversity (see their entropy w.r.t." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-248", "text": "different verbs and their frequencies in Table 6 )." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-249", "text": "At the same time \"perception\" is far less diverse and mainly focuses on verbs like \"look\" or \"stare\", which are quite frequent in the dataset, resulting in better performance." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-250", "text": "Topics with more abstract verbs (e.g. \"be\", \"have\", \"start\") tend to get lower scores." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-251", "text": "Top 100 best and worst sentences." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-252", "text": "We look at 100 Test sentences, where our method obtains highest and lowest METEOR scores." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-253", "text": "Out of 100 best sentences 44 contain the verb \"look\" (including verb phrases such as \"look at\")." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-254", "text": "The other frequent verbs are \"walk\", \"turn\", \"smile\", \"nod\", \"shake\", \"stare\", \"sit\", i.e. mainly visual verbs." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-255", "text": "Overall the sentences are simple and common." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-256", "text": "Among the 100 lowest scoring sentences we observe more diversity: 12 sentences contain no verb, 10 mention unusual words (specific to the movie), 24 contain no subject, 29 have a non-human subject." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-257", "text": "Altogether this leads to a lower performance, in particular, as most training sentences contain \"Someone\" as subject and generated sentences are biased towards it." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-258", "text": "Summary." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-259", "text": "a) The test sentences that mention the verb \"look\" (and similar) get higher METEOR scores due to their high frequency in the dataset." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-260", "text": "b) The sentences with more \"visual\" verbs tend to get higher scores." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-261", "text": "c) The sentences without verbs (e.g. describing a scene), without subjects or with non-human subjects get lower scores, which can be explained by a dataset bias towards \"Someone\" as subject." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-262", "text": "----------------------------------" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-263", "text": "**CONCLUSION**" }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-264", "text": "We propose an approach to automatic movie description which trains visual classifiers and uses the classifier scores as input to LSTM." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-265", "text": "To handle the weak sentence annotations we rely on three main ingredients." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-266", "text": "First, we distinguish three semantic groups of labels (verbs, objects and places), second we train them discriminatively, removing potentially noisy negatives, and third, we select only a small number of the most reliable classifiers." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-267", "text": "For sentence generation we show the benefits of exploring different LSTM architectures and learning configurations." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-268", "text": "As the result we obtain the highest performance on the MPII-MD dataset as shown by all automatic evaluation measures and extensive human evaluation." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-269", "text": "We analyze the challenges in the movie description task using our and two prior works." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-270", "text": "We find that the factors which contribute to higher performance include: presence of frequent words, sentence length and simplicity as well as presence of \"visual\" verbs (e.g. \"nod\", \"walk\", \"sit\", \"smile\")." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-271", "text": "Textual and visual difficulties of sentences/clips strongly correlate with the performance of all methods." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-272", "text": "We observe a high bias in the data towards humans as subjects and verbs similar to \"look\"." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-273", "text": "Future work has to focus on dealing with less frequent words and handle less visual descriptions." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-274", "text": "This potentially requires to consider external text corpora, modalities other than video, such as audio and dialog, and to look across multiple sentences." }, { "sent_id": "5fa570cf5f37c7aae3b428a17de3e3-C001-275", "text": "This would allow exploiting long-and shortrange context and thus understanding and describing the story of the movie." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5fa570cf5f37c7aae3b428a17de3e3-C001-4" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-16" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-18" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-42" ] ], "cite_sentences": [ "5fa570cf5f37c7aae3b428a17de3e3-C001-4", "5fa570cf5f37c7aae3b428a17de3e3-C001-16", "5fa570cf5f37c7aae3b428a17de3e3-C001-18", "5fa570cf5f37c7aae3b428a17de3e3-C001-42" ] }, "@UNSURE@": { "gold_contexts": [ [ "5fa570cf5f37c7aae3b428a17de3e3-C001-9" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-111" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-122" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-194" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-198" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-202" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-207" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-269" ] ], "cite_sentences": [ "5fa570cf5f37c7aae3b428a17de3e3-C001-9", "5fa570cf5f37c7aae3b428a17de3e3-C001-111", "5fa570cf5f37c7aae3b428a17de3e3-C001-122", "5fa570cf5f37c7aae3b428a17de3e3-C001-194", "5fa570cf5f37c7aae3b428a17de3e3-C001-198", "5fa570cf5f37c7aae3b428a17de3e3-C001-202", "5fa570cf5f37c7aae3b428a17de3e3-C001-207", "5fa570cf5f37c7aae3b428a17de3e3-C001-269" ] }, "@DIF@": { "gold_contexts": [ [ "5fa570cf5f37c7aae3b428a17de3e3-C001-21", "5fa570cf5f37c7aae3b428a17de3e3-C001-22" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-50" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-67" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-173", "5fa570cf5f37c7aae3b428a17de3e3-C001-174", "5fa570cf5f37c7aae3b428a17de3e3-C001-175", "5fa570cf5f37c7aae3b428a17de3e3-C001-176", "5fa570cf5f37c7aae3b428a17de3e3-C001-177", "5fa570cf5f37c7aae3b428a17de3e3-C001-178" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-184" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-244" ] ], "cite_sentences": [ "5fa570cf5f37c7aae3b428a17de3e3-C001-22", "5fa570cf5f37c7aae3b428a17de3e3-C001-50", "5fa570cf5f37c7aae3b428a17de3e3-C001-67", "5fa570cf5f37c7aae3b428a17de3e3-C001-173", "5fa570cf5f37c7aae3b428a17de3e3-C001-176", "5fa570cf5f37c7aae3b428a17de3e3-C001-178", "5fa570cf5f37c7aae3b428a17de3e3-C001-184", "5fa570cf5f37c7aae3b428a17de3e3-C001-244" ] }, "@USE@": { "gold_contexts": [ [ "5fa570cf5f37c7aae3b428a17de3e3-C001-50" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-66" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-82" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-110" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-115" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-118" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-167" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-171" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-176" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-193" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-195" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-241" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-268" ] ], "cite_sentences": [ "5fa570cf5f37c7aae3b428a17de3e3-C001-50", "5fa570cf5f37c7aae3b428a17de3e3-C001-66", "5fa570cf5f37c7aae3b428a17de3e3-C001-82", "5fa570cf5f37c7aae3b428a17de3e3-C001-110", "5fa570cf5f37c7aae3b428a17de3e3-C001-115", "5fa570cf5f37c7aae3b428a17de3e3-C001-118", "5fa570cf5f37c7aae3b428a17de3e3-C001-167", "5fa570cf5f37c7aae3b428a17de3e3-C001-171", "5fa570cf5f37c7aae3b428a17de3e3-C001-176", "5fa570cf5f37c7aae3b428a17de3e3-C001-193", "5fa570cf5f37c7aae3b428a17de3e3-C001-195", "5fa570cf5f37c7aae3b428a17de3e3-C001-241", "5fa570cf5f37c7aae3b428a17de3e3-C001-268" ] }, "@SIM@": { "gold_contexts": [ [ "5fa570cf5f37c7aae3b428a17de3e3-C001-173" ], [ "5fa570cf5f37c7aae3b428a17de3e3-C001-245" ] ], "cite_sentences": [ "5fa570cf5f37c7aae3b428a17de3e3-C001-173", "5fa570cf5f37c7aae3b428a17de3e3-C001-245" ] } } }, "ABC_5d68c07f716cd3c9861921d7e515ea_9": { "x": [ { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-108", "text": "**EVALUATION AND RESULTS**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-2", "text": "We follow the step-by-step approach to neural data-to-text generation we proposed in Moryossef et al. (2019) , in which the generation process is divided into a text-planning stage followed by a plan-realization stage." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-3", "text": "We suggest four extensions to that framework: (1) we introduce a trainable neural planning component that can generate effective plans several orders of magnitude faster than the original planner; (2) we incorporate typing hints that improve the model's ability to deal with unseen relations and entities; (3) we introduce a verification-by-reranking stage that substantially improves the faithfulness of the resulting texts; (4) we incorporate a simple but effective referring expression generation module." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-4", "text": "These extensions result in a generation process that is faster, more fluent, and more accurate." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-5", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-7", "text": "In the data-to-text generation task (D2T), the input is data encoding facts (e.g., a table, a set of tuples, or a small knowledge graph), and the output is a natural language text representing those facts." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-8", "text": "1 In neural D2T, the common approaches train a neural end-to-end encoder-decoder system that encodes the input data and decodes an output text." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-9", "text": "In recent work (Moryossef et al., 2019) we proposed to adopt ideas from \"traditional\" language generation approaches (i.e. Reiter and Dale (2000) ; Walker et al. (2007) ; Gatt and Krahmer (2017) ) that separate the generation into a planning stage that determines the order and structure of the expressed facts, and a realization stage that maps the plan to natural language text." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-10", "text": "We show that by breaking the task this way, one can achieve the same fluency of neural generation systems while being able to better control the form of the generated text and to improve its correctness by reducing missing facts and \"hallucinations\", common in neural systems." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-11", "text": "In this work we adopt the step-by-step framework of Moryossef et al. (2019) and propose four independent extensions that improve aspects of our original system: we suggest a new plan generation mechanism, based on a trainable-yetverifiable neural decoder, that is orders of magnitude faster than the original one ( \u00a73); we use knowledge of the plan structure to add typing information to plan elements." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-12", "text": "This improves the system's performance on unseen relations and entities ( \u00a74); the separation of planning from realizations allows the incorporation of a simple output verification heuristic that drastically improves the correctness of the output ( \u00a75); and finally we incorporate a post-processing referring expression generation (REG) component, as proposed but not implemented in our previous work, to improve the naturalness of the resulting output ( \u00a76)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-13", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-14", "text": "**STEP-BY-STEP GENERATION**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-15", "text": "We provide a brief overview of the step-by-step system." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-16", "text": "See Moryossef et al. (2019) for further details." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-17", "text": "The system works in two stages." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-18", "text": "The first stage (planning) maps the input facts (encoded as a directed, labeled graph, where nodes represent entities and edges represent relations) to text plans, while the second stage (realization) maps the text plans to natural language text." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-19", "text": "The text plans are a sequence of sentence plans-each of which is a tree-representing the ordering of facts and entities within the sentence." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-20", "text": "In other words, the plans determine the separation of facts into sentences, the ordering of sentences, and the ordering of facts and entities within each sentence." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-21", "text": "This stage is completely verifiable: the text plans are guaranteed to faithfully encode all and only the facts from the input." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-22", "text": "The realization stage then translates the plans into natural language sentences, using a neural sequenceto-sequence system, resulting in fluent output." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-23", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-24", "text": "**FAST AND VERIFIABLE PLANNER**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-25", "text": "The data-to-plan component in Moryossef et al. (2019) exhaustively generates all possible plans, scores them using a heuristic, and chooses the highest scoring one for realization." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-26", "text": "While this is feasible with the small input graphs in the WebNLG challenge (Colin et al., 2016) , it is also very computationally intensive, growing exponentially with the input size." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-27", "text": "We propose an alternative planner which works in linear time in the size of the graph and remains verifiable: generated plans are guaranteed to represent the input faithfully." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-28", "text": "The original planner works by first enumerating over all possible splits into sentences (subgraphs), and for each sub-graph enumerating over all possible undirected, unordered, Depth First Search (DFS) traversals, where each traversal corresponds to a sentence plan." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-29", "text": "Our planner combines these into a single process." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-30", "text": "It works by performing a series of what we call random truncated DFS traversals." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-31", "text": "In a DFS traversal, a node is visited, then its children are visited recursively in order." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-32", "text": "Once all children are visited, the node \"pops\" back to the parent." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-33", "text": "In a random truncated traversal, the choice of which children to visit next, as well as whether to go to the next children or to \"pop\", is non-deterministic (in practice, our planner decides by using a neural-network controller)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-34", "text": "Popping at a node before visiting all its children truncates the DFS: further descendants of that node will not be visited in this traversal." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-35", "text": "It behaves as a DFS on a graph where edges to these descendants do not exist." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-36", "text": "Popping the starting node terminates the traversal." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-37", "text": "Our planner works by choosing a node with a non-zero degree and performing a truncated DFS traversal from that node." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-38", "text": "Then, all edges visited in the traversal are removed from the input graph, and the process repeats (performing another truncated DFS) until no more edges remain." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-39", "text": "Each truncated DFS traversal corresponds to a sentence plan, following the DFS-to-plan procedure of Moryossef et al. (2019) : the linearized plan is generated incrementally at each step of the traversal." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-40", "text": "This process is linear in the number of edges in the graph." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-41", "text": "At training time, we use the plan-to-DFS mapping to perform the correct sequence of traversals, and train a neural classifier to act as a controller, choosing which action to perform at each step." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-42", "text": "At test time, we use the controller to guide the truncated DFS process." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-43", "text": "This mechanism is inspired by transition based parsing (Nivre and McDonald, 2008) ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-44", "text": "The action set at each stage is dynamic." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-45", "text": "During traversal, it includes the available children at each stage and POP." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-46", "text": "Before traversals, it includes a choose-i action for each available node n i ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-47", "text": "We assign a score to each action, normalize with softmax, and train to choose the desired one using cross-entropy loss." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-48", "text": "At test time, we either greedily choose the best action, or we can sample plans by sampling actions according to their assigned probabilities." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-49", "text": "Feature Representation and action scoring." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-50", "text": "Each graph node n i corresponds to an entity x n i , and has an associated embedding vector x n i ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-51", "text": "Each relation r i is associated with an embedding vector r i ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-52", "text": "Each labeled input graph edge e k = (n i , r , n j ) is represented as a projected concatenated vector e k = E(x n i ; r ; x n j ), where E is a projection matrix." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-53", "text": "Finally, each node n i is then represented as a vector n i = V[x n i ; e j \u2208\u03c0(i) e j ; e j \u2208\u03c0 \u22121 (i) e j ], where \u03c0(i) and \u03c0 \u22121 (i) are the incoming and outgoing edges from node n i ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-54", "text": "The traverse-to-childvia-edge-e j action is represented as e j , choosenode-i is represented as n i and pop-to-node-i is represented as n i + p where p is a learned vector." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-55", "text": "The score for an action a at time t is calculated as a dot-product between the action representation and the LSTM state over the symbols generated in the plan so far." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-56", "text": "Thus, each decision takes into account the immediate surrounding of the node in the graph, and the plan structure generated so far." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-57", "text": "Speed On a 7 edges graph, the planner of Moryossef et al. (2019) takes an average of 250 seconds to generate a plan, while our planner takes 0.0025 seconds, 5 orders of magnitude faster." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-58", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-59", "text": "**INCORPORATING TYPING INFORMATION FOR UNSEEN ENTITIES AND RELATIONS**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-60", "text": "In Moryossef et al. (2019) , the sentence plan trees were linearized into strings that were then fed to a neural machine translation decoder (Open-NMT) (Klein et al., 2017) with a copy mecha-nism." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-61", "text": "This linearization process is lossy, in the sense that the linearized strings do not explicitly distinguish between symbols that represent entities (e.g., BARACK OBAMA) and symbols that represent relations (e.g., works-for)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-62", "text": "While this information can be deduced from the position of the symbol within the structure, there is a benefit in making it more explicit." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-63", "text": "In particular, the decoder needs to act differently when decoding relations and entities: entities are copied, while relations need to be verbalized." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-64", "text": "By making the typing information explicit to the decoder, we make it easier for it to generalize this behavior distinction and apply it also for unseen entities and relations." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-65", "text": "We thus expect the typing information to be especially useful for the unseen part of the evaluation set." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-66", "text": "We incorporate typing information by concatenating to the embedding vector of each input symbol one of three embedding vectors, S, E or R, where S is concatenated to structural elements (opening and closing brackets), E to entity symbols and R to relation symbols." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-67", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-68", "text": "**OUTPUT VERIFICATION**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-69", "text": "While the plan generation stage is guaranteed to be faithful to the input, the translation process from plans to text is based on a neural seq2seq model and may suffer from known issues with such models: hallucinating facts that do not exist in the input, repeating facts, or dropping facts." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-70", "text": "While the clear mapping between plans and text helps to reduce these issues greatly, the system in Moryossef et al. (2019) still has 2% errors of these kinds." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-71", "text": "Existing approaches: soft encouragement via neural modules." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-72", "text": "Recent work in neural text generation and summarization attempt to address these issues by trying to map the textual outputs back to structured predicates, and comparing these predicates to the input data." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-73", "text": "Kiddon et al. (2016) uses a neural checklist model to avoid the repetition of facts and improve coverage." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-74", "text": "Agarwal et al. (2018) generate k-best output candidates with beam search, and then try to map each candidate output back to the input structure using a reverse seq2seq model trained on the same data." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-75", "text": "They then select the highest scoring output candidate that best translates back to the input." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-76", "text": "Mohiuddin and Joty (2019) reconstructs the input in training time, by jointly learning a back-translation model and enforcing the back-translation to re-construct the input." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-77", "text": "Both of these approaches are \"soft\" in the sense that they crucially rely on the internal dynamics or on the output of a neural network module that may or may not be correct." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-78", "text": "Our proposal: explicit verification." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-79", "text": "The separation between planning and realization provided by the step-by-step framework allows incorporating a robust and straightforward verification step, that does not rely on brittle information extraction procedures or trust neural network models." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-80", "text": "The plan-to-text generation handles each sentence individually and translates entities as copy operations." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-81", "text": "We thus have complete knowledge of the generated entities and their locations." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-82", "text": "We can then assess the correctness of an output sentence by comparing 2 its sequence of entities to the entity sequence in the corresponding sentence plan, which is guaranteed to be complete." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-83", "text": "We then decode k-best outputs and rerank them based on their correctness scores, tie-breaking using model scores." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-84", "text": "We found empirically that, with a beam of size 5 we find at least one candidate with an exact match to the plan's entity sequence in 99.82% of the cases for seen entities and relations compared to 98.48% at 1-best, and 72.3% for cases of unseen entities and relations compared to 58.06% at 1-best." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-85", "text": "In the remaining cases, we set the system to continue searching by trying other plans, by going down the list of plans (when using the exhaustive planner of Moryossef et al. (2019) ) or by sampling a new plan (when using the linear time planner suggested in this paper)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-86", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-87", "text": "**REFERRING EXPRESSIONS**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-88", "text": "The step-by-step system generates entities by first generating an indexed entity symbols, and then lexicalizing each symbol to the string associated with this entity in the input structure (i.e., all occurrences of the entity 11TH MISSISSIPPI IN-FANTRY MONUMENT will be lexicalized with the full name rather than \"it\" or \"the monument\")." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-89", "text": "This results in correct but somewhat unnatural structures." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-90", "text": "In contrast, end-to-end neural generation systems are trained on text that includes referring expressions, and generate them naturally as part of the decoding process, resulting in natural looking text." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-91", "text": "However, the generated referring expressions are sometimes incorrect." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-92", "text": "Moryossef et al. (2019) suggests the possibility of handling this with a post-processing referring-expression generation step (REG)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-93", "text": "Here, we propose a concrete REG module and demonstrate its effectiveness." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-94", "text": "One option is to use a supervised REG module (Ferreira et al., 2018) , that is trained to lexicalize in-context mentions." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-95", "text": "Such an approach is suboptimal for our setup as it is restricted to the entities and contexts it seen in training, and is prone to error on unseen entities and contexts." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-96", "text": "Our REG solution lexicalizes the first mention of each entity as its associated string and attempts to generate referring expressions to subsequent mentions." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-97", "text": "The generated referring expressions can take the form \"PRON\", \"X\" or \"THE X\" where PRON is a pronoun 3 , and X is a word appearing in the entity's string (allowing, e.g., John, or the monument)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-98", "text": "We also allow referring to its entity with its entire associated string." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-99", "text": "We restrict the set of allowed pronouns for each entity according to its type (male, female, plural-animate, unknown-animate, inanimate)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-100", "text": "4 We then take, for each entity mention individually, the referring expression that receives the best language model score in context, using a strong unsupervised neural LM (BERT (Devlin et al., 2018) )." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-101", "text": "The system is guaranteed to be correct in the sense that it will not generate wrong pronouns." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-102", "text": "It also has failure modes: it is possible for the system to generate ambiguous referring expressions (e.g., John is Bob's father." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-103", "text": "He works as a nurse.), and may lexicalize Boston University as Boston." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-104", "text": "We find that the second kind of mistake is rare as it is handled well by the language model." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-105", "text": "It can also be controlled by manually restricting the set of possible referring expression to each entity." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-107", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-106", "text": "Similarly, it is easy to extend the system to support other lexicalizations of entities by extending the sets of allowed lexicalizations (for example, supporting abbreviations, initials or nicknames) either as user-supplied inputs or using heuristics." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-109", "text": "We evaluate each of the introduced components separately." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-110", "text": "Tables listing their interactions are available in the appendix." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-111", "text": "The appendix also lists some qualitative outputs." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-112", "text": "The main trends that we observe are:" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-113", "text": "\u2022 The new planner causes a small drop in BLEU, but is orders of magnitude faster ( \u00a77.1)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-114", "text": "\u2022 Typing information causes a negligible drop in BLEU overall, but improves results substantially for the unseen portion of the dataset ( \u00a77.2)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-115", "text": "\u2022 The verification step is effective at improving the faithfulness of the output, practically eliminating omitted and overgenerated facts, reducing the number of wrong facts, and increasing the number of correctly expressed facts." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-116", "text": "This is based on both manual and automatic evaluations." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-117", "text": "( \u00a77.3)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-118", "text": "\u2022 The referring expression module is effective, with an intrinsic correctness of 92.2%." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-119", "text": "It substantially improves BLEU scores." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-120", "text": "( \u00a77.4)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-121", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-122", "text": "**SETUP**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-123", "text": "We evaluate on the WebNLG dataset (Colin et al., 2016) , comparing to the step-bystep systems described in Moryossef et al. (2019) , which are state of the art." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-124", "text": "Due to randomness inherent in neural training, our reported automatic evaluation measures are based on an average of 5 training runs of each system (neural planner and neural realizer), each run with a different random seed." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-125", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-126", "text": "**NEURAL PLANNER VS EXHAUSTIVE PLANNER**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-127", "text": "We compare the exhaustive planner from Moryossef et al. (2019) to our neural planner, by replacing the planner component in the Moryossef et al. (2019) system." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-128", "text": "Moving to the neural planner exhibits a small drop in BLEU (46.882 dropped to 46.506)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-129", "text": "However, figure 1 indicates 5 orders of magnitude (100,000x) speedup for graphs with 7 edges, and a linear growth in time for number of edges compared to exponential time for the exhaustive planner." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-130", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-131", "text": "**EFFECT OF TYPE INFORMATION**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-132", "text": "We repeat the coverage experiment in (Moryossef et al., 2019) , counting the number of output texts that contain all the entities in the input graph, and, of these text, counting the ones in which the entities appear in the exact same order as the plan." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-133", "text": "Incorporating typing information reduced the number of texts not containing all entities by 18% for the seen part of the test set, and 16% for the unseen part." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-134", "text": "Moreover, for the text containing all entities, the number of texts that did not follow the plan's entity order is reduced by 46% for the seen part of the test set, and by 35% for the unseen part." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-135", "text": "We also observe a small drop in BLEU scores, which we attribute to some relations being verbalized more freely (though correctly)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-136", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-137", "text": "**EFFECT OF OUTPUT VERIFICATION**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-138", "text": "The addition of output verification resulted in negligible changes in BLEU, reinforcing that automatic metrics are not sensitive enough to output accuracy." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-139", "text": "We thus performed manual analysis, following the procedure in Moryossef et al. (2019) ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-140", "text": "We manually inspect 148 samples from the seen part of the test set, containing 440 relations, counting expressed, omitted, wrong and over-generated (hallucinated) facts." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-141", "text": "5 We compare to the StrongNeural and BestPlan systems from Moryossef et al. (2019) ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-142", "text": "Results in Table 1 indicate that the effectiveness of the verification process in ensuring correct output, reducing the already small number of ommited and overgenerated facts to 0 (with the exhaustive planner) and keeping it small (with the fast neural planner)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-143", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-144", "text": "**REFERRING EXPRESSION MODULE**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-145", "text": "Intrinsic evaluation of the REG module." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-146", "text": "We manually reviewed 1,177 pairs of entities and referring expressions generated by the system." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-147", "text": "We find that 92.2% of the generated referring expressions refer to the correct entity." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-148", "text": "From the generated expressions, 325 (27.6%) were pronouns, 192 (16.3%) are repeating a onetoken entity as is, and 505 (42.9%) are generating correct shortening of a long entity." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-149", "text": "In 63 (5.6%) of the cases the system did not find a good substitute and kept the entire entity intact." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-150", "text": "Finally, 92 (7.82%) are wrong referrals." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-151", "text": "Overall, 73.3% of the non-first mentions of entities were replaced with suitable shorter and more fluent expressions." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-152", "text": "Effect on BLEU scores." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-153", "text": "As can be seen in Table 2 , using the REG module increases BLEU scores for both the exhaustive and the neural planner." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-154", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-155", "text": "**CONCLUSIONS**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-156", "text": "We adopt the planning-based neural generation framework of Moryossef et al. (2019) and extend it to be orders of magnitude faster and produce more correct and more fluent text." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-157", "text": "We conclude that these extensions not only improve the system of Moryossef et al. (2019) but also highlight the flexibility and advantages of the step-by-step framework for text generation." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-158", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-159", "text": "**APPENDICES**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-160", "text": "We report some additional results." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-161", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-162", "text": "**A INTERACTION OF DIFFERENT COMPONENTS**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-163", "text": "We introduced 4 components: neural planner instead of exhaustive one, adding type information, adding output verification stage, and incorporating a referring expression generation (REG)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-164", "text": "In Table 3 we report BLEU scores (Papineni et al., 2002) for all 16 combinations of components." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-165", "text": "The numbers are averages of 5 runs with different random seeds." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-166", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-167", "text": "**B REG ERROR ANALYSIS**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-168", "text": "We perform further analysis of the errors of the unsupervised LM based REG module." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-169", "text": "We categorise all entities into 3 groups: (1) names of people;" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-170", "text": "(2) locations (cities / counties / countries); and (3) places and objects." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-171", "text": "For person names, the module did not produce any errors, selecting either a correct pronoun, or either the first or last name of a person, all valid refferences." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-172", "text": "For location names, we observe two distinct error types, both relating to our module's restriction to predict a single MASK token." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-173", "text": "The first type is in cases like \"city, country\" or \"county, country\", where the more specific location is not in the LM vocabulary, and cannot be predicted with a single token." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-174", "text": "For example, in \"Punjab, Pakistan\", Punjab is not contained in the vocabulary as a single token, causing the model to select \"Pakistan\", which we consider a mistake." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-175", "text": "The second type is when a city name is longer than a single token, as in \"New York\"." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-176", "text": "While it is common to refer to \"New Jersey\" as \"Jersey\", it is wrong to refer to \"New York\" as either \"New\" or \"York\", and as BERT can only fill in one MASK token, it chooses only one (in this case \"York\")." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-177", "text": "Finally, for places and objects, we also identify to mistake types." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-178", "text": "The first occurs for multi-token entities." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-179", "text": "While for some cases it is possible to select the correct one (i.e., \"Agra Airport\" \u2192 \"The Airport\" or \"Boston University\" \u2192 \"The University\"), in other cases it is not possible (i.e., \"Baked Alaska\", where choosing either word does not produce a useful reference)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-180", "text": "The second type occurs with names of objects, like books titles." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-181", "text": "For example, for the entity \"A Severed Wasp\" we would like the model to predict \"The Book\"." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-182", "text": "However, as we only allow either pronouns or words from the original entity, the model cannot produce \"The book\", producing the erroneous \"The Wasp\" instead." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-183", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-184", "text": "**C OUTPUT EXAMPLES**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-185", "text": "The following output examples demonstrate the kinds of texts produces by the final system." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-186", "text": "The following outputs are correct, expressing all and only the facts from their input graphs." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-187", "text": "We enumerate them as number of facts:" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-188", "text": "1. The leader of Azerbaijan is Artur Rasizade." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-189", "text": "2. Baked Alaska, containing Sponge Cake, is from France." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-190", "text": "3. Above The Veil, written by Garth Nix, is available in Hardcover and has 248 pages." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-191", "text": "----------------------------------" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-192", "text": "**THE AKITA MUSEUM OF ART IS LOCATED IN**" }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-193", "text": "Japan where the Brazilians In Japan are an ethnic group." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-194", "text": "The Museum is located in Akita, Akita which is part of Akita Prefecture ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-195", "text": "5. The AWH Engineering College in Kuttikkattoor, Kerala has Mah, India to its northwest ." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-196", "text": "The College was established in 2001 and has a staff of 250." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-197", "text": "An example where the system failed, producing a wrong lexicalization of a fact is: \"The AWH Engineering College is located in the state of Kerala, Kochi, in India." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-198", "text": "Table 3 : Average BLEU score for every combination of methods (avg of 5 independent runs)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-199", "text": "Mumbai and the river is the Ganges\"." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-200", "text": "In this example, the input entity Kochi refers to the leader of Kerala, and not tpo the location (although there is also a location by that name)." }, { "sent_id": "5d68c07f716cd3c9861921d7e515ea-C001-201", "text": "The text lexicalizes this fact such that Kerala and Kochi are related, but with a relation of part-of, implying Kerala is in Kochi." } ], "y": { "@USE@": { "gold_contexts": [ [ "5d68c07f716cd3c9861921d7e515ea-C001-2" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-11" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-39" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-132" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-139" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-156" ] ], "cite_sentences": [ "5d68c07f716cd3c9861921d7e515ea-C001-2", "5d68c07f716cd3c9861921d7e515ea-C001-11", "5d68c07f716cd3c9861921d7e515ea-C001-39", "5d68c07f716cd3c9861921d7e515ea-C001-132", "5d68c07f716cd3c9861921d7e515ea-C001-139", "5d68c07f716cd3c9861921d7e515ea-C001-156" ] }, "@EXT@": { "gold_contexts": [ [ "5d68c07f716cd3c9861921d7e515ea-C001-2", "5d68c07f716cd3c9861921d7e515ea-C001-3" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-11" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-156" ] ], "cite_sentences": [ "5d68c07f716cd3c9861921d7e515ea-C001-2", "5d68c07f716cd3c9861921d7e515ea-C001-3", "5d68c07f716cd3c9861921d7e515ea-C001-11", "5d68c07f716cd3c9861921d7e515ea-C001-156" ] }, "@BACK@": { "gold_contexts": [ [ "5d68c07f716cd3c9861921d7e515ea-C001-9" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-25" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-60" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-70" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-92" ] ], "cite_sentences": [ "5d68c07f716cd3c9861921d7e515ea-C001-9", "5d68c07f716cd3c9861921d7e515ea-C001-25", "5d68c07f716cd3c9861921d7e515ea-C001-60", "5d68c07f716cd3c9861921d7e515ea-C001-70", "5d68c07f716cd3c9861921d7e515ea-C001-92" ] }, "@UNSURE@": { "gold_contexts": [ [ "5d68c07f716cd3c9861921d7e515ea-C001-16" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-85" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-123" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-127" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-141" ] ], "cite_sentences": [ "5d68c07f716cd3c9861921d7e515ea-C001-16", "5d68c07f716cd3c9861921d7e515ea-C001-85", "5d68c07f716cd3c9861921d7e515ea-C001-123", "5d68c07f716cd3c9861921d7e515ea-C001-127", "5d68c07f716cd3c9861921d7e515ea-C001-141" ] }, "@DIF@": { "gold_contexts": [ [ "5d68c07f716cd3c9861921d7e515ea-C001-57" ], [ "5d68c07f716cd3c9861921d7e515ea-C001-157" ] ], "cite_sentences": [ "5d68c07f716cd3c9861921d7e515ea-C001-57", "5d68c07f716cd3c9861921d7e515ea-C001-157" ] } } }, "ABC_c8cf2d615cc47395a55bc8737cd9fd_9": { "x": [ { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-20", "text": "All rights reserved." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-65", "text": "In , they proposed a framework of encoderdecoder models, called skip-thoughts." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-18", "text": "However, for the question part, we couldn't find any acceptable method to measure the robustness of VQA algorithms after extensive literature review." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-2", "text": "Visual Question Answering (VQA) models should have both high robustness and accuracy." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-3", "text": "Unfortunately, most of the current VQA research only focuses on accuracy because there is a lack of proper methods to measure the robustness of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-4", "text": "There are two main modules in our algorithm." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-5", "text": "Given a natural language question about an image, the first module takes the question as input and then outputs the ranked basic questions, with similarity scores, of the main given question." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-6", "text": "The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question about the given image." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-7", "text": "We claim that a robust VQA model is one, whose performance is not changed much when related basic questions as also made available to it as input." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-8", "text": "We formulate the basic questions generation problem as a LASSO optimization, and also propose a large scale Basic Question Dataset (BQD) and Rscore (novel robustness measure), for analyzing the robustness of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-9", "text": "We hope our BQD will be used as a benchmark for to evaluate the robustness of VQA models, so as to help the community build more robust and accurate VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-10", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-12", "text": "Visual Question Answering (VQA) is a challenging and young research field, which can help machines achieve one of the ultimate goals in computer vision, holistic scene understanding (Yao, Fidler, and Urtasun 2012) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-13", "text": "VQA is a computer vision task: a system is given an arbitrary textbased question about an image, and then tasked to output the text-based answer of the given question about the image." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-14", "text": "Currently, most of the state-of-the-art VQA models (Antol et al. 2015; Chen et al. 2016; Lu et al. 2016; Ben-younes et al. 2017; Fukui et al. 2016; Kim et al. 2017) only focus on how to improve accuracy." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-15", "text": "However, accuracy is not the only metric to score a given VQA model." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-16", "text": "Robustness is also a crucial property." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-17", "text": "For the image part, there is already a rapidly growing research on evaluating the robustness of deep learning models (Fawzi, Moosavi Dezfooli, and Frossard 2017; Carlini and Wagner 2017; Xu, Caramanis, and Mannor 2009) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-19", "text": "To Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org)." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-195", "text": "ii.) Who Can Affect The Quality of BQs?" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-21", "text": "Regarding the input question of Module 2, it is the direct concatenation of a given main question with 3 basic questions of the main question." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-22", "text": "Here, \"\u2295\" denotes the direct concatenation of basic questions." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-23", "text": "the best of our knowledge, this paper is the first work to analyze the robustness of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-24", "text": "When we directly add or replace some words or phrases by similar words or phrases to the given main question, the VQA model should output the same or a very similar answer." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-25", "text": "In some sense, we can consider the similar words or phrases as small perturbation or noise to the input." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-26", "text": "If the model can output the same answer as before in the presence of small noise, we can say that the model is robust." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-27", "text": "Moreover, if a human is presented with only the main question or the main question accompanied with some highly similar questions to the main question (called basic questions of the main question), he/she tends to produce the same or a highly similar answer in both cases." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-28", "text": "Based on this observation, we consider the basic question of given main question as noise to the given main question." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-29", "text": "That is to say, a basic question with larger similarity score to the main question is considered smaller noise to the main question and a basic question with smaller similarity score is larger noise to the main question." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-30", "text": "The above interesting concepts inspire us to propose a method, Visual Question Answering by Basic Questions (VQABQ) illustrated in Figure 1 , to measure the robustness of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-31", "text": "In the VQABQ model, there are two modules: the basic question generation module (Module 1) and the visual question answering module (Module 2)." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-32", "text": "We take the query question, called the main question (MQ), encoded by Skip-Thought Vectors , as the input of Module 1." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-33", "text": "In Module 1, we encode all of the questions, also by Skip-Thought Vectors, from the training and validation sets of VQA (Antol et al. 2015) dataset as a 4800 by 186027 dimension basic question (BQ) matrix, and then solve the LASSO optimization problem (Huang, Alfadly, and Ghanem 2017) , with MQ, to find the top 3 similar BQ of MQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-34", "text": "These BQ are the output of Module 1." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-35", "text": "Moreover, we take the direct concatenation of MQ and BQ and the given image as the input of Module 2, the general VQA module, and then it will output the answer prediction of MQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-36", "text": "We claim that the BQ of given MQ can be considered as the small noise of MQ and it will affect the accuracy of VQA model." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-37", "text": "Then, we verify the above claim by the detailed experiments and use the VQABQ method to analyze the robustness of 6 available pretrained stateof-the-art VQA models, provided by papers' authors, (Antol et al. 2015; Lu et al. 2016; Ben-younes et al. 2017; Fukui et al. 2016; Kim et al. 2017) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-38", "text": "Note that the available pretrained VQA models can be categorized into two main categories, attention-based and non-attention-based VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-39", "text": "According to the results of our experiments, we discover that attention-based models not only have the higher accuracy but also the better robustness compared with nonattention-based models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-40", "text": "In this work, our main contributions are summarized below:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-41", "text": "\u2022 We propose a new method to generate the basic questions of the given main question and utilize these basic questions to analyze the robustness of 6 available pretrained state-of-the-art VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-42", "text": "\u2022 Also, we propose a novel large scale Basic Question Dataset (BQD) generated by our basic question generation algorithm and demonstrate how to use it with our R score (novel robustness measure) for the robustness analysis of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-43", "text": "\u2022 According to our experiments, we also discover that attention-based mechanism not only can help the accuracy but also the robustness of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-44", "text": "The rest of this paper is organized as the following." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-45", "text": "We first review the related work in Section 2." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-46", "text": "In Section 3, we discuss the detailed methodology and shortly introduce the proposed basic question dataset." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-47", "text": "Finally, the experimental results are demonstrated in Section 4." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-48", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-49", "text": "**RELATED WORK**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-50", "text": "Recently, there are many papers (Antol et al. 2015; Shih, Singh, and Hoiem 2016; Chen et al. 2016; Kafle and Kanan 2016; Ma, Lu, and Li 2016; Ren, Kiros, and Zemel 2015; Zhu et al. 2016; Wu et al. 2016; Lu et al. 2016; Ben-younes et al. 2017; Fukui et al. 2016; Kim et al. 2017) have proposed methods to solve the challenging VQA task." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-51", "text": "Our VQABQ method involves in different areas in Machine Learning, Natural Language Processing (NLP) and Computer Vision." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-52", "text": "The following, we discuss recent works related to our approach." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-53", "text": "Sequence Modeling by Recurrent Neural Networks." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-54", "text": "Recurrent Neural Networks (RNN) can handle the sequences of flexible length." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-55", "text": "Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) is a particular variant of RNN and in natural language tasks, such as machine translation (Sutskever, Vinyals, and Le 2014; , LSTM is a successful application." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-56", "text": "In (Ren, Kiros, and Zemel 2015) , they exploit RNN and Convolutional Neural Network (CNN) to build a question generation algorithm, but the generated question sometimes has invalid grammar." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-57", "text": "The input in (Malinowski, Rohrbach, and Fritz 2015; is the concatenation of each word embedding with the same feature vector of the image." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-58", "text": "encodes the input question sentence by LSTM and join the image feature to the final output." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-59", "text": "(Ma, Lu, and Li 2016) groups the neighboring word and image feature by doing convolution." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-60", "text": "In (Noh, Hongsuck Seo, and Han 2016) , the question is encoded by Gated Recurrent Unit (GRU) (Chung et al. 2014) similar to LSTM and they also introduce a dynamic parameter layer in CNN whose weights are adaptively predicted by the encoded question feature." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-61", "text": "Sentence Encoding." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-62", "text": "In order to analyze the relationship among words, phrases and sentences, several works, such as (Pennington, Socher, and Manning 2014; Kiros et al. 2015; Mikolov et al. 2013) , proposed methods about how to map text into vector space." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-63", "text": "After we have the vector representation of text, we can exploit the vector analysis skill to analyze the relationship among text." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-64", "text": "(Pennington, Socher, and Manning 2014; Mikolov et al. 2013 ) try to map words to vector space, and if the words share common contexts in the corpus, their encoded vectors will close to each other in the vector space." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-66", "text": "In this model, they exploit an RNN encoder with GRU activations (Chung et al. 2014) and an RNN decoder with a conditional GRU (Chung et al. 2014) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-67", "text": "Because skip-thoughts model emphasizes more on whole sentence encoding, in our work, we encode the whole question sentences into vector space by skip-thoughts model and use these skip-thought vectors to do further analysis of question sentences." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-68", "text": "Image Captioning." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-69", "text": "In some sense, VQA is related to image captioning Karpathy and Fei-Fei 2015; Vinyals et al. 2015; Fang et al. 2015) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-70", "text": "(Fang et al. 2015 ) used a language model to combine a set of possible words detected in several regions of the image and generate image description." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-71", "text": "In (Vinyals et al. 2015) , they used CNN to extract the high-level image features and considered them as the first input of the recurrent network to generate the caption of the image." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-72", "text": "proposed an algorithm to generate one word at a time by paying attention to local image regions related to the currently predicted word." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-73", "text": "In (Karpathy and Fei-Fei 2015) , their deep neural network can learn to embed language and visual information into a common multimodal space." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-74", "text": "However, the current image captioning algorithms only can generate the rough description of an image and there is no so called proper metric to evaluate the quality of image caption , even though BLEU (Papineni et al. 2002) can be used to evaluate the image caption." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-75", "text": "Attention-based VQA." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-76", "text": "There are several VQA models that have the ability to focus on specific image regions related to the input question by integrating the image attention mechanism (Shih, Singh, and Hoiem 2016; Chen et al. 2016; Yang et al. 2016; Li and Jia 2016) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-77", "text": "In (Li and Jia 2016) , in the pooling step, they exploited an image attention mechanism to help determine the relevance of original questions and updated ones." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-78", "text": "Before , no work applied language attention mechanism to VQA, but the researchers in NLP they had modeled language attention." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-79", "text": "In , they proposed a co-attention mechanism that jointly performs language attention and image attention." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-80", "text": "Because both question and image information are important in VQA, adding the co-attention mechanism in VQA models is very reasonable and intuitive." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-81", "text": "According to our experimental results, is the state-of-the-art VQA model in robustness." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-82", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-83", "text": "**MULTIPLE MODALITIES FUSION STRATEGIES IN VQA.**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-84", "text": "Because VQA task contains the image and text features, it is a kind of multimodal feature fusion tasks." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-85", "text": "Recently, there are some works (Kiros, Salakhutdinov, and Zemel 2014; Ben-younes et al. 2017; Fukui et al. 2016; Kim et al. 2017; Lin, RoyChowdhury, and Maji 2015) focus on modeling the interactions between two embedding spaces." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-86", "text": "(Kiros, Salakhutdinov, and Zemel 2014; Lin, RoyChowdhury, and Maji 2015) show that bilinear interactions can have great success in deep learning for multimodal language modeling and fine-grained classification." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-87", "text": "In (Fukui et al. 2016) , Multimodal Compact Bilinear (MCB) pooling uses an outer product between textural and visual embeddings." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-196", "text": "In our model, \u03bb is one of the most important factors to affect the quality of BQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-88", "text": "(Kim et al. 2017) , Multimodal Low-rank Bilinear (MLB) pooling, exploits a tensor to parametrize the full bilinear interactions between image and question spaces." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-89", "text": "The newest state-ofthe-art VQA work, (Ben-younes et al. 2017), can efficiently parametrize bilinear interactions between visual and textual representations and the authors prove that MCB and MLB are the special cases of (Ben-younes et al. 2017)." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-90", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-91", "text": "**METHODOLOGY**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-92", "text": "In Section 3, we mainly discuss how to encode questions and generate BQ and how do we exploit BQ to do robustness analysis on the 6 available state-of-the-art VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-93", "text": "The overall architecture of our VQABQ model can be referred to Figure 1 ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-94", "text": "The model has two main parts, Module 1 and Module 2." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-95", "text": "Regarding Module 1, it will take the encoded MQ as the input and uses the encoded BQ matrix to output the BQ of query question." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-96", "text": "Then, Module 2 is a VQA model we want to analyze, and it will take the concatenation of the output of Module 1 and MQ, and the given image as input and then outputs the final answer of MQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-97", "text": "The detailed architecture of Module 1 also can be referred to Figure 1 ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-98", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-99", "text": "**QUESTION DATA PREPROCESSING**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-100", "text": "We take the most popular VQA dataset (Antol et al. 2015) to develop our BQD." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-101", "text": "At the beginning, we take all of the training and validation questions from the VQA dataset (Antol et al. 2015) to be our basic question candidates." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-102", "text": "Then, we take all of the testing questions from the VQA dataset (Antol et al. 2015) to be our main question candidates." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-103", "text": "That is to say, our main question candidates and the number of main question candidates are exactly the same as the testing question set of VQA dataset (Antol et al. 2015 )." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-104", "text": "Because we model the basic question generation problem by LASSO, we remove the repeated basic question candidate and the basic question candidate exactly the same as any main question candidate." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-105", "text": "The above step guarantees that our LASSO can work well." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-106", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-107", "text": "**QUESTION ENCODING**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-108", "text": "There are many popular text encoders, such as Word2Vec (Mikolov et al. 2013) , GloVe (Pennington, Socher, and Manning 2014) and Skip-Thoughts ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-109", "text": "In these encoders, Skip-Thoughts not only can focus on the word-toword meaning but also the whole sentence semantic meaning." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-110", "text": "So, we choose Skip-Thoughts to be our question encoding method." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-111", "text": "In Skip-Thoughts model, it uses an RNN encoder with GRU (Chung et al. 2014) activations, and then we use this encoder to map an English sentence into a vector." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-112", "text": "Regarding GRU, it has been shown to perform as well as LSTM (Hochreiter and Schmidhuber 1997) on the sequence modeling applications but being conceptually simpler because GRU units only have 2 gates and do not need the use of a cell." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-113", "text": "Question Encoder. Let w 1 i , ..., w N i be the words in question s i and N is the total number of words in s i . Note that w t i denotes the t-th word for s i and x t i denotes its word embedding." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-114", "text": "The question encoder at each time step generates a hidden state h t i ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-115", "text": "It can be considered as the representation of the sequence w 1 i , ..., w t i ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-116", "text": "So, the hidden state h N i can represent the whole question." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-117", "text": "For convenience, here we drop the index i and iterate the following sequential equations to encode a question:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-118", "text": ", where U r , U z , W r , W z , U and W are the matrices of weight parameters.h t is the state update at time step t, r t is the reset gate, denotes an element-wise product and z t is the update gate." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-119", "text": "These two update gates take the values between zero and one." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-120", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-121", "text": "**PROBLEM FORMULATION**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-122", "text": "Our idea is the BQ generation for MQ and, at the same time, we only want the minimum number of BQ to represent the MQ, so modeling our problem as LASSO optimization problem is an appropriate way: (Antol et al. 2015) . \"-\" indicates the results are not available, \"-std\" means the accuracy of VQA model evaluated on the complete testing set of BQD and VQA dataset and \"-dev\" means the accuracy of VQA model evaluated on the partial testing set of BQD and VQA dataset." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-123", "text": "In addition, dif f = Original dev All \u2212 X dev All , where X is equal to \"First\", \"Second\",..., \"Seventh\"." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-124", "text": ", where A is the matrix of encoded BQ, b is the encode MQ and \u03bb is a parameter of the regularization term." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-125", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-126", "text": "**BASIC QUESTION GENERATION**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-127", "text": "We now describe how to generate the BQ of a query question, illustrated in Figure 1 ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-128", "text": "According to the above subsections, Question Encoding and Problem Formulation, we can encode all basic question candidates from the training and validation question sets of VQA dataset (Antol et al. 2015) by Skip-Thought Vectors, and then we have a matrix of basic question candidates." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-129", "text": "Each column of the matrix is a vector representation, 4800 by 1 dimensions, of a specific basic question candidate and we have 186027 columns." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-130", "text": "That is, the dimension of BQ matrix, called A, is 4800 by 186027." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-131", "text": "Also, we encode the given query question as a column vector, 4800 by 1 dimensions, by Skip-Thought Vectors, called b. Regarding the selection of the parameter, \u03bb, we will discuss this in Section 4. Now, we can solve the LASSO optimization problem, mentioned in the above subsection of Problem Formulation, to get the solution, x. Here, we consider the elements of the solution vector, x, as the similarity score of the corresponding BQ in the BQ matrix, A. The first element of x corresponds to the first column, i.e. the first BQ, of A. Then, we rank all of the similarity scores in x and pick up the top 21 large weights with corresponding BQ to be the ranked BQ of the given query question." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-132", "text": "Intuitively, if a BQ has a larger similarity score to the given MQ, then this BQ can be considered as a question more similar to the given MQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-133", "text": "Finally, we find the ranked BQ of all 244302 testing questions from the testing question set of VQA dataset (Antol et al. 2015) and collect them together, with the format {Image, M Q, 21 (BQ + corresponding similarity score)}, as our Basic Question Dataset (BQD)." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-134", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-135", "text": "**BASIC QUESTION DATASET**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-136", "text": "We propose a novel large scale dataset, called Basic (Antol et al. 2015) . \"-\" indicates the results are not available, \"-std\" means the accuracy of VQA model evaluated on the complete testing set of BQD and VQA dataset and \"-dev\" means the accuracy of VQA model evaluated on the partial testing set of BQD and VQA dataset." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-137", "text": "In addition, dif f = Original dev All \u2212 X dev All , where X is equal to \"First\", \"Second\",..., \"Seventh\"." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-138", "text": "(Antol et al. 2015) , and the corresponding similarity scores of BQ are generated by our basic question generation method, referring to Section 3. Note that we preprocess the training and validation question datasets from the VQA dataset (Antol et al. 2015) , so the total number of basic question candidates is less than the total number of training and validation question datasets in VQA dataset (Antol et al. 2015) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-139", "text": "In BQD, we have 81434 images, 244302 MQ and 5130342 (BQ + corresponding similarity score)." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-140", "text": "Furthermore, we exploit the BQD to do robustness analysis of the 6 available pretrained state-of-the-art VQA models (Antol et al. 2015; Lu et al. 2016; Ben-younes et al. 2017; Fukui et al. 2016; Kim et al. 2017) in the next subsection." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-141", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-142", "text": "**ROBUSTNESS ANALYSIS BY BASIC QUESTIONS**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-143", "text": "To measure the robustness of any model, we should evaluate it on clean and noisy input and compare the performance." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-144", "text": "The noise can be completely random, have a specific structure and/or be semantically relevant to the final task." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-145", "text": "For VQA the input is an image question pair and therefore the noise should be introduced to both." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-146", "text": "The noise to the question shouldn't be random and it should have some contextual semantics for the measure to be informative." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-147", "text": "For the image part, there is already a rapidly growing research on evaluating the robustness of deep learning models (Fawzi, Moosavi Dezfooli, and Frossard 2017; Table 3 : MUTAN without Attention model evaluation results on BQD and VQA dataset (Antol et al. 2015) . \"-\" indicates the results are not available, \"-std\" means the accuracy of VQA model evaluated on the complete testing set of BQD and VQA dataset and \"-dev\" means the accuracy of VQA model evaluated on the partial testing set of BQD and VQA dataset." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-148", "text": "In addition, dif f = Original dev All \u2212 X dev All , where X is equal to \"First\", \"Second\",..., \"Seventh\"." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-149", "text": "Carlini and Wagner 2017; Xu, Caramanis, and Mannor 2009) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-150", "text": "However, for the question part, we couldn't find any acceptable method to measure the robustness of visual question answering algorithms after extensive literature review." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-151", "text": "Here we propose a novel robustness measure for VQA by introducing semantically relevant noise to the questions where we can control the strength of noisiness." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-152", "text": "First, we measure the accuracy of the model on the clean VQA dataset (Antol et al. 2015) and we call it Acc vqa ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-153", "text": "Then, we append the top ranked k BQs to each of the MQs in the clean dataset and recompute the accuracy of the model on this noisy input and we call it Acc bqd ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-154", "text": "Finally, we compute the absolute difference Acc dif f = |Acc vqa \u2212 Acc bqd | and we report the robustness score R score ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-155", "text": ",where 0 \u2264 t < m \u2264 100." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-156", "text": "The parameters t and m are the tolerance and the maximum robustness limit, respectively, i.e., the robustness score R score decreases smoothly between 1 and 0 as Acc dif f moves from t to m and remain constant out side this range." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-157", "text": "The rate of change of this transition is exponentially decreasing from exponential to sublinear in the range [t, m] ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-158", "text": "The reasoning behind this is that we want the score to be sensitive if the difference is small, but not before t, and less sensitive if it is large, but not after m." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-159", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-160", "text": "**EXPERIMENT**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-161", "text": "In Section 4, we describe the details of our implementation and experimental results of the proposed method." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-162", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-163", "text": "**DATASETS**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-164", "text": "We conduct our experiments on BQD and VQA (Antol et al. 2015) dataset." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-165", "text": "VQA dataset is based on the MS COCO Table 4 : MUTAN with Attention model evaluation results on BQD and VQA dataset (Antol et al. 2015) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-166", "text": "\"-\" indicates the results are not available, \"-std\" means the accuracy of VQA model evaluated on the complete testing set of BQD and VQA dataset and \"-dev\" means the accuracy of VQA model evaluated on the partial testing set of BQD and VQA dataset." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-167", "text": "In addition, dif f = Original dev All \u2212 X dev All , where X is equal to \"First\", \"Second\",..., \"Seventh\"." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-168", "text": "dataset (Lin et al. 2014) and it contains a large number of questions." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-169", "text": "There are questions, 248349 for training, 121512 for validation and 244302 for testing." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-170", "text": "In the VQA dataset, each question is associated with 10 answers annotated by different people from Amazon Mechanical Turk (AMT)." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-171", "text": "About 98% of answers do not exceed 3 words and 90% of answers have single words." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-172", "text": "Note that we only develop our work on the open-ended case in VQA dataset because it is the most popular task and we also think the open-ended task is closer to the real situation than multiple-choice one." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-173", "text": "Setup In our LASSO model, we use \u03bb = 10 \u22126 to be our parameter and in the later subsection, we will discuss how the \u03bb affects the quality of BQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-174", "text": "Furthermore, although we rank all of the basic question candidates for each MQ, we only collect top 21 BQ to put into our BQD." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-175", "text": "The most important reason is that the similarity scores are too small after twentyfirst BQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-176", "text": "Regarding the limit of number of words of question input, for most of available pretrained state-of-the-art VQA models they are trained under the condition maximum number of words of input 26 words." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-177", "text": "Based on the above limitation, we divide each 21 ranked BQs into 7 partitions to do detailed analysis, referring Table 1 to Table 5 , because the total number of words of each MQ with 3 BQs is less than or equal to 26 words." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-178", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-179", "text": "**EVALUATION METRICS**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-180", "text": "VQA dataset provides multiple-choice and open-ended task for evaluation." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-181", "text": "Regarding open-ended task, the answer can be any phrase or word." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-182", "text": "However, in the multiple-choice task, an answer should be chosen from 18 candidate answers." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-183", "text": "For both cases, answers are evaluated by accuracy which can reflect human consensus." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-184", "text": "The accuracy is given by the following: Table 5 : MLB with Attention model evaluation results on BQD and VQA dataset (Antol et al. 2015) . \"-\" indicates the results are not available, \"-std\" means the accuracy of VQA model evaluated on the complete testing set of BQD and VQA dataset and \"-dev\" means the accuracy of VQA model evaluated on the partial testing set of BQD and VQA dataset." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-185", "text": "In addition, dif f = Original dev All \u2212 X dev All , where X is equal to \"First\", \"Second\",..., \"Seventh\"." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-186", "text": ", where N is the total number of examples, I[\u00b7] denotes an indicator function, a i is the predicted answer and T i is an answer set of the i th example." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-187", "text": "That is, a predicted answer is considered as a correct one if at least 3 annotators agree with it and the score depends on the total number of agreements when the predicted answer is not correct." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-188", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-189", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-190", "text": "We describe final results and analysis by the following: i.) Are The Rankings of BQs Effective? The answer is yes. According to the Figure 2 , we divide the top 21 ranked BQs into 7 partitions and each partition contains 3 ranked BQs and the accuracy is decreasing from the first partition to the seventh partition." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-191", "text": "Moreover, based on Figure 3 , we can discover that the difference of accuracy of VQA models is increasing from the first partition to the seventh partition." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-192", "text": "These trends imply that the similarity of BQs to the given MQ is decreasing from the first partition to the seventh partition." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-193", "text": "That is to say, the noise strength is increasing from the first partition to the seventh partition because we assume that a BQ with lower similarity score to the given MQ means the BQ is a larger noise for the given MQ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-194", "text": "That's why we can say that the rankings of BQs are effective." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-197", "text": "Note that if a BQ can have highly enough similarity score and provide enough extra useful information to a given MQ, then we say that the BQ has the well enough quality." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-198", "text": "We discover that when \u03bb is greater than 10 \u22125 or less than 10 \u22126 , the quality of BQ is obviously not good based on the common sense knowledge." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-199", "text": "However, if we compare the quality of BQ of \u03bb equal to 10 \u22125 with \u03bb equal to 10 \u22126 , we think the BQ quality of \u03bb equal to 10 \u22126 is slightly better than \u03bb equal to 10 \u22125 based on our common sense knowledge." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-200", "text": "We will put some randomly selected BQ examples from our BQD in the supplementary material for references." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-201", "text": "Note that Figure 2 : The accuracy of state-of-the-art VQA models evaluated on BQD and VQA dataset (Antol et al. 2015) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-202", "text": "Note that we divide the top 21 ranked BQs into 7 partitions and each partition contains 3 ranked BQs." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-203", "text": "Here, \"First top 3\" means the first partition, \"Second top 3\" means the second partition,..., and \"Seventh top 3\" means the seventh partition." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-204", "text": "Model LQI HAV HAR MU MUA MLB R score 0.19 0.48 0.45 0.30 0.34 0.36 we use the state-of-the-art question encoder, skip-thoughts , to encode all questions." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-205", "text": "iii.) Who Is The Robustest VQA Model? According to Table 6 , there are two categories, attention-based and nonattention-based, of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-206", "text": "We can discover that attention-based VQA models are more robust than nonattention-based VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-207", "text": "Furthermore, the only difference between MU and MUA is the attention mechanism." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-208", "text": "Based on the above observation, we can say that attention mechanism can help robustness." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-209", "text": "Finally, we also can discover that HieCoAtt is the most robust VQA model." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-210", "text": "In addition, when we compare HAV with HAR, the only difference between them is using the different neural networks to extract image features." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-211", "text": "Although using Resnet-200 instead of VGG-16 can help the accuracy of HieCoAtt VQA model , it reduces the robustness of the model." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-212", "text": "We conjecture that if a convolutional neural network is too deep, it probably can harm the robustness of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-213", "text": "In the next subsection, we will do further analysis on the most robust HieCoAtt VQA model ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-214", "text": "iv.) Can Basic Questions Directly Help Accuracy? According to Table 6 , we know that HieCoAtt is the robustness VQA model." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-215", "text": "We want to do more advanced analysis on this model, so we claim that if the quality of Figure 3 : The accuracy decrement of state-of-the-art VQA models evaluated on BQD and VQA dataset (Antol et al. 2015) ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-216", "text": "Note that we divide the top 21 ranked BQs into 7 partitions and each partition contains 3 ranked BQs." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-217", "text": "Here, \"First top 3\" means the first partition, \"Second top 3\" means the second partition,..., and \"Seventh top 3\" means the seventh partition." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-218", "text": "score1 score2/score1 score3/score2 avg 0.33 0.61 0.73 std 0.20 0.27 0.21 Table 7 : \"avg\" denotes average and \"std\" denotes standard deviation." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-219", "text": "BQs are well enough, then even we use the naive concatenation, i.e. direct concatenation, method to concatenate MQ and BQs, it still can directly help the accuracy of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-220", "text": "For justifying our claim, we propose a thresholding criterion, referring to Table 8 , to select the BQ with good quality." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-221", "text": "v.) Basic Question Concatenation." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-222", "text": "In this subsection, we propose a thresholding criterion to select the BQ with good quality." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-223", "text": "In BQD, each MQ has 21 corresponding BQ with scores." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-224", "text": "We can have the following format, {M Q, (BQ1, score1), ..., (BQ21, score21)}, and these scores are all between 0 and 1 with the following order:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-225", "text": "and we define 3 thresholds, s1, s2 and s3." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-226", "text": "Note that, for convenience, we only take the top 3 BQs to do selection because our BQs are ranked." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-227", "text": "Moreover, we compute the following 3 averages (avg) and 3 standard deviations (std) to score1, score2/score1 and score3/score2, respectively, and then use avg \u00b1 std, referring to Table 7 , to be the initial estimation of proper thresholds." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-228", "text": "The BQ utilization process can be explained as the Table 8 ." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-229", "text": "Does BQ can directly help accuracy?" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-230", "text": "The answer probably is yes. In our experiment, we use the avg \u00b1 std, referring to Table 7 , to be the initial guess of proper thresholds of s1, s2 and s3." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-231", "text": "We discover that when s1 = 0.60, s2 = 0.58 Algorithm 1 Basic Question Concatenation Note that s1, s2, s3 are thresholds we can choose." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-232", "text": "1: if score1 >s1 2:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-233", "text": "Append BQ1 with the largest score 3:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-234", "text": "if score2/score1 >s2 4:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-235", "text": "Append BQ2 with the second large score 5:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-236", "text": "if score3/score2 >s3 6:" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-237", "text": "Append BQ3 with the third large score Table 9 : The number of BQs are appended." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-238", "text": "Here, \"X BQ\" means X BQs are appended by MQ, where X = 0, 1, 2, 3, and \"# Q\" denoted the number of questions. and s3 = 0.41, we can get the BQs, referring to Table 9 , who can directly help accuracy with the naive concatenation method." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-239", "text": "Accordingly to the Table 9 , 96.84% of testing questions from VQA dataset cannot find the proper basic questions to help the accuracy by naive concatenation method." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-240", "text": "Although we only have 3.16% of testing questions can benefit from the basic questions, our method still can improve the state-of-the-art accuracy from 60.32% to 60.34%, referring to supplementary material." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-241", "text": "Then, we have 244302 testing questions, so that means the number of correctly answering questions of our method is more than state-of-the-art method 49 questions." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-242", "text": "In other words, if we have well enough basic question dataset, we can increase accuracy more, especially in the counting-type question, referring to supplementary material." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-243", "text": "We conjecture that because the Co-Attention Mechanism is good at localizing, the counting-type question is improved more than others." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-244", "text": "Accordingly, based on our experiment, we believe that basic questions with well enough quality can directly help accuracy by only using the naive concatenation method." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-245", "text": "----------------------------------" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-246", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-247", "text": "In this paper, we propose a novel VQABQ method and Basic Question Dataset (BQD) for robustness analysis of VQA models." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-248", "text": "The VQABQ method has two main modules, Basic Question Generation Module and VQA Module." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-249", "text": "The former one can generate the basic questions for the query question, and the latter one can take an image, basic and query questions as the input and then output the text-based answer of the query question about the given image." }, { "sent_id": "c8cf2d615cc47395a55bc8737cd9fd-C001-250", "text": "Furthermore, we can use the proposed BQD, R score and VQA dataset (Antol et al. 2015) to measure the robustness of VQA models." } ], "y": { "@MOT@": { "gold_contexts": [ [ "c8cf2d615cc47395a55bc8737cd9fd-C001-14" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-172" ] ], "cite_sentences": [ "c8cf2d615cc47395a55bc8737cd9fd-C001-14", "c8cf2d615cc47395a55bc8737cd9fd-C001-172" ] }, "@USE@": { "gold_contexts": [ [ "c8cf2d615cc47395a55bc8737cd9fd-C001-33" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-37" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-100" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-101" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-102" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-103" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-128" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-133" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-136" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-138" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-140" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-147" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-152" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-164" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-165" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-166" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-172" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-180" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-184" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-201" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-215" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-250" ] ], "cite_sentences": [ "c8cf2d615cc47395a55bc8737cd9fd-C001-33", "c8cf2d615cc47395a55bc8737cd9fd-C001-37", "c8cf2d615cc47395a55bc8737cd9fd-C001-100", "c8cf2d615cc47395a55bc8737cd9fd-C001-101", "c8cf2d615cc47395a55bc8737cd9fd-C001-102", "c8cf2d615cc47395a55bc8737cd9fd-C001-103", "c8cf2d615cc47395a55bc8737cd9fd-C001-128", "c8cf2d615cc47395a55bc8737cd9fd-C001-133", "c8cf2d615cc47395a55bc8737cd9fd-C001-136", "c8cf2d615cc47395a55bc8737cd9fd-C001-138", "c8cf2d615cc47395a55bc8737cd9fd-C001-140", "c8cf2d615cc47395a55bc8737cd9fd-C001-147", "c8cf2d615cc47395a55bc8737cd9fd-C001-152", "c8cf2d615cc47395a55bc8737cd9fd-C001-164", "c8cf2d615cc47395a55bc8737cd9fd-C001-165", "c8cf2d615cc47395a55bc8737cd9fd-C001-166", "c8cf2d615cc47395a55bc8737cd9fd-C001-172", "c8cf2d615cc47395a55bc8737cd9fd-C001-180", "c8cf2d615cc47395a55bc8737cd9fd-C001-184", "c8cf2d615cc47395a55bc8737cd9fd-C001-201", "c8cf2d615cc47395a55bc8737cd9fd-C001-215", "c8cf2d615cc47395a55bc8737cd9fd-C001-250" ] }, "@BACK@": { "gold_contexts": [ [ "c8cf2d615cc47395a55bc8737cd9fd-C001-50" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-122" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-165" ], [ "c8cf2d615cc47395a55bc8737cd9fd-C001-170" ] ], "cite_sentences": [ "c8cf2d615cc47395a55bc8737cd9fd-C001-50", "c8cf2d615cc47395a55bc8737cd9fd-C001-122", "c8cf2d615cc47395a55bc8737cd9fd-C001-165", "c8cf2d615cc47395a55bc8737cd9fd-C001-170" ] }, "@SIM@": { "gold_contexts": [ [ "c8cf2d615cc47395a55bc8737cd9fd-C001-103" ] ], "cite_sentences": [ "c8cf2d615cc47395a55bc8737cd9fd-C001-103" ] }, "@UNSURE@": { "gold_contexts": [ [ "c8cf2d615cc47395a55bc8737cd9fd-C001-239" ] ], "cite_sentences": [ "c8cf2d615cc47395a55bc8737cd9fd-C001-239" ] } } }, "ABC_40d73d5fc22686c13a14946946dd18_9": { "x": [ { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-2", "text": "This paper presents DAMESRL 1 , a flexible and open source framework for deep semantic role labeling (SRL)." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-3", "text": "DAMESRL aims to facilitate easy exploration of model structures for multiple languages with different characteristics." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-4", "text": "It provides flexibility in its model construction in terms of word representation, sequence representation, output modeling, and inference styles and comes with clear output visualization." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-5", "text": "Additionally, it handles various input and output formats and comes with clear output visualization." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-6", "text": "The framework is available under the Apache 2.0 license." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-7", "text": "----------------------------------" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-9", "text": "During the first decade of the 21 st century, mapping from the syntactic analysis of a sentence to its semantic representation has received a central interest in the natural language processing (NLP) community." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-10", "text": "Semantic role labeling, which is a sentence-level semantic task aimed at identifying \"Who did What to Whom, and How, When and Where?\" (Palmer et al., 2010) , has strengthened this focus." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-11", "text": "Recently, several neural mechanisms have been used to train end-to-end SRL models that do not require task-specific feature engineering as the traditional SRL models do." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-12", "text": "Zhou and Xu (2015) introduced the first deep end-to-end model for SRL using a stacked Bi-LSTM network with a conditional random field (CRF) as the top layer." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-13", "text": "He et al. (2017) simplified their architecture using a highway Bi-LSTM network." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-14", "text": "More recently, Tan et al. (2018) replaced the common recurrent architecture with a self-attention network, directly capturing relationships between tokens regardless of their distance, resulting in better results and faster training." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-15", "text": "The work in deep end-to-end SRL has focused heavily on applying deep learning advances without considering the multilingual aspect." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-16", "text": "However, language-specific characteristics and the available amount of training data highly influence the optimal model structure." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-17", "text": "DAMESRL facilitates exploration and fair evaluation of new SRL models for different languages by providing flexible neural model construction on different modeling levels, the handling of various input and output formats, and clear output visualization." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-18", "text": "Beyond the existing state-of-the-art models (Zhou and Xu, 2015; He et al., 2017; Tan et al., 2018 ), we exploit character-level modeling, beneficial when considering multiple languages." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-19", "text": "To demonstrate the merits of easy cross-lingual exploration and evaluation of model structures for SRL provided by DAMESRL, we report performance of several distinct models integrated into our framework for English, German and Arabic, as they have very different linguistic characteristics." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-20", "text": "by w p ." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-21", "text": "Here, words outside argument spans have the tag O, and words at the beginning and inside of argument spans with role r have the tags B r and I r , respectively." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-22", "text": "For example, the sentence \"the cat chases the dog .\" should be annotated as \"the B\u2212A0 cat I\u2212A0 chases B\u2212V the B\u2212A1 dog I\u2212A1 . O \"." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-23", "text": "DAMESRL's architecture (see Fig. 1 ) facilitates the construction of models that prioritize certain language-dependent linguistic properties, such as the importance of word order and inflection, or that adapt to the amount of available training data." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-24", "text": "The framework, implemented in Python 3.5 using TensorFlow, can be used to train new models, or make predictions with the provided pre-trained models." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-25", "text": "----------------------------------" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-26", "text": "**INPUT AND OUTPUT**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-27", "text": "The input/output format of DAMESRL is a shortened version of the CoNLL'05 format, which only contains the Words, Targets and (possibly) Props columns 2 ." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-28", "text": "DAMESRL also provides an HTML format that can be directly visualized in the web browser (as in Fig. 2 )." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-29", "text": "----------------------------------" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-30", "text": "**MODEL CONSTRUCTION MODULES**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-31", "text": "As can be seen in Fig. 1 , the framework divides model construction in four phases: (I) word representation, (II) sentence representation, (III) output modeling, and (IV) inference." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-32", "text": "Phase I: The word representation of a word w i consist of three optional concatenated components: a word-embedding, a Boolean indicating if w i is the predicate of the semantic frame (w p ), and a character representation." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-33", "text": "DAMESRL provides a Bi-LSTM network to learn character-level word representations helping for languages where important SRL cues are given through inflections, such as case markings in German and Arabic." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-34", "text": "Despite the foreseen importance, character-level embeddings have not been used in previous work (Zhou and Xu, 2015; He et al., 2017; Tan et al., 2018) ." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-35", "text": "Phase II: As core sequence representation component, users can choose between a self-attention encoding (Tan et al., 2018) , a regular Bi-LSTM (Hochreiter and Schmidhuber, 1997) or a highway Bi-LSTM (Zhang et al., 2016; He et al., 2017) ." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-36", "text": "Phase III: To compute model probabilities, users can choose a regular softmax, or a linear chain CRF as proposed by (Zhou and Xu, 2015) , which can be useful for languages where word order is an important SRL cue, such as English, or when less training data is available (shown in Section 4)." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-37", "text": "Phase IV: The inference phase provides two options for label inference from the computed model probabilities including greedy prediction and Viterbi decoding." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-38", "text": "----------------------------------" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-39", "text": "**EXPERIMENTS 4.1 SETTINGS**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-40", "text": "To evaluate our framework, and show the benefits of choosing certain model components, we construct five models: HLstm, Char, CRFm, Att, and CharAtt, whose configurations are shown in Tab." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-41", "text": "1." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-42", "text": "The selected models are evaluated in three languages: English, German and Arabic (see Tab. 2) using the standard CoNLL'05 metrics." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-43", "text": "Information about the used SRL data is shown in Tab." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-44", "text": "2. We initialize the weights of all sub-layers as random orthogonal matrices." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-45", "text": "The learning rate is fixed in the first N 1 training epochs, and halved after every next N 2 epochs." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-46", "text": "Detailed settings and the word embeddings used to initialize the word representation layer used per language are found in Tab." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-47", "text": "3. Table 6 : Results on CoNLL'05 English data: precision (P), recall (R), and F1." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-48", "text": "We compare our results with other state-of-the-art deep single models." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-49", "text": "----------------------------------" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-50", "text": "**MODEL**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-51", "text": "Development Out-Of-Domain Evaluation P R F1 P R F1 P R F1 Lstm + CRF (Zhou and Xu, 2015) 79.7 79.4 79.6 70.7 68.2 69.4 82.9 82.8 82.8 HLstm (He et al., 2017) 81.6 81.6 81.6 72.9 71.4 72.1 83.1 83.0 83.1 Att (Tan et al., 2018) 82.6 83." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-52", "text": "----------------------------------" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-53", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-54", "text": "In Tab. 5-6, we compare the five models on English, German and Arabic." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-55", "text": "The proposed CharAtt outperforms all other models in almost all cases except the English out-of-domain setting." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-56", "text": "As can be seen in Tab." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-57", "text": "6, our implementation achieves competitive performance to other state-of-the-art systems for English." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-58", "text": "To the best of our knowledge, we report the first SRL results (in CoNLL'05 metrics) for German and Arabic without using linguistic features." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-59", "text": "In general, we find that using character embeddings improves the performance of HLstm and Att, although at a cost of increased processing time." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-60", "text": "Interestingly, using character embeddings is particularly effective for the Att model." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-61", "text": "One explanation could be that character embeddings are important for learning good attention masks as they encode information about the syntax of words and the sentence, e.g., it facilitates the system in learning that the number (singular/plural) of a subject and its verb should match." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-62", "text": "Among the three languages, the performance gain by character-level representations is larger for German and Arabic than for English." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-63", "text": "This can be explained by the much larger vocabularies for German and Arabic combined with the smaller training datasets (#sentences, and #predicates) for these languages." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-64", "text": "Moreover, many grammatical cases, which are very strong predictors for semantic roles, are explicitly marked through use of inflection in German and Arabic." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-65", "text": "To evaluate the influence of the training size on model performance, we train the models on a random sample of 2000 sentences for each language (see Tab." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-66", "text": "7)." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-67", "text": "Intriguingly, the attention models perform worst in this setting, indicating their need of large datasets." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-68", "text": "A reason for this could be that the attention models consider the sequential dependency between hidden states to a lesser degree than recurrent models do." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-69", "text": "In contrast, CRFm achieves the best results for English and Arabic, and the second best result for German." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-70", "text": "In fact, CRFm exploits not only the input sequence -using the LSTM -but also the sequential output dependencies, to compute output probabilities." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-71", "text": "We can see that this is very beneficial when less training data is available, especially when word order is a strong cue for SRL, which applies well for a strict word order language like English." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-72", "text": "For such cases the output dependencies can be learned even from less training data, which results in the CRFm model to excel." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-73", "text": "As can be seen in Tab." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-74", "text": "7, when comparing Char with HLstm-ours and CharAtt with Att-ours, the benefit of using character embeddings is demonstrated on small datasets as well." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-75", "text": "----------------------------------" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-76", "text": "**CONCLUSIONS**" }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-77", "text": "We introduced an open source SRL framework, DAMESRL, which provides flexible model construction, using state-of-the-art model components, handles various input and output formats, and which comes with clear output visualization." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-78", "text": "Using our framework, we slightly improve the state-of-the-art results of single end-to-end deep systems on the English CoNLL'05, and report the first experimental end-to-end deep SRL results for German 5 and Arabic 5 ." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-79", "text": "We have shown that the flexible model construction provided by the framework is crucial for exploring good model structures when considering different languages with different characteristics, especially when training data is limited." }, { "sent_id": "40d73d5fc22686c13a14946946dd18-C001-80", "text": "DAMESRL is made available under the Apache 2.0 license." } ], "y": { "@BACK@": { "gold_contexts": [ [ "40d73d5fc22686c13a14946946dd18-C001-14" ], [ "40d73d5fc22686c13a14946946dd18-C001-18" ], [ "40d73d5fc22686c13a14946946dd18-C001-34" ], [ "40d73d5fc22686c13a14946946dd18-C001-35" ] ], "cite_sentences": [ "40d73d5fc22686c13a14946946dd18-C001-14", "40d73d5fc22686c13a14946946dd18-C001-18", "40d73d5fc22686c13a14946946dd18-C001-34", "40d73d5fc22686c13a14946946dd18-C001-35" ] } } }, "ABC_28eeecadd8d3348de6daec3c801ae4_9": { "x": [ { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-2", "text": "Style transfer is the task of transferring an attribute of a sentence (e.g., formality) while maintaining its semantic content." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-3", "text": "The key challenge in style transfer is to strike a balance between the competing goals, one to preserve meaning and the other to improve the style transfer accuracy." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-4", "text": "Prior research has identified that the task of meaning preservation is generally harder to attain and evaluate." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-5", "text": "This paper proposes two extensions of the state-ofthe-art style transfer models aiming at improving the meaning preservation in style transfer." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-6", "text": "Our evaluation shows that these extensions help to ground meaning better while improving the transfer accuracy." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-7", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-9", "text": "Consider the following two comments about a movie: (1) I entered the theater in the bloom of youth and emerged with a family of field mice living in my long, white mustache; 1 and (2) The movie was very long." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-10", "text": "Although the meaning of the two sentences is similar, their styles are very different." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-11", "text": "Style transfer is the task of transferring the attributes of a sentence (e.g., 'sarcastic' and 'not-sarcastic') without changing its content." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-12", "text": "It is important for dialog systems such as personalized agents, customer service agents and smart home assistants to generate responses that are fluent and fit the social setting." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-13", "text": "Advances in text generation has motivated recent work on style transfer with non-parallel corpora (Shen et al., 2017; Hu et al., 2017; Fu et al., 2018; Li et al., 2018) ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-14", "text": "Shen et al. (2017) propose a novel method which leverages the refined alignment of latent representations to perform style transfer." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-15", "text": "The paper introduces cross-aligned autoencoder with discriminators." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-16", "text": "Hu et al. (2017) 1 From https://www.thestranger.com/ movies/1210980/sex-and-the-city-2 learn a disentangled latent representation and use a code to generate a sentence." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-17", "text": "Fu et al. (2018) explore two models for style transfer which use multiple decoders or style embeddings to augment the encoded representations." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-18", "text": "Prabhumoye et al. (2018) propose to transfer style through backtranslation." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-19", "text": "The latter method is simpler to train and it attains the state-of-the-art performance in style transfer accuracy, confirming the efficacy of back-translation in grounding meaning." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-20", "text": "The goal of the current study is to investigate alternative back-translation setups that attain a better balance between meaning preservation and style transfer." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-21", "text": "We introduce two approaches which extend the back-translation models proposed by Prabhumoye et al. (2018) exploring back-translation setups that preserve the content of the sentence better." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-22", "text": "The first approach explores multilingual pivoting, hypothesizing that transfer through several languages will help ground meaning better than transfer through one language." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-23", "text": "We follow Johnson et al.'s (2017) setup." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-24", "text": "The second approach is an investigation to include a term in the loss function which corresponds to preserving semantic content of the sentence: we add a feedback loss to the generative models." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-25", "text": "We evaluate our models along three dimensions: style transfer accuracy, fluency and preservation of meaning." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-26", "text": "We compare the results with the cross-aligned auto-encoder (Shen et al., 2017) and the back-translation model with one pivot language (Prabhumoye et al., 2018) ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-27", "text": "We find that both extensions improve the accuracy of style transfer without reduction in preservation of meaning." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-28", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-29", "text": "**GROUNDING MEANING IN BACK-TRANSLATION**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-30", "text": "While the previous work (Prabhumoye et al., 2018) focuses on creating a representation by translating to a pivot language, preserving meaning in the generated sentences is still an unsolved question." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-31", "text": "In this work, we try to tackle this question by extending their model in two directions:" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-32", "text": "(1) To improve the latent representation such that it grounds the meaning better and (2) Providing the generative models with a feedback which represents how good the generator performs in preserving the meaning." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-33", "text": "Both the extensions are marked in Figure 1 ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-34", "text": "Notation." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-35", "text": "Given a dataset X in which each instance is labeled with a style s 1 or s 2 , the goal of style transfer is to generate sentences of the target style without changing the meaning of the original sentence." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-36", "text": "Let the set of sentences in X which belong to s 1 be X 1 = {x" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-37", "text": "1 } and the sentences which belong to s 2 be X 2 = {x" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-38", "text": "We denote the sentences of X 1 transferred to style s 2 asX 12 = {x (1) 12 , . . . ,x (n) 12 } and the sentences of X 2 transferred to style s 1 b\u0177" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-39", "text": "Style transfer through Back-translation." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-40", "text": "Prabhumoye et al. (2018) introduces the technique of back-translation to perform style transfer." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-41", "text": "They first transfer a sentence to one pivot language and use the encoding of the sentence in the pivot language to train the generative models corresponding to the two styles." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-42", "text": "They also use feedback from a pre-trained classifier to guide the generators to generate the desired style." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-43", "text": "The objective function of the generative models is:" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-44", "text": "This model is denoted as Back-translated Style Transfer (BST) in the future." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-45", "text": "Grounding meaning with multilingual backtranslation." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-46", "text": "Johnson et al. (2017) showed that multi-lingual neural machine translation systems using one-to-many and many-to-one frameworks can perform zero-shot learning." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-47", "text": "We want to leverage this approach to ground meaning in style transfer using multiple pivot languages." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-48", "text": "We have trained a one to many translation system (Johnson et al., 2017) where we have a encoder-decoder network for one source language and two target languages." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-49", "text": "We also train a many to one translation system (Johnson et al., 2017) where we have a encoder-decoder network for two source languages and one target language." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-50", "text": "We use these translation systems for training the style specific decoders following the procedure in (Prabhumoye et al., 2018) ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-51", "text": "Specifically, a sentence is first translated from English to two pivot languages." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-52", "text": "We create the latent representation of the sentence by encoding the sentence in both pivot languages using the many to one translation system." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-53", "text": "where, Encoder mo is the encoder of the many to one translate system, x l 1 is the sentence in pivot language 1 and x l 2 is the sentence in pivot language 2." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-54", "text": "The final representation is given by elementwise average of the two representations." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-55", "text": "This model is denoted as Multi-lingual Back-translated Style Transfer (MBST) in the future." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-56", "text": "Grounding meaning with feedback." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-57", "text": "This approach adds a loss function to the generative models which guides them to generate sentences that are closer to the original sentence and hence to preserve meaning better." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-58", "text": "The models trained by the back-translation approach are fine tuned to include the feedback loss function." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-59", "text": "To provide this feedback to the generators, we first generate th\u00ea X 12 andX 21 using the generative models corresponding to style s 2 and s 1 respectively." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-60", "text": "Dat\u00e2 X 12 is now representative of style s 2 and dat\u00e2 X 21 is representative of style s 1 ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-61", "text": "Hence, now we have train data that is transferred from style s 1 to s 2 and transferred from style s 2 to s 1 ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-62", "text": "We use this data to fine-tune the models." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-63", "text": "We use the same back-translation procedure as described in Section 2 to first translate this data to the two pivot languages and then create a latent representation of the data." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-64", "text": "While fine-tuning the generative model for style s 1 , we transfer dataX 12 to style s 1 . Let the data generated in this process be denoted byX 121 ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-65", "text": "Our loss function compares the generated dataX 121 with the original s 1 data X 1 ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-66", "text": "Similarly, while fine-tuning the generative model for style s 2 , we transfer dataX 21 to style s 2 . Let the data generated in this process be denoted b\u0177 X 212 ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-67", "text": "Our loss function compares the generated dataX 212 with the original s 2 data X 2 . Let \u03b8 gen denote the parameters of the generators." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-68", "text": "The generative loss L gen is then given by:" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-69", "text": "where L recon is the reconstruction loss, L class is the binary cross entropy loss, L f eed is the reconstruction loss for feedback and \u03bb c , \u03bbf are the balancing parameters." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-70", "text": "This model is denoted as Multi-lingual Back-translated Style Transfer + Feedback (MBST+F) in the future." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-71", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-72", "text": "**EXPERIMENTS**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-73", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-74", "text": "**STYLE TRANSFER TASKS**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-75", "text": "We use three tasks described in (Prabhumoye et al., 2018) to evaluate our models." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-76", "text": "The three tasks correspond to: (1) gender transfer: we transfer the style of writing reviews of Yelp from male and female authors (Reddy and Knight, 2016) ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-77", "text": "(2) political slant transfer: we transfer the style of addressing comments to the two political parties namely democratic and republican (Voigt et al., 2018) and (3) sentiment modification: here we focus on only two sentiments -positive and negative." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-78", "text": "The goal is to modify the sentiment of the sentence while preserving the content." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-79", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-80", "text": "**BASELINES**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-81", "text": "Our baseline model is a Cross-aligned AutoEncoder (CAE) from (Shen et al., 2017) ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-82", "text": "We use the off-the-shelf trained model for sentiment modification task and we separately train this model for the gender and political slant tasks." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-83", "text": "We also compare our results with the back-translation model using only one pivot language and with no feedback loss (BST model)." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-84", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-85", "text": "**EVALUATION TASKS**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-86", "text": "Style Transfer Accuracy." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-87", "text": "We measure the accuracy of style transfer as described in (Shen et al., 2017) ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-88", "text": "We have reproduced the classifiers described in (Prabhumoye et al., 2018) ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-89", "text": "The classifier has an accuracy of 82% for the gender- annotated corpus, 92% accuracy for the political slant dataset and 93.23% accuracy for the sentiment dataset." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-90", "text": "We use these classifiers to test the generated sentences for the desired style." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-91", "text": "Perplexity." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-92", "text": "To measure the fluency of the generative models automatically, we use perplexity measure." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-93", "text": "We create separate language models for each of the three tasks using only the training data." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-94", "text": "We use only ngrams up to an order of 3 to create the language model 2 ." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-95", "text": "Meaning Preservation." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-96", "text": "We follow the procedure described in (Bennett, 2005) to perform A/B testing." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-97", "text": "We reuse the instructions provided by (Prabhumoye et al., 2018) for the three tasks. But unlike (Prabhumoye et al., 2018), we perform our evaluation on Amazon Mechanical Turk." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-98", "text": "We annotate 200 samples per task and ask 3 unique workers to annotate each sample." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-99", "text": "We take the majority vote as the final label." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-100", "text": "The results in (Prabhumoye et al., 2018) were reproduced for comparing the CAE model with the BST model." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-101", "text": "As reported by them, the BST model performs better in preservation of meaning for the tasks of gender and political slant transfer." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-102", "text": "We present the results for the comparison between BST and MBST models; and the MBST and the MBST+F models." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-103", "text": "Fluency." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-104", "text": "We asked human annotators on Mechanical Turk to measure the fluency of the generated sentences on a scale of 1 to 4. 1 is unreadable and 4 is perfectly readable." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-105", "text": "We annotate 120 samples for each model and each sample is annotated by three unique workers." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-106", "text": "The 120 samples of each model has an equal distribution of samples from the three tasks." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-107", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-108", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-109", "text": "We used data from Workshop in Statistical Machine Translation 2015 (WMT15) (Bojar et al., 2015) and sequence-sequence framework (Sutskever et al., 2014; Bahdanau et al., 2015; Karpathy and Fei-Fei, 2015) to train our translation models." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-110", "text": "Table 1 shows the style transfer accuracy results in percentages for the generated test sentences." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-111", "text": "We can see the accuracy is boosted for all three tasks by the two extensions." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-112", "text": "The MBST+F model performs the best in all three tasks." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-113", "text": "Table 2 shows the perplexity of the generative models for each of the three tasks." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-114", "text": "We observe that both MBST and MBST+F models are better than the NMT and CAE models but there is no significant difference between the two models." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-115", "text": "Table 3 shows the results for human evaluation of the models MBST and MBST+F for preservation of meaning." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-116", "text": "Perhaps confusingly, these results show no clear preference between the models." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-117", "text": "This is a positive result as it means that these extensions do not degrade the meaning, in spite of them improving the style transfer accuracy." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-118", "text": "Although we observe that MBST may be slightly preferred over MBST+F. Table 3 : Human preference for meaning preservation % the four models for the three tasks." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-119", "text": "We averaged the scores over the 120 samples and 3 annotators per sample." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-120", "text": "MBST+F performs better than the other models in 2 out of 3 tasks and MBST performs the best in one task -political slant." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-121", "text": "The over-all averaged scores for the two models MBST and MBST+F is the same 3.08, whereas it is much lower 2.79 for BST and 2.57 for CAE." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-122", "text": "Table 4 : Fluency in generated sentences." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-123", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-124", "text": "**MODEL**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-125", "text": "----------------------------------" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-127", "text": "We have presented in this paper two extensions of the back-translation model." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-128", "text": "The first extension focused on creating latent representation which is better grounded in meaning and the second extension targeted to provide a feedback to the generator which guides it to produce sentences similar to the original sentence." }, { "sent_id": "28eeecadd8d3348de6daec3c801ae4-C001-129", "text": "Both the extensions allow us to boost the style transfer accuracy for all the three tasks considerably, while still preserving the meaning." } ], "y": { "@BACK@": { "gold_contexts": [ [ "28eeecadd8d3348de6daec3c801ae4-C001-18" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-40" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-41" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-42" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-44" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-101" ] ], "cite_sentences": [ "28eeecadd8d3348de6daec3c801ae4-C001-18", "28eeecadd8d3348de6daec3c801ae4-C001-40", "28eeecadd8d3348de6daec3c801ae4-C001-41", "28eeecadd8d3348de6daec3c801ae4-C001-42", "28eeecadd8d3348de6daec3c801ae4-C001-44", "28eeecadd8d3348de6daec3c801ae4-C001-101" ] }, "@EXT@": { "gold_contexts": [ [ "28eeecadd8d3348de6daec3c801ae4-C001-21" ] ], "cite_sentences": [ "28eeecadd8d3348de6daec3c801ae4-C001-21" ] }, "@UNSURE@": { "gold_contexts": [ [ "28eeecadd8d3348de6daec3c801ae4-C001-26" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-83" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-100" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-102" ] ], "cite_sentences": [ "28eeecadd8d3348de6daec3c801ae4-C001-26", "28eeecadd8d3348de6daec3c801ae4-C001-83", "28eeecadd8d3348de6daec3c801ae4-C001-100", "28eeecadd8d3348de6daec3c801ae4-C001-102" ] }, "@MOT@": { "gold_contexts": [ [ "28eeecadd8d3348de6daec3c801ae4-C001-30" ] ], "cite_sentences": [ "28eeecadd8d3348de6daec3c801ae4-C001-30" ] }, "@USE@": { "gold_contexts": [ [ "28eeecadd8d3348de6daec3c801ae4-C001-50" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-75" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-88" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-97" ] ], "cite_sentences": [ "28eeecadd8d3348de6daec3c801ae4-C001-50", "28eeecadd8d3348de6daec3c801ae4-C001-75", "28eeecadd8d3348de6daec3c801ae4-C001-88", "28eeecadd8d3348de6daec3c801ae4-C001-97" ] }, "@DIF@": { "gold_contexts": [ [ "28eeecadd8d3348de6daec3c801ae4-C001-97" ], [ "28eeecadd8d3348de6daec3c801ae4-C001-121" ] ], "cite_sentences": [ "28eeecadd8d3348de6daec3c801ae4-C001-97", "28eeecadd8d3348de6daec3c801ae4-C001-121" ] } } }, "ABC_b71da01fb46900d81162b3a3c3cd41_9": { "x": [ { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-2", "text": "We present an improved training strategy for dependency parsers that use online reordering to handle non-projective trees." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-3", "text": "The new strategy improves both efficiency and accuracy by reducing the number of swap operations performed on non-projective trees by up to 80%." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-4", "text": "We present state-ofthe-art results for five languages with the best ever reported results for Czech." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-5", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-7", "text": "Recent work on dependency parsing has resulted in considerable progress with respect to both accuracy and efficiency, not least in the treatment of discontinuous syntactic constructions, usually modeled by non-projective dependency trees." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-8", "text": "While nonprojective dependency relations tend to be rare in most languages (Kuhlmann and Nivre, 2006) , it is not uncommon that up to 25% of the sentences have a structure that involves at least one non-projective relation, a relation that may be crucial for an adequate analysis of predicate-argument structure." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-9", "text": "This makes the treatment of non-projectivity central for accurate dependency parsing." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-10", "text": "Unfortunately, parsing with unrestricted non-projective structures is a hard problem, for which exact inference is not possible in polynomial time except under drastic independence assumptions (McDonald and Satta, 2007) , and most data-driven parsers therefore use approximate methods McDonald et al., 2006) ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-11", "text": "One recently explored approach is to perform online reordering by swapping adjacent words of the input sentence while building the dependency structure." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-12", "text": "Using this technique, the system of Nivre (2009) processes unrestricted non-projective structures with state-ofthe-art accuracy in observed linear time." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-13", "text": "The normal procedure for training a transitionbased parser is to use an oracle that predicts an optimal transition sequence for every dependency tree in the training set, and then approximate this oracle by a classifier." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-14", "text": "In this paper, we show that the oracle used for training by Nivre (2009) is suboptimal because it eagerly swaps words as early as possible and therefore makes a large number of unnecessary transitions, which potentially affects both efficiency and accuracy." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-15", "text": "We propose an alternative oracle that reduces the number of transitions by building larger structures before swapping, but still handles arbitrary non-projective structures." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-16", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-17", "text": "**BACKGROUND**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-18", "text": "The fundamental reason why sentences with nonprojective dependency trees are hard to parse is that they contain dependencies between non-adjacent substructures." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-19", "text": "The basic idea in online reordering is to allow the parser to swap input words so that all dependency arcs can be constructed between adjacent subtrees." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-20", "text": "This idea is implemented in the transition system proposed by Nivre (2009) ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-21", "text": "The first three transitions of this system (LEFT-ARC, RIGHT-ARC, and SHIFT) are familiar from many systems for transition-based dependency parsing (Nivre, 2008) ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-22", "text": "The only novelty is the SWAP transition, which permutes two nodes by moving the second-topmost node from the stack back to the input buffer while leaving the top node on the stack." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-23", "text": "To understand how we can parse sentences with non-projective dependency trees, in spite of the fact that dependencies can only be added between nodes that are adjacent on the stack, note that, for any sentence x with dependency tree G, there is always some permutation x of x such that G is projective with respect to x ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-24", "text": "There may be more than one such permutation, but Nivre (2009) defines the canonical projective order < G for x given G as the order given by an inorder traversal of G that respects the order < between a node and its direct dependents." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-25", "text": "This is illustrated in Figure 1 , where the words of a sentence with a non-projective tree have been annotated with their positions in the projective order; reading the words in this order gives the permuted string Did you send the letter who to?" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-26", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-27", "text": "**TRAINING ORACLES**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-28", "text": "In order to train classifiers for transition-based parsing, we need a training oracle, that is, a function that maps every dependency tree T in the training set to a transition sequence that derives T ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-29", "text": "While every complete transition sequence determines a unique dependency tree, the inverse does not necessarily hold." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-30", "text": "This also means that it may be possible to construct different training oracles." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-31", "text": "For simple systems that are restricted to projective dependency trees, such differences are usually trivial, but for a system that allows online reordering there may be genuine differences that can affect both the efficiency and accuracy of the resulting parsers." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-32", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-33", "text": "**THE OLD ORACLE**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-34", "text": "Figure 2 defines the original training oracle \u03c4 1 proposed by Nivre (2009) ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-35", "text": "This oracle follows an eager reordering strategy; it predicts SWAP in every configuration where this is possible." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-36", "text": "The basic insight in this paper is that, by postponing swaps and building as much of the tree structure as possible before swapping, we can significantly decrease the length of the transition sequence for a given sentence and tree." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-37", "text": "This may benefit the efficiency of the parser trained using the oracle, as each transition takes a certain time to predict and to execute." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-38", "text": "Longer transition sequences may also be harder to learn than shorter ones, which potentially affects the accuracy of the parser." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-39", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-40", "text": "**A NEW ORACLE**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-41", "text": "While it is desirable to delay a SWAP transition for as long as possible, it is not trivial to find the right time point to actually do the swap." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-42", "text": "To see this, consider the dependency tree in Figure 1 ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-43", "text": "In a parse of this tree, the first configuration in which swapping is possible is when who 6 and did 1 are the two top nodes on the stack." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-44", "text": "In this configuration we can delay the swap until did has combined with its subject you by means of a RIGHT-ARC transition, but if we do not swap in the second configuration where this is possible, we eventually end up with the stack [ROOT 0 , who 6 , did 1 , send 3 , to 7 ]." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-45", "text": "Here we cannot attach who to to by means of a LEFT-ARC transition and get stuck." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-46", "text": "In order to define the new oracle, we introduce an auxiliary concept." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-47", "text": "Consider a modification of the oracle \u03c4 1 from Figure 2 that cannot predict SWAP transitions." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-48", "text": "This oracle will be able to produce valid transition sequences only for projective target trees; for non-projective trees, it will fail to reconstruct all dependency arcs." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-49", "text": "More specifically, a parse with this oracle will end up in a configuration in which the set of constructed dependency arcs forms a set of projective dependency trees, not necessarily a single such tree." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-50", "text": "We call the elements of this set the maximal projective components of the target tree." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-51", "text": "To illustrate the notion, we have drawn boxes around nodes in the same component in Figures 1." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-52", "text": "Based on the concept of maximal projective components, we define a new training oracle \u03c4 2 , which delays swapping as long as the next node in the input is in the same maximal projective component as the top node on the stack." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-53", "text": "The definition of the new oracle \u03c4 2 is identical to that of \u03c4 1 except that the third line is replaced by \"SWAP if c = ([\u03c3|i, j], [k|\u03b2], A c ), j < G i, and MPC(j) = MPC(k)\", where MPC(i) is the maximal projective component of node i. As a special case, \u03c4 2 predicts SWAP if j < G i and the buffer B is empty." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-54", "text": "For example, in extracting the transition sequence for the target tree in Figure 1 , the new oracle will postpone swapping of did when you is the next node in the input, but not postpone when the next node is send." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-55", "text": "We can show that a parser informed by the new training oracle can always proceed to a terminal configuration, and still derive all (even non-projective) dependency trees." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-56", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-57", "text": "**EXPERIMENTS**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-58", "text": "We now test the hypothesis that the new training oracle can improve both the accuracy and the efficiency of a transition-based dependency parser." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-59", "text": "Our experiments are based on the same five data sets as Nivre (2009) ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-60", "text": "The training sets vary in size from 28,750 tokens (1,534 sentences) for Slovene to 1,249,408 tokens (72,703 sentences) for Czech, while the test sets all consist of about 5,000 tokens." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-61", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-62", "text": "**NUMBER OF TRANSITIONS**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-63", "text": "For each language, we first parsed the training set with both the old and the new training oracle." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-64", "text": "This allowed us to compare the number of SWAP transitions needed to parse all sentences with the two oracles, shown in Table 1 ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-65", "text": "We see that the reduction is very substantial, ranging from 55% (for Czech) to almost 84% (for Arabic)." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-66", "text": "While this difference does not affect the asymptotic complexity of parsing, it may reduce the number of calls to the classifier, which is where transition-based parsers spend most of their time." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-67", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-68", "text": "**PARSING ACCURACY**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-69", "text": "In order to assess whether the reduced number of SWAP transitions also has a positive effect on parsing accuracy, we trained two parsers for each of the five languages, one for each oracle." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-70", "text": "All systems use SVM classifiers with a polynomial kernel with features and parameters optimized separately for each language and training oracle." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-71", "text": "The training data for these classifiers consist only of the sequences derived by the oracles, which means that the parser has no explicit notion of projective order or maximal projective components at parsing time." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-72", "text": "Table 2 shows the labeled parsing accuracy of the parsers measured by the overall attachment score (AS), as well as labeled precision, recall and (balanced) F-score for non-projective dependencies." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-73", "text": "1 For comparison, we also give results for the two best performing systems in the original CoNLL-X shared task, Malt and MST (McDonald et al., 2006) , as well as the combo system MST Malt , (Nivre and McDonald, 2008) ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-74", "text": "Looking first at the overall labeled attachment score, we see that the new training oracle consistently gives higher accuracy than the old one, with differences of up to 0.5 percentage points (for Arabic and Slovene), which is substantial given that the frequency of non-projective dependencies is only 0.4-1.9%." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-75", "text": "Because the test sets are so small, none of the differences is statistically significant (McNemar's test, \u03b1 = .05), but the consistent improvement over all languages nevertheless strongly suggests that this is a genuine difference." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-76", "text": "In relation to the state of the art, we note that the parsers with online reordering significantly outperform Malt and MST on Czech and Slovene, and MST on Turkish, and have significantly lower scores than the combo system MST Malt only for Arabic and Danish." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-77", "text": "For Czech, the parser with the new oracle actually has the highest attachment score ever reported, although the difference with respect to MST Malt is not statistically significant." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-78", "text": "Turning to the scores for non-projective dependencies, we again see that the new oracle consistently gives higher scores than the old oracle, with Table 2 : Labeled attachment score (AS) overall; precision (P), recall (R) and balanced F-score (F) for non-projective dependencies." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-79", "text": "Old = \u03c4 1 ; New = \u03c4 2 ; Malt = Nivre et al. the single exception that the old one has marginally higher recall for Czech." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-80", "text": "Moreover, the reordering parser with the new oracle has higher F-score than any other system for all languages except Danish." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-81", "text": "Especially the result for Czech, with 79.3% precision and 71.0% recall, is remarkably good, making the parser almost as accurate for non-projective dependencies as it is for projective dependencies." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-82", "text": "It seems likely that the good results for Czech are due to the fact that Czech has the highest percentage of non-projective structures in combination with the (by far) largest training set." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-83", "text": "----------------------------------" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-84", "text": "**CONCLUSION**" }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-85", "text": "We have presented a new training oracle for the transition system originally presented in Nivre (2009) ." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-86", "text": "This oracle postpones swapping as long as possible but still fulfills the correctness criterion." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-87", "text": "Our experimental results show that the new training oracle can reduce the necessary number of swaps by more than 80%, and that parsers trained in this way achieve higher precision and recall on nonprojective dependency arcs as well as higher attachment score overall." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-88", "text": "The results are particularly good for languages with a high percentage of nonprojective dependencies, with an all-time best over all metrics for Czech." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-89", "text": "An interesting theoretical question is whether the new oracle defined in this paper is optimal with respect to minimizing the number of swaps." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-90", "text": "The answer turns out to be negative, and it is possible to reduce the number of swaps even further by generalizing the notion of maximal projective components to maximal components that may be non-projective." }, { "sent_id": "b71da01fb46900d81162b3a3c3cd41-C001-91", "text": "However, the characterization of these generalized maximal components is non-trivial, and is therefore an important problem for future research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b71da01fb46900d81162b3a3c3cd41-C001-12" ], [ "b71da01fb46900d81162b3a3c3cd41-C001-20" ], [ "b71da01fb46900d81162b3a3c3cd41-C001-24" ] ], "cite_sentences": [ "b71da01fb46900d81162b3a3c3cd41-C001-12", "b71da01fb46900d81162b3a3c3cd41-C001-20", "b71da01fb46900d81162b3a3c3cd41-C001-24" ] }, "@DIF@": { "gold_contexts": [ [ "b71da01fb46900d81162b3a3c3cd41-C001-14" ] ], "cite_sentences": [ "b71da01fb46900d81162b3a3c3cd41-C001-14" ] }, "@UNSURE@": { "gold_contexts": [ [ "b71da01fb46900d81162b3a3c3cd41-C001-34" ], [ "b71da01fb46900d81162b3a3c3cd41-C001-85" ] ], "cite_sentences": [ "b71da01fb46900d81162b3a3c3cd41-C001-34", "b71da01fb46900d81162b3a3c3cd41-C001-85" ] }, "@USE@": { "gold_contexts": [ [ "b71da01fb46900d81162b3a3c3cd41-C001-59" ] ], "cite_sentences": [ "b71da01fb46900d81162b3a3c3cd41-C001-59" ] }, "@SIM@": { "gold_contexts": [ [ "b71da01fb46900d81162b3a3c3cd41-C001-59" ] ], "cite_sentences": [ "b71da01fb46900d81162b3a3c3cd41-C001-59" ] } } }, "ABC_59d9c7640e69fd0e19b12b6dbc392c_9": { "x": [ { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-2", "text": "FrameNet is a lexical resource that provides rich semantic representations of the core English vocabulary based on Fillmore's Frame Semantics, with more than 200k manually annotated examples." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-3", "text": "Resources based on FrameNet have now been created for roughly a dozen languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-4", "text": "This workshop will present current research on aligning Frame Semantic resources across languages and automatic frame semantic parsing in English and other languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-5", "text": "We will explore the extent to which semantic frames are similar across languages and the implications for theories of semantic universals, the practice of translation (whether human or machine), and multilingual knowledge representation." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-6", "text": "Does not require prior familiarity with Frame Semantics." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-7", "text": "The FrameNet Project at the International Computer Science Institute (ICSI, http://framenet." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-8", "text": "icsi.berkeley.edu) has created a detailed lexical database of contemporary English, (currently more than 13,000 lexical units in 1,200 semantic frames) based on Frame Semantics (Fillmore (1977 (Fillmore ( ), (1985 )." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-9", "text": "NLP researchers have shown that such representations are useful in diverse applications such as question answering, text-to-scene systems, dialog systems, and social network extraction." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-10", "text": "Separate research projects have now developed Frame Semantic lexical databases for roughly a dozen languages (including Brazilian Portuguese, Chinese, Hebrew, Japanese, Korean, and Swedish) based in varying degrees on the original FrameNet structure, methodology and annotation practices (Ruppenhofer et al., 2016) ." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-11", "text": "Most have taken the ICSI English frames as a starting point and have found that the majority of target-language words fit comfortably in those frames." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-12", "text": "The FrameNet team is developing alignments across these FrameNets, seeking to understand crosslinguistic similarities and differences in framing." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-13", "text": "Going beyond alignments between frames, we also use techniques such as multilingual word vectors (Hermann and Blunsom, 2014) to cluster and align lexical units (each a single sense of a word in a frame) across languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-14", "text": "The underlying research questions include: To what extent are semantic frames the same cross-culturally and cross-linguistically? Where they differ, what are the reasons for these differences?" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-15", "text": "Will translations to be \"frame preserving\"? This tutorial will discuss the methodology and status of the alignment effort, and of a recently launched parallel manual annotation task, and theoretical issues that have emerged in this area of research, including the interplay between semantic frames and constructions in different languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-16", "text": "We will also report on the state of the art in automatic frame semantic role labeling for English (Swayamdipta et al., 2017) and for other languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-17", "text": "This can be regarded as a structured prediction task which maps a sentence to a graph with nodes for each predicator and its arguments and adjuncts, linked by arcs representing the frame semantic roles." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-18", "text": "Recent approaches rely on neural architectures to learn representations which enforce global consistency for each classification decision and can learn from disparate data." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-19", "text": "Participants in this tutorial will learn about:" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-20", "text": "1. Multilingual FrameNet, its methodology and practices 2." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-21", "text": "Cross-linguistic similarities and differences among the languages 9" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-22", "text": "----------------------------------" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-23", "text": "**DESCRIPTION**" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-24", "text": "The FrameNet Project at the International Computer Science Institute (ICSI, http://framenet." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-25", "text": "icsi.berkeley.edu) has created a detailed lexical database of contemporary English, (currently more than 13,000 lexical units in 1,200 semantic frames) based on Frame Semantics (Fillmore (1977 (Fillmore ( ), (1985 )." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-26", "text": "NLP researchers have shown that such representations are useful in diverse applications such as question answering, text-to-scene systems, dialog systems, and social network extraction." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-27", "text": "Separate research projects have now developed Frame Semantic lexical databases for roughly a dozen languages (including Brazilian Portuguese, Chinese, Hebrew, Japanese, Korean, and Swedish) based in varying degrees on the original FrameNet structure, methodology and annotation practices (Ruppenhofer et al., 2016) ." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-28", "text": "Most have taken the ICSI English frames as a starting point and have found that the majority of target-language words fit comfortably in those frames." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-49", "text": "She has a Masters degree from Columbia University, and was a research intern at Google New York." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-29", "text": "The FrameNet team is developing alignments across these FrameNets, seeking to understand crosslinguistic similarities and differences in framing." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-30", "text": "Going beyond alignments between frames, we also use techniques such as multilingual word vectors (Hermann and Blunsom, 2014) to cluster and align lexical units (each a single sense of a word in a frame) across languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-31", "text": "The underlying research questions include: To what extent are semantic frames the same cross-culturally and cross-linguistically? Where they differ, what are the reasons for these differences?" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-32", "text": "Will translations to be \"frame preserving\"? This tutorial will discuss the methodology and status of the alignment effort, and of a recently launched parallel manual annotation task, and theoretical issues that have emerged in this area of research, including the interplay between semantic frames and constructions in different languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-33", "text": "We will also report on the state of the art in automatic frame semantic role labeling for English (Swayamdipta et al., 2017) and for other languages." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-34", "text": "This can be regarded as a structured prediction task which maps a sentence to a graph with nodes for each predicator and its arguments and adjuncts, linked by arcs representing the frame semantic roles." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-35", "text": "Recent approaches rely on neural architectures to learn representations which enforce global consistency for each classification decision and can learn from disparate data." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-36", "text": "Participants in this tutorial will learn about:" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-37", "text": "1. Multilingual FrameNet, its methodology and practices 2." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-38", "text": "Cross-linguistic similarities and differences among the languages Gilardi and Baker (2017) ), aligning FrameNet to other lexical resources (Fellbaum and Baker (2013) , (2008), Ferr\u00e1ndez et al. (2010) ) and linking to ontologies and reasoning (Scheffczyk et al., 2010) ." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-39", "text": "----------------------------------" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-40", "text": "**MICHAEL ELLSWORTH**" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-41", "text": "Michael Ellsworth (Addeco/International Computer Science Institute, (infinity@icsi.berkeley.edu, https://berkeley.academia.edu/MichaelEllsworth) has been involved in lexical semantic research for nearly 20 years, and has been a key member of the FrameNet team, involved in frame definition, annotation, annotator training, and data-integrity checking; he has been in charge of the ontology-like hierarchy that organizes the frames since 2002." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-42", "text": "Publication topics include the differences between FrameNet and other annotation projects (Ellsworth et al., 2004) , the FrameNet hierarchy and ontologies (Dolbey et al., 2006) , the principles behind FrameNet annotation, paraphrasing using FrameNet (Ellsworth and Janin, 2007) , and various English grammatical constructions (Lee-Goldman et al., 2009 (Petruck (2009 (Petruck ( ),(1996 ), lexical semantics , knowledge base development, grammar and lexis, semantics, Frame Semantics and Construction Grammar, particularly as these linguistic theories support advances in NLU and NLP." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-43", "text": "She is a frequent invited speaker, lecturing internationally about Frame Semantics, Construction Grammar, and FrameNet." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-44", "text": "Petruck is currently working on a manuscript about FrameNet and NLP." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-45", "text": "----------------------------------" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-46", "text": "**SWABHA SWAYAMDIPTA**" }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-47", "text": "Swabha Swayamdipta (swabha@cs.cmu.edu, http://www.cs.cmu.edu/\u02dcsswayamd) is a PhD student at the Language Technologies Institute at Carnegie Mellon University (currently a visiting student at U Washington)." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-48", "text": "She works with Noah Smith and Chris Dyer on developing efficient algorithms for broad-coverage semantic parsing, with a focus on exploiting the relationship between syntax and semantics (Swayamdipta et al., 2017) ." }, { "sent_id": "59d9c7640e69fd0e19b12b6dbc392c-C001-50", "text": "Her research interests also include applications of broad-coverage semantics for tasks such as entailment and coreference." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "59d9c7640e69fd0e19b12b6dbc392c-C001-16" ], [ "59d9c7640e69fd0e19b12b6dbc392c-C001-33" ], [ "59d9c7640e69fd0e19b12b6dbc392c-C001-48" ] ], "cite_sentences": [ "59d9c7640e69fd0e19b12b6dbc392c-C001-16", "59d9c7640e69fd0e19b12b6dbc392c-C001-33", "59d9c7640e69fd0e19b12b6dbc392c-C001-48" ] } } }, "ABC_7b4a976ba6a43b5ba42cc350b4d132_9": { "x": [ { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-2", "text": "Explainable Neural Networks, Kernel Methods, Nystrom Method AI systems are currently used in a wide variety of applications, with several levels of societal impact, and are expected to be soon deployed in safety-critical fields, e.g., autonomous driving." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-3", "text": "Hence, a natural need for ethical accountability of such systems is gaining importance." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-4", "text": "A central issue lies in designing systems whose decisions are transparent [6], i.e., they must be easily interpretable by humans, as users must be able to suitably weight and trust their assistance." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-5", "text": "Deep neural networks are clearly problematic in this regard: their high non-linearity, despite allowing for state-of-theart performances in several challenging problems also amplifies the epistemological opaqueness of the decision-flow and limits its interpretability." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-6", "text": "The concept of transparency of a machine learning model spans multiple definitions, focusing on different aspects, from the simplicity of the model, e.g., the number of nodes in a decision tree, to the intuitiveness of its parameters and computations [4] ." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-7", "text": "In this context, an important capability of an AI system is the ability of providing post-hoc explanations in terms of evidences supporting the provided decisions: although they usually do not formally elucidate how a model works, post-hoc explanations often have the nice property of being quite intuitive, conveying useful information also to end-users without any AI or machine learning expertise [8] ." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-8", "text": "In semantic inference tasks (e.g., text classification), an explanation model producing post-hoc explanations should hence be able to trace back connections between the output categories and the semantic and syntactic properties of the input texts." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-9", "text": "Such models should have three desired properties: semantic transparency, informativeness w.r.t." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-10", "text": "the system decision and effectiveness in enabling auditing processes against the system." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-11", "text": "In this work we focus on a specific post-hoc mechanism which is to provide, along with the prediction, a comparison with one or more other examples, namely landmarks, that share task-relevant linguistic properties with the input." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-12", "text": "From an argument theory perspective, this corresponds to supporting decisions through an \"argument by analogy\" schema [9]: a user exposed to such a kind of argument will endow a different level of trust into the machine decision according to the linguistic plausibility of the analogy." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-13", "text": "In fact, he/she will implicitly gauge the evidence from the linguistic properties shared between the input sentence (or its parts) and the" }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-14", "text": "----------------------------------" }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-15", "text": "****" }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-16", "text": "one used for comparison as well their importance with respect to the output decision." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-17", "text": "Let us consider, for example, the following prediction in question classification (QC) [7] : \"What is the capital of Zimbabwe?\" refers to a Location." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-18", "text": "We would like the system to motivate its decision with an argument such as: ...since it recalls me of \"What is the capital of California?\" which also refers to a Location." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-19", "text": "Notice that a decision explaining task is quite different from relevance ranking, and semantic similarity plays here a minor role: clear and trustful analogies may exist between training examples that are semantically different but such that their properties imply similar causal relationships between the input and the decision." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-20", "text": "Recent work has been inspired by efforts in improving model's interpretability in image processing tasks, in particular by the Layerwise Relevance Propagation (LRP) [3] ." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-21", "text": "In LRP, the classification decision of a deep neural network is decomposed backward across the network layers and evidence about the contribution to the final decision brought by individual input fragments (i.e., pixels of the input image) is gathered." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-22", "text": "We propose here to extend the LRP application to a linguistically motivated network architecture, known as Kernel-Based Deep Architecture (KDA) [5] , which frames semantic information captured by linguistic Tree Kernel [2] methods within the neural-based learning paradigm." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-23", "text": "The result is a mechanism that, for each system's prediction such as in question classification, generates an argument-by-analogy explanation based on real training examples, not necessarily similar to the input." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-24", "text": "We also propose here a novel approach to evaluate numerically the interpretability of any explanation-enriched model applied in semantic inference tasks." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-25", "text": "By defining a specific audit process, we derive a synthetic metric, i.e. Auditing Accuracy, that takes into account the properties of transparency, informativeness and effectiveness." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-26", "text": "The evaluation of the proposed methodology shows the meaningful impact of LRP-based explanation models: users faced with explanations are systematically oriented to accept (or reject) the system decisions, so that post-hoc judgments may even help in improving the overall application accuracy." }, { "sent_id": "7b4a976ba6a43b5ba42cc350b4d132-C001-27", "text": "This work has been accepted for publication at the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP2019 [1] )." } ], "y": { "@USE@": { "gold_contexts": [ [ "7b4a976ba6a43b5ba42cc350b4d132-C001-22" ] ], "cite_sentences": [ "7b4a976ba6a43b5ba42cc350b4d132-C001-22" ] } } }, "ABC_9ee702243b3976ee4261f433d75528_9": { "x": [ { "sent_id": "9ee702243b3976ee4261f433d75528-C001-12", "text": "In addition, their computational costs are expensive." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-33", "text": "**GOLD-STANDARD LABELS FOR PREORDERING**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-124", "text": "Consequently, preordering failed." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-2", "text": "The word order between source and target languages significantly influences the translation quality in machine translation." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-3", "text": "Preordering can effectively address this problem." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-4", "text": "Previous preordering methods require a manual feature design, making language dependent design costly." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-5", "text": "In this paper, we propose a preordering method with a recursive neural network that learns features from raw inputs." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-6", "text": "Experiments show that the proposed method achieves comparable gain in translation quality to the state-of-the-art method but without a manual feature design." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-7", "text": "----------------------------------" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-9", "text": "The word order between source and target languages significantly influences the translation quality in statistical machine translation (SMT) (Tillmann, 2004; Hayashi et al., 2013; Nakagawa, 2015) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-10", "text": "Models that adjust orders of translated phrases in decoding have been proposed to solve this problem (Tillmann, 2004; Koehn et al., 2005; Nagata et al., 2006) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-11", "text": "However, such reordering models do not perform well for long-distance reordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-13", "text": "To address these problems, preordering (Xia and McCord, 2004; Wang et al., 2007; Xu et al., 2009; Isozaki et al., 2010b; Gojun and Fraser, 2012; Nakagawa, 2015) and postordering (Goto et al., 2012 (Goto et al., , 2013 Hayashi et al., 2013 ) models have been proposed." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-14", "text": "Preordering reorders source sentences before translation, while post-ordering reorders sentences translated without considering the word order after translation." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-15", "text": "In particular, preordering effectively improves the translation quality because it solves long-distance reordering and computational complexity issues (Jehl et al., 2014; Nakagawa, 2015) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-16", "text": "Rule-based preordering methods either manually create reordering rules (Wang et al., 2007; Xu et al., 2009; Isozaki et al., 2010b; Gojun and Fraser, 2012) or extract reordering rules from a corpus (Xia and McCord, 2004; Genzel, 2010) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-17", "text": "On the other hand, studies in (Neubig et al., 2012; Lerner and Petrov, 2013; Hoshino et al., 2015; Nakagawa, 2015) apply machine learning to the preordering problem." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-18", "text": "Hoshino et al. (2015) proposed a method that learns whether child nodes should be swapped at each node of a syntax tree." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-19", "text": "Neubig et al. (2012) and Nakagawa (2015) proposed methods that construct a binary tree and reordering simultaneously from a source sentence." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-20", "text": "These methods require a manual feature design for every language pair, which makes language dependent design costly." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-21", "text": "To overcome this challenge, methods based on feed forward neural networks that do not require a manual feature design have been proposed (de Gispert et al., 2015; Botha et al., 2017) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-22", "text": "However, these methods decide whether to reorder child nodes without considering the sub-trees, which contains important information for reordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-23", "text": "As a preordering method that is free of manual feature design and makes use of information in sub-trees, we propose a preordering method with a recursive neural network (RvNN)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-24", "text": "RvNN calculates reordering in a bottom-up manner (from the leaf nodes to the root) on a source syntax tree." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-25", "text": "Thus, preordering is performed considering the entire sub-trees." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-26", "text": "Specifically, RvNN learns whether to reorder nodes of a syntax tree 1 with a vector representation of sub-trees and syntactic categories." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-27", "text": "We evaluate the proposed method for English-to-Japanese translations using both phrase-based SMT (PBSMT) and neural MT (NMT)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-28", "text": "The results confirm that the proposed method achieves comparable translation quality to the state-of-the-art preordering method (Nakagawa, 2015) that requires a manual feature design." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-29", "text": "----------------------------------" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-30", "text": "**PREORDERING WITH A RECURSIVE NEURAL NETWORK**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-31", "text": "We explain our design of the RvNN to conduct preordering after describing how to obtain goldstandard labels for preordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-32", "text": "----------------------------------" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-34", "text": "We created training data for preordering by labeling whether each node of the source-side syntax tree has reordered child nodes against a targetside sentence." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-35", "text": "The label is determined based on Kendall's \u03c4 (Kendall, 1938) as in (Nakagawa, 2015) , which is calculated by Equation (1)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-36", "text": "where y is a vector of target word indexes that are aligned with source words." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-37", "text": "The value of Kendall's \u03c4 is in [\u22121, 1]." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-38", "text": "When it is 1, it means the sequence of y is in a complete ascending order, i.e., target sentence has the same word order with the source in terms of word alignment." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-39", "text": "At each node, if Kendall's \u03c4 increases by reordering child nodes, an \"Inverted\" label is assigned; otherwise, a \"Straight\" label, which means the child nodes do not need to be reordered, is assigned." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-40", "text": "When a source word of a child node does not have an alignment, a \"Straight\" label is assigned." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-41", "text": "----------------------------------" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-42", "text": "**PREORDERING MODEL**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-43", "text": "RvNN is constructed given a binary syntax tree." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-44", "text": "It predicts the label determined in Section 2.1 at each node." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-45", "text": "RvNN decides whether to reorder the child nodes by considering the sub-tree." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-46", "text": "The vector of the sub-tree is calculated in a bottom-up manner from the leaf nodes." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-47", "text": "Figure 1 shows an example of preordering of an English sentence \"My parents live in London.\" At the VP node corresponding to \"live in London,\" the vector of the node is calculated by Equation (2), considering its child nodes correspond to \"live\" and \"in London.\" where f is a rectifier, W \u2208 R 2\u03bb\u00d7\u03bb is a weight matrix, p l and p r are vector representations of the left and right child nodes, respectively." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-48", "text": "[\u00b7; \u00b7] denotes the concatenation of two vectors." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-49", "text": "W s \u2208 R \u03bb\u00d72 is a weight matrix for the output layer, and b, b s \u2208 R \u03bb are the biases." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-50", "text": "s \u2208 R 2 calculated by Equation (3) is a weight vector for each label, which is fed into a softmax function to calculate the probabilities of the \"Straight\" and \"Inverted\" labels." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-51", "text": "At a leaf node, a word embedding calculated by Equations (4) and (5) is fed into Equation (2)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-52", "text": "where x \u2208 R N is a one-hot vector of an input word with a vocabulary size of N , W E \u2208 R N \u00d7\u03bb is an embedding matrix, and b l \u2208 R \u03bb is the bias." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-53", "text": "The loss function is the cross entropy defined by Equation (6)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-54", "text": "where \u03b8 is the parameters of the model, n is the node of a syntax tree T , K is a mini batch size, and l n k is the label of the n-th node in the k-th syntax tree in the mini batch." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-55", "text": "With the model using POS tags and syntactic categories, we use Equation (7) instead of Equation (2)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-56", "text": "where e t represents a vector of POS tags or syntactic categories, W t \u2208 R 3\u03bb\u00d7\u03bb is a weight matrix, and b t \u2208 R \u03bb is the bias." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-57", "text": "e t is calculated in the same manner as Equations (4) and (5), but the input is a one-hot vector of the POS tags or syntactic categories at each node." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-58", "text": "\u03bb is tuned on a development set, whose effects are investigated in Section 3.2." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-59", "text": "----------------------------------" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-60", "text": "**EXPERIMENTS 3.1 SETTINGS**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-61", "text": "We conducted English-to-Japanese translation experiments using the ASPEC corpus (Nakazawa et al., 2016) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-62", "text": "This corpus provides 3M sentence pairs as training data, 1, 790 sentence pairs as development data, and 1, 812 sentence pairs as test data." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-63", "text": "We used Stanford CoreNLP 2 for tokenization and POS tagging, Enju 3 for parsing of English, and MeCab 4 for tokenization of Japanese." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-64", "text": "For word alignment, we used MGIZA." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-65", "text": "5 Source-totarget and target-to-source word alignments were calculated using IBM model 1 and hidden Markov model, and they were combined with the intersection heuristic following (Nakagawa, 2015) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-66", "text": "We implemented our RvNN preordering model with Chainer." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-67", "text": "6 The ASPEC corpus was created using the sentence alignment method proposed in (Utiyama and Isahara, 2007) and was sorted based on the alignment confidence scores." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-68", "text": "In this paper, we used 100k sentences sampled from the top 500k sentences as training data for preordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-69", "text": "The vocabulary size N was set to 50k." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-70", "text": "We used Adam (Kingma and Ba, 2015) with a weight decay and gradient clipping for optimization." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-71", "text": "The mini batch size K was set to 500." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-72", "text": "We compared our model with the state-of-theart preordering method proposed in (Nakagawa, 2015) , which is hereafter referred to as BTG." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-73", "text": "We used its publicly available implementation, 7 and trained it on the same 100k sentences as our model." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-74", "text": "We used the 1.8M source and target sentences as training data for MT." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-75", "text": "We excluded part of the sentence pairs whose lengths were longer than 50 words or the source to target length ratio exceeded 9." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-76", "text": "For SMT, we used Moses." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-77", "text": "8 We trained the 5-gram language model on the target side of the training corpus with KenLM." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-78", "text": "9 Tuning was performed by minimum error rate training (Och, 2003)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-79", "text": "We repeated tuning and testing of each model 3 times and reported the average of scores." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-80", "text": "For NMT, we used the attention-based encoderdecoder model of (Luong et al., 2015) with 2-layer LSTM implemented in OpenNMT." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-81", "text": "10 The sizes of the vocabulary, word embedding, and hidden layer were set to 50k, 500, and 500, respectively." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-82", "text": "The batch size was set to 64, and the number of epochs was set to 13." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-83", "text": "The translation quality was evaluated using BLEU (Papineni et al., 2002) and RIBES (Isozaki et al., 2010a) using the bootstrap resampling method (Koehn, 2004) for the significance test." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-84", "text": "Figure 2 shows the learning curve of our preordering model with \u03bb = 200." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-85", "text": "11 Both the training and the development losses decreased until 2 epochs." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-86", "text": "However, the development loss started to increase after 3 epochs." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-87", "text": "Therefore, the number of epochs was set up to 5, and we chose the model with the lowest development loss." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-88", "text": "The source sentences in the translation evaluation were preordered with 10 http://opennmt.net/ 11 The learning curve behaves similarly for different \u03bb values." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-89", "text": "----------------------------------" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-90", "text": "**RESULTS**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-91", "text": "Avogadro 's hypothesis ( 1811 ) contributed to the development in since then Figure 4 : Example of a syntax tree with a parse-error (the phrase \"(1811)\" was divided in two phrases by mistake)." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-92", "text": "Our preordering result was affected by such parse-errors." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-93", "text": "(Nodes with a horizontal Table 2 : BLEU and RIBES scores on the test set." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-94", "text": "(All models are trained on the entire training corpus of 1.8M sentence pairs.) Numbers in bold indicate the best systems and the systems that are statistically insignificant at p < 0.05 from the best systems." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-95", "text": "this model." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-96", "text": "Next, we investigated the effect of \u03bb." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-97", "text": "Table 1 shows the BLEU scores with different \u03bb values, as well as the BLEU score without preordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-98", "text": "In this experiment, PBSMT was trained with a 500k subset of training data, and the distortion limit was set to 6." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-99", "text": "Our RvNNs consistently outperformed the plain PBSMT without preordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-100", "text": "The BLEU score improved as \u03bb increased when only word embedding was considered." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-101", "text": "In addition, RvNNs involving POS tags and syntactic categories achieved even higher BLEU scores." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-102", "text": "This result shows the effectiveness of POS tags and syntactic categories in reordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-103", "text": "For these models, setting \u03bb larger than 200 did not contribute to the translation quality." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-104", "text": "Based on these, we further evaluated the RvNN with POS tags and syntactic categories where \u03bb = 200." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-105", "text": "Table 2 shows BLEU and RIBES scores of the test set on PBSMT and NMT trained on the entire training data of 1.8M sentence pairs." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-106", "text": "The distortion limit of SMT systems trained using preordered sentences by RvNN and BTG was set to 0, while that without preordering was set to 6." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-107", "text": "Compared to the plain PBSMT without preordering, both BLEU and RIBES increased significantly with preordering by RvNN and BTG." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-108", "text": "These scores were comparable (statistically insignificant at p < 0.05) between RvNN and BTG, 12 indicating that the proposed method achieves a translation quality comparable to BTG." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-109", "text": "In contrast to the case of PB-SMT, NMT without preordering achieved a significantly higher BLEU score than NMT models with preordering by RvNN and BTG." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-110", "text": "This is the same phenomenon in the Chinese-to-Japanese translation experiment reported in (Sudoh and Nagata, 2016) ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-111", "text": "We assume that one reason is the isolation between preordering and NMT models, where both models are trained using independent optimization functions." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-112", "text": "In the future, we will investigate this problem and consider a model that unifies preordering and translation in a single model." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-113", "text": "Figure 3 shows the distribution of Kendall's \u03c4 in the original training data as well as the distributions after preordering by RvNN and BTG." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-114", "text": "The ratio of high Kendall's \u03c4 largely increased in the case of RvNN, suggesting that the proposed Table 4 : Example of a parse-error disturbed preordering in our method." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-115", "text": "(Literal translations are given in the parenthesis under the Japanese sentences.) method learns preordering properly." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-116", "text": "Furthermore, the ratio of high Kendall's \u03c4 by RvNN is more than that of BTG, implying that preordering by RvNN is better than that by BTG." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-117", "text": "We also manually investigated the preordering and translation results." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-118", "text": "We found that our model improved both." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-119", "text": "Table 3 shows a successful preordering and translation example on PBSMT." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-120", "text": "The word order is notably different between source and reference sentences." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-121", "text": "After preordering, the word order between the source and reference sentences became the same." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-122", "text": "Because RvNN depends on parsing, sentences with a parse-error tended to fail in preordering." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-123", "text": "For example, the phrase \"(1811)\" in Figure 4 was divided in two phrases by mistake." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-125", "text": "Table 4 shows preordering and translation examples for the sentence in Figure 4 ." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-126", "text": "Compared to the translation without preordering, the translation quality after preordering was improved to deliver correct meaning." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-127", "text": "----------------------------------" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-128", "text": "**CONCLUSION**" }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-129", "text": "In this paper, we proposed a preordering method without a manual feature design for MT." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-130", "text": "The experiments confirmed that the proposed method achieved a translation quality comparable to the state-of-the-art preordering method that requires a manual feature design." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-131", "text": "As a future work, we plan to develop a model that jointly parses and preorders a source sentence." }, { "sent_id": "9ee702243b3976ee4261f433d75528-C001-132", "text": "In addition, we plan to integrate preordering into the NMT model." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9ee702243b3976ee4261f433d75528-C001-9" ], [ "9ee702243b3976ee4261f433d75528-C001-13" ], [ "9ee702243b3976ee4261f433d75528-C001-15" ], [ "9ee702243b3976ee4261f433d75528-C001-17" ], [ "9ee702243b3976ee4261f433d75528-C001-19" ] ], "cite_sentences": [ "9ee702243b3976ee4261f433d75528-C001-9", "9ee702243b3976ee4261f433d75528-C001-13", "9ee702243b3976ee4261f433d75528-C001-15", "9ee702243b3976ee4261f433d75528-C001-17", "9ee702243b3976ee4261f433d75528-C001-19" ] }, "@SIM@": { "gold_contexts": [ [ "9ee702243b3976ee4261f433d75528-C001-28" ], [ "9ee702243b3976ee4261f433d75528-C001-35" ], [ "9ee702243b3976ee4261f433d75528-C001-108" ], [ "9ee702243b3976ee4261f433d75528-C001-130" ] ], "cite_sentences": [ "9ee702243b3976ee4261f433d75528-C001-28", "9ee702243b3976ee4261f433d75528-C001-35", "9ee702243b3976ee4261f433d75528-C001-108", "9ee702243b3976ee4261f433d75528-C001-130" ] }, "@USE@": { "gold_contexts": [ [ "9ee702243b3976ee4261f433d75528-C001-35" ], [ "9ee702243b3976ee4261f433d75528-C001-65" ], [ "9ee702243b3976ee4261f433d75528-C001-107" ] ], "cite_sentences": [ "9ee702243b3976ee4261f433d75528-C001-35", "9ee702243b3976ee4261f433d75528-C001-65", "9ee702243b3976ee4261f433d75528-C001-107" ] }, "@UNSURE@": { "gold_contexts": [ [ "9ee702243b3976ee4261f433d75528-C001-72" ], [ "9ee702243b3976ee4261f433d75528-C001-106" ], [ "9ee702243b3976ee4261f433d75528-C001-109" ], [ "9ee702243b3976ee4261f433d75528-C001-113" ] ], "cite_sentences": [ "9ee702243b3976ee4261f433d75528-C001-72", "9ee702243b3976ee4261f433d75528-C001-106", "9ee702243b3976ee4261f433d75528-C001-109", "9ee702243b3976ee4261f433d75528-C001-113" ] }, "@DIF@": { "gold_contexts": [ [ "9ee702243b3976ee4261f433d75528-C001-116" ] ], "cite_sentences": [ "9ee702243b3976ee4261f433d75528-C001-116" ] } } }, "ABC_27aeffca1f7a9a6b40743284a2871d_9": { "x": [ { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-2", "text": "Context-free grammars have been a cornerstone of theoretical computer science and computational linguistics since their inception over half a century ago." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-3", "text": "Topic models are a newer development in machine learning that play an important role in document analysis and information retrieval." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-4", "text": "It turns out there is a surprising connection between the two that suggests novel ways of extending both grammars and topic models." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-5", "text": "After explaining this connection, I go on to describe extensions which identify topical multiword collocations and automatically learn the internal structure of namedentity phrases." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-6", "text": "The adaptor grammar framework is a nonparametric extension of probabilistic context-free grammars (Johnson et al., 2007) , which was initially intended to allow fast prototyping of models of unsupervised language acquisition (Johnson, 2008), but it has been shown to have applications in text data mining and information retrieval as well Hardisty et al., 2010) ." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-7", "text": "We'll see how learning the referents of words and learning the roles of social cues in language acquisition (Johnson et al., 2012) can be viewed as a kind of topic modelling problem that can be reduced to a grammatical inference problem using the techniques described in this talk." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-8", "text": "----------------------------------" }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-9", "text": "**ABSTRACT**" }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-10", "text": "Context-free grammars have been a cornerstone of theoretical computer science and computational linguistics since their inception over half a century ago." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-11", "text": "Topic models are a newer development in machine learning that play an important role in document analysis and information retrieval." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-12", "text": "It turns out there is a surprising connection between the two that suggests novel ways of extending both grammars and topic models." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-13", "text": "After explaining this connection, I go on to describe extensions which identify topical multiword collocations and automatically learn the internal structure of namedentity phrases." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-14", "text": "The adaptor grammar framework is a nonparametric extension of probabilistic context-free grammars (Johnson et al., 2007) , which was initially intended to allow fast prototyping of models of unsupervised language acquisition (Johnson, 2008 ), but it has been shown to have applications in text data mining and information retrieval as well Hardisty et al., 2010) ." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-15", "text": "We'll see how learning the referents of words and learning the roles of social cues in language acquisition (Johnson et al., 2012) can be viewed as a kind of topic modelling problem that can be reduced to a grammatical inference problem using the techniques described in this talk." }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-16", "text": "----------------------------------" }, { "sent_id": "27aeffca1f7a9a6b40743284a2871d-C001-17", "text": "**ABOUT THE SPEAKER**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "27aeffca1f7a9a6b40743284a2871d-C001-6" ] ], "cite_sentences": [ "27aeffca1f7a9a6b40743284a2871d-C001-6" ] } } }, "ABC_9b203bfa690c4a79c1324360a4b8dc_11": { "x": [ { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-2", "text": "Discriminating lexical relations among distributionally similar words has always been a challenge for natural language processing (NLP) community." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-3", "text": "In this paper, we investigate whether the network embedding of distributional thesaurus can be effectively utilized to detect co-hyponymy relations." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-4", "text": "By extensive experiments over three benchmark datasets, we show that the vector representation obtained by applying node2vec on distributional thesaurus outperforms the state-of-the-art models for binary classification of co-hyponymy vs. hypernymy, as well as co-hyponymy vs. meronymy, by huge margins." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-5", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-7", "text": "Distributional semantic models are used in a wide variety of tasks like sentiment analysis, word sense disambiguation, predicting semantic compositionality, etc." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-8", "text": "Automatic detection of lexical relations is one such fundamental task which can be leveraged in applications like paraphrasing, ontology building, metaphor detection etc." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-9", "text": "Both supervised and unsupervised methods have been proposed by the researchers to identify lexical relations like hypernymy, co-hyponymy, meronymy etc." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-10", "text": "over the years." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-11", "text": "Recent attempts to solve this task deal with proposing similarity measures based on distributional semantic models (Roller et al., 2014; Weeds et al., 2014; Santus et al., 2016; Shwartz et al., 2017; Roller and Erk, 2016) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-12", "text": "For hypernymy detection, several works use distributional inclusion hypothesis (Geffet and Dagan, 2005) , entropy-based distributional measure (Santus et al., 2014) as well as several embedding schemes (Fu et al., 2014; Yu et al., 2015; Nguyen et al., 2017) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-13", "text": "Image generality for lexical entailment detection (Kiela et al., 2015) has also been tried out for the same purpose." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-14", "text": "As far as meronymy detection is concerned, most of the attempts are pattern based (Berland and Charniak, 1999; Girju et al., 2006; Pantel and Pennacchiotti, 2006) along with some recent works exploring the possibility of using distributional semantic models (Morlane-Hond\u00e8re, 2015) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-15", "text": "Similarly, for co-hyponymy detection, researchers have investigated the usefulness of several distributional semantic models." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-16", "text": "One such attempt is made by Weeds et al. (2014) , where they proposed a supervised framework and used several vector operations as features for the classification of hypernymy and co-hyponymy." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-17", "text": "Santus et al. (2016) proposed a supervised method based on a Random Forest algorithm to learn taxonomical semantic relations and they have shown that the model performs well for co-hyponymy detection." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-18", "text": "In another attempt, Jana and Goyal (2018b) proposed various complex network measures which can be used as features to build a supervised classifier model for co-hyponymy detection, and showed improvements over other baseline approaches." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-19", "text": "Recently, with the emergence of various network representation learning methods (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Ribeiro et al., 2017) , attempts have been made to convert distributional thesauri network into low dimensional vector space." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-20", "text": "(Ferret, 2017) apply distributional thesaurus embedding for synonym extraction and expansion tasks whereas Jana and Goyal (2018a) use it to improve the state-of-theart performance of word similarity/relatedness tasks, word analogy task etc." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-21", "text": "Thus, a natural question arises as to whether network embeddings should be more effective than the handcrafted network features used by Jana and Goyal (2018b) for cohyponymy detection." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-22", "text": "Being motivated by this connection, we investigate how the information captured by network representation learning methodologies on distributional thesaurus can be used in discriminating word pairs having co-hyponymy relation from the word pairs having hypernymy, meronymy relation or any random pair of words." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-23", "text": "We use the distributional thesaurus (DT) network (Riedl and Biemann, 2013) built using Google books syntactic n-grams." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-24", "text": "As a network representation learning method, we apply node2vec (Grover and Leskovec, 2016) which is an algorithmic framework for learning continuous feature representations for nodes in networks that maximizes the likelihood of preserving network neighborhoods of nodes." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-25", "text": "Thus obtained vectors are then used as feature vectors and plugged into the classifiers according to the state-of-the-art experimental setup." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-26", "text": "Classification model: To distinguish the word pairs having co-hyponymy relation from the word pairs having hypernymy or meronymy relation, or from any random pair of words, we combine the network embeddings of the two words by concatenation (CC) and addition (ADD) operations to provide as features to train classifiers like Support Vector Machine (SVM) and Random Forest (RF)." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-27", "text": "Evaluation results: We evaluate the usefulness of DT embeddings against three benchmark datasets for cohyponymy detection (Weeds et al., 2014; Santus et al., 2016; Jana and Goyal, 2018b) , following their experimental setup." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-28", "text": "We show that the network embeddings outperform the baselines by a huge margin throughout all the experiments, except for co-hyponyms vs. random pairs, where the baselines already have very high accuracy and network embeddings are able to match the results." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-29", "text": "arXiv:2002.11506v1 [cs.CL] 24 Feb 2020" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-30", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-31", "text": "**METHODOLOGY**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-32", "text": "We take the distributional thesaurus (DT) (Riedl and Biemann, 2013) constructed from the Google books syntactic n-grams data (Goldberg and Orwant, 2013) spanning from 1520 to 2008 as the underlying network where each word's neighborhood is represented by a list of top 200 words that are similar with respect to their bi-gram distribution (Riedl and Biemann, 2013) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-33", "text": "Figure 1 : A sample snapshot of distributional thesaurus (DT) network, where each node represents a word and the weight of edge between two nodes is defined as the number of context features that these two words share in common." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-34", "text": "Here the word 'owl' shares more context features with its co-hyponyms -'crow', 'vulture' compared to their hypernym 'animal'." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-35", "text": "The nodes in the network represent words and edges are present between a node and its top 200 similar nodes; the number of features that two nodes share in common is assigned as the weight of the edge connecting them." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-36", "text": "A snapshot of the DT is shown in Figure 1 ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-37", "text": "We see that a target word 'owl' is connected with its co-hyponyms, 'crow' and 'vulture' via higher weighted edges, whereas the edge weights with its hypernyms like 'animal' are less." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-38", "text": "It may also happen that hypernyms of a target word are not even present in its neighborhood." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-39", "text": "For example, 'creature' is not present in the neighborhood of 'owl' but it is connected with 'crow' via less weighted edge." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-40", "text": "As per the DT network structure, distributionally similar words are present in a close proximity with similar neighborhood." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-41", "text": "According to the literature dealing with lexical relation detection, words having co-hyponymy relation are distributionally more similar than the words having hypernymy or meronymy relation or any random pair of words." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-42", "text": "This is well captured by the DT." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-43", "text": "In a recent work, Jana and Goyal (2018b) used network features extracted from the DT to detect co-hyponyms." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-44", "text": "In our approach, we attempt to use embeddings obtained through a network representation learning method such as node2vec (Grover and Leskovec, 2016) when applied over the DT network." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-45", "text": "By choosing a flexible notion of a neighborhood and applying a biased random walk procedure, which efficiently explores diverse neighborhoods, node2vec learn representations for each node that organize nodes based on their network roles and/or communities." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-46", "text": "We use the default setup of node2vec; having walk-length 80, walks per node 10, window size 10 and dimension of vector 128." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-47", "text": "In order to do a qualitative analysis of the obtained vectors, we plot some sample words using t-SNE (Maaten and Hinton, 2008) in Figure 2 ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-48", "text": "We observe that the relative distance between the co-hyponymy pairs is much smaller than those having hypernymy relations or meronymy relations for the DT embeddings." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-49", "text": "For instance, the cohyponyms of 'owl' like 'crow', 'vulture', 'sparrow' are close to each other whereas hypernyms of 'owl' like 'animal', 'vertebrate', 'creature', as well as meronyms of 'owl' like 'claw','feather', are at distant positions." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-50", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-51", "text": "**FIGURE 2:**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-52", "text": "t-Distributed Stochastic Neighbor (t-SNE) (Maaten and Hinton, 2008) plot of DT embedding obtained using node2vec." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-53", "text": "We aim to build a classifier that given a word pair, is able to detect whether or not they hold a co-hyponymy relation." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-54", "text": "Since we intend to explore the use of DT embeddings, we need to come up with specific ways to combine the embeddings of the word pair to be used as features for the classification." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-55", "text": "Following the literature (Weeds et al., 2014) , we investigate four operations -vector difference (DIFF), vector concatenation (CC), vector pointwise addition (ADD) and vector pointwise multiplication (MUL)." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-56", "text": "From our initial experiments, we find that CC and ADD prove to be the better combination methods overall." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-57", "text": "It is justified, as DIFF and MUL operations are somewhat intersective whereas both CC and ADD effectively come up with the union of the features in different ways and classifier fed with both shared and non-shared features has access to more information leading to better accuracy." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-58", "text": "We only report the performances for CC and ADD for Support Vector Machine (SVM) and Random Forest (RF) classifiers." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-59", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-60", "text": "**EXPERIMENTAL RESULTS AND ANALYSIS**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-61", "text": "We perform experiments using three benchmark datasets for co-hyponymy detection (Weeds et al., 2014; Santus et al., 2016; Jana and Goyal, 2018b) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-62", "text": "For each of these, we follow the same experimental setup as discussed by the authors and compare our method with the method proposed by the author as well as the state-of-the-art models by Jana and Goyal (2018b) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-63", "text": "We perform the analysis of three datasets to investigate the extent of overlap present in these publicly available benchmark datasets and find out that 45.7% word pairs of dataset prepared by Weeds et al. (2014) are present in dataset ROOT9 prepared by Santus et al. (2016) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-64", "text": "This intersection set comprises 27.8% of the ROOT9 dataset." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-65", "text": "Similarly 36.7% word pairs of dataset prepared by Weeds et al. (2014) are present in the whole dataset prepared by Jana and Goyal (2018b) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-66", "text": "This intersection set comprises 44.9% of the dataset prepared by Jana and Goyal (2018b) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-67", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-68", "text": "**BASELINE MODEL**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-69", "text": "Description svmDIFF A linear SVM trained on the vector difference svmMULT A linear SVM trained on the pointwise product vector svmADD A linear SVM trained on the vector sum svmCAT A linear SVM trained on the vector concatenation svmSING A linear SVM trained on the vector of the second word in the given word pair knnDIFF k nearest neighbours (knn) trained on the vector difference cosineP" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-70", "text": "The relation between word pair holds if the cosine similarity of the word vectors is greater than some threshold p linP" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-71", "text": "The relation between word pair holds if the lin similarity (Lin, 1998) of the word vectors is greater than some threshold p Table 3 : Accuracy scores on a ten-fold cross validation for cohyponym BLESS dataset of our models along with the top two baseline models (one supervised, one semisupervised) described in (Weeds et al., 2014) and models described in (Jana and Goyal, 2018b )" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-72", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-73", "text": "**METHOD**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-74", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-75", "text": "**CO-HYP VS RANDOM**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-76", "text": "Co-Hyp vs Hyper (Santus et al., 2016) 97.8 95.7 (Jana and Goyal, 2018b) 99 Table 4 : Percentage F1 scores on a ten-fold cross validation of our models along with the best models described in (Santus et al., 2016) and (Jana and Goyal, 2018b) for ROOT9 dataset a set of baseline methodologies, the descriptions of which are presented in Table 1 ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-77", "text": "Following the same experimental setup, we report the accuracy measure for ten-fold cross validation and compare our models with the baselines in proposed by Weeds et al. (2014) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-78", "text": "Table 2 represents the performance of all the baseline models proposed by Weeds et al. (2014) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-79", "text": "In Table 3 we show the performance of the best supervised model (svmDIFF) and the best semi-supervised model (cosineP) proposed by Weeds et al. (2014) along with our models." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-80", "text": "Here, the best model proposed by Jana and Goyal (2018b) uses SVM classifier which is fed with structural similarity of the words in the given word pair from the distributional thesaurus network." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-81", "text": "We see that all the 4 proposed methods perform at par or better than the baselines, and using RF CC gives a 15.4% improvement over the best results reported." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-82", "text": "3.2. Experiment-2 (Santus et al., 2016) In the second experiment, we use ROOT9 dataset prepared by Santus et al. (2016) , containing 9,600 labeled pairs extracted from three datasets: EVALution (Santus et al., 2015) , Lenci/Benotto (?) and BLESS (Baroni and Lenci, 2011) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-83", "text": "There is an even distribution of the three classes (hypernyms, co-hyponyms and random) in the dataset." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-84", "text": "Following the same experimental setup as (Santus et al., 2016) , we report percentage F1 scores on a ten-fold cross validation for binary classification of co-hyponyms vs random pairs, as well as co-hyponyms vs. hypernyms using both SVM and Random Forest classifiers." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-85", "text": "Table 4 represents the performance comparison of our models with the best stateof-the-art models reported in (Santus et al., 2016) and (Jana and Goyal, 2018b) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-86", "text": "Here, the best model proposed by Santus et al. (2016) uses Random Forest classifier which is fed with nine corpus based features like frequency of words, co-occurrence frequency etc., and the best model proposed by Jana and Goyal (2018b) use Random Forest classifier which is fed with five complex network features like structural similarity, shortest path etc." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-87", "text": "computed from the distributional thesaurus network." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-88", "text": "The results in Table 4 shows that, for the binary classification task of co-hyponymy vs random pairs, we achieve percentage F1 score of 99.0 with RF CC which is at par with the state-of-the-art models." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-89", "text": "More importantly, both RF CC and RF ADD beat the baselines with significant margins for the classification task of co-hyponymy vs hypernymy pairs." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-90", "text": "Table 5 : Accuracy scores on a ten-fold cross validation of models (svmSS, rfALL) proposed by Jana and Goyal (2018b) and our models for the dataset prepared by Jana and Goyal (2018b) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-91", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-92", "text": "**MODEL**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-93", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-94", "text": "**EXPERIMENT-3 (JANA AND GOYAL, 2018B)**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-95", "text": "In the third experiment we use the dataset specifically build for co-hyponymy detection in one of the recent works by Jana and Goyal (2018b) ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-96", "text": "This dataset is extracted from BLESS (Baroni and Lenci, 2011) and divided into three small datasets-Co-Hypo vs Hyper, Co-Hypo vs Mero, Co-Hypo Vs Random." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-97", "text": "Each of these datasets are balanced, containing 1,000 co-hyponymy pairs and 1,000 pairs for the other class." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-98", "text": "Following the same setup, we report accuracy scores for ten-fold cross validation for each of these three datasets of our models along with the best models (svmSS, rfALL) reported by Jana and Goyal (2018b) in Table 5 ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-99", "text": "Jana and Goyal (2018b) use SVM classifier with structural similarity between words in a word pair as feature to obtain svmSS and use Random Forest classifier with five complex network measures computed from distributional thesaurus network as features to obtain rfALL." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-100", "text": "From the results presented in Table 5 , RF CC proves to be the best among our proposed models which performs at par with the baselines for Co-Hypo vs Random dataset." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-101", "text": "Interestingly, it beats the baselines comprehensively for Co-Hypo vs Mero and Co-Hypo vs Hyper datasets, providing improvements of 9.88% and 25.64%, respectively." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-102", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-103", "text": "**ERROR ANALYSIS**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-104", "text": "We further analyze the cases for which our model produces wrong prediction." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-105", "text": "We point out some example word pairs such as 'screw -screwdriver', 'gorilla -orangutan' from cohyponym BLESS dataset which our model wrongly flags as 'false'." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-106", "text": "We observe a drastic difference in frequency between the words in these words pairs in the corpus from which the DT was constructed; for example 'screw' appears 592,857 times whereas 'screwdriver' has a frequency of 29,748; similarly 'gorilla' has a frequency of 40,212 whereas 'orangutan' has 3,567." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-107", "text": "In the DT network, edge weight depends on the overlap between top 1000 context features, and a drastic frequency difference might not capture this well." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-108", "text": "On the other hand, there are examples like 'potato -peel', 'jacket -zipper' which our model wrongly flags as 'true' co-hyponyms." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-109", "text": "We observe that the corpus does not contain many co-hyponyms of 'peel' or 'zipper', and thus their neighborhood in the DT network contains words like 'ginger, lemon, onion, garlic' and 'pant, skirt, coat, jeans' which are co-hyponyms of 'potato' and 'jacket', respectively." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-110", "text": "This leads to the false signal by the approach." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-111", "text": "----------------------------------" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-112", "text": "**CONCLUSION**" }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-113", "text": "In this paper, we have investigated how the distributional thesaurus embeddings obtained using network representation learning can help improve the otherwise difficult task of discriminating co-hyponym pairs from hypernym, meronym and random pairs." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-114", "text": "By extensive experiments, we have shown that while the proposed models are at par with the baselines for detecting co-hyponyms vs. random pairs, they outperform the state-of-the-art models by a huge margin for the binary classification of co-hyponyms vs. hypernyms, as well as co-hyponyms vs. meronyms." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-115", "text": "It clearly shows that network representations can be very effectively utilized for a focused task like relation extraction." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-116", "text": "All the datasets, DT embeddings and codes (with instructions) used in our experiments are made publicly available 1 ." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-117", "text": "The next immediate step is to try out DT embedding to build unsupervised model for co-hyponymy detection." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-118", "text": "In future, we plan to investigate some more sophisticated network representation learning techniques like path embedding, community embedding techniques to embed the path joining the given pair of words or the subgraph induced by the given pair of words etc. and apply it on distributional thesaurus network for robust detection of lexical relations." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-119", "text": "In this study, our focus has been distinguishing a horizontal relation, co-hyponymy, from parent-child relations like hypernymy and meronymy." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-120", "text": "However, the investigation on discriminating two analogous sibling relations, co-hyponymy and co-meronymy using the proposed method would be one of the interesting future direction." }, { "sent_id": "9b203bfa690c4a79c1324360a4b8dc-C001-121", "text": "Finally, our broad objective is to build a general supervised and unsupervised framework based on complex network theory to detect different lexical relations from a given a corpus with high accuracy." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9b203bfa690c4a79c1324360a4b8dc-C001-11" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-17" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-86" ] ], "cite_sentences": [ "9b203bfa690c4a79c1324360a4b8dc-C001-11", "9b203bfa690c4a79c1324360a4b8dc-C001-17", "9b203bfa690c4a79c1324360a4b8dc-C001-86" ] }, "@USE@": { "gold_contexts": [ [ "9b203bfa690c4a79c1324360a4b8dc-C001-27" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-61" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-63" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-76", "9b203bfa690c4a79c1324360a4b8dc-C001-77" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-82" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-84" ], [ "9b203bfa690c4a79c1324360a4b8dc-C001-85" ] ], "cite_sentences": [ "9b203bfa690c4a79c1324360a4b8dc-C001-27", "9b203bfa690c4a79c1324360a4b8dc-C001-61", "9b203bfa690c4a79c1324360a4b8dc-C001-63", "9b203bfa690c4a79c1324360a4b8dc-C001-76", "9b203bfa690c4a79c1324360a4b8dc-C001-82", "9b203bfa690c4a79c1324360a4b8dc-C001-84", "9b203bfa690c4a79c1324360a4b8dc-C001-85" ] }, "@DIF@": { "gold_contexts": [ [ "9b203bfa690c4a79c1324360a4b8dc-C001-27", "9b203bfa690c4a79c1324360a4b8dc-C001-28" ] ], "cite_sentences": [ "9b203bfa690c4a79c1324360a4b8dc-C001-27" ] }, "@UNSURE@": { "gold_contexts": [ [ "9b203bfa690c4a79c1324360a4b8dc-C001-76" ] ], "cite_sentences": [ "9b203bfa690c4a79c1324360a4b8dc-C001-76" ] }, "@SIM@": { "gold_contexts": [ [ "9b203bfa690c4a79c1324360a4b8dc-C001-86", "9b203bfa690c4a79c1324360a4b8dc-C001-88" ] ], "cite_sentences": [ "9b203bfa690c4a79c1324360a4b8dc-C001-86" ] } } }, "ABC_3128481fa4e5d2c4af7deba2c28950_11": { "x": [ { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-31", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-32", "text": "**PREVIOUS WORK**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-2", "text": "This paper explores the segmentation of tutorial dialogue into cohesive topics." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-3", "text": "A latent semantic space was created using conversations from human to human tutoring transcripts, allowing cohesion between utterances to be measured using vector similarity." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-4", "text": "Previous cohesionbased segmentation methods that focus on expository monologue are reapplied to these dialogues to create benchmarks for performance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-5", "text": "A novel moving window technique using orthonormal bases of semantic vectors significantly outperforms these benchmarks on this dialogue segmentation task." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-6", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-8", "text": "Ever since Morris and Hirst (1991) 's groundbreaking paper, topic segmentation has been a steadily growing research area in computational linguistics, with applications in summarization (Barzilay and Elhadad, 1997) , information retrieval (Salton and Allan, 1994) , and text understanding (Kozima, 1993) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-123", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-9", "text": "Topic segmentation likewise has multiple educational applications, such as question answering, detecting student initiative, and assessing student answers." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-10", "text": "There have been essentially two approaches to topic segmentation in the past." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-11", "text": "The first of these, lexical cohesion, may be used for either linear segmentation (Morris and Hirst, 1991; Hearst, 1997) or hierarchical segmentation (Yarri, 1997; Choi, 2000) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-12", "text": "The essential idea behind the lexical cohesion approaches is that different topics will have different vocabularies." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-13", "text": "Therefore the lexical cohesion within topics will be higher than the lexical cohesion between topics, and gaps in cohesion may mark topic boundaries." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-14", "text": "The second major approach to topic segmentation looks for distinctive textual or acoustic markers of topic boundaries, e.g. referential noun phrases or pauses (Passonneau and Litman, 1993; Passonneau and Litman, 1997) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-15", "text": "By using multiple markers and machine learning methods, topic segmentation algorithms may be developed using this second approach that have a higher accuracy than methods using a single marker alone (Passonneau and Litman, 1997) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-16", "text": "The primary technique used in previous studies, lexical cohesion, is no stranger to the educational NLP community." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-17", "text": "Lexical cohesion measured by latent semantic analysis (LSA) (Landauer and Dumais, 1997; Dumais, 1993; Manning and Sch\u00fctze, 1999) has been used in automated essay grading (Landauer, Foltz, and Laham, 1998) and in understanding student input during tutorial dialogue (Graesser et al., 2001 )." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-18", "text": "The present paper investigates an orthonormal basis of LSA vectors, currently used by the AutoTutor ITS to assess student answers (Hu et al., 2003) , and how it may be used to segment tutorial dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-19", "text": "The focus on dialogue distinguishes our work from virtually all previous work on topic segmentation: prior studies have focused on monologue rather than dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-20", "text": "Without dialogue, previous approaches have only limited relevance to interactive educational applications such as intelligent tutoring systems (ITS)." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-21", "text": "The only existing work on topic segmentation in dialogue, Galley et al. (2003) , segments recorded speech between multiple persons using both lexical cohesion and dis-tinctive textual and acoustic markers." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-22", "text": "The present work differs from Galley et al. (2003) in two respects, viz. we focus solely on textual information and we directly address the problem of tutorial dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-23", "text": "In this study we apply the methods of Foltz et al. (1998) , Hearst (1994 Hearst ( , 1997 , and a new technique utilizing an orthonormal basis to topic segmentation of tutorial dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-24", "text": "All three are vector space methods that measure lexical cohesion to determine topic shifts." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-25", "text": "Our results show that the new using an orthonormal basis significantly outperforms the other methods." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-26", "text": "Section 2 reviews previous work, and Section 3 reviews the vector space model." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-27", "text": "Section 4 introduces an extension of the vector space model which uses an orthonormal basis." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-28", "text": "Section 5 outlines the task domain of tutorial dialogue, and Section 6 presents the results of previous and the current method on this task domain." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-29", "text": "A discussion and comparison of these results takes place in Section 7." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-30", "text": "Section 8 concludes." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-33", "text": "Though the idea of using lexical cohesion to segment text has the advantages of simplicity and intuitive appeal, it lacks a unique implementation." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-34", "text": "An implementation must define how to represent units of text, compare the cohesion between units, and determine whether the results of comparison indicate a new text segment." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-35", "text": "Both Hearst (1994 Hearst ( , 1997 and Foltz et al. (1998) use vector space methods discussed below to represent and compare units of text." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-36", "text": "The comparisons can be characterized by a moving window, where successive overlapping comparisons are advanced by one unit of text." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-37", "text": "However, Hearst (1994 Hearst ( , 1997 and Foltz et al. (1998) differ on how text units are defined and on how to interpret the results of a comparison." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-38", "text": "The text unit's definition in Hearst (1994 Hearst ( , 1997 and Foltz et al. (1998) is generally task dependent, depending on what size gives the best results." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-39", "text": "For example, when measuring comprehension, use the unit of the sentence, as opposed to the more standard unit of the proposition, because LSA is most correlated with comprehension at that level." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-40", "text": "However, when using LSA to segment text, Foltz et al. (1998) use the paragraph as the unit, to \"smooth out\" the local changes in cohesion and become more sensitive to more global changes of cohesion." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-41", "text": "Hearst likewise chooses a large unit, 6 token-sequences of 20 tokens (Hearst, 1994) , but varies these parameters dependent on the characteristics of the text to be segmented, e.g. paragraph size." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-42", "text": "Under a vector space model, comparisons are performed by calculating the cosine of vectors representing text." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-43", "text": "As stated previously, these comparisons reflect the cohesion between units of text." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-44", "text": "In order to use these comparisons to segment text, however, one must have a criterion in place." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-45", "text": "Foltz et al. (1998) , noting mean cosines of .16 for boundaries and .43 for non-boundaries, choose a threshold criterion of .15, which is two standard deviations below the boundary mean of .43." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-46", "text": "Using LSA and this criterion, Foltz et al. (1998) detected chapter boundaries with an F-measure of .33 (see Manning and Sch\u00fctze (1999) for a definition of Fmeasure)." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-47", "text": "Hearst (1994 Hearst ( , 1997 in contrast uses a relative comparison of cohesion, by recasting vector comparisons as depth scores." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-48", "text": "A depth score is computed as the difference between a given vector comparison and its surrounding peaks, i.e. the local maxima of vector comparisons on either side of the given vector comparison." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-49", "text": "The greater the difference between a given comparison and its surrounding peaks, the higher the depth score." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-50", "text": "Once all the depth scores are calculated for a text, those that are higher than one standard deviation below the mean are taken as topic boundaries." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-51", "text": "Using a vector space method without singular value decomposition, Hearst (1997) reports an F-measure of .70 when detecting topic shifts between paragraphs." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-52", "text": "Thus previous work suggests that the Hearst (1997) method is superior to that of Foltz et al. (1998) , having roughly twice the accuracy indicated by F-measure." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-53", "text": "Although these two results used different data sets and are therefore not directly comparable, one would predict based on this limited evidence that the Hearst algorithm would outperform the Foltz algorithm on other topic segmentation tasks." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-54", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-55", "text": "**THE VECTOR SPACE MODEL**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-56", "text": "The vector space model is a statistical technique that represents the similarity between collections of words as a cosine between vectors (Manning and Sch\u00fctze, 1999) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-57", "text": "The process begins by collecting text into a corpus." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-58", "text": "A matrix is created from the corpus, having one row for each unique word in the corpus and one column for each document or paragraph." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-59", "text": "The cells of the matrix consist of a simple count of the number of times word i appeared in document j. Since many words do not appear in any given document, the matrix is often sparse." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-60", "text": "Weightings are applied to the cells that take into account the frequency of word i in document j and the frequency of word i across all documents, such that distinctive words that appear infrequently are given the most weight." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-61", "text": "Two collections of words of arbitrary size are compared by creating two vectors." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-62", "text": "Each word is associated with a row vector in the matrix, and the vector of a collection is simply the sum of all the row vectors of words in that collection." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-63", "text": "Vectors are compared geometrically by the cosine of the angle between them." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-64", "text": "LSA (Landauer and Dumais, 1997; Dumais 1993 ) is an extension of the vector space model that uses singular value decomposition (SVD)." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-65", "text": "SVD is a technique that creates an approximation of the original word by document matrix." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-66", "text": "After SVD, the original matrix is equal to the product of three matrices, word by singular value, singular value by singular value, and singular value by document." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-67", "text": "The size of each singular value corresponds to the amount of variance captured by a particular dimension of the matrix." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-68", "text": "Because the singular values are ordered in decreasing size, it is possible to remove the smaller dimensions and still account for most of the variance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-69", "text": "The approximation to the original matrix is optimal, in the least squares sense, for any number of dimensions one would choose." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-70", "text": "In addition, the removal of smaller dimensions introduces linear dependencies between words that are distinct only in dimensions that account for the least variance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-71", "text": "Consequently, two words that were distant in the original space can be near in the compressed space, causing the inductive machine learning and knowledge acquisition effects reported in the literature (Landauer and Dumais, 1997) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-72", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-73", "text": "**AN ORTHONORMAL BASIS**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-74", "text": "Cohesion can be measured by comparing the cosines of two successive sentences or paragraphs (Foltz, Kintsch, and Landauer, 1998) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-75", "text": "However, cohesion is a crude measure: repetitions of a single sentence will be highly cohesive (cosine of 1) even though no new information is introduced." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-76", "text": "A variation of the LSA algorithm using orthonormalized vectors provides two new measures, \"informativity\" and \"relevance\", which can detect how much new information is added and how relevant it is in a context (Hu et al., 2003) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-77", "text": "The essential idea is to represent context by an orthonormalized basis of vectors, one vector for each utterance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-78", "text": "The basis is a subspace of the higher dimensional LSA space, in the same way as a plane or line is a subspace of 3D space." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-79", "text": "The basis is created by projecting each utterance vector onto the basis of previous utterance vectors using a method known as the GramSchmidt process (Anton, 2000) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-80", "text": "Each projected utterance vector has two components, a component parallel to the basis and a component perpendicular to the basis." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-81", "text": "These two components represent \"informativity\" and \"relevance\", respectively. Let us first consider \"relevance\"." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-82", "text": "Since each vector in the basis is orthogonal, the basis represents all linear combinations of what has been previously said." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-83", "text": "Therefore the component of a new utterance vector that is parallel to the basis is already represented by a linear combination of the existing vectors." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-84", "text": "\"Informativity\" follows similarly: it is the perpendicular component of a new utterance vector that can not be represented by the existing basis vectors." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-85", "text": "For example, in Figure 1" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-86", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-87", "text": "**PROCEDURE**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-88", "text": "The task domain is a subset of conversations from human-human computer mediated tutoring sessions on Newton's Three Laws of Motion, in which tutor and tutee engaged in a chat room-style conversation." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-89", "text": "The benefits of this task domain are twofold." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-90", "text": "Firstly, the conversations are already transcribed." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-91", "text": "Additionally, tutors were instructed to introduce problems using a fixed set of scripted problem statements." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-92", "text": "Therefore each topic shift corresponds to a distinct problem introduced by the tutor." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-93", "text": "Clearly this problem would be trivial for a cue phrase based approach, which could learn the finite set of problem introductions." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-94", "text": "However, the current lexical approach does not have this luxury: words in the problem statements recur throughout the following dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-95", "text": "Human to human computer mediated physics tutoring transcripts first were removed of all markup, translated to lower case, and each utterance was broken into a separate paragraph." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-96", "text": "An LSA space was made with these paragraphs alone, approximately one megabyte of text." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-97", "text": "The conversations were then randomly assigned to training (21 conversations) and testing (22 conversations)." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-98", "text": "The average number of utterances per topic, 16 utterances, and the average number of words per utterance, 32 words, were calculated to determine the parameters of the segmentation methods." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-99", "text": "For example, a moving window size greater than 16 utterances implies that, in the majority of occurrences, the moving window straddles three topics as opposed to the desired two." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-100", "text": "To replicate Foltz et al. (1998) , software was written in Java that created a moving window of varying sizes on the input text, and the software retrieved the LSA vector and calculated the cosine of each window." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-101", "text": "Hearst (1994 Hearst ( , 1997 was replicated using the JTextTile (Choi, 1999 ) Java software." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-102", "text": "A variant of Hearst (1994 Hearst ( , 1997 was created by using LSA instead of the standard vector space method." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-103", "text": "The orthonormal basis method also used a moving window; however, in contrast to the previous methods, the window is not treated just as a large block of text." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-104", "text": "Instead, the window consists of two orthonormal bases, one on either side of an utterance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-105", "text": "That is, a region of utterances above the test utterance is projected, utterance by utterance, into an orthonormal basis, and likewise a region of utterances below the test utterance is projected into another orthonormal basis." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-106", "text": "Then the test utterance is projected into each orthonormal basis, yielding measures of \"relevance\" and \"informativity\" with respect to each." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-107", "text": "Next the elements that make up each orthonormal basis are aggregated into a block, and a cosine is calculated between the test utterance and the blocks on either side, producing a total of six measures." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-108", "text": "Each tutoring session consists of the same 10 problems, discussed between one of a set of 4 tutors and one of 18 subjects." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-109", "text": "The redundancy provides a variety of speaking and interaction styles on the same topic." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-110", "text": "Tutor: A clown is riding a unicycle in a straight line." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-111", "text": "She accidentally drops an egg beside her as she continues to move with constant velocity." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-112", "text": "Where will the egg land relative to the point where the unicycle touches the ground? Explain." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-113", "text": "Student: The egg should land right next to the unicycle." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-114", "text": "The egg has a constant horizontal velocity." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-115", "text": "The vertical velocity changes and decreases as gravity pulls the egg downward at a rate of 9.8m/s^2." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-116", "text": "The egg should therefore land right next to the unicycle." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-117", "text": "Tutor: Good! There is only one thing I would like to know. What can you say about the horizontal velocity of the egg compared to the horizontal velocity of the clown? Student: Aren't they the same?" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-118", "text": "All of the 10 problems are designed to require application of Newton's Laws to be solved, and therefore conversations share many terms such as force, velocity, acceleration, gravity, etc." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-119", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-120", "text": "**RESULTS**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-121", "text": "For each method, the development set was first used to establish the parameters such as text unit size and classification criterion." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-122", "text": "The methods, tuned to these parameters, were then applied to the testing data." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-148", "text": "Again, the unit size and window size had to be calculated." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-124", "text": "**FOLTZ ET AL. (1998)**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-125", "text": "In order to replicate Foltz et al.'s results, a text unit size and window size needed to be chosen." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-126", "text": "The utterance was chosen as the text unit size, which included single word utterances, full sentences, and multi-sentence utterances." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-127", "text": "To determine the most appropriate window size, results from all sizes between 1 and 16 (the average number of utterances between topic shifts) were gathered." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-128", "text": "The greatest difference between the means for utterances that introduce a topic shift versus non-shift utterances occurs when the window contains four utterances." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-129", "text": "The standard deviation is uniformly low for windows containing more than two utterances and therefore can be disregarded in choosing a window size." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-130", "text": "The optimal cosine threshold for classification was found using logistic regression (Garson, 2003) which establishes a relationship between the cosine threshold and the log odds of classification." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-131", "text": "The optimal cutoff was found to be shift odds = .17 with associated F-measure of .49." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-132", "text": "The logistic equation of best fit is: cosine) (-13.345 1.887 odds) ln(shift \u22c5 + = F-measure of .49 is 48% higher than the Fmeasure reported by Foltz et al. (1998) for segmenting monologue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-133", "text": "On the testing corpus the Fmeasure is .52, which demonstrates good generalization for the logistic equation given." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-134", "text": "Compared the F-measure of .33 reported by Foltz et al. (1998) , the current result is 58% higher." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-135", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-136", "text": "**HEARST (1994, 1997)**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-137", "text": "The JTextTile software was used to implement Hearst (1994) on dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-138", "text": "As with Foltz et al. (1998) , a text unit and window size had to be determined for dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-139", "text": "Hearst (1994) recommends using the average paragraph size as the window size." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-140", "text": "Using the development corpus's average topic length of 16 utterances as a reference point, F-measures were calculated for the combinations of window size and text unit size in Table 1 ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-141", "text": "The optimal combination of parameters (Fmeasure = .17) is a unit size of 16 words and a window size of 16 units." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-142", "text": "This combination matches Hearst (1994) 's heuristic of choosing the window size to be the average paragraph length." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-143", "text": "On the test set, this combination of parameters yielded an F-measure of .14 as opposed to the Fmeasure for monologue reported by Hearst (1997) , .70." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-144", "text": "For dialogue, the algorithm is 20% as effective as it is for monologue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-145", "text": "It is unclear, however, exactly what part of the algorithm contributes to this poor performance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-146", "text": "The two most obvious possibilities are the segmentation criterion, i.e. depth scores, or the standard vector space method." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-147", "text": "To further explore these possibilities, the Hearst method was augmented with LSA." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-149", "text": "As with Foltz, the unit size was taken to be the utterance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-150", "text": "The window size was determined by computing F-measures on the development corpus for all sizes between 1 and 16." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-151", "text": "The optimal window size is 9, F-measure = .22." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-152", "text": "Given the smaller number of test cases, 22, this F-measure of .22 is not significantly different from .17." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-153", "text": "However, the Foltz method is significantly higher than both of these, p < .10." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-154", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-155", "text": "**ORTHONORMAL BASIS**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-156", "text": "The text unit used in the orthonormal basis is the single utterance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-157", "text": "The optimal window size, i.e. the orthonormal basis size, was determined by creating a logistic regression to calculate the maximum Fmeasure for several orthonormal basis sizes." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-158", "text": "The findings of this procedure are listed in Table 2 ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-159", "text": "F-measure for orthonormal basis sizes F-measure monotonically increases until the orthonormal basis holds six elements and holds relatively steady for larger orthonormal basis sizes." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-160", "text": "Since F-measure does not increase much over .72 for greater orthonormal basis sizes, 6 was chosen as the most computationally efficient size for the strength of the effect." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-161", "text": "The logistic equation of best fit is: Where the index of 1 indicates a measure on the window preceding the utterance, and an index of 2 indicates a measure on the window following the utterance." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-162", "text": "In the regression, the cosine between the utterance and the preceding window was not significant, p = .86." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-163", "text": "This finding reflects the intuition that the cosine to the following window varies according to whether the following window is on a new topic, whereas the cosine to the preceding window is always high." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-164", "text": "Additionally, measures of \"relevance\" and \"informativity\" correspond to vector length; all other measures did not contribute significantly to the model and so were not included." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-165", "text": "The sign of the metrics illuminates their role in the model." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-166", "text": "The negative sign on the coefficients for relevance 1 , informativity 1 , and relevance 2 indicates that they are inversely correlated with an utterance signaling the start of a new topic." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-167", "text": "The only surprising feature is that informativity 1 is negatively correlated instead of positively correlated: one would expect a topic shift to introduce new information." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-168", "text": "There is possibly some edge effect here, since the last move of a topic is often a summarizing move that shares many of the physics terms present in the introduction of a new topic." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-169", "text": "On the other hand, the positive sign on cosine 2 and informativity 2 indicates that the start of a new topic should have elements in common with the following material and add new information to that material, as an overview would." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-170", "text": "Beyond the sign, the exponentials of these values indicate how the two basis metrics are weighted." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-171", "text": "For example, when informativity 2 is raised by one unit, a topic shift is 16 times more likely." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-172", "text": "On the testing corpus the F-measure of the orthonormal basis method is .67, which is significantly different from the performance of all three methods mentioned above, p < .05." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-173", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-174", "text": "**DISCUSSION**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-175", "text": "The relative ranking of these results is not altogether surprising given the relationships between inferencing and LSA and between inferencing and dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-176", "text": "Foltz et al. (1998) found that LSA makes simple bridging inferences in addition to detecting lexical cohesion." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-177", "text": "These bridging inferences are a kind of collocational cohesion (Halliday and Hassan, 1976) whereby words that cooccur in similar contexts become highly related in the LSA space." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-178", "text": "Therefore in applications where this kind of inferencing is required, one might expect an LSA based method to excel." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-179", "text": "Similarly to van Dijk and Kintsch's model of comprehension (van Dijk and Kintsch, 1983) , dialogue can require inferences to maintain coherence." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-180", "text": "According to Grice's Co-operative Principle, utterances lacking semantic coherence flout the Maxim of Relevance and license an inference (Grice, 1975) : S1: Let's go dancing." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-181", "text": "S2: I have an exam tomorrow." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-182", "text": "The \"inference\" in the sense of Foltz, Kintsch, and Landauer (1998) would be represented by a high cosine between these utterances, even though they don't share any of the same words." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-183", "text": "Dialogue generally tends to be less lexically cohesive and require more inferencing than expository mono- logue, so one might predict that LSA would excel in dialogue applications." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-184", "text": "However, LSA has a weakness: the cosine measure between two vectors does not change monotonically as new word vectors are added to either of the two vectors." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-185", "text": "Accordingly, the addition of a word vector can cause the cosine between two text units to dramatically increase or decrease." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-186", "text": "Therefore the distinctive properties of individual words can be lost with the addition of more words to a text unit." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-187", "text": "This problem can be addressed by using an orthonormal basis (Hu et al., 2003) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-188", "text": "By using a basis, each utterance is kept independent, so \"inferencing\" can extend over both the entire set of utterances and the linear combination of any of its subsets." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-189", "text": "Accordingly, when \"inferencing\" over the entire text unit is required, one would expect a basis method using LSA vectors to outperform a standard LSA method." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-190", "text": "This expectation has been put to the test recently by Olney & Cai (2005) , who find that an orthonormal basis can significantly predict entailment on test data supplied by the PASCAL Textual Entailment Challenge (PASCAL, 2004) ." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-191", "text": "Beyond relative performance rankings, more support for the above reasoning can be found in the difference between Hearst and Hearst + LSA." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-192", "text": "Recall that in monologue, Hearst (1997) reports a much larger F-measure than Foltz et al. (1998) , .70 vs. .33, albeit on different data sets." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-193", "text": "In the present dialogue corpus, these roles are reversed, .14 vs. .52." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-194", "text": "Possible reasons for this reversal are the segmentation criterion, the vector space method, or the fact that Foltz has been trained on similar data via regression and Hearst has not." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-195", "text": "However, comparing the Hearst algorithm with the Hearst + LSA algorithm indicates that a 57% improvement stems from the addition of LSA, keeping all other factors constant." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-196", "text": "While this result is not statistically significant, the direction of the result supports the use of an \"inferencing\" vector space method for segmenting dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-197", "text": "Unfortunately, the large difference in F-measure between the Foltz algorithm and the Hearst + LSA algorithm is more difficult to explain." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-198", "text": "These two methods differ by their segmentation criterion and by their training (Foltz is a regression model and Hearst is not)." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-199", "text": "It may be that Hearst (1994 Hearst ( , 1997 )'s segmentation criterion, i.e. depth scores, do not translate well to dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-200", "text": "Perhaps the assignment of segment boundaries based on the relative difference between a candidate score and its surrounding peaks is highly sensitive to cohesion gaps created by conversational implicatures." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-201", "text": "On the other hand the differences between these two methods may be entirely attributable to the amount of training they received." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-202", "text": "One way to separate the contributions of the segmentation criterion and training would be to create a logistic model using the Hearst + LSA method and to compare this to Foltz." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-203", "text": "The increased effectiveness of the orthonormal basis method over the Foltz algorithm can also be explained in terms of \"inferencing\"." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-204", "text": "Since \"inferencing\" is overwhelmed by lexical cohesion , the increase in window size for the Foltz algorithm deteriorates performance for a window size greater than 4." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-205", "text": "In contrast, the orthonormal basis method becomes most effective as the orthonormal basis size increases past 4." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-206", "text": "This dichotomy illustrates that the Foltz algorithm is not complementary to an \"inferencing\" approach in general." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-207", "text": "Use of an orthonormal basis, on the other hand, increases sensitivity to collocational cohesion without sacrificing lexical cohesion." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-208", "text": "----------------------------------" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-209", "text": "**CONCLUSION**" }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-210", "text": "This study explored the segmentation of tutorial dialogue using techniques that have previously been applied to expository monologue and using a new orthonormal basis technique." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-211", "text": "The techniques previously applied to monologue reversed their roles of effectiveness when applied to dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-212", "text": "This role reversal suggests the predominance of collocational cohesion, requiring \"inferencing\", present in this tutorial dialogue." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-213", "text": "The orthonormal basis method, which we suggest has an increased capacity for \"inferencing\", outperformed both of the techniques previously applied to monologue, and demonstrates that segmentation of these tutorial dialogues most benefits from a method sensitive to lexical and collocational cohesion over large text units." }, { "sent_id": "3128481fa4e5d2c4af7deba2c28950-C001-214", "text": "conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DoD, ONR, or NSF." } ], "y": { "@USE@": { "gold_contexts": [ [ "3128481fa4e5d2c4af7deba2c28950-C001-23" ], [ "3128481fa4e5d2c4af7deba2c28950-C001-137" ], [ "3128481fa4e5d2c4af7deba2c28950-C001-142" ] ], "cite_sentences": [ "3128481fa4e5d2c4af7deba2c28950-C001-23", "3128481fa4e5d2c4af7deba2c28950-C001-137", "3128481fa4e5d2c4af7deba2c28950-C001-142" ] }, "@BACK@": { "gold_contexts": [ [ "3128481fa4e5d2c4af7deba2c28950-C001-35" ], [ "3128481fa4e5d2c4af7deba2c28950-C001-37" ], [ "3128481fa4e5d2c4af7deba2c28950-C001-38" ], [ "3128481fa4e5d2c4af7deba2c28950-C001-41" ] ], "cite_sentences": [ "3128481fa4e5d2c4af7deba2c28950-C001-35", "3128481fa4e5d2c4af7deba2c28950-C001-37", "3128481fa4e5d2c4af7deba2c28950-C001-38", "3128481fa4e5d2c4af7deba2c28950-C001-41" ] }, "@DIF@": { "gold_contexts": [ [ "3128481fa4e5d2c4af7deba2c28950-C001-102" ], [ "3128481fa4e5d2c4af7deba2c28950-C001-197", "3128481fa4e5d2c4af7deba2c28950-C001-198", "3128481fa4e5d2c4af7deba2c28950-C001-199" ] ], "cite_sentences": [ "3128481fa4e5d2c4af7deba2c28950-C001-102", "3128481fa4e5d2c4af7deba2c28950-C001-199" ] }, "@UNSURE@": { "gold_contexts": [ [ "3128481fa4e5d2c4af7deba2c28950-C001-199" ] ], "cite_sentences": [ "3128481fa4e5d2c4af7deba2c28950-C001-199" ] } } }, "ABC_1c1b524d2bfe00c62a5a2e1a05ffc7_11": { "x": [ { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-21", "text": "This allows for increased parallel computation and reduces time to convergence." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-45", "text": "We describe both these components below." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-2", "text": "State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-3", "text": "Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-4", "text": "Instead, it uses only self-attention and feed-forward layers." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-5", "text": "While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-6", "text": "We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15 \u2212 40% faster." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-7", "text": "Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-8", "text": "Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-9", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-11", "text": "Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs) (Hochreiter & Schmidhuber, 1997) , form an important building block for many tasks that require modeling of sequential data." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-12", "text": "RNNs have been successfully employed for several such tasks including language modeling (Melis et al., 2017; Merity et al., 2017) , speech recognition Graves et al., 2013) , and machine translation (Wu et al., 2016; ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-13", "text": "RNNs make output predictions at each time step by computing a hidden state vector h t based on the current input token and the previous states." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-14", "text": "This sequential computation underlies their ability to map arbitrary inputoutput sequence pairs." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-15", "text": "However, because of their auto-regressive property of requiring previous hidden states to be computed before the current time step, they cannot benefit from parallelization." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-16", "text": "Variants of recurrent networks that use strided convolutions eschew the traditional time-step based computation (Kaiser & Bengio, 2016; Lei & Zhang, 2017; Bradbury et al., 2016; Gehring et al., 2016; 2017; Kalchbrenner et al., 2016) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-17", "text": "However, in these models, the operations needed to learn dependencies between distant positions can be difficult to learn (Hochreiter et al., 2001; Hochreiter, 1998) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-18", "text": "Attention mechanisms, often used in conjunction with recurrent models, have become an integral part of complex sequential tasks because they facilitate learning of such dependencies (Luong et al., 2015; Parikh et al., 2016; Paulus et al., 2017; Kim et al., 2017) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-19", "text": "In Vaswani et al. (2017) , the authors introduce the Transformer network, a novel architecture that avoids the recurrence equation and maps the input sequences into hidden states solely using attention." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-20", "text": "Specifically, the authors use positional encodings in conjunction with a multi-head attention mechanism." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-22", "text": "The authors report results for neural machine translation that show the Transformer networks achieves state-of-the-art performance on the WMT 2014 English-to-German and English-to-French tasks while being orders-of-magnitude faster than prior approaches." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-23", "text": "Transformer networks still require a large number of parameters to achieve state-of-the-art performance." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-24", "text": "In the case of the newstest2013 English-to-German translation task, the base model required 65M parameters, and the large model required 213M parameters." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-25", "text": "We propose a variant of the Transformer network which we call Weighted Transformer that uses self-attention branches in lieu of the multi-head attention." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-26", "text": "The branches replace the multiple heads in the attention mechanism of the original Transformer network, and the model learns to combine these branches during training." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-27", "text": "This branched architecture enables the network to achieve comparable performance at a significantly lower computational cost." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-28", "text": "Indeed, through this modification, we improve the state-of-the-art performance by 0.5 and 0.4 BLEU scores on the WMT 2014 English-to-German and English-to-French tasks, respectively." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-29", "text": "Finally, we present evidence that suggests a regularizing effect of the proposed architecture." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-30", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-31", "text": "**RELATED WORK**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-32", "text": "Most architectures for neural machine translation (NMT) use an encoder and a decoder that rely on deep recurrent neural networks like the LSTM (Luong et al., 2015; Wu et al., 2016; Barone et al., 2017; ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-33", "text": "Several architectures have been proposed to reduce the computational load associated with recurrence-based computation (Gehring et al., 2016; 2017; Kaiser & Bengio, 2016; Kalchbrenner et al., 2016) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-34", "text": "Self-attention, which relies on dot-products between elements of the input sequence to compute a weighted sum (Lin et al., 2017; Parikh et al., 2016; Kim et al., 2017) , has also been a critical ingredient in modern NMT architectures." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-35", "text": "The Transformer network (Vaswani et al., 2017) avoids the recurrence completely and uses only self-attention." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-36", "text": "We propose a modified Transformer network wherein the multi-head attention layer is replaced by a branched self-attention layer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-37", "text": "The contributions of the various branches is learned as part of the training procedure." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-38", "text": "The idea of multi-branch networks has been explored in several domains (Ahmed & Torresani, 2017; Gastaldi, 2017; Xie et al., 2016) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-39", "text": "To the best of our knowledge, this is the first model using a branched structure in the Transformer network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-40", "text": "In , the authors use a large network, with billions of weights, in conjunction with a sparse expert model to achieve competitive performance." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-41", "text": "Ahmed & Torresani (2017) analyze learned branching, through gates, in the context of computer vision while in Gastaldi (2017) , the author analyzes a two-branch model with randomly sampled weights in the context of image classification." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-42", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-43", "text": "**TRANSFORMER NETWORK**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-44", "text": "The original Transformer network uses an encoder-decoder architecture with each layer consisting of a novel attention mechanism, which the authors call multi-head attention, followed by a feedforward network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-46", "text": "From the source tokens, learned embeddings of dimension d model are generated which are then modified by an additive positional encoding." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-47", "text": "The positional encoding is necessary since the network does not otherwise possess any means of leveraging the order of the sequence since it contains no recurrence or convolution." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-48", "text": "The authors use additive encoding which is defined as:" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-49", "text": "where pos is the position of a word in the sentence and i is the dimension of the vector." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-50", "text": "The authors also experiment with learned embeddings (Gehring et al., 2016; 2017) but found no benefit in doing so." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-51", "text": "The encoded word embeddings are then used as input to the encoder which consists of N layers each containing two sub-layers: (a) a multi-head attention mechanism, and (b) a feed-forward network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-52", "text": "A multi-head attention mechanism builds upon scaled dot-product attention, which operates on a query Q, key K and a value V :" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-53", "text": "where d k is the dimension of the key." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-54", "text": "In the first layer, the inputs are concatenated such that each of (Q, K, V ) is equal to the word vector matrix." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-55", "text": "This is identical to dot-product attention except for the scaling factor d k , which improves numerical stability." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-56", "text": "Multi-head attention mechanisms obtain h different representations of (Q, K, V ), compute scaled dot-product attention for each representation, concatenate the results, and project the concatenation with a feed-forward layer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-57", "text": "This can be expressed in the same notation as Equation (1):" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-58", "text": "where the W i and W O are parameter projection matrices that are learned." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-59", "text": "Note that" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-60", "text": "where h denotes the number of heads in the multi-head attention." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-61", "text": "Vaswani et al. (2017) proportionally reduce d k = d v = d model so that the computational load of the multi-head attention is the same as simple self-attention." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-62", "text": "The second component of each layer of the Transformer network is a feed-forward network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-63", "text": "The authors propose using a two-layered network with a ReLU activation." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-64", "text": "Given trainable weights W 1 , W 2 , b 1 , b 2 , the sub-layer is defined as:" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-65", "text": "The dimension of the inner layer is d f f which is set to 2048 in their experiments." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-66", "text": "For the sake of brevity, we refer the reader to Vaswani et al. (2017) for additional details regarding the architecture." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-67", "text": "For regularization and ease of training, the network uses layer normalization (Ba et al., 2016) after each sub-layer and a residual connection around each full layer ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-68", "text": "Analogously, each layer of the decoder contains the two sub-layers mentioned above as well as an additional multi-head attention sub-layer that receives as inputs (V, K) from the output of the corresponding encoding layer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-69", "text": "In the case of the decoder multi-head attention sub-layers, the scaled dot-product attention is masked to prevent future positions from being attended to, or in other words, to prevent illegal leftward-ward information flow." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-70", "text": "One natural question regarding the Transformer network is why self-attention should be preferred to recurrent or convolutional models." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-71", "text": "Vaswani et al. (2017) state three reasons for the preference: (a) computational complexity of each layer, (b) concurrency, and (c) path length between long-range dependencies." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-72", "text": "Assuming a sequence length of n and vector dimension d, the complexity of each layer is O(n 2 d) for self-attention layers while it is O(nd 2 ) for recurrent layers." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-73", "text": "Given that typically d > n, the complexity of self-attention layers is lower than that of recurrent layers." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-74", "text": "Further, the number of sequential computations is O(1) for self-attention layers and O(n) for recurrent layers." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-75", "text": "This helps improved utilization of parallel computing architectures." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-76", "text": "Finally, the maximum path length between dependencies is O(1) for the self-attention layer while it is O(n) for the recurrent layer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-77", "text": "This difference is instrumental in impeding recurrent models' ability to learn long-range dependencies." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-78", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-79", "text": "**PROPOSED NETWORK ARCHITECTURE**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-80", "text": "We now describe the proposed architecture, the Weighted Transformer, which is more efficient to train and makes better use of representational power." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-81", "text": "In Equations (3) and (4), we described the attention layer proposed in Vaswani et al. (2017) comprising the multi-head attention sub-layer and a FFN sub-layer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-82", "text": "For the Weighted Transformer, we propose a branched attention that modifies the entire attention layer in the Transformer network (including both the multi-head attention and the feed-forward network)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-83", "text": "The proposed attention layer can be described as:" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-84", "text": "where M denotes the total number of branches, \u03ba i , \u03b1 i \u2208 R + are learned parameters and W Oi \u2208 R dv\u00d7dmodel ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-85", "text": "The FFN function above is identical to Equation (4)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-86", "text": "Further, we require that \u03ba i = 1 and \u03b1 i = 1 so that Equation (7) is a weighted sum of the individual branch attention values." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-87", "text": "In the equations above, \u03ba can be interpreted as a learned concatenation weight and \u03b1 as the learned addition weight." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-88", "text": "Indeed, \u03ba scales the contribution of the various branches before \u03b1 is used to sum them in a weighted fashion." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-89", "text": "We ensure that all bounds are respected during each training step by projection." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-90", "text": "While it is possible that \u03b1 and \u03ba could be merged into one variable and trained, we found better training outcomes by separating them." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-91", "text": "It also improves the interpretability of the models gives that (\u03b1, \u03ba) can be thought of as probability masses on the various branches." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-92", "text": "It can be shown that if \u03ba i = 1 and \u03b1 i = 1 for all i, we recover the equation for the multi-head attention (3)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-93", "text": "However, given the i \u03ba i = 1 and i \u03b1 i = 1 bounds, these values are not permissible in the Weighted Transformer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-94", "text": "One interpretation of our proposed architecture is that it replaces the multi-head attention by a multi-branch attention." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-95", "text": "Rather than concatenating the contributions of the different heads, they are instead treated as branches that a multi-branch network learns to combine." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-96", "text": "This mechanism adds O(M ) trainable weights." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-97", "text": "This is an insignificant increase compared to the total number of weights." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-98", "text": "Indeed, in our experiments, the proposed mechanism added 192 weights to a model containing 213M weights already." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-99", "text": "Without these additional trainable weights, the proposed mechanism is identical to the multi-head attention mechanism in the Transformer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-100", "text": "The proposed attention mechanism is used in both the encoder and decoder layers and is masked in the decoder layers as in the Transformer network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-101", "text": "Similarly, the positional encoding, layer normalization, and residual connections in the encoder-decoder layers are retained." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-102", "text": "We eliminate these details from Figure 1 for clarity." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-103", "text": "Instead of using (\u03b1, \u03ba) learned weights, it is possible to also use a mixture-ofexperts normalization via a softmax layer ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-104", "text": "However, we found this to perform worse than our proposal." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-105", "text": "Unlike the Transformer, which weighs all heads equally, the proposed mechanism allows for ascribing importance to different heads." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-106", "text": "This in turn prioritizes their gradients and eases the optimization process." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-107", "text": "Further, as is known from multi-branch networks in computer vision (Gastaldi, 2017) , such mechanisms tend to cause the branches to learn decorrelated input-output mappings." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-108", "text": "This reduces co-adaptation and improves generalization." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-109", "text": "This observation also forms the basis for mixture-ofexperts models ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-110", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-111", "text": "**EXPERIMENTS 4.1 TRAINING DETAILS**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-112", "text": "The weights \u03ba and \u03b1 are initialized randomly, as with the rest of the Transformer weights." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-113", "text": "In addition to the layer normalization and residual connections, we use label smoothing with ls = 0.1, attention dropout, and residual dropout with probability P drop = 0.1." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-114", "text": "Attention dropout randomly drops out elements (Srivastava et al., 2014) from the softmax in (1)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-115", "text": "As in Vaswani et al. (2017) , we used the Adam optimizer (Kingma & Ba, 2014) with (\u03b2 1 , \u03b2 2 ) = (0.9, 0.98) and = 10 \u22129 ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-116", "text": "We also use the learning rate warm-up strategy for Adam wherein the learning rate lr takes on the form:" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-117", "text": "for the all parameters except (\u03b1, \u03ba) and" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-118", "text": "for (\u03b1, \u03ba)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-119", "text": "This corresponds to the warm-up strategy used for the original Transformer network except that we use a larger peak learning rate for (\u03b1, \u03ba) to compensate for their bounds." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-120", "text": "Further, we found that freezing the weights (\u03ba, \u03b1) in the last 10K iterations aids convergence." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-121", "text": "During this time, we continue training the rest of the network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-122", "text": "We hypothesize that this freezing process helps stabilize the rest of the network weights given the weighting scheme." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-123", "text": "We note that the number of iterations required for convergence to the final score is substantially reduced for the Weighted Transformer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-124", "text": "We found that Weighted Transformer converges 15-40% faster as measured by the total number of iterations to achieve optimal performance." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-125", "text": "We train the baseline model for 100K steps for the smaller variant and 300K for the larger." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-126", "text": "We train the Weighted Transformer for the respective variants for 60K and 250K iterations." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-127", "text": "We found that the objective did not significantly improve by running it for longer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-128", "text": "Further, we do not use any averaging strategies employed in Vaswani et al. (2017) and simply return the final model for testing purposes." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-129", "text": "In order to reduce the computational load associated with padding, sentences were batched such that they were approximately of the same length." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-130", "text": "All sentences were encoded using byte-pair encoding (Sennrich et al., 2015) and shared a common vocabulary." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-131", "text": "Weights for word embeddings were tied to corresponding entries in the final softmax layer (Inan et al., 2016; Press & Wolf, 2016) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-132", "text": "We trained all our networks on NVIDIA K80 GPUs with a batch containing roughly 25,000 source and target tokens." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-133", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-134", "text": "**RESULTS ON BENCHMARK DATA SETS**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-135", "text": "We benchmark our proposed architecture on the WMT 2014 English-to-German and English-toFrench tasks." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-136", "text": "The WMT 2014 English-to-German data set contains 4.5M sentence pairs." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-137", "text": "The English-to-French contains 36M sentence pairs." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-138", "text": "Results of our experiments are summarized in Table 1 ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-139", "text": "The Weighted Transformer achieves a 1.1 BLEU score improvement over the state-of-the-art on the English-to-German task for the smaller network and 0.5 BLEU improvement for the larger network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-140", "text": "In the case of the larger English-toFrench task, we note a 0.8 BLEU improvement for the smaller model and a 0.4 improvement for the larger model." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-141", "text": "Also, note that the performance of the smaller model for Weighted Transformer is close to that of the larger baseline model, especially for the English-to-German task." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-142", "text": "This suggests that the Weighted Transformer better utilizes available model capacity since it needs only 30% of the parameters as the baseline transformer for matching its performance." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-143", "text": "Our relative improvements do not hinge on using the BLEU scores for comparison; experiments with the GLEU score proposed in Wu et al. (2016) also yielded similar improvements." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-144", "text": "Finally, we comment on the regularizing effect of the Weighted Transformer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-145", "text": "Given the improved results, a natural question is whether the results stem from improved regularization of the model." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-146", "text": "To investigate this, we report the testing loss of the Weighted Transformer and the baseline Transformer against the training loss in Figure 2 ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-147", "text": "Models which have a regularizing effect tend to have lower" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-148", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-149", "text": "**MODEL EN-DE BLEU EN-FR BLEU**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-150", "text": "Transformer (small) (Vaswani et al., 2017) 27.3 38.1 Weighted Transformer (small) 28.4 38.9" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-151", "text": "Transformer (large) (Vaswani et al., 2017) 28.4 41.0 Weighted Transformer (large) 28.9 41.4" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-152", "text": "ByteNet (Kalchbrenner et al., 2016) 23.7 -Deep-Att+PosUnk (Zhou et al., 2016) -39.2 GNMT+RL (Wu et al., 2016) 24.6 39.9 ConvS2S (Gehring et al., 2017) 25.2 40.5 MoE 26.0 40.6 Table 1 : Experimental results on the WMT 2014 English-to-German (EN-DE) and English-toFrench (EN-FR) translation tasks." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-153", "text": "Our proposed model outperforms the state-of-the-art models including the Transformer (Vaswani et al., 2017) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-154", "text": "The small model corresponds to configuration (A) in Table 2 while large corresponds to configuration (B)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-155", "text": "testing losses for the same training loss." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-156", "text": "We see this effect in our experiments suggesting that the proposed architecture may have better regularizing properties." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-157", "text": "This is not unexpected given similar outcomes for other branching-based strategies such as Shake-Shake Gastaldi (2017) and mixtureof-experts ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-158", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-159", "text": "**SENSITIVITY ANALYSIS**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-160", "text": "In Table 2 , we report sensitivity results on the newstest2013 English-to-German task." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-161", "text": "Specifically, we vary the number of layers in the encoder/decoder and compare the performance of the Weighted Transformer and the Transformer baseline." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-162", "text": "The results clearly demonstrate the benefit of the branched attention; for every experiment, the Weighted Transformer outperforms the baseline transformer, in some cases by up to 1.3 BLEU points." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-163", "text": "As in the case of the baseline Transformer, increasing the number of layers does not necessarily improve performance; a modest improvement is seen when the number of layers N is increased from 2 to 4 and 4 to 6 but the performance degrades when N is increased to 8." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-164", "text": "Increasing the number of heads from 8 to 16 in configuration (A) yielded an even better BLEU score." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-165", "text": "However, preliminary experiments with h = 16 and h = 32, like in the case with N , degrade the performance of the model." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-166", "text": "(Vaswani et al., 2017) architecture and our proposed Weighted Transformer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-167", "text": "Reported BLEU scores are evaluated on the English-to-German translation development set, newstest2013." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-168", "text": "Figure 3: Convergence of the (\u03b1, \u03ba) weights for the second encoder layer of Configuration (C) for the English-to-German newstest2013 task." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-169", "text": "We smoothen the curves using a mean filter." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-170", "text": "This shows that the network does prioritize some branches more than others and that the architecture does not exploit a subset of the branches while ignoring others." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-171", "text": "In Figure 3 , we present the behavior of the weights (\u03b1, \u03ba) for the second encoder layer of the configuration (C) for the English-to-German newstest2013 task." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-172", "text": "The figure shows that, in terms of relative weights, the network does prioritize some branches more than others; circumstantially by as much as 2\u00d7." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-173", "text": "Further, the relative ordering of the branches changes over time suggesting that the network is not purely exploitative." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-174", "text": "A purely exploitative network, which would learn to exploit a subset of the branches at the expense of the rest, would not be preferred since it would effectively reduce the number of available parameters and limit the representational power." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-175", "text": "Similar results are seen for other layers, including the decoder layers; we omit them for brevity." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-176", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-177", "text": "**RANDOMIZATION BASELINE**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-178", "text": "The proposed modification can also be interpreted as a form of Shake-Shake regularization proposed in Gastaldi (2017) ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-179", "text": "In this regularization strategy, random weights are sampled during forward and backward passes for weighing the various branches in a multi-branch network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-180", "text": "During test time, they are weighed equally." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-181", "text": "In our strategy, the weights are learned instead of being sampled randomly." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-182", "text": "Consequently, no changes to the model are required during test time." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-183", "text": "Weights (\u03b1, \u03ba) BLEU Learned 24.8 Random 21.1 Uniform 23.4 Table 3 : Performance of the architecture with random and uniform normalization weights on the newstest2013 English-to-German task for configuration (C)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-184", "text": "This shows that the learned (\u03b1, \u03ba) weights of the Weighted Transformer are crucial to its performance." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-185", "text": "In order to better understand whether the network benefits from the learned weights or if, at test time, random or uniform weights suffice, we propose the following experiment: the weights for the Weighted Transformer, including (\u03b1, \u03ba) are trained as before, but, during test time, we replace them with (a) randomly sampled weights, and (b) 1/M where M is the number of incoming branches." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-186", "text": "In Table 3 , we report experimental results on the configuration (C) of the Weighted Transformer on the English-to-German newstest2013 data set (see Table 2 for details regarding the configuration)." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-187", "text": "It is evident that random or uniform weights cannot replace the learned weights during test time." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-188", "text": "Preliminary experiments suggest that a Shake-Shake-like strategy where the weights are sampled randomly during training also leads to inferior performance." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-189", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-190", "text": "**GATING**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-191", "text": "In order to analyze whether a hard (discrete) choice through gating will outperform our normalization strategy, we experimented with using gates instead of the proposed concatenation-addition strategy." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-192", "text": "Specifically, we replaced the summation in Equation (7) by a gating structure that sums up the contributions of the top k branches with the highest probabilities." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-193", "text": "This is similar to the sparselygated mixture of experts model in ." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-194", "text": "Despite significant hyper-parameter tuning of k and M , we found that this strategy performs worse than our proposed mechanism by a large margin." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-195", "text": "We hypothesize that this is due to the fact that the number of branches is low, typically less than 16." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-196", "text": "Hence, sparsely-gated models lose representational power due to reduced capacity in the model." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-197", "text": "We plan to investigate the setup with a large number of branches and sparse gates in future work." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-198", "text": "----------------------------------" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-199", "text": "**CONCLUSIONS**" }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-200", "text": "We present the Weighted Transformer that trains faster and achieves better performance than the original Transformer network." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-201", "text": "The proposed architecture replaces the multi-head attention in the Transformer network by a multiple self-attention branches whose contributions are learned as a part of the training process." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-202", "text": "We report numerical results on the WMT 2014 English-to-German and English-to-French tasks and show that the Weighted Transformer improves the state-of-the-art BLEU scores by 0.5 and 0.4 points respectively." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-203", "text": "Further, our proposed architecture trains 15 \u2212 40% faster than the baseline Transformer." }, { "sent_id": "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-204", "text": "Finally, we present evidence suggesting the regularizing effect of the proposal and emphasize that the relative improvement in BLEU score is observed across various hyper-parameter settings for both small and large models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-3" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-19" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-35" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-61" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-66" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-71" ] ], "cite_sentences": [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-3", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-19", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-35", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-61", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-66", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-71" ] }, "@EXT@": { "gold_contexts": [ [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-19", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-25" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-35", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-36" ] ], "cite_sentences": [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-19", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-35" ] }, "@USE@": { "gold_contexts": [ [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-81" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-150" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-151" ] ], "cite_sentences": [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-81", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-150", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-151" ] }, "@SIM@": { "gold_contexts": [ [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-115" ] ], "cite_sentences": [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-115" ] }, "@DIF@": { "gold_contexts": [ [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-128" ], [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-153" ] ], "cite_sentences": [ "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-128", "1c1b524d2bfe00c62a5a2e1a05ffc7-C001-153" ] } } }, "ABC_79e96060492c3978dc5a7a0d5f293f_11": { "x": [ { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-124", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-2", "text": "In this paper we undertake a large crossdomain investigation of sentiment domain adaptation, challenging the practical necessity of sentiment domain adaptation algorithms." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-3", "text": "We first show that across a wide set of domains, a simple \"all-in-one\" classifier that utilizes all available training data from all but the target domain tends to outperform published domain adaptation methods." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-4", "text": "A very simple ensemble classifier also performs well in these scenarios." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-5", "text": "Combined with the fact that labeled data nowadays is inexpensive to come by, the \"kitchen sink\" approach, while technically nonglamorous, might be perfectly adequate in practice." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-6", "text": "We also show that the common anecdotal evidence for sentiment terms that \"flip\" polarity across domains is not borne out empirically." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-7", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-9", "text": "Automatic detection and analysis of sentiment around products, brands, political issues etc." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-10", "text": "has triggered a large amount of research in the past 15 -20 years (for a recent overview see Lee 2008 and Liu 2012) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-11", "text": "Early work focused on algorithms for mining sentiment dictionaries (Hatzivassiloglou and McKeown 1997, Turney 2002) ; this was followed by the exploration of supervised techniques (Pang et al. 2002) and, somewhat more recently, by investigations of domain adaptation techniques." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-12", "text": "Also more recently, the focus has broadened from the detection of polarity (negative/positive sentiment) to more nuanced approaches that try to identify targets and holders of sentiment, sentiment strength, or finer-grained mood distinctions (e.g. Wilson et al. 2006, Kim and Hovy 2006) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-13", "text": "Within the polarity detection paradigm, a number of common assumptions have been shared in the community and are frequently repeated in the literature." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-14", "text": "Two of these fundamental assumptions are:" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-15", "text": "1. Obtaining sufficient labeled data for supervised training is expensive 2." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-16", "text": "Sentiment models trained on one domain tend to perform poorly on new, unseen domains A conclusion that is often drawn from these assumptions is that domain adaptation of sentiment models from a domain with sufficient labeled data to a new domain with little labeled data is an important problem and requires new and sophisticated algorithms." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-17", "text": "In this paper, we empirically re-examine the assumptions above." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-18", "text": "Based on a wide range of experiments on 27 different domains, we challenge the conclusion that domain adaptation for polarity detection necessarily requires novel and sophisticated machinery." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-19", "text": "It is important to keep in mind, however, that our claims are strictly limited to the problem under investigation, namely polarity detection." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-20", "text": "We do not make any claims whatsoever about domain adaptation for other sentiment-related problems or general problems in machine learning." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-21", "text": "Based on readily available data from 27 domains, we show that a \"kitchen sink\" approach where all source domain data are combined to train a single classifier sets a surprisingly high baseline for polarity identification accuracy across domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-22", "text": "We also show on a previously released data set of four domains that the result is competitive with a state-of-theart domain adaptation approach using Structural Correspondence Learning." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-23", "text": "We then show that a straightforward ensemble learner can, for some domains, improve results further, without any need for specialized learning algorithms." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-24", "text": "Since most work in domain-adaptation only provides published results on pairwise adaptation between domains and not on multi-domain adaptation, we hope to establish a new baseline for future adaptation techniques to compare against." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-25", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-26", "text": "**RELATED WORK**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-27", "text": "Of direct importance to the discussion in this paper are results from domain adaptation in polarity detection." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-28", "text": "One of the earlier successful approaches (Blitzer et al. 2006 (Blitzer et al. , 2007 involved Structural Correspondence Learning (SCL)." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-29", "text": "SCL identifies \"pivot\" features that are both highly discriminative in the labeled source domain data and also frequent in the unlabeled target domain data." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-30", "text": "In a subsequent step, linear predictors for the pivot terms are learned from the unlabeled target data and from the source data." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-31", "text": "Daum\u00e9 (2007) approached domain adaptation from a fully labeled source domain to a partially labeled target domain by augmenting the feature space." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-32", "text": "Instead of using a single, general, feature set for source and target, three distinct feature sets are created: the general set of features, a source-domain specific version of the feature set, and a target-specific version of the feature set." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-33", "text": "Li and Zong (NLP-KE 2008) explore a classifier combination technique they call \"MultipleLabel Consensus Training\" which results in better accuracy than non-adapted models on the data sets used in Blitzer et al. (2007) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-34", "text": "They also addressed the multi-domain sentiment analysis problem using feature -level fusion and classifier-level fusion approaches in Li and Zong (ACL 2008) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-35", "text": "Dredze and Crammer (2008) have proposed a multi-domain online learning framework based on parameter combination from multiple Confidence Weighted (CW) classifiers." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-36", "text": "Their MultiDomain Regularization (MDR) framework seeks to learn domain specific parameters guided by the shared parameter across domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-37", "text": "Samdani and Yih (2011) propose an ensemble learner that consists of classifiers trained on different feature groups." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-38", "text": "The feature groups are Chen et al. (2011) use a specific co-training algorithm for domain adaptation on the Blitzer et al. (2007) data set." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-39", "text": "In averaged pair-wise comparisons they establish gains over a source-plustarget logistic regression baseline." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-40", "text": "Glorot et al. (2011) investigate a deep learning approach to domain adaptation and report increased accuracy across domains both on the Blitzer et al. (2007) 4-domain data set and the larger Amazon review data set (25 domains) also made available in that release." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-41", "text": "They also introduce a new metric for transfer learning: Transfer Ratio." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-42", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-43", "text": "**DATASETS & EXPERIMENTAL SETUP**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-44", "text": "This section illustrates the datasets, the methods and the setup of our experiments." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-45", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-46", "text": "**DATASETS**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-47", "text": "The datasets we used in our experiments have been obtained from three sources: 1." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-48", "text": "Amazon reviews 1 : this dataset contains more than 5.8 million reviews." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-49", "text": "It has been used in previous work on sentiment analysis (see Glorot et al. (2011) )." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-50", "text": "The Amazon reviews include 25 domains as shown in Table 1 . 2. Hotel reviews 2 : this dataset includes full reviews of hotels in 10 different cities (Dubai, Beijing, London, New York City, New Delhi, San Francisco, Shanghai, Montreal, Las Vegas, Chicago)." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-51", "text": "There are about 80-700 hotels in each city." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-52", "text": "The extracted fields include date, review title and the full review." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-53", "text": "The total number of reviews is 259,000. 3." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-54", "text": "Twitter: this dataset has been obtained and annotated in Choudhury et al. (AAAI 2012) over a 1 year period of time from Nov. 1, 2010 to Oct. 31, 2011." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-55", "text": "The dataset has been originally annotated for affects." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-56", "text": "We mapped the positive affects \"joviality\" and \"serenity\" to positive sentiment and the negative affects \"fatigue\", \"hostility\", and \"sadness\" to negative sentiment." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-57", "text": "We selected a balanced dataset of 2,000 tweets from the various months of the collected tweets." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-58", "text": "The average review length for the Amazon and hotel reviews is 437 characters and 97 words." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-59", "text": "In total, we used 27 domains namely the identified based on how stable the feature distribution is across domains, which can either be estimated from the data directly or can be hypothesized based on domain knowledge." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-60", "text": "25 Amazon domains, the hotel domain and the Twitter domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-61", "text": "We considered Twitter as a domain though the content of tweets spans multiple domains since it has different characteristics from the product reviews." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-62", "text": "Tweets are constrained log-likelihood ratio (LLR)." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-63", "text": "Further, we used the accuracy metric to indicate the performance of each of the above four domain adaptation techniques." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-64", "text": "We also employed the Transfer Ratio metric proposed by Glorot et al. (2011) The Amazon reviews and the hotel reviews are rated between 1 and 5 on a 5 point scale where 1 is the most negative and 5 is the most positive." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-65", "text": "We have extracted only the reviews that are rated 5 and 1 to represent the positive and negative reviews respectively." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-66", "text": "Further, we ensured that the datasets we extracted and used for training are balanced between positive and negative reviews." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-67", "text": "Table 1 summarizes the 27 domains and their dataset sizes including the balanced datasets we used for training." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-68", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-69", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-70", "text": "In our experiments, we employed the datasets of the 27 domains mentioned in section 3.1." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-71", "text": "In each experiment, we have employed one domain for testing while the other 26 domains have been used for training." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-72", "text": "We compared four domain adaptation techniques: 1." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-73", "text": "One classifier trained in all source domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-74", "text": "2." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-75", "text": "An ensemble of classifiers, each trained on a source domain, combined into an ensemble." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-76", "text": "3." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-77", "text": "The domain adaptation approach proposed in Daum\u00e9 (2007) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-78", "text": "4. We also compared the results of approaches 1 and 2 to published results on Structural Correspondence Learning (SCL) by using the same datasets as in Blitzer et al. (2007) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-79", "text": "In all our experiments, we employed Maximum Entropy-based classification with vanilla parameter settings and feature reduction using ure the performance of the all-in-one and ensemble classifiers." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-80", "text": "The rest of the subsection illustrates the experimental setup for each of the above four approaches." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-81", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-82", "text": "**IN-DOMAIN CLASSIFIERS**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-83", "text": "To establish a \"ceiling\" performance we built an in-domain classifier for each of the 27 domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-84", "text": "The in-domain classifier is trained with a dataset of that one domain and tested on the same domain (using cross-validation)." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-85", "text": "This standard indomain supervised setup establishes an upper bound for classification performance (although in some cases we will see that other techniques can outperform this upper bound)." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-86", "text": "Features consist of binary unigram and bigram features." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-87", "text": "On average, the total number of features in each domain is 52,039." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-88", "text": "Feature reduction was performed using LLR, retaining only the top 20,000 most predictive features as established on the training set." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-89", "text": "We compare the results obtained from testing each domain with the three approaches to its indomain classifier results." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-90", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-91", "text": "**ALL-IN-ONE CLASSIFIER**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-92", "text": "The all-in-one classifier is a maximum entropy classifier trained with the source domain datasets merged together." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-93", "text": "In this setting, the classifier is trained with data from multiple domains, which exposes it to multiple sentiment vocabularies at training time, creating a somewhat domain-independent and general model." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-94", "text": "The all-in-one classifier is trained with 26 domains datasets while being tested on the held-out 27 th domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-95", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-96", "text": "**ENSEMBLE CLASSIFIERS**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-97", "text": "One approach to address the problem of domain adaptation is to construct an ensemble of classifiers, all of which contribute partially to the final result (see Dietterich (1997) for an overview)." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-98", "text": "We constructed an ensemble of in-domain sentiment classifiers, one for each source domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-99", "text": "There are various techniques to combine the contribution of each classifier in the ensemble." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-100", "text": "We employed three techniques in our experiment settings: 1. Majority vote: the results are obtained by taking the majority of votes from the multiple classifiers in the ensemble." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-101", "text": "For example, if 20 classifiers vote positive and only 6 classifiers vote negative, the final result is positive 2." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-102", "text": "Sum of weights: the results are obtained by summing up the class probabilities from each classifier." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-103", "text": "3. Meta-classification: the results are obtained by combining the weight of each classifier's vote in a meta-classifier." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-104", "text": "The meta-classifier weights are learned through a machine learning model trained on a small labeled set of data from the target domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-105", "text": "We used both logistic regression and SVM to train the meta-classifier." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-106", "text": "We have experimented with multiple sizes of labeled target data ranging from 5 positive and 5 negative meta-training examples to 50 positive and 50 negative examples." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-107", "text": "The following steps are used to train the meta-classifier." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-108", "text": "a) For each review r in the set of labeled data in target domain D that is used to train the meta-classifier; we create a vector V consisting of the vote of each source-domain classifier on r and the label of r. b) We construct a matrix M of the set of vectors Vs created in step 1. c) We employ either logistic regression or SVM." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-109", "text": "We have used SVM light implementation 3 to train the ensemble using SVM with the matrix M and a radial basis kernel function." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-110", "text": "For K source domains, the augmented feature space consists of K+1 copies of the original feature space." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-111", "text": "However, creating three versions of each feature in both the source and the target domains grows the feature space exponentially, which is prohibitive in a many-domain adaptation scenario such as ours which consists of a total of 27 domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-112", "text": "We addressed this challenge by considering the 26 source domains as a single source domain being adapted to the target domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-113", "text": "This setup along with feature reduction enabled us to apply Daum\u00e9's approach without too much of an inflation of the feature space." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-114", "text": "However, we also recognize that this likely compromises the power of the feature augmentation approach." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-115", "text": "Blitzer et al. (2007) employ the Structural Correspondence Learning (SCL) algorithm for sentiment domain adaptation." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-116", "text": "Blitzer et al. evaluate the SCL domain adaptation on four publicly released datasets from Amazon product reviews: books, DVDs, electronics and kitchen appliances." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-117", "text": "In these four datasets, reviews with rating > 3 were labeled positive, those with rating < 3 were labeled negative, and the rest discarded because their polarity was ambiguous." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-118", "text": "1000 positive and 1000 negative labeled examples were used for each domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-119", "text": "Some unlabeled data were additionally used including 3685 (DVDs) and 5945 (kitchen)." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-120", "text": "Each labeled dataset was split into 1600 instances for training and 400 instances for testing." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-121", "text": "The baseline in Blitzer et al. (2007) is a linear classifier trained without adaptation, while their ceiling reference is the same as ours, which is the in-domain classifier trained and tested on the same domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-122", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-123", "text": "**HAL DAUM\u00c9'S DOMAIN-ADAPTATION APPROACH**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-125", "text": "**BLITZER'S STRUCTURAL CORRESPONDENCE LEARNING**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-126", "text": "We conducted a set of experiments employing the four datasets used for SCL domain adaptation." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-127", "text": "In these experiments, we compare the results of our all-in-one classifier and the ensemble classifier trained and tested on the four datasets to the results of SCL and its variation SCL-MI domain adaptation as reported by Blitzer et al. (2007) baseline and ceiling in-domain classifiers for the four domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-128", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-129", "text": "**RESULTS & DISCUSSION**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-130", "text": "This section summarizes the results of the experiments described in section 3.2 while further scrutinizing the comparison between the four domain adaptation sentiment analysis techniques." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-131", "text": "We also report the Transfer Ratio results of the all-in-one and ensemble classifiers." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-132", "text": "Generally, the all-in-one classifier is closely comparable to the in-domain classifier of each domain" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-133", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-134", "text": "**RESULTS**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-135", "text": "In this section, we summarize the various results obtained from the set of experiments described in section 3.2." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-136", "text": "In the summary of each experiment results, we also plot the in-domain classifier results of each domain as the ceiling of comparison." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-137", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-138", "text": "**ALL-IN-ONE CLASSIFIER EXPERIMENTS**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-139", "text": "In the all-in-one classifier experiments, the sentiment classifier is trained with 26 domain datasets while testing it with the 27 th domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-140", "text": "Table 3 summarizes the results." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-141", "text": "The results of the allin-one classifier are very close to the in-domain classifiers in most domains except for the apparel, beauty, magazines, outdoor living, office products and software." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-142", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-143", "text": "**ENSEMBLE CLASSIFIER EXPERIMENTS**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-144", "text": "We produced the results of the ensemble of classifiers using the three settings: majority votes, sum of weights, and meta-classification using both logistic regression and SVM." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-145", "text": "Table 3 summarizes the results of the three settings used in the ensemble." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-146", "text": "Table 3 shows that the ensembles with sum of weights and meta-training (SVM sigmoid kernel) are the most comparable to the in-domain classifier of each domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-147", "text": "We also experimented with variations of logistic regression and SVM for meta-training." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-148", "text": "The non-linear (RBF kernel) SVM meta-classifier outperforms the linear logistic regression model." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-149", "text": "We have employed two variations of SVM, namely, a radial basis function with gamma 0.01 and sigmoid kernel." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-150", "text": "In most domains, the SVM model trained with 50 positive and 50 negative feedback examples is not far off the one trained with 5 positive and 5 negative feedback examples." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-151", "text": "This shows that even with little labeled data in the target domain, the ensemble could effectively combine the weights of the classifier votes." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-152", "text": "We expect the ensemble to achieve steady but slow performance gains over time while collecting more feedback examples." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-153", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-154", "text": "**HAL DAUM\u00c9'S DOMAIN-ADAPTATION APPROACH**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-155", "text": "We compared the performance of the all-in-one and ensemble classifiers to Daum\u00e9's feature augmentation algorithm." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-156", "text": "Table 3 shows that the all-in-one classifier exceeds Daum\u00e9's approach in all 27 domains given our current implementation of Daum\u00e9's approach." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-157", "text": "The ensemble exceeds Daum\u00e9's approach on all domains except office, kitchen & housewares, magazines, office products, and tweets." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-158", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-159", "text": "**STRUCTURAL CORRESPONDENCE LEARNING (SCL)**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-160", "text": "We employed the four domains datasets used in Blitzer et al. (2007) to train and test the all-in one and the ensemble classifiers." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-161", "text": "We also replicated the in-domain results of these four datasets using our maximum entropy classifier." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-162", "text": "We compare the results of the all-in-one and the ensemble classifier to the SCL and its variation SCL-MI adaptation techniques using the four datasets used to evaluate SCL and SCL-MI in Blitzer et al. (2007) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-163", "text": "Note that the results published in Blitzer's work represent pairwise domain-adaptation, while our ensemble and all-in-one results are based on training on three of Blitzer's domains and testing on the held-out fourth domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-164", "text": "This makes it impossible to draw a direct comparison, but we can still observe that in general, it is best to simply combine as many domains as possible in an all-in-one or ensemble approach as compared to carefully adapting a single domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-165", "text": "Table 2 summarizes the results of the comparison." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-166", "text": "Where is the transfer error defined as the test error obtained by a method trained on the source domain S and tested on the target domain T." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-167", "text": "is the test error obtained by the baseline method." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-168", "text": "The transfer ratio Q also characterizes the transfer but is defined by replacing the difference by a quotient in t:" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-169", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-170", "text": "**DOMAIN**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-171", "text": "InDomain" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-172", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-173", "text": "**ALL-IN-ONE**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-174", "text": "Ensemblesum of weights Where n is the number of couples (S, T) with S\u2260T. The all-in-one classifier had a 1.12 transfer ratio across domains, which is very close to the best result of ~1.07 in Glorot et al. The ensemble with Sigmoid kernel of SVM trained on 50 positive and 50 negative feedback examples from the target domain had 1.81 transfer ratio." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-175", "text": "The ensemble with radial basis function (gamma=0.01) trained on 5 positive and 5 negative feedback examples from the target domain had 1.85 transfer ratio." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-176", "text": "Note that the transfer ratio of the indomain classifier, which is used a base-line for calculating the transfer ratio is 1." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-177", "text": "The transfer ratio of the all-in-one classifier is better than the transfer ratio of the ensemble with its two variations." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-178", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-179", "text": "**DISCUSSION**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-180", "text": "The results in the previous section indicate that both the all-in-one and the ensemble approaches exceed both Daum\u00e9's domain adaptation technique on the 27 datasets (given our current implementation of Daum\u00e9's approach) and SCL on the four datasets in Blitzer et al. (2007) and that the all-in-one approach achieves comparable results in terms of transfer ratio to Glorot et al. (2011) ." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-181", "text": "The ensemble approach exceeds the all-in-one in some domains like apparel and automotive." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-182", "text": "They both are very close in some domains like When comparing the all-in-one and the ensemble approaches on the four datasets in Blitzer et al. (2007) , the all-in-one exceeds the ensemble only in the DVD domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-183", "text": "The ensemble exceeds the all-in-one in electronics and kitchen & housewares." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-184", "text": "They both perform at the same accuracy level on the books domain." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-185", "text": "We have also employed NcNemar significance test between pairs of the all-in-one, the ensemble and Daum\u00e9's approaches on the 27 domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-186", "text": "Table 4 shows the significance difference between the approaches' combinations." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-187", "text": "Finally, we would like to do some initial exploration of the role of features across domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-188", "text": "The commonly held belief is that sentiment indicators such as \"hot\" can change their polarity from domain to domain (e.g. it is positive in the food domain while it is negative in the negative domain), contributing to the need for domain adaptation." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-189", "text": "On the other hand, the success of the all-in-one classifier indicates that a greater number of observed sentiment features and more solid statistics on those features are more important than capturing domain-specific polarity changes." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-190", "text": "In order to gather evidence for or against these hypotheses, we first calculated the number of overlapping features between each pair of domains within the 27 domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-191", "text": "The average percentage of features that overlap between pairs of domain is only 12.48%." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-192", "text": "Furthermore, only a very small set of the highly sentiment-correlated features overlap." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-193", "text": "16 features overlap among the 27 domains which accounts for only 0.08% of the features." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-194", "text": "Examples of positive overlapping feature are \"highly\", \"excellent\", and \"great\"." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-195", "text": "Negative overlapping features are \"waste\", \"terrible\", and \"worst\"." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-196", "text": "This low feature overlap of sentiment-bearing features lends some support to the hypothesis that in order to capture a general, large-scale sentiment vocabulary nothing beats diverse and plentiful training data." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-197", "text": "The low feature overlap also justifies why the all-in-one classifier exceeds the ensemble though the latter has access to some labeled data in the target Second, we examined the question of polaritychanging sentiment features." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-198", "text": "Among the top 1000 features in each domain ranked by LLR, we counted the common features among multiple domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-199", "text": "The number of common features among 15 domains is 42 features." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-200", "text": "Only 13 features are common among 20 domains while there are no common features from the highest 1000 likelihood ratio features among the 27 domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-201", "text": "Most features do not flip polarity across domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-202", "text": "For example the word \"waste\" is common among 20 domains and maintains a negative polarity across the domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-203", "text": "Very few features flip polarity across domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-204", "text": "The word \"highly\" is shared across 23 domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-205", "text": "It maintains a positive polarity in all domains while it flips in Tools & Hardware." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-206", "text": "The word \"refund\" is shared in 20 domains." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-207", "text": "It maintains a negative polarity in almost all domains except Gourmet Food." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-208", "text": "----------------------------------" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-209", "text": "**CONCLUSION**" }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-210", "text": "In this paper, we empirically re-examine the assumption that adapting one or multiple domains with plenty of labeled sentiment polarity data to one domain with little labeled data requires new and sophisticated algorithms." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-211", "text": "We evaluate four domain adaptation techniques on a wide variety of domains in two major groups of state-of-theart datasets." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-212", "text": "Our experiments show that overall, simple domain adaptation techniques like the allin-one classifier do comparably well if not better than more sophisticated domain adaptation techniques." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-213", "text": "Combined with the fact that labeled sentiment data tends to be cheap to come by through either the collection of product reviews from the web or inexpensive crowd-sourced labeling, this indicates that in practice, domain-adaptation for sentiment detection might be of less importance than previously claimed." }, { "sent_id": "79e96060492c3978dc5a7a0d5f293f-C001-214", "text": "We also show that the often anecdotally observed \"polarity-flip\" of sentiment terms from one domain to another in practice is a rather rare occurrence and might not be as detrimental to sentiment domain adaptation as assumed in much of the literature." } ], "y": { "@BACK@": { "gold_contexts": [ [ "79e96060492c3978dc5a7a0d5f293f-C001-28" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-33" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-38" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-40" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-115" ] ], "cite_sentences": [ "79e96060492c3978dc5a7a0d5f293f-C001-28", "79e96060492c3978dc5a7a0d5f293f-C001-33", "79e96060492c3978dc5a7a0d5f293f-C001-38", "79e96060492c3978dc5a7a0d5f293f-C001-40", "79e96060492c3978dc5a7a0d5f293f-C001-115" ] }, "@USE@": { "gold_contexts": [ [ "79e96060492c3978dc5a7a0d5f293f-C001-78" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-127" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-160" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-162" ], [ "79e96060492c3978dc5a7a0d5f293f-C001-182" ] ], "cite_sentences": [ "79e96060492c3978dc5a7a0d5f293f-C001-78", "79e96060492c3978dc5a7a0d5f293f-C001-127", "79e96060492c3978dc5a7a0d5f293f-C001-160", "79e96060492c3978dc5a7a0d5f293f-C001-162", "79e96060492c3978dc5a7a0d5f293f-C001-182" ] }, "@SIM@": { "gold_contexts": [ [ "79e96060492c3978dc5a7a0d5f293f-C001-121" ] ], "cite_sentences": [ "79e96060492c3978dc5a7a0d5f293f-C001-121" ] }, "@DIF@": { "gold_contexts": [ [ "79e96060492c3978dc5a7a0d5f293f-C001-180" ] ], "cite_sentences": [ "79e96060492c3978dc5a7a0d5f293f-C001-180" ] } } }, "ABC_9aa9fa6b94aa24939b50effa0e575b_11": { "x": [ { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-12", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-90", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-2", "text": "This paper investigates the problem of image-caption retrieval using joint visualsemantic embeddings." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-3", "text": "We introduce a very simple change to the loss function used in the original formulation by Kiros et al. (2014) , which leads to drastic improvements in the retrieval performance." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-4", "text": "In particular, the original paper uses the rank loss which computes the sum of violations across the negative training examples." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-5", "text": "Instead, we penalize the model according to the hardest negative examples." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-6", "text": "We then make several additional modifications according to the current best practices in image-caption retrieval." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-7", "text": "We showcase our model on the MS-COCO and Flickr30K datasets through comparisons and ablation studies." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-8", "text": "On MS-COCO, we improve caption retrieval by 21% in R@1 with respect to the original formulation." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-9", "text": "Our results outperform the state-of-the-art results by 8.8% in caption retrieval and 11.3% in image retrieval at R@1." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-10", "text": "On Flickr30K, we more than double R@1 as reported by Kiros et al. (2014) in both image and caption retrieval, and achieve near state-of-the-art performance." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-11", "text": "We further show that similar improvements also apply to the Order-embeddings by Vendrov et al. (2015) which builds on a similar loss function." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-14", "text": "The problem of image-caption retrieval has received significant attention over the past few years, quickly gearing towards more semantic search engines." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-15", "text": "Several different approaches have been proposed, most sharing the core idea of embedding images and language in a common space." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-16", "text": "This allows us to easily search for semantically meaningful neighbors in either modality." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-17", "text": "Learning such embeddings using powerful neural networks has led to significant advancements in image-caption retrieval and generation Kiros et al. (2014) ; Karpathy & Fei-Fei (2015) , video-to-text alignment Zhu et al. (2015) , and question-answering Malinowski et al. (2015) ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-18", "text": "This paper investigates the visual-semantic embeddings (VSE) of Kiros et al. (2014) for imagecaption retrieval." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-19", "text": "We propose a set of simple modifications to the original formulation that prove to be extremely effective." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-20", "text": "In particular, we change the rank loss used in the original formulation to penalize the model according to the hardest negative training exemplars instead of averaging the individual violations across the negatives." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-21", "text": "This is a sensible modification because it is the hardest negative that affects nearest neighbor recall." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-22", "text": "We refer to this model as VSE++." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-23", "text": "We achieve further improvements by fine-tuning a more powerful network, exploiting more data, and employing a multi-crop trick from Klein et al. (2015) ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-24", "text": "Our results on MS-COCO show a dramatic increase in caption retrieval performance over VSE." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-25", "text": "The new loss function alone outperforms the original model by 8.6%." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-26", "text": "With all introduced changes, VSE++ achieves an absolute improvement of 21% in R@1, which corresponds to a 49% relative improvement." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-27", "text": "We outperform the best reported result on MS-COCO by almost 9%." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-28", "text": "To ensure reproducibility, our code is publicly available 1 ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-29", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-31", "text": "The task of image-caption retrieval is considered as a benchmark for image and language understanding (Hodosh et al. (2013) )." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-32", "text": "The most common approach to the retrieval task is to use joint embedding spaces." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-33", "text": "In such an approach, we learn two mappings, one for images and another for captions, that embed the two modalities in a joint space." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-34", "text": "Given a similarity measure in this space, the task of image retrieval can be formulated as a nearest neighbor search problem." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-35", "text": "Works such as Kiros et al. (2014) , Karpathy & Fei-Fei (2015) , Zhu et al. (2015) , Socher et al. (2014) use a rank loss to learn the joint visual-semantic embedding." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-36", "text": "Klein et al. (2015) and Eisenschtat & Wolf (2016) use Canonical Correlation Analysis (CCA) to compute a linear projection for two views to a common space where the correlation of the transformed views is maximized." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-37", "text": "Older work (Lin et al. (2014a) ) performed matching between words and objects based on classification scores." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-38", "text": "Recent methods that learn visual-semantic embeddings propose new model architectures for computing the embedding vectors or computing the similarity score between the embedding vectors." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-39", "text": "Wang et al. (2017) propose an embedding network to fully replace the similarity measure used for the rank loss." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-40", "text": "An attention mechanism on both image and caption is used by Nam et al. (2016) , where the authors sequentially and selectively focus on a subset of words and image regions to compute the similarity." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-41", "text": "In Huang et al. (2016) , the authors use a multi-modal context-modulated attention mechanism to compute the matching score between an image and a caption." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-42", "text": "Our work builds upon the work by Kiros et al. (2014) , in which the authors use a rank loss to optimize the embedding." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-43", "text": "Works such as Karpathy & Fei-Fei (2015) , Socher et al. (2014) , Vendrov et al. (2015) , Wang et al. (2017) , Huang et al. (2016) , Nam et al. (2016) use a similar loss to optimize more sophisticated models." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-44", "text": "Our modifications are orthogonal to most of their approaches and can potentially lead to improvements of these models as well." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-45", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-46", "text": "**IMPROVING VISUAL-SEMANTIC EMBEDDINGS**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-47", "text": "Our work builds on Visual-Semantic Embeddings Kiros et al. (2014) ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-48", "text": "In what follows, we first define the task of image-caption retrieval and summarize the original model and its loss function." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-49", "text": "Then we introduce our new loss." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-50", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-51", "text": "**IMAGE-CAPTION RETRIEVAL**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-52", "text": "We focus on the image-caption retrieval task where one is given either a caption as a query and the goal is to retrieve the most relevant image(s) from a database, or similarly the query is an image and we need to retrieve relevant captions." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-53", "text": "For this task, we typically aim to maximize recall (the most relevant item is ranked among the top K items), which is the conventional performance measure." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-54", "text": "Let S = {(i n , c n )} N n=1 denote a training set of images and their captions." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-55", "text": "We refer to (i n , c n ) as positive pairs and (i n , c m =n ) as negative pairs." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-56", "text": "We define a scoring function s(i, c) \u2208 R (where i denotes an image and c a caption) which aims to score the positive pairs higher than the negative pairs." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-57", "text": "In caption retrieval, we consider images as queries and rank a database of captions with respect to each query according to the scoring function." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-58", "text": "Recall at K (R@K) is the percentage of queries for which their corresponding positive caption is ranked in the top K highest scored captions." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-59", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-60", "text": "**VISUAL-SEMANTIC EMBEDDING**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-61", "text": "We define the similarity measure s(i, c) in the joint embedding space following Kiros et al. (2014) ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-62", "text": "Let \u03c6(i; \u03b8 \u03c6 ) \u2208 R D \u03c6 be the representation of the image (e.g. the representation before logits in VGG19 Simonyan & Zisserman (2014) or ResNet152 He et al. (2016) )." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-63", "text": "Similarly, let \u03c8(c; \u03b8 \u03c8 ) \u2208 R D \u03c8 be the embedding of a caption c in a caption embedding space (e.g. a GRU-based text encoder)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-64", "text": "Here \u03b8 \u03c6 , \u03b8 \u03c8 denote the model parameters of the image and caption representations." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-65", "text": "The mappings of each modality into the joint embedding space are defined as follows:" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-66", "text": "Here W f \u2208 R D \u03c6 \u00d7D , W g \u2208 R D \u03c8 \u00d7D represent the two mappings." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-67", "text": "We further normalize \u03c6(i), f (i; W f , \u03b8 \u03c6 ), and g(c; W g , \u03b8 \u03c8 ), so that they lie on a unit hypersphere." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-68", "text": "Finally, the similarity measure is defined as s(i, c) = f (i; W f , \u03b8 \u03c6 ) \u00b7 g(c; W g , \u03b8 \u03c8 )." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-69", "text": "Let \u03b8 = {W f , W g , \u03b8 \u03c8 } be the model parameters." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-70", "text": "We will include \u03b8 \u03c6 in \u03b8 when the aim is also to fine-tune the image encoder." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-71", "text": "We define e(\u03b8, S) = 1 N N n=1 (i n , c n ) to be the empirical loss of the model, parametrized by \u03b8, over the training samples S = {(i n , c n )} N n=1 , where (i n , c n ) is a loss of a single example." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-72", "text": "The rank loss used in Kiros et al. (2014) , Socher et al. (2014) , and Karpathy & Fei-Fei (2015) is defined as follows:" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-73", "text": "where \u03b1 represents the margin,\u0109 denotes a negative caption for the query image i, and\u00ee a negative image for the query caption c. Here, we used a shorthand notation [x] + \u2261 max(x, 0)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-74", "text": "This loss is composed of two symmetric terms, by considering i and c as individual queries." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-75", "text": "In each term, the loss is defined by the sum of the violations for each negative sample." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-76", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-77", "text": "**LOSS FUNCTION: MAXIMUM VIOLATION**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-78", "text": "We now introduce our new loss function, and discuss the relation with respect to the original rank loss in the following subsection." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-79", "text": "Given an embedding pair (i, c), we denote the hardest negative samples by i and c , where i = arg max j =i s(j, c) (i.e. i is the most similar negative image sample to c), and c = arg max d =c s(i, d) (i.e. c is the most similar negative caption sample to i)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-80", "text": "We define our loss function on a single pair (i, c) as" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-81", "text": "Similar to Eq. 3, we have two symmetric terms, by considering i and c as individual queries." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-82", "text": "In the first term, taking i as the query and c its correct caption, the loss is defined by finding c as the maximum violating caption." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-83", "text": "The loss is zero if c is more similar to i than c by \u03b1 or more." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-84", "text": "The second term is a similar loss where each c is considered as the query." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-85", "text": "Our loss in Eq. 4 is a triplet rank loss." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-86", "text": "We could define a pairwise loss by incurring a loss for all positive pairs that have a distance more than a margin \u03b1 1 and another loss for all negative pairs that have a distance less than a margin \u03b1 2 , where \u03b1 1 < \u03b1 2 ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-87", "text": "This is a stronger loss that restraints the model by pairwise distances." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-88", "text": "However, we only care about the rank of the points with respect to a query rather than their exact distance." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-89", "text": "Therefore, compared to a pairwise loss, this triplet loss gives us more flexibility in the embedding and can be easier to optimize." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-118", "text": "A similar difficulty is observed when large mini-batches are used for SGD (Goyal et al. (2017) )." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-91", "text": "**MAXIMUM VIOLATION INSTEAD OF SUM OF VIOLATIONS**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-92", "text": "This loss function differs from Eq. (3) in an important way." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-93", "text": "In Eq. (4), we use a max operation instead of a sum over the negative examples." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-94", "text": "The main difference between the two formulations is thus the number of negative triplets that affect the loss at each step of stochastic gradient descent (SGD)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-95", "text": "In Eq." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-96", "text": "(3), the loss sums the violations across all negatives, while our loss function only considers the penalty incurred by the hardest negative." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-97", "text": "One way to compare Eq. 3 and Eq. 4 is to consider how they prioritize two typical training examples." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-98", "text": "Fig. 1 shows two illustrations for typical examples." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-99", "text": "The positive pair is at the same location in both examples, while the negative samples are at different distances." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-100", "text": "This illustration shows that while only the hardest negative c is important to R@1, the sum loss focuses on moving all the violating negative samples at the same time." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-101", "text": "Note that since we use SGD to learn our parameters, we take the max over each mini-batch (this is similar to Kiros et al. (2014) , while empty circles are negative samples for the query i. The dashed circles on the two sides are drawn at the same radii." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-102", "text": "The example on the left has a higher loss with the sum loss compared to the example on the right." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-103", "text": "The max loss assigns a higher loss to the example on the right." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-104", "text": "Notice that the hardest negative example c is closer to i in the right example." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-105", "text": "give us the hardest negative example in the entire training set but it is effective as long as we get enough violating triplets with non-zero loss." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-106", "text": "Using sum, there are potentially 2M (M \u2212 1) non-negative terms in the loss, where M is the size of the mini-batch." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-107", "text": "With max, there will be at most 2M non-negative terms." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-108", "text": "Note that the set of terms in the sum formulation is always a superset of the set in max." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-109", "text": "Let us interpret the rank loss in Eq." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-110", "text": "(3) and compare it with our loss function." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-111", "text": "Consider a positive pair (i, c) and the set\u0109 m of all negative samples where i is given as the query. Suppose that the values \u03b1 + s" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-112", "text": "where p is the probability of a negative sample not incurring any penalty." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-113", "text": "Sampling M \u2212 1 negative samples from this distribution is expected to give us (M \u2212 1)(1 \u2212 p) non-negative penalty terms." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-114", "text": "A generalization of the central limit theorem to the truncated normal distributions (Johnson et al. (1970) ) tells us that if there are enough samples, the distribution of the sum of the random variables will be normal." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-115", "text": "Thus, the sum loss is actually minimizing the mean of the non-negative terms." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-116", "text": "In doing so, it is aggregating the subtle gradient signal from many samples." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-117", "text": "Thus the gradient updates are no longer noisy and SGD may not be capable of jumping out of local minima." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-119", "text": "The max loss reduces the contributing terms and considers only the hardest negatives." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-120", "text": "While this potentially makes our model more prone to outliers in the training data, we experimentally show that this loss is much more effective on the image-captioning datasets." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-121", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-122", "text": "**EXPERIMENTS**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-123", "text": "We perform experiments with our VSE++ and compare it to the original formulation of Kiros et al. (2014) (referred to as VSE), as well as state-of-the-art approaches." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-124", "text": "We re-implemented VSE with the help of the authors' open-source code 2 ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-125", "text": "For comparison, we present both results and refer to our re-implementation (which improves over the original VSE implementation) as VSE0." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-126", "text": "We experiment with two image encoders: VGG19 by Simonyan & Zisserman (2014) and ResNet152 by He et al. (2016) ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-127", "text": "Previous works have extracted the image features mostly by pre-computing the FC7 features (the penultimate fully connected layer) of VGG19." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-128", "text": "We thus explicitly indicate methods that use ResNet152." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-129", "text": "The dimensionality of the image embedding, D \u03c6 , is 4096 for VGG19 and 2048 for ResNet152." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-130", "text": "The image features are extracted by first resizing the image to 256 \u00d7 256, then using either a single center crop of size 224 \u00d7 224 or the mean of feature vectors for 10 crops of similar size as used by Klein et al. (2015) and Vendrov et al. (2015) ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-131", "text": "We refer to training with one center crop as 1C and training with 10 crops as 10C." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-132", "text": "We also try training using random crops which we denote by RC." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-133", "text": "For RC, we have the full VGG19 model and extract features over a single randomly chosen cropped patch on the fly as opposed to pre-computing the image features once and reusing them." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-134", "text": "For the caption encoder, we use a GRU similar to the one used in Kiros et al. (2014) ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-135", "text": "We set the dimensionality of the GRU, D \u03c8 , and the joint embedding space, D, to 1024." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-136", "text": "The dimensionality of the word embeddings that are input to the GRU is set to 300." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-137", "text": "We further note that in Kiros et al. (2014) , the caption embedding is normalized, while the image embedding is not." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-138", "text": "The normalization of both vectors ensures that the similarity measure is cosine similarity." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-139", "text": "In VSE++ we normalize both vectors." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-140", "text": "Not normalizing the image embedding changes the importance of samples and so can be helpful or harmful." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-141", "text": "In our experiments, not normalizing the image embedding helped the original VSE formulation find a better solution." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-142", "text": "However, VSE++ is not significantly affected by this normalization." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-143", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-144", "text": "**DATASETS**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-145", "text": "We evaluate our method on the Microsoft COCO dataset (Lin et al. (2014b) ) and the Flickr30K dataset (Young et al. (2014) )." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-146", "text": "Flickr30K has a standard 30, 000 images for training." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-147", "text": "Following Karpathy & Fei-Fei (2015) , we use 1000 images for validation and 1000 images for testing." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-148", "text": "We also use the splits of Karpathy & Fei-Fei (2015) for MS-COCO." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-149", "text": "In this split, the training set contains 82, 783 images, 5000 validation and 5000 test images." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-150", "text": "However, there are also 30, 504 images that were originally in the validation set of MS-COCO but have been left out in this split." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-151", "text": "We refer to this set as rV." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-152", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-153", "text": "**DETAILS OF TRAINING**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-154", "text": "We use the Adam optimizer Kingma & Ba (2014) to train the models." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-155", "text": "We train most of the models by running 15 epochs with learning rate 0.0002 and then 15 epochs with 0.00002." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-156", "text": "The fine-tuned models are trained by taking a model that is trained for 30 epochs with a fixed image encoder and then training it for 15 epochs with a learning rate of 0.00002." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-157", "text": "We set the margin to 0.2 for most of the experiments." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-158", "text": "We use a mini-batch size of 128 in all our experiments." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-159", "text": "We do early stopping based on a sum of the recalls on the validation set." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-160", "text": "We did not see a difference in the performance by changing the mini-batch size if we run for the full 30 epochs." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-161", "text": "Notice that since the size of the training set for different models is different, the actual number of iterations in each epoch can vary." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-162", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-163", "text": "**RESULTS ON MS-COCO**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-164", "text": "The results on the MS-COCO dataset are presented in Table 1 ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-165", "text": "To ensure transparency with respect to all additional modifications, we report an ablation study for the baseline VSE including these modifications in Table 2 ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-166", "text": "Our best result is achieved by using ResNet152 and fine-tuning the image encoder (row 1.11), where we see 21.2% improvement in R@1 for caption retrieval and 21% improvement in R@1 for image retrieval compared to the original VSE results (row 1.1)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-167", "text": "Notice that using ResNet152 and fine-tuning can only lead to 12.6% improvement using the original formulation (row 2.6), while our introduced loss function brings a significant gain of 8.6%." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-168", "text": "Comparing VSE++ (ResNet152, fine-tuned) to the current state-of-the-art on MS-COCO, 2WayNet (row 1.5), we see approximately 8.8% improvement in R@1 for caption retrieval and compared to sm-LSTM (row 1.4), 11.3% improvement in image retrieval." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-169", "text": "We also report results on the full 5K test set of MS-COCO in row 1.13 and row 1.14." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-170", "text": "Effect of the training set." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-171", "text": "We compare VSE and VSE++ by incrementally improving the training data." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-172", "text": "Comparing the models trained on 1C (row 1.1 and row 1.6), we only see 2.7% improvement in R@1 for image retrieval but no improvement in caption retrieval performance." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-173", "text": "However, when we train using RC (row 1.7 and row 2.2) or RC+rV (row 1.8 and row 2.3), we see that VSE++ gains an improvement of 5.9% and 5.1%, respectively, in R@1 for caption retrieval compared to VSE0." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-174", "text": "This shows that VSE++ can better exploit the additional data." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-175", "text": "Effect of a better image encoding." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-176", "text": "We also investigate the effect of a better image encoder on the models." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-177", "text": "Row 1.9 and row 2.4 show the effect of fine-tuning the VGG19 image encoder." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-178", "text": "We see that the gap between VSE0 and VSE++ increases to 6.1%." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-179", "text": "If we use ResNet152 instead of VGG19 (row 1.10 and row 2.5), the gap is 5.6%." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-180", "text": "As for our best result, if we use ResNet152 and also fine-tune the image encoder (row 1.11 and row 2.6) the gap becomes 8.6%." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-181", "text": "The increase in the performance gap shows that the improved loss of VSE++ can better guide the optimization when a more powerful image encoder is used." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-182", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-183", "text": "**IMPROVING ORDER EMBEDDINGS**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-184", "text": "The formulation of Order-embeddings Vendrov et al. (2015) , is very similar to VSE with a difference in the similarity measure." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-185", "text": "Order-embeddings use the asymmetric similarity measure s(i, c) = \u2212 max(0, g(c; W g , \u03b8 \u03c8 ) \u2212 f (i; W f , \u03b8 \u03c6 )) 2 ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-186", "text": "Note that, similar to the reported results of order-embeddings, we report the results on the training set 10C+rV. We use the hyper-parameters reported by Order0 to reproduce their results and we get slightly better results (row 3.1 and row 3.3)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-187", "text": "We use a learning rate of 0.001 for 15 epochs and 0.0001 for another 15 epochs." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-188", "text": "The margin is set to 0.05." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-189", "text": "Additionally, Vendrov et al. (2015) takes the absolute value of embeddings before computing the similarity measure." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-190", "text": "We do not do this in Order++, and stick with our original embedding." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-191", "text": "We use the same learning schedule and margin as our other experiments." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-192", "text": "In Table 3 , we report the results when replacing the sum loss used in Order-embeddings to our max loss." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-193", "text": "We can again see that our loss leads to increased performance." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-194", "text": "We observe 4.5% improvement from Order0 to Order++ in R@1 for caption retrieval (row 3.3 and row 3.5)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-195", "text": "Compared to the improvement from VSE0 to VSE++, where the improvement on the 10C+rV training set is 1.8%, we gain an even higher improvement here." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-196", "text": "This shows that our modification can help different methods that use a similar loss function to Eq. (3)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-197", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-198", "text": "**INVESTIGATE THE BEHAVIOR OF THE LOSS FUNCTIONS**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-199", "text": "We investigate the behavior incurred by our loss during training in Figure 2 ." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-200", "text": "In particular, we show results during training the VSE and VSE++ with ResNet152 and Order-embeddings using 10C+rV." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-201", "text": "We notice that our max loss can take a couple of epochs to warm-up." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-202", "text": "Observe that initially, sum starts off faster, but after approximately 2 epochs max surpasses sum." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-203", "text": "In order to explain this, notice that the max loss only depends on triplets, rather than a larger set in the case of sum." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-204", "text": "Since we randomly initialize the parameters, there can be more than one hard negative." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-205", "text": "However, the gradient of the max loss, will only be influenced by one of them." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-206", "text": "As such, it can take longer to train a better model than when using the sum loss." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-207", "text": "Figure 2 : Comparison between the recall performance on the validation set of MS-COCO for the max loss and the sum loss." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-208", "text": "The left plot compares VSE and VSE++ with ResNet152 as the image encoder with/without fine-tuning (Table 1 , row 1.10 and row 1.11 compared to Table 2 , row 2.5 and row 2.6)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-209", "text": "The right plot compares the two formulations for Order-embeddings (Table 3 , row 3.3 and row 3.5)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-210", "text": "Notice that for Order-embeddings, in the first 2 epochs the original loss function achieves a better performance, however, from there-on our loss function leads to much higher recall rates." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-211", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-212", "text": "**RESULTS ON FLICKR30K**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-213", "text": "We report the results on the Flickr30K dataset in Tables 4.1 and 4.13." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-214", "text": "Here, we obtain 23.1% improvement in R@1 for caption retrieval and 17.6% improvement in R@1 for image retrieval (row 4.1 and row 4.13)." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-215", "text": "Since the size of the Flickr30K training data is small, we observed that VSE++ overfits when the pre-computed features of a single center crop is used." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-216", "text": "We do early stopping to stop before over-fitting happens." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-217", "text": "We run for a fixed number of epochs and save checkpoints at the end of each epoch." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-218", "text": "The checkpoint with the maximum sum of the recalls on the validation set is taken as the best model for comparison to other models." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-219", "text": "The over-fitting is resolved when the model is trained using RC." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-220", "text": "Our results show that the improvements incurred by our simple modification to the loss function persists across datasets, as well as across models." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-221", "text": "These are drastic improvements, and we hope to see similar behavior on other tasks in the future." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-222", "text": "Figure 3 : Examples of test images and the top 1 retrieved captions for VSE0 and VSE++ (ResNet)finetune." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-223", "text": "The value in brackets is the rank of the highest ranked ground-truth caption." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-224", "text": "----------------------------------" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-225", "text": "**CONCLUSION**" }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-226", "text": "In this paper, we focused on the task of image-caption retrieval and investigated visual-semantic embeddings." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-227", "text": "We have shown that a new loss function that uses only violation incurred by hard negatives drastically improves performance over the typical loss that sums the violations across the negatives, typically used in previous work (Kiros et al. (2014) ; Vendrov et al. (2015) )." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-228", "text": "We performed experiments on the MS-COCO and Flickr30K datasets." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-229", "text": "We observed that the improved loss can better guide a more powerful image encoder, ResNet152, and also guide better when fine-tuning an image encoder." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-230", "text": "With all modifications, our VSE++ model achieves state-of-the-art performance on the MS-COCO dataset, and is slightly below the best recent model on the Flickr30K dataset." }, { "sent_id": "9aa9fa6b94aa24939b50effa0e575b-C001-231", "text": "Our proposed improvement can be used to train more sophisticated models that have been using a similar rank loss for training." } ], "y": { "@EXT@": { "gold_contexts": [ [ "9aa9fa6b94aa24939b50effa0e575b-C001-3" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-42" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-47" ] ], "cite_sentences": [ "9aa9fa6b94aa24939b50effa0e575b-C001-3", "9aa9fa6b94aa24939b50effa0e575b-C001-42", "9aa9fa6b94aa24939b50effa0e575b-C001-47" ] }, "@DIF@": { "gold_contexts": [ [ "9aa9fa6b94aa24939b50effa0e575b-C001-10" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-137", "9aa9fa6b94aa24939b50effa0e575b-C001-139" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-227" ] ], "cite_sentences": [ "9aa9fa6b94aa24939b50effa0e575b-C001-10", "9aa9fa6b94aa24939b50effa0e575b-C001-137", "9aa9fa6b94aa24939b50effa0e575b-C001-227" ] }, "@BACK@": { "gold_contexts": [ [ "9aa9fa6b94aa24939b50effa0e575b-C001-17" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-35" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-72" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-137" ] ], "cite_sentences": [ "9aa9fa6b94aa24939b50effa0e575b-C001-17", "9aa9fa6b94aa24939b50effa0e575b-C001-35", "9aa9fa6b94aa24939b50effa0e575b-C001-72", "9aa9fa6b94aa24939b50effa0e575b-C001-137" ] }, "@USE@": { "gold_contexts": [ [ "9aa9fa6b94aa24939b50effa0e575b-C001-18" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-61" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-72", "9aa9fa6b94aa24939b50effa0e575b-C001-73" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-123" ] ], "cite_sentences": [ "9aa9fa6b94aa24939b50effa0e575b-C001-18", "9aa9fa6b94aa24939b50effa0e575b-C001-61", "9aa9fa6b94aa24939b50effa0e575b-C001-72", "9aa9fa6b94aa24939b50effa0e575b-C001-123" ] }, "@SIM@": { "gold_contexts": [ [ "9aa9fa6b94aa24939b50effa0e575b-C001-101" ], [ "9aa9fa6b94aa24939b50effa0e575b-C001-134" ] ], "cite_sentences": [ "9aa9fa6b94aa24939b50effa0e575b-C001-101", "9aa9fa6b94aa24939b50effa0e575b-C001-134" ] } } }, "ABC_4a90cd18be0df0c41a94febe2f68ef_11": { "x": [ { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-2", "text": "Rapid progress has been made in the field of reading comprehension and question answering, where several systems have achieved human parity in some simplified settings." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-3", "text": "However, the performance of these models degrades significantly when they are applied to more realistic scenarios, where answers are involved with various types, multiple text strings are correct answers, or discrete reasoning abilities are required." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-4", "text": "In this paper, we introduce the Multi-Type Multi-Span Network (MTMSN), a neural reading comprehension model that combines a multi-type answer predictor designed to support various answer types (e.g., span, count, negation, and arithmetic expression) with a multi-span extraction method for dynamically producing one or multiple text spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-5", "text": "In addition, an arithmetic expression reranking mechanism is proposed to rank expression candidates for further confirming the prediction." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-6", "text": "Experiments show that our model achieves 79.9 F1 on the DROP hidden test set, creating new state-of-the-art results." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-7", "text": "Source code 1 is released to facilitate future work." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-8", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-10", "text": "This paper considers the reading comprehension task in which some discrete-reasoning abilities are needed to correctly answer questions." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-11", "text": "Specifically, we focus on a new reading comprehension dataset called DROP (Dua et al., 2019) , which requires Discrete Reasoning Over the content of Paragraphs to obtain the final answer." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-12", "text": "Unlike previous benchmarks such as CNN/DM (Hermann et al., 2015) and SQuAD (Rajpurkar et al., 2016) that have been well solved Devlin et al., 2019) , DROP is substantially more challenging in three ways." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-13", "text": "First, the answers to 1 https://github.com/huminghao16/MTMSN the questions involve a wide range of types such as numbers, dates, or text strings." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-14", "text": "Therefore, various kinds of prediction strategies are required to successfully find the answers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-140", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-34", "text": "To make a fair comparison, we also construct a baseline that uses the same BERT-based encoder." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-15", "text": "Second, rather than restricting the answer to be a span of text, DROP loosens the constraint so that answers may be a set of multiple text strings." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-16", "text": "Third, for questions that require discrete reasoning, a system must have a more comprehensive understanding of the context and be able to perform numerical operations such as addition, counting, or sorting." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-17", "text": "Existing approaches, when applied to this more realistic scenario, have three problems." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-18", "text": "First, to produce various answer types, Dua et al. (2019) extend previous one-type answer prediction (Seo et al., 2017) to multi-type prediction that supports span extraction, counting, and addition/subtraction." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-19", "text": "However, they have not fully considered all potential types." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-20", "text": "Take the question \"What percent are not non-families?\" and the passage snippet \"39.9% were non-families\" as an example, a negation operation is required to infer the answer." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-21", "text": "Second, previous reading comprehension models (Wang et al., 2017; Yu et al., 2018; Hu et al., 2018) are designed to produce one single span as the answer. But for some questions such as \"Which ancestral groups are smaller than 11%?\", there may exist several spans as correct answers (e.g., \"Italian\", \"English\", and \"Polish\"), which can not be well handled by these works." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-22", "text": "Third, to support numerical reasoning, prior work (Dua et al., 2019) learns to predict signed numbers for obtaining an arithmetic expression that can be executed by a symbolic system." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-23", "text": "Nevertheless, the prediction of each signed number is isolated, and the expression's context information has not been considered." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-24", "text": "As a result, obviously-wrong expressions, such as all predicted signs are either minus or zero, are likely produced." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-25", "text": "To address the above issues, we introduce the Multi-Type Multi-Span Network (MTMSN), a neural reading comprehension model for predicting various types of answers as well as dynamically extracting one or multiple spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-26", "text": "MTMSN utilizes a series of pre-trained Transformer blocks (Devlin et al., 2019) to obtain a deep bidirectional context representation." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-27", "text": "On top of it, a multi-type answer predictor is proposed to not only support previous prediction strategies such as span, count number, and arithmetic expression, but also add a new type of logical negation." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-28", "text": "This results in a wider range of coverage of answer types, which turns out to be crucial to performance." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-29", "text": "Besides, rather than always producing one single span, we present a multi-span extraction method to produce multiple answers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-30", "text": "The model first predicts the number of answers, and then extracts non-overlapped spans to the specific amount." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-31", "text": "In this way, the model can learn to dynamically extract one or multiple spans, thus being beneficial for multi-answer cases." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-32", "text": "In addition, we propose an arithmetic expression reranking mechanism to rank expression candidates that are decoded by beam search, so that their context information can be considered during reranking to further confirm the prediction." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-33", "text": "Our MTMSN model outperforms all existing approaches on the DROP hidden test set by achieving 79.9 F1 score, a 32.9% absolute gain over prior best work at the time of submission." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-35", "text": "Again, MTMSN surpasses it by obtaining a 13.2 F1 increase on the development set." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-36", "text": "We also provide an in-depth ablation study to show the effectiveness of our proposed methods, analyze performance breakdown by different answer types, and give some qualitative examples as well as error analysis." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-37", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-38", "text": "**TASK DESCRIPTION**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-39", "text": "In the reading comprehension task that requires discrete reasoning, a passage and a question are given." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-40", "text": "The goal is to predict an answer to the question by reading and understanding the passage." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-41", "text": "Unlike previous dataset such as SQuAD (Rajpurkar et al., 2016) where the answer is limited to be a single span of text, DROP loosens the constraint so that the answer involves various types such as number, date, or span of text (Figure 1) ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-42", "text": "Moreover, the answer can be multiple text strings instead of single continuous span (A 2 )." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-43", "text": "To suc- Figure 1: Question-answer pairs along with a passage from the DROP dataset." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-44", "text": "cessfully find the answer, some discrete reasoning abilities, such as sorting (A 1 ), subtraction (A 3 ), and negation (A 4 ), are required." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-45", "text": "Figure 2 gives an overview of our model that aims to combine neural reading comprehension with numerical reasoning." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-46", "text": "Our model uses BERT (Devlin et al., 2019) as encoder: we map word embeddings into contextualized representations using pre-trained Transformer blocks (Vaswani et al., 2017 ) ( \u00a73.1)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-47", "text": "Based on the representations, we employ a multi-type answer predictor that is able to produce four answer types: (1) span from the text; (2) arithmetic expression; (3) count number; (4) negation on numbers ( \u00a73.2)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-48", "text": "Following Dua et al. (2019) , we first predict the answer type of a given passage-question pair, and then adopt individual prediction strategies." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-49", "text": "To support multispan extraction ( \u00a73.3), the model explicitly predicts the number of answer spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-50", "text": "It then outputs non-overlapped spans until the specific amount is reached." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-51", "text": "Moreover, we do not directly use the arithmetic expression that possesses the maximum probability, but instead re-rank several expression candidates that are decoded by beam search to further confirm the prediction ( \u00a73.4)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-52", "text": "Finally, the model is trained under weakly-supervised signals to maximize the marginal likelihood over all possible annotations ( \u00a73.5)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-53", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-54", "text": "**OUR APPROACH**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-55", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-56", "text": "**BERT-BASED ENCODER**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-141", "text": "**EXPERIMENTS**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-57", "text": "To obtain a universal representation for both the question and the passage, we utilize BERT (Devlin et al., 2019) , a pre-trained deep bidirectional Transformer model that achieves state-of-the-art performance across various tasks, as the encoder." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-58", "text": "Specifically, we first tokenize the question and Figure 2 : An illustration of MTMSN architecture." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-59", "text": "The multi-type answer predictor supports four kinds of answer types including span, addition/subtraction, count, and negation." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-60", "text": "A multi-span extraction method is proposed to dynamically produce one or several spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-61", "text": "The arithmetic expression reranking mechanism aims to rank expression candidates that are decoded by beam search for further validating the prediction." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-62", "text": "the passage using the WordPiece vocabulary (Wu et al., 2016) , and then generate the input sequence by concatenating a [CLS] token, the tokenized question, a [SEP] token, the tokenized passage, and a final [SEP] token." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-63", "text": "For each token in the sequence, its input representation is the elementwise addition of WordPiece embeddings, positional embeddings, and segment embeddings (Devlin et al., 2019) ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-64", "text": "As a result, a list of input embeddings H 0 \u2208 R T \u00d7D can be obtained, where D is the hidden size and T is the sequence length." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-65", "text": "A series of L pre-trained Transformer blocks are then used to project the input embeddings into contextualized representations H i as:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-66", "text": "Here, we omit a detailed introduction of the block architecture and refer readers to Vaswani et al. (2017) for more details." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-67", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-68", "text": "**MULTI-TYPE ANSWER PREDICTOR**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-69", "text": "Rather than restricting the answer to always be a span of text, the discrete-reasoning reading comprehension task involves different answer types (e.g., number, date, span of text)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-70", "text": "Following Dua et al. (2019) , we design a multi-type answer predictor to selectively produce different kinds of answers such as span, count number, and arithmetic expression." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-71", "text": "To further increase answer coverage, we propose adding a new answer type to support logical negation." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-72", "text": "Moreover, unlike prior work that separately predicts passage spans and question spans, our approach directly extracts spans from the input sequence." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-73", "text": "Answer type prediction Inspired by the Augmented QANet model (Dua et al., 2019) , we use the contextualized token representations from the last four blocks (H L\u22123 , ..., H L ) as the inputs to our answer predictor, which are denoted as M 0 , M 1 , M 2 , M 3 , respectively." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-74", "text": "To predict the answer type, we first split the representation M 2 into a question representation Q 2 and a passage representation P 2 according to the index of intermediate [SEP] token." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-75", "text": "Then the model computes two vectors h Q 2 and h P 2 that summarize the question and passage information respectively:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-76", "text": "where h P 2 is computed in a similar way over P 2 ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-77", "text": "Next, we calculate a probability distribution to represent the choices of different answer types as:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-78", "text": "Here, h CLS is the first vector in the final contextualized representation M 3 , and FFN denotes a feed-forward network consisting of two linear projections with a GeLU activation (Hendrycks and Gimpel, 2016) followed by a layer normalization (Lei Ba et al., 2016) in between." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-79", "text": "Span To extract the answer either from the passage or from the question, we combine the gating mechanism of Wang et al. (2017) with the standard decoding strategy of Seo et al. (2017) to predict the starting and ending positions across the entire sequence." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-80", "text": "Specifically, we first compute three vectors, namely g Q 0 , g Q 1 , g Q 2 , that summarize the question information among different levels of question representations:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-81", "text": "where g Q 0 and g Q 1 are computed over Q 0 and Q 1 respectively, in a similar way as described above." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-82", "text": "Then we compute the probabilities of the starting and ending indices of the answer span from the input sequence as:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-83", "text": "where \u2297 denotes the outer product between the vector g and each token representation in M." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-84", "text": "Arithmetic expression In order to model the process of performing addition or subtraction among multiple numbers mentioned in the passage, we assign a three-way categorical variable (plus, minus, or zero) for each number to indicate its sign, similar to Dua et al. (2019) ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-85", "text": "As a result, an arithmetic expression that has a number as the final answer can be obtained and easily evaluated." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-86", "text": "Specifically, for each number mentioned in the passage, we gather its corresponding representation from the concatenation of M 2 and M 3 , eventually yielding U = (u 1 , ..., u N ) \u2208 R N \u00d72 * D where N numbers exist." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-87", "text": "Then the probabilities of the i-th number being assigned a plus, minus or zero is computed as:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-88", "text": "Count We consider the ability of counting entities and model it as a multi-class classification problem." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-89", "text": "To achieve this, the model first produces a vector h U that summarizes the important information among all mentioned numbers, and then computes a counting probability distribution as:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-90", "text": "Negation One obvious but important linguistic phenomenon that prior work fails to capture is negation." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-91", "text": "We find there are many cases in DROP that require to perform logical negation on numbers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-92", "text": "The question (Q 4 ) in Figure 1 gives a qualitative example of this phenomenon." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-93", "text": "To model this phenomenon, we assign a two-way categorical variable for each number to indicate whether a negation operation should be performed." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-94", "text": "Then we compute the probabilities of logical negation on the i-th number as:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-95", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-96", "text": "**MULTI-SPAN EXTRACTION**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-97", "text": "Although existing reading comprehension tasks focus exclusively on finding one span of text as the final answer, DROP loosens the restriction so that the answer to the question may be several text spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-98", "text": "Therefore, specific adaption should be made to extend previous single-span extraction to multi-span scenario." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-99", "text": "To do this, we propose directly predicting the number of spans and model it as a classification problem." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-100", "text": "This is achieved by computing a probability distribution on span amount as" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-101", "text": "To extract non-overlapped spans to the specific amount, we adopt the non-maximum suppression (NMS) algorithm (Rosenfeld and Thurston, 1971) that is widely used in computer vision for pruning redundant bounding boxes, as shown in Algorithm 1." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-102", "text": "Concretely, the model first proposes a set of top-K spans S according to the descending order of the span score, which is computed as p start k p end l for the span (k, l)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-103", "text": "It also predicts the amount of extracted spans t from p span , and initializes a new setS. Next, we add the span s i that possesses the maximum span score to the setS, and remove it from S. We also delete any remaining span s j that overlaps with s i , where the degree of overlap is measured using the text-level F1 function." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-104", "text": "This process is repeated for remaining spans in S, until S is empty or the size ofS reaches t." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-105", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-106", "text": "**ARITHMETIC EXPRESSION RERANKING**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-107", "text": "As discussed in \u00a73.2, we model the phenomenon of discrete reasoning on numbers by learning to predict a plus, minus, or zero for each number in the passage." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-108", "text": "In this way, an arithmetic expression composed of signed numbers can be obtained, where the final answer can be deduced by performing simple arithmetic computation." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-109", "text": "However, since the sign of each number is only determined by the number representation and some coarsegrained global representations, the context information of the expression itself has not been considered." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-110", "text": "As a result, the model may predict some for si in S do 7:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-111", "text": "Add span si toS 8:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-112", "text": "Remove span si from S 9:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-113", "text": "for sj in S do 10:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-114", "text": "if f1(si, sj) > 0 then 11:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-115", "text": "Remove span sj from S 12: returnS obviously wrong expressions (e.g., the signs that have maximum probabilities are either minus or zero, resulting in a large negative value)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-116", "text": "Therefore, in order to further validate the prediction, it is necessary to rank several highly confident expression candidates using the representation summarized from the expression's context." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-117", "text": "Specifically, we use beam search to produce top-ranked arithmetic expressions, which are sent back to the network for reranking." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-118", "text": "Since each expression consists of several signed numbers, we construct an expression representation by taking both the numbers and the signs into account." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-119", "text": "For each number in the expression, we gather its corresponding vector from the representation U. As for the signs, we initialize an embedding matrix E \u2208 R 3\u00d72 * D , and find the sign embeddings for each signed number." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-120", "text": "In this way, given the i-th expression that contains M signed numbers at most, we can obtain number vectors V i \u2208 R M \u00d72 * D as well as sign embeddings C i \u2208 R M \u00d72 * D ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-121", "text": "Then the expression representation along with the reranking probability can be calculated as:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-122", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-123", "text": "**TRAINING AND INFERENCE**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-124", "text": "Since DROP does not indicate the answer type but only provides the answer string, we therefore adopt the weakly supervised annotation scheme, as suggested in Berant et al. (2013); Dua et al. (2019) ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-125", "text": "We find all possible annotations that point to the gold answer, including matching spans, arithmetic expressions, correct count numbers, negation operations, and the number of spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-126", "text": "We use simple rules to search over all mentioned numbers to find potential negations." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-127", "text": "That is, if 100 minus a number is equal to the answer, then a negation occurs on this number." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-128", "text": "Besides, we only search the addition/subtraction of three numbers at most due to the exponential search space." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-129", "text": "To train our model, we propose using a twostep training method composed of an inference step and a training step." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-130", "text": "In the first step, we use the model to predict the probabilities of sign assignments for numbers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-131", "text": "If there exists any annotation of arithmetic expressions, we run beam search to produce expression candidates and label them as either correct or wrong, which are later used for supervising the reranking component." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-132", "text": "In the second step, we adopt the marginal likelihood objective function (Clark and Gardner, 2018) , which sums over the probabilities of all possible annotations including the above labeled expressions." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-133", "text": "Notice that there are two objective functions for the multi-span component: one is a distantly-supervised loss that maximizes the probabilities of all matching spans, and the other is a classification loss that maximizes the probability on span amount." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-134", "text": "At test time, the model first chooses the answer type and then performs specific prediction strategies." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-135", "text": "For the span type, we use Algorithm 1 for decoding." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-136", "text": "If the type is addition/subtraction, arithmetic expression candidates will be proposed and further reranked." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-137", "text": "The expression with the maximum product of cumulative sign probability and reranking probability is chosen." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-138", "text": "As for the counting type, we choose the number that has the maximum counting probability." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-139", "text": "Finally, if the type is negation, we find the number that possesses the largest negation probability, and then output the answer as 100 minus this number." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-142", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-143", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-144", "text": "Dataset We consider the reading comprehension benchmark that requires Discrete Reasoning Over Paragraphs (DROP) (Dua et al., 2019) prehensive understanding of the context as well as the ability of numerical reasoning are required." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-145", "text": "Model settings We build our model upon two publicly available uncased versions of BERT: BERT BASE and BERT LARGE 2 , and refer readers to Devlin et al. (2019) for details on model sizes." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-146", "text": "We use Adam optimizer with a learning rate of 3e-5 and warmup over the first 5% steps to train." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-147", "text": "The maximum number of epochs is set to 10 for base models and 5 for large models, while the batch size is 12 or 24 respectively." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-148", "text": "A dropout probability of 0.1 is used unless stated otherwise." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-149", "text": "The number of counting class is set to 10, and the maximum number of spans is 8." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-150", "text": "The beam size is 3 by default, while the maximum amount of signed numbers M is set to 4." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-151", "text": "All texts are tokenized using Word-2 BERTBASE is the original version while BERTLARGE is the model augmented with n-gram masking and synthetic self-training:" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-152", "text": "https://github.com/ google-research/bert." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-153", "text": "Piece vocabulary (Wu et al., 2016) , and truncated to sequences no longer than 512 tokens." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-154", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-155", "text": "**MODEL**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-156", "text": "Baselines Following the implementation of Augmented QANet (NAQANet) (Dua et al., 2019) , we introduce a similar baseline called Augmented BERT (NABERT)." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-157", "text": "The main difference is that we replace the encoder of QANet (Yu et al., 2018) with the pre-trained Transformer blocks (Devlin et al., 2019) ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-158", "text": "Moreover, it also supports the prediction of various answer types such as span, arithmetic expression, and count number." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-159", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-160", "text": "**MAIN RESULTS**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-161", "text": "Two metrics, namely Exact Match (EM) and F1 score, are utilized to evaluate models." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-162", "text": "We use the official script to compute these scores." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-163", "text": "Since the test set is hidden, we only submit the best single model to obtain test results." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-164", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-165", "text": "**ABLATION STUDY**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-166", "text": "Component ablation To analyze the effect of the proposed components, we conduct ablation studies on the development set." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-167", "text": "As illustrated in Table 2 , the use of addition and subtraction is extremely crucial: the EM/F1 performance of both the base and large models drop drastically by more than 20 points if it is removed." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-168", "text": "Predicting count numbers is also an important component that contributes nearly 5% gain on both metrics." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-169", "text": "Moreover, enhancing the model with the negation type significantly increases the F1 by roughly 9 percent on both models." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-170", "text": "In brief, the above results show that multi-type answer prediction is vitally important for handling different forms of answers, especially in cases where discrete reasoning abilities are required." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-171", "text": "We also report the performance after removing the multi-span extraction method." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-172", "text": "The results reveal that it has a more negative impact on the F1 score." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-173", "text": "We interpret this phenomenon as follows: producing multiple spans that are partially matched with ground-truth answers is much easier than generating an exactly-matched set of multiple answers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-174", "text": "Hence for multi-span scenarios, the gain of our method on F1 is relatively easier to obtain than the one on EM." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-175", "text": "Finally, to ablate arithmetic expression reranking, we simply use the arithmetic expression that has the maximum cumulative sign Table 5 : Performance breakdown of NABERT LARGE and MTMSN LARGE by predicted answer types." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-176", "text": "probability instead." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-177", "text": "We find that our reranking mechanism gives 1.8% gain on both metrics for the large model." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-178", "text": "This confirms that validating expression candidates with their context information is beneficial for filtering out highly-confident but wrong predictions." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-179", "text": "Architecture ablation We further conduct a detailed ablation in Table 3 to evaluate our architecture designs." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-180", "text": "First, we investigate the effects of some \"global vectors\" used in our model." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-181", "text": "Specifically, we find that removing the question and passage vectors from all involved computation leads to 1.3 % drop on F1." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-182", "text": "Ablating the representation of [CLS] token leads to even worse results." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-183", "text": "We also try to use the last hidden representation (denoted as M 3 ) to calculate question and passage vectors, but find that does not work." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-184", "text": "Next, we remove the gating mechanism used during span prediction, and observe a nearly 0.8% decline on both metrics." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-185", "text": "Finally, we share parameters between the arithmetic expression component and the negation component, and find the performance drops by 1.1% on F1." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-186", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-187", "text": "**ANALYSIS AND DISCUSSION**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-188", "text": "Performance breakdown We now provide a quantitative analysis by showing performance breakdown on the development set." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-189", "text": "Table 4 shows that our gains mainly come from the most frequent number type, which requires various types of symbolic, discrete reasoning operations." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-190", "text": "Moreover, significant improvements are also obtained in the multi-span category, where the F1 score increases by more than 40 points." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-191", "text": "This result further proves the validity of our multi-span extraction method." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-192", "text": "We also give the performance statistics that are categorized according to the predicted answer types in Table 5 ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-193", "text": "As shown in the Table, the main improvements are due to the addition/subtraction and negation types." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-194", "text": "We conjecture that there are two reasons for these improvements." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-195", "text": "First, our proposed expression reranking mechanism helps validate candidate expressions." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-196", "text": "Second, a new inductive bias that enables the model to perform logical negation has been introduced." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-197", "text": "The impressive performance on the negation type confirms our judgement, and suggests that the model is able to find most of negation operations." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-198", "text": "In addition, we also observe promising gains brought by the span and count types." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-199", "text": "We think the gains are mainly due to the multi-span extraction method as well as architecture designs." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-200", "text": "Effect of maximum number of spans To investigate the effect of maximum number of spans on multi-span extraction, we conduct an experiment on the dev set and show the curves in Figure 3 ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-201", "text": "We vary the value from 2 to 12, increased by 2, and also include the extreme value 1." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-202", "text": "According to the Figure, the best results are obtained at 8." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-203", "text": "A higher value could potentially increase the answer recall but damage the precision by making more predictions, and a smaller value may force the model to produce limited number of answers, resulting in high precision but low recall." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-204", "text": "Therefore, a value of 8 turns out to be a good trade-off between recall and precision." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-205", "text": "Moreover, when the value decreases to 1, the multi-span extraction degrades to previous single-span scenario, and the performance drops significantly." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-206", "text": "Effect of beam size and M We further investigate the effect of beam size and maximum amount of signed numbers in Figure 4 ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-207", "text": "As we can see, a beam size of 3 leads to the best performance, likely because a larger beam size might confuse the model as too many candidates are ranked, on the other hand, a small size could be not sufficient to cover the correct expression." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-208", "text": "In addition, we find that the performance constantly decreases as the maximum threshold M increases, suggesting that most of expressions only contain two or three signed numbers, and setting a larger threshold could bring in additional distractions." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-209", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-210", "text": "**ANNOTATION STATISTICS**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-211", "text": "We list the annotation statistics on the DROP train set in Table 6 ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-212", "text": "As we can see, only annotating matching spans results in a labeled ratio of 56.4%, indicating that DROP includes various answer types beyond text spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-213", "text": "By further considering the arithmetic expression, the ratio increase sharply to 91.7%, suggesting more than 35% answers need to be inferred with numeral reasoning." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-214", "text": "Continuing adding counting leads to a percentage of 94.4%, and a final 97.9% coverage is achieved by additionally taking negation into account." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-215", "text": "More importantly, the F1 score constantly increases as more answer types are considered." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-216", "text": "This result is consistent with our observations in ablation study." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-217", "text": "Error analysis Finally, to better understand the remaining challenges, we randomly sample 100 incorrectly predicted examples based on EM and categorize them into 7 classes." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-218", "text": "38% of errors are incorrect arithmetic computations, 18% require sorting over multiple entities, 13% are due to mistakes on multi-span extraction, 10% are singlespan extraction problems, 8% involve miscounting, another 8% are wrong predictions on span number, the rest (5%) are due to various reasons such as incorrect preprocessing, negation error, and so on." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-219", "text": "See Appendix for some examples of the above error cases." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-220", "text": "----------------------------------" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-221", "text": "**RELATED WORK**" }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-222", "text": "Reading comprehension benchmarks Promising advancements have been made for reading comprehension due to the creation of many large datasets." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-223", "text": "While early research used cloze-style tests (Hermann et al., 2015; Hill et al., 2016) , most of recent works (Rajpurkar et al., 2016; Joshi et al., 2017) are designed to extract answers from the passage." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-224", "text": "Despite their success, these datasets only require shallow pattern matching and simple logical reasoning, thus being well solved Devlin et al., 2019) ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-225", "text": "Recently, Dua et al. (2019) released a new benchmark named DROP that demands discrete reasoning as well as deeper paragraph understanding to find the answers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-226", "text": "Saxton et al. (2019) introduced a dataset consisting of different types of mathematics problems to focuses on mathematical computation." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-227", "text": "We choose to work on DROP to test both the numerical reasoning and linguistic comprehension abilities." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-228", "text": "Neural reading models Previous neural reading models, such as BiDAF (Seo et al., 2017) , R-Net (Wang et al., 2017) , QANet (Yu et al., 2018) , Reinforced Mreader (Hu et al., 2018) , are usually designed to extract a continuous span of text as the answer." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-229", "text": "Dua et al. (2019) enhanced prior single-type prediction to support various answer types such as span, count number, and addition/subtraction." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-230", "text": "Different from these approaches, our model additionally supports a new negation type to increase answer coverage, and learns to dynamically extract one or multiple spans." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-231", "text": "Morevoer, answer reranking has been well studied in several prior works (Cui et al., 2016; Wang et al., 2018a,b,c; Hu et al., 2019) ." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-232", "text": "We follow this line of work, but propose ranking arithmetic expressions instead of candidate answers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-233", "text": "End-to-end symbolic reasoning Combining neural methods with symbolic reasoning was considered by Graves et al. (2014) ; Sukhbaatar et al. (2015) , where neural networks augmented with external memory are trained to execute simple programs." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-234", "text": "Later works on program induction (Reed and De Freitas, 2016; Neelakantan et al., 2016; Liang et al., 2017) extended this idea by using several built-in logic operations along with a keyvalue memory to learn different types of compositional programs such as addition or sorting." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-235", "text": "In contrast to these works, MTMSN does not model various types of reasoning with a universal memory mechanism but instead deals each type with individual predicting strategies." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-236", "text": "Visual question answering In computer vision community, the most similar work to our approach is Neural Module Networks (Andreas et al., 2016b) , where a dependency parser is used to lay out a neural network composed of several pre-defined modules." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-237", "text": "Later, Andreas et al. (2016a) proposed dynamically choosing an optimal layout structure from a list of layout candidates that are produced by off-the-shelf parsers." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-238", "text": "Hu et al. (2017) introduced an end-to-end module network that learns to predict instance-specific network layouts without the aid of a parser." }, { "sent_id": "4a90cd18be0df0c41a94febe2f68ef-C001-239", "text": "Compared to these approaches, MTMSN has a static network layout that can not be changed during training and evaluation, where pre-defined \"modules\" are used to handle different types of answers." } ], "y": { "@USE@": { "gold_contexts": [ [ "4a90cd18be0df0c41a94febe2f68ef-C001-11" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-48" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-70" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-73" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-124" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-144" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-225", "4a90cd18be0df0c41a94febe2f68ef-C001-227" ] ], "cite_sentences": [ "4a90cd18be0df0c41a94febe2f68ef-C001-11", "4a90cd18be0df0c41a94febe2f68ef-C001-48", "4a90cd18be0df0c41a94febe2f68ef-C001-70", "4a90cd18be0df0c41a94febe2f68ef-C001-73", "4a90cd18be0df0c41a94febe2f68ef-C001-124", "4a90cd18be0df0c41a94febe2f68ef-C001-144", "4a90cd18be0df0c41a94febe2f68ef-C001-225" ] }, "@BACK@": { "gold_contexts": [ [ "4a90cd18be0df0c41a94febe2f68ef-C001-18", "4a90cd18be0df0c41a94febe2f68ef-C001-22" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-225" ] ], "cite_sentences": [ "4a90cd18be0df0c41a94febe2f68ef-C001-18", "4a90cd18be0df0c41a94febe2f68ef-C001-22", "4a90cd18be0df0c41a94febe2f68ef-C001-225" ] }, "@MOT@": { "gold_contexts": [ [ "4a90cd18be0df0c41a94febe2f68ef-C001-17", "4a90cd18be0df0c41a94febe2f68ef-C001-18", "4a90cd18be0df0c41a94febe2f68ef-C001-19", "4a90cd18be0df0c41a94febe2f68ef-C001-20", "4a90cd18be0df0c41a94febe2f68ef-C001-21", "4a90cd18be0df0c41a94febe2f68ef-C001-22", "4a90cd18be0df0c41a94febe2f68ef-C001-23", "4a90cd18be0df0c41a94febe2f68ef-C001-24", "4a90cd18be0df0c41a94febe2f68ef-C001-25" ] ], "cite_sentences": [ "4a90cd18be0df0c41a94febe2f68ef-C001-18", "4a90cd18be0df0c41a94febe2f68ef-C001-22" ] }, "@DIF@": { "gold_contexts": [ [ "4a90cd18be0df0c41a94febe2f68ef-C001-17", "4a90cd18be0df0c41a94febe2f68ef-C001-18", "4a90cd18be0df0c41a94febe2f68ef-C001-19", "4a90cd18be0df0c41a94febe2f68ef-C001-20", "4a90cd18be0df0c41a94febe2f68ef-C001-21", "4a90cd18be0df0c41a94febe2f68ef-C001-22", "4a90cd18be0df0c41a94febe2f68ef-C001-23", "4a90cd18be0df0c41a94febe2f68ef-C001-24", "4a90cd18be0df0c41a94febe2f68ef-C001-25" ] ], "cite_sentences": [ "4a90cd18be0df0c41a94febe2f68ef-C001-18", "4a90cd18be0df0c41a94febe2f68ef-C001-22" ] }, "@SIM@": { "gold_contexts": [ [ "4a90cd18be0df0c41a94febe2f68ef-C001-84" ], [ "4a90cd18be0df0c41a94febe2f68ef-C001-156" ] ], "cite_sentences": [ "4a90cd18be0df0c41a94febe2f68ef-C001-84", "4a90cd18be0df0c41a94febe2f68ef-C001-156" ] } } }, "ABC_fa3d20d5975ec59454abfca68f8935_11": { "x": [ { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-6", "text": "A framework that encodes the source text first with a transformer, then a sequence-to-sequence (seq2seq) model." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-7", "text": "We find that the transformer and seq2seq model complement themselves adequately, making for a richer encoded vector representation." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-8", "text": "We also find that paying more attention to the vocabulary of target words during abstraction improves performance." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-23", "text": "We also present a simple approach to reduce this bias." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-46", "text": "We describe our summarization model in two modules -Extraction and Abstraction." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-2", "text": "We propose a system that improves performance on single document summarization task using the CNN/DailyMail and Newsroom datasets." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-3", "text": "It follows the popular encoderdecoder paradigm, but with an extra focus on the encoder." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-4", "text": "The intuition is that the probability of correctly decoding an information significantly lies in the pattern and correctness of the encoder." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-5", "text": "Hence we introduce, encodeencode -decode." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-9", "text": "We experiment our hypothesis and framework on the task of extractive and abstractive single document summarization and evaluate using the standard CNN/DailyMail dataset and the recently released Newsroom dataset." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-10", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-12", "text": "Document summarization has been an active area of research, especially on the CNN/DailyMail dataset." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-13", "text": "Even with recent progress (Gehrmann et al., 2018; Chen and Bansal, 2018) , there is still some work to be done in the field." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-14", "text": "Although extractive summarization seem to be less challenging because new words are not generated, identifying salient parts of the document without any guide in the form of a query, is a substantial problem to tackle." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-15", "text": "Earlier approaches for extractive summarization use manual-feature engineering implemented with graphs Erkan and Radev, 2004) , integer linear programming (ILP) (Boudin et al., 2015; Nayeem and Chali, 2017) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-16", "text": "More recent approaches are data-driven and implement a variety of neural networks (Jadhav and Rajan, 2018; Narayan et al., 2017) majorly with an encoder-decoder framework (Narayan et al., 2018; Cheng and Lapata, 2016) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-17", "text": "Similar to the work of Nallapati et al. (2017) , we consider the extractive summarization task as a sequence classification problem." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-18", "text": "A major challenge with this approach, is the fact that the training data is not sequentially labelled." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-19", "text": "Hence creating one from the abstractive ground-truth summary, is crucial." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-20", "text": "We improve on Nallapati et al. (2017) 's approach to generate this labelled data, and evaluation shows that our extractive labels are more accurate." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-21", "text": "Another hurdle in this task, is the imbalance in the created data, that is, most of the document's sentences are labelled 0 (excluded from the summary) than 1, because just a few sentences actually make up a summary." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-22", "text": "Hence the neural extractor tends to be biased and suffer from a lot of false-negative labels." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-24", "text": "Most importantly, our neural extractor uses the recent bidirectional transformer encoder (Vaswani et al., 2017) with details provided in Section 3.1." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-25", "text": "More interesting than extractive summaries, abstractive summaries correlate better with summaries that a human would present." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-26", "text": "Abstractive summarization does not simply reproduce salient parts of the document verbatim, but rewrites them in a concise form, usually introducing novel words along the way by utilizing some key abstraction techniques such as paraphrasing (Gupta et al., 2018) , compression (Filippova et al., 2015) or sentence fusion (Barzilay and McKeown, 2005) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-27", "text": "However, it is met with major challenges like grammatical correctness and repetition of words especially when generating long-worded sentences." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-28", "text": "Nonetheless remarkable progress have been achieved with the use of seq2seq models (Gehrmann et al., 2018; See et al., 2017; Chopra et al., 2016; Rush et al., 2015) and a reward instead of loss function via deep-reinforcement learning (Chen and Bansal, 2018; Paulus et al., 2017; Ranzato et al., 2015) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-29", "text": "We see abstractive summarization in same light as several other authors (Chen and Bansal, 2018; Hsu et al., 2018; Liu et al., 2018 ) -extract salient sentences and then abstract; thus sharing similar advantages as the popular divide-and-conquer algorithm." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-30", "text": "More-so, it mitigates the problem of information redundancy, since the mini-source, ie extracted document, contains distinct salient sentences." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-31", "text": "Our abstractive model is a blend of the transformer and seq2seq model." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-32", "text": "We notice improvements using this framework in the abstractive setting." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-33", "text": "This is because, to generate coherent and grammatically correct sentences, we need to be able to learn long-term dependency relations." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-34", "text": "The transformer complements the seq2seq model in this regard with its multi-head self attention." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-35", "text": "Also the individual attention heads in the transformer model mimics behavior related to the syntactic and semantic structure of the sentence (Vaswani et al., 2017 (Vaswani et al., , 2018 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-36", "text": "Hence, the transformer produces a richer meaningful vector representation of the input, from which we can encode a fixed state vector for decoding." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-37", "text": "The main contributions of this work are:" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-38", "text": "\u2022 We present a simple algorithm for building a sentence-labelled corpus for extractive summarization training that produces more accurate results." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-39", "text": "\u2022 We propose a novel framework for the task of extractive single document summarization that improves the current state-of-the-art on two specific datasets." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-40", "text": "\u2022 We introduce the encode -encode -decode paradigm using two complementary models, transformer and seq2seq for generating abstractive summaries that improves current top performance on two specific datasets." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-41", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-42", "text": "**TASK DEFINITION**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-43", "text": "Given a document D = (S 1 , ..., S n ) with n sentences comprising of a set of words D W = {d 1 , ..., d w }, the task is to produce an extractive (S E ) or abstractive (S A ) summary that contains salient information in D, where S E \u2286 D W and S A = {w 1 , ..., w s } | \u2203w i \u2208 D W ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-44", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-45", "text": "**METHOD**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-47", "text": "The abstraction module simply learns to paraphrase and compress the output of the extracted document sentences." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-48", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-49", "text": "**EXTRACTION**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-50", "text": "As illustrated in Figure 1 , our model classifies each sentence in a document as being summaryworthy or not." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-51", "text": "However, in order to enhance this sequence classification process, we encode the input document with a TRANSFORMER." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-52", "text": "A logistic classifier then learns to label each sentence in the transformed document." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-53", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-54", "text": "**TRANSFORMER ENCODER**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-55", "text": "The input to the Transformer is the document representation, which is a concatenation of the vector representation of its sentences." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-56", "text": "Each sentence representation is obtained by averaging the vector representation of its constituent words." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-57", "text": "The transformer encoder is composed of 6 stacked identical layers." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-58", "text": "Each layer contains 2 sub-layers with multi-head self attention and position-wise fully connected feed-forward network respectively." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-59", "text": "Full details with implementation are provided in (Vaswani et al., 2017 (Vaswani et al., , 2018 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-60", "text": "The bidirectional Transformer often referred to as the Transformer encoder learns a rich representation of the document that captures long-range syntactic and semantic dependency between the sentences." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-61", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-62", "text": "**SENTENCE EXTRACTION**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-63", "text": "The final layer of our extraction model is a softmax layer which performs the classification." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-64", "text": "We learn the probability of including a sentence in the summary," }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-65", "text": "where W and b are trainable parameters and S ' i is the transformed representation of the i th sentence in document D j , by minimizing the cross-entropy loss" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-66", "text": "between the predicted probabilities, y p and true sentence-labels, y t during training." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-67", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-68", "text": "**EXTRACTIVE TRAINING**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-69", "text": "Filtering Currently, no extractive summarization dataset exists." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-70", "text": "Hence it is customary to create one from the abstractive ground-truth summaries (Chen and Bansal, 2018; Nallapati et al., 2017) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-71", "text": "We observe however, that some summaries are more abstractive than others." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-72", "text": "Since the extractive labels are usually gotten by doing some n-gram overlap matching, the greater the abstractiveness of the ground-truth the more inaccurate the tuned extractive labels are." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-73", "text": "We filter out such samples 1 as illustrated in Table 1 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-74", "text": "In our work, we consider a reference summary R j as overly abstractive if it has zero bigram overlap with the corresponding document D j , excluding stop words." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-96", "text": "Chen and Bansal (2018) introduced a stop criterion in their reinforcement learning process." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-97", "text": "We implemented a basic subjective approach based on the dataset." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-98", "text": "Since the gold summaries are typically 3 or 4 sentences long, we extract the top 3 sentences by default, but proceed to additionally extract a 4 th sentence if the confidence score from the softmax function is greater than 0.55." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-99", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-100", "text": "**ABSTRACTION**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-101", "text": "The input to our abstraction module is a subset of the document's sentences which comprises of the output of the extraction phase from Section 3.1.2." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-102", "text": "For each document D j , initially comprising of n sentences, we abstract its extracted sentences," }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-103", "text": "where m < n and S E j \u2286 D j , by learning to jointly paraphrase (Gupta et al., 2018) and compress (Filippova et al., 2015) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-104", "text": "We add one more encoding layer to the standard encoder-alignerdecoder Luong et al., 2015) , ie, encode-encode-align-decode." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-105", "text": "The intuition is to seemingly improve the performance of the decoder by providing an interpretable and richly encoded sequence." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-106", "text": "For this, we interleave two efficient models -transformer (Vaswani et al., 2017) and sequence-to-sequence (Sutskever et al., 2014) , specifically GRU-RNN (Chung et al., 2014; ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-107", "text": "Details are presented in subsequent subsections." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-108", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-109", "text": "**ENCODER -TRANSFORMER**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-110", "text": "The transformer encoder has same implementation from Vaswani et al. (2017) as explained in Section 3.1.1, except the inputs are sentence-level vector representations not document." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-111", "text": "Also, the sentence representations in this module are not averaged constituent word representations as in the extraction module but concatenated." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-112", "text": "That is, for each i th sentence in equation 7, its vector representation, is the concatenation of its constituent word embeddings" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-113", "text": "The output of equation 8 serves as the input vector representation to the transformer encoder." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-114", "text": "We use the transformer-encoder during abstraction as sort of a pre-training module of the input sentence." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-115", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-116", "text": "**ENCODER -GRU-RNN**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-117", "text": "We use a single layer uni-directional GRU-RNN whose input is the output of the transformer." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-118", "text": "The GRU-RNN encoder (Chung et al., 2014; produces fixed-state vector representation of the transformed input sequence using the following equations:" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-119", "text": "where r and z are the reset and update gates respectively, W and U are the network's parameters, x t is the hidden state vector at timestep t, s t is the input vector and represents the Hadamard product." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-120", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-121", "text": "**EXTRACTIVE MODEL**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-122", "text": "R-1 R-2 R-L LEAD (See et al., 2017) 40.3 17.7 36.5 LEAD (Narayan et al., 2018) 39.6 17.7 36.2 LEAD (ours) 40.1 17.6 36.0 (Nallapati et al., 2017) 39.6 16.2 35.3 REFRESH (Narayan et al., 2018) 40.0 18.2 36.6 FAST (Chen and Bansal, 2018) 41.4 18.7 37.7 NEUSUM (Zhou et al., 2018) 41.6 19.0 37.0 Content Selector (Gehrmann et al., 2018)" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-123", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-124", "text": "**DECODER -GRU-RNN**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-125", "text": "The fixed-state vector representation produced by the GRU-RNN encoder is used as initial state for the decoder." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-126", "text": "At each time step, the decoder receives the previously generated word, y t\u22121 and hidden state s t\u22121 at time step t \u22121 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-127", "text": "The output word, y t at each time step, is a softmax probability of the vector in equation 11 over the set of vocabulary words, V ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-128", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-129", "text": "**EXPERIMENTS**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-130", "text": "We used pre-trained 300-dimensional gloV e 2 word-embeddings (Pennington et al., 2014) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-131", "text": "The transformer encoder was setup with the transf ormer base hyperparameter setting from the tensor2tensor library (Vaswani et al., 2018) 3 , but the hidden size and dropout were reset to 300 and 0.0 respectively." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-132", "text": "We also use 300 hidden units for the GRU-RNN encoder." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-133", "text": "The tensor2tensor library comes with pre-processed/tokenized versions of the dataset, we however perform these operations independently." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-134", "text": "For abstraction, our target vocabulary is a set of approximately 50,000 and 80,000 words for CNN/DM and Newsroom 2 https://nlp.stanford.edu/projects/ glove/ 3 https://github.com/tensorflow/ tensor2tensor corpus respectively." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-135", "text": "It contains words in our target training and test sets that occur at least twice." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-136", "text": "Experiments showed that using this subset of vocabulary words as opposed to over 320,000 vocabulary words contained in gloV e improves both training time and performance of the model." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-137", "text": "During the abstractive training, we match summary sentence with its corresponding extracted document sentence using equation 6 and learn to minimize the seq2seq loss implemented in tensorflow API 4 with AdamOptimizer (Kingma and Ba, 2014)." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-138", "text": "We employ early stopping when the validation loss does not decrease after 5 epochs." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-139", "text": "We apply gradient clipping at 5.0 (Pascanu et al., 2013) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-140", "text": "We use greedy-decoding during training and validation and set the maximum number of iterations to 5 times the target sentence length." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-141", "text": "Beam-search decoding is used during inference." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-142", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-143", "text": "**DATASETS**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-144", "text": "We evaluate our models on the non-anonymized version of the CNN-DM corpus (Hermann et al., 2015; Nallapati et al., 2016) and the recent Newsroom dataset (Grusky et al., 2018) released by Connected Experiences Lab 5 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-145", "text": "The Newsroom" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-146", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-147", "text": "**75**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-148", "text": "Abstractive Model R-1 R-2 R-L RL+Intra-Att (Paulus et al., 2017) 41.16 15.75 39.08 KIGN+Pred (Li et al., 2018) 38.95 17.12 35.68 FAST (Chen and Bansal, 2018) 40.88 17.80 38.54 Bottom-Up (Gehrmann et al., 2018) Grusky et al. (2018) corpus contains over 1.3M news articles together with various metadata information such as the title, summary, coverage and compression ratio." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-149", "text": "CNN/DM summaries are twice as long as Newsroom summaries with average word lengths of 66 and 26 respectively." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-150", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-151", "text": "**EVALUATION**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-152", "text": "Following previous works (See et al., 2017; Nallapati et al., 2017; Chen and Bansal, 2018) , we evaluate both datasets on standard ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-153", "text": "It calculates the appropriate n-gram word-overlap between the reference and system summaries." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-154", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-155", "text": "**RESULTS ANALYSIS**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-156", "text": "We used the official pyrouge script 6 with option 7 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-157", "text": "Table 3 and 5 presents extractive and abstractive results on the CNN/DM dataset respectively, while Tables 4 and 6 for the Newsroom dataset." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-158", "text": "For clarity, we present results separately for each model and dataset." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-159", "text": "Our baseline non-filtered extractive (TRANSext) model is highly competitive with top models." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-160", "text": "Our TRANS-ext + filter produces an average of about +1 and +9 points across reported ROUGE variants on the CNN/DM and Newsroom datasets respectively, showing that our model does a better job at identifying the most salient parts of the document than existing state-of-the-art extractive 6 https://github.com/andersjo/pyrouge/ tree/master/tools/ROUGE-1.5.5 7 -n 2 -w 1.2 -m -a -c 95 models." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-161", "text": "We observe the large margin in the Newsroom dataset results, as existing baselines are just the LEAD-3 and TEXTRANK of (Barrios et al., 2016) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-162", "text": "The Newsroom dataset was recently released and is yet to be thoroughly explored, however it is a larger dataset and contains more diverse summaries as analyzed by Grusky et al. (2018) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-163", "text": "We also experimented with the empirical outcome of using imbalanced extractive labels which usually leads to bias towards the majority class." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-164", "text": "Interestingly, our extractive model has +20% F Score increase when trained with balanced labels." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-165", "text": "Switching the transformer encoder with a seq2seq encoder, resulted in a drop of about 2 ROUGE points, showing that the transformer encoder does learn features that adds meaning to the vector representation of our input sequence." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-166", "text": "Our baseline non-filtered abstractive (TRANSext + abs) model is also highly competitive with top models, with a drop of -0.81 ROUGE-2 points against Gehrmann et al. (2018) 's model which is the current state-of-the art." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-167", "text": "Our TRANS-ext + filter + abs produces an average of about +0.5 and +7 points across reported ROUGE variants on the CNN/DM and Newsroom datasets respectively, showing empirically that our model is an improvement of existing abstractive summarization models." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-168", "text": "On the abstractiveness of our summaries, after aligning with the ground-truth as explained in Section 3.2 about 60% of our extracted document sentences were paraphrased and compressed." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-169", "text": "O: the two clubs, who occupy the top two spots in spain's top flight, are set to face each other at the nou camp on sunday." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-170", "text": "G: real madrid face barcelona in the nou camp R: real madrid will travel to the nou camp to face barcelona on sunday." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-171", "text": "O: dangelo conner, from new york, filmed himself messing around with the powerful weapon in a friend's apartment, first waving it around, then sending volts coursing through a coke can ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-172", "text": "G: dangelo conner from new york was fooling around with his gun R: dangelo conner, from new york ,was fooling around with stun gun." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-173", "text": "O: jamie peacock broke his try drought with a double for leeds in their win over salford on sunday." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-174", "text": "G: jamie adam scored to win over salford for leeds R: jamie peacock scored two tries for leeds in their win over salford." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-175", "text": "O: britain's lewis hamilton made the perfect start to his world title defense by winning the opening race of the f1 season in australia sunday to lead a mercedes one-two in melbourne ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-176", "text": "G: lewis hamilton wins first race of season in australia R: lewis hamilton wins opening race of 2015 f1 season in australia ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-177", "text": "We highlight examples of some of the generated paraphrases in Table 7 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-178", "text": "Table 7 show that our paraphrases are well formed, abstractive (e.g powerful weapon -gun, messing around -fooling around), capable of performing syntactic manipulations (e.g for leeds in their win over sadford -win over salford for leeds) and compression as seen in all the examples." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-179", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-180", "text": "**RELATED WORK**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-181", "text": "Summarization has remained an interesting and important NLP task for years due to its diverse applications -news headline generation, weather forecasting, emails filtering, medical cases, recommendation systems, machine reading compre-hension MRC and so forth (Khargharia et al., 2018) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-182", "text": "Early summarization models were mostly extractive and manual-feature engineered (Knight and Marcu, 2000; Jing and McKeown, 2000; Dorr et al., 2003; Berg-Kirkpatrick et al., 2011) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-183", "text": "With the introduction of neural networks (Sutskever et al., 2014) and availability of large training data, deep learning became a viable approach (Rush et al., 2015; Chopra et al., 2016) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-184", "text": "Extraction has been handled on different levels of granularity -word (Cheng and Lapata, 2016) , phrases (Bui et al., 2016; Gehrmann et al., 2018) , sentence (Cheng and Lapata, 2016; Nallapati et al., 2016 Nallapati et al., , 2017 each with its challenges." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-185", "text": "Word and phrase level extraction although more concise usually suffers from grammatical incorrectness, while sentence-level extraction are too lengthy and sometimes contain redundant information." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-186", "text": "Hence Berg-Kirkpatrick et al. (2011); Filippova et al. (2015) ; Durrett et al. (2016) learn to extract and compress at sentence-level." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-187", "text": "Identifying the likely most salient part of the text as summary-worthy is very crucial." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-188", "text": "Some authors have employed integer linear programming (Martins and Smith, 2009; Gillick and Favre, 2009; Boudin et al., 2015) , graph concepts (Erkan and Radev, 2004; , ranking with reinforcement learning (Narayan et al., 2018) and mostly related to our work -binary classification (Shen et al., 2007; Nallapati et al., 2017; Chen and Bansal, 2018) Our binary classification architecture differs significantly from existing models because it uses a transformer as the building block instead of a bidirectional GRU-RNN (Nallapati et al., 2017) , or bidirectional LSTM-RNN (Chen and Bansal, 2018) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-189", "text": "To the best of our knowledge, our utilization of the transformer encoder model as a building block for binary classification is novel, although the transformer has been successfully used for language understanding (Devlin et al., 2018) , machine translation (MT) (Vaswani et al., 2017) and paraphrase generation ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-190", "text": "For generation of abstractive summaries, before the ubiquitous use of neural nets, manually crafted rules and graph techniques were utilized with considerable success." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-191", "text": "Barzilay and McKeown (2005) ; Cheung and Penn (2014) fused two sentences into one using their dependency parsed trees." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-192", "text": "Re-cently, sequence-to-sequence models (Sutskever et al., 2014) with attention Chopra et al., 2016) , copy mechanism Gu et al., 2016) , pointer-generator (See et al., 2017) , graph-based attention (Tan et al., 2017) have been explored." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-193", "text": "Since the system generated summaries are usually evaluated on ROUGE, its been beneficial to directly optimize this metric during training via a suitable policy using reinforcement learning (Paulus et al., 2017; Celikyilmaz et al., 2018) ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-194", "text": "Similar to Rush et al. (2015) ; Chen and Bansal (2018) we abstract by simplifying our extracted sentences." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-195", "text": "We jointly learn to paraphrase and compress, but different from existing models purely based on RNN, we implement a blend of two proven efficient models -transformer encoder and GRU-RNN." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-196", "text": "paraphrased with a transformer-decoder, we find that using the GRU-RNN decoder but with a two-level stack of hybrid encoders (transformer and GRU-RNN) gives better performance." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-197", "text": "To the best of our knowledge, this architectural blend is novel." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-198", "text": "----------------------------------" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-199", "text": "**CONCLUSION**" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-200", "text": "We proposed two frameworks for extractive and abstractive summarization and demonstrated that they each improve results over existing state-ofthe art." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-201", "text": "Our models are simple to train, and the intuition/hypothesis behind the formulation are straightforward and logical." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-202", "text": "The scientific correctness is provable, as parts of our model architecture have been used in other NLG-related tasks such as MT with state-of-the art results." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-75", "text": "See et al. (2017) and Paulus et al. (2017) truncate source documents to 400 tokens and target 1 Filtering is used only for the training set, to ensure that evaluation comparisons on the test set with existing models are fair summaries to 100 tokens." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-76", "text": "We totally exclude documents with more than 30 sentences and truncate or pad as necessary to 20 sentences per document." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-77", "text": "From the over 280,000 and 1.3M training pairs in the CNN/DM and Newsroom training dataset respectively, our filtering yields approximately 150,000 and 250,000 abstractive summarization sub-dataset." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-78", "text": "We report evaluation scores using the training sets as-is versus our filtered training sets, to show that filtering the training samples does improve results." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-79", "text": "Document: world-renowned chef, author and emmy winning television personality anthony bourdain visits quebec in the next episode of \" anthony bourdain : parts unknown, \" airing sunday, may 5, at 9 p.m. et. follow the show on twitter and facebook." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-80", "text": "Summary: 11 things to know about quebec." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-81", "text": "o canada! our home and delicious land.' Tuning We use a very simple approach to create extractive labels for our neural extractor." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-82", "text": "We hypothesize that each reference summary sentence originates from at least one document sentence." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-83", "text": "The goal is to identify the most-likely document sentence." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-84", "text": "Different from Nallapati et al. (2017) 's approach to greedily add sentences to the summary that maximizes the ROUGE score, our approach is more similar to Chen and Bansal (2018)'s model that calculates the individual reference sentence-level score as per its similarity with each sentence in the corresponding document." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-85", "text": "However, our sentence-level similarity score is based on its bigram overlap:" }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-86", "text": "for each t th sentence in the reference summary, R j , per i th sentence in document D j , in contrast to Chen and Bansal (2018) 's that uses ROUGE-L recall score." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-87", "text": "Additionally, for every time both words in the set of bigrams-overlap are stopwords, we decrement the similarity score by 1, for example, (on, the) is an invalid bigram-overlap while (the, President) is valid." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-88", "text": "We do this, to capture more important similarities instead of trivial ones." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-89", "text": "For statistical purposes, we evaluate our extractive trainer for tuning the document's sentences to 0's and 1's against (Nallapati et al., 2017) Table 2 : ROUGE-F1 (%) scores of manually crafted extractive trainers for producing sentence-level extractive labels for CNN/DM." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-90", "text": "We apply our tuned dataset to the neural extractive summarizer explained in Sections 3.1.1 and 3.1.2 and report results in Tables 3 and 4 ." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-91", "text": "Imbalanced Extractive Labels Because a summary is a snippet of the document, the majority of the labels are rightly 0 (excluded from the summary)." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-92", "text": "Hence a high classification accuracy does not necessarily translate to a highly salient summary." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-93", "text": "Therefore, we consider the F1 score, which is a weighted average of the precision and recall, and apply an early stopping criteria when minimizing the loss, if the F1 score does not increase after a set number of training epochs." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-94", "text": "Additionally during training, we synthetically balance the labels, by forcing some random sentences to be labelled as 1 and subsequently masking their weights." }, { "sent_id": "fa3d20d5975ec59454abfca68f8935-C001-95", "text": "Number of sentences to extract The number of extracted sentences is not trivial, as this significantly affects the summary length and hence evaluation scores." } ], "y": { "@MOT@": { "gold_contexts": [ [ "fa3d20d5975ec59454abfca68f8935-C001-13" ] ], "cite_sentences": [ "fa3d20d5975ec59454abfca68f8935-C001-13" ] }, "@BACK@": { "gold_contexts": [ [ "fa3d20d5975ec59454abfca68f8935-C001-28" ], [ "fa3d20d5975ec59454abfca68f8935-C001-96" ] ], "cite_sentences": [ "fa3d20d5975ec59454abfca68f8935-C001-28", "fa3d20d5975ec59454abfca68f8935-C001-96" ] }, "@USE@": { "gold_contexts": [ [ "fa3d20d5975ec59454abfca68f8935-C001-29" ], [ "fa3d20d5975ec59454abfca68f8935-C001-70" ], [ "fa3d20d5975ec59454abfca68f8935-C001-143", "fa3d20d5975ec59454abfca68f8935-C001-144", "fa3d20d5975ec59454abfca68f8935-C001-148" ], [ "fa3d20d5975ec59454abfca68f8935-C001-152" ] ], "cite_sentences": [ "fa3d20d5975ec59454abfca68f8935-C001-29", "fa3d20d5975ec59454abfca68f8935-C001-70", "fa3d20d5975ec59454abfca68f8935-C001-148", "fa3d20d5975ec59454abfca68f8935-C001-152" ] }, "@SIM@": { "gold_contexts": [ [ "fa3d20d5975ec59454abfca68f8935-C001-84" ], [ "fa3d20d5975ec59454abfca68f8935-C001-194" ] ], "cite_sentences": [ "fa3d20d5975ec59454abfca68f8935-C001-84", "fa3d20d5975ec59454abfca68f8935-C001-194" ] }, "@DIF@": { "gold_contexts": [ [ "fa3d20d5975ec59454abfca68f8935-C001-86" ], [ "fa3d20d5975ec59454abfca68f8935-C001-188" ] ], "cite_sentences": [ "fa3d20d5975ec59454abfca68f8935-C001-86", "fa3d20d5975ec59454abfca68f8935-C001-188" ] }, "@UNSURE@": { "gold_contexts": [ [ "fa3d20d5975ec59454abfca68f8935-C001-148" ] ], "cite_sentences": [ "fa3d20d5975ec59454abfca68f8935-C001-148" ] } } }, "ABC_81bdddc7d6b04c88407537f57c0580_11": { "x": [ { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-2", "text": "Given a question and a set of answer candidates, answer triggering determines whether the candidate set contains any correct answers." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-3", "text": "If yes, it then outputs a correct one." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-4", "text": "In contrast to existing pipeline methods which first consider individual candidate answers separately and then make a prediction based on a threshold, we propose an end-to-end deep neural network framework, which is trained by a novel group-level objective function that directly optimizes the answer triggering performance." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-5", "text": "Our objective function penalizes three potential types of error and allows training the framework in an end-to-end manner." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-6", "text": "Experimental results on the WIKIQA benchmark show that our framework outperforms the state of the arts by a 6.6% absolute gain under F 1 measure 1 ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-7", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-9", "text": "Question Answering (QA) aims at automatically responding to natural language questions with direct answers (Heilman and Smith, 2010; Severyn and Moschitti, 2013; Yao et al., 2013; Berant and Liang, 2014; Sun et al., 2015; Miller et al., 2016; Sun et al., 2016) ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-10", "text": "Most existing QA systems always output an answer for any question, no matter whether their answer candidate set contains correct answers or not (Feng et al., 2015; Severyn and Moschitti, 2015; Yang et al., 2016; Rao et al., 2016) ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-11", "text": "In practice, however, this can greatly hurt user experience, especially when it is hard for users to judge answer correctness." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-12", "text": "In this paper, we study the critical yet under-addressed Answer Triggering (Yang et al., 2015) problem: Given a question and a set of answer candidates, determine whether the candidate set contains any correct answer, and if so, select a correct answer as system output." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-13", "text": "The answer triggering problem can be logically divided into two sub-problems: P 1 : Build an individual-level model to rank answer candidates so that a correct one (if it exists) gets the highest score." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-14", "text": "P 2 : Make a group-level binary prediction on the existence of correct answers within the candidate set." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-15", "text": "Previous work (Yang et al., 2015; Jurczyk et al., 2016) attack the problem via a pipeline approach: First solve P 1 as a ranking task and then solve P 2 by choosing an optimal threshold upon the previous step's highest ranking score." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-16", "text": "However, the yielded answer triggering performance is far from satisfactory, with F 1 between 32% and 36%." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-17", "text": "An alternative pipeline approach is to first solve P 2 and then P 1 , i.e., first determine whether there's a correct answer in the candidate set and then rank all candidates to find a correct one." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-18", "text": "However, as we will show using state-of-theart Multiple Instance Learning (MIL) algorithms in Section 4, P 2 by itself is currently a very challenging task, partly because of the difficulty of extracting features from a set of candidate answers that are effective for answer triggering." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-19", "text": "Because both P 1 and P 2 performances are far from perfect, the above pipeline approaches also suffer from error propagation (Finkel et al., 2006; Zeng et al., 2015) ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-20", "text": "We propose Group-level Answer Triggering (GAT), an end-to-end framework for jointly optimizing P 1 and P 2 ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-21", "text": "Our key contribution in GAT is a novel group-level objective function, which aggregates individual-level information and penalizes three potential error types in answer triggering as a group-level task." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-22", "text": "By optimizing this objective function, we can directly back-propagate the final answer triggering errors to the entire framework and learn all the parameters simultaneously." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-23", "text": "We conduct evaluation using the same dataset and measure as in previous work (Yang et al., 2015; Jurczyk et al., 2016) , and our framework improves the F 1 score by 6.6% (from 36.65% to 43.27%), compared with the state of the art." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-24", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-25", "text": "**FRAMEWORK**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-26", "text": "Notations. Let i and j respectively be the index of question and answer candidate, l i,j be the binary label of the j-th answer candidate for question q i , and l i be the group label of the answer candidate set of q i (1 if it contains any correct answer; 0 otherwise)." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-27", "text": "m i,j denotes an individual-level matching score, measuring how likely question q i can be correctly addressed by its j-th answer candidate." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-28", "text": "The GAT framework is illustrated in Figure 1 , which consists of three components: (1) Encoder." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-29", "text": "Two separate encoders process questions and answer candidates respectively, mapping them from token sequences into two different vector spaces." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-30", "text": "(2) QA Matching." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-31", "text": "For each question and answer candidate pair, we concatenate their encoded vectors, and pass it through a feed forward neural network with a binary softmax output layer." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-32", "text": "The output is an individual-level matching score, i.e., m i,j . (3) Signed Max Pooling." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-33", "text": "Max pooling is applied on all the matching scores in a candidate set." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-34", "text": "During training when each candidate is positively/negatively labeled on whether they can answer the question or not, we use the labels to divide the scores into two disjoint subsets and perform max pooling separately:" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-35", "text": "where m + i is the maximum score among correct answers (if there's any) and m \u2212 i is that among wrong ones." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-36", "text": "At testing time when labels are unavailable, it reduces to normal max pooling and pools a single score m i = max j m i,j ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-37", "text": "The answer triggering prediction is then made by comparing m i with a predefined threshold (0.5) to decide whether to return the top-scored answer candidate to the user." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-38", "text": "The GAT framework design is generic in that the Encoder component can be instantiated with different network architectures." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-39", "text": "In this paper, we implement it with Bidirectional RNNs (Bi-RNN) (Schuster and Paliwal, 1997) with GRU cells (Cho et al., 2014) , and use the temporal average pooling over the hidden states as the encoding representation." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-40", "text": "We choose Bi-RNN mainly because of its good performance in many QA problems (Wang and Nyberg, 2015; Wang et al., 2016) ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-41", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-42", "text": "**LEARNING**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-43", "text": "The cost function for negative groups (answer candidate sets without correct answers) and positive groups (those with correct answers) are treated differently." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-44", "text": "For each negative group, the highest QA matching score is penalized by a hinge loss:" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-45", "text": "where the maximum matching score m \u2212 i is compared with 0.5, a fixed threshold for our framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-46", "text": "The variable d \u2212 here, as well as d + and d \u00b1 that will appear shortly after, are all margin hyperparameters." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-47", "text": "O 1 is normalized by N neg , which is the number of negative groups (with l i = 0)." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-48", "text": "We use O 1 to reduce false-positive answer existence predictions by penalizing the top matching score that is not safely below the 0.5 threshold." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-49", "text": "For a positive group, it is more complicated because answer triggering prediction can have the following two error types: (1) the top matching score is below the threshold, or (2) the top ranked answer candidate is a wrong answer." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-50", "text": "We design loss terms O 2 and O 3 to penalize these two types of error, respectively." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-51", "text": "O 2 is a hinge loss that penalizes the case where the highest score among the correct answers in a group is not large enough to signify answer existence." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-52", "text": "O 3 is to penalize the case where the highest score is obtained by an incorrect candidate answer." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-53", "text": "Formally:" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-54", "text": "Finally, the overall objective function in Equation 1 is a linear combination of the three loss terms and a standard 2 -regularization." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-55", "text": "\u0398 denotes all the trainable parameters in the framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-56", "text": "\u03b1, \u03b2 and \u03bb are hyper-parameters." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-57", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-58", "text": "**A NAIVE OBJECTIVE BASELINE**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-59", "text": "For comparison, we provide an alternative objective formulation, which equivalently treats positive and negative groups, and does not explicitly penalize cases where an incorrect candidate answer obtains the highest QA matching score in a positive group." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-60", "text": "Here d + is a margin and \u03b1 * , \u03bb * are weights." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-61", "text": "We hypothesize this formulation will work worse than the objective in Equation 1, and will use experiments to verify it." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-62", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-63", "text": "**EXPERIMENTS 3.1 DATASET**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-64", "text": "We use the WIKIQA dataset (Yang et al., 2015) for evaluation." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-65", "text": "It contains 3,047 questions from Bing query logs, each associated with a group of candidate answer sentences from Wikipedia and manually labeled via crowdsourcing." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-66", "text": "Several intuitive features are also included in WIKIQA: two word matching features (IDF-weighted and unweighted word-overlapping counts between questions and candidate answers, denoted as Cnt), the length of a question (QLen), and the length of a candidate answer (SLen)." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-67", "text": "As in previous works, we also test the effect of these features, by combining them with other features as input into the Softmax layer in our framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-68", "text": "We use the standard 70% (train), 10% (dev), and 20% (test) split of WIK-IQA." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-69", "text": "We also use the same data pre-processing steps for fair comparison: Truncate questions and sentences to a maximum of 40-token long and initialize the 300-dimensional word vectors using pretrained word2vec embedding (Mikolov et al., 2013) ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-70", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-71", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-72", "text": "We implement our full framework using TensorFlow (Abadi et al., 2016) and train it using the AdaDelta optimizer (Zeiler, 2012) with learning rate 0.1 and decay factor 0.95." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-73", "text": "Dropout is used during training to prevent overfitting." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-74", "text": "The default threshold in Signed Max Pooling is set at 0.5." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-75", "text": "We select the hyper-parameters using the dev set and set \u03b1=1.2, \u03b2=1." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-76", "text": "." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-77", "text": "The RNN's hidden state size is 200 in both directions." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-78", "text": "The feed-forward network in QA Matching has two layers of 400 hidden units." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-79", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-80", "text": "**EVALUATION METRICS**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-81", "text": "We use precision, recall, and F 1 , defined in the same way as in previous work." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-82", "text": "A question is treated as a positive case only if it contains one or more correct answers in its candidate set." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-83", "text": "For the prediction of a question, only the candidate with the highest matching score is considered." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-84", "text": "A true positive prediction shall meet two criteria: (1) the score is above a threshold (0.5 for our framework; tuned on dev set in other work), and (2) the candidate is labeled as a correct answer to the question." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-85", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-86", "text": "**RESULTS A. COMPARISON WITH BASELINES**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-87", "text": "We evaluate the effectiveness of the proposed GAT framework by comparing with several baseline models." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-88", "text": "To the best of our knowledge, there has only been limited work so far on answer triggering, and they are the first two baselines below." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-89", "text": "The results are summarized in Table 1 ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-90", "text": "We can see that GAT combined with Cnt features improves the F 1 score from Yang et al. (2015) and Jurczyk et al. (2016) by around 11.1% and 6.6% (from 32.17 and 36.65 to 43.27), which shows the effectiveness of our framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-91", "text": "We denote this configuration as our full framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-92", "text": "Through the comparison between Naive and GAT, Model Prec Rec F1 (Yang et al., 2015) 27.96 37.86 32.17 (Jurczyk et al., 2016) we can see that our proposed objective function has a great advantage over the Naive one which does not model the complexity of answer triggering for positive candidate sets." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-93", "text": "Different from Yang et al. (2015)'s results, combining with the QLen feature does not further improve the performance in our case, possibly because we choose Bi-RNN as our encoder, which may capture some question characteristics better than a length feature." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-94", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-95", "text": "**B. FRAMEWORK BREAKDOWN**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-96", "text": "Now we conduct further analysis in order to better understand the contribution of each component in our full framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-97", "text": "Since the code from (Yang et al., 2015) is available, we use it (rather than (Jurczyk et al., 2016) ) to assist our analysis." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-98", "text": "We first test a variant of our full framework by replacing the Encoder and QA Matching component with the CNN based model from (Yang et al., 2015) 2 , denoted as GAT w/ CNN, and train it with our objective." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-99", "text": "From the first two rows in Table 2 , we observe that: (1) Using our current design Bi-RNN and feed-foward NN improves from 35.03% to 43.27%, in comparison with the CNN based model, partly because their CNN only consists of one convolution layer and one average pooling layer." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-100", "text": "However, we leave more advanced encoder and QA matching design for future work, and anticipate that more complex CNN based models can achieve similar or better results than our current design, as in many other QA-related work (Hu et al., 2014; . (2) Compared with the best result from (Yang et al., 2015) in Table 1 , training the CNN based model end-to-end using our objective improves from 32.17% to 35.03%." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-101", "text": "This directly shows an end-to-end learning strategy works better than the pipeline approach in (Yang et al., 2015) ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-102", "text": "Now we detach the Encoder component ENC 2 Where the QA matching score is obtained first through CNN encoding and then a bilinear model." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-103", "text": "Table 2 : GAT framework breakdown." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-104", "text": "All variants are trained with our proposed objective function (Equation 1)." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-105", "text": "from our end-to-end full framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-106", "text": "To obtain semantic vectors of questions and candidate answers as input to the subsequent QA Matching component, we leverage Yang et al.(2015) 's released code to train the Encoder component (with CNN) through their well-tuned individual-level optimization, and use their learnt semantic vectors." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-107", "text": "Then our framework without ENC, i.e., -ENC, is trained and tested as before." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-108", "text": "We further detach the QA matching component QAM in a similar way: We directly use the matching score between a question and a candidate answer obtained by Yang et al. (2015) , and concatenate it with Cnt features as input to the Softmax layer, which is our framework without ENC or QAM, denoted as -ENC -QAM, and trained by our group-level objective." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-109", "text": "By comparing them with our end-to-end frameworks on both dev and test sets, we can see that it is beneficial to jointly train the entire framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-110", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-111", "text": "**ERROR ANALYSIS**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-112", "text": "We now demonstrate some typical mistake types made by our framework to inspire future improvements." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-113", "text": "Q: What city was the convention when Gerald Ford was nominated?" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-114", "text": "A: Held in Kemper arena in Kansas City , Missouri , the convention nominated president Gerald Ford for a full term, but only after narrowly defeating a strong challenge from former California governor Ronald Reagan." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-115", "text": "In this case, A is correct, but our framework made a false negative prediction." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-116", "text": "Although already being the highest ranked in a set of 4 candidate answers, A only got a score of 0.134, possibly due to its complicated semantic structure (attribute clause) and the extra irrelevant information (defeating Reagan)." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-117", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-118", "text": "**Q: WHAT CAN SQL 2005 DO?**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-119", "text": "A1: Microsoft SQL server is a relational database management system developed by Microsoft." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-120", "text": "A2: As a database , it is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a [TRUNCATED END]" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-121", "text": "The incorrect answer A1 is ranked higher than the correct answer A2, both with scores above 0.5." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-122", "text": "This is a false positive case, with incorrect ranking as well." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-123", "text": "Possible reasons are that the detailed functionality of SQL explained in A2 is hard to be captured and related to the question, and A2 gets truncated to 40 tokens long in our experiments." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-124", "text": "On the other hand, the \"database management system\" phrase in A1 sounds close to an explanation of functionality, if not carefully distinguished." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-125", "text": "Both cases above show that the semantic relation between a question and its answer is hard to capture." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-126", "text": "For future research, more advanced models can be incorporated in the Encoder and QA Matching components of our framework." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-127", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-128", "text": "**RELATED WORK**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-129", "text": "Answer Selection." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-130", "text": "Answer selection (a.k.a., answer sentence selection) is the task of assigning answer candidates with individual-level ranking scores given a question, which is similar to P1 defined in Section 1." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-131", "text": "Existing QA systems based on answer selection just select the top-scored candidate as answer, without considering the possibility that the true answer doesn't even exist." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-132", "text": "However, many neural network models recently explored in the answer existence literature (Hu et al., 2014; Wang and Nyberg, 2015; Feng et al., 2015) could be utilized for answer selection as well in the future." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-133", "text": "For example, explore the respective advantages of different network architectures such as Long Short-Term Memory Networks (LSTMs) and CNNs." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-134", "text": "They also develop hybrid models for answer selection." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-135", "text": "Various attention mechanisms have been proposed such as (Wang et al., 2016) for RNNs and (Yin et al., 2015; for CNNs." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-136", "text": "Answer selection is also formulated as a sentence similarity measurement problem or a pairwise ranking problem as in (Severyn and Moschitti, 2015; Yang et al., 2016; Rao et al., 2016) ." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-137", "text": "Multiple Instance Learning We have briefly mentioned MIL (Babenko et al., 2011; Amores, 2013; Cheplygina et al., 2015) in Section 1." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-138", "text": "Many MIL algorithms can not be directly applied for answer triggering, because individual-level annotations and predictions are often assumed unavailable and unnecessary in MIL (Maron and LozanoP\u00e9rez, 1998; Babenko et al., 2011; Amores, 2013; Cheplygina et al., 2015) , but not in the answer triggering setting, where the correctness of each answer candidate is annotated during training and needs to be predicted during testing." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-139", "text": "We experimented with two popular MIL algorithms that explicitly discriminate individual-level labels: MI-SVM (Andrews et al., 2003) and Sb-MIL (Bunescu and Mooney, 2007) implemented in one of the state-of-the-art MIL toolkits (Doran and Ray, 2014), where we represented each question/answer with encoder vectors as in Section 3.4." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-140", "text": "Unfortunately, both algorithms predict no correct answer exists for any question, possibly because the training data are biased towards negative groups and the input features are not effective enough." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-141", "text": "This indicates that using MIL for answer triggering is challenging and still open for future research." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-142", "text": "----------------------------------" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-143", "text": "**CONCLUSION**" }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-144", "text": "In conclusion, we address the critical answer triggering challenge with an effective framework based on deep neural networks." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-145", "text": "We propose a novel objective function to optimize the entire framework end-to-end, where we focus more on the group-level prediction and take into account multiple important factors." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-146", "text": "In particular, the objective function explicitly penalizes three potential errors in answer triggering: (1) false-positive and (2) false-negative predictions of the existence of a correct answer, as well as (3) ranking incorrect answers higher than correct ones." }, { "sent_id": "81bdddc7d6b04c88407537f57c0580-C001-147", "text": "We experimented with different objective function settings and show that our GAT framework outperforms the previous state of the arts by a remarkable margin." } ], "y": { "@USE@": { "gold_contexts": [ [ "81bdddc7d6b04c88407537f57c0580-C001-12" ], [ "81bdddc7d6b04c88407537f57c0580-C001-23" ], [ "81bdddc7d6b04c88407537f57c0580-C001-64" ], [ "81bdddc7d6b04c88407537f57c0580-C001-90" ], [ "81bdddc7d6b04c88407537f57c0580-C001-97" ], [ "81bdddc7d6b04c88407537f57c0580-C001-98" ], [ "81bdddc7d6b04c88407537f57c0580-C001-106" ], [ "81bdddc7d6b04c88407537f57c0580-C001-108" ] ], "cite_sentences": [ "81bdddc7d6b04c88407537f57c0580-C001-12", "81bdddc7d6b04c88407537f57c0580-C001-23", "81bdddc7d6b04c88407537f57c0580-C001-64", "81bdddc7d6b04c88407537f57c0580-C001-90", "81bdddc7d6b04c88407537f57c0580-C001-97", "81bdddc7d6b04c88407537f57c0580-C001-98", "81bdddc7d6b04c88407537f57c0580-C001-106", "81bdddc7d6b04c88407537f57c0580-C001-108" ] }, "@BACK@": { "gold_contexts": [ [ "81bdddc7d6b04c88407537f57c0580-C001-15" ] ], "cite_sentences": [ "81bdddc7d6b04c88407537f57c0580-C001-15" ] }, "@EXT@": { "gold_contexts": [ [ "81bdddc7d6b04c88407537f57c0580-C001-15", "81bdddc7d6b04c88407537f57c0580-C001-16", "81bdddc7d6b04c88407537f57c0580-C001-17", "81bdddc7d6b04c88407537f57c0580-C001-18", "81bdddc7d6b04c88407537f57c0580-C001-19", "81bdddc7d6b04c88407537f57c0580-C001-20" ] ], "cite_sentences": [ "81bdddc7d6b04c88407537f57c0580-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "81bdddc7d6b04c88407537f57c0580-C001-92" ], [ "81bdddc7d6b04c88407537f57c0580-C001-93" ], [ "81bdddc7d6b04c88407537f57c0580-C001-100" ], [ "81bdddc7d6b04c88407537f57c0580-C001-101" ] ], "cite_sentences": [ "81bdddc7d6b04c88407537f57c0580-C001-92", "81bdddc7d6b04c88407537f57c0580-C001-93", "81bdddc7d6b04c88407537f57c0580-C001-100", "81bdddc7d6b04c88407537f57c0580-C001-101" ] } } }, "ABC_69857bcd5ba67cb7ca0b4344a3a85f_11": { "x": [ { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-54", "text": "In this framework, we have a set of g feature functions" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-78", "text": "We refer the reader to (Soricut, 2006) for additional details regarding the optimality and the theoretical run-time behavior of these algorithms." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-79", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-80", "text": "**UTILITY-BASED TRAINING**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-2", "text": "We describe a generic framework for integrating various stochastic models of discourse coherence in a manner that takes advantage of their individual strengths." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-3", "text": "An integral part of this framework are algorithms for searching and training these stochastic coherence models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-4", "text": "We evaluate the performance of our models and algorithms and show empirically that utilitytrained log-linear coherence models outperform each of the individual coherence models considered." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-5", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-28", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-29", "text": "**LOCAL MODELS OF DISCOURSE COHERENCE**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-152", "text": "These figures account for combined model and search performance." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-53", "text": "We can model the probability $ a & of a text using a log-linear model that combines the discourse coherence models presented above." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-7", "text": "Various theories of discourse coherence (Mann and Thompson, 1988; Grosz et al., 1995) have been applied successfully in discourse analysis (Marcu, 2000; Forbes et al., 2001 ) and discourse generation (Scott and de Souza, 1990; Kibble and Power, 2004) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-8", "text": "Most of these efforts, however, have limited applicability." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-9", "text": "Those that use manually written rules model only the most visible discourse constraints (e.g., the discourse connective \"although\" marks a CONCESSION relation), while being oblivious to fine-grained lexical indicators." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-10", "text": "And the methods that utilize manually annotated corpora (Carlson et al., 2003; Karamanis et al., 2004) and supervised learning algorithms have high costs associated with the annotation procedure, and cannot be easily adapted to different domains and genres." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-11", "text": "In contrast, more recent research has focused on stochastic approaches that model discourse coherence at the local lexical (Lapata, 2003) and global levels (Barzilay and Lee, 2004) , while preserving regularities recognized by classic discourse theories (Barzilay and Lapata, 2005) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-12", "text": "These stochastic coherence models use simple, non-hierarchical representations of discourse, and can be trained with minimal human intervention, using large collections of existing human-authored documents." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-13", "text": "These models are attractive due to their increased scalability and portability." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-14", "text": "As each of these stochastic models captures different aspects of coherence, an important question is whether we can combine them in a model capable of exploiting all coherence indicators." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-15", "text": "A frequently used testbed for coherence models is the discourse ordering problem, which occurs often in text generation, complex question answering, and multi-document summarization: given discourse units, what is the most coherent ordering of them (Marcu, 1996; Lapata, 2003; Barzilay and Lee, 2004; Barzilay and Lapata, 2005) ? Because the problem is NP-complete (Althaus et al., 2005) , it is critical how coherence model evaluation is intertwined with search: if the search for the best ordering is greedy and has many errors, one is not able to properly evaluate whether a model is better than another." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-16", "text": "If the search is exhaustive, the ordering procedure may take too long to be useful." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-17", "text": "In this paper, we propose an A\u00a1 search algorithm for the discourse ordering problem that comes with strong theoretical guarantees." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-18", "text": "For a wide range of practical problems (discourse ordering of up to 15 units), the algorithm finds an optimal solution in reasonable time (on the order of seconds)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-19", "text": "A beam search version of the algorithm enables one to find good, approximate solutions for very large reordering tasks." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-20", "text": "These algorithms enable us not only to compare head-to-head, for the first time, a set of coherence models, but also to combine these models so as to benefit from their complementary strengths." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-21", "text": "The model com-bination is accomplished using statistically wellfounded utility training procedures which automatically optimize the contributions of the individual models on a development corpus." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-22", "text": "We empirically show that utility-based models of discourse coherence outperform each of the individual coherence models considered." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-23", "text": "In the following section, we describe previously-proposed and new coherence models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-24", "text": "Then, we present our search algorithms and the input representation they use." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-25", "text": "Finally, we show evaluation results and discuss their implications." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-26", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-27", "text": "**STOCHASTIC MODELS OF DISCOURSE COHERENCE**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-30", "text": "Stochastic local models of coherence work under the assumption that well-formed discourse can be characterized in terms of specific distributions of local recurring patterns." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-31", "text": "These distributions can be defined at the lexical level or entity-based levels." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-32", "text": "Word-Coocurrence Coherence Models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-33", "text": "We propose a new coherence model, inspired by (Knight, 2003) , that models the intuition that the usage of certain words in a discourse unit (sentence) tends to trigger the usage of other words in subsequent discourse units." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-34", "text": "(A similar intuition holds for the Machine Translation models generically known as the IBM models (Brown et al., 1993) , which assume that certain words in a source language sentence tend to trigger the usage of certain words in a target language translation of that sentence.) We train models able to recognize local recurring patterns of word usage across sentences in an unsupervised manner, by running an ExpectationMaximization (EM) procedure over pairs of consecutive sentences extracted from a large collection of training documents 1 ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-35", "text": "We expect EM to detect and assign higher probabilities to recurring word patterns compared to casually occurring word patterns." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-36", "text": "A local coherence model based on IBM Model 1 assigns the following probability to a text consisting of" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-37", "text": "Entity-based Coherence Models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-38", "text": "Barzilay and Lapata (2005) recently proposed an entity-based coherence model that aims to learn abstract coherence properties, similar to those stipulated by Centering Theory (Grosz et al., 1995) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-39", "text": "Their model learns distribution patterns for transitions between discourse entities that are abstracted into their syntactic roles -subject (S), object (O), other (X), missing (-)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-40", "text": "The feature values are computed using an entity-grid representation for the discourse that records the syntactic role of each entity as it appears in each sentence." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-41", "text": "Also, salient entities are differentiated from casually occurring entities, based on the widely used assumption that occurrence frequency correlates with discourse prominence (Morris and Hirst, 1991; Grosz et al., 1995) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-42", "text": "We exclude the coreference information from this model, as the discourse ordering problem cannot accommodate current coreference solutions, which assume a pre-specified order (Ng, 2005) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-43", "text": "In the jargon of (Barzilay and Lapata, 2005) , the model we implemented is called Syntax+Salience." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-44", "text": "The probability assigned to a text e ( e \u00a2 f \u00a3 4 \u00a6 \u00a6 F \u00a2 \u00a9 by this Entity-Based model (henceforth called EB) can be locally computed (i.e., at sentence transition level) using g feature functions, as follows:" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-45", "text": "are feature values, and q U are weights trained to discriminate between coherent, human-authored documents and examples assumed to have lost some degree of coherence (scrambled versions of the original documents)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-46", "text": "Barzilay and Lee (2004)" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-47", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-48", "text": "**GLOBAL MODELS OF DISCOURSE COHERENCE**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-49", "text": ", models the probability of changing from topic 5 2 4 \u00a3 to topic 5 ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-50", "text": "The second term, , models the probability of generating sentences from topic 5 ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-51", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-52", "text": "**COMBINING LOCAL AND GLOBAL MODELS OF DISCOURSE COHERENCE**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-55", "text": "Under this model, finding the most probable text is equivalent with solving Equation 1, and therefore we do not need to be concerned about computing expensive normalization factors." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-56", "text": "In this framework, we distinguish between the modeling problem, which amounts to finding appropriate feature functions for the discourse coherence task, and the training problem, which amounts to finding appropriate values for 0 e # , P f % ' % g ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-57", "text": "We address the modeling problem by using as feature functions the discourse coherence models presented in the previous sections." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-58", "text": "In Section 3, we address the training problem by performing a discriminative training procedure of the 0 # parameters, using as utility functions a metric that measures how different a training instance is from a given reference." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-59", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-60", "text": "**SEARCH ALGORITHMS FOR COHERENT DISCOURSES AND UTILITY-BASED TRAINING**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-61", "text": "The algorithms we propose use as input representation the IDL-expressions formalism (Nederhof and Satta, 2004; Soricut and Marcu, 2005) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-62", "text": "We use here the IDL formalism (which stands for Interleave, Disjunction, Lock, after the names of its operators) to define finite sets of possible discourses over given discourse units." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-63", "text": "Without losing generality, we will consider sentences as discourse units in our examples and experiments." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-64", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-65", "text": "**INPUT REPRESENTATION**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-66", "text": "Consider the discourse units A-C presented in Figure 1(a) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-67", "text": "Each of these units undergoes various processing stages in order to provide the information needed by our coherence models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-68", "text": "The entity-based model (EB) (Section 2), for instance, makes use of a syntactic parser to determine the syntactic role played by each detected entity (Figure 1(b) )." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-69", "text": "For example, the string SSXXXXOX------------(first row of the grid in Figure 1(" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-70", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-71", "text": "**SEARCH ALGORITHMS**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-72", "text": "Algorithms that operate on IDL-graphs have been recently proposed by Soricut and Marcu (2005) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-73", "text": "We extend these algorithms to take as input IDLgraphs over non-atomic symbols (such that the coherence models can operate inside terms like We also define an admissible heuristic function (Russell and Norvig, 1995) , which is used to compute an admissible future cost" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-74", "text": "is the set of future (visible) events for state 4 , which can be computed directly from an input IDL-graph, as the set of all Q -edge-labels between the vertices of state (sorted according to total cost, computed as current I admissible) to control the unfolding of an input IDL-graph, by processing, at each unfolding step, the most inexpensive state (extracted from the top of q )." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-75", "text": "The admissibility of the future costs and the monotonicity property enforced by the priority queue guarantees that IDL-CH-A\u00a1 finds an optimal solution to Equation 1 (Russell and Norvig, 1995) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-76", "text": "The IDL-CH-HB# algorithm uses a histogram beam $ to control the unfolding of an input IDLgraph, by processing, at each unfolding step, the top $ most inexpensive states (according to total cost)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-77", "text": "This algorithm can be tuned (via $ ) to achieve good trade-off between speed and accuracy." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-81", "text": "In addition to the modeling problem, we must also address the training problem, which amounts to finding appropriate values for the 0 # parameters from Equation 1." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-82", "text": "The solution we employ here is the discriminative training procedure of Och (2003) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-83", "text": "This procedure learns an optimal setting of the 0 e # parameters using as optimality criterion the utility of the proposed solution." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-84", "text": "There are two necessary ingredients to implement Och's (2003) training procedure." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-85", "text": "First, it needs a search algorithm that is able to produce ranked \" -best lists of the most promising candidates in a reasonably fast manner (Huang and Chiang, 2005) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-86", "text": "We accommodate \" -best computation within the IDL-CH-HB \u00a3 V r V algorithm, which decodes bag-of-units IDL-expressions at an average speed of 75.4 sec./exp." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-87", "text": "on a 3.0 GHz CPU Linux machine, for an average input of 11.5 units per expression." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-88", "text": "Second, it needs a criterion which can automatically assess the quality of the proposed candidates." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-89", "text": "To this end, we employ two different metrics, such that we can measure the impact of using different utility functions on performance." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-90", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-91", "text": "**TAU (KENDALL'S ).**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-92", "text": "One of the most frequently used metrics for the automatic evaluation of document coherence is Kendall's (Lapata, 2003; Barzilay and Lee, 2004) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-93", "text": "TAU measures the minimum number of adjacent transpositions needed to transform a proposed order into a reference order." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-94", "text": "The range of the TAU metric is between -1 (the worst) to 1 (the best)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-95", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-96", "text": "**BLEU.**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-97", "text": "One of the most successful metrics for judging machine-generated text is BLEU (Papineni et al., 2002) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-98", "text": "It counts the number of unigram, bigram, trigram, and four-gram matches between hypothesis and reference, and combines them using geometric mean." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-99", "text": "For the discourse ordering problem, we represent hypotheses and references by index sequences (e.g., \"4 2 3 1\" is a hypothesis order over four discourse units, in which the first and last units have been swapped with respect to the reference order)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-100", "text": "The range of BLEU scores is between 0 (the worst) and 1 (the best)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-101", "text": "We run different discriminative training sessions using TAU and BLEU, and train two different sets of the" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-102", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-103", "text": "**EXPERIMENTS**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-104", "text": "We evaluate empirically two different aspects of our work." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-105", "text": "First, we measure the performance of our search algorithms across different models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-106", "text": "Second, we compare the performance of each individual coherence model, and also the performance of the discriminatively trained log-linear models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-107", "text": "We also compare the overall performance (model & decoding strategy) obtained in our framework with previously reported results." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-108", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-109", "text": "**EVALUATION SETTING**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-110", "text": "The task on which we conduct our evaluation is information ordering (Lapata, 2003; Barzilay and Lee, 2004; Barzilay and Lapata, 2005) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-111", "text": "In this task, a pre-selected set of information-bearing document units (in our case, sentences) needs to be arranged in a sequence which maximizes some specific information quality (in our case, document coherence)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-112", "text": "We use the information-ordering task as a means to measure the performance of our algorithms and models in a well-controlled setting." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-113", "text": "As described in Section 3, our framework can be used in applications such as multi-document summarization." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-114", "text": "In fact, Barzilay et al. (2002) formulate the multi-document summarization problem as an information ordering problem, and show that naive ordering algorithms such as majority ordering (select most frequent orders across input documents) and chronological ordering (order facts according to publication date) do not always yield coherent summaries." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-115", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-116", "text": "**DATA.**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-117", "text": "For training and testing, we use documents from two different genres: newspaper articles and accident reports written by government officials (Barzilay and Lapata, 2005 Table 2 : Evaluation of stochastic models for document coherence, for both EARTHQUAKES and ACCIDENTS genre, using IDL-CH-HB \u00a3 V r V ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-118", "text": "Board's database." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-119", "text": "For both collections, we used 100 documents for training and 100 documents for testing." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-120", "text": "A fraction of 40% of the training documents was temporarily removed and used as a development set, on which we performed the discriminative training procedure." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-121", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-122", "text": "**EVALUATION OF SEARCH ALGORITHMS**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-123", "text": "We evaluated the performance of several search algorithms across four stochastic models of document coherence: the IBM \u00a3 and IBM \u00a3 coherence models, the content model of Barzilay and Lee (2004) (CM) , and the entity-based model of Barzilay and Lapata (2005) (EB) (Section 2)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-124", "text": "We measure search performance using an Estimated Search Error (ESE) figure, which reports the percentage of times when the search algorithm proposes a sentence order which scores lower than Overall performance TAU QUAKES ACCID." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-125", "text": "Lapata (2003) 0.48 0.07 Barzilay & Lee (2004) 0.81 0.44 Barzilay & Lee (reproduced) 0.39 0.36 Barzilay & Lapata (2005) 0 the original sentence order (OSO)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-126", "text": "We also measure the quality of the proposed documents using TAU and BLEU, using as reference the OSO." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-127", "text": "In Table 1 , we report the performance of four search algorithms." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-153", "text": "We first note that, unfortunately, we failed to accurately reproduce the model of Barzilay and Lee (2004) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-128", "text": "The first three, IDL-CH-A\u00a1 , IDL-CH-HB \u00a3 V r V , and IDL-CH-HB \u00a3 are the IDLbased search algorithms of Section 3, implementing A\u00a1 search, histogram beam search with a beam of 100, and histogram beam search with a beam of 1, respectively." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-129", "text": "We compare our algorithms against the greedy algorithm used by Lapata (2003) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-130", "text": "We note here that the comparison is rendered meaningful by the observation that this algorithm performs search identically with algorithm IDL-CH-HB \u00a3 (histogram beam 1), when setting the heuristic function for future costs 3 to constant 0." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-131", "text": "The results in Table 1 clearly show the superiority of the IDL-CH-A\u00a1 and IDL-CH-HB \u00a3 V r V algo-rithms." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-132", "text": "Across all models considered, they consistently propose documents with scores at least as good as OSO (0% Estimated Search Error)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-133", "text": "As the original documents were coherent, it follows that the proposed document realizations also exhibit coherence." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-134", "text": "In contrast, the greedy algorithm of Lapata (2003) makes grave search errors." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-135", "text": "As the comparison between IDL-CH-HB \u00a3 V r V and IDL-CH-HB \u00a3 shows, the superiority of the IDL-CH algorithms depends more on the admissible heuristic function 3 than in the ability to maintain multiple hypotheses while searching." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-136", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-137", "text": "**EVALUATION OF LOG-LINEAR MODELS**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-138", "text": "For this round of experiments, we held constant the search procedure (IDL-CH-HB \u00a3 V r V ), and varied the 0 # parameters of Equation 1." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-139", "text": "The utility-trained log-linear models are compared here against a baseline log-linear model loglinear\u00a1" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-140", "text": ", for which all 0 C # parameters are set to 1, and also against the individual models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-141", "text": "The results are presented in Table 2 ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-142", "text": "If not properly weighted, the log-linear combination may yield poorer results than those of individual models (average TAU of .34 for loglinear\u00a1" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-143", "text": ", versus .38 for IBM \u00a3 and .39 for CM, on the EARTHQUAKES domain)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-144", "text": "The highest TAU accuracy is obtained when using TAU to perform utility-based training of the 0 e # parameters (.47 for EARTHQUAKES, .50 for ACCIDENTS)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-145", "text": "The highest BLEU accuracy is obtained when using BLEU to perform utility-based training of the 0 # parameters (.16 for EARTHQUAKES, .24 for the ACCIDENTS)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-146", "text": "For both genres, the differences between the highest accuracy figures (in bold) and the accuracy of the individual models are statistically significant at 95% confidence (using bootstrap resampling)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-147", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-148", "text": "**OVERALL PERFORMANCE EVALUATION**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-149", "text": "The last comparison we provide is between the performance provided by our framework and previously-reported performance results (Table 3) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-150", "text": "We are able to provide this comparison based on the TAU figures reported in (Barzilay and Lee, 2004) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-151", "text": "The training and test data for both genres is the same, and therefore the figures can be directly compared." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-154", "text": "Our reproduction has an average TAU figure of only .39 versus the original figure of .81 for EARTHQUAKES, and .36 versus .44 for ACCIDENTS." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-155", "text": "On the other hand, we reproduced successfully the model of Barzilay and Lapata (2005) , and the average TAU figure is .19 for EARTHQUAKES, and .12 for ACCIDENTS 3 ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-156", "text": "The large difference on the EARTHQUAKES corpus between the performance of Barzilay and Lee (2004) and our reproduction of their model is responsible for the overall lower performance (0.47) of our log-linear \u00a9 model and IDL-CH-HB \u00a3 V r V search algorithm, which is nevertheless higher than that of its component model CM (0.39)." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-157", "text": "On the other hand, we achieve the highest accuracy figure (0.50) on the ACCIDENTS corpus, outperforming the previous-highest figure (0.44) of Barzilay and Lee (2004) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-158", "text": "These result empirically show that utility-trained log-linear models of discourse coherence outperform each of the individual coherence models considered." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-159", "text": "----------------------------------" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-160", "text": "**DISCUSSION AND CONCLUSIONS**" }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-161", "text": "We presented a generic framework that is capable of integrating various stochastic models of discourse coherence into a more powerful model that combines the strengths of the individual models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-162", "text": "An important ingredient of this framework are the search algorithms based on IDL-expressions, which provide a flexible way of solving discourse generation problems using stochastic models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-163", "text": "Our generation algorithms are fundamentally different from previously-proposed algorithms for discourse generation." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-164", "text": "The genetic algorithms of Mellish et al. (1998) and Karamanis and Manarung (2002) , as well as the greedy algorithm of Lapata (2003) , provide no theoretical guarantees on the optimality of the solutions they propose." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-165", "text": "At the other end of the spectrum, the exhaustive search of Barzilay and Lee (2004) , while ensuring optimal solutions, is prohibitively expensive, and cannot be used to perform utility-based training." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-166", "text": "The linear programming algorithm of Althaus et al. (2005) is the only proposal that achieves both good speed and accuracy." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-167", "text": "Their algorithm, however, cannot handle models with hidden states, cannot compute \" -best lists, and does not have the representation flexibility provided by IDL-expressions, which is crucial for coherence decoding in realistic applications such as multidocument summarization." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-168", "text": "For each of the coherence model combinations that we have utility trained, we obtained improved results on the discourse ordering problem compared to the individual models." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-169", "text": "This is important for two reasons." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-170", "text": "Our improvements can have an immediate impact on multi-document summarization applications (Barzilay et al., 2002) ." }, { "sent_id": "69857bcd5ba67cb7ca0b4344a3a85f-C001-171", "text": "Also, our framework provides a solid foundation for subsequent research on discourse coherence models and related applications." } ], "y": { "@BACK@": { "gold_contexts": [ [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-11" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-92" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-165" ] ], "cite_sentences": [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-11", "69857bcd5ba67cb7ca0b4344a3a85f-C001-92", "69857bcd5ba67cb7ca0b4344a3a85f-C001-165" ] }, "@MOT@": { "gold_contexts": [ [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-15" ] ], "cite_sentences": [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-110" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-123" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-124", "69857bcd5ba67cb7ca0b4344a3a85f-C001-125" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-150" ] ], "cite_sentences": [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-110", "69857bcd5ba67cb7ca0b4344a3a85f-C001-123", "69857bcd5ba67cb7ca0b4344a3a85f-C001-125", "69857bcd5ba67cb7ca0b4344a3a85f-C001-150" ] }, "@DIF@": { "gold_contexts": [ [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-153" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-156" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-157" ], [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-163", "69857bcd5ba67cb7ca0b4344a3a85f-C001-165" ] ], "cite_sentences": [ "69857bcd5ba67cb7ca0b4344a3a85f-C001-153", "69857bcd5ba67cb7ca0b4344a3a85f-C001-156", "69857bcd5ba67cb7ca0b4344a3a85f-C001-157", "69857bcd5ba67cb7ca0b4344a3a85f-C001-165" ] } } }, "ABC_5d3c08596677a1f8ac48fa17766bb4_11": { "x": [ { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-2", "text": "In this paper we focus on practical issues of data representation for dependency parsing." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-3", "text": "We carry out an experimental comparison of (a) three syntactic dependency schemes; (b) three data-driven dependency parsers; and (c) the influence of two different approaches to lexical category disambiguation (aka tagging) prior to parsing." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-4", "text": "Comparing parsing accuracies in various setups, we study the interactions of these three aspects and analyze which configurations are easier to learn for a dependency parser." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-5", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-7", "text": "Dependency parsing is one of the mainstream research areas in natural language processing." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-8", "text": "Dependency representations are useful for a number of NLP applications, for example, machine translation (Ding and Palmer, 2005) , information extraction (Yakushiji et al., 2006) , analysis of typologically diverse languages (Bunt et al., 2010) and parser stacking (\u00d8vrelid et al., 2009 )." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-9", "text": "There were several shared tasks organized on dependency parsing (CoNLL 2006 (CoNLL -2007 and labeled dependencies (CoNLL 2008 (CoNLL -2009 ) and there were a number of attempts to compare various dependencies intrinsically, e.g. (Miyao et al., 2007) , and extrinsically, e.g. (Wu et al., 2012) ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-10", "text": "In this paper we focus on practical issues of data representation for dependency parsing." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-11", "text": "The central aspects of our discussion are (a) three dependency formats: two 'classic' representations for dependency parsing, namely, Stanford Basic (SB) and CoNLL Syntactic Dependencies (CD), and bilexical dependencies from the HPSG English Resource Grammar (ERG), so-called DELPH-IN Syntactic Derivation Tree (DT), proposed recently by Ivanova et al. (2012) ; (b) three state-of-the art statistical parsers: Malt (Nivre et al., 2007) , MST (McDonald et al., 2005) and the parser of Bohnet and Nivre (2012) ; (c) two approaches to wordcategory disambiguation, e.g. exploiting common PTB tags and using supertags (i.e. specialized ERG lexical types)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-12", "text": "We parse the formats and compare accuracies in all configurations in order to determine how parsers, dependency representations and grammatical tagging methods interact with each other in application to automatic syntactic analysis." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-13", "text": "SB and CD are derived automatically from phrase structures of Penn Treebank to accommodate the needs of fast and accurate dependency parsing, whereas DT is rooted in the formal grammar theory HPSG and is independent from any specific treebank." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-14", "text": "For DT we gain more expressivity from the underlying linguistic theory, which challenges parsing with statistical tools." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-15", "text": "The structural analysis of the schemes in Ivanova et al. (2012) leads to the hypothesis that CD and DT are more similar to each other than SB to DT." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-16", "text": "We recompute similarities on a larger treebank and check whether parsing results reflect them." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-17", "text": "The paper has the following structure: an overview of related work is presented in Section 2; treebanks, tagsets, dependency schemes and parsers used in the experiments are introduced in Section 3; analysis of parsing results is discussed in Section 4; conclusions and future work are outlined in Section 5." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-18", "text": "Schwartz et al. (2012) investigate which dependency representations of several syntactic structures are easier to parse with supervised versions of the Klein and Manning (2004) parser, ClearParser (Choi and Nicolov, 2009) , MST Parser, Malt and the Easy First Non-directional parser (Goldberg and Elhadad, 2010 3 Data and software" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-19", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-21", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-22", "text": "**TREEBANKS**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-23", "text": "For the experiments in this paper we used the Penn Treebank (Marcus et al., 1993) and the DeepBank ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-24", "text": "The latter is comprised of roughly 82% of the sentences of the first 16 sections of the Penn Treebank annotated with full HPSG analyses from the English Resource Grammar (ERG)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-25", "text": "The DeepBank annotations are created on top of the raw text of the PTB." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-26", "text": "Due to imperfections of the automatic tokenization, there are some token mismatches between DeepBank and PTB." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-27", "text": "We had to filter out such sentences to have consistent number of tokens in the DT, SB and CD formats." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-28", "text": "For our experiments we had available a training set of 22209 sentences and a test set of 1759 sentences (from Section 15)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-29", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-30", "text": "**PARSERS**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-31", "text": "In the experiments described in Section 4 we used parsers that adopt different approaches and implement various algorithms." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-32", "text": "Malt (Nivre et al., 2007) : transition-based dependency parser with local learning and greedy search." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-33", "text": "MST (McDonald et al., 2005) : graph-based dependency parser with global near-exhaustive search." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-34", "text": "Bohnet and Nivre (2012) parser: transitionbased dependency parser with joint tagger that implements global learning and beam search." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-35", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-36", "text": "**DEPENDENCY SCHEMES**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-37", "text": "In this work we extract DeepBank data in the form of bilexical syntactic dependencies, DELPH-IN Syntactic Derivation Tree (DT) format." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-38", "text": "We obtain the exact same sentences in Stanford Basic (SB) format from the automatic conversion of the PTB with the Stanford parser (de Marneffe et al., 2006) and in the CoNLL Syntactic Dependencies (CD) representation using the LTH Constituentto-Dependency Conversion Tool for Penn-style Treebanks (Johansson and Nugues, 2007) ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-39", "text": "SB and CD represent the way to convert PTB to bilexical dependencies; in contrast, DT is grounded in linguistic theory and captures decisions taken in the grammar." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-40", "text": "Figure 1 demonstrates the differences between the formats on the coordination structure." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-41", "text": "According to Schwartz et al. (2012) , analysis of coordination in SB and CD is easier for a statistical parser to learn; however, as we will see in section 4.3, DT has more expressive power distinguishing structural ambiguities illustrated by the classic example old men and women." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-42", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-43", "text": "**PART-OF-SPEECH TAGS**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-44", "text": "We experimented with two tag sets: PTB tags and lexical types of the ERG grammar -supertags." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-45", "text": "PTB tags determine the part of speech (PoS) and some morphological features, such as number for nouns, degree of comparison for adjectives and adverbs, tense and agreement with person and number of subject for verbs, etc." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-46", "text": "Supertags are composed of part-of-speech, valency in the form of an ordered sequence of complements, and annotations that encompass category-internal subdivisions, e.g. mass vs. count vs. proper nouns, intersective vs. scopal adverbs, or referential vs. expletive pronouns." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-47", "text": "Example of a supertag: v np is le (verb \"is\" that takes noun phrase as a complement)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-48", "text": "There are 48 tags in the PTB tagset and 1091 supertags in the set of lexical types of the ERG." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-49", "text": "The state-of-the-art accuracy of PoS-tagging on in-domain test data using gold-standard tokenization is roughly 97% for the PTB tagset and approximately 95% for the ERG supertags (Ytrest\u00f8l, 2011) ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-50", "text": "Supertagging for the ERG grammar is an ongoing research effort and an off-the-shelf supertagger for the ERG is not currently available." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-51", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-52", "text": "**EXPERIMENTS**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-53", "text": "In this section we give a detailed analysis of parsing into SB, CD and DT dependencies with Malt, MST and the Bohnet and Nivre (2012) parser." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-54", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-55", "text": "**SETUP**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-56", "text": "For Malt and MST we perform the experiments on gold PoS tags, whereas the Bohnet and Nivre (2012) parser predicts PoS tags during testing." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-57", "text": "Prior to each experiment with Malt, we used MaltOptimizer to obtain settings and a feature model; for MST we exploited default configuration; for the Bohnet and Nivre (2012) parser we set the beam parameter to 80 and otherwise employed the default setup." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-58", "text": "With regards to evaluation metrics we use labelled attachment score (LAS), unlabeled attachment score (UAS) and label accuracy (LACC) excluding punctuation." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-59", "text": "Our results cannot be directly compared to the state-of-the-art scores on the Penn Treebank because we train on sections 0-13 and test on section 15 of WSJ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-60", "text": "Also our results are not strictly inter-comparable because the setups we are using are different." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-61", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-62", "text": "**DISCUSSION**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-63", "text": "The results that we are going to analyze are presented in Tables 1 and 2 ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-64", "text": "Statistical significance was assessed using Dan Bikel's parsing evaluation comparator 1 at the 0.001 significance level." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-65", "text": "We inspect three different aspects in the interpretation of these results: parser, dependency format and tagset." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-66", "text": "Below we will look at these three angles in detail." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-67", "text": "From the parser perspective Malt and MST are not very different in the traditional setup with gold PTB tags (Table 1, Gold PTB tags)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-68", "text": "The Bohnet and Nivre (2012) parser outperforms Malt on CD and DT and MST on SB, CD and DT with PTB tags even though it does not receive gold PTB tags during test phase but predicts them (Table 2 , Predicted PTB tags)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-69", "text": "This is explained by the fact that the Bohnet and Nivre (2012) parser implements a novel approach to parsing: beam-search algorithm with global structure learning." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-70", "text": "MST \"loses\" more than Malt when parsing SB with gold supertags (Table 1, Gold supertags) ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-71", "text": "This parser exploits context features \"POS tag of each intervening word between head and dependent\" (McDonald et al., 2006) ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-72", "text": "Due to the far larger size of the supertag set compared to the PTB tagset, such features are sparse and have low frequencies." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-73", "text": "This leads to the lower scores of parsing accuracy for MST." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-74", "text": "For the Bohnet and Nivre (2012) parser the complexity of supertag prediction has significant negative influence on the attachment and labeling accuracies ( Table 2 , Predicted supertags)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-75", "text": "The addition of gold PTB tags as a feature lifts the performance of the Bohnet and Nivre (2012) parser to the level of performance of Malt and MST on CD with gold supertags and Malt on SB with gold supertags (compare Table 2 , Predicted supertags + gold PTB, and Table 1 , Gold supertags)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-76", "text": "Both Malt and MST benefit slightly from the combination of gold PTB tags and gold supertags (Table 1 , Gold PTB tags + gold supertags)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-77", "text": "For the Bohnet and Nivre (2012) parser we also observe small rise of accuracy when gold supertags are provided as a feature for prediction of PTB tags (compare Predicted PTB tags and Predicted PTB tags + gold supertags sections of Table 2 )." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-78", "text": "The parsers have different running times: it takes minutes to run an experiment with Malt, about 2 hours with MST and up to a day with the Bohnet and Nivre (2012) parser." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-79", "text": "From the point of view of the dependency format, SB has the highest LACC and CD is first-rate on UAS for all three parsers in most of the configurations (Tables 1 and 2 )." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-80", "text": "This means that SB is easier to label and CD is easier to parse structurally." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-81", "text": "DT appears to be a more difficult target format because it is both hard to label and attach in most configurations." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-82", "text": "It is not an unexpected result, since SB and CD are both derived from PTB phrase-structure trees and are oriented to ease dependency parsing task." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-83", "text": "DT is not custom-designed to dependency parsing and is independent from parsing questions in this sense." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-84", "text": "Unlike SB and CD, it is linguistically informed by the underlying, full-fledged HPSG grammar." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-85", "text": "The Jaccard similarity on our training set is 0.57 for SB and CD, 0.564 for CD and DT, and 0.388 for SB and DT." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-86", "text": "These similarity values show that CD and DT are structurally closer to each other than SB and DT." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-87", "text": "Contrary to our expectations, the accuracy scores of parsers do not suggest that CD and DT are particularly similar to each other in terms of parsing." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-88", "text": "Inspecting the aspect of tagset we conclude that traditional PTB tags are compatible with SB and CD but do not fit the DT scheme well, while ERG supertags are specific to the ERG framework and do not seem to be appropriate for SB and CD." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-89", "text": "Neither of these findings seem surprising, as PTB tags were developed as part of the treebank from which CD and SB are derived; whereas ERG supertags are closely related to the HPSG syntactic structures captured in DT." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-90", "text": "PTB tags were designed to simplify PoS-tagging whereas supertags were developed to capture information that is required to analyze syntax of HPSG." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-91", "text": "For each PTB tag we collected corresponding supertags from the gold-standard training set." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-92", "text": "For open word classes such as nouns, adjectives, adverbs and verbs the relation between PTB tags and supertags is many-to-many." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-93", "text": "Unique one-tomany correspondence holds only for possessive wh-pronoun and punctuation." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-94", "text": "Thus, supertags do not provide extra level of detalization for PTB tags, but PTB tags and supertags are complementary." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-95", "text": "As discussed in section 3.4, they contain bits of information that are different." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-96", "text": "For this reason their combination results in slight increase of accuracy for all three parsers on all dependency formats (Table 1 , Gold PTB tags + gold supertags, and Table 2 , Predicted PTB + gold supertags and Predicted supertags + gold PTB)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-97", "text": "The Bohnet and Nivre (2012) parser predicts supertags with an average accuracy of 89.73% which is significantly lower than state-ofthe-art 95% (Ytrest\u00f8l, 2011) ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-98", "text": "When we consider punctuation in the evaluation, all scores raise significantly for DT and at the same time decrease for SB and CD for all three parsers." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-99", "text": "This is explained by the fact that punctuation in DT is always attached to the nearest token which is easy to learn for a statistical parser." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-100", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-101", "text": "**ERROR ANALYSIS**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-102", "text": "Using the CoNLL-07 evaluation script 2 on our test set, for each parser we obtained the error rate distribution over CPOSTAG on SB, CD and DT." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-103", "text": "VBP, VBZ and VBG." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-104", "text": "VBP (verb, non-3rd person singular present), VBZ (verb, 3rd person singular present) and VBG (verb, gerund or present participle) are the PTB tags that have error rates in 10 highest error rates list for each parser (Malt, MST and the Bohnet and Nivre (2012) parser) with each dependency format (SB, CD and DT) and with each PoS tag set (PTB PoS and supertags) when PTB tags are included as CPOSTAG feature." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-105", "text": "We automatically collected all sentences that contain 1) attachment errors, 2) label errors, 3) attachment and label errors for VBP, VBZ and VBG made by Malt parser on DT format with PTB PoS. For each of these three lexical categories we manually analyzed a random sample of sentences with errors and their corresponding gold-standard versions." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-106", "text": "In many cases such errors are related to the root of the sentence when the verb is either treated as complement or adjunct instead of having a root status or vice versa." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-107", "text": "Errors with these groups of verbs mostly occur in the complex sentences that contain several verbs." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-108", "text": "Sentences with coordination are particularly difficult for the correct attachment and labeling of the VBP (see Figure 2 for an example)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-109", "text": "Coordination." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-110", "text": "The error rate of Malt, MST and the Bohnet and Nivre (2012) parser for the coordination is not so high for SB and CD ( 1% and 2% correspondingly with MaltParser, PTB tags) whereas for DT the error rate on the CPOSTAGS is especially high (26% with MaltParser, PTB tags)." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-111", "text": "It means that there are many errors on incoming dependency arcs for coordinating conjunctions when parsing DT." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-112", "text": "On outgoing arcs parsers also make more mistakes on DT than on SB and CD." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-113", "text": "This is related to the difference in choice of annotation principle (see Figure 1) ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-114", "text": "As it was shown in (Schwartz et al., 2012) , it is harder to parse coordination headed by coordinating conjunction." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-115", "text": "Although the approach used in DT is harder for parser to learn, it has some advantages: using SB and CD annotations, we cannot distinguish the two cases illustrated with the sentences (a) and (b): a) The fight is putting a tight squeeze on profits of many, threatening to drive the smallest ones out of business and straining relations between the national fast-food chains and their franchisees." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-116", "text": "b) Proceeds from the sale will be used for remodelling and reforbishing projects, as well as for the planned MGM Grand hotel/casino and theme park." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-117", "text": "In the sentence a) \"the national fast-food\" refers only to the conjunct \"chains\", while in the sentence b) \"the planned\" refers to both conjuncts and \"MGM Grand\" refers only to the first conjunct." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-118", "text": "The Bohnet and Nivre (2012) parser succeeds in finding the correct conjucts (shown in bold font) on DT and makes mistakes on SB and CD in some difficult cases like the following ones: a) <. . ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-119", "text": "> investors hoard gold and help underpin its price <. . ." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-120", "text": "> b) Then take the expected return and subtract one standard deviation." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-121", "text": "CD and SB wrongly suggest \"gold\" and \"help\" to be conjoined in the first sentence and \"return\" and \"deviation\" in the second." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-122", "text": "----------------------------------" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-123", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-124", "text": "In this survey we gave a comparative experimental overview of (i) parsing three dependency schemes, viz., Stanford Basic (SB), CoNLL Syntactic Dependencies (CD) and DELPH-IN Syntactic Derivation Tree (DT), (ii) with three leading dependency parsers, viz., Malt, MST and the Bohnet and Nivre (2012) parser (iii) exploiting two different tagsets, viz., PTB tags and supertags." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-125", "text": "From the parser perspective, the Bohnet and Nivre (2012) parser performs better than Malt and MST not only on conventional formats but also on the new representation, although this parser solves a harder task than Malt and MST." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-126", "text": "From the dependency format perspective, DT appeares to be a more difficult target dependency representation than SB and CD." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-127", "text": "This suggests that the expressivity that we gain from the grammar theory (e.g. for coordination) is harder to learn with state-of-the-art dependency parsers." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-128", "text": "CD and DT are structurally closer to each other than SB and DT; however, we did not observe sound evidence of a correlation between structural similarity of CD and DT and their parsing accuracies Regarding the tagset aspect, it is natural that PTB tags are good for SB and CD, whereas the more fine-grained set of supertags fits DT better." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-129", "text": "PTB tags and supertags are complementary, and for all three parsers we observe slight benefits from being supplied with both types of tags." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-130", "text": "As future work we would like to run more experiments with predicted supertags." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-131", "text": "In the absence of a specialized supertagger, we can follow the pipeline of (Ytrest\u00f8l, 2011) who reached the stateof-the-art supertagging accuracy of 95%." }, { "sent_id": "5d3c08596677a1f8ac48fa17766bb4-C001-132", "text": "Another area of our interest is an extrinsic evaluation of SB, CD and DT, e.g. applied to semantic role labeling and question-answering in order to find out if the usage of the DT format grounded in the computational grammar theory is beneficial for such tasks." } ], "y": { "@USE@": { "gold_contexts": [ [ "5d3c08596677a1f8ac48fa17766bb4-C001-11" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-31", "5d3c08596677a1f8ac48fa17766bb4-C001-34" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-53" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-57" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-74" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-104" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-110" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-118" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-124" ] ], "cite_sentences": [ "5d3c08596677a1f8ac48fa17766bb4-C001-11", "5d3c08596677a1f8ac48fa17766bb4-C001-34", "5d3c08596677a1f8ac48fa17766bb4-C001-53", "5d3c08596677a1f8ac48fa17766bb4-C001-57", "5d3c08596677a1f8ac48fa17766bb4-C001-74", "5d3c08596677a1f8ac48fa17766bb4-C001-104", "5d3c08596677a1f8ac48fa17766bb4-C001-110", "5d3c08596677a1f8ac48fa17766bb4-C001-118", "5d3c08596677a1f8ac48fa17766bb4-C001-124" ] }, "@UNSURE@": { "gold_contexts": [ [ "5d3c08596677a1f8ac48fa17766bb4-C001-34" ] ], "cite_sentences": [ "5d3c08596677a1f8ac48fa17766bb4-C001-34" ] }, "@DIF@": { "gold_contexts": [ [ "5d3c08596677a1f8ac48fa17766bb4-C001-56" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-68" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-78" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-97" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-125" ] ], "cite_sentences": [ "5d3c08596677a1f8ac48fa17766bb4-C001-56", "5d3c08596677a1f8ac48fa17766bb4-C001-68", "5d3c08596677a1f8ac48fa17766bb4-C001-78", "5d3c08596677a1f8ac48fa17766bb4-C001-97", "5d3c08596677a1f8ac48fa17766bb4-C001-125" ] }, "@BACK@": { "gold_contexts": [ [ "5d3c08596677a1f8ac48fa17766bb4-C001-69" ] ], "cite_sentences": [ "5d3c08596677a1f8ac48fa17766bb4-C001-69" ] }, "@EXT@": { "gold_contexts": [ [ "5d3c08596677a1f8ac48fa17766bb4-C001-75" ], [ "5d3c08596677a1f8ac48fa17766bb4-C001-77" ] ], "cite_sentences": [ "5d3c08596677a1f8ac48fa17766bb4-C001-75", "5d3c08596677a1f8ac48fa17766bb4-C001-77" ] } } }, "ABC_547551e556d8aa919f731da99424c9_11": { "x": [ { "sent_id": "547551e556d8aa919f731da99424c9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-2", "text": "A lot of the recent success in natural language processing (NLP) has been driven by distributed vector representations of words trained on large amounts of text in an unsupervised manner." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-3", "text": "These representations are typically used as general purpose features for words across a range of NLP problems." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-4", "text": "However, extending this success to learning representations of sequences of words, such as sentences, remains an open problem." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-5", "text": "Recent work has explored unsupervised as well as supervised learning techniques with different training objectives to learn general purpose fixed-length sentence representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-6", "text": "In this work, we present a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-7", "text": "We train this model on several data sources with multiple training objectives on over 100 million sentences." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-8", "text": "Extensive experiments demonstrate that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-9", "text": "We present substantial improvements in the context of transfer learning and low-resource settings using our learned general-purpose representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-129", "text": "Table 2 that adding more tasks improves the transfer performance of our model." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-130", "text": "Increasing the capacity our sentence encoder with more hidden units (+L) as well as an additional layer (+2L) also lead to improved transfer performance." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-131", "text": "We observe gains of 1.1-2.0% on the sentiment classification tasks (MR, CR, SUBJ & MPQA) over Infersent." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-10", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-12", "text": "Transfer learning has driven a number of recent successes in computer vision and NLP." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-13", "text": "Computer vision tasks like image captioning (Xu et al., 2015) and visual question answering typically use CNNs pretrained on ImageNet (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014) to extract representations of the image, while several natural language tasks such as reading comprehension and sequence labeling (Lample et al., 2016) have benefited from pretrained word embeddings (Mikolov et al., 2013; Pennington et al., 2014) that are either fine-tuned for a specific task or held fixed." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-14", "text": "Many neural NLP systems are initialized with pretrained word embeddings but learn their representations of words in context from scratch, in a task-specific manner from supervised learning signals." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-15", "text": "However, learning these representations reliably from scratch is not always feasible, especially in low-resource settings, where we believe that using general purpose sentence representations will be beneficial." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-16", "text": "Some recent work has addressed this by learning general-purpose sentence representations Wieting et al., 2015; Hill et al., 2016; Conneau et al., 2017; McCann et al., 2017; Jernite et al., 2017; Nie et al., 2017; Pagliardini et al., 2017) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-17", "text": "However, there exists no clear consensus yet on what training objective or methodology is best suited to this goal." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-18", "text": "Understanding the inductive biases of distinct neural models is important for guiding progress in representation learning." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-19", "text": "Shi et al. (2016) and Belinkov et al. (2017) demonstrate that neural ma-chine translation (NMT) systems appear to capture morphology and some syntactic properties." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-20", "text": "Shi et al. (2016) also present evidence that sequence-to-sequence parsers more strongly encode source language syntax." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-21", "text": "Similarly, Adi et al. (2016) probe representations extracted by sequence autoencoders, word embedding averages, and skip-thought vectors with a multi-layer perceptron (MLP) classifier to study whether sentence characteristics such as length, word content and word order are encoded." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-22", "text": "To generalize across a diverse set of tasks, it is important to build representations that encode several aspects of a sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-23", "text": "Neural approaches to tasks such as skip-thoughts, machine translation, natural language inference, and constituency parsing likely have different inductive biases." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-24", "text": "Our work exploits this in the context of a simple one-to-many multi-task learning (MTL) framework, wherein a single recurrent sentence encoder is shared across multiple tasks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-25", "text": "We hypothesize that sentence representations learned by training on a reasonably large number of weakly related tasks will generalize better to novel tasks unseen during training, since this process encodes the inductive biases of multiple models." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-26", "text": "This hypothesis is based on the theoretical work of Baxter et al. (2000) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-27", "text": "While our work aims at learning fixed-length distributed sentence representations, it is not always practical to assume that the entire \"meaning\" of a sentence can be encoded into a fixed-length vector." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-28", "text": "We merely hope to capture some of its characteristics that could be of use in a variety of tasks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-29", "text": "The primary contribution of our work is to combine the benefits of diverse sentence-representation learning objectives into a single multi-task framework." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-30", "text": "To the best of our knowledge, this is the first large-scale reusable sentence representation model obtained by combining a set of training objectives with the level of diversity explored here, i.e. multi-lingual NMT, natural language inference, constituency parsing and skip-thought vectors." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-31", "text": "We demonstrate through extensive experimentation that representations learned in this way lead to improved performance across a diverse set of novel tasks not used in the learning of our representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-32", "text": "Such representations facilitate low-resource learning as exhibited by significant improvements to model performance for new tasks in the low labelled data regime -achieving comparable performance to a few models trained from scratch using only 6% of the available training set on the Quora duplicate question dataset." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-33", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-34", "text": "**RELATED WORK**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-35", "text": "The problem of learning distributed representations of phrases and sentences dates back over a decade." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-36", "text": "For example, Mitchell & Lapata (2008) present an additive and multiplicative linear composition function of the distributed representations of individual words." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-37", "text": "Clark & Pulman (2007) combine symbolic and distributed representations of words using tensor products." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-38", "text": "Advances in learning better distributed representations of words (Mikolov et al., 2013; Pennington et al., 2014) combined with deep learning have made it possible to learn complex non-linear composition functions of an arbitrary number of word embeddings using convolutional or recurrent neural networks (RNNs)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-39", "text": "A network's representation of the last element in a sequence, which is a non-linear composition of all inputs, is typically assumed to contain a squashed \"summary\" of the sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-40", "text": "Most work in supervised learning for NLP builds task-specific representations of sentences rather than general-purpose ones." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-41", "text": "Notably, skip-thought vectors , an extension of the skip-gram model for word embeddings (Mikolov et al., 2013) to sentences, learn re-usable sentence representations from weakly labeled data." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-42", "text": "Unfortunately, these models take weeks or often months to train." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-43", "text": "Hill et al. (2016) address this by considering faster alternatives such as sequential denoising autoencoders and shallow log-linear models." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-44", "text": "Arora et al. (2016) , however, demonstrate that simple word embedding averages are comparable to more complicated models like skip-thoughts." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-45", "text": "More recently, Conneau et al. (2017) show that a completely supervised approach to learning sentence representations from natural language inference data outperforms all previous approaches on transfer learning benchmarks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-46", "text": "Here we use the terms \"transfer learning performance\" on \"transfer tasks\" to mean the performance of sentence representations evaluated on tasks unseen during training." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-47", "text": "McCann et al. (2017) demonstrated that representations learned by state-of-the-art large-scale NMT systems also generalize well to other tasks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-48", "text": "However, their use of an attention mechanism prevents the learning of a single fixedlength vector representation of a sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-49", "text": "As a result, they present a bi-attentive classification network that composes information present in all of the model's hidden states to achieve improvements over a corresponding model trained from scratch." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-50", "text": "Jernite et al. (2017) and Nie et al. (2017) demonstrate that discourse-based objectives can also be leveraged to learn good sentence representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-51", "text": "Our work is most similar to that of Luong et al. (2015) , who train a many-to-many sequenceto-sequence model on a diverse set of weakly related tasks that includes machine translation, constituency parsing, image captioning, sequence autoencoding, and intra-sentence skip-thoughts." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-52", "text": "There are two key differences between that work and our own." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-53", "text": "First, like McCann et al. (2017) , their use of an attention mechanism prevents learning a fixed-length vector representation for a sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-54", "text": "Second, their work aims for improvements on the same tasks on which the model is trained, as opposed to learning re-usable sentence representations that transfer elsewhere." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-55", "text": "We further present a fine-grained analysis of how different tasks contribute to the encoding of different information signals in our representations following work by Shi et al. (2016) and Adi et al. (2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-56", "text": "Hashimoto et al. (2016) similarly present a multi-task framework for textual entailment with task supervision at different levels of learning. \"Universal\" multi-task models have also been successfully explored in the context of computer vision problems (Kokkinos, 2016; Eigen & Fergus, 2015) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-57", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-58", "text": "**SEQUENCE-TO-SEQUENCE LEARNING**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-59", "text": "Five out of the six tasks that we consider for multi-task learning are formulated as sequence-tosequence problems ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-60", "text": "Briefly, sequence-to-sequence models are a specific case of encoder-decoder models where the inputs and outputs are sequential." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-61", "text": "They directly model the conditional distribution of outputs given inputs P (y|x)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-62", "text": "The input x and output y are sequences x 1 , x 2 , . . . , x m and y 1 , y 2 . . . , y n ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-63", "text": "The encoder produces a fixed length vector representation h x of the input, which the decoder then conditions on to generate an output." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-64", "text": "The decoder is auto-regressive and breaks down the joint probability of outputs into a product of conditional probabilities via the chain rule:" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-65", "text": "Cho et al. (2014) and use encoders and decoders parameterized as RNN variants such as Long Short-term Memory (LSTMs) (Hochreiter & Schmidhuber, 1997) or Gated Recurrent Units (GRUs) (Chung et al., 2014) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-66", "text": "The hidden representation h x is typically the last hidden state of the encoder RNN. alleviate the gradient bottleneck between the encoder and the decoder by introducing an attention mechanism that allows the decoder to condition on every hidden state of the encoder RNN instead of only the last one." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-67", "text": "In this work, as in ; Hill et al. (2016) , we do not employ an attention mechanism." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-68", "text": "This enables us to obtain a single, fixed-length, distributed sentence representation." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-69", "text": "To diminish the effects of vanishing gradient, we condition every decoding step on the encoder hidden representation h x ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-70", "text": "We use a GRU for the encoder and decoder in the interest of computational speed." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-71", "text": "The encoder is a bidirectional GRU while the decoder is a unidirectional conditional GRU whose parameterization is as follows:" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-72", "text": "The encoder representation h x is provided as conditioning information to the reset gate, update gate and hidden state computation in the GRU via the parameters C r , C z and C d to avoid attenuation of information from the encoder." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-73", "text": "3.1 MULTI-TASK SEQUENCE-TO-SEQUENCE LEARNING Dong et al. (2015) present a simple one-to-many multi-task sequence-to-sequence learning model for NMT that uses a shared encoder for English and task-specific decoders for multiple target languages." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-74", "text": "Luong et al. (2015) extend this by also considering many-to-one (many encoders, one decoder) and many-to-many architectures." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-75", "text": "In this work, we consider a one-to-many model since it lends itself naturally to the idea of combining inductive biases from different training objectives." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-76", "text": "The same bidirectional GRU encodes the input sentences from different tasks into a compressed summary h x which is then used to condition a task-specific GRU to produce the output sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-77", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-78", "text": "**TRAINING OBJECTIVES & EVALUATION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-79", "text": "Our motivation for multi-task training stems from theoretical insights presented in Baxter et al. (2000) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-80", "text": "We refer readers to that work for a detailed discussion of results, but the conclusions most relevant to this discussion are (i) that learning multiple related tasks jointly results in good generalization as measured by the number of training examples required per task; and (ii) that inductive biases learned on sufficiently many training tasks are likely to be good for learning novel tasks drawn from the same environment." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-81", "text": "We select the following training objectives to learn general-purpose sentence embeddings." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-82", "text": "Our desiderata for the task collection were: sufficient diversity, existence of fairly large datasets for training, and success as standalone training objectives for sentence representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-83", "text": "Skip-thought vectors Skip-thought vectors are an extension of skip-gram word embedding models to sentences." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-84", "text": "The task typically requires a corpus of contiguous sentences, for which we use the BookCorpus (Zhu et al., 2015) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-85", "text": "The learning objective is to simultaneously predict the next and previous sentences from the current sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-86", "text": "The encoder for the current sentence and the decoders for the previous (STP) and next sentence (STN) are typically parameterized as separate RNNs." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-132", "text": "We demonstrate substantial gains on TREC (6% over Infersent and roughly 2% over the CNN-LSTM), outperforming even a competitive supervised baseline." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-87", "text": "We also consider training skip-thoughts by predicting only the next sentence given the current, motivated by results in Tang et al. (2017) where it is demonstrated that predicting the next sentence alone leads to comparable performance. and demonstrated that NMT can be formulated as a sequence-to-sequence learning problem where the input is a sentence in the source language and the output is its corresponding translation in the target language." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-88", "text": "We use a parallel corpus of around 4.5 million English-German (De) sentence pairs from WMT15 and 40 million English-French (Fr) sentence pairs from WMT14." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-89", "text": "We train our representations using multiple target languages motivated by improvements demonstrated by Dong et al. (2015) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-90", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-91", "text": "**NEURAL MACHINE TRANSLATION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-92", "text": "Constituency Parsing (linearized parse tree construction) demonstrated that a sequence-to-sequence approach to constituency parsing is viable." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-93", "text": "The input to the encoder is the sentence itself and the decoder produces its linearized parse tree." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-94", "text": "We train on 3 million weakly labeled parses obtained by parsing a random subset of the 1-billion word corpus with the Puck GPU parser 2 along with gold parses from sections 0-21 of the WSJ section of Penn Treebank." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-95", "text": "Gold parses are duplicated 5 times and shuffled in with the weakly labeled parses to have a roughly 1:5 ratio of gold to noisy parses." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-96", "text": "Natural Language Inference Natural language inference (NLI) is a 3-way classification problem." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-97", "text": "Given a premise and a hypothesis sentence, the objective is to classify their relationship as either entailment, contradiction, or neutral." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-98", "text": "In contrast to the previous tasks, we do not formulate this as sequence-to-sequence learning." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-99", "text": "We use the shared recurrent sentence encoder to encode both the premise and hypothesis into fixed length vectors u and v, respectively." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-100", "text": "We then feed the vector [u; v; |u \u2212 v|; u * v], which is a concatenation of the premise and hypothesis vectors and their respective absolute difference and hadamard product, to an MLP that performs the 3-way classification." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-101", "text": "This is the same classification strategy adopted by Conneau et al. (2017) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-102", "text": "We train on a collection of about 1 million sentence pairs from the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2017) corpora." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-103", "text": "Dong et al. (2015) use periodic task alternations with equal training ratios for every task." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-104", "text": "In contrast, Luong et al. (2015) alter the training ratios for each task based on the size of their respective training sets." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-105", "text": "Specifically, the training ratio for a particular task, \u03b1 i , is the fraction of the number of training examples in that task to the total number of training samples across all tasks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-106", "text": "The authors then perform \u03b1 i * N parameter updates on task i before selecting a new task at random proportional to the training ratios, where N is a predetermined constant." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-133", "text": "We see similar gains (2.3%) on paraphrase identification (MPRC), closing the gap on supervised approaches trained from scratch." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-134", "text": "The addition of constituency parsing improves performance on sentence relatedness (SICK-R) and entailment (SICK-E) consistent with observations made by Bowman et al. (2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-135", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-136", "text": "**EXPERIMENTAL RESULTS & DISCUSSION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-137", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-138", "text": "**IT IS EVIDENT FROM**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-139", "text": "In Table 4 , we show that simply training an MLP on top of our fixed sentence representations outperforms several strong & complex supervised approaches that use attention mechanisms, even on this fairly large dataset." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-140", "text": "For example, we observe a 0.2-0.5% improvement over the decomposable attention model (Parikh et al., 2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-141", "text": "When using only a small fraction of the training data, indicated by the columns 1k-25k, we are able to outperform the Siamese and Multi-Perspective CNN using roughly 6% of the available training set." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-142", "text": "We also outperform the Deconv LVM model proposed by Shen et al. (2017) in this low-resource setting." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-107", "text": "We take a simpler approach and pick a new sequence-to-sequence task to train on after every parameter update sampled uniformly." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-108", "text": "An NLI minibatch is interspersed after every ten parameter updates on sequence-to-sequence tasks (this was chosen so as to complete roughly 6 epochs of the dataset after 7 days of training)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-109", "text": "Our approach is described formally in the Algorithm below." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-110", "text": "Model details can be found in section 7 in the Appendix." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-111", "text": "Require: A set of k tasks with a common source language, a shared encoder E across all tasks and a set of k task specific decoders D1 . . ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-112", "text": "D k . Let \u03b8 denote each model's parameters, \u03b1 a probability vector (p1 . . ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-113", "text": "p k ) denoting the probability of sampling a task such that \u03a3 k i pi = 1, datasets for each task IP1 . . ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-114", "text": "IP k and a loss function L." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-115", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-116", "text": "**EVALUATION STRATEGIES, EXPERIMENTAL RESULTS & DISCUSSION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-117", "text": "In this section, we describe our approach to evaluate the quality of our learned representations, present the results of our evaluation and discuss our findings." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-118", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-119", "text": "**EVALUATION STRATEGY**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-120", "text": "We follow a similar evaluation protocol to those presented in ; Hill et al. (2016) ; Conneau et al. (2017) which is to use our learned representations as features for a low complexity classifier (typically linear) on a novel supervised task/domain unseen during training without updating the parameters of our sentence representation model." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-121", "text": "We also consider such a transfer learning evaluation in an artificially constructed low-resource setting." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-122", "text": "In addition, we also evaluate the quality of our learned individual word representations using standard benchmarks (Faruqui & Dyer, 2014; Tsvetkov et al., 2015) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-123", "text": "The choice of transfer tasks and evaluation framework 3 are borrowed largely from Conneau et al. (2017) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-124", "text": "We provide a condensed summary of the tasks in section 10 in the Appendix but refer readers to their paper for a more detailed description." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-125", "text": "Table 2 presents the results of training logistic regression on 10 different supervised transfer tasks using different fixed-length sentence representation." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-126", "text": "Supervised approaches trained from scratch on some of these tasks are also presented for comparison." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-127", "text": "We present performance ablations when adding more tasks and increasing the number of hidden units in our GRU (+L)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-128", "text": "Ablation specifics are presented in section 9 of the Appendix." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-143", "text": "Unlike Conneau et al. (2017) , who use pretrained GloVe word embeddings, we learn our word embeddings from scratch." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-144", "text": "Somewhat surprisingly, in Table 3 we observe that the learned word embeddings are competitive with popular methods such as GloVe, word2vec, and fasttext (Bojanowski et al., 2016) on the benchmarks presented by Faruqui & Dyer (2014) and Tsvetkov et al. (2015) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-145", "text": "In Table 5 , we probe our sentence representations to determine if certain sentence characteristics and syntactic properties can be inferred following work by Adi et al. (2016) and Shi et al. (2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-146", "text": "We observe that syntactic properties are better encoded with the addition of multi-lingual NMT and parsing." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-147", "text": "Representations learned solely from NLI do appear to encode syntax but incorporation into our multi-task framework does not amplify this signal." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-148", "text": "Similarly, we observe that sentence characteristics such as length and word order are better encoded with the addition of parsing." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-149", "text": "In Appendix Table 6 , we note that our sentence representations outperform skip-thoughts and are on par with Infersent for image-caption retrieval." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-150", "text": "We also observe that comparing sentences using across all 10 tasks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-151", "text": "For MRPC and STSB we consider only the F1 score and Spearman correlations respectively and we also multiply the SICK-R scores by 100 to have all differences in the same scale." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-152", "text": "Bold numbers indicate the best performing transfer model on a given task." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-153", "text": "Underlines are used for each task to indicate both our best performing model as well as the best performing transfer model that isn't ours." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-154", "text": "cosine similarities correlates reasonably well with their relatedness on semantic textual similarity benchmarks (Appendix Table 7 )." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-155", "text": "We also present qualitative analysis of our learned representations by visualizations using dimensionality reduction techniques ( Figure 1 ) and nearest neighbor exploration (Appendix Table 8 )." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-156", "text": "Figure 1 shows t-sne plots of our sentence representations on three different datasets -SUBJ, TREC and DBpedia." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-157", "text": "DBpedia is a large corpus of sentences from Wikipedia labeled by category and used by Zhang et al. (2015) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-158", "text": "Sentences appear to cluster reasonably well according to their labels." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-159", "text": "The clustering also appears better than that demonstrated in Figure 2 of on TREC and SUBJ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-160", "text": "Appendix Table 8 contains sentences from the BookCorpus and their nearest neighbors." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-161", "text": "Sentences with some lexical overlap and similar discourse structure appear to be clustered together." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-162", "text": "Mrk\u0161i\u0107 et al. (2017) respectively." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-163", "text": "We also report QVEC benchmarks (Tsvetkov et al., 2015) Model Accuracy 1k 5k 10k 25k All (400k) Table 4 : Supervised & low-resource classification accuracies on the Quora duplicate question dataset." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-164", "text": "Accuracies are reported corresponding to the number of training examples used." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-165", "text": "The first 6 rows are taken from , the next 4 are from Tomar et al. (2017) , the next 5 from Shen et al. (2017) and The last 4 rows are our experiments using Infersent (Conneau et al., 2017) and our models." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-166", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-167", "text": "**CONCLUSION & FUTURE WORK**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-168", "text": "We present a multi-task framework for learning general-purpose fixed-length sentence representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-169", "text": "Our primary motivation is to encapsulate the inductive biases of several diverse training signals used to learn sentence representations into a single model." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-170", "text": "Our multi-task framework includes a combination of sequence-to-sequence tasks such as multi-lingual NMT, constituency parsing and skip-thought vectors as well as a classification task -natural language inference." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-171", "text": "We demonstrate that the learned representations yield competitive or superior results to previous general-purpose sentence representation methods." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-172", "text": "We also observe that this approach produces good word embeddings." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-173", "text": "Table 5 : Evaluation of sentence representations by probing for certain sentence characteristics and syntactic properties." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-174", "text": "Sentence length, word content & word order from Adi et al. (2016) and sentence active/passive, tense and top level syntactic sequence (TSS) from Shi et al. (2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-175", "text": "Numbers reported are the accuracy with which the models were able to predict certain characteristics." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-176", "text": "In future work, we would like understand and interpret the inductive biases that our model learns and observe how it changes with the addition of different tasks beyond just our simple analysis of sentence characteristics and syntax." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-177", "text": "Having a rich, continuous sentence representation space could allow the application of state-of-the-art generative models of images such as that of Nguyen et al. (2016) to language." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-178", "text": "One could also consider controllable text generation by directly manipulating the sentence representations and realizing it by decoding with a conditional language model." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-179", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-180", "text": "**APPENDIX 7 MODEL TRAINING**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-181", "text": "We present some architectural specifics and training details of our multi-task framework." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-182", "text": "Our shared encoder uses a common word embedding lookup table and GRU." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-183", "text": "We experiment with unidirectional, bidirectional and 2 layer bidirectional GRUs (details in Appendix section 9)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-184", "text": "For each task, every decoder has its separate word embedding lookups, conditional GRUs and fully connected layers that project the GRU hidden states to the target vocabularies." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-185", "text": "The last hidden state of the encoder is used as the initial hidden state of the decoder and is also presented as input to all the gates of the GRU at every time step." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-186", "text": "For natural language inference, the same encoder is used to encode both the premise and hypothesis and a concatenation of their representations along with the absolute difference and hadamard product (as described in Conneau et al. (2017) ) are given to a single layer MLP with a dropout (Srivastava et al., 2014 ) rate of 0.3." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-187", "text": "All models use word embeddings of 512 dimensions and GRUs with either 1500 or 2048 hidden units." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-188", "text": "We used minibatches of 48 examples and the Adam Kingma & Ba (2014) optimizer with a learning rate of 0.002." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-189", "text": "Models were trained for 7 days on an Nvidia Tesla P100-SXM2-16GB GPU." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-190", "text": "While report close to a month of training, we only train for 7 days, made possible by advancements in GPU hardware and software (cuDNN RNNs)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-191", "text": "We did not tune any of the architectural details and hyperparameters owing to the fact that we were unable to identify any clear criterion on which to tune them." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-192", "text": "Gains in performance on a specific task do not often translate to better transfer performance." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-193", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-194", "text": "**VOCABULARY EXPANSION & REPRESENTATION POOLING**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-195", "text": "In addition to performing 10-fold cross-validation to determine the L2 regularization penalty on the logistic regression models, we also tune the way in which our sentence representations are generated from the hidden states corresponding to words in a sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-196", "text": "For example, use the last hidden state while Conneau et al. (2017) perform max-pooling across all of the hidden states." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-197", "text": "We consider both of these approaches and pick the one with better performance on the validation set." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-198", "text": "We note that max-pooling works best on sentiment tasks such as MR, CR, SUBJ and MPQA, while the last hidden state works better on all other tasks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-199", "text": "We also employ vocabulary expansion on all tasks as in by training a linear regression to map from the space of pre-trained word embeddings (GloVe) to our model's word embeddings." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-200", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-201", "text": "**MULTI-TASK MODEL DETAILS**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-202", "text": "This section describes the specifics our multi-task ablations in the experiments section." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-203", "text": "These definitions hold for all tables except for 3 and 5." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-204", "text": "We refer to skip-thought next as STN, French and German NMT as Fr and De, natural language inference as NLI, skip-thought previous as STP and parsing as Par." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-205", "text": "+STN +Fr +De : The sentence representation h x is the concatenation of the final hidden vectors from a forward GRU with 1500-dimensional hidden vectors and a bidirectional GRU, also with 1500-dimensional hidden vectors." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-206", "text": "+STN +Fr +De +NLI : The sentence representation h x is the concatenation of the final hidden vectors from a bidirectional GRU with 1500-dimensional hidden vectors and another bidirectional GRU with 1500-dimensional hidden vectors trained without NLI." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-207", "text": "In tables 3 and 5 we do not concatenate the representations of multiple models. and Conneau et al. (2017) provide a detailed description of tasks that are typically used to evaluate sentence representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-208", "text": "We provide a condensed summary and refer readers to their work for a more thorough description." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-209", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-210", "text": "**DESCRIPTION OF EVALUATION TASKS**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-211", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-212", "text": "**TEXT CLASSIFICATION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-213", "text": "We evaluate on text classification benchmarks -sentiment classification on movie reviews (MR), product reviews (CR) and Stanford sentiment (SST), question type classification (TREC), subjectivity/objectivity classification (SUBJ) and opinion polarity (MPQA)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-214", "text": "Representations are used to train a logistic regression classifier with 10-fold cross validation to tune the L2 weight penalty." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-215", "text": "The evaluation metric for all these tasks is classification accuracy." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-216", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-217", "text": "**PARAPHRASE IDENTIFICATION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-218", "text": "We also evaluate on pairwise text classification tasks such as paraphrase identification on the Microsoft Research Paraphrase Corpus (MRPC) corpus." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-219", "text": "This is a binary classification problem to identify if two sentences are paraphrases of each other." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-220", "text": "The evaluation metric is classification accuracy and F1." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-221", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-222", "text": "**ENTAILMENT AND SEMANTIC RELATEDNESS**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-223", "text": "To test if similar sentences share similar representations, we evaluate on the SICK relatedness (SICK-R) task where a linear model is trained to output a score from 1 to 5 indicating the relatedness of two sentences." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-224", "text": "We also evaluate using the entailment labels in the same dataset (SICK-E) which is a binary classification problem." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-225", "text": "The evaluation metric for SICK-R is Pearson correlation and classification accuracy for SICK-E." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-226", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-227", "text": "**SEMANTIC TEXTUAL SIMILARITY**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-228", "text": "In this evaluation, we measure the relatedness of two sentences using only the cosine similarity between their representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-229", "text": "We use the similarity textual similarity (STS) benchmark tasks from 2012-2016 (STS12, STS13, STS14, STS15, STS16, STSB)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-230", "text": "The STS dataset contains sentences from a diverse set of data sources." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-231", "text": "The evaluation criteria is Pearson correlation." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-232", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-233", "text": "**IMAGE-CAPTION RETRIEVAL**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-234", "text": "Image-caption retrieval is typically formulated as a ranking task wherein images are retrieved and ranked based on a textual description and vice-versa." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-235", "text": "We use 113k training images from MSCOCO with 5k images for validation and 5k for testing." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-236", "text": "Image features are extracted using a pre-trained 110 layer ResNet." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-237", "text": "The evaluation criterion is Recall@K and the median K across 5 different splits of the data." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-238", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-239", "text": "**QUORA DUPLICATE QUESTION CLASSIFICATION**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-240", "text": "In addition to the above tasks which were considered by Conneau et al. (2017) , we also evaluate on the recently published Quora duplicate question dataset 6 since it is an order of magnitude larger than the others (approximately 400,000 question pairs)." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-241", "text": "The task is to correctly identify question pairs that are duplicates of one another, which we formulate as a binary classification problem." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-242", "text": "We use the same data splits as in ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-243", "text": "Given the size of this data, we consider a more expressive classifier on top of the representations of both questions." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-244", "text": "Specifically, we train a 4 layer MLP with 1024 hidden units, with a dropout rate of 0.5 after every hidden layer." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-245", "text": "The evaluation criterion is classification accuracy." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-246", "text": "We also artificially create a low-resource setting by reducing the number of training examples between 1,000 and 25,000 using the same splits as Shen et al. (2017) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-247", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-248", "text": "**SENTENCE CHARACTERISTICS & SYNTAX**" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-249", "text": "In an attempt to understand what information is encoded in by sentence representations, we consider six different classification tasks where the objective is to predict sentence characteristics such as length, word content and word order (Adi et al., 2016) or syntactic properties such as active/passive, tense and the top syntactic sequence (TSS) from the parse tree of a sentence (Shi et al., 2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-250", "text": "The sentence characteristic tasks are setup in the same way as described in Adi et al. (2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-251", "text": "The length task is an 8-way classification problem where sentence lengths are binned into 8 ranges." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-252", "text": "The content task is formulated as a binary classification problem that takes a concatenation of a sentence representation u \u2208 R k and a word representation w \u2208 R d to determine if the word is contained in the sentence." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-253", "text": "The order task is an extension of the content task where a concatenation of the sentence representation and word representations of two words in sentence is used to determine if the first word occurs before or after the second." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-254", "text": "We use a random subset of the 1-billion-word dataset for these experiments that were not used to train our multi-task representations." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-255", "text": "The syntactic properties tasks are setup in the same way as described in Shi et al. (2016) .The passive and tense tasks are characterized as binary classification problems given a sentence's representation." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-256", "text": "The former's objective is to determine if a sentence is written in active/passive voice while the latter's objective is to determine if the sentence is in the past tense or not." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-257", "text": "The top syntactic sequence (TSS) is a 20-way classification problem with 19 most frequent top syntactic sequences and 1 miscellaneous class." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-258", "text": "We use the same dataset as the authors but different training, validation and test splits." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-259", "text": "Table 7 : Evaluation of sentence representations on the semantic textual similarity benchmarks." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-260", "text": "Numbers reported are Pearson Correlations x100." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-261", "text": "Skipthought, GloVe average, GloVe TF-IDF, GloVe + WR (U) and all supervised numbers were taken from Arora et al. (2016) and Wieting et al. (2015) and Charagram-phrase numbers were taken from Wieting et al. (2016) ." }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-262", "text": "Other numbers were obtained from the evaluation suite provided by Conneau et al. (2017)" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-263", "text": "----------------------------------" }, { "sent_id": "547551e556d8aa919f731da99424c9-C001-264", "text": "**CAPTION**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "547551e556d8aa919f731da99424c9-C001-16" ], [ "547551e556d8aa919f731da99424c9-C001-45" ], [ "547551e556d8aa919f731da99424c9-C001-196" ] ], "cite_sentences": [ "547551e556d8aa919f731da99424c9-C001-16", "547551e556d8aa919f731da99424c9-C001-45", "547551e556d8aa919f731da99424c9-C001-196" ] }, "@USE@": { "gold_contexts": [ [ "547551e556d8aa919f731da99424c9-C001-101" ], [ "547551e556d8aa919f731da99424c9-C001-123" ], [ "547551e556d8aa919f731da99424c9-C001-165" ], [ "547551e556d8aa919f731da99424c9-C001-186" ], [ "547551e556d8aa919f731da99424c9-C001-195", "547551e556d8aa919f731da99424c9-C001-196" ], [ "547551e556d8aa919f731da99424c9-C001-207" ], [ "547551e556d8aa919f731da99424c9-C001-262" ] ], "cite_sentences": [ "547551e556d8aa919f731da99424c9-C001-101", "547551e556d8aa919f731da99424c9-C001-123", "547551e556d8aa919f731da99424c9-C001-165", "547551e556d8aa919f731da99424c9-C001-186", "547551e556d8aa919f731da99424c9-C001-196", "547551e556d8aa919f731da99424c9-C001-207", "547551e556d8aa919f731da99424c9-C001-262" ] }, "@SIM@": { "gold_contexts": [ [ "547551e556d8aa919f731da99424c9-C001-120" ] ], "cite_sentences": [ "547551e556d8aa919f731da99424c9-C001-120" ] }, "@DIF@": { "gold_contexts": [ [ "547551e556d8aa919f731da99424c9-C001-143" ] ], "cite_sentences": [ "547551e556d8aa919f731da99424c9-C001-143" ] }, "@EXT@": { "gold_contexts": [ [ "547551e556d8aa919f731da99424c9-C001-240" ] ], "cite_sentences": [ "547551e556d8aa919f731da99424c9-C001-240" ] } } }, "ABC_28038a4fa4182ccdc6134f2138c0da_11": { "x": [ { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-2", "text": "Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-3", "text": "Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequenceto-sequence task, rather than a word-tosequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-4", "text": "We implement this approach in a Transformerbased sequence-to-sequence model." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-5", "text": "Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-6", "text": "We achieve stateof-the-art results both in contextual and non-contextual definition modeling." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-7", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-9", "text": "The task of definition modeling, introduced by Noraset et al. (2017) , consists in generating the dictionary definition of a specific word: for instance, given the word \"monotreme\" as input, the system would need to produce a definition such as \"any of an order (Monotremata) of egg-laying mammals comprising the platypuses and echidnas\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-10", "text": "1 Following the tradition set by lexicographers, we call the word being defined a definiendum (pl. definienda), whereas a word occurring in its definition is called a definiens (pl. definientia)." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-11", "text": "Definition modeling can prove useful in a variety of applications." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-12", "text": "Systems trained for the task may generate dictionaries for low resource languages, or extend the coverage of existing lexicographic resources where needed, e.g. of domainspecific vocabulary." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-13", "text": "Such systems may also be 1 Definition from Merriam-Webster." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-14", "text": "able to provide reading help by giving definitions for words in the text." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-15", "text": "A major intended application of definition modeling is the explication and evaluation of distributed lexical representations, also known as word embeddings (Noraset et al., 2017) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-16", "text": "This evaluation procedure is based on the postulate that the meaning of a word, as is captured by its embedding, should be convertible into a human-readable dictionary definition." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-17", "text": "How well the meaning is captured must impact the ability of the model to reproduce the definition, and therefore embedding architectures can be compared according to their downstream performance on definition modeling." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-18", "text": "This intended usage motivates the requirement that definition modeling architectures take as input the embedding of the definiendum and not retrain it." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-19", "text": "From a theoretical point of view, usage of word embeddings as representations of meaning (cf." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-20", "text": "Lenci, 2018; Boleda, 2019 , for an overview) is motivated by the distributional hypothesis (Harris, 1954) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-21", "text": "This framework holds that meaning can be inferred from the linguistic context of the word, usually seen as co-occurrence data." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-22", "text": "The context of usage is even more crucial for characterizing meanings of ambiguous or polysemous words: a definition that does not take disambiguating context into account will be of limited use (Gadetsky et al., 2018) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-23", "text": "We argue that definition modeling should preserve the link between the definiendum and its context of occurrence." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-24", "text": "The most natural approach to this task is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it (cf." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-25", "text": "sections 3 & 4) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-26", "text": "We implement this approach in a Transformer-based sequence-to-sequence model that achieves state-of-the-art performances (sections 5 & 6)." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-27", "text": "arXiv:1911.05715v1 [cs.CL] 13 Nov 2019 2 Related Work" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-28", "text": "In their seminal work on definition modeling, Noraset et al. (2017) likened systems generating definitions to language models, which can naturally be used to generate arbitrary text." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-29", "text": "They built a sequential LSTM seeded with the embedding of the definiendum; its output at each time-step was mixed through a gating mechanism with a feature vector derived from the definiendum." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-30", "text": "Gadetsky et al. (2018) stressed that a definiendum outside of its specific usage context is ambiguous between all of its possible definitions." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-31", "text": "They proposed to first compute the AdaGram vector (Bartunov et al., 2016) for the definiendum, to then disambiguate it using a gating mechanism learned over contextual information, and finally to run a language model over the sequence of definientia embeddings prepended with the disambiguated definiendum embedding." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-32", "text": "In an attempt to produce a more interpretable model, map the definiendum to a sparse vector representation." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-33", "text": "Their architecture comprises four modules." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-34", "text": "The first encodes the context in a sentence embedding, the second converts the definiendum into a sparse vector, the third combines the context embedding and the sparse representation, passing them on to the last module which generates the definition." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-35", "text": "Related to these works, specifically tackle definition modeling in the context of Chinese-whereas all previous works on definition modeling studied English." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-36", "text": "In a Transformer-based architecture, they incorporate \"sememes\" as part of the representation of the definiendum to generate definitions." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-37", "text": "On a more abstract level, definition modeling is related to research on the analysis and evaluation of word embeddings (Levy and Goldberg, 2014a,b; Arora et al., 2018; Batchkarov et al., 2016; Swinger et al., 2018, e.g.) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-38", "text": "It also relates to other works associating definitions and embeddings, like the \"reverse dictionary task\" (Hill et al., 2016 )-retrieving the definiendum knowing its definition, which can be argued to be the opposite of definition modeling-or works that derive embeddings from definitions (Wang et al., 2015; Tissier et al., 2017; Bosc and Vincent, 2018) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-39", "text": "3 Definition modeling as a sequence-to-sequence task Gadetsky et al. (2018) remarked that words are often ambiguous or polysemous, and thus generating a correct definition requires that we either use sense-level representations, or that we disambiguate the word embedding of the definiendum." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-40", "text": "The disambiguation that Gadetsky et al. (2018) proposed was based on a contextual cue-ie." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-41", "text": "a short text fragment." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-42", "text": "As notes, the cues in Gadetsky et al.'s (2018) dataset do not necessarily contain the definiendum or even an inflected variant thereof." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-43", "text": "For instance, one training example disambiguated the word \"fool\" using the cue \"enough horsing around-let's get back to work!\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-44", "text": "Though the remark that definienda must be disambiguated is pertinent, the more natural formulation of such a setup would be to disambiguate the definiendum using its actual context of occurrence." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-45", "text": "In that respect, the definiendum and the contextual cue would form a linguistically coherent sequence, and thus it would make sense to encode the context together with the definiendum, rather than to merely rectify the definiendum embedding using a contextual cue." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-46", "text": "Therefore, definition modeling is by its nature a sequence-to-sequence task: mapping contexts of occurrence of definienda to definitions." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-47", "text": "This remark can be linked to the distributional hypothesis (Harris, 1954) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-48", "text": "The distributional hypothesis suggests that a word's meaning can be inferred from its context of usage; or, more succinctly, that \"you shall know a word by the company it keeps\" (Firth, 1957) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-49", "text": "When applied to definition modeling, the hypothesis can be rephrased as follows: the correct definition of a word can only be given when knowing in what linguistic context(s) it occurs." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-50", "text": "Though different kinds of linguistic contexts have been suggested throughout the literature, we remark here that sentential context may sometimes suffice to guess the meaning of a word that we don't know (Lazaridou et al., 2017) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-51", "text": "Quoting from the example above, the context \"enough around-let's get back to work!\" sufficiently characterizes the meaning of the omitted verb to allow for an approximate definition for it even if the blank is not filled (Taylor, 1953; Devlin et al., 2018) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-52", "text": "This reformulation can appear contrary to the original proposal by Noraset et al. (2017) , which conceived definition modeling as a \"word-tosequence task\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-53", "text": "They argued for an approach related to, though distinct from sequence-to-sequence architectures." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-54", "text": "Concretely, a specific encoding procedure was applied to the definiendum, so that it could be used as a feature vector during generation." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-55", "text": "In the simplest case, vector encoding of the definiendum consists in looking up its vector in a vocabulary embedding matrix." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-56", "text": "We argue that the whole context of a word's usage should be accessible to the generation algorithm rather than a single vector." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-57", "text": "To take a more specific case of verb definitions, we observe that context explicitly represents argument structure, which is obviously useful when defining the verb." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-58", "text": "There is no guarantee that a single embedding, even if it be contextualized, would preserve this wealth of information-that is to say, that you can cram all the information pertaining to the syntactic context into a single vector." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-59", "text": "Despite some key differences, all of the previously proposed architectures we are aware of (Noraset et al., 2017; Gadetsky et al., 2018; followed a pattern similar to sequence-to-sequence models." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-60", "text": "They all implicitly or explicitly used distinct submodules to encode the definiendum and to generate the definientia." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-61", "text": "In the case of Noraset et al. (2017) , the encoding was the concatenation of the embedding of the definiendum, a vector representation of its sequence of characters derived from a characterlevel CNN, and its \"hypernym embedding\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-62", "text": "Gadetsky et al. (2018) used a sigmoid-based gating module to tweak the definiendum embedding." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-63", "text": "The architecture proposed by is comprised of four modules, only one of which is used as a decoder: the remaining three are meant to convert the definiendum as a sparse embedding, select some of the sparse components of its meaning based on a provided context, and encode it into a representation adequate for the decoder." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-64", "text": "Aside from theoretical implications, there is another clear gain in considering definition modeling as a sequence-to-sequence task." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-65", "text": "Recent advances in embedding designs have introduced contextual embeddings (McCann et al., 2017; Peters et al., 2018; Devlin et al., 2018) ; and these share the particularity that they are a \"function of the entire sentence\" (Peters et al., 2018) : in other words, vector representations are assigned to tokens rather than to word types, and moreover semantic information about a token can be distributed over other token representations." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-66", "text": "To extend definition modeling to contextual embeddings therefore requires that we devise architectures able to encode a word in its context; in that respect sequence-to-sequence architectures are a natural choice." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-67", "text": "A related point is that not all definienda are comprised of a single word: multi-word expressions include multiple tokens, yet receive a single definition." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-68", "text": "Word embedding architectures generally require a pre-processing step to detect these expressions and merge them into a single token." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-69", "text": "However, as they come with varying degrees of semantic opacity (Cordeiro et al., 2016) , a definition modeling system would benefit from directly accessing the tokens they are made up from." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-70", "text": "Therefore, if we are to address the entirety of the language and the entirety of existing embedding architectures in future studies, reformulating definition modeling as a sequence-to-sequence task becomes a necessity." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-71", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-72", "text": "**FORMALIZATION**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-73", "text": "A sequence-to-sequence formulation of definition modeling can formally be seen as a mapping between contexts of occurrence of definienda and their corresponding definitions." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-74", "text": "It moreover requires that the definiendum be formally distinguished from the remaining context: otherwise the definition could not be linked to any particular word of the contextual sequence, and thus would need to be equally valid for any word of the contextual sequence." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-75", "text": "We formalize definition modeling as mapping to sequences of definientia from sequences of pairs w 1 , i 1 , . . . , w n , i n , where w k is the k th word in the input and i k \u2208 {0, 1} indicates whether the k th token is to be defined." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-76", "text": "As only one element of the sequence should be highlighted, we expect the set of all indicators to contain only two elements: the one, i d = 1, to mark the definiendum, the other, i c = 0, to mark the context; this entails that we encode this marking using one bit only." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-77", "text": "2 To treat definition modeling as a sequence-tosequence task, the information from each pair w k , i k has to be integrated into a single repre-sentation marked k :" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-78", "text": "This marking function can theoretically take any form." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-79", "text": "Considering that definition modeling uses the embedding of the definiendum w d = e(w d ), in this work we study a multiplicative and an additive mechanism, as they are conceptually the simplest form this marking can take in a vector space." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-80", "text": "They are given schematically in Figure 1 , and formally defined as:" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-81", "text": "The last point to take into account is where to set the marking." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-82", "text": "Two natural choices are to set it either before or after encoded representations were obtained." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-83", "text": "We can formalize this using either of the following equation, with E the model's encoder:" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-84", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-85", "text": "**MULTIPLICATIVE MARKING: SELECT**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-86", "text": "The first option we consider is to use scalar multiplication to distinguish the word to define." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-87", "text": "In such a scenario, the marked token encoding is" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-88", "text": "As we use bit information as indicators, this form of marking entails that only the representation of the definiendum be preserved and that all other contextual representations are set to 0 = (0, \u00b7 \u00b7 \u00b7 , 0): thus multiplicative marking amounts to selecting just the definiendum embedding and discarding other token embeddings." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-89", "text": "The contextualized definiendum encoding bears the trace of its context, but detailed information is irreparably lost." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-90", "text": "Hence, we refer to such an integration mechanism as a SELECT marking of the definiendum." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-91", "text": "When to apply marking, as introduced by eq. 4, is crucial when using the multiplicative marking scheme SELECT." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-92", "text": "Should we mark the definiendum before encoding, then only the definiendum embedding is passed into the encoder: the resulting system provides out-of-context definitions, like in Noraset et al. (2017) where the definition is not linked to the context of a word but to its definiendum only." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-93", "text": "For context to be taken into account under the multiplicative strategy, tokens w k must be encoded and contextualized before integration with the indicator i k ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-94", "text": "Figure 1a presents the contextual SELECT mechanism visually." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-95", "text": "It consists in coercing the decoder to attend only to the contextualized representation for the definiendum." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-96", "text": "To do so, we encode the full context and then select only the encoded representation of the definiendum, dropping the rest of the context, before running the decoder." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-97", "text": "In the case of the Transformer architecture, this is equivalent to using a multiplicative marking on the encoded representations: vectors that have been zeroed out are ignored during attention and thus cannot influence the behavior of the decoder." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-98", "text": "This SELECT approach may seem intuitive and naturally interpretable, as it directly controls what information is passed to the decoder-we carefully select only the contextualized definiendum, thus the only remaining zone of uncertainty would be how exactly contextualization is performed." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-99", "text": "It also seems to provide a strong and reasonable bias for training the definition generation system." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-100", "text": "Such an approach, however, is not guaranteed to excel: forcibly omitted context could contain important information that might not be easily incorporated in the definiendum embedding." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-168", "text": "We use perplexity, a standard metric in definition modeling, to evaluate and compare our models." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-167", "text": "**RESULTS**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-101", "text": "Being simple and natural, the SELECT approach resembles architectures like that of Gadetsky et al. (2018) and : the full encoder is dedicated to altering the embedding of the definiendum on the basis of its context; in that, the encoder may be seen as a dedicated contextualization sub-module." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-102", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-103", "text": "**ADDITIVE MARKING: ADD**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-104", "text": "We also study an additive mechanism shown in Figure 1b (henceforth ADD)." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-105", "text": "It concretely consists in embedding the word w k and its indicator bit i k in the same vector space and adding the corresponding vectors:" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-106", "text": "In other words, under ADD we distinguish the definiendum by adding a vector D to the definiendum embedding, and another vector C to the remaining context token embeddings; both markers D and C are learned during training." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-107", "text": "In our implementation, markers are added to the input of the encoder, so that the encoder has access to this information; we leave the question of whether to integrate indicators and words at other points of the encoding process, as suggested in eq. 4, to future work." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-108", "text": "Additive marking of substantive features has its precedents." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-109", "text": "For example, BERT embeddings (Devlin et al., 2018) are trained using two sentences at once as input; sentences are distinguished with added markers called \"segment encodings\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-110", "text": "Tokens from the first sentence are all marked with an added vector seg A , whereas tokens from second sentences are all marked with an added vector seg B ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-111", "text": "The main difference here is that we only mark one item with the marker D, while all others are marked with C." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-112", "text": "This ADD marking is more expressive than the SELECT architecture." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-113", "text": "Sequence-to-sequence decoders typically employ an attention to the input source (Bahdanau et al., 2014) , which corresponds to a re-weighting of the encoded input sequence based on a similarity between the current state of the decoder (the 'query') and each member of the input sequence (the 'keys')." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-114", "text": "This re-weighting is normalized with a softmax function, producing a probability distribution over keys." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-115", "text": "However, both non-contextual definition modeling and the SELECT approach produce singleton encoded sequences: in such scenarios the attention mechanism assigns a single weight of 1 and thus devolves into a simple linear transformation of the value and makes the attention mechanism useless." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-116", "text": "Using an additive marker, rather than a selective mechanism, will prevent this behavior." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-117", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-118", "text": "**EVALUATION**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-119", "text": "We implement several sequence to sequence models with the Transformer architecture (Vaswani et al., 2017) , building on the OpenNMT library (Klein et al., 2017) with adaptations and modifications when necessary." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-120", "text": "3 Throughout this work, we use GloVe vectors (Pennington et al., 2014) and freeze weights of all embeddings for a fairer comparison with previous models; words not in GloVe but observed in train or validation data and missing definienda in our test sets were randomly initialized with components drawn from a normal distribution N (0, 1)." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-121", "text": "We train a distinct model for each dataset." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-122", "text": "We batch examples by 8,192, using gradient accumulation to circumvent GPU limitations." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-123", "text": "We optimize the network using Adam with \u03b2 1 = 0.99, \u03b2 2 = 0.998, a learning rate of 2, label smoothing of 0.1, Noam exponential decay with 2000 warmup steps, and dropout rate of 0.4." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-124", "text": "The parameters are initialized using Xavier." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-125", "text": "Models were trained for up to 120,000 steps with checkpoints at each 1000 steps; we stopped training if perplexity on the validation dataset stopped improving." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-126", "text": "We report results from checkpoints performing best on validation." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-127", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-128", "text": "**IMPLEMENTATION OF THE NON-CONTEXTUAL DEFINITION MODELING SYSTEM**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-129", "text": "In non-contextual definition modeling, definienda are mapped directly to definitions." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-130", "text": "As the source corresponds only to the definiendum, we conjecture that few parameters are required for the encoder." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-131", "text": "We use 1 layer for the encoder, 6 for the decoder, 300 dimensions per hidden representations and 6 heads for multi-head attention." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-132", "text": "We do not share vocabularies between the encoder and the decoder: therefore output tokens can only correspond to words attested as definientia." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-133", "text": "4 The dropout rate and warmup steps number were set using a hyperparameter search on the dataset from Noraset et al. (2017) , during which encoder and decoder vocabulary were merged for computational simplicity and models stopped after 12,000 steps." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-134", "text": "We first fixed dropout to 0.1 and tested warmup step values between 1000 and 10,000 by increments of 1000, then focused on the most promising span (1000-4000 steps) and exhaustively tested dropout rates from 0.2 to 0.8 by increments of 0.1." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-135", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-136", "text": "**IMPLEMENTATION OF CONTEXTUALIZED DEFINITION MODELING SYSTEMS**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-137", "text": "To compare the effects of the two integration strategies that we discussed in section 4, we implement both the additive marking approach (ADD, cf. section 4.2) and the alternative 'encode and select' approach (SELECT, cf." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-138", "text": "section 4.1)." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-139", "text": "To match with the complex input source, we define encoders with 6 layers; we reemploy the set of hyperparameters previously found for the non-contextual system." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-140", "text": "Other implementation details, initialization strategies and optimization algorithms are kept the same as described above for the non-contextual version of the model." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-141", "text": "We stress that the two approaches we compare for contextualizing the definiendum are applicable to almost any sequence-to-sequence neural architecture with an attention mechanism to the input source." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-142", "text": "5 Here we chose to rely on a Transformerbased architecture (Vaswani et al., 2017) , which has set the state of the art in a wide range of tasks, from language modeling (Dai et al., 2019) to machine translation (Ott et al., 2018) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-143", "text": "It is therefore expected that the Transformer architecture will also improve performances for definition modeling, if our arguments for treating it as a sequence to sequence task are on the right track." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-144", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-145", "text": "**DATASETS**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-146", "text": "We train our models on three distinct datasets, which are all borrowed or adapted from previous works on definition modeling." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-147", "text": "As a consequence, our experiments focus on the English language." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-148", "text": "The dataset of Noraset et al. (2017) (henceforth D Nor ) maps definienda to their respective definientia, as well as additional information not used here." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-149", "text": "In the dataset of Gadetsky et al. (2018) (henceforth D Gad ), each example consists of a definiendum, the definientia for one of its meanings and a contextual cue sentence." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-150", "text": "D Nor contains on average shorter definitions than D Gad ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-151", "text": "Definitions in D Nor have a mean length of 6.6 and a standard deviation of 5.78, whereas those in D Gad have a mean length of 11.01 and a standard deviation of 6.96." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-152", "text": "stress that the dataset D Gad includes many examples where the definiendum is absent from the associated cue." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-153", "text": "About half of these cues doe not contain an exact match for the corresponding definiendum, but up to 80% contains either an exact match or an inflected form of the definiendum according to lemmatization by the NLTK toolkit (Loper and Bird, 2002) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-154", "text": "To cope with this problematic characteristic, we converted the dataset into the word-in-context format assumed by our model by concatenating the definiendum with the cue." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-155", "text": "To illustrate this, consider the actual input from D Gad comprised of the definiendum \"fool\" and its associated cue \"enough horsing around-let's get back to work!\": to convert this into a single sequence, we simply prepend the definiendum to the cue, which results in the sequence \"fool enough horsing around-let's get back to work!\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-156", "text": "Hence the input sequences of D Gad do not constitute linguistically coherent sequences, but it does guarantee that our sequenceto-sequence variants have access to the same input as previous models; therefore the inclusion of this dataset in our experiments is intended mainly for comparison with previous architectures." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-157", "text": "We also note that this conversion procedure entails that our examples have a very regular structure: the word marked as a definiendum is always the first word in the input sequence." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-158", "text": "Our second strategy was to restrict the dataset by selecting only cues where the definiendum (or its inflected form) is present." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-159", "text": "The curated dataset (henceforth D Ctx ) contains 78,717 training examples, 9,413 for validation and 9,812 for testing." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-160", "text": "In each example, the first occurrence of the definiendum is annotated as such." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-161", "text": "D Ctx thus differs from D Gad in two ways: some definitions have been removed, and the exact citation forms of the definienda are not given." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-162", "text": "Models trained on D Ctx implicitly need to lemmatize the definiendum, since inflected variants of a given word are to be aligned to a common representation; thus they are not directly comparable with models trained with the citation form of the definiendum that solely use context as a cue-viz." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-163", "text": "Gadetsky et al. (2018 ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-164", "text": "All this makes D Ctx harder, but at the same time closer to a realistic application than the other two datasets, since each word appears inflected and in a specific sentential context." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-165", "text": "For applications of definition modeling, it would only be beneficial to take up these challenges; for example, the output \"monotremes: plural of monotreme\" 6 would not have been self-contained, necessitating a second query for \"monotreme\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-166", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-169", "text": "Informally, perplexity assesses the model's confidence in producing the ground-truth output when presented the source input." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-170", "text": "It is formally defined as the exponentiation of cross-entropy." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-171", "text": "We do not report BLEU or ROUGE scores due to the fact that an important number of ground-truth definitions are comprised of a single word, in particular in D Nor (\u2248 25%)." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-172", "text": "Single word outputs can either be assessed as entirely correct or entirely wrong using BLEU or ROUGE." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-173", "text": "However consider for instance the word \"elation\": that it be defined either as \"mirth\" or \"joy\" should only influence our metric slightly, and not be discounted as a completely wrong prediction. , as they did not report the perplexity of their system and focused on a different dataset; likewise, consider only the Chinese variant of the task." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-174", "text": "Perplexity measures for Noraset et al. (2017) and Gadetsky et al. (2018) are taken from the authors' respective publications." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-175", "text": "All our models perform better than previous proposals, by a margin of 4 to 10 points, for a relative improvement of 11-23%." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-176", "text": "Part of this improvement may be due to our use of Transformerbased architectures (Vaswani et al., 2017) , which is known to perform well on semantic tasks (Radford, 2018; Cer et al., 2018; Devlin et al., 2018; Radford et al., 2019, eg.) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-177", "text": "Like Gadetsky et al. (2018) , we conclude that disambiguating the definiendum, when done correctly, improves performances: our best performing contex-tual model outranks the non-contextual variant by 5 to 6 points." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-178", "text": "The marking of the definiendum out of its context (ADD vs. SELECT) also impacts results." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-179", "text": "Note also that we do not rely on taskspecific external resources (unlike Noraset et al., 2017; or on pre-training (unlike Gadetsky et al., 2018) ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-180", "text": "Our contextual systems trained on the D Gad dataset used the concatenation of the definiendum and the contextual cue as inputs." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-181", "text": "The definiendum was always at the start of the training example." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-182", "text": "This regular structure has shown to be useful for the models' performance: all models perform significantly worse on the more realistic data of D Ctx than on D Gad ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-183", "text": "The D Ctx dataset is intrinsically harder for other reasons as well: it requires some form of lemmatization in every three out of eight training examples, and contains less data than other datasets, only half as many examples as D Nor , and 20% less than D Gad ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-184", "text": "The surprisingly poor results of SELECT on the D Ctx dataset may be partially blamed on the absence of a regular structure in D Ctx ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-185", "text": "Unlike D Gad , where the model must only learn to contextualize the first element of the sequence, in D Ctx the model has to single out the definiendum which may appear anywhere in the sentence." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-186", "text": "Any information stored only in representations of contextual tokens will be lost to the decoders." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-187", "text": "The SELECT model therefore suffers of a bottleneck, which is highly regular in D Gad and that it may therefore learn to cope with; however predicting where in the input sequence the bottleneck will appear is far from trivial in the D Ctx dataset." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-188", "text": "We also attempted to retrain this model with various settings of hyperparameters, modifying dropout rate, number of warmup steps, and number of layers in the encoder-but to no avail." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-189", "text": "An alternative explanation may be that in the case of the D Gad dataset, the regular structure of the input entails that the first positional encoding is used as an additive marking device: only definienda are marked with the positional encoding pos(1), and thus the architecture does not purely embrace a selective approach but a mixed one." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-190", "text": "In any event, even on the D Gad dataset where the margin is very small, the perplexity of the additive marking approach ADD is better than that of the SELECT model." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-191", "text": "This lends empirical support to our claim that definition modeling is a nontrivial sequence-to-sequence task, which can be better treated with sequence methods." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-192", "text": "The stability of the performance improvement over the noncontextual variant in both contextual datasets also highlights that our proposed additive marking is fairly robust, and functions equally well when confronted to somewhat artificial inputs, as in D Gad , or to linguistically coherent sequences, as in D Ctx ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-193", "text": "A manual analysis of definitions produced by our system reveals issues similar to those discussed by Noraset et al. (2017) , namely selfreference, 7 POS-mismatches, over-and underspecificity, antonymy, and incoherence." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-194", "text": "Annotating distinct productions from the validation set, for the non-contextual model trained on D Nor , we counted 9.9% of self-references, 11.6% POSmismatches, and 1.3% of words defined as their antonyms." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-195", "text": "We counted POS-mismatches whenever the definition seemed to fit another part-of-speech than that of the definiendum, regardless of both of their meanings; cf." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-196", "text": "Table 2 for examples." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-197", "text": "For comparison, we annotated the first 1000 productions of the validation set from our ADD model trained on D Ctx ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-198", "text": "We counted 18.4% POS mismatches and 4.4% of self-referring definitions; examples are shown in Table 3 ." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-199", "text": "The higher rate of POS-mismatch may be due to the model's hardship in finding which word is to be defined since the model is not presented with the definiendum alone: access to the full context may confuse it." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-200", "text": "On the other hand, the lower number of self-referring definitions may also be linked to this richer, more varied input: this would allow the model not to fall 7 Self-referring definitions are those where a definiendum is used as a definiens for itself." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-201", "text": "Dictionaries are expected to be exempt of such definitions: as readers are assumed not to know the meaning of the definiendum when looking it up. back on simply reusing the definiendum as its own definiens." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-202", "text": "Self-referring definitions highlight that our models equate the meaning of the definiendum to the composed meaning of its definientia." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-203", "text": "Simply masking the corresponding output embedding might suffice to prevent this specific problem; preliminary experiments in that direction suggest that this may also help decrease perplexity further." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-204", "text": "As for POS-mismatches, we do note that the work of Noraset et al. (2017) had a much lower rate of 4.29%: we suggest that this may be due to the fact that they employ a learned character-level convolutional network, which arguably would be able to capture orthography and rudiments of morphology." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-205", "text": "Adding such a sub-module to our proposed architecture might diminish the number of mistagged definienda." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-206", "text": "Another possibility would be to pre-train the model, as was done by Gadetsky et al. (2018) : in our case in particular, the encoder could be trained for POS-tagging or lemmatization." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-207", "text": "Lastly, one important kind of mistakes we observed is hallucinations." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-208", "text": "Consider for instance this production by the ADD model trained on D Ctx , for the word \"beta\": \"the twentieth letter of the Greek alphabet (\u03ba), transliterated as 'o'.\"." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-209", "text": "Nearly everything it contains is factually wrong, though the general semantics are close enough to deceive an unaware reader." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-210", "text": "8 We conjecture that filtering out hallucinatory productions will be a main challenge for future definition modeling architectures, for two main reasons: firstly, the tools and metrics necessary to assess and handle such hallucinations have yet to be developed; secondly, the input given to the system being word embeddings, research will be faced with the problem of grounding these distributional representations-how can we ensure that \"beta\" is correctly defined as \"the second letter of the Greek alphabet, transliterated as 'b'\", if we only have access to a representation derived from its contexts of usage?" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-211", "text": "Integration of word embeddings with structured knowledge bases might be needed for accurate treatment of such cases." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-212", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-213", "text": "**ERROR TYPE**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-214", "text": "Context (definiendum in bold)" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-215", "text": "Production POS-mismatch her major is linguistics most important or important Self-reference he wrote a letter of apology to the hostess a formal expression of apology" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-216", "text": "----------------------------------" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-217", "text": "**CONCLUSION**" }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-218", "text": "We introduced an approach to generating word definitions that allows the model to access rich contextual information about the word token to be defined." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-219", "text": "Building on the distributional hypothesis, we naturally treat definition generation as a sequence-to-sequence task of mapping the word's context of usage (input sequence) into the contextappropriate definition (output sequence)." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-220", "text": "We showed that our approach is competitive against a more naive 'contextualize and select' pipeline." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-221", "text": "This was demonstrated by comparison both to the previous contextualized model by Gadetsky et al. (2018) and to the Transformerbased SELECT variation of our model, which differs from the proposed architecture only in the context encoding pipeline." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-222", "text": "While our results are encouraging, given the existing benchmarks we were limited to perplexity measurements in our quantitative evaluation." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-223", "text": "A more nuanced semantically driven methodology might be useful in the future to better assess the merits of our system in comparison to alternatives." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-224", "text": "Our model opens several avenues of future explorations." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-225", "text": "One could straightforwardly extend it to generate definitions of multiword expressions or phrases, or to analyze vector compositionality models by generating paraphrases for vector representations produced by these algorithms." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-226", "text": "Another strength of our approach is that it can provide the basis for a standardized benchmark for contextualized and non-contextual embeddings alike: downstream evaluation tasks for embeddings systems in general either apply to non-contextual embeddings (Gladkova et al., 2016, eg.) or to contextual embeddings (Wang et al., 2019, eg.) exclusively, redefining definition modeling as a sequence-tosequence task will allow in future works to compare models using contextual and non-contextual embeddings in a unified fashion." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-227", "text": "Lastly, we also intend to experiment on languages other than English, especially considering that the required resources for our model only amount to a set of pretrained embeddings and a dataset of definitions, either of which are generally simple to obtain." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-228", "text": "While there is a potential for local improvements, our approach has demonstrated its ability to account for contextualized word meaning in a principled way, while training contextualized token encoding and definition generation end-toend." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-229", "text": "Our implementation is efficient and fast, building on free open source libraries for deep learning, and shows good empirical results." }, { "sent_id": "28038a4fa4182ccdc6134f2138c0da-C001-230", "text": "Our code, trained models, and data will be made available to the community." } ], "y": { "@BACK@": { "gold_contexts": [ [ "28038a4fa4182ccdc6134f2138c0da-C001-9" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-15" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-28" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-52" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-61" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-174" ] ], "cite_sentences": [ "28038a4fa4182ccdc6134f2138c0da-C001-9", "28038a4fa4182ccdc6134f2138c0da-C001-15", "28038a4fa4182ccdc6134f2138c0da-C001-28", "28038a4fa4182ccdc6134f2138c0da-C001-52", "28038a4fa4182ccdc6134f2138c0da-C001-61", "28038a4fa4182ccdc6134f2138c0da-C001-174" ] }, "@DIF@": { "gold_contexts": [ [ "28038a4fa4182ccdc6134f2138c0da-C001-50", "28038a4fa4182ccdc6134f2138c0da-C001-51", "28038a4fa4182ccdc6134f2138c0da-C001-52" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-148" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-174", "28038a4fa4182ccdc6134f2138c0da-C001-175" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-204" ] ], "cite_sentences": [ "28038a4fa4182ccdc6134f2138c0da-C001-52", "28038a4fa4182ccdc6134f2138c0da-C001-148", "28038a4fa4182ccdc6134f2138c0da-C001-174", "28038a4fa4182ccdc6134f2138c0da-C001-204" ] }, "@SIM@": { "gold_contexts": [ [ "28038a4fa4182ccdc6134f2138c0da-C001-59" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-92" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-193" ] ], "cite_sentences": [ "28038a4fa4182ccdc6134f2138c0da-C001-59", "28038a4fa4182ccdc6134f2138c0da-C001-92", "28038a4fa4182ccdc6134f2138c0da-C001-193" ] }, "@USE@": { "gold_contexts": [ [ "28038a4fa4182ccdc6134f2138c0da-C001-133" ], [ "28038a4fa4182ccdc6134f2138c0da-C001-146", "28038a4fa4182ccdc6134f2138c0da-C001-148" ] ], "cite_sentences": [ "28038a4fa4182ccdc6134f2138c0da-C001-133", "28038a4fa4182ccdc6134f2138c0da-C001-148" ] } } }, "ABC_fa7475b6025d010dd6814dfb3905ef_11": { "x": [ { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-77", "text": "**BERT-OF-THESEUS**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-2", "text": "In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-3", "text": "Our approach first divides the original BERT into several modules and builds their compact substitutes." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-4", "text": "Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-5", "text": "We progressively increase the probability of replacement through the training." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-6", "text": "In this way, our approach brings a deeper level of interaction between the original and compact models, and smooths the training process." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-7", "text": "Compared to the previous knowledge distillation approaches for BERT compression, our approach leverages only one loss function and one hyper-parameter, liberating human effort from hyper-parameter tuning." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-8", "text": "Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-9", "text": "2 * Equal contribution." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-10", "text": "Work done during these two authors' internship at Microsoft Research Asia." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-11", "text": "2 Code and pretrained model are available at https://github.com/JetRunner/BERT-of-Theseus 3 https://en.wikipedia.org/wiki/Ship_of_Theseus" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-12", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-14", "text": "With the prevalence of deep learning, many huge neural models have been proposed and achieve state-of-the-art performance in various fields [12, 38] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-15", "text": "Specifically, in Natural Language Processing (NLP), pretraining and fine-tuning have become the new norm of most tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-16", "text": "Transformer-based pretrained models [4, 21, 42, 31, 6] have dominated the field of both Natural Language Understanding (NLU) and Natural Language Generation (NLG)." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-17", "text": "These models benefit from their \"overparameterized\" nature [24] and contain millions or even billions of parameters, making it computationally expensive and inefficient considering both memory consumption and high latency." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-18", "text": "This drawback enormously hinders the applications of these models in production." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-19", "text": "To resolve this problem, many techniques have been proposed to compress a neural network." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-20", "text": "Generally, these techniques can be categorized into Quantization [10] , Weights Pruning [11] and Knowledge Distillation (KD) [14] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-21", "text": "Among them, KD has received much attention for compressing pretrained language models." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-22", "text": "KD exploits a large teacher model to \"teach\" a compact student model to mimic the teacher's behavior." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-23", "text": "In this way, the knowledge embedded in the teacher model can be transferred into the smaller model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-24", "text": "However, the retained performance of the student model relies on a well-designed distillation loss function which forces the student model to behave as the teacher." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-25", "text": "Recent studies on KD [33, 15] even leverage more sophisticated model-specific distillation loss functions for better performance." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-26", "text": "Different from previous KD studies which explicitly exploit a distillation loss to minimize the distance between the teacher model and the student model, we propose a new genre of model compression." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-27", "text": "Inspired by the famous thought experiment \"Ship of Theseus 3 \" in Philosophy, where all components of a ship are gradually replaced by new ones until no original component exists, we propose Theseus Compression for BERT (BERT-of-Theseus), which progressively substitutes modules of BERT with modules of fewer parameters." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-28", "text": "We call the original model and compressed model predecessor and successor, in correspondence to the concepts of teacher and student in KD, respectively." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-29", "text": "As shown in Figure 1 , we first specify a substitute (successor module) for each predecessor module (i.e., modules in the predecessor model)." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-30", "text": "Then, we randomly replace each predecessor module with its corresponding successor module by a probability and make them work together in the training phase." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-31", "text": "After convergence, we combine all successor modules to be the successor model for inference." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-32", "text": "In this way, the large predecessor model can be compressed into the compact successor model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-33", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-34", "text": "**< L A T E X I T S H A 1 _ B A S E 6 4 = \" O S M A F M J G T X Y 5 S N J U W A 8 V C Z E C 8 O U = \" > A A A B 7 H I C B Z D N S G M X F I V V / K M 1 W Q 1 2 6 S Z Y B B D S Z T Q F L G T U X F Z W 2 K I 7 L E Y A A U M Z M S H J C M P Q Z 3 D J Q H G 3 P P A 7 H 0 Y W / V L O 6 4 H A X Z N 3 K N T V K A I U J E N 8 O A 3 T N D 3 C X N G / D H B Y P J Q U N J X 2 D J W Q Y J W A I 1 J 1 A Q K Z 4 J J 5 H H V B E O L I J A O E 6 W B T 2 3 N E F W R K 8 1 G + M C X H F K T G K O E C E M M T T 1 M 6 B A 4 R N A F U L I Q 3 W V 1 B R Y U G 1 E 9 Y I W S P K 5 + D U U Z T I E L D**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-35", "text": "Theseus Compression shares a similar idea with KD, which encourages the compressed model to behave like the original, but holds many merits." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-36", "text": "First, we only use the task-specific loss function in the compression process." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-37", "text": "However, KD-based methods use task-specific loss, together with one or multiple distillation losses as its optimization objective." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-38", "text": "The use of only one loss function throughout the whole compression process allows us to unify the different phases and keep the compression in a total end-to-end fashion." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-39", "text": "Also, selecting various loss functions and balancing the weights of each loss for different tasks and datasets are always laborious [33, 28] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-40", "text": "Second, different from recent work [15] , Theseus Compression does not use Transformer-specific features for compression thus is potential to compress a wide spectrum of models." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-41", "text": "Third, instead of using the original model only for inference in KD, our approach allows the predecessor model to work in association with the compressed successor model, enabling a deeper gradient-level interaction and a smoother training process." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-42", "text": "Moreover, the different module permutations mixing both predecessor and successor modules adds extra regularization, similar to Dropout [32] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-43", "text": "With a Curriculum Learning [1] driven replacement scheduler, our approach achieves great performance compressing BERT [4] , a large pretrained Transformer model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-44", "text": "To summarize, our contribution is two-fold: (1) We propose a novel approach, Theseus Compression, revealing a new pathway to model compression, with only one loss function and one hyper-parameter." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-45", "text": "(2) Our compressed BERT model is 1.94\u00d7 faster while retaining more than 98% performance of the original model, outperforming other KD-based compression baselines." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-46", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-47", "text": "**RELATED WORK**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-48", "text": "Model Compression." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-49", "text": "Model compression aims to reduce the size and computational cost of a large model while retaining as much performance as possible." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-50", "text": "Conventional explanations [3, 43] claim that the large number of weights is necessary for the training of deep neural network but a high degree of redundancy exists after training." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-51", "text": "Recent work [8] proposes The Lottery Ticket Hypothesis claiming that dense, randomly-initialized and feed-forward networks contain subnetworks that can be recognized and trained to get a comparable test accuracy to the original network." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-52", "text": "Quantization [10] reduces the number of bits used to represent a number in a model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-53", "text": "Weights Pruning [11, 13] conducts a binary classification to decide which weights to be trimmed from the model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-54", "text": "Knowledge Distillation (KD) [14] aims to train a compact model which behaves like the original one." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-55", "text": "FitNets [27] demonstrates that \"hints\" learned by the large model can benefit the distillation process." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-56", "text": "Born-Again Neural Network [9] reveals that ensembling multiple identical-parameterized students can outperform a teacher model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-57", "text": "LIT [17] introduces block-wise intermediate representation training." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-58", "text": "Liu et al. [20] distilled knowledge from ensemble models to improve the performance of a single model on NLU tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-59", "text": "Tan et al. [35] exploited KD for multi-lingual machine translation." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-60", "text": "Different from KD-based methods, our proposed Theseus Compression is the first approach to mix the original model and compact model for training." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-61", "text": "Also, no additional loss is used throughout the whole compression procedure which eliminates the tricky hyper-parameter tuning for various losses." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-62", "text": "Faster BERT." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-63", "text": "Very recently, many attempts have been made to speed up a large pretrained language model (e.g., BERT [4] )." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-64", "text": "Michel et al. [22] reduced the parameters of a BERT model by pruning unnecessary heads in the Transformer." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-65", "text": "Shen et al. [29] quantized BERT to 2-bit using Hessian information." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-66", "text": "Also, substantial modification has been made to Transformer architecture." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-67", "text": "Fan et al. [7] exploited a structure dropping mechanism to train a BERT-like model which is resilient to pruning." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-68", "text": "ALBERT [18] leverages matrix decomposition and parameter sharing." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-69", "text": "However, these models cannot exploit ready-made model weights and require a full retraining." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-70", "text": "Tang et al. [36] used a BiLSTM architecture to extract task-specific knowledge from BERT." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-71", "text": "DistilBERT [28] applies a naive Knowledge Distillation on the same corpus used to pretrain BERT." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-72", "text": "Patient Knowledge Distillation (PKD) [33] designs multiple distillation losses between the module hidden states of the teacher and student models." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-73", "text": "Pretrained Distillation [37] pretrains the student model with a self-supervised masked LM objective on a large corpus first, then performs a standard KD on supervised tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-74", "text": "TinyBERT [15] conducts the Knowledge Distillation twice with data augmentation." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-75", "text": "MobileBERT [34] devises a more computationally efficient architecture and applies knowledge distillation with a bottom-to-top layer training procedure." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-76", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-78", "text": "In this section, we introduce module replacing, the technique proposed for BERT-of-Theseus." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-79", "text": "Further, we introduce a Curriculum Learning driven scheduler to obtain better performance." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-80", "text": "The workflow is shown in Figure 1 ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-81", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-82", "text": "**MODULE REPLACING**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-83", "text": "The basic idea of Theseus Compression is very similar to KD." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-84", "text": "We want the successor model to act like a predecessor model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-85", "text": "KD explicitly defines a loss to measure the similarity of the teacher and student." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-86", "text": "However, the performance greatly relies on the design of the loss function [14, 33, 15] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-87", "text": "This loss function needs to be combined with taskspecific loss [33, 17] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-88", "text": "Different from KD, Theseus Compression only requires one task-specific loss function (e.g., Cross Entropy), which closely resembles a fine-tuning procedure." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-89", "text": "Inspired by Dropout [32] , we propose module replacing, a novel technique for model compression." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-90", "text": "We call the original model and the target model predecessor and successor, respectively." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-91", "text": "First, we specify a successor module for each module in the predecessor." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-92", "text": "For example, in the context of BERT compression, we let one Transformer layer to be the successor module for two Transformer layers." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-93", "text": "Consider a predecessor model P which has n modules and a successor model S which has n predefined modules." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-94", "text": "Let P = {prd 1 , .., prd n } denote the predecessor model, prd i and scc i denote the the predecessor modules and their corresponding substitutes, respectively." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-95", "text": "The output vectors of the i-th module is denoted as y i ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-96", "text": "Thus, the forward operation can be described in the form of:" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-97", "text": "During compression, we apply module replacing." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-98", "text": "First, for (i + 1)-th module, r i+1 is an independent Bernoulli random variable which has probability p to be 1 and 1 \u2212 p to be 0." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-99", "text": "Then, the output of the (i + 1)-th model is calculated as:" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-100", "text": "where * denotes the element-wise multiplication, r i+1 \u2208 {0, 1}. In this way, the predecessor modules and successor modules work together in the training." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-101", "text": "Since the permutation of the hybrid model is random, it adds extra noises as a regularization for the training of the successor, similar to Dropout [32] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-102", "text": "During training, similar to a fine-tuning process, we optimize a regular task-specific loss, e.g., Cross Entropy: where x j \u2208 X is the i-th training sample; z j is its corresponding ground-truth label; c and C denote a class label and the set of class labels, respectively." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-103", "text": "For back-propagation, the weights of all predecessor modules are frozen and only weights of successor will be updated." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-104", "text": "For both the embedding layer and output layer of the predecessor model are weight-frozen and directly adopted for the successor model in this training phase." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-105", "text": "In this way, the gradient can be calculated across both the predecessor and successor modules, allowing the interaction on a deeper level." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-106", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-107", "text": "**SUCCESSOR FINE-TUNING AND INFERENCE**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-108", "text": "Since all successor modules have not been combined for training yet, we further carry out a post-replacement fine-tuning phase." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-109", "text": "After the replacing compression converges, we collect all successor modules and combine them to be the successor model S:" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-110", "text": "Since each scc i is smaller than prd i in size, the predecessor model P is in essence compressed into a smaller model S." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-111", "text": "Then, we fine-tune the successor model by optimizing the same loss of Equation 4." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-112", "text": "The whole procedure including module replacing and successor fine-tuning is illustrated in Figure 2 (a)." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-113", "text": "Finally, we use the fine-tuned successor for inference as Equation 5." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-114", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-115", "text": "**CURRICULUM REPLACEMENT**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-116", "text": "Although setting a constant replacement rate p can meet the need for compressing a model, we further highlight a Curriculum Learning [1] driven replacement scheduler, which helps progressively substitute the modules in a model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-117", "text": "Similar to Curriculum Dropout [23] , we devise a replacement scheduler to dynamically tune the replacement rate p." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-118", "text": "Here, we leverage a simple linear scheduler \u03b8(t) to output the dynamic replacement rate p d for step t." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-119", "text": "where k > 0 is the coefficient and b is the basic replacement rate." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-120", "text": "The replacing rate curve with a replacement scheduler is illustrated in Figure 2 (b)." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-121", "text": "In this way, we unify the two previously separated training stages and encourage an end-to-end easy-to-hard learning process." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-122", "text": "First, with more predecessor modules present, the model would more likely to correctly predict thus have a relatively small cross-entropy loss, which is helpful for smoothing the learning process." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-123", "text": "Then, at a later time of compression, more modules can be present together, encouraging the model to gradually learn to predict with less guidance from the predecessor and steadily transit to the successor fine-tuning stage." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-124", "text": "Second, at the beginning of the compression, when \u03b8(t) < 1, considering the average learning rate for all n successor modules, the expected number of replaced modules is n \u00b7 p d and the expected average learning rate is:" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-125", "text": "where lr is the constant learning rate set for the compression and lr is the equivalent learning rate considering all successor modules." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-196", "text": "Liu et al. [21] found that a model fine-tuned on MNLI can successfully transfer to other sentence classification tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-126", "text": "Thus, when applying a replacement scheduler, a warm-up mechanism [25] is essentially adopted at the same time, which helps the training of a Transformer." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-127", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-128", "text": "**EXPERIMENTS**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-129", "text": "In this section, we introduce the experiments of Theseus Compression for BERT [4] compression." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-130", "text": "We compare BERT-of-Theseus with other compression methods and further conduct experiments to analyze the results." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-131", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-132", "text": "**DATASETS**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-133", "text": "We evaluate our proposed approach on GLUE benchmark [39] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-134", "text": "Specifically, we test on Microsoft Research Paraphrase Matching (MRPC) [5] , Quora Question Pairs (QQP) 4 and STS-B [2] for Paraphrase Similarity Matching; Stanford Sentiment Treebank (SST-2) [30] for Sentiment Classification; Multi-Genre Natural Language Inference Matched (MNLI-m), Multi-Genre Natural Language Inference Mismatched (MNLI-mm) [41] , Question Natural Language Inference (QNLI) [26] and Recognizing Textual Entailment (RTE) [39] for the Natural Language Inference (NLI) task;" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-135", "text": "The Corpus of Linguistic Acceptability (CoLA) [40] for Linguistic Acceptability. Note that we exclude WNLI [19] from GLUE since the original BERT [4] paper excludes this task as well for convergence problems." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-136", "text": "The accuracy is used as the metric for SST-2, MNLI-m, MNLI-mm, QNLI and RTE." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-137", "text": "The" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-138", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-139", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-140", "text": "We test our approach under a task-specific compression setting [33, 37] instead of a pretraining compression setting [28, 34] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-141", "text": "That is to say, we use no external unlabeled corpus but only the training set of each task in GLEU to compress the model." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-142", "text": "The reason why we test our model under such a setting is that we intend to straightforwardly verify the effectiveness of our generic compression approach." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-143", "text": "The fast training process of task-specific compression (e.g., no longer than 20 GPU hours for any task of GLUE) computationally enables us to conduct more analytical experiments." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-144", "text": "For comparision, DistillBERT [28] takes 720 GPU hours to train." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-145", "text": "Plus, in real-world applications, this setting provides with more flexibility when selecting from different pretrained LMs (e.g., BERT, RoBERTa [21] ) for various downstream tasks and it is easy to adopt a newly released model, without a time-consuming pretraining compression." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-146", "text": "On the other hand, we acknowledge that a general-purpose compressed BERT can better facilitate the downstream applications in the community since it requires less computational resource to simply fine-tune a small model than compressing a large one." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-147", "text": "Thus, we release a general-purpose compressed BERT as well." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-148", "text": "Formally, we define the task of compression as trying to retain as much performance as possible when compressing the officially released BERT-base (uncased) 5 to a 6-layer compact model with the same hidden size, following the settings in [28, 33, 37] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-149", "text": "Under this setting, the compressed model has 24M parameters for the token embedding (identical to the original model) and 42M parameters for the Transformer layers and obtains a 1.94\u00d7 speed-up for inference." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-150", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-151", "text": "**TRAINING DETAILS**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-152", "text": "We fine-tune BERT-base as the predecessor model for each task with the batch size of 32, the learning rate of 2 \u00d7 10 \u22125 , and the number of epochs as 4." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-153", "text": "As a result, we are able to obtain a predecessor model with comparable performance with that reported in previous studies [28, 33, 15] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-154", "text": "Afterward, for training successor models, following [28, 33] , we use the first 6 layers of BERT-base to initialize the successor model since the over-parameterized nature of Transformer [38] could cause the model unable to converge while training on small datasets." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-155", "text": "During module replacing, We fix the batch size as 32 for all evaluated tasks to reduce the search space." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-156", "text": "All r variables only sample once for a training batch." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-157", "text": "The maximum sequence length is set to 256 on QNLI and 128 for the other tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-158", "text": "We perform grid search over the sets of learning rate lr as {1e-5, 2e-5}, the basic replacing rate b as {0.1, 0.3}, the scheduler coefficient k making the dynamic replacing rate increase to 1 within the first {1000, 5000, 10000, 30000} training steps." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-159", "text": "We apply an early stopping mechanism and select the model with the best performance on the dev set." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-160", "text": "We conduct our experiments on a single Nvidia V100 16GB GPU." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-161", "text": "The peak memory usage is identical to fine-tuning a BERT-base, since there would be at most 12 layers training at the same time." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-162", "text": "The training time for each task varies depending on the different sizes of training sets." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-163", "text": "For example, it takes 20 hours to train on MNLI but less than 30 minutes on MRPC." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-164", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-165", "text": "**BASELINES**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-166", "text": "As shown in Table 1 , we compare the layer numbers, parameter numbers, loss function, external data usage and model agnosticism of our proposed approach to existing methods." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-167", "text": "We set up a baseline of vanilla Knowledge Distillation [14] as in [33] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-168", "text": "Additionally, we directly fine-tune an initialized 6-layer BERT model on GLUE tasks to obtain a natural fine-tuning baseline." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-169", "text": "Under the setting of compressing 12-layer BERT-base to a 6-layer compact model, we choose BERT-PKD [33] , PD-BERT [37] , and DistillBERT [28] as strong baselines." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-170", "text": "Note that DistillBERT [28] is not directly comparable here since it uses a pretraining compression setting." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-171", "text": "Both PD-BERT and DistillBERT use external unlabeled corpus." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-221", "text": "A replacing rate in the range between 0.5 and 0.7 can always lead to a satisfying performance on all GLUE tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-172", "text": "We do not include TinyBERT [15] since it has a different size setting, conducts distillation twice, and leverages extra augmented data for GLUE tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-173", "text": "We also exclude MobileBERT [34] , due to its redesigned Transformer block and different model size." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-174", "text": "Besides, in these two studies, the loss functions are not architecture-agnostic thus limit their applications on other models." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-175", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-176", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-177", "text": "We report the experimental results on the dev set of GLUE in Table 2 and submit our predictions to the GLUE test server and obtain the results from the official leaderboard as shown in Table 3 ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-178", "text": "Note that DistillBERT does not report on test set." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-179", "text": "The BERT-base performance reported on GLUE dev set is the predecessor fine-tuned by us." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-180", "text": "The results of BERT-PKD on dev set are reproduced by us using the official implementation." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-181", "text": "In the original paper of BERT-PKD, the results of CoLA and STS-B on test set are not reported, thus we reproduce these two results." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-182", "text": "Fine-tuning and Vanilla KD baselines are both implemented by us." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-183", "text": "All other results are from the original papers." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-184", "text": "The macro scores here are calculated in the same way as the official leaderboard but are not directly comparable with GLUE leaderboard since we exclude WNLI from the calculation." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-185", "text": "Overall, our BERT-of-Theseus retains 98.4% and 98.3% of the BERT-base performance on GLUE dev set and test set, respectively." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-186", "text": "On every task of GLUE, our model dramatically outperforms the fine-tuning baseline, indicating that with the same loss function, our proposed approach can effectively transfer knowledge from the predecessor to the successor." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-187", "text": "Also, our model obviously outperforms the vanilla KD [14] and Patient Knowledge Distillation (PKD) [33] , showing its supremacy over the KD-based compression approaches." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-188", "text": "On MNLI, our model performs better than BERT-PKD but slightly lower than PD-BERT [37] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-189", "text": "However, PD-BERT exploits an additional corpus which provides much more samples for knowledge transferring." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-190", "text": "Also, we would like to highlight that on RTE, our model achieves nearly identical performance to BERT-base and on QQP our model even outperforms BERT-base." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-191", "text": "To analyze, a moderate model size may help generalize and prevent overfitting on downstream tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-192", "text": "Notably, on both large datasets with more than 350K samples (e.g., MNLI and QQP) and small datasets with fewer than 4K samples (e.g., MRPC and RTE), our model can consistently achieve good performance, verifying the robustness of our approach." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-193", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-194", "text": "**GENERAL-PURPOSE MODEL**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-195", "text": "Although our approach achieves good performance under a task-specific setting, it requires more memory to fine-tune a full-size predecessor than a compact BERT (e.g., DistillBERT [28] )." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-197", "text": "Thus, we release our compressed model by conducting compression on MNLI as a general-purpose compact BERT to facilitate downstream applications." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-198", "text": "After compression, we fine-tune the successor model on other sentence classification tasks and compare the results with DistillBERT [28] in Table 4 ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-199", "text": "Our general-purpose model achieves an identical performance on MRPC and remarkably outperforms DistillBERT on the other sentence-level tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-200", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-201", "text": "**ANALYSIS**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-202", "text": "In this section, we conduct extensive experiments to analyze our BERT-of-Theseus." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-203", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-204", "text": "**IMPACT OF MODULE REPLACEMENT**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-205", "text": "As pointed out in previous work [7] , different layers of a Transformer play imbalanced roles for inference." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-206", "text": "To explore the effect of different module replacements, we iteratively use one compressed successor module (constant replacing rate, without successor fine-tuning) to replace its corresponding predecessor module on QNLI, MNLI and QQP, as shown in Table 5 ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-207", "text": "We illustrate the average performance drop on three tasks in Figure 3 ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-208", "text": "Surprisingly, different from a similar study for the importance of different Transformer layers in [7] , which is basically a U-curve, our results show that the replacement of the last two modules have only a trivial influence on the overall performance while the replacement of the first module significantly harms the performance." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-209", "text": "To analyze, the linguistic features are mainly extracted by the first few layers." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-210", "text": "Therefore, the reduced representation capability becomes the bottleneck for the following layers." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-211", "text": "On the contrary, high-quality low-level features can help the following layers, thus the reduced module size has only a limited influence on the final results." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-212", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-213", "text": "**IMPACT OF REPLACING RATE**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-214", "text": "We attempt to adopt different replacing rates on GLUE tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-215", "text": "First, we fix the batch size to be 32 and learning rate lr to be 2 \u00d7 10 \u22125 and conduct compression on each task." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-216", "text": "On the other hand, as we analyzed in Section 3.3, the equivalent learning rate lr is affected by the replacing rate." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-217", "text": "To further eliminate the influence of learning rate, we fix the equivalent learning rate lr to be 2 \u00d7 10 \u22125 and adjust learning rate lr for different replacing rates by lr = lr /p." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-218", "text": "We illustrate the results with different replacing rates on two representative tasks (MRPC and RTE) in Figure 4 ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-219", "text": "The trivial gap between two curves in both figures indicate that the effect of different replacing rates on equivalent learning rate is not the main factor for the performance differences." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-220", "text": "Generally speaking, BERT-of-Theseus is not very sensitive to different replacing rates." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-222", "text": "However, a significant performance drop can be observed on all tasks if the replacing rate is too small (e.g., p = 0.1)." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-223", "text": "On the other hand, the best replacing rate differs across tasks." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-224", "text": "Figure 4 : Performance of different replacing rate on MRPC and RTE. \"LR\" and \"ELR\" denote that the learning rate and equivalent learning rate are fixed, respectively." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-225", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-226", "text": "**IMPACT OF REPLACEMENT SCHEDULER**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-227", "text": "To study the impact of our curriculum replacement strategy, we compare the results of BERT-of-Theseus compressed with a constant replacing rate and with a replacement scheduler." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-228", "text": "The constant replacing rate for the baseline is searched over {0.5, 0.7, 0.9}. Additionally, we implement an \"anti-curriculum\" baseline, similar to the one in [23] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-229", "text": "For each task, we adopt the same coefficient k and basic replacing rate b to calculate the p d as Equation 6 for both curriculum replacement and anti-curriculum." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-230", "text": "However, we use 1 \u2212 p d as the dynamic replacing rate for anti-curriculum baseline." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-231", "text": "Thus, we can determine whether the improvement of curriculum replacement is simply due to an inconstant replacing rate or an easy-to-hard curriculum design." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-232", "text": "As shown in Table 6 , our model compressed with curriculum scheduler consistently outperforms a model compressed with a constant replacing rate." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-233", "text": "On the contrary, a substantial performance drop is observed on the model compressed with an anti-curriculum scheduler, which further verifies the effectiveness of the curriculum replacement strategy." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-234", "text": "----------------------------------" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-235", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-236", "text": "In this paper, we propose Theseus Compression, a novel model compression approach." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-237", "text": "We use this approach to compress BERT to a compact model, which outperforms other models compressed by Knowledge Distillation, with only one hyper-parameter, one loss function and no external data." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-238", "text": "Our work highlights a new genre of model compression and reveals a new path towards model compression." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-239", "text": "A known limitation of our approach is that to allow a successor module to replace a predecessor module, they must have the same input and output sizes." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-240", "text": "First, given this restriction, we can still perform depth reduction (i.e., reducing the number of layers)." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-241", "text": "As analyzed in [28] , the hidden size dimension has a smaller impact on computational efficiency than the depth, for a fixed parameter budget." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-242", "text": "Second, there have been many developed in-place substitutes (e.g., ShuffleNet unit [44] for ResBlock [12] , Reformer Layer [16] for Transformer Layer [38] ), which can be directly adopted as the successor." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-243", "text": "Third, it is possible to use a feed-forward neural network to map features between the hidden spaces of different sizes [15] ." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-244", "text": "Although our model has achieved good performance compressing BERT, it would be interesting to explore its possible applications in other neural models." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-245", "text": "As summarized in Table 1 , our model does not rely on any model-specific features to compress BERT." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-246", "text": "Therefore, it is potential to apply Theseus Compression to other large models (e.g., ResNet [12] in Computer Vision)." }, { "sent_id": "fa7475b6025d010dd6814dfb3905ef-C001-247", "text": "For the future work, we would like to conduct Theseus Compression on Convolutional Neural Network and Graph Neural Network." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fa7475b6025d010dd6814dfb3905ef-C001-25" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-39" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-72" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-86" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-87" ] ], "cite_sentences": [ "fa7475b6025d010dd6814dfb3905ef-C001-25", "fa7475b6025d010dd6814dfb3905ef-C001-39", "fa7475b6025d010dd6814dfb3905ef-C001-72", "fa7475b6025d010dd6814dfb3905ef-C001-86", "fa7475b6025d010dd6814dfb3905ef-C001-87" ] }, "@MOT@": { "gold_contexts": [ [ "fa7475b6025d010dd6814dfb3905ef-C001-25", "fa7475b6025d010dd6814dfb3905ef-C001-26" ] ], "cite_sentences": [ "fa7475b6025d010dd6814dfb3905ef-C001-25" ] }, "@USE@": { "gold_contexts": [ [ "fa7475b6025d010dd6814dfb3905ef-C001-140" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-148" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-154" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-167" ], [ "fa7475b6025d010dd6814dfb3905ef-C001-169" ] ], "cite_sentences": [ "fa7475b6025d010dd6814dfb3905ef-C001-140", "fa7475b6025d010dd6814dfb3905ef-C001-148", "fa7475b6025d010dd6814dfb3905ef-C001-154", "fa7475b6025d010dd6814dfb3905ef-C001-167", "fa7475b6025d010dd6814dfb3905ef-C001-169" ] }, "@SIM@": { "gold_contexts": [ [ "fa7475b6025d010dd6814dfb3905ef-C001-153" ] ], "cite_sentences": [ "fa7475b6025d010dd6814dfb3905ef-C001-153" ] }, "@DIF@": { "gold_contexts": [ [ "fa7475b6025d010dd6814dfb3905ef-C001-187" ] ], "cite_sentences": [ "fa7475b6025d010dd6814dfb3905ef-C001-187" ] } } }, "ABC_d51bf6d22d21dcd91e080f6f0b5dcb_11": { "x": [ { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-2", "text": "The multilingual Paraphrase Database (PPDB) is a freely available automatically created resource of paraphrases in multiple languages." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-3", "text": "In statistical machine translation, paraphrases can be used to provide translation for out-of-vocabulary (OOV) phrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-4", "text": "In this paper, we show that a graph propagation approach that uses PPDB paraphrases can be used to improve overall translation quality." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-5", "text": "We provide an extensive comparison with previous work and show that our PPDB-based method improves the BLEU score by up to 1.79 percent points." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-6", "text": "We show that our approach improves on the state of the art in three different settings: when faced with limited amount of parallel training data; a domain shift between training and test data; and handling a morphologically complex source language." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-7", "text": "Our PPDB-based method outperforms the use of distributional profiles from monolingual source data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-8", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-10", "text": "Translation coverage is a major concern in statistical machine translation (SMT) which relies on large amounts of parallel, sentence-aligned text." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-11", "text": "In (Callison-Burch et al., 2006) , even with a training data size of 10 million word tokens, source vocabulary coverage in unseen data does not go above 90%." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-12", "text": "The problem is worse with multi-word OOV phrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-13", "text": "Copying OOVs to the output is the most common solution." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-14", "text": "However, even noisy translations of OOVs can improve reordering and language model scores (Zhang et al., 2012) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-15", "text": "Transliteration is useful but not a panacea for the OOV problem (Irvine and Callison-Burch, 2014b) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-16", "text": "We find and remove the named entities, dates, etc." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-17", "text": "in the source and focus on the use of paraphrases to help translate the remaining OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-18", "text": "In Sec. 5.2 we show that handling such OOVs correctly does improve translation scores." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-19", "text": "In this paper, we build on the following research: Bilingual lexicon induction is the task of learning translations of words from monolingual data in source and target languages (Schafer and Yarowsky, 2002; Koehn and Knight, 2002; Haghighi et al., 2008) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-20", "text": "The distributional profile (DP) approach uses context vectors to link words as potential paraphrases to translation candidates (Rapp, 1995; Koehn and Knight, 2002; Haghighi et al., 2008; Garera et al., 2009) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-21", "text": "DPs have been used in SMT to assign translation candidates to OOVs (Marton et al., 2009; Daum\u00e9 and Jagarlamudi, 2011; Irvine and Callison-Burch, 2014a) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-22", "text": "Graph-based semisupervised methods extend this approach and propagate translation candidates across a graph with phrasal nodes connected via weighted paraphrase relationships (Razmara et al., 2013; Saluja et al., 2014; Zhao et al., 2015) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-23", "text": "Saluja et al. (2014) extend paraphrases for SMT from the words to phrases, which we also do in this work." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-24", "text": "Bilingual pivoting uses parallel data instead of context vectors for paraphrase extraction (Mann and Yarowsky, 2001; Schafer and Yarowsky, 2002; Bannard and Callison-Burch, 2005; CallisonBurch et al., 2006; Zhao et al., 2008; CallisonBurch, 2008 )." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-25", "text": "Ganitkevitch and Callison-Burch (2014) published a large-scale multilingual Paraphrase Database (PPDB) http://paraphrase. org which includes lexical, phrasal, and syntactic paraphrases (available for 22 languages with up to 170 million paraphrases each)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-26", "text": "To our knowledge, this paper is the first comprehensive study of the use of PPDB for statistical machine translation model training." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-27", "text": "Our framework has three stages: 1) a novel graph construction approach for PPDB paraphrases linked with phrases from parallel training data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-28", "text": "2) Graph propagation that uses PPDB paraphrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-29", "text": "3) An SMT model that incorporates new translation candidates." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-30", "text": "Sec. 3 explains these three stages in detail." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-31", "text": "Using PPDB has several advantages: 1) Resources such as PPDB can be built and used for many different tasks including but not limited to SMT." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-32", "text": "2) PPDB contains many features that are useful to rank the strength of a paraphrase connection and with more information than distributional profiles." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-33", "text": "3) Paraphrases in PPDB are often better than paraphrases extracted from monolingual or comparable corpora because a large-scale multilingual paraphrase database such as PPDB can pivot through a large amount of data in many different languages." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-34", "text": "It is not limited to using the source language data for finding paraphrases which distinguishes it from previous uses of paraphrases for SMT." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-35", "text": "PPDB is a natural resource for paraphrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-36", "text": "However, PPDB was not built with the specific application to SMT in mind." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-37", "text": "Other applications such as text-to-text generation have used PPDB (Ganitkevitch et al., 2011) but SMT brings along a specific set of concerns when using paraphrases: translation candidates should be transferred suitably across paraphrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-38", "text": "There are many cases, e.g. when faced with different word senses where transfer of a translation is not appropriate." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-39", "text": "Our proposed methods of using PPDB use graph propagation to transfer translation candidates in a way that is sensitive to SMT concerns." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-40", "text": "In our experiments (Sec. 5) we compare our approach with the state-of-the-art in three different settings in SMT: 1) when faced with limited amount of parallel training data; 2) a domain shift between training and test data; and 3) handling a morphologically complex source language." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-41", "text": "In each case, we show that our PPDB-based approach outperforms the distributional profile approach." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-42", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-43", "text": "**PARAPHRASE EXTRACTION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-44", "text": "Our goal is to produce translations for OOV phrases by exploiting paraphrases from the multilingual PPDB (Ganitkevitch and Callison-Burch, 2014 ) by using graph propagation." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-45", "text": "Since our approach relies on phrase-level paraphrases we compare with the current state of the art approaches that use monolingual data and distributional profiles to construct paraphrases and use graph propagation (Razmara et al., 2013; Saluja et al., 2014) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-46", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-47", "text": "**PARAPHRASES FROM DISTRIBUTIONAL PROFILES**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-48", "text": "A distributional profile (DP) of a word or phrase was first proposed in (Rapp, 1995) for SMT." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-49", "text": "Given a word f , its distributional profile is:" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-50", "text": "V is the vocabulary and the surrounding words w i are taken from a monolingual corpus using a fixed window size." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-51", "text": "We use a window size of 4 words based on the experiments in (Razmara et al., 2013) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-52", "text": "DPs need an association measure A(\u00b7, \u00b7) to compute distances between potential paraphrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-53", "text": "A comparison of different association measures appears in (Marton et al., 2009; Razmara et al., 2013; Saluja et al., 2014) and our preliminary experiments validated the choice of the same association measure as in these papers, namely Pointwise Mutual Information (Lin, 1998) (PMI) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-54", "text": "For each potential context word w i :" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-55", "text": "To evaluate the similarity between two phrases we use cosine similarity." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-56", "text": "The cosine coefficient of two phrases f 1 and f 2 is:" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-57", "text": "where V is the vocabulary." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-58", "text": "Note that in Eqn." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-59", "text": "(2) w i 's are the words that appear in the context of f 1 or f 2 , otherwise the PMI values would be zero." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-60", "text": "Considering all possible candidate paraphrases is very expensive." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-61", "text": "Thus, we use the heuristic applied in previous works (Marton et al., 2009; Razmara et al., 2013; Saluja et al., 2014) to reduce the search space." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-62", "text": "For each phrase we keep candidate paraphrases which appear in one of the surrounding context (e.g. Left Right) among all occurrences of the phrase." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-63", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-64", "text": "**PARAPHRASES FROM BILINGUAL PIVOTING**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-65", "text": "Bilingual pivoting uses parallel corpora between the source language, F , and a pivot language T ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-66", "text": "If two phrases, f 1 and f 2 , in a same language are paraphrases, then they share a translation in other languages with p(f 1 |f 2 ) as a paraphrase score: (1 and 6) are phrases from the SMT phrase table (unfilled nodes are not)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-67", "text": "Edge weights are set using a log-linear combination of scores from PPDB." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-135", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-68", "text": "Phrase #6 has different senses ('gold' or 'left'); and it has a paraphrase in phrase #7 for the 'gold' sense and a paraphrase in phrase #2 for the 'left' sense." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-69", "text": "After propagation, phrase #2 receives translation candidates from phrase #6 and phrase #1 reducing the probability of translation from unrelated senses (like the 'gold' sense)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-70", "text": "Phrase #8 is a misspelling of phrase #7 and is also captured as a paraphrase." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-71", "text": "Phrase #6 propagates translation candidates to phrase #8 through phrase #7." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-72", "text": "Morphological variants of phrase #6 (shown in bold) also receive translation candidates through graph propagation giving translation candidates for morphologically rich OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-73", "text": "Figure 1: English paraphrases extracted by pivoting over German shared translation (Bannard and Callison-Burch, 2005 )." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-74", "text": "where t is a phrase in language T ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-75", "text": "p(f 1 |t) and p(t|f 2 ) are taken from the phrase table extracted from parallel data for languages F and T ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-76", "text": "In Fig. 1 from (Bannard and Callison-Burch, 2005) we see that paraphrase pairs like (in check, under control) can be extracted by pivoting over the German phrase unter kontrolle." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-77", "text": "The multilingual Paraphrase Database (PPDB) (Ganitkevitch and Callison-Burch, 2014 ) is a published resource for paraphrases extracted using bilingual pivoting." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-78", "text": "It leverages syntactic information and other resources to filters and scores each paraphrase pair using a large set of features." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-79", "text": "These features can be used by a log linear model to score paraphrases (Zhao et al., 2008) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-80", "text": "We used a linear combination of these features using the equation in Sec. 3 of (Ganitkevitch and Callison-Burch, 2014) to score paraphrase pairs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-81", "text": "PPDB version 1 is broken into different levels of coverage." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-82", "text": "The smaller sizes contain only better-scoring, high-precision paraphrases, while larger sizes aim for high coverage." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-83", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-84", "text": "**METHODOLOGY**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-85", "text": "After paraphrase extraction we have paraphrase pairs, (f 1 , f 2 ) and a score S(f 1 , f 2 ) we can induce new translation rules for OOV phrases using the steps in Algo." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-86", "text": "(1): 1) A graph of source phrases is constructed as in (Razmara et al., 2013) ; 2) translations are propagated as labels through the graph as explained in Fig. 2 ; and 3) new translation rules obtained from graph-propagation are integrated with the original phrase table." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-87", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-88", "text": "**GRAPH CONSTRUCTION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-89", "text": "We construct a graph G(V, E, W ) over all source phrases in the paraphrase database and the source language phrases from the SMT phrase table extracted from the available parallel data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-136", "text": "**GRAPH PRUNING AND PPDB SIZES**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-90", "text": "V corresponds to the set of vertices (source phrases), E is the set of edges between phrases and W is weight of each using the score function S defined in Sec. 2." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-91", "text": "V has two types of nodes: seed (labeled) nodes, V s , from the SMT phrase table, and regular nodes, V r ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-92", "text": "Note that in this step OOVs are part of these regular nodes, and we try to find translation in the propagation step for all of these regular nodes." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-93", "text": "In graph construction and propagation, we do not know which phrasal nodes correspond to OOVs in the dev and test set." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-94", "text": "Fig. 2 shows a small slice of the actual graph used in one of our experiments; This graph is constructed using the paraphrase database on the right side of the figure." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-95", "text": "Filled nodes have a distribution over translations (the possible \"labels\" for that node)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-96", "text": "In our setting, we consider the translation e to be the \"label\" and so we propagate the labeling distribution p(e|f ) which is taken from the feature function for the SMT log-linear model that is taken from the SMT phrase table and we propagate this distribution to unlabeled nodes in the graph." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-97", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-98", "text": "**GRAPH PROPAGATION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-99", "text": "Considering the translation candidates of known phrases in the SMT phrase table as the \"labels\" we apply a soft label propagation algorithm in order to assign translation candidates to \"unlabeled\" nodes in the graph, which include our OOV phrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-100", "text": "As described by the example in Fig. 2 we wish two outcomes: 1) transfer of translations (or \"labels\") to unlabeled nodes (OOV phrases) from labeled nodes, and 2) smoothing the label distribution at each node." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-101", "text": "We use the Modified Adsorption (MAD) algorithm (Talukdar and Crammer, 2009) for graph propagation." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-102", "text": "Suppose we have m different possible labels plus one dummy label, a soft label\u0176 \u2208 \u2206 m+1 is a m + 1 dimension probability vector." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-103", "text": "The dummy label is used when there is low confidence on correct labels." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-104", "text": "Based on MAD, we want to find soft label vectors for each node by optimizing the objective function below:" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-105", "text": "In this objective function, \u00b5 i and P i,v are hyperparameters (\u2200v : \u03a3 i P i,v = 1)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-106", "text": "R v \u2208 \u2206 m+1 is our prior belief about labeling." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-107", "text": "First component of the function tries to minimize the difference of new distribution to the original distribution for the seed nodes." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-108", "text": "The second component insures that nearby neighbours have similar distributions, and the final component is to make sure that the distribution does not stray from a prior distribution." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-109", "text": "At the end of propagation, we wish to find a label distribution for our OOV phrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-110", "text": "We describe in Sec. 4.2.2 the reasons for choosing MAD over other graph propagation algorithms." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-111", "text": "The MAD graph propagation generalizes the approach used in (Razmara et al., 2013) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-112", "text": "The Structured Label Propagation algorithm (SLP) was used in (Saluja et al., 2014; Zhao et al., 2015) which uses a graph structure on the target side phrases as well." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-113", "text": "However, we have found that in our diverse experimental settings (see Sec. 5) MAD had two properties we needed compared to SLP: one was the use of graph random walks which allowed us to control translation candidates and MAD also has the ability to penalize nodes with a large number of edges (also see Sec. 4.2.2)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-114", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-115", "text": "**PHRASE TABLE INTEGRATION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-116", "text": "After propagation, for each potential OOV phrase we have a list of possible translations with corresponding probabilities." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-117", "text": "A potential OOV is any phrase which does not appear in training, but could appear in unseen data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-118", "text": "We do not look at the dev or test data to produce the augmented phrase table." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-119", "text": "The original phrase table is now augmented with new entries providing translation candidates for potential OOVs; Last column in Table 2 shows how many entries have been added to the phrase table for each experimental settings." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-120", "text": "A new feature is added to the standard SMT log-linear discriminative model and introduced into the phrase table." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-121", "text": "This new feature is set to either 1.0 for the phrase table entries that already existed; or i which is the log probability (from graph propagation) for the translation candidate i for potential OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-122", "text": "In case the dummy label exists with high probability or the label distribution is uniform, an identity rule is added to the phrase table (copy over source to target)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-123", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-124", "text": "**ANALYSIS OF THE FRAMEWORK**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-125", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-126", "text": "**PROPAGATION OF POOR TRANSLATIONS**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-127", "text": "Automatic paraphrase extraction generates many possible paraphrase candidates and many of them are likely to be false positives for finding translation candidates for OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-128", "text": "Distributional profiles rely on context information which is not sufficient to derive accurate paraphrases for many phrases and this results in many low quality paraphrase candidates." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-129", "text": "Bilingual pivoting uses word alignments which can also introduce errors depending on the size and quality of the bilingual data used." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-130", "text": "Alignment errors also introduce poor translations." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-131", "text": "In graph propagation, these errors may be propagated and result in poor translations for OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-132", "text": "We could address this issue by aggressively pruning the potential paraphrase candidates to improve the precision." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-133", "text": "However, this results in a dramatic drop in coverage and many OOV phrases do not obtain any translation candidates." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-134", "text": "We use a combination of the following three steps to augment our graph propagation framework." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-137", "text": "Pruning the graph avoids error propagation by removing unreliable edges." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-138", "text": "Pruning removes edges with an edge weight lower than a minimum threshold or by limiting the number of neighbours to the top-K edges ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-139", "text": "PPDB has different sizes with different levels of accuracy and coverage." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-140", "text": "We can do graph pruning simply by choosing to use different sizes of PPDB." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-141", "text": "As we can see in Fig. 3 results vary from language to language depending on the pruning used." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-142", "text": "For instance, the L size results in the best score for French-English." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-143", "text": "We choose the best size of PPDB for each language based on a separate held-out set and independently from each of the SMT-based tasks in our experimental results." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-144", "text": "Our conclusion from our experiments with the different sizes of PPDB is that removing phrases (or nodes in our graph) is not desirable." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-145", "text": "However, removing unreliable edges is useful." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-146", "text": "As seen in Table 1 , increasing the size of PPDB leads to a rapid increase in nodes followed by a larger number of edges in the very large PPDB sizes." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-147", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-148", "text": "**PRUNING THE TRANSLATION CANDIDATES**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-149", "text": "Another solution to the error propagation issue is to propagate all translation candidates but when providing translations to OOVs in the final phrase (Koehn et al., 2003) )." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-150", "text": "Based on a development set, separate from the test sets we used, we found that the best value of L was 10." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-151", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-152", "text": "**EXTERNAL RESOURCES FOR FILTERING**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-153", "text": "Applying more informative filters can be also used to improve paraphrase quality." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-154", "text": "This can be done through additional features for paraphrase pairs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-155", "text": "For example, edit distance can be used to capture misspelled paraphrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-156", "text": "We use a Named Entity Recognizer to exclude names, numbers and dates from the paraphrase candidates." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-157", "text": "Even after removing these tokens, 3.32% of tokens of test set are still OOVs ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-158", "text": "In addition, we use a list of stop words to remove nodes which have too many connections." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-159", "text": "These two filters improve our results (more in Sec. 5)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-160", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-161", "text": "**PATH SENSITIVITY**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-184", "text": "By reducing the values of P cont and increasing P abnd we can control the label propagation process to optimize the quality of translations for OOV phrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-185", "text": "Again, this is done on a held-out development set and not on the test data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-186", "text": "The optimal values in our experiments for these probabilities are P inj = 0.9, P cont = 0.001, P abnd = 0.01." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-187", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-188", "text": "**EARLY STOPPING OF PROPAGATION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-189", "text": "In Modified Adsorption (MAD) (see Sec. 3.2) nodes in the graph that are closely linked will tend to similar label distributions as the number of iterations increase (even when the path lengths increase)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-190", "text": "In our setting, smoothing the label distribution helps in the first few iterations, but is harmful as the number of iterations increase due to the factors shown in Fig. 4 ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-191", "text": "We use early stopping which limits the number of iterations." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-192", "text": "We varied the number of iterations from 1 to 10 on a held-out dev set and found that 5 iterations was optimal." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-193", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-194", "text": "**EVALUATION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-195", "text": "We first show the effect of OOVs on translation quality, then evaluate our approach in three different SMT settings: low resource SMT, domain shift, and morphologically complex languages." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-196", "text": "In each case, we compare results of using paraphrases extracted by Distributional Profile (DP) and PPDB in an end-to-end SMT system." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-197", "text": "Important: no subset of the test data sentences are used in the bilingual corpora for paraphrase extraction process." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-198", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-199", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-200", "text": "We use CDEC 1 (Dyer et al., 2010) as an endto-end SMT pipeline with its standard features 2 . fast align (Dyer et al., 2013 ) is used for word alignment, and weights are tuned by minimizing BLEU loss on the dev set using MIRA (Crammer and Singer, 2003) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-201", "text": "This setup is used for most of our experiments: oracle (Sec. 5.2), domain adaptation (Sec. 5.4) and morphologically complex languages (Sec. 5.5)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-202", "text": "But as we wish to fairly compare our approach with Razmara et al. (2013) on low resource setting, we follow their setup in Sec. 5.3: Moses (Koehn et al., 2007) as SMT pipeline, GIZA++ (Och and Ney, 2003) for word alignment and MERT (Och, 2003) for tuning." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-203", "text": "We add our own feature to the SMT log-linear model as described in Sec. KenLM (Heafield, 2011 ) is used to train a 5-gram language model on English Gigaword (V5: LDC2011T07)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-204", "text": "For scalable graph propagation we use the Junto framework 3 ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-205", "text": "We use maximum phrase length 10." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-206", "text": "For our experiments we use the Hadoop distributed computing framework executed on a cluster with 12 nodes (each node has 8 cores and 16GB of RAM)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-207", "text": "Each graph propagation iteration takes about 3 minutes." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-208", "text": "For French, we apply a simple heuristic to detect named entities: words that are capitalized in the original dev/test set that do not appear at the beginning of a sentence are named entities." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-209", "text": "Based on eyeballing the results, this works very well in our data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-210", "text": "For Arabic, AQMAR is used to exclude named-entities (Mohit et al., 2012) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-211", "text": "For each of the experimental settings below we show the OOV statistics in Table 2 ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-212", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-213", "text": "**IMPACT OF OOVS: ORACLE EXPERIMENT**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-214", "text": "This oracle experiment shows that translation of OOVs beyond named entities, dates, etc. is potentially very useful in improving output translation." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-215", "text": "We trained a SMT system on 10K French-English sentences from the Europarl corpus(v7) (Koehn, 2005) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-216", "text": "WMT 2011 and WMT 2012 are used as dev and test data respectively." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-217", "text": "Table 4 shows the results in terms of BLEU on dev and test." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-218", "text": "The first row is baseline which simply copies OOVs to output." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-219", "text": "The second and third rows show the result of augmenting phrase-table by adding translations for single-word OOVs and phrases containing OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-220", "text": "The last row shows the oracle result where dev and test sentences exist inside the training data and all the OOVs are known (Fully observers cannot avoid model and search errors)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-221", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-222", "text": "**CASE 1: LIMITED PARALLEL DATA**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-223", "text": "In this experiment we use a setup similar to (Razmara et al., 2013 we use 10K French-English parallel sentences, randomly chosen from Europarl to train translation system, as reported in (Razmara et al., 2013) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-224", "text": "ACL/WMT 2005 4 is used for dev and test data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-225", "text": "We re-implement their paraphrase extraction method (DP) to extract paraphrases from French side of Europarl (2M sentences)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-226", "text": "We use unigram nodes to construct graphs for both DP and PPDB." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-227", "text": "In bipartite graphs, each node is connected to at most 20 nodes." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-228", "text": "For tripartite graphs, each node is connected to 15 labeled and 5 unlabeled nodes." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-229", "text": "For intrinsic evaluation, we use MeanReciprocal-Rank (MRR) and Recall." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-230", "text": "MRR is the mean of reciprocal rank of the candidate list compared to the gold list (Eqn. 5)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-231", "text": "Recall shows percentage of gold list covered by the candidate list (Eqn. 6)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-232", "text": "Gold translations for OOVs are given by concatenating the test data to training and running a word aligner." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-233", "text": "to show how well our PPDB approach does compared to the DP approach in terms of MRR and recall; and 3) to show applicability of our approach for a low-resource language." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-234", "text": "However we used French instead of a language which is truly resource-poor due to the lack of available paraphrases for a true resource poor language, e.g. Malagasy." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-235", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-236", "text": "**CASE 2: DOMAIN ADAPTATION**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-237", "text": "Domain adaptation is another case that suffers from massive number of OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-238", "text": "We compare our approach with Marginal Matching , a state of the art approach in SMT domain adaptation." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-239", "text": "We use their setup and data and compare our results to their reported results (Tiedemann, 2009) and for the science domain a corpus of scientific articles (Carpuat et al., 2012) has been used." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-240", "text": "Unigram paraphrases using DP are extracted from French side of Europarl." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-241", "text": "Table 6 compares the results in terms of BLEU score." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-242", "text": "In both medical and science domains, graph-propagation approach using PPDB (large) performs significantly better than DP (p < 0.02), and has comparable results to Marginal Matching." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-243", "text": "Marginal Matching performs better in science domain but graph-propagation approach with PPDB outperforms it in medical domain getting a +1.79 BLEU score improvement over the baseline." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-244", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-245", "text": "**CASE 3: MORPHOLOGICALLY RICH LANGUAGES**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-246", "text": "Both Distribution Profiling and Bilingual Pivoting propose morphological variants of a word as paraphrase pairs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-247", "text": "Even more so in PPDB due to pivoting over English." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-248", "text": "We choose Arabic-English task for this experiment." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-249", "text": "We train the SMT system on 685K sentence pairs (randomly selected from LDC2007T08 and LDC2008T09) and use NIST OpenMT 2012 for dev and test data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-250", "text": "Arabic side of 1M sentences of LDC2007T08 and LDC2008T09 is used to extract unigram paraphrases for DP." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-251", "text": "Table 7 shows that PPDB (large; with phrases) resulted in +1.53 BLEU score improvement over DP which only slightly improved over baseline." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-252", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-253", "text": "**RELATED WORK**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-254", "text": "Sentence level paraphrasing has been used for generating alternative reference translations (Madnani et al., 2007; Kauchak and Barzilay, 2006) , or augmenting the training data with sentential para-phrases (Bond et al., 2008; Nakov, 2008; Mirkin et al., 2009 )." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-255", "text": "Phrase level paraphrasing was done using crowdsourcing or by using paraphrases in lattice decoding (Onishi et al., 2010; Du et al., 2010) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-256", "text": "Daum\u00e9 and Jagarlamudi (2011) apply a generative model to domain adaptation based on canonical correlation analysis Haghighi et al. (2008) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-257", "text": "However, they use artificially created monolingual corpora very related to the same domain as test data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-258", "text": "Irvine and Callison-Burch (2014a) generate a large, noisy phrase table by composing unigram translations which are obtained by a supervised method (Irvine and Callison-Burch, 2013) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-259", "text": "Comparable monolingual data is used to re-score and filter the phrase table." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-260", "text": "Zhang and Zong (2013) use a large manually generated lexicon for domain adaptation." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-261", "text": "In contrast to these methods, our method is unsupervised." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-262", "text": "Alexandrescu and Kirchhoff (2009) use a graph-based semi-supervised model determine similarities between sentences, then use it to rerank the n-best translation hypothesis." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-263", "text": "Liu et al. (2012) extend this model to derive some features to be used during decoding." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-264", "text": "These approaches are orthogonal to our approach." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-265", "text": "Saluja et al. (2014) use Structured Label Propagation (Liu et al., 2012) in two parallel graphs constructed on source and target paraphrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-266", "text": "In their case the graph construction is extremely expensive." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-267", "text": "Leveraging a morphological analyzer, they reach significant improvement on Arabic." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-268", "text": "We can not directly compare our results to (Saluja et al., 2014) because they exploit several external resources such as a morphological analyzer and also had different sizes of training and test." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-269", "text": "In experiments (Sec. 5) we obtained comparable BLEU score improvement on Arabic-English by using bilingual pivoting only on source phrases." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-270", "text": "(Saluja et al., 2014) also use methods similar to (Habash, 2008) that expand the phrase table with spelling and morphological variants of OOVs in test data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-271", "text": "We do not use the dev/test data to augment the phrase table." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-272", "text": "Using comparable corpora to extract parallel sentences and phrases (Munteanu and Marcu, 2006; Smith et al., 2010; Tamura et al., 2012) are orthogonal to the approach we discuss here." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-273", "text": "Bilingual and multilingual word and phrase representation using neural networks have been applied to machine translation (Zou et al., 2013; Mikolov et al., 2013a; Zhang et al., 2014) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-274", "text": "However, most of these methods focus on frequent words or an available bilingual phrase table (Zou et al., 2013; Zhang et al., 2014; Gao et al., 2014) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-275", "text": "Mikolov et al. (2013a) learn a global linear projection from source to target using representation of frequent words on both sides." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-276", "text": "This model can be used to generate translations for new words, but a large amounts of bilingual data is required to create such a model." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-277", "text": "(Mikolov et al., 2013b ) also uses bilingual data to project new translation rules." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-278", "text": "Zhao et al. (2015) extend Mikolov's model to learn one local linear projection for each phrase." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-279", "text": "Their model reaches comparable results to Saluja et al. (2014) while works faster." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-280", "text": "Alkhouli et al. (2014) use neural network phrase representation for paraphrasing OOVs and find translation for them using a phrase-table created from limited parallel data." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-281", "text": "Our experimental settings is different from the approaches in (Alkhouli et al., 2014; Mikolov et al., 2013a; Mikolov et al., 2013b) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-282", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-283", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-284", "text": "In future work, we would like to include translations for infrequent phrases which are not OOVs." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-285", "text": "We would like to explore new propagation methods that can directly use confidence estimates and control propagation based on label sparsity." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-286", "text": "We also would like to expand this work for morphologically rich languages by exploiting other resources like morphological analyzer and campare our approach to the current state of art approaches which are using these types of resources." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-287", "text": "In conclusion, we have shown significant improvements to the quality of statistical machine translation in three different cases: low resource SMT, domain shift, and morphologically complex languages." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-288", "text": "Through the use of semi-supervised graph propagation, a large scale multilingual paraphrase database can be used to improve the quality of statistical machine translation." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-162", "text": "Graph propagation has been used in many NLP tasks like POS tagging, parsing, etc. but propagating translations in a graph as labels is much more challenging." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-163", "text": "Due to huge number of possible labels (translations) and many low quality edges, it is very likely that many wrong translations are rapidly propagated in few steps." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-164", "text": "Razmara et al. (2013) show that unlabeled nodes inside the graph, called bridge nodes, are useful for the transfer of translations when there is no other connection between an OOV phrase and a node with known translation candidates." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-165", "text": "However, they show that using the full graph with long paths of bridge nodes hurts performance." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-166", "text": "Thus the propagation has to be constrained using path sensitivity." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-167", "text": "Figure 4: Sensitivity issue in graph propagation for translations. \"Lager\" is a translation candidate for \"stock\", which is transferred to \"majority\" after 3 iterations." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-168", "text": "phrase graph." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-169", "text": "After three iterations, German translation \"Lager\" reaches \"majority\" which is totally irrelevant as a translation candidate." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-170", "text": "Transfer of translation candidates should prefer close neighbours and only with a very low probability to other nodes in the graph." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-171", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-172", "text": "**PRE-STRUCTURING THE GRAPH**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-173", "text": "Razmara et al. (2013) avoid a fully connected graph structure." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-174", "text": "They pre-structure the graph into bipartite graphs (only connections between phrases with known translation and OOV phrases) and tripartite graphs (connections can also go from a known phrasal node to an OOV phrasal node through one node that is a paraphrase of both but does not have translations, i.e. it is an unlabeled node)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-175", "text": "In these pre-structured graphs there are no connections between nodes of the same type (known, OOV or unlabeled)." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-176", "text": "We apply this method in our low resource setting experiments (Sec. 5.3) to compare our bipartite and tripartite results to Razmara et al. (2013) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-177", "text": "In the rest of the experiments we use the tripartite approach since it outperforms the bipartite approach." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-178", "text": "----------------------------------" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-179", "text": "**GRAPH RANDOM WALKS**" }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-180", "text": "Our goal is to limit the number of hops in the propagation of translation candidates preferring closely connected and highly probable edge weights." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-181", "text": "Optimization for the Modified Adsorption (MAD) objective function in Sec. 3.2 can be viewed as a controlled random walk (Talukdar et al., 2008; Talukdar and Crammer, 2009 )." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-182", "text": "This is formalized as three actions: inject, continue and abandon with corresponding pre-defined probabilities P inj , P cont and P abnd respectively as in (Talukdar and Crammer, 2009) ." }, { "sent_id": "d51bf6d22d21dcd91e080f6f0b5dcb-C001-183", "text": "A random walk through the graph will transfer labels from one node to another node, and probabilities P cont and P abnd control exploration of the graph." } ], "y": { "@USE@": { "gold_contexts": [ [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-51" ], [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-61" ], [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-86" ], [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-111" ], [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-176" ], [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-202" ], [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-223" ] ], "cite_sentences": [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-51", "d51bf6d22d21dcd91e080f6f0b5dcb-C001-61", "d51bf6d22d21dcd91e080f6f0b5dcb-C001-86", "d51bf6d22d21dcd91e080f6f0b5dcb-C001-111", "d51bf6d22d21dcd91e080f6f0b5dcb-C001-176", "d51bf6d22d21dcd91e080f6f0b5dcb-C001-202", "d51bf6d22d21dcd91e080f6f0b5dcb-C001-223" ] }, "@SIM@": { "gold_contexts": [ [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-53" ] ], "cite_sentences": [ "d51bf6d22d21dcd91e080f6f0b5dcb-C001-53" ] } } }, "ABC_b124e65938672691a5589fb5cdb21e_11": { "x": [ { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-2", "text": "Named entity recognition (NER) is a well-established task of information extraction which has been studied for decades." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-3", "text": "More recently, studies reporting NER experiments on social media texts have emerged." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-4", "text": "On the other hand, stance detection is a considerably new research topic usually considered within the scope of sentiment analysis." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-5", "text": "Stance detection studies are mostly applied to texts of online debates where the stance of the text owner for a particular target, either explicitly or implicitly mentioned in text, is explored." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-6", "text": "In this study, we investigate the possible contribution of named entities to the stance detection task in tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-7", "text": "We report the evaluation results of NER experiments as well as that of the subsequent stance detection experiments using named entities, on a publicly-available stance-annotated data set of tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-8", "text": "Our results indicate that named entities obtained with a high-performance NER system can contribute to stance detection performance on tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-9", "text": "----------------------------------" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-11", "text": "With the emergence of social media applications like Twitter and Facebook, a considerable body of social media texts has accumulated." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-12", "text": "The need for the utilization of this user-generated content, for several purposes like trend analysis, has accelerated research on social media analysis." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-13", "text": "As acknowledged in the related literature, the language used in social media texts is usually different from well-formed texts such as news articles, where the latter has been one of the most common text genre targeted by natural language processing (NLP) research so far." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-14", "text": "As a result, approaches proposed for well-formed texts have suffered from the porting problem, revealed with poor performance on social media texts." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-15", "text": "One of these NLP problems is named entity recognition (NER) which targets at the extraction and classification of named entities like person, location, and organization names in texts (Nadeau and Sekine, 2007) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-16", "text": "Several recent studies on NER report performance results on social media texts and propose customized systems for this text genre (Ritter et al., 2011) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-17", "text": "Another important research topic regarding social media analysis is sentiment analysis (Pang and Lee, 2008) where the opinion or sentiment of the text owner is explored." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-18", "text": "Stance detection is a considerably recent research field usually considered as a subproblem of sentiment analysis (Mohammad et al., 2016b) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-19", "text": "In stance detection, the aim is to determine the stance of the text owner (as Favor, Against, or Neither) for a particular target either explicitly or implicitly mentioned in the text (Mohammad et al., 2016b) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-20", "text": "The most common text genres used by stance detection studies are on-line debates and social media texts like tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-41", "text": "The evaluation results of the NER tool on the tweet data set are presented in Table 2 , in terms of the corresponding metrics of precision (P), recall (R), and F-Measure (F), without giving credit to partial extractions." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-21", "text": "Some of the recent studies on this topic report performance evaluation results of different classifiers using different feature sets (Mohammad et al., 2016b ) while others present publicly-available stance-annotated data sets (Mohammad et al., 2016a; Sobhani et al., 2017; K\u00fc\u00e7\u00fck, 2017) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-22", "text": "In this study, we present our experiments of using named entities for the purposes of improved stance detection in tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-23", "text": "We have used the publicly-available tweet data set in Turkish annotated with stance information, together with the results of the corresponding SVM classifiers using unigrams as features in (K\u00fc\u00e7\u00fck, 2017) as the baselines." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-24", "text": "We first perform NER on this data set and next use the named entities as additional features during our SVM-based stance detection experiments." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-25", "text": "Our findings are particularly encouraging as they provide evidence for the contribution of a high-performance NER system to the subsequent stance detection procedure using the extracted named entities." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-26", "text": "The rest of the paper is organized as follows: In Section 2, the evaluation results of an existing NER system on the data set are presented, the subsequent stance detection experiments using the named entities from the data set are described in Section 3, and finally Section 4 summarizes the paper together with future research directions." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-27", "text": "----------------------------------" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-28", "text": "**NAMED ENTITY RECOGNITION IN TWEETS**" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-29", "text": "NER is an information extraction task that has been studied for decades, especially on well-formed texts like news articles." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-30", "text": "More recently, considerable number of studies on NER target at social media texts, like tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-31", "text": "Yet, as emphasized in the related literature (Ritter et al., 2011) , porting existing NER systems to social media texts results in poor performance." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-32", "text": "Therefore, related studies usually propose customized systems and/or annotated data sets (to be used during the training of supervised systems) for this new text genre." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-33", "text": "In this study, we have used the stance-annotated tweet data set in Turkish (K\u00fc\u00e7\u00fck, 2017) which includes 700 random tweets related to two sports clubs and these clubs constitute the stance targets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-34", "text": "The data set is a balanced one in the sense that 175 tweets are in favor of Target-1 and 175 tweets are against Target-1, while 175 tweets are in favor of and 175 are against Target-2." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-35", "text": "There are no tweet instances annotated with the class Neither in the data set (K\u00fc\u00e7\u00fck, 2017) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-36", "text": "The names of the two target clubs are explicitly mentioned in all of the tweets although some of these mentions are neologisms or contracted forms while some others have writing errors, as expected due to the peculiarities of language use in Twitter (K\u00fc\u00e7\u00fck and Steinberger, 2014) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-37", "text": "In order to create the NER answer key for this data set, we have annotated it with person, location, and organization names." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-38", "text": "The resulting named entity statistics are provided in Table 1 ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-39", "text": "As the NER tool, we have used the extended version of the rule-based tool (K\u00fc\u00e7\u00fck and Yaz\u0131c\u0131, 2009) proposed for news articles in order to perform better on tweets, as presented in (K\u00fc\u00e7\u00fck and Steinberger, 2014) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-40", "text": "These extensions include relaxing the capitalization constraint to extract entities all in lowercase as well and diacriticsbased extension of the lexical resources of the tool since characters with diacritics such as \u00e7, \u0131,\u00f6, \u015f are commonly replaced with the corresponding characters such as c, i, o, s in tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-42", "text": "That is, a named entity extraction is considered as correct if both its type and all of its tokens are correctly identified by the system." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-43", "text": "The NER results obtained are not favorable compared to the NER results on news articles in Turkish (usually over 85% in F-Measure) but as reported in the literature, NER performance on tweets is usually considerably lower due to the particular language use in tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-44", "text": "For instance, using the same version of the NER tool, the best F-Measure rate obtained on another tweet data set in Turkish is 48.13% (K\u00fc\u00e7\u00fck and Steinberger, 2014) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-45", "text": "The F-Measure rates given in Table 2 are all higher than 48.13%, reaching to 68.78% for the tweets marked as in favor of Target-1, and therefore these rates can be considered encouraging." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-46", "text": "Yet, we should also note a difference between the features of the tweet data sets in (K\u00fc\u00e7\u00fck and Steinberger, 2014 ) and in the current study." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-47", "text": "As pointed out at the beginning of this section, in each of the tweets of the stance data set used in the current study, the sports clubs (a named entity of organization type) which are the stance targets are explicitly mentioned while there is no such restriction for the data sets used in (K\u00fc\u00e7\u00fck and Steinberger, 2014) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-48", "text": "Hence, the data set used in the current study can better be considered as a \"targeted tweet data set\" and this may be one of the reasons for the high rates compared to those reported in (K\u00fc\u00e7\u00fck and Steinberger, 2014) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-49", "text": "The results in Table 2 also indicate that NER performance is higher for the tweets marked as in favor of their targets when compared with the tweets marked as against." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-50", "text": "The recall rates are particularly low for this latter set of tweets and one of the reasons for this observation may be the users' common employment of neologisms (in the negative sense) especially for the tweet targets, which are missed by the NER tool." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-51", "text": "To summarize, we have annotated the stance data set with named entities in order to create the NER answer key and performed NER evaluations on this data set." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-52", "text": "The results obtained are promising and even higher than those results reported for similar settings in the literature." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-53", "text": "The evaluation results obtained in the current study are significant as a new set of NER evaluations on a proprietary \"targeted\" tweet data set in Turkish, in addition to the data sets used in studies such as (K\u00fc\u00e7\u00fck and Steinberger, 2014) , since NER on tweets is still an important research field of NLP." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-54", "text": "----------------------------------" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-55", "text": "**USING NAMED ENTITIES FOR STANCE DETECTION IN TWEETS**" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-56", "text": "In the current study, we have used the stance-annotated tweet data set described in (K\u00fc\u00e7\u00fck, 2017) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-57", "text": "Also presented in (K\u00fc\u00e7\u00fck, 2017) are the results of the following experiments on this data set:" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-58", "text": "\u2022 SVM classifiers using unigrams as features," }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-59", "text": "\u2022 SVM classifiers using bigrams as features," }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-60", "text": "\u2022 SVM classifiers using unigrams and the existence of hashtags in tweets as features." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-61", "text": "The corresponding results have indicated that using unigrams as features leads to favorable performance rates and using unigrams together with hashtag features improves these results further, while using bigrams as features of the SVM classifiers results in poor performance (K\u00fc\u00e7\u00fck, 2017) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-62", "text": "The favorable results corresponding to the former two settings are provided in Table 3 as excerpted from (K\u00fc\u00e7\u00fck, 2017) , in order to be used as reference results for comparison purposes." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-63", "text": "These are 10-fold cross validation results on the data set." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-64", "text": "As can be observed in Table 3 , using the existence of hashtags as an additional feature improves stance detection performance in terms of average F-Measure for Target-2 although it leads to a slight decrease in F-Measure for Target-1 (K\u00fc\u00e7\u00fck, 2017) ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-65", "text": "In this study, we investigate the possible contribution of named entities to stance detection in tweets." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-66", "text": "We do not consider named entity types as features, but instead we use named entities as additional features for SVM classifiers which have used unigrams as features." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-67", "text": "Named entities which are not inflected and which comprise only single tokens are no different than existing unigrams." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-68", "text": "However, there are named entities in the form of ngrams." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-69", "text": "Additionally, Turkish is an agglutinative language and the particular NER tool that we have employed extracts named entities in their bare forms by excluding the sequence of suffixes attached to named entities, making the corresponding single-token named entities different from unigrams." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-70", "text": "We should note that during the named entity annotation procedure to create the answer key (as explained in Section 2), bare forms of the entities are annotated, making the system results and the annotations consistent." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-71", "text": "Similar to the settings in (K\u00fc\u00e7\u00fck, 2017) , we have used the SVM classifier based on the SMO algorithm (Platt, 1999) , available in the Weka tool (Hall et al., 2009 ), during our stance detection experiments." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-72", "text": "10-fold crossvalidation results of the classifiers using the named entities extracted by the employed NER tool from the data set are provided in Table 4 while the corresponding results of the classifiers using the named entities in the manually-annotated version of the data set (i.e., the answer key for the NER procedure) are provided in Table 5 ." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-73", "text": "Based on the results presented in Tables 3, 4 , and 5, the following conclusions can be drawn:" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-74", "text": "\u2022 Using named entities as additional features improves the stance detection performance considerably for Target-2 although a slight performance decrease is observed for Target-1, when compared with the case in which only the unigrams are used as features." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-75", "text": "These results indicate that stance detection task can benefit from the outputs of NER tools." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-76", "text": "\u2022 Using manually annotated named entities (in the answer key) improves stance detection performance for both targets and for both stance classes when compared with using named entities extracted with a NER tool." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-77", "text": "This is an expected result, since there are errors in the output of the NER tool (as quantified in Table 2) where some entities are missed and some other token sequences are incorrectly extracted as named entities." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-78", "text": "This finding provides evidence for the contribution of a high-performance NER tool to the stance detection task." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-79", "text": "The less errors the employed NER tool makes, the more successful the stance detection system, utilizing the output of the NER tool, will be." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-80", "text": "\u2022 Joint use of named entities and existence of hashtags as additional features to unigrams improves stance detection performance slightly for Target-1 while slight decreases are observed for Target-2 in this settings." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-81", "text": "Hence, further experiments are necessary to make sound conclusions regarding joint utilization of named entities and hastags as features of the SVM classifiers for the stance detection task." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-82", "text": "\u2022 As has been reported in (K\u00fc\u00e7\u00fck, 2017) , the overall evaluation results of the stance detection task are considerably higher for the Favor class when compared with the results for the Against class, in all settings given in Table 4 and 5." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-83", "text": "----------------------------------" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-84", "text": "**CONCLUSION**" }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-85", "text": "Stance detection is a relatively recent research topic similar in nature to sentiment analysis." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-86", "text": "The aim of stance detection is to determine the stance of a text owner for a particular target and stance detection modules can be used within several different social media analysis applications." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-87", "text": "In this study, we have first performed NER on a stance-annotated tweet data set in Turkish and then investigated the possible contribution of named entities as SVM features for the stance detection task." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-88", "text": "During this procedure, we have first reported the evaluation results of a NER tool that has been extended to be perform better on tweets and then have utilized the extracted named entities as additional features to SVM classifiers which already employ unigrams as features." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-89", "text": "Our results indicate that named entities can considerably improve the stance detection performance when used together with unigrams." }, { "sent_id": "b124e65938672691a5589fb5cdb21e-C001-90", "text": "Future work includes testing the same classifier settings on larger data sets, and performing similar tests by using stance-annotated data sets and NER tools for other languages such as English and compare the corresponding test results with the ones in the current study." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b124e65938672691a5589fb5cdb21e-C001-21" ], [ "b124e65938672691a5589fb5cdb21e-C001-35" ], [ "b124e65938672691a5589fb5cdb21e-C001-57" ], [ "b124e65938672691a5589fb5cdb21e-C001-61" ] ], "cite_sentences": [ "b124e65938672691a5589fb5cdb21e-C001-21", "b124e65938672691a5589fb5cdb21e-C001-35", "b124e65938672691a5589fb5cdb21e-C001-57", "b124e65938672691a5589fb5cdb21e-C001-61" ] }, "@USE@": { "gold_contexts": [ [ "b124e65938672691a5589fb5cdb21e-C001-23" ], [ "b124e65938672691a5589fb5cdb21e-C001-33" ], [ "b124e65938672691a5589fb5cdb21e-C001-56" ], [ "b124e65938672691a5589fb5cdb21e-C001-62" ], [ "b124e65938672691a5589fb5cdb21e-C001-64" ] ], "cite_sentences": [ "b124e65938672691a5589fb5cdb21e-C001-23", "b124e65938672691a5589fb5cdb21e-C001-33", "b124e65938672691a5589fb5cdb21e-C001-56", "b124e65938672691a5589fb5cdb21e-C001-62", "b124e65938672691a5589fb5cdb21e-C001-64" ] }, "@SIM@": { "gold_contexts": [ [ "b124e65938672691a5589fb5cdb21e-C001-71" ], [ "b124e65938672691a5589fb5cdb21e-C001-82" ] ], "cite_sentences": [ "b124e65938672691a5589fb5cdb21e-C001-71", "b124e65938672691a5589fb5cdb21e-C001-82" ] } } }, "ABC_76476d80e1d3f65818592ec4caab0e_11": { "x": [ { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-2", "text": "In models to generate program source code from natural language, representing this code in a tree structure has been a common approach." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-3", "text": "However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-28", "text": "Given an NL description q, our purpose is to generate code (e.g. Python) represented as an AST a. In this work, we start with the syntactic code gen-eration model by Yin and Neubig (2017) , which uses sequences of actions to generate the AST before converting it to surface code." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-29", "text": "Formally, we want to find the best generated AST\u00e2 given by:" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-30", "text": "where y t is the action taken at time step t and y Yin and Neubig (2017) for more detail of the neural model, which consists of a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) encoder-decoder with action embeddings, context vectors, parent feeding, and a copy mechanism using pointer networks." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-37", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-38", "text": "**RECODE: RETRIEVAL-BASED NEURAL CODE GENERATION**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-39", "text": "We propose RECODE, a method for retrievalbased neural syntactic code generation, using retrieved action subtrees." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-40", "text": "Following 's method for neural machine translation, these retrieved subtrees act as templates that bias the generation of output code." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-41", "text": "Our pipeline at test time is as follows:" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-42", "text": "\u2022 retrieve from the training set NL descriptions that are most similar with our input sentence ( \u00a73.1), \u2022 extract n-gram action subtrees from these retrieved sentences' corresponding target ASTs ( \u00a73.2)," }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-43", "text": "\u2022 alter the copying actions in these subtrees, by substituting words of the retrieved sentence with corresponding words in the input sentence ( \u00a73.3), and \u2022 at every decoding step, increase the probability of actions that would lead to having these subtrees in the produced tree ( \u00a73.4)." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-44", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-45", "text": "**RETRIEVAL OF TRAINING INSTANCES**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-46", "text": "For every retrieved NL description q m from training set (or retrieved sentence for short), we compute its similarity with input q, using a sentence similarity formula (Gu et al., 2016; :" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-47", "text": "where d is the edit distance." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-48", "text": "We retrieve only the top M sentences according to this metric where M is a hyperparameter." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-49", "text": "These scores will later be used to increase action probabilities accordingly." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-50", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-51", "text": "**EXTRACTING N -GRAM ACTION SUBTREES**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-52", "text": "In , they collect n-grams from the output side of the retrieved sentences and encourage the model to generate these n-grams." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-53", "text": "Word n-grams are obvious candidates when generating a sequence of words as output, as in NMT." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-54", "text": "However, in syntax-based code generation, the generation target is ASTs with no obvious linear structure." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-55", "text": "To resolve this problem, we instead use retrieved pieces of n-gram subtrees from the target code corresponding to the retrieved NL descriptions." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-56", "text": "Though we could select successive nodes in the AST as retrieved pieces, such as [assign; expr * (targets); expr] from Figure 1 , we would miss important structural information from the rules that are used." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-57", "text": "Thus, we choose to exploit actions in the generation model rather than AST nodes themselves to be candidates for our retrieved pieces." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-58", "text": "In the action tree (Figure 1 ), we considered only successive actions, such as subtrees where each node has one or no children, to avoid overly rigid structures or combinatorial explosion of the number of retrieved pieces the model has to consider." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-59", "text": "For example, such an action subtree would be given by [assign \u2192 expr * (targets), expr (value) ; expr(value) \u2192 List; List \u2192 epsilon]." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-60", "text": "As the node in the action tree holds structural information about its children, we set the subtrees to have a fixed depth, linear in the size of the tree." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-61", "text": "These can be considered \"n-grams of actions\", emphasizing the comparison with machine translation which uses n-grams of words." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-62", "text": "n is a hyperparameter to be tuned." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-63", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-64", "text": "**WORD SUBSTITUTION IN COPY ACTIONS**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-65", "text": "Using the retrieved subtree without modification is problematic if it contains at least one node corresponding to a COPY action because copied tokens from the retrieved sentence may be different from those in the input." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-66", "text": "Figure 1 shows an example when the input and retrieved sentence have four common words, but the object names are different." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-67", "text": "The extracted action n-gram would contain the rule that copies the second word (\"lst\") of the retrieved sentence while we want to copy the first word (\"params\") from the input." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-68", "text": "By computing word-based edit distance between the input description and the retrieved sentence, we implement a one-to-one sentence alignment method that infers correspondences between uncommon words." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-69", "text": "For unaligned words, we alter all COPY rules in the extracted n-grams to copy tokens by their aligned counterpart, such as replace \"params\" with \"lst\", and delete the n-gram subtree, as it is not likely to be relevant in the predicted tree." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-70", "text": "Thus, in the example in Figure 1 , the GENTOKEN(LST) action in t 5 will not be executed." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-71", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-72", "text": "**RETRIEVAL-GUIDED CODE GENERATION**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-73", "text": "N -gram subtrees from all retrieved sentences are assigned a score, based on the best similarity score Yin and Neubig (2017) of all instances where they appeared." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-74", "text": "We normalize the scores for each input sentence by subtracting the average over the training dataset." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-75", "text": "At decoding time, incorporate these retrievalderived scores into beam search: for a given time step, all actions that would result in one of the retrieved n-grams u to be in the prediction tree has its log probability log(p(y t | y t\u22121 1 )) increased by \u03bb * score(u) where \u03bb is a hyperparameter, and score(u) is the maximal sim(q, q m ) from which u is extracted." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-76", "text": "The probability distribution is then renormalized." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-77", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-78", "text": "**DATASETS AND EVALUATION METRICS**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-79", "text": "We evaluate RECODE with the Hearthstone (HS) (Ling et al., 2016) and Django (Oda et al., 2015) datasets, as preprocessed by Yin and Neubig (2017) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-80", "text": "HS consists of Python classes that implement Hearthstone card descriptions while Django contains pairs of Python source code and English pseudo-code from Django web framework." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-81", "text": "Table 1 summarizes dataset statistics." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-82", "text": "For evaluation metrics, we use accuracy of exact match and the BLEU score following Yin and Neubig (2017) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-83", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-84", "text": "**EXPERIMENTS**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-85", "text": "For the neural code generation model, we use the settings explained in Yin and Neubig (2017) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-86", "text": "For the retrieval method, we tuned hyperparameters and achieved best result when we set n max = 4 and \u03bb = 3 for both datasets 3 ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-87", "text": "For HS, we set M = 3 and M = 10 for Django." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-88", "text": "We compare our model with Yin and Neubig (2017)'s model that we call YN17 for brevity, and a sequence-to-sequence (SEQ2SEQ) model that we implemented." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-89", "text": "SEQ2SEQ is an attentionenabled encoder-decoder model (Bahdanau et al., 2015) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-90", "text": "The encoder is a bidirectional LSTM and the decoder is an LSTM." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-91", "text": "We ran statistical significance tests for RECODE and YN17, using bootstrap resampling with N = 10,000." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-92", "text": "For the BLEU scores of both datasets, p < 0.001." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-93", "text": "For the exact match accuracy, p < 0.001 for Django dataset, but for Hearthstone, p > 0.3, showing that the retrieval-based model is on par with YN17." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-94", "text": "It is worth noting, though, that HS consists of long and complex code, and that generating exact matches is very difficult, making exact match accuracy a less reliable metric." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-95", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-96", "text": "**RESULTS**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-97", "text": "We also compare RECODE with Rabinovich et al. (2017) 's Abstract Syntax Networks with supervision (ASN+SUPATT) which is the state-of-the-art system for HS." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-98", "text": "RECODE exceeds ASN without extra supervision though ASN+SUPATT has a slightly better result." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-99", "text": "However, ASN+SUPATT is trained with supervised attention extracted through heuristic exact word matches while our attention is unsupervised." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-100", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-101", "text": "**DISCUSSION AND ANALYSIS**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-102", "text": "From our observation and as mentioned in Rabinovich et al. (2017) , HS contains classes with similar structure, so the code generation task could be simply matching the tree structure and filling the terminal tokens with correct variables and values." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-103", "text": "However, when the code consists of complex logic, partial implementation errors occur, leading to low exact match accuracy (Yin and Neubig, 2017) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-104", "text": "Analyzing our result, we find this intuition to be true not only for HS but also for Django." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-105", "text": "Examining the generated output for the Django dataset in Table 3 , we can see that in the first example, our retrieval model can successfully generate the correct code when YN17 fails." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-106", "text": "This difference suggests that our retrieval model benefits from the action subtrees from the retrieved sentences." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-107", "text": "In the second example, although our generated code does not perfectly match the reference code, it has a higher BLEU score compared Example 1 \"if offset is lesser than integer 0, sign is set to '-', otherwise sign is '+' \" Input sign = offset < 0 or '-' YN17 sign = '-' if offset < 0 else '+' RECODE sign = '-' if offset < 0 else '+' to the output of YN17 because our model can predict part of the code (timesince(d, now, reversed)) correctly." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-108", "text": "The third example shows where our method fails to apply the correct action as it cannot cast s to str type while YN17 can at least cast s into a type (bool)." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-109", "text": "Another common type of error that we found RECODE's generated outputs is incorrect variable copying, similarly to what is discussed in Yin and Neubig (2017) and Rabinovich et al. (2017) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-110", "text": "Table 4 presents a result on the HS dataset 4 ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-111", "text": "We can see that our retrieval model can handle complex code more effectively." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-112", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-113", "text": "**RELATED WORK**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-114", "text": "Several works on code generation focus on domain specific languages (Raza et al., 2015; Kushman and Barzilay, 2013) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-115", "text": "For general purpose code generation, some data-driven work has been done for predicting input parsers (Lei et al., 2013) or a set of relevant methods (Raghothaman et al., 2016) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-116", "text": "Some attempts using neural networks have used sequence-to-sequence models (Ling et al., 2016) or tree-based architectures (Dong and Lapata, 2016; Alvarez-Melis and Jaakkola, 2017) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-117", "text": "Ling et al. (2016) ; Jia and Liang (2016) ; Locascio et al. (2016) treat semantic parsing as a sequence generation task by linearizing trees." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-118", "text": "The closest work to ours are Yin and Neubig (2017) and Rabinovich et al. (2017) which represent code as an AST." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-119", "text": "Another close work is Dong and Lapata (2018) , which uses a two-staged structure-aware neural architecture." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-120", "text": "They initially generate a lowlevel sketch and then fill in the missing information using the NL and the sketch." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-121", "text": "Recent works on retrieval-guided neural machine translation have been presented by Gu et al. (2018); Amin Farajian et al. (2017) ; Li et al. (2018) ; ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-122", "text": "Gu et al. (2018) use the retrieved sentence pairs as extra inputs to the NMT model." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-123", "text": "employ a simpler and faster retrieval method to guide neural MT where translation pieces are n-grams from retrieved target sentences." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-124", "text": "We modify 's method from textual n-grams to n-grams over subtrees to exploit the code structural similarity, and propose methods to deal with complex statements and rare words." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-125", "text": "In addition, some previous works have used subtrees in structured prediction tasks." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-126", "text": "For example, Galley et al. (2006) used them in syntaxbased translation models." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-127", "text": "In Galley et al. (2006) , subtrees of the input sentence's parse tree are associated with corresponding words in the output sentence." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-128", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-130", "text": "We proposed an action subtree retrieval method at test time on top of an AST-driven neural model for generating general-purpose code." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-131", "text": "The predicted surface code is syntactically correct, and the retrieval component improves the performance of a previously state-of-the-art model." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-132", "text": "Our successful result suggests that our idea of retrieval-based generation can be potentially applied to other treestructured prediction tasks." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-4", "text": "We introduce RECODE, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-5", "text": "First, we retrieve sentences that are similar to input sentences using a dynamicprogramming-based sentence similarity scoring method." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-6", "text": "Next, we extract n-grams of action sequences that build the associated abstract syntax tree." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-7", "text": "Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-8", "text": "We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-9", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-10", "text": "**1**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-11", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-13", "text": "Natural language to code generation, a subtask of semantic parsing, is the problem of converting natural language (NL) descriptions to code (Ling et al., 2016; Yin and Neubig, 2017; Rabinovich et al., 2017) ." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-14", "text": "This task is challenging because it has a well-defined structured output and the input structure and output structure are in different forms." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-15", "text": "A number of neural network approaches have been proposed to solve this task." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-16", "text": "Sequential approaches (Ling et al., 2016; Jia and Liang, 2016; Locascio et al., 2016) convert the target code into a sequence of symbols and apply a sequence-tosequence model, but this approach does not ensure that the output will be syntactically correct." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-17", "text": "1 Code available at https://github.com/ sweetpeach/ReCode Tree-based approaches (Yin and Neubig, 2017; Rabinovich et al., 2017) represent code as Abstract Syntax Trees (ASTs), which has proven effective in improving accuracy as it enforces the well-formedness of the output code." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-18", "text": "However, representing code as a tree is not a trivial task, as the number of nodes in the tree often greatly exceeds the length of the NL description." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-19", "text": "As a result, tree-based approaches are often incapable of generating correct code for phrases in the corresponding NL description that have low frequency in the training data." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-20", "text": "In machine translation (MT) problems Gu et al., 2018; Amin Farajian et al., 2017; Li et al., 2018) , hybrid methods combining retrieval of salient examples and neural models have proven successful in dealing with rare words." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-21", "text": "Following the intuition of these models, we hypothesize that our model can benefit from querying pairs of NL descriptions and AST structures from training data." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-22", "text": "In this paper, we propose RECODE, and adaptation of 's retrieval-based approach neural MT method to the code generation problem by expanding it to apply to generation of tree structures." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-23", "text": "Our main contribution is to introduce the use of retrieval methods in neural code generation models." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-24", "text": "We also propose a dynamic programming-based sentence-tosentence alignment method that can be applied to similar sentences to perform word substitution and enable retrieval of imperfect matches." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-25", "text": "These contributions allow us to improve on previous stateof-the-art results." }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-26", "text": "----------------------------------" }, { "sent_id": "76476d80e1d3f65818592ec4caab0e-C001-27", "text": "**SYNTACTIC CODE GENERATION**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "76476d80e1d3f65818592ec4caab0e-C001-13" ], [ "76476d80e1d3f65818592ec4caab0e-C001-17" ], [ "76476d80e1d3f65818592ec4caab0e-C001-36" ], [ "76476d80e1d3f65818592ec4caab0e-C001-103" ] ], "cite_sentences": [ "76476d80e1d3f65818592ec4caab0e-C001-13", "76476d80e1d3f65818592ec4caab0e-C001-17", "76476d80e1d3f65818592ec4caab0e-C001-36", "76476d80e1d3f65818592ec4caab0e-C001-103" ] }, "@MOT@": { "gold_contexts": [ [ "76476d80e1d3f65818592ec4caab0e-C001-17", "76476d80e1d3f65818592ec4caab0e-C001-18", "76476d80e1d3f65818592ec4caab0e-C001-19" ] ], "cite_sentences": [ "76476d80e1d3f65818592ec4caab0e-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "76476d80e1d3f65818592ec4caab0e-C001-28" ], [ "76476d80e1d3f65818592ec4caab0e-C001-73" ], [ "76476d80e1d3f65818592ec4caab0e-C001-79" ], [ "76476d80e1d3f65818592ec4caab0e-C001-82" ], [ "76476d80e1d3f65818592ec4caab0e-C001-85" ], [ "76476d80e1d3f65818592ec4caab0e-C001-88" ] ], "cite_sentences": [ "76476d80e1d3f65818592ec4caab0e-C001-28", "76476d80e1d3f65818592ec4caab0e-C001-73", "76476d80e1d3f65818592ec4caab0e-C001-79", "76476d80e1d3f65818592ec4caab0e-C001-82", "76476d80e1d3f65818592ec4caab0e-C001-85", "76476d80e1d3f65818592ec4caab0e-C001-88" ] }, "@EXT@": { "gold_contexts": [ [ "76476d80e1d3f65818592ec4caab0e-C001-103", "76476d80e1d3f65818592ec4caab0e-C001-104" ] ], "cite_sentences": [ "76476d80e1d3f65818592ec4caab0e-C001-103" ] }, "@SIM@": { "gold_contexts": [ [ "76476d80e1d3f65818592ec4caab0e-C001-109" ], [ "76476d80e1d3f65818592ec4caab0e-C001-118" ] ], "cite_sentences": [ "76476d80e1d3f65818592ec4caab0e-C001-109", "76476d80e1d3f65818592ec4caab0e-C001-118" ] } } }, "ABC_66392c3b6fa3744de79f056f615a75_11": { "x": [ { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-50", "text": "**MOTIVATION**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-2", "text": "This paper investigates the task of noun compound interpretation, building on the sense collocation approach proposed by Moldovan et al. (2004) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-3", "text": "Our primary task is to evaluate the impact of similar words on the sense collocation method, and decrease the sensitivity of the classifiers by expanding the range of sense collocations via different semantic relations." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-4", "text": "Our method combines hypernyms, hyponyms and sister words of the component nouns, based on WordNet." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-5", "text": "The data used in our experiments was taken from the nominal pair interpretation task of SEMEVAL-2007 (4th International Workshop on Semantic Evaluation 2007." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-6", "text": "In the evaluation, we test 7-way and 2-way class data, and show that the inclusion of hypernyms improves the performance of the sense collocation method, while the inclusion of hyponym and sister word information leads to a deterioration in performance." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-7", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-9", "text": "This paper investigates the automatic interpretation of noun compounds (i.e. NCs), such as paper submission and computer science department." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-10", "text": "NC interpretation is a well-known problem that aims to predict the semantic relation (i.e. SR) between a head noun and its modifying nominal(s) (i.e. modifiers)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-11", "text": "SRs, simply put, encapsulate how the head noun and the other nominals in a noun compound are related." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-12", "text": "As English noun phrases are rightheaded, the head noun occurs after all modifying nouns." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-13", "text": "For example, brick house is interpreted as a house that is modified by the word brick, which exhibits a PRODUCT-PRODUCER relationship between the two nouns in the compound." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-14", "text": "In contrast, the modifier and head in house brick exhibits a PART-WHOLE relationship, which is interpreted as a brick from a house, rather than the former interpretation of a house made of bricks." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-15", "text": "The set of SRs that we are concerned with in this paper is defined in Section 5.1." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-16", "text": "Research on NCs can be categorised into four main groups: defining SRs, disambiguating the syntax of NCs, disambiguating the semantics of NCs, and interpreting NCs via SRs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-17", "text": "Each task is detailed in Section 8." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-18", "text": "Interpreting NCs has received much attention of late, and the problem has been addressed in areas of machine translation (MT), information extraction (IE), and applications such as question-answering (QA)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-19", "text": "Processing NCs exhibits many challenges due to the following issues associated with the task (Lapata, 2002) : (1) the compounding process is extremely productive; (2) the semantic relationship between the head noun and its modifier(s) is implicit; and (3) the interpretation of an NC can vary due to contextual and pragmatic factors." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-20", "text": "Due to these challenges, current NC interpretation methods are too error-prone to employ directly in NLP applications without any human intervention or preprocessing." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-21", "text": "In this paper, we investigate the task of NC interpretation based on sense collocation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-22", "text": "It has been shown that NCs with semantically similar compo-nents share the same SR ; this is encapsulated by the phrase coined as sense collocation in Moldovan et al. (2004) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-23", "text": "For example, NCs such as apple pie have the same interpretation as banana cake, where the modifiers of both NCs are semantically similar (they are both classified as fruit), and the head nouns of both NCs are a type of baked edible concoction." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-24", "text": "Given that the modifier is a fruit and the head noun is a baked concoction, then we can interpret NCs with this sense collocation as having the PRODUCT-PRODUCER SR." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-25", "text": "However, unlike the method of , where new data was induced by substituting the components of the NCs with semantically similar terms, our approach adds related terms as features for the classifier." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-26", "text": "The related terms we add are the NC components' hypernyms, hyponyms, and sister words." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-27", "text": "We expect that these similar words can contribute to disambiguate the SR when classifying NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-28", "text": "The remainder of this paper is structured as follows." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-29", "text": "In Section 2 we present previous research on NC interpretation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-30", "text": "Section 3 and Section 4 describe our motivation and method, respectively." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-31", "text": "We describe the data used in our experiments in Section 5 and present the results of our experiments in Sections 6 and 7." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-32", "text": "In Section 8, we briefly outline some of the issues associated with NC research." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-33", "text": "Finally, we conclude our work in Section 9." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-34", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-36", "text": "A majority of research undertaken in interpreting NCs have been based on two statistical methods: SEMANTIC SIMILARITY (Barker and Szpakowicz, 1998; Rosario, 2001; Moldovan et al., 2004; Kim and Baldwin, 2005; Nastase, 2006; Girju, 2007; and SEMANTIC INTER-PRETABILITY (Vanderwende, 1994; Lapata, 2002; Kim and Baldwin, 2006; Nakov, 2006) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-37", "text": "Our work, based on an extension of the sense collocation approach, corresponds to the semantic similarity category." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-38", "text": "A significant contribution to this area is by Moldovan et al. (2004) , who used the sense collocation (i.e. pair-of-word-senses) as their primary feature in disambiguating NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-39", "text": "Many subsequent studies have been based on this sense collocation method, with the addition of other performance-improving features." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-40", "text": "For example, Girju (2007) added contextual information (e.g. the grammatical role and POS) and cross-lingual information from 5 European languages as features to her model." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-41", "text": "In contrast, utilise sense collocations in a different way: instead of adding additional features in their model, they increase the size of their training data by substituting components of existing training instances to generate additional training instances (which is assumed to have the same SR as the original)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-42", "text": "For an SR to be preserved the newly-generated NC must be semantically similar and hence maintain the same sense collocation as the original NC on which it was based." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-43", "text": "Rosario (2001; Kim and Baldwin (2005; Nastase (2006) attempted to interpret NCs by applying implicit sense collocations." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-44", "text": "In particular, they used various ways to retrieve sense collocations instead of manually assigning word senses to the NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-45", "text": "Hence, their methods do not require direct use of word senses." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-46", "text": "Rosario (2001) retrieved the sense pairs in the context of a hierarchical class set for the biomedical domain." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-47", "text": "Kim and Baldwin (2005) used the similarity measure to express the sense collocation of NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-48", "text": "Nastase (2006) listed the hypernyms of components as sense features." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-49", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-51", "text": "As mentioned above, Moldovan et al. (2004) showed that the sense collocation of NCs is a key feature when interpreting NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-52", "text": "Further research in this area has shown that not only synonymous NCs share the same SR, but NCs whose components are replaced with similar words also have the same SR as the original NCs ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-53", "text": "For example, car factory, automobile factory and truck factory substituted with a synonym, hypernym and sister word, respectively, share the same SR of PRODUCT-PRODUCER." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-54", "text": "Figure 3 shows an example of semantic neighbours for the two NCs car key and apple pie." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-55", "text": "Car key can be interpreted as PRODUCT-PRODUCER by referring to the training NC automobile key, since they have the same sense collocation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-56", "text": "With apple juice, the sense collocation method tries to locate matching sense collocations in the training data, and finds that fruit juice matches closely, with the mod- ifier being a hypernym of apple." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-57", "text": "From this, we can hope to correctly interpret apple juice as having the SR PRODUCT-PRODUCER." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-58", "text": "In order to achieve this, we require some means of comparing nouns taxonomically, both vertically to capute hypernyms and hyponyms, and horizontally to capture sister words." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-59", "text": "As intimated above, our motivation in conducting this research is to be able to include hypernym, hyponym and sister word information without using direct substitution over the training instances, but still preserving the essence of the sense collocation approach." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-60", "text": "The disadvantage of the method employed by is that noise will inevitably infect the training data, skewing the classifier performance." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-61", "text": "The original method described in Moldovan et al. (2004) only relies on observed sense collocations." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-62", "text": "The components of the NCs are represented as specific synsets in WordNet, and the model does not capture related words." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-63", "text": "Hence, in this paper, we aimed to develop a model that can take advantage of relatedness between WordNet synsets via hypernyms, hyponyms and sister words, without the risk of losing semantic granularity or introducing noisy training data." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-64", "text": "Note that in Kim and Baldwin (2007), we used synonyms, hypernyms and sister words." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-65", "text": "As synonyms have the identical sense collocation (i.e. pairing of synsets) within WordNet, we ignore it in this research." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-66", "text": "Instead, we add hyponyms as a means of broadening the range of sense collocation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-67", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-68", "text": "**METHOD**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-69", "text": "At first, we describe the principal idea of sense collocation method on NC interpretation and the probability model proposed in (Moldovan et al., 2004) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-70", "text": "Then we present our method using hypernyms, hyponyms and sister words in order to extend sense collocation method." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-71", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-72", "text": "**SENSE COLLOCATION**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-73", "text": "The basic idea behind sense collocation method in Moldovan et al. (2004) was based on the \"pair-ofword-senses\" from the component nouns in noun compounds as features of the classifier." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-74", "text": "They also introduced a probability model called semantic scattering, as detailed in Equations 1 and 2 below as a supervised classification technique." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-75", "text": "In essence, the probability P (r|f i f j ) (simplified to P (r|f ij )) of a modifier and head noun with word sense f i and f j , respectively, occurring with SR r is calculated based on simple maximum likelihood estimation:" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-76", "text": "The preferred SR r * for the given sense combination is that which maximizes the probability:" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-77", "text": "(2)" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-78", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-79", "text": "**ADDING SIMILAR WORDS**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-80", "text": "We extend the approach of Moldovan et al. (2004) by adding similar words as features focusing on hypernyms, hyponyms and sister words of the modifier and head noun." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-81", "text": "We accumulate the features for semantic relations based on different taxonomic relation types, from which we construct a feature vector to build a classifier over." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-82", "text": "The features of each taxonomic relation types are listed below." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-83", "text": "The first is features used in the original sense collocation method." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-84", "text": "The second, third and fourth are our experimental features added hypernyms, hyponyms and sister words respectively." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-85", "text": "N 1 is the cause of N 2 virus flu, hormone growth, inhalation death Instrument-Agency (IA) N 1 is the instrument of N 2 , N 2 uses N 1 laser printer, ax murderer, sump pump drainage Product-Producer (PP) N 1 is a product of N 2 , N 2 produces N 1 honey bee, music clock, supercomputer business Origin-Entity (OE) N 1 is the origin of N 2 bacon grease, desert storm, peanut butter Theme-Tool (TT) N 2 is intended for N 1 reorganization process, copyright law, work force Part-Whole (PW) N 1 is part of N 2 table leg, daisy flower, tree forest Content-Container (CC) N 1 is store or carried inside N 2 apple basket, wine bottle, plane carge where mod is the modifier, head is the head noun, W S mod is the WordNet synset of the modifier, W S head is the WordNet synset of the head, H i is an ith-degree ancestor (with direct hypernyms corresponding to H 1 ), O is a hyponym and S is a sister word." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-86", "text": "We include up to the 7th-degree ancestor (i.e. H 7 ) 1 in our model." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-87", "text": "Note that while a given synset has a unique hypernym in WordNet (assuming no cycles, or the ability to remove cycles by precompiling a tree structure), it can have arbitrarily many hyponyms and sister words." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-88", "text": "Here, we take the cross product of the different hyponym and sister word candidates for a given synset." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-89", "text": "We build our final classifier with TiMBL v5.16.0, a memory-based learner (Daelemans et al., 2004) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-90", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-91", "text": "**DATA**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-92", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-93", "text": "**SEMANTIC RELATION**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-94", "text": "The SR between a head and its modifier(s) in a NC tells us how to (default) interpret the NC." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-95", "text": "For example door knob corresponds to the PART-WHOLE relation, which means we can interpret knob as being part of a door." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-96", "text": "We sidestep the considerable challenge of developing an optimal set of semantic relation categories by using the set of SRs and data from the SEMEVAL-2007 nominal pair interpretation task ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-97", "text": "The SRs defined for the task are: CAUSE-EFFECT(CE), CONTENT-CONTAINER(CC), INSTRUMENT-AGENCY(IA), ORIGIN-ENTITY(OE), PART-WHOLE(PW), PRODUCT-PRODUCER(PP) and THEME-TOOL(TT)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-98", "text": "Table 1 provides a definition of each SR along with example NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-99", "text": "1 We adopted 7th-degree ancester from the previous study in (Nastase, 2006) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-100", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-101", "text": "**DATA COLLECTION**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-102", "text": "From the SEMEVAL-2007 annotated data (Girju et al., 2007) , we collect two sets of data: a 2-class dataset and a 7-class dataset." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-103", "text": "The 2-class dataset is taken from the original SEMEVAL-2007 task, and comprises a set of positive and negative instances for each of the 7 SRs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-104", "text": "The 7-class dataset is derived from this, by combining all positive NCs across the 7 SRs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-105", "text": "The taxonomic relations are derived from WordNet3.0." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-106", "text": "In each of the two sets, we use each of hypernyms, hyponyms and sister words." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-107", "text": "Table 2 shows the number of hyponyms and sister words in each dataset." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-108", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-109", "text": "**7-WAY CLASSIFICATION EXPERIMENT**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-110", "text": "We ran our first experiment over the 7-class dataset." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-111", "text": "The baseline was computed using a Zero-R classifier (i.e. majority vote)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-112", "text": "2 The performance of the original method proposed in Moldovan et al. (2004) is considered as a benchmark." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-113", "text": ".217 .496 .544 .552 .573 .562 .588 .568 .557 .197 .142 .547 .547 533 .573 .600 .606 .586 .607 .630 .467 .453 IA .507 .581 .595 .608 .649 .671 .653 .629 .645 .500 .500 PP .655 .667 .679 .691 .679 .737 .700 .690 .687 .655 .655 OE .558 .636 .623 .610 .662 .645 .662 .625 .712 .558 .558 TT .636 .697 .727 .712 .742 .766 .732 .717 .650 .515 .394 PW .634 .620 .690 .690 .629 .657 .585 .731 .630 .633 .634 CC .514 .676 .703 .689 .689 .676 .667 .647 .698 .446 .514 All .579 .632 .649 .653 .662 .679 .654 .661 .667 .541 .534 Table 4 : Results for each of the 2-way classification tasks: B = baseline, M+ = Moldovan et al. (2004) method, H i = ith-order Hypernym, O = Hyponym and S = Sister word; the best performing system is indicated in boldface and that of extended sense collocation as proposed in this paper." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-114", "text": "Table 3 shows that our method, combined with hypernyms outperforms the original sense collocation method, with the highest accuracy of .5880 achieved with 5th-degree ancestors of the head noun and modifier." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-115", "text": "This confirms that hypernyms are valuable in extending the range of sense collocation for NC interpretation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-116", "text": "In stark contrast to the results for hypernyms, the results for hyponyms and sister words significantly reduced the accuracy." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-117", "text": "The reason for this anomaly is that hypernyms are able to generalize the sense collocation without losing key discriminative features (i.e. the hypernyms always, by definition, subsume the original semantic information), while hyponyms and sister words add many sense collocations for which we have no direct evidence (i.e. we indiscriminately specialise the semantics without any motivation)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-118", "text": "Hence, hyponyms and sister words drastically blur the sense collocation ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-119", "text": "The reason that the accuracy of the hypernym method drops in the result tables a certain level is that the semantic collocations start to blend in together, and lose their power of discrimination." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-120", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-121", "text": "**2-WAY CLASSIFICATION EXPERIMENT**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-122", "text": "Our second experiment was to run the systems over the original data from SEMEVAL-2007, in the form of a binary classifier for each of the 7 SRs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-123", "text": "The performance of each of the 2-way classification tasks is shown in Table 4 ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-124", "text": "As we can see in Table 4 , the basic pattern of the results is the same as for the 7-way classification task in Table 3 ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-125", "text": "Adding hypernyms enhances performance, peaking for 4th-degree ancestors in this case at .6786." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-126", "text": "As with the 7-way classification task, hyponyms and sister words degraded performance, for the same reasons as before." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-127", "text": "Looking at the performance of each SR, we found that some SRs are more easy to interpret than others." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-128", "text": "Notably, PRODUCT-PRODUCER and THEME-TOOL were high performers, while CAUSE-EFFECT was considerably harder to classify." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-129", "text": "These trends coincide with the system results for the SEMEVAL-2007 task." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-130", "text": "analyze this effect in terms of the intrinsic semantic complexity of the different SRs, and also the relative size of the training data for each SR." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-131", "text": "These effects are also observable in the breakdown of precision and recall of each SR in Figure 1 ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-132", "text": "As we used the same data for SEMEVAL-2007 task, we are able to directly compare the performance of our method with the competition results." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-133", "text": "The Table 5 shows the three baselines provided by the SEMEVAL-2007 organisers." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-134", "text": "We also present the best-performing system and the average performance within group B from the SEMEVAL-2007 task as our method uses only word senses." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-135", "text": "The methods for computing the baselines are described in ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-136", "text": "All True is computed to guess \"true\" for all relations which maximizes recall." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-137", "text": "Probability is computed by randomly assigning \"true\"(or \"false\") with probability matching the distribution of \"true\"(or \"false\") in the testing dataset for the given relation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-138", "text": "It intends to balance precision and recall." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-139", "text": "Majority is computed by assigning \"true\" or \"false\" to which ever is the majority in the testing dataset for the given relation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-140", "text": "As shown in Table 5 : Results of 2-way classification (P=precision, R=recall, F=F-score, A=accuracy) pernyms outperformed all of three baselines." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-141", "text": "The performances using hyponyms and sister words only exceeded the baselines computed by \"all true\" and \"probability\" measure as these toxomomies tend to decrease the performances." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-142", "text": "The interesting point here is that although the method is meant for general-purpose interpretation not for binary decision task, our proposed method with hypernyms simply achieved better than the baselines and relatively performed well compared other systems in the competition (average accuracy of group B = .656 vs. our best = .679)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-143", "text": "Therefore, we conclude that sense collocation integrated with hypernyms is a potential to extend sense collocation as wellas improve the performance for NC interpretation task." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-144", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-145", "text": "**NC RESEARCH PRIMER**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-146", "text": "In this section, apart from previous works related to NC interpretation task described in Section ??, we present a short description of the computational tasks and previous attempts to provide the broad comprehension of tasks related to NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-147", "text": "The major tasks related to NCs involve syntactic and semantic disambiguation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-148", "text": "Related to the task of semantic disambiguation is the task of defining what relations exist in NCs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-149", "text": "This has gained much attention in recent decades, as well as controversy, (Downing, 1977; Levi, 1979; Finin, 1980) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-150", "text": "In the study conducted by Levi (1979) , it was claimed that there were 9 distinct SRs, which could be discretely defined and interpreted within NCs, while Finin (1980) claimed an unlimited number of SRs." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-151", "text": "The problems surrounding this task involve the issue of granularity versus coverage, which to date remains widely debated." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-152", "text": "Syntactic disambiguation (called bracketing) is required when NCs are composed of more than 2 components, such as in the case of computer science department, introducing the need for phrasal disambiguation (Lauer, 1995; Nakov, 2005) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-153", "text": "Lauer (1995) proposed probabilistic models (based on dependency and adjacency analyses of the data)." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-154", "text": "Later Nakov (2005) built upon this by adding linguistic features into these probabilistic models." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-155", "text": "Methods employed in word sense disambiguation (WSD) have also been used to enhance NC interpretation; the noun components that comprise the NCs are disambiguated using these WSD techniques (SparckJones, 1983; . carried out experiments on automatically modeling WSD and attested the usefulness of conducting analysis of the word senses in the NC in determining its SR." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-156", "text": "In the automatic interpretation of NCs, many claims have been made for the increase in performance, but these works make their own assumptions for interpretation (Barker and Szpakowicz, 1998; Moldovan et al., 2004; Kim and Baldwin, 2005; Girju, 2007; Seaghdha, 2007) ." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-157", "text": "----------------------------------" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-158", "text": "**CONCLUSION**" }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-159", "text": "In this paper, we have investigated the impact of using different taxonomic relations to expand a sense collocation method of NC interpretation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-160", "text": "That is, we experimented with the integration of similar terms into a sense collocation model." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-161", "text": "We added up to 7 hypernyms, direct hyponyms and direct sister words terms as features to a classifier." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-162", "text": "We ran experiments on 7-way and 2-way classification tasks using data from SEMEVAL-2007, and found that the inclusion of hypernym information significantly improved accuracy, while hyponyms and sister words degraded performance by arbitrarily overspecialising the sense information." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-163", "text": "Our study has shown that adding features broadens the scope of sense collocation." }, { "sent_id": "66392c3b6fa3744de79f056f615a75-C001-164", "text": "While intuitively all of hypernyms, hyponyms and sister words would appear to provide rich features for a sense collocation method, further research is needed to develop ways of successfully incorporating hyponyms and sister words into the NC interpretation task." } ], "y": { "@EXT@": { "gold_contexts": [ [ "66392c3b6fa3744de79f056f615a75-C001-2" ], [ "66392c3b6fa3744de79f056f615a75-C001-80" ] ], "cite_sentences": [ "66392c3b6fa3744de79f056f615a75-C001-2", "66392c3b6fa3744de79f056f615a75-C001-80" ] }, "@BACK@": { "gold_contexts": [ [ "66392c3b6fa3744de79f056f615a75-C001-22" ], [ "66392c3b6fa3744de79f056f615a75-C001-36" ], [ "66392c3b6fa3744de79f056f615a75-C001-38" ], [ "66392c3b6fa3744de79f056f615a75-C001-51" ], [ "66392c3b6fa3744de79f056f615a75-C001-73" ], [ "66392c3b6fa3744de79f056f615a75-C001-156" ] ], "cite_sentences": [ "66392c3b6fa3744de79f056f615a75-C001-22", "66392c3b6fa3744de79f056f615a75-C001-36", "66392c3b6fa3744de79f056f615a75-C001-38", "66392c3b6fa3744de79f056f615a75-C001-51", "66392c3b6fa3744de79f056f615a75-C001-73", "66392c3b6fa3744de79f056f615a75-C001-156" ] }, "@USE@": { "gold_contexts": [ [ "66392c3b6fa3744de79f056f615a75-C001-36", "66392c3b6fa3744de79f056f615a75-C001-37" ], [ "66392c3b6fa3744de79f056f615a75-C001-69" ], [ "66392c3b6fa3744de79f056f615a75-C001-112" ], [ "66392c3b6fa3744de79f056f615a75-C001-113" ] ], "cite_sentences": [ "66392c3b6fa3744de79f056f615a75-C001-36", "66392c3b6fa3744de79f056f615a75-C001-69", "66392c3b6fa3744de79f056f615a75-C001-112", "66392c3b6fa3744de79f056f615a75-C001-113" ] }, "@MOT@": { "gold_contexts": [ [ "66392c3b6fa3744de79f056f615a75-C001-38", "66392c3b6fa3744de79f056f615a75-C001-39" ], [ "66392c3b6fa3744de79f056f615a75-C001-61" ] ], "cite_sentences": [ "66392c3b6fa3744de79f056f615a75-C001-38", "66392c3b6fa3744de79f056f615a75-C001-61" ] } } }, "ABC_74623c8d812e3c84e7bc6b46e982f5_11": { "x": [ { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-2", "text": "Disentangling conversations mixed together in a single stream of messages is a difficult task, made harder by the lack of large manually annotated datasets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-3", "text": "We created a new dataset of 77,563 messages manually annotated with reply-structure graphs that both disentangle conversations and define internal conversation structure." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-4", "text": "Our dataset is 16 times larger than all previously released datasets combined, the first to include adjudication of annotation disagreements, and the first to include context." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-5", "text": "We use our data to re-examine prior work, in particular, finding that 80% of conversations in a widely used dialogue corpus are either missing messages or contain extra messages." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-6", "text": "Our manually-annotated data presents an opportunity to develop robust data-driven methods for conversation disentanglement, which will help advance dialogue research." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-7", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-9", "text": "When a group of people communicate in a common channel there are often multiple conversations occurring concurrently." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-10", "text": "Often there is no explicit structure identifying conversations or their structure, such as in Internet Relay Chat (IRC), Google Hangout, and comment sections on websites." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-11", "text": "Even when structure is provided it often has limited depth, such as threads in Slack, which provide one layer of branching." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-12", "text": "In all of these cases, conversations are entangled: all messages appear together, with no indication of separate conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-13", "text": "Automatic disentanglement could be used to provide more interpretable results when searching over chat logs, and to help users understand what is happening when they join a channel." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-14", "text": "Over a decade of research has considered conversation disentanglement (Shen et al., 2006) , but using datasets that are either small (2,500 messages, Elsner and Charniak, 2008) or not released (Adams and Martell, 2008) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-15", "text": "* jkummerf@umich.edu" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-16", "text": "We introduce a conversation disentanglement dataset of 77,563 messages of IRC manually annotated with reply-to relations between messages." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-17", "text": "1 Our data is sampled from a technical support channel at 173 points in time between 2004 and 2018, providing a diverse set of speakers and topics, while remaining in a single domain." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-18", "text": "Our data is the first to include context, which differentiates messages that start a conversation from messages that are responding to an earlier point in time." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-19", "text": "We are also the first to adjudicate disagreements in disentanglement annotations, producing higher quality development and test sets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-20", "text": "We also developed a simple model that is more effective than prior work, and showed that having diverse data makes it perform better and more consistently." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-21", "text": "We also analyze prior disentanglement work." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-22", "text": "In particular, a recent approach from Lowe et al. (2015 Lowe et al. ( , 2017 ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-23", "text": "By applying disentanglement to an enormous log of IRC messages, they developed a resource that has been widely used (over 315 citations), indicating the value of disentanglement in dialogue research." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-24", "text": "However, they lacked annotated data to evaluate the conversations produced by their method." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-25", "text": "We find that 20% of the conversations are completely right or a prefix of a true conversation; 58% are missing messages, 3% contain messages from other conversations, and 19% have both issues." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-26", "text": "As a result, systems trained on the data will not be learning from accurate humanhuman dialogues." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-27", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-28", "text": "**TASK DEFINITION**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-29", "text": "We consider a shared channel in which a group of people are communicating by sending messages that are visible to everyone." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-30", "text": "We label this data with a graph in which messages are nodes and edges indicate that one message is a response to another." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-31", "text": "Each connected component is a conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-32", "text": "Figure 1 : #Ubuntu IRC log sample, earliest message first." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-33", "text": "Curved lines are our graph annotations of reply structure, which define two conversations shown with blue solid edges and green dashed edges." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-34", "text": "Figure 1 shows an example of two entangled conversations and their graph structure." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-35", "text": "It includes a message that receives multiple responses, when multiple people independently help BurgerMann, and the inverse, when the last message responds to multiple messages." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-36", "text": "We also see two of the users, delire and Seveas, simultaneously participating in two conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-37", "text": "This multi-conversation participation is common." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-38", "text": "The example also shows two aspects of IRC we will refer to later." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-39", "text": "Directed messages, an informal practice in which a participant is named in the message." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-40", "text": "These cues are useful for understanding the discussion, but only around 48% of messages have them." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-41", "text": "System messages, which indicate actions like users entering the channel." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-42", "text": "These all start with ===, but not all messages starting with === are system messages, as shown by the second message in Figure 1 ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-43", "text": "3 Related Work IRC Disentanglement Data: The most significant work on conversation disentanglement is a line of papers developing data and models for the #Linux IRC channel (Elsner and Charniak, 2008; Elsner and Schudy, 2009; Charniak, 2010, 2011) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-44", "text": "Until now, their dataset was the only publicly available set of messages with annotated conversations (partially re-annotated by Mehri and Carenini (2017) with reply-structure graphs), and has been used for training and evaluation in subsequent work (Wang and Oard, 2009; Mehri and Carenini, 2017; Jiang et al., 2018) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-45", "text": "We are aware of three other IRC disentanglement datasets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-46", "text": "First, Adams and Martell (2008) studied disentanglement and topic identification, but did not release their data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-47", "text": "Second, Riou et al. (2015) annotated conversations and discourse relations in the #Ubuntu-fr channel (French Ubuntu support)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-48", "text": "Third, Lowe et al. (2015 Lowe et al. ( , 2017 heuristically extracted conversations from the #Ubuntu channel." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-49", "text": "2 Their work opened up a new research opportunity by providing 930,000 disentangled conversations, and has already been the basis of many papers (315 citations), particularly on developing dialogue agents." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-50", "text": "This is far beyond the size of resources previously collected, even with crowdsourcing (Lasecki et al., 2013) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-51", "text": "Using our data we provide the first empirical evaluation of their method." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-52", "text": "Other Disentanglement Data: IRC is not the only form of synchronous group conversation online." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-53", "text": "Other platforms with similar communication formats have been studied in settings such as classes (Wang et al., 2008; Dulceanu, 2016) , support communities (Mayfield et al., 2012) , and customer service (Du et al., 2017) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-54", "text": "Unfortunately, only one of these resources (Dulceanu, 2016) is available, possibly due to privacy concerns." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-55", "text": "Another stream of research has used userprovided structure to get conversation labels (Shen et al., 2006; Domeniconi et al., 2016) and replyto relations (Wang and Ros\u00e9, 2010; Wang et al., 2011a; Aumayr et al., 2011; Balali et al., 2013 Balali et al., , 2014 Chen et al., 2017a) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-56", "text": "By removing these labels and mixing conversations they create a disentanglement problem." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-57", "text": "While convenient, this risks introducing a bias, as people write differently when explicit structure is defined, and only a few papers have released data (Abbott et al., 2016; Zhang et al., 2017; Louis and Cohen, 2015) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-58", "text": "Models: Elsner and Charniak (2008) explored various message-pair feature sets and linear classifiers, combined with local and global inference methods." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-59", "text": "Their system is the only publicly released statistical model for disentanglement of chat conversation, but most of the other work cited above applied similar models." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-60", "text": "We evaluate their model on both our data and our re-annotated version of their data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-61", "text": "Recent work has applied neural networks (Mehri and Carenini, 2017; Guo et al. (2017) 1,500 1 48 hr 5 n/a 2 Table 1 : Annotated disentanglement dataset comparison." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-62", "text": "Our data is much larger than prior work, one of the only released sets, and the only one with context and adjudication. '+a' indicates there was an adjudication step to resolve disagreements. '?' indicates the value is not in the paper and the authors no longer have access to the data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-63", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-64", "text": "**2018), WITH SLIGHT GAINS IN PERFORMANCE.**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-65", "text": "Graph Structure: Within a conversation, we define a graph of reply-to relations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-66", "text": "Almost all prior work with annotated graph structures has been for threaded web forums (Schuth et al., 2007; Kim et al., 2010; Wang et al., 2011b) , which do not exhibit the disentanglement problem we explore." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-67", "text": "Studies that do consider graphs for disentanglement have used small datasets (Dulceanu, 2016; Mehri and Carenini, 2017) that are not always released (Wang et al., 2008; Guo et al., 2017) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-68", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-69", "text": "**DATA**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-70", "text": "We introduce a manually annotated dataset of 77,563 messages: 74,963 from the #Ubuntu IRC channel, 3 and 2,600 messages from the #Linux IRC channel." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-71", "text": "4 Annotating the #Linux data enables comparison with Elsner and Charniak (2008) , while the #Ubuntu channel has over 34 million messages, making it an interesting largescale resource for dialogue research." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-72", "text": "It also allows us to evaluate Lowe et al. (2015 Lowe et al. ( , 2017 's widely used heuristically disentangled conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-73", "text": "When choosing samples we had to strike a balance between the number of samples and the size of each one." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-74", "text": "We sampled the training set in three ways: (1) 95 uniform length samples, (2) 10 smaller samples to check annotator agreement, and (3) 48 time spans of one hour that are diverse in terms of the number of messages, the number of participants, and what percentage of messages are directed." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-75", "text": "For additional details of the data selection process, see the supplementary material." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-76", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-77", "text": "**DATASET COMPARISON**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-78", "text": "Table 1 presents properties of our data and prior work on disentanglement in real-time chat." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-79", "text": "Availability: Only one other dataset, annotated twice, has been publicly released, and two others were shared when we contacted the authors." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-80", "text": "Scale: Our dataset is 31 times larger than almost any other dataset, the exception being one that was not released." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-81", "text": "As well as being larger, our data is also based on many different points in time." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-82", "text": "This is crucial because a single sample presents a biased view of the task." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-83", "text": "Having multiple samples also means our training and evaluation sets are from different points in time, preventing overfitting to specific users or topics of conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-84", "text": "Context: We are the first to consider the fact that IRC data is sampled from a continuous stream and the context prior to the sample is important." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-85", "text": "In prior work, a message with no antecedent could either be the start of a conversation or a response to a message that occurs prior to the sample." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-86", "text": "Adjudication: Our labeling method is similar to prior work, but we are the first to perform adjudication of annotations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-87", "text": "While some cases were ambiguous, often one option was clearly incorrect." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-88", "text": "By performing adjudication we can reduce these errors, creating high quality sets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-89", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-90", "text": "**METHODOLOGY**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-91", "text": "Guidelines: We developed annotation guidelines through three rounds of pilot annotations in which annotators labeled a set of messages and discussed all disagreements." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-92", "text": "We instructed annotators to link each message to the one or more messages it is a response to." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-93", "text": "If a message started a new conversation it was linked to itself." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-94", "text": "We also described a series of subtle cases, using one to three examples to tease out differences." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-95", "text": "These included when a question is repeated, when a user responds multiple times, interjections, etc." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-96", "text": "For our full guidelines, see the supplementary material." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-97", "text": "All annotations were performed using SLATE (Kummerfeld, 2019), a custom-built tool with features designed specifically for this task." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-98", "text": "5 Adjudication: Table 1 shows the number of annotators for each subset of our data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-99", "text": "For the development, test, out-of-domain data, and a small set of the training data, we labeled each sample multiple times and then resolved all disagreements in an adjudication step." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-100", "text": "During adjudication, there was no indication of who had given which annotation, and there was the option to choose a different annotation entirely." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-101", "text": "In order to maximize the volume annotated, we did not perform adjudication for most of the training data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-102", "text": "Also, the 18,924 training message set initially only had 100 messages of context per sample, and we later added another 900 lines and checked every message that was not a reply to see if it was a response to something in the additional context." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-103", "text": "Annotators: The annotators were all fluent English speakers with a background in computer science (necessary to understand the technical content): a postdoc, a master's student, and three CS undergraduates." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-104", "text": "All adjudication was performed by the postdoc, who is a native English speaker." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-105", "text": "Time: Annotations took between 7 and 11 seconds per message depending on the complexity of the discussion, and adjudication took 5 seconds 5 https://jkk.name/slate" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-106", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-107", "text": "**ANNOTATION QUALITY**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-108", "text": "Our annotations define two levels of structure: (1) links between pairs of messages, and (2) sets of messages, where each set is one conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-109", "text": "Annotators label (1), from which (2) can be inferred." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-110", "text": "Table 2 presents inter-annotator agreement measures for both cases." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-111", "text": "These are measured in the standard manner, by comparing the labels from different annotators on the same data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-112", "text": "We also include measurements for annotations in prior work." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-113", "text": "Figure 2 shows ambiguous examples from our data to provide some intuition for the source of disagreements." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-114", "text": "In both examples the disagreement involves one link, but the conversation structure in the second case is substantially changed." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-115", "text": "Some disagreements in our data are mistakes, where one annotation is clearly incorrect, and some are ambiguous cases, such as these." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-116", "text": "In Channel Two, we also see mistakes and ambiguous cases, including a particularly long discussion about a user's financial difficulties that could be divided in multiple ways (also noted by Elsner and Charniak (2008) )." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-117", "text": "Graphs: We measure agreement on the graph structure annotation using Cohen (1960) 's \u03ba." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-118", "text": "This measure of inter-rater reliability corrects for chance agreement, accounting for the class imbalance between linked and not-linked pairs." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-119", "text": "Values are in the good agreement range proposed by Altman (1990) , and slightly higher than for Mehri and Carenini (2017)'s annotations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-120", "text": "Results are not shown for Elsner and Charniak (2008) because they did not annotate graphs." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-121", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-122", "text": "**CONVERSATIONS:**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-123", "text": "We consider three metrics: 6 (1) Variation of Information (VI, Meila, 2007) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-124", "text": "A measure of information gained or lost when going from one clustering to another." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-125", "text": "It is the sum of conditional entropies H(Y |X) + H(X|Y ), where X and Y are clusterings of the same set of items." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-126", "text": "We consider a scaled version, using the bound for n items that VI(X; Y ) \u2264 log(n), and present 1\u2212VI so that larger values are better." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-127", "text": "(2) One-to-One Overlap (1-1, Elsner and Charniak, 2008) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-128", "text": "Percentage overlap when conversations from two annotations are optimally paired up using the max-flow algorithm." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-129", "text": "We follow Mehri and Carenini (2017) and keep system messages." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-130", "text": "(3) Exact Match F 1 ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-131", "text": "Calculated using the number of perfectly matching conversations, excluding conversations with only one message (mostly system messages)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-132", "text": "This is an extremely challenging metric." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-133", "text": "We include it because it is easy to understand and it directly measures a desired value (perfectly extracted conversations)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-134", "text": "Our scores are higher in 4 cases and lower in 5." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-135", "text": "Interestingly, while \u03ba was higher for us than Mehri and Carenini (2017) , our scores for conversations are lower." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-136", "text": "This is possible because a single link can merge two conversations, meaning a single disagreement in links can cause a major difference in conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-137", "text": "This may reflect the fact that our annotation guide was developed for the Ubuntu channel, which differs in conversation style from the Channel Two data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-138", "text": "Manually comparing the annotations, there was no clear differences in the types of disagreements." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-139", "text": "Agreement is lower on the Channel Two data, particularly on its test set." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-140", "text": "From this we conclude that there is substantial variation in the difficulty of conversation disentanglement across datasets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-141", "text": "7" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-142", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-143", "text": "**EVALUATING DISENTANGLEMENT QUALITY**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-144", "text": "In this section, we propose new simple disentanglement models that perform better than prior methods, and re-examine prior work." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-145", "text": "The models we consider are: Previous: Each message is linked to the most recent non-system message before it." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-146", "text": "6 Metrics such as Cohen's \u03ba and Krippendorff's \u03b1 are not applicable to conversations because there is no clear mapping from one set of conversations to another." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-147", "text": "7 Riou et al. (2015) also observe this, noting that their French IRC data is less entangled than Elsner's, making it possible to achieve an agreement level of 0.95." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-148", "text": "(Pennington et al., 2014) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-149", "text": "Union: Run 10 FF models trained with different random seeds and combine their output by keeping all edges predicted." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-150", "text": "Vote: Run 10 FF models and combine output by keeping the edges they all agree on." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-151", "text": "Link messages with no agreed antecedent to themselves." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-152", "text": "Intersect: Conversations that 10 FF models agree on, and other messages as singleton conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-153", "text": "For Channel Two we also compare to Wang and Oard (2009) and Mehri and Carenini (2017) , but their code was unavailable, preventing evaluation on our data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-154", "text": "We exclude Jiang et al. (2018) as they substantially modified the dataset." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-155", "text": "For details of models, including hyperparameters tuned on the development set, see the supplementary material." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-156", "text": "Table 5 : Performance with different training conditions on the Ubuntu test set." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-157", "text": "For Graph-F, * indicates a significant difference at the 0.01 level compared to Standard." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-158", "text": "Results are averages over 10 runs, varying the data and random seeds." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-159", "text": "The standard deviation is shown in parentheses." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-160", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-161", "text": "**GRAPH CONVERSATION**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-187", "text": "In both cases, the heuristic from Lowe et al. (2015 Lowe et al. ( , 2017 performs poorly." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-188", "text": "We suspect our model trained only on Channel Two data is overfitting, as the graph F-score on the training data is 94, whereas on the Ubuntu data it is 80." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-189", "text": "Data Comparison: Comparing the same models in the top and bottom section, scores are consistently higher for our annotations, except for the Lowe et al. (2015 Lowe et al. ( , 2017 heuristic." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-190", "text": "Comparing the annotations, we find that their annotators identified between 250 and 328 conversations (mean 281), while we identify 257." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-191", "text": "Beyond this difference it is hard to identify consistent variations in the annotations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-192", "text": "Another difference is the nature of the evaluation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-193", "text": "On Elsner's data, evaluation is performed by measuring relative to each annotators labels and averaging the scores." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-194", "text": "On our data, we adjudicated the annotations, providing a single gold standard." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-162", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-163", "text": "**RESULTS**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-164", "text": "Graphs: Table 3 presents precision, recall, and F-score over links." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-165", "text": "Our models perform much better than the baseline." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-166", "text": "As we would expect, vote has higher precision, while union has higher recall." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-167", "text": "Vote has higher recall than a single feedforward model because it identifies more of the selflink cases (its default when there is no agreement)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-168", "text": "Conversations: Table 4 presents results on the metrics defined in Section 4.3." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-169", "text": "There are three regions of performance." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-170", "text": "First, the baseline has consistently low scores since it forms a single conversation containing all messages." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-171", "text": "Second, Elsner and Charniak (2008) and Lowe et al. (2017) per-form similarly, with one doing better on VI and the other on 1-1, though Elsner and Charniak (2008) do consistently better across the exact conversation extraction metrics." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-172", "text": "Third, our methods do best, with x10 vote best in all cases except precision, where the intersect approach is much better." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-173", "text": "Dataset Variations: Table 5 shows results for the feedforward model with several modifications to the training set, designed to test corpus design decisions." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-174", "text": "Removing context does not substantially impact results." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-175", "text": "Decreasing the data size to match Elsner and Charniak (2008) 's training set leads to worse results, both if the sentences are from diverse contexts (3rd row), and if they are from just two contexts (bottom row)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-176", "text": "We also see a substantial increase in the standard deviation when only two samples are used, indicating that performance is not robust when the data is not widely sampled." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-177", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-178", "text": "**CHANNEL TWO RESULTS**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-179", "text": "For channel Two, we consider two annotations of the same underlying text: ours and Elsner and Charniak (2008)'s." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-180", "text": "To compare with prior work, we use the metrics defined by Shen et al. (2006, Shen) and Elsner and Charniak (2008, Loc) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-181", "text": "8 We do not use these for our data as they have been superseded by more rigorously studied metrics (VI for Shen) or make strong assumptions about the data (Loc)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-182", "text": "We do not evaluate on graphs because Elsner and Charniak (2008) 's annotations do not include them." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-183", "text": "This also prevents us from training our method on their data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-184", "text": "Model Comparison: For Elsner's annotations (top section of Table 6 ), their approach remains the most effective with just Channel Two data." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-185", "text": "However, training on our Ubuntu data, treating Channel Two as an out-of-domain sample, yields substantially higher performance on two metrics and comparable performance on the third." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-186", "text": "On our annotations (bottom section), we see the same trend." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-195", "text": "Evaluating our Channel-Two-trained Feedforward model on our two preadjudication annotations and averaging scores, the results are lower by 3.1, 1.8, and 4.3 on 1-1, Loc and Shen respectively." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-196", "text": "This suggests that our adjudication process removes annotator mistakes that introduce noise into the evaluation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-197", "text": "Lowe et al. (2015 Lowe et al. ( , 2017 The previous section showed that only 10.8% of the conversations extracted by the heuristic in Lowe et al. (2015 Lowe et al. ( , 2017 are correct (P in Table 4 )." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-198", "text": "We focus on precision because the primary use of their method has been to extract conversations to train and test dialogue systems, which will be impacted by errors in the conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-199", "text": "Recall errors (measuring missed conversations) are not as serious a problem because the Ubuntu chat logs are so large that even with low recall a large number of conversations will still be extracted." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-200", "text": "Additional Metrics: First, we must check this is Figure 3 : An example conversation extracted by the heuristic from Lowe et al. (2015 Lowe et al. ( , 2017 with the messages it misses and the one it incorrectly includes." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-201", "text": "not an artifact of our test set." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-202", "text": "On our development set, P, R, and F are slightly higher (11.6, 8.1 and 9.5), but VI and 1-1 are slightly lower (80.0 and 51.7)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-203", "text": "We can also measure performance as the distribution of scores over all of the samples we annotated." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-204", "text": "The average precision was 10, and varied from 0 to 50, with 19% of cases at 0 and 95% below 23." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-205", "text": "To avoid the possibility that we made a mistake running their code, we also considered evaluating their released conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-206", "text": "On the data that overlapped with our annotations, the precision was 9%." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-207", "text": "These results indicate that the test set performance is not an aberration: the heuristic's results are consistently low, with only about 10% of output conversations completely right." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-208", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-209", "text": "**EVALUATING**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-210", "text": "Error Types: Figure 3 shows an example heuristic output with several types of errors." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-211", "text": "The initial question was missed, as was the final resolution, and in the middle there is a message from a separate conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-212", "text": "67% of conversations were a subset of a true conversation (ie., only missed messages), and 3% were a superset of a true conversation (ie., only had extra messages)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-213", "text": "The subset cases were missing 1-187 messages (missing 56% of the conversation on average) and the superset cases had 1-3 extra messages (an extra 31% of the conversation on average)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-214", "text": "The first message is particularly important because it is usually the question being resolved." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-215", "text": "In 47% of cases the first message is not the true start of a conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-216", "text": "It is important to note that the dialogue task the conversations were intended for only uses a prefix of each conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-217", "text": "For this purpose, missing the end of a conversation is not a problem." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-218", "text": "In 9% of cases, the conversation is a true prefix of a gold conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-244", "text": "Table 7 show results when varying the training and test datasets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-245", "text": "Training and testing on the same dataset leads to higher performance than training on one and testing on the other." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-246", "text": "This is true even though the heuristic data contains nine times as many training conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-247", "text": "This is evidence that our conversations are fundamentally different despite being derived from the same resource and filtered in the same way." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-248", "text": "This indicates that our changes lead to quantitatively different downstream models." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-249", "text": "Fortunately, the relative performance of the two models remains consistent across the two datasets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-250", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-251", "text": "**RE-EXAMINING DISENTANGLEMENT RESEARCH**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-252", "text": "Using our data we also investigate other assumptions made in prior work." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-253", "text": "The scale of our data provides a more robust test of these ideas." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-254", "text": "Number of samples: Table 1 shows that all prior work with available data has considered a small number of samples." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-255", "text": "In Table 5 , we saw that training on less diverse data samples led to models that performed worse and with higher variance." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-256", "text": "We can also investigate this by looking at performance on the different samples in our test set." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-257", "text": "The difficulty of samples varies considerably, with the F-score of our model varying from 11 to 40 and annotator agreement scores before adjudication varying from 0.65 to 0.78." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-258", "text": "The model performance and agreement levels are also strongly correlated, with a Spearman's rank correlation of 0.77." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-259", "text": "This demonstrates the importance of evaluating on data from more than one point in time to get a robust estimate of performance." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-260", "text": "How far apart consecutive messages in a conversation are: Elsner and Charniak (2008) and Mehri and Carenini (2017) use a limit of 129 seconds, Jiang et al. (2018) limit to within 1 hour, Guo et al. (2017) limit to within 8 messages, and we limit to within 100 messages." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-261", "text": "Figure 4 shows the distribution of time differences in our conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-262", "text": "94.9% are within 2 minutes, and almost all are within an hour." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-263", "text": "88.3% are 8 messages or less apart, and 99.4% are 100 or less apart." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-264", "text": "This suggests that the lower limits in prior work are too low." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-265", "text": "However, in Channel Two, 98% of messages are within 2 minutes, suggesting this property is channel and sample dependent." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-266", "text": "Concurrent conversations: Adams and Martell (2008) forced annotators to label at most 3 conversations, while Jiang et al. (2018) remove conversations to ensure there are no more than 10 at once." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-267", "text": "We find there are 3 or fewer 46.4% of the time and 10 or fewer 97.3% of the time (where time is in terms of messages, not minutes, and we ignore system messages), Presumably the annotators in Adams and Martell (2008) would have proposed changes if the 3 conversation limit was problematic, suggesting that their data is less entangled than ours." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-268", "text": "Conversation and message length: Adams and Martell (2008) annotate blocks of 200 messages." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-269", "text": "If such a limit applied to our data, 13.7% of conversations would not finish before the cutoff point." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-270", "text": "This suggests that their conversations are typi-cally shorter, which is consistent with the previous conclusion that their conversations are less entangled." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-271", "text": "Jiang et al. (2018) remove conversations with fewer than 10 messages, describing them as outliers, and remove messages shorter than 5 words, arguing that they were not part of real conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-272", "text": "Not counting conversations with only system messages, 83.4% of our conversations have fewer than 10 messages, 40.8% of which have multiple authors." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-273", "text": "88.5% of messages with less than 5 words are in conversations with more than one author." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-274", "text": "These values suggest that these messages and conversations are real and not outliers." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-275", "text": "Overall: This analysis indicates that working from a small number of samples can lead to major bias in system design for disentanglement." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-276", "text": "There is substantial variation across channels, and across time within a single channel." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-277", "text": "----------------------------------" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-278", "text": "**CONCLUSION**" }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-279", "text": "Conversation disentanglement has been understudied because of a lack of public, annotated datasets." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-280", "text": "We introduce a new corpus that is larger and more diverse than any prior corpus, and the first to include context and adjudicated annotations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-281", "text": "Using our data, we perform the first empirical analysis of Lowe et al. (2015 Lowe et al. ( , 2017 's widely used data, finding that only 20% of the conversations their method produces are true prefixes of conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-282", "text": "The models we develop have already enabled new directions in dialogue research, providing disentangled conversations for DSTC 7 track 1 (Gunasekara et al., 2019; Yoshino et al., 2018) and will be used in DSTC 8." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-283", "text": "We also show that diversity is particularly important for the development of robust models." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-284", "text": "This work fills a key gap that has limited research, providing a new opportunity for understanding synchronous multiparty conversation online." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-219", "text": "Combined with the exact match cases, that means 20% of the conversations are accurate as used in the next utterance selection task." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-220", "text": "A further 9% of cases are a continuous chunk of a conversation, but missing one or more messages at the start." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-221", "text": "Long Distance Links: One issue we observed is that conversations often spanned days." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-222", "text": "We manually inspected a random sample: 20 conversations 12 to 24 hours long, and 20 longer than 24 hours." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-223", "text": "All of the longer conversations and 17 of the shorter ones were clearly incorrect." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-224", "text": "9 This issue is not measured in the analysis above because our samples do not span days (they are 5.5 hours long on average when including context)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-225", "text": "The original work notes this issue, but claims that it is rare." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-226", "text": "We measured the time between consecutive messages in conversations and plot the frequency of each value in Figure 4 ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-227", "text": "10 The figure indicates that the conversations often extend over days, or even more than a month apart (note the point in the topright corner)." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-228", "text": "In contrast, our annotations rarely contain links beyond an hour, and the output of our model rarely contains links longer than 2 hours." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-229", "text": "Causes: To investigate possible reasons for these issues, we measured several properties of our data to test assumptions in the heuristic." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-230", "text": "First, the heuristic assumes if all directed messages from a user are in one conversation, all undirected messages from the user are in the same conversation." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-231", "text": "9 The exceptions were two cases where a user thanked another user for their help the previous day, and one case where a user asked if another user ended up resolving their question." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-232", "text": "10 In 68,002 conversations there was a negative time difference because a message was out of order." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-233", "text": "To resolve this, we sorted the messages in each conversation by timestamp." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-234", "text": "We find this is true 52.2% of the time." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-235", "text": "Second, it assumes that it is rare for two people to respond to an initial question." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-236", "text": "In our data, of the messages that start a conversation and receive a response, 37.7% receive multiple responses." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-237", "text": "Third, that a directed message can start a conversation, which we find in 6.8% of cases." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-238", "text": "Fourth, that the first response to a question is within 3 minutes, which we find is true in 94.8% of conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-239", "text": "Overall, these assumptions have mixed support from our data, which may be why the heuristic produces so few accurate conversations." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-240", "text": "Dialogue Modeling: Most of the work building on Lowe et al. (2017) uses the conversations to train and evaluate dialogue systems." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-241", "text": "To see the impact on downstream work, we constructed a next utterance selection task as described in their work, disentangling the entire #Ubuntu logs with our feedforward model." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-242", "text": "We tried two dialogue models: a dual-encoder (Lowe et al., 2017) , and Enhanced Long Short-Term Memory (Chen et al., 2017b) ." }, { "sent_id": "74623c8d812e3c84e7bc6b46e982f5-C001-243", "text": "For full details of the task and model hyperparameters, see the supplementary material." } ], "y": { "@MOT@": { "gold_contexts": [ [ "74623c8d812e3c84e7bc6b46e982f5-C001-14" ] ], "cite_sentences": [ "74623c8d812e3c84e7bc6b46e982f5-C001-14" ] }, "@BACK@": { "gold_contexts": [ [ "74623c8d812e3c84e7bc6b46e982f5-C001-14" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-43" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-58" ] ], "cite_sentences": [ "74623c8d812e3c84e7bc6b46e982f5-C001-14", "74623c8d812e3c84e7bc6b46e982f5-C001-43", "74623c8d812e3c84e7bc6b46e982f5-C001-58" ] }, "@USE@": { "gold_contexts": [ [ "74623c8d812e3c84e7bc6b46e982f5-C001-43", "74623c8d812e3c84e7bc6b46e982f5-C001-51" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-58", "74623c8d812e3c84e7bc6b46e982f5-C001-60" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-71" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-123", "74623c8d812e3c84e7bc6b46e982f5-C001-127" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-171" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-175" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-179" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-182" ] ], "cite_sentences": [ "74623c8d812e3c84e7bc6b46e982f5-C001-43", "74623c8d812e3c84e7bc6b46e982f5-C001-58", "74623c8d812e3c84e7bc6b46e982f5-C001-71", "74623c8d812e3c84e7bc6b46e982f5-C001-127", "74623c8d812e3c84e7bc6b46e982f5-C001-171", "74623c8d812e3c84e7bc6b46e982f5-C001-175", "74623c8d812e3c84e7bc6b46e982f5-C001-179", "74623c8d812e3c84e7bc6b46e982f5-C001-182" ] }, "@SIM@": { "gold_contexts": [ [ "74623c8d812e3c84e7bc6b46e982f5-C001-116" ] ], "cite_sentences": [ "74623c8d812e3c84e7bc6b46e982f5-C001-116" ] }, "@DIF@": { "gold_contexts": [ [ "74623c8d812e3c84e7bc6b46e982f5-C001-120" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-175" ], [ "74623c8d812e3c84e7bc6b46e982f5-C001-260" ] ], "cite_sentences": [ "74623c8d812e3c84e7bc6b46e982f5-C001-120", "74623c8d812e3c84e7bc6b46e982f5-C001-175", "74623c8d812e3c84e7bc6b46e982f5-C001-260" ] } } }, "ABC_e0df566d073649431c3454a52813e9_11": { "x": [ { "sent_id": "e0df566d073649431c3454a52813e9-C001-114", "text": "**RESULTS**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-139", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-140", "text": "**PER-CONNECTIVE ANALYSIS**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-2", "text": "Abstract." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-3", "text": "Discourse connectives (e.g. however, because) are terms that can explicitly convey a discourse relation within a text." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-4", "text": "While discourse connectives have been shown to be an effective clue to automatically identify discourse relations, they are not always used to convey such relations, thus they should first be disambiguated between discourse-usage and non-discourse-usage." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-5", "text": "In this paper, we investigate the applicability of features proposed for the disambiguation of English discourse connectives for French." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-6", "text": "Our results with the French Discourse Treebank (FDTB) show that syntactic and lexical features developed for English texts are as effective for French and allow the disambiguation of French discourse connectives with an accuracy of 94.2%." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-7", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-9", "text": "Discourse connectives are often used to signal discourse relations between two textual units." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-10", "text": "For example, in (1) the discourse connective 'ainsi' conveys a result relation between the two textural units marked in italic and bold." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-11", "text": "(1) L'\u00e9lan r\u00e9formateur, lanc\u00e9 depuis Moscou en 1985, revient, tel un boomerang, vers l'URSS." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-12", "text": "La Lituanie, la Lettonie et l'Estonie s'ouvrent ainsi au multipartisme." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-13", "text": "1 The reformative push, initiated from Moscow in 1985, comes back like a boomerang towards the USSR." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-14", "text": "Lithuania, Latvia and Estonia thus open themselves to the multiparty system." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-15", "text": "2 However, discourse connectives do not always mark the presence of discourse relations." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-16", "text": "For example, while the word 'et' is not a discourse connective in (1) , it signals a continuation relation in (2)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-17", "text": "(2) La f\u00e9d\u00e9ration CGT des transports s'est \u00e9lev\u00e9e contre \"l'absence de concertation\" et estime que les salari\u00e9s \"n'ont rien de bon \u00e0 attendre de cette restructuration\"." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-18", "text": "1 The CGT transport federation have risen against \"the lack of consultation\" and consider that employees have \"nothing positive to expect from this restructuring.\" 2 While studies have shown that discourse usage of discourse connectives can be accurately identified for English [13, 20] , only a few studies have focused on the disambiguation of discourse connectives in other languages." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-19", "text": "In this paper, we investigate the usefulness of features proposed in the literature for the disambiguation of English discourse connectives for French discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-20", "text": "3 This paper is organized as follow: Section 2 reviews related work." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-21", "text": "Section 3 describes our approach to disambiguate French discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-22", "text": "Section 4 reports our results, and finally Section 5 presents our conclusions and future work." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-23", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-25", "text": "With respect to discourse organization, discourse connectives constitute the most basic way of signaling the speaker's or writer's intentions." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-26", "text": "They provide an important clue to disambiguate discourse relations whose interpretations would be opaque without them [5, 8, [16] [17] [18] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-27", "text": "Hence, lexicons of discourse connectives associated with the relations that they express can be very useful for discourse studies (e.g. developing discourse annotated corpora [2, 7, 21, 22] , automatic discourse analysis [13, 25] , etc.) and have been developed for English [10] , Spanish [3] , German [24] and French [6] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-28", "text": "Discourse connectives can be ambiguous at two levels: first, they can be used in discourse-usage or non-discourseusage, and second, they may be used to signal more than one discourse relation." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-29", "text": "To automatically disambiguate discourse connectives, discourse annotated corpora such as the Penn Discourse Treebank (PDTB) [22] are instrumented." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-30", "text": "The PDTB is the largest corpus of discourse annotated texts." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-31", "text": "It contains articles from the Wall Street Journal, where discourse connectives that are used in discourse-usage have been annotated by the discourse relation that they signal." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-32", "text": "The same approach has been used in the French Discourse Treebank (FDTB) [7] , however to date, only discourse-usage and non-discourse-usage of French discourse connectives have been annotated in the FDTB." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-33", "text": "Most of previous work on the disambiguation of discourse connectives have focused on English discourse connectives [13, 14, 20] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-34", "text": "One of earliest and pioneer work on the disambiguation of discourse connectives, Pitler and Nenkova [20] , showed that four syntactic features (see Section 3.4 for details about the features) and the connective itself can disambiguate discourse connectives with an accuracy of 95.04% within the PDTB [22] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-35", "text": "Their approach used these features not only to disambiguate discourse connectives between discourse-usage and non-discourse-usage, but also to tag the discourse relation signalled by the discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-36", "text": "Later, Lin et al. [13] used the context of the connective (i.e. the previous and the following word of the connective) and added seven lexico-syntactic features to the feature set proposed by Pitler and Nenkova [20] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-37", "text": "In doing so, Lin et al. achieved an accuracy of 97.34% for the disambiguation of discourse connectives in the PDTB." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-38", "text": "On the other hand, the disambiguation of discourse connectives in languages other than English has received much less attention." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-39", "text": "Due to syntactic differences across languages and different discourse annotation methodologies, the techniques developed for one language may not be as effective in another." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-40", "text": "For example, English discourse connectives include mostly subordinating conjunctions (e.g. when) or coordinating conjunctions (e.g. but)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-41", "text": "In addition, only a few connectives are disjoint (e.g. On the one hand ... On the other hand)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-42", "text": "This is not the case for Chinese which uses many more disjoint connectives [26] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-43", "text": "Inspired by Pitler and Nenkova [20] , Alsaif and Markert [4] proposed an approach for the disambiguation of Arabic Discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-44", "text": "Alsaif and Markert have shown that the features proposed by Pitler and Nenkova [20] work well for Arabic with an accuracy of 91.2%." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-45", "text": "Moreover, they further improved the result of their system by considering Arabic-specific morphological features and achieved an accuracy of 92.4%." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-46", "text": "Today, due to the availability of discourse annotated corpora such as the French Discourse Treebank [6] , it is possible to analyse how the features developed for English behave when applied to French." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-47", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-48", "text": "**CORPUS**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-49", "text": "To evaluate the disambiguation of French discourse connectives, we used the French Discourse Treebank (FDTB) [6] which constitutes the largest publicly available discourse annotated corpus for French." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-50", "text": "The corpus contains the annotation of more than 10K connectives used in discourse usage in the French Treebank corpus [1] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-51", "text": "The FDTB uses the French discourse connectives of the LEXCONN resource 4 [23] , a lexicon of 328 French discourse connectives (e.g. 'alors que') and their morphological variations (e.g. 'alors qu' ')." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-52", "text": "Out of 328 discourse connectives listed in LEXCONN, 88 connectives appeared in the FDTB." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-53", "text": "For training and testing, we used the annotated discourse connectives of the FDTB as positive instances and all other occurrences of the connectives that were not annotated were used as negative instances." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-54", "text": "To compare the size of the FDTB dataset, we constructed a similar dataset from the PDTB." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-55", "text": "Table 1 shows the size of the datasets extracted from both the FDTB and the PDTB." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-56", "text": "As Table 1 shows, while the FDTB is smaller than the PDTB (10K instances of connectives in the FDTB vs. 14K instances in the PDTB), more types of discourse connectives are annotated in the FDTB (see Table 2 )." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-57", "text": "The FDTB contains the annotations of 372 different discourse connectives while the PDTB contains the annotations of 100 different discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-58", "text": "Table 2 shows the distribution of the discourse connectives in both corpora along with their frequency." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-59", "text": "Not surprisingly, 61% (= 25% + 36%) of the French discourse connectives appear less than 10 times." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-60", "text": "This constitutes a large portion of French discourse connectives if we compare this number to its English counterpart in the PDTB (i.e. 18%)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-61", "text": "This entails that it will be more difficult to learn an accurate model for the disambiguation of such low frequent discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-62", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-63", "text": "**ENTROPY OF FRENCH DISCOURSE CONNECTIVES**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-64", "text": "To estimate the difficulty of the task for French compared to English, we compared the ambiguity of discourse connectives in the two languages by calculating the entropy of each discourse connective." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-65", "text": "Table 3 shows the top three most ambiguous and the top three least ambiguous discourse connectives (based on entropy) in the FDTB and the PDTB 5 ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-66", "text": "As Table 3 shows, the French discourse connectives 'effectivement' and 'sinon' are used as often in a discourse/non-discourse context (yielding an entropy of 1.0)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-67", "text": "On the other hand, in English 'on the other hand', 'particularly' and 'upon' are the least ambiguous (entropy = 0.0) as they are always used to signal a discourse relation." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-68", "text": "Table 3 also shows the weighted average entropy of discourse connectives for each language." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-69", "text": "The entropy of French discourse connectives is 0.39 while the entropy of English discourse connectives is 0.51." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-70", "text": "This seems to indicate that the disambiguation of French discourse connectives can be considered a slightly easier task than the disambiguation of English discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-71", "text": "To make a more detailed comparison, it would be preferable to align French and English discourse connectives with the same meaning and then compare the entropy of the mapped discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-72", "text": "Unfortunately, discourse connectives are language specific and cannot be easily aligned." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-73", "text": "For example, the accuracy of alignments achieved from statistical word-alignment models are very low for discourse connectives [12] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-74", "text": "To the best of our knowledge, a cross-lingual alignment of discourse connectives is available only for casual discourse connectives [27] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-75", "text": "Zufferey and Cartoni [27] manually aligned a few hundred occurrences of a discourse connective with their translation over the Europarl [11] parallel texts." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-76", "text": "Then, they created an English-French dictionary for discourse connectives based on the similarities and discrepancies between the discourse connectives and their most appropriate translation." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-77", "text": "Table 4 shows the entropy of French and English discourse connectives that signal the Cause relation with their most likely translations 6 as identified by Zufferey and Cartoni [27] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-78", "text": "As Table 4 shows, there does not seem to be a direct rela- 6 Note that some translations of discourse connectives such as '\u00e9tant donn\u00e9 que' are not considered discourse connectives in the FDTB and the PDTB because they do not fit the formal definition of discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-79", "text": "Therefore, we do not list their entropy in Table 4 . (1) tionship between the entropy of the mapped discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-80", "text": "For example, while the French discourse connective 'car' has an entropy of 0.05 (i.e. 'car' is more than 99% of the time used in discourse-usage in the FDTB), its translations in English (i.e. 'because', 'since', and 'as') are very ambiguous." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-81", "text": "The disparity between the entropy of discourse connectives in the FDTB and the PDTB can be explained by the differences between the languages and the annotation methodology." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-82", "text": "Regardless of its source, this disparity motivated us to investigate the applicability of features proposed for the disambiguation of English discourse connectives for French." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-83", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-84", "text": "**FEATURES**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-85", "text": "As mentioned in Section 2, Pitler and Nenkova [20] have shown that the context of discourse connectives in the syntactic tree is very discriminating for the disambiguation of English discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-86", "text": "They proposed four syntactic features:" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-87", "text": "1. SelfCat: The highest node in the parse tree that covers the connective words but nothing more. 2." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-88", "text": "SelfCatParent: The parent of the SelfCat." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-89", "text": "To illustrate these four features, consider the parse tree of the second sentence in Example (1) shown in Figure 1 and the discourse connective 'ainsi'." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-90", "text": "The SelfCat node is the 'ADV' node in the parse tree and its parent, left and right siblings are the 'S', 'VN' and 'PP' nodes, respectively." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-91", "text": "In addition to the four features above, Pitler and Nenkova [20] used the discourse connective itself (case sensitive) as an additional feature for the classifier." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-92", "text": "The purpose of using the case sensitive version is to distinguish connectives positioned at the beginning of sentences." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-93", "text": "We slightly modified this feature by using the case-folded version of the discourse connective (called the Conn Feature)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-94", "text": "However, we created a new feature (called the Pos Feature) to explicitly indicate the position of the discourse connective within the sentence (i.e." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-95", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-96", "text": "**AT-THE-BEGINNING OR NOT-AT-THE-BEGINNING).**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-97", "text": "These two features are as informative as the case-sensitive connective string proposed by Pitler and Nenkova [20] , however, separating these features gives the classifier more flexibility when building its model." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-98", "text": "In Example (1), these two features are 'ainsi' and 'not-at-the-beginning', respectively." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-99", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-100", "text": "**DATA PREPARATION**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-101", "text": "Although the focus of the work is the disambiguation of French discourse connectives, we performed the same experiments with English discourse connectives as well." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-102", "text": "This allowed us to better analyse the results and make a comparison between the performance of our model for French and for English." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-103", "text": "For our experiments, we used the FDTB and the PDTB corpora for gold discourse annotation for French and English texts respectively." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-104", "text": "We also used the French Treebank (FTB) [1] and Penn Treebank (PTB) [15] to obtain the parse trees of the French and English texts." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-105", "text": "To prepare the FDTB for our experiments, we used the French Treebank (FTB) [1] to obtain the syntactic trees of the FDTB sentences." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-106", "text": "Next, we converted the FDTB sentences along with their syntactic trees into the CoNLL-2015 shared task format [25] ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-107", "text": "The English version of our experiments were performed on the CoNLL 2015 shared task dataset [25] which contains the annotations of the PDTB and parse trees of the PTB." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-108", "text": "Similarly to Pitler et al. [19] , we used sections 2-22 of the PDTB for our experiments." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-109", "text": "To annotate discourse connectives, the input sentences were first searched for terms that match a discourse connective." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-110", "text": "Then, inspired by Pitler et al. [19] , a binary classifier was built using six local syntactic and lexical features to classify discourse connectives as in-discourse-usage or notin-discourse-usage." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-111", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-112", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-113", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-115", "text": "Similarly to Pitler and Nenkova [20] , we report results using a maximum entropy classifier using ten-fold cross-validation over the extracted datasets." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-116", "text": "We used the off-the-shelf implementation of the maximum entropy classifier available in WEKA [9] for our experiments." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-117", "text": "Table 5 shows the overall performance of the classifier for the disambiguation of French and English discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-118", "text": "The results show that the classifier can distinguish between discourse-usage and non-discourse-usage of French discourse connectives with an accuracy of 94.2% and an FMeasure of 86.2%." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-119", "text": "This is close to the results achieved for English discourse connectives over the PDTB (accuracy of 93.6% and F-score of 88.9%)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-120", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-121", "text": "**FEATURE ANALYSIS**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-122", "text": "To evaluate the contribution of each feature, we ranked the features by their information gain for both languages." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-123", "text": "As Table 6 shows, with our datasets, the syntactic features provide less information about discourse-usage or non-discourseusage for French discourse connectives than they do for English." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-124", "text": "For example, the Selfcat feature has a significantly lower information gain than the Conn feature for the disambiguation of French discourse connectives while this is not the case for English discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-125", "text": "This seems to indicate that the English discourse connectives tend to appear in more restricted syntactic contexts than French discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-126", "text": "We also experimented with different features combinations." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-127", "text": "We ranked the features by their information gain and added one feature at a time." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-128", "text": "Table 7 shows the accuracy of the classifier for each subset of features for French discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-129", "text": "The differences between the accuracies were evaluated with the Student t test, with P < 0.05." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-130", "text": "Statistically significant increases are marked with \u21d1 in the table while a lack of statistically significant increase is indicated with a ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-131", "text": "As Table 7 shows, using only the connective text (the Conn feature) gives an overall accuracy of 89.10%, which is a reasonably high baseline." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-132", "text": "Adding the SelfCat, SelfCatLeftSibling and SelfCatParent features gradually improves the accuracy from 89.10% to 94.21%." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-133", "text": "Table 7 shows that the effects of the Pos and SelfCatRightSibling features are negligible and without these two features the accuracy of the classifier (i.e. 93.21%) does not yield a statistically lower accuracy than the overall result (94.52%) which combines all features." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-134", "text": "As Table 7 also shows, the effect of each feature is different for English." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-135", "text": "Using only the Conn feature gives an accuracy of 85.38% which is lower than the accuracy achieved for French (i.e. 89.10%)." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-136", "text": "Adding the SelfCat feature improves the accuracy from 85.38% to 93.15%." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-137", "text": "The effect of the rest of the features is negligible and only improves the accuracy to 93.52%." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-138", "text": "This again seems to show that English discourse connectives tend to appear in more restricted syntactic contexts than French discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-141", "text": "The overall results of Table 5 showed that the features proposed for English can also accurately disambiguate French discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-142", "text": "However, if we analyse the results for each connective, many seem to be very well classified with the features used; while a few are more difficult to disambiguate." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-143", "text": "If we use as a baseline the assignment of the most likely class based only on the Conn feature, many connectives obtained statistically significant improvements with the classifier learned over the proposed features." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-144", "text": "Table 8 shows the accuracy of the classifier for the French discourse connectives which achieved the greatest improvements over the baseline." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-145", "text": "Again, the differences between the accuracies were evaluated with the Student t test, with P < .05 considered statistically significant and marked with \u21d1 and lack of statistical increase is indicated by in the table." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-146", "text": "As Table 8 shows, for these connectives, the classifier can disambiguate discourse-usage versus non-discourse-usage with a much better accuracy than the baseline." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-147", "text": "For example, the classifier can disambiguate 'sinon', which is among the top tree ambiguous French discourse connectives (see Table 3 ), with an accuracy of 88.89%, yet only 27 instances of this connective are available in the dataset." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-148", "text": "While the accuracy of the classifier is high for many discourse connectives, there are a few discourse connectives that the classifier cannot disambiguate." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-149", "text": "The five discourse connectives 7 that achieve the lowest accuracy are listed in Table 9 ." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-150", "text": "All the discourse connectives in Table 9 have very high entropy." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-151", "text": "For example, both 'effectivement' and 'alors' are among the top three ambiguous discourse connectives (see Table 3 )." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-152", "text": "Even though the accuracy of the classifier is higher than the baseline (except for the 'maintenant' discourse connective), the increase is small or not statistically significant." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-153", "text": "For example, the accuracy for the discourse connective 'effectivement' is 55.56% which is not statistically better than the baseline." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-154", "text": "These results show that for some connectives, the features proposed for English are sufficient (see Table 8 ), but for others, using only the connective and the syntactic features is not sufficient." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-155", "text": "----------------------------------" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-156", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-157", "text": "In this paper, we have investigated the applicability of the syntactic and lexical features proposed by Pitler and Nenkova [20] for the disambiguation of English discourse connectives for French." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-158", "text": "Our experiments on the French Discourse Treebank (FDTB) show that even though the syntactic features are less informative for the disambiguation of French discourse connectives than for English discourse connectives, overall the features can effectively disambiguate French discourse connectives between discourse-usage and non-discourseusage as well in French as in English." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-159", "text": "The fact that the local syntactic features proposed for English can be used almost as effectively for French and Arabic [4] suggests that lexicalized discourse connectives share certain common structural features cross-linguistically and that these structures are potentially an important component in discourse processing." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-160", "text": "However, our analysis also shows that the features are not as effective for all connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-161", "text": "Some high entropy connectives such as 'sinon' have a very high accuracy whereas others such as 'effectivement' or 'maintenant' require additional features." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-162", "text": "As future work, we would like to investigate features that do not need parse trees (such as the features proposed by Lin et al. [13] ) for the disambiguation of discourse connectives." }, { "sent_id": "e0df566d073649431c3454a52813e9-C001-163", "text": "We believe that such features would be useful for languages that lack robust syntactic parsers." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e0df566d073649431c3454a52813e9-C001-18" ], [ "e0df566d073649431c3454a52813e9-C001-33" ], [ "e0df566d073649431c3454a52813e9-C001-34" ], [ "e0df566d073649431c3454a52813e9-C001-36" ], [ "e0df566d073649431c3454a52813e9-C001-43" ], [ "e0df566d073649431c3454a52813e9-C001-44" ], [ "e0df566d073649431c3454a52813e9-C001-85" ], [ "e0df566d073649431c3454a52813e9-C001-91" ] ], "cite_sentences": [ "e0df566d073649431c3454a52813e9-C001-18", "e0df566d073649431c3454a52813e9-C001-33", "e0df566d073649431c3454a52813e9-C001-34", "e0df566d073649431c3454a52813e9-C001-36", "e0df566d073649431c3454a52813e9-C001-43", "e0df566d073649431c3454a52813e9-C001-44", "e0df566d073649431c3454a52813e9-C001-85", "e0df566d073649431c3454a52813e9-C001-91" ] }, "@MOT@": { "gold_contexts": [ [ "e0df566d073649431c3454a52813e9-C001-18" ] ], "cite_sentences": [ "e0df566d073649431c3454a52813e9-C001-18" ] }, "@USE@": { "gold_contexts": [ [ "e0df566d073649431c3454a52813e9-C001-82", "e0df566d073649431c3454a52813e9-C001-85" ] ], "cite_sentences": [ "e0df566d073649431c3454a52813e9-C001-85" ] }, "@EXT@": { "gold_contexts": [ [ "e0df566d073649431c3454a52813e9-C001-91", "e0df566d073649431c3454a52813e9-C001-93" ], [ "e0df566d073649431c3454a52813e9-C001-157" ] ], "cite_sentences": [ "e0df566d073649431c3454a52813e9-C001-91", "e0df566d073649431c3454a52813e9-C001-157" ] }, "@SIM@": { "gold_contexts": [ [ "e0df566d073649431c3454a52813e9-C001-97" ], [ "e0df566d073649431c3454a52813e9-C001-115" ] ], "cite_sentences": [ "e0df566d073649431c3454a52813e9-C001-97", "e0df566d073649431c3454a52813e9-C001-115" ] }, "@DIF@": { "gold_contexts": [ [ "e0df566d073649431c3454a52813e9-C001-97" ] ], "cite_sentences": [ "e0df566d073649431c3454a52813e9-C001-97" ] } } }, "ABC_eee36102d3feac0f673cd33562d40f_11": { "x": [ { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-6", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-8", "text": "To date, standard approaches to named entity classification rely on supervised models, that typically require a large-scale annotated corpus and a widecoverage dictionary." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-9", "text": "However, since new named entities arise regularly, it becomes increasingly difficult to maintain an up-to-date dictionary and/or adapt a named entity classifier to a new domain; for example, sequence labeling techniques that use feature templates (Finkel et al., 2005; Sarawagi and Cohen, 2004) are not robust for unknown named entities because their feature space is very sparse (Primadhanty et al., 2015) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-10", "text": "This problem worsens when we attempt to use a combination of features for sparse named entity classification." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-11", "text": "Therefore, in this paper, we propose the use of matrix factorization for named entity classification to consider the relationships between sparse features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-12", "text": "Through our experiments, we achieved competitive accuracy to models developed in previous works in terms of using fewer features and compactness using factorization machines (Rendle, 2010) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-13", "text": "The main contributions of this paper are as follows:" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-14", "text": "\u2022 We address the data sparseness problem in unknown named entity classification using factorization machines." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-15", "text": "\u2022 We demonstrate that factorization machines achieve state-of-the-art performance in sparse named entity classification task using a reduced feature set and a compact model." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-16", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-17", "text": "**RELATED WORK**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-18", "text": "A standard approach to named entity classification is to formulate a task as a sequence labeling problem and use a supervised method, such as conditional random fields (Lafferty et al., 2001; Finkel et al., 2005; Sarawagi and Cohen, 2004) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-19", "text": "These studies heavily rely on feature templates for learning combinations of features; however, since combinations of features in conventional supervised learning are treated independently, this approach is not robust for named entities that do not appear in the training data." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-20", "text": "To address the task of unknown named entity classification, Primadhanty et al. (2015) explored the use of sparse combinatorial features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-21", "text": "They proposed a log-bilinear model that defines a score function considering interactions between features; the score function is regularized via a nuclear norm on a feature weight matrix." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-22", "text": "Further, heir method employs singular value decomposition (SVD)-based regularization to handle the combination of features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-23", "text": "They reported that their regularization achieved higher accuracy than L1 and L2 regularization, frequently used in natural language processing (Okanohara and Tsujii, 2009 )." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-24", "text": "However, nuclear norm regularization (i.e., SVDbased regularization) is not necessarily the best way to incorporate interactions between features, because it does not directly optimize classification accuracy." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-25", "text": "Therefore, our proposed method treats sparse features using matrix factorization from a different perspective: we decompose a feature weight matrix using factorization machines as to directly optimize classification accuracy using a large margin method similar to support vector machines (SVMs) and passive-agressive algorithms (Vapnik, 1995; Crammer et al., 2006) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-26", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-27", "text": "**FACTORIZATION MACHINES**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-28", "text": "In this paper, we propose the use of factorization machines (Rendle, 2010) for unknown named entity classification." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-29", "text": "Using this approach, we can employ the same objective function as SVMs and yet performs matrix factorization to handle sparse combinatorial features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-30", "text": "Matrix factorization yields better generalizations over a sparse feature matrix (Madhyastha et al., 2014) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-31", "text": "Factorization machines with interaction degree d = 2 use the following equation for prediction:" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-32", "text": "Here, x is an instance, x i represents the i-th dimension of the feature x, n is the number of features, w \u2208 R n is a weight vector, and w 0 \u2208 R is a bias term." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-33", "text": "Factorization machines incorporate interactions between variables v i , v j as the third term of Equation (1)." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-34", "text": "Here, \u00b7, \u00b7 is the inner product of two vectors of size k, i.e.," }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-35", "text": "where v i is the i-th element of matrix V \u2208 R n\u00d7k and k is a hyperparameter representing the dimension of matrix decomposition." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-36", "text": "product of a decomposed matrix n \u00d7 k times." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-37", "text": "Therefore, we do not incur high computational costs even though the number of interacting features is large." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-38", "text": "Note that even though the polynomial kernel of SVMs take combinations of features into account, it treats them independently." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-39", "text": "Conversely, factorization machines take advantage of interactions between features using a low-dimensional feature matrix via matrix factorization." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-40", "text": "Because factorization machines can learn combinations of infrequent features thanks to matrix factorization, we expect that factorization machines will correctly classify named entities that seldom appear in the training corpus." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-41", "text": "In a binary classification task, factorization machines use hinge loss to optimize parameters." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-42", "text": "Here, parameter learning can be accomplished via Markov chain Monte Carlo or stochastic gradient descent." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-43", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-44", "text": "**EXPERIMENTS**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-45", "text": "As described above, we aim to classify named entities that rarely appear in a given training corpus." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-46", "text": "We compared factorization machines with a log-linear model, a polynomial-kernel SVM, and a state-ofthe-art log-bilinear model using nuclear norm for regularization (Primadhanty et al., 2015) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-47", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-48", "text": "**SETTINGS**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-49", "text": "Data." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-50", "text": "We used the dataset provided by Primadhanty et al. (2015) ; this dataset was created for evaluating unknown named entity classification and is context features: Right and left contexts of the candidate in a sentence (do not take the order into account)." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-51", "text": "cap=1, cap=0: Whether the first letter of the candidate is uppercase, or not. all-low=1, all-low=0: Whether all letters of the candidate are lowercase, or not." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-52", "text": "all-cap1=1, all-cap1=0: Whether all letters of the candidate are uppercase, or not." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-53", "text": "all-cap2=1, all-cap2=0: Whether all letters of the candidate are uppercase and periods, or not." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-54", "text": "num-tokens=1, num-tokens=2, num-tokens>2: Whether the candidate consists of 1, 2, or more tokens." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-55", "text": "dummy: Dummy feature to capture context features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-56", "text": "based on the CoNLL-2003 English dataset, which omits named entity candidates that appear in the training data from the development and test data." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-57", "text": "Table 1 shows the number of tokens and types in the given dataset." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-58", "text": "This dataset contains five tags: person (PER), location (LOC), organization (ORG), miscellaneous (MISC), and non-entities (O)." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-59", "text": "Features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-60", "text": "We used a subset of features from experiments performed by Primadhanty et al. (2015) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-61", "text": "Table 3 summarizes the features used in our experiment, including context and entity features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-62", "text": "Tools." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-63", "text": "In terms of tools, we used scikit-learn 0.17 to implement a log-linear model and polynomial kernel in an SVM." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-64", "text": "Further, we employed libFM 1.4.2 1 (Rendle, 2012) to build a named entity classifier using factorization machines." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-65", "text": "In the interaction of both the SVM and the factorization machine, we fixed the degree of the polynomial kernel to d = 2." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-66", "text": "We also tuned other parameters such as learning methods, learning rate and regularization methods based on development data." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-67", "text": "Further, we used a one-versus-all strategy to build a multiclass classifier." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-68", "text": "Evaluation metrics." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-69", "text": "For our evaluation, we used precision, recall, and F1-score." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-70", "text": "The scores were calculated on all tags except for non-entities (O)." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-71", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-72", "text": "**RESULTS**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-73", "text": "Table 2 presents results of our experiments." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-74", "text": "Note that Primadhanty et al. (2015) used additional features such as Brown clustering and parts-of-speech (POS) features, which we did not use." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-75", "text": "Table 4 and Figure 2 show the performance and precision-recall curves of named entity classification for each tag, respectively." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-76", "text": "We observed here that, aside from LOC, we obtained competitive results to the state-of-the-art named entity classifier proposed by Primadhanty et al. (2015) with fewer features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-77", "text": "Overall, the microaveraged F1 score improved by 1.4 points." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-78", "text": "From these results, we conclude that unknown named entity classification can be successfully achieved by taking combinatorial features into account using factorization machines." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-79", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-80", "text": "**DISCUSSION**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-81", "text": "Experimental results show that performance on ORG was improved." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-82", "text": "For example, the term \"VicePresident\" appears in both contexts of ORG and O, and our method correctly handled this sparse combination of context and entity features." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-83", "text": "The accuracy of LOC, however, was lower than that of the log-bilinear model (Primadhanty et al., 2015) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-84", "text": "Upon investigating the confusion matrix, we found that the LOC tag was often misclassified as PER." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-85", "text": "We therefore conclude here that clustering and POS features are necessary to distinguish these tags." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-86", "text": "Figure 1 plots the F1-score of our proposed method as dimension k changes for matrix factor- ization using the same development data as that of Primadhanty et al. (2015) ." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-87", "text": "Our method yielded the best F1-score (i.e., 57.1) at k = 5, whereas the log-bilinear model achieved the best F1-score (i.e., 61.73) at k = 40." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-88", "text": "These results show that factorization machines require a compact model to achieve state-of-the-art results on the test set of this corpus." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-89", "text": "It would be interesting to point out that the performance of our factorization machines approach on the development dataset was lower than that of the log-bilinear model by 4.6 points." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-90", "text": "This phenomenon may occur because the log-bilinear model overfits to sparse combinatorial features even with nuclear norm regularization; further factorization machines typically have better generalization abilities than those of nuclear norm regularization." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-91", "text": "Both our approach and the methods of Primadhanty et al. (2015) address the problem of incorporating sparse combinatorial features by dimension reduction (i.e., matrix factorization); however, they differ in terms of the objective function to be optimized." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-92", "text": "Primadhanty et al. (2015) use maximum likelihood estimation as an objective function; whereas other objective functions such as hinge loss can be used in factorization machines." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-93", "text": "----------------------------------" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-94", "text": "**CONCLUSION**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-95", "text": "In this paper, we proposed the use of factorization machines to handle the combinations of sparse features in unknown named entity classification." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-96", "text": "Our experimental results showed that we were able to achieve competitive accuracy to state-of-the-art methods using fewer features and a compact model." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-97", "text": "For future work, we aim to extend this framework to sequence labeling, thereby improving overall named entity recognition." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-2", "text": "Named entity classification is the task of classifying text-based elements into various categories , including places, names, dates, times, and monetary values." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-3", "text": "A bottleneck in named entity classification, however, is the data problem of sparseness, because new named entities continually emerge, making it rather difficult to maintain a dictionary for named entity classification." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-4", "text": "Thus, in this paper, we address the problem of named entity classification using matrix factorization to overcome the problem of feature sparsity." }, { "sent_id": "eee36102d3feac0f673cd33562d40f-C001-5", "text": "Experimental results show that our proposed model, with fewer features and a smaller size, achieves competitive accuracy to state-of-the-art models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "eee36102d3feac0f673cd33562d40f-C001-9" ], [ "eee36102d3feac0f673cd33562d40f-C001-20" ], [ "eee36102d3feac0f673cd33562d40f-C001-92" ] ], "cite_sentences": [ "eee36102d3feac0f673cd33562d40f-C001-9", "eee36102d3feac0f673cd33562d40f-C001-20", "eee36102d3feac0f673cd33562d40f-C001-92" ] }, "@MOT@": { "gold_contexts": [ [ "eee36102d3feac0f673cd33562d40f-C001-11", "eee36102d3feac0f673cd33562d40f-C001-9" ] ], "cite_sentences": [ "eee36102d3feac0f673cd33562d40f-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "eee36102d3feac0f673cd33562d40f-C001-46" ], [ "eee36102d3feac0f673cd33562d40f-C001-50" ], [ "eee36102d3feac0f673cd33562d40f-C001-60" ], [ "eee36102d3feac0f673cd33562d40f-C001-86" ] ], "cite_sentences": [ "eee36102d3feac0f673cd33562d40f-C001-46", "eee36102d3feac0f673cd33562d40f-C001-50", "eee36102d3feac0f673cd33562d40f-C001-60", "eee36102d3feac0f673cd33562d40f-C001-86" ] }, "@DIF@": { "gold_contexts": [ [ "eee36102d3feac0f673cd33562d40f-C001-74" ], [ "eee36102d3feac0f673cd33562d40f-C001-83" ], [ "eee36102d3feac0f673cd33562d40f-C001-91" ] ], "cite_sentences": [ "eee36102d3feac0f673cd33562d40f-C001-74", "eee36102d3feac0f673cd33562d40f-C001-83", "eee36102d3feac0f673cd33562d40f-C001-91" ] }, "@SIM@": { "gold_contexts": [ [ "eee36102d3feac0f673cd33562d40f-C001-76" ], [ "eee36102d3feac0f673cd33562d40f-C001-91" ] ], "cite_sentences": [ "eee36102d3feac0f673cd33562d40f-C001-76", "eee36102d3feac0f673cd33562d40f-C001-91" ] } } }, "ABC_950cc4a7fa2db3aa6786cc0ae802b5_12": { "x": [ { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-2", "text": "We present a simple yet powerful approach to non-factoid answer reranking whereby question-answer pairs are represented by concatenated distributed representation vectors and a multilayer perceptron is used to compute the score for an answer." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-3", "text": "Despite its simplicity, our approach achieves state-of-the-art performance on a public dataset of How questions, outperforming systems which employ sophisticated feature sets." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-4", "text": "We attribute this good performance to the use of paragraph instead of word vector representations and to the use of suitable data for training these representations." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-5", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-7", "text": "In contrast to factoid question answering (QA), nonfactoid QA is concerned with questions whose answer is not easily expressed as an entity or list of entities and can instead be quite complex -compare, for example, the factoid question Who is the secretary general of the UN? with the non-factoid manner question How is the secretary general of the UN chosen?" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-8", "text": "A significant amount of research has been carried out on factoid QA, with non-factoid questions receiving less attention." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-9", "text": "This is changing, however, with the popularity of community-based question answering (CQA) sites such as Yahoo! Answers 1 , Quora 2 and the StackExchange 3 family of forums." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-10", "text": "The ability of users to vote for their favourite answer makes these sites a valuable source of training data for open-domain non-factoid QA systems." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-11", "text": "In this paper, we present a neural approach to open-domain non-factoid QA, focusing on the subtask of answer reranking, i.e. given a list of candidate answers to a question, order the answers according to their relevance to the question." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-12", "text": "We test our approach on the Yahoo! Answers dataset of manner or How questions introduced by Jansen et al. (2014) , who describe answer reranking experiments on this dataset using a diverse range of features incorporating syntax, lexical semantics and discourse." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-13", "text": "In particular, they show how discourse information (obtained either via a discourse parser or using shallow techniques based on discourse markers) can complement distributed lexical semantic information." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-14", "text": "Sharp et al. (2015) show how discourse structure can be used to generate artificial questionanswer training pairs from documents, and test their approach on the same dataset." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-15", "text": "The best performance on this dataset -33.01 P@1 and 53.96 MRR -is reported by Fried et al. (2015) who improve on the lexical semantic models of Jansen et al. (2014) by exploiting indirect associations between words using higher-order models." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-16", "text": "In contrast, our approach is very simple and requires no feature engineering." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-17", "text": "Question-answer pairs are represented by concatenated distributed representation vectors and a multilayer perceptron is used to compute the score for an answer (the probability of an answer being the best answer to the question)." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-18", "text": "Despite its simplicity, we achieve state-of-theart performance on this dataset -37.17 P@1 and 56.82 MRR." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-19", "text": "We attribute this improved performance to the use of paragraph vector representations (Le and Mikolov, 2014) instead of averaging over word vectors, and to the use of suitable data for training these representations." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-20", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-21", "text": "**APPROACH**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-22", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-23", "text": "**LEARNING ALGORITHM**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-24", "text": "We use a simple feedforward neural network, i.e. a multilayer perceptron, to predict the best answer." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-25", "text": "As shown in Figure 1 , the first layer of the network is a projection layer that transforms question-answer pairs into their vector representations." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-26", "text": "The vector representation for a question-answer pair (q, a) is a concatenation of the distributed representations q and a for the question and the answer respectively." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-27", "text": "Each representation is a real-valued vector of a fixed dimensionality d, which is a parameter to be tuned." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-28", "text": "The projection layer is followed by one or more hidden layers, the number of layers and units in each of these layers are also parameters to be experimentally tuned." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-29", "text": "We use the rectified linear (ReLU) activation function." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-30", "text": "Finally, a softmax layer is used to compute the output probability p, i.e. the probabilities p 1 and p 2 of the negative (i.e. not best answer) and positive (i.e. best answer) classes respectively." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-31", "text": "For each question, all its user-generated answers are ranked according to their probability of being the best answer, as predicted by the network." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-32", "text": "Given a question-answer pair (q, a), the possible values for the ground-truth label are 1 (best answer) and 0 (not a best answer)." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-33", "text": "The network is trained by minimizing the L2-regularized cross-entropy loss function between the ground-truth labels and the network predictions on the training set." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-34", "text": "We use stochastic gradient descent to minimize the loss over the training set." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-35", "text": "The development set is used for early stopping." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-36", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-37", "text": "**DOCUMENT REPRESENTATIONS**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-38", "text": "Our approach requires question-answer pairs to be represented as a fixed-size vector." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-39", "text": "We experimentally evaluate the Paragraph Vector model (PV) proposed by Le and Mikolov (2014) ." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-40", "text": "The PV is an extension of the widely used continuous bag-of-words (CBOW) and skip-gram word embedding models, known as word2vec." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-41", "text": "However, in contrast to CBOW and skip-gram models that only learn word embeddings, the PV is able to learn representations for pieces of text of arbitrary length, e.g. sentences, paragraphs or documents." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-42", "text": "The PV includes (1) the distributed memory (DM) model, that predicts the next word using the concatenation of the previous words and the paragraph vector, that is shared among all words in the same paragraph (or sentence); (2) the distributed bag-of-words (DBOW) model, that -similar to the skip-gram model -predicts words randomly sampled from the paragraph, given the paragraph vector." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-43", "text": "We experiment with both DM and DBOW models, as well as their combination." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-44", "text": "For comparison with recent work in answer reranking (Jansen et al., 2014; Sharp et al., 2015) , we also evaluate the averaged word embedding vectors obtained with the skip-gram model (Mikolov et al., 2013 ) (henceforth referred to as the SkipAvg model)." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-45", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-46", "text": "**EXPERIMENTS**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-47", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-48", "text": "**DATA**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-49", "text": "In order to be able to compare our work with previous research, we use the Yahoo! Answers dataset that was first introduced by Jansen et al. (2014) and was later used by Sharp et al. (2015) and Fried et al. (2015) ." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-50", "text": "This dataset contains 10K How questions from Yahoo! Answers." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-51", "text": "Each question has at least four user-generated answers, and the average number of answers per question is nine." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-52", "text": "50% of the dataset is used for training, 25% for development and 25% for testing." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-53", "text": "Further information about the dataset can be found in Jansen et al. (2014) ." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-54", "text": "Our approach requires unlabelled data for unsupervised pre-training of the word and paragraph vectors." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-55", "text": "For these purposes we use the L6 Yahoo! Answers Comprehensive Questions and Answers corpus obtained via Webscope." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-75", "text": "However, these improvements over the 200-dimension DBOW model are not statistically significant." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-56", "text": "4 This dataset contains about 4.5M questions from Yahoo! Answers along with their user-generated answers, and was provided as training data at the recent TREC LiveQA competition (Agichtein et al., 2015) , the goal of which was to answer open-domain questions coming from real users in real time." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-57", "text": "5 The Yahoo! Answers manner question dataset prepared by Jansen et al. (2014) and described in the previous paragraph, was initially sampled from this larger dataset." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-58", "text": "We want to emphasize that the L6 dataset is only used for unsupervised pretraining -no meta-information is used in our experiments." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-59", "text": "We also experiment with the English Gigaword corpus, 6 which contains data from several English newswire sources." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-60", "text": "Jansen et al. (2014) used this corpus to train word embeddings, which were then included as features in their answer reranker." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-61", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-62", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-63", "text": "Following Jansen et al. (2014) and Fried et al. (2015) , we implement two baselines: the baseline that selects an answer randomly and the candidate retrieval (CR) baseline." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-64", "text": "The CR baseline uses the same scoring as in Jansen et al. (2014) : the questions and the candidate answers are represented using tf-idf (Salton, 1991) over lemmas; the candidate answers are ranked according to their cosine similarity to the respective question." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-65", "text": "We use the gensim 7 implementation of the DBOW and DM paragraph vector models." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-66", "text": "The word embeddings for the SkipAvg model are obtained with word2vec." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-67", "text": "8 The data was tokenized with the Stanford tokenizer 9 and then lowercased." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-68", "text": "To evaluate our models, we use standard imple- mentations of the P@1 and mean reciprocal rank (MRR) evaluation metrics." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-69", "text": "To evaluate whether the difference between two models is statistically significant, statistical significance testing is performed using one-tailed bootstrap resampling with 10,000 iterations." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-70", "text": "Improvements are considered to be statistically significant at the 5% confidence level (p < 0.05)." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-71", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-72", "text": "**RESULTS**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-73", "text": "In Table 1 , we report best development P@1 and MRR of the multilayer perceptron trained on Yahoo! Answers (Jansen et al., 2014) 200 outperforming the rest by a small margin." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-74", "text": "The multilayer perceptron with combinations of different distributed representations reach slightly higher P@1 and MRR on the development set." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-76", "text": "Table 2 presents the results on the test set." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-77", "text": "We only evaluate the 200-dimension DBOW model and its combinations with other models, comparing these to the baselines and the previous results on the same dataset (we use the same train/dev/test split as Jansen et al. (2014) )." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-78", "text": "The DBOW outperforms the baselines by a statistically significant margin." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-79", "text": "The combination of the DBOW, DM and SkipAvg models provides slightly better results, but the improvement over the DBOW is not statistically significant." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-80", "text": "Jansen et al. (2014) report that answer reranking benefits from lexical semantic models, and describe experiments using SkipAvg embeddings pretrained using the English Gigaword corpus." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-81", "text": "Here we compare the performance of the reranker with distributed representations pretrained on a large \"outof-domain\" newswire corpus (Gigaword), versus a smaller \"in-domain\" non-factoid QA one (L6 Yahoo)." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-82", "text": "Figure 2 shows the development P@1 and MRR of the multilayer perceptron with DBOW model on the Yahoo! Answers dataset pretrained on 30M random paragraphs from the English Gigaword corpus versus the multilayer perceptron with DBOW model pretrained on the Yahoo L6 corpus containing about 8.5M paragraphs." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-83", "text": "We also evaluate the com- bination of the two models." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-84", "text": "The results highlight the importance of finding a suitable source of unlabelled training data since vectors pretrained on reasonably large amounts of Yahoo! Answers data are more beneficial than using a much larger Gigaword dataset." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-85", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-86", "text": "**ANALYSIS**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-87", "text": "Even our best model is still, however, far from being perfect, i.e. for about 60% of questions, the answer selected as best by the author of the question is not assigned the highest rank by our system." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-88", "text": "We believe that one of the reasons for that is that the choice of the best answer purely relies on the question's author and may be subjective (see Table 3 )." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-89", "text": "A possible useful direction for future research is to incorporate the user-level information into the neural reranking model." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-90", "text": "This approach has been recently found beneficial in the task of sentiment analysis ." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-91", "text": "Another potential source of error lies in the usergenerated nature of the data." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-92", "text": "Yahoo! Answers contains a large number of spelling and grammar mistakes (e.g. how do i thaw fozen [sic] chicken?), nonstandard spelling and punctuation (e.g. Booorrrriinng!!!!!)." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-93", "text": "A common way to deal with this problem is normalization (Baldwin et al., 2015) ." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-94", "text": "To determine whether this might be helpful, we normalized the data following the strategy described by Le Roux et al. (2012" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-95", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-96", "text": "**OTHER ANSWERS**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-97", "text": "Losen it." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-98", "text": "Close your eyes, grab some scissors, and GO CRAZY! I think you should scrunch it! It looks awesome." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-99", "text": "Just tip ur head over and put jell in ur hands and like scrunch! just brush it and go, it always works for me when i can't figure out what to do with it." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-100", "text": "pull it up in a high pony tail & small curls falling down! make it into braids Table 3 : Example question from the Yahoo! Answers dataset though our preliminary experiments show that applying lexical normalization results in significantly lower performance, further study is needed." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-101", "text": "One direction is in using character-level embeddings that have been proven promising for user-generated content because of their ability to better handle spelling variation ." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-102", "text": "----------------------------------" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-103", "text": "**CONCLUSIONS**" }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-104", "text": "We have conducted answer reranking experiments for open-domain non-factoid QA and achieved stateof-the-art performance on the Yahoo! Answers manner question corpus using a very straightforward neural approach which involves representing question-answer pairs as paragraph vectors and training a multilayer perceptron to order candidate answers." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-105", "text": "Our experiments show that representing the question-answer pair as a paragraph vector is clearly superior to the use of averaged word vectors." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-106", "text": "We have also shown that a smaller amount of unlabelled data taken from a CQA site is more useful for training representations than a larger newswire set." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-107", "text": "In this paper, we use general purpose distributed document representations provided by Paragraph Vector models to represent question-answer pairs." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-108", "text": "Then a machine learning algorithm is used to rank the pairs." }, { "sent_id": "950cc4a7fa2db3aa6786cc0ae802b5-C001-109", "text": "One possible direction for future research is in learning distributed document representations and the ranking simultaneously and applying more sophisticated recurrent models such as long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997 ) neural networks, that have been shown to be effective in similar tasks (Wang and Nyberg, 2015; Zhou et al., 2015) ." } ], "y": { "@USE@": { "gold_contexts": [ [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-12" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-44" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-49" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-55", "950cc4a7fa2db3aa6786cc0ae802b5-C001-57" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-63" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-64" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-73" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-77" ] ], "cite_sentences": [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-12", "950cc4a7fa2db3aa6786cc0ae802b5-C001-44", "950cc4a7fa2db3aa6786cc0ae802b5-C001-49", "950cc4a7fa2db3aa6786cc0ae802b5-C001-57", "950cc4a7fa2db3aa6786cc0ae802b5-C001-63", "950cc4a7fa2db3aa6786cc0ae802b5-C001-64", "950cc4a7fa2db3aa6786cc0ae802b5-C001-73", "950cc4a7fa2db3aa6786cc0ae802b5-C001-77" ] }, "@BACK@": { "gold_contexts": [ [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-15" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-44" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-53" ], [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-57" ] ], "cite_sentences": [ "950cc4a7fa2db3aa6786cc0ae802b5-C001-15", "950cc4a7fa2db3aa6786cc0ae802b5-C001-44", "950cc4a7fa2db3aa6786cc0ae802b5-C001-53", "950cc4a7fa2db3aa6786cc0ae802b5-C001-57" ] } } }, "ABC_d68bb5264d157cc4c2d9fa9c8f82b6_12": { "x": [ { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-2", "text": "Here we test Neutral models against the evolution of English word frequency and vocabulary at the population scale, as recorded in annual word frequencies from three centuries of English language books." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-43", "text": "As the \"bar\" is raised, words are more likely to 'die' before they ever reach the bar by stochastic walk [43] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-3", "text": "Against these data, we test both static and dynamic predictions of two neutral models, including the relation between corpus size and vocabulary size, frequency distributions, and turnover within those frequency distributions." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-4", "text": "Although a commonly used Neutral model fails to replicate all these emergent properties at once, we find that modified two-stage Neutral model does replicate the static and dynamic properties of the corpus data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-5", "text": "This two-stage model is meant to represent a relatively small corpus (population) of English books, analogous to a 'canon', sampled by an exponentially increasing corpus of books in the wider population of authors." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-6", "text": "More broadly, this model-a smaller neutral model within a larger neutral model-could represent more broadly those situations where mass attention is focused on a small subset of the cultural variants." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-7", "text": "English has evolved continually over the centuries, in the branching off from antecedent languages in Indo-European prehistory [34, 39] , in the rates of regularisation of verbs [34] and in the waxing and waning in the popularity of individual words [3, 13, 37] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-8", "text": "At a much finer scale of time and population, languages change through modifications and errors in the learning process [14, 27] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-9", "text": "This continual change and diversity contrasts with the simplicity and consistency of Zipf's law, by which the frequency a word, f , is inversely proportional to its rank k, as f \u223c k \u2212\u03b3 and Heaps law, by which vocabulary size scales sub-linearly with total number of words, across diverse textual and spoken samples [32, 41, 46, 49, 15, 21, 48, 42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-10", "text": "The Google Ngram corpus [37] provides new support for these statistical regularities in word frequency dynamics at timescales from decades to centuries [22, 41, 42, 1, 28] of n-grams -an n-gram being n consecutive character strings, separated by spaces -derived from millions of books over multiple centuries [35] , the n-gram data now covers English books from the year 1500 to year 2008." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-11", "text": "In English, the Zipf's law in the n-gram data [41] exhibits two regimes: one among words with frequencies above about 0.01% (Zipf's exponent \u03b3 \u2248 1) and another (\u03b3 \u2248 1.4) among words with frequency below 0.0001% [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-12", "text": "The latter Zipf's law exponent \u03b3 of 1.4 is equivalent to a probability distribution function (PDF) exponent, \u03b1, of about 1.7 (\u03b1 = 1 + 1/\u03b3)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-13", "text": "In addition to the well-known Zipf's law, word frequency data have at least two other statistical properties." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-14", "text": "One, known as Heaps law, refers to the way that vocabulary size scales sub-linearly with corpus size (raw word count)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-15", "text": "The n-gram data show Heaps law in that, if N t is corpus size and v t is vocabulary size at time t, then v t \u2248 N \u03b2 t , with \u03b2 \u2248 0.5, for all English words in the corpus [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-16", "text": "If the n-gram corpus is truncated by a minimum word count, then as that minimum is raised the Heaps scaling exponent increases from \u03b2 < 0.5, approaching \u03b2 < 1 [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-17", "text": "The other statistical property is dynamic turnover in the ranked list of most commonly used words." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-18", "text": "This can be measured in terms of how many words are replaced through time on \"Top y\" ranked lists of different sizes y of most frequently-used words [12, 17, 19, 23] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-19", "text": "We can define this turnover z y (t) as the number of new words to have entered the top y most common words in year t, which is equivalent to the the top y in that year." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-20", "text": "The plotting of turnover z y for different list sizes y can therefore be useful in characterising turnover dynamics [2] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-21", "text": "Many functional or network models readily yield the static Zipf distribution [21, 15] and Heaps law [36] , but not the dynamic aspects such as turnover." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-22", "text": "Here we focus on how Heaps law and Zipf's law can be modeled together with continual turnover of words within the rankings by frequency [4, 23] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-23", "text": "We focus on the 1-grams in Google's English 2012 data set, which samples English language books published in any country [25] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-24", "text": "One promising, parsimonious approach incorporates the class of neutral evolutionary models [11, 12, 7, 24, 38] that are now proving insightful for language transmission [13, 10, 45] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-25", "text": "The null hypothesis of a Neutral model is that copying is undirected, without biases or different 'fitnesses' of the words being replicated [2, 29] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-26", "text": "A basic neutral model, which we will call the full-sampling Neutral model (FNM), would assume simply that authors choose to write words by copying those published in the past and occasionally inventing or introducing new words." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-27", "text": "As shown in Fig 1a, the FNM represents each word choice by an author as selecting at random among the N t words that were published in the previous year [45, 10] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-28", "text": "This copying occurs with probability 1 \u2212 \u00b5, where \u00b5 1 is the fixed, dimensionless probability that an author invents a new word (even the word had originated somewhere 'outside' books, e.g. in spoken slang)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-29", "text": "Each newly-invented word enters with frequency one, regardless of N t ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-30", "text": "In terms of the modeled corpus, a total of about \u00b5N t unique new words are invented per time step." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-31", "text": "Note that N t represents the total number of written words, or corpus size, for year t, which contrasts with the smaller \"vocabulary\" size, v t , defined as the number of different words in each year t regardless of their frequency of usage." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-32", "text": "As has been well demonstrated, the FNM readily yields Zipf's law [11, 9, 47] , which can also be shown analytically (see Appendix 1)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-33", "text": "Also, simulations of the FNM show that the resulting Zipf distribution undergoes dynamic turnover [12] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-34", "text": "Extensive simulations [19] show that when list size y is small compared to the corpus (0.15y < N t \u00b5), this neutral turnover z y per time step is more precisely approximated by:" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-35", "text": "where n is the number of words per time interval." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-36", "text": "This prediction can be visualized by plotting the measured turnover z y for different list sizes y. The FNM predicts the results to follow z y \u221d y 0.86 , such that departures from this expected curve can be identified to indicate biases such as conformity or anti-conformity [2] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-37", "text": "It would appear from eq. 1 that turnover should increase with corpus size." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-38", "text": "This is the nominal equilibrium for FNM with constant N t ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-39", "text": "If corpus size N t in the FNM is growing exponentially with time, however, then there may be no such nominal equilibrium." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-40", "text": "In this case we predict that the turnover z y can actually decrease with time as N t increases." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-41", "text": "This is because newly invented words start with frequency one, and under the neutral model they must essentially make a stochastic walk into the top 100, say." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-42", "text": "As N t grows, so does the minimum frequency needed to break into the top 100." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-44", "text": "As a result, turnover in the Top y can slow down over time and growth of N t ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-45", "text": "The FNM does not, however, readily yield Heaps law (v t = N \u03b2 t , where \u03b2 < 1), for which \u03b2 \u2248 0.5 among the 1-gram data for English [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-46", "text": "In the FNM, the expected exponent \u03b2 is 1.0, as the number of different variants (vocabulary) normally scales linearly with \u00b5N t [11] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-47", "text": "While the FNM has been a powerful null model, in the case of books, we can make a notable improvement to account for the fact that most published material goes unnoticed while a relatively small portion of the corpus is highly visible." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-48", "text": "To name a few examples across the centuries, literally billions of copies of the Bible and the works of Shakespeare have been read since the seventeenth century, as well as tens or hundreds of millions of copies of works by Voltaire, Swift, Austen, Dickens, Tolkien, Fleming, Rawling and so on." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-49", "text": "While these and hundreds more books become considered part of the \"Western Canon,\" that canon is constantly evolving [28] and many books that were enormously popular in their time -e.g., Arabian Nights or the works of Fanny Burney-fall out of favour." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-50", "text": "As the published corpus has grown exponentially over the centuries, early authors were more able to sample the full range of historically published works, whereas contemporary authors sample from an increasingly small and more recent fraction of the corpus, simply due to its exponential expansion [28, 40] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-51", "text": "As a simple way of capturing this, we propose a modified neutral model, called the partial-sampling Neutral model (PNM), of an evolving \"canon\" that is sampled by an exponentially-growing corpus of books." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-52", "text": "As shown in Fig 1b, the PNM represents an exponentially growing number of books that sample words from a fixed size canon over all previous years since 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-53", "text": "Our PNM represents a world where there exists an evolving canonical literature as a relatively small subset of the world's books on which all writers are educated." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-54", "text": "As new contributions to the canon are contributed, authors sample from the recent generation of writers with occasional innovation." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-55", "text": "Because the canon is a high-visibility subset of all books, only a fixed, constant number words of text per year are allowed into a year's canon." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-56", "text": "The rest of the population learns from the cumulative canon since our chosen reference year of 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-57", "text": "The average result from 100 runs in each of the FNM and PNM were used to match summary statistics with the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-58", "text": "Several key statistical results emerge from analysis of the 1-gram data which we compare the FNM to the PNM in terms of these results: (1) Heaps law, which is the sublinear scaling of vocabulary size with corpus size, (2) a Zipf's law frequency distribution for unique words, (3) a rate of turnover that decreases exponentially with time and a turnover vs popular list size that is approximately linear." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-59", "text": "Here we describe our results in terms of rank-frequency distributions, turnover and corpus and vocabulary size." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-60", "text": "We compare the partial-sample Neutral model (PNM) to the full 1-gram data for English." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-61", "text": "First, we check that the model replicates the Zipf's law that characterizes the 1-gram frequencies in multiple languages [41] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-62", "text": "Our own maximum likelihood determinations, applying available code [15] to the Google 1-gram data, confirm that the mean \u03b1 = 1.75 \u00b1 0.12 for the Zipf's law over all English words in the hundred years from 1700 to 1800 (beyond 1800, the corpus size becomes too large for our computation)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-63", "text": "Normalising by the word count [21] , the form of the Zipf represented by different colored circles in each box, is copied (arrows) from from the previous year t \u2212 1 with probability 1 \u2212 \u00b5, or newly-invented with probability \u00b5. The FNM shown in (a) has a corpus size N t that grows through time." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-64", "text": "In (b) the PNM samples from all previous results of the FNM since the initial time step representing year 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-65", "text": "The PNM population grows exponentially (N 0 e 0.021t ) through time, from 3000 to 1.5 million." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-66", "text": "As the PNM samples from all previous years of FNM population, the PNM samples from a corpus that increases linearly (by 10,000 words per year) from 10,000 words in year 1700 to 3 million words by year 2000." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-67", "text": "For the PNM, the big blue arrows represent how each generation can sample any year of the canon randomly, all the way back to 1700, the smaller arrows representing individual sampling events." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-68", "text": "distribution is virtually identical for each year of the dataset, reaching eight orders of magnitude by the year 2000 (Fig 2a) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-69", "text": "The FNM replicates the Zipf (Fig 2b) but the PNM replicates it better and over more orders of magnitude (Fig 2c) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-70", "text": "It was not computationally possible with either the FNM or PNM to replicate the Zipf across all nine orders of magnitude, as the modeled corpus size N t grows exponentially (Fig 2d) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-71", "text": "Fig 3a illustrates the relationship between corpus size and vocabulary size in our partial-sampling Neutral model." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-72", "text": "Due to the exponentially increasing sample size, the ratio of vocabulary size over corpus size becomes increasingly small, thus the model gives us the sub-linear relationship described by v t = N \u03b2 t , where \u03b2 < 1." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-73", "text": "On the double-logarithmic plot in Fig 3a, the Heaps law exponent is equivalent to the slope of the data series." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-74", "text": "The PNM matches the 1-gram data with Heaps exponent (slope) of about 0.5, whereas the FNM, with exponent about 1.0, does not." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-75", "text": "Fig 3b shows how 100 runs of the PNM yields a Heaps law exponent within the range derived by [42] for several different n-grams corpora (all English, English fiction, English GB, English US and English 1M)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-76", "text": "We also The PNM yields Heaps law exponent \u03b2 \u2248 0.52 \u00b1 0.006, within the range of English corpora, whereas the FNM yields a mismatch with the data of \u03b2 \u2248 1 \u00b1 0.002 (Fig 3b) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-77", "text": "In Fig 3a, there is a constant offset on the y-axis between vocabulary size in the PNM (\u03b1 = 0.02, N = 10000) versus the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-78", "text": "Both data series follow Heaps exponent b \u2248 0.5, but the coefficient, A, is several times larger for the 1-gram data than for the PNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-79", "text": "We do not think this is due to our choice of canon size N in the PNM, because if we halve it to 5000, the resulting A does not significantly change." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-80", "text": "The difference could be resolved, however, with larger The Heaps law exponents, \u03b2, for the data series on the left, as well as additional data series, using Table 1 in [42] : all English 1-grams: 0.54 \u00b1 0.01; English fiction: 0.49 \u00b1 0.01; English GB: 0.44 \u00b1 0.01; English US: 0.51 \u00b1 0.01." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-81", "text": "The 100 independent runs of each neutral model, using parameters listed in the text, yielded \u03b2 = 0.52 \u00b1 0.07 for the PNM, and \u03b2 = 1.00 \u00b1 0.002 for the FNM (not shown)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-82", "text": "exponential growth in PNM corpus size, S t , over the 300 time steps." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-83", "text": "Computationally, we could only model the PNM with growth exponent \u03b1 = 0.02-using \u03b1 = 0.03, as would fit the actual growth of the n-gram corpus over 300 years [8] , makes the PNM too large to compute." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-84", "text": "Nevertheless, we can roughly estimate the effect; when we reduce \u03b1 from 0.02 to 0.01, while keeping N = 10000, we find that A averaged over one hundred PNM runs is reduced from 6.3 \u00b1 0.5 to 1.4 \u00b1 0.3." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-85", "text": "Given an exponential relationship, increasing alpha to 0.03 would increase A to about 20, which is within the magnitude of offset we see in Fig 3a." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-86", "text": "Of course, this question can be resolved precisely when the much larger PNM can be simulated." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-87", "text": "Regarding dynamic turnover, we consider turnover in ranked lists of size y, varying the list size y from the top 1000 most common words down to the top 10 (the top 1 word has been \"the\" since before the year 1700)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-88", "text": "We measure turnover in the word-frequency rankings by determining the top y rankings independently for each year, and then counting the number of new words to appear on the list from one year to the next." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-89", "text": "Fig 4 shows the number of 1-grams to drop out of the top 1000, top 500 and top 200 per year in the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-90", "text": "Annual turnover among the top 1000 and the top 500 decreased exponentially from the year 1700 to 2000, proportional to e \u22120.012t (r 2 > 0.91 for both), where t is years since 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-91", "text": "This exponential decay equates to roughly a halving of turnover per century." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-92", "text": "Since the corpus size was increasing with time, Fig 4 effectively also shows how turnover in top y list decreases as corpus size increases in the partial-sampling Neutral model, where the corpus size grows faster than the number relative to speakers over the years." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-93", "text": "The exponential decay in turnover in the partial-sampling Neutral model is markedly different than the base Neutral model, in which turnover would be growing as corpus size grew, due to term n 0.013 s in equation 1." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-94", "text": "Finally, we also look at the \"turnover profile\", plotting list size y versus turnover z y for different time slices (Fig 5) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-95", "text": "For all words, z y \u221d y 1.26 for different time periods (Fig 5) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-96", "text": "We can then compare the turnover profile for the 1-grams to the prediction from eq. 1 that turnover will be proportional to y 0.86 , as shown in Fig 5b ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-97", "text": "Table 1 lays out the specific predictions of each of the models and how they fare against empirical data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-98", "text": "Bands indicate 95% range of simulated values." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-99", "text": "While the predictions for the FNM and PNM are similar for y = 50 and for the year 1800 (Fig 4a and Fig 5a) , they do differ (Fig 4c and Fig 5c) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-100", "text": "Although the FNM can fit Zipf's Law with the right parameters, it cannot also fit Heaps law or the turnover patterns at the same time as matching Zipf's Law." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-101", "text": "In contrast, the PNM can fit Zipf's law, Heaps law exponent (Fig 3a) , and the 2000 series in Fig 4 (but starts to breakdown at y > 150)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-102", "text": "Neither the FNM nor the PNM does very well at y = 200." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-103", "text": "We have explored how 'neutral' models of word choice could replicate a series of static and dynamic observations from a historical 1-gram corpora: corpus size, frequency distributions, and turnover within those frequency distributions." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-104", "text": "Our goal was to capture two static and three dynamic properties of word frequency statistics in one model." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-105", "text": "The static properties are not only the well-known (a) Zipf's law, which a range of proportionate-advantage models can replicate, but also (b) Heaps law." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-106", "text": "The dynamic properties are (c) the continual turnover in words ranked by 7 popularity, (d) the decline in that turnover rate through time, and (e) the relationship between list size and turnover, which we call the turnover profile." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-107", "text": "We found that, although the full-sample Neutral model (FNM) predicts the Zipf's law in ranked word frequencies, the FNM does not replicate Heaps law between corpus and vocabulary size, or the concavity in the non-linear relationship between list size y and turnover z y , or the slowing of this turnover through time among English words." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-108", "text": "It is notable that we found it impossible to capture all five of these properties at once with the FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-109", "text": "It was a bit like trying to juggle five balls, as soon as the FNM could replicate some of those properties, it dropped the others." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-110", "text": "Having explored the FNM under broad range of under a range of parameter combinations, we ultimately determined that it could never replicate all these properties at once." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-111", "text": "This is mainly because both vocabulary size in the FNM is proportional to corpus size (rather than roughly the square root of corpus size as in Heaps law) and also because turnover in FNM should increase slightly with growing population, not decrease as we see in the 1-gram data over 300 years." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-112", "text": "Other hypotheses to modify the FNM, such as introducing a conformity bias [2], can also be ruled out." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-113", "text": "In the case of conformity bias-where agents choose high-frequency words with even greater probability than just in proportion to frequency-both the Zipf law and turnover deteriorate under strong conformity in ways that mis-match with the data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-114", "text": "What did ultimately work very well was our partial-sampling Neutral model, or PNM (Fig 1b) , which models a growing sample from a fixed-sized FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-115", "text": "Our PNM, which takes exponentially increasing sample sizes from a neutrally evolved latent population, replicated the Zipf's law, Heaps law, and turnover patterns in the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-116", "text": "Although it did not replicate exactly the particular 1-gram corpus we used here, the Heaps law exponent yielded by the PNM does fall within the range-from 0.44 to 0.54-observed in different English 1-gram corpora [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-117", "text": "Among all features we attempted to replicate, the one mismatch between PNM and the 1-gram data is that the PNM yielded an order of magnitude fewer vocabulary words for a given corpus size, while increasing with corpus size according to the same Heaps law exponent." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-118", "text": "The reason for this mismatch appears to be a computational constraint: we could not run the PNM with exponential growth quite as large as that of the actual 300 years of exponential growth in the real English corpus." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-119", "text": "As a heuristic device, we consider the fixed-size FNM to represent a canonical literature, while the growing sample represents the real world of exponentially growing numbers of books published ever year in English." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-120", "text": "Of course, the world is not as simple as our model; there is no official fixed canon, that canon does not strictly copy words from the previous year only and there are plenty of words being invented that occur outside this canon." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-121", "text": "Our canonical model of the PNM differs somewhat from the explanation by [42] , in which a \"decreasing marginal need for additional words\" as the corpus grows is underlain by the \"dependency network between the common words ... and their more esoteric counterparts." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-122", "text": "\" In our PNM representation, there is no network structure between words at all, such as \"inter-word statistical dependencies\" [44] or grammar as a hierarchical network structure between words [20] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-123", "text": "Since the PNM performed quite well in replicating multiple static and dynamic statistical properties of 1-grams simultaneously, which the FNM could not do, we find two insights." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-124", "text": "The first is that the FNM remains a powerful representation of word usage dynamics [13, 45, 26, 24, 9, 5] , but it may need to be embedded in a larger sampling process in order to represent the world." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-125", "text": "Case studies where the PNM succeeds and the FNM fails could represent situations where mass attention is focused on a small subset of the cultural variants." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-126", "text": "The same idea seems appropriate for a digital world, where many cultural choices are pre-sorted in ranked lists [24] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-127", "text": "In the present century, published books contain only a few percent of the verbiage recorded online, with the volume of digital data doubling about every three years." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-128", "text": "Centuries of prior evolution in published English word use provides valuable context for future study of this digital transition." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-129", "text": "8" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-130", "text": "Our aim is to compare key summary statistics from simulated data generated by the hypothetical FNM and PNM processes with summary statistics from Google 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-131", "text": "See Acknowledgements for data source address and the repository location for the Python code used to generate the FNM and PNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-132", "text": "The FNM assumes words in a population at time t are selected at random from the population of books at time t \u2212 1." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-133", "text": "The population size Nt increases exponentially, N0e 0.021t , through time to simulate the exponentially increasing corpus size observed in the Google n-grams data [8] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-134", "text": "We ran a genetic algorithm (described in the Appendix 2) to search the model state space to obtain parameter combinations-latent corpus size Nt, innovation fraction \u00b5 and initial population size N0-that yielded similar summary statistics to the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-135", "text": "With the corpus growth exponent fixed at 0.021, initial corpus size, N0, was constrained by computational capacity." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-136", "text": "Following the genetic algorithm search, the model was initialized with population size N0 = 3000 and invention fraction \u00b5 = 0.003." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-137", "text": "Once steady state was achieved, we permitted the population size in each successive generation to increase at an exponential growth rate comparable to the average annual growth rate of Google 1-gram data until it finally reached N300 = 1.5 million by time step t = 301." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-138", "text": "At each time t in the FNM, a new set of Nt words enter the modeled corpus." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-139", "text": "Each word in the corpus, at time t, is either a copy of a word from the previous generation of books, with probability 1 \u2212 \u00b5, or else invented as a new word with probability \u00b5. Each of the copied words is selected from vt\u22121 possible words (the vocabulary in the previous time step), which follow a discrete Zipf's law distribution with the probability a word is selected being proportional to the number of copies the word had in the previous population in time step t \u2212 1 [7] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-140", "text": "The PNM, represented schematically in Fig 1, draws an exponentially increasing sample (with replacement) from a latent neutrally-evolving canon." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-141", "text": "We designate the number of words in the sample as St, and the cumulative number of words in the canon as Nt, which grows by a fixed number of words in each time step." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-142", "text": "This exponentially increasing sample, S0e \u03b1t , has an initial population size S0 = 3000, growth exponent \u03b1 = 0.021, yielding a final sample size S300 = 1.5 million, matching the FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-143", "text": "The latent population evolves by the rules of the FNM, but with a constant population size of 10000 for each year t (representing a canonical literature from which the main body of authors sample)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-144", "text": "The cumulative canon, Nt, thus grows by 10,000 words per year." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-145", "text": "The partial sample, St, at time t can copy words from all canonical literature, Nt, up to that time step." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-146", "text": "We set \u00b5 = 0.003 and run for t = 301 time steps representing years between 1700 and 2000, which are the same parameters used in the FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-147", "text": "The 1-gram data are available as csv files directly from Google's Ngrams site [25] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-148", "text": "As in a previous study [1], we removed 1-grams that are common symbols or numbers, and 1-grams containing the same consonant three or more times consecutively." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-149", "text": "As in our other studies [1, 8, 6 ], we normalized the count of 1-grams using the yearly occurrences of the most common English word, the." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-150", "text": "Although we track 1-grams from the year 1700, for turnover statistics we follow other studies [42] in being cautious about the n-grams record before the year 1800, due to misspelled words before 1800 that were surely digital scanning errors related to antique printing styles of that may conflate letters such as 's' and 'f' (e.g., myfelf, yourfelf, provifions, increafe, afked etc)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-151", "text": "The code used for modeling is available at: https://github.com/dr2g08/Neutral-evolution-and-turnover-over-centuries-of-English-word-popularity." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-152", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-153", "text": "**INTRODUCTION**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-154", "text": "English has evolved continually over the centuries, in the branching off from antecedent languages in Indo-European prehistory [34, 39] , in the rates of regularisation of verbs [34] and in the waxing and waning in the popularity of individual words [3, 13, 37] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-155", "text": "At a much finer scale of time and population, languages change through modifications and errors in the learning process [14, 27] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-156", "text": "This continual change and diversity contrasts with the simplicity and consistency of Zipf's law, by which the frequency a word, f , is inversely proportional to its rank k, as f \u223c k \u2212\u03b3 and Heaps law, by which vocabulary size scales sub-linearly with total number of words, across diverse textual and spoken samples [32, 41, 46, 49, 15, 21, 48, 42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-157", "text": "The Google Ngram corpus [37] provides new support for these statistical regularities in word frequency dynamics at timescales from decades to centuries [22, 41, 42, 1, 28] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-158", "text": "With annual counts of n-grams -an n-gram being n consecutive character strings, separated by spaces -derived from millions of books over multiple centuries [35] , the n-gram data now covers English books from the year 1500 to year 2008." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-159", "text": "In English, the Zipf's law in the n-gram data [41] exhibits two regimes: one among words with frequencies above about 0.01% (Zipf's exponent \u03b3 \u2248 1) and another (\u03b3 \u2248 1.4) among words with frequency below 0.0001% [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-160", "text": "The latter Zipf's law exponent \u03b3 of 1.4 is equivalent to a probability distribution function (PDF) exponent, \u03b1, of about 1.7 (\u03b1 = 1 + 1/\u03b3)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-161", "text": "In addition to the well-known Zipf's law, word frequency data have at least two other statistical properties." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-162", "text": "One, known as Heaps law, refers to the way that vocabulary size scales sub-linearly with corpus size (raw word count)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-163", "text": "The n-gram data show Heaps law in that, if N t is corpus size and v t is vocabulary size at time t, then v t \u2248 N \u03b2 t , with \u03b2 \u2248 0.5, for all English words in the corpus [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-164", "text": "If the n-gram corpus is truncated by a minimum word count, then as that minimum is raised the Heaps scaling exponent increases from \u03b2 < 0.5, approaching \u03b2 < 1 [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-165", "text": "The other statistical property is dynamic turnover in the ranked list of most commonly used words." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-166", "text": "This can be measured in terms of how many words are replaced through time on \"Top y\" ranked lists of different sizes y of most frequently-used words [12, 17, 19, 23] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-167", "text": "We can define this turnover z y (t) as the number of new words to have entered the top y most common words in year t, which is equivalent to the the top y in that year." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-168", "text": "The plotting of turnover z y for different list sizes y can therefore be useful in characterising turnover dynamics [2] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-169", "text": "Many functional or network models readily yield the static Zipf distribution [21, 15] and Heaps law [36] , but not the dynamic aspects such as turnover." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-170", "text": "Here we focus on how Heaps law and Zipf's law can be modeled together with continual turnover of words within the rankings by frequency [4, 23] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-171", "text": "We focus on the 1-grams in Google's English 2012 data set, which samples English language books published in any country [25] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-172", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-173", "text": "**NEUTRAL MODELS OF VOCABULARY CHANGE**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-174", "text": "One promising, parsimonious approach incorporates the class of neutral evolutionary models [11, 12, 7, 24, 38] that are now proving insightful for language transmission [13, 10, 45] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-175", "text": "The null hypothesis of a Neutral model is that copying is undirected, without biases or different 'fitnesses' of the words being replicated [2, 29] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-176", "text": "A basic neutral model, which we will call the full-sampling Neutral model (FNM), would assume simply that authors choose to write words by copying those published in the past and occasionally inventing or introducing new words." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-177", "text": "As shown in Fig 1a, the FNM represents each word choice by an author as selecting at random among the N t words that were published in the previous year [45, 10] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-178", "text": "This copying occurs with probability 1 \u2212 \u00b5, where \u00b5 1 is the fixed, dimensionless probability that an author invents a new word (even the word had originated somewhere 'outside' books, e.g. in spoken slang)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-179", "text": "Each newly-invented word enters with frequency one, regardless of N t ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-180", "text": "In terms of the modeled corpus, a total of about \u00b5N t unique new words are invented per time step." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-181", "text": "Note that N t represents the total number of written words, or corpus size, for year t, which contrasts with the smaller \"vocabulary\" size, v t , defined as the number of different words in each year t regardless of their frequency of usage." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-182", "text": "As has been well demonstrated, the FNM readily yields Zipf's law [11, 9, 47] , which can also be shown analytically (see Appendix 1) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-183", "text": "Also, simulations of the FNM show that the resulting Zipf distribution undergoes dynamic turnover [12] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-184", "text": "Extensive simulations [19] show that when list size y is small compared to the corpus (0.15y < N t \u00b5), this neutral turnover z y per time step is more precisely approximated by:" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-185", "text": "where n is the number of words per time interval." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-186", "text": "This prediction can be visualized by plotting the measured turnover z y for different list sizes y. The FNM predicts the results to follow z y \u221d y 0.86 , such that departures from this expected curve can be identified to indicate biases such as conformity or anti-conformity [2] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-187", "text": "It would appear from eq. 1 that turnover should increase with corpus size." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-188", "text": "This is the nominal equilibrium for FNM with constant N t ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-189", "text": "If corpus size N t in the FNM is growing exponentially with time, however, then there may be no such nominal equilibrium." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-190", "text": "In this case we predict that the turnover z y can actually decrease with time as N t increases." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-191", "text": "This is because newly invented words start with frequency one, and under the neutral model they must essentially make a stochastic walk into the top 100, say." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-192", "text": "As N t grows, so does the minimum frequency needed to break into the top 100." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-193", "text": "As the \"bar\" is raised, words are more likely to 'die' before they ever reach the bar by stochastic walk [43] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-194", "text": "As a result, turnover in the Top y can slow down over time and growth of N t ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-195", "text": "The FNM does not, however, readily yield Heaps law (v t = N \u03b2 t , where \u03b2 < 1), for which \u03b2 \u2248 0.5 among the 1-gram data for English [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-196", "text": "In the FNM, the expected exponent \u03b2 is 1.0, as the number of different variants (vocabulary) normally scales linearly with \u00b5N t [11] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-197", "text": "While the FNM has been a powerful null model, in the case of books, we can make a notable improvement to account for the fact that most published material goes unnoticed while a relatively small portion of the corpus is highly visible." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-198", "text": "To name a few examples across the centuries, literally billions of copies of the Bible and the works of Shakespeare have been read since the seventeenth century, as well as tens or hundreds of millions of copies of works by Voltaire, Swift, Austen, Dickens, Tolkien, Fleming, Rawling and so on." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-199", "text": "While these and hundreds more books become considered part of the \"Western Canon,\" that canon is constantly evolving [28] and many books that were enormously popular in their time -e.g., Arabian Nights or the works of Fanny Burney-fall out of favour." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-200", "text": "As the published corpus has grown exponentially over the centuries, early authors were more able to sample the full range of historically published works, whereas contemporary authors sample from an increasingly small and more recent fraction of the corpus, simply due to its exponential expansion [28, 40] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-201", "text": "As a simple way of capturing this, we propose a modified neutral model, called the partial-sampling Neutral model (PNM), of an evolving \"canon\" that is sampled by an exponentially-growing corpus of books." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-202", "text": "As shown in Fig 1b, the PNM represents an exponentially growing number of books that sample words from a fixed size canon over all previous years since 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-203", "text": "Our PNM represents a world where there exists an evolving canonical literature as a relatively small subset of the world's books on which all writers are educated." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-204", "text": "As new contributions to the canon are contributed, authors sample from the recent generation of writers with occasional innovation." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-205", "text": "Because the canon is a high-visibility subset of all books, only a fixed, constant number words of text per year are allowed into a year's canon." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-206", "text": "The rest of the population learns from the cumulative canon since our chosen reference year of 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-207", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-208", "text": "**RESULTS**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-209", "text": "The average result from 100 runs in each of the FNM and PNM were used to match summary statistics with the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-210", "text": "Several key statistical results emerge from analysis of the 1-gram data which we compare the FNM to the PNM in terms of these results: (1) Heaps law, which is the sublinear scaling of vocabulary size with corpus size, (2) a Zipf's law frequency distribution for unique words, (3) a rate of turnover that decreases exponentially with time and a turnover vs popular list size that is approximately linear." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-211", "text": "Here we describe our results in terms of rank-frequency distributions, turnover and corpus and vocabulary size." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-212", "text": "We compare the partial-sample Neutral model (PNM) to the full 1-gram data for English." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-213", "text": "First, we check that the model replicates the Zipf's law that characterizes the 1-gram frequencies in multiple languages [41] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-214", "text": "Our own maximum likelihood determinations, applying available code [15] to the Google 1-gram data, confirm that the mean \u03b1 = 1.75 \u00b1 0.12 for the Zipf's law over all English words in the hundred years from 1700 to 1800 (beyond 1800, the corpus size becomes too large for our computation)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-215", "text": "Normalising by the word count [21] , the form of the Zipf represented by different colored circles in each box, is copied (arrows) from from the previous year t \u2212 1 with probability 1 \u2212 \u00b5, or newly-invented with probability \u00b5. The FNM shown in (a) has a corpus size N t that grows through time." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-216", "text": "In (b) the PNM samples from all previous results of the FNM since the initial time step representing year 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-217", "text": "The PNM population grows exponentially (N 0 e 0.021t ) through time, from 3000 to 1.5 million." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-218", "text": "As the PNM samples from all previous years of FNM population, the PNM samples from a corpus that increases linearly (by 10,000 words per year) from 10,000 words in year 1700 to 3 million words by year 2000." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-219", "text": "For the PNM, the big blue arrows represent how each generation can sample any year of the canon randomly, all the way back to 1700, the smaller arrows representing individual sampling events." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-220", "text": "distribution is virtually identical for each year of the dataset, reaching eight orders of magnitude by the year 2000 (Fig 2a) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-221", "text": "The FNM replicates the Zipf (Fig 2b) but the PNM replicates it better and over more orders of magnitude (Fig 2c) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-222", "text": "It was not computationally possible with either the FNM or PNM to replicate the Zipf across all nine orders of magnitude, as the modeled corpus size N t grows exponentially (Fig 2d) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-223", "text": "Fig 3a illustrates the relationship between corpus size and vocabulary size in our partial-sampling Neutral model." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-224", "text": "Due to the exponentially increasing sample size, the ratio of vocabulary size over corpus size becomes increasingly small, thus the model gives us the sub-linear relationship described by v t = N \u03b2 t , where \u03b2 < 1." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-225", "text": "On the double-logarithmic plot in Fig 3a, the Heaps law exponent is equivalent to the slope of the data series." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-226", "text": "The PNM matches the 1-gram data with Heaps exponent (slope) of about 0.5, whereas the FNM, with exponent about 1.0, does not." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-227", "text": "Fig 3b shows how 100 runs of the PNM yields a Heaps law exponent within the range derived by [42] for several different n-grams corpora (all English, English fiction, English GB, English US and English 1M)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-228", "text": "We also The PNM yields Heaps law exponent \u03b2 \u2248 0.52 \u00b1 0.006, within the range of English corpora, whereas the FNM yields a mismatch with the data of \u03b2 \u2248 1 \u00b1 0.002 (Fig 3b) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-229", "text": "In Fig 3a, there is a constant offset on the y-axis between vocabulary size in the PNM (\u03b1 = 0.02, N = 10000) versus the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-230", "text": "Both data series follow Heaps exponent b \u2248 0.5, but the coefficient, A, is several times larger for the 1-gram data than for the PNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-231", "text": "We do not think this is due to our choice of canon size N in the PNM, because if we halve it to 5000, the resulting A does not significantly change." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-232", "text": "The difference could be resolved, however, with larger The Heaps law exponents, \u03b2, for the data series on the left, as well as additional data series, using Table 1 in [42] : all English 1-grams: 0.54 \u00b1 0.01; English fiction: 0.49 \u00b1 0.01; English GB: 0.44 \u00b1 0.01; English US: 0.51 \u00b1 0.01." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-233", "text": "The 100 independent runs of each neutral model, using parameters listed in the text, yielded \u03b2 = 0.52 \u00b1 0.07 for the PNM, and \u03b2 = 1.00 \u00b1 0.002 for the FNM (not shown)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-234", "text": "exponential growth in PNM corpus size, S t , over the 300 time steps." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-235", "text": "Computationally, we could only model the PNM with growth exponent \u03b1 = 0.02-using \u03b1 = 0.03, as would fit the actual growth of the n-gram corpus over 300 years [8] , makes the PNM too large to compute." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-236", "text": "Nevertheless, we can roughly estimate the effect; when we reduce \u03b1 from 0.02 to 0.01, while keeping N = 10000, we find that A averaged over one hundred PNM runs is reduced from 6.3 \u00b1 0.5 to 1.4 \u00b1 0.3." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-237", "text": "Given an exponential relationship, increasing alpha to 0.03 would increase A to about 20, which is within the magnitude of offset we see in Fig 3a." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-238", "text": "Of course, this question can be resolved precisely when the much larger PNM can be simulated." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-239", "text": "Regarding dynamic turnover, we consider turnover in ranked lists of size y, varying the list size y from the top 1000 most common words down to the top 10 (the top 1 word has been \"the\" since before the year 1700)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-240", "text": "We measure turnover in the word-frequency rankings by determining the top y rankings independently for each year, and then counting the number of new words to appear on the list from one year to the next." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-241", "text": "Fig 4 shows the number of 1-grams to drop out of the top 1000, top 500 and top 200 per year in the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-242", "text": "Annual turnover among the top 1000 and the top 500 decreased exponentially from the year 1700 to 2000, proportional to e \u22120.012t (r 2 > 0.91 for both), where t is years since 1700." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-243", "text": "This exponential decay equates to roughly a halving of turnover per century." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-244", "text": "Since the corpus size was increasing with time, Fig 4 effectively also shows how turnover in top y list decreases as corpus size increases in the partial-sampling Neutral model, where the corpus size grows faster than the number relative to speakers over the years." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-245", "text": "The exponential decay in turnover in the partial-sampling Neutral model is markedly different than the base Neutral model, in which turnover would be growing as corpus size grew, due to term n 0.013 s in equation 1." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-246", "text": "Finally, we also look at the \"turnover profile\", plotting list size y versus turnover z y for different time slices (Fig 5) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-247", "text": "For all words, z y \u221d y 1.26 for different time periods (Fig 5) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-248", "text": "We can then compare the turnover profile for the 1-grams to the prediction from eq. 1 that turnover will be proportional to y 0.86 , as shown in Fig 5b ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-249", "text": "Table 1 lays out the specific predictions of each of the models and how they fare against empirical data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-250", "text": "Bands indicate 95% range of simulated values." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-251", "text": "While the predictions for the FNM and PNM are similar for y = 50 and for the year 1800 (Fig 4a and Fig 5a) , they do differ (Fig 4c and Fig 5c) ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-252", "text": "Although the FNM can fit Zipf's Law with the right parameters, it cannot also fit Heaps law or the turnover patterns at the same time as matching Zipf's Law." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-253", "text": "In contrast, the PNM can fit Zipf's law, Heaps law exponent (Fig 3a) , and the 2000 series in Fig 4 (but starts to breakdown at y > 150)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-254", "text": "Neither the FNM nor the PNM does very well at y = 200." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-255", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-256", "text": "**DISCUSSION**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-257", "text": "We have explored how 'neutral' models of word choice could replicate a series of static and dynamic observations from a historical 1-gram corpora: corpus size, frequency distributions, and turnover within those frequency distributions." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-258", "text": "Our goal was to capture two static and three dynamic properties of word frequency statistics in one model." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-259", "text": "The static properties are not only the well-known (a) Zipf's law, which a range of proportionate-advantage models can replicate, but also (b) Heaps law." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-260", "text": "The dynamic properties are (c) the continual turnover in words ranked by popularity, (d) the decline in that turnover rate through time, and (e) the relationship between list size and turnover, which we call the turnover profile." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-261", "text": "We found that, although the full-sample Neutral model (FNM) predicts the Zipf's law in ranked word frequencies, the FNM does not replicate Heaps law between corpus and vocabulary size, or the concavity in the non-linear relationship between list size y and turnover z y , or the slowing of this turnover through time among English words." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-262", "text": "It is notable that we found it impossible to capture all five of these properties at once with the FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-263", "text": "It was a bit like trying to juggle five balls, as soon as the FNM could replicate some of those properties, it dropped the others." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-264", "text": "Having explored the FNM under broad range of under a range of parameter combinations, we ultimately determined that it could never replicate all these properties at once." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-265", "text": "This is mainly because both vocabulary size in the FNM is proportional to corpus size (rather than roughly the square root of corpus size as in Heaps law) and also because turnover in FNM should increase slightly with growing population, not decrease as we see in the 1-gram data over 300 years." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-266", "text": "Other hypotheses to modify the FNM, such as introducing a conformity bias [2] , can also be ruled out." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-267", "text": "In the case of conformity bias-where agents choose high-frequency words with even greater probability than just in proportion to frequency-both the Zipf law and turnover deteriorate under strong conformity in ways that mis-match with the data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-268", "text": "What did ultimately work very well was our partial-sampling Neutral model, or PNM (Fig 1b) , which models a growing sample from a fixed-sized FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-269", "text": "Our PNM, which takes exponentially increasing sample sizes from a neutrally evolved latent population, replicated the Zipf's law, Heaps law, and turnover patterns in the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-270", "text": "Although it did not replicate exactly the particular 1-gram corpus we used here, the Heaps law exponent yielded by the PNM does fall within the range-from 0.44 to 0.54-observed in different English 1-gram corpora [42] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-271", "text": "Among all features we attempted to replicate, the one mismatch between PNM and the 1-gram data is that the PNM yielded an order of magnitude fewer vocabulary words for a given corpus size, while increasing with corpus size according to the same Heaps law exponent." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-272", "text": "The reason for this mismatch appears to be a computational constraint: we could not run the PNM with exponential growth quite as large as that of the actual 300 years of exponential growth in the real English corpus." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-273", "text": "As a heuristic device, we consider the fixed-size FNM to represent a canonical literature, while the growing sample represents the real world of exponentially growing numbers of books published ever year in English." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-274", "text": "Of course, the world is not as simple as our model; there is no official fixed canon, that canon does not strictly copy words from the previous year only and there are plenty of words being invented that occur outside this canon." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-275", "text": "Our canonical model of the PNM differs somewhat from the explanation by [42] , in which a \"decreasing marginal need for additional words\" as the corpus grows is underlain by the \"dependency network between the common words ... and their more esoteric counterparts." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-276", "text": "\" In our PNM representation, there is no network structure between words at all, such as \"inter-word statistical dependencies\" [44] or grammar as a hierarchical network structure between words [20] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-277", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-278", "text": "**CONCLUSION**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-279", "text": "Since the PNM performed quite well in replicating multiple static and dynamic statistical properties of 1-grams simultaneously, which the FNM could not do, we find two insights." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-280", "text": "The first is that the FNM remains a powerful representation of word usage dynamics [13, 45, 26, 24, 9, 5] , but it may need to be embedded in a larger sampling process in order to represent the world." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-281", "text": "Case studies where the PNM succeeds and the FNM fails could represent situations where mass attention is focused on a small subset of the cultural variants." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-282", "text": "The same idea seems appropriate for a digital world, where many cultural choices are pre-sorted in ranked lists [24] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-283", "text": "In the present century, published books contain only a few percent of the verbiage recorded online, with the volume of digital data doubling about every three years." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-284", "text": "Centuries of prior evolution in published English word use provides valuable context for future study of this digital transition." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-285", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-286", "text": "**MODELS AND DATA**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-287", "text": "Our aim is to compare key summary statistics from simulated data generated by the hypothetical FNM and PNM processes with summary statistics from Google 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-288", "text": "See Acknowledgements for data source address and the repository location for the Python code used to generate the FNM and PNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-289", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-290", "text": "**NEUTRAL MODELS**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-291", "text": "The FNM assumes words in a population at time t are selected at random from the population of books at time t \u2212 1." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-292", "text": "The population size Nt increases exponentially, N0e 0.021t , through time to simulate the exponentially increasing corpus size observed in the Google n-grams data [8] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-293", "text": "We ran a genetic algorithm (described in the Appendix 2) to search the model state space to obtain parameter combinations-latent corpus size Nt, innovation fraction \u00b5 and initial population size N0-that yielded similar summary statistics to the 1-gram data." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-294", "text": "With the corpus growth exponent fixed at 0.021, initial corpus size, N0, was constrained by computational capacity." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-295", "text": "Following the genetic algorithm search, the model was initialized with population size N0 = 3000 and invention fraction \u00b5 = 0.003." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-296", "text": "Once steady state was achieved, we permitted the population size in each successive generation to increase at an exponential growth rate comparable to the average annual growth rate of Google 1-gram data until it finally reached N300 = 1.5 million by time step t = 301." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-297", "text": "At each time t in the FNM, a new set of Nt words enter the modeled corpus." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-298", "text": "Each word in the corpus, at time t, is either a copy of a word from the previous generation of books, with probability 1 \u2212 \u00b5, or else invented as a new word with probability \u00b5. Each of the copied words is selected from vt\u22121 possible words (the vocabulary in the previous time step), which follow a discrete Zipf's law distribution with the probability a word is selected being proportional to the number of copies the word had in the previous population in time step t \u2212 1 [7] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-299", "text": "The PNM, represented schematically in Fig 1, draws an exponentially increasing sample (with replacement) from a latent neutrally-evolving canon." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-300", "text": "We designate the number of words in the sample as St, and the cumulative number of words in the canon as Nt, which grows by a fixed number of words in each time step." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-301", "text": "This exponentially increasing sample, S0e \u03b1t , has an initial population size S0 = 3000, growth exponent \u03b1 = 0.021, yielding a final sample size S300 = 1.5 million, matching the FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-302", "text": "The latent population evolves by the rules of the FNM, but with a constant population size of 10000 for each year t (representing a canonical literature from which the main body of authors sample)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-303", "text": "The cumulative canon, Nt, thus grows by 10,000 words per year." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-304", "text": "The partial sample, St, at time t can copy words from all canonical literature, Nt, up to that time step." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-305", "text": "We set \u00b5 = 0.003 and run for t = 301 time steps representing years between 1700 and 2000, which are the same parameters used in the FNM." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-306", "text": "----------------------------------" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-307", "text": "**1-GRAM DATA**" }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-308", "text": "The 1-gram data are available as csv files directly from Google's Ngrams site [25] ." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-309", "text": "As in a previous study [1] , we removed 1-grams that are common symbols or numbers, and 1-grams containing the same consonant three or more times consecutively." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-310", "text": "As in our other studies [1, 8, 6] , we normalized the count of 1-grams using the yearly occurrences of the most common English word, the." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-311", "text": "Although we track 1-grams from the year 1700, for turnover statistics we follow other studies [42] in being cautious about the n-grams record before the year 1800, due to misspelled words before 1800 that were surely digital scanning errors related to antique printing styles of that may conflate letters such as 's' and 'f' (e.g., myfelf, yourfelf, provifions, increafe, afked etc)." }, { "sent_id": "d68bb5264d157cc4c2d9fa9c8f82b6-C001-312", "text": "The code used for modeling is available at: https://github.com/dr2g08/Neutral-evolution-and-turnover-over-centuries-of-English-word-popularity." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-9" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-10" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-11" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-15" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-16" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-45" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-75" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-156" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-157" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-159" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-163" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-164" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-227" ] ], "cite_sentences": [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-9", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-10", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-11", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-15", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-16", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-45", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-75", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-156", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-157", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-159", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-163", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-164", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-227" ] }, "@SIM@": { "gold_contexts": [ [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-75" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-116" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-227" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-270" ] ], "cite_sentences": [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-75", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-116", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-227", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-270" ] }, "@USE@": { "gold_contexts": [ [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-80" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-150" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-232" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-311" ] ], "cite_sentences": [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-80", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-150", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-232", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-311" ] }, "@DIF@": { "gold_contexts": [ [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-121" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-156" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-195" ], [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-275" ] ], "cite_sentences": [ "d68bb5264d157cc4c2d9fa9c8f82b6-C001-121", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-156", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-195", "d68bb5264d157cc4c2d9fa9c8f82b6-C001-275" ] } } }, "ABC_392cbe849c1b8a69aae9923ade41aa_12": { "x": [ { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-2", "text": "To improve grammatical function labelling for German, we augment the labelling component of a neural dependency parser with a decision history." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-3", "text": "We present different ways to encode the history, using different LSTM architectures, and show that our models yield significant improvements, resulting in a LAS for German that is close to the best result from the SPMRL 2014 shared task (without the reranker)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-4", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-6", "text": "For languages with a non-configurational word order and rich(er) morphology, such as German, grammatical function (GF) labels are essential for interpreting the meaning of a sentence." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-7", "text": "Case syncretism in the German case paradigm makes GF labelling a challenging task." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-8", "text": "See (1) for an example where the nouns in the sentence are ambiguous between different cases, which makes it hard for a statistical parser to recover the correct reading." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-9", "text": "We approach the problem of GF labelling as a subtask of dependency parsing, where we first generate unlabelled trees and, in the second step, try to find the correct labels." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-10", "text": "This pipeline architechture gives us more flexibility, allowing us to use the labeller in combination with our parser, but also to apply it to the unlabelled output of other parsing systems without the need to change or retraining the parsers." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-11", "text": "The approach also makes it straightforward to test different architectures for GF labelling." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-12", "text": "We are especially interested in the influence of different input structures representing different (surface versus structural) orders of the input." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-13", "text": "In particular, we compare models where we present the unlabelled tree in linear order with a model where we encode the parser output as a tree." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-14", "text": "We show that all models are able to learn GFs with a similar overall LAS, but the model where the tree is encoded in a breadth-first order outperforms all other models on labelling core argument GFs." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-15", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-16", "text": "**RELATED WORK**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-17", "text": "Grammatical function labelling is commonly integrated into syntactic parsing." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-18", "text": "Few studies have adressed the issue as a separate classification task." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-19", "text": "While most of them assign grammatical functions on top of constituency trees (Blaheta and Charniak, 2000; Jijkoun and de Rijke, 2004; Chrupa\u0142a and van Genabith, 2006; Klenner, 2007; Seeker et al., 2010) , less work has tried to predict GF labels for unlabelled dependency trees." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-20", "text": "One of them is McDonald et al. (2006) who first generate the unlabelled trees using a graph-based parser, and then model the assignment of dependency labels as a sequence labelling task." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-21", "text": "Another approach has been proposed by Zhang et al. (2017) who present a simple, yet efficient and accurate parsing model that generates unlabelled trees by identifying the most probable head for each token in the input." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-22", "text": "Then, in a post-processing step, they assign labels to each head-dependent pair, using a two-layer rectifier network." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-23", "text": "Dependency Parsing as Head Selection Our labelling model is an extension of the parsing model of Zhang et al. (2017) ." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-24", "text": "We use our own implementation of the head-selection parser and focus on the grammatical function labelling part." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-25", "text": "The parser uses a bidirectional LSTM to extract a dense, positional representation a i of the word w i at position i in a sentence:" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-26", "text": "x i is the input at position i, which is the concatenation of the word embeddings and the tag embeddings of word w i ." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-27", "text": "An artificial root token w 0 is added at the beginning of each sentence." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-28", "text": "The unlabelled tree is then built by selecting the most probable head for each word." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-29", "text": "The score of word w j being the head of word w i is computed by a single hidden layer neural network on their representations a j and a i ." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-30", "text": "An additional classifier with two rectified hidden layers is used to predict dependency labels, and is trained separately from the unlabeled parsing component, in a pipeline architecture." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-31", "text": "The classifier predictions are based on the representations of the head and the dependent, b j and b i , which are the concatenation of the input and the bidirectional LSTM-based representations:" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-32", "text": "Despite its simplicity and the lack of global optimisation, Zhang et al. (2017) report competitive results for English, Czech, and German." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-33", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-34", "text": "**LABELING DEPENDENCIES WITH HISTORY**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-35", "text": "Although the labelling approach in Zhang et al. (2017) is simple and efficient, looking at head and dependent only when assigning the labels comes with some disadvantages." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-36", "text": "First, some labels are easier to predict when we also take context into account, e.g. the parent and grandparent nodes or the siblings of the head or dependent." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-37", "text": "Consider, for example, the following sentence: Is this the future of chamber music? and its syntactic structure (figure 1)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-38", "text": "If we only consider the nodes this and future, there is a chance that the edge between them is labelled as det (determiner)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-39", "text": "However, if we also look at the local context, we know that node the to the left of future is more likely to be the determiner, and thus this should be assigned a different label." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-40", "text": "Second, when looking at the parser output, we notice some errors that are well-known from other local parsing models, such as the assignment of duplicate subjects for the same predicate." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-41", "text": "To address this issue, we propose an extended labelling model that incorporates a decision history." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-42", "text": "To that end, we design different LSTM architectures for the labelling task and compare their performance on German, Czech and English." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-43", "text": "Label prediction as a sequence labelling task Presenting the input to the labeller in sequential surface order does not seem very intuitive when we want to assign labels to a tree." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-44", "text": "This approach, however, was adapted by McDonald et al. (2006) ." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-45", "text": "In their work, they consider all dependents x j1 , ..., x jM of a head x i and label those edges (i, j1), ..., (i, jM ) in a sequence." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-46", "text": "We argue, however, that it is not enough to know the labels of the siblings, but that we also need to consider nodes at different levels in the tree." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-47", "text": "Therefore, when predicting the label for the current node, we consider all label decisions in the history, and feed them to a bidirectional LSTM." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-48", "text": "Given a sequence of nodes S = (w 1 , ..., w N ) and their corresponding head (h 1 , ..., h N ), at each recurrent step, we input the learned representation of the head and the dependent:" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-49", "text": "After that, the concatenated hidden states [h" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-50", "text": "] are projected to a softmax layer to predict the label." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-51", "text": "When presenting a tree as a sequence, we experiment with two different input orders:" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-52", "text": "\u2022 BILSTM(L): Tree nodes are ordered according to their surface order in the sentence (linear order; figure 2)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-53", "text": "\u2022 BILSTM(B): Tree nodes are ordered according to a breadth-first traversal (BFS) of the tree, starting from the root node." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-54", "text": "Here, siblings are closer to each other in the history." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-55", "text": "Is, future this, future the, future future, ROOT of, future chamber, music music, of ?, future" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-56", "text": "future, ROOT is, future this, future the, future of, future ?, future music, of chamber, music" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-57", "text": "Figure 2 Top-down tree LSTM Intuitively, it seems more natural to present the input as a tree structure when trying to predict the dependency labels." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-58", "text": "We do that by adopting the top-down tree LSTM model (Zhang et al., 2016 ) that processes nodes linked through dependency paths in a top-down manner." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-59", "text": "To make it comparable to the previous LSTM models, we only use one LSTM instead of four, and do not stack LSTMs." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-60", "text": "The hidden state is computed as follow:" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-61", "text": "After that, we proceed as we did for the BI-LSTM models (see above)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-62", "text": "Note that the processing order i is also the BFS order." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-63", "text": "We call this model TREELSTM (figure 3)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-64", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-65", "text": "**EXPERIMENTS**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-66", "text": "Our interest is focussed on German, but to put our work in context, we follow Zhang et al. (2017) and report results also for English, which has a configurational word order, and for Czech, which has a free word order, rich morphology, and less ambiguity in the case paradigm than German." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-67", "text": "For English, we use the Penn Treebank (PTB) (Marcus et al., 1993) with standard training/dev/test splits." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-68", "text": "The POS tags are assigned using the Stanford POS tagger (Toutanova et al., 2003) with ten-way jackknifing, and constituency trees are converted to Stanford basic dependencies (De Marneffe et al., 2006) ." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-69", "text": "The German and Czech data come from the CoNLL-X shared task (Buchholz and Marsi, 2006) and our data split follows Zhang et al. (2017) ." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-70", "text": "As the CoNLL-X testsets are rather small (\u223c 360 sentences), we also" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-71", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-72", "text": "**SETUP**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-73", "text": "We test different labelling models on top of the unlabelled trees produced by our re-implementation of the parsing as head selection model ( \u00a72)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-74", "text": "We first train the unlabelled parsing models for the three languages." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-75", "text": "Unless stated otherwise, all parameters are set according to Zhang et al. (2017) , and tag embedding size was set to 40 for all languages." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-76", "text": "Please note that we do not use pre-trained embeddings in our experiments." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-77", "text": "In the next step, we train four different labelling models: the labeller of Zhang et al. (2017) that uses a rectifier neural network with two hidden layers (baseline), two bidirectional LSTM models (BILSTM(L) and BILSTM(B)), and one tree LSTM model (TREELSTM) ( \u00a73)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-78", "text": "The hidden layer dimension in all LSTM models was set to 200." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-79", "text": "The models were trained for 10 epochs, and were optimized using Adam ( Kingma and Ba, 2015) with default parameters (initial learning rate 0.001, first momentum coefficient 0.9, second momentum coefficient 0.999)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-80", "text": "We used L2 regularization with a coefficient of 10 \u22123 and max-norm regularization with an upper bound of 5.0." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-81", "text": "The dropout (Srivastava et al., 2014) rate was set to 0.05 for the input connections, and 0.5 for the rest." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-82", "text": "Table 1 shows the unlabelled attachment score (UAS) for the unlabelled trees and the labelled attachment scores (LAS) for the different labellers (excluding punctuation)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-83", "text": "All history-based labelling models perform significantly better than the local baseline model, 1 but for English the improvements are smaller (0.3%) than for the nonconfigurational languages (\u223c0.7%)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-84", "text": "While we tried to reimplement the model of Zhang et al. (2017) following the details in the paper, our reimplemented model yields higher scores for German, compared to the results in the paper." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-85", "text": "The scores for English are slightly lower since, in contrast to Zhang et al. (2017) , we do not use pre-trained embeddings." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-86", "text": "When using our historybased labellers, we get similar results for English (91.9%) and higher results for both Czech (84.1% vs. 81.7%) and German (91.0% vs. 89.6%) on the same data without using pre-trained embeddings or post-processing." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-87", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-88", "text": "**RESULTS**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-89", "text": "On the SPMRL 2014 shared task data, our results are only 0.3% lower than the ones of the winning system (Bj\u00f6rkelund et al., 2014) fectiveness of our models, we also ran our labeller on the unlabelled output of the SPMRL 2014 winning system and on unlabelled gold trees." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-90", "text": "On the output of the blended system LAS slightly improves from 88.62% to 88.76% (TREELSTM)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-91", "text": "3 When applied to unlabelled gold trees, the distance between our models and the baseline becomes larger and the best of our history-based models (BILSTM(B), 97.38%) outperforms the original labeller of Zhang et al. (2017) (96.15%) by more than 1%." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-92", "text": "We would like to emphasize that our historybased LSTM labeller is practically simple and computationally inexpensive (as compared to global training or inference), so our model manages to preserve simplicity while significantly improving labelling performance." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-93", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-94", "text": "**DISCUSSION**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-95", "text": "Most strikingly, all three models seem to perform roughly the same, and the TREELSTM model fails to outperform the other two models." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-96", "text": "However, in comparison to the BILSTM models, the TREE-LSTM model has a smaller number of parameters, and the history only flows in one direction." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-97", "text": "The tree model also has a shorter history chain since nodes are linked by paths from the root (figure 3), which might explain why it does not yield better results than the linear LSTM models." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-98", "text": "The overall results suggest that the order in which the nodes are presented in the history does not have any impact on the labelling results." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-99", "text": "However, when looking at results for individual core argument functions (subject, direct object, etc.), a (table 2) ." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-100", "text": "4 Here we see the benefit of encoding the siblings close to each other in the history: For all core argument functions, the BILSTM(B) model outperforms the other models." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-101", "text": "To find out why the history-based models work better for Czech and German than for English, we compared the average dependency length as well as the variability in head direction (how often e.g. the head of a subject is positioned to the left, in relation to the total number of subjects)." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-102", "text": "Table 3 suggests that the success of the history-based models is not due to a better handling of long dependencies but that they are better in dealing with the uncertainty in head direction (also see Gulordava and Merlo (2016) )." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-103", "text": "----------------------------------" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-104", "text": "**CONCLUSIONS**" }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-105", "text": "We have shown that GF labelling, which is of crucial importance for languages like German, can be improved by combining LSTM models with a decision history." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-106", "text": "All our models outperform the original labeller of Zhang et al. (2017) and give results in the same range as the best system from the SPMRL-2014 shared task (without the reranker), but with a much simpler model." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-107", "text": "Our results show that the history is especially important for languages that show more word order variation." }, { "sent_id": "392cbe849c1b8a69aae9923ade41aa-C001-108", "text": "Here, presenting the input in a structured BFS order not only significantly outperforms the baseline, but also yields improvements over the other LSTM models on core grammatical functions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "392cbe849c1b8a69aae9923ade41aa-C001-21" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-32" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-35" ] ], "cite_sentences": [ "392cbe849c1b8a69aae9923ade41aa-C001-21", "392cbe849c1b8a69aae9923ade41aa-C001-32", "392cbe849c1b8a69aae9923ade41aa-C001-35" ] }, "@EXT@": { "gold_contexts": [ [ "392cbe849c1b8a69aae9923ade41aa-C001-23" ] ], "cite_sentences": [ "392cbe849c1b8a69aae9923ade41aa-C001-23" ] }, "@MOT@": { "gold_contexts": [ [ "392cbe849c1b8a69aae9923ade41aa-C001-35" ] ], "cite_sentences": [ "392cbe849c1b8a69aae9923ade41aa-C001-35" ] }, "@USE@": { "gold_contexts": [ [ "392cbe849c1b8a69aae9923ade41aa-C001-66" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-69" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-75" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-77" ] ], "cite_sentences": [ "392cbe849c1b8a69aae9923ade41aa-C001-66", "392cbe849c1b8a69aae9923ade41aa-C001-69", "392cbe849c1b8a69aae9923ade41aa-C001-75", "392cbe849c1b8a69aae9923ade41aa-C001-77" ] }, "@DIF@": { "gold_contexts": [ [ "392cbe849c1b8a69aae9923ade41aa-C001-84" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-85" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-91" ], [ "392cbe849c1b8a69aae9923ade41aa-C001-106" ] ], "cite_sentences": [ "392cbe849c1b8a69aae9923ade41aa-C001-84", "392cbe849c1b8a69aae9923ade41aa-C001-85", "392cbe849c1b8a69aae9923ade41aa-C001-91", "392cbe849c1b8a69aae9923ade41aa-C001-106" ] } } }, "ABC_2c3a2999390b82f4e29b00d59f90f2_12": { "x": [ { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-2", "text": "We describe the CoNLL-2003 shared task: language-independent named entity recognition." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-3", "text": "We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-4", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-6", "text": "Named entities are phrases that contain the names of persons, organizations and locations." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-7", "text": "Example: [ORG U.N. ] This sentence contains three named entities: Ekeus is a person, U.N. is a organization and Baghdad is a location." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-8", "text": "Named entity recognition is an important task of information extraction systems." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-9", "text": "There has been a lot of work on named entity recognition, especially for English (see Borthwick (1999) for an overview)." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-10", "text": "The Message Understanding Conferences (MUC) have offered developers the opportunity to evaluate systems for English on the same data in a competition." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-11", "text": "They have also produced a scheme for entity annotation (Chinchor et al., 1999) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-12", "text": "More recently, there have been other system development competitions which dealt with different languages (IREX and CoNLL-2002) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-13", "text": "The shared task of CoNLL-2003 concerns language-independent named entity recognition." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-14", "text": "We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-15", "text": "The shared task of CoNLL-2002 dealt with named entity recognition for Spanish and Dutch (Tjong Kim Sang, 2002) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-16", "text": "The participants of the 2003 shared task have been offered training and test data for two other European languages: English and German." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-17", "text": "They have used the data for developing a named-entity recognition system that includes a machine learning component." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-18", "text": "The shared task organizers were especially interested in approaches that made use of resources other than the supplied training data, for example gazetteers and unannotated data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-19", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-20", "text": "**DATA AND EVALUATION**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-21", "text": "In this section we discuss the sources of the data that were used in this shared task, the preprocessing steps we have performed on the data, the format of the data and the method that was used for evaluating the participating systems." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-22", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-23", "text": "**DATA**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-24", "text": "The CoNLL-2003 named entity data consists of eight files covering two languages: English and German 1 ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-25", "text": "For each of the languages there is a training file, a development file, a test file and a large file with unannotated data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-26", "text": "The learning methods were trained with the training data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-27", "text": "The development data could be used for tuning the parameters of the learning methods." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-28", "text": "The challenge of this year's shared task was to incorporate the unannotated data in the learning process in one way or another." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-29", "text": "When the best parameters were found, the method could be trained on the training data and tested on the test data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-30", "text": "The results of the different learning methods on the test sets are compared in the evaluation of the shared task." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-31", "text": "The split between development data and test data was chosen to avoid systems being tuned to the test data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-32", "text": "The English data was taken from the Reuters Corpus 2 ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-33", "text": "This corpus consists of Reuters news stories" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-34", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-35", "text": "**ENGLISH DATA**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-36", "text": "Articles Sentences Tokens Training set 946 14,987 203,621 Development set 216 3,466 51,362 Test set 231 3, The text for the German data was taken from the ECI Multilingual Text Corpus 3 ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-37", "text": "This corpus consists of texts in many languages." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-38", "text": "The portion of data that was used for this task, was extracted from the German newspaper Frankfurter Rundshau." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-39", "text": "All three of the training, development and test sets were taken from articles written in one week at the end of August 1992." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-40", "text": "The raw data were taken from the months of September to December 1992." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-41", "text": "Table 1 contains an overview of the sizes of the data files." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-42", "text": "The unannotated data contain 17 million tokens (English) and 14 million tokens (German)." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-43", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-44", "text": "**DATA PREPROCESSING**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-45", "text": "The participants were given access to the corpus after some linguistic preprocessing had been done: for all data, a tokenizer, part-of-speech tagger, and a chunker were applied to the raw data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-46", "text": "We created two basic language-specific tokenizers for this shared task." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-47", "text": "The English data was tagged and chunked by the memory-based MBT tagger (Daelemans et al., 2002) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-48", "text": "The German data was lemmatized, tagged and chunked by the decision tree tagger Treetagger (Schmid, 1995) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-49", "text": "Named entity tagging of English and German training, development, and test data, was done by hand at the University of Antwerp." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-50", "text": "Mostly, MUC conventions were followed (Chinchor et al., 1999 )." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-51", "text": "An extra named entity category called MISC was added to denote all names which are not already in the other categories." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-52", "text": "This includes adjectives, like Italian, and events, like 1000 Lakes Rally, making it a very diverse category." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-53", "text": "All data files contain one word per line with empty lines representing sentence boundaries." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-54", "text": "At the end of each line there is a tag which states whether the current word is inside a named entity or not." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-55", "text": "The tag also encodes the type of named entity." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-56", "text": "Here is an example sentence:" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-57", "text": "Each line contains four fields: the word, its partof-speech tag, its chunk tag and its named entity tag." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-58", "text": "Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-59", "text": "Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-60", "text": "The data contains entities of four types: persons (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC)." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-61", "text": "This tagging scheme is the IOB scheme originally put forward by Ramshaw and Marcus (1995) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-62", "text": "We assume that named entities are non-recursive and non-overlapping." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-63", "text": "When a named entity is embedded in another named entity, usually only the top level entity has been annotated." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-64", "text": "Table 2 contains an overview of the number of named entities in each data file." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-65", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-66", "text": "**EVALUATION**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-67", "text": "The performance in this task is measured with F \u03b2=1 rate:" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-68", "text": "lex pos aff pre ort gaz chu pat cas tri bag quo doc Florian Van Rijsbergen, 1975) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-69", "text": "Precision is the percentage of named entities found by the learning system that are correct." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-70", "text": "Recall is the percentage of named entities present in the corpus that are found by the system." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-71", "text": "A named entity is correct only if it is an exact match of the corresponding entity in the data file." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-72", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-73", "text": "**PARTICIPATING SYSTEMS**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-74", "text": "Sixteen systems have participated in the CoNLL-2003 shared task." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-75", "text": "They employed a wide variety of machine learning techniques as well as system combination." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-76", "text": "Most of the participants have attempted to use information other than the available training data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-77", "text": "This information included gazetteers and unannotated data, and there was one participant who used the output of externally trained named entity recognition systems." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-78", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-79", "text": "**LEARNING TECHNIQUES**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-80", "text": "The most frequently applied technique in the CoNLL-2003 shared task is the Maximum Entropy Model." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-81", "text": "Five systems used this statistical learning method." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-82", "text": "Three systems used Maximum Entropy Models in isolation (Bender et al., 2003; Chieu and Ng, 2003; Curran and Clark, 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-83", "text": "Two more systems used them in combination with other techniques (Florian et al., 2003; Klein et al., 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-84", "text": "Maximum Entropy Models seem to be a good choice for this kind of task: the top three results for English and the top two results for German were obtained by participants who employed them in one way or another." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-85", "text": "Hidden Markov Models were employed by four of the systems that took part in the shared task (Florian et al., 2003; Klein et al., 2003; Mayfield et al., 2003; Whitelaw and Patrick, 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-86", "text": "However, they were always used in combination with other learning techniques." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-87", "text": "Klein et al. (2003) also applied the related Conditional Markov Models for combining classifiers." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-88", "text": "Learning methods that were based on connectionist approaches were applied by four systems." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-89", "text": "Zhang and Johnson (2003) used robust risk minimization, which is a Winnow technique." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-90", "text": "Florian et al. (2003) employed the same technique in a combination of learners." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-91", "text": "Voted perceptrons were applied to the shared task data by Carreras et al. (2003a) and Hammerton used a recurrent neural network (Long Short-Term Memory) for finding named entities." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-92", "text": "Other learning approaches were employed less frequently." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-93", "text": "Two teams used AdaBoost." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-94", "text": "MH (Carreras et al., 2003b; Wu et al., 2003) and two other groups employed memory-based learning (De Meulder and Daelemans, 2003; Hendrickx and Van den Bosch, 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-117", "text": "A reasonable number of groups have also employed unannotated data for obtaining capitalization features for words." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-95", "text": "Transformation-based learning (Florian et al., 2003) , Support Vector Machines (Mayfield et al., 2003) and Conditional Random Fields (McCallum and Li, 2003) were applied by one system each." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-96", "text": "Combination of different learning systems has proven to be a good method for obtaining excellent results." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-97", "text": "Five participating groups have applied system combination." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-98", "text": "Florian et al. (2003) tested different methods for combining the results of four systems and found that robust risk minimization worked best." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-99", "text": "Klein et al. (2003) employed a stacked learning system which contains Hidden Markov Models, Maximum Entropy Models and Conditional Markov Models." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-100", "text": "Mayfield et al. (2003) stacked two learners and obtained better performance." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-101", "text": "Wu et al. (2003) applied both stacking and voting to three learners." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-102", "text": "Munro et al. (2003) employed both voting and bagging for combining classifiers." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-103", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-104", "text": "**FEATURES**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-105", "text": "The choice of the learning approach is important for obtaining a good system for recognizing named entities." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-106", "text": "However, in the CoNLL-2002 shared task we found out that choice of features is at least as important." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-107", "text": "An overview of some of the types of features chosen by the shared task participants, can be found in Table 3 ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-108", "text": "All participants used lexical features (words) except for Whitelaw and Patrick (2003) who implemented a character-based method." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-109", "text": "Most of the systems employed part-of-speech tags and two of them have recomputed the English tags with better taggers (Hendrickx and Van den Bosch, 2003; Wu et al., 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-110", "text": "Othographic information, affixes, gazetteers and chunk information were also incorporated in most systems although one group reports that the available chunking information did not help (Wu et al., 2003) Other features were used less frequently." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-111", "text": "Table 3 does not reveal a single feature that would be ideal for named entity recognition." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-112", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-113", "text": "**EXTERNAL RESOURCES**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-114", "text": "Eleven of the sixteen participating teams have attempted to use information other than the training data that was supplied for this shared task." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-115", "text": "All included gazetteers in their systems." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-116", "text": "Four groups examined the usability of unannotated data, either for extracting training instances (Bender et al., 2003; Hendrickx and Van den Bosch, 2003) or obtaining extra named entities for gazetteers (De Meulder and Daelemans, 2003; McCallum and Li, 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-118", "text": "One participating team has used externally trained named entity recognition systems for English as a part in a combined system (Florian et al., 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-119", "text": "with extra information compared to while using only the available training data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-120", "text": "The inclusion of extra named entity recognition systems seems to have worked well (Florian et al., 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-121", "text": "Generally the systems that only used gazetteers seem to gain more than systems that have used unannotated data for other purposes than obtaining capitalization information." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-122", "text": "However, the gain differences between the two approaches are most obvious for English for which better gazetteers are available." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-123", "text": "With the exception of the result of Zhang and Johnson (2003) , there is not much difference in the German results between the gains obtained by using gazetteers and those obtained by using unannotated data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-124", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-125", "text": "**PERFORMANCES**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-126", "text": "A baseline rate was computed for the English and the German test sets." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-127", "text": "It was produced by a system which only identified entities which had a unique class in the training data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-128", "text": "If a phrase was part of more than one entity, the system would select the longest one." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-129", "text": "All systems that participated in the shared task have outperformed the baseline system." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-130", "text": "For all the F \u03b2=1 rates we have estimated significance boundaries by using bootstrap resampling (Noreen, 1989) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-131", "text": "From each output file of a system, 250 random samples of sentences have been chosen and the distribution of the F \u03b2=1 rates in these samples is assumed to be the distribution of the performance of the system." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-132", "text": "We assume that performance A is significantly different from performance B if A is not within the center 90% of the distribution of B." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-133", "text": "The performances of the sixteen systems on the two test data sets can be found in Table 5 ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-134", "text": "For English, the combined classifier of Florian et al. (2003) achieved the highest overall F \u03b2=1 rate." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-135", "text": "However, the difference between their performance and that of the Maximum Entropy approach of Chieu and Ng (2003) is not significant." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-136", "text": "An important feature of the best system that other participants did not use, was the inclusion of the output of two externally trained named entity recognizers in the combination process." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-137", "text": "Florian et al. (2003) have also obtained the highest F \u03b2=1 rate for the German data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-138", "text": "Here there is no significant difference between them and the systems of Klein et al. (2003) and Zhang and Johnson (2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-139", "text": "We have combined the results of the sixteen system in order to see if there was room for improvement." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-140", "text": "We converted the output of the systems to the same IOB tagging representation and searched for the set of systems from which the best tags for the development data could be obtained with majority voting." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-141", "text": "The optimal set of systems was determined by performing a bidirectional hill-climbing search (Caruana and Freitag, 1994) with beam size 9, starting from zero features." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-142", "text": "A majority vote of five systems (Chieu and Ng, 2003; Florian et al., 2003; Klein et al., 2003; McCallum and Li, 2003; Whitelaw and Patrick, 2003) performed best on the English development data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-143", "text": "Another combination of five systems (Carreras et al., 2003b; Mayfield et al., 2003; McCallum and Li, 2003; Munro et al., 2003; Zhang and Johnson, 2003) obtained the best result for the German development data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-144", "text": "We have performed a majority vote with these sets of systems on the related test sets and obtained F \u03b2=1 rates of 90.30 for English (14% error reduction compared with the best system) and 74.17 for German (6% error reduction)." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-145", "text": "----------------------------------" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-146", "text": "**CONCLUDING REMARKS**" }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-147", "text": "We have described the CoNLL-2003 shared task: language-independent named entity recognition." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-148", "text": "Sixteen systems have processed English and German named entity data." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-149", "text": "The best performance for both languages has been obtained by a combined learning system that used Maximum Entropy Models, transformation-based learning, Hidden Markov Models as well as robust risk minimization (Florian et al., 2003) ." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-150", "text": "Apart from the training data, this system also employed gazetteers and the output of two externally trained named entity recognizers." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-151", "text": "The performance of the system of Chieu et al. (2003) was not significantly different from the best performance for English and the method of Klein et al. (2003) and the approach of Zhang and Johnson (2003) were not significantly worse than the best result for German." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-152", "text": "Eleven teams have incorporated information other than the training data in their system." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-153", "text": "Four of them have obtained error reductions of 15% or more for English and one has managed this for German." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-154", "text": "The resources used by these systems, gazetteers and externally trained named entity systems, still require a lot of manual work." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-155", "text": "Systems that employed unannotated data, obtained performance gains around 5%." }, { "sent_id": "2c3a2999390b82f4e29b00d59f90f2-C001-156", "text": "The search for an excellent method for taking advantage of the fast amount of available raw text, remains open." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2c3a2999390b82f4e29b00d59f90f2-C001-80", "2c3a2999390b82f4e29b00d59f90f2-C001-82", "2c3a2999390b82f4e29b00d59f90f2-C001-83" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-85" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-89", "2c3a2999390b82f4e29b00d59f90f2-C001-90" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-95" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-98" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-118", "2c3a2999390b82f4e29b00d59f90f2-C001-119" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-120" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-134" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-137" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-142" ], [ "2c3a2999390b82f4e29b00d59f90f2-C001-149" ] ], "cite_sentences": [ "2c3a2999390b82f4e29b00d59f90f2-C001-83", "2c3a2999390b82f4e29b00d59f90f2-C001-85", "2c3a2999390b82f4e29b00d59f90f2-C001-90", "2c3a2999390b82f4e29b00d59f90f2-C001-95", "2c3a2999390b82f4e29b00d59f90f2-C001-98", "2c3a2999390b82f4e29b00d59f90f2-C001-118", "2c3a2999390b82f4e29b00d59f90f2-C001-120", "2c3a2999390b82f4e29b00d59f90f2-C001-134", "2c3a2999390b82f4e29b00d59f90f2-C001-137", "2c3a2999390b82f4e29b00d59f90f2-C001-142", "2c3a2999390b82f4e29b00d59f90f2-C001-149" ] } } }, "ABC_e3ee86bbaca6ae00906e7ec64f0ac0_12": { "x": [ { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-2", "text": "Lifelong machine learning is a novel machine learning paradigm which continually learns tasks and accumulates knowledge for reusing." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-3", "text": "The knowledge extracting and reusing abilities enable the lifelong machine learning to understand the knowledge for solving a task and obtain the ability to solve the related problems." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-4", "text": "In sentiment classification, traditional approaches like Na\u00efve Bayes focus on the probability for each words with positive or negative sentiment." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-5", "text": "However, the lifelong machine learning in this paper will investigate this problem in a different angle and attempt to discover which words determine the sentiment of a review." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-6", "text": "We will pay all attention to obtain knowledge during learning for future learning rather than just solve a current task." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-7", "text": "\u2022 Computing methodologies \u2192 Theory of mind." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-8", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-10", "text": "Over the past 30 years, machine learning have achieved a significant development." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-11", "text": "However, we are still in a era of \"weak AI\" rather than \"strong AI\" which due to the algorithms of AI only know how to solve a problem but have no idea why these approaches work and when can reuse them to solve other problems." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-12", "text": "Hence, the lifelong machine learning (simply said as lifelong learning or \"LML\" below) [7] was raised to build a new learning paradigm to learn with knowledge accumulation and reusing." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-13", "text": "With the knowledge, the AI is able to solve new problems totally unsupervised or semi-supervised." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-14", "text": "Thinking forward, the knowledge discovering and reusing becomes the core learning goal rather than to solve a specific problem under the lifelong learning setting." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-15", "text": "For instance, in the sentiment classification we can to predict the sentiment (positive or negative) of a sentence or a document by Na\u00efve Bayes." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-16", "text": "To solve this problem, we need to know the probability of each word that appears in positive or negative content." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-17", "text": "For different sentiment classification tasks with different domain, traditional approaches will calculate the probability of each word to be positive or negative in individual domain to achieve a good performance due to one word can be positive in a domain while be negative in another domain." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-18", "text": "Hence, for each domain we need to collect data for supervised learning." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-19", "text": "In this way, the algorithm will arXiv:1905.01988v1 [cs.CL] 30 Apr 2019 never know how to solve a problem without new labeled data and teaching." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-20", "text": "This is what a typical \"weak AI\"." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-21", "text": "To achieve the goal of \"strong AI\", we need to convert our learning goal to discover which words have sentiment orientation and check whether the orientation just valid in specific domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-22", "text": "If we can achieve this learning goal, the algorithms will be able to solve new tasks without teaching and explore new domain to find some new words with sentiment orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-23", "text": "Zhiyuan Chen [2] ever proposed a approach to determine which domain dose a word have the sentiment orientation to achieve the goal of lifelong learning." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-24", "text": "He made a big progress but the supervised learning still is needed." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-25", "text": "Hence, we will make it forward to let the learning to star with supervised learning but continue with unsupervised learning in the future tasks." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-26", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-27", "text": "**LIFELONG MACHINE LEARNING**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-28", "text": "It was firstly called as lifelong machine learning since 1995 by Thrun [6, 8] ." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-29", "text": "Efficient Lifelong Machine Learning (ELLA) [5] raised by Ruvolo and Eaton." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-30", "text": "Comparing with the multi-task learning [1] , ELLA is much more efficient." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-31", "text": "Zhiyuan and Bing [2] improved the sentiment classification by involving knowledge." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-32", "text": "The object function was modified with two penalty terms which corresponding with previous tasks." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-33", "text": "The knowledge system contains the following components:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-34", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-35", "text": "**COMPONENTS OF LML**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-36", "text": "\u2022 Knowledge Base (KB): The knowledge Base [2] mainly used to maintain the previous knowledge." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-37", "text": "Based on the type of knowledge, it could be divided as Past Information Store (PIS), Meta-Knowledge Miner (MKM) and Meta-Knowledge Store (MKS)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-38", "text": "\u2022 Knowledge Reasoner (KR): The knowledge reasoner is designed to generate new knowledge upon the achieve knowledge by logic inference." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-39", "text": "A strict logic design is required so the most of the LML algorithms lack of the component." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-40", "text": "\u2022 Knowledge-Base Learner (KBL): The Knowledge-Based Learner [2] aims to retrieve and transfer previous knowledge to the current task." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-41", "text": "Hence, it contains two parts: task knowledge miner and leaner." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-42", "text": "The miner seeks and determines which knowledge could be reused, and the learner transfer such knowledge to the current task." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-43", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-44", "text": "**SENTIMENT CLASSIFICATION**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-45", "text": "Hong and etc. [3] ever discussed that the NLP field is most suitable for the researches of the lifelong learning due to it is easier to extract knowledge and be understood by human." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-46", "text": "Previous classical paper [2] chose the sentiment classification as the learning target because it is could be regarded as a task as well as a group of subtasks in different domain." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-47", "text": "These sub-tasks related to each other but a model trained on a domain is unable to perform well in rest domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-48", "text": "The sub-tasked is related means that the knowledge transform among tasks is possible to improve performance. And the distribution of distribution is different requires that our algorithms could know when the knowledge can be used and when can not." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-49", "text": "Known these, an algorithm can be called as \"lifelong\" due to it is able to transfer previous knowledge to new tasks to improve performance." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-50", "text": "By using Naive Bayes to solve sentiment classification, we need to know the probability of each words that shows in positive or negative content." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-51", "text": "We also need to know well that some words may only have sentiment orientation in some specific domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-52", "text": "\"Lifelong Sentiment Classification\" (\"LSC\" for simple below) [2] records that which domain does a word have the sentiment orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-53", "text": "If a word always has sentiment orientation or has significant orientation in current domain, a high weight will sign to it more than other words." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-54", "text": "This approach contains a knowledge transfer operation and a knowledge validation operation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-55", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-56", "text": "**CONTRIBUTION OF THIS PAPER**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-57", "text": "Although LSC [2] already raised a lifelong approach, it only aims to improve the classification accuracy." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-58", "text": "It will not deliver a summary that which words are influence sentiment which is most important to us and can be used for future learning." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-59", "text": "In addition, it still limits in supervised learning and unable to handle the tasks without labeled data." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-60", "text": "Our paper advances the lifelong learning in sentiment classification and have two main contributions:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-61", "text": "\u2022 We introduce an novel approach to discover and store the words with sentiment orientation for reuse \u2022 A improved lifelong learning paradigm is proposed to solve the sentiment classification problem under unsupervised learning setting with previous knowledge" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-62", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-63", "text": "**SENTIMENT ORIENTATION WORDS 4.1 NA\u00cfVE BAYESIAN TEXT CLASSIFICATION**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-64", "text": "In this paper, we define a word has sentiment orientation by calculating the probability that it will appears in a positive or negative content (sentence or document)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-65", "text": "If a word has high probability with sentiment orientation, it also will leads to the document have higher probability of sentiment orientation based on the Na\u00efve Bayesian (NB) formula." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-66", "text": "NB text classification [4] will calculate the probability of each word w given a sentiment orientation (positive or negative)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-67", "text": "Use use the same formula as LSC [2] used below." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-68", "text": "P(w |c j ) is the probability of a word appears in a class:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-69", "text": "Where c j is either positive (+) or negative (-) sentiment orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-70", "text": "N c j ,w is the frequency of word w in documents of class c j ." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-71", "text": "|V| is the size of vocabulary V and \u03bb(0 \u2a7d \u03bb \u2a7d 1) is used for smoothing ( set as 1 for Laplace smoothing in this paper)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-72", "text": "Given a document, we can calculate the probability of it for different classes by:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-73", "text": "Where d i is the given document, n w , d i is the frequence of a word appears in this document." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-74", "text": "To predict the class of a document, we only need to calculate P(c + |d i ) \u2212 P(c \u2212 |d i )." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-75", "text": "If the difference is lager than 0, the document should be predict as positive orientation:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-76", "text": "As we only need to know whether P(c + |d i ) \u2212 P(c \u2212 |d i ) is lager that 0, so the formula could be simplify to:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-77", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-78", "text": "**DISCOVER SENTIMENT ORIENTATION WORDS**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-79", "text": "Ideally, if we know the P(c + ), P(c \u2212 ) and P(w |c j ) for all words, we can predict the sentiment orientation for all documents." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-80", "text": "However, above three key components are different in different domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-81", "text": "LSC [2] discussed a possible solution of P(w |c j )." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-82", "text": "As we known, not all words have sentimental orientation like \"a\", \"one\" and etc." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-83", "text": "while some words always have like \"good\", \"hate\", \"excellent\" and so on." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-84", "text": "In addition, some words only have sentiment orientation in specific domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-85", "text": "For example, \"tough\" in reviews of the diamond may mean that the diamond have a good quality while \"tough\" in the comments for food normally shows that it is hard to chew." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-86", "text": "Hence, in order to achieve the goal of the lifelong learning." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-87", "text": "We need to find the words always have sentiment orientation and be careful for those words only shows orientation in specific domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-88", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-89", "text": "**LIFELONG SEMI-SUPERVISED LEARNING FOR SENTIMENT CLASSIFICATION**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-90", "text": "Although LSC [2] considered the difference among domains, it still is a typical supervised learning approach." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-91", "text": "In this paper, we proposed to learn as two stages:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-92", "text": "(1) Initial Learning Stage: to explore a basic set of sentiment orientation words." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-93", "text": "After that, the model should be able to basically classify a new domain with a good performance." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-94", "text": "(2) Self-study Stage: Use the knowledge accumulated from the initial stage to handle new domains, also fine-tune and consolidate the knowledge generated from initial learning stage." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-95", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-96", "text": "**INITIAL LEARNING STAGE**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-97", "text": "In this stage, we need to train the model to remember some sentiment orientation words." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-98", "text": "This requires us to find the words with sentiment orientation in each domain." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-99", "text": "We need to answer two question here:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-100", "text": "(1) How to determine whether a word has orientation? (2) How much domain do we need for the initial learning stage?" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-101", "text": "For the first question, we need to find which words mainly show at the positive or negative documents." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-102", "text": "This means for a word w with positive orientation, P(+|w) >> P(\u2212|w) or P(+|w) >> P(+)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-103", "text": "In this paper, we will use O(w) = P(+|w)/P(+) to represent the orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-104", "text": "This because that the P(c j |w)/P(w) is easy to extend to multi-classes classification problem." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-105", "text": "According to the Bayesian formula, P(+|w)/P(+) = P(w |+)/P(w)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-106", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-107", "text": "**SELF-STUDY STAGE**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-108", "text": "In this stage, our main task is to explore which words have orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-109", "text": "We will use the these words to predict the new domains and assign the pseudo-labels to them." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-110", "text": "With the pseudo labels, we are able to discover the new words with orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-111", "text": "Following the the procedure for self-study:" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-112", "text": "(1) Using the orientation words accumulate from previous tasks to predict a new domain, and assign the prediction result as pseudo labels." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-113", "text": "(2) Using the reviews and pseudo labels the new domain as new training data to run Na\u00efve model." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-114", "text": "(3) Update the orientation words knowledge base." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-115", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-116", "text": "**EXPERIMENT 6.1 DATASETS**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-117", "text": "In the experiment, we use the same datasets as LSC [2] used." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-118", "text": "It contains the reviews from 20 domains crawled from Amazon.com and each domain has 1,000 reviews (the distribution of positive and negative reviews is imbalanced)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-119", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-120", "text": "**WORD ORIENTATION ANALYSIS**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-121", "text": "To answer the first question for the initial learning stage, we need to know which words exactly influence the sentiment classification." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-122", "text": "Table 1 : F1 Score of Na\u00efve Bayesian Classifiers under Decreasing Word Usage Percentage Firstly, we calculate P(w |+) and P(w |\u2212) for each words." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-123", "text": "Then, we define the orientation by O(w) = P(w |+)/P(w)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-124", "text": "Finally, we only choose a specific percentage words to predict and see whether the performance decreases." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-125", "text": "In addition, we also only consider the words that at least show over average 5 times in per domain." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-126", "text": "This because that we did not delete the symbols and numbers in the data, and these characters may be noise in the training data." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-127", "text": "We firstly sorted the words or symbols (no data pre-processing to the corpus in this paper) by the orientation O(w) and then choose a specific percentage words or symbols from the whole words to only 10%." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-128", "text": "From Table 1 we can see that using more than 70% obtains the best average result." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-129", "text": "We also noticed that the performance will not significantly decrease until we removed more than 80% of words and symbols." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-130", "text": "This means that the most of words and symbols do not have obvious sentiment orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-131", "text": "Hence, we will only keep 20% of words for Na\u00efve Bayes model and it still keeps around 96% f1 score." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-132", "text": "Although the performance decrease on a single domain, the better global performance will achieve with the orientation words." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-133", "text": "In addition" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-134", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-135", "text": "**REQUIREMENT FOR THE INITIAL LEARNING**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-136", "text": "For the second question of the initial learning stage, the answer depends on the tasks." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-137", "text": "In the practice, all of the labeled data definitely need to be used for training." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-138", "text": "The only question should be conceded is that how much domains is insufficient if there only are a few labeled data." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-139", "text": "For this sentiment classification task, one domain is absolutely insufficient." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-140", "text": "Based on the experiment result, the initial learning stage at least needs two domains, and can achieve much better performance with three domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-141", "text": "Increase more domains will not significant influence performance." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-142", "text": "Hence, three domains is enough for this task." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-143", "text": "For different tasks, two labeled domains are necessary." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-144", "text": "More labeled domains are suggested to continue collect until the performance on the new domains tends to steady." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-145", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-146", "text": "**SELF-STUDY LEARNING**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-147", "text": "In the self-study learning stage, our assume that we are under unsupervised learning setting." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-148", "text": "In this stage, there is not any labeled data." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-149", "text": "Instead of that, we will use the model generate from the initial learning stage to predict each use domain and assign pseudo labels." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-150", "text": "After that, the model will regards the pseudo labels as real label and continue training on the new domain." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-151", "text": "With this method, self-study learning stage can learn new domains well without any labeled data." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-152", "text": "Table 2 is the F1 score of three models on 17 domains." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-153", "text": "The first three domains was used for the initial learning stage. And we use the Macro-F1 score because the datasets are imbalanced and it can show the performance on the minor classes." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-154", "text": "We compared our model (Semi-Unsupervised Learning, SU-LML for short) with Na\u00efve Bayes model which only trained on the first three (source) domains (NB-S) and Na\u00efve Bayes model trained on each domain with labels by 5-fold cross validation (NB-T)." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-155", "text": "We can see that our approach is significantly better than other two approaches." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-156", "text": "It even perform better than the NB-T, a typically supervised learning." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-157", "text": "The figure 2 shows the result more clearly." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-158", "text": "Table 2 : F1 Score for NB-S, NB-T, SU-LML" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-159", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-160", "text": "**KNOWLEDGE GENERATED DURING LEARNING**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-161", "text": "In this paper, we do one more important things is that we are learning which words have sentiment orientation." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-162", "text": "If a word was regarded with sentiment orientation, we increase the orientation score of it plus one." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-163", "text": "In addition, we will plus an additional score from 0 to 1 to 1 based on the O(w) rank." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-164", "text": "From" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-165", "text": "----------------------------------" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-166", "text": "**CONCLUSION AND OUTLOOK**" }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-167", "text": "We proposed a semi-unsupervised lifelong sentiment classification approach in this paper." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-168", "text": "It can accumulate knowledge from previous learning and turn to self-study." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-169", "text": "Very few labeled data required in our approach so it is very suitable for the industry scenario." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-170", "text": "The performance of it even exceeds the supervised learning, which shows that the knowledge reusing of the lifelong learning is useful." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-171", "text": "Although we only show two classes classification here, but the ideal is also suitable for the multi-classes classification." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-172", "text": "All text classification can use approach, not only sentiment classification." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-173", "text": "Our model already know which words have sentiment orientation and can use them to classify, which uses the same approach of our human being." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-174", "text": "We shows that to focus the goal behind the learning tasks is more meaningful than just to find a solution." }, { "sent_id": "e3ee86bbaca6ae00906e7ec64f0ac0-C001-175", "text": "We should learn the knowledge and skills for all tasks rather than a solution for one task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-23" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-31" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-36" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-40" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-46" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-52" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-81" ] ], "cite_sentences": [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-23", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-31", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-36", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-40", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-46", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-52", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-81" ] }, "@MOT@": { "gold_contexts": [ [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-23", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-24" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-46", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-47" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-57" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-90" ] ], "cite_sentences": [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-23", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-46", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-57", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-90" ] }, "@USE@": { "gold_contexts": [ [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-67" ], [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-117" ] ], "cite_sentences": [ "e3ee86bbaca6ae00906e7ec64f0ac0-C001-67", "e3ee86bbaca6ae00906e7ec64f0ac0-C001-117" ] } } }, "ABC_c705c0533600b9b93d2c89bcbc292b_12": { "x": [ { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-2", "text": "Vision-and-Language Navigation (VLN) requires grounding instructions, such as turn right and stop at the door, to routes in a visual environment." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-3", "text": "The actual grounding can connect language to the environment through multiple modalities, e.g. stop at the door might ground into visual objects, while turn right might rely only on the geometric structure of a route." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-4", "text": "We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-5", "text": "Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-6", "text": "To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-ofthe-art models on the VLN task." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-7", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-9", "text": "The Vision-and-Language Navigation (VLN) task (Anderson et al., 2018) requires an agent to navigate to a particular location in a real-world environment, following complex, context-dependent instructions written by humans (e.g. go down the second hallway on the left, enter the bedroom and stop by the mirror)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-10", "text": "The agent must navigate through the environment, conditioning on the instruction as well as the visual imagery that it observes along the route, to stop at the location specified by the instruction (e.g. the mirror)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-11", "text": "Recent state-of-the-art models (Wang et al., 2018; Fried et al., 2018b; Ma et al., 2019) have demonstrated large gains in accuracy on the VLN task." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-12", "text": "However, it is unclear which modality these go past the couch \u2026 Figure 1 : We factor the grounding of language instructions into visual appearance, route structure, and object detections using a mixture-of-experts approach." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-13", "text": "substantial increases in task metrics can be attributed to, and, in particular, whether the gains in performance are due to stronger grounding into visual context or e.g. simply into the discrete, geometric structure of possible routes, such as turning left or moving forward (see Fig. 1" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-14", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-15", "text": "**, TOP VS. MIDDLE).**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-16", "text": "First, we analyze to what extent VLN models ground language into visual appearance and route structure by training versions of two state-ofthe-art models without visual features, using the benchmark Room-to-Room (R2R) dataset (Anderson et al., 2018) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-17", "text": "We find that while grounding into route structure is useful, the models with visual features fail to learn generalizable visual grounding." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-18", "text": "Surprisingly, when trained without visual features, their performance on unseen environments is comparable or even better." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-19", "text": "We hypothesize that the low-level, pixel-based CNN features in the visual models contribute to their failure to generalize." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-20", "text": "To address this, we introduce a high-level object-based visual representation to ground language into visual context in a more generalizable way, using the symbolic output of a pretrained object detection system." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-21", "text": "For example, while a concept table could ground into visual appearance of a specific table in a given environment, detecting tables and other objects in scenes, mapping them into symbols, and grounding the text mentions into these symbols should generalize better to unseen environments." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-22", "text": "Finally, inspired by the complementary errors of visual and non-visual agents, we decompose the grounding process through a mixture-of-experts approach." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-23", "text": "We train separate visual and non-visual agents, encouraging each one to focus on a separate modality, and combine their predictions as an ensemble (see Fig. 1 )." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-24", "text": "Our mixture-of-experts outperforms the individual agents, and is also better than the ensembles of multiple agents of the same modality (e.g. both visual or both non-visual)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-25", "text": "Adding our object representation and mixtureof-experts approach to both state-of-the-art models improves their success rate by over 10% (absolute) in novel environments, obtaining a 51.9% success rate on the val-unseen split of the benchmark R2R dataset (Anderson et al., 2018) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-26", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-28", "text": "Vision and Language Navigation." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-29", "text": "Vision-andLanguage Navigation (VLN) (Anderson et al., 2018; Chen et al., 2019) unites two lines of work: first, of following natural language navigational instructions in an environmental context (MacMahon et al., 2006; Vogel and Jurafsky, 2010; Tellex et al., 2011; Chen and Mooney, 2011; Artzi and Zettlemoyer, 2013; Andreas and Klein, 2015; Mei et al., 2016; Fried et al., 2018a; Misra et al., 2018) , and second, of vision-based navigation tasks (Mirowski et al., 2017; Yang et al., 2019; Mirowski et al., 2018; Cirik et al., 2018 ) that use visually-rich real-world imagery (Chang et al., 2017) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-30", "text": "A number of methods for the VLN task have been recently proposed." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-31", "text": "Wang et al. (2018) use model-based and model-free reinforcement learning to learn an environmental model and optimize directly for navigation success." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-32", "text": "Fried et al. (2018b) use a separate instruction generation model to synthesize new instructions as data augmentation during training, and perform pragmatic inference at test time." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-33", "text": "Most recently, Ma et al. (2019) introduce a visual and textual co-attention mechanism and a route progress predictor." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-34", "text": "These approaches have significantly improved performance on the VLN task, when evaluated by metrics such as success rate." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-35", "text": "However, it is unclear where the high performance comes from." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-36", "text": "In this paper, we find that agents without any visual input can achieve competitive performance, matching or even outperforming their vision-based counterparts under two state-of-theart model models (Fried et al., 2018b; Ma et al., 2019) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-37", "text": "We also explore two approaches to make the agents better utilize their visual inputs." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-38", "text": "The role of vision in vision-and-language tasks." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-39", "text": "In several vision-and-language tasks, high performance can be achieved without effective modeling of the visual modality." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-40", "text": "Devlin et al. (2015) find that image captioning models can exploit regularity in the captions, showing that a nearestneighbor matching approach can achieve competitive performance to sophisticated language generation models. and find that neural captioning models often ground object mentions into incorrect objects due to correlations in the training data, and can hallucinate non-existing objects." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-41", "text": "Recent work has also investigated singlemodality performance in vision-and-language embodiment tasks." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-42", "text": "Anand et al. (2018) find that stateof-the-art results can be achieved on the EmbodiedQA task (Das et al., 2018 ) using an agent without visual inputs." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-43", "text": "Work concurrent to ours evaluates the performance of single-modality models for several embodied tasks including VLN (Thomason et al., 2019) , finding that high performance can be achieved on the R2R dataset using a non-visual version of the baseline model (Anderson et al., 2018) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-44", "text": "In this paper, we show that the same trends hold for two recent state-of-the-art architectures (Ma et al., 2019; Fried et al., 2018b) for the VLN task; we also analyze to what extent object-based representations and mixture-ofexperts methods can address these issues." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-45", "text": "in a connectivity graph determined by line-of-sight in the physical environment." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-46", "text": "See the top row of Fig. 1 for a top-down environment illustration." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-47", "text": "In the VLN task, a virtual agent is placed at a particular viewpoint in an environment, and is given a natural language instruction (written by a human annotator) to follow." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-48", "text": "At each timestep, the agent receives the panoramic image for the viewpoint it is currently located at, and either predicts to move to one of the adjacent connected viewpoints, or to stop." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-49", "text": "When the agent predicts the stop action, it is evaluated on whether it has correctly reached the end of the route that the human annotator was asked to describe." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-50", "text": "In this work, we analyze two recent VLN models, which typify the visual grounding approaches of VLN work: the panoramic \"follower\" model from the Speaker-Follower (SF) system of Fried et al. (2018b) and the Self-Monitoring (SM) model of Ma et al. (2019) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-51", "text": "These models obtained stateof-the-art results on the R2R dataset." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-52", "text": "Both models are based on the encoder-decoder approach (Cho et al., 2014 ) and map an instruction to a sequence of actions in context by encoding the instruction with an LSTM, and outputting actions using an LSTM decoder that conditions on the encoded instruction and visual features summarizing the agent's environmental context." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-53", "text": "Compared to the SF model, the SM model introduces an improved visual-textual co-attention mechanism and a progress monitor component." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-54", "text": "We refer to the original papers for details on the two models." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-55", "text": "To analyze the models' visual grounding ability, we focus on their core encoder-decoder components." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-56", "text": "In our experiments, we use models trained without data augmentation, and during inference predict actions with greedy search (i.e. without beam search, pragmatic, or progress monitorbased inference)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-57", "text": "For SF, we use the publicly released code." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-58", "text": "For SM, we use a reimplementation without the progress monitor, which was shown to be most important for search in inference (Ma et al., 2019) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-59", "text": "We investigate how well these models ground instructions into visual features of the environment, by training and evaluating them without access to the visual context: setting their visual feature vectors to zeroes during training and testing." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-60", "text": "We compare performance on the validation sets of the R2R dataset: the val-seen split, consisting of the same environments as in training, and the val- Table 1 : Success rate (SR) of the vision-based full agent (\"RN\", using ResNet) and the non-visual agent (\"no vis.\", setting all visual features to zero) on the R2R dataset under different model architectures (SpeakerFollower (SF) (Fried et al., 2018b) and Self-Monitoring (SM) (Ma et al., 2019) ) and training schemes." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-61", "text": "unseen split of novel environments." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-62", "text": "Since we aim to evaluate how well the agents generalize to the unseen environments, we focus on the val-unseen split." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-63", "text": "For both the SF and SM models, we train two versions of the agents, using either the studentforcing or teacher-forcing approaches of Anderson et al. (2018) 1 , and select the best training snapshot on the val-seen split." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-64", "text": "2 The results are shown in Table 1 ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-65", "text": "In each block, the two rows show the agent's performance (under the specific model architecture and training approach) with or without access to the visual features (\"RN\": ResNet-152 network (He et al., 2016) , \"no vis.\": non-visual)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-66", "text": "While visual features improve performance on environments seen during training, we see that for the SF architecture the non-visual agent (lines 1 and 3) outperforms the visual agent (lines 2 and 4) on unseen environments under both studentforcing and teacher-forcing training." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-67", "text": "For SM, the non-visual agent (lines 5 and 7) has a success rate very close to the visual agent (lines 6 and 8)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-68", "text": "This indicates that these models do not learn generalizable visual perception, so that the visual features may actually hurt them in unseen environments." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-69", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-70", "text": "**OBJECT REPRESENTATION FOR BETTER GROUNDING AND GENERALIZATION**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-71", "text": "In both the SF and SM architectures, the agents use visual features from a pretrained ResNet-152 CNN (He et al., 2016) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-72", "text": "As the training data for the R2R dataset contains only 61 distinct environments, the agents may overfit to the appearance of the training environments and thus struggle to gen-eralize." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-73", "text": "For example, for the instruction go down the staircase, a model may learn to ground staircase into a specific staircase in a given training environment, and fail to generalize to staircases with different appearances or in different contexts in unseen environments." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-74", "text": "We thus propose an objectbased representation, where object detection results from a pretrained large-scale object detector are used as the environment representation." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-75", "text": "The object-based representation is intended to prevent overfitting to training scenes and to transfer to new environments better than CNN features." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-76", "text": "Both the SF and SM models represent the visual appearance at each location with a set of visual features {x img,i }, where x img,i is a vector extracted from an image patch at a particular orientation i using a CNN." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-77", "text": "Both models also use a visual attention mechanism to extract an attended visual feature x img,att from {x img,i }." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-78", "text": "For our objectbased representation, we use a Faster R-CNN (Ren et al., 2015) object detector trained on the Visual Genome dataset (Krishna et al., 2017) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-79", "text": "We construct a set of vectors {x obj,j } representing detected objects and their attributes." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-80", "text": "Each vector x obj,j (j-th detected object in the scene) is a concatenation of summed GloVe vectors (Pennington et al., 2014) for the detected object label (e.g. door) and attribute labels (e.g. white) and a location vector from the object's bounding box coordinates." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-81", "text": "We then use the same visual attention mechanism as in Fried et al. (2018b) and Ma et al. (2019) to obtain an attended object representation x obj,att over these {x obj,j } vectors." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-82", "text": "We either substitute the ResNet CNN features x img,att (\"RN\") with our object representation x obj,att (\"Obj\"), or concatenate x img,att and x obj,att (\"RN+Obj\")." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-83", "text": "Then we train the SF model or the SM model using this object representation, with results shown in Table 2 . 3 For SF (lines 1-4), object representations substantially improve generalization ability: using either the object representation (\"Obj\") or the combined representation (\"RN+Obj\") obtains higher success rate on unseen environments than using only the ResNet features (\"RN\"), and the combined representation (\"RN+Obj\") obtains the highest overall performance." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-84", "text": "For SM (lines 5-8), Table 2 : Success rate (SR) of agents with different visual inputs on the R2R dataset (\"RN\": ResNet CNN, \"Obj\": objects, \"no vis.\": no visual representation)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-85", "text": "Models: Speaker-Follower (SF) (Fried et al., 2018b) and Self-Monitoring (SM) (Ma et al., 2019) ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-86", "text": "the model that uses only the object representation achieves the best performance (line 7)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-87", "text": "Here the success rates across the four settings are closer, and the improvement from object representation is smaller than for SF." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-88", "text": "However, in Sec. 5 we find that object representation can be combined with other inputs to further improve the performance." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-89", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-90", "text": "**MIXTURE-OF-EXPERTS MAKES BETTER USE OF ALL AVAILABLE INFORMATION**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-91", "text": "While the agent with CNN visual features does not outperform its non-visual counterpart (Sec. 3) on average, it often succeeds on individual instructions where the non-visual model fails, indicating the visual and non-visual modalities are complementary." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-92", "text": "To encourage grounding into both modalities, we ensemble visual and non-visual models in a mixture-of-experts approach." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-93", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-94", "text": "**SEPARATE TRAINING**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-95", "text": "We first ensemble the models from Sec. 3 and Sec. 4 at test time (after training them separately) by combining their predictions at each timestep." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-96", "text": "4 Lines 9-22 of Table 3 show ensembles of two models." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-97", "text": "Compared to single-model performance (line 1-8 in Table 2 ), an ensemble of a visual and a non-visual agent outperforms the individual agents for both the SF and the SM models." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-98", "text": "The best performing setting is the combination of \"RN\" and \"no vis.\" (non-visual) in line 20 under the SM model." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-99", "text": "While it is unsurprising that the mixture-of-experts can boost performance, it is interesting to see that the best mixture in line 20 outperforms mixtures of two agents of the same type (two non-visual agents in line 16, two visual agents in line 17, trained from distinct random parameter initializations), confirming that two agents Table 3 : Success rate (SR) of different mixtureof-experts ensembles." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-100", "text": "Models: Speaker-Follower (SF) (Fried et al., 2018b) and Self-Monitoring (SM) (Ma et al., 2019) ; \"RN\": ResNet CNN, \"Obj\": objects, \"no vis.\": no visual representation." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-101", "text": "with access to different modalities can complement each other, especially in the SM model." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-102", "text": "We also experiment with a 3-way mixture in the SM model, combining a visual agent with ResNet CNN features, a visual agent with object features, and a non-visual agent (line 23)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-103", "text": "This mixture outperforms all the 2-way mixtures by a noticeable margin, showing that the CNN and object-based visual representations are also complementary." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-104", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-105", "text": "**JOINT TRAINING**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-106", "text": "Finally, given the success of this simple test-time ensemble, we also explore jointly training these models by building a single agent which uses a single instruction encoder shared between multiple (visual and non-visual) jointly-trained decoders." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-107", "text": "During joint training, each decoder is supervised to predict the true actions, applying the same loss function as in separate training." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-108", "text": "During testing, actions are predicted by averaging logits from the separate decoders as in Sec. 5.1." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-109", "text": "We experiment with jointly training the agents in each of the two best-performing combinations (RN, no vis.) and (RN, Obj, no vis.) under the SM architecture (line 24 and 25 of Table 3 )." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-110", "text": "From line 24 vs. 20 and line 25 vs. 23, joint training gives higher performance than training each model separately and combining them only at test time." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-111", "text": "Overall, we obtain 51.9% final success rate on the val-unseen split (line 25), which is over 10% (absolute) higher than the SF or SM baselines using a single decoder with CNN features (line 2 and 6 in Table 2 )." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-112", "text": "5" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-113", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-114", "text": "**DISCUSSION**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-115", "text": "The success of non-visual versions of two recent state-of-the-art VLN models, often outperforming their vision-based counterparts in unseen environments on the benchmark R2R dataset, shows that these models do not use the visual inputs in a generalizable way." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-116", "text": "Our intuition is that, while language has rich, high-level symbolic meaning, which can be easily matched to the modality of the route structures, pixel-based visual representations, even those extracted via CNNs, are a lowerlevel modality which require more data to learn, and so a model trained on both modalities may learn to mostly rely on the route structure." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-117", "text": "This is also supported by the results in Table 3 (line 23 vs. line 20), where adding higher-level object representations improves the success rate by 2.6%." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-118", "text": "Notably, an agent in the R2R environment is only able to move to a discrete set of locations in the environment, and at each point in time it only has a small number of actions available, determined by the environment's connectivity graph (i.e., moving to the adjacent locations)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-119", "text": "These constraints on possible routes help explain our findings that language in the VLN instructions often grounds into geometric route structure in addition to visual context along the route." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-120", "text": "For example, if an instruction says turn left at the couch, and the route structure only allows the agent to turn left at a single location, it may not need to perceive the couch." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-121", "text": "Other instructions, such as go straight for 5 meters and stop may also be carried out without access to visual perception." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-122", "text": "The improvement of our mixture-of-experts approach over single models suggests that it is challenging to learn to ground language into multiple modalities in one model." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-123", "text": "The \"RN+Obj\" model ( Table 2 , line 8) has access to the same information as our best result in Table 3 , line 25, but obtains much lower success rate (39.5% vs. 51.9%)." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-124", "text": "Thus, splitting the prediction task across several models, where each has access to a different input modality, is an effective way to inject an inductive bias that encourages the model to ground into each of the modalities." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-125", "text": "Supplementary material to \"Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation\"" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-126", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-127", "text": "**A DETAILS ON THE COMPARED VLN MODELS**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-128", "text": "The Speaker-Follower (SF) model (Fried et al., 2018b ) and the Self-Monitoring (SM) model (Ma et al., 2019) which we analyze both use sequenceto-sequence model (Cho et al., 2014) with attention (Bahdanau et al., 2015) as their base instruction-following agent." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-129", "text": "Both use an encoder LSTM (Hochreiter and Schmidhuber, 1997 ) to represent the instruction text, and a decoder LSTM to predict actions sequentially." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-130", "text": "At each timestep, the decoder LSTM conditions on the action previously taken, a representation of the visual context at the agent's current location, and an attended representation of the encoded instruction." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-131", "text": "While at a high level these models are similar (at least in terms of the base sequence-tosequence models -both papers additionally develop techniques to select routes from these base models during search-based inference techniques, either using a separate language generation model in SF, or a progress-monitor in SM), they differ in the mechanism by which they combine representations of the text instruction and visual input." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-132", "text": "The SM uses a co-grounded attention mechanism, where both the visual attention on image features and the textual attention on the instruction words are generated based on previous decoder LSTM hidden state h t\u22121 , and then the attended visual and textual features are used as LSTM inputs to produce h t ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-133", "text": "The SF model only uses attended visual features as LSTM inputs and then produces textual attention based on updated LSTM state h t ." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-134", "text": "Also, the visual attention weights are calculated with an MLP and batch-normalization in SM, while only a linear dot-product visual attention is used in SF." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-135", "text": "Empirically these differences produce large performance improvements for the SM model, which may contribute to the smaller gap between the SM model and its non-visual counterparts." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-136", "text": "B Details on the training mechanisms Anderson et al. (2018) compare two methods for training agents, which subsequent work on VLN has also used." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-137", "text": "These methods differ in whether they allow the agent to visit viewpoints which are not part of the true routes at training time." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-138", "text": "In the first training setup, teacher-forcing, the agent visits each viewpoint in a given true route in sequence, and is supervised at each viewpoint with the action necessary to reach the next viewpoint in the true route." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-139", "text": "In the second training setup, student-forcing, the agent takes actions by sampling from its predicted distribution at each timestep, which results in exploring viewpoints that are not part of the true routes." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-140", "text": "At each viewpoint, supervision is provided by an oracle that returns the action which would take the agent along the shortest path to the goal." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-141", "text": "Empirically, studentforcing works better in nearly all settings in Table 1 (except for the non-visual version of the SF model), which is likely due to the fact that it reduces the discrepancy between training and testing, since it allows the agent to sample from its own prediction during training." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-142", "text": "Teacher-forcing works better for the non-visual version of the SF model, and we hypothesize that following the ground-truth routes during training allows the SF model to better preserve the geometric structures of the routes and match them to the instructions for the non-visual setting." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-143", "text": "----------------------------------" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-144", "text": "**C DETAILS ON THE OBJECT REPRESENTATION**" }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-145", "text": "In our object representation, we use the top-150 detected objects (with the highest detection confidence) at each location in the environment." }, { "sent_id": "c705c0533600b9b93d2c89bcbc292b-C001-146", "text": "The detection results are obtained from a Faster R-CNN detector (Ren et al., 2015) pretrained on the Visual Genome dataset (Krishna et al., 2017) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c705c0533600b9b93d2c89bcbc292b-C001-11" ], [ "c705c0533600b9b93d2c89bcbc292b-C001-33" ] ], "cite_sentences": [ "c705c0533600b9b93d2c89bcbc292b-C001-11", "c705c0533600b9b93d2c89bcbc292b-C001-33" ] }, "@MOT@": { "gold_contexts": [ [ "c705c0533600b9b93d2c89bcbc292b-C001-33", "c705c0533600b9b93d2c89bcbc292b-C001-34", "c705c0533600b9b93d2c89bcbc292b-C001-35" ] ], "cite_sentences": [ "c705c0533600b9b93d2c89bcbc292b-C001-33" ] }, "@DIF@": { "gold_contexts": [ [ "c705c0533600b9b93d2c89bcbc292b-C001-36" ] ], "cite_sentences": [ "c705c0533600b9b93d2c89bcbc292b-C001-36" ] }, "@USE@": { "gold_contexts": [ [ "c705c0533600b9b93d2c89bcbc292b-C001-36" ], [ "c705c0533600b9b93d2c89bcbc292b-C001-44" ], [ "c705c0533600b9b93d2c89bcbc292b-C001-50" ], [ "c705c0533600b9b93d2c89bcbc292b-C001-58" ], [ "c705c0533600b9b93d2c89bcbc292b-C001-60" ], [ "c705c0533600b9b93d2c89bcbc292b-C001-81" ], [ "c705c0533600b9b93d2c89bcbc292b-C001-128" ] ], "cite_sentences": [ "c705c0533600b9b93d2c89bcbc292b-C001-36", "c705c0533600b9b93d2c89bcbc292b-C001-44", "c705c0533600b9b93d2c89bcbc292b-C001-50", "c705c0533600b9b93d2c89bcbc292b-C001-58", "c705c0533600b9b93d2c89bcbc292b-C001-60", "c705c0533600b9b93d2c89bcbc292b-C001-81", "c705c0533600b9b93d2c89bcbc292b-C001-128" ] } } }, "ABC_35233406ffd78d87743478454432d5_12": { "x": [ { "sent_id": "35233406ffd78d87743478454432d5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-2", "text": "Product Community Question Answering (PCQA) provides useful information about products and their features (aspects) that may not be well addressed by product descriptions and reviews." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-3", "text": "We observe that a product's compatibility issues with other products are frequently discussed in PCQA and such issues are more frequently addressed in accessories, i.e., via a yes/no question \"Does this mouse work with windows 10?\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-4", "text": "In this paper, we address the problem of extracting compatible and incompatible products from yes/no questions in PCQA." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-5", "text": "This problem can naturally have a two-stage framework: first, we perform Complementary Entity (product) Recognition (CER) on yes/no questions; second, we identify the polarities of yes/no answers to assign the complementary entities a compatibility label (compatible, incompatible or unknown)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-6", "text": "We leverage an existing unsupervised method for the first stage and a 3-class classifier by combining a distant PU-learning method (learning from positive and unlabeled examples) together with a binary classifier for the second stage." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-7", "text": "The benefit of using distant PU-learning is that it can help to expand more implicit yes/no answers without using any human annotated data." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-8", "text": "We conduct experiments on 4 products to show that the proposed method is effective." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-9", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-10", "text": "****" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-11", "text": "1 Introduction E-commerce websites like Amazon.com incorporate Product Community Question Answering (PCQA) into their websites to provide additional information about their products." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-12", "text": "Questions are usually posted by customers before their purchases and answers are provided by existing product owners or sellers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-13", "text": "Compatibility issues are one popular topic in PCQA." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-14", "text": "As shown in Figure 1 , one customer may write a question like \"Will it work with Surface Pro 3?\"; existing customer may reply with \"Yes.\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-15", "text": "From those 4 QA pairs discussing a Microsoft mouse, we know that the Microsoft mouse is compatible with \"Microsoft Surface Pro 3\" and \"Windows 10\" but incompatible with \"iPad\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-16", "text": "Furthermore, we have no idea whether \"Samsung Galaxy Tab 2 10.0\" is compatible or not with this mouse." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-17", "text": "Similar to our previous work in product reviews (Xu et al., 2016) , we call the mouse target entity and those 4 products complementary entities of the target entity." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-18", "text": "Each complementary entity forms a complementary relation with the target entity." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-19", "text": "Each yes/no answer further assigns a compatibility label to each complementary entity." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-20", "text": "Knowing which entity is compatible and which one is not is important because customers need to buy compatible ones and avoid incompatible ones." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-21", "text": "It is also important for manufacturers to realize the compatibility issues of their product." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-22", "text": "Further, recommender systems need to be aware of such issues and stay out of trouble of recommending incompatible products for their valued customers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-23", "text": "Problem Statement: We deal with the problem of identifying compatible and incompatible products from QA pairs in PCQA." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-24", "text": "More specifically, given a yes/no QA pair, we want to recognize complementary entities from questions and assign compatibility labels (compatible, incompatible or unknown) to Figure 1 : An example of QA pairs under a mouse product: complementary products are underlined in questions, \"Yes/No\" answers are bolded and \"Neutral\" answer is italicized; \"Microsoft Surface Pro 3\" and \"Windows 10\" are compatible products; \"iPad\" is an incompatible product; \"Samsung Galaxy Tab 2 10.0\" is unknown regarding compatibility issues." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-25", "text": "them according to the polarity (yes, no or neutral) of the answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-26", "text": "We observe that compatibility issues are mostly discussed via yes/no questions rather than open questions." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-27", "text": "This is because customers tend to ask specific questions in PCQA." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-28", "text": "We leave the work of mining compatible/incompatible products on open questions to future work." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-29", "text": "Given the structure of a QA pair, our method naturally has a twostage framework: Complementary Entity Recognition (CER) (Xu et al., 2016) and yes/no answer classification." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-30", "text": "For the first stage, we employ a similar approach as in (Xu et al., 2016) ; for the second stage, it is reduced to a yes/no answer classification problem (McAuley and Yang, 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-31", "text": "We observe that the second stage provides further research opportunity since the polarities of many yes/no answers are implicit." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-32", "text": "For example, \"Will it work with Surface Pro 3? It works.\" has no explicit \"Yes\" but it is still a yes answer." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-33", "text": "Therefore, exploiting implicit yes/no answers can further help to identify even more compatible/incompatible entities." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-34", "text": "To the best of our knowledge, there are no largely annotated implicit yes/no answers for PCQA." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-35", "text": "To save time-intensive annotation efforts, we leverage a distant PU-learning (learning from positive and unlabeled examples) method (Liu et al., 2003; Elkan and Noto, 2008) without using any human annotated answer." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-36", "text": "This is possible due to a simple observation: the beginning \"Yes\" or \"No\" word in explicit yes/no answers can serve as distant labels and can be used to expand implicit yes/no answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-37", "text": "For example, \"Yes, it works\" and \"It works\" have the same polarity. But the first answer is explicit and the second one is implicit." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-38", "text": "So the beginning word \"Yes\" can label \"Yes, it works\" as a yes answer and further the implicit answer \"It works\" may also be labeled as a yes answer due to its similarity with the former explicit yes answer." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-39", "text": "Besides yes and no answers, we assume that there are also neutral answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-40", "text": "For example, the last answer in Figure 1 is a neutral answer and we have no obvious distant label for that type of answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-41", "text": "The framework of PU-learning (learning from positive and unlabeled examples) comes to rescue since it only requires positive examples and we already have many unlabeled answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-42", "text": "The idea of obtaining positive examples is simple: we leverage explicit answers (both yes and no answers) as positive examples and those explicit answers can expand to implicit answers via the PU-learning framework." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-43", "text": "Since all the explicit answers are distantly labeled, we have no human annotation effort at all." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-44", "text": "To further separate yes and no answers, we utilize a binary classifier trained from explicit yes/no answers to classify all positive examples labeled by PU-learning." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-45", "text": "The major contribution of this paper can be summarized as follows: we propose the problem of mining compatible/incompatible products from PCQA; we propose a two-stage framework to solve this problem without using any human annotated data." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-46", "text": "The rest of this paper is organized as follows: we describe related works in Section 2; In Section 3.2 and 4 we describe the proposed two-stage framework; we conduct experiments in Section 5 and then draw our conclusion." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-47", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-48", "text": "**RELATED WORKS**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-49", "text": "The problem of Complementary Entity Recognition (CER) is first proposed by Xu et." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-50", "text": "al. (Xu et al., 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-51", "text": "However, our previous work focuses on product reviews and consider CER as a special kind of aspect extraction problem (Liu, 2015) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-52", "text": "Determining the polarities of compatibility is reduced to a traditional sentiment classification problem." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-53", "text": "This paper focuses on yes/no QAs in PCQA and the polarities of compatibility is a yes/no answer classification problem." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-54", "text": "CER is closely related to entity recognition (e.g., Named Entity Recognition (NER) (Nadeau and Sekine, 2007; Zhou and Su, 2002 ) problem)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-55", "text": "The major differences are that many complementary entities are not named entities and CER heavily relies on the context of an entity (e.g., \"iPhone\" in \"I like my iPhone\" is not a complementary entity)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-56", "text": "Complementary entities are also studied as a social network problem in recommender systems (Zheng et al., 2009; McAuley et al., 2015) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-57", "text": "We discussed the benefit of CER over social network problem in (Xu et al., 2016 ) so we omit here but keep a performance comparison in Section 5." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-58", "text": "Community Question and Answering (CQA) has been well studied in literature (Liu et al., 2008; Nam et al., 2009; Li and King, 2010; Anderson et al., 2012) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-59", "text": "More specifically, product Community Question and Answering (PCQA) is studied in (McAuley and Yang, 2016; Liu et al., 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-60", "text": "They both try to find relevance between reviews and questions." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-61", "text": "(McAuley and Yang, 2016) takes questions from PCQA as queries and retrieve relevant reviews that can answer those queries." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-62", "text": "(Liu et al., 2016) considers questions in PCQA as summaries of reviews to help customers to identify relevant reviews." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-63", "text": "Extracting compatible/incompatible products from PCQA is very important." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-64", "text": "Based on our experience of annotating PCQA, we notice that PCQA usually addresses compatibility issues that are not well addressed by product description." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-65", "text": "This is because the number of complementary products for a target product can be unlimited so it is impractical to cover all of them." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-66", "text": "We also bring out the test dataset used in (Xu et al., 2016) for a comparison (Section 5)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-67", "text": "We notice that PCQA addresses compatibility issues in a different perspective compared to product reviews." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-68", "text": "PCQA tends to be specific on compatibility issues; reviews are free to talk about their experiences (e.g, opinions on features/aspects)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-69", "text": "For example, customers tend to ask more specific questions like \"Will it work with Surface Pro 3\" rather than \"Will it work with my tablet?\" since the latter question is pointless; reviews are typical datasets for opinion mining and aspects extraction (Liu, 2015) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-70", "text": "Also, it is common to see general complementary products like \"It works with my tablet.\" in reviews since reviewers do not need to specify which tablet they have." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-71", "text": "Determining the polarity of a yes/no answer is closely related to answer summarization subtask B in SemEval-2015 Task 3 (M\u00e0rquez et al., 2015) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-72", "text": "The proposed problem differs from this subtask B in that our problem indirectly utilizes the polarity of an answer to classify complementary entity rather than directly summarizes the usefulness of an answer to a question." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-73", "text": "McAuley et." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-74", "text": "al (McAuley and Yang, 2016) classifies the polarity of a PCQA answer by simply training an SVM on unigrams of labeled answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-75", "text": "From their predictions, we observe that they may only label explicit yes/no answers (e.g., answers begin with a \"Yes\" or \"No\") and put many implicit answers (e.g., \"I think it works.\" implies a yes answer) as uncertain." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-76", "text": "Identifying more implicit yes or no answer is crucial to the proposed problems since a complementary entity does not provide much information without its compatibility label (compatible, incompatible or uncertain)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-77", "text": "The proposed method utilizes the PU-learning framework (Liu et al., 2003; Elkan and Noto, 2008) , which can be used to expand positive examples (both yes and no answers)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-78", "text": "We demonstrate that PUlearning can improve the recall by exploiting more implicit answers in Section 5." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-79", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-80", "text": "**TWO-STAGE FRAMEWORK AND CER**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-81", "text": "In this section, we first introduce the two-stage framework of the proposed method." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-82", "text": "Then we briefly introduce the method for CER in (Xu et al., 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-83", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-84", "text": "**TWO-STAGE FRAMEWORK**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-85", "text": "Since complementary entities are mentioned in yes/no questions and their polarities of compatibility information are in answers, the proposed method naturally has a two-stage framework: Complementary Entity Recognition: we extract complementary entities from questions using dependency paths almost the same as in (Xu et al., 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-86", "text": "It utilizes a large amount of unlabeled reviews under the same category as the target entity to expand knowledge about domain-specific verbs." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-87", "text": "Identifying Polarities of Yes/No Answers: then we determine the polarity (yes, no or neutral) of yes/no answers for each question with complementary entity and assign a compatibility label (compatible, incompatible or unknown) to it." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-88", "text": "We form this 3-class classification via PU-learning and a binary SVM classifier in Section 4." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-89", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-90", "text": "**COMPLEMENTARY ENTITY RECOGNITION**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-91", "text": "We briefly introduce the method used in (Xu et al., 2016) and how the dependency paths can be used in questions of PCQA (details of dependency paths can be found in the original paper)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-92", "text": "The basic idea is to use dependency paths to identify the context of complementary relations around complementary entities." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-93", "text": "Dependency paths can match dependency relations parsed through dependency parsing 1 , which parses a sentence into a set of dependency relations." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-94", "text": "In our previous work, we notice that the verbs used to indicate a complementary relation can be unlimited and product specific." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-95", "text": "So we utilize another novel set of dependency paths that are in high precision but low recall to expand knowledge about complementary entities on a large amount of unlabeled review." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-96", "text": "We use similar ideas in this paper since verbs in questions of PCQA are also unlimited and product specific. But we do not incorporate candidate complementary entities into dependency paths when performing extractions because complementary entities are rather specific and diverse in PCQA and general entities are rarely mentioned." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-97", "text": "We still keep candidate complementary entities when expanding knowledge about domain-specific verbs." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-98", "text": "The knowledge expansion process is the same as our previous work." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-99", "text": "We start with seed verbs \"work\" and \"fit\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-100", "text": "Then we first expand candidate complementary entities on the large unlabeled reviews." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-101", "text": "Then we use those candidate complementary entities to expand domain-specific verbs, e.g., \"insert\" for micro SD card and \"hold\" for tablet stand." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-102", "text": "The idea of using reviews rather than questions in PCQA to expand domain knowledge is that reviews contain a lot of the same general complementary entities (e.g. \"tablet\") that can easily appear in different reviews." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-103", "text": "However, \"Samsung Galaxy S6\" may 1 We use Stanford CoreNLP (http://stanfordnlp." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-104", "text": "github.io/CoreNLP/) as our dependency parser be in low frequency in PCQA." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-105", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-106", "text": "**IDENTIFYING THE POLARITIES OF YES/NO ANSWERS**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-107", "text": "After CER, we need to identify whether a product is compatible or not with the target product." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-108", "text": "We assume a yes/no answer can clearly identify the polarities of the compatibility of a complementary entity for the target entity." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-109", "text": "We only classify the polarities of answers for successful extraction of complementary entities." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-110", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-111", "text": "**MOTIVATIONS**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-112", "text": "We assume that largely annotated yes and no answers are not available for training." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-113", "text": "We observe that the explicit mentions of \"Yes\" or \"No\" at the beginning of each answer are indicators of yes or no answers respectively." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-114", "text": "So they can be used for prediction directly." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-115", "text": "However, not every answer in PCQA begins with an explicit \"Yes\" or \"No\" word, but the polarity of the answer can still be implicitly expressed." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-116", "text": "For example, \"Yes, it works.\" and \"It works.\" have the same yes polarity, but the latter answer does not have an explicit word \"Yes\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-117", "text": "From the test data in Section 5, we observe that using explicit mentions of \"Yes\" or \"No\" contribute about 60% of accuracy of yes or no answer classification." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-118", "text": "Without identifying those implicitly mentioned polarities, the polarities of compatibility for many complementary entities are uncertain." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-119", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-120", "text": "**DISTANT PU-LEARNING CLASSIFIER AND BINARY CLASSIFIER**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-121", "text": "We distribute the classification task into 2 classifiers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-122", "text": "First, we use PU learning to train a classifier that can separate yes or no answers from neutral answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-123", "text": "Second, we train a yes or no binary classifier by using the explicit yes/no examples." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-124", "text": "From the previous examples of \"Yes, it works.\" and \"It works.\", we observe that the beginning word \"Yes\" or \"No\" is optional for a yes or no answer respectively." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-125", "text": "So \"Yes\" can be served as a distant label for the training example \"It works\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-126", "text": "We select all answers beginning with \"Yes\" or \"No\" as training examples and take the first words as distant labels and transform the remaining words of the answer to features." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-127", "text": "However, we notice that there is no obvious distant label for neutral answers (e.g., \"I am not sure.\")." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-128", "text": "Therefore, it is impossible to train a 3-class classifier directly." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-129", "text": "Instead, we utilize PU-learning framework (Liu et al., 2003; Elkan and Noto, 2008) to first separate implicit yes/no answers from neutral answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-130", "text": "PUlearning is a machine learning method using only positive and unlabeled examples (no negative examples are labeled)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-131", "text": "To get positive examples, we first combine all examples distantly labeled by \"Yes\" or \"No\" (the first word in an answer) together." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-132", "text": "Unlabeled examples can be easily collected from PCQA answers as long as the first word is not \"Yes\" or \"No\". Please note that unlabeled examples contain both implicit yes/no answers and neutral answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-133", "text": "We utilize the implementation of PU learning described in (Liu et al., 2003) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-134", "text": "Lastly, we train a yes/no binary classifier using the same positive examples (explicit \"Yes\" or \"No\" answers) used in PU learning. But this time we separate the distant labels \"Yes\" and \"No\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-135", "text": "By combining a PU-learning classifier and a binary classifier, we actually build a 3-class classifier for implicit yes/no answer classification." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-136", "text": "During testing, we ensemble the results from the first \"Yes\" or \"No\" word prediction, the PU-learning classifier and the binary classifier together." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-137", "text": "We first detect whether the answer is an explicit yes/no answer by checking the first word." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-138", "text": "If the first word is a \"Yes\" (or \"No\"), we output label yes (or no); otherwise we use PU-learning classifier to predict whether the answer is an implicit yes/no answer or neutral; if it outputs negative, we consider the answer as neutral; otherwise we consider it as an implicit yes/no answer and use the binary classifier to predict yes or no. We demonstrate this method using SVM as the base classifier for both the PU-learning classifier and the binary classifier in Section 5." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-139", "text": "In reality, other base classifiers can also be adopted." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-140", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-141", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-142", "text": "In this section, we first describe the dataset used for testing; then we introduce the evaluation methods and the baselines; lastly, we analyze the results." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-143", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-144", "text": "**DATASET**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-145", "text": "We crawl questions with at least one answers from product Community Question and Answering of Amazon.com and choose 4 products for test purpose." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-146", "text": "The 4 products are \"stylus\", \"micro SD card\", \"mouse\" and \"tablet stand\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-147", "text": "We label complementary entities mentioned in each question and the answers as yes, no or neutral." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-148", "text": "The whole test dataset is labeled by 3 annotators independently." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-149", "text": "The initial agreement is 93%." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-150", "text": "Then disagreements are discussed and final agreements are reached among all annotators." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-151", "text": "To obtain knowledge about domainspecific verbs, we use 6000 reviews for each product similar as in (Xu et al., 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-152", "text": "We also select about 220 reviews for each product and label them in a similar way to show the difference between product QA community and reviews." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-153", "text": "The agreement for reviews is 82%." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-154", "text": "The statistics of the datasets 2 can be found in Table 1 ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-155", "text": "We observe that PCQA has higher densities (complementary products per sentence) of mentions of complementary entities." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-156", "text": "Further, PCQA has unique complementary entities since repeatedly asking the same question does not make sense." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-157", "text": "So identifying complementary entities from PCQA is much effective than that from customer reviews." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-158", "text": "Based on our experience of annotation, complementary entities mentioned in PCQA and in customer reviews are different." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-159", "text": "In PCQA, potential buyers frequently mention specific complementary entities as named entities (e.g., \"Microsoft Surface Pro 3\") to make their questions more accurate; in customer reviews, complementary entities can be general complementary products like \"tablet\", \"phone\", which is much less meaningful than specific products." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-160", "text": "We also read the product descriptions of these 4 products and count the number of compatible products, including general products like \"Android tablets\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-161", "text": "There are 13, 9, 5 and 55 compatible products for the stylus, micro SD card, mouse, tablet stand respectively." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-162", "text": "No incompatible products are mentioned in descriptions." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-163", "text": "So we can conclude that PCQA provides more information about compatibility issues." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-164", "text": "Table 1 : Statistics of test QA dataset and corresponding reviews: we list number of QA pairs and reviews (Q/R), number of sentences in questions and sentences in reviews (Q/RSent.), number of complementary product mentions (CP), number of complementary products per question sentence or review sentence (Density), number of unique complementary products (Uniq. CP), number of positive/negative/neutral complementary products mentions (Pos./Neg./Neu.) and a few example sentences." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-165", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-166", "text": "**COMPARED METHODS AND EVALUATION METHODS**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-167", "text": "We first perform separate evaluations on CER and yes/no answer classification." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-168", "text": "Then we combine those two stages together to evaluate the overall accuracy." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-169", "text": "For CER, we count true positive, false positive and false negative to compute precision P, recall R and F1-score F 1 ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-170", "text": "We consider each question as an instance and the dependency paths are applied to each sentence in that question and then the extractions combined to form one prediction." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-171", "text": "A prediction contributes to one count of true positive when the extracted complementary products match the labeled complementary entities in one question; more or less predicted complementary entities in one question are treated as one count of false positive; failed extraction from one question is treated as one count of false negative." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-172", "text": "Noun Phrase Chunker: Most of the product names mentioned in questions of PCQA are noun phrases, we use the same noun phrase chunking pattern as the proposed method to extract noun phrases directly from questions and take them as complementary products." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-173", "text": "UIUC NER: We use UIUC Named Entity Tagger (Ratinov and Roth, 2009 ) to perform Named Entity Recognition (NER) on questions in PCQA." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-174", "text": "UIUC NER has 18 labels in total and we consider words or phrases labeled as \"PRODUCT\" or \"ORG\" as predictions of complementary products." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-175", "text": "Sceptre: We also retrieve the top 25 complements for the same 4 products from (McAuley et al., 2015) 's Sceptre and adapt their results for a comparison." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-176", "text": "Direct comparison is impossible since they deal with a link prediction problem and consider \"Items also bought\" as complementary products for training/testing." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-177", "text": "We label and compute the precision for the top 25 predictions and assume annotators have the same background knowledge for both their datasets and ours." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-178", "text": "We observe that the predicted complementary products are irrelevant products like \"network cables\", \"mother board\", etc. and all 4 products have similar complementary products." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-179", "text": "We mostly consider \"Windows\" as complementary products for \"Mouse\"." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-180", "text": "CER6K: This method is the method proposed in (Xu et al., 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-181", "text": "Specifically, it uses 6000 reviews to expand domain-specific verbs." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-182", "text": "Next, we perform a separate evaluation on yes/no answer classification." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-183", "text": "We assume the accuracies of complementary entities extraction are 100% and errors do not affect answer classification." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-184", "text": "We only classify answers to questions that have labeled complementary entities." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-185", "text": "Yes/No: This simple baseline predicts the polarities of yes/no answers based on the first \"Yes\" or \"No\" word in an answer; if the first word is not \"Yes\" or \"No\", it predicts the answer as neutral." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-186", "text": "Sentiment Parser: We utilize the RNN-based sentiment parser (Socher et al., 2013) to get the sentiment polarities of the first sentences in answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-187", "text": "We observe that opinions expressed in answers can indicate the polarities of answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-188", "text": "For example, \"It works well.\" indicates a positive answer." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-189", "text": "We use We train a 3-class SVM classifier using the answer predictions from (McAuley and Yang, 2016) ." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-190", "text": "Their method originally uses 1000 labeled data as the training data for answer prediction." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-191", "text": "Since the labeled training data is not available, we use their predictions as the training data." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-192", "text": "We select 4000 examples for each class as training examples and ensemble the results with Yes/No baseline." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-193", "text": "PU SVM(Bigram): This is described in Section 4." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-194", "text": "We use 3000 yes answers and 3000 no answers as positive examples and 6000 unlabeled answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-195", "text": "We use bigrams as features for prediction and PU learning method in (Liu et al., 2003) as the implementation of PU learner." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-196", "text": "3 http://scikit-learn.org/ Finally, we combine the results of CER6K and PU SVM to get the Overall Results." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-197", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-198", "text": "**RESULT ANALYSIS**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-199", "text": "CER: From Table 2 , we can see that CER6K performs the best." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-200", "text": "NP chunker performs better than UIUC NER because NER heavily relies on capital letters as features but PCQA tends to have typos in lower case (e.g., \"samsung\" instead of \"Samsung\")." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-201", "text": "The precision of Sceptre is low because the \"Items also bought\" training data tend to be noisy for accessories." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-202", "text": "We further observe that the recall of \"tablet stand\" is relatively low." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-203", "text": "We examine the data and find that many errors are due to parsing errors (e.g., the POS tagger treats \"stand\" as a verb, which makes dependency parsing incorrect.)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-204", "text": "Yes/No Answer Classification: In Table 3 , we compare the results of yes/no answer classification." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-205", "text": "The first 5 methods are performed only on the answers with questions that have human-annotated complementary entities." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-206", "text": "The last column is the combined results of CER6K and PU SVM(Bigram)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-207", "text": "All numbers are accuracies of classification for yes, no and neutral." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-208", "text": "Given many explicit yes/no answers, the Yes/No baseline performs relatively good." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-209", "text": "Sentiment parser performs worse than the Yes/No baseline." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-210", "text": "We examine the results and find that sentiment parser tends to produce more errors on negative opinions." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-211", "text": "3-class SVM(Bigram) does not have much improvement over Yes/No baseline." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-212", "text": "This is because (McAuley and Yang, 2016) 's predictions are mostly explicit yes or no answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-213", "text": "We guess they mostly label implicit yes or no answers as neutral." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-214", "text": "PU learning performs better than One-class SVM because PU learning also leverages unlabeled data, even though the size of training data is smaller." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-215", "text": "In the overall results, we achieve accuracy around 70%." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-216", "text": "----------------------------------" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-217", "text": "**CONCLUSIONS**" }, { "sent_id": "35233406ffd78d87743478454432d5-C001-218", "text": "In this paper, we propose the problem of mining compatible and incompatible products from product Community Question and Answering (PCQA)." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-219", "text": "We propose a two-stage framework to solve this problem." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-220", "text": "We first extract complementary entities from each question using a dependency rule-based method; then we determine the labels of compatibility for complementary entities from the polarities of yes/no answers." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-221", "text": "We leverage a distant PU learning method to identify extra implicit polarities of yes/no answers without using any human-labeled training data." }, { "sent_id": "35233406ffd78d87743478454432d5-C001-222", "text": "Experiments show that the proposed method can exploit more implicit answers." } ], "y": { "@SIM@": { "gold_contexts": [ [ "35233406ffd78d87743478454432d5-C001-17" ], [ "35233406ffd78d87743478454432d5-C001-30" ] ], "cite_sentences": [ "35233406ffd78d87743478454432d5-C001-17", "35233406ffd78d87743478454432d5-C001-30" ] }, "@BACK@": { "gold_contexts": [ [ "35233406ffd78d87743478454432d5-C001-17" ], [ "35233406ffd78d87743478454432d5-C001-49", "35233406ffd78d87743478454432d5-C001-50" ], [ "35233406ffd78d87743478454432d5-C001-57" ] ], "cite_sentences": [ "35233406ffd78d87743478454432d5-C001-17", "35233406ffd78d87743478454432d5-C001-50", "35233406ffd78d87743478454432d5-C001-57" ] }, "@USE@": { "gold_contexts": [ [ "35233406ffd78d87743478454432d5-C001-29" ], [ "35233406ffd78d87743478454432d5-C001-30" ], [ "35233406ffd78d87743478454432d5-C001-66" ], [ "35233406ffd78d87743478454432d5-C001-82" ], [ "35233406ffd78d87743478454432d5-C001-85" ], [ "35233406ffd78d87743478454432d5-C001-91" ], [ "35233406ffd78d87743478454432d5-C001-151" ], [ "35233406ffd78d87743478454432d5-C001-180", "35233406ffd78d87743478454432d5-C001-181", "35233406ffd78d87743478454432d5-C001-182" ] ], "cite_sentences": [ "35233406ffd78d87743478454432d5-C001-29", "35233406ffd78d87743478454432d5-C001-30", "35233406ffd78d87743478454432d5-C001-66", "35233406ffd78d87743478454432d5-C001-82", "35233406ffd78d87743478454432d5-C001-85", "35233406ffd78d87743478454432d5-C001-91", "35233406ffd78d87743478454432d5-C001-151", "35233406ffd78d87743478454432d5-C001-180" ] }, "@MOT@": { "gold_contexts": [ [ "35233406ffd78d87743478454432d5-C001-49", "35233406ffd78d87743478454432d5-C001-50", "35233406ffd78d87743478454432d5-C001-51", "35233406ffd78d87743478454432d5-C001-52", "35233406ffd78d87743478454432d5-C001-53" ] ], "cite_sentences": [ "35233406ffd78d87743478454432d5-C001-50" ] }, "@EXT@": { "gold_contexts": [ [ "35233406ffd78d87743478454432d5-C001-91", "35233406ffd78d87743478454432d5-C001-92" ] ], "cite_sentences": [ "35233406ffd78d87743478454432d5-C001-91" ] } } }, "ABC_808e0a94b877182dc06447c8682a63_12": { "x": [ { "sent_id": "808e0a94b877182dc06447c8682a63-C001-95", "text": "**RESULTS**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-49", "text": "A uniform cutoff of 40 applies to all n-grams in this corpus." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-94", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-5", "text": "Our best system combines the two feature sets, achieving up to 0.8% absolute UAS improvements on newswire and 1.4% on web text." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-6", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-2", "text": "We develop novel first-and second-order features for dependency parsing based on the Google Syntactic Ngrams corpus, a collection of subtree counts of parsed sentences from scanned books." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-3", "text": "We also extend previous work on surface n-gram features from Web1T to the Google Books corpus and from first-order to second-order, comparing and analysing performance over newswire and web treebanks." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-4", "text": "Surface and syntactic n-grams both produce substantial and complementary gains in parsing accuracy across domains." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-8", "text": "Current state-of-the-art parsers score over 90% on the standard newswire evaluation, but the remaining errors are difficult to overcome using only the training corpus." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-9", "text": "Features from n-gram counts over resources like Web1T (Brants and Franz, 2006 ) have proven to be useful proxies for syntax (Bansal and Klein, 2011; Pitler, 2012) , but they enforce linear word order, and are unable to distinguish between syntactic and non-syntactic co-occurrences." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-10", "text": "Longer n-grams are also noisier and sparser, limiting the range of potential features." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-11", "text": "In this paper we develop new features for the graph-based MSTParser (McDonald and Pereira, 2006) from the Google Syntactic Ngrams corpus (Goldberg and Orwant, 2013) , a collection of Stanford dependency subtree counts." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-12", "text": "These features capture information collated across millions of subtrees produced by a shift-reduce parser, trading off potential systemic parser errors for data that is better aligned with the parsing task." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-13", "text": "We compare the performance of our syntactic n-gram features against the surface n-gram features of Bansal and Klein (2011) in-domain on newswire and out-of-domain on the English Web Treebank (Petrov and McDonald, 2012) across CoNLL-style (LTH) dependencies." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-14", "text": "We also extend the first-order surface n-gram features to second-order, and compare the utility of Web1T and the Google Books Ngram corpus (Lin et al., 2012) as surface n-gram sources." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-15", "text": "We find that surface and syntactic n-grams provide statistically significant and complementary accuracy improvements in-and out-of-domain." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-16", "text": "Our best LTH system combines the two feature sets to achieve 92.5% unlabeled attachment score (UAS) on newswire and 85.2% UAS averaged over web text on a baseline of 91.7% and 83.8%." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-17", "text": "Our analysis shows that the combined system is able to draw upon the strengths of both surface and syntactic features whilst avoiding their weaknesses." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-18", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-19", "text": "**SYNTACTIC N-GRAM FEATURES**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-20", "text": "The Google Syntactic Ngrams English (2013) corpus 1 contains counts of dependency tree fragments over a 345 billion word selection of the Google Books data, parsed with a beam-search shift-reduce parser and Stanford dependencies (Goldberg and Orwant, 2013 )." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-21", "text": "The parser is trained over substantially more annotated data than typically used in dependency parsing." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-22", "text": "Unlike surface n-grams, syntactic n-grams are not restricted to linear word order or affected by non- syntactic co-occurrence." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-23", "text": "Given a head-argument ambiguity, we extract different combinations of word, POS tag, and directionality, and search the Syntactic Ngrams corpus for matching subtrees." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-24", "text": "To reduce the impact of this search during run time, we extract all possible combinations in the training and test corpora ahead of time and total the frequencies of each configuration, storing these in a lookup table that is used by the parser at run-time to compute feature values." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-25", "text": "We did not use any features based on the dependency label as these are assigned in a separate pass in MSTParser." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-26", "text": "Table 1 summarizes the first-order features extracted from the dependency hold \u2192 hearing depicted in Figure 1 ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-27", "text": "The final feature encodes the POS tags of the head and argument, directionality, the binned distance between the head and argument, and a bucketed frequency of the syntactic n-gram calculated as per Equation 1, creating bucket labels from 0 in increments of 5 (0, 5, 10, etc.) ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-28", "text": "Additional features for each bucket value up to the maximum are also encoded." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-29", "text": "We also develop paraphrase-style features like those of Bansal and Klein (2011) based on the most frequently occurring words and POS tags before, in between, and after each head-argument ambiguity (see Section 3.2)." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-30", "text": "Figure 1 depicts the potential context words available the hold \u2192 hearing dependency." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-31", "text": "We experiment with a number of second-order features, mirroring those extracted for surface ngrams in Section 3.3." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-32", "text": "We extract all triple and sibling word and POS structures considered by the parser in the training and test corpora (following the factorization depicted in Figure 2 ), and counted their frequency in the Syntactic Ngrams corpus." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-33", "text": "Importantly, we require that matching subtrees in the Syntactic Ngrams corpus maintain the position of the parent relative to its children." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-34", "text": "We generate separate features encoding the word and POS tag variants of each triple and sibling structure." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-35", "text": "Similar to the surface n-gram features (Section 3), counts for our syntactic n-gram features are precomputed to improve the run-time efficiency of the parser." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-36", "text": "Experiments on the development set led to a minimum cutoff frequency of 10,000 for each feature to avoid noise from parser and OCR errors." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-37", "text": "3 Surface n-gram Features Bansal and Klein (2011) demonstrate that features generated from bucketing simple surface n-gram counts and collecting the top paraphrase-based contextual words over Web1T are useful for almost all attachment decisions, boosting dependency parsing accuracy by up to 0.6%." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-38", "text": "However, this technique is restricted to counts based purely on the linear order of the adjacent words, and is unable to incorporate disambiguating information such as POS tags to avoid spurious counts." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-39", "text": "Bansal and Klein (2011) also tested only on in-domain text, though these external count features should be useful out of domain." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-40", "text": "We extract Bansal and Klein (2011) 's affinity and paraphrase-style first-order features from the Google Books English Ngrams corpus, and compare their performance against Web1T counts." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-41", "text": "Both corpora are very large, contain different types of noise, and are sourced from very different underlying texts." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-42", "text": "We also extend Bansal and Klein's affinity and paraphrase features to second-order." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-43", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-44", "text": "**SURFACE N-GRAM CORPORA**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-45", "text": "The Web1T corpus contains counts of 1 to 5-grams over 1 trillion words of web text (Brants and Franz, 2006) ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-46", "text": "Unigrams must appear at least 200 times in the source text before being included in the corpus, while longer n-grams have a cutoff of 40." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-47", "text": "Pitler et al. (2010) has documented a number of sources of noise in the corpus, including duplicate sentences (such as legal disclaimers and boilerplate text), disproportionately short or long sentences, and primarily alphanumeric sentences." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-48", "text": "The Google Books Ngrams English corpus (2012) contains counts of 1 to 5-grams over 468 billion words sourced from scanned books published across three centuries (Michel et al., 2011) ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-50", "text": "This corpus is affected by the accuracy of OCR and digitization tools; the changing typography of books across time is one issue that may create spurious cooccurrences and counts (Lin et al., 2012) ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-51", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-52", "text": "**FIRST-ORDER SURFACE N-GRAM FEATURES**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-53", "text": "Affinity features rely on the intuition that frequently co-occurring words in large unlabeled text collections are likely to be in a syntactic relationship (Nakov and Hearst, 2005; Bansal and Klein, 2011) ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-54", "text": "N-gram resources such as Web1T and Google Books provide large offline collections from which these co-occurrence statistics can be harvested; given each head and argument ambiguity in a training and test corpus, the corpora can be linearly scanned ahead of parsing time to reduce the impact of querying in the parser." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-55", "text": "When scanning, the head and argument word may appear immediately adjacent to one another in linear order (CONTIG), or with up to three intervening words (GAP1, GAP2, and GAP3) as the maximum n-gram length is five." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-56", "text": "The total count is then discretized as per Equation 1 previously." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-57", "text": "The final parser features include the POS tags of the potential head and argument, the discretized count, directionality, and the binned length of the dependency." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-58", "text": "Additional cumulative features are generated using each bucket from the pre-calculated up to the maximum bucket size." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-59", "text": "Paraphrase-style surface n-gram features attempt to infer attachments indirectly." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-60", "text": "Nakov and Hearst (2005) propose several static patterns to resolve a variety of nominal and prepositional attachment ambiguities." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-61", "text": "For example, they give the example of sentence (1) below, paraphrase it into sentence (2), and examine how frequent the paraphrase is. If it should happen sufficiently often, this serves as evidence for the nominal attachment to demands in sentence (1) rather than the verbal attachment to meet." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-62", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-63", "text": "**MEET DEMANDS FROM CUSTOMERS 2. MEET THE CUSTOMERS DEMANDS**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-64", "text": "In Bansal and Klein (2011) , paraphrase features are generated for all full-parse attachment ambiguities from the surface n-gram corpus." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-65", "text": "For each attachment ambiguity, 3-grams of the form ( q 1 q 2 ), (q 1 q 2 ), and (q 1 q 2 ) are extracted, where q 1 and q 2 are the head and argument in their linear order of appearance in the original sentence, and is any single context word appearing before, in between, or after the query words." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-66", "text": "Then the most frequent words appearing in each of these configurations for each head-argument ambiguity is encoded as a feature with the POS tags of the head and argument 2 ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-67", "text": "Given the arc hold \u2192 hearing in Figure 2 , public is the most frequent word appearing in the n-gram (hold hearing) in Web1T." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-68", "text": "Thus, the final encoded feature is POS (hold) \u2227 POS (hearing) \u2227 public \u2227 mid." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-69", "text": "Further generalization is achieved by using a unigram POS tagger trained on the WSJ data to tag each context word, and encoding features using each unique tag of the most frequent context words." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-70", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-71", "text": "**SECOND-ORDER SURFACE N-GRAM FEATURES**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-72", "text": "We extend the first-order surface n-gram features to new features over second-order structures." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-73", "text": "We experimented with triple and sibling features, reflecting the second-order factorization used in MSTParser (see Figure 2) ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-74", "text": "As with first-order features, we convert all triple and sibling structures from the training and test 2 The top 20 words in between and top 5 words before and after are used for all paraphrase-style features in this paper." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-75", "text": "data into query n-grams, maintaining their linear order." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-76", "text": "In Figure 2 , the corresponding n-grams are hold hearing Tuesday , and hearing Tuesday." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-77", "text": "We then scan the n-gram corpus for each query n-gram and sum the frequency of each." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-78", "text": "Frequencies are summed over each configuration (including intervening words) that the query n-gram may appear in, as depicted below." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-79", "text": "\u2022 (q 1 q 2 q 3 )" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-80", "text": "where q 1 , q 2 , and q 3 are the words of the triple in their linear order, and is a single intervening word of any kind." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-81", "text": "We encode the POS tags of the parent and children (or just the children for sibling features), along with the bucketed count, directionality, and the binned distance between the two children." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-82", "text": "We also extract paraphrase-style features for siblings in the same way as for first-order n-grams, and cumulative variants up to the maximum bucket size." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-83", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-84", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-85", "text": "As with Bansal and Klein (2011) and Pitler (2012) , we convert the Penn Treebank to dependencies using pennconverter 3 (Johansson and Nugues, 2007) (henceforth LTH) and generate POS tags with MX-POST (Ratnaparkhi, 1996) ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-86", "text": "We used sections 02-21 of the WSJ for training, 22 for development, and 23 for final testing." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-87", "text": "The test sections of the answers, newsgroups, and reviews sections of the English Web Treebank as per the SANCL 2012 Shared Task (Petrov and McDonald, 2012) were converted to LTH and used for out-of-domain evaluation." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-88", "text": "We used MSTParser (McDonald and Pereira, 2006) , trained with the parameters order:2, training-k:5, iters:10, and loss-type:nopunc." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-89", "text": "We omit labeled attachment scores in this paper for brevity, but they are consistent with the reported UAS scores." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-90", "text": "On the out-of-domain web treebank, surface and syntactic features each improve over the baseline by an average of roughly 0.8 -1.0% on the test sets." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-91", "text": "All of our results are also statistically significant improvements over the baseline." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-92", "text": "While our syntactic n-gram counts are computed over Stanford dependencies and almost certainly include substantial parser and OCR errors, they still provide a significant performance improvement for LTH parsing." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-93", "text": "Additionally, the Syntactic Ngrams dataset is drawn from a wide variety of genres, but helps with both newswire and web text parsing." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-96", "text": "The best results on LTH dependencies used second-order sibling features in addition to the firstorder features for both surface and syntactic ngrams." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-97", "text": "A combined system of Google Books surface n-gram features and syntactic n-gram features (which performed individually best on the development set) produces absolute UAS improvements of 0.8% over the baseline on the WSJ test set, and 1.4% over the baseline averaged across the three web tree- bank testing domains." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-98", "text": "These results are significantly higher than any feature set in isolation, showing that surface and syntactic n-gram features are complementary and individually address different types of errors being made by the parser." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-99", "text": "Figure 3 gives an error breakdown by highfrequency gold argument POS tag on LTH dependencies for the baseline, Web1T surface n-grams, syntactic n-grams, and combined systems reported in Table 2 ." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-100", "text": "For almost every POS tag, the combined system outperforms the baseline and makes equal or fewer errors than either the surface or syntactic n-gram features in isolation." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-101", "text": "Syntactic n-grams are worse relative to surface n-grams on noun, adjectival, and prepositional parts of speech -constructions which are known to be difficult to parse." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-102", "text": "Table 3 : Correct attachments by gold argument POS tag and the percentage of the overall error reduction over WSJ section 22 for the baseline and combined systems in Table 2. that we have discussed as helping resolve these issues, it is unsurprising that syntactic n-gram features using the counts from the Goldberg and Orwant (2013) parser are less effective." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-103", "text": "In comparison, surface n-grams are worse on conjunctive and verbal parts of speech, suggesting that the localized nature of these features is less useful for the idiosyncrasies of coordination representations and longerrange subject/object relationships." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-104", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-105", "text": "**ANALYSIS**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-106", "text": "Whilst Web1T and Google Books features perform similarly overall, Books n-grams are more effective for noun structures, and Web1T n-grams are slightly better in predicting PP attachment sites." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-107", "text": "alone fare worst." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-108", "text": "However, the system finds less improvement in coordinators, determiners, and cardinal numbers, all of which are also components of noun phrases." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-109", "text": "This shows the difficulty of correctly identifying a head noun in a nominal to attach modifiers to, and the general difficulty of representing and parsing coordination." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-110", "text": "Web1T contains approximately double the total number of n-grams as Google Books." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-111", "text": "Table 4 shows that 27% and 32.5% of the n-gram queries from the WSJ sections 2-23 and the entire English Web Treebank do not receive features from Web1T and Google Books respectively." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-112", "text": "The intersection of these queries is 24.7% of the total, showing that the two corpora have small but substantive differences in word distributions; this may partially explain why our combined feature experiments work so well." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-113", "text": "However, the similar performance of surface n-gram features extracted from these sources suggests Web1T contains substantial noise." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-114", "text": "We had expected our syntactic n-gram features to perform better than they did since they address many of the shortcomings of using surface n-grams." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-115", "text": "Syntactic features are sensitive to the quality of the parser used to produce them, but in this case the parser is difficult to assess as the source corpus is enormous and extracted using OCR from scanned books." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-116", "text": "Even if the parser is state of the art, it is being used to parse diverse texts spanning multiple genres across a wide time period, compounded by potential scanning and digitization errors." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-117", "text": "Additionally, a post-hoc analysis of the types of errors present in the corpus is impossible due to the exclusion of the full parse trees, though Goldberg and Orwant (2013) note that this data would almost certainly be computationally prohibitive to process." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-118", "text": "Despite this, our work has shown that counts from this corpus provide useful features for parsing." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-119", "text": "Futhermore, these features stack with surface n-gram features, providing substantial overall performance improvements." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-120", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-121", "text": "**FUTURE WORK**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-122", "text": "A combination of features from all of the sources used in this work would be interesting avenues for further investigation, especially since these features seem strongly complementary." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-123", "text": "We could also explore more of the POS and head-modifier annotations available in the Google Books Ngram corpus to develop features which are a middle ground between surface and syntactic n-gram features." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-124", "text": "The Google Books and Syntactic Ngrams corpora both provide frequencies by date, and it would be interesting to explore how well features extracted from different date ranges would perform -particularly on text from roughly the same eras." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-125", "text": "Resampling Web1T to reduce it to a comparable corpus that is the same size as Google Books would also provide better insights on how many n-grams are noise." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-126", "text": "----------------------------------" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-127", "text": "**RELATED WORK**" }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-128", "text": "Surface n-gram counts from large web corpora have been used to address NP and PP attachment errors (Volk, 2001; Nakov and Hearst, 2005) Aside from Bansal and Klein (2011) , other feature-based approaches to improving dependency parsing include Pitler (2012) , who exploits Brown clusters and point-wise mutual information of surface n-gram counts to specifically address PP and coordination errors." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-129", "text": "Chen et al. (2013) describe a novel way of generating meta-features that work to emphasise important feature types used by the parser." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-130", "text": "Chen et al. (2009) generate subtree-based features that are similar to ours." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-131", "text": "However, they use the in-domain BLLIP newswire corpus to generate their subtree counts, whereas the Syntactic Ngrams corpus is out-of-domain and an order of magnitude larger." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-132", "text": "They also use the same underlying parser to generate the BLLIP subtree counts and as the final test-time parser, while Syntactic Ngrams is parsed with a simpler, shift-reduce parser compared to the graph-based MSTParser used during test time." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-133", "text": "They also evaluate only on newswire text, whilst our work systematically explores various configurations of surface and syntactic n-gram features in-and outof-domain." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-134", "text": "We developed features for dependency parsing using subtree counts from 345 billion parsed words of scanned English books." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-135", "text": "We extended existing work on surface n-grams from first to second-order, and investigated the utility of web text and scanned books as sources of surface n-grams." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-136", "text": "Our individual feature sets all perform similarly, providing significant improvements in parsing accuracy of about 0.5% on newswire and up to 1.0% averaged across the web treebank domains." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-137", "text": "They are also complementary, with our best system combining surface and syntactic n-gram features to achieve up to 1.3% UAS improvements on newswire and 1.6% on web text." }, { "sent_id": "808e0a94b877182dc06447c8682a63-C001-138", "text": "We hope that our work will encourage further efforts to unify different sources of unlabeled and automatically parsed data for dependency parsing, addressing the relative strengths and weaknesses of each source." } ], "y": { "@MOT@": { "gold_contexts": [ [ "808e0a94b877182dc06447c8682a63-C001-9" ], [ "808e0a94b877182dc06447c8682a63-C001-37", "808e0a94b877182dc06447c8682a63-C001-38" ] ], "cite_sentences": [ "808e0a94b877182dc06447c8682a63-C001-9", "808e0a94b877182dc06447c8682a63-C001-37" ] }, "@BACK@": { "gold_contexts": [ [ "808e0a94b877182dc06447c8682a63-C001-9" ], [ "808e0a94b877182dc06447c8682a63-C001-37" ], [ "808e0a94b877182dc06447c8682a63-C001-53" ], [ "808e0a94b877182dc06447c8682a63-C001-64" ], [ "808e0a94b877182dc06447c8682a63-C001-128" ] ], "cite_sentences": [ "808e0a94b877182dc06447c8682a63-C001-9", "808e0a94b877182dc06447c8682a63-C001-37", "808e0a94b877182dc06447c8682a63-C001-53", "808e0a94b877182dc06447c8682a63-C001-64", "808e0a94b877182dc06447c8682a63-C001-128" ] }, "@USE@": { "gold_contexts": [ [ "808e0a94b877182dc06447c8682a63-C001-13" ], [ "808e0a94b877182dc06447c8682a63-C001-29" ], [ "808e0a94b877182dc06447c8682a63-C001-42" ], [ "808e0a94b877182dc06447c8682a63-C001-85" ] ], "cite_sentences": [ "808e0a94b877182dc06447c8682a63-C001-13", "808e0a94b877182dc06447c8682a63-C001-29", "808e0a94b877182dc06447c8682a63-C001-42", "808e0a94b877182dc06447c8682a63-C001-85" ] }, "@SIM@": { "gold_contexts": [ [ "808e0a94b877182dc06447c8682a63-C001-29" ] ], "cite_sentences": [ "808e0a94b877182dc06447c8682a63-C001-29" ] }, "@EXT@": { "gold_contexts": [ [ "808e0a94b877182dc06447c8682a63-C001-40" ] ], "cite_sentences": [ "808e0a94b877182dc06447c8682a63-C001-40" ] } } }, "ABC_da2429450c8d1f1f3e72383c86ec73_12": { "x": [ { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-2", "text": "We describe a classifier to predict the message-level sentiment of English microblog messages from Twitter." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-3", "text": "This paper describes the classifier submitted to the SemEval-2014 competition (Task 9B)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-4", "text": "Our approach was to build up on the system of the last year's winning approach by NRC Canada 2013 (Mohammad et al., 2013) , with some modifications and additions of features, and additional sentiment lexicons." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-5", "text": "Furthermore, we used a sparse ( 1 -regularized) SVM, instead of the more commonly used 2 -regularization, resulting in a very sparse linear classifier." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-6", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-8", "text": "With the immense growth of user generated text online, the interest in automatic sentiment analysis of text has greatly increased recently in both academia and industry." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-9", "text": "In this paper, we describe our approach for a modified SVM based classifier for short text as in Twitter messages." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-10", "text": "Our system has participated in the SemEval-2014 Task 9 competition, \"Sentiment Analysis in Twitter, Subtask-B Message Polarity Classification\" (Rosenthal et al., 2014) ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-11", "text": "The goal is to classify a tweet (on the full message level) into the three classes positive, negative, and neutral." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-12", "text": "An almost identical competition was already run in 2013." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-13", "text": "Our Results in the Competition." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-14", "text": "Our approach was ranked on the 8th place out of the 50 participating submissions, with an F1-score of 67." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-15", "text": "(The more detailed rankings of our approach were 4th rank on the LiveJournal data, 5th on the SMS data (2013), 18th on Twitter-2013, and 16th on Twitter Sarcasm, see (Rosenthal et al., 2014) for full details and all results)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-16", "text": "Data." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-17", "text": "In the competition, the tweets for training and development were only provided as tweet IDs." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-18", "text": "A fraction (10-15%) of the tweets were no longer available on twitter, which makes the results of the competition not fully comparable." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-19", "text": "For testing, in addition to last years data (tweets and SMS), new tweets and data from a surprise domain were provided." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-20", "text": "An overview of the data, which we were able to download, is shown in Table 1 . (Tweets) 1417 494 286 637 Test: Twitter2014 1853 982 202 669 Test: Twitter2013 3813 1572 601 1640 Test: SMS2013 2093 492 394 1207 Test: Tw2014Sarcasm 86 33 40 13 Test: LiveJournal2014 1142 427 304 411 2 Description of Our Approach" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-21", "text": "Compared to the previous NRC Canada 2013 approach (Mohammad et al., 2013) , our main changes are the following three: First we use sparse linear classifiers instead of classical dense ones." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-22", "text": "Secondly, we drop n-gram features completely, in favor of what we call part-of-speech n-grams, which are n-grams where up to two tokens are the original ones, and the rest of the tokens is replaced by their corresponding POS tag (noun, verb, punctuation etc) ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-23", "text": "Third, we added two new sentiment lexicons, containing numerical scores associated for all 3 classes (positive, neutral, negative), instead of just 2 as in classical po-larity lexicons." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-24", "text": "All changes are described in more detail in Sections 4 and 3 below." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-25", "text": "Performance." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-26", "text": "We tried to reproduce the same classifier as in (Mohammad et al., 2013) as a baseline for comparison." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-27", "text": "Trying to quantify our contributions, when adding all our additional features and tricks described below, the score of our method increases from the baseline of 63.25 to 64.81 (on the Twitter-2013 test set), which is a gain of 1.56 points in F1." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-28", "text": "Baseline Approach by NRC Canada 2013." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-29", "text": "Unfortunately our replica system of Mohammad et al. (2013) only achieved an F1-score of 63.25 on the Twitter-2013 test set, while their score in the 2013 competition on the same test set was 69.02, nearly 6 points higher in F1." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-30", "text": "Part of this big difference might be explained by the fact that the exact same training sets are not available anymore." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-31", "text": "Other possibly more important differences are the SVM classifier variant used and class weighting (described in Section 4)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-32", "text": "Furthermore, we didn't implement all features in the exactly same way, see the more detailed description in Section 3.1.2 below." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-33", "text": "Although we had the impression that these changes individually had only a relatively minor effect, it might be that the changes together with the different training set add up to the difference in score." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-34", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-35", "text": "**FEATURES**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-36", "text": "Before we describe the linear classifier in Section 4, we detail the used features for each tweet message." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-37", "text": "On average, we generated 843 features per tweet." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-38", "text": "For comparison, the average in our NRC Canada 2013 replica system was only 285." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-39", "text": "Most of the increase in features comes from the fact that we allowed for slightly longer n-grams (6 instead of 4), and substrings (length 6 instead of 5)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-40", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-41", "text": "**NEW FEATURES**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-42", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-43", "text": "**PART OF SPEECH N-GRAMS**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-44", "text": "We used the ArkTweetNLP structured prediction POS tagger provided by Owoputi et al. (2013) together with their provided standard model (model.20120919) suitable for twitter data." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-45", "text": "Part of speech n-grams are n-grams where up to two tokens are kept as the original ones, and all other tokens are replaced by their corresponding POS tag (noun, verb, punctuation etc)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-46", "text": "We generated these modified n-grams for all possible positions of the one or two original tokens within the n positions, for 3 \u2264 n \u2264 6." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-47", "text": "As features for a classifier, we found POS ngrams at least as useful (with some more robustness) as the n-grams themselves." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-48", "text": "In our final approach, we dropped the use of n-grams completely, and only used POS n-grams instead." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-49", "text": "The idea of replacing some of the tokens by their POS tag is also investigated by Joshi and Penstein-Ros\u00e9 (2009) , where the authors used n \u2264 3." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-50", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-51", "text": "**VARIOUS CHANGES COMPARED TO NRC**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-52", "text": "Canada 2013" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-53", "text": "\u2022 We do not allow n-grams (or POS n-grams) to span over sentence boundaries." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-54", "text": "\u2022 Substrings of length up to 6 (instead of 5)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-55", "text": "\u2022 Substring features are weighted increasingly by their length (weights 0.7 \u00b7 {1.0, 1.1, 1.2, 1.4, 1.6, 1.9} for lengths 3, 4, . . . )" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-56", "text": "\u2022 Instead of the score itself, we used the sigmoid value s(t) = 1/(1 + e \u2212t )) of each lexicon score." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-57", "text": "For each lexicon, the 4 scores were the same as in (Mohammad et al., 2013) , i.e. per tweet, we use the number of tokens appearing in the lexicon, the sum and the max of the scores, and the last non-zero score." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-58", "text": "We skipped some features from the baseline approach (because their effect was not significant in our setting): Elongated words (number of words with one character repeated more than two times), and word clustering." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-59", "text": "Also, we had a slightly simplified variant of how to use the lexicon scores." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-60", "text": "We didn't count the lexicon scores separately per emotion (pos and neg), but only altogether." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-61", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-62", "text": "**EXISTING FEATURES**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-63", "text": "Text Preprocessing." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-64", "text": "A good tokenization seems very important for twitter data." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-65", "text": "We used the popular tokenizer ArkTweetNLP (Owoputi et al., 2013) which is suitable for tweets." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-66", "text": "All text was transformed to lowercase (except for those features in (Mohammad et al., 2013) which use case information)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-67", "text": "As usual, URLs were normalized to http://someurl and twitter user IDs to @someuser." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-68", "text": "We also employed the usual marking of negated contexts of a sentence as in (Pang et al., 2002) , using the list of negation words from Christopher Potts' sentiment tutorial 1 ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-69", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-70", "text": "**CLASSIFIER**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-71", "text": "We used a linear support vector machine (SVM) classifier, which is standard for text data." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-72", "text": "The LibLinear package (Fan et al., 2008) was employed for training the multi-class classifier." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-73", "text": "Multi-Class Formulation, and Class Weights." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-74", "text": "We found significant performance changes depending on which type of multi-class SVM, and also which regularizer ( 1 -or 2 -norm) is used." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-75", "text": "For the multi-class variant, we found the oneagainst-all models to perform slightly better than the Crammer and Singer (2001) formulation." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-76", "text": "More importantly, since the 3 classes (positive, negative and neutral) are quite unbalanced in size in the training set, it is crucial to set a good weight for each class in the SVM." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-77", "text": "We used (4.52, 1.38, 1.80), which corresponds to the twice the ratio of each class compared to the average class size." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-78", "text": "Sparse Linear Classifiers." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-79", "text": "In our setting, an 1 -regularized squared loss SVM (one-against-all) performed best (this is mode L1R L2LOSS SVC in LibLinear), despite the fact that 2 -regularization is generally more commonly used in text applications." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-80", "text": "We used C = 0.055 for the regularization parameter, and \u03b5 = 0.003 as the optimization stopping criterion." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-81", "text": "We did not employ any kernel, but always used linear classifiers." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-82", "text": "Another benefit of the 1 -regularization is that the resulting classifier is extremely sparse and compact, which significantly accelerates the evaluation of the classifier on large amounts of text, e.g. for testing." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-83", "text": "Our final classifier only uses 1985 non-zero features (1427 unigram/substrings, and 558 other features, such as lexicon scores, ngrams, POS n-grams, as explained in the previous Section 3)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-84", "text": "As the resulting classifier is so small, it is also relatively easy to read and interpret." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-85", "text": "We have made our final classifier weights publicly available for download as a text file 2 ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-86", "text": "Every line contains the feature description followed by the 3 weights corresponding to the 3 sentiment classes." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-87", "text": "1 http://sentiment.christopherpotts." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-88", "text": "net/lingstruc.html 2 http://www.m8j.net/sentiment/ Our final classifier was trained on 9641 tweets, which are all we could download from the IDs given in this years train and dev set." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-89", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-90", "text": "**LEXICONS**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-91", "text": "A sentiment lexicon is a mapping from words (or n-grams) to an association score corresponding to positive or negative sentiment." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-92", "text": "Such lists can be constructed either from manually labeled data (supervised), or automatically labeled data (unsupervised) as for example tweets with a positive or negative smiley." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-93", "text": "We used the same set of lexicons as in (Mohammad et al., 2013) , with one addition:" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-94", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-95", "text": "**A LEXICON FOR 3-CLASS CLASSIFICATION**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-96", "text": "Our main new addition was another type of lexicon, which not only provides one score per word, but 3 of them, (being the association to positive, negative and neutral)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-97", "text": "The idea here is to improve on the discrimination quality, especially for neutral text, and treat all 3 labels in this multi-class task the same way, instead of just 2 as in the previous approaches." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-98", "text": "Data." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-99", "text": "We found it challenging to find good datasets to build such a lexicon." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-100", "text": "We again used the Sentiment140 corpus (Go et al., 2009 ) (containing tweets with positive or negative emoticons)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-101", "text": "Using a subset of 100k positive and 100k negative ones, we added a set of 100k arbitrary (hopefully neutral) tweets." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-102", "text": "The neutral set was chosen randomly from the thinknook.com dataset 3 of 1.5mio tweets (from which we ignored the provided labels, and counted the tweets as neutral)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-103", "text": "We did the same with the movie reviews from the recent kaggle competition on annotated reviews from the rotten-tomatoes website 4 ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-104", "text": "We automatically built a lexicon from 100k texts in this dataset, with the data balanced equally for the three classes." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-105", "text": "Features Used in the Lexicon." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-106", "text": "To construct the lexicon, we extracted the POS n-grams (as we described in Section 3.1.1 above) from all texts." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-107", "text": "In comparison, Mohammad et al. (2013) used noncontiguous n-grams (unigram-unigram, unigrambigram, and bigram-bigram pairs) ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-108", "text": "We only used POS n-grams with 2 tokens kept original, and the remaining ones replaced by their POS tag, with n ranging from 3 to 6." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-109", "text": "Building the Lexicon." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-110", "text": "While in (Mohammad et al., 2013) , the score for each n-gram was computed using point-wise mutual information (PMI) with the labels, we trained a linear classifier on the same labels instead." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-111", "text": "The lexicon weights are set as the resulting classifier weights for our (POS) n-grams." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-112", "text": "We used the same type of sparse SVM trained with LibLinear, for 3 classes, as in the final classifier." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-113", "text": "Download of the Lexicons." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-114", "text": "We built 4 lexicons as described above." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-115", "text": "Thanks to the sparsity of the linear weights from the SVM, they are again relatively small, analogous to the final classifier." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-116", "text": "We also provide the lexicons for download as text files 5 ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-117", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-118", "text": "**EXISTING LEXICONS**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-119", "text": "Lexicons from Manually Labeled Data." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-120", "text": "We used the same 3 existing sentiment lexicons as in (Mohammad et al., 2013) ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-121", "text": "All lexicons give a single score for each word (if present in the lexicon)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-122", "text": "Those existing lexicons are: NRC Emotion Lexicon (about 14k words), the MPQA Lexicon (about 8k words), and the Bing Liu Lexicon (about 7k words)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-123", "text": "Lexicons from Automatically Labeled Data." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-124", "text": "The NRC hashtag sentiment lexicon was generated automatically from a set of 775k tweets containing a hashtag of a small predefined list of positive and negative hashtags (Mohammad et al., 2013) ." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-125", "text": "Lexicon scores were trained via PMI (point-wise mutual information)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-126", "text": "Scores are not only available for words, but also unigramunigram, unigram-bigram, and bigram-bigram pairs (that can be non-contiguous in the text)." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-127", "text": "The Sentiment140 lexicon (Go et al., 2009 ) was generated automatically from a set of 1.6 million tweets containing a positive or negative emoticon." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-128", "text": "This uses the same features and scoring as above." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-129", "text": "----------------------------------" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-131", "text": "We have described an SVM classifier to detect the sentiment of short texts such as tweets." }, { "sent_id": "da2429450c8d1f1f3e72383c86ec73-C001-132", "text": "Our system is built up on the approach of NRC Canada (Mohammad et al., 2013) , with several modifications and extensions (e.g. sparse linear classifiers," } ], "y": { "@EXT@": { "gold_contexts": [ [ "da2429450c8d1f1f3e72383c86ec73-C001-4" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-21" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-93" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-132" ] ], "cite_sentences": [ "da2429450c8d1f1f3e72383c86ec73-C001-4", "da2429450c8d1f1f3e72383c86ec73-C001-21", "da2429450c8d1f1f3e72383c86ec73-C001-93", "da2429450c8d1f1f3e72383c86ec73-C001-132" ] }, "@USE@": { "gold_contexts": [ [ "da2429450c8d1f1f3e72383c86ec73-C001-4" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-26" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-66" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-93" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-120" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-124" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-132" ] ], "cite_sentences": [ "da2429450c8d1f1f3e72383c86ec73-C001-4", "da2429450c8d1f1f3e72383c86ec73-C001-26", "da2429450c8d1f1f3e72383c86ec73-C001-66", "da2429450c8d1f1f3e72383c86ec73-C001-93", "da2429450c8d1f1f3e72383c86ec73-C001-120", "da2429450c8d1f1f3e72383c86ec73-C001-124", "da2429450c8d1f1f3e72383c86ec73-C001-132" ] }, "@DIF@": { "gold_contexts": [ [ "da2429450c8d1f1f3e72383c86ec73-C001-29" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-106", "da2429450c8d1f1f3e72383c86ec73-C001-107", "da2429450c8d1f1f3e72383c86ec73-C001-108" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-110" ] ], "cite_sentences": [ "da2429450c8d1f1f3e72383c86ec73-C001-29", "da2429450c8d1f1f3e72383c86ec73-C001-107", "da2429450c8d1f1f3e72383c86ec73-C001-110" ] }, "@SIM@": { "gold_contexts": [ [ "da2429450c8d1f1f3e72383c86ec73-C001-57" ], [ "da2429450c8d1f1f3e72383c86ec73-C001-120" ] ], "cite_sentences": [ "da2429450c8d1f1f3e72383c86ec73-C001-57", "da2429450c8d1f1f3e72383c86ec73-C001-120" ] } } }, "ABC_06db17253d76150772c0926e11131d_12": { "x": [ { "sent_id": "06db17253d76150772c0926e11131d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-2", "text": "Visual Question Answering (VQA) is the task of answering questions about an image." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-3", "text": "Some VQA models often exploit unimodal biases to provide the correct answer without using the image information." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-4", "text": "As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-5", "text": "This critical issue makes them unsuitable for real-world settings." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-6", "text": "We propose RUBi, a new learning strategy to reduce biases in any VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-7", "text": "It reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-8", "text": "It implicitly forces the VQA model to use the two input modalities instead of relying on statistical regularities between the question and the answer." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-9", "text": "We leverage a question-only model that captures the language biases by identifying when these unwanted regularities are used." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-10", "text": "It prevents the base VQA model from learning them by influencing its predictions." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-11", "text": "This leads to dynamically adjusting the loss in order to compensate for biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-12", "text": "We validate our contributions by surpassing the current state-of-the-art results on VQA-CP v2." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-13", "text": "This dataset is specifically designed to assess the robustness of VQA models when exposed to different question biases at test time than what was seen during training." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-14", "text": "Our code is available: github.com/cdancette/rubi.bootstrap.pytorch" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-15", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-16", "text": "**INTRODUCTION**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-17", "text": "The recent Deep Learning success in computer vision [1] and natural language understanding [2] allowed researchers to tackle multimodal tasks that combine visual and textual modalities [3] [4] [5] [6] [7] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-18", "text": "Among these tasks, Visual Question Answering (VQA) attracts increasing attention." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-19", "text": "The goal of the VQA task is to answer a question about an image." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-20", "text": "It requires a high-level understanding of the visual scene and the question, but also to ground the textual concepts in the image and to use both modalities adequately." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-21", "text": "Solving the VQA task could have tremendous impacts on real-world applications such as aiding visually impaired users in understanding their physical and online surroundings, searching through large quantities of visual data via natural language interfaces, or even communicating with robots using more efficient and intuitive interfaces." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-22", "text": "Several large real image VQA datasets have recently emerged [8] [9] [10] [11] [12] [13] [14] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-23", "text": "Each one of them targets specific abilities that a VQA model would need to be used in real-world settings such as fine-grained recognition, object detection, counting, activity recognition, commonsense reasoning, etc." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-24", "text": "Current end-to-end VQA models [15] [16] [17] [18] [19] [20] [21] [22] achieve impressive results on most of these benchmarks and are even able to surpass the human accuracy on a specific benchmark accounting for compositional Figure 1 : Our RUBi approach aims at reducing the amount of unimodal biases learned by a VQA model during training." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-25", "text": "As depicted, current VQA models often rely on unwanted statistical correlations between the question and the answer instead of using both modalities." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-26", "text": "reasoning [23] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-27", "text": "However, it has been shown that they tend to exploit statistical regularities between answer occurrences and certain patterns in the question [24, 10, 25, 23, 13] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-28", "text": "While they are designed to merge information from both modalities, in practice they often answer without considering the image modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-29", "text": "When most of the bananas are yellow, a model does not need to learn the correct behavior to reach a high accuracy for questions asking about the color of bananas." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-30", "text": "Instead of looking at the image, detecting a banana and assessing its color, it is much easier to learn from the statistical shortcut linking the words what, color and bananas with the most occurring answer yellow." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-31", "text": "One way to quantify the amount of statistical shortcuts from each modality is to train unimodal models." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-32", "text": "For instance, a question-only model trained on the widely used VQA v2 dataset [9] predicts the correct answer approximately 44% of the time over the test set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-33", "text": "VQA models are not discouraged to exploit these statistical shortcuts from the question modality, because their training set often follows the same distribution as their testing set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-34", "text": "However, when evaluated on a test set that displays different statistical regularities, they usually suffer from a significant drop in accuracy [10, 25] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-35", "text": "Unfortunately, these statistical regularities are hard to avoid when collecting real datasets." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-36", "text": "As illustrated in Figure 1 , there is a crucial need to develop new strategies to reduce the amount of biases coming from the question modality in order to learn better behaviors." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-37", "text": "We propose RUBi, a training strategy to reduce the amount of biases learned by VQA models." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-38", "text": "Our strategy reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-39", "text": "It implicitly forces the VQA model to use the two input modalities instead of relying on statistical regularities between the question and the answer." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-40", "text": "We take advantage of the fact that question-only models are by design biased towards the question modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-41", "text": "We add a question-only branch on top of a base VQA model during training only." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-42", "text": "This branch influences the VQA model, dynamically adjusting the loss to compensate for biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-43", "text": "As a result, the gradients backpropagated through the VQA model are reduced for the most biased examples and increased for the less biased." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-44", "text": "At the end of the training, we simply remove the question-only branch." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-45", "text": "We run extensive experiments on VQA-CP v2 [10] and demonstrate the ability of RUBi to surpass current state-of-the-art results from a significant margin." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-46", "text": "This dataset has been specifically designed to assess the capacity of VQA models to be robust to biases from the question modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-47", "text": "We show that our RUBi learning framework provides gains when applied on several VQA architectures such as Stacked Attention Networks [26] and Top-Down Bottom-Up Attention [15] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-48", "text": "We also show that RUBi is competitive on the standard VQA v2 dataset [9] when compared to approaches that reduce unimodal biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-49", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-50", "text": "**RELATED WORK**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-51", "text": "Real-world datasets display some form of inherent biases due to their collection process [27] [28] [29] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-52", "text": "As a result, machine learning models tend to reflect these biases because they capture often undesirable correlations between the inputs and the ground truth annotations [30] [31] [32] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-53", "text": "Procedures exist to identify certain kinds of biases and to reduce them." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-54", "text": "For instance, some methods are focused on gender biases [33, 34] , some others on the human reporting biases [35] , and also on the shift in distribution between lab-curated data and real-world data [36] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-55", "text": "In the following, we discuss about related works that assess and reduce unimodal biases learned by VQA models." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-56", "text": "Assessing unimodal biases in datasets and models Despite being designed to merge the two input modalities, it has been found that VQA models often rely on superficial correlations between inputs from one modality and the answers without considering the other modality [37, 32] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-57", "text": "An interesting way to quantify the amount of unimodal biases that can potentially be learned by a VQA model consists in training models using only one of the two modalities [8, 9] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-58", "text": "The question-only model is a particularly strong baseline because of the large amount of statistical regularities that can be leveraged from the question modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-59", "text": "With the RUBi learning strategy, we take advantage of this baseline model to prevent VQA models from learning question biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-60", "text": "Unfortunately, biased models that exploit statistical shortcuts from one modality usually reach impressive accuracy on most of the current benchmarks." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-61", "text": "VQA-CP v2 and VQA-CP v1 [10] were recently introduced as diagnostic datasets containing different answer distributions for each questiontype between train and test splits." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-62", "text": "Consequentially, models biased towards the question modality fail on these benchmarks." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-63", "text": "We use the more challenging VQA-CP v2 dataset extensively in order to show the ability of our approach to reduce the learning of biases coming from the question modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-64", "text": "Balancing datasets to avoid unimodal biases Once the unimodal biases have been identified, one method to overcome these biases is to create more balanced datasets." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-65", "text": "For instance, the synthetic datasets for VQA [23, 13] minimize question-conditional biases via rejection sampling within families of related questions to avoid simple shortcuts to the correct answer." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-66", "text": "Doing rejection sampling in real VQA datasets is usually not possible due to the cost of annotations." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-67", "text": "Another solution is to collect complementary examples to increase the difficulty of the task." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-68", "text": "For instance, VQA v2 [9] has been introduced to weaken language priors in the VQA v1 dataset [8] by identifying complementary images." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-69", "text": "For a given VQA v1 question, VQA v2 also contains a similar image with a different answer to the same question." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-70", "text": "However, even with this additional balancing, statistical biases from the question remain and can be leveraged [10] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-71", "text": "That is why we propose an approach to reduce unimodal biases during training." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-72", "text": "It is designed to learn unbiased models from biased datasets." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-73", "text": "Our learning strategy dynamically modifies the loss values to reduce biases from the question." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-74", "text": "By doing so, we reduce the importance of certain examples, similarly to the rejection sampling approach, while increasing the importance of complementary examples which are already in the training set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-75", "text": "Architectures and learning strategies to reduce unimodal biases In parallel of these previous works on balancing datasets, an important effort has been carried out to design VQA models to overcome biases from datasets." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-76", "text": "[10] proposed a hand-designed architecture called Grounded VQA model (GVQA)." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-77", "text": "It breaks the task of VQA down into a first step of locating and recognizing the visual regions needed to answer the question, and a second step of identifying the space of plausible answers based on a question-only branch." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-78", "text": "This approach requires training multiple sub-models separately." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-79", "text": "In contrast, our learning strategy is end-to-end." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-80", "text": "Their complex design is not straightforward to apply on different architectures while our approach is model-agnostic." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-81", "text": "While we rely on a question-only branch, we remove it at the end of the training." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-82", "text": "The work most related to ours in terms of approach is [25] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-83", "text": "The authors propose a learning strategy to overcome language priors in VQA models." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-84", "text": "They first introduce an adversary question-only branch." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-85", "text": "It takes as input the question encoding from the VQA model and produces a question-only loss." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-86", "text": "They use a gradient negation of this loss to discourage the question encoder to capture unwanted biases that could be exploited by the VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-87", "text": "They also propose a loss based on the difference of entropies between the VQA model and the question-only branch output distributions." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-88", "text": "These two losses are only backpropagated to the question encoder." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-89", "text": "In contrast, our learning strategy targets the full VQA model parameters to reduce the impact of unwanted biases more effectively." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-90", "text": "Instead of relying on these two additional losses, we use the question-only branch to dynamically adapt the value of the classification loss in order to reduce the learning of biases in the VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-91", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-92", "text": "**REDUCING UNIMODAL BIASES APPROACH**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-93", "text": "We consider the common formulation of the Visual Question Answering (VQA) task as a multi-class classification problem." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-94", "text": "Given a dataset D consisting of n triplets (v i , q i , a i ) i\u2208 [1,n] with v i \u2208 V an image, q i \u2208 Q a question in natural language and a i \u2208 A an answer, one must optimize the parameters \u03b8 of the function f : V \u00d7 Q \u2192 R |A| to produce accurate predictions." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-95", "text": "For a single example, VQA models use an image encoder e v : V \u2192 R nv\u00d7dv to output a set of n v vectors of dimension d v , a question encoder e q : Q \u2192 R nq\u00d7dq to output a set of n q vectors of dimension d q , a multimodal fusion m : R nv\u00d7dv \u00d7 R nq\u00d7dq \u2192 R dm , and a classifier c : R dm \u2192 R |A| ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-96", "text": "These functions are composed as follows:" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-97", "text": "Each one of them can be defined to instantiate most of the state of the art models, such as [26, 38, 19, 39, 17, 40, 16] to cite a few." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-98", "text": "Classical learning strategy and pitfall The classical learning strategy of VQA models, depicted in Figure 2 , consists in minimizing the standard cross-entropy criterion over a dataset of size n." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-99", "text": "VQA models are inclined to learn unimodal biases from the datasets [10] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-100", "text": "This can be shown by evaluating models on datasets that have different distributions of answers for the test set, such as VQA-CP v2." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-101", "text": "In other words, they rely on statistical regularities from one modality to provide accurate predictions without having to consider the other modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-102", "text": "As an extreme example, strongly biased models towards the question modality always output yellow to the question what color is the banana." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-103", "text": "They do not learn to use the image information because there are too few examples in the dataset where the banana is not yellow." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-104", "text": "Once trained, their inability to use the two modalities adequately makes them inoperable on data coming from different distributions such as real-world data." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-105", "text": "Our contribution consists in modifying this cost function to avoid the learning of these biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-106", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-107", "text": "**RUBI LEARNING STRATEGY**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-108", "text": "Learning biases with a question-only branch One way to measure the unimodal biases in VQA datasets is to train an unimodal model which takes only one of the two modalities as input." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-109", "text": "A question-only model can be formalized as a function f Q : Q \u2192 R |A| parameterized by \u03b8 Q , and composed of a question encoder e q and a classifier c q ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-110", "text": "All parameters are learned with the standard cross-entropy criterion from Equation (2) ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-111", "text": "In the second row, we illustrate how RUBi increases the loss for examples that cannot be answered without using both modalities." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-112", "text": "The key idea of our approach, depicted in Figure 2 , is to use the question-only model to capture the question biases, allowing the model to focus on the examples that cannot be answered correctly using the question modality only." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-113", "text": "For any VQA model of the form presented in Equation (1), we add a question-only branch, i.e. a classifier using the encoder e q from the VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-114", "text": "During training, the branch acts as a proxy preventing the VQA model from learning biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-115", "text": "At the end of the training, we simply remove the branch and use the predictions from the base VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-116", "text": "Preventing biases by masking predictions Before passing the predictions of our base VQA model to the loss function defined in Equation (2), we merge them with the output of the question-only branch." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-117", "text": "This output is a mask containing a scalar value between 0 and 1 for each answer." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-118", "text": "It is obtained by passing the output of c q through a sigmoid function \u03c3." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-119", "text": "The goal of this mask is to dynamically alter the loss by modifying the predictions of the VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-120", "text": "To obtain the new predictions, we simply compute an element-wise product between the mask and the original predictions as defined in the following equation." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-121", "text": "We use L QM to refer to the cross-entropy loss associated with these predictions." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-122", "text": "We use \u03b8 QM to refer to the union of the parameters of the base model and the question classifier c q ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-123", "text": "We modify the predictions in this specific way to prevent the VQA model to learn biases from the question." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-124", "text": "To better understand the impact of our approach on the learning, we examine two scenarios." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-125", "text": "First, we reduce the importance of the most biased examples, i.e. examples that can be correctly classified without using the image modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-126", "text": "To do so, the question-only branch outputs a mask to increase the score of the correct answer while decreasing the scores of the others." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-127", "text": "As a result, the loss is much lower for these biased examples." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-128", "text": "In other words, the gradients backpropagated through the VQA model are smaller, thereby reducing the importance of these examples in the learning." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-129", "text": "As illustrated in the first row of Figure 3, given the question what color is the banana, the mask takes a high value of 0.8 for the answer yellow which is the most likely answer for this question in the training set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-130", "text": "On the other hand, the value for the other answers green and white are smaller." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-131", "text": "We see that the mask influences the VQA model to produce new predictions where the score associated with the answer yellow increases from 0.8 to 0.94." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-132", "text": "Compared to the classical learning approach, the loss is smaller with RUBi and decreases from 0.22 to 0.06." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-133", "text": "Secondly, we increase the importance of examples that cannot be answered without using both modalities." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-134", "text": "For these examples, the question-only branch outputs a mask that increases the score of the wrong answer." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-135", "text": "As a result, the loss is much higher and the VQA model is encouraged to learn from these examples." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-136", "text": "We illustrate this behavior in the second row of Figure 3 for the same question about the color of the banana." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-137", "text": "When the image contains a green banana, RUBi increases the loss from 0.69 to 1.20." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-138", "text": "We also add a classifier c q with a question-only cross-entropy loss L QO on top of the question-only branch." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-139", "text": "By doing so, we further improve the unimodal branch ability to capture biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-140", "text": "It reduces the amount of unimodal biases learned by the VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-141", "text": "With this loss, we optimize \u03b8 QO containing the parameters of the question-only branch modules, c q and c q ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-142", "text": "Note that we don't backpropagate this loss to the question encoder to prevent it from directly learning question biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-143", "text": "We obtain our final loss L RUBi by summing the two losses together in the following equation:" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-144", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-145", "text": "**BASELINE ARCHITECTURE**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-146", "text": "Most VQA architectures from the state of the art are compatible with our RUBi learning strategy." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-147", "text": "To test our strategy, we design a fast and simple architecture inspired from [16] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-148", "text": "This baseline architecture is detailed in the supplementary material." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-149", "text": "As common in the state of the art, our baseline architecture encodes the image as a bag of n v visual features v i \u2208 R dv using the pretrained Faster R-CNN by [15] , and encodes the question as a vector q \u2208 R dq using a GRU, pretrained on the skipthought task [3] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-150", "text": "The VQA model consists of a Bilinear BLOCK fusion [17] which merge the question representation q with the features v i of each region of the image." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-151", "text": "The output is aggregated using a max pooling on the n v regions." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-152", "text": "The resulting vector is then fed into a MLP classifier which outputs the final predictions." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-153", "text": "While most of our experiments are done with this fast and simple baseline architecture, we experimentally demonstrate that the RUBi learning strategy is effective on other VQA architectures." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-154", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-155", "text": "**EXPERIMENTS**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-156", "text": "Experimental setup We train and evaluate our models on VQA-CP v2 [10] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-157", "text": "This dataset was developed to evaluate the models robustness to question biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-158", "text": "We follow the same training and evaluation protocol as [25] , who also propose a learning strategy to reduce biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-159", "text": "For each model, we report the standard VQA evaluation metric [8] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-160", "text": "We also evaluate our models on the standard VQA v2 [9] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-161", "text": "Further implementation details are included in the supplementary materials." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-162", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-163", "text": "**RESULTS**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-164", "text": "State-of-the-art comparison In Table 1 , we compare our approach consisting of our baseline architecture trained with RUBi on VQA-CP v2 against the state of the art." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-165", "text": "To be fair, we only report approaches that use the strong visual features from [15] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-166", "text": "We compute the average accuracy over 5 experiments with different random seeds." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-167", "text": "Our RUBi approach reaches an average overall accuracy of 47.11% with a low standard deviation of \u00b10.51." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-168", "text": "This accuracy corresponds to a gain of +5.94 percentage points over the current state-of-the-art UpDn + Q-Adv + DoE. It also corresponds to a gain of +15.88 over GVQA [10] , which is a specific architecture designed for VQA-CP." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-169", "text": "RUBi reaches a +8.65 improvement over our baseline model trained with the classical cross-entropy." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-170", "text": "In comparison, the second best approach UpDn + Q-Adv + DoE only achieves a +1.43 gain in overall accuracy over their baseline UpDn." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-171", "text": "In addition, our approach does not significantly reduce the accuracy over our baseline for the answer type Other, while the second best approach reduces it by 10.57 point." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-172", "text": "Architecture agnostic RUBi can be used on existing VQA models without changing the underlying architecture." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-173", "text": "In Table 2 , we experimentally demonstrate the generality and effectiveness of our Table 1 : State-of-the-art results on VQA-CP v2 test." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-174", "text": "All reported models use the same features from [15] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-175", "text": "Models with * have been trained by [25] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-176", "text": "Models with ** have been trained by [41] . . [26] and Bottom-Up and Top-Down Attention (UpDn) [15] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-177", "text": "First, we show that applying RUBi on these architectures leads to important gains over the baselines trained with their original learning strategy." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-178", "text": "We report a gain of +11.73 accuracy point for SAN and +4.5 for UpDn." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-179", "text": "This lower gap in accuracy may show that UpDn is less driven by biases than SAN." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-180", "text": "This is consistent with results from [25] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-181", "text": "Secondly, we show that these architectures trained with RUBi obtain better accuracy than with the state-of-the-art strategy from [25] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-182", "text": "We report a gain of +3.4 with SAN + RUBi over SAN + Q-Adv + DoE, and +3.06 with UpDn + RUBi over UpDn + Q-Adv + DoE." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-183", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-184", "text": "**MODEL OVERALL**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-185", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-186", "text": "**IMPACT ON VQA V2**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-187", "text": "We report the impact of our method on the standard VQA v2 dataset in Table 3 ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-188", "text": "It uses the same data as VQA-CP v2, but includes the same statistical regularities in its train, val and test sets." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-189", "text": "In this context, we usually observe a drop in accuracy using approaches focused on reducing biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-190", "text": "This is due to the fact that exploiting unwanted correlations from the VQA v2 train set is not discouraged and often leads to a higher accuracy on the test set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-191", "text": "Nevertheless, our RUBi approach leads to a comparable drop to what can be seen in the state-of-the-art." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-192", "text": "We report a drop of 1.94 percentage points with respect to our baseline, while [10] report a drop of 3.78 between GVQA and their SAN baseline." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-193", "text": "[25] report drops of 0.05, 0.73 and 2.95 for their three learning strategies with the UpDn architecture which uses the same visual features as RUBi." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-194", "text": "As shown in this section, RUBi improves the accuracy on VQA-CP v2 from a large margin, while maintaining competitive performance on the standard VQA v2 dataset compared to similar approaches." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-195", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-196", "text": "**QUALITATIVE ANALYSIS**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-197", "text": "To better understand the impact of our RUBi approach, we compare in Figure 4 the answer distribution on VQA-CP v2 for some specific question patterns." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-198", "text": "We also display interesting behaviors on some examples using attention maps extracted as in [16] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-199", "text": "In the first row, we show the ability of RUBi to reduce biases for the is this person skiing question pattern." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-200", "text": "Most examples in the train set have the answer yes, while in the test set, they have the answer no. Nevertheless, RUBi outputs 80% of no, while the baseline almost always outputs yes." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-201", "text": "Interestingly, the best scoring region from the attention map of both models is localized on the shoes." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-202", "text": "To get the answer right, RUBi seems to reason about the absence of skis in this region." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-203", "text": "It seems that our baseline gets it wrong by not seeing that the skis are not locked under the ski boots." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-204", "text": "This unwanted behavior could be due to the question biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-205", "text": "In the second row, similar behaviors occur for the what color are the bananas question pattern." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-206", "text": "80% of the answers from the train set are yellow, while most of them are green in the test set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-207", "text": "We show that the amount of green and white answers from RUBi are much closer to the ones from the test set than with our baseline." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-208", "text": "In the example, it seems that RUBi relies on the color of the banana, while our baseline misses it." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-209", "text": "In the third row, it seems that RUBi is able to ground the textual concepts such as top part of the fire hydrant and color on the right visual region, while the baseline relies on the correlations between the fire hydrant, the yellow color of its core and the answer yellow." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-210", "text": "Similarly on the fourth row, RUBi grounds color, star, fire hydrant on the right region, while our baseline relies on correlations between color, fire hydrant, the yellow color of the top part region and the answer yellow." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-211", "text": "Interestingly, there is no similar question that involves the color of a star on a fire hydrant in the training set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-212", "text": "It shows the capacity of RUBi to generalize to unseen examples by composing and grounding existing visual and textual concepts from other kinds of question patterns." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-213", "text": "On the left, we display distributions of answers for the train set, the baseline evaluated on the test set, RUBi on the test set and the ground truth answers from the test set." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-214", "text": "For each row, we filter questions in a certain way." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-215", "text": "In the first row, we keep the questions that exactly match the string is this person skiing." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-216", "text": "In the three other rows, we filter questions that respectively include the following words: what color bananas, what color fire hydrant and what color star hydrant." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-217", "text": "On the right, we display examples that contains the pattern from the left." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-218", "text": "For each example, we display the answer of our baseline and RUBi, as well as the best scoring region from their attention map." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-219", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-220", "text": "**CONCLUSION**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-221", "text": "We propose RUBi to reduce unimodal biases learned by Visual Question Answering (VQA) models." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-222", "text": "RUBi is a simple learning strategy designed to be model agnostic." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-223", "text": "It is based on a question-only branch that captures unwanted statistical regularities from the question modality." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-224", "text": "This branch influences the base VQA model to prevent the learning of unimodal biases from the question." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-225", "text": "We demonstrate a significant gain of +5.94 percentage point in accuracy over the state-of-the-art result on VQA-CP v2, a dataset specifically designed to account for question biases." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-226", "text": "We also show that RUBi is effective with different kinds of common VQA models." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-227", "text": "In future works, we would like to extend our approach on other multimodal tasks." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-228", "text": "----------------------------------" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-229", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "06db17253d76150772c0926e11131d-C001-230", "text": "Image encoder We use the pretrained Faster R-CNN by [15] to extract object features." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-231", "text": "We use the setup that extracts 36 regions for each image." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-232", "text": "We do not fine-tune the image extractor." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-233", "text": "Question encoder We use the same preprocessing as in [16] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-234", "text": "We apply a lower case transformation and remove the punctuation." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-235", "text": "We only consider the most frequent 3000 answers for both VQA v2 and VQA CP v2." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-236", "text": "We then use a pretrained Skip-thought encoder with a two-glimpses self attention mechanism." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-237", "text": "The final embedding is of size 4800." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-238", "text": "We fine-tune the question encoder during training." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-239", "text": "Baseline architecture Our baseline architecture is a simplified version of the MuRel architecture [16] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-240", "text": "First, it computes a bilinear fusion between the question vector and the visual features for each region." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-241", "text": "The bilinear fusion module is a BLOCK [17] composed of 15 chunks, each of rank 15." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-242", "text": "The dimension of the projection space is 1000, and the output dimension is 2048." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-243", "text": "The output of the bilinear fusion is aggregated using a max pooling over n v regions." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-244", "text": "The resulting vector is then fed into a MLP classifier composed of three layers of size (2048, 2048, 3000), with ReLU activations." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-245", "text": "It outputs the predictions over the space of the 3000 answers." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-246", "text": "Question-only branch The RUBi question-only branch feeds the question into a first MLP composed of three layers, of size (2048, 2048, 3000), with ReLU activations." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-247", "text": "First, this output vector goes through a sigmoid to compute the mask that will alter the predictions of the VQA model." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-248", "text": "Secondly, this same output vector goes through a single linear layer of size 3000." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-249", "text": "We use these question-only predictions to compute the question-only loss." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-250", "text": "Optimization process We train all our models with the Adam optimizer." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-251", "text": "We train our baseline architecture with the learning rate scheduler of [16] ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-252", "text": "We use a learning rate of 1.5 \u00d7 10 \u22124 and a batch size of 256." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-253", "text": "During the first 7 epochs, we linearly increase the learning rate to 6 \u00d7 10 \u22124 ." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-254", "text": "After epoch 14, we apply a learning rate decay strategy which multiplies the learning rate by 0.25 every two epochs." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-255", "text": "We train our models until convergence as we do not have a validation set for VQA-CP v2." }, { "sent_id": "06db17253d76150772c0926e11131d-C001-256", "text": "For the UpDn and SAN architectures, we follow the optimization procedure described in [25] ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "06db17253d76150772c0926e11131d-C001-22" ], [ "06db17253d76150772c0926e11131d-C001-61" ], [ "06db17253d76150772c0926e11131d-C001-99" ] ], "cite_sentences": [ "06db17253d76150772c0926e11131d-C001-22", "06db17253d76150772c0926e11131d-C001-61", "06db17253d76150772c0926e11131d-C001-99" ] }, "@MOT@": { "gold_contexts": [ [ "06db17253d76150772c0926e11131d-C001-27", "06db17253d76150772c0926e11131d-C001-28" ], [ "06db17253d76150772c0926e11131d-C001-34" ], [ "06db17253d76150772c0926e11131d-C001-70" ] ], "cite_sentences": [ "06db17253d76150772c0926e11131d-C001-27", "06db17253d76150772c0926e11131d-C001-34", "06db17253d76150772c0926e11131d-C001-70" ] }, "@USE@": { "gold_contexts": [ [ "06db17253d76150772c0926e11131d-C001-45" ], [ "06db17253d76150772c0926e11131d-C001-61", "06db17253d76150772c0926e11131d-C001-62", "06db17253d76150772c0926e11131d-C001-63" ], [ "06db17253d76150772c0926e11131d-C001-156" ] ], "cite_sentences": [ "06db17253d76150772c0926e11131d-C001-45", "06db17253d76150772c0926e11131d-C001-61", "06db17253d76150772c0926e11131d-C001-156" ] }, "@DIF@": { "gold_contexts": [ [ "06db17253d76150772c0926e11131d-C001-168" ], [ "06db17253d76150772c0926e11131d-C001-192" ] ], "cite_sentences": [ "06db17253d76150772c0926e11131d-C001-168", "06db17253d76150772c0926e11131d-C001-192" ] } } }, "ABC_7b9fc52e4479dc5ff9b8796a558981_12": { "x": [ { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-2", "text": "Recent embedding-based methods in bilingual lexicon induction show good results, but do not take advantage of orthographic features, such as edit distance, which can be helpful for pairs of related languages." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-3", "text": "This work extends embedding-based methods to incorporate these features, resulting in significant accuracy gains for related languages." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-4", "text": "----------------------------------" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-6", "text": "Over the past few years, new methods for bilingual lexicon induction have been proposed that are applicable to low-resource language pairs, for which very little sentence-aligned parallel data is available." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-7", "text": "Parallel data can be very expensive to create, so methods that require less of it or that can utilize more readily available data are desirable." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-8", "text": "One prevalent strategy involves creating multilingual word embeddings, where each language's vocabulary is embedded in the same latent space (Vuli\u0107 and Moens, 2013; Mikolov et al., 2013a; Artetxe et al., 2016) ; however, many of these methods still require a strong cross-lingual signal in the form of a large seed dictionary." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-9", "text": "More recent work has focused on reducing that constraint." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-10", "text": "Vuli\u0107 and Moens (2016) and Vulic and Korhonen (2016) use document-aligned data to learn bilingual embeddings instead of a seed dictionary." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-11", "text": "Artetxe et al. (2017) use a very small, automatically-generated seed lexicon of identical numerals as the initialization in an iterative self-learning framework to learn a linear mapping between monolingual embedding spaces; Zhang et al. (2017) use an adversarial training method to learn a similar mapping." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-12", "text": "Lample et al. (2018a) use a series of techniques to align monolingual embedding spaces in a completely unsupervised way; their method is used by Lample et al. (2018b) as the initialization for a completely unsupervised machine translation system." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-13", "text": "These recent advances in unsupervised bilingual lexicon induction show promise for use in low-resource contexts." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-14", "text": "However, none of them make use of linguistic features of the languages themselves (with the arguable exception of syntactic/semantic information encoded in the word embeddings)." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-15", "text": "This is in contrast to work that predates many of these embedding-based methods that leveraged linguistic features such as edit distance and orthographic similarity: Dyer et al. (2011) and Berg-Kirkpatrick et al. (2010) investigate using linguistic features for word alignment, and Haghighi et al. (2008) use linguistic features for unsupervised bilingual lexicon induction." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-16", "text": "These features can help identify words with common ancestry (such as the English-Italian pair agile-agile) and borrowed words (macaronimaccheroni)." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-17", "text": "The addition of linguistic features led to increased performance in these earlier models, especially for related languages, yet these features have not been applied to more modern methods." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-18", "text": "In this work, we extend the modern embeddingbased approach of Artetxe et al. (2017) with orthographic information in order to leverage similarities between related languages for increased accuracy in bilingual lexicon induction." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-19", "text": "----------------------------------" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-20", "text": "**BACKGROUND**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-21", "text": "This work is directly based on the work of Artetxe et al. (2017) ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-22", "text": "Following their work, let X \u2208 R |Vs|\u00d7d and Z \u2208 R |Vt|\u00d7d be the word embedding matrices of two distinct languages, referred to respectively as the source and target, such that each row corresponds to the d-dimensional embedding of a single word." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-23", "text": "We refer to the ith row of one of these matrices as X i * or Z i * ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-24", "text": "The vocabularies for each language are V s and V t , respectively." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-25", "text": "Also let D \u2208 {0, 1} |Vs|\u00d7|Vt| be a binary matrix representing a dictionary such that D ij = 1 if the ith word in the source language is aligned with the jth word in the target language." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-26", "text": "We wish to find a mapping matrix W \u2208 R d\u00d7d that maps source embeddings onto their aligned target embeddings." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-27", "text": "Artetxe et al. (2017) define the optimal mapping matrix W * with the following equation," }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-28", "text": "which minimizes the sum of the squared Euclidean distances between mapped source embeddings and their aligned target embeddings." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-29", "text": "By normalizing and mean-centering X and Z, and enforcing that W be an orthogonal matrix (W T W = I), the above formulation becomes equivalent to maximizing the dot product between the mapped source embeddings and target embeddings, such that" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-30", "text": "where Tr(\u00b7) is the trace operator, the sum of all diagonal entries." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-31", "text": "The optimal solution to this equation is W * = U V T , where X T DZ = U \u03a3V T is the singular value decomposition of X T DZ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-32", "text": "This formulation requires a seed dictionary." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-33", "text": "To reduce the need for a large seed dictionary, Artetxe et al. (2017) propose an iterative, self-learning framework that determines W as above, uses it to calculate a new dictionary D, and then iterates until convergence." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-34", "text": "In the dictionary induction step, they set" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-35", "text": "We propose two methods for extending this system using orthographic information, described in the following two sections." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-36", "text": "----------------------------------" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-37", "text": "**ORTHOGRAPHIC EXTENSION OF WORD EMBEDDINGS**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-38", "text": "This method augments the embeddings for all words in both languages before using them in the self-learning framework of Artetxe et al. (2017) ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-39", "text": "To do this, we append to each word's embedding a vector of length equal to the size of the union of the two languages' alphabets." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-40", "text": "Each position in this vector corresponds to a single letter, and its value is set to the count of that letter within the spelling of the word." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-41", "text": "This letter count vector is then scaled by a constant before being appended to the base word embedding." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-42", "text": "After appending, the resulting augmented vector is normalized to have magnitude 1." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-43", "text": "Mathematically, let A be an ordered set of characters (an alphabet), containing all characters appearing in both language's alphabets:" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-44", "text": "Let O source and O target be the orthographic extension matrices for each language, containing counts of the characters appearing in each word w i , scaled by a constant factor c e :" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-45", "text": "Then, we concatenate the embedding matrices and extension matrices:" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-46", "text": "Finally, in the normalized embedding matrices X and Z , each row has magnitude 1:" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-47", "text": "These new matrices are used in place of X and Z in the self-learning process." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-48", "text": "----------------------------------" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-49", "text": "**ORTHOGRAPHIC SIMILARITY ADJUSTMENT**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-50", "text": "This method modifies the similarity score for each word pair during the dictionary induction phase of the self-learning framework of Artetxe et al. (2017) , which uses the dot product of two words' embeddings to quantify similarity." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-51", "text": "We modify this similarity score by adding a measure of orthographic similarity, which is a function of the normalized string edit distance of the two words." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-52", "text": "The normalized edit distance is defined as the Levenshtein distance (L(\u00b7, \u00b7)) (Levenshtein, 1966) divided by the length of the longer word." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-53", "text": "The Levenshtein distance represents the minimum number of insertions, deletions, and substitutions required to transform one word into the other." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-54", "text": "The normalized edit distance function is denoted as NL(\u00b7, \u00b7)." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-55", "text": "We define the orthographic similarity of two words w 1 and w 2 as log(2.0\u2212NL(w 1 , w 2 ))." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-56", "text": "These similarity scores are used to form an orthographic similarity matrix S, where each entry corresponds to a source-target word pair." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-57", "text": "Each entry is first scaled by a constant factor c s ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-58", "text": "This matrix is added to the standard similarity matrix, XW Z T ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-59", "text": "The vocabulary for each language is 200,000 words, so computing a similarity score for each pair would involve 40 billion edit distance calculations." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-60", "text": "Also, the vast majority of word pairs are orthographically very dissimilar, resulting in a normalized edit distance close to 1 and an orthographic similarity close to 0, having little to no effect on the overall estimated similarity." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-61", "text": "Therefore, we only calculate the edit distance for a subset of possible word pairs." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-62", "text": "Thus, the actual orthographic similarity matrix that we use is as follows:" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-63", "text": "This subset of word pairs was chosen using an adaptation of the Symmetric Delete spelling correction algorithm described by Garbe (2012), which we denote as symDelete(\u00b7,\u00b7,\u00b7)." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-64", "text": "This algorithm takes as arguments the target vocabulary, source vocabulary, and a constant k, and identifies all source-target word pairs that are identical after k or fewer deletions from each word; that is, all pairs where each is reachable from the other with no more than k insertions and k deletions." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-65", "text": "For example, the Italian-English pair modernomodern will be identified with k = 1, and the pair tollerante-tolerant will be identified with k = 2." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-66", "text": "The algorithm works by computing all strings formed by k or fewer deletions from each target word, stores them in a hash table, then does the same for each source word and generates sourcetarget pairs that share an entry in the hash table." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-67", "text": "The complexity of this algorithm can be expressed as O(|V |l k ), where V = V t \u222a V s is the combined vocabulary and l is the length of the longest word in V ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-68", "text": "This is linear with respect to the vocabulary size, as opposed to the quadratic complexity required for computing the entire matrix." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-69", "text": "However, the algorithm is sensitive to both word length and the choice of k. In our experiments, we found that ignoring all words of length greater than 30 allowed the algorithm to complete very quickly while skipping less than 0.1% of the data." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-70", "text": "We also used small values of k (0 < k < 4), and used k = 1 for our final results, finding no significant benefit from using a larger value." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-71", "text": "----------------------------------" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-72", "text": "**EXPERIMENTS**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-73", "text": "We use the datasets used by Artetxe et al. (2017) , consisting of three language pairs: EnglishItalian, English-German, and English-Finnish." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-74", "text": "The English-Italian dataset was introduced in Dinu and Baroni (2014) ; the other datasets were created by Artetxe et al. (2017) ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-75", "text": "Each dataset includes monolingual word embeddings (trained with word2vec (Mikolov et al., 2013b) ) for both languages and a bilingual dictionary, separated into a training and test set." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-76", "text": "We do not use the training set as the input dictionary to the system, instead using an automatically-generated dictionary consisting only of numeral identity translations (such as 2-2, 3-3, et cetera) as in Artetxe et al. (2017) ." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-77", "text": "1 However, because the methods presented in this work feature tunable hyperparameters, we use a portion of the training set as devel- Table 1 : Comparison of methods on test data." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-78", "text": "Scaling constants c e and c s were selected based on performance on development data over all three language pairs." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-79", "text": "The last two rows report the results of using both methods together." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-80", "text": "----------------------------------" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-81", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-82", "text": "For our experiments with orthographic extension of word embeddings, each embedding was extended by the size of the union of the alphabets of both languages." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-83", "text": "The size of this union was 199 for English-Italian, 200 for English-German, and 287 for English-Finnish." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-84", "text": "These numbers are perhaps unintuitively high." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-85", "text": "However, the corpora include many other characters, including diacritical markings and various symbols (%, [, !, etc.) that are an indication that tokenization of the data could be improved." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-86", "text": "We did not filter these characters in this work." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-87", "text": "For our experiments with orthographic similarity adjustment, the heuristic identified approximately 2 million word pairs for each language pair out of a possible 40 billion, resulting in significant computation savings. and c s = 1 as our hyperparameters." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-88", "text": "The local optima were not identical for all three languages, but we felt that these values struck the best compromise among them." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-89", "text": "Table 1 compares our methods against the system of Artetxe et al. (2017) , using scaling factors selected based on development data results." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-90", "text": "Because approximately 20% of source-target pairs in the dictionary were identical, we also extended all systems to guess the identity translation if the source word appeared in the target vocabulary." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-91", "text": "This improved accuracy in most cases, with some exceptions for English-Italian." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-92", "text": "We also experimented with both methods together, and found that this was the best of the settings that did not include the identity translation component; with the identity component included, however, the embedding extension method alone was best for EnglishFinnish." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-93", "text": "The fact that Finnish is the only language here that is not in the Indo-European family (and has fewer words borrowed from English or its ancestors) may explain why the performance trends for English-Finnish were different than those of the other two language pairs." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-94", "text": "In addition to identifying orthographically similar words, the extension method is capable of learning a mapping between source and target letters, which could partially explain its improved performance over our edit distance method." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-95", "text": "Table 2 shows some correct translations from our system that were missed by the baseline." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-96", "text": "----------------------------------" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-97", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-98", "text": "In this work, we presented two techniques (which can be combined) for improving embedding-based bilingual lexicon induction for related languages using orthographic information and no parallel data, allowing their use with low-resource language pairs." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-99", "text": "These methods increased accuracy in our experiments, with both the combined and embedding extension methods providing significant gains over the baseline system." }, { "sent_id": "7b9fc52e4479dc5ff9b8796a558981-C001-100", "text": "In the future, we want to extend this work to related languages with different alphabets (experimenting with transliteration or phonetic transcription) and to extend other unsupervised bilingual lexicon induction systems, such as that of Lample et al. (2018a) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7b9fc52e4479dc5ff9b8796a558981-C001-11" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-27", "7b9fc52e4479dc5ff9b8796a558981-C001-28", "7b9fc52e4479dc5ff9b8796a558981-C001-33" ] ], "cite_sentences": [ "7b9fc52e4479dc5ff9b8796a558981-C001-11", "7b9fc52e4479dc5ff9b8796a558981-C001-27", "7b9fc52e4479dc5ff9b8796a558981-C001-33" ] }, "@MOT@": { "gold_contexts": [ [ "7b9fc52e4479dc5ff9b8796a558981-C001-11", "7b9fc52e4479dc5ff9b8796a558981-C001-13", "7b9fc52e4479dc5ff9b8796a558981-C001-14" ] ], "cite_sentences": [ "7b9fc52e4479dc5ff9b8796a558981-C001-11" ] }, "@EXT@": { "gold_contexts": [ [ "7b9fc52e4479dc5ff9b8796a558981-C001-18" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-33", "7b9fc52e4479dc5ff9b8796a558981-C001-35" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-50", "7b9fc52e4479dc5ff9b8796a558981-C001-51" ] ], "cite_sentences": [ "7b9fc52e4479dc5ff9b8796a558981-C001-18", "7b9fc52e4479dc5ff9b8796a558981-C001-33", "7b9fc52e4479dc5ff9b8796a558981-C001-50" ] }, "@USE@": { "gold_contexts": [ [ "7b9fc52e4479dc5ff9b8796a558981-C001-21", "7b9fc52e4479dc5ff9b8796a558981-C001-22", "7b9fc52e4479dc5ff9b8796a558981-C001-27", "7b9fc52e4479dc5ff9b8796a558981-C001-28", "7b9fc52e4479dc5ff9b8796a558981-C001-33" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-38" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-50" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-73" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-76" ], [ "7b9fc52e4479dc5ff9b8796a558981-C001-89" ] ], "cite_sentences": [ "7b9fc52e4479dc5ff9b8796a558981-C001-21", "7b9fc52e4479dc5ff9b8796a558981-C001-27", "7b9fc52e4479dc5ff9b8796a558981-C001-33", "7b9fc52e4479dc5ff9b8796a558981-C001-38", "7b9fc52e4479dc5ff9b8796a558981-C001-50", "7b9fc52e4479dc5ff9b8796a558981-C001-73", "7b9fc52e4479dc5ff9b8796a558981-C001-76", "7b9fc52e4479dc5ff9b8796a558981-C001-89" ] }, "@SIM@": { "gold_contexts": [ [ "7b9fc52e4479dc5ff9b8796a558981-C001-76" ] ], "cite_sentences": [ "7b9fc52e4479dc5ff9b8796a558981-C001-76" ] }, "@DIF@": { "gold_contexts": [ [ "7b9fc52e4479dc5ff9b8796a558981-C001-89", "7b9fc52e4479dc5ff9b8796a558981-C001-90", "7b9fc52e4479dc5ff9b8796a558981-C001-91" ] ], "cite_sentences": [ "7b9fc52e4479dc5ff9b8796a558981-C001-89" ] } } }, "ABC_c34bbed419bddb6d63b3e3bccf595d_12": { "x": [ { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-2", "text": "Recognizing bridging anaphora is difficult due to the wide variation within the phenomenon, the resulting lack of easily identifiable surface markers and their relative rarity." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-3", "text": "We develop linguistically motivated discourse structure, lexico-semantic and genericity detection features and integrate these into a cascaded minority preference algorithm that models bridging recognition as a subtask of learning finegrained information status (IS)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-4", "text": "We substantially improve bridging recognition without impairing performance on other IS classes." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-5", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-7", "text": "In bridging or associative anaphora (Clark, 1975; Prince, 1981; Gundel et al., 1993) , the antecedent and anaphor are not coreferent but are linked via a variety of contiguity relations." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-8", "text": "1 In Example 1, the phrases a resident, the stairs and the lobby are bridging anaphors with the antecedent One building." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-9", "text": "2 (1) One building was upgraded to red status while people were taking things out, and a resident called up the stairs to his girlfriend, telling her to keep sending things down to the lobby." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-10", "text": "Bridging is an important problem as it affects linguistic theory and applications alike." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-11", "text": "For example, without bridging resolution, entity coherence between the first and second coordinated clause in Example 1 cannot be established." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-12", "text": "This is a problem both for coherence theories such as Centering (Grosz et al., 1995) (where bridging is therefore incorporated as an indirect realization of previous entities) as well as applications relying on entity coherence modelling, such as readability assessment or sentence ordering (Barzilay and Lapata, 2008) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-13", "text": "Full bridging resolution needs (i) recognition that a bridging anaphor is present and (ii) identification of the antecedent and contiguity relation." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-14", "text": "In recent work, these two tasks have been tackled separately, with bridging recognition handled as part of information status (IS) classification (Markert et al., 2012; Cahill and Riester, 2012; Rahman and Ng, 2012) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-15", "text": "Each mention in a text gets assigned one IS class that describes its accessibility to the reader at a given point in a text, bridging being one possible class." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-16", "text": "We stay within this framework." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-17", "text": "Bridging recognition is a difficult task, so that we had to report very low results on this IS class in previous work (Markert et al., 2012) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-18", "text": "This is due to the phenomenon's variety, leading to a lack of clear surface features for recognition." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-19", "text": "Instead, we formulate in this paper novel discourse structure and lexicosemantic features as well as features that distinguish bridging from generics (see Section 3)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-20", "text": "In addition, making up between 5% and 20% of definite descriptions (Gardent and Manu\u00e9lian, 2005; Caselli and Prodanof, 2006) and around 6% of all NPs (Markert et al., 2012) , bridging is still less frequent than many other IS classes and recognition of minority classes is well known to be more difficult." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-21", "text": "We therefore use a cascaded classification algorithm to address this problem (Omuya et al., 2013) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-22", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-24", "text": "Most bridging research concentrates on antecedent selection only (Poesio and Vieira, 1998; Poesio et al., 2004a; Lassalle and Denis, 2011; Hou et al., 2013) , assuming that bridging recognition has already been performed." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-25", "text": "Previous work on recognition is either limited to definite NPs based on heuristics evaluated on small datasets (Hahn et al., 1996; Vieira and Poesio, 2000) , or models it as a subtask of learning fine-grained IS (Rahman and Ng, 2012; Markert et al., 2012; Cahill and Riester, 2012) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-26", "text": "Results within this latter framework for bridging have been mixed: We reported in Markert et al. (2012) low results for bridging in written news text whereas Rahman and Ng (2012) report high results for the four subcategories of bridging annotated in the Switchboard dialogue corpus by Nissim et al. (2004) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-27", "text": "We believe this discrepancy to be due to differences in corpus size and genre as well as in bridging definition." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-28", "text": "Bridging in Switchboard includes non-anaphoric, syntactically linked part-of and set-member relationships (such as the building's lobby), as well as comparative anaphora, the latter being marked by surface indicators such as other, another etc." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-29", "text": "Both types are much easier to identify than anaphoric bridging cases." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-30", "text": "3 In addition, many non-anaphoric lexical cohesion cases have been annotated as bridging in Switchbard as well." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-31", "text": "We also separate bridging recognition and antecedent selection." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-32", "text": "One could argue that a joint model is more attractive as potential antecedents such as building \"trigger\" subsequent bridging cases such as stairs (Example 1)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-33", "text": "However, bridging can be indicated by referential patterns without world knowledge about the anaphor/antecedent NPs, as the nonsense example 2 shows: the wug is clearly a bridging anaphor although we do not know the antecedent." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-34", "text": "4 (2) The blicket couldn't be connected to the dax." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-35", "text": "The wug failed." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-36", "text": "Similarly, Clark (1975) distinguishes between bridging via necessary, probable and inducible parts/roles and argues that only in the first and maybe the second case the antecedent triggers the 3 See also the high results for our specific category for comparative anaphora (Markert et al., 2012) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-37", "text": "4 We thank an anonymous reviewer for pointing this out." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-38", "text": "bridging anaphor in the sense that we already spontaneously think of the anaphor when we read the antecedent." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-39", "text": "Also, bridging recognition on its own can be valuable for applications: for example, prosody is influenced by IS status without needing antecedent knowledge (Baumann and Riester, 2013) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-40", "text": "3 Characterizing Bridging Anaphora for Automatic Recognition" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-41", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-42", "text": "**PROPERTIES OF BRIDGING ANAPHORA**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-43", "text": "Bridging anaphors are rarely marked by surface features." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-44", "text": "Indeed, even the common practice (Vieira and Poesio, 2000; Lassalle and Denis, 2011; Cahill and Riester, 2012) to limit bridging to definite NPs does not seem to be correct: We report in previous work (Hou et al., 2013 ) that less than 40% of the bridging anaphora in our corpus are definites." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-45", "text": "Instead, bridging is diverse with regard to syntactic form and function: bridging anaphora can be definite NPs (Examples 4 and 6), indefinite NPs (Example 5) or bare NPs (Examples 3, 8 and 9)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-46", "text": "The only frequent syntactic property shared is that bridging NPs tend to have a simple internal structure with regards to modification." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-47", "text": "Bridging is also easily confused with generics: friends is used as bridging anaphor in Example 9 but generically in Example 10. (6) . . . his truck . . ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-48", "text": "The farmer at the next truck shouts, \"Wheat!\" (7) . . . the firms . . . Crime was the reason that 26% reported difficulty recruiting personnel and that 19% said they were considering moving." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-49", "text": "(8) . . . the company . . . His father was chairman and chief executive until his death in an accident five years ago." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-50", "text": "(9) . . . Josephine Baker . . . Friends pitched in." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-51", "text": "(10) Friends are part of the glue that holds life and faith together." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-52", "text": "Bridging anaphora can have almost limitless variation." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-53", "text": "However, we observe that bridging anaphors are often licensed because of discourse structure Markert et al. (2012)" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-54", "text": "f 13 Precedes (r) and/or lexical or world knowledge." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-55", "text": "With regard to discourse structure, Grosz et al. (1995) observe that bridging is often needed to establish entity coherence between two adjacent sentences (Examples 1, 2, 4, 5, 6, 7 and 9)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-56", "text": "With regard to lexical and world knowledge, relational noun phrases (Examples 3, 4, 8 and 9), building parts (Example 1), set membership elements (Example 7), or, more rarely, temporal/spatial modification (Example 6) may favor a bridging reading." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-57", "text": "Motivated by these observations, we develop discourse structure and lexico-semantic features indicating bridging anaphora as well as features designed to separate genericity from bridging." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-58", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-59", "text": "**FEATURES**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-60", "text": "In Markert et al. (2012) we classify eight finegrained IS categories for NPs in written text: old, new and 6 mediated categories (syntactic, worldKnowledge, bridging, comparative, aggregate and function)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-61", "text": "This feature set (Table 1 , f 1-f 13) works well to identify old, new and several mediated categories." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-62", "text": "However, it fails to recognize most bridging anaphora which we try to remedy in this work by including more diverse features." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-63", "text": "Discourse structure features (Table 2 , f 1-f 3)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-64", "text": "Bridging occurs frequently in sentences where otherwise there would no entity coherence to previous sentences/clauses (see Grosz et al. (1995) and Poesio et al. (2004b) for discussions about bridging, entity coherence and centering transitions in the Centering framework)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-65", "text": "This is especially true for topic NPs (Halliday and Hasan, 1976) in such sentences." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-66", "text": "We follow these insights by identifying coherence gap sentences (see Examples 1, 4, 5, 6, 7, 9 and also 2): a sentence has a coherence gap (f 1) if it has none new local features for bridging discourse of the following three coherence elements: (1) entity coreference to previous sentences, as approximated via string match or presence of pronouns, (2) comparative anaphora approximated by mentions modified via a small set of comparative markers (see also Table 1 , f 10 PreModByCompMarker), or (3) proper names." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-67", "text": "We approximate the topic of a sentence via the first mention (f 2)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-68", "text": "f 3 models that bridging anaphors do not appear at the beginning of a text." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-69", "text": "Table 2 , f 4-f 10)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-70", "text": "In contrast to generic patterns, our semantic features capture lexical properties of nouns that make them more likely to be the head of a bridging NP." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-71", "text": "We create f 4-f 8 to capture four kinds of bridging anaphora." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-72", "text": "L\u00f6bner (1985) distinguishes between relational nouns that take on at least one obligatory semantic role (such as friend) and sortal nouns." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-73", "text": "It is likely that relational nouns are more frequently used as bridging than sortal nouns (see Examples 3, 4, 8 and 9)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-74", "text": "We extract a list containing around 4,000 relational nouns from WordNet and a list containing around 500 nouns that specify professional roles from the General Inquirer lexicon (Stone et al., 1966) , then determine whether the NP head appears in these lists or not (f 4 and f 5)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-75", "text": "The obligatory semantic role for a relational noun can of course also be filled NP internally instead of anaphorically and we use the features f 10 (for instances such as the Egyptian president) and f 18 (for complex NPs that are likely to fill needed roles NP internally) to address this." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-76", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-77", "text": "**SEMANTIC FEATURES (**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-78", "text": "Because part-of relations are typical bridging relations (see Example 1 and Clark (1975) ), we use f 6 to determine whether the NP is a part of the building or not, using again a list extracted from Inquirer." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-79", "text": "f 7 is used to identify set membership bridging cases (see Example 7), by checking whether the NP head is a number or indefinite pronoun (such as none, one, some) or modified by each, one." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-80", "text": "However, not all numbers are bridging cases (such as 1976) and we use f 9 to exclude such cases." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-81", "text": "Lassalle and Denis (2011) note that some bridging anaphors are indicated by spatial or temporal modifications (see Example 6)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-82", "text": "We use f 8 to detect this by compiling 20 such adjectives from Inquirer." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-83", "text": "Features to detect generic nouns (Table 2, f 11-f 15)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-84", "text": "Generic NPs (Example 10) are easily confused with bridging anaphora." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-85", "text": "Inspired by Reiter and Frank (2010) who build on linguistic research, we develop features (f 11-f 15) to exclude generics." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-86", "text": "First, hypothetical entities are likely to refer to generic entities (Mitchell et al., 2002) , We approximate this by determining whether the NP appears in an if-clause (f 11)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-87", "text": "Also the clause tense and mood may play a role to decide genericity (Reiter and Frank, 2010) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-88", "text": "This is often reflected by the main verb of a clause, so we extract its POS tag (f 12)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-89", "text": "Some NPs are commonly used generically, such as children, men, or the dollar." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-90", "text": "The ACE-2 corpus (distinct from our corpus) contains generic annotation ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-91", "text": "We collect all NPs from ACE-2 that are always used generically (f 13)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-92", "text": "We also try to learn NPs that are uniquely identifiable without further description or anaphoric links such as the sun or the pope." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-93", "text": "We do this by extracting common nouns which are annotated as worldKnowledge from the training part of our corpus 5 and use these as lexical features (f 14)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-94", "text": "Finally, motivated by the ACE-2 annotation guidelines, we identify six quantifiers that may indicate genericity, such as all, no, neither (f 15)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-95", "text": "Other features for bridging (Table 2, f 16-f 18) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-96", "text": "Following Rahman and Ng (2012) , we use unigrams (f 16)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-97", "text": "We also extract heads of bridging anaphors from the training data as lexical features (f 17) to learn typical nouns used for bridging that we did not cover in lexicon extraction (f 4 to f 6)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-98", "text": "Feature f 18 models that bridging anaphora most often have a simple internal structure and usually do not contain any other NPs." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-99", "text": "Features for other IS categories (Table 2 , f 19-f 21)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-100", "text": "We propose three features to improve other IS categories." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-101", "text": "In the relational feature f 19, we separate coordination parent-child from other parentchild relations to help with the class aggregate." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-102", "text": "f 20 determines whether a number is the object of an increase/decrease verb (using a list extracted from Inquirer) and therefore likely to be the IS class function." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-103", "text": "Frequent proper names are more likely to be hearer old and hence of the class worldKnowledge." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-104", "text": "f 21 extracts proper names that occur in at least 100 documents in the Tipster corpus to approximate this." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-105", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-106", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-107", "text": "Experimental setup." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-108", "text": "We perform experiments on the corpus provided in Markert et al. (2012) 6 ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-109", "text": "It consists of 50 texts taken from the WSJ portion of the OntoNotes corpus (Weischedel et al., 2011) with almost 11,000 NPs annotated for information status including 663 bridging NPs and their antecedents." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-110", "text": "All experiments are performed via 10-fold crossvalidation on documents." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-111", "text": "We use gold standard mentions and the OntoNotes named entity and syntactic annotation layers for feature extraction." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-112", "text": "Reimplemented baseline system (rbls)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-113", "text": "rbls uses the same features as Markert et al. (2012) (Table 1) but replaces the local decision tree classifier with LibSVM as we will need to include lexical features." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-114", "text": "Tables 1 and 2 and apply them in order from rarest to more frequent category." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-115", "text": "Whenever a minority classifier predicts true, this class is assigned." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-116", "text": "When all minority classifiers say false, we back off to the multiclass rbls+newfeat system." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-117", "text": "cmps \u2212 Table 2 feature set (cmps\u2212newfeat)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-118", "text": "To test the effect of using the minority preference system without additional features, we employ a cmps system with baseline features from Table 1 only." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-119", "text": "(Table 3) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-120", "text": "Our novel features in rbls+newfeat show improvements for worldKnowledge, aggregate and function as well as bridging categories compared to both baseline systems, although the performance for bridging is still low." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-121", "text": "In addition, the overall accuracy is significantly better than the two baseline systems (at the level of 1% using McNemar's test)." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-122", "text": "Using the cascaded minority preference system cmps in addition improves bridging results substantially while the performance on other categories does not worsen." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-123", "text": "The algorithm needs both our novel feature classes as well as cascaded modelling to achieve this improvement as the comparison to cmps\u2212newfeat shows: the latter lowers overall accuracy as it tends to overgenerate rare 7 Parameter against data imbalance is set according to the ratio between positive and negative instances in the training set." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-124", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-125", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-126", "text": "classes (including bridging) with low precision if the features are not strong enough." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-127", "text": "Our novel features (addressing linguistic properties of bridging) and the cascaded algorithm (addressing data sparseness) appear to be complementary." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-128", "text": "To look at the impact of features in our best system, we performed an ablation study." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-129", "text": "Lexical features as well as semantic ones have the most impact." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-130", "text": "Discourse structure and genericity information features have less of an impact." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-131", "text": "We believe the latter to be due to noise involved in extracting these features (such as approximating coreference for the coherence gap feature) as well as genericity recognition still being in its infancy (Reiter and Frank, 2010) ." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-132", "text": "----------------------------------" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-133", "text": "**CONCLUSIONS**" }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-134", "text": "This paper aims to recognize bridging anaphora in written text." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-135", "text": "We develop discourse structure, lexicosemantic and genericity features based on linguistic intuition and corpus research." }, { "sent_id": "c34bbed419bddb6d63b3e3bccf595d-C001-136", "text": "By using a cascading minority preference system, we show that our approach outperforms the bridging recognition in Markert et al. (2012) by a large margin without impairing the performance on other IS classes." } ], "y": { "@DIF@": { "gold_contexts": [ [ "c34bbed419bddb6d63b3e3bccf595d-C001-136" ] ], "cite_sentences": [ "c34bbed419bddb6d63b3e3bccf595d-C001-136" ] }, "@BACK@": { "gold_contexts": [ [ "c34bbed419bddb6d63b3e3bccf595d-C001-14" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-17" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-20" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-36" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-60" ] ], "cite_sentences": [ "c34bbed419bddb6d63b3e3bccf595d-C001-14", "c34bbed419bddb6d63b3e3bccf595d-C001-17", "c34bbed419bddb6d63b3e3bccf595d-C001-20", "c34bbed419bddb6d63b3e3bccf595d-C001-36", "c34bbed419bddb6d63b3e3bccf595d-C001-60" ] }, "@MOT@": { "gold_contexts": [ [ "c34bbed419bddb6d63b3e3bccf595d-C001-17" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-20" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-25" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-60", "c34bbed419bddb6d63b3e3bccf595d-C001-61", "c34bbed419bddb6d63b3e3bccf595d-C001-62" ] ], "cite_sentences": [ "c34bbed419bddb6d63b3e3bccf595d-C001-17", "c34bbed419bddb6d63b3e3bccf595d-C001-20", "c34bbed419bddb6d63b3e3bccf595d-C001-25", "c34bbed419bddb6d63b3e3bccf595d-C001-60" ] }, "@USE@": { "gold_contexts": [ [ "c34bbed419bddb6d63b3e3bccf595d-C001-53" ], [ "c34bbed419bddb6d63b3e3bccf595d-C001-108" ] ], "cite_sentences": [ "c34bbed419bddb6d63b3e3bccf595d-C001-53", "c34bbed419bddb6d63b3e3bccf595d-C001-108" ] }, "@EXT@": { "gold_contexts": [ [ "c34bbed419bddb6d63b3e3bccf595d-C001-113" ] ], "cite_sentences": [ "c34bbed419bddb6d63b3e3bccf595d-C001-113" ] } } }, "ABC_bcf19914bb67ded47785d298969a7a_12": { "x": [ { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-108", "text": "On the other hand, text-only models achieve a very high accuracy." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-25", "text": "An analysis of the models, providing insights into the reasons for their strong performance (Section 5)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-81", "text": "We also note that the performance is better than human performance." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-2", "text": "We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-3", "text": "Solving this problem should in principle require a fine-grained understanding of images to detect linguistically valid perturbations in captions." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-4", "text": "In such contexts, encoding sufficiently descriptive image information becomes a key challenge." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-5", "text": "In this paper, we demonstrate that it is possible to solve this task using simple, interpretable yet powerful representations based on explicit object information." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-6", "text": "Our models achieve stateof-the-art performance on a standard dataset, with scores exceeding those achieved by humans on the task." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-7", "text": "We also measure the upperbound performance of our models using gold standard annotations." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-8", "text": "Our analysis reveals that the simpler model performs well even without image information, suggesting that the dataset contains strong linguistic bias." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-9", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-11", "text": "Models tackling vision-to-language (V2L) tasks, for example Image Captioning (IC) and Visual Question Answering (VQA), have demonstrated impressive results in recent years in terms of automatic metric scores." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-12", "text": "However, whether or not these models are actually learning to address the tasks they are designed for is questionable." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-13", "text": "For example, Hodosh and Hockenmaier (2016) showed that IC models do not understand images sufficiently, as reflected by the generated captions." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-14", "text": "As a consequence, in the last few years many diagnostic tasks and datasets have been proposed aiming at investigating the capabilities of such models in more detail to determine whether and how these models are capable of exploiting visual and/or linguistic information (Shekhar et al., 2017b; Johnson et al., 2017; Antol et al., 2015; Chen et al., 2015; Gao et al., 2015; Yu et al., 2015; Zhu et al., 2016) ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-15", "text": "FOIL (Shekhar et al., 2017b ) is one such dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-16", "text": "It was proposed to evaluate the ability of V2L models in understanding the interplay of objects and their attributes in the images and their relations in an image captioning framework." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-17", "text": "This is done by replacing a word in MSCOCO (Lin et al., 2014) captions with a 'foiled' word that is semantically similar or related to the original word (substituting dog with cat), thus rendering the image caption unfaithful to the image content, while yet linguistically valid." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-18", "text": "Shekhar et al. (2017b) report poor performance for V2L models in classifying captions as foiled (or not)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-19", "text": "They suggested that their models (using image embeddings as input) are very poor at encoding structured visuallinguistic information to spot the mismatch between a foiled caption and the corresponding content depicted in the image." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-20", "text": "In this paper, we focus on the foiled captions classification task (Section 2), and propose the use of explicit object detections as salient image cues for solving the task." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-21", "text": "In contrast to methods from previous work that make use of word based information extracted from captions (Heuer et al., 2016; Yao et al., 2016; Wu et al., 2018) , we use explicit object category information directly extracted from the images." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-22", "text": "More specifically, we use an interpretable bag of objects as image representation for the classifier." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-23", "text": "Our hypothesis is that, to truly 'understand' the image, V2L models should exploit information about objects and their relations in the image and not just global, low-level image embeddings as used by most V2L models." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-24", "text": "Our main contributions are: 3." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-26", "text": "Our results reveal that the FOIL dataset has a very strong linguistic bias, and that the proposed simple object-based models are capable of finding salient patterns to solve the task." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-27", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-28", "text": "**BACKGROUND**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-29", "text": "In this section we describe the foiled caption classification task and dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-30", "text": "We combine the tasks and data from Shekhar et al. (2017b) and Shekhar et al. (2017a) ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-31", "text": "Given an image and a caption, in both cases the task is to learn a model that can distinguish between a REAL caption that describes the image, and a FOILed caption where a word from the original caption is swapped such that it no longer describes the image accurately." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-32", "text": "There are several sets of 'foiled captions' where words from specific parts of speech are swapped:" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-33", "text": "\u2022 Foiled Noun: In this case a noun word in the original caption is replaced with another similar noun, such that the resultant caption is not the correct description for the image." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-34", "text": "The foiled noun is obtained from list of object annotations from MSCOCO (Lin et al., 2014) and nouns are constrained to the same supercategory;" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-35", "text": "\u2022 Foiled Verb: Here, verb is foiled with a similar verb." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-36", "text": "The similar verb is extracted using external resources;" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-37", "text": "\u2022 Foiled Adjective and Adverb: Adjectives and adverbs are replaced with similar adjectives and adverbs." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-38", "text": "Here, the notion of similarity again is obtained from external resources;" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-39", "text": "\u2022 Foiled Preposition: Prepositions are directly replaced with functionally similar prepositions." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-40", "text": "The Verb, Adjective, Adverb and Preposition subsets were obtained using a slightly different methodology (see Shekhar et al. (2017a) ) than that used for Nouns (Shekhar et al., 2017b) ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-41", "text": "Therefore, we evaluate these two groups separately." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-42", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-43", "text": "**PROPOSED MODEL**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-44", "text": "For the foiled caption classification task (Section 3.1), our proposed model uses information from explicit object detections as an object-based image representation along with textual representations (Section 3.2) as input to several different classifiers (Section 3.3)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-45", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-46", "text": "**MODEL DEFINITION**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-47", "text": "Let y \u2208 {REAL, FOIL} denote binary class labels." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-48", "text": "The objective is to learn a model that computes P (y|I; C), where I and C correspond to the image and caption respectively." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-49", "text": "Our model seeks to maximize a scoring function \u03b8:" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-50", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-51", "text": "**REPRESENTATIONS**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-52", "text": "Our scoring function \u03b8 takes in image features and text features (from captions) and concatenates them." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-53", "text": "We experiment with various types of features." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-54", "text": "For the image side, we propose a bag of objects representation for 80 pre-defined MSCOCO categories." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-55", "text": "We consider two variants: (a) Object Mention: A binary vector where we encode the presence/absence of instances of each object category for a given image; (b) Object Frequency: A histogram vector where we encode the number of instances of each object category in a given image." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-56", "text": "For both features, we use Gold MSCOCO object annotations as well as Predicted object detections using YOLO (Redmon and Farhadi, 2017) pre-trained on MSCOCO to detect instances of the 80 categories." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-57", "text": "As comparison, we also compute a standard CNN-based image representation, using the POOL5 layer of a ResNet-152 (He et al., 2016) CNN pre-trained on ImageNet." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-58", "text": "We posit that our object-based representation will better capture semantic information corresponding to the text compared to the CNN embeddings used directly as a feature by most V2L models." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-59", "text": "For the language side, we explore two features: (a) a simple bag of words (BOW) representation for each caption; (b) an LSTM classifier based model trained on the training part of the dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-60", "text": "Our intuition is that an image description/caption is essentially a result of the interaction between important objects in the image (this includes spatial relations, co-occurrences, etc.)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-61", "text": "Thus, representations explicitly encoding objectlevel information are better suited for the foiled caption classification task." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-62", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-63", "text": "**CLASSIFIERS**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-64", "text": "Three types of classifiers are explored: (a) Multilayer Perceptron (MLP): For BOW-based text representations, a two 100-dimensional hidden layer MLP with ReLU activation function is used with cross-entropy loss, and is optimized with Adam (learning rate 0.001); (b) LSTM Classifier: For LSTM-based text representations, a uni-directional LSTM classifier is used with 100-dimensional word embeddings and 200-dimensional hidden representations." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-65", "text": "We train it using cross-entropy loss and optimize it using Adam (learning rate 0.001)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-66", "text": "Image representations are appended to the final hidden state of the LSTM; (c) Multimodal LSTM (MM-LSTM) Classifier: As above, except that we initialize the LSTM with the image representation instead of appending it to its output." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-67", "text": "This can also be seen as am image grounded LSTM based classifier." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-68", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-69", "text": "**EXPERIMENTS**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-70", "text": "Data: We use the dataset for nouns from Shekhar et al. (2017b) 1 and the datasets for other parts of speech from Shekhar et al. (2017a) 2 ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-71", "text": "Statistics about the dataset are given in Table 1 ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-72", "text": "The evaluation metric is accuracy per class and the average (overall) accuracy over the two classes." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-73", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-74", "text": "**PERFORMANCE ON NOUNS:**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-75", "text": "The results of our experiments with foiled nouns are summarized in Table 2." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-76", "text": "First, we note that the models that use Gold 1 https://foilunitn.github.io/ 2 The authors have kindly provided us the datasets." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-77", "text": "Table 2 : Accuracy on Nouns dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-78", "text": "\u2020 are taken directly from Shekhar et al. (2017b) ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-79", "text": "HieCoAtt is the state of the art reported in the paper." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-80", "text": "bag of objects information are the best performing models across classifiers." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-82", "text": "We hypothesize the following reasons for this: (a) human responses were crowd-sourced, which could have resulted in some noisy annotations; (b) our gold object-based features closely resembles the information used for data-generation as described in Shekhar et al. (2017b) for the foil noun dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-83", "text": "The models using Predicted bag of objects from a detector are very close to the performance of Gold." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-84", "text": "The performance of models using simple bag of words (BOW) sentence representations and an MLP is better than that of models that use LSTMs." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-85", "text": "Also, the accuracy of the bag of objects model with Frequency counts is higher than with the binary Mention vector, which only encodes the presence of objects." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-86", "text": "The Multimodal LSTM (MM-LSTM) has a slightly better performance than LSTM classifiers." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-87", "text": "In all cases, we observe that the performance is on par with human-level accuracy." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-88", "text": "Our overall accuracy is substantially higher than that reported in Shekhar et al. (2017b) ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-89", "text": "Interestingly, our implementation of CNN+LSTM produced better results than their equivalent model (they reported 61.07% vs. our 87.45%)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-90", "text": "We investigate this further in Section 5." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-91", "text": "Performance on other parts of speech: For other parts of speech, we fix the image representation to Gold Frequency, and compare results using the BOW-based MLP and MM-LSTM." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-92", "text": "We also compare the scores to the state of the art reported in Shekhar et al. (2017a) Table 3 : Accuracy on Verb, Adjective, Adverb and Preposition datasets, using Gold Frequency as the image representation." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-93", "text": "\u2020 is the best performing model as reported in Shekhar et al. (2017a) ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-94", "text": "model does not use gold object information and may thus not be directly comparable -we however recall that only a slight drop in accuracy was found for our models when using predicted object detections rather than gold ones." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-95", "text": "Our findings are summarized in Table 3 ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-96", "text": "The classification performance is not as high as it was for the nouns dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-97", "text": "Noteworthy is the performance on adverbs, which is significantly lower than the performance across other parts of speech." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-98", "text": "We hypothesize that this is because of the imbalanced distribution of foiled and real captions in the dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-99", "text": "We also found that the performance of LSTM-based models on other parts of speech datasets are almost always better than BOW-based models, indicating the necessity of more sophisticated features." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-100", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-101", "text": "**ANALYSIS**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-102", "text": "In this section, we attempt to better understand why our models achieve such a high accuracy." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-103", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-104", "text": "**ABLATION ANALYSIS**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-105", "text": "We first perform ablation experiments with our proposed models over the Nouns dataset (FOIL)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-106", "text": "We compute image-only models (CNN or Gold Frequency) and text-only models (BOW or LSTM), and investigate which components of our model (text or image/objects) contribute to the strong classification performance (Table 4) ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-107", "text": "As expected, we cannot classify foiled captions given only image information (global or object-level), resulting in chance-level performance." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-109", "text": "This is a central finding, suggesting that foiled captions are easy to detect even without image information." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-110", "text": "We also observe that the performance of BOW improves by adding object Frequency image information, but not CNN image embeddings." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-111", "text": "We posit that this is because there is a tighter correspondence between the bag of objects and bag of word models." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-112", "text": "In the case of LSTMs, adding either image information helps slightly." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-113", "text": "The accuracy of our models is substantially higher than that reported in Shekhar et al. (2017b) , even for equivalent models." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-114", "text": "We note, however, that while the trends of image information is similar for other parts of speech datasets, the performance of BOW based models are lower than the performance of LSTM based models." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-115", "text": "The anomaly of improved performance of BOW based models seems heavily pronounced in the nouns dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-116", "text": "Thus, we further analyze our model in the next section to shed light on whether the high performance is due to the models or the dataset itself." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-117", "text": "Table 4 : Ablation study on FOIL (Nouns)." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-118", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-119", "text": "**FEATURE IMPORTANCE ANALYSIS**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-120", "text": "We apply Local Interpretable Model-agnostic Explanations (Ribeiro et al., 2016) to further understand the strong performance of our simple classifier on the Nouns dataset (FOIL) without any image information." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-121", "text": "We present an example in Figure 1 ." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-122", "text": "We use MLP with BOW only (no image information) as our classifier." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-123", "text": "As the caption is correctly predicted to be foiled, we observe that the most important feature for classification is the information on the word ball, which also happens to be the foiled word." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-124", "text": "We further analyzed the chances of this happening on the entire test set." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-125", "text": "We found that 96.56% of the time the most important classification feature happens to be the foiled word." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-126", "text": "This firmly indicates that there is a very strong linguistic bias in the training data, despite The classifier is able to correctly classify the foiled caption and uses the foiled word as the trigger for classification." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-127", "text": "the claim in Shekhar et al. (2017b) that special attention was paid to avoid linguistic biases in the dataset." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-128", "text": "3 We note that we were not able to detect the linguistic bias in the other parts of speech datasets." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-129", "text": "----------------------------------" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-130", "text": "**CONCLUSIONS**" }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-131", "text": "We presented an object-based image representation derived from explicit object detectors/gold annotations to tackle the task of classifying foiled captions." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-132", "text": "The hypothesis was that such models provide the necessary semantic information for the task, while this informaiton is not explicitly present in CNN image embeddings commonly used in V2L tasks." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-133", "text": "We achieved stateof-the-art performance on the task, and also provided a strong upper-bound using gold annotations." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-134", "text": "A significant finding is that our simple models, especially for the foiled noun dataset, perform well even without image information." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-135", "text": "This could be partly due to the strong linguistic bias in the foiled noun dataset, which was revealed by our analysis on our interpretable object-based models." }, { "sent_id": "bcf19914bb67ded47785d298969a7a-C001-136", "text": "We release our analysis and source code at https://github.com/ sheffieldnlp/foildataset.git." } ], "y": { "@BACK@": { "gold_contexts": [ [ "bcf19914bb67ded47785d298969a7a-C001-14" ], [ "bcf19914bb67ded47785d298969a7a-C001-15" ], [ "bcf19914bb67ded47785d298969a7a-C001-40" ] ], "cite_sentences": [ "bcf19914bb67ded47785d298969a7a-C001-14", "bcf19914bb67ded47785d298969a7a-C001-15", "bcf19914bb67ded47785d298969a7a-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "bcf19914bb67ded47785d298969a7a-C001-30" ], [ "bcf19914bb67ded47785d298969a7a-C001-40", "bcf19914bb67ded47785d298969a7a-C001-41" ], [ "bcf19914bb67ded47785d298969a7a-C001-70" ], [ "bcf19914bb67ded47785d298969a7a-C001-81", "bcf19914bb67ded47785d298969a7a-C001-82" ] ], "cite_sentences": [ "bcf19914bb67ded47785d298969a7a-C001-30", "bcf19914bb67ded47785d298969a7a-C001-40", "bcf19914bb67ded47785d298969a7a-C001-70", "bcf19914bb67ded47785d298969a7a-C001-82" ] }, "@EXT@": { "gold_contexts": [ [ "bcf19914bb67ded47785d298969a7a-C001-30" ] ], "cite_sentences": [ "bcf19914bb67ded47785d298969a7a-C001-30" ] }, "@SIM@": { "gold_contexts": [ [ "bcf19914bb67ded47785d298969a7a-C001-82" ] ], "cite_sentences": [ "bcf19914bb67ded47785d298969a7a-C001-82" ] }, "@DIF@": { "gold_contexts": [ [ "bcf19914bb67ded47785d298969a7a-C001-88" ], [ "bcf19914bb67ded47785d298969a7a-C001-113" ] ], "cite_sentences": [ "bcf19914bb67ded47785d298969a7a-C001-88", "bcf19914bb67ded47785d298969a7a-C001-113" ] } } }, "ABC_710ec6f6d6d4c7c8c148833c0adfef_12": { "x": [ { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-2", "text": "We describe an efficient neural network method to automatically learn sentiment lexicons without relying on any manual resources." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-3", "text": "The method takes inspiration from the NRC method, which gives the best results in SemEval13 by leveraging emoticons in large tweets, using the PMI between words and tweet sentiments to define the sentiment attributes of words." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-4", "text": "We show that better lexicons can be learned by using them to predict the tweet sentiment labels." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-5", "text": "By using a very simple neural network, our method is fast and can take advantage of the same data volume as the NRC method." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-6", "text": "Experiments show that our lexicons give significantly better accuracies on multiple languages compared to the current best methods." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-7", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-9", "text": "Sentiment lexicons contain the sentiment polarity and/or the strength of words or phrases (Baccianella et al., 2010; Taboada et al., 2011; Tang et al., 2014a; Ren et al., 2016a) ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-10", "text": "They have been used for both rule-based (Taboada et al., 2011) and unsupervised (Turney, 2002; Hu and Liu, 2004; or supervised (Mohammad et al., 2013; Tang et al., 2014b; Vo and Zhang, 2015) machine-learning-based sentiment analysis." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-11", "text": "As a result, constructing sentiment lexicons is one important research task in sentiment analysis." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-12", "text": "Many approaches have been proposed to construct sentiment lexicons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-13", "text": "Traditional methods manually label the sentiment attributes of words (Hu and Liu, 2004; Wilson et al., 2005; Taboada et al., 2011) ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-14", "text": "One benefit of such lexicons is high quality." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-15", "text": "On the other hand, the methods are timeconsuming, requiring language and domain expertise." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-16", "text": "Recently, statistical methods have been exploited to learn sentiment lexicons automatically (Esuli and Sebastiani, 2006; Baccianella et al., 2010; Mohammad et al., 2013) ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-17", "text": "Such methods leverage knowledge resources (Bravo-Marquez et al., 2015) or labeled sentiment data (Tang et al., 2014a) , giving significantly better coverage compared to manual lexicons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-18", "text": "Among the automatic methods, Mohammad et al. (2013) proposed to use tweets with emoticons or hashtags as training data." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-19", "text": "The main advantage is that such training data are abundant, and manual annotation can be avoided." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-20", "text": "Despite that emoticons or hashtags can be noisy in indicating the sentiment of a tweet, existing research (Go et al., 2009; Pak and Paroubek, 2010; Agarwal et al., 2011; Kalchbrenner et al., 2014; Ren et al., 2016b) has shown that effectiveness of such data when used to supervise sentiment classifiers." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-21", "text": "Mohammad et al. (2013) collect sentiment lexicons by calculating pointwise mutual information (PMI) between words and emoticons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-22", "text": "The resulting lexicons give the best results in a SemEval13 benchmark (Nakov et al., 2013) ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-23", "text": "In this paper, we show that a better lexicon can be learned by directly optimizing the prediction accuracy, taking the lexicon as input and emoticon as the output." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-24", "text": "The correlation between our method and the method of Mohammad et al. (2013) is analogous to the \"predicting\" vs \"counting\" correlation between distributional and distributed word representations (Baroni et al., 2014) ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-25", "text": "We follow Esuli and Sebastiani (2006) in using two simple attributes to represent each sentiment word, and take inspiration from Mikolov et al. (2013) in using a very simple neural network for sentiment prediction." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-26", "text": "The method can leverage the same data as Mohammad et al. (2013) and therefore benefits from both scale and annotation independence." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-27", "text": "Experiments show that the neural model gives the best results on standard benchmarks across multiple languages." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-28", "text": "Our code and lexicons are publicly available at https://github.com/duytinvo/acl2016." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-29", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-31", "text": "Existing methods for automatically learning sentiment lexicons can be classified into three main categories." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-32", "text": "The first category augments existing lexicons with sentiment information." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-33", "text": "For example, Esuli and Sebastiani (2006) and Baccianella et al. (2010) use a tuple (pos, neg, neu) to represent each word, where pos, neg and neu stand for possibility, negativity and neutrality, respectively, training these attributes by extracting features from WordNet." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-34", "text": "These methods rely on the taxonomic structure of existing lexicons, which are limited to specific languages." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-35", "text": "The second approach expands existing lexicons, which are typically manually labeled." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-36", "text": "For example, Tang et al. (2014a) apply a neural network to learn sentiment-oriented embeddings from a small amount of annotated tweets, and then expand a set of seed sentiment words by measuring vector space distances between words." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-37", "text": "Bravo-Marquez et al. (2015) extend an existing lexicon by classifying words using manual features." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-38", "text": "These methods are also limited to domains and languages with manual resources." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-39", "text": "The third line of methods constructs lexicons from scratch by accumulating statistical information over large data." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-40", "text": "Turney (2002) proposes to estimate the sentiment polarity of words by calculating PMI between seed words and search hits." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-41", "text": "Mohammad et al. (2013) improve the method by computing sentiment scores using distance-supervised data from emoticon-baring tweets instead of seed words." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-42", "text": "This approach can be used to automatically extract multilingual sentiment lexicons Mohammad et al., 2015) without using manual resources, which makes it more flexible compared to the first two methods." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-43", "text": "We consider it as our baseline." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-44", "text": "We use the same data source as Mohammad et al. (2013) to train lexicons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-45", "text": "However, rather than relying on PMI, we take a machine-learning method in optimizing the prediction accuracy of emoticons using the lexicons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-46", "text": "To leverage large data, we use a very simple neural network to train the lexicons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-47", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-48", "text": "**BASELINE**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-49", "text": "Mohammad et al. (2013) employ emoticons and relevant hashtags contained in a tweet as the sentiment label of the tweet." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-50", "text": "Given a set of tweets with their labels, the sentiment score (SS) for a token w was computed as:" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-51", "text": "where pos represents the positive label and neg represents the negative label." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-52", "text": "PMI stands for pointwise mutual information, which is" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-53", "text": "Here freq(w , pos) is the number of times the term w occurs in positive tweets, freq(w ) is the total frequency of term w in the corpus, freq(pos) is the total number of tokens in positive tweets, and N is the total number of tokens in the corpus." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-54", "text": "PMI (w , neg) is calculated in a similar way." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-55", "text": "Thus, Equation 1 is equal to:" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-56", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-57", "text": "**MODEL**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-58", "text": "We follow Esuli and Sebastiani (2006) , using positivity and negativity attributes to define lexicons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-59", "text": "In particular, each word takes the form w = (n, p), where n denotes negativity and p denotes positivity (n, p \u2208 R)." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-60", "text": "As shown in Figure 1 , given a tweet tw = w 1 , w 2 , ..., w n , a simple neural network is used to predict its two-dimensional sentiment label y, where [1,0] for negative and [0,1] for positive tweets." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-61", "text": "The predicted sentiment probability y of a tweet is computed as: Table 1 : Emoticon-based training data." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-62", "text": "where W is fixed to the diagonal matrix (W \u2208 R 2x2 )." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-63", "text": "We follow Go et al. (2009) in defining the sentiment labels of tweets via emoticons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-64", "text": "Each token is first initialized by random negative and positive attribute scores in [-0.25,0.25] , and then trained by supervised learning." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-65", "text": "The cross-entropy error is employed as the objective function:" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-66", "text": "Backpropagation is applied to learn (n, p) for each token." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-67", "text": "Optimization is done using stochastic gradient descent over shuffled mini-batches, with the AdaDelta update rule (Zeiler, 2012)." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-68", "text": "All models are trained over 5 epochs with a batch size of 50." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-69", "text": "Due to its simplicity, the method is very fast, training a sentiment lexicon over 9 million tweets within 35 minutes per epoch on an Intel core\u2122 i7-3770 CPU @ 3.40 GHz." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-70", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-71", "text": "**SENTIMENT CLASSIFICATION**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-72", "text": "The resulting lexicon can be used in both unsupervised and supervised sentiment classifiers." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-73", "text": "The former is implemented by summing the sentiment scores of all tokens contained in a given document (Taboada et al., 2011; ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-74", "text": "If the total sentiment score is larger than 0, the document is classified as positive." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-75", "text": "Here only one positivity attribute is required to represent a lexicon, and we use the contrast between the positivity and negativity attributes (p \u2212 n) as the score." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-76", "text": "The supervised method makes use of sentiment lexicons as features for machine learning classification." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-77", "text": "Given a document D, we follow and extract the following features:" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-78", "text": "\u2022 The number of sentiment tokens in D, where sentiment tokens are word tokens whose sentiment scores are not zero in a lexicon;" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-79", "text": "\u2022 The total sentiment score of a document:" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-80", "text": "\u2022 The maximal score: max w i \u2208D SS(w i ); \u2022 The total scores of positive and negative words in D;" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-81", "text": "\u2022 The sentiment score of the last token in D." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-82", "text": "Again we use SS(w i ) = p w i \u2212 n w i as the sentiment score of each word w i , because the methods are based on a single sentiment score value for each word." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-83", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-84", "text": "**EXPERIMENTS**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-85", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-86", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-87", "text": "Training data: To automatically obtain training data, we use the Twitter Developers API 1 to crawl emoticon tweets 2 of English and Arabic from February 2014 to September 2014." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-88", "text": "We follow Go et al. (2009) , removing all emoticons used to collect training data from the tweets, and Tang et al. (2014b) , ignoring tweets which are less than 7 tokens." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-89", "text": "A Twitter tokenizer (Gimpel et al., 2011) is applied to preprocess all tweets." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-90", "text": "Rare words that occur less than 5 times in the vocabulary are removed." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-91", "text": "HTTP links and username are replaced by http and user , respectively." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-92", "text": "The statistics of training data is shown in Table 1 ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-93", "text": "Sentiment classifier: We use LibLinear 3 (Fan et al., 2008) as the supervised classifier on benchmark datasets." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-94", "text": "The parameter c is tuned by making a grid search (Hsu et al., 2003) on the accuracy of development set on the English dataset and fivefold cross validation on the Arabic dataset." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-95", "text": "Evaluation: We follow in employing precision (P), recall (R) and F1 score (F) to evaluate unsupervised classification." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-96", "text": "We follow Hsu et al. (2003) and use accuracy (acc), the tuning criterion, to evaluate supervised classification." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-97", "text": "Code and lexicons: We make the Python implementation of our models and the resulting sentiment lexicons available at https://github.com/duytinvo/acl2016 Table 4 : Standard splits of ASTD." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-98", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-99", "text": "**ENGLISH LEXICONS**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-100", "text": "The Twitter benchmark of SemEval13 (Nakov et al., 2013 ) is used as the English test set." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-101", "text": "In order to evaluate both unsupervised and supervised methods, we follow Tang et al. (2014b) and , removing neutral tweets." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-102", "text": "The statistics is shown in Table 2 ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-103", "text": "We compare our lexicon with the lexicons of NRC 4 (Mohammad et al., 2013) , HIT 5 (Tang et al., 2014a) and WEKA 6 (Bravo-Marquez et al., 2015) ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-104", "text": "As shown in Table 3 , using the unsupervised sentiment classification method (unsup) in Section 5, our lexicon gives significantly better result in comparison with countbased lexicons of NRC." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-105", "text": "Under both settings, our lexicon yields the best results compared to other methods." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-106", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-107", "text": "**ARABIC LEXICONS**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-108", "text": "We employ the standard Arabic Twitter dataset ASTD (Nabil et al., 2015) , which consists of about 10,000 tweets with 4 labels: objective (obj), negative (neg), positive (pos) and mixed subjective (mix)." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-109", "text": "The standard splits of ASTD are shown in Table 4 ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-110", "text": "We follow Nabil et al. (2015) by merging training and validating data for learning model." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-111", "text": "We compare our lexicon with only the lexicons of NRC 7 , because the methods of Tang et al. (2014a) Table 5 , our lexicon consistently gives the best performance on both the balanced and unbalanced datasets, showing the advantage of \"predicting\" over \"counting\"." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-112", "text": "Table 6 shows examples of our predicting-based lexicon and the counting-based lexicon of Mohammad et al. (2013) ." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-113", "text": "First, both lexicons can correctly reflect the strength of emotional words (e.g. bad, worse, worst), which demonstrates that our method can learn statistical relevance as effectively as PMI." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-114", "text": "Second, we find many cases where our lexicon gives the correct polarity (e.g. suitable, lazy) but the lexicon of Mohammad et al. (2013) does not." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-115", "text": "To quantitatively compare the lexicons, we calculated the accuracies of their polarities (i.e. sign) by using the manually-annotated lexicon of Hu and Liu (2004) as the gold standard." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-116", "text": "We take the intersection between the automatic lexicons and the lexicon of Hu and Liu (2004) as the test set, which contains 3270 words." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-117", "text": "The polarity accuracy of our lexicon is 78.2%, in contrast to 76.9% by the lexicon of Mohammad et al. (2013) , demonstrating the relative strength of our method." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-118", "text": "Third, by having two attributes (n, p) instead of one, our lexicon is better in compositionality (e.g. SS(strong memory) > 0, SS(strong snowstorm) < 0)." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-119", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-120", "text": "**ANALYSIS**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-121", "text": "----------------------------------" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-123", "text": "We constructed a sentiment lexicon for short text automatically using an efficient neural network, showing that prediction-based training is better than counting-based training for learning from large tweets with emoticons." }, { "sent_id": "710ec6f6d6d4c7c8c148833c0adfef-C001-124", "text": "In standard evaluations, the method gave better accuracies across multiple languages compared to the state-of-theart counting-based method." } ], "y": { "@BACK@": { "gold_contexts": [ [ "710ec6f6d6d4c7c8c148833c0adfef-C001-10" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-16" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-18" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-21" ] ], "cite_sentences": [ "710ec6f6d6d4c7c8c148833c0adfef-C001-10", "710ec6f6d6d4c7c8c148833c0adfef-C001-16", "710ec6f6d6d4c7c8c148833c0adfef-C001-18" ] }, "@DIF@": { "gold_contexts": [ [ "710ec6f6d6d4c7c8c148833c0adfef-C001-24" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-114" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-117" ] ], "cite_sentences": [ "710ec6f6d6d4c7c8c148833c0adfef-C001-24", "710ec6f6d6d4c7c8c148833c0adfef-C001-114", "710ec6f6d6d4c7c8c148833c0adfef-C001-117" ] }, "@USE@": { "gold_contexts": [ [ "710ec6f6d6d4c7c8c148833c0adfef-C001-26" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-44" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-103" ], [ "710ec6f6d6d4c7c8c148833c0adfef-C001-112" ] ], "cite_sentences": [ "710ec6f6d6d4c7c8c148833c0adfef-C001-26", "710ec6f6d6d4c7c8c148833c0adfef-C001-44", "710ec6f6d6d4c7c8c148833c0adfef-C001-103", "710ec6f6d6d4c7c8c148833c0adfef-C001-112" ] } } }, "ABC_fc5de471ba4cc82a2156ed25d2c78b_12": { "x": [ { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-43", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-44", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-45", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-46", "text": "**PARALLEL DATA**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-2", "text": "Previous work has shown that for low-resource source languages, automatic speech-to-text translation (AST) can be improved by pretraining an end-to-end model on automatic speech recognition (ASR) data from a high-resource language." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-3", "text": "However, it is not clear what factors-e.g., language relatedness or size of the pretraining datayield the biggest improvements, or whether pretraining can be effectively combined with other methods such as data augmentation." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-4", "text": "Here, we experiment with pretraining on datasets of varying sizes, including languages related and unrelated to the AST source language." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-5", "text": "We find that the best predictor of final AST performance is the word error rate of the pretrained ASR model, and that differences in ASR/AST performance correlate with how phonetic information is encoded in the later RNN layers of our model." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-6", "text": "We also show that pretraining and data augmentation yield complementary benefits for AST." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-7", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-9", "text": "Low-resource automatic speech-to-text translation (AST) has recently gained traction as a way to bring NLP tools to under-represented languages." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-10", "text": "An end-to-end approach [1] [2] [3] [4] [5] [6] [7] is particularly appealing for source languages with no written form, or for endangered languages where translations into a high-resource language may be easier to collect than transcriptions [8] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-11", "text": "However, building high-quality endto-end AST with little parallel data is challenging, and has led researchers to explore how other sources of data could be used to help." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-12", "text": "A number of methods have been investigated." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-13", "text": "Several of these use transcribed source language audio and/or translated source language text in a multitask learning scenario [4, 6, 9] or to pre-train parts of the model before fine-tuning on the end-to-end AST task [4] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-14", "text": "Others assume, as we do here, that no additional source language resources are available, in which case transfer learning using data from language(s) other than the source language is a good option." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-15", "text": "In particular, several researchers have shown that low-resource AST can be improved by pretraining on an ASR task in some other language, then transferring the encoder parameters to initialize the AST model." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-16", "text": "For example, Bansal et al. [5] showed that pre-training on either English or French ASR improved their Spanish-English AST system (trained on 20 hours of parallel data) and Tian [10] got improvements on an 8-hour Swahili-English AST dataset using English ASR pretraining." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-17", "text": "Overall these results show that pretraining helps, but leave open the question of what factors affect the degree of improvement." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-18", "text": "For example, does language relatedness play a role, or simply the amount of pretraining data?" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-19", "text": "Bansal et al. showed bigger AST gains as the amount of English pretraining data increased from 20 to 300 hours, and also found a slightly larger improvement when pretraining on 20 hours of English versus 20 hours of French, but they pointed out that the Spanish data contains many English code-switched words, which could explain the latter result." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-20", "text": "In related work on multilingual pretraining for low-resource ASR, Adams et al. [11] showed that pre-training on more languages helps, but it is not clear whether the improvement is due to including more languages, or just more data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-21", "text": "To begin to tease apart these issues, we focus here on monolingual pretraining for low-resource AST, and investigate two questions." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-22", "text": "First, can we predict what sort of pretraining data is best for a particular AST task? Does it matter if the pretraining language is related to the AST source language (defined here as part of the same language family, since phonetic similarity is difficult to measure), or is the amount of pretraining data (or some other factor) more important?" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-23", "text": "Second, can pretraining be effectively combined with other methods, such as data augmentation, in order to further improve AST results?" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-24", "text": "To answer these questions, we use the same AST architecture and Spanish-English parallel data as Bansal et al. [5] , but pretrain the encoder using a number of different ASR datasets: the 150hour AISHELL corpus of Chinese as well as seven GlobalPhone languages, each with about 20 hours of data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-25", "text": "We find that pretraining on a larger amount of data from an unrelated language is much better than pretraining on a smaller amount of data from a related language." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-26", "text": "Moreover, even when controlling for the amount of data, the WER of the ASR model from pretraining seems to be a better predictor of final AST performance than does language relatedness." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-27", "text": "Indeed, we show that there is a very strong correlation between the WER of the pretraining model and BLEU score of the final AST model-i.e., the best pretraining strategy may simply be to use datasets and methods that will yield the lowest ASR WER during pretraining." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-28", "text": "However, we also found that AST results can be improved further by augmenting the AST data using standard speed perturbation techniques [12] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-29", "text": "Our best results using non-English pretraining data improve the test set BLEU scores of an AST system trained on 20 hours of parallel data from 10.2 to 14.3, increasing to 15.8 with data augmentation." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-30", "text": "Finally, we analyze the representations learned by the models and show that better performance seems to correlate with the extent to which phonetic information is encoded in a linearly separable way in the later RNN layers." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-31", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-32", "text": "**METHODOLOGY**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-33", "text": "For both ASR and AST tasks we use the same end-to-end system architecture shown in Figure 1 : the encoder-decoder model from [5] , which itself is adapted from [2] , [4] and [3] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-34", "text": "Details of the architecture and training parameters are described in Section 3.4." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-35", "text": "After pretraining an ASR model, we transfer only its encoder parameters to the AST task." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-36", "text": "Previous experiments [5] showed that the encoder accounts for most of the benefits of transferring the parameters." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-37", "text": "Transferring also the decoder and attention mechanism does bring some improvements, but is only feasible when the ASR pretraining language is the same as the AST target language, which is not true in most of our experiments." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-38", "text": "In addition to pretraining, we experimented with data augmentation." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-39", "text": "Specifically, we augmented the AST data using Kaldi's [13] 3-way speed perturbation, adding versions of the AST data where the audio is sped down and up by a factor of 0.9 and 1.1, respectively." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-40", "text": "1 To evaluate ASR performance we compute the word error rate (WER)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-41", "text": "2 To evaluate AST performance we calculate the 4-gram BLEU score [14] on four reference translations." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-42", "text": "3" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-47", "text": "For the AST models, we use Spanish-English parallel data from Fisher corpus [15] , containing 160 hours of Spanish telephone speech translated into English text." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-48", "text": "To simulate low-resource settings, we randomly downsample the original corpus to 20 hours of training data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-49", "text": "Each of the dev and test sets comprise 4.5 hours of speech." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-50", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-51", "text": "**PRETRAINING DATA**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-52", "text": "Since we focus on investigating factors that might affect the AST improvements over the baseline when pretraining, we have chosen ASR datasets for pretraining that contrast in the number of hours and/or in the language similarity with Spanish." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-53", "text": "Statistics for each dataset are in the left half of Table 1 , with further details below." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-54", "text": "To look at a range of languages with similar amounts of data, we used GlobalPhone corpora from seven languages [16] , each with around 20 hours of speech: Mandarin Chinese (zh), Croatian (hr), Czech (cs), French (fr), Polish (pl), Portuguese (pt), and Swedish (sv)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-55", "text": "French and Portuguese, like the source language (Spanish), belong to the Romance family of languages, while the other languages are less related-especially Chinese, which is not an Indo-European language." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-56", "text": "GlobalPhone consists of read speech recorded using similar conditions across languages, and the transcriptions for Chinese are Romanized, with annotated word boundaries." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-57", "text": "Table 1 : Dataset statistics (left); dev set results from ASR pretraining and from the final AST system (right)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-58", "text": "AST results in all rows except the first are from pretraining using the dataset listed in that row, followed by fine-tuning using ast-20h." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-59", "text": "Numbers in brackets are the improvement over the baseline." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-60", "text": "To explore the effects of using a large amount of pretraining data from an unrelated language, we used the AISHELL-1 corpus of Mandarin Chinese [17] , which contains 150 hours of read speech." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-61", "text": "Transcriptions with annotated word boundaries are available in both Hanzi (Chinese characters) and Romanized versions, and we built models with each." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-62", "text": "To compare to the GlobalPhone data, we also created a 20-hour subset of the Romanized AISHELL (zh-ai-small) by randomly selecting utterances from a subset of the speakers (81, roughly the number present in most of the GlobalPhone datasets)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-63", "text": "Finally, to reproduce one of the experiments from [5] , we pretrained one model using 300 hours of Switchboard English [18] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-64", "text": "This data is the most similar to the AST speech data in terms of style and channel (both are conversational telephone speech)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-65", "text": "However, as noted by [5] , the Fisher Spanish speech contains many words that are actually in English (code-switching), so pretraining on English may provide an unfair advantage relative to other languages." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-66", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-67", "text": "**PREPROCESSING**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-68", "text": "We compute 13-dim MFCCs and cepstral mean and variance normalization along speakers using Kaldi [13] on our ASR and AST audio." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-69", "text": "To shorten the training time, we trimmed utterances from the AST data to 16 seconds (or 12 seconds for the 160h augmented dataset)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-70", "text": "To account for unseen words in the test data, we model the ASR and AST text outputs via sub-word units using byte-pair encoding (BPE) [19] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-71", "text": "We do this separately for each dataset as BPE works best as a language-specific tool (i.e. it depends on the frequency of different subword units, which varies with the language)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-72", "text": "We use 1k merge operations in all cases except Hanzi, where there are around 3000 symbols initially (vs around 60 in the other datasets)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-73", "text": "For Hanzi we ran experiments with both 1k and 15k merge operations." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-74", "text": "For Chinese Romanized transcriptions we removed tone diacritics." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-75", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-76", "text": "**MODEL ARCHITECTURE AND TRAINING**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-77", "text": "Following the architecture and training procedure described in [5] , input speech features are fed into a stack of two CNN layers." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-78", "text": "In each CNN layer we stride the input with a factor of 2 along time, apply The points in the circled group come from different runs on the same dataset but with different BPE or learning rate schedules." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-79", "text": "The Spearman rank correlation of these points is -0.97; the correlation is -0.92 when using test sets to compute both ASR and BLEU." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-80", "text": "ReLU activation [20] followed by batch normalization [21] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-81", "text": "The CNN output is fed into a three-layer bi-directional long short-term memory network (LSTM) [22] , with 512 hidden layer dimensions." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-82", "text": "For decoding, we use the predicted token 20% of the time and the training token 80% of the time [23] as input to a 128-dimensional embedding layer followed by a three-layer LSTM, with 256 hidden layer dimensions, and combine this with the output from the attention mechanism [24] to predict the word at the current time step." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-83", "text": "We use code and hyperparameter settings from [5] 4 : the Adam optimizer [25] with an initial learning rate of 0.001 and decay it by a factor of 0.5 based on the dev set BLEU score." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-84", "text": "When training AST models, we regularize using dropout [26] with a ratio of 0.3 over the embedding and LSTM layers [27] ; weight decay with a rate of 0.0001; and, after the first 20 epochs, 30% of the time we replace the predicted output word by a random word from the target vocabulary." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-85", "text": "At test time we use beam decoding with a beam size of 5 and length normalization [28] with a weight of 0.6." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-86", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-87", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-88", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-89", "text": "**BASELINE AND ASR RESULTS**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-90", "text": "Our baseline 20-hour AST system obtains a BLEU score of 10.3 ( Table 1 , first row), 0.5 BLEU point lower than that reported by [5] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-91", "text": "This discrepancy might be due to differences in subsampling from the 160-hour AST dataset to create the 20-hour subset, or from Kaldi parameters when computing the MFCCs." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-92", "text": "WERs for our pre-trained models ( Table 1 ) vary from 22.5 for the large AISHELL dataset with Romanized transcript to 80.5 for Portuguese GlobalPhone." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-93", "text": "These are considerably worse than stateof-the-art ASR systems (e.g., Kaldi recipes can achieve WER of 7.5 on AISHELL and 26.5 on Portuguese GlobalPhone), but we did not optimize our architecture or hyperparameters for the ASR task since our main goal is to analyze the relationship between pretraining and AST performance (and in order to use pretraining, we must use a seq2seq model with the architecture as for AST)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-94", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-95", "text": "**PRETRAINING THE AST TASK ON ASR MODELS**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-96", "text": "AST results for our pre-trained models are given in Table 1 ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-97", "text": "Pretraining improves AST performance in every case, with improvements 4 https://github.com/0xSameer/ast." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-98", "text": "ranging from 0.2 (pt-gp) to 4.3 (zh-ai-large)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-99", "text": "These results make it clear that language relatedness does not play a strong role in predicting AST improvements, since on the similar-sized GlobalPhone datasets, the two languages most related to Spanish (French and Portuguese) yield the highest and lowest improvements, respectively." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-100", "text": "Moreover, pretraining on the large Chinese dataset yields a bigger improvement than either of these-4.3 BLEU points." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-101", "text": "This is nearly as much as the 6 point improvement reported by [5] when pretraining on 100 hours of English data, which is especially surprising given not only that Chinese is very different from Spanish, but also that the Spanish data contains some English words." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-102", "text": "This finding seems to suggest that data size is more important than language relatedness for predicting the effects of pretraining." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-103", "text": "However, there are big differences even amongst the languages with similar amounts of pretraining data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-104", "text": "Analyzing our results further, we found a striking correlation between the WER of the initial ASR model and the BLEU score of the AST system pretrained using that model, as shown in Figure 2 ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-105", "text": "Therefore, although pretraining data size clearly influences AST performance, this appears to be mainly due to its effect on WER of the ASR model." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-106", "text": "We therefore hypothesize that WER is a better direct predictor of AST performance than either data size or language relatedness." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-107", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-108", "text": "**MULTILINGUAL PRETRAINING**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-109", "text": "Although our main focus is monolingual pretraining, we also looked briefly at multilingual pretraining, inspired by recent work on multilingual ASR [29, 30] and evidence that multilingual pretraining followed by fine-tuning on a distinct target language can improve ASR on the target language [11, 31, 32] ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-110", "text": "These experiments did not directly compare pretraining using a similar amount of monolingual data, but such a comparison was done by [33, 34] in their work on learning feature representations for a target language with no transcribed data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-111", "text": "They found a benefit for multilingual vs monolingual pretraining given the same amount of data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-112", "text": "Following up on this work, we tried pretraining using 124 hours of multilingual data (all GlobalPhone languages except Chinese), roughly the amount of data in our large Chinese models." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-113", "text": "We combined all the data together and trained an ASR model using a common target BPE with 6k merge operations, then transferred only the encoder to the AST model." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-114", "text": "However, we did not see a benefit to the multilingual training (Table 1, final row); in fact the resulting AST model was slightly worse than the zh-ai-large model (BLEU of 13.3 vs 14.6)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-115", "text": "Other configurations of multilingual training might still outperform their monolingual counterparts, but we leave this investigation as future work." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-116", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-117", "text": "**AUGMENTING THE PARALLEL DATA**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-118", "text": "Table 2 (top) shows how data augmentation affects the results of the baseline 20h AST system, as well as three of the best-performing pretrained models from Table 1 ." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-119", "text": "For these experiments only, we changed the learning rates of the augmented-data systems so that all models took about the same amount of time to train (see Figure 3 )." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-120", "text": "Despite a more aggressive learning schedule, the performance of the augmented-data systems surpasses that of the baseline and pretrained models, even those trained on the largest ASR sets (150-hr Chinese and 300-hr English)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-121", "text": "For comparison to other work, to time constraints (after 15 days, compared to 8 days for complete training of the non-augmented 160h models)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-122", "text": "We find that both pretraining and augmentation still help, providing a combined gain of 3.8 (3.2) BLEU points over the baseline on the dev (test) set." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-123", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-124", "text": "**ANALYZING THE MODELS' REPRESENTATIONS**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-125", "text": "Finally, we hope to gain some understanding into why pretraining on ASR helps with AST, and specifically how the neural network representations change during pretraining and fine-tuning." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-126", "text": "We follow [35] and [10] , who built diagnostic classifiers [36] to examine the representation of phonetic information in end-to-end ASR and AST systems, respectively." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-127", "text": "Unlike [10, 35] , who used non-linear classifiers, we use a linear classifier to predict phone labels from the internal representations of the trained ASR or AST model." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-128", "text": "Using a linear classifier allows us to make more precise claims: if the classifier performs better using the representation from a particular layer, we can say that layer represents the phonetic information in a more linearly separable way." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-129", "text": "Using a nonlinear classifier raises questions about how to choose the complexity of the classifier itself, and therefore makes any results difficult to interpret." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-130", "text": "We hypothesized that pretraining allows the models to abstract away from nonlinguistic acoustic differences, and to better represent phonetic information: crucially, both in the trained language and in other languages." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-131", "text": "To test this hypothesis, we used two phone-labelled datasets distinct from all our ASR and AST datasets: the English TIMIT corpus (a language different to all of our trained models, with hand-labeled phones) and the Spanish GlobalPhone corpus (the same language as our AST source language, with phonetic forcedalignments produced using Kaldi)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-132", "text": "We randomly sampled utterances from these and passed them through the trained encoders, giving us a total of about 600k encoded frames." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-133", "text": "We used 400k of these to train logistic regression models to predict the phone labels, and tested on the remaining 200k frames." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-134", "text": "Separate logistic regression models were trained on the representations from each layer of the encoder." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-135", "text": "Since convolutional layers have a stride of 2, the number of frames decreases at each convolutional layer." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-136", "text": "To label the frames after a convolutional layer we eliminated every other label (and corresponding frame) from the original label sequence." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-137", "text": "For example, given label sequence S1 = aaaaaaann at input layer, we get sequence S2 = aaaan at the first convolutional layer and sequence S3 = aan at the second convolutional layer and at the following recurrent layers." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-138", "text": "Results for the two classification data sets ( Figure 4) show very similar patterns." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-139", "text": "In both the ASR and the AST models, the pretraining data seems to make little difference to phonetic encoding at the early layers, and classification accuracy peaks at the second CNN layer." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-140", "text": "However, the RNN layers show a clear trend where phone classification accuracy drops off more slowly for models with better ASR/AST performance (i.e., zh > fr > pt)." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-141", "text": "That is, the later RNN layers more transparently encode language-universal phonetic information." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-142", "text": "Phone classification accuracy in the RNN layers drops for both English and Spanish after fine-tuning on the AST data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-143", "text": "This is slightly surprising for Spanish, since the fine-tuning data (unlike the pretraining data) is actually Spanish speech." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-144", "text": "However, we hypothesize that for AST, higher layers of the encoder may be recruited more to encode semantic information needed for the translation task, and therefore lose some of the linear separability in the phonetic information." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-145", "text": "Nevertheless, we still see the same pattern where better end-to-end models have higher classification accuracy in the later layers." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-146", "text": "----------------------------------" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-147", "text": "**CONCLUSIONS**" }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-148", "text": "This paper explored what factors help pretraining for low-resource AST." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-149", "text": "We performed careful comparisons to tease apart the effects of language relatedness and data size, ultimately finding that rather than either of these, the WER of the pre-trained ASR model is likely the best direct predictor of AST performance." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-150", "text": "Given equivalent amounts of data, we did not find multilingual pretraining to help more than monolingual pretraining, but we did find an added benefit from using speed perturbation to augment the AST data." }, { "sent_id": "fc5de471ba4cc82a2156ed25d2c78b-C001-151", "text": "Finally, analysis of the pretrained models suggests that those models with better WER are transparently encoding more language-universal phonetic information in the later RNN layers, and this appears to help with AST." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fc5de471ba4cc82a2156ed25d2c78b-C001-10" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-16" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-36" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-65" ] ], "cite_sentences": [ "fc5de471ba4cc82a2156ed25d2c78b-C001-10", "fc5de471ba4cc82a2156ed25d2c78b-C001-16", "fc5de471ba4cc82a2156ed25d2c78b-C001-36", "fc5de471ba4cc82a2156ed25d2c78b-C001-65" ] }, "@MOT@": { "gold_contexts": [ [ "fc5de471ba4cc82a2156ed25d2c78b-C001-10", "fc5de471ba4cc82a2156ed25d2c78b-C001-11" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-65" ] ], "cite_sentences": [ "fc5de471ba4cc82a2156ed25d2c78b-C001-10", "fc5de471ba4cc82a2156ed25d2c78b-C001-65" ] }, "@EXT@": { "gold_contexts": [ [ "fc5de471ba4cc82a2156ed25d2c78b-C001-24" ] ], "cite_sentences": [ "fc5de471ba4cc82a2156ed25d2c78b-C001-24" ] }, "@USE@": { "gold_contexts": [ [ "fc5de471ba4cc82a2156ed25d2c78b-C001-24" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-33" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-63" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-77" ], [ "fc5de471ba4cc82a2156ed25d2c78b-C001-83" ] ], "cite_sentences": [ "fc5de471ba4cc82a2156ed25d2c78b-C001-24", "fc5de471ba4cc82a2156ed25d2c78b-C001-33", "fc5de471ba4cc82a2156ed25d2c78b-C001-63", "fc5de471ba4cc82a2156ed25d2c78b-C001-77", "fc5de471ba4cc82a2156ed25d2c78b-C001-83" ] }, "@DIF@": { "gold_contexts": [ [ "fc5de471ba4cc82a2156ed25d2c78b-C001-90" ] ], "cite_sentences": [ "fc5de471ba4cc82a2156ed25d2c78b-C001-90" ] }, "@SIM@": { "gold_contexts": [ [ "fc5de471ba4cc82a2156ed25d2c78b-C001-101" ] ], "cite_sentences": [ "fc5de471ba4cc82a2156ed25d2c78b-C001-101" ] } } }, "ABC_24ee9b2bd8c97cbe923bc747b09806_12": { "x": [ { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-2", "text": "Humans learn language by interaction with their environment and listening to other humans." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-48", "text": "**MATERIALS**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-49", "text": "Our model is trained on the Flickr8k database [22] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-3", "text": "It should also be possible for computational models to learn language directly from speech but so far most approaches require text." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-4", "text": "We improve on existing neural network approaches to create visually grounded embeddings for spoken utterances." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-5", "text": "Using a combination of a multi-layer GRU, importance sampling, cyclic learning rates, ensembling and vectorial self-attention our results show a remarkable increase in image-caption retrieval performance over previous work." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-6", "text": "Furthermore, we investigate which layers in the model learn to recognise words in the input." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-7", "text": "We find that deeper network layers are better at encoding word presence, although the final layer has slightly lower performance." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-8", "text": "This shows that our visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-9", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-11", "text": "Most computational models of natural language processing (NLP) are based on written language; machine translation, sentence meaning representation and language modelling to name a few (e.g. [1, 2] )." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-12", "text": "Even if the task inherently involves speech, such as in automatic speech recognition, models require large amounts of transcribed speech [3] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-13", "text": "Yet, humans are capable of learning language from raw sensory input, and furthermore children learn to communicate long before they are able to read." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-14", "text": "In fact, many languages have no orthography at all and there are also languages of which the writing system is not widely used by its speakers." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-15", "text": "Text-based models cannot be used for these languages and applications like search engines and automated translators cannot serve these populations." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-16", "text": "There has been increasing interest in learning language from more natural input, such as directly from the speech signal, or multi-modal input (e.g. speech and vision)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-17", "text": "This has several advantages such as removing the need for expensive annotation of speech, being applicable to low resource languages and being more plausible as a model of human language learning." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-18", "text": "An important challenge in learning language from spoken input is the fact that the input is not presented in neatly segmented tokens." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-19", "text": "An auditory signal does not contain neat breaks in between words like the spaces in text." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-20", "text": "Furthermore, no two realisations of the same spoken word are ever exactly the same." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-21", "text": "As such, spoken input cannot be represented by conventional word embeddings (e.g. word2vec [4] , GloVe [5] )." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-22", "text": "These textbased embeddings are trained to encode word-level semantic knowledge and have become a mainstay in work on sentence representations (e.g. [6, 7] )." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-23", "text": "When we want to learn language directly from speech, we will have to do so in a more end-to-end fashion, without prior lexical level knowledge in terms of both form and semantics." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-24", "text": "In previous work [8] we used image-caption retrieval, where given a written caption the model must return the matching image and vice versa." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-25", "text": "We trained deep neural networks (DNNs) to create sentence embeddings without the use of prior knowledge of lexical semantics (see [7, 9, 10] for other studies on this task)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-76", "text": "First we apply a 1-dimensional convolutional layer to the acoustic input features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-26", "text": "The visually grounded sentence embeddings that arose capture semantic information about the sentence as measured by the Semantic Textual Similarity task (see [11] ), performing comparably to text-only methods that require word embeddings." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-27", "text": "In the current study we present an image-caption retrieval model that extends our previous work to spoken input." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-28", "text": "In [12, 13] , the authors adapted text based caption-image retrieval (e.g. [9] ) and showed that it is possible to perform speech-image retrieval using convolutional neural networks on spectral features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-29", "text": "Our work is most closely related to the models presented in [12, 13, 14, 15] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-30", "text": "In the current study we improve upon these previous approaches to visual grounding of speech and present state-of-the-art image-caption retrieval results." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-31", "text": "The work by [12, 13, 14, 15] and the results presented here are a step towards more cognitively plausible models of language learning as it is more natural to learn language without prior assumptions about the lexical level." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-32", "text": "For instance, research indicates that the adult lexicon contains many relatively fixed multi-word expressions (e.g., 'how-are-you-doing') [16] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-33", "text": "Furthermore, early during language acquisition the lexicon consists of entire utterances before a child's language use becomes more adult-like [16, 17, 18, 19] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-34", "text": "Image to spoken-caption retrieval models do not know a priori which constituents of the input are important and have no prior knowledge of lexical level semantics." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-35", "text": "We probe the resulting model to investigate whether it learns to recognise lexical units in the input without being explicitly trained to do so." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-36", "text": "We test two types of acoustic features; Mel Frequency Cepstral Coefficients (MFCCs) and Multilingual Bottleneck (MBN) features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-37", "text": "MFCCs are features that can be computed for any speech signal without needing any other data, while the MBN features are 'learned' features that result from training a network on top of MFCCs in order to recognise phoneme states." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-38", "text": "While MBN features have been shown to be useful in several speech recognition tasks (e.g. [20, 21] ), learned audio features face the same issue as word embeddings, as humans learn to extract useful features from the audio signal as a result of learning to understand language and not as a separate process." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-39", "text": "However, the MBN features can still be useful where system performance is more important than cognitive plausibility, for instance in a low resource setting." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-40", "text": "Furthermore, these features could provide a clue as to what performance would be possible if we had more sophisticated models or more data to improve the feature extraction from the MFCCs in an end-to-end fashion." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-41", "text": "In summary, we improve on previous spoken-caption to image retrieval models and investigate whether it learns to recognise words in the speech signal." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-42", "text": "We show that our model achieves state-of-the-art results on the Flickr8k database, outperforming previous models by a large margin using both MFCCs and MBN features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-43", "text": "We find that our model learns to recognise words in the input signal and show that the deeper layers are better at encoding this information." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-44", "text": "Recognition performance drops a little in the last two layers as the network abstracts away from the detection of specific words in the input and learns to map the utterances to the joint embedding space." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-45", "text": "We released the code for this project on github: https://github.com/DannyMerkx/speech2image/tree/Interspeech19." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-46", "text": "2. Image to spoken-caption retrieval" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-47", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-50", "text": "Flickr8k contains 8,000 images taken from online photo sharing application Flickr.com, for which five English captions per image are available." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-51", "text": "Annotators were asked to write sentences that describe the depicted scenes, situations, events and entities (people, animals, other objects)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-52", "text": "Spoken captions for Flickr8k were collected by [12] by having Amazon Mechanical Turk workers pronounce the original written captions." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-53", "text": "We used the data split provided by [9] , with 6,000 images for training and a development and test set both of 1,000 images." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-54", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-55", "text": "**IMAGE AND ACOUSTIC FEATURES**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-56", "text": "To extract image features, all images are resized such that the smallest side is 256 pixels while keeping the aspect ratio intact." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-57", "text": "We take ten 224 by 224 crops of the image: one from each corner, one from the middle and the same five crops for the mirrored image." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-58", "text": "We use ResNet-152 [23] pretrained on ImageNet to extract visual features from these ten crops and then average the features of the ten crops into a single vector with 2,048 features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-59", "text": "We test two types of acoustic features; Mel Frequency Cepstral Coefficients (MFCCs) and Multilingual Bottleneck (MBN) features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-60", "text": "The MFCCs were created using 40 Mel-spaced filterbanks." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-61", "text": "We use 12 MFCCs and the log energy feature and add the first and second derivatives resulting in 39-dimensional feature vectors." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-62", "text": "We compute the MFCCs using 25 ms analysis windows with a 5 ms shift." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-63", "text": "The MBN features are created using a pre-trained DNN made available by [21] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-64", "text": "In short, the network is trained on multilingual speech data (11 languages, no English) to classify phoneme states." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-65", "text": "The MBN features consist of the outputs of intermediate network layers where the network is compressed from 1500 features to 30 features (see [21] for the full details of the network and training)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-66", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-67", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-68", "text": "Our multimodal encoder maps images and their corresponding captions to a common embedding space." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-69", "text": "The idea is to make matching images and captions lie close together and mismatched images and captions lie far apart in the embedding space." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-70", "text": "Our model consists of two parts; an image encoder and a sentence encoder as depicted in Figure 1 ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-71", "text": "The approach is based on our own text-based model described in [8] and on the speech-based models described in [13, 15] and we refer to those studies for more details." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-72", "text": "Here, we focus on the differences with previous work." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-73", "text": "For the image encoder we use a single-layer linear projection on top of the pretrained image recognition model, and nor- malise the result to have unit L2 norm." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-74", "text": "The image encoder has 2048 input units and 2048 output units." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-75", "text": "Our caption encoder consists of three main components." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-77", "text": "The convolution has a stride of size 2, kernel size 6 and 64 output channels." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-78", "text": "This is the only layer where the model differs from the text-based model, which features a character embedding layer instead of a convolutional layer." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-79", "text": "The resulting features are then fed into a bi-directional Gated Recurrent Unit (GRU) followed by a self-attention layer and is lastly normalised to have unit L2 norm." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-80", "text": "We use a 3-layer bi-directional GRU which allows the network to capture long-range dependencies in the acoustic signal (see [24] for a more detailed description of the GRU)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-81", "text": "Furthermore, by making the layer bi-directional we let the network process the output of the convolutional layer from left to right and vice versa, allowing the model to capture dependencies in both directions." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-82", "text": "We use a GRU with 1024 units, and concatenate the bidirectional representations resulting in hidden states of size 2048." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-83", "text": "Finally, the self-attention layer computes a weighted sum over all the hidden GRU states:" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-84", "text": "where at is the attention vector for hidden state ht and W , V , bw, and bv indicate the weights and biases." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-85", "text": "The applied attention is then the sum over the Hadamard product between all hidden states (h1, ..., ht) and their attention vector." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-86", "text": "We use 128 units for W and 2048 units for V ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-87", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-88", "text": "**TRAINING**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-89", "text": "Following [8] , the model is trained to embed the images and captions such that the cosine similarity between image and caption pairs is larger (by a certain margin) than the similarity between mismatching pairs." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-90", "text": "This so called hinge loss L as a function of the network parameters \u03b8 is given by:" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-91", "text": "B is a minibatch of correct captionimage pairs (c, i), where the other caption-image pairs in the batch serve to create mismatched pairs (c, i \u2032 ) and (c \u2032 , i)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-92", "text": "We take the cosine similarity cos(x, y) and subtract the similarity of the mismatched pairs from the matching pairs such that the loss is only zero when the matching pair is more similar than the mismatched pairs by a margin \u03b1." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-93", "text": "We use importance sampling to select the mismatched pairs; rather than using all the other samples in the mini-batch as mismatched pairs (as done in [8, 15] ), we calculate the loss using only the hardest examples (i.e. mismatched pairs with high cosine similarity)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-94", "text": "While [10] used only the single hardest example in the batch for text-captions, we found that this did not work for the spoken captions." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-95", "text": "Instead we found that using the hardest 25 percent worked well." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-96", "text": "The networks are trained using Adam [25] with a cyclic learning rate schedule based on [26] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-97", "text": "The learning rate schedule varies the learning rate smoothly between a minimum and maximum bound which were set to 10 \u22126 and 2 \u00d7 10 \u22124 respectively." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-98", "text": "The learning rate schedule causes the network to visit several local minima during training, allowing us to use snapshot ensembling [27] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-99", "text": "By saving the network parameters at each local minimum, we can ensemble the embeddings of multiple networks at no extra cost." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-100", "text": "We use a margin \u03b1 = 0.2 for the loss function." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-101", "text": "We train the networks for 32 epochs and take a snapshot for ensembling at every fourth epoch." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-102", "text": "For ensembling we use the two snapshots with the highest performance on the development data and simply sum their embeddings." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-103", "text": "The main differences with the approaches described in [13, 15] are the use of multi-layered GRUs, importance sampling, the cyclic learning rate, snapshot ensembling and the use of vectorial rather than scalar attention." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-104", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-105", "text": "**WORD PRESENCE DETECTION**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-106", "text": "While our model is not explicitly trained to recognise words or segment the speech signal, previous work has shown that such information can be extracted by visual grounding models [15, 28] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-107", "text": "[15] use a binary decision task: given a word and a sentence embedding, decide if the word occurs in the sentence." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-108", "text": "Our approach is similar to the spoken-bag-of-words prediction task described in [28] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-109", "text": "Given a sentence embedding created by our model, a classifier has to decide which of the words in its vocabulary occur in the sentence." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-110", "text": "Based on the original written captions, our database contains 7,374 unique words with a combined occurrence frequency of 324,480." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-111", "text": "From these we select words that occur between 50 and a 1,000 times and are over 3 characters long so that there are enough examples in the data that the model might actually learn to recognise them, and to filter out punctuation, spelling mistakes, numerals and most function words." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-112", "text": "This leaves 460 unique words, mostly verbs and nouns, with a combined occurrence frequency of 87,020 in our data." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-113", "text": "We construct a vector for each sentence in Flickr8k indicating which of these words is present." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-114", "text": "We do not encode multiple occurrences of the same word in one sentence." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-115", "text": "The words described above are used as targets for a neural network classifier consisting of a single feed forward layer with 460 units." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-116", "text": "This layer simply takes an embedding vector as input and maps it to the 460 target words." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-117", "text": "We then apply the standard logistic function and calculate the Binary Cross Entropy loss to train the network." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-118", "text": "We train five word detection networks for both the MFCC and the MBN based caption encoders, in order to see how word presence is encoded in the different neural network layers." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-119", "text": "We train networks for the final output layer, the three intermediate layers of the GRU and the acoustic features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-120", "text": "For the final layer we simply use the output embedding as input to the word detection network." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-121", "text": "We apply some post-processing to the acoustic features and the intermediate layer outputs to ensure that our word detection inputs are all of the same size." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-122", "text": "As the intermediate GRU layers produce 2048 features for each time step in the signal, we use average-pooling along the temporal dimension to create a single input vector and normalise the result to have unit L2 norm." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-123", "text": "The acoustic features consist of 30 (MBN) or 39 (MFCC) features for each time step, so we apply the convolutional layer followed by an untrained GRU layer to the input features, use average-pooling and normalise the result to have unit L2 norm." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-124", "text": "The word detection networks are trained for 32 epochs using Adam [25] with a constant learning rate of 0.001." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-125", "text": "We use the same data split that was used for training the multi-modal encoder, so that we test word presence detection on data that was not seen by either the encoder or the decoder." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-126", "text": "Table 1 shows the performance of our models on the imagecaption retrieval task." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-127", "text": "The caption embeddings are ranked by cosine distance to the image and vice versa where R@N is the percentage of test items for which the correct image or caption was in the top N results." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-128", "text": "We compare our models to [12] and [15] , and include our own character-based model for comparison." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-129", "text": "[12] is a convolutional approach, whereas [15] is an approach using recurrent highway networks with scalar attention." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-130", "text": "The character-based model is similar to the model we use here and was trained on the original Flickr8k text captions The results of the word presence detection task are shown in Figure 2 and Table 2 ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-131", "text": "Figure 2 shows the F1 score for all the classifiers at 20 equally spaced detection thresholds (i.e. a word is classified as 'present' if the word detection output is above this threshold)." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-132", "text": "Table 2 displays the area under the curve for the receiver operating characteristic." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-133", "text": "Even though the MBN model outperforms the MFCC model for all layers we see the same pattern emerging from both the F1 score and the AUC." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-134", "text": "The performance on the feature level is not much better than random." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-135", "text": "Predicting 'not present' for every word would be the best random guess as this is a heavy majority class in this task." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-136", "text": "Inspection of the predictions shows that the classifier is indeed heavily biased towards the majority class for the input features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-137", "text": "Then we see the performance increasing for the first layer and peaking at the second layer." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-138", "text": "The performance then drops slightly for the third layer and the attention layer." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-139", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-140", "text": "**RESULTS**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-141", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-142", "text": "**DISCUSION AND CONCLUSION**" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-143", "text": "We trained an image-caption retrieval model on spoken input and investigated whether it learns to recognise linguistic units in the input." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-144", "text": "As improvements over previous work we used a 3-layer GRU and employed importance sampling, cyclic learning rates, ensembling and vectorial self-attention." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-145", "text": "Our results on both MBN and MFCC features are significantly higher than the previous state-of-the-art." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-146", "text": "The largest improvement comes from using the learned MBN features but our approach also improves results for MFCCs, which are the same features as were used in [15] ." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-147", "text": "The learned MBN features provide better performance whereas the MFCCs are more cognitively plausible input features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-148", "text": "The probing task shows that the model learns to recognise these words in the input." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-149", "text": "The system is not explicitly optimised to do so, but our results show that the lower layers learn to recognise this form related information from the input." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-150", "text": "After layer 2, the performance starts to decrease slightly which might indicate that these layers learn a more task-specific representation and it is to be expected that the final attention layer specialises in mapping from audio features to the multi-modal embedding space." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-151", "text": "In conclusion, we presented what are, to the best of our knowledge, the best results on spoken-caption to image retrieval." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-152", "text": "Our results improve significantly over previous approaches for both untrained and trained audio features." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-153", "text": "In a probing task, we show that the model learns to recognise words in the input speech signal." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-154", "text": "We are currently collecting the Semantic Textual Similarity (STS) database in spoken format and the next step will be to investigate whether the model presented here also learns to capture sentence level semantic information and understand language in a deeper sense than recognising word presence." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-155", "text": "The work presented in [15] has made the first efforts in this regard and we aim to extend this to a larger database with sentences from multiple domains." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-156", "text": "Furthermore, we want to investigate the linguistic units that our model learns to recognise." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-157", "text": "In the current study, we only investigated whether the model learns to recognise words, but the potential benefit of our model is that it might learn multi-word statements or might even learn to look at sub-lexical level information." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-158", "text": "[14, 29] have recently shown that the speech-to-image retrieval approach can be used to detect word boundaries and even discover sub-word units." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-159", "text": "Our interest is in investigating how these word and sub-word units develop over training and through the network layers." }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-160", "text": "----------------------------------" }, { "sent_id": "24ee9b2bd8c97cbe923bc747b09806-C001-161", "text": "**ACKNOWLEDGEMENTS**" } ], "y": { "@SIM@": { "gold_contexts": [ [ "24ee9b2bd8c97cbe923bc747b09806-C001-29" ] ], "cite_sentences": [ "24ee9b2bd8c97cbe923bc747b09806-C001-29" ] }, "@EXT@": { "gold_contexts": [ [ "24ee9b2bd8c97cbe923bc747b09806-C001-29", "24ee9b2bd8c97cbe923bc747b09806-C001-30" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-154", "24ee9b2bd8c97cbe923bc747b09806-C001-155" ] ], "cite_sentences": [ "24ee9b2bd8c97cbe923bc747b09806-C001-29", "24ee9b2bd8c97cbe923bc747b09806-C001-155" ] }, "@BACK@": { "gold_contexts": [ [ "24ee9b2bd8c97cbe923bc747b09806-C001-31" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-71" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-106" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-107" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-129" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-155" ] ], "cite_sentences": [ "24ee9b2bd8c97cbe923bc747b09806-C001-31", "24ee9b2bd8c97cbe923bc747b09806-C001-71", "24ee9b2bd8c97cbe923bc747b09806-C001-106", "24ee9b2bd8c97cbe923bc747b09806-C001-107", "24ee9b2bd8c97cbe923bc747b09806-C001-129", "24ee9b2bd8c97cbe923bc747b09806-C001-155" ] }, "@USE@": { "gold_contexts": [ [ "24ee9b2bd8c97cbe923bc747b09806-C001-71" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-128" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-146" ] ], "cite_sentences": [ "24ee9b2bd8c97cbe923bc747b09806-C001-71", "24ee9b2bd8c97cbe923bc747b09806-C001-128", "24ee9b2bd8c97cbe923bc747b09806-C001-146" ] }, "@DIF@": { "gold_contexts": [ [ "24ee9b2bd8c97cbe923bc747b09806-C001-93" ], [ "24ee9b2bd8c97cbe923bc747b09806-C001-103" ] ], "cite_sentences": [ "24ee9b2bd8c97cbe923bc747b09806-C001-93", "24ee9b2bd8c97cbe923bc747b09806-C001-103" ] } } }, "ABC_26fbf9f4ae740513d8889160ad9f63_12": { "x": [ { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-108", "text": "**EVALUATION OF DISCOURSE RELATIONS**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-61", "text": "Attributive: Relation provides details about an entity or an event -e.g." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-2", "text": "The work presented in this paper attempts to evaluate and quantify the use of discourse relations in the context of blog summarization and compare their use to more traditional and factual texts." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-3", "text": "Specifically, we measured the usefulness of 6 discourse relations -namely comparison, contingency, illustration, attribution, topic-opinion, and attributive for the task of text summarization from blogs." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-4", "text": "We have evaluated the effect of each relation using the TAC 2008 opinion summarization dataset and compared them with the results with the DUC 2007 dataset." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-5", "text": "The results show that in both textual genres, contingency, comparison, and illustration relations provide a significant improvement on summarization content; while attribution, topic-opinion, and attributive relations do not provide a consistent and significant improvement." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-6", "text": "These results indicate that, at least for summarization, discourse relations are just as useful for informal and affective texts as for more traditional news articles." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-7", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-9", "text": "It is widely accepted that in a coherent text, units should not be understood in isolation but in relation with each other through discourse relations that may or may not be explicitly marked." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-10", "text": "A text is not a linear combination of textual units but a hierarchically organized group of units placed together based on informational and intentional relations to one another." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-11", "text": "According to (Taboada, 2006) , \"Discourse relations -relations that hold together different parts (i.e. proposition, sentence, or paragraph) of the discourse -are partly responsible for the perceived coherence of a text\"." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-12", "text": "For example, in the sentence \"If you want the full Vista experience, you'll want a heavy system and graphics hardware, and lots of memory\", the first and second clauses do not bear much meaning independently; but become more meaningful when we realize that they are related through the discourse relation condition." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-13", "text": "Discourse relations have been found useful in many NLP applications such as natural language generation (e.g. (McKeown, 1985) ) and news summarization (e.g. (Blair-Goldensohn and McKeown, 2006; Bosma, 2004) ) to improve coherence and better simulate human writing." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-14", "text": "However, most of these work have been developed for formal, wellwritten and factual documents." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-15", "text": "Text available in the social media are typically written in a more casual style, are opinionated and speculative (Andreevskaia et al., 2007) ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-16", "text": "Because of this, techniques developed for formal texts, such as news articles, often do not behave as well when dealing with informal documents." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-17", "text": "In particular, news articles are more uniform in style and structure; whereas blogs often do not exhibit a stereotypical discourse structure." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-18", "text": "As a result, for blogs, it is usually more difficult to identify and rank relevant units for summarization compared to news articles." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-62", "text": "\"Mary has a pink coat.\"." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-19", "text": "Several work have shown that discourse relations can improve the results of summarization in the case of factual texts or news articles (e.g. (Otterbacher et al., 2002) )." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-20", "text": "However, to our knowledge no work has evaluated the usefulness of discourse relations for the summarization of informal and opinionated texts, as those found in the social media." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-21", "text": "In this paper, we consider the most frequent discourse relations found in blogs: namely comparison, contingency, illustration, attribution, topic-opinion, and attributive and evaluate the effect of each relation on informal text summarization using the Text Analysis Conference (TAC) 2008 opinion summarization dataset 1 ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-22", "text": "We then compare these results to those found with the news articles of the Document Understanding Conference (DUC) 2007 Main task dataset 2 ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-23", "text": "The results show that in both types of texts, discourse relations seem to be as useful: contingency, comparison, and illustration relations provide a statistically significant improvement on the summary content; while the attribution, topic-opinion, and attributive relations do not provide a consistent and significant improvement." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-24", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-25", "text": "**RELATED WORK ON DISCOURSE RELATIONS FOR SUMMARIZATION**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-26", "text": "The use of discourse relations for text summarization is not new." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-27", "text": "Most notably, (Marcu, 1997) used discourse relations for single document summarization and proposed a discourse relation identification parsing algorithm." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-28", "text": "In some work (e.g. (Bosma, 2004; Blair-Goldensohn and McKeown, 2006) ), discourse relations have been exploited successfully for multi-document summarization." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-29", "text": "In particular, (Otterbacher et al., 2002) experimentally showed that discourse relations can improve the coherence of multi-document summaries." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-30", "text": "(Bosma, 2004) showed how discourse relations can be used effectively to incorporate additional contextual information for a given question in a query-based summarization." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-31", "text": "(Blair-Goldensohn and McKeown, 2006) used discourse relations for content selection and organization of automatic summaries and achieved an improvement in both cases." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-32", "text": "Discourse relations were also used successfully by (Zahri and Fukumoto, 2011) for news summarization." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-33", "text": "However, the work described above have been developed for formal, well-written and factual documents." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-34", "text": "Most of these work show how discourse relations can be used in text summarization and show their overall usefulness." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-35", "text": "To the best of our knowledge, our work is the first to measure the effect of specific relations on the summarization of informal and opinionated text." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-36", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-37", "text": "**TAGGING DISCOURSE RELATIONS**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-38", "text": "To evaluate the effect of discourse relations on a large scale, sentences need to be tagged automat-1 http://www.nist.gov/tac/ 2 http://www-nlpir.nist.gov/projects/duc/guidelines/2007.html ically with discourse relations." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-39", "text": "For example, the sentence \"Yesterday, I stayed at home because it was raining.\" needs to be tagged as containing a cause relation." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-40", "text": "One sentence can convey zero or several discourse relations." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-41", "text": "For example, the sentence \"Starbucks has contributed to the popularity of good tasting coffee\" does not contain any discourse relations of interest to us." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-42", "text": "On the other hand, the sentence \"While I like the Zillow interface and agree it's an easy way to find data, I'd prefer my readers used their own brain to perform a basic valuation of a property instead of relying on zestimates.\" contains 5 relations of interest: one comparison, three illustrations, and one attribution." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-43", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-44", "text": "**MOST FREQUENT DISCOURSE RELATIONS**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-45", "text": "Since our work is performed within the framework of blog summarization; we have only considered the discourse relations that are most useful to this application." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-46", "text": "To find the set of the relations needed for this task, we have first manually analyzed 50 summaries randomly selected from participating systems at the TAC 2008 opinion summarization track and 50 randomly selected blogs from BLOG06 corpus 3 ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-47", "text": "In building our relation taxonomy, we considered all main discourse relations listed in the taxonomy of Mann and Thompson's Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-48", "text": "These discourse relations are also considered in Grimes' (Grimes, 1975) and Williams' predicate lists." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-49", "text": "From our corpus analysis, we have identified the six most prevalent discourse relations in this blog dataset, namely comparison, contingency, illustration, attribution, topic-opinion, and attributive." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-50", "text": "The comparison, contingency, and illustration relations are also considered by most of the work in the field of discourse analysis such as the PDTB: Penn Discourse TreeBank research group (Prasad et al., 2008) and the RST Discourse Treebank research group (Carlson and Marcu, 2001 )." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-51", "text": "We considered three additional classes of relations: attributive, attribution, and topic-opinion." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-52", "text": "These discourse relations are summarized in Figure 1 while a description of these relations is given below." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-53", "text": "Illustration: Is used to provide additional information or detail about a situation." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-54", "text": "For example: \"Allied Capital is a closed-end management investment company that will operate as a business development concern.\" As shown in Figure 1 , illustration relations can be sub-divided into sub-categories: joint, list, disjoint, and elaboration relations according to the RST Discourse Treebank (Carlson and Marcu, 2001 ) and the Penn Discourse TreeBank (Prasad et al., 2008) ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-55", "text": "Contingency: Provides cause, condition, reason or evidence for a situation, result or claim." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-56", "text": "For example: \"The meat is good because they slice it right in front of you.\"" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-57", "text": "As shown in Figure 1 , the contingency relation subsumes several more specific relations: explanation, evidence, reason, cause, result, consequence, background, condition, hypothetical, enablement, and purpose relations according to the Penn Discourse TreeBank (Prasad et al., 2008) ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-58", "text": "Comparison: Gives a comparison and contrast among different situations." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-59", "text": "For example, \"Its fastforward and rewind work much more smoothly and consistently than those of other models I've had.\"" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-60", "text": "The comparison relation subsumes the contrast relation according to the Penn Discourse TreeBank (Prasad et al., 2008) and the analogy and preference relations according to the RST Discourse Treebank (Carlson and Marcu, 2001) ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-63", "text": "It can be used to illustrate a particular feature about a concept or an entity -e.g." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-64", "text": "\"Picasa makes sure your pictures are always organized.\"." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-65", "text": "The attributive relation, also included in Grimes' predicates (Grimes, 1975) , is considered because it describes attributes or features of an object or event and is often used in query-based summarization and question answering." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-66", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-67", "text": "**TOPIC-OPINION:**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-68", "text": "We introduced topic-opinion relations to represent opinions which are not expressed by reported speech." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-69", "text": "This relation can be used to express an opinion: an internal feeling or belief towards an object or an event." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-70", "text": "For example: \"Cage is a wonderfully versatile actor.\"" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-71", "text": "Attribution: These relations are instances of reported speech both direct and indirect which may express feelings, thoughts, or hopes." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-72", "text": "For example: \"The legendary GM chairman declared that his company would make \"a car for every purse and purpose.\"\"" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-73", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-74", "text": "**AUTOMATIC DISCOURSE TAGGING**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-75", "text": "Once the manual analysis identified the most prevalent set of relations, we tried to measure their frequency by tagging them automatically within a larger corpus." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-76", "text": "Only recently, the HILDA (Hernault et al., 2010) and (Feng and Hirst, 2012 )'s discourse parser were made publicly available." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-77", "text": "Both of these parsers work at the text-level, as opposed to the sentence-level, and hence currently achieve the highest tagging performance when compared to the state of the art. (Feng and Hirst, 2012) 's work showed a significant improvement on the performance of HILDA by enhancing its original feature set." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-78", "text": "However, at the time this research was done, the only publicly available discourse parser was SPADE (Soricut and Marcu, 2003) which operates on individual sentences." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-79", "text": "To identify illustration, contingency, comparison, and attribution relations, we have used SPADE discourse parser." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-80", "text": "However, we have complemented this parser with three other approaches: (Jindal and Liu, 2006 )'s approach is used to identify intra-sentence comparison relations; we have designed a tagger based on (Fei et al., 2008) 's approach to identify topic-opinion relations; and we have proposed a new approach to tag attributive relations (Mithun, 2012) ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-81", "text": "A description and evaluation of these approaches can be found in (Mithun, 2012) ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-82", "text": "By combining these approaches, a sentence is tagged with all possible discourse relations that it contains." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-83", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-84", "text": "**DISTRIBUTION OF DISCOURSE RELATIONS**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-85", "text": "To find the most prevalent discourse relations for opinion summarization, we have used the TAC 2008 opinion summarization track input document set (collection) which is a subset of BLOG06 and the answer nuggets provided by TAC 2008 as the reference summary (or model summaries), which had been created to evaluate participants' summaries at the TAC 2008 opinion summarization track." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-86", "text": "The collection consists of 600 blogs on 28 different topics." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-87", "text": "The dataset of the model summaries consists of 693 sentences." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-88", "text": "Using the discourse parsers presented in Section 3.2, we computed the distribution of discourse relations within the TAC 2008 opinion summarization collection and the model summaries." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-89", "text": "Illustration, contingency, comparison, attributive, topicopinion, and attribution are the most frequently occuring relations in our data sets." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-90", "text": "The distribution is shown in Table 1 4 ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-91", "text": "Table 1 shows that in the TAC 2008 input document set, the illustration relation occurs in 52% of the sentences; while attribution is the least frequently occurring relation." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-92", "text": "In this dataset, other relations, such as antithesis and temporal relations, occur in about 13% of the sentences and about 14% of the sentences did not receive any relation tag." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-93", "text": "As indicated in Table 1 , the TAC model summaries have a similar distribution as the collection as a whole." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-94", "text": "The attributive relation seems, however, to be more frequent in the summaries (28%) than in the original texts (12%)." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-95", "text": "We suspect that the reason for this is due to the question types of this track." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-96", "text": "To successfully generate queryrelevant summaries that answer the questions of this track, candidate sentences need to contain attributive relations." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-97", "text": "For example, to answer the questions from this track \"Why do people like Picasa?\" or \"What features do people like about Windows Vista?\", the summary needs to provide details about these entities or illustrate a particular feature about them." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-98", "text": "As a result, the summary will be composed of many attributive relations since attributive relations help to model the required information." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-99", "text": "To compare the distribution of discourse relations within more formal types of texts such as news articles, we used the Document Understanding Conference ( Table 1 . Table 1 shows that the most frequently occurring relation in the DUC 2007 document collection and in the model summaries is illustration; while the attribution relation is the least frequently occurring relation." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-100", "text": "Here again, it is interesting to note that the distribution of the discourse relations in the document collection and in the model summaries is generally comparable." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-101", "text": "The distribution of the illustration, contingency, and comparison relations in the DUC 2007 dataset is comparable to those in the TAC 2008 opinion summarization dataset." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-102", "text": "Indeed, Table 1 shows that illustration, contingency, and comparison relations occur quite frequently irrespective of the textual genre." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-103", "text": "However, in contrast to the TAC dataset, attributive, topic-opinion, and attribution relations occur very rarely in DUC 2007." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-104", "text": "We suspect that this is mostly due to the opinionated nature of blogs." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-105", "text": "Another observation is that temporal relations (included in \"other\") occurred very frequently (30%) in the DUC 2007 dataset whereas this relation occurs rarely in the blog dataset." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-106", "text": "This is in line with our intuition that news articles present events that inherently contain temporal information." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-107", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-109", "text": "To measure the usefulness of discourse relations for the summarization of informal texts, we have tested the effect of each relation with four different summarizers: BlogSum (Mithun, 2012) , MEAD (Radev et al., 2004) , the best scoring system at TAC 2008 5 and the best scoring system at DUC 2007 6 ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-110", "text": "We have evaluated the effect of each discourse relation on the summaries generated and compared the results." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-111", "text": "Let us first describe the BlogSum summarizer." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-112", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-113", "text": "**BLOGSUM**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-114", "text": "BlogSum is a domain-independent query-based extractive summarization system that uses intrasentential discourse relations within the framework based on text schemata." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-115", "text": "The heart of BlogSum is based on discourse relations and text schemata." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-116", "text": "BlogSum works in the following way: First candidate sentences are extracted and ranked using the topic and question similarity to give priority to topic and question relevant sentences." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-117", "text": "Since BlogSum has been designed for blogs, which are opinionated in nature, to rank a sentence, the sentence polarity (e.g. positive, negative or neutral) is calculated and used for sentence ranking." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-118", "text": "To extract and rank sentences, BlogSum thus calculates a score for each sentence using the features shown below:" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-119", "text": "Sentence Score = w 1 \u00d7 Question Similarity + w 2 \u00d7 Topic Similarity + w 3 \u00d7 Subjectivity Score where, question similarity and topic similarity are calculated using the cosine similarity based on words tf.idf and the subjectivity score is calculated using a dictionary-based approach based on the MPQA lexicon 7 ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-120", "text": "Once sentences are ranked, they are categorized based on the discourse relations that they convey." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-121", "text": "This step is critical because the automatic identification of discourse relations renders BlogSum independent of the domain." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-122", "text": "This step also plays a key role in content selection and summary coherence as schemata are designed using these relations." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-123", "text": "In order not to answer all questions the same way, BlogSum uses different schemata to generate a summary that answers specific types of questions." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-124", "text": "Each schema is designed to give priority to its associated question type and subjective sentences as summaries for opinionated texts are generated." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-125", "text": "Each schema specifies the types of discourse relations and the order in which they should appear in the output summary for a particular question type." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-126", "text": "Figure 2 shows a sample schema that is used to answer reason questions (e.g. \"Why do people like Picasa?\")." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-127", "text": "According to this schema 8 , one or more sentences containing a topic-opinion or attribution relation followed by zero or many sentences containing a contingency or comparison relation followed by zero or many sentences containing a attributive relation should be used." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-128", "text": "Finally the most appropriate schema is selected based on a given question type; and candidate sentences fill particular slots in the selected schema based on which discourse relations they contain in order to create the final summary (details of BlogSum can be found in (Mithun, 2012) )." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-129", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-130", "text": "**EVALUATION OF DISCOURSE RELATIONS ON BLOGS**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-131", "text": "To evaluate the effect of each discourse relation for blog summarization, we performed several experiments." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-132", "text": "We used as a baseline the original ranked list of candidate sentences produced by BlogSum before applying the discourse schemata, and compared this to the BlogSum-generated summaries with and without each discourse relation." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-133", "text": "We used the TAC 2008 opinion summarization dataset which consists of 50 questions on 28 topics; on each topic one or two questions were asked and 9 to 39 relevant documents were given." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-134", "text": "For each question, one summary was generated with no regards to discourse relations and two summaries were produced by BlogSum: one using the discourse tagger and the other without using the specific discourse tagger." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-135", "text": "The maximum summary length was restricted to 250 words." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-136", "text": "To measure the effect of each relation, we have automatically evaluated how BlogSum performs using the standard ROUGE-2 and ROUGE-SU4 measures." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-137", "text": "For comparative purposes, Table 2 shows the official ROUGE-2 (R-2) and ROUGE-SU4 (R-SU4) for all 36 submissions of the TAC 2008 opinion summarization track." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-138", "text": "In the table, \"TAC Average\" refers to the mean performance of all participant systems and \"TAC-Best\" refers to the best-scoring system at TAC 2008." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-139", "text": "The results of our evaluation are shown in Tables 3 (ROUGE-2) and 4 (ROUGE-SU4)." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-140", "text": "As the tables show, BlogSum's baseline is situated below the best scoring system at TAC-2008, but much higher than the average system (see Table 2 ); hence, it represents a fair baseline." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-141", "text": "The tables further show that using both the ROUGE-2 (R-2) and ROUGE-SU4 (R-SU4) metrics, with the TAC 2008 dataset, BlogSum performs better when taking discourse relations into account." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-142", "text": "Indeed, when ignoring discourse relations, BlogSum has a R2=0.102 and R-SU4=0.107 and misses many question relevant sentences; whereas the inclusion of these relations helps to incorporate those relevant sentences into the final summary and brings the R-2 score to 0.125 and R-SU4 to 0.128." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-143", "text": "In order to verify if these improvements were statistically significant, we performed a 2-tailed ttest." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-144", "text": "The results of this test are indicated with the \u21d3 symbol in Tables 3 and 4 ." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-145", "text": "For example, the baseline setup of BlogSum performed significantly lower for both R-2 and R-SU4 compared to BlogSum with all relations." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-146", "text": "This result indicates that the use of discourse relations as a whole helps to include more question relevant sentences and improve the summary content." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-147", "text": "To ensure that the results were not specific to our summarizer, we performed the same experiments with two other systems: the MEAD summarizer (Radev et al., 2004) , a publicly available and a widely used summarizer, and with the output of the TAC best-scoring system." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-148", "text": "For MEAD, we first generated candidate sentences using MEAD, then these candidate sentences were tagged using discourse relation taggers used under BlogSum." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-149", "text": "Then these tagged sentences were filtered using BlogSum so that no sentence with a specific relation is used in summary generation for a particular experiment." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-150", "text": "We have calculated ROUGE scores using the original candidate sentences generated by MEAD and also using the filtered candidate sentences." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-151", "text": "As a baseline, we used the original candidate sentences generated by MEAD." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-152", "text": "As a best case scenario, we have passed these candidate sentences through the discourse schemata used by BlogSum (see Section 4.1)." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-153", "text": "In Tables 3 and 4 , this is referred to as \"MEAD with all relations\"." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-154", "text": "We have applied the same approach with the output of the TAC best-scoring system." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-155", "text": "In the tables, \"TACBest Baseline\" refers to the original summaries generated by the TAC-Best system and \"TAC-Best with all relations\" refers to the summaries generated by applying discourse schemata using the summary sentences generated by the TAC-Best system." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-156", "text": "When looking at individual relations, Tables 3 and 4 show that considering illustrations, contingencies and comparisons make a statistically significant improvement in all scenarios, and with all summarisers." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-157", "text": "For example, if TAC-Best does not consider illustration relations, then the R-2 score decreases from 0.138 to 0.112, 0.102 and 0.113, respectively." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-158", "text": "On the other hand, the relations of topic-opinion, attribution, and attributive do not consistently lead to a statistically significant improvement on ROUGE scores." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-159", "text": "It is interesting to note that although informal texts may not exhibit a clear discourse structure, the use of individual discourse relations such as illustration, contingency and comparison is nonetheless useful in the analysis of informal documents such as those found in the social media." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-160", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-161", "text": "**EFFECT OF DISCOURSE RELATIONS ON NEWS**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-162", "text": "To compare the results found with blogs with more formal types of texts, we have performed the same experiments but, this time with the DUC 2007 Main Task dataset." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-163", "text": "In this task, given a topic (title) and a set of 25 relevant documents, participants had to create an automatic summary of length 250 words from the input documents." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-164", "text": "In the dataset, there were 45 topics and thirty teams participated to this shared task." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-165", "text": "Table 5 shows the official ROUGE-2 (R-2) and ROUGE-SU4 (R-SU4) scores of the DUC 2007 main task summarization track." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-166", "text": "In Table 5 , \"DUC Average\" refers to the mean performance of all participant systems and \"DUC-Best\" refers to the best scoring system at DUC 2007." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-167", "text": "Tables 6 and 7 show the results with this dataset with respect to ROUGE-2 and ROUGE-SU4, respectively." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-168", "text": "As the tables show, Blog- Sum's performance with all discourse relations (R2=0.093 and R-SU4=0.132) is similar to the DUC average performance shown in Table 5 (R2=00.095 and R-SU4=0.157) which is much lower than the DUC-Best performance (R2=0.124, R-SU4=0.177) shown in Table 5 )." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-169", "text": "However, these results show that even though BlogSum was designed for informal texts, it still performs relatively well with formal documents." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-170", "text": "Tables 6 and 7 further show that with the news dataset, the same relations have the most effect as with blogs." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-171", "text": "Indeed BlogSum generated summaries also benefit most from the contingency, illustration, and comparison relations; and all three relations bring a statistically significant contribution to the summary content." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-172", "text": "Here again, as shown in Tables 6 and 7 , we performed the same experiments with two other systems: the MEAD summarizer and the output of the DUC-Best system." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-173", "text": "Again, for the DUC 2007 dataset, each discourse relation has the same effect on summarization with all systems as with the blog dataset: contingency, illustration, and comparison provide a statistically significant improvement in content; while attributive, topicopinion and attribution do not reduce the content, but do not see to bring a systematic and significant improvement." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-174", "text": "----------------------------------" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-175", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-176", "text": "In this paper, we have evaluated the effect of discourse relations on summarization." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-177", "text": "We have considered the six most frequent relations in blogsnamely comparison, contingency, illustration, attribution, topic-opinion, and attributive." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-178", "text": "First, we have measured the distribution of discourse relations on blogs and on news articles and show that the prevalence of these six relations is not genre dependent." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-179", "text": "For example, the relations of illustration, contingency, and comparison occur frequently in both textual genres." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-180", "text": "We have then evaluated the effect of these six relations on summarization with the TAC 2008 opinion summarization dataset and the DUC 2007 dataset." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-181", "text": "We have conducted these evaluations with our summarization system called BlogSum, the TAC best-scoring system, the DUC best-scoring system, and the MEAD summarizer." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-182", "text": "The results show that for both textual genres, some relations have more effect on summarization compared to others." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-183", "text": "In both types of texts, the contingency, illustration, and comparison relations provide a significant improvement on summary content; while the attribution, topicopinion, and attributive relations do not provide a systematic and statistically significant improvement." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-184", "text": "These results seem to indicate that, at least for summarization, discourse relations are just as useful for informal and affective texts as for more traditional news articles." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-185", "text": "This is interesting, because although informal texts may not exhibit a clear discourse structure, the use of individual discourse relations is nonetheless useful in the analysis of informal documents." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-186", "text": "In the future, it would be interesting to evaluate the effect of other relations such as the temporal relation." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-187", "text": "Indeed, temporal relations occur infrequently in blogs but are very frequent in news articles." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-188", "text": "Such an analysis would allow us to tailor the type of discourse relations to include in the final summary as a function of the textual genre being considered." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-189", "text": "In the future, it would also be interesting to use other types of texts such as reviews and evaluate the effect of discourse relations using other measures than ROUGE-2 and ROUGE-SU4." }, { "sent_id": "26fbf9f4ae740513d8889160ad9f63-C001-190", "text": "Finally, we would like to validate this work again with the newly available discourse parsers of (Hernault et al., 2010) and (Feng and Hirst, 2012) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "26fbf9f4ae740513d8889160ad9f63-C001-19" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-29" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-50" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-54" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-57" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-60" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-81" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-128" ] ], "cite_sentences": [ "26fbf9f4ae740513d8889160ad9f63-C001-19", "26fbf9f4ae740513d8889160ad9f63-C001-29", "26fbf9f4ae740513d8889160ad9f63-C001-50", "26fbf9f4ae740513d8889160ad9f63-C001-54", "26fbf9f4ae740513d8889160ad9f63-C001-57", "26fbf9f4ae740513d8889160ad9f63-C001-60", "26fbf9f4ae740513d8889160ad9f63-C001-81", "26fbf9f4ae740513d8889160ad9f63-C001-128" ] }, "@SIM@": { "gold_contexts": [ [ "26fbf9f4ae740513d8889160ad9f63-C001-49", "26fbf9f4ae740513d8889160ad9f63-C001-50" ] ], "cite_sentences": [ "26fbf9f4ae740513d8889160ad9f63-C001-50" ] }, "@EXT@": { "gold_contexts": [ [ "26fbf9f4ae740513d8889160ad9f63-C001-50", "26fbf9f4ae740513d8889160ad9f63-C001-51" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-80" ] ], "cite_sentences": [ "26fbf9f4ae740513d8889160ad9f63-C001-50", "26fbf9f4ae740513d8889160ad9f63-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "26fbf9f4ae740513d8889160ad9f63-C001-50", "26fbf9f4ae740513d8889160ad9f63-C001-51" ] ], "cite_sentences": [ "26fbf9f4ae740513d8889160ad9f63-C001-50" ] }, "@USE@": { "gold_contexts": [ [ "26fbf9f4ae740513d8889160ad9f63-C001-80" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-109" ], [ "26fbf9f4ae740513d8889160ad9f63-C001-147" ] ], "cite_sentences": [ "26fbf9f4ae740513d8889160ad9f63-C001-80", "26fbf9f4ae740513d8889160ad9f63-C001-109", "26fbf9f4ae740513d8889160ad9f63-C001-147" ] } } }, "ABC_7d5c01ec5d744747413e42dcbc1a3c_12": { "x": [ { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-98", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-99", "text": "**EXPANDRANK**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-122", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-2", "text": "State-of-the-art approaches for unsupervised keyphrase extraction are typically evaluated on a single dataset with a single parameter setting." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-3", "text": "Consequently, it is unclear how effective these approaches are on a new dataset from a different domain, and how sensitive they are to changes in parameter settings." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-4", "text": "To gain a better understanding of state-of-the-art unsupervised keyphrase extraction algorithms, we conduct a systematic evaluation and analysis of these algorithms on a variety of standard evaluation datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-5", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-7", "text": "The keyphrases for a given document refer to a group of phrases that represent the document." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-8", "text": "Although we often come across texts from different domains such as scientific papers, news articles and blogs, which are labeled with keyphrases by the authors, a large portion of the Web content remains untagged." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-9", "text": "While keyphrases are excellent means for providing a concise summary of a document, recent research results have suggested that the task of automatically identifying keyphrases from a document is by no means trivial." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-10", "text": "Researchers have explored both supervised and unsupervised techniques to address the problem of automatic keyphrase extraction." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-11", "text": "Supervised methods typically recast this problem as a binary classification task, where a model is trained on annotated data to determine whether a given phrase is a keyphrase or not (e.g., Frank et al. (1999) , Turney (2000; , Hulth (2003) , Medelyan et al. (2009)) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-12", "text": "A disadvantage of supervised approaches is that they require a lot of training data and yet show bias towards the domain on which they are trained, undermining their ability to generalize well to new domains." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-13", "text": "Unsupervised approaches could be a viable alternative in this regard." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-14", "text": "The unsupervised approaches for keyphrase extraction proposed so far have involved a number of techniques, including language modeling (e.g., Tomokiyo and Hurst (2003) ), graph-based ranking (e.g., Zha (2002) , Mihalcea and Tarau (2004) , Wan et al. (2007) , Wan and Xiao (2008) , Liu et al. (2009a) ), and clustering (e.g., Matsuo and Ishizuka (2004) , Liu et al. (2009b) )." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-15", "text": "While these methods have been shown to work well on a particular domain of text such as short paper abstracts and news articles, their effectiveness and portability across different domains have remained an unexplored issue." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-16", "text": "Worse still, each of them is based on a set of assumptions, which may only hold for the dataset on which they are evaluated." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-17", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-18", "text": "**CONSEQUENTLY, WE HAVE LITTLE UNDERSTANDING OF HOW EFFECTIVE THE STATE-OF THE-ART SYSTEMS WOULD BE ON A COMPLETELY NEW DATASET FROM A DIFFERENT DOMAIN.**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-19", "text": "A few questions arise naturally." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-20", "text": "How would these systems perform on a different dataset with their original configuration? What could be the underlying reasons in case they perform poorly?" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-21", "text": "Is there any system that can generalize fairly well across various domains?" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-22", "text": "We seek to gain a better understanding of the state of the art in unsupervised keyphrase extraction by examining the aforementioned questions." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-23", "text": "More specifically, we compare five unsupervised keyphrase extraction algorithms on four corpora with varying domains and statistical characteristics." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-24", "text": "These algorithms represent the ma-jor directions in this research area, including TfIdf and four recently proposed systems, namely, TextRank (Mihalcea and Tarau, 2004) , SingleRank (Wan and Xiao, 2008) , ExpandRank (Wan and Xiao, 2008) , and a clustering-based approach (Liu et al., 2009b) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-25", "text": "Since none of these systems (except TextRank) are publicly available, we reimplement all of them and make them freely available for research purposes." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-26", "text": "1 To our knowledge, this is the first attempt to compare the performance of state-of-the-art unsupervised keyphrase extraction systems on multiple datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-27", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-28", "text": "**CORPORA**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-29", "text": "Our four evaluation corpora belong to different domains with varying document properties." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-30", "text": "Table 1 provides an overview of each corpus." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-31", "text": "The DUC-2001 dataset (Over, 2001) , which is a collection of 308 news articles, is annotated by Wan and Xiao (2008) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-32", "text": "We report results on all 308 articles in our evaluation." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-123", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-33", "text": "The Inspec dataset is a collection of 2,000 abstracts from journal papers including the paper title." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-34", "text": "Each document has two sets of keyphrases assigned by the indexers: the controlled keyphrases, which are keyphrases that appear in the Inspec thesaurus; and the uncontrolled keyphrases, which do not necessarily appear in the thesaurus." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-35", "text": "This is a relatively popular dataset for automatic keyphrase extraction, as it was first used by Hulth (2003) and later by Mihalcea and Tarau (2004) and Liu et al. (2009b) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-36", "text": "In our evaluation, we use the set of 500 abstracts designated by these previous approaches as the test set and its set of uncontrolled keyphrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-37", "text": "Note that the average document length for this dataset is the smallest among all our datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-38", "text": "The NUS Keyphrase Corpus (Nguyen and Kan, 2007) includes 211 scientific conference papers with lengths between 4 to 12 pages." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-39", "text": "Each paper has one or more sets of keyphrases assigned by its authors and other annotators." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-40", "text": "We use all the 211 papers in our evaluation." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-41", "text": "Since the number of annotators can be different for different documents and the annotators are not specified along with the annotations, we decide to take the union 1 See http://www.hlt.utdallas.edu/ saidul/code.html for details." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-42", "text": "of all the gold standard keyphrases from all the sets to construct one single set of annotation for each paper." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-43", "text": "As Table 1 shows, each NUS paper, both in terms of the average number of tokens (8291) and candidate phrases (2027) per paper, is more than five times larger than any document from any other corpus." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-44", "text": "Hence, the number of candidate keyphrases that can be extracted is potentially large, making this corpus the most challenging of the four." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-45", "text": "Finally, the ICSI meeting corpus (Janin et al., 2003) , which is annotated by Liu et al. (2009a) , includes 161 meeting transcriptions." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-46", "text": "Following Liu et al., we remove topic segments marked as 'chitchat' and 'digit' from the dataset and use all the remaining segments for evaluation." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-47", "text": "Each transcript contains three sets of keyphrases produced by the same three human annotators." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-48", "text": "Since it is possible to associate each set of keyphrases with its annotator, we evaluate each system on this dataset three times, once for each annotator, and report the average score." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-49", "text": "Unlike the other three datasets, the gold standard keys for the ICSI corpus are mostly unigrams." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-50", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-51", "text": "**UNSUPERVISED KEYPHRASE EXTRACTORS**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-52", "text": "A generic unsupervised keyphrase extraction system typically operates in three steps (Section 3.1), which will help understand the unsupervised systems explained in Section 3.2." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-53", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-54", "text": "**GENERIC KEYPHRASE EXTRACTOR**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-55", "text": "Step 1: Candidate lexical unit selection The first step is to filter out unnecessary word tokens from the input document and generate a list of potential keywords using heuristics." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-56", "text": "Commonly used heuristics include (1) using a stop word list to remove non-keywords (e.g., Liu et al. (2009b) ) and (2) allowing words with certain partof-speech tags (e.g., nouns, adjectives, verbs) to be considered candidate keywords (Mihalcea and Tarau (2004) , Liu et al. (2009a) , Wan and Xiao (2008) )." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-57", "text": "In all of our experiments, we follow Wan and Xiao (2008) and select as candidates words with the following Penn Treebank tags: NN, NNS, NNP, NNPS, and JJ, which are obtained using the Stanford POS tagger (Toutanova and Manning, 2000) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-58", "text": "Table 1 : Corpus statistics for the four datasets used in this paper." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-59", "text": "A candidate word/phrase, typically a sequence of one or more adjectives and nouns, is extracted from the document initially and considered a potential keyphrase." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-60", "text": "The U/B/T/O distribution indicates how the gold standard keys are distributed among unigrams, bigrams, trigrams, and other higher order n-grams." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-61", "text": "Step 2: Lexical unit ranking Once the candidate list is generated, the next task is to rank these lexical units." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-62", "text": "To accomplish this, it is necessary to build a representation of the input text for the ranking algorithm." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-63", "text": "Depending on the underlying approach, each candidate word is represented by its syntactic and/or semantic relationship with other candidate words." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-64", "text": "The relationship can be defined using co-occurrence statistics, external resources (e.g., neighborhood documents, Wikipedia), or other syntactic clues." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-65", "text": "Step 3: Keyphrase formation In the final step, the ranked list of candidate words is used to form keyphrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-66", "text": "A candidate phrase, typically a sequence of nouns and adjectives, is selected as a keyphrase if (1) it includes one or more of the top-ranked candidate words (Mihalcea and Tarau (2004) , Liu et al. (2009b) ), or (2) the sum of the ranking scores of its constituent words makes it a top scoring phrase (Wan and Xiao, 2008) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-67", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-68", "text": "**THE FIVE KEYPHRASE EXTRACTORS**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-69", "text": "As mentioned above, we re-implement five unsupervised approaches for keyphrase extraction." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-70", "text": "Below we provide a brief overview of each system." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-71", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-72", "text": "**TF-IDF**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-73", "text": "Tf-Idf assigns a score to each term t in a document d based on t's frequency in d (term frequency) and how many other documents include t (inverse document frequency) and is defined as:" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-74", "text": "where D is the total number of documents and D t is the number of documents containing t." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-75", "text": "Given a document, we first compute the TfIdf score of each candidate word (see Step 1 of the generic algorithm)." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-76", "text": "Then, we extract all the longest n-grams consisting of candidate words and score each n-gram by summing the Tf-Idf scores of its constituent unigrams." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-77", "text": "Finally, we output the top N n-grams as keyphrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-78", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-79", "text": "**TEXTRANK**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-80", "text": "In the TextRank algorithm (Mihalcea and Tarau, 2004) , a text is represented by a graph." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-81", "text": "Each vertex corresponds to a word type." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-82", "text": "A weight, w ij , is assigned to the edge connecting the two vertices, v i and v j , and its value is the number of times the corresponding word types co-occur within a window of W words in the associated text." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-83", "text": "The goal is to (1) compute the score of each vertex, which reflects its importance, and then (2) use the word types that correspond to the highestscored vertices to form keyphrases for the text." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-84", "text": "The score for v i , S(v i ), is initialized with a default value and is computed in an iterative manner until convergence using this recursive formula:" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-85", "text": "where Adj(v i ) denotes v i 's neighbors and d is the damping factor set to 0.85 (Brin and Page, 1998) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-86", "text": "Intuitively, a vertex will receive a high score if it has many high-scored neighbors." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-87", "text": "As noted before, after convergence, the T % top-scored vertices are selected as keywords." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-88", "text": "Adjacent keywords are then collapsed and output as a keyphrase." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-89", "text": "According to Mihalcea and Tarau (2004) , TextRank's best score on the Inspec dataset is achieved when only nouns and adjectives are used to create a uniformly weighted graph for the text under consideration, where an edge connects two word types only if they co-occur within a window of two words." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-90", "text": "Hence, our implementation of TextRank follows this configuration." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-91", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-92", "text": "**SINGLERANK**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-93", "text": "SingleRank (Wan and Xiao, 2008 ) is essentially a TextRank approach with three major differences." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-94", "text": "First, while each edge in a TextRank graph (in Mihalcea and Tarau's implementation) has the same weight, each edge in a SingleRank graph has a weight equal to the number of times the two corresponding word types co-occur." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-95", "text": "Second, while in TextRank only the word types that correspond to the top-ranked vertices can be used to form keyphrases, in SingleRank, we do not filter out any low-scored vertices." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-96", "text": "Rather, we (1) score each candidate keyphrase, which can be any longest-matching sequence of nouns and adjectives in the text under consideration, by summing the scores of its constituent word types obtained from the SingleRank graph, and (2) output the N highest-scored candidates as the keyphrases for the text." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-97", "text": "Finally, SingleRank employs a window size of 10 rather than 2." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-100", "text": "ExpandRank (Wan and Xiao, 2008 ) is a TextRank extension that exploits neighborhood knowledge for keyphrase extraction." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-101", "text": "For a given document d, the approach first finds its k nearest neighboring documents from the accompanying document collection using a similarity measure (e.g., cosine similarity)." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-102", "text": "Then, the graph for d is built using the co-occurrence statistics of the candidate words collected from the document itself and its k nearest neighbors." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-103", "text": "Specifically, each document is represented by a term vector where each vector dimension corresponds to a word type present in the document and its value is the Tf-Idf score of that word type for that document." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-104", "text": "For a given document d 0 , its k nearest neighbors are identified, and together they form a larger document set of k+1 documents," }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-105", "text": "Given this document set, a graph is constructed, where each vertex corresponds to a candidate word type in D, and each edge connects two vertices v i and v j if the corresponding word types co-occur within a window of W words in the document set." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-106", "text": "The weight of an edge, w(v i , v j ), is computed as follows:" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-107", "text": "where sim(d 0 , d k ) is the cosine similarity be-" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-108", "text": "Once the graph is constructed, the rest of the procedure is identical to SingleRank." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-109", "text": "Liu et al. (2009b) propose to cluster candidate words based on their semantic relationship to ensure that the extracted keyphrases cover the entire document." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-110", "text": "The objective is to have each cluster represent a unique aspect of the document and take a representative word from each cluster so that the document is covered from all aspects." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-111", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-112", "text": "**CLUSTERING-BASED APPROACH**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-113", "text": "More specifically, their algorithm (henceforth referred to as KeyCluster) first filters out the stop words from a given document and treats the remaining unigrams as candidate words." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-114", "text": "Second, for each candidate, its relatedness with another candidate is computed by (1) counting how many times they co-occur within a window of size W in the document and (2) using Wikipedia-based statistics." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-115", "text": "Third, candidate words are clustered based on their relatedness with other candidates." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-116", "text": "Three clustering algorithms are used of which spectral clustering yields the best score." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-117", "text": "Once the clusters are formed, one representative word, called an exemplar term, is picked from each cluster." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-118", "text": "Finally, KeyCluster extracts from the document all the longest n-grams starting with zero or more adjectives and ending with one or more nouns, and if such an n-gram includes one or more exemplar words, it is selected as a keyphrase." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-119", "text": "As a post-processing step, a frequent word list generated from Wikipedia is used to filter out the frequent unigrams that are selected as keyphrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-120", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-121", "text": "**EVALUATION**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-124", "text": "TextRank and SingleRank setup Following Mihalcea and Tarau (2004) and Wan and Xiao (2008) , we set the co-occurrence window size for TextRank and SingleRank to 2 and 10, respectively, as these parameter values have yielded the best results for their evaluation datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-125", "text": "ExpandRank setup Following Wan and Xiao (2008), we find the 5 nearest neighbors for each document from the remaining documents in the same corpus." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-126", "text": "The other parameters are set in the same way as in SingleRank." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-127", "text": "KeyCluster setup As argued by Liu et al. (2009b) , Wikipedia-based relatedness is computationally expensive to compute." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-128", "text": "As a result, we follow them by computing the co-occurrence-based relatedness instead, using a window of size 10." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-129", "text": "Then, we cluster the candidate words using spectral clustering, and use the frequent word list that they generously provided us to post-process the resulting keyphrases by filtering out those that are frequent unigrams." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-130", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-131", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-132", "text": "In an attempt to gain a better insight into the five unsupervised systems, we report their performance in terms of precision-recall curves for each of the four datasets (see Figure 1 )." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-133", "text": "This contrasts with essentially all previous work, where the performance of a keyphrase extraction system is reported in terms of an F-score obtained via a particular parameter setting on a particular dataset." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-134", "text": "We generate the curves for each system as follows." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-135", "text": "For Tf-Idf, SingleRank, and ExpandRank, we vary the number of keyphrases, N , predicted by each system." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-136", "text": "For TextRank, instead of varying the number of predicted keyphrases, we vary T , the percentage of top-scored vertices (i.e., unigrams) that are selected as keywords at the end of the ranking step." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-137", "text": "The reason is that TextRank only imposes a ranking on the unigrams but not on the keyphrases generated from the high-ranked unigrams." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-138", "text": "For KeyCluster, we vary the number of clusters produced by spectral clustering rather than the number of predicted keyphrases, again because KeyCluster does not impose a ranking on the resulting keyphrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-139", "text": "In addition, to give an estimate of how each system performs in terms of F-score, we also plot curves corresponding to different F-scores in these graphs." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-140", "text": "Tf-Idf Consistent with our intuition, the precision of Tf-Idf drops as recall increases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-141", "text": "Although it is the simplest of the five approaches, Tf-Idf is the best performing system on all but the Inspec dataset, where TextRank and KeyCluster beat TfIdf on just a few cases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-142", "text": "It clearly outperforms all other systems for NUS and ICSI." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-143", "text": "TextRank The TextRank curves show a different progression than Tf-Idf: precision does not drop as much when recall increases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-144", "text": "For instance, in case of DUC and ICSI, precision is not sensitive to changes in recall." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-145", "text": "Perhaps somewhat surprisingly, its precision increases with recall for Inspec, allowing it to even reach a point (towards the end of the curve) where it beats Tf-Idf." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-146", "text": "While additional experiments are needed to determine the reason for this somewhat counter-intuitive result, we speculate that this may be attributed to the fact that the TextRank curves are generated by progressively increasing T (i.e., the percentage of top-ranked vertices/unigrams that are used to generate keyphrases) rather than the number of predicted keyphrases, as mentioned before." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-147", "text": "Increasing T does not necessarily imply an increase in the number of predicted keyphrases, however." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-148", "text": "To see the reason, consider an example in which we want TextRank to extract the keyphrase \"advanced machine learning\" for a given document." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-149", "text": "Assume that TextRank ranks the unigrams \"advanced\", \"learning\", and \"machine\" first, second, and third, respectively in its ranking step." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-150", "text": "When T = 2 n , where n denotes the total number of candidate unigrams, only the two highest-ranked unigrams (i.e., \"advanced\" and \"learning\") can be used to form keyphrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-151", "text": "This implies that \"advanced\" and \"learning\" will each be predicted as a keyphrase, but \"advanced machine learning\" will not." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-152", "text": "However, when T = 3 n , all three unigrams can be used to form a keyphrase, and since TextRank collapses unigrams adjacent to each other in the text to form a keyphrase, it will correctly predict \"advanced machine learning\" as a keyphrase." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-153", "text": "Note that as we increase T from \"advanced\" and \"learning\" are now combined to form one keyphrase (and hence the number of predicted keyphrases decreases)." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-154", "text": "In other words, it is possible to see a simultaneous rise in precision and recall in a TextRank curve." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-155", "text": "A natural question is: why does is happen only for Inspec but not the other datasets?" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-156", "text": "The reason could be attributed to the fact that Inspec is composed of abstracts: since the number of keyphrases that can be generated from these short documents is relatively small, precision may not drop as severely as with the other datasets even when all of the unigrams are used to form keyphrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-157", "text": "On average, TextRank performs much worse compared to Tf-Idf." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-158", "text": "The curves also prove TextRank's sensitivity to T on Inspec, but not on the other datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-159", "text": "This certainly gives more insight into TextRank since it was evaluated on Inspec only for T=33% by Mihalcea and Tarau (2004) ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-160", "text": "SingleRank SingleRank, which is supposed to be a simple variant of TextRank, surprisingly exhibits very different performance." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-161", "text": "First, it shows a more intuitive nature: precision drops as recall increases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-162", "text": "Second, SingleRank outperforms TextRank by big margins on all the datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-163", "text": "Later, we will examine which of the differences between them is responsible for the differing performance." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-164", "text": "Table 2 : Best parameter settings." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-207", "text": "First, to fully understand the strengths and weaknesses of a keyphrase extractor, it is essential to evaluate it on multiple datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-165", "text": "N is the number of predicted keyphrases, T is the percentage of vertices selected as keywords in TextRank, m is the number of clusters in KeyCluster, expressed in terms of n, the fraction of candidate words." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-166", "text": "ExpandRank Consistent with Wan and Xiao (2008) , ExpandRank beats SingleRank on DUC when a small number of phrases are predicted, but their difference diminishes as more phrases are predicted." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-167", "text": "On the other hand, their performance is indistinguishable from each other on the other three datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-168", "text": "A natural question is: why does ExpandRank improve over SingleRank only for DUC but not for the other datasets?" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-169", "text": "To answer this question, we look at the DUC articles and find that in many cases, the 5-nearest neighbors of a document are on the same topic involving the same entities as the document itself, presumably because many of these news articles are simply updated versions of an evolving event." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-170", "text": "Consequently, the graph built from the neighboring documents is helpful for predicting the keyphrases of the given document." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-171", "text": "Such topic-wise similarity among the nearest documents does not exist in the other datasets, however." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-172", "text": "KeyCluster As in TextRank, KeyCluster does not always yield a drop in precision as recall improves." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-173", "text": "This, again, may be attributed to the fact that the KeyCluster curves are generated by varying the number of clusters rather than the number of predicted keyphrases, as well as the way keyphrases are formed from the exemplars." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-174", "text": "Another reason is that the frequent Wikipedia unigrams are excluded during post-processing, making KeyCluster more resistant to precision drops." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-175", "text": "Overall, KeyCluster performs slightly better than TextRank on DUC and ICSI, yields the worst performance on NUS, and scores the best on Inspec when the number of clusters is high." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-176", "text": "These results seem to suggest that KeyCluster works better if more clusters are used." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-177", "text": "Best parameter settings Table 2 shows for each system the parameter values yielding the best Fscore on each dataset." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-178", "text": "Two points deserve mention." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-179", "text": "First, in comparison to SingleRank and ExpandRank, Tf-Idf outputs fewer keyphrases to achieve its best F-score on most datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-180", "text": "Second, the systems output more keyphrases on NUS than on other datasets to achieve their best F-scores (e.g., 60 for Tf-Idf, 190 for SingleRank, and 177 for ExpandRank)." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-181", "text": "This can be attributed in part to the fact that the F-scores on NUS are low for all the systems and exhibit only slight changes as we output more phrases." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-182", "text": "Our re-implementations Do our duplicated systems yield scores that match the original scores?" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-183", "text": "Table 3 sheds light on this question." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-184", "text": "First, consider KeyCluster, where our score lags behind the original score by approximately 5%." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-185", "text": "An examination of Liu et al.'s (2009b) results reveals a subtle caveat in keyphrase extraction evaluations." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-186", "text": "In Inspec, not all gold-standard keyphrases appear in their associated document, and as a result, none of the five systems we consider in this paper can achieve a recall of 100." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-187", "text": "While Mihalcea and Tarau (2004) and our reimplementations use all of these gold-standard keyphrases in our evaluation, Hulth (2003) and Liu et al. address Table 3 : Original vs. re-implementation scores of TextRank 3 , and are confident that our implementation is correct." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-188", "text": "It is also worth mentioning that using our re-implementation of SingleRank, we are able to match the best scores reported by Mihalcea and Tarau (2004) on Inspec." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-189", "text": "We score 2 and 5 points less than Wan and Xiao's (2008) implementations of SingleRank and ExpandRank, respectively." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-190", "text": "We speculate that document pre-processing (e.g., stemming) has contributed to the discrepancy, but additional experiments are needed to determine the reason." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-191", "text": "SingleRank vs. TextRank Figure 1 shows that SingleRank behaves very differently from TextRank." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-192", "text": "As mentioned in Section 3.2.3, the two algorithms differ in three major aspects." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-193", "text": "To determine which aspect is chiefly responsible for the large difference in their performance, we conduct three \"ablation\" experiments." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-194", "text": "Each experiment modifies exactly one of these aspects in SingleRank so that it behaves like TextRank, effectively ensuring that the two algorithms differ only in the remaining two aspects." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-195", "text": "More specifically, in the three experiments, we (1) change SingleRank's window size to 2, (2) build an unweighted graph for SingleRank, and (3) incorporate TextRank's way of forming keyphrases into SingleRank, respectively." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-196", "text": "Figure 2 shows the resultant curves along with the SingleRank and TextRank curves on Inspec taken from Figure 1b ." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-197", "text": "As we can see, the way of forming phrases, rather than the window size or the weight assignment method, has the largest impact on the scores." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-198", "text": "In fact, after incorporating TextRank's way of forming phrases, SingleRank exhibits a remarkable drop in performance, yielding a curve that resembles the TextRank curve." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-199", "text": "Also note that SingleRank achieves better recall values than TextRank." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-200", "text": "To see the reason, recall that TextRank requires that every word of a gold keyphrase must appear among the top- ranked unigrams." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-201", "text": "This is a fairly strict criterion, especially in comparison to SingleRank, which does not require all unigrams of a gold keyphrase to be present in the top-ranked list." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-202", "text": "We observe similar trends for the other datasets." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-203", "text": "----------------------------------" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-204", "text": "**CONCLUSIONS**" }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-205", "text": "We have conducted a systematic evaluation of five state-of-the-art unsupervised keyphrase extraction algorithms on datasets from four different domains." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-206", "text": "Several conclusions can be drawn from our experimental results." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-208", "text": "In particular, evaluating it on a single dataset has proven inadequate, as good performance can sometimes be achieved due to certain statistical characteristics of a dataset." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-209", "text": "Second, as demonstrated by our experiments with TextRank and SingleRank, post-processing steps such as the way of forming keyphrases can have a large impact on the performance of a keyphrase extractor." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-210", "text": "Hence, it may be worthwhile to investigate alternative methods for extracting candidate keyphrases (e.g., Kumar and Srinathan (2008) , You et al. (2009) )." }, { "sent_id": "7d5c01ec5d744747413e42dcbc1a3c-C001-211", "text": "Finally, despite the large amount of recent work on unsupervised keyphrase extractor, our results indicated that Tf-Idf remains a strong baseline, offering very robust performance across different datasets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7d5c01ec5d744747413e42dcbc1a3c-C001-14" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-24" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-35" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-56" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-66" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-80" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-89" ] ], "cite_sentences": [ "7d5c01ec5d744747413e42dcbc1a3c-C001-14", "7d5c01ec5d744747413e42dcbc1a3c-C001-24", "7d5c01ec5d744747413e42dcbc1a3c-C001-35", "7d5c01ec5d744747413e42dcbc1a3c-C001-56", "7d5c01ec5d744747413e42dcbc1a3c-C001-66", "7d5c01ec5d744747413e42dcbc1a3c-C001-80", "7d5c01ec5d744747413e42dcbc1a3c-C001-89" ] }, "@MOT@": { "gold_contexts": [ [ "7d5c01ec5d744747413e42dcbc1a3c-C001-14", "7d5c01ec5d744747413e42dcbc1a3c-C001-15" ] ], "cite_sentences": [ "7d5c01ec5d744747413e42dcbc1a3c-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "7d5c01ec5d744747413e42dcbc1a3c-C001-23", "7d5c01ec5d744747413e42dcbc1a3c-C001-24", "7d5c01ec5d744747413e42dcbc1a3c-C001-25" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-124" ] ], "cite_sentences": [ "7d5c01ec5d744747413e42dcbc1a3c-C001-24", "7d5c01ec5d744747413e42dcbc1a3c-C001-124" ] }, "@DIF@": { "gold_contexts": [ [ "7d5c01ec5d744747413e42dcbc1a3c-C001-134", "7d5c01ec5d744747413e42dcbc1a3c-C001-135", "7d5c01ec5d744747413e42dcbc1a3c-C001-157", "7d5c01ec5d744747413e42dcbc1a3c-C001-159" ] ], "cite_sentences": [ "7d5c01ec5d744747413e42dcbc1a3c-C001-159" ] }, "@SIM@": { "gold_contexts": [ [ "7d5c01ec5d744747413e42dcbc1a3c-C001-187" ], [ "7d5c01ec5d744747413e42dcbc1a3c-C001-188" ] ], "cite_sentences": [ "7d5c01ec5d744747413e42dcbc1a3c-C001-187", "7d5c01ec5d744747413e42dcbc1a3c-C001-188" ] } } }, "ABC_40742bac72bbbaed4755ff0b74d599_12": { "x": [ { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-2", "text": "In this paper, we propose LexVec, a new method for generating distributed word representations that uses low-rank, weighted factorization of the Positive Point-wise Mutual Information matrix via stochastic gradient descent, employing a weighting scheme that assigns heavier penalties for errors on frequent co-occurrences while still accounting for negative co-occurrence." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-3", "text": "Evaluation on word similarity and analogy tasks shows that LexVec matches and often outperforms state-of-the-art methods on many of these tasks." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-4", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-6", "text": "Distributed word representations, or word embeddings, have been successfully used in many NLP applications (Turian et al., 2010; Collobert et al., 2011; ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-7", "text": "Traditionally, word representations have been obtained using countbased methods (Baroni et al., 2014) , where the cooccurrence matrix is derived directly from corpus counts (Lin, 1998) or using association measures like Point-wise Mutual Information (PMI) (Church and Hanks, 1990) and Positive PMI (PPMI) (Bullinaria and Levy, 2007; ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-8", "text": "Techniques for generating lower-rank representations have also been employed, such as PPMI-SVD (Levy et al., 2015) and GloVe (Pennington et al., 2014) , both achieving state-of-the-art performance on a variety of tasks." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-9", "text": "* This is a preprint of the paper that will be presented at the 54th Annual Meeting of the Association for Computational Linguistics." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-10", "text": "Alternatively, vector-space models can be generated with predictive methods, which generally outperform the count-based methods (Baroni et al., 2014) , the most notable of which is Skip-gram with Negative Sampling (SGNS, Mikolov et al. (2013b) ), which uses a neural network to generate embeddings." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-11", "text": "It implicitly factorizes a shifted PMI matrix, and its performance has been linked to the weighting of positive and negative co-occurrences ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-12", "text": "In this paper, we present Lexical Vectors (LexVec), a method for factorizing PPMI matrices that combines characteristics of all these methods." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-96", "text": "Finally, its PPMI factorization seems to better capture semantics when compared to the shifted PMI factorization of SGNS." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-97", "text": "As a result, it outperforms PPMI-SVD and SGNS in a variety of word similarity and semantic analogy tasks, and generally outperforms GloVe on similarity tasks." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-98", "text": "Future work will examine the use of positional contexts for improving performance on syntactic analogy tasks." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-99", "text": "Moreover, we will explore further the hyper-parameter space to find globally optimal values for LexVec, and will experiment with the factorization of other matrices for developing alternative word representations." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-13", "text": "On the one hand, it uses SGNS window sampling, negative sampling, and stochastic gradient descent (SGD) to minimize a loss function that weights frequent co-occurrences heavily but also takes into account negative co-occurrence." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-14", "text": "However, since PPMI generally outperforms PMI on semantic similarity tasks (Bullinaria and Levy, 2007) , rather than implicitly factorize a shifted PMI matrix (like SGNS), LexVec explicitly factorizes the PPMI matrix." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-15", "text": "This paper is organized as follows: First, we describe PPMI-SVD, GloVe, and SGNS ( \u00a72) before introducing the proposed method, LexVec ( \u00a73), and evaluating it on word similarity and analogy tasks ( \u00a74)." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-16", "text": "We conclude with an analysis of results and discussion of future work." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-17", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-19", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-20", "text": "**PPMI-SVD**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-21", "text": "Given a word w and a symmetric window of win context words to the left and win to the right, the co-occurrence matrix of elements M wc is defined as the number of times a target word w and the context word c co-occurred in the corpus within the window." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-22", "text": "The PMI matrix is defined as" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-23", "text": "where '*' represents the summation of the corresponding index." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-24", "text": "As this matrix is unbounded in the inferior limit, in most applications it is replaced by its positive definite version, PPMI, where negative values are set to zero." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-25", "text": "The performance of the PPMI matrix on word similarity tasks can be further improved by using context-distribution smoothing (Levy et al., 2015) and subsampling the corpus (Mikolov et al., 2013b) ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-26", "text": "As word embeddings with lower dimensionality may improve efficiency and generalization (Levy et al., 2015) , the improved PPMI * matrix can be factorized as a product of two lower rank matrices." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-27", "text": "where W w andW c are d-dimensional row vectors corresponding to vector embeddings for the target and context words." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-28", "text": "Using the truncated SVD of size d yields the factorization U \u03a3T \u22a4 with the lowest possible L 2 error (Eckert and Young, 1936) ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-29", "text": "Levy et al. (2015) recommend using W = U \u03a3 p as the word representations, as suggested by Bullinaria and Levy (2012) , who borrowed the idea of weighting singular values from the work of Caron (2001) on Latent Semantic Analysis." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-30", "text": "Although the optimal value of p is highly task-dependent (\u00d6sterlund et al., 2015) , we set p = 0.5 as it has been shown to perform well on the word similarity and analogy tasks we use in our experiments (Levy et al., 2015) ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-31", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-32", "text": "**GLOVE**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-33", "text": "GloVe (Pennington et al., 2014) factors the logarithm of the co-occurrence matrixM that considers the position of the context words in the window." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-34", "text": "The loss function for factorization is" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-35", "text": "(3) where b w andb c are bias terms, and f is a weighting function defined as" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-36", "text": "W andW are obtained by iterating over all non-zero (w, c) cells in the co-occurrence matrix and minimizing eq. (3) through SGD." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-37", "text": "The weighting function (in eq. (3)) penalizes more heavily reconstruction error of frequent cooccurrences, improving on PPMI-SVD's L 2 loss, which weights all reconstruction errors equally." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-38", "text": "However, as it does not penalize reconstruction errors for pairs with zero counts in the co-occurrence matrix, no effort is made to scatter the vectors for these pairs." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-39", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-40", "text": "**SKIP-GRAM WITH NEGATIVE SAMPLING (SGNS)**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-41", "text": "SGNS (Mikolov et al., 2013b ) trains a neural network to predict the probability of observing a context word c given a target word w, sliding a symmetric window over a subsampled training corpus with the window size being sampled uniformly from the range [1, win] ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-42", "text": "Each observed (w, c) pair is combined with k randomly sampled noise pairs (w, w i ) and used to calculate the loss function" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-43", "text": "where P n (w) is the distribution from which noise words w i are sampled." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-44", "text": "1 We refer to this routine which SGNS uses for selecting (w, c) pairs by sliding a context window over the corpus for loss calculation and SGD as window sampling." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-45", "text": "SGNS is implicitly performing the weighted factorization of a shifted PMI matrix ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-46", "text": "Window sampling ensures the factorization weights frequent co-occurrences heavily, but also takes into account negative co-occurrences, thanks to negative sampling." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-47", "text": "LexVec is based on the idea of factorizing the PPMI matrix using a reconstruction loss function that does not weight all errors equally, unlike SVD, but instead penalizes errors of frequent cooccurrences more heavily, while still treating negative co-occurrences, unlike GloVe." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-48", "text": "Moreover, given that using PPMI results in better performance than PMI on semantic tasks, we propose keeping the SGNS weighting scheme by using window sampling and negative sampling, but explicitly factorizing the PPMI matrix rather than implicitly factorizing the shifted PMI matrix." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-49", "text": "The LexVec loss function has two terms" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-50", "text": "We minimize eqs. (6) and (7)" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-51", "text": "where #(w, c) is the number of times (w, c) is observed in the subsampled corpus." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-52", "text": "Stochastic (St): Every context window is extended with k negative samples w 1...k ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-53", "text": "Iterative gradient descent of eq. (6) is then run on pairs (w, c j ), for j = 1, .., 2 * win and (w, c i ), j = 1, .., k for each window." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-54", "text": "The global loss for this approach is" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-55", "text": "where #(w) is the number of times w is observed in the subsampled corpus." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-56", "text": "If a pair (w, c) co-occurs frequently, #(w, c) will weigh heavily in both eqs. (8) and (9), giving the desired weighting for frequent co-occurrences." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-57", "text": "The noise term, on the other hand, has corrections proportional to #(w) and #(w i ), for each pair (w, w i )." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-58", "text": "It produces corrections in pairs that due to frequency should be in the corpus but are not observed, therefore accounting automatically for negative cooccurrences." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-59", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-60", "text": "**MATERIALS**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-61", "text": "All models were trained on a dump of Wikipedia from June 2015, split into sentences, with punctuation removed, numbers converted to words, and lower-cased." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-62", "text": "Words with less than 100 counts were removed, resulting in a vocabulary of 302,203 words." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-63", "text": "All models generate embeddings of 300 dimensions." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-64", "text": "The PPMI* matrix used by both PPMI-SVD and LexVec was constructed using smoothing of \u03b1 = 3/4 suggested in (Levy et al., 2015) and an unweighted window of size 2." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-65", "text": "A dirty subsampling of the corpus is adopted for PPMI* and SGNS with threshold of t = 10 \u22125 (Mikolov et al., 2013b) ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-66", "text": "2 Additionally, SGNS uses 5 negative samples (Mikolov et al., 2013b) , a window of size 10 ( Levy et al., 2015) , for 5 iterations with initial learning rate set to the default 0.025." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-67", "text": "GloVe is run with a window of size 10, x max = 100, \u03b2 = 3/4, for 50 iterations and initial learning rate of 0.05 (Pennington et al., 2014) ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-68", "text": "In LexVec two window sampling alternatives are compared: W S P P M I , which keeps the same fixed size win = 2 as used to create the P P M I * matrix; or W S SGN S , which adopts identical SGNS settings (win = 10 with size randomization)." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-69", "text": "We run LexVec for 5 iterations over the training corpus." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-70", "text": "All methods generate both word and context matrices (W andW ): W is used for SGNS, PPMI-SVD and W +W for GloVe (following Levy et al. (2015) , and W and W +W for LexVec." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-71", "text": "For evaluation, we use standard word similarity and analogy tasks (Mikolov et al., 2013b; Pennington et al., 2014; Levy et al., 2015 factorization of logM ) and Skip-gram (implicit factorization of the shifted PMI matrix), and compare the stochastic and mini-batch approaches." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-72", "text": "Word similarity tasks are: 3 WS-353 Similarity (WSim) and Relatedness (WRel) (Finkelstein et al., 2001) , MEN (Bruni et al., 2012 ), MTurk (Radinsky et al., 2011 , RW (Luong et al., 2013) , SimLex-999 (Hill et al., 2015) , MC (Miller and Charles, 1991) , RG (Rubenstein and Goodenough, 1965) , and SCWS (Huang et al., 2012) , calculated using cosine." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-73", "text": "Word analogy tasks are: Google semantic (GSem) and syntactic (GSyn) (Mikolov et al., 2013a) and MSR syntactic analogy dataset (Mikolov et al., 2013c) , using 3CosAdd and 3CosM ul ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-74", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-75", "text": "**RESULTS**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-76", "text": "Results for word similarity and for the analogy tasks are in tables 1 and 2, respectively." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-77", "text": "Compared with PPMI-SVD, LexVec performs better in all tasks." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-78", "text": "As they factorize the same P P M I * matrix, it is the 3 http://www.cs.cmu.edu/ mfaruqui/suite.html loss weighting from window sampling that is an improvement over L 2 loss." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-79", "text": "As expected, due to PPMI, LexVec performs better than SGNS in several word similarity tasks, but in addition it also does so on the semantic analogy task, nearly approaching GloVe." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-80", "text": "LexVec generally outperforms GloVe on word similarity tasks, possibly due to the factorization of the PPMI matrix and to window sampling's weighting of negative co-occurrences." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-81", "text": "We believe LexVec fares well on semantic analogies because its vector-space does a good job of preserving semantics, as evidenced by its performance on word similarity tasks." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-82", "text": "We believe the poor syntactic performance is a result of the PPMI measure." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-83", "text": "PPMI-SVD also struggled with syntactic analogies more than any other task." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-84", "text": "Levy et al. (2015) obtained similar results, and suggest that using positional contexts as done by might help in recovering syntactic analogies." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-85", "text": "In terms of configurations, WS SGN S performed marginally better than WS P P M I ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-86", "text": "We hypothesize it is simply because of the additional computation." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-87", "text": "While W and (W +W ) are roughly equivalent on word similarity tasks, W is better for analogies." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-88", "text": "This is inline with results for PPMI-SVD and SGNS models (Levy et al., 2015) ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-89", "text": "Both mini-batch and stochastic approaches result in similar scores for all tasks." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-90", "text": "For the same parameter k of negative samples, the mini-batch approach uses 2 * win W S P P M I times more negative samples than stochastic when using W S P P M I , and win W S SGNS times more samples when using W S SGN S ." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-91", "text": "Therefore, the stochastic approach is more computationally efficient while delivering similar performance." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-92", "text": "----------------------------------" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-93", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-94", "text": "In this paper, we introduced LexVec, a method for low-rank, weighted factorization of the PPMI matrix that generates distributed word representations, favoring low reconstruction error on frequent co-occurrences, whilst accounting for negative cooccurrences as well." }, { "sent_id": "40742bac72bbbaed4755ff0b74d599-C001-95", "text": "This is in contrast with PPMI-SVD, which does no weighting, and GloVe, which only considers positive co-occurrences." } ], "y": { "@BACK@": { "gold_contexts": [ [ "40742bac72bbbaed4755ff0b74d599-C001-8" ], [ "40742bac72bbbaed4755ff0b74d599-C001-25" ], [ "40742bac72bbbaed4755ff0b74d599-C001-26" ] ], "cite_sentences": [ "40742bac72bbbaed4755ff0b74d599-C001-8", "40742bac72bbbaed4755ff0b74d599-C001-25", "40742bac72bbbaed4755ff0b74d599-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "40742bac72bbbaed4755ff0b74d599-C001-30" ], [ "40742bac72bbbaed4755ff0b74d599-C001-64" ], [ "40742bac72bbbaed4755ff0b74d599-C001-66" ], [ "40742bac72bbbaed4755ff0b74d599-C001-70" ], [ "40742bac72bbbaed4755ff0b74d599-C001-71" ] ], "cite_sentences": [ "40742bac72bbbaed4755ff0b74d599-C001-30", "40742bac72bbbaed4755ff0b74d599-C001-64", "40742bac72bbbaed4755ff0b74d599-C001-66", "40742bac72bbbaed4755ff0b74d599-C001-70", "40742bac72bbbaed4755ff0b74d599-C001-71" ] }, "@SIM@": { "gold_contexts": [ [ "40742bac72bbbaed4755ff0b74d599-C001-88" ], [ "40742bac72bbbaed4755ff0b74d599-C001-84" ] ], "cite_sentences": [ "40742bac72bbbaed4755ff0b74d599-C001-88" ] } } }, "ABC_0732eaa37366d7ae092f4de0ed72cb_12": { "x": [ { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-90", "text": "We confirmed that all our results contained no malformed parse trees." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-2", "text": "This paper investigates the construction of a strong baseline based on general purpose sequence-to-sequence models for constituency parsing." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-3", "text": "We incorporate several techniques that were mainly developed in natural language generation tasks, e.g., machine translation and summarization, and demonstrate that the sequenceto-sequence model achieves the current top-notch parsers' performance without requiring explicit task-specific knowledge or architecture of constituent parsing." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-4", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-6", "text": "Sequence-to-sequence (Seq2seq) models have successfully improved many well-studied NLP tasks, especially for natural language generation (NLG) tasks, such as machine translation (MT) (Sutskever et al., 2014; Cho et al., 2014) and abstractive summarization (Rush et al., 2015) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-7", "text": "Seq2seq models have also been applied to constituency parsing (Vinyals et al., 2015) and provided a fairly good result." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-8", "text": "However one obvious, intuitive drawback of Seq2seq models when they are applied to constituency parsing is that they have no explicit architecture to model latent nested relationships among the words and phrases in constituency parse trees, Thus, models that directly model them, such as RNNG (Dyer et al., 2016) , are an intuitively more promising approach." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-9", "text": "In fact, RNNG and its extensions (Kuncoro et al., 2017; Fried et al., 2017) provide the current stateof-the-art performance." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-10", "text": "Sec2seq models are currently considered a simple baseline of neuralbased constituency parsing." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-11", "text": "After the first proposal of an Seq2seq constituency parser, many task-independent techniques have been developed, mainly in the NLG research area." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-12", "text": "Our aim is to update the Seq2seq approach proposed in Vinyals et al. (2015) as a stronger baseline of constituency parsing." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-13", "text": "Our motivation is basically identical to that described in Denkowski and Neubig (2017) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-14", "text": "A strong baseline is crucial for reporting reliable experimental results." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-15", "text": "It offers a fair evaluation of promising new techniques if they solve new issues or simply resolve issues that have already been addressed by current generic technology." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-16", "text": "More specifically, it might become possible to analyze what types of implicit linguistic structures are easier or harder to capture for neural models by comparing the outputs of strong Seq2seq models and task-specific models, e.g., RNNG." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-17", "text": "The contributions of this paper are summarized as follows: (1) a strong baseline for constituency parsing based on general purpose Seq2seq models 1 , (2) an empirical investigation of several generic techniques that can (or cannot) contribute to improve the parser performance, (3) empirical evidence that Seq2seq models implicitly learn parse tree structures well without knowing taskspecific and explicit tree structure information." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-18", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-19", "text": "**CONSTITUENCY PARSING BY SEQ2SEQ**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-20", "text": "Our starting point is an RNN-based Seq2seq model with an attention mechanism that was applied to constituency parsing (Vinyals et al., 2015) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-21", "text": "We omit detailed descriptions due to space limitations, but note that our model architecture is identical to the one introduced in Luong et al. (2015a) 2 ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-22", "text": "A key trick for applying Seq2seq models to constituency parsing is the linearization of parse 1 Our code and experimental configurations for reproducing our experiments are publicly available: https://github.com/nttcslab-nlp/strong s2s baseline parser 2 More specifically, our Seq2seq model follows the one implemented in seq2seq-attn (https://github.com/harvardnlp/seq2seq-attn), which is the alpha-version of the OpenNMT tool (http://opennmt.net)." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-23", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-24", "text": "**ORIGINAL INPUT**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-25", "text": "John has a dog ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-26", "text": "Output: S-exp." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-27", "text": "(S (NP NNP ) (VP VBZ (NP DT NN ) ) . ) Linearized form (S (NP NNP )NP (VP VBZ (NP DT NN )NP )VP . )S w/ POS normalized (S (NP XX )NP (VP XX (NP XX XX )NP )VP . )S Table 1 : Examples of linearization and POS-tag normalization (Vinyals et al., 2015) trees (Vinyals et al., 2015) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-28", "text": "Roughly speaking, a linearized parse tree consists of open, close bracketing and POS-tags that correspond to a given input raw sentence." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-29", "text": "Since a one-to-one mapping exists between a parse tree and its linearized form (if the linearized form is a valid tree), we can recover parse trees from the predicted linearized parse tree." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-30", "text": "Vinyals et al. (2015) also introduced the part-of-speech (POS) tag normalization technique." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-31", "text": "They substituted each POS tag in a linearized parse tree to a single XX-tag 3 , which allows Seq2seq models to achieve a more competitive performance range than the current state-ofthe-art parses 4 ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-32", "text": "Table 1 shows an example of a parse tree to which linearization and POS-tag normalization was applied." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-33", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-34", "text": "**TASK-INDEPENDENT EXTENSIONS**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-35", "text": "This section describes several generic techniques that improve Seq2seq performance 5 ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-36", "text": "Table 2 lists the notations used in this paper for a convenient reference." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-37", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-38", "text": "**SUBWORD AS INPUT FEATURES**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-39", "text": "Applying subword decomposition has recently become a leading technique in NMT literature (Sennrich et al., 2016; Wu et al., 2016) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-40", "text": "Its primary advantage is a significant reduction of the serious out-of-vocabulary (OOV) problem." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-41", "text": "We incorporated subword information as an additional feature of the original input words." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-42", "text": "A similar usage of subword features was previously proposed in Bojanowski et al. (2017) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-43", "text": "Formally, the encoder embedding vector at encoder position i, namely, e i , is calculated as follows:" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-44", "text": "3 We did not substitute POS-tags for punctuation symbols such as \".\", and \",\"." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-89", "text": "Test data was only evaluated on baseline and our best setting ((a), (e), (f) and (j)) to prevent over-tuning to the test data." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-45", "text": "4 Several recently developed neural-based constituency parsers ignore POS tags since they are not evaluated in the standard evaluation metric of constituency parsing (Bracketing F-measure)." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-46", "text": "5 Figure in the supplementary material shows the brief sketch of the method explained in the following section." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-47", "text": "zj : final hidden vector calculated at the decoder position j oj : final decoder output scores at decoder position j qj : output scores of auxiliary task at decoder position j b : additional bias term in the decoder output layer for mask pj : vector format of output probability at decoder position j A : number of models for ensembling C : number of candidates generating for LM-reranking Table 2 : List of notations used in this paper." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-48", "text": "where k = \u03c6(w i )." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-49", "text": "Note that the second term of RHS indicates our additional subword features, and the first represents the standard word embedding extraction procedure." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-50", "text": "Among several choices, we used the byte-pair encoding (BPE) approach proposed in Sennrich et al. (2016) applying 1,000 merge operations 6 ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-51", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-52", "text": "**UNKNOWN TOKEN EMBEDDING AS A BIAS**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-53", "text": "We generally replace rare words, e.g., those appearing less than five times in the training data, with unknown tokens in the Seq2seq approach." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-54", "text": "However, we suspect that embedding vectors, which correspond to unknown tokens, cannot be trained well for the following reasons: (1) the occurrence of unknown tokens remains relatively small in the training data since they are obvious replacements for rare words, and (2) Seq2seq is relatively ineffective for training infrequent words (Luong et al., 2015b) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-55", "text": "Based on these observations, we utilize the unknown embedding as a bias term b of linear layer (W x + b) when obtaining every encoder embeddings for overcoming infrequent word problem." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-56", "text": "Then, we modify Eq. 2 as follows:" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-57", "text": "Note that if w i is unknown token, then Eq. 2 becomes e i = 2u + k \u2208\u03c8(w i ) (F s k + u)." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-58", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-59", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-60", "text": "Several papers on the Seq2seq approach (Luong et al., 2016) have reported that the multi-task learning extension often improves the task performance if we can find effective auxiliary tasks related to the target task." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-61", "text": "From this general knowledge, we re-consider jointly estimating POS-tags by incorporating the linearized forms without the POS-tag normalization as an auxiliary task." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-62", "text": "In detail, the linearized forms with and without the POS-tag normalization are independently and simultaneously estimated as o j and q j , respectively, in the decoder output layer by following equation:" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-63", "text": "3.4 Output length controlling" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-64", "text": "As described in Vinyals et al. (2015) , not all the outputs (predicted linearized parse trees) obtained from the Seq2seq parser are valid (well-formed) as a parse tree." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-65", "text": "Toward guaranteeing that every output is a valid tree, we introduce a simple extension of the method for controlling the Seq2seq output length (Kikuchi et al., 2016) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-66", "text": "First, we introduce an additional bias term b in the decoder output layer to prevent the selection of certain output words:" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-67", "text": "If we set a large negative value at the m-th element in b, namely b m \u2248 \u2212\u221e, then the m-th element in p j becomes approximately 0, namely p j,m \u2248 0, regardless of the value of the k-th element in o j ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-68", "text": "We refer to this operation to set value \u2212\u221e in b as a mask." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-69", "text": "Since this naive masking approach is harmless to GPU-friendly processing, we can still exploit GPU parallelization." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-70", "text": "We set b to always mask the EOS-tag and change b when at least one of the following conditions is satisfied: (1) if the number of open and closed brackets generated so far is the same, then we mask the XX-tags (or the POS-tags) and all the closed brackets." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-71", "text": "(2) if the number of predicted XX-tags (or POS-tags) is equivalent to that of the words in a given input sentence, then we mask the XX-tags (or all the POS-tags) and all the open brackets." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-72", "text": "If both conditions (1) and (2) are satisfied, then the decoding process is finished." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-73", "text": "The additional cost for controlling the mask is to count the number of XX-tags and the open and closed brackets so far generated in the decoding process." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-74", "text": "The pre-trained word embeddings obtained from a large external corpora often boost the final task performance even if they only initialize the input embedding layer." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-75", "text": "In constituency parsing, several systems also incorporate pre-trained word embeddings, such as Vinyals et al. (2015) ; Durrett and Klein (2015) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-76", "text": "To maintain as much reproducibility of our experiments as possible, we simply applied publicly available pre-trained word embeddings, i.e., glove.840B.300d 7 , as initial values of the encoder embedding layer." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-77", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-78", "text": "**MODEL ENSEMBLE**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-79", "text": "Ensembling several independently trained models together significantly improves many NLP tasks." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-80", "text": "In the ensembling process, we predict the output tokens using the arithmetic mean of predicted probabilities computed by each model:" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-81", "text": "where p (a) j represents the probability distribution at position j predicted by the a-th model." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-82", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-83", "text": "**LANGUAGE MODEL (LM) RERANKING**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-84", "text": "Choe and Charniak (2016) demonstrated that reranking the predicted parser output candidates with an RNN language model (LM) significantly improves performance." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-85", "text": "We refer to this reranking process as LM-rerank." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-86", "text": "Following their success, we also trained RNN-LMs on the PTB dataset with their published preprocessing code 8 to reproduce the experiments in Choe and Charniak (2016) for our LM-rerank." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-87", "text": "We selected the current stateof-the-art LM (Yang et al., 2018) 9 as our LMreranker, which is a much stronger LM than was used in Choe and Charniak (2016)." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-88", "text": "Table 4 : Results on English PTB data: Results were average (ave), worst (min), and best (max) performance of ten models independently trained with distinct random initial values." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-91", "text": "Table 6 : Impact of hyper-parameter selections." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-92", "text": "We only evaluated the development data (PTB Sec. 22) to prevent over-tuning to the test data." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-93", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-94", "text": "**EXPERIMENTS**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-95", "text": "Our experiments used the English Penn Treebank data (Marcus et al., 1994) , which are the most widely used benchmark data in the literature." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-96", "text": "We used the standard split of training (Sec.02-21), development (Sec.22), and test data (Sec.23) and strictly followed the instructions for the evaluation settings explained in Vinyals et al. (2015) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-97", "text": "For data pre-processing, all the parse trees were transformed into linearized forms, which include standard UNK replacement for OOV words and POS-tag normalization by XX-tags." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-98", "text": "As explained in Vinyals et al. (2015) , we did not apply any parse tree binarization or special unary treatment, which were used as common techniques in the literature." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-99", "text": "Table 7 : List of bracketing F-measures on test data (PTB Sec.23) reported in recent top-notch systems: scores with bold font represent our scores." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-100", "text": "ments unless otherwise specified." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-101", "text": "The baseline Seq2seq models, (a) and (f), produced the malformed parse trees." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-102", "text": "We postprocessed such malformed parse trees by simple rules introduced in (Vinyals et al., 2015) ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-103", "text": "On the other hand, we confirmed that all the results applying the technique explained in Sec. 3.4 produced no malformed parse trees." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-104", "text": "Ensembling and Reranking: Table 5 shows the results of our models with model ensembling and LM-reranking." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-105", "text": "For ensemble, we randomly selected eight of the ten Seq2seq models reported in Table 4 ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-106", "text": "For LM-reranking, we first generated 80 candidates by the above eight ensemble models and selected the best parse tree for each input in terms of the LM-reranker." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-107", "text": "The results in Table 5 were taken from a single-shot evaluation, unlike the averages of ten independent runs in Table 4 ." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-108", "text": "Hyper-parameter selection: We empirically investigated the impact of the hyper-parameter selections." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-109", "text": "Table 6 shows the results." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-110", "text": "The following observations appear informative for building strong baseline systems: (1) Smaller mini-batch size M and gradient clipping G provided the better performance." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-111", "text": "Such settings lead to slower and longer training, but higher performance." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-112", "text": "(2) Larger layer size, hidden state dimension, and beam size have little impact on the performance; our setting, L = 2, H = 200, and B = 5 looks adequate in terms of speed/performance trade-off." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-113", "text": "Input unit selection: As often demonstrated in the NMT literature, using subword split as input token unit instead of standard tokenized word unit has potential to improve the performance." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-114", "text": "Table 6 (e) shows the results of utilizing subword splits." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-115", "text": "Clearly, 8K and 16K subword splits as input token units significantly degraded the performance." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-116", "text": "It seems that the numbers of XX-tags in output and tokens in input should keep consistent for better performance since Seq2seq models look to somehow learn such relationship, and used it during the decoding." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-117", "text": "Thus, using subword information as features is one promising approach for leveraging subword information into constituency parsing." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-118", "text": "Table 7 lists the reported constituency parsing scores on PTB that were recently published in the literature." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-119", "text": "We split the results into three categories." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-120", "text": "The first category (top row) contains the results of the methods that were trained only from the pre-defined training data (PTB Sec.02-21), without any additional resources." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-121", "text": "The second category (middle row) consists of the results of methods that were trained from the pre-defined PTB training data as well as those listed in the top row, but incorporating word embeddings obtained from a large-scale external corpus to initialize the encoder embedding layer." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-122", "text": "The third category (bottom row) shows the performance of the methods that were trained using high-confidence, auto-parsed trees in addition to the pre-defined PTB training data." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-123", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-124", "text": "**RESULTS**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-125", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-126", "text": "**COMPARISON TO CURRENT TOP SYSTEMS**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-127", "text": "Our Seq2seq approach successfully achieved the competitive level as the current top-notch methods: RNNG and its variants." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-128", "text": "Note here that, as described in Dyer et al. (2016) , RNNG uses Berkeley parser's mapping rules for effectively handling singleton words in the training corpus." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-129", "text": "In contrast, we demonstrated that Seq2seq models have enough power to achieve a competitive stateof-the-art performance without leveraging such task-dependent knowledge." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-130", "text": "Moreover, they need no explicit information of parse tree structures, transition states, stacks, (Stanford or Berkeley) mapping rules, or external silver training data during the model training except general purpose word embeddings as initial values." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-131", "text": "These observations from our experiments imply that recently developed Seq2seq models have enough ability to implicitly learn parsing structures from linearized parse trees." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-132", "text": "Our results argue that Seq2seq models can be a strong baseline for constituency parsing." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-133", "text": "----------------------------------" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-134", "text": "**CONCLUSION**" }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-135", "text": "This paper investigated how well general purpose Seq2seq models can achieve the higher performance of constituency parsing as a strong baseline method." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-136", "text": "We incorporated several generic techniques to enhance Seq2seq models, such as incorporating subword features, and output length controlling." }, { "sent_id": "0732eaa37366d7ae092f4de0ed72cb-C001-137", "text": "We experimentally demonstrated that by applying ensemble and LM-reranking techniques, a general purpose Seq2seq model achieved almost the same performance level as the state-of-the-art constituency parser without any task-specific or explicit tree structure information." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0732eaa37366d7ae092f4de0ed72cb-C001-7" ], [ "0732eaa37366d7ae092f4de0ed72cb-C001-64" ], [ "0732eaa37366d7ae092f4de0ed72cb-C001-75" ] ], "cite_sentences": [ "0732eaa37366d7ae092f4de0ed72cb-C001-7", "0732eaa37366d7ae092f4de0ed72cb-C001-64", "0732eaa37366d7ae092f4de0ed72cb-C001-75" ] }, "@USE@": { "gold_contexts": [ [ "0732eaa37366d7ae092f4de0ed72cb-C001-12" ], [ "0732eaa37366d7ae092f4de0ed72cb-C001-20" ], [ "0732eaa37366d7ae092f4de0ed72cb-C001-96" ], [ "0732eaa37366d7ae092f4de0ed72cb-C001-98" ], [ "0732eaa37366d7ae092f4de0ed72cb-C001-102" ] ], "cite_sentences": [ "0732eaa37366d7ae092f4de0ed72cb-C001-12", "0732eaa37366d7ae092f4de0ed72cb-C001-20", "0732eaa37366d7ae092f4de0ed72cb-C001-96", "0732eaa37366d7ae092f4de0ed72cb-C001-98", "0732eaa37366d7ae092f4de0ed72cb-C001-102" ] }, "@MOT@": { "gold_contexts": [ [ "0732eaa37366d7ae092f4de0ed72cb-C001-64" ], [ "0732eaa37366d7ae092f4de0ed72cb-C001-74", "0732eaa37366d7ae092f4de0ed72cb-C001-75", "0732eaa37366d7ae092f4de0ed72cb-C001-76" ] ], "cite_sentences": [ "0732eaa37366d7ae092f4de0ed72cb-C001-64", "0732eaa37366d7ae092f4de0ed72cb-C001-75" ] } } }, "ABC_4d2488844c1f6f39f1f4b8f3487288_13": { "x": [ { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-2", "text": "In this paper, we aim to close the gap from extensive, human-built semantic resources and corpus-driven unsupervised models." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-3", "text": "The particular resource explored here is VerbNet, whose organizing principle is that semantics and syntax are linked." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-4", "text": "To capture patterns of usage that can augment knowledge resources like VerbNet, we expand a Dirichlet process mixture model to predict a VerbNet class for each sense of each verb, allowing us to incorporate annotated VerbNet data to guide the clustering process." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-5", "text": "The resulting clusters align more closely to hand-curated syntactic/semantic groupings than any previous models, and can be adapted to new domains since they require only corpus counts." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-6", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-8", "text": "In this paper, we aim to close the gap from extensive, human-built semantic resources and corpusdriven unsupervised models." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-9", "text": "The work done by linguists over years of effort has been validated by the scientific community, and promises real traction on the fuzzy problem of deriving meaning from words." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-10", "text": "However, lack of coverage and adaptability currently limit the usefulness of this work." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-11", "text": "The particular resource explored here is VerbNet (Kipper-Schuler, 2005) , a semantic resource built upon the foundation of verb classes by Levin (1993) ." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-12", "text": "Levin's verb classes are built on the hypothesis that syntax and semantics are fundamentally linked." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-13", "text": "The semantics of a verb affect the allowable syntactic constructions involving that verb, creating regularities in language to which speakers are extremely sensitive." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-14", "text": "It follows that grouping verbs by allowable syntactic realizations leads from syntax to meaningful semantic groupings." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-15", "text": "This seed grew into VerbNet, a process which involved dozens of linguists and a decade of work, making careful decisions about the allowable syntactic frames for various verb senses, informed by text examples." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-16", "text": "VerbNet is useful for semantic role labeling and related tasks (Giuglea and Moschitti, 2006; Yi, 2007; Merlo and van der Plas, 2009; Kshirsagar et al., 2014) , but its widespread use is limited by coverage." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-17", "text": "Not all verbs have a VerbNet class, and some polysemous verbs have important senses unaccounted for." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-18", "text": "In addition, VerbNet is not easily adaptable to domainspecific corpora, so these omissions may be more prominent outside of the general-purpose corpora and linguistic intuition used in its construction." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-19", "text": "Its great strength is also its downfall: adding new verbs, new senses, and new classes requires trained linguists -at least, to preserve the integrity of the resource." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-20", "text": "According to Levin's hypothesis, knowing the set of allowable syntactic patterns for a verb sense is sufficient to make meaningful semantic classifications." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-21", "text": "Large-scale corpora provide an extremely comprehensive picture of the possible syntactic realizations for any particular verb." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-22", "text": "With enough data in the training set, even infrequent verbs have sufficient data to support learning." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-23", "text": "Kawahara et al. (2014) showed that, using a Dirichlet Process Mixture Model (DPMM), a VerbNet-like clustering of verb senses can be built from counts of syntactic features." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-24", "text": "We develop a model to extend VerbNet, using a large corpus with machine-annotated dependencies." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-25", "text": "We build on prior work by adding partial supervision from VerbNet, treating VerbNet classes as additional latent variables." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-26", "text": "The resulting clusters are more similar to the evaluation set, and each cluster in the DPMM predicts its VerbNet class distribution naturally." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-27", "text": "Because the technique is data-driven, it is easily adaptable to domainspecific corpora." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-28", "text": "The DPMM used in Kawahara et al. (2014) for clustering verb senses." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-29", "text": "M is the number of verb senses, and N is the sum total of slot counts for that verb sense." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-30", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-31", "text": "**PRIOR WORK**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-32", "text": "Parisien and Stevenson (2011) and Kawahara et al. (2014) showed distinct ways of applying the Hierarchical Dirichlet Process (Teh et al., 2006) to uncover the latent clusters from cluster examples." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-33", "text": "The latter used significantly larger corpora, and explicitly separated verb sense induction from the syntactic/semantic clustering, which allowed more fine-grained control of each step." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-34", "text": "In Kawahara et al. (2014) , two identical DPMM's were used." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-35", "text": "The first clustered verb instances into senses, and one such model was trained for each verb." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-36", "text": "These verb-sense clusters are available publicly, and are used unmodified in this paper." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-37", "text": "The second DPMM clusters verb senses into VerbNet-like clusters of verbs." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-38", "text": "The result is a resource that, like Verbnet, inherently captures the inherent polysemy of verbs." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-39", "text": "We focus our improvements on this second step, and try to derive verb clusters that more closely align to VerbNet." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-40", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-41", "text": "**DIRICHLET PROCESS MIXTURE MODELS**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-42", "text": "The DPMM used in Kawahara et al. (2014) is shown in Figure 1 ." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-43", "text": "The clusters are drawn from a Dirichlet Process with hyperparameter \u03b1 and base distribution G. The Dirichlet process prior creates a clustering effect described by the Chinese Restaurant Process." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-44", "text": "Each cluster is chosen proportionally to the number of elements it already contains, i.e." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-45", "text": "where C k ( * ) is the count of clustered items already in cluster k. Each cluster k has an associated multinomial distribution over vocabulary items (e.g. slot:token pairs), \u03c6 k , which is drawn from G, a Dirichlet distribution of the same size as the vocabulary, parameterized by a constant \u03b2." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-46", "text": "Because the Dirichlet is the multinomial's conjugate prior, we can actually integrate out \u03c6 k analytically, given counts of vocabulary items drawn from \u03c6 k ." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-47", "text": "For a particular vocabulary item w, we compute" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-48", "text": "where C k (w) is the number of times w has been drawn from" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-49", "text": ", and |V | is the size of the vocabulary." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-50", "text": "When assigning a verb instance to a sense, a single instance may have multiple syntactic arguments w. Using Bayes's law, we update each assignment iteratively using Gibbs sampling, using equations (1) and (2), according to" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-51", "text": "\u03b2 < 1 encourages the clusters to have a sparse representation in the vocabulary space." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-52", "text": "\u03b1 = 1 is a typical choice, and encourages a small number of clusters to be used." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-53", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-54", "text": "**STEP-WISE VERB CLUSTER CREATION**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-55", "text": "By separating the verb sense induction and the clustering of verb senses, the features can be optimized for the distinct tasks." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-56", "text": "According to (Kawahara et al., 2014) , the best features for inducing verb classes are joint slot:token pairs." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-57", "text": "For the verb clustering task, slot features which ignore the lexical items were the most effective." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-58", "text": "This aligns with Levin's hypothesis of diathesis alternations -the syntactic contexts are sufficient for the clustering." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-59", "text": "In this paper, we re-create the second stage clustering with the same features, but add supervision." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-60", "text": "Supervised Topic Modeling (Mimno and McCallum, 2008; Ramage et al., 2009 ) builds on the Bayesian framework by adding, for each item, a prediction about a variable of interest, which is observed at least some of the time." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-61", "text": "This encourages the topics to be useful at predicting a supervised signal, as well as coherent as topics." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-62", "text": "We do not have explicit knowledge of VerbNet class for any of the first-level DPMM's verb senses, so our supervision is informed only at the level of the verb." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-63", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-64", "text": "**SUPERVISED DPMM**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-65", "text": "Adding supervision to the DPMM is fairly straightforward: at each step, we sample both a mixture component k and a VerbNet class y. For this, we assign each cluster (mixture component) a unique distribution \u03c1 over VerbNet classes, drawn from a fixed-size Dirichlet prior with parameter \u03b3." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-66", "text": "As before, this allows us to estimate the likelihood of a VerbNet class y knowing only the counts of assigned senses, C k (y), for each y, as" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-67", "text": "where |S| is the number of classes in the supervision." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-68", "text": "The likelihood of choosing a class for a particular verb requires us to form an estimate of that verb's probability of joining a particular VerbNet class." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-69", "text": "We initialize \u03b7 from SemLink, as \u03b7(y) = \u03c9 * C SL v (y) + \u03b4, for fixed constants \u03c9 and \u03b4, and with C SL v (y) as the count, in SemLink, of times verb v was assigned to VerbNet class y. We then draw a verb-specific distribution \u03b8 over VerbNet classes, from a Dirichlet with parameters \u03b7, so that \u03b7 acts as pseudo-counts, steering \u03b8 to give high weight to VerbNet classes aligned with SemLink for each verb." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-70", "text": "We compute" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-71", "text": "where C v (y) is the number of times verb v is assigned to VerbNet class y by our model." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-72", "text": "We sample the VerbNet class for a verb sense as a product of experts (Hinton, 2002) , the \u03b8 v for the verb v, and \u03c1 k for the assigned cluster k. This encourages alignment between the VerbNet classes observed in SemLink and the VerbNet classes predicted by the clusters, and is computationally straightforward." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-73", "text": "We simply compute" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-74", "text": "Sampling a cluster for a verb sense now depends on the VerbNet class y," }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-75", "text": "We then update y based on Equation 6, and then resample for the next batch." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-76", "text": "The supervised process is depicted in Figure 2 ." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-77", "text": "In brief, we know for each verb an \u03b7, a given by counts from SemLink, which we use as a prior for \u03b8." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-78", "text": "We sample, in addition to the cluster label k, a VerbNet class y, which depends on \u03b8 and \u03c1, where \u03c1 is the distribution over VerbNet classes in cluster k. \u03c1 is drawn from a Dirichlet distribution paramaterized by \u03b3 < 1, encouraging each cluster to have a sparse distribution over VerbNet classes." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-79", "text": "Because y depends on both \u03b8 and \u03c1, the clusters are encouraged to align with VerbNet classes." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-80", "text": "The Supervised DPMM used in this work for clustering verb senses." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-81", "text": "M is the number of verb senses, and N is the sum total of slot counts for that verb sense." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-82", "text": "\u03b8 is initialized to reflect the VerbNet class preferences for each verb, when they are known." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-83", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-84", "text": "**MODELING CHOICES**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-85", "text": "When incorporating supervision, the more direct method of downstream sampling of the VerbNet class may be preferred to using a prior." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-86", "text": "However, the verb senses are generated through a DPMM, and we do not have a gold-label assignment of VerbNet classes to each sense." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-87", "text": "Instead, we estimate, for each verb in VerbNet, a distribution \u03b8 describing the likelihood a verb will participate in a particular class, using counts from SemLink." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-88", "text": "When sampling a cluster for a verb sense with a verb in VerbNet, we sample y from a product of experts." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-89", "text": "We cannot incorporate \u03b8 as a prior when sampling y, because we have multiple verbs, with distinct distributions \u03b8 v 1 , \u03b8 v 2 , . . .." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-90", "text": "Because the product-of-experts is a discrete probability distribution, it is easy to marginalize out this variable when sampling k, using" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-91", "text": "Either way, once a cluster is selected, we should update the \u03c1 and \u03b8." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-92", "text": "So, once a cluster is selected, we still sample a discrete y. We compare performance for sampling k with assigned y and with marginalized y." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-93", "text": "When incorporating supervision, we flatten VerbNet, using only the top-level categories, simplifying the selection process for y. In Kawahara et al. (2014) , slot features were most effective features at producing a VerbNet-like structure; we follow suit." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-94", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-95", "text": "**RESULTS**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-96", "text": "For evaluation, we compare using the same dataset and metrics as Kawahara et al. (2014) ." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-97", "text": "There, the authors use the polysemous verb classes of Korhonen et al. (2003) , a subset of frequent polysemous verbs." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-98", "text": "This makes the test set a sort of miniVerbNet, suitable for evaluation." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-99", "text": "They also define a normalized modified purity and normalized inverse purity for evaluation, explained below." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-100", "text": "The standard purity of a hard clustering averages, for each cluster's majority gold standard class, the percentage of clustered items of that class." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-101", "text": "Because the clustering is polysemous, a typical automatically-induced cluster K will contain only some senses of the verbs." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-102", "text": "We take this partial membership into account when deciding the cluster's majority class." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-103", "text": "We define c iv \u2208 [0, 1] as the proportion of instances of verb v grouped into cluster K i ." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-104", "text": "We also treat induced clusters containing only one verb sense as errors, rather than treating them as clusters of perfect purity." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-105", "text": "Therefore, the normalized modified purity (nmPU), with respect to the gold standard clusters G, is," }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-106", "text": "where" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-107", "text": "This nmPU is analagous to clustering precision: it measures, on average, how well the clustering avoids matching items that should not be clustered." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-108", "text": "We also define a recall analogue, the normalized inverse purity (niPU), as," }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-109", "text": "This measures how well each gold standard cluster is recovered." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-110", "text": "We report each metric, and the F1 score combining them, to compare the clustering accuracy with respect to the gold standard G. We use the clustering from Kawahara et al. (2014) as a baseline for comparison." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-111", "text": "However, for evaluation, the authors only clustered senses of verbs in the evaluation set." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-112", "text": "Since we would like to test the effectiveness of adding supervision, we treat all verbs in the evaluation set as unsupervised, with no initialization of \u03b8." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-113", "text": "Therefore, to compare apples-to-apples, we calculate the nPU, niPU, and F1 of the Kawahara et al. (2014) full clustering against the evaluation set." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-114", "text": "Our model also computes the full clustering, but with supervision for known verbs (other than the evaluation set)." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-115", "text": "Parameters were selected using a grid search, and cross-validation." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-116", "text": "The results are summarized in Table 1 , comparing the unsupervised DPMM baseline (DPMM) to the supervised DPMM (SDPMM), and the supervised DPMM sampling k with y marginalized out (mSDPMM)." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-117", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-118", "text": "**COMPARISON OF PRODUCED CLUSTERS**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-119", "text": "The supervised sampling scheme produces fewer clusters than the unsupervised baseline." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-120", "text": "This is in (2003) dataset." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-121", "text": "N is the number of clusters spanned by the evaluation set." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-122", "text": "part because it produces fewer \"singleton\" clusters, containing only one verb sense from the evaluation set." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-123", "text": "The SDPMM produces only 16% singleton clusters, compared with 34% of singleton clusters from the unsupervised DPMM." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-124", "text": "The supervised clusters also tend to cluster more of the senses of each verb into the same cluster." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-125", "text": "The predominant SDPMM cluster for a verb, which has the highest percentage of a verb's total instances, tends to have 224% the number of instances as the predominant unsupervised DPMM cluster." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-126", "text": "This tendency does not prevent verbs being assigned multiple clusters, however." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-127", "text": "On average, the supervised clustering uses 30% fewer clusters for each verb, a smaller reduction than the 70% overall drop in the number of clusters." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-128", "text": "A few example clusters are presented in Table 2 ." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-129", "text": "----------------------------------" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-130", "text": "**CONCLUSIONS AND FUTURE DIRECTIONS**" }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-131", "text": "The supervision tends to encourage a smaller number of clusters, so the precision-like metric, nmPU, is lower, but the recall-like metric, niPU, is much higher." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-132", "text": "Marginalizing out the variable y when sampling k does not make an appreciable difference to the F1 score." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-133", "text": "Swapping out the Dirichlet process for a Pitman-Yor process may bring finer control over the number of clusters." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-134", "text": "We have expanded the work in Kawahara et al. (2014) by explicitly modeling a VerbNet class for each verb sense, drawn from a product of experts based on the cluster and verb." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-135", "text": "This allowed us to leverage data from SemLink with VerbNet annotation, to produce a higher-quality clustering." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-136", "text": "It also allows us to describe each cluster in terms of alignment to VerbNet classes." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-137", "text": "Both of these improvements bring us closer to extending VerbNet's usefulness, using only automated dependency parses of corpora." }, { "sent_id": "4d2488844c1f6f39f1f4b8f3487288-C001-138", "text": "We may speculate, and should test, whether the improved verb clusters will prove useful in end-to-end semantic tasks." } ], "y": { "@USE@": { "gold_contexts": [ [ "4d2488844c1f6f39f1f4b8f3487288-C001-28" ], [ "4d2488844c1f6f39f1f4b8f3487288-C001-42" ], [ "4d2488844c1f6f39f1f4b8f3487288-C001-93" ], [ "4d2488844c1f6f39f1f4b8f3487288-C001-96" ], [ "4d2488844c1f6f39f1f4b8f3487288-C001-110" ], [ "4d2488844c1f6f39f1f4b8f3487288-C001-113" ] ], "cite_sentences": [ "4d2488844c1f6f39f1f4b8f3487288-C001-28", "4d2488844c1f6f39f1f4b8f3487288-C001-42", "4d2488844c1f6f39f1f4b8f3487288-C001-93", "4d2488844c1f6f39f1f4b8f3487288-C001-96", "4d2488844c1f6f39f1f4b8f3487288-C001-110", "4d2488844c1f6f39f1f4b8f3487288-C001-113" ] }, "@BACK@": { "gold_contexts": [ [ "4d2488844c1f6f39f1f4b8f3487288-C001-32" ], [ "4d2488844c1f6f39f1f4b8f3487288-C001-34" ], [ "4d2488844c1f6f39f1f4b8f3487288-C001-56" ] ], "cite_sentences": [ "4d2488844c1f6f39f1f4b8f3487288-C001-32", "4d2488844c1f6f39f1f4b8f3487288-C001-34", "4d2488844c1f6f39f1f4b8f3487288-C001-56" ] }, "@EXT@": { "gold_contexts": [ [ "4d2488844c1f6f39f1f4b8f3487288-C001-134" ] ], "cite_sentences": [ "4d2488844c1f6f39f1f4b8f3487288-C001-134" ] } } }, "ABC_9655fb9abfb1c30b39f3261680fafc_13": { "x": [ { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-2", "text": "Recent neural models for data-to-document generation have achieved remarkable progress in producing fluent and informative texts." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-3", "text": "However, large proportions of generated texts do not actually conform to the input data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-4", "text": "To address this issue, we propose a new training framework which attempts to verify the consistency between the generated texts and the input data to guide the training process." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-5", "text": "To measure the consistency, a relation extraction model is applied to check information overlaps between the input data and the generated texts." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-6", "text": "The non-differentiable consistency signal is optimized via reinforcement learning." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-7", "text": "Experimental results on a recently released challenging dataset ROTOWIRE show improvements from our framework in various metrics." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-8", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-10", "text": "Data-to-text generation, a classic task of natural language generation is to convert the structured input data (i.e., a table) into descriptions that adequately and fluently describes the data (Kukich, 1983; Reiter and Dale, 1997; Barzilay and Lapata, 2005; Angeli et al., 2010; Kim and Mooney, 2010; Perez-Beltrachini and Gardent, 2017) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-11", "text": "datato-document generation is a slightly more challenging setting in which a system generates multisentence summaries based on input data (Wiseman et al., 2017) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-12", "text": "It is traditionally divided into two subtasks: content selection and the surface realization." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-13", "text": "Recent Neural generation systems blur the distinction by building over encoder-decoder architecture (Sutskever et al., 2014) with attention mechanism (Bahdanau et al., 2015) over input content (Mei et al., 2016; Du\u0161ek and Jurcicek, 2016; Kiddon et al., 2016; Chisholm et al., 2017) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-14", "text": "* Contribution during internship at Microsoft Research Asia." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-83", "text": "For Gaussian smoothing on context reward, we set its variance to 1 and truncate size to 5." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-15", "text": "Feng Nie and Hailin Chen contributed equally to this work and should be considered as joint first authors." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-16", "text": "Although neural network models are capable of generating fluent texts, they tend to make mistakes in the generated sentences, containing sentences that do not conform to the input structured data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-17", "text": "As shown in Figure 1 , the neural model produces the wrong rebound number." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-18", "text": "Previous work (Wiseman et al., 2017) notices the problem and proposes an extractive metric to evaluate the consistency between the generation and its input structured data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-19", "text": "However, such consistency is not involved in the training process to guide model learning." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-20", "text": "In this paper, we propose to use a new training framework that directly incorporates consistency verification to guide the generation during the training process." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-21", "text": "In this way, the inconsistency between the output text and the input structured data can be measured and treated as negative signals to guide a traditional encoder-decoder network." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-22", "text": "To measure the consistency, we apply a relation extraction model to collect factual information from generated texts and compare them with its paired reference and input data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-23", "text": "Since only a subset of generated text are related to the reference and the input data, we design two novel word-level reward signals based on the consistency with the reference text and with the input data respectively." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-24", "text": "The non-differentiable consistency reward signals are incorporated into training procedure via a reinforcement learning approach." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-25", "text": "We evaluate our proposed method on the RO-TOWIRE dataset (Wiseman et al., 2017) , which targets at generating multi-sentence game summaries." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-26", "text": "Empirical experiments show that our proposed method outperforms the encoder-decoder neural generation baseline on BLEU and extraction based evaluation metrics." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-27", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-28", "text": "**MODEL**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-29", "text": "As shown in Figure 1 , our method incorporates a consistency verification process in the encoderdecoder architecture (Sutskever et al., 2014 ) via a reinforcement learning (RL) framework." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-30", "text": "We introduce the encoder-decoder architecture in Section 2.1, then describe the components of verification in Section 2.2." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-31", "text": "The consistency information brought by the verification is incorporated using the RL approach in Section 2.3." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-32", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-33", "text": "**BASE MODEL**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-34", "text": "Our base model is a simple sequence-to-sequence (Seq2Seq) with two-layer bidirectional encoder and unidirectional LSTM decoder with attention (Bahdanau et al., 2015) and conditional copy mechanism as (G\u00fcl\u00e7ehre et al., 2016) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-35", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-36", "text": "**CONSISTENCY VERIFICATION**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-37", "text": "To examine information consistency, we use a relation extraction (RE) model to extract factual information from unstructured generated texts." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-38", "text": "Since only a subset of words contain factual statements in the generated texts, the verification weights on each word are various." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-39", "text": "We design two word-level reward signals to verify the consistency on the reference text and input data respectively, and the Gaussian smoothing is also applied smooth the verification weights on context words." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-40", "text": "Given an input and description pair (x, y), where each target description y = y 1:T consists of T words and each input contains a set of records x = {x j } J j=1 ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-41", "text": "The description generated by model is denoted as\u0177 1:T ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-42", "text": "Following (Liang et al., 2009) , each x j = (x e j , x m j , x t j ) is a triple, where x t j is the type of record, x e j and x m j are a record's entity and value respectively." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-43", "text": "As shown in Figure 1 , x t j , x e j , x m j refer to'PTS', 'Harden', and '40' respectively." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-44", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-45", "text": "**INFORMATION EXTRACTION**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-46", "text": "To extract information describing the input data from the generated texts, we apply a simple information extraction system similar to (Wiseman et al., 2017) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-47", "text": "Given a generated text\u0177 1:T , we first extract candidate entity r e (e.g., name) and value r m (e.g., number) pairs in the generated texts with lexical rules and then predict the type r t (e.g., points) of each candidate pair using a RE model (dos Santos et al., 2015) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-48", "text": "We train the RE model in a distant supervision manner (Mintz et al., 2009) , with training data constructed from reference text and input data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-49", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-50", "text": "**CONSISTENCY REWARDS**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-51", "text": "Applying a RE model on the generated text yields a set of candidate relation tuples" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-52", "text": "The reference text and input structured data both describe uncontradicted facts and can be treated as gold data which describe facts." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-53", "text": "To make use of both information, we design two consistency scoring functions (rewards) based on reference text and input data respectively." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-54", "text": "We first define fidelity reward to check whether the relation tuples are consistent with the input data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-55", "text": "For a particular\u0177 t in\u0177 1:T , its fidelity reward is:" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-56", "text": "where g(\u00b7) returns index range of given entity or number, 1 is indicator function defined as" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-57", "text": "\u03b2, b f are hyper parameters." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-58", "text": "\u03b2 controls reward scale and b f controls the balance between positive and negative reward." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-59", "text": "Similar to fidelity reward, we define content reward for each\u0177 t to measure whether the relation tuples S extracted in the generated text also appear in the reference text as:" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-60", "text": "To combine these rewards, we use weighted sum of two components R(\u0177 t ) = \u03bb 1 R C (\u0177 t ) + \u03bb 2 R F (\u0177 t ) where weights \u03bb 1 and \u03bb 2 are hyper parameters." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-61", "text": "Moreover, both fidelity and content rewards only effect a subset of words, to balance the effect on context words, we apply Gaussian smoothing technique in this paper." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-62", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-63", "text": "**POLICY GRADIENT REINFORCE**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-64", "text": "Our verification signal is non-differentiable and thus end-to-end back-propagation is not possible." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-65", "text": "Reinforce-based policy gradient approach addresses this issue by using its own distribution during training and by optimizing the nondifferentiable verification signals as rewards." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-66", "text": "We use the REINFORCE algorithm (Williams, 1992; Zaremba and Sutskever, 2015) to learn a policy p \u03b8 , i.e., the distribution produced by the encoderdecoder model." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-67", "text": "The training objective for one sequence is its expected cumulative rewards:" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-68", "text": "where R(\u00b7) is the reward function of the sequence of words\u0176 = (\u0177 1 , ...,\u0177 T ) sampled from the policy." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-69", "text": "The derivative of the loss function with approximation using a single sample along with variance reduction with a bias estimator b t is:" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-70", "text": "As illustrated in Section 2.2, our proposed reward can only affect a subset of words related to the input data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-71", "text": "Therefore, our word-level reward function can be formulated as R(\u0177 1:T , x) = T t=1 R t (\u0177 t |\u0177 1:t\u22121 , x)." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-72", "text": "Therefore, we can have word level feedback as (Sutton et al., 2000) :" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-73", "text": "We conduct pre-training on our policy with the maximum likelihood (MLE) objective prior to RE-INFORCE training." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-74", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-75", "text": "**EXPERIMENTS**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-76", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-77", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-78", "text": "Data: We use ROTOWIRE dataset (Wiseman et al., 2017) , which is a collection of articles summarizing NBA basketball games, paired with their corresponding box-and line-score tables." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-79", "text": "It consists of 3,398, 727, and 728 summaries for training, validation and testing respectively." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-80", "text": "Training: For both MLE and RL training, we use the SGD optimizer with starting learning rate as 1 and last MLE epoch's learning rate respectively, the dimension of trainable word embeddings and hidden units in LSTMs are all set to 512, and the mini-batch size is set to 16." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-81", "text": "We apply the truncated backpropagation with window size 100 (Mikolov et al., 2010) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-82", "text": "For RL training, we set the sample size to 1, \u03b3 to 0, \u03bb 1 , \u03bb 2 to 1, b f , b c to 2 3 and \u03b1, \u03b2 to 1.5 according to the validation set 1 ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-104", "text": "For CS metric which measures the consistency between the generated texts and reference text, when incorporating the content reward with fidelity rewards, the overall F1 score of our method is improved." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-105", "text": "Qualitative analysis: Table 3 shows a validation set game statistics with generation results by original Seq2Seq model and our proposed method 4 ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-106", "text": "The original Seq2Seq model is more likely to produce incorrect facts (e.g., wrong score points and in their paper and rerun this baseline with the author's code." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-107", "text": "3 Hand crafted templates from (Wiseman et al., 2017 ), e.g. scored points (- FG, - 3PT, - FT) 4 The whole game input data and the summaries are relatively lengthy, we presents the first three sentences in the summary and its corresponding game statistics for brevity." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-108", "text": "shooting numbers for the player \"Victor Oladipo\" etc.)." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-109", "text": "When incorporating the consistency into the training framework, our method can produce the facts that are consistent with the paired input data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-110", "text": "Moreover, we observe that both two models can produce some irrelevant facts such as \"Orlando has won two straight...\"." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-111", "text": "We can incorporate constraints to avoid generating the irrelevant facts in the same framework as a future work." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-112", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-113", "text": "**IMPACT OF RELATION EXTRACTION MODEL**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-114", "text": "The consistency signal is based on the relation extraction (RE) model, we therefore investigate the influence brought by different RE models." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-115", "text": "We provide two RE models in Table 2 , where the Linear extractor refers to replacing the non-linear layer in CNN/LSTM extractor with a linear layer." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-116", "text": "The results of using different RE model for consistency verification over test set is listed in Table 2." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-117", "text": "Results show that Linear RE model extracts noisy relation tuples from the generated texts and can hurt the accuracy of consistency verification." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-118", "text": "The result also suggests that our framework may gain potential improvements if the RE model performs better." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-119", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-120", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-121", "text": "We present a framework that incorporates verification constraints during training for neural data-todocument generation to enhance the consistency between the generated text and the input structure data." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-122", "text": "Experimental results show that our method outperforms current state-of-arts neural Seq2Seq models in various evaluation metrics." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-123", "text": "In the future, we would like to improve both the neural generation model and relation extraction model simultaneously within the single framework." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-84", "text": "We subtract mean reward as baseline for content reward and bound both type of rewards to [-2, 1] ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-85", "text": "For relation extractor model, we use an ensemble of CNNs and LSTMs relation classification models (Wiseman et al., 2017) , which achieves the precision of 94.7% and recall of 75.3% given the reference." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-86", "text": "Evaluation: We use automatic evaluation metric BLEU-4 (Papineni et al., 2002) and the extractive evaluation metrics proposed by (Wiseman et al., 2017) , which contains three criteria: content selection (CS), relation generation (RG), content ordering (CO)." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-87", "text": "CS measures the precision and recall of unique relation r extracted from generated text\u015d y 1:T that are also extracted from reference y 1:T ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-88", "text": "RG measures the precision and recall of unique relation r extracted from generated texts\u0177 1:T that appears in its paired input data x. CO measures content ordering between\u0177 1:T and y 1:T by calculating normalized Damerau-Levenshtein Distance (DLD) on extracted relations S and S ref ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-89", "text": "----------------------------------" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-90", "text": "**MAIN RESULTS**" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-91", "text": "Automatic evaluation: Results of our experiments and a comparison to the previous works on this datasets are shown in Table 1 Baseline The Orlando Magic ( 7 -8 ) defeated the New York Knicks ( 8 -8 ) 100 -91 on Friday ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-92", "text": "Orlando has won three straight games ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-93", "text": "They were led by Victor Oladipo , who scored a game -high 28 points on 9 -of -17 shooting from the field and 13 -of -15 from the free throw line \u00b7 \u00b7 \u00b7 Proposed Model" }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-94", "text": "The Orlando Magic ( 7 -8 ) defeated the New York Knicks ( 8 -8 ) 100 -91 on Friday ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-95", "text": "Orlando has won two straight and six of their last eight games ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-96", "text": "They were led by Carmelo Anthony , who scored a game -high 28 points on 9 -of -17 shooting from the field and 3 -of -4 shooting from beyond the arc \u00b7 \u00b7 \u00b7 Table 3 : Top: partial input data of one game." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-97", "text": "Below: First three sentences produced by baseline and our proposed model." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-98", "text": "blue words denotes facts consistency with input, red words denotes facts contradicting with input and italic words denote facts not mentioned in input MLE training on our baseline model and achieve comparable results on ROTOWIRE dataset w.r.t." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-99", "text": "the previous work (Wiseman et al., 2017) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-100", "text": "The differences between our method and (Wiseman et al., 2017 ) is that we adopt a LSTM for the encoder, while (Wiseman et al., 2017 ) uses a table encoder similar to (Yang et al., 2017) ." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-101", "text": "Template based 3 method performs poorly than all neural based method in terms of BLEU score, but it performs quite well on the extractive metrics, as input data is directly feed into placeholders of template by rules, which provides the upper-bound for how domain knowledge could help content selection and generation." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-102", "text": "For neural based methods, our proposed two verification signals can improve both BLEU and RG metric, which indicates the effectiveness of incorporating the verification constraints during training." }, { "sent_id": "9655fb9abfb1c30b39f3261680fafc-C001-103", "text": "In addition, with these two signals, our method achieves further improvements on BLEU and RG metric." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9655fb9abfb1c30b39f3261680fafc-C001-11" ], [ "9655fb9abfb1c30b39f3261680fafc-C001-18" ] ], "cite_sentences": [ "9655fb9abfb1c30b39f3261680fafc-C001-11", "9655fb9abfb1c30b39f3261680fafc-C001-18" ] }, "@DIF@": { "gold_contexts": [ [ "9655fb9abfb1c30b39f3261680fafc-C001-18", "9655fb9abfb1c30b39f3261680fafc-C001-19", "9655fb9abfb1c30b39f3261680fafc-C001-20" ], [ "9655fb9abfb1c30b39f3261680fafc-C001-100" ] ], "cite_sentences": [ "9655fb9abfb1c30b39f3261680fafc-C001-18", "9655fb9abfb1c30b39f3261680fafc-C001-100" ] }, "@USE@": { "gold_contexts": [ [ "9655fb9abfb1c30b39f3261680fafc-C001-25" ], [ "9655fb9abfb1c30b39f3261680fafc-C001-78" ], [ "9655fb9abfb1c30b39f3261680fafc-C001-85" ], [ "9655fb9abfb1c30b39f3261680fafc-C001-86" ], [ "9655fb9abfb1c30b39f3261680fafc-C001-97", "9655fb9abfb1c30b39f3261680fafc-C001-98", "9655fb9abfb1c30b39f3261680fafc-C001-99" ], [ "9655fb9abfb1c30b39f3261680fafc-C001-107" ] ], "cite_sentences": [ "9655fb9abfb1c30b39f3261680fafc-C001-25", "9655fb9abfb1c30b39f3261680fafc-C001-78", "9655fb9abfb1c30b39f3261680fafc-C001-85", "9655fb9abfb1c30b39f3261680fafc-C001-86", "9655fb9abfb1c30b39f3261680fafc-C001-99", "9655fb9abfb1c30b39f3261680fafc-C001-107" ] }, "@SIM@": { "gold_contexts": [ [ "9655fb9abfb1c30b39f3261680fafc-C001-46" ] ], "cite_sentences": [ "9655fb9abfb1c30b39f3261680fafc-C001-46" ] } } }, "ABC_e5bff4a27468139762496abdff3436_13": { "x": [ { "sent_id": "e5bff4a27468139762496abdff3436-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-2", "text": "Linguistic relations in oral conversations present how opinions are constructed and developed in a restricted time." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-3", "text": "The relations bond ideas, arguments, thoughts, and feelings, reshape them during a speech, and finally build knowledge out of all information provided in the conversation." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-4", "text": "Speakers share a common interest to discuss." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-5", "text": "It is expected that each speakers reply includes duplicated forms of words from previous speakers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-6", "text": "However, linguistic adaptation is observed and evolves in a more complex path than just transferring slightly modified versions of common concepts." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-7", "text": "A conversation aiming a benefit at the end shows an emergent cooperation inducing the adaptation." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-8", "text": "Not only cooperation, but also competition drives the adaptation or an opposite scenario and one can capture the dynamic process by tracking how the concepts are linguistically linked." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-9", "text": "To uncover salient complex dynamic events in verbal communications, we attempt to discover self-organized linguistic relations hidden in a conversation with explicitly stated winners and losers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-10", "text": "We examine open access data of the United States Supreme Court." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-11", "text": "Our understanding is crucial in big data research to guide how transition states in opinion mining and decision-making should be modeled and how this required knowledge to guide the model should be pinpointed, by filtering large amount of data." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-12", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-14", "text": "Traditionally, in computational linguistics, it is essential to integrate models and algorithms with fundamental laws of language." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-15", "text": "Widely applied hierarchical dependency trees and parsing in natural language processing (NLP) follow existing grammatical relations." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-16", "text": "Nowadays, while algorithms and models reach higher levels and available data becomes bigger, not enough linguistic laws are uncovered and can have a chance to meet with developed techniques." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-17", "text": "Language processing in data science mainly considers evaluated data as single source in terms of language." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-18", "text": "There are approaches such as cross-media topic analysis, retrieving information referring various data platforms including websites, blogs, and mobile phones, and multimodal analysis Poria et al. 2017a; Poria et al. 2017b) , combining text data with images, videos, and audio, however, they only gather all available channels and do not address the richness of language." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-19", "text": "On the other hand, language itself has many dimensions, language of a text written by a single author is different than language used in a dialogue or that of a group speech, e.g., trialogue discussions." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-20", "text": "Therefore, it is emergent that current conventional NLP should meet with the revolutionary philosophy of linguistics (Chomsky 1975) and establish new hidden laws applicable in data science: the human mind easily knows and applies by birth, but hardly formulates to understand the underlying structure." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-21", "text": "One of the remarkable perspectives to dig into natural linguistic laws is provided by social and behavior sciences, adaptation in language during communication as a result of changes in opinions and decisions." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-22", "text": "Opinions and decisions are personal in individual level, however, they are flexible while facing public opinions and decisions." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-23", "text": "Linguistic adaptation is twofold." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-24", "text": "In one part, collective voice unifies opinions and decisions in a complex process, ideas are biased, and consequently people start acting similarly, talking similarly, and so writing similarly." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-25", "text": "Twitter conversations (Danescu-Niculescu-Mizil, Gamon, and Dumais 2011; Purohit et al. 2013 ) and popular memes (Myers and Leskovec 2012; Coscia 2013) prove this similarity in social media." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-26", "text": "In the other part, when people have a well-defined goal at the end, they tend to reshape their arguments." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-27", "text": "In the presence of distinct winning and losing sides and social hierarchy, people at lower status show both cooperation through that at the higher status and competition among each other." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-28", "text": "Therefore, a verbal discussion in such explicitly opposing groups host linguistic adaptation, investigated in social exchange theory (Willer 1999; Thye, Willer, and Markovsky 2006) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-29", "text": "While information and emotions are the fundamental elements of human knowledge, commonsense knowledge is the fundamental element for gluing society (Cambria et al. 2009; ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-30", "text": "Commonsense is implicit semantic and affective information humans continuously tap on for decision-making, communication, and reasoning in general Rajagopal et al. 2013; Tran, Cambria, and Hussain 2016) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-31", "text": "Effective speeches and public talks use commonsense efficiently to drive opinions and change decisions in large scales (Drath and Palus 1994) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-32", "text": "The resultant unified collective motion is extremely interesting in social groups (Borge-Holthoefer et al. 2011; Gonzalez-Bailon et al. 2011 )." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-33", "text": "Opinions and decisions are personal in individual level." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-34", "text": "However, as observed, they are quite flexible facing with a collective decision." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-35", "text": "Complex knowledge extraction process in micro state suddenly becomes less valuable and group decision gains (Conover et al. 2011) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-36", "text": "We can argue that our opinions are biased when our decisions mostly rely on our previous knowledge, e.g., commonsense, and so richness of opinions kept in each individual is relatively unimportant." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-37", "text": "We can further argue that commonsense drives an adaptation in extracting knowledge." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-38", "text": "To measure commonsense for a particular situation is hard, however, adaptations can be easily captured in Twitter conversations (Danescu-NiculescuMizil, Gamon, and Dumais 2011; Purohit et al. 2013) , in memes (Myers and Leskovec 2012; Coscia 2013) , and faceto-face discussions (Danescu-Niculescu-Mizil et al. 2012) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-39", "text": "In this paper, our main concerns are firstly to construct discussion groups including agents having different social powers and serving opposite aims." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-40", "text": "Secondly, we investigate how we can track the progress of opinions together with their influences on decisions in oral conversations." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-41", "text": "We claim that linguistic relations (Poria et al. 2015) preserve all rich phenomena, shortly discussed above, including collective voice, reshaping arguments, and so adaptation." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-42", "text": "To analyze adaptation induced by both cooperation and competition, we consider court conversations: they are held in clearly stated winner and loser groups with distinct hierarchy in decisionmaking due to the presence of Justices and lawyers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-43", "text": "To this end, we evaluate the open access data of the United States Supreme Court (Hawes, Lin, and Resnik 2009; Hawes 2009; Danescu-Niculescu-Mizil et al. 2012 ), prepare conversation groups with different adaptation levels, implement a suitable algorithm to extract linguistic relations in these group conversations, and finally provide a comparison between the groups and the discovered linguistic relations." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-44", "text": "The rest of the paper is organized as follows: the first section presents the dataset we consider and designed conversation groups out of the data; the second section describes our algorithm in detail; the following section explains how we implement pointwise mutual information for the conversation groups and then link with linguistic relations; finally, we provide experimental results and conclude the paper." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-45", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-46", "text": "**SUPREME COURT DATA**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-47", "text": "We borrow the textual data of the conversations in the United States Supreme Court pre-processed by (Hawes, Lin, and Resnik 2009; Hawes 2009 ) and enriched by (DanescuNiculescu-Mizil et al. 2012) including the final votes of Justices." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-48", "text": "Both the original data and the most updated version used here are publicly available (Danescu-NiculescuMizil et al. 2012) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-49", "text": "The data gathers oral speeches before the Supreme Court and hosts 50,389 conversational exchanges among Justices and lawyers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-50", "text": "Distinct hierarchy between Justices (high power) and lawyers (low power) impose lawyers to tune their arguments under the perspective and understandings of Justices, and as a result, speech adaptation and linguistic coordination leaves their traces in a sudden occurrence of sharing the same adverbs, conjunctions, and pronouns." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-51", "text": "Tracking initial utterances, the sides present a unique and personal speaking, but after a while in the communication, word selections, their forms, and frequencies mirror each other's language preference." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-52", "text": "The linguistic coordination is systematically quantified by (Danescu-Niculescu-Mizil et al. 2012 ) and the arguments follow the principles of exchange theory examining behavior dynamics in low and high power groups (Willer 1999; Thye, Willer, and Markovsky 2006) : Lawyers tend to cooperate more to Justices than conversely and demonstrate strong linguistic coordination in their speech." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-53", "text": "Moreover, lawyers show even more cooperation to unfavorable Justices than favorable ones." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-54", "text": "Here, we enrich the comparison including the identity of winners and losers in lawsuits." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-55", "text": "The data provides whether the petitioner or the respondent is the winner at the end of each lawsuit." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-56", "text": "In addition, the speaker of each utterance is labeled as their position, e.g., Justice or lawyer." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-57", "text": "Furthermore, Justice's votes and the side of lawyers are tagged with the utterances." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-58", "text": "Table 1 identifies all roles carried by Justices and lawyers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-59", "text": "For Justices, both the vote (middle) and whom to speak (last) are given." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-60", "text": "Lawyers are allowed to speak only when Justices address their side." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-61", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-62", "text": "**ID ROLES OF JUSTICES (J) AND LAWYERS**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-63", "text": "l -Respondent Side Table 1 : The segregation schema of the roles in conversations: Support sides of Justices and sides of lawyers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-64", "text": "1-6 summarize all potential roles present in the data." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-65", "text": "In 1-4, who supported by the Justice is given in the middle." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-66", "text": "Furthermore, the last indicates the side of lawyer the Justice speaks to." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-67", "text": "Referring exchange theory (Willer 1999; Thye, Willer, and Markovsky 2006) and the measured coordination (Danescu-Niculescu-Mizil et al. 2012) , one can order the relative power of each Justice and lawyer pair" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-68", "text": "where J and l represent Justices and lawyers, respectively (note that for comparing individually following the social exchange theory, P (J) > P (l) for both supportive and unsupported Justices)." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-69", "text": "The subscript u indicates that Justice doesn't support the side of lawyer and the supportive version is described by s. For instance, in Table 1 , in the communications of 1 and 5; 4 and 6, Justices show supports and play as J s , whereas that of 3 and 5; 2 and 6, lawyers are unsupported by J u ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-70", "text": "The scenarios and pairs guide to construct groups with different cooperation level induced by P as illustrated in Table 2 ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-71", "text": "We further add another dimension in the relative power: Winners and Losers, haven't been investigated in the previous study (Danescu-Niculescu-Mizil et al. 2012) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-72", "text": "To this end, Eq. 1 is reformulated" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-73", "text": "Group ID Cooperation Pool of J and l I.i supportive, P (J s , l) 1 and 5 I.ii unsupported, P (J u , l) 3 and 5 II.i unsupported, P (J u , l) 2 and 6 II.ii supportive, P (J s , l) 4 and 6 Table 2 : Grouping communications with respect to level of cooperation, based on the relative power of the partners in the conversations." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-74", "text": "1-6 and the power pairs P (J s , l) and P (J u , l) as defined previously." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-75", "text": "Here, win and lose subscripts highlight that the concerned Justice and lawyer pairs are the partners in a won or lost lawsuit." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-76", "text": "As an illustration, P (J s , l) win occurs in the group I.i when petitioners are the winner and also in II.ii while respondents are the winners of the lawsuits." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-77", "text": "On the other hand, P (J s , l) lose is the Justices-lawyers of I.i in respondent won lawsuits as well as of II.ii in petitioner won lawsuits." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-78", "text": "The situations are generated for the unsupported Justice-lawyer groups and all are listed in Table 3 ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-79", "text": "Calculating utterances in \u03ba, we have 21,105 for A, 15,116 for B, 15,489 for C, and 24,461 for D, gathered by different combinations of 195 lawsuits." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-80", "text": "The large number of each pool convinces that we have enough examples to perform statistics and our measurement won't be biased by the size effect." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-81", "text": "On the other hand, noting the total number of 50,389 utterances, almost the half of the data presents P (J u , l) lose type social relations, e.g., case D. Eqs. (2) and (3) do not include the comparison of {P (J u , l) win ; P (J s , l) lose } and {P (J u , l) lose ; P (J s , l) win } on purpose since it is unknown whether P (J u , l) > P (J s , l) is still valid in the presence of win and lose, bringing interesting perspective while coupling the power hypothesis with the cooperation and not considered in social exchange theory." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-82", "text": "We aim to understand this full picture by correlating determined linguistic relations with the separated relative power groups." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-83", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-84", "text": "**LINGUISTIC RELATION EXTRACTION**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-85", "text": "The Supreme Court hosts lawsuits of rich subjects." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-86", "text": "To design specific linguistic relations in each distinct lawsuit is challenging and not required." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-87", "text": "Our aim is to suggest relations suitable for any discussion concept." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-88", "text": "To generalize the task, we first determine noun phrases in the data following the definition in (Pennacchiotti and Pantel 2006) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-89", "text": "The phrases are combinations of adjectives and nouns." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-90", "text": "The technical steps include standard part-of-speech tagging including grammar based chunk parser." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-91", "text": "We then restrict our attention to address the relations linking only determined noun phrases within one sentence." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-92", "text": "The data shows utterances of grammatically correct and well-organized sentences." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-93", "text": "To this end, we apply rule-based relation extraction." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-94", "text": "While Fig. 1 shows each step of the developed algorithm, steps (A-C) indicate the discussed concept recognition of noun phrases." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-95", "text": "The rule-based schema starts with first restricting linguistic relations and then constructing static surface patterns (regular expressions) for them." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-96", "text": "The assigned patterns run as an iterative process searching the exact match of the real patterns between any concept pair, which is any noun phrase pair here." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-97", "text": "Within a sentence, multiple relations can be addressed based on the comparison in the iteration, to capture both different relations or the same relation but with the different patterns." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-98", "text": "To balance the relations without overweighting extreme cases, we first apply classical IsA (Hearst 1992) and PartOf (Girju, Badulescu, and Moldovan 2003) relations." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-99", "text": "The patterns of the relations follow both lexicosyntactic formalisms (Klaussner and Zhekova 2011) and manual investigations of the data." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-100", "text": "We then recommend further relations as UsedBy, UsedFor, UsedIn, UsedOver, and UsedWith to cover the rest of the data." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-101", "text": "The Used relations do not accumulate for certain lawsuits and nicely distribute over entire data, which provides us reliable analysis." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-102", "text": "Fig. 1(D and E) highlight the iteration process to detect all potential relations." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-103", "text": "To illustrate the outcome of our algorithm, we provide examples for each relation." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-104", "text": "They are given with the detected noun phrases in Table 4 ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-105", "text": "The identity of the sentences, a-g, are to guide the following concerned examples, where the linked noun phrases are highlighted in bold: (a) That was so because her claim is that J. Howard intended to give her a catchall trust." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-106", "text": "The validation of the discovered linguistic relations and their suggested patterns are systematically satisfied by the following protocol." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-107", "text": "From each conversation group \u03ba in Table 3, 1000 utterances are randomly selected." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-108", "text": "Utterances present averages sentences of 2-4, the minimum is for the group C, P (J u , l) win , and the maximum for group D, P (J u , l) lose ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-109", "text": "Table 4 : Extracted relations with our algorithms and the corresponded (linked) noun phrases." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-110", "text": "Sent. ID refers the labeled example sentences above in the main text." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-111", "text": "Then, manual annotations are provided for each pool, which works as the ground truth, and the patterns are readjusted if necessary based on the performance, as shown in Fig. 1(D-F) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-112", "text": "The overall average scores, comparing the relations generated by our algorithm with the ground truth, are obtained as 59.92% for Recall, 67.2% for Precision, and 63.35% for the resultant F1." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-113", "text": "The scores are relatively higher than that of the rule-based relation extraction algorithms for more general purposes applied in large data sets (Pantel, Ravichandran, and Hovy 2004) ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-114", "text": "Our manual efforts, the grammatically correct sentences, and relatively small and well-organized data are the reasons behind the good performance." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-115", "text": "However, we observe that the foremost reason is the linguistic coordination extracting many relations from the same static patterns." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-116", "text": "In the rest of the paper, we will demonstrate how we interpret these linguistic relations in the Supreme Court conversation groups of different relative powers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-117", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-118", "text": "**POINTWISE MUTUAL INFORMATION**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-119", "text": "Pointwise mutual information (PMI) is a metric to measure coincidence of two discrete random events." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-120", "text": "It combines individual probabilities of events and their joined probability to determine how often the two events occur at the same occasion." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-121", "text": "We quantify to what extend linguistic relations R are addressed by conversation groups \u03ba and whether we observe any variation in the selections." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-122", "text": "To this end, PMI between R and \u03ba is introduced (Pantel, Ravichandran, and Hovy 2004)" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-123", "text": "Here, f (R, \u03ba) represents the frequency of occurrence for certain R in particular \u03ba and N is the total number of all R in all \u03ba." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-124", "text": "So, while the numerator describes the probabilistic occurrence of R in \u03ba, the denominator provides individual probability of R and that of \u03ba in the pool." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-125", "text": "We expect high M I(R, \u03ba) while R appears in a specific \u03ba and that is an indicator of its rare presence in the other conversation groups." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-126", "text": "Unlike the previous study (Danescu-Niculescu-Mizil et al. 2012) , entirely tracking back and forth utterances and proving the adaptation, e.g., linguistic coordination, by identifying the frequency of selected keywords, we directly utilize their overall conclusion and claim that linguistic relations already preserve the adaptation and any other complex collective linguistic process induced by both cooperation and competition in different power groups." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-127", "text": "We expect that the variation in M I(R, \u03ba) of gathered utterances of each relative power group, independent of the utterance order, suggests which relations can distinguish the difference in the groups and the magnitude of M I(R, \u03ba) of that difference highlights which relative power groups drastically influence the applied language." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-128", "text": "We will analyze M I(R, \u03ba) following this discussed understanding in coming Section." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-129", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-130", "text": "**RESULTS**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-131", "text": "We perform M I(R, \u03ba) for each group \u03ba separated by different coordination level and linguistic dynamics, expected due to the distinct relative powers as introduced in Table 3 , and each relation R described in Section Linguistic Relation Extraction." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-132", "text": "The results are presented in Fig. 2 and suggests rich behavior." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-133", "text": "First, M I for the relations IsA, PartOf, and UsedBy is almost indistinguishable overall \u03ba." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-134", "text": "We understand that these relations cannot uncover the linguistic variations in different power groups." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-135", "text": "This is an obvious outcome of NLP and examining sentences by lexico-syntactic patterns: Any sentence can consider them with no complex linguistic process such as coordination and competition." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-136", "text": "On the other hand, we observe quite remarkable separation starting with UsedFor." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-137", "text": "Successfully, the results of UsedIn, UsedOver, and UsedWith show that their appearances in \u03ba are not arbitrary." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-138", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-139", "text": "**ISA**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-140", "text": "PartOf UsedBy UsedFor UsedIn UsedOverUsedWith" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-141", "text": "Relations ( Evaluating the results in more detail, let us remind Table 3." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-142", "text": "A is expected to have the least relative power, P (J s , l) win , and consequently, no significant variation is observed." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-143", "text": "However, the situations are much more challenging for B, C, and D: They face with many conceptual challenges while defending their sides and competing with the opposite arguments, C and D, and to experiment different communications in a losing state, B and D. Each difficulty is a potential origin of the competition, some can build sufficient cooperation and make the lawyer winner, C, some cannot help to overcome the situation, keep the coordination limited, and so we experience lost lawsuits, B and D. To remind, B for P (J s , l) lose , C for P (J u , l) win , and D for P (J u , l) lose ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-144", "text": "If we just call social exchange theory, for any measurable linguistic quantity, we would need to have A \u2261 B and C \u2261 D. However, we show that the win and lose states impose observable deviations and none group resembles each other, oppositely, each presents very unique behavior." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-145", "text": "In a simplified picture, M I(R, \u03ba) for C always indicates significantly positive values." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-146", "text": "This proves that the utterances in C consider all type of relations, can be the reason behind the success of the \"win\" state in spite of the presence of unsupported Justices." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-147", "text": "----------------------------------" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-148", "text": "**CONCLUSION**" }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-149", "text": "We investigated the linguistic dynamics in terms of a restricted set of linguistic relations in oral conversations while the actors have different powers such as Justices (high power) and lawyers (low power) in the United States Supreme Court." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-150", "text": "Initially, defined cooperation of lawyers to Justices and the resultant linguistic coordination are only based on the relative power." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-151", "text": "This is a microscopic picture underestimating the dynamics of emergent competition arises in a losing state (lost lawsuits), which can change the nature of the linguistic coordination and make the linguistic relations richer." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-152", "text": "Our argument is proven by measuring M I(R, \u03ba) always positive for the group C, P (J u , l) win ." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-153", "text": "Novelty of our approach is that it evaluates supportive and unsupported situations in more realistically." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-154", "text": "The principle of exchange theory suggests P (J u , l) > P (J s , l) and one should expect high coordination in the former." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-155", "text": "However, this can be always true if there is no explicitly stated decision at the end of the communication: Winner or loser lawyer." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-156", "text": "We can observe P (J s , l) lose P (J u , l) lose and so the linguistic coordination (dynamics) for both can be comparable, as we trace in our result, e.g., very similar trend of M I(R, \u03ba) for groups B and D. Therefore, both social exchange theory and their impacts on the linguistic behavior need to be reinterpreted under exogenous factors such as win-lose situations." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-157", "text": "Furthermore, we experience that the rule-based relation extraction is well-applicable for speech data, in this grammatically correct form with minor noise, because of the presence of the linguistic adaptation, providing a better performance than its usage for other type of textual data such as internet data." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-158", "text": "Furthermore, M I(R, \u03ba) brings another perspective to uncover complex linguistic dynamics, including cooperation and competition, and discover the correlations between the linguistic relations and the relative powers." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-159", "text": "We establish the preliminary set-up to examine the linguistic dynamics of trialogue discussions hosting in social groups with distinct hierarchy." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-160", "text": "Our main conclusion is that win and lose states impose further complexity and change the conventional application of social exchange theory in language and communication." }, { "sent_id": "e5bff4a27468139762496abdff3436-C001-161", "text": "In our future study, we attempt to analyze back and forth utterances in detail regarding semantics bonding by the linguistic relations by applying advanced tools." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e5bff4a27468139762496abdff3436-C001-25" ], [ "e5bff4a27468139762496abdff3436-C001-38" ], [ "e5bff4a27468139762496abdff3436-C001-52" ] ], "cite_sentences": [ "e5bff4a27468139762496abdff3436-C001-25", "e5bff4a27468139762496abdff3436-C001-38", "e5bff4a27468139762496abdff3436-C001-52" ] }, "@USE@": { "gold_contexts": [ [ "e5bff4a27468139762496abdff3436-C001-37", "e5bff4a27468139762496abdff3436-C001-38" ], [ "e5bff4a27468139762496abdff3436-C001-43" ], [ "e5bff4a27468139762496abdff3436-C001-48" ], [ "e5bff4a27468139762496abdff3436-C001-67" ], [ "e5bff4a27468139762496abdff3436-C001-47" ] ], "cite_sentences": [ "e5bff4a27468139762496abdff3436-C001-38", "e5bff4a27468139762496abdff3436-C001-43", "e5bff4a27468139762496abdff3436-C001-48", "e5bff4a27468139762496abdff3436-C001-67" ] }, "@EXT@": { "gold_contexts": [ [ "e5bff4a27468139762496abdff3436-C001-52", "e5bff4a27468139762496abdff3436-C001-53", "e5bff4a27468139762496abdff3436-C001-54" ] ], "cite_sentences": [ "e5bff4a27468139762496abdff3436-C001-52" ] }, "@DIF@": { "gold_contexts": [ [ "e5bff4a27468139762496abdff3436-C001-71" ], [ "e5bff4a27468139762496abdff3436-C001-126" ] ], "cite_sentences": [ "e5bff4a27468139762496abdff3436-C001-71", "e5bff4a27468139762496abdff3436-C001-126" ] } } }, "ABC_8c202e3610599c9eee23724ef213de_13": { "x": [ { "sent_id": "8c202e3610599c9eee23724ef213de-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-2", "text": "When processing arguments in online user interactive discourse, it is often necessary to determine their bases of support." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-3", "text": "In this paper, we describe a supervised approach, based on deep neural networks, for classifying the claims made in online arguments." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-4", "text": "We conduct experiments using convolutional neural networks (CNNs) and long short-term memory networks (LSTMs) on two claim data sets compiled from online user comments." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-5", "text": "Using different types of distributional word embeddings, but without incorporating any rich, expensive set of features, we achieve a significant improvement over the state of the art for one data set (which categorizes arguments as factual vs. emotional), and performance comparable to the state of the art on the other data set (which categorizes propositions according to their verifiability)." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-6", "text": "Our approach has the advantages of using a generalized, simple, and effective methodology that works for claim categorization on different data sets and tasks." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-7", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-9", "text": "Argumentation mining is a relatively new subfield in natural language processing that aims to automatically identify and extract arguments, and their underlying structures, from textual documents (Moens et al., 2007; Palau and Moens, 2009; Wyner et al., 2010; Feng and Hirst, 2011; Ashley and Walker, 2013; Stab and Gurevych, 2014) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-10", "text": "Some such documents are written by professionals and contain well-formed, explicit arguments-i.e., propositions supported by evidence and connected through reasoning." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-11", "text": "However, informal arguments in online argumentative discourses can exhibit different styles." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-12", "text": "Recent work has begun to model different aspects of these naturally occurring lay arguments, with tasks including stance classification (Somasundaran and Wiebe, 2009; Walker et al., 2012) , argument summarization (Misra et al., 2015) , sarcasm detection (Justo et al., 2014) and classification of propositions and arguments (Park and Cardie, 2014; Park et al., 2015; Oraby et al., 2015) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-13", "text": "Of particular interest is the fact that arguments in online user comments, unlike those written by professionals, often have inappropriate or missing justifications." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-14", "text": "Recognizing such propositions and determining the appropriate types of support can be useful for assessing the strength of the supporting information and, in turn, the strength of the whole argument." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-15", "text": "To this end, two previous studies have produced data sets and methods for classifying propositions in online argumentative discourse." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-58", "text": "Dependency context based embeddings capture functional similarities across the words using different contexts (Levy and Goldberg, 2014) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-16", "text": "The first of these studies (Park and Cardie, 2014) compiled online user comments from a discussion website and developed a framework for automatically classifying each proposition as either \"unverifiable\", \"verifiable non-experiential\", or \"verifiable experiential\", where the appropriate types of support are reason, evidence, and optional evidence, respectively." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-17", "text": "The second study, Oraby et al. (2015) , uses a different online corpus (Walker et al., 2012 ) of short argumentative responses to quotes, and classifies each response as either \"factual\" or \"feeling\" according to whether the support invoked appeals to facts or to emotions." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-18", "text": "In this paper, we use the term \"claim\" loosely to refer to an individual proposition (a sentence or independent clause) in an argument, or to a short argumentative text containing one or more propositions." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-19", "text": "In classifying propositions, Park and Cardie (2014) followed previous work such as Reed et al. (2008) and Palau and Moens (2009) , employing supervised learning methods." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-20", "text": "Despite using a rich set of linguistic features, these approaches suffer from low accuracy." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-21", "text": "Moreover, generating these features can be a tedious and complex process." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-22", "text": "In this paper, we show that state-of-the-art performance in claim classification for online user comments can be achieved without the need for expensive features." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-23", "text": "Our work, which employs CNN-and LSTM-based deep neural networks, is inspired by advances in sentence classification (Kim, 2014) and sequence classification (Hochreiter and Schmidhuber, 1997 ) using distributional word representations and deep learning." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-24", "text": "In particular, our approach leverages word2vec 1 distributional embeddings, dependency context-based embeddings (Levy and Goldberg, 2014) , and factuality/certainty-indicating embeddings for improving claim classification." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-25", "text": "(We refer to these embeddings as linguistic embeddings, as these are compiled from linguistic annotations such as dependency relations, verb modalities, and actuality information.) In this paper, we separately evaluate the usefulness of word and linguistic embeddings in the claim classification task on both the aforementioned data sets." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-26", "text": "We also concatenate (stack) these embeddings and show how these stacked embeddings, as well as tuning of the hyper-parameters, further improves claim classification performance." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-27", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-28", "text": "**BACKGROUND**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-29", "text": "In this section, we introduce two main approaches for claim classification: feature-rich supervised learning and distributional word embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-30", "text": "We then discuss the recent use of convolutional neural networks and long short-term memory networks in the related task of sentence classification." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-31", "text": "Oraby et al. (2015) classify arguments as emotional or factual using a set of linguistic patterns extracted from unlabelled arguments, provided the argument matches at least three patterns in the category." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-32", "text": "Although this approach has good precision, its recall is significantly lower than that of a supervised unigram baseline using Bayesian classifier." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-33", "text": "Park and Cardie (2014) classify propositions as verifiable non-experiential, verifiable experiential, or unverifiable using a support vector machine (SVM)." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-34", "text": "The classifier employs a rich set of features including n-grams, part of speech tags, imperative expressions, speech events, emotions, sentiment, person, and tense." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-35", "text": "Though this approach classifies unverifiable statements reasonably well, its performance on the two classes of verifiable propositions is low (44-70% F 1 )." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-36", "text": "In light of the observation that certain types of propositions tend to occur together, Park et al. (2015) propose an intuitive extension to this approach, framing the proposition classification task as a sequence-labelling problem." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-37", "text": "This extended approach employs conditional random fields (CRF) using dictionary-based features along with all the features from the original technique." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-38", "text": "However, it resulted in lower accuracy than the SVMs." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-39", "text": "Ferreira and Vlachos (2016) addressed the task of determining the stance of news article headlines with respect to claims from a data set of rumours." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-40", "text": "The authors used a logistic regression classifier using various features, such as bag of words, paraphrase entailment alignment scores, and word2vec embedding features, that examine the headline and its agreement with the claim." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-41", "text": "The work in this paper is focused on stance classification but the claims in the data set are related to the data sets used in our work." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-42", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-43", "text": "**METHODS BASED ON A RICH SET OF FEATURES**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-44", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-45", "text": "**DISTRIBUTIONAL WORD EMBEDDINGS**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-46", "text": "Traditional supervised learning approaches to NLP tasks depend heavily on manual annotation, and often suffer from data sparseness." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-47", "text": "Distributional representations of words, also known as word embeddings, can be learned from large, unlabelled corpora using neural networks, and encode both syntactic and semantic properties of words." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-48", "text": "Studies have found the learned word vectors to capture linguistic regularities and to collapse similar words into groups (Mikolov et al., 2013b) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-49", "text": "Their utility in tasks such as sentiment classification (Kim, 2014 ) is well attested." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-50", "text": "Dependency-based Embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-51", "text": "Claims containing multiple clauses or propositions might be better distinguished with the help of dependency embeddings inferred from the respective proposition contexts." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-52", "text": "Consider the following claim from one of our data sets: \"The Governor said that he enjoyed it.\" In this claim, the main clause, \"The Governor said\", is the core proposition, which excludes consideration of the remainder." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-53", "text": "The reason is that \"said\" is a reporting predicate, so it is unnecessary to verify whether or not the governor really has enjoyed the object mentioned in the subordinate clause." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-54", "text": "In some other claims, it is the subordinate rather than the main clause predicate that decides the claim type." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-55", "text": "Park and Cardie (2014) extracted clause-specific features using the Stanford syntactic parser and the Penn Treebank." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-56", "text": "(Merely using clause tags without capturing dependencies for important clauses may not help much in distinguishing objective verifiable claims from unverifiable subjective ones.) Park and Cardie (2014) also used tense and person counts for distinguishing verifiable claims from unverifiable claims." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-57", "text": "We hypothesize that word2vec and dependency context-based embeddings can inherently capture these linguistic characteristics and can replace these features." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-59", "text": "Komninos and Manandhar (2016) have shown that dependency-based models produce word embeddings that better capture functional properties of words for question type classification and relation detection." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-60", "text": "Task-specific Embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-61", "text": "Compiling embeddings for the specific vocabulary present in the task data can also be helpful in a classification task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-62", "text": "Tang et al. (2014) use enriched task-specific word embeddings and show improvement in a Twitter sentiment classification task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-63", "text": "Park and Cardie (2014) compiled a speech-event lexicon containing the most frequent speech anchors (predicates such as \"said\" and \"wrote\") from MPQA 2.0, a corpus manually annotated for opinions and other private states." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-64", "text": "These anchors can help in correctly distinguishing verifiable claims from unverifiable ones when the propositions contain both objective and subjective expressions." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-65", "text": "In our work, we use factual embeddings learned from the labelled FactBank corpus (Saur\u00ed and Pustejovsky, 2009 ) containing various speech event predicates (see \u00a73.3)." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-66", "text": "Such factual embeddings could help in resolving various predicate ambiguities present in the argumentative propositions." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-67", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-68", "text": "**DEEP NEURAL NETWORKS FOR TEXT CLASSIFICATION**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-69", "text": "Deep neural networks, with or without word embeddings, have recently shown significant improvements over traditional machine learning-based approaches when applied to various sentence-and document-level classification tasks." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-70", "text": "Kim (2014) have shown that CNNs outperform traditional machine learning-based approaches on several tasks, such as sentiment classification, question type classification, and subjectivity classification, using simple static word embeddings and tuning of hyper-parameters." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-71", "text": "proposed character-level CNNs for text classification." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-72", "text": "Lai et al. (2015) and Visin et al. (2015) proposed recurrent CNNs, while Johnson and Zhang (2015) proposed semi-supervised CNNs for solving a text classification task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-73", "text": "Tang et al. (2015) used a document classification approach based on recurrent neural networks (RNNs) and showed an improvement on a sentiment classification task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-74", "text": "Palangi et al. (2016) proposed sentence embedding using an LSTM network for an information retrieval task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-75", "text": "Zhou et al. (2016) proposed attention-based, bidirectional LSTM networks for a relation classification task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-76", "text": "Augenstein et al. (2016) employed a weakly supervised conditional LSTM encoding approach to stance detection for unseen targets on Twitter stance detection data, and presented improved results." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-77", "text": "RNNs model text sequences effectively by capturing long-range dependencies among the words." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-78", "text": "LSTM-based approaches based on RNNs effectively capture the sequences in the sentences when compared to the CNN and SVM-based approaches." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-79", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-80", "text": "**CLAIM CLASSIFICATION**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-81", "text": "Here we present two deep learning-based methods for claim classification, the first of which uses CNNs and the second of which uses LSTMs." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-82", "text": "In \u00a73.3, we also show how different pre-trained distributional linguistic embeddings are incorporated into CNNs and LSTMs to improve the classification results." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-83", "text": "Collobert et al. (2011) adapted the original CNN proposed by LeCun and Bengio (1995) for modelling natural language sentences." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-84", "text": "Following Kim (2014) , we present a variant of the CNN architecture with four layer types: an input layer, a convolution layer, a max pooling layer, and a fully connected softmax layer." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-85", "text": "Each claim in the input layer is represented as a sentence comprised of distributional word embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-86", "text": "Let #\u00bb v i \u2208 R k be the k-dimensional word vector corresponding to the ith word in the sentence." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-87", "text": "Then a sentence S of length is represented as the concatenation of its word vectors:" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-88", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-89", "text": "**CNN-BASED CLAIM CLASSIFICATION**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-90", "text": "Word2vec embeddings which are learned using the bag-of-words representation of the contexts yield broad topical similarities, while using dependency-based contexts yields more functional similarities (Levy et al., 2015) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-91", "text": "In addition, with word2vec (E) embeddings, we use linguistically motivated pre-trained dependency embeddings (D) and task-specific factual embeddings (F) for capturing syntactic and functional regularities encoded in the propositions, in order to better distinguish different types of claims." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-92", "text": "To incorporate these linguistic embeddings at word level into the learning process, we extend the network as illustrated in Figure 1a ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-93", "text": "Inspired by Baroni et al. (2012) 's supervised distributional concatenation method and a linguistically informed CNN (Ebert et al., 2015) , we concatenate word2vec (E), dependency (D), and factual (F) word embeddings corresponding to the ith input word into a merged vector #\u00bb c i \u2208 R k+m+n :" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-94", "text": "where #\u00bb e i , #\u00bb d i , and #\u00bb f i represent, respectively, the concatenated word2vec, dependency, and factual embeddings corresponding to i th word in the sentence." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-95", "text": "In the final representation, every input claim from the data set is represented using combined word2vec and linguistic embeddings in the network as in Equation 1, where each #\u00bb v i = #\u00bb c i ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-96", "text": "In the convolution layer, for a given word sequence within a claim, a convolutional word filter P is defined." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-97", "text": "Then, the filter P is applied to each word in the sentence to produce a new set of features." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-98", "text": "We use a non-linear activation function such as rectified linear unit (ReLU) for the convolution process and max-over-time pooling (Collobert et al., 2011; Kim, 2014) at pooling layer to deal with the variable claim size." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-99", "text": "After a series of convolutions with different filters with different heights, the most important features are generated." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-100", "text": "Then, this feature representation, Z, is passed to a fully connected penultimate layer and outputs a distribution over different labels:" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-101", "text": "where y denotes a distribution over different claims labels, W is the weight vector learned from the stacked representation of all embeddings from the training corpus, and b is the bias term." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-102", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-103", "text": "**LSTM-BASED CLAIM CLASSIFICATION**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-104", "text": "In case of CNN, concatenating words with various window sizes, works as n-gram models but do not capture long-distance word dependencies with shorter window sizes." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-105", "text": "A larger window size can be used, but this may lead to data sparsity problem." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-106", "text": "In order to encode long-distance word dependencies, we use long short-term memory networks, which are a special kind of RNN capable of learning long-distance dependencies." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-107", "text": "LSTMs were introduced by Hochreiter and Schmidhuber (1997) in order to mitigate the vanishing gradient problem (Gers et al., 2000; Gers, 2001; Graves, 2013; Pascanu et al., 2013) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-108", "text": "The model illustrated in Figure 1b is composed of a single LSTM layer followed by an average pooling and a softmax regression layer." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-109", "text": "Each claim is represented as a sentence (S) in the input layer." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-110", "text": "Thus, from an input sequence, S i, j , the memory cells in the LSTM layer produce a representation sequence h i , h i+1 , . . . , h j ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-111", "text": "This representation sequence is then averaged over all time steps, resulting in a final feature representation h. Finally, this representation is fed to a logistic regression layer to predict the claim labels for unseen input claims." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-112", "text": "As with the CNN architecture shown in the previous section, for each claim, we encode word2vec, dependency, and factual embeddings in the input layer into a variation of the standard LSTM network." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-113", "text": "As our results demonstrate, the LSTM encoder can effectively capture informative features from the concatenated embedding representation and classify different types of argumentative claims." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-114", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-115", "text": "**WORD EMBEDDINGS**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-116", "text": "In order to better capture the syntactic contexts of words and the factuality indicators of propositions, we employ two linguistically motivated word embeddings in addition to the usual word2vec ones: dependency-based embeddings, and factuality-and certainty-signalling emeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-117", "text": "Word2vec Embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-118", "text": "We use word embeddings from word2vec which are learned using the skipgram model of Mikolov et." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-119", "text": "al (2013a,b) by predicting linear context words surrounding the target words." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-120", "text": "These word vectors are trained on about 100 billion words from a Google News corpus." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-121", "text": "As word embeddings alone have shown good performance in various classification tasks, we also use them in isolation, with varying dimensions, in our CNN and LSTM experiments." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-122", "text": "In the case of CNN, a word embedding size of 300, together with other network parameters, resulted in high accuracy on the claim verifiability data set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-123", "text": "In the case of LSTM, word embeddings of size 300 also produced good accuracy on the claim verifiability data set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-124", "text": "Dependency-based Word Embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-125", "text": "We use Levy and Goldberg's (2014) dependency-based word embeddings in our claim classification task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-126", "text": "These embeddings are learned using dependency-based contexts from an English Wikipedia corpus containing about 175 000 words and over 900 000 distinct syntactic contexts." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-127", "text": "Dependency-based embeddings are encoded in the input layers of both our CNN and LSTM, as shown in Figure 1 ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-128", "text": "Dependency embeddings of size 100 are concatenated with equally sized word2vec and factual embeddings, resulting in a 300-dimension concatenated embedding vector." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-129", "text": "Factuality-and Certainty-signalling Embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-130", "text": "We investigate the use of certainty-and factualityrelated distributed signals for distinguishing claims." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-131", "text": "In online argumentative discourse, claims often serve as implicit arguments with inappropriate or missing justification (Park and Cardie, 2014) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-132", "text": "The certainty and factuality signals present in such claims may be appropriate for determining its factuality or verifiability." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-133", "text": "As the claims in our data set are objective, subjective and factual types, predicates, adverbs and other modals (related to certainty and factuality) present in FactBank 1.0 may help in better distinguishing various types of claims." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-134", "text": "As an example, consider the sentence in Figure 2 , a complex claim of type \"verifiable non-experiential\"." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-135", "text": "The predicate \"seems\" and the modal verb \"must\" can be viewed as certainty and factuality information related to the speaker's commitment to their utterance." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-136", "text": "Factual embeddings of these co-occurrence indicators can help in better identifying the type of the claim." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-137", "text": "We compile these extra linguistic factual and certainty signals from FactBank (Saur\u00ed and Pustejovsky, 2009 ), a corpus annotated with factuality and certainty indicators very much similar to the word2vec embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-138", "text": "These annotations are basically related to certainty, possibility, and probability, with positive and negative polarities." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-139", "text": "We used the gensim (\u0158eh\u016f\u0159ek and Sojka, 2010) word2vec program to compile embeddings from FactBank." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-140", "text": "We compiled 300-dimensional factual embedding vectors for the words that appear at least five times in FactBank, and for rest of the vocabulary, embedding vectors are assigned uniform distribution in the range of [\u22120.25, 0.25] ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-141", "text": "In our CNN and LSTM experiments, we integrate factual embeddings (denoted by F above)." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-142", "text": "We also concatenate factual embeddings with other dependency and word embeddings, as shown in Figure 1 ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-143", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-144", "text": "**DATA SETS AND EXPERIMENTAL SETUP**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-145", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-146", "text": "**DATA SETS**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-147", "text": "Our experiments use the two claim data sets introduced in \u00a71, further details of which are given below." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-148", "text": "Factual and Feeling Debate Forum Posts (Walker et al., 2012) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-149", "text": "This corpus is compiled from the Internet Argument Corpus." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-150", "text": "It consists of quote-response pairs that are manually annotated according to whether the response is primarily a \"factual\"-or \"feeling\"-based argument." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-151", "text": "In our experiments, we use the training and test splits from Oraby et al. (2015) ; these consist of claims that can span multiple sentences." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-152", "text": "The annotation distribution for these splits is shown in Table 1 ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-153", "text": "We also use a development set to tune the hyper-parameters of the model." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-154", "text": "(Park and Cardie, 2014) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-155", "text": "This corpus consists of 9476 manually annotated sentences and independent clauses from 1047 user comments extracted from the Regulation Room website." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-156", "text": "2 Park and Cardie (2014) and Park et al. (2015) used this corpus for examining each proposition with respect to its verifiability to determine the desirable types of support for the analysis of arguments." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-157", "text": "The propositions are manually annotated with three classes-\"verifiable experiential\", \"verifiable non-experiential\", and \"unverifiable\"-where the support types are evidence, optional evidence, and reason, respectively." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-158", "text": "The annotation distribution and our train/test splits are shown in Table 2 ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-159", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-160", "text": "**VERIFIABLE AND UNVERIFIABLE USER COMMENTS**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-161", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-162", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-163", "text": "We model claim classification as a sentence classification task." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-164", "text": "We perform binary classification on the factual/feeling data set, and multi-class classification on the verifiability data set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-165", "text": "We used Kim's (2014) Theano implementation of CNN for training the CNN model and a variant of the standard Theano implementation 3 for training the LSTM network." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-166", "text": "We initialized the word2vec, dependency, and factual embeddings in both the CNN and LSTM models." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-167", "text": "Unknown words from the pre-compiled embeddings were initialized randomly in the range [\u22120.25, 0.25] ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-168", "text": "We updated all three embedding vectors during the training." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-169", "text": "We also produced a stacked embedding where all three types of embeddings, with dimensionality 100, were concatenated." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-170", "text": "In the CNN approach, we used a stochastic gradient descent-based optimization method for minimizing the cross entropy loss during the training with the Rectified Linear Unit (ReLU) non-linear activation function." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-171", "text": "Window filter sizes were set at [3, 4, 5] ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-172", "text": "In the case of LSTM, model was trained using an adaptive learning rate optimizer, ADADELTA (Zeiler, 2012), over shuffled mini-batches with the sigmoid activation function at input, output and forget gates, and the tanh non-linear activation function at cell state." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-173", "text": "Tuning Hyper-parameters." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-174", "text": "We manually explored hyper-parameters such as drop-out (for avoiding over-fitting), and batch size and learning rates (for improving performance) on development sets of both data sets." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-175", "text": "We performed tuning on the verifiability development data set obtained by splitting the corpus into an 85% training set and a 15% development set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-176", "text": "We tuned the hyper-parameters on a 20% development set obtained from Oraby et al. (2015) on the factual vs. feeling data set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-177", "text": "We varied batch sizes (12-64), drop-out (0.1-0.6), embedding sizes (50-300), and learning rate (0.0001-0.001) on both data sets and across all embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-178", "text": "We obtained the best CNN performance with learning rate decay 0.95, batch size 50, drop-out 0.5, and embedding size 300." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-179", "text": "For LSTM, we got the best results with learning rate 0.001, drop-out 0.5, and embedding size 300 for both data sets; the optimal batch size was 24 for the verifiability data set but 32 for the factual vs. feeling data set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-180", "text": "SVM Classification on the Factual vs. Feeling Data Set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-181", "text": "SVM classifiers find the hyperplane that best discriminates between positive and negative instances (Cristianini and Shawe-Taylor, 2000) ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-182", "text": "We used the SVM classifier SMO (Hall et al., 2009 ) from the DKPro TC framework (Daxenberger et al., 2014) for factual vs. feeling claims classification." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-183", "text": "Surface-level top k n-grams are used as features for building the model." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-184", "text": "We used uni-, bi-, and trigrams, and varied k from 500 to 5000." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-185", "text": "We obtained the best results with the top 500 n-gram features." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-186", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-187", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-188", "text": "We compare our methods with several state-of-the-art methods for claim classification, as described in \u00a72." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-189", "text": "In these tables, the highest accuracy values for precision, recall and F 1 measure are specified in bold font." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-190", "text": "Verifiability Data Set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-191", "text": "Park and Cardie (2014) and Park et al. (2015) performed claim classification on this data set using SVM and CRF classifiers." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-192", "text": "The former classifier was found to yield better results." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-193", "text": "Both approaches employed various lexical and shallow semantic features." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-194", "text": "The authors also report baseline results using simple unigram features." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-195", "text": "We considered the SVM-based results 4 a baseline for comparison with ours." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-196", "text": "The results of our own experiments on the same data set, using CNN and LSTM methods together with the various embeddings mentioned in \u00a73.3, are shown in Table 3 ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-197", "text": "We macro-averaged F 1 across all the classes." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-198", "text": "Using word embeddings alone in the CNN method, our results (70.47%) were comparable to those of the SVM (68.99%) and exceeded those of the CRF method (63.63%)." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-199", "text": "In a concatenated embeddings setting, CNN achieves 70.34% F 1 ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-200", "text": "The LSTM performance is low when compared to the CNN approach, but comparable to the SVM-based approach." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-201", "text": "LSTM also performed better than the sequential CRF baseline." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-202", "text": "We computed train, validation, and test error rates with respect to the number of epochs during training for the CNN and LSTM approaches." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-203", "text": "In the case of LSTM, the best classification accuracy is obtained between 5 and 12 epochs, and in case of CNN, at between 5 and 20 epochs." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-204", "text": "Confusion matrices showing the assignments of our best-performing LSTM and CNN classifiers are shown in Tables 4 and 5 , respectively." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-205", "text": "Both classifiers show a similar pattern of errors." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-206", "text": "Verifiable experiential and non-experiential claims were not confused as much with each other as they were with unverifiable claims; this may be an artifact of the latter being the majority class." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-207", "text": "When unverifiable claims were misclassified, they were more likely to be labelled as verifiable non-experiential, suggesting that the vocabulary employed in the two classes of claims is similar." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-208", "text": "Factual vs. Feeling Claims Data Set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-209", "text": "In this data set, claims can span more than one sentence, but we treat these as single sentences for the purposes of our experiments." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-210", "text": "Oraby et al. (2015) performed unsupervised claim classification on this data set using bootstrapped patterns from both unlabelled and labelled data and report accuracy (F 1 ) of 41.41%." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-211", "text": "They also report an F 1 of 64.98% for a na\u00efve Bayes supervised classifier using simple unigram and binary features." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-212", "text": "The focus of their experiment was to discover more factual-and feeling-related patterns from the unlabelled corpus using a small amount of labelled data." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-213", "text": "In our experiments, both the CNN (79.56% F 1 ) and the LSTM-based (75.10% F 1 ) methods using distributional embeddings show significant improvements over the na\u00efve Bayes and SVM-based approaches as shown in Table 6 ." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-214", "text": "CNN achieved good accuracy in all embeddings setting." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-215", "text": "Sequential LSTM's performance is not better than the CNN approach, but LSTM together with word2vec and factual embeddings performed better on this data set." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-216", "text": "Confusion matrices for our best LSTM and CNN classifiers are shown in Tables 7 and 8 , respectively." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-217", "text": "We manually examined those factual claims misclassified as feeling and found that they contained a relatively high proportion of personal pronouns, wh-questions, and negations." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-218", "text": "While these vocabulary terms are typically associated with feeling claims, they are missing from the factuality embeddings learned from FactBank." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-219", "text": "By contrast, when feeling claims were misclassified as factual, we found that they tend to contain several distinct propositions or clauses, only one of which was emotional in nature." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-220", "text": "Properly handling these type of claims would require modelling them with intrapropositional relations." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-221", "text": "----------------------------------" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-222", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-223", "text": "In this paper, we presented LSTM-and CNN-based deep neural network methods leverging word2vec and linguistic embeddings, and applied these to argumentative claim classification on two data sets." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-224", "text": "On the data set of verifiable and unverifiable claims, our CNN approach using word2vec and concatenated embeddings has shown results comparable to those of a state-of-the-art, feature-rich, SVM-based method." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-225", "text": "When using an LSTM-based method, the accuracy was somewhat lower, but still better than a CRF." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-226", "text": "In this case, however, the concatenated embeddings were not any better than the individual ones." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-227", "text": "On the factual vs. feeling data set, our CNN-based method using word2vec and linguistic embeddings showed good improvements (over 14 percentage points in F 1 ) over the state-of-the-art Bayes classifier and a 9-point improvement over the SVM baseline, while the LSTM-based method using word2vec and factual embeddings yielded a 10-point improvement over the Bayes classifier and a 5-point improvement over SVM." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-228", "text": "The LSTM-based method using word2vec and factual embeddings performed better than using other embeddings." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-229", "text": "We also observed that the performance of sequential LSTM is lower than the CNN but better than the SVM baseline and the sequential CRF method described in prior work." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-230", "text": "Our methods are simpler than those described in prior work, and we have demonstrated that they generalize well across claim data sets." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-231", "text": "Our framework can also be easily adapted to other stacked embeddings to perform various sentence-and document-level classification tasks." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-232", "text": "In future work, we plan to investigate usage of richer linguistic embeddings, such as factual and word sense embeddings compiled from a larger corpus." }, { "sent_id": "8c202e3610599c9eee23724ef213de-C001-233", "text": "We may also consider incorporating inter-proposition predicate relations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8c202e3610599c9eee23724ef213de-C001-12" ], [ "8c202e3610599c9eee23724ef213de-C001-16" ], [ "8c202e3610599c9eee23724ef213de-C001-19" ], [ "8c202e3610599c9eee23724ef213de-C001-55", "8c202e3610599c9eee23724ef213de-C001-56" ], [ "8c202e3610599c9eee23724ef213de-C001-131" ], [ "8c202e3610599c9eee23724ef213de-C001-156" ] ], "cite_sentences": [ "8c202e3610599c9eee23724ef213de-C001-12", "8c202e3610599c9eee23724ef213de-C001-16", "8c202e3610599c9eee23724ef213de-C001-19", "8c202e3610599c9eee23724ef213de-C001-56", "8c202e3610599c9eee23724ef213de-C001-131", "8c202e3610599c9eee23724ef213de-C001-156" ] }, "@MOT@": { "gold_contexts": [ [ "8c202e3610599c9eee23724ef213de-C001-19", "8c202e3610599c9eee23724ef213de-C001-20", "8c202e3610599c9eee23724ef213de-C001-22" ] ], "cite_sentences": [ "8c202e3610599c9eee23724ef213de-C001-19" ] }, "@DIF@": { "gold_contexts": [ [ "8c202e3610599c9eee23724ef213de-C001-19", "8c202e3610599c9eee23724ef213de-C001-20", "8c202e3610599c9eee23724ef213de-C001-22" ] ], "cite_sentences": [ "8c202e3610599c9eee23724ef213de-C001-19" ] }, "@EXT@": { "gold_contexts": [ [ "8c202e3610599c9eee23724ef213de-C001-55", "8c202e3610599c9eee23724ef213de-C001-56", "8c202e3610599c9eee23724ef213de-C001-57" ] ], "cite_sentences": [ "8c202e3610599c9eee23724ef213de-C001-56" ] }, "@USE@": { "gold_contexts": [ [ "8c202e3610599c9eee23724ef213de-C001-153", "8c202e3610599c9eee23724ef213de-C001-154", "8c202e3610599c9eee23724ef213de-C001-155", "8c202e3610599c9eee23724ef213de-C001-156" ] ], "cite_sentences": [ "8c202e3610599c9eee23724ef213de-C001-156" ] } } }, "ABC_8ef47e16cd41aa3a606cf21c41adb7_13": { "x": [ { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-94", "text": "Unlike the AMI Corpus, the official reported results for the CCCS Shared Task were recall; thus, for LUNA Corpus the reported values are ROUGE-2 recall." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-95", "text": "For statistical significance testing, we use a paired bootstrap resampling method proposed in (Koehn, 2004) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-96", "text": "We create new virtual test sets of 15 conversations with random re-sampling 100 times." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-2", "text": "Summarization of spoken conversations is a challenging task, since it requires deep understanding of dialogs." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-97", "text": "For each set, we compute the ROUGE-2 score and compare the system performances using paired t-test with p = 0.05." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-98", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-99", "text": "**RESULTS**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-3", "text": "Abstractive summarization techniques rely on linking the summary sentences to sets of original conversation sentences, i.e. communities." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-4", "text": "Unfortunately, such linking information is rarely available or requires trained annotators." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-5", "text": "We propose and experiment automatic community creation using cosine similarity on different levels of representation: raw text, WordNet SynSet IDs, and word embeddings." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-6", "text": "We show that the abstractive summarization systems with automatic communities significantly outperform previously published results on both English and Italian corpora." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-7", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-9", "text": "Spoken conversation summarization is an important task, since speech is the primary medium of human-human communication." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-10", "text": "Vast amounts of spoken conversation data are produced daily in call-centers." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-11", "text": "Due to this overwhelming number of conversations, call-centers can only evaluate a small percentage of the incoming calls ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-12", "text": "Automatic methods of conversation summarization have a potential to increase the capacity of the call-centers to analyze and assess their work." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-13", "text": "Earlier works on conversation summarization have mainly focused on extractive techniques." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-14", "text": "However, as pointed out in (Murray et al., 2010) and (Oya et al., 2014) , abstractive summaries are preferred to extractive ones by human judges." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-15", "text": "The possible reason for this is that extractive techniques are not well suited for the conversation summarization, since there are style differences between spoken conversations and humanauthored summaries." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-16", "text": "Abstractive conversation summarization systems, on the other hand, are mainly based on the extraction of lexical information (Mehdad et al., 2013; Oya et al., 2014) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-17", "text": "The authors cluster conversation sentences/utterances into communities to identify most relevant ones and aggregate them using word-graph models." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-18", "text": "The graph paths are ranked to yield abstract sentences -a template." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-19", "text": "And these templates are selected for population with entities extracted from a conversation." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-20", "text": "Thus the abstractive summarization systems are limited to these templates generated by supervised data sources." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-21", "text": "The template selection strategy in these systems leverages on the manual links between summary and conversation sentences." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-22", "text": "Unfortunately, such manual links are rarely available." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-23", "text": "In this paper we evaluate a set of heuristics for automatic linking of summary and conversations sentences, i.e. 'community' creation." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-24", "text": "The heuristics rely on the similarity between the two, and we experiment with the cosine similarity computation on different levels of representation -raw text, text after replacing the verbs with their WordNet SynSet IDs, and the similarity computed using distributed word embeddings." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-25", "text": "The heuristics are evaluated within the template-based abstractive summarization system of Oya et al. (2014) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-26", "text": "We extend this system to Italian using required NLP tools." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-27", "text": "However, the approach transparently extends to other languages with available WordNet, minimal supervised summarization corpus and running text." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-28", "text": "Heuristics are evaluated and compared on AMI meeting corpus and Italian LUNA Human-Human conversation corpus." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-29", "text": "The overall description of the system with the more detailed description of the heuristics is provided in Section 2." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-30", "text": "In Section 3 we describe the corpora, evaluation methodology and the commu- nity creation experiments." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-31", "text": "Section 4 provides concluding remarks and future directions." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-32", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-33", "text": "**METHODOLOGY**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-34", "text": "In this section we describe the conversation summarization pipeline that is partitioned into community creation, template generation, ranker training, and summary generation components." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-35", "text": "The whole pipeline is depicted in Figure 1 ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-36", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-37", "text": "**TEMPLATE GENERATION**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-38", "text": "Template Generation follows the approach of (Oya et al., 2014) and, starting from human-authored summaries, produces abstract templates applying slot labeling, summary clustering and template fusion steps." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-39", "text": "The information required for the template generation are part-of-speech (POS) tags, noun and verb phrase chunks, and root verbs from dependency parsing." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-40", "text": "For English, we use Illinois Chunker (Punyakanok and Roth, 2001) to identify noun phrases and extract part-of-speech tags; and the the tool of (De Marneffe et al., 2006) for generating dependency parses." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-41", "text": "For Italian, on the other hand, we use TextPro 2.0 (Pianta et al., 2008) to perform all the Natural Language Processing tasks." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-42", "text": "In the slot labeling step, noun phrases from human-authored summaries are replaced by WordNet (Fellbaum, 1998 ) SynSet IDs of the head nouns (right most for English)." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-43", "text": "For a word, SynSet ID of the most frequent sense is selected with respect to the POS-tag." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-44", "text": "To get hypernyms for Italian we use MultiWordNet (Pianta et al., 2002) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-45", "text": "The clustering of the abstract templates generated in the previous step is performed using the WordNet hierarchy of the root verb of a sentence." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-46", "text": "The similarity between verbs is computed with respect to the shortest path that connects the senses in the hypernym taxonomy of WordNet." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-47", "text": "The template graphs, created using this similarity, are then clustered using the Normalized Cuts method (Shi and Malik, 2000) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-48", "text": "The clustered templates are further generalized using a word graph algorithm extended to templates in (Oya et al., 2014) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-49", "text": "The paths in the word graph are ranked using language models trained on the abstract templates and the top 10 are selected as a template for the cluster." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-50", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-51", "text": "**COMMUNITY CREATION**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-52", "text": "In the AMI Corpus, sentences in human-authored summaries are manually linked to a set of the sentences/utterances in the meeting transcripts, referred to as communities." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-53", "text": "It is hypothesized that a community sentence covers a single topic and conveys vital information about the conversation segment." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-54", "text": "For the automatic community creation we explore four heuristics." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-55", "text": "\u2022 H1 (baseline): take the whole conversation as a community for each sentence; \u2022 H2: The 4 closest turns with respect to cosine similarity between a summary and a conversation sentence." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-56", "text": "\u2022 H3: The 4 closest turns with respect to cosine similarity after replacing the verbs with WordNet SynSet ID." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-57", "text": "\u2022 H4: The 4 closest turns with respect to cosine similarity of averaged word embedding vectors obtained using word2vec for a turn." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-58", "text": "(Mikolov et al., 2013) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-59", "text": "The number of sentences selected for a community is set to 4, since it is the average size of the manual community in the AMI corpus." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-60", "text": "We use word2vec tool (Mikolov et al., 2013 ) for learning distributed word embeddings." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-61", "text": "For English, we obtained pre-trained word embeddings trained on a part of Google News data set (about 3 billion words) 1 ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-62", "text": "The model contains 300-dimensional vectors for 3 million words and phrases." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-63", "text": "For Italian, we use the word2vec to train word embeddings on the Europarl Italian corpus (Koehn, 2005) 2 ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-64", "text": "We empirically choose 300, 5, and 5 for the embedding size, window length, and word count threshold, respectively." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-65", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-66", "text": "**SUMMARY GENERATION**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-67", "text": "The first step in summary generation is the segmentation of conversations into topics using a lexical cohesion-based domain-independent discourse segmenter -LCSeg (Galley et al., 2003) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-68", "text": "The purpose of this step is to cover all the conversation topics." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-69", "text": "Next, all possible slot 'fillers' are extracted from the topic segments and are ranked with respect to their frequency in the conversation." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-70", "text": "An abstract template for a segment is selected with respect to the average cosine similarity of the segment and the community linked to that template." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-71", "text": "The selected template slots are filled with the 'fillers' extracted earlier." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-72", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-73", "text": "**SENTENCE RANKING**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-74", "text": "Since the system produces many sentences that might repeat the same information, the final set of automatic sentences is selected from these filled templates with respect to the ranking using the token and part-of-speech tag 3-gram language models." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-75", "text": "In this paper, different from (Oya et al., 2014) , the sentence ranking is based solely on the n-gram language models trained on the tokens and part-ofspeech tags from the human-authored summaries." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-76", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-77", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-78", "text": "We evaluate the automatic community creation heuristics on the AMI meeting corpus (Carletta et al., 2006) and Italian and English LUNA Human-Human corpora (Dinarelli et al., 2009 )." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-79", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-80", "text": "**DATA SETS**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-81", "text": "The two corpora used for the evaluation of the heuristics are AMI and LUNA." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-82", "text": "The AMI meeting corpus (Carletta et al., 2006 ) is a collection of 139 meeting records where groups of people are engaged in a 'roleplay' as a team and each speaker assumes a certain role in a team (e.g. project manager (PM))." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-83", "text": "Following (Oya et al., 2014) , we removed 20 dialogs used by the authors for development, and use the remaining dialogs for the threefold cross-validation." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-84", "text": "The LUNA Human-Human corpus (Dinarelli et al., 2009 ) consists of 572 call-center dialogs where a client and an agent are engaged in a problem solving task over the phone." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-85", "text": "The 200 Italian LUNA dialogs have been annotated with summaries by 5 native speakers (5 summaries per dialog)." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-86", "text": "For the Call Centre Conversation Summarization (CCCS) shared task a set of 100 dialogs was manually translated to English." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-87", "text": "The conversations are equally split into training and testing sets as 100/100 for Italian, and 50/50 for English." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-88", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-89", "text": "**EVALUATION**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-90", "text": "ROUGE-2 metric (Lin, 2004 ) is used for the evaluation." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-91", "text": "The metric considers bigram-level precision, recall and F-measure between a set of reference and hypothesis summaries." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-92", "text": "For AMI corpus, following (Oya et al., 2014) , we report ROUGE-2 F-measures on 3-fold cross-validation." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-93", "text": "For LUNA Corpus, on the other hand, we have used the modified version of ROUGE 1.5.5 toolkit from the CCCS Shared Task , which was adapted to deal with a conversation-dependent length limit of 7%." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-100", "text": "In this section we report on the results of the abstractive summarization system using the community creation heuristics described in Section 2." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-101", "text": "Following the Call-Center Conversation Summarization Shared Task at MultiLing 2015 , for LUNA Corpus (Dinarelli et al., 2009) we compare performances to three extractive baselines: (1) the longest turn in the conversation up to the length limit (7% of a conversation) (Baseline-L), (2) the longest turn in the first 25% of the conversation up to the length limit (Baseline-LB) (Trione, 2014) , and (3) Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998 ) with \u03bb = 0.7." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-102", "text": "For AMI corpus, on the other hand, we compare performances to the abstractive systems reported in (Oya et al., 2014) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-103", "text": "The performances of the heuristics on AMI corpus are given in Table 1 ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-104", "text": "In the table we also report the performances of the previously published summarization systems that make use of the manual communities - (Oya et al., 2014) and (Mehdad et al., 2013) ; and our run of the system of (Oya et al., 2014) ." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-105", "text": "With manual communities we have Table 2 : ROUGE-2 recall with 7% summary length limit for the extractive baselines and abstractive summarization systems with the community creation heuristics on LUNA corpus." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-106", "text": "obtained average F-measure of 0.072." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-107", "text": "From the table, we can observe that all the systems with automatic community creation heuristics and the simplified sentence ranking described in Section 2 outperform the systems with manual communities." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-108", "text": "Among the heuristics, average word embeddingbased cosine similarity metric performs the best with average F-measure of 0.079." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-109", "text": "All the systems with automatic community creation heuristics (H2, H3, H4) perform significantly better than the system with manual communities." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-110", "text": "For Italian, the extractive baseline that selects the longest utterance from the first quarter of a conversation, is the strong baseline with ROUGE-2 recall of 0.027." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-111", "text": "It is not surprising, since the longest turn from the beginning of the conversation is usually a problem description, which appears in human-authored summaries." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-112", "text": "In the CCCS Shared Task, none of the submitted systems was able to outperform it." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-113", "text": "The system with a word embedding-based automatic community creation heuristic, however, achieves recall of 0.029, significantly outperforming it." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-114", "text": "Using word embeddings allow us to exploit monolingual data, which helps to avoid the problem of data sparsity encountered using WordNet, which allows for better communities on out-ofdomain data set and better coverage." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-115", "text": "This fact can account for the wider gap in performance between using H2 -H4 heuristics." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-116", "text": "For the 100 English LUNA dialogs, we observe the same pattern as for Italian dialogs and AMI corpus: the best performance is observed for the similarity using word embeddings (0.051)." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-117", "text": "However, for English LUNA, the best extractive baseline is weaker, as H2 and H3 heuristics are able to outperform it." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-118", "text": "The additional observation is that the performance for English is generally higher." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-119", "text": "Moreover, word embeddings provide larger boost on English LUNA." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-120", "text": "Whether this is due to the properties of Italian or the differences in the amount and domain of data used for training word embeddings is a question we plan to address in the future." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-121", "text": "We also observe that English WordNet gives a better lexical coverage than the Multilingual WordNet used for Italian." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-122", "text": "Thus, it becomes important to explore methods which does not rely on WordNet, as now the Italian system may be suffering from the data sparsity problem due to it." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-123", "text": "Overall, the heuristics with word embedding vectors perform the best on both corpora and across-languages." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-124", "text": "Consequently, we conclude that automatic community creation with word embedding for similarity computation is a good technique for the abstractive summarization of spoken conversations." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-125", "text": "----------------------------------" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-127", "text": "In this paper we have presented automatic community creation heuristics for abstractive spoken conversation summarization." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-128", "text": "The heuristics are based on the cosine similarity between conversation and summary sentences." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-129", "text": "The similarity is computed as different levels: raw text, text after verbs are replaces with WordNet SynSet IDs and average word embedding similarity." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-130", "text": "The heuristics are evaluated on AMI meeting corpus and LUNA human-human conversation corpus." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-131", "text": "The community creation heuristic based on cosine similarity using word embedding vectors outperforms all the other heuristics on both corpora, as well as it outperforms the previously published results." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-132", "text": "We have observed that the systems generally perform better on English; and the performance differences among heuristics is less for Italian." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-133", "text": "The Italian word embedding were trained on Europarl, that is much smaller in size than the data that was used to train English embeddings." }, { "sent_id": "8ef47e16cd41aa3a606cf21c41adb7-C001-134", "text": "In the future we plan to address these issues and train embeddings on a larger more diverse corpus." } ], "y": { "@MOT@": { "gold_contexts": [ [ "8ef47e16cd41aa3a606cf21c41adb7-C001-14" ] ], "cite_sentences": [ "8ef47e16cd41aa3a606cf21c41adb7-C001-14" ] }, "@BACK@": { "gold_contexts": [ [ "8ef47e16cd41aa3a606cf21c41adb7-C001-16" ] ], "cite_sentences": [ "8ef47e16cd41aa3a606cf21c41adb7-C001-16" ] }, "@USE@": { "gold_contexts": [ [ "8ef47e16cd41aa3a606cf21c41adb7-C001-25" ], [ "8ef47e16cd41aa3a606cf21c41adb7-C001-38" ], [ "8ef47e16cd41aa3a606cf21c41adb7-C001-83" ], [ "8ef47e16cd41aa3a606cf21c41adb7-C001-92" ], [ "8ef47e16cd41aa3a606cf21c41adb7-C001-102" ], [ "8ef47e16cd41aa3a606cf21c41adb7-C001-104" ] ], "cite_sentences": [ "8ef47e16cd41aa3a606cf21c41adb7-C001-25", "8ef47e16cd41aa3a606cf21c41adb7-C001-38", "8ef47e16cd41aa3a606cf21c41adb7-C001-83", "8ef47e16cd41aa3a606cf21c41adb7-C001-92", "8ef47e16cd41aa3a606cf21c41adb7-C001-102", "8ef47e16cd41aa3a606cf21c41adb7-C001-104" ] }, "@EXT@": { "gold_contexts": [ [ "8ef47e16cd41aa3a606cf21c41adb7-C001-48" ] ], "cite_sentences": [ "8ef47e16cd41aa3a606cf21c41adb7-C001-48" ] }, "@DIF@": { "gold_contexts": [ [ "8ef47e16cd41aa3a606cf21c41adb7-C001-75" ] ], "cite_sentences": [ "8ef47e16cd41aa3a606cf21c41adb7-C001-75" ] } } }, "ABC_c1eefe276c0ed46d7cd50f3f3bc3f3_13": { "x": [ { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-182", "text": "**CONCLUSIONS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-2", "text": "The main goal of modelling human conversation is to create agents which can interact with people in both open-ended and goal-oriented scenarios." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-3", "text": "End-to-end trained neural dialog systems are an important line of research for such generalized dialog models as they do not resort to any situation-specific handcrafting of rules." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-4", "text": "Modelling personalization of conversation in such agents is important for them to be truly 'smart' and to integrate seamlessly into the lives of human beings." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-5", "text": "However, the topic has been largely unexplored by researchers as there are no existing corpora for training dialog systems on conversations that are influenced by the profiles of the speakers involved." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-6", "text": "In this paper, we present a new dataset of goal-oriented dialogs with profiles attached to them." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-7", "text": "We also introduce a framework for analyzing how systems model personalization in addition to performing the task associated with each dialog." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-8", "text": "Although no existing model was able to sufficiently solve our tasks, we provide baselines using a variety of learning methods and investigate in detail the shortcomings of an end-to-end dialog system based on Memory Networks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-9", "text": "Our dataset and experimental code are publicly available at https://github.com/chaitjo/personalized-dialog ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-10", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-33", "text": "Section 3 presents our modifications to the original dataset and Section 4 describes the various models that are trained on our tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-12", "text": "The recent advances in memory and attention mechanisms for neural networks architectures have led to remarkable progress in machine translation (Bahdanau et al., 2014; Johnson et al., 2016) , question answering (Sukhbaatar et al. 2015 , Graves et al., 2016 and other language understanding tasks which require an element of logical reasoning." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-13", "text": "The main motivation for building neural network based systems over traditional systems for such tasks is that they do not require any feature engineering or domain-specific handcrafting of rules (Vinyals and Le, 2015) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-14", "text": "Conversation modelling is one such domain where end-to-end trained systems have matched or surpassed traditional dialog systems in both open-ended (Dodge et al., 2016) and goal-oriented applications (Bordes et al., 2017 )." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-15", "text": "An important yet unexplored aspect of dialog systems is the ability to personalize the bot's responses based on the profile or attributes of who it is interacting with (Serban et al., 2017) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-16", "text": "For example, a restaurant reservation system should ideally conduct dialog with the user to find values for variables such Figure 1 : Original bAbI dialog tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-17", "text": "The user (in green) conducts a dialog with the bot (in blue) to reserve a table at a restaurant." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-18", "text": "At each turn, a model has access to the conversation history and the outputs from the API call (in light red) and must predict the next bot utterance or API call (in dark red)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-19", "text": "(Illustration taken from Bordes et al., 2017) as location, type of cuisine and price range." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-20", "text": "It should then make recommendations based on these variables as well as certain fixed attributes about the user (dietary preference, favorite food items, etc.)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-21", "text": "The register (or style) of the language used by the bot may also be influenced by certain characteristics of the user (age, gender, etc.) (Halliday et al., 1964) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-22", "text": "However, there are no open datasets which allow researchers to train end-to-end dialog systems where each conversation is influenced by a speaker's profile (Serban et al., 2017) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-23", "text": "With the ultimate aim of creating such a dataset, this paper aims to be an extension of the bAbI dialog dataset introduced by Bordes et al. (2017) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-24", "text": "Set in the domain of restaurant reservation, their synthetically generated dataset breaks down a conversation into several tasks to test some crucial capabilities that dialog systems should have." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-25", "text": "Taken together, the tasks can be used as a framework for the analysis of end-to-end dialog systems in a goal-oriented setting." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-26", "text": "Given a knowledge base (KB) of restaurants and their properties (location, type of cuisine, etc.), the aim of the dialog is to book a restaurant for the user." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-27", "text": "Full dialogs are divided into various stages, each of which tests if models can learn abilities such as implicit dialog state tracking, using KB facts in dialog, and dealing with new entities not appearing in dialogs from the training set." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-28", "text": "In this paper, we propose extensions to the first five tasks of the existing dataset." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-29", "text": "In addition to the goal of the original task, the dialog system must leverage a user's profile information to alter speech style and personalize reasoning over the KB." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-30", "text": "The end-goal is to make a restaurant reservation that is personalized to the user's attributes (dietary preference, favorite food items, etc.)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-31", "text": "The synthetic nature of the bAbI dialog dataset and by extension, our work, makes it easy to construct a perfect handcrafted dialog system based on the same rules that were used to generate the dialogs." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-32", "text": "Hence, the goal here is not to improve the state of the art in this domain, but to analyze existing end-to-end goal-oriented dialog systems and to model personalization in such frameworks without handcrafting." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-34", "text": "Experimental results (Section 5) show that Memory Networks are an efficient model for goal-oriented dialog but are not able to reason over a KB or personalize conversation perfectly." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-35", "text": "Further work needs to be done on these aspects of end-to-end models in order to develop reliable systems for personalization in goal-oriented dialog." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-36", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-37", "text": "**RELATED WORK**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-38", "text": "This work builds upon the bAbI dialog dataset described in Bordes et al. (2017) , which is aimed at testing end-to-end dialog systems in the goal-oriented domain of restaurant reservations." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-39", "text": "Their tasks are meant to complement the bAbI tasks for text understanding and reasoning described in Weston et al. (2015b) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-40", "text": "In addition to the baselines provided, several new techniques proposed to solve these tasks are of interest to us (Perez and Liu, 2017; Seo et al., 2017; Eric and Manning, 2017) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-41", "text": "The closest work to ours is by Li et al. (2016b) who encoded speaker personas into SEQ2SEQ dialog models (Vinyals and Le, 2015; Li et al., 2016a) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-42", "text": "The model builds an embedding of a speaker's persona in a vector space based on conversation history of the speaker (for example, all of their tweets)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-43", "text": "Our work differs from their investigation in the sense that 1) they are concerned with modelling speaker personas for solving the problem of consistent response generation in an open-ended dialog (chit-chat), whereas our work focuses on the personalization of a goal-oriented conversation and the ranking / discrimination of the correct candidate responses from a set of utterances, and 2) their dialog system needs to be provided with a speaker's conversation history to build the persona, while the user's attributes are explicitly provided to the bot for our task." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-44", "text": "Thus, models must compose user profiles through the possible values of each attribute." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-45", "text": "This is arguably a better representation of real-world learning scenarios where goal-oriented dialog agents can leverage information stored in databases to personalize conversations in domains such as customer care or restaurant reservation." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-46", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-47", "text": "**PERSONALIZED GOAL-ORIENTED DIALOG TASKS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-48", "text": "We build upon the first five synthetically generated bAbI dialog tasks (T1-T5), where the goal is to book a table at a restaurant." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-49", "text": "The conversations are generated by a simulator (in the format shown in Figure 1 ) based on an underlying KB containing all the restaurants and their properties." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-50", "text": "Each restaurant is defined by a type of cuisine (10 choices, e.g., Italian, Indian), a location (10 choices, e.g. London, Tokyo), a price range (cheap, moderate or expensive), a party size (2, 4, 6 or 8 people) and a rating (from 1 to 8)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-51", "text": "Each restaurant also has an address and a phone number." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-52", "text": "Making an API call to the KB returns a list of facts related to all the restaurants that satisfy the four parameters: location, cuisine, price range and party size." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-53", "text": "In addition to the user and bot utterances, dialogs in each task are comprised of API calls and the resulting facts." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-54", "text": "Conversations are generated using natural language patterns after randomly selecting each of the four required fields: location, cuisine, price range and party size." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-55", "text": "There are 43 patterns for the user and 15 for the bot (the user can say something in upto 4 different ways, while the bot only has one)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-56", "text": "We make further additions to the KB and augment the bot utterance patterns for the creation of our tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-57", "text": "In addition to fulfilling the original goal, the modified tasks also require the dialog system to personalize the conversation based on the user's profile, which is composed of various fixed attributes." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-58", "text": "To fit in with the synthetic nature of the bAbI dialog tasks, the personalization of the bot's speech style and KB reasoning are handcrafted to be extremely simplistic in comparison to real life situations." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-59", "text": "The details of our enhancements are described in the following sections." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-60", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-61", "text": "**ORIGINAL TASKS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-62", "text": "Tasks 1 and 2 test the model's capability to implicitly track dialog state, Tasks 3 and 4 check if they can use KB facts in conversation." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-63", "text": "Task 3 also involves sorting through candidates and providing suggestions." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-64", "text": "Task 5 combines all tasks into a full dialog." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-65", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-66", "text": "**TASK 1: ISSUING API CALLS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-67", "text": "Users define a query containing from 0 to 4 of the required fields (sampled uniformly)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-68", "text": "The bot must ask questions to fill the missing fields and then generate the proper API call." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-69", "text": "Task 2: Updating API calls Starting by issuing an API call as in Task 1, users then ask to update their requests between 1 and 4 times (sampled uniformly)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-70", "text": "The fields to update are selected randomly and the bot must then issue the updated API call." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-71", "text": "Task 3: Displaying Options Given a user request, the KB is queried by the corresponding API call and the resulting facts are added to the dialog history." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-72", "text": "The bot must sort the restaurants in the facts based on their ratings (from higher to lower) and propose a restaurant to the users until they accept." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-73", "text": "Users accept a suggestion 25% of the time or always if it is the last remaining one." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-74", "text": "Task 4: Providing extra information Given a user request for a randomly sampled restaurant, all KB facts related to the restaurant are added to the history and the dialog is conducted as if the user has agreed to book a table there." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-75", "text": "The user then asks for the address of the restaurant, its phone number or both (with probabilities 25%, 25% and 50% respectively)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-76", "text": "The bot must learn to retrieve the correct KB fact from history." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-77", "text": "Task 5: Conducting full dialogs Conversations generated for Task 5 combine all the aspects of Tasks 1-4 into full dialogs." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-78", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-79", "text": "**USER PROFILES AND SPEECH STYLE CHANGES**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-80", "text": "The first aspect of personalization incorporated into all 5 tasks was the change in the style of the language used by the bot based on the user's gender (male or female) and age (young, middle-aged or elderly)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-81", "text": "For each of the 15 bot utterance patterns in the original tasks, we created 6 new patterns for each (age, gender) profile permutation." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-82", "text": "Each of these patterns, while conveying the same information to the user, differed in tone, formality and word usage." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-83", "text": "Appendix A displays in a tabulated form all the original patterns and the 6 modified patterns associated with each of them." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-84", "text": "While creating the utterance patterns, importance was given to maintaining a consistent vocabulary for each of the 6 profiles." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-85", "text": "The levels of formality and precision of the words and language used by the bot increased with the age of the user." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-86", "text": "At the same time, word choices overlapped between the same gender and age group." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-87", "text": "For example, for a given bot utterance, the pattern for a (female, young) user is similar in formality and tone to the pattern for a (male, young) user and shares certain key words with both (male, young) and (female, middle-aged) user patterns." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-88", "text": "It is comparatively unrelated to the patterns of a (male, middle-aged) or (male, elderly) user." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-89", "text": "By creating such relationships between the profiles instead of having 6 completely distinct patterns, we wanted to test whether dialog models could learn to form associations between concepts such as formality, precision and word usage, and attributes such as gender and age." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-90", "text": "Applying our speech style changes to the bAbI dialog tasks, we obtained 6 versions of the same dialog associated with each user profile." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-91", "text": "Hence, we increased the size of each task by 6 folds." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-92", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-93", "text": "**KB UPDATES AND PERSONALIZED REASONING**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-94", "text": "The second aspect of personalization was restricted to Tasks 3 and 4, which involve reasoning over KB facts." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-95", "text": "To personalize the order in which restaurants are recommended by the bot in Task 3, we added 2 new attributes to the user's profile-dietary preference (vegetarian or non-vegetarian) and favorite food item (randomly sampled from a list of dishes associated with the cuisine in the dialog)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-96", "text": "We created a duplicate for each restaurant in the KB, with an additional attribute for type of restaurant (vegetarian or non-vegetarian) to differentiate the otherwise same copies." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-97", "text": "For every restaurant in the modified KB, we also added the speciality attribute (randomly sampled from a list of dishes associated with the restaurant's cuisine)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-98", "text": "When modifying each dialog in the original task, instead of sorting and proposing the restaurants solely on their rating, we used a score calculated as-rating (out of 8) + 8 (if restaurant type matches user's dietary preference) + 2.5 (if restaurant speciality matches user's favorite food item)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-99", "text": "With such a metric, a Table 2 : Personalized reasoning over the KB." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-100", "text": "Columns 2 and 3 show excerpts from variants of the same dialog for two different user profiles." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-101", "text": "Row 2 contains the KB facts resulting from the same API call for both profiles." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-102", "text": "The bot's first suggestion is a vegetarian restaurant for the vegetarian user and a non-vegetarian one for the non-vegetarian user." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-103", "text": "In column 3, the bot gives priority to a lower rated restaurant specializing in pizza as the user's favorite food item is also pizza." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-104", "text": "Finally, the bot provides a social media link to the young user and the phone number to the elderly user when they ask for contact information." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-105", "text": "here is the information you asked for resto_paris_moderate_italian_5stars_1_phone vegetarian user will always be proposed all the vegetarian restaurants in the KB facts (in descending order of rating) before the bot suggests any non-vegetarian ones." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-106", "text": "Also, if a user's favorite item is pizza, a 6 star rated restaurant specializing in pizza will always be proposed before an 8 star rated restaurant specializing in pasta or risotto." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-181", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-107", "text": "This tests a model's ability to perform true/false reasoning based on the user's profile and implicitly rank restaurants depending on more than one condition." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-108", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-109", "text": "**LOCUTOR**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-110", "text": "Our modification to Task 4 requires the bot to retrieve a combination of KB facts related to a restaurant based on certain attributes of the user and the restaurant itself." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-111", "text": "In addition to the phone number and address, we added 3 new attributes (social media links, parking information and public transport information) to the KB entries for every restaurant." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-112", "text": "In each modified dialog, when a user asks for the contact information of the restaurant, the bot must return the restaurant's social media link if the user is young, or the phone number if the user is middle-aged or elderly." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-113", "text": "Similarly, when a user asks for the directions to the restaurant, the bot must return the address and the public transport information if the restaurant is cheap, or the address and the parking information if it is in the moderate or expensive price range." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-114", "text": "This tests a model's ability to personalize KB fact retrieval based on an attribute in the user's profile (age) or a choice made by the user during the dialog (the restaurant's price range)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-115", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-116", "text": "**UPDATED DATASET**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-117", "text": "We applied our proposed modifications to bAbI dialog tasks T1-T5 to create the personalized dialog tasks PT1-PT5." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-118", "text": "In addition to the original goals of each task, all the modified tasks require the bot to personalize its speech style based on the user's profile." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-119", "text": "Additionally, PT3 and PT4 also test the bot's ability to personalize reasoning over a KB." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-120", "text": "PT5 combines all aspects of PT1-PT4 into full dialogs." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-121", "text": "For each dialog in all tasks, the values of the attributes of the user's profile (gender, age, dietary preference and favorite food item) are provided before the first turn of the dialog." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-122", "text": "The statistics of the datasets for each task are given in Table 3 , along with a comparison to the original bAbI dialog tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-123", "text": "The size of the vocabulary has increased by almost four folds due to the various speech styles associated with user profiles." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-124", "text": "The number of possible candidate responses has increased ten fold due to the duplication of each restaurant in the KB and the speech style changes." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-125", "text": "We provide two variations for each task-a full set with all generated dialogs and a small set with only 1000 dialogs each for training, validation and testing to create realistic learning conditions." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-126", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-127", "text": "**MODELS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-128", "text": "Following Bordes et al. (2017) , we provide baselines on the modified dataset by evaluating several learning methods: rule-based systems, supervised embeddings, and end-to-end Memory networks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-129", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-130", "text": "**RULE-BASED SYSTEMS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-131", "text": "Our tasks are generated by modifying and appending to the bAbI dialog tasks T1-T5." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-132", "text": "All dialogs are built with a rule based simulator and the bot utterances are completely deterministic." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-133", "text": "Thus, it is possible to create a perfect handcrafted system based on the same rules as the simulator, similar to the bAbI QA tasks of Weston et al. (2015b) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-134", "text": "As mentioned previously, the point of the tasks is not to improve the state of the art in restaurant reservation through handcrafted systems, but to analyze the strengths and weaknesses of machine learning algorithms." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-135", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-136", "text": "**SUPERVISED EMBEDDING MODELS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-137", "text": "Although widely known for learning unsupervised embeddings from raw text like in Word2Vec (Mikolov et al., 2013) , embeddings can also be learned in a supervised manner specifically for a given task." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-138", "text": "Supervised word embedding models which score (conversation history, response) pairs have been shown to be a strong baseline for both open-ended and goal-oriented dialog (Dodge et al., 2016; Bordes et al., 2017) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-139", "text": "We do not handcraft any special embeddings for the user profiles." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-140", "text": "The embedding vectors are trained specifically for the task of predicting the next response given the previous conversation: a candidate response y is scored against the input (Dodge et al., 2016; Bordes et al., 2017) ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-141", "text": "For dialogs, the entire conversation history is stored in the memory component of the model." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-142", "text": "It can be iteratively read from to perform reasoning and select the best possible responses based on the context." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-143", "text": "Implementing the modifications to the Memory Network architecture described by Bordes et al. (2017) , we use the model as an end-to-end baseline and analyze its performance." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-144", "text": "The user profile information is stored in the memory of the model as if it were the first turn of the conversation history spoken by the user, i.e. the model builds an embedding of the profile by combining the values of the embeddings of each attribute in the profile." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-145", "text": "Unlike Bordes et al. (2017), we do not make use of any match type features." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-146", "text": "Our goal is to analyse the capabilities of the existing Memory Network model to leverage profile information when conducting dialog." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-147", "text": "Appendix B shows illustrative examples of Memory Network predictions based on the experiments described in the following section." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-148", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-149", "text": "**EXPERIMENTS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-150", "text": "We report per-response accuracy (i.e. the percentage of responses in which the correct candidate is chosen out of all possible ones) across all the models and tasks in Table 4 ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-151", "text": "The rows show tasks PT1-PT5 and columns 2-4 give the accuracy for each of the models." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-152", "text": "The hyperparameters for the models were optimized on the validation sets (values are provided in Appendix C)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-153", "text": "Table 4 : Test results across all models and tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-154", "text": "For Memory Networks, the first number is the accuracy on the full set of dialogs for each task and the number in parenthesis is the accuracy on the small set (with 1000 dialogs)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-155", "text": "All other models were evaluated on the full set only." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-156", "text": "As expected, handcrafted rule-based systems outperformed all machine learning models and solved all 5 tasks perfectly." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-157", "text": "However, it is important to note that building a rule-based system for real conversations is not easy-our tasks use a restricted vocabulary and fixed speech patterns." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-158", "text": "Compared to results reported on the bAbI dialog tasks in Bordes et al. (2017) , supervised embeddings performed significantly worse on the modified tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-159", "text": "The model was unable to complete any of the tasks successfully and had extremely low per-response accuracy for PT2-5." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-160", "text": "In contrast, the same model reported 100 percent accuracy on bAbI dialog task T1 and above 55 percent on the other tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-161", "text": "We attribute this drop in performance to the increased complexity of our tasks due to the four fold increase in vocabulary and the ten fold increase in candidate set size." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-162", "text": "Memory Networks substantially outperformed supervised embeddings for all tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-163", "text": "They completed PT1 and PT2 (issuing and updating API calls) with a very high degree of accuracy." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-164", "text": "This indicates that the model is able to implicitly track dialog state and personalize the bot's utterance based on the user's profile." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-165", "text": "However, visualizations of the parts of the memory that the model read from at each turn of the dialog indicate that the Memory Network architecture is not ideal for personalization of speech style." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-166", "text": "We provide examples of such visualizations in Appendix B. Results on PT3-PT5 suggest that Memory Networks were unable to use KB facts in conversation reliably." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-167", "text": "Analysis of the model's memory show that it fails to interpret knowledge about entities and link it to the attributes of a user's profile." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-168", "text": "Restricting Memory Networks to only 1000 dialogs for training did not lead to any significant decreases in accuracy except for PT5." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-169", "text": "This indicates an issue with the learning mechanism as providing the model with more data did not result in any advantages in terms of solving the individual tasks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-170", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-171", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-172", "text": "We also analyzed the Memory Network architecture in a multi-task learning scenario for conducting full dialog-we trained individual profile-specific models for each of the 6 profile permutations for speech style changes (described in Section 3.2), and compared their performance to a single multi-profile model." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-173", "text": "Each of the profile-specific models were trained on 1000 full dialogs between the bot and a user with the corresponding age and gender combination." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-174", "text": "The multi-profile model was trained on 1000 full dialogs from the PT5 small set, which contains dialogs with all 6 user profiles." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-175", "text": "For each profile, we report per-response accuracies for both the profile-specific and multi-profile model on 1000 test dialogs (with users having the same profile) in Table 5 ." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-176", "text": "The profile-specific models always outperformed the multi-profile model." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-177", "text": "However, it is worth noting that the multi-profile model had six times fewer training examples for any given profile and still performed only marginally worse (2%-3%)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-178", "text": "This indicates that training a single model on dialogs with multiple profiles which share logic and vocabulary is an effective learning strategy." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-179", "text": "An ensemble model may be superior in realistic learning conditions where obtaining sufficient data to train individual models is expense." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-180", "text": "Table 12 in Appendix B illustrates the differences between the two types of models." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-183", "text": "This paper aims to bridge a gap in research on neural conversational agents by introducing a new open dataset of goal-oriented dialogs with user profiles associated with each dialog." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-184", "text": "The dataset acts as a testbed for the training and analysis of end-to-end goal-oriented conversational agents which must personalize their conversation with the user based on attributes in the user's profile." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-185", "text": "As this work builds on top of the bAbI dialog dataset proposed by Bordes et al. (2017) , crucial aspects of goal-oriented conversation have been split into various synthetically generated tasks to evaluate the strengths and weaknesses of models in a systematic way before applying them on real data." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-186", "text": "We demonstrated how to use our tasks to break down one such system, end-to-end Memory Networks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-187", "text": "The model was unable to sufficiently perform reasoning or personalization to solve the tasks, indicating that further work needs to be done on learning methods for these aspects." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-188", "text": "Despite the scenarios and language of the tasks being artificial, we believe that building mechanisms that can solve them is a reasonable starting point towards the development of sophisticated dialog systems in domains such as restaurant reservation, customer care or personal assistants." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-189", "text": "We hope that future research in this field will focus on developing better models to solve our tasks and on releasing datasets with real human -bot dialogs influenced by speaker profiles." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-190", "text": "At any turn of the dialog, the Memory Network stores the conversation history in its memory and, based on the user's input for that turn, pays attention to specific utterances from the memory." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-191", "text": "It can iteratively reason over the memory and uses a weighted combination of these utterances to predict the bot's response to the user." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-192", "text": "In our visualization, we take the model state at a specific turn in the conversation and highlight the values of the attention weights over the memory for each iteration (called a hop)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-193", "text": "----------------------------------" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-194", "text": "**C HYPERPARAMETERS**" }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-195", "text": "Tables 13 and 14 display the hyperparameters used to train the best performing models for each task." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-196", "text": "Table 11 : Modified task 4 (Providing information)." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-197", "text": "The model directs its attention to all the KB facts that it may need to provide but does not focus on the user profile sufficiently." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-198", "text": "Instead, it also attends to its own final utterance before the turn, which may have helped it judge the user's age instead of using the profile." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-199", "text": "It correctly predicts that it has to display the social media information instead of the phone number for the young user, but provides the information for the wrong restaurant." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-200", "text": "Bordes et al. (2017) claim that 'embeddings mix up the information and make it hard to distinguish between different KB entities, making answering correctly very hard.' They overcome this problem by using match type features to emphasize entities that appear in the conversation history." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-201", "text": "We are yet to implement this technique and consider PT4 to not be sufficiently solved by Memory Networks." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-202", "text": "Predicted answer here you go resto_rome_cheap_indian_7stars_2_social_media Table 12 : Predictions of multi-profile model versus profile-specific model." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-203", "text": "For the chosen profile (female, middle-aged), the multi-profile model attends to the user's profile, greeting and incomplete inquiry to modify its speech style and ask for the missing field." }, { "sent_id": "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-204", "text": "The profile-specific model does not need to perform such personalization, and thus has a narrow focus." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-14" ], [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-138" ] ], "cite_sentences": [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-14", "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-138" ] }, "@USE@": { "gold_contexts": [ [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-19" ], [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-128" ], [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-140" ], [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-143" ] ], "cite_sentences": [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-19", "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-128", "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-140", "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-143" ] }, "@EXT@": { "gold_contexts": [ [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-23" ], [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-38" ], [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-185" ] ], "cite_sentences": [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-23", "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-38", "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-185" ] }, "@DIF@": { "gold_contexts": [ [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-145" ], [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-158" ] ], "cite_sentences": [ "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-145", "c1eefe276c0ed46d7cd50f3f3bc3f3-C001-158" ] } } }, "ABC_1f77b780c98093cd85966243471a1d_13": { "x": [ { "sent_id": "1f77b780c98093cd85966243471a1d-C001-71", "text": "**METHOD**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-2", "text": "We describe a strategy for the acquisition of training data necessary to build a social-media-driven early detection system for individuals at risk for (preventable) type 2 diabetes mellitus (T2DM)." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-3", "text": "The strategy uses a game-like quiz with data and questions acquired semi-automatically from Twitter." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-4", "text": "The questions are designed to inspire participant engagement and collect relevant data to train a public-health model applied to individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-5", "text": "Prior systems designed to use social media such as Twitter to predict obesity (a risk factor for T2DM) operate on entire communities such as states, counties, or cities, based on statistics gathered by government agencies." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-6", "text": "Because there is considerable variation among individuals within these groups, training data on the individual level would be more effective, but this data is difficult to acquire." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-7", "text": "The approach proposed here aims to address this issue." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-8", "text": "Our strategy has two steps." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-9", "text": "First, we trained a random forest classifier on data gathered from (public) Twitter statuses and state-level statistics with state-of-the-art accuracy." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-10", "text": "We then converted this classifier into a 20-questions-style quiz and made it available online." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-11", "text": "In doing so, we achieved high engagement with individuals that took the quiz, while also building a training set of voluntarily supplied individual-level data for future classification." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-12", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-14", "text": "Data collection in the public health domain is difficult due to privacy concerns and low engagement." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-15", "text": "For example, people seldom engage with surveys that require them to report their height and weight." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-16", "text": "However, such data is crucial for training automated public health tools, such as algorithms that detect risk for (preventable) type 2 diabetes mellitus (T2DM, henceforth diabetes)." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-17", "text": "We propose a semiautomated data collection algorithm for obesity detection that mitigates these issues with a game-like quiz that is automatically bootstrapped from a machine-learning model trained over social media data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-18", "text": "The resulting quiz is nonintrusive, focusing on \"fun\" questions about food and food language while avoiding personal questions, which leads to high engagement." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-19", "text": "We believe this idea contributes to addressing one of the most challenging unsolved public health problems: the high rate of chronic illness resulting from modifiable risk factors such as poor diet quality and physical inactivity." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-20", "text": "It is estimated that more than 86 million Americans over the age of 20 exhibit signs of pre-diabetes, and 70% of these pre-diabetic individuals will eventually develop T2DM, a chronic and debilitating disease associated with heart disease, stroke, blindness, kidney failure, and amputations (National Center for Health Statistics, 2014; American Diabetes Association and others, 2008) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-21", "text": "In the United States, the estimated cost of T2DM rose to $245 billion in 2012 (Association, 2013 )." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-22", "text": "Yet, 90% of these individuals at high risk are not aware of it (Li et al., 2013) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-23", "text": "Our long-term goal is to develop tools that automatically classify overweight individuals (hence at risk for T2DM 1 ) 1 In the CDC diabetes questionnaire available at http://www.cdc.gov/diabetes/prevention/ pdf/prediabetestest.pdf, overweight BMI contributes more than half of the points associated with diabetes risk." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-24", "text": "using solely public social media information." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-25", "text": "The advantage of such an effort is that the resulting tool provides non-intrusive and cost-effective means to detect and warn at-risk individuals early, before they visit a doctor's office, and possibly influence their decision to visit a doctor." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-26", "text": "Previous work has demonstrated that intervention by social media has modest but significant success in decreasing obesity (Ashrafian et al., 2014) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-27", "text": "Furthermore, there is good evidence that detecting communities at risk using computational models trained on social media data is possible (Fried et al., 2014; Culotta, 2014) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-28", "text": "However, in all cases, classification is made on aggregated data from cities, counties, or states, so these models are not immediately applicable to the task of classifying individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-29", "text": "Our work takes the first steps towards transferring a classification model that identifies communities that are more overweight than average to classifying overweight (and thus at-risk for T2DM) individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-30", "text": "The contributions of our work are:" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-31", "text": "1. We introduce a random-forest (RF) model that classifies US states as more or less overweight than average using only 7 decision trees with a maximum depth of 3." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-32", "text": "Despite the model's simplicity, it outperforms Fried et al. (2014) 's best model by 2% accuracy." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-33", "text": "2." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-34", "text": "Using this model, we introduce a novel semi-automated process that converts the decision nodes in the RF model into natural language questions." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-35", "text": "We then use these questions to implement a quiz that mimics a 20-questions-like game." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-36", "text": "The quiz aims to detect if the person taking it is overweight or not based on indirect questions related to food or use of food-related words." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-37", "text": "To our knowledge, we are the first to use a semiautomatically generated quiz for data acquisition." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-38", "text": "3. We demonstrate that this quiz serves as a non-intrusive and engaging data collection process for individuals 2 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-39", "text": "The survey was posted online and evaluated with 945 participants, of whom 926 voluntarily provided supplemental data, such as information necessary to compute the Body Mass Index (BMI), demographics, and Twitter handle, demonstrating excellent engagement." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-40", "text": "The randomforest model backing the survey agreed with self-reported BMI in 78.7% of cases." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-41", "text": "More importantly, the differences prompted a spirited Reddit discussion, again supporting our hypothesis that this quiz leads to higher participant engagement 3 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-42", "text": "This initial experiment suggests that it is possible to use easy-to-access community data to acquire training data on individuals, which is much more expensive to obtain, yet is fundamental to building individualized public health tools." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-43", "text": "The anonymized data collected from the quiz is publicly available." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-44", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-45", "text": "**PRIOR WORK**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-46", "text": "Previous work has used social media to detect events, including monitoring disasters (Sakaki et al., 2010) , clustering newsworthy tweets in real-time (McCreadie et al., 2013; Petrovi\u0107 et al., 2010) , and forecasting popularity of news (Bandari et al., 2012) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-47", "text": "Social media has also been used for exploring people's opinions towards objects, individuals, organizations and activities." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-48", "text": "For example, Tumasjan et al. (2010) and O'Connor et al. (2010) have applied sentiment analysis on tweets and predicted election results." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-49", "text": "Hu and Liu (2004) analyzed restaurant ratings based on online reviews, which contain both subjective and objective sentences." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-50", "text": "Golder and Macy (2011) and Dodds et al. (2011) are interested in the temporal changes of mood on social media." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-51", "text": "Mysl\u00edn et al. (2013) focus on understanding the perception of emerging tobacco products by analyzing tweets." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-52", "text": "Social media, especially Twitter, has been recently utilized as a popular source of data for public health monitoring, such as tracking diseases (Ginsberg et al., 2009; YomTov et al., 2014; Nascimento et al., 2014; Greene et al., 2011; Chew and Eysenbach, 2010) , mining drug-related adverse events (Bian et al., 2012) , predicting postpartum psychological changes in new mothers (De Choudhury et al., 2013) , and detecting life satisfaction (Schwartz et al., 2013) and obesity (Chunara et al., 2013; Cohen-Cole and Fletcher, 2008; Fernandez-Luque et al., 2011) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-53", "text": "We focus our attention on the language of food on social media to identify overweight communities and individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-54", "text": "In the last couple of years, several variants of this problem have been considered (Fried et al., 2014; Abbar et al., 2015; Culotta, 2014; Ardehaly and Culotta, 2015) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-55", "text": "food-related tweets and use it to predict several population characteristics, namely diabetes rate, overweight rate and political tendency." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-56", "text": "Generally, they use state-level populations, e.g., one of their classification tasks is to label whether a state is more overweight than the national median." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-57", "text": "Overweight rate is the percentage of adults whose Body Mass Index (BMI) is larger than a normal range defined by NIH." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-58", "text": "The classification task is to label whether a state is more overweight than the national median." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-59", "text": "Individuals' tweets are localized at state level as a single instance to train several classifier models, and the performance of models is evaluated using leave-one-out cross-validation." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-60", "text": "Importantly, Fried et al. (2014) train and test their models on communities rather than individuals, which limits the applicability of their approach to individualized public health." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-61", "text": "Abbar et al. (2015) also used aggregated information for predicting obesity and diabetes statistics." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-62", "text": "They considered energy intake based on caloric values in food mentioned on social media, demographic variables, and social networks." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-63", "text": "This paper begins to address individual predictions, based on the simplifying assumption that all individuals can be labeled based on the known label of their home county, e.g., all individuals in an overweight county are overweight, which is less than ideal." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-64", "text": "In contrast, our work collects actual individual information through the survey derived from community information." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-65", "text": "Even though performing classification at state or county granularity tends to be robust and accurate (Fried et al., 2014) , characteristics that are specific to individuals are more meaningful and practical." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-66", "text": "A wave of computational work on the automatic identification of latent attributes of individuals has recently emerged." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-67", "text": "Ardehaly and Culotta (2015) utilize label regularization, a lightly supervised learning method, to infer latent attributes of individuals, such as age and ethnicity." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-68", "text": "Other efforts have focused on inferring the gender of people on Twitter (Bamman et al., 2014; Burger et al., 2011) or their location on the basis of the text in their tweets (Cheng et al., 2010; Eisenstein et al., 2010) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-69", "text": "These are exciting approaches, but it is unlikely they will perform as well as a fully supervised model, which is the ultimate goal of our work." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-70", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-72", "text": "Fried et al. (2014) showed that states and large cities generate a considerable number of food-related tweets, which can be used to infer important information about the respective community, such as overweight status or diabetes risk." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-73", "text": "In an initial experiment, we tested this classifier on the identification of overweight individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-74", "text": "This classifier did not perform better than chance, likely due to the fact that individuals have a much sparser social media presence than entire communities (most tweeters post hundreds of tweets, not millions, and rarely directly about food)." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-75", "text": "This convinced us that a realistic public health tool that identifies individuals at risk must be trained on individual data directly, in order to learn to take advantage of the specific signal available." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-76", "text": "We describe next the process through which we acquire such data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-77", "text": "Quiz Generation" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-78", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-79", "text": "**DATA COLLECTION**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-80", "text": "Figure 1: Architecture of the semi-automatic approach for quiz generation from social media data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-81", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-82", "text": "**AN INTERPRETABLE MODEL FOR COMMUNITY CLASSIFICATION**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-83", "text": "Our main data-collection idea is to use a playful 20-questions-like survey, automatically generated from a community-based model, which can be widely deployed to acquire training data on individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-84", "text": "Our approach is summarized in Figure 1 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-85", "text": "The first step is to develop an interpretable predictive model that identifies communities that are more overweight than average, in a way that can be converted into fun, engaging natural language questions." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-86", "text": "To this end, we started with the same settings as Fried et al. (2014) : we used the 887,310 tweets they collected which were localizable to a specific state and contained at least one relevant hashtag, such as #breakfast or #dinner." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-87", "text": "Each state was assigned a binary label (more or less overweight than the median) by comparing the percentage of overweight adults against the median state." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-88", "text": "For each state, we extracted features based on unigram (i.e., single) words and hashtags from all the above tweets localized to the corresponding state." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-89", "text": "To mitigate sparsity, we also included topics generated using Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and all tweets collected by Fried et al. For example, one of the generated topics contains words that approximate the standard American diet (e.g., chicken, potatoes, cheese, baked, beans, fried, mac), which has already been shown to correlate with higher overweight and T2DM rates (Fried et al., 2014 Figure 2 : A decision tree from the random forest classifier trained using state-level Twitter data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-90", "text": "motivation for this decision was interpretability: as shown below, decision trees can be easily converted into a series of if . . . then . . . else . . ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-91", "text": "statements, which form the building blocks of the quiz." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-92", "text": "To minimize the number of questions, we trained a random forest with 7 trees with maximum depth of 3, and we ignored tokens that appear fewer than 3 times in the training data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-93", "text": "These parameter values were selected to make the quiz of reasonable length." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-94", "text": "We aimed at 20 questions, as in the popular \"20 questions\" game, in which one player must guess what object the other is thinking of by asking 20 or fewer yes-or-no questions." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-95", "text": "Further tuning confirmed that a small number of shallow trees are most effective in accurately partitioning the state-level data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-96", "text": "To further increase the interpretability of the model, word and hashtag counts were automatically discretized into three bins (e.g., infrequent, somewhat frequent, and very frequent) based on the quantiles of the training data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-97", "text": "Figure 2 illustrates one of the decision trees in the trained random forest, with 0 standing for the infrequent, 1 for the somewhat frequent, and 2 for the very frequent bin of the corresponding word or hashtag." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-98", "text": "The figure highlights that the tree is immediately interpretable." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-99", "text": "For example, the leftmost branch indicates that a state is classified as overweight if its tweets mention the word \"fruit\" infrequently or somewhat frequently (f ruit > 1), and the hashtag \"#cook\" appears infrequently (#cook > 0)." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-100", "text": "5 A state with infrequent mention of the word fruit would take the left branch, then test for the frequency of #cook. If this is not an infrequent token, then the classifier tests for curry; very frequent use of curry would lead to an \"overweight\" classification (relative to the median state)." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-101", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-102", "text": "**QUIZ**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-103", "text": "We next manually converted all decision statements in the random forest classifier into natural language questions." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-104", "text": "The main assumption behind this process is that language use parallels actual behavior, e.g., a person who talks about fruit on social media will also eat fruit in real life." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-105", "text": "This allowed us to produce more intuitive questions, such as How often do you eat fruit? for the top node in Figure 2, Figure 2 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-106", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-107", "text": "**MODEL ACCURACY**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-108", "text": "Majority baseline 50.89 SVM (Fried et al., 2014) 80.39 RF (food + hashtags) 82.35 Discretized RF (food + hashtags) 78.43 Table 2 : Random forest (RF) classifier performance on state-level data relative to majority baseline and Fried et al. (2014) 's best classifier." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-109", "text": "We include two versions of our classifier: the first keeps numeric features (e.g., word counts) as is, whereas the second discretizes numeric features to three bins." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-110", "text": "of How often do you mention \"fruit\" in your tweets?" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-111", "text": "Table 1 shows the questions and corresponding answers we used for the three left-most decision nodes in Figure 2 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-112", "text": "Conversion to natural language questions was as consistent as possible." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-113", "text": "For example, whenever the relevant feature's word was a food name x, the question would be formulated as \"How often do you eat x?\" with an accompanying picture of the food named." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-114", "text": "When the relevant word was not a food (such as hot or supper) or a topic (such as the cluster containing diner, bacon, omelette, etc.), the question was formulated in terms of proportion of meals rather than frequency." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-115", "text": "In all, we generated 33 questions that cover all decision nodes in the random forest classifier." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-116", "text": "However, when taking the quiz, each individual participant answered between 12 and 24 questions, depending on their answers and the corresponding traversal of the decision trees." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-117", "text": "This quiz serves to gather training data, which will be used in future work to train a supervised model for the identification of individuals at risk." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-118", "text": "To our knowledge, this approach is a novel strategy for quiz generation, and it serves as an important stepping-stone toward our goal of building individualized public health tools driven by social media." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-119", "text": "With respect to data retention, we collect (with the permission of the participants) the following additional data to be used for future research: height, weight, sex, location, age, and social media handles for Twitter, Instagram, and Facebook." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-120", "text": "We only downloaded public posts using these handles." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-121", "text": "This data (specifically height and weight) is also immediately used to compute the participant's BMI, to verify whether the classifier was correct." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-122", "text": "identical experimental settings as (Fried et al., 2014) , i.e., leave-one-out-cross-validation on the 50 states plus the District of Columbia." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-123", "text": "The table shows that our best model performs 2% better than the best model of (Fried et al., 2014) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-124", "text": "Our second classifier, which used discretized numeric features and was the source of the quiz, performed 2% worse, but it still had acceptable accuracy, nearing 80%." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-125", "text": "As discussed earlier, this discretization step was necessary to create intelligible Likert-scaled questions (Likert, 1932) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-126", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-127", "text": "**EMPIRICAL RESULTS**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-128", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-129", "text": "**EVALUATION OF RANDOM FOREST CLASSIFIER**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-130", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-131", "text": "**QUIZ RESPONSE**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-132", "text": "Many of the 945 participants were highly engaged with the quiz; 97.9% volunteered demographic information at the end of the quiz." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-133", "text": "Many of the participants also left feedback, some on the Reddit page linking to the quiz, as shown in Figure 4 , some on the quiz page itself, as shown in Table 3 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-134", "text": "The feedback comprised mostly comments on the accuracy (or inaccuracy) of the quiz, comments expressing interest in particular questions, and speculation about how the quiz was constructed." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-135", "text": "It seems that quiz accuracy was not a prerequisite for commenting on the quiz." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-136", "text": "On the contrary, participants were more likely to comment when their results were inaccurate." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-137", "text": "It is unknown whether the up-and down-voting was motivated by the accuracy of the quiz, but researchers making interactive prediction sites may discover that inaccuracy is in fact more engaging in some regards." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-138", "text": "The perceived stigma of obesity was also evident in the reactions to the quiz, with some negative reactions to a prediction of overweight regardless of its accuracy." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-139", "text": "For a better understanding of the feedback received, we performed a post-hoc analysis." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-140", "text": "Our analysis indicated that while there were 3 comments made about accuracy out of 744 people with correct predictions (0.40% commented), and 13 noting incorrect answers out of 201 with incorrect predictions (6.5% commented)." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-141", "text": "Thus, the participants were 16 times as likely to comment on the quiz's accuracy if its prediction was incorrect than they were if it was correct." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-142", "text": "We further classified the Reddit comments received into six classes: affective comments (7%), hypothesizing comments (17%), cultural comments (17%), result-based comments (53%), constructive criticism (7%), and comments seeking a greater understanding of the quiz (7%) 6 , examples of which can be seen in Figure 4 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-143", "text": "The comments made" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-144", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-145", "text": "**QUIZ EVALUATION**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-146", "text": "We evaluated the quiz on 945 volunteers recruited at the University of Arizona and on social media, namely Facebook, Twitter, and Reddit's SampleSize subreddit 7 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-147", "text": "The results are summarized in Table 4 ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-148", "text": "We evaluated the accuracy of the random forest classifier by comparing each individual's actual BMI, based on the self-reported height and weight, to the classifier's prediction." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-149", "text": "The cutoff boundary BMI for both training and testing was 28.7 -the average US adult BMI according to National Center for Health Statistics (2013) ." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-150", "text": "This figure is above the NIH's definition of overweight (BMI \u2265 25) because the average US resident is overweight by that standard." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-151", "text": "These results are promising: the quiz had a 78.7% accuracy for the classification of individuals into the two classes: higher or lower BMI than the average US resident." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-152", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-153", "text": "**DISCUSSION**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-154", "text": "It is important to note that the limitations of the sample are considerable: as shown in Table 4 and Figure 3 , our initial sample is taller, lighter, and younger than the average US adult, leading to a biased test sample." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-175", "text": "There is great potential for further improvements of the model by adding calorie count estimates for food pictures associated with individual tweets, by incorporating individual-level demographic features such as gender and age, and by using words and hashtags about physical activities." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-155", "text": "The strong bias of the sample means that the trivial baseline of predicting no participants to be over BMI 28.7 would have accuracy 82.3%." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-156", "text": "Moreover, while the overall accuracy of the random forest backing the quiz was 78.7%, the accuracy on participants who reported a BMI over 28.7 was only 16.0%." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-157", "text": "Our conjecture is that, in general, participants who are overweight are more reluctant to mention food-and health-related topics on social media, which led to lower-quality training data for this group, and the distribution of BMI for participants classified as under 28.7 was not significantly different from that of those classified as over 28.7." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-158", "text": "Far from being a problem for this data collection technique, however, the failure of transfer from state-level training to individual-level testing underscores the need for the data collection itself." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-159", "text": "No existing system has been able to automatically predict individuals' weight after training on statelevel data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-160", "text": "----------------------------------" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-161", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-162", "text": "We described a strategy for the acquisition of training data necessary to build a social-media-driven early detection system for individuals at risk for T2DM, using a game-like quiz with data and questions acquired from Twitter." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-163", "text": "Our approach has proven to inspire considerable participant engagement and, in so doing, provide relevant data to train a public-health model applied to individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-164", "text": "First, we built a random forest classifier that improves on the state of the art for the classification of overweight communities (in particular US states)." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-165", "text": "We then use this as the basis of a 20-questions-style quiz to classify individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-166", "text": "Early results are promising: 78.7% accuracy, but the sample does not represent the general population well, and the quiz performs poorly on classifying overweight individuals." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-167", "text": "The most immediate goal is to obtain a large respondent sample that is more representative of US adults, and to extend the information gathered to longitudinal data." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-168", "text": "Based on the high engagement observed in this initial experiment, we hope that a large dataset can be constructed at minimal cost." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-169", "text": "This dataset will be used to develop a public-health tool capable of non-intrusively identifying at-risk individuals by monitoring public social-media streams." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-170", "text": "Our long term goal is to use this data to train a supervised classifier for the identification of individuals at risk for type 2 diabetes." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-171", "text": "The dataset collected through the quiz described here is sufficient for this goal: it includes necessary information for the calculation of BMI (weight, height), demographic information, and social media handles." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-172", "text": "We plan to explore (public) multi-modal social media information: natural language, posted pictures, etc." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-173", "text": "From this, we will extract and use preventable risk factors, such as poor diet or lack and perceived lack of physical activity." }, { "sent_id": "1f77b780c98093cd85966243471a1d-C001-174", "text": "The data will be made available to interested researchers." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1f77b780c98093cd85966243471a1d-C001-27" ], [ "1f77b780c98093cd85966243471a1d-C001-54" ] ], "cite_sentences": [ "1f77b780c98093cd85966243471a1d-C001-27", "1f77b780c98093cd85966243471a1d-C001-54" ] }, "@EXT@": { "gold_contexts": [ [ "1f77b780c98093cd85966243471a1d-C001-27", "1f77b780c98093cd85966243471a1d-C001-28", "1f77b780c98093cd85966243471a1d-C001-29" ] ], "cite_sentences": [ "1f77b780c98093cd85966243471a1d-C001-27" ] }, "@DIF@": { "gold_contexts": [ [ "1f77b780c98093cd85966243471a1d-C001-32" ], [ "1f77b780c98093cd85966243471a1d-C001-65" ], [ "1f77b780c98093cd85966243471a1d-C001-123" ] ], "cite_sentences": [ "1f77b780c98093cd85966243471a1d-C001-32", "1f77b780c98093cd85966243471a1d-C001-65", "1f77b780c98093cd85966243471a1d-C001-123" ] }, "@MOT@": { "gold_contexts": [ [ "1f77b780c98093cd85966243471a1d-C001-60" ] ], "cite_sentences": [ "1f77b780c98093cd85966243471a1d-C001-60" ] }, "@USE@": { "gold_contexts": [ [ "1f77b780c98093cd85966243471a1d-C001-86" ], [ "1f77b780c98093cd85966243471a1d-C001-89" ], [ "1f77b780c98093cd85966243471a1d-C001-108" ], [ "1f77b780c98093cd85966243471a1d-C001-122" ] ], "cite_sentences": [ "1f77b780c98093cd85966243471a1d-C001-86", "1f77b780c98093cd85966243471a1d-C001-89", "1f77b780c98093cd85966243471a1d-C001-108", "1f77b780c98093cd85966243471a1d-C001-122" ] } } }, "ABC_45238fe9b493ccdf5921c8f5284097_13": { "x": [ { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-193", "text": "Example in Table 6 shows disagreements between fluency and clarity scores for different summaries of the same article." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-153", "text": "The article with the highest agreement (0.32) has more focused highlights, whereas the article with the lowest agreement (0.04) has highlights spread all over (both articles can be seen in the supplementary materials)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-2", "text": "There has been substantial progress in summarization research enabled by the availability of novel, often large-scale, datasets and recent advances on neural network-based approaches." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-3", "text": "However, manual evaluation of the system generated summaries is inconsistent due to the difficulty the task poses to human non-expert readers." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-4", "text": "To address this issue, we propose a novel approach for manual evaluation, HIGHlight-based Reference-less Evaluation of Summarization (HIGHRES), in which summaries are assessed by multiple annotators against the source document via manually highlighted salient content in the latter." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-5", "text": "Thus summary assessment on the source document by human judges is facilitated, while the highlights can be used for evaluating multiple systems." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-6", "text": "To validate our approach we employ crowd-workers to augment with highlights a recently proposed dataset and compare two state-of-the-art systems." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-7", "text": "We demonstrate that HIGHRES improves inter-annotator agreement in comparison to using the source document directly, while they help emphasize differences among systems that would be ignored under other evaluation approaches." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-8", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-9", "text": "**1**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-10", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-12", "text": "Research in automatic summarization has made headway over the years with single document summarization as the front-runner due to the availability of large datasets (Sandhaus, 2008; Hermann et al., 2015; Narayan et al., 2018b) which has enabled the development of novel methods, many of them employing recent advances in neural networks (See et al., 2017; Narayan et al., 2018c; , inter alia)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-13", "text": "* The work was primarily done while Shashi was still at School of Informatics, University of Edinburgh." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-14", "text": "1 Our dataset and code are available at https:// github.com/sheffieldnlp/highres Figure 1 : Highlight-based evaluation of a summary." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-15", "text": "Annotators to evaluate a summary (bottom) against the highlighted source document (top) presented with a heat map marking the salient content in the document; the darker the colour, the more annotators deemed the highlighted text salient." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-16", "text": "Measuring progress in summarization is difficult, as the task has as input a source document consisting of multiple sentences and methods need to generate a shorter text that expresses the salient information of the source fluently and succinctly." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-17", "text": "Thus there can be multiple equally good summaries for the same source document as not all salient information can fit in a given summary length, while even extractive methods that select complete sentences are not guaranteed to produce a coherent summary overall." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-18", "text": "The most consistently used evaluation approach is comparison of the summaries produces against reference summaries via automatic measures such as ROUGE (Lin, 2004) and its variants." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-19", "text": "However, automatic measures are unlikely to be sufficient to measure performance in summarization (Schluter, 2017) , also known for other tasks in which the goal is to generate natural language (Novikova et al., 2017) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-20", "text": "Furthermore, the datasets typically considered have a single reference summary, as obtaining multiple ones increases dataset creation cost, thus evaluation against them is likely to exhibit reference bias (Louis and Nenkova, 2013; Fomicheva and Specia, 2016) , penalizing summaries containing salient content different from the reference." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-21", "text": "For the above reasons manual evaluation is considered necessary for measuring progress in summarization." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-22", "text": "However, the intrinsic difficulty of the task has led to research without manual evaluation or only fluency being assessed manually." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-23", "text": "Those that conduct manual assessment of the content, typically use a single reference summary, either directly (Celikyilmaz et al., 2018; Tan et al., 2017) or through questions (Narayan et al., 2018b,c) and thus are also likely to exhibit reference bias." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-24", "text": "In this paper we propose a novel approach for manual evaluation, HIGHlight-based Referenceless Evaluation of document Summarization (HIGHRES), in which a summary is assessed against the source document via manually highlighted salient content in the latter (see Figure 1 for an example)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-25", "text": "Our approach avoids reference bias, as the multiple highlights obtained help consider more content than what is contained in a single reference." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-26", "text": "The highlights are not dependent on the summaries being evaluated but only on the source documents, thus they are reusable across studies, and they can be crowd-sourced more effectively than actual summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-27", "text": "Furthermore, we propose to evaluate the clarity of a summary separately from its fluency, as they are different dimensions." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-28", "text": "Finally, HIGHRES provides absolute instead of ranked evaluation, thus the assessment of a system can be conducted and interpreted without reference to other systems." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-29", "text": "To validate our proposed approach we use the recently introduced eXtreme SUMmarization dataset (XSUM, Narayan et al., 2018b) to evaluate two state-of-the-art abstractive summarization methods, Pointer Generator Networks (See et al., 2017) and Topic-aware Convolutional Networks (Narayan et al., 2018b) , using crowd-sourcing for both highlight annotation and quality judgments." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-30", "text": "We demonstrate that HIGHRES improves interannotator agreement in comparison to using the source document directly, while they help emphasize differences among systems that would be ignored under other evaluation approaches, including reference-based evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-31", "text": "Furthermore, we show that the clarity metric from the DUC (Dang, 2005 ) must be measured separately from \"fluency\", as judgments for them had low correlation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-32", "text": "Finally, we make the highlighted XSUM dataset, codebase to replicate the crowd-sourcing experiments and all other materials produced in our study publicly available." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-33", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-34", "text": "**LITERATURE REVIEW**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-35", "text": "In recent years, summarization literature has investigated different means of conducting manual evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-36", "text": "We study a sample of 26 recent papers from major ACL conferences and outline the trends of manual evaluation in summarization in Table 1 ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-37", "text": "From 26 papers, 11 papers (e.g., See et al., 2017; Kedzie et al., 2018; Cao et al., 2018) did not conduct any manual evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-38", "text": "Following the Document Understanding Conference (DUC, Dang, 2005) , a majority of work has focused on evaluating the content and the linguistic quality of summaries (Nenkova, 2005) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-39", "text": "However, there seems to be a lack of consensus on how a summary should be evaluated: (i) Should it be evaluated relative to other summaries or standalone in absolute terms? and (ii) What would be a good source of comparison: the input document or the reference summary?" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-40", "text": "The disagreements on these issues result in authors evaluating their summaries often (11 out of 26 papers) using automatic measures such as ROUGE (Lin, 2004) despite of its limitations (Schluter, 2017) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-41", "text": "In what follows, we discuss previously proposed approaches along three axes: evaluation metrics, relative vs. absolute, and the choice of reference." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-42", "text": "Evaluation Metrics Despite differences in the exact definitions, the majority (e.g., Hsu et al., 2018; Celikyilmaz et al., 2018; Narayan et al., 2018b; Chen and Bansal, 2018; Peyrard and Gurevych, 2018) agree on both or either one of two broad quality definitions: coverage determines how much of the salient content of the source document is captured in the summary, and informativeness, how much of the content captured in the summary is salient with regards to the original document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-43", "text": "These measures correspond to \"recall\" and \"precision\" metrics respectively in Table 1 : Overview of manual evaluations conducted in recent summarization systems." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-44", "text": "We categorize them in four dimensions: the first columns presents papers that do not report on human evaluation; the second column identifies matrices used for evaluating content (\"Pyramid\", \"QA\", \"Correctness\", \"Recall\" and \"Precision\") and quality (\"Clarity\", \"Fluency\") of summaries; the third column focuses if the system ranking reported by humans on content evaluation were \"Absolute\" or \"Relative\"; and finally, the fourth column evaluates if summaries were evaluated against the input document (\"With Document\"), the reference summary (\"With Reference\") or both (\"With Ref. & Doc.\")." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-45", "text": "in information retrieval and information extraction literature." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-46", "text": "Clarke and Lapata (2010) proposed a question-answering based approach to improve the agreement among human evaluations for the quality of summary content, which was recently employed by Narayan et al. (2018b) and Narayan et al. (2018c) (QA in Table 1 )." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-47", "text": "In this approach, questions were created first from the reference summary and then the system summaries were judged with regards to whether they enabled humans to answer those questions correctly." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-48", "text": "ShafieiBavani et al. (2018) , on the other hand, used the \"Pyramid\" method (Nenkova and Passonneau, 2004 ) which requires summaries to be annotated by experts for salient information." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-49", "text": "A similar evaluation approach is the factoids analysis by Teufel and Van Halteren (2004) which evaluates the system summary against factoids, a representation based on atomic units of information, that are extracted from multiple gold summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-50", "text": "However, as in the case of the \"Pyramid\" method, extracting factoids requires experts annotators." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-51", "text": "Finally, a small number of work evaluates the \"Correctness\" (Chen and Bansal, 2018; Li et al., 2018b; Chen and Bansal, 2018) of the summary, similar to fact checking (Vlachos and Riedel, 2014) , which can be a challenging task in its own right." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-52", "text": "The linguistic quality of a summary encompasses many different qualities such as fluency, grammatically, readability, formatting, naturalness and coherence." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-53", "text": "Most recent work uses a single human judgment to capture all linguistic qualities of the summary (Hsu et al., 2018; Kry\u015bci\u0144ski et al., 2018; Narayan et al., 2018b; Song et al., 2018; Guo et al., 2018) ; we group them under \"Fluency\" in Table 1 with an exception of \"Clarity\" which was evaluated in the DUC evaluation campaigns (Dang, 2005) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-54", "text": "The \"Clarity\" metric puts emphasis in easy identification of noun and pronoun phrases in the summary which is a different dimension than \"Fluency\", as a summary may be fluent but difficult to be understood due to poor clarity." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-55", "text": "Absolute vs Relative Summary Ranking." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-56", "text": "In relative assessment of summarization, annotators are shown two or more summaries and are asked to rank them according to the dimension at question (Yang et al., 2017; Chen and Bansal, 2018; Narayan et al., 2018a; Guo et al., 2018; Krishna and Srinivasan, 2018) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-57", "text": "The relative assessment is often done using the paired comparison (Thurstone, 1994) or the best-worst scaling (Woodworth and G, 1991; Louviere et al., 2015) , to improve inter-annotator agreement." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-58", "text": "On the other hand, absolute assessment of summarization (Li et al., 2018b; Song et al., 2018; Kry\u015bci\u0144ski et al., 2018; Hsu et al., 2018; Hardy and Vlachos, 2018 ) is often done using the Likert rating scale (Likert, 1932) where a summary is assessed on a numerical scale." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-59", "text": "Absolute assessment was also employed in combination with the question answering approach for content evaluation (Narayan et al., 2018b; Mendes et al., 2019) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-60", "text": "Both approaches, relative ranking and absolute assessment, have been investigated extensively in Machine Translation (Bojar et al., 2016 (Bojar et al., , 2017 ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-61", "text": "Absolute assessment correlates highly with the relative assessment without the bias introduced by having a simultaneous assessment of several models (Bojar et al., 2011) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-62", "text": "Choice of Reference." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-63", "text": "The most convenient way to evaluate a system summary is to assess it against the reference summary (Celikyilmaz et al., 2018; Yang et al., 2017; Peyrard and Gurevych, 2018) , as this typically requires less effort than reading the source document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-64", "text": "The question answering approach of Narayan et al. (2018b,c) also falls in this category, as the questions were written using the reference summary." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-65", "text": "However, summarization datasets are limited to a single reference summary per document (Sandhaus, 2008; Hermann et al., 2015; Grusky et al., 2018; Narayan et al., 2018b) thus evaluations using them is prone to reference bias (Louis and Nenkova, 2013) , also a known issue in machine translation evaluation (Fomicheva and Specia, 2016) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-66", "text": "A circumvention for this issue is to evaluate it against the source document (Song et al., 2018; Narayan et al., 2018a; Hsu et al., 2018; Kry\u015bci\u0144ski et al., 2018) , asking judges to assess the summary after reading the source document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-67", "text": "However this requires more effort and is known to lead to low inter-annotator agreement (Nenkova and Passonneau, 2004) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-68", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-69", "text": "**HIGHRES**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-70", "text": "Our novel highlight-based reference-less evaluation does not suffer from reference bias as a summary is assessed against the source document with manually highlighted salient content." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-71", "text": "These highlights are crowd-sourced effectively without the need of expert annotators as required by the Pyramid method (Nenkova and Passonneau, 2004) or to generate reference summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-72", "text": "Our approach improves over the \"Correctness\" or \"Fluency\" only measure for summarization by taking salience into account." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-73", "text": "Finally, the assessment of summaries against the document with highlighted pertinent content facilitates an absolute evaluation of summaries with high inter-annotator agreement." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-74", "text": "Our evaluation framework comprises three main components: document highlight annotation, highlight-based content evaluation, and clarity and fluency evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-75", "text": "The second component, which evaluates the notions of \"Precision\" and \"Recall\" requires the highlights from the first one to be conducted." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-76", "text": "However, the highlight annotation needs to happen only once per document, and it can be reused to evaluate many system summaries, unlike the Pyramid approach (Nenkova and Passonneau, 2004 ) that requires additional expert annotation for every system summary being evaluated." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-77", "text": "The third component is independent of the others and can be run in isolation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-78", "text": "In all components we employ crowd-workers as human judges, and implement appropriate sanity checking mechanisms to ensure good quality judgements." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-79", "text": "Finally, we present an extended version of ROUGE (Lin, 2004 ) that utilizes the highlights to evaluate system summaries against the document; this demonstrates another use of the highlights for summarization evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-80", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-81", "text": "**HIGHLIGHT ANNOTATION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-82", "text": "In this part, we ask human judges to read the source document and then highlight words or phrases that are considered salient." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-83", "text": "Each judge is allowed to highlight parts of the text at any granu-larity, from single words to complete sentences or even paragraphs." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-84", "text": "However we enforce a limit in the number of words to K that can be highlighted in total by a judge in a document, corresponding to the length of the summary expected." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-85", "text": "By employing multiple judges per document who are restricted in the amount of text that can be highlighted we expect to have a more diverse and focused highlight from multiple judges which cover different viewpoints of the article." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-154", "text": "Interestingly, the reference summary on the highest agreement article appears to be more informative of its content when the annotator agreement is high; the reference summary on the lowest agreement article is more indicative, i.e., it does not contain any informative content from the article but only to inform the reader about the article's topic and scope." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-155", "text": "These results confirm that the annotation behaviour originates from the nature of the document and the summary it requires, and validates our highlight annotation setup." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-156", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-157", "text": "**CONTENT EVALUATION OF SUMMARIES**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-158", "text": "We assessed the summaries against (i) documents with highlights (Highlight-based), (ii) original documents without highlights (Non Highlightbased) and (iii) reference summaries (Referencebased)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-159", "text": "For each setup, we collected judgments from 3 different participants for each model summary." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-160", "text": "Table 2 and 3 presents our results." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-161", "text": "Both the highlight-based and non-highlight based assessment of summaries agree on the ranking among TCONVS2S, PTGEN and Reference." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-162", "text": "Perhaps unsurprisingly human-authored summaries were considered best, whereas, TCONVS2S was ranked 2nd, followed by PT-GEN." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-163", "text": "However, the performance difference in TCONVS2S and PTGEN is greatly amplified when they are evaluated against document with highlights (6.48 and 5.54 Precision and Recall points) compared to when evaluated against the original documents (3.98 and 1.83 Precision and Recall points)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-164", "text": "The performance difference is lowest when they are evaluated against the reference summary (2.51 and -1.79 Precision and Recall points)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-165", "text": "The superiority of TCONVS2S is expected; TCONVS2S is better than PTGEN for recognizing pertinent content and generating informative summaries due to its ability to represent high-level document knowledge in terms of topics and long-range dependencies (Narayan et al., 2018b) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-166", "text": "We further measured the agreement among the judges using the coefficient of variation (Everitt, 2006) from the aggregated results." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-167", "text": "It is defined as the ratio between the sample standard deviation and sample mean." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-168", "text": "It is a scale-free metric, i.e. its results are comparable across measurements of different magnitude." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-169", "text": "Since, our sample size is small (3 judgements per summary), we use the unbiased version (Sokal and Rohlf, 1995) as cv = (1+ 1 4n ) \u03c3 x , where \u03c3 is the standard deviation, n is the number of sample, andx is the mean." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-170", "text": "We found that the highlight-based assessment in general has lower variation among judges than the non-highlight based or reference-based assessment." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-86", "text": "To ensure that each highlight is reliable, we performed a sanity check at the end of the task where we ask the judges to answer a True/False question based on the article." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-87", "text": "We rejected all annotations that failed to correctly answer the sanity check question." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-88", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-89", "text": "**HIGHLIGHT-BASED CONTENT EVALUATION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-90", "text": "In this component, we present human judges a document that has been highlighted using heatmap coloring and a summary to assess." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-91", "text": "We ask our judges to assess the summary for (i) 'All important information is present in the summary' and (ii) 'Only important information is in the summary." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-92", "text": "' The first one is the recall (content coverage) measure and the second, the precision (informativeness) measure." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-93", "text": "All the ratings were collected on a 1-100 Likert scale (Likert, 1932) ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-94", "text": "Figure 2 shows the content evaluation user interface where salient parts of the document are highlighted." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-95", "text": "As with the highlight annotation, we performed the same form of sanity check to the one in the highlight annotation task." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-96", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-97", "text": "**CLARITY AND FLUENCY EVALUATION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-98", "text": "In this part, we give the judges only the summary and ask them to rate it on clarity and fluency." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-99", "text": "For clarity, each judge is asked whether the summary is easy to be understood, i.e. there should be no difficulties in identifying the referents of the noun phrases (every noun/place/event should be wellspecified) or understanding the meaning of the sentence." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-100", "text": "For fluency, each judge is asked whether the summary sounds natural and has no grammatical problems." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-101", "text": "While fluency is often evaluated in recent work, clarity, while first introduced in DUC evaluations, has recently been ignored in manual evaluation, despite that it captures a different dimension of summarization quality." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-102", "text": "To ensure that the judgments for clarity and fluency are not affected by each other (poor fluency can affect clarity, but a summary can have perfect fluency but low clarity), we evaluate each metric separately." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-103", "text": "We ask the judges to evaluate multiple summaries per task with each dimension in its own screen." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-104", "text": "For sanity checking, we insert three artificial summaries of different quality (good, mediocre and bad summaries)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-105", "text": "The good summary is the unedited one, while the others are generated from sentences randomly sampled from the source document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-106", "text": "For the mediocre summary, some words are edited to introduce some grammatical or syntactic errors while for the bad summary, the words are further scrambled." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-107", "text": "We reject judgements that failed to pass this criteria: bad < mediocre < good." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-108", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-109", "text": "**HIGHLIGHT-BASED ROUGE EVALUATION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-110", "text": "Our Highlight-based ROUGE (we refer to it as HROUGE) formulation is similar to the original ROUGE with the difference that the n-grams are weighted by the number of times they were highlighted." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-111", "text": "One benefit of HROUGE is that it introduces saliency into the calculation without being reference-based as in ROUGE." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-112", "text": "Implicitly HROUGE considers multiple summaries as the highlights are obtained from multiple workers." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-113", "text": "Given a document D as a sequence of m tokens {w 1 , . . . , w m }, annotated with N highlights, we define the weight \u03b2 n g \u2208 [0, 1] for an n-gram g as:" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-114", "text": "where, [x] y is an indicator function which returns x if y is true and 0, otherwise." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-115", "text": "[1] w j \u2208H k is a function which returns the number of times word w j is highlighted by the annotators out of N times weighted by the lengths of their highlights; H k is the highlighted text by the k-th annotator and K is the maximum allowed length of the highlighted text (see Section 3.1)." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-116", "text": "NumH(w j ) gives less importance to annotators with highlights with few words." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-117", "text": "In principle, if an n-gram is highlighted by every crowd-worker and the length of the highlight of each crowd-worker is K, the n-gram g will have a maximum weight of \u03b2 n g = 1." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-118", "text": "The HROUGE scores for a summary S can then The UI for content evaluation with highlight." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-119", "text": "Judges are given an article with important words highlighted using heat map." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-120", "text": "Judges can also remove less important highlight color by sliding the scroller at the left of the page." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-121", "text": "At the right of the page judges give the recall and precision assessment by sliding the scroller from 1 to 100 based on the given summary quality. be defined as:" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-122", "text": "HR n rec and HR n pre are the HROUGE recall and precision scores; count(g, X ) is the maximum number of n-gram g occurring in the text X ." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-123", "text": "The weight in the denominator of HR n pre is uniform (\u03b2 n g = 1) for all g because if we weighted according to the highlights, words in the summary that are not highlighted in the original document would be ignored." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-124", "text": "This would result in HR n pre not penalizing summaries for containing words that are likely to be irrelevant as they do not appear in the highlights of the document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-125", "text": "It is important to note HROUGE has an important limitation in that it penalizes abstractive summaries that do not reuse words from the original document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-126", "text": "This is similar to ROUGE penalizing summaries for not reusing words from the reference summaries, however the highlights allow us to implicitly consider multiple references without having to actually obtain them." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-127", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-128", "text": "**SUMMARIZATION DATASET AND MODELS**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-129", "text": "We use the extreme summarization dataset (XSUM, Narayan et al., 2018b) 2 which comprises BBC articles paired with their singlesentence summaries, provided by the journalists writing the articles." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-130", "text": "The summary in the XSUM dataset demonstrates a larger number of novel ngrams compared to other popular datasets such as CNN/DailyMail (Hermann et al., 2015) or NY Times (Sandhaus, 2008) as such it is suitable to be used for our experiment since the more abstractive nature of the summary renders automatic methods such as ROUGE less accurate as they rely on string matching, and thus calls for human evaluation for more accurate system comparisons." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-131", "text": "Following Narayan et al. (2018b) , we didn't use the whole test set portion, but sampled 50 articles from it for our highlight-based evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-132", "text": "We assessed summaries from two state-ofthe-art abstractive summarization systems using our highlight-based evaluation: (i) the PointerGenerator model (PTGEN) introduced by See et al. (2017) is an RNN-based abstractive systems which allows to copy words from the source text, and (ii) the Topic-aware Convolutional Sequence to Sequence model (TCONVS2S) introduced by Narayan et al. (2018b) is an abstractive model which is conditioned on the article's topics and based entirely on Convolutional Neural Networks." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-133", "text": "We used the pre-trained models 3 provided by the authors to obtain summaries from both systems for the documents in our test set." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-134", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-135", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-136", "text": "All of our experiments are done using the Amazon Mechanical Turk platform." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-137", "text": "We develop three types of Human Intelligence Tasks (HITs): highlight annotation, highlight-based content evaluation, and fluency and clarity evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-138", "text": "In addition, we elicited human judgments for content evaluation in two more ways: we assessed system summaries against the original document (without highlights) and against the reference summary." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-139", "text": "The latter two experiments are intended as the comparison for our proposed highlight-based content evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-140", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-141", "text": "**HIGHLIGHT ANNOTATION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-142", "text": "We collected highlight annotations from 10 different participants for each of 50 articles." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-143", "text": "For each annotation, we set K, the maximum number of words to highlight, to 30." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-144", "text": "Our choice reflects the average length (24 words) of reference summaries in the XSUM dataset." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-145", "text": "To facilitate the annotation of BBC news articles with highlights, we asked our participants to adapt the 5W1H (Who, What, When, Where, Why and How) principle (Robertson, 1946 ) that is a common practice in journalism." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-146", "text": "The participants however were not obliged to follow this principle and were free to highlight content as they deem fit." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-147", "text": "The resulting annotation exhibits a substantial amount of variance, confirming the intuition that different participants are not expected to agree entirely on what is salient in a document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-148", "text": "On average, the union of the highlights from 10 annotators covered 38.21% per article and 33.77% of the highlights occurred at the second half of the article." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-149", "text": "This shows that the judges did not focus only on the beginning of the documents but annotated all across the document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-150", "text": "Using Fleiss Kappa (Fleiss, 1971) on the binary labels provided by each judge on each word (highlighted or not) we obtained an average agreement Table 3 : Coefficient of variation (lower is better) for evaluating summaries against documents with and without highlights." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-151", "text": "of 0.19 for the 50 articles considered." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-152", "text": "The low agreement score does not indicate a poor annotation process necessarily; we argue that this is primarily due to the annotators having different opinions on which parts of an article are salient." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-171", "text": "The assessment of TCONVS2S summaries achieves 0.67 and 0.80 of Precision and Recall cv points which are 0.08 and 0.03 points below when they are assessed against documents with no highlights, respectively." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-172", "text": "We see a similar pattern in Recall on the assessment of PTGEN summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-173", "text": "Our results demonstrate that the highlightbased assessment of abstractive systems improve agreement among judges compared to when they are assessed against the documents without highlights or the reference summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-174", "text": "The assessment of human-authored summaries does not seem to follow this trend, we report a mixed results (0.49 vs 0.48 for precision and 0.63 vs 0.67 for recall) when they are evaluated with and without the highlights." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-175", "text": "Table 5 : HROUGE-1 (unigram) and HROUGE-2 (bigram) precision, and recall scores for TCONVS2S , PT-GEN and Reference summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-176", "text": "Table 4 shows the results of our fluency and clarity evaluations." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-177", "text": "Similar to our highlight-based content evaluation, human-authored summaries were considered best, whereas TCONVS2S was ranked 2nd followed by PTGEN, on both measures." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-178", "text": "The Pearson correlation between fluency and clarity evaluation is 0.68 which shows a weak correlation; it confirms our hypothesis that the \"clarity\" captures different aspects from \"fluency\" and they should not be combined as it is commonly done." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-179", "text": "In the latter case, HROUGE becomes the standard ROUGE metric with \u03b2 n g = 1 for all n-grams g. Both ROUGE and HROUGE favour method of copying content from the original document and penalizes abstractive methods, thus it is not surprising that PTGEN is superior to TCONVS2S, as the former has an explicit copy mechanism." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-180", "text": "The fact that PTGEN is better in terms of HROUGE is also an evidence that the copying done by PT-GEN selects salient content, thus confirming that the copying mechanism works as intended." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-181", "text": "When comparing the reference summaries against the original documents, both ROUGE and HROUGE confirm that the reference summaries are rather abstractive as reported by Narayan et al. (2018b) , and they in fact score below the system summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-182", "text": "Recall scores are very low in all cases which is expected, since the 10 highlights obtained per document or the documents themselves, taken together, are much longer than any of the summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-183", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-184", "text": "**CLARITY AND FLUENCY EVALUATION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-185", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-186", "text": "**HIGHLIGHT-BASED ROUGE EVALUATION**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-187", "text": "6 Qualitative Analysis HIGHRES eliminates reference bias." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-188", "text": "The example presented in Figure 3 demonstrates how our highlight-based evaluation eliminates reference bias in summarization evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-189", "text": "The summaries generated by TCONVS2S and PTGEN are able to capture the essence of the document, however, there are phrases in these summaries that do not occur in the reference summary." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-190", "text": "A referencebased evaluation would fail to give a reasonable score to these system summaries." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-191", "text": "The HIGHRES however, would enable the judges to better evaluate the summaries without any reference bias." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-192", "text": "Fluency vs Clarity." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-194", "text": "From the example, we can see that the TCONVS2S summary is fluent but is not easily understood in the context of 'the duration of resignation', while the PTGEN summary has word duplication which lower the fluency and also lacking clarity due to several unclear words." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-195", "text": "----------------------------------" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-196", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-197", "text": "In this paper we introduced the HIGHlightbased Reference-less Evaluation Summarization (HIGHRES) framework for manual evaluation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-198", "text": "The proposed framework avoids reference bias and provides absolute instead of ranked evaluation of the systems." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-199", "text": "Our experiments show that HIGHRES lowers the variability of the judges' content assessment, while helping expose the differences between systems." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-200", "text": "We also showed that by evaluating clarity we are able to capture a different dimension of summarization quality that is not captured by the commonly used fluency." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-201", "text": "We believe that our highlight-based evaluation is an ideal setup of abstractive summarization for three reasons: (i) highlights can be crowd sourced effectively without expert annotations, (ii) it avoids reference bias and (iii) it is not limited by n-gram overlap." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-202", "text": "In future work, we would like to extend our framework to other variants of summarization e.g. multi-document." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-203", "text": "Also, we will explore ways of automating parts of the process, e.g. the highlight annotation." }, { "sent_id": "45238fe9b493ccdf5921c8f5284097-C001-204", "text": "Finally, the highlights could also be used as training signal, as it offers content saliency information at a finer level than the single reference typically used." } ], "y": { "@BACK@": { "gold_contexts": [ [ "45238fe9b493ccdf5921c8f5284097-C001-12" ], [ "45238fe9b493ccdf5921c8f5284097-C001-23" ], [ "45238fe9b493ccdf5921c8f5284097-C001-42" ], [ "45238fe9b493ccdf5921c8f5284097-C001-46" ] ], "cite_sentences": [ "45238fe9b493ccdf5921c8f5284097-C001-12", "45238fe9b493ccdf5921c8f5284097-C001-23", "45238fe9b493ccdf5921c8f5284097-C001-42", "45238fe9b493ccdf5921c8f5284097-C001-46" ] }, "@DIF@": { "gold_contexts": [ [ "45238fe9b493ccdf5921c8f5284097-C001-23", "45238fe9b493ccdf5921c8f5284097-C001-24" ] ], "cite_sentences": [ "45238fe9b493ccdf5921c8f5284097-C001-23" ] }, "@USE@": { "gold_contexts": [ [ "45238fe9b493ccdf5921c8f5284097-C001-29" ], [ "45238fe9b493ccdf5921c8f5284097-C001-53" ], [ "45238fe9b493ccdf5921c8f5284097-C001-59" ], [ "45238fe9b493ccdf5921c8f5284097-C001-129" ], [ "45238fe9b493ccdf5921c8f5284097-C001-131" ], [ "45238fe9b493ccdf5921c8f5284097-C001-132" ] ], "cite_sentences": [ "45238fe9b493ccdf5921c8f5284097-C001-29", "45238fe9b493ccdf5921c8f5284097-C001-53", "45238fe9b493ccdf5921c8f5284097-C001-59", "45238fe9b493ccdf5921c8f5284097-C001-129", "45238fe9b493ccdf5921c8f5284097-C001-131", "45238fe9b493ccdf5921c8f5284097-C001-132" ] }, "@MOT@": { "gold_contexts": [ [ "45238fe9b493ccdf5921c8f5284097-C001-65" ] ], "cite_sentences": [ "45238fe9b493ccdf5921c8f5284097-C001-65" ] }, "@SIM@": { "gold_contexts": [ [ "45238fe9b493ccdf5921c8f5284097-C001-165" ], [ "45238fe9b493ccdf5921c8f5284097-C001-181" ] ], "cite_sentences": [ "45238fe9b493ccdf5921c8f5284097-C001-165", "45238fe9b493ccdf5921c8f5284097-C001-181" ] } } }, "ABC_71a72cfca17b0b15938ed590f9c868_13": { "x": [ { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-2", "text": "Previous work on classifying information status (Nissim, 2006; Rahman and Ng, 2011) is restricted to coarse-grained classification and focuses on conversational dialogue." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-3", "text": "We here introduce the task of classifying finegrained information status and work on written text." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-4", "text": "We add a fine-grained information status layer to the Wall Street Journal portion of the OntoNotes corpus." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-5", "text": "We claim that the information status of a mention depends not only on the mention itself but also on other mentions in the vicinity and solve the task by collectively classifying the information status of all mentions." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-6", "text": "Our approach strongly outperforms reimplementations of previous work." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-7", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-9", "text": "Speakers present already known and yet to be established information according to principles referred to as information structure (Prince, 1981; Lambrecht, 1994; Kruijff-Korbayov\u00e1 and Steedman, 2003, inter alia) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-10", "text": "While information structure affects all kinds of constituents in a sentence, we here adopt the more restricted notion of information status which concerns only discourse entities realized as noun phrases, i.e. mentions 1 ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-11", "text": "Information status (IS henceforth) describes the degree to which a discourse entity is available to the hearer with regard to the speaker's assumptions about the hearer's knowledge and beliefs (Nissim et al., 2004) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-12", "text": "Old mentions are known to the hearer and have been referred to previously." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-13", "text": "Mediated mentions have not been mentioned before but are also not autonomous, i.e., they can only be correctly interpreted by reference to another mention or to prior world knowledge." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-14", "text": "All other mentions are new." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-15", "text": "IS can be beneficial for a number of NLP tasks, though the results have been mixed." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-16", "text": "Nenkova et al. (2007) used IS as a feature for generating pitch accent in conversational speech." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-17", "text": "As IS is restricted to noun phrases, while pitch accent can be assigned to any word in an utterance, the experiments were not conclusive." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-18", "text": "For determining constituent order of German sentences, Cahill and Riester (2009) incorporate features modeling IS to good effect." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-19", "text": "Rahman and Ng (2011) showed that IS is a useful feature for coreference resolution." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-20", "text": "Previous work on learning IS (Nissim, 2006; Rahman and Ng, 2011 ) is restricted in several ways." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-21", "text": "It deals with conversational dialogue, in particular with the corpus annotated by Nissim et al. (2004) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-22", "text": "However, many applications that can profit from IS concentrate on written texts, such as summarization." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-23", "text": "For example, Siddharthan et al. (2011) show that solving the IS subproblem of whether a person proper name is already known to the reader improves automatic summarization of news." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-24", "text": "Therefore, we here model IS in written text, creating a new dataset which adds an IS layer to the already existing comprehensive annotation in the OntoNotes corpus (Weischedel et al., 2011) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-25", "text": "We also report the first results on fine-grained IS classification by modelling further distinctions within the category of mediated mentions, such as comparative and bridging anaphora (see Examples 1 and 2, re-spectively)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-26", "text": "2 Fine-grained IS is a prerequisite to full bridging/comparative anaphora resolution, and therefore necessary to fill gaps in entity grids (Barzilay and Lapata, 2008) based on coreference only." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-27", "text": "Thus, Examples 1 and 2 do not exhibit any coreferential entity coherence but coherence can be established when the comparative anaphor others is resolved to others than freeway survivor Buck Helm, and the bridging anaphor the streets is resolved to the streets of Oranjemund, respectively." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-28", "text": "(1) the condition of freeway survivor Buck Helm . . . , improved, hospital officials said." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-29", "text": "Rescue crews, however, gave up hope that others would be found." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-30", "text": "(2) Oranjemund, the mine headquarters, is a lonely corporate oasis of 9,000 residents." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-31", "text": "Jackals roam the streets at night . . ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-32", "text": "We approach the challenge of modeling IS via collective classification, using several novel linguistically motivated features." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-33", "text": "We reimplement Nissim's (2006) and Rahman and Ng's (2011) approaches as baselines and show that our approach outperforms these by a large margin for both coarse-and finegrained IS classification." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-34", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-36", "text": "IS annotation schemes and corpora." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-37", "text": "We enhance the approach in Nissim et al. (2004) in two major ways (see also Section 3.1)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-38", "text": "First, comparative anaphora are not specifically handled in Nissim et al. (2004) (and follow-on work such as Ritz et al. (2008) and Riester et al. (2010) ), although some of them might be included in their respective bridging subcategories." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-39", "text": "Second, we apply the annotation scheme reliably to a new genre, namely news." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-40", "text": "This is a non-trivial extension: Ritz et al. (2008) applied a variation of the Nissim et al. (2004) scheme to a small set of 220 NPs in a German news/commentary corpus but found that reliability then dropped significantly to the range of \u03ba = 0.55 to 0.60." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-41", "text": "They attributed this to the higher syntactic complexity and semantic vagueness in the commentary corpus." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-42", "text": "Riester et al. (2010) annotated a German news corpus marginally reliable (\u03ba = 0.66) for their overall scheme but their confusion matrix shows even lower reliability for several subcategories, most importantly deixis and bridging." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-43", "text": "While standard coreference corpora do not contain IS annotation, some corpora annotated for bridging are emerging (Poesio, 2004; Korzen and Buch-Kromann, 2011) but they are (i) not annotated for comparative anaphora or other IS categories, (ii) often not tested for reliability or reach only low reliability, (iii) often very small (Poesio, 2004) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-44", "text": "To the best of our knowledge, we therefore present the first English corpus reliably annotated for a wide range of IS categories as well as full anaphoric information for three main anaphora types (coreference, bridging, comparative)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-45", "text": "Automatic recognition of IS." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-46", "text": "Vieira and Poesio (2000) describe heuristics for processing definite descriptions in news text." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-47", "text": "As their approach is restricted to definites, they only analyse a subset of the mentions we consider carrying IS." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-48", "text": "Siddharthan et al. (2011) also concentrate on a subproblem of IS only, namely the hearer-old/hearer-new distinctions for person proper names." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-49", "text": "Nissim (2006) and Rahman and Ng (2011) both present algorithms for IS detection on Nissim et al.'s (2004) Switchboard corpus." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-50", "text": "Both papers treat IS classification as a local classification problem whereas we look at dependencies between the IS status of different mentions, leading to collective classification." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-51", "text": "In addition, they only distinguish the three main categories old, mediated and new." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-52", "text": "Finally, we work on news corpora which poses different problems from dialogue." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-53", "text": "Anaphoricity determination (Ng, 2009; Zhou and Kong, 2009 ) identifies many or most old mentions." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-54", "text": "However, no distinction between mediated and new mentions is made." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-55", "text": "Most approaches to bridging resolution (Meyer and Dale, 2002; or comparative anaphora (Modjeska et al., 2003; Markert and Nissim, 2005) address only the selection of the antecedent for the bridging/comparative anaphor, not its recognition." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-56", "text": "Sasano and Kurohashi (2009) do also tackle bridging recognition, but they depend on languagespecific non-transferrable features for Japanese." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-57", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-58", "text": "**CORPUS CREATION**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-59", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-60", "text": "**ANNOTATION SCHEME**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-61", "text": "Our scheme follows Nissim et al. (2004) in distinguishing three major IS categories old, new and mediated." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-62", "text": "A mention is old if it is either coreferential with an already introduced entity or a generic or deictic pronoun." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-63", "text": "We follow the OntoNotes (Weischedel et al., 2011 ) definition of coreference to be able to integrate our annotations with it." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-64", "text": "This definition includes coreference with noun phrase as well as verb phrase antecedents 3 ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-65", "text": "Mediated refers to entities which have not yet been introduced in the text but are inferrable via other mentions or are known via world knowledge." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-66", "text": "We distinguish the following six subcategories: The category mediated/comparative comprises mentions compared via either a contrast or similarity to another one (see Example 1)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-67", "text": "This category is novel in our scheme." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-68", "text": "We also include a category mediated/bridging (see Examples 2, 3 and 4)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-69", "text": "Bridging anaphora can be any noun phrase and are not limited to definite NPs as in , Gardent and Manu\u00e9lian (2005) , Riester et al. (2010) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-70", "text": "In contrast to Nissim et al. (2004) , antecedents for both comparative and bridging categories are annotated and can be noun phrases, verb phrases or even clauses." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-71", "text": "The category mediated/knowledge is inspired by the hearerold distinction introduced by Prince (1992) and covers entities generally known to the hearer." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-72", "text": "It includes many proper names, such as Poland." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-73", "text": "4 Mentions that are syntactically linked via a possessive relation or a PP modification to other, old or mediated mentions fall into the type mediated/synt (see Examples 5 and 6)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-74", "text": "5 With no change to Nissim et al.'s scheme, coordinated mentions where at least one element in the conjunction is old or mediated are covered by the category mediated/aggregate, and mentions referring to a value of a previously mentioned function by the type mediated/func." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-75", "text": "All other mentions are annotated as new, includ-" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-76", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-77", "text": "**AGREEMENT STUDY**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-78", "text": "We carried out an agreement study with 3 annotators, of which Annotator A was the scheme developer and first author of this paper." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-79", "text": "All texts used were from the Wall Street Journal (WSJ) portion of OntoNotes." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-80", "text": "There were no restrictions on which texts to include apart from (i) exclusion of letters to the editor as they contain cross-document links and (ii) a preference for longer texts with potentially richer discourse structure." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-81", "text": "Mentions were automatically preselected for the annotators using the gold-standard syntactic annotation." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-82", "text": "6 The existing coreference annotation was automatically carried over to the IS task by marking all mentions in a coreference chain (apart from the first mention in the chain) as old." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-83", "text": "The annotation task consisted of marking all mentions for their IS (old, mediated or new) as well as marking mediated subcategories (see Section 3.1) and the antecedents for comparative and bridging anaphora." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-84", "text": "The scheme was developed on 9 texts, which were also used for training the annotators." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-85", "text": "Inter-annotator agreement was measured on 26 new texts, which included 5905 pre-marked potential mentions." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-86", "text": "The annotations of 1499 of these were carried over from OntoNotes, leaving 4406 potential mentions for annotation and agreement measurement." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-87", "text": "In addition to Poesio, 2008) between all 3 possible annotator pairings." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-88", "text": "We also report single-category agreement for each category, where all categories but one are merged and then \u03ba is computed as usual." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-89", "text": "Table 1 shows agreement results for the overall scheme at the coarse-grained (4 categories: non-mention, old, new, mediated) and the fine-grained level (9 categories: non-mention, old, new and the 6 mediated subtypes)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-90", "text": "The results show that the scheme is overall reliable, with not too many differences between the different annotator pairings." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-91", "text": "7 Table 2 shows the individual category agreement for all 9 categories." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-92", "text": "We achieve high reliability for most categories." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-93", "text": "8 Particularly interesting is the fact that hearer-old entities (mediated/knowledge) can be identified reliably although all annotators had substantially different backgrounds." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-94", "text": "The reliability of the category bridging is more annotatordependent, although still higher, sometimes considerably, than other previous attempts at bridg-ing annotation Gardent and Manu\u00e9lian, 2005; Riester et al., 2010) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-95", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-96", "text": "**GOLD STANDARD**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-97", "text": "Our final gold standard corpus consists of 50 texts from the WSJ portion of the OntoNotes corpusThe corpus will be made publically available as OntoNotes annotation layer via http://www. h-its.org/nlp/download." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-98", "text": "Disagreements in the 35 texts used for annotator training (9 texts) and testing (26 texts) were resolved via discussion between the annotators." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-99", "text": "An additional 15 texts were annotated by Annotator A. Finally, Annotator A carried out consistency checks over all texts." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-100", "text": "-The gold standard includes 10,980 true mentions (see Table 3 )." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-101", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-102", "text": "**FEATURES**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-103", "text": "In this Section, we describe both the local as well as the relational features we use." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-104", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-105", "text": "**FEATURES FOR LOCAL CLASSIFICATION**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-106", "text": "We use the following local features, including the features in Nissim (2006) and Rahman and Ng (2011) to be able to gauge how their systems fare on our corpus and as a comparison point for our novel collective classification approach." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-107", "text": "The features developed by Nissim (2006) are shown in Table 4 ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-108", "text": "Nissim shows clearly that these features are useful for IS classification." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-109", "text": "Thus, subjects are more likely to be old as assumed by, e.g., centering theory (Grosz et 1995)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-110", "text": "Also, previously unmentioned proper names are more likely to be hearer-old and therefore mediated/knowledge, although their exact status will depend on how well known a particular proper name is." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-111", "text": "Rahman and Ng (2011) add all unigrams appearing in any mention in the training set as features." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-112", "text": "They also integrated (via a convolution tree-kernel SVM (Collins and Duffy, 2001 )) partial parse trees that capture the generalised syntactic context of a mention e and include the mention's parent and sibling nodes without lexical leaves." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-113", "text": "However, they use no structure underneath the mention node e itself, assuming that \"any NP-internal information has presumably been captured by the flat features\"." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-114", "text": "To these feature sets, we add a small set of other local features otherlocal." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-115", "text": "These track partial previous mentions by also counting partial previous mention time as well as the previous mention of content words only." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-116", "text": "We also add a mention's number as one of singular, plural or unknown, and whether the mention is modified by an adjective." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-117", "text": "Another feature encapsulates whether the mention is modified by a comparative marker, using a small set of 10 markers such as another, such, similar . . . and the presence of adjectives or adverbs in the comparative." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-118", "text": "Finally, we include the mention's semantic class as one of 12 coarse-grained classes, including location, organisation, person and several classes for numbers (such as date, money or percent)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-119", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-120", "text": "**RELATIONS FOR COLLECTIVE CLASSIFICATION**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-121", "text": "Both Nissim (2006) and Rahman and Ng (2011) classify each mention individually in a standard supervised ML setting, not considering potential dependencies between the IS categories of different mentions." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-122", "text": "However, collective or joint classification has made substantial impact in other NLP tasks, such as opinion mining (Pang and Lee, 2004; Somasundaran et al., 2009 ), text categorization (Yang et al., 2002; Taskar et al., 2002) and the related task of coreference resolution (Denis and Baldridge, 2007) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-123", "text": "We investigate two types of relations between mentions that might impact on IS classification." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-124", "text": "Syntactic parent-child relations." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-125", "text": "Two mediated subcategories account for accessibility via syntactic links to another old or mediated mention: mediated/synt is used when at least one child of a mention is mediated or old, with child relations restricted to pre-or postnominal possessives as well as PP children in our scheme (see Section 3.1)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-126", "text": "mediated/aggregate is for coordinations in which at least one of the children is old or mediated." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-127", "text": "In these two cases, a mention's IS depends directly on the IS of its children." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-128", "text": "We therefore link a mention m 1 to a mention m 2 via a hasChild relation if (i) m 2 is a possessive or prepositional modification of m 1 , or (ii) m 1 is a coordination and m 2 is one of its children." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-129", "text": "Using such a relational feature catches two birds with one stone: firstly, it integrates the internal structure of a mention into the algorithm, which Rahman and Ng (2011) ignore; secondly, it captures dependencies between parent and child classification, which would not be possible if we integrated the internal structure via flat features or additional tree kernels." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-130", "text": "We hypothesise that the higher syntactic complexity of our news genre (14.5% of all mentions are mediated/synt) will make this feature highly effective in distinguishing between new and mediated categories." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-131", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-132", "text": "**SYNTACTIC PRECEDENCE RELATIONS.**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-133", "text": "IS is said to influence word order (Birner and Ward, 1998; Cahill and Riester, 2009 ) and this fact has been exploited in work on generation (Prevost, 1996; Filippova and Strube, 2007; Cahill and Riester, 2009 )." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-134", "text": "Therefore, we integrate dependencies between the IS classification of mentions in precedence relations." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-135", "text": "m 1 precedes m 2 if (i) m 1 and m 2 are in the same clause, allowing for trace subjects in gerund and infinitive constructions, (ii) m 1 and m 2 are dependent on the same verb or noun, allowing for intervening nodes via modal, auxiliary, gerund and infinitive constructions, (iii) m 1 is neither a child nor a parent of m 2 , and (iv) m 1 occurs before m 2 ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-136", "text": "For Example 8 (slightly simplified) we extract the precedence relations shown in Table 5 . (8) She was sent by her mother to a white woman's house to do chores in exchange for meals and a place to sleep." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-137", "text": "Proper names behave differently from common nouns." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-138", "text": "For example, they can occur at many different places in the clause when functioning as spatial or temporal scene-setting elements, such as In New York." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-139", "text": "We therefore exclude all precedence relations where one element of the pair is a proper name." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-140", "text": "We extract 2855 precedence relations." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-141", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-142", "text": "**EXPERIMENTS**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-143", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-144", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-145", "text": "We use our gold standard corpus (see Section 3.3) via 10-fold cross-validation on documents for all experiments." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-146", "text": "Following Nissim (2006) and Rahman and Ng (2011) , we perform all experiments on gold standard mentions and use the human WSJ syntactic annotation for feature extraction, when necessary." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-147", "text": "For the extraction of semantic class, we use OntoNotes entity type annotation for proper names and an automatic assignment of semantic class via WordNet hypernyms for common nouns." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-148", "text": "Coarse-grained versions of all algorithms distinguish only between the three old, mediated, new categories." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-149", "text": "Fine-grained versions distinguish between the categories old, the six mediated subtypes, and new." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-150", "text": "We report overall accuracy as well as precision, recall and F-measure per category." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-151", "text": "Significance tests are conducted using McNemar's test on overall algorithm accuracy, at the level of 1%." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-152", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-153", "text": "**LOCAL CLASSIFIERS**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-154", "text": "We reimplemented the algorithms in Nissim (2006) and Rahman and Ng (2011) as comparison baselines, using their feature and algorithm choices." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-155", "text": "Algorithm Nissim is therefore a decision tree J48 with standard settings in WEKA with the features in Table 4." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-156", "text": "Algorithm RahmanNg is an SVM with a composite kernel and one-vs-all training/testing (toolkit SVMLight)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-157", "text": "They use the features in Table 4 plus unigram and tree kernel features, described in Section 4.1." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-158", "text": "We add our additional set of otherlocal features to both baseline algorithms (yielding Nissim+ol and RahmanNg+ol) as they aim specifically at improving fine-grained classification." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-159", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-160", "text": "**COLLECTIVE CLASSIFICATION**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-161", "text": "For incorporating our inter-mention links, we use a variant of Iterative Collective classification (ICA), which has shown good performance over a variety of tasks (Lu and Getoor, 2003) and has been used in NLP for example for opinion mining (Somasundaran et al., 2009) ." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-162", "text": "ICA is normally faster than Gibbs sampling and -in initial experiments -did not yield significantly different results from it." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-163", "text": "ICA initializes each mention with its most likely IS, according to the local classifier and features." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-164", "text": "It then iterates a relational classifier, which uses both local and relational features (our hasChild and precedes features) taking IS assignments to neighbouring mentions into account." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-165", "text": "We use the exist aggregator to define the dependence between mentions." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-166", "text": "We use NetKit (Macskassy and Provost, 2007) with its standard ICA settings for collective inference, as it allows direct comparison between local and collective classification." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-167", "text": "The relational classifiers are always exactly the same classifiers as the local ones with the relational features added: thus, if the local classifier is a tree kernel SVM so is the relational one." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-168", "text": "One problem when using the SVM Tree kernel as relational classifier is that it allows only for binary classification so that we need to train several binary networks in a one-vs-all paradigm (see also (Rahman and Ng, 2011) ), which will not be able to use the multiclass dependencies of the relational features to optimum effect." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-169", "text": "Table 7 shows the comparison of collective classification to local classification, using Nissim's framework and features, and Table 8 the equivalent table for Rahman and Ng's approach." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-170", "text": "The improvements using the additional local features over the original local classifiers are statistically significant in all cases." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-171", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-172", "text": "**RESULTS**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-173", "text": "In particular, the inclusion of semantic classes improves mediated/knowledge and mediated/func, and comparative anaphora are recognised highly reliably via a small set of comparative markers." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-174", "text": "The hasChild relation leads to significant improvement in accuracy over local classification in all cases, showing the value of collective classification." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-175", "text": "The improvement here is centered on the categories of mediated/synt (for both cases) and mediated/aggregate (for Nissim+ol+hasChild) as well as their distinction from new." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-176", "text": "10 It is also interesting that collective classification with a concise feature set and a simple decision tree as used in Nissim+ol+hasChild, performs equally well as RahmanNg+ol+hasChild, which uses thousands of unigram and tree features and a more sophisticated local classifier." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-177", "text": "It also shows more consistent improvements over all finegrained classes." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-178", "text": "The precedes relation does not lead to any further improvement." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-179", "text": "We investigated several variations of the precedence link, such as restricting it to certain grammatical relations, taking into account definiteness or NP type but none of them led to any improvement." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-180", "text": "We think there are two reasons for this lack of success." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-181", "text": "First, the precedence of mediated vs. new mentions does not follow a clear order and is therefore not a very predictive feature (see Table 6 )." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-182", "text": "At first, this seems to contradict studies such as Cahill and Riester (2009) and old > p new preferences are partially already captured by the local features, especially the grammatical role, as, for example, subjects are often both old as well as early on in a sentence." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-183", "text": "With regard to fine-grained classification, many categories including comparative anaphora, are identified quite reliably, especially in the multiclass classification setting (Nissim+ol+hasChild)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-184", "text": "Bridging seems to be the by far most difficult category to identify with final best F-measures still very low." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-185", "text": "Most bridging mentions do not have any clear internal structure or external syntactic contexts that signal their presence." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-186", "text": "Instead, they rely more on lexical and world knowledge for recognition." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-187", "text": "Unigrams could potentially encapsulate some of this lexical knowledge but -without generalization -are too sparse for a relatively rare category such as bridging (6% of all mentions) to perform well." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-188", "text": "The difficulty of bridging recognition is an important insight of this paper as it casts doubt on the strategy in previous research to concentrate almost exclusively on antecedent selection (see Section 2)." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-189", "text": "----------------------------------" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-190", "text": "**CONCLUSIONS**" }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-191", "text": "We presented a new approach to information status classification in written text, for which we also provide the first reliably annotated English language corpus." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-192", "text": "Based on linguistic intuition, we define features for classifying mentions collectively." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-193", "text": "We show that our collective classification approach outperforms the state-of-the-art in coarse-grained IS classification by about 10% (Nissim, 2006) and 5% (Rahman and Ng, 2011) accuracy." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-194", "text": "The gain is almost entirely due to improvements in distinguishing between new and mediated mentions." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-195", "text": "For the latter, we also report the -to our knowledge -first finegrained IS classification results." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-196", "text": "Since the work reported in this paper relied -following Nissim (2006) and Rahman and Ng (2011) -on gold standard mentions and syntactic annotations, we plan to perform experiments with predicted mentions as well." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-197", "text": "We also have to improve the recognition of bridging, ideally combining recognition and antecedent selection for a complete resolution component." }, { "sent_id": "71a72cfca17b0b15938ed590f9c868-C001-198", "text": "In addition, we plan to integrate IS resolution with our coreference resolution system (Cai et al., 2011) to provide us with a more comprehensive discourse processing system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "71a72cfca17b0b15938ed590f9c868-C001-2" ], [ "71a72cfca17b0b15938ed590f9c868-C001-20" ], [ "71a72cfca17b0b15938ed590f9c868-C001-49" ], [ "71a72cfca17b0b15938ed590f9c868-C001-121" ], [ "71a72cfca17b0b15938ed590f9c868-C001-168" ] ], "cite_sentences": [ "71a72cfca17b0b15938ed590f9c868-C001-2", "71a72cfca17b0b15938ed590f9c868-C001-20", "71a72cfca17b0b15938ed590f9c868-C001-49", "71a72cfca17b0b15938ed590f9c868-C001-121", "71a72cfca17b0b15938ed590f9c868-C001-168" ] }, "@MOT@": { "gold_contexts": [ [ "71a72cfca17b0b15938ed590f9c868-C001-2", "71a72cfca17b0b15938ed590f9c868-C001-3" ], [ "71a72cfca17b0b15938ed590f9c868-C001-20", "71a72cfca17b0b15938ed590f9c868-C001-21", "71a72cfca17b0b15938ed590f9c868-C001-22", "71a72cfca17b0b15938ed590f9c868-C001-23", "71a72cfca17b0b15938ed590f9c868-C001-24" ] ], "cite_sentences": [ "71a72cfca17b0b15938ed590f9c868-C001-2", "71a72cfca17b0b15938ed590f9c868-C001-20" ] }, "@DIF@": { "gold_contexts": [ [ "71a72cfca17b0b15938ed590f9c868-C001-20", "71a72cfca17b0b15938ed590f9c868-C001-21", "71a72cfca17b0b15938ed590f9c868-C001-22", "71a72cfca17b0b15938ed590f9c868-C001-23", "71a72cfca17b0b15938ed590f9c868-C001-24" ], [ "71a72cfca17b0b15938ed590f9c868-C001-33" ], [ "71a72cfca17b0b15938ed590f9c868-C001-49", "71a72cfca17b0b15938ed590f9c868-C001-50" ], [ "71a72cfca17b0b15938ed590f9c868-C001-129" ], [ "71a72cfca17b0b15938ed590f9c868-C001-193" ] ], "cite_sentences": [ "71a72cfca17b0b15938ed590f9c868-C001-20", "71a72cfca17b0b15938ed590f9c868-C001-33", "71a72cfca17b0b15938ed590f9c868-C001-49", "71a72cfca17b0b15938ed590f9c868-C001-129", "71a72cfca17b0b15938ed590f9c868-C001-193" ] }, "@USE@": { "gold_contexts": [ [ "71a72cfca17b0b15938ed590f9c868-C001-33" ], [ "71a72cfca17b0b15938ed590f9c868-C001-106" ], [ "71a72cfca17b0b15938ed590f9c868-C001-146" ], [ "71a72cfca17b0b15938ed590f9c868-C001-154" ], [ "71a72cfca17b0b15938ed590f9c868-C001-196" ] ], "cite_sentences": [ "71a72cfca17b0b15938ed590f9c868-C001-33", "71a72cfca17b0b15938ed590f9c868-C001-106", "71a72cfca17b0b15938ed590f9c868-C001-146", "71a72cfca17b0b15938ed590f9c868-C001-154", "71a72cfca17b0b15938ed590f9c868-C001-196" ] } } }, "ABC_03c57679549ff600a024d436d5a107_13": { "x": [ { "sent_id": "03c57679549ff600a024d436d5a107-C001-28", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-29", "text": "**BASELINE PAIRWISE SYSTEM**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-2", "text": "Coreference resolution systems can benefit greatly from inclusion of global context, and a number of recent approaches have demonstrated improvements when precomputing an alignment to external knowledge sources." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-3", "text": "However, since alignment itself is a challenging task and is often noisy, existing systems either align conservatively, resulting in very few links, or combine the attributes of multiple candidates, leading to a conflation of entities." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-4", "text": "Our approach instead performs joint inference between within-document coreference and entity linking, maintaining ranked lists of candidate entities that are dynamically merged and reranked during inference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-5", "text": "Further, we incorporate a large set of surface string variations for each entity by using anchor texts from the web that link to the entity." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-6", "text": "These forms of global context enables our system to improve classifier-based coreference by 1.09 B 3 F1 points, and improve over the previous state-of-art by 0.41 points, thus introducing a new state-of-art result on the ACE 2004 data." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-7", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-9", "text": "Coreference resolution is the task of identifying sets of noun phrase mentions from a document that refer to the same real-world entities." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-79", "text": "**LINKING TO WIKIPEDIA**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-10", "text": "For example, in the following excerpt: \"The Chicago suburb of Arlington Heights is the first stop for George W. Bush 1 today." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-11", "text": "The Texas governor 2 stops in Gore's home state 3 of Tennessee 4 this afternoon. . . \", (m 1 , m 2 ) and (m 3 , m 4 ) define the coreferent pairs." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-12", "text": "Coreference resolution forms an important component for natural language processing and information extraction pipelines due to its utility in relation extraction, cross-document coreference, text summarization, and question answering." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-13", "text": "The task of coreference is challenging for automated systems as the local information contained in the document is often not enough to accurately disambiguate mentions, for example, coreferencing (m 1 , m 2 ) requires identifying that George W. Bush (m 1 ) is the governor of Texas (m 2 ), and similarly for (m 3 , m 4 )." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-14", "text": "External knowledge-bases such as FrameNet (Baker et al., 1998) , Wikipedia, Yago (Suchanek et al., 2007) , and Freebase (Bollacker et al., 2008) , can be used to provide global context, and there is a strong need for coreference resolution systems to accurately use such sources for disambiguation." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-15", "text": "Incorporating external knowledge bases into coreference has been the subject of active recent research." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-16", "text": "Ponzetto and Strube (2006) and Ratinov and Roth (2012) precompute a fixed alignment of the mentions to the knowledge base entities." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-17", "text": "The attributes of these entities are used during coreference by incorporating them in the mention features." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-18", "text": "Since alignment of mentions to the external entities is itself a difficult task, these systems favor high-precision linking." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-19", "text": "Unfortunately, this results in fewer alignments, and improvements are only shown on mentions that are easier to align and corefer (such as the non-transcript documents in Ratinov and Roth (2012) )." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-20", "text": "Alternatively, Rahman and Ng (2011) link each mention to multiple entities in the knowledge base, improving recall at the cost of lower precision; the attributes of all the linked entities are aggregated as features." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-21", "text": "Although this approach is more robust to noise in the documents, the features of a mention merge the different aspects of the entities, for example a \"Michael Jordan\" mention will contain features for both the scientist and basketball personas." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-22", "text": "Instead of fixing the alignment of the mentions to the knowledge base, our proposed approach maintains a ranked list of candidate entities for each mention." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-23", "text": "To expand the set of surface strings that may be used to refer to each entity, the attributes of each candidate contain anchor texts (the visible text) of the links on the web that refer to that entity candidate." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-24", "text": "When mentions are compared during inference, we use the features computed from the top ranked entity candidate of the antecedent mention." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-25", "text": "As mentions are merged, the ranked lists of candidate entities are also merged and reranked, often changing the top-ranked entity candidate used in subsequent comparisons." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-26", "text": "The large set of surface string variations and constant reranking of the entity candidates during inference allows our approach to correct mistakes in alignment and makes external information applicable to a wider variety of mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-27", "text": "Our paper provides the following contributions: (1) an approach that jointly reasons about both within-doc entities and their alignment to KBentities by dynamically adjusting a ranked list of candidate alignments, during coreference, (2) Utilization of a larger set of surface string variations for each entity candidate by using links that appear all over the web (Spitkovsky and Chang, 2012) , (3) A combination of these approaches that improves upon a competitive baseline without a knowledge base by 1.09 B 3 F1 points on the ACE 2004 data, and outperforms the state-of-the-art coreference system (Stoyanov and Eisner, 2012) by 0.41 B 3 F1 points, and (4) Accurate predictions on documents that are difficult for coreference, such as the transcript documents that were omitted from the evaluation in Ratinov and Roth (2012) , and documents that contain a large number of mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-30", "text": "In this section we describe a variant of a commonlyused coreference resolution system that does not utilize external knowledge sources." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-31", "text": "This widely adopted model casts the problem as a series of binary classifications (Soon et al., 2001; Ng and Cardie, 2002; Ponzetto and Strube, 2006; Bengston and Roth, 2008; Stoyanov et al., 2010) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-32", "text": "Given a document with its mentions, the system iteratively checks each mention m j for coreference with preceding mentions using a classifier." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-33", "text": "A coreference link may be created between m j and one of these preceding mentions using one of the following strategies." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-34", "text": "The CLOSESTLINK (Soon et al., 2001 ) method picks the closest mention to m j that is positively classified, while the BESTLINK (Ng and Cardie, 2002) method links m j to the preced- Bengston and Roth (2008) are italicized." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-35", "text": "ing mention that was scored the highest." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-36", "text": "If none of the preceding mentions are classified as positive (for CLOSESTLINK), or are above a threshold (for BESTLINK), then m j is left unlinked." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-37", "text": "After all the mentions have been processed, the links are used to generate a transitive closure that corresponds to the recognized entities in the document." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-38", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-39", "text": "**PAIRWISE MENTION FEATURES**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-40", "text": "The features used to train our classifier are similar to those in Bengston and Roth (2008) , including lexical, syntactical, semantic, predicted NER types, etc., with the exclusion of their \"learned features\" that require additional classifiers." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-41", "text": "Further, we include features that compare the mention strings, the distance between the two mentions in terms of the number of sentences and tokens, and the POS tags of the head words." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-42", "text": "We also use the conjunctions of these features as in Bengston and Roth (2008) , as well as the BESTLINK approach." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-43", "text": "The complete set of features are listed in Table 1 ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-44", "text": "The training for our system is similar to Bengston and Roth (2008) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-45", "text": "The positive training examples are generated from mentions and their immediate preceding antecedent." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-46", "text": "The negative examples are generated from mentions and all their preceding non-coreferent mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-47", "text": "If the mention is not a pronoun, preceding pronouns are not used to create training examples, and they are also excluded during inference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-48", "text": "In contrast to averaged perceptron used in Bengston and Roth (2008) , our baseline system is trained using hinge-loss, 2 -regularized SVM." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-49", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-50", "text": "**MERGING PAIRWISE FEATURES**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-51", "text": "When a mention m j is compared against a preceding mention m i , information from other mentions that are already coreferent with m i may be helpful in disambiguating m j as they may contain information that is not available from m i . Let M be the mentions between m i and m j that are coreferent with m i . Let m q \u2208 M be the mention that is closest to m j ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-52", "text": "All the features from the pair (m q , m j ), except those that characterize one mention (for example, mention type of m j ), are added to the features between (m i , m j )." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-53", "text": "This extends a similar approach by Lee et al. (2011) that merges only the attributes of mentions (such as gender, but not all pairwise features)." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-54", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-55", "text": "**PRUNING COMPARISONS DURING TRAINING**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-56", "text": "A potential drawback of including all the negative examples as in Bengston and Roth (2008) is that the negative instances far outnumber the positive ones, which is challenging for training a classifier." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-57", "text": "In their system, the positive training examples only constitute 1.6% of the total training instances." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-58", "text": "By contrast, Soon et al. (2001) reduce the number of negative instances by using only mentions between the mention and its closest coreferent pair as negative examples." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-59", "text": "Instead of just using the closest coreferent mention, we extend this approach to use the k closest of coreferent preceding mentions, where k is tuned using the development data." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-60", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-61", "text": "**DYNAMIC LINKING TO KNOWLEDGE-BASE**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-62", "text": "In this section, we describe our approach to coreference resolution that incorporates external knowledge sources." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-63", "text": "The approach is an extension of the pairwise model described earlier, with the inclusion of a ranked list of entities, and using a larger set of surface string variations." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-64", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-65", "text": "**ALGORITHM**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-66", "text": "We describe our overall approach in Algorithm 1." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-67", "text": "The system assumes that the data is annotated with true mention boundaries and mention types." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-68", "text": "We additionally tokenize the document text and tag the tokens with their parts of speech for use as features." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-69", "text": "First, an empty entity candidate list is created for each mention in the document." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-70", "text": "For each proper noun mention, we query a knowledge base for an ordered list of Wikipedia articles that may refer to it, and add these to the mention's candidate list." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-71", "text": "Other mentions' candidates lists are left empty." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-72", "text": "After this pre-processing, each mention m i is compared against its preceding mentions m 1 . . ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-73", "text": "m i\u22121 and their top-ranked entity candi- date using a classifier." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-74", "text": "Amongst antecedents m 1 . . ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-75", "text": "m i\u22121 that score above a threshold, the highest-scoring one m j is marked as coreferent with m i and the two candidate lists that correspond to m i and m j are merged." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-76", "text": "Merging two mentions results in the merging and reranking of their respective entity candidate lists, described below." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-77", "text": "If no antecedents score above a threshold, we leave the mention in its singleton cluster." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-78", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-80", "text": "To create the initial entity candidate lists for proper noun mentions, we query a knowledge base searcher (Dalton and Dietz, 2013) with the text of these mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-81", "text": "These queries return scored, ranked lists of entity candidates (Wikipedia articles), which we associate with each proper noun mention, leaving the rest of the candidate lists empty." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-82", "text": "Linking is often noisy, so only selecting the high-precision links as in Ratinov and Roth (2012) results in too few matches, while picking an aggregation of all links results in more noise due to lower precision (Rahman and Ng, 2011) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-83", "text": "Additionally, since linking is often performed in pre-processing, two mentions that are determined coreferent during inference could still be linked to different KB entities." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-84", "text": "To avoid these problems, we keep a list of candidate links for each mention, merging the lists when two mentions are determined coreferent, and rerank this list during inference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-85", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-86", "text": "**POPULATING ENTITY ATTRIBUTES**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-87", "text": "After linking to Wikipedia, we have a list of candidate KB entities for each mention." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-88", "text": "Each entity has access to external information keyed on the Wikipedia article, but this information could more generally come from any knowledge base." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-89", "text": "Given these entities, there are many possible features that may be used for disambiguation of the mentions, such as gender and fine-grained Wikipedia categories as used by Ratinov and Roth (2012) , however most of these features may not be relevant to the task of within-document coreference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-90", "text": "Instead, an important resource for linking non-proper mentions of an entity is to identify the possible name variations of the entity." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-91", "text": "For example, it would be useful to know that Massachusetts is also referred to as \"The 6th State\", however this information is not readily available from Wikipedia." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-92", "text": "1 We instead use the corpus described in Spitkovsky and Chang (2012) that consists of anchor texts of links to Wikipedia that appear on web pages." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-93", "text": "This collection of anchor texts is sufficiently extensive to cover many common misspellings of entity names, as well as many name variations missing from Wikipedia." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-94", "text": "For example, for the entity \"Massachusetts\", our anchor texts include misspellings like \"Massachussetts\" and \"Messuchusetts\", and the (debatably) affectionate nickname of \"Taxachusetts\"-none of which are found in Wikipedia." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-95", "text": "Using these anchor texts, each entity candidate provides a rich set of name variations that we use for disambiguation, as described in the next section." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-96", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-97", "text": "**INFERENCE WITH DYNAMIC LINKING**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-98", "text": "The input to our inference algorithm consists of a number of mentions, a list of ranked entity candidates for the proper noun mentions that are present in the KB, and a list of attributes (in this case, name variations) for each entity candidate." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-99", "text": "Scoring: Our underlying model is a pairwise classification approach as described in Section 2." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-100", "text": "Similar to existing coreference systems such as Bengston and Roth (2008) and Rahman and Ng (2011) , we perform coreference resolution using greedy left-to-right pairwise mention classification, clustering each mention with its highest-scoring antecedent (or leaving it as a singleton temporarily if no score is above a threshold)." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-101", "text": "We add the same additional features and perform feature merging operation (Section 2.2) as in our baseline system." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-102", "text": "The top-ranked entity candidate of the antecedent mention is used during coreference to provide additional features for the pairwise classifier." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-103", "text": "Only using the top-ranked entity candidate allows the system to maintain a consistent one entity per cluster hypothesis, reducing the noise resulting from conflated entities." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-104", "text": "The attributes for this topranked entity consist of name variations." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-105", "text": "We add a binary feature, and conjunctions of this with other features, if the text of the right mention matches one of these name variations." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-106", "text": "Entity List Merging: Once a mention pair is scored as coreferent, their corresponding entity candidates are merged." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-107", "text": "Merging is performed by simply combining the two lists of candidates." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-108", "text": "Note that there is only one candidate list for a given group of coreferent mentions at any point in inference: if m 1 and m 2 have been previously marked as coreferent, and m 3 is marked as coreferent with m 2 , m 1 's entity candidates will then contain those from m 3 for future classification decisions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-109", "text": "Re-Ranking: After the two entity candidate lists are merged, we rerank the candidates to identify the top-ranked one." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-110", "text": "We sort the new list of candidate entities by the number of times each candidate occurs in the list, breaking ties by their original relevance from the KB." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-111", "text": "For example, if two mentions disagree on the top-ranked KB search result, but agree on the second one, after being clustered they will both use the second search result when creating feature vectors for future coreference decisions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-112", "text": "Even though other candidates besides the top-ranked one are ignored for a single classification decision, they may become top-ranked after merging with later candidate sets." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-113", "text": "This approach allows our system to use the intermediate results of coreference resolution to re-link mentions to KB entities, reducing the noise and contradictory features from incorrect links." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-114", "text": "Additionally, features from the KB are added to nonproper noun mentions once those mentions are linked with a populated entity, allowing the results of coreference to enrich non-proper noun mentions with KB-based features." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-115", "text": "The initial proper noun queries effectively seed the linking process, and KB data is then dynamically spread to the other mentions through coreference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-116", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-117", "text": "**EXAMPLE**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-118", "text": "We describe a run of our approach on an example in Figure 1 Figure 1 : Example of Dynamic Alignment paired with a top-ranked KB candidate: \"Washington\", \"Wash\", and \"Washington State\"." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-119", "text": "For the first two mentions, clearly the top entity candidate is incorrect; hence approaches that rely on a fixed alignment will perform poorly." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-120", "text": "In particular, since \"Washington State\" mention is not compatible with the top-ranked entities of the first two mentions (Washington, D.C. and Car Wash respectively), approaches that do not modify the ranking during inference may not resolve them." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-121", "text": "However, the correct candidate Washington State does appear in the candidate entities of the first two mentions, albeit with a lower rank." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-122", "text": "In our approach, clustering the first two mentions causes the shared candidate Washington State to move to the top of the list." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-123", "text": "The coreference system is now able to easily identify that the \"Washington State\" mention is compatible with the Washington State entity formed by the previous two mentions, providing evidence that the final mention should be clustered with either of them in subsequent comparisons." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-124", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-125", "text": "**EXPERIMENTS**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-126", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-127", "text": "**SETUP**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-128", "text": "We evaluate our system on the ACE 2004 annotated dataset (Doddington et al., 2004) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-129", "text": "Following the setup in Bengston and Roth (2008), we split the corpus into training, development, and test sets, resulting in 268 documents in the train set, 107 documents in the test set, and 68 documents in the development set." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-130", "text": "The data is processed using standard open source tools to segment the sentences and tokenize the corpus, and using the OpenNLP 2 tagger to obtain the POS tags." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-131", "text": "The hyperparameters of our system, such as regularization, initial number of candidates, and the number of compar-2 http://opennlp.apache.org/ isons during training (k in Section 2.3) are tuned on the development data when trained on the train set." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-132", "text": "The models we use to evaluate on the test data set are trained on the training and development sets, following the standard evaluation for coreference first used by Culotta et al. (2007) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-133", "text": "To provide the initial ranked list of entity candidates from Wikipedia, we query the KB Bridge system (Dalton and Dietz, 2013) with the proper name mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-134", "text": "KB Bridge is an information-retrievalbased entity linking system that connects the query mentions to Wikipedia entities using a sequential dependence model." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-135", "text": "This system has been shown to match or outperform the top performing systems in the 2012 TAC KBP entity linking task." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-136", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-137", "text": "**METHODS**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-138", "text": "Our experiments investigate a number of baselines that are similar or identical to existing approaches." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-139", "text": "Wikipedia Linking: As a simple baseline, we directly evaluate the quality of the alignment for coreference by merging all pairs of proper noun mentions that share at least one common candidate, as per KB bridge." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-140", "text": "Further, the non-pronoun mentions are linked to these proper nouns if the mention string matches any of the entity titles or anchor texts." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-141", "text": "Bengston and Roth (2008) : A pairwise coreference model containing a rich set of features, as described and evaluated in Bengston and Roth (2008) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-142", "text": "Baseline: Our implementation of a pairwise model that is similar to the approach in Bengston and Roth (2008) with the differences described in Section 2." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-143", "text": "This is our baseline system that performs coreference without the use of external knowledge." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-144", "text": "Incidentally, it outperforms Bengston and Roth (2008) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-145", "text": "Dynamic linking: This is our complete system as described in Section 3, in which the list of candidates associated with each mention is reranked and modified during inference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-146", "text": "Static linking: Identical to dynamic linking except that entity candidate lists are not merged during inference (i.e., Algorithm 1 without line 17)." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-147", "text": "This approach is comparable to the fixed alignment model, as in the approaches of Ponzetto and Strube (2006) and Ratinov and Roth (2012) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-148", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-149", "text": "**RESULTS**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-150", "text": "As in Bengston and Roth (2008) , we evaluate our system primarily using the B 3 metric (Bagga and Baldwin, 1998) , but also include pairwise, MUC and CEAF(m) metrics." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-151", "text": "The performance of our systems on the test data set is shown in Table 2 ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-152", "text": "These results use true mentions provided in the dataset, since, as suggested by Ng (2010) , coreference resolvers that use different mention detectors (extraction from parse tree, detector trained from gold boundaries, etc.) should not be compared." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-153", "text": "Our baseline system outperforms Bengston and Roth (2008) by 0.32 B 3 F1 points on this data set." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-154", "text": "Incorporating Wikipedia and anchor text information from the web with a fixed alignment (static linking) further improves our performance by 0.54 B 3 F1 points." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-155", "text": "Using dynamic linking, which improves the alignment during inference, achieves another 0.55 F1 point improvement, which is 1.09 F1 above our baseline, 1.41 F1 above the current best pairwise classification system (corresponding to an error reduction of 7.4%), and 0.4 F1 above the current state-of-art on this dataset (Stoyanov and Eisner, 2012) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-156", "text": "The improvement of the dynamic linking approach over our baselines is consistent across the various evaluation metrics." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-157", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-158", "text": "**DISCUSSION**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-159", "text": "We also explore our system's performance on subsets of the ACE dataset, and on the OntoNotes dataset." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-160", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-161", "text": "**DOCUMENT LENGTH**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-162", "text": "Coreference becomes more difficult as the number of mentions is increased since the number of pairwise comparisons increases quadratically with the number of mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-163", "text": "We observe this phenomenon in our dataset: the performance on the smallest third of the documents (when sorted according to number of mentions) is 8.5-10% higher than on the largest third of the documents, as per the B 3 metric." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-164", "text": "However, we expect dynamic linking of entities to be more beneficial on these larger documents as our system can use the information from a larger number of mentions to improve the alignment during inference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-165", "text": "Static linking, on the other hand, is unlikely to obtain higher improvements with the larger number of mentions in the document as the alignment is fixed." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-166", "text": "We perform the following experiment to analyze the performance with varying numbers of mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-167", "text": "We sort all the documents in the test set according to their number of mentions, and evaluate on the top X% of this list (where X is 10, 33, 40, 50) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-168", "text": "As the results demonstrate in Figure 2 , the improvement of the static linking approach stays fairly even as X is varied." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-192", "text": "**RELATED WORK**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-169", "text": "Even though the experiments suggest that the larger documents are tougher to coreference, 3 dynamic linking provides higher improvements when the documents contain a larger number of mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-170", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-171", "text": "**PERFORMANCE ON TRANSCRIPTS**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-172", "text": "The quality of alignment and the coreference predictions for a document is influenced by the quality of the mentions in the document." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-173", "text": "In particular, Table 2 : Evaluation on the ACE test data, with the system trained on the train and development sets." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-174", "text": "ACE contains a large number of broadcast news documents, many of which consist of transcribed data containing noise in the form of incomplete sentences and disfluencies." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-175", "text": "Since these transcripts provide an additional challenge for alignment and coreference, Ratinov and Roth (2012) only use the set of non-transcripts for their evaluation." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-176", "text": "Using dynamic linking and a large set of surface string variations, our approach may be able to provide an improvement even on the transcripts." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-177", "text": "To identify the transcripts in the test set, we use the approximation from Ratinov and Roth (2012) that considers a document to be non-transcribed if it contains proper noun mentions and at least a third of those start with a capital letter." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-178", "text": "The performance is shown in Table 3 , while the improvement over our baseline is shown in Figure 3 ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-179", "text": "Our static linking matches the performance of Ratinov and Roth (2012) on the non-transcripts." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-180", "text": "Further, the improvement of static linking on the transcripts over the baseline is lower than that on the non-transcript data, suggesting that noisy mentions and text result in poor quality alignment." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-181", "text": "Dynamic linking, on the other hand, not only outperforms all other systems, but also shows a higher improvement over the baseline on the transcripts than on non-transcripts." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-182", "text": "This indicates that dynamic linking approach is robust to noise, and its wider variety of surface strings and flexible alignments are especially useful for transcripts." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-183", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-184", "text": "**ONTONOTES**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-185", "text": "We also run our systems on the OntoNotes dataset, which was used for evaluation in CoNLL 2011 Shared Task (Pradhan et al., 2011) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-186", "text": "The dataset consists of 2083 documents from a much larger variety of genres, such as conversations, magazines, web text, etc." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-187", "text": "Further, the dataset also consists of mentions that refer to events, most of which do not appear as Wikipedia pages." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-188", "text": "Since only the nonsingleton mentions are annotated in the training set, we also include additional noun phrase mentions during training." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-189", "text": "We obtain B 3 F1 of 65.3, 67.6, and 67.7 for our baseline, static linking, and dynamic linking respectively." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-190", "text": "4 When compared to the participants of the closed task, the dynamic linking system outperforms all but two on this metric, suggesting that dynamic alignment is beneficial even when the features have not been engineered for events or for different genres." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-191", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-193", "text": "Within-document coreference has been wellstudied for a number of years." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-194", "text": "A variety of approaches incorporate linguistic knowledge as rules iteratively applied to identify the chains, such as Klein (2009), Raghunathan et al. (2010) , Stoyanov et al. (2010) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-195", "text": "Alternatively (and similar to our approach), others represent this knowledge as features in a machine learning model." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-196", "text": "Early applications of such models include Soon et al. (2001) , Ng and Cardie (2002) and (Bengston and Roth, 2008) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-197", "text": "There are also a number of techniques that represent entities explicitly (Culotta et al., 2007; Wick et al., 2009; Haghighi and Klein, 2010; Stoyanov and Eisner, 2012) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-198", "text": "This work is an extension of recent approaches that incorporate external knowledge sources to improve within-document coreference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-199", "text": "Ponzetto and Strube (2006) identify Wikipedia candidates for each mention as a preprocessing step, and incorporate them as features in a pairwise model." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-200", "text": "Our method differs in that we draw such features from entity candidates during inference, and also maintain and update a set of candidate entity links instead of selecting only one." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-201", "text": "Rahman and Ng (2011) introduce similar features from a more extensive set of knowledge sources (such as YAGO and FrameNet) into a cluster-based model whose features change as inference proceeds." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-202", "text": "However, the features for each cluster come from a combination of all entities aligned to the cluster mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-203", "text": "We improve upon this approach by maintaining a list of the candidate entities for each mention cluster, modifying this list during the course of inference, and using features from only the top-ranked candidate at any time." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-204", "text": "Further, they do not provide a comparison on a standard dataset." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-205", "text": "Ratinov and Roth (2012) extend the multi-sieve coreference model (Raghunathan et al., 2010) by identifying at most a single candidate for each mention, and incorporating high-precision attributes extracted from Wikipedia." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-206", "text": "The high-precision mention-candidate pairings are precomputed and fixed; additionally, the features for an entity are based on the predictions of the previous sieves, thus fixed while a sieve is applied." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-207", "text": "With these restrictions, they show improvements over the state-ofthe-art on a subset of ACE mentions that are more easily aligned to Wikipedia, while our approach demonstrates improvements on the complete set of mentions including the tougher to link mentions from the transcripts." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-208", "text": "There are a number of approaches that provide an alignment from mentions in a document to Wikipedia." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-209", "text": "Wikifier (Ratinov et al., 2011) analyzes the context around the mentions and the entities jointly, and was used to align mentions for coreference in Ratinov and Roth (2012) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-210", "text": "Dalton and Dietz (2013) introduce an approximation to the above approach, but incorporate retrieval-based supervised reranking that provides multiple candidates and scores; this approach performed competitively on previous TAC-KBP entity linking benchmarks (Dietz and Dalton, 2012) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-211", "text": "Alignment to an external knowledge-base has improved performance for a number of NLP and information extraction tasks, such as named-entity recognition (Cucerzan, 2007; Han and Zhao, 2009) , cross-document coreference (Finin et al., 2009; Singh et al., 2010) , and relation-extraction (Riedel et al., 2010; Hoffmann et al., 2011) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-212", "text": "----------------------------------" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-213", "text": "**CONCLUSIONS**" }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-214", "text": "In this paper, we incorporate external knowledge to improve within-document coreference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-215", "text": "Instead of fixing the alignment a priori, our approach maintains a ranked list of candidate entities for each mention, and merges and reranks the list during inference." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-216", "text": "Further, we consider a large set of surface string variations for each entity by using anchor texts from the web." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-217", "text": "These external sources allow our system to achieve a new state-of-the-art on the ACE data." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-218", "text": "We also demonstrate improvements on documents that are difficult for alignment and coreference, such as transcripts and documents containing a large number of mentions." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-219", "text": "A number of possible avenues for future study are apparent." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-220", "text": "First, our alignment to a knowledgebase can benefit from more document-aware linking to entities, such as the Wikifier (Ratinov et al., 2011) ." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-221", "text": "Second, we would like to augment mention features with additional information available from the knowledge base, such as Wikipedia categorization and gender attributes." }, { "sent_id": "03c57679549ff600a024d436d5a107-C001-222", "text": "We also want to investigate a cluster ranking model, as used in (Rahman and Ng, 2011; Stoyanov and Eisner, 2012) , to aggregate the features of all the coreferent mentions as inference progresses." } ], "y": { "@BACK@": { "gold_contexts": [ [ "03c57679549ff600a024d436d5a107-C001-16" ], [ "03c57679549ff600a024d436d5a107-C001-19" ], [ "03c57679549ff600a024d436d5a107-C001-82" ], [ "03c57679549ff600a024d436d5a107-C001-89" ], [ "03c57679549ff600a024d436d5a107-C001-205" ], [ "03c57679549ff600a024d436d5a107-C001-209" ] ], "cite_sentences": [ "03c57679549ff600a024d436d5a107-C001-16", "03c57679549ff600a024d436d5a107-C001-19", "03c57679549ff600a024d436d5a107-C001-82", "03c57679549ff600a024d436d5a107-C001-89", "03c57679549ff600a024d436d5a107-C001-205", "03c57679549ff600a024d436d5a107-C001-209" ] }, "@USE@": { "gold_contexts": [ [ "03c57679549ff600a024d436d5a107-C001-27" ], [ "03c57679549ff600a024d436d5a107-C001-177" ] ], "cite_sentences": [ "03c57679549ff600a024d436d5a107-C001-27", "03c57679549ff600a024d436d5a107-C001-177" ] }, "@DIF@": { "gold_contexts": [ [ "03c57679549ff600a024d436d5a107-C001-82", "03c57679549ff600a024d436d5a107-C001-84" ], [ "03c57679549ff600a024d436d5a107-C001-89", "03c57679549ff600a024d436d5a107-C001-92" ], [ "03c57679549ff600a024d436d5a107-C001-175" ], [ "03c57679549ff600a024d436d5a107-C001-205", "03c57679549ff600a024d436d5a107-C001-207" ] ], "cite_sentences": [ "03c57679549ff600a024d436d5a107-C001-82", "03c57679549ff600a024d436d5a107-C001-89", "03c57679549ff600a024d436d5a107-C001-175", "03c57679549ff600a024d436d5a107-C001-205" ] }, "@SIM@": { "gold_contexts": [ [ "03c57679549ff600a024d436d5a107-C001-147" ], [ "03c57679549ff600a024d436d5a107-C001-179" ] ], "cite_sentences": [ "03c57679549ff600a024d436d5a107-C001-147", "03c57679549ff600a024d436d5a107-C001-179" ] } } }, "ABC_46050691971ea46ce7e18fef5f6d2d_13": { "x": [ { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-2", "text": "Learning representations that accurately model semantics is an important goal of natural language processing research." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-3", "text": "Many semantic phenomena depend on syntactic structure." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-4", "text": "Recent work examines the extent to which state-of-the-art models for pre-training representations, such as BERT, capture such structure-dependent phenomena, but is largely restricted to one phenomenon in English: number agreement between subjects and verbs." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-5", "text": "We evaluate BERT's sensitivity to four types of structure-dependent agreement relations in a new semi-automatically curated dataset across 26 languages." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-6", "text": "We show that both the single-language and multilingual BERT models capture syntax-sensitive agreement patterns well in general, but we also highlight the specific linguistic contexts in which their performance degrades." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-7", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-9", "text": "Learning general-purpose sentence representations which accurately model sentential semantic content is a current goal of natural language processing research (Subramanian et al., 2018; Conneau et al., 2017; Wieting et al., 2016; Kiros et al., 2015) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-10", "text": "A prominent and successful approach is to pre-train neural networks to encode sentences into fixed length vectors (Conneau et al., 2018; Nie et al., 2017) , with common architecture choices based on recurrent neural networks (Elman, 1990; Hochreiter and Schmidhuber, 1997) , convolutional neural networks, or transformers (Vaswani et al., 2017) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-11", "text": "Many core linguistic phenomena that one would like to model in general-purpose sentence representations depend on syntactic structure (Chomsky, 1965; Everaert et al., 2015) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-12", "text": "Despite the fact that none of the aforementioned architectures have explicit syntactic structural representations, there is some evidence that these models can approximate such structure-dependent phenomena under certain conditions (Gulordava et al., 2018; McCoy et al., 2018; Linzen et al., 2016; Bowman et al., 2015) , in addition to their widespread success in practical tasks." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-13", "text": "The recently introduced BERT model (Devlin et al., 2018) , which is based on transformers, achieves state-of-the-art results on eleven natural language processing tasks." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-14", "text": "In this work, we assess BERT's ability to learn structure-dependent linguistic phenomena of agreement relations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-15", "text": "To test whether BERT is sensitive to agreement relations, we use the cloze test (Taylor, 1953 , also called the \"masked language model\" objective), in which we mask out one of two words in an agreement relation and ask BERT to predict the masked word, one of the two tasks on which BERT is initially trained." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-16", "text": "Goldberg (2019) adapted the experimental setup of Linzen et al. (2016) , Gulordava et al. (2018) and Marvin and Linzen (2018) to use the cloze test to assess BERT's sensitivity to number agreement in English subject-verb agreement relations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-17", "text": "The results showed that the single-language BERT model performed surprisingly well at this task (above 80% accuracy in all experiments), even when there were multiple \"distractors\" in the sentence (other nouns that differed from the subject in number)." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-18", "text": "This suggests that BERT is actually learning to approximate structure-dependent computation, and not simply relying on flawed heuristics." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-19", "text": "However, English subject-verb agreement is a rather restricted phenomenon, with the majority of verbs having only two inflected forms and only one morphosyntactic feature (number) involved." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-20", "text": "To what extent does Goldberg's (2019) result hold for subject-verb agreement in other languages, including more morphologically rich ones, as well as for other types of agreement relations?" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-21", "text": "Building on Goldberg's (2019) work, we expand the experiment to 26 languages and four types of agreement relations, which include more challenging examples." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-22", "text": "In Section 2, we define what is meant by agreement relations and outline the particular agreement relations under study." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-23", "text": "Section 3 introduces our newly curated cross-linguistic dataset of agreement relations, while section 4 discusses our experimental setup." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-24", "text": "We report the results of our experiments in section 5." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-25", "text": "All data and code are available at https://github.com/ geoffbacon/does-bert-agree." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-26", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-27", "text": "**STRUCTURE-DEPENDENT AGREEMENT RELATIONS**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-28", "text": "Agreement phenomena are an important and cross-linguistically common property of natural languages, and as such have been extensively studied in syntax and morphology (Corbett, 2006) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-29", "text": "1 Languages often express grammatical features, such as number and gender, through inflectional morphology." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-30", "text": "An agreement relation is a morphophonologically overt co-variance in feature values between two words in a syntactic relationship (Preminger, 2014) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-31", "text": "In other words, agreement refers to when the morphosyntactic features of one word are reflected in its syntactic dependents." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-32", "text": "In this way, agreement relations are overt markers of covert syntactic structure." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-33", "text": "Thus, evaluating a model's ability to capture agreement relations is also an evaluation of its ability to capture syntactic structure." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-34", "text": "Following Corbett (2003), we call the syntactically dependent word the \"target\" of the agreement relation, and the word with which it agrees we call the \"controller\"." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-35", "text": "An example of an agreement relation in English is given in (1), in which the inflected form of the verb BE (are) reflects the plural number of its syntactic head keys." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-36", "text": "In all examples in this section, the controller and target are given in bold." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-37", "text": "In this example, keys is the controller and are is the target of the agreement relation." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-38", "text": "(1) The keys to the door are on the table." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-39", "text": "The agreement relation in (1) is between a subject and its verb, but there are other types of agree-ment relations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-40", "text": "In addition to subject-verb agreement, three other types of agreement relations are cross-linguistically common: agreement of noun with i) determiner, ii) attributive adjective and iii) predicate adjective (Baker, 2008) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-41", "text": "The latter two types are distinguished by whether the adjective modifies the noun within a noun phrase or whether it is predicated of the subject of a clause." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-42", "text": "The first two types are sometimes categorized as nominal concord rather than agreement, but for our purposes this is merely a difference in terminology." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-43", "text": "The morphosyntactic feature in the agreement relation in (1) is number, a feature that is crosslinguistically common in agreement systems." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-44", "text": "In addition to number, the most commonly involved in agreement relations are gender, case and person (Baker, 2008) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-45", "text": "With its comparatively limited inflectional morphology, English only exhibits subject-verb and determiner agreement (in demonstratives, \"this\" vs. \"these\") and even then only agrees for number." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-46", "text": "Languages with richer inflectional morphology tend to display more agreement types and involve more features." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-47", "text": "French, for example, employs all four types of agreement relations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-48", "text": "Examples are given in (2)- (5)." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-49", "text": "The subject and verb in (2) agree for number, while the noun and determiner in (3), the noun and attributive adjective in (4) and the subject and predicated adjective in (5) agree for both number and gender." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-50", "text": "(2) Les cl\u00e9s de la porte se trouvent sur la table." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-51", "text": "'The keys to the door are on the table.' 'The keys to the door are broken." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-52", "text": "' Previous work using agreement relations to assess knowledge of syntactic structure in modern neural networks has focussed on subject-verb agreement in number (Goldberg, 2019; Gulordava et al., 2018; Linzen et al., 2016) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-53", "text": "In our work, we study all four types of agreement relations and all four features discussed above." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-54", "text": "Moreover, previous work using any method to assess BERT's knowledge of syntactic structure has focussed exclusively on the single-language English model (Hewitt and Manning, 2019; Goldberg, 2019; Tenney et al., 2019; Lin et al., 2019; Jawahar et al., 2019; Clark et al., 2019) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-55", "text": "We expand this line of work to 26 languages." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-56", "text": "Not all languages in our sample exhibit all four types of agreement nor use all four features examined, but they all exhibit at least one of the agreement types involving at least one of the features." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-57", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-58", "text": "**DATA**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-59", "text": "Our study requires two types of data." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-60", "text": "First, we need sentences containing agreement relations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-61", "text": "We mask out one of the words in the agreement relation and ask BERT to predict the masked word." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-62", "text": "We are interested in BERT's ability to predict words that respect the agreement relation, that is, words which share the morphosyntactic features of the word with which it agrees." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-63", "text": "To measure this, we need to know the feature values for each word in BERT's vocabulary." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-64", "text": "This is our second type of data." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-65", "text": "Throughout this paper, we refer to the first type of data as the cloze data, and the second as the feature data." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-66", "text": "In the design of our datasets, we followed two principles." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-67", "text": "First, we chose data sources that are available across multiple languages, because we are interested in cross-linguistic generality." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-68", "text": "The languages in this study are those with sufficiently large data sources that also appear in the multilingual BERT model." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-69", "text": "Second, we use naturallyoccurring data (cf." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-70", "text": "Marvin and Linzen (2018) )." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-71", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-72", "text": "**CLOZE DATA**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-73", "text": "We sourced our cloze data from version 2.4 of the Universal Dependencies treebanks (Nivre et al., 2016, UD) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-74", "text": "The UD treebanks use a consistent schema across all languages to annotate naturally occurring sentences at the word level with rich grammatical information." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-75", "text": "We used the partof-speech and dependency information to identify potential agreement relations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-76", "text": "Specifically, we identified all instances of subject-verb, noundeterminer, noun-attributive adjective and subjectpredicate adjective word pairs." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-77", "text": "We then used the morphosyntactic annotations for number, gender, case and person to filter out word pairs that disagree due to errors in the underlying data source (e.g. one is annotated as plural while the other is singular) or that are not annotated for any of the four features." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-78", "text": "This method is language-agnostic, but due to errors in the underlying UD corpora, yielded some false positives (e.g. predicate adjective agreement in English)." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-79", "text": "To correct for this, we consulted reference grammars of each language to note which of the four types of agreement exist in the language." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-80", "text": "We removed all examples that are of the wrong type for the language (8% of harvested examples In all four types of agreement studied, the controller of the agreement is a noun or pronoun, while the target can be a determiner, adjective or verb." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-81", "text": "Because of this part-of-speech restriction, we chose to mask out the controller in every cloze example so that BERT is evaluated against the same vocabulary across all four types." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-82", "text": "This also means that we only need to collect feature data on nouns and pronouns." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-83", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-84", "text": "**FEATURE DATA**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-85", "text": "Our feature data comes from both the UD and the UniMorph projects (Sylak-Glassman, 2016, downloaded June 2019)." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-86", "text": "The UniMorph project also uses a consistent schema across all languages to annotate word types with morphological features." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-87", "text": "Although this schema is not the same as that used in UD, there is a deterministic mapping between the two (McCarthy et al., 2018) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-88", "text": "In data, we say a word can take on a particular bundle if we ever see it with that bundle of feature values in a Universal Dependencies corpus for that language." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-89", "text": "Both sources individually allow for a word to have multiple feature bundles (e.g. sheep in English can be singular or plural)." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-90", "text": "In these cases, we keep all possible feature bundles." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-91", "text": "Finally, we filter out words that do not appear in BERT's vocabulary." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-92", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-93", "text": "**EXPERIMENT**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-94", "text": "Our experiment is designed to measure BERT's ability to model syntactic structure." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-95", "text": "Our experimental set up is an adaptation of that of Goldberg (2019) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-96", "text": "As in previous work, we mask one word involved in an agreement relation and ask BERT to predict it." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-97", "text": "Goldberg (2019) , following Linzen et al. (2016) , considered a correct prediction to be one in which the masked word receives a higher probability than other inflected forms of the lemma." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-98", "text": "For example, when dogs is masked, a correct response gives more probability to dogs than dog." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-99", "text": "This evaluation leaves open the possibility that selectional restrictions or frequency are responsible for the results rather than sensitivity to syntactic structure (Gulordava et al., 2018) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-100", "text": "To remove this possibility, we take into account all words of the same part-of-speech as the masked word." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-101", "text": "Concretely, we consider a correct prediction to be one in which the average probability of all possible correct words is higher than that of all incorrect words." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-102", "text": "By \"correct words\", we mean words with the exact same feature values and the same part of speech as the masked word." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-103", "text": "By \"incorrect words\", we mean words of the same part of speech as the masked word but that differ from the masked word with respect to at least one feature value." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-104", "text": "We ignore cloze examples in which there are fewer than 10 possible correct and 10 incorrect answers in our feature data." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-105", "text": "The average example in our cloze data is evaluated using 1,468 words, compared with 2 in Goldberg (2019) ." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-106", "text": "Following Goldberg (2019), we use the pretrained BERT models from the original authors 2 , but through the PyTorch implementation." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-107", "text": "3 Goldberg (2019) showed that in his experiments the base BERT model performed better than the larger model, so we restrict our attention to the base model." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-108", "text": "For English, we use the model trained only on English data, whereas for all other languages we use the multilingual model." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-109", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-110", "text": "**RESULTS**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-111", "text": "Overall, BERT performs well on our experimental task, suggesting that it is able to model syntactic structure." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-112", "text": "BERT was correct in 94.3% of all cloze examples." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-113", "text": "This high performance is found across all four types of agreement relations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-114", "text": "Figure 1 shows that BERT performed above 90% accuracy in each type." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-115", "text": "Performance is best on determiner and attributive agreement relations, while worst on subject-verb and predicate adjective." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-116", "text": "In figure 2 , we see BERT's performance for each language." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-117", "text": "BERT performs well for the major- ity of languages, although some fare much worse than others." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-118", "text": "It is important to note that it is an unfair comparison because even though the datasets were curated using the same methodology, each language's dataset is different." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-119", "text": "It is possible, for example, that the examples we have for Basque are simply harder than they are for Portuguese." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-120", "text": "Finally, we ask how BERT's performance is affected by distance between the controller and the target, as well as the number of distractors." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-121", "text": "Figure 3 shows BERT's performance, aggregated over all languages and types, as a function of the distance involved in the agreement, while figure 4 shows the same for number of distractors." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-122", "text": "There is a slight but consistent decrease in performance as the distance and the number of distractors increase." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-123", "text": "The decline in performance begins later in figure 4 but drops more rapidly once it does." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-124", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-125", "text": "**RELATED WORK**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-126", "text": "Given the success of large pre-trained language representation models on downstream tasks, it is not surprising that that the field wants to understand the extent of their linguistic knowledge." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-127", "text": "4 In our work, we looked exclusively at the predictions BERT makes at the word level." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-128", "text": "Tenney et al. (2019) and Jawahar et al. (2019) examined the internal representations of BERT to find that syntactic concepts are learned at lower levels than semantic concepts." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-129", "text": "Hewitt and Manning (2019) are also interested in syntactic knowledge and propose a method to evaluate whether entire syntax trees are embedded in a linear transformation of a model's word representation space, finding that BERT does capture such information." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-130", "text": "As a complementary approach, Clark et al. (2019) studied the attention mechanism of BERT, finding clear correlates with interpretable linguistic structures such as direct objects, and suggest that BERT's success is due in part to its syntactic awareness." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-131", "text": "However, by subjecting it to existing psycholinguistic tasks, Ettinger (2019) found that BERT fails in its ability to understand negation." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-132", "text": "In concurrent work, van Schijndel et al. (forthcoming) show that BERT does not consistently outperform LSTM-based models on English subjectverb agreement tasks." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-133", "text": "----------------------------------" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-134", "text": "**CONCLUSIONS & FUTURE WORK**" }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-135", "text": "Core linguistic phenomena depend on syntactic structure." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-136", "text": "Yet current state-of-the-art models in language representations, such as BERT, do not have explicit syntactic structural representations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-137", "text": "Previous work by Goldberg (2019) showed that BERT captures English subject-verb number agreement well despite this lack of explicit structural representation." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-138", "text": "We replicated this result using a different evaluation methodology that addresses shortcomings in the original methodology and expanded the study to 26 languages." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-139", "text": "Our study further broadened existing work by considering the most cross-linguistically common agreement types as well as the most common morphosyntactic features." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-140", "text": "The main result of this expansion into more languages, types and features is that BERT, without explicit syntactic structure, is still able to capture syntax-sensitive agreement patterns well." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-141", "text": "However, our analysis highlights an important qualification of this result." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-142", "text": "We showed that BERT's ability to model syntaxsensitive agreement relations decreases slightly as the dependency becomes longer range, and as the number of distractors increases." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-143", "text": "We release our new curated cross-linguistic datasets and code in the hope that it is useful to future research that may probe why this pattern appears." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-144", "text": "The experimental setup we used has some known limitations." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-145", "text": "First, in certain languages some of the cloze examples we studied contain redundant information." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-146", "text": "Even when one word from an agreement relation is masked out, other cues remain in the sentence (e.g. when masking out the noun for a French attributive adjective agreement relation, number information is still available from the determiner)." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-147", "text": "To counter this in future work, we plan to run our experiment twice, masking out the controller and then the target." }, { "sent_id": "46050691971ea46ce7e18fef5f6d2d-C001-148", "text": "Second, we used a different evaluation scheme than previous work (Goldberg, 2019) by averaging BERT's predictions over many word types and plan to compare both schemes in future work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "46050691971ea46ce7e18fef5f6d2d-C001-16" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-52" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-54" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-97" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-137" ] ], "cite_sentences": [ "46050691971ea46ce7e18fef5f6d2d-C001-16", "46050691971ea46ce7e18fef5f6d2d-C001-52", "46050691971ea46ce7e18fef5f6d2d-C001-54", "46050691971ea46ce7e18fef5f6d2d-C001-97", "46050691971ea46ce7e18fef5f6d2d-C001-137" ] }, "@MOT@": { "gold_contexts": [ [ "46050691971ea46ce7e18fef5f6d2d-C001-20" ] ], "cite_sentences": [ "46050691971ea46ce7e18fef5f6d2d-C001-20" ] }, "@EXT@": { "gold_contexts": [ [ "46050691971ea46ce7e18fef5f6d2d-C001-21" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-54", "46050691971ea46ce7e18fef5f6d2d-C001-55" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-95" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-137", "46050691971ea46ce7e18fef5f6d2d-C001-138" ] ], "cite_sentences": [ "46050691971ea46ce7e18fef5f6d2d-C001-21", "46050691971ea46ce7e18fef5f6d2d-C001-54", "46050691971ea46ce7e18fef5f6d2d-C001-95", "46050691971ea46ce7e18fef5f6d2d-C001-137" ] }, "@USE@": { "gold_contexts": [ [ "46050691971ea46ce7e18fef5f6d2d-C001-52", "46050691971ea46ce7e18fef5f6d2d-C001-53" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-106" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-107" ] ], "cite_sentences": [ "46050691971ea46ce7e18fef5f6d2d-C001-52", "46050691971ea46ce7e18fef5f6d2d-C001-106", "46050691971ea46ce7e18fef5f6d2d-C001-107" ] }, "@DIF@": { "gold_contexts": [ [ "46050691971ea46ce7e18fef5f6d2d-C001-100", "46050691971ea46ce7e18fef5f6d2d-C001-97", "46050691971ea46ce7e18fef5f6d2d-C001-99" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-105" ], [ "46050691971ea46ce7e18fef5f6d2d-C001-148" ] ], "cite_sentences": [ "46050691971ea46ce7e18fef5f6d2d-C001-97", "46050691971ea46ce7e18fef5f6d2d-C001-105", "46050691971ea46ce7e18fef5f6d2d-C001-148" ] } } }, "ABC_e264c45391853fb008c838aa7ccca8_13": { "x": [ { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-24", "text": "**THE STRUCTURE OF WIKIPEDIA**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-2", "text": "This paper proposes an extension of Sumida and Torisawa's method of acquiring hyponymy relations from hierachical layouts in Wikipedia (Sumida and Torisawa, 2008) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-3", "text": "We extract hyponymy relation candidates (HRCs) from the hierachical layouts in Wikipedia by regarding all subordinate items of an item x in the hierachical layouts as x's hyponym candidates, while Sumida and Torisawa (2008) extracted only direct subordinate items of an item x as x's hyponym candidates." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-4", "text": "We then select plausible hyponymy relations from the acquired HRCs by running a filter based on machine learning with novel features, which even improve the precision of the resulting hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-5", "text": "Experimental results show that we acquired more than 1.34 million hyponymy relations with a precision of 90.1%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-6", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-8", "text": "The goal of this study is to automatically extract a large set of hyponymy relations, which play a critical role in many NLP applications such as Q&A systems (Fleischman et al., 2003) and specification retrieval (Yoshinaga and Torisawa, 2006) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-9", "text": "In this paper, a hyponymy relation is defined as a relation between a hypernym and a hyponym when \"the hyponym is a (kind of) hypernym.\"" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-10", "text": "1 We acquired more than 1.34 million hyponymy relations in Japanese with a precision of 90.1%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-11", "text": "Many NLP researchers have attempted to automatically acquire hyponymy relations from texts (Hearst, 1992; Caraballo, 1999; Mann, 2002; Fleischman et al., 2003; Morin and Jacquemin, 2004; Shinzato and Torisawa, 2004; Etzioni et al., 2005; Pantel and Pennacchiotti, 2006; Sumida et al., 2006; Sumida and Torisawa, 2008) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-12", "text": "Most of these methods, however, require tera-scale documents (e.g., a web repository) and powerful computational resources to acquire a wide range of hyponymy relations that include concept-instance relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-13", "text": "On the other hand, Sumida and Torisawa (2008) have shown that you could easily obtain numerous hyponymy relations from Wikipedia; in particular, they have acquired more than 0.63 million hyponymy relations only from hierarchical layouts in the 2.2GB Japanese version of Wikipedia (e.g., Figure 1 shows a hierarchical structure of a Wikipedia article shown in Figure 2) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-14", "text": "Although the reported precision (76.4%) is insufficient for practical applications, the hierarchical structures in Wikipedia are definitely a promising resource to mine hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-15", "text": "This work was conducted while the second author was a research fellow of the Japan Society for the Promotion of Science." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-16", "text": "1 This is a slightly modified definition of the one given in Miller (Fellbaum, 1998, Chapter. 1) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-17", "text": "The linguistic literature (e.g., Cruse (1998)) distinguishes concept-instance relations such as \"university\"-\"Tokyo University\" from hyponymy relations such as \"university\"-\"national university\"." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-18", "text": "However, we regard concept-instance relations as a part of hyponymy relations since such distinction is not crucial for many NLP applications." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-19", "text": "Black tea is a variety of tea that is more oxidized than the green, oolong and white varieties." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-20", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-21", "text": "**RESEARCH BACKGROUND**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-22", "text": "In this section, we first explain the structure of Wikipedia, and then describe previous studies that attempted to acquire hyponymy relations from Wikipedia." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-23", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-25", "text": "Wikipedia is a free, multilingual, open-content encyclopedia, and consists of numerous articles that convey comprehensive information in the headings (basically concepts or instances)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-26", "text": "The Wikipedia is built on the MediaWiki software package." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-27", "text": "2 , which interprets source codes written in the MediaWiki syntax to produce human-readable web pages." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-28", "text": "Figure 2 (a) shows an article on 'Black tea', which is the result of interpreting the source code in Figure 2 The basic hierarchical structures of Wikipedia articles are organized according to a pre-determined ordering among the above items." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-29", "text": "In general, items occupy a higher position in the hierarchy according to the order of headings, definition lists, bulleted lists, and ordered lists." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-30", "text": "In addition, note that headings, bullet lists and ordered lists allow the repetitions of the symbols \"=\", \"*\" and \"#\"." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-31", "text": "The number of repetitions of the symbols indicates the position in the hierarchy, and the more repetitions of the symbol an item 2 http://www.mediawiki.org/wiki/MediaWiki contains, the lower the position the item belongs to." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-32", "text": "For instance, \"= Common tea brands =\" occupies a higher position than \"== England ==\" as illustrated in Figure 2 (b)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-33", "text": "Then, it is easy to extract a hierarchical structure from a Wikipedia article by parsing the source code of the article according to the above order among the mark-up items." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-34", "text": "Figure 1 illustrates the hierarchical structure obtained from the source code in Figure 2 (b)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-35", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-36", "text": "**HYPONYMY ACQUISITION FROM WIKIPEDIA**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-37", "text": "Previous studies attempted to extract hyponymy relations from definition sentences (Kazama and Torisawa, 2007; Herbelot and Copestake, 2006; Ruiz-Casado et al., 2005) and category labels (Suchanek et al., 2007) Fellbaum, 1998) to learn patterns for acquiring hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-38", "text": "They acquired 1,204 hyponymy relations with a precision of 69%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-39", "text": "Suchanek et al. (2007) regarded the heading of a Wikipedia article as a hyponym and obtained category labels attached to the article as its hypernym candidates." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-40", "text": "A languagedependent heuristics then selected correct hypernyms from the hypernym candidates." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-41", "text": "They acquired more than 2.04 millions of hyponymy relations (relations SUBCLASSOF and TYPE in their paper) from 1.6 millions of Wikipedia articles with a precision of about 95%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-42", "text": "Although the above studies extracted hyponymy relations from the English version of Wikipedia, Sumida and Torisawa (2008) extracted hyponymy relations from definition sentences, category labels, and hierarchical structures in Wikipedia articles." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-43", "text": "They reported that the number of hyponymy relations acquired from the hierarchical structures was larger than the number of hyponymy relations acquired from the other resources." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-44", "text": "We thus focus on the hierarchical structures to acquire more hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-45", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-46", "text": "**PROPOSED METHOD**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-47", "text": "Our method of acquiring hyponymy relations is an extension of the supervised method proposed by Sumida and Torisawa (2008) , but differs in the way of enumerating hyponymy relation candidates (hereafter, HRCs) from the hierarchical layouts, and in the features of machine learning." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-48", "text": "Our method consists of the following two steps:" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-49", "text": "Step 1: We first extract HRCs from hierarchical layouts in Wikipedia articles." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-50", "text": "Step 2: We then select proper hyponymy relations from the HRCs extracted in Step 1 by using Support Vector Machines (SVMs) as a classifier (Vapnik, 1998)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-51", "text": "X (partial list of X), *X (details of X), * X (typical X), * X (basic X), * X (notable X), * X (partial list of X)" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-52", "text": "Figure 3: Patterns for finding plausible hypernym X; patterns with * are newly introduced in this study (Japanese terms used in our experiments are followed by English translations)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-53", "text": "In what follows, we describe each step in detail." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-54", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-55", "text": "**STEP 1: EXTRACTING HRCS FROM THE HIERARCHICAL STRUCTURES IN WIKIPEDIA ARTICLES**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-56", "text": "We obtain HRCs by considering the title of each marked-up item as a hypernym candidate, and titles of its all subordinate marked-up items as its hyponym candidates; for example, we extract 'England', 'France', 'Wedgwood', 'Lipton', and 'Fauchon' as hyponym candidates of 'Common tea brands' from the hierarchical structure in Figure 1 . Note that Sumida and Torisawa (2008) extracted HRCs by regarding the title of each marked-up item as a hypernym candidate and titles of its direct subordinate marked-up items as its hyponyms; for example, they extracted only 'England' and 'France' as hyponym candidates of 'Common tea brands' from the hierarchical structure in Figure 1 ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-57", "text": "They also employed patterns shown in Figure 3 (e.g., \"X \" (list of X)) to find plausible hypernyms denoted by X in the pattern." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-58", "text": "They regarded the HRCs whose hypernyms matched the patterns as correct hyponymy relations, and did not apply a filter based on machine learning to these HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-59", "text": "In this study, we use these patterns only to justify the hypernym part of HRCs; namely, we just replace hypernyms that match the patterns shown in Figure 3 with the variable part, by discarding the non-variable part of the patterns." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-60", "text": "We then apply a filter based on machine learning to all the HRCs acquired in the manner described in the previous paragraph." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-61", "text": "This is because the hyponymy relations whose hypernyms matched these patterns were still too noisy to use in practical applications, and we would like to control the total quality of the acquired hyponymy relations by changing the threshold of the SVM value for each HRC." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-62", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-63", "text": "**STEP 2: SELECTING PROPER HYPONYMY RELATIONS FROM THE ACQUIRED HRCS**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-64", "text": "We select proper hyponymy relations from the HRCs obtained in Step 1 by using SVMs (Vapnik, 1998) as a classifier." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-65", "text": "In what follows, we briefly review the features proposed by Sumida and Torisawa (2008) , and then explain the novel features introduced in this study." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-66", "text": "We expect that the readers will refer to the literature (Sumida and Torisawa, 2008) to see the effect of the features proposed by Sumida and Torisawa." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-67", "text": "In the following explanation, we refer to the hypernym candidate or the hyponym candidate of each HRC as hypernym or hyponym." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-68", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-69", "text": "**POS**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-70", "text": "We assigned a unique dimension in the feature space to each part-of-speech (POS) tag." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-71", "text": "When the hypernym/hyponym consists of a morpheme with a particular POS tag, 3 then the corresponding element of the feature vector is set to 1." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-72", "text": "When the hypernym/hyponym consists of multiple morphemes, the feature vectors for all the morphemes are simply summed (The resulting feature vector works as disjunction of each feature vector)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-73", "text": "The POS tag of the last morpheme is mapped to the dimension that is different from that of the POS tags of the other morphemes." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-74", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-75", "text": "**MORPH**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-76", "text": "The morphemes are mapped to the dimensions of the feature vectors." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-77", "text": "The last morpheme is mapped to the dimension that is different from that of the other morphemes." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-78", "text": "EXP The expression of a hypernym/hyponym itself is mapped to an element in a feature vector, and the corresponding element is set to 1." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-79", "text": "ATTR Using the attribute set created by Sumida and Torisawa (2008) , when a hypernym/hyponym is included as an element of the attribute set, we set a feature corresponding to the element to 1." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-80", "text": "LAYER Each type of the marking items from which the hypernym/hyponym is extracted (namely, headings, bulleted lists, ordered lists, or definition lists) is mapped to an element of a feature vector, and the feature corresponding to the marking type for the hypernym/hyponym is set to 1." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-81", "text": "In this study, we introduce the following three new features to improve the performance of the classifier." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-82", "text": "DIST The distance d between items from which the hypernym and the hyponym are acquired is mapped to two elements of the feature vector." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-83", "text": "When the distance d = 1, one element is set to 1, and otherwise (i.e., d \u2265 2) the other element is set to 1." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-84", "text": "This feature reflects the tendency that HRCs acquired from items whose distance is d = 1 are more plausible than the other HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-85", "text": "PAT This feature is set to 1 when the hypernym of the given HRC is obtained from a hypernym that matches the patterns in Figure 3 ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-86", "text": "This reflects Sumida and Torisawa's observation that HRCs whose hypernym matches the patterns are likely to be correct (Sumida and Torisawa, 2008) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-87", "text": "LCHAR This feature is set to 1 when the hypernym and the hyponym share the last character." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-88", "text": "Such HRCs (e.g., \" (high school)\"-\" (public high school)\" are likely to be correct, because the last characters are likely to convey major semantic contents of Japanese compound nouns." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-89", "text": "Using the above features, we train an SVM classifier." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-90", "text": "3 In Japanese, a morpheme takes a POS tag." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-91", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-92", "text": "**EXPERIMENTS**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-93", "text": "To evaluate our method, we used the Japanese version of the Wikipedia version of March 2007, which includes 276,323 articles (pages)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-94", "text": "4 In" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-95", "text": "Step 2, we used TinySVM 5 with a polynomial kernel of degree 2 as a classifier, and MeCab 6 as a morphological analyzer." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-96", "text": "We acquired 6,564,317 HRCs from the above articles in" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-97", "text": "Step 1." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-98", "text": "The test set of 1,000 HRCs were randomly extracted from these HRCs, and the remaining HRCs were used to form the development set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-99", "text": "We increased the size of the development set by adding the following four sets, while investigating the performance of the classifier on the development set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-100", "text": "The first set was randomly chosen from the remaining HRCs, and consisted of 9,000 HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-101", "text": "The second set was chosen from the HRCs whose hypernyms did not match the patterns in Figure 3 , and consisted of 10,000 HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-102", "text": "The third set was randomly chosen from the HRCs whose hypernym and hyponym are acquired from items with distance d = 1 in the hierarchy, and consisted of 9,000 HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-103", "text": "The fourth set was chosen from the HRCs whose hypernyms matched the patterns in Figure 3 , and consisted of 2,000 HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-104", "text": "The total number of HRCs in the development set was 29,900, when we eliminated duplicated entries." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-105", "text": "There is no overlap between the test set and the development set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-106", "text": "A human subject then manually judged whether HRCs in the test and development sets are correct or not using the same criteria as one in Hearst (1992); the subject checked whether the expression \"a hyponym candidate is (a kind of) a hypernym candidate\" is acceptable or not." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-107", "text": "To investigate the quality of the input HRCs, we assessed the precision of the 9,000 development HRCs that were randomly extracted from all the HRCs excluding the test set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-108", "text": "Table 1 shows the precision of the 9,000 HRCs according to the distance between items from which the hypernym and hyponym of each HRC are extracted." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-109", "text": "We can see that when the distance between the items from which the hypernym and the hyponym of HRCs are extracted increases, the pre- 4 We excluded \"user pages\",\"special pages\", \"template pages\", \"redirection pages\", and \"category pages\", since they are meant for internal purpose, and excluded \"disambiguation pages\", since they only enumerate possible articles for the ambiguous headings." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-110", "text": "5 http://chasen.org/ \u223c taku/software/ TinySVM/ 6 http://mecab.sourceforge.net/ Table 2 shows the performance of our method when we use the whole development set to train the SVM classifier." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-111", "text": "The columns titled 'PRE', '# RELS.', and '# EST." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-112", "text": "CORR." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-113", "text": "RELS.' show the precision of the hyponymy relations in the test set, the number of the acquired hyponymy relations, and the expected number of correct hyponymy relations estimated from the precision and the number of the acquired hyponymy relations, respectively." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-114", "text": "The row titled 'S & T (2008) ' shows the performance of the method proposed by Sumida and Torisawa (2008) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-115", "text": "The following two rows show the precision of the HRCs acquired by the patterns in Figure 3 (PAT) 7 and that of the results of machine learning (ML)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-116", "text": "We successfully obtained more than 1.73 million hyponymy relations with 85.2% precision, which greatly outperformed the results of Sumida and Torisawa (2008) in terms of both the precision and the number of acquired hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-117", "text": "The acquired hyponymy relations covered 80,466 distinct hypernyms and 886,781 distinct hyponyms." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-118", "text": "Table 3 shows the performance of the classifier when we eliminated each type of features." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-119", "text": "The columns titled 'ACC', 'PRE', 'REC', and 'F 1 ' show the accuracy, precision, recall, and F 1 -measure calculated on the test set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-120", "text": "All the newly introduced features contributed to the accuracy, and improved the total accuracy by 1.1%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-121", "text": "The features DIST and PAT improved the precision of the classifier, while the feature LCHAR improved the recall of the classifier." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-122", "text": "To investigate the trade-off between precision and recall, we changed the threshold of the SVM values for the HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-123", "text": "Figures 4 and 5 show the P-R curve of the hyponymy relation acquisition using the feature set in Table 3 ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-124", "text": "We can observe that the newly introduced features improve the precision in the range of the recall greater than 60%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-125", "text": "We can improve the precision of the acquired hyponymy relations by making the threshold of the SVM values larger." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-126", "text": "By setting the threshold to 0.36, we obtain 1,349,622 hyponymy relations with a precision of 90.1%, which cover 46,653 distinct hypernyms and 739,972 distinct hyponyms." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-127", "text": "We obtained on average 4.88 hyponymy relations from one Wikipedia article with a precision of 90.1%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-128", "text": "To investigate the contribution of the newly extracted HRCs to the acquired hyponymy relations, we classified the HRCs in the test set into subsets according to the distance between items from which the hypernym and hyponym are extracted." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-129", "text": "Table 4 shows the performance of the SVM classifier for the resulting subsets of the test set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-130", "text": "The column DIST shows the distance between items from which the hypernym and hyponym are extracted, while NUM shows the number of the HRCs." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-131", "text": "The columns ACC, PRE, REC, and F 1 show the accuracy, precision, recall, and F 1 -measure calculated on each subset of the test set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-132", "text": "Although there is a larger number of noisy HRCs in the subsets of the test set which were acquired from distant items (DIST \u2265 2), we successfully maintained the precision of the acquired hyponymy relations above 75%." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-133", "text": "Boosting the recall of hyponymy relations acquired from the distant items will be the key to improve the performance of our method." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-134", "text": "Figure 6 shows the performance of our method when varying the number of the training HRCs from 1,000 to 8,000." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-135", "text": "Table 5 : Hyponymy relations acquired from hierarchical structures in Wikipedia: incorrect hyponyms are marked as '*', while fictional objects are marked as '#'." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-136", "text": "The hypernyms and hyponyms are followed by their English translations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-137", "text": "Table 5 shows examples of the acquired hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-138", "text": "The hypernyms are manually selected and 10 hyponyms are randomly selected for each hypernym." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-139", "text": "For some classes such as 'planet' and 'technique', many fictional objects (marked as '#') are extracted as hyponyms (e.g., a fictional planet in a scientific fiction)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-140", "text": "We may have to distinguish these fictional objects from real objects in certain application contexts." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-141", "text": "We finally investigated details of the errors in the SVM classifier." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-142", "text": "We applied the SVM classifier to 1,000 HRCs that were randomly selected from all the HRCs excluding the test and development sets, and manually investigated the classification results." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-143", "text": "The classification accuracy of these HRCs was 89.1% (233 true positives, 658 true negatives, 22 false positives and 87 false negatives)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-144", "text": "Table 6 summarizes the types of false positives." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-145", "text": "Meronymy (part-of relation; e.g., 'car'-'engine') is the most frequent error, and the current classifier yields a high score for this type of error." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-146", "text": "To filter out meronymy correctly, we will need additional criteria to judge hyponymy relations, for example whether they have the same attributes in common (Dowty et al., 1980; Almuhareb and Poesio, 2004) ." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-147", "text": "The hierarchical structures also represented instance-attribute/value relations, and some instance-value pairs were wrongly regarded as hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-148", "text": "We found that an attribute that specifies the relation between the instance and the value usually appeared between the nodes from which the instance and the value were extracted." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-149", "text": "For example, in the hierarchical structure that included 'Studio Easter' (a design studio) and 'Uta Kata' (TV animation series) as titles of nodes, there was a node titled ' (Major work)' between them." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-150", "text": "We will be able to filter out these instance-value pairs by using information on the other nodes in the original hierarchical structures as features for machine learning." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-151", "text": "The other two cases, 'concept-facet' and 'facet-instance', are both related to a facet label, which is usually a value of a specific attribute to classify instances according to the attribute's value (e.g., 'England' and 'France' in Figure 1 are values of the attribute 'country' of tea brands)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-152", "text": "For example, 'Urawa Red Diamonds' (a football club) is used to classify 'supporter's groups' in terms of the target they support, while 'Ratoroa' (location) is used to classify characters ('Gerald Mason') in a novel in terms of their origination." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-153", "text": "The hierarchical structures often included such facet labels to show a certain classification of instances." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-154", "text": "The three hyponymy relations whose hypernym and hyponym shared the last character were wrongly regarded as correct hyponymy relations." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-155", "text": "This will be over-fitting due to the feature 'LCHAR'." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-156", "text": "We next investigated the difference between the 87 false negatives and the 233 true positives in terms of the number of available training samples." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-157", "text": "We extracted HRCs in the development (training) set whose hypernym candidates were one of the hypernyms extracted from the false negatives and true positives." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-158", "text": "Although there were on average 66.6 labeled HRCs for the hypernyms extracted from the true positives, there were on average only 17.7 labeled HRCs for the hypernyms in the false negatives, which means the hypernyms in the false negatives were relatively infrequent in the training set." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-159", "text": "We will exploit training samples for hypernym candidates that are synonymous with or superclass of the infrequent hypernyms to solve the data sparseness problem." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-160", "text": "Table 7 shows the classification of the 658 true negatives." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-161", "text": "We found that hierarchical structures in Wikipedia were mainly used to express instance-attribute-value relations, meronymy relations and concept-(facet-)instance relations (hyponymy relations)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-162", "text": "In Table 7 , most of the HRCs classified as 'other' were extracted from items in distant positions in the hierarchical structures, and the hypernym and hyponym candidates were irrelevant." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-163", "text": "We will obtain instanceattribute-value triples from the hierarchical structures." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-164", "text": "----------------------------------" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-165", "text": "**CONCLUSION**" }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-166", "text": "This paper presented an extended version of Sumida and Torisawa's method (2008) of acquiring hyponymy relations from the hierarchical structures in Wikipedia." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-167", "text": "We extract more hyponymy relation candidates from the hierarchical structures than the original method to increase the number of hyponymy relations acquired by the method." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-168", "text": "We successfully acquired more than 1.34 million hyponymy relations, which doubled the number of hyponymy relations acquired by the method, and we also increased the precision by 13.7% (from 76.4% to 90.1%)." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-169", "text": "Since the number of Wikipedia articles increases day by day (cf." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-170", "text": "276,323 articles in March 2007 to 449,233 articles in March 2008), we can obtain a larger number of hyponymy relations by simply applying our method to the latest version of Wikipedia." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-171", "text": "In future research, we plan to apply the SVM classifier to HRCs acquired from the definition sentences and category labels in Wikipedia articles." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-172", "text": "We will apply our method to the Wikipedia in other languages, such as English." }, { "sent_id": "e264c45391853fb008c838aa7ccca8-C001-173", "text": "We will also evaluate the acquired hyponymy relations in practical application contexts." } ], "y": { "@EXT@": { "gold_contexts": [ [ "e264c45391853fb008c838aa7ccca8-C001-2" ], [ "e264c45391853fb008c838aa7ccca8-C001-47" ] ], "cite_sentences": [ "e264c45391853fb008c838aa7ccca8-C001-2", "e264c45391853fb008c838aa7ccca8-C001-47" ] }, "@DIF@": { "gold_contexts": [ [ "e264c45391853fb008c838aa7ccca8-C001-3" ], [ "e264c45391853fb008c838aa7ccca8-C001-47" ], [ "e264c45391853fb008c838aa7ccca8-C001-56" ], [ "e264c45391853fb008c838aa7ccca8-C001-116" ] ], "cite_sentences": [ "e264c45391853fb008c838aa7ccca8-C001-3", "e264c45391853fb008c838aa7ccca8-C001-47", "e264c45391853fb008c838aa7ccca8-C001-56", "e264c45391853fb008c838aa7ccca8-C001-116" ] }, "@BACK@": { "gold_contexts": [ [ "e264c45391853fb008c838aa7ccca8-C001-11" ], [ "e264c45391853fb008c838aa7ccca8-C001-13" ], [ "e264c45391853fb008c838aa7ccca8-C001-42" ], [ "e264c45391853fb008c838aa7ccca8-C001-66" ] ], "cite_sentences": [ "e264c45391853fb008c838aa7ccca8-C001-11", "e264c45391853fb008c838aa7ccca8-C001-13", "e264c45391853fb008c838aa7ccca8-C001-42", "e264c45391853fb008c838aa7ccca8-C001-66" ] }, "@USE@": { "gold_contexts": [ [ "e264c45391853fb008c838aa7ccca8-C001-42", "e264c45391853fb008c838aa7ccca8-C001-44" ], [ "e264c45391853fb008c838aa7ccca8-C001-65" ], [ "e264c45391853fb008c838aa7ccca8-C001-79" ], [ "e264c45391853fb008c838aa7ccca8-C001-114" ] ], "cite_sentences": [ "e264c45391853fb008c838aa7ccca8-C001-42", "e264c45391853fb008c838aa7ccca8-C001-65", "e264c45391853fb008c838aa7ccca8-C001-79", "e264c45391853fb008c838aa7ccca8-C001-114" ] }, "@SIM@": { "gold_contexts": [ [ "e264c45391853fb008c838aa7ccca8-C001-86" ] ], "cite_sentences": [ "e264c45391853fb008c838aa7ccca8-C001-86" ] } } }, "ABC_c3c09df34cf9f81c1cc4fc63a18bf0_13": { "x": [ { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-142", "text": "\u2022 Let S i = {{s i } \u222a S i }." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-163", "text": "Evaluation Metrics: Following [Lin et al." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-143", "text": "We now create a new instance set B i = {(S i , e i1 , e i2 , r i )}." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-2", "text": "Relation extraction is the problem of classifying the relationship between two entities in a given sentence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-3", "text": "Distant Supervision (DS) is a popular technique for developing relation extractors starting with limited supervision." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-4", "text": "We note that most of the sentences in the distant supervision relation extraction setting are very long and may benefit from word attention for better sentence representation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-5", "text": "Our contributions in this paper are threefold." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-6", "text": "Firstly, we propose two novel word attention models for distantlysupervised relation extraction: (1) a Bi-directional Gated Recurrent Unit (Bi-GRU) based word attention model (BGWA), (2) an entity-centric attention model (EA), and (3) a combination model which combines multiple complementary models using weighted voting method for improved relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-7", "text": "Secondly, we introduce GDS, a new distant supervision dataset for relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-8", "text": "GDS removes test data noise present in all previous distantsupervision benchmark datasets, making credible automatic evaluation possible." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-9", "text": "Thirdly, through extensive experiments on multiple real-world datasets, we demonstrate the effectiveness of the proposed methods." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-10", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-12", "text": "Classifying the semantic relationship between two entities in a sentence is termed as Relation Extraction (RE)." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-13", "text": "RE from unstructured text is an important step in various Natural Language Understanding tasks, such as knowledge base construction, question-answering etc." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-14", "text": "Supervised methods have been successful on the relation extraction task [Bunescu and Mooney, 2005; Zeng et al., 2014] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-15", "text": "However, the extensive training data necessary for supervised learning is expensive to obtain and therefore restrictive in a Web-scale relation extraction task." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-16", "text": "* Equal Contribution to this work \u2020 This research was conducted during the author's research assistantship at Indian Institute of Science To overcome this challenge, [Mintz et al., 2009] proposed a Distant Supervision (DS) method for relation extraction to help automatically generate new training data by taking an intersection between a text corpus and knowledge base." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-17", "text": "The distant supervision assumption states that for a pair of entities participating in a relation, any sentence mentioning that entity pair in the text corpora is a positive example for the relation fact." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-18", "text": "This assumption outputs evidence from multiple sentences for multiple relation labels between an entity-pair." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-19", "text": "Therefore the problem of relation extraction in distantly supervised datasets is posed as a Multi-instance Multi-label (MIML) problem [Surdeanu et al., 2012] , as shown in Figure 1 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-20", "text": "However, the DS assumption is too strong and may introduce noise such as false negative samples due to missing facts in the knowledge base." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-21", "text": "In this paper, we propose relation extraction models and a new dataset to improve RE." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-22", "text": "We define 'instance' as a sentence containing an entity-pair, and 'instance set' as a set of sentences containing the same entity-pair." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-23", "text": "It was observed by [Zeng et al., 2015] that 50% of the sentences in the Riedel2010 Distant Supervision dataset [Riedel et al., 2010] , a popular DS benchmark dataset, had 40 or more words in them." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-24", "text": "We note that not all the words in these long sentences contribute towards expressing the given relation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-25", "text": "In this work, we formulate various word attention mechanisms to help the relation extraction model focus on the right context in a given sentence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-26", "text": "The MIML assumption states that in an instance set corresponding to an entity pair, at least one sentence in that set should express the true relation assigned to the set." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-27", "text": "However, we observe that this is not always true in currently available benchmark datasets for RE in the distantly supervised setting." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-28", "text": "In particular, current datasets have noise in the test set, for example, a fact may be labelled false if it is missing in the knowledge base, leading to a false negative label in train and test set." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-29", "text": "Noise in test set impedes the right comparison of models and may favor overfitted models." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-30", "text": "To address this challenge, we build the Google Distant Supervision (GDS) dataset, a new dataset for distantly-supervised relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-31", "text": "GDS is seeded from the Google relation extraction corpus [Shaohua Sun and Orr, 2013] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-32", "text": "This new dataset addresses an important shortcoming in distant supervision evaluation and makes an automatic evaluation in this setting more reliable." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-33", "text": "[Zeng et al., 2015] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-34", "text": "The two entities are underlined." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-35", "text": "(Section 2.1) In summary, our contributions are: (a) we introduce the Google Distant Supervision (GDS) dataset, a new dataset for distantly-supervised relation extraction; (b) we propose two novel word attention based models for distant supervision, viz., BGWA, a BiGRU-based word attention model, and EA, an entity-centric attention model; and (c) we show efficacy of combining new and existing relation extraction models using a weighted ensemble model." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-36", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-37", "text": "**PROPOSED METHODS**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-38", "text": "In this section, we present our attention-based models for distantly supervised relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-39", "text": "We first describe problem background and Piecewise Convolution Neural Network (PCNN), a previous-state-of-the-art model." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-40", "text": "We then introduce our Entity attention (EA) and Bi-GRU based word attention (BGWA) models." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-41", "text": "The last subsection describes a simple ensemble approach to combine predictions of various models for robust relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-42", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-43", "text": "**BACKGROUND**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-44", "text": "Relation Extraction: A relation is defined as a semantic property between a set of entities {e k }." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-45", "text": "In our task, we consider binary relations where k \u2208 [1, 2], such as Born In(Barack Obama, Hawaii)." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-46", "text": "Given a set of sentences S = {s i }; i \u2208 [1 . . . N ], where each sentence s i contains both the entities, the task of relation extraction with distantly supervised dataset is to learn a function F r : F r (S, (e 1 , e 2 )) = 1 if relation r is true for pair(e 1 , e 2 ) 0 Otherwise PCNN: [Zeng et al., 2015] proposed the Piecewise Convolution Neural Network (PCNN), a successful model for distantly supervised relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-47", "text": "The Success of the relation extraction task depends on extracting the right structural features from the sentence containing the entity-pair." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-48", "text": "Neural networks, such as Convolutional Neural Networks (CNNs), have been proposed to alleviate the need to manually design features for a given task [Zeng et al., 2014] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-49", "text": "As the output of CNNs is dependent on the number of tokens in the sentence, max pooling operation is often applied to remove this dependence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-50", "text": "However, the use of a single max-pool misses out on some of these structural features useful for relation extraction task." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-51", "text": "PCNN model divides a sentence s i 's convolution filter output c i containing two entities into three parts c i1 , c i2 , c i3 -sentence context to the left of first entity, between the two entities, and to right of the second entity respectively-and performs max-pooling on each of the three parts, shown in Figure 2 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-52", "text": "Thereby, leveraging the entity location information to retain the structural features of a sentence after the max-pooling operation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-53", "text": "The output of this operation is the concatenation of {pc i1 , pc i2 , pc i3 } yielding a fixed size output." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-54", "text": "The fixed size output is processed through a tanh non-linearity followed by a linear layer to produce relation probabilities." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-55", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-56", "text": "**BI-GRU BASED WORD ATTENTION MODEL (BGWA)**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-57", "text": "Consider the sentence expressing bornIn(Person, City) relation between the entity pair (Obama, Honolulu)." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-58", "text": "Former President Barack Obama was born in the city of Honolulu, capital of the U.S. state of Hawaii" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-59", "text": "In the sentence, the phrase \"was born in\" helps in identifying the correct relation in the sentence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-60", "text": "It is conceivable that identifying such key phrases or words will be helpful in improving relation extraction performance." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-61", "text": "Motivated by this, our first proposed model, Bidirectional Gated Recurrent Unit (Bi-GRU) based Word Attention Model (BGWA) uses an attention mechanism over words to identify such key phrases." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-62", "text": "To the best of our knowledge, there has been no prior work on using word attention in the distant supervision setting." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-63", "text": "The BGWA model is shown in Figure 3 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-64", "text": "It uses Bi-GRU to encode sentence context." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-65", "text": "GRU ] is a variant of Recurrent Neural Network (RNN) which was designed to capture long-range dependencies in words." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-66", "text": "A Bi-GRU runs in both forward and backward direction in a sentence to capture both sides of a word context." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-67", "text": "Only a few words in a sentence are relevant for determining the relation expressed." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-68", "text": "This degree of relevance of a word is calculated as an attention value in our model." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-69", "text": "An instance set S q consists of set of sentences," }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-70", "text": ". Each word in sentence s i is represented using a pre-trained embedding" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-71", "text": "g\u00d71 is used as representation for a word." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-72", "text": "We define u ij , the degree of relevance of the j th word in i th sentence of the instance set, where A \u2208 R g\u00d7g is a square matrix and r \u2208 R g\u00d71 is a relation query vector." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-73", "text": "Bilinear operator A determines the relevance of a word for a relation vector." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-74", "text": "Both A and r are learned parameters." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-75", "text": "Attention value a ij is calculated by taking softmax over {u ij } values." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-76", "text": "Despite the widespread use of weighted sum to obtain sentence context embeddings s wa in attention-based settings, similar to PCNN model Section 2.1, we apply the piecewise max pooling on\u0175 ij before, between, and after the entity pair." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-77", "text": "Final piecewise max-pooled sentence embedding s wa \u2208 R 1\u00d73g of the sentence is processed through a tanh non-linearity and linear layer to yield probabilities for each relation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-78", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-79", "text": "**ENTITY ATTENTION (EA) MODEL**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-80", "text": "Let us once again consider the example sentence from Section 2.2 involving entity pair (Obama, Honolulu)." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-81", "text": "In the sentence, for entity Obama, the word President helps in identifying that the entity is a person." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-82", "text": "This extra information helps in narrowing down the relation possibilities by looking only at the relations that occur between a person and a city." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-83", "text": "[Shen and Huang, 2016] proposed an entity attention model for supervised relation extraction with a single sentence as input to the model." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-84", "text": "We modify and adapt their model for the distant supervision setting and propose Entity Attention (EA) which works with a bag of sentences." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-85", "text": "For a given bag of sentences, learning is done using the setting proposed by [Zeng et al., 2015] , wherein the sentence with the highest probability of expressing a relation in a bag is selected to train the model in each iteration." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-86", "text": "The EA model has two components: 1) PCNN layer, and 2) Entity Attention Layer, as shown in Figure 4 . Consider an instance set S q with set of sentences," }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-87", "text": "1\u00d7d is a word embedding and {e emb q1 , e emb q2 } are the embeddings for the two entities." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-88", "text": "The PCNN layer is applied on the words in the sentence [Zeng et al., 2015] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-89", "text": "The entity-specific attention u i,j,qk for j th word with respect to k th entity is calculated as follows:" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-90", "text": "is the concatenation of a word and the entity embedding." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-91", "text": "A k , r k are learned parameters." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-92", "text": "Bilinear operator A k determines the relevance of concatenated word & entity embedding for a relation vector r k ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-93", "text": "Intuitively, attention should choose words which are related to the entity for a given relation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-94", "text": "The u i,j,qk are normalized using a softmax function to generate a i,j,qk , the attention scores for a given word." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-95", "text": "Similar to the PCNN model in Section 2.1, the attention weighted word embeddings are pooled using piecewise pooling method to generate s ea \u2208 R 1\u00d73g dimensional sentence embeddings." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-96", "text": "The output from the PCNN layer and the entity attention layers are concatenated and then passed through a linear layer to obtain probabilities for each relation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-162", "text": "Details of the new GDS dataset is described in Section 3." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-97", "text": "The entity attention model (EA) we propose is adapted to the distantly supervised setting by using two important variations from the original [Shen and Huang, 2016] model (a) The EA processes a set of sentences." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-98", "text": "It uses PCNN [Zeng et al., 2015] assumption to select the sentence with highest probability of any relation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-99", "text": "The selected sentence is used to estimate the relation probabilities for an entity-pair and for back-propagation of the error for the bag-of-sentences." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-100", "text": "(b) EA uses PCNN instead of CNN to preserve structural features in a sentence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-101", "text": "We found the two variations to be crucial for the model to work in the distant supervision setting." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-102", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-103", "text": "**BRING IT ALL TOGETHER: ENSEMBLE MODEL**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-104", "text": "We note that the models discussed in previous sections, BGWA, EA and PCNN, have complementary strengths." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-105", "text": "PCNN extracts high-level semantic features from sentences using CNN." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-106", "text": "Most effective features are then selected using a piecewise max-pooling layer." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-107", "text": "Entity-based attention (Section 2.3) helps in highlighting important relation words with respect to each of the entities present in the sentence, thus complimenting the PCNN-based features." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-108", "text": "Going beyond the entity-centric words, we observe that not all words in a sentence are equally important for relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-109", "text": "The BGWA model (Section 2.2) addresses this aspect by selecting words relevant to a relation in a sentence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-110", "text": "In Figure 5 , we plot the confidence scores of various models on the true labels of 10 randomly selected instance sets from Google Distant Supervision dataset (described in Section 3)." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-111", "text": "From this figure, we observe that the proposed methods are able to leverage signals from the entity and word attention models, even when the PCNN model is incorrect (light colored cell in the last column)." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-112", "text": "This validates our assumption and motivates ensemble approach to efficiently combine these complementary models." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-113", "text": "We combine the predictions of all the three models using a weighted voting ensemble." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-114", "text": "The weights of this model are learned using linear regression on development dataset." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-115", "text": "Assume P i, is a vector containing probability scores for all relations with respect to i th example in developement data as given by a model." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-116", "text": "P i, \u2208 R 1\u00d7rl , where rl is the number of relations." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-117", "text": "Here, \u03b1, \u03b2, \u03b3 are parameters learned using linear regression [Pedregosa et al., 2011] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-118", "text": "More complicated regression methods (e.g., ridge regression) did not improve the results greatly." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-144", "text": "Here, B i is an instance set for distant supervision which consists of the set of instances (sentences or snippets) S i , where the entities e i1 and e i2 are mentioned in each instance." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-145", "text": "The label r i is applied over the entire set B i ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-146", "text": "D GDS = {B i } is the new GDS dataset." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-147", "text": "Here, each set B i is guaranteed to contain at least one sentence (s i ) which expresses the relation r i assigned to that set." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-148", "text": "An example of the sentence set expansion is shown in Figure 6 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-149", "text": "We note that such guarantee was not available in previous DS benchmarks." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-150", "text": "We divided this dataset into a train (60%), development (10%) and test (30%) sets, such that there is no overlap among entity-pairs of these sets." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-151", "text": "Unlike currently available datasets, the availability of development dataset helps in performing model selection in a principled manner for relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-152", "text": "In [Riedel et al., 2010] and subsequent work, a manual evaluation was done by validating the top 1000 confident predictions." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-153", "text": "This manual evaluation was necessary due to the noise in the test data." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-154", "text": "GDS although a small dataset in terms of the size as compared to Riedel2010 dataset, gets past such cumbersome manual evaluation and makes an automated evaluation in distantly-supervised relation extraction a reality." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-155", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-156", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-157", "text": "Datasets: We validate effectiveness of the proposed models on two datasets summarized in Table 3 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-158", "text": "Riedel2010 was created by aligning Freebase relations with the New York Times corpus [Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012; Lin et al., 2016] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-159", "text": "We partitioned Riedel2010 train set into a new train (80%) and development set (20%)." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-160", "text": "Development set is created to facilitate the learning of an ensemble model and for model selection." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-161", "text": "This resulting dataset is called Riedel2010-b." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-119", "text": "We also experimented with a jointly learned neural ensemble by concatenating the features of all models after pooling layer followed by a linear layer." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-120", "text": "In our experiments, weighted voting ensemble method gave better results than the jointly learned model." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-121", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-122", "text": "**GDS: A NEW DATASET FOR RELATION EXTRACTION USING DISTANT SUPERVISION**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-123", "text": "Several benchmarks datasets for Relation Extraction (RE) using distant supervision (DS) exist [Riedel et al., 2010; Mintz et al., 2009] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-124", "text": "DS is used to create both train and test sets in all of these datasets, resulting in the introduction of noise." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-125", "text": "While training noise in distant supervision is expected, noise in the test data is troublesome as it may lead to incorrect evaluations." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-126", "text": "There are two kinds of noise added due to distant supervision assumption: (a) samples with incorrect labels due to missing Knowledge Base (KB) fact, and (b) samples with no instance supporting the KB fact." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-127", "text": "A few examples of such noise are listed in Table 1 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-128", "text": "Previous benchmark datasets in this area suffer from these drawbacks." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-129", "text": "In order to overcome these challenges, we develop Google Distant Supervision (GDS), a new dataset for relation extraction using distant supervision." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-130", "text": "Statistics of the new dataset are summarized in Table 2 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-131", "text": "To alleviate noise in DS setting, we make sure that labelled relation is correct and for each instance set in GDS, there is at least one sentence in that set which expresses the relation assigned to that set." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-132", "text": "We start with the human-judged Google Relation Extraction corpus 1 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-133", "text": "This corpus consists of 5 binary relations." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-134", "text": "We construct the GDS dataset out of the relation extraction corpus using the following process." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-135", "text": "Let D GRE be the Google RE corpus, D GRE = {(s i , e i1 , e i2 , r i )} , where the i th sentence s i is annotated as expressing relation r i between the two entities e i1 and e i2 in the sentence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-136", "text": "r i is one of the five relations mentioned in Table 2 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-137", "text": "Now, for each (s i , e i1 , e i2 , r i ) \u2208 D GRE , we perform the following:" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-138", "text": "\u2022 Perform web search to retrieve documents containing the two entities e i1 and e i2 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-139", "text": "\u2022 From such retrieved documents, select multiple text snippets containing the two entities." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-140", "text": "Each snippet is restricted to contain at most 500 words." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-141", "text": "Let S i = {s q }; q \u2208 (1 . . . M ) be the set of such snippets." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-164", "text": ", 2016], we use held-out evaluation scheme." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-165", "text": "The performance of each model is evaluated on a test set using Precision-Recall (PR) curve." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-166", "text": "Baselines: We compare proposed models with (a) Piecewise Convolution Neural Network (PCNN) [Zeng et al., 2015] and (b) Neural Relation Extraction with Selective Attention over Instances (NRE) [Lin et al., 2016] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-167", "text": "Both NRE and PCNN baseline outperform traditional baselines like MIML-RE and hence we use them as a representative state-of-the-art baseline to compare with proposed models." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-168", "text": "Model Parameters: The parameters used for the various models are summarized in Table 4 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-169", "text": "Word embeddings are initialized using the Word2Vec vectors from NYT dataset, similar to [Lin et al., 2016] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-170", "text": "Word Position feature embeddings (with respect to each entity) are randomly initialized and learned during training." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-171", "text": "Concatenation of the word embedding and position embedding results in a 60-dimensional (d w + (2 * d p )) embedding x ij for each word." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-172", "text": "We implemented PCNN model baseline following [Zeng et al., 2015] and used author provided results and implementation for NRE baseline." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-173", "text": "The EA and BGWA models were developed in PyTorch 2 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-174", "text": "We use SGD algorithm with dropout [Srivastava et al., 2014] for model learning." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-175", "text": "The experiments were run on GeForce GTX 1080 Ti using NVIDIA-CUDA." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-176", "text": "Model selection for all algorithms was done based on the AUC (Area Under the Curve) metric for the precision-recall curve for development dataset." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-177", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-178", "text": "**RESULTS**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-179", "text": "Performance Comparison: Figure 7 and Figure 8 show the precision-recall curve for baseline and proposed algorithms on two datasets, Riedel2010-b (with development set) and the GDS dataset." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-180", "text": "Please note that the NRE model's PR-curve in Figure 7 is taken from author published results which used combined train+dev set for training." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-181", "text": "This gives the NRE model an advantage over all other models in Figure 7 as all of them are trained using only the train part." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-182", "text": "For the Riedel2010-b dataset, we plot the PR-curve with a maximum recall of 0.35, as the precision is too low beyond 0.35 recall." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-183", "text": "From Figure 7 and Figure 8 , we observe that the proposed models -BGWA and EA -achieve higher or competitive precision over the entire recall range compared to the state-ofthe-art NRE and PCNN models." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-184", "text": "PCNN model outperforms NRE model in both datasets." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-185", "text": "ENSEMBLE, a combination of proposed models BGWA, EA and PCNN in a weighted ensemble, helps in improving precision further." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-186", "text": "It achieves a significant precision gain of over 2-3% over various recall ranges for the Riedel2010-b dataset with 53 relations." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-187", "text": "This indicates that clues from combined model help results." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-188", "text": "We observe that the BGWA model performs well on the Riedel2010-b dataset, but the trend is reversed in performance for GDS dataset where EA performs better." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-189", "text": "These two datasets have varied properties, (a) Riedel2010-b has 53 relations as opposed to 5 in the GDS dataset, (b) GDS has no label noise in the test as compared to the Riedel2010-b dataset." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-190", "text": "The performance difference between the BGWA and EA model shows that the errors by both the models are not correlated and are complimentary as shown in Figure 5 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-191", "text": "This empirical validation encourages ensemble of these methods." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-192", "text": "We observe that the ENSEMBLE model performs consistently well across all recall ranges in both the datasets, validating our assumption." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-193", "text": "Visualizing Attention: We visualize the attention values of our models in Figure 9 ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-194", "text": "It can be observed that the Entity 2 Attention for the 'location in' relation rightly focuses on words indicating place information like 'in', ',' and 'annopolis'." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-195", "text": "We note that entity attention is relation specific." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-196", "text": "In this case, Entity 1 Attention rightly focuses on the second entity, 'maryland' (location name), for selecting relation 'location in'." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-197", "text": "The word attention value is calculated using Bi-GRU hidden representation embeddings." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-198", "text": "Bi-GRU representation at a given time point t in a sequence is a summary of all the timepoints correlated to t, in the sequence." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-199", "text": "A high attention value for the hidden layers after processing the word 'annapolis' indicates that the sentence has rich context around the first entity to indicate location in relation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-200", "text": "In conclusion, the attention models rightly choose the relevant words in context and help in improving relation extraction performance." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-201", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-202", "text": "**RELATED WORK**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-203", "text": "Relation extraction in distantly supervised datasets is posed in a Multi-instance Multi-label (MIML) setting [Surdeanu et al., 2012] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-204", "text": "A large proportion of the subsequent work in this field has aimed to relax the strong assumptions that the original DS model made." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-205", "text": "[Riedel et al., 2010] introduced the expressed-at-least-once assumption in a factor graph model as an aggregating mechanism over mention level predictions." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-206", "text": "Work by [Hoffmann et al., 2011; Surdeanu et al., 2012; Ritter et al., 2013] are crucial increments to [Riedel et al., 2010] ." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-207", "text": "In past few years, Deep learning models [Bengio, 2009] have reduced the dependence of algorithms on manually de- signed features." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-208", "text": "[Zeng et al., 2014] introduced the use of a CNN based model for relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-209", "text": "[Zeng et al., 2015] proposed a Piecewise Convolutional Neural Network (PCNN) model to preserve the structural features of a sentence using piecewise max-pooling approach, improving the precisionrecall curve significantly." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-210", "text": "However, PCNN method used only one sentence in the instance-set to predict the relation label and for backpropagation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-211", "text": "[Lin et al., 2016] improves upon PCNN results by introducing an attention mechanism to select a set of sentences from instance set for relation label prediction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-212", "text": "[ Zheng et al., 2016] aimed to leverage inter-sentence information for relation extraction in a ranking model." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-213", "text": "The hypothesis explored is that for a particular entity-pair, each mention alone may not be expressive enough of the relation in question, but information from several mentions may be required to decisively make a prediction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-214", "text": "Recently, work by [Ye et al., 2016] exploit the connections between relation (class ties) to improve relation extraction performance." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-215", "text": "A few papers propose the addition of background knowledge to reduce noise in training data." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-216", "text": "[Weston et al., 2013] proposes a joint-embedding model for text and KB entities where the known part of the KB is utilized as part of the supervision signal." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-217", "text": "[Han and Sun, 2016] use indirect supervision like consistency between relation labels, consistency between relations and arguments, and consistency between neighbour instances using Markov logic networks." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-218", "text": "[Nagarajan et al., 2017] uses inter-instance-set couplings for relation extraction in multi-task setup to improve performance." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-219", "text": "Attention models learn the importance of a feature in the supervised task through back-propogation." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-220", "text": "Attention mechanisms in neural networks have been successfully applied to a variety of problems, like machine translation [Bahdanau et al., 2014] , image captioning [Xu et al., 2015] , supervised relation extraction [Shen and Huang, 2016] , distantly-supervised relation extraction [Zheng et al., 2016] etc." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-221", "text": "In our work, we focus on selecting the right words in a sentence using the word and entity-based attention mechanism." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-222", "text": "----------------------------------" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-223", "text": "**CONCLUSION**" }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-224", "text": "Distant Supervision (DS) has emerged as a promising approach to bootstrap relation extractors with limited supervision." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-225", "text": "In this paper, we present three novel models for distantlysupervised relation extraction: (1) a Bi-GRU based word attention model (BGWA), (2) an entity-centric attention model (EA), and (3) and a weighted voting ensemble model, which combines multiple complementary models for improved relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-226", "text": "We introduce GDS, a new distant supervision dataset for relation extraction." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-227", "text": "GDS removes test data noise present in all previous distant supervision benchmark datasets, making credible automatic evaluation possible." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-228", "text": "Combining proposed methods with attention-based sentence selection methods is left as future work." }, { "sent_id": "c3c09df34cf9f81c1cc4fc63a18bf0-C001-229", "text": "We plan to make our code and datasets publicly available to foster reproducible research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-23" ], [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-46" ], [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-212" ], [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-220" ] ], "cite_sentences": [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-23", "c3c09df34cf9f81c1cc4fc63a18bf0-C001-46", "c3c09df34cf9f81c1cc4fc63a18bf0-C001-212", "c3c09df34cf9f81c1cc4fc63a18bf0-C001-220" ] }, "@USE@": { "gold_contexts": [ [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-85" ], [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-88" ], [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-98" ], [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-166" ], [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-172" ] ], "cite_sentences": [ "c3c09df34cf9f81c1cc4fc63a18bf0-C001-85", "c3c09df34cf9f81c1cc4fc63a18bf0-C001-88", "c3c09df34cf9f81c1cc4fc63a18bf0-C001-98", "c3c09df34cf9f81c1cc4fc63a18bf0-C001-166", "c3c09df34cf9f81c1cc4fc63a18bf0-C001-172" ] } } }, "ABC_8ae44e74146d3f40845741fac4dff9_13": { "x": [ { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-23", "text": "However, (to the best of our knowledge) they have not yet been deployed in tasks involving figurative meaning transfers, such as interpretation of metonymy or metaphor." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-24", "text": "In this paper, we address this problem and apply a vector space model of word meaning in context to metaphor paraphrasing, appropriately adapting it to the task." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-25", "text": "In comparison to lexical substitution, metaphor paraphrasing presents an additional challenge, namely that of discriminating between literal and metaphorical substitutes." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-26", "text": "Shutova (2010) used a selectional preference-based model for this purpose, obtaining encouraging results in a supervised setting." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-27", "text": "We evaluate the capacity of our vector space model to discriminate between literal and figurative paraphrases on its own, as well as integrating it with a selectional preference-based model similar to that of Shutova (2010) and thus evaluating the latter in an unsupervised setting." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-28", "text": "Our system thus operates in two steps." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-29", "text": "It first computes candidate paraphrases according to a latent model of semantic similarity based on the context of the metaphorically used word, and then measures the literalness of the candidates using a selectional preference model." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-30", "text": "We focus on paraphrasing metaphorical verbs and evaluate our system using the dataset of Shutova (2010) especially designed for this task." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-31", "text": "The comparison against a paraphrasing gold standard provided by Shutova (2010) is complemented by an evaluation against direct human judgements of system output." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-32", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-33", "text": "**METHOD**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-34", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-35", "text": "**GENERATION OF CANDIDATE PARAPHRASES USING A VECTOR SPACE MODEL**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-36", "text": "Paraphrase candidates are generated by first computing the specific meaning of the metaphorical term in its context." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-37", "text": "The meaning of a word instance in context is computed by adapting its original (global) meaning vector according to the dependency relations in which the word instance participates." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-38", "text": "For this purpose, we build a factorization model in which words, together with their window-based context words and their dependency relations, are linked to latent dimensions." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-39", "text": "Both types of contexts are combined to be able to induce broad, topical semantics as well as tight, synonym-like semantics." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-40", "text": "The factorization model allows us to determine which dimensions are important for a particular context, and adapt the dependency-based feature vector of the word accordingly." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-41", "text": "The model uses non-negative matrix factorization (NMF) (Lee and Seung, 2000) in order to find latent dimensions, using the minimization of the Kullback-Leibler divergence as an objective function." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-42", "text": "A more detailed description of the factorization model can be found in Van de Cruys et al. (2011) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-43", "text": "Our paraphrase generation model has been trained on part of the UKWaC corpus (Baroni et al., 2009 ), covering about 500M words." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-44", "text": "The corpus has been part of speech tagged and lemmatized with Stanford Part-Of-Speech Tagger (Toutanova and Manning, 2000; Toutanova et al., 2003) , and parsed with MaltParser (Nivre et al., 2006) , so that dependency triples could be extracted." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-45", "text": "Using the latent distributions yielded by our factorization model, it is now possible to compute the meaning vector for a particular word in context, and subsequently the most similar words to this meaning vector, which will be our candidate paraphrases." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-46", "text": "Intuitively, the contextual features of the word (i.e. the dependency-based context features) will highlight the important semantic dimensions of the particular instance, creating a probability distribution over latent factors p(z|d j )." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-47", "text": "Using this probability distribution, a new probability distribution is determined over dependency features given the context, following equation 1." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-48", "text": "The last step is to weight the original probability vector of the word according to the probability vector of the dependency features given the word's context, by taking the pointwise multiplication of probability vectors p(d|w i ) and p(d|d j )." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-49", "text": "This final step is a crucial one in the model." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-50", "text": "The model is not just based on latent factors; rather, the latent factors are used to determine which of the features in the original word vector are the salient ones given a particular context." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-51", "text": "This allows us to compute an accurate adaptation of the original word vector in context." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-52", "text": "As an example, take the metaphorical expression reflect concern." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-53", "text": "We want to compute the meaning vector for the verb reflect (w i ) in the context of its direct object," }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-54", "text": "Using the probability distribution over latent factors given the dependency context p(z|d j ) (a result that comes out of the factorization), we can compute the probability of dependency features given the context -p(d|d j )." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-55", "text": "The former step yields a general probability distribution over dependency features that tells us how likely a particular dependency feature is given the context concer n d o b j that the verb appears in." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-56", "text": "Our last step is now to weight the original probability vector of the target word (the aggregate of dependency-based context features over all contexts of the target word) according to the new distribution given the context in which the verb appears." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-57", "text": "Features associated with concern (or more specifically, the dependency features associated with latent factors that are related to the feature concer n d o b j ) will be emphasized, while features associated with unrelated latent factors are leveled out." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-58", "text": "We can now return to our original matrix A and compute the top similar words for the adapted vector of reflect given the dependency feature concer n d o b j , which yields the results presented in 1. If we instead compute the meaning vector for reflect given the dependency feature l i ght d o b j (as in the non-metaphorical expression reflect light), we get the results in 2." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-59", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-60", "text": "**RERANKING OF CANDIDATE PARAPHRASES USING A SELECTIONAL PREFERENCE MODEL**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-61", "text": "The candidate lists which are generated by the vector space model contain a number of substitutes that retain the meaning of a metaphorical expression as closely as possible." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-62", "text": "However, due to the fact that the model favours the substitutes that are similar to the metaphorical verb, the highly-ranked substitutes are sometimes also metaphorically used." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-63", "text": "For example, \"speed up change\" is the top-ranked paraphrase for \"accelerate change\" and the literal paraphrase \"facilitate change\" appears only in rank 10." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-64", "text": "As the task is to identify the literal interpretation, this ranking still needs to be refined." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-65", "text": "Following Shutova (2010) , we use a selectional preference model to discriminate between literally and metaphorically used substitutes." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-66", "text": "Verbs used metaphorically are likely to demonstrate semantic preference for the source domain, e.g. speed up would select for MACHINES, or VEHICLES, rather than CHANGE (the target domain), whereas the ones used literally for the target domain, e.g. facilitate would select for PROCESSES (including CHANGE)." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-67", "text": "We therefore expect that selecting the verbs whose preferences the noun in the metaphorical expression matches best should allow us to filter out non-literalness." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-68", "text": "We automatically acquired selectional preference (SP) distributions of the candidate substitutes (for subject-verb and verb-object relations) from the British National Corpus (BNC) (Burnard, 2007) parsed by the RASP parser (Briscoe et al., 2006) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-69", "text": "We obtained SP classes by clustering the 2000 most frequent nouns in the BNC into 200 clusters using the algorithm of Sun and Korhonen (2009) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-70", "text": "We quantified selectional preferences using the association measure proposed by Resnik (1993) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-71", "text": "It represents SPs as the difference between the posterior distribution of noun classes in a particular relation with the verb and their prior distribution in that syntactic position irrespective of the identity of the verb." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-72", "text": "This difference then defines the selectional preference strength (SPS) of the verb, quantified in terms of Kullback-Leibler divergence as follows." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-73", "text": "where P(c) is the prior probability of the noun class, P(c|v) is the posterior probability of the noun class given the verb and R is the grammatical relation." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-74", "text": "SPS measures how strongly the predicate constrains its arguments." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-75", "text": "Resnik then quantifies how well a particular argument class fits the verb using another measure called selectional association:" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-76", "text": "We use selectional association as a measure of semantic fitness, i.e. literalness, of the paraphrases." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-77", "text": "The selectional preference model was applied to the top 20 substitutes suggested by the vector space model." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-78", "text": "The threshold of 20 substitutes was set experimentally on a small development set." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-79", "text": "The paraphrases were re-ranked based on their selectional association with the noun in the context." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-80", "text": "Those paraphrases that are not well suited or used metaphorically are dispreferred within this ranking." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-81", "text": "The new ranking (top 6 paraphrases) is shown in Table 2 ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-82", "text": "The expectation is that the paraphrase in the first rank (i.e. the verb with which the noun in the context has the highest association) represents a literal interpretation." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-83", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-84", "text": "**EVALUATION AND DISCUSSION**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-85", "text": "We compared the rankings of the initial candidate generation by the vector space model (VS) and the selectional preference-based reranking (SP) to that of an unsupervised paraphrasing baseline." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-86", "text": "We thus evaluated the ability of VS on its own to detect literal paraphrases, as well as the effectiveness of the SP model of Shutova (2010) in an unsupervised setting and in combination with VS." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-87", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-88", "text": "**DATASET**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-89", "text": "To our knowledge, the only metaphor paraphrasing dataset and gold standard available to date is that of Shutova (2010) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-90", "text": "We used this dataset to develop and test our system." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-91", "text": "Shutova (2010) annotated metaphorical expressions in a subset of the BNC sampling various genres: literature, newspaper/journal articles, essays on politics, international relations and sociology, radio broadcast (transcribed speech)." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-92", "text": "The dataset consists of 62 phrases that include a metaphorical verb and either a subject or a direct object." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-93", "text": "subject-verb constructions) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-94", "text": "10 phrases in the dataset were used during development to observe the behavior of the system, and the remaining 52 constituted the test set." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-95", "text": "11 of them were subject-verb constructions and 41 were verb-direct object constructions." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-96", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-97", "text": "**BASELINE SYSTEM**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-98", "text": "The baseline system is also unsupervised and incorporates two methods: that of generating most similar substitutes for the metaphorical verb regardless of its context and a method for their re-ranking based on the likelihood of their co-occurrence with the noun in the metaphorical expression." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-99", "text": "Thus a list of most similar substitutes is first generated using a standard dependencybased vector space model (Pad\u00f3 and Lapata, 2007) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-100", "text": "The likelihood of a paraphrase is then calculated as a joint probability of the candidate substitutes and the noun in the context as follows:" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-101", "text": "where f (v, n) is the frequency of the co-occurrence of the substitute with the context and" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-102", "text": "is the total number of verbs in the corpus." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-103", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-104", "text": "**EVALUATION METHOD AND RESULTS**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-105", "text": "We evaluated the paraphrases with the aid of human judges and against a human-created gold standard in two different experimental settings." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-106", "text": "Setting 1 Human judges were presented with a set of sentences containing metaphorical expressions and their rank 1 paraphrases produced by VS, by SP and by the baseline, randomised." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-107", "text": "They were asked to mark the ones that have the same meaning as the metaphorically used term -and are used literally in the context of the paraphrase expression -as correct." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-108", "text": "We had 4 volunteer annotators who were all native speakers of English and had no or sparse linguistic expertise." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-109", "text": "Their agreement on the task was \u03ba = 0.54 (n = 2, N = 115, k = 4), whereby the main source of disagreement was the presence of highly conventionalised metaphorical paraphrases." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-110", "text": "We then evaluated the system performance against their judgements in terms of precision at rank 1, P(1)." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-111", "text": "Precision at rank 1 measures the proportion of correct literal interpretations among the paraphrases in rank 1." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-112", "text": "A paraphrase was considered correct if at least 3 judges out of 4 marked it as such." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-113", "text": "The results are demonstrated in Table 3 ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-114", "text": "The VS model identifies literal paraphrases with P(1) = 0.48 and the SP model with P(1) = 0.52." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-115", "text": "Both models outperform the baseline that only achieves P(1) = 0.40." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-116", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-117", "text": "**SETTING 2**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-118", "text": "We then also evaluated VS, SP and baseline rankings against a human-constructed paraphrasing gold standard." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-119", "text": "The gold standard was created by Shutova (2010) as follows." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-120", "text": "Five independent annotators were presented with a set of sentences containing metaphorical However, given that the metaphor paraphrasing task is open-ended, it is hard to construct a comprehensive gold standard." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-121", "text": "For example, for the phrase stir excitement the gold standard includes the paraphrase create excitement, but not provoke excitement or stimulate excitement, which are more precise paraphrases." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-122", "text": "Thus the gold standard evaluation may unfairly penalise the system, which motivates our two-phase evaluation against both the gold standard and direct judgements of system output." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-123", "text": "The system output was compared against the gold standard using mean average precision (MAP) as a measure." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-124", "text": "MAP is defined as follows:" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-125", "text": "where M is the number of metaphorical expressions, N j is the number of correct paraphrases for the metaphorical expression, P ji is the precision at each correct paraphrase (the number of correct paraphrases among the top i ranks)." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-126", "text": "First, average precision is estimated for individual metaphorical expressions, and then the mean is computed across the dataset." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-127", "text": "This measure allows us to assess ranking quality beyond rank 1, as well as the recall of the system." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-128", "text": "As compared to the gold standard, MAP of VS is 0.40, MAP of SP is 0.41 and that of the baseline is 0.37." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-129", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-130", "text": "**DISCUSSION**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-131", "text": "Our system consistently produces better results than the baseline, with an improvement of 12% in precision on our human evaluation (SP) and an improvement of 4% MAP on the gold standard (SP)." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-132", "text": "At first sight, these improvements of our unsupervised system may not seem very high, in particular when compared to the results of the supervised system of Shutova (2010) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-133", "text": "Note, however, that our results are in line with the performance of unsupervised approaches on the lexical substitution task." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-134", "text": "Unsupervised approaches to lexical substitution perform well below their supervised counterparts (which are usually based on WordNet), and often have difficulties getting significant improvements over a baseline of a simple dependency-based vector space model of semantic similarity (Erk and Pad\u00f3, 2008; Van de Cruys et al., 2011) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-135", "text": "We therefore think that the method presented here takes a promising step in the direction of unsupervised metaphor paraphrasing." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-136", "text": "The SP re-ranking of the candidates yields an improvement over the VS model used on its own, as expected." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-137", "text": "Our data analysis has shown that SP produces higher quality top paraphrases with respect to their literalness, however the two models perform similarly on the meaning retention task (according to our own judgements 55% of the top ranked paraphrases had a similar meaning to that of a metaphorical verb for both models)." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-138", "text": "The difference in MAP scores of the two models is, however, not as high as that of the respective P(1) scores." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-139", "text": "This can be explained by the fact that the VS model produces a number of antonymous candidates." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-140", "text": "The candidates are then re-ranked by the SP model which does not consider meaning retention, but rather the semantic fit of a candidate interpretation in the context." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-141", "text": "As a result, a number of antonymous paraphrases that are highly associated with the noun in the context get ranked above some of the correct literal paraphrases, lowering the method's MAP score." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-142", "text": "For example, the antonymous paraphrase tension eased for the metaphorical expression tension mounted is ranked higher than the correct paraphrase tension intensified." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-143", "text": "In general, antonymous paraphrasing was the most common type of error." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-144", "text": "Antonyms are known to attract high similarity scores within a distributional similarity framework." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-145", "text": "This is an issue that needs to be addressed in the future, in lexical substitution in general and metaphor paraphrasing in particular." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-146", "text": "Although the SP model generally improves the initial VS ranking, there were some instances where this was not the case." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-147", "text": "One such example is the metaphorical expression break agreement." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-148", "text": "The top ranked paraphrases suggested in the first step, breach and violate, were overrun by the well matching paraphrases ratify and sign, that have a different -almost opposite -meaning." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-149", "text": "The baseline tends to produce metaphorical paraphrases rather than literal ones." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-150", "text": "However, in a few cases the baseline suggests better rank 1 paraphrases than the system." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-151", "text": "For example, it interprets the expression leak a report as circulate a report, as opposed to print a report incorrectly suggested by the system." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-152", "text": "This is due to the fact that the paraphrase generation relies entirely on one single context word (in this case report); taking a broader context into account might alleviate this problem." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-153", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-154", "text": "**CONCLUSION**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-155", "text": "In this paper we presented the first fully unsupervised approach to metaphor interpretation." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-156", "text": "Our system produces literal paraphrases for metaphorical expressions in unrestricted text." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-157", "text": "Producing metaphorical interpretations in textual format makes our system directly usable by other NLP applications that can benefit from a metaphor processing component." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-158", "text": "The fact that, unlike all previous approaches to this problem, our system does not use any supervision makes it easily scalable to new domains and applications, as well as portable to a wider range of languages." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-159", "text": "Our method identifies literal paraphrases for metaphorical expressions with a precision of 0.52 measured at top-ranked paraphrases." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-160", "text": "Given the unsupervised nature of our system and considering the state-of-the-art in unsupervised lexical substitution, we consider this a promising result." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-161", "text": "Following Shutova (2010) , the current experimental design and test set focuses on subject-verb and verb-object metaphors only, but we expect the method to be equally applicable to other parts of speech and a wider range of syntactic constructions." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-162", "text": "Our context-based vector space model is suited to all part-of-speech classes and types of relations." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-163", "text": "Selectional preferences have been previously successfully acquired not only for verbs, but also for nouns, adjectives and even prepositions (Brockmann and Lapata, 2003; Zapirain et al., 2009; \u00d3 S\u00e9aghdha, 2010) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-164", "text": "Extending the system to deal with further syntactic constructions is thus part of our future work." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-2", "text": "We present the first fully unsupervised approach to metaphor interpretation, and a system that produces literal paraphrases for metaphorical expressions." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-3", "text": "Such a form of interpretation is directly transferable to other NLP applications that can benefit from a metaphor processing component." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-4", "text": "Our method is different from previous work in that it does not rely on any manually annotated data or lexical resources." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-5", "text": "First, our method computes candidate paraphrases according to the context in which the metaphor appears, using a vector space model." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-6", "text": "It then uses a selectional preference model to measure the degree of literalness of the paraphrases." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-7", "text": "The system identifies correct paraphrases with a precision of 0.52 at top rank, which is a promising result for a fully unsupervised approach." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-8", "text": "----------------------------------" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-10", "text": "Metaphor has traditionally been viewed as an artistic device that lends vividness and distinction to its author's style." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-11", "text": "This view was first challenged by Lakoff and Johnson (1980) , who claimed that it is a productive phenomenon that operates at the level of mental processes." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-12", "text": "Humans often use metaphor to describe abstract concepts through reference to more concrete experiences." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-13", "text": "Being a characteristic property of human thought and communication, metaphor becomes an important problem for natural language processing." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-14", "text": "Shutova and Teufel (2010) have shown in an empirical study that the use of metaphor is ubiquitous in natural language text (according to their data, on average every third sentence in general domain text contains a metaphorical expression)." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-15", "text": "Due to this high frequency usage, a system capable of recognizing and interpreting metaphorical expressions in unrestricted text would become an invaluable component of many semantics-oriented NLP applications." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-16", "text": "The majority of previous computational approaches to metaphor rely on manually created knowledge and thus operate on a limited domain and are expensive to build and extend." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-17", "text": "Handcoded knowledge has proved useful for both metaphor identification, i.e. distinguishing between literal and metaphorical language in text (Fass, 1991; Martin, 1990; Krishnakumaran and Zhu, 2007; Gedigian et al., 2006) and metaphor interpretation, i.e. identifying the intended literal meaning of a metaphorical expression (Fass, 1991; Martin, 1990; Narayanan, 1997; Barnden and Lee, 2002) ." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-18", "text": "However, to be applicable in a real-world setting a metaphor processing system needs to be able to identify and interpret metaphorical expressions in unrestricted text." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-19", "text": "The recent metaphor paraphrasing approach of Shutova (2010) was designed with this requirement in mind and used statistical methods, but still relied on the WordNet (Fellbaum, 1998) database to generate the initial set of paraphrases." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-20", "text": "In this paper, we take the metaphor paraphrasing task a step further and present a fully unsupervised approach to this problem." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-21", "text": "In our method, candidate substitutes for the metaphorical term are generated using a vector space model." }, { "sent_id": "8ae44e74146d3f40845741fac4dff9-C001-22", "text": "Vector space models have been previously used in the general lexical substitution task (Mitchell and Lapata, 2008; Pad\u00f3, 2008, 2009; Thater et al., 2009 Thater et al., , 2010 Erk and Pad\u00f3, 2010; Van de Cruys et al., 2011) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8ae44e74146d3f40845741fac4dff9-C001-19" ], [ "8ae44e74146d3f40845741fac4dff9-C001-26" ], [ "8ae44e74146d3f40845741fac4dff9-C001-89" ], [ "8ae44e74146d3f40845741fac4dff9-C001-91" ] ], "cite_sentences": [ "8ae44e74146d3f40845741fac4dff9-C001-19", "8ae44e74146d3f40845741fac4dff9-C001-26", "8ae44e74146d3f40845741fac4dff9-C001-89", "8ae44e74146d3f40845741fac4dff9-C001-91" ] }, "@EXT@": { "gold_contexts": [ [ "8ae44e74146d3f40845741fac4dff9-C001-19", "8ae44e74146d3f40845741fac4dff9-C001-20" ] ], "cite_sentences": [ "8ae44e74146d3f40845741fac4dff9-C001-19" ] }, "@SIM@": { "gold_contexts": [ [ "8ae44e74146d3f40845741fac4dff9-C001-27" ] ], "cite_sentences": [ "8ae44e74146d3f40845741fac4dff9-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "8ae44e74146d3f40845741fac4dff9-C001-30" ], [ "8ae44e74146d3f40845741fac4dff9-C001-31" ], [ "8ae44e74146d3f40845741fac4dff9-C001-65" ], [ "8ae44e74146d3f40845741fac4dff9-C001-86" ], [ "8ae44e74146d3f40845741fac4dff9-C001-89", "8ae44e74146d3f40845741fac4dff9-C001-90" ], [ "8ae44e74146d3f40845741fac4dff9-C001-119" ], [ "8ae44e74146d3f40845741fac4dff9-C001-132" ], [ "8ae44e74146d3f40845741fac4dff9-C001-161" ] ], "cite_sentences": [ "8ae44e74146d3f40845741fac4dff9-C001-30", "8ae44e74146d3f40845741fac4dff9-C001-31", "8ae44e74146d3f40845741fac4dff9-C001-65", "8ae44e74146d3f40845741fac4dff9-C001-86", "8ae44e74146d3f40845741fac4dff9-C001-89", "8ae44e74146d3f40845741fac4dff9-C001-119", "8ae44e74146d3f40845741fac4dff9-C001-132", "8ae44e74146d3f40845741fac4dff9-C001-161" ] } } }, "ABC_2504d707a8123774791d98b755551a_13": { "x": [ { "sent_id": "2504d707a8123774791d98b755551a-C001-69", "text": "**OUR APPROACH TO HANDLING NON-LOCAL DEPENDENCIES**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-2", "text": "This paper shows that a simple two-stage approach to handle non-local dependencies in Named Entity Recognition (NER) can outperform existing approaches that handle non-local dependencies, while being much more computationally efficient." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-3", "text": "NER systems typically use sequence models for tractable inference, but this makes them unable to capture the long distance structure present in text." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-4", "text": "We use a Conditional Random Field (CRF) based NER system using local features to make predictions and then train another CRF which uses both local information and features extracted from the output of the first CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-5", "text": "Using features capturing non-local dependencies from the same document, our approach yields a 12.6% relative error reduction on the F1 score, over state-of-theart NER systems using local-information alone, when compared to the 9.3% relative error reduction offered by the best systems that exploit non-local information." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-6", "text": "Our approach also makes it easy to incorporate non-local information from other documents in the test corpus, and this gives us a 13.3% error reduction over NER systems using local-information alone." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-7", "text": "Additionally, our running time for inference is just the inference time of two sequential CRFs, which is much less than that of other more complicated approaches that directly model the dependencies and do approximate inference." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-8", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-10", "text": "Named entity recognition (NER) seeks to locate and classify atomic elements in unstructured text into predefined entities such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-11", "text": "A particular problem for Named Entity Recognition(NER) systems is to exploit the presence of useful information regarding labels assigned at a long distance from a given entity." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-12", "text": "An example is the label-consistency constraint that if our text has two occurrences of New York separated by other tokens, we would want our learner to encourage both these entities to get the same label." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-13", "text": "Most statistical models currently used for Named Entity Recognition, use sequence models and thereby capture local structure." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-14", "text": "Hidden Markov Models (HMMs) (Leek, 1997; Freitag and McCallum, 1999) , Conditional Markov Models (CMMs) (Borthwick, 1999; McCallum et al., 2000) , and Conditional Random Fields (CRFs) (Lafferty et al., 2001 ) have been successfully employed in NER and other information extraction tasks." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-15", "text": "All these models encode the Markov property i.e. labels directly depend only on the labels assigned to a small window around them." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-16", "text": "These models exploit this property for tractable computation as this allows the Forward-Backward, Viterbi and Clique Calibration algorithms to become tractable." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-17", "text": "Although this constraint is essential to make exact inference tractable, it makes us unable to exploit the non-local structure present in natural language." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-18", "text": "Label consistency is an example of a non-local dependency important in NER." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-19", "text": "Apart from label consistency between the same token sequences, we would also like to exploit richer sources of dependencies between similar token sequences." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-20", "text": "For example, as shown in Figure 1 , we would want it to encourage Einstein to be labeled \"Person\" if there is strong evidence that Albert Einstein should be labeled \"Person\"." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-21", "text": "Sequence models unfortu- nately cannot model this due to their Markovian assumption." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-22", "text": "Recent approaches attempting to capture nonlocal dependencies model the non-local dependencies directly, and use approximate inference algorithms, since exact inference is in general, not tractable for graphs with non-local structure." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-23", "text": "Bunescu and Mooney (2004) define a Relational Markov Network (RMN) which explicitly models long-distance dependencies, and use it to represent relations between entities." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-24", "text": "Sutton and McCallum (2004) augment a sequential CRF with skip-edges i.e. edges between different occurrences of a token, in a document." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-25", "text": "Both these approaches use loopy belief propagation (Pearl, 1988; Yedidia et al., 2000) for approximate inference." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-26", "text": "Finkel et al. (2005) hand-set penalties for inconsistency in entity labeling at different occurrences in the text, based on some statistics from training data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-27", "text": "They then employ Gibbs sampling (Geman and Geman, 1984) for dealing with their local feature weights and their non-local penalties to do approximate inference." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-28", "text": "We present a simple two-stage approach where our second CRF uses features derived from the output of the first CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-68", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-29", "text": "This gives us the advantage of defining a rich set of features to model non-local dependencies, and also eliminates the need to do approximate inference, since we do not explicitly capture the non-local dependencies in a single model, like the more complex existing approaches." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-30", "text": "This also enables us to do inference efficiently since our inference time is merely the inference time of two sequential CRF's; in contrast Finkel et al. (2005) reported an increase in running time by a factor of 30 over the sequential CRF, with their Gibbs sampling approximate inference." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-31", "text": "In all, our approach is simpler, yields higher F1 scores, and is also much more computationally efficient than existing approaches modeling nonlocal dependencies." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-32", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-33", "text": "**CONDITIONAL RANDOM FIELDS**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-34", "text": "We use a Conditional Random Field (Lafferty et al., 2001; Sha and Pereira, 2003) since it represents the state of the art in sequence modeling and has also been very effective at Named Entity Recognition." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-35", "text": "It allows us both discriminative training that CMMs offer as well and the bi-directional flow of probabilistic information across the sequence that HMMs allow, thereby giving us the best of both worlds." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-36", "text": "Due to the bi-directional flow of information, CRFs guard against the myopic locally attractive decisions that CMMs make." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-37", "text": "It is customary to use the Viterbi algorithm, to find the most probably state sequence during inference." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-38", "text": "A large number of possibly redundant and correlated features can be supplied without fear of further reducing the accuracy of a high-dimensional distribution." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-39", "text": "These are welldocumented benefits (Lafferty et al., 2001 )." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-40", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-41", "text": "**OUR BASELINE CRF FOR NAMED ENTITY RECOGNITION**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-42", "text": "Our baseline CRF is a sequence model in which labels for tokens directly depend only on the labels corresponding to the previous and next tokens." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-43", "text": "We use features that have been shown to be effective in NER, namely the current, previous and next words, character n-grams of the current word, Part of Speech tag of the current word and surrounding words, the shallow parse chunk of the current word, shape of the current word, the surrounding word shape sequence, the presence of a word in a left window of size 5 around the current word and the presence of a word in a left window of size 5 around the current word." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-44", "text": "This gives us a competitive baseline CRF using local information alone, whose performance is close to the best published local CRF models, for Named Entity Recognition" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-45", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-46", "text": "**LABEL CONSISTENCY**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-47", "text": "The intuition for modeling label consistency is that within a particular document, different occur- rences of a particular token sequence (or similar token sequences) are unlikely to have different entity labels." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-48", "text": "While this constraint holds strongly at the level of a document, there exists additional value to be derived by enforcing this constraint less strongly across different documents." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-49", "text": "We want to model label consistency as a soft and not a hard constraint; while we want to encourage different occurrences of similar token sequences to get labeled as the same entity, we do not want to force this to always hold, since there do exist exceptions, as can be seen from the off-diagonal entries of tables 1 and 2." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-50", "text": "A named entity recognition system modeling this structure would encourage all the occurrences of the token sequence to the same entity type, thereby sharing evidence among them." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-51", "text": "Thus, if the system has strong evidence about the label of a given token sequence, but is relatively unsure about the label to be assigned to another occurrence of a similar token sequence, the system can gain significantly by using the information about the label assigned to the former occurrence, to label the relatively ambiguous token sequence, leading to accuracy improvements." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-52", "text": "The strength of the label consistency constraint, can be seen from statistics extracted from the CoNLL 2003 English training data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-53", "text": "Table 1 shows the counts of entity labels pairs assigned for each pair of identical token sequences both within a document and across the whole corpus." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-54", "text": "As we would expect, inconsistent labelings are relatively rare and most pairs of the same entity sequence are labeled the same(i.e." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-55", "text": "the diagonal has most of the density) at both the document and corpus levels." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-56", "text": "A notable exception to this is the labeling of the same text as both organization and location within the same document and across documents." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-57", "text": "This is a due to the large amount of sports news in the CoNLL dataset due to which city and country names are often also team names." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-58", "text": "We will see that our approach is capable of exploiting this as well, i.e. we can learn a model which would not penalize an Organization-Location inconsistency as strongly as it penalizes other inconsistencies." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-59", "text": "In addition, we also want to model subsequence constraints: having seen Albert Einstein earlier in a document as a person is a good indicator that a subsequent occurrence of Einstein should also be labeled as a person." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-60", "text": "Here, we would expect that a subsequence would gain much more by knowing the label of a supersequence, than the other way around." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-61", "text": "However, as can be seen from table 2, we find that the consistency constraint does not hold nearly so strictly in this case." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-62", "text": "A very common case of this in the CoNLL dataset is that of documents containing references to both The China Daily, a newspaper, and China, the country (Finkel et al., 2005) ." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-63", "text": "The first should be labeled as an organization, and second as a location." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-64", "text": "The counts of subsequence labelings within a document and across documents listed in Table 2 , show that there are many off-diagonal entries: the China Daily case is among the most common, occurring 328 times in the dataset." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-65", "text": "Just as we can model off-diagonal pat-terns with exact token sequence matches, we can also model off-diagonal patterns for the token subsequence case." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-66", "text": "In addition, we could also derive some value by enforcing some label consistency at the level of an individual token." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-67", "text": "Obviously, our model would learn much lower weights for these constraints, when compared to label consistency at the level of token sequences." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-70", "text": "To handle the non-local dependencies between same and similar token sequences, we define three sets of feature pairs where one member of the feature pair corresponds to a function of aggregate statistics of the output of the first CRF at the document level, and the other member corresponds to a function of aggregate statistics of the output of the first CRF over the whole test corpus." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-71", "text": "Thus this gives us six additional feature types for the second round CRF, namely Document-level Token-majority features, Document-level Entitymajority features, Document-level Superentitymajority features, Corpus-level Token-majority features, Corpus-level Entity-majority features and Corpus-level Superentity-majority features." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-72", "text": "These feature types are described in detail below." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-73", "text": "All these features are a function of the output labels of the first CRF, where predictions on the test set are obtained by training on all the data, and predictions on the train data are obtained by 10 fold cross-validation (details in the next section)." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-74", "text": "Our features fired based on document and corpus level statistics are:" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-75", "text": "\u2022 Token-majority features: These refer to the majority label assigned to the particular token in the document/corpus." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-76", "text": "Eg: Suppose we have three occurrences of the token Australia, such that two are labeled Location and one is labeled Organization, our tokenmajority feature would take value Location for all three occurrences of the token." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-77", "text": "This feature can enable us to capture some dependence between token sequences corresponding to a single entity and having common tokens." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-78", "text": "\u2022 Entity-majority features: These refer to the majority label assigned to the particular entity in the document/corpus." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-79", "text": "Eg: Suppose we have three occurrences of the entity sequence (we define it as a token sequence labeled as a single entity by the first stage CRF) Bank of Australia, such that two are labeled Organization and one is labeled Location, our entitymajority feature would take value Organization for all tokens in all three occurrences of the entity sequence." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-80", "text": "This feature enables us to capture the dependence between identical entity sequences." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-81", "text": "For token labeled as not a Named Entity by the first CRF, this feature returns the majority label assigned to that token when it occurs as a single token named entity." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-82", "text": "\u2022 Superentity-majority features: These refer to the majority label assigned to supersequences of the particular entity in the document/corpus." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-83", "text": "By entity supersequences, we refer to entity sequences, that strictly contain within their span, another entity sequence." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-84", "text": "For example, if we have two occurrences of Bank of Australia labeled Organization and one occurrence of Australia Cup labeled Miscellaneous, then for all occurrences of the entity Australia, the superentity-majority feature would take value Organization." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-85", "text": "This feature enables us to take into account labels assigned to supersequences of a particular entity, while labeling it." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-86", "text": "For token labeled as not a Named Entity by the first CRF, this feature returns the majority label assigned to all entities containing the token within their span." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-87", "text": "The last feature enables entity sequences to benefit from labels assigned to entities which are entity supersequences of it." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-88", "text": "We attempted to add subentity-majority features, analogous to the superentity-majority features to model dependence on entity subsequences, but got no benefit from it." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-89", "text": "This is intuitive, since the basic sequence model would usually be much more certain about labels assigned to the entity supersequences, since they are longer and have more contextual information." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-90", "text": "As a result of this, while there would be several cases in which the basic sequence model would be uncertain about labels of entity subsequences but relatively certain about labels of token supersequences, the converse is very unlikely." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-91", "text": "Thus, it is difficult to profit from labels of entity subsequences while labeling entity sequences." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-92", "text": "We also attempted using more fine grained features corresponding to the majority label of supersequences that takes into account the position of the entity sequence in the entity supersequence(whether the entity sequence occurs in the start, middle or end of the supersequence), but could obtain no additional gains from this." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-93", "text": "It is to be noted that while deciding if token sequences are equal or hold a subsequencesupersequence relation, we ignore case, which clearly performs better than being sensitive to case." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-94", "text": "This is because our dataset contains several entities in allCaps such as AUSTRALIA, especially in news headlines." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-95", "text": "Ignoring case enables us to model dependences with other occurrences with a different case such as Australia." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-96", "text": "It may appear at first glance, that our framework can only learn to encourage entities to switch to the most popular label assigned to other occurrences of the entity sequence and similar entity sequences." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-97", "text": "However this framework is capable of learning interesting off-diagonal patterns as well." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-98", "text": "To understand this, let us consider the example of different occurrences of token sequences being labeled Location and Organization." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-99", "text": "Suppose, the majority label of the token sequence is Location." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-100", "text": "While this majority label would encourage the second CRF to switch the labels of all occurrences of the token sequence to Location, it would not strongly discourage the CRF from labeling these as Organization, since there would be several occurrences of token sequences in the training data labeled Organization, with the majority label of the token sequence being Location." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-101", "text": "However it would discourage the other labels strongly." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-102", "text": "The reasoning is analogous when the majority label is Organization." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-103", "text": "In case of a tie (when computing the majority label), if the label assigned to a particular token sequence is one of the majority labels, we fire the feature corresponding to that particular label being the majority label, instead of breaking ties arbitrarily." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-104", "text": "This is done to encourage the second stage CRF to make its decision based on local information, in the absence of compelling non-local information to choose a different label." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-105", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-106", "text": "**ADVANTAGES OF OUR APPROACH**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-107", "text": "With our two-stage approach, we manage to get improvements on the F1 measure over existing approaches that model non-local dependencies." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-108", "text": "At the same time, the simplicity of our two-stage approach keeps inference time down to just the inference time of two sequential CRFs, when compared to approaches such as those of Finkel et al. (2005) who report that their inference time with Gibbs sampling goes up by a factor of about 30, compared to the Viterbi algorithm for the sequential CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-109", "text": "Below, we give some intuition about areas for improvement in existing work and explain how our approach incorporates the improvements." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-110", "text": "\u2022 Most existing work to capture labelconsistency, has attempted to create all n 2 pairwise dependencies between the different occurrences of an entity, (Finkel et al., 2005; Sutton and McCallum, 2004) , where n is the number of occurrences of the given entity." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-111", "text": "This complicates the dependency graph making inference harder." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-112", "text": "It also leads to the penalty for deviation in labeling to grow linearly with n, since each entity would be connected to \u0398(n) entities." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-113", "text": "When an entity occurs several times, these models would force all occurrences to take the same value." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-114", "text": "This is not what we want, since there exist several instances in real-life data where different entities like persons and organizations share the same name." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-115", "text": "Thus, our approach makes a certain entity's label depend on certain aggregate information of other labels assigned to the same entity, and does not enforce pairwise dependencies." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-116", "text": "\u2022 We also exploit the fact that the predictions of a learner that takes non-local dependencies into account would have a good amount of overlap with a sequential CRF, since the sequence model is already quite competitive." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-117", "text": "We use this intuition to approximate the aggregate information about labels assigned to other occurrences of the entity by the nonlocal model, with the aggregate information about labels assigned to other occurrences of the entity by the sequence model." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-118", "text": "This intuition enables us to learn weights for non-local dependencies in two stages; we first get predictions from a regular sequential CRF and in turn use aggregate information about predictions made by the CRF as extra features to train a second CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-119", "text": "\u2022 Most work has looked to model non-local dependencies only within a document (Finkel et al., 2005; Chieu and Ng, 2002; Sutton and McCallum, 2004; Bunescu and Mooney, 2004) ." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-120", "text": "Our model can capture the weaker but still important consistency constraints across the whole document collection, whereas previous work has not, for reasons of tractability." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-121", "text": "Capturing label-consistency at the level of the whole test corpus is particularly helpful for token sequences that appear only once in their documents, but occur a few times over the corpus, since they do not have strong nonlocal information from within the document." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-122", "text": "\u2022 For training our second-stage CRF, we need to get predictions on our train data as well as test data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-123", "text": "Suppose we were to use the same train data to train the first CRF, we would get unrealistically good predictions on our train data, which would not be reflective of its performance on the test data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-124", "text": "One option is to partition the train data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-125", "text": "This however, can lead to a drop in performance, since the second CRF would be trained on less data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-126", "text": "To overcome this problem, we make predictions on our train data by doing a 10-fold cross validation on the train data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-127", "text": "For predictions on the test data, we use all the training data to train the CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-128", "text": "Intuitively, we would expect that the quality of predictions with 90% of the train data would be similar to the quality of predictions with all the training data." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-129", "text": "It turns out that this is indeed the case, as can be seen from our improved performance." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-130", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-131", "text": "**EXPERIMENTS**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-132", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-133", "text": "**DATASET AND EVALUATION**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-134", "text": "We test the effectiveness of our technique on the CoNLL 2003 English named entity recognition dataset downloadable from http://cnts.uia.ac.be/conll2003/ner/. The data comprises Reuters newswire articles annotated with four entity types: person (PER), location (LOC), organization (ORG), and miscellaneous (MISC)." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-135", "text": "The data is separated into a training set, a development set (testa), and a test set (testb)." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-136", "text": "The training set contains 945 documents, and approximately 203,000 tokens and the test set has 231 documents and approximately 46,000 tokens." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-137", "text": "Performance on this task is evaluated by measuring the precision and recall of annotated entities (and not tokens), combined into an F1 score." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-138", "text": "There is no partial credit for labeling part of an entity sequence correctly; an incorrect entity boundary is penalized as both a false positive and as a false negative." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-139", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-140", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-141", "text": "It can be seen from table 3, that we achieve a 12.6% relative error reduction, by restricting ourselves to features approximating non-local dependency within a document, which is higher than other approaches modeling non-local dependencies within a document." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-142", "text": "Additionally, by incorporating non-local dependencies across documents in the test corpus, we manage a 13.3% relative error reduction, over an already competitive baseline." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-143", "text": "We can see that all three features approximating non-local dependencies within a document yield reasonable gains." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-144", "text": "As we would expect the additional gains from features approximating nonlocal dependencies across the whole test corpus are relatively small." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-145", "text": "We use the approximate randomization test (Yeh, 2000) for statistical significance of the difference between the basic sequential CRF and our second round CRF, which has additional features derived from the output of the first CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-146", "text": "With a 1000 iterations, our improvements were statistically significant with a p-value of 0.001." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-147", "text": "Since this value is less than the cutoff threshold of 0.05, we reject the null hypothesis." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-148", "text": "The simplicity of our approach makes it easy to incorporate dependencies across the whole corpus, which would be relatively much harder to incorporate in approaches like (Bunescu and Mooney, 2004) and (Finkel et al., 2005) ." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-149", "text": "Additionally, our approach makes it possible to do inference in just about twice the inference time with a single sequential CRF; in contrast, approaches like Gibbs Sampling that model the dependencies directly can increase inference time by a factor of 30 (Finkel et al., 2005 )." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-150", "text": "An analysis of errors by the first stage CRF revealed that most errors are that of single token entities being mislabeled or missed altogether followed by a much smaller percentage of multiple token entities mislabelled completely." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-151", "text": "All our features directly encode information that is useful to reducing these errors." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-152", "text": "The widely prevalent boundary detection error is that of missing a single-token entity (i.e. labeling it as Other (O) Table 3 : Table showing improvements obtained with our additional features, over the baseline CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-153", "text": "We also compare our performance against (Bunescu and Mooney, 2004) and (Finkel et al., 2005) and find that we manage higher relative improvement than existing work despite starting from a very competitive baseline CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-154", "text": "named entities." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-155", "text": "Other kinds of boundary detection errors involving multiple tokens are very rare." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-156", "text": "Our approach can also handle these errors by encouraging certain tokens to take different labels." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-157", "text": "This together with the clique features encoding the markovian dependency among neighbours can correct some multiple-token boundary detection errors." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-158", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-159", "text": "**RELATED WORK**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-160", "text": "Recent work looking to directly model non-local dependencies and do approximate inference are that of Bunescu and Mooney (2004) , who use a Relational Markov Network (RMN) (Taskar et al., 2002) to explicitly model long-distance dependencies, Sutton and McCallum (2004) , who introduce skip-chain CRFs, which add additional non-local edges to the underlying CRF sequence model (which Bunescu and Mooney (2004) lack) and Finkel et al. (2005) who hand-set penalties for inconsistency in labels based on the training data and then use Gibbs Sampling for doing approximate inference where the goal is to obtain the label sequence that maximizes the product of the CRF objective function and their penalty." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-161", "text": "Unfortunately, in the RMN model, the dependencies must be defined in the model structure before doing any inference, and so the authors use heuristic part-of-speech patterns, and then add dependencies between these text spans using clique templates." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-162", "text": "This generates an extremely large number of overlapping candidate entities, which renders necessary additional templates to enforce the constraint that text subsequences cannot both be different entities, something that is more naturally modeled by a CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-163", "text": "Another disadvantage of this approach is that it uses loopy belief propagation and a voted perceptron for approximate learning and inference, which are inherently unstable algorithms leading to convergence problems, as noted by the authors." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-164", "text": "In the skip-chain CRFs model, the decision of which nodes to connect is also made heuristically, and because the authors focus on named entity recognition, they chose to connect all pairs of identical capitalized words." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-165", "text": "They also utilize loopy belief propagation for approximate learning and inference." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-166", "text": "It is hard to directly extend their approach to model dependencies richer than those at the token level." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-167", "text": "The approach of Finkel et al. (2005) makes it possible a to model a broader class of longdistance dependencies than Sutton and McCallum (2004) , because they do not need to make any initial assumptions about which nodes should be connected and they too model dependencies between whole token sequences representing entities and between entity token sequences and their token supersequences that are entities." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-168", "text": "The disadvantage of their approach is the relatively ad-hoc selection of penalties and the high computational cost of running Gibbs sampling." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-169", "text": "Early work in discriminative NER employed two stage approaches that are broadly similar to ours, but the effectiveness of this approach appears to have been overlooked in more recent work." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-170", "text": "Mikheev et al. (1999) exploit label consistency information within a document using relatively ad hoc multi-stage labeling procedures." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-171", "text": "Borth-wick (1999) used a two-stage approach similar to ours with CMM's where Reference Resolution features which encoded the frequency of occurrences of other entities similar to the current token sequence, were derived from the output of the first stage." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-172", "text": "Malouf (2002) and Curran and Clark (2003) condition the label of a token at a particular position on the label of the most recent previous instance of that same token in a previous sentence of the same document." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-173", "text": "This violates the Markov property and therefore instead of finding the maximum likelihood sequence over the entire document (exact inference), they label one sentence at a time, which allows them to condition on the maximum likelihood sequence of previous sentences." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-174", "text": "While this approach is quite effective for enforcing label consistency in many NLP tasks, it permits a forward flow of information only, which can result in loss of valuable information." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-175", "text": "Chieu and Ng (2002) propose a solution to this problem: for each token, they define additional features based on known information, taken from other occurrences of the same token in the document." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-176", "text": "This approach has the advantage of allowing the training procedure to automatically learn good weights for these \"global\" features relative to the local ones." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-177", "text": "However, it is hard to extend this to incorporate other types of non-local structure." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-178", "text": "----------------------------------" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-179", "text": "**CONCLUSION**" }, { "sent_id": "2504d707a8123774791d98b755551a-C001-180", "text": "We presented a two stage approach to model nonlocal dependencies and saw that it outperformed existing approaches to modeling non-local dependencies." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-181", "text": "Our approach also made it easy to exploit various dependencies across documents in the test corpus, whereas incorporating this information in most existing approaches would make them intractable due to the complexity of the resultant graphical model." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-182", "text": "Our simple approach is also very computationally efficient since the inference time is just twice the inference time of the basic sequential CRF, while for approaches doing approximate inference, the inference time is often well over an order of magnitude over the basic sequential CRF." }, { "sent_id": "2504d707a8123774791d98b755551a-C001-183", "text": "The simplicity of our approach makes it easy to understand, implement, and adapt to new applications." } ], "y": { "@DIF@": { "gold_contexts": [ [ "2504d707a8123774791d98b755551a-C001-30" ], [ "2504d707a8123774791d98b755551a-C001-108" ], [ "2504d707a8123774791d98b755551a-C001-148" ], [ "2504d707a8123774791d98b755551a-C001-149" ], [ "2504d707a8123774791d98b755551a-C001-153" ] ], "cite_sentences": [ "2504d707a8123774791d98b755551a-C001-30", "2504d707a8123774791d98b755551a-C001-108", "2504d707a8123774791d98b755551a-C001-148", "2504d707a8123774791d98b755551a-C001-149", "2504d707a8123774791d98b755551a-C001-153" ] }, "@BACK@": { "gold_contexts": [ [ "2504d707a8123774791d98b755551a-C001-62" ], [ "2504d707a8123774791d98b755551a-C001-110" ], [ "2504d707a8123774791d98b755551a-C001-119" ], [ "2504d707a8123774791d98b755551a-C001-160" ], [ "2504d707a8123774791d98b755551a-C001-167" ] ], "cite_sentences": [ "2504d707a8123774791d98b755551a-C001-62", "2504d707a8123774791d98b755551a-C001-110", "2504d707a8123774791d98b755551a-C001-119", "2504d707a8123774791d98b755551a-C001-160", "2504d707a8123774791d98b755551a-C001-167" ] }, "@SIM@": { "gold_contexts": [ [ "2504d707a8123774791d98b755551a-C001-61", "2504d707a8123774791d98b755551a-C001-62" ] ], "cite_sentences": [ "2504d707a8123774791d98b755551a-C001-62" ] }, "@MOT@": { "gold_contexts": [ [ "2504d707a8123774791d98b755551a-C001-109", "2504d707a8123774791d98b755551a-C001-110", "2504d707a8123774791d98b755551a-C001-111" ] ], "cite_sentences": [ "2504d707a8123774791d98b755551a-C001-110" ] } } }, "ABC_a3ad95d75b7750b8a879fa183e30f6_13": { "x": [ { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-2", "text": "The current trend in automatic speech recognition is to leverage large amounts of labeled data to train supervised neural network models." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-3", "text": "Unfortunately, obtaining data for a wide range of domains to train robust models can be costly." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-4", "text": "However, it is relatively inexpensive to collect large amounts of unlabeled data from domains that we want the models to generalize to." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-5", "text": "In this paper, we propose a novel unsupervised adaptation method that learns to synthesize labeled data for the target domain from unlabeled in-domain data and labeled out-of-domain data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-6", "text": "We first learn without supervision an interpretable latent representation of speech that encodes linguistic and nuisance factors (e.g., speaker and channel) using different latent variables." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-7", "text": "To transform a labeled out-of-domain utterance without altering its transcript, we transform the latent nuisance variables while maintaining the linguistic variables." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-8", "text": "To demonstrate our approach, we focus on a channel mismatch setting, where the domain of interest is distant conversational speech, and labels are only available for close-talking speech." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-9", "text": "Our proposed method is evaluated on the AMI dataset, outperforming all baselines and bridging the gap between unadapted and in-domain models by over 77% without using any parallel data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-10", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-12", "text": "Distant speech recognition has greatly improved due to the recent advance in neural network-based acoustic models, which facilitates integration of automatic speech recognition (ASR) systems into hands-free human-machine interaction scenarios [1] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-13", "text": "To build a robust acoustic model, previous work primarily focused on collecting labeled in-domain data for fully supervised training [2, 3, 4] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-14", "text": "However, in practice, it is expensive and laborious to collect labeled data for all possible testing conditions." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-15", "text": "In contrast, collecting large amount of unlabeled indomain data and labeled out-of-domain data can be fast and economical." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-16", "text": "Hence, an important question arises for this scenario: how can we do unsupervised adaptation for acoustic models by utilizing labeled out-of-domain data and unlabeled in-domain data, in order to achieve good performance on in-domain data?" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-17", "text": "Research on unsupervised adaptation for acoustic models can be roughly divided into three categories: (1) constrained model adaptation [5, 6, 7] , (2) domain-invariant feature extraction [8, 9, 10] , and (3) labeled in-domain data augmentation by synthesis [11, 12, 13] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-18", "text": "Among these approaches, data augmentation-based adaptation is favorable, because it does not require extra hyperparameter tuning for acoustic model training, and can utilize full model capacity by training a model with as much and as diverse a dataset as possible." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-19", "text": "Another benefit of this approach is that data in their original domain are more intuitive to humans." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-20", "text": "In other words, it is easier for us to inspect and manipulate the data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-21", "text": "Furthermore, with the recent progress on domain translation [13, 14, 15] , conditional synthesis of indomain data without parallel data has become achievable, which makes data augmentation-based adaptation a more promising direction to investigate." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-22", "text": "Variational autoencoder-based data augmentation (VAE-DA) is a domain adaptation method proposed in [13] , which pools in-domain and out-domain to train a VAE that learns factorized latent representations of speech segments." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-23", "text": "To disentangle linguistic factors from nuisance ones in the latent space, statistics of the latent representations for each utterance are computed." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-24", "text": "By altering the latent representations of the segments from a labeled out-of-domain utterance properly according to the computed statistics, one can synthesize an in-domain utterance without changing the linguistic content using the trained VAE decoder." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-25", "text": "This approach shows promising results on synthesizing noisy read speech from clean speech." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-26", "text": "However, it is non-trivial to apply this approach to conversational speech, because utterances tend to be shorter, which makes estimating the statistics of a disentangled representation difficult." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-27", "text": "In this paper, we extend VAE-DA and address the issue by learning interpretable and disentangled representations using a variant of VAEs that is designed for sequential data, named factorized hierarchical variational autoencoders (FHVAEs) [15] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-28", "text": "Instead of estimating the latent representation statistics on short utterances, we use a loss that considers the statistics across utterances in the entire corpus." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-29", "text": "Therefore, we can safely alter the latent part that models non-linguistic factors in order to synthesize in-domain data from out-of-domain data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-30", "text": "Our proposed methods are evaluated on the AMI [16] dataset, which contains close-talking and distant-talking recordings in a conference room meeting scenario." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-31", "text": "We treat close-talking data as out-of-domain data and distant-talking data as in-domain data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-32", "text": "In addition to outperforming all baseline methods, our proposed methods successfully close the gap between an unadapted model and a fully-supervised model by more than 77% in terms of word error rate without the presence of any parallel data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-33", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-34", "text": "**LIMITATIONS OF PREVIOUS WORK**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-35", "text": "In this section, we briefly review VAE-based data augmentation and its limitations." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-36", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-37", "text": "**VAE-BASED DATA AUGMENTATION**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-38", "text": "Generation of speech data often involves many independent factors, such as linguistic content, speaker identity, and room acoustics, that are often unobserved, or only partially observed." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-39", "text": "One can describe such a generative process using a latent variable model, where a vector z \u2208 Z describing generating factors is first drawn from a prior distribution, and a speech segment x \u2208 X is then drawn from a distribution conditioned on z. VAEs [17, 18] are among the most successful latent variable models, which parameterize a conditional distribution, p(x|z), with a decoder neural network, and introduce an encoder neural network, q(z|x), to approximate the true posterior, p(z|x)." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-40", "text": "In [19] , a VAE is proposed to model a generative process of speech segments." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-41", "text": "A latent vector in the latent space is assumed to be a linear combination of orthogonal vectors corresponding to the independent factors, such as phonetic content and speaker identity." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-42", "text": "In other words, we assume that z = z + zn where z encodes the linguistic/phonetic content and zn encodes the nuisance factors, and z \u22a5 zn." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-43", "text": "To augment the data set while reusing the labels, for any pair of utterance and its corresponding label sequence (X, y) in the data set, we generate (X, y) by altering the nuisance part of X in the latent space." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-44", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-45", "text": "**ESTIMATING LATENT NUISANCE VECTORS**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-46", "text": "A key observation made in [13] is that nuisance factors, such as speaker identity and room acoustics, are generally constant over segments within an utterance, while linguistic content changes from segment to segment." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-47", "text": "In other words, latent nuisance vectors zn are relatively consistent within an utterance, while the distribution of z conditioned on an utterance can be assumed to have the same distribution as the prior." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-48", "text": "Therefore, suppose the prior is a diagonal Gaussian with zero mean." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-49", "text": "Given an utterance" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-50", "text": "of N segments, we have:" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-51", "text": "That is to say, the latent nuisance vector would stand out, and the rest would cancel out, when we take the average of latent vectors over segments within an utterance." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-52", "text": "This approach shows great success in transforming clean read speech into noisy read speech." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-53", "text": "However, in a conversational scenario, the portion of short utterances are much larger than that in a reading scenario." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-54", "text": "For instance, in the Wall Street Journal corpus [20] , a read speech corpus, the average duration on the training set is 7.6s (\u00b12.9s), with no utterance shorter than 1s." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-55", "text": "On the other hand, in the AMI corpus [16] , the distant conversational speech meeting corpus, the average duration on the training set is 2.6s (\u00b12.7s), with over 35% of the utterances being shorter than 1s." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-56", "text": "The small number of segments in a conversational scenario can lead to unreliable estimation of latent nuisance vectors, because the sampled mean of latent linguistic vectors would exhibit large variance from the population mean." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-57", "text": "The estimation under such a condition can contain information about not only nuisance factors, but also linguistic factors." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-58", "text": "Indeed, we illustrate in Figure 1 that modifying the estimated latent nuisance vector of a short utterance can result in undesirable changes to its linguistic content." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-59", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-60", "text": "**METHODS**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-61", "text": "In this section, we describe the formulation of FHVAEs and explain how it can overcome the limitations of vanilla VAEs." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-62", "text": "Comparison between VAE-DA and proposed FHVAE-DA applied to a short utterance using nuisance factor replacement." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-63", "text": "The same source and two target utterances are used for both methods." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-64", "text": "Our proposed FHVAE-DA can successfully transform only nuisance factors, while VAE-DA cannot." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-65", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-66", "text": "**LEARNING INTERPRETABLE DISENTANGLED REPRESENTATIONS**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-67", "text": "To avoid estimating nuisance vectors on short segments, one can leverage the statistics at the corpus level, instead of at the utterance level, to disentangle generating factors." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-68", "text": "An FHVAE [15] is a variant of VAEs that models a generative process of sequential data with a hierarchical graphical model." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-69", "text": "Specifically, an FHVAE imposes sequence-independent priors and sequencedependent priors to two sets of latent variables, z1 and z2, respectively." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-70", "text": "We now formulate the process of generating a se-" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-71", "text": "are drawn from a sequence-dependent prior" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-72", "text": "where f\u00b5 x (\u00b7, \u00b7) and f \u03c3 2 x (\u00b7, \u00b7) are parameterized by a decoder neural network." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-73", "text": "The joint probability for a sequence is formulated as follows:" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-74", "text": "With such a formulation, z2 is encouraged to capture generating factors that are relatively consistent within a sequence, and z1 will then capture the residual generating factors." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-75", "text": "Therefore, when we apply an FHVAE to model speech sequence generation, it is clear that z2 will capture the nuisance generating factors that are in general consistent within an utterance." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-76", "text": "Since the exact posterior inference is intractable, FHVAEs introduce an inference model q(Z1, Z2, \u00b52|X) to approximate the true posterior, which is factorized as follows:" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-77", "text": "where q(\u00b52), q(z1|x, z2), and q(z2|x) are all diagonal Gaussian distributions." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-78", "text": "Two encoder networks are introduced in FHVAEs to parameterize mean and variance values of q(z1|x, z2) and q(z2|x) respectively." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-79", "text": "As for q(\u00b52), for testing utterances we parameterize its mean with an approximated maximum a posterior (MAP) estimation" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-80", "text": "is the inferred posterior mean of q(z (n) 2 |x (n) ); during training, we initialize a lookup table of posterior mean of \u00b52 for each training utterance with the approximated MAP estimation, and treat the lookup table as trainable parameters." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-81", "text": "This can avoid computing the MAP estimation of each segment for each mini-batch, and utilize the discriminative loss proposed in [15] to encourage disentanglement." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-82", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-83", "text": "**FHVAE-BASED DATA AUGMENTATION**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-84", "text": "With a trained FHVAE, we are able to infer disentangled latent representations that capture linguistic factors z1 and nuisance factors z2." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-85", "text": "To transform nuisance factors of an utterance X without changing the corresponding transcript, one only needs to perturb Z2." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-86", "text": "Furthermore, since each z2 within an utterance is generated conditioned on a Gaussian whose mean is \u00b52, we can regard \u00b52 as the representation of nuisance factors of an utterance." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-87", "text": "We now derive two data augmentation methods similar to those proposed in [13] , named nuisance factor replacement and nuisance factor perturbation." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-88", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-89", "text": "**NUISANCE FACTOR REPLACEMENT**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-90", "text": "Given a labeled out-of-domain utterance (Xout, yout) and an unlabeled in-domain utterance Xin, we want to transform Xout toXout such that it exhibits the same nuisance factors as Xin, while maintaining the original linguistic content." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-91", "text": "We can then add the synthesized labeled in-domain data (Xout, yout) to the ASR training set." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-92", "text": "From the generative modeling perspective, this implies that z2 of Xin andXout are drawn from the same distribution." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-93", "text": "We carry out the same modification for the latent sequence variable of each segment of Xout as follows: z2,out = z2,out \u2212 \u00b52,out + \u00b52,in, where \u00b52,out and \u00b52,in are the approximate MAP estimations of \u00b52." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-94", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-95", "text": "**NUISANCE FACTOR PERTURBATION**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-96", "text": "Alternatively, we are also interested in synthesizing an utterance conditioned on unseen nuisance factors, for example, the interpolation of nuisance factors between two utterances." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-97", "text": "We propose to draw a random perturbation vector p and comput\u00ea z2,out = z2,out + p for each segment in an utterance, in order to synthesize an utterance with perturbed nuisance factors." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-98", "text": "Naively, we may want to sample p from a centered isotropic Gaussian." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-99", "text": "However, in practice, VAE-type of models suffer from an over-pruning issue [21] in that some latent variables become inactive, which we do not want to perturb." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-100", "text": "Instead, we only want to perturb the linear subspace which models the variation of nuisance factors between utterances." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-101", "text": "Therefore, we adopt a similar soft perturbation scheme as in [13] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-102", "text": "First, {\u00b52} M i=1 for all M utterances are estimated with the approximated MAP." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-103", "text": "Principle component analysis is performed to obtain D pairs of eigenvalue \u03c3 d and eigenvectors e d , where D is the dimension of \u00b52." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-104", "text": "Lastly, one random perturbation vector p is drawn for each utterance to perturb as follows:" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-105", "text": "where \u03b3 is used to control the perturbation scale." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-106", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-107", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-108", "text": "We evaluate our proposed method on the AMI meeting corpus [16] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-109", "text": "The AMI corpus consists of 100 hours of meeting recordings in English, recorded in three different meeting rooms with different acoustic properties, and with three to five participants for each meeting that are mostly non-native speakers." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-110", "text": "Multiple microphones are used for recording, including individual headset microphones (IHM), and far-field microphone arrays." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-111", "text": "In this paper, we regard IHM recordings as out-of-domain data, whose transcripts are available, and single distant microphone (SDM) recordings as in-domain data, whose transcripts are not available, but on which we will evaluate our model." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-112", "text": "The recommended partition of the corpus is used, which contains an 80 hours training set, and 9 hours for a development and a test set respectively." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-113", "text": "FHVAE and VAE models are trained using both IHM and SDM training sets, which do not require transcripts." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-114", "text": "ASR acoustic models are trained using augmented data and transcripts based on only the IHM training set." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-115", "text": "The performance of all ASR systems are evaluated on the SDM development set." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-116", "text": "The NIST asclite tool [22] is used for scoring." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-117", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-118", "text": "**VAE AND FHVAE CONFIGURATIONS**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-119", "text": "Speech segments of 20 frames, represented with 80 dimensional log Mel filterbank coefficients (FBank) are used as inputs." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-120", "text": "We configure VAE and FHVAE models such that they have comparable modeling capacity." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-121", "text": "The VAE latent variable dimension is 64, whereas the dimensions of z1 and z2 in FHVAEs are both 32." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-122", "text": "Both models have a two-layer LSTM decoder with 256 memory cells that predicts one frame of x at a time." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-123", "text": "Since a FHVAE model has two encoders, while a VAE model only has one, we use a two-layer LSTM encoder with 256 memory cells for the former, and with 512 memory cells for the latter." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-124", "text": "All the LSTM encoders take one frame as input at each step, and the output from the last step is passed to an affine transformation layer that predicts the mean and the log variance of latent variables." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-125", "text": "The VAE model is trained to maximize the variational lower bound, and the FHVAE model is trained to maximize the discriminative segment variational lower bound proposed in [15] with a discriminative weight \u03b1 = 10." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-126", "text": "In addition, the original FHVAE training [15] is not scalable to hundreds of thousands of utterances; we therefore use the hierarchical sampling-based training algorithm proposed in [23] with batches of 5,000 utterances." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-127", "text": "Adam [24] with \u03b21 = 0.95 and \u03b22 = 0.999 is used to optimize all models." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-128", "text": "Tensorflow [25] is used for implementation." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-129", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-130", "text": "**ASR CONFIGURATION**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-131", "text": "Kaldi [26] is used for feature extraction, forced alignment, decoding, and training of initial HMM-GMM models on the IHM training set." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-132", "text": "The Microsoft Cognitive Toolkit [27] is used for neural network acoustic model training." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-133", "text": "For all experiments, the same 3-layer LSTM acoustic model [28] with the architecture proposed in [2] is adopted, which has 1024 memory cells and a 512-node linear projection layer for each LSTM layer." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-134", "text": "Following the setup in [29] , LSTM acoustic models are trained with cross entropy loss, truncated back-propagation through time [30] , and mini-batches of 40 parallel utterances and 20 frames." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-135", "text": "A momentum of 0.9 is used starting from the second epoch [2] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-136", "text": "Ten percent of training data is held out for validation, and the learning rate is halved if no improvement is observed on the validation set after an epoch." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-137", "text": "[10] 64.8 (-6.0) 29.0 (+2.0) IHM, VAE-DA, (repl) [13] 62.2 (-8.0) 31.8 (+4.8) IHM, VAE-DA, (p, \u03b3 = 1.0) [13] 61.1 (-9.7) 30.0 (+3.0) IHM, VAE-DA, (p, \u03b3 = 1.5) [13] 61.9 (-8.9) 31.4 (+4.4)" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-138", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-139", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-140", "text": "We first establish baseline results and report the SDM (indomain) and IHM (out-of-domain) development set word error rates (WERs) in Table 1 ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-141", "text": "To avoid constantly querying the test set results, we only report WERs on the development set." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-142", "text": "If not otherwise mentioned, the data augmentation-based systems are evaluated on reconstructed features, and trained on a transformed IHM set, where each utterance is only transformed once, without the original copy of data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-143", "text": "The first two rows of results show that the WER gap between the unadapted model and the model trained on in-domain data is 24%." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-144", "text": "The third row reports the results of training with domain invariant feature, z1, extracted with a FHVAE as is done in [10] ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-145", "text": "It improves over the baseline by 6% absolute." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-146", "text": "VAE-DA [13] results with nuisance factor replacement (repl) and latent nuisance perturbation (p) are shown in the last three rows." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-147", "text": "We then examine the effectiveness of our proposed method and show the results in the second, third, and fourth rows in Table 2 ." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-148", "text": "We observe about 12% WER reduction on the in-domain development set for both nuisance factor perturbation (p) and nuisance factor replacement (repl), with little degradation on the out-of-domain development set." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-149", "text": "Both augmentation methods outperform their VAE counterparts and the domain invariant feature baseline using the same FHVAE model." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-150", "text": "We attribute the improvement to the better quality of the transformed IHM data, which covers the nuisance factors of the SDM data, without altering the original linguistic content." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-151", "text": "To verify the superiority of the proposed method of drawing random perturbation vectors, we compare two alternative sampling methods: rev-p and uni-p, similar to [13] , with the same expected squared Euclidean norm as the proposed method." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-152", "text": "Table 2 confirm that the proposed sampling method is more effective under the same perturbation scale \u03b3 = 1.0 compared to the alternative methods as expected." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-153", "text": "Due to imperfect reconstruction using FHVAE models, some linguistic information may be lost in this process." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-154", "text": "Furthermore, since VAE models tend to have overly-smoothed outputs, one can easily tell an original utterance from a reconstructed one." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-155", "text": "In other words, there is another layer of domain mismatch between original data and reconstructed data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-156", "text": "In Table 3 , we investigate the performance of models trained with different data on both original data and reconstructed data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-157", "text": "The first row, a model trained on the reconstructed IHM data serves as the baseline, from which we observe a 3.0%/3.1% WER increase on SDM/IHM when tested on the reconstructed data, and a further 5.7%/2.0% WER increase when tested on the original data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-158", "text": "Compared to the reconstruction baseline, the proposed perturbation and replacement method both show about 15% improvement on the reconstructed SDM data, and 8% on the original SDM data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-159", "text": "Results on the reconstructed or original IHM data are comparable to the baseline." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-160", "text": "The performance difference between the original and reconstructed SDM shows that FHVAEs are able to transform the IHM acoustic features closer to the reconstructed SDM data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-161", "text": "We then explore adding the original IHM training data to the two transformed sets (+ori." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-162", "text": "IHM)." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-163", "text": "This significantly improves the performance on the original data for both SDM and IHM data sets." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-164", "text": "We even see an improvement from 27.0% to 25.9% on the IHM development set compared to the model trained on original IHM data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-165", "text": "Finally, to demonstrate that FHVAEs are not exploiting the parallel connection between the IHM and SDM data sets, we create two disjoint sets of recordings of roughly the same size, such that IHM-a and SDM-b only contain one set of recordings each." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-166", "text": "Results are shown in 4, where the FHVAE models is trained without any parallel utterances." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-167", "text": "In this setting, we observe an even more significant 24.1% absolute WER improvement from the baseline IHM-a model, which bridges the gap by over 77% to the fully supervised model." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-168", "text": "----------------------------------" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-169", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-170", "text": "In this paper, we marry the VAE-based data augmentation method with interpretable disentangled representations learned from FHVAE models for transforming data from one domain to another." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-171", "text": "The proposed method outperforms both baselines, and demonstrates the ability to reduce the gap between an unadapted model and a fully supervised model by over 77% without the presence of any parallel data." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-172", "text": "For future work, we plan to investigate the unsupervised data augmentation techniques for a wider range of tasks." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-173", "text": "In addition, data augmentation is inherently inefficient because the training time grows linearly in the amount of data we have." }, { "sent_id": "a3ad95d75b7750b8a879fa183e30f6-C001-174", "text": "We plan to explore model-space unsupervised adaptation to combat this limitation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a3ad95d75b7750b8a879fa183e30f6-C001-17" ], [ "a3ad95d75b7750b8a879fa183e30f6-C001-21" ], [ "a3ad95d75b7750b8a879fa183e30f6-C001-22" ], [ "a3ad95d75b7750b8a879fa183e30f6-C001-46" ] ], "cite_sentences": [ "a3ad95d75b7750b8a879fa183e30f6-C001-17", "a3ad95d75b7750b8a879fa183e30f6-C001-21", "a3ad95d75b7750b8a879fa183e30f6-C001-22", "a3ad95d75b7750b8a879fa183e30f6-C001-46" ] }, "@EXT@": { "gold_contexts": [ [ "a3ad95d75b7750b8a879fa183e30f6-C001-22", "a3ad95d75b7750b8a879fa183e30f6-C001-27" ] ], "cite_sentences": [ "a3ad95d75b7750b8a879fa183e30f6-C001-22" ] }, "@MOT@": { "gold_contexts": [ [ "a3ad95d75b7750b8a879fa183e30f6-C001-22", "a3ad95d75b7750b8a879fa183e30f6-C001-27" ] ], "cite_sentences": [ "a3ad95d75b7750b8a879fa183e30f6-C001-22" ] }, "@SIM@": { "gold_contexts": [ [ "a3ad95d75b7750b8a879fa183e30f6-C001-87" ], [ "a3ad95d75b7750b8a879fa183e30f6-C001-101" ], [ "a3ad95d75b7750b8a879fa183e30f6-C001-151" ] ], "cite_sentences": [ "a3ad95d75b7750b8a879fa183e30f6-C001-87", "a3ad95d75b7750b8a879fa183e30f6-C001-101", "a3ad95d75b7750b8a879fa183e30f6-C001-151" ] }, "@USE@": { "gold_contexts": [ [ "a3ad95d75b7750b8a879fa183e30f6-C001-146" ] ], "cite_sentences": [ "a3ad95d75b7750b8a879fa183e30f6-C001-146" ] } } }, "ABC_4bc12aca138835b5ed80b0cf69febf_13": { "x": [ { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-2", "text": "We introduce a simple yet effective method of integrating contextual embeddings with commonsense graph embeddings, dubbed BERT Infused Graphs: Matching Over Other em-beDdings." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-3", "text": "First, we introduce a preprocessing method to improve the speed of querying knowledge bases." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-4", "text": "Then, we develop a method of creating knowledge embeddings from each knowledge base." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-5", "text": "We introduce a method of aligning tokens between two misaligned tokenization methods." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-6", "text": "Finally, we contribute a method of contextualizing BERT after combining with knowledge base embeddings." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-7", "text": "We also show BERTs tendency to correct lower accuracy question types." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-8", "text": "Our model achieves a higher accuracy than BERT, and we score fifth on the official leaderboard of the shared task and score the highest without any additional language model pretraining." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-9", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-11", "text": "Recently, wide-scale pre-training and deep contextual representations have taken the world by storm." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-12", "text": "Peters et al. (2018) underscored the importance of bidirectional contextual representations by using traditional neural networks trained on a large corpus of text." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-13", "text": "Devlin et al. (2018) used transformers (Vaswani et al., 2017) and word masking to pre-train on another large corpus of data, reporting human-level performance on one commonsense dataset (Zellers et al., 2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-14", "text": "achieves state-of-the-art on RACE (Lai et al., 2017) with a Transformer-XL based model ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-15", "text": "The key to success in the performance of many of these models is their ability to train on extremely large datasets." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-16", "text": "BERT (Devlin et al., 2018) , for example, trains on the BooksCorpus (Zhu et al., 2015) and English Wikipedia, for a combined 3,200M words." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-17", "text": "Other iterations increased the amount of knowledge used during pre-training, such as RoBERTa (Liu et al., 2019) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-18", "text": "Training large-scale models on these massive datasets has drawbacks, such as significantly increased carbon pollution and harm to the environment (Schwartz et al., 2019; Strubell et al., 2019) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-19", "text": "We present a methodology of combining queries from commonsense knowledge bases with contextual embeddings, BIG MOOD -BERT Infused Graphs: Matching Over Other embeDdings, and abbreviated for its relationship to human knowledge awareness." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-20", "text": "Our methodology achieves a increase without significant additional fine-tuning or pre-training." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-21", "text": "Instead, it learns a separate representation from commonsense graphical knowledge bases, and augments the BERT representation with this learned explicit representation." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-22", "text": "We introduce several methods of combining and querying knowledge base embeddings to introduce them to the BERT embedding layers." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-23", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-25", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-26", "text": "**KNOWLEDGE GRAPHS**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-27", "text": "Significant research has been put into representing human knowledge in various ways (Lenat and Guha, 1989; Auer et al., 2007; Chambers and Jurafsky, 2008) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-28", "text": "ConceptNet (Speer and Havasi, 2013) contains various aspects of commonsense knowledge through a knowledge graph." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-29", "text": "The knowledge is collected from crowedsourced resources (Meyer and Gurevych, 2012; Havasi et al., 2010; von Ahn et al., 2006) and expert-created resources (Miller, 1992; Breen, 2004) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-30", "text": "WebChild (Tandon et al., 2017) is a collection of commonsense knowledge automatically extracted from web contents." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-31", "text": "The database is constructed similarly to ConceptNet, and intended to cover concepts that ConceptNet does not cover." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-32", "text": "ATOMIC (Sap et al., 2018) focuses on inferential Passage: I had decided that I wanted to visit my friend Paul whom lives quite a distance away." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-33", "text": "With this and my fear of air travel in mind I decided to take a train." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-34", "text": "After researching and finding one online I was well on my way to going to see my friend Paul." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-35", "text": "I drive to the station and decide that I am going to purchase a round trip ticket as this would be cheaper than just buying both tickets separately." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-36", "text": "Whenever my train arrives I have to get in line as they process our tickets." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-37", "text": "After all this is done I decide to take a seat by the window." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-38", "text": "I sit and fall asleep a bit as I ride on the train for hours." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-39", "text": "After a couple hours we finally reach the destination and I get off the train, excited to see my friend." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-40", "text": "When did they wait for their train? a) before buying the ticket b) after buying a ticket Table 1 : Example of a prompt from the shared task dataset, an everyday commonsense reasoning dataset." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-41", "text": "Questions often require script knowledge that extends beyond referencing the text." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-42", "text": "If \u2212 T hen relations, built for everyday commonsense reasoning." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-43", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-44", "text": "**KNOWLEDGE INTEGRATION**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-45", "text": "Knowledge graphs have been applied in various natural language processing applications, such as reading comprehension (Lin et al., 2017; Yang and Mitchell, 2017) and machine translation (Zhang et al., 2017) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-46", "text": "ERNIE: Enhanced Representation through Knowledge Integration (Sun et al., 2019) appends knowledge to the input of the model and learns via knowledge masking, as well as entitylevel masking and phrase-level masking." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-47", "text": "TriAN (Wang, 2018) , the top public model on the MC-Script (Ostermann et al., 2018) shared task, uses ConceptNet embeddings to highlight relationships between the question, text, and answer." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-48", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-49", "text": "**MODEL**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-50", "text": "We present our model for this shared task." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-51", "text": "Our model has three major components: language model adaptation, knowledge graph embeddings, and attention for classification." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-52", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-53", "text": "**DATA PREPROCESSING**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-54", "text": "Before model usage, we preprocess the data in two ways to make it easier for the model to un-derstand." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-80", "text": "For each (start, end, edge), we (Vaswani et al., 2017) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-55", "text": "For language modeling, we create training data similar to those in BERT (Devlin et al., 2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-56", "text": "For knowledge graph use, we preprocess language to create commonsense object and relationship vocabulary and to match as many related commonsense objects as possible." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-57", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-58", "text": "**LANGUAGE MODEL PREPROCESSING**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-59", "text": "We prepossess each passage for training." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-60", "text": "We use this process for each training epoch, since it allows for the most dense pretraining framework." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-61", "text": "Commonly known as a cloze task, Devlin et al. (2018) introduced a framework that pretrained transformers (Vaswani et al., 2017) based on masked token prediction." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-62", "text": "First, we preprocess the tokens with WordPiece embeddings (Wu et al., 2016) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-63", "text": "Then, we append special [CLS] and [SEP ] to each datum." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-64", "text": "We append [CLS] to the beginning of each datum, and [SEP ] to separate the question with the answer, as such:" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-65", "text": "Then, we randomly mask 15% of all WordPiece embeddings." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-66", "text": "Unlike Devlin et al. (2018) , we run the randomization script once per each training epoch." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-67", "text": "Otherwise, we follow the procedure in Devlin et al. (2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-68", "text": "80% of the time, we replace the word with the [M ASK] prediction, to be replaced through cloze task prediction." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-69", "text": "10% of the time, we replace the word with a random word." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-70", "text": "10% of the time, we keep the word unchanged." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-71", "text": "Combined with the above cloze task, we process the data for next sentence prediction." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-72", "text": "We do this process after the cloze task masking, similar to Devlin et al. (2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-73", "text": "For each datum, we randomly pick either a sentence labeled correctly as the next sentence 50% of the time, or a random sentence 50% of the time." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-74", "text": "We ensure that the random sentence is not the next sentence." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-75", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-76", "text": "**KNOWLEDGE GRAPH PROCESSING**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-77", "text": "We preprocess the data in the shared task along with knowledge graph preprocessing." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-78", "text": "The purpose of this procedure is to reduce the number of items in the knowledge graph, to speed up fine-tuning since the knowledge graphs are extremely large, and also to ensure matching between as many different types of knowledge graph edges that are relevant as possible." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-79", "text": "First, we create an index of (start, end, edge) relationships that match vocabulary within the shared task prompt." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-81", "text": "Since the queries work on whole words only, one knowledge base embeddings may be integrated with one or more language embedding." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-82", "text": "Several self-attention encoding layers are used." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-83", "text": "Algorithm 1: Knowledge Graph Vocab Creation for prompt in dataset do for KG in knowledge graphs do for (start, end, edge) in KG do if start in prompt & end in prompt then add((start, end, edge)) index as relationship(edge) end if end for end for end for check to see if there are any matching prompts in which start is present in the text and end is present in the text." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-84", "text": "If so, we store the (start, end, edge), and note the edge as a relationship." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-85", "text": "We also index the relationship (edge), giving an index for each unique relationship." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-86", "text": "For longer sequences, we allow matches between any trigram, and store an index for each trigram matched." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-87", "text": "In addition, we stem words beforehand, to ensure that the different word endings do not effect the result of the matches." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-88", "text": "We use the Porter Stemmer (Porter, 1980) to stem each word in both the text and the knowledge graph." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-89", "text": "Note that we only use the stemming to match different words, and do not keep the stemmed words for later use in the process, as to keep comparability between embedding types." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-90", "text": "We also stem words in knowledge bases, to allow for comprasion." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-91", "text": "Algorithm 1 shows our process for matching sequences." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-92", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-93", "text": "**KNOWLEDGE GRAPH USAGE**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-94", "text": "We query each of three knowledge bases to create an embedding layer, for each word, for each knowledge graph." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-95", "text": "Here, we describe our procedure for querying each knowledge graph." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-96", "text": "We stem words beforehand, to allow for matches agnostic of linguistic postfixes (Merkhofer et al., 2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-97", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-98", "text": "**CONCEPTNET**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-99", "text": "ConceptNet (Speer and Havasi, 2013) represents everyday words and phrases, with edges between the commonsense relationships between them." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-100", "text": "We first preprocess ConceptNet, keeping only the vocabulary present in the shared task." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-101", "text": "Then, for each edge, we store a tuple (agent, dependent, relationship) that describes the commonsense relationship mentioned in the knowledge graph." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-102", "text": "During fine-tuning, we check the text for any present agent, dependent pairs." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-103", "text": "If any word in the text is an agent, and the dependent is present in the text, we add that relationship index as input into the embedding layer." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-104", "text": "(For agents that span more than one word, such as the phrase \"apple pie\", we apply the index to the first word, as long as the entire phrase is found in the text)." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-105", "text": "We randomly generate a length 10 embedding for each relationship, and if more than one relationship is matched, we randomly pick one." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-106", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-107", "text": "**WEBCHILD**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-108", "text": "WebChild (Tandon et al., 2017) is a large collection of commonsense knowledge collected from various sources on the web." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-109", "text": "The format is similar to ConceptNet, which allows us to follow a similar process." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-110", "text": "WordNet instances are split into categories part \u2212 whole, comparative, property, activity, and spatial." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-111", "text": "For each category, we capture the (agent, dependent, relationship) tuple, which is usually defined as properties such as x disambi , y disambi , and sub \u2212 relation, but is slightly different for each category." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-112", "text": "We ignore the WordNet (Miller, 1992) relation (some categories will contain subjects such as bike#n#1, and take only the stemmed word." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-113", "text": "For fine-tuning, we follow the same procedure as ConceptNet, creating an additional 10-length embedding for each word." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-114", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-115", "text": "**ATOMIC**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-116", "text": "ATOMIC (Sap et al., 2018 ) is a resource that focuses on inferential knowledge via If \u2212 T hen relations." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-117", "text": "ATOMIC separates its relationships into nine different types (xN eed, xIntent, xAttr, xEf f ect, xReact, xW ant, oEf f ect, oW ant)." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-118", "text": "For each of the nine categories, for each datum in the given category, we search our text for relationships that match the defined If \u2212 T hen relationship." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-119", "text": "Since each relationship is nearly a full sentence, we allow a match to be any trigram matched between the given datum and the text." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-120", "text": "Then, we append an index [0, 8] to the embedding layer of the first word in the selected trigram based on the type of relationship matched." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-121", "text": "For fine-tuning, we follow the same procedure as ConceptNet and We-bChild, creating an additional 10-length embedding for each word." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-122", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-123", "text": "**ARCHITECTURE**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-124", "text": "Out modeling procedure consists of three parts." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-125", "text": "First, we query each knowledge graph, allowing us to create embeddings for each specific graph." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-126", "text": "Then, we describe our word-level knowledge fusion procedure, creating augmented embeddings for each word." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-127", "text": "Finally, we describe our fine-tuning procedure for the shared task dataset." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-128", "text": "We modify pytorch-transformers 1 ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-129", "text": "1 https://github.com/huggingface/pytorch-transformers" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-130", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-131", "text": "**LANGUAGE MODEL FINE-TUNING**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-132", "text": "Contrary to Devlin et al. (2018) , we do language model fine-tuning in addition to classification finetuning." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-133", "text": "We find that this generally provides better results, and allows for more stable accuracy since the shared task involves a small dataset." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-134", "text": "For each prompt, we use the previous preprocessed data to create tasks for our model to predict." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-135", "text": "We do this before token realignment, so this happens before any extra knowledge graph embeddings are added to the model architecture." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-136", "text": "For masked tokens, we predict that token through bidirectional context, the same as Devlin et al. (2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-137", "text": "For next sentence prediction, we use the unbiased method previously introduced as well as in Devlin et al. (2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-138", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-139", "text": "**TOKEN REALIGNMENT**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-140", "text": "We do a word-level fusion to incorporate knowledge embeddings into the BERT model." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-141", "text": "First, we collect word embeddings from BERT." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-142", "text": "We sum the last four layer of BERT together, as suggested by \"The Illustrated BERT, ELMo, and co.\" 2 ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-143", "text": "We fuse these embeddings with the embeddings gathered from querying each of the three databases." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-144", "text": "For each word, we take the dyadic product, or linear fusion, of the contextual BERT embeddings with the concatenation of the three graph embeddings." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-145", "text": "When there is no related embedding (if the word did not match any edges during querying, or if the word is a BERT-specific token such as [CLS], we do not do any dyadic fusion." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-146", "text": "Finally, to get a single linear layer, we concatenate each dimension of the result of the dyadic fusion with the original BERT embedding." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-147", "text": "Algorithm 2 shows a detailed explanation of our token realignment process." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-148", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-149", "text": "**RE-ATTENTION**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-150", "text": "To get a final result, we do a few more necessary steps." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-151", "text": "First, we do a single layer of selfattention over the text, allowing each of the wordlevel embeddings to interact with one another." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-152", "text": "For this self-attention, we follow the same process as in (Vaswani et al., 2017) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-153", "text": "We compare each token with each other and do token-level fusion with each other to learn an attention embedding layer." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-154", "text": "Then, we use the sequence embedding for classification." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-155", "text": "We add a simple linear layer over the sequence embedding for classification, and softmax over the given choices." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-156", "text": "Note that we do not freeze any weights along the process, allowing the transformer and perceptron to Algorithm 2: Psuedocode for the token realignment algorithm, a method of finding token alignments between two different sequences." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-157", "text": "token realignment(seq 1, seq 2): alignment dict = dict seq 1 i = 0 seq 2 i = 0 while seq 1 i Devlin et al. (2018) , and use the [CLS] token, without attention, for classification." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-162", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-163", "text": "**ANALYSIS**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-164", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-165", "text": "**HYPERPARAMETER TUNING**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-166", "text": "For hyperparameter tuning with BERT, we find that grid search is the best method." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-167", "text": "We tune various hyperparameters, including batch size, learning rate, warmup, and epoch count (for hyperparameter details, see appendix)." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-168", "text": "Graph 2 shows the results of several hyperparameters on BERT with our additional knowledge bases." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-169", "text": "We find that B. MOOD seems to correct its deficiencies as it gets closer to the maxima." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-170", "text": "Interestingly, B. MOOD seems to be naturally good \"What\" questions, which commonly require commonsense inference." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-171", "text": "This could be explained by the effect of the commonsense knowledge graphs, showing that is picking up on commonsense attributes." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-172", "text": "How- ever, for \"Where\" questions, which it requires more information from the text, B. MOOD needs to learn and thus experiences a greater gain as the accuracy gets closer to its maxima." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-173", "text": "We also compare to TriAN (Wang, 2018) , the previous state-of-the-art." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-174", "text": "Table 2 : Results with B. MOOD on task dev and test set. \"with all KB\" describes results using all Concept-Net, WebChild, and ATOMIC embeddings." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-175", "text": "\"Human\" and \"Regression Baseline\" accuracy is from the shared task paper (Ostermann et al., 2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-176", "text": "TriAN (Wang, 2018) embeddings, we use a size of 10 for each knowledge graph, combining for a size 30 knowledge graph embedding." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-177", "text": "We randomly init each embedding, and if there is more than one embedding for token, we pick one at random (Wang, 2018) ." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-178", "text": "For BERT fine-tuning, we use a maximum sequence length of 450, a train batch size of 32, four epochs, 1e \u2212 5 learning rate, and a 20% warmup." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-179", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-180", "text": "**RESULTS**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-181", "text": "We show our results and give analysis for BIG MOOD." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-182", "text": "We show that each of the knowledge bases help the accuracy of our model, and our strongest model involves the union of all three knowledge bases." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-183", "text": "ConceptNet gives the largest increase, likely because there are the most matches between the prompts and ConceptNet, since ConceptNet covers everyday concepts that are relatively more common." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-184", "text": "WebChild gives a boost also, but not as large as ConceptNet." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-185", "text": "ATOMIC gives the smallest boost, likely because 1) ATOMIC queries are the longest, and thus, least likely to match, and 2) there is not as much inferential commonsense present." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-186", "text": "We also note that the base B. MOOD accuracy is higher than the base TriAN (Wang, 2018) accuracy, the previous state of the art." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-187", "text": "By appending similar knowledge embeddings, we find that we can bring the TriAN accuracy up to 77.8%, which is more comparable with B. MOOD." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-188", "text": "This shows that the additional knowledge bases (ATOMIC, WebChild) contribute to the overall accuracy even without the contextual embeddings." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-189", "text": "However, we find that the knowledge bases combined with TriAN still do not provide an improvement above that of B. MOOD, and thus, the knowledge bases alone are not enough to capture the necessary information." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-190", "text": "Instead, the knowledge graphs must be used through combination with contextual embeddings for the most effective model." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-191", "text": "This shows that BERT may lack the complete amount of information needed to understand this dataset." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-192", "text": "We also show that the attention is needed to understand the knowledge graphs alongside BERT, showing the importance of learning the different knowledge base embeddings within the text." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-193", "text": "This highlights the fact that using the knowledge base embeddings is helpful, and also comparisons between different sections of text is helpful for reading comprehension tasks." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-194", "text": "----------------------------------" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-195", "text": "**CONCLUSION**" }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-196", "text": "We introduce a method of fine-tuning with graphical embeddings alongside contextual embeddings, B. MOOD." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-197", "text": "Our method uses three different knowledge bases, and introduces ways of improving both learning speed and knowledge embedding effectiveness." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-198", "text": "First, we preprocess the dataset, showing that both language model preprocessing and knowledge graph preprocessing is important to the final result." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-199", "text": "Then, we tune our language model on the shared task, stabilizing the hyperparameter search." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-200", "text": "We create knowledge graph embeddings and concatenate the embeddings via token realignment." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-201", "text": "Then, we introduce a final layer of attention that learns both contextual and explicit graph embeddings through contextualization." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-202", "text": "We show the effect of various knowledge bases, and show our accuracy across various question types." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-203", "text": "Our model gets fifth on the task leaderboard and outperforms BERT across all question types." }, { "sent_id": "4bc12aca138835b5ed80b0cf69febf-C001-204", "text": "We hope that this investigation motivates and furthers additional research in combining commonsense knowledge awareness with transformers." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4bc12aca138835b5ed80b0cf69febf-C001-16" ], [ "4bc12aca138835b5ed80b0cf69febf-C001-61" ] ], "cite_sentences": [ "4bc12aca138835b5ed80b0cf69febf-C001-16", "4bc12aca138835b5ed80b0cf69febf-C001-61" ] }, "@SIM@": { "gold_contexts": [ [ "4bc12aca138835b5ed80b0cf69febf-C001-55" ], [ "4bc12aca138835b5ed80b0cf69febf-C001-72" ] ], "cite_sentences": [ "4bc12aca138835b5ed80b0cf69febf-C001-55", "4bc12aca138835b5ed80b0cf69febf-C001-72" ] }, "@DIF@": { "gold_contexts": [ [ "4bc12aca138835b5ed80b0cf69febf-C001-66" ], [ "4bc12aca138835b5ed80b0cf69febf-C001-132" ] ], "cite_sentences": [ "4bc12aca138835b5ed80b0cf69febf-C001-66", "4bc12aca138835b5ed80b0cf69febf-C001-132" ] }, "@USE@": { "gold_contexts": [ [ "4bc12aca138835b5ed80b0cf69febf-C001-67" ], [ "4bc12aca138835b5ed80b0cf69febf-C001-136" ], [ "4bc12aca138835b5ed80b0cf69febf-C001-137" ], [ "4bc12aca138835b5ed80b0cf69febf-C001-161" ] ], "cite_sentences": [ "4bc12aca138835b5ed80b0cf69febf-C001-67", "4bc12aca138835b5ed80b0cf69febf-C001-136", "4bc12aca138835b5ed80b0cf69febf-C001-137", "4bc12aca138835b5ed80b0cf69febf-C001-161" ] }, "@EXT@": { "gold_contexts": [ [ "4bc12aca138835b5ed80b0cf69febf-C001-61", "4bc12aca138835b5ed80b0cf69febf-C001-62", "4bc12aca138835b5ed80b0cf69febf-C001-63", "4bc12aca138835b5ed80b0cf69febf-C001-66", "4bc12aca138835b5ed80b0cf69febf-C001-67" ] ], "cite_sentences": [ "4bc12aca138835b5ed80b0cf69febf-C001-61", "4bc12aca138835b5ed80b0cf69febf-C001-66", "4bc12aca138835b5ed80b0cf69febf-C001-67" ] } } }, "ABC_518b281a4bc04b3504d3b385a5dc62_13": { "x": [ { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-2", "text": "A major motivation for unsupervised morphological analysis is to reduce the sparse data problem in under-resourced languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-3", "text": "Most previous work focuses on segmenting surface forms into their constituent morphs (e.g., taking: tak +ing), but surface form segmentation does not solve the sparse data problem as the analyses of take and taking are not connected to each other." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-4", "text": "We extend the MorphoChains system (Narasimhan et al., 2015) to provide morphological analyses that can abstract over spelling differences in functionally similar morphs." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-5", "text": "These analyses are not required to use all the orthographic material of a word (stopping: stop +ing), nor are they limited to only that material (acidified: acid +ify +ed)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-6", "text": "On average across six typologically varied languages our system has a similar or better F-score on EMMA (a measure of underlying morpheme accuracy) than three strong baselines; moreover, the total number of distinct morphemes identified by our system is on average 12.8% lower than for Morfessor (Virpioja et al., 2013) , a stateof-the-art surface segmentation system." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-7", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-9", "text": "Most previous work on unsupervised morphological analysis has focused on the problem of segmentation: segmenting surface forms into their constituent morphs (Goldsmith, 2001; Creutz and Lagus, 2007; Poon et al., 2009; Lee et al., 2011; Virpioja et al., 2013; Sirts and Goldwater, 2013) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-10", "text": "However, the focus on surface segmentation is largely due to ease of model definition and implementation rather than linguistic correctness." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-11", "text": "Even in languages with primarily concatenative morphology, spelling (or phonological) changes often occur at morpheme boundaries, so that a single morpheme may have multiple surface forms." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-12", "text": "For example, the past tense in English may surface as -ed (walked), -d (baked), -ted (emitted), -ped (skipped), etc." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-13", "text": "A major motivation for unsupervised morphological analysis is to reduce the sparse data problem in under-resourced languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-14", "text": "While surface segmentation can help, the example above illustrates its limitations: for more effective parameter sharing, a system should recognize that -ed, -d, -ted, and -ped share the same linguistic function." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-15", "text": "The importance of identifying underlying morphemes rather than surface morphs is widely recognized, for example by the MorphoChallenge organizers, who in later years provided datasets and evaluation measures to encourage this deeper level of analysis (Kurimo et al., 2010) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-16", "text": "Nevertheless, only a few systems have attempted this task (Goldwater and Johnson, 2004; Naradowsky and Goldwater, 2009) , and as far as we know, only one, the rule-based MORSEL (Lignos et al., 2009; Lignos, 2010) , has come close to the level of performance achieved by segmentation systems such as Morfessor (Virpioja et al., 2013) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-17", "text": "We present a system that adapts the unsupervised MorphoChains segmentation system (Narasimhan et al., 2015) to provide morphological analyses that aim to abstract over spelling differences in functionally similar morphemes." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-18", "text": "Like MorphoChains, our system uses an unsupervised log-linear model whose parameters are learned using contrastive estimation (Smith and Eisner, 2005) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-19", "text": "The original MorphoChains system learns to identify child-parent pairs of morphologically related words, where the child (e.g., stopping) is formed from the parent (stop) by adding an affix and possibly a spelling transformation (both represented as features in the model)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-20", "text": "However, these spelling transformations are never used to output underlying morphemes, instead the system just returns a segmentation by post-processing the inferred child-parent pairs." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-21", "text": "We extend the MorphoChains system in sev-eral ways: first, we use the spelling transformation features to output underlying morphemes for each word rather than a segmentation; second, we broaden the types of morphological changes that can be identified to include compounds; and third, we modify the set of features used in the log-linear model to improve the overall performance." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-22", "text": "We evaluate using EMMA (Spiegler and Monson, 2010) , a measure that focuses on the identity rather than the spelling of morphemes." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-44", "text": "The candidate set is generated by taking all possible splits of the word and applying all possible transformation types." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-23", "text": "On average across six typologically varied languages (English, German, Turkish, Finnish, Estonian, Arabic), our system outperforms both the original MorphoChains system and the MORSEL system, and performs similarly to the surface segmentation system Morfessor." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-24", "text": "These results are (to our knowledge) the best to date from a system for identifying underlying morphemes; moreover, the total number of distinct morphemes identified by our system is on average 12.8% lower than for Morfessor, suggesting that it does a better job of abstracting over surface spellings and inducing a compact representation of the data." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-25", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-26", "text": "**MORPHOLOGICAL CHAINS AND ANALYSES**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-27", "text": "We base our work on the MorphoChains segmentation system (Narasimhan et al., 2015) , 1 which defines a morphological chain as a sequence of child-parent pairs." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-28", "text": "Each pair consists of two morphologically related words where the child must be longer than the parent." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-29", "text": "To analyse a word we want to find the best parent for that word; we do so recursively until we conclude that the stop condition is met (i.e. a word doesn't have a morphological parent)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-30", "text": "The word standardizes, for example, produces the following chain:" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-31", "text": "standardizes \u2192 standardize \u2192 standard which consists of the child-parent pairs (standardizes, standardize), (standardize, standard) and (standard, NONE)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-32", "text": "Each child-parent pair is annotated with a type indicating the kind of transformation that relates the child-parent pair." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-33", "text": "The set of transformations defined by MorphoChains is: suffixation as in (dogs, dog), prefixation as in (undone, done), deletion as in (baked, bake) 2 , repetition as in (stopped, stop), and modification as in 1 We modified the implementation available at https: //github.com/karthikncode/MorphoChain." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-34", "text": "2 The system could in principle learn that bake is the parent of baked with type suffix, which would imply the analysis bake +d. However, we hope it learns instead the type delete, which implies the (correct) analysis bake +ed." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-35", "text": "Similar alternative analyses are possible for the other example types shown." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-36", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-37", "text": "**WORD**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-38", "text": "MorphoChains Our Model stopping stopp +ing stop +ing doubled double +d double +ed acidified acid +ifi +ed acid +ify +ed Table 1 : Examples outputs of two models." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-39", "text": "(worried, worry)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-40", "text": "We add a sixth type, compounding as in (darkroom, room)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-41", "text": "The delete, repeat, and modify types all assume the change occurs to the final character of the stem, while compounding can simply concatenate the two stems, or can introduce an extra character (as in higher-rate or German schilddruese +n+ krebs ('thyroid cancer')." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-42", "text": "Any word (undone) has many possible parents, some linguistically plausible (undo, done) and others not (und, ndone)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-43", "text": "Our system, like MorphoChains, learns a log-linear model to discriminate plausible from implausible parents amongst the complete parent candidate set for each word." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-45", "text": "For example, some parent candidates for the word dogs include (dog, suffix), (do, suffix), (gs, prefix), (doga, delete), (dogb, delete), (doe, delete), and (doe, modify)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-46", "text": "The last two imply analyses of doe +gs and doe +s, respectively." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-47", "text": "The examples above indicate how analyses can be induced recursively by tracking the transformation type and orthographic change associated with each parent-child pair." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-48", "text": "However, the original MorphoChains algorithm did not do so, instead it only used the transformation types to predict morph boundaries." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-49", "text": "Table 1 contrasts the word segmentation into morphs produced by the original MorphoChains model and the morpheme analysis produced by our model." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-50", "text": "Table 2 provides additional examples of our recursive analysis process." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-51", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-52", "text": "**MODEL**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-53", "text": "We predict child-parent pairs using a log-linear model, following Narasimhan et al. (2015) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-54", "text": "The model consists of a set of features represented by a feature vector \u03c6 : W \u00d7 Z \u2192 R d , where W is a set of words and Z is the set of (parent, type) pairs for words in W. The model defines the conditional probability of a particular (parent, type) pair z \u2208 Z given word w \u2208 W as:" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-55", "text": "Step Word where C(w) \u2282 Z denotes the set of parent candidates for w and \u03b8 is a weight vector." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-56", "text": "Our goal is to learn the feature weights in an unsupervised fashion." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-57", "text": "Following Narasimhan et al. (2015), we do so using Contrastive Estimation (CE) (Smith and Eisner, 2005) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-58", "text": "In CE every training example w \u2208 W serves as both a positive example and a set of implied negative examples-strings that are similar to w but don't occur in the corpus." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-59", "text": "The negative examples are the source of the probability mass allocated to the positive examples." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-60", "text": "The word w and its negative examples constitute the neighbourhood N (w) of w." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-61", "text": "Given the list of words W and their neighbourhoods, the CE likelihood is defined as:" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-62", "text": "We use the same neighbourhood functions as Narasimhan et al. (2015) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-63", "text": "Specifically, for each word w in the corpus W, we create neighbours in two ways: by swapping two adjacent characters of w (walking\u2192walkign) and by swapping two pairs of adjacent characters, where one pair is at the beginning of the word, and the other at the end of the word (walking\u2192awlkign)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-64", "text": "We use LBFGS-B (Zhu et al., 1997) to optimize the regularized log-likelihood of the model:" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-65", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-66", "text": "**FEATURES**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-67", "text": "MorphoChains used a rich set of features from which we have kept some, discarded others and added new ones to improve overall performance." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-68", "text": "This section describes our set of features, with examples shown in Table 3 ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-69", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-70", "text": "**PRESENCE IN TRAINING DATA**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-71", "text": "We want features that signal which parents are valid words." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-72", "text": "Narasimhan et al. (2015) used each word's log frequency." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-73", "text": "However the majority of words in the training data (word frequency lists) occur only once, which makes their frequency information unreliable." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-74", "text": "3 Instead, we use an out-of-vocabulary feature (OOV) for parents that don't occur in the training data." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-75", "text": "Semantic Similarity Morphologically related words exhibit semantic similarity among their word embeddings (Schone and Jurafsky, 2000; Baroni et al., 2002) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-76", "text": "Semantic similarity was an important feature in MorphoChains: Narasimhan et al. (2015) concluded that up to 25 percent of their model's precision was due to the semantic similarity feature." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-77", "text": "We use the same feature here (COS)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-78", "text": "For a child-parent pair (w A , w B ) with word embeddings v w A and v w B respectively we compute semantic similarity as:" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-79", "text": "Affixes Candidate pairs where the child contains a frequently occurring affix are more likely to be correct." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-80", "text": "To identify possible affixes to use as features, Narasimhan et al. (2015) counted the number of words that end (or start) with each substring Table 3 : Examples illustrating which of the binary features in the model are active for various potential child-parent pairs." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-81", "text": "Not shown here is the real-valued semantic similarity feature COS, used in all examples except 9 and 10, where it is replaced by the binary feature STOPCOS=y, for y in increments of 0.1." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-82", "text": "al, ar, ba, be, bo, ca, car, co, de, dis, en, ha, ho, in, inter, la, le, li, lo, ma, mar, mc, mi, mis, mo, out, over, pa, po, pre, pro, ra, re, ro, se, ta, to, un, under, up Suffixes a, age, al, an, ar, ary, as, ation, b, ble, ch, e, ed, el, en, er, ers, es, est, et, ful, i, ia, ic, ie, ies, in, ing, ings, is, ism, ist, ists, land, le, led, les, less, ley, ling, ly, m, man, ment, ments, ner, ness, o, or, p, s, se, son, t, ted, ter, ters, th, ting, ton, ts, y and selected the most frequent ones." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-83", "text": "However, all words that end with ing also end with ng and g, which means that they also become affix candidates." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-84", "text": "Furthermore, there are more words that end with ng or g than with ing, therefore valid affixes might be excluded from the list because of their more frequent substrings." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-85", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-86", "text": "**PREFIXES**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-87", "text": "We therefore modify the affix features in two ways." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-88", "text": "First, we identify a more precise set of likely affixes using Letter Successor Entropy (LSE) values (Hafer and Weiss, 1974) , which are typically high at morph boundaries." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-89", "text": "LSE is computed at each point in the word as the entropy of the distribution over the next character given the word prefix so far." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-90", "text": "When selecting likely affixes, we use an LSE threshold value of 3.0 as suggested by Hafer and Weiss (1974) , and we require that the affix has appeared in at least 50 types with a corpus frequency of at least 100." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-91", "text": "We then define two features (PREFLIST, SUFLIST), which are active if the proposed prefix or suffix for a parent-child pair is in the set of likely prefixes or suffixes." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-92", "text": "Table 4 shows the list of likely English affixes found by using LSE (62 suffixes and 42 prefixes)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-93", "text": "For German and Turkish, our other two development languages (see \u00a73), the lists contain 498 suffixes/183 prefixes and 181 suffixes/35 prefixes, respectively." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-94", "text": "4 In addition, we use a much larger set of affix features, PREF=x and SUF=x, where x is instantiated with all possible word prefixes (suffixes) for which both w and xw (wx) are words in the training data." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-95", "text": "Transformations To help distinguish between probable and improbable transformations, we introduce transformation-specific features." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-96", "text": "For deletion we use the deleted letter (DELETED)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-97", "text": "For repetition we use the repeated letter and its preceding 2-and 1-character contexts (REPEATED, RENV2 RENV1)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-98", "text": "For modification we use the combination of the involved letters (MODIFIED)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-99", "text": "Finally, for compounding we use the headword (i.e. the parent of the compound), the modifier and the connector, if such exists (HEAD, MODIFIER, CONNECTOR)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-100", "text": "Since these compound features can be very sparse, we also add a single COMPOUND feature, which is active when both parts of the compound are present in the training data." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-101", "text": "Stop Condition To identify words with no parents we use two types of binary features suggested by Narasimhan et al. (2015) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-102", "text": "STOPCOS=y is the maximum cosine similarity between the word and any of its parent candidates (using bins of size 0.1), and STOPLEN=x is instantiated for all possible word lengths x in the training data." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-103", "text": "For illustration, if we are considering whether decided is a word with no parents (Table 3 Example 10), the binary features STOPLEN=7 and STOPCOS=0.5 become active." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-104", "text": "We discard the starting and ending character unigram and bigram features used by MorphoChains, because of the large number 5 and the sparsity of these features." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-105", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-106", "text": "**DATA SELECTION**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-107", "text": "Most unsupervised morphology learners are sensitive to the coverage and the quality of training data." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-108", "text": "In a large corpus, however, many word types occur only once because of the Zipfian distribution of word types." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-109", "text": "Low-frequency types can be either rare but valid words or they can be foreign words, typos, non-words, etc." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-110", "text": "This makes learning from lowfrequency words unreliable, but discarding them dramatically reduces the size of the training data (including many valid words)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-111", "text": "To seek balance between the quality and the coverage of the training data we try to identify which low-frequency words are likely to provide useful statistical support for our model, so we can include those in the training data and discard the other lowfrequency words." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-112", "text": "First, we set a frequency-based pruning threshold (PT) at the frequency for which at least 50% of the words above this frequency have a word embedding (see \u00a73); next we set a learning threshold (LT) at the median frequency of the words with frequencies above PT; finally we adopt the algorithm by Neuvel and Fulop (2002) to decide which words with frequencies below PT can be useful to analyse the words with frequencies above LT." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-113", "text": "We filter out any remaining words with frequencies below PT." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-114", "text": "The outline of the adapted version of the algorithm by Neuvel and Fulop (2002) is: 1) For every word pair in the top 20k most frequent words in training data:" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-115", "text": "1.1) We find the pair's orthographic similarities as the longest common subsequence: receive\u21d4reception." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-116", "text": "1.2) We find the pair's orthographic differ-5 For English there can be 676 different letter bigrams of which 99% occur at least once at the beginning of some word in the word frequency list." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-117", "text": "ences with respect to their orthographic similarities: receive\u21d4reception." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-118", "text": "2) For all word-pairs with the same orthographic differences we merge their similarities and differences into Word Formation Strategies (WFS): so receive\u21d4reception, conceive\u21d4conception, deceive\u21d4deception give *##ceive\u21d4*##ception, where * and # stand for the optional and mandatory character wild cards respectively." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-119", "text": "3) We discard those WFS that that are suggested by less than 10 word pairs; 4) For each WFS and for each word with a frequency below PT:" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-120", "text": "4.1) if a word w matches either of the sides of a WFS and the other side of a WFS predicts a word w' with a frequency above the LT, we keep w in the training data, otherwise we discard it." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-121", "text": "For more detailed description of the algorithm see Neuvel and Fulop (2002) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-122", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-123", "text": "**EXPERIMENTS**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-124", "text": "Data We conduct experiments on six languages: Arabic, English, Estonian, Finnish, German and Turkish." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-125", "text": "For the word embeddings required by our system and the MorphoChains baseline, we used word2vec (Mikolov et al., 2013) to train a Continuous Bag of Words model on a sub-sample of the Common Crawl (CC) corpus 6 for each language (Table 5 lists corpus sizes)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-126", "text": "We trained 100-dimensional embeddings for all words occurring at least 25 times, using 20 iterations and default parameters otherwise." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-127", "text": "For all languages except Estonian, we train and evaluate all systems on the data from the Morpho Challenge 2010 competition." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-128", "text": "7 The training data consists of word lists with word frequencies." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-129", "text": "The official test sets are not public, but a small labelled training and development set is provided for each language in addition to the large unannotated word list, since the challenge included semi-supervised systems." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-130", "text": "Thus, for experiments on Arabic, English, Finnish, German and Turkish we evaluated on the annotated training and development gold standard analyses form the Morpho Challenge 2009/2010 competition data sets." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-131", "text": "The gold standard labels include part of speech tags and functional labels for inflectional morphemes, with multiple analyses given for words with part of speech ambiguity or (2013) functionally different but orthographically equivalent inflectional morphemes." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-132", "text": "For example, rockiness is analysed as rock N y s ness s, while rocks has two analyses: rock N +PL and rock V +3SG." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-133", "text": "For Estonian we train on word lists extracted from Common Crawl and test on data prepared by Sirts and Goldwater (2013) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-134", "text": "The Estonian test set contains only surface segmentation into morphs (e.g. kolmandal is analysed kolmanda l)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-135", "text": "Table 5 provides information about each dataset." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-136", "text": "Since we are developing an unsupervised system, we want to make sure that it generalizes to new languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-137", "text": "We therefore divide the languages into three development languages (English, German, Turkish) and three test languages (Finnish, Estonian, Arabic)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-138", "text": "We used the development languages to choose features, design the data selection procedure and select best values for hyperparameters." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-139", "text": "The system that performed best on those languages was then used unmodified on the test languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-140", "text": "Hyperparameters In addition to threshold values described above, we use the same \u03bb = 1 (Equation 3) as Narasimhan et al. (2015) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-141", "text": "To control for under segmentation we downscale weights of the stop features by a factor of 0.8." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-142", "text": "We set the maximum affix length to 8 characters and the minimum word length to 1 character." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-143", "text": "Evaluation Metric We test our model on the task of unsupervised morpheme analysis induction." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-144", "text": "We follow the format of Morpho Challenge 2010 and use Evaluation Metric for Morphological Analysis (EMMA) (Spiegler and Monson, 2010) to evaluate predicted outputs." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-145", "text": "EMMA works by finding the optimal one-to-one mapping between the model's output and the reference analysis (i.e., the spelling of the morphemes in the analysis doesn't matter)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-146", "text": "These are used to compute precision, recall, and F-score against the reference morphemes." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-147", "text": "Baselines We compare our model to three other systems: Morfessor 2.0 (Virpioja et al., 2013) , MORSEL (Lignos et al., 2009; Lignos, 2010) and MorphoChains (Narasimhan et al., 2015) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-148", "text": "We also use a trivial baseline Words which outputs the input word." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-149", "text": "Morfessor (Morf.2.0) is a family of probabilistic algorithms that perform unsupervised word segmentation into morphs." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-150", "text": "Since the release of the initial version of Morfessor, it has become popular as an automatic tool for processing morphologically complex languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-151", "text": "MORSEL is a rule-based unsupervised morphology learner designed for affixal morphology." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-152", "text": "Like our own system, it outputs morphological analyses of words rather than segmentations." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-153", "text": "MORSEL achieved excellent performance on the Morpho Challenge 2010 data sets." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-154", "text": "MorphoChains (MC) is the model upon which our own system is based, but as noted above it performs segmentation rather than analysis." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-155", "text": "In contrast to Morfessor and MORSEL, which analyse words based only on orthographic patterns, MorphoChains (like our extension) uses both orthographic and the semantic information." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-156", "text": "All three baselines have multiple hyperparameters." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-157", "text": "Since performance tends to be most sensitive to the treatment of word frequency (including possibly discarding low-frequency words), for each system we tuned the hyperparameters related to word frequency to optimize average performance on the development languages, and kept these hyperparameters fixed for the test languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-158", "text": "Table 6 gives the performance of all models on the three development languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-159", "text": "Our model outperforms all baselines on every language, and is a clear improvement over the original MorphoChains." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-160", "text": "8 Lang Method To see where the benefit is coming from, we performed ablation tests (Table 7) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-161", "text": "Results show the importance of the LSE-based affix features (Model-A)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-162", "text": "Using these features gives gains of +1.0%, 3.8% and 0.6% F-1 absolute on English, German and Turkish respectively over using the raw affix occurrence frequencies as used by Narasimhan et al. (2015) ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-163", "text": "We can see that our data selection scheme (Model-D) is important for English (+3.5%) and German (+1.1%)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-164", "text": "Although we expected that the data selection scheme would have the biggest impact on Turkish because of its small training data, it has very little effect on this language." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-165", "text": "As expected, the compounding transformation (Model-C) has a prominent impact on German (+2.7%) and a modest effect on English and Turkish." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-166", "text": "The three features PREFLIST, SUFLIST, COMPOUND (Model-B) have the least impact on the model's performance (on average 0.5% F-1 absolute), however the effect is substantial considering that this gain is achieved their hyperparameters separately for each language." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-167", "text": "Finally, it is not clear how they tuned the frequency-related hyperparameters for Morfessor." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-168", "text": "We found that Morfessor performed better than MorphoChains when either low frequency words are pruned from its input, or its log-frequency option is used rather than raw frequency." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-169", "text": "by merely 3 features as opposed to a new feature type with many instantiations." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-170", "text": "Table 8 shows some example outputs for English, German and Turkish." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-171", "text": "These analyses include some correctly identified spelling changes (Example 1) compounds (Example 4), and purely concatenative morphology (Example 6)." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-172", "text": "In Example 2, +ble is counted as incorrect because our model predicts +ble both for deplorable and reproducible while the reference analysis uses able s and ible s, respectively." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-173", "text": "Since EMMA uses one-to-one alignment, it deems one of the alignments wrong." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-174", "text": "The opposite problem occurs in Example 4: our model analyses aus in two ways, either as a prefix aus+ or as a separate word aus (part of a compound), whereas the reference analysis always treats it as a separate word aus." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-175", "text": "Example 6 illustrates an oversegmentation error, caused by encountering two similar forms of the verb giy, giymeyin and giymeyi." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-176", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-177", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-178", "text": "Performance of all models on the three test languages is shown in Table 9 ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-179", "text": "On average, our model does better than MorphoChains and MORSEL, but slightly worse than Morfessor." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-180", "text": "However, this difference is mainly due to Morfessor's very good performance on Estonian, which is the only test set using gold standard segmentations rather than analyses." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-181", "text": "All systems perform poorly on Arabic since they do not handle templatic morphology; nevertheless our model and Morfessor perform considerably Table 9 : Results on test languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-182", "text": "Scores calculated using EMMA." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-183", "text": "*=reference analysis contains word segmentation." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-184", "text": "better than the others." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-185", "text": "Overall, our model performs consistently near the top even if not the best for any of the three languages." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-186", "text": "Lexical Inventory Size One of the motivations for unsupervised morphological analysis is to reduce data sparsity in downstream applications, which implies that for a given level of accuracy, systems that produce a more compact representation (i.e., a smaller morpheme inventory) should be preferred." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-187", "text": "To see how compactly each model represents the test set, we count the number of unique morphemes (or morphs, or labels) in the predicted output of each model and compare it with the number of labels in the reference analysis and the num- ber of words in the test set." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-188", "text": "Table 10 summarizes this information 9 ." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-189", "text": "For all languages except Estonian our model finds the most compact set of items." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-190", "text": "The number of distinct morphemes identified by our model is only about 5%, 4.5% and 8.0% larger than in the reference analysis for English, German and Finnish respectively." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-191", "text": "On average our model identified 12.8% and 14.8% fewer morphemes than Morfessor and MorphoChains respectively, while on average performing no worse or better than the two word segmentation systems." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-192", "text": "MORSEL produces the second most compact output with only a 3.2% larger set of distinct morphemes than our model, leaving the two word segmentation systems, Morfessor and MorphoChains, in the third and the forth place respectively." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-193", "text": "These results suggest that systems that attempt to output morphological analysis succeed in reusing the same morphemes more frequently than the systems that perform surface segmentation." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-194", "text": "----------------------------------" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-195", "text": "**CONCLUSION**" }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-196", "text": "We presented an unsupervised log-linear model that learns to identify morphologically related words and the affixes and spelling transformations that relate them." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-197", "text": "It uses these to induce morphemelevel analyses of each word and an overall compact representation of the corpus." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-198", "text": "In tests on six languages, our system's EMMA scores are considerably better than its inspiration, the segmentation system MorphoChains, and it also outperformed the rule-based analysis system MORSEL." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-199", "text": "Our system achieved similar EMMA performance to Morfessor but with a more compact representation-the first probabilistic system we are aware of to do so well." }, { "sent_id": "518b281a4bc04b3504d3b385a5dc62-C001-200", "text": "In future work, we hope to investigate further improvements to the system and perform extrinsic evaluation on downstream tasks." } ], "y": { "@EXT@": { "gold_contexts": [ [ "518b281a4bc04b3504d3b385a5dc62-C001-4" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-17" ] ], "cite_sentences": [ "518b281a4bc04b3504d3b385a5dc62-C001-4", "518b281a4bc04b3504d3b385a5dc62-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "518b281a4bc04b3504d3b385a5dc62-C001-27" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-53" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-57" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-62" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-101" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-140" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-147" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-162" ] ], "cite_sentences": [ "518b281a4bc04b3504d3b385a5dc62-C001-27", "518b281a4bc04b3504d3b385a5dc62-C001-53", "518b281a4bc04b3504d3b385a5dc62-C001-57", "518b281a4bc04b3504d3b385a5dc62-C001-62", "518b281a4bc04b3504d3b385a5dc62-C001-101", "518b281a4bc04b3504d3b385a5dc62-C001-140", "518b281a4bc04b3504d3b385a5dc62-C001-147", "518b281a4bc04b3504d3b385a5dc62-C001-162" ] }, "@BACK@": { "gold_contexts": [ [ "518b281a4bc04b3504d3b385a5dc62-C001-76" ], [ "518b281a4bc04b3504d3b385a5dc62-C001-80" ] ], "cite_sentences": [ "518b281a4bc04b3504d3b385a5dc62-C001-76", "518b281a4bc04b3504d3b385a5dc62-C001-80" ] } } }, "ABC_c4e2f43e223f61d81d81ac2c9aaa3f_13": { "x": [ { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-94", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-95", "text": "**PRIOR WORK AS INSTANCES OF SLUICE NETWORKS**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-101", "text": "Our architecture learns a 1 / 2 group lasso over the two subspaces (with the same degrees of freedom), when all \u03b1 AB and \u03b1 BA -values are set to 0." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-96", "text": "Our meta-architecture is very flexible and can be seen as a generalization over several existing algorithms for transfer and multi-task learning, including (Caruana 1998; Daum\u00e9 III 2007; Misra et al. 2016) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-97", "text": "We show how to derive each of these below." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-98", "text": "\u2022 Hard parameter sharing between the two networks appears if all \u03b1 values are set to the same constant (Caruana 1998; Collobert and Weston 2008) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-99", "text": "This is equivalent to a mean-constrained 0 -regularizer \u2126(\u00b7) = | \u00b7 |w i 0 and i \u03bb i L i < 1. Since the penalty for not sharing a parameter is 1, it holds that if the sum of weighted losses is smaller than 1, the loss with penalty is always the highest when all parameters are shared." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-100", "text": "\u2022 The 1 / 2 group lasso regularizer is G g=1 ||G 1,i,g || 2 , a weighted sum over the 2 norms of the groups, often used to enforce subspace sharing (Zhou, Jin, and Hoi 2010; Swirszcz and Lozano 2012) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-102", "text": "When the outer layer \u03b1-values are not shared, we get block communication between the networks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-129", "text": "See Figure 2 for the results of our experiment." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-2", "text": "Multi-task learning (MTL) allows deep neural networks to learn from related tasks by sharing parameters with other networks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-3", "text": "In practice, however, MTL involves searching an enormous space of possible parameter sharing architectures to find (a) the layers or subspaces that benefit from sharing, (b) the appropriate amount of sharing, and (c) the appropriate relative weights of the different task losses." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-4", "text": "Recent work has addressed each of the above problems in isolation." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-5", "text": "In this work we present an approach that learns a latent multi-task architecture that jointly addresses (a)-(c)." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-6", "text": "We present experiments on synthetic data and data from OntoNotes 5.0, including four different tasks and seven different domains." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-7", "text": "Our extension consistently outperforms previous approaches to learning latent architectures for multi-task problems and achieves up to 15% average error reductions over common approaches to MTL." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-8", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-10", "text": "Multi-task learning (MTL) in deep neural networks is typically a result of parameter sharing between two networks (of usually the same dimensions) (Caruana 1993) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-11", "text": "If you have two three-layered, recurrent neural networks, both with an embedding inner layer and each recurrent layer feeding the task-specific classifier function through a feed-forward neural network, we have 19 pairs of layers that could share parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-12", "text": "With the option of having private spaces, this gives us 5 19 =19,073,486,328,125 possible MTL architectures." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-13", "text": "If we additionally consider soft sharing of parameters, the number of possible architectures grows infinite." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-14", "text": "It is obviously not feasible to search this space." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-15", "text": "Neural architecture search (NAS) (Zoph and Le 2017) typically requires learning from a large pool of experiments with different architectures." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-16", "text": "Searching for multi-task architectures via reinforcement learning (Wong and Gesmundo 2018) or evolutionary approaches (Liang, Meyerson, and Miikkulainen 2018) can therefore be quite expensive." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-17", "text": "In this paper, we jointly learn a latent multi-task architecture and task-specific models, paying a minimal computational cost over single task learning and standard multi-task learning (5-7% training time)." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-18", "text": "We refer to this problem as multi-task architecture learning." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-19", "text": "In contrast to architecture search, the overall meta-architecture is fixed and the model learns the optimal latent connections and pathways for each task." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-20", "text": "Recently, a few authors have considered multi-task architecture learning (Misra et al. 2016; , but these papers only address a subspace of the possible architectures typically considered in neural multi-task learning, while other approaches at most consider a couple of architectures for sharing Peng and Dredze 2016; Mart\u00ednez Alonso and Plank 2017) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-21", "text": "In contrast, we introduce a framework that unifies previous approaches by introducing trainable parameters for all the components that differentiate multi-task learning approaches along the above dimensions." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-22", "text": "Contributions We present a novel meta-architecture (shown in Figure 1 ) that generalizes several previous multitask architectures, with an application to sequence tagging problems." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-23", "text": "Our meta-architecture enables multi-task architecture learning, i.e., learning (a) what layers to share between deep recurrent neural networks, but also (b) which parts of those layers to share, and with what strength, as well as (c) a mixture model of skip connections at the architecture's outer layer." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-24", "text": "We show that the architecture is a generalization of various multi-task (Caruana 1998; Misra et al. 2016 ) and transfer learning algorithms (Daum\u00e9 III 2007) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-25", "text": "We evaluate it on four tasks and across seven domains on OntoNotes 5.0 (Weischedel et al. 2013) , where it consistently outperforms previous work on multi-task architecture learning, as well as common MTL approaches." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-26", "text": "Moreover, we study the task properties that predict gains and those that correlate with learning certain types of sharing." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-27", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-28", "text": "**MULTI-TASK ARCHITECTURE LEARNING**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-29", "text": "We introduce a meta-architecture for multi-task architecture learning, which we refer to as a sluice network, sketched in Figure 1 for the case of two tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-30", "text": "The network learns to share parameters between M neural networks-in our case, two deep recurrent neural networks (RNNs) (Hochreiter and Schmidhuber 1997) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-31", "text": "The network can be seen as an end-toend differentiable union of a set of sharing architectures with parameters controlling the sharing." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-32", "text": "By learning the weights of those sharing parameters (sluices) jointly with the rest of the model, we arrive at a task-specific MTL architecture over the course of training." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-33", "text": "Figure 1 : A sluice meta-network with one main task A and one auxiliary task B. It consists of a shared input layer (bottom), two task-specific output layers (top), and three hidden layers per task, each partitioned into two subspaces G that are enforced to be orthogonal." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-34", "text": "\u03b1 parameters control which subspaces are shared between main and auxiliary task, while \u03b2 parameters control which layer outputs are used for prediction." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-35", "text": "For simplicity, we do not index layers and subspaces." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-36", "text": "With two subspaces, each block \u03b1 AA , \u03b1 BA , . . ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-37", "text": "\u2208 R 2\u00d72 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-38", "text": "With three layers, \u03b2 A , \u03b2 B \u2208 R 3 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-39", "text": "The two networks A and B share an embedding layer associating the elements of an input sequence, in our case English words, with vector representations via word and character embeddings." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-40", "text": "The two sequences of vectors are then passed on to their respective inner recurrent layers." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-41", "text": "Each layer is divided into subspaces (by splitting the matrices in half), e.g., for network A into G A,1 and G A,2 , which allow the sluice network to learn task-specific and shared representations, if beneficial." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-42", "text": "The subspaces have different weights." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-43", "text": "The output of the inner layer of network A is then passed to its second layer, as well as to the second layer of network B. This traffic of information is mediated by a set of \u03b1 and \u03b2 parameters similar to the way a sluice controls the flow of water." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-44", "text": "Specifically, the second layer of each network receives a combination of the output of the two inner layers weighted by the \u03b1 parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-45", "text": "Importantly, these \u03b1 parameters are trainable and allow the model to learn whether to share or to focus on task-specific features in a subspace." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-46", "text": "Finally, a weighted combination of the outputs of the outer recurrent layers G \u00b7,3,\u00b7 as well as the weighted outputs of the inner layers are mediated through \u03b2 parameters, which reflect a mixture over the representations at various depths of the multi-task architecture." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-47", "text": "In sum, sluice networks have the capacity to learn what layers and subspaces should be shared, and how much, as well as at what layers the meta-network has learned the best representations of the input sequences." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-48", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-49", "text": "**MATRIX REGULARIZATION**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-50", "text": "We cast learning what to share as a matrix regularization problem, following (Jacob et al. 2009; Yang and Hospedales 2017) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-51", "text": "Assume M different tasks that are loosely related, with M potentially non-overlapping datasets D 1 , . . . , D M ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-52", "text": "Each task is associated with a deep neural network with K layers L 1 , . . ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-53", "text": "L K ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-54", "text": "For simplicity, we assume that all the deep networks have the same hyperparameters at the outset." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-55", "text": "Let W \u2208 R M \u00d7D be a matrix in which each row i corresponds to a model \u03b8 i with D parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-56", "text": "The loss that sluice networks minimize, with a penalty term \u2126, is then as follows:" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-57", "text": "The loss functions L i are crossentropy functions of the form \u2212 y p(y) log q(y) where y i are the labels of task i. Note that sluice networks are not restricted to tasks with the same loss functions, but could also be applied to jointly learn regression and classification tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-58", "text": "The weights \u03bb i determine the importance of the different tasks during training." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-59", "text": "We explicitly add inductive bias to the model via the regularizer \u2126 below, but our model also implicitly learns regularization through multi-task learning (Caruana 1993 ) mediated by the \u03b1 parameters, while the \u03b2 parameters are used to learn the mixture functions f (\u00b7), as detailed in the following." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-60", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-61", "text": "**LEARNING MATRIX REGULARIZERS**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-62", "text": "We now explain how updating \u03b1 parameters can lead to different matrix regularizers." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-63", "text": "Each matrix W consists of M rows where M is the number of tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-64", "text": "Each row is of length D with D the number of model parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-65", "text": "Subvectors L m,k correspond to the parameters of network m at layer k. Each layer consists of two subspaces with parameters G m,k,1 and G m,k,2 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-66", "text": "Our meta-architecture is partly motivated by the observation that for loosely related tasks, it is often beneficial if only certain features in specific layers are shared, while many of the layers and subspaces remain more task-specific ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-67", "text": "We want to learn what to share while inducing models for the different tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-68", "text": "For simplicity, we ignore subspaces at first and assume only two tasks A and B. The outputs h A,k,t and h B,k,t of the k-th layer for time step t for task A and B respectively interact through the \u03b1 parameters (see Figure 1 )." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-69", "text": "Omitting t for simplicity, the output of the \u03b1 layers is:" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-70", "text": "where h A,k is a linear combination of the outputs that is fed to the k+1-th layer of task A, and a , b designates the stacking of two vectors a, b \u2208 R D to a matrix M \u2208 R 2\u00d7D ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-71", "text": "Subspaces (Virtanen, Klami, and Kaski 2011; Bousmalis et al. 2016 ) should allow the model to focus on task-specific and shared features in different parts of its parameter space." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-72", "text": "Extending the \u03b1-layers to include subspaces, for 2 tasks and 2 subspaces, we obtain an \u03b1 matrix \u2208 R 4\u00d74 that not only controls the interaction between the layers of both tasks, but also between their subspaces:" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-73", "text": "where h A1,k is the output of the first subspace of the k-th layer of task A and h A1,k is the linear combination for the first subspace of task A. The input to the k + 1-th layer of task A is then the concatenation of both subspace outputs: h A,k = h A1,k , h A2,k ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-74", "text": "Different \u03b1 weights correspond to different matrix regularizers \u2126, including several ones that have been proposed previously for multi-task learning." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-75", "text": "We review those in Section 3." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-76", "text": "For now just observe that if all \u03b1-values are set to 0.25 (or any other constant), we obtain hard parameter sharing (Caruana 1993) , which is equivalent to a heavy L 0 matrix regularizer." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-77", "text": "Adding Inductive Bias Naturally, we can also add explicit inductive bias to sluice networks by partially constraining the regularizer or adding to the learned penalty." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-78", "text": "Inspired by work on shared-space component analysis (Salzmann et al. 2010 ), we add a penalty to enforce a division of labor and discourage redundancy between shared and task-specific subspaces." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-79", "text": "While the networks can theoretically learn such a separation, an explicit constraint empirically leads to better results and enables the sluice networks to take better advantage of subspace-specific \u03b1-values." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-80", "text": "We introduce an orthogonality constraint (Bousmalis et al. 2016 ) between the layer-wise subspaces of each model:" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-81", "text": "F , where M is the number of tasks, K is the number of layers, \u00b7 2 F is the squared Frobenius norm, and G m,k,1 and G m,k,2 are the first and second subspace respectively in the k-th layer of the m-th task model." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-82", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-83", "text": "**LEARNING MIXTURES**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-84", "text": "Many tasks form an implicit hierarchy of low-level to more complex tasks, with intuitive synergies between the low-level tasks and parts of the complex tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-85", "text": "Rather than hard-coding this structure Hashimoto et al. 2017) , we enable our model to learn hierarchical relations by associating different tasks with different layers if this is beneficial for learning." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-86", "text": "Inspired by advances in residual learning (He et al. 2016 ), we employ skip-connections from each layer, controlled using \u03b2 parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-87", "text": "This layer acts as a mixture model, returning a mixture of expert predictions:" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-88", "text": "where h A,k is the output of layer k of model A, while h A,t is the linear combination of all layer outputs of model A that is fed into the final softmax layer." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-89", "text": "Complexity Our model only adds a minimal number of additional parameters compared to single-task models of the same architecture." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-90", "text": "In our experiments, we add \u03b1 parameters between all task networks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-91", "text": "As such, they scale linearly with the number of layers and quadratically with the number of tasks and subspaces, while \u03b2 parameters scale linearly with the number of tasks and the number of layers." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-92", "text": "For a sluice network with M tasks, K layers per task, and 2 subspaces per layer, we thus obtain 4KM 2 additional \u03b1 parameters and KM \u03b2 parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-93", "text": "Training sluice networks is not much slower than training hard parameter sharing networks, with only a 5-7% increase in training time." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-103", "text": "\u2022 The approach to domain adaptation in (Daum\u00e9 III 2007) , commonly referred to as frustratingly easy domain adaptation, which relies on a shared and a private space for each task or domain, can be encoded in sluice networks by setting all \u03b1 AB -and \u03b1 BA -weights associated with G i,k,1 to 0, while setting all \u03b1 AB -weights associated with G i,k,2 to \u03b1 BB , and \u03b1 BA -weights associated with G i,k,2 to \u03b1 AA ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-104", "text": "Note that Daum\u00e9 III (2007) discusses three subspaces." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-105", "text": "We obtain this space if we only share one half of the second subspaces across the two networks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-106", "text": "\u2022 propose a low supervision model where only the inner layers of two deep recurrent works are shared." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-107", "text": "This is obtained using heavy meanconstrained L 0 regularization over the first layer L i,1 , e.g.," }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-108", "text": ", while for the auxiliary task, only the first layer \u03b2 parameter is set to 1." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-109", "text": "\u2022 Misra et al. (2016) introduce cross-stitch networks that have \u03b1 values control the flow between layers of two convolutional neural networks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-110", "text": "Their model corresponds to setting the \u03b1-values associated with G i,j,1 to be identical to those for G i,j,2 , and by letting all but the \u03b2-value associated with the outer layer be 0." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-111", "text": "In our experiments, we include hard parameter sharing, low supervision, and cross-stitch networks as baselines." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-112", "text": "We do not report results for group lasso and frustratingly easy domain adaptation, which were consistently inferior, by some margin, on development data." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-113", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-114", "text": "**EXPERIMENTS**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-115", "text": "A synthetic experiment Our first experiment serves as a sanity check that our meta-architecture learns reasonable sharing architectures by learning \u03b1 weights." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-116", "text": "We also want the \u03b1 to adjust quickly in order not to slow down learning." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-117", "text": "We contrast two partially synthetic pairs of target and auxiliary data." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-118", "text": "In both cases, our target dataset is n instances (sentences) from our part-of-speech tagging dataset (see details below)." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-119", "text": "In the first scenario (Random), the auxiliary dataset is a random relabeling of the same n instances." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-120", "text": "In the second scenario (Copy), the auxiliary dataset is a copy of the n instances." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-121", "text": "For Random, we would like our \u03b1 parameters to quickly learn that the auxiliary task at best contributes with noise injection." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-122", "text": "Initializing our \u03b1 parameters to equal weights (0.25), we therefore hope to see a quick drop in the relative importance of the auxiliary task, given by \u03b1 BA \u03b1 AA ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-123", "text": "Seeing n training instances, we expect this number to quickly drop, then stabilize to a slow decrease due to the reduced need for regularization with larger sample sizes." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-124", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-125", "text": "**2**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-126", "text": "For Copy, in contrast, we expect no significant change in the relative importance of the auxiliary task over n training instances." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-127", "text": "We use the same hyperparameters as in our subsequent experiments (see below)." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-128", "text": "The parameter settings are thus realistic, and not toy settings." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-130", "text": "The two curves show the expected contrast between an auxiliary task with an all-noise signal (Random) and an auxiliary task with a perfect, albeit redundant, signal (Copy)." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-131", "text": "This experiment shows that our meta-architecture quickly learns a good sharing architecture in clear cases such as Random and Copy." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-132", "text": "We now proceed to test whether multi-task architecture learning also leads to empirical gains over existing approaches to multi-task learning." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-133", "text": "for MTL." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-134", "text": "2 The quick drop is the meta-architecture learning that the auxiliary data is much less useful than the target data; the slight decrease after the first drop is the reduced need for regularization due to lower variance with more data." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-135", "text": "Table 1 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-136", "text": "Tasks In MTL, one task is usually considered the main task, while other tasks are used as auxiliary tasks to improve performance on the main task." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-137", "text": "As main tasks, we use chunking (CHUNK), named entity recognition (NER), and a simplified version of semantic role labeling (SRL) where we only identify headwords 3 , and pair them with part-of-speech tagging (POS) as an auxiliary task, following ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-138", "text": "Example annotations for each task can be found in Table 2 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-139", "text": "Model We use a state-of-the-art BiLSTM-based sequence labeling model (Plank, S\u00f8gaard, and Goldberg 2016) as the building block of our model." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-140", "text": "The BiLSTM consists of 3 layers with 100 dimensions that uses 64-dimensional word and 100-dimensional character embeddings, which are both randomly initialized." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-141", "text": "The output layer is an MLP with a dimensionality of 100." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-142", "text": "We initialize \u03b1 parameters with a bias towards one source subspace for each direction and initialize \u03b2 parameters with a bias towards the last layer 4 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-143", "text": "We have found it most effective to apply the orthogonality constraint only to the weights associated with the LSTM inputs." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-144", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-145", "text": "**TRAINING AND EVALUATION**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-146", "text": "We train our models with stochastic gradient descent (SGD), an initial learning rate of 0.1, and learning rate decay 5 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-147", "text": "During training, we uni- formly sample from the data for each task." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-148", "text": "We perform early stopping with patience of 2 based on the main task and hyperparameter optimization on the in-domain development data of the newswire domain." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-149", "text": "We use the same hyperparameters for all comparison models across all domains." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-150", "text": "We train our models on each domain and evaluate them both on the indomain test set (Table 3 , top) as well as on the test sets of all other domains (Table 3 , bottom) to evaluate their out-ofdomain generalization ability." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-151", "text": "Note that due to this set-up, our results are not directly comparable to the results reported in , who only train on the WSJ domain and use OntoNotes 4.0." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-152", "text": "Baseline Models As baselines, we compare against i) a single-task model only trained on chunking; ii) the low supervision model , which predicts the auxiliary task at the first layer; iii) an MTL model based on hard parameter sharing (Caruana 1993) ; and iv) cross-stitch networks (Misra et al. 2016) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-153", "text": "We compare these against our complete sluice network with subspace constraints and learned \u03b1 and \u03b2 parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-154", "text": "We implement all models in DyNet (Neubig et al. 2017 ) and make our code available at https://github.com/ sebastianruder/sluice-networks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-155", "text": "We first assess how well sluice networks perform on indomain and out-of-domain data compared to the state-of-theart." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-156", "text": "We evaluate all models on chunking with POS tagging as auxiliary task." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-157", "text": "Chunking results We show results on in-domain and outof-domain tests sets in Table 3 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-158", "text": "On average, sluice networks significantly outperform all other model architectures on both in-domain and out-of-domain data and perform best for all domains, except for the telephone conversation (tc) domain, where they are outperformed by cross-stitch networks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-159", "text": "The performance boost is particularly significant for the out-ofdomain setting, where sluice networks add more than 1 point in accuracy compared to hard parameter sharing and almost .5 compared to the strongest baseline on average, demonstrating that sluice networks are particularly useful to help a model generalize better." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-160", "text": "In contrast to previous studies on MTL (Mart\u00ednez Alonso and Plank 2017; Bingel and S\u00f8gaard 2017; Augenstein, Ruder, and S\u00f8gaard 2018) , our model also consistently outperforms single-task learning." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-161", "text": "Overall, this demonstrates that our meta-architecture for learning which parts of multi-task models to share, with a small set of additional parameters to learn, can achieve significant and consistent improvements over strong baseline methods." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-162", "text": "NER and SRL We now evaluate sluice nets on NER with POS tagging as auxiliary task and simplified semantic role labeling with POS tagging as auxiliary task." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-163", "text": "We show results in Table 4 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-164", "text": "Sluice networks outperform the comparison models for both tasks on in-domain test data and successfully generalize to out-of-domain test data on average." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-165", "text": "They yield the best performance on 5 out of 7 domains and 4 out of 7 domains for NER and semantic role labeling." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-166", "text": "Joint model Most work on MTL for NLP uses a single auxiliary task (Bingel and S\u00f8gaard 2017; Mart\u00ednez Alonso and Plank 2017) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-167", "text": "In this experiment, we use one sluice network to jointly learn our four tasks on the newswire domain and show results in Table 5 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-168", "text": "Here, the low-level POS tagging and simplified SRL tasks are the only ones that benefit from hard parameter sharing highlighting that hard parameter sharing by itself is not sufficient for doing effective multi-task learning with semantic tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-169", "text": "We rather require task-specific layers that can be used to transform the shared, low-level representation into a form that is able to capture more fine-grained task-specific knowl- edge." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-170", "text": "Sluice networks outperform single task models for all tasks, except chunking and achieve the best performance on 2/4 tasks in this challenging setting." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-171", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-172", "text": "**ANALYSIS**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-173", "text": "To better understand the properties and behavior of our metaarchitecture, we conduct a series of analyses and ablations." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-174", "text": "Task Properties and Performance Bingel and S\u00f8gaard (2017) correlate meta-characteristics of task pairs and gains compared to hard parameter sharing across a large set of NLP task pairs." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-175", "text": "Similarly, we correlate various metacharacteristics with error reductions and \u03b1, \u03b2 values." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-200", "text": "In soft parameter sharing (Duong et al. 2015) , each task has separate parameters and separate hidden layers, as in our architecture, but the loss at the outer layer is regularized by the current distance between the models." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-176", "text": "Most importantly, we find that a) multi-task learning gains, also in sluice networks, are higher when there is less training data; and b) sluice networks learn to share more when there is more variance in the training data (cross-task \u03b1s are higher, intra-task \u03b1s lower)." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-177", "text": "Generally, \u03b1 values at the inner layers correlate more strongly with meta-characteristics than \u03b1 values at the outer layers." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-178", "text": "Ablation Analysis Different types of sharing may be more important than others." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-179", "text": "In order to analyze this, we perform an ablation analysis in Table 6 ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-180", "text": "We investigate the impact of i) the \u03b1 parameters; ii) the \u03b2 parameters; and iii) the division into subspaces with an orthogonality penalty." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-181", "text": "We also evaluate whether concatenation of the outputs of each layer is a reasonable alternative to our mixture model." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-182", "text": "Overall, we find that learnable \u03b1 parameters are preferable over constant \u03b1 parameters." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-183", "text": "Learned \u03b2 parameters marginally outperform skip-connections in the hard parameter sharing setting, while skip-connections are competitive with learned \u03b2 values in the learned \u03b1 setting." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-184", "text": "In addition, modeling subspaces explicitly helps for almost all domains." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-185", "text": "To our knowledge, this is the first time that subspaces within individual LSTM layers have been shown to be beneficial." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-186", "text": "6 . Being able to effectively partition LSTM weights opens the way to research in inducing more structured neural network representations that encode task-specific priors." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-187", "text": "Finally, concatenation of layer outputs is a viable form to share information across layers as has also been demonstrated by recent models such as DenseNet )." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-188", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-189", "text": "**ANALYSIS OF \u0391 VALUES**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-190", "text": "We analyze the final \u03b1 weights in the sluice networks for Chunking, NER, and SRL, trained with newswire as training data." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-191", "text": "We find that a) for the low-level simplified SRL, there is more sharing at inner layers, which is in line with , while Chunking and NER also rely on the outer layer, and b) more information is shared from the more complex target tasks than vice versa." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-192", "text": "Analysis of \u03b2 values Inspecting the \u03b2 values for the alltasks sluice net in Table 5 , we find that all tasks place little emphasis on the first layer, but prefer to aggregate their representations in different later layers of the model: The more semantic NER and chunking tasks use the second and third layer to a similar extent, while for POS tagging and simplified SRL the representation of one of the two later layers dominates the prediction." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-193", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-194", "text": "**RELATED WORK**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-195", "text": "Hard parameter sharing (Caruana 1993 ) is easy to implement, reduces overfitting, but is only guaranteed to work for (certain types of) closely related tasks (Baxter 2000; Maurer 2007 )." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-196", "text": "Peng and Dredze (2016) apply a variation of hard parameter sharing to multi-domain multi-task sequence tagging with a shared CRF layer and domain-specific projection layers." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-197", "text": "Yang, Salakhutdinov, and Cohen (2016) use hard parameter sharing to jointly learn different sequence-tagging tasks across languages." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-198", "text": "Mart\u00ednez Alonso and Plank (2017) explore a similar set-up, but sharing is limited to the initial layer." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-199", "text": "In all three papers, the amount of sharing between the networks is fixed in advance." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-201", "text": "Kumar and Daum\u00e9 III (2012) and Maurer, Pontil, and Romera-paredes (2013) enable selective sharing by allowing task predictors to select from sparse parameter bases for homogeneous tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-202", "text": "show that low-level tasks, i.e. syntactic tasks typically used for preprocessing such as POS tagging and NER, should be supervised at lower layers when used as auxiliary tasks." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-203", "text": "Another line of work looks into separating the learned space into a private (i.e. task-specific) and shared space (Salzmann et al. 2010; Virtanen, Klami, and Kaski 2011) to more explicitly capture the difference between task-specific and cross-task features." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-204", "text": "Constraints are enforced to prevent the models from duplicating information." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-205", "text": "Bousmalis et al. (2016) use shared and private encoders regularized with orthogonality and similarity constraints for domain adaptation for computer vision." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-206", "text": "Liu, Qiu, and Huang (2017) use a similar technique for sentiment analysis." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-207", "text": "In contrast, we do not limit ourselves to a predefined way of sharing, but let the model learn which parts of the network to share using latent variables, the weights of which are learned in an end-to-end fashion." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-208", "text": "Misra et al. (2016) , focusing on applications in computer vision, consider a small subset of the sharing architectures that are learnable in sluice networks, i.e., split architectures, in which two n-layer networks share the innermost k layers with 0 \u2264 k \u2264 n, but learn k with a mechanism very similar to \u03b1-values." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-209", "text": "Our method is also related to the classic mixture-of-experts layer (Jacobs et al. 1991) ." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-210", "text": "In contrast to this approach, our method is designed for multi-task learning and thus encourages a) the sharing of parameters between different task \"experts\" if this is beneficial as well as b) differentiating between low-level and high-level representations." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-211", "text": "----------------------------------" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-212", "text": "**CONCLUSION**" }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-213", "text": "We introduced sluice networks, a meta-architecture for multitask architecture search." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-214", "text": "In our experiments across four tasks and seven different domains, the meta-architecture consistently improved over strong single-task learning, architecture learning, and multi-task learning baselines." }, { "sent_id": "c4e2f43e223f61d81d81ac2c9aaa3f-C001-215", "text": "We also showed how our meta-architecture can learn previously proposed architectures for multi-task learning and domain adaptation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-10" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-71" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-166" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-174" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-195" ] ], "cite_sentences": [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-10", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-71", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-166", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-174", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-195" ] }, "@USE@": { "gold_contexts": [ [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-59" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-80" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-152" ] ], "cite_sentences": [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-59", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-80", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-152" ] }, "@SIM@": { "gold_contexts": [ [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-76" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-174", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-175" ] ], "cite_sentences": [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-76", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-174" ] }, "@DIF@": { "gold_contexts": [ [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-160" ], [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-166", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-167" ] ], "cite_sentences": [ "c4e2f43e223f61d81d81ac2c9aaa3f-C001-160", "c4e2f43e223f61d81d81ac2c9aaa3f-C001-166" ] } } }, "ABC_412c2daf6d060f520850d187c6eb36_13": { "x": [ { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-19", "text": "This data is then used in a sense disambiguation system." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-2", "text": "Corpus-based sense disambiguation methods, like most other statistical NLP approaches, suffer from the problem of data sparseness." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-3", "text": "In this paper, we describe an approach which overcomes this problem using dictionary definitions." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-4", "text": "Using the definitionbased conceptual co-occurrence data collected from the relatively small Brown corpus, our sense disambiguation system achieves an average accuracy comparable to human performance given the same contextual information." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-5", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-7", "text": "Previous corpus-based sense disambiguation methods require substantial amounts of sense-tagged training data (Kelly and Stone, 1975; Black, 1988 and Hearst, 1991) or aligned bilingual corpora (Brown et al., 1991; Dagan, 1991 and Gale et al. 1992 )." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-8", "text": "Yarowsky (1992) introduces a thesaurus-based approach to statistical sense disambiguation which works on monolingual corpora without the need for sense-tagged training data." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-9", "text": "By collecting statistical data of word occurrences in the context of different thesaurus categories from a relatively large corpus (10 million words), the system can identify salient words for each category." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-10", "text": "Using these salient words, the system is able to disambiguate polysemous words with respect to thesaurus categories." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-11", "text": "Statistical approaches like these generally suffer from the problem of data sparseness." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-12", "text": "To estimate the salience of a word with reasonable accuracy, the system needs the word to have a significant number of occurrences in the corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-13", "text": "Having large corpora will help but some words are simply too infrequent to make a significant statistical contribution even in a rather large corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-14", "text": "Moreover, huge corpora are not generally available in all domains and storage and processing of very huge corpora can be problematic in some cases." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-15", "text": "Z In this paper, we describe an approach which attacks the problem of." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-16", "text": "data sparseness in automatic statistical sense disambiguation." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-17", "text": "Using definitions from LDOCE (Longman Dictionary of Contemporary English; Procter, 1978) , cooccurrence data of concepts, rather than words, is collected from a relatively small corpus, the one million word Brown corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-18", "text": "Since all the definitions in LDOCE are written using words from the 2000 word controlled vocabulary (or in our terminology, defining concepts), even our small corpus is found to be capable of providing statistically significant cooccurrence data at the level of the defining concepts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-20", "text": "The system is tested on twelve words previously discussed in the sense disambiguation literature." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-21", "text": "The results are found to be comparable to human performance given the same contextual information." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-22", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-23", "text": "**STATISTICAL SENSE DISAMBIGUATION USING DICTIONARY DEFINITIONS**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-24", "text": "It is well known that some words tend to co-occur with some words more often than with others." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-25", "text": "Similarly, looking at the meaning of the words, one should find that some concepts co-occur more often with some concepts than with others." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-26", "text": "For example, the concept crime is found to co-occur frequently with the concept punishment." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-27", "text": "This kind of conceptual relationship is not always reflected at the lexical level." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-28", "text": "For instance, in legal reports, the Statistical data is domain dependent." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-29", "text": "Data extracted from a corpus of one particular domain is usually not very useful for processing text of another domain." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-30", "text": "concept crime will usually be expressed by words like offence or felony, etc., and punishment will be expressed by words such as sentence, fine or penalty, etc." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-31", "text": "The large number of different words of similar meaning is the major cause of the data sparseness problem." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-32", "text": "The meaning or underlying concepts of a word are very difficult to capture accurately but dictionary definitions provide a reasonable representation and are readily available." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-33", "text": "2 For instance, the LDOCE definitions of both offence and felony contain the word crime, and all of the definitions of sentence, fine and penalty contain the word punishment." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-34", "text": "To disambiguate a polysemous word, a system can select the sense with a dictionary definition containing defining concepts that co-occur most frequently with the defining concepts in the definitions of the other words in the context." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-35", "text": "In the current experiment, this conceptual co-occurrence data is collected from the Brown corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-36", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-37", "text": "**COLLECTING CONCEPTUAL CO-OCCURRENCE DATA**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-38", "text": "Our system constructs a two-dimensional table which records the frequency of co-occurrence of each pair of defining concepts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-39", "text": "The controlled vocabulary provided by Longman is a list of all the words used in the definitions but, in its crude form, it does not suit our purpose." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-40", "text": "From the controlled vocabulary, we manually constructed a list of 1792 defining concepts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-41", "text": "To minimise the size of the table and the processing time, all the closed class words and words which are rarely used in definitions (e.g., the days of the week, the months) are excluded from the list." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-42", "text": "To strengthen the signals, words which have the same semantic root are combined as one element in the list (e.g., habit and habitual are combined as {habit, habitual})." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-43", "text": "The whole LDOCE is pre-processed first." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-44", "text": "For each entry in LDOCE, we construct its corresponding conceptual expansion." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-45", "text": "The conceptual expansion of an entry whose headword is not a defining concept is a set of conceptual sets." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-46", "text": "Each conceptual set corresponds to a sense in the entry and contains all the defining concepts which occur in the definition of the sense." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-47", "text": "The entry of the noun sentence and its corresponding conceptual expansion 2 Manually constructed semantic frames could be more useful computationally but building semantic frames for a huge lexicon is an extremely expensive exercise." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-48", "text": "are shown in Figure 1 ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-49", "text": "If the headword of an entry is a defining concept DC, the conceptual expansion is given as {{DC}}." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-50", "text": "The corpus is pre-segrnented into sentences but not pre-processed in any other way (sense-tagged or part-of-speech-tagged)." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-51", "text": "The context of a word is defined to be the current sentence) The system processes the corpus sentence by sentence and collects conceptual co-occurrence data for each defining concept which occurs in the sentence." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-52", "text": "This allows the whole table to be constructed in a single run through the corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-53", "text": "Since the training data is not sense tagged, the data collected will contain noise due to spurious senses of polysemous words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-54", "text": "Like the thesaurusbased approach of Yarowsky (1992) , our approach relies on the dilution of this noise by their distribution through all the 1792 defining concepts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-55", "text": "Different words in the corpus have different numbers of senses and different senses have definitions of varying lengths." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-56", "text": "The principle adopted in collecting co-occurrence data is that every pair of content words which co-occur in a sentence should have equal contribution to the conceptual cooccurrence data regardless of the number of definitions (senses) of the words and the lengths of the definitions." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-57", "text": "In addition, the contribution of a word should be evenly distributed between all the senses of a word and the contribution of a sense should be evenly distributed between all the concepts in a sense." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-58", "text": "The algorithm for conceptual cooccurrence data collection is shown in Figure 2." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-59", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-60", "text": "**USING THE CONCEPTUAL CO-OCCURRENCE DATA FOR SENSE DISAMBIGUATION**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-61", "text": "To disambiguate a polysemous word W in a context C, which is taken to be the sentence containing W, the system scores each sense S of W, as defined in LDOCE, with respect to C using the following equations." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-62", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-63", "text": "**SCORE(S, C) = SCORE(CS, C') -SCORE(CS, GLOBALCS) [1]**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-64", "text": "where CS is the corresponding conceptual set of S, C' is the set of conceptual expansions of all content words (which are defined in LDOCE) in C and GlobalCS is the conceptual set containing all the 1792 defining concepts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-65", "text": "3 The average sentence length of the Brown corpus is 19.4 words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-66", "text": "Entry in LDOCE 1. (an order given by a judge which fixes) a punishment for a criminal found guilty in court 2." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-67", "text": "a group of words that forms a statement, command, exclamation, or question, usu. contains a subject and a verb, and (in writing) begins with a capital letter and ends with one of the marks. ! ? conceptual expansion { {order, judge, punish, crime, criminal, fred, guilt, court}, {group, word, form, statement, command, question, contain, subject, verb, write, begin, capital, letter, end, mark} } Figure 1 ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-68", "text": "The entry of sentence (n.) in LDOCE and its corresponding conceptual expansion 1." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-69", "text": "Initialise the Conceptual Co-occurrence Data Table (" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-70", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-71", "text": "**SCORE( CS, CS') = VOE'.ES' ~'SC\u00b0RE( ES'DC') /ICS'[**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-72", "text": "for any concp, sets CS and CS' [4]" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-73", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-74", "text": "**SCORE(CS, DC')= ~F~ SCORE(DC, DC') /[CS[**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-75", "text": "for any concp, set CS and def." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-76", "text": "concept DC' [5] score( DC, DC' ) = max(0, I ( DC, DC' ))" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-77", "text": "for any def. concepts DC and DC' [6] I(DC, DC') is the mutual information 4 (Fano, 1961) between the 2 defining concepts DC and DC' given by:" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-78", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-79", "text": "**I(X,Y) ---LOG S P(X,Y) P(X). P(Y) F(X,Y).N I\u00b0G2 F(X). F(Y)**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-80", "text": "(using the Maximum Likelihood Estimator)." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-81", "text": "f(x,y) is looked up directly from the conceptual cooccurrence data Our scoring method is based on a probabilistic model at the conceptual level." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-82", "text": "In a standard model, the logarlthm of the probability of occurrence of a conceptual set {x,, x~ ..... xm} in the context of the conceptual set {y~, y~.....y,} is given by log2 P (xl,x2 ..... x,,lyl,y2 ..... y,) \"~ ~=l ( \"j~.__ll(x,,Yj)" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-83", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-84", "text": "**+L\u00b0G2 P(XI))**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-85", "text": "assuming that each P(x~) is independent of each other given y~, y2...., y, and each P(Y.i) is independent of each other given x~, for all x~.S Our scoring method deviates from the standard model in a number of aspects:" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-86", "text": "1. log 2 P(x~), the term of the occurrence Probability of each of the defining concepts in the sense, is excluded in our scoring method." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-87", "text": "Since the training data is not sense-tagged, the occurrence probability is highly unreliable." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-88", "text": "Moreover, the magnitude of mutual information is decreased due to the noise of the spurious senses while the average magnitude of the occurrence probability is unaffected, e Inclusion of the occurrence probability term will lead to the dominance of this term over the mutual information term, resulting in the system flavouring the sense with the more frequently occurring defining concepts most of the time." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-89", "text": "2." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-90", "text": "The score of a sense with respect to the current context is normalised by subtracting the score of the sense calculated with respect to the GlobalCS (which contains all defining concepts) from it (see formula 5 The occurrence probabilities of some defining concepts will not be independent in some contexts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-91", "text": "However, modelling the dependency between different concepts in different contexts will lead to an explosion of the complexity of the model." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-92", "text": "6 The noise only leads to incorrect distribution of the occurrence probability." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-93", "text": "[1])." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-94", "text": "In effect, we are comparing the score between the sense with the current context and the score between the sense and an artificially constructed \"average\" context." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-95", "text": "This is needed to rectify the bias towards the sense(s) with defining concepts of higher average mutual information (over the set of all defining concepts), 'which is intensified by the ambiguity of the context words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-96", "text": "3. Negative mutual information score is taken to be 0 ([6])." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-97", "text": "Negative mutual information is unreliable due to the smaller number of data points." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-98", "text": "4." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-99", "text": "The evidence (mutual information score) from multiple defining concepts/words is averaged rather than summed ([2], [4] & [5] )." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-100", "text": "This is to compensate for the different lengths of definitions of different senses and different lengths of the context." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-101", "text": "The evidence from a polysemous context word is taken to be the evidence from its sense with the highest mutual information score ([3] )." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-102", "text": "This is due to the fact that only one of the senses is used in the given sentence." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-103", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-104", "text": "**EVALUATION**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-105", "text": "Our system is tested on the twelve words discussed in Yarowsky (1992) and previous publications on sense disambiguation." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-106", "text": "Results are shown in Table 1 ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-107", "text": "Our system achieves an average accuracy of 77% on a mean 3-way sense distinction over the twelve words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-108", "text": "Numerically, the result is not as good as the 92% as reported in Yarowsky (1992) ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-109", "text": "However, direct comparison between the numerical results can be misleading since the experiments are carried out on two very different corpora both in size and genre." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-110", "text": "Firstly, Yarowsky's system is trained with the 10 million word Grolier's Encyclopedia, which is a magnitude larger than the Brown corpus used by our system." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-111", "text": "Secondly, and more importantly, the two corpora, which are also the test corpora, are very different in genre." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-112", "text": "Semantic coherence of text, on which both systems rely, is generally stronger in technical writing than in most other kinds of text." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-113", "text": "Statistical disambiguation systems which rely on semantic coherence will generally perform better on technical writing, which encyclopedia entry can be regarded as one kind of, than on most other kinds of text." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-114", "text": "On the other hand, the Brown corpus is a collection of text with all kinds of genre." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-115", "text": "People make use of syntactic, semantic and pragmatic knowledge in sense disambiguation." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-116", "text": "It is not very realistic to expect any system which only possesses semantic coherence knowledge (including ours as well as Yarowsky's) to achieve a very high level of accuracy for all words in general text." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-117", "text": "To provide a better evaluation of our approach, we have conducted an informal experiment aiming at establishing a more reasonable upper bound of the performance of such systems." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-118", "text": "In the experiment, a human subject is asked to perform the same disambiguation task as our system, given the same contextual information, 7 Since our system only uses semantic coherence information and has no deeper understanding of the meaning of the text, the human subject is asked to disambiguate the target word, given a list of all the content words in the context (sentence) of the target word in random order." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-119", "text": "The words are put in random order because the system does not make use of syntactic information of the sentence either." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-120", "text": "The human subject is also allowed access to a copy of LDOCE which the system also uses." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-121", "text": "The results are listed in Table 1 ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-122", "text": "The actual upper bound of the performance of statistical methods using semantic coherence information only should be slightly better than the performance of human since the human is disadvantaged by a number of factors, including but not limited to: 1. it is unnatural for human to disambiguate in the described manner; 2. the semantic coherence knowledge used by the human is not complete or specific to the current corpusS; 3." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-123", "text": "human error." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-124", "text": "However, the results provide a rough approximation of the upper bound of performance of such systems, The human subject achieves an average accuracy of 71% over the twelve words, which is 6% lower than our system." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-125", "text": "More interestingly, the results of the human subject are found to exhibit a similar pattern to the results of our system -the human subject performs better on words and senses for which our system achieve higher accuracy and less well on words and senses for which our system has a lower accuracy." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-126", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-127", "text": "**THE USE OF SENTENCE AS LOCAL CONTEXT**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-128", "text": "Another significant point our experiments have shown is that the sentence can also provide enough contextual information for semantic coherence based 7 The result is less than conclusive since only one human subject is tested." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-129", "text": "In order to acquire more reliable results, we are currently seeking a few more subjects to repeat the experiment." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-130", "text": "s The subject has not read through the whole corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-131", "text": "approaches in a large proportion of cases." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-132", "text": "9 The average sentence length in the Brown corpus is 19.41\u00b0 words which is 5 times smaller than the 100 word window used in Gale et al. (1992) and Yarowsky (1992) ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-133", "text": "Our approach works well even with a small \"window\" because it is based on the identification of salient concepts rather than salient words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-134", "text": "In salient word based approaches, due to the problem of data sparseness, many less frequently occurring words which are intuitively salient to a particular word sense will not be identified in practice unless an extremely large corpus is used." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-135", "text": "Therefore the sentence usually does not contain enough identified salient words to provide enough contextual information." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-136", "text": "Using conceptual cooccurrence data, contextual information from the salient but less frequently used words in the sentence will also be utilised through the salient concepts in the conceptual expansions of these words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-137", "text": "Obviously, there are still cases where the sentence does not provide enough contextual information even using conceptual co-occurrence data, such as when the sentence is too short, and contextual information from a larger context has to be used." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-138", "text": "However, the ability to make use of information in a smaller context is very important because the smaller context always overrules the larger context if their sense preferences are different." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-139", "text": "For example, in a legal trial context, the correct sense of sentence in the clause she was asked to repeat the last word of her previous sentence will be its word sense rather than its legal sense which would have been selected if a larger context is used instead." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-140", "text": "9 Analysis of the test samples which our system fails to correctly disambiguate also shows that increasing the window size will benefit the disambiguation process only in a very small proportion of these samples." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-141", "text": "The main cause of errors is the polysemous words in dictionary definitions which we will discuss in Section 6." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-142", "text": "1o Based on 1004998 words and 51763 sentences." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-143", "text": "1. N marks the column with the number of tcst samples for each sense." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-144", "text": "DBCC (Defmition-Bascd Conceptual Cooccurrence) and Human mark the columns with the results of our system and the human subject in disambiguating the occurrences of the 12 words in the Brown corpus, respectively." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-145", "text": "Thes." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-146", "text": "(thesaurus) marks the column with the results of Yarowsky (1992) tested on the Grolier's Encyclopedia." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-147", "text": "2. The \"correct\" sense of each test sample is chosen by hand disambiguation carried out by the author using the sentence as the context." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-148", "text": "A small proportion of test samples cannot be disambiguated within the given context and are excluded from the experiment." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-149", "text": "3. The senses marked with * are used in Yarowsky (1992) but no corresponding sense is found in LDOCE." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-150", "text": "4. The sense marked with ** is defined in LDOCE but not used in Yarowsky (1992) ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-151", "text": "6. In our experiment, the words are disambiguated between all the senses listed except the ones marked with 7." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-152", "text": "The rare senses listed in LDOCE are not listed here." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-153", "text": "For some of the words, more than one sense listed in LDOCE corresponds to a sense as used in Yarowsky (1992) ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-154", "text": "In these cases, the senses used by Yarowsky are adopted for easier comparison." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-155", "text": "8. All results are based on 100% recall." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-156", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-157", "text": "**RELATED WORK**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-158", "text": "Previous attempts to tackle the data sparseness problem in general corpus-based work include the class-based approaches and similarity-based approaches." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-159", "text": "In these approaches, relationships between a given pair of words are modelled by analogy with other words that resemble the given pair in some way." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-160", "text": "The class-based approaches (Brown et al., 1992; Resnik, 1992; Pereira et al., 1993) calculate co-occurrence data of words belonging to different classes,~ rather than individual words, to enhance the co-occurrence data collected and to cover words which have low occurrence frequencies." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-161", "text": "Dagan et al. (1993) argue that using a relatively small number of classes to model the similarity between words may lead to substantial loss of information." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-162", "text": "In the similaritybased approaches (Dagan et al., 1993 (Dagan et al., & 1994 Grishman et al., 1993) , rather than a class, each word is modelled by its own set of similar words derived from statistical data collected from corpora." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-163", "text": "However, deriving these sets of similar words requires a substantial amount of statistical data and thus these approaches require relatively large corpora to start with.~ 2 Our definition-based approach to statistical sense disambiguation is similar in spirit to the similaritybased approaches, with respect to the \"specificity\" of modelling individual words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-164", "text": "However, using definitions from existing dictionaries rather than derived sets of similar words allows our method to work on corpora of much smaller sizes." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-165", "text": "In our approach, each word is modelled by its own set of defining concepts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-166", "text": "Although only 1792 defining concepts are used, the set of all possible combinations (a power set of the defining concepts) is so huge that it is very unlikely two word senses will have the same combination of defining concepts unless they are almost identical in meaning." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-188", "text": "such as that described in Yarowsky (1992) are very unlikely to be capable of acquiring this finer knowledge because the problem of data sparseness becomes even more serious with the introduction of syntactic constraints." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-167", "text": "On the other hand, the thesaurus-based method of Yarowsky (1992) may suffer from loss of information (since it is semi-class-based) as well as data sparseness (since H Classes used in Resnik (1992) are based on the WordNet taxonomy while classes of Brown et al. (1992) and Pereira et al. (1993) are derived from statistical data collected from corpora." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-168", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-169", "text": "**~2**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-170", "text": "The corpus used in Dagan et al. (1994) contains 40.5 million words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-171", "text": "it is based on salient words) and may not perform as well on general text as our approach." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-172", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-173", "text": "**LIMITATION AND FURTHER WORK**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-174", "text": "Being a dictionary-based method, the natural limitation of our approach is the dictionary." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-175", "text": "The most serious problem is that many of the words in the controlled vocabulary of LDOCE are polysemous themselves." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-176", "text": "The result is that many of our list of 1792 defining concepts actually stand for a number of distinct concepts." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-177", "text": "For example, the defining concept point is used in its place sense, idea sense and sharp end sense in different definitions." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-178", "text": "This affects the accuracy of disambiguating senses which have definitions containing these polysemous words and is found to be the main cause of errors for most of the senses with below-average results." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-179", "text": "We are currently working on ways to disambiguate the words in the dictionary definitions." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-180", "text": "One possible way is to apply the current method of disambiguation on the defining text of dictionary itself." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-181", "text": "The LDOCE defining text has roughly half a million words in its 41000 entries, which is half the size of the Brown corpus used in the current experiment." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-182", "text": "Although the result on the dictionary cannot be expected to be as good as the result on the Brown corpus due to the smaller size of the dictionary, the reliability of further co-occurrence data collected and, thus, the performance of the disambiguation system can be improved significantly as long as the disambiguation of the dictionary is considerably more accurate than by chance." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-183", "text": "Our success in using definitions of word senses to overcome the data sparseness problem may also lead to further improvement of sense disambiguation technologies." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-184", "text": "In many cases, semantic coherence information is not adequate to select the correct sense, and knowledge about local constraints is needed." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-185", "text": "~3 For disambiguation of polysemous nouns, these constraints include the modifiers of these nouns and the verbs which take these nouns as objects, etc." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-186", "text": "This knowledge has been successfully acquired from corpora in manual or semi-automatic approaches such as that described in Hearst (1991) ." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-187", "text": "However, fully automatic lexically based approaches 3 Hatzivassiloglou (1994) shows that the introduction of linguistic cues improves the performance of a statistical semantic knowledge acquisition system in the context of word grouping." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-189", "text": "Our approach has overcome the data sparseness problem by using the defining concepts of words." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-190", "text": "It is found to be effective in acquiring semantic coherence knowledge from a relatively small corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-191", "text": "It is possible that a similar approach based on dictionary definitions will be successful in acquiring knowledge of local constraints from a reasonably sized corpus." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-192", "text": "----------------------------------" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-193", "text": "**CONCLUSION**" }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-194", "text": "We have shown that using definition-based conceptual co-occurrence data collected from a relatively small corpus, our sense disambiguation system has achieved accuracy comparable to human performance given the same amount of contextual information." }, { "sent_id": "412c2daf6d060f520850d187c6eb36-C001-195", "text": "By overcoming the data sparseness problem, contextual information from a smaller local context becomes sufficient for disambiguation in a large proportion of cases." } ], "y": { "@BACK@": { "gold_contexts": [ [ "412c2daf6d060f520850d187c6eb36-C001-8" ] ], "cite_sentences": [ "412c2daf6d060f520850d187c6eb36-C001-8" ] }, "@SIM@": { "gold_contexts": [ [ "412c2daf6d060f520850d187c6eb36-C001-54" ], [ "412c2daf6d060f520850d187c6eb36-C001-153" ] ], "cite_sentences": [ "412c2daf6d060f520850d187c6eb36-C001-54", "412c2daf6d060f520850d187c6eb36-C001-153" ] }, "@USE@": { "gold_contexts": [ [ "412c2daf6d060f520850d187c6eb36-C001-105" ], [ "412c2daf6d060f520850d187c6eb36-C001-146" ], [ "412c2daf6d060f520850d187c6eb36-C001-149" ] ], "cite_sentences": [ "412c2daf6d060f520850d187c6eb36-C001-105", "412c2daf6d060f520850d187c6eb36-C001-146", "412c2daf6d060f520850d187c6eb36-C001-149" ] }, "@DIF@": { "gold_contexts": [ [ "412c2daf6d060f520850d187c6eb36-C001-108" ], [ "412c2daf6d060f520850d187c6eb36-C001-132" ], [ "412c2daf6d060f520850d187c6eb36-C001-150" ], [ "412c2daf6d060f520850d187c6eb36-C001-167" ], [ "412c2daf6d060f520850d187c6eb36-C001-188" ] ], "cite_sentences": [ "412c2daf6d060f520850d187c6eb36-C001-108", "412c2daf6d060f520850d187c6eb36-C001-132", "412c2daf6d060f520850d187c6eb36-C001-150", "412c2daf6d060f520850d187c6eb36-C001-167", "412c2daf6d060f520850d187c6eb36-C001-188" ] } } }, "ABC_1c89c8f4849d1c8214a3e5f6b9ff1a_14": { "x": [ { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-2", "text": "Text normalization (TN) is an important step in conversational systems." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-3", "text": "It converts written text to its spoken form to facilitate speech recognition, natural language understanding and text-to-speech synthesis." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-4", "text": "Finite state transducers (FSTs) are commonly used to build grammars that handle text normalization (Sproat, 1996; Roark et al., 2012) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-5", "text": "However, translating linguistic knowledge into grammars requires extensive effort." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-6", "text": "In this paper, we frame TN as a machine translation task and tackle it with sequence-to-sequence (seq2seq) models." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-7", "text": "Previous research focuses on normalizing a word (or phrase) with the help of limited word-level context, while our approach directly normalizes full sentences." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-8", "text": "We find subword models with additional linguistic features yield the best performance (with a word error rate of 0.17%)." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-9", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-11", "text": "Non-standard words (NSWs) include expressions such as time or date (e.g., 4:58AM, 08-02, 8/2/2018), abbreviations (e.g., ft.) and letter sequences (e.g., IBM, DL) (Sproat et al., 2001) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-12", "text": "They commonly appear in written texts such as websites, books and movie scripts." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-13", "text": "Written form of non-standard words can be normalized/verbalized to a spoken form, e.g., \"August second\"." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-14", "text": "Although there is no incentive for human users to transcribe NSWs into spoken form, it plays an integral role in spoken dialog systems." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-15", "text": "As shown in Figure 1 , automatic speech recognition (ASR), natural language understanding (NLU) and textto-speech synthesis (TTS) components all involve written-to-spoken form normalization or its inverse process, spoken-to-written text normalization (ITN)." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-16", "text": "ASR normalizes the training corpus before building its language model." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-17", "text": "Among many benefits, such a model can reduce the size of the required vocabulary and address data sparsity issues." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-18", "text": "NLU might adopt ITN to recover the written text from ASR in run-time (e.g., \"five p m \" \u2192 \"5:00PM\")." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-19", "text": "In text-to-speech synthesis, for example, in order to pronounce \"221B Baker St\", TTS needs to first convert the text to \"two twenty one b baker street\" and then generate the audio signal." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-20", "text": "Normalizing the written form text to its spoken form is difficult due to the following bottlenecks:" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-21", "text": "1. Lack of supervision -there is no incentive for people to produce spoken form text." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-22", "text": "Thus, it is hard to obtain a supervised dataset for training machine learning models;" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-23", "text": "2. Ambiguity -for written text, a change in context may require a different normalization." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-24", "text": "For example, \"2/3\" can be verbalized as a date or fraction depending on the meaning of the sentence." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-25", "text": "Traditionally, the task of NSW normalization has been approached by manually authoring grammars in the form of finite-state transducers (Sproat, 1996; Roark et al., 2012) such as integer grammars (e.g., \"26\" \u2192 \"twenty six\") or time grammars (e.g., \"5:26\" \u2192 \"five twenty six\")." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-26", "text": "Constructing such grammars is time consuming and error-prone and requires extensive linguistic knowledge and programming proficiency." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-27", "text": "Recently, with the rise of machine learning and especially deep learning techniques, researchers are starting to bring more data-driven approaches to this field (Sproat and Jaitly, 2016) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-28", "text": "In this paper, we present our approach to nonstandard text normalization via machine translation techniques, where the source and target are written and spoken form text, respectively." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-29", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-30", "text": "**RELATED WORK 2.1 FINITE STATE TRANSDUCER**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-31", "text": "Normalizing written-form text to its spoken form has been approached by authoring weighted finite state transducer (WFST) grammars to handle individual categories of NSW (e.g., time, date) and subsequently join them together (Sproat, 1996; Roark et al., 2012) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-32", "text": "One bottleneck to this approach is the heavy demand of translating linguistic knowledge into WFSTs." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-33", "text": "A second problem is a lack of context awareness." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-34", "text": "For example, \"dr.\" may refer to \"doctor\" or \"drive\" in different contexts." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-35", "text": "We have observed accuracy improvements by using an n-gram LM to re-rank hypotheses generated by WFSTs." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-36", "text": "However, an n-gram LM's context awareness is limited." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-37", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-38", "text": "**DATA-DRIVEN APPROACHES**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-39", "text": "Recently, methods based on neural networks have been applied to TN and ITN (Sproat and Jaitly, 2016; Pusateri et al., 2017; Yolchuyeva et al., 2018) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-40", "text": "To overcome one of the biggest problems -a lack of supervision, WFSTs have been used to transform large amounts of written-form text to its spoken form." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-41", "text": "Researchers hope a vast amount of such data can counteract the errors inherited in WFST-based models." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-42", "text": "Recent data-driven approaches examine window-based sequence-to-sequence (seq2seq) models and convolutional neural networks (CNN) to normalize a central piece of text with the help of context (Sproat and Jaitly, 2016; Yolchuyeva et al., 2018) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-43", "text": "Window-based methods have the advantage of limiting the output vocabulary size, as most tokens that do not need to be transformed are labeled with a special token." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-44", "text": "Hybrid neural/WFST models have also been proposed and applied to the text normalization problem (Pusateri et al., 2017; Yolchuyeva et al., 2018) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-45", "text": "Tokens in the input are first tagged with labels using machine learned models whereupon a handcrafted grammar corresponding to each label conducts conversion." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-46", "text": "In both methods, a tagger is needed to first segment/label the input tokens and conversion must be applied to each segment to normalize a full sentence." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-47", "text": "Our seq2seq model does not require the aforementioned tagger (although could benefit from the tagger as we will show later) and directly translates a written-form sentence to its spoken form without grammars." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-48", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-49", "text": "**MODEL**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-50", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-51", "text": "**BASELINE MODELS**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-52", "text": "Following Sproat and Jaitly (2016), we implement a seq2seq model trained on window-based data." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-53", "text": "Table 1 illustrates the window-based model's training examples corresponding to one sentence \"wake me up at 8 AM .\" which is broken down into 6 pairs." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-54", "text": " and indicate the center of the window." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-55", "text": "A window center might contain 1 or more words (e.g., \"8 AM\") and the grouping is provided by the dataset where each input sentence is segmented into chunks corresponding to labels such as TIME, DATE, ORDINAL (Sproat and Jaitly, 2016) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-56", "text": "The model outputs tokens which correspond to the center of the window." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-57", "text": "Input Output wake me wake me up me up at " }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-58", "text": "The model architecture is similar to Chan et al. (2016) and uses attention to align the output tokens with input characters as in Bahdanau et al. (2014) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-59", "text": "The encoder takes character sequences as input." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-60", "text": "Otherwise, sequences of numbers or dates (e.g., 2018-08-04) are hard to interpret." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-61", "text": "On the output side, we believe various granularities such as character, word or word fragments can be suitable." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-62", "text": "Following the literature, we used a word level decoder." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-63", "text": "A window-based seq2seq model, although able to attend well to a central piece of text, is not practical for applying over a whole sentence." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-64", "text": "To extend the model to full sentences, we break source sentences into segments." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-65", "text": "We then apply the model to one segment after another and concatenate their output tokens to produce full sentences." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-66", "text": "As our second baseline, a seq2seq model is trained with full sentence data." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-67", "text": "As a result, it does not require any pre-processing step to generate windows of text." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-68", "text": "It directly translates a sentence to its spoken form." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-69", "text": "Again, the encoder works at the character level while the decoder output sequences of words while attention is used to align the input and output sequences." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-70", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-71", "text": "**PROPOSED MODEL**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-72", "text": "There are several issues with the baseline seq2seq models." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-73", "text": "First of all, although there is no out-ofvocabulary (OOV) problem on the input side since it is modeled as a sequence of characters, the decoder has an OOV issue-we cannot model every possible token." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-74", "text": "The window-based seq2seq adopts a special output token that significantly reduces the output vocabulary size." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-75", "text": "This is not practical in the full sentence baseline as it requires the additional step of mapping each in the output to a word in the input." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-76", "text": "Subwords have been shown to work well in open-vocabulary speech recognition and machine translation tasks (Sennrich et al., 2015; Qin et al., 2011) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-77", "text": "Subwords (i.e., a grouping of one or more characters) capture frequently co-occurring character combinations." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-78", "text": "For example, the word \"subword\" might be decomposed into two parts: \" sub\" and \"word\", where \" \" indicates the start of a word." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-79", "text": "An extreme case of the subword model is a character model." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-80", "text": "Compared with only characters, we believe segmenting input/output into subwords eases a seq2seq model's burden of modeling longdistance dependencies." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-81", "text": "Sennrich and Haddow (2016) have shown that the addition of linguistic features can improve the quality of neural machine translation models." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-82", "text": "We observe that features such as casing and part-ofspeech tags can also provide helpful insights into how a NSW should be normalized." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-83", "text": "For example, \"US\" should be normalized to \"u s\" instead of \"us\"." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-84", "text": "Similarly, part-of-speech tags can help the model decide how to verbalize ambiguous forms such as \"resume\", which is kept as-is as a verb or read out as \"r\u00e9sum\u00e9\" as a noun." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-85", "text": "In regards to subwords, it is important to know where the fragment comes from -beginning, middle, end of a word or the full word." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-86", "text": "For example, \"id\" should be normalized as \"id\" if it comes from the beginning of a word like \"idea\"." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-87", "text": "However, it could also be verbalized as \"i d\" when taken as a standalone word." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-88", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-89", "text": "**LINGUISTIC FEATURES**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-90", "text": "In this paper, we explore linguistic features that are inexpensive to compute such as casing, POS, and positional features." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-91", "text": "We also use the edit labels from Google's dataset (e.g., TIME, DATE) although we acknowledge these labels are expensive and often times not accessible." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-92", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-93", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-94", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-95", "text": "**DATASET**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-96", "text": "The data for the window-based seq2seq model and full sentence seq2seq were generated from the publicly available release of parallel written/speech formatted text from Sproat and Jaitly (2016) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-97", "text": "The set consists of Wikipedia text which was processed through Google TTS's Kestrel text normalization system relying primarily on handcrafted rules to produce speech-formatted text." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-98", "text": "Although a large parallel dataset is available for English, we consider the feasibility of developing neural models for other languages which may not have text normalization systems in place." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-99", "text": "Therefore, we choose to scale the training data size to a limited set of text which could be generated by annotators in a reasonable time frame." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-100", "text": "As summarized in Table 2 , both window-based and sentencebased models are trained with 500K training instances." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-101", "text": "Our datasets were randomly sampled from a set of 4.9M sentences in the training data portion of the Sproat and Jaitly (2016) data release and split into training, validation, and test data." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-102", "text": "However, the training data for window-based and sentencebased models are not identical due to differences in input configurations." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-103", "text": "While the window-based model uses 500K randomly sampled windows, the sentence-based models use 500K sentences." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-104", "text": "For testing, 62.5K identical test sentences are used across all models." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-105", "text": "In order to decode sentences with the window-based model, sentences are first segmented into windows before inference." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-106", "text": "Among 16 edit labels available in the dataset release, we found the normalization target for Table 2 : Size of training, validation, and test datasets." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-107", "text": "For the window-baseline, the data are pairs of windows and the normalization of the central piece of the window." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-136", "text": "Learning rate decayed at a rate of 0.5 if perplexity on the validation set did not decrease or after 50K steps." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-108", "text": "For the sent-baseline and subword models, the data are pairs of sentences but in different formats -sent-baseline: (character sequence, word sequence); subword: (subword sequence, subword sequence)." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-109", "text": "All models are evaluated on the same set of 62.5K sentences." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-110", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-111", "text": "**MODEL**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-112", "text": "Train ELECTRONIC text is not suitable for our system as it primarily reads out URLs letter by letter, e.g., \"Forbes.com\" \u2192 \"f o r b e s dot c o m\" (as opposed to \"forbes dot com\")." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-113", "text": "Therefore, we exclude ELECTRONIC data in our experiments." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-114", "text": "There are large numbers of tokens present in the dataset." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-115", "text": "We follow Sproat and Jaitly (2016) in down-sampling window-based training data to constrain the proportion of \"\" tokens to 10% of the data." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-116", "text": "For training sentence-based models, the source sentence is segmented into characters while the target sentence is broken into tokens." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-117", "text": "For the subword model, both the source and target sentences are segmented into subword sequences." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-118", "text": "Subword units are concatenated to words for evaluation." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-119", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-120", "text": "**BASELINE MODEL SETUP**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-121", "text": "Our first approach replicates the window-based seq2seq model of Sproat and Jaitly (2016) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-122", "text": "The model encodes the central piece of text (1 or more tokens) including its context of N previous and following tokens at the character level." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-123", "text": "The output is a target token or a sequence of tokens." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-124", "text": "The input vocabulary consists of 250 common characters including letters, digits and symbols (e.g., $)." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-125", "text": "The decoder vocabulary consists of 1K tokens including and , the latter of which is used to normalize punctuation." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-126", "text": "Following Chan et al. (2016) , we use a stacked (2-layer) bi-directional long short term memory network (bi-LSTM) as encoder and a stacked (2-layer) LSTM as decoder." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-127", "text": "We use 512 hidden states for the (bi-)LSTM." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-128", "text": "A softmax output distribution is computed over output vocabulary at each decoding step." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-129", "text": "Decoding uses the attention mechanism from Bahdanau et al. (2014) and a beam size of 5." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-130", "text": "Word and character embeddings are trained from scratch." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-131", "text": "We use the OpenNMT toolkit (Klein et al., 2017) to train our models on a single P2.8xlarge Amazon EC2 instance." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-132", "text": "Models were trained with Stochastic Gradient Descent (SGD) on 200K timesteps (approximately 13 epochs)." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-133", "text": "Approaching 200K timesteps, a significant decay in accuracy and plateau in perplexity of the validation set occurred for all models." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-134", "text": "Validation occurred every 10K timesteps and the number of timesteps was chosen based on maximum accuracy on the validation data." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-135", "text": "The learning rate was tuned to 1.0 for the window-based model and 0.5 for sentence-based models to achieve optimal performance." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-137", "text": "A dropout of 0.3 was used across all models." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-138", "text": "Figure 2: Evaluation of the window-based model." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-139", "text": "Categories are sorted by frequency." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-140", "text": "* TELEPHONE is not reported in Sproat and Jaitly (2016) but included in the dataset; ** we removed ELECTRONIC category." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-141", "text": "As shown in Figure 2 , our replicated windowbased model achieves reasonable performance compared with Sproat and Jaitly (2016) , considering our training set is much smaller." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-142", "text": "There are 16 different edit labels shown." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-143", "text": "Data with TELEPHONE labels were not included in the initial analysis of Sproat and Jaitly (2016) , but were made available in the dataset release." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-144", "text": "For our second baseline model which operates on whole sentences, on the input side, we still use 250 common characters." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-145", "text": "However, due to the removal of the token, the output space is drastically extended from 1K tokens to 45K tokens." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-146", "text": "Thus, it becomes increasingly difficult for the model to learn and predict." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-147", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-148", "text": "**SUBWORD INVENTORY**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-149", "text": "A subword inventory can be populated by datadriven approaches such as Byte Pair Encoding (BPE) (Sennrich et al., 2015) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-150", "text": "Text is first split into character sequences and the most frequently co-occurring units are greedily merged into one subword unit." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-151", "text": "This procedure continues until the desired subword inventory size is reached." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-152", "text": "Here, we enforce that two units cannot be merged if they cross a word boundary." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-153", "text": "To avoid OOV words, we also populate the subword inventory with the 250 most common characters used in the baseline model and digits 0-9." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-154", "text": "In data preparation, we force the subword model to split digits into a single subword piece (e.g., \"1234\" \u2192 \" 1 2 3 4\"), regardless of whether a certain combination of numbers co-occur frequently (e.g., \"19\")." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-155", "text": "Tokenizing digits is beneficial when interpreting large sequences of numbers where every digit must be read out (e.g., 1,342 \u2192 \"one thousand three hundred and forty two\")." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-156", "text": "In this work, we use the SentencePiece toolkit 1 and vary the inventory size." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-157", "text": "One can imagine that a larger subword inventory may contain longer subword entries." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-158", "text": "For example, the word \"anthology\" is split into \" an th ology\" by a subword model of 2K size and \" anth ology\" by a model of 8K size." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-159", "text": "Our experiments find that an inventory size between 1K and 2K yields the best WER and SER (see Table 3 )." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-160", "text": "For the rest of the paper, we use 2K." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-161", "text": "Table 4 summarizes the performance of each model." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-162", "text": "We report sentence-error-rate (SER), word-error-rate (WER), BLEU score (Papineni et al., 2002) and latency (millisecond per input sentence), measured on the test set." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-163", "text": "We also report number of parameters and training time." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-164", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-165", "text": "**OVERALL PERFORMANCE**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-166", "text": "For the identity model, we replaced all nonalphanumerical characters in the source data with \"\", except for spaces." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-167", "text": "As expected, this model generates a large number of errors." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-168", "text": "When evaluated on full sentences, the window-based model yields a reasonable accuracy, although it leverages a limited context." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-169", "text": "On the other hand, although the sentence baseline is directly trained on full sentences, its WER and SER are both worse than the window-based approach." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-170", "text": "The expansion of the output space significantly increases the trainable parameters from 10M to 55M, leading to more difficulties in training and inference." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-171", "text": "As shown in Table 4 , the subword model significantly outperforms baseline models in both accuracy and inference speed." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-172", "text": "Due to the source of the dataset (i.e., Wikipedia), test set and training set have an overlap of about 27%." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-173", "text": "For instance, several source citations were commonly found in Wikipedia articles and appeared in training and test (e.g., \"IUCN Red List of Threatened Species.\")." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-174", "text": "We found that, for sentences that were not seen by the subword model in training, our model still produces reliable outputs with a SER of 4.59% and WER of 1.09%." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-175", "text": "Figure 3 demonstrates that the attention mechanism can effectively learn the non-monotonous nature of the text normalization problem as \"eleventh\", \"November\" and \"eleven\" correspond to the third, second and first \"11\" in the input." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-176", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-177", "text": "**LINGUISTIC FEATURES**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-178", "text": "We use the following linguistic features: 1) capitalization:" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-179", "text": "upper, lower, mixed, nonalphanumerical, foreign characters; 2) position: (Bird et al., 2009) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-180", "text": "Edit labels are the most expensive to obtain in real life." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-181", "text": "Our labels are generated directly from the Google FST (Sproat and Jaitly, 2016) ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-182", "text": "Each type of feature is represented by a one-hot encoding." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-183", "text": "To combine linguistic features with subword units, one can add or concatenate each subword's embedding with its corresponding linguistic feature embedding and feed a combined embedding to the bi-LSTM encoder." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-184", "text": "Or, a multi-layer perceptron (MLP) can be applied to combine information in a non-linear way." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-185", "text": "Our experiments find that concatenation outperforms the other two methods." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-186", "text": "In Table 4 we can see that the subword model with linguistic features produces the lowest SER (0.78%) and WER (0.17%)." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-187", "text": "In addition, results from the ablation study show that each feature makes a positive contribution to the model." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-188", "text": "However, edit labels seem to make the strongest contribution." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-189", "text": "We acknowledge that edit labels may not always be readily available." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-190", "text": "The model which utilizes all linguistic features except for edit labels still shows a 16% relative SER reduction and 14% WER reduction over the subword model without linguistic features." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-191", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-192", "text": "**DISCUSSION**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-193", "text": "Errors from the subword model are presented in Table 5 ." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-194", "text": "Severe errors are shown in the first two rows." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-195", "text": "While these types of errors are infrequent, they change or obscure the meaning of the utterance for a user." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-196", "text": "For example, the currency \"nok\" (e.g., \"norwegian kroner\") was verbalized as \"euros\", reflecting a bias in the training data." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-197", "text": "While \"euros\" appeared 88 times, \"norwegian kroner\" appeared just 10 times." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-198", "text": "Another type of error does not change the sentence meaning but can be unnatural." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-199", "text": "For example, \"alexander iii\" was predicted as \"alexander three\" rather than \"alexander the third\"." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-200", "text": "In this case, the referent of the sentence would likely be understandable given context." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-201", "text": "Examples such as \"5' 11\"\" reflect the variety of natural readings which a human might produce. \"Five foot eleven inches\", \"five foot eleven\", and \"five eleven\" may all refer to a person's height." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-202", "text": "Here the reference and model have produced different but acceptable variations." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-203", "text": "A fundamental problem is the lack of supervised data for training and evaluation, particularly data which reflects the variety of acceptable readings of non-standard text." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-204", "text": "The pairs in this study (and in other text normalization research) are generated by a system which does not have the full capability to verbalize sentences in different but natural ways." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-205", "text": "Our system's normalization WER and SER may not translate proportionally to ASR's WER and SER, simply because real users will read non-standard text in a variety of ways." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-206", "text": "It remains a challenge for the academic community to come up with better data solutions." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-207", "text": "----------------------------------" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-208", "text": "**CONCLUSION**" }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-209", "text": "In this paper, we investigate neural approaches to text normalization which directly translate a written-form sentence to its spoken counterpart without the need of a tagger or grammar." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-210", "text": "We show that the use of subwords can effectively reduce the OOV problem of a baseline seq2seq model with character inputs and token outputs." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-211", "text": "The addition of linguistic features including casing, word position, POS tags, and edit labels leads to further gains." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-212", "text": "We empirically test the addition of each linguistic feature revealing that all features make a contribution to the model, and combining features results in the best performance." }, { "sent_id": "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-213", "text": "Our model is an improvement over both window-based and sentence-based seq2seq baselines, yielding a WER of 0.17%." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-27" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-39" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-42" ] ], "cite_sentences": [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-27", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-39", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-42" ] }, "@MOT@": { "gold_contexts": [ [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-26", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-27" ] ], "cite_sentences": [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-52" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-53", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-54", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-55" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-96" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-101" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-115" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-121" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-181" ] ], "cite_sentences": [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-52", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-55", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-96", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-101", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-115", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-121", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-181" ] }, "@DIF@": { "gold_contexts": [ [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-140" ], [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-143" ] ], "cite_sentences": [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-140", "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-143" ] }, "@SIM@": { "gold_contexts": [ [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-141" ] ], "cite_sentences": [ "1c89c8f4849d1c8214a3e5f6b9ff1a-C001-141" ] } } }, "ABC_73d7831596bfe6d6861f360042048f_14": { "x": [ { "sent_id": "73d7831596bfe6d6861f360042048f-C001-111", "text": "**MODEL**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-139", "text": "Categories are sorted by frequency." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-167", "text": "As expected, this model generates a large number of errors." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-194", "text": "Severe errors are shown in the first two rows." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-56", "text": "The model outputs tokens which correspond to the center of the window." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-57", "text": "Input Output wake me wake me up me up at " }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-5", "text": "However, translating linguistic knowledge into grammars requires extensive effort." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-2", "text": "Text normalization (TN) is an important step in conversational systems." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-3", "text": "It converts written text to its spoken form to facilitate speech recognition, natural language understanding and text-to-speech synthesis." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-4", "text": "Finite state transducers (FSTs) are commonly used to build grammars that handle text normalization (Sproat, 1996; Roark et al., 2012) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-109", "text": "All models are evaluated on the same set of 62.5K sentences." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-6", "text": "In this paper, we frame TN as a machine translation task and tackle it with sequence-to-sequence (seq2seq) models." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-7", "text": "Previous research focuses on normalizing a word (or phrase) with the help of limited word-level context, while our approach directly normalizes full sentences." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-8", "text": "We find subword models with additional linguistic features yield the best performance (with a word error rate of 0.17%)." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-9", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-11", "text": "Non-standard words (NSWs) include expressions such as time or date (e.g., 4:58AM, 08-02, 8/2/2018), abbreviations (e.g., ft.) and letter sequences (e.g., IBM, DL) (Sproat et al., 2001) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-12", "text": "They commonly appear in written texts such as websites, books and movie scripts." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-13", "text": "Written form of non-standard words can be normalized/verbalized to a spoken form, e.g., \"August second\"." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-14", "text": "Although there is no incentive for human users to transcribe NSWs into spoken form, it plays an integral role in spoken dialog systems." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-15", "text": "As shown in Figure 1 , automatic speech recognition (ASR), natural language understanding (NLU) and textto-speech synthesis (TTS) components all involve written-to-spoken form normalization or its inverse process, spoken-to-written text normalization (ITN)." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-16", "text": "ASR normalizes the training corpus before building its language model." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-17", "text": "Among many benefits, such a model can reduce the size of the required vocabulary and address data sparsity issues." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-18", "text": "NLU might adopt ITN to recover the written text from ASR in run-time (e.g., \"five p m \" \u2192 \"5:00PM\")." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-19", "text": "In text-to-speech synthesis, for example, in order to pronounce \"221B Baker St\", TTS needs to first convert the text to \"two twenty one b baker street\" and then generate the audio signal." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-20", "text": "Normalizing the written form text to its spoken form is difficult due to the following bottlenecks:" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-21", "text": "1. Lack of supervision -there is no incentive for people to produce spoken form text." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-22", "text": "Thus, it is hard to obtain a supervised dataset for training machine learning models;" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-23", "text": "2. Ambiguity -for written text, a change in context may require a different normalization." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-24", "text": "For example, \"2/3\" can be verbalized as a date or fraction depending on the meaning of the sentence." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-25", "text": "Traditionally, the task of NSW normalization has been approached by manually authoring grammars in the form of finite-state transducers (Sproat, 1996; Roark et al., 2012) such as integer grammars (e.g., \"26\" \u2192 \"twenty six\") or time grammars (e.g., \"5:26\" \u2192 \"five twenty six\")." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-26", "text": "Constructing such grammars is time consuming and error-prone and requires extensive linguistic knowledge and programming proficiency." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-27", "text": "Recently, with the rise of machine learning and especially deep learning techniques, researchers are starting to bring more data-driven approaches to this field (Sproat and Jaitly, 2016) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-28", "text": "In this paper, we present our approach to nonstandard text normalization via machine translation techniques, where the source and target are written and spoken form text, respectively." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-29", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-30", "text": "**RELATED WORK 2.1 FINITE STATE TRANSDUCER**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-31", "text": "Normalizing written-form text to its spoken form has been approached by authoring weighted finite state transducer (WFST) grammars to handle individual categories of NSW (e.g., time, date) and subsequently join them together (Sproat, 1996; Roark et al., 2012) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-32", "text": "One bottleneck to this approach is the heavy demand of translating linguistic knowledge into WFSTs." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-33", "text": "A second problem is a lack of context awareness." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-34", "text": "For example, \"dr.\" may refer to \"doctor\" or \"drive\" in different contexts." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-35", "text": "We have observed accuracy improvements by using an n-gram LM to re-rank hypotheses generated by WFSTs." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-36", "text": "However, an n-gram LM's context awareness is limited." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-37", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-38", "text": "**DATA-DRIVEN APPROACHES**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-39", "text": "Recently, methods based on neural networks have been applied to TN and ITN (Sproat and Jaitly, 2016; Pusateri et al., 2017; Yolchuyeva et al., 2018) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-40", "text": "To overcome one of the biggest problems -a lack of supervision, WFSTs have been used to transform large amounts of written-form text to its spoken form." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-41", "text": "Researchers hope a vast amount of such data can counteract the errors inherited in WFST-based models." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-42", "text": "Recent data-driven approaches examine window-based sequence-to-sequence (seq2seq) models and convolutional neural networks (CNN) to normalize a central piece of text with the help of context (Sproat and Jaitly, 2016; Yolchuyeva et al., 2018) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-43", "text": "Window-based methods have the advantage of limiting the output vocabulary size, as most tokens that do not need to be transformed are labeled with a special token." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-44", "text": "Hybrid neural/WFST models have also been proposed and applied to the text normalization problem (Pusateri et al., 2017; Yolchuyeva et al., 2018) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-45", "text": "Tokens in the input are first tagged with labels using machine learned models whereupon a handcrafted grammar corresponding to each label conducts conversion." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-46", "text": "In both methods, a tagger is needed to first segment/label the input tokens and conversion must be applied to each segment to normalize a full sentence." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-47", "text": "Our seq2seq model does not require the aforementioned tagger (although could benefit from the tagger as we will show later) and directly translates a written-form sentence to its spoken form without grammars." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-48", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-49", "text": "**MODEL**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-50", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-51", "text": "**BASELINE MODELS**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-52", "text": "Following Sproat and Jaitly (2016), we implement a seq2seq model trained on window-based data." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-53", "text": "Table 1 illustrates the window-based model's training examples corresponding to one sentence \"wake me up at 8 AM .\" which is broken down into 6 pairs." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-54", "text": " and indicate the center of the window." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-55", "text": "A window center might contain 1 or more words (e.g., \"8 AM\") and the grouping is provided by the dataset where each input sentence is segmented into chunks corresponding to labels such as TIME, DATE, ORDINAL (Sproat and Jaitly, 2016) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-58", "text": "The model architecture is similar to Chan et al. (2016) and uses attention to align the output tokens with input characters as in Bahdanau et al. (2014) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-59", "text": "The encoder takes character sequences as input." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-60", "text": "Otherwise, sequences of numbers or dates (e.g., 2018-08-04) are hard to interpret." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-61", "text": "On the output side, we believe various granularities such as character, word or word fragments can be suitable." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-62", "text": "Following the literature, we used a word level decoder." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-63", "text": "A window-based seq2seq model, although able to attend well to a central piece of text, is not practical for applying over a whole sentence." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-64", "text": "To extend the model to full sentences, we break source sentences into segments." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-65", "text": "We then apply the model to one segment after another and concatenate their output tokens to produce full sentences." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-66", "text": "As our second baseline, a seq2seq model is trained with full sentence data." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-67", "text": "As a result, it does not require any pre-processing step to generate windows of text." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-68", "text": "It directly translates a sentence to its spoken form." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-69", "text": "Again, the encoder works at the character level while the decoder output sequences of words while attention is used to align the input and output sequences." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-70", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-71", "text": "**PROPOSED MODEL**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-72", "text": "There are several issues with the baseline seq2seq models." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-73", "text": "First of all, although there is no out-ofvocabulary (OOV) problem on the input side since it is modeled as a sequence of characters, the decoder has an OOV issue-we cannot model every possible token." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-74", "text": "The window-based seq2seq adopts a special output token that significantly reduces the output vocabulary size." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-75", "text": "This is not practical in the full sentence baseline as it requires the additional step of mapping each in the output to a word in the input." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-76", "text": "Subwords have been shown to work well in open-vocabulary speech recognition and machine translation tasks (Sennrich et al., 2015; Qin et al., 2011) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-77", "text": "Subwords (i.e., a grouping of one or more characters) capture frequently co-occurring character combinations." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-78", "text": "For example, the word \"subword\" might be decomposed into two parts: \" sub\" and \"word\", where \" \" indicates the start of a word." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-79", "text": "An extreme case of the subword model is a character model." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-80", "text": "Compared with only characters, we believe segmenting input/output into subwords eases a seq2seq model's burden of modeling longdistance dependencies." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-81", "text": "Sennrich and Haddow (2016) have shown that the addition of linguistic features can improve the quality of neural machine translation models." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-82", "text": "We observe that features such as casing and part-ofspeech tags can also provide helpful insights into how a NSW should be normalized." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-83", "text": "For example, \"US\" should be normalized to \"u s\" instead of \"us\"." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-110", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-84", "text": "Similarly, part-of-speech tags can help the model decide how to verbalize ambiguous forms such as \"resume\", which is kept as-is as a verb or read out as \"r\u00e9sum\u00e9\" as a noun." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-85", "text": "In regards to subwords, it is important to know where the fragment comes from -beginning, middle, end of a word or the full word." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-86", "text": "For example, \"id\" should be normalized as \"id\" if it comes from the beginning of a word like \"idea\"." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-87", "text": "However, it could also be verbalized as \"i d\" when taken as a standalone word." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-88", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-89", "text": "**LINGUISTIC FEATURES**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-90", "text": "In this paper, we explore linguistic features that are inexpensive to compute such as casing, POS, and positional features." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-91", "text": "We also use the edit labels from Google's dataset (e.g., TIME, DATE) although we acknowledge these labels are expensive and often times not accessible." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-92", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-93", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-94", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-95", "text": "**DATASET**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-96", "text": "The data for the window-based seq2seq model and full sentence seq2seq were generated from the publicly available release of parallel written/speech formatted text from Sproat and Jaitly (2016) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-97", "text": "The set consists of Wikipedia text which was processed through Google TTS's Kestrel text normalization system relying primarily on handcrafted rules to produce speech-formatted text." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-98", "text": "Although a large parallel dataset is available for English, we consider the feasibility of developing neural models for other languages which may not have text normalization systems in place." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-99", "text": "Therefore, we choose to scale the training data size to a limited set of text which could be generated by annotators in a reasonable time frame." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-100", "text": "As summarized in Table 2 , both window-based and sentencebased models are trained with 500K training instances." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-101", "text": "Our datasets were randomly sampled from a set of 4.9M sentences in the training data portion of the Sproat and Jaitly (2016) data release and split into training, validation, and test data." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-102", "text": "However, the training data for window-based and sentencebased models are not identical due to differences in input configurations." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-103", "text": "While the window-based model uses 500K randomly sampled windows, the sentence-based models use 500K sentences." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-104", "text": "For testing, 62.5K identical test sentences are used across all models." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-105", "text": "In order to decode sentences with the window-based model, sentences are first segmented into windows before inference." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-106", "text": "Among 16 edit labels available in the dataset release, we found the normalization target for Table 2 : Size of training, validation, and test datasets." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-107", "text": "For the window-baseline, the data are pairs of windows and the normalization of the central piece of the window." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-108", "text": "For the sent-baseline and subword models, the data are pairs of sentences but in different formats -sent-baseline: (character sequence, word sequence); subword: (subword sequence, subword sequence)." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-166", "text": "For the identity model, we replaced all nonalphanumerical characters in the source data with \"\", except for spaces." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-112", "text": "Train ELECTRONIC text is not suitable for our system as it primarily reads out URLs letter by letter, e.g., \"Forbes.com\" \u2192 \"f o r b e s dot c o m\" (as opposed to \"forbes dot com\")." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-113", "text": "Therefore, we exclude ELECTRONIC data in our experiments." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-114", "text": "There are large numbers of tokens present in the dataset." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-115", "text": "We follow Sproat and Jaitly (2016) in down-sampling window-based training data to constrain the proportion of \"\" tokens to 10% of the data." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-116", "text": "For training sentence-based models, the source sentence is segmented into characters while the target sentence is broken into tokens." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-117", "text": "For the subword model, both the source and target sentences are segmented into subword sequences." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-118", "text": "Subword units are concatenated to words for evaluation." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-119", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-120", "text": "**BASELINE MODEL SETUP**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-121", "text": "Our first approach replicates the window-based seq2seq model of Sproat and Jaitly (2016) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-122", "text": "The model encodes the central piece of text (1 or more tokens) including its context of N previous and following tokens at the character level." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-123", "text": "The output is a target token or a sequence of tokens." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-124", "text": "The input vocabulary consists of 250 common characters including letters, digits and symbols (e.g., $)." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-125", "text": "The decoder vocabulary consists of 1K tokens including and , the latter of which is used to normalize punctuation." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-126", "text": "Following Chan et al. (2016) , we use a stacked (2-layer) bi-directional long short term memory network (bi-LSTM) as encoder and a stacked (2-layer) LSTM as decoder." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-127", "text": "We use 512 hidden states for the (bi-)LSTM." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-128", "text": "A softmax output distribution is computed over output vocabulary at each decoding step." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-129", "text": "Decoding uses the attention mechanism from Bahdanau et al. (2014) and a beam size of 5." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-130", "text": "Word and character embeddings are trained from scratch." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-131", "text": "We use the OpenNMT toolkit (Klein et al., 2017) to train our models on a single P2.8xlarge Amazon EC2 instance." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-132", "text": "Models were trained with Stochastic Gradient Descent (SGD) on 200K timesteps (approximately 13 epochs)." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-133", "text": "Approaching 200K timesteps, a significant decay in accuracy and plateau in perplexity of the validation set occurred for all models." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-134", "text": "Validation occurred every 10K timesteps and the number of timesteps was chosen based on maximum accuracy on the validation data." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-135", "text": "The learning rate was tuned to 1.0 for the window-based model and 0.5 for sentence-based models to achieve optimal performance." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-136", "text": "Learning rate decayed at a rate of 0.5 if perplexity on the validation set did not decrease or after 50K steps." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-137", "text": "A dropout of 0.3 was used across all models." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-138", "text": "Figure 2: Evaluation of the window-based model." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-140", "text": "* TELEPHONE is not reported in Sproat and Jaitly (2016) but included in the dataset; ** we removed ELECTRONIC category." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-141", "text": "As shown in Figure 2 , our replicated windowbased model achieves reasonable performance compared with Sproat and Jaitly (2016) , considering our training set is much smaller." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-142", "text": "There are 16 different edit labels shown." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-143", "text": "Data with TELEPHONE labels were not included in the initial analysis of Sproat and Jaitly (2016) , but were made available in the dataset release." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-144", "text": "For our second baseline model which operates on whole sentences, on the input side, we still use 250 common characters." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-145", "text": "However, due to the removal of the token, the output space is drastically extended from 1K tokens to 45K tokens." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-146", "text": "Thus, it becomes increasingly difficult for the model to learn and predict." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-147", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-148", "text": "**SUBWORD INVENTORY**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-149", "text": "A subword inventory can be populated by datadriven approaches such as Byte Pair Encoding (BPE) (Sennrich et al., 2015) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-150", "text": "Text is first split into character sequences and the most frequently co-occurring units are greedily merged into one subword unit." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-151", "text": "This procedure continues until the desired subword inventory size is reached." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-152", "text": "Here, we enforce that two units cannot be merged if they cross a word boundary." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-153", "text": "To avoid OOV words, we also populate the subword inventory with the 250 most common characters used in the baseline model and digits 0-9." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-154", "text": "In data preparation, we force the subword model to split digits into a single subword piece (e.g., \"1234\" \u2192 \" 1 2 3 4\"), regardless of whether a certain combination of numbers co-occur frequently (e.g., \"19\")." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-155", "text": "Tokenizing digits is beneficial when interpreting large sequences of numbers where every digit must be read out (e.g., 1,342 \u2192 \"one thousand three hundred and forty two\")." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-156", "text": "In this work, we use the SentencePiece toolkit 1 and vary the inventory size." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-157", "text": "One can imagine that a larger subword inventory may contain longer subword entries." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-158", "text": "For example, the word \"anthology\" is split into \" an th ology\" by a subword model of 2K size and \" anth ology\" by a model of 8K size." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-159", "text": "Our experiments find that an inventory size between 1K and 2K yields the best WER and SER (see Table 3 )." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-160", "text": "For the rest of the paper, we use 2K." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-161", "text": "Table 4 summarizes the performance of each model." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-162", "text": "We report sentence-error-rate (SER), word-error-rate (WER), BLEU score (Papineni et al., 2002) and latency (millisecond per input sentence), measured on the test set." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-163", "text": "We also report number of parameters and training time." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-164", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-165", "text": "**OVERALL PERFORMANCE**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-168", "text": "When evaluated on full sentences, the window-based model yields a reasonable accuracy, although it leverages a limited context." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-169", "text": "On the other hand, although the sentence baseline is directly trained on full sentences, its WER and SER are both worse than the window-based approach." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-170", "text": "The expansion of the output space significantly increases the trainable parameters from 10M to 55M, leading to more difficulties in training and inference." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-171", "text": "As shown in Table 4 , the subword model significantly outperforms baseline models in both accuracy and inference speed." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-172", "text": "Due to the source of the dataset (i.e., Wikipedia), test set and training set have an overlap of about 27%." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-173", "text": "For instance, several source citations were commonly found in Wikipedia articles and appeared in training and test (e.g., \"IUCN Red List of Threatened Species.\")." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-174", "text": "We found that, for sentences that were not seen by the subword model in training, our model still produces reliable outputs with a SER of 4.59% and WER of 1.09%." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-175", "text": "Figure 3 demonstrates that the attention mechanism can effectively learn the non-monotonous nature of the text normalization problem as \"eleventh\", \"November\" and \"eleven\" correspond to the third, second and first \"11\" in the input." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-176", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-177", "text": "**LINGUISTIC FEATURES**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-178", "text": "We use the following linguistic features: 1) capitalization:" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-179", "text": "upper, lower, mixed, nonalphanumerical, foreign characters; 2) position: (Bird et al., 2009) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-180", "text": "Edit labels are the most expensive to obtain in real life." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-181", "text": "Our labels are generated directly from the Google FST (Sproat and Jaitly, 2016) ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-182", "text": "Each type of feature is represented by a one-hot encoding." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-183", "text": "To combine linguistic features with subword units, one can add or concatenate each subword's embedding with its corresponding linguistic feature embedding and feed a combined embedding to the bi-LSTM encoder." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-184", "text": "Or, a multi-layer perceptron (MLP) can be applied to combine information in a non-linear way." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-185", "text": "Our experiments find that concatenation outperforms the other two methods." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-186", "text": "In Table 4 we can see that the subword model with linguistic features produces the lowest SER (0.78%) and WER (0.17%)." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-187", "text": "In addition, results from the ablation study show that each feature makes a positive contribution to the model." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-188", "text": "However, edit labels seem to make the strongest contribution." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-189", "text": "We acknowledge that edit labels may not always be readily available." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-190", "text": "The model which utilizes all linguistic features except for edit labels still shows a 16% relative SER reduction and 14% WER reduction over the subword model without linguistic features." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-191", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-192", "text": "**DISCUSSION**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-193", "text": "Errors from the subword model are presented in Table 5 ." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-195", "text": "While these types of errors are infrequent, they change or obscure the meaning of the utterance for a user." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-196", "text": "For example, the currency \"nok\" (e.g., \"norwegian kroner\") was verbalized as \"euros\", reflecting a bias in the training data." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-197", "text": "While \"euros\" appeared 88 times, \"norwegian kroner\" appeared just 10 times." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-198", "text": "Another type of error does not change the sentence meaning but can be unnatural." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-199", "text": "For example, \"alexander iii\" was predicted as \"alexander three\" rather than \"alexander the third\"." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-200", "text": "In this case, the referent of the sentence would likely be understandable given context." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-201", "text": "Examples such as \"5' 11\"\" reflect the variety of natural readings which a human might produce. \"Five foot eleven inches\", \"five foot eleven\", and \"five eleven\" may all refer to a person's height." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-202", "text": "Here the reference and model have produced different but acceptable variations." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-203", "text": "A fundamental problem is the lack of supervised data for training and evaluation, particularly data which reflects the variety of acceptable readings of non-standard text." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-204", "text": "The pairs in this study (and in other text normalization research) are generated by a system which does not have the full capability to verbalize sentences in different but natural ways." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-205", "text": "Our system's normalization WER and SER may not translate proportionally to ASR's WER and SER, simply because real users will read non-standard text in a variety of ways." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-206", "text": "It remains a challenge for the academic community to come up with better data solutions." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-207", "text": "----------------------------------" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-208", "text": "**CONCLUSION**" }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-209", "text": "In this paper, we investigate neural approaches to text normalization which directly translate a written-form sentence to its spoken counterpart without the need of a tagger or grammar." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-210", "text": "We show that the use of subwords can effectively reduce the OOV problem of a baseline seq2seq model with character inputs and token outputs." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-211", "text": "The addition of linguistic features including casing, word position, POS tags, and edit labels leads to further gains." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-212", "text": "We empirically test the addition of each linguistic feature revealing that all features make a contribution to the model, and combining features results in the best performance." }, { "sent_id": "73d7831596bfe6d6861f360042048f-C001-213", "text": "Our model is an improvement over both window-based and sentence-based seq2seq baselines, yielding a WER of 0.17%." } ], "y": { "@BACK@": { "gold_contexts": [ [ "73d7831596bfe6d6861f360042048f-C001-27" ], [ "73d7831596bfe6d6861f360042048f-C001-39" ], [ "73d7831596bfe6d6861f360042048f-C001-42" ] ], "cite_sentences": [ "73d7831596bfe6d6861f360042048f-C001-27", "73d7831596bfe6d6861f360042048f-C001-39", "73d7831596bfe6d6861f360042048f-C001-42" ] }, "@MOT@": { "gold_contexts": [ [ "73d7831596bfe6d6861f360042048f-C001-26", "73d7831596bfe6d6861f360042048f-C001-27", "73d7831596bfe6d6861f360042048f-C001-28" ], [ "73d7831596bfe6d6861f360042048f-C001-39", "73d7831596bfe6d6861f360042048f-C001-40", "73d7831596bfe6d6861f360042048f-C001-41" ] ], "cite_sentences": [ "73d7831596bfe6d6861f360042048f-C001-27", "73d7831596bfe6d6861f360042048f-C001-39" ] }, "@USE@": { "gold_contexts": [ [ "73d7831596bfe6d6861f360042048f-C001-52" ], [ "73d7831596bfe6d6861f360042048f-C001-55" ], [ "73d7831596bfe6d6861f360042048f-C001-96" ], [ "73d7831596bfe6d6861f360042048f-C001-101" ], [ "73d7831596bfe6d6861f360042048f-C001-115" ], [ "73d7831596bfe6d6861f360042048f-C001-121" ], [ "73d7831596bfe6d6861f360042048f-C001-181" ] ], "cite_sentences": [ "73d7831596bfe6d6861f360042048f-C001-52", "73d7831596bfe6d6861f360042048f-C001-55", "73d7831596bfe6d6861f360042048f-C001-96", "73d7831596bfe6d6861f360042048f-C001-101", "73d7831596bfe6d6861f360042048f-C001-115", "73d7831596bfe6d6861f360042048f-C001-121", "73d7831596bfe6d6861f360042048f-C001-181" ] }, "@EXT@": { "gold_contexts": [ [ "73d7831596bfe6d6861f360042048f-C001-140" ] ], "cite_sentences": [ "73d7831596bfe6d6861f360042048f-C001-140" ] }, "@DIF@": { "gold_contexts": [ [ "73d7831596bfe6d6861f360042048f-C001-140" ], [ "73d7831596bfe6d6861f360042048f-C001-141" ], [ "73d7831596bfe6d6861f360042048f-C001-143" ] ], "cite_sentences": [ "73d7831596bfe6d6861f360042048f-C001-140", "73d7831596bfe6d6861f360042048f-C001-141", "73d7831596bfe6d6861f360042048f-C001-143" ] }, "@SIM@": { "gold_contexts": [ [ "73d7831596bfe6d6861f360042048f-C001-141" ] ], "cite_sentences": [ "73d7831596bfe6d6861f360042048f-C001-141" ] } } }, "ABC_22da24997f66a6dafa911f83f061e5_14": { "x": [ { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-119", "text": "We use the Ranknet loss function (Burges et al., 2005) :" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-2", "text": "Motivated by the promising performance of pre-trained language models, we investigate BERT in an evidence retrieval and claim verification pipeline for the FEVER fact extraction and verification challenge." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-3", "text": "To this end, we propose to use two BERT models, one for retrieving potential evidence sentences supporting or rejecting claims, and another for verifying claims based on the predicted evidence sets." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-4", "text": "To train the BERT retrieval system, we use pointwise and pairwise loss functions, and examine the effect of hard negative mining." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-5", "text": "A second BERT model is trained to classify the samples as supported, refuted, and not enough information." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-6", "text": "Our system achieves a new state of the art recall of 87.1 for retrieving top five sentences out of the FEVER documents consisting of 50K Wikipedia pages, and scores second in the official leaderboard with the FEVER score of 69.7." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-7", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-9", "text": "The constantly growing online textual information and the rise in the popularity of social media have been accompanied by the spread of fake news and false claims." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-10", "text": "It is not feasible to manually determine the truthfulness of such information." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-11", "text": "Therefore, there is a need for automatic verification and fact-checking." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-12", "text": "Due to the unavailability of proper datasets for evidence-based fake news detection, we focus on claim verification." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-13", "text": "The Fact Extraction and VERification (FEVER) shared task (Thorne et al., 2018) introduces a benchmark for evidence-based claim verification." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-14", "text": "FEVER consists of 185K generated claims labelled as 'SUPPORTED', 'REFUTED' or 'NOT Claim: Roman Atwood is a content creator." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-15", "text": "Evidence: [wiki/Roman_Atwood] He is best known for his vlogs, where he posts updates about his life on a daily basis." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-16", "text": "Verdict: SUPPORTED Claim: Furia is adapted from a short story by Anna Politkovskaya." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-17", "text": "Evidence: [wiki/Furia_(film)] Furia is a 1999 French romantic drama film directed by Alexandre Aja, who co-wrote screenplay with Grgory Levasseur, adapted from the science fiction short story Graffiti by Julio Cortzar." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-18", "text": "Verdict: REFUTED Claim: Afghanistan is the source of the Kushan dynasty." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-19", "text": "Verdict: NOT ENOUGH INFO Figure 1 : Three examples from the FEVER dataset (Thorne et al., 2018) ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-20", "text": "Given a claim, the task is to to extract evidence sentence(s) from the Wikipedia dump and classify the claim as 'SUP-PORTED', 'REFUTED', or 'NOT ENOUGH INFO' ENOUGH INFO' based on the introductory sections of a 50K popular Wikipedia pages dump." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-21", "text": "The benchmark task is to classify the veracity of textual claims and extract the correct evidence sentences required to support or refute the claims." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-22", "text": "The primary evaluation metric (FEVER score) is label accuracy conditioned on providing evidence sentence(s) unless the predicted label is 'NOT ENOUGH INFO', which does not need any specific evidence." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-23", "text": "Figure 1 shows three examples of the FEVER dataset." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-24", "text": "To verify a claim, an enormous amount of information needs to be processed against the claim to retrieve the evidence and then infer possible evidence-claim relations." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-25", "text": "This problem can be alleviated by a multi-step pipeline." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-26", "text": "Most work on FEVER has adopted a three-step pipeline (Figure 2) : document retrieval, sentence retrieval, and claim verification." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-27", "text": "First, a set of documents, which possibly contain relevant information to support or reject a claim, are shortlisted from the Wikipedia dump." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-28", "text": "Second, five sentences are extracted out of the retrieved documents to be used as evidence." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-29", "text": "Third, the claim is verified against the retrieved evidence sentences." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-30", "text": "The FEVER dataset covers a wide range of topics, and to overcome data limitation pre-trained models are useful." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-31", "text": "Recently the release of Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) has significantly advanced the performance in a wide variety of Natural Language Processing (NLP) tasks and datasets including MS MARCO (Nguyen et al., 2016) in passage re-ranking and MultiNLI (Williams et al., 2018) in natural language inference that respectively resemble the second and third step of the FEVER baseline." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-32", "text": "However, to the best of our knowledge, there is no integrated work for both steps." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-33", "text": "In this paper, we propose a three-step pipeline system to address the FEVER task." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-34", "text": "We examine the BERT model for the FEVER task, and use that for evidence retrieval and claim verification." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-35", "text": "A first BERT model is trained to retrieve evidence sentences required for verifying the claims." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-36", "text": "We compare pointwise cross entropy loss and pairwise Hinge loss and Ranknet loss (Burges et al., 2005) for the BERT sentence retrieval." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-37", "text": "We further investigate the effect of Hard Negative Mining (HNM)." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-38", "text": "Next, we train another BERT model to verify claims against the retrieved evidence sentences." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-39", "text": "In summary, our contributions are as follows: (1) We employ the BERT model for evidence retrieval and claim verification; (2) We are the first to compare pointwise loss and pairwise loss functions for training the BERT model for sentence retrieval or fact extraction; (3) We investigate HNM to improve the retrieval performance; (4) We achieve second rank in the FEVER official leaderboard without ensembling." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-40", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-41", "text": "**RELATED WORK**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-42", "text": "In this section, we first briefly survey related background in natural language inference." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-43", "text": "Second, we review particularly previous work on the FEVER task in the three-step pipeline: document retrieval, sentence retrieval, and claim verification 2.1 Natural Language Inference Natural Language Inference (NLI) is concerned with determining if a premise entails a hypothesis." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-44", "text": "The Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) and the Multi-Genre NLI (MultiNLI) corpora (Williams et al., 2018) are the two established benchmarks for NLI." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-45", "text": "The availability of these large datasets has driven the recent advances in deep learning methods for NLI." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-46", "text": "The deep models for NLI can be divided into two categories: (1) Models that classify the premise-hypothesis relation based on the concatenation of the premise and hypothesis fixedsize representations together with their elementwise products (Bowman et al., 2015; Bowman et al., 2016) ; (2) Uni-directional or bi-directional attention-based models that provide reasoning over distributional representation of the sentences (Rockt\u00e4schel et al., 2015; Chen et al., 2016) ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-47", "text": "In addition to the early improvement using context-free word representations (Mikolov et al., 2013; Pennington et al., 2014) , pre-trained language models have significantly advanced several NLP tasks." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-48", "text": "In particular, BERT (Devlin et al., 2018) has achieved impressive results on several NLP tasks including NLI." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-49", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-50", "text": "**FEVER PIPELINE**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-51", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-52", "text": "**DOCUMENT RETRIEVAL**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-53", "text": "In the FEVER benchmark (Thorne et al., 2018) , the DrQA (Chen et al., 2017) retrieval component is considered as the baseline." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-54", "text": "They choose the k-nearest documents based on the cosine similarity of TF-IDF feature vectors." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-55", "text": "In addition to the DrQA retrieval component, the Sweeper team (Hidey and Diab, 2018) considers lexical and syntactic features for the claim and first two sentences in the pages." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-56", "text": "The authors in (Malon, 2018) use TF-IDF along with exact matching of the page titles with the claim's named entities." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-57", "text": "The UCL team (Yoneda et al., 2018) highlights the pages titles, and detect them in the claims." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-58", "text": "They rank pages by logistic regression and extra features like capitalization, sentence position and token matching." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-59", "text": "Keyword matching along with page-view statistics are used in (Nie et al., 2019) ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-60", "text": "UKP-Athene (Hanselowski et al., 2018) , the highest document retrieval scoring team, uses MediaWiki API 1 to search the Wikipedia database for the claims noun phrases." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-61", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-62", "text": "**SENTENCE RETRIEVAL**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-63", "text": "In order to extract evidence sentences, (Thorne et al., 2018 ) use a TF-IDF approach similar to their document retrieval." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-64", "text": "The UCL team (Yoneda et al., 2018) trains a logistic regression model on a heuristically set of features." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-65", "text": "Enhanced Sequential Inference Model (ESIM) (Chen et al., 2016) with some small modifications has been used in (Nie et al., 2019; Hanselowski et al., 2018) ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-66", "text": "ESIM encodes premises and hypotheses using one Bidirectional Long Short-Term Memory (BiLSTM) with shared weights." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-67", "text": "The encoded sentences are later aligned by a bidirectional attention mechanism." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-68", "text": "The encoded and aligned sentences are combined, and another shared BiL-STM matches the two representations." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-69", "text": "Finally, a softmax layer classifies the max and mean pooled representations of the second BiLSTM." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-70", "text": "The UKP-Athene team (Hanselowski et al., 2018) achieved the highest sentence retrieval recall using ESIM and pairwise training." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-71", "text": "Their model takes a claim and a pair of positive and negative sentences and predicts a similarity score for each sentence." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-72", "text": "To train the model, they use a modified Hinge loss function and a random neg-ative sampling strategy." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-73", "text": "In other words, positive samples are trained against five randomly selected negative sentences from the top retrieved pages for each claim." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-74", "text": "Note that recall is the most important factor in this step because the FEVER score counts a prediction as true if a complete set of evidence is retrieved." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-75", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-76", "text": "**CLAIM VERIFICATION**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-77", "text": "Decomposable Attention (DA) (Parikh et al., 2016) , which compares and aggregates softaligned words in sentence pairs, is used in the FEVER benchmark paper (Thorne et al., 2018) ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-78", "text": "The Papelo team (Malon, 2018) employs transformer networks with pre-trained weights (Radford et al., 2018) ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-79", "text": "ESIM has been widely used among the FEVER challenge participants (Nie et al., 2019; Yoneda et al., 2018; Hanselowski et al., 2018) ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-80", "text": "UNC (Nie et al., 2019) , the winner of the competition, proposes a modified ESIM that takes the concatenation of the retrieved evidence sentences and claim along with ELMo embedding and three additional token-level features: Word-Net, number embedding, and semantic relatedness score from the document retrieval and sentence retrieval steps." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-81", "text": "Dream (Zhong et al., 2019) has the state of the art FEVER score." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-82", "text": "The authors use a graph reasoning method based on XLNet (Yang et al., 2019) and RoBERTa (Liu et al., 2019) , the two new BERT variants that are supposed to provide better pre-trained embeddings." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-83", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-84", "text": "**METHODS**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-85", "text": "BERT (Devlin et al., 2018 ) is a multi-layer transformer language representation model pre-trained on the task of next sentence prediction and masked word prediction using extremely large datasets." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-86", "text": "The input representation begins with a special classification embedding ([CLS]) followed by the tokens representations of the first and second sentences separated by another specific token ([SEP])." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-87", "text": "In order to use the BERT model for different tasks, only one additional task-specific output layer is needed that can be trained together with fine-tuning the base layers." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-88", "text": "In particular, for the classification task, a softmax layer is added on the last hidden state of the first token, which is corresponding to [CLS] ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-89", "text": "Figure 3 demonstrates" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-90", "text": "The goal is to classify the claim c l for l = 1, ..., N C (N C = 145K for the FEVER benchmark) as 'SUPPORTED', 'REFUTED', or 'NOT ENOUGH INFO'." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-91", "text": "In order to count a prediction true, a complete set of evidence E c l = {s i j } must be retrieved for the claim c l ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-92", "text": "The claims with 'NOT ENOUGH INFO' label do not have an evidence set." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-93", "text": "In this section, we explain the proposed system that we developed for the three FEVER steps." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-94", "text": "Figure 4 briefly demonstrates our proposed BERT-based architectures for the three-step pipeline (Figure 2 )." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-95", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-96", "text": "**DOCUMENT RETRIEVAL**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-97", "text": "In the document retrieval step, the Wikipedia documents containing the evidence supporting or refuting the claim are retrieved." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-98", "text": "Following the UKP-Athene promising document retrieval component (Hanselowski et al., 2018) , which results in more than 93% development set document recall, we exactly use their method to collect a set of top documents D c l top for the claim c l ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-99", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-100", "text": "**SENTENCE RETRIEVAL**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-101", "text": "The sentence retrieval step extracts the top five potential evidence sentences S c l top for the claim c l ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-102", "text": "The training set consists of about 145K claims and all the sentences (S d i ) from the documents retrieved at the previous step (D c l top ) corresponding to the claim c l (S c l all = {S d i |d i \u2208 D c l top })." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-103", "text": "Note that S c l all may or may not contain the actual evidence sentences that we know from the groundtruth labels." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-104", "text": "We adopt the pre-trained BERT model and finetune using two different pointwise and pairwise approaches." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-105", "text": "We did not observe any improvement to use the large BERT for this step." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-106", "text": "In both approaches, the input consist of a potential evidence sentence from S c l all and a claim c l ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-107", "text": "Similar to (Malon, 2018) , in order to compensate for the missed co-reference pronouns in the sentences, we add the Wikipedia pages titles at the beginning of each potential sentence." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-108", "text": "For all the retrieval experiments, we adopt a batch size of 32, a learning rate of 2e\u22125, and one epoch of training." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-109", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-110", "text": "**POINTWISE**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-111", "text": "In the pointwise approach, every single input is classified as evidence or non-evidence." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-112", "text": "We use cross entropy classification loss for the pointwise approach:" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-113", "text": "where y i and p i are respectively the one-hot ground-truth label vector and the corresponding softmax output (Figure 4 (left) ), and N is the total number of training samples." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-114", "text": "At testing time, sentences are sorted by their p i values and the top five sentences are considered as evidence." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-115", "text": "A threshold can also be used on the output scores to filter out uncertain results and tradeoff the recall against the precision." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-116", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-117", "text": "**PAIRWISE**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-118", "text": "In the pairwise approach, a pair of positive and negative samples are compared against each other (Figure 4 (right) )." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-120", "text": "where the mapping from the positive sample o pos and negative sample output o neg to probabilities are calculated using the softmax function p i = e opos\u2212oneg /(1 + e opos\u2212oneg )." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-121", "text": "Note that we do not force the positive and negative samples to be selected from the same claims because the number of sentences per claim is significantly different and this difference might result in biasing on the claims with higher number of sentences." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-122", "text": "In addition, we experiment with the modified Hinge loss functions like (Hanselowski et al., 2018) :" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-123", "text": "At testing time, for both pairwise loss functions, we sort the sentences by their output value o and similarly choose S c l top for the claim c l ." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-124", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-125", "text": "**HARD NEGATIVE MINING**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-126", "text": "The ratio of negative (non-evidence) to positive (evidence) sentences is high, thus it is not reasonable to train on all the negative samples." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-127", "text": "Random sampling limits the number of negative samples, however, this might lead to training on easy and trivial samples." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-128", "text": "Therefore, we opt to investigate the effect of HNM." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-129", "text": "Similar to (Schroff et al., 2015) , we focus on online HNM." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-130", "text": "We fix the positive samples batch size of 16 but heuristically increase negative sample batch size from 16 to 64 and train on the positive samples and only the 16 negative samples with the highest loss values." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-131", "text": "This results in a balanced batch sized of 32." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-132", "text": "In the case of pairwise retrieval, HNM is applied to select the 32 hardest pairs out of 128 pairs thus plenty of the positive samples might not be trained on." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-133", "text": "Therefore, for this case, we heuristically increase the training epoch from one to three." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-134", "text": "While increasing the epoch for the HNM improves the performance, we observed the reverse for the normal training." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-135", "text": "In the experiments without HNM, we use random negative sampling." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-136", "text": "Note that for both cases, loss values are computed in the no-gradient mode, like the inference time, and thus there is no need for more GPUs than normal training with the batch size of 32." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-137", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-138", "text": "**CLAIM VERIFICATION**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-139", "text": "In the final step, the top five potential evidence sentences S c l top are independently compared against the claim c l and the final label is determined by aggregating the five individual decisions." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-140", "text": "Like (Malon, 2018) , the default label is 'NOT ENOUGH INFO' unless there is any supporting evidence to predict the claim label as 'SUPPORTED'." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-141", "text": "If there is at least one piece of evidence rejecting the claim while there is no supporting fact, the final decision is 'REFUTED'." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-142", "text": "We propose to train a new pre-trained BERT model as a three-class classifier (Figure 4 (left) )." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-143", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-144", "text": "**MODEL**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-145", "text": "Precision(%) Recall@5(%) F1(%) UNC (Nie et al., 2019) 36.39 86.79 51.38 UCL (Yoneda et al., 2018) 22.74** 84.54 35.84 UKP-Athene (Hanselowski et al., 2018) 23.67* 85.81* 37.11* DREAM-XLNet (Zhong et al., 2019) 26.60 87.33 40.79 DREAM-RoBERTa (Zhong et al., 2019) 26 Model FEVER Score(%) Label Accuracy(%) UNC (Nie et al., 2019) 66.14 69.60 UCL (Yoneda et al., 2018) 65.41 69.66 UKP-Athene (Hanselowski et al., 2018) 64." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-146", "text": "We train the model on 722K evidence-claim pairs provided by the first two steps." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-147", "text": "We adopt the batch size of 32, the learning rate of 2e \u2212 5, and two epochs of training." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-148", "text": "Table 1 compares the development set performance of different variants of the proposed sentence retrieval method with the state of the art results on the FEVER dataset." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-149", "text": "The results indicate that both pointwise and pairwise BERT sentence retrieval improve the recall." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-150", "text": "The UNC and DREAM precision scores are better than our methods without a decision threshold, however, a threshold can regulate the trade-off between the recall and precision, and achieve the best precision and F1 scores." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-151", "text": "As discussed in (Nie et al., 2019) , the recall is the most important factor." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-152", "text": "It is because the sentence retrieval predictions are the samples that we train the verification system on." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-153", "text": "Moreover, the FEVER score requires evidence for 'SUPPORTED' and 'REFUTED' claims." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-154", "text": "Therefore, we opt to focus more on recall and train the claim verification model on the predictions with the maximum recall." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-155", "text": "Surprisingly, the DREAM paper (Zhong et al., 2019) reports lower recalls for RoBERTa and XLNet that might be because of different training setups." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-156", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-157", "text": "**RESULTS**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-158", "text": "Although the pairwise Ranknet with HNM has the best recall score, we cannot conclude that pairwise methods are necessarily better for this task." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-159", "text": "This is more clear in Figure 5 , which plots the recall-precision trade-off by applying a decision threshold on the output scores." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-160", "text": "The pointwise Model FEVER Score(%) Label Accuracy(%) DREAM (Zhong et al., 2019) 70 (Nie et al., 2019) 64.21 68.21 UCL (Yoneda et al., 2018) 62.52 67.62 UKP-Athene (Hanselowski et al., 2018) 61.58 65.46 Figure 5 : Recall and precision results on the development set." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-161", "text": "x shows the UNC, UCL, UPK-Athene, DREAM XLNet, and DREAM RoBERTa scores (Nie et al., 2019; Yoneda et al., 2018; Hanselowski et al., 2018; Zhong et al., 2019) methods surpass the pairwise methods in terms of recall-precision performance." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-162", "text": "Figure 5 also shows that HNM enhances both pairwise methods trained by the Ranknet and Hinge loss functions and preserves the pointwise performance." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-163", "text": "In Table 2 , we compare the development set results of the state of the art methods with the BERT model trained on different retrieved evidence sets." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-164", "text": "The BERT claim verification system even if it is trained on the UKP-Athene sentence retrieval component (Hanselowski et al., 2018) , the state of the art method with the highest recall, improves both label accuracy and FEVER score." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-165", "text": "Training based on the BERT sentence retrieval predic-tions significantly enhances the verification results because while it explicitly improves the FEVER score by providing more correct evidence sentences, it provides a better training set for the verification system." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-166", "text": "The large BERTs are only trained on the best retrieval systems, and as expected significantly improve the performance." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-167", "text": "Finally, we report the blind test set results in Table 3 using the official FEVER framework on Co-daLab 2 as of the date of writing." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-168", "text": "Our best model ranks at the second place that indicates the importance of using pre-trained language modelling methods for both sentence retrieval and claim verification systems." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-169", "text": "Note that it is not completely fair to compare our method with the DREAM's core idea because in addition to a graph-based reasoning approach they use XLNet, a superior pretrained language model." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-170", "text": "----------------------------------" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-171", "text": "**CONCLUSION**" }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-172", "text": "We investigated the BERT model for evidence sentence retrieval and claim verification." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-173", "text": "In the retrieval step, we compared the pointwise and pairwise approaches and concluded that although the pairwise Ranknet approach achieved the highest recall, pairwise approaches are not necessarily superior to the pointwise approach particularly if precision is taken into account." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-174", "text": "Our large system scored second with a FEVER score of 69.66 without ensembling." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-175", "text": "We additionally examined hard negative mining for training the retrieval systems and showed that it slightly improves the performance." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-176", "text": "We discussed that by constantly switching between the training and inference mode, the online hard negative mining does not require additional GPUs." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-177", "text": "We leave its probable effect on the faster training to future work." }, { "sent_id": "22da24997f66a6dafa911f83f061e5-C001-178", "text": "Furthermore, using BERT as an end-to-end framework for the entire FEVER pipeline can be investigated in the future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "22da24997f66a6dafa911f83f061e5-C001-60" ], [ "22da24997f66a6dafa911f83f061e5-C001-65" ], [ "22da24997f66a6dafa911f83f061e5-C001-70" ], [ "22da24997f66a6dafa911f83f061e5-C001-79" ] ], "cite_sentences": [ "22da24997f66a6dafa911f83f061e5-C001-60", "22da24997f66a6dafa911f83f061e5-C001-65", "22da24997f66a6dafa911f83f061e5-C001-70", "22da24997f66a6dafa911f83f061e5-C001-79" ] }, "@USE@": { "gold_contexts": [ [ "22da24997f66a6dafa911f83f061e5-C001-98" ], [ "22da24997f66a6dafa911f83f061e5-C001-122" ] ], "cite_sentences": [ "22da24997f66a6dafa911f83f061e5-C001-98", "22da24997f66a6dafa911f83f061e5-C001-122" ] }, "@DIF@": { "gold_contexts": [ [ "22da24997f66a6dafa911f83f061e5-C001-161" ], [ "22da24997f66a6dafa911f83f061e5-C001-164" ] ], "cite_sentences": [ "22da24997f66a6dafa911f83f061e5-C001-161", "22da24997f66a6dafa911f83f061e5-C001-164" ] } } }, "ABC_260489da0fb3f7a201a6a1cce8f03b_14": { "x": [ { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-2", "text": "The attentional mechanism has proven to be effective in improving end-to-end neural machine translation." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-3", "text": "However, due to the structural divergence between natural languages, unidirectional attentionbased models might only capture partial aspects of attentional regularities." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-4", "text": "We propose agreementbased joint training for bidirectional attention-based end-to-end neural machine translation." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-66", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-5", "text": "Instead of training source-to-target and target-to-source translation models independently, our approach encourages the two complementary models to agree on word alignment matrices on the same training data." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-6", "text": "Experiments on Chinese-English and English-French translation tasks show that joint training significantly improves both alignment and translation quality over independent training." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-7", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-9", "text": "End-to-end neural machine translation (NMT) is a newly proposed paradigm for machine translation [Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-10", "text": "Without explicitly modeling latent structures that are vital for conventional statistical machine translation [Brown et al., 1993; Koehn et al., 2003; Chiang, 2005] , NMT builds on an encoder-decoder framework: the encoder transforms a source-language sentence into a continuous-space representation, from which the decoder generates a target-language sentence." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-11", "text": "While early NMT models encode a source sentence as a fixed-length vector, Bahdanau et al. [2015] advocate the use of attention in NMT." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-12", "text": "They indicate that only parts of the source sentence have an effect on the target word being generated." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-13", "text": "In addition, the relevant parts often vary with different target words." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-14", "text": "Such an attentional mechanism has proven to be an effective technique in text generation tasks such as machine translation [Bahdanau et al., 2015; Luong et al., 2015] and image caption generation [Xu et al., 2015] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-15", "text": "However, due to the structural divergence between natural languages, modeling the correspondence between words in two languages still remains a major challenge for NMT, especially for distantly-related languages." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-16", "text": "For example, Luong et al. [2015] report that attention-based NMT lags behind the Berkeley Figure 1 : The illustration of attention-based NMT." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-17", "text": "The decoder generates the first target word y 1 and target hidden state s 1 given a source sentence x. A bidirectional RNN is used to concatenate the forward and backward states as the hidden states of source words." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-18", "text": "The state s 0 is the initial hidden state and y 0 is a padding word." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-19", "text": "aligner [Liang et al., 2006] in terms of alignment error rate (AER) on the English-German data." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-20", "text": "One possible reason is that unidirectional attention-based NMT can only capture partial aspects of attentional regularities due to the non-isomorphism of natural languages." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-21", "text": "In this work, we propose to introduce agreement-based learning [Liang et al., 2006 [Liang et al., , 2007 into attentionbased neural machine translation." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-22", "text": "The basic idea is to encourage source-to-target and target-to-source translation models to agree on word alignment on the same training data." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-23", "text": "This can be done by defining a new training objective that combines likelihoods in two directions as well as an agreement term that measures the consensus between word alignment matrices in two directions." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-67", "text": "**EXPERIMENTS**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-24", "text": "Experiments on ChineseEnglish and English-French datasets show that our approach achieves significant improvements in terms of alignment and translation quality as compared with independent training." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-25", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-26", "text": "**BACKGROUND**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-27", "text": "Given a source-language sentence x = x 1 , . . . , x m , . . . , x M that contains M words and a target-language sentence y = y 1 , . . . , y n , . . . , y N that contains N words, end-to-end neural machine translation directly models the translation probability as a single, large neural network:" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-28", "text": "where \u03b8 is a set of model parameters and y Bahdanau et al., 2015] usually uses a recurrent neural network (RNN) to encode the source sentence into a sequence of hidden states h = h 1 , . . . , h m , . . . , h M :" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-30", "text": "where h m is the hidden state of the m-th source word and f is a non-linear function." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-31", "text": "Note that there are many ways to obtain the hidden states." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-32", "text": "For example, Bahdanau et al. [2015] use a bidirectional RNN and concatenate the forward and backward states as the hidden state of a source word to capture both forward and backward contexts." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-33", "text": "Figure 1 illustrates how the decoder generates the first target word y 1 and the target hidden state s 1 given the concatenation of forward and backward source hidden states." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-34", "text": "Bahdanau et al. [2015] define the conditional probability in Eq." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-35", "text": "(1) as" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-36", "text": "where g is a non-linear function and s n is the hidden state corresponding to the n-th target word computed by" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-37", "text": "The context vector c n for generating the n-th target word is calculated as" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-38", "text": "where A(\u03b8) is an alignment matrix, in which an element A(\u03b8) n,m reflects the contribution of the m-th source word x m to generating the n-th target word y n :" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-39", "text": "where a(s n\u22121 , h m , \u03b8) measures how well x m and y n are aligned." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-40", "text": "Note that word alignment is treated as a function parametrized by \u03b8 instead of a latent variable in attention-based NMT." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-41", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-42", "text": "**GIVEN A SET OF TRAINING EXAMPLES**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-43", "text": ", the training algorithm aims to find the model parameters that maximize the likelihood of the training data:" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-44", "text": "Although the introduction of attention has advanced the state-of-the-art of NMT, it is still challenging for attention-based NMT to capture the intricate structural divergence between natural languages." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-68", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-45", "text": "Figure 2 (a) shows the Chinese-to-English (upper) and English-to-Chinese (bottom) alignment matrices for the same sentence pair." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-46", "text": "Both the two independently trained models fail to correctly capture the gold-standard correspondence: while the Chinese-to-English alignment assigns wrong probabilities to \"us\" and \"bush\", the English-to-Chinese alignment makes wrong predictions on \"condemns\" and \"bombing\"." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-47", "text": "Fortunately, although each model only captures partial aspects of the mapping between words in natural languages, the two models seem to be complementary: the Chinese-to-English alignment does well on \"condemns\" and the English-to-Chinese alignment assigns correct probabilities to \"us\" and \"bush\"." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-48", "text": "Therefore, combining the two models can hopefully improve both alignment and translation quality in both directions." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-49", "text": "Figure 2: Example alignments of (a) independent training and (b) joint training on a Chinese-English sentence pair." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-50", "text": "The first row shows Chinese-to-English alignments and the second row shows English-toChinese alignments." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-51", "text": "We find that the two unidirectional models are complementary and encouraging the agreement between them leads to improved alignment accuracy." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-52", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-53", "text": "**AGREEMENT-BASED JOINT TRAINING**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-54", "text": "In this work, we propose to introduce agreement-based learning [Liang et al., 2006 [Liang et al., , 2007 into attentionbased neural machine translation." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-55", "text": "The central idea is to encourage the source-to-target and target-tosource models to agree on alignment matrices on the same training data." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-56", "text": "More formally, we train both the source-to-target attention-based neural translation model P (y|x; \u2212 \u2192 \u03b8 ) and the target-to-source model P (x|y;" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-57", "text": ", where \u2212 \u2192 \u03b8 and \u2190 \u2212 \u03b8 are model parameters in two directions, respectively." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-58", "text": "The new training objective is given by" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-59", "text": "where" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-60", "text": "is the source-to-target alignment matrix for the s-th sentence pair," }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-61", "text": "is the targetto-source alignment matrix for the s-th sentence pair, \u2206(\u00b7) is a loss function that measures the discrepancy between two matrices, and \u03bb is a hyper-parameter that balances the preference between likelihoods and agreement." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-62", "text": "Inspired by the agreement term [Liang et al., 2006] and model invertibility regularization [Levinboim et al., 2015] , we use the following loss function in our experiments:" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-63", "text": "Intuitively, this loss function encourages both models to agree on each cell in the alignment matrices." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-64", "text": "As shown in Figure 2 , joint learning leads to increased consensus between source-to-target and target-tosource models Our goal is to find the model parameters that maximize the training objective:" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-65", "text": "It is easy to use the stochastic gradient descent (SGD) algorithm to implement agreement-based joint learning since the translation models in two directions share the same training data." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-69", "text": "**SETUP**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-70", "text": "We evaluated our approach on Chinese-English and English-French datasets." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-71", "text": "For Chinese-English, the training corpus from LDC consists of 2.56M sentence pairs with 67.53M Chinese words and 74.81M English words." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-72", "text": "We used the NIST 2006 dataset as the validation set for hyperparameter optimization and model selection and the NIST 2002 NIST , 2003 NIST , 2004 NIST , 2005 NIST , and 2008 datasets as the test sets." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-73", "text": "In the NIST Chinese-English datasets, each Chinese sentence has four corresponding English translations." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-74", "text": "To build English-Chinese evaluation datasets, we select the first English sentence in the four references as the source sentence and the Chinese sentence as the single reference translation." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-75", "text": "For English-French, the training corpus from WMT 2014 consists of 12.07M sentence pairs with 303.88M English words and 348.24M French words." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-76", "text": "We follow Bahdanau et al. [2015] to restrict that sentences are no longer than 50 words." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-77", "text": "The concatenation of news-test-2012 and news-test-2013 is used as the validation set and news-test-2014 as the test set." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-78", "text": "The French-English evaluation sets can be easily obtained by reversing the English-French datasets." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-79", "text": "We compared our approach with two state-of-the-art SMT and NMT systems: [Koehn and Hoang, 2007] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-80", "text": "GroundHog is an attention-based neural machine translation system [Bahdanau et al., 2015] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-81", "text": "We introduce agreement-based joint training for bidirectional attention-based NMT." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-82", "text": "NIST06 is the validation set and NIST02-05, 08 are test sets." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-83", "text": "The BLEU scores are case-insensitive. \"*\": significantly better than Moses (p < 0.05); \"**\": significantly better than Moses (p < 0.01); \"+\": significantly better than GroundHog with independent training (p < 0.05); \"++\": significantly better than GroundHog with independent training (p < 0.01)." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-84", "text": "Table 2 : Results on the Chinese-English word alignment task." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-85", "text": "The evaluation metric is alignment error rate. \"**\": significantly better than GroundHog with independent training (p < 0.01)." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-86", "text": "1. Moses [Koehn and Hoang, 2007] : a phrase-based SMT system; 2. GroundHog [Bahdanau et al., 2015] : an attention-based NMT system." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-87", "text": "For Moses, we used the parallel corpus to train the phrase-based translation model and the targetside part of the parallel corpus to train a 4-gram language model using the SRILM toolkit [Stolcke, 2002] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-88", "text": "For GroundHog, we use the parallel corpus to train the neural machine translation models." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-89", "text": "The vocabulary size is set to 30K for all languages." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-90", "text": "Our approach simply extends GroundHog by replacing independent training with agreement-based joint training." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-114", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-115", "text": "**RELATED WORK**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-116", "text": "Our work is inspired by two lines of research: (1) attention-based neural machine translation and (2) agreement-based learning." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-117", "text": "Bahdanau et al. [2015] first introduce the attentional mechanism into neural machine translation to enable the decoder to focus on relevant parts of the source sentence during decoding." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-118", "text": "The attention mechanism allows a neural model to cope better with long sentences because it does not need to encode all the information of a source sentence into a fixed-length vector regardless of its length." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-119", "text": "In addition, the attentional mechanism allows us to look into the \"black box\" to gain insights on how NMT works from a linguistic perspective." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-120", "text": "Luong et al. [2015] propose two simple and effective attentional mechanisms for neural machine translation and compare various alignment functions." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-121", "text": "They show that attention-based NMT are superior to non-attentional models in translating names and long sentences." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-122", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-123", "text": "**ATTENTION-BASED NEURAL MACHINE TRANSLATION**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-124", "text": "After analyzing the alignment matrices generated by GroundHog [Bahdanau et al., 2015] , we find that modeling the structural divergence of natural languages is so challenging that unidirectional models can only capture part of alignment regularities." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-125", "text": "This finding inspires us to improve attention-based NMT by combining two unidirectional models." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-126", "text": "In this work, we only apply agreement-based joint learning to GroundHog." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-127", "text": "As our approach does not assume specific network architectures, it is possible to apply it to the models proposed by Luong et al. [2015] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-128", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-129", "text": "**AGREEMENT-BASED LEARNING**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-130", "text": "Liang et al. [2006] first introduce agreement-based learning into word alignment." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-131", "text": "The basic idea is to encourage asymmetric IBM models to agree on word alignment, which is a latent structure in word-based translation models [Brown et al., 1993] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-132", "text": "This strategy significantly improves alignment quality across many languages." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-133", "text": "They extend this idea to deal with more latent-variable models in grammar induction and predicting missing nucleotides in DNA sequences [Liang et al., 2007] ." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-134", "text": "propose generalized agreement for word alignment." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-135", "text": "The new general framework allows for arbitrary loss functions that measure the disagreement between asymmetric alignments." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-136", "text": "The loss functions can not only be defined between asymmetric alignments but also between alignments and other latent structures such as phrase segmentations." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-137", "text": "In attention-based NMT, word alignment is treated as a parametrized function instead of a latentvariable." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-138", "text": "This makes word alignment differentiable, which is important for training attention-based NMT models." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-139", "text": "Although alignment matrices in attention-based NMT are in principle \"symmetric\" as they allow for many-to-many soft alignments, we find that unidirectional modeling can only capture partial aspects of structure mapping." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-140", "text": "Our contribution is to adapt agreement-based learning into attentional NMT, which leads to significant improvements in terms of both alignment and translation performance." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-91", "text": "The encoder-decoder framework and the attentional mechanism remain unchanged." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-92", "text": "We follow Jean et al. [2015] to address unknown words based on alignment matrices." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-93", "text": "Given alignment matrices between source sentences and target sentences, it is possible to calculate the position of the source word that is most translationally equivalent for each target word." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-94", "text": "After a source sentence is translated, each unknown word is translated from its corresponding source word." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-95", "text": "While Jean et al. [2015] use bilingual dictionary generated by an off-the-shelf word aligner to translate unknown words, we simply use unigram phrases." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-96", "text": "Table 1 shows the results on the Chinese-English translation task." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-97", "text": "We find that our approach significantly outperforms both Moses and GroundHog with independent training in both Chinese-to-English and English-to-Chinese directions across all datasets." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-98", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-99", "text": "**RESULTS ON CHINESE-ENGLISH DATA**" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-100", "text": "Figure 2(b) shows example alignment matrices resulted from agreement-based joint training." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-101", "text": "We find that agreement-based joint training improves the alignment accuracy significantly since the two models are complementary." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-102", "text": "Table 3 : Results on the English-French translation task." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-103", "text": "The BLEU scores are case-insensitive. \"**\": significantly better than Moses (p < 0.01); \"++\": significantly better than GroundHog with independent training (p < 0.01)." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-104", "text": "Table 2 shows the results on the Chinese-English word alignment task." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-105", "text": "We used the TsinghuaAligner evaluation dataset in which both the validation and test sets contain 450 manuallyaligned Chinese-English sentence pairs." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-106", "text": "We follow Luong et al. [2015] to \"force\" decode our jointly trained models to produce translations that match the references." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-107", "text": "Then, we extract only one-to-one alignments by selecting the source word with the highest alignment weight for each target word." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-108", "text": "We find that agreement-based joint training significantly reduces alignment errors for both directions as compared with independent training." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-109", "text": "Table 3 gives the results on the English-French translation task." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-110", "text": "While GroundHog with independent training achieves translation performance on par with Moses, agreement-based joint learning leads to significant improvements over both baselines." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-111", "text": "This suggests that our approach is general and can be applied to more language pairs." }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-112", "text": "----------------------------------" }, { "sent_id": "260489da0fb3f7a201a6a1cce8f03b-C001-113", "text": "**RESULTS ON ENGLISH-FRENCH DATA**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "260489da0fb3f7a201a6a1cce8f03b-C001-9" ], [ "260489da0fb3f7a201a6a1cce8f03b-C001-11" ], [ "260489da0fb3f7a201a6a1cce8f03b-C001-14", "260489da0fb3f7a201a6a1cce8f03b-C001-29", "260489da0fb3f7a201a6a1cce8f03b-C001-30" ], [ "260489da0fb3f7a201a6a1cce8f03b-C001-32" ], [ "260489da0fb3f7a201a6a1cce8f03b-C001-34", "260489da0fb3f7a201a6a1cce8f03b-C001-35", "260489da0fb3f7a201a6a1cce8f03b-C001-36" ], [ "260489da0fb3f7a201a6a1cce8f03b-C001-80" ], [ "260489da0fb3f7a201a6a1cce8f03b-C001-117" ] ], "cite_sentences": [ "260489da0fb3f7a201a6a1cce8f03b-C001-9", "260489da0fb3f7a201a6a1cce8f03b-C001-11", "260489da0fb3f7a201a6a1cce8f03b-C001-14", "260489da0fb3f7a201a6a1cce8f03b-C001-29", "260489da0fb3f7a201a6a1cce8f03b-C001-32", "260489da0fb3f7a201a6a1cce8f03b-C001-34", "260489da0fb3f7a201a6a1cce8f03b-C001-80", "260489da0fb3f7a201a6a1cce8f03b-C001-117" ] }, "@USE@": { "gold_contexts": [ [ "260489da0fb3f7a201a6a1cce8f03b-C001-76" ], [ "260489da0fb3f7a201a6a1cce8f03b-C001-79", "260489da0fb3f7a201a6a1cce8f03b-C001-80", "260489da0fb3f7a201a6a1cce8f03b-C001-86" ] ], "cite_sentences": [ "260489da0fb3f7a201a6a1cce8f03b-C001-76", "260489da0fb3f7a201a6a1cce8f03b-C001-80", "260489da0fb3f7a201a6a1cce8f03b-C001-86" ] }, "@EXT@": { "gold_contexts": [ [ "260489da0fb3f7a201a6a1cce8f03b-C001-80", "260489da0fb3f7a201a6a1cce8f03b-C001-90" ] ], "cite_sentences": [ "260489da0fb3f7a201a6a1cce8f03b-C001-80" ] }, "@MOT@": { "gold_contexts": [ [ "260489da0fb3f7a201a6a1cce8f03b-C001-124" ] ], "cite_sentences": [ "260489da0fb3f7a201a6a1cce8f03b-C001-124" ] } } }, "ABC_8853d810b364ae47a2da71c2502b3e_14": { "x": [ { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-23", "text": "**LEXICAL RESOURCES**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-47", "text": "**EQUAL:**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-2", "text": "More and more evidence is appearing that integrating symbolic lexical knowledge into neural models aids learning." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-3", "text": "This contrasts the widely-held belief that neural networks largely learn their own feature representations." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-4", "text": "For example, recent work has shown benefits of integrating lexicons to aid cross-lingual part-of-speech (PoS)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-5", "text": "However, little is known on how complementary such additional information is, and to what extent improvements depend on the coverage and quality of these external resources." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-6", "text": "This paper seeks to fill this gap by providing a thorough analysis on the contributions of lexical resources for cross-lingual PoS tagging in neural times." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-7", "text": "Our base tagger is a bidirectional long short-term memory network (bi-LSTM) (Graves and Schmidhuber, 2005; Hochreiter and Schmidhuber, 1997; Plank et al., 2016) with a rich word encoding model which consists of a character-based bi-LSTM representation cw paired with pre-trained word embeddings w. Sub-word and especially character-level modeling is currently pervasive in top-performing neural sequence taggers, owing to its capacity to effectively capture morphological features that are useful in labeling out-ofvocabulary (OOV) items." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-8", "text": "Sub-word information is often coupled with standard word embeddings to mitigate OOV issues." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-9", "text": "Specifically, i) word embeddings are typically built from massive unlabeled datasets and thus OOVs are less likely to be encountered at test time, while ii) character embeddings offer further linguistically plausible fallback for the remaining OOVs through modeling intraword relations." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-10", "text": "Through these approaches, multilingual PoS tagging has seen tangible gains from neural methods in the recent years." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-11", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-13", "text": "In natural language processing, the deep learning revolution has shifted the focus from conventional hand-crafted symbolic representations to dense inputs, which are adequate representations learned automatically from corpora." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-14", "text": "However, particularly when working with low-resource languages, small amounts of symbolic lexical resources such as user-generated lexicons are often available even when gold-standard corpora are not." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-15", "text": "Recent work has shown benefits of combining conventional lexical information into neural cross-lingual part-ofspeech (PoS) tagging (Plank and Agi\u0107, 2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-16", "text": "However, little is known on how complementary such additional information is, and to what extent improvements depend on the coverage and quality of these external resources." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-17", "text": "The contribution of this paper is in the analysis of the contributions of models' components (tagger transfer through annotation projection vs. the contribution of encoding lexical and morphosyntactic resources)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-18", "text": "We seek to understand under which conditions a low-resource neural tagger benefits from external lexical knowledge." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-19", "text": "In particular: a) we evaluate the neural tagger across a total of 20+ languages, proposing a novel baseline which uses retrofitting; b) we investigate the reliance on dictionary size and properties; c) we analyze model-internal representations via a probing task to investigate to what extent model-internal representations capture morphosyntactic information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-20", "text": "Our experiments confirm the synergetic effect between a neural tagger and symbolic linguistic knowledge." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-21", "text": "Moreover, our analysis shows that the composition of the dictionary plays a more important role than its coverage." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-22", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-24", "text": "We use linguistic resources that are user-generated and available for many languages." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-25", "text": "The first is WIKTIONARY, a word type dictionary that maps words to one of the 12 Universal PoS tags (Li et al., 2012; Petrov et al., 2012) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-26", "text": "The second resource is UNIMORPH, a morphological dictionary that provides inflectional paradigms for 350 languages (Kirov et al., 2016) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-27", "text": "For Wiktionary, we use the freely available dictionaries from Li et al. (2012) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-28", "text": "UniMorph covers between 8-38 morphological properties (for English and Finnish, respectively)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-29", "text": "1 The sizes of the dictionaries vary considerably, from a few thousand entries (e.g., for Hindi and Bulgarian) to 2M entries (Finnish Uni-Morph)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-30", "text": "We study the impact of smaller dictionary sizes in Section 4.1." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-31", "text": "The tagger we analyze in this paper is an extension of the base tagger, called distant supervision from disparate sources (DSDS) tagger (Plank and Agi\u0107, 2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-32", "text": "It is trained on projected data and further differs from the base tagger by the integration of lexicon information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-33", "text": "In particular, given a lexicon src, DSDS uses e src to embed the lexicon into an l-dimensional space, where e src is the concatenation of all embedded m properties of length l (empirically set, see Section 2.2), and a zero vector for words not in the lexicon." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-34", "text": "A property here is a possible PoS tag (for Wiktionary) or a morphological feature (for Unimorph)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-35", "text": "To integrate the type-level supervision, the lexicon embeddings vector is created and concatenated to the word and character-level representations for every token: w \u2022 cw \u2022 e." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-36", "text": "We compare DSDS to alternative ways of using lexical information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-37", "text": "The first approach uses lexical information directly during decoding (T\u00e4ckstr\u00f6m et al., 2013) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-38", "text": "The second approach is more implicit and uses the lexicon to induce better word embeddings for tagger initialization." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-39", "text": "In particular, we use the dictionary for retrofitting off-the-shelf embeddings (Faruqui et al., 2015) to initialize the tagger with those." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-40", "text": "The latter is a novel approach which, to the best of our knowledge, has not yet been evaluated in the neural tagging literature." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-41", "text": "The idea is to bring the off-theshelf embeddings closer to the PoS tagging task by retrofitting the embeddings with syntactic clusters derived from the lexicon." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-42", "text": "We take a deeper look at the quality of the lex-1 More details: http://unimorph.org/ icons by comparing tag sets to the gold treebank data, inspired by Li et al. (2012) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-43", "text": "In particular, let T be the dictionary derived from the gold treebank (development data), and W be the user-generated dictionary, i.e., the respective Wiktionary (as we are looking at PoS tags)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-44", "text": "For each word type, we compare the tag sets in T and W and distinguish six cases:" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-45", "text": "1. NONE: The word type is in the training data but not in the lexicon (out-of-lexicon)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-46", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-48", "text": "In an ideal setup, the dictionaries contain no disjoint tag sets, and larger amounts of equal tag sets or superset of the treebank data." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-49", "text": "This is particularly desirable for approaches that take lexical information as type-level supervision." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-50", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-51", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-52", "text": "In this section we describe the baselines, the data and the tagger hyperparameters." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-53", "text": "Data We use the 12 Universal PoS tags (Petrov et al., 2012) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-54", "text": "The set of languages is motivated by accessibility to embeddings and dictionaries." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-55", "text": "We here focus on 21 dev sets of the Universal Dependencies 2.1 (Nivre and et al., 2017) , test set results are reported by Plank and Agi\u0107 (2018) showing that DSDS provides a viable alternative." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-56", "text": "Annotation projection To build the taggers for new languages, we resort to annotation projection following Plank and Agi\u0107 (2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-57", "text": "In particular, they employ the approach by Agi\u0107 et al. (2016) , where labels are projected from multiple sources to multiple targets and then decoded through weighted majority voting with word alignment probabilities and source PoS tagger confidences." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-58", "text": "The wide-coverage Watchtower corpus (WTC) by Agi\u0107 et al. (2016) is used, where 5k instances are selected via data selection by alignment coverage following Plank and Agi\u0107 (2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-59", "text": "Baselines We compare to the following alternatives: type-constraint Wiktionary supervision (Li et al., 2012) and retrofitting initialization." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-60", "text": "Hyperparameters We use the same setup as Plank and Agi\u0107 (2018) , i.e., 10 epochs, word dropout rate (p=.25) and l=40-dimensional lexicon embeddings for DSDS, except for downscaling the hidden dimensionality of the character representations from 100 to 32 dimensions." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-61", "text": "This ensures that our probing tasks always get the same input dimensionality: 64 (2x32) dimensions for cw, which is the same dimension as the off-theshelf word embeddings." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-62", "text": "Language-specific hyperparameters could lead to optimized models for each language." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-63", "text": "However, we use identical settings for each language which worked well and is less expensive, following Bohnet et al. (2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-64", "text": "For all experiments, we average over 3 randomly seeded runs, and provide mean accuracy." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-65", "text": "We use the off-the-shelf Polyglot word embeddings (Al-Rfou et al., 2013) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-66", "text": "Word embedding initialization provides a consistent and considerable boost in this cross-lingual setup, up to 10% absolute improvements across 21 languages when only 500 projected training instances are available (Plank and Agi\u0107, 2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-67", "text": "Note that we em- pirically find it to be best to not update the word embeddings in this noisy training setup, as that results in better performance, see Section 4.4." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-68", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-69", "text": "**RESULTS**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-70", "text": "Table 1 presents our replication results, i.e., tagging accuracy for the 21 individual languages, with means over all languages and language families (for which at least two languages are available)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-71", "text": "There are several take-aways." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-72", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-73", "text": "**INCLUSION OF LEXICAL INFORMATION**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-74", "text": "Combining the best of two worlds results in the overall best tagging accuracy, confirming Plank and Agi\u0107 (2018) : Embedding lexical information into a neural tagger improves tagging accuracy from 83.4 to 84.1 (means over 21 languages)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-75", "text": "On 15 out of 21 languages, DSDS is the best performing model." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-76", "text": "On two languages, type constraints work the best (English and Greek)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-77", "text": "Retrofitting performs best only on one language (Persian); this is the language with the overall lowest performance." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-78", "text": "On three languages, Czech, French and Hungarian, the baseline remains the best model, none of the lexicon-enriching approaches works." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-79", "text": "We proceed to inspect these results in more detail." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-80", "text": "Analysis Overall, type-constraints improve the baseline but only slightly (83.4 vs 83.6)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-81", "text": "Intuitively, this more direct use of lexical information requires the resource to be high coverage and a close fit to the evaluation data, to not introduce too many pruning errors during decoding due to contradictory tag sets." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-82", "text": "To analyze this, we look at the tag set agreement in Figure 1 ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-83", "text": "For languages for which the level of disjoint tag set information is low, such as Greek, English, Croatian, Finnish and Dutch, type constraints are expected to help." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-84", "text": "This is in fact the case, but there are exceptions, such as Finnish." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-85", "text": "Coverage of the lexicon is also important, and for this morphologically rich language, the coverage is amongst the lowest (c.f. large amount of the 'none' category in Figure 1 )." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-86", "text": "The more implicit use of lexical information in DSDS helps on languages with relatively high dictionary coverage and low tag set disagreement, such as Danish, Dutch and Italian." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-87", "text": "Compared to type constraints, embedding the lexicon also helps on languages with low dictionary coverage, such as Bulgarian, Hindi, Croatian and Finnish, which is very encouraging and in sharp contrast to type constraints." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-88", "text": "The only outlier remains Greek." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-89", "text": "Figure 2 (a) plots the absolute improvement in tagging accuracy over the baseline versus the number of properties in the dictionaries." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-90", "text": "Slavic and Germanic languages cluster nicely, with some outliers (Croatian)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-91", "text": "However, there is only a weak positive correlation (\u03c1=0.08)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-92", "text": "More properties do not necessarily improve performance, and lead to sparsity." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-93", "text": "The inclusion of the lexicons results in higher coverage, which might be part of the explanation for the improvement of DsDs." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-94", "text": "The question remains whether the tagger learns to rely only on this additional signal, or it generalizes beyond it." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-95", "text": "Therefore, we first turn to inspecting out-ofvocabulary (OOV) items." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-96", "text": "OOV items are the key challenge in part-of-speech tagging, i.e., to correctly tag tokens unseen in the training data." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-97", "text": "In Figure 2 (b) and (c), we analyze accuracy improvements on different groups of tokens: The in lex+train tokens that were seen both in the lexicon and the training data, the in train only tokens seen in the training data but not present in the lexicon, the in lex only tokens that were present in the lexicon but not seen in the training data and the true OOV tokens that were neither seen in training nor present in the lexicon." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-98", "text": "Figure 2 (b) shows means over the 21 languages, Figure 2 (c) provides details per language." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-99", "text": "The first take-away is that in many cases the tagger does learn to use information beyond the coverage of the lexicon." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-100", "text": "The embedded knowledge helps the tagger to improve on tokens which are in train only (and are thus not in the lexicon, green bars)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-101", "text": "For true OOVs (orange bars), this is the case for some languages as well Figure 2 (c) , i.e., improvements on true OOVs can be observed for Bulgarian, German, Greek, English, Finish, Croatian, Italian and Portuguese." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-102", "text": "Over all 21 languages there is a slight drop on true OOVs: -0.08, but this is a mean over all languages, for which results vary, making it important to look beyond the aggregate level." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-103", "text": "Over all languages except for Hungarian, the tagger, unsurprisingly, improves over tokens which are both in the lexicon and in the training data (see further discussion in Section 4)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-104", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-105", "text": "**DISCUSSION**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-106", "text": "Here we dig deeper into the effect of including lexical information by a) examining learning curves with increasing dictionary sizes, b) relating tag set properties to performance, and finally c) having a closer look at model internal representations, by comparing them to the representations of the base model that does not include lexical information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-107", "text": "We hypothesize that when learning from dictionary-level supervision, information is propagated through the representation layers so as to generalize beyond simply relying on the respective external resources." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-108", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-109", "text": "**LEARNING CURVES**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-110", "text": "The lexicons we use so far are of different sizes (shown in Table 1 of Plank and Agi\u0107 (2018) ), spanning from 1,000 entries to considerable dictionaries of several hundred thousands entries." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-111", "text": "In a low-resource setup, large dictionaries might not be available." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-112", "text": "It is thus interesting to examine how tagging accuracy is affected by dictionary size." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-113", "text": "We examine two cases: randomly sampling dictionary entries and sampling by word frequency, over increasing dictionary sizes: 50, 100, 200, 400, 800, 1600 word types." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-114", "text": "The latter is motivated by the fact that an informed dictionary creation (under limited resources) might be more beneficial." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-115", "text": "We estimate word frequency by using the UD training data sets (which are otherwise not used)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-116", "text": "guages (with confidence intervals of \u00b11 standard deviation based on three runs)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-117", "text": "We note that sampling by frequency is overall more beneficial than random sampling." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-118", "text": "The biggest effect of sampling by frequency is observed for the Romance language family, see Figure 3 (b)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-119", "text": "It is noteworthy that more dictionary data is not always necessarily beneficial." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-120", "text": "Sometimes a small but high-frequency dictionary approximates the entire dictionary well." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-121", "text": "This is for instance the case for Danish, where sampling by frequency approximates the entire dictionary well ('all' achieves 90.1, while using 100 most frequent entries is close: 89.93)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-122", "text": "Frequency sampling also helps clearly for Italian, but here having the entire dictionary results in the overall highest performance." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-123", "text": "For some languages, the inclusion of lexical information does not help, not even at smaller dictionary sizes." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-124", "text": "This is the case for Hungarian, French and Czech." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-125", "text": "For Hungarian using the entire dictionary drops performance below the baseline." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-126", "text": "For Czech, this is less pronounced, as the performance stays around baseline." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-127", "text": "Relating these negative ef- fects to the results from the tag set agreement analysis (Figure 1 ), we note that Hungarian is the language with the largest disjoint tag set." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-128", "text": "Albeit the coverage for Hungarian is good (around .5), including too much contradictory tag information has a clear deteriorating effect." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-129", "text": "Consequently, neither sampling strategy works." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-130", "text": "Czech, which has less coverage, sees a negative effect as well: half of the dictionary entries have disjoint tag sets." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-131", "text": "Italian is the language with the highest dictionary coverage and the highest proportion of equal tag sets, thereby providing a large positive benefit." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-132", "text": "We conclude that when dictionaries are not available, creating them by targeting highfrequency items is a pragmatic and valuable strategy." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-133", "text": "A small dictionary, which does not contain too contradictory tag sets, can be beneficial." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-134", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-135", "text": "**ANALYSIS OF CORRECT/INCORRECT PREDICTIONS**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-136", "text": "In the following we analyze correctly and incorrectly labeled tokens." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-137", "text": "Because we are analyzing differences between languages as well as between errors and successes we abstract away from the underlying sample size variation by comparing proportions." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-138", "text": "The analysis inspects the differences in proportions on four subsections of the development set, as introduced above: the in lex+train tokens, the in train only tokens, the in lex only tokens and the true OOVs." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-139", "text": "The proportion of these four data subsets in the correctly and the incorrectly labeled tokens are shown side by side in Figure 4 in lighter and darker shades, respectively." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-140", "text": "If the OOVstatus of a word was unrelated to performance, the lighter and darker bars would be of identical size." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-141", "text": "This is not the case and we can observe that the true OOVs make up a significantly larger share of the errors than of successes (two-tailed paired Student's t-test: p = 0.007)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-142", "text": "Similarly, seen across all languages the shift in the size of the proportion of true OOVs is made up by more correct labeling of a larger proportion of in train only (two-tailed paired Student's t-test: p = 0.014) and in lex only (two-tailed paired Student's t-test: p = 0.020), whereas the proportion of in lex+train does not significantly differ between the correctly and incorrectly labeled parts (two-tailed paired Student's t-test: p = 0.200)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-143", "text": "2" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-144", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-145", "text": "**PROBING WORD ENCODINGS**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-146", "text": "Probing tasks, or diagnostic classifiers, are separate classifiers which use representations extracted from any facet of a trained neural model as input for solving a separate task." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-147", "text": "Following the intuition of Adi et al. (2017) , if the target can be predicted, then the information must be encoded in the representation." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-148", "text": "However, the contrary does not necessarily hold: if the model fails it does not necessarily follow that the information is not encoded, as opposed to not being encoded in a useful way for a probing task classifier." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-149", "text": "As the internal representations stored in neural models are not immediately interpretable, probing tasks serve as a way of querying neural representations for interpretable information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-150", "text": "The probing task objective and training data is designed to model the query of interest." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-151", "text": "The representation layer we query in this work is the word-level output from the character embedding sub-model." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-152", "text": "This part of the word-level representation starts out uninformative and thus without prior prediction power on the classifier objectives." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-153", "text": "The pre-trained word embeddings stay fixed in our model (see Section 4.4)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-154", "text": "However, the character-based word encodings get updated: This holds true both for the BASE system and the DSDS tagger." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-155", "text": "As a target for assessing the flow of information in the neural tagger, we thus focus on the character-based word encodings." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-156", "text": "The word-level is relevant as it is the granularity at which the tagger is evaluated." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-157", "text": "The word embeddings may already have encoded PoS-relevant information and the lexicon embeddings explicitly encodes PoS-type-level information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-158", "text": "By contrast, the character-based word encodings are initialized to be uninformative and any encoding of PoS-related information is necessarily a result of the neural training feedback signal." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-159", "text": "For these reasons we query the character-based word representations of the tagger in order to com-pare variation between the base tagger and the DSDS lexicon-enriched architecture." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-160", "text": "Figure 5 : Macro F1 scores for stand-alone classifiers on the probing tasks of predicting which words are long and which are in the lexicon, respectively." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-161", "text": "The baseline (bl) is a simple majority baseline." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-162", "text": "The base-and DsDs-informed classifiers were trained on character-based word representations from the neural taggers with and without access to lexical information, respectively." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-163", "text": "We employ two binary probing tasks: predicting which words are long, i.e., contain more than 7 characters 3 , and predicting which words are in the lexicon." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-164", "text": "The word length task is included as a task which can be learned independently of whether lexicon information is available to the neural model." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-165", "text": "Storing length-related information might help the model distinguish suffix patterns of relevance to PoS-tagging." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-166", "text": "Following Shi et al. (2016) and Gulordava et al (2018) , we use a logistic regression classifier setup and a constant input dimensionality of 64 across tasks (Conneau et al., 2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-167", "text": "The classifiers are trained using 10-fold cross-validation for each of three trained runs of each neural model and averaged." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-168", "text": "We include a majority baseline and report macro F1-scores, as we are dealing with imbalanced classes." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-169", "text": "The training vocabulary of both probing tasks is restricted to the neural tagger training vocabulary, that is, all word types in the projected training data, as these are the representations which have been subject to updates during training of the neural model." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-170", "text": "Using the projected data has the advantage that the vocabulary is similar across languages as the data comes from the same domain (Watchtower)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-171", "text": "3 Considering words of 7 characters or more to be long is based on the threshold that was experimentally tuned in the design of the readability metric LIX (Bj\u00f6rnsson, 1983) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-172", "text": "This threshold aligns well with the visual perceptual span within which proficient readers from grade four and up can be expected to automatically decode a word in a single fixation (Sperlich et al., 2015) The results on the word length probing task shown on the top half of Figure 5 confirm that information relevant to distinguishing word length is being encoded in the neural representation, as expected." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-173", "text": "It is intriguing that the lexiconinformed DSDS representation encodes this information even at higher degree." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-174", "text": "On the task of classifying which words are in the lexicon, all neural representations beat the majority baseline, but we also see that this task is harder, given the higher variance across languages." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-175", "text": "With Spanish (es) and Croatian (hr) as the only exceptions, the DsDs-based representations are generally encoding more of the information relevant to distinguishing which words are in the lexicon, confirming our intuitions that the internal representations were altered." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-176", "text": "Note, however, that even the base-tagger is able to solve this task above chance level." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-177", "text": "This is potentially an artifact of how lexicons grow where it would be likely for several inflections of the same word to be added collectively to the lexicon at once, and since the character representations can be expected to produce more similar representations of words derived from the same lemma the classifier will be able to generalize and perform above chance level without the base-model representations having ever been exposed to the lexical resource." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-178", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-179", "text": "**UPDATING IN LIGHT OF NOISY DATA?**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-180", "text": "When training a tagger with noisy training data and pre-trained embeddings, the question arises whether it is more beneficial to freeze the word embeddings or update them." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-181", "text": "We hypothesize that freezing embeddings is more beneficial in noisy training cases, as it helps to stabilize the signal from the pre-trained word embeddings while avoiding updates from the noisy training data." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-182", "text": "To test this hypothesis, we train the base tagger on high-quality gold training data (effectively, the UD training data sets), with and without freezing the word embeddings layer." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-183", "text": "We find that updating the word embedding layer is in fact beneficial in the high-quality training data regime: on average +0.4% absolute improvement is obtained (mean over 21 languages)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-184", "text": "This is in sharp contrast to the noisy training data regime, in which the baseline accuracy drops by as much as 1.2% accuracy." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-185", "text": "Therefore, we train the tagger with pre-trained embeddings on projected WTC data and freeze the word embeddings lookup layer during training." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-186", "text": "In recent years, natural language processing has witnessed a move towards deep learning approaches, in which automatic representation learning has become the de facto standard methodology (Collobert et al., 2011; Manning, 2015) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-187", "text": "One of the first works that combines neural representations with semantic symbolic lexicons is the work on retrofitting (Faruqui et al., 2015) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-188", "text": "The main idea is to use the relations defined in semantic lexicons to refine word embedding representations, such that words linked in the lexical resource are encouraged to be closer to each other in the distributional space." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-189", "text": "The majority of recent work on neural sequence prediction follows the commonly perceived wisdom that hand-crafted features are obsolete for deep learning methods." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-190", "text": "They rely on end-to-end training without resorting to additional linguistic resources." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-191", "text": "Our study contributes to the increasing literature to show the utility of linguistic resources for deep learning models by providing a deep analysis of a recently proposed model (Plank and Agi\u0107, 2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-192", "text": "Most prior work in this direction can be found on machine translation (Sennrich and Haddow, 2016; Chen et al., 2017; Li et al., 2017; Passban et al., 2018) , work on named entity recognition (Wu et al., 2018) and PoS tagging (Sagot and Mart\u00ednez Alonso, 2017) who use lexicons, but as n-hot features and without examining the crosslingual aspect." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-193", "text": "Somewhat complementary to evaluating the utility of linguistic resources empirically is the increasing body of work that uses linguistic insights to try to understand what properties neural-based representations capture (K\u00e1d\u00e1r et al., 2017; Adi et al., 2017; Conneau et al., 2018; Hupkes et al., 2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-194", "text": "Shi et al. (2016) and Adi et al. (2017) introduced the idea of probing tasks (or 'diagnostic classifiers'), see Belinkov and Glass for a recent survey (Belinkov and Glass, 2019) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-195", "text": "Adi et al. (2017) evaluate several kinds of sentence encoders and propose a range of probing tasks around isolated aspects of sentence structure at the surface level (sentence length, word content and word order)." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-196", "text": "This work has been greatly expanded by including both syntactic and semantic probing tasks, careful sampling of probing task training data, and extending the framework to make it encoder agnostic (Conneau et al., 2018) ." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-197", "text": "A general observation here is that task-specific knowledge is needed in order to design relevant diagnostic tasks, which is not always straightforward." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-198", "text": "For example, Gulordava (2018) investigate whether RNNs trained using a language model objective capture hierarchical syntactic information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-199", "text": "They create nonsensical construction so that the RNN cannot rely on lexical or semantic clues, showing that RNNs still capture syntactic properties in sentence embeddings across the four tested languages while obfuscating lexical information." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-200", "text": "There is also more theoretical work on investigating the capabilities of recurrent neural networks, e.g., Weiss et al. (2018) show that specific types of RNNs (LSTMs) are able to use counting mechanisms to recognize specific formal languages." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-201", "text": "Finally, linguistic resources can also serve as proxy for evaluation." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-202", "text": "As recently shown (Agi\u0107 et al., 2017) , type-level information from dictionaries approximates PoS tagging accuracy in the absence of gold data for cross-lingual tagger evaluation." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-203", "text": "Their use of high-frequency word types inspired parts of our analysis." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-204", "text": "----------------------------------" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-205", "text": "**CONCLUSIONS**" }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-206", "text": "We analyze DSDS, a recently-proposed lowresource tagger that symbiotically leverages neural representations and symbolic linguistic knowledge by integrating them in a soft manner." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-207", "text": "We replicated the results of Plank and Agi\u0107 (2018) , showing that the more implicit use of embedding user-generated dictionaries turns out to be more beneficial than approaches that rely more explicitly on symbolic knowledge, such a type constraints or retrofitting." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-208", "text": "By analyzing the reliance of DSDS on the linguistic knowledge, we found that the composition of the lexicon is more important than its size." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-209", "text": "Moreover, the tagger benefits from small dictionaries, as long as they do not contain tag set information contradictory to the evaluation data." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-210", "text": "Our quantitative analysis also sheds light on the internal representations, showing that they get more sensitive to the task." }, { "sent_id": "8853d810b364ae47a2da71c2502b3e-C001-211", "text": "Finally, we found that freezing pre-trained word embeddings complement the learning signal well in this noisy data regime." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8853d810b364ae47a2da71c2502b3e-C001-15" ], [ "8853d810b364ae47a2da71c2502b3e-C001-66" ] ], "cite_sentences": [ "8853d810b364ae47a2da71c2502b3e-C001-15", "8853d810b364ae47a2da71c2502b3e-C001-66" ] }, "@MOT@": { "gold_contexts": [ [ "8853d810b364ae47a2da71c2502b3e-C001-15", "8853d810b364ae47a2da71c2502b3e-C001-16" ] ], "cite_sentences": [ "8853d810b364ae47a2da71c2502b3e-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "8853d810b364ae47a2da71c2502b3e-C001-31" ], [ "8853d810b364ae47a2da71c2502b3e-C001-55" ], [ "8853d810b364ae47a2da71c2502b3e-C001-56" ], [ "8853d810b364ae47a2da71c2502b3e-C001-58" ], [ "8853d810b364ae47a2da71c2502b3e-C001-60" ], [ "8853d810b364ae47a2da71c2502b3e-C001-110" ], [ "8853d810b364ae47a2da71c2502b3e-C001-191" ], [ "8853d810b364ae47a2da71c2502b3e-C001-207" ] ], "cite_sentences": [ "8853d810b364ae47a2da71c2502b3e-C001-31", "8853d810b364ae47a2da71c2502b3e-C001-55", "8853d810b364ae47a2da71c2502b3e-C001-56", "8853d810b364ae47a2da71c2502b3e-C001-58", "8853d810b364ae47a2da71c2502b3e-C001-60", "8853d810b364ae47a2da71c2502b3e-C001-110", "8853d810b364ae47a2da71c2502b3e-C001-191", "8853d810b364ae47a2da71c2502b3e-C001-207" ] }, "@EXT@": { "gold_contexts": [ [ "8853d810b364ae47a2da71c2502b3e-C001-31", "8853d810b364ae47a2da71c2502b3e-C001-32" ], [ "8853d810b364ae47a2da71c2502b3e-C001-207" ] ], "cite_sentences": [ "8853d810b364ae47a2da71c2502b3e-C001-31", "8853d810b364ae47a2da71c2502b3e-C001-207" ] }, "@DIF@": { "gold_contexts": [ [ "8853d810b364ae47a2da71c2502b3e-C001-65", "8853d810b364ae47a2da71c2502b3e-C001-66" ] ], "cite_sentences": [ "8853d810b364ae47a2da71c2502b3e-C001-66" ] }, "@SIM@": { "gold_contexts": [ [ "8853d810b364ae47a2da71c2502b3e-C001-74" ] ], "cite_sentences": [ "8853d810b364ae47a2da71c2502b3e-C001-74" ] } } }, "ABC_bdd0ebe147e277f8f7f04fc351464a_14": { "x": [ { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-2", "text": "Despite its importance, the task of summarizing evolving events has received small attention by researchers in the field of Multi-document Summarization." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-3", "text": "In a previous paper [5] we have presented a methodology for the automatic summarization of documents, emitted by multiple sources, which describe the evolution of an event." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-4", "text": "At the heart of this methodology lies the identification of similarities and differences between the various documents, in two axes: the synchronic and the diachronic." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-5", "text": "This is achieved by the introduction of the notion of Synchronic and Diachronic Relations." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-6", "text": "Those relations connect the messages that are found in the documents, resulting thus in a graph which we call grid." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-7", "text": "Although the creation of the grid completes the Document Planning phase of a typical NLG architecture, it can be the case that the number of messages contained in a grid is very large, exceeding thus the required compression rate." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-8", "text": "In this paper we provide some initial thoughts on a probabilistic model which can be applied at the Content Determination stage, and which tries to alleviate this problem." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-9", "text": "----------------------------------" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-11", "text": "It wouldn't be an exaggeration to claim that human beings live engulfed in an environment full of information." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-12", "text": "Information which, metaphorically speaking, vie with each other in order to gain our attention, to gain an almost exclusive control of the precious resources which are our brains." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-13", "text": "This is most evident in the medium of Internet in which so many people are spending nowadays a considerable amount of their time." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-14", "text": "Information in this medium is constantly flowing in front of our screens, making the assimilation of such a plethora no longer feasible." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-15", "text": "In such an environment, information which is presented in brief and concise manner-i.e." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-16", "text": "summarized information-stand more chances of retaining our attention, in relation to information presented in long and fragmented pieces of text." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-17", "text": "We can claim then, with a certain degree of certainty, that the task of automatic text summarization can prove to be very useful." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-18", "text": "To provide a concrete example, we can imagine the case of a person who would like to keep track of the information related to an event as the event is evolving through time." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-19", "text": "What will usually happen in such cases is that, firstly, there will be more than one sources which will provide an account of the event, and secondly, most of the sources will provide more than one descriptions, in the sense that they will most probably follow the evolution of the event and provide updates as the event evolves through time." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-20", "text": "This can easily result in hundreds or even thousands of related articles which will describe the evolution of the same event, rendering it thus almost impossible for the interested person to read through its evolution comparing along the way the points in which the sources agree, disagree or present the information from a different point of view." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-21", "text": "A simple visit to a news aggregator, such as for example Google News, 1 can make this point very clear." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-22", "text": "As we have hinted before, a solution to this problem might be the automatic creation of summaries." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-23", "text": "In this paper we will present a methodology which aims at exactly that, i.e. the automatic creation of text summaries from documents emitted by multiple sources which describe the evolution of a particular event." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-24", "text": "In Section 2 we will briefly present this methodology, at the heart of which lies the notion of Synchronic and Diachronic Relations (SDRs) whose aim is the identification of the similarities and differences that exist between the documents in the synchronic and diachronic axes." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-25", "text": "The end result of this methodology is a graph whose vertices are the SDRs and whose nodes are some structures which we call messages." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-26", "text": "The creation of this graph can be considered as completing-as we have previously argued [5] -the Document Planning phase of a typical architecture of a Natural Language Generation (NLG) system [20] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-27", "text": "Nevertheless, this graph can prove to be very large and thus the resulting summary can easily exceed the desired compression rate." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-28", "text": "In Section 4 we will present a brief sketch of a probabilistic model for the selection of the appropriate information-i.e." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-29", "text": "messages-to be included in the final summary, so that the desired compression rate will not be violated." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-30", "text": "In other words, we will propose a model for the Content Determination stage of the Document Planning phase." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-31", "text": "This model will be based on certain remarks concerning the way with which information overlap between multiple documents which we present in Section 3." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-32", "text": "The conclusions of this paper are presented in Section 5." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-33", "text": "----------------------------------" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-34", "text": "**A METHODOLOGY FOR SUMMARIZING EVOLVING EVENTS**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-35", "text": "----------------------------------" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-36", "text": "**2**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-37", "text": "At the heart of Multi-document Summarization (MDS) lies the process of identifying the similarities and differences that exist between the input documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-38", "text": "Although this holds true for the general case of Multi-document Summarization, for the case of summarizing evolving events the identification of the similarities and differences should be distinguished, as we have previously argued [1, 2, 4, 5, 6] between two axes: the synchronic and the diachronic axes." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-39", "text": "In the synchronic axis we are mostly concerned with the degree of agreement or disagreement that the various sources exhibit, for the same time frame, whilst in the diachronic axis we are concerned with the actual evolution of an event, as this evolution is being described by one source." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-40", "text": "The initial inspiration for the SDRs was provided by the Rhetorical Structure Theory (RST) of Mann & Thompson [15, 16] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-41", "text": "Rhetorical Structure Theory-which was initially developed in the context of \"computational text generation\" 3 [15, 16, 22] -is trying to connect several units of analysis with relations that are semantic in nature and are supposed to capture the intentions of the author." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-42", "text": "As \"units of analysis\" today are used, almost ubiquitously, the clauses of the text." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-43", "text": "In our case, as units of analysis for the SDRs we are using some structures which we call messages, inspired from the research in the NLG field." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-44", "text": "Each message is composed of two parts: its type and a list of arguments which take their values from an ontology for the specific domain." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-45", "text": "In other words, a message can be defined as follows:" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-46", "text": "where arg i \u2208 Domain Ontology" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-47", "text": "The message type represents the type of the action that is involved in an event, whilst the arguments represent the main entities that are involved in this action." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-48", "text": "Additionally, each message is accompanied by information on the source which emitted this message, as well as its publication and referring time." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-49", "text": "Concerning the SDRs, in order to formally define a relation the following four fields ought to be defined (see also [5] ):" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-50", "text": "1. The relation's type (i.e. Synchronic or Diachronic)." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-51", "text": "2." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-52", "text": "The relation's name." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-53", "text": "3. The set of pairs of message types that are involved in the relation." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-54", "text": "4. The constraints that the corresponding arguments of each of the pairs of message types should have." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-55", "text": "Those constraints are expressed using the notation of first order logic." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-56", "text": "The name of the relation carries semantic information which, along with the messages that are connected with the relation, are later being exploited by the NLG component (see [5] ) in order to produce the final summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-57", "text": "2 Due to space limitations this section contains a very brief introduction to a methodology for the creation of summaries from evolving events that we have earlier presented [5] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-58", "text": "The interested reader is encouraged to consult [1, 2, 4, 5, 6 ] for more information." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-59", "text": "3 Also referred to as Natural Language Generation (NLG)." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-60", "text": "The methodology we propose consists of two main phases, the topic analysis phase and the implementation phase." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-61", "text": "The topic analysis phase is composed of four steps, which include the creation of the ontology for the topic and the providing of the specifications for the messages and the SDRs." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-62", "text": "The final step of this phase, which in fact serves as a bridge step with the implementation phase, includes the annotation of the corpora belonging to the topic under examination that have to be collected as a preliminary step during this phase." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-63", "text": "The annotated corpora will serve a dual role: the first is the training of the various Machine Learning algorithms used during the next phase and the second is for evaluation purposes." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-64", "text": "The implementation phase involves the computational extraction of the messages and the SDRs that connect them in order to create a directed acyclic graph (DAG) which we call grid." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-65", "text": "The architecture of the summarization system is shown in Figure 1 ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-66", "text": "We applied our methodology in two different case studies." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-67", "text": "The first case study concerned the description of football matches, a topic which evolved linearly and exhibited synchronous emission of reports, while the second case study concerned the description of terroristic incidents with hostages, a topic which evolved non-linearly and exhibited asynchronous emission of reports." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-68", "text": "4 The preprocessing stage involved tokenization and sentence splitting in the first case study and tokenization, sentence splitting and part-of-speech tagging in the second case study." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-69", "text": "For the task of the entities recognition and classification in the first case the use of simple gazetteer lists proved to be sufficient." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-70", "text": "In the second case study this was not the case and thus we opted for using what we called a cascade of classifiers which contained three levels." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-71", "text": "At the first level we used a binary classifier which determines whether a textual element in the input text is an instance of an ontology concept or not." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-72", "text": "At the second level, the classifier takes the instances of the ontology concepts of the previous level and classifies them under the top-level ontology concepts (e.g. Person)." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-89", "text": "This is graphically depicted in Figure 2 , in which each circle represents the information that is contained in a different document." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-73", "text": "Finally at the third level we had a specific classifier for each top-level ontology concept, which classifies the instances in their appropriate sub-concepts; for example, in the Person ontology concept the specialized classifier classifies the instances into Offender, Hostage, etc." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-74", "text": "For the third stage of the messages' extraction we use in both case studies lexical and semantic features." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-75", "text": "As lexical features in the first case we used the words of the sentences (excluding low frequency words and stop-words) while in the second case study we used only the verbs and nouns of the sentences as lexical features." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-76", "text": "As semantic features in the first case study we used the number of the top-level ontology concepts that appear in the sentence, while in the second case study we enriched that with the appearance of certain trigger words in the sentence." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-77", "text": "Finally, the extraction of the SDRs is the most straightforward task, since the only thing that is needed is the translation of the relations' specifications into an appropriate algorithm which, once applied to the extracted messages, will provide the relations that connect the messages, effectively thus creating the grid." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-78", "text": "In Table 1 we present the statistics of the final messages and SDRs extraction stages for both case studies." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-79", "text": "The creation of the grid can be considered as completing-as we have previously argued [5] -the Document Planning phase of a typical architecture of an NLG system [20] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-80", "text": "Nevertheless, this graph can prove to be very large and thus the resulting summary can easily exceed the desired compression rate." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-81", "text": "In the following two sections we will present a brief sketch of a probabilistic model which can operate on the Content Determination stage of the Document Planning phase in order to select the appropriate content so that the compression rate of the summary will be respected." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-82", "text": "3 The White, Grey, and Black Areas of MDS Not too distant in time from the dawn of Artificial Intelligence in the early 1950's, the first seeds of automatic text summarization appeared with the seminal works of Luhn [12] and Edmundson [7] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-83", "text": "Those early works, as well as the works on summarization that would follow in the next decades, were mostly concerned with the creation of summaries from single documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-84", "text": "Most of them were focusing on the verbatim extraction of important textual elements, usually sentences or paragraphs, from the input document in order to create the final summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-85", "text": "The methods used for the identification of the most salient sentences or paragraphs vary from a mixture of locational criteria with statistics [7, 12, 19] to statistical based graph creation methods [21] to RST based methods [17] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-86", "text": "Multi-document Summarization would not be actively pursued by researchers up until the mid 1990's, since when it is a quite active area of research." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-87", "text": "6 The main difference that seems to exist between the summarization of a single document and the summarization of multiple (related) documents, seems to be the fact that the ensemble of the related documents, in most of the cases, creates informational redundancy, as well as what-for a lack of better termwe will call informational isolation." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-88", "text": "In the case of informational redundancy more than one document contain the same information, while in the case of informational isolation only one document contains a specific piece of information." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-90", "text": "The black and grey areas of the figure represent the information redundancy that exists between the documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-91", "text": "More specifically, the black area represents information which is common to all of the documents, while the grey areas represent information which are common between some articles but not all of them." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-92", "text": "The white areas, on the other hand, represent what we have called the informational isolation of certain portions of texts, in the sense that the information contained therein is not found anywhere else in the collection of documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-93", "text": "Of course, one could imagine many more ways in which the circles could be arranged." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-94", "text": "For example, a circle could be contained inside two other circles, which would imply that the corresponding document is informationally subsumed by the other two." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-95", "text": "More extreme cases can involve circles arranged in a way that only gray areas exist, which would imply that the documents of the collection are only very loosely related, or cases in which one or more circles are completely white, meaning that the documents which are represented by those circles are completely unrelated with the rest of the documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-96", "text": "Such cases though, one could argue, violate the premises of MDS which require a set of related documents that will be informationally condensed by the end of the process." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-97", "text": "Despite those extreme cases, it is fair to assume that the configuration depicted in Figure 2 represents a fairly common situation in most of the MDS scenarios." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-98", "text": "Of course we have to bare in mind that in most of the cases we will not have just three documents to be summarized, but most possibly many more." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-99", "text": "This will have the consequence that the grey areas will not have a single shade of greyness but in- 6 For a general overview of summarization the interested reader is encouraged to consult [13] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-100", "text": "Mani & Maybury [14] provide a wonderful collection of papers on summarization spanning most of the research sub-fields of this area." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-101", "text": "Afantenos et al. [3] provide an overview as well, focusing mostly on the summarization from medical documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-102", "text": "Finally, [8] contains an excellent account of the cognitive processes that are involved during the task of single document summarization by professionals, as well a brief overview of the field of summarization." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-103", "text": "stead they will range from light grey to dark grey depending on the degree of information overlap that will exist between the various sources." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-104", "text": "----------------------------------" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-105", "text": "**WHAT SHOULD BE INCLUDED IN A MULTI-DOCUMENT SUMMARY OF EVOLVING EVENTS?**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-106", "text": "Having made the above distinction between the different levels of information overlap, the question that arises at this point is which pieces of information should finally be included in the text that will summarize the multiple documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-107", "text": "The obvious answer to this question would be that such a summary should include the information that are contained in the input documents in decreasing order of their importance, until the length of the summary reaches the required compression rate of the total length of the input documents." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-108", "text": "In other words, a summary should contain the black areas of Figure 2 , then the darker to the lighter grey areas, until the length of the summary reaches the required compression rate." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-109", "text": "In mathematical terms this can be expressed as follows." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-110", "text": "If P (i) is the probability that a piece of information will be included in the final summary, then we can claim that:" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-111", "text": "where n represents the total number of documents, d k the k-th document, and:" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-112", "text": "Additionally, if c is the desirable compression rate, then the final summary S should confront to the following constraint:" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-113", "text": "length(d k )" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-114", "text": "----------------------------------" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-115", "text": "**OBJECTIONS TO THE PROPOSED MODEL FOR THE GENERAL CASE OF MDS**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-116", "text": "Now, the above model is really a simplistic one and a host of objections could be raised concerning its usefulness in the general case of MDS, something that we do acknowledge." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-117", "text": "One could for example claim that the information that will be contained in the black areas will tend to be trivial information, in the sense that they can be characterized as representing \"common knowledge\"." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-118", "text": "This objection can be balanced by two arguments." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-119", "text": "The first is that the authors of the original documents will most possibly not contain in their articles such common knowledge, unless it is necessary, in which case it might be a good idea to be included in a summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-120", "text": "The second argument is that if the summarization system uses knowledge representation methods-an ontology for example-then such trivial information will tend not to be included in this knowledge representation." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-121", "text": "Of course, if the system uses purely statistical methods, then the last argument does not hold." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-122", "text": "The second objection concerns the white or light grey areas." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-123", "text": "In the proposed model such areas will have a small probability of being included in the final summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-124", "text": "Nevertheless, it can be argued that under certain circumstances it can be the case that a piece of information which is mentioned only by one or very few sources might turn out to be very important." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-125", "text": "For example, a prominent source might have an exclusive piece of information that other sources do not have which might prove to be important for inclusion in the final summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-126", "text": "In such case the proposed model, indeed, will fail to include this piece of information in the final summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-127", "text": "----------------------------------" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-128", "text": "**WHY THE PROPOSED MODEL CAN BE CONSIDERED AS A GOOD STARTING POINT FOR THE CASE OF MDS FOR EVOLVING EVENTS**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-129", "text": "The above discussion outlines some of the objections that might arise when the proposed model is applied under the prism of the general case of Multi-document Summarization." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-130", "text": "Despite those objections, we make the claim in this paper that the proposed model can nevertheless be considered as a good starting point for the case of Multi-document Summarization of Evolving Events, at least in the framework we have described in Section 2." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-131", "text": "Concerning the first objection-i.e." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-132", "text": "the claim that the same trivial information might be contained in all the documents and thus such trivial information will have a high probability of being included in the final summary-this claim is rebuffed by the nature of the methodology that we have briefly presented in Section 2 and more fully exposed in [1] and [5] ." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-133", "text": "The use of an ontology and especially the use of the messages guarantee that the system will try to extract information whose nature, we know beforehand, will be non-trivial." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-134", "text": "Of course, this beneficial situation has its drawbacks as well." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-135", "text": "As we have argued in [5] the creation of the ontology and the specifications of the messages require a considerable amount of human labor." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-136", "text": "Nevertheless, in Section 9 of [5] we present specific propositions of how this problem can be alleviated." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-137", "text": "Let us now come to the second objection." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-138", "text": "According to this objection, it can be the case that a piece of information while mentioned by only one or very few sources (which implies that this piece of information stands very few chances of being included in the summary, according to the proposed model of Section 4) it might nevertheless be mentioned by a prominent source and thus ought finally to be included in the summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-139", "text": "Although this could be the case, we have to note as well that such prominent sources are usually highly influential ones as well." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-140", "text": "This has the implication that if a piece of information-which was initially exclusively mentioned by one source only-is indeed an important one for the description of the event's evolution, then, almost surely, the rest of the sources will sooner or later follow the initial source in mentioning this information." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-141", "text": "Thus what was initially a light grey area, according to the discussion of Section 3, will tend to become darker grey, or even black, as time goes by, if indeed the mentioned piece of information is important and thus worthy of inclusion in the final summary of the event's evolution." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-142", "text": "This leaves us with the conclusion that the afore presented model can indeed serve as a nice starting point for the Content Determination stage, in the case that the grid contains more messages than the required compression rate requires." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-143", "text": "7 7 It would be fair to mention that the above conclusion is valid in the case" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-144", "text": "----------------------------------" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-145", "text": "**CONCLUSIONS**" }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-146", "text": "In [1] and [5] we thoroughly presented a methodology (and applied it in two different case studies) which aims towards the creation of summaries from descriptions of evolving events which are emitted from multiple sources." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-147", "text": "The end result of this methodology is the computational extraction of a structure, which we called a grid." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-148", "text": "This structure is a directed acyclic graph (DAG) whose nodes are the messages extracted from the input documents and whose vertices are the Synchronic and Diachronic Relations that connect those messages." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-149", "text": "The creation of the grid, as we have argued, completes the Document Planning stage of a typical NLG architecture." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-150", "text": "Nevertheless, it can be the case that the created grid can prove to be large enough in order for the final summary to exceed the required compression rate." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-151", "text": "In this paper we have presented a probabilistic model which can be applied to the Content Determination stage of the Document Planning phase." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-152", "text": "The application of that model 8 to the extracted grid will have the effect of creating a subset of the original grid (a sub-grid in other words) which will contain just the messages that confront to this model as well as the SDRs that connect only the selected messages." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-153", "text": "From the discussion in this paper, as well as from the general literature in the area of Multi-document Summarization, we can conclude that the identification of similarities and differences is an essential component for any MDS system." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-154", "text": "Digressing a little bit at this point, we would like to note that spotting similarities between even disparate situations or objects, is something that human beings effortlessly and continuously perform all the time, and thus the study of this phenomenon is of paramount importance for the understanding of the human cognitive functioning." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-155", "text": "The mechanism of identifying \"sameness\"-despite its subtlety [9] -is an essential component for the task of analogymaking which lies at the core of cognition as [11] has claimed." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-156", "text": "Closing this digression on the fascinating topic of analogy-making 9 we would like to note that with respect to MDS, to the best of our knowledge, there are no empirical studies as to how human beings proceed in order to create a summary from multiple documents-be they documents that describe evolving events, or not." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-157", "text": "We do not even have sufficient corpora of summaries from multiple documents which will provide us with an insight as to what can be considered a \"good\" multi-document summary." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-158", "text": "This comes in contrast with the area of Single Document Summarization (SDS) in which, of course, we do have such corpora." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-159", "text": "Moreover, in SDS we do have at least one substantial research from the perspective of Cognitive Science [8] which studies the cognitive mechanisms-or \"strategies\" as they are called in that book-of professional summarizers during the process of creating a summary from a single document." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-160", "text": "It is our personal belief that the performance of more such studies from the cognitive science perspective, for SDS and that we do have the final set of documents which describe the evolution of the event." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-161", "text": "In case that the evolution is still on-going and this set is not yet finalized, then it might be the case that the second objection still holds." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-162", "text": "8 Although the probabilistic model presented in Section 4 talks about \"pieces of information\" the substitution of this abstract notion with the more concrete concept of messages makes that model ready for use in our methodology." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-163", "text": "9 The interested reader is encouraged to consult [9, 10] and [18] for more information on this topic." }, { "sent_id": "bdd0ebe147e277f8f7f04fc351464a-C001-164", "text": "MDS alike, will be beneficial for the advancement of our understanding not only of how we do create summaries, but for the understanding of how we spot similarities and differences; a task which lies at the heart of analogy-making as well." } ], "y": { "@BACK@": { "gold_contexts": [ [ "bdd0ebe147e277f8f7f04fc351464a-C001-3" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-26" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-38" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-49", "bdd0ebe147e277f8f7f04fc351464a-C001-50", "bdd0ebe147e277f8f7f04fc351464a-C001-51", "bdd0ebe147e277f8f7f04fc351464a-C001-52", "bdd0ebe147e277f8f7f04fc351464a-C001-53", "bdd0ebe147e277f8f7f04fc351464a-C001-54", "bdd0ebe147e277f8f7f04fc351464a-C001-56" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-57" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-58" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-79" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-131", "bdd0ebe147e277f8f7f04fc351464a-C001-132" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-146" ] ], "cite_sentences": [ "bdd0ebe147e277f8f7f04fc351464a-C001-3", "bdd0ebe147e277f8f7f04fc351464a-C001-26", "bdd0ebe147e277f8f7f04fc351464a-C001-38", "bdd0ebe147e277f8f7f04fc351464a-C001-49", "bdd0ebe147e277f8f7f04fc351464a-C001-56", "bdd0ebe147e277f8f7f04fc351464a-C001-57", "bdd0ebe147e277f8f7f04fc351464a-C001-58", "bdd0ebe147e277f8f7f04fc351464a-C001-79", "bdd0ebe147e277f8f7f04fc351464a-C001-132", "bdd0ebe147e277f8f7f04fc351464a-C001-146" ] }, "@USE@": { "gold_contexts": [ [ "bdd0ebe147e277f8f7f04fc351464a-C001-25", "bdd0ebe147e277f8f7f04fc351464a-C001-26" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-131", "bdd0ebe147e277f8f7f04fc351464a-C001-132" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-136" ] ], "cite_sentences": [ "bdd0ebe147e277f8f7f04fc351464a-C001-26", "bdd0ebe147e277f8f7f04fc351464a-C001-132", "bdd0ebe147e277f8f7f04fc351464a-C001-136" ] }, "@MOT@": { "gold_contexts": [ [ "bdd0ebe147e277f8f7f04fc351464a-C001-79", "bdd0ebe147e277f8f7f04fc351464a-C001-80" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-135" ], [ "bdd0ebe147e277f8f7f04fc351464a-C001-146", "bdd0ebe147e277f8f7f04fc351464a-C001-147", "bdd0ebe147e277f8f7f04fc351464a-C001-150" ] ], "cite_sentences": [ "bdd0ebe147e277f8f7f04fc351464a-C001-79", "bdd0ebe147e277f8f7f04fc351464a-C001-135", "bdd0ebe147e277f8f7f04fc351464a-C001-146" ] } } }, "ABC_cc66b46b34a0d716414e8b845707f9_14": { "x": [ { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-59", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-2", "text": "We propose an end-to-end, domainindependent neural encoder-aligner-decoder model for selective generation, i.e., the joint task of content selection and surface realization." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-3", "text": "Our model first encodes a full set of over-determined database event records via an LSTM-based recurrent neural network, then utilizes a novel coarse-to-fine aligner to identify the small subset of salient records to talk about, and finally employs a decoder to generate free-form descriptions of the aligned, selected records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-4", "text": "Our model achieves the best selection and generation results reported to-date (with 59% relative improvement in generation) on the benchmark WEATHER-GOV dataset, despite using no specialized features or linguistic resources." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-5", "text": "Using an improved k-nearest neighbor beam filter helps further." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-6", "text": "We also perform a series of ablations and visualizations to elucidate the contributions of our key model components." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-7", "text": "Lastly, we evaluate the generalizability of our model on the ROBOCUP dataset, and get results that are competitive with or better than the state-of-the-art, despite being severely data-starved." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-8", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-10", "text": "We consider the important task of producing a natural language description of a rich world state represented as an over-determined database of event records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-11", "text": "This task, which we refer to as selective generation, is often formulated as two subproblems: content selection, which involves choosing a subset of relevant records to talk about from the exhaustive database, and surface realization, which is concerned with generating natural language descriptions for this subset." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-12", "text": "Learning to perform these tasks jointly is challenging due to the ambiguity in deciding which records are relevant, the complex dependencies between selected records, and the multiple ways in which these records can be described." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-13", "text": "Previous work has made significant progress on this task (Chen and Mooney, 2008; Angeli et al., 2010; Konstas and Lapata, 2012) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-14", "text": "However, most approaches solve the two content selection and surface realization subtasks separately, use manual domain-dependent resources (e.g., semantic parsers) and features, or employ template-based generation." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-15", "text": "This limits domain adaptability and reduces coherence." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-16", "text": "We take an alternative, neural encoder-aligner-decoder approach to free-form selective generation that jointly performs content selection and surface realization, without using any specialized features, resources, or generation templates." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-17", "text": "This enables our approach to generalize to new domains." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-18", "text": "Further, our memorybased model captures the long-range contextual dependencies among records and descriptions, which are integral to this task (Angeli et al., 2010) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-19", "text": "We formulate our model as an encoder-alignerdecoder framework that uses recurrent neural networks with long short-term memory units (LSTMRNNs) (Hochreiter and Schmidhuber, 1997) together with a coarse-to-fine aligner to select and \"translate\" the rich world state into a natural language description." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-20", "text": "Our model first encodes the full set of over-determined event records using a bidirectional LSTM-RNN." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-21", "text": "A novel coarse-to-fine aligner then reasons over multiple abstractions of the input to decide which of the records to discuss." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-22", "text": "The model next employs an LSTM decoder to generate natural language descriptions of the selected records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-23", "text": "The use of LSTMs, which have proven effective for similar long-range generation tasks Vinyals et al., 2015b; Karpathy and FeiFei, 2015) , allows our model to capture the longrange contextual dependencies that exist in selective generation." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-24", "text": "Further, the introduction of our proposed variation on alignment-based LSTMs (Bahdanau et al., 2014; Xu et al., 2015) enables our model to learn to perform content selection and surface realization jointly, by aligning each generated word to an event record during decoding." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-25", "text": "Our novel coarse-to-fine aligner avoids searching over the full set of over-determined records by employing two stages of increasing complexity: a pre-selector and a refiner acting on multiple abstractions (low-and high-level) of the record input." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-26", "text": "The end-to-end nature of our framework has the advantage that it can be trained directly on corpora of record sets paired with natural language descriptions, without the need for ground-truth content selection." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-27", "text": "We evaluate our model on a benchmark weather forecasting dataset (WEATHERGOV) and achieve the best results reported to-date on content selection (12% relative improvement in F-1) and language generation (59% relative improvement in BLEU), despite using no domain-specific resources." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-28", "text": "We also perform a series of ablations and visualizations to elucidate the contributions of the primary model components, and also show improvements with a simple, k-nearest neighbor beam filter approach." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-29", "text": "Finally, we demonstrate the generalizability of our model by directly applying it to a benchmark sportscasting dataset (ROBOCUP), where we get results competitive with or better than state-of-the-art, despite being extremely data-starved." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-30", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-31", "text": "**RELATED WORK**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-32", "text": "Selective generation is a relatively new research area and more attention has been paid to the individual content selection and selective realization subproblems." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-33", "text": "With regards to the former, Barzilay and Lee (2004) model the content structure from unannotated documents and apply it to the application of text summarization." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-34", "text": "Barzilay and Lapata (2005) treat content selection as a collective classification problem and simultaneously optimize the local label assignment and their pairwise relations." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-35", "text": "Liang et al. (2009) address the related task of aligning a set of records to given textual description clauses." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-36", "text": "They propose a generative semi-Markov alignment model that jointly segments text sequences into utterances and associates each to the corresponding record." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-37", "text": "Surface realization is often treated as a problem of producing text according to a given grammar." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-38", "text": "Soricut and Marcu (2006) propose a language generation system that uses the WIDL-representation, a formalism used to compactly represent probability distributions over finite sets of strings." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-39", "text": "Wong and Mooney (2007) and Lu and Ng (2011) use synchronous context-free grammars to generate natural language sentences from formal meaning representations." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-40", "text": "Similarly, Belz (2008) employs probabilistic context-free grammars to perform surface realization." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-41", "text": "Other effective approaches include the use of tree conditional random fields (Lu et al., 2009) and template extraction within a log-linear framework (Angeli et al., 2010) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-42", "text": "Recent work seeks to solve the full selective generation problem through a single framework." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-43", "text": "Chen and Mooney (2008) and Chen et al. (2010) learn alignments between comments and their corresponding event records using a translation model for parsing and generation." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-44", "text": "implement a two-stage framework that decides what to discuss using a combination of the methods of Lu et al. (2008) and Liang et al. (2009) , and then produces the text based on the generation system of Wong and Mooney (2007) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-45", "text": "Angeli et al. (2010) propose a unified conceptto-text model that treats joint content selection and surface realization as a sequence of local decisions represented by a log-linear model." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-46", "text": "Similar to other work, they train their model using external alignments from Liang et al. (2009) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-47", "text": "Generation then follows as inference over this model, where they first choose an event record, then the record's fields (i.e., attributes), and finally a set of templates that they then fill in with words for the selected fields." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-48", "text": "Their ability to model long-range dependencies relies on their choice of features for the log-linear model, while the template-based generation further employs some domain-specific features for fluent output." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-49", "text": "Konstas and Lapata (2012) propose an alternative method that simultaneously optimizes the content selection and surface realization problems." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-50", "text": "They employ a probabilistic context-free grammar that specifies the structure of the event records, and then treat generation as finding the best derivation tree according to this grammar." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-51", "text": "However, their method still selects and orders records in a local fashion via a Markovized chaining of records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-52", "text": "Konstas and Lapata (2013) improve upon this approach with global document representations." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-53", "text": "However, this approach also requires alignment during training, which they estimate using the method of Liang et al. (2009) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-54", "text": "We treat the problem of selective generation as end-to-end learning via a recurrent neural network encoder-aligner-decoder model, which enables us to jointly learn content selection and surface realization directly from database-text pairs, without the need for an external aligner or ground-truth selection labels." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-55", "text": "The use of LSTM-RNNs enables our model to capture the long-range dependencies that exist among the records and natural language output." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-56", "text": "Additionally, the model does not rely on any manually-selected or domain-dependent features, templates, or parsers, and is thereby generalizable." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-57", "text": "The alignment-RNN approach has recently proven successful for generation-style tasks, e.g., machine translation (Bahdanau et al., 2014) and image captioning (Xu et al., 2015) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-58", "text": "Since selective generation requires identifying the small number of salient records among an over-determined database, we avoid performing exhaustive search over the full record set, and instead propose a novel coarse-tofine aligner that divides the search complexity into pre-selection and refinement stages." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-60", "text": "**TASK DEFINITION**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-61", "text": "We consider the problem of generating a natural language description for a rich world state specified in terms of an over-determined set of records (database)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-62", "text": "This problem requires deciding which of the records to discuss (content selection) and how to discuss them (surface realization)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-63", "text": "Training data consists of scenario pairs (r (i) , x (i) ) for i = 1, 2, . . . , n, where r (i) is the complete set of records and x (i) is the natural language description ( Fig. 1) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-64", "text": "At test time, only the records are given." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-65", "text": "We evaluate our model in the context of two publiclyavailable benchmark selective generation datasets." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-66", "text": "WEATHERGOV The weather forecasting dataset (see Fig. 1 (a)) of Liang et al. (2009) consists of 29528 scenarios, each with 36 weather records (e.g., temperature, sky cover, etc.) paired with a natural language forecast ( 28.7 avg." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-67", "text": "word length)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-68", "text": "ROBOCUP We evaluate our model's generalizability on the sportscasting dataset of Chen and Mooney (2008) , which consists of only 1539 pairs of temporally ordered robot soccer events (e.g., pass, score) and commentary drawn from the four-game 2001-2004 RoboCup finals (see Fig. 1(b) )." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-69", "text": "Each scenario contains an average of 2.4 event records and a 5.7 word natural language commentary." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-70", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-71", "text": "**THE MODEL**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-72", "text": "We formulate selective generation as inference over a probabilistic model P (x 1:T |r 1:N ), where r 1:N = (r 1 , r 2 , . . . , r N ) is the input set of over-" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-73", "text": "Figure 2: Our model architecture with a bidirectional LSTM encoder, coarse-to-fine aligner, and decoder." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-74", "text": "determined event records, 1" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-75", "text": "is the generated description with x t being the word at time t and x 0 being a special start token:" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-76", "text": "= arg max" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-77", "text": "The goal of inference is to generate a natural language description for a given set of records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-78", "text": "An effective means of learning to perform this generation is to use an encoder-aligner-decoder architecture with a recurrent neural network, which has proven effective for related problems in machine translation (Bahdanau et al., 2014) and image captioning (Xu et al., 2015) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-79", "text": "We propose a variation on this general model with novel components that are well-suited to the selective generation problem." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-80", "text": "Our model (Fig. 2 ) first encodes each input record r j into a hidden state h j with j \u2208 {1, . . . , N } using a bidirectional recurrent neural network (RNN)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-81", "text": "Our novel coarse-to-fine aligner then acts on a concatenation m j of each record and its hidden state 1 These records may take the form of an unordered set or have a natural ordering (e.g., temporal in the case of ROBOCUP)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-82", "text": "In order to make our model generalizable, we treat the set as a sequence and use the order specified by the dataset." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-83", "text": "We note that it is possible that a different ordering will yield improved performance, since ordering has been shown to be important when operating on sets (Vinyals et al., 2015a) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-84", "text": "as multi-level representation of the input to compute the selection decision z t at each decoding step t. The model then employs an RNN decoder to arrive at the word likelihood P (x t |x 0:t\u22121 , r 1:N ) as a function of the multi-level input and the hidden state of the decoder s t\u22121 at time step t \u2212 1." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-85", "text": "In order to model the long-range dependencies among the records and descriptions (which is integral to effectively performing selective generation (Angeli et al., 2010; Konstas and Lapata, 2012; Konstas and Lapata, 2013 )), our model employs LSTM units as the nonlinear encoder and decoder functions." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-86", "text": "Encoder Our LSTM-RNN encoder (Fig. 2) takes as input the set of event records represented as a sequence r 1:N = (r 1 , r 2 , . . . , r N ) and returns a sequence of hidden annotations h 1:N = (h 1 , h 2 , . . . , h N ), where the annotation h j summarizes the record r j ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-87", "text": "This results in a representation that models the dependencies that exist among the records in the database." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-88", "text": "We adopt an encoder architecture similar to that of Graves et al. (2013)" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-89", "text": "where T e is an affine transformation, \u03c3 is the logistic sigmoid that restricts its input to [0, 1], i e j , f e j , and o e j are the input, forget, and output gates of the LSTM, respectively, and c e j is the memory cell activation vector." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-90", "text": "The memory cell c e j summarizes the LSTM's previous memory c e j\u22121 and the current input, which are modulated by the forget and input gates, respectively." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-91", "text": "Our encoder operates bidirectionally, encoding the records in both the forward and backward directions, which provides a better summary of the input records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-92", "text": "In this way, the hidden annotations h j = ( \u2212 \u2192 h j ; \u2190 \u2212 h j ) concatenate forward \u2212 \u2192 h j and backward \u2190 \u2212 h j annotations, each determined using Equation (2c)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-93", "text": "Coarse-to-Fine Aligner Having encoded the input records r 1:N to arrive at the hidden annotations h 1:N , the model then seeks to select the content at each time step t that will be used for generation." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-94", "text": "Our model performs content selection using an extension of the alignment mechanism proposed by Bahdanau et al. (2014) , which allows for selection and generation that is independent of the ordering of the input." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-95", "text": "In selective generation, the given set of event records is over-determined with only a small subset of salient records being relevant to the output natural language description." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-96", "text": "Standard alignment mechanisms limit the accuracy of selection and generation by scanning the entire range of overdetermined records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-97", "text": "In order to better address the selective generation task, we propose a coarse-tofine aligner that prevents the model from being distracted by non-salient records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-98", "text": "Our model aligns based on multiple abstractions of the input: both the original input record as well as the hidden annotations m j = (r j ; h j ) , an approach that has previously been shown to yield better results than aligning based only on the hidden state (Mei et al., 2015) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-99", "text": "Our coarse-to-fine aligner avoids searching over the full set of over-determined records by using two stages of increasing complexity: a pre-selector and refiner (Fig. 2) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-100", "text": "The pre-selector first assigns to each record a probability p j of being selected, while the standard aligner computes the alignment likelihood w tj over all the records at each time step t during decoding." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-101", "text": "Next, the refiner produces the final selection decision by re-weighting the aligner weights w tj with the pre-selector probabilities p j :" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-102", "text": "where P , q, U , W , v are learned parameters." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-103", "text": "Ideally, the selection decision would be based on the highestvalue alignment z t = m k where k = arg max j \u03b1 tj ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-104", "text": "However, we use the weighted average (Eqn. 3e) as its soft approximation to maintain differentiability of the entire architecture." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-105", "text": "The pre-selector assigns large values (p j > 0.5) to a small subset of salient records and small values (p j < 0.5) to the rest." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-106", "text": "This modulates the standard aligner, which then has to assign a large weight w tj in order to select the j-th record at time t. In this way, the learned prior p j makes it difficult for the alignment (attention) to be distracted by nonsalient records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-107", "text": "Further, we can relate the output of the pre-selector to the number of records that are selected." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-108", "text": "Specifically, the output p j expresses the extent to which the j-th record should be selected." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-109", "text": "The summation N j=1 p j can then be regarded as a real-valued approximation to the total number of pre-selected records (denoted as \u03b3), which we regularize towards, based on validation (see Eqn. 5)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-110", "text": "Decoder Our architecture uses an LSTM decoder that takes as input the current context vector z t , the last word x t\u22121 , and the LSTM's previous hidden state s t\u22121 ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-111", "text": "The decoder outputs the conditional probability distribution P x,t = P (x t |x 0:t\u22121 , r 1:N ) over the next word, represented as a deep output layer (Pascanu et al., 2014) ," }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-112", "text": "where E (an embedding matrix), L 0 , L s , and L z are parameters to be learned." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-113", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-114", "text": "**TRAINING AND INFERENCE**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-115", "text": "We train the model using the database-record pairs (r 1:N , x 1:T ) from the training corpora so as to maximize the likelihood of the ground-truth language description x * 1:T (Eqn. 1)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-116", "text": "Additionally, we introduce a regularization term ( N j=1 p j \u2212 \u03b3) 2 that enables the model to influence the pre-selector weights based on the aforementioned relationship between the output of the preselector and the number of selected records." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-117", "text": "Moreover, we also introduce the term (1.0 \u2212 max(p j )), which accounts for the fact that at least one record should be pre-selected." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-118", "text": "Note that when \u03b3 is equal to N , the pre-selector is forced to select all the records (p j = 1.0 for all j), and the coarse-to-fine alignment reverts to the standard alignment introduced by Bahdanau et al. (2014) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-119", "text": "Together with the negative loglikelihood of the ground-truth description x * 1:T , our loss function becomes" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-162", "text": "**QUALITATIVE ANALYSIS (WEATHERGOV)**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-120", "text": "Having trained the model, we generate the natural language description by finding the maximum a posteriori words under the learned model (Eqn. 1)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-121", "text": "For inference, we perform greedy search starting with the first word x 1 ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-122", "text": "Beam search offers a way to perform approximate joint inference -however, we empirically found that beam search does not perform any better than greedy search on the datasets that we consider, an observation that is shared with previous work (Angeli et al., 2010) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-123", "text": "We later discuss an alternative k-nearest neighbor-based beam filter (see Sec 6.2)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-124", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-125", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-126", "text": "Datasets We analyze our model on the benchmark WEATHERGOV dataset, and use the data-starved ROBOCUP dataset to demonstrate the model's generalizability." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-127", "text": "Following Angeli et al. (2010) , we use WEATHERGOV training, development, and test splits of size 25000, 1000, and 3528, respectively." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-128", "text": "For ROBOCUP, we follow the evaluation methodology of previous work (Chen and Mooney, 2008) , performing three-fold cross-validation whereby we train on three games (approximately 1000 scenarios) and test on the fourth." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-129", "text": "Within each split, we hold out 10% of the training data as the development set to tune the early-stopping criterion and \u03b3." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-130", "text": "We then report the standard average performance (weighted by the number of scenarios) over these four splits." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-131", "text": "Training Details On WEATHERGOV, we lightly tune the number of hidden units and \u03b3 on the development set according to the generation metric (BLEU), and choose 500 units from {250, 500, 750} and \u03b3 = 8.5 from {6.5, 7.5, 8.5, 10.5, 12.5}. For ROBOCUP, we only tune \u03b3 on the development set and choose \u03b3 = 5.0 from the set {1.0, 2.0, . . . , 6.0}. However, we do not retune the number of hidden units on ROBOCUP." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-132", "text": "For each iteration, we randomly sample a mini-batch of 100 scenarios during back-propagation and use Adam (Kingma and Ba, 2015) for optimization." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-133", "text": "Training typically converges within 30 epochs." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-134", "text": "We select the model according to the BLEU score on the development set." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-135", "text": "2 Evaluation Metrics We consider two metrics as a means of evaluating the effectiveness of our model on the two selective generation subproblems." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-136", "text": "For content selection, we use the F-1 score of the set of selected records as defined by the harmonic mean of precision and recall with respect to the ground-truth selection record set." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-137", "text": "We define the set of selected records as consisting of the record with the largest selection weight \u03b1 ti computed by our aligner at each decoding step t." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-138", "text": "We evaluate the quality of surface realization using the BLEU score 3 (a 4-gram matching-based precision) (Papineni et al., 2001 ) of the generated description with respect to the human-created reference." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-139", "text": "To be comparable to previous results on WEATHERGOV, we also consider a modified BLEU score (cBLEU) that does not penalize numerical deviations of at most five (Angeli et al., 2010 ) (i.e., to not penalize \"low around 58\" compared to a reference \"low around 60\")." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-140", "text": "On ROBOCUP, we also evaluate the BLEU score in the case that groundtruth content selection is known (sBLEU G ), to be comparable to previous work." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-141", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-142", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-143", "text": "We analyze the effectiveness of our model on the benchmark WEATHERGOV (as primary) and ROBOCUP (as generalization) datasets." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-144", "text": "We also present several ablations to illustrate the contributions of the primary model components." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-145", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-146", "text": "**PRIMARY RESULTS (WEATHERGOV)**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-147", "text": "We report the performance of content selection and surface realization using F-1 and two BLEU scores (standard sBLEU and the customized cBLEU of Angeli et al. (2010)), respectively (Sec. 5)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-148", "text": "Table 1 compares our test results against previous methods that include KL12 (Konstas and Lapata, 2012) , KL13 (Konstas and Lapata, 2013) , and ALK10 (Angeli et al., 2010) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-149", "text": "Our method achieves the best results reported to-date on all three metrics, with relative improvements of 11.94% (F-1), 58.88% (sBLEU), and 36.68% (cBLEU) over the previous state-of-the-art." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-150", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-151", "text": "**BEAM FILTER WITH K-NEAREST NEIGHBORS**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-152", "text": "We considered beam search as an alternative to greedy search in our primary setup (Eqn. 1), but this performs worse, similar to what previous work found on this dataset (Angeli et al., 2010) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-153", "text": "As an alternative, we consider a beam filter based on a knearest neighborhood." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-154", "text": "See Supplementary Material for details." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-155", "text": "Table 9 shows that this k-NN beam filter improves results over the primary greedy results." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-156", "text": "Aligner Ablation First, we evaluate the contribution of our proposed coarse-to-fine aligner by comparing our model with the basic encoder-alignerdecoder model introduced by Bahdanau et al. (2014) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-157", "text": "Table 3 reports the results demonstrating that our aligner yields superior F-1 and BLEU scores relative to a standard aligner." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-158", "text": "Encoder Ablation Next, we consider the effectiveness of the encoder." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-159", "text": "Table 4 compares the results with and without the encoder on the development set, and demonstrates that there is a significant gain from encoding the event records using the LSTM-RNN." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-160", "text": "We attribute this improvement to the LSTM-RNN's ability to capture the relationships that exist among the records, which is known to be essential to selective generation (Barzilay and Lapata, 2005; Angeli et al., 2010) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-161", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-163", "text": "Output Examples Fig. 3 shows an example record set with its output description and recordword alignment heat map." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-164", "text": "As shown, our model learns to align records with their corresponding words (e.g., windDir and \"southeast,\" temperature and \"71,\" windSpeed and \"wind 10,\" and gust and \"winds could gust as high as 30 mph\")." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-165", "text": "It also learns the subset of salient records to talk about (matching the ground-truth description perfectly for this example, i.e., a standard BLEU of 100.00)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-166", "text": "We also see some word-level mismatch, e.g., \"cloudy\" misaligns to id-0 temp and id-10 precipChance, which we attribute to the high correlation between these types of records (\"garbage collection\" in Liang et al. (2009) )." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-167", "text": "Word Embeddings Training our decoder has the effect of learning embeddings for the words in the training set (via the embedding matrix E in Eqn." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-168", "text": "4)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-169", "text": "Here, we explore the extent to which these learned embeddings capture semantic relationships among the training words." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-170", "text": "Table 10 presents nearest neighbor words for some of the common words from the WEATHERGOV dataset (according to cosine similarity in the embedding space)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-171", "text": "More details of other embedding approaches that we tried are discussed in the Supplementary Material section." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-172", "text": "We use the ROBOCUP dataset to evaluate the domain-independence of our model." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-173", "text": "The dataset is severely data-starved with only 1000 (approx.) training pairs, which is much smaller than is typically necessary to train RNNs." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-174", "text": "This results in higher variance in the trained model distributions, and we thus adopt the standard denoising method of ensembles Vinyals et al., 2015b; Zaremba et al., 2014) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-175", "text": "5 5 We use an ensemble of five randomly initialized models." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-176", "text": "Following previous work, we perform two experiments on the ROBOCUP dataset (Table 6 ), the first considering full selective generation and the second assuming ground-truth content selection at test time." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-177", "text": "On the former, we obtain a standard BLEU score (sBLEU) of 25.28, which exceeds the best score of 24.88 (Konstas and Lapata, 2012) ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-178", "text": "Additionally, we achieve an selection F-1 score of 81.58, which is also the best result reported to-date." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-179", "text": "In the case of assumed (known) ground-truth content selection, our model attains an sBLEU G score of 29.40, which is competitive with the state-of-the-art." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-180", "text": "6" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-181", "text": "----------------------------------" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-182", "text": "**CONCLUSION**" }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-183", "text": "We presented an encoder-aligner-decoder model for selective generation that does not use any specialized features, linguistic resources, or genera-tion templates." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-184", "text": "Our model employs a bidirectional LSTM-RNN model with a novel coarse-tofine aligner that jointly learns content selection and surface realization." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-185", "text": "We evaluate our model on the benchmark WEATHERGOV dataset and achieve state-of-the-art selection and generation results." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-186", "text": "We achieve further improvements via a k-nearest neighbor beam filter." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-187", "text": "We also present several model ablations and visualizations to elucidate the effects of the primary components of our model." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-188", "text": "Moreover, our model generalizes to a different, data-starved domain (ROBOCUP), where it achieves results competitive with or better than the state-of-the-art." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-189", "text": "As an alternative, we consider a beam filter based on a k-nearest neighborhood." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-190", "text": "First, we generate the M -best description candidates (i.e., a beam width of M ) for a given input record set (database) using standard beam search." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-191", "text": "Next, we find the K nearest neighbor database-description pairs from the training data, based on the cosine similarity of each neighbor database with the given input record." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-192", "text": "We then compute the BLEU score for each of the M description candidates relative to the K nearest neighbor descriptions (as references) and select the candidate with the highest BLEU score." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-193", "text": "We tune K and M on the development set and report the results in Table 8 ." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-194", "text": "Table 9 presents the test results with this tuned setting (M = 2, K = 1), where we achieve BLEU scores better than our primary greedy results." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-195", "text": "Training our decoder has the effect of learning embeddings for the words in the training set (via the embedding matrix E in Eqn." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-196", "text": "4)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-197", "text": "Here, we explore the extent to which these learned embeddings capture semantic relationships among the training words." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-198", "text": "Table 10 presents nearest neighbor words for some of the common words from the WEATHER-GOV dataset (according to cosine similarity in the embedding space)." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-199", "text": "We also consider different ways of using pretrained word embeddings (Mikolov et al., 2013) to bootstrap the quality of our learned embeddings." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-200", "text": "One approach initializes our embedding matrix with the pre-trained vectors and then refines the embedding based on our training corpus." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-201", "text": "The second concatenates our learned embedding matrix with the pre-trained vectors in an effort to simultaneously exploit general similarities as well as those learned for the domain." }, { "sent_id": "cc66b46b34a0d716414e8b845707f9-C001-202", "text": "As shown previously for other tasks Vinyals et al., 2015b) , we find that the use of pre-trained embeddings results in negligible improvements (on the development set)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cc66b46b34a0d716414e8b845707f9-C001-13" ], [ "cc66b46b34a0d716414e8b845707f9-C001-41" ], [ "cc66b46b34a0d716414e8b845707f9-C001-45" ], [ "cc66b46b34a0d716414e8b845707f9-C001-160" ] ], "cite_sentences": [ "cc66b46b34a0d716414e8b845707f9-C001-13", "cc66b46b34a0d716414e8b845707f9-C001-41", "cc66b46b34a0d716414e8b845707f9-C001-45", "cc66b46b34a0d716414e8b845707f9-C001-160" ] }, "@MOT@": { "gold_contexts": [ [ "cc66b46b34a0d716414e8b845707f9-C001-12", "cc66b46b34a0d716414e8b845707f9-C001-13", "cc66b46b34a0d716414e8b845707f9-C001-14" ], [ "cc66b46b34a0d716414e8b845707f9-C001-18" ] ], "cite_sentences": [ "cc66b46b34a0d716414e8b845707f9-C001-13", "cc66b46b34a0d716414e8b845707f9-C001-18" ] }, "@SIM@": { "gold_contexts": [ [ "cc66b46b34a0d716414e8b845707f9-C001-122" ] ], "cite_sentences": [ "cc66b46b34a0d716414e8b845707f9-C001-122" ] }, "@USE@": { "gold_contexts": [ [ "cc66b46b34a0d716414e8b845707f9-C001-127" ], [ "cc66b46b34a0d716414e8b845707f9-C001-147" ] ], "cite_sentences": [ "cc66b46b34a0d716414e8b845707f9-C001-127", "cc66b46b34a0d716414e8b845707f9-C001-147" ] }, "@DIF@": { "gold_contexts": [ [ "cc66b46b34a0d716414e8b845707f9-C001-148", "cc66b46b34a0d716414e8b845707f9-C001-149" ], [ "cc66b46b34a0d716414e8b845707f9-C001-152" ] ], "cite_sentences": [ "cc66b46b34a0d716414e8b845707f9-C001-148", "cc66b46b34a0d716414e8b845707f9-C001-152" ] } } }, "ABC_7f234ecfb4cf880502faa8b89cd07b_14": { "x": [ { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-127", "text": "We show results on in-domain and out-of-domain tests sets in Table 3 ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-152", "text": "OOD scores for each target domain are averaged across the 6 remaining source domains." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-153", "text": "(cross-task \u03b1s are higher, intra-task \u03b1s lower)." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-101", "text": "We present experiments with the English portions of datasets, for which we show statistics in Table 1." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-126", "text": "**RESULTS**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-2", "text": "Multi-task learning is partly motivated by the observation that humans bring to bear what they know about related problems when solving new ones." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-3", "text": "Similarly, deep neural networks can profit from related tasks by sharing parameters with other networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-4", "text": "However, humans do not consciously decide to transfer knowledge between tasks (and are typically not aware of the transfer)." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-5", "text": "In machine learning, it is hard to estimate if sharing will lead to improvements; especially if tasks are only loosely related." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-6", "text": "To overcome this, we introduce SLUICE NETWORKS, a general framework for multi-task learning where trainable parameters control the amount of sharing -including which parts of the models to share." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-7", "text": "Our framework goes beyond and generalizes over previous proposals in enabling hard or soft sharing of all combinations of subspaces, layers, and skip connections." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-8", "text": "We perform experiments on three task pairs from natural language processing, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multi-task learning." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-9", "text": "We analyze when the architecture is particularly helpful, as well as its ability to fit noise." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-10", "text": "We show that a) label entropy is predictive of gains in sluice networks, confirming findings for hard parameter sharing, and b) while sluice networks easily fit noise, they are robust across domains in practice." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-11", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-12", "text": "****" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-13", "text": "introducing trainable parameters for the components that differentiate multi-task learning approaches." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-14", "text": "We build on recent work trying to learn where to split merged networks [21] , as well as work trying to learn how best to combine private and shared subspaces [5, 18] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-15", "text": "Our model is empirically justified and deals with the dirtiness [16] of loosely related tasks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-16", "text": "We show that it is a generalization of various multi-task learning algorithms such as hard parameter sharing [7] , low supervision [25] , and cross-stitch networks [21] , as well as transfer learning algorithms such as frustratingly easy domain adaptation [9] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-17", "text": "Moreover, we study what task properties predict gains, and what properties correlate with learning certain types of sharing, as well as the inductive bias of the resulting architecture." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-18", "text": "Figure 1 : A SLUICE NETWORK with one main task A and one auxiliary task B. It consists of a shared input layer (shown left), two task-specific output layers (right), and three hidden layers per task, each partitioned into two subspaces." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-19", "text": "\u03b1 parameters control which subspaces are shared between main and auxiliary task, while \u03b2 parameters control which layer outputs are used for prediction." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-20", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-21", "text": "**AN ARCHITECTURE FOR LEARNING TO SHARE**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-22", "text": "We introduce a novel architecture for multi-task learning, which we refer to as a SLUICE NETWORK, sketched in Figure 1 for the case of two tasks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-23", "text": "The network learns to share parameters between augmented, deep recurrent neural networks [13] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-24", "text": "The recurrent networks could easily be replaced with multi-layered perceptrons or convolutional neural networks for other applications." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-25", "text": "The two networks A and B share an embedding layer associating the elements of an input sequence, in our case English words, with vector representations via word and character embeddings." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-26", "text": "The two sequences of vectors are then passed on to their respective inner recurrent layers." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-27", "text": "Each of these layers is divided into subspaces, e.g., for A into G A,1,1 and G A,1,2 , which allow the network to learn task-specific and shared representations, if beneficial." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-28", "text": "The output of the inner layer of network A is then passed to its second layer, as well as to the second layer of network B. This traffic of information is mediated by a set of parameters \u03b1 in a way such that the second layer of each network receives a weighted combination of the output of the two inner layers." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-29", "text": "The subspaces have different weights." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-30", "text": "Importantly, these weights are trainable and allow the model to learn whether to share, whether to restrict sharing to a shared subspace, etc." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-31", "text": "Finally, a weighted combination of the outputs of the outer recurrent layers G \u00b7,3,\u00b7 as well as the weighted outputs of the inner layers are mediated through \u03b2 parameters, which reflect a mixture over the representations at various depths of the network." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-32", "text": "In sum, sluice networks have the capacity to learn what layers and subspaces should be shared, as well as at what layers the network has learned the best representations of the input sequences." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-33", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-34", "text": "**MATRIX REGULARIZATION**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-35", "text": "We cast learning what to share as a matrix regularization problem, following [15, 29] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-36", "text": "Assume M different tasks that are loosely related, with M potentially non-overlapping datasets D 1 , . . . , D M ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-37", "text": "Each task is associated with a deep neural network with K layers L 1 , . . ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-38", "text": "L K ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-39", "text": "We assume that all the deep networks have the same hyper-parameters at the outset." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-40", "text": "With loosely related tasks, one task may be better modeled with one hidden layer; another one with two [25] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-41", "text": "Our architecture, however, is flexible enough to learn this, if we initially associate each task with the union of the a priori task networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-42", "text": "Let W \u2208 R M \u00d7D be a matrix in which each row i corresponds to a model \u03b8 i with D parameters." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-43", "text": "The loss functions L i are cross-entropy functions of the form \u2212 y p(y) log q(y) below, but note that sluice networks are not restricted to tasks with the same loss functions." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-44", "text": "Let \u03bb i be weights that determine the importance of the different tasks during training." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-45", "text": "The loss that sluice networks minimize, with a penalty term \u2126(W ), is then as follows:" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-46", "text": "Sluice networks not only learn the parameters in W , but also some of the parameters of the regularizer \u2126, through the \u03b1 weights, while the \u03b2 weights are used to learn the parameters of the mixture functions f (\u00b7)." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-47", "text": "Recall that our architecture is partly motivated by the observation that for loosely related tasks, only certain features in specific layers should be shared, while many of the layers and subspaces may remain more taskspecific [25] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-48", "text": "We want to learn what to share while inducing models for the different tasks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-49", "text": "For simplicity, we ignore subspaces at first and assume only two tasks A and B. The outputs h A,k,t and h B,k,t of the k-th layer for time step t for task A and B respectively interact through what [21] refer to as cross-stitch units \u03b1 (see Figure 1 )." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-50", "text": "Omitting t for simplicity, the output of the \u03b1 layers is:" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-51", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-52", "text": "**LEARNING MATRIX REGULARIZERS**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-53", "text": "where h A,k is a linear combination of the outputs that is fed to the k + 1-th layer of task A, and a , b designates the stacking of two vectors a, b \u2208 R D to a matrix M \u2208 R 2\u00d7D ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-54", "text": "Extending the \u03b1-layers to include subspaces, for 2 tasks and 2 subspaces, we obtain an \u03b1 matrix \u2208 R 4\u00d74 that not only controls the interaction between the layers of both tasks, but also between their subspaces:" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-55", "text": "where h A 1 ,k is the output of the first subspace of the k-th layer of task A and h A 1 ,k is the linear combination for the first subspace of task A. The input to the k + 1-th layer of task A is then the concatenation of both subspace outputs:" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-56", "text": "Different \u03b1 weights correspond to different matrix regularizers \u2126, including several ones that have been proposed previously for multi-task learning." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-57", "text": "We review those in \u00a73." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-58", "text": "For now just observe that if all \u03b1-values are set to 0.25 (or any other constant), we obtain hard parameter sharing [6] , which is equivalent to a heavy L0-regularizer." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-59", "text": "Adding Inductive Bias Naturally, we can also add inductive bias to sluice networks by partially constraining the regularizer or adding to the learned penalty." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-60", "text": "Inspired by recent work on shared-space component analysis [31, 24] , we add a penalty to enforce a division of labor and discourage redundancy between shared and task-specific subspaces." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-61", "text": "While the networks can theoretically learn such a separation, an explicit constraint empirically leads to better results and enables the sluice networks to take better advantage of subspace-specific \u03b1-values." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-62", "text": "This is modeled by an orthogonality constraint [5] between the layer-wise subspaces of each model:" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-63", "text": "where M is the number of tasks, K is the number of layers, \u00b7 2 F is the squared Frobenius norm and G m,k,1 and G k,2,m are the first and second subspace respectively in the k-th layer of m-th task model." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-64", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-65", "text": "**LEARNING MIXTURES MANY TASKS HAVE AN IMPLICIT HIERARCHY THAT INFORMS THEIR INTERACTION. RATHER THAN**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-66", "text": "predefining it [25, 11] , we enable our model to learn hierarchical relations by associating different tasks with different layers if this is beneficial for learning." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-67", "text": "Inspired by advances in residual learning [12] , we employ skip-connections from each layer, controlled using \u03b2 parameters." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-68", "text": "This layer acts as a mixture model, returning a mixture of expert predictions:" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-69", "text": "where h A,k is the output of layer k of model A, while hA,t is the linear combination of all layer outputs of model A that is fed into the final softmax layer." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-70", "text": "Complexity Our model only adds a minimal number of additional parameters compared to single-task models of the same architecture." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-71", "text": "\u03b1 parameters scale linearly with the number of layers and quadratically with the number of tasks and subspaces, while \u03b2 parameters scale linearly with the number of tasks and the number of layers." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-72", "text": "For a sluice network with M tasks, K layers per task, and 2 subspaces per layer, we thus obtain 4KM 2 additional \u03b1 parameters and KM \u03b2 parameters." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-73", "text": "Training sluice networks is not much slower than training hard parameter sharing networks, with only an 5-7% increase in training time." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-74", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-75", "text": "**PREVIOUS PROPOSALS AS INSTANTIATIONS OF SLUICE NETWORKS**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-76", "text": "The architecture is very flexible and can be seen as a generalization over several existing algorithms for transfer and multi-task learning, including [7, 9, 25, 21] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-77", "text": "We show how to derive each of these below." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-78", "text": "Hard Parameter Sharing in the two networks appears if all \u03b1 values are set to the same constant [7, 8] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-79", "text": "This is equivalent to a mean-constrained 0-regularizer \u2126(\u00b7) = | \u00b7 |w i 0 and i \u03bbiLi < 1. If the sum of weighted losses are smaller than 1, the loss with penalty is always the highest when all parameters are shared." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-80", "text": "Group Lasso The 1/ 2 group lasso regularizer is G g=1 ||G1,i,g||2, a weighted sum over the 2 norms of the groups, often used to enforce subspace sharing [33, 26] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-81", "text": "Our architecture learns a 1/ 2 group lasso over the two subspaces (with the same degrees of freedom), when all \u03b1A,B and \u03b1B,A-values are set to 0." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-82", "text": "When the outer layer \u03b1-values are not shared, we get block communication between the networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-83", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-84", "text": "**FRUSTRATINGLY EASY DOMAIN ADAPTATION**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-85", "text": "The approach to domain adaptation in [9] , which relies on a shared and a private space for each task or domain, can be encoded in sluice networks by setting all \u03b1A,B-and \u03b1B,A-weights associated with G i,k,1 to 0, while setting all \u03b1A,B-weights associated with G i,k,2 to \u03b1B,B, and \u03b1B,A-weights associated with G i,k,2 to \u03b1A,A. Note that [9] discusses three subspaces." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-86", "text": "We split the space in two, leading to three subspaces, if we only share one half across the two networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-87", "text": "Low Supervision [25] propose a model where only the inner layers of two deep recurrent works are shared." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-88", "text": "This is obtained using heavy mean-constrained L0 regularization over the first layer Li,1, e.g., \u2126(W ) = K i ||Li,1||0 with i \u03bbiL(i) < 1, while for the auxiliary task, only the first layer \u03b2 parameter is set to 1." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-89", "text": "Cross-Stitch Networks [21] introduce cross-stitch networks that have \u03b1 values control the flow between layers of two convolutional neural networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-90", "text": "Their model corresponds to setting the \u03b1-values associated with Gi,j,1 be identical to those for Gi,j,2, and by letting all but the \u03b2-value associated with the outer layer be 0." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-91", "text": "In our experiments, we include hard parameter sharing, low supervision, and cross-stitch networks as baselines." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-92", "text": "We do not report results for group lasso and frustratingly easy domain adaptation, which were consistently inferior on development data by some margin." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-93", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-94", "text": "**EXPERIMENTS**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-95", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-96", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-97", "text": "Data We want to experiment with multiple loosely related NLP tasks, but also study performance across domains to make sure our architecture is not prone to overfitting." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-98", "text": "As testbed for our experiments, we therefore choose the OntoNotes 5.0 dataset [28] , not only due to its high inter-annotator agreement [14] , but also because it enables us to analyze the generalization ability of our models across different tasks and domains." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-99", "text": "The OntoNotes Train 11846 173289 10658 206902 6905 164217 34944 878223 21520 297049 11274 90403 16734 388851 Dev 2112 29957 1292 25271 641 15421 5893 147955 1780 25206 1367 11200 2297 49393 Test 2206 35947 1357 26424 779 17874 2326 60756 1869 25883 1306 10916 2281 52225 Table 1 : Number of sentences and words for the splits of each domain in the OntoNotes 5.0 dataset." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-100", "text": "dataset provides data annotated for an array of tasks across different languages and domains." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-102", "text": "2 Tasks In multi-task learning, one task is usually considered the main task, while other tasks are used as auxiliary tasks to improve performance on the main task." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-103", "text": "As main tasks, we use chunking (CHUNK), named entity recognition (NER), and a simplified version of semantic role labeling (SRL) where we only identify headwords, and pair them with part-of-speech tagging (POS) as an auxiliary task, following [25] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-104", "text": "Example annotations for each task can be found in Table 2 ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-105", "text": "Model We use a state-ofthe-art BiLSTM-based sequence labeling model [23] as the building block of our model." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-106", "text": "The BiLSTM consists of 3 layers with a hidden dimension of 100." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-107", "text": "At every time step, the model receives as input the concatenation between the 64-dimensional embedding of a word and its character-level embedding produced by a Bi-LSTM over 100-dimensional character embeddings." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-108", "text": "Both word and character embeddings are randomly initialized." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-109", "text": "The output layer is an MLP with a dimensionality of 100." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-110", "text": "We initialize \u03b1 parameters with a bias towards one source subspace for each direction and initialize \u03b2 parameters with a bias towards the last layer." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-111", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-112", "text": "**TRAINING AND EVALUATION**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-113", "text": "We train our models with SGD, an initial learning rate of 0.1, and learning rate decay." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-114", "text": "During training, we randomly sample from the data for each task." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-115", "text": "We perform early stopping with patience of 2 and hyperparameter optimization on the in-domain development data of the newswire domain." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-116", "text": "We use the same hyperparameters for all comparison models across all domains." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-117", "text": "We train our models on each domain and evaluate them both on the in-domain test set as well as on the test sets of all other domains to evaluate their out-of-domain generalization ability." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-118", "text": "Baseline Models As baselines, we compare against i) a single-task model only trained on chunking; ii) the low supervision model by [25] , which predicts the auxiliary task at the first layer; iii) an MTL model based on hard parameter sharing [6] ; and iv) cross-stitch networks [21] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-119", "text": "We compare these against our complete sluice network with subspace constraints and learned \u03b1 and \u03b2 parameters." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-120", "text": "We provide a detailed ablation analysis of our model in Section 5." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-121", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-122", "text": "**MODEL COMPARISON**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-123", "text": "We first investigate how well sluice networks perform on in-domain and out-of-domain test data compared to state-of-the-art multi-task learning models." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-124", "text": "To this end, we first evaluate all models on chunking with POS tagging as auxiliary task." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-125", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-128", "text": "On average, sluice networks significantly outperform all other model architectures on both in-domain and out-of-domain data." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-129", "text": "Single task models and hard parameter sharing achieve the lowest results, almost on par, and are outperformed by low supervision models and cross-stitch networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-130", "text": "Sluice networks perform best for all domains, except for the telephone conversation (tc) domain, where they are outperformed by cross-stitch networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-131", "text": "In total, this shows that our proposed model for learning which parts of multi-task models to share, with a small set of additional parameters to learn, can achieve significant and consistent improvements in results." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-132", "text": "Table 4 : Test accuracy scores for different target domains with nw as source domain for named entity recognition (main task) and simplified semantic role labeling with POS tagging as auxiliary task for baselines and Sluice networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-133", "text": "ID: in-domain." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-134", "text": "OOD: out-of-domain." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-135", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-136", "text": "**IN-DOMAIN RESULTS**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-137", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-138", "text": "**SYSTEM**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-139", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-140", "text": "**PERFORMANCE ACROSS TASKS**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-141", "text": "We now compare sluice nets across different combinations of main and auxiliary tasks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-142", "text": "In particular, we evaluate them on NER with POS tagging as auxiliary task and simplified semantic role labeling with POS tagging as auxiliary task." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-143", "text": "We show results in Table 4 ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-144", "text": "Sluice networks outperform the comparison models for both tasks on in-domain test data as well as on out-of-domain test data on average." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-145", "text": "They yield the best performance on 5 out of 7 domains and 4 out of 7 domains for NER and semantic role labeling respectively." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-146", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-147", "text": "**ANALYSIS**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-148", "text": "Task Properties and Performance [4] correlate meta-characteristics of task pairs and gains from hard parameter sharing across a large set of NLP task pairs." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-149", "text": "Inspired by this study, we correlate various metacharacteristics with error reductions and \u03b1, \u03b2 values in sluice networks, as well as in hard parameter sharing." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-150", "text": "Most importantly, we find that a) multi-task learning gains, also in sluice networks, are higher when there is less training data, and b) sluice networks learn to share more when there is more variance in the training data Table 5 : Ablation analysis." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-151", "text": "Accuracy scores on out-of-domain (OOD) test sets for Chunking (main task) with POS tagging as auxiliary task for different target domains for different configurations of sluice networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-154", "text": "Generally, \u03b1 values at the inner layers correlate more highly with meta-characteristics than \u03b1 values at the outer layers." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-155", "text": "Ablation Analysis Different types of sharing may empirically be more important than others." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-156", "text": "In order to investigate this, we perform an ablation analysis (Table 5) ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-157", "text": "We investigate the impact of i) the \u03b1 parameters; ii) the \u03b2 parameters; and iii) the division into subspaces with an orthogonality penalty." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-158", "text": "We also evaluate whether concatenation of the outputs of each layer is a reasonable alternative to our mixture model." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-159", "text": "Overall, we find that learnable \u03b1 parameters are preferable over constant \u03b1 parameters." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-160", "text": "Learned \u03b2 parameters marginally outperform skip-connections in the hard parameter sharing setting, while skip-connections are competitive with learned \u03b2 values in the learned \u03b1 setting." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-161", "text": "In addition, modeling subspaces explicitly helps for almost all domains." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-162", "text": "Finally, concatenation of layer outputs is a viable form to share information across layers." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-163", "text": "Chunking, NER, and SRL." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-164", "text": "We present inner, middle, and outer layer left to right." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-165", "text": "Figure 2 presents the final \u03b1 weights in the sluice networks for Chunking, NER, and SRL, trained with newswire as training data." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-166", "text": "We see that a) for the low-level simplified SRL, there is more sharing at inner layers, which is in line with [25] , while Chunking and NER also rely on the outer layer, and b) more information is shared from the more complex target tasks than vice versa." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-167", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-168", "text": "**ANALYSIS OF \u0391 VALUES**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-169", "text": "Ability to Fit Noise Sluice networks can learn to disregard sharing completely, so we expect them to be as good as single-task networks to fit random noise, potentially even better." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-170", "text": "We verify this by computing a learning curve for random relabelings of 200 sentences annotated with syntactic chunking brackets, as well as 100 gold standard POS-annotated sentences." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-171", "text": "The figure in 3 shows that hard parameter sharing, while learning faster because of the smoother loss surface in multi-task learning, is a good regularizer, confirming the findings in [25] , whereas the sluice network is even better at fitting noise than the single-task models." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-172", "text": "While ability to fit noise is not necessarily a problem [32] , this means that it can be beneficial to add inductive bias to the regularizer, especially when working with small amounts of data." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-173", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-174", "text": "**RELATED WORK**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-175", "text": "In the context of deep neural networks, multi-task learning is often done with hard or soft parameter sharing of hidden layers." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-176", "text": "Hard parameter sharing was introduced by [6] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-177", "text": "There, all hidden layers are shared between tasks which are projected into output layers specific to different tasks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-178", "text": "This approach to multi-task learning is easy to implement, reduces overfitting and is guaranteed to work for (certain types of) closely related tasks [2, 20] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-179", "text": "[22] apply a variation of hard parameter sharing to multi-domain multi-task sequence tagging with a shared CRF layer and domain-specific projection layers." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-180", "text": "[30] also use hard parameter sharing to jointly learn different sequence-tagging tasks (NER, POS tagging, Chunking) across languages." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-181", "text": "They also use word and character embeddings and share character embeddings in their model." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-182", "text": "[19] explore a similar set-up, but sharing is limited to the initial layer." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-183", "text": "In all three papers, the amount of sharing between the networks is fixed in advance." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-184", "text": "Note that we only plot accuracies for hard parameter sharing and sluice networks until they plateau." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-185", "text": "In soft parameter sharing, on the other hand, each task has separate parameters and separate hidden layers, as in our architecture, but the loss at the outer layer is regularized by the current distance between the models." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-186", "text": "In [10] , for example, the loss is regularized by the L2 distance between (selective parts of) the main and auxiliary models." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-187", "text": "Other regularization schemes used in multi-task learning include the 1/ 2 group lasso [1] and the trace norm [17] ." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-188", "text": "Selective Sharing Several authors have discussed which parts of the model to share." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-189", "text": "[25] perform experiments on which hidden layers to share in the context of hard parameter sharing with deep recurrent neural networks for sequence tagging." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-190", "text": "They show that low-level tasks, i.e. easy natural language processing tasks typically used for preprocessing such as part of speech tagging and named entity recognition, should be supervised at lower layers when used as auxiliary tasks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-191", "text": "Another line of work looks into separating the learned space into a private (i.e. task-specific) and shared space [31, 24, 27 ] to more explicitly capture the difference between task-specific and cross-task features." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-192", "text": "To enforce such behavior, constraints are enforced to prevent the models from duplicating information between subspaces." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-193", "text": "[5] use shared and private encoders regularized with orthogonality and similarity constraints for domain adaptation for computer vision." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-194", "text": "[18] use a similar technique for sentiment analysis." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-195", "text": "In contrast to all the work mentioned above, we do not limit ourselves to a predefined way of sharing, but let the model learn which parts of the network to share using latent variables, the weights of which are learned in an end-to-end fashion." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-196", "text": "The work most related to ours is [21] , who also look into learning what to share in multi-task learning." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-197", "text": "However, they only consider a very small class of the architectures that are learnable in sluice networks." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-198", "text": "Specifically, they restrict themselves to learning split architectures." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-199", "text": "In such architectures, two n-layer networks share the innermost k layers with 0 \u2264 k \u2264 n, and they learn k with a mechanism that is very similar to our \u03b1-values." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-200", "text": "Our work can be seen as a generalization of [21] , including a more in-depth analysis of augmented works." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-201", "text": "----------------------------------" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-202", "text": "**CONCLUSION**" }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-203", "text": "We introduced SLUICE NETWORKS, a framework for learning what to share in multi-task learning using trainable parameters." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-204", "text": "Our approach is a generalization of recent work, but goes well beyond this in enabling the network to learn selective sharing of layers, subspaces, and skip connections." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-205", "text": "In experiments with NLP task pairs in Ontonotes 5.0, we show up to 15% average error reduction over hard parameter sharing at only a 5-7% increase in training time." }, { "sent_id": "7f234ecfb4cf880502faa8b89cd07b-C001-206", "text": "We provide an analysis of the ability of sluice networks to fit noise, as well as what properties are predictive of gains with sluice networks, seeing that the effect size correlates highly with label entropy, confirming previous findings for hard parameter sharing [19, 4] ." } ], "y": { "@USE@": { "gold_contexts": [ [ "7f234ecfb4cf880502faa8b89cd07b-C001-15", "7f234ecfb4cf880502faa8b89cd07b-C001-16" ], [ "7f234ecfb4cf880502faa8b89cd07b-C001-76" ], [ "7f234ecfb4cf880502faa8b89cd07b-C001-103" ], [ "7f234ecfb4cf880502faa8b89cd07b-C001-118" ] ], "cite_sentences": [ "7f234ecfb4cf880502faa8b89cd07b-C001-16", "7f234ecfb4cf880502faa8b89cd07b-C001-76", "7f234ecfb4cf880502faa8b89cd07b-C001-103", "7f234ecfb4cf880502faa8b89cd07b-C001-118" ] }, "@BACK@": { "gold_contexts": [ [ "7f234ecfb4cf880502faa8b89cd07b-C001-40" ], [ "7f234ecfb4cf880502faa8b89cd07b-C001-87" ] ], "cite_sentences": [ "7f234ecfb4cf880502faa8b89cd07b-C001-40", "7f234ecfb4cf880502faa8b89cd07b-C001-87" ] }, "@MOT@": { "gold_contexts": [ [ "7f234ecfb4cf880502faa8b89cd07b-C001-47" ] ], "cite_sentences": [ "7f234ecfb4cf880502faa8b89cd07b-C001-47" ] }, "@DIF@": { "gold_contexts": [ [ "7f234ecfb4cf880502faa8b89cd07b-C001-65", "7f234ecfb4cf880502faa8b89cd07b-C001-66" ] ], "cite_sentences": [ "7f234ecfb4cf880502faa8b89cd07b-C001-66" ] }, "@SIM@": { "gold_contexts": [ [ "7f234ecfb4cf880502faa8b89cd07b-C001-76" ], [ "7f234ecfb4cf880502faa8b89cd07b-C001-166" ], [ "7f234ecfb4cf880502faa8b89cd07b-C001-171" ] ], "cite_sentences": [ "7f234ecfb4cf880502faa8b89cd07b-C001-76", "7f234ecfb4cf880502faa8b89cd07b-C001-166", "7f234ecfb4cf880502faa8b89cd07b-C001-171" ] } } }, "ABC_a0af9cf22996a245af9d66cf1d358f_14": { "x": [ { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-68", "text": "**COLLECTION**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-2", "text": "Collective attention is central to the spread of real world news and the key to understanding how public discussions report emerging topics and breaking news." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-3", "text": "Most research measures collective attention via activity metrics such as post volume." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-4", "text": "While useful, this kind of metric obscures the nuanced content side of collective attention, which may reflect how breaking events are perceived by the public." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-5", "text": "In this work, we conduct a large-scale language analysis of public online discussions of breaking crisis events on Facebook and Twitter." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-6", "text": "Specifically, we examine how people refer to locations of hurricanes in their discussion with or without contextual information (e.g. writing \"San Juan\" vs. \"San Juan, Puerto Rico\") and how such descriptor expressions are added or omitted in correlation with social factors including relative time, audience and additional information requirements." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-7", "text": "We find that authors' references to locations are influenced by both macro-level factors such as the location's global importance and micro-level social factors like audience characteristics, and there is a decrease in descriptor context use over time at a collective level as well as at an individual-author level." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-8", "text": "Our results provide insight that can help crisis event analysts to better predict the public's understanding of news events and to determine how to share information during such events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-9", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-11", "text": "Today, millions of people experience and discuss news and events happening around the world through online media." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-12", "text": "Breaking news events, especially crisis events, often attract significant collective attention from the general public (Lin et al. 2014) , resulting in bursts of discussion on social media (Leavitt and Clark 2014; Lehmann et al. 2012 )." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-13", "text": "During such events, public observers often focus on important locations, people or organizations (hereafter \"named entities\") depending on their relevance to the unfolding crisis (Wakamiya et al. 2015) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-14", "text": "A spike in attention directed toward a particular location may signal an important update, such as the need for aid for the location (Varga et al. 2013) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-15", "text": "While collective attention is often measured with activity metrics such as post volume (Mitra, Wright, Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-16", "text": "All rights reserved." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-17", "text": "and Gilbert 2016), such metrics often focus on an aggregate quantity summary of attention without considering the nuanced content side of attention dynamics." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-18", "text": "One way to model the content of collective attention is to examine how people talk about breaking news events, especially their descriptions of locations, which are a major component of crisis events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-19", "text": "For instance, after Hurricane Maria struck Puerto Rico in 2017, more Americans became familiar with the locations mentioned in news coverage about the island (DiJulio, Mu\u00f1ana, and Brodie 2017) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-20", "text": "In the immediate aftermath of Hurricane Maria, many news headlines referred to \"San Juan\" without extra context such as \"the capital of Puerto Rico\", largely because they expected their audience had already become familiar with the city due to the recent crisis." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-21", "text": "To better understand the nuanced dynamics of collective attention, we take a closer look at how people refer to locations of hurricanes during breaking crisis events via their usage of descriptor context phrases with respect to location mentions." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-22", "text": "Such descriptor phrases provide additional contextual information for named entities (people, organizations and locations) (Stali\u016bnait\u0117 et al. 2018) , helping to locate unfamiliar entities and disambiguate names that could have multiple referents." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-23", "text": "This is especially important when the writers assume their audiences have limited knowledge about the entities (Prince 1992) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-24", "text": "In crisis events, we are particularly interested in the factors that influence descriptor phrase usage, which can be seen as a content-based reflection of collective attention over the course of the event." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-25", "text": "These factors include how a writer anticipates their audience's understanding of the location being discussed, and whether a writer includes extra information outside of a descriptor phrase to help disambiguate the location." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-26", "text": "Studying how and when online discussions use or omit descriptor context when referring to locations can help crisis event participants more effectively track public awareness of an uncertain situation, better infer the public's understanding of news events, and more strategically determine how to share information during such events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-27", "text": "Figure 1 shows an example of a shift in descriptor use during a crisis event." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-28", "text": "In public Twitter discussion of Figure 1 : Example of collective attention expressed toward the location \"San Juan\" in discussion of Hurricane Maria on Twitter." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-29", "text": "Left y-axis (black solid line) indicates the location's daily log frequency, right y-axis (red dotted line) indicates the location's weekly probability of receiving a descriptor phrase like \"Puerto Rico\"." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-30", "text": "Hurricane Maria in 2017, the location \"San Juan\" was less likely to receive a descriptor (e.g., \"San Juan, PR\") following the peak in collective attention volume." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-31", "text": "While this shift appears to be due to time (Stali\u016bnait\u0117 et al. 2018) , the shift in descriptor use may also stem from non-temporal factors as well, such as an author's expectations of their audience (audience design) and additional information such as links to external news articles, included in the same sentence with the location (a micro-level aspect of the discussion)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-32", "text": "Jointly modeling such macro-level factors, like post volume, and micro-level factors, from authors and information expectations, and their influence on a writer's use of descriptor context can help reveal a more comprehensive picture of the dynamics in collective attention." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-33", "text": "Concretely, this work examines the public discussion of five recent devastating natural disasters on Facebook and Twitter." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-34", "text": "We investigate how people refer to locations of hurricanes with or without descriptor phrases in their discussion and how such descriptor context use changes in response to factors related to audience, writer attributes and temporal trends." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-35", "text": "Our research questions are:" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-36", "text": "\u2022 RQ1: What factors influence people's use of descriptor context when referring to locations of hurricane events?" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-37", "text": "\u2022 RQ2a: How does the use of descriptor context for locations change over time at a collective level?" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-38", "text": "\u2022 RQ2b: How does the use of descriptor context for locations change at an individual author level?" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-39", "text": "To address these research questions, we first analyzed posts written on Facebook in public groups concerning Hurricane Maria relief, and found that location mentions receive descriptors more often when the locations are not local to the group of discussion, suggesting that descriptors may be used to help explain new information to audiences." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-40", "text": "By looking at public posts written on Twitter concerning natural disasters, we found that the aggregate rate of descriptor phrases decreased following the peaks in these locations' collective attention, supporting prior findings in the change in named entity use (Stali\u016bnait\u0117 et al. 2018) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-41", "text": "To assess potential individual-level causes of such content dynamics, we examined a set of characteristics related to audiences and authors, and we found that authors tend to use fewer descriptors if they had mentioned a location before, and to use more descriptors if the author received more audience engagement (e.g., more retweets and likes)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-42", "text": "To sum up, our work demonstrates intuitive patterns in the use of descriptor phrases as a means of expressing shared knowledge expectations, which is an under-explored aspect of the content side of collective attention." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-43", "text": "Studying the use of descriptor phrases as well as other writing conventions in public discussions can provide insight into a writer's expectations of their audience, and therefore a more fine-grained view into information sharing dynamics." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-44", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-45", "text": "**RELATED WORK**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-46", "text": "The term collective attention refers to the attention that a public group of people pays to a particular event or topic (Sasahara et al. 2013) , often as a result of a shared interest among the people." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-47", "text": "Collective attention is an important component in the spread of information (Wu and Huberman 2007) , and it can shift either vary rapidly or gradually in response to particular events such as sports games (Lehmann et al. 2012) , natural disasters (Varga et al. 2013) , and political controversy (Garimella et al. 2017) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-48", "text": "With the wealth of digital data available to researchers today, studies have often quantified collective attention using the volume of posting and sharing activity in social media sites such as Reddit and Twitter (Leavitt and Clark 2014; Mitra, Wright, and Gilbert 2016) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-49", "text": "While these kinds of activity metrics provide an aggregate summary of attention dynamics, they largely obscure the nuanced content of collective attention such as how people refer to such particular events via language and how such referring language evolves over time." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-50", "text": "As an initial effort to understand this under-explored content aspect of collective attention, our research focuses on how people refer to named entities (e.g., locations, organizations) of breaking crisis events in their discussion, which are essential information for these events, and how such referring changes among large groups of people over the course of those crisis events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-51", "text": "When describing a named entity, a writer may add descriptive information in the form of a dependent clause (Kang et al. 2019 ) (e.g. \"San Juan, in Puerto Rico\"), to provide additional, contextual information for the audience to be familiar with the entity." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-69", "text": "Twitter Dataset The Twitter posts were collected using hashtags from five major disasters that recently struck the United States: Hurricane Florence (2018), Hurricane Harvey (2017), Hurricane Irma (2017), Hurricane Maria (2017), and Hurricane Michael (2018)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-52", "text": "The dependent clause may describe attributes of the entity that are relevant to a specific topic, such as \"San Juan, epicenter of Hurricane Maria relief effort,\" or attributes that are generally relevant, such as \"San Juan, Puerto Rico.\" From a collective perspective, prior work that examined the use of descriptor phrases in news media found that writers tend to drop such phrases as the entities gradually become more and more familiar (i.e., shared knowledge) among discussion participants over time (Stali\u016bnait\u0117 et al. 2018) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-53", "text": "In addition to relative time, Siddharthan, Nenkova, and McKeown (2011) found that salience or the importance of the named entity, i.e., whether an entity plays a major role in the story or narrative being told, determines the need for a descriptor phrase, since a perceived salient or important entity is likely to be understood as shared knowledge among discussion participants and therefore unlikely to need a descriptor phrase (Prince 1992)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-54", "text": "From an individual perspective, Galati and Brennan (2010) suggested that audience matters for writers' choice of using a additional descriptive information, since audiences who are familiar with those entities are less likely to require context to read or participate in such discussions." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-55", "text": "In most cases, when referring to locations in online discussions of crisis events, authors may find it difficult to determine their potential audience." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-56", "text": "As a result, they may lack a common ground (Bell 1984) , and authors may need to use a descriptive phrase to write for this large and potentially diverse audience." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-57", "text": "Similarly, depending on to what extent authors are familiar with the locations of crisis events, authors may have certain tendency to use or omit a descriptor phrase; for instance, authors who are a local (Kogan, Palen, and Anderson 2015) or a \"core\" community member (Lave and Wenger 1991) during a crisis event may be less likely to use descriptor phrases in location mentions because of their prior familiarity." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-58", "text": "Building on this theoretical and empirical work on collective attention, we study the content reflection of collective attention by first operationalizing a set of collective-and individual-level factors such as the importance of locations and the characteristics of audiences and authors, which are summarized in Table 4 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-59", "text": "We then analyze how they relate to descriptor context use when people refer to locations during crisis events in the following sections." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-60", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-61", "text": "**DATA**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-62", "text": "Crisis events such as hurricanes present a useful case study for the development of collective attention, due to the large volume of online participation and large uncertainty among event observers towards the situation during the crisis events (Varga et al. 2013) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-63", "text": "We chose to study the collective attention changes in public discourse related to hurricanes, due to hurricanes' lasting economic impact, their broad coverage in the news, and their relevance to specific geographic regions." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-64", "text": "We collected social media data related to five recent devastating hurricanes, and we describe the data collection ( \u00a7 3.1), location detection ( \u00a7 3.2), and descriptor detection ( \u00a7 3.3) for the following datasets:" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-65", "text": "1. Twitter data: 2 million public tweets related to 5 major hurricanes, collected in 2017 and 2018." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-66", "text": "2. Facebook data: around 30,000 posts from 60 public groups related to disaster relief in Hurricane Maria, collected in 2017." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-67", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-70", "text": "We used hashtags that contained the name of the event in full and shortened form, e.g. #Harvey and #HurricaneHarvey for Hurricane Harvey." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-71", "text": "During 2017 and 2018, we streamed tweets that contained hashtags related to the natural disasters at the start of each disaster for up to one week after the dissolution of the hurricane." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-72", "text": "1 We augmented this data with additional tweets available in a 1% Twitter sample that contains the related hashtags, restricting our time frame to one day before the formation of the hurricane and one week after the dissipation of the hurricane." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-73", "text": "Manual inspection revealed minimal noise generated by the inclusion of the name-only hashtags." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-74", "text": "Summary statistics about the Twitter data are presented in Table 1 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-75", "text": "In addition to these tweets, we also collected additional event-related tweets from the most frequently-posting authors in each dataset (\"active authors\"), which were needed to evaluate per-author descriptor use change (see \u00a7 4.3)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-76", "text": "Facebook Dataset The Facebook data was collected in the aftermath of Hurricane Maria by searching for public discussion groups that included at least one of Puerto Rico's municipalities in the title (e.g. \"Guayama: Hurac\u00e1n Maria\" refers to Guayama municipality)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-77", "text": "Relatives and friends of Puerto Ricans often posted in these groups to seek additional information about those still on Puerto Rico, who could not be reached by telephone due to infrastructure damage." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-78", "text": "We restricted our analysis to Facebook groups related to Hurricane Maria because of the limited information causing more discussion of specific locations (as compared to the other hurricane events that had more up-to-date information available online)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-79", "text": "In total, we collected 31,414 public posts from 61 groups, from the time of their creation to one month afterward (Sept 20 to Oct 20 2017)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-80", "text": "Only posts in Spanish were retained (determined using langid.py (Lui and Baldwin 2012) 2 ) because it was the majority language in the posts." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-81", "text": "Note that, due to Facebook data restrictions and API changes, we were not able to collect posts in Facebook groups for the other four hurricanes events, which we acknowledge as a limitation." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-82", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-83", "text": "**EXTRACTING AND FILTERING LOCATIONS**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-84", "text": "We extracted locations mentioned using, for English tweets, a distantly supervised named entity recognizer adapted to Twitter data (Ritter et al. 2011) 3 and for Spanish tweets, a general purpose named entity recognizer (Manning et al. 2014) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-85", "text": "4 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-86", "text": "These NER systems are highly accessible, widely-used, and well-performing across multiple domains." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-87", "text": "We further evaluated the performance of these NER systems on a sample of tweets (100 tagged LOCATIONs per dataset, 500 total) and found reasonable precision for the LOCATION tag (81-96% across all datasets)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-88", "text": "For this work, we are interested in named entities that may require descriptors, which include cities and counties." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-89", "text": "We therefore restrict our analysis only to named entities (NEs) that (1) are tagged as LOCATION, (2) can be found in the GeoNames ontology 5 , (3) map to cities or counties in the ontology, (4) map to affected locations in the ontology, based on their location occurring in the region affected by the event, and (5) are unambiguous within the region affected by the event." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-90", "text": "For instance, the string \"San Juan\" is a valid location for the Hurricane Maria tweets because the affected region contains an unambiguous match for the string, but it is not a valid location for the Hurricane Harvey tweets because the affected region does not contain an unambiguous match." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-91", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-92", "text": "**EXTRACTING DESCRIPTOR PHRASES**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-93", "text": "One way in which a writer mentions or helps introduce a new entity (e.g., \"San Juan\") in their discussion is by linking it to a more well-known entity (e.g., \"Puerto Rico\") in an descriptor phrase." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-94", "text": "We operationalized this as the occurrence of a well-known entity in a dependent clause relative to the location, which is straightforward to detect." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-95", "text": "Here, we used the population of a location as a proxy to determine how \"well-known\" that location is." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-96", "text": "The underlying assumption is that a more well-populated location may be more likely to be known or heard by more people and can therefore help describe the preceding location." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-97", "text": "In this work, we used the frequency of such descriptor phrases as the dependent variable: higher frequency of descriptor phrases uses indicates that the location may be new knowledge, while a lower frequency indicates that a location is more likely to be shared knowledge." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-98", "text": "To extract sentence structure from text, we used dependency parsing, which decomposes a sentence into a directed acyclic graph connecting words and phrases." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-99", "text": "Following Stali\u016bnait\u0117 et al. (2018) , we used a small set of dependencies to capture the \"MODIFIER\" phrase type in a subclause (adjectival clause, appositional modifier, prepositional modifier, numeric modifier) and another set of dependencies to capture the \"COMPOUND\" type in a super-clause (nominal modifier, compound, appositional modifier)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-100", "text": "A summary of our phrase patterns to capture descriptor phrases is provided in Table 3 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-101", "text": "Taking into account the characteristics of text from two different domains, for the Twitter data we used the spacy shift-reduce parser (Honnibal and Johnson 2015) 6 to extract the dependencies; for the Facebook data, the dependencies were extracted using the SyntaxNet transition-based parser (Andor et al. 2016) 7 , following initial tests that showed higher accuracy on SyntaxNet versus other comparable alternatives." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-102", "text": "Whether post was written during peak of collective attention toward location Post peak" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-103", "text": "Whether post was written at least 1 day after the peak of collective attention toward location Table 4 : Summary of explanatory variables and corresponding metrics, used for descriptor phrase prediction." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-104", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-105", "text": "**VALIDATION OF EXTRACTION PERFORMANCE**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-144", "text": "Consistently, local locations or authors being a local are associated with lower rates of descriptor use, suggesting that the lack of descriptor context indicates shared knowledge among a large group of discussion participants." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-106", "text": "To assess the accuracy of our phrase patterns in capturing descriptor phrases, we asked two annotators (computer science graduate students) who had not seen the data to annotate a random sample of 50 tweets containing at least one location from each data set (250 tweets total)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-107", "text": "The annotators received instructions on how to determine if a location was marked by a descriptor phrase, including examples that were not drawn from the data, and the annotators marked each location mention as either (1) a \"LOCATION + LOCATION STATE\" pattern, (2) one of the other descriptor patterns in Table 3 or (3) no descriptor phrase." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-108", "text": "The annotators achieved high agreement on each separate descriptor type (Cohen's \u03ba = 0.96 for the state pattern, \u03ba = 0.91 for the other patterns)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-109", "text": "We then extracted posts with perfect agreement and detected descriptor phrases using the phrase patterns proposed." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-110", "text": "We found that our phrase patterns achieve reasonable precision and recall (96.6% and 87.5% respectively) in identifying descriptor phrases compared to raters' annotations." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-111", "text": "This validation check demonstrated that our proposed syntactic patterns can capture descriptor phrases reasonably well." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-112", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-113", "text": "**RESULTS**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-114", "text": "We address our research questions in three analyses as follows." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-115", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-116", "text": "**WHAT AFFECTS THE USE OF DESCRIPTOR CONTEXT?**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-117", "text": "This section investigates RQ1 about what factors influence people's use of descriptor context when referring to locations of hurricane events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-118", "text": "We are particularly interested in the correlations between descriptor uses and a set of indicators of whether locations may be considered as old information." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-119", "text": "Here, a descriptor phrase may be omitted for locations that are geographically local to a group of people, i.e. knowledge that already shared among the group and are therefore assumed to be old information (e.g., if someone mentions the location \"San Juan\" in a group based in a region containing San Juan)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-120", "text": "To examine this research question, we compared the rate of descriptor uses for location mentions using both Facebook and Twitter data." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-121", "text": "For the Facebook data, we determined whether the group's region contains the location mentioned based on whether the most likely match for the location in the gazetteer 8 is contained in that region." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-122", "text": "We then operationalized a set of explanatory variables mentioned above as follows: location mention frequency (importance), author in-group posting frequency (author status), and group size (audience), as summarized in Table 4 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-123", "text": "For the Twitter data, we operationalized a similar set of explanatory variables using the following: location mention frequency (importance), whether the author is an organization (author status), 9 whether the author is a local (commitment), 9 post length (information), URL presence (information), and image/video presence (information)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-143", "text": "we found that our operationalized factors of importance, author, audience and information correlate differently with writers' use of descriptor phrases." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-124", "text": "We built two logistic regression models to predict descriptor phrase use from the location containment variable with fixed effects on the categorical variables (location strings, authors, and groups) on the Facebook data and the Twitter data separately (N=18432 and N=49020, respectively)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-125", "text": "In detail, we used an elastic net regression (Zou and Hastie 2005) 10 in order to reduce the risk of overfitting to fixed effects variables." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-126", "text": "For this analysis, rare categorical values (N < 20) for the fixed effects are replaced with RARE values to avoid overfitting to uncommon categories." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-127", "text": "The columns \"RQ1 (Facebook)\" and \"RQ1 (Twitter)\" in Table 5 report the results of our logistic regression models." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-128", "text": "On Facebook, we observed that local locations are 8 We assume that a location string matches a given location candidate if the candidate has the highest population in the gazetteer." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-129", "text": "9 See Appendix A for details on determining whether an author is an organization or local." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-130", "text": "10 L2 normalization weight of 0.01 chosen through grid search to maximize log-likelihood on held-out data (90-10 train/test split)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-131", "text": "associated with a lower rate of descriptor phrase use (\u03b2=-0.623, p < 0.001)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-132", "text": "This was further validated via a qualitative inspection of comments." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-133", "text": "For example, we found that in the group \"Hurricane Maria en Lajas\" the mention of the municipality \"Lajas\" does not receive an descriptor (\"Do you know if Banco Popular is open in Lajas?\"), while in the group \"Guayama: Hurac\u00e1n Maria\" the mention of \"Lajas\" does receive an descriptor (\"People who can bring water to Lajas Puerto Rico: they need water urgently\")." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-134", "text": "11 We did not find significant correlations for other explanatory variables and the descriptor phrase use." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-135", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-136", "text": "**RQ1 (FACEBOOK)**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-137", "text": "On Twitter, we found that (1) the more salient or important a location is, the less likely the descriptor context use (\u03b2=-0.172)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-138", "text": "(2) Authors who are local (\u03b2=-0.511) are less likely to include descriptor phrases, possibly because they know the location much better and assume their audiences to be familiar with it as well." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-139", "text": "(3) Organizational accounts on Twitter are more likely to use descriptor phrases (\u03b2=0.093), largely for preserving the information accuracy and validity." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-140", "text": "(4) Posts with URLs were less likely to have descriptors (\u03b2=-0.081), which may indicate that authors include additional new information in place of descriptors or authors assume that the information in the URL will provide necessary context." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-141", "text": "(5) In contrast, posts with an image or video were more likely to include descriptors (\u03b2=0.137), implying that visual content may require additional information to be understood." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-142", "text": "Taking the analyses on two different platforms together, 11 Comments are translated from Spanish and are paraphrased for ethical reasons." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-145", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-146", "text": "**COLLECTIVE CHANGE IN DESCRIPTOR CONTEXT USE**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-147", "text": "This section investigates RQ2a on how the use of descriptor context for location mentions changes over time." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-148", "text": "Specifically, we used longitudinal data to examine the collective tendency to use more or fewer descriptor phrases over time." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-149", "text": "The intuition is that over the course of crisis events more collective attention to a particular location may result in more awareness of the location among discussion participants, therefore reducing the need for context." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-150", "text": "In addition to the aforementioned explanatory factors used, we incorporated an additional set of variables to capture this temporal dynamics: relative peak time, i.e. whether the location is mentioned during or after the peak in post volume; and time since start, i.e. days since the beginning of the hurricane." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-151", "text": "Here, the definition of peak in collective attention is critical, because it determines the point at which an entity is expected to become shared knowledge (Stali\u016bnait\u0117 et al. 2018) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-152", "text": "Following Mitra, Wright, and Gilbert (2016), we defined the time of peak collective attentiont i for each location i as the (24-hour) period during which it is mentioned the most frequently:" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-153", "text": "is the raw frequency of location i at time t (see Figure 1 for peak in \"San Juan\" posts)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-154", "text": "We defined pre-peak as the period that ends t buffer days before the frequency peak, during-peak as the period at most t buffer days before and at most t buffer days after, and post-peak as the period that begins t buffer days after the frequency peak (we set t buffer = 1)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-155", "text": "To improve the stability of the fixed effects estimates, we removed all locations that are mentioned on fewer than N = 5 separate dates, and combined all authors with 1 post into a RARE bin." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-156", "text": "As shown in the \"RQ2a (Twitter)\" column of Table 5 , the main variable of interest, i.e. the post-peak time period, had less descriptor use than the earlier time periods (\u03b2=-0.127)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-157", "text": "This suggests that a location may become shared knowledge after receiving a burst of collective attention and further validated our previous example study (see Figure 1) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-158", "text": "Furthermore, we found that descriptor phrase use decreased with the amount of time since the start of the event (\u03b2=-0.120)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-159", "text": "This answers our RQ2a that the use of descriptor context for location mentions decreases over time, indicating that authors' expectations of those locations being shared knowledge change gradually over the course of the event as well as in a burst following the attention peak." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-160", "text": "One potential cause for the decrease in descriptor context may be the change in population after the peak in collective attention (e.g. influx of locals)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-161", "text": "To this end, we re-ran the regression above and replaced the author variables (\"local\" and \"organization\") with a fixed effect for all authors." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-162", "text": "We found that the post-peak effect was still significant and negative (\u03b2=-0.253, p < 0.05), which suggests that a change in author population is unlikely correlated with the decrease in descriptor use 12 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-163", "text": "To summarize, we found consistently less descriptor use over the course of crisis events even after controlling for other explanatory factors, supporting prior work in long-term descriptor phrase change (Stali\u016bnait\u0117 et al. 2018 )." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-164", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-165", "text": "**INDIVIDUAL CHANGE IN DESCRIPTOR CONTEXT USE**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-166", "text": "The previous section showed that the collective descriptor use decreases after the peak in post volume even after controlling for other explanatory factors." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-167", "text": "This section further examines such changes at an individual author level (RQ2b), to determine whether authors modulate their use of descriptors over the course of the event in response to perceived changes in shared knowledge of those locations of hurricanes events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-168", "text": "For example, does an author's prior participation in discussion of the event lead to less descriptor use for the same event?" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-169", "text": "Are authors who have a larger audience more likely to use descriptor phrases?" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-170", "text": "To better model the author-level changes in descriptor use, we introduced the following new factors into our regression models: number of prior posts by author during event (author-level), number of prior posts by author about the location during event (author-level), engagement received by author at t \u2212 1 (audience), 13 and change in engagement 12 If there were a population change, the post-peak effect would disappear and the negative correlation would be distributed among the fixed effects of the authors responsible for the change." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-171", "text": "13 We define engagement as the mean of Z-normalized retweets and likes." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-172", "text": "received by author between t\u22122 and t\u22121 (audience)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-173", "text": "These factors required a longitudinal sample of frequently-posting authors, i.e. active authors." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-174", "text": "Thus a set of active authors was identified in each data set, based on their relative post volume." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-175", "text": "14 We scraped all publicly available tweets posted by these active authors that mention one of the event's hashtags during the event time period (e.g., all posts for a Harvey-related active author from [17-08-17,10-09-17] that use #Harvey or #HurricaneHarvey)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-176", "text": "The locations and descriptor phrases were processed in the same way as before (see \u00a7 3.2), and we report the relevant statistics for these active authors in Table 2 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-177", "text": "We built similar regularized regression models only using data from the active authors who posted at least once during each of the time periods, in order to isolate authors who may have changed over time." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-178", "text": "The results are described in the \"RQ2b (Twitter)\" column of Table 5 ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-179", "text": "We found several significant trends among those author and audience factors." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-180", "text": "(1) Authors' prior mentions of a location are associated with less descriptor use (\u03b2 = \u22120.237)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-181", "text": "This negative correlation indicates that a location entity may be gradually understood as shared knowledge as the author repeats the location to the same audience (Galati and Brennan 2010) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-182", "text": "(2) More prior engagement from audiences correlates with more descriptor uses (\u03b2 = 0.292), which suggests that the authors need to plan their messages in response to cues from a larger and potentially more diverse audience." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-183", "text": "(3) Surprisingly, we found that this active author population showed no significant temporal tendency toward more or less descriptor use over the course of an event, during the peak or after the peak in collective attention." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-184", "text": "This null result held even when we performed the regression without the additional author and audience variables (i.e. same setup as RQ2a but including only the active author population)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-185", "text": "We hypothesize that the active authors may be different from the overall population, i.e. active authors may have their different ways of responding to trends in collective attention and thus are less likely be influenced by such temporal trends, whereas less active authors are more likely to be influenced by them." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-186", "text": "We tested this by identifying less active authors (\"regulars\") as those with lower post volumes 15 and conducting the regression analysis with only those regulars." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-187", "text": "We found that these less active authors do show a significant decrease in descriptor use following the peak in collective attention (\u03b2 = \u22120.127, p < 0.05) and a decrease in descriptor use over time (i.e., with more time since start, \u03b2 = \u22120.098, p < 0.05)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-188", "text": "This suggests that less active authors may be more likely to accommodate to such collective temporal trends in descriptor context use, while the active authors are less responsive." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-189", "text": "Addressing RQ2b, highly active authors do not change their descriptor context use over time, while relatively less active users show a decrease in such context use over the course of crisis events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-190", "text": "However, the active authors do show significant modulation of their context use in response to more audience engagement and more mentions of the location in prior posts." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-191", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-192", "text": "**DISCUSSION**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-193", "text": "By examining how people refer to crisis-impacted locations over the course of those crisis events, we found several trends in the use of contextualization related to audience expectations: (1) When authors are local to a place, or are writing for an audience who is expected to be local, they are less likely to use descriptor phrases to contextualize references to locations, reflecting shared knowledge among the author and audience." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-194", "text": "(2) At a collective level, there is a decreased descriptor use over the course of crisis events even after controlling for a set of explanatory variables." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-195", "text": "(3) At an individual level, highly active authors change their descriptor use in response to prior audience engagement but not after the peak in collective attention, whereas relatively less active users show a significant decrease in such context use over time." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-196", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-197", "text": "**IMPLICATIONS**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-198", "text": "This study demonstrates the benefits of studying the content of collective attention rather than merely quantity: studying how location entities are mentioned can provide more insight into writer intentions and expectations." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-217", "text": "A speaker may use a preceding descriptor phrase, instead of a subordinate descriptor phrase, to indicate that the entity is not shared knowledge (e.g. \"a city called San Juan\")." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-199", "text": "For example, the initial example of \"San Juan\" losing descriptor use over time reveals a different narrative than its frequency alone would reveal, i.e. that the entity became shared knowledge in discussion during the hurricane." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-200", "text": "Furthermore, unlike other linguistic analysis techniques such as topic models, the method proposed for capturing descriptor phrases does not require extra interpretation and can be directly applied to large-scale social media data without the need for post-hoc interpretation, which could be beneficial for event monitors." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-201", "text": "In addition to methods, this study provides insight into the role of audience among writers, which is relevant even during the extreme case of crisis events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-202", "text": "The finding about local locations receiving fewer descriptors, along with the finding about active authors' audience response, suggests that authors may accommodate to group norms in order to improve their odds of receiving a response." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-203", "text": "Furthermore, active authors' descriptor use may correlate with audience engagement but not with overall collective attention peaks because these authors focus more on their own social behavior rather than responding to global trends." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-204", "text": "Overall, the study highlights a set of practical and theoretical implications by looking at the content side of collective attention." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-205", "text": "First, we provide alternative ways to examine collective attention by looking at how people refer to crisis events, in contrast to prior work that mostly focused on the aggregate quantity side of collective attention (Candia et al. 2019; Lehmann et al. 2012) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-206", "text": "For example, Mitra, Wright, and Gilbert (2016) found that more sustained attention (lower variance in post volume) toward an event on social media correlates with lower perceived credibility of the event." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-207", "text": "Such analyses can be further enriched via our content-based operationalization of collective attention: for instance, future work might analyze how certain shared knowledge towards crisis events inferred from location mentions and descriptor context uses correlates with the perceived credibility of those events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-208", "text": "Our work could also help distinguish nuances in hashtag uses of collective attention (Lehmann et al. 2012) to better understand different forms of the manifestation of collective attention." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-209", "text": "Furthermore, our work sheds light on how and when online discussions use or omit descriptor context when referring to locations during crisis events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-210", "text": "This can can help crisis event participants more effectively track public awareness of an uncertain situation, better infer the public's understanding of news events, and more strategically determine how to share information during such events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-211", "text": "It also has theoretical implications for understanding linguistic structures (e.g. descriptor phrases) and social change." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-212", "text": "----------------------------------" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-213", "text": "**LIMITATIONS**" }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-214", "text": "Our work is also subject to several limitations." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-215", "text": "First, the analysis of what factors influence the use of descriptive context are mainly correlational without casual validations." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-216", "text": "Our formulation of descriptor phrases is not exhaustive and may have missed other syntactic constructions that indicate that an entity is considered new information (i.e. false negatives)." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-218", "text": "In addition, we focused on only a set of specific crisis events due to their representative usages of location mentions and large volume of online discussions." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-219", "text": "Future work can build upon our work and generalize it to other different types of crisis events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-220", "text": "In addition, we are unable to rule out the possibility that another event attracted attention to the locations under discussion before the crises began (e.g. a political news story relevant to the event's region) Lastly, the study focuses exclusively on location names because of their geographic relevance to events, but other types of named entities (people, organizations) are also likely to undergo changes in descriptor use in response to increased attention (Stali\u016bnait\u0117 et al. 2018) ." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-221", "text": "To conclude, this study adds a new content-based perspective to the measurement of collective attention, by analyzing how people discuss breaking news events." }, { "sent_id": "a0af9cf22996a245af9d66cf1d358f-C001-222", "text": "By examining five recent hurricane events, our research demonstrated how referring expressions are shaped by author and audience expectations of collective attention over time and across communities." } ], "y": { "@MOT@": { "gold_contexts": [ [ "a0af9cf22996a245af9d66cf1d358f-C001-14", "a0af9cf22996a245af9d66cf1d358f-C001-15", "a0af9cf22996a245af9d66cf1d358f-C001-17" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-220" ] ], "cite_sentences": [ "a0af9cf22996a245af9d66cf1d358f-C001-14", "a0af9cf22996a245af9d66cf1d358f-C001-220" ] }, "@BACK@": { "gold_contexts": [ [ "a0af9cf22996a245af9d66cf1d358f-C001-14" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-22" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-29", "a0af9cf22996a245af9d66cf1d358f-C001-30", "a0af9cf22996a245af9d66cf1d358f-C001-31" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-40" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-47" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-52" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-62" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-151" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-220" ] ], "cite_sentences": [ "a0af9cf22996a245af9d66cf1d358f-C001-14", "a0af9cf22996a245af9d66cf1d358f-C001-22", "a0af9cf22996a245af9d66cf1d358f-C001-31", "a0af9cf22996a245af9d66cf1d358f-C001-40", "a0af9cf22996a245af9d66cf1d358f-C001-47", "a0af9cf22996a245af9d66cf1d358f-C001-52", "a0af9cf22996a245af9d66cf1d358f-C001-62", "a0af9cf22996a245af9d66cf1d358f-C001-151", "a0af9cf22996a245af9d66cf1d358f-C001-220" ] }, "@USE@": { "gold_contexts": [ [ "a0af9cf22996a245af9d66cf1d358f-C001-99" ], [ "a0af9cf22996a245af9d66cf1d358f-C001-150", "a0af9cf22996a245af9d66cf1d358f-C001-151" ] ], "cite_sentences": [ "a0af9cf22996a245af9d66cf1d358f-C001-99", "a0af9cf22996a245af9d66cf1d358f-C001-151" ] }, "@SIM@": { "gold_contexts": [ [ "a0af9cf22996a245af9d66cf1d358f-C001-163" ] ], "cite_sentences": [ "a0af9cf22996a245af9d66cf1d358f-C001-163" ] }, "@FUT@": { "gold_contexts": [ [ "a0af9cf22996a245af9d66cf1d358f-C001-220" ] ], "cite_sentences": [ "a0af9cf22996a245af9d66cf1d358f-C001-220" ] } } }, "ABC_8ff1560ac0241a763b4b0d93718b40_14": { "x": [ { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-145", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-146", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-2", "text": "Relation extraction from simple questions aims to capture the relation of a factoid question with one underlying relation from a set of predefined ones in a knowledge base." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-3", "text": "Most recent methods take advantage of neural networks for matching a question with all relations in order to find the best relation that is expressed by that question." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-4", "text": "In this paper, we propose an instance-based method to find similar questions of a new question, in the sense of their relations, to predict its mentioned relation." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-5", "text": "The motivation roots in the fact that a relation can be expressed with different forms of question and these forms mostly share similar terms or concepts." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-6", "text": "Our experiments on the SimpleQuestions dataset show that the proposed model achieved better accuracy compared to the stateof-the-art relation extraction models." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-7", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-9", "text": "With the growth of the Internet and rapid production of a vast amount of information, question answering systems, designed to find a relevant proper answer by searching throughout a data source, are of great importance." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-10", "text": "The production of knowledge bases and the need to answer questions over such resources received researchers attentions to propose different models to find the answer of questions from the knowledge bases, known as KBQA 1 . Answering factoid questions with one relation, also known as simple question answering, has been widely studied in recent years (Dai et al., 2016; Yin et al., 2016; He and Golub, 2016; Yu et al., 2017) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-11", "text": "A common approach that has been used in most of the researches is utilizing a two-component system, including an entity linker and a relation extractor." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-12", "text": "In this paper, we focus on the relation extraction component, which is also treated as a classification problem." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-13", "text": "This topic demands certain tools to capture relation that is mentioned in the questions, as a part of the QA systems." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-14", "text": "In this paper, we aim to view this kind of relation prediction." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-15", "text": "The term \"relation extraction\" is originally referred to capturing relation between two entities, if there is any." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-16", "text": "In the case of simple questions, one of entities is already mentioned and the relation which represents the topic behind the words must be predicted in order to find the other entity." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-17", "text": "For instance, in the question \"Which artist recorded georgia?\", \"artist\" conveys the topic and \"georgia\" stands for the first entity." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-18", "text": "In this context, extracting relation from single-relation questions obtains higher accuracy compared to multi-relational ones." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-19", "text": "Having a large number of relations in a knowledge base, however, this simple question relation extraction is not a solved problem yet." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-20", "text": "Classifying questions to predefined set of relations is one of the main approaches for this task (Mohammed et al., 2018) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-21", "text": "Moreover, matching question content with relations has also been proposed and shown promising results (Yin et al., 2016; Yu et al., 2017) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-22", "text": "In this paper, the relation extraction is viewed from a new perspective such that relation extraction is done within a question-question matching model, instead of only matching questions and relations." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-23", "text": "Indeed, while many of the previous works use a matching between question mention and relations, we use an instance-based method for classifying relations." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-24", "text": "The proposed model benefits from a text matching model, namely 1 Knowledge Base Question Answering 2 MatchPyramid , and enhances it with a two-channel model for considering lexical match and semantic match between questions." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-25", "text": "The structure of the paper is as follows: in Section 2, we give a concise overview of the existing approaches for relation classification and its application in KBQA." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-26", "text": "We also review the available neural text matching models which is the base of our instance-based model." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-27", "text": "Section 3 presents our approach and elaborately explains detail of our proposed model." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-28", "text": "In Section 4, we show the conducted evaluation experiments and discuss the results." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-29", "text": "Finally, we summarize our method in Section 5." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-30", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-31", "text": "**RELATED WORKS**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-32", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-33", "text": "**QUESTION ANSWERING OVER KNOWLEDGE BASE**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-34", "text": "One paradigm in proposed approaches for relation extraction in KBQA is based on semantic parsing in which questions were parsed and turned into logical forms in order to query the knowledge base (Berant et al., 2013; Berant and Liang, 2014) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-35", "text": "However, most of the recent approaches (Mohammed et al., 2018; Bordes et al., 2015; Dai et al., 2016; He and Golub, 2016; Yu et al., 2017) are based on automatically extracted features of terms; thanks to the prominent performance of neural network on representation learning (Mikolov et al., 2013a,b) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-36", "text": "From another point of view, two mainstreams for extracting relations in KBQA are studied: (a) using a classifier which chooses the most probable relation among all (Mohammed et al., 2018) ; (b) matching questions and relations through learning of an embedding space for representing all relations and question words (Bordes et al., 2015; Dai et al., 2016; Yin et al., 2016; He and Golub, 2016; Yu et al., 2017) , in which each relation is considered either as a meaningful sequence of words or as a unique entity." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-37", "text": "Dai et al. (2016) considered the relation prediction, as well as the whole KBQA problem, as a conditional probability task in which the goal is finding the most probable relation given the question mention." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-38", "text": "To this aim, they used" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-39", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-40", "text": "**NEURAL TEXT MATCHING**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-41", "text": "The growing area of text matching develops models to investigate the relationship and the degree of matching between sequences of words." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-42", "text": "This comparison mechanism is a substantial core for various tasks, including ad-hoc retrieval, paraphrase identification, question answering, and semantic web search (Hu et al., 2014) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-43", "text": "In this regard, three main categories are considered in the context of deep matching models, namely representation-focused, interactionfocused, and hybrid ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-44", "text": "In representation-focused models, an abstract contextual representation of texts are extracted through neural networks and then these representations are used to estimate the matching score between them." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-45", "text": "BiMPM (Wang et al., 2017) and ARC I (Hu et al., 2014) are examples of these models." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-46", "text": "On the other hand, interaction-focused models compute the similarity between two sequences of words in a procedure, such that different patterns and structures of interactions are learned with the help of neural networks based on local interactions of two sequences ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-47", "text": "MatchPyramid and aNMM (Yang et al., 2016) are examples of these models." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-48", "text": "Hybrid models aim to benefit from the advantages of both techniques." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-49", "text": "ARC II (Hu et al., 2014) is an example of this category." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-50", "text": "In this paper, owing to its superior performance, which is reported in Section 4 , we take advantage of an interaction-focused model in the hierarchy of our model, based on MatchPyramid." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-51", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-52", "text": "**PROPOSED MODEL**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-53", "text": "In our research, we look at the problem of relation extraction of KBQA from a new point of view to propose our instance-based solution for the task." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-54", "text": "Before describing the model in detail, we provide an overview of the problem." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-55", "text": "Given pairs of question and relation in our training data, denoted as (Q,R), and pairs of question and relation in our test data, denoted as (Q',R'), for each Q', we aim to predict the most probable relation (R\"), which interprets the question precisely." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-56", "text": "Having different lexical representation for each question about a relation, there are similar words from different range of similarity that occur in the questions of the same relation." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-57", "text": "Based on these similarities, we argue that the resemblance of questions can be used to detect the relation that lies behind question words." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-58", "text": "In this regard, following Yin et al. (2016) , we first extract the entity mentions out of question words and put a symbol (e.g. < e >) in its place, so that we will have a question pool in which each question is labeled with its relation that can be considered as a paraphrase of that question." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-59", "text": "For each new question (Q'), we find the most resemble question (Q) in our question pool, and assign its corresponding relation as the relation for Q'." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-60", "text": "To do so, we need a model to compute the resemblance of each pair of questions and find the most similar one to Q'." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-61", "text": "Nevertheless, due to existence of multiple form of questions for a relation to be paraphrased, we take the relation of the majority among k top ranked similar questions to Q'." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-62", "text": "In this sense, we are using an instance-based method by computing the relatedness of each new Q' to all train questions." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-63", "text": "The architecture of our model is depicted in Figure 1 ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-64", "text": "As can be seen, represents the similarity of two questions (Q' with Q)." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-65", "text": "Additionally, following Yu et al. (2017) , we add another neural network (Q'-R), the right part of the architecture, to compute the matching score of Q' with the relation of Q (R)." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-66", "text": "By doing so, we are enhancing the matching signals between Q' and Q to estimate the overall score." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-67", "text": "In the first step, our proposed model projects the input question as well as the available questions and relations of training data into an embedding space." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-68", "text": "To do so, each sequence of words (Q', Q and R) are fed to an embedding layer and all of their corresponding vectors are fetched." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-69", "text": "In this work, we use pre-trained vectors due to the fact that current neural embeddings which are learned by using large scale text corpora provide rich enough representation." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-70", "text": "In the next step, input vectors are fed to the two neural networks." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-71", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-72", "text": "**Q'-Q NETWORK**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-73", "text": "Q' and Q are fed to the Q'-Q network, which is inspired by Matchpyramid ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-74", "text": "Initially, an interaction matrix between the words of Q and Q' is computed." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-75", "text": "This interaction matrix has been used in several interactionfocused neural learning to match models Mitra et al., 2017; Wan et al., 2016; Xiong et al., 2017; Hu et al., 2014) due to the fact that it provides good representation to compute matching degree between two piece of texts." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-76", "text": "Indeed, an interaction matrix is a matrix computed based on two sequence of words, in which, each element at the i th row and j th column stands for the similarity between the i th word of the first sequence and j th word of the second sequence." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-77", "text": "Accordingly, the (i, j) element not only records the exact matches between the words of two sequences, but also estimates the degree of semantic similarity between them ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-78", "text": "There are several similarity functions that can be used for creating this matrix; e.g. Cosine similarity, indicator function, dot product, tensor network, etc. ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-79", "text": "In this work, our matrix is computed based on the Cosine similarity." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-80", "text": "In addition, we add another interaction matrix in which indicator function is used as the similarity function." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-81", "text": "The idea behind using indicator function in interaction matrix comes from the fact 7 that the range of difference between words that can be used to express a question about a unique topic/relation (e.g. location of birth) is not too wide and they have relatively low diversity." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-82", "text": "By doing so, we add an additional bias, which we believe that exists inherently in the problem, to our model." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-83", "text": "This makes our convolutional neural network a 2-channel network, which reflects both lexical and semantic match between text sequences." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-84", "text": "The impact of adding this extra input channel is provided in Section 4." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-85", "text": "Then, this matching matrix is fed to a convolutional neural network to compute a vector (matching vector) which is followed by a Multilayer Perceptron (MLP) to compute the matching score between Q' and and Q from the question pool." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-86", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-87", "text": "**Q'-R NETWORK**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-88", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-89", "text": "**EVALUATION**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-90", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-91", "text": "**DATASET**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-92", "text": "Following the previous works by Yin et al. (2016) and Yu et al. (2017) , we use the common benchmark dataset of the simple question answering, namely SimpleQuestions, which was originally introduced by Bordes et al. (2015) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-93", "text": "This dataset contains 108442 questions gathered with the help of English-speaking annotators." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-94", "text": "Yin et al. (2016) proposed a new benchmark for evaluating relation extraction task on SimpleQuestion." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-95", "text": "In this benchmark, every question, whose entity is replaced by a unique token, is labeled with its ground truth relation as its positive label, and all other relation of the gold entity that is mentioned in the question are considered as negative labels." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-96", "text": "We use the same dataset which contains 72239, 10310 and 20610 question samples as train, validation, and test sets respectively." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-97", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-98", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-99", "text": "The hyper-parametes are tuned over validation set and they are finally configured with: (1) 64 units for BiLSTMs, (2) single layer for convolutional network, and (3) 256 neurons for MLP on matching vector." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-100", "text": "The embeddings are initialized with 300-dimension glove word vectors (Pennington et al., 2014) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-101", "text": "It is worth mentioning that the interaction layer consists of two square matrices whose length is equal to the longest question." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-102", "text": "All the experiments are implemented with Keras library and performed on a Linux machine which has an Intel TMCore i7-6700 3.40 GHz CPU with 16 Gigabyte memory alongside Nvidia GeForce GTX 1080Ti GPU." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-103", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-104", "text": "**RESULTS**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-105", "text": "As mentioned, our instance-based idea for question answering over knowledge graph requires a text matching model to find similar question to the input question." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-106", "text": "Considering the available text matching models, in the first step of our experiments, we trained different text matching models on the training data." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-107", "text": "The obtained results are reported in Table 1 ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-108", "text": "As can be seen, the MatchPyramid model performs the best on the proposed model." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-109", "text": "Considering these results, MatchPyramid has been used as the base of our model in further steps." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-110", "text": "In the next step, we evaluate the performance of our model while considering exact lexical match as a separate channel in the our Q'-Q network." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-111", "text": "To this end, three experiments were done and compared: (1) single-channel semantic text matching (Cosine function), (2) single-channel lexical match (indicator function), and (3) two-channel lexical and semantic text matching." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-112", "text": "The results of these experiments are reported in Table 2 ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-113", "text": "As can be seen in the results, the Model Accuracy (%) ARC I (Hu et al., 2014) 89.35 ARC II (Hu et al., 2014) 88.44" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-114", "text": "MatchPyramid 91.75" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-115", "text": "BiMPM (Wang et al., 2017) 90.73 impact of adding an extra input channel is obvious as it is compared with single one." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-116", "text": "The better performance of the semantic channel shows the importance of semantic text matching in the QA system." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-117", "text": "The further improvement using an additional channel for lexical match indicates that although lexical match is considered implicitly in the normal MatchPyramid model, it is not enough for considering this issue in the QA task and the proposed two-channel model can better cover both semantic and lexical similarities." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-118", "text": "In the next step of our experiments, we added the Q'-R network to the Q'-Q network and evaluated the new combined architecture, presented in Figure 1 , on the same dataset." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-119", "text": "Table 3 reports the performance of our model on classifying the relations in comparison with the state-of-the-art models." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-120", "text": "In this table, AMPCNN (Yin et al., 2016) is an attentive max-pooling CNN for matching a question with all relations." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-121", "text": "APCNN (dos Santos et al., 2016) and ABCNN (Yin et al., 2016) both employ an attentive pooling mechanism." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-122", "text": "These two models Model Accuracy (%) AMPCNN (Yin et al., 2016) 91.3" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-123", "text": "OWA-APCNN (dos Santos et al., 2016) 90.5" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-124", "text": "OWA-ABCNN (Yin et al., 2016) 90.2" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-125", "text": "BiCNN (Yih et al., 2015) 90.0" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-126", "text": "Hier-Res-BiLSTM (HR-BiLSTM) (Yu et al., 2017) 93.3" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-127", "text": "Proposed Q'-Q + Q'-R model 93.41 are not originally evaluated on relation prediction of simple questions." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-128", "text": "In fact, the authors of AMPCNN (Yin et al., 2016) , conducted the corresponding experiments on a one-way-attention adaptation of these two models to compare them with the available methods in this task." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-129", "text": "Hier-Res-BiLSTM (Yu et al., 2017) uses hierarchical residual connections to ease the training procedure of BiL-STM." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-130", "text": "BiCNN (Yih et al., 2015) uses convolutional neural networks for matching a question with relations." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-131", "text": "The model is reimplemented for SimpleQuestions by Yu et al. (2017) ." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-132", "text": "As it is shown, our proposed model outperforms the state-of-the-art models in relation extraction for SimpleQuestions dataset by a margin of 0.11 percentage." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-133", "text": "We believe that this improvement is an effect of the two contributions that we had in this paper, namely proposing a combined Q'-Q + Q'-R network and the two-channel text matching model in the Q'-Q network." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-134", "text": "----------------------------------" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-135", "text": "**ERROR ANALYSIS**" }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-136", "text": "In the last step of our experiments, we aim to find the main reasons of errors in the system." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-137", "text": "To this end, the test questions whose relations were not obtained correctly in our proposed model are analyzed." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-138", "text": "Table 4 presents few examples from those questions." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-139", "text": "Among these questions, there are some predictions in which even the human supervision would assign incorrect relation; e.g., \"what is the genre of the movie \u00a1e\u00bf?\" or \"is \u00a1e\u00bf from the united states or canada?\", due to very close concepts in the relations or different levels of granularity in the available relations in the knowledge bases." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-140", "text": "In addition, some of questions are practically equivocal; e.g., \"what are \u00a1e\u00bf?\" or \"what's an example of a \u00a1e\u00bf book?\"." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-141", "text": "Indeed, this ambiguity exists in the training data." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-142", "text": "Hence, during the training process, an extra variance is imposed to the model." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-143", "text": "For instance, for the question \"what are \u00a1e\u00bf?\", there are four relations, namely /cvg/gameplay mode/games with this mode, /film/film genre/films in this genre, /common/topic/notable types, and /music/album content type/albums, that are assigned to the aforementioned question." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-144", "text": "It seems that there is an upper bound for relation prediction on Sim-pleQuestions due to these kinds of indistinctness." }, { "sent_id": "8ff1560ac0241a763b4b0d93718b40-C001-147", "text": "In this paper, we proposed a new relation prediction model for simple ques- Thus, for future work, we would like to utilize these question words to predict relations from more complex questions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8ff1560ac0241a763b4b0d93718b40-C001-10" ], [ "8ff1560ac0241a763b4b0d93718b40-C001-21" ], [ "8ff1560ac0241a763b4b0d93718b40-C001-34", "8ff1560ac0241a763b4b0d93718b40-C001-36" ], [ "8ff1560ac0241a763b4b0d93718b40-C001-94" ], [ "8ff1560ac0241a763b4b0d93718b40-C001-127", "8ff1560ac0241a763b4b0d93718b40-C001-128" ] ], "cite_sentences": [ "8ff1560ac0241a763b4b0d93718b40-C001-10", "8ff1560ac0241a763b4b0d93718b40-C001-21", "8ff1560ac0241a763b4b0d93718b40-C001-36", "8ff1560ac0241a763b4b0d93718b40-C001-94", "8ff1560ac0241a763b4b0d93718b40-C001-128" ] }, "@USE@": { "gold_contexts": [ [ "8ff1560ac0241a763b4b0d93718b40-C001-58" ], [ "8ff1560ac0241a763b4b0d93718b40-C001-92" ], [ "8ff1560ac0241a763b4b0d93718b40-C001-94", "8ff1560ac0241a763b4b0d93718b40-C001-96" ] ], "cite_sentences": [ "8ff1560ac0241a763b4b0d93718b40-C001-58", "8ff1560ac0241a763b4b0d93718b40-C001-92", "8ff1560ac0241a763b4b0d93718b40-C001-94" ] }, "@DIF@": { "gold_contexts": [ [ "8ff1560ac0241a763b4b0d93718b40-C001-120", "8ff1560ac0241a763b4b0d93718b40-C001-121", "8ff1560ac0241a763b4b0d93718b40-C001-122", "8ff1560ac0241a763b4b0d93718b40-C001-124", "8ff1560ac0241a763b4b0d93718b40-C001-127" ] ], "cite_sentences": [ "8ff1560ac0241a763b4b0d93718b40-C001-120", "8ff1560ac0241a763b4b0d93718b40-C001-121", "8ff1560ac0241a763b4b0d93718b40-C001-122", "8ff1560ac0241a763b4b0d93718b40-C001-124" ] } } }, "ABC_de9eb9b7dff69743252b3ff0ef8894_14": { "x": [ { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-2", "text": "Word embeddings have recently been shown to reflect many of the pronounced societal biases (e.g., gender bias or racial bias)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-3", "text": "Existing studies are, however, limited in scope and do not investigate the consistency of biases across relevant dimensions like embedding models, types of texts, and different languages." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-4", "text": "In this work, we present a systematic study of biases encoded in distributional word vector spaces: we analyze how consistent the bias effects are across languages, corpora, and embedding models." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-5", "text": "Furthermore, we analyze the crosslingual biases encoded in bilingual embedding spaces, indicative of the effects of bias transfer encompassed in cross-lingual transfer of NLP models." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-6", "text": "Our study yields some unexpected findings, e.g., that biases can be emphasized or downplayed by different embedding models or that user-generated content may be less biased than encyclopedic text." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-7", "text": "We hope our work catalyzes bias research in NLP and informs the development of bias reduction techniques." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-8", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-10", "text": "Recent work demonstrated that word embeddings induced from large text collections encode many human biases (e.g., Bolukbasi et al., 2016; Caliskan et al., 2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-11", "text": "This finding is not particularly surprising given that (1) we are likely project our biases in the text that we produce and (2) these biases in text are bound to be encoded in word vectors due to the distributional nature (Harris, 1954) of the word embedding models (Mikolov et al., 2013a; Pennington et al., 2014; Bojanowski et al., 2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-12", "text": "For illustration, consider the famous analogy-based gender bias example from Bolukbasi et al. (2016) : \"Man is to computer programmer as woman is to homemaker\"." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-13", "text": "This bias will be reflected in the text (i.e., the word man will cooccur more often with words like programmer or engineer, whereas woman will more often appear next to homemaker or nurse), and will, in turn, be captured by word embeddings built from such biased texts." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-14", "text": "While biases encoded in word embeddings can be a useful data source for diachronic analyses of societal biases (e.g., Garg et al., 2018) , they may cause ethical problems for many downstream applications and NLP models." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-15", "text": "In order to measure the extent to which various societal biases are captured by word embeddings, Caliskan et al. (2017) proposed the Word Embedding Association Test (WEAT)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-16", "text": "WEAT measures semantic similarity, computed through word embeddings, between two sets of target words (e.g., insects vs. flowers) and two sets of attribute words (e.g., pleasant vs. unpleasant words)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-17", "text": "While they test a number of biases, the analysis is limited in scope to English as the only language, GloVe (Pennington et al., 2014) as the embedding model, and Common Crawl as the type of text." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-18", "text": "Following the same methodology, McCurdy and Serbetci (2017) extend the analysis to three more languages (German, Dutch, Spanish) , but test only for gender bias." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-19", "text": "In this work, we present the most comprehensive study of biases captured by distributional word vector to date." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-20", "text": "We create XWEAT, a collection of multilingual and cross-lingual versions of the WEAT dataset, by translating WEAT to six other languages and offer a comparative analysis of biases over seven diverse languages." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-21", "text": "Furthermore, we measure the consistency of WEAT biases across different embedding models and types of corpora." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-22", "text": "What is more, given the recent surge of models for inducing cross-lingual embedding spaces (Mikolov et al., 2013a; Hermann and Blunsom, 2014; Smith et al., 2017; Conneau et al., 2018; Artetxe et al., 2018; Hoshen and Wolf, 2018, inter alia) and their ubiquitous application in cross-lingual transfer of NLP models for downstream tasks, we investigate cross-lingual biases encoded in cross-lingual embedding spaces and compare them to bias effects present of corresponding monolingual embeddings." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-23", "text": "Our analysis yields some interesting findings: biases do depend on the embedding model and, quite surprisingly, they seem to be less pronounced in embeddings trained on social media texts." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-24", "text": "Furthermore, we find that the effects (i.e., amount) of bias in cross-lingual embedding spaces can roughly be predicted from the bias effects of the corresponding monolingual embedding spaces." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-25", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-26", "text": "**DATA FOR MEASURING BIASES**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-27", "text": "We first introduce the WEAT dataset (Caliskan et al., 2017) and then describe XWEAT, our multilingual and cross-lingual extension of WEAT designed for comparative bias analyses across languages and in cross-lingual embedding spaces." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-28", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-29", "text": "**WEAT**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-30", "text": "The Word Embedding Association Test (WEAT) (Caliskan et al., 2017) is an adaptation of the Implicit Association Test (IAT) (Nosek et al., 2002) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-31", "text": "Whereas IAT measures biases based on response times of human subjects to provided stimuli, WEAT quantifies the biases using semantic similarities between word embeddings of the same stimuli." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-32", "text": "For each bias test, WEAT specifies four stimuli sets: two sets of target words and two sets of attribute words." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-33", "text": "The sets of target words represent stimuli between which we want to measure the bias (e.g., for gender biases, one target set could contain male names and the other females names)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-34", "text": "The attribute words, on the other hand, represent stimuli towards which the bias should be measured (e.g., one list could contain pleasant stimuli like health and love and the other negative war and death)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-35", "text": "The WEAT dataset defines ten bias tests, each containing two target and two attribute sets." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-36", "text": "1 Table 1 enumerates the WEAT tests and provides examples of the respective target and attribute words." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-37", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-38", "text": "**MULTILINGUAL AND CROSS-LINGUAL WEAT**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-39", "text": "We port the WEAT tests to the multilingual and cross-lingual settings by translating the test vocabularies consisting of attribute and target terms from English to six other languages: German (DE), Spanish (ES), Italian (IT), Russian (RU), Croatian (HR), and Turkish (TR)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-40", "text": "We first automatically translate the vocabularies and then let native speakers of the respective languages (also fluent in English) fix the incorrect automatic translations (or introduce better fitting ones)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-41", "text": "Our aim was to translate WEAT vocabularies to languages from diverse language families 2 for which we also had access to native speakers." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-42", "text": "Whenever the translation of an English term indicated the gender in a target language (e.g., Freund vs. Freundin in DE), we asked the translator to provide both male and female forms and included both forms in the respective test vocabularies." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-43", "text": "This helps avoiding artificially amplifying the gender bias stemming from the grammatically masculine or feminine word forms." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-44", "text": "The monolingual tests in other languages are created by simply using the corresponding translations of target and attribute sets in those languages." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-45", "text": "For every two languages, L1 and L2 (e.g., DE" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-46", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-47", "text": "**. }).**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-48", "text": "We did not translate or modify proper names from WEAT sets 3-6." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-49", "text": "In our multilingual and cross-lingual experiments we, however, discard the (translations of) WEAT tests for which we cannot find more than 20% of words from some target or attribute set in the embedding vocabulary of the respective language." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-50", "text": "This strategy eliminates tests 3-5 and 10 which include proper American names, majority of which can not be found in distributional vocabularies of other languages." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-51", "text": "The exception to this is test 6, containing frequent English first names (e.g., Paul, Lisa), which we do find in distributional vocabularies of other languages as well." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-52", "text": "In summary, for languages other than EN and for cross-lingual settings, we execute six bias tests (T1, T2, T6-T9)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-53", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-54", "text": "**METHODOLOGY**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-55", "text": "We adopt the general bias-testing framework from Caliskan et al. (2017) , but we span our study over multiple dimensions: (1) corpora -we analyze the consistency of biases across distributional vectors induced from different types of text; (2) embedding models -we compare biases across distributional vectors induced by different embedding models (on the same corpora); and (3) languageswe measure biases for word embeddings of different languages, trained from comparable corpora." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-56", "text": "Furthermore, unlike Caliskan et al. (2017) , we test whether biases depend on the selection of the similarity metric." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-57", "text": "Finally, given the ubiquitous adoption of cross-lingual embeddings (Ruder et al., 2017; Glava\u0161 et al., 2019) , we investigate biases in a variety of bilingual embedding spaces." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-58", "text": "Bias-Testing Framework." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-59", "text": "We first describe the WEAT framework (Caliskan et al., 2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-60", "text": "Let X and Y be two sets of targets, and A and B two sets of attributes (see \u00a72.1)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-61", "text": "The tested statistic is the difference between X and Y in average similarity of their terms with terms from A and B:" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-62", "text": "with association difference for term t computed as:" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-63", "text": "where t is the distributional vector of term t and f is a similarity or distance metric, fixed to cosine similarity in the original work (Caliskan et al., 2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-64", "text": "The effect size, that is, the \"amount of bias\", is computed as the normalized measure of separation between association distributions:" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-65", "text": "where \u00b5 denotes the mean and \u03c3 standard deviation." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-66", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-67", "text": "**DIMENSIONS OF BIAS ANALYSIS.**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-68", "text": "We analyze the bias effects across multiple dimensions." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-69", "text": "First, we analyze the effect that different embedding models have: we compare biases of distributional spaces induced from English Wikipedia, using CBOW (Mikolov et al., 2013b) , GLOVE (Pennington et al., 2014) , FASTTEXT (Bojanowski et al., 2017) , and DICT2VEC algorithms (Tissier et al., 2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-70", "text": "Secondly, we investigate the effects of biases in different corpora: we compare biases between embeddings trained on the Common Crawl, Wikipedia, and a corpus of tweets." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-71", "text": "Finally, and (arguably) most interestingly, we test the consistency of biases across seven languages (see \u00a72.2)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-72", "text": "To this end, we test for biases in seven monolingual FASTTEXT spaces trained on Wikipedia dumps of the respective languages." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-73", "text": "Biases in Cross-Lingual Embeddings." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-74", "text": "Crosslingual embeddings (CLEs) are widely used in multilingual NLP and cross-lingual transfer of NLP models." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-75", "text": "Despite the ubiquitous usage of CLEs, the biases they potentially encode have not been analyzed so far." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-76", "text": "We analyze projection-based CLEs (Glava\u0161 et al., 2019) , induced through posthoc linear projections between monolingual embedding spaces (Mikolov et al., 2013a; Artetxe et al., 2016; Smith et al., 2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-77", "text": "The projection is commonly learned through supervision with few thousand word translation pairs." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-78", "text": "Most recently, however, a number of models have been proposed that learn the projection without any bilingual signal (Artetxe et al., 2018; Conneau et al., 2018; Hoshen and Wolf, 2018; AlvarezMelis and Jaakkola, 2018, inter alia) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-79", "text": "Let X and Y be, respectively, the distributional spaces of the source (S) and target (T) language and let D = {w i S , w i T } i be the word translation dictionary." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-80", "text": "Let (X S , X T ) be the aligned subsets of monolingual embeddings, corresponding to word-aligned pairs from D. We then compute the orthogonal matrix W that minimizes the Euclidean distance between X S W and X T (Smith et al., 2017):" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-81", "text": ". We create comparable bilingual dictionaries D by translating 5K most frequent EN words to other six languages and induce a bilingual space for all 21 language pairs." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-82", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-83", "text": "**FINDINGS**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-84", "text": "Here, we report and discuss the results of our multi-dimensional analysis." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-85", "text": "Table 2 shows the effect sizes for WEAT T1-T10 based on Euclidean or cosine similarity between word vector representations trained on the EN Wikipedia using FAST-TEXT." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-86", "text": "We observe the highest bias effects for T6 (Male/Female -Career/Family), T9 (Physical/Mental deseases -Long-term/Short-term), and T1 (Insects/Flowera -Positive/Negative)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-87", "text": "Importantly, the results show that biases do not depend on the similarity metric." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-88", "text": "We observe nearly identical effects for cosine similarity and Euclidean distance for all WEAT tests." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-89", "text": "In the following experiments we thus analyze biases only for cosine similarity." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-90", "text": "Word Embedding Models." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-91", "text": "Table 3 compares biases in embedding spaces induced with different models: CBOW, GLOVE, FASTTEXT, and DICT2VEC." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-92", "text": "While the first three embedding methods are trained on Wikipedia only, DICT2VEC employs definitions from dictionaries (e.g., Oxford dictionary) as additional resources for identifying strongly related terms." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-93", "text": "4 results T1, T2, and T7-T9 for DICT2VEC, as the DICT2VEC's vocabulary does not cover most of the proper names from the remaining tests." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-94", "text": "Somewhat surprisingly, the bias effects seem to vary greatly across embedding models." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-95", "text": "While GLOVE embeddings are biased according to all tests, 5 FASTTEXT and especially CBOW exhibit significant biases only for a subset of tests." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-96", "text": "We hypothesize that the bias effects reflected in the distributional space depend on the preprocessing steps of the embedding model." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-97", "text": "FASTTEXT, for instance, relies on embedding subword information, in order to avoid issues with representations of out-of-vocabulary and underrepresented terms: additional reliance on morpho-syntactic signal may make FASTTEXT more resilient to biases stemming from distributional signal (i.e., word cooccurrences)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-98", "text": "The fact that the embedding space induced with DICT2VEC exhibits larger bias effects may seem counterintuitive at first, since the dictionaries used for vector training should be more objective and therefore less biased than encyclopedic text." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-99", "text": "We believe, however, that the additional dictionary-based training objective only propagates the distributional biases across definitionally related words." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-100", "text": "Generally, we find these results to be important as they indicate that embedding models may accentuate or diminish biases expressed in text." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-101", "text": "Corpora." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-102", "text": "In Table 4 we compare the biases of embeddings trained with the same model (GLOVE) but on different corpora: Common Crawl (i.e., noisy web content), Wikipedia (i.e., encyclopedic the definition of A and vice versa (Tissier et al., 2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-103", "text": "5 This is consistent with the original results obtained by Caliskan et al. (2017) ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-104", "text": "Corpus T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 WIKI 1.4 1.5 1.2 1.4 1.4 1.8 1.2 1.3 1.3 1.2 CC 1.5 1.6 1.5 1.6 1.4 1.9 1.1 1.3 1.4 1.3 TWEETS 1.2 1.0 1.1 1.2 1.2 1.2 \u22120.2 * 0.6 * 0.7 * 0.8 * Table 6 : XWEAT bias effects (aggregated over all six tests) for cross-lingual word embedding spaces." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-105", "text": "Rows: targets language; columns: attributes language." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-106", "text": "Asterisks indicate the inclusion of bias effects sizes in the aggregation that were insignificant at \u03b1 < 0.05." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-107", "text": "more pronounced for embeddings trained on the Common Crawl than for those obtained on encyclopedic texts (Wikipedia)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-108", "text": "Countering our intuition, the corpus of tweets seems to be consistently less biased (across all tests) than Wikipedia." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-109", "text": "In fact, the biases covered by tests T7-T10 are not even significantly present in the vectors trained on tweets." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-110", "text": "This finding is indeed surprising and the phenomenon warrants further investigation." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-111", "text": "Biases in Cross-Lingual Embeddings." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-112", "text": "We report bias effects for all 21 bilingual embedding spaces in Table 6 ." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-113", "text": "For brevity, here we report the bias effects averaged over all six XWEAT tests (we provide results detailing bias effects for each of the tests separately in the supplementary materials)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-114", "text": "Generally, the bias effects of bilingual spaces are in between the bias effects of the two corresponding monolingual spaces (cf." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-115", "text": "Table 5 ): this means that we can roughly predict the amount of bias in a cross-lingual embedding space from the same bias effects of corresponding monolingual spaces." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-116", "text": "For example, effects in cross-lingual spaces increase over monolingual effects for low-bias languages (HR and TR), and decrease for high-bias languages (EN and ES)." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-117", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-118", "text": "**MULTILINGUAL COMPARISON.**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-119", "text": "----------------------------------" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-120", "text": "**CONCLUSION**" }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-121", "text": "In this paper, we have presented the largest study to date on biases encoded in distributional word vector spaces." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-122", "text": "To this end, we have extended previous analyses based on the WEAT test (Caliskan et al., 2017; McCurdy and Serbetci, 2017) in multiple dimensions: across seven languages, four embedding models, and three different types of text." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-123", "text": "We find that different models may produce embeddings with very different biases, which stresses the importance of embedding model selection when fair text representations are to be created." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-124", "text": "Surprisingly, we find that the user-generated texts, such as tweets, may be less biased than redacted content." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-125", "text": "Furthermore, we have investigated the bias effects in cross-lingual embedding spaces and have shown that they may be predicted from the biases of corresponding monolingual embeddings." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-126", "text": "We make the XWEAT dataset and the testing code publicly available, 7 hoping to fuel further research on biases encoded in word representations." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-127", "text": "Table 12 : XWEAT T9 effect sizes for cross-lingual embedding spaces." }, { "sent_id": "de9eb9b7dff69743252b3ff0ef8894-C001-128", "text": "Rows denote the target set language, column the attribute set language." } ], "y": { "@BACK@": { "gold_contexts": [ [ "de9eb9b7dff69743252b3ff0ef8894-C001-10" ], [ "de9eb9b7dff69743252b3ff0ef8894-C001-15" ], [ "de9eb9b7dff69743252b3ff0ef8894-C001-30" ], [ "de9eb9b7dff69743252b3ff0ef8894-C001-61", "de9eb9b7dff69743252b3ff0ef8894-C001-62", "de9eb9b7dff69743252b3ff0ef8894-C001-63" ] ], "cite_sentences": [ "de9eb9b7dff69743252b3ff0ef8894-C001-10", "de9eb9b7dff69743252b3ff0ef8894-C001-15", "de9eb9b7dff69743252b3ff0ef8894-C001-30", "de9eb9b7dff69743252b3ff0ef8894-C001-63" ] }, "@MOT@": { "gold_contexts": [ [ "de9eb9b7dff69743252b3ff0ef8894-C001-10" ] ], "cite_sentences": [ "de9eb9b7dff69743252b3ff0ef8894-C001-10" ] }, "@USE@": { "gold_contexts": [ [ "de9eb9b7dff69743252b3ff0ef8894-C001-27" ], [ "de9eb9b7dff69743252b3ff0ef8894-C001-55" ], [ "de9eb9b7dff69743252b3ff0ef8894-C001-59" ], [ "de9eb9b7dff69743252b3ff0ef8894-C001-122" ] ], "cite_sentences": [ "de9eb9b7dff69743252b3ff0ef8894-C001-27", "de9eb9b7dff69743252b3ff0ef8894-C001-55", "de9eb9b7dff69743252b3ff0ef8894-C001-59", "de9eb9b7dff69743252b3ff0ef8894-C001-122" ] }, "@EXT@": { "gold_contexts": [ [ "de9eb9b7dff69743252b3ff0ef8894-C001-55" ], [ "de9eb9b7dff69743252b3ff0ef8894-C001-122" ] ], "cite_sentences": [ "de9eb9b7dff69743252b3ff0ef8894-C001-55", "de9eb9b7dff69743252b3ff0ef8894-C001-122" ] }, "@SIM@": { "gold_contexts": [ [ "de9eb9b7dff69743252b3ff0ef8894-C001-103" ] ], "cite_sentences": [ "de9eb9b7dff69743252b3ff0ef8894-C001-103" ] } } }, "ABC_e68d09937d522dc5acac9637eb2a8b_14": { "x": [ { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-2", "text": "The performance of machine learning algorithms can be improved by combining the output of different systems." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-3", "text": "In this paper we apply this idea to the recognition of noun phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-4", "text": "We generate different classifiers by using different representations of the data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-5", "text": "By combining the results with voting techniques described in (Van Halteren et al., 1998) we manage to improve the best reported performances on standard data sets for base noun phrases and arbitrary noun phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-6", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-7", "text": "****" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-8", "text": "1 Introduction (Van Halteren et al., 1998) and (Brill and Wu, 1998) describe a series of successful experiments for improving the performance of part-of-speech taggers." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-9", "text": "Their results have been obtained by combining the output of different taggers with system combination techniques such as majority voting." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-10", "text": "This approach cancels errors that are made by the minority of the taggers." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-11", "text": "With the best voting technique, the combined results decrease the lowest error rate of the component taggers by as much as 19% (Van Halteren et al., 1998) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-12", "text": "The fact that combination of classifiers leads to improved performance has been reported in a large body of machine learning work." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-13", "text": "We would like to know what improvement combination techniques would cause in noun phrase recognition." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-14", "text": "For this purpose, we apply a single memorybased learning technique to data that has been represented in different ways." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-15", "text": "We compare various combination techniques on a part of the Penn Treebank and use the best method on standard data sets for base noun phrase recognition and arbitrary noun phrase recognition." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-16", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-17", "text": "**METHODS AND EXPERIMENTS**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-18", "text": "In this section we start with a description of our task: recognizing noun phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-19", "text": "After this we introduce the different data representations we use and our machine learning algorithms." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-20", "text": "We conclude with an outline of techniques for combining classifier results." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-21", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-22", "text": "**TASK DESCRIPTION**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-23", "text": "Noun phrase recognition can be divided in two tasks: recognizing base noun phrases and recognizing arbitrary noun phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-24", "text": "Base noun phrases (baseNPs) are noun phrases which do not contain another noun phrase." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-25", "text": "For example, the sentence In [ early trading ] in [ Hong Kong ] [ Monday ] , [ gold ] was quoted at [ $ 366.50 ] [ an ounce ] ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-26", "text": "contains six baseNPs (marked as phrases between square brackets)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-27", "text": "The phrase $ 366.50 an ounce is a noun phrase as well." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-28", "text": "However, it is not a baseNP since it contains two other noun phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-29", "text": "Two baseNP data sets have been put forward by (Ramshaw and Marcus, 1995) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-30", "text": "The main data set consist of four sections (15-18) of the Wall Street Journal (WSJ) part of the Penn Treebank (Marcus et al., 1993) as training material and one section (20) as test material 1." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-31", "text": "The baseNPs in this data are slightly different from the ones that can be derived from the Treebank, most notably in the attachment of genitive markers." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-32", "text": "The recognition task involving arbitrary noun phrases attempts to find both baseNPs and noun phrases that contain other noun phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-33", "text": "A standard data set for this task was put forward at the CoNLL-99 workshop." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-34", "text": "It consist on the same parts of the Penn Treebank as the main baseNP data set: WSJ sections 15-18 as training data and section 20 as test data 2." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-35", "text": "The noun phrases in this data set are the same as in the Treebank and therefore the baseNPs in this data set are slightly different from the ones in the (Ramshaw and Marcus, 1995) data sets." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-36", "text": "In both tasks, performance is measured with three scores." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-37", "text": "First, with the percentage of detected noun phrases that are correct (precision)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-38", "text": "Second, with the percentage of noun phrases in the data that were found by the classifier (recall)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-39", "text": "And third, 1This (Ramshaw and Marcus, 1995) baseNP data set is available via ftp://ftp.cis.upenn.edu/pub/chunker/ 2Software for generating the data is available from http://lcg-www.uia.ac.be/conl199/npb/ with the FZ=I rate which is equal to (2*precision*recall)/(precision+recall)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-40", "text": "The latter rate has been used as the target for optimization." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-41", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-42", "text": "**DATA REPRESENTATION**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-43", "text": "In our example sentence in section 2.1, noun phrases are represented by bracket structures." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-44", "text": "Both (Mufioz et al., 1999) and (Tjong Kim Sang and Veenstra, 1999) have shown how classifiers can process bracket structures." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-45", "text": "One classifier can be trained to recognize open brackets (O) while another will process close brackets (C)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-46", "text": "Their results can be converted to baseNPs by making pairs of open and close brackets with large probability scores (Mufioz et al., 1999) or by regarding only the shortest phrases between open and close brackets as baseNPs (Tjong Kim Sang and Veenstra, 1999) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-47", "text": "We have used the bracket representation (O+C) in combination with the second baseNP construction method." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-48", "text": "An alternative representation for baseNPs has been put forward by (Ramshaw and Marcus, 1995) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-49", "text": "They have defined baseNP recognition as a tagging task: words can be inside a baseNP (1) or outside of baseNPs (O)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-50", "text": "In the case that one baseNP immediately follows another baseNP, the first word in the second baseNP receives tag B. Example:" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-51", "text": "Ino earlyi tradingr ino Hongl Kongz MondayB ,o goldz waso quotedo ato $r 366.50z anB ounce/ -o This set of three tags is sufficient for encoding baseNP structures since these structures are nonrecursive and nonoverlapping." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-52", "text": "(Tjong Kim Sang and Veenstra, 1999) have presented three variants of this tagging representation." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-53", "text": "First, the B tag can be used for the first word of every noun phrase (IOB2 representation)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-54", "text": "Second, instead of the B tag an E tag can be used to mark the last word of a baseNP immediately before another baseNP (IOE1)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-55", "text": "And third, the E tag can be used for every noun phrase final word (IOE2)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-56", "text": "They have used the (Ramshaw and Marcus, 1995) representation as well (IOB1)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-57", "text": "We will use these four tagging representations as well as the O+C representation." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-58", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-59", "text": "**MACHINE LEARNING ALGORITHMS**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-60", "text": "We have used the memory-based learning algorithm IBI-IG which is part of TiMBL package (Daelemans et al., 1999b) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-61", "text": "In memory-based learning the training data is stored and a new item is classified by the most frequent classification among training items which are closest to this new item." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-62", "text": "Data items are represented as sets of feature-value pairs." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-63", "text": "In IBI-IG each feature receives a weight which is based on the amount of information which it provides for computing the classification of the items in the training data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-64", "text": "These feature weights are used for computing the distance between a pair of data items (Daelemans et al., 1999b) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-65", "text": "ml-IG has been used successfully on a large variety of natural language processing tasks." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-66", "text": "Beside IBI-IG, we have used IGTREE in the combination experiments." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-67", "text": "IGTREE is a decision tree variant of II31-IG (Daelemans et al., 1999b) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-68", "text": "It uses the same feature weight method as IBI-IG." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-69", "text": "Data items are stored in a tree with the most important features close to the root node." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-70", "text": "A new item is classified by traveling down from the root node until a leaf node is reached or no branch is available for the current feature value." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-71", "text": "The most frequent classification of the current node will be chosen." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-72", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-73", "text": "**COMBINATION TECHNIQUES**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-74", "text": "Our experiments will result in different classifications of the data and we need to find out how to combine these." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-75", "text": "For this purpose we have evaluated different voting mechanisms, effectively the voting methods as described in (Van Halteren et al., 1998) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-76", "text": "All combination methods assign some weight to the results of the individual classifier." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-77", "text": "For each input token, they pick the classification score with the highest total score." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-78", "text": "For example, if five classifiers have weights 0.9, 0.4, 0.8, 0.6 and 0.6 respectively and they classify some token as npstart, null, npstart, null and null, then the combination method will pick npstart since it has a higher total score (1.7) than null (1.6)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-79", "text": "The values of the weights are usually estimated by processing a part of the training data, the tuning data, which has been kept separate as training data for the combination process." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-80", "text": "In the first voting method, each of the five classitiers receives the same weight (majority)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-81", "text": "The second method regards as the weight of each individual classification algorithm its accuracy on the tuning data (TotPrecision)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-82", "text": "The third voting method computes the precision of each assigned tag per classifier and uses this value as a weight for the classifier in those cases that it chooses the tag (TagPrecision)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-83", "text": "The fourth method uses the tag precision weights as well but it subtracts from them the recall values of the competing classifier results." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-84", "text": "Finally, the fifth method uses not only a weight for the current classification but it also computes weights for other possible classifications." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-85", "text": "The other classifications are determined by examining the tuning data and registering the correct values for every pair of classifier results (pair-wise voting)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-86", "text": "Apart from these five voting methods we have also processed the output streams with two classifiers: IBI-IG (memory-based) and IGTREE (decision tree)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-87", "text": "This approach is called classifier stacking." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-88", "text": "Like (Van Halteren et al., 1998) , we have used different input versions: one containing only the classifier output and another containing both classifier output and a compressed representation of the classifier input." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-89", "text": "For the latter purpose we have used the part-ofspeech tag of the current word." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-90", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-91", "text": "**RESULTS**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-92", "text": "Our first goal was to find out whether system combination could improve performance of baseNP recognition and, if this was the fact, to select the best combination technique." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-93", "text": "For this purpose we performed a 10-fold cross validation experiment on the baseNP training data, sections 15-18 of the WSJ part of the Penn Treebank (211727 tokens)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-94", "text": "Like the data used by (Ramshaw and Marcus, 1995) , this data was retagged by the Brill tagger in order to obtain realistic part-of-speech (POS) tags 3." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-95", "text": "The data was segmented into baseNP parts and non-baseNP parts in a similar fashion as the data used by (Ramshaw and Marcus, 1995) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-96", "text": "The data was converted to the five data representations (IOB1, IOB2, IOE1, IOE2 and O+C) and" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-97", "text": "IBI-IG was used to classify it by using 10-fold cross validation." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-98", "text": "This means that the data was divided in ten consecutive parts of about the same size after which each part was used as test data with the other nine parts as training data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-99", "text": "The standard parameters of IBI-IG have been used except for k, the number of examined nearest neighbors, which was set to three." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-100", "text": "Each word in the data was represented by itself and its POS tag and additionally a left and right context of four word-POS tag pairs." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-101", "text": "For the first four representations, we have used a second processing stage as well." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-102", "text": "In this stage, a word was represented by itself, its POS tag, a left and right context of three word-POS tag pairs and a left and right context of two classification results of the first processing stage (see figure 1 )." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-103", "text": "The second processing stage improved the FZ=I scores with almost 0.7 on average." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-104", "text": "After this conversion step we had five O results and five C results." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-105", "text": "In the bracket representations, tokens can be classified as either being the first token of an NP (or the last in the C representation) or not." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-106", "text": "The results obtained with these representations have been measured with accuracy rates: the percentage of tokens that were classified correctly." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-107", "text": "Only about one in four tokens are at a baseNP boundary so guessing that a text does not contains baseNPs will already give us an accuracy of 75%." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-108", "text": "Therefore the accuracy rates obtained with these representations are high and the room for improvement is small (see table 1 )." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-109", "text": "However, because of the different treatment of neighboring chunks, the five classifiers disagree in about 2.5% of the classifications." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-110", "text": "It seems useful to use combination methods for finding the best classification for those ambiguous cases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-111", "text": "section 20 Majority voting (Mufioz et al., 1999 Halteren et al., 1998) , the best voting technique did not outperform the best stacked classifier." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-112", "text": "Furthermore the performance differences between the combination methods are not significant (p>0.05)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-113", "text": "To our surprise the five voting techniques performed the same." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-114", "text": "We assume that this has happened because the accuracies of the individual classifiers do not differ much and because the classification involves a binary choice." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-115", "text": "Since there is no significant difference between the combination methods, we can use any of them in the remaining experiments." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-116", "text": "We have chosen to use majority voting because it does not require tuning data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-117", "text": "We have applied it to the two data sets mentioned in (Ramshaw and Marcus, 1995) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-118", "text": "The first data set uses WSJ sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-119", "text": "The second one uses sections 02-21 of the same corpus as training data (950028 tokens) and section 00 as test data (46451 tokens)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-120", "text": "All data sets were processed in the same way as described earlier." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-121", "text": "The results of these experiments can be found in table 3." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-122", "text": "With section 20 as test set, we managed to reduce the error of the best result known to us with 6% with the error rate dropping from 7.2% to 6.74%, and for section 00 this difference was almost 18% with the FB= 1 rates." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-123", "text": "error rate dropping from 6.19% to 5.10% (see table 3 )." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-124", "text": "We have also applied majority voting to the NP data set put forward on the CoNLL-99 workshop." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-125", "text": "In this task the goal is to recognize all NPs." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-126", "text": "We have approached this as repeated baseNP recognition." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-127", "text": "A first stage detects the baseNPs." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-128", "text": "The recognized NPs are replaced by their presumed head word with a special POS tag and the result is send to a second stage which recognizes NPs with one level of embedding." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-129", "text": "The output of this stage is sent to a third stage and this stage finds NPs with two levels of embedding and so on." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-130", "text": "In the first processing stage we have used the five data representations with majority voting." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-131", "text": "This approach did not work as well for other stages." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-132", "text": "The O+C representation outperformed the other four representations by a large margin for the validation data 5." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-133", "text": "This caused the combined output of all five representations being worse than the O+C result." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-134", "text": "Therefore we have only used the O+C representation for recognizing nombaseNPs." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-135", "text": "The overall system reached an F~=I score of 83.79 and this is slightly better than the best rate reported at the 5The validation data is the test set we have used for estimating the best parameters for the CoNLL experiment: WSJ section 21." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-136", "text": "CoNLL-99 workshop (82.98 (CoNLL-99, 1999) , an error reduction of 5%)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-137", "text": "(Abney, 1991) has proposed to approach parsing by starting with finding correlated chunks of words." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-138", "text": "The chunks can be combined to trees by a second processing stage, the attacher." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-139", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-140", "text": "**RELATED WORK**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-141", "text": "(Ramshaw and Marcus, 1995) have build a chunker by applying transformation-based learning to sections of the Penn Treebank." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-142", "text": "Rather than working with bracket structures, they have represented the chunking task as a tagging problem." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-143", "text": "POS-like tags were used to account for the fact that words were inside or outside chunks." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-144", "text": "They have applied their method to two segments of the Penn Treebank and these are still being used as benchmark data sets." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-145", "text": "Several groups have continued working with the Ramshaw and Marcus data sets for base noun phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-146", "text": "(Argamon et al., 1998) use Memory-Based Sequence Learning for recognizing both NP chunks and VP chunks." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-147", "text": "This method records POS tag sequences which contain chunk boundaries and uses these sequences to classify the test data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-148", "text": "Its performance is somewhat worse than that of Ramshaw and Marcus (F~=1=91.6 vs. 92.0) but it is the best result obtained without using lexical information 6." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-149", "text": "(Cardie and Pierce, 1998) store POS tag sequences that make up complete chunks and use these sequences as rules for classifying unseen data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-150", "text": "This approach performs worse than the method of Argamon et al. (F~=1=90.9)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-151", "text": "Three papers mention having used the memorybased learning method IBI-IG." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-152", "text": "(Veenstra, 1998) introduced cascaded chunking, a two-stage process in which the first stage classifications are used to improve the performance in a second processing stage." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-153", "text": "This approach reaches the same performance level as Argamon et al. but it requires lexical information." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-154", "text": "(Daelemans et al., 1999a ) report a good performance for baseNP recognition but they use a different data set and do not mention precision and recall rates." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-155", "text": "(Tjong Kim Sang and Veenstra, 1999) compare different data representations for this task." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-156", "text": "Their baseNP results are slightly better than those of Ramshaw and Marcus (F~=1=92.37)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-157", "text": "(XTAG, 1998) describes a baseNP chunker built from training data by a technique called supertagging." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-158", "text": "The performance of the chunker was an improvement of the Ramshaw and Marcus results (Fz=I =92.4)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-159", "text": "(Mufioz et al., 1999) use SNOW, a network of linear units, for recognizing baseNP phrases 6We have applied majority voting of five data representations to the Ramshaw and Marcus data set without using lexical information and the results were: accuracy O: 97.60%, accuracy C: 98.10%, precision: 92.19%, recall: 91.53% and F~=I: 91.86. and SV phrases." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-160", "text": "They compare two data representations and report that a representation with bracket structures outperforms the IOB tagging representation introduced by (Ramshaw and Marcus, 1995) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-161", "text": "SNoW reaches the best performance on this task (Fz=I =92.8)." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-162", "text": "There has been less work on identifying general noun phrases than on recognizing baseNPs." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-163", "text": "(Osborne, 1999) extended a definite clause grammar with rules induced by a learner that was based upon the maximum description length principle." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-164", "text": "He processed other parts of the Penn Treebank than we with an F~=I rate of about 60." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-165", "text": "Our earlier effort to process the CoNLL data set was performed in the same way as described in this paper but without using the combination method for baseNPs." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-166", "text": "We obtained an F~=I rate of 82.98 (CoNLL-99, 1999) ." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-167", "text": "----------------------------------" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-168", "text": "**5**" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-169", "text": "Concluding remarks" }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-170", "text": "We have put forward a method for recognizing noun phrases by combining the results of a memory-based classifier applied to different representations of the data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-171", "text": "We have examined different combination techniques and each of them performed significantly better than the best individual classifier." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-172", "text": "We have chosen to work with majority voting because it does not require tuning data and thus enables the individual classifiers to use all the training data." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-173", "text": "This approach was applied to three standard data sets for base noun phrase recognition and arbitrary noun phrase recognition." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-174", "text": "For all data sets majority voting improved the best result for that data set known to US." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-175", "text": "Varying data representations is not the only way for generating different classifiers for combination purposes." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-176", "text": "We have also tried dividing the training data in partitions (bagging) and working with artificial training data generated by a crossover-like operator borrowed from genetic algorithm theory." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-177", "text": "With our memory-based classifier applied to this data, we have been unable to generate a combination which improved the performance of its best member." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-178", "text": "Another approach would be to use different classification algorithms and combine the results." }, { "sent_id": "e68d09937d522dc5acac9637eb2a8b-C001-179", "text": "We are working on this but we are still to overcome the practical problems which prevent us from obtaining acceptable results with the other learning algorithms." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e68d09937d522dc5acac9637eb2a8b-C001-29" ], [ "e68d09937d522dc5acac9637eb2a8b-C001-33", "e68d09937d522dc5acac9637eb2a8b-C001-35" ], [ "e68d09937d522dc5acac9637eb2a8b-C001-39" ], [ "e68d09937d522dc5acac9637eb2a8b-C001-48" ], [ "e68d09937d522dc5acac9637eb2a8b-C001-52", "e68d09937d522dc5acac9637eb2a8b-C001-56" ], [ "e68d09937d522dc5acac9637eb2a8b-C001-141" ] ], "cite_sentences": [ "e68d09937d522dc5acac9637eb2a8b-C001-29", "e68d09937d522dc5acac9637eb2a8b-C001-35", "e68d09937d522dc5acac9637eb2a8b-C001-39", "e68d09937d522dc5acac9637eb2a8b-C001-48", "e68d09937d522dc5acac9637eb2a8b-C001-56", "e68d09937d522dc5acac9637eb2a8b-C001-141" ] }, "@SIM@": { "gold_contexts": [ [ "e68d09937d522dc5acac9637eb2a8b-C001-93", "e68d09937d522dc5acac9637eb2a8b-C001-94", "e68d09937d522dc5acac9637eb2a8b-C001-95" ] ], "cite_sentences": [ "e68d09937d522dc5acac9637eb2a8b-C001-94", "e68d09937d522dc5acac9637eb2a8b-C001-95" ] }, "@USE@": { "gold_contexts": [ [ "e68d09937d522dc5acac9637eb2a8b-C001-93", "e68d09937d522dc5acac9637eb2a8b-C001-94", "e68d09937d522dc5acac9637eb2a8b-C001-95" ], [ "e68d09937d522dc5acac9637eb2a8b-C001-117" ] ], "cite_sentences": [ "e68d09937d522dc5acac9637eb2a8b-C001-94", "e68d09937d522dc5acac9637eb2a8b-C001-95", "e68d09937d522dc5acac9637eb2a8b-C001-117" ] }, "@DIF@": { "gold_contexts": [ [ "e68d09937d522dc5acac9637eb2a8b-C001-160" ] ], "cite_sentences": [ "e68d09937d522dc5acac9637eb2a8b-C001-160" ] } } }, "ABC_26b00c6e5b499eea30e9cef0bbaf9f_14": { "x": [ { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-2", "text": "In this paper we take a state-of-the-art model for distributed word representation that explicitly factorizes the positive pointwise mutual information (PPMI) matrix using window sampling and negative sampling and address two of its shortcomings." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-3", "text": "We improve syntactic performance by using positional contexts, and solve the need to store the PPMI matrix in memory by working on aggregate data in external memory." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-4", "text": "The effectiveness of both modifications is shown using word similarity and analogy tasks." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-5", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-7", "text": "Distributed word representations have become a mainstay in natural language processing, enjoying a slew of applications (Sebastiani, 2002; Turian et al., 2010; ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-28", "text": "The key idea is that window sampling is likely to sample related words, approximating their vectors using eq. (2), while negative sampling is likely to select unrelated words, scattering their vectors using eq. (3)." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-29", "text": "The resulting global loss, where" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-30", "text": "Given a word w and context word c, eq. (5) is proportional to #(w) and #(c)." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-31", "text": "This is the desired behaviour for the global loss function, since the more frequent w or c are in the corpus, the more confident we can be about the corpus estimated P P M I wc ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-32", "text": "Suppose both #(w) and #(c) are high, but P P M I wc is low." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-33", "text": "This is unequivocal evidence of negative correlation between them, and so we should put more effort into approximating their P P M I. The argument is analogous for high P P M I. If on the other hand #(w) and #(c) are low, we cannot be too confident about the corpus estimated P P M I wc , and so less effort should be spent on its approximation." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-34", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-35", "text": "**ENHANCING LEXVEC**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-36", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-37", "text": "**POSITIONAL CONTEXTS**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-38", "text": "As suggested by Levy et al. (2015) and Salle et al. (2016) , positional contexts (introduced in Levy et al. (2014) ) are a potential solution to poor performance on syntactic analogy tasks." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-39", "text": "Rather than only accounting for which context words appear around a target word, positional contexts also account for their position relative to the target word." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-40", "text": "For example, in the sentence \"the big dog barked loudly\", target word dog has contexts (the \u22122 , big \u22121 , barked 1 , loudly 2 )." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-41", "text": "The cooccurrence matrix, before having dimensions |V | \u00d7 |V |, takes on dimensions |V | \u00d7 2 * win * |V | when using positional contexts." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-42", "text": "This can be incorporated into LexVec with two minor modifications: 1) The context embeddingW takes on dimensions 2 * win * |V | \u00d7 d, 2) Negative sampling must now sample positional contexts rather than simple contexts." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-43", "text": "This latter point requires that the distribution from which negative samples are drawn become" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-44", "text": "Without positional contexts, either W or W +W can be used as embeddings." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-45", "text": "Since positional contexts make the dimensions of both matrices incompatible,W cannot be used directly." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-46", "text": "We propose using the sum of all positional context vectors as the context vector for a word (W pos ) ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-47", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-48", "text": "**EXTERNAL MEMORY**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-49", "text": "As window sampling scans over the training corpus and negative sampling selects random contexts, (w, c i ) pairs are generated and the corresponding P P M I wc i cell must be accessed so that eqs. (2) and (3) can be minimized." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-50", "text": "Unfortunately, this results in random access to the P P M I matrix which requires it to be kept in main memory." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-51", "text": "Pennington et al. (2014) show that the under certain assumptions, this sparse matrix grows as O(|C| 0.8 ), which bounds the maximum corpus size that can be processed by LexVec." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-52", "text": "We propose an approximation to WSNS that works as follows: All the word-context pairs (w, c i ) generated by window sampling the corpus and by negative sampling each target word are first written to a file F ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-53", "text": "The file F is then sorted with duplicated lines collapsed, and the lines written in the format (w, c i , +/\u2212, tot, M wc i ), where +/\u2212 indicates if the pair occurred or not in the corpus, tot is the number of times the pair occurs including negative sampling, and M wc i the number of times it occurred in the corpus." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-54", "text": "F 's construction requires O(|C|) external memory, and only O(1) main memory." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-55", "text": "Additionally, all M w * and M * c i are kept in main memory, using O(|V |) space." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-56", "text": "This is nearly identical to the way in which GloVe builds its sparse co-occurrence matrix on disk, with the additional logic for adding and merging negatively sampled pairs." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-57", "text": "We now present two ways to proceed with training: multiple iteration or single iteration." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-58", "text": "Multiple Iteration (MI): In this variant, F is shuffled." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-59", "text": "For each tuple (w, c i , +/\u2212, tot, M wc i ) in F , eq. (2) is minimized tot times, using M w * , M * c i , and M wc i to calculate P P M I wc i if marker is +, else P P M I wc i is equal to zero." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-60", "text": "Single Iteration (SI): For every tuple (w, c i , +/\u2212, tot, M wc i ) in F , write the tuple (w, c i , +/\u2212, 1, M wc i ) tot times to a new file F \u2032 ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-61", "text": "Then shuffle F \u2032 and execute the MI algorithm described above on F \u2032 ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-62", "text": "Both MI and SI are minimizing the exact same global loss function given by eq. (5) as LexVec without external memory, the only difference between the three being the order in which word-context pairs are processed." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-63", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-64", "text": "**MATERIALS**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-65", "text": "We report results from Salle et al. (2016) and use the same training corpus and parameters to train LexVec with positional contexts and external memory." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-66", "text": "The corpus is a Wikipedia dump from June 2015, tokenized, lowercased, and split into sentences, removing punctuation and converting numbers to words, for a final vocabulary of 302,203 words." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-67", "text": "All generated embeddings have dimensionality equal to 300." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-68", "text": "As recommended in Levy et al. (2015) and used in Salle et al. (2016) , the PPMI matrix used in all LexVec models and in PPMI-SVD is transformed using context distribution smoothing exponentiating context frequencies to the power 0.75." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-69", "text": "PPMI-SVD is the singular value decomposition of the PPMI matrix." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-70", "text": "LexVec and PPMI-SVD use symmetric windows of size 2." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-71", "text": "Both GloVe (Pennington et al., 2014) and Skip-gram with negative sampling (SGNS) (Mikolov et al., 2013b) were trained using a symmetric window of size 10." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-72", "text": "GloVe was run for 50 iterations, using parameters x max = 100, \u03b2 = 3/4, and learning rate of 0.05." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-73", "text": "LexVec and SGNS were run for 5 iterations, using 5 negative samples, and initial learning rate of 0.025." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-74", "text": "LexVec, PPMI-SVD, and SGNS use dirty subsampling (Mikolov et al., 2013b; Levy et al., 2015) with threshold t = 10 \u22125 ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-75", "text": "Words in the training corpus with unigram probability f greater than t are discarded with probability 1 \u2212 t/f ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-76", "text": "For LexVec, we report results for W and (W +W pos ) embeddings when using positional contexts, otherwise W and (W +W )." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-77", "text": "For PPMI-SVD and GloVe we report (W +W ), and for SGNS, W , that correspond to their best results." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-78", "text": "The goal of our evaluation is to determine whether: 1) Positional contexts improve syntactic performance 2) The use of external memory is a good approximation of WSNS." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-79", "text": "Therefore, we perform the exact same evaluation as Salle et al. (2016) , namely the WS-353 Similarity (WSim) and Relatedness (WRel) (Finkelstein et al., 2001) , MEN (Bruni et al., 2012) , MTurk (Radinsky et al., 2011) , RW (Luong et al., 2013) , SimLex-999 (Hill et al., 2015) , MC (Miller and Charles, 1991) , RG (Rubenstein and Goodenough, 1965) , and SCWS (Huang et al., 2012) word similarity tasks 1 , and the Google semantic (GSem) and syntactic (GSyn) analogy (Mikolov et al., 2013a) and MSR syntactic analogy dataset (Mikolov et al., 2013c) tasks." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-80", "text": "Word similarity tasks use cosine similarity." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-81", "text": "Word analogy tasks are solved using both 3CosAdd and 3CosMul (Levy et al., 2014) ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-82", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-83", "text": "**RESULTS**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-84", "text": "Positional contexts improved performance in both similarity (table 1) and analogy tasks (table 2) ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-85", "text": "As hypothesized, their use significantly improved LexVec's performance on syntactic analogies, leading to the highest score on GSyn, surpassing GloVe and SGNS." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-86", "text": "This confirms the relevance of using positional contexts to capture syntactic dependencies." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-87", "text": "Salle et al. (2016) reported that combining word and context embeddings scored marginally higher in word similarity tasks, and that holds true in our experiments, even forW pos ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-88", "text": "In the analogy tasks, using only the word embedding leads to far better syntactic performance, indicating that the embedding W strikes a better balance between syntax and semantics than does W +W ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-89", "text": "The SI external memory implementation very closely approximates the standard variant (without the use of external memory), which was expected given that they minimize the exact same loss function." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-90", "text": "The gap between MI and standard was much wider." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-91", "text": "It seems that there is value in the way WSNS uses corpus ordering of word-context pairs to train the model." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-92", "text": "The SI variant more closely mimics this order, distributing same pair occurrences over the entire training." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-93", "text": "MI, on the other hand, has a completely artificial ordering, distant from corpus and SI's ordering." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-94", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-95", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-96", "text": "This paper presented two improvements to the word embedding model LexVec." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-97", "text": "The first yields state-ofthe-art performance on syntactic analogies through the use of positional contexts." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-98", "text": "The second solves the need to store the PPMI matrix in main memory by using external memory." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-99", "text": "The SI variant of the external memory implementation was a good approximation of standard LexVec's WSNS, enabling future training using web-scale corpora." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-100", "text": "In future work, we plan to explore the model's hyperparameter space, which could potentially boost model performance, having so far restricted ourselves to parameters recommended in Levy et al. (2015) ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-101", "text": "Finally, following Tsvetkov et al. (2015) , we will pursue evaluation of the model on downstream tasks in addition to the intrinsic evaluations used in this paper." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-8", "text": "Though Baroni et al. (2014) suggested that predictive models which use neural networks to generate the distributed word representations (also known as embeddings in this context) outperform counting models which work on co-occurrence matrices, recent work shows evidence to the contrary (Levy et al., 2014; Salle et al., 2016) ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-9", "text": "In this paper, we focus on improving a state-ofthe-art counting model, LexVec (Salle et al., 2016) , which performs factorization of the positive pointwise mutual information (PPMI) matrix using window sampling and negative sampling (WSNS)." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-10", "text": "Salle et al. (2016) suggest that LexVec matches and often outperforms competing models in word similarity and semantic analogy tasks." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-11", "text": "Here we show that using positional contexts to approximate syntactic dependencies yields state-of-the-art performance on syntactic analogy tasks as well." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-12", "text": "We also show how it is possible to approximate WSNS using aggregate data, eliminating random access to the PPMI matrix, enabling the use of external memory." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-13", "text": "Though not undertaken in this paper, this modification effectively allows LexVec to be trained on web-scale corpora." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-14", "text": "This paper is organized as follows: we review the LexVec model ( \u00a72) and detail how positional contexts and external memory can be incorporated into the model ( \u00a73)." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-15", "text": "We describe evaluation methods ( \u00a74) and discuss results in terms of related work ( \u00a75) and finish with conclusions and future work." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-16", "text": "Source code for the enhanced model is available at https://github.com/alexandres/ lexvec." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-17", "text": "----------------------------------" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-18", "text": "**LEXVEC**" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-19", "text": "LexVec uses WSNS to factorize the PPMI matrix into two lower rank matrices." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-20", "text": "The co-occurrence matrix M is calculated from word-context pairs (w, c) obtained by sliding a symmetric window of size win over the training corpus C. The PPMI matrix is then calculated as follows" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-21", "text": "where * represents index summation." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-22", "text": "The word and context embeddings W andW , with dimensions |V | \u00d7 d (where V is the vocabulary and d the embedding dimension), are obtained by the minimization via stochastic gradient descent (SGD) of a combination of the loss functions" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-23", "text": "using WSNS." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-24", "text": "P n is the distribution used for drawing negative samples, chosen to be" }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-25", "text": "with \u03b1 = 3/4 (Mikolov et al., 2013b; Salle et al., 2016) , and #(w) the unigram frequency of w. Two methods were defined for the minimization of eqs. (2) and (3): Mini-batch and Stochastic (Salle et al., 2016) ." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-26", "text": "Since the latter is more computationally efficient and yields equivalent results, we adopt it in this paper." }, { "sent_id": "26b00c6e5b499eea30e9cef0bbaf9f-C001-27", "text": "The Stochastic method extends the context window with noise words drawn using negative sampling according to eq. (4)." } ], "y": { "@MOT@": { "gold_contexts": [ [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-8" ], [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-38" ] ], "cite_sentences": [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-8", "26b00c6e5b499eea30e9cef0bbaf9f-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-9" ], [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-24", "26b00c6e5b499eea30e9cef0bbaf9f-C001-25", "26b00c6e5b499eea30e9cef0bbaf9f-C001-26" ], [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-65" ], [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-68" ], [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-79" ] ], "cite_sentences": [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-9", "26b00c6e5b499eea30e9cef0bbaf9f-C001-25", "26b00c6e5b499eea30e9cef0bbaf9f-C001-65", "26b00c6e5b499eea30e9cef0bbaf9f-C001-68", "26b00c6e5b499eea30e9cef0bbaf9f-C001-79" ] }, "@BACK@": { "gold_contexts": [ [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-38" ] ], "cite_sentences": [ "26b00c6e5b499eea30e9cef0bbaf9f-C001-38" ] } } }, "ABC_9e8af6ca401cd74adc9a4137ae05ec_14": { "x": [ { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-17", "text": "A word list is a list of words." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-20", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-18", "text": "A misspelling, marked with * , is a sequence of characters not found in the word list." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-19", "text": "A candidate correction is a word from the word list proposed as a potential correction." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-2", "text": "We propose a method for modeling pronunciation variation in the context of spell checking for non-native writers of English." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-3", "text": "Spell checkers, typically developed for native speakers, fail to address many of the types of spelling errors peculiar to non-native speakers, especially those errors influenced by differences in phonology." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-4", "text": "Our model of pronunciation variation is used to extend a pronouncing dictionary for use in the spelling correction algorithm developed by Toutanova and Moore (2002), which includes models for both orthography and pronunciation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-5", "text": "The pronunciation variation modeling is shown to improve performance for misspellings produced by Japanese writers of English." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-6", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-8", "text": "Spell checkers identify misspellings, select appropriate words as suggested corrections, and rank the suggested corrections so that the likely intended word is high in the list." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-9", "text": "Since traditional spell checkers have been developed with competent native speakers as the target users, they do not appropriately address many types of errors made by nonnative writers and they often fail to suggest the appropriate corrections." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-10", "text": "Non-native writers of English struggle with many of the same idiosyncrasies of English spelling that cause difficulty for native speakers, but differences between English phonology and the phonology of their native language lead to types of spelling errors not anticipated by traditional spell checkers (Okada, 2004; Mitton and Okada, 2007) ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-11", "text": "Okada (2004) and Mitton and Okada (2007) investigate spelling errors made by Japanese writers of English as a foreign language (JWEFL)." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-12", "text": "Okada (2004) identifies two main sources of errors for JWEFL: differences between English and Japanese phonology and differences between the English alphabet and the Japanese romazi writing system, which uses a subset of English letters." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-13", "text": "Phonological differences result in number of distinctions in English that are not present in Japanese and romazi causes difficulties for JWEFL because the Latin letters correspond to very different sounds in Japanese." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-14", "text": "We propose a method for creating a model of pronunciation variation from a phonetically untranscribed corpus of read speech recorded by nonnative speakers." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-15", "text": "The pronunciation variation model is used to generate multiple pronunciations for each canonical pronunciation in a pronouncing dictionary and these variations are used in the spelling correction approach developed by Toutanova and Moore (2002) , which uses statistical models of spelling errors that consider both orthography and pronunciation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-16", "text": "Several conventions are used throughout this paper: a word is a sequence of characters from the given alphabet found in the word list." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-21", "text": "**BACKGROUND**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-22", "text": "Research in spell checking (see Kukich, 1992 , for a survey of spell checking research) has focused on three main problems: non-word error detection, isolated-word error correction, and contextdependent word correction." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-23", "text": "We focus on the first two tasks." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-24", "text": "A non-word is a sequence of letters that is not a possible word in the language in any context, e.g., English * thier." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-25", "text": "Once a sequence of letters has been determined to be a non-word, isolatedword error correction is the process of determining the appropriate word to substitute for the non-word." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-26", "text": "Given a sequence of letters, there are thus two main subtasks: 1) determine whether this is a nonword, 2) if so, select and rank candidate words as potential corrections to present to the writer." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-27", "text": "The first subtask can be accomplished by searching for the sequence of letters in a word list." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-28", "text": "The second subtask can be stated as follows (Brill and Moore, 2000) : Given an alphabet \u03a3, a word list D of strings \u2208 \u03a3 * , and a string r / \u2208 D and \u2208 \u03a3 * , find w \u2208 D such that w is the most likely correction." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-29", "text": "Minimum edit distance is used to select the most likely candidate corrections." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-30", "text": "The general idea is that a minimum number of edit operations such as insertion and substitution are needed to convert the misspelling into a word." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-31", "text": "Words requiring the smallest numbers of edit operations are selected as the candidate corrections." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-32", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-33", "text": "**EDIT OPERATIONS AND EDIT WEIGHTS**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-34", "text": "In recent spelling correction approaches, edit operations have been extended beyond single character edits and the methods for calculating edit operation weights have become more sophisticated." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-35", "text": "The spelling error model proposed by Brill and Moore (2000) allows generic string edit operations up to a certain length." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-36", "text": "Each edit operation also has an associated probability that improves the ranking of candidate corrections by modeling how likely particular edits are." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-37", "text": "Brill and Moore (2000) estimate the probability of each edit from a corpus of spelling errors." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-38", "text": "Toutanova and Moore (2002) extend Brill and Moore (2000) to consider edits over both letter sequences and sequences of phones in the pronunciations of the word and misspelling." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-39", "text": "They show that including pronunciation information improves performance as compared to Brill and Moore (2000) ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-40", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-41", "text": "**NOISY CHANNEL SPELLING CORRECTION**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-42", "text": "The spelling correction models from Brill and Moore (2000) and Toutanova and Moore (2002) use the noisy channel model approach to determine the types and weights of edit operations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-84", "text": "**RESOURCES AND DATA PREPARATION**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-43", "text": "The idea behind this approach is that a writer starts out with the intended word w in mind, but as it is being written the word passes through a noisy channel resulting in the observed non-word r. In order to determine how likely a candidate correction is, the spelling correction model determines the probability that the word w was the intended word given the misspelling r: P (w|r)." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-44", "text": "To find the best correction, the word w is found for which P (w|r) is maximized: argmax w P (w|r)." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-45", "text": "Applying Bayes' Rule and discarding the normalizing constant P (r) gives the correction model:" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-46", "text": "argmax w P (w|r) = argmax w P (w)P (r|w) P (w), how probable the word w is overall, and P (r|w), how probable it is for a writer intending to write w to output r, can be estimated from corpora containing misspellings." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-47", "text": "In the following experiments, P (w) is assumed be equal for all words to focus this work on estimating the error model P (r|w) for JWEFL." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-48", "text": "1 Brill and Moore (2000) allow all edit operations \u03b1 \u2192 \u03b2 where \u03a3 is the alphabet and \u03b1, \u03b2 \u2208 \u03a3 * , with a constraint on the length of \u03b1 and \u03b2." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-49", "text": "In order to consider all ways that a word w may generate r with the possibility that any, possibly empty, substring \u03b1 of w becomes any, possibly empty, substring \u03b2 of r, it is necessary to consider all ways that w and r may be partitioned into substrings." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-50", "text": "This error model over letters, called P L , is approximated by Brill and Moore (2000) as shown in Figure 1 by considering only the pair of partitions of w and r with the maximum product of the probabilities of individual substitutions." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-51", "text": "P art(w) is all possible partitions of w, |R| is number of segments in a particular partition, and R i is the i th segment of the partition." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-52", "text": "The parameters for P L (r|w) are estimated from a corpus of pairs of misspellings and target words." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-53", "text": "The method, which is described in detail in Brill and Moore (2000) , involves aligning the letters in pairs of words and misspellings, expanding each alignment with up to N neighboring alignments, and calculating the probability of each \u03b1 \u2192 \u03b2 alignment." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-54", "text": "Since we will be using a training corpus that consists solely of pairs of misspellings and words (see section 3), we would have lower probabilities for" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-55", "text": "Figure 1: Approximations of P L from Brill and Moore (2000) and P P HL from Toutanova and Moore (2002) \u03b1 \u2192 \u03b1 than would be found in a corpus with misspellings observed in context with correct words." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-56", "text": "To compensate, we approximate P (\u03b1 \u2192 \u03b1) by assigning it a minimum probability m:" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-57", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-58", "text": "**EXTENDING TO PRONUNCIATION**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-59", "text": "Toutanova and Moore (2002) describe an extension to Brill and Moore (2000) where the same noisy channel error model is used to model phone sequences instead of letter sequences." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-60", "text": "Instead of the word w and the non-word r, the error model considers the pronunciation of the non-word r, pron r , and the pronunciation of the word w, pron w ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-61", "text": "The error model over phone sequences, called P P H , is just like P L shown in Figure 1 except that r and w are replaced with their pronunciations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-62", "text": "The model is trained like P L using alignments between phones." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-63", "text": "Since a spelling correction model needs to rank candidate words rather than candidate pronunciations, Toutanova and Moore (2002) derive an error model that determines the probability that a word w was spelled as the non-word r based on their pronunciations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-64", "text": "Their approximation of this model, called P P HL , is also shown in Figure 1 ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-65", "text": "P P H (pron w |pron r ) is the phone error model described above and P (pron r |r) is provided by the letter-to-phone model described below." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-66", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-67", "text": "**LETTER-TO-PHONE MODEL**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-68", "text": "A letter-to-phone (LTP) model is needed to predict the pronunciation of misspellings for P P HL , since they are not found in a pronouncing dictionary." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-69", "text": "Like Toutanova and Moore (2002) , we use the n-gram LTP model from Fisher (1999) to predict these pronunciations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-70", "text": "The n-gram LTP model predicts the pronunciation of each letter in a word considering up to four letters of context to the left and right." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-71", "text": "The most specific context found for each letter and its context in the training data is used to predict the pronunciation of a word." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-72", "text": "We extended the prediction step to consider the most probable phone for the top M most specific contexts." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-73", "text": "We implemented the LTP algorithm and trained and evaluated it using pronunciations from CMU-DICT." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-74", "text": "A training corpus was created by pairing the words from the size 70 CMUDICT-filtered SCOWL word list (see section 3) with their pronunciations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-75", "text": "This list of approximately 62,000 words was split into a training set with 80% of entries and a test set with the remaining 20%." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-76", "text": "We found that the best performance is seen when M = 3, giving 95.5% phone accuracy and 74.9% word accuracy." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-77", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-78", "text": "**CALCULATING FINAL SCORES**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-79", "text": "For a misspelling r and a candidate correction w, the letter model P L gives the probability that w was written as r due to the noisy channel taking into account only the orthography." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-80", "text": "P P H does the same for the pronunciations of r and w, giving the probability that pron w was output was pron r ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-81", "text": "The pronunciation model P P HL relates the pronunciations modeled by P P H to the orthography in order to give the probability that r was written as w based on pronunciation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-82", "text": "P L and P P HL are then combined as follows to calculate a score for each candidate correction." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-83", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-85", "text": "Our spelling correction approach, which includes error models for both orthography and pronunciation (see section 2.2) and which considers pronunciation variation for JWEFL requires a number of resources: 1) spoken corpora of American English (TIMIT, TIMIT 1991) and Japanese English (ERJ, see below) are used to model pronunciation variation, 2) a pronunciation dictionary (CMUDICT, CMUDICT 1998) provides American English pronunciations for the target words, 3) a corpus of spelling errors made by JWEFL (Atsuo-Henry Corpus, see below) is used to train spelling error models and test the spell checker's performance, and 4) Spell Checker Oriented Word Lists (SCOWL, see below) are adapted for our use." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-86", "text": "The English Read by Japanese Corpus (Minematsu et al., 2002) consists of 70,000 prompts containing phonemic and prosodic cues recorded by 200 native Japanese speakers with varying English competence." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-87", "text": "See Minematsu et al. (2002) for details on the construction of the corpus." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-88", "text": "The Atsuo-Henry Corpus (Okada, 2004) includes a corpus of spelling errors made by JWEFL that consists of a collection of spelling errors from multiple corpora." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-89", "text": "2 For use with our spell checker, the corpus has been cleaned up and modified to fit our task, resulting in 4,769 unique misspellings of 1,046 target words." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-90", "text": "The data is divided into training (80%), development (10%), and test (10%) sets." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-91", "text": "For our word lists, we use adapted versions of the Spell Checker Oriented Word Lists." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-92", "text": "3 The size 50 word lists are used in order to create a general purpose word list that covers all the target words from the Atsuo-Henry Corpus." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-93", "text": "Since the target pronunciation of each item is needed for the pronunciation model, the word list was filtered to remove words whose pronunciation is not in CMUDICT." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-94", "text": "After filtering, the word list contains 54,001 words." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-95", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-96", "text": "**METHOD**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-97", "text": "This section presents our method for modeling pronunciation variation from a phonetically untranscribed corpus of read speech." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-98", "text": "The pronunciationbased spelling correction approach developed in Toutanova and Moore (2002) requires a list of possible pronunciations in order to compare the pronunciation of the misspelling to the pronunciation of correct words." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-99", "text": "To account for target pronunciations specific to Japanese speakers, we observe the pronunciation variation in the ERJ and generate additional pronunciations for each word in the word list." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-100", "text": "Since the ERJ is not transcribed, we begin by adapting a recognizer trained on native English speech." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-101", "text": "First, the ERJ is recognized using a monophone recognizer trained on TIMIT." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-102", "text": "Next, the most frequent variations between the canonical and recognized pronunciations are used to adapt the recognizer." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-103", "text": "The adapted recognizer is then used to recognize the ERJ in forced alignment with the canonical pronunciations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-104", "text": "Finally, the variations from the previous step are used to create models of pronunciation variation for each phone, which are used to generate multiple pronunciations for each word." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-105", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-106", "text": "**INITIAL RECOGNIZER**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-107", "text": "A monophone speech recognizer was trained on all TIMIT data using the Hidden Markov Model Toolkit (HTK)." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-108", "text": "4 This recognizer is used to generate a phone string for each utterance in the ERJ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-109", "text": "Each recognized phone string is then aligned with the canonical pronunciation provided to the speakers." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-110", "text": "Correct alignments and substitutions are considered with no context and insertions are conditioned on the previous phone." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-111", "text": "Due to restrictions in HTK, deletions are currently ignored." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-112", "text": "The frequency of phone alignments for all utterances in the ERJ are calculated." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-113", "text": "Because of the low phone accuracy of monophone recognizers, especially on non-native speech, alignments are observed between nearly all pairs of phones." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-114", "text": "In order to focus on the most frequent alignments common to multiple speakers and utterances, any alignment observed less than 20% as often as the most frequent alignment for that canonical phone is discarded, which results in an average of three variants of each phone." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-115", "text": "5" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-116", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-117", "text": "**ADAPTING THE RECOGNIZER**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-118", "text": "Now that we have probability distributions over observed phones, the HMMs trained on TIMIT are modified as follows to allow the observed variation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-119", "text": "To allow, for instance, variation between p and th, the states for th from the original recognizer are inserted into the model for p as a separate path." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-120", "text": "The resulting phone model is shown in Figure 2 ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-121", "text": "The transition probabilities into the first states" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-122", "text": "Figure 2: Adapted phone model for p accounting for variation between p and th of the phones come from the probability distribution observed in the initial recognition step." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-123", "text": "The transition probabilities between the three states for each variant phone remain unchanged." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-124", "text": "All HMMs are adapted in this manner using the probability distributions from the initial recognition step." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-125", "text": "The adapted HMMs are used to recognize the ERJ Corpus for a second time, this time in forced alignment with the canonical pronunciations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-126", "text": "The state transitions indicate which variant of each phone was recognized and the correspondences between the canonical phones and recognized phones are used to generate a new probability distribution over observed phones for each canonical phone." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-127", "text": "These are used to find the most probable pronunciation variations for a native-speaker pronouncing dictionary." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-128", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-129", "text": "**GENERATING PRONUNCIATIONS**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-130", "text": "The observed phone variation is used to generate multiple pronunciations for each pronunciation in the word list." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-131", "text": "The OpenFst Library 6 is used to find the most probable pronunciations in each case." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-132", "text": "First, FSTs are created for each phone using the probability distributions from the previous section." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-133", "text": "Next, an FST is created for the entire word by concatenating the FSTs for the pronunciation from CMU-DICT." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-134", "text": "The pronunciations corresponding to the best n paths through the FST and the original canonical pronunciation become possible pronunciations in the extended pronouncing dictionary." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-135", "text": "The size 50 word list contains 54,001 words and when expanded to contain the top five variations of each pronunciation, there are 255,827 unique pronunciations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-136", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-137", "text": "**RESULTS**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-138", "text": "In order to evaluate the effect of pronunciation variation in Toutanova and Moore (2002) 's spelling correction approach, we compare the performance of the pronunciation model and the combined model with and without pronunciation variation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-139", "text": "We implemented the letter and pronunciation spelling correction models as described in section 2.2." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-140", "text": "The letter error model P L and the phone error model P P H are trained on the training set." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-141", "text": "The development set is used to tune the parameters introduced in previous sections." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-142", "text": "7 In order to rank the words as candidate corrections for a misspelling r, P L (r|w) and P P HL (r|w) are calculated for each word in the word list using the algorithm described in Brill and Moore (2000) ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-143", "text": "Finally, P L and P P HL are combined using S CM B to rank each word." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-144", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-145", "text": "**BASELINE**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-146", "text": "The open source spell checker GNU Aspell 8 is used to determine the baseline performance of a traditional spell checker using the same word list." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-147", "text": "An Aspell dictionary was created with the word list described in section 3." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-148", "text": "Aspell's performance is shown in Table 1 ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-149", "text": "The 1-Best performance is the percentage of test items for which the target word was the first candidate correction, 2-Best is the percentage for which the target was in the top two, etc." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-150", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-151", "text": "**EVALUATION OF PRONUNCIATION VARIATION**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-152", "text": "The effect of introducing pronunciation variation using the method described in section 4 can be evaluated by examining the performance on the test set for P P HL with and without the additional variations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-153", "text": "The results in Table 1 show that the addition of pronunciation variations does indeed improve the performance of P P HL across the board." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-154", "text": "The 1-Best, 3-Best, and 4-Best cases for P P HL with variation show significant improvement (p<0.05) over P P HL without variation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-155", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-156", "text": "**EVALUATION OF THE COMBINED MODEL**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-157", "text": "We evaluated the effect of including pronunciation variation in the combined model by comparing the performance of the combined model with and without pronunciation variation, see results in Table 1 ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-158", "text": "Despite the improvements seen in P P HL with pronunciation variation, there are no significant differences between the results for the combined model with and without variation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-159", "text": "The combined model with variation is also not significantly different from the letter model P L except for the drop in the 4-Best case." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-160", "text": "To illustrate the performance of each model, the ranked lists in Table 2 give an example of the candidate corrections for the misspelling of any as * eney." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-161", "text": "Aspell preserves the initial letter of the misspelling and vowels in many of its candidates." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-162", "text": "P L 's top candidates also overlap a great deal in orthography, but there is more initial letter and vowel variation." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-163", "text": "As we would predict, P P HL ranks any as the top correction, but some of the lower-ranked candidates for P P HL differ greatly in length." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-164", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-165", "text": "**SUMMARY OF RESULTS**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-166", "text": "The noisy channel spelling correction approach developed by Brill and Moore (2000) and Toutanova and Moore (2002) appears well-suited for writers of English as a foreign language." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-167", "text": "The letter and combined models outperform the traditional spell checker Aspell by a wide margin." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-168", "text": "Although including pronunciation variation does not improve the combined model, it leads to significant improvements in the pronunciation-based model P P HL ." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-169", "text": "----------------------------------" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-170", "text": "**CONCLUSION**" }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-171", "text": "We have presented a method for modeling pronunciation variation from a phonetically untranscribed corpus of read non-native speech by adapting a monophone recognizer initially trained on native speech." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-172", "text": "This model allows a native pronouncing dictionary to be extended to include non-native pronunciation variations." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-173", "text": "We incorporated a pronouncing dictionary extended for Japanese writers of English into the spelling correction model developed by Toutanova and Moore (2002) , which combines orthography-based and pronunciation-based models." }, { "sent_id": "9e8af6ca401cd74adc9a4137ae05ec-C001-174", "text": "Although the extended pronunciation dictionary does not lead to improvement in the combined model, it does leads to significant improvement in the pronunciation-based model." } ], "y": { "@EXT@": { "gold_contexts": [ [ "9e8af6ca401cd74adc9a4137ae05ec-C001-4" ] ], "cite_sentences": [ "9e8af6ca401cd74adc9a4137ae05ec-C001-4" ] }, "@MOT@": { "gold_contexts": [ [ "9e8af6ca401cd74adc9a4137ae05ec-C001-3", "9e8af6ca401cd74adc9a4137ae05ec-C001-4" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-14", "9e8af6ca401cd74adc9a4137ae05ec-C001-15" ] ], "cite_sentences": [ "9e8af6ca401cd74adc9a4137ae05ec-C001-4", "9e8af6ca401cd74adc9a4137ae05ec-C001-15" ] }, "@BACK@": { "gold_contexts": [ [ "9e8af6ca401cd74adc9a4137ae05ec-C001-15" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-38" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-42" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-59" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-63" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-98" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-166" ] ], "cite_sentences": [ "9e8af6ca401cd74adc9a4137ae05ec-C001-15", "9e8af6ca401cd74adc9a4137ae05ec-C001-38", "9e8af6ca401cd74adc9a4137ae05ec-C001-42", "9e8af6ca401cd74adc9a4137ae05ec-C001-59", "9e8af6ca401cd74adc9a4137ae05ec-C001-63", "9e8af6ca401cd74adc9a4137ae05ec-C001-98", "9e8af6ca401cd74adc9a4137ae05ec-C001-166" ] }, "@USE@": { "gold_contexts": [ [ "9e8af6ca401cd74adc9a4137ae05ec-C001-69" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-98", "9e8af6ca401cd74adc9a4137ae05ec-C001-99" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-138" ], [ "9e8af6ca401cd74adc9a4137ae05ec-C001-173" ] ], "cite_sentences": [ "9e8af6ca401cd74adc9a4137ae05ec-C001-69", "9e8af6ca401cd74adc9a4137ae05ec-C001-98", "9e8af6ca401cd74adc9a4137ae05ec-C001-138", "9e8af6ca401cd74adc9a4137ae05ec-C001-173" ] } } }, "ABC_967c78ccf905c69d732a4c1ef00289_14": { "x": [ { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-28", "text": "The shared encoder accesses more data, learning less overfitted representation." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-2", "text": "Typical relation extraction models are trained on a single corpus annotated with a pre-defined relation schema." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-3", "text": "An individual corpus is often small, and the models may often be biased or overfitted to the corpus." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-4", "text": "We hypothesize that we can learn a better representation by combining multiple relation datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-5", "text": "We attempt to use a shared encoder to learn the unified feature representation and to augment it with regularization by adversarial training." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-6", "text": "The additional corpora feeding the encoder can help to learn a better feature representation layer even though the relation schemas are different." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-7", "text": "We use ACE05 and ERE datasets as our case study for experiments." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-8", "text": "The multi-task model obtains significant improvement on both datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-9", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-11", "text": "Relations represent specific semantic relationships between two entities." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-12", "text": "For example, there is Physical." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-13", "text": "Located relationship between Smith and Brazil in the sentence: Smith went to a conference in Brazil." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-14", "text": "Relation extraction is a crucial task for many applications such as knowledge base population." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-15", "text": "Several relation schemas and annotated corpora have been developed such as the Automatic Content Extraction (ACE), and the Entities, Relations and Events (ERE) annotation (Song et al., 2015) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-16", "text": "These schemas share some similarity, but differ in details." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-17", "text": "A relation type may exist in one schema but not in another." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-18", "text": "An example might be annotated as different types in different datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-19", "text": "For example, Part-whole." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-20", "text": "Geographical relations in ACE05 are annotated as Physcial." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-21", "text": "Located relations in ERE." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-22", "text": "Most of these corpora are relatively small." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-23", "text": "Models trained on a single corpus may be biased or overfitted towards the corpus." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-24", "text": "Despite the difference in relation schemas, we hypothesize that we can learn a more general representation with a unified encoder." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-25", "text": "Such a representation could have better out-of-domain or lowresource performance." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-26", "text": "We develop a multi-task model to learn a representation of relations in a shared relation encoder." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-27", "text": "We use separate decoders to allow different relation schemas." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-29", "text": "We then regularize the representation with adversarial training in order to further enforce the sharing between different datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-30", "text": "In our experiments, we take ACE05 1 and ERE 2 datasets as a case study." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-31", "text": "Experimental results show that the model achieves higher performance on both datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-32", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-33", "text": "**RELATED WORK**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-34", "text": "Relation extraction is typically reduced to a classification problem." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-35", "text": "A supervised machine learning model is designed and trained on a single dataset to predict the relation type of pairs of entities." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-36", "text": "Traditional methods rely on linguistic or semantic features (Zhou et al., 2005; Jing and Zhai, 2007) , or kernels based on syntax or sequences (Bunescu and Mooney, 2005a,b; Plank and Moschitti, 2013) to represent sentences of relations." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-37", "text": "More recently, deep neural nets start to show promising results." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-38", "text": "Most rely on convolutional neural nets (Zeng et al., 2014 (Zeng et al., , 2015 Grishman, 2015, 2016; Fu et al., 2017) or recurrent neural nets (Zhang et al., 2015; Zhou et al., 2016; Miwa and Bansal, 2016) to learn the representation of relations." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-39", "text": "Our supervised base model will be similar to (Zhou et al., 2016) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-40", "text": "Our initial experiments did not use syntactic features (Nguyen and Grishman, 2016; Fu et al., 2017 ) that require additional parsers." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-41", "text": "In order to further improve the representation learning for relation extraction, tried to transfer knowledge through bilingual representation." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-42", "text": "They used their multi-task model to train on the bilingual ACE05 datasets and obtained improvement when there is less training available (10%-50%)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-43", "text": "Our experiments will show our multitask model can make significant improvement on the full training set." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-44", "text": "In terms of the regularization to the representation, Duong et al. (2015) used l2 regularization between the parameters of the same part of two models in multi-task learning." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-45", "text": "Their method is a kind of soft-parameter sharing, which does not involve sharing any part of the model directly." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-46", "text": "Fu et al. (2017) applied domain adversarial networks (Ganin and Lempitsky, 2015) to relation extraction and obtained improvement on out-of-domain evaluation." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-47", "text": "Inspired by the adversarial training, we attempt to use it as a regularization tool in our multi-task model and find some improvement." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-48", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-49", "text": "**SUPERVISED NEURAL RELATION EXTRACTION MODEL**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-50", "text": "The supervised neural model on a single dataset was introduced by Zeng et al. (2014) and followed by many others (Nguyen and Grishman, 2015; Zhou et al., 2016; Miwa and Bansal, 2016; Nguyen and Grishman, 2016; Fu et al., 2017) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-51", "text": "We use a similar model as our base model." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-52", "text": "It takes word tokens, position of arguments and their entity types as input." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-53", "text": "Some work (Nguyen and Grishman, 2016; Fu et al., 2017) used extra syntax features as input." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-54", "text": "However, the parsers that produce syntax features could have errors and vary depending on the domain of text." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-55", "text": "The syntax features learned could also be too specific for a single dataset." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-56", "text": "Thus, we focus on learning representation from scratch, but also compare the models with extra features later in the experiments." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-57", "text": "The encoder is a bidirectional RNN with attention and the decoder is one hidden fully connected layer followed by a softmax output layer." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-58", "text": "In the input layer, we convert word tokens into word embeddings with pretrained word2vec (Mikolov et al., 2013) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-59", "text": "For each token, we convert the distance to the two arguments of the example to two position embeddings." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-60", "text": "We also convert the entity types of the arguments to entity embeddings." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-61", "text": "The setup of word embedding and position embedding was introduced by Zeng et al. (2014) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-62", "text": "The entity embedding (Nguyen and Grishman, 2016; Fu et al., 2017 ) is included for arguments that are entities rather than common nouns." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-63", "text": "At the end, each token is converted to an embedding w i as the concatenation of these three types of embeddings, where i \u2208 [0, T ), T is the length of the sentence." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-64", "text": "A wide range of encoders have been proposed for relation extraction." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-65", "text": "Most of them fall into categories of CNN (Zeng et al., 2014) , RNN (Zhou et al., 2016) and TreeRNN (Miwa and Bansal, 2016) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-66", "text": "In this work, we follow Zhou et al. (2016) to use Bidirectional RNN with attention (BiRNN), which works well on both of the datasets we are going to evaluate on." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-67", "text": "BiRNN reads embeddings of the words from both directions in the sentence." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-68", "text": "It summarizes the contextual information at each state." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-69", "text": "The attention mechanism aggregates all the states of the sentence by paying more attention to informative words." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-70", "text": "Given input w i from the input layer, the encoder is defined as the following:" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-71", "text": "We use GRU (Cho et al., 2014) as the RNN cell." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-72", "text": "W v and b v are the weights for the projection v i ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-73", "text": "v w is the word context vector, which works as a query of selecting important words." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-74", "text": "The importance of the word is computed as the similarity between v i and v w ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-75", "text": "The importance weight is then normalized through a softmax function." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-76", "text": "Then we obtain the high level summarization \u03c6(x) for the relation example." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-77", "text": "The decoder uses this high level representation as features for relation classification." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-78", "text": "It usually contains one hidden layer (Zeng et al., 2014; Nguyen and Grishman, 2016; Fu et al., 2017 ) and a softmax output layer." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-79", "text": "We use the same structure which can be formalized as the following:" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-80", "text": "where W h and b h are the weights for the hidden" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-81", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-82", "text": "**LEARNING UNIFIED REPRESENTATION**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-83", "text": "While the data for one relation task may be small, noisy and biased, we can learn a better representation combining multiple relation tasks." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-84", "text": "We attempt to use multi-task learning to learn a unified representation across different relation tasks." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-85", "text": "The method is simple and straightforward." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-86", "text": "We use the same encoder to learn the unified feature representation for both relation tasks, and then we train classifiers for each task on top of this representation." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-87", "text": "We then apply regularization on this representation by adversarial training." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-88", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-89", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-90", "text": "Given example x 1 from relation schema 1 and x 2 from relation schema 2, we use the same encoder to obtain representation \u03c6(x 1 ) and \u03c6(x 2 ) respectively." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-91", "text": "Then we build separate decoders for them using the same structure (7) (8). To train them at the same time, we put examples from both tasks in the same batch." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-92", "text": "The ratio of the examples are controlled so that the the model reads both datasets once every epoch." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-93", "text": "We use linear interpolation to combine the loss from them." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-94", "text": "where \u03bb is used to control the attention to each task." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-95", "text": "The model may learn the two tasks at different speed." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-96", "text": "During optimization, one task can be seen as the main task, while the other can be seen as the auxiliary task." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-97", "text": "The benefit of joint learning to the main task may vary depending on how much attention the model pays to the auxiliary task." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-98", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-99", "text": "**REGULARIZATION BY ADVERSARIAL TRAINING**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-100", "text": "Being optimized simultaneously by different decoders, the model could still learn very different representation for similar examples coming from different tasks." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-101", "text": "We want to prevent this and to further push the model to learn similar representation for similar examples even if they come from different tasks." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-102", "text": "We attempt to regularize the representation using adversarial training between the two tasks." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-103", "text": "Given the representation \u03c6(x 1 ) and \u03c6(x 2 ) learned from the two tasks, we build a classifier to predict which task the examples come from (11)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-104", "text": "We add a gradient reversal layer (Ganin and Lempitsky, 2015) at the input of this classifier (10) to implement the adversarial training." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-105", "text": "While the classifier learns to distinguish the sources of the input representation, the input representation is learned in the opposite direction to confuse the classifier thanks to GRL." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-106", "text": "Thus, the input representation (\u03c6(x 1 ) and \u03c6(x 2 )) will be pushed to be close to each other." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-107", "text": "The gradient reversal layer (GRL) is defined as the identity function for forward propagation (12) and reversed gradient for back propagation (13)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-108", "text": "We also use the cross-entropy loss for this adversarial training, and combine the loss L adv with the two relation tasks." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-109", "text": "where we can use \u03b2 to control how close the representations are between the two relation tasks." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-110", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-111", "text": "**EXPERIMENTS**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-112", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-113", "text": "**DATASETS**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-114", "text": "To apply the multi-task learning, we need at least two datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-115", "text": "We pick ACE05 and ERE for our case study." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-116", "text": "The ACE05 dataset provides a cross-domain evaluation setting ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-117", "text": "It contains 6 domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and weblogs (wl)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-118", "text": "Previous work (Gormley et al., 2015; Nguyen and Grishman, 2016; Fu et al., 2017) set, and the other half of bc, cts and wl as the test sets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-119", "text": "We followed their split of documents and their split of the relation types for asymmetric relations." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-120", "text": "The ERE dataset has a similar relation schema to ACE05, but is different in some annotation guidelines (Aguilar et al., 2014) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-121", "text": "It also has more data than ACE05, which we expect to be helpful in the multi-task learning." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-122", "text": "It contains documents from newswire and discussion forums." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-123", "text": "We did not find an existing split of this dataset, so we randomly split the documents into train (80%), dev (10%) and test (10%)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-124", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-125", "text": "**MODEL CONFIGURATIONS**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-126", "text": "We use word embedding pre-trained on newswire with 300 dimensions from word2vec (Mikolov et al., 2013) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-127", "text": "We fix the word embeddings during the training." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-128", "text": "We follow Nguyen and Grishman (2016) to set the position and entity type embedding size to be 50." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-129", "text": "We use 150 dimensions for the GRU state, 100 dimensions for the word context vector and use 300 dimensions for the hidden layer in the decoders." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-130", "text": "We train the model using Adam (Kingma and Ba, 2014) optimizer with learning rate 0.001." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-131", "text": "We tune \u03bb linearly from 0 to 1, and \u03b2 logarithmically from 5 \u00b7 10 \u22121 to 10 \u22124 For all scores, we run experiments 10 times and take the average." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-132", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-133", "text": "**AUGMENTATION BETWEEN ACE05 AND ERE**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-134", "text": "Training separately on the two corpora (row \"Supervised\" in Table 1 ), we obtain results on ACE05 comparable to previous work (Gormley et al., 2015) with substantially fewer features." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-135", "text": "With syntactic features as (Nguyen and Grishman, 2016; Fu et al., 2017) did, it could be further improved." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-136", "text": "In this paper, however, we want to focus on representation learning from scratch first." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-137", "text": "Our experiments focus on whether we can improve the representation with more sources of data." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-138", "text": "A common way to do so is pre-training." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-139", "text": "As a baseline, we pre-train the encoder of the supervised model on ERE and then fine-tune on ACE05, and vice versa (row \"Pretraining\" in Table 1 )." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-140", "text": "We observe improvement on both fine-tuned datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-141", "text": "This shows the similarity between the encoders of the two datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-142", "text": "However, if we fix the encoder trained from one dataset, and only train the decoder on the other dataset, we will actually obtain a much worse model." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-143", "text": "This indicates that neither dataset contains enough data to learn a universal feature representation layer for classification." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-144", "text": "This leaves the possibility to further improve the representation by learning a better encoder." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-145", "text": "We then attempt to learn a multi-task model using a shared encoder." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-146", "text": "We use 14K words as the vocabulary from ACE05 and 20K from ERE." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-147", "text": "There are about 8K words shared by the two datasets (same for both pretrained and multi-task models)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-148", "text": "By multi-task learning, we expect the model to conceive the embeddings for words better and construct more general representation." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-149", "text": "Experiments determined that the multi-task learning works best at \u03bb = 0.8 for both ACE05 and ERE datasets (Table 1)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-150", "text": "It obtains improvement on both the out-ofdomain evaluation on ACE and in-domain evaluation on ERE." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-151", "text": "It works especially well on weblogs (wl) and telephone conversation (cts) domains on ACE, which possibly benefits from the discussion forum data from ERE." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-152", "text": "On the other hand, we use the adversarial training between the two datasets to further enforce the representation to be close to each other." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-153", "text": "There is strong dependency between the schemas of these two datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-154", "text": "Two examples from different datasets could have the same semantics in terms of relation type." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-155", "text": "We try to force the representation of these examples to be similar." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-156", "text": "With appropriate amount of this regularization (\u03b2 = 0.001), the model can be further improved ( multi-task model can already balance representation with enough labels on both sides." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-157", "text": "We also artificially remove half of the training data of each dataset to see the performance in a relatively lowresource setting (row \"Training Data\" Table 1) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-158", "text": "We observe larger improvement with both multitask learning and regularization." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-159", "text": "Because of the decrease of the training data, the best \u03bb is 0.9 for ACE05 and 0.7 for ERE." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-160", "text": "We also use slightly stronger regularization (\u03b2 = 0.01)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-161", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-162", "text": "**MORE FEATURES ON ACE05**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-163", "text": "Since ACE05 has been studied for a long time, numerous features have been found to be effective on this dataset." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-164", "text": "(Nguyen and Grishman, 2016) incorporated some of those features into the neural net and beat the state-of-art on the dataset." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-165", "text": "Although representation learning from scratch could be more general across multiple datasets, we compare the effect of multi-task learning with extra features on this specific dataset." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-166", "text": "We add chunk embedding and on dep path embedding (Nguyen and Grishman, 2016) ." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-167", "text": "Similar to entity type embedding, chunk embedding is created according to each token's chunk type, we set the embedding size to 50." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-168", "text": "On dep path embedding is a vector indicating whether the token is on the dependency path between the two entities." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-169", "text": "In the multi-task model, the shared encoder is a bidirectional RNN (BiRNN) without attention (Equation (1-3) )." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-170", "text": "These two embeddings will be concatenated to the output of the BiRNN to obtain the new h i and then passed to Equation (4)." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-171", "text": "As the results (Table 2) , our supervised baseline is slightly worse than the previous state-of-the-art neural model with extra features, but the multitask learning can consistently help." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-172", "text": "The improvement is more obvious with 50% training data." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-173", "text": "It is also worth to note that with 50% training data, the extra features improve the supervised base model, but not the multi-task learning model." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-174", "text": "It shows the effectiveness of the multi-task model when there is less training data." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-175", "text": "----------------------------------" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-176", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-177", "text": "We attempt to learn unified representation for relations by multi-task learning between ACE05 and ERE datasets." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-178", "text": "We use a shared encoder to learn the unified feature representation and then apply regularization by adversarial training." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-179", "text": "The improvement on both datasets shows the promising future of learning representation for relations in this unified way." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-180", "text": "This will require less training data for new relation schemas." }, { "sent_id": "967c78ccf905c69d732a4c1ef00289-C001-181", "text": "It will be interesting future work to further explore the multi-task learning between different datasets, especially to capture the dependency between different schemas in the decoder." } ], "y": { "@DIF@": { "gold_contexts": [ [ "967c78ccf905c69d732a4c1ef00289-C001-40" ] ], "cite_sentences": [ "967c78ccf905c69d732a4c1ef00289-C001-40" ] }, "@BACK@": { "gold_contexts": [ [ "967c78ccf905c69d732a4c1ef00289-C001-50" ], [ "967c78ccf905c69d732a4c1ef00289-C001-53" ], [ "967c78ccf905c69d732a4c1ef00289-C001-78" ], [ "967c78ccf905c69d732a4c1ef00289-C001-118" ] ], "cite_sentences": [ "967c78ccf905c69d732a4c1ef00289-C001-50", "967c78ccf905c69d732a4c1ef00289-C001-53", "967c78ccf905c69d732a4c1ef00289-C001-78", "967c78ccf905c69d732a4c1ef00289-C001-118" ] }, "@MOT@": { "gold_contexts": [ [ "967c78ccf905c69d732a4c1ef00289-C001-53", "967c78ccf905c69d732a4c1ef00289-C001-54" ], [ "967c78ccf905c69d732a4c1ef00289-C001-134", "967c78ccf905c69d732a4c1ef00289-C001-135" ] ], "cite_sentences": [ "967c78ccf905c69d732a4c1ef00289-C001-53", "967c78ccf905c69d732a4c1ef00289-C001-135" ] }, "@USE@": { "gold_contexts": [ [ "967c78ccf905c69d732a4c1ef00289-C001-60", "967c78ccf905c69d732a4c1ef00289-C001-62" ], [ "967c78ccf905c69d732a4c1ef00289-C001-78", "967c78ccf905c69d732a4c1ef00289-C001-79" ], [ "967c78ccf905c69d732a4c1ef00289-C001-118", "967c78ccf905c69d732a4c1ef00289-C001-119" ], [ "967c78ccf905c69d732a4c1ef00289-C001-128" ], [ "967c78ccf905c69d732a4c1ef00289-C001-166" ] ], "cite_sentences": [ "967c78ccf905c69d732a4c1ef00289-C001-62", "967c78ccf905c69d732a4c1ef00289-C001-78", "967c78ccf905c69d732a4c1ef00289-C001-118", "967c78ccf905c69d732a4c1ef00289-C001-128", "967c78ccf905c69d732a4c1ef00289-C001-166" ] }, "@SIM@": { "gold_contexts": [ [ "967c78ccf905c69d732a4c1ef00289-C001-78", "967c78ccf905c69d732a4c1ef00289-C001-79" ] ], "cite_sentences": [ "967c78ccf905c69d732a4c1ef00289-C001-78" ] }, "@FUT@": { "gold_contexts": [ [ "967c78ccf905c69d732a4c1ef00289-C001-134", "967c78ccf905c69d732a4c1ef00289-C001-135", "967c78ccf905c69d732a4c1ef00289-C001-136" ] ], "cite_sentences": [ "967c78ccf905c69d732a4c1ef00289-C001-135" ] } } }, "ABC_17252628fa9c03c2fe0b44763fc7a2_14": { "x": [ { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-73", "text": "Here \"mw\" means \"measure word\"." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-23", "text": "The most similar work to this paper is that of Wang et al. (2007) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-24", "text": "They created a set of preordering rules for constituent parsers for ChineseEnglish PBSMT." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-25", "text": "In contrast, we propose a set of pre-ordering rules for dependency parsers." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-70", "text": "In this case, it also comes directly after the preposition." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-71", "text": "Similarly, we created a rule plmod : lccomp (clausal complement of a localizer)." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-2", "text": "In statistical machine translation (SMT), syntax-based pre-ordering of the source language is an effective method for dealing with language pairs where there are great differences in their respective word orders." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-3", "text": "This paper introduces a novel pre-ordering approach based on dependency parsing for Chinese-English SMT." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-4", "text": "We present a set of dependency-based preordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-5", "text": "We also investigate the accuracy of the rule set by conducting human evaluations." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-6", "text": "----------------------------------" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-8", "text": "SMT systems have difficulties translating between distant language pairs such as Chinese and English." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-9", "text": "The reason for this is that there are great differences in their word orders." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-10", "text": "Reordering therefore becomes a key issue in SMT systems between distant language pairs." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-11", "text": "Previous work has shown that the approaches tackling the problem by introducing a pre-ordering procedure into phrase-based SMT (PBSMT) were effective." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-12", "text": "These pre-ordering approaches first parse the source language sentences to create parse trees." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-13", "text": "Then, syntactic reordering rules are applied to these parse trees with the goal of reordering the source language sentences into the word order of the target language." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-14", "text": "Syntax-based pre-ordering by employing constituent parsing have demonstrated effectiveness in many language pairs, such as English-French (Xia and McCord, 2004) , German-English (Collins et al., 2005) , Chinese-English (Wang et al., 2007; Zhang et al., 2008) , and English-Japanese (Lee et al., 2010) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-15", "text": "As a kind of constituent structure, HPSG (Pollard and Sag, 1994) parsing-based pre-ordering showed improvements in SVO-SOV translations, such as English-Japanese (Isozaki et al., 2010; Wu et al., 2011) and Chinese-Japanese (Han et al., 2012) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-16", "text": "Since dependency parsing is more concise than constituent parsing in describing sentences, some research has used dependency parsing in pre-ordering approaches for language pairs such as Arabic-English (Habash, 2007) , and English-SOV languages (Xu et al., 2009; Katz-Brown et al., 2011) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-17", "text": "The pre-ordering rules can be made manually (Collins et al., 2005; Wang et al., 2007; Han et al., 2012) or extracted automatically from a parallel corpus (Xia and McCord, 2004; Habash, 2007; Zhang et al., 2007; Wu et al., 2011) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-18", "text": "The purpose of this paper is to introduce a novel dependency-based pre-ordering approach through creating a pre-ordering rule set and applying it to the Chinese-English PBSMT system." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-19", "text": "Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-20", "text": "Moreover, this rule set substantially decreased the total times of rule application about 60%, compared with a constituent-based approach (Wang et al., 2007) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-21", "text": "We also conducted human evaluations in order to assess its accuracy." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-22", "text": "To our knowledge, our manually created pre-ordering rule set is the first Chinese-English dependencybased pre-ordering rule set." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-26", "text": "We argue that even though the rules by Wang et al. (2007) exist, it is almost impossible to automatically convert their rules into rules that are applicable to dependency parsers." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-27", "text": "In fact, we abandoned our initial attempts to automatically convert their rules into rules for dependency parsers, and spent more than two months discovering the rules introduced in this paper." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-28", "text": "By applying our rules and Wang et al.'s rules, one can use both dependency and constituency parsers for pre-ordering in Chinese-English PBSMT." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-29", "text": "This is especially important on the point of the system combination of PBSMT systems, because the diversity of outputs from machine translation systems is important for system combination (Cer et al., 2013) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-30", "text": "By using both our rules and Wang et al.'s rules, one can obtain diverse machine translation results because the pre-ordering results of these two rule sets are generally different." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-31", "text": "Another similar work is that of (Xu et al., 2009 )." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-32", "text": "They created a pre-ordering rule set for dependency parsers from English to several SOV languages." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-33", "text": "In contrast, our rule set is for ChineseEnglish PBSMT." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-34", "text": "That is, the direction of translation is opposite." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-35", "text": "Because there are a lot of language specific decisions that reflect specific aspects of the source language and the language pair combination, our rule set provides a valuable resource for pre-ordering in Chinese-English PB-SMT." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-36", "text": "2 Dependency-based Pre-ordering Rule Set Figure 1 shows a constituent parse tree and its Stanford typed dependency parse tree for the same Figure 2 : An example of a preposition phrase with a plmod structure." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-37", "text": "The phrase translates into \"in front of the US embassy\"." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-38", "text": "Chinese sentence." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-39", "text": "As shown in the figure, the number of nodes in the dependency parse tree (i.e. 9) is much fewer than that in its corresponding constituent parse tree (i.e. 17)." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-40", "text": "Because dependency parse trees are generally more concise than the constituent ones, they can conduct longdistance reorderings in a finer way." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-41", "text": "Thus, we attempted to conduct pre-ordering based on dependency parsing." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-42", "text": "There are two widely-used dependency systems -Stanford typed dependencies and CoNLL typed dependencies." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-43", "text": "For Chinese, there are 45 types of grammatical relations for Stanford typed dependencies (Chang et al., 2009) and 25 for CoNLL typed dependencies." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-44", "text": "As we thought that Stanford typed dependencies could describe language phenomena more meticulously owing to more types of grammatical relations, we preferred to use it for searching candidate preordering rules." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-45", "text": "We designed two types of formats in our dependency-based pre-ordering rules." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-46", "text": "They are:" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-47", "text": "Type-1: x : y Type-2: x -y Here, both x and y are dependency relations (e.g., plmod or lobj in Figure 2 )." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-72", "text": "rcmod Figure 3 shows an example of an rcmod structure under an nsubj (nominal subject) structure." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-48", "text": "We define the dependency structure of a dependency relation as the structure containing the dependent word (e.g., the word directly indicated by plmod, or \"\u524d\" in Figure 2 ) and the whole subtree under the dependency relation (all of the words that directly or indirectly depend on the dependent word, or the words under \"\u524d\" in Figure 2 )." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-49", "text": "Further, we define X and Y as the corresponding dependency structures of the dependency relations x and y, respectively." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-50", "text": "We define X\\Y as structure X except Y. For example, in Figure 2 , let x and y denote plmod and lobj dependency relations, then X represents \"\u524d\" and all words under \"\u524d\", Y represents \"\u5927\u4f7f\u9986\" and all words under \"\u5927\u4f7f\u9986\", and X\\Y represents We obtained rules as the following steps:" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-51", "text": "1 Search the Chinese dependency parse trees in the corpus and rank all of the structures matching the two types of rules respectively according to their frequencies." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-52", "text": "Note that while calculating the frequencies of Type-1 structures, we dismissed the structures in which X occurred before Y originally." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-53", "text": "2 Filtration." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-54", "text": "1) Filter out the structures which occurred less than 5,000 times." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-55", "text": "2) Filter out the structures from which it was almost impossible to derive candidate pre-ordering rules because x or y was an \"irrespective\" dependency relation, for example, root, conj, cc and so on." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-56", "text": "3 Investigate the remaining structures." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-57", "text": "For each kind of structure, we selected some of the sample dependency parse trees that contained it, tried to restructure the parse trees according to the matched rule and judged the reordered Chinese phrases." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-58", "text": "If the reordering produced a Chinese phrase that had a closer word order to that of the English one, this structure would be a candidate pre-ordering rule." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-59", "text": "4 Conduct primary experiments which used the same training set and development set as the experiments described in Section 3." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-60", "text": "In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-61", "text": "Figure 4 : An example of rcmod structure with a preposition modifier." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-62", "text": "The phrase translates into \"a press conference held in Kabul\"." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-63", "text": "As a result, we obtained eight pre-ordering rules in total, which can be divided into three dependency relation categories." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-64", "text": "They are: plmod (localizer modifier of a preposition), rcmod (relative clause modifier) and prep (preposition modifer)." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-65", "text": "Each of these categories are discussed in detail below." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-66", "text": "plmod Figure 2 shows an example of a prepositional phrase with a plmod structure, which translates literally into \"in the US embassy front\"." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-67", "text": "In Chinese, the dependent word of a plmod relation (e.g., \"\u524d\" in Figure 2 ) occurs in the last position of the prepositional phrase." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-68", "text": "However, in English, this kind of word (e.g., \"front\" in the caption of Figure 2 ) always occur directly after prepositions, which is to say, in the second position in a prepositional phrase." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-69", "text": "Therefore, we applied a rule plmod : lobj (localizer object) to reposition the dependent word of the plmod relation (e.g., \"\u524d\" in Figure 2) to the position before the lobj structure (e.g., \"\u7f8e\u56fd \u5927\u4f7f\u9986\" in Figure 2 )." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-74", "text": "As shown in the figure, relative clause modifiers in Chinese (e.g., \"\u63a5\u8fd1 \u590f\u9686 \u7684\" in Figure 3 ) occurs before the noun being modified, which is in contrast to English (e.g., \"close to Sharon\" in the caption of Figure 3 ), where they come after." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-75", "text": "Thus, we introduced a series of rules NOUN : rcmod to restructure rcmod structures so that the noun is moved to the head." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-76", "text": "In this example, with the application of an nsubj : rcmod rule, the phrase can be translated into \"a senior official close to Sharon say\", which has a word order very close to English." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-77", "text": "Since a noun can be nsubj, dobj (direct object), pobj (prepositional object) and lobj The comparison of four systems, including the performance (BLEU) on the test set, the total count of each rule set and the number of sentences they were applied to on the training set." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-78", "text": "in Stanford typed dependencies, we created four rules from the NOUN pattern." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-79", "text": "Note that for some preposition modifiers, we needed a rule rcmod : prep to conduct the same work." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-80", "text": "For instance, the Chinese phrase in Figure 4 can be translated into \"hold in Kabul press conference\" with the application of this rule." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-81", "text": "prep Within verb phrases, the positions of prep structures are quite different between Chinese and English." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-82", "text": "Figure 5 shows an example of a verb phrase with a preposition modifier (prep), which literally translates into \"Musharraf at this place tell reporter\"." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-83", "text": "Recognizing that prep structures occur before the verb in Chinese (e.g., \"\u5728 \u6b64 \u5730\" in Figure 5 ) but after the verb in English (usually in the last position of a verb phrase, e.g., \"here\" in the caption of Figure 5 ), we applied a rule prep -dobj to reposition prep structures after their sibling dobj structures." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-84", "text": "In summary, the dependency-based preordering rule set has eight rules: plmod : lobj, plmod : lccomp, nsubj : rcmod, dobj : rcmod, pobj : rcmod, lobj : rcmod, rcmod : prep, and prep -dobj." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-85", "text": "----------------------------------" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-86", "text": "**EXPERIMENTS**" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-87", "text": "We used the MOSES PBSMT system in our experiments." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-88", "text": "The training data, which included those data used in Wang et al. (2007) , contained 1 million pairs of sentences extracted from the Linguistic Data Consortium's parallel news corpora." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-89", "text": "Our development set was the official NIST MT evaluation data from 2002 to 2005, consisting of 4476 Chinese-English sentences pairs." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-90", "text": "Our test set was the NIST 2006 MT evaluation data, consisting of 1664 sentence pairs." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-91", "text": "We employed the Stanford Segmenter 1 to segment all of the data sets." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-92", "text": "For evaluation, we used BLEU scores (Papineni et al., 2002) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-93", "text": "We implemented the constituent-based preordering rule set in Wang et al. (2007) for comparison, which is called WR07 below." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-94", "text": "The Berkeley Parser (Petrov et al., 2006) was employed for parsing the Chinese sentences." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-95", "text": "For training the Berkeley Parser, we used Chinese Treebank (CTB) 7.0." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-96", "text": "We conducted our dependency-based preordering experiments on the Berkeley Parser and the Mate Parser (Bohnet, 2010) , which were shown to be the two best parsers for Stanford typed dependencies (Che et al., 2012) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-97", "text": "First, we converted the constituent parse trees in the results of the Berkeley Parser into dependency parse trees by employing a tool in the Stanford Parser (Klein and Manning, 2003) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-98", "text": "For the Mate Parser, POS tagged inputs are required both in training and in inference." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-99", "text": "Thus, we then extracted the POS information from the results of the Berkeley Parser and used these as the pre-specified POS tags for the Mate Parser." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-100", "text": "Finally, we applied our dependency-based pre-ordering rule set to the dependency parse trees created from the converted Berkeley Parser and the Mate Parser, respectively." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-101", "text": "Table 1 presents a comparison of the system without pre-ordering, the constituent system using WR07 and two dependency systems employing the converted Berkeley Parser and the Mate Parser, respectively." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-102", "text": "It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap- Table 2 : Accuracy of the dependency-based pre-ordering rules on a set of 200 sentences randomly selected from the development set." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-103", "text": "plied to." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-104", "text": "Both of our dependency systems outperformed WR07 slightly but were not significant at p = 0.05." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-105", "text": "However, both of them substantially decreased the total times about 60% (or 1,600,000) for pre-ordering rule applications on the training set, compared with WR07." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-106", "text": "In our opinion, the reason for the great decrease was that the dependency parse trees were more concise than the constituent parse trees in describing sentences and they could also describe the reordering at the sentence level in a finer way." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-107", "text": "In contrast, the constituent parse trees were more redundant and they needed more nodes to conduct long-distance reordering." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-108", "text": "In this case, the affect of the performance of the constituent parsers on pre-ordering is larger than that of the dependency ones so that the constituent parsers are likely to bring about more incorrect pre-orderings." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-109", "text": "Similar to Wang et al. (2007) , we carried out human evaluations to assess the accuracy of our dependency-based pre-ordering rules by employing the system \"OUR DEP 2\" in Table 1 ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-110", "text": "The evaluation set contained 200 sentences randomly selected from the development set." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-111", "text": "Among them, 107 sentences contained at least one rule and the rules were applied 185 times totally." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-112", "text": "Since the accuracy check for dependency parse trees took great deal of time, we did not try to select error free (100% accurately parsed) sentences." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-113", "text": "A bilingual speaker of Chinese and English looked at an original Chinese phrase and the pre-ordered one with their corresponding English phrase and judged whether the pre-ordering obtained a Chinese phrase that had a closer word order to the English one." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-114", "text": "Table 2 shows the accuracies of three categories of our dependency-based pre-ordering rules." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-115", "text": "The overall accuracy of this rule set is 60.0%, which is almost at the same level as the WR07 rule set (62.1%), according to the similar evaluation (200 sentences and one annotator) conducted in Wang et al. (2007) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-116", "text": "Notice that some of the incorrect pre-orderings may be caused by erroneous parsing as also suggested by Wang et al. (2007) ." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-117", "text": "Through human evaluations, we found that 19 out of the total 74 incorrect pre-orderings resulted from errors in parsing." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-118", "text": "Among them, 13 incorrect pre-orderings applied the rules of the rcmod category." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-119", "text": "The analysis suggests that we need to introduce constraints on the rule application of this category in the future." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-120", "text": "----------------------------------" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-121", "text": "**CONCLUSION**" }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-122", "text": "In this paper, we introduced a novel pre-ordering approach based on dependency parsing for a Chinese-English PBSMT system." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-123", "text": "The results showed that our approach achieved a BLEU score gain of 1.61." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-124", "text": "Moreover, our dependency-based pre-ordering rule set substantially decreased the time for applying pre-ordering rules about 60% compared with WR07, on the training set of 1M sentences pairs." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-125", "text": "The overall accuracy of our rule set is 60.0%, which is almost at the same level as the WR07 rule set." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-126", "text": "These results indicated that dependency parsing is more effective for conducting pre-ordering for Chinese-English PBSMT." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-127", "text": "Although our work focused on Chinese, the ideas can also be applied to other languages." }, { "sent_id": "17252628fa9c03c2fe0b44763fc7a2-C001-128", "text": "In the future, we attempt to create more efficient pre-ordering rules by exploiting the rich information in dependency structures." } ], "y": { "@BACK@": { "gold_contexts": [ [ "17252628fa9c03c2fe0b44763fc7a2-C001-14" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-17" ] ], "cite_sentences": [ "17252628fa9c03c2fe0b44763fc7a2-C001-14", "17252628fa9c03c2fe0b44763fc7a2-C001-17" ] }, "@MOT@": { "gold_contexts": [ [ "17252628fa9c03c2fe0b44763fc7a2-C001-16", "17252628fa9c03c2fe0b44763fc7a2-C001-17", "17252628fa9c03c2fe0b44763fc7a2-C001-18" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-26" ] ], "cite_sentences": [ "17252628fa9c03c2fe0b44763fc7a2-C001-17", "17252628fa9c03c2fe0b44763fc7a2-C001-26" ] }, "@DIF@": { "gold_contexts": [ [ "17252628fa9c03c2fe0b44763fc7a2-C001-19", "17252628fa9c03c2fe0b44763fc7a2-C001-20" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-23", "17252628fa9c03c2fe0b44763fc7a2-C001-24", "17252628fa9c03c2fe0b44763fc7a2-C001-25" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-87", "17252628fa9c03c2fe0b44763fc7a2-C001-88", "17252628fa9c03c2fe0b44763fc7a2-C001-89" ] ], "cite_sentences": [ "17252628fa9c03c2fe0b44763fc7a2-C001-20", "17252628fa9c03c2fe0b44763fc7a2-C001-23", "17252628fa9c03c2fe0b44763fc7a2-C001-88" ] }, "@SIM@": { "gold_contexts": [ [ "17252628fa9c03c2fe0b44763fc7a2-C001-23" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-87", "17252628fa9c03c2fe0b44763fc7a2-C001-88" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-109" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-115" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-116" ] ], "cite_sentences": [ "17252628fa9c03c2fe0b44763fc7a2-C001-23", "17252628fa9c03c2fe0b44763fc7a2-C001-88", "17252628fa9c03c2fe0b44763fc7a2-C001-109", "17252628fa9c03c2fe0b44763fc7a2-C001-115", "17252628fa9c03c2fe0b44763fc7a2-C001-116" ] }, "@USE@": { "gold_contexts": [ [ "17252628fa9c03c2fe0b44763fc7a2-C001-93" ], [ "17252628fa9c03c2fe0b44763fc7a2-C001-109" ] ], "cite_sentences": [ "17252628fa9c03c2fe0b44763fc7a2-C001-93", "17252628fa9c03c2fe0b44763fc7a2-C001-109" ] } } }, "ABC_ab8c43bf5a37c436d166960af459a8_14": { "x": [ { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-79", "text": "Figure 2: The training data used for multi-task learning models." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-80", "text": "The bi-directional formality transfer data and the bilingual data (e.g. FR-EN) of equivalent size are always concatenated." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-2", "text": "Generating natural language requires conveying content in an appropriate style." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-3", "text": "We explore two related tasks on generating text of varying formality: monolingual formality transfer and formality-sensitive machine translation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-4", "text": "We propose to solve these tasks jointly using multi-task learning, and show that our models achieve state-of-the-art performance for formality transfer and are able to perform formality-sensitive translation without being explicitly trained on styleannotated translation examples." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-5", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-7", "text": "Generating language in the appropriate style is a requirement for applications that generate natural language, as the style of a text conveys important information beyond its literal meaning (Hovy, 1987) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-8", "text": "Heylighen and Dewaele (1999) and Biber (2014) have argued that the formal-informal dimension is a core dimension of stylistic variation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-9", "text": "In this work, we focus on the problem of generating text for a desired formality level." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-10", "text": "It has been recently studied in two distinct settings: (1) Rao and Tetreault (2018) addressed the task of Formality Transfer (FT) where given an informal sentence in English, systems are asked to output a formal equivalent, or vice-versa; (2) introduced the task of FormalitySensitive Machine Translation (FSMT), where given a sentence in French and a desired formality level (approximating the intended audience of the translation), systems are asked to produce an English translation of the desired formality level." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-11", "text": "While FT and FSMT can both be framed as Machine Translation (MT), appropriate training examples are much harder to obtain than for traditional machine translation tasks." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-12", "text": "FT requires sentence pairs that express the same meaning in two different styles, which rarely occur naturally and are therefore only available in small quantities." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-13", "text": "FSMT can draw from existing parallel corpora in diverse styles, but would ideally require not only sentence pairs, but e.g., sentence triplets that contain a French input, its formal English translation, and its informal English translation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-14", "text": "We hypothesize that FT and FSMT can benefit from being addressed jointly, by sharing information from two distinct types of supervision: sentence pairs in the same language that capture style difference, and translation pairs drawn from corpora of various styles." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-15", "text": "Inspired by the benefits of multi-task learning (Caruana, 1997) for natural language processing tasks in general (Collobert and Weston, 2008; Liu et al., 2015; Luong et al., 2016) , and for multilingual MT in particular (Johnson et al., 2017) , we introduce a model based on Neural Machine Translation (NMT) that jointly learns to perform both monolingual FT and bilingual FSMT." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-16", "text": "As can be seen in Figure 1 , given an English sentence and a tag (formal or informal), our model paraphrases the input sentence into the desired formality." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-17", "text": "The same model can also take in a French sentence, and produce a formal or an informal English translation as desired." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-81", "text": "tags to corresponding source French sentences (i.e. the bottom two rows of data in Figure 2a ) and train an NMT model for our FSMT task." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-82", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-83", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-84", "text": "We propose a multi-task learning model to jointly perform FT and FSMT using a many-to-one (i.e. multi-language to English) sequence to sequence model (Luong et al., 2016) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-85", "text": "Following Johnson et al. (2017) , we implement this approach using shared encoders and decoders." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-18", "text": "Designing this model requires addressing several questions: Can we build a single model that performs formality transfer in both directions? How to best combine monolingual examples of formality transfer and bilingual examples of translation? What kind of bilingual examples are most useful for the joint task?" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-19", "text": "Can our joint model learn to perform FSMT without being explicitly trained on style-annotated translation examples?" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-20", "text": "We explore these questions by conducting an empirical study on English FT and Figure 1 : System overview: our multi-task learning model can perform both bi-directional English formality transfer and translate French to English with desired formality." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-21", "text": "It is trained jointly on monolingual formality transfer data and bilingual translation data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-22", "text": "French-English FSMT, using both automatic and human evaluation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-23", "text": "Our results show the benefits of the multi-task learning approach, improving the state-of-the-art on the FT task, and yielding competitive performance on FSMT without style-annotated translation examples." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-24", "text": "Along the way, we also improve over prior results on FT using a single NMT model that can transfer between styles in both directions." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-25", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-26", "text": "**BACKGROUND**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-27", "text": "Style Transfer can naturally be framed as a sequence to sequence translation problem given sentence pairs that are paraphrases in two distinct styles." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-28", "text": "These parallel style corpora are constructed by creatively collecting existing texts of varying styles, and are therefore rare and much smaller than machine translation parallel corpora." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-29", "text": "For instance, Xu et al. (2012) scrape modern translations of Shakespeare's plays and use a phrase-based MT (PBMT) system to paraphrase Shakespearean English into/from modern English." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-30", "text": "Jhamtani et al. (2017) improve performance on this dataset using neural translation model with pointers to enable copy actions." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-31", "text": "The availability of parallel standard and simple Wikipedia (and sometimes additional human rewrites) makes text simplification a popular style transfer task, typically addressed using machine translation models ranging from syntax-based MT (Zhu et al., 2010; Xu et al., 2016) , phrase-based MT (Coster and Kauchak, 2011; Wubben et al., 2012) to neural MT (Wang et al., 2016) trained via reinforcement learning (Zhang and Lapata, 2017) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-32", "text": "Naturally occurring examples of parallel formal-informal sentences are harder to find." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-33", "text": "Prior work relied on synthetic examples generated based on lists of words of known formality (Sheikha and Inkpen, 2011) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-34", "text": "This state of affairs recently changed, with the introduction of the first large scale parallel corpus for formality transfer, GYAFC (Grammarly's Yahoo Answers Formality Corpus)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-35", "text": "110K informal sentences were collected from Yahoo Answers and they were rewritten in a formal style via crowd-sourcing, which made it possible to benchmark style transfer systems based on both PBMT and NMT models (Rao and Tetreault, 2018) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-36", "text": "In this work, we leverage this corpus to enable multi-task FT and FSMT." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-37", "text": "Recent work also explores how to perform style transfer without parallel data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-38", "text": "However, this line of work considers transformations that alter the original meaning (e.g., changes in sentiment or topic), while we view style transfer as meaning-preserving." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-39", "text": "An auto-encoder is used to encode a sequence to a latent representation which is then decoded to get the style transferred output sequence (Mueller et al., 2017; Hu et al., 2017; Shen et al., 2017; Fu et al., 2018; Prabhumoye et al., 2018) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-40", "text": "Style in Machine Translation has received little attention in recent MT architectures." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-41", "text": "Mima et al. (1997) improve rule-based MT by using extra-linguistic information such as speaker's role and gender." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-42", "text": "Lewis et al. (2015) and Niu and Carpuat (2016) equate style with domain, and train conversational MT systems by selecting in-domain (i.e. conversation-like) training data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-43", "text": "Similarly, Wintner et al. (2017) and Michel and Neubig (2018) take an adaptation approach to personalize MT with gender-specific or speaker-specific data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-44", "text": "Other work has focused on specific realizations of stylistic variations, such as T-V pronoun selection for translation into German (Sennrich et al., 2016a) or controlling voice (Yamagishi et al., 2016) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-45", "text": "In contrast, we adopt the broader range of style variations considered in our prior work, which introduced the FSMT task : in FSMT, the MT system takes a desired formality level as an additional input, to represent the target audience of a translation, which human translators implicitly take into account." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-46", "text": "This task was addressed via n-best re-ranking in phrase-based MT -translation hypotheses whose formality are closer to desired formality are promoted." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-47", "text": "By contrast, in this work we use neural MT which is based on the Attentional Recurrent EncoderDecoder model (Bahdanau et al., 2015; Luong et al., 2016) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-48", "text": "The input is encoded into a sequence of vector representations while the decoder adaptively computes a weighted sum of these vectors as the context vector for each decoding step." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-49", "text": "In the joint model, we employ Side Constraints as the formality input to restrict the generation of the output sentence (Figure 1) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-50", "text": "Prior work has successfully implemented side constraints as a special token added to each source sentence." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-51", "text": "These tokens are embedded into the source sentence representation and control target sequence generation via the attention mechanism." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-52", "text": "Sennrich et al. (2016a) append or (i.e. T-V pronoun distinction) to the source text to indicate which pronoun is preferred in the German output." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-53", "text": "Johnson et al. (2017) and concatenate parallel data of various language directions and mark the source with the desired output language to perform multilingual or bi-directional NMT." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-54", "text": "Kobus et al. (2017) and Chu et al. (2017) add domain tags for domain adaptation in NMT." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-55", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-56", "text": "**APPROACH**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-57", "text": "We describe our unified model for performing FT in both directions (Section 3.1), our FSMT model with side constraints (Section 3.2) and finally our multi-task learning model that jointly learns to perform FT and FSMT (Section 3.3)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-86", "text": "This approach can use existing NMT architectures without modifications." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-58", "text": "All models rely on the same NMT architecture: attentional recurrent sequenceto-sequence models." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-59", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-60", "text": "**BI-DIRECTIONAL FORMALITY TRANSFER**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-61", "text": "Rao and Tetreault (2018) used independent neural machine translation models for each formality transfer direction (informal\u2192formal and formal\u2192informal)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-62", "text": "Inspired by the bi-directional NMT for low-resource languages , we propose a unified model that can handle either direction -we concatenate the parallel data from the two directions of formality transfer and attach a tag to the beginning of each source sentence denoting the desired target formality level i.e. for transferring to formal and for transferring to informal." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-63", "text": "This enables our FT model to learn to transfer to the correct style via attending to the tag in the source embedding." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-64", "text": "We train an NMT model on this combined dataset." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-65", "text": "Since both the source and target sentences come from the same language, we encourage their representations to lie in the same distributional vector space by (1) building a shared Byte-Pair Encoding (BPE) model on source and target data (Sennrich et al., 2016b) and (2) tying source and target word embeddings (Press and Wolf, 2017) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-66", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-67", "text": "**FORMALITY-SENSITIVE MACHINE TRANSLATION WITH SIDE CONSTRAINTS**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-68", "text": "Inspired by Sennrich et al. (2016a) , we use side constraints on parallel translation examples to control output formality." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-69", "text": "At training time, this requires a tag that captures the formality of the target sentence for every sentence pair." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-70", "text": "Given the vast range of text variations that influence style, we cannot obtain tags using rules as for T-V pronoun distinctions (Sennrich et al., 2016a) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-71", "text": "Instead, we categorize FrenchEnglish parallel data into formal vs. informal categories by comparing them to the informal and formal English from the GYAFC corpus." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-72", "text": "We adopt a data selection technique, Cross-Entropy Difference (CED) (Moore and Lewis, 2010) , to rank English sentences in the bilingual corpus by their relative distance to each style." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-73", "text": "First, we consider formal English as the target style and define CED(s) = H f ormal (s) \u2212 H inf ormal (s), where H f ormal (s) is the cross-entropy between a sentence s and the formal language model." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-74", "text": "Smaller CED indicates an English sentence that is more similar to the formal English corpus and less similar to the informal English corpus." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-75", "text": "We rank English sentences by their CED scores and select the top N sentences (choice of N discussed in Section 6)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-76", "text": "Pairing these N English sentences with their parallel French source, we get the formal sample of our bilingual data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-77", "text": "Similarly, we construct the informal sample using informal English as the target style." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-78", "text": "Finally, we combine the formal and the informal samples, attach the and " }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-153", "text": "NMT Baseline uses OpenNMT-py (Klein et al., 2017) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-87", "text": "We design three models to investigate how to best incorporate side constraints at training time, and the benefits of sharing representations for style and language." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-88", "text": "MultiTask-tag-style is a straightforward combination of the transfer and translation models above." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-89", "text": "We hypothesize that using the bilingual parallel data where English is the target could enhance English FT in terms of target language modeling, especially when the bilingual data has similar topics and styles." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-90", "text": "We therefore combine equal sizes of formality tagged training data (selected as described in Section 3.2) from our FT and FSMT tasks in this configuration (Figure 2a )." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-91", "text": "MultiTask-style is designed to test whether formality tags for bilingual examples are necessary." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-92", "text": "We hypothesize that the knowledge of controlling target formality for the FSMT task can be learned from the FT data since the source embeddings of formality tags are shared between the FT and the FSMT tasks." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-93", "text": "We therefore combine the formality tagged FT data with the MT data without their tags (Figure 2b )." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-94", "text": "MultiTask-random investigates the impact of the similarity between formality transfer and bilingual examples." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-95", "text": "Selecting bilingual data which is similar to the GYAFC corpus is not necessarily beneficial for the FSMT task especially when French-English bilingual examples are drawn from a domain distant from the GYAFC corpus." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-96", "text": "In this configuration, we test how well our model performs FSMT if bilingual examples are randomly selected instead (Figure 2c )." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-97", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-98", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-99", "text": "FT data: We use the GYAFC corpus introduced by Rao and Tetreault (2018) as our FT data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-100", "text": "This corpus consists of 110K informal sentences from two domains of Yahoo Answers (Entertainment and Music (E&M) and Family and Relationships (F&R)) paired with their formal rewrites by humans." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-101", "text": "The train split consists of 100K informal-formal sentence pairs whereas the dev/test sets consist of roughly 5K source-style sentences paired with four reference target-style human rewrites for both transfer directions." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-102", "text": "FSMT data: We evaluate the FSMT models on a large-scale French to English (FR-EN) translation task." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-103", "text": "Examples are drawn from OpenSubtitles2016 (Lison and Tiedemann, 2016) which consists of movie and television subtitles and is thus more similar to the GYAFC corpus compared to news or parliament proceedings." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-104", "text": "This is a noisy dataset where aligned French and English sentences often do not have the same meaning, so we use a bilingual semantic similarity detector to select 20,005,000 least divergent examples from \u223c27.5M deduplicated sentence pairs in the original set (Vyas et al., 2018) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-105", "text": "Selected examples are then randomly split into a 20M training pool, a 2.5K dev set and a 2.5K test set." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-106", "text": "Preprocessing: We apply four pre-processing steps to both FT and MT data: normalization, tokenization, true-casing, and joint source-target BPE with 32,000 operations for NMT (Sennrich et al., 2016b) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-107", "text": "NMT Configuration: We use the standard attentional encoder-decoder architecture implemented in the Sockeye toolkit (Hieber et al., 2017) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-108", "text": "Our translation model uses a bi-directional encoder with a single LSTM layer (Bahdanau et al., 2015) of size 512, multilayer perceptron attention with a layer size of 512, and word representations of size 512." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-109", "text": "We apply layer normalization and tie the source and target embeddings as well as the output layer's weight matrix." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-110", "text": "We add dropout to embeddings and RNNs of the encoder and decoder with probability 0.2." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-111", "text": "We train using the Adam optimizer with a batch size of 64 sentences and checkpoint the model every 1000 updates (Kingma and Ba, 2015) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-112", "text": "Training stops after 8 checkpoints without improvement of validation perplexity." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-113", "text": "We decode with a beam size of 5." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-114", "text": "We train four randomly seeded models for each experiment and combine them in a linear ensemble for decoding." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-115", "text": "1" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-116", "text": "5 Evaluation Protocol" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-117", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-118", "text": "**AUTOMATIC EVALUATION**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-119", "text": "We evaluate both FT and FSMT tasks using BLEU (Papineni et al., 2002) , which compares the model output with four reference target-style rewrites for FT and a single reference translation for FSMT." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-120", "text": "We report case-sensitive BLEU with standard WMT tokenization." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-121", "text": "2 For FT, Rao and Tetreault (2018) show that BLEU correlates well with the overall system ranking assigned by humans." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-122", "text": "For FSMT, BLEU is an imperfect metric as it conflates mismatches due to translation errors and due to correct style variations." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-123", "text": "We therefore turn to human evaluation to isolate formality differences from translation quality." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-124", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-125", "text": "**HUMAN EVALUATION**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-126", "text": "Following Rao and Tetreault (2018), we assess model outputs on three criteria: formality, fluency and meaning preservation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-127", "text": "Since the goal of our evaluation is to compare models, our evaluation scheme asks workers to compare sentence pairs on these three criteria instead of rating each sentence in isolation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-128", "text": "We collect human judgments using CrowdFlower on 300 samples of each model outputs." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-129", "text": "For FT, we compare the top performing NMT benchmark model in Rao and Tetreault (2018) with our best FT model." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-130", "text": "For FSMT, we compare outputs from three representative models: NMT-constraint, MultiTask-random and PBMT-random." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-131", "text": "3 Formality." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-132", "text": "For FT, we want to measure the amount of style variation introduced by a model." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-133", "text": "Hence, we ask workers to compare the source-style sentence with its target-style model output." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-134", "text": "For FSMT, we want to measure the amount of style variation between two different translations by the same model." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-135", "text": "Hence, we ask workers to compare the \"informal\" English translation and the \"formal\" English translation of the same source sentence in French." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-136", "text": "4 We design a five point scale for comparing the formality of two sentences ranging from one being much more formal than the other to the other being much more formal than the first, giving us a value between 0 and 2 for each sentence pair." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-137", "text": "5 Fluency." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-138", "text": "For both FT and FSMT tasks, we want to understand how fluent are the different model outputs." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-139", "text": "Hence, we ask workers to compare the fluency of two model outputs of the same target style." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-140", "text": "Similar to formality evaluation, we design a five point scale for comparing the fluency of two sentences, giving us a value between 0 and 2 for each sentence pair." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-141", "text": "Meaning Preservation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-142", "text": "For FT, we want to measure the amount of meaning preserved during formality transfer." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-143", "text": "Hence, we ask workers to compare the source-style sentence and the target-style model output." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-144", "text": "For FSMT, we want to measure the amount of meaning preserved between two different translations by the same model." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-145", "text": "Hence, we ask workers to compare the \"informal\" English translation and the \"formal\" English translation of the same source sentence in French." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-146", "text": "We design a four point scale to compare the meaning of two sentences ranging from the two being completely equivalent to the two being not equivalent, giving us a value between 0 and 3 for each sentence pair." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-147", "text": "Table 1 : Automatic evaluation of Formality Transfer with BLEU scores." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-148", "text": "The bi-directional model with three stacked improvements achieves the best overall performance." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-149", "text": "The improvement over the second best system is statistically significant at p < 0.05 using bootstrap resampling (Koehn, 2004) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-150", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-151", "text": "**INFORMAL\u2192FORMAL FORMAL\u2192INFORMAL**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-152", "text": "6 Formality Transfer Experiments 6.1 Baseline Models from Rao and Tetreault (2018) PBMT is a phrase-based machine translation model trained on the GYAFC corpus using a training regime consisting of self-training, data sub-selection and a large language model." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-154", "text": "Rao and Tetreault (2018) use a pre-processing step to make source informal sentences more formal and source formal sentences more informal by rules such as re-casing." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-155", "text": "Word embeddings pre-trained on Yahoo Answers are also used." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-156", "text": "NMT Combined is Rao and Tetreault's best performing NMT model trained on the rule-processed GYAFC corpus, with additional forward and backward translations produced by the PBMT model." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-157", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-158", "text": "**OUR MODELS**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-159", "text": "NMT Baseline: Our NMT baseline uses Sockeye instead of OpenNMT-py and is trained on raw datasets of two domains and two transfer directions." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-160", "text": "Bi-directional FT: Our initial bi-directional model is trained on bi-directional data from both domains with formality tags." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-161", "text": "It is incrementally augmented with three modifications to get the final multi-task model (i.e. MultiTask-tag-style as described in Section 3.3): (1) We combine training sets of two domains (E&M+F&R) together and train a single model on it." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-162", "text": "(2) We use ensemble decoding by training four randomly seeded models on the combined data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-163", "text": "(3) We add formality-tagged bilingual data and train the model using multi-task learning to jointly learn FT and FSMT." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-164", "text": "Suppose the amount of original bidirectional FT data is n, we always select kn bilingual data where k is an integer." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-165", "text": "We also duplicate FT data to make it match the size of selected bilingual data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-166", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-167", "text": "**RESULTS**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-168", "text": "Automatic Evaluation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-169", "text": "As shown in Table 1 , our NMT baselines yield surprisingly better BLEU scores than those of Rao and Tetreault (2018) , even without using rule-processed source training data and pretrained word embeddings." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-170", "text": "We attribute the difference to the more optimized NMT toolkit we use." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-171", "text": "Initial bi-directional models outperforms uni-directional models." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-172", "text": "This matches the behavior of bidirectional NMT in low-resource settings studied by -we work with a relatively small amount of training data (\u223c50K), and FT models benefit from doubling the size of training data without being confused by mixing two transfer directions." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-173", "text": "For the same reason, increasing the training data by combining two domains together improves performance further." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-174", "text": "Ensemble decoding is a consistently effective technique used by NMT and it enhances our NMT-based FT models as expected." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-175", "text": "Incorporating the bilingual parallel data by multi-task learning yields further improvement." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-243", "text": "Table 2 shows that neural models control formality significantly better than PBMTrandom (0.35/0.32 vs. 0.05)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-176", "text": "The target side of bilingual data is selected based on the closeness to the GYAFC corpus, so we hypothesize that the higher quality comes from better target language modeling by training on more English text." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-177", "text": "Table 2 : Human evaluation of formality difference and meaning preservation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-178", "text": "MultiTask-tag-style generates significantly more informal (F\u2192I) English than NMT Combined (p<0.05 using the t-test, see Section 6.3)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-179", "text": "PBMT-random does not control formality effectively when comparing its informal (I) and formal (F) output (Section 7.2)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-180", "text": "Formality scores are relatively low because workers rarely choose \"much more (in)formal\"." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-181", "text": "All models preserve meaning equally well." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-182", "text": "Human Evaluation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-183", "text": "The superior performance of the best FT model (i.e. MultiTask-tag-style) is also reflected in our human evaluation (see Table 2 )." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-184", "text": "It generates slightly more formal English (0.59 vs 0.54) and significantly more informal English (0.64 vs 0.45) than NMT Combined." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-185", "text": "This is consistent with BLEU differences in Table 1 which show that MultiTask-tag-style yields bigger improvements when transferring formal language to informal." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-186", "text": "Both models have good quality with respect to meaning preservation (2.94 vs 2.92) and workers can hardly find any fluency difference between outputs of these two models by assigning 0.03 in average in the fluency test (0 means no difference)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-187", "text": "Impact of Bilingual Data Size." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-188", "text": "We evaluate the impact of selected bilingual data size on the combination of development sets from two domains in GYAFC and show the results in Figure 3 ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-189", "text": "The quality of formality transfer improves instantly when using bilingual data and it soon converges when more data is used." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-190", "text": "Meanwhile, the translation quality increases monotonously with the size of training data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-191", "text": "The optimal point is a hyper-parameter that can be determined on the development set." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-192", "text": "We empirically choose n = 12 since it works best for formality transfer and yields reasonable translation quality." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-193", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-194", "text": "**QUALITATIVE ANALYSIS**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-195", "text": "We manually inspect 100 randomly selected samples from our evaluation set and compare the target-style output of our best model (MultiTask-tag-style) with that of the best baseline model (NMT-Combined) from Rao and Tetreault (2018) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-196", "text": "Table 3 shows some samples representative of the trends we find for informal\u2192formal (3a) and formal\u2192informal (3b) tasks." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-197", "text": "In majority of the cases, the two models produce similar outputs as can be expected since they use similar NMT architectures." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-198", "text": "In cases where the two outputs differ, in the I\u2192F task, we find that our model produces a more formal output by introducing phrasal level changes (first sample in 3a) or by moving 3a: informal\u2192formal Original I chill out sweetie everything will be just fine eventually 1 NMT-Combined F Can you chill out sweetie everything will be just fine eventually." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-199", "text": "MultiTask-tag-style F Calm down, sweetie, everything will be fine eventually." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-200", "text": "Original I Dakota Fanning.....I know that she is only 12 but she is really famous." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-201", "text": "2 NMT-Combined F Dakota Fanning.i know that she is only twelve, but she is famous." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-202", "text": "MultiTask-tag-style F I know that Dakota Fanning is only twelve, but she is really famous." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-203", "text": "Original I depends....usully they are about ur personailty but not wat ur gonna do iwith ur life." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-204", "text": "3 NMT-Combined F Depends.usully they are about your personailty, but not what your going to do iwith your life." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-205", "text": "MultiTask-tag-style F It depends." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-206", "text": "They are about your personality, but not what you are going to do with your life." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-207", "text": "Original I THAT DEPENDS...ARE YOU A HOTTIE W/A BODY? 4 NMT-Combined F That depends, are you a hottie with a body?" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-208", "text": "MultiTask-tag-style F That depends." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-209", "text": "Are you a HOTTIE W / A BODY? 3b: formal\u2192informal Original F Therefore I would say that they do succeed but not frequently." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-210", "text": "I hope this is helpful." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-211", "text": "1 NMT-Combined I So I would say that they do failing but not frequently, I hope this is helps." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-212", "text": "MultiTask-tag-style I so i would say they do it but not all the time, hope this helps." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-213", "text": "Original F I am simply inquiring because people behave as though they are no longer interested in them." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-214", "text": "2 NMT-Combined I I am just asking because people act as though they are no longer interested in them." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-215", "text": "MultiTask-tag-style I I'm just asking because people act like they don't like them anymore." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-216", "text": "Original F Hello, I am interested in visiting your country." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-217", "text": "3 NMT-Combined I Hi, I'm interested in visiting your country." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-218", "text": "MultiTask-tag-style I hi, I'm going to go to your country." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-219", "text": "phrases around (second sample in 3a), both of which happens frequently during machine translation thus showcasing the benefit of our multi-task approach." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-220", "text": "Our model very often makes the output sentence more complete (and thereby more formal) by inserting pronouns like 'it', 'they' at the start of the sentence or by removing conjunctions like 'usually', 'and', 'but', 'however' from the beginning of a sentence (sample three in 3a)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-221", "text": "Likewise, in the F\u2192I task, our model produces more informal sentences compared to the baseline by introducing more phrasal level changes (first and second sample in 3b)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-222", "text": "Error analysis: In the I\u2192F task, our model performs worse than the baseline when the original informal sentence consists of all uppercased words (fourth sample in 3a)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-223", "text": "This is primarily because the baseline model pre-lowercases them using rules." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-224", "text": "Whereas, we rely on the model to learn this transformation and so it fails to do so for less frequent words." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-225", "text": "In the F\u2192I task, in trying to produce more informal outputs, our model sometimes fails to preserve the original meaning of the sentence (third sample in 3b)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-226", "text": "In both tasks, very often our model fails to make transformations for some pairs like ('girls','women') , which the baseline model is very good at." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-227", "text": "We hypothesize that this could be because for these pairs, human rewriters do not always agree on one of the words in the pair being more informal/formal." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-228", "text": "This makes our model more conservative in making changes because our bi-directional model combines FT data from both directions and when the original data contains instances where these words are not changed, we double that and learn to copy the word more often than change it." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-229", "text": "(Niu et al.) 29.12 29.02 Table 4 : BLEU scores of various FSMT models." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-230", "text": "\"+Tag\" indicates using formality tags for bilingual data while \"Random\" indicates using randomly selected bilingual data." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-231", "text": "translation hypotheses by their closeness to the desired formality level." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-232", "text": "We adapt this system to our evaluation scenario -we calculate median scores for informal and formal data (i.e. \u22120.41 and \u22120.27 respectively) in GYAFC respectively by a PCA-LSA-based formality model and use them as desired formality levels." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-233", "text": "6 The bilingual training data is randomly selected." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-234", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-235", "text": "**RESULTS**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-236", "text": "Automatic Evaluation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-237", "text": "We compute BLEU scores on the held out test set for all models as a sanity check on translation quality." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-238", "text": "Because there is only one reference translation of unknown style for each input sentence, these BLEU scores conflate translation errors and stylistic mismatch, and are therefore not sufficient to evaluate FSMT performance." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-239", "text": "We include them for completeness here, as indicators of general translation quality, and will rely on human evaluation as primary evaluation method." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-240", "text": "As can be seen in Table 4 , changing the formality level for a given system yields only small differences in BLEU." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-241", "text": "Based on BLEU scores, we select MultiTask-random as the representative of multi-task FSMT and compare it with NMT-constraint and PBMT-random during our human evaluation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-242", "text": "Human Evaluation." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-244", "text": "They also introduce more changes in translation: with NMT models, \u223c80% of outputs change when only the input formality changes, while that is only the case for \u223c30% of outputs with PBMT-random." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-245", "text": "Among neural models, MultiTask-random and NMT-constraint have similar quality in controlling output formality (0.32 vs. 0.35) and preserving meaning (2.90 vs. 2.95)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-246", "text": "They are also equally fluent as judged by humans." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-247", "text": "Interestingly, multi-task learning helps MultiTask-random perform as well as NMT-constraint with simpler examples that do not require the additional step of data selection to generate formality tags." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-248", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-249", "text": "**QUALITATIVE ANALYSIS**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-250", "text": "We randomly sample 100 examples from our test set and manually compare the formal and the informal translations of the French source by MultiTask-random, NMT-constraint and PBMT-random." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-251", "text": "Table 5 shows representative examples of the observed trends." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-252", "text": "We find that in most cases, the difference between the formal and informal style translations is very minor in PBMT-random model, better in NMT-constraint model and the best in our MultiTask-random model (first sample in the table)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-253", "text": "In general, our MultiTask-random model does a good job of making very large changes while transferring the style, especially into informal (second sample in the table)." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-254", "text": "We hypothesize that this is because our joint model is trained on the GYAFC corpus which consists of parallel sentences that differ heavily in style." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-255", "text": "Error analysis: All FSMT models perform well in terms of meaning preservation, yet the human scores are not perfect (Table 2) ." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-256", "text": "They occasionally change not only the style but also the meaning of the input (e.g. the third sample of MultiTask-random in Table 5 )." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-257", "text": "This motivates future work that penalizes meaning changes more explicitly during training." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-258", "text": "In general, none of the models do a good job of changing the style when the source sentence is not skewed in one style." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-259", "text": "For example, consider the French sentence \"Combien de fois vous l'ai-je dit?\" and its English reference translation \"How many times have I told you, right?\"." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-260", "text": "All models produce the same translation \"How many times did I tell you?\"." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-261", "text": "In such cases, changing style requires heavier editing or paraphrasing of the source sentence that our current models are unable to produce." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-262", "text": "----------------------------------" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-263", "text": "**CONCLUSION**" }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-264", "text": "We explored the use of multi-task learning to jointly perform monolingual FT and bilingual FSMT." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-265", "text": "Using French-English translation and English style transfer data, we showed that the joint model is able to learn from both style transfer parallel examples and translation parallel examples." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-266", "text": "On the FT task, the joint model significantly improves the quality of transfer between formal and informal styles in both directions, compared to prior work (Rao and Tetreault, 2018 )." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-267", "text": "The joint model interestingly also learns to perform FSMT without being explicitly trained on style-annotated translation examples." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-268", "text": "On the FSMT task, our model outperforms previously proposed phrase-based MT model , and performs on par with a neural model with side-constraints which requires more involved data selection." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-269", "text": "These results show the promise of multi-task learning for controlling style in language generation applications." }, { "sent_id": "ab8c43bf5a37c436d166960af459a8-C001-270", "text": "In future work, we plan to investigate other multi-task architectures and objective functions that better capture the desired output properties, in order to help address current weaknesses such as meaning errors revealed by manual analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ab8c43bf5a37c436d166960af459a8-C001-10" ], [ "ab8c43bf5a37c436d166960af459a8-C001-35" ], [ "ab8c43bf5a37c436d166960af459a8-C001-61" ], [ "ab8c43bf5a37c436d166960af459a8-C001-121" ], [ "ab8c43bf5a37c436d166960af459a8-C001-152" ], [ "ab8c43bf5a37c436d166960af459a8-C001-154" ] ], "cite_sentences": [ "ab8c43bf5a37c436d166960af459a8-C001-10", "ab8c43bf5a37c436d166960af459a8-C001-35", "ab8c43bf5a37c436d166960af459a8-C001-61", "ab8c43bf5a37c436d166960af459a8-C001-121", "ab8c43bf5a37c436d166960af459a8-C001-152", "ab8c43bf5a37c436d166960af459a8-C001-154" ] }, "@USE@": { "gold_contexts": [ [ "ab8c43bf5a37c436d166960af459a8-C001-35", "ab8c43bf5a37c436d166960af459a8-C001-36" ], [ "ab8c43bf5a37c436d166960af459a8-C001-61", "ab8c43bf5a37c436d166960af459a8-C001-62" ], [ "ab8c43bf5a37c436d166960af459a8-C001-99" ], [ "ab8c43bf5a37c436d166960af459a8-C001-120", "ab8c43bf5a37c436d166960af459a8-C001-121" ], [ "ab8c43bf5a37c436d166960af459a8-C001-126" ], [ "ab8c43bf5a37c436d166960af459a8-C001-129" ], [ "ab8c43bf5a37c436d166960af459a8-C001-152" ], [ "ab8c43bf5a37c436d166960af459a8-C001-195" ] ], "cite_sentences": [ "ab8c43bf5a37c436d166960af459a8-C001-35", "ab8c43bf5a37c436d166960af459a8-C001-61", "ab8c43bf5a37c436d166960af459a8-C001-99", "ab8c43bf5a37c436d166960af459a8-C001-121", "ab8c43bf5a37c436d166960af459a8-C001-126", "ab8c43bf5a37c436d166960af459a8-C001-129", "ab8c43bf5a37c436d166960af459a8-C001-152", "ab8c43bf5a37c436d166960af459a8-C001-195" ] }, "@DIF@": { "gold_contexts": [ [ "ab8c43bf5a37c436d166960af459a8-C001-169" ], [ "ab8c43bf5a37c436d166960af459a8-C001-266" ] ], "cite_sentences": [ "ab8c43bf5a37c436d166960af459a8-C001-169", "ab8c43bf5a37c436d166960af459a8-C001-266" ] } } }, "ABC_4f111ff06afd5523d65fc1d1a9ff83_14": { "x": [ { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-57", "text": "**EXACT DECODING VIA ILP**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-81", "text": "**RUNTIME EXPERIMENTS**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-2", "text": "The task of aligning corresponding phrases across two related sentences is an important component of approaches for natural language problems such as textual inference, paraphrase detection and text-to-text generation." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-3", "text": "In this work, we examine a state-of-the-art structured prediction model for the alignment task which uses a phrase-based representation and is forced to decode alignments using an approximate search approach." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-4", "text": "We propose instead a straightforward exact decoding technique based on integer linear programming that yields order-of-magnitude improvements in decoding speed." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-5", "text": "This ILP-based decoding strategy permits us to consider syntacticallyinformed constraints on alignments which significantly increase the precision of the model." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-6", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-8", "text": "Natural language processing problems frequently involve scenarios in which a pair or group of related sentences need to be aligned to each other, establishing links between their common words or phrases." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-9", "text": "For instance, most approaches for natural language inference (NLI) rely on alignment techniques to establish the overlap between the given premise and a hypothesis before determining if the former entails the latter." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-10", "text": "Such monolingual alignment techniques are also frequently employed in systems for paraphrase generation, multi-document summarization, sentence fusion and question answering." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-11", "text": "Previous work (MacCartney et al., 2008) has presented a phrase-based monolingual aligner for NLI (MANLI) that has been shown to significantly outperform a token-based NLI aligner (Chambers et al., 2007) as well as popular alignment techniques borrowed from machine translation (Och and Ney, 2003; Liang et al., 2006) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-12", "text": "However, MANLI's use of a phrase-based alignment representation appears to pose a challenge to the decoding task, i.e. the task of recovering the highest-scoring alignment under some parameters." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-13", "text": "Consequently, MacCartney et al. (2008) employ a stochastic search algorithm to decode alignments approximately while remaining consistent with regard to phrase segmentation." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-79", "text": "It is important to note that the specific implementations being compared 2 may be responsible for the relative speed of decoding." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-14", "text": "In this paper, we propose an exact decoding technique for MANLI that retrieves the globally optimal alignment for a sentence pair given some parameters." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-15", "text": "Our approach is based on integer linear programming (ILP) and can leverage optimized general-purpose LP solvers to recover exact solutions." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-16", "text": "This strategy boosts decoding speed by an order of magnitude over stochastic search in our experiments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-17", "text": "Additionally, we introduce hard syntactic constraints on alignments produced by the model, yielding better precision and a large increase in the number of perfect alignments produced over our evaluation corpus." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-18", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-19", "text": "**RELATED WORK**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-20", "text": "Alignment is an integral part of statistical MT (Vogel et al., 1996; Och and Ney, 2003; Liang et al., 2006) but the task is often substantively different from monolingual alignment, which poses unique challenges depending on the application (MacCartney et al., 2008) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-21", "text": "Outside of NLI, prior research has also explored the task of monolingual word align-ment using extensions of statistical MT (Quirk et al., 2004) and multi-sequence alignment (Barzilay and Lee, 2002) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-22", "text": "ILP has been used extensively for applications ranging from text-to-text generation (Clarke and Lapata, 2008; Filippova and Strube, 2008; Woodsend et al., 2010) to dependency parsing (Martins et al., 2009 )." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-23", "text": "It has also been recently employed for finding phrase-based MT alignments (DeNero and Klein, 2008) in a manner similar to this work; however, we further build upon this model through syntactic constraints on the words participating in alignments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-24", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-25", "text": "**THE MANLI ALIGNER**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-26", "text": "Our alignment system is structured identically to MANLI (MacCartney et al., 2008) and uses the same phrase-based alignment representation." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-27", "text": "An alignment E between two fragments of text T 1 and T 2 is represented by a set of edits {e 1 , e 2 , . . .}, each belonging to one of the following types:" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-28", "text": "\u2022 INS and DEL edits covering unaligned words in T 1 and T 2 respectively \u2022 SUB and EQ edits connecting a phrase in T 1 to a phrase in T 2 ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-29", "text": "EQ edits are a specific case of SUB edits that denote a word/lemma match; we refer to both types as SUB edits in this paper." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-30", "text": "Every token in T 1 and T 2 participates in exactly one edit." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-31", "text": "While alignments are one-to-one at the phrase level, a phrase-based representation effectively permits many-to-many alignments at the token level." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-32", "text": "This enables the aligner to properly link paraphrases such as death penalty and capital punishment by exploiting lexical resources." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-33", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-34", "text": "**DATASET**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-56", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-35", "text": "MANLI was trained and evaluated on a corpus of human-generated alignment annotations produced by Microsoft Research (Brockett, 2007) for inference problems from the second Recognizing Textual Entailment (RTE2) challenge (Bar-Haim et al., 2006) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-36", "text": "The corpus consists of a development set and test set that both feature 800 inference problems, each of which consists of a premise, a hypothesis and three independently-annotated human alignments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-37", "text": "In our experiments, we merge the annotations using majority rule in the same manner as MacCartney et al. (2008) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-38", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-39", "text": "**FEATURES**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-40", "text": "A MANLI alignment is scored as a sum of weighted feature values over the edits that it contains." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-41", "text": "Features encode the type of edit, the size of the phrases involved in SUB edits, whether the phrases are constituents and their similarity (determined by leveraging various lexical resources)." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-42", "text": "Additionally, contextual features note the similarity of neighboring words and the relative positions of phrases while a positional distortion feature accounts for the difference between the relative positions of SUB edit phrases in their respective sentences." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-43", "text": "Our implementation uses the same set of features as MacCartney et al. (2008) with some minor changes: we use a shallow parser (Daum\u00e9 and Marcu, 2005) for detecting constituents and employ only string similarity and WordNet for determining semantic relatedness, forgoing NomBank and the distributional similarity resources used in the original MANLI implementation." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-44", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-45", "text": "**PARAMETER INFERENCE**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-46", "text": "Feature weights are learned using the averaged structured perceptron algorithm (Collins, 2002) , an intuitive structured prediction technique." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-47", "text": "We deviate from MacCartney et al. (2008) and do not introduce L2 normalization of weights during learning as this could have an unpredictable effect on the averaged parameters." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-48", "text": "For efficiency reasons, we parallelize the training procedure using iterative parameter mixing (McDonald et al., 2010) in our experiments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-49", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-50", "text": "**DECODING**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-51", "text": "The decoding problem is that of finding the highestscoring alignment under some parameter values for the model." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-52", "text": "MANLI's phrase-based representation makes decoding more complex because the segmentation of T 1 and T 2 into phrases is not known beforehand." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-53", "text": "Every pair of phrases considered for inclusion in an alignment must adhere to some consistent segmentation so that overlapping edits and uncovered words are avoided." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-54", "text": "Consequently, the decoding problem cannot be factored into a number of independent decisions and MANLI searches for a good alignment using a stochastic simulated annealing strategy." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-55", "text": "While seemingly quite effective at avoiding local maxima, this iterative search strategy is computationally expensive and moreover is not guaranteed to return the highest-scoring alignment under the parameters." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-58", "text": "Instead of resorting to approximate solutions, we can simply reformulate the decoding problem as the optimization of a linear objective function with linear constraints, which can be solved by well-studied algorithms using off-the-shelf solvers 1 ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-59", "text": "We first define boolean indicator variables x e for every possible edit e between T 1 and T 2 that indicate whether e is present in the alignment or not." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-60", "text": "The linear objective that maximizes the score of edits for a given parameter vector w is expressed as follows:" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-61", "text": "where \u03a6(e) is the feature vector over an edit." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-62", "text": "This expresses the score of an alignment as the sum of scores of edits that are present in it, i.e., edits e that have x e = 1." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-63", "text": "In order to address the phrase segmentation issue discussed in \u00a73.4, we merely need to add linear constraints ensuring that every token participates in exactly one edit." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-64", "text": "Introducing the notation e \u227a t to indicate that edit e covers token t in one of its phrases, this constraint can be encoded as: highest-scoring alignment under w. A similar approach is employed by DeNero and Klein (2008) for finding optimal phrase-based alignments for MT." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-65", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-66", "text": "**ALIGNMENT EXPERIMENTS**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-67", "text": "For evaluation purposes, we compare the performance of approximate search decoding against exact ILP-based decoding on a reimplementation of MANLI as described in \u00a73." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-68", "text": "All models are trained on the development section of the Microsoft Research RTE2 alignment corpus (cf." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-69", "text": "\u00a73.1) using the training parameters specified in MacCartney et al. (2008) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-70", "text": "Aligner performance is determined by counting aligned token pairs per problem and macro-averaging over all problems." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-71", "text": "The results are shown in Table 1 ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-72", "text": "We first observe that our reimplemented version of MANLI improves over the results reported in MacCartney et al. (2008) , gaining 2% in precision, 1% in recall and 2-3% in the fraction of alignments that exactly matched human annotations." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-73", "text": "We attribute at least some part of this gain to our modified parameter inference (cf." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-74", "text": "\u00a73.3) which avoids normalizing the structured perceptron weights and instead adheres closely to the algorithm of Collins (2002) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-75", "text": "Although exact decoding improves alignment performance over the approximate search approach, the gain is marginal and not significant." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-76", "text": "This seems to indicate that the simulated annealing search strategy is fairly effective at avoiding local maxima and finding the highest-scoring alignments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-77", "text": "Table 2 contains the results from timing alignment tasks over various corpora on the same machine using the models trained as per \u00a74.1." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-78", "text": "We observe a twenty-fold improvement in performance with ILPbased decoding." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-80", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-82", "text": "The short hypotheses featured in the RTE2 corpus (averaging 11 words) dampen the effect of the quadratic growth in number of edits with sentence length." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-83", "text": "For this reason, we also run the aligners on a corpus of 297 related sentence pairs which don't have a particular disparity in sentence lengths (McKeown et al., 2010) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-84", "text": "The large difference in decoding time illustrates the scaling limitations of the searchbased decoder." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-85", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-86", "text": "**SYNTACTICALLY-INFORMED CONSTRAINTS**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-87", "text": "The use of an integer program for decoding provides us with a convenient mechanism to prevent common alignment errors by introducting additional constraints on edits." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-88", "text": "For example, function words such as determiners and prepositions are often misaligned just because they occur frequently in many different contexts." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-89", "text": "Although MANLI makes use of contextual features which consider the similarity of neighboring words around phrase pairs, outof-context alignments of function words often appear in the output." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-90", "text": "We address this issue by adding constraints to the integer program from \u00a74 that look at the syntactic structure of T 1 and T 2 and prevent matching function words from appearing in an alignment unless they are syntactically linked with other words that are aligned." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-91", "text": "To enforce token-based constraints, we define boolean indicator variables y t for each token t in text snippets T 1 and T 2 that indicate whether t is involved in a SUB edit or not." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-92", "text": "The following constraint ensures that y t = 1 if and only if it is covered by a SUB edit that is present in the alignment." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-93", "text": "We refer to tokens t with y t = 1 as being active in the alignment." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-94", "text": "Constraints can now be applied over any token with specific part-of-speech (POS) tag in order to ensure that it can only be active if a different token related to it in a dependency parse of the sentence is also active." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-95", "text": "We consider the following classes of constraints:" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-96", "text": "Modifier constraints: Tokens t that represent conjunctions, determiners, modals and cardinals can only be active if their parent tokens \u03c0(t) are active." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-97", "text": "Lineage constraints: Tokens t that represent prepositions and particles (which are often confused by parsers) can only be active if one of their ancestors \u03b1(t) or descendants \u03b4(t) is active." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-98", "text": "These constraints are less restrictive than the modifier constraints in order to account for attachment errors." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-99", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-100", "text": "**ALIGNMENT EXPERIMENTS**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-101", "text": "A TAG-based probabilistic dependency parser (Bangalore et al., 2009 ) is used to formulate the above constraints in our experiments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-102", "text": "The results are shown in Table 3 and indicate a notable increase in alignment precision, which is to be expected as the constraints specifically seek to exclude poor edits." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-103", "text": "Despite the simple and overly general restrictions being applied, recall is almost unaffected." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-104", "text": "Most compellingly, the number of perfect alignments produced by the system increases significantly when compared to the unconstrained models from Table 1 (a relative increase of 35% on the test corpus)." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-105", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-106", "text": "**DISCUSSION**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-107", "text": "The results of our evaluation indicate that exact decoding via ILP is a robust and efficient technique for solving alignment problems." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-108", "text": "Furthermore, the incorporation of simple constraints over a dependency parse can help to shape more accurate alignments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-109", "text": "An examination of the alignments produced by our system reveals that many remaining errors can be tackled by the use of named-entity recognition and better paraphrase corpora; this was also noted by MacCartney et al. (2008) with regard to the original MANLI system." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-110", "text": "In addition, stricter constraints that enforce the alignment of syntactically-related tokens (rather than just their inclusion in the solution) may also yield performance gains." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-111", "text": "Although MANLI's structured prediction approach to the alignment problem allows us to encode preferences as features and learn their weights via the structured perceptron, the decoding constraints used here can be used to establish dynamic links between alignment edits which cannot be determined a priori." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-112", "text": "The interaction between the selection of soft features for structured prediction and hard constraints for decoding is an interesting avenue for further research on this task." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-113", "text": "Initial experiments with a feature that considers the similarity of dependency heads of tokens in an edit (similar to MANLI's contextual features that look at preceding and following words) yielded some improvement over the baseline models; however, this did not perform as well as the simple constraints described above." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-114", "text": "Specific features that approximate soft variants of these constraints could also be devised but this was not explored here." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-115", "text": "In addition to the NLI applications considered in this work, we have also employed the MANLI alignment technique to tackle alignment problems that are not inherently asymmetric such as the sentence fusion problems from McKeown et al. (2010) ." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-116", "text": "Although the absence of asymmetric alignment features affects performance marginally over the RTE2 dataset, all the performance gains exhibited by exact decoding with constraints appear to be preserved in symmetric settings." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-117", "text": "----------------------------------" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-118", "text": "**CONCLUSION**" }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-119", "text": "We present a simple exact decoding technique as an alternative to approximate search-based decoding in MANLI that exhibits a twenty-fold improvement in runtime performance in our experiments." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-120", "text": "In addition, we propose novel syntactically-informed constraints to increase precision." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-121", "text": "Our final system improves over the results reported in MacCartney et al. (2008) by about 4.5% in precision and 1% in recall, with a large gain in the number of perfect alignments over the test corpus." }, { "sent_id": "4f111ff06afd5523d65fc1d1a9ff83-C001-122", "text": "Finally, we analyze the alignments produced and suggest that further improvements are possible through careful feature/constraint design, as well as the use of named-entity recognition and additional resources." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4f111ff06afd5523d65fc1d1a9ff83-C001-11", "4f111ff06afd5523d65fc1d1a9ff83-C001-13" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-20" ] ], "cite_sentences": [ "4f111ff06afd5523d65fc1d1a9ff83-C001-11", "4f111ff06afd5523d65fc1d1a9ff83-C001-13", "4f111ff06afd5523d65fc1d1a9ff83-C001-20" ] }, "@MOT@": { "gold_contexts": [ [ "4f111ff06afd5523d65fc1d1a9ff83-C001-11", "4f111ff06afd5523d65fc1d1a9ff83-C001-12", "4f111ff06afd5523d65fc1d1a9ff83-C001-13" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-20" ] ], "cite_sentences": [ "4f111ff06afd5523d65fc1d1a9ff83-C001-11", "4f111ff06afd5523d65fc1d1a9ff83-C001-13", "4f111ff06afd5523d65fc1d1a9ff83-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "4f111ff06afd5523d65fc1d1a9ff83-C001-26" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-37" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-43" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-68", "4f111ff06afd5523d65fc1d1a9ff83-C001-69" ] ], "cite_sentences": [ "4f111ff06afd5523d65fc1d1a9ff83-C001-26", "4f111ff06afd5523d65fc1d1a9ff83-C001-37", "4f111ff06afd5523d65fc1d1a9ff83-C001-43", "4f111ff06afd5523d65fc1d1a9ff83-C001-69" ] }, "@EXT@": { "gold_contexts": [ [ "4f111ff06afd5523d65fc1d1a9ff83-C001-43" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-47" ] ], "cite_sentences": [ "4f111ff06afd5523d65fc1d1a9ff83-C001-43", "4f111ff06afd5523d65fc1d1a9ff83-C001-47" ] }, "@DIF@": { "gold_contexts": [ [ "4f111ff06afd5523d65fc1d1a9ff83-C001-47" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-72" ], [ "4f111ff06afd5523d65fc1d1a9ff83-C001-121" ] ], "cite_sentences": [ "4f111ff06afd5523d65fc1d1a9ff83-C001-47", "4f111ff06afd5523d65fc1d1a9ff83-C001-72", "4f111ff06afd5523d65fc1d1a9ff83-C001-121" ] }, "@SIM@": { "gold_contexts": [ [ "4f111ff06afd5523d65fc1d1a9ff83-C001-109" ] ], "cite_sentences": [ "4f111ff06afd5523d65fc1d1a9ff83-C001-109" ] }, "@FUT@": { "gold_contexts": [ [ "4f111ff06afd5523d65fc1d1a9ff83-C001-109" ] ], "cite_sentences": [ "4f111ff06afd5523d65fc1d1a9ff83-C001-109" ] } } }, "ABC_d8168f4596878807d22ddc7474ffc8_14": { "x": [ { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-14", "text": "We experiment with four UtoAztecan languages: Mexicanero (MX), Nahuatl (NH), Wixarika (WX) and Yorem Nokki (YN) (Kann et al., 2018) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-2", "text": "Polysynthetic languages pose a challenge for morphological analysis due to the rootmorpheme complexity and to the word class \"squish\"." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-3", "text": "In addition, many of these polysynthetic languages are low-resource." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-4", "text": "We propose unsupervised approaches for morphological segmentation of low-resource polysynthetic languages based on Adaptor Grammars (AG) (Eskander et al., 2016) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-5", "text": "We experiment with four languages from the Uto-Aztecan family." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-6", "text": "Our AG-based approaches outperform other unsupervised approaches and show promise when compared to supervised methods, outperforming them on two of the four languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-7", "text": "----------------------------------" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-9", "text": "Computational morphology of polysynthetic languages is an emerging field of research." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-10", "text": "Polysynthetic languages pose unique challenges for computational approaches, including machine translation and morphological analysis, due to the rootmorpheme complexity and to word class gradations (Homola, 2011; Mager et al., 2018d; Klavans, 2018a) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-11", "text": "Previous approaches include rule-based methods based on finite state transducers (Farley, 2009; Littell, 2018; Kazeminejad et al., 2017) , hybrid models (Mager et al., 2018b; Moeller et al., 2018) , and supervised machine learning, particularly deep learning approaches (Micher, 2017; Kann et al., 2018) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-12", "text": "While each rule-based method is developed for a specific language (Inuktitut (Farley, 2009 ), or Arapaho (Littell, 2018 Moeller et al., 2018) ), machine learning, including deep learning approaches, might be more rapidly scalable to many additional languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-13", "text": "We propose an unsupervised approach for morphological segmentation of polysynthetic languages based on Adaptor Grammars (Johnson et al., 2007) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-15", "text": "Adaptor Grammars (AGs) are nonparametric Bayesian models that generalize probabilistic context free grammars (PCFG), and have proven to be successful for unsupervised morphological segmentation, where a PCFG is a morphological grammar that specifies word structure (Johnson, 2008; Sirts and Goldwater, 2013; Eskander et al., 2016 Eskander et al., , 2018 ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-16", "text": "Our main goal is to examine the success of Adaptor Grammars for unsupervised morphological segmentation when applied to polysynthetic languages, where the morphology is synthetically complex (not simply agglutinative), and where resources are minimal." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-17", "text": "We use the datasets introduced by Kann et al. (2018) in an unsupervised fashion (unsegmented words)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-18", "text": "We design several AG learning setups: 1) use the best-on-average AG setup from Eskander et al. (2016) ; 2) optimize for language using just the small training vocabulary (unsegmented) and dev vocabulary (segmented) from Kann et al. (2018) ; 3) approximate the effect of having some linguistic knowledge; 4) learn from all languages at once and 5) add additional unsupervised data for NH and WX (Section 3)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-19", "text": "We show that the AG-based approaches outperform other unsupervised methods -M orf essor (Creutz and Lagus, 2007) and M orphoChain (Narasimhan et al., 2015) ) -, and that for two of the languages (NH and YN), the best AG-based approaches outperform the best supervised methods (Section 4)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-20", "text": "----------------------------------" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-21", "text": "**LANGUAGES AND DATASETS**" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-22", "text": "Typically, polysynthetic languages demonstrate holophrasis, i.e. the ability of an entire sentence to be expressed as what is considered by native speakers to be just one word." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-23", "text": "To illustrate, consider the following example from Inuktitut (Kla-vans, 2018b) , where the morpheme -tusaa-is the root and all the other morphemes are synthetically combined with it in one unit:" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-24", "text": "tusaa-tsia-runna-nngit-tu-alu-u-jung hear-well-be.able-NEG-DOE-very-BE-PT.1S I can't hear very well." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-25", "text": "Another example from WX, one of the languages in the dataset for this paper (from (Mager et al., 2018c) ) shows this complexity:" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-26", "text": "yu-huta-me ne-p+-we-iwa an-two-ns 1sg:s-asi-2pl:o-brother I have two brothers." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-27", "text": "In linguistic typology, the broader gradient is: isolating/analytic to synthetic to polysynthetic." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-28", "text": "Agglutinating refers to the clarity of boundaries between morphemes." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-29", "text": "This more specific gradation is: agglutinating to mildly fusional to fusional." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-30", "text": "Thus a language might be characterized overall as polysynthetic and agglutinating, i.e. generally a high number of morphemes per word, with clear boundaries between morphemes and thus easily segmentable." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-31", "text": "Another language might be characterized as polysynthetic and fusional, so again, many morphemes per word, but many phonological and other processes so it is difficult to segment morphemes." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-32", "text": "Thus, morphological analysis of polysynthetic languages is challenging due to the rootmorpheme complexity and to word class gradations." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-33", "text": "Linguists recognize a gradience in word classes, known as \"squishiness\", a term first discussed in Ross (1972) who argued that, instead of a fixed, distinct inventory of syntactic categories, a quasi-continuum from verb, adjective and noun best reflects most lexical distinctions." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-34", "text": "The rootmorpheme complexity and the word class \"squish\" makes developing segmented training data with reliability across annotators difficult to achieve." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-35", "text": "Kann et al. (2018) have made a first step by releasing a small set of morphologically segmented datasets although even in these carefully curated datasets, the distinction between affix and clitic is not always indicated." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-36", "text": "We use these datasets in an unsupervised fashion (i.e., we use the unsegmented words)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-37", "text": "These datasets were taken from detailed descriptions in the Archive of Indigenous Languages collection for MX (Canger, 2001 ), NH (de Su\u00e1rez, 1980) , WX (G\u00f3mez and L\u00f3pez, 1999) , and YN (Freeze, 1989) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-38", "text": "They were constructed so they include both segmentable as well as non- Kann et al. (2018) , for training we do not use the segmented version of the data (our approach is unsupervised)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-39", "text": "In addition to the datasets, for NH and WX we also have available the Bible (Christodouloupoulos and Steedman, 2015; Mager et al., 2018a ), which we consider for one of our experimental setups as additional training data." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-40", "text": "In the dataset from (Kann et al., 2018) , the maximum number of morphemes per word for MX is seven with an average of 2.13; for NH, six with an average of 2.2; for WX, maximum of ten with an average of 3.3; and for YN, the maximum is ten, with an average of 2.13." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-41", "text": "----------------------------------" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-42", "text": "**USING ADAPTOR GRAMMARS FOR POLYSYNTHETIC LANGUAGES**" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-43", "text": "An Adaptor Grammar is typically composed of a PCFG and an adaptor that adapts the probabilities of individual subtrees." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-44", "text": "For morphological segmentation, a PCFG is a morphological grammar that specifies word structure, where AGs learn latent tree structures given a list of words." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-45", "text": "In this paper, we experiment with the grammars and the learning setups proposed by Eskander et al. (2016) , which we outline briefly below." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-46", "text": "Grammars." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-47", "text": "We use the nine grammars from Eskander et al. (2016 Eskander et al. ( , 2018 that were designed based on three dimensions: 1) how the grammar models word structure (e.g., prefix-stem-suffix vs. morphemes), 2) the level of abstraction in nonterminals (e.g., compounds, morphemes and submorphemes) and 3) how the output boundaries are specified (see Table 2 for a sample grammars)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-48", "text": "For example, the PrStSu+SM grammar models the Table 2 : Sample grammar setups used by Eskander et al. (2018 Eskander et al. ( , 2016 ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-49", "text": "Compound = Upper level representation of the word as a sequence of compounds; Morph = affix/morpheme representation as a sequence of morphemes." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-50", "text": "SubMorph (SM) = Lower level representation of characters as a sequence of sub-morphemes. \"+\" denotes one or more." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-51", "text": "word as a complex prefix, a stem and a complex suffix, where the complex prefix and suffix are composed of zero or more morphemes, and a morpheme is a sequence of sub-morphemes." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-52", "text": "The boundaries in the output are based on the prefix, stem and suffix levels." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-53", "text": "Learning Settings." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-54", "text": "The input to the learner is a grammar and a vocabulary of unsegmented words." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-120", "text": "For NH, the morpheme tla has a high degree of ambiguity at 79.12%, which lead the model to fail in recognizing it as an affix (see an example in Table 6 )." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-55", "text": "We consider the three learning settings in (Eskander et al., 2016) : Standard, Scholarseeded Knowledge and Cascaded." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-56", "text": "The Standard setting is language-independent and fully unsupervised, while in the Scholar-seeded-Knowledge setting, some linguistic knowledge (in the form of affixes taken from grammar books) is seeded into the grammar trees before learning takes place." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-57", "text": "The Cascaded setting simulates the effect of seeding scholar knowledge in a language-independent manner by first running an AG of high precision to derive a set of affixes, and then seeding those affixes into the grammars." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-58", "text": "----------------------------------" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-59", "text": "**AG SETUPS FOR POLYSYNTHETIC LANGUAGES**" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-60", "text": "We experimented with several setups using AGs for unsupervised segmentation." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-61", "text": "Language-Independent Morphological Segmenter." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-62", "text": "LIMS is the best-on-average AG setup obtained by Eskander et al. (2016) when trained on six languages (English, German, Finnish, Estonian, Turkish and Zulu), which is the Cascaded PrStSu+SM configuration." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-63", "text": "We use this AG setup for each of the four languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-64", "text": "We refer to this system as AG LIM S ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-65", "text": "Best AG Configuration per Language." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-66", "text": "In this experimental setup, we consider all nine grammars from Eskander et al. (2016) using both the Standard and the Cascaded approaches and choosing the one that is best for each polysynthetic language by training on the training set and evaluating on the development set." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-67", "text": "We denote this system as AG BestL ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-68", "text": "Using Seeded Knowledge." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-69", "text": "To approximate the effect of Scholar-seeded-Knowledge in Eskander et al. (2016), we used the training set to derive affixes and use them as scholar-seeded knowledge added to the grammars (before the learning happens)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-70", "text": "However, since affixes and stems are not distinguished in the training annotations from Kann et al. (2018) , we only consider the first and last morphemes that appear at least five times." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-71", "text": "We call this setup AG Scholar BestL ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-72", "text": "Multilingual Training." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-73", "text": "Since the vocabulary in Kann et al. (2018) for each language is small, and the languages are from the same language family, one data augmentation approach is to train on all languages and test then on each language individually." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-74", "text": "We call this setup AG M ulti ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-75", "text": "Data Augmentation." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-76", "text": "In this setup, we examine the performance of the best AG configuration per language (AG BestL ) when more data is available." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-77", "text": "We merge the training corpus with unique words in the New Testament of the Bible (train Bible )." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-78", "text": "We run this only on NH and WX since the Bible text is only available for these two languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-79", "text": "We denote this setup as AG Aug ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-80", "text": "----------------------------------" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-81", "text": "**EVALUATION AND DISCUSSION**" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-82", "text": "We evaluate the different AG setups on the blind test set from Kann et al. (2018) and compare our AG approaches to state-of-the-art unsupervised systems as well as supervised models including the best supervised deep learning models from Kann et al. (2018) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-83", "text": "As the metric, we use the segmentation-boundary F1-score, which is standard for this task (Virpioja et al., 2011) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-84", "text": "Evaluating different AG setups." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-85", "text": "Table 3 shows the performance of our AG setups on the four languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-86", "text": "The best AG setup learned for each of the four polysynthetic languages (AG BestL ) is the PrStSu+SM grammar using the Cascaded learning setup." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-87", "text": "This is an interesting finding as the Cascaded PrSTSu+SM setup is in fact AG LIM S -the best-on-average AG setup obtained by Eskander et al. (2016) Table 4 : Best AG results compared to supervised approaches from Kann et al. (2018) ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-88", "text": "Bold indicates best scores." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-89", "text": "WX and YN, respectively." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-90", "text": "Seeding affixes into the grammar trees (AG Scholar BestL ) improves the performance of the Cascaded P rStSu + SM setup only for MX and WX (additional absolute F1-scores of 0.023 and 0.019, respectively)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-91", "text": "However, it does not help for NH, while it even decreases the performance on YN." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-92", "text": "This occurs because AGs are able to recognize the main affixes in the Cascaded setup, while the seeded affixes were either abundant or conflicting with the automatically discovered ones." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-93", "text": "The multilingual setup (AG M ulti ) does not improve the performance on any of the languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-94", "text": "This could be because the datasets are too small to generalize common patterns across languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-95", "text": "Finally, augmenting with Bible text in the cases of NH and WX leads to an absolute F1-score increase of 0.015 for both languages when compared to AG BestL ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-96", "text": "There are two possible explanations for why we only see a slight increase when adding more data: 1) AGs are able to generalize from small data and 2) the added Bible data represents a domain that is different from those of the datasets we are experimenting with as only 4.8% and 9% of the words in the training sets from Kann et al. (2018) appear in the augmented data of NH and WX, respectively." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-97", "text": "Overall, AG BestL is the best setup for YN, AG Scholar BestL is the best setup for MX and WX, while AG Aug is the best for NH." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-98", "text": "Comparison with unsupervised baselines." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-99", "text": "We consider M orf essor (Creutz and Lagus, 2007) , a commonly-used toolkit for unsupervised morphological segmentation, and M orphoChain (Narasimhan et al., 2015) , another unsupervised morphological system based on constructing morphological chains." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-100", "text": "Our AG approaches significantly outperform both M orf essor and M orphoChain on all four languages, as shown in Table 3 ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-101", "text": "Comparison with supervised baselines." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-102", "text": "To obtain an upper bound, we compare the best AG setup to the best supervised neural methods presented in Kann et al. (2018) for each language." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-103", "text": "We consider their best multi-task approach (BestMTT) and the best data-augmentation approach (BestDA), using F1 scores from their Table 4 for each language." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-104", "text": "In addition, we report the results on their other supervised baselines: a supervised seq-to-seq model (S2S) and a supervised CRF approach." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-105", "text": "As can be seen in Table 4 , our unsupervised AG-based approaches outperform the best supervised approaches for NH and YN with absolute F1-scores of 0.010 and 0.012, respectively." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-106", "text": "An interesting observation is that for YN we only used the words in the training set of Kann et al. (2018) (unsegmented) , without any data augmentation." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-107", "text": "For MX and WX, the neural models from Kann et al. (2018) (BestMTT and BestDA), outperform our unsupervised AG-based approaches." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-108", "text": "Error Analysis." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-109", "text": "For the purpose of error analysis, we train our unsupervised segmentation on the training sets and perform the analysis of results on the output of the development sets based on our best unsupervised models AG BestL ." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-110", "text": "Since there is no distinction between stems and affixes in the labeled data, we only consider the morphemes that appear at least three times in order to eliminate open-class morphemes in our statistics." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-111", "text": "We first define the degree of ambiguity of a morpheme to be the percentage of times its sequence of characters does not form a segmentable morpheme when they appear in the training set." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-112", "text": "We also define the degree of ambiguity of a language as the average degree of ambiguity of the morphemes in that language." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-113", "text": "Table 5 shows the number of morphemes, average length of a morpheme (in characters) and the degree of morpheme Table 6 : Examples of correct and incorrect segmentation ambiguity in each language." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-114", "text": "Looking at the two languages where our models perform worse than the supervised models, we notice that MX has the least number of morphemes, and our unsupervised methods tend to oversegment; WX has the highest degree of ambiguity with a large number of one-letter morphemes, which makes the task more challenging for unsupervised segmentation as opposed to the case of a supervised setup." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-115", "text": "Analyzing all the errors that our AG-based models made across all languages, we noticed one, or a combination, of the following factors: a high degree of morpheme ambiguity, short morpheme length and/or low frequency of a morpheme." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-116", "text": "Examples." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-117", "text": "Table 6 shows some examples of correctly and incorrectly segmented words by our models (blue indicates correct morphemes while red are wrong ones)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-118", "text": "For MX, our models fail to recognize ka as a correct affix 100% of the time due to its high degree of ambiguity (71.79%), while we often wrongly detect ro as an affix, most likely since ro tends to appear at the end of a word; our approaches tend to oversegment in such cases." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-119", "text": "On the other hand, our method correctly identify ki as a correct affix 100% of the time since it appears frequently in the training data." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-121", "text": "On the other hand, NH has a higher percentage of correctly recognized morphemes, due to their less ambiguous nature and higher frequency (such as ke, tl or mo)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-122", "text": "For WX, a large portion of errors stem from one-letter morphemes that are highly ambiguous (e.g., u, a, e, m, n, p and r), in addition to having morphemes in the training set which are not frequent enough to learn from, such as ki,nua and wawi (see Table 6 )." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-123", "text": "Examples of correct segmentation involve morphemes that are more frequent and less ambiguous (pe, p@ and ne)." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-124", "text": "For YN, ambiguity is the main source of segmentation errors (e.g., wa, wi and \u00dfa).slight" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-125", "text": "----------------------------------" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-126", "text": "**CONCLUSIONS**" }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-127", "text": "Unsupervised approaches based on Adaptor Grammars show promise for morphological segmentation of low-resource polysynthetic languages." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-128", "text": "We worked with the AG grammars developed by Eskander et al. (2016 Eskander et al. ( , 2018 for languages that are not polysynthetic." }, { "sent_id": "d8168f4596878807d22ddc7474ffc8-C001-129", "text": "We showed that even when using these approaches and very little data, we can obtain encouraging results, and that using additional unsupervised data is a promising path." } ], "y": { "@USE@": { "gold_contexts": [ [ "d8168f4596878807d22ddc7474ffc8-C001-4" ], [ "d8168f4596878807d22ddc7474ffc8-C001-18" ], [ "d8168f4596878807d22ddc7474ffc8-C001-45" ], [ "d8168f4596878807d22ddc7474ffc8-C001-47" ], [ "d8168f4596878807d22ddc7474ffc8-C001-48" ], [ "d8168f4596878807d22ddc7474ffc8-C001-55" ], [ "d8168f4596878807d22ddc7474ffc8-C001-62", "d8168f4596878807d22ddc7474ffc8-C001-63" ], [ "d8168f4596878807d22ddc7474ffc8-C001-66" ], [ "d8168f4596878807d22ddc7474ffc8-C001-128" ] ], "cite_sentences": [ "d8168f4596878807d22ddc7474ffc8-C001-4", "d8168f4596878807d22ddc7474ffc8-C001-18", "d8168f4596878807d22ddc7474ffc8-C001-45", "d8168f4596878807d22ddc7474ffc8-C001-47", "d8168f4596878807d22ddc7474ffc8-C001-48", "d8168f4596878807d22ddc7474ffc8-C001-55", "d8168f4596878807d22ddc7474ffc8-C001-62", "d8168f4596878807d22ddc7474ffc8-C001-66", "d8168f4596878807d22ddc7474ffc8-C001-128" ] }, "@EXT@": { "gold_contexts": [ [ "d8168f4596878807d22ddc7474ffc8-C001-4", "d8168f4596878807d22ddc7474ffc8-C001-5", "d8168f4596878807d22ddc7474ffc8-C001-6" ], [ "d8168f4596878807d22ddc7474ffc8-C001-69", "d8168f4596878807d22ddc7474ffc8-C001-70" ] ], "cite_sentences": [ "d8168f4596878807d22ddc7474ffc8-C001-4", "d8168f4596878807d22ddc7474ffc8-C001-69" ] }, "@BACK@": { "gold_contexts": [ [ "d8168f4596878807d22ddc7474ffc8-C001-15" ], [ "d8168f4596878807d22ddc7474ffc8-C001-62" ], [ "d8168f4596878807d22ddc7474ffc8-C001-87" ] ], "cite_sentences": [ "d8168f4596878807d22ddc7474ffc8-C001-15", "d8168f4596878807d22ddc7474ffc8-C001-62", "d8168f4596878807d22ddc7474ffc8-C001-87" ] }, "@SIM@": { "gold_contexts": [ [ "d8168f4596878807d22ddc7474ffc8-C001-69" ], [ "d8168f4596878807d22ddc7474ffc8-C001-86", "d8168f4596878807d22ddc7474ffc8-C001-87" ] ], "cite_sentences": [ "d8168f4596878807d22ddc7474ffc8-C001-69", "d8168f4596878807d22ddc7474ffc8-C001-87" ] } } }, "ABC_154fd8e6b625eb93da21c09906ee90_14": { "x": [ { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-65", "text": "For each head i of nhds, we learn linear maps W" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-66", "text": ", and values V (i) of the i-th head, which combine to give" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-67", "text": "where \u03c3 is row-wise softmax." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-114", "text": "We warmup for 8000 steps, use a dropout of 0.2, and switch schedules at epoch 40." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-137", "text": "**MODEL**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-2", "text": "The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-3", "text": "Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-4", "text": "We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-toend speech recognition." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-5", "text": "SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-6", "text": "Similar improvements hold for WERs after LM decoding." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-7", "text": "We motivate the architecture for speech, evaluate position and downsampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-8", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-9", "text": "****" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-10", "text": "later works proposed partially-or purely-convolutional CTC models [8] [9] [10] [11] and convolution-heavy encoder-decoder models [16] for ASR." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-11", "text": "However, convolutional models must be significantly deeper to retrieve the same temporal receptive field [23] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-12", "text": "Recently, the mechanism of self-attention [22, 24] was proposed, which uses the whole sequence at once to model feature interactions that are arbitrarily distant in time." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-13", "text": "Its use in both encoder-decoder and feedforward contexts has led to faster training and state-of-the-art results in translation (via the Transformer [22] ), sentiment analysis [25] , and other tasks." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-14", "text": "These successes have motivated preliminary work in self-attention for ASR." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-15", "text": "Time-restricted self-attention was used as a drop-in replacement for individual layers in the state-of-theart lattice-free MMI model [26] , an HMM-NN system." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-16", "text": "Hybrid self-attention/LSTM encoders were studied in the context of listenattend-spell (LAS) [27] , and the Transformer was directly adapted to speech in [19, 28, 29] ; both are encoder-decoder systems." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-17", "text": "In this work, we propose and evaluate fully self-attentional networks for CTC (SAN-CTC)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-18", "text": "We are motivated by practicality: selfattention could be used as a drop-in replacement in existing CTClike systems, where only attention has been evaluated in the past [30, 31] ; unlike encoder-decoder systems, SAN-CTC is able to predict tokens in parallel at inference time; an analysis of SAN-CTC is useful for future state-of-the-art ASR systems, which may equip self-attentive encoders with auxiliary CTC losses [17, 20] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-19", "text": "Unlike past works, we do not require convolutional frontends [19] or interleaved recurrences [27] to train self-attention for ASR." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-20", "text": "In Section 2, we motivate the model and relevant design choices (position, downsampling) for ASR." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-21", "text": "In Section 3, we validate SAN-CTC on the Wall Street Journal and LibriSpeech datasets by outperforming existing CTC models and most encoder-decoder models in character error rates (CERs), with fewer parameters or less training time." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-22", "text": "Finally, we train our models with different label alphabets (character, phoneme, subword), use WFST decoding to give word error rates (WERs), and examine the learned attention heads for insights." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-23", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-24", "text": "**MODEL ARCHITECTURES FOR CTC AND ASR**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-25", "text": "Consider an input sequence of T feature vectors, viewed as a matrix X \u2208 R T \u00d7d fr . Let L denote the (finite) label alphabet, and denote the output sequence as y = (y1, . . . , yU ) \u2208 L U ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-26", "text": "In ASR, X is the sequence of acoustic frames, L is the set of graphemes/phonemes/wordpieces, and y is the corresponding ground-truth transcription over L." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-27", "text": "For CTC, one assumes U \u2264 T and defines an intermediate al-" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-28", "text": "In this way, paths are analogous to framewise alignments in the HMM-NN framework." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-29", "text": "CTC models the distribution of sequences by marginalizing over all paths corresponding to an output:" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-30", "text": "Finally, CTC models each P (\u03c0 | X) non-autoregressively, as a sequence of conditionally-independent outputs:" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-31", "text": "This model assumption means each P (\u03c0t, t | X) could be computed in parallel, after which one can do prediction via beam search, or training with gradient descent using the objective LCTC(X, y) = \u2212 log P (y | X); the order-monotonicity of B ensures LCTC can be efficiently evaluated with dynamic programming [1, 4] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-32", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-33", "text": "**RECURRENT AND CONVOLUTIONAL MODELS**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-34", "text": "In practice, one models P (\u03c0, t | X) with a neural network." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-35", "text": "As inspired by HMMs, the model simplification of conditional independence can be tempered by multiple layers of (recurrent) bidirectional long short-term memory units (BLSTMs) [1] [2] [3] [4] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-36", "text": "However, these are computationally expensive (Table 1) , leading to simplifications like gated recurrent units (GRUs) [8, 32] ; furthermore, the success of the ReLU(x) = max(0, x) nonlinearity in preventing vanishing gradients enabled the use of vanilla bidirectional recurrent deep neural networks (BRDNNs) [5, 6, 33] to further reduce operations per layer." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-37", "text": "Convolutions over time and/or frequency were first used as initial layers to recurrent neural models, beginning with HMM-NNs [34] and later with CTC, where they are viewed as promoting invariance to temporal and spectral translation in ASR [8] , or image translation in handwriting recognition [35] ; they also serve as a form of dimensionality reduction (Section 2.4)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-38", "text": "However, these networks were still bottlenecked by the sequentiality of operations at the recurrent layers, leading [8] to propose row convolutions for unidirectional RNNs, which had finite lookaheads to enable online processing while having some future context." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-39", "text": "This led to convolution-only CTC models for long-range temporal dependencies [9] [10] [11] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-40", "text": "However, these models have to be very deep (e.g., 17-19 convolutional layers on LibriSpeech [23] ) to cover the same context (Table 1) ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-41", "text": "While in theory, a relatively local context could suffices for ASR, this is complicated by alphabets L which violate the conditional independence assumption of CTC (e.g., English characters [36] )." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-42", "text": "Wide contexts also enable incorporation of noise/speaker contexts, as [27] suggest regarding the broad-context attention heads in the first layer of their self-attentional LAS model." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-43", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-44", "text": "**MOTIVATING THE SELF-ATTENTION LAYER**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-45", "text": "We now replace recurrent and convolutional layers for CTC with self-attention [24] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-46", "text": "Our proposed framework ( Figure 1a ) is built around self-attention layers, as used in the Transformer encoder [22] , previous explorations of self-attention in ASR [19, 27] , and defined in Section 2.3." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-47", "text": "The other stages are downsampling, which reduces input length T via methods like those in Section 2.4; embedding, which learns a dh-dim. embedding that also describes token position (Section 2.5); and projection, where each final representation is mapped framewise to logits over the intermediate alphabet L ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-48", "text": "The first implements self-attention, where the success of attention in CTC and encoder-decoder models [14, 31] is parallelized by using each position's representation to attend to all others, giving a contextualized representation for that position." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-49", "text": "Hence, the full receptive field is immediately available at the cost of O(T 2 ) inner products (Table 1) , enabling richer representations in fewer layers." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-50", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-51", "text": "**MODEL**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-52", "text": "Operations per layer" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-53", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-54", "text": "**SEQUENTIAL OPERATIONS**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-55", "text": "Maximum path length Table 1 : Operation complexity of each layer type, based on [22] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-56", "text": "T is input length, d is no. of hidden units, and k is filter/context width." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-57", "text": "We also see inspiration from convolutional blocks: residual connections, layer normalization, and tied dense layers with ReLU for representation learning." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-58", "text": "In particular, multi-head attention is akin to having a number of infinitely-wide filters whose weights adapt to the content (allowing fewer \"filters\" to suffice)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-59", "text": "One can also assign interpretations; for example, [27] argue their LAS self-attention heads are differentiated phoneme detectors." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-60", "text": "Further inductive biases like filter widths and causality could be expressed through time-restricted self-attention [26] and directed self-attention [25] , respectively." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-61", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-62", "text": "**FORMULATION**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-63", "text": "Let H \u2208 R T \u00d7d h denote a sublayer's input." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-64", "text": "The first sublayer performs multi-head, scaled dot-product, self-attention [22] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-68", "text": "Heads are concatenated along the dh/nhds axis to give MltHdAtt = [HdAtt (1) , . . . , HdAtt (n hds ) ]." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-69", "text": "The second sublayer is a position-wise feed-forward network [22] FFN(H) = ReLU(HW1 + b1)W2 + b2 where parameters" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-70", "text": "with the biases b1, b2 broadcasted over all T positions." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-71", "text": "This sublayer aggregates the multiple heads at time t into the attention layer's final output at t. All together, the layer is given by:" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-72", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-73", "text": "**DOWNSAMPLING**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-74", "text": "In speech, the input length T of frames can be many times larger than the output length U , in contrast to the roughly word-to-word setting of machine translation." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-75", "text": "This is especially prohibitive for self-attention in terms of memory: recall that an attention matrix of dimension" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-76", "text": "\u2208 R T \u00d7T is created, giving the T 2 factor in Table 1 ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-77", "text": "A convolutional frontend is a typical downsampling strategy [8, 19] ; however, we leave integrating other layer types into SAN-CTC as future work." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-78", "text": "Instead, we consider three fixed approaches, from least-to most-preserving of the input data: subsampling, which only takes every k-th frame; pooling, which aggregates every k consecutive frames via a statistic (average, maximum); reshaping, where one concatenates k consecutive frames into one [27] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-79", "text": "Note that CTC will still require U \u2264 T /k, however." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-80", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-81", "text": "**POSITION**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-82", "text": "Self-attention is inherently content-based [22] , and so one often encodes position into the post-embedding vectors." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-83", "text": "We use standard trigonometric embeddings, where for 0 \u2264 i \u2264 demb/2, we define" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-84", "text": "for position t. We consider three approaches: content-only [21] , which forgoes position encodings; additive [19] , which takes demb = dh and adds the encoding to the embedding; and concatenative, where one takes demb = 40 and concatenates it to the embedding." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-85", "text": "The latter was found necessary for self-attentional LAS [27] , as additive encodings did not give convergence." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-86", "text": "However, the monotonicity of CTC is a further positional inductive bias, which may enable the success of content-only and additive encodings." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-87", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-88", "text": "**EXPERIMENTS**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-89", "text": "We take (nlayers, dh, nheads, dff) = (10, 512, 8, 2048), giving \u223c30M parameters." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-90", "text": "This is on par with models on WSJ (10-30M) [4, 5, 9] and an order of magnitude below models on LibriSpeech (100-250M) [8, 23] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-91", "text": "We use MXNet [37] for modeling and Kaldi/EESEN [7, 38] for data preparation and decoding." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-92", "text": "Our self-attention code is based on GluonNLP's implementation." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-93", "text": "At train time, utterances are sorted by length: we exclude those longer than 1800 frames ( 1% of each training set)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-94", "text": "We take a window of 25ms, a hop of 10ms, and concatenate cepstral mean-variance normalized features with temporal first-and second-order differences." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-95", "text": "1 We downsample by a factor of k = 3 (this also gave an ideal T /k \u2248 dh for our data; see Table 1 )." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-96", "text": "We perform Nesterov-accelerated gradient descent on batches of 20 utterances." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-97", "text": "As self-attention architectures can be unstable in early training, we clip gradients to a global norm of 1 and use the standard linear warmup period before inverse square decay associated with these architectures [19, 22] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-98", "text": "Let n denote the global step number of the batch (across epochs); the learning rate is given by" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-99", "text": "1 Rescaling so that these differences also have var." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-100", "text": "\u2248 1 helped WSJ training." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-101", "text": "[17] 11.5 -9.0 -Enc-Dec (4-1) [17] 12.0 -8.2 -Enc-Dec+CTC (4-1) [17] 11.3 -7.4 -Enc-Dec (4-1) [39] --6.4 9.3 CTC/ASG (Gated CNN) [40] 6.9 9.5 4.9 6.6 Enc-Dec (2,1,3-1 Table 3 : CTC phoneme models with WFST decoding on WSJ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-102", "text": "where we take \u03bb = 400 and nwarmup as a hyperparameter." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-103", "text": "However, such a decay led to early stagnation in validation accuracy, so we later divide the learning rate by 10 and run at the decayed rate for 20 epochs." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-104", "text": "We do this twice, then take the epoch with the best validation score." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-105", "text": "Xavier initialization gave validation accuracies of zero for the first few epochs, suggesting room for improvement." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-106", "text": "Like previous works on self-attention, we apply label smoothing (see Tables 2, 3, 5; we also tried model averaging to no gain)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-107", "text": "To compute word error rates (WERs), we use the dataset's provided language model (LM) as incorporated by WFST decoding [7] to bridge the gap between CTC and encoder-decoder frameworks, allowing comparison with known benchmarks and informing systems that incorporate expert knowledge in this way (e.g., via a pronunciation lexicon)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-108", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-109", "text": "**WALL STREET JOURNAL (WSJ)**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-110", "text": "We train both character-and phoneme-label systems on the 80-hour WSJ training set to validate our architectural choices." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-111", "text": "Similar to [17, 19] , we use 40-dim." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-112", "text": "mel-scale filter banks and hence 120-dim." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-113", "text": "features." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-115", "text": "For the WSJ dataset, we compare with similar MLE-trained, end-to-end, open-vocabulary systems in Table 2 ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-116", "text": "We get an eval92 CER of 4.7%, outdoing all previous CTC-like results except 4.6% with a trainable frontend [40] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-117", "text": "We use the provided extended 3-gram LM to retrieve WERs." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-118", "text": "For phoneme training, our labels come from the CMU pronunciation lexicon (Table 3 )." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-119", "text": "These models train in one day (Tesla V100), comparable to the Speech Transformer [19] ; however, SAN-CTC gives further benefits at inference time as token predictions are generated in parallel." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-120", "text": "We also evaluate design choices in Table 4 ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-121", "text": "Here, we consider the effects of downsampling and position encoding on accuracy for our fixed training regime." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-122", "text": "We see that unlike self-attentional LAS [27] , SAN-CTC works respectably even with no position en- coding; in fact, the contribution of position is relatively minor (compare with [21] , where location in an encoder-decoder system improved CER by 3% absolute)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-123", "text": "Lossy downsampling appears to preserve performance in CER while degrading WER (as information about frame transitions is lost)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-124", "text": "We believe these observations align with the monotonicity and independence assumptions of CTC." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-125", "text": "Inspired by [27] , we plot the standard deviation of attention weights for each head as training progresses; see Figure 2 for details." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-126", "text": "In the first layers, we similarly observe a differentiation of variances, along with wide-context heads; in later layers, unlike [27] we still see mild differentiation of variances." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-127", "text": "Inspired by [26] , we further plot the attention weights relative to the current time position (here, per head)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-128", "text": "Character labels gave forward-and backward-attending heads (incidentally, averaging these would retrieve the bimodal distribution in [26] ) at all layers." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-129", "text": "This suggests a gradual expansion of context over depth, as is often engineered in convolutional CTC." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-130", "text": "This also suggests possibly using fewer heads, directed self-attention [25] , and restricted contexts for faster training (Table 1) ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-131", "text": "Phoneme labels gave a sharp backward-attending head and more diffuse heads." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-132", "text": "We believe this to be a symptom of English characters being more contextdependent than phonemes (for example, emitting 'tt' requires looking ahead, as '-' must occur between two runs of 't' tokens)." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-133", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-134", "text": "**LIBRISPEECH**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-135", "text": "We give the first large-scale demonstration of a fully self-attentional ASR model using the LibriSpeech ASR corpus [42] , an English corpus produced from audio books giving 960 hours of training" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-136", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-138", "text": "Tok. test-clean test-other CER WER CER WER CTC/ASG (Wav2Letter) [9] chr." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-139", "text": "6.9 7.2 --CTC (DS1-like) [33, 43] chr." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-140", "text": "-6.5 --Enc-Dec (4-4) [44] chr." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-141", "text": "6.5 -18.1 -Enc-Dec (6-1) [45] chr." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-142", "text": "4.5 -11.6 -CTC (DS2-like) [8, 32] chr." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-143", "text": "Table 5 ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-144", "text": "At this scale, even minor label smoothing was detrimental." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-145", "text": "We run 70 epochs in slightly over a week (Tesla V100) then choose the epoch with the best validation score for testing." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-146", "text": "For comparison, the best CTC-like architecture [23] took 4-8 weeks on 4 GPUs for its results." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-147", "text": "2 The Enc-Dec+CTC model is comparable, taking almost a week on an older GPU (GTX 1080 Ti) to do its \u223c12.5 full passes over the data." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-148", "text": "3 Finally, we trained the same model with BPE subwords as CTC targets, to get more context-independent units [36] ." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-149", "text": "We did 300 merge operations (10k was unstable) and attained a CER of 7.4%." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-150", "text": "This gave a WER of 8.7% with no LM (compare with Table 5 's LMbased entries), and 5.2% with a subword WFST of the LM." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-151", "text": "We still observed attention heads in both directions in the first layer, suggesting our subwords were still more context-dependent than phonemes." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-152", "text": "----------------------------------" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-153", "text": "**CONCLUSION**" }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-154", "text": "We introduced SAN-CTC, a novel framework which integrates a fully self-attentional network with a connectionist temporal classification loss." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-155", "text": "We addressed the challenges of adapting self-attention to CTC and to speech recognition, showing that SAN-CTC is competitive with or outperforms existing end-to-end models on WSJ and LibriSpeech." }, { "sent_id": "154fd8e6b625eb93da21c09906ee90-C001-156", "text": "Future avenues of work include multitasking SAN-CTC with other decoders or objectives, and streamlining network structure via directed or restricted attention." } ], "y": { "@BACK@": { "gold_contexts": [ [ "154fd8e6b625eb93da21c09906ee90-C001-16" ], [ "154fd8e6b625eb93da21c09906ee90-C001-59" ] ], "cite_sentences": [ "154fd8e6b625eb93da21c09906ee90-C001-16", "154fd8e6b625eb93da21c09906ee90-C001-59" ] }, "@DIF@": { "gold_contexts": [ [ "154fd8e6b625eb93da21c09906ee90-C001-19" ], [ "154fd8e6b625eb93da21c09906ee90-C001-122" ], [ "154fd8e6b625eb93da21c09906ee90-C001-126" ] ], "cite_sentences": [ "154fd8e6b625eb93da21c09906ee90-C001-19", "154fd8e6b625eb93da21c09906ee90-C001-122", "154fd8e6b625eb93da21c09906ee90-C001-126" ] }, "@MOT@": { "gold_contexts": [ [ "154fd8e6b625eb93da21c09906ee90-C001-17", "154fd8e6b625eb93da21c09906ee90-C001-19" ], [ "154fd8e6b625eb93da21c09906ee90-C001-41", "154fd8e6b625eb93da21c09906ee90-C001-42" ], [ "154fd8e6b625eb93da21c09906ee90-C001-85" ] ], "cite_sentences": [ "154fd8e6b625eb93da21c09906ee90-C001-19", "154fd8e6b625eb93da21c09906ee90-C001-42", "154fd8e6b625eb93da21c09906ee90-C001-85" ] }, "@SIM@": { "gold_contexts": [ [ "154fd8e6b625eb93da21c09906ee90-C001-46" ], [ "154fd8e6b625eb93da21c09906ee90-C001-125" ] ], "cite_sentences": [ "154fd8e6b625eb93da21c09906ee90-C001-46", "154fd8e6b625eb93da21c09906ee90-C001-125" ] }, "@USE@": { "gold_contexts": [ [ "154fd8e6b625eb93da21c09906ee90-C001-78" ] ], "cite_sentences": [ "154fd8e6b625eb93da21c09906ee90-C001-78" ] } } }, "ABC_23119eff3cfd71370e8ad408fc75e1_15": { "x": [ { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-2", "text": "Coreference resolution aims to identify in a text all mentions that refer to the same real-world entity." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-3", "text": "The state-of-the-art endto-end neural coreference model considers all text spans in a document as potential mentions and learns to link an antecedent for each possible mention." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-4", "text": "In this paper, we propose to improve the end-toend coreference resolution system by (1) using a biaffine attention model to get antecedent scores for each possible mention, and (2) jointly optimizing the mention detection accuracy and the mention clustering log-likelihood given the mention cluster labels." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-5", "text": "Our model achieves the stateof-the-art performance on the CoNLL-2012 Shared Task English test set." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-6", "text": "----------------------------------" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-8", "text": "End-to-end coreference resolution is the task of identifying and grouping mentions in a text such that all mentions in a cluster refer to the same entity." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-9", "text": "An example is given below (Bj\u00f6rkelund and Kuhn, 2014) Many traditional coreference systems, either rulebased (Haghighi and Klein, 2009; Lee et al., 2011 ) * Work done during the internship at IBM Watson. or learning-based (Bengtson and Roth, 2008; Fernandes et al., 2012; Durrett and Klein, 2013; Bj\u00f6rkelund and Kuhn, 2014) , usually solve the problem in two separate stages: (1) a mention detector to propose entity mentions from the text, and (2) a coreference resolver to cluster proposed mentions." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-10", "text": "At both stages, they rely heavily on complicated, fine-grained, conjoined features via heuristics." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-11", "text": "This pipeline approach can cause cascading errors, and in addition, since both stages rely on a syntactic parser and complicated handcraft features, it is difficult to generalize to new data sets and languages." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-12", "text": "Very recently, Lee et al. (2017) proposed the first state-of-the-art end-to-end neural coreference resolution system." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-13", "text": "They consider all text spans as potential mentions and therefore eliminate the need of carefully hand-engineered mention detection systems." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-14", "text": "In addition, thanks to the representation power of pre-trained word embeddings and deep neural networks, the model only uses a minimal set of hand-engineered features (speaker ID, document genre, span distance, span width)." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-15", "text": "The core of the end-to-end neural coreference resolver is the scoring function to compute the mention scores for all possible spans and the antecedent scores for a pair of spans." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-16", "text": "Furthermore, one major challenge of coreference resolution is that most mentions in the document are singleton or non-anaphoric, i.e., not coreferent with any previous mention (Wiseman et al., 2015) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-17", "text": "Since the data set only have annotations for mention clusters, the end-to-end coreference resolution system needs to detect mentions, detect anaphoricity, and perform coreference linking." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-18", "text": "Therefore, research questions still remain on good designs of the scoring architecture and the learning strategy for both mention detection and antecedent scoring given only the gold cluster labels." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-19", "text": "To this end, we propose to use a biaffine atten- Figure 1 : Model architecture." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-20", "text": "We consider all text spans up to 10-word length as possible mentions." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-21", "text": "For brevity, we only show three candidate antecedent spans (\"Drug Emporium Inc.\", \"Gary Wilber\", \"was named CEO\") for the current span \"this drugstore chain\"." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-22", "text": "tion model instead of pure feed forward networks to compute antecedent scores." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-23", "text": "Furthermore, instead of training only to maximize the marginal likelihood of gold antecedent spans, we jointly optimize the mention detection accuracy and the mention clustering log-likelihood given the mention cluster labels." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-24", "text": "We optimize mention detection loss explicitly to extract mentions and also perform anaphoricity detection." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-25", "text": "We evaluate our model on the CoNLL-2012 English data set and achieve new state-of-the-art performances of 67.8% F1 score using a single model and 69.2% F1 score using a 5-model ensemble." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-26", "text": "----------------------------------" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-27", "text": "**TASK FORMULATION**" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-28", "text": "In end-to-end coreference resolution, the input is a document D with T words, and the output is a set of mention clusters each of which refers to the same entity." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-29", "text": "A possible span is an N-gram within a single sentence." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-30", "text": "We consider all possible spans up to a predefined maximum width." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-31", "text": "To impose an ordering, spans are sorted by the start position START(i) and then by the end position END(i)." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-32", "text": "For each span i the system needs to assign an antecedent a i from all preceding spans or a dummy antecedent : a i \u2208 { , 1, . . . , i\u22121}. If a span j is a true antecedent of the span i, then we have a i = j and 1 \u2264 j \u2264 i\u22121." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-33", "text": "The dummy antecedent represents two possibilities: (1) the span i is not an entity mention, or (2) the span i is an entity mention but not coreferent with any previous span." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-34", "text": "Finally, the system groups mentions according to coreference links to form the mention clusters." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-35", "text": "----------------------------------" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-36", "text": "**MODEL**" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-37", "text": "Figure 1 illustrates our model." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-38", "text": "We adopt the same span representation approach as in Lee et al. (2017) using bidirectional LSTMs and a headfinding attention." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-39", "text": "Thereafter, a feed forward network produces scores for spans being entity mentions." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-40", "text": "For antecedent scoring, we propose a biaffine attention model (Dozat and Manning, 2017) to produce distributions of possible antecedents." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-41", "text": "Our training data only provides gold mention cluster labels." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-42", "text": "To make best use of this information, we propose to jointly optimize the mention scoring and antecedent scoring in our loss function." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-43", "text": "Span Representation Suppose the current sentence of length L is [w 1 , w 2 , . . . , w L ], we use w t to denote the concatenation of fixed pretrained word embeddings and CNN character embeddings (dos Santos and Zadrozny, 2014) for word w t ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-44", "text": "Bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) recurrently encode each w t :" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-45", "text": "Then, the head-finding attention computes a score distribution over different words in a span s i :" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-46", "text": "where FFNN is a feed forward network outputting a vector." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-47", "text": "Effective span representations encode both contextual information and internal structure of spans." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-48", "text": "Therefore, we concatenate different vectors, including a feature vector \u03c6(i) for the span size, to produce the span representation s i for s i :" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-49", "text": "Mention Scoring The span representation is input to a feed forward network which measures if it is an entity mention using a score m(i):" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-50", "text": "Since we consider all possible spans, the number of spans is O(T 2 ) and the number of span pairs is O(T 4 )." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-51", "text": "Due to computation efficiency, we prune candidate spans during both inference and training." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-52", "text": "We keep \u03bbT spans with highest mention scores." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-53", "text": "Biaffine Attention Antecedent Scoring Consider the current span s i and its previous spans s j (1 \u2264 j \u2264 i \u2212 1), we propose to use a biaffine attention model to produce scores c(i, j):" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-54", "text": "FFNN anaphora and FFNN antecedent reduce span representation dimensions and only keep information relevant to coreference decisions." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-55", "text": "Compared with the traditional FFNN approach in Lee et al. (2017) , biaffine attention directly models both the compatibility of s i and s j by\u015d j U bi\u015di and the prior likelihood of s i having an antecedent by v bi\u015d i ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-56", "text": "Inference The final coreference score s(i, j) for span s i and span s j consists of three terms: (1) if s i is a mention, (2) if s j is a mention, (3) if s j is an antecedent for s i ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-57", "text": "Furthermore, for dummy antecedent , we fix the final score to be 0:" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-58", "text": "During inference, the model only creates a link if the highest antecedent score is positive." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-59", "text": "----------------------------------" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-60", "text": "**JOINT MENTION DETECTION AND MENTION CLUSTER**" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-61", "text": "During training, only mention cluster labels are available rather than antecedent links." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-62", "text": "Therefore, Lee et al. (2017) train the model end-to-end by maximizing the following marginal log-likelihood where GOLD(i) are gold antecedents for s i :" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-63", "text": "However, the initial pruning is completely random and the mention scoring model only receives distant supervision if we only optimize the above mention cluster performance." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-64", "text": "This makes learning slow and ineffective especially for mention detection." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-65", "text": "Based on this observation, we propose to directly optimize mention detection:" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-66", "text": "where\u0177 i = sigmoid(m(i)), y i = 1 if and only if s i is in one of the gold mention clusters." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-67", "text": "Our final loss combines mention detection and clustering:" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-68", "text": "where N is the number of all possible spans, N is the number of unpruned spans, and \u03bb detection controls weights of two terms." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-69", "text": "(Bagga and Baldwin, 1998) , and CEAF \u03c6 4 (Luo, 2005) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-70", "text": "We report Precision, Recall, F1 for each metric and the average F1 as the final CoNLL score." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-71", "text": "Implementation Details For fair comparisons, we follow the same hyperparameters as in Lee et al. (2017) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-72", "text": "We consider all spans up to 10 words and up to 250 antecedents." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-73", "text": "\u03bb = 0.4 is used for span pruning." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-74", "text": "We use fixed concatenations of 300-dimension GloVe (Pennington et al., 2014) embeddings and 50-dimension embeddings from Turian et al. (2010) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-75", "text": "Character CNNs use 8-dimension learned embeddings and 50 kernels for each window size in {3,4,5}. LSTMs have hidden size 200, and each FFNN has two hidden layers with 150 units and ReLU (Nair and Hinton, 2010) Avg." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-76", "text": "F1 Our model (single) 67.8 without mention detection loss 67.5 without biaffine attention 67.4 Lee et al. (2017) 67.3 Table 2 : Ablation study on the development set." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-77", "text": "genre, span distance, span width) features as 20-dimensional learned embeddings." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-78", "text": "Word and character embeddings use 0.5 dropout." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-79", "text": "All hidden layers and feature embeddings use 0.2 dropout." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-80", "text": "The batch size is 1 document." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-81", "text": "Based on the results on the development set, \u03bb detection = 0.1 works best from {0.05, 0.1, 0.5, 1.0}. Model is trained with ADAM optimizer (Kingma and Ba, 2015) and converges in around 200K updates, which is faster than that of Lee et al. (2017) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-82", "text": "Overall Performance In Table 1 , we compare our model with previous state-of-the-art systems." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-83", "text": "We obtain the best results in all F1 metrics." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-84", "text": "Our single model achieves 67.8% F1 and our 5-model ensemble achieves 69.2% F1." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-85", "text": "In particular, compared with Lee et al. (2017) , our improvement mainly results from the precision scores." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-86", "text": "This indicates that the mention detection loss does produce better mention scores and the biaffine attention more effectively determines if two spans are coreferent." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-87", "text": "Ablation Study To understand the effect of different proposed components, we perform ablation study on the development set." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-88", "text": "As shown in Table 2 , removing the mention detection loss term or the biaffine attention decreases 0.3/0.4 final F1 score, but still higher than the baseline." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-89", "text": "This shows Figure 2 : Mention detection subtask on development set." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-90", "text": "We plot accuracy and frequency breakdown by span widths." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-91", "text": "that both components have contributions and when they work together the total gain is even higher." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-92", "text": "Mention Detection Subtask To further understand our model, we perform a mention detection subtask where spans with mention scores higher than 0 are considered as mentions." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-93", "text": "We show the mention detection accuracy breakdown by span widths in Figure 2 ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-94", "text": "Our model indeed performs better thanks to the mention detection loss." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-95", "text": "The advantage is even clearer for longer spans which consist of 5 or more words." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-96", "text": "In addition, it is important to note that our model can detect mentions that do not exist in the training data." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-97", "text": "While Moosavi and Strube (2017) observe that there is a large overlap between the gold mentions of the training and dev (test) sets, we find that our model can correctly detect 1048 mentions which are not detected by Lee et al. (2017) , consisting of 386 mentions existing in training data and 662 mentions not existing in training data." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-98", "text": "From those 662 mentions, some examples are (1) a suicide murder (2) Hong Kong Island (3) a US Airforce jet carrying robotic undersea vehicles (4) the investigation into who was behind the apparent suicide attack." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-99", "text": "This shows that our mention loss helps detection by generalizing to new mentions in test data rather than memorizing the existing mentions in training data." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-100", "text": "----------------------------------" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-101", "text": "**RELATED WORK**" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-102", "text": "As summarized by Ng (2010) , learning-based coreference models can be categorized into three types: (1) Mention-pair models train binary classifiers to determine if a pair of mentions are coreferent (Soon et al., 2001; Ng and Cardie, 2002; Bengtson and Roth, 2008) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-103", "text": "(2) Mention-ranking models explicitly rank all previous candidate mentions for the current mention and select a single highest scoring antecedent for each anaphoric mention (Denis and Baldridge, 2007b; Wiseman et al., 2015; Clark and Manning, 2016a; Lee et al., 2017) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-104", "text": "(3) Entity-mention models learn classifiers to determine whether the current mention is coreferent with a preceding, partially-formed mention cluster (Clark and Manning, 2015; Wiseman et al., 2016; Clark and Manning, 2016b) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-105", "text": "In addition, we also note latent-antecedent models (Fernandes et al., 2012; Bj\u00f6rkelund and Kuhn, 2014; Martschat and Strube, 2015) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-106", "text": "Fernandes et al. (2012) introduce coreference trees to represent mention clusters and learn to extract the maximum scoring tree in the graph of mentions." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-107", "text": "Recently, several neural coreference resolution systems have achieved impressive gains (Wiseman et al., 2015 (Wiseman et al., , 2016 Clark and Manning, 2016b,a) ." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-108", "text": "They utilize distributed representations of mention pairs or mention clusters to dramatically reduce the number of hand-crafted features." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-109", "text": "For example, Wiseman et al. (2015) propose the first neural coreference resolution system by training a deep feed-forward neural network for mention ranking." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-110", "text": "However, these models still employ the two-stage pipeline and require a syntactic parser or a separate designed hand-engineered mention detector." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-111", "text": "Finally, we also note the relevant work on joint mention detection and coreference resolution." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-112", "text": "Daum\u00e9 III and Marcu (2005) propose to model both mention detection and coreference of the Entity Detection and Tracking task simultaneously." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-113", "text": "Denis and Baldridge (2007a) propose to use integer linear programming framework to model anaphoricity and coreference as a joint task." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-114", "text": "----------------------------------" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-115", "text": "**CONCLUSION**" }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-116", "text": "In this paper, we propose to use a biaffine attention model to jointly optimize mention detection and mention clustering in the end-to-end neural coreference resolver." }, { "sent_id": "23119eff3cfd71370e8ad408fc75e1-C001-117", "text": "Our model achieves the state-ofthe-art performance on the CoNLL-2012 Shared Task in English." } ], "y": { "@BACK@": { "gold_contexts": [ [ "23119eff3cfd71370e8ad408fc75e1-C001-12" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-62" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-76" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-103" ] ], "cite_sentences": [ "23119eff3cfd71370e8ad408fc75e1-C001-12", "23119eff3cfd71370e8ad408fc75e1-C001-62", "23119eff3cfd71370e8ad408fc75e1-C001-76", "23119eff3cfd71370e8ad408fc75e1-C001-103" ] }, "@SIM@": { "gold_contexts": [ [ "23119eff3cfd71370e8ad408fc75e1-C001-38" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-55" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-71" ] ], "cite_sentences": [ "23119eff3cfd71370e8ad408fc75e1-C001-38", "23119eff3cfd71370e8ad408fc75e1-C001-55", "23119eff3cfd71370e8ad408fc75e1-C001-71" ] }, "@USE@": { "gold_contexts": [ [ "23119eff3cfd71370e8ad408fc75e1-C001-38" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-55" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-71" ] ], "cite_sentences": [ "23119eff3cfd71370e8ad408fc75e1-C001-38", "23119eff3cfd71370e8ad408fc75e1-C001-55", "23119eff3cfd71370e8ad408fc75e1-C001-71" ] }, "@DIF@": { "gold_contexts": [ [ "23119eff3cfd71370e8ad408fc75e1-C001-81", "23119eff3cfd71370e8ad408fc75e1-C001-85" ], [ "23119eff3cfd71370e8ad408fc75e1-C001-97" ] ], "cite_sentences": [ "23119eff3cfd71370e8ad408fc75e1-C001-81", "23119eff3cfd71370e8ad408fc75e1-C001-85", "23119eff3cfd71370e8ad408fc75e1-C001-97" ] } } }, "ABC_18a44fac8d2f450aee62fc15c00c6f_15": { "x": [ { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-2", "text": "The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-3", "text": "Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-4", "text": "We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-toend speech recognition." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-5", "text": "SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-6", "text": "Similar improvements hold for WERs after LM decoding." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-76", "text": "\u2208 R T \u00d7T is created, giving the T 2 factor in Table 1 ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-7", "text": "We motivate the architecture for speech, evaluate position and downsampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-8", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-9", "text": "****" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-10", "text": "later works proposed partially-or purely-convolutional CTC models [8] [9] [10] [11] and convolution-heavy encoder-decoder models [16] for ASR." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-11", "text": "However, convolutional models must be significantly deeper to retrieve the same temporal receptive field [23] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-12", "text": "Recently, the mechanism of self-attention [22, 24] was proposed, which uses the whole sequence at once to model feature interactions that are arbitrarily distant in time." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-13", "text": "Its use in both encoder-decoder and feedforward contexts has led to faster training and state-of-the-art results in translation (via the Transformer [22] ), sentiment analysis [25] , and other tasks." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-14", "text": "These successes have motivated preliminary work in self-attention for ASR." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-15", "text": "Time-restricted self-attention was used as a drop-in replacement for individual layers in the state-of-theart lattice-free MMI model [26] , an HMM-NN system." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-16", "text": "Hybrid self-attention/LSTM encoders were studied in the context of listenattend-spell (LAS) [27] , and the Transformer was directly adapted to speech in [19, 28, 29] ; both are encoder-decoder systems." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-17", "text": "In this work, we propose and evaluate fully self-attentional networks for CTC (SAN-CTC)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-18", "text": "We are motivated by practicality: selfattention could be used as a drop-in replacement in existing CTClike systems, where only attention has been evaluated in the past [30, 31] ; unlike encoder-decoder systems, SAN-CTC is able to predict tokens in parallel at inference time; an analysis of SAN-CTC is useful for future state-of-the-art ASR systems, which may equip self-attentive encoders with auxiliary CTC losses [17, 20] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-19", "text": "Unlike past works, we do not require convolutional frontends [19] or interleaved recurrences [27] to train self-attention for ASR." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-20", "text": "In Section 2, we motivate the model and relevant design choices (position, downsampling) for ASR." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-21", "text": "In Section 3, we validate SAN-CTC on the Wall Street Journal and LibriSpeech datasets by outperforming existing CTC models and most encoder-decoder models in character error rates (CERs), with fewer parameters or less training time." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-22", "text": "Finally, we train our models with different label alphabets (character, phoneme, subword), use WFST decoding to give word error rates (WERs), and examine the learned attention heads for insights." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-23", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-24", "text": "**MODEL ARCHITECTURES FOR CTC AND ASR**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-25", "text": "Consider an input sequence of T feature vectors, viewed as a matrix X \u2208 R T \u00d7d fr . Let L denote the (finite) label alphabet, and denote the output sequence as y = (y1, . . . , yU ) \u2208 L U ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-26", "text": "In ASR, X is the sequence of acoustic frames, L is the set of graphemes/phonemes/wordpieces, and y is the corresponding ground-truth transcription over L." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-27", "text": "For CTC, one assumes U \u2264 T and defines an intermediate al-" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-28", "text": "In this way, paths are analogous to framewise alignments in the HMM-NN framework." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-29", "text": "CTC models the distribution of sequences by marginalizing over all paths corresponding to an output:" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-30", "text": "Finally, CTC models each P (\u03c0 | X) non-autoregressively, as a sequence of conditionally-independent outputs:" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-31", "text": "This model assumption means each P (\u03c0t, t | X) could be computed in parallel, after which one can do prediction via beam search, or training with gradient descent using the objective LCTC(X, y) = \u2212 log P (y | X); the order-monotonicity of B ensures LCTC can be efficiently evaluated with dynamic programming [1, 4] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-32", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-33", "text": "**RECURRENT AND CONVOLUTIONAL MODELS**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-34", "text": "In practice, one models P (\u03c0, t | X) with a neural network." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-35", "text": "As inspired by HMMs, the model simplification of conditional independence can be tempered by multiple layers of (recurrent) bidirectional long short-term memory units (BLSTMs) [1] [2] [3] [4] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-36", "text": "However, these are computationally expensive (Table 1) , leading to simplifications like gated recurrent units (GRUs) [8, 32] ; furthermore, the success of the ReLU(x) = max(0, x) nonlinearity in preventing vanishing gradients enabled the use of vanilla bidirectional recurrent deep neural networks (BRDNNs) [5, 6, 33] to further reduce operations per layer." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-37", "text": "Convolutions over time and/or frequency were first used as initial layers to recurrent neural models, beginning with HMM-NNs [34] and later with CTC, where they are viewed as promoting invariance to temporal and spectral translation in ASR [8] , or image translation in handwriting recognition [35] ; they also serve as a form of dimensionality reduction (Section 2.4)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-38", "text": "However, these networks were still bottlenecked by the sequentiality of operations at the recurrent layers, leading [8] to propose row convolutions for unidirectional RNNs, which had finite lookaheads to enable online processing while having some future context." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-39", "text": "This led to convolution-only CTC models for long-range temporal dependencies [9] [10] [11] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-40", "text": "However, these models have to be very deep (e.g., 17-19 convolutional layers on LibriSpeech [23] ) to cover the same context (Table 1) ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-41", "text": "While in theory, a relatively local context could suffices for ASR, this is complicated by alphabets L which violate the conditional independence assumption of CTC (e.g., English characters [36] )." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-42", "text": "Wide contexts also enable incorporation of noise/speaker contexts, as [27] suggest regarding the broad-context attention heads in the first layer of their self-attentional LAS model." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-43", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-44", "text": "**MOTIVATING THE SELF-ATTENTION LAYER**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-45", "text": "We now replace recurrent and convolutional layers for CTC with self-attention [24] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-46", "text": "Our proposed framework ( Figure 1a ) is built around self-attention layers, as used in the Transformer encoder [22] , previous explorations of self-attention in ASR [19, 27] , and defined in Section 2.3." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-47", "text": "The other stages are downsampling, which reduces input length T via methods like those in Section 2.4; embedding, which learns a dh-dim. embedding that also describes token position (Section 2.5); and projection, where each final representation is mapped framewise to logits over the intermediate alphabet L ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-48", "text": "The first implements self-attention, where the success of attention in CTC and encoder-decoder models [14, 31] is parallelized by using each position's representation to attend to all others, giving a contextualized representation for that position." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-49", "text": "Hence, the full receptive field is immediately available at the cost of O(T 2 ) inner products (Table 1) , enabling richer representations in fewer layers." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-50", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-51", "text": "**MODEL**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-52", "text": "Operations per layer" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-53", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-54", "text": "**SEQUENTIAL OPERATIONS**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-55", "text": "Maximum path length Table 1 : Operation complexity of each layer type, based on [22] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-56", "text": "T is input length, d is no. of hidden units, and k is filter/context width." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-57", "text": "We also see inspiration from convolutional blocks: residual connections, layer normalization, and tied dense layers with ReLU for representation learning." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-58", "text": "In particular, multi-head attention is akin to having a number of infinitely-wide filters whose weights adapt to the content (allowing fewer \"filters\" to suffice)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-59", "text": "One can also assign interpretations; for example, [27] argue their LAS self-attention heads are differentiated phoneme detectors." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-60", "text": "Further inductive biases like filter widths and causality could be expressed through time-restricted self-attention [26] and directed self-attention [25] , respectively." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-61", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-62", "text": "**FORMULATION**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-63", "text": "Let H \u2208 R T \u00d7d h denote a sublayer's input." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-64", "text": "The first sublayer performs multi-head, scaled dot-product, self-attention [22] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-65", "text": "For each head i of nhds, we learn linear maps W" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-66", "text": ", and values V (i) of the i-th head, which combine to give" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-67", "text": "where \u03c3 is row-wise softmax." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-68", "text": "Heads are concatenated along the dh/nhds axis to give MltHdAtt = [HdAtt (1) , . . . , HdAtt (n hds ) ]." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-69", "text": "The second sublayer is a position-wise feed-forward network [22] FFN(H) = ReLU(HW1 + b1)W2 + b2 where parameters" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-70", "text": "with the biases b1, b2 broadcasted over all T positions." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-71", "text": "This sublayer aggregates the multiple heads at time t into the attention layer's final output at t. All together, the layer is given by:" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-72", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-73", "text": "**DOWNSAMPLING**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-74", "text": "In speech, the input length T of frames can be many times larger than the output length U , in contrast to the roughly word-to-word setting of machine translation." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-75", "text": "This is especially prohibitive for self-attention in terms of memory: recall that an attention matrix of dimension" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-77", "text": "A convolutional frontend is a typical downsampling strategy [8, 19] ; however, we leave integrating other layer types into SAN-CTC as future work." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-78", "text": "Instead, we consider three fixed approaches, from least-to most-preserving of the input data: subsampling, which only takes every k-th frame; pooling, which aggregates every k consecutive frames via a statistic (average, maximum); reshaping, where one concatenates k consecutive frames into one [27] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-79", "text": "Note that CTC will still require U \u2264 T /k, however." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-80", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-81", "text": "**POSITION**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-82", "text": "Self-attention is inherently content-based [22] , and so one often encodes position into the post-embedding vectors." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-83", "text": "We use standard trigonometric embeddings, where for 0 \u2264 i \u2264 demb/2, we define" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-84", "text": "for position t. We consider three approaches: content-only [21] , which forgoes position encodings; additive [19] , which takes demb = dh and adds the encoding to the embedding; and concatenative, where one takes demb = 40 and concatenates it to the embedding." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-85", "text": "The latter was found necessary for self-attentional LAS [27] , as additive encodings did not give convergence." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-86", "text": "However, the monotonicity of CTC is a further positional inductive bias, which may enable the success of content-only and additive encodings." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-87", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-88", "text": "**EXPERIMENTS**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-89", "text": "We take (nlayers, dh, nheads, dff) = (10, 512, 8, 2048), giving \u223c30M parameters." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-90", "text": "This is on par with models on WSJ (10-30M) [4, 5, 9] and an order of magnitude below models on LibriSpeech (100-250M) [8, 23] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-91", "text": "We use MXNet [37] for modeling and Kaldi/EESEN [7, 38] for data preparation and decoding." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-92", "text": "Our self-attention code is based on GluonNLP's implementation." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-93", "text": "At train time, utterances are sorted by length: we exclude those longer than 1800 frames ( 1% of each training set)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-94", "text": "We take a window of 25ms, a hop of 10ms, and concatenate cepstral mean-variance normalized features with temporal first-and second-order differences." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-95", "text": "1 We downsample by a factor of k = 3 (this also gave an ideal T /k \u2248 dh for our data; see Table 1 )." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-96", "text": "We perform Nesterov-accelerated gradient descent on batches of 20 utterances." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-97", "text": "As self-attention architectures can be unstable in early training, we clip gradients to a global norm of 1 and use the standard linear warmup period before inverse square decay associated with these architectures [19, 22] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-98", "text": "Let n denote the global step number of the batch (across epochs); the learning rate is given by" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-99", "text": "1 Rescaling so that these differences also have var." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-100", "text": "\u2248 1 helped WSJ training." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-101", "text": "[17] 11.5 -9.0 -Enc-Dec (4-1) [17] 12.0 -8.2 -Enc-Dec+CTC (4-1) [17] 11.3 -7.4 -Enc-Dec (4-1) [39] --6.4 9.3 CTC/ASG (Gated CNN) [40] 6.9 9.5 4.9 6.6 Enc-Dec (2,1,3-1 Table 3 : CTC phoneme models with WFST decoding on WSJ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-102", "text": "where we take \u03bb = 400 and nwarmup as a hyperparameter." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-103", "text": "However, such a decay led to early stagnation in validation accuracy, so we later divide the learning rate by 10 and run at the decayed rate for 20 epochs." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-104", "text": "We do this twice, then take the epoch with the best validation score." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-105", "text": "Xavier initialization gave validation accuracies of zero for the first few epochs, suggesting room for improvement." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-106", "text": "Like previous works on self-attention, we apply label smoothing (see Tables 2, 3, 5; we also tried model averaging to no gain)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-107", "text": "To compute word error rates (WERs), we use the dataset's provided language model (LM) as incorporated by WFST decoding [7] to bridge the gap between CTC and encoder-decoder frameworks, allowing comparison with known benchmarks and informing systems that incorporate expert knowledge in this way (e.g., via a pronunciation lexicon)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-108", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-109", "text": "**WALL STREET JOURNAL (WSJ)**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-110", "text": "We train both character-and phoneme-label systems on the 80-hour WSJ training set to validate our architectural choices." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-111", "text": "Similar to [17, 19] , we use 40-dim." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-112", "text": "mel-scale filter banks and hence 120-dim." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-113", "text": "features." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-114", "text": "We warmup for 8000 steps, use a dropout of 0.2, and switch schedules at epoch 40." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-115", "text": "For the WSJ dataset, we compare with similar MLE-trained, end-to-end, open-vocabulary systems in Table 2 ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-116", "text": "We get an eval92 CER of 4.7%, outdoing all previous CTC-like results except 4.6% with a trainable frontend [40] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-117", "text": "We use the provided extended 3-gram LM to retrieve WERs." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-118", "text": "For phoneme training, our labels come from the CMU pronunciation lexicon (Table 3 )." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-119", "text": "These models train in one day (Tesla V100), comparable to the Speech Transformer [19] ; however, SAN-CTC gives further benefits at inference time as token predictions are generated in parallel." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-120", "text": "We also evaluate design choices in Table 4 ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-121", "text": "Here, we consider the effects of downsampling and position encoding on accuracy for our fixed training regime." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-122", "text": "We see that unlike self-attentional LAS [27] , SAN-CTC works respectably even with no position en- coding; in fact, the contribution of position is relatively minor (compare with [21] , where location in an encoder-decoder system improved CER by 3% absolute)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-123", "text": "Lossy downsampling appears to preserve performance in CER while degrading WER (as information about frame transitions is lost)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-124", "text": "We believe these observations align with the monotonicity and independence assumptions of CTC." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-125", "text": "Inspired by [27] , we plot the standard deviation of attention weights for each head as training progresses; see Figure 2 for details." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-126", "text": "In the first layers, we similarly observe a differentiation of variances, along with wide-context heads; in later layers, unlike [27] we still see mild differentiation of variances." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-127", "text": "Inspired by [26] , we further plot the attention weights relative to the current time position (here, per head)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-128", "text": "Character labels gave forward-and backward-attending heads (incidentally, averaging these would retrieve the bimodal distribution in [26] ) at all layers." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-129", "text": "This suggests a gradual expansion of context over depth, as is often engineered in convolutional CTC." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-130", "text": "This also suggests possibly using fewer heads, directed self-attention [25] , and restricted contexts for faster training (Table 1) ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-131", "text": "Phoneme labels gave a sharp backward-attending head and more diffuse heads." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-132", "text": "We believe this to be a symptom of English characters being more contextdependent than phonemes (for example, emitting 'tt' requires looking ahead, as '-' must occur between two runs of 't' tokens)." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-133", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-134", "text": "**LIBRISPEECH**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-135", "text": "We give the first large-scale demonstration of a fully self-attentional ASR model using the LibriSpeech ASR corpus [42] , an English corpus produced from audio books giving 960 hours of training" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-136", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-137", "text": "**MODEL**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-138", "text": "Tok. test-clean test-other CER WER CER WER CTC/ASG (Wav2Letter) [9] chr." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-139", "text": "6.9 7.2 --CTC (DS1-like) [33, 43] chr." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-140", "text": "-6.5 --Enc-Dec (4-4) [44] chr." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-141", "text": "6.5 -18.1 -Enc-Dec (6-1) [45] chr." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-142", "text": "4.5 -11.6 -CTC (DS2-like) [8, 32] chr." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-143", "text": "Table 5 ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-144", "text": "At this scale, even minor label smoothing was detrimental." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-145", "text": "We run 70 epochs in slightly over a week (Tesla V100) then choose the epoch with the best validation score for testing." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-146", "text": "For comparison, the best CTC-like architecture [23] took 4-8 weeks on 4 GPUs for its results." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-147", "text": "2 The Enc-Dec+CTC model is comparable, taking almost a week on an older GPU (GTX 1080 Ti) to do its \u223c12.5 full passes over the data." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-148", "text": "3 Finally, we trained the same model with BPE subwords as CTC targets, to get more context-independent units [36] ." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-149", "text": "We did 300 merge operations (10k was unstable) and attained a CER of 7.4%." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-150", "text": "This gave a WER of 8.7% with no LM (compare with Table 5 's LMbased entries), and 5.2% with a subword WFST of the LM." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-151", "text": "We still observed attention heads in both directions in the first layer, suggesting our subwords were still more context-dependent than phonemes." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-152", "text": "----------------------------------" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-153", "text": "**CONCLUSION**" }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-154", "text": "We introduced SAN-CTC, a novel framework which integrates a fully self-attentional network with a connectionist temporal classification loss." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-155", "text": "We addressed the challenges of adapting self-attention to CTC and to speech recognition, showing that SAN-CTC is competitive with or outperforms existing end-to-end models on WSJ and LibriSpeech." }, { "sent_id": "18a44fac8d2f450aee62fc15c00c6f-C001-156", "text": "Future avenues of work include multitasking SAN-CTC with other decoders or objectives, and streamlining network structure via directed or restricted attention." } ], "y": { "@BACK@": { "gold_contexts": [ [ "18a44fac8d2f450aee62fc15c00c6f-C001-16" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-42" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-59" ] ], "cite_sentences": [ "18a44fac8d2f450aee62fc15c00c6f-C001-16", "18a44fac8d2f450aee62fc15c00c6f-C001-42", "18a44fac8d2f450aee62fc15c00c6f-C001-59" ] }, "@DIF@": { "gold_contexts": [ [ "18a44fac8d2f450aee62fc15c00c6f-C001-19" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-77", "18a44fac8d2f450aee62fc15c00c6f-C001-78" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-85" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-122" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-126" ] ], "cite_sentences": [ "18a44fac8d2f450aee62fc15c00c6f-C001-19", "18a44fac8d2f450aee62fc15c00c6f-C001-78", "18a44fac8d2f450aee62fc15c00c6f-C001-85", "18a44fac8d2f450aee62fc15c00c6f-C001-122", "18a44fac8d2f450aee62fc15c00c6f-C001-126" ] }, "@SIM@": { "gold_contexts": [ [ "18a44fac8d2f450aee62fc15c00c6f-C001-46" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-125" ] ], "cite_sentences": [ "18a44fac8d2f450aee62fc15c00c6f-C001-46", "18a44fac8d2f450aee62fc15c00c6f-C001-125" ] }, "@USE@": { "gold_contexts": [ [ "18a44fac8d2f450aee62fc15c00c6f-C001-46" ], [ "18a44fac8d2f450aee62fc15c00c6f-C001-125" ] ], "cite_sentences": [ "18a44fac8d2f450aee62fc15c00c6f-C001-46", "18a44fac8d2f450aee62fc15c00c6f-C001-125" ] }, "@EXT@": { "gold_contexts": [ [ "18a44fac8d2f450aee62fc15c00c6f-C001-77", "18a44fac8d2f450aee62fc15c00c6f-C001-78" ] ], "cite_sentences": [ "18a44fac8d2f450aee62fc15c00c6f-C001-78" ] } } }, "ABC_c5ec401f42f79c4707770dac4f5013_15": { "x": [ { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-2", "text": "Despite the success of existing works on single-turn conversation generation, taking the coherence in consideration, human conversing is actually a context-sensitive process." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-3", "text": "Inspired by the existing studies, this paper proposed the static and dynamic attention based approaches for context-sensitive generation of open-domain conversational responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-4", "text": "Experimental results on two public datasets show that the proposed static attention based approach outperforms all the baselines on automatic and human evaluation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-5", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-7", "text": "Until recently, training open-domain conversational systems that can imitate the way of human conversing is still not a well-solved problem and non-trivial task." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-8", "text": "Previous efforts focus on generating opendomain conversational responses as an unsupervised clustering process (Ritter et al., 2010) , a phrasebased statistical machine translation task (Ritter et al., 2011 ) and a search problem based on the vector space model (Banchs and Li, 2012) , etc." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-9", "text": "With the booming of deep learning, particularly the neural network based sequence-to-sequence models, generating open-domain conversational responses gradually turns into an end-to-end encoding and decoding process (Sutskever et al., 2014; Vinyals and Le, 2015; Shang et al., 2015; Serban et al., 2016b; Li et al., 2016a; Li et al., 2016b; Shao et al., 2017; Yao et al., 2017) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-10", "text": "Despite the success of the above research on single-turn conversational response generation, human conversations are usually coherent (Li et al., 2016c) and context-sensitive (Tian et al., 2017; Xing et al., 2017) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-11", "text": "Table 1 illustrates how contextual information in conversations impact on the response generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-12", "text": "For instance, given a message 1 \"How should I tell my mom?\", as input, to a single-turn" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-13", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-14", "text": "**CONVERSATION 1**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-15", "text": "Conversation 2 A: I got a high score on my exam." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-16", "text": "A: I failed to pass the exam." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-17", "text": "B: Oh! Great! B: That's too bad." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-18", "text": "A: How should I tell my mom?" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-19", "text": "A: How should I tell my mom? B: Go and give her a big surprise! B: Just tell her the truth and do well next time." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-20", "text": "Table 1 : An example of the impact of contextual information on human conversations." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-21", "text": "\"A\" and \"B\" denote two speakers in the conversations." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-22", "text": "conversational response generation model, it should output a fixed response regardless of the content in previous utterances." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-23", "text": "However, as shown in Table 1 , in the conversations 2 , the responses to be generated (the last utterance in Table 1) should not only dependent on the last one message (\"How should I tell my mom?\"), but also need to consider the longer historical utterances in the conversations." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-24", "text": "This work is licensed under a Creative Commons Attribution 4.0 International License." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-25", "text": "License details: http: //creativecommons.org/licenses/by/4.0/ 1 Here, a \"message\" indicates an input of a response in single-turn conversational response generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-26", "text": "2 In this paper, a \"conversation\" equals to an \"open-domain conversation\"and a \"conversational response\" or \"response\" equals to an \"open-domain conversational response\"." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-27", "text": "Recent studies on generating open-domain conversational responses begin to explore the context information to generate more informative and coherent responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-28", "text": "Serban et al. (2016a) presented a hierarchical recurrent encoder-decoder (HRED) to recurrently model the dialogue context." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-29", "text": "Serban et al. (2017b) further introduced a stochastic latent variable at each dialogue turn to improve the diversity of the HRED model." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-30", "text": "proposed a conditional variational autoencoder based approach to learning contextual diversity for neural response generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-31", "text": "Xing et al. (2017) proposed a hierarchical recurrent attention network (HRAN) to jointly model the importance of tokens and utterances." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-32", "text": "Tian et al. (2017) treated the hierarchical modeling of contextual information as a recurrent process in encoding." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-33", "text": "We could make two conclusions from these works." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-34", "text": "\u2022 First, existing studies of utterance modeling mainly focus on representing utterances by using bidirectional GRU (Xing et al., 2017) or unidirectional GRU (Tian et al., 2017 )." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-35", "text": "\u2022 Second, there are two types of approaches on context (inter-utterance) modeling." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-36", "text": "One is the attention-based approach (Xing et al., 2017) , the other is the sequential integration approach (Tian et al., 2017) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-37", "text": "Drawing the advantages of the existing approaches, in this paper, we proposed a novel contextsensitive generation approach, which obtains the context representation of a conversation by weighing the importance of each utterance using two attention mechanisms, namely dynamic and static attention, to generate open-domain conversational responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-38", "text": "2 The Proposed Context-Sensitive Generation Approach" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-39", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-40", "text": "**PRELIMINARY**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-41", "text": "A typical neural network based sequence-to-sequence model for generating open-domain conversational responses usually includes an encoder and a decoder." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-42", "text": "The encoder expresses an input message as a dense vector which represents the semantics of the input message." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-43", "text": "The decoder then generates a conversational response according to the semantic representation of the input message." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-44", "text": "In context-sensitive generation of open-domain conversational responses, the input message to the encoder usually includes several historical utterances in a conversation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-45", "text": "Therefore, one of the key problems in context-sensitive generation is how to encode historical utterances in a conversation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-46", "text": "Figure 1 presents two state-of-the-art approaches to encoding contextual information for context-sensitive response generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-47", "text": "Here, u i , u i+1 and u j denote the i-th, i+1-th and j-th utterance, respectively, in a conversation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-48", "text": "As the inputs of the two models, they are then represented to utterance-level vectors as shown in the second layer of the two models in Figure 1 ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-49", "text": "The context vectors of the two models are obtained by hierarchically representing the utterances to a dense vector c for decoding." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-50", "text": "It is easy to recognize that the frameworks used to illustrate the encoders of two existing context-sensitive generation models look similar to each other." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-51", "text": "There are two different parts between the two frameworks:" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-52", "text": "\u2022 Utterance Representations: Bidirectional GRU vs. Unidirectional GRU Xing et al. (2017) utilized a bidirectional GRU and a word-level attention mechanism to transfer word representations to utterance representations." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-53", "text": "Tian et al. (2017) represented the utterance in a simpler way, which is a unidirectional GRU." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-54", "text": "\u2022 Inter-utterance Representations: Attention vs. Sequential Integration Xing et al. (2017) proposed a hierarchical attention mechanism to feed the utterance representations to a backward RNN to obtain contextual representation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-55", "text": "Tian et al. (2017) proposed a weighted sequential integration (WSI) approach to use an RNN model and a heuristic weighting mechanism to obtain interutterance representation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-56", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-57", "text": "**THE PROPOSED MODEL**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-58", "text": "The proposed context-sensitive generation model is under the framework of encoder-decoder." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-59", "text": "To obtain the contextual representations, the proposed model consists of a hierarchical representation mechanism for encoding." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-60", "text": "For utterance representation, we consider the advantages of the two state-of-the-art approaches to encoding contextual information for context-sensitive response generation (Xing et al., 2017; Tian et al., 2017) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-61", "text": "We utilize a GRU model to obtain utterance representation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-62", "text": "For inter-utterance representation, inspired by the above approaches of modeling inter-utterance representations, we proposed two attention mechanisms, namely dynamic and static attention, to weigh the importance of each utterance in a conversation and obtain the contextual representation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-63", "text": "Figure 2 shows the framework of the proposed context-sensitive generation model." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-64", "text": "Drawing the advantages of attention mechanism on Here, u * denotes the * -th utterance in a conversation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-65", "text": "weighing the importance of utterances for generating open-domain conversational responses (Xing et al., 2017) , we thus model the inter-utterance representation to obtain the context vector in two measures, namely static and dynamic attention, as shown in Figure 2 ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-66", "text": "We then formally describe the static and dynamic attention for decoding process." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-67", "text": "\u2022 Static Attention based Decoding As shown in Figure 2 , the static attention mechanism calculates the importance of each utterance as e i or \u03b1 i (i \u2208 {1, ..., s})." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-68", "text": "Here, h i and h s denote the representations of hidden state of the i-th and the last utterance in a conversation, respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-69", "text": "V , W and U are parameters." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-70", "text": "We can see that once the weights of each utterance \u03b1 i (i \u2208 {1, ..., s}) are produced, they will be unchanged in the decoding process." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-71", "text": "In decoding, the t-th hidden state s t can be calculated as follows:" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-72", "text": "Here, y t\u22121 is the t\u22121-th output of the decoder and s t\u22121 is the hidden state of t-1-th time step in decoding." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-73", "text": "Notice that y 0 is set to be a special character and s 0 is initialized by h s ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-74", "text": "The generated response is thus represented as a sequence of y 1 , y 2 , ..., y T , where T denotes the last time step." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-75", "text": "\u2022 Dynamic Attention based Decoding Rather than the static attention mechanism fixes the weights of each utterance before decoding process, the dynamic attention mechanism maintains a weighting matrix and updates the weights of each utterance during decoding process as shown in Figure 2 ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-76", "text": "The formal illustration of the dynamic attention mechanism for decoding is as follows:" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-77", "text": "Here, V , W and U are also parameters that are independent to those in the static attention." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-78", "text": "T denotes the transposition operation of V ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-79", "text": "The e i,t and \u03b1 i,t are calculated in each time step t of decoding." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-80", "text": "The t-th hidden state s t in dynamic attention-based decoder can be calculated as follows:" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-81", "text": "The main difference between our proposed conversational response generation model and the above two state-of-the-art models is the two attention mechanisms for obtaining the contextual representation of a conversation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-82", "text": "Rather than use a hierarchical attention neural network (Xing et al., 2017) to obtain the contextual representation of a conversation, we propose two utterance-level attentions for weighting the importance of each utterance in the context, which is more simple in structure and has less number of parameters than the hierarchical attention approach." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-83", "text": "Meanwhile, rather than use a heuristic approach to weigh the importance of each utterance in the context (Tian et al., 2017) , in our proposed approach, the weights of utterance in the context are learned by two attention mechanisms from the data, which is more reasonable and flexible than the heuristic based approach." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-84", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-85", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-86", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-87", "text": "**EXPERIMENT SETTINGS**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-88", "text": "Dataset: Two datasets are selected for the experiment of generation of open-domain conversational responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-89", "text": "First is the Ubuntu dataset which is developed by Lowe et al. (2015) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-90", "text": "The dataset is extracted from the Ubuntu Internet Relayed Chat (IRC) channel and recently used for the generation of conversational responses in (Serban et al., 2016a; Serban et al., 2017b; Serban et al., 2017a) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-91", "text": "We follow the train-test split proposed by Serban et al. (2017a) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-92", "text": "It is worthy to note that there is no development set in Serban et al. (2017a) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-93", "text": "In this paper, we randomly select the same number of sessions to that in the test set from the training set." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-94", "text": "Second is the OpenSubtitles dataset which is proposed by Tiedemann (2009) and also used by Li et al. (2016a; Li et al. (2016c) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-95", "text": "The detailed statistics of the two datasets are shown in Table 2 ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-96", "text": "It is worthy to note that the original released data of OpenSubtitles consists of about 40,000,000 utterances without partitions of conversational session, which is called \"session\" for short in the following of this paper." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-97", "text": "Therefore, we split each of 10 continuous utterances as a session." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-98", "text": "We then randomly sample 800,000 sessions for training (including 8,000 sessions for developing) and remove them from the complete dataset." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-99", "text": "In the rest of the complete dataset, we again randomly sample 8,000 sessions for testset." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-100", "text": "The vocabulary size equals to the number of unique tokens in both two datasets, respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-101", "text": "Table 2 : The statistics of two experimental datasets." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-102", "text": "Avg is short for average." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-103", "text": "# represents number." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-104", "text": "u and w denote utterance and word respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-105", "text": "The unit of training and test is a conversational session." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-106", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-107", "text": "**UBUNTU OPENSUBTITLES**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-108", "text": "Hyper-Parameters: For the static attention model, the dimension of hidden layer in encoder and decoder is 512." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-109", "text": "The padding length is set to 15." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-110", "text": "The dimension of word embedding equals to 200." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-111", "text": "The word embedding is pre-trained using the skip-gram model in word2vec (Mikolov et al., 2013) and fine-tuned during the learning process." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-112", "text": "For the dynamic attention model, the dimension of hidden layer in encoder and decoder is 1024." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-113", "text": "The padding length and dimension of word embedding are same to the static attention model." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-114", "text": "Adam is used for optimization." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-115", "text": "The initial learning rate is 0.001 and the weight decay is set to 10 \u22125 ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-116", "text": "The dropout parameter equals to 0.5." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-117", "text": "Mini-batch is used and the batch size equals to 80." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-118", "text": "The number of iterations in training is 10." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-119", "text": "Baselines: For the experimental comparisons, six baselines are chosen." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-120", "text": "Four out of them are state-ofthe-art approaches." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-121", "text": "They are VHRED, CVAE, WSI, and HRAN." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-122", "text": "\u2022 LSTM: Under the sequence-to-sequence framework for generation of conversational responses, the most simple but natural idea is to directly use the LSTM to encode all the utterances in a session word by word and then decode to generate a response." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-123", "text": "\u2022 HRED: The first hierarchical recurrent encoder-decoder model, which is proposed by Serban et al. (2016a) , for conversational response generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-124", "text": "\u2022 VHRED: The augmented HRED model, which incorporates a stochastic latent variable at utterance level for encoding and decoding, is proposed by Serban et al. (2017b) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-125", "text": "\u2022 CVAE: The conditional variational autoencoder based approach, which is proposed by , to learn context diversity for conversational responses generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-126", "text": "\u2022 WSI and HRAN are proposed by Tian et al. (2017) and Xing et al. (2017) respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-127", "text": "We detailed describe and compare the two models in Section 2.1 and 2.2 and their frameworks are shown in Figure 1 ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-128", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-129", "text": "**EVALUATION AND RESULTS**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-130", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-131", "text": "**AUTOMATIC EVALUATION**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-132", "text": "Until now, automatically evaluating the quality of open-domain conversational response is still an open problem." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-133", "text": "The BLEU score (Papineni et al., 2002) , which is a widely used evaluation metric for machine translation, is not a suitable metric for conversation generation, as the appropriate responses to the same message may share less common words." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-134", "text": "Moreover, it is also impossible to construct a reference set, which includes all appropriate responses, of each message." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-135", "text": "The perplexity that is used to evaluate language model, is also not suitable to evaluate the relevance between messages and responses (Shang Figure 2 ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-136", "text": "The other models are baselines." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-137", "text": "\u2192 and denote the use of unidirectional and bidirectional GRU in the proposed model to obtain utterance representations, respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-138", "text": "\u2020 denotes the results pass the statistical significance test with p < 0.05." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-139", "text": "Li et al., 2016c) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-140", "text": "In this paper, we employ an evaluation metric that is proposed by Serban et al. (2016a) and also used in (Serban et al., 2017b) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-141", "text": "Rather than calculating the token-level or n-gram similarity as the perplexity and BLEU (Papineni et al., 2002) , the metric measure the semantic similarity between a generated responsesr and the ground-truth responses r by matching their semantic representations." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-142", "text": "The metric also has three aspects, namely Average, Greedy and Extrema." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-143", "text": "For the Average, it first calculates the element-wise arithmetic average of embeddings of all words inr a and r a , respectively and produces two response representations vr a and v ra ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-144", "text": "The value of Average is then equals to the cosine similarity of vr a and v ra ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-145", "text": "For the Greedy, every word inr will find a most similar word in r by calculating the cosine similarity of their word embeddings." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-146", "text": "After that, the element-wise arithmetic average of embeddings of all words inr a and the corresponding words in r are calculated and two response representations vr g and v rg are produced." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-147", "text": "The value of Greedy is then equals to the cosine similarity of vr g and v rg ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-148", "text": "For the Extrema, two embedding matrices mr and m r can be obtained by arranging the embeddings of all words inr a and r a , respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-149", "text": "The i-th column of mr is the embedding of the i-th word inr as well as that in m r ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-150", "text": "Getting the maximum value of each row in mr and m r , respectively, we then obtain two response representations vr e and v re ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-151", "text": "The value of Extrema is then equals to the cosine similarity of vr e and v re ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-152", "text": "Table 3 shows the experimental results on two datasets." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-153", "text": "We can see that our proposed context-sensitive generation model with static attention outperforms all the baselines in the two datasets." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-154", "text": "It verifies the effectiveness of the proposed utterance-level attention mechanism on modeling context representations for generating conversational responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-155", "text": "To compare the dynamic and static attentions, we find that for the generation of conversational response, dynamically estimate the importance of each utterance in context performs worse than the static attention approach." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-156", "text": "The reason may be that the context vector in dynamic attention model is changed in every time step of decoding." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-157", "text": "The change of context vector may lead to decoding incoherent responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-158", "text": "Meanwhile, the unidirectional GRU based models outperform the bidirectional GRU based models." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-159", "text": "It doesn't illustrate the unidirectional GRU is better than the bidirectional GRU in utterance representation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-160", "text": "It only indicates that in the current experimental settings, the unidirectional GRU based model outperforms the bidirectional one." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-161", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-162", "text": "**HUMAN EVALUATION**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-163", "text": "For human evaluation, we proposed 2 metrics, namely Coherence and Naturalness." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-164", "text": "As the example shown in Table 1 , in context-sensitive generation of conversational responses, a generated response should not only dependent on the last one message but also need to consider the longer context in the conversation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-165", "text": "Coherence is thus used to evaluate the semantic consistency between a generated response and its context." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-166", "text": "The Coherence score is in the range of 0 to 2, where 0, 1, 2 denote incoherent, neutral and coherent, respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-167", "text": "In some cases, a coherent response may not be a natural one." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-168", "text": "Given an example message, \"Can you tell me the way to the nearest bazaar?\", the response \"Yes, I can tell you the way to the nearest bazaar.\" is definitely a coherent but not a natural response." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-169", "text": "A more extreme example of a message-response pair is \"I don't know what you are talking about!\" and \"I don't know what you are talking about!\"." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-170", "text": "Therefore, besides the Coherence, we proposed another metric, Naturalness, to evaluate the quality of generated responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-171", "text": "For human evaluation, given a context and a conversational response generated by a model, Naturalness denotes whether the response can be an alternative to a human response." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-172", "text": "The Naturalness value equals to 1 or 0, which represents the generated response can be an alternative to a human response or not, respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-173", "text": "Besides the Coherence and Naturalness, we also want to compare the Diversity of the responses generated by all baselines and our proposed approach." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-174", "text": "Here, diversity score of a generated response equals to the number of distinct tokens in the response divided by the total number of distinct tokens in its context (including the number of distinct tokens in the response)." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-175", "text": "The final Diversity score is the average diversity of all the generated responses in test set." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-176", "text": "In the human evaluation, for each model, we randomly sample 500 test results from Ubuntu and OpenSubtitles datasets, respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-177", "text": "Each of the three annotators, who are undergraduate students and not involved in other parts of the experiment, is asked to provide the evaluation scores for all the 8,000 test results." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-178", "text": "The final score of each model equals to the average score of the three annotators." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-179", "text": "Table 4 shows the human evaluation results on the two datasets." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-180", "text": "Generally speaking, we can see that the proposed static attention-based model outperforms the baselines in Coherence and Naturalness on Ubuntu dataset and obtains comparable performance with the HRAN model in Coherence on OpenSubtitles dataset." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-181", "text": "For the Diversity, we can see that the proposed dynamic attention-based model is better at generating diverse responses than other models on Ubuntu dataset." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-182", "text": "We also notice that the CVAE model obtains the best diversity performance on OpenSubtitles dataset and the best Naturalness performance on Ubuntu dataset." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-183", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-184", "text": "**ANALYSIS OF CONTEXT LENGTH**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-185", "text": "To verify the impact of context length on the performance of the proposed model for the generation of conversational responses, we use different length of context to re-train the proposed models, which are called context varied models, on two datasets." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-186", "text": "Here, context length indicates the number of historical utterances that are used for encoding in a context." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-187", "text": "Figure 3 shows that the performance of the proposed static and dynamic attention models are varying with the change of context length." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-188", "text": "The values denote the difference between the results of Static \u2192 and Dynamic \u2192 in Table 3 and the results of the context varied models." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-189", "text": "It also verifies that the generation of conversational responses is a context-sensitive process, which relates to the numbers of utterance in context for encoding." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-190", "text": "Table 5 shows the conversational responses, which are sampled from the test result generated by the proposed static attention model." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-191", "text": "We can see that the attention values predicted by the static attention model can appropriately reveal the importance of the utterance in a context for generating conversational responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-192", "text": "4 Related Work Ritter et al. (2010) proposed an unsupervised approach to model dialogue response by clustering the raw utterances." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-193", "text": "They then presented an end-to-end dialogue response generator by using a phrase-based statistical machine translation model (Ritter et al., 2011) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-194", "text": "Banchs and Li (2012) introduced a search-based system, named IRIS, to generate dialogues using vector space model and then released the experimental corpus for research and development (Banchs, 2012) ." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-195", "text": "Recently, benefit from the advantages of the sequence-to-sequence learning framework with neural networks, Sutskever et al. (2014) and Shang et al. (2015) had drawn inspiration from the neural machine translation (Bahdanau et al., 2014) and proposed an RNN encoder-decoder based approach to generate dialogue by considering the last one sentence and a larger range of context respectively." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-196", "text": "Serban et al. (2016b) proposed a parallel stochastic generation framework which first generates a coarse sequence and then generates an utterance conditioned on the coarse sequence." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-197", "text": "Shao et al. (2017) introduced the \"glimpse-model\" which adds self-attention to the decoder to maintain coherence for generating long, informative, coherent and diverse responses in single turn setting." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-198", "text": "Yao et al. (2017) first predicted cue words using point-wise mutual information (P-MI) for short text conversation generation and then added them into the encoder-decoder framework." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-199", "text": "To consider the context information for improving the diversity of generated conversations, Serban et al. (2016a) presented a hierarchical recurrent encoder-decoder (HRED) approach to encode each utterance and recurrently model the dialogue context to generate context-dependent responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-200", "text": "Serban et al. (2017b) further introduced a stochastic latent variable at each dialogue turn to improve the ambiguity and uncertainty of the HRED model for dialogue generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-201", "text": "Xing et al. (2017) proposed a hierarchical recurrent attention network (HRAN) to jointly model the importance of tokens in utterances and the utterances in context for context-aware response generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-202", "text": "Tian et al. (2017) presented a context-aware hierarchical model to generate conversations by jointly modeling the utterance and inter-utterance information for encoding process." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-203", "text": "As the advantages of generative adversarial net (GAN) and variational autoencoder (VAE), proposed a sequence generative adversarial net model to assess a partially generated sequence with policy gradient and obtain the intermediate rewards by using Monte Carlo search." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-204", "text": "modified the VAE model by conditioning the response into the VAE model in training step to optimize the similarity of prior network and recognition network for dialogue generation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-205", "text": "Similarly, Shen et al. (2017) presented a conditional variational framework to generate specific responses based on the dialog context." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-206", "text": "Due to the recent advantages of reinforcement learning on modeling human-computer interactions, such as the AlphaGo (Silver et al., 2016) , researchers begin to focus on modeling the success of a conversation by not only considering the quality of single turn response generation, but also considering long-term goal of the conversation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-207", "text": "To address the problems of generating generic and repetitive response of the RNN encoder-decoder framework, Li et al. (2016c) proposed a deep reinforcement learning approach to either generate meaningful and diverse response or increase the length of the generated dialogues." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-208", "text": "Dhingra et al. (2017) presented an end-to-end dialogue system for information accquisition, which is called KB-InfoBot from knowledge base (KB) by using reinforcement learning." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-209", "text": "Asghar et al. (2017) proposed an active learning approach to learn user explicit feedback online and combine the offline supervised learning for response generation of conversational agents." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-210", "text": "----------------------------------" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-211", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-212", "text": "This paper proposed a novel context-sensitive generation approach for open-domain conversational responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-213", "text": "The proposed model gained from the proposed static and dynamic attention for context or inter-utterance representation." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-214", "text": "Experimental results show that the proposed model generally outperforms all the baselines in automatic and human evaluations." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-215", "text": "It is also verified the impact of context length on the performance of the proposed generation models for conversational responses." }, { "sent_id": "c5ec401f42f79c4707770dac4f5013-C001-216", "text": "In future work, the way to uniformly integrate the static and dynamic attention for decoding will be explored." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c5ec401f42f79c4707770dac4f5013-C001-10" ], [ "c5ec401f42f79c4707770dac4f5013-C001-34", "c5ec401f42f79c4707770dac4f5013-C001-36", "c5ec401f42f79c4707770dac4f5013-C001-52", "c5ec401f42f79c4707770dac4f5013-C001-54" ], [ "c5ec401f42f79c4707770dac4f5013-C001-126" ] ], "cite_sentences": [ "c5ec401f42f79c4707770dac4f5013-C001-10", "c5ec401f42f79c4707770dac4f5013-C001-34", "c5ec401f42f79c4707770dac4f5013-C001-36", "c5ec401f42f79c4707770dac4f5013-C001-52", "c5ec401f42f79c4707770dac4f5013-C001-54", "c5ec401f42f79c4707770dac4f5013-C001-126" ] }, "@SIM@": { "gold_contexts": [ [ "c5ec401f42f79c4707770dac4f5013-C001-60", "c5ec401f42f79c4707770dac4f5013-C001-65" ] ], "cite_sentences": [ "c5ec401f42f79c4707770dac4f5013-C001-60", "c5ec401f42f79c4707770dac4f5013-C001-65" ] }, "@USE@": { "gold_contexts": [ [ "c5ec401f42f79c4707770dac4f5013-C001-60", "c5ec401f42f79c4707770dac4f5013-C001-65" ] ], "cite_sentences": [ "c5ec401f42f79c4707770dac4f5013-C001-60", "c5ec401f42f79c4707770dac4f5013-C001-65" ] }, "@DIF@": { "gold_contexts": [ [ "c5ec401f42f79c4707770dac4f5013-C001-82" ] ], "cite_sentences": [ "c5ec401f42f79c4707770dac4f5013-C001-82" ] }, "@EXT@": { "gold_contexts": [ [ "c5ec401f42f79c4707770dac4f5013-C001-82" ] ], "cite_sentences": [ "c5ec401f42f79c4707770dac4f5013-C001-82" ] } } }, "ABC_b7a718664f395f048abb3655fb1d8d_15": { "x": [ { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-2", "text": "Lack of repeatability and generalisability are two significant threats to continuing scientific development in Natural Language Processing." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-3", "text": "Language models and learning methods are so complex that scientific conference papers no longer contain enough space for the technical depth required for replication or reproduction." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-4", "text": "Taking Target Dependent Sentiment Analysis as a case study, we show how recent work in the field has not consistently released code, or described settings for learning methods in enough detail, and lacks comparability and generalisability in train, test or validation data." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-5", "text": "To investigate generalisability and to enable state of the art comparative evaluations, we carry out the first reproduction studies of three groups of complementary methods and perform the first large-scale mass evaluation on six different English datasets." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-6", "text": "Reflecting on our experiences, we recommend that future replication or reproduction experiments should always consider a variety of datasets alongside documenting and releasing their methods and published code in order to minimise the barriers to both repeatability and generalisability." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-7", "text": "We have released our code with a model zoo on GitHub with Jupyter Notebooks to aid understanding and full documentation, and we recommend that others do the same with their papers at submission time through an anonymised GitHub account." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-8", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-10", "text": "Repeatable (replicable and/or reproducible 1 ) experimentation is a core tenet of the scientific endeavour." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-11", "text": "In Natural Language Processing (NLP) research as in other areas, this requires three crucial components: (a) published methods described in sufficient detail (b) a working code base and (c) open dataset(s) to permit training, testing and validation to be reproduced and generalised." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-12", "text": "In the cognate sub-discipline of corpus linguistics, releasing textual datasets has been a defining feature of the community for many years, enabling multiple comparative experiments to be conducted on a stable basis since the core underlying corpora are community resources." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-13", "text": "In NLP, with methods becoming increasingly complex with the use of machine learning and deep learning approaches, it is often difficult to describe all settings and configurations in enough detail without releasing code." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-14", "text": "The work described in this paper emerged from recent efforts at our research centre to reimplement other's work across a number of topics (e.g. text reuse, identity resolution and sentiment analysis) where previously published methods were not easily repeatable because of missing or broken code or dependencies, and/or where methods were not sufficiently well described to enable reproduction." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-15", "text": "We focus on one sub-area of sentiment analysis to illustrate the extent of these problems, along with our initial recommendations and contributions to address the issues." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-16", "text": "The area of Target Dependent Sentiment Analysis (TDSA) and NLP in general has been growing rapidly in the last few years due to new neural network methods that require no feature engineering." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-17", "text": "However it is difficult to keep track of the state of the art as new models are tested on different datasets, thus preventing true comparative evaluations." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-18", "text": "This is best shown by table 1 where many approaches This work is licenced under a Creative Commons Attribution 4.0 International Licence." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-19", "text": "Licence details: http:// creativecommons.org/licenses/by/4.0/ 1 We follow the definitions in Antske Fokkens' guest blog post \"replication (obtaining the same results using the same experiment) as well as reproduction (reach the same conclusion through different means)\" from http://coling2018." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-20", "text": "org/slowly-growing-offspring-zigglebottom-anno-2017-guest-post/ are evaluated on the SemEval dataset (Pontiki et al., 2014) but not all." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-21", "text": "Datasets can vary by domain (e.g. product), type (social media, review), or medium (written or spoken), and to date there has been no comparative evaluation of methods from these multiple classes." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-22", "text": "Our primary and secondary contributions therefore, are to carry out the first study that reports results across all three different dataset classes, and to release a open source code framework implementing three complementary groups of TDSA methods." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-23", "text": "In terms of reproducibility via code release, recent TDSA papers have generally been very good with regards to publishing code alongside their papers (Mitchell et al., 2013; Zhang et al., 2016; Liu and Zhang, 2017; Wang et al., 2017) but other papers have not released code (Wang et al., 2016; Tay et al., 2017) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-24", "text": "In some cases, the code was initially made available, then removed, and is now back online (Tang et al., 2016a) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-25", "text": "Unfortunately, in some cases even when code has been published, different results have been obtained relative to the original paper." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-26", "text": "This can be seen when Chen et al. (2017) used the code and embeddings in Tang et al. (2016b) they observe different results." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-43", "text": "Different methods of releasing datasets and code have been suggested." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-27", "text": "Similarly, when others (Tay et al., 2017; Chen et al., 2017) attempt to replicate the experiments of Tang et al. (2016a) they also produce different results to the original authors." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-28", "text": "Our observations within this one sub-field motivates the need to investigate further and understand how such problems can be avoided in the future." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-29", "text": "In some cases, when code has been released, it is difficult to use which could explain why the results were not reproduced." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-30", "text": "Of course, we would not expect researchers to produce industrial strength code, or provide continuing free ongoing support for multiple years after publication, but the situation is clearly problematic for the development of the new field in general." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-31", "text": "In this paper, we therefore reproduce three papers chosen as they employ widely differing methods: Neural Pooling (NP) , NP with dependency parsing (Wang et al., 2017) , and RNN (Tang et al., 2016a) , as well as having been applied largely to different datasets." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-32", "text": "At the end of the paper, we reflect on bringing together elements of repeatability and generalisability which we find are crucial to NLP and data science based disciplines more widely to enable others to make use of the science created." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-33", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-34", "text": "**RELATED WORK**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-35", "text": "Reproducibility and replicability have long been key elements of the scientific method, but have been gaining renewed prominence recently across a number of disciplines with attention being given to a 'reproducibility crisis'." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-36", "text": "For example, in pharmaceutical research, as little as 20-25% of papers were found to be replicable (Prinz et al., 2011) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-37", "text": "The problem has also been recognised in computer science in general (Collberg and Proebsting, 2016) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-38", "text": "Reproducibility and replicability have been researched for sometime in Information Retrieval (IR) since the Grid@CLEF pilot track (Ferro and Harman, 2009 )." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-39", "text": "The aim was to create a 'grid of points' where a point defined the performance of a particular IR system using certain pre-processing techniques on a defined dataset." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-40", "text": "Louridas and Gousios (2012) looked at reproducibility in Software Engineering after trying to replicate another authors results and concluded with a list of requirements for papers to be reproducible: (a) All data related to the paper, (b) All code required to reproduce the paper and (c) Documentation for the code and data." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-41", "text": "Fokkens et al. (2013) looked at reproducibility in WordNet similarity and Named Entity Recognition finding five key aspects that cause experimental variation and therefore need to be clearly stated: (a) pre-processing, (b) experimental setup, (c) versioning, (d) system output, (e) system variation." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-42", "text": "In Twitter sentiment analysis, Sygkounas et al. (2016) stated the need for using the same library versions and datasets when replicating work." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-151", "text": "**REPRODUCTION OF WANG ET AL. (2017)**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-44", "text": "Ferro and Harman (2009) defined a framework (CIRCO) that enforces a pre-processing pipeline where data can be extracted at each stage therefore facilitating a validation step." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-45", "text": "They stated a mechanism for storing results, dataset and pre-processed data 2 ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-46", "text": "Louridas and Gousios (2012) suggested the use of a virtual machine alongside papers to bundle the data and code together, while most state the advantages of releasing source code (Fokkens et al., 2013; Potthast et al., 2016; Sygkounas et al., 2016) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-47", "text": "The act of reproducing or replicating results is not just for validating research but to also show how it can be improved." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-48", "text": "Ferro and Silvello (2016) followed up their initial research and were able to analyse which pre-processing techniques were important on a French monolingual dataset and how the different techniques affected each other given an IR system." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-49", "text": "Fokkens et al. (2013) showed how changes in the five key aspects affected results." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-50", "text": "The closest related work to our reproducibility study is that of Marrese-Taylor and Matsuo (2017) which they replicate three different syntactic based aspect extraction methods." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-51", "text": "They found that parameter tuning was very important however using different pre-processing pipelines such as Stanford's CoreNLP did not have a consistent effect on the results." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-52", "text": "They found that the methods stated in the original papers are not detailed enough to replicate the study as evidenced by their large results differential." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-53", "text": "Dashtipour et al. (2016) undertook a replication study in sentiment prediction, however this was at the document level and on different datasets and languages to the originals." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-54", "text": "In other areas of (aspectbased) sentiment analysis, releasing code for published systems has not been a high priority, e.g. in SemEval 2016 task 5 (Pontiki et al., 2016) only 1 out of 21 papers released their source code." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-55", "text": "In IR, specific reproducible research tracks have been created 3 and we are pleased to see the same happening at COLING 2018 4 ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-56", "text": "Turning now to the focus of our investigations, Target Dependent sentiment analysis (TDSA) research (Nasukawa and Yi, 2003) arose as an extension to the coarse grained analysis of document level sentiment analysis (Pang et al., 2002; Turney, 2002) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-57", "text": "Since its inception, papers have applied different methods such as feature based (Kiritchenko et al., 2014) , Recursive Neural Networks (RecNN) (Dong et al., 2014) , Recurrent Neural Networks (RNN) (Tang et al., 2016a) , attention applied to RNN (Wang et al., 2016; Chen et al., 2017; Tay et al., 2017) , Neural Pooling (NP) Wang et al., 2017) , RNN combined with NP (Zhang et al., 2016) , and attention based neural networks (Tang et al., 2016b) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-58", "text": "Others have tackled TDSA as a joint task with target extraction, thus treating it as a sequence labelling problem." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-59", "text": "Mitchell et al. (2013) carried out this task using Conditional Random Fields (CRF), and this work was then extended using a neural CRF ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-60", "text": "Both approaches found that combining the two tasks did not improve results compared to treating the two tasks separately, apart from when considering POS and NEG when the joint task performs better." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-61", "text": "Finally, created an attention RNN for this task which was evaluated on two very different datasets containing written and spoken (video-based) reviews where the domain adaptation between the two shows some promise." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-62", "text": "Overall, within the field of sentiment analysis there are other granularities such as sentence level (Socher et al., 2013) , topic (Augenstein et al., 2018) , and aspect (Wang et al., 2016; Tay et al., 2017) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-63", "text": "Aspect-level sentiment analysis relates to identifying the sentiment of (potentially multiple) topics in the same text although this can be seen as a similar task to TDSA." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-64", "text": "However the clear distinction between aspect and TDSA is that TDSA requires the target to be mentioned in the text itself while aspect-level employs a conceptual category with potentially multiple related instantiations in the text." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-65", "text": "Tang et al. (2016a) created a Target Dependent LSTM (TDLSTM) which encompassed two LSTMs either side of the target word, then improved the model by concatenating the target vector to the input embeddings to create a Target Connected LSTM (TCLSTM)." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-66", "text": "Adding attention has become very popular recently." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-67", "text": "Tang et al. (2016b) showed the speed and accuracy improvements of using multiple attention layers only over LSTM based methods, however they found that it could not model complex sentences e.g. negations." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-68", "text": "Liu and Zhang (2017) showed that adding attention to a Bi-directional LSTM (BLSTM) improves the results as it takes the importance of each word into account with respect to the target." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-69", "text": "Chen et al. (2017) also combined a BLSTM and attention, however they used multiple attention layers and combined the results using a Gated Recurrent Unit (GRU) which they called Recurrent Attention on Memory (RAM), and they found this method to allow models to better understand more complex sentiment for each comparison." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-70", "text": "used neural pooling features e.g. max, min, etc of the word embeddings of the left and right context of the target word, the target itself, and the whole Tweet." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-71", "text": "They inputted the features into a linear SVM, and showed the importance of using the left and right context for the first time." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-72", "text": "They found in their study that using a combination of Word2Vec embeddings and sentiment embeddings performed best alongside using sentiment lexicons to filter the embedding space." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-73", "text": "Other studies have adopted more linguistic approaches." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-74", "text": "Wang et al. (2017) extended the work of by using the dependency linked words from the target." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-75", "text": "Dong et al. (2014) used the dependency tree to create a Recursive Neural Network (RecNN) inspired by Socher et al. (2013) but compared to Socher et al. (2013) they also utilised the dependency tags to create an Adaptive RecNN (ARecNN)." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-76", "text": "Critically, the methods reported above have not been applied to the same datasets, therefore a true comparative evaluation between the different methods is somewhat difficult." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-77", "text": "This has serious implications for generalisability of methods." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-78", "text": "We correct that limitation in our study." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-79", "text": "There are two papers taking a similar approach to our work in terms of generalisability although they do not combine them with the reproduction issues that we highlight." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-80", "text": "First, Chen et al. (2017) compared results across SemEval's laptop and restaurant reviews in English (Pontiki et al., 2014) , a Twitter dataset (Dong et al., 2014) and their own Chinese news comments dataset." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-81", "text": "They did perform a comparison across different languages, domains, corpora types, and different methods; SVM with features (Kiritchenko et al., 2014) , Rec-NN (Dong et al., 2014) , TDLSTM (Tang et al., 2016a) , Memory Neural Network (MNet) (Tang et al., 2016b) and their own attention method." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-82", "text": "However, the Chinese dataset was not released, and the methods were not compared across all datasets." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-83", "text": "By contrast, we compare all methods across all datasets, using techniques that are not just from the Recurrent Neural Network (RNN) family." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-84", "text": "A second paper, by Barnes et al. (2017) compares seven approaches to (document and sentence level) sentiment analysis on six benchmark datasets, but does not systematically explore reproduction issues as we do in our paper." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-85", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-86", "text": "**DATASETS USED IN OUR EXPERIMENTS**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-87", "text": "We are evaluating our models over six different English datasets deliberately chosen to represent a range of domains, types and mediums." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-88", "text": "As highlighted above, previous papers tend to only carry out evaluations on one or two datasets which limits the generalisability of their results." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-89", "text": "In this paper, we do not consider the quality or inter-annotator agreement levels of these datasets but it has been noted that some datasets may have issues here." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-90", "text": "For example, Pavlopoulos and Androutsopoulos (2014) point out that the Hu and Liu (2004) dataset does not state their inter-annotator agreement scores nor do they have aspect terms that express neutral opinion." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-91", "text": "We only use a subset of the English datasets available." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-92", "text": "For two reasons." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-93", "text": "First, the time it takes to write parsers and run the models." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-94", "text": "Second, we only used datasets that contain three distinct sentiments (Wilson (2008) only has two)." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-95", "text": "From the datasets we have used, we have only had issue with parsing Wang et al. (2017) where the annotations for the first set of the data contains the target span but the second set does not." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-96", "text": "Thus making it impossible to use the second set of annotation and forcing us to only use a subset of the dataset." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-97", "text": "An as example of this: \"Got rid of bureaucrats 'and we put that money, into 9000 more doctors and nurses'... to turn the doctors into bureaucrats#BattleForNumber10\" in that Tweet 'bureaucrats' was annotated as negative but it does not state if it was the first or second instance of 'bureaucrats' since it does not use target spans." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-98", "text": "As we can see from table 2, generally the social media datasets (Twitter and YouTube) contain more targets per sentence with the exception of Dong et al. (2014) and Mitchell et al. (2013) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-99", "text": "The only dataset that has a small difference between the number of unique sentiments per sentence is the Wang et al. (2017)" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-100", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-101", "text": "**REPRODUCTION STUDIES**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-102", "text": "In the following subsections, we present the three different methods that we are reproducing and how their results differ from the original analysis." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-103", "text": "In all of the experiments below, we lower case all text and tokenise using Twokenizer (Gimpel et al., 2011) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-104", "text": "This was done as the datasets originate from Twitter and this pre-processing method was to some extent stated in and assumed to be used across the others as they do not explicitly state how they pre-process in the papers." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-105", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-106", "text": "**REPRODUCTION OF VO AND ZHANG (2015)**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-107", "text": "Vo and Zhang (2015) created the first NP method for TDSA." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-108", "text": "It takes the word vectors of the left, right, target word, and full tweet/sentence/text contexts and performs max, min, average, standard deviation, and product pooling over these contexts to create a feature vector as input to the Support Vector Machine (SVM) classifier." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-109", "text": "This feature vector is in affect an automatic feature extractor." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-110", "text": "They created four different methods: 1." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-111", "text": "Target-ind uses only the full tweet context, 2." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-112", "text": "Target-dep-uses left, right, and target contexts, 3." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-113", "text": "Target-dep Uses both features of Target-ind and Target-dep-, and 4." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-114", "text": "Target-dep+ Uses the features of Target-dep and adds two additional contexts left and right sentiment (LS & RS) contexts where only the words within a specified lexicon are kept and the rest of the words are zero vectors." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-115", "text": "All of their experiments are performed on Dong et al. (2014) Twitter data set." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-116", "text": "For each of the experiments below we used the following configurations unless otherwise stated: we performed 5 fold stratified cross validation, features are scaled using Max Min scaling before inputting into the SVM, and used the respective C-Values for the SVM stated in the paper for each of the models." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-117", "text": "One major difficulty with the description of the method in the paper and re-implementation is handling the same target multiple appearances issue as originally raised by Wang et al. (2017) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-118", "text": "As the method requires context with regards to the target word, if there is more than one appearance of the target word then the method does not specify which to use." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-119", "text": "We therefore took the approach of Wang et al. (2017) and found all of the features for each appearance and performed median pooling over features." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-120", "text": "This change could explain the subtle differences between the results we report and those of the original paper." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-121", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-122", "text": "**SENTIMENT LEXICONS**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-123", "text": "Vo and Zhang (2015) used three different sentiment lexicons: MPQA 5 (Wilson et al., 2005) , NRC 6 (Mohammad and Turney, 2010) , and HL 7 (Hu and Liu, 2004) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-124", "text": "We found a small difference in word counts between their reported statistics for the MPQA lexicons and those we performed ourselves, as can be seen in the bold numbers in table 3." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-125", "text": "Originally, we assumed that a word can only occur in one sentiment class within the same lexicon, and this resulted in differing counts for all lexicons." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-126", "text": "This distinction is not clearly documented in the paper or code." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-127", "text": "However, our assumption turned out to be incorrect, giving a further illustration of why detailed descriptions and documentation of all decisions is important." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-128", "text": "We ran the same experiment as to show the effectiveness of sentiment lexicons the results can be seen in table 4." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-129", "text": "We can clearly see there are some difference not just with the accuracy scores but the rank of the sentiment lexicons." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-130", "text": "We found just using HL was best and MPQA does help performance compared to the Target-dep baseline which differs to findings." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-131", "text": "Since we found that using just HL performed best, the rest of the results will apply the Target-dep+ method using HL and using HL & MPQA to show the affect of using the lexicon that both we and Vo and Zhang (2015)" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-132", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-133", "text": "**USING DIFFERENT WORD VECTORS**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-134", "text": "The original authors tested their methods using three different word vectors: 1. Word2Vec trained by on 5 million Tweets containing emoticons (W2V), 2." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-135", "text": "Sentiment Specific Word Embedding (SSWE) from , and 3." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-136", "text": "W2V and SSWE combined." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-137", "text": "Neither of these word embeddings are available from the original authors as never released the embeddings and the link to embeddings no longer works 8 ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-138", "text": "However, the embeddings were released through Wang et al. (2017) code base 9 following requesting of the code from ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-139", "text": "Figure 1 shows the results of the different word embeddings across the different methods." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-140", "text": "The main finding we see is that SSWE by themselves are not as informative as W2V vectors which is different to the findings of ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-141", "text": "However we agree that combining the two vectors is beneficial and that the rank of methods is the same in our observations." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-142", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-143", "text": "**SCALING AND FINAL MODEL COMPARISON**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-144", "text": "We test all of the methods on the test data set of Dong et al. (2014) and show the difference between the original and reproduced models in figure 2 ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-145", "text": "Finally, we show the effect of scaling using Max Min and not scaling the data." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-146", "text": "As stated before, we have been using Max Min scaling on the NP features, however did not mention scaling in their paper." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-147", "text": "The library they were using, LibLinear (Fan et al., 2008) , suggests in its practical guide (Hsu et al., 2003) to scale each feature to [0, 1] but this was not re-iterated by ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-148", "text": "We are using scikit-learn's (Pedregosa et al., 2011) LinearSVC which is a wrapper of LibLinear, hence making it appropriate to use here." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-149", "text": "As can be seen in figure 2, not scaling can affect the results by around one-third." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-150", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-152", "text": "Wang et al. (2017) extended the NP work of and instead of using the full tweet/sentence/text contexts they used the full dependency graph of the target word." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-153", "text": "Thus, they created three different methods: 1. TDParse-uses only the full dependency graph context, 2." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-154", "text": "TDParse the feature of TDParse-and the left and right contexts, and 3." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-155", "text": "TDParse+ the features of TDParse and LS and RS contexts." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-156", "text": "The experiments are performed on the Dong et al. (2014) and Wang et al. (2017) Twitter datasets where we train and test on the previously specified train and test splits." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-157", "text": "We also scale our features using Max Min scaling before inputting into the SVM." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-158", "text": "We used all three sentiment lexicons as in the original paper, and we found the C-Value by performing 5 fold stratified cross validation on the training datasets." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-159", "text": "The results of these experiments can be seen in figure 3 10 ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-160", "text": "As found with the results of replication, scaling is very important but is typically overlooked when reporting." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-161", "text": "Tang et al. (2016a) was the first to use LSTMs specifically for TDSA." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-162", "text": "They created three different models: 1. LSTM a standard LSTM that runs over the length of the sentence and takes no target information into account, 2." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-163", "text": "TDLSTM runs two LSTMs, one over the left and the other over the right context of the target word and concatenates the output of the two, and 3." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-164", "text": "TCLSTM same as the TDLSTM method but each input word vector is concatenated with vector of the target word." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-165", "text": "All of the methods outputs are fed into a softmax activation function." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-166", "text": "The experiments are performed on the Dong et al. (2014) dataset where we train and test on the specified splits." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-167", "text": "For the LSTMs we initialised the weights using uniform distribution U(0.003, 0.003), used Stochastic Gradient Descent (SGD) a learning rate of 0.01, cross entropy loss, padded and truncated sequence to the length of the maximum sequence in the training dataset as stated in the original paper, and we did not \"set the clipping threshold of softmax layer as 200\" (Tang et al., 2016a) as we were unsure what this meant." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-168", "text": "With regards to the number of epochs trained, we used early stopping with a patience of 10 and allowed 300 epochs." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-169", "text": "Within their experiments they used SSWE and Glove Twitter vectors 11 (Pennington et al., 2014) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-170", "text": "As the paper being reproduced does not define the number of epochs they trained for, we use early stopping." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-171", "text": "Thus for early stopping we require to split the training data into train and validation sets to know when to stop." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-172", "text": "As it has been shown by Reimers and Gurevych (2017) that the random seed statistically significantly changes the results of experiments we ran each model over each word embedding thirty times, using a different seed value but keeping the same stratified train and validation split, and reported the results on the same test data as the original paper." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-173", "text": "As can be seen in Figure 4 , the initial seed value makes a large difference more so for the smaller embeddings." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-174", "text": "In table 5, we show the difference between our mean and maximum result and the original result for each model using the 200 dimension Glove Twitter vectors." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-175", "text": "Even though the mean result is quite different from the original the maximum is much closer." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-176", "text": "Our results generally agree with their results on the ranking of the word vectors and the embeddings." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-177", "text": "Overall, we were able to reproduce the results of all three papers." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-178", "text": "However for the neural network/deep learning approach of Tang et al. (2016a) we agree with Reimers and Gurevych (2017) that reporting multiple runs of the system over different seed values is required as the single performance scores can be misleading, which could explain why previous papers obtained different results to the original for the TDLSTM method (Chen et al., 2017; Tay et al., 2017) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-179", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-180", "text": "**MASS EVALUATION**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-181", "text": "For all of the methods we pre-processed the text by lower casing and tokenising using Twokenizer (Gimpel et al., 2011) , and we used all three sentiment lexicons where applicable." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-182", "text": "We found the best word vectors from SSWE and the common crawl 42B 300 dimension Glove vectors by five fold stratified cross validation for the NP methods and the highest accuracy on the validation set for the LSTM methods." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-183", "text": "We chose these word vectors as they have very different sizes (50 and 300), also they have been shown to perform well in different text types; SSWE for social media (Tang et al., 2016a) and Glove for reviews (Chen et al., 2017) ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-184", "text": "To make the experiments quicker and computationally less expensive, we filtered out all words from the word vectors that did not appear in the train and test datasets, and this is equivalent with respect to word coverage as using all words." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-185", "text": "Finally we only reported results for the LSTM methods with one seed value and not multiple due to time constraints." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-186", "text": "The results of the methods using the best found word vectors on the test sets can be seen in table 6." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-187", "text": "We find that the TDParse methods generally perform best but only clearly outperforms the other nondependency parser methods on the YouTuBean dataset." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-188", "text": "We hypothesise that this is due to the dataset containing, on average, a deeper constituency tree depth which could be seen as on average more complex sentences." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-189", "text": "This could be due to it being from the spoken medium compared to the rest of the datasets which are written." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-190", "text": "Also that using a sentiment lexicon is almost always beneficial, but only by a small amount." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-191", "text": "Within the LSTM based methods the TDLSTM method generally performs the best indicating that the extra target information that the TCLSTM method contains is not needed, but we believe this needs further analysis." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-192", "text": "We can conclude that the simpler NP models perform well across domain, type and medium and that even without language specific tools and lexicons they are competitive to the more complex LSTM based methods." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-193", "text": "----------------------------------" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-194", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-195", "text": "The fast developing subfield of TDSA has so far lacked a large-scale comparative mass evaluation of approaches using different models and datasets." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-196", "text": "In this paper, we address this generalisability limitation and perform the first direct comparison and reproduction of three different approaches for TDSA." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-197", "text": "While carrying out these reproductions, we have noted and described above, the many emerging issues in previous research related to incomplete descriptions of methods and settings, patchy release of code, and lack of comparative evaluations." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-198", "text": "This is natural in a developing field, but it is crucial for ongoing development within NLP in general that improved repeatability practices are adopted." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-199", "text": "The practices adopted in our case studies are to reproduce the methods in open source code, adopt only open data, provide format conversion tools to ingest the different data formats, and describe and document all settings via the code and Jupyter Notebooks (released initially in anonymous form at submission time) 12 ." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-200", "text": "We therefore argue that papers should not consider repeatability (replication or reproduction) or generalisability alone, but these two key tenets of scientific practice should be brought together." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-201", "text": "In future work, we aim to extend our reproduction framework further, and extend the comparative evaluation to languages other than English." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-202", "text": "This will necessitate changes in the framework since we expect that dependency parsers and sentiment lexicons will be unavailable for specific languages." }, { "sent_id": "b7a718664f395f048abb3655fb1d8d-C001-203", "text": "Also we will explore through error analysis in which situations different neural network architectures perform best." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b7a718664f395f048abb3655fb1d8d-C001-24", "b7a718664f395f048abb3655fb1d8d-C001-27" ], [ "b7a718664f395f048abb3655fb1d8d-C001-57" ], [ "b7a718664f395f048abb3655fb1d8d-C001-81" ] ], "cite_sentences": [ "b7a718664f395f048abb3655fb1d8d-C001-24", "b7a718664f395f048abb3655fb1d8d-C001-27", "b7a718664f395f048abb3655fb1d8d-C001-57", "b7a718664f395f048abb3655fb1d8d-C001-81" ] }, "@MOT@": { "gold_contexts": [ [ "b7a718664f395f048abb3655fb1d8d-C001-31" ] ], "cite_sentences": [ "b7a718664f395f048abb3655fb1d8d-C001-31" ] }, "@EXT@": { "gold_contexts": [ [ "b7a718664f395f048abb3655fb1d8d-C001-167" ], [ "b7a718664f395f048abb3655fb1d8d-C001-178" ] ], "cite_sentences": [ "b7a718664f395f048abb3655fb1d8d-C001-167", "b7a718664f395f048abb3655fb1d8d-C001-178" ] }, "@DIF@": { "gold_contexts": [ [ "b7a718664f395f048abb3655fb1d8d-C001-167" ], [ "b7a718664f395f048abb3655fb1d8d-C001-178" ] ], "cite_sentences": [ "b7a718664f395f048abb3655fb1d8d-C001-167", "b7a718664f395f048abb3655fb1d8d-C001-178" ] }, "@SIM@": { "gold_contexts": [ [ "b7a718664f395f048abb3655fb1d8d-C001-183" ] ], "cite_sentences": [ "b7a718664f395f048abb3655fb1d8d-C001-183" ] }, "@USE@": { "gold_contexts": [ [ "b7a718664f395f048abb3655fb1d8d-C001-183" ] ], "cite_sentences": [ "b7a718664f395f048abb3655fb1d8d-C001-183" ] } } }, "ABC_5fc7df69445712a50228d0bf80f30a_15": { "x": [ { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-2", "text": "We propose a novel tensor embedding method that can effectively extract lexical features for humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-3", "text": "Specifically, we use wordword co-occurrence to encode the contextual content of documents, and then decompose the tensor to get corresponding vector representations." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-4", "text": "We show that this simple method can capture features of lexical humor effectively for continuous humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-5", "text": "In particular, we achieve a distance of 0.887 on a global humor ranking task, comparable to the top performing systems from SemEval 2017 Task 6B (Potash et al., 2017) but without the need for any external training corpus." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-6", "text": "In addition, we further show that this approach is also beneficial for small sample humor recognition tasks through a semi-supervised label propagation procedure, which achieves about 0.7 accuracy on the 16000 One-Liners (Mihalcea and Strapparava, 2005) and Pun of the Day (Yang et al., 2015) humour classification datasets using only 10% of known labels." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-7", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-9", "text": "Recognizing humor automatically is an important step for natural human-computer interaction (Shahaf et al., 2015) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-10", "text": "While early works tend to frame humor recognition as a binary classification task (Mihalcea and Strapparava, 2005; Yang et al., 2015) , the last few years have seen the emergence of humor recognition as a pairwise relative ranking task (Cattle and Ma, 2016; Shahaf et al., 2015) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-11", "text": "In addition to pairwise ranking, SemEval 2017 Task 6 also includes a global ranking subtask." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-12", "text": "However, the majority of submissions build * Zhenjie Zhao and Andrew Cattle contributed equally to this work." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-13", "text": "\u2020 E. Papalexakis was supported by a UCR-China collaboration grant by the Bourns College of Engineering at UCR, and by the National Science Foundation CDSE Grant no. OAC-1808591 global rankings using a series of pairwise comparisons (Potash et al., 2017) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-14", "text": "Only Yan and Pedersen (2017) attempt to predict global rankings directly, ranking documents inversely to their probability according to an n-gram language model." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-15", "text": "State-of-the-art humor recognition algorithms usually require a considerable amount of training data with labels to learn effective features (Yang et al., 2015) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-16", "text": "However, such data are difficult to obtain -especially fine-grained humor annotations." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-17", "text": "First, the humor judgments differ from individual to individual." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-18", "text": "Thus, collecting perceptually consistent human labels is expensive and time-consuming." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-19", "text": "Second, fine-grained degrees of humor add a further challenge." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-20", "text": "Therefore, methods on small sample learning or even unsupervised rule-based methods merit investigation." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-21", "text": "In this paper, considering the importance of lexical information for humor recognition (Radev et al., 2015) , we propose a tensor decomposition method to capture the contextual nuances of a corpus." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-22", "text": "This allows us to model the lexical similarity of sentences regardless of the size of the corpus." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-23", "text": "In this way, we can rank the degree of humor effectively via lexical centrality (Radev et al., 2015) , namely, regarding the distance to the lexical center as an indicator of the degree of humor." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-24", "text": "Experimental results on the SemEval 2017 Task 6 dataset (Potash et al., 2017) show that without external training data, the tensor embedding method can achieve performance equivalent to the second place on SemEval 2017 Task 6B without the need for any external training corpus." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-25", "text": "In addition, by applying a semi-supervised label propagation procedure (Zhou et al., 2003) , we can also use the tensor embedding method for small sample humor recognition, achieving about 0.7 accuracy with only 10% of known labels on the 16000 One-Liners (Mihalcea and Strapparava, 2005) and Pun of the Day (Yang et al., 2015) datasets." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-26", "text": "The contributions of this paper are: 1) we propose a tensor embedding method to model the lexical features of documents, which can capture lexical similarity effectively regardless of the size of the corpus, 2) we show that the lexical features can be used effectively for finegrained humor ranking and small sample humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-27", "text": "Our implementation is open-sourced, and can be found at https://github.com/ zhaozj89/TensorEmbeddingNLP." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-28", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-30", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-31", "text": "**HUMOR FEATURE EXTRACTION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-32", "text": "Modeling and learning humor features are critical for automatic humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-33", "text": "Previous works tend to use a combination of phonological, stylistic, semantic, and content-based features." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-34", "text": "Phonological features include acoustic features extracted from sitcom audio tracks (Bertero and Fung, 2016) and \"phonetic embeddings\" generated using a character-to-phoneme LSTM encoder-decoder (Donahue et al., 2017) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-35", "text": "Stylistic features include alliteration, rhyming, negative sentiment, and adult slang (Mihalcea and Strapparava, 2005) as well as emotional scenarios (Reyes et al., 2012) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-36", "text": "Semantic features range from attempts to measure incongruity (Cattle and Ma, 2018; Shahaf et al., 2015; Yang et al., 2015) to the use of word embeddings as inputs to neural models (Bertero and Fung, 2016; Donahue et al., 2017) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-37", "text": "Content-based approaches include word frequency (Mihalcea and Strapparava, 2005) , n-gram probability (Yan and Pedersen, 2017) , and lexical centrality (Radev et al., 2015) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-38", "text": "Centrality is based on the observation that humorous responses to common stimuli tend to cluster around a small number of core jokes (Radev et al., 2015; Shahaf et al., 2015) , with more central documents benefiting from \"wisdom of the crowd\"." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-39", "text": "While most humor features involve making population-level inferences based on document-level features, centrality is instead population-level feature directly." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-40", "text": "Radev et al. (2015) calculate their centrality feature using LexRank, a graph-based text summarization method (Erkan and Radev, 2004) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-41", "text": "Compared with more traditional lexical similarity measures like tfidf, this method is better suited to short humor texts due to their short lengths leading to sparse vector representations (Erkan and Radev, 2004) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-42", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-43", "text": "**SMALL SAMPLE HUMOR RECOGNITION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-44", "text": "Once the humor features have been extracted, the next step is training a machine learning model to make predictions." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-45", "text": "Although learning-based methods have shown significant performance improvement recently (Yang et al., 2015) , one of their main bottlenecks is the lack of appropriate training corpora." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-46", "text": "While previous works have employed data crawled from websites (Mihalcea and Strapparava, 2005; Yang et al., 2015) , Twitter (Cattle and Ma, 2016; Reyes et al., 2012) , sitcom subtitles (Bertero and Fung, 2016; Purandare and Litman, 2006) , or the New Yorker Cartoon Caption Contest (Radev et al., 2015; Shahaf et al., 2015) , these datasets are generally not released publicly." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-47", "text": "Owing to the difficulty in obtaining fine-grained labeled humor data, it is critical to study how to recognize humor by a small training sample or even without labeled data." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-48", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-49", "text": "**METHOD**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-50", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-51", "text": "**TENSOR EMBEDDING**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-52", "text": "Contextual patterns of words can be used to measure lexical similarity for humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-53", "text": "State-of-the-art learning-based approaches like doc2vec (Le and Mikolov, 2014) or sent2vec (Pagliardini et al., 2018) usually require a large amount of data." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-54", "text": "This is difficult to obtain for humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-55", "text": "We propose to use a novel tensor decomposition method to obtain lexical features of short humor texts." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-56", "text": "To capture lexical similarity for humor recognition, we propose to represent the tensor through a novel word-word co-occurrence method, which has only been explored in the context of fake news detection (Hosseinimotlagh and Papalexakis, 2018)." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-57", "text": "Considering a corpus D = {s 1 , s 2 , . . . , s D } with D sentences, we first build a vocabulary for it, namely, w 1 , w 2 , . . . , w V , where V is the number of words." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-58", "text": "For each sentence s in D, we count the wordword co-occurrence in a small window H, and build a frequency matrix W s \u2208 Z V \u00d7V , where Z denotes the set of integers." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-59", "text": "In particular, W s (i, j) indicates the frequency that word w i and w j cooccur in s within the window H. In this way, we can capture the lexical patterns of s in W s ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-60", "text": "We then stack all W s as a three-dimensional tensor W \u2208 Z V \u00d7V \u00d7D ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-61", "text": "The objective of tensor decomposition is to find an approximation\u0174 of W so Yang et al. (2015) that:\u0174" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-62", "text": "where v r \u2208 R V , d r \u2208 R D , R is the predefined rank parameter, and \u2297 is the outer product, namely, v r \u2297 v r \u2297 d r being a three-dimensional tensor, and" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-63", "text": "With the tensor decomposition, we can find low-rank embeddings of sentences that capture the similarity of contextual patterns (Hosseinimotlagh and Papalexakis, 2018)." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-64", "text": "In particular, C = [d 1 , d 2 , . . . , d R ] \u2208 R D\u00d7R , where the s-row of C is the embedding vector of sentence s. The Euclidean distance of embeddings is used to measure the similarity of two sentences." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-65", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-66", "text": "**LEXICAL CENTRALITY**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-67", "text": "We use lexical centrality to rank the degree of humor (Radev et al., 2015) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-68", "text": "While Radev et al. (2015) utilize a graph-based definition of centrality, we instead take a vector-space approach." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-69", "text": "Given the decomposed C = [c T 1 , c T 2 , . . . , c T D ] T , we compute a centroid as the average m of all sentence vectors of a corpus:" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-70", "text": "The Euclidean distance to the center is then taken as an indicator of the degree of humor." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-71", "text": "In other words, given two sentences s 1 and s 2 and their embeddings x 1 and x 2 , d(m, x1) < d(m, x2) implies s 1 is funnier than s 2 ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-72", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-73", "text": "**LABEL PROPAGATION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-74", "text": "With the lexical similarity captured by tensor embeddings, we can build a similarity graph, and use a label propagation algorithm (Zhou et al., 2003) for semi-supervised humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-75", "text": "In this way, we can use only a small portion of labeled data to predict the remaining unlabeled data effectively (Zhou et al., 2003) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-76", "text": "In particular, with the tensor embeddings, we first find the K nearest neighbors of each data point, and build a similarity graph G. We then form an affinity matrix W , where W ij = 1 if i and j are connected, otherwise, W ij = 0." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-77", "text": "Afterwards, we iterate:" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-78", "text": "and can get the results F * as the limit of this sequence." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-79", "text": "Equation (4) means we propagate the labels of each data point to its neighbors in a weighted average way, where \u03b1 is the ratio of propagating labels each iteration." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-80", "text": "For each point x i , its label is y i = arg max j\u2264c F * ij ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-81", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-82", "text": "**EXPERIMENT**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-83", "text": "To evaluate the effectiveness of the tensor embedding method, we conduct two experiments on global humor ranking and binary humor classification separately." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-84", "text": "The alternating least squares method of CANDECOMP/PARAFAC tensor decomposition (Sidiropoulos et al., 2017) is used to calculate the low-rank sentence embeddings as implemented in the Matlab tensor toolbox 1 ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-85", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-86", "text": "**GLOBAL HUMOR RANKING**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-87", "text": "To show the effectiveness of the lexical centrality of our tensor embedding method, we conduct an experiment on SemEval 2017 Task 6B (Potash et al., 2017) consisting of tweeted responses to specific thematic prompts generated as part of a TV show." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-88", "text": "For each prompt, the writing staff of the show pick a top 10 and an overall winner." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-89", "text": "These humor judgments are used as gold standard labels." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-90", "text": "Tensor embeddings and centroids are computed on a per-prompt basis and responses are ranked according to their distance from the centroid." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-91", "text": "We run a grid search procedure to determine the optimal rank value as 100, the window size as 5." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-92", "text": "For evaluation, we adopt the same edit distance-based metric used in Potash et al. (2017) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-93", "text": "The results of our lexical centrality system using tensor embeddings is shown in Table 1 , where the official results of other state-of-the-art systems are taken from Potash et al. (2017) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-94", "text": "Our system outperforms all but the Duluth (Yan and Pedersen, 2017) system in the official results for Se-mEval 2017 Task 6B (Potash et al., 2017) , making our performance equivalent to second place." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-95", "text": "It is notable that our system can perform well on the Broadway prompt, where other methods usually fail." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-96", "text": "Moreover, because our system does not have a learning procedure, the performance is more stable than others." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-97", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-98", "text": "**BINARY HUMOR CLASSIFICATION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-99", "text": "To show the effectiveness of label propagation of our tensor embedding method for small sample humor recognition, we conduct an experiment on two humor classification datasets 16000 One-Liners (Mihalcea and Strapparava, 2005) and Pun of the Day (Yang et al., 2015) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-100", "text": "Similarly, we run a grid search procedure to find optimal parameters, and set the rank as 10, window size as 5, neighbor number as 50, \u03b1 as 0.2." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-101", "text": "F (0) is set as a zero matrix initially." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-102", "text": "For each dataset, we randomly select 5%, 10%, 30%, and 90% of the data for training." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-103", "text": "We run a 10-fold procedure, and report the average accuracy, precision, recall, and F1 score values." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-104", "text": "The results of humor classification are shown in 1 www.tensortoolbox.org Table 2 ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-105", "text": "Our own implementation of Yang et al. (2015) is included as a baseline." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-106", "text": "While Yang et al. (2015) uses a large portion of data for training and combine different features, we find that at similar portion of training data (90%), the results of our method are comparable to it." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-107", "text": "In addition, with only a small portion of training data, our method still achieves good results." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-108", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-109", "text": "**DISCUSSION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-110", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-111", "text": "**LEXICAL CENTRALITY**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-112", "text": "The most notable aspect of our tensor embedding/lexical centrality approach is how little training data our system requires." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-113", "text": "Our system's unsupervised nature means that we do not need to use the 106 training prompts included with the SemEval 2017 Task 6 dataset." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-114", "text": "Our results are obtained exclusively using the six evaluation prompts." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-115", "text": "By comparison, almost all the systems reported in Potash et al. (2017) take a supervised approach and make full use of the training set." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-116", "text": "Furthermore, since we consider prompts one-ata-time and since each prompt only contains approximately 100 responses, we are able to achieve a state-of-the-art performance with 100 training documents." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-117", "text": "The only system reported in Potash et al. (2017) to take an unsupervised approach is Duluth (Yan and Pedersen, 2017) ." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-118", "text": "Like ours, their results are obtained without using the training set." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-119", "text": "However, their system uses an n-gram language model trained on a 6.2GB subset of the News Commentary Corpus and the News Crawl Corpus." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-120", "text": "Similarly, most other systems use some form of external training corpora for training word embeddings, phoneme models, semantic models, and so on." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-121", "text": "Another advantage of our approach is the ease of interpretability, in contrast to neural-based state-of-the-art baselines." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-122", "text": "Because our lexical feature is in an Euclidean space, we can compare and rank humor level more easily." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-123", "text": "Tweets labeled as \"overall winners\" exhibited a smaller mean distance from their respective centroids (0.848) than those labeled as \"merely in the top 10\" (0.942)." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-124", "text": "These tweets then in turn exhibited smaller distances than those labeled as \"not in the top 10\" (1.00)." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-125", "text": "A one-way ANOVA test gives mild evidence that overall winners are drawn from a different distribution than tweets not in the top 10 (p = 0.106)." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-126", "text": "This slight result is likely due to the fuzzy nature of humor and the relatively small dataset." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-127", "text": "Finally, ad hoc analysis of tweets with distances > 2 revealed these to be mostly \"not in the top 10\"." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-128", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-129", "text": "**LABEL PROPAGATION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-130", "text": "Although the semi-supervised framework provides a good alternative for small sample humor recognition, our method still cannot achieve a state-of-the-art performance with the same portion of training data." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-131", "text": "There is still space to improve the method; for example, by modeling not only the lexical similarity, but also other features, such as word association (Cattle and Ma, 2016) , and the like, that are important for humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-132", "text": "In addition, label propagation cannot handle unbalanced data well." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-133", "text": "Adding prior knowledge of the ratio of labels, e.g., the unbalanced SemEval 2017 Task 6 dataset, also deserves further investigation." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-134", "text": "----------------------------------" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-135", "text": "**CONCLUSION**" }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-136", "text": "In this paper, we show the importance of lexical features for small sample humor recognition." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-137", "text": "We propose a tensor embedding method to capture the lexical similarity effectively." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-138", "text": "Without training data, on SemEval 2017 Task 6B, we can achieve a relatively good result." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-139", "text": "Under a semi-supervised framework, the tensor embedding method can also achieve pretty good results for small sample humor classification." }, { "sent_id": "5fc7df69445712a50228d0bf80f30a-C001-140", "text": "It is interesting to further investigate a unified tensor embedding model to combine not only lexical, but also other features that are important for the sense of humor." } ], "y": { "@MOT@": { "gold_contexts": [ [ "5fc7df69445712a50228d0bf80f30a-C001-6" ], [ "5fc7df69445712a50228d0bf80f30a-C001-25" ] ], "cite_sentences": [ "5fc7df69445712a50228d0bf80f30a-C001-6", "5fc7df69445712a50228d0bf80f30a-C001-25" ] }, "@BACK@": { "gold_contexts": [ [ "5fc7df69445712a50228d0bf80f30a-C001-10" ], [ "5fc7df69445712a50228d0bf80f30a-C001-15" ], [ "5fc7df69445712a50228d0bf80f30a-C001-25" ], [ "5fc7df69445712a50228d0bf80f30a-C001-36" ], [ "5fc7df69445712a50228d0bf80f30a-C001-61" ] ], "cite_sentences": [ "5fc7df69445712a50228d0bf80f30a-C001-10", "5fc7df69445712a50228d0bf80f30a-C001-15", "5fc7df69445712a50228d0bf80f30a-C001-25", "5fc7df69445712a50228d0bf80f30a-C001-36", "5fc7df69445712a50228d0bf80f30a-C001-61" ] }, "@DIF@": { "gold_contexts": [ [ "5fc7df69445712a50228d0bf80f30a-C001-45", "5fc7df69445712a50228d0bf80f30a-C001-46" ] ], "cite_sentences": [ "5fc7df69445712a50228d0bf80f30a-C001-45", "5fc7df69445712a50228d0bf80f30a-C001-46" ] }, "@SIM@": { "gold_contexts": [ [ "5fc7df69445712a50228d0bf80f30a-C001-99" ], [ "5fc7df69445712a50228d0bf80f30a-C001-105", "5fc7df69445712a50228d0bf80f30a-C001-106" ] ], "cite_sentences": [ "5fc7df69445712a50228d0bf80f30a-C001-99", "5fc7df69445712a50228d0bf80f30a-C001-105", "5fc7df69445712a50228d0bf80f30a-C001-106" ] }, "@USE@": { "gold_contexts": [ [ "5fc7df69445712a50228d0bf80f30a-C001-99" ], [ "5fc7df69445712a50228d0bf80f30a-C001-105", "5fc7df69445712a50228d0bf80f30a-C001-106" ] ], "cite_sentences": [ "5fc7df69445712a50228d0bf80f30a-C001-99", "5fc7df69445712a50228d0bf80f30a-C001-105", "5fc7df69445712a50228d0bf80f30a-C001-106" ] } } }, "ABC_b13bc55709f5040cf100bd5f466ff2_15": { "x": [ { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-61", "text": "**PROPOSED METHOD**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-86", "text": "**TEXT-SIDE PROSODY CONTROL**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-158", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-159", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-14", "text": "These models have demonstrated ability to generate speech with expressive styles with Tacotron [1] using prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-15", "text": "They can also transfer the prosody of one speaker to another using a different speaker ID while leaving the prosody embedding unchanged." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-2", "text": "We propose prosody embeddings for emotional and expressive speech synthesis networks." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-3", "text": "The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-4", "text": "The temporal structures can be designed either on the speech side or the text side, leading to different control resolutions in time." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-5", "text": "The prosody embedding networks are plugged into end-to-end speech synthesis networks and trained without any other supervision except for the target speech for synthesizing." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-6", "text": "It is demonstrated that the prosody embedding networks learned to extract prosodic features." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-7", "text": "By adjusting the learned prosody features, we could change the pitch and amplitude of the synthesized speech both at the frame level and the phoneme level." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-8", "text": "We also introduce the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-9", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-11", "text": "Since Tacotron [1] paved the way for end-to-end Text-To-Speech (TTS) using neural networks, researchers have attempted to generate more naturally sounding speech by conditioning a TTS model via speaker and prosody embedding [2, 3, 4, 5, 6] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-12", "text": "( We use the term prosody as defined in earlier work [4] henceforth.) Because there is no available label for prosody, learning to control prosody in TTS is a difficult problem to tackle." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-13", "text": "Recent approaches learn to extract prosody embedding from reference speech in an unsupervised manner and use prosody embedding to control the speech style [4, 5] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-16", "text": "However, we observed two limitations with the above models." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-17", "text": "First, controlling the prosody at a specific moment of generated speech is not clear." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-18", "text": "Earlier works focused on prosody embedding with a fixed length (a length of 1 in their experiments) regardless of the length of the reference speech or that of the text input." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-19", "text": "A loss of temporal information when squeezing reference speech into a fixed length embedding is highly likely." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-20", "text": "Therefore, fine-grained control of prosody at a specific moment of speech is difficult for embedding with a fixed length." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-21", "text": "For example, we can set the global style as \"lively\" or \"sad,\" but we cannot control the prosody of a specific moment with fixed-length embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-22", "text": "Because humans are sensitive to subtle changes of nuance, it is important to ensure finegrained control of prosody to represent one's intentions precisely." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-23", "text": "Secondly, inter-speaker prosody transfer is not robust if the difference between the pitch range of the source speaker and the target speaker is significant." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-24", "text": "For example, when the source speaker (female) has higher pitch than the target speaker (male), the prosodytransferred speech tends to show a higher pitch than the usual pitch of the target speaker." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-25", "text": "In this work, we focus on solving these two problems." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-26", "text": "We will introduce two types of variable-length prosody embedding which have the same length as the reference speech or input text to enable sequential control of prosody." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-27", "text": "In addition, we will show that normalizing prosody embedding helps to maintain the robustness of prosody transfers against speaker perturbations." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-28", "text": "With our methods, speaker-normalized variable-length prosody embedding was able to not only to control prosody at each specific frame, but also to transfer prosody between two speakers, even in a singing voice." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-29", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-31", "text": "Prosody modeling had been done in a supervised manner by using annotated labels, such as those in ToBI [7] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-32", "text": "Problems were reported about hand annotations, and the cost was high [8] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-33", "text": "Skerry-Ryan et al. used convolutional neural networks and a Gated Recurrent Unit (GRU) [9] to compress the prosody of the reference speech [4] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-34", "text": "The output, denoted by p, is fixed-length prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-35", "text": "They enabled prosody transfers using the prosody embedding, but they could not gain control of prosody at a specific point of time." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-36", "text": "Another problem was also reported [5] ; fixed-length prosody embedding worked poorly if the length of the reference speech was shorter than the speech to generate." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-37", "text": "In addition, variablelength prosody embedding was also implemented using the output of the GRU at every time step [4] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-60", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-38", "text": "However, this method did not draw attention because it could not obtain satisfactory results given that it was not robust with regard to text and speaker perturbations." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-39", "text": "We noted the usefulness of variable-length prosody and elaborated on this concept for fine-grained prosody control." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-40", "text": "Wang et al. came up with the global style token (GST) Tacotron to encode different speaking styles [5] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-41", "text": "Although they used the same reference encoder architecture used in earlier work [4] , they did not use p itself for prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-42", "text": "Using a content-based attention, they computed the attention weights for style tokens from p. The attention weights represent the contribution of each style token, and the weighted sum of the style tokens is now used for style embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-43", "text": "During the training step, each randomly initialized style token learns the speaking style in an unsupervised manner." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-44", "text": "In the inference mode, it was possible to control prosody by either predicting the style embedding from the reference speech or specifying the attention weights of the style tokens." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-45", "text": "This enables explicit control of the speaking style, but it nonetheless worked only in a global sense." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-46", "text": "If we are interested in controlling the prosody of a phoneme, it would be ideal to obtain the same prosody for different phonemes when the phonemes are conditioned on the same prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-47", "text": "However, GST Tacotron generates various types of prosody for input phonemes that are conditioned on the same style embedding, which is not desirable for prosody control." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-48", "text": "Wang et al. also proposed text-side style control using multiple style embeddings for different segments of input text." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-49", "text": "This method could roughly change the style of the text segments, but it is limited when used to control phoneme-wise prosody for the reasons mentioned above." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-50", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-51", "text": "**BASELINE MODEL**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-52", "text": "We used a simplified version [10] of Tacotron for the base encoderdecoder architecture, but we used the original Tacotron [1] style of the Post-processing net and the Griffin-Lim algorithm [11] for spectrogram-to-waveform conversion." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-53", "text": "For the encoder input x, we used the phoneme sequence of normalized text to ease the learning." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-54", "text": "The one-hot speaker identity is converted into speaker embedding vector s by the embedding lookup layer." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-55", "text": "Equation 1 describes the base encoder-decoder, where e, p, and d denote the text encoder state, variable-length prosody embedding, and decoder state, respectively." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-56", "text": "Reference speech is encoded to prosody embedding using the reference encoder [4] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-57", "text": "A mel-spectrogram of the reference speech proceeds through 2D-convolutional layers." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-58", "text": "The output of the last convolutional layer is fed to a uni-directional GRU." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-59", "text": "The last output of GRU rN is the fixed-length prosody embedding p. If we use every output of GRU r1:N for prosody embedding, it forms the variablelength prosody embedding p1:N ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-62", "text": "Fine-grained prosody control can be done by adjusting the values of variable-length prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-63", "text": "We propose two types of prosody control methods: speech-side control and text-side control." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-64", "text": "Variable-length prosody embedding is used as a conditional input at the encoder module or at the decoder module for speech-side control or text-side control, respectively." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-65", "text": "In order to do this, we need to align and downsample the prosody embedding to match the length of the prosody embedding lp with the speech side (the number of decoder time-steps, l d ) or the text side (the number of encoder time-steps, le)." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-66", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-67", "text": "**MODIFICATIONS IN THE REFERENCE ENCODER**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-68", "text": "We empirically found that the following modifications improved the generation quality." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-69", "text": "We used CoordConv [12] for the first convolutional layer." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-70", "text": "According to its construction, Coordconv can utilize positional information while losing the translation invariance." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-71", "text": "We speculate that the positional information was helpful to encode prosody sequentially." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-72", "text": "We used ReLU as the activation function to force the values of the prosody embedding to lie in [0, 1]." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-73", "text": "The proposed models are trained identically to the Tacotron model." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-74", "text": "The model is trained according to the L1 loss between the target spectrogram and the generated spectrogram, and no other supervision is given for the reference encoder." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-75", "text": "Unless otherwise stated, we used the same hyperparameter settings used in earlier work [4] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-76", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-77", "text": "**SPEECH-SIDE PROSODY CONTROL**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-78", "text": "The length lp of variable-length prosody embedding created from a reference spectrogram with length l ref is identical to l ref ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-79", "text": "Note that the decoder should generate the same spectrogram as the reference spectrogram and that r-frames are generated at each decoder timestep." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-80", "text": "This gives lp a longer length by r-times than l d ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-81", "text": "By choosing appropriate stride sizes for the convolutional layers, we could shorten reference spectrogram to match lp with l d ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-82", "text": "At each decoder time-step i, pi is initially fed to the attention module together with e 1:le to compute the i-th attention weights, \u21b5i. We did not feed speaker embedding to the attention module as we assumed the speaker identity to be conditionally independent with attention weights when prosody is given." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-83", "text": "The weighted sum of e 1:le with \u21b5i gives us the context vector e 0 i ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-84", "text": "The input of the decoder module at the i-th time-step is a concatenation of {e 0 i , pi, s}." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-85", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-87", "text": "The linear relationship between lp and l d made it easy to ensure that speech-side prosody embedding has a length identical to the number of the decoder time-steps." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-88", "text": "Unfortunately, such a relationship is not guaranteed between lp and le." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-89", "text": "We introduced a reference attention module that uses scaled dot-product attention [13] to find the alignment between e 1:le and p 1:l ref ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-90", "text": "In the reference attention module, key \uf8ff and value v are obtained from p and the query is e. Conceptually, the attention mechanism computes the attention weight according to the similarity between the query and each key, and weighted sum of the values is then obtained using the attention weight." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-91", "text": "To obtain \uf8ff and v from prosody embedding, we doubled the output dimension h of the reference encoder for the text-side prosody control, with the output split into two matrices of size (l ref \u21e5 h)." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-92", "text": "The weighted sum of v 1:l ref with gives us text-side prosody embedding p t ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-93", "text": "Then, p t is concatenated to e upon every usage of e." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-94", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-95", "text": "**PROSODY NORMALIZATION**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-96", "text": "Prosody embedding is normalized using each speaker's prosody mean." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-97", "text": "During training, we computed the sample mean along the temporal dimension of variable-length prosody embedding and stored average of the sample mean for each speaker." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-98", "text": "For both the training step and the evaluation, normalization was done by subtracting the speaker-wise prosody mean from every time step of prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-99", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-100", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-101", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-102", "text": "**DATASET**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-103", "text": "Previous works [4, 5] used large amounts of data to train the prosodic TTS model (296 hours of data for the multi-speaker model)." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-104", "text": "To ensure a large amount of data, we used multiple datasets, in this case VCTK, CMU ARCTIC, and internal datasets." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-105", "text": "The final dataset consisted of 104 hours (58 hours of English and 46 hours of Korean) with 136 speakers (128 English speakers and 8 Korean speakers)." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-106", "text": "Because variable-length prosody embedding has a large enough capacity to copy the reference audio, we had to use a very small dimension for the bottleneck size." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-107", "text": "This led us to the use of prosody sizes of 2 and 4 for the speech-side and text-side prosody embedding, respectively." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-108", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-109", "text": "**SPEECH-SIDE CONTROL OF PROSODY**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-110", "text": "By adjusting the values of the speech-side prosody embedding, we could change the prosody at a specific frame." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-111", "text": "Figure 1 shows the change in the learned prosody embeddings (line graph) and their corresponding spectrograms." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-112", "text": "The first dimension of prosody embedding, in the second row of Figure 1 , tended to control the pitch of the generated speech." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-113", "text": "By comparing the highlighted parts of Figures 1-(a) and (b), one can assess the change of the pitch from the spaces between the harmonics." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-114", "text": "The second dimension of prosody embedding, in the third row of Figure 1 , tended to control the amplitude of the generated speech." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-115", "text": "By comparing the highlighted parts of Figures 1-(a) and (c), one can assess the change of the amplitude from the intensity of the harmonics." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-116", "text": "We recommend that readers listen to the examples on our demo page." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-117", "text": "First, we checked if the reference attention module learned how to find the alignment between the phoneme sequence and the reference audio." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-118", "text": "Figure 3 shows an attention alignment plot of the original attention module (a) and reference attention module (b)." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-119", "text": "From their analogous shape, we find that the reference attention module could align the reference speech to the text." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-120", "text": "As was done in Section 5.2, we changed the prosody of the phonemes by adjusting the text-side prosody embedding in Figure 2 ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-121", "text": "It appeared that the amplitude was affected by the first and third dimensions and that the pitch was affected by the second and third dimensions." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-122", "text": "In addition, the length was affected by the first and third dimensions." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-123", "text": "It would be ideal if each dimension represents one prosodic feature (i.e., the pitch, amplitude, or length)." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-124", "text": "We think prosody embedding is entangled because we did not impose any constraints on prosody embedding to be disentangled." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-125", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-126", "text": "**COMPARISON WITH GST TACOTRON**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-127", "text": "We compared our methods to GST Tacotron both quantitatively and qualitatively." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-128", "text": "For the quantitative comparison, we used the Mean Cepstral Distortion (MCD) with the first 13 MFCCs, as proposed in earlier work [4] ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-129", "text": "Table 1 shows that the proposed methods outperform GST Tacotron in terms of MCD13, where a lower MCD is better." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-130", "text": "In particular, speech-side prosody control, which has the highest temporal resolution of prosody embedding, showed the lowest MCD." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-131", "text": "One shortcoming of GST Tacotron is that GST works only in a global sense." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-132", "text": "If we fix GST for multiple decoder time steps, the decoder generates speech while changing the prosody implicitly at each time step to create the GST's speech style." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-133", "text": "This is not problematic if the generated prosody perfectly matches the intention of the user, but in many cases we needed modifications to realize this." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-134", "text": "Because GST changes the prosody implicitly, it is ambiguous to control the prosody at specific moment." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-135", "text": "On the other hand, the proposed prosody embeddings control the prosody explicitly." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-136", "text": "In Sections 5.2 and 5.3, we observed that prosody can be controlled by adjusting the values of the prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-137", "text": "We further demonstrated the explicitness and consistency of the proposed methods by fixing prosody embedding to have the same value over all frames." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-138", "text": "This should give us a flat speech style in contrast to the GST approach, and we can see these outcomes in Figure 1-(d) and Figure 2 -(e)." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-139", "text": "The results are obtained by fixing the dimensions that controlled the pitch." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-140", "text": "If we apply prosody control to a wider dynamic range of prosody embedding, it will be able to generate a singing voice." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-141", "text": "We demonstrate this in Figure 4 ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-142", "text": "Using the three prosody control methods, we extracted the prosody from an unseen song of an unseen singer." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-143", "text": "We combined the extracted prosody embedding with the original lyrics and a speaker in the training set to perform the prosody transfer." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-144", "text": "While GST could not reconstruct the melody of the song, we could recognize the melody of the original song using the proposed methods." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-145", "text": "In particular, the generated song from speech-side prosody control was almost identical to the original song." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-146", "text": "For this task, the lyrics, speaker identity, and prosody embedding were the only requirements for the generation step." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-147", "text": "We witnessed the capability of the proposed methods to generate a song given an appropriate sequence of prosody embedding." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-148", "text": "----------------------------------" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-149", "text": "**INTER-SPEAKER PROSODY TRANSFER**" }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-150", "text": "We compared the MCD of the speech-side prosody embeddings with and without normalization, as shown in Table 2 ." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-151", "text": "For each case of prosody embedding, we computed the MCD between the reference and the generated speech for each prosody reconstruction and prosody transfer task." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-152", "text": "In both tasks, we used the speech of a female speaker as the reference speech, and we used the speaker ID of the same female speaker or another male speaker for prosody reconstruction and prosody transfer, respectively." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-153", "text": "Without normalization, the generated speech tended to show a higher pitch than the male speaker and sometimes failed to generate speech." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-154", "text": "We consider that this failure arises because the combination of the male speaker ID and female prosody embedding did not exist during the training step." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-155", "text": "When we used normalization for prosody embedding, the model was exposed to the similarly distributed prosody embedding during the traininng phase." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-156", "text": "This caused the prosody transfer to be easier compared to that without normalization." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-160", "text": "Here, we proposed temporally structured prosody embedding networks to control the expressive style of synthesized speech." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-157", "text": "Table 2 also presents this phenomenon with a higher MCD during the prosody transfer with the non-normalized model compared to that with the normalized model." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-161", "text": "The proposed methods changed the pitch and amplitude both at the framelevel and phoneme-level resolution." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-162", "text": "Moreover, normalized prosody embedding made the prosody transfer step more robust to pitch discrepancies between the reference and generated speaker." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-163", "text": "The proposed methods demonstrated better quality in terms of the MCD score, and the prosody of a song could be successfully transferred to another speaker, resulting in voice conversion of a song." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-164", "text": "The bottleneck size was the only factor that regularized the prosody embedding network in this paper." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-165", "text": "Disentangling techniques will be beneficial to factorize the prosody embeddings into more explainable prosodic features and separate them from other speech features." }, { "sent_id": "b13bc55709f5040cf100bd5f466ff2-C001-166", "text": "This will be a fruitful direction for future work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b13bc55709f5040cf100bd5f466ff2-C001-11" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-13" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-33" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-56" ] ], "cite_sentences": [ "b13bc55709f5040cf100bd5f466ff2-C001-11", "b13bc55709f5040cf100bd5f466ff2-C001-13", "b13bc55709f5040cf100bd5f466ff2-C001-33", "b13bc55709f5040cf100bd5f466ff2-C001-56" ] }, "@SIM@": { "gold_contexts": [ [ "b13bc55709f5040cf100bd5f466ff2-C001-12" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-75" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-103", "b13bc55709f5040cf100bd5f466ff2-C001-104" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-128" ] ], "cite_sentences": [ "b13bc55709f5040cf100bd5f466ff2-C001-12", "b13bc55709f5040cf100bd5f466ff2-C001-75", "b13bc55709f5040cf100bd5f466ff2-C001-103", "b13bc55709f5040cf100bd5f466ff2-C001-128" ] }, "@USE@": { "gold_contexts": [ [ "b13bc55709f5040cf100bd5f466ff2-C001-12" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-75" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-103", "b13bc55709f5040cf100bd5f466ff2-C001-104" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-128" ] ], "cite_sentences": [ "b13bc55709f5040cf100bd5f466ff2-C001-12", "b13bc55709f5040cf100bd5f466ff2-C001-75", "b13bc55709f5040cf100bd5f466ff2-C001-103", "b13bc55709f5040cf100bd5f466ff2-C001-128" ] }, "@DIF@": { "gold_contexts": [ [ "b13bc55709f5040cf100bd5f466ff2-C001-36", "b13bc55709f5040cf100bd5f466ff2-C001-37" ], [ "b13bc55709f5040cf100bd5f466ff2-C001-41" ] ], "cite_sentences": [ "b13bc55709f5040cf100bd5f466ff2-C001-37", "b13bc55709f5040cf100bd5f466ff2-C001-41" ] }, "@EXT@": { "gold_contexts": [ [ "b13bc55709f5040cf100bd5f466ff2-C001-41" ] ], "cite_sentences": [ "b13bc55709f5040cf100bd5f466ff2-C001-41" ] } } }, "ABC_d8a250a1a0495ee824837839b74f26_15": { "x": [ { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-24", "text": "**SYSTEM DESCRIPTION**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-2", "text": "We develop a novel transition-based parsing algorithm for the abstract meaning representation parsing task using exact imitation learning, in which the parser learns a statistical model by imitating the actions of an expert on the training data." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-3", "text": "We then use the imitation learning algorithm DAGGER to improve the performance, and apply an \u03b1-bound as a simple noise reduction technique." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-4", "text": "Our performance on the test set was 60% in F-score, and the performance gains on the development set due to DAGGER was up to 1.1 points of Fscore." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-5", "text": "The \u03b1-bound improved performance by up to 1.8 points." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-6", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-8", "text": "In abstract meaning representation parsing (Banarescu et al., 2013) , the goal is to parse natural language in a domain-independent graph-based meaning representation (AMR)." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-9", "text": "In the first AMR parsing work, Flanigan et al. (2014) split the task into two sub-tasks; concept identification and graph creation." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-10", "text": "The sub-tasks are learned independently, and exact inference is used to find highest-scoring maximum spanning connected acyclic graph that contains all the concepts identified in the first stage." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-11", "text": "Later work by Wang et al. (2015b) adopted a different strategy based on the similarity between the dependency parse of a sentence and the semantic AMR graph." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-12", "text": "They start from the dependency parse and learn a transition-based parser that converts it into an AMR graph." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-13", "text": "To learn the parser, Wang et al. (2015b) define an algorithm that for each instance in the training data infers the action sequence that convert the input dependency tree into the corresponding AMR graph and train a classifier to predict the actions to be taken during testing." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-14", "text": "This strategy is also referred to as exact imitation learning, while the algorithm that infers the action sequence in the training instances is commonly referred to as the expert policy." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-15", "text": "In our submission to SemEval Task 8 on AMR parsing, we follow the transition-based paradigm of Wang et al. (2015b) with modifications to the parsing algorithm, and also use the DAGGER imitation learning algorithm (Ross et al., 2011) to generalise better to unseen data." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-16", "text": "The central idea of DAGGER is that the distribution of states encountered by the expert policy during training may not be a good approximation to those seen in testing by the trained policy." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-17", "text": "Previous work by Rao et al. (2015) used SEARN, a similar imitation learning algorithm, on the AMR problem, with an algorithm that constructs the AMR graph directly from the sentence tokens." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-18", "text": "Imitation learning has also been used successfully in other semantic parsing tasks (Vlachos and Clark, 2014; Berant and Liang, 2015) ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-19", "text": "In imitation learning approaches such as DAG-GER the previous actions become features for classification learning." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-20", "text": "However the partial graphs in AMR parsing are rather complex to represent in this way, and combined with the finite amount of training data different actions can be chosen by the expert even though the feature representations for them can be very similar." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-21", "text": "These decisions appear as noisy outliers in classification learning." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-22", "text": "To control noise we experiment with the \u03b1-bound discussed by Khardon and Wachman (2007) , which excludes a training example from future training once it has been misclas-" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-23", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-25", "text": "In the following subsections we focus on the differences from previous work and in particular that of Wang et al. (2015b) who introduced the transitionbased dependency-to-AMR paradigm we follow." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-26", "text": "We initialise the main algorithm with a stack of the nodes in the dependency tree, root node first." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-27", "text": "This stack is termed \u03c3." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-28", "text": "A second stack, \u03b2 is initialised with all children of the top node in \u03c3." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-29", "text": "The state at any time is described by \u03c3, \u03b2, and the current graph (which starts as the dependency tree)." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-30", "text": "Each action manipulates the top nodes in each stack, \u03c3 0 and \u03b2 0 ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-31", "text": "We reach a terminal state when \u03c3 is empty." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-32", "text": "1" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-33", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-34", "text": "**THE EXPERT POLICY**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-35", "text": "The expert policy used in training applies heuristic rules to determine the next action from a given state." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-36", "text": "It uses the training alignments to construct a mapping between nodes in the dependency tree, and nodes in the target AMR." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-37", "text": "Any unmapped nodes in the dependency tree will be deleted by the expert, and any unmapped nodes in the AMR graph will be 1 Code available at https://github.com/ hopshackle/dagger-AMR." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-38", "text": "SemevalSubmission tag bookmarks the version used." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-39", "text": "inserted." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-40", "text": "All our experiments use node alignments from the system of Pourdamghani et al. (2014) ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-41", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-42", "text": "**ACTION SPACE**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-43", "text": "Flanigan et al. (2014) and Wang et al. (2015b) , both use AMR fragments as their smallest unit, which may consist of more than one AMR concept." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-44", "text": "Instead, we always work with the individual AMR nodes, and rely on Insert actions to learn how to build common fragments, such as country names." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-45", "text": "The main adaptations to the actions, summarised in Table 1 , stem from this." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-46", "text": "NextNode and NextEdge form the core action set, labelling nodes and edges respectively without changing the graph structure." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-47", "text": "Swap, Reattach and ReplaceHead change this structure, but always retain a tree structure." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-48", "text": "ReplaceHead covers two distinct actions in Wang et al. (2015b) ; ReplaceHead and Merge." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-49", "text": "Their Merge action merges \u03c3 0 and \u03b2 0 into a composite node; this is not required without composite nodes and retention of a 1:1 mapping between nodes and AMR concept." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-50", "text": "Unlike Wang et al. (2015b) we do not parameterise Swap or Reattach actions with a label." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-51", "text": "We leave that decision to a later NextEdge action." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-52", "text": "We permit a Reattach action to use parameter \u03ba equal to any node within six edges from \u03c3 0 , excluding any that would disconnect the graph or creating a cycle." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-53", "text": "The Insert action inserts a new node as a parent of the current \u03c3 0 ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-54", "text": "Wang et al. (2015a) later introduced an 'Infer' action similar to our Insert action." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-55", "text": "Infer inserts an AMR concept node above the current node as Insert does, but is restricted to nodes that occur outside of AMR 'fragments', which continue to be the base building block." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-56", "text": "Reentrance is the one action that will turn a Tree into a non-Tree." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-57", "text": "We only consider Reentrance actions during a second pass (\"Phase Two\") through the AMR graph once the first pass has reached a terminal state." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-58", "text": "In this second pass we consider each node as \u03c3 0 in turn, and each nearby node as a possible \u03ba to insert a new edge (\u03c3 0 , \u03ba)." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-59", "text": "We follow any Reentrance action with a NextEdge action to label the new arc." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-60", "text": "This approach simplifies the first pass, during which the graph is guaranteed to be a Tree." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-61", "text": "Reentrance makes only a small difference in the final F-Score, and was turned off for our final submission." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-62", "text": "Also during Phase Two we decide whether to Wikify \u03c3 0 ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-63", "text": "This adds a new leaf node with a relation of wiki." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-64", "text": "There are three parameter values for \u03c9 that determine the wiki concept." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-65", "text": "In turn these use \"-\"; a concatenation of all child concepts in original word order; a dictionary look-up keyed on a concatenation of the child nodes if these were seen in the training data." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-66", "text": "An example of the third option is if a name node with a single child node of \"Michael\" is seen in the training data with a wiki relation of \"Michael Jackson\"." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-67", "text": "This wikification is held in the dictionary, and will be used for any name node with a single child node of \"Michael\" in test." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-68", "text": "If instead in test the name node had two children; \"Michael\", and \"Jackson\", then during \u03c9 would either add a wiki node of \"-\", or one of \"Michael Jackson\" by concatenating the two child concepts in the order they appear in the sentence." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-69", "text": "Figure 1 shows a parse of a sentence fragment." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-70", "text": "The current \u03c3 0 node is shown dashed and in red." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-71", "text": "From the top the actions are Insert(dateentity); NextNode(WORD); NextEdge(year); second diagram; NextNode(WORD); ReplaceHead to remove \"in\"; third diagram; NextNode(WORD); NextEdge(mod); Reattach to move \"date-entity\"; fourth diagram; NextNode(VERB); ReplaceHead to remove \"by\"; NextEdge(ARG0); NextEdge(time); NextNode(strike-01)." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-72", "text": "Wang et al. (2015b) use all AMR concepts and relations that appear in the training set as possible parameters (l c and l r ) if they appear in any sentence containing the same lemma as \u03c3 0 and \u03b2." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-73", "text": "We reduce this to just concepts that have been aligned to the current lemma." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-74", "text": "We initially run the expert policy over the training set, and track the AMR concept assigned for each lemma." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-75", "text": "These provide the possible l c that will be used for NextNode actions." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-76", "text": "Similarly we track the lemmas at head and tail of each expert-assigned AMR relation, and compile possible l r from these." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-77", "text": "There is no direct generalisation between different concepts and relations; so ARG0 and ARG0-of are independently learned relations for example, although they represent the same semantic relationship." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-78", "text": "AMR concepts/relations will never be considered during test if they were not aligned to that lemma in the training data." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-79", "text": "To relax this restriction we allow l c to take the values WORD, LEMMA, VERB, which respectively use the word, lemma, or the lemma concatenated with '-01' as the AMR concept." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-80", "text": "This is inspired by Werling et al. (2015) , who use a similar set of actions in a concept identification phase." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-81", "text": "These options improve performance (by 0.5 to 1.0 points on a validation set) by generalising to unseen tokens in test data, which otherwise would have no mapped AMR concepts." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-82", "text": "For the l c parameters on Insert (InsertBelow) actions, we use all AMR concepts that the expert inserted above (below) any node in the training set with the same lemma as \u03c3 0 ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-83", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-84", "text": "**ADDITIONAL ACTION CONSTRAINTS**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-85", "text": "Transition-based parsing algorithms have classically relied on a fixed length of trajectory T for guarantees on performance, or at least a bounded T (Goldberg and Elhadad, 2010; Honnibal et al., 2013; Sartorio et al., 2013; McDonald and Nivre, 2007) ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-86", "text": "In our approach T is theoretically unbounded and the algorithm could Insert, or Reattach ad infinitum." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-87", "text": "We impose constraints to prevent these situations." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-88", "text": "A Swap action cannot be applied to a previously Swapped edge; once a node has been moved by Reattach, then it cannot be Reattached again; an Insert action is only permissible if no previous Insert action has been used with that node as \u03c3 0 ; an Insert action is not permissible if it would insert an AMR concept already in use as any of the parent, children, grand-parents or grand-children of \u03c3 0 ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-89", "text": "Any action that would create a cycle is prohibited." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-90", "text": "We do not prevent duplication of argument relations so that a concept could have two outgoing ARG1 edges." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-91", "text": "We start with a fully connected graph (the dependency tree), and preserve full connectivity as none of the actions will disconnect a graph." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-92", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-93", "text": "**IMITATION LEARNING**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-94", "text": "In the first iteration we only use the expert policy to generate a trajectory for each training sentence, and train a classifier \u03c0 1 from the collected data." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-95", "text": "At each step in the ith iteration we randomly choose the expert (with probability \u03b2 i ) or else \u03c0 i\u22121 ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-96", "text": "The full set of collected data from all iterations is then used to train \u03c0 i\u22121 to obtain \u03c0 i ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-97", "text": "\u03b2 1 = 1 and we set \u03b2 i = 0.7\u03b2 i\u22121 ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-98", "text": "Hence each iteration uses the expert policy less and less, with the training trajectories increasingly approximating the states the classifier would encounter without expert knowledge of the target." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-99", "text": "Formally the DAGGER algorithm is online (Ross et al., 2011) , while we use it in a batch mode as described above." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-100", "text": "We use an averaged AROW classifier for all our experiments, with parameter r = 100 (Crammer et al., 2009) ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-101", "text": "After each batch DAGGER iteration we use 3 iterations of (on-line) AROW training using all the collected data." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-102", "text": "As well as the \u03b1-bound for noise reduction, we tried two additional modifications to DAGGER and AROW." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-103", "text": "Firstly, we considered reducing the number of actions explored at each stage." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-104", "text": "The currently trained classifier evaluates all possible actions, and discards any that exceed the best scoring action by some threshold." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-105", "text": "Only this smaller set, plus the expert action, are included in the training example." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-106", "text": "This speeds up classifier training." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-107", "text": "In the first iteration we choose three to five actions randomly." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-108", "text": "Ross and Bagnell (2014) use a random set of exploratory actions in their AGGREVATE algorithm, but do not use the classifier to focus the exploration." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-109", "text": "Secondly, we used only the smallest training sentences in the first iteration, as measured by number of AMR nodes." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-110", "text": "At each further iteration the AMR size threshold for the training set was increased." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-111", "text": "The motivation was to train the classifier on 'easy' sentences, before introducing more complex ones." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-112", "text": "We start with up to 30 nodes, and increase this by 10 each iteration." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-113", "text": "Rao et al. (2015) similarly use the smallest sentences for training, but do not increase the size threshold as training proceeds." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-114", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-115", "text": "**FEATURES**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-116", "text": "All features used are detailed in Table 2 , largely based on Wang et al. (2015b) ." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-117", "text": "All are 0-1 indicator functions." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-118", "text": "inserted is 1 if the node was inserted by the parser; dl is the dependency label in the original dependency tree; ner the named entity tag; POS the part-of-speech tag; prefix is the string before the hyphen if word is hyphenated; suffix is the string after the hyphen; brown is the 100-class Brown cluster id with cuts at 4, 6, 10 and 20 2 ; deleted is the lemma of any child node previously deleted by the parser; merged is the lemma of any node merged into this node by a ReplaceHead action; distance is the distance between the tokens in the sentence; path concatenates lemmas and dls between the tokens in the dependency tree; POSpath concatenates POS tags between the tokens; NERpath concatenates NER tags between the tokens." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-119", "text": "The key differences to Wang et al. (2015b) are the inclusion of the brown, POSpath, NERpath, prefix and suffix feature types." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-120", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-121", "text": "**PRE-PROCESSING**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-122", "text": "Pre-processing steps on the training sentences were to: pass the full sentence through the Stanford De- dl, ner, POS, inserted, prefix, suffix, brown, deleted, inserted, lemma, brown \u03c3 0C label, ner, label-brown \u03b2 0 inserted, POS, lemma, brown, ner, dl, prefix, suffix, merged \u03ba ner, POS, lemma, brown, label \u03c3 0 \u2192 \u03b2 0 label, path, lemma-path-lemma, POSpath, inserted-inserted, lemma-POS, POS-lemma, dl-lemma, lemma-dl, lemma-label, label-lemma, ner-ner, distance \u03b2 0 \u2192 \u03ba path, lemma-path-lemma, NERpath, POSpath, distance, lemma-POS, dl-lemma, ner-ner \u03c3 0 \u2192 \u03ba distance, lemma-path-lemma, brown-brown, NERpath, POSpath, lemma-dl, lemma-label \u03c3 0P \u2192 \u03c3 0 label, POS-lemma, dl-lemma, ner-ner" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-123", "text": "POS-lemma, lemma-POS, dl-lemma, ner-ner Table 2 : Features used by context." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-124", "text": "\u03c30P is the parent of \u03c30, \u03c30P P the parent of \u03c30P , and \u03c30C a child of \u03c30." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-125", "text": "pendency Parser v3.3.1 to construct a dependency tree (Manning et al., 2014) ; remove punctuation tokens; \"/\" characters are treated as token separators, but hyphenated words are kept as single tokens; simple regex expressions are applied to find common date formats, and convert these to numeric sequences (e.g. 03-Jan-72 to 3 1 1972); similar regex conversions of common numeric expressions e.g. \"two thousand\" becomes \"2000\"." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-126", "text": "The parser was then able to learn to construct date-entity, temporal-quantity and similar AMR moieties." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-127", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-128", "text": "**RESULTS**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-129", "text": "In all experiments, the training data was the union of all training and dev sets in the task data, and the union of all test sets was used for validation." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-130", "text": "Table 3 shows the F-Score of the validation set with and without the Reentrance action (Reent, NoR), and with combinations of reduced search (Red), and incremental data (Inc)." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-131", "text": "All of these give the same result of 0.65 to 2 dp, and the bold entry was submitted." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-132", "text": "The Baseline entry uses only a single iteration, and the DAGGER experiments report the highest FScore achieved over 10 iterations (usually reached between the 3rd to 5th iterations)." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-133", "text": "The reduced information in the first iterations for incremental data and reduced search lead to lower initial performance here, but they still achieve the 0.65 result in time and each iteration is faster." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-134", "text": "DAG-GER provides a gain of 0.6 and 1.1 points of F-Score for experiments with and without Reentrance." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-135", "text": "Table 4 shows that the \u03b1-bound helps signifi-" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-136", "text": "----------------------------------" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-137", "text": "**CONCLUSION**" }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-138", "text": "Imitation Learning algorithms like DAGGER help in the AMR task, as in other structured prediction problems." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-139", "text": "Performance is improved using the \u03b1-bound to reduce the impact of noise in the training examples, and future work could investigate the impact in similar tasks." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-140", "text": "The reduction in search space and incremental growth of the training set do not have a significant impact on the results." }, { "sent_id": "d8a250a1a0495ee824837839b74f26-C001-141", "text": "The speed improvements they provide make little difference here, but could benefit more computationally demanding loss functions than the 0-1 expert loss used in DAGGER." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d8a250a1a0495ee824837839b74f26-C001-11", "d8a250a1a0495ee824837839b74f26-C001-13" ], [ "d8a250a1a0495ee824837839b74f26-C001-48" ] ], "cite_sentences": [ "d8a250a1a0495ee824837839b74f26-C001-11", "d8a250a1a0495ee824837839b74f26-C001-13", "d8a250a1a0495ee824837839b74f26-C001-48" ] }, "@EXT@": { "gold_contexts": [ [ "d8a250a1a0495ee824837839b74f26-C001-15" ], [ "d8a250a1a0495ee824837839b74f26-C001-119" ] ], "cite_sentences": [ "d8a250a1a0495ee824837839b74f26-C001-15", "d8a250a1a0495ee824837839b74f26-C001-119" ] }, "@DIF@": { "gold_contexts": [ [ "d8a250a1a0495ee824837839b74f26-C001-15" ], [ "d8a250a1a0495ee824837839b74f26-C001-25" ], [ "d8a250a1a0495ee824837839b74f26-C001-43", "d8a250a1a0495ee824837839b74f26-C001-44" ], [ "d8a250a1a0495ee824837839b74f26-C001-50" ], [ "d8a250a1a0495ee824837839b74f26-C001-72", "d8a250a1a0495ee824837839b74f26-C001-73" ], [ "d8a250a1a0495ee824837839b74f26-C001-119" ] ], "cite_sentences": [ "d8a250a1a0495ee824837839b74f26-C001-15", "d8a250a1a0495ee824837839b74f26-C001-25", "d8a250a1a0495ee824837839b74f26-C001-43", "d8a250a1a0495ee824837839b74f26-C001-50", "d8a250a1a0495ee824837839b74f26-C001-72", "d8a250a1a0495ee824837839b74f26-C001-119" ] }, "@SIM@": { "gold_contexts": [ [ "d8a250a1a0495ee824837839b74f26-C001-116" ] ], "cite_sentences": [ "d8a250a1a0495ee824837839b74f26-C001-116" ] }, "@USE@": { "gold_contexts": [ [ "d8a250a1a0495ee824837839b74f26-C001-116" ] ], "cite_sentences": [ "d8a250a1a0495ee824837839b74f26-C001-116" ] } } }, "ABC_b0083488650bc98477fb10a9c5a808_15": { "x": [ { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-2", "text": "Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-3", "text": "Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-4", "text": "In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture to directly model the input/output relation between audio/feature vector sequences and word sequences." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-5", "text": "The proposed model is an alternative model such that instead of using any type of attention components, we apply a 2DLSTM layer to assimilate the context from both input observations and output transcriptions." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-6", "text": "The experimental evaluation on the Switchboard 300h automatic speech recognition task shows word error rates for the 2DLSTM model that are competitive to end-to-end attention-based model." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-7", "text": "----------------------------------" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-9", "text": "Conventional automatic speech recognition (ASR) systems using Gaussian mixture model (GMM) and/or hybrid deep neural network (DNN) hidden Markov models (HMM) consist of several components that are trained separately, depend on pretrained alignments and require a complex search [1, 2, 3, 4] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-10", "text": "Unlike the conventional approaches, attention-based sequence-to-sequence models propose a standalone and single neural network that trains end-to-end, does not need explicit alignments or context-dependent phonetic labels as in HMM and simplify the inference." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-11", "text": "In these models, an implicit probabilistic notion of alignment is used as part of a neural network." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-12", "text": "However, it does not work the same way as its analogy of alignment models in the conventional methods." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-13", "text": "The widely used attention-based sequence-to-sequence systems are based on an encoder-decoder architecture, where one or more long short-term memory (LSTM) layers read the observation sequence and another LSTM decodes it to a variable length output sequence of characters or words." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-14", "text": "In such architectures, both input and output sequences are separately handled as a one-dimensional sequence over time." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-15", "text": "An attention mechanism is then added into the architecture to combine the encoder and the decoder by allowing the decoder to selectively focus on individual parts of the encoder state sequences [5, 6, 7, 8, 9] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-16", "text": "The LSTM [10] is well suited for sequence modeling, where the sequence is strongly correlated along a onedimensional time axis." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-17", "text": "Handling dynamic length, encoding positional information, the ability to make use of the previous context and tracking long-term dependencies by the gating strategy are some of the properties which make LSTM appropriate for the sequence to sequence modeling." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-18", "text": "Although an LSTM processes essentially one-dimensionally, it can be extended for the processing of multi-dimensional data such as an image or a video [11] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-19", "text": "In this work, we investigate the use of two-dimensional LSTM (2DLSTM) [11, 12] in sequence-to-sequence modeling as an alternative model for the attention component." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-20", "text": "In this architecture, we apply a 2DLSTM on top of a deep bidirectional encoder to relate input and output representations in a 2D space." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-21", "text": "One dimension of the 2DLSTM processes the input sequence, and another dimension predicts the output (sub)words." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-22", "text": "In contrast to the attention-based sequence-tosequence model, where the encoder states are not updated and the model is not able to re-interpret the encoder states while decoding, this model enables the computation of the encoding of the observation sequence as a function of the previously generated transcribed words." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-23", "text": "Our model is similar to an architecture used in machine translation described in [13] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-24", "text": "We believe that the 2DLSTM is able to capture necessary monotonic alignments as well as retrieve coverage concepts internally by its cell states." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-25", "text": "Experimental results on the 300h-Switchboard task show competitive performance compared to an attentionbased sequence-to-sequence system." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-26", "text": "----------------------------------" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-27", "text": "**RELATED WORKS**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-28", "text": "A way of building multidimensional context into recurrent networks is provided by a strategy that is based on networks with tree-structured update graphs." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-29", "text": "In handwriting recognition (HWR), 2DLSTM has shown successful results in auto- Fig. 1 : The internal architecture of the standard and the 2DLSTM." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-30", "text": "The additional connections are marked in blue [13] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-31", "text": "matic extraction of features from raw 2D-images over convolutional neural networks (CNNs) [14] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-32", "text": "In order to investigate deeper and larger models using 2DLSTM, an algorithm to use the GPU power has been implemented [15] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-33", "text": "Different neural networks have been proposed in automatic speech recognition (ASR) to model 2D correlations in the input signal." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-34", "text": "One of them is a 2DLSTM layer which scans the input over both time and frequency jointly for spatiotemporal modeling and aggregates more variations [16] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-35", "text": "Moreover, various architectures to model time-frequency patterns based on deep DNN, CNN, RNN and 2DLSTM layers are compared for large vocabulary ASR [17] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-36", "text": "As an alternative method to the concept of the 2DLSTM, a network of one-dimensional LSTM cells arranged in a multidimensional grid has been introduced [18] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-37", "text": "In this topology, the LSTM cells communicate not only along time sequence but also between the layers." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-38", "text": "The grid LSTM network is also applied for the endpoint detection task in ASR to model both spectral and temporal variations [19] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-39", "text": "A 2D attention matrix is also applied in a neural pitch accent recognition model [20] , in which graphemes are encoded in one dimension and audio frames are encoded in the other." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-40", "text": "Recently, the 2DLSTM layer also has been used for sequence-to-sequence modeling in machine translation [13] where it implicitly updates the source representation conditioned on the generated target words." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-41", "text": "In a similar direction, a 2D CNN-based network has been proposed where the positions of the source and the target words define the 2D grid for translation modeling [21] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-42", "text": "Similar to [13] , we apply a 2DLSTM layer to combine the acoustic model (the LSTM encoder) and the language model (the decoder) without any attention components." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-43", "text": "The 2DL-STM reconciles the context from both the input and the output sequences and re-interprets the encoder states while a new word has been predicted." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-44", "text": "Compared to [13] , our model is much deeper." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-45", "text": "We use max-pooling to select the most relevant encoder state whereas [13] uses the last horizontal state of the 2DLSTM." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-46", "text": "Furthermore, we utilize the same pretraining scheme explained in [9] during training and a faster decoding." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-47", "text": "----------------------------------" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-48", "text": "**2D LONG SHORT-TERM MEMORY**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-49", "text": "The 2DLSTM is characterized as a general form of the standard LSTM [11, 22] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-50", "text": "It has been proposed to process inherent 2D data of arbitrary lengths, T and N ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-51", "text": "Therefore, it uses both horizontal and vertical recurrences." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-52", "text": "The building block of both the LSTM and the 2DLSTM are shown in Figure 1 ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-53", "text": "At time step (t, n), it gets an input x t,n , and its computation relies on both the vertical s t,n\u22121 and the horizontal hidden states s t\u22121,n ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-54", "text": "Besides the input i t,n , the forget f t,n and the output o t,n gates that are similar to those in the LSTM, the 2DLSTM employs an additional lambda gate." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-55", "text": "As written in Equation 5 , its activation is computed analogously to the other gates [13, 11] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-56", "text": "The internal cell state c t,n , is computed based on the sum of the two previous cell's states c t\u22121,n and c t,n\u22121 , weighted by the lambda gate \u03bb t,n and its complement (see Equation 6 )." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-57", "text": "Similar to the LSTM, the internal cell c t,n is combined with the output gate to yield the hidden state." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-58", "text": "g and \u03c3 are the hyperbolic tangent and the sigmoid functions." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-59", "text": "V i , W i and U i , are the weight matrices." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-60", "text": "For notational simplicity, we omit the bias vectors." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-61", "text": "We process the 2D data in a forward pass from the time step (1, 1) to (T, N ) and thus the gradient is passed backwards in an opposite direction from the time step (T, N ) to (1, 1)." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-62", "text": "Training a 2DLSTM unit involves back-propagation through two dimensions." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-63", "text": "For more details, We refer the readers to [11, 22] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-64", "text": "----------------------------------" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-65", "text": "**2D SEQUENCE-TO-SEQUENCE MODEL**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-66", "text": "Bayes decision rule requires maximization of the class posterior given an input observation." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-67", "text": "In ASR, classes are discrete label sequences of unknown length N (e.g. word, subword, character) sequences, denoted as w" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-68", "text": ")." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-69", "text": "This conditional distribution usually covers the alignment information between the input observation sequence and the output word sequence either implicitly or explicitly." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-70", "text": "In the attention-based sequence-to-sequence approach, the attention weights serve as the implicit probabilistic notion of alignments aligning output labels to encoder states." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-71", "text": "The freedom of the attention model to focus on the entire input sequence might contradict monotonicity in ASR." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-72", "text": "In this work, we remove the attention component and intend to investigate whether the 2D sequence-to-sequence modeling is able to properly capture the input-output monotonic relation." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-73", "text": "As shown in Figure 2 , we apply a deep bidirectional LSTM encoder (L = 6) to scan an observation sequence." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-74", "text": "On top of each bidirectional LSTM layer, we conduct maxpooling over the time dimension to reduce the observation length." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-75", "text": "Hence, the encoder states are formulated as follows:" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-76", "text": "where T is the reduced length by a reduction factor." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-77", "text": "Similar to [13] , we then equip the network by a 2DLSTM layer to relate the encoder and the decoder states." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-78", "text": "At time step (t , n), the 2DLSTM receives both the encoder state h t , and the last target embedding vector w n\u22121 , as inputs." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-79", "text": "One dimension of the 2DLSTM (horizontal-axis in the figure) sequentially reads the encoder states and another (vertical axis) plays the role of the decoder." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-80", "text": "Therefore, there is no additional decoder LSTM." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-81", "text": "Unlike the attention-based sequence-to-sequence model, where the encoder states are obtained once at the beginning, our model repeatedly updates the encoder representations h T 1 , while generating a new output word w n ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-82", "text": "We note that in this model, we do not use any attention component." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-83", "text": "The state of the 2DLSTM is derived as follows:" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-84", "text": "It is significant to note that the 2DLSTM state for a label/word step n only have a dependence on the preceding word sequence w n\u22121 1 , while it takes into account the whole temporal context of the input observation sequence." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-85", "text": "At each decoder step, once the whole input sequence is processed from 1 to T , we do max-pooling over all horizontal states to obtain the context vector." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-86", "text": "We have also tried averagepooling or the last horizontal state instead of max-pooling, but none is better in this case." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-87", "text": "In order to generate a next output word, w n , a transformation followed by a softmax operation is applied." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-88", "text": "Therefore:" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-89", "text": "w n+1 pooling softmax w n pooling softmax w n\u22121 pooling softmax" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-90", "text": "max-pooling" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-91", "text": "The 2D seq2seq architecture using the 2DLSTM layer on top of L-layer of encoder." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-92", "text": "Neither attention components nor explicit LSTM decoders are used." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-93", "text": "Inspired by [13] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-94", "text": "----------------------------------" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-95", "text": "**EXPERIMENTS**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-96", "text": "We have conducted experiments on the Switchboard 300h task." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-97", "text": "We apply 40-dimensional Gammatone features [23] using the RASR feature extractor [24] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-98", "text": "We use the full Hub5'00 including Switchboard (SWB) and Callhome (CH) as the development set and the Hub5'01 as a test set." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-99", "text": "In order to enable an open-vocabulary system, we use byte-pair-encoding (BPE) [25] with 1k merge operations." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-100", "text": "As our baseline, we utilize the attention-based sequenceto-sequence architecture similar to that described in [9] with the exact pretraining scheme and the same reduction factor." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-101", "text": "The baseline model includes a one-layer LSTM decoder with additive attention equipped with fertility feedback." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-102", "text": "The feature vectors are passed into a stack of 6 bidirectional LSTM layers of size 1000 in each direction followed by the max-pooling operation." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-103", "text": "We downsample the input sequence by factor of 8 in total as described in [9] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-104", "text": "The 2DLSTM layer is equipped with 1000 nodes and the output subwords are projected into a 620-dimensional embedding space." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-105", "text": "The models are trained end to end using the Adam optimizer [26] , dropout of 30% [27] , label smoothing of 0.1 [28] and warmup technique." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-106", "text": "We reduce the learning rate by a factor of 0.7 following a variant of the Newbob scheme based on the perplexity on the development set for a few checkpoints." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-107", "text": "In our training, we use layer-wise pretraining for the encoder, where we start with two encoder layers and a single max-pool in between with the same multiple-step reduction factor similar to [9] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-108", "text": "Decoding is performed using beam search with a beam size of 12 and the subwords are merged into words." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-109", "text": "We do not utilize any language model (LM) neither in the baseline system nor in the 2D sequence-tosequence model." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-110", "text": "The model is built using our in-house CUDA implementation of 2DLSTM [15] utilizing optimal speedups in RETURNN [29] ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-111", "text": "The code is open source and the configuration of the setups are available online 1 ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-112", "text": "Table 1 compares the total number of parameters, perplexity and frame error rate (FER) on the development set between our model and the attention baseline." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-113", "text": "Both models have the same vocabulary size of almost 1K." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-114", "text": "Our model has 3M more parameters." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-115", "text": "The perplexity and the FER are comparable." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-116", "text": "We also compare our model over prior works based on the WER listed in Table 2 ." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-117", "text": "As a simple significance test, the reported WERs are averaged over 3 runs." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-118", "text": "Although our 2D sequenceto-sequence model is still behind the hybrid methods, it leads to competitive results over the attention baseline." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-119", "text": "We observe that our model outperforms the baseline on the Hub5'01 subset by 0.4% absolute." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-120", "text": "Including a separate LM during the search, we expect to obtain improvements." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-121", "text": "We also compare our model and the attention-based sequence-to-sequence model in terms of decoding speed." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-122", "text": "1 https://github.com/rwth-i6/returnn Based on the fact that the whole output label sequence is known during the training, the entire 2DLSTM states can be computed once and at each time step, one row of it is taken." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-123", "text": "This computation cannot be done as a single operation in the search since the output sequence has to be predicted; therefore, during the decoding, we need to compute the states of the 2DLSTM row-wise which slows down the search procedure." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-124", "text": "This algorithm is faster than [13] , where at each output step, they recompute all previous states of 2DLSTM from scratch which are not required." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-125", "text": "Table 3 lists the decoding speed of the models to decode the entire development set using a single GPU." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-126", "text": "In general, the decoding speed of our model is about 6 times slower than that of a standard attention-based model." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-127", "text": "----------------------------------" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-128", "text": "**CONCLUSION**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-129", "text": "We have applied a simple 2D sequence-to-sequence model as an alternative to the attention-based model." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-130", "text": "In our model, a 2DLSTM layer has been utilized to jointly combine the input and the output representations." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-131", "text": "It processes the observation sequence via the horizontal dimension and generates the output (sub)word sequence through the vertical axis." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-132", "text": "It does not have any additional LSTM decoder and does not benefit from any attention components." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-133", "text": "Contrary to the attentionbased sequence-to-sequence model, it repeatedly re-encodes the encoder representation when a new output (sub)word is generated." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-134", "text": "The experimental results are competitive with the baseline on the 300h-Switchboard Hub'00 and show 0.4% improvements on the Hub'01." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-135", "text": "Our future goal is to develop a bidirectional 2DLSTM to model completely independent of the standard LSTM layers as well as run more experiments on various speech tasks." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-136", "text": "----------------------------------" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-137", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-138", "text": "This work has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 694537, project \"SEQCLAS\") and from a Google Focused Award." }, { "sent_id": "b0083488650bc98477fb10a9c5a808-C001-139", "text": "The work reflects only the authors' views and none of the funding parties is responsible for any use that may be made of the information it contains." } ], "y": { "@SIM@": { "gold_contexts": [ [ "b0083488650bc98477fb10a9c5a808-C001-23" ], [ "b0083488650bc98477fb10a9c5a808-C001-42" ], [ "b0083488650bc98477fb10a9c5a808-C001-77" ] ], "cite_sentences": [ "b0083488650bc98477fb10a9c5a808-C001-23", "b0083488650bc98477fb10a9c5a808-C001-42", "b0083488650bc98477fb10a9c5a808-C001-77" ] }, "@BACK@": { "gold_contexts": [ [ "b0083488650bc98477fb10a9c5a808-C001-30" ], [ "b0083488650bc98477fb10a9c5a808-C001-40" ], [ "b0083488650bc98477fb10a9c5a808-C001-55" ] ], "cite_sentences": [ "b0083488650bc98477fb10a9c5a808-C001-30", "b0083488650bc98477fb10a9c5a808-C001-40", "b0083488650bc98477fb10a9c5a808-C001-55" ] }, "@DIF@": { "gold_contexts": [ [ "b0083488650bc98477fb10a9c5a808-C001-42", "b0083488650bc98477fb10a9c5a808-C001-44", "b0083488650bc98477fb10a9c5a808-C001-45" ], [ "b0083488650bc98477fb10a9c5a808-C001-124" ] ], "cite_sentences": [ "b0083488650bc98477fb10a9c5a808-C001-42", "b0083488650bc98477fb10a9c5a808-C001-44", "b0083488650bc98477fb10a9c5a808-C001-45", "b0083488650bc98477fb10a9c5a808-C001-124" ] }, "@EXT@": { "gold_contexts": [ [ "b0083488650bc98477fb10a9c5a808-C001-42", "b0083488650bc98477fb10a9c5a808-C001-44", "b0083488650bc98477fb10a9c5a808-C001-45" ] ], "cite_sentences": [ "b0083488650bc98477fb10a9c5a808-C001-42", "b0083488650bc98477fb10a9c5a808-C001-44", "b0083488650bc98477fb10a9c5a808-C001-45" ] } } }, "ABC_da8f30113f1126a78cefed06a15076_15": { "x": [ { "sent_id": "da8f30113f1126a78cefed06a15076-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-2", "text": "In this paper, we investigate the challenges of using reinforcement learning agents for question-answering over knowledge graphs for real-world applications." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-3", "text": "We examine the performance metrics used by state-of-the-art systems and determine that they are inadequate for such settings." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-4", "text": "More specifically, they do not evaluate the systems correctly for situations when there is no answer available and thus agents optimized for these metrics are poor at modeling confidence." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-5", "text": "We introduce a simple new performance metric for evaluating question-answering agents that is more representative of practical usage conditions, and optimize for this metric by extending the binary reward structure used in prior work to a ternary reward structure which also rewards an agent for not answering a question rather than giving an incorrect answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-6", "text": "We show that this can drastically improve the precision of answered questions while only not answering a limited number of previously correctly answered questions." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-7", "text": "Employing a supervised learning strategy using depth-first-search paths to bootstrap the reinforcement learning algorithm further improves performance." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-8", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-10", "text": "A number of approaches for question answering have been proposed recently that use reinforcement learning to reason over a knowledge graph (Das et al., 2018; Lin et al., 2018; Chen et al., 2018; Zhang et al., 2018) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-11", "text": "In these methods the input question is first parsed into a constituent question entity and relation." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-12", "text": "The answer entity is then identified by sequentially taking a number of steps (or 'hops') over the knowledge graph (KG) starting from the question entity." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-13", "text": "The agent receives a positive reward if it arrives at the correct answer entity and a negative reward for an incorrect answer entity." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-14", "text": "For example, for the ques- tion \"What is the capital of France?\", the question entity is (F rance) and the goal is to find a path in the KG which connects it to (P aris)." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-15", "text": "The relation between the answer entity and question entity in this example is (Capital of ) which is missing from the KG and has to be inferred via alternative paths." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-16", "text": "This is illustrated in Figure 1 ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-17", "text": "A possible two-hop path to find the answer is to use the fact that (M acron) is the president of (F rance) and that he lives in (P aris)." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-18", "text": "However, there are many paths that lead to the entity (P aris) but also to other entities which makes finding the correct answer a non-trivial task." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-19", "text": "The standard evaluation metrics used for these systems are metrics developed for web search such as Mean Reciprocal Rank (MRR) and hits@k, where k ranges from 1 to 20." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-20", "text": "We argue that this is not a correct evaluation mechanism for a practical question-answering system (such as Alexa, Cortana, Siri, etc.) where the goal is to return a single answer for each question." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-21", "text": "Moreover it is assumed that there is always an answer entity that could be reached from the question entity in limited number of steps." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-22", "text": "However this cannot be guaranteed in a large-scale commercial setting and for all KGs." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-23", "text": "For example, in our proprietary dataset used for the experimentation, for 15.60% of questions the answer entity cannot be reached within the limit of number of steps used by the agent." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-24", "text": "Hence, we propose a new evaluation criterion, allowing systems to return 'no answer' as a response when no answer is available." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-25", "text": "We demonstrate that existing state-of-the-art methods are not suited for a practical questionanswering setting and perform poorly in our evaluation setup." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-26", "text": "The root-cause of poor performance is the reward structure which does not provide any incentive to learn not to answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-27", "text": "The modified reward structure we present allows agents to learn not to answer in a principled way." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-28", "text": "Rather than having only two rewards, a positive and a negative reward, we introduce a ternary reward structure that also rewards agents for not answering a question." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-29", "text": "A higher reward is given to the agent for correctly answering a question compared to not answering a question." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-30", "text": "In this setup the agent learns to make a trade-off between these three possibilities to obtain the highest total reward over all questions." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-31", "text": "Additionally, because the search space of possible paths exponentially grows with the number of hops, we also investigate using Depth-First-Search (DFS) algorithm to collect paths that lead to the correct answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-32", "text": "We use these paths as a supervised signal for training the neural network before the reinforcement learning algorithm is applied." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-33", "text": "We show that this improves overall performance." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-34", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-36", "text": "The closest works to ours are the works by Lin et al. (2018) , Zhang et al. (2018) and Das et al. (2018) , which consider the question answering task in a reinforcement learning setting in which the agent always chooses to answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-37", "text": "1 Other approaches consider this as a link prediction problem in which multi-hop reasoning can be used to learn relational paths that link two entities." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-38", "text": "One line of work focuses on composing embeddings (Neelakantan et al., 2015; Guu et al., 2015; Toutanova et al., 2016) initially introduced for link prediction, e.g., TransE (Bordes et al., 2013) , ComplexE (Trouillon et al., 2016) or ConvE (Dettmers et al., 2018 )." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-39", "text": "Another line of work focuses on logical rule learning such as neural logical programming and neural theorem proving (Rockt\u00e4schel and Riedel, 2017) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-40", "text": "Here, we focus on question answering rather than link prediction or rule mining and use reinforcement learning to circumvent that we do not have ground truth paths leading to the answer entity." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-41", "text": "Recently, popular textual QA datasets have been extended with not-answerable questions (Trischler et al., 2017; Rajpurkar et al., 2018) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-42", "text": "Questions that cannot be answered are labeled with 'no answer' option which allows for supervised training." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-43", "text": "This is different from our setup in which there are no ground truth 'no answer' labels." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-44", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-45", "text": "**BACKGROUND: REINFORCEMENT LEARNING**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-89", "text": "Next we will describe the ternary reward structure." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-46", "text": "We base our work on the recent reinforcement learning approaches introduced in Das et al. (2018) and Lin et al. (2018) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-47", "text": "We denote the knowledge graph as G, the set of entities as E, the set of relations as R and the set of directed edges L between entities of the form l = (e 1 , r, e 2 ) with e 1 , e 2 \u2208 E and r \u2208 R. The goal is to find an answer entity e a given a question entity e q and the question relation r q , when (e q , r q , e a ) is not part of graph G." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-48", "text": "We formulate this problem as a Markov Decision Problem (MDP) (Sutton and Barto, 1998) with the following states, actions, transition function and rewards:" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-49", "text": "States." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-50", "text": "At every timestep t, the state s t is defined by the current entity e t , the question entity e q and relation r q , for which e t , e q \u2208 E and r q \u2208 R. More formally, s t = (e t , e q , r q )." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-51", "text": "Actions." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-52", "text": "For a given entity e t , the set of possible actions is defined by the outgoing edges from e t ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-53", "text": "Thus" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-54", "text": "Transition function." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-55", "text": "The transition function \u03b4 maps s t to a new state s t+1 based on the action taken by the agent." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-56", "text": "Consequently, s t+1 = \u03b4(s t , A t ) = \u03b4(e t , e q , r q , A t )." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-57", "text": "Rewards." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-58", "text": "The agent is rewarded based on the final state." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-59", "text": "For example, in Das et al. (2018) and Lin et al. (2018) the agent obtains a reward of 1 if the correct answer entity is reached as the final state and 0 otherwise (i.e., R(s T ) = I{e T = e a })." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-60", "text": "Figure 2a illustrates the LSTM which encodes history of the path taken." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-61", "text": "The output at timestep t is used as input to the policy network, illustrated in Figure 2b , to determine which action to take next." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-62", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-63", "text": "**TRAINING**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-64", "text": "We train a policy network \u03c0 using the REIN-FORCE algorithm of Williams (1992) which maximizes the expected reward:" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-65", "text": "(1) in which a t is the action selected at timestep t following the policy \u03c0, and \u03b8 are the parameters of the network." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-66", "text": "The policy network consists of two parts: a Long Short-Term Memory (LSTM) network which encodes the history of the traversed path, and a feed-forward neural network to select an action (a t ) out of all possible actions." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-67", "text": "Each entity and relation have a corresponding vector e t , r t \u2208 R d ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-68", "text": "The action a t \u2208 A t is represented by the vectors of the relation and entity as a t = [r t+1 ; e t+1 ] \u2208 R 2d ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-69", "text": "The LSTM encodes the history of the traversed path and updates its hidden state each timestep, based on the selected action:" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-70", "text": "This is illustrated in Figure 2a ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-71", "text": "Finally, the feed-forward neural network (f ) combines the history h t , the current entity representation e t and the query relation r q ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-72", "text": "Using softmax, we compute the probability for each action by calculating the dot product between the output of f and each action vector a t :" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-73", "text": "in which A t \u2208 R |At|\u00d72d is a matrix consisting of rows of action vectors a t ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-74", "text": "This is illustrated in Figure 2b." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-75", "text": "During training, we sample over this probability distribution to select the action a t , whereas during inference, we use beam search to select the most probable path." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-76", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-77", "text": "**EVALUATION**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-78", "text": "User-facing question answering systems inherently face a trade-off between presenting an answer to a user that could potentially be incorrect, and choosing not to answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-79", "text": "However, prior work in knowledge graph question-answering (QA) only considers cases in which the answering agent always produces an answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-80", "text": "This setup originates from the link prediction and knowledge base completion tasks in which the evaluation criteria are hits@k and Mean Reciprocal Rank (MRR), where k ranges from 1 to 20." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-81", "text": "However, these metrics are not an accurate representation of practical question-answering systems in which the goal is to return a single correct answer or not answer at all." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-82", "text": "Moreover, using these metrics result in the problem of the model learning 'spurious' paths since the metrics encourage the models to make wild guesses even if the path is unlikely to lead to the correct answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-83", "text": "We therefore propose to measure the fraction of questions the system answers (Answer Rate) and the number of correct answers out of all answers (Precision) to measure the system performance." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-84", "text": "We combine these two metrics by taking the harmonic mean and call this the QA Score." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-85", "text": "This can be viewed as a variant of the popular F-Score metric, with answer rate used as an analogue to recall in the original metric." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-86", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-87", "text": "**PROPOSED METHOD**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-88", "text": "In this section, we will first introduce the supervised learning technique we used to pretrain the neural network before applying the reinforcement learning algorithm." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-90", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-91", "text": "**SUPERVISED LEARNING**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-92", "text": "Typically in reinforcement learning, the search space of possible actions and paths grows exponentially with the path length." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-93", "text": "Our problem is no exception to this." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-94", "text": "Hence an imitation learning approach could be beneficial here where we provide a number of expert paths to the learning algorithm to bootstrap the learning process." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-95", "text": "This idea has been explored previously in the context of link and fact prediction in knowledge graphs where Xiong et al. (2017) proposed to use a BreadthFirst-Search (BFS) between the entity pairs to select a set of plausible paths." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-96", "text": "However BFS favours identification of shorter paths which could bias the learner." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-97", "text": "We therefore use Depth-First-Search (DFS) to identify paths between question and answer entities and sample up to 100 paths to be used for the supervised training." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-98", "text": "If no path can be found between the entity pair we return a 'no answer' label." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-99", "text": "Following this, we train the network using reinforcement learning algorithm which refines it further." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-100", "text": "Note that it is not guaranteed that the set of paths found using DFS are all most efficient." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-101", "text": "However as we show in our experiments, bootstrapping with these paths provide good initialization for the reinforcement learning algorithm." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-102", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-103", "text": "**TERNARY REWARD STRUCTURE**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-104", "text": "As mentioned previously, we encounter situations when the answer entity cannot be reached in the limited number of steps taken by an agent." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-105", "text": "In such cases, the system should return a special answer 'no answer' as the response." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-106", "text": "We can achieve this by adding a synthetic 'no answer' action that leads to a special entity e N OAN SW ER ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-107", "text": "This is illustrated in Figure 3 ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-108", "text": "In the framework of Das et al. (2018) a binary reward is used which rewards the learner for the answer being wrong or correct." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-109", "text": "Following a similar protocol, we could award a score of 1 to return 'no answer' when there is no answer available in the KG." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-110", "text": "However, we cannot achieve reasonable training with such reward structure." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-111", "text": "This is because there is no specific pattern for 'no answer' that could be directly learned." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-112", "text": "Hence, if we reward a system equally for correct or no answer, it learns to always predict 'no answer'." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-113", "text": "We therefore propose a ternary reward structure in which a positive reward is given to a correct answer, a neutral reward when e N OAN SW ER is selected as an answer, and a negative reward for an does not exist in the graph and thus an alternative path needs to be used that leads to the correct answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-114", "text": "To avoid that the agent returns an incorrect answer when not finding the correct answer, a 'no answer' relation is added between every entity node and a special 'no answer' node, to be able to return 'no answer'. incorrect answer." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-115", "text": "More formally:" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-116", "text": "r pos if e T = e a , 0 if e T = e N OAN SW ER , r neg if e T \u2208 {e a , e N OAN SW ER } (4) with r pos >0 and r neg <0." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-117", "text": "The idea is that the agent receives a larger reward for a correct answer compared to not answering the question, and a negative reward for incorrectly answering a question compared to not answering the question." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-118", "text": "In the experimental section, we show that this mechanism provides better performance." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-119", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-120", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-121", "text": "We evaluate our proposed approach on a publicly available dataset, FB15k-237 (Toutanova and Chen, 2015) which is based on the Freebase knowledge graph and a proprietary dataset Alexa69k-378 which is a sample of Alexa's proprietary knowledge graph." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-122", "text": "Both the public dataset and the proprietary dataset are Das et al. (2018) , using the same train/val/test splits for FB15k-237." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-123", "text": "For Alexa69k-378 we use 10% of the full dataset for validation and test." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-124", "text": "For both datasets, we add the reverse relations of all relations in the training set in order to facilitate backward navigation following the approach of previous work." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-125", "text": "Similarly, a 'no op' relation is added for each entity between the entity and itself, which allows the agent to loop/reason multiple consecutive steps over the same entity." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-126", "text": "An overview of both datasets can be found in Table 3 ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-127", "text": "We extend the publicly available implementation of Das et al. (2018) for our experimentation." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-128", "text": "We set the size of the entity and relation representations d at 100 and the hidden state at 200." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-129", "text": "We use a single layer LSTM and train models with path length 3 (tuned using hyper-parameter search)." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-130", "text": "We optimize the neural network using Adam (Kingma and Ba, 2015) with learning rate 0.001, mini-batches of size 256 with 20 rollouts per example." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-131", "text": "During the test time, we use beam search with the beam size of 100." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-132", "text": "Unlike Das et al. (2018) , we also train entity embeddings after initializing them with random values." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-133", "text": "Reward values are set as r pos = 10 and r neg = \u22120.1 after performing a coarse grid search for various reward values on the validation set." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-134", "text": "For all experiments, we selected the best model with the highest QA Score on the corresponding validation set." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-135", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-136", "text": "**RESULTS**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-137", "text": "The results of our experiments for FB15k-237 and Alexa69k-378 are given in Table 1 and Table 2 respectively." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-138", "text": "Supervised learning For FB15k-237, we see that the model trained using reinforcement learning (RL) scores as well as the model trained using supervised learning." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-139", "text": "This makes supervised learning using DFS a strong baseline system for question answering over knowledge graphs, and for FB15k-237 in particular." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-140", "text": "On Alexa69k-378, models trained using supervised learning score lower on all metrics compared to RL." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-141", "text": "When combining supervised learning with RL overall performance increases." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-142", "text": "No answer When we train RL system with our ternary reward structure (No Answer RL), the precision and QA score increase significantly on both datasets." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-143", "text": "For FB15k-237, our No Answer RL model decided not to answer over 40% of the questions, with an absolute hits@1 reduction of only 1.3% over standard RL." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-144", "text": "Moreover, of all the answered questions, 40.11% were answered correctly compared to 24.75% of the original question-answering system: an absolute improvement of over 15%." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-145", "text": "This resulted in the final QA Score of 47.58%, around 8% higher than standard RL and 12% higher than Das et al. (2018) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-146", "text": "Similarly, 60% of the questions did not get answered on Alexa69k-378." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-147", "text": "This resulted in hits@1 decrease of roughly 1% but compared to standard RL, the precision increased from 16.77% to 38.92%: an absolute increase of more than 20%." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-148", "text": "The final QA Score also increased from 28.72% to 39.55%, and also significantly improved over Das et al. (2018) and Lin et al. (2018) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-149", "text": "The results indicate that using our method allows us to improve the precision of the question-answering system by choosing the right questions to be answered by not answering many questions that were previously answered incorrectly." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-150", "text": "This comes at the expense of not answering some questions that previously could be answered correctly." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-151", "text": "All Finally, all methods were combined in a single method." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-152", "text": "First the model was pretrained in a supervised way." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-153", "text": "Then the model was retrained using RL algorithm with ternary reward structure." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-154", "text": "This jointly trained model obtained better QA scores than any individually trained model." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-155", "text": "On FB15k-237, a QA score of 52.16% is obtained which is an absolute improvement of 4.58% over the best individual model and 2.66% over Lin et al. (2018) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-156", "text": "Similarly, on Alexa69k-378, an absolute improve- ment of 2.57% over the best individual result is obtained, almost 10% absolute improvement over Lin et al. (2018) ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-157", "text": "Sample results from our method are given in Table 4 and Table 5 ." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-158", "text": "Reward tuning An important part of increasing the QA score is to select the right combination of rewards." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-159", "text": "Therefore, we ran additional experiments where we varied either the positive or negative reward, keeping the other rewards fixed." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-160", "text": "In Figure 4 , the precision, answer rate and QA score are shown when varying the positive reward and keeping the neutral and negative rewards fixed." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-161", "text": "When, the positive reward is very small (r pos = 0.625), almost no question is answered." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-162", "text": "When the positive reward r pos is 1.25, roughly 20% of the questions are answered with a 50% precision." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-163", "text": "After that, the precision starts declining and the answer rate starts increasing, resulting in an overall increase in QA score." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-164", "text": "The QA score plateaus between 5 and 10 and then starts decreasing slowly." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-165", "text": "In Figure 5 , the precision, answer rate and QA score are shown when varying the negative reward and keeping the neutral and positive rewards fixed." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-166", "text": "In this case, the highest QA score is achieved when the negative reward is between -0.25 and -0.1." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-167", "text": "As long as the negative reward is lower than zero, a wrong answer gets penalized and the QA score stays high." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-168", "text": "----------------------------------" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-169", "text": "**CONCLUSIONS**" }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-170", "text": "In this paper, we addressed the limitations of current approaches for question answering over a knowledge graph that use reinforcement learning." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-171", "text": "Rather than only returning a correct or incorrect answer, we allowed the model to not answer a question when it is not sure about it." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-172", "text": "Our ternary reward structure gives different rewards for correctly answered, incorrectly answered and not answered questions." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-173", "text": "We also introduced a new evaluation metric which takes these three options into account." }, { "sent_id": "da8f30113f1126a78cefed06a15076-C001-174", "text": "We showed that we can significantly improve the precision of answered questions compared to previous approaches, making this a promising direction for the practical usage in knowledge graph-based QA systems." } ], "y": { "@BACK@": { "gold_contexts": [ [ "da8f30113f1126a78cefed06a15076-C001-10" ], [ "da8f30113f1126a78cefed06a15076-C001-58", "da8f30113f1126a78cefed06a15076-C001-59" ], [ "da8f30113f1126a78cefed06a15076-C001-108" ] ], "cite_sentences": [ "da8f30113f1126a78cefed06a15076-C001-10", "da8f30113f1126a78cefed06a15076-C001-59", "da8f30113f1126a78cefed06a15076-C001-108" ] }, "@SIM@": { "gold_contexts": [ [ "da8f30113f1126a78cefed06a15076-C001-36" ], [ "da8f30113f1126a78cefed06a15076-C001-122", "da8f30113f1126a78cefed06a15076-C001-127" ] ], "cite_sentences": [ "da8f30113f1126a78cefed06a15076-C001-36", "da8f30113f1126a78cefed06a15076-C001-122", "da8f30113f1126a78cefed06a15076-C001-127" ] }, "@USE@": { "gold_contexts": [ [ "da8f30113f1126a78cefed06a15076-C001-46" ], [ "da8f30113f1126a78cefed06a15076-C001-122", "da8f30113f1126a78cefed06a15076-C001-127" ] ], "cite_sentences": [ "da8f30113f1126a78cefed06a15076-C001-46", "da8f30113f1126a78cefed06a15076-C001-122", "da8f30113f1126a78cefed06a15076-C001-127" ] }, "@DIF@": { "gold_contexts": [ [ "da8f30113f1126a78cefed06a15076-C001-132", "da8f30113f1126a78cefed06a15076-C001-145", "da8f30113f1126a78cefed06a15076-C001-148" ] ], "cite_sentences": [ "da8f30113f1126a78cefed06a15076-C001-132", "da8f30113f1126a78cefed06a15076-C001-145", "da8f30113f1126a78cefed06a15076-C001-148" ] }, "@EXT@": { "gold_contexts": [ [ "da8f30113f1126a78cefed06a15076-C001-132", "da8f30113f1126a78cefed06a15076-C001-145", "da8f30113f1126a78cefed06a15076-C001-148" ] ], "cite_sentences": [ "da8f30113f1126a78cefed06a15076-C001-132", "da8f30113f1126a78cefed06a15076-C001-145", "da8f30113f1126a78cefed06a15076-C001-148" ] } } }, "ABC_0d798fcdee6ee5722d6dc5638210c2_15": { "x": [ { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-38", "text": "The role of vision in vision-and-language tasks." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-2", "text": "Vision-and-Language Navigation (VLN) requires grounding instructions, such as turn right and stop at the door, to routes in a visual environment." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-3", "text": "The actual grounding can connect language to the environment through multiple modalities, e.g. stop at the door might ground into visual objects, while turn right might rely only on the geometric structure of a route." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-4", "text": "We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-5", "text": "Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-6", "text": "To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-ofthe-art models on the VLN task." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-7", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-9", "text": "The Vision-and-Language Navigation (VLN) task (Anderson et al., 2018) requires an agent to navigate to a particular location in a real-world environment, following complex, context-dependent instructions written by humans (e.g. go down the second hallway on the left, enter the bedroom and stop by the mirror)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-10", "text": "The agent must navigate through the environment, conditioning on the instruction as well as the visual imagery that it observes along the route, to stop at the location specified by the instruction (e.g. the mirror)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-11", "text": "Recent state-of-the-art models (Wang et al., 2018; Fried et al., 2018b; Ma et al., 2019) have demonstrated large gains in accuracy on the VLN task." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-12", "text": "However, it is unclear which modality these go past the couch \u2026 Figure 1 : We factor the grounding of language instructions into visual appearance, route structure, and object detections using a mixture-of-experts approach." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-13", "text": "substantial increases in task metrics can be attributed to, and, in particular, whether the gains in performance are due to stronger grounding into visual context or e.g. simply into the discrete, geometric structure of possible routes, such as turning left or moving forward (see Fig. 1" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-14", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-15", "text": "**, TOP VS. MIDDLE).**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-16", "text": "First, we analyze to what extent VLN models ground language into visual appearance and route structure by training versions of two state-ofthe-art models without visual features, using the benchmark Room-to-Room (R2R) dataset (Anderson et al., 2018) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-17", "text": "We find that while grounding into route structure is useful, the models with visual features fail to learn generalizable visual grounding." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-39", "text": "In several vision-and-language tasks, high performance can be achieved without effective modeling of the visual modality." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-40", "text": "Devlin et al. (2015) find that image captioning models can exploit regularity in the captions, showing that a nearestneighbor matching approach can achieve competitive performance to sophisticated language generation models. and find that neural captioning models often ground object mentions into incorrect objects due to correlations in the training data, and can hallucinate non-existing objects." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-41", "text": "Recent work has also investigated singlemodality performance in vision-and-language embodiment tasks." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-42", "text": "Anand et al. (2018) find that stateof-the-art results can be achieved on the EmbodiedQA task (Das et al., 2018 ) using an agent without visual inputs." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-43", "text": "Work concurrent to ours evaluates the performance of single-modality models for several embodied tasks including VLN (Thomason et al., 2019) , finding that high performance can be achieved on the R2R dataset using a non-visual version of the baseline model (Anderson et al., 2018) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-44", "text": "In this paper, we show that the same trends hold for two recent state-of-the-art architectures (Ma et al., 2019; Fried et al., 2018b) for the VLN task; we also analyze to what extent object-based representations and mixture-ofexperts methods can address these issues." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-45", "text": "in a connectivity graph determined by line-of-sight in the physical environment." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-46", "text": "See the top row of Fig. 1 for a top-down environment illustration." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-47", "text": "In the VLN task, a virtual agent is placed at a particular viewpoint in an environment, and is given a natural language instruction (written by a human annotator) to follow." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-48", "text": "At each timestep, the agent receives the panoramic image for the viewpoint it is currently located at, and either predicts to move to one of the adjacent connected viewpoints, or to stop." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-49", "text": "When the agent predicts the stop action, it is evaluated on whether it has correctly reached the end of the route that the human annotator was asked to describe." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-50", "text": "In this work, we analyze two recent VLN models, which typify the visual grounding approaches of VLN work: the panoramic \"follower\" model from the Speaker-Follower (SF) system of Fried et al. (2018b) and the Self-Monitoring (SM) model of Ma et al. (2019) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-51", "text": "These models obtained stateof-the-art results on the R2R dataset." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-52", "text": "Both models are based on the encoder-decoder approach (Cho et al., 2014 ) and map an instruction to a sequence of actions in context by encoding the instruction with an LSTM, and outputting actions using an LSTM decoder that conditions on the encoded instruction and visual features summarizing the agent's environmental context." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-53", "text": "Compared to the SF model, the SM model introduces an improved visual-textual co-attention mechanism and a progress monitor component." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-54", "text": "We refer to the original papers for details on the two models." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-55", "text": "To analyze the models' visual grounding ability, we focus on their core encoder-decoder components." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-56", "text": "In our experiments, we use models trained without data augmentation, and during inference predict actions with greedy search (i.e. without beam search, pragmatic, or progress monitorbased inference)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-57", "text": "For SF, we use the publicly released code." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-58", "text": "For SM, we use a reimplementation without the progress monitor, which was shown to be most important for search in inference (Ma et al., 2019) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-59", "text": "We investigate how well these models ground instructions into visual features of the environment, by training and evaluating them without access to the visual context: setting their visual feature vectors to zeroes during training and testing." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-60", "text": "We compare performance on the validation sets of the R2R dataset: the val-seen split, consisting of the same environments as in training, and the val- Table 1 : Success rate (SR) of the vision-based full agent (\"RN\", using ResNet) and the non-visual agent (\"no vis.\", setting all visual features to zero) on the R2R dataset under different model architectures (SpeakerFollower (SF) (Fried et al., 2018b) and Self-Monitoring (SM) (Ma et al., 2019) ) and training schemes." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-61", "text": "unseen split of novel environments." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-62", "text": "Since we aim to evaluate how well the agents generalize to the unseen environments, we focus on the val-unseen split." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-63", "text": "For both the SF and SM models, we train two versions of the agents, using either the studentforcing or teacher-forcing approaches of Anderson et al. (2018) 1 , and select the best training snapshot on the val-seen split." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-64", "text": "2 The results are shown in Table 1 ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-65", "text": "In each block, the two rows show the agent's performance (under the specific model architecture and training approach) with or without access to the visual features (\"RN\": ResNet-152 network (He et al., 2016) , \"no vis.\": non-visual)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-66", "text": "While visual features improve performance on environments seen during training, we see that for the SF architecture the non-visual agent (lines 1 and 3) outperforms the visual agent (lines 2 and 4) on unseen environments under both studentforcing and teacher-forcing training." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-67", "text": "For SM, the non-visual agent (lines 5 and 7) has a success rate very close to the visual agent (lines 6 and 8)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-68", "text": "This indicates that these models do not learn generalizable visual perception, so that the visual features may actually hurt them in unseen environments." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-69", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-70", "text": "**OBJECT REPRESENTATION FOR BETTER GROUNDING AND GENERALIZATION**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-71", "text": "In both the SF and SM architectures, the agents use visual features from a pretrained ResNet-152 CNN (He et al., 2016) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-72", "text": "As the training data for the R2R dataset contains only 61 distinct environments, the agents may overfit to the appearance of the training environments and thus struggle to gen-eralize." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-73", "text": "For example, for the instruction go down the staircase, a model may learn to ground staircase into a specific staircase in a given training environment, and fail to generalize to staircases with different appearances or in different contexts in unseen environments." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-74", "text": "We thus propose an objectbased representation, where object detection results from a pretrained large-scale object detector are used as the environment representation." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-75", "text": "The object-based representation is intended to prevent overfitting to training scenes and to transfer to new environments better than CNN features." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-76", "text": "Both the SF and SM models represent the visual appearance at each location with a set of visual features {x img,i }, where x img,i is a vector extracted from an image patch at a particular orientation i using a CNN." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-77", "text": "Both models also use a visual attention mechanism to extract an attended visual feature x img,att from {x img,i }." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-78", "text": "For our objectbased representation, we use a Faster R-CNN (Ren et al., 2015) object detector trained on the Visual Genome dataset (Krishna et al., 2017) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-79", "text": "We construct a set of vectors {x obj,j } representing detected objects and their attributes." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-80", "text": "Each vector x obj,j (j-th detected object in the scene) is a concatenation of summed GloVe vectors (Pennington et al., 2014) for the detected object label (e.g. door) and attribute labels (e.g. white) and a location vector from the object's bounding box coordinates." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-81", "text": "We then use the same visual attention mechanism as in Fried et al. (2018b) and Ma et al. (2019) to obtain an attended object representation x obj,att over these {x obj,j } vectors." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-82", "text": "We either substitute the ResNet CNN features x img,att (\"RN\") with our object representation x obj,att (\"Obj\"), or concatenate x img,att and x obj,att (\"RN+Obj\")." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-83", "text": "Then we train the SF model or the SM model using this object representation, with results shown in Table 2 . 3 For SF (lines 1-4), object representations substantially improve generalization ability: using either the object representation (\"Obj\") or the combined representation (\"RN+Obj\") obtains higher success rate on unseen environments than using only the ResNet features (\"RN\"), and the combined representation (\"RN+Obj\") obtains the highest overall performance." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-84", "text": "For SM (lines 5-8), Table 2 : Success rate (SR) of agents with different visual inputs on the R2R dataset (\"RN\": ResNet CNN, \"Obj\": objects, \"no vis.\": no visual representation)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-85", "text": "Models: Speaker-Follower (SF) (Fried et al., 2018b) and Self-Monitoring (SM) (Ma et al., 2019) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-86", "text": "the model that uses only the object representation achieves the best performance (line 7)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-87", "text": "Here the success rates across the four settings are closer, and the improvement from object representation is smaller than for SF." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-88", "text": "However, in Sec. 5 we find that object representation can be combined with other inputs to further improve the performance." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-89", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-90", "text": "**MIXTURE-OF-EXPERTS MAKES BETTER USE OF ALL AVAILABLE INFORMATION**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-91", "text": "While the agent with CNN visual features does not outperform its non-visual counterpart (Sec. 3) on average, it often succeeds on individual instructions where the non-visual model fails, indicating the visual and non-visual modalities are complementary." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-92", "text": "To encourage grounding into both modalities, we ensemble visual and non-visual models in a mixture-of-experts approach." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-93", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-94", "text": "**SEPARATE TRAINING**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-95", "text": "We first ensemble the models from Sec. 3 and Sec. 4 at test time (after training them separately) by combining their predictions at each timestep." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-96", "text": "4 Lines 9-22 of Table 3 show ensembles of two models." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-97", "text": "Compared to single-model performance (line 1-8 in Table 2 ), an ensemble of a visual and a non-visual agent outperforms the individual agents for both the SF and the SM models." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-98", "text": "The best performing setting is the combination of \"RN\" and \"no vis.\" (non-visual) in line 20 under the SM model." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-99", "text": "While it is unsurprising that the mixture-of-experts can boost performance, it is interesting to see that the best mixture in line 20 outperforms mixtures of two agents of the same type (two non-visual agents in line 16, two visual agents in line 17, trained from distinct random parameter initializations), confirming that two agents Table 3 : Success rate (SR) of different mixtureof-experts ensembles." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-100", "text": "Models: Speaker-Follower (SF) (Fried et al., 2018b) and Self-Monitoring (SM) (Ma et al., 2019) ; \"RN\": ResNet CNN, \"Obj\": objects, \"no vis.\": no visual representation." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-101", "text": "with access to different modalities can complement each other, especially in the SM model." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-102", "text": "We also experiment with a 3-way mixture in the SM model, combining a visual agent with ResNet CNN features, a visual agent with object features, and a non-visual agent (line 23)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-103", "text": "This mixture outperforms all the 2-way mixtures by a noticeable margin, showing that the CNN and object-based visual representations are also complementary." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-104", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-105", "text": "**JOINT TRAINING**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-106", "text": "Finally, given the success of this simple test-time ensemble, we also explore jointly training these models by building a single agent which uses a single instruction encoder shared between multiple (visual and non-visual) jointly-trained decoders." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-107", "text": "During joint training, each decoder is supervised to predict the true actions, applying the same loss function as in separate training." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-108", "text": "During testing, actions are predicted by averaging logits from the separate decoders as in Sec. 5.1." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-109", "text": "We experiment with jointly training the agents in each of the two best-performing combinations (RN, no vis.) and (RN, Obj, no vis.) under the SM architecture (line 24 and 25 of Table 3 )." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-110", "text": "From line 24 vs. 20 and line 25 vs. 23, joint training gives higher performance than training each model separately and combining them only at test time." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-111", "text": "Overall, we obtain 51.9% final success rate on the val-unseen split (line 25), which is over 10% (absolute) higher than the SF or SM baselines using a single decoder with CNN features (line 2 and 6 in Table 2 )." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-112", "text": "5" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-113", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-114", "text": "**DISCUSSION**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-115", "text": "The success of non-visual versions of two recent state-of-the-art VLN models, often outperforming their vision-based counterparts in unseen environments on the benchmark R2R dataset, shows that these models do not use the visual inputs in a generalizable way." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-116", "text": "Our intuition is that, while language has rich, high-level symbolic meaning, which can be easily matched to the modality of the route structures, pixel-based visual representations, even those extracted via CNNs, are a lowerlevel modality which require more data to learn, and so a model trained on both modalities may learn to mostly rely on the route structure." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-117", "text": "This is also supported by the results in Table 3 (line 23 vs. line 20), where adding higher-level object representations improves the success rate by 2.6%." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-118", "text": "Notably, an agent in the R2R environment is only able to move to a discrete set of locations in the environment, and at each point in time it only has a small number of actions available, determined by the environment's connectivity graph (i.e., moving to the adjacent locations)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-119", "text": "These constraints on possible routes help explain our findings that language in the VLN instructions often grounds into geometric route structure in addition to visual context along the route." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-120", "text": "For example, if an instruction says turn left at the couch, and the route structure only allows the agent to turn left at a single location, it may not need to perceive the couch." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-121", "text": "Other instructions, such as go straight for 5 meters and stop may also be carried out without access to visual perception." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-122", "text": "The improvement of our mixture-of-experts approach over single models suggests that it is challenging to learn to ground language into multiple modalities in one model." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-123", "text": "The \"RN+Obj\" model ( Table 2 , line 8) has access to the same information as our best result in Table 3 , line 25, but obtains much lower success rate (39.5% vs. 51.9%)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-124", "text": "Thus, splitting the prediction task across several models, where each has access to a different input modality, is an effective way to inject an inductive bias that encourages the model to ground into each of the modalities." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-125", "text": "Supplementary material to \"Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation\"" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-126", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-127", "text": "**A DETAILS ON THE COMPARED VLN MODELS**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-128", "text": "The Speaker-Follower (SF) model (Fried et al., 2018b ) and the Self-Monitoring (SM) model (Ma et al., 2019) which we analyze both use sequenceto-sequence model (Cho et al., 2014) with attention (Bahdanau et al., 2015) as their base instruction-following agent." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-129", "text": "Both use an encoder LSTM (Hochreiter and Schmidhuber, 1997 ) to represent the instruction text, and a decoder LSTM to predict actions sequentially." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-130", "text": "At each timestep, the decoder LSTM conditions on the action previously taken, a representation of the visual context at the agent's current location, and an attended representation of the encoded instruction." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-131", "text": "While at a high level these models are similar (at least in terms of the base sequence-tosequence models -both papers additionally develop techniques to select routes from these base models during search-based inference techniques, either using a separate language generation model in SF, or a progress-monitor in SM), they differ in the mechanism by which they combine representations of the text instruction and visual input." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-132", "text": "The SM uses a co-grounded attention mechanism, where both the visual attention on image features and the textual attention on the instruction words are generated based on previous decoder LSTM hidden state h t\u22121 , and then the attended visual and textual features are used as LSTM inputs to produce h t ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-133", "text": "The SF model only uses attended visual features as LSTM inputs and then produces textual attention based on updated LSTM state h t ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-134", "text": "Also, the visual attention weights are calculated with an MLP and batch-normalization in SM, while only a linear dot-product visual attention is used in SF." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-135", "text": "Empirically these differences produce large performance improvements for the SM model, which may contribute to the smaller gap between the SM model and its non-visual counterparts." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-136", "text": "B Details on the training mechanisms Anderson et al. (2018) compare two methods for training agents, which subsequent work on VLN has also used." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-137", "text": "These methods differ in whether they allow the agent to visit viewpoints which are not part of the true routes at training time." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-138", "text": "In the first training setup, teacher-forcing, the agent visits each viewpoint in a given true route in sequence, and is supervised at each viewpoint with the action necessary to reach the next viewpoint in the true route." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-139", "text": "In the second training setup, student-forcing, the agent takes actions by sampling from its predicted distribution at each timestep, which results in exploring viewpoints that are not part of the true routes." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-140", "text": "At each viewpoint, supervision is provided by an oracle that returns the action which would take the agent along the shortest path to the goal." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-141", "text": "Empirically, studentforcing works better in nearly all settings in Table 1 (except for the non-visual version of the SF model), which is likely due to the fact that it reduces the discrepancy between training and testing, since it allows the agent to sample from its own prediction during training." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-142", "text": "Teacher-forcing works better for the non-visual version of the SF model, and we hypothesize that following the ground-truth routes during training allows the SF model to better preserve the geometric structures of the routes and match them to the instructions for the non-visual setting." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-143", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-144", "text": "**C DETAILS ON THE OBJECT REPRESENTATION**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-145", "text": "In our object representation, we use the top-150 detected objects (with the highest detection confidence) at each location in the environment." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-146", "text": "The detection results are obtained from a Faster R-CNN detector (Ren et al., 2015) pretrained on the Visual Genome dataset (Krishna et al., 2017) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-18", "text": "Surprisingly, when trained without visual features, their performance on unseen environments is comparable or even better." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-19", "text": "We hypothesize that the low-level, pixel-based CNN features in the visual models contribute to their failure to generalize." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-20", "text": "To address this, we introduce a high-level object-based visual representation to ground language into visual context in a more generalizable way, using the symbolic output of a pretrained object detection system." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-21", "text": "For example, while a concept table could ground into visual appearance of a specific table in a given environment, detecting tables and other objects in scenes, mapping them into symbols, and grounding the text mentions into these symbols should generalize better to unseen environments." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-22", "text": "Finally, inspired by the complementary errors of visual and non-visual agents, we decompose the grounding process through a mixture-of-experts approach." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-23", "text": "We train separate visual and non-visual agents, encouraging each one to focus on a separate modality, and combine their predictions as an ensemble (see Fig. 1 )." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-24", "text": "Our mixture-of-experts outperforms the individual agents, and is also better than the ensembles of multiple agents of the same modality (e.g. both visual or both non-visual)." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-25", "text": "Adding our object representation and mixtureof-experts approach to both state-of-the-art models improves their success rate by over 10% (absolute) in novel environments, obtaining a 51.9% success rate on the val-unseen split of the benchmark R2R dataset (Anderson et al., 2018) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-26", "text": "----------------------------------" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-28", "text": "Vision and Language Navigation." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-29", "text": "Vision-andLanguage Navigation (VLN) (Anderson et al., 2018; Chen et al., 2019) unites two lines of work: first, of following natural language navigational instructions in an environmental context (MacMahon et al., 2006; Vogel and Jurafsky, 2010; Tellex et al., 2011; Chen and Mooney, 2011; Artzi and Zettlemoyer, 2013; Andreas and Klein, 2015; Mei et al., 2016; Fried et al., 2018a; Misra et al., 2018) , and second, of vision-based navigation tasks (Mirowski et al., 2017; Yang et al., 2019; Mirowski et al., 2018; Cirik et al., 2018 ) that use visually-rich real-world imagery (Chang et al., 2017) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-30", "text": "A number of methods for the VLN task have been recently proposed." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-31", "text": "Wang et al. (2018) use model-based and model-free reinforcement learning to learn an environmental model and optimize directly for navigation success." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-32", "text": "Fried et al. (2018b) use a separate instruction generation model to synthesize new instructions as data augmentation during training, and perform pragmatic inference at test time." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-33", "text": "Most recently, Ma et al. (2019) introduce a visual and textual co-attention mechanism and a route progress predictor." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-34", "text": "These approaches have significantly improved performance on the VLN task, when evaluated by metrics such as success rate." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-35", "text": "However, it is unclear where the high performance comes from." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-36", "text": "In this paper, we find that agents without any visual input can achieve competitive performance, matching or even outperforming their vision-based counterparts under two state-of-theart model models (Fried et al., 2018b; Ma et al., 2019) ." }, { "sent_id": "0d798fcdee6ee5722d6dc5638210c2-C001-37", "text": "We also explore two approaches to make the agents better utilize their visual inputs." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0d798fcdee6ee5722d6dc5638210c2-C001-11" ] ], "cite_sentences": [ "0d798fcdee6ee5722d6dc5638210c2-C001-11" ] }, "@MOT@": { "gold_contexts": [ [ "0d798fcdee6ee5722d6dc5638210c2-C001-36" ] ], "cite_sentences": [ "0d798fcdee6ee5722d6dc5638210c2-C001-36" ] }, "@SIM@": { "gold_contexts": [ [ "0d798fcdee6ee5722d6dc5638210c2-C001-44" ], [ "0d798fcdee6ee5722d6dc5638210c2-C001-50" ], [ "0d798fcdee6ee5722d6dc5638210c2-C001-81" ], [ "0d798fcdee6ee5722d6dc5638210c2-C001-128" ] ], "cite_sentences": [ "0d798fcdee6ee5722d6dc5638210c2-C001-44", "0d798fcdee6ee5722d6dc5638210c2-C001-50", "0d798fcdee6ee5722d6dc5638210c2-C001-81", "0d798fcdee6ee5722d6dc5638210c2-C001-128" ] }, "@USE@": { "gold_contexts": [ [ "0d798fcdee6ee5722d6dc5638210c2-C001-50" ], [ "0d798fcdee6ee5722d6dc5638210c2-C001-81" ] ], "cite_sentences": [ "0d798fcdee6ee5722d6dc5638210c2-C001-50", "0d798fcdee6ee5722d6dc5638210c2-C001-81" ] }, "@DIF@": { "gold_contexts": [ [ "0d798fcdee6ee5722d6dc5638210c2-C001-60" ] ], "cite_sentences": [ "0d798fcdee6ee5722d6dc5638210c2-C001-60" ] }, "@EXT@": { "gold_contexts": [ [ "0d798fcdee6ee5722d6dc5638210c2-C001-60" ] ], "cite_sentences": [ "0d798fcdee6ee5722d6dc5638210c2-C001-60" ] } } }, "ABC_4adcc28c6d1906d74874b8fca371dc_15": { "x": [ { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-84", "text": "Note that we have one single model for both languages." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-2", "text": "In this paper we propose a model to learn multimodal multilingual representations for matching images and sentences in different languages, with the aim of advancing multilingual versions of image search and image understanding." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-3", "text": "Our model learns a common representation for images and their descriptions in two different languages (which need not be parallel) by considering the image as a pivot between two languages." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-4", "text": "We introduce a new pairwise ranking loss function which can handle both symmetric and asymmetric similarity between the two modalities." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-5", "text": "We evaluate our models on image-description ranking for German and English, and on semantic textual similarity of image descriptions in English." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-6", "text": "In both cases we achieve state-of-the-art performance." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-7", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-9", "text": "In recent years there has been a significant amount of research in language and vision tasks which require the joint modeling of texts and images." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-10", "text": "Examples include text-based image retrieval, image description and visual question answering." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-11", "text": "An increasing number of large image description datasets has become available (Hodosh et al., 2013; Young et al., 2014; Lin et al., 2014) and various systems have been proposed to handle the image description task as a generation problem (Bernardi et al., 2016; Mao et al., 2015; Vinyals et al., 2015; Fang et al., 2015) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-12", "text": "There has also been a great deal of work on sentence-based image search or cross-modal retrieval where the objective is to learn a joint space for images and text (Hodosh et al., 2013; Frome et al., 2013; Kiros et al., 2015; Socher et al., 2014; Donahue et al., 2015) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-13", "text": "Previous work on image description generation or learning a joint space for images and text has mostly focused on English due to the availability of English datasets." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-14", "text": "Recently there have been attempts to create image descriptions and models for other languages (Funaki and Nakayama, 2015; Rajendran et al., 2016; Miyazaki and Shimizu, 2016; Li et al., 2016; Hitschler et al., 2016; Yoshikawa et al., 2017) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-15", "text": "Most work on learning a joint space for images and their descriptions is based on Canonical Correlation Analysis (CCA) or neural variants of CCA over representations of image and its descriptions (Hodosh et al., 2013; Andrew et al., 2013; Yan and Mikolajczyk, 2015; Gong et al., 2014; ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-16", "text": "Besides CCA, a few others learn a visual-semantic or multimodal embedding space of image descriptions and representations by optimizing a ranking cost function (Kiros et al., 2015; Socher et al., 2014; Vendrov et al., 2016) or by aligning image regions (objects) and segments of the description Plummer et al., 2015) in a common space." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-17", "text": "Recently Lin and Parikh (2016) have leveraged visual question answering models to encode images and descriptions into the same space." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-18", "text": "However, all of this work is targeted at monolingual descriptions, i.e., mapping images and descriptions in a single language onto a joint embedding space." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-19", "text": "The idea of pivoting or bridging is not new and language pivoting is well explored for machine translation (Wu and Wang, 2007; Firat et al., 2016) and to learn multilingual multimodal representations (Rajendran et al., 2016; Calixto et al., 2017) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-20", "text": "Rajendran et al. (2016) Calixto et al. (2017) proposed a model for creating multilingual multimodal embeddings." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-21", "text": "Our work is different from theirs in that we choose the image as the pivot and use a different similarity function." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-22", "text": "We also propose a single model for learning representations of images and multiple languages, whereas their model is language-specific." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-23", "text": "In this paper, we learn multimodal representations in multiple languages, i.e., our model yields a joint space for images and text in multiple languages using the image as a pivot between languages." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-24", "text": "We propose a new objective function in a multitask learning setting and jointly optimize the mappings between images and text in two different languages." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-25", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-26", "text": "**DATASET**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-27", "text": "We experiment with the Multi30k dataset, a multilingual extension of Flickr30k corpus (Young et al., 2014) consisting of English and German image descriptions ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-28", "text": "The Multi30K dataset has 29k, 1k and 1k images in the train, validation and test splits respectively, and contains two types of multilingual annotations: (i) a corpus of one English description per image and its translation into German; and (ii) a corpus of five independently collected English and German descriptions per image." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-29", "text": "We use the independently collected English and German descriptions to train our models." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-30", "text": "Note that these descriptions are not translations of each other, i.e., they are not parallel, although they describe the same image." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-31", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-32", "text": "**PROBLEM FORMULATION**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-33", "text": "Given an image i and its descriptions c 1 and c 2 in two different languages our aim is to learn a model which maps i, c 1 and c 2 onto same common space R N (where N is the dimensionality of the embedding space) such that the image and its gold-standard descriptions in both languages are mapped close to each other (as shown in Figure 1 )." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-34", "text": "Our model consists of the embedding functions f i and f c to encode images and descriptions and a scoring function S to compute the similarity between a description-image pair." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-35", "text": "In the following we describe two models: (i) the PIVOT model that uses the image as pivot between the description in both the languages; (ii) the PAR-ALLEL model that further forces the image descriptions in both languages to be closer to each other in the joint space." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-36", "text": "We build two variants of PIVOT and PARALLEL with different similarity functions S to learn the joint space." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-37", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-38", "text": "**MULTILINGUAL MULTIMODAL REPRESENTATION MODELS**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-39", "text": "In both PIVOT and PARALLEL we use a deep convolutional neural network architecture (CNN) to represent the image i denoted by f i (i) = W i \u00b7 CNN(i) where W i is a learned weight matrix and CNN(i) is the image vector representation." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-40", "text": "For each language we define a recurrent neural network encoder f c (c k ) = GRU(c k ) with gated recurrent units (GRU) activations to encode the description c k ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-41", "text": "In PIVOT, we use monolingual corpora from multiple languages of sentences aligned with images to learn the joint space." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-42", "text": "The intuition of this model is that an image is a universal representation across all languages, and if we constrain a sentence representation to be closer to image, sentences in different languages may also come closer." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-43", "text": "Accordingly we design a loss function as follows:" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-44", "text": "where k stands for each language." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-45", "text": "This loss function encourages the similarity S(c k , i) between gold-standard description c k and image i to be greater than any other irrelevant description c k by a margin \u03b1." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-46", "text": "A similar loss function is useful for learning multimodal embeddings in a single language (Kiros et al., 2015) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-47", "text": "For each minibatch, we obtain invalid descriptions by selecting descriptions of other images except the current image of interest and vice-versa." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-48", "text": "In PARALLEL, in addition to making an image similar to a description, we make multiple descriptions of the same image in different languages similar to each other, based on the assumption that these descriptions, although not parallel, share some commonalities." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-49", "text": "Accordingly we enhance the previous loss function with an additional term:" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-50", "text": "Note that we are iterating over all pairs of descriptions (c 1 , c 2 ), and maximizing the similarity between descriptions of the same image and at the same time minimizing the similarity between descriptions of different images." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-51", "text": "We learn models using two similarity functions: symmetric and asymmetric." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-52", "text": "For the former we use cosine similarity and for the latter we use the metric of Vendrov et al. (2016) which is useful for learning embeddings that maintain an order, e.g., dog and cat are more closer to pet than animal while being distinct." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-53", "text": "Such ordering is shown to be useful in building effective multimodal space of images and texts." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-54", "text": "An analogy in our setting would be two descriptions of an image are closer to the image while at the same time preserving the identity of each (which is useful when sentences describe two different aspects of the image)." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-55", "text": "The similarity metric is defined as:" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-56", "text": "where a and b are embeddings of image and description." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-57", "text": "We call the symmetric similarity variants of our models as PIVOT-SYM and PARALLEL-SYM, and the asymmetric variants PIVOT-ASYM and PARALLEL-ASYM." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-58", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-59", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-60", "text": "We test our model on the tasks of imagedescription ranking and semantic textual similarity." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-61", "text": "We work with each language separately." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-62", "text": "Since we learn embeddings for images and languages in the same semantic space, our hope is that the training data for each modality or language acts complementary data for the another modality or language, and thus helps us learn better embeddings." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-63", "text": "Experiment Setup We sampled minibatches of size 64 images and their descriptions, and drew all negative samples from the minibatch." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-64", "text": "We trained using the Adam optimizer with learning rate 0.001, and early stopping on the validation set." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-65", "text": "Following Vendrov et al. (2016) we set the dimensionality of the embedding space and the GRU hidden layer N to 1024 for both English and German." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-66", "text": "We set the dimensionality of the learned word embeddings to 300 for both languages, and the margin \u03b1 to 0.05 and 0.2, respectively, to learn asymmetric and symmetric similarity-based embeddings." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-67", "text": "1 We keep all hyperparameters constant across all models." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-68", "text": "We used the L2 norm to mitigate over-fitting (Kiros et al., 2015) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-69", "text": "We tokenize and truecase both English and German descriptions using the Moses Decoder scripts." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-70", "text": "2 To extract image features, we used a convolutional neural network model trained on 1.2M images of 1000 class ILSVRC 2012 object classification dataset, a subset of ImageNet (Russakovsky et al., 2015) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-71", "text": "Specifically, we used VGG 19-layer CNN architecture and extracted the activations of the penultimate fully connected layer to obtain features for all images in the dataset (Simonyan and Zisserman, 2015) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-72", "text": "We use average features from 10 crops of the re-scaled images." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-73", "text": "3 Baselines As baselines we use monolingual models, i.e., models trained on each language separately." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-74", "text": "Specifically, we use Visual Semantic Embeddings (VSE) of Kiros et al. (2015) and Order Embeddings (OE) of Vendrov et al. (2016) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-75", "text": "We System Text to Image Image to Text R@1 R@5 R@10 Mr R@1 R@5 R@10 Mr VSE (Kiros et al., 2015) 23.3 53.6 65.8 5 31.6 60.4 72.7 3 OE (Vendrov et al., 2016) (Kiros et al., 2015) 20.3 47.2 60.1 6 29.3 58.1 71.8 4 OE (Vendrov et al., 2016) at an outdoor market , a small group of people stoop to buy potatoes from a street vendor , who has his goods laid out on the ground 24 2 2 Table 3 : The rank of the gold-standard image when using each German and English descriptions as a query on models trained using asymmetric similarity." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-76", "text": "use a publicly available implementation to train both VSE and OE." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-77", "text": "4" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-78", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-79", "text": "**IMAGE-DESCRIPTION RANKING RESULTS**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-80", "text": "To evaluate the multimodal multilingual embeddings, we report results on an image-description ranking task." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-81", "text": "Given a query in the form of a description or an image, the task its to retrieve all images or descriptions sorted based on the relevance." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-82", "text": "We use the standard ranking evaluation metrics of recall at position k (R@K, where higher is better) and median rank (Mr, where lower is better) to evaluate our models." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-83", "text": "We report results for both English and German descriptions." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-85", "text": "In Tables 1 and 2 we present the ranking results of the baseline models of Kiros et al. (2015) and Vendrov et al. (2016) and our proposed PIVOT and PARALLEL models." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-86", "text": "We do not compare our image-description ranking results with Calixto et al. (2017) since they report results on half of validation set of Multi30k whereas our results are on the publicly available test set of Multi30k." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-87", "text": "For English, PIVOT with asymmetric similarity is either competitive or better than monolingual models 4 https://github.com/ivendrov/order-embedding and symmetric similarity, especially in the R@10 category it obtains state-of-the-art." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-88", "text": "For German, both PIVOT and PARALLEL with the asymmetric scoring function outperform monolingual models and symmetric similarity." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-89", "text": "We also observe that the German ranking experiments benefit the most from the multilingual signal." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-90", "text": "A reason for this could be that the German description corpus has many singleton words (more than 50% of the vocabulary) and English description mapping might have helped in learning better semantic embeddings." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-91", "text": "These results suggest that the multilingual signal could be used to learn better multimodal embeddings, irrespective of the language." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-92", "text": "Our results also show that the asymmetric scoring function can help learn better embeddings." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-93", "text": "In Table 3 we present a few examples where PIVOT-ASYM and PARALLEL-ASYM models performed better on both the languages compared to baseline order embedding model even using descriptions of very different lengths as queries." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-94", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-95", "text": "**SEMANTIC TEXTUAL SIMILARITY RESULTS**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-96", "text": "In the semantic textual similarity task (STS), we use the textual embeddings from our model to compute the similarity between a pair of sen- (Wieting et al., 2017) \u2212 83.7 84.5 85.0 MLMME (Calixto et al., 2017) VGG19 \u2212 72.7 79.7 VSE (Kiros et al., 2015) VGG19 80.6 82.7 89.6 OE (Vendrov et al., 2016) VGG19 82." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-97", "text": "tences (image descriptions in this case)." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-98", "text": "We evaluate on video task from STS-2012 and image tasks from STS-2014 , STS-2015 (Agirre et al. 2012 , Agirre et al. 2014 , Agirre et al. 2015 ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-99", "text": "The video descriptions in the STS-2012 task are from the MSR video description corpus (Chen and Dolan, 2011) and the image descriptions in STS-2014 and 2015 are from UIUC PASCAL dataset (Rashtchian et al., 2010) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-100", "text": "In Table 4 , we present the Pearson correlation coefficients of our model predicted scores with the gold-standard similarity scores provided as part of the STS image/video description tasks." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-101", "text": "We compare with the best reported scores for the STS shared tasks, achieved by MLMME (Calixto et al., 2017) , paraphrastic sentence embeddings (Wieting et al., 2017) , visual semantic embeddings (Kiros et al., 2015) , and order embeddings (Vendrov et al., 2016) ." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-102", "text": "The shared task baseline is computed based on word overlap and is high for both the 2014 and the 2015 dataset, indicating that there is substantial lexical overlap between the STS image description datasets." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-103", "text": "Our models outperform both the baseline system and the best system submitted to the shared task." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-104", "text": "For the 2012 video paraphrase corpus, our multilingual methods performed better than the monolingual methods showing that similarity across paraphrases can be learned using multilingual signals." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-105", "text": "Similarly, Wieting et al. (2017) have reported to learn better paraphrastic sentence embeddings with multilingual signals." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-106", "text": "Overall, we observe that models learned using the asymmetric scoring function outperform the state-of-theart on these datasets, suggesting that multilingual S1 S2 GT Pred Black bird standing on concrete." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-107", "text": "Blue bird standing on green grass." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-108", "text": "Table 5 : Example sentences with gold-standard semantic textual similarity score and the predicted score using our best performing PARALLEL-ASYM model." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-109", "text": "sharing is beneficial." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-110", "text": "Although the task has nothing to do German, because our models can make use of datasets from different languages, we were able to train on significantly larger training dataset of approximately 145k descriptions." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-111", "text": "Calixto et al. (2017) also train on a larger dataset like ours, but could not exploit this to their advantage." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-112", "text": "In Table 5 we present the example sentences with the highest and lowest difference between gold-standard and predicted semantic textual similarity scores using our best performing PARALLEL-ASYM model." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-113", "text": "----------------------------------" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-114", "text": "**CONCLUSIONS**" }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-115", "text": "We proposed a new model that jointly learns multilingual multimodal representations using the image as a pivot between languages." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-116", "text": "We introduced new objective functions that can exploit similarities between images and descriptions across languages." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-117", "text": "We obtained state-of-the-art results on two tasks: image-description ranking and semantic textual similarity." }, { "sent_id": "4adcc28c6d1906d74874b8fca371dc-C001-118", "text": "Our results suggest that exploiting multilingual and multimodal resources can help in learning better semantic representations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4adcc28c6d1906d74874b8fca371dc-C001-12" ], [ "4adcc28c6d1906d74874b8fca371dc-C001-15", "4adcc28c6d1906d74874b8fca371dc-C001-16" ], [ "4adcc28c6d1906d74874b8fca371dc-C001-85" ], [ "4adcc28c6d1906d74874b8fca371dc-C001-101" ] ], "cite_sentences": [ "4adcc28c6d1906d74874b8fca371dc-C001-12", "4adcc28c6d1906d74874b8fca371dc-C001-16", "4adcc28c6d1906d74874b8fca371dc-C001-85", "4adcc28c6d1906d74874b8fca371dc-C001-101" ] }, "@MOT@": { "gold_contexts": [ [ "4adcc28c6d1906d74874b8fca371dc-C001-46" ] ], "cite_sentences": [ "4adcc28c6d1906d74874b8fca371dc-C001-46" ] }, "@SIM@": { "gold_contexts": [ [ "4adcc28c6d1906d74874b8fca371dc-C001-68", "4adcc28c6d1906d74874b8fca371dc-C001-74" ], [ "4adcc28c6d1906d74874b8fca371dc-C001-96" ], [ "4adcc28c6d1906d74874b8fca371dc-C001-101" ] ], "cite_sentences": [ "4adcc28c6d1906d74874b8fca371dc-C001-68", "4adcc28c6d1906d74874b8fca371dc-C001-74", "4adcc28c6d1906d74874b8fca371dc-C001-96", "4adcc28c6d1906d74874b8fca371dc-C001-101" ] } } }, "ABC_6597d733f13b06f61cb653f86c4460_15": { "x": [ { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-88", "text": "**PERMUTATION SIGNIFICANCE TEST**" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-2", "text": "Recent work by Nerbonne and Wiersma (2006) has provided a foundation for measuring syntactic differences between corpora." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-3", "text": "It uses part-of-speech trigrams as an approximation to syntactic structure, comparing the trigrams of two corpora for statistically significant differences." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-4", "text": "This paper extends the method and its application." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-5", "text": "It extends the method by using leafpath ancestors of Sampson (2000) instead of trigrams, which capture internal syntactic structure-every leaf in a parse tree records the path back to the root." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-6", "text": "The corpus used for testing is the International Corpus of English, Great Britain (Nelson et al., 2002) , which contains syntactically annotated speech of Great Britain." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-7", "text": "The speakers are grouped into geographical regions based on place of birth." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-8", "text": "This is different in both nature and number than previous experiments, which found differences between two groups of Norwegian L2 learners of English." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-9", "text": "We show that dialectal variation in eleven British regions from the ICE-GB is detectable by our algorithm, using both leaf-ancestor paths and trigrams." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-10", "text": "----------------------------------" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-11", "text": "****" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-12", "text": "This paper extends the method and its application." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-13", "text": "It extends the method by using leafpath ancestors of Sampson (2000) instead of trigrams, which capture internal syntactic structure-every leaf in a parse tree records the path back to the root." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-14", "text": "The corpus used for testing is the International Corpus of English, Great Britain (Nelson et al., 2002) , which contains syntactically annotated speech of Great Britain." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-15", "text": "The speakers are grouped into geographical regions based on place of birth." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-16", "text": "This is different in both nature and number than previous experiments, which found differences between two groups of Norwegian L2 learners of English." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-17", "text": "We show that dialectal variation in eleven British regions from the ICE-GB is detectable by our algorithm, using both leaf-ancestor paths and trigrams." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-18", "text": "----------------------------------" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-19", "text": "**INTRODUCTION**" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-20", "text": "In the measurement of linguistic distance, older work such as S\u00e9guy (1973) was able to measure distance in most areas of linguistics, such as phonology, morphology, and syntax." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-21", "text": "The features used for comparison were hand-picked based on linguistic knowledge of the area being surveyed." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-22", "text": "These features, while probably lacking in completeness of coverage, certainly allowed a rough comparison of distance in all linguistic domains." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-23", "text": "In contrast, computational methods have focused on a single area of language." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-24", "text": "For example, a method for determining phonetic distance is given by Heeringa (2004) ." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-25", "text": "Heeringa and others have also done related work on phonological distance in Nerbonne and Heeringa (1997) and Gooskens and Heeringa (2004) ." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-26", "text": "A measure of syntactic distance is the obvious next step: Nerbonne and Wiersma (2006) provide one such method." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-27", "text": "This method approximates internal syntactic structure using vectors of part-of-speech trigrams." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-28", "text": "The trigram types can then be compared for statistically significant differences using a permutation test." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-29", "text": "This study can be extended in a few ways." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-30", "text": "First, the trigram approximation works well, but it does not necessarily capture all the information of syntactic structure such as long-distance movement." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-31", "text": "Second, the experiments did not test data for geographical dialect variation, but compared two generations of Norwegian L2 learners of English, with differences between ages of initial acquisition." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-32", "text": "We address these areas by using the syntactically annotated speech section of the International Corpus of English, Great Britain (ICE-GB) (Nelson et al., 2002) , which provides a corpus with full syntactic annotations, one that can be divided into groups for comparison." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-33", "text": "The sentences of the corpus, being represented as parse trees rather than a vector of POS tags, are converted into a vector of leafancestor paths, which were developed by Sampson (2000) to aid in parser evaluation by providing a way to compare gold-standard trees with parser output trees." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-34", "text": "In this way, each sentence produces its own vec-tor of leaf-ancestor paths." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-35", "text": "Fortunately, the permutation test used by Nerbonne and Wiersma (2006) is already designed to normalize the effects of differing sentence length when combining POS trigrams into a single vector per region." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-36", "text": "The only change needed is the substitution of leaf-ancestor paths for trigrams." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-86", "text": "There is still no ambiguity between two single leaves and a single node with two leaves because only the second case will receive brackets." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-37", "text": "The speakers in the ICE-GB are divided by place of birth into geographical regions of England based on the nine Government Office Regions, plus Scotland and Wales." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-38", "text": "The average region contains a little over 4,000 sentences and 40,000 words." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-39", "text": "This is less than the size of the Norwegian corpora, and leaf-ancestor paths are more complex than trigrams, meaning that the amount of data required for obtaining significance should increase." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-40", "text": "Testing on smaller corpora should quickly show whether corpus size can be reduced without losing the ability to detect differences." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-41", "text": "Experimental results show that differences can be detected among the larger regions: as should be expected with a method that measures statistical significance, larger corpora allow easier detection of significance." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-42", "text": "The limit seems to be around 250,000 words for leaf-ancestor paths, and 100,000 words for POS trigrams, but more careful tests are needed to verify this." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-43", "text": "Comparisons to judgments of dialectologists have not yet been made." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-44", "text": "The comparison is difficult because of the difference in methodology and amount of detail in reporting." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-45", "text": "Dialectology tends to collect data from a few informants at each location and to provide a more complex account of relationship than the like/unlike judgments provided by permutation tests." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-46", "text": "----------------------------------" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-47", "text": "**METHODS**" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-48", "text": "The methods used to implement the syntactic difference test come from two sources." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-49", "text": "The primary source is the syntactic comparison of Nerbonne and Wiersma (2006) , which uses a permutation test, explained in Good (1995) and in particular for linguistic purposes in Kessler (2001) ." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-50", "text": "Their permutation test collects POS trigrams from a random subcorpus of sentences sampled from the combined corpora." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-51", "text": "The trigram frequencies are normalized to neutralize the effects of sentence length, then compared to the trigram frequencies of the complete corpora." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-52", "text": "The principal difference between the work of Nerbonne and Wiersma (2006) and ours is the use of leaf-ancestor paths." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-53", "text": "Leaf-ancestor paths were developed by Sampson (2000) for estimating parser performance by providing a measure of similarity of two trees, in particular a gold-standard tree and a machine-parsed tree." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-54", "text": "This distance is not used for our method, since for our purposes, it is enough that leaf-ancestor paths represent syntactic information, such as upper-level tree structure, more explicitly than trigrams." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-55", "text": "The permutation test used by Nerbonne and Wiersma (2006) is independent of the type of item whose frequency is measured, treating the items as atomic symbols." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-56", "text": "Therefore, leaf-ancestor paths should do just as well as trigrams as long as they do not introduce any additional constraints on how they are generated from the corpus." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-57", "text": "Fortunately, this is not the case; Nerbonne and Wiersma (2006) generate N \u2212 2 POS trigrams from each sentence of length N ; we generate N leaf-ancestor paths from each parsed sentence in the corpus." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-58", "text": "Normalization is needed to account for the frequency differences caused by sentence length variation; it is presented below." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-87", "text": "----------------------------------" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-59", "text": "Since the same number (minus two) of trigrams and leaf-ancestor paths are generated for each sentence the same normalization can be used for both methods." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-60", "text": "----------------------------------" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-61", "text": "**LEAF-ANCESTOR PATHS**" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-62", "text": "Sampson's leaf-ancestor paths represent syntactic structure by aggregating nodes starting from each leaf and proceeding up to the root-for our experiment, the leaves are parts of speech." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-63", "text": "This maintains constant input from the lexical items of the sentence, while giving the parse tree some weight in the representation." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-64", "text": "For example, the parse tree \u2022 S-NP-Det-The" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-65", "text": "There is one path for each word, and the root appears in all four." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-66", "text": "However, there can be ambiguities if some node happens to have identical siblings." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-67", "text": "Sampson gives the example of the two trees" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-68", "text": "which would both produce" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-69", "text": "There is no way to tell from the paths which leaves belong to which B node in the first tree, and there is no way to tell the paths of the two trees apart despite their different structure." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-70", "text": "To avoid this ambiguity, Sampson uses a bracketing system; brackets are inserted at appropriate points to produce" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-71", "text": "Left and right brackets are inserted: at most one in every path." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-72", "text": "A left bracket is inserted in a path containing a leaf that is a leftmost sibling and a right bracket is inserted in a path containing a leaf that is a rightmost sibling." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-73", "text": "The bracket is inserted at the highest node for which the leaf is leftmost or rightmost." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-74", "text": "It is a good exercise to derive the bracketing of the previous two trees in detail." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-75", "text": "In the first tree, with two B siblings, the first path is A-B-p." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-76", "text": "Since p is a leftmost child, a left bracket must be inserted, at the root in this case." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-77", "text": "The resulting path is [A-B-p." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-78", "text": "The next leaf, q, is rightmost, so a right bracket must be inserted." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-79", "text": "The highest node for which it is rightmost is B, because the rightmost leaf of A is s. The resulting path is A-B]-q." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-80", "text": "Contrast this with the path for q in the second tree; here q is not rightmost, so no bracket is inserted and the resulting path is A-B-q." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-81", "text": "r is in almost the same position as q, but reversed: it is the leftmost, and the right B is the highest node for which it is the leftmost, producing A-[B-r." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-82", "text": "Finally, since s is the rightmost leaf of the entire sentence, the right bracket appears after A: A]-B-s." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-83", "text": "At this point, the alert reader will have noticed that both a left bracket and right bracket can be inserted for a leaf with no siblings since it is both leftmost and rightmost." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-84", "text": "That is, a path with two brackets on the same node could be produced: A-[B]-c." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-85", "text": "Because of this redundancy, single children are excluded by the bracket markup algorithm." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-89", "text": "With the paths of each sentence generated from the corpus, then sorted by type into vectors, we now try to determine whether the paths of one region occur in significantly different numbers from the paths of another region." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-90", "text": "To do this, we calculate some measure to characterize the difference between two vectors as a single number." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-91", "text": "Kessler (2001) creates a simple measure called the RECURRENCE metric (R hereafter), which is simply the sum of absolute differences of all path token counts c ai from the first corpus A and c bi from the second corpus B." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-92", "text": "However, to find out if the value of R is significant, we must use a permutation test with a Monte Carlo technique described by Good (1995) , following closely the same usage by Nerbonne and Wiersma (2006) ." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-93", "text": "The intuition behind the technique is to compare the R of the two corpora with the R of two random subsets of the combined corpora." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-94", "text": "If the random subsets' Rs are greater than the R of the two actual corpora more than p percent of the time, then we can reject the null hypothesis that the two were are actually drawn from the same corpus: that is, we can assume that the two corpora are different." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-95", "text": "However, before the R values can be compared, the path counts in the random subsets must be normalized since not all paths will occur in every subset, and average sentence length will differ, causing relative path frequency to vary." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-96", "text": "There are two normalizations that must occur: normalization with respect to sentence length, and normalization with respect to other paths within a subset." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-97", "text": "The first stage of normalization normalizes the counts for each path within the pair of vectors a and b. The purpose is to neutralize the difference in sentence length, in which longer sentences with more words cause paths to be relatively less frequent." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-98", "text": "Each count is converted to a frequency f f = c N where c is either c ai or c bi from above and N is the length of the containing vector a or b. This produces two frequencies, f ai and f bi .Then the frequency is scaled back up to a redistributed count by the equation \u2200j \u2208 a, b :" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-99", "text": "This will redistribute the total of a pair from a and b based on their relative frequencies." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-100", "text": "In other words, the total of each path type c ai + c bi will remain the same, but the values of c ai and c bi will be balanced by their frequency within their respective vectors." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-101", "text": "For example, assume that the two corpora have 10 sentences each, with a corpus a with only 40 words and another, b, with 100 words." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-102", "text": "This results in N a = 40 and N b = 100." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-103", "text": "Assume also that there is a path i that occurs in both: c ai = 8 in a and c bi = 10 in b. This means that the relative frequencies are f ai = 8/40 = 0.2 and f bi = 10/100 = 0.1." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-104", "text": "The first normalization will redistribute the total count (18) Now that 8 has been scaled to 12 and 10 to 6, the effect of sentence length has been neutralized." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-105", "text": "This reflects the intuition that something that occurs 8 of 40 times is more important than something that occurs 10 of 100 times." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-106", "text": "The second normalization normalizes all values in both permutations with respect to each other." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-107", "text": "This is simple: find the average number of times each path appears, then divide each scaled count by it." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-108", "text": "This produces numbers whose average is 1.0 and whose values are multiples of the amount that they are greater than the average." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-109", "text": "The average path count is N/2n, where N is the number of path tokens in both the permutations and n is the number of path types." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-110", "text": "Division by two is necessary since we are multiplying counts from a single permutation by token counts from both permutations." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-111", "text": "Each type entry in the vector now becomes \u2200j \u2208 a, b : s ji = 2nc ji N Starting from the previous example, this second normalization first finds the average." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-112", "text": "Therefore, the average path type has 140/2(35) = 2 tokens in a and b respectively." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-113", "text": "Dividing c ai and c bi by this average gives s ai = 6 and s bi = 3." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-114", "text": "In other words, s ai has 6 times more tokens than the average path type." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-115", "text": "----------------------------------" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-116", "text": "**EXPERIMENT AND RESULTS**" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-117", "text": "The experiment was run on the syntactically annotated part of the International Corpus of English, Great Britain corpus (ICE-GB)." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-118", "text": "The syntactic annotation labels terminals with one of twenty parts of speech and internal nodes with a category and a function marker." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-119", "text": "Therefore, the leaf-ancestor paths each started at the root of the sentence and ended with a part of speech." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-120", "text": "For comparison to the experiment conducted by Nerbonne and Wiersma (2006) , the experiment was also run with POS trigrams." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-121", "text": "Finally, a control experiment was conducted by comparing two permutations from the same corpus and ensuring that they were not significantly different." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-122", "text": "ICE-GB reports the place of birth of each speaker, which is the best available approximation to which dialect a speaker uses." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-123", "text": "As a simple, objective partitioning, the speakers were divided into 11 geographical regions based on the 9 Government Office Regions of England with Wales and Scotland added as single regions." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-124", "text": "Some speakers had to be thrown out at this point because they lacked brithplace information or were born outside the UK." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-125", "text": "Each region varied in size; however, the average number of sentences per corpus was 4682, with an average of 44,726 words per corpus (see table 1)." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-126", "text": "Thus, the average sentence length was 9.55 words." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-127", "text": "The average corpus was smaller than the Norwegian L2 English corpora of Nerbonne and Wiersma (2006) , which had two groups, one with 221,000 words and the other with 84,000." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-128", "text": "Significant differences (at p < 0.05) were found when comparing the largest regions, but no significant differences were found when comparing small regions to other small regions." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-129", "text": "The significant differences found are given in table 2 and 3." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-130", "text": "It seems that summed corpus size must reach a certain threshold before differences can be observed reliably: about 250,000 words for leaf-ancestor paths and 100,000 for trigrams." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-131", "text": "There are exceptions in both directions; the total size of London compared to Wales is larger than the size of London compared to the East Midlands, but the former is not statistically different." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-132", "text": "On the other hand, the total size of Southeast England compared to Scotland is only half of the other significantly different comparisons; this difference may be a result of more extreme syntactic differences than the other areas." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-133", "text": "Finally, it is interesting to note that the summed Norwegian corpus size is around 305,000 words, which is about three times the size needed for significance as estimated from the ICE-GB data." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-134", "text": "----------------------------------" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-135", "text": "**DISCUSSION**" }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-136", "text": "Our work extends that of Nerbonne and Wiersma (2006) in a number of ways." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-137", "text": "We have shown that an alternate method of representing syntax still allows the permutation test to find significant differences between corpora." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-138", "text": "In addition, we have shown differences between corpora divided by geographical area rather than language proficiency, with many more corpora than before." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-139", "text": "Finally, we have shown that the size of the corpus can be reduced somewhat and still obtain significant results." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-140", "text": "Furthermore, we also have shown that both leafancestor paths and POS trigrams give similar results, although the more complex paths require more data." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-141", "text": "However, there are a number of directions that this experiment should be extended." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-142", "text": "A comparison that divides the speakers into traditional British dialect areas is needed to see if the same differences can be detected." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-143", "text": "This is very likely, because corpus divisions that better reflect reality have a better chance of achieving a significant difference." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-144", "text": "In fact, even though leaf-ancestor paths should provide finer distinctions than trigrams and thus require more data for detectable significance, the regional corpora presented here were smaller than the Norwegian speakers' corpora in Nerbonne and Wiersma (2006) by up to a factor of 10." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-145", "text": "This raises the question of a lower limit on corpus size." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-146", "text": "Our experiment suggests that the two corpora must have at least 250,000 words, although we suspect that better divisions will allow smaller corpus sizes." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-147", "text": "While we are reducing corpus size, we might as well compare the increasing numbers of smaller and smaller corpora in an advantageous order." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-148", "text": "It should be possible to cluster corpora by the point at which they fail to achieve a significant difference when split from a larger corpus." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-149", "text": "In this way, regions could be grouped by their detectable boundaries, not a priori distinctions based on geography or existing knowledge of dialect boundaries." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-150", "text": "Of course this indirect method would not be needed if one had a direct method for clustering speakers, by distance or other measure." }, { "sent_id": "6597d733f13b06f61cb653f86c4460-C001-151", "text": "Development of such a method is worthwhile research for the future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6597d733f13b06f61cb653f86c4460-C001-2" ], [ "6597d733f13b06f61cb653f86c4460-C001-25", "6597d733f13b06f61cb653f86c4460-C001-26" ], [ "6597d733f13b06f61cb653f86c4460-C001-55" ] ], "cite_sentences": [ "6597d733f13b06f61cb653f86c4460-C001-2", "6597d733f13b06f61cb653f86c4460-C001-26", "6597d733f13b06f61cb653f86c4460-C001-55" ] }, "@MOT@": { "gold_contexts": [ [ "6597d733f13b06f61cb653f86c4460-C001-25", "6597d733f13b06f61cb653f86c4460-C001-26" ], [ "6597d733f13b06f61cb653f86c4460-C001-35" ] ], "cite_sentences": [ "6597d733f13b06f61cb653f86c4460-C001-26", "6597d733f13b06f61cb653f86c4460-C001-35" ] }, "@SIM@": { "gold_contexts": [ [ "6597d733f13b06f61cb653f86c4460-C001-49" ], [ "6597d733f13b06f61cb653f86c4460-C001-120", "6597d733f13b06f61cb653f86c4460-C001-92", "6597d733f13b06f61cb653f86c4460-C001-93" ] ], "cite_sentences": [ "6597d733f13b06f61cb653f86c4460-C001-49", "6597d733f13b06f61cb653f86c4460-C001-120", "6597d733f13b06f61cb653f86c4460-C001-92" ] }, "@USE@": { "gold_contexts": [ [ "6597d733f13b06f61cb653f86c4460-C001-49" ], [ "6597d733f13b06f61cb653f86c4460-C001-120", "6597d733f13b06f61cb653f86c4460-C001-92", "6597d733f13b06f61cb653f86c4460-C001-93" ] ], "cite_sentences": [ "6597d733f13b06f61cb653f86c4460-C001-49", "6597d733f13b06f61cb653f86c4460-C001-120", "6597d733f13b06f61cb653f86c4460-C001-92" ] }, "@DIF@": { "gold_contexts": [ [ "6597d733f13b06f61cb653f86c4460-C001-52" ], [ "6597d733f13b06f61cb653f86c4460-C001-57" ], [ "6597d733f13b06f61cb653f86c4460-C001-127" ], [ "6597d733f13b06f61cb653f86c4460-C001-144" ] ], "cite_sentences": [ "6597d733f13b06f61cb653f86c4460-C001-52", "6597d733f13b06f61cb653f86c4460-C001-57", "6597d733f13b06f61cb653f86c4460-C001-127", "6597d733f13b06f61cb653f86c4460-C001-144" ] }, "@EXT@": { "gold_contexts": [ [ "6597d733f13b06f61cb653f86c4460-C001-136" ] ], "cite_sentences": [ "6597d733f13b06f61cb653f86c4460-C001-136" ] } } }, "ABC_a7f4154081f4045390e662c6e6f3ac_15": { "x": [ { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-2", "text": "A key challenge in entity linking is making effective use of contextual information to disambiguate mentions that might refer to different entities in different contexts." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-3", "text": "We present a model that uses convolutional neural networks to capture semantic correspondence between a mention's context and a proposed target entity." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-4", "text": "These convolutional networks operate at multiple granularities to exploit various kinds of topic information, and their rich parameterization gives them the capacity to learn which n-grams characterize different topics." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-5", "text": "We combine these networks with a sparse linear model to achieve state-of-the-art performance on multiple entity linking datasets, outperforming the prior systems of Durrett and Klein (2014) and Nguyen et al. (2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-6", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-7", "text": "**1**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-8", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-10", "text": "One of the major challenges of entity linking is resolving contextually polysemous mentions." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-11", "text": "For example, Germany may refer to a nation, to that nation's government, or even to a soccer team." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-12", "text": "Past approaches to such cases have often focused on collective entity linking: nearby mentions in a document might be expected to link to topically-similar entities, which can give us clues about the identity of the mention currently being resolved (Ratinov et al., 2011; Hoffart et al., 2011; He et al., 2013; Cheng and Roth, 2013; Durrett and Klein, 2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-13", "text": "But an even simpler approach is to use context information from just the words in the source document itself to make sure the entity is being resolved sensibly in context." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-14", "text": "In past work, these approaches have typically relied on heuristics such as tf-idf (Ratinov et 1 Source available at github.com/matthewfl/nlp-entity-convnet al., 2011), but such heuristics are hard to calibrate and they capture structure in a coarser way than learning-based methods." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-15", "text": "In this work, we model semantic similarity between a mention's source document context and its potential entity targets using convolutional neural networks (CNNs)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-16", "text": "CNNs have been shown to be effective for sentence classification tasks (Kalchbrenner et al., 2014; Kim, 2014; Iyyer et al., 2015) and for capturing similarity in models for entity linking (Sun et al., 2015) and other related tasks (Dong et al., 2015; Shen et al., 2014) , so we expect them to be effective at isolating the relevant topic semantics for entity linking." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-17", "text": "We show that convolutions over multiple granularities of the input document are useful for providing different notions of semantic context." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-18", "text": "Finally, we show how to integrate these networks with a preexisting entity linking system (Durrett and Klein, 2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-19", "text": "Through a combination of these two distinct methods into a single system that leverages their complementary strengths, we achieve state-ofthe-art performance across several datasets." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-20", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-21", "text": "**MODEL**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-22", "text": "Our model focuses on two core ideas: first, that topic semantics at different granularities in a document are helpful in determining the genres of entities for entity linking, and second, that CNNs can distill a block of text into a meaningful topic vector." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-23", "text": "Our entity linking model is a log-linear model that places distributions over target entities t given a mention x and its containing source document." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-24", "text": "For now, we take P (t|x) \u221d exp w f C (x, t; \u03b8), where f C produces a vector of features based on CNNs with parameters \u03b8 as discussed in Section 2.1." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-25", "text": "Section 2.2 describes how we combine this simple model with a full-fledged entity linking system." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-26", "text": "As shown in the middle of Figure 1 Figure 1 : Extraction of convolutional vector space features fC (x, te)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-27", "text": "Three types of information from the input document and two types of information from the proposed title are fed through convolutional networks to produce vectors, which are systematically compared with cosine similarity to derive real-valued semantic similarity features." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-28", "text": "is a cosine similarity between a topic vector associated with the source document and a topic vector associated with the target entity." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-29", "text": "These vectors are computed by distinct CNNs operating over different subsets of relevant text." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-30", "text": "Figure 1 shows an example of why different kinds of context are important for entity linking." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-31", "text": "In this case, we are considering whether Pink Floyd might link to the article Gavin Floyd on Wikipedia (imagine that Pink Floyd might be a person's nickname)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-32", "text": "If we look at the source document, we see that the immediate source document context around the mention Pink Floyd is referring to rock groups (Led Zeppelin, Van Halen) and the target entity's Wikipedia page is primarily about sports (baseball starting pitcher)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-33", "text": "Distilling these texts into succinct topic descriptors and then comparing those helps tell us that this is an improbable entity link pair." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-34", "text": "In this case, the broader source document context actually does not help very much, since it contains other generic last names like Campbell and Savage that might not necessarily indicate the document to be in the music genre." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-35", "text": "However, in general, the whole document might provide a more robust topic estimate than a small context window does." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-36", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-37", "text": "**CONVOLUTIONAL SEMANTIC SIMILARITY**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-38", "text": "Figure 1 shows our method for computing topic vectors and using those to extract features for a potential Wikipedia link." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-39", "text": "For each of three text granularities in the source document (the mention, that mention's immediate context, and the entire document) and two text granularities on the target entity side (title and Wikipedia article text), we produce vector representations with CNNs as follows." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-40", "text": "We first embed each word into a d-dimensional vector space using standard embedding techniques (discussed in Section 3.2), yielding a sequence of vectors w 1 , . . . , w n ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-41", "text": "We then map those words into a fixed-size vector using a convolutional network parameterized with a filter bank M \u2208 R k\u00d7d ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-42", "text": "We put the result through a rectified linear unit (ReLU) and combine the results with sum pooling, giving the following formulation:" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-43", "text": "where w j:j+ is a concatenation of the given word vectors and the max is element-wise." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-44", "text": "2 Each convolution granularity (mention, context, etc.) has a distinct set of filter parameters M g ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-45", "text": "This process produces multiple representative topic vectors s ment , s context , and s doc for the source document and t title and t doc for the target entity, as shown in Figure 1 ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-46", "text": "All pairs of these vectors between the source and the target are then compared using cosine similarity, as shown in the middle of Figure 1 ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-47", "text": "This yields the vector of features f C (s, t e ) which indicate the different types of similarity; this vector can then be combined with other sparse features and fed into a final logistic regression layer (maintaining end-to-end inference and learning of the filters)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-48", "text": "When trained with backpropagation, the convolutional networks should learn to map text into vector spaces that are informative about whether the document and entity are related or not." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-49", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-50", "text": "**INTEGRATING WITH A SPARSE MODEL**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-51", "text": "The dense model presented in Section 2.1 is effective at capturing semantic topic similarity, but it is most effective when combined with other signals for entity linking." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-52", "text": "An important cue for resolving a mention is the use of link counts from hyperlinks in Wikipedia (Cucerzan, 2007; Milne and Witten, 2008; Ji and Grishman, 2011) , which tell us how often a given mention was linked to each article on Wikipedia." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-53", "text": "This information can serve as a useful prior, but only if we can leverage it effectively by targeting the most salient part of a mention." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-54", "text": "For example, we may have never observed President Barack Obama as a linked string on Wikipedia, even though we have seen the substring Barack Obama and it unambiguously indicates the correct answer." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-55", "text": "Following Durrett and Klein (2014) , we introduce a latent variable q to capture which subset of a mention (known as a query) we resolve." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-56", "text": "Query generation includes potentially removing stop words, plural suffixes, punctuation, and leading or tailing words." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-57", "text": "This processes generates on average 9 queries for each mention." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-58", "text": "Conveniently, this set of queries also defines the set of candidate entities that we consider linking a mention to: each query generates a set of potential entities based on link counts, whose unions are then taken to give on the possible entity targets for each mention (including the null link)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-59", "text": "In the example shown in Figure 1 , the query phrases are Pink Floyd and Floyd, which generate Pink Floyd and Gavin Floyd as potential link targets (among other options that might be derived from the Floyd query)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-60", "text": "Our final model has the form P (t|x) = q P (t, q|x)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-61", "text": "We parameterize P (t, q|x) in a loglinear way with three separate components:" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-84", "text": "Table 2 shows results for two baselines and three variants of our system." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-62", "text": "f Q and f E are both sparse features vectors and are taken from previous work (Durrett and Klein, 2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-63", "text": "f C is as discussed in Section 2.1." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-64", "text": "Note that f C has its own internal parameters \u03b8 because it relies on CNNs with learned filters; however, we can compute gradients for these parameters with standard backpropagation." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-65", "text": "The whole model is trained to maximize the log likelihood of a labeled training corpus using Adadelta (Zeiler, 2012) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-66", "text": "The indicator features f Q and f E are described in more detail in Durrett and Klein (2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-67", "text": "f Q only impacts which query is selected and not the disambiguation to a title." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-68", "text": "It is designed to roughly capture the basic shape of a query to measure its desirability, indicating whether suffixes were removed and whether the query captures the capitalized subsequence of a mention, as well as standard lexical, POS, and named entity type features." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-69", "text": "f E mostly captures how likely the selected query is to correspond to a given entity based on factors like anchor text counts from Wikipedia, string match with proposed Wikipedia titles, and discretized cosine similarities of tf-idf vectors (Ratinov et al., 2011) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-70", "text": "Adding tf-idf indicators is the only modification we made to the features of Durrett and Klein (2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-71", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-72", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-73", "text": "We performed experiments on 4 different entity linking datasets." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-74", "text": "\u2022 ACE (NIST, 2005; Bentivogli et al., 2010) :" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-75", "text": "This corpus was used in Fahrni and Strube (2014) and Durrett and Klein (2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-76", "text": "\u2022 CoNLL-YAGO (Hoffart et al., 2011) : This corpus is based on the CoNLL 2003 dataset; the test set consists of 231 news articles and contains a number of rarer entities." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-77", "text": "\u2022 WP (Heath and Bizer, 2011): This dataset consists of short snippets from Wikipedia." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-78", "text": "\u2022 Table 2 : Performance of the system in this work (Full) compared to two baselines from prior work and two ablations." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-79", "text": "Our results outperform those of Durrett and Klein (2014) and Nguyen et al. (2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-80", "text": "In general, we also see that the convolutional networks by themselves can outperform the system using only sparse features, and in all cases these stack to give substantial benefit." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-81", "text": "We use standard train-test splits for all datasets except for WP, where no standard split is available." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-82", "text": "In this case, we randomly sample a test set." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-83", "text": "For all experiments, we use word vectors computed by running word2vec (Mikolov et al., 2013) on all Wikipedia, as described in Section 3.2." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-85", "text": "Our main contribution is the combination of indicator features and CNN features (Full)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-86", "text": "We see that this system outperforms the results of Durrett and Klein (2014) and the AIDA-LIGHT system of Nguyen et al. (2014) ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-87", "text": "We can also compare to two ablations: using just the sparse features (a system which is a direct extension of Durrett and Klein (2014) ) or using just the CNNderived features." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-88", "text": "5 Our CNN features generally outperform the sparse features and improve even further when stacked with them." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-89", "text": "This reflects that they capture orthogonal sources of information: for example, the sparse features can capture how frequently the target document was linked to, whereas the CNNs can capture document context in a more nuanced way." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-90", "text": "These CNN features also clearly supersede the sparse features based on tf-idf (taken from (Ratinov et al., 2011)) , showing that indeed that CNNs are better at learning semantic topic similarity than heuristics like tf-idf." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-91", "text": "Table 3 : Comparison of using only topic information derived from the document and target article, only information derived from the mention itself and the target entity title, and the full set of information (six features, as shown in Figure 1 )." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-92", "text": "Neither the finest nor coarsest convolutional context can give the performance of the complete set." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-93", "text": "Numbers are reported on a development set." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-94", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-95", "text": "**ACE**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-96", "text": "In the sparse feature system, the highest weighted features are typically those indicating the frequency that a page was linked to and those indicating specific lexical items in the choice of the latent query variable q. This suggests that the system of Durrett and Klein (2014) has the power to pick the right span of a mention to resolve, but then is left to generally pick the most common link target in Wikipedia, which is not always correct." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-97", "text": "By contrast, the full system has a greater ability to pick less common link targets if the topic indicators distilled from the CNNs indicate that it should do so." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-98", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-99", "text": "**MULTIPLE GRANULARITIES OF CONVOLUTION**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-100", "text": "One question we might ask is how much we gain by having multiple convolutions on the source and target side." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-101", "text": "Table 3 compares our full suite of CNN features, i.e. the six features specified in Figure 1 , with two specific convolutional features in isolation." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-102", "text": "Using convolutions over just the source document (s doc ) and target article text (t doc ) gives a system 6 that performs, in aggregate, comparably to using convolutions over just the mention (s ment ) and the entity title (t title )." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-103", "text": "These represent two extremes of the system: consuming the maximum amount of context, which might give the most robust representation of topic semantics, and consuming the minimum amount of context, which gives the most focused representation of topics semantics (and which more generally might allow the system to directly memorize train-test pairs observed in training)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-104", "text": "However, neither performs as well as the combination of all CNN features, showing that the different granularities capture complementary destroying missiles ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-105", "text": "spy planes has died of his wounds him which was more depressing and destroying missiles ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-106", "text": "spy vittorio sacerdoti has told his a trip and you see by U.N. weapons inspectors ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-107", "text": "his bail hearing , his \" i can see why inspectors are discovering and destroying bail hearing , his lawyer i think so many americans are discovering and destroying missiles died of his wounds after his life from the depression an attack using chemical weapons from scott peterson 's attorney trip and you see him discovering and destroying missiles ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-108", "text": "'s murder trial ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-109", "text": "she , but dumb liberal could attack munitions or j-dam weapons has told his remarkable tale i can see why he sanctions targeting iraq civilians , murder trial ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-110", "text": "she asking one passage ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-111", "text": "you cite its nuclear weapons and missile trial lawyers are driving doctors think so many americans are Table 1 : Five-grams representing the maximal activations from different filters in the convolution over the source document (M doc , producing s doc in Figure 1 )." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-112", "text": "Some filters tend towards singular topics as shown in the first and second columns, which focus on weapons and trials, respectively." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-113", "text": "Others may have a mix of seemingly unrelated topics, as shown in the third column, which does not have a coherent theme." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-114", "text": "However, such filters might represent a superposition of filters for various topics which never cooccur and thus never need to be disambiguated between." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-115", "text": "aspects of the entity linking task." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-116", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-117", "text": "**EMBEDDING VECTORS**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-118", "text": "We also explored two different sources of embedding vectors for the convolutions." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-119", "text": "Table 4 shows that word vectors trained on Wikipedia outperformed Google News word vectors trained on a larger corpus." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-120", "text": "Further investigation revealed that the Google News vectors had much higher out-of-vocabulary rates." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-121", "text": "For learning the vectors, we use the standard word2vec toolkit (Mikolov et al., 2013) with vector length set to 300, window set to 21 (larger windows produce more semantically-focused vectors (Levy and Goldberg, 2014) ), 10 negative samples and 10 iterations through Wikipedia." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-122", "text": "We do not fine-tune word vectors during training of our model, as that was not found to improve performance." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-123", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-124", "text": "**ANALYSIS OF LEARNED CONVOLUTIONS**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-125", "text": "One downside of our system compared to its purely indicator-based variant is that its operation is less interpretable." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-126", "text": "However, one way we can inspect the learned system is by examining what causes high activations of the various convolutional filters (rows of the matrices M g from Equation 1)." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-127", "text": "Table 1 shows the n-grams in the ACE dataset leading to maximal activations of three of the filters from M doc ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-128", "text": "Some filters tend to learn to pick up on n-grams characteristic of a particular topic." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-129", "text": "In other cases, a single filter might be somewhat inscrutable, as with the third column of Table 1 ." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-130", "text": "There are a few possible explanations for this." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-131", "text": "First, the filter may generally have low activations and therefore have little impact in the final feature computation." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-132", "text": "Second, the extreme points of the filter may not be characteristic of its overall behavior, since the bulk of n-grams will lead to more moderate activations." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-133", "text": "Finally, such a filter may represent the superposition of a few topics that we are unlikely to ever need to disambiguate between; in a particular context, this filter will then play a clear role, but one which is hard to determine from the overall shape of the parameters." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-134", "text": "----------------------------------" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-135", "text": "**CONCLUSION**" }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-136", "text": "In this work, we investigated using convolutional networks to capture semantic similarity between source documents and potential entity link targets." }, { "sent_id": "a7f4154081f4045390e662c6e6f3ac-C001-137", "text": "Using multiple granularities of convolutions to evaluate the compatibility of a mention in context and several potential link targets gives strong performance on its own; moreover, such features also improve a pre-existing entity linking system based on sparse indicator features, showing that these sources of information are complementary." } ], "y": { "@MOT@": { "gold_contexts": [ [ "a7f4154081f4045390e662c6e6f3ac-C001-5" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-17", "a7f4154081f4045390e662c6e6f3ac-C001-18" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-96", "a7f4154081f4045390e662c6e6f3ac-C001-97" ] ], "cite_sentences": [ "a7f4154081f4045390e662c6e6f3ac-C001-5", "a7f4154081f4045390e662c6e6f3ac-C001-18", "a7f4154081f4045390e662c6e6f3ac-C001-96" ] }, "@BACK@": { "gold_contexts": [ [ "a7f4154081f4045390e662c6e6f3ac-C001-12" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-66" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-96", "a7f4154081f4045390e662c6e6f3ac-C001-97" ] ], "cite_sentences": [ "a7f4154081f4045390e662c6e6f3ac-C001-12", "a7f4154081f4045390e662c6e6f3ac-C001-66", "a7f4154081f4045390e662c6e6f3ac-C001-96" ] }, "@EXT@": { "gold_contexts": [ [ "a7f4154081f4045390e662c6e6f3ac-C001-55" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-62" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-70" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-87", "a7f4154081f4045390e662c6e6f3ac-C001-88" ] ], "cite_sentences": [ "a7f4154081f4045390e662c6e6f3ac-C001-55", "a7f4154081f4045390e662c6e6f3ac-C001-62", "a7f4154081f4045390e662c6e6f3ac-C001-70", "a7f4154081f4045390e662c6e6f3ac-C001-87" ] }, "@DIF@": { "gold_contexts": [ [ "a7f4154081f4045390e662c6e6f3ac-C001-55" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-70" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-79", "a7f4154081f4045390e662c6e6f3ac-C001-86" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-87", "a7f4154081f4045390e662c6e6f3ac-C001-88" ] ], "cite_sentences": [ "a7f4154081f4045390e662c6e6f3ac-C001-55", "a7f4154081f4045390e662c6e6f3ac-C001-70", "a7f4154081f4045390e662c6e6f3ac-C001-79", "a7f4154081f4045390e662c6e6f3ac-C001-86", "a7f4154081f4045390e662c6e6f3ac-C001-87" ] }, "@SIM@": { "gold_contexts": [ [ "a7f4154081f4045390e662c6e6f3ac-C001-62" ], [ "a7f4154081f4045390e662c6e6f3ac-C001-75" ] ], "cite_sentences": [ "a7f4154081f4045390e662c6e6f3ac-C001-62", "a7f4154081f4045390e662c6e6f3ac-C001-75" ] }, "@USE@": { "gold_contexts": [ [ "a7f4154081f4045390e662c6e6f3ac-C001-75" ] ], "cite_sentences": [ "a7f4154081f4045390e662c6e6f3ac-C001-75" ] } } }, "ABC_57e65909baf823ff00a9a10a64fffd_15": { "x": [ { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-137", "text": "For Waseem (2016) we see that there is no significant difference in the estimated rates at which tweets are classified as racist across groups, although the rates remain low." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-147", "text": "The classifier trained on the data is significantly less likely to classify black-aligned tweets as hate speech, although it is more likely to classify them as offensive." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-138", "text": "Tweets in the black-aligned corpus are classified as containing sexism almost twice as frequently and 1.1 times as frequently classified as containing racism and sexism compared to those in the white-aligned corpus." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-211", "text": "Identifying such content is therefore a highly imbalanced classification problem." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-139", "text": "Moving onto , we find large disparities, with around 5% of tweets in the black-aligned corpus classified as hate speech compared to 2% of those in the white-aligned set." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-2", "text": "Technologies for abusive language detection are being developed and applied with little consideration of their potential biases." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-3", "text": "We examine racial bias in five different sets of Twitter data annotated for hate speech and abusive language." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-4", "text": "We train classifiers on these datasets and compare the predictions of these classifiers on tweets written in African-American English with those written in Standard American English." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-5", "text": "The results show evidence of systematic racial bias in all datasets, as classifiers trained on them tend to predict that tweets written in African-American English are abusive at substantially higher rates." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-6", "text": "If these abusive language detection systems are used in the field they will therefore have a disproportionate negative impact on African-American social media users." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-140", "text": "Similarly, 17% of black-aligned tweets are predicted to contain offensive language compared to 6.5% of whitealigned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-7", "text": "Consequently, these systems may discriminate against the groups who are often the targets of the abuse we are trying to detect." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-8", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-10", "text": "Recent work has shown evidence of substantial bias in machine learning systems, which is typically a result of bias in the training data." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-11", "text": "This includes both supervised (Blodgett and O'Connor, 2017; Tatman, 2017; Kiritchenko and Mohammad, 2018; De-Arteaga et al., 2019) and unsupervised natural language processing systems (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-12", "text": "Machine learning models are currently being deployed in the field to detect hate speech and abusive language on social media platforms including Facebook, Instagram, and Youtube." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-13", "text": "The aim of these models is to identify abusive language that directly targets certain individuals or groups, particularly people belonging to protected categories (Waseem et al., 2017) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-14", "text": "Bias may reduce the accuracy of these models, and at worst, will mean that the models actively discriminate against the same groups they are designed to protect." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-15", "text": "Our study focuses on racial bias in hate speech and abusive language detection datasets (Waseem, 2016; Waseem and Hovy, 2016; Golbeck et al., 2017; Founta et al., 2018) , all of which use data collected from Twitter." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-16", "text": "We train classifiers using each of the datasets and use a corpus of tweets with demographic information to compare how each classifier performs on tweets written in African-American English (AAE) versus Standard American English (SAE) (Blodgett et al., 2016) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-17", "text": "We use bootstrap sampling (Efron and Tibshirani, 1986) to estimate the proportion of tweets in each group that each classifier assigns to each class." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-18", "text": "We find evidence of systematic racial biases across all of the classifiers, with AAE tweets predicted as belonging to negative classes like hate speech or harassment significantly more frequently than SAE tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-19", "text": "In most cases the bias decreases in magnitude when we condition on particular keywords which may indicate membership in negative classes, yet it still persists." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-20", "text": "We expect that these biases will result in racial discrimination if classifiers trained on any of these datasets are deployed in the field." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-21", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-22", "text": "**RELATED WORKS**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-23", "text": "Scholars and practitioners have recently been devoting more attention to bias in machine learning models, particularly as these models are becoming involved in more and more consequential decisions (Athey, 2017) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-24", "text": "Bias often derives from the data used to train these models." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-25", "text": "For example, Buolamwini and Gebru (2018) show how facial recognition technologies perform worse for darker-skinned people, particularly darker-skinned women, due to the disproportionate presence of white, male faces in the training data." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-26", "text": "Natural language processing systems also inherit biases from the data they were trained on." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-27", "text": "For example, in unsupervised learning, word embeddings often contain biases (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018) which persist even after attempts to remove them (Gonen and Goldberg, 2019) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-28", "text": "There are many examples of bias in supervised learning contexts: YouTube's captioning models make more errors when transcribing women (Tatman, 2017) , AAE is more likely to be misclassified as non-English by widely used language classifiers (Blodgett and O'Connor, 2017) , numerous gender and racial biases exist in sentiment classification systems (Kiritchenko and Mohammad, 2018) , and errors in both co-reference resolution systems and occupational classification models reflect gendered occupational patterns (Zhao et al., 2018; De-Arteaga et al., 2019) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-29", "text": "While hate speech and abusive language detection has become an important area for natural language processing research (Schmidt and Wiegand, 2017; Waseem et al., 2017; Fortuna and Nunes, 2018) , there has been little work addressing the potential for these systems to be biased." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-30", "text": "The danger posed by bias in such systems is, however, particularly acute, since it could result in negative impacts on the same populations the systems are designed to protect." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-31", "text": "For example, if we mistakenly consider speech by a targeted minority group as abusive we might unfairly penalize the victim, but if we fail to identify abuse against them we will be unable to take action against the perpetrator." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-32", "text": "Although no model can perfectly avoid such problems, we should be particularly concerned about the potential for such models to be systematically biased against certain social groups, particularly protected classes." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-33", "text": "A number of studies have shown that false positive cases of hate speech are associated with the presence of terms related to race, gender, and sexuality (Kwok and Wang, 2013; Burnap and Williams, 2015; ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-34", "text": "While not directly measuring bias, prior work has explored how annotation schemes and the identity of the annotators (Waseem, 2016 ) might be manipulated to help to avoid bias." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-35", "text": "Dixon et al. (2018) directly measured biases in the Google Perspective API classifier, 1 trained on data from Wikipedia talk comments, finding that it tended to give high toxicity scores to innocuous statements like \"I am a gay man\"." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-36", "text": "They called this \"false positive bias\", caused by the model overgeneralizing from the training data, in this case from examples where \"gay\" was used pejoratively." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-37", "text": "They find that a number of such \"identity terms\" are disproportionately represented in the examples labeled as toxic." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-38", "text": "Park et al. (2018) build upon this study, using templates to study gender differences in performance across two hate speech and abusive language detection datasets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-39", "text": "They find that classifiers trained on these data tend to perform worse when female identity terms used, indicating gender bias in performance." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-40", "text": "We build upon this work by auditing a series of abusive language and hate speech detection datasets for racial biases." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-41", "text": "We evaluate how classification models trained on these datasets perform in the field, comparing their predictions for tweets written in language used by whites or African-Americans." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-42", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-43", "text": "**RESEARCH DESIGN**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-44", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-45", "text": "**HATE SPEECH AND ABUSIVE LANGUAGE DATASETS**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-46", "text": "We focus on Twitter, the most widely used data source in abusive language research." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-47", "text": "We use all available datasets where tweets are labeled as various types of abuse and are written in English." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-48", "text": "We now briefly describe each of these datasets in chronological order." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-49", "text": "Waseem and Hovy (2016) collected 130k tweets containing one of seventeen different terms or phrases they considered to be hateful." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-50", "text": "They then annotated a sample of these tweets themselves, using guidelines inspired by critical race theory." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-51", "text": "These annotators were then reviewed by \"a 25 year old woman studying gender studies and a nonactivist feminist\" to check for bias." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-52", "text": "This dataset consists of 16,849 tweets labeled as either racism, sexism, or neither." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-53", "text": "Most of the tweets categorized as sexist relate to debates over an Australian TV show and most of those considered as racist are anti-Muslim." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-54", "text": "To account for potential bias in the previous dataset, Waseem (2016) relabeled 2876 tweets in the dataset, along with a new sample from the tweets originally collected." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-55", "text": "The tweets were annotated by \"feminist and anti-racism activists\", based upon the assumption that they are domain-experts." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-56", "text": "A fourth category, racism and sexism was also added to account for the presence of tweets which exhibit both types of abuse." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-57", "text": "The dataset contains 6,909 tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-58", "text": "collected tweets containing terms from the Hatebase, 2 a crowdsourced hate speech lexicon, then had a sample coded by crowdworkers located in the United States." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-59", "text": "To avoid false positives that occurred in prior work which considered all uses of particular terms as hate speech, crowdworkers were instructed not to make their decisions based upon any words or phrases in particular, no matter how offensive, but on the overall tweet and the inferred context." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-60", "text": "The dataset consists of 24,783 tweets annotated as hate speech, offensive language, or neither." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-61", "text": "Golbeck et al. (2017) selected tweets using ten keywords and phrases related to anti-black racism, Islamophobia, homophobia, anti-semitism, and sexism." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-62", "text": "The authors developed a coding scheme to distinguish between potentially offensive content and serious harassment, such as threats or hate speech." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-63", "text": "After an initial round of coding, where tweets were assigned to a number of different categories, they simplified their analysis to include a binary harassment or non-harassment label for each tweet." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-64", "text": "The dataset consists of 20,360 tweets, each hand-labeled by the authors." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-65", "text": "3 Founta et al. (2018) constructed a dataset intended to better approximate a real-world setting where abuse is relatively rare." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-66", "text": "They began with a random sample of tweets then augmented it by adding tweets containing one or more terms from the Hatebase lexicon and that had negative sentiment." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-67", "text": "They criticized prior work for defining labels in an ad hoc manner." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-68", "text": "To develop a more comprehensive annotation scheme they initially labeled a sample of tweets, allowing each tweet to belong to multiple classes." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-69", "text": "After analyzing the overlap between different classes they settled on a coding scheme with four distinct classes: abusive, hateful, spam, and normal." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-70", "text": "We use a dataset they published containing 91,951 tweets coded into these categories by crowdworkers." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-71", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-72", "text": "**TRAINING CLASSIFIERS**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-73", "text": "For each dataset we train a classifier to predict the class of unseen tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-74", "text": "We use regularized logistic regression with bag-of-words features, a commonly used approach in the field." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-75", "text": "While we expect that we could improve predictive performance by using more sophisticated classifiers, we expect that any bias is likely a function of the training data itself rather than the classifier." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-76", "text": "Moreover, although features like word embeddings can work well for this task (Djuric et al., 2015) we wanted to avoid inducing any bias in our models by using pre-trained embeddings (Park et al., 2018) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-77", "text": "We pre-process each tweet by removing excess white-space and replacing URLs and mentions with placeholders." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-78", "text": "We then tokenize them, stem each token, and construct n-grams with a maximum length of three." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-79", "text": "Next we transform each dataset into a TF-IDF matrix, with a maximum of 10,000 features." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-80", "text": "We use 80% of each dataset to train models and hold out the remainder for validation." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-81", "text": "Each model is trained using stratified 5-fold cross-validation." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-82", "text": "We conduct a grid-search over different regularization strength parameters to identify the best performing model." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-83", "text": "Finally, for each dataset we identify the model with the best average F1 score and retrain it using all of the training data." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-84", "text": "The performance of these models on the 20% held-out validation data is reported in Table 1 ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-85", "text": "Overall we see varying performance across the classifiers, with some performing much better out-of-sample than others." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-86", "text": "In particular, we see that hate speech and harassment are particularly difficult to detect." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-87", "text": "Since we are primarily interested in within classifier, between corpora performance, any variation between classifiers should not impact our results." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-88", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-89", "text": "**RACE DATASET**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-90", "text": "We use a dataset of tweets labeled by race from Blodgett et al. (2016) to measure racial biases in these classifiers." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-91", "text": "They collected geolocated tweets in the U.S. and matched them with demographic data from the Census on the population of non-Hispanic whites, non-Hispanic blacks, Hispanics, and Asians in the block group where the tweets originated." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-92", "text": "They then identified words associated with particular demographics and trained a probabilistic mixed-membership language model." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-93", "text": "This model learns demographicallyaligned language models for each of the four demographic categories and is used to calculate the posterior proportion of language from each category in each tweet." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-94", "text": "Their validation analyses indicate that tweets with a high posterior proportion of non-Hispanic black language exhibit lexical, phonological, and syntactic variation consistent with prior research on AAE." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-95", "text": "Their publiclyavailable dataset contains 59.2 million tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-96", "text": "We define a user as likely non-Hispanic black if the average posterior proportion across all of their tweets for the non-Hispanic black language model is \u2265 0.80 (and \u2264 0.10 Hispanic and Asian combined) and as non-Hispanic white using the same formula but for the white language model." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-97", "text": "5 This allows us to restrict our analysis to tweets written by users who predominantly use one of the language models." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-98", "text": "Due to space constraints we discard users who predominantly use either the Hispanic or the Asian language model." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-99", "text": "This results in a set of 1.1m tweets written by people who generally use non-Hispanic black language and 14.5m tweets written by users who tend to use non-Hispanic white language." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-100", "text": "Following Blodgett and O'Connor (2017), we call these datasets black-aligned and white-aligned tweets, reflecting 5 We use this threshold following Blodgett and O'Connor (2017) and after consulting with the lead author." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-101", "text": "While these cut-offs should provide high confidence that the users tend to use AAE or SAE, and hence serve as a proxy for race, it is important to note that not all African-Americans use AAE and that not all AAE users are African-American, although use of the AAE dialect suggests a social proximity to or affinity for African-American communities (Blodgett et al., 2016) the fact that they contain language associated with either demographic category but which may not all be produced by members of these categories." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-102", "text": "We now describe how we use these data in our experiments." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-103", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-104", "text": "**EXPERIMENTS**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-105", "text": "We examine whether the probability that a tweet is predicted to belong to a particular class varies in relation to the racial alignment of the language it uses." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-106", "text": "The null hypothesis of no racial bias is that the probability a tweet will belong to a negative class is independent of the racial group the tweet's author is a member of." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-107", "text": "Formally, for class c i , where c i = 1 denotes membership in the class and c i = 0 the opposite, we aim to test H N : P (c i = 1|black) = P (c i = 1|white)." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-108", "text": "If P (c i = 1|black) > P (c i = 1|white) and the difference is statistically significant then we can reject the null hypothesis H N in favor of the alternative hypothesis H A that black-aligned tweets are classified into c i at a higher rate than white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-109", "text": "Conversely, if P (c i = 1|black) < P (c i = 1|white) we can conclude that the classifier is more likely to classify white-aligned tweets as c i ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-110", "text": "We should expect that white-aligned tweets are more likely to use racist language or hate speech than blackaligned tweets, given that African-Americans are often targeted with racism and hate speech by whites." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-111", "text": "However for some classes like sexism we have no reason to expect there to be racial differences in either direction." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-112", "text": "To test this hypothesis we use bootstrap sampling (Efron and Tibshirani, 1986) to estimate the proportion of tweets in each dataset that each classifier predicts to belong to each class." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-113", "text": "We draw n random samples with replacement of k tweets from each of the two race corpora, where n = k = 1000." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-114", "text": "For each sample we use each classifier to predict the class membership of each tweet, then store the proportion of tweets that were assigned to each class, p i ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-115", "text": "For each classifier-class pair, we thus obtain a pair of vectors, one for each corpus, each containing n sampled proportions." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-141", "text": "The classifier trained on the Golbeck et al. (2017) dataset predicts black-aligned tweets to be harassment 1.4 times as frequently as white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-142", "text": "The Founta et al. (2018) classifier labels around 11% of tweets in the blackaligned corpus as hate speech and almost 18% as abusive, compared to 6% and 8% of white-aligned tweets respectively." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-143", "text": "It also classifies black-aligned tweets as spam 1.8 times as frequently." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-144", "text": "The results of Experiment 2 are consistent with the previous results, although there are some notable differences." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-145", "text": "In most cases the racial disparities persist, although they are generally smaller in magnitude and in some cases the direction even changes." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-146", "text": "Table 3 shows that for tweets containing the word \"n*gga\", classifiers trained on Waseem and Hovy (2016) and Waseem (2016) are both predict black-aligned tweets to be instances of sexism approximately 1.5 times as often as white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-116", "text": "The bootstrap estimates for the proportion of tweets belonging to class i for each group, p i black and p i white , are calculated by taking the mean of the elements in each vector: dicate that black-aligned tweets are classified as belonging to class i at a higher rate than whitealigned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-117", "text": "We also conduct a second experiment, where we assess whether there is racial bias conditional upon a tweet containing a keyword likely to be associated with a negative class." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-118", "text": "While differences in language will undoubtedly remain, this should help to account for the possibility that results in Experiment 1 are driven by differences in the true distribution of the different classes of interest, or of words associated with these classes, in the two corpora." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-119", "text": "For classifier c and category i, we evaluate H N : P (c i = 1|black, t) = P (c i = 1|white, t) for a given term t. We conduct this experiment for two different terms, each of which occurs frequently enough in the data to enable our bootstrapping approach." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-120", "text": "We select the term \"n*gga\", since it is a particularly prevalent source of false positives for hate speech detection (Kwok and Wang, 2013; Waseem et al., 2018) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-121", "text": "6 In this case, we expect that tweets containing the word should be classified as more negative when used by whites, thus H A 1 : P (c i = 1|black, t) < P (c i = 1|white, t)." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-122", "text": "The other alternative, H A 2 : P (c i = 1|black, t) > P (c i = 1|white, t) would indicate that blackaligned tweets containing the term are penalized at a higher rate than comparable white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-123", "text": "We also assess the results for the word \"b*tch\" since it is a widely used sexist term, which 6 We also planned to conduct the same analysis using the \"-er\" suffix, however the sample was too small, with the word being used in 555 tweets in the white-aligned corpus (0.004%) and 61 in the black-aligned corpus (0.005%)." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-124", "text": "is often also used casually, but we have no theoretical reason to expect there to be racial differences in its usage." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-125", "text": "The term \"n*gga\" was used in around 2.25% of black-aligned and 0.15% of white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-126", "text": "The term \"b*tch\" was used in 1.7% of black-aligned and 0.5% of whitealigned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-127", "text": "The substantial differences in the distributions for these two terms alone are consistent with our intuition that some of the results in Experiment 1 may be driven by differences in the frequencies of words associated with negative classes in the training datasets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-128", "text": "Since we are using a subsample of the available data, we use smaller bootstrap samples, drawing k = 100 tweets each time." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-129", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-130", "text": "**RESULTS**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-131", "text": "The results of Experiment 1 are shown in Table 2." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-132", "text": "We observe substantial racial disparities in the performance of all classifiers." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-133", "text": "In all but one of the comparisons, there are statistically significant (p < 0.001) differences and in all but one of these we see that tweets in the black-aligned corpus are assigned negative labels more frequently than those by whites." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-134", "text": "The only case where blackaligned tweets are classified into a negative class less frequently than white-aligned tweets is the racism class in the Waseem and Hovy (2016) classifier." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-135", "text": "Note, however, the extremely low rate at which tweets are predicted to belong to this class for both groups." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-136", "text": "On the other hand, this classifier is 1.7 times more likely to classify tweets in the black-aligned corpus as sexist." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-148", "text": "Golbeck et al. (2017) classifies black-aligned tweets as harassment at a higher rate for both groups than in the previous experiment, although the disparity is narrower." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-149", "text": "For the Founta et al. (2018) classifier we see that black-aligned tweets are slightly less frequently considered to be hate speech but are much more frequently classified as abusive." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-150", "text": "The results for the second variation of Experiment 2 where we conditioned on the word \"b*tch\" are shown in Table 4 ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-151", "text": "We see similar results for Waseem and Hovy (2016) and Waseem (2016) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-152", "text": "In both cases the classifiers trained upon their data are still more likely to flag black-aligned tweets as sexism." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-153", "text": "The Waseem and Hovy (2016) classifier is particularly sensitive to the word \"b*tch\" with 96% of black-aligned and 94% of white-aligned tweets predicted to belong to this class." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-154", "text": "For almost all of these tweets are classified as offensive, however those in the blackaligned corpus are 1.15 times as frequently classified as hate speech." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-155", "text": "We see a very similar result for Golbeck et al. (2017) compared to the previous experiment, with black-aligned tweets flagged as harassment at 1.1 times the rate of those in the white-aligned corpus." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-156", "text": "Finally, for the Founta et al. (2018) classifier we see a substantial racial disparity, with black-aligned tweets classified as hate speech at 2.7 times the rate of white aligned ones, a higher rate than in Experiment 1." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-157", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-158", "text": "**DISCUSSION**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-159", "text": "Our results demonstrate consistent, systematic and substantial racial biases in classifiers trained on all five datasets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-160", "text": "In almost every case, black-aligned tweets are classified as sexism, hate speech, harassment, and abuse at higher rates than whitealigned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-161", "text": "To some extent, the results in the first experiment may be driven by underlying differences in the rates at which speakers of different dialects use particular words and phrases associated with these negative classes in the training data." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-162", "text": "For example, the word \"n*gga\" appears fifteen times as frequently in the black-aligned corpus compared to the white-aligned corpus." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-163", "text": "7 However, the second experiment shows that these disparities tend to persist even when comparing tweets containing keywords likely to be associated with negative classes." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-164", "text": "While some of the remaining disparities are likely due to differences in the distributions of other keywords we did not condition on, we expect that other more innocuous aspects of black-aligned language may be associated with negative labels in the training data, leading classifiers to disproportionately predict that tweets by African-Americans belong to negative classes." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-165", "text": "We now discuss the results as they pertain to each of the datasets used." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-166", "text": "Classifiers trained on data from Waseem and Hovy (2016) and Waseem (2016) only predicted a small fraction of the tweets to be racism." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-167", "text": "We suspect that this is due to the composition of their dataset, since the majority of the racist training examples consist of anti-Muslim rather than anti- Table 4 : Experiment 2, t = \"b*tch\" black language." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-168", "text": "Across both datasets the words \"n*gger\" and \"n*gga\" appear in 4 and 10 tweets respectively." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-169", "text": "Looking at the sexism class on the other hand, we see that both models were consistently classifying tweets in the black-aligned corpus as sexism at a substantially higher rate than those in the white-aligned corpus." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-170", "text": "Given this result, and the gender biases identified in these data by Park et al. (2018), it not apparent that the purportedly expert annotators were any less biased than amateur annotators (Waseem, 2016) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-171", "text": "The classifier trained on shows the largest disparities in Experiment 1, with tweets in the black-aligned corpus classified as hate speech and offensive language at substantially higher rates than white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-172", "text": "We expect that this result occurred for two reasons." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-173", "text": "First, the dataset contains a large number of cases where AAE is used (Waseem et al., 2018) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-174", "text": "Second, many of the AAE tweets also use words like \"n*gga\" and \"b*tch\", and are thus frequently associated with the hate speech and offensive classes, resulting in \"false positive bias\" (Dixon et al., 2018) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-175", "text": "On the other hand, the distinction between hate speech and offensive language appears to hold up to scrutiny: while a large proportion of tweets in Experiment 2 containing the word \"n*gga\" are classified as hate speech, the rate is substantially higher for white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-176", "text": "Without this category we expect that many of the tweets classified as offensive would instead be mistakenly classified as hate speech." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-177", "text": "Turning to the Golbeck et al. (2017) classifer we found that tweets in the black-aligned dataset were significantly more likely to be classified as harassment in all experiments, although the disparity decreased substantially after conditioning on certain keywords." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-178", "text": "It seems likely that their simple binary labelling scheme may not be sufficient to capture the variation in language used, resulting in high rates of false positives." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-179", "text": "Finally, Founta et al. (2018) is the largest and perhaps the most comprehensive of the available datasets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-180", "text": "In Experiment 1 we see that this classifier has the second highest rates of racial disparities, classifying black-aligned tweets as hate speech, abusive, and spam at substantially higher rates than white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-181", "text": "In Experiment 2 the classifier is slightly less likely to classify black-aligned tweets containing the word \"n*gga\" as hate speech but is 2.7 times more likely to predict that black-aligned tweets using \"b*tch\" belong to this category." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-182", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-183", "text": "**CONCLUSION**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-184", "text": "Our study is the first to measure racial bias in hate speech and abusive language detection datasets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-185", "text": "We find evidence of substantial racial bias in all of the datasets tested." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-186", "text": "This bias tends to persist even when comparing tweets containing certain relevant keywords." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-187", "text": "While these datasets are still valuable for academic research, we caution against using them in the field to detect and particularly to take enforcement action against different types of abusive language." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-188", "text": "If they are used in this way we expect that they will systematically penalize African-Americans more than whites, resulting in racial discrimination." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-189", "text": "We have not evaluated these datasets for bias related to other ethnic and racial groups, nor other protected categories like gender and sexuality, but expect that such bias is also likely to exist." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-190", "text": "We recommend that efforts to measure and mitigate bias should start by focusing on how bias enters into datasets as they are collected and labeled." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-191", "text": "In particular, future work should focus on the following three areas." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-192", "text": "First, we expect that some biases emerge at the point of data collection." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-193", "text": "Some studies sampled tweets using small, ad hoc sets of keywords created by the authors (Waseem and Hovy, 2016; Waseem, 2016; Golbeck et al., 2017) , an approach demonstrated to produce poor results (King et al., 2017) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-194", "text": "Others start with large crowdsourced dictionaries of keywords, which tend to include many irrelevant terms, resulting in high rates of false positives Founta et al., 2018) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-195", "text": "In both cases, by using keywords to identify relevant tweets we are likely to get non-representative samples of training data that may over-or under-represent certain communities." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-196", "text": "In particular, we need to consider whether the linguistic markers we use to identify potentially abusive language may be associated with language used by members of protected categories." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-197", "text": "For example, although started with thousands of terms from the Hatebase lexicon, AAE is over-represented in the dataset (Waseem et al., 2018) because some keywords associated with this speech community were used more frequently on Twitter than other keywords in the lexicon and were consequentially over-sampled." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-198", "text": "Second, we expect that the people who annotate data have their own biases." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-199", "text": "Since individual biases in reflect societal prejudices, they aggregate into systematic biases in training data." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-200", "text": "The datasets considered here relied upon a range of different annotators, from the authors (Golbeck et al., 2017; Waseem and Hovy, 2016) and crowdworkers Founta et al., 2018) to activists (Waseem, 2016) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-201", "text": "Even the classifier trained on expert-labeled data (Waseem, 2016) flags black-aligned tweets as sexist at almost twice the rate of white-aligned tweets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-202", "text": "While we agree that there is value in working with domain-experts to annotate data, these results suggest that activists may be prone to similar biases as academics and crowdworkers." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-203", "text": "Further work is therefore necessary to better understand how to integrate expertise into the process and how training can be used to help to mitigate bias." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-204", "text": "We also need to consider how sociocultural context influences annotators' decisions." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-205", "text": "For example, 48% of the workers employed by Founta et al. (2018) were located in Venezuela but the authors did not consider whether this affected their results (or if the annotators understood English sufficiently for the task)." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-206", "text": "Third, we observed substantial variation in the rates of class membership across classifiers and datasets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-207", "text": "In Experiment 1 the rate at which tweets were assigned to negative classes varied from 1% to 18%." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-208", "text": "Some of the low proportions may indicate a preponderance of false negatives due to a lack of training data, suggesting that these models may not be able to sufficiently generalize beyond the data they were trained on." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-209", "text": "The high proportions may signal too many false positives, which may a result of the over-sampling of abusive language in labeled datasets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-210", "text": "Founta et al. (2018) claim that, on average, between 0.1% and 3% of tweets are abusive, depending upon the category of abuse." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-212", "text": "When labeling datasets and evaluating our models we must pay more attention to the baseline rates of usage of different types of abusive language and how they may vary across populations (Silva et al., 2016) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-213", "text": "Finally, we need to more carefully consider how contextual factors interact with linguistic subtleties and our definitions of abuse." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-214", "text": "The \"n-word\" is a particularly useful illustration of this issue." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-215", "text": "It exhibits polysemy, as it can be extremely racist or quotidian, depending on the speaker, the context, and the spelling." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-216", "text": "While the history of the word and its usages is too complex to be summarized here (Neal, 2013) , when used with the \"-er\" suffix it is generally considered to be a racist ephiphet, associated with white supremacy." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-217", "text": "Prior work has confirmed that the use of this variant online is generally considered to be hateful (Kwok and Wang, 2013) , although not always the case, for example when a victim of abuse shares an insult they have received." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-218", "text": "However the variant with the \"-a\" suffix is typically used innocuously by African-Americans (Kwok and Wang, 2013) , indeed our results indicate that it is used far more frequently in black-aligned tweets (although it is still used by many white people)." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-219", "text": "8 Despite this distinction, some studies have considered this variant to be hateful (Silva et al., 2016; Alorainy et al., 2018) ." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-220", "text": "This approach results in high rates of false positive cases of hate speech, thus included a class for offensive language which does not appear to be hateful and let annotators decide which class tweets belonged to based upon their interpretation of the context, many of whom labeled tweets containing the term as offensive." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-221", "text": "Waseem et al. (2018) criticized this decision, claiming that it is problematic to ever consider the word to be offensive due to its widespread use among AAE speakers." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-222", "text": "This critique appears to be reasonable in the sense that we should not penalize African-Americans for using the word, but it avoids grappling with how to act when the word is used by other speakers and in other contexts. What should be done if it is used by a white social media user in reference to a black user? How should the context of their interaction and the nature of their relationship affect our decision?" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-223", "text": "A \"one-size-fits-all\", context-independent approach to defining and detecting abusive language is clearly inappropriate." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-224", "text": "Different communities have different speech norms, such that a model suitable for one community may discriminate against another." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-225", "text": "However there is no con-sensus in the field on how and if we can develop detection systems sensitive to different social and cultural contexts." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-226", "text": "In addition to our recommendations for improving training data, we emphasize the necessity of considering how context matters and how detection systems will have uneven effects across different communities." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-227", "text": "----------------------------------" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-228", "text": "**LIMITATIONS**" }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-229", "text": "First, while the Blodgett et al. (2016) dataset is the best available source of tweets labeled as AAE, we do not have ground truth labels for the racial identities of the authors." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-230", "text": "By filtering on users who predominantly used one type of language we may also miss users who may frequently codeswitch between AAE and SAE." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-231", "text": "Second, although we roughly approximate this in Experiment 2, we cannot rule out the possibility that the results, rather than evidence of bias, are a function of different distributions of negative classes in the corpora studied." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-232", "text": "It is possible that words associated with negative categories in our abusive language datasets are also used to predict race by Blodgett et al. (2016) , potentially contributing to the observed disparities." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-233", "text": "To more thoroughly investigate this issue we therefore require ground truth labels for abuse and race." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-234", "text": "Third, the results may vary for different classifiers or feature sets." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-235", "text": "It is possible that more sophisticated modeling approaches could enable us to alleviate bias, although they could also exacerbate it." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-236", "text": "Fourth, we did not interpret the results of the classifiers to determine why they made particular predictions." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-237", "text": "Further work is needed to identify what features of AAE the classifiers are learning to associate with negative classes." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-238", "text": "Finally, this study has only focused on one dimension of racial bias." }, { "sent_id": "57e65909baf823ff00a9a10a64fffd-C001-239", "text": "Further work is necessary to assess the degree to investigate the extent to which data and models are biased against people belonging to other protected categories." } ], "y": { "@BACK@": { "gold_contexts": [ [ "57e65909baf823ff00a9a10a64fffd-C001-15" ], [ "57e65909baf823ff00a9a10a64fffd-C001-34" ], [ "57e65909baf823ff00a9a10a64fffd-C001-54" ], [ "57e65909baf823ff00a9a10a64fffd-C001-146" ], [ "57e65909baf823ff00a9a10a64fffd-C001-166", "57e65909baf823ff00a9a10a64fffd-C001-169", "57e65909baf823ff00a9a10a64fffd-C001-170" ], [ "57e65909baf823ff00a9a10a64fffd-C001-193" ] ], "cite_sentences": [ "57e65909baf823ff00a9a10a64fffd-C001-15", "57e65909baf823ff00a9a10a64fffd-C001-34", "57e65909baf823ff00a9a10a64fffd-C001-54", "57e65909baf823ff00a9a10a64fffd-C001-146", "57e65909baf823ff00a9a10a64fffd-C001-166", "57e65909baf823ff00a9a10a64fffd-C001-170", "57e65909baf823ff00a9a10a64fffd-C001-193" ] }, "@SIM@": { "gold_contexts": [ [ "57e65909baf823ff00a9a10a64fffd-C001-137" ], [ "57e65909baf823ff00a9a10a64fffd-C001-151", "57e65909baf823ff00a9a10a64fffd-C001-152" ], [ "57e65909baf823ff00a9a10a64fffd-C001-166", "57e65909baf823ff00a9a10a64fffd-C001-169", "57e65909baf823ff00a9a10a64fffd-C001-170" ] ], "cite_sentences": [ "57e65909baf823ff00a9a10a64fffd-C001-137", "57e65909baf823ff00a9a10a64fffd-C001-151", "57e65909baf823ff00a9a10a64fffd-C001-166", "57e65909baf823ff00a9a10a64fffd-C001-170" ] }, "@USE@": { "gold_contexts": [ [ "57e65909baf823ff00a9a10a64fffd-C001-193" ] ], "cite_sentences": [ "57e65909baf823ff00a9a10a64fffd-C001-193" ] }, "@DIF@": { "gold_contexts": [ [ "57e65909baf823ff00a9a10a64fffd-C001-200", "57e65909baf823ff00a9a10a64fffd-C001-201" ] ], "cite_sentences": [ "57e65909baf823ff00a9a10a64fffd-C001-200", "57e65909baf823ff00a9a10a64fffd-C001-201" ] }, "@EXT@": { "gold_contexts": [ [ "57e65909baf823ff00a9a10a64fffd-C001-200", "57e65909baf823ff00a9a10a64fffd-C001-201" ] ], "cite_sentences": [ "57e65909baf823ff00a9a10a64fffd-C001-200", "57e65909baf823ff00a9a10a64fffd-C001-201" ] } } }, "ABC_c3f6140bd69d1eef0124665e651c0c_15": { "x": [ { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-2", "text": "In this paper, we experiment with a resource consisting of metaphorically annotated proverbs on the task of word-level metaphor recognition." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-3", "text": "We observe that existing feature sets do not perform well on this data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-4", "text": "We design a novel set of features to better capture the peculiar nature of proverbs and we demonstrate that these new features are significantly more effective on the metaphorically dense proverb data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-5", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-7", "text": "Recent years have seen a growing attention towards attempts to understand figurative language in text (Steen et al., 2010 , Shutova and Teufel, 2010 , Turney et al., 2011 , Neuman et al., 2013 , Klebanov et al., 2015 ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-8", "text": "Recently,\u00d6zbal et al. (2016) published a resource consisting of 1,054 proverbs annotated with metaphors at the word and sentence level, making it possible for the first time to test existing models for metaphor detection on such data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-9", "text": "More than in other genres, such as news, fiction and essays, in proverbs metaphors can resolve a significant amount of the figurative meaning (Faycel, 2012) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-10", "text": "The richness of proverbs in terms of metaphors is very fascinating from a linguistic and cultural point of view." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-11", "text": "Due to this richness, proverbs constitute a challenging benchmark for existing computational models of metaphoricity." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-12", "text": "In this paper, we devise novel feature sets especially tailored to cope with the peculiarities of proverbs, which are generally short and figuratively rich." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-13", "text": "To the best of our knowledge, this is the first attempt to design a word-level metaphor recognizer specifically tailored to such metaphorically rich data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-14", "text": "Even though some of the resources that we use (e.g., imageability and concreteness) have been used for this task before, we propose new ways of encoding this information, especially with respect to the density of the feature space and the way that the context of each word is modeled." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-15", "text": "On the proverb data, the novel features result in compact models that significantly outperform existing features designed for word-level metaphor detection in other genres (Klebanov et al., 2014) , such as news and essays." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-16", "text": "By also testing the new features on these other genres, we show that their generalization power is not limited to proverbs." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-17", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-18", "text": "**BACKGROUND**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-19", "text": "In this section we provide a brief overview of the efforts of the NLP community to build metaphor datasets and utilize them to develop computational techniques for metaphor processing." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-20", "text": "Steen et al. (2010) construct the Amsterdam Metaphor Corpus (VUAMC) by annotating a subset of BNC Baby 1 ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-21", "text": "Linguistic metaphors in VUAMC are annotated by utilizing the Metaphor Annotation Procedure (MIP) proposed by Group (2007) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-22", "text": "VUAMC contains 200,000 words in sentences sampled from various genres (news, fiction, academic, and conversations) and 13.6% of the words are annotated as metaphoric (Shutova, 2010) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-23", "text": "Another metaphor annotation study following the MIP procedure is conducted by Shutova and Teufel (2010) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-24", "text": "A subset of the British National Corpus (BNC) (Burnard, 2000) is annotated to reveal word-level verb metaphors and to determine the conceptual mappings of the metaphorical verbs." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-25", "text": "Turney et al. (2011) introduce an algorithm to classify word-level metaphors expressed by an adjective or a verb based on their concreteness levels in association with the nouns they collocate." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-26", "text": "Similarly, Neuman et al. (2013) extend the concreteness model with a selectional preference approach to detect metaphors formed of concrete concepts." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-27", "text": "They focus on three types of metaphors: i) IS-A, ii) verb-noun, iii) adjective-noun." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-28", "text": "Rather than restricting the identification task to a particular POS or metaphoric structure, Hovy et al. (2013) aim to recognize any word-level metaphors given an unrestricted text, and they create a corpus containing sentences where one target token for each sentence is annotated as metaphorical or literal." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-29", "text": "They use SVM and CRF models with dependency tree-kernels to capture the anomalies in semantic patterns." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-30", "text": "Klebanov et al. (2014) propose a supervised approach to predict the metaphoricity of all content words in a running text." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-31", "text": "Their model combines unigram, topic model, POS and concreteness features and it is evaluated on VUAMC and a set of essays written for a large-scale assessment of college graduates." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-32", "text": "Following this study, Klebanov et al. (2015) improve their model by re-weighting the training examples and redesigning the concreteness features." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-33", "text": "The experiments in this paper are carried out on PROMETHEUS (\u00d6zbal et al., 2016) , a dataset consisting of 1,054 English proverbs and their equivalents in Italian." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-34", "text": "Proverbs are annotated with wordlevel metaphors, overall metaphoricity, meaning and century of first appearance." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-35", "text": "For our experiments, we only use the word-level annotations on the English data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-36", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-37", "text": "**WORD-LEVEL METAPHOR DETECTION**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-38", "text": "Similarly to Klebanov et al. (2014) , we classify each content word (i.e., adjective, noun, verb or adverb) appearing in a proverb as being used metaphorically or not." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-39", "text": "Out of 1,054 proverbs in PROMETHEUS, we randomly sample 800 for training, 127 for development and 127 for testing." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-40", "text": "We carry out the development of new features on the development set; then we compare the performance of different feature sets using 10-fold cross validation on the combination of the development and training data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-41", "text": "Finally, we test the most meaningful configurations on the held-out test data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-42", "text": "As a baseline, we use a set of features very similar to the one proposed by Klebanov et al. (2014) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-43", "text": "To obtain results more easily comparable with Klebanov et al. (2014), we use the same classifier, i.e., logistic regression, in the implementation bundled with the scikit-learn package (Pedregosa et al., 2011) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-44", "text": "For all the experiments, we adjust the weight of the examples proportionally to the inverse of the class frequency." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-45", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-46", "text": "**BASELINE FEATURES (B)**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-47", "text": "Unigrams (u B ): Klebanov et al. (2014) use all content word forms as features without stemming or lemmatization." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-48", "text": "To reduce sparsity, we consider lemmas along with their POS tag." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-49", "text": "Part-of-speech (p B ): The coarse-grained part-ofspeech (i.e., noun, adjective, verb or adverb) of content words 2 ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-50", "text": "Concreteness (c B ): We extract the concreteness features from the resource compiled by Brysbaert et al. (2014) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-51", "text": "Similarly to Klebanov et al. (2014) , the mean concreteness ratings, ranging from 1 to 5, are binned in 0.25 increments." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-52", "text": "We also add a binary feature which encodes the information about whether the lemma is found in the resource." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-53", "text": "Topic models (t B ): We use Latent Dirichlet Allocation (LDA) (Blei et al., 2003) using Gibbs sampling for parameter estimation and inference (Griffiths, 2002) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-54", "text": "We run LDA on the full British National Corpus (Consortium and others, 2001) to estimate 100 topics, using 2000 Gibbs sampling iterations, and keeping the first 1000 words for each topic." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-55", "text": "As topic model features for a lemma, we use the conditional probability of the topic given the lemma for each of the 100 topics generated by LDA." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-56", "text": "Besides, we use a binary feature that encodes whether the lemma exists in the LDA model." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-57", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-58", "text": "**NOVEL FEATURES (N )**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-59", "text": "We introduce five feature sets that capture other aspects of the data which we consider to be meaningful for the peculiar characteristics of proverbs." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-60", "text": "Imageability (i) and Concreteness (c): Imageability and concreteness of the metaphor constituents were found to be highly effective in metaphor identification by several studies in the literature (Turney et al., 2011 , Broadwell et al., 2013 , Neuman et al., 2013 , Tsvetkov et al., 2014 ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-61", "text": "We obtain the imageability and concreteness scores of each lemma from the resource constructed by Tsvetkov et al. (2014) , as it accounts for both dimensions." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-62", "text": "The imageability (concreteness) feature set contains the following four features:" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-63", "text": "\u2022 Has score: A binary feature that indicates whether the lemma exists in the relevant resource." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-64", "text": "\u2022 Score value: The imageability (concreteness) score of the lemma." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-65", "text": "\u2022 Average sentence score: The average imageability (concreteness) score of the other lemmas in the sentence." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-66", "text": "\u2022 Score difference: The difference between Average sentence score and Score value." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-67", "text": "The last two features take the context of the target lemma into account and encode the intuition that metaphorical lemmas often have higher imageability (concreteness) than the rest of the sentence (Broadwell et al., 2013) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-68", "text": "Metaphor counts (m): This feature set consists of three features." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-69", "text": "The first two features encode the number of times a lemma-POS pair is used as a metaphor and a non-metaphor in the data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-70", "text": "The third feature evaluates to the difference between these counts 3 ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-71", "text": "Standard domains (d s ) and normalized domains (d n ): These features reflect our intuition that there is a strong prior for some domains to be used as a source for metaphors." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-72", "text": "This notion is backed by the analysis of PROMETHEUS carried out by\u00d6zbal et al. (2016) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-73", "text": "We also expect that words which are clearly out of context with respect to the rest of the sentence are more likely to be used as metaphors." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-74", "text": "The correlation between word and sentence domains described below aims to model such phenomenon." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-75", "text": "For each lemma-POS pair, we collect the domain information from WordNet Domains 4 (Magnini et al., 2002 , Bentivogli et al., 2004 for the standard 3 Counts are estimated on training folds." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-76", "text": "To reduce overfitting, lemmas are randomly sampled with a probability of 2/3." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-77", "text": "4 We always select the first sense of the lemma-POS." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-78", "text": "0.6 0.751 0.705 0.724 The same process is repeated for the normalized domains." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-79", "text": "For normalization, we use a reduced set of domains (43 distinct domains) by considering the middle level of the WordNet Domains hierarchy." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-80", "text": "For instance, VOLLEY or BASKETBALL domains are mapped to the SPORT domain." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-81", "text": "Normalization already proved to be beneficial in tasks such as word sense disambiguation (Gliozzo et al., 2004) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-82", "text": "It allows for a good level of abstraction without losing relevant information and it helps to overcome data sparsity." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-83", "text": "The set of normalized domain features (d n ) consists of 46 features (45 binary, 1 real valued)." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-84", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-85", "text": "**DENSE SIGNALS (S):**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-86", "text": "This set includes three binary features which summarize the concreteness, imageability and metaphor count feature sets." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-87", "text": "The first (second) feature is set to 1 if the imageability (concreteness) of the lemma is higher than the average imageability (concreteness) of the rest of the sentence." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-88", "text": "The third feature is set to 1 if the lemma was observed more frequently as a metaphor than not, as estimated on training data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-89", "text": "Table 1 shows the results of the 10-fold cross validation on the English proverb data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-90", "text": "The value reported in the column labeled C is the optimal inverse of regularization strength, determined via grid-search in the interval [0.1, 1.0] with a step of 0.1." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-91", "text": "Using only baseline features (B) we measure an average F1 score of 0.738." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-92", "text": "The performance goes up to 0.833 when the novel features are used in isolation (N ) (statistically significant with p < 0.001)." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-93", "text": "We believe that the difference in performance is at least in part due to the sparser B features requiring more data to be able to generalize." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-94", "text": "But most importantly, unlike B, N accounts for the context and the peculiarity of the target word with respect to the rest of the sentence." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-95", "text": "The combination of the two feature sets (B \u222a N ) very slightly improves over N (0.834), but the difference is not significant." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-96", "text": "The second block of rows in Table 1 presents a summary of the ablation tests that we conducted to assess the contribution of the different feature groups." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-97", "text": "Each lowercase letter indicates one of the feature sets introduced in the previous section." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-98", "text": "All configurations reported, except" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-99", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-100", "text": "**RESULTS**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-101", "text": "there is a significant loss of performance with respect to N ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-102", "text": "The worst performance is observed when all the domain features are removed (i.e., N \\ (d s \u222a d n ))." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-103", "text": "These results suggest that the prior knowledge about the domain of a word and the frequency of its metaphorical use are indeed strong predictors of a word metaphoricity in context." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-104", "text": "The fact that N \\ d n and N \\ d s do not result in the same loss of performance as" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-105", "text": "indicates that both d n and d s are adequately expressive to model the figuratively rich proverb data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-106", "text": "In one case (i.e., N \\ s), the F1 measure is slightly higher than N , even though the difference does not appear to be statistically significant." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-107", "text": "Our intuition is that each of the three binary indicators is a very good predictor of metaphoricity per se, and due to the relatively small size of the data the classifier may tend to over-fit on these features." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-108", "text": "As another configuration, the last row shows the results obtained by replacing our domain features d s and d n with the topic features t from B. With this experiment, we aim to understand the extent to which the two features are interchangeable." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-109", "text": "The results are significantly worse than N , which is a further confirmation of the suitability of the domain features to model the proverbs dataset." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-110", "text": "We then evaluated the best configuration from the cross-fold validation (N \\ s) and the three feature sets B, N and B \u222a N on the held-out test data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-111", "text": "The results of this experiment reported in Table 2 are similar to the cross-fold evaluation, and in this case the contribution of N features is even more accentuated." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-112", "text": "Indeed, the absolute F 1 of N and B \u222a N is slightly higher on test data, while the f-measure of B decreases slightly." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-113", "text": "This might be explained by the low-dimensionality of N , which makes it less prone to overfitting the training data." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-114", "text": "On test data, N \\ s is not found to outperform N ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-115", "text": "Interestingly, N \\ s is the only configuration having higher recall than precision." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-116", "text": "As shown by the feature ablation experi-ments, one of the main reasons for the performance difference between N and B is the ability of the former to model domain information." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-117", "text": "This finding can be further confirmed by inspecting the cases where B misclassifies metaphors that are correctly detected by N ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-118", "text": "Among these, we can find several examples including words that belong to domains often used as a metaphor source, such as \"grist\" (domain: \"gastronomy\") in \"All is grist that comes to the mill\", or \"horse\" (domain: \"animals\") in \"You can take a horse to the water , but you can't make him drink\"." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-119", "text": "Finally, Table 3 shows the effect of the different feature sets on VUAMC used by Klebanov et al. (2014) ." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-120", "text": "We use the same 12-fold data split as Klebanov et al. (2014) , and also in this case we perform a grid-search to optimize the meta-parameter C of the logistic regression classifier." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-121", "text": "The best value of C identified for each genre and feature set is shown in the column labeled C. On this data, N features alone are significantly outperformed by B (p < 0.01)." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-122", "text": "On the other hand, for the genres \"academic\" and \"fiction\", combining N and B features improves classification performance over B, and the difference is always statistically significant." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-123", "text": "Besides, the addition of N always leads to more balanced models, by compensating for the relatively lower precision of B. Due to the lack of a separate test set, as in the original setup by Klebanov et al. (2014) , and to the high dimensionality of B's lexicalized features, we cannot rule out over-fitting as an explanation for the relatively good performance of B on this benchmark." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-124", "text": "It should also be noted that the results reported in (Klebanov et al., 2014) are not the same, due to the mentioned differences in the implementation of the features and possibly other differences in the experimental setup (e.g., data filtering, pre-processing and meta-parameter optimization)." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-125", "text": "In particular, our implementation of the B features performs better than reported by Klebanov et al. (2014) on all four genres, namely: 0.52 vs. 0.51 for \"news\", 0.51 vs. 0.28 for \"academic\", 0.39 vs. 0.28 for \"conversation\" and 0.42 vs. 0.33 for \"fiction\"." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-126", "text": "Even though the evidence is not conclusive, these results suggest that the insights derived from the analysis of PROMETHEUS and captured by the feature set N can also be applied to model word-level metaphor detection across very different genres." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-127", "text": "In particular, we believe that our initial attempt to encode context and domain information for metaphor detection deserves further investigation." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-128", "text": "----------------------------------" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-130", "text": "We designed a novel set of features inspired by the analysis of PROMETHEUS, and used it to train and test models for word-level metaphor detection." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-131", "text": "The comparison against a strong set of baseline features demonstrates the effectiveness of the novel features at capturing the metaphoricity of words for proverbs." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-132", "text": "In addition, the novel features show a positive contribution for metaphor detection on \"fiction\" and \"academic\" genres." }, { "sent_id": "c3f6140bd69d1eef0124665e651c0c-C001-133", "text": "The experimental results also highlight the peculiarities of PROMETHEUS, which stands out as an especially dense, metaphorically rich resource for the investigation of the linguistic and computational aspects of figurative language." } ], "y": { "@DIF@": { "gold_contexts": [ [ "c3f6140bd69d1eef0124665e651c0c-C001-15" ], [ "c3f6140bd69d1eef0124665e651c0c-C001-47", "c3f6140bd69d1eef0124665e651c0c-C001-48" ], [ "c3f6140bd69d1eef0124665e651c0c-C001-123", "c3f6140bd69d1eef0124665e651c0c-C001-125" ] ], "cite_sentences": [ "c3f6140bd69d1eef0124665e651c0c-C001-15", "c3f6140bd69d1eef0124665e651c0c-C001-47", "c3f6140bd69d1eef0124665e651c0c-C001-123", "c3f6140bd69d1eef0124665e651c0c-C001-125" ] }, "@SIM@": { "gold_contexts": [ [ "c3f6140bd69d1eef0124665e651c0c-C001-38", "c3f6140bd69d1eef0124665e651c0c-C001-42", "c3f6140bd69d1eef0124665e651c0c-C001-43" ], [ "c3f6140bd69d1eef0124665e651c0c-C001-51" ], [ "c3f6140bd69d1eef0124665e651c0c-C001-120" ] ], "cite_sentences": [ "c3f6140bd69d1eef0124665e651c0c-C001-38", "c3f6140bd69d1eef0124665e651c0c-C001-42", "c3f6140bd69d1eef0124665e651c0c-C001-43", "c3f6140bd69d1eef0124665e651c0c-C001-51", "c3f6140bd69d1eef0124665e651c0c-C001-120" ] }, "@USE@": { "gold_contexts": [ [ "c3f6140bd69d1eef0124665e651c0c-C001-38", "c3f6140bd69d1eef0124665e651c0c-C001-42", "c3f6140bd69d1eef0124665e651c0c-C001-43" ], [ "c3f6140bd69d1eef0124665e651c0c-C001-120" ] ], "cite_sentences": [ "c3f6140bd69d1eef0124665e651c0c-C001-38", "c3f6140bd69d1eef0124665e651c0c-C001-42", "c3f6140bd69d1eef0124665e651c0c-C001-43", "c3f6140bd69d1eef0124665e651c0c-C001-120" ] }, "@BACK@": { "gold_contexts": [ [ "c3f6140bd69d1eef0124665e651c0c-C001-119" ] ], "cite_sentences": [ "c3f6140bd69d1eef0124665e651c0c-C001-119" ] }, "@EXT@": { "gold_contexts": [ [ "c3f6140bd69d1eef0124665e651c0c-C001-123", "c3f6140bd69d1eef0124665e651c0c-C001-125" ] ], "cite_sentences": [ "c3f6140bd69d1eef0124665e651c0c-C001-123", "c3f6140bd69d1eef0124665e651c0c-C001-125" ] } } }, "ABC_34b73a56bd9b80dc415ca2c5608596_15": { "x": [ { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-91", "text": "Next, we analyze how bias towards refugees varies by city for talk radio produced between September 1 and 30, 2018." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-92", "text": "We first train our model to learn a city-specific embedding for each word and then use these embeddings to compute corresponding bias scores, which are depicted in figure 4." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-93", "text": "Qualitatively, cities in the Southeastern US, those closer to the US-Mexico border, and some that have suffered from economic decline in recent years (e.g. Detroit, MI; Youngstown, OH) tend to have talk radio coverage that is more biased towards refugees, though the trends are quite varied." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-94", "text": "Interestingly, there is a weak negative, though marginally insignificant, correlation between the level of bias per city and the number of refugees the city admitted in 2017 7 (r = \u22120.21, p = 0.1)." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-95", "text": "This relationship persists even after controlling for state fixed effects." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-96", "text": "A more thorough analysis with additional cities and other city-level covariates may reveal meaningful patterns and perhaps even help illuminate which geographies are particularly welcoming towards refugees." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-97", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-98", "text": "**CONCLUSION**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-99", "text": "In this paper, we present a unified dynamic word embedding model mirroring the earlier work of (Bamman et al., 2014) to learn attribute-specific embeddings." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-100", "text": "We validated our model by replicating gender and ethnic stereotypes produced in (Garg et al., 2018) by training multiple word embedding models and applied it to a novel corpus of talk radio data to analyze how perceptions of refugees as \"outsiders\" vary by geography and over time." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-101", "text": "Our results illustrate that dynamic word embeddings capture salient shifts in public discourse around specific topics, suggesting their potential usefulness as a tool for obtaining a granular understanding of how the media and members of the public perceive different issues, especially when data is sparse." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-2", "text": "Word embeddings trained on large-scale historical corpora can illuminate human biases and stereotypes that perpetuate social inequalities." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-3", "text": "These embeddings are often trained in separate vector space models defined according to different attributes of interest." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-4", "text": "In this paper, we develop a single, unified dynamic embedding model that learns attribute-specific word embeddings and apply it to a novel dataset-talk radio shows from around the US-to analyze perceptions about refugees." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-5", "text": "We validate our model on a benchmark dataset and apply it to two corpora of talk radio shows averaging 117 million words produced over one month across 83 stations and 64 cities." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-6", "text": "Our findings suggest that dynamic word embeddings are capable of identifying nuanced differences in public discourse about contentious topics, suggesting their usefulness as a tool for better understanding how the public perceives and engages with different issues across time, geography, and other dimensions." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-7", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-9", "text": "Language has long been described as both a cause and reflection of our psycho-social contexts (Lewis and Lupyan, 2018) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-10", "text": "Recent work using word embeddings-low-dimensional vector representations of words trained on large datasets to capture key semantic informationhas demonstrated that language encodes several gender, racial, and other common contemporary biases that correlate with both implicit biases (Caliskan et al., 2017) and macro-scale historical trends (Garg et al., 2018) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-11", "text": "These studies have validated the use of word embeddings to measure a range of psychological and social contexts, yet in most cases, they have failed to leverage the full power of available datasets." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-12", "text": "For example, the historical biases presented in (Garg et al., 2018) are computed using decade-specific word embeddings produced by training different Word2Vec (Mikolov et al., 2013 ) models on a large corpus of historical text from that decade." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-13", "text": "The authors then use a Procrustes alignment to project embeddings from different models into the same vector space so they can be compared across decades (Hamilton et al., 2016) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-14", "text": "While this approach is reasonable when there are large-scale datasets available for a given attribute of interest (e.g. decade), it requires an additional optimization step and also disregards valuable training data that could be pooled and leveraged across attribute values to help with both training and regularization." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-15", "text": "This latter property is particularly appealing-and necessary-in the context of limited data." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-16", "text": "In this paper, we use a simple, unified dynamic word embedding model that jointly trains linguistic information alongside any categorical variable of interest-e.g." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-17", "text": "year, geography, income bracket, etc.-that describes the context in which a particular word was used." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-18", "text": "We apply this model to a novel data corpus-talk radio transcripts from stations located in over 64 US cities-to explore the evolution of perceptions about refugees during a onemonth period in late 2018." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-19", "text": "The results from our model suggest the potential to use dynamic word embeddings to obtain a granular, near real-time pulse on how people feel about different issues in the public sphere." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-20", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-21", "text": "**MODEL**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-22", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-23", "text": "**OVERVIEW**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-24", "text": "Our dynamic embedding for word w is defined as" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-25", "text": "where \u03b3 w is an attribute-invariant embedding of w computed across the entire corpus, \u03b2 a w is the off-set for w with respect to attribute a across the set of attributes A we are interested in computing the word embedding with respect to." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-26", "text": "For example, if we wish to compute the embedding for the word \"refugee\" as it was used on the 25th day of a particular 30-day corpus of talk radio transcripts, we would set w = ref ugee and A = {25}. This approach, as formalized in Equation 1 above, is identical to one introduced by (Bamman et al., 2014) , though finer details of our model and training differ slightly, as described below." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-27", "text": "To learn \u03b3 w and \u03b2 a w , we train a neural network." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-28", "text": "Our model is a simple extension to the distributed memory (DM) model for learning paragraph vectors originally introduced in (Le and Mikolov, 2014) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-29", "text": "The DM model uses a continuous bag-ofwords architecture to jointly train a paragraph ID with a sequence of words sampled from that paragraph to predict a particular word given the words that surround it." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-30", "text": "The output of this model includes a semantic vector representation of a) each paragraph, and b) each word in the vocabulary." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-31", "text": "Our model extends the DM model by adding an additional dimension to the paragraph vector to learn specific paragraph-by-word-or, in our context, attribute-by-word-embeddings (i.e., \u03b2 a w )." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-32", "text": "The penultimate layer (before word prediction) is computed as an average of the dynamic embeddings for each context word, i.e., X = 1" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-33", "text": "where N is the size of our context window." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-34", "text": "This average embedding is then multiplied by the output layer parameters and fed through the final layer for word prediction." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-35", "text": "Figure 1 depicts our model architecture." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-36", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-37", "text": "**IMPLEMENTATION**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-38", "text": "We build on an existing PyTorch implementation of paragraph vectors 1 to implement our model, setting the dimensionality of \u03b3 w and \u03b2 a w to be 100." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-39", "text": "We use the Adam optimization algorithm with a batch size of 128, word context window size of 8 (sampling four words to the left and right of a target prediction word), learning rate of 0.001, and L2 penalty to regularize all model parameters." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-40", "text": "We only train embeddings for words that occur at least 10 times in the corpus." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-41", "text": "For training, we use the negative sampling loss function, described in (Mikolov et al., 2013) to be much more efficient than the hierarchical softmax and yield competi- tive results 2 ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-42", "text": "We train for 1 to 3 epochs and select the model with the lowest loss." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-43", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-44", "text": "**VALIDATION**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-45", "text": "To validate our model, we compare our results to those produced via the decade-by-decade models trained in (Garg et al., 2018) using the Corpus of Historical American English (Davies, 2010) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-46", "text": "We use the same metric and word lists as the authors to compute bias scores." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-47", "text": "In particular, we compute linguistic bias scores for two analyses presented in (Garg et al., 2018) : the extent to which female versus male words are semantically similar to occupation-related words, and the extent to which Asian vs. White last names are semantically similar to the same, from 1910 through 1990." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-48", "text": "We then compute correlations between changes in these scores and the actual changes in female and Asian workforce participation rates (relative to men and Whites, respectively) over the same time period." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-49", "text": "Figure 2 depicts these results." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-50", "text": "The correlation between our scores and changes in workforce participation rates are similar to the correlation between the scores from (Garg et al., 2018) and the same (r = 0.8, p = 0.01 and r = 0.81, p < 0.01, respectively, for gender occupation bias; r = 0.84, p < 0.01 and r = 0.79, p = 0.01, respectively, for Asian/White occupation bias)." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-51", "text": "Qualitative inspection of Figure 2 suggests that our model also produces smoother decade-by-decade scores, suggesting that it not only identifies attribute- (Garg et al., 2018) and our model (blue dotted and green dashed lines, respectively) compared to actual workforce participation rates (solid lines) for gender (top) and Asian/White (bottom) linguistic biases." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-52", "text": "To compare all values on a single yaxis, we standardize both sets of bias scores and workforce participation rates by subtracting the mean and dividing by the standard deviation across decades." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-53", "text": "specific fluctuations in word semantics, but also, may provide a more general, regularized model for learning attribute-conditioned word embeddings." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-54", "text": "Future research should include a comparison of our model's outputs to the outputs of other dynamic word embedding models that treat time as a continuously-valued attribute, e.g. (Bamler and Mandt, 2017; Rudolph and Blei, 2018; Yao et al., 2018) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-55", "text": "3 Case study: refugee bias on talk radio" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-56", "text": "We are interested in applying our dynamic embedding model to better-understand talk radio-show biases towards refugees." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-57", "text": "Talk radio is a significant source of news for a large fraction of Americans: In 2017, over 90% of Americans over the age of 12 listened to some type of broadcast radio during the course of a given week, with news/talk radio serving as one of the most popular types (Pew, 2018) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-58", "text": "With listener call-ins and live dialog, talk radio provides an interesting source of information, commentary, and discussion that distinguishes it from discourse found in both print and social media." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-59", "text": "Given the proliferation of refugees and displaced peoples in recent years (totalling nearly 66 million individuals in 2016 (UNHCR, 2017))-coupled with the rise of talk radio as a particularly popular media channel for conservative political discourse (Mort, 2012 )-analyzing bias towards refugees across talk radio stations may provide a unique window into a large portion of the American population's views on the issue." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-60", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-61", "text": "**DATASET AND ANALYSES**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-62", "text": "Our data is sourced from talk radio audio data collected by the media analytics nonprofit Cortico 3 ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-63", "text": "Audio data is ingested from nearly 170 different radio stations and automatically transcribed to text." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-64", "text": "The data is further processed to identify different speaker turns into \"snippets\"; infer the gender of the speaker; and compute other useful metrics (more details on the radio data pipeline can be found in (Beeferman and Roy, 2018) )." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-65", "text": "We train our dynamic embedding model on two talk radio datasets sourced from 83 stations located in 64 cities across the US." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-66", "text": "Dataset 1 includes 4.4 million snippets comprised of 114 million words produced by 390 shows between September 1 and 30, 2018." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-67", "text": "Dataset 2 includes over 4.8 million snippets comprised of 119 million total words produced by 433 shows between August 15, and September 15, 2018 4 ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-68", "text": "These datasets are used for analyses 1 and 2, respectively, described below." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-69", "text": "Finally, we define bias towards refugees similar to how the authors of (Garg et al., 2018) define bias against Asians during the 20th century, measuring to what extent radio shows associate \"outsider\" adjectives like \"aggressive\", \"frightening\", \"illegal\", etc." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-70", "text": "with refugee and immigrant-related terms in comparison to all other adjectives." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-71", "text": "To compute refugee bias scores with respect to the attribute set A, we use the relative norm distance metric from (Garg et al., 2018) :" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-72", "text": "Where E(r, A) is the dynamic embedding for a given word refugee word r in the set of all refugeerelated words R (e.g. \"refugee\", \"immigrant\", \"asylum\", etc); all is the average dynamic embedding computed for each w in the set of all adjectives with respect to A; out is analogously defined for outsider adjectives; and || \u00b7 || 2 is the L2 norm." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-73", "text": "Figure 3 : Bias towards refugees as outsiders across talk radio shows from mid-August to mid-September 2018: (a) depicts bias scores computed using a \"non-dynamic model\", i.e., training multiple Word2Vec models (one per day of data) and then projecting these models into the same vector space using orthogonal Procrustes alignment, and (b) depicts bias scores computed using our dynamic model." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-74", "text": "From qualitative inspection, the dynamic model appears to regularize scores across days during which refugee-related news is likely less-salient in public discourse." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-75", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-76", "text": "**ANALYSIS 1: REFUGEE BIAS OVER TIME**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-77", "text": "We analyze how refugee biases on talk radio vary by day between August 15 and September 15, 2018." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-78", "text": "We choose this interval to center on the August 31, 2018 news story regarding the Trump administration's contentious decision to pull funding from a UN agency that supports Palestinian refugees 5 ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-79", "text": "Our attribute of interest is the day in which a particular snippet occurred." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-80", "text": "Figure 3(b) illustrates the temporal variation in bias scores, highlighting a notable shift towards greater bias against refugees in response to the news story." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-81", "text": "Interestingly, bias towards refugees returns to preevent levels very quickly after the spike." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-82", "text": "Computing the correlation between daily bias scores and 5 For historical coverage of different refugeerelated news events, please see https: //www.nytimes.com/topic/subject/ refugees-and-displaced-people." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-83", "text": "the number of mentions of the keyword \"refugee\" across stations yields r = 0.56, p < 0.001, suggesting that additional discourse about refugees tends to be biased against them." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-84", "text": "As a comparison, we also compute bias scores by training one Word2Vec model per day and projecting all day-by-day models into the same vector space using orthogonal Procrustes alignment 6 similar to (Hamilton et al., 2016) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-85", "text": "The resulting scores from this non-dynamic model are depicted in 3(a)." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-86", "text": "From qualitative inspection, the day-byday scores produced by the non-dynamic model appear much less smooth, and hence, fail to show the relative shift in discourse that likely occurred in response to a major refugee-related news event." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-87", "text": "One possible reason for this is that the median number of words for each day in the talk radio corpus is 4 million-over 5x fewer than a median of 22 million words per decade used to train each decade-specific model in (Garg et al., 2018) ." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-88", "text": "These results suggest that using our dynamic embedding approach is particularly valuable when data is sparse for any given attribute." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-89", "text": "----------------------------------" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-90", "text": "**ANALYSIS 2: REFUGEE BIAS BY CITY**" }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-102", "text": "Opportunities for future work include a) comparing the results of our model to other existing dynamic embedding models, particularly when the attribute of interest is temporal in nature, b) exploring embeddings defined with respect to other attributes of interest, perhaps in combination with other contextual embedding models like (Peters et al., 2018) , c) exploring alternative definitions of bias towards refugees and other groups, and d) learning a dynamic embedding model for continuous attributes in order to limit the need to impose (perhaps arbitrary) discretizations." }, { "sent_id": "34b73a56bd9b80dc415ca2c5608596-C001-103", "text": "We be-" } ], "y": { "@BACK@": { "gold_contexts": [ [ "34b73a56bd9b80dc415ca2c5608596-C001-10", "34b73a56bd9b80dc415ca2c5608596-C001-12" ] ], "cite_sentences": [ "34b73a56bd9b80dc415ca2c5608596-C001-10", "34b73a56bd9b80dc415ca2c5608596-C001-12" ] }, "@SIM@": { "gold_contexts": [ [ "34b73a56bd9b80dc415ca2c5608596-C001-45", "34b73a56bd9b80dc415ca2c5608596-C001-47" ], [ "34b73a56bd9b80dc415ca2c5608596-C001-50" ], [ "34b73a56bd9b80dc415ca2c5608596-C001-69", "34b73a56bd9b80dc415ca2c5608596-C001-71" ], [ "34b73a56bd9b80dc415ca2c5608596-C001-100" ] ], "cite_sentences": [ "34b73a56bd9b80dc415ca2c5608596-C001-45", "34b73a56bd9b80dc415ca2c5608596-C001-47", "34b73a56bd9b80dc415ca2c5608596-C001-50", "34b73a56bd9b80dc415ca2c5608596-C001-69", "34b73a56bd9b80dc415ca2c5608596-C001-71", "34b73a56bd9b80dc415ca2c5608596-C001-100" ] }, "@USE@": { "gold_contexts": [ [ "34b73a56bd9b80dc415ca2c5608596-C001-45", "34b73a56bd9b80dc415ca2c5608596-C001-47" ], [ "34b73a56bd9b80dc415ca2c5608596-C001-69", "34b73a56bd9b80dc415ca2c5608596-C001-71" ], [ "34b73a56bd9b80dc415ca2c5608596-C001-100" ] ], "cite_sentences": [ "34b73a56bd9b80dc415ca2c5608596-C001-45", "34b73a56bd9b80dc415ca2c5608596-C001-47", "34b73a56bd9b80dc415ca2c5608596-C001-69", "34b73a56bd9b80dc415ca2c5608596-C001-71", "34b73a56bd9b80dc415ca2c5608596-C001-100" ] }, "@DIF@": { "gold_contexts": [ [ "34b73a56bd9b80dc415ca2c5608596-C001-51" ], [ "34b73a56bd9b80dc415ca2c5608596-C001-86", "34b73a56bd9b80dc415ca2c5608596-C001-87" ] ], "cite_sentences": [ "34b73a56bd9b80dc415ca2c5608596-C001-51", "34b73a56bd9b80dc415ca2c5608596-C001-87" ] } } }, "ABC_93cf4d4fd9cd875e8e148d1bbd8e2c_15": { "x": [ { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-29", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-9", "text": "It helps to pay attention." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-2", "text": "We conduct large-scale studies on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-3", "text": "We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-4", "text": "Thus, we introduce the VQA-HAT (Human ATtention) dataset." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-5", "text": "We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation)." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-6", "text": "Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-7", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-10", "text": "Humans have the ability to quickly perceive a scene by selectively attending to parts of the image instead of processing the whole scene in its entirety (Rensink, 2000) ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-11", "text": "Inspired by human attention, a recent trend in computer vision and deep learning is to build computational models of attention." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-12", "text": "Given an input signal, these models learn to attend to parts of it for further processing and have been successfully applied in machine translation (Bahdanau et al., 2014; Firat et al., 2016) , object recognition Mnih et al., 2014; Sermanet et al., 2014) , image captioning Cho et al., 2015) and visual question answer- * Denotes equal contribution." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-13", "text": "ing (Yang et al., 2015; Lu et al., 2016; Xu and Saenko, 2015; Xiong et al., 2016) ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-14", "text": "In this work, we study attention for the task of Visual Question Answering (VQA)." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-15", "text": "Unlike image captioning, where a coarse understanding of an image is often sufficient for producing generic descriptions (Devlin et al., 2015) , visual questions selectively target different areas of an image including background details and underlying context." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-16", "text": "This suggests that a VQA model may benefit from an explicit or implicit attention mechanism to answer a question correctly." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-17", "text": "In this work, we are interested in the following questions: 1) Which image regions do humans choose to look at in order to answer questions about images?" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-18", "text": "2) Do deep VQA models with attention mechanisms attend to the same regions as humans?" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-19", "text": "We design and conduct studies to collect \"human attention maps\"." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-20", "text": "maps on the same image for two different questions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-21", "text": "When asked 'What type is the surface?', humans choose to look at the floor, while attention for 'Which game is being played?' is concentrated around the player and racket." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-22", "text": "These human attention maps can be used both for evaluating machine-generated attention maps and for explicitly training attention-based models." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-23", "text": "Contributions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-24", "text": "First, we design and test multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (Antol et al., 2015) ; this VQA-HAT (Human ATtention) dataset will be released publicly." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-25", "text": "Second, we perform qualitative and quantitative comparison of the maps generated by state-of-the-art attention-based VQA models (Yang et al., 2015; Lu et al., 2016 ) and a task-independent saliency baseline (Judd et al., 2009 ) against our human attention maps through visualizations and rank-order correlation." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-26", "text": "We find that machine-generated attention maps from the most accurate VQA model have a mean rank-correlation of 0.26 with human attention maps, which is worse than task-independent saliency maps that have a mean rank-correlation of 0.49." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-27", "text": "It is well understood that task-independent saliency maps have a 'center bias' (Tatler, 2007; Judd et al., 2009 )." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-28", "text": "After we control for this center bias in our human attention maps, we find that the correlation of task-independent saliency is poor (as expected), while trends for machine-generated VQA-attention maps remain the same (which is promising)." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-31", "text": "Our work draws on recent work in attention-based VQA and human studies in saliency prediction." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-32", "text": "We work with the free-form and open-ended VQA dataset released by (Antol et al., 2015) ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-33", "text": "VQA Models." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-34", "text": "Attention-based models for VQA typically use convolutional neural networks to highlight relevant regions of image given a question." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-35", "text": "Stacked Attention Networks (SAN) proposed in (Yang et al., 2015) use LSTM encodings of question words to produce a spatial attention distribution over the convolutional layer features of the image." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-36", "text": "Hierarchical Co-Attention Network (Lu et al., 2016) generates multiple levels of image attention based on words, phrases and complete questions, and is the top entry on the VQA Challenge 1 as of the time of this submission." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-37", "text": "Another interesting approach uses question parsing to compose the neural network from modules, attention being one of the sub-tasks addressed by these modules (Andreas et al., 2016) ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-38", "text": "Note that all these works are unsupervised attention models, where \"attention\" is simply an intermediate variable (a spatial distribution) that is produced by the model to optimize downstream loss (VQA cross-entropy)." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-39", "text": "The fact that some (it's unclear how many) of these spatial distributions end up being interpretable is simply fortuitous." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-40", "text": "In contrast, we study where humans choose to look to answer visual questions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-41", "text": "These human attention maps can be used to evaluate unsupervised maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-42", "text": "Human Studies." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-43", "text": "There's a rich history of work in collecting eye tracking data from human subjects to gain an understanding of image saliency and visual perception (Jiang et al., 2014; Judd et al., 2009; Fei-Fei et al., 2007; Yarbus, 1967) ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-44", "text": "Eye tracking data to study natural visual exploration (Jiang et al., 2014; Judd et al., 2009 ) is useful but difficult and expensive to collect on a large scale." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-45", "text": "(Jiang et al., 2015) established mouse tracking as an accurate ap- Figure 3 : Deblurring procedure to collect attention maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-46", "text": "We present subjects with a blurred image and ask them to sharpen regions of the image that will help them answer the question correctly, in a smooth, click-and-drag, 'coloring' motion with the mouse." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-47", "text": "proach to collecting attention maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-48", "text": "They collected large-scale attention annotations for MS COCO (Lin et al., 2014) on Amazon Mechanical Turk (AMT)." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-49", "text": "While (Jiang et al., 2015) studies natural exploration and collects task-independent human annotations by asking subjects to freely move the mouse cursor to anywhere they wanted to look on a blurred image, our approach is task-driven." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-50", "text": "Specifically, as described in 3, we collect ground truth attention annotations by instructing subjects to sharpen parts of a blurred image that are important for answering the questions accurately." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-51", "text": "Section 4 covers evaluation of unsupervised attention maps generated by VQA models against our human attention maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-52", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-53", "text": "**VQA-HAT (HUMAN ATTENTION) DATASET**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-54", "text": "We design and test multiple game-inspired novel interfaces for conducting large-scale human studies on AMT." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-55", "text": "Our basic interface design consists of a \"deblurring\" exercise for answering visual questions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-56", "text": "Specifically, we present subjects with a blurred image and a question about the image, and ask subjects to sharpen regions of the image that will help them answer the question correctly, in a smooth, click-and-drag, 'coloring' motion with the mouse." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-57", "text": "The sharpening is gradual: successively scrubbing the same region progressively sharpens it." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-58", "text": "Figure 3 shows intermediate steps in our attention annotation interface, from a completely blurry image to a deblurred attention map." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-59", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-60", "text": "**ATTENTION ANNOTATION INTERFACE**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-61", "text": "Our interface starts by showing a low-resolution blurry version of the image." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-62", "text": "This is to convey a partial 'holistic' understanding of the scene to the subjects so they may intelligently choose which regions to sharpen." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-63", "text": "Gradual sharpening with strokes was aimed to capture initial exploration as they tried to get a better sense of the scene, and eventually focussed sharpening to answer the question." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-64", "text": "Next we describe the three variants of our attention annotation interface that we experimented with." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-65", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-66", "text": "**BLURRED IMAGE WITHOUT ANSWER**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-67", "text": "In our first interface, subjects were shown a blurred image and a question without the answer, and were asked to deblur regions and enter the answer." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-68", "text": "We found that this interface sometimes resulted in 'exploratory attention', where the subject lightly sharpens large regions of an image to find salient regions that eventually lead them to the answer." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-69", "text": "However, subjects often ended up with 'incomplete' attention maps since they did not see the high-resolution image and the answer, so they did not know when to stop deblurring or exploring." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-70", "text": "For instance, for an image with 3 players playing a sport, if the question is \"How many players are visible in the image?\", the subject might sharpen a region that seems to have the players, count the 2 players in there and answer 2, and completely miss another region of the image that had 1 more." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-71", "text": "The resulting attention map in this case is incomplete since there are 3 players in the image." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-72", "text": "This effect of incomplete human attention maps was seen in counting (\"How many ...\") and binary (\"Is there ...\") types of questions, and as a result, the answers to these were often incorrect." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-73", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-74", "text": "**BLURRED IMAGE WITH ANSWER**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-75", "text": "In our second interface, subjects were shown the correct answer in addition to the question and blurred image." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-76", "text": "They were asked to sharpen as few regions as possible such that someone can answer the question just by looking at the blurred image with sharpened regions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-77", "text": "This interface is shown in Figure 4b ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-78", "text": "Providing the answer fixed the failure cases from the 1st interface, i.e. for counting and binary questions, since the subjects now knew the answer, they continued to explore till they found the answer region in the image." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-79", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-80", "text": "**BLURRED AND ORIGINAL IMAGE WITH ANSWER**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-81", "text": "To encourage exploitation instead of exploration, in our third interface, subjects were shown the question-answer pair and full-resolution original image." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-82", "text": "In principle, seeing the original (fullresolution) image, the question, and answer provides most information to subjects, thus enabling them to provide the most 'accurate' attention maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-83", "text": "However, this task turns out to be fairly counterintuitive -subjects are shown full-resolution images and the answer, and asked to imagine a scenario where someone else has to answer the question without looking at the original image." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-84", "text": "Figure 4 shows screen-captures of the 3 attention annotation interfaces." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-85", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-86", "text": "**DATASET EVALUATION**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-87", "text": "We ran pilot studies on AMT to experiment with the above described three interfaces." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-88", "text": "In order to quantitatively evaluate the interfaces, we conducted a second human study where (a second set of) subjects where shown the attention-sharpened images generated from each of the attention interfaces from the first experiment and asked to answer the question." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-89", "text": "The intuition behind this experiment is that if the attention map revealed too little information, this second set of subjects would answer the question incorrectly." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-90", "text": "Table 1 shows VQA accuracies of the answers given by human subjects under these 3 interfaces." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-91", "text": "We can see that the \"Blurred Image with Answer\" interface (section 3.1.2) gives the highest accuracy on evaluation by humans." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-92", "text": "Since the payments structure on AMT encourage completing tasks as quickly as possible, this implicitly incentivizes subjects to deblur as few regions as possible, and our human study shows that humans can still answer questions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-93", "text": "Thus, overall we achieve a balance between highlighting too little or too much." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-94", "text": "Figure 2 shows examples of collected human attention maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-95", "text": "This VQA-HAT dataset will be released publicly." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-96", "text": "To visualize the collected dataset, we cluster the human attention maps and visualize the average attention map and example questions falling in each of them for 6 selected clusters in Figure 5 ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-97", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-98", "text": "**HUMAN ATTENTION MAPS VS UNSUPERVISED ATTENTION MODELS**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-99", "text": "Now that we have collected these human attention maps, we can ask the following question -do unsupervised attention models learn to predict attention maps that are similar to human attention maps?" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-100", "text": "To rephrase, do neural networks look at the same regions as humans to answer a visual question? VQA Attention Models." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-101", "text": "We evaluate maps generated by the following unsupervised models:" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-102", "text": "Figure 6: Random samples of human attention (column 2) v/s machine-generated attention (columns 3-5)." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-103", "text": "\u2022 Stacked Attention Network (SAN) (Yang et al., 2015) with two attention layers (SAN-2) 2 ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-104", "text": "\u2022 Hierarchical Co-Attention Network (HieCoAtt) (Lu et al., 2016) with word-level (HieCoAtt-W), phrase-level (HieCoAtt-P) and question-level (HieCoAtt-Q) attention maps; we evaluate all three maps 3 ." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-105", "text": "Comparison Metric: Rank Correlation." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-106", "text": "We first scale both the machine-generated and human attention maps to 14x14, rank the pixels according to their spatial attention and then compute correlation between these two ranked lists." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-107", "text": "We choose an orderbased metric so as to make the evaluation invariant to absolute spatial probability values which can be made peaky or diffuse by tweaking a 'temperature' parameter." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-108", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-109", "text": "**MODEL**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-110", "text": "Rank-correlation SAN-2 (Yang et al., 2015) 0.249 \u00b1 0.004" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-111", "text": "HieCoAtt-W (Lu et al., 2016) 0.246 \u00b1 0.004 HieCoAtt-P (Lu et al., 2016) 0.256 \u00b1 0.004 HieCoAtt-Q (Lu et al., 2016) 0.264 \u00b1 0.004 Table 2 : Mean rank-correlation coefficients (higher is better); error bars show standard error of means." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-112", "text": "We can see that both SAN-2 and HieCoAtt attention maps are positively correlated with human attention maps, but not as strongly as task-independent Judd saliency maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-113", "text": "Table 2 shows rank-order correlation averaged over all image-question pairs on the validation set." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-135", "text": "Mean correlation goes down significantly for Judd saliency maps since they have a strong center bias." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-136", "text": "Relative trends among SAN-2 & HieCoAtt are similar to those over the whole Model Rank-correlation SAN-2 (Yang et al., 2015) 0.038 \u00b1 0.011" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-137", "text": "HieCoAtt-W (Lu et al., 2016) 0.062 \u00b1 0.012 HieCoAtt-P (Lu et al., 2016) 0.048 \u00b1 0.010 HieCoAtt-Q (Lu et al., 2016) 0 Table 3 : Mean rank-correlation coefficients (higher is better) on the reduced set without center bias; error bars show standard error of means." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-138", "text": "We can see that correlation goes down significantly for Judd saliency maps since they have a strong center bias." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-139", "text": "Relative trends among SAN-2 & HieCoAtt are similar to those over the whole validation set (reported in Table 2 )." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-140", "text": "validation set (reported in Table 2 )." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-141", "text": "HieCoAtt-Q now has a higher correlation with human attention maps than Judd saliency." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-142", "text": "This demonstrates that discounting the center bias, VQA-specific machine attention maps correlate better with VQA-specific human attention maps than task independent machine saliency maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-143", "text": "----------------------------------" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-144", "text": "**CONCLUSION & DISCUSSION**" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-145", "text": "We introduce and release the VQA-HAT dataset." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-146", "text": "This dataset can be used to evaluate attention maps generated in an unsupervised manner by attentionbased VQA models, or to explicitly train models with attention supervision for VQA." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-147", "text": "We quantify whether current attention-based VQA models are 'looking' at the same regions of the image as humans do to produce an answer." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-148", "text": "Necessary vs Sufficient Maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-149", "text": "Are human attention maps 'necessary' and/or 'sufficient'? If regions highlighted by the human attention maps are sufficient to answer the question accurately, then so is any region that is a superset." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-150", "text": "For example, if attention mass is concentrated on a 'cat' for 'What animal is present in the picture?', then an attention map that assigns weights to any arbitrary-sized region that includes the 'cat' is sufficient as well." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-151", "text": "On the contrary, a necessary and sufficient attention map would be the smallest visual region sufficient for answering the question accurately." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-152", "text": "It is an ill-posed problem to define a necessary attention map in the space of pixels; random pixels can be blacked out and chances are that humans would still be able to answer the question given the resulting subset attention map." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-153", "text": "Our work thus poses an interesting question for future work -what is the right semantic space in which it is meaningful to talk about necessary and sufficient attention maps for humans?" }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-114", "text": "We compare with random attention maps and taskindependent saliency maps generated by a model trained to predict human eye fixation locations where subjects are asked to freely view an image for 3 seconds (Judd et al., 2009 )." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-115", "text": "Both SAN-2 and HieCoAtt attention maps are positively correlated with human attention maps, but not as strongly as task-independent Judd saliency maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-116", "text": "Our find-imageqa-san." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-117", "text": "3 Code available at https://github.com/ jiasenlu/HieCoAttenVQA ings lead to two take-away messages with significant potential impact on future research in this active field." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-118", "text": "First, current VQA attention models do not seem to be 'looking' at the same regions as humans to produce an answer." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-119", "text": "Second, as attentionbased VQA models become more accurate (58.9% SAN \u2192 62.1% HieCoAtt), they seem to be (slightly) better correlated with humans in terms of where they look." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-120", "text": "Our dataset will allow for a more thorough validation of this observation as future attention-based VQA models are proposed." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-121", "text": "Figure 6 shows examples of human attention and machine-generated attention maps with corresponding rank-correlation coefficients." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-122", "text": "To put these numbers in perspective, we computed inter-human agreement on the validation set by collecting 3 human attention maps per image-question pair and computing mean rank-correlation, which is 0.623." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-123", "text": "Lastly, all reported correlation values are averaged over 3 trials by adding random noise (order of 10 \u221214 ) to the human attention maps to account for ranking variations in case of uniformly weighted regions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-124", "text": "Center Bias." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-125", "text": "Judd saliency maps aim to predict human eye fixations during natural visual exploration." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-126", "text": "These tend to have a strong center bias (Tatler, 2007; Judd et al., 2009 )." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-127", "text": "Although our human attention maps dataset is not an eye tracking study, the center bias still exists albeit not as severe." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-128", "text": "One potential source of this center bias is the fact that the VQA dataset was human-generated by subjects looking at the images." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-129", "text": "Thus, salient objects in the center of the image are likely be potential subjects of the questions." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-130", "text": "We compute rank-correlation of a synthetically generated central attention map with Judd saliency and human attention maps." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-131", "text": "Judd saliency maps have a mean rank-correlation of 0.877 and human attention maps have a mean rank-correlation of 0.458 on the validation set." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-132", "text": "To eliminate the effect of center bias in this evaluation, we removed human attention maps that have a positive rank-correlation with the center attention map." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-133", "text": "We compute rank-correlation of machinegenerated attention with human attention on this reduced set." }, { "sent_id": "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-134", "text": "See Table 3 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-25" ], [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-36" ] ], "cite_sentences": [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-25", "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-36" ] }, "@MOT@": { "gold_contexts": [ [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-25" ], [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-104" ] ], "cite_sentences": [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-25", "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-104" ] }, "@USE@": { "gold_contexts": [ [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-111", "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-137" ] ], "cite_sentences": [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-111", "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-137" ] }, "@SIM@": { "gold_contexts": [ [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-111", "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-137" ] ], "cite_sentences": [ "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-111", "93cf4d4fd9cd875e8e148d1bbd8e2c-C001-137" ] } } }, "ABC_9ea14a9fe422451901ad221bee5714_15": { "x": [ { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-177", "text": "**RESULTS**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-80", "text": "We again utilize multiple possibilities of input feature representation." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-2", "text": "We propose an automated approach for semantic class identification of compounds in Sanskrit." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-3", "text": "It is essential to extract semantic information hidden in compounds for improving overall downstream Natural Language Processing (NLP) applications such as information extraction, question answering, machine translation, and many more." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-4", "text": "In this work, we systematically investigate the following research question: Can recent advances in neural network outperform traditional hand engineered feature based methods on the semantic level multi-class compound classification task for Sanskrit?" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-5", "text": "Contrary to the previous methods, our method does not require feature engineering." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-25", "text": "In Sanskrit, Krishna et al. (2016) have proposed a framework for semantic type classification of compounds in Sanskrit." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-26", "text": "They proposed a multi-class classifier using Random Forests (Geurts et al., 2006; Pedregosa et al., 2011) , where they classified a given compound into one of the four coarse level compound classes, namely, Avyay\u012bbh\u0101va, Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-27", "text": "a, Bahuvr\u012bhi and Dvandva." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-28", "text": "They have used an elaborate feature set, which summarily consists of rules from the grammar treatise As ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-29", "text": "t .\u0101 dhy\u0101y\u012b pertaining to compounding, semantic relations between the compound components from a lexical database Amarakos ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-30", "text": "a and distributional subword patterns from the data using Adaptor Grammar (Johnson et al., 2007) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-31", "text": "Inspired from the recent advances in using neural models for compound analysis in NLP, we revisit the task of compound class identification and validate the efficacy of such models under the low-resource setting like that of Sanskrit." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-32", "text": "In this work, we experiment with multiple deep learning models for compound type classification." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-33", "text": "Our extensive experiments include standard neural models comprising of Multi-Layer Perceptrons (MLP), Convolution Neural Networks (CNN) (Zhang et al., 2015) and Recurrent models such as Long Short Term Memory (LSTM) configurations." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-34", "text": "Unlike the feature-rich representation of Krishna et al. (2016) , we rely on various word embedding approaches, which include character level, sub-word level, and word-level embedding approaches." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-35", "text": "Using end-to-end training, the pretrained embeddings are fine tuned for making them task specific embeddings." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-36", "text": "So all the architectures are integrated with end-to-end training (Kim, 2014 )." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-6", "text": "For well-organized analysis, we categorize neural systems based on Multi-Layer Perceptron (MLP), Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) architecture and feed input to the system from one of the possible levels, namely, word level, sub-word level, and character level." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-7", "text": "Our best system with LSTM architecture and FastText embedding with end-to-end training has shown promising results in terms of F-score (0.73) compared to the state of the art method based on feature engineering (0.74) and outperformed in terms of accuracy (77.68%)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-8", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-10", "text": "The landscape of Natural Language Processing has significantly shifted towards the realm of Deep Learning and Artificial Neural Networks." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-11", "text": "With the benefit of hindsight, the title for the seminal work on a neural pipeline for NLP from Collobert et al. (2011) , \"Natural Language Processing (Almost) from Scratch\", seems prophetic." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-12", "text": "Neural networks have demonstrated promising results in a wide variety of problems like sentiment analysis (Tai et al., 2015) , information extraction (Nguyen et al., 2009 ), text classification (Kim, 2014) , machine translation (Bastings et al., 2017) among others." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-13", "text": "Many of such models in fact have become part and parcel of a standard NLP pipeline for data processing, especially for the resource-rich languages such as English (Tenney et al., 2019) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-14", "text": "There have been academic debates over the philosophical implications of the use of such statistical black box approaches in Computational Linguistics, especially towards the trade-off between performance and interpretability as also summarised in Krishna et al. (2018b) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-15", "text": "However, in this work, we focus more on the pragmatic side of using such approaches for low resource languages like Sanskrit." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-16", "text": "Deep Learning models demand a humongous amount of data to train a model effectively." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-17", "text": "Additionally, it is challenging and often tricky to incorporate available linguistic knowledge into these neural architectures (Strubell et al., 2018) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-18", "text": "Summarily, we can say that a standard off the shelf neural model relies mostly on its capacity to learn distributional information from the large datasets provided as input during training." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-19", "text": "In this pretext, we revisit the problem of compound type identification in Sanskrit (Krishna et al., 2016) and experiment with various neural architectures for solving the task." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-20", "text": "The process of compounding and the nature of compositionality of the compounds are well studied in the field of NLP." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-21", "text": "Given that compounding is a productive process of word-formation in languages, this is of much interest in the area of word-level semantics in NLP." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-22", "text": "There are various aspects involved in the compound analysis." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-23", "text": "These include productivity and recursiveness of the words involved in the process, presence of implicit relations between the components, and finally, the analysis of a compound relies on its pragmatic or contextual features (Kim and Baldwin, 2005) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-24", "text": "Recently, there has been a concerted effort in studying the nature of compositionality in compounds by leveraging on distributional word-representations or word embeddings and then learning function approximators to predict the nature of compositionality of such words (Mitchell and Lapata, 2010; Cordeiro et al., 2016; Salehi et al., 2015; Jana et al., 2019) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-37", "text": "The best system of ours, an end-to-end LSTM architecture initialised with fasttext embeddings has shown promising results in terms of F-score (0.73) compared to the state of the art classifier from Krishna et al. (2016) (0.74) and outperformed it in terms of accuracy (77.68%)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-38", "text": "Summarily, we find that the models we experimented with, report competitive results with the current state of the art model for compound type identification." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-39", "text": "We achieve the same without making use of any feature engineering or domain expertise." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-40", "text": "We release the codebase for all our models experimented with at https://github.com/Jivnesh/ISCLS-19." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-41", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-42", "text": "**COMPOUND CLASSIFICATION TASK IN SANSKRIT**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-43", "text": "In this work, we address the challenge of semantic type identification of compounds in Sanskrit." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-44", "text": "This is generally treated as a word-level semantic task in NLP (Rink and Harabagiu, 2010; Hashimoto et al., 2014; Santos et al., 2015) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-45", "text": "We treat the task as a supervised multiclass classification problem." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-46", "text": "Here, similar to Krishna et al. (2016) , we expect the users to provide a compound in its component-wise segmented form as input to the model." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-47", "text": "But our model relies on distributed representations or embeddings of the input as features, instead of the linguistically involved feature set proposed in Krishna et al. (2016) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-48", "text": "Approaches for compound analysis have been of great interest in NLP for multiple languages including English, Italian, Dutch and German (S\u00e9aghdha and Copestake, 2013; Tratz and Hovy, 2010; Kim and Baldwin, 2005; Girju et al., 2005; Verhoeven et al., 2014a) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-49", "text": "These methods primarily rely on lexical networks, distributional information (S\u00e9aghdha and Copestake, 2013) or a combination of both lexical and distributional information (Nastase et al., 2006) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-50", "text": "In Sanskrit, Krishna et al. (2016) proposed a similar statistical approach which combined lexical and distributional information by using information from the lexical network Amarakos ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-51", "text": "a (Nair and and variable length n-grams learned from data using Adaptor grammar (Johnson et al., 2007) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-52", "text": "Here, the authors also adopted rules from As ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-53", "text": "t .\u0101 dhy\u0101y\u012b as potentially discriminative features for compound type identification (Kulkarni and Kumar, 2013) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-54", "text": "While this model has shown to be effective for the task, it nevertheless is a linguistically involved model." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-55", "text": "Recently, Dima and Hinrichs (2015) , Cordeiro et al. (2016) and Ponkiya et al. (2016) have shown that use of word embedding as the sole features can produce models with competitive results as compared to other feature-rich models." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-56", "text": "Inspired from these observations, we attempt to build similar models which use only embeddings as features for the compound type identification task." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-57", "text": "Compounds in Sanskrit can be categorized into 30 possible classes based on how granular categorizations one would like to have (Lowe, 2015) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-58", "text": "There are slightly altered set of categorizations considered by Gillon (2009), Olsen (2000) , Bisetto and Scalise (2005) and Tubb and Boose (2007) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-59", "text": "Semantically As ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-60", "text": "t .\u0101 dhy\u0101y\u012b categorizes the Sanskrit compounds into four major semantic classes, namely, Avyay\u012bbh\u0101va, Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-61", "text": "a, Bahuvr\u012bhi and Dvandva (Kumar et al., 2010) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-62", "text": "Similar to prior computational approaches in Sanskrit compounding (Krishna et al., 2016; Kumar et al., 2010) , we follow this four class coarse level categorization of the semantic classes in compounds." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-63", "text": "Compounding in Sanskrit is extremely productive, or rather recursive, resulting in compound words with multiple components (Lowe, 2015) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-64", "text": "Further, it is observed that compounding of a pair of components may result in compounds of different semantic classes." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-65", "text": "Avyay\u012bbh\u0101va and Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-66", "text": "a may likely be confusing due to particular sub-category of Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-67", "text": "a if the first component is an avyaya." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-68", "text": "For example, upa j\u012bvatah . has the first component as avyaya which is strong characteristic of Avyay\u012bbh\u0101va." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-69", "text": "However, this compound belongs to Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-70", "text": "a class." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-71", "text": "Likewise, a negation avyaya in the first component can create confusion between Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-72", "text": "a and Bahuvr\u012bhi classes." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-73", "text": "The instances mentioned above reveal the difficulties associated with distinguishing the semantic classes of a compound." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-74", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-75", "text": "**SYSTEM DESCRIPTION**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-76", "text": "While the compounds in Sanskrit can consist of multiple components, we restrict our problem to that of compounds with two components only." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-77", "text": "Thus, given the two components of the compound, we treat this as a classification problem." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-78", "text": "For the task, we use neural models, which can be categorized based on the architectural point of view, namely, Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) based classifier, among others." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-79", "text": "These networks typically require a feature representation of the input (in our case, the two components of the compound word), and learn to classify into one of the possible compound categories." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-81", "text": "For instance, consider svam\u0101nasam, which is a Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-82", "text": "a compound." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-83", "text": "We can break this compound in three possible ways: 1) word level: sva m\u0101nasam 2) subword level: sva m\u0101 nas am (subword level segmentation is based on segmentation learned by Byte Pair Encoding (BPE) (Sennrich et al., 2016)" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-84", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-85", "text": "**FROM CORPUS DATA). 3) CHARACTER LEVEL: S V A M\u0100 N A S A M.**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-86", "text": "We learned word embeddings of these components of the compound from our Sanskrit corpus (Section 4.1)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-87", "text": "Word embeddings map a word x from a vocabulary V to a real-valued vector \u20d7 x of dimensionality D in a feature space (Schnabel et al., 2015) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-88", "text": "The idea based on distributional hypothesis (Harris, 1954) , and the learning objective attempts to put similar words closer in the vector space." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-89", "text": "We used FastText for learning word-level embedding, BPE along with Word2Vec (w2v) and Glove (Pennington et al., 2014) for learning subword level embedding, and character level embedding learned using CharCNN (Zhang et al., 2015) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-90", "text": "Note that we learned embeddings for the individual components, and finally concatenated vectors corresponding to each component and fed as input to the classifier." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-91", "text": "We also integrated our system with task-specific end-to-end training for text classification (Kim, 2014) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-92", "text": "This approach facilitates pre-trained initialized vector to be updated during the task-specific training process." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-93", "text": "Performance of the classifier, with and without end-to-end training, is reported in Appendix I. In all the architectures, relu activation function for dense layer, softmax cross entropy loss function and adam optimizer are used." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-94", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-95", "text": "**MLP BASED CLASSIFIER**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-96", "text": "Multi-layer Perceptron in supervised learning problem consists of an input layer to receive input, output layer to make a decision and multiple hidden layers in between them." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-97", "text": "Training involves learning the parameters of the model using backpropagation." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-98", "text": "As discussed earlier, We experiment with feeding input in two levels, namely, word level (FastText and FastText*) and subword level (W2V and Glove along with BPE)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-99", "text": "Architectures used for them are reported in Table 1 ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-100", "text": "Next to the embedding layer, a drop-out layer with drop-out rate 0.2 is used to avoid over-fitting (Srivastava et al., 2014) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-101", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-102", "text": "**EMBEDDING**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-125", "text": "Due to the problem of Vanishing Gradient (Pascanu et al., 2013; Bengio et al., 1994) , RNNs can capture only short-term dependencies." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-103", "text": "Layer units FastText*: In this case, as shown in Figure 1 , FastText vectors of two components of the compound are concatenated along with element-wise absolute difference and element-wise product between the embedding vector of these two vectors (it is denoted by FastText*)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-104", "text": "Moreover, the resultant vector is passed to MLP based classifier (Table 1) with no end-to-end training." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-105", "text": "The architecture we have used to combine information from the two components is similar to the one used for the Natural Language Inference (NLI) problem in Conneau et al. (2017) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-106", "text": "The key idea behind their approach was to obtain a unified representation of two sentences, each represented as a vector, similar to Figure 1." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-107", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-108", "text": "**CNN BASED CLASSIFIER**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-109", "text": "CNN has shown outstanding performance in the field of computer vision." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-110", "text": "The purpose behind adopting CNNs in NLP is to derive position-invariant features (such as phrases, n-grams) using the convolution operation." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-111", "text": "Max pooling over these features helps to find the essential n-grams and then fully connected hidden layers are employed, similar to MLP, for final predictions." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-112", "text": "Recently, Kim (2014) has shown the application of CNN for textual data." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-113", "text": "In our CNN architecture, end-to-end training is integrated into the embedding layer." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-114", "text": "Next to the embedding layer, dropout layer with a drop-out rate of 0.2 is used." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-115", "text": "For different input levels, architecture details are shown in Table 2 ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-116", "text": "Now We will explain CNN used for the character level input." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-117", "text": "CharCNN: Zhang et al. (2015) used character level information of text as input for a convolutional neural network." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-118", "text": "The advantage of the model is that by using character level embedding with convolution layers, word-level embedding can be obtained." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-119", "text": "This model requires fixed-size input of encoded characters where embeddings of each character are initialized with Gaussian Figure 1 : FastText feature augmented with element-wise difference and multiplication of compound's component (Conneau et al., 2017) distribution with mean 0 and variance 0.05." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-120", "text": "CharCNN architecture employed for our experiment is mentioned in" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-121", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-122", "text": "**LSTM BASED CLASSIFIER**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-123", "text": "The conventional feed-forward neural network treats all input-output pairs independently, which limits the ability to learn patterns in sequential data." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-124", "text": "RNNs are designed to capture this time dependency where network memorizes the previous input-output interactions in order to predict the current output." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-176", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-126", "text": "To overcome this limitation, LSTM (Hochreiter and Schmidhuber, 1997) is used which employs a gating mechanism to carry forward the long-term dependencies." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-127", "text": "LSTM has achieved great success in working with sequences of words." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-128", "text": "In our LSTM architecture, next to embedding layer, drop-out layer with rate 0.2 is used." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-129", "text": "Embedding layer is integrated with end-to-end training." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-130", "text": "Architectural details for different input levels are given in" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-131", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-132", "text": "**EXPERIMENTS**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-133", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-134", "text": "**DATASET**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-135", "text": "Our text corpus contains data from the Digital Corpus of Sanskrit (DCS) 1 , as well as scraped data from Wikipedia and Vedabase corpus." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-136", "text": "The number of words in each corpus are 3.8 M, 0.7 M, and 0.2 M, respectively." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-137", "text": "DCS and Vedabase are segmented, but the Wikipedia data is unsegmented." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-138", "text": "We have used this corpus to learn word embedding features." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-139", "text": "Most of the data in our corpus is in the form of poetry." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-140", "text": "Figure 2 presents a few statistics regarding the corpus utilized." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-141", "text": "The labelled dataset for the compound classification task with a segmented pair of components is obtained from the department of Sanskrit studies, UoHyd 2 ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-142", "text": "These compounds are part of ancient texts, namely, Bhagavadg\u012bt\u0101, Carakasam . h\u012bta, etc." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-143", "text": "We have used the same experimental setting as Krishna et al. (2016) for the classification task." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-144", "text": "The dataset for the compound classification task has more than 32,000 sandhi splitted compounds with labels." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-145", "text": "There are four broad classes, namely, Avyay\u012bbh\u0101va, Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-146", "text": "a, Bahuvr\u012bhi and Dvandva." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-147", "text": "More than 75% data points were from Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-148", "text": "a class, Krishna et al. (2016) down-sampled it to 4,000, which takes it close to the count of the second most highly populated class Bahuvr\u012bhi." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-149", "text": "Avyay\u012bbh\u0101va class is highly skewed, 5% of the Bahuvr\u012bhi class." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-150", "text": "After down-sampling, number of compounds are 239 in Avyay\u012bbh\u0101va, 4, 271 in Bahuvr\u012bhi, 1, 176 in Dvandva, and 4, 266 in Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-151", "text": "a. Out of 9,952 datapoints, 7,957 were kept for training and remaining for testing." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-152", "text": "We have created development (dev) dataset for hyperparameter tuning, from 20 % stratified sampling of the training data." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-153", "text": "We have not used test dataset in any part of training or hyperparameter tuning." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-154", "text": "Figure 3 (e) and 3(f) show the effect of embedding size on the dev set performance." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-155", "text": "In FastText, accuracy on dev-set saturated at 350, which we used as the default embedding size." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-156", "text": "Since most of the data is in the form of poetry, the window size is kept larger." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-157", "text": "As we increase the epoch size, there was a gradual increase in performance (Figure 3(e) )." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-158", "text": "Parameters min-n and max-n were chosen by plotting the distribution of the number of characters in word (Figure 2(b) )." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-159", "text": "Figure 2 (a) shows that more than 50% data sample from the classification task has zero occurrences in the corpus." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-160", "text": "So this Out of Vocabulary (OOV) issue is handled by applying BPE with vocabulary size 100." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-161", "text": "Results did not improve by increased vocabulary size of BPE." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-162", "text": "BPE vocabulary size is chosen as 100, for both glove and w2v features." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-163", "text": "Embedding for w2v and Glove is calculated for segmented sub-words." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-164", "text": "Figure 3(b) and 3(c) indicates that by increasing embedding size, there is a gradual increase in F-score on dev dataset for both BPE+W2v and BPE+Glove." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-165", "text": "So we chose 450 as the embedding size for w2v." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-166", "text": "For Glove, feature size, epoch size and window size are 450, 70 and 20, respectively." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-167", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-168", "text": "**HYPERPARAMETER TUNING FOR INPUT REPRESENTATION**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-169", "text": "In CharCNN, the vocabulary size of characters is 60." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-170", "text": "Apart from the Sanskrit alphabets, there are other eight symbols present in the dataset, which include numbers." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-171", "text": "The maximum length of characters in the input is 25." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-172", "text": "Features corresponding to each character is of size 1014, which is initialized from Gaussian distribution with mean 0 and variance 0.05." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-173", "text": "Filter size, kernel size, and pull size for each layer are shown in Table 2 ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-174", "text": "Last two layers are fully connected layers with a relu activation function." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-175", "text": "All the hyper-parameters are reported in Appendix II." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-178", "text": "Classifier's performance is evaluated based on micro accuracy and macro precision, recall and F-score." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-179", "text": "F-score is the combined metric of precision and recall, so accuracy and the F-score will be our main evaluation metric." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-180", "text": "Figure 3 : Investigating the sensitivity of the results (F1-score and Accuracy) with respect to the dimensionality of various embeddings on the development set: (a) As vocabulary size of BPE increases, macro F1-score decreases." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-181", "text": "So we have used the BPE vocabulary size as 100." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-182", "text": "(b) As embedding size of w2v increases, there is a gradual increase in F-score." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-183", "text": "So we have chosen 450 as the embedding size." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-184", "text": "(c) As embedding size of Glove increases, there is a gradual increase in F-score." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-185", "text": "So we have chosen 450 as the embedding size." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-186", "text": "Figure 4 : (a) Class-wise F-score for different embeddings with the same architecture (CNN) (b) Class-wise F-score for different architectures with the same embedding (FastText)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-187", "text": "Note that class size increases as we move from left to right along x-axis." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-188", "text": "ERF and N-gram are baselines reported in Table 4." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-189", "text": "feature engineering is involved, but it was able to give comparable performance." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-190", "text": "There are three possible ways to feed input to the system, namely, word level, subword level, and character level." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-191", "text": "Based on these categorizations, step by step, we evaluated our MLP, CNN, and LSTM based classifiers." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-192", "text": "First, for word-level inputs, we randomly initialized all the embedding vectors and checked the performance of the classifier." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-193", "text": "We were able to reach up to 0.59 (macro) F-score with CNN+ classifier (Table 4 )." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-194", "text": "Next, for subword level input, we used W2V and Glove embedding on BPE segmented (the segmentation is not morphemic) sub-words of the compound." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-195", "text": "These embeddings helped to get significant improvement compared to word level randomly initialized embedding, achieving F-score of 0.67 and 0.68, respectively." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-196", "text": "As shown in Figure 2(a) , W2V and Glove could not give very good embeddings due to the rare occurrence of compound words in the corpus." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-197", "text": "Then we experimented with another embedding, FastText, which has shown excellent performance compared to all other systems." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-198", "text": "We were able to reach 0.73 (macro) F-score." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-199", "text": "We almost achieved state of the art result without feature engineering." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-200", "text": "Then we used the FastText* embedding combination technique to check whether we can improve further, but it declined the actual result to 0.68." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-201", "text": "Finally, character level input with CharCNN architecture with randomly initialized embedding reached 0.68." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-202", "text": "Our system outperformed in terms of accuracy (77.68) to state of the art baseline (77.39)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-203", "text": "We also integrated end-to-end training to learn task-specific embedding in all systems mentioned above." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-204", "text": "Detailed results for all the systems are presented in Appendix I." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-205", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-206", "text": "**ERROR ANALYSIS**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-207", "text": "We have done a detailed analysis of particular instances of compound types which get misclassified." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-208", "text": "From confusion matrix heat map in Figure 5 , we can see that most of the mis-classification has gone to Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-209", "text": "a class for our best performing system." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-210", "text": "There are no mis-classification between Dvandva and Avyay\u012bbh\u0101va." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-211", "text": "Specific sub-type of Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-212", "text": "a has similar properties as that of Avyay\u012bbh\u0101va, where first component of compound is avyaya, which creates conflict between these two classes." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-213", "text": "In our observation, 11 data-points from Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-214", "text": "a got mis-classified into Avyay\u012bbh\u0101va where all of them have the first component as avyaya." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-215", "text": "Also from Figure 5(a) , we can see that most of the compounds from Avyay\u012bbh\u0101va were misclassified into Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-216", "text": "a. Our best model is able to perform better compared to the baseline model for Bahuvr\u012bhi and Dvandva which are the second and the third most highly populated classes (Figure 4 )." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-217", "text": "Figure 5(b) indicates that our best system mostly got confused between Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-218", "text": "a and Bahuvr\u012bhi, because there is a special sub-type in both of these semantic classes which exhibits similar properties." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-219", "text": "There are more than 600 unique components of compound common in training set of Bahuvr\u012bhi and Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-220", "text": "a. Out of these, 205 components have more number of occurrences in Bahuvr\u012bhi than that of Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-221", "text": "a and 201 components have more occurrence in Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-222", "text": "a than that of Bahuvr\u012bhi." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-223", "text": "So common component compounds present in a conflicting class which has less occurrences will be misclassified." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-224", "text": "Since we have not provided any other information, classifier is getting confused due to common component occurrences in both the classes." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-225", "text": "Similar cases have been found for Dvandva and Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-226", "text": "a. For example, b\u0101la occurred 7 times in Dvandva and 12 times in Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-227", "text": "a, so majority of compounds of Dvandva having b\u0101la as component will be misclassified into Tatpurus ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-228", "text": "a. There are 11 such unique components in training set which have number of occurrences more than 4 in either class." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-229", "text": "We need to provide contextual information in order to overcome this problem." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-230", "text": "In summary, error cases observed in our best system are similar to that of baseline system." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-231", "text": "In this classification setup, apart from individual components of compounds, we have not provided contextual information or canonical paraphrasing." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-232", "text": "With this restriction, the classification problem is not entirely solvable; however, we explored up to what degree the ambiguities can be resolved." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-233", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-234", "text": "**RELATED WORK**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-235", "text": "Semantic analysis of compounds is an essential preprocessing step for improving on overall downstream NLP applications such as information extraction, question answering, machine translation, and many more (Fares et al., 2016) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-236", "text": "It has captivated much attention from the computational linguistics community, particularly on languages like English, Dutch, Italian, Afrikaans, and German (Verhoeven et al., 2014b) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-237", "text": "By rigorously studying Sanskrit compounding system and Sanskrit grammar, analysis of compounds in Hindi and Marathi has been done (Kulkarni et al., 2012) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-238", "text": "Another interesting approach uses simple statistics on how to automate segmentation and type identification of compounds (Kumar et al., 2010) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-239", "text": "Nastase et al. (2006) show that from two types of word meaning, namely, based on lexical resources and corpus-based, noun-modifier semantic relations can be learned." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-240", "text": "Another exciting work by S\u00e9aghdha and Copestake (2013) has done noun-noun compound classification using statistical learning framework of kernel methods, where the measure of similarity between compound components is determined using kernel function." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-241", "text": "Based on As ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-242", "text": "t .\u0101 dhy\u0101y\u012b rules, Kulkarni and Kumar (2013) has developed rule-based compound type identifier." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-243", "text": "This study helped to get more insights on what kind of information should be incorporated into lexical databases to automate this analysis." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-244", "text": "Kulkarni and Kumar (2011) proposed a constituency parser for Sanskrit compounds to generate paraphrase of the compound which helps to understand the meaning of compounds better." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-245", "text": "Recently, neural models are widely used for different downstream NLP applications for Sanskrit." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-246", "text": "The error corrections in Sanskrit OCR documents is done based on a neural network based approach (Adiga et al., 2018) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-247", "text": "Another work used neural models for post-OCR text correction for digitising texts in Romanised Sanskrit (Krishna et al., 2018a) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-248", "text": "Hellwig and Nehrdich (2018) proposed an approach for automating feature engineering required for the word segmentation task." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-249", "text": "Another neural-based approach for word segmentation based on seq2seq model architecture was proposed by Reddy et al. (2018) , where they have shown significant improvement compared to the previous linguistically involved models." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-250", "text": "Feedforward networks are used for building Sanskrit character recognition system (Dineshkumar and Suganthi, 2015) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-251", "text": "Krishna et al. (2018c) proposed energy-based framework for jointly solving the word segmentation and morphological tagging tasks in Sanskrit." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-252", "text": "The pretrained word embeddings proposed by and Pennington (2014) had a great impact in the field of Natural Language Processing (NLP)." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-253", "text": "However, these token based embeddings were unable to generate embeddings for out-of-vocabulary (OOV) words." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-254", "text": "To overcome this shortcoming, subword level information was integrated into recent approaches, where character-n-gram features (Bojanowski et al., 2017) have shown good performance over the compositional function of individual characters (Wieting et al., 2015) ." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-255", "text": "Another interesting approach (Zhang et al., 2015) is the use of character level input for word-level predictions." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-256", "text": "----------------------------------" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-257", "text": "**CONCLUSION**" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-258", "text": "For resource-rich languages, deep learning based models have helped in improving the state of the art for most of the NLP tasks, and have now replaced the need for feature engineering with the choice of a good model architecture." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-259", "text": "In this work, we systematically investigated the following research question: Can the recent advances in neural network outperform traditional hand engineered feature based methods on the semantic level multi-class compound classification task for Sanskrit?" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-260", "text": "We experimented with some of the basic architectures, namely, MLP, CNN, and LSTM, with input representation at the word, sub-word, and character level." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-261", "text": "The experiments suggest that the end-to-end trained LSTM architecture with FastText embedding gives an Fscore of 0.73 compared to the state of the art baseline (0.74) which utilized a lot of domain specific features including lexical lists, grammar rules, etc." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-262", "text": "This is clearly an important result." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-263", "text": "There are many limitations of this study." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-264", "text": "For instance, what is the effect of the corpus size on the performance?" }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-265", "text": "We work with a corpus with less than 5 million tokens, which is negligible compared to 840 billion tokens, on which Glove embeddings for English have been trained." }, { "sent_id": "9ea14a9fe422451901ad221bee5714-C001-266", "text": "Would a larger dataset have helped? Could methods based on cross-lingual embeddings help in this scenario for transfer learning from languages similar to Sanskrit?" } ], "y": { "@BACK@": { "gold_contexts": [ [ "9ea14a9fe422451901ad221bee5714-C001-19" ], [ "9ea14a9fe422451901ad221bee5714-C001-25" ] ], "cite_sentences": [ "9ea14a9fe422451901ad221bee5714-C001-19", "9ea14a9fe422451901ad221bee5714-C001-25" ] }, "@DIF@": { "gold_contexts": [ [ "9ea14a9fe422451901ad221bee5714-C001-34" ], [ "9ea14a9fe422451901ad221bee5714-C001-47" ] ], "cite_sentences": [ "9ea14a9fe422451901ad221bee5714-C001-34", "9ea14a9fe422451901ad221bee5714-C001-47" ] }, "@SIM@": { "gold_contexts": [ [ "9ea14a9fe422451901ad221bee5714-C001-37" ], [ "9ea14a9fe422451901ad221bee5714-C001-46" ], [ "9ea14a9fe422451901ad221bee5714-C001-50" ], [ "9ea14a9fe422451901ad221bee5714-C001-62" ], [ "9ea14a9fe422451901ad221bee5714-C001-143", "9ea14a9fe422451901ad221bee5714-C001-148" ] ], "cite_sentences": [ "9ea14a9fe422451901ad221bee5714-C001-37", "9ea14a9fe422451901ad221bee5714-C001-46", "9ea14a9fe422451901ad221bee5714-C001-50", "9ea14a9fe422451901ad221bee5714-C001-62", "9ea14a9fe422451901ad221bee5714-C001-143", "9ea14a9fe422451901ad221bee5714-C001-148" ] }, "@USE@": { "gold_contexts": [ [ "9ea14a9fe422451901ad221bee5714-C001-37" ], [ "9ea14a9fe422451901ad221bee5714-C001-46" ], [ "9ea14a9fe422451901ad221bee5714-C001-62" ], [ "9ea14a9fe422451901ad221bee5714-C001-143", "9ea14a9fe422451901ad221bee5714-C001-148" ] ], "cite_sentences": [ "9ea14a9fe422451901ad221bee5714-C001-37", "9ea14a9fe422451901ad221bee5714-C001-46", "9ea14a9fe422451901ad221bee5714-C001-62", "9ea14a9fe422451901ad221bee5714-C001-143", "9ea14a9fe422451901ad221bee5714-C001-148" ] }, "@EXT@": { "gold_contexts": [ [ "9ea14a9fe422451901ad221bee5714-C001-47" ] ], "cite_sentences": [ "9ea14a9fe422451901ad221bee5714-C001-47" ] } } }, "ABC_b86a5a8ec1f27354057bb45ff27588_15": { "x": [ { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-2", "text": "The paper presents work on improved sentence-level dialect classification of Egyptian Arabic (ARZ) vs. Modern Standard Arabic (MSA)." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-3", "text": "Our approach is based on binary feature functions that can be implemented with a minimal amount of task-specific knowledge." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-4", "text": "We train a featurerich linear classifier based on a linear support-vector machine (linear SVM) approach." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-5", "text": "Our best system achieves an accuracy of 89.1 % on the Arabic Online Commentary (AOC) dataset (Zaidan and Callison-Burch, 2011) using 10-fold stratified cross validation: a 1.3 % absolute accuracy improvement over the results published by (Zaidan and Callison-Burch, 2014) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-6", "text": "We also evaluate the classifier on dialect data from an additional data source." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-7", "text": "Here, we find that features which measure the informalness of a sentence actually decrease classification accuracy significantly." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-8", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-10", "text": "The standard form of written Arabic is Modern Standard Arabic (MSA) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-11", "text": "It differs significantly from various spoken varieties of Arabic (Zaidan and Callison-Burch, 2011; Zaidan and Callison-Burch, 2014; Elfardy and Diab, 2013) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-12", "text": "Even though these dialects do not originally exist in written form, they are present in social media texts." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-13", "text": "Recently a dataset of dialectal Arabic has been made available in the form of the Arabic Online Commentary (AOC) set (Zaidan and Callison-Burch, 2011; Zaidan and CallisonBurch, 2014) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-14", "text": "The data consists of reader commentary from the online versions of Arabic newspapers, which have a high degree of dialect content." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-15", "text": "Data for the following dialects has been collected: Levantine, Gulf, and Egyptian." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-16", "text": "The data had been obtained by a crowd-sourcing effort." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-167", "text": "Table 4 : Sub-corpora together with total number as well as percentage of sentences that are classified as ARZ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-193", "text": "Phrase translation pairs demonstrating the use of the classified training data are shown in Table 6 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-17", "text": "In the current paper, we present results for a binary classification task only, where we predict the dialect of Egyptian Arabic ARZ vs. MSA sentences from the Al-Youm Al-Sabe' newspaper online commentaries 1 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-18", "text": "Our ultimate goal is to use the dialect classifier for building a dialect-aware Arabic-English statistical machine translation (SMT) system." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-19", "text": "Our Arabic-English training data contains a significant amount of Egyptian dialect data only, and we would like to adapt the components of our hierarchical phrase-based SMT system (Zhao and Al-Onaizan, 2008) to that data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-20", "text": "Similar to (Elfardy and Diab, 2013) , we present a sentence-level classifier that is trained in a supervised manner." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-21", "text": "Our approach is based on an Arabic tokenizer, but we do not use a range of specialized tokenizers or orthography normalizers." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-22", "text": "In contrast to the language-model (LM) based classifier used by (Zaidan and Callison-Burch, 2014 ), we present a linear classifier approach that works best without the use of LMbased features." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-23", "text": "Some improvements in terms of classification accuracy and 10-fold cross validation under the same data conditions as (Zaidan and Callison-Burch, 2011; Elfardy and Diab, 2013) are presented." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-24", "text": "In general, we aim at a smaller amount of domain specific feature engineering than previous related approaches." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-25", "text": "The paper is structured as follows." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-26", "text": "In Section 2, we present related work on language and dialect identification." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-27", "text": "In Section 3, we discuss the linear classification model used in this paper." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-28", "text": "In Section 4, we evaluate the classifier performance in terms of classification accuracy on two data sets and present some error analysis." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-29", "text": "Finally, in Section 5, we discuss future work on improved dialect-level classification and its application to system adaptation for machine translation." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-30", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-31", "text": "**RELATED WORK**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-32", "text": "From a computational perpective, we can view dialect identification as a more fine-grained form of language identification (ID)." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-33", "text": "Previous work on language ID examined the use of character histograms (Cavnar and Trenkle, 1994; Dunning, 1994) , and high accuracy prediction results have been reported even for languages with a common character set." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-34", "text": "(Baldwin and Lui, 2010 ) present a range of document-level language identification techniques on three different data sets." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-35", "text": "They use n-gram counting techniques and different tokenization schemes that are adopted to those data sets." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-36", "text": "Their classification task deals with several languages, and it becomes more difficult as the number of languages increases." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-37", "text": "They present an SVM-based multiclass classification approach similar to the one presented in this paper which performs well on one of their data sets." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-38", "text": "(Trieschnigg et al., 2012) generates n-gram features based on character or word sequences to classify dialectal documents in a dutch-language fairy-tale collection." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-39", "text": "Their baseline model uses N -gram based text classification techniques as popularised in the TextCat tool (Cavnar and Trenkle, 1994) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-40", "text": "Following (Baldwin and Lui, 2010) , the authors extend the usage of n-gram features with nearest neighbour and nearest-prototype models together with appropriately chosen similarity metrics." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-41", "text": "(Zampieri and Gebre, 2012) classify two varieties of the same language: European and Brazilian Portuguese." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-42", "text": "They use word and character-based language model classification techniques similar to (Zaidan and Callison-Burch, 2014) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-43", "text": "(Huang and Lee, 2008) present simple bag-of-word techniques to classify varieties of Chinese from the Chinese Gigaword corpus." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-44", "text": "(Kruengkrai et al., 2005) extend the use of ngram features to using string kernels: they may take into account all possible sub-strings for comparison purposes." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-45", "text": "The resulting kernel-based classifier is compared against the method in (Cavnar and Trenkle, 1994) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-46", "text": "(Lui and Cook, 2013 ) present a dialect classification approach to identify Australian, British, and Canadian English." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-47", "text": "They present results where they draw training and test data from different sources." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-48", "text": "The successful transfer of models from one text source to another is evidence that their classifier indeed captures dialectal rather than stylistic or formal differences." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-49", "text": "Language identification of related languages is also addressed in the DSL (Discriminating Similar Languages) task of the present Vardial workshop at COLING 14 (Tan et al., 2014) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-50", "text": "While most of the above work focuses on document-level language classification, recent work on handling Arabic dialect data addresses the problem of sentence-level classification (Zaidan and CallisonBurch, 2011; Zaidan and Callison-Burch, 2014; Elfardy and Diab, 2013; Zaidan and Callison-Burch, 2014 )." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-51", "text": "The work is based on the data collection effort by (Zaidan and Callison-Burch, 2014 ) which crowdsources the annotation task to workers on Amazons Mechanical Turk." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-52", "text": "The classification results by (Zaidan and Callison-Burch, 2014 ) are based on n-gram language-models, where the n-grams are defined both on words and characters." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-53", "text": "The authors find that unigram word-based models perform best." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-54", "text": "The word-based models are obtained after a minimal amount of preprocessing such as proper handling of HTML entities and Arabic numbers." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-55", "text": "Classification accuracy is significantly reduced for shorter sentences." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-56", "text": "(Elfardy and Diab, 2013) presents classifcation result based on various tokinization and orthographic normalization techniques as well as so-called meta features that estimate the informalness of the data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-57", "text": "Like our work, the authors focus on a binary dialect classification based on the ARZ-MSA portion of the dataset in (Zaidan and Callison-Burch, 2011) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-58", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-59", "text": "**CLASSIFICATION MODEL**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-60", "text": "We use a linear model and compute a score s(t n 1 ) for a tokenized input sentence consisting of n tokens t i :" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-61", "text": "where \u03c6 s (c i , t i ) is a binary feature function which takes into account the context c i of token t i ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-62", "text": "w \u2208 R d is a high-dimensional weight vector obtained during training." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-142", "text": "MSA prediction F-score is defined analogously." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-63", "text": "In our experiments, we classify a tokenized Table 1 : We used the following dialect data: 1) the ARZ-MSA portion of the AOC data from commentaries of the Egyptian newspaper Al-Youm Al-Sabe', and 2) the DEV12 tune set (1219 sentences) which is the LDC2012E30 corpus BOLT Phase 1 dev-tune set." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-64", "text": "The DEV12 tune set was annotated by a native speaker of Arabic." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-65", "text": "sentence as being Egyptian dialect (ARZ) if s(t n 1 ) > 0." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-66", "text": "To train the weights w in Eq. 1, we use a linear SVM approach Fan et al., 2008) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-67", "text": "The trainer can easily handle a huge number of instances and features." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-68", "text": "The training data is given as instance-label pairs (x i , y i ) where i \u2208 {1, \u00b7 \u00b7 \u00b7 , l} and l is the number of training sentences." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-69", "text": "The x i are d-dimensional vectors of integer-valued features that count how often a binary feature fired for a tokenized sentence t n 1 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-70", "text": "y i \u2208 {+1, \u22121} are the class labels where a label of '+1' represents Egyptian dialect." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-71", "text": "During training, we solve the following optimization problem:" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-72", "text": "i.e. we use L1 regularized L2-loss support vector classification." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-73", "text": "We set the penalty term C = 0.5." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-74", "text": "For our experiments, we use the data set provided in (Zaidan and Callison-Burch, 2011 ) which also has been used in the experiments in (Elfardy and Diab, 2013; Zaidan and Callison-Burch, 2014) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-75", "text": "We focus on the binary classification between MSA and ARZ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-76", "text": "Details on the data sources can be found in Table 1 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-77", "text": "We present accuracy results in terms of 10-fold stratified cross-validation which are comparable to previously published work." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-78", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-79", "text": "**TOKENIZATION AND DICTIONARIES**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-80", "text": "The Arabic tokenizer used in the current paper is based on (Lee et al., 2003) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-81", "text": "It is a general purpose tokenizer which has been optimized towards improving machine translation quality of SMT systems rather than dialect classification." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-82", "text": "Together with the tokenized text, a maximum-entropy based tagger provides the part-of-speech (PoS) tags for each token." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-83", "text": "In addition, we have explored a range of features that are based on the output of the AIDA software package (Elfardy and Diab, 2012; Mona Diab et al., 2009 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-84", "text": "The AIDA software has been made available to the participants of the DARPA-funded Broad Operational Language Translation (BOLT) project." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-85", "text": "AIDA is a system for dialect identification, classification and glossing on the token and sentence level for written Arabic." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-86", "text": "AIDA aggregates several components including dictionaries and language models in order to perform named entity recognition, dialect identification classification, and MSA English linearized glossing of the input text." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-87", "text": "We created a dictionary from AIDA resources that includes about 41 000 ARZ tokens." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-88", "text": "In addition, we obtained a second small dictionary of about 70 ARZ dialect tokens with the help of a native speaker of Arabic." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-89", "text": "The list was created by training two IBM Model 1 lexicons, one on Egyptian Arabic data and another on MSA data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-90", "text": "We then inspected the ARZ lexicon entries with the highest cosine distance to their MSA counterparts and kept the ones that are strong ARZ words." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-91", "text": "The tokens in both dictionaries are not ARZ exclusive, but could occur in MSA as well." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-92", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-93", "text": "**FEATURE SET**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-94", "text": "In our work, we employ a simple set of binary feature functions based on the tokenized Arabic sentence." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-95", "text": "For example, we define a token bigram feature as follows:" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-96", "text": "Token unigram and trigram features are defined accordingly." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-97", "text": "We also define unigram, bigram, and trigram features based on PoS tags." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-98", "text": "Currently, just PoS unigrams are used in the experiments." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-99", "text": "We define dictionary-based features as follows:" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-100", "text": "where we use the two dictionaries Dict 1 and Dict 2 as described in Section 3.1." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-101", "text": "The dictionaries are handled as token sets and we generate separate features for each of them." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-102", "text": "We generate some features based on the AIDA tool output." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-103", "text": "AIDA provides a dialect label for each input token t k as well as a single dialect label at the sentence level." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-104", "text": "A sentence-level binary feature based on the AIDA sentence level classification is defined as follows:" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-105", "text": "where AIDA(t n 1 ) is the sentence-level classification of the AIDA tool." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-106", "text": "A word-level feature \u03c6 AIDA (t k ) is defined accordingly." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-107", "text": "These features improve the classification accuracy of our best system significantly." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-108", "text": "We have also experimented with some real-valued feature." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-109", "text": "For example, we derived a feature from dialect-specific language model probabilities:" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-110", "text": "where log(p ARZ (t n 1 )) is the language-model log probability for the dialect class ARZ ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-111", "text": "We used a trigram language model." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-112", "text": "p MSA (\u00b7) is defined accordingly." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-113", "text": "In addition, we have implemented a range of so-called 'meta' features similar to the ones defined in (Elfardy and Diab, 2013) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-114", "text": "For example, we define a feature \u03c6 Excl (t n 1 ) which is equal to the length of the longest consecutive sequence of exclamation marks in the tokenized sentence t n 1 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-166", "text": "The classifier is evaluated on data from the DARPA-funded BOLT project." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-115", "text": "Similarly, we define features that count the longest sequence of punctuation marks, the number of tokens, the averaged character-length of a token in the sentence, and the percentage of words with word-lengthening effects." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-116", "text": "These features do not directly model dialectalness of the data but rather try to capture the degree of in-formalness." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-117", "text": "Contrary to (Elfardy and Diab, 2013) we find that those features do not improve accuracy of our best model in the cross-validation experiments." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-118", "text": "On the DEV12 set, the use of the meta features results in a significant drop in accuracy." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-119", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-120", "text": "**EXPERIMENTS**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-121", "text": "In this section, we present experimental results." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-122", "text": "Firstly, Section 4.1 demonstrates that our data is annotated consistently." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-123", "text": "In Section 4.2, we present dialect prediction results in terms of accuracy and F-score on our two data sets." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-124", "text": "In Section 4.3, we perform some qualitative error analysis for our classifier." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-125", "text": "In Section 4.4, we present some preliminary effects on training a SMT system." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-126", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-127", "text": "**ANNOTATOR AGREEMENT**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-128", "text": "To confirm the consistent annotation of our data, we have measured some inter-annotator and intraannotator agreement on it." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-129", "text": "A native speaker of Arabic was asked to classify the ARZ-MSA portion of the dialect data using the following three labels: ARZ, MSA, Other." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-130", "text": "We randomly sampled 250 sentences from the ARZ-MSA portion of the Zaidan data maintaining the original dialect distribution." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-131", "text": "The confusion matrix is shown in Table 2 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-132", "text": "It corresponds to a kappa value of 0.84 (using the definition of (Fleiss, 1971)), which indicates a very high agreement." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-133", "text": "In addition, we did re-annotate a sub-set of 200 sentences from the DEV12 set over a time period of three months using our own annotator." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-134", "text": "The kappa value of the corresponding confusion matrix is 0.93, indicating very high agreement as well." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-135", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-136", "text": "**CLASSIFICATION EXPERIMENTS**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-137", "text": "Following previous work, we present dialect prediction results in terms of accuracy: Table 1 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-138", "text": "An in-lab annotator's dialect prediction is compared against the AOC data gold-standard dialect labels." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-139", "text": "where '# sent' is the number of sentences." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-140", "text": "In addition, we present dialect prediction results in terms of precision, recall, and F-score." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-141", "text": "They are defined as follows: Prec = # sent correctly tagged as ARZ # sent tagged as ARZ (7) Recall = # sent correctly tagged as ARZ # ref sent tagged as ARZ" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-143", "text": "Experimental results are presented in Table 3 , where we present results for different sets of feature types and the two test sets in Table 1 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-144", "text": "In the top half of the table, results are presented in terms of 10-fold cross validation on the ARZ-MSA portion of the AOC data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-145", "text": "In the bottom half, we present results on DEV12 tune set, where we use the entire dialect data in Table 1 for training (about 26K sentences)." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-146", "text": "As our baseline we have re-implemented the language-model-perplexity based approach reported in (Zaidan and Callison-Burch, 2011) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-147", "text": "We train language models on the dialect-labeled commentary training data for each of the dialect classes c \u2208 {MSA, ARZ}. During testing, we compute the language model probability of a sentence s for each of the classes c. We assign a sentence to the class c with the highest probability (or the lowest perplexity) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-148", "text": "For the 10-fold cross validation experiments, 10 language models are built and perplexities are computed on 10 different test sets." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-149", "text": "The resulting (averaged) accuracy is 83.3 % for cross-validation and 82.2 % on the DEV12 tune set." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-150", "text": "In comparison, (Elfardy and Diab, 2013) reports an accuracy of 80.4 % as perplexity-based baseline." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-151", "text": "We have carried out additional experiments with a simple feature set that consists of only unigram token and bigram token features as defined in Eq. 3." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-152", "text": "Such a system performs surprisingly well under both testing conditions: we achieved an accuracy of 87.7 % on the AOC data and an accuracy of 83.4 % on the DEV12 test set." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-153", "text": "On the AOC set using 10-fold cross validation, we achieve only a small improvement from using the dictionary features defined in Eq. 4." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-154", "text": "The accuracy is improved from 87.7 % to 88.0 %." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-155", "text": "On the DEV12 set, we obtain a much larger improvement from using these features." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-156", "text": "Furthermore, we have investigated the usefulness of the AIDA-based features." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-157", "text": "The stand-alone sentence-level classification of the AIDA tool performs quite poorly." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-158", "text": "On the DEV12 set, it achieves an accuracy of just 77.9 %. But using the AIDA assigned sentence-level and token-level dialect labels based on the binary features defined in Eq. 5 improves accuracy significantly, e.g. from 85.3 % to 87.8 % on the DEV12 set." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-159", "text": "In the current experiments, the so-called meta features which are computed at the sentence level do not improve classification accuracy." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-160", "text": "The meta features are only useful in classifying dialect data based on the in-formalness of the data, i.e. the ARZ news commentaries tend to exhibit more in-formalness than the MSA commentaries." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-161", "text": "Finally, the sentence-level perplexity feature defined in Eq. 6 did not improve accuracy as well (no results for this feature are presented in Table 3 )." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-162", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-163", "text": "**CLASSIFIER ANALYSIS**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-164", "text": "In this section, we perform a simple error analysis of the classifier performance on some dialect data for which the degree of dialectalness is known." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-165", "text": "The data comes from news sources that differ from the data used to train the classifier." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-168", "text": "The BOLT data consists of several corpora collected from various resources." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-169", "text": "These resources include newswire, web-logs, ARZ web forum data and others." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-170", "text": "Classification statistics are presented in Table 4 , where we report the number of sentences along with the percentage of those sentences classified as ARZ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-171", "text": "The distribution of the dialect labels in the classifier output appears to correspond to the expected origin of the data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-172", "text": "For example, the ARZ web forum data contains a majority of ARZ sentences, but quite a few sentences are MSA such as greetings and quotations from Islamic resources (Quran, Hadith ...)." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-173", "text": "The broadcast conversation data is mainly MSA, but sometimes the speaker switches to dialectal usage for a short phrase and then switches back to MSA." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-174", "text": "Lastly, the newswire data has a vast majority of MSA sentences." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-175", "text": "Examining a small portion of newswire sentences classified as ARZ, the sentences labeled as ARZ are mostly classification errors." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-176", "text": "Example sentence classifications from the BOLT data are shown in Table 5 ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-177", "text": "The first two text fragments are taken from the Egyptian Arabic (ARZ) web forum data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-178", "text": "In the first document fragment, the user starts with MSA sentences, then switches to Egyptian (ARZ) dialect marked by the ARZ indicator and using the prefix before a verb which is not allowed in MSA." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-179", "text": "The user then switches back to MSA." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-180", "text": "The classifier is able to classify the Egyptian Arabic (ARZ) sentence correctly." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-181", "text": "In the second document fragment, the user uses several Egyptian Arabic (ARZ) words." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-182", "text": "In the forth sentence no ARZ words exist, and the classifier correctly classifies the sentence as MSA." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-183", "text": "some sentences from the newswire corpus that are mis-classified." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-184", "text": "The first sentence contains the word which corresponds to the letter 'd' in the abbreviation 'tdk'." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-185", "text": "The word is contained in one of our ARZ dictionaries such that the binary AIDA-based feature in Eq. 5 fires and triggers a mis-classification." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-186", "text": "In this context, the word is part of an abbreviation which is split in the Arabic text." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-187", "text": "In the other examples, only a few of the binary features defined in Section 3.2 apply and features that correspond to Arabic prefixes tend to support a classification as ARZ dialect." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-188", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-189", "text": "**PRELIMINARY APPLICATION FOR SMT**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-190", "text": "The dialect classification of Arabic data for SMT can be used in various ways." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-191", "text": "Examples include domainspecific tuning, mixture modeling, and the use of so-called provenance features (Chiang et al., 2011) among others ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-192", "text": "As a motivation for the future use of the dialect classifier in SMT, we classify the BOLT bilingual training data into ARZ and MSA parts and examine the effect on the phrase table scores." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-194", "text": "The ARZ web forum data is split into an ARZ part and an MSA part and two separate phrase probability tables are trained on these two splits." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-195", "text": "The ARZ web forum data is highly ambiguous with respect to dialect and it is difficult to obtain good dialect-dependent splits of the data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-196", "text": "In the first example in the table, the word could mean 'Arab' in MSA, but in ARZ it could also mean 'car'." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-197", "text": "The phrase table scores obtained from the classifier-split training data correctly reflect this ambiguity." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-198", "text": "The phrase pair with 'car' has the lowest translation score for the BOLT.ARZ phrase Table 6 : Phrase tables based on classified training data." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-199", "text": "BOLT.ARZ is trained on the ARZ portion of the ARZ web forums data, while BOLT.MSA is trained on the MSA part." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-200", "text": "The table includes Arabic words and the top three phrase translation candidates, sorted (first is best) by the phrase model cost (cost= \u2212log(p(f |e)) )." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-201", "text": "In the second example, the word could function as a proper noun with its English translation 'mursi' or 'marsa', but only in ARZ it could also be translated as 'thanks' ('merci')." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-202", "text": "In this case, the classifier is unable to distinguish between the ARZ dialect and the MSA usage." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-203", "text": "We found out that the word token 'merci' appears only 4 times in the training data, rendering its binary features unreliable reliable." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-204", "text": "In general we note that the phrase tables build on the classified data become more domainspecific, and it is left to future work to check whether improvements could carry over to the translation quality." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-205", "text": "----------------------------------" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-206", "text": "**DISCUSSION AND FUTURE WORK**" }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-207", "text": "The ultimate goal is to use the ARZ vs. MSA dialect classifier for training an adapted SMT system." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-208", "text": "We split the training data at the sentence level using our classifier and train dialect-specific systems on each of these splits along with a general dialect-independent system." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-209", "text": "We will be using techniques similar to (Koehn and Schroeder, 2007; Chiang et al., 2011; Sennrich, 2012; Chen et al., 2013) to adapt the general SMT system to a target domain with a predominant dialect." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-210", "text": "Or, we will be adopting an SMT system to a development or test set where we use the classifier to predict the dialect for each sentence and use a dialect-specific SMT system on each of them individually." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-211", "text": "Our approach of using just binary feature functions in connection with a sentence-level global linear model can be related to work on PoS-tagging (Collins, 2002) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-212", "text": "(Collins, 2002 ) trains a linear model based on Viterbi decoding and the perceptron algorithm." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-213", "text": "The gold-standard PoS tags are given at the word-level, but the training uses a global representation at the sentence level." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-214", "text": "Similarly, we use linear SVMs to train a classification model at the sentence level without access to sentence length statistics, i.e. our best performing classifier does not compute features like the percentage of punctuation, numbers, or averaged word length as has been proposed previously (Elfardy and Diab, 2013) ." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-215", "text": "All of our features are actually computed at the token level (with the exception of a single sentence-level AIDA-based feature)." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-216", "text": "An interesting direction for future work could be to train the dialect classifier at the sentence level, but use it to compute token-level predictions for a more fine-grained analysis." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-217", "text": "Even though the token-level prediction task corresponds to a word-level tag set of just size 2, Viterbi decoding techniques could be used to introduce novel context-dependent features, e.g. dialect tag n-gram features." }, { "sent_id": "b86a5a8ec1f27354057bb45ff27588-C001-218", "text": "Such a token-level predictions might be used for weighting each phrase pair in an SMT system using methods like the instance-based adaptation approach in (Foster et al., 2010) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b86a5a8ec1f27354057bb45ff27588-C001-11" ] ], "cite_sentences": [ "b86a5a8ec1f27354057bb45ff27588-C001-11" ] }, "@SIM@": { "gold_contexts": [ [ "b86a5a8ec1f27354057bb45ff27588-C001-20", "b86a5a8ec1f27354057bb45ff27588-C001-23" ], [ "b86a5a8ec1f27354057bb45ff27588-C001-74" ], [ "b86a5a8ec1f27354057bb45ff27588-C001-113" ] ], "cite_sentences": [ "b86a5a8ec1f27354057bb45ff27588-C001-20", "b86a5a8ec1f27354057bb45ff27588-C001-23", "b86a5a8ec1f27354057bb45ff27588-C001-74", "b86a5a8ec1f27354057bb45ff27588-C001-113" ] }, "@MOT@": { "gold_contexts": [ [ "b86a5a8ec1f27354057bb45ff27588-C001-50" ] ], "cite_sentences": [ "b86a5a8ec1f27354057bb45ff27588-C001-50" ] }, "@USE@": { "gold_contexts": [ [ "b86a5a8ec1f27354057bb45ff27588-C001-74" ] ], "cite_sentences": [ "b86a5a8ec1f27354057bb45ff27588-C001-74" ] }, "@DIF@": { "gold_contexts": [ [ "b86a5a8ec1f27354057bb45ff27588-C001-117" ], [ "b86a5a8ec1f27354057bb45ff27588-C001-149", "b86a5a8ec1f27354057bb45ff27588-C001-150" ], [ "b86a5a8ec1f27354057bb45ff27588-C001-214" ] ], "cite_sentences": [ "b86a5a8ec1f27354057bb45ff27588-C001-117", "b86a5a8ec1f27354057bb45ff27588-C001-150", "b86a5a8ec1f27354057bb45ff27588-C001-214" ] } } }, "ABC_4d528117dd7751d0cd6413430e1ec1_15": { "x": [ { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-2", "text": "Bilingual lexicon extraction has been studied for decades and most previous methods have relied on parallel corpora or bilingual dictionaries." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-3", "text": "Recent studies have shown that it is possible to build a bilingual dictionary by aligning monolingual word embedding spaces in an unsupervised way." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-4", "text": "With the recent advances in generative models, we propose a novel approach which builds cross-lingual dictionaries via latent variable models and adversarial training with no parallel corpora." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-5", "text": "To demonstrate the effectiveness of our approach, we evaluate our approach on several language pairs and the experimental results show that our model could achieve competitive and even superior performance compared with several state-of-the-art models." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-6", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-8", "text": "Learning the representations of languages is a fundamental problem in natural language processing and most existing methods exploit the hypothesis that words occurring in similar contexts tend to have similar meanings (Pennington et al., 2014; Bojanowski et al., 2017) , which could lead word vectors to capture semantic information." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-9", "text": "Mikolov et al. (2013) first point out that word embeddings learned on separate monolingual corpora exhibit similar structures." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-10", "text": "Based on this finding, they suggest it is possible to learn a linear mapping from a source to a target embedding space and then generate bilingual dictionaries." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-11", "text": "This simple yet effective approach has led researchers to investigate on improving cross-lingual word embeddings with the help of bilingual word lexicons (Faruqui and Dyer, 2014; Xing et al., 2015) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-12", "text": "For low-resource languages and domains, crosslingual signal would be hard and expensive to obtain, and thus it is necessary to reduce the need for bilingual supervision." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-13", "text": "Artetxe et al. (2017) successfully learn bilingual word embeddings with only a parallel vocabulary of aligned digits." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-14", "text": "Zhang et al. (2017) utilize adversarial training to obtain cross-lingual word embeddings without any parallel data." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-15", "text": "However, their performance is still significantly worse than supervised methods." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-16", "text": "By combining the merits of several previous works, Conneau et al. (2018) introduce a model that reaches and even outperforms supervised state-of-the-art methods with no parallel data." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-17", "text": "In recent years, generative models have become more and more powerful." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-18", "text": "Both Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Variational Autoencoders (VAEs) (Kingma and Welling, 2014) are prominent ones." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-19", "text": "In this work, we borrow the ideas from both GANs and VAEs to tackle the problem of bilingual lexicon induction." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-20", "text": "The basic idea is to learn latent variables that could capture semantic meaning of words, which would be helpful for bilingual lexicon induction." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-21", "text": "We also utilize adversarial training for our model and require no form of supervision." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-22", "text": "We evaluate our approach on several language pairs and experimental results demonstrate that our model could achieve promising performance." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-23", "text": "We further combine our model with several helpful techniques and show our model could perform competitively and even superiorly compared with several state-of-the-art methods." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-24", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-25", "text": "**RELATED WORK**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-26", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-27", "text": "**BILINGUAL LEXICON INDUCTION**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-28", "text": "Extracting bilingual lexica has been studied by researchers for a long time." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-29", "text": "Mikolov et al. (2013) first observe there is isomorphic structure among word embeddings trained separately on monolingual corpora and they learn the linear transformation between languages." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-30", "text": "Zhang et al. (2016b) improve the method by constraining the transformation matrix to be orthogonal." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-31", "text": "Xing et al. (2015) maximize the cosine similarity instead. They point out that adding an orthogonality constraint can improve performance and has a closed-form solution, which was referred to as Procrustes approach in Smith et al. (2017) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-32", "text": "Canonical correlation analysis has also been used to map both languages to a shared vector space (Faruqui and Dyer, 2014; Lu et al., 2015) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-33", "text": "To reduce the need for supervision signals, Artetxe et al. (2017) use identical digits and numbers to form an initial seed dictionary and then iteratively refine their results until convergence." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-34", "text": "Zhang et al. (2017) apply adversarial training to align monolingual word vector spaces with no supervision." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-35", "text": "Conneau et al. (2018) improve the model by combining adversarial training and Procrustes approach, and their unsupervised approach could reach and even outperform state-of-the-art supervised approaches." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-36", "text": "In this work, we make further improvements and enhance the model proposed in (Conneau et al., 2018 ) with latent variable model and iterative training procedure." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-37", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-38", "text": "**GENERATIVE MODELS**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-39", "text": "VAEs (Kingma and Welling, 2014) represent one of the most successful deep generative models." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-40", "text": "Standard VAEs assume observed variables are generated from latent variables and the latent variables are sampled from a simple Gaussian distribution." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-41", "text": "Typically, VAEs utilize an neural inference model to approximate the intractable posterior, and optimize model parameters jointly with a reparameterized variational lower bound." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-42", "text": "VAEs have been successfully applied in several natural language processing tasks before (Zhang et al., 2016a; Bowman et al., 2016) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-43", "text": "GANs (Goodfellow et al., 2014) are another framework for estimating generative models via an adversarial process and have attracted huge attention." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-44", "text": "The basic strategy is to train a generative model and a discriminative model simultaneously via an adversarial process." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-45", "text": "Adversarial training technique for matching distribution has proven to be powerful in a variety of tasks (Bowman et al., 2016) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-46", "text": "Adversarial Autoencoder (Makhzani et al., 2015) is a probabilistic autoencoder that uses the GANs to perform variational inference." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-47", "text": "By combining a VAE with a GAN, Larsen et al. (2016) use learned feature representations in the GAN discriminator as the basis for the VAE reconstruction objective." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-48", "text": "GANs have been applied in machine translation before (Yang et al., 2018; ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-49", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-50", "text": "**PROPOSED APPROACH**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-51", "text": "In this section, we first briefly introduce VAEs, and then we illustrate the details and training techniques of our proposed model." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-52", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-53", "text": "**VARIATIONAL AUTOENCODER**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-54", "text": "Variational Autoencoders (VAEs) are deep generative model which are capable of learning complex density models for data via latent variables." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-55", "text": "Given a nonlinear generative model p \u03b8 (x|z) with input x \u2208 R D and associated latent variable z \u2208 R L drawn from a prior distribution p 0 (z), the goal of VAEs is to use a recognition model, q \u03c6 (z|x) to approximate the posterior distribution of the latent variables by maximizing the following variational lower bound" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-56", "text": "where KL refers to Kullback-Leibler divergence." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-57", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-58", "text": "**OUR MODEL**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-59", "text": "Basically, our model assumes that the source word embedding {x n } and the target word embedding {y n } could be drawn from a same latent variable space {z n }, where {z n } is capable of capturing semantic meaning of words." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-60", "text": "In contrast to the standard VAE prior that assumes each latent embedding z n to be drawn from the same latent Gaussian, our model just requires the distribution of latent variables for source and target word embeddings to be equal." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-61", "text": "To achieve such a goal, we utilize adversarial training to guide the two latent distributions to match with each other." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-62", "text": "As in adversarial training, we have networks \u03c6 s and \u03c6 t for both source and target space, striving to map words into the same latent space, while the discriminator D is a binary classifier which tries to distinguish between the two languages." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-63", "text": "We also have reconstruction networks \u03b8 s and \u03b8 t as in VAEs." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-64", "text": "The objective function for the discriminator D could be formulated as" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-65", "text": "For the source side, the objective is to minimize" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-66", "text": "Here we define q \u03c6s (z|x) = N (\u00b5 s (x), \u03a3 s (x)), where \u00b5 s (x) = W \u00b5s x and \u03a3 s (x) = exp(W \u03c3s x); W \u00b5s and W \u03c3s are learned parameters." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-67", "text": "We also define the mean of p \u03b8s (x|z) to be W T \u00b5s z. The objective function and structure for \u03c6 t are similar." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-68", "text": "The basic framework of our model is shown in Figure 1 ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-69", "text": "As we could see from the figure, our model tries to map the source and target word embedding into the same latent space which could capture the semantic meaning of words." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-70", "text": "Theoretical analysis has revealed that adversarial training tries to minimize the Jensen-Shannon (JS) divergence between the real and fake distribution." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-71", "text": "Therefore, one can view our model as replace KL divergence in Equation 1 with JS divergence and change the Gaussian prior to the target distribution." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-72", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-73", "text": "**TRAINING STRATEGY**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-74", "text": "Our model has two generators \u03c6 s and \u03c6 t , and we have found that training them jointly would be extremely unstable." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-75", "text": "In this paper, we propose an iterative method to train our models." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-76", "text": "Basically, we first initialize W \u00b5t to be identity matrix and train \u03c6 s and \u03b8 s on the source side." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-77", "text": "After convergence, we freeze W \u00b5s , and then train \u03c6 t and \u03b8 t in the target side." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-78", "text": "The pseudo-code for this process is shown in Algorithm 1." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-79", "text": "It should be noted that there is no variance once completing training." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-80", "text": "Our experiments could be divided into two parts." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-81", "text": "In the first part, we conduct experiments on smallscale datasets and our main baseline is Zhang et al. (2017) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-82", "text": "In the second part, we combine our model with several advanced techniques and we compare our model with Conneau et al. (2018)" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-83", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-84", "text": "**SMALL-SCALE DATASETS**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-85", "text": "In this section, our experiments focus on smallscale datasets and our main baseline model is adversarial autoencoder (Zhang et al., 2017) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-86", "text": "For justice, we use the same model selection strategy with Zhang et al. (2017) , i.e. we choose the model whose sum of reconstruction loss and classification accuracy is the least." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-87", "text": "The source and target word embeddings would be first mapped into the latent space." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-88", "text": "For each source word embedding x, it would be first transformed into z x ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-89", "text": "The the its k nearest target embeddings would be retrieved and be compared against the entry in a ground truth bilingual lexicon." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-90", "text": "Performance is measured by top-1 accuracy." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-91", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-92", "text": "**EXPERIMENTS ON CHINESE-ENGLISH DATASET**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-93", "text": "For this set of experiments, we use the same data as Zhang et al. (2017) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-94", "text": "The statistics of the final training data is given in Table 1 ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-95", "text": "We use Chinese-English Translation Lexicon Version 3.0 (LDC2002L27) as our ground truth bilingual lexicon for evaluation." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-96", "text": "The baseline models are MonoGiza system (Dou et al., 2015) , translation matrix (TM) (Mikolov et al., 2013) , isometric alignment (IA) (Zhang et al., 2016b) and adversarial training approach (Zhang et al., 2017) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-97", "text": "Table 2 summarizes the performance of baseline models and our approach." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-98", "text": "The results of baseline models are cited from Zhang et al. (2017) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-99", "text": "As we can see from the table, our model could achieve superior performance compared with other baseline models." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-100", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-101", "text": "**EXPERIMENTS ON OTHER LANGUAGE PAIRS DATASETS**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-102", "text": "We also conduct experiments on Spanish-English and Italian-English language pairs." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-103", "text": "Again, we use the same dataset with Zhang et al. (2017) . and the statistics are shown in The experimental results are shown in Table 4 ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-104", "text": "Because Spanish, Italian and English are closely related languages, the accuracy would be higher than the Chinese-English dataset." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-105", "text": "Our model is able to outperform baseline model in this setting." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-106", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-107", "text": "**MODEL**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-108", "text": "Accuracy (" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-109", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-110", "text": "**LARGE-SCALE DATASETS**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-111", "text": "In this section, we integrate our method with Conneau et al. (2018) , whose method improves Zhang et al. (2017) by more sophiscated refinement procedure and validation criterion." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-112", "text": "We replace their first step, namely the adversarial training step, with our model." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-113", "text": "Basically, we first map the source and target embeddings into the latent space using our algorithm, and then fine-tune the identity mapping in the latent space with the closed-form Procrustes solution." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-114", "text": "We use their similarity measure, namely cross-domain similarity local scaling (CSLS), to produce reliable matching pairs and validation criterion for unsupervised model selection." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-115", "text": "We conduct experiments on English-Spanish, English-Russian and English-Chinese datasets, which are the same as Conneau et al. (2018) ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-116", "text": "The results are shown in Table 5 ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-117", "text": "As seen, our model could consistently achieve better performance compared with adversarial training." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-118", "text": "After refinement, our model could further achieve competitive and even superior results compared with state-of-the-art unsupervised methods." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-119", "text": "This further demonstrates the capacity of our model." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-120", "text": "----------------------------------" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-121", "text": "**CONCLUSION**" }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-122", "text": "Based on the assumption that word vectors in different languages could be drawn from a same latent variable space, we propose a novel approach which builds cross-lingual dictionaries via latent variable models and adversarial training with no parallel corpora." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-123", "text": "Experimental results on several language pairs have demonstrated the effectiveness and universality of our model." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-124", "text": "We hope our method could be beneficial to other areas such as unsupervised machine translation ." }, { "sent_id": "4d528117dd7751d0cd6413430e1ec1-C001-125", "text": "Future directions include validate our model on more realistic scenarios (Dinu et al., 2015) as well as combine our algorithms with more sophiscated adversarial networks Gulrajani et al., 2017) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4d528117dd7751d0cd6413430e1ec1-C001-14", "4d528117dd7751d0cd6413430e1ec1-C001-15", "4d528117dd7751d0cd6413430e1ec1-C001-34" ] ], "cite_sentences": [ "4d528117dd7751d0cd6413430e1ec1-C001-14", "4d528117dd7751d0cd6413430e1ec1-C001-34" ] }, "@SIM@": { "gold_contexts": [ [ "4d528117dd7751d0cd6413430e1ec1-C001-81" ], [ "4d528117dd7751d0cd6413430e1ec1-C001-85", "4d528117dd7751d0cd6413430e1ec1-C001-86" ], [ "4d528117dd7751d0cd6413430e1ec1-C001-96" ], [ "4d528117dd7751d0cd6413430e1ec1-C001-103" ] ], "cite_sentences": [ "4d528117dd7751d0cd6413430e1ec1-C001-81", "4d528117dd7751d0cd6413430e1ec1-C001-85", "4d528117dd7751d0cd6413430e1ec1-C001-86", "4d528117dd7751d0cd6413430e1ec1-C001-96", "4d528117dd7751d0cd6413430e1ec1-C001-103" ] }, "@USE@": { "gold_contexts": [ [ "4d528117dd7751d0cd6413430e1ec1-C001-93" ], [ "4d528117dd7751d0cd6413430e1ec1-C001-96" ], [ "4d528117dd7751d0cd6413430e1ec1-C001-98", "4d528117dd7751d0cd6413430e1ec1-C001-99" ], [ "4d528117dd7751d0cd6413430e1ec1-C001-103" ] ], "cite_sentences": [ "4d528117dd7751d0cd6413430e1ec1-C001-93", "4d528117dd7751d0cd6413430e1ec1-C001-96", "4d528117dd7751d0cd6413430e1ec1-C001-98", "4d528117dd7751d0cd6413430e1ec1-C001-103" ] }, "@DIF@": { "gold_contexts": [ [ "4d528117dd7751d0cd6413430e1ec1-C001-98", "4d528117dd7751d0cd6413430e1ec1-C001-99" ], [ "4d528117dd7751d0cd6413430e1ec1-C001-111" ] ], "cite_sentences": [ "4d528117dd7751d0cd6413430e1ec1-C001-98", "4d528117dd7751d0cd6413430e1ec1-C001-111" ] }, "@EXT@": { "gold_contexts": [ [ "4d528117dd7751d0cd6413430e1ec1-C001-111" ] ], "cite_sentences": [ "4d528117dd7751d0cd6413430e1ec1-C001-111" ] } } }, "ABC_70dc108166d6b5fb9da39c451c3229_16": { "x": [ { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-2", "text": "This paper presents the ColumbiaNLP submission for the FEVER Workshop Shared Task." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-3", "text": "Our system is an end-to-end pipeline that extracts factual evidence from Wikipedia and infers a decision about the truthfulness of the claim based on the extracted evidence." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-4", "text": "Our pipeline achieves significant improvement over the baseline for all the components (Document Retrieval, Sentence Selection and Textual Entailment) both on the development set and the test set." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-5", "text": "Our team finished 6th out of 24 teams on the leader-board based on the preliminary results with a FEVER score of 49.06 on the blind test set compared to 27.45 of the baseline system." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-6", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-7", "text": "**INTRODUCTION AND BACKGROUND**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-8", "text": "Fact checking is a type of investigative journalism where experts examine the claims published by others for their veracity." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-9", "text": "The claims can range from statements made by public figures to stories reported by other publishers." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-10", "text": "The end goal of a fact checking system is to provide a verdict on whether the claim is true, false, or mixed." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-11", "text": "Several organizations such as FactCheck.org and PolitiFact are devoted to such activities." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-12", "text": "The FEVER Shared task aims to evaluate the ability of a system to verify information using evidence from Wikipedia." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-13", "text": "Given a claim involving one or more entities (mapping to Wikipedia pages), the system must extract textual evidence (sets of sentences from Wikipedia pages) that supports or refutes the claim and then using this evidence, it must label the claim as Supported, Refuted or NotEnoughInfo." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-14", "text": "The dataset for the shared task was introduced by Thorne et al. (2018) and consists of 185,445 claims." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-15", "text": "Table 1 shows three instances from the data set with the claim, the evidence and the verdict." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-16", "text": "Table 1 : Examples of claims, the extracted evidence from Wikipedia and the verdicts from the shared task dataset (Thorne et al., 2018) The baseline system described by Thorne et al. (2018) uses 3 major components:" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-17", "text": "\u2022 Document Retrieval: Given a claim, identify relevant documents from Wikipedia which contain the evidence to verify the claim." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-18", "text": "Thorne et al. (2018) used the document retrieval component from the DrQA system (Chen et al., 2017) , which returns the k nearest documents for a query using cosine similarity between binned unigram and bigram TF-IDF vectors." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-19", "text": "\u2022 Sentence Selection: Given the set of retrieved document, identify the candidate evidence sentences." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-20", "text": "Thorne et al. (2018) used a modified document retrieval component of DrQA (Chen et al., 2017) to select the top most similar sentences w.r.t the claim, using bigram TF-IDF with binning." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-21", "text": "\u2022 Textual Entailment: For the entailment task, training is done using labeled claims paired with evidence (labels are SUPPORTS, REFUTES, NOT ENOUGH INFO)." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-22", "text": "Thorne et al. (2018) used the decomposable attention model (Parikh et al., 2016) for this task." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-23", "text": "For the case where multiple sentences are required as evidence, the strings were concatenated." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-24", "text": "Our system implements changes in all three modules (Section 2), which leads to significant improvements both on the development and test sets." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-25", "text": "On the shared task development set our document retrieval approach covers 94.4% of the claims requiring evidence, compared to 55.30% in the baseline." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-26", "text": "Further, on the dev set our evidence recall is improved by 33 points over the baseline." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-27", "text": "For entailment, our model improves the baseline by 7.5 points on dev set." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-28", "text": "Overall, our end-to-end system shows an improvement of 19.56 in FEVER score compared to the baseline (50.83 vs. 31.27) on the dev set ." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-29", "text": "On the blind test set we achieve an evidence recall of 75.89 and an entailment accuracy of 57.45 (9 points above baseline) resulting in a FEVER score of 49.06 (Section 3)." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-30", "text": "Together with the results we discuss some lessons learned based on our error analysis and release our code 1 ." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-31", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-32", "text": "**METHODS**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-33", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-34", "text": "**DOCUMENT RETRIEVAL**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-35", "text": "Document Retrieval is a crucial step when building an end-to-end system for fact extraction and verification." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-36", "text": "Missing a relevant document could lead to missed evidence, while non-relevant documents would add noise for the subsequent tasks of sentence selection and textual entailment." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-37", "text": "We propose a multi-step approach for retrieving documents relevant to the claims." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-38", "text": "\u2022 Google Custom Search API: Wang et al. (2018) looked at retrieving relevant documents for fact-checking articles, looking at generating candidates via search." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-39", "text": "Inspired by this, we first use the Custom Search API of Google to retrieve documents having information about the claim." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-40", "text": "We add the token 1 https://github.com/tuhinjubcse/FEVER-EMNLP wikipedia to the claim and issue a query and collect the top 2 results." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-41", "text": "\u2022 Named Entity Recognition: Second, we use the AllenNLP (Gardner et al., 2017) pretrained bidirectional language model for named entity recognition 2 ." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-42", "text": "After finding the named entities in the claim, we use Wikipedia python API 3 to collect the top wikipedia document returned by the API for each named entity." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-43", "text": "\u2022 Dependency Parse: Third, to increase the chance of detecting relevant entities in the claim, we find the first lower case verb phrase (VP) in the dependency parse tree and query the Wikipedia API with all the tokens before the VP." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-44", "text": "The reason for emphasizing lower case verb phrase is to avoid missing entities in claims such as \"Finding Dory was directed by X\", where the relevant entity is \"Finding Dory\"." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-45", "text": "To deal with entity ambiguity, we also add the token film in our query where the claim contains keywords such as film, stars, premiered and directed by." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-46", "text": "For example in \"Marnie was directed by Whoopi Goldberg.\", Marnie can refer to both wikipedia pages Marnie (film) and Marnie." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-47", "text": "Our point of interest here is Marnie (film)." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-48", "text": "We only experimented with film to capture the performance gains." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-49", "text": "One of our future goals is to build better computational models to handle entity ambiguity or entity linking." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-50", "text": "\u2022 Combined: We use the union of the documents returned by the three approaches as the final set of relevant documents to be used by the Sentence Selection module." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-51", "text": "Table 2 shows the percentage of claims that can be fully supported or refuted by the retrieved docu-ments before sentence selection on the dev set." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-52", "text": "We see that our best approach (combined) achieved a high coverage 94.4% compared to the baseline (Thorne et al., 2018) of 55.3%." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-53", "text": "Because we do not have the gold evidences for the blind test set we cannot report the claim coverage using our pipeline ." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-54", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-55", "text": "**SENTENCE SELECTION**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-56", "text": "For sentence selection, we used the modified document retrieval component of DrQA (Chen et al., 2017) to select sentences using bigram TF-IDF with binning as proposed by (Thorne et al., 2018) ." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-57", "text": "We extract the top 5 most similar sentences from the k-most relevant documents using the TF-IDF vector similarity." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-58", "text": "Our evidence recall is 78.4 as compared to 45.05 in the development set of FEVER (Thorne et al., 2018) , which demonstrates the importance of document retrieval in fact extraction and verification." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-59", "text": "On the blind test set our sentence selection approach achieves an evidence recall of 75.89." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-60", "text": "However, even though TF-IDF proves to be a strong baseline for sentence selection we noticed on the dev set that using all 5 evidences together introduced additional noise to the entailment model." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-61", "text": "To solve this, we further filtered the top 3 evidences from the selected 5 evidences using distributed semantic representations." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-62", "text": "Peters et al. (2018) show how deep contextualized word representations model both complex characteristics of word use (e.g., syntax and semantics), and usage across various linguistic contexts." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-63", "text": "Thus, we used the ELMo embeddings to convert the claim and evidence to vectors." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-64", "text": "We then calculated cosine similarity between claim and evidence vectors and extracted the top 3 sentences based on the score." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-65", "text": "Because there was no penalty involved for poor evidence precision, we returned all five selected sentences as our predicted evidence but used only the top three sentences for the entailment model." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-66", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-67", "text": "**TEXTUAL ENTAILMENT**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-68", "text": "The final stage of our pipeline is recognizing textual entailment." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-69", "text": "Unlike Thorne et al. (2018) , we did not concatenate evidences, but trained our model for each claim-evidence pair." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-70", "text": "For recognizing textual entailment we used the model introduced by Conneau et al. (2017) in their work on supervised learning of universal sentence representations." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-71", "text": "The architecture is presented in Figure 1 ." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-72", "text": "We use bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) with max-pooling to encode the claim and the evidence." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-73", "text": "The text encoder provides dense feature representation of an input claim or evidence." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-74", "text": "Formally, for a sequence of T words w t\"1,..,T , the BiLSTM layer generates a sequence of h t vectors, where h t is the concatenation of a forward and a backward LSTM output." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-75", "text": "The hidden vectors h t are then converted into a single vector using max-pooling, which chooses the maximum value over each dimension of the hidden units." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-76", "text": "Overall, the text encoder can be treated as an operator Text \u00d1 R d that provides d dimensional encoding for a given text." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-77", "text": "Out of vocabulary issues in pre-trained word embeddings are a major bottleneck for sentence representations." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-78", "text": "To solve this we use fastText embeddings (Bojanowski et al., 2017) which rely on subword information." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-79", "text": "Also, these embeddings were trained on Wikipedia corpus making them an ideal choice for this task." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-80", "text": "As shown in Figure 1 , the shared sentence encoder outputs a representation for the claim u and the evidence v. Once the sentence vectors are generated, the following three methods are applied to extract relations between the claim and the evidence: (i) concatenation of the two representations (u, v); (ii) element-wise product u*v and (iii) absolute element-wise difference |u\u00b4v|." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-81", "text": "The resulting vector, which captures information from both the claim and the evidence, is fed into a 3-class classifier consisting of fully connected layers culminating in a softmax layer." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-82", "text": "For the final class label, we experimented first by taking the majority prediction of the three (claim, evidence) pairs as our entailment label but this led to lower accuracy on the dev set." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-83", "text": "So our final predictions are based on the rule outlined in the Algorithm 1, where SUPPORTS = S, REFUTES = R, NOT ENOUGH INFO = N and C is a count function." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-84", "text": "Because the selected evidences were inherently noisy and our pipeline did not concatenate evidences together we chose this rule over majority prediction to mitigate the dominance of prediction of NOT ENOUGH INFO class." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-85", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-86", "text": "**ALGORITHM 1 PREDICTION RULE**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-87", "text": "We also experimented by training a classifier which takes confidence scores of all the three claim evidence pairs along with their position in the document and trained a boosted tree classifier but the accuracy did not improve." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-88", "text": "Empirically the rule gave us the best results on the dev set and thus used it to obtain the final label." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-89", "text": "Table 3 shows the 3 way classification accuracy using the textual entailment model described above." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-90", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-91", "text": "**DATASET**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-92", "text": "Accuracy Shared Task Dev 58.77 Blind Test Set 57.45 Table 3 : 3 way classification results" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-93", "text": "Our entailment accuracy on the shared task dev and test set is 7 and 9 points better than the baseline respectively." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-94", "text": "Implementation Details." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-95", "text": "The batch size is kept as 64." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-96", "text": "The model is trained for 15 epochs using Adam optimizer with a learning rate of 0.001." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-97", "text": "The size of the LSTM hidden units is set to 512 and for the classifier, we use a MLP with 1 hidden-layer of 512 hidden units." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-98", "text": "The embedding dimension of the words is set to 300." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-99", "text": "Table 4 shows the overall FEVER score obtained by our pipeline on the dev and test sets." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-100", "text": "In the provisional ranking our system ranked 6th." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-101", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-102", "text": "**END TO END RESULTS AND ERROR ANALYSIS**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-103", "text": "On closer investigation we find that neither TF-IDF nor sentence embedding based approaches are Table 5 : Cosine similarity between claim and supporting evidence Table 5 goes on to prove that we cannot rely on models that entirely depend on semantics." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-104", "text": "In spite of the two sentences being similar, the cosine similarity between them is poor mostly because the evidence contains a lot of extra information which might not be relevant to the claim and difficult for the model to understand." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-105", "text": "We also found instances where the predicted evidence is correct but it does not match the gold evidence." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-106", "text": "For the claim \"Aristotle spent time in Athens\", both evidences given in Table 6 support it, but still our system gets penalized on not being able to match the gold evidence." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-107", "text": "We found quite a few annotations to be incorrect and hence the FEVER scores are lower than expected." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-108", "text": "Table 7 show two instances where the gold labels for the claims was NOT ENOUGH INFO, while in fact they should have been SUPPORTS and REFUTES, respectively." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-109", "text": "Table 8 ." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-110", "text": "Our models need better understanding of semantics to be able to identify these." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-111", "text": "Table 9 shows one such example where the gospel keyword becomes the discriminative factor." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-112", "text": "Claim: Happiness in Slavery is a gospel song by Nine Inch Nails Evidence: Happiness in Slavery,is a song by American industrial rock band Nine Inch Nails from their debut extended play (EP), Broken(1992)" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-113", "text": "----------------------------------" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-114", "text": "**CONCLUSION**" }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-115", "text": "The FEVER shared task is challenging primarily because the annotation requires substantial manual effort." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-116", "text": "We presented an end-to-end pipeline to automate the human effort and showed empirically that our model outperforms the baseline by a large margin." }, { "sent_id": "70dc108166d6b5fb9da39c451c3229-C001-117", "text": "We also provided a thorough error analysis which highlights some of the shortcomings of our models and potentially of the gold annotations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "70dc108166d6b5fb9da39c451c3229-C001-14" ], [ "70dc108166d6b5fb9da39c451c3229-C001-16" ], [ "70dc108166d6b5fb9da39c451c3229-C001-18" ], [ "70dc108166d6b5fb9da39c451c3229-C001-20" ], [ "70dc108166d6b5fb9da39c451c3229-C001-22" ] ], "cite_sentences": [ "70dc108166d6b5fb9da39c451c3229-C001-14", "70dc108166d6b5fb9da39c451c3229-C001-16", "70dc108166d6b5fb9da39c451c3229-C001-18", "70dc108166d6b5fb9da39c451c3229-C001-20", "70dc108166d6b5fb9da39c451c3229-C001-22" ] }, "@DIF@": { "gold_contexts": [ [ "70dc108166d6b5fb9da39c451c3229-C001-52" ], [ "70dc108166d6b5fb9da39c451c3229-C001-58" ], [ "70dc108166d6b5fb9da39c451c3229-C001-69" ] ], "cite_sentences": [ "70dc108166d6b5fb9da39c451c3229-C001-52", "70dc108166d6b5fb9da39c451c3229-C001-58", "70dc108166d6b5fb9da39c451c3229-C001-69" ] }, "@SIM@": { "gold_contexts": [ [ "70dc108166d6b5fb9da39c451c3229-C001-56" ] ], "cite_sentences": [ "70dc108166d6b5fb9da39c451c3229-C001-56" ] }, "@USE@": { "gold_contexts": [ [ "70dc108166d6b5fb9da39c451c3229-C001-56" ] ], "cite_sentences": [ "70dc108166d6b5fb9da39c451c3229-C001-56" ] } } }, "ABC_195f41862b929318787aad9d8e5a1c_16": { "x": [ { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-52", "text": "**FASTFUSIONNET**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-107", "text": "We follow their data pre-processing procedure." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-108", "text": "Training procedure." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-2", "text": "In this technical report, we introduce FastFusionNet, an efficient variant of FusionNet [12] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-104", "text": "We use single precision floating-point in our implementation." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-3", "text": "FusionNet is a high performing reading comprehension architecture, which was designed primarily for maximum retrieval accuracy with less regard towards computational requirements." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-4", "text": "For FastFusionNets we remove the expensive CoVe layers [21] and substitute the BiLSTMs with far more efficient SRU layers [19] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-5", "text": "The resulting architecture obtains state-of-the-art results on DAWNBench [5] while achieving the lowest training and inference time on SQuAD [25] to-date." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-6", "text": "The code is available at https://github.com/felixgwu/FastFusionNet." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-7", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-8", "text": "****" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-9", "text": "In this technical report, we analyze the inference bottlenecks of FusionNet [12] and introduce FastFusionNet that tackles them." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-10", "text": "In our experiments, we show that FastFusionNet achieves new state-of-the-art training and inference time on SQuAD based on the metrics of DAWNBench." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-11", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-12", "text": "**BACKGROUND**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-13", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-14", "text": "**EFFICIENT SEQUENCE ENCODING**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-15", "text": "The sequential nature of Recurrent Neural Networks (RNNs) makes them inherently slow, even on parallel computing devices." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-16", "text": "Consequently, a series of methods have been proposed to either reduce the sequential computation within RNNs or substitute them with alternative building blocks." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-17", "text": "Yu et al. [38] propose LSTM-Jump, in which LSTMs [10] are trained to predict the number of tokens to skip." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-18", "text": "Seo et al. [29] propose skim-RNN for sentiment analysis and question answering, which has a special RNN unit combining big and small RNN cells." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-19", "text": "Bradbury et al. [1] introduce Quesi-RNNs that combines convolution with sequential pooling to reduce the sequential." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-20", "text": "Lei et al. [19] invent Simple Recurrent Unit (SRU) a fast RNN variant, which will be explained further in subsection 2.2." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-21", "text": "Their second version [20] is more accurate but a little less efficient." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-22", "text": "To the best of our knowledge, SRU is the most efficient RNN variant, so we choose it as our preferred building block throughout this manuscript." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-23", "text": "Other lines of work replace RNNs with convolution layers [6, 9, 14, 35, 36, 39, 40] or self-attention [30, 31, 33, 39] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-24", "text": "Shen et al. [30] introduce bidirectional block self-attentions (Bi-BloSA) that split a sequence into blocks and compute intra-block and inter-block self-attention that significantly reduces computation and memory footprint compared to the popular multi-head self-atttention [33] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-25", "text": "Wu et al. [36] propose lightweight and dynamic convolutions as efficient alternatives to self-attentions with comparable performance." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-26", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-27", "text": "**SIMPLE RECURRENT UNIT**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-28", "text": "The key idea behind Simple Recurrent Unit (SRU) [19, 20] is to separate the matrix multiplications (the bottleneck) from the recurrence." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-29", "text": "To be specific, SRU replaces the matrix multiplication style recurrence to a vector summation style recurrence." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-30", "text": "As a consequence, the matrix multiplication can be done in parallel at once." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-31", "text": "The complete architecture is" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-32", "text": "where x t is the input at the time step t, c t is the hidden state and h t is the output." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-33", "text": "All blue computations can be performed through simple parallel matrix\u00d7matrix multiplies followed by parallel element-wise function operators, and are therefore maximally efficient on modern CUDA hardware." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-34", "text": "The only sequential operation is the update to c t , which is a highly efficient vector operation." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-35", "text": "The final update to h t is again fully parallel." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-36", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-37", "text": "**DRQA**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-38", "text": "DrQA [2] is one of the simplest reading comprehension model, which employs a variety of features including pre-trained word vectors, term frequencies, part-of-speech tags, name entity relations, and the fact that whether a context word is in the question or not, encodes the features with RNNs, and predicts the start and end of an answer with a PointerNet-like module [34] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-39", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-40", "text": "**ANALYSIS OF FUSIONNET**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-41", "text": "FusionNet [12] is reading comprehension model built on top of DrQA by introducing Fully-aware attention layers (context-question attention and context self-attention), contextual embeddings [21] , and more RNN layers." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-42", "text": "Their proposed fully-aware attention mechanism uses the concatenation of layers of hidden representations as the query and the key to compute attention weights, which shares a similar intuition as DenseNet [11] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-43", "text": "FusionNet was the state-of-the-art reading comprehension model at the time of writing (Oct. 4th 2017)." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-44", "text": "Figure 2 provides an analysis of the individual components of FusionNet that the contextual embedding layer, i.e. CoVe [21] , with several layers of wide LSTMs, takes up to 35.5% of the inference time while only contributing a 1.1% improvement of F1 Score (from 82.5% to 83.6%) Huang et al. [12] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-45", "text": "Additionally, the LSTM layers contribute to 58.8% [19] , GRU [3] , LSTM [10] , QANet Encoding block (with 2 conv layers and a 8-head attention) [39] , 5 Convolution layers with gated linear unit (GLU) [6, 35] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-46", "text": "All input and hidden sizes are 128." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-47", "text": "of the inference time." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-48", "text": "Therefore, we propose to remove the contextual embedding layer and replace each bidirectional LSTM layer with two layers of bidirectional SRU [19] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-49", "text": "Figure 3 shows that SRU is faster than LSTM [10] , GRU [3] , QANet Encoder [39] , and 5-layer CNN w/ GLU [6, 35] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-50", "text": "We time a 5-layer CNN since it matches the performance of one layer SRU." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-51", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-53", "text": "Here we introduce FastFusionNet which addresses the inference bottlenecks of FusionNet [12] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-54", "text": "There are two differences compared to FusionNet: i) the CoVe [21] layers are removed and ii) each BiLSTM layer is replaced with two BiSRU layers." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-55", "text": "We closely follow the implementation of Huang et al. [12] described in their paper except for the changes above." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-56", "text": "Following Huang et al. [12] , the hidden size of each SRU is set to 125, resulting in a 250-d output feature of each BiSRU regardless of the input size." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-57", "text": "In the following explanation, we use [A; B] to represent concatenation in the feature dimension." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-58", "text": "Attn(Q, K, V) represents the attention mechanism taking the query Q, the key K, and the value V as inputs." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-59", "text": "Assuming O being the output, we have" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-60", "text": "Input Features." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-61", "text": "Following Chen et al. [2] , we use 300-dim GloVe [24] vectors, term-frequency, part-of-speech (POS) tags, and named entity recognition (NER) tags as features for each word in the context or the question." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-62", "text": "We fine-tune the embedding vector of the padding token, the unknown word token, and the top 1000 most frequent words in the training set." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-63", "text": "Like others [12] we use a randomly initialized the trainable embedding layer with 12 dimensions for POS tags and 8 dimensions for NER." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-64", "text": "We use question matching features proposed by Chen et al. [2] as well, which contains a hard version and a soft version." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-65", "text": "The hard version contains 3 binary features indicating where a context word's original form, lower case form, or lemmatized form appears in the question, respectively." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-66", "text": "The soft version uses a trainable attention module that learns to represent each context word as a mixture of question words." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-67", "text": "Overall, the i-th context token is represented as C i which has 624 dimensions, and the j-th question token is represented as a 300-d Q j glove vector." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-68", "text": "We have the context features C \u2208 R n\u00d7624 and question features Q \u2208 R m\u00d7300 where m and n are the length of the question and context, respectively." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-69", "text": "Specifically," }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-70", "text": "Low-level encoding Layer." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-71", "text": "We apply 2-layer BiSRU on C In and Q In to obtain lower-level representations C and Q respectively." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-72", "text": "That is," }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-73", "text": "High-level Encoding Layer consists of another 2-layer BiSRU to obtain high-level representations C h and Q h ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-74", "text": "In other words," }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-75", "text": "where" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-76", "text": "The Question Understanding Layer is another 2-layer BiSRU combining Q and Q h into Q u , i.e." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-77", "text": "where" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-105", "text": "Arguably, using half-precision floating-point may further improve our results." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-78", "text": "The Question-Context Attention Layer is a fully-aware attention module [12] which takes the history (concatenation of GloVe, low-level, and high-level features) of each context word and question words as query and key for three attention modules, and represents each context word as three different vectors:" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-79", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-80", "text": "**S). ANOTHER 2-LAYER SRU PROCESSES THE CONCATENATION OF ALL PREVIOUS CONTEXT WORD VECTORS**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-81", "text": "To be specific," }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-82", "text": "The Context Self-Attention Layer is another fully-aware attention module that treats the history of words (GloVe vectors," }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-83", "text": "Answer Prediction Layer." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-84", "text": "This layer predicts the positions of the start and end of the answer span using the final representations of the context C u and the question Q u ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-85", "text": "This layer first combines all question vectors into a weighted sum q = m j=1 \u03b1 j Q u j using a single trainable parameter v \u2208 R" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-86", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-87", "text": "**250**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-88", "text": ", where" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-89", "text": ". As a next step it predicts the probability that the i th word denotes the start of the answer span as" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-90", "text": ", using a bilinear soft-max model." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-91", "text": "Subsequently, it summarizes the context with the start prediction and produces z = TensorFlow v1.8 1 TPUv2 0:50:21 DrQA (ParlAI) [22] PyTorch v1.0.0 1 T4 / GCP 0:56:43 DrQA (ParlAI) [22] PyTorch v1.0.0 1 P4 / GCP 1:00:35 DrQA (ParlAI) [22] PyTorch v0.4.1 1 V100 1:22:33 BERT-base [7] (1 epoch fine-tuning) TensorFlow v1.11.0 1 GTX-1080 Ti 7:38:10 BiDAF [5, 28] TensorFlow v1.2 1 K80 Table 1: DAWNBench Training Track It then produces a refined question vectorq with one step of GRU [3] , using the original question vector q as the hidden memory and z as the input, i.e.q = GRUCell(z, q)." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-92", "text": "Similarly, a bi-linear module is applied to get the end predicted" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-93", "text": "." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-94", "text": "The product of the respective start and end probabilities becomes the score of an answer span." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-95", "text": "However, we only consider answers with no more than 15 words and do a exhaustive search to find the best span." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-96", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-97", "text": "**EXPERIMENTS**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-98", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-99", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-100", "text": "We conduct our experiments on the SQuAD dataset, which contains 87K, 10K, and 10K context-question pairs for training, development, and test." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-101", "text": "Like the other models submitted to the DAWNBench, we use the publicly available development set to evaluate the performance and the efficiency of our model." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-102", "text": "All of the experiments are conducted on a single Nvidia GTX-1080 Ti GPU." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-103", "text": "We use PyTorch [23] 0.3.1 to implement our model." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-106", "text": "Our implementation is based on two open source code base 1 ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-109", "text": "We train the model for 100 epochs to ensure convergence; however, the model stops improving after 60 epochs." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-110", "text": "The other hyper-parameters are borrowed from Lei et al. [19] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-111", "text": "We do not tune the hyper-parameters." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-112", "text": "We use batch size 32 for training." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-113", "text": "We use Adam optimizer [15] with \u03b1 = 0.001 and clip the 2 -norm of the gradients to 20 before each update." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-114", "text": "The SQuAD dataset is tokenized and tagged by the SpaCy package 2 ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-115", "text": "We apply variational dropout [16] to sequential features and normal dropout [32] to others." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-116", "text": "Following [2] , dropout rate for input embeddings is set to 0.4." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-117", "text": "We also dropout all inputs of LSTMs and attentions with probability 0.4." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-118", "text": "For SRUs, we follow [19] using dropout rate 0.2." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-119", "text": "We do not use learning rate decay or weight decay for simplicity." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-120", "text": "----------------------------------" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-121", "text": "**DAWNBENCH RESULTS**" }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-122", "text": "We report the performance of our FastFusionNet on DAWNBench [5] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-123", "text": "We consider three baselines: i) FusionNet ii) FusionNet without CoVE, and iii) BERT-base." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-124", "text": "For BERT-base, we use the open source code 3 ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-125", "text": "Our FastFusionNet reaches F1 75% in 4 epochs and achieves at F1 82.5% at the end which matches the reported F1 82.5% of FusionNet without CoVe on SQuAD development set [12] ." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-126", "text": "The training time track aims to minimize the time to train a model up to at least 75% F1 score on SQuAD development set." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-127", "text": "Table 1 shows that our FastFusionNet reaches F1 75.0% within 20 minutes (after 4 epochs), which gives a 45% speedup compared to the winner DrQA(ParlAI) on the leaderboard." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-128", "text": "Notably, we use an Nvidia GTX-1080 GPU which is about 22% slower than their Nvidia RTX-2080 GPU." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-129", "text": "When controlling the generation of GPUs and comparing our model with a DrQA (ParlAI) trained on an Nvidia V100, our model achieves a 3.1\u00d7 speedup." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-130", "text": "Compared to FusionNet, FastFusionNet is 23% faster to reach 75% F1 score; however, in terms of the training time per epoch, it is in fact 2.6\u00d7 as fast as FusionNet." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-131", "text": "[12] here since our reimplementation is about 0.5% F1 score worse." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-132", "text": "The inference time track evaluates the average 1-example inference latency of a model with an F1 score at least 75%." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-133", "text": "Our FastFusionNet reduces the 1-example latency down to 7.9 ms, which is 2.8\u00d7 as fast as a BERT-base and 12.7\u00d7 over BiDAF." }, { "sent_id": "195f41862b929318787aad9d8e5a1c-C001-134", "text": "FastFusionNet achieves a 5.8\u00d7 speedup over the original FusionNet." } ], "y": { "@DIF@": { "gold_contexts": [ [ "195f41862b929318787aad9d8e5a1c-C001-2" ], [ "195f41862b929318787aad9d8e5a1c-C001-9" ], [ "195f41862b929318787aad9d8e5a1c-C001-53" ], [ "195f41862b929318787aad9d8e5a1c-C001-55" ], [ "195f41862b929318787aad9d8e5a1c-C001-125" ], [ "195f41862b929318787aad9d8e5a1c-C001-131" ] ], "cite_sentences": [ "195f41862b929318787aad9d8e5a1c-C001-2", "195f41862b929318787aad9d8e5a1c-C001-9", "195f41862b929318787aad9d8e5a1c-C001-53", "195f41862b929318787aad9d8e5a1c-C001-55", "195f41862b929318787aad9d8e5a1c-C001-125", "195f41862b929318787aad9d8e5a1c-C001-131" ] }, "@EXT@": { "gold_contexts": [ [ "195f41862b929318787aad9d8e5a1c-C001-2" ], [ "195f41862b929318787aad9d8e5a1c-C001-9" ], [ "195f41862b929318787aad9d8e5a1c-C001-53" ], [ "195f41862b929318787aad9d8e5a1c-C001-55" ] ], "cite_sentences": [ "195f41862b929318787aad9d8e5a1c-C001-2", "195f41862b929318787aad9d8e5a1c-C001-9", "195f41862b929318787aad9d8e5a1c-C001-53", "195f41862b929318787aad9d8e5a1c-C001-55" ] }, "@BACK@": { "gold_contexts": [ [ "195f41862b929318787aad9d8e5a1c-C001-41" ], [ "195f41862b929318787aad9d8e5a1c-C001-44" ] ], "cite_sentences": [ "195f41862b929318787aad9d8e5a1c-C001-41", "195f41862b929318787aad9d8e5a1c-C001-44" ] }, "@SIM@": { "gold_contexts": [ [ "195f41862b929318787aad9d8e5a1c-C001-56" ], [ "195f41862b929318787aad9d8e5a1c-C001-63" ], [ "195f41862b929318787aad9d8e5a1c-C001-78" ] ], "cite_sentences": [ "195f41862b929318787aad9d8e5a1c-C001-56", "195f41862b929318787aad9d8e5a1c-C001-63", "195f41862b929318787aad9d8e5a1c-C001-78" ] }, "@USE@": { "gold_contexts": [ [ "195f41862b929318787aad9d8e5a1c-C001-56" ] ], "cite_sentences": [ "195f41862b929318787aad9d8e5a1c-C001-56" ] } } }, "ABC_f587fc2bbbf3c1327b03d556e4bc05_16": { "x": [ { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-139", "text": "We use same evaluation measures (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-63", "text": "Prior to this work, we proposed earlier trials for using covariance features in community question answering (Malhas et al., 2016b,a; Torki et al., 2017) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-64", "text": "In these trials we used the covariance features in combination with lexical and semantic features." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-65", "text": "Close to our work is (Nikolentzos et al., 2017) , they build an implicit representation of documents using multidimensional Gaussian distribution." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-66", "text": "Then they compute a similarity kernel to be used in document classification task." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-67", "text": "Our work is distinguished from (Nikolentzos et al., 2017) as we compute an explicit descriptor for any document." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-68", "text": "Moreover, we use linear models which scale much better than non-linear kernels as introduced in (Nikolentzos et al., 2017) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-69", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-70", "text": "**DOCUMENT COVARIANCE DESCRIPTOR**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-71", "text": "We present our DoCoV descriptor." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-2", "text": "In this paper, we address the problem of finding a novel document descriptor based on the covariance matrix of the word vectors of a document." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-3", "text": "Our descriptor has a fixed length, which makes it easy to use in many supervised and unsupervised applications." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-4", "text": "We tested our novel descriptor in different tasks including supervised and unsupervised settings." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-5", "text": "Our evaluation shows that our document covariance descriptor fits different tasks with competitive performance against state-of-the-art methods." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-6", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-8", "text": "Retrieving documents that are similar to a query using vectors has a long history." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-9", "text": "Earlier methods modeled documents and queries using vector space models via bag-of-words (BOW) representation (Salton and Buckley, 1988) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-10", "text": "Other representations include latent semantic indexing (LSI) (Deerwester et al., 1990) , which can be used to define dense vector representation for documents and/or queries." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-11", "text": "The past few years have witnessed a big interest in distributed representation for words, sentences, paragraphs and documents." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-12", "text": "This was achieved by leveraging deep learning methods that learn word vector representation." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-13", "text": "Introduction of neural language models (Bengio et al., 2003) using deep learning allowed to learn word vector representation (word embedding for simplicity)." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-14", "text": "The seminal work of Mikolov et al. introduced an efficient way to compute dense vectorized representation of words (Mikolov et al., 2013a,b) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-72", "text": "First, we define a document observation matrix." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-73", "text": "Second, we show how to extract our DoCoV descriptor." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-74", "text": "Document Observation Matrix Given a d-dimensional word embedding model and an n-terms document." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-75", "text": "We can define a document observation matrix O \u2208 R n\u00d7d ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-76", "text": "In the matrix O, a row represents a term in the document and columns represent the d-dimensional word embedding representation for that term." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-77", "text": "Assume that we have observed n terms of a ddimensional random variable; we have a data ma-" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-78", "text": "The rows" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-79", "text": "the i-th observation of a d-dimensional random variable X \u2208 R d ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-80", "text": "The \"sample mean vector\" of the n observations \u2208 R d is given by the vectorx of the meansx j of the d variables:" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-81", "text": "From hereafter, when we mention the Mean vector we mean the sample Mean Vectorx." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-82", "text": "Document-Covariance Descriptor (DoCoV) Given an observation matrix O for a document, we compute the covariance matrix entriesfor every pair of dimensions (X, Y )." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-15", "text": "A more recent step was taken to move beyond distributed representation of words." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-16", "text": "This is to find a distributed representation for sentences, paragraphs and documents." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-17", "text": "Most of the presented works study the interrelationship between words in a text snip- pet (Hill et al., 2016; Kiros et al., 2015; Le and Mikolov, 2014) in an unsupervised fashion." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-18", "text": "Other methods build a task specific representation (Kim, 2014; Collobert et al., 2011) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-19", "text": "In this paper we propose to use the covariance matrix of the word vectors in some document to define a novel descriptor for a document." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-20", "text": "We call our representation DoCoV descriptor." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-21", "text": "Our descriptor obtains a fixed-length representation of the paragraph which captures the interrelationship between the dimensions of the word embedding via the covariance matrix elements." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-22", "text": "This makes our work distinguished from to the work of (Le and Mikolov, 2014; Hill et al., 2016; Kiros et al., 2015) where they study the interrelationship of words in the text snippet." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-23", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-24", "text": "**TOY EXAMPLE**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-25", "text": "We show a toy example to highlight the differences between DoCoV vector, the Mean vector and paragraph vector (Le and Mikolov, 2014) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-26", "text": "First, we used Gensim library 1 to generate word vectors and paragraph vectors using a dummy training corpus." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-27", "text": "Next, we formed two hypothetical documents; first document contains words about \"pets\" and second document contains words about \"travel\"." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-28", "text": "In figure 1 we show on the top part the first two dimensions of a word embedding for each document separately." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-29", "text": "On the bottom Left, we show embedding of the two documents' words in the same space." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-30", "text": "We also show the Mean vectors and the paragraph vectors." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-31", "text": "In the word embedding space the covariance matrices are represented via the confidence ellipses." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-32", "text": "On the bottom right we show the corresponding covariance matrices as points in a new space after vectorization step." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-33", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-34", "text": "**MOTIVATION AND CONTRIBUTIONS**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-35", "text": "Below we describe our motivation towards the proposal of our novel representation:" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-36", "text": "(1) Some neural-based paragraph representations such as paragraph vectors (Le and Mikolov, 2014) , FastSent (Hill et al., 2016) use a shared space between the words and paragraphs." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-37", "text": "This is counter intuitive, as the paragraph is a different entity other than the words." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-38", "text": "Figure 1 illustrates that point, we do not see a clear interpretation of why the paragraph vectors (Le and Mikolov, 2014) are positioned in the space as in figure 1 ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-39", "text": "(2) The covariance matrix represents the second order summary statistic of multivariate data." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-40", "text": "This distinguishes the covariance matrix from the mean vector." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-41", "text": "In figure 1 we visualize the covariance matrix using confidence ellipse representation." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-42", "text": "We see that the covariance encodes the shape of the density composed of the words of interest." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-43", "text": "In the earlier example the Mean vectors of two dissimilar documents are put close by the word embedding." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-44", "text": "On the other hand, the covariance matrices capture the distinctness of the two documents." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-45", "text": "(3) The use of the covariance as a spatial descriptor for multivariate data has a great success in different domains like computer vision (Tuzel et al., 2006; Hussein et al., 2013; Sharaf et al., 2015) and brain signal analysis (Barachant et al., 2013) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-46", "text": "With this global success of this representation, we believe this method can be useful for text-related tasks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-47", "text": "(4) The computation of the covariance descriptor is known to be fast and highly parallelizable." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-48", "text": "Moreover, there is no inference steps involved while computing the covariance matrix given its observations." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-49", "text": "This is an advantage compared to existing methods for generating paragraph vectors, such as (Le and Mikolov, 2014; Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-50", "text": "Our contribution in this work is two-fold: (1) We propose the Document-Covariance descriptor (DoCoV) to represent every document as the covariance of the word embedding of its words." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-51", "text": "To the best of our knowledge, we are the first to explicitly compute covariance descriptors on word embedding such as word2vec (Mikolov et al., 2013b) or similar word vectors." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-52", "text": "(2) We empirically show the effectiveness of our novel descriptor in comparison to the state-of-theart methods in various unsupervised and supervised classification tasks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-53", "text": "Our results show that our descriptor can attain comparable accuracy to state-ofthe-art methods in a diverse set of tasks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-54", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-55", "text": "**RELATED WORK**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-56", "text": "We can see the word embedding at the core of recent state-of-art methods for solving many tasks like semantic textual similarity, sentiment analysis and more." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-57", "text": "Among the approaches of finding word embedding are (Pennington et al., 2014; Levy and Goldberg, 2014; Mikolov et al., 2013b) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-58", "text": "These alternatives share the same objective of finding a fixed-length vectorized representation for words to capture the semantic and syntactic regularities between words." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-59", "text": "These efforts paved the way for many researchers to judge document similarity based on word embedding." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-60", "text": "Some efforts aimed at finding a global representation of a text snippet using a paragraph-level representation such as paragraph vectors (Le and Mikolov, 2014) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-61", "text": "Recently other neural-based sentence and paragraph level representations appeared to provide a fixed length representation like Skip-Thought Vectors (Kiros et al., 2015) and FastSent (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-62", "text": "Some efforts focused on defining a Word Mover Distance(WMD) based on word level representation (Kusner et al., 2015) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-83", "text": "The matrix C \u2208 R d\u00d7d is a symmetric matrix and is defined as" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-84", "text": "We Compute a vectorized representation of the matrix C as the stacking of the upper triangular part of matrix C as in eq. 5." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-85", "text": "This process produces a vector v \u2208 R d(d+1)/2 ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-86", "text": "The Euclidean distance between vectorized matrices is equivalent to the Frobenius norm of the original covariance matrices." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-87", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-88", "text": "**EXPERIMENTAL EVALUATION**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-89", "text": "We show an extensive comparative evaluation for unsupervised paragraph representation approaches." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-90", "text": "We test the unsupervised semantic textual similarity task." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-91", "text": "Next, we show a comparative evaluation for text classification benchmarks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-92", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-93", "text": "**EFFECT OF WORD EMBEDDING SOURCE AND DIMENSIONALITY ON CLASSIFICATION RESULTS**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-94", "text": "We evaluate classification performance over the IMDB movie reviews dataset using error rate as the evaluation measure." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-95", "text": "We report our results using a linear SVM classifier." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-96", "text": "We chose the default parameters for Linear SVM classifier in scikit-learn library 2 ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-97", "text": "The IMDB movie review dataset was first proposed by Maas et al. (Maas et al., 2011) as a benchmark for sentiment analysis." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-98", "text": "The dataset consists of 100K IMDB movie reviews and each review has several sentences." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-99", "text": "The 100K reviews are divided into three datasets: 25K labelled training instances, 25K labelled test instances and 50K unlabelled training instances." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-100", "text": "Each review has one label representing the sentiment of it: Positive or Negative." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-101", "text": "These labels are balanced in both the training and the test set." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-102", "text": "The objective is to show that theDoCoV descriptor can be used with different alternatives for word representations." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-103", "text": "Also, the experiment shows that pre-trained models are giving the best results, namely the word2vec model built on Google news." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-104", "text": "This alleviates the need of computing a problem specific word embedding." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-105", "text": "In some cases there is no available data to construct the word embedding." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-106", "text": "To illustrate that we tried different alternatives for word representation." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-107", "text": "(1) We computed our own skipgram models using Gensim Library." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-108", "text": "We used the Training and unlabelled subsets of IMDB dataset to obtain different embedding by setting number of dimensions to 100, 200 and 300." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-109", "text": "(2) We used pre-trained GloVe models trained on wikipedia2014 and Gigaword5." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-110", "text": "We tested the available different dimensionality 100, 200 and 300." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-111", "text": "We also used the 300 dimensions GloVe model that used commoncrawl with 42 Billion tokens We call the last one Lrg." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-112", "text": "This model provides word vectors of 300 dimensions for each word." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-113", "text": "(3) We used pre-trained word2vec model trained on Google news." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-114", "text": "We call it Gnews." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-115", "text": "This model provides word vectors of 300 dimensions for each word." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-116", "text": "Table 1 shows the results when using DoCoV computed at different dimensions of word embedding in classification." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-117", "text": "The table also compares classification performance when using DoCoV to the performance when using the Mean of word embedding as a baseline." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-118", "text": "Also, we show the effect of fusing DoCoV with other feature sets." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-119", "text": "We mainly experiment with the following sets: DoCoV, Mean, and bag-of-words (BOW)." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-120", "text": "We use the mean and DoCoV features." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-121", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-122", "text": "**OBSERVATIONS**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-123", "text": "From the results we can observe the following (1) We observe that the DoCoV is consistently outperforming the Mean vector for different dimensionality of the word embedding regardless of the embedding source." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-124", "text": "(3) The best performing feature concatenation is DoCoV+BOW." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-125", "text": "This ensures that the concatenation in fact is benefiting from both representations." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-126", "text": "(3) In general the best results are achieved using the available 300-dimensions Gnews word embedding." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-127", "text": "In the subsequent experiments we will use that embedding such that we do not need to build a different word embedding for every task on hand." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-128", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-129", "text": "**UNSUPERVISED SEMANTIC TEXTUAL SIMILARITY**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-130", "text": "We conduct a comparative evaluation against the state-of-the-art approaches in unsupervised paragraph representation." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-131", "text": "We follow the setup used in (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-132", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-133", "text": "**DATASETS AND BASELINES**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-134", "text": "We contrast our results against the methods reported in (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-135", "text": "The competing methods are the paragraph vectors (Le and Mikolov, 2014) , skip-thought vectors (Kiros et al., 2015) , Fastsent (Hill et al., 2016) , Sequential (Denoising) Autoencoders (SDAE) (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-136", "text": "The Mean vector baseline is also implemented." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-137", "text": "Also, we use the sum of the similarities generated by the DoCoV and the mean vectors." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-138", "text": "All of our results are reported using the freely available Gnews word2vec of dim = 300." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-140", "text": "We use the Pearson correlation and Spearman correlation with the manual relatedness judgements." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-141", "text": "The semantic sentence relatedness datasets used in the comparative evaluation the SICK dataset (Marelli et al., 2014) consists of 10,000 pairs of sentences and relatedness judgements and the STS 2014 dataset (Agirre et al., 2014) consists of 3,750 pairs and ratings from six linguistic domains." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-142", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-143", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-144", "text": "We show the correlation values between the similarities computed via DoCoV and the human judgements." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-145", "text": "We contrast the performance of other representations in table 2." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-146", "text": "We observe that DoCoV representation outperforms other representations in this task." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-147", "text": "Other models such as skipthought vectors (Kiros et al., 2015) and SDAE (Hill et al., 2016) requires building an encoder-decoder model which takes time 3 to learn." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-148", "text": "For other models like paragraph vectors (Le and Mikolov, 2014) and Fastsent vectors (Hill et al., 2016) , they require a gradient descent inference step to compute the paragraph/sentence vectors." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-149", "text": "Using the DoCoV, we just require a pre-trained word embedding model and we do not need any additional training like encoder-decoder models or inference steps via gradient descent." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-150", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-151", "text": "**TEXT CLASSIFICATION BENCHMARKS**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-152", "text": "The datasets used in this experiment form a textclassification benchmark for sentence and paragraph classification." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-153", "text": "Towards the end of this section we can clearly identify the value of the DoCoV descriptor as a generic descriptor for text classification tasks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-154", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-155", "text": "**DATASETS AND BASELINES**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-156", "text": "We contrast our results against the same methods of unsupervised paragraph representations." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-157", "text": "In addition to the results of DoCoV we examined concatenation of BoW with tf-idf weighting and Mean vectors with our DoCoV descriptors." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-158", "text": "We use linear (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-159", "text": "(Kiros et al., 2015) 75.5 79.3 91.4 92.1 84.58 bi-skip (Kiros et al., 2015) 73.9 77.9 89.4 92.5 84.43 comb-skip (Kiros et al., 2015) 76.5 80.1 92.2 93.6 85.6 FastSent (Hill et al., 2016) 70.8 78.4 76.8 88.7 78.68 FastSentAE (Hill et al., 2016) 71.8 76.7 80.4 88.8 79.43 SAE (Hill et al., 2016) 62.6 68 80.2 86.1 74.23 SAE+embs (Hill et al., 2016) 73.2 75.3 80.4 89.8 79.68 SDAE (Hill et al., 2016) 67.6 74 77.6 89.3 77.13 SDAE+embs (Hill et al., 2016) 74 SVM for all the tasks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-160", "text": "All of our results are reported using the freely available Gnews word2vec of dim = 300." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-161", "text": "We use classification accuracy as the evaluation measure for this experiment as (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-162", "text": "The subsets used in comparative benchmark evaluation are: Movie Reviews MR (Pang and Lee, 2005) , Subjectivity Subj (Pang and Lee, 2004) ,Customer Reviews CR (Hu and Liu, 2004) and TREC Question TREC (Li and Roth, 2002) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-163", "text": "Results and Discussion Table 3 shows the results of our variants against state-of-art algorithms that use unsupervised paragraph representation." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-164", "text": "We observe that DoCoV is consistently better than the Mean vector and BOW with tf-idf weights." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-165", "text": "Also, DoCoV is improving consistently when concatenated with baselines such as Mean vector and BOW vectors." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-166", "text": "This means each feature is capturing different discriminating information." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-167", "text": "This justifies the choice of concatenating DoCoV with other features." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-168", "text": "We further observe that DoCoV is consistently better than the paragraph vectors (Le and Mikolov, 2014) , Fastsent and SDAE (Hill et al., 2016) ." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-169", "text": "The overall accuracy of DoCoV is highlighted and it outperforms other methods on the text classification benchmark." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-170", "text": "----------------------------------" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-171", "text": "**CONCLUSION**" }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-172", "text": "We presented a novel descriptor to represent text on any level such as sentences, paragraphs or documents." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-173", "text": "Our representation is generic which makes it useful for different supervised and unsupervised tasks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-174", "text": "It has fixed-length property which makes it useful for different learning algorithms." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-175", "text": "Also, our descriptor requires minimal training." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-176", "text": "We do not require a encoder-decoder model or a gradient descent iterations to be computed." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-177", "text": "Empirically we showed the effectiveness of the descriptor in different tasks." }, { "sent_id": "f587fc2bbbf3c1327b03d556e4bc05-C001-178", "text": "We showed better performance against other state-of-the-art methods in supervised and unsupervised settings." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f587fc2bbbf3c1327b03d556e4bc05-C001-17" ], [ "f587fc2bbbf3c1327b03d556e4bc05-C001-36" ], [ "f587fc2bbbf3c1327b03d556e4bc05-C001-148" ] ], "cite_sentences": [ "f587fc2bbbf3c1327b03d556e4bc05-C001-17", "f587fc2bbbf3c1327b03d556e4bc05-C001-36", "f587fc2bbbf3c1327b03d556e4bc05-C001-148" ] }, "@DIF@": { "gold_contexts": [ [ "f587fc2bbbf3c1327b03d556e4bc05-C001-22" ], [ "f587fc2bbbf3c1327b03d556e4bc05-C001-25" ], [ "f587fc2bbbf3c1327b03d556e4bc05-C001-49" ], [ "f587fc2bbbf3c1327b03d556e4bc05-C001-168" ] ], "cite_sentences": [ "f587fc2bbbf3c1327b03d556e4bc05-C001-22", "f587fc2bbbf3c1327b03d556e4bc05-C001-25", "f587fc2bbbf3c1327b03d556e4bc05-C001-49", "f587fc2bbbf3c1327b03d556e4bc05-C001-168" ] }, "@SIM@": { "gold_contexts": [ [ "f587fc2bbbf3c1327b03d556e4bc05-C001-38" ], [ "f587fc2bbbf3c1327b03d556e4bc05-C001-60" ], [ "f587fc2bbbf3c1327b03d556e4bc05-C001-135" ] ], "cite_sentences": [ "f587fc2bbbf3c1327b03d556e4bc05-C001-38", "f587fc2bbbf3c1327b03d556e4bc05-C001-60", "f587fc2bbbf3c1327b03d556e4bc05-C001-135" ] }, "@USE@": { "gold_contexts": [ [ "f587fc2bbbf3c1327b03d556e4bc05-C001-135" ] ], "cite_sentences": [ "f587fc2bbbf3c1327b03d556e4bc05-C001-135" ] } } }, "ABC_c293e9fb8d6382f185a3efeaf0dbf7_16": { "x": [ { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-113", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-114", "text": "**SETTINGS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-115", "text": "Data." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-2", "text": "We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-3", "text": "Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-4", "text": "In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-5", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-7", "text": "Sequence labeling models have been used for fundamental NLP tasks such as POS tagging, chunking and named entity recognition (NER)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-8", "text": "Traditional work uses statistical approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Ratinov and Roth, 2009; Passos et al., 2014; Luo et al., 2015) with handcrafted features and task-specific resources." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-9", "text": "With advances in deep learning, neural models have given state-of-the-art results on many sequence labeling tasks (Ling et al., 2015; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-10", "text": "Words and characters are encoded in distributed representations (Mikolov et al., 2013) and sentence-level features are learned automatically during end-to-end training." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-11", "text": "Many existing state-of-the-art neural sequence labeling models utilize word-level Long Short-Term Memory (LSTM) structures to represent global sequence information and a CRF layer to capture dependencies between neighboring labels Lample et al., 2016; Ma and Hovy, 2016; Peters et al., 2017) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-12", "text": "As an alternative, Convolution Neural Network (CNN) (LeCun et al., 1989) has also been used for its ability of parallel computing, leading to an efficient training and decoding process." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-13", "text": "Despite them being dominant in the research literature, reproducing published results for neural models can be challenging, even if the codes are available open source." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-14", "text": "For example, Reimers and Gurevych (2017b) conduct a large number of experiments using the code of Ma and Hovy (2016) , but cannot obtain comparable results as reported in the paper." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-15", "text": "Liu et al. (2018) report lower average F-scores on NER when reproducing the structure of Lample et al. (2016) , and on POS tagging when reproducing Ma and Hovy (2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-16", "text": "Most literature compares results with others by citing the scores directly Lample et al., 2016) without re-implementing them under the same setting, resulting in less persuasiveness on the advantage of their models." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-17", "text": "In addition, conclusions from different reports can be contradictory." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-18", "text": "For example, most work observes that stochastic gradient descent (SGD) gives best performance on NER task (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , while Reimers and Gurevych (2017b) report that SGD is the worst optimizer on the same datasets." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-43", "text": "\u2022 Hardware environment can also affect system accuracy." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-19", "text": "The comparison between different deep neural models is challenging due to sensitivity on experimental settings." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-20", "text": "We list six inconsistent configurations in literature, which lead to difficulties for fair comparison." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-21", "text": "\u2022 Dataset." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-22", "text": "Most work reports sequence labeling results on both CoNLL 2003 English NER (Tjong Kim Sang and De Meulder, 2003) and PTB POS (Marcus et al., 1993) datasets (Collobert et al., 2011; Ma and Hovy, 2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-23", "text": "Ling et al. (2015) give results only on POS dataset, while some papers (Chiu and Nichols, 2016; Lample et al., 2016; Strubell et al., 2017) report results on the NER dataset only." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-24", "text": "dos Santos et al. (2015) conducts experiments on NER for Portuguese and Spanish." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-25", "text": "Most work uses the development set to select hyperparameters (Lample et al., 2016; Ma and Hovy, 2016) , while others add development set into training set (Chiu and Nichols, 2016; Peters et al., 2017) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-26", "text": "Reimers and Gurevych (2017b) use a smaller dataset (13862 vs 14987 sentences)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-27", "text": "Different from Ma and Hovy (2016) and Liu et al. (2018) , choose a different data split on the POS dataset." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-28", "text": "Liu et al. (2018) and Hashimoto et al. (2017) use different development sets for chunking." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-29", "text": "\u2022 Preprocessing." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-30", "text": "A typical data preprocessing step is to normize digit characters (Chiu and Nichols, 2016; Lample et al., 2016; Yang et al., 2016; Strubell et al., 2017) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-31", "text": "Reimers and Gurevych (2017b) use fine-grained representations for less frequent words." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-32", "text": "Ma and Hovy (2016) do not use preprocessing." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-33", "text": "\u2022 Features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-34", "text": "Strubell et al. (2017) and Chiu and Nichols (2016) apply word spelling features and further integrate context features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-35", "text": "Collobert et al. (2011) and use neural features to represent external gazetteer information." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-36", "text": "Besides, Lample et al. (2016) and Ma and Hovy (2016) use end-to-end structure without handcrafted features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-37", "text": "\u2022 Hyperparameters including learning rate, dropout rate (Srivastava et al., 2014) , number of layers, hidden size etc. can strongly affect the model performance." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-38", "text": "Chiu and Nichols (2016) search for the hyperparameters for each task and show that the system performance is sensitive to the choice of hyperparameters." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-39", "text": "However, existing models use different parameter settings, which affects the fair comparison." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-40", "text": "\u2022 Evaluation." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-41", "text": "Some literature reports results using mean and standard deviation under different random seeds (Chiu and Nichols, 2016; Peters et al., 2017; Liu et al., 2018) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-42", "text": "Others report the best result among different trials (Ma and Hovy, 2016) , which cannot be compared directly." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-44", "text": "Liu et al. (2018) observes that the system gives better accuracy on NER task when trained using GPU as compared to using CPU." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-45", "text": "Besides, the running speeds are highly affected by the hardware environment." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-46", "text": "To address the above concerns, we systematically analyze neural sequence labeling models on three benchmarks: CoNLL 2003 NER (Tjong Kim Sang and De Meulder, 2003) , CoNLL 2000 chunking (Tjong Kim Sang and Buchholz, 2000) and PTB POS tagging (Marcus et al., 1993) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-47", "text": "Table 1 shows a summary of the models we investigate, which can be categorized under three settings: (i) character sequence representations ; (ii) word sequence representations; (iii) inference layer." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-48", "text": "Although various combinations of these three settings have been proposed in the literature, others have not been examined." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-49", "text": "We compare all models in Table 1 , which includes most state-of-the-art methods." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-50", "text": "To make fair comparisons, we build a unified framework 1 to reproduce the twelve neural sequence labeling architectures in Table 1 ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-51", "text": "Experiments show that our framework gives comparable or better results on reproducing existing works, showing the practicability and reliability of our analysis for practitioners." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-52", "text": "The detailed comparison and analysis show that (i) Character information provides a significant improvement on accuracy; (ii) Word-based LSTM models outperforms CNN models in most cases; (iii) CRF can improve model accuracy on NER and chunking but does not on POS tagging." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-53", "text": "Our framework is based on PyTorch with batched implementation, which is highly efficient, facilitating quick configurations for new tasks." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-54", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-55", "text": "**RELATED WORK**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-56", "text": "Collobert et al. (2011) proposed a seminal neural architecture for sequence labeling." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-57", "text": "It captures word sequence information with a one-layer CNN based on pretrained word embeddings and handcrafted neural features, followed with a CRF output layer." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-58", "text": "dos Santos et al. (2015) extended this model by integrating character-level CNN features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-59", "text": "Strubell et al. (2017) built a deeper dilated CNN architecture to capture larger local features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-60", "text": "Hammerton (2003) was the first to exploit LSTM for sequence labeling." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-61", "text": "built a BiLSTM-CRF structure, which has been extended by adding character-level LSTM (Lample et al., 2016; Liu et al., 2018) , GRU (Yang et al., 2016) , and CNN (Chiu and Nichols, 2016; Ma and Hovy, 2016) features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-62", "text": "Yang et al. (2017a) proposed a neural reranking model to improve NER models." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-63", "text": "These models achieve state-of-the-art results in the literature." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-64", "text": "Reimers and Gurevych (2017b) compared several word-based LSTM models for several sequence labeling tasks, reporting the score distributions over multiple runs rather than single value." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-65", "text": "They investigated the influence of various hyperparameters and configurations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-88", "text": "We examined both structures and found that they give comparable accuracies on sequence labeling tasks." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-66", "text": "Our work is similar in comparing different neural architectures under unified settings, but differs in four main aspects: 1) Their experiments are based on a BiLSTM with handcrafted word features, while our experiments are based on end-to-end neural models without human knowledge." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-67", "text": "2) Their system gives relatively low performances on standard benchmarks 2 , while ours can give comparable or better results with state-of-the-art models, rendering our observations more informative for practitioners." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-68", "text": "3) Our findings are more consistent with most previous work on configurations such as usefulness of character information (Lample et al., 2016; Ma and Hovy, 2016) , optimizer (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) and tag scheme (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-69", "text": "In contrast, many results of Reimers and Gurevych (2017b) contradict existing reports." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-70", "text": "4) We conduct a wider range of comparison for word sequence representations, including all combinations of character CNN/LSTM and word CNN/LSTM structures, while Reimers and Gurevych (2017b) studied the word LSTM models only." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-71", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-72", "text": "**NEURAL SEQUENCE LABELING MODELS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-73", "text": "Our neural sequence labeling framework contains three layers, i.e., a character sequence representation layer, a word sequence representation layer and an inference layer, as shown in Figure 1 ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-74", "text": "Character information has been proven to be critical for sequence labeling tasks (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) , with LSTM and CNN being used to model character sequence information (\"Char Rep.\")." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-75", "text": "Similarly, on the word level, LSTM or CNN structures can be leveraged to capture long-term information or local features (\"Word Rep.\"), respectively." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-76", "text": "Subsequently, the inference layer assigns labels to each word using the hidden states of word sequence representations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-77", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-78", "text": "**CHARACTER SEQUENCE REPRESENTATIONS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-79", "text": "Character features such as prefix, suffix and capitalization can be represented with embeddings through a feature-based lookup table (Collobert et al., 2011; Strubell et al., 2017) , or neural networks without human-defined features (Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-80", "text": "In this work, we focus on neural character sequence representations without hand-engineered features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-81", "text": "Character CNN." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-82", "text": "Using a CNN structure to encode character sequences was firstly proposed by Santos and Zadrozny (2014), and followed by many subsequent investigations (dos Santos et al., 2015; Chiu and Nichols, 2016; Ma and Hovy, 2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-83", "text": "In our experiments, we take the same structure as Ma and Hovy (2016) , using one layer CNN structure with max-pooling to capture character-level representations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-84", "text": "Figure 2 (a) shows the CNN structure on representing word \"Mexico\"." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-85", "text": "Character LSTM." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-86", "text": "Shown as Figure 2 (b), in order to model the global character sequence information of a word \"Mexico\", we utilize a bidirectional LSTM on the character sequence of each word and concatenate the left-to-right final state F LST M and the right-to-left final state B LST M as character sequence representations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-87", "text": "Liu et al. (2018) applied one bidirectional LSTM for the character sequence over a sentence rather than each word individually." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-89", "text": "We choose Lample et al. (2016) 's structure as its character LSTMs can be calculated in parallel, making the system more efficient." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-90", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-91", "text": "**WORD SEQUENCE REPRESENTATIONS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-92", "text": "Similar to character sequences in words, we can model word sequence information through LSTM or CNN structures." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-93", "text": "LSTM has been widely used in sequence labeling (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Liu et al., 2018) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-94", "text": "CNN can be much faster than LSTM due to the fact that convolution calculation can be parallel on the input sequence (Collobert et al., 2011; dos Santos et al., 2015; Strubell et al., 2017) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-95", "text": "Word CNN." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-96", "text": "Figure 3(a) shows the multi-layer CNN on word sequence, where words are represented by embeddings." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-97", "text": "If a character sequence representation layer is used, then word embeddings and character sequence representations are concatenated for word representations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-98", "text": "For each CNN layer, a window of size 3 slides along the sequence, extracting local features on the word inputs and a ReLU function (Glorot et al., 2011 ) is followed." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-99", "text": "We follow Strubell et al. (2017) by using 4 CNN layers." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-100", "text": "Batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) are applied following each CNN layer." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-101", "text": "Word LSTM." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-102", "text": "Shown in Figure 3 (b), word representations are fed into a forward LSTM and backward LSTM, respectively." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-103", "text": "The forward LSTM captures the word sequence information from left to right, while the backward LSTM extracts information in a reversed direction." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-104", "text": "The hidden states of the forward and backward LSTMs are concatenated at each word to give global information of the whole sequence." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-105", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-106", "text": "**INFERENCE LAYER**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-107", "text": "The inference layer takes the extracted word sequence representations as features and assigns labels to the word sequence." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-108", "text": "Independent local decoding with a linear layer mapping word sequence representations to label vocabulary and performing softmax can be quite effective (Ling et al., 2015) , while for tasks that with strong output label dependency, such as NER, CRF is a more appropriate choice." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-109", "text": "In this work, we examine both softmax and CRF as inference layer on three sequence labeling tasks." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-110", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-111", "text": "**EXPERIMENTS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-112", "text": "We investigate the main influencing factors to system accuracy, including the character sequence representations, word sequence representations, inference algorithm, pretrained embeddings, tag scheme, running environment and optimizer; analyzing system performances from the perspective of decoding speed and accuracies on in-vocabulary (IV) and out-of-vocabulary (OOV) entities/chunks/words." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-116", "text": "The NER dataset has been standardly split in Tjong Kim Sang and De Meulder (2003 (Toutanova et al., 2003; Santos and Zadrozny, 2014; Ma and Hovy, 2016; Liu et al., 2018) , we adopt the standard splits by using sections 0-18 as training set, sections 19-21 as development set and sections 22-24 as test set." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-117", "text": "No preprocessing is performed on either dataset except for normalizing digits." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-118", "text": "The dataset statistics are listed in Table 2 ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-119", "text": "Hyperparameters." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-120", "text": "Table 3 shows the hyperparameters used in our experiments, which mostly follow Ma and Hovy (2016) , including the learning rate \u03b7 = 0.015 for word LSTM models." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-121", "text": "For word CNN based models, a large \u03b7 leads to convergence problem." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-122", "text": "We take \u03b7 = 0.005 with more epochs (200) instead." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-123", "text": "GloVe 100-dimension (Pennington et al., 2014 ) is used to initialize word embeddings and character embeddings are randomly initialized." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-124", "text": "We use mini-batch stochastic gradient descent (SGD) with a decayed learning rate to update parameters." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-125", "text": "For NER and chunking, we the BIOES tag scheme." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-126", "text": "Evaluation." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-127", "text": "Standard precision (P), recall (R) and F1-score (F) are used as the evaluation metrics for NER and chunking; token accuracy is used to evaluate the performance of POS tagger." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-128", "text": "Development datasets are used to select the optimal model among all epochs, and we report scores of the selected model on the test dataset." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-129", "text": "To reduce the volatility of the system, we conduct each experiment 5 times under different random seeds, and report the max, mean, and standard deviation for each model." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-130", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-131", "text": "**RESULTS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-132", "text": "Tables 4, 5 and 6 show the results of the twelve models on NER, chunking and POS datasets, respectively." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-133", "text": "Existing work has also been listed in the tables for comparison." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-134", "text": "To simplify the description, we use \"CLSTM\" and \"CCNN\" to represent character LSTM and character CNN encoder, respectively." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-135", "text": "Similarly, \"WLSTM\" and \"WCNN\" represent word LSTM and word CNN structure, respectively." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-136", "text": "As shown in Table 4 , most NER work focuses on WLSTM+CRF structures with different character sequence representations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-137", "text": "We re-implement the structure of several reports (Chiu and Nichols, 2016; Ma and Hovy, 2016; Peters et al., 2017) , which take the CCNN+WLSTM+CRF architecture." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-138", "text": "Our reproduced models give slightly better performances." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-139", "text": "The results of Lample et al. (2016) can be reproduced by our CLSTM+WLSTM+CRF." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-140", "text": "In most cases, our \"Nochar\" based models underperform their corresponding prototypes Strubell et al., 2017) , which utilize the hand-crafted features." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-141", "text": "Table 5 shows the results of the chunking task." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-142", "text": "Peters et al. (2017) give the best reported single model results (95.00\u00b10.08), and our CLSTM+WLSTM+CRF gives a comparable performance (94.93\u00b10.05)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-143", "text": "We re-implement Zhai et al. (2017) 's model in our Nochar+WLSTM but cannot reproduce their results, this may because that they use grid search for hyperparameter selection." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-144", "text": "Our Nochar+WCNN+CRF can give comparable results with Collobert et al. (2011) , even ours does not include character information." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-145", "text": "The results of the POS tagging task is shown in Table 6 ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-146", "text": "The results of Lample et al. (2016) , Ma and Hovy (2016) and Yang et al. (2017b) can be reproduced by our CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF models." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-147", "text": "Our WLSTM based models give better results than WLSTM+CRF based models, this is consistent with the fact that Ling et al. (2015) take CLSTM+WLSTM without CRF layer but achieve the best POS accuracy." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-148", "text": "Santos and Zadrozny (2014) build a pure CNN structure on both character and word level, which can be reproduced by our CCNN+WCNN models." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-149", "text": "Based on above observations, most results in the literature are reproducible." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-150", "text": "Our implementations can achieve the comparable or better results with state-of-the-art models." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-151", "text": "We do not fine-tune any hyperparameter to fit the specific task." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-152", "text": "Results on Table 4 , 5 and 6 are all under the same hyperparameters, which demonstrates the generalization ability of our framework." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-153", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-154", "text": "**NETWORK SETTINGS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-155", "text": "Character LSTM vs Character CNN." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-156", "text": "Unlike the observations of Reimers and Gurevych (2017b) , in our experiments, character information can significantly (p < 0.01) 3 improve sequence labeling models (by comparing the row of Nochar with CLSTM or CCNN on Table 4 , 5 and 6), while the difference between CLSTM and CCNN is not significant." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-157", "text": "In most cases, CLSTM and CCNN can give comparable results under different frameworks and different tasks." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-158", "text": "CCNN gives the best NER result under the WL-STM+CRF framework, while CLSTM gets better NER results in all other configurations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-159", "text": "For chunking and POS tagging, CLSTM consistently outperforms CCNN under all settings, while the difference is statistically insignificant (p > 0.2)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-160", "text": "We conclude that the difference between CLSTM and CCNN is small, which is consistent with the observation of Reimers and Gurevych (2017b) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-161", "text": "Word LSTM vs Word CNN." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-162", "text": "By comparing the performances of WLSTM+CRF, WLSTM with WCNN+CRF, WCNN on the three benchmarks, we conclude that word-based LSTM models are significantly (p < 0.01) better than word-based CNN models in most cases." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-163", "text": "It demonstrates that global word context information is necessary for sequence labeling." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-164", "text": "Softmax vs CRF." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-165", "text": "Models with CRF inference layer can consistently outperform the models with softmax layer under all configurations on NER and chunking tasks, proving the effectiveness of label dependency information." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-166", "text": "While for POS tagging, the local softmax based models give slightly better accuracies while the difference is insignificant (p > 0.2)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-167", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-168", "text": "**EXTERNAL FACTORS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-169", "text": "In addition to model structures, external factors such as pretrained embeddings, tag scheme, and optimizer can significantly influence system performance." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-170", "text": "We investigate a set of external factors on the NER dataset with the two best models: CLSTM+WLSTM+CRF and CCNN+WLSTM+CRF." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-171", "text": "Pretrained embedding." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-172", "text": "Figure 4 (a) shows the F1-scores of the two best models on the NER test set with two different pretrained embeddings, as well as the random initialization." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-173", "text": "Compared with the random initialization, models using pretrained embeddings give significant improvements (p < 0.01)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-174", "text": "The GloVe 100-dimension embeddings give higher F1-scores than SENNA (Collobert et al., 2011) on both models, which is consistent with the observation of Ma and Hovy (2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-175", "text": "Tag scheme." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-176", "text": "We examine two different tag schemes: BIO and BIOES (Ratinov and Roth, 2009) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-177", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-178", "text": "In our experiments, models using BIOES are significantly (p < 0.05) better than BIO." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-179", "text": "Our observation is consistent with most literature (Ratinov and Roth, 2009; Dai et al., 2015) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-180", "text": "Reimers and Gurevych (2017b) report that the difference between the schemes is insignificant." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-181", "text": "Running environment." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-182", "text": "Liu et al. (2018) observe that neural sequence labeling models can give better results on GPU rather than CPU." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-183", "text": "We conduct repeated experiments on both GPU and CPU environments." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-184", "text": "The results are shown in Figure 4 (b)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-185", "text": "Models run on CPU give a lower mean F1-score than models run on GPU, while the difference is insignificant (p > 0.2)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-186", "text": "Optimizer." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-187", "text": "We compare different optimizers including SGD, Adagrad (Duchi et al., 2011 ), Adadelta (Zeiler, 2012 , RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-188", "text": "The results are shown in Figure 5 5 ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-189", "text": "In contrast to Reimers and Gurevych (2017b) , who reported that SGD is the worst optimizer, our results show that SGD outperforms all other optimizers significantly (p < 0.01), with a slower convergence process during training." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-190", "text": "Our observation is consistent with most literature (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-191", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-192", "text": "**ANALYSIS**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-193", "text": "Decoding speed." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-194", "text": "We test the decoding speeds of the twelve models on the NER dataset using a Nvidia GTX 1080 GPU." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-195", "text": "Figure 6 shows the decoding times on 10000 NER sentences." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-196", "text": "The CRF inference layer severely limits the decoding speed due to the left-to-right inference process, which disables the parallel decoding." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-197", "text": "Character LSTM significantly slows down the system." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-198", "text": "Compared with models without character information, adding character CNN representations does not affect the decoding speed too much but can give significant accuracy improvements (shown in Section 4.3)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-199", "text": "Due to the support of parallel computing, word-based CNN models are consistently faster than word-based LSTM models, with close accuracies, leading to large utilization potential in practice." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-200", "text": "Table 7 : Results for OOV analysis." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-201", "text": "OOV error." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-202", "text": "We conduct error analysis on in-vocabulary and out-of-vocabulary words with the CRF based models 6 ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-203", "text": "Following Ma and Hovy (2016) , words in the test set are divided into four subsets: in-vocabulary words, out-of-training-vocabulary words (OOTV), out-of-embedding-vocabulary words (OOEV) and out-of-both-vocabulary words (OOBV)." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-204", "text": "For NER and chunking, we consider entities or chunks rather than words." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-205", "text": "The OOV entities and chunks are categorized following Ma and Hovy (2016) ." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-206", "text": "Table 7 shows the performances of different OOV splits on three benchmarks." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-207", "text": "The top three rows list the performances of word-based LSTM CRF models, followed by the word-based CNN CRF models." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-208", "text": "The results of OOEV in NER keep 100% because of there exist only 8 OOEV entities and all are recognized correctly." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-209", "text": "It is obvious that character LSTM or CNN representations improve OOTV and OOBV the most on both WLSTM+CRF and WCNN+CRF models across all three datasets, proving that the main contribution of neural character sequence representations is to disambiguate the OOV words." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-210", "text": "Models with character LSTM representations give the best IV scores across all configurations, which may be because character LSTM can be well trained on IV data, bringing the useful global character sequence information." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-211", "text": "On the OOVs, character LSTM and CNN gives comparable results." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-212", "text": "----------------------------------" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-213", "text": "**CONCLUSION**" }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-214", "text": "We built a unified neural sequence labeling framework to reproduce and compare recent state-of-theart models with different configurations." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-215", "text": "We explored three neural model design decisions: character sequence representations, word sequence representations, and inference layer." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-216", "text": "Experiments show that character information helps to improve model performances, especially on disambiguating OOV words." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-217", "text": "Character-level LSTM and CNN structures give comparable improvements, with the latter being more efficient." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-218", "text": "In most cases, models with word-level LSTM encoders outperform those with CNN, at the expense of longer decoding time." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-219", "text": "We observed that the CRF inference algorithm is effective on NER and chunking tasks, but does not have the advantage on POS tagging." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-220", "text": "With controlled experiments on the NER dataset, we showed that BIOES tags are better than BIO." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-221", "text": "Besides, pretrained GloVe 100d embedding and SGD optimizer give significantly better performances compared to their competitors." }, { "sent_id": "c293e9fb8d6382f185a3efeaf0dbf7-C001-222", "text": "6 We choose the models that give the median performance on the test set for conducting result analysis." } ], "y": { "@DIF@": { "gold_contexts": [ [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-27" ], [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-116" ] ], "cite_sentences": [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-27", "c293e9fb8d6382f185a3efeaf0dbf7-C001-116" ] }, "@EXT@": { "gold_contexts": [ [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-27" ] ], "cite_sentences": [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-27" ] }, "@BACK@": { "gold_contexts": [ [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-41" ], [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-60", "c293e9fb8d6382f185a3efeaf0dbf7-C001-61" ], [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-93" ] ], "cite_sentences": [ "c293e9fb8d6382f185a3efeaf0dbf7-C001-41", "c293e9fb8d6382f185a3efeaf0dbf7-C001-61", "c293e9fb8d6382f185a3efeaf0dbf7-C001-93" ] } } }, "ABC_6683d7b77f536b93416d985414afeb_16": { "x": [ { "sent_id": "6683d7b77f536b93416d985414afeb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-2", "text": "Emojis are small images that are commonly included in social media text messages." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-3", "text": "The combination of visual and textual content in the same message builds up a modern way of communication, that automatic systems are not used to deal with." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-4", "text": "In this paper we extend recent advances in emoji prediction by putting forward a multimodal approach that is able to predict emojis in Instagram posts." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-5", "text": "Instagram posts are composed of pictures together with texts which sometimes include emojis." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-6", "text": "We show that these emojis can be predicted by using the text, but also using the picture." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-7", "text": "Our main finding is that incorporating the two synergistic modalities, in a combined model, improves accuracy in an emoji prediction task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-8", "text": "This result demonstrates that these two modalities (text and images) encode different information on the use of emojis and therefore can complement each other." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-9", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-11", "text": "In the past few years the use of emojis in social media has increased exponentially, changing the way we communicate." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-12", "text": "The combination of visual and textual content poses new challenges for information systems which need not only to deal with the semantics of text but also that of images." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-13", "text": "Recent work (Barbieri et al., 2017) has shown that textual information can be used to predict emojis associated to text." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-14", "text": "In this paper we show that in the current context of multimodal communication where texts and images are combined in social networks, visual information should be combined with texts in order to obtain more accurate emojiprediction models." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-15", "text": "We explore the use of emojis in the social media platform Instagram." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-16", "text": "We put forward a multimodal approach to predict the emojis associated to an Instagram post, given its picture and text 1 ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-17", "text": "Our task and experimental framework are similar to (Barbieri et al., 2017) , however, we use different data (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a post." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-18", "text": "We show that a multimodal approach (textual and visual content of the posts) increases the emoji prediction accuracy compared to the one that only uses textual information." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-19", "text": "This suggests that textual and visual content embed different but complementary features of the use of emojis." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-20", "text": "In general, an effective approach to predict the emoji to be associated to a piece of content may help to improve natural language processing tasks (Novak et al., 2015) , such as information retrieval, generation of emoji-enriched social media content, suggestion of emojis when writing text messages or sharing pictures online." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-21", "text": "Given that emojis may also mislead humans (Miller et al., 2017) , the automated prediction of emojis may help to achieve better language understanding." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-22", "text": "As a consequence, by modeling the semantics of emojis, we can improve highly-subjective tasks like sentiment analysis, emotion recognition and irony detection (Felbo et al., 2017) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-23", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-24", "text": "**DATASET AND TASK**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-25", "text": "Dataset: We gathered Instagram posts published between July 2016 and October 2016, and geolocalized in the United States of America." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-26", "text": "We considered only posts that contained a photo together with the related user description of at least 4 words and exactly one emoji." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-27", "text": "Moreover, as done by Barbieri et al. (2017) , we considered only the posts which include one and only one of the 20 most frequent emojis (the most frequent emojis are shown in Table 3 )." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-28", "text": "Our dataset is composed of 299,809 posts, each containing a picture, the text associated to it and only one emoji." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-29", "text": "In the experiments we also considered the subsets of the 10 (238,646 posts) and 5 most frequent emojis (184,044 posts) (similarly to the approach followed by Barbieri et al. (2017) )." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-30", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-31", "text": "**TASK:**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-32", "text": "We extend the experimental scheme of Barbieri et al. (2017) , by considering also visual information when modeling posts." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-33", "text": "We cast the emoji prediction problem as a classification task: given an image or a text (or both inputs in the multimodal scenario) we select the most likely emoji that could be added to (thus used to label) such contents." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-34", "text": "The task for our machine learning models is, given the visual and textual content of a post, to predict the single emoji that appears in the input comment." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-35", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-36", "text": "**MODELS**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-37", "text": "We present and motivate the models that we use to predict an emoji given an Instagram post composed by a picture and the associated comment." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-38", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-39", "text": "**RESNETS**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-40", "text": "Deep Residual Networks (ResNets) (He et al., 2016) are Convolutional Neural Networks which were competitive in several image classification tasks (Russakovsky et al., 2015; Lin et al., 2014) and showed to be one of the best CNN architectures for image recognition." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-41", "text": "ResNet is a feedforward CNN that exploits \"residual learning\", by bypassing two or more convolution layers (like similar previous approaches (Sermanet and LeCun, 2011) )." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-42", "text": "We use an implementation of the original ResNet where the scale and aspect ratio augmentation are from (Szegedy et al., 2015) , the photometric distortions from (Howard, 2013) and weight decay is applied to all weights and biases (instead of only weights of the convolution layers)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-43", "text": "The network we used is composed of 101 layers (ResNet-101), initialized with pretrained parameters learned on ImageNet (Deng et al., 2009) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-44", "text": "We use this model as a starting point to later finetune it on our emoji classification task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-45", "text": "Learning rate was set to 0.0001 and we early stopped the training when there was not improving in the validation set." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-46", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-47", "text": "**FASTTEXT**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-48", "text": "Fastext (Joulin et al., 2017 ) is a linear model for text classification." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-49", "text": "We decided to employ FastText as it has been shown that on specific classification tasks, it can achieve competitive results, comparable to complex neural classifiers (RNNs and CNNs), while being much faster." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-50", "text": "FastText represents a valid approach when dealing with social media content classification, where huge amounts of data needs to be processed and new and relevant information is continuously generated." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-51", "text": "The FastText algorithm is similar to the CBOW algorithm (Mikolov et al., 2013) , where the middle word is replaced by the label, in our case the emoji." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-52", "text": "Given a set of N documents, the loss that the model attempts to minimize is the negative log-likelihood over the labels (in our case, the emojis):" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-53", "text": "where e n is the emoji included in the n-th Instagram post, represented as hot vector, and used as label." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-54", "text": "A and B are affine transformations (weight matrices), and x n is the unit vector of the bag of features of the n-th document (comment)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-55", "text": "The bag of features is the average of the input words, represented as vectors with a look-up table." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-56", "text": "Barbieri et al. (2017) propose a recurrent neural network approach for the emoji prediction task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-57", "text": "We use this model as baseline, to verify whether FastText achieves comparable performance." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-58", "text": "They used a Bidirectional LSTM with character representation of the words (Ling et al., 2015; Ballesteros et al., 2015) to handle orthographic variants (or even spelling errors) of the same word that occur in social media (e.g. cooooool vs cool)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-59", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-60", "text": "**B-LSTM BASELINE**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-61", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-62", "text": "**EXPERIMENTS AND EVALUATION**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-63", "text": "In order to study the relation between Instagram posts and emojis, we performed two different experiments." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-64", "text": "In the first experiment (Section 4.2) we compare the FastText model with the state of the art on emoji classification (B-LSTM) by Barbieri et al. (2017) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-65", "text": "Our second experiment (Section 4.3) evaluates the visual (ResNet) and textual (FastText) models on the emoji prediction task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-66", "text": "Moreover, we evaluate a multimodal combination of both models respectively based on visual and Barbieri et al. (2017) , using the same Twitter dataset." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-67", "text": "textual inputs." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-68", "text": "Finally we discuss the contribution of each modality to the prediction task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-69", "text": "We use 80% of our dataset (introduced in Section 2) for training, 10% to tune our models, and 10% for testing (selecting the sets randomly)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-70", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-71", "text": "**FEATURE EXTRACTION AND CLASSIFIER**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-72", "text": "To model visual features we first finetune the ResNet (process described in Section 3.1) on the emoji prediction task, then extract the vectors from the input of the last fully connected layer (before the softmax)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-73", "text": "The textual embeddings are the bag of features shown in Section 3.2 (the x n vectors), extracted after training the FastText model on the emoji prediction task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-74", "text": "With respect to the combination of textual and visual modalities, we adopt a middle fusion approach (Kiela and Clark, 2015) : we associate to each Instagram post a multimodal embedding obtained by concatenating the unimodal representations of the same post (i.e. the visual and textual embeddings), previously learned." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-75", "text": "Then, we feed a classifier 2 with visual (ResNet), textual (FastText), or multimodal feature embeddings, and test the accuracy of the three systems." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-76", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-77", "text": "**B-LSTM / FASTTEXT COMPARISON**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-78", "text": "To compare the FastText model with the word and character based B-LSTMs presented by Barbieri et al. (2017) , we consider the same three emoji prediction tasks they proposed: top-5, top-10 and top-20 emojis most frequently used in their Tweet datasets." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-79", "text": "In this comparison we used the same Twitter datasets." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-80", "text": "As we can see in Table 1 FastText model is competitive, and it is also able to outperform the character based B-LSTM in one of the emoji prediction tasks (top-20 emojis Table 2 : Prediction results of top-5, top-10 and top-20 most frequent emojis in the Instagram dataset: Precision (P), Recall (R), F-measure (F1)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-81", "text": "Experimental settings: majority baseline, weighted random, visual, textual and multimodal systems." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-82", "text": "In the last line we report the percentage improvement of the multimodal over the textual system." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-83", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-84", "text": "**MULTIMODAL EMOJI PREDICTION**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-85", "text": "We present the results of the three emoji classification tasks, using the visual, textual and multimodal features (see Table 2 )." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-86", "text": "The emoji prediction task seems difficult by just using the image of the Instagram post (Visual), even if it largely outperforms the majority baseline 3 and weighted random 4 ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-87", "text": "We achieve better performances when we use feature embeddings extracted from the text." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-88", "text": "The most interesting finding is that when we use a multimodal combination of visual and textual features, we get a nonnegligible improvement." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-89", "text": "This suggests that these two modalities embed different representations of the posts, and when used in combination they are synergistic." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-90", "text": "It is also interesting to note that the more emojis to predict, the higher improvement the multimodal system provides over the text only system (3.28% for top-5 emojis, 7.31% for top-10 emojis, and 13.42 for the top-20 emojis task)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-91", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-92", "text": "**QUALITATIVE ANALYSIS**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-93", "text": "In Table 3 we show the results for each class in the top-20 emojis task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-94", "text": "The emoji with highest F1 using the textual features is the most frequent one (0.62) and the US flag (0.52)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-95", "text": "The latter seems easy to predict since it appears in specific contexts: when the word USA/America is used (or when American cities are referred, like #NYC)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-96", "text": "The hardest emojis to predict by the text only system are the two gestures (0.12) and (0.13)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-97", "text": "The first one is often selected when the gold stan-3 Always predict since it is the most frequent emoji." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-98", "text": "4 Random keeping labels distribution of the training set Table 3 : F-measure in the test set of the 20 most frequent emojis using the three different models." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-99", "text": "\"%\" indicates the percentage of the class in the test set dard emoji is the second one or is often mispredicted by wrongly selecting or ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-100", "text": "Another relevant confusion scenario related to emoji prediction has been spotted by Barbieri et al. (2017) : relying on Twitter textual data they showed that the emoji was hard to predict as it was used similarly to ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-101", "text": "Instead when we consider Instagram data, the emoji is easier to predict (0.23), even if it is often confused with ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-102", "text": "When we rely on visual contents (Instagram picture), the emojis which are easily predicted are the ones in which the associated photos are similar." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-103", "text": "For instance, most of the pictures associated to are dog/pet pictures." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-104", "text": "Similarly, is predicted along with very bright pictures taken outside." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-105", "text": "is correctly predicted along with pictures related to gym and fitness." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-106", "text": "The accuracy of is also high since most posts including this emoji are related to fitness (and the pictures are simply either selfies at the gym, weight lifting images, or protein food)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-107", "text": "Employing a multimodal approach improves performance." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-108", "text": "This means that the two modalities are somehow complementary, and adding visual information helps to solve potential ambiguities that arise when relying only on textual content." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-109", "text": "In Figure 1 we report the confusion matrix of the multimodal model." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-110", "text": "The emojis are plotted from the most frequent to the least, and we can see that the model tends to mispredict emojis selecting more frequent emojis (the left part of the matrix is brighter)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-111", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-112", "text": "**SALIENCY MAPS**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-113", "text": "In order to show the parts of the image most relevant for each class we analyze the global average pooling (Lin et al., 2013) on the convolutional feature maps (Zhou et al., 2016) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-114", "text": "By visually observing the image heatmaps of the set of Instagram post pictures we note that in most cases it is quite difficult to determine a clear association between the emoji used by the user and some particular portion of the image." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-115", "text": "Detecting the correct emoji given an image is harder than a simple object recognition task, as the emoji choice depends on subjective emotions of the user who posted the image." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-116", "text": "In Figure 2 we show the first four predictions of the CNN for three pictures, and where the network focuses (in red)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-117", "text": "We can see that in the first example the network selects the smile with sunglasses because of the legs in the bottom of the image, the dog emoji is selected while focusing on the dog in the image, and the smiling emoji while focusing on the person in the back, who is lying on a hammock." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-118", "text": "In the second example the network selects again the due to the water and part of the kayak, the heart emoji focusing on the city landscape, and the praying emoji focusing on the sky." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-119", "text": "The same \"praying\" emoji is also selected when focusing on the luxury car in the third example, probably because the same emoji is used to express desire, i.e. \"please, I want this awesome car\"." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-120", "text": "It is interesting to note that images can give context to textual messages like in the following Instagram posts: (1)\"Love my new home \" (associated to a picture of a bright garden, outside) and (2) \"I can't believe it's the first day of school!!! I love being these boys' mommy!!!! #myboys #mommy \" (associated to picture of two boys wearing two blue shirts)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-121", "text": "In both examples the textual system predicts ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-122", "text": "While the multimodal system correctly predicts both of them: the blue color in the picture associated to (2) helps to change the color of the heart, and the sunny/bright picture of the garden in (1) helps to correctly predict ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-123", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-124", "text": "**RELATED WORK**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-125", "text": "Modeling the semantics of emojis, and their applications, is a relatively novel research problem with direct applications in any social media task." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-126", "text": "Since emojis do not have a clear grammar, it is not clear their role in text messages." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-127", "text": "Emojis are considered function words or even affective markers (Na'aman et al., 2017) , that can potentially affect the overall semantics of a message (Donato and Paggio, 2017) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-128", "text": "Emojis can encode different meanings, and they can be interpreted differently." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-129", "text": "Emoji interpretation has been explored user-wise (Miller et al., 2017) , location-wise, specifically in countries (Barbieri et al., 2016b) and cities (Barbieri et al., 2016a) , and gender-wise (Chen et al., 2017) and time-wise ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-130", "text": "Emoji sematics and usage have been studied with distributional semantics, with models trained on Twitter data (Barbieri et al., 2016c) , Twitter data together with the official unicode description (Eisner et al., 2016) , or using text from a popular keyboard app ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-131", "text": "In the same context, Wijeratne et al. (2017a) propose a platform for exploring emoji semantics." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-132", "text": "In order to further study emoji semantics, two datasets with pairwise emoji similarity, with human annotations, have been proposed: EmoTwi50 (Barbieri et al., 2016c) and EmoSim508 (Wijeratne et al., 2017b) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-133", "text": "Emoji similarity has been also used for proposing efficient keyboard emoji organization (Pohl et al., 2017) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-134", "text": "Recently, Barbieri and Camacho-Collados (2018) show that emoji modifiers (skin tones and gender) can affect the semantics vector representation of emojis." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-135", "text": "Emoji play an important role in the emotional content of a message." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-136", "text": "Several sentiment lexicons for emojis have been proposed (Novak et al., 2015; Kimura and Katsurai, 2017; Rodrigues et al., 2018) and also studies in the context of emotion and emojis have been published recently (Wood and Ruder, 2016; Hu et al., 2017) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-137", "text": "During the last decade several studies have shown how sentiment analysis improves when we jointly leverage information coming from different modalities (e.g. text, images, audio, video) (Morency et al., 2011; Poria et al., 2015; Tran and Cambria, 2018) ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-138", "text": "In particular, when we deal with Social Media posts, the presence of both textual and visual content has promoted a number of investigations on sentiment or emotions (Baecchi et al., 2016; You et al., 2016b,a; Yu et al., 2016; Chen et al., 2015) or emojis (Cappallo et al., 2015 (Cappallo et al., , 2018 ." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-139", "text": "----------------------------------" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-140", "text": "**CONCLUSIONS**" }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-141", "text": "In this work we explored the use of emojis in a multimodal context (Instagram posts)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-142", "text": "We have shown that using a synergistic approach, thus relying on both textual and visual contents of social media posts, we can outperform state of the art unimodal approaches (based only on textual contents)." }, { "sent_id": "6683d7b77f536b93416d985414afeb-C001-143", "text": "As future work, we plan to extend our models by considering the prediction of more than one emoji per Social Media post and also considering a bigger number of labels." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6683d7b77f536b93416d985414afeb-C001-13" ] ], "cite_sentences": [ "6683d7b77f536b93416d985414afeb-C001-13" ] }, "@MOT@": { "gold_contexts": [ [ "6683d7b77f536b93416d985414afeb-C001-13" ] ], "cite_sentences": [ "6683d7b77f536b93416d985414afeb-C001-13" ] }, "@DIF@": { "gold_contexts": [ [ "6683d7b77f536b93416d985414afeb-C001-17" ], [ "6683d7b77f536b93416d985414afeb-C001-27" ], [ "6683d7b77f536b93416d985414afeb-C001-32" ] ], "cite_sentences": [ "6683d7b77f536b93416d985414afeb-C001-17", "6683d7b77f536b93416d985414afeb-C001-27", "6683d7b77f536b93416d985414afeb-C001-32" ] }, "@EXT@": { "gold_contexts": [ [ "6683d7b77f536b93416d985414afeb-C001-17" ], [ "6683d7b77f536b93416d985414afeb-C001-27" ], [ "6683d7b77f536b93416d985414afeb-C001-32" ] ], "cite_sentences": [ "6683d7b77f536b93416d985414afeb-C001-17", "6683d7b77f536b93416d985414afeb-C001-27", "6683d7b77f536b93416d985414afeb-C001-32" ] }, "@SIM@": { "gold_contexts": [ [ "6683d7b77f536b93416d985414afeb-C001-29" ], [ "6683d7b77f536b93416d985414afeb-C001-64" ], [ "6683d7b77f536b93416d985414afeb-C001-66" ], [ "6683d7b77f536b93416d985414afeb-C001-78" ], [ "6683d7b77f536b93416d985414afeb-C001-100" ] ], "cite_sentences": [ "6683d7b77f536b93416d985414afeb-C001-29", "6683d7b77f536b93416d985414afeb-C001-64", "6683d7b77f536b93416d985414afeb-C001-66", "6683d7b77f536b93416d985414afeb-C001-78", "6683d7b77f536b93416d985414afeb-C001-100" ] }, "@USE@": { "gold_contexts": [ [ "6683d7b77f536b93416d985414afeb-C001-29" ], [ "6683d7b77f536b93416d985414afeb-C001-66" ], [ "6683d7b77f536b93416d985414afeb-C001-78" ] ], "cite_sentences": [ "6683d7b77f536b93416d985414afeb-C001-29", "6683d7b77f536b93416d985414afeb-C001-66", "6683d7b77f536b93416d985414afeb-C001-78" ] } } }, "ABC_2adb3a645a57b8f441a80bb5a46045_16": { "x": [ { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-160", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-161", "text": "**MACHINE LEARNING DECISIONS**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-182", "text": "Weakly supervised data for classifying into stance against or for vaccination could be gathered by using the assumption that debaters do not change their stance within a thread." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-183", "text": "A few posts from each of the debaters would then be manually annotated and the rest of the posts from this debater would automatically be assigned the same stance category." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-141", "text": "The set of debate posts written by one debater would then be treated as one unit of text, and the assumption that debaters do not change their stance within a discussion thread would have to be made." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-17", "text": "Mohammad et al. (2017) define stance detection as \" [.. .] the task of automatically determining from text whether the author of the text is in favor of, against, or neutral toward a proposition or target\"." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-18", "text": "They distinguish stance detection from the better known task of sentiment analysis by that \"in stance detection, systems are to determine favorability toward a given (pre-chosen) target of interest\", whereas sentiment analysis is the task of \"determining whether a piece of text is positive, negative, or neutral, or determining from text the speakers opinion and the target of the opinion\"." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-19", "text": "For instance, the utterance \"The diseases that vaccination can protect you from are horrible\" expresses a stance for the pre-chosen target \"vaccination\", while expressing a negative sentiment towards the sentiment-target \"diseases\"." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-20", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-21", "text": "**BACKGROUND**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-22", "text": "This definition of stance is used in several stance detection studies." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-23", "text": "For instance, in studies performed on the text genres web debate forums (Somasundaran and Wiebe, 2010; Anand et al., 2011; Walker et al., 2012; Hasan and Ng, 2013) , news paper text (Ferreira and Vlachos, 2016; Fake News Challenge, 2017) and tweets (Augenstein et al., 2016; Mohammad et al., 2017) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-24", "text": "Stance detection is generally considered more difficult than sentiment analysis and thereby a task for which currently available methods achieve lower results." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-25", "text": "This was, for instance, shown by a recent shared task on three-category stance classification of tweets, where an F-score of 0.59 was achieved by a classifier that outperformed submissions from 19 shared task teams (Mohammad et al., 2017) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-26", "text": "For the task of stance classification of posts of two-sided discussion threads, an F-score of 0.70 is the best result we have been able to find in previous research (Hasan and Ng, 2013)" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-27", "text": "Previous studies on attitudes towards vaccination do not make use of the term stance, but discuss negative/positive sentiment towards vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-28", "text": "There are a number of such sentiment detection studies conducted on tweets, while studies on online forums, to the best of our knowledge, are limited to the task of topic modelling (Tangherlini et al., 2016) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-29", "text": "Most vaccination sentiment studies have been conducted on tweets that contain keywords related to HPV (human papillomavirus) vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-30", "text": "In one of these, 1,470 HPV-related tweets were manually categorised according to sentiment towards the HPV vaccine (positive, negative, neutral, or no mention)." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-31", "text": "A decision tree was thereafter trained to perform the sentiment classification, and a leave-one-out evaluation resulted in an AUC score of 0.92 (Massey et al., 2016) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-32", "text": "Another HPV study applied a three-level hierarchical classification scheme and 6,000 tweets were manually categorised as (i) related or unrelated to HPV vaccination, (ii) positive, negative or neutral towards vaccination and (iii) whether they concerned safety, efficacy, emotional resistance, cost and others (Du et al., 2017) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-33", "text": "A hierarchical classification scheme corresponding to the hierarchy of the categories was applied in the form of support vector machine classifiers, which resulted in a micro-average Fscore of 0.74 and a macro-average F-score of 0.59." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-34", "text": "In a third HPV study, tweets were annotated into the binary category of whether they expressed an anti-vaccine opinion or not (1,050 tweets as training data and 1,100 as test data)." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-35", "text": "A support vector machine trained on bigrams achieved an F-score of 0.82 for detecting negative tweets." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-36", "text": "Tweets on the A(H1N1) influenza vaccine have also been automatically classified (Salath\u00e9 and Khandelwal, 2011) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-37", "text": "47,143 tweets that contained keywords related to vaccination were manually classified into four categories; positive/negative/neutral sentiment towards vaccination, or not concerning the A(H1N1) vaccine." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-38", "text": "The 630 tweets that had been classified by at least 44 annotators, and for which more than 50% of these annotators had selected the same category were used as evaluation data." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-39", "text": "The rest of the annotated data was used for training a machine learning model in the form of an ensemble classifier built on a Naive Bayes and a Maximum Entropy classifier." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-40", "text": "This resulted in a classifier accuracy of 0.84." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-41", "text": "There are also a number of studies in which purely manual analyses of opinions on vaccination have been carried out." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-42", "text": "For instance, analyses of blog posts (Brien et al., 2013) , online forums (Skea et al., 2008) , and of reports on vaccination in many different types of online materials (Larson et al., 2013) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-43", "text": "Data from the latter analysis has been incorporated into a disease surveillance tool and used for comparing sentiment towards vaccination in different parts of North America (Powell et al., 2016b) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-44", "text": "However, as pointed out by Powell et al. (2016a) , to be able to use this kind of surveillance tool for vaccine-preventable diseases on a larger scale, manual analyses are not enough." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-45", "text": "Instead, the functionality that we investigate here is required, that is to be able to automatically detect stance towards vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-46", "text": "While previous studies on detection of stance/sentiment towards specific types of vaccines in tweets have have been carried out, we here aim to investigate the possibility of automatic vaccine stance detection in the important genre of online discussion forums." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-47", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-48", "text": "**METHOD**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-49", "text": "A corpus was first compiled and pre-processed, and thereafter annotated for stance towards vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-50", "text": "The annotated corpus was then used to train models to detect the stance categories." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-51", "text": "2" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-52", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-53", "text": "**CORPUS SELECTION**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-54", "text": "The experiment was carried out on discussion threads from the British parental website Mumsnet, which is a site that hosts online forums where users discuss and share information on parenting and other topics." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-55", "text": "3 We based the choice of forum on the reasoning provided by Skea et al. (2008) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-56", "text": "They chose Mumsnet for manual analysis of vaccine stance, as the site contains discussion threads on many different topics and therefore is likely to attract a more diverse set of debaters than e.g., an anti-vaccination site." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-57", "text": "In addition, the discussion threads are publicly available without a login and the debaters are asked to anonymise their postings, e.g., by using a chat nickname." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-58", "text": "This makes it less likely that the posts include content that debaters would like to keep private." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-59", "text": "Mumsnet lists a large number of main discussion topics, of which vaccination is one." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-60", "text": "The debaters can either choose a main topic and start a new discussion thread on a more specific topic, e.g., \"Refusing to vaccinate your child\", or submit a post to an existing thread." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-61", "text": "We extracted posts from the six discussion threads, which, given their name, we assessed as most likely to spur a debate against/for vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-62", "text": "The topics of all threads with more than 80 posts and for which the latest post was written between the years 2011 and 2017 were considered (68 threads)." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-63", "text": "Only thread names that encouraged discussions on child vaccination in general were included, while debates on more specific aspects on vaccination or vaccination for specific diseases were excluded." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-64", "text": "Examples of topics excluded for these reasons were \"Vaccinations and nursery schools\", \"Staggering Vaccinations?\" or \"HPV gardasil 4 \"." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-65", "text": "Other types of threads excluded were those with a yes/no question as thread name, as answers to these might be more difficult to understand without context, and threads asking for explanations for an opposite viewe.g., \"Please explain, succinctly, the anti vac argument.\" -as such threads might prompt debaters to list opinions of opponents rather than to express their own arguments." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-66", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-67", "text": "**CORPUS PRE-PROCESSING AND FILTERING**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-68", "text": "The text and meta-data of the discussion posts could be extracted from the html-pages based on is available at: github.com/mariask2/vaccination_stance 3 www.mumsnet.com/Talk 4 The name of a vaccine." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-69", "text": "their div class." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-70", "text": "Html tags in the text were removed and the text was segmented into paragraphs using jusText (Pomik\u00e1lek, 2011) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-71", "text": "Texts previously written by other debaters are sometimes copied into new posts in order to indicate that a comment to this previous post is made (the posts are all posted on the same level, and there is no functionality for posting an answer to a specific previous post)." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-72", "text": "Although the debaters do not use a uniform approach to indicate that text has been copied from another debater, we devised a simple method for removing as many instances as possible of copied text." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-73", "text": "Paragraphs that were exclusively constructed of sentences that had occurred in previous posts were removed, using the standard sentence segmentation included in NLTK (Natural Language Toolkit) for sentence matching (Bird, 2002) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-74", "text": "In addition, text chunks longer than three words that were marked in bold or by double quotation were removed, in order to exclude citations in general, from other debaters as well as from external sources." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-75", "text": "Also names of opponents are sometimes mentioned in the posts, as a means to indicate that the content of the post is addressed to a specific opponent debater." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-76", "text": "These names were also automatically removed." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-77", "text": "Similar to previous vaccination studies on tweets, we considered the content of each debate post without the context of surrounding posts." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-78", "text": "To increase the likelihood that the debater's stance towards vaccination would be interpretable without context, only posts containing at least one of the following character combinations were included in the experiment: vacc / vax / jab / immunis / immuniz." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-79", "text": "The removal of posts that did not contain these character combinations resulted in that the original set of 2,225 debate posts included in the six extracted threads was reduced to a set of 1,190 posts." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-80", "text": "These 1,190 posts (written by 136 different authors) were manually annotated and used in the machine learning experiment." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-81", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-82", "text": "**ANNOTATION**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-83", "text": "The 1,190 posts to include in the experiment were presented for manual classification in a random order, without revealing who the debater was or which thread the post belonged to." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-84", "text": "The annotation was performed by one of the authors of the paper." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-85", "text": "Following the principle of the guidelines by Mohammad et al. (2017) , we classified the posts as taking a stance against or for vaccination, or to be undecided." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-86", "text": "The third category was applied to posts in which the debater explicitly declared to be undecided, as well as to posts for which the debater's stance towards vaccination could not be determined." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-87", "text": "The post did not have to explicitly support or oppose vaccination to be classified according to the categories against or for, but it was enough if an opposition or support could be inferred from the post." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-88", "text": "The stance taken could, for instance, be conveyed without actually mentioning vaccination, as in \"You are very lucky to live in the west, and you are free to make that decision because the majority are giving you herd immunity." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-89", "text": "\" It could also be conveyed through an agreement or disagreement with a known proponent/opponent of vaccination, as the statement \"Andrew Wakefield is a proven liar and a profiteer -therefore his \"research\" is irrelevant to any sane, rational discussion [...]\" Web links to external resources were, however, not included in the classification decision, even when a stance towards vaccination could be inferred from the name of the URL." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-90", "text": "Mohammad et al. (2017) did not specify in detail how to distinguish between stance against and for the targets included in the study." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-91", "text": "However, several of the posts in our data did not express a clear positive or clear negative stance towards vaccination in general, and we therefore needed more detailed guidelines for how to draw the line between against and for." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-92", "text": "We adopted the basic rule of classifying the post as against vaccination when the debater expressed a stance that opposed an official vaccination policy, e.g., as recommended in the health care system." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-93", "text": "This included, for instance, posts that expressed criticism against some of the recommended vaccines but an acceptance of others, or an acceptance of vaccination in general but not of the officially recommended vaccination scheme." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-94", "text": "The post \"I challenge my DD vaccine schedule all the time." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-95", "text": "Last time I refused to allow her have MMR with yet another vaccine just because a government quango says so [...]\" was thereby classified as against vaccination, although the debater is not negative towards all forms of vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-96", "text": "Posts that contained a concern over waning immunity from vaccination were classified as undecided, except when this concern was used as an argument against vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-97", "text": "Posts that expressed a stance against compulsory vaccination, without revealing a stance on vaccination in general, were also classified as undecided." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-98", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-99", "text": "**MACHINE LEARNING EXPERIMENTS**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-100", "text": "A standard text classification approach, in the form of a linear support vector machine model, was applied to the task of automatically classifying the debate posts." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-101", "text": "This follows the approach of Mohammad et al. (2017) , as well as of many of the previously performed vaccine sentiment studies." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-102", "text": "The model was trained on all tokens in the training data, as well as on 2-, 3-and 4-grams that occurred at least twice in the data." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-103", "text": "The standard NLTK stop word list for English was used for removing non-content words when constructing one set of n-grams." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-104", "text": "An additional set of n-grams was generated with a reduced version of this stop word list, which mainly consisted of articles, forms of copula, and forms of \"it\", \"have\" and \"do\"." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-105", "text": "The reason for using a reduced list was that negations, pronouns etc." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-106", "text": "that were included in the standard NLTK stop word list can be important cues for classifying argumentative text." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-107", "text": "Two types of classifiers were trained: one to perform the task of classifying posts into all three categories annotated, and the other one to perform the task of distinguishing posts annotated as against vaccination from those annotated as for vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-108", "text": "The classifiers were implemented using scikit-learn's LinearSVC class with the default settings." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-109", "text": "For training/evaluation, we applied crossvalidation on the 1,190 annotated posts." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-110", "text": "Due to the relatively small data size, we used 30 folds, instead of the more standard approach of 10 folds." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-111", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-112", "text": "**RESULTS**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-113", "text": "Average F-scores for the classifiers and the confusion matrix was calculated using the standard functionality in scikit-learn 5 ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-114", "text": "Macro average Fscores of 0.44 and 0.62 were achieved for the three-class classifier and for the binary classifier, respectively (Table 1 )." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-115", "text": "The confusion matrix and the precision/recall scores for the three-class classifier show that there were frequent misclassifications between all three categories ( Table 2) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-116", "text": "It can also be derived from this table that there was an even distribution between posts annotated as taking a stance against vaccination (41%) and those taking a stance for vaccination (38%)." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-117", "text": "Table 1 : Macro and micro F-score for the two experiments, i.e., (i) a classifier that classifies posts as taking a stance against and for vaccination or being undecided and (ii) a binary classifier that classifies posts as against or for vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-118", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-119", "text": "**DISCUSSION**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-142", "text": "Previous research has shown that stance classification of posts in online forums can be improved by also taking classifier output from other posts by the same author into account (Hasan and Ng, 2013) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-143", "text": "On the same assumption that debaters do not change their stance, it might also be possible to give some kind of measure of the validity of the stance annotations." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-144", "text": "If the annotations are to be considered valid, posts from the same debater should ideally always take the same stance, or at least only exceptionally take the opposite stance." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-145", "text": "Such a measure could be used as a complement to annotator agreement that measures reliability rather than validity (Artstein and Poesio, 2008) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-146", "text": "Annotator agreement should, however, also be measured." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-147", "text": "Previous studies on stance do not reason to a large extent on where to draw the line between the categories against and for the target." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-148", "text": "In the case of previous vaccination stance studies, this might be explained by that these studies have focused on attitudes towards one specific type of vaccine; whereas stance towards vaccination in general is a larger issue with a wider range of possible nuances among debaters' opinions." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-149", "text": "Mohammad et al. (2017) , on the other hand, used broader targets, e.g., \"Feminist movement\" and \"Atheism\", but left the task of drawing the line between against and for to the annotators." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-150", "text": "Eight annotators classified each tweet, and only the subset of tweets for which at least 60% of the annotators agreed on the classification were retained in the data set." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-151", "text": "Other studies have circumvented the problem of giving an exact definition of stance towards a target by using data sets from debate portals, where meta-data on the debaters' stance is provided." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-152", "text": "Our decision, to classify debate posts as against vaccination when they opposed an official vaccination policy, was based on that debaters often implicitly argue against such a policy." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-153", "text": "In addition, a system for surveilling increases in vaccine hesitancy is likely to take an official policy as its point of departure." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-154", "text": "The eight annotators that classified each tweet in the study by Mohammad et al. (2017) were employed through a crowdsourcing platform, which was made possible by that the stance targets were chosen with the criterion that they should be commonly known in the United States." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-155", "text": "For annotating stance on vaccination, however, annotators with some amount of prior knowledge of vaccine debate topics and vaccine controversies might be preferred." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-156", "text": "Crowdsourcing might therefore not be a viable option for this annotation task." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-157", "text": "40 randomly selected posts that did not fulfil the criterion of containing any of the selected vaccine-related filter terms, and which therefore had been excluded from the study, were also annotated." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-158", "text": "Although this set is too small to make any definite conclusions, the relatively large proportion of posts in the set that expressed a stance against or for vaccination (25%) indicates that the filtering criterion used was too crude and led to the exclusion of relevant posts." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-159", "text": "Future studies should therefore either apply a better filtering criterion, or include all discussion thread posts in an annotation and machine learning study." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-120", "text": "Similar to what as been shown in previous stance detection studies, the detection of stance towards vaccination was proven to be a difficult task, at least for the type of model investigated and for the relatively small training corpus used." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-121", "text": "Results cannot be directly compared to previous studies, as there are a number factors that vary between the studies." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-122", "text": "For instance, the number of training samples used, evaluation measures applied, as well as criteria in some previous studies for excluding samples from the data set that were difficult to classify." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-123", "text": "There is, however, a large difference between the results achieved here, and some of the previous studies on detection of sentiment towards vaccination, which probably cannot be accounted for by these variations." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-124", "text": "Instead, it is likely that these differences are due to that the previous studies i) were conducted on tweets, and ii) used a more precise stance target." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-125", "text": "As tweets are short, they are likely to be more to the point than the more elaborate and longer discussions of the debate forums that we used, and therefore easier for stance detection." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-126", "text": "The more precise stance target is likely to result in that a more limited set of topics are discussed, which are easier to learn compared to the more wide stance target of vaccination in general that we applied." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-127", "text": "In future work, the training corpus will be expanded in order to explore if this can improve results." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-128", "text": "In addition, an evaluation of a wider range of machine learning methods and features will be conducted." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-129", "text": "For constructing the corpus used here, a number of decisions had to be made." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-130", "text": "In the following sections we reflect on these decisions and make a number of suggestions for future studies." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-131", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-132", "text": "**ANNOTATION DECISIONS**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-133", "text": "The decision to treat each debate post as one independent unit without taking its context into account is not self-evident, as a post is more meaningful to interpret in the context of its discussion Table 2 : Confusion matrix and precision/recallscores for the three-class classifier." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-134", "text": "The table also shows the total number of posts annotated as against, for or undecided." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-135", "text": "thread." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-136", "text": "Taking the context into account for determining the stance would, however, also entail a more complex classification task." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-137", "text": "At least in a first step in future work, we will therefore concentrate on the types of debate posts that can be interpreted without context." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-138", "text": "Research on online forums by Anand et al. (2011) has shown that the incorporation of features from the previous post can improve the performance of a stance classifier for posts that express rebuttals." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-139", "text": "This work, however, used meta-data of the debaters' stance for training the classifiers and did not provide the option to classify a post as undecided when its stance could not be determined without its debate context." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-140", "text": "Another possible approach to take would be to classify debaters according to the stance they take, instead of classifying each individual post into a stance." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-2", "text": "A classifier for automatic detection of stance towards vaccination in online forums was trained and evaluated." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-3", "text": "Debate posts from six discussion threads on the British parental website Mumsnet were manually annotated for stance against or for vaccination, or as undecided." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-4", "text": "A support vector machine, trained to detect the three classes, achieved a macro F-score of 0.44, while a macro F-score of 0.62 was obtained by the same type of classifier on the binary classification task of distinguishing stance against vaccination from stance for vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-5", "text": "These results show that vaccine stance detection in online forums is a difficult task, at least for the type of model investigated and for the relatively small training corpus that was used." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-6", "text": "Future work will therefore include an expansion of the training data and an evaluation of other types of classifiers and features." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-7", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-9", "text": "There have been outbreaks of vaccine-preventable diseases that were caused by decreased vaccination rates, which in turn were due to negative attitudes towards vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-10", "text": "Two examples are an outbreak of polio in [2003] [2004] , which started in northern Nigeria and spread to 15 other countries (Larson and Ghinai, 2011) , and an outbreak of measles in Minnesota in 2017 (ModarressyTehrani, 2017) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-11", "text": "Information on vaccination can be gathered from many different types of sources." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-12", "text": "A survey among British parents showed that 34% consulted web-based resources for vaccination information (Campbell et al., 2017) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-13", "text": "The survey also showed that 31% of the parents that had consulted chat rooms or discussion forums had seen information that \"would make them doubt having their child(ren) immunised or persuade them not to immunise\", compared to 23% for parents consulting Twitter and 8% among all parents included in the survey." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-14", "text": "Discussion forums thus form an important outlet for vaccine hesitancy, and this genre might therefore be relevant to automatically monitor for an increase in posts that express a negative stance towards vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-15", "text": "Most previous work on training and evaluation of classifiers for automatic detection of vaccination stance has, however, been carried out on tweets." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-16", "text": "In this study, we therefore take on the task of automatic vaccine stance detection of debate posts in online discussion forums." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-162", "text": "The choice of machine learning model was primarily based on that a linear support vector machine was successful on data from the previously mentioned shared task of stance detection of tweets (Mohammad et al., 2017) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-163", "text": "This model outperformed submissions from teams that used methods which might intuitively be better adapted to the task, i.e., an LSTM classifier (Augenstein et al., 2016) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-164", "text": "Support vector machines have also been used in many of the previous vaccine sentiment studies." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-165", "text": "In addition, a linear support vector machine classifier is also a standard method that is often used for different types of text classification tasks (Manning et al., 2008, pp." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-166", "text": "335-337) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-167", "text": "The model is for these reasons suitable to use as a baseline against which to compare future experiments on the vaccine stance classification task." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-168", "text": "Despite being the most successful method for stance detection of tweets, it is likely that a support vector machine, trained on word and character n-grams in the entire text is not the optimal method for stance detection of discussion posts." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-169", "text": "First of all, discussion thread posts are typically longer than a tweet and consist of several sentences, which often each of them form an own argument with a relatively independent content." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-170", "text": "A classifier that operates on the level of a sentence, or other shorter parts of the text, and combines the stance classification of each of these segments into a post-level classification might be better suited to the task." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-171", "text": "For instance, Hasan and Ng (2013) improved their stance classification results by expanding their feature set to also include an unsupervised estimation of the stance of each sentence in the debate post." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-172", "text": "In addition, it is likely that even if applied on a sentence-level, the use of n-grams would not capture the full complexity of the argumentative text genre." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-173", "text": "Instead, the structure of the words in the sentences might need to be incorporated in the feature set." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-174", "text": "A classifier trained on a token-level and where neighbouring tokens in a large context window are incorporated as features could be one such approach." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-175", "text": "Another possibility would be the previously mentioned approach of stance detection using an LSTM classifier (Augenstein et al., 2016) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-176", "text": "Previous studies have also been able to improve results, at least for some targets, by incorporating other features than n-grams." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-177", "text": "For instance, features constructed using an arguing lexicon (Somasun-daran and Wiebe, 2010), or word embeddings constructed in an unsupervised fashion using a large corpus from the same text genre as the text to classify (Mohammad et al., 2017) ." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-178", "text": "Apart from making a decision on what type of classifier and features to use, it must also be decided on how to gather more training data." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-179", "text": "Strategies for reducing the manual labelling effort should be investigated, in particular since the annotation task, as discussed above, might not be suitable for crowdsourcing." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-180", "text": "One possible approach would be to use weakly supervised data." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-181", "text": "Posts from vaccine-related discussion threads contrasted with posts from other discussion threads might be used as weakly supervised data for classifying posts as either taking a stance on vaccination or being undecided." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-184", "text": "Hasan and Ng (2013) improved results by incorporating weakly labelled data that was gathered through harvesting text adjacent to phrases that, with a large confidence, are indicators of stance against or for the target in question." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-185", "text": "It can be noted that the number of contributors to each discussion thread seems to be rather small, as the posts annotated for the experiment had been written by only 136 debaters." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-186", "text": "Future experiments on this data set and its extensions should therefore evaluate to what extent a classifier trained on the Mumsnet data is able to detect vaccination stance in discussion threads in general." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-187", "text": "That is, to make sure the models have avoided overfitting on the language typical to a small set of authors and to arguments typical to Mumsnet." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-188", "text": "This could be carried out by evaluating the classifier on vaccinerelated debate posts from other forums." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-189", "text": "The experimental design would then consist of using the Mumsnet data for the parameter setting and for training the models, and posts from other debate forums for evaluation." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-190", "text": "----------------------------------" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-191", "text": "**CONCLUSION**" }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-192", "text": "A macro F-score of 0.62 was achieved for a binary classifier that distinguished debate posts taking a stance against vaccination from those taking a stance for vaccination." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-193", "text": "When also including the category undecided, an F-score of 0.44 was achieved for the three-class classifier." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-194", "text": "The detection of stance towards vaccination in online forums was proven to be a difficult task, at least for the support vector machine model trained on n-grams that was used as classifier and for the relatively small training corpus used." }, { "sent_id": "2adb3a645a57b8f441a80bb5a46045-C001-195", "text": "Future work will therefore include the expansion of the training data set, as well as an evaluation of other types of machine learning models and feature sets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2adb3a645a57b8f441a80bb5a46045-C001-22", "2adb3a645a57b8f441a80bb5a46045-C001-23" ], [ "2adb3a645a57b8f441a80bb5a46045-C001-25" ], [ "2adb3a645a57b8f441a80bb5a46045-C001-154" ] ], "cite_sentences": [ "2adb3a645a57b8f441a80bb5a46045-C001-23", "2adb3a645a57b8f441a80bb5a46045-C001-25", "2adb3a645a57b8f441a80bb5a46045-C001-154" ] }, "@SIM@": { "gold_contexts": [ [ "2adb3a645a57b8f441a80bb5a46045-C001-85" ], [ "2adb3a645a57b8f441a80bb5a46045-C001-101" ], [ "2adb3a645a57b8f441a80bb5a46045-C001-162" ], [ "2adb3a645a57b8f441a80bb5a46045-C001-177" ] ], "cite_sentences": [ "2adb3a645a57b8f441a80bb5a46045-C001-85", "2adb3a645a57b8f441a80bb5a46045-C001-101", "2adb3a645a57b8f441a80bb5a46045-C001-162", "2adb3a645a57b8f441a80bb5a46045-C001-177" ] }, "@USE@": { "gold_contexts": [ [ "2adb3a645a57b8f441a80bb5a46045-C001-85" ], [ "2adb3a645a57b8f441a80bb5a46045-C001-101" ], [ "2adb3a645a57b8f441a80bb5a46045-C001-162" ] ], "cite_sentences": [ "2adb3a645a57b8f441a80bb5a46045-C001-85", "2adb3a645a57b8f441a80bb5a46045-C001-101", "2adb3a645a57b8f441a80bb5a46045-C001-162" ] } } }, "ABC_6a90ecc147618b3909609fc6c2e2b3_16": { "x": [ { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-23", "text": "We would like to thank Sharon Goldwater and Ray Mooney for helpful feedback and suggestions." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-2", "text": "Abstract." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-3", "text": "Factorial Hidden Markov Models (FHMM) support joint inference for multiple sequence prediction tasks." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-4", "text": "Here, we use them to jointly predict part-of-speech tag and supertag sequences with varying levels of supervision." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-5", "text": "We show that supervised training of FHMM models improves performance compared to standard HMMs, especially when labeled training data is scarce." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-6", "text": "Secondly, we show that an FHMM and a maximum entropy Markov model in a single step co-training setup improves the performance of both models when there is limited labeled training data." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-7", "text": "Finally, we find that FHMMs trained from tag dictionaries rather than labeled examples also perform better than a standard HMM." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-8", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-10", "text": "For many sequence prediction tasks in Natural Language Processing, modeling dependencies between individual predictions can be used to improve prediction accuracy of the sequence as a whole." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-11", "text": "For example, chunking involves identifying sequences of words in a sentence that are part of syntactically related non-overlapping, non-recursive phrases." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-12", "text": "An effective representation for this task involves assigning an individual part-of-speech (POS) tag and chunk tag to each word and deriving the actual chunks from these word specific labels." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-13", "text": "In these sequences, many of the POS and chunk tags are correlated, so joint inference can be quite useful." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-14", "text": "Supertagging (Bangalore and Joshi, 1999) , involves assigning lexical entries to words based on lexicalized grammatical theory such as Combinatory Categorial Grammar (CCG) (Steedman, 2000; Steedman and Baldridge, 2009) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-15", "text": "For example, the English verb join has the POS VB and the CCG category ((S b \\NP)/PP)/NP in CCGbank (Hockenmaier and Steedman, 2007) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-16", "text": "This category indicates that join requires a noun phrase to its left, another to its right, and a prepositional phrase to the right of that." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-17", "text": "Every lexical item has as many supertags as the number of different syntactic contexts in which the item can appear, so supertags are far more detailed and numerous than POS tags." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-18", "text": "Recently there is increased interest on supertagging beyond their standard use as a pre-parsing step (Clark and Curran, 2007) -for example, they are being used as features in machine translation (Birch et al., 2007; Hassan et al., 2007) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-19", "text": "Chunking and supertagging can be modeled using a two-stage cascade of Hidden Markov Models (HMMs) (Rabiner, 1989) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-20", "text": "POS tags are first predicted from the observed words in the first stage; then the chunk tags or supertags are predicted from those POS tags in the next stage." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-21", "text": "Alternatively, both sequences can be jointly predicted with Factorial Hidden Markov Models (FHMMs) (Ghahramani and Jordan, 1998) , thereby preventing propagation of errors." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-22", "text": "Here, we apply" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-24", "text": "This work was supported by the Morris Memorial Grant from the New York Community Trust." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-25", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-26", "text": "**COPYRIGHT 2009 BY SRIVATSAN RAMANUJAM AND JASON BALDRIDGE**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-27", "text": "FHMMs to supertagging for the categories defined in CCGbank for English." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-28", "text": "Fully supervised maximum entropy Markov models have been used for cascaded prediction of POS tags followed by supertags (Clark and Curran, 2007) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-29", "text": "Here, we learn supertaggers given only a POS tag dictionary and supertag dictionary or a small amount of material labeled with both types of information." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-30", "text": "Previous work has used Bayesian HMMs to learn taggers for both POS tagging and supertagging (Baldridge, 2008) separately." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-31", "text": "Modeling them jointly has the potential to produce more robust and accurate supertaggers trained with less supervision and thereby potentially help in the creation of useful models for new languages and domains." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-32", "text": "Our results show that joint inference improves supervised supertag prediction (compared to HMMs), especially when labeled training data is scarce." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-33", "text": "Secondly, when training data is limited, the generative FHMMs and a maximum entropy Markov model (a discriminative model like C&C) can bootstrap each other, in a single round co-training setup, to complement each other." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-34", "text": "Finally, FHMMs trained on tag dictionaries also outperform standard HMMs, thereby providing a stronger basis for learning accurate supertaggers with less supervision." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-35", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-36", "text": "**DATA**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-37", "text": "CCG is a lexicalized grammar formalism in which the grammatical constituents have types that are detailed categories like NP, (S\\NP)/NP, and (N\\N)/(S/NP) that specify, among other things, the sub-categorization requirements of the constituent." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-38", "text": "Every word is associated with a lexical category; strings of adjacent words and word sequences may then be joined via universal rules of category combination." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-39", "text": "An analysis of a sentence is complete when a single derived category spanning all words in the sentence is reached, as shown in Figure 1 ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-40", "text": "Accurately assigning lexical categories to words is the key to fast parsing for CCG." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-41", "text": "Clark and Curran (2007) use a maximum entropy Markov Model to predict lexical categories before fully parsing a sentence based on those categories." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-42", "text": "In another light, supertagging for CCG can also be seen as a way of generalizing a lexicon by identifying categories for unseen words or unobserved word/category pairs." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-43", "text": "The performance of the C&C supertagger relies on the existence of CCGbank, which itself is a semi-automated conversion of phrase structure analyses of the Penn Treebank (Marcus et al., 1994) into CCG analyses." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-44", "text": "CCGbank, not to mention the original Penn Treebank, required considerable manual effort to create." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-45", "text": "It is thus of interest to build accurate supertaggers, and also to do so with as little supervision as possible in order to support the creation of grammars and annotated resources for other domains and languages." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-46", "text": "Table 1 summarizes some attributes and the ambiguities of the POS and CCG types and tokens in CCGbank." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-47", "text": "The larger size of the CCG tag set translates to a greater per token ambiguity in the prediction of CCG tags." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-48", "text": "Supertagging is thus generally a harder problem than POS tagging." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-49", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-50", "text": "**THE MODELS**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-51", "text": "We consider three different models in this paper: an HMM model and two FHMM models which we call FHMMA and FHMMB." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-52", "text": "In the standard HMM shown in Figure 2 (a), a tag (POS tag or We use Bayesian inference for HMMs following Goldwater and Griffiths (2007) and Johnson et al. (2007) , with symmetric Dirichlet priors for the transition and emission distributions of each of the three models." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-53", "text": "Such bitag HMMs can be formulated as:" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-54", "text": "Here, t i and w i refer to the i'th tag and word and \u03c4 refers to the state transition distribution and \u03c9 refers to the word emission distribution." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-55", "text": "The Forward-Backward algorithm standardly used for HMMs is intractable for FHMMMs due to their larger state space." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-56", "text": "We thus use Gibbs sampling, a Markov Chain Monte Carlo method that is commonly used for inference in Bayesian graphical models (Besag, 2004; Gao and Johnson, 2007) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-57", "text": "The Gibbs sampling equations for a POS and CCG pair in each of the models are summarized in Figure 3 ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-58", "text": "For the HMM model, the POS and CCG tags are sampled independently of each other." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-59", "text": "For the FHMMs, the interlinks between the POS and CCG nodes in the graphical model, determines the interdependency during the joint inference of the POS and CCG tag sequences." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-60", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-61", "text": "**SUPERVISED SUPERTAGGING EXPERIMENTS**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-62", "text": "We consider two supervised training scenarios here." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-63", "text": "(1)" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-64", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-65", "text": "**SUPERTAGGING WITH VARYING AMOUNTS OF TRAINING DATA**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-66", "text": "In this experiment, we use the training and test sets used by Baldridge (2008) from CCGbank." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-67", "text": "We vary the amount of training material by using 100, 1000, 10,000 and all 38015 training set sentences." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-68", "text": "We also vary the transition prior \u03b1 choosing \u03b1 = 1.0 and \u03b1 = 0.05 on the CCG tags." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-69", "text": "The emission prior \u03b2 was held constant at 1.0." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-70", "text": "The results of these experiments for \u03b1 = 0.05 are tabulated in Table 3 (a)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-71", "text": "For comparison, we also show the results of the C&C supertagger of Clark and Curran (2007) in Table 3 (b)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-72", "text": "The parameter \u03b1, which determines the sparsity of the transition matrix, has been reported to have a greater influence on the performance of the tagger in Goldwater and Griffiths (2007) in weakly supervised POS tagging." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-73", "text": "We also observed this in supervised supertagging, in the models HMM and FHMMB." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-74", "text": "The HMM model and FHMMB showed a slight dip in their performance for \u03b1 = 1.0 while FHMMA did slightly better." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-75", "text": "What stands out in these results is the performance of the FHMM models with minimal amount of training data (for 100 sentences, FHMMB is quite close to the discriminatively trained C&C supertagger)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-76", "text": "The FHMMA model achieves a 22% absolute accuracy improvement for CCG tags (ambiguous types alone) when compared to the HMM model and the FHMMB model achieves a 41% improvement compared to the HMM model." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-77", "text": "State-of-the-art POS taggers report accuracies in the range of 96\u221297%; our model FHMMB was comparable (95.35% for \u03b1 = 0.05 and 94.41 for \u03b1 = 1.0)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-78", "text": "The FHMMA model and the HMM model achieved 91% and 92.5% accuracy on POS tags, respectively." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-79", "text": "The accuracy of our HMM is lower than the performance of Baldridge (2008) for supertags." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-80", "text": "We attribute this to better tag-specific smoothing in his model for emissions, compared to our use of a symmetric parameter for all tags." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-81", "text": "We stress that our interest here is in evaluating the advantage of joint inference over POS tags and supertags rather than direct supertag prediction while holding all other modeling considerations equal." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-82", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-83", "text": "**SINGLE ROUND CO-TRAINING EXPERIMENTS**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-84", "text": "In this section, we use the FHMMB model and the C&C model in a single round of co-training." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-85", "text": "The idea behind co-training (Blum and Mitchell, 1998) is that when two models learn different views of the task, and are not correlated in the errors they make, they may compliment each other to boost each other's performance." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-86", "text": "Thus, provided with a large amount of unannotated data, we could use co-training iteratively to enhance the prediction performance of the two models." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-87", "text": "For example, Clark et al. (2003) co-train the C&C tagger and the TNT tagger (Brants, 2000) and obtain significant performance improvements for POS tagging." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-88", "text": "Here, we do not perform co-training, strictly speaking." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-89", "text": "Instead, we complete a single round in which one model is trained on a small number of sentences and then is used to label all remaining unannotated examples; the entire set is then used by the other model for training." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-90", "text": "Table 3(a) shows the results of bootstrapping the C&C supertagger with the FHMMB model trained on 25, 50 and 100 annotated sentences respectively." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-91", "text": "For comparison, the standalone performance of C&C is also shown." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-92", "text": "The FHMMB model was used to annotate the remaining sentences in the training set and the C&C supertagger was trained on this larger annotated dataset and its prediction performance was tested on the test set." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-93", "text": "Table 3 (b) shows the results of using the C&C supertagger trained on 25, 50 and 100 annotated sentences, to bootstrap the FHMMB model." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-94", "text": "Again, for comparison, the standalone performance of the FHMMB model is also shown." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-95", "text": "The C&C supertagger is used to annotate the remaining sentences in the training dataset and the FHMMB model is trained on them and its prediction performance is tested on the test dataset." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-96", "text": "C&C bootstrapped by the FHMMB model outperforms C&C alone when training on the same sentences." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-97", "text": "This makes it ideal for bootstrapping more powerful discriminative models like C&C. Clearly, from Table 3(a), we see that co-training helps the C&C supertagger in improving its supertagging performance, with minimal supervision (25, 50 sentences)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-98", "text": "In Table 3 (b), we see that the C&C supertagger helps in boosting the performance of the FHMMB model." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-99", "text": "These results suggest that further experiments applying standard multi-round co-training could improve both models considerably." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-100", "text": "Finally, note that FHMMB's lone performance of 58.25% with 25 seed sentences is considerably better than C&C's lone performance of 53.86% with the same seed set." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-101", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-102", "text": "**WEAKLY SUPERVISED SUPERTAGGING**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-103", "text": "Since annotation is costly, we are interested in automatic annotation of unlabeled sentences with minimal supervision." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-104", "text": "In the weakly supervised learning setting, we are provided with a lexicon that lists possible POS tags and supertags for many, though not all, words." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-105", "text": "We draw the initial sample of CCG tag sequences corresponding to the observation sequence, using probabilities based on grammar informed initialization (Baldridge, 2008) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-106", "text": "We consider the prior probability of occurrence of categories based on their complexity: given a lexicon L, the probability of a category c i is inversely proportional to its complexity:" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-107", "text": "where complexity(c i ) is defined as the number of sub-categories contained in category c i ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-108", "text": "The POS tag corresponding to an observed word w i is drawn uniformly at random from the set of all tags corresponding to w i in the dictionary." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-109", "text": "For the FHHMs, we first draw a POS tag t i corresponding to a word w i uniformly at random from the tag dictionary of w i and then from the set of all CCG tags that have occurred with t i and w i in the dictionary, we randomly sample a CCG tag c i based on its complexity, as defined above." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-110", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-111", "text": "**EFFECT OF FREQUENCY CUT-OFF ON SUPERTAGS**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-112", "text": "Any category c, that occurs less than k% of the times with a word type w, is removed from the tag dictionary of that word, when the lexicon is constructed." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-113", "text": "This is in fact a form of supervision, which we use here as an oracle to explore the effect of reducing lexical ambiguity." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-114", "text": "Results of this experiment for \u03b1 = 1.0, on ambiguous CCG categories, are tabulated in Table 5 (a)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-115", "text": "The results for \u03b1 = 0.05 is shown in Table 6 (a)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-116", "text": "We also report the CCG accuracy values inclusive of unambiguous types in Table 5 (b) for \u03b1 = 1.0 and Table 6 (b) respectively." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-117", "text": "The performance of the HMM model (31%) in Table 5 (a) without any frequency cut-off on the CCG categories, is comparable to the bitag HMM of Baldridge (2008) that uses variational Bayes EM (33%)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-118", "text": "Our complexity based initialization is not directly comparable to the results in Baldridge (2008) because the values there are based on a weighted combination of complexity based initialization and modified transition priors based on the CCG formalism." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-119", "text": "However, it is encouraging to see that when there is no cut-off based filtering of the categories, FHMMB (47.98%) greatly outperforms the HMM-EM model of Baldridge (2008) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-120", "text": "It is however, quite short of the 56.1% accuracy achieved by the model of Baldridge (2008) that uses grammar informed initialization (combination of category based initialization along with category transition rules)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-121", "text": "Without any frequency cut-off on CCG categories, FHMMB achieves over 17% improvement in the prediction accuracy of ambiguous CCG categories, in comparison with the HMM." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-122", "text": "The HMM performs much better when there is a high level of frequency based filtering of the categories." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-123", "text": "However, recall that frequency based filtering of categories is a strong form of supervision that we use here only as an oracle and which one could not expect to have in real world tag dictionaries." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-124", "text": "The POS accuracies in these experiments were 83.5-85%, 84.5-86.2% and 78.3-78.4% for models FHMMB, FHMMA and HMM respectively (without any frequency cut-off)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-125", "text": "In the weakly supervised setting, the choice of the transition prior \u03b1 of 0.05 lead to severe degradation in the prediction accuracy of CCG tags." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-126", "text": "Unlike POS tagging, where a symmetric transition prior of \u03b1 = 0.05 captured the sparsity of the tag transition distribution , in supertagging the transition priors are asymmetric." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-127", "text": "We expect that CCG transition rules (Baldridge, 2008) when encoded as category specific transition priors, will lead to better performance with the FHMMs." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-128", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-129", "text": "**RELATED WORK**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-130", "text": "This paper follows the work of Duh (2005) , Baldridge (2008) and Goldwater and Griffiths (2007) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-131", "text": "Duh (2005) uses FHMMs for jointly labeling the POS and NP chunk tags for the CoNLL2000 dataset (Sang et al., 2000) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-132", "text": "His is a fully supervised model for a simpler task." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-133", "text": "We address the harder problem of supertagging in this paper and especially in the weakly supervised setting, with FHMMs." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-134", "text": "Goldwater and Griffiths (2007) uses a Bayesian tritag HMM (BHMM) for POS tagging and considers three different scenarios: (1) a weakly supervised setting with fixed hyperparameters \u03b1 and \u03b2, (2) hyper parameter inference (learning the optimal values for \u03b1 and \u03b2) and (3) hyper parameter inference with varying corpus size and dictionary knowledge." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-135", "text": "Our bitag HMM achieved results close to what was reported by her BHMM on a random 24000 word subset of the WSJ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-136", "text": "In all our experiments, we have kept the test set separate from the training set from which the dictionary was built; this distinction is not made by Goldwater." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-137", "text": "Again, our work focuses on the harder problem of supertagging." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-138", "text": "McCallum et al. (2003) have also used a factorial model for performing joint labeling of the POS and chunk tags but by using Dynamic Conditional Random Fields (DCRF)." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-139", "text": "The advantage of using an FHMM over DCRF is the the ability to use less supervision in training the model." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-140", "text": "Even in the supervised training scenario, FHMM has the advantage of lower training time when compared to discriminative training models like DCRF." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-141", "text": "----------------------------------" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-142", "text": "**CONCLUSION**" }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-143", "text": "We demonstrated that joint inference in supertagging, boosts the prediction accuracy of both POS and CCG tags by a considerable margin." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-144", "text": "The improvement is more significant when training data is scarce." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-145", "text": "The results from the single round co-training experiments were encouraging." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-146", "text": "The generative FHMM model is able to rival a discriminative model like the C&C supertagger, when more labeled sentences are made available by a bootstrapped supertagger." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-147", "text": "To the best of our knowledge, this is the first work on joint inference in the Bayesian framework for supertagging." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-148", "text": "There is plenty of scope for further improvements." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-149", "text": "Overall, the discriminative C&C supertagger outperforms the FHMMs in all supervised settings." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-150", "text": "Despite this, the FHMMs are suited for estimating models with less supervision, such as from tag dictionaries alone and incorporating more informative prior distributions such as those in Baldridge (2008) ." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-151", "text": "This may make them more appropriate for developing CCGbanks for other languages and domains." }, { "sent_id": "6a90ecc147618b3909609fc6c2e2b3-C001-152", "text": "Furthermore, Bayesian inference is modular and extensible, so our models could be supplemented by finding optimal values of the hyperparameters \u03b1 (for POS tags) and \u03b2." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6a90ecc147618b3909609fc6c2e2b3-C001-30" ], [ "6a90ecc147618b3909609fc6c2e2b3-C001-120" ] ], "cite_sentences": [ "6a90ecc147618b3909609fc6c2e2b3-C001-30", "6a90ecc147618b3909609fc6c2e2b3-C001-120" ] }, "@USE@": { "gold_contexts": [ [ "6a90ecc147618b3909609fc6c2e2b3-C001-66" ], [ "6a90ecc147618b3909609fc6c2e2b3-C001-105" ] ], "cite_sentences": [ "6a90ecc147618b3909609fc6c2e2b3-C001-66", "6a90ecc147618b3909609fc6c2e2b3-C001-105" ] }, "@SIM@": { "gold_contexts": [ [ "6a90ecc147618b3909609fc6c2e2b3-C001-66" ], [ "6a90ecc147618b3909609fc6c2e2b3-C001-117" ], [ "6a90ecc147618b3909609fc6c2e2b3-C001-150" ] ], "cite_sentences": [ "6a90ecc147618b3909609fc6c2e2b3-C001-66", "6a90ecc147618b3909609fc6c2e2b3-C001-117", "6a90ecc147618b3909609fc6c2e2b3-C001-150" ] }, "@DIF@": { "gold_contexts": [ [ "6a90ecc147618b3909609fc6c2e2b3-C001-79" ], [ "6a90ecc147618b3909609fc6c2e2b3-C001-118" ], [ "6a90ecc147618b3909609fc6c2e2b3-C001-119" ] ], "cite_sentences": [ "6a90ecc147618b3909609fc6c2e2b3-C001-79", "6a90ecc147618b3909609fc6c2e2b3-C001-118", "6a90ecc147618b3909609fc6c2e2b3-C001-119" ] }, "@FUT@": { "gold_contexts": [ [ "6a90ecc147618b3909609fc6c2e2b3-C001-127" ] ], "cite_sentences": [ "6a90ecc147618b3909609fc6c2e2b3-C001-127" ] }, "@EXT@": { "gold_contexts": [ [ "6a90ecc147618b3909609fc6c2e2b3-C001-130" ] ], "cite_sentences": [ "6a90ecc147618b3909609fc6c2e2b3-C001-130" ] } } }, "ABC_e7c947a02bb0e81d6b6b4b9da74024_16": { "x": [ { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-13", "text": "Word embeddings have become an important component in many NLP models and are widely used for a vast range of downstream tasks." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-48", "text": "There, the authors rely on the same definition of gender bias that considers the projection on the gender direction." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-49", "text": "We expect similar results for this method as well, however, we did not verify that." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-2", "text": "Word embeddings are widely used in NLP for a vast range of tasks." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-3", "text": "It was shown that word embeddings derived from text corpora reflect gender biases in society." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-4", "text": "This phenomenon is pervasive and consistent across different word embedding models, causing serious concern." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-5", "text": "Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demonstrating convincing results." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-6", "text": "However, we argue that this removal is superficial." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-7", "text": "While the bias is indeed substantially reduced according to the provided bias definition, the actual effect is mostly hiding the bias, not removing it." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-8", "text": "The gender bias information is still reflected in the distances between \"gender-neutralized\" words in the debiased embeddings, and can be recovered from them." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-9", "text": "We present a series of experiments to support this claim, for two debiasing methods." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-10", "text": "We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-11", "text": "----------------------------------" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-14", "text": "However, these word representations have been proven to reflect social biases (e.g. race and gender) that naturally occur in the data used to train them ." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-15", "text": "In this paper we focus on gender bias." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-16", "text": "Gender bias was demonstrated to be consistent and pervasive across different word embeddings." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-17", "text": "Bolukbasi et al. (2016b) show that using word embeddings for simple analogies surfaces many gender stereotypes." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-18", "text": "For example, the word embedding they use (word2vec embedding trained on the Google News dataset 1 (Mikolov et al., 2013) ) an-1 https://code.google.com/archive/p/word2vec/ swer the analogy \"man is to computer programmer as woman is to x\" with \"x = homemaker\"." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-19", "text": "further demonstrate association between female/male names and groups of words stereotypically assigned to females/males (e.g. arts vs. science)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-20", "text": "In addition, they demonstrate that word embeddings reflect actual gender gaps in reality by showing the correlation between the gender association of occupation words and labor-force participation data." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-21", "text": "Recently, some work has been done to reduce the gender bias in word embeddings, both as a post-processing step (Bolukbasi et al., 2016b) and as part of the training procedure (Zhao et al., 2018) ." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-22", "text": "Both works substantially reduce the bias with respect to the same definition: the projection on the gender direction (i.e. \u2212 \u2192 he \u2212 \u2212\u2192 she), introduced in the former." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-23", "text": "They also show that performance on word similarity tasks is not hurt." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-24", "text": "We argue that current debiasing methods, which lean on the above definition for gender bias and directly target it, are mostly hiding the bias rather than removing it." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-25", "text": "We show that even when drastically reducing the gender bias according to this definition, it is still reflected in the geometry of the representation of \"gender-neutral\" words, and a lot of the bias information can be recovered." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-26", "text": "----------------------------------" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-27", "text": "**GENDER BIAS IN WORD EMBEDDINGS**" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-28", "text": "In what follows we refer to words and their vectors interchangeably." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-50", "text": "3 The gender direction is chosen to be the top principal component (PC) of ten gender pair difference vectors." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-29", "text": "Bolukbasi et al. (2016b) define the gender bias of a word w by its projection on the \"gender direction\": \u2212 \u2192 w \u00b7 ( \u2212 \u2192 he \u2212 \u2212\u2192 she), assuming all vectors are normalized." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-30", "text": "The larger a word's projection is on \u2212 \u2192 he \u2212 \u2212\u2192 she, the more biased it is." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-31", "text": "They also quantify the bias in word embeddings using this definition and show it aligns well with social stereotypes." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-32", "text": "----------------------------------" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-33", "text": "**DEFINITION AND EXISTING DEBIASING METHODS**" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-34", "text": "Both Bolukbasi et al. (2016b) and Zhao et al. (2018) propose methods for debiasing word embeddings, substantially reducing the bias according to the suggested definition." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-35", "text": "2 In a seminal work, Bolukbasi et al. (2016b) use a post-processing debiasing method." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-36", "text": "Given a word embedding matrix, they make changes to the word vectors in order to reduce the gender bias as much as possible for all words that are not inherently gendered (e.g. mother, brother, queen)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-37", "text": "They do that by zeroing the gender projection of each word on a predefined gender direction." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-38", "text": "3 In addition, they also take dozens of inherently gendered word pairs and explicitly make sure that all neutral words (those that are not predefined as inherently gendered) are equally close to each of the two words." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-39", "text": "This extensive, thoughtful, rigorous and well executed work surfaced the problem of bias in embeddings to the ML and NLP communities, defined the concept of debiasing word embeddings, and established the defacto metric of measuring this bias (the gender direction)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-40", "text": "It also provides a perfect solution to the problem of removing the gender direction from non-gendered words." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-41", "text": "However, as we show in this work, while the gender-direction is a great indicator of bias, it is only an indicator and not the complete manifestation of this bias." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-42", "text": "Zhao et al. (2018) take a different approach and suggest to train debiased word embeddings from scratch." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-43", "text": "Instead of debiasing existing word vectors, they alter the loss of the GloVe model (Pennington et al., 2014) , aiming to concentrate most of the gender information in the last coordinate of each vector." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-44", "text": "This way, one can later use the word representations excluding the gender coordinate." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-45", "text": "They do that by using two groups of male/female seed words, and encouraging words that belong to different groups to differ in their last coordinate." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-46", "text": "In addition, they encourage the representation of neutral-gender words (excluding the last coordinate) to be orthogonal to the gender direction." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-47", "text": "4 This work did a step forward by trying to 2 Another work in this spirit is that of Zhang et al. (2018) , which uses an adversarial network to debias word embeddings." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-51", "text": "4 The gender direction is estimated during training by averaging the differences between female words and their male remove the bias during training rather than in postprocessing, which we believe to be the right approach." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-52", "text": "Unfortunately, it relies on the same definition that we show is insufficient." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-53", "text": "These works implicitly define what is good gender debiasing: according to Bolukbasi et al. (2016b) , there is no gender bias if each nonexplicitly gendered word in the vocabulary is in equal distance to both elements of all explicitly gendered pairs." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-54", "text": "In other words, if one cannot determine the gender association of a word by looking at its projection on any gendered pair." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-55", "text": "In Zhao et al. (2018) the definition is similar, but restricted to projections on the gender-direction." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-56", "text": "Remaining bias after using debiasing methods Both works provide very compelling results as evidence of reducing the bias without hurting the performance of the embeddings for standard tasks." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-57", "text": "However, both methods and their results rely on the specific bias definition." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-58", "text": "We claim that the bias is much more profound and systematic, and that simply reducing the projection of words on a gender direction is insufficient: it merely hides the bias, which is still reflected in similarities between \"gender-neutral\" words (i.e., words such as \"math\" or \"delicate\" are in principle genderneutral, but in practice have strong stereotypical gender associations, which reflect on, and are reflected by, neighbouring words)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-59", "text": "Our key observation is that, almost by definition, most word pairs maintain their previous similarity, despite their change in relation to the gender direction." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-60", "text": "The implication of this is that most words that had a specific bias before are still grouped together, and apart from changes with respect to specific gendered words, the word embeddings' spatial geometry stays largely the same." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-61", "text": "5 In what follows, we provide a series of experiments that demonstrate the remaining bias in the debiased embeddings." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-62", "text": "----------------------------------" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-63", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-64", "text": "We refer to the word embeddings of the previous works as HARD-DEBIASED (Bolukbasi et al., 2016b) and GN-GLOVE (gender-neutral GloVe) counterparts in a predefined set." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-65", "text": "(Zhao et al., 2018) ." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-66", "text": "For each debiased word embedding we quantify the hidden bias with respect to the biased version." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-67", "text": "For HARD-DEBIASED we compare to the embeddings before applying the debiasing procedure." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-68", "text": "For GN-GLOVE we compare to embedding trained with standard GloVe on the same corpus." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-69", "text": "6 Unless otherwise specified, we follow Bolukbasi et al. (2016b) and use a reduced version of the vocabulary for both word embeddings: we take the most frequent 50,000 words and phrases and remove words with upper-case letters, digits, or punctuation, and words longer than 20 characters." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-70", "text": "In addition, to avoid quantifying the bias of words that are inherently gendered (e.g. mother, father, queen), we remove from each vocabulary the respective set of gendered words as pre-defined in each work." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-71", "text": "7 This yeilds a vocabulary of 26,189 words for HARD-DEBIASED and of 47,698 words for GN-GLOVE." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-72", "text": "As explained in Section 2 and according to the definition in previous works, we compute the bias of a word by taking its projection on the gender direction:" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-73", "text": "\u2212 \u2192 he \u2212 \u2212\u2192 she." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-74", "text": "In order to quantify the association between sets of words, we follow and use their Word Embedding Association Test (WEAT): consider two sets of target words (e.g., male and female professions) and two sets of attribute words (e.g., male and female names)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-75", "text": "A permutation test estimates the probability that a random permutation of the target words would produce equal or greater similarities to the attribute sets." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-76", "text": "----------------------------------" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-77", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-78", "text": "Male-and female-biased words cluster together We take the most biased words in the vocabulary according to the original bias (500 malebiased and 500 female-biased 8 ), and cluster them 6 We use the embeddings provided by Bolukbasi et al. (2016b) in https://github.com/tolga-b/ debiaswe and by Zhao et al. (2018) in https:// github.com/uclanlp/gn_glove." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-79", "text": "7 For HARD-DEBIASED we use first three lists from: https://github.com/tolga-b/debiaswe/ tree/master/data and for GN-GLOVE we use the two lists from: https://github.com/uclanlp/gn_ glove/tree/master/wordlist 8 highest on the two lists for HARD-DEBIASED are 'petite', 'mums', 'bra', 'breastfeeding' and 'sassy' for female and 'rookie', 'burly', 'hero', 'training camp' and 'journeyman' for male." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-80", "text": "Lowest on the two lists are 'watchdogs', 'watercolors', 'sew', 'burqa', 'diets' for female and 'teammates', 'playable', 'grinning', 'knee surgery', 'impersonation' for male." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-81", "text": "(a) Clusteing for HARD-DEBIASED embedding, before (left hand-side) and after (right hand-side) debiasing." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-82", "text": "(b) Clusteing for GN-GLOVE embedding, before (left handside) and after (right hand-side) debiasing." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-83", "text": "Figure 1: Clustering the 1,000 most biased words, before and after debiasing, for both models." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-84", "text": "into two clusters using k-means." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-85", "text": "For the HARD-DEBIASED embedding, the clusters align with gender with an accuracy of 92.5% (according to the original bias of each word), compared to an accuracy of 99.9% with the original biased version." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-86", "text": "For the GN-GLOVE embedding, we get an accuracy of 85.6%, compared to an accuracy of 100% with the biased version." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-87", "text": "These results suggest that indeed much of the bias information is still embedded in the representation after debiasing." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-88", "text": "Figure 1 shows the tSNE (Maaten and Hinton, 2008) projection of the vectors before and after debiasing, for both models." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-89", "text": "Bias-by-projection correlates to bias-byneighbours This clustering of gendered words indicates that while we cannot directly \"observe\" the bias (i.e. the word \"nurse\" will no longer be closer to explicitly marked feminine words) the bias is still manifested by the word being close to socially-marked feminine words, for example \"nurse\" being close to \"receptionist\", \"caregiver\" and \"teacher\"." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-90", "text": "This suggests a new mechanism for measuring bias: the percentage of male/female socially-biased words among the k nearest neighbors of the target word." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-91", "text": "9 We measure the correlation of this new bias measure with the original bias measure." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-92", "text": "For the HARD-DEBIASED embedding we get a Pearson correlation of 0.686 (compared to a correlation of 0.741 when checking neighbors according to the biased version)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-93", "text": "For the GN-GLOVE embedding we get a Pearson correlation of 0.736 (compared to 0.773)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-94", "text": "All these correlations are statistically significant with p-values of 0." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-95", "text": "Professions We consider the list of professions used in Bolukbasi et al. (2016b) and Zhao et al. (2018) 10 in light of the neighbours-based bias definition." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-96", "text": "Figure 2 plots the professions, with axis X being the original bias and axis Y being the number of male neighbors, before and after debiasing." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-97", "text": "For both methods, there is a clear correlation between the two variables." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-98", "text": "We observe a Pearson correlation of 0.606 (compared to a correlation of 0.747 when checking neighbors according to the biased version) for HARD-DEBIASED and 0.792 (compared to 0.820) for GN-GLOVE." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-99", "text": "All these correlations are significant with p-values < 1 \u00d7 10 \u221230 ." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-100", "text": "Association between female/male and female/male-stereotyped words We replicate the three gender-related association experiments from ." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-101", "text": "For these experiments we use the full vocabulary since some of the words are not included in the reduced one." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-102", "text": "The first experiment evaluates the association between female/male names and family and career words." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-103", "text": "The second one evaluates the association between female/male concepts and arts and mathematics words." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-104", "text": "Since the inherently gendered words (e.g. girl, her, brother) in the second ex-periment are handled well by the debiasing models we opt to use female and male names instead." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-105", "text": "The third one evaluates the association between female/male concepts and arts and science words." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-106", "text": "Again, we use female and male names instead. 11 For the HARD-DEBIASED embedding, we get a p-value of 0 for the first experiment, 0.00016 for the second one, and 0.0467 for the third." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-107", "text": "For the GN-GLOVE embedding, we get p-values of 7.7 \u00d7 10 \u22125 , 0.00031 and 0.0064 for the first, second and third experiments, respectively." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-108", "text": "Classifying previously female-and male-biased words Can a classifier learn to generalize from some gendered words to others based only on their representations?" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-109", "text": "We consider the 5,000 most biased words according to the original bias (2,500 from each gender), train an RBF-kernel SVM classifier on a random sample of 1,000 of them (500 from each gender) to predict the gender, and evaluate its generalization on the remaining 4,000." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-110", "text": "For the HARD-DEBIASED embedding, we get an accuracy of 88.88%, compared to an accuracy of 98.25% with the non-debiased version." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-111", "text": "For the GN-GLOVE embedding, we get an accuracy of 96.53%, compared to an accuracy of 98.65% with the non-debiased version." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-112", "text": "----------------------------------" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-113", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-114", "text": "The experiments described in the previous section reveal a systematic bias found in the embeddings, which is independent of the gender direction." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-115", "text": "We observe that semantically related words still maintain gender bias both in their similarities, and in their representation." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-116", "text": "Concretely, we find that:" }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-117", "text": "1. Words with strong previous gender bias (with the same direction) are easy to cluster together." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-118", "text": "2. Words that receive implicit gender from social stereotypes (e.g. receptionist, hairdresser, captain) still tend to group with other implicit-gender words of the same gender, similar as for non-debiased word embeddings." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-119", "text": "3. The implicit gender of words with prevalent previous bias is easy to predict based on their vectors alone." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-120", "text": "The implications are alarming: while suggested debiasing methods work well at removing the gender direction, the debiasing is mostly superficial." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-121", "text": "The bias stemming from world stereotypes and learned from the corpus is ingrained much more deeply in the embeddings space." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-122", "text": "We note that the real concern from biased representations is not the association of a concept with words such as \"he\", \"she\", \"boy\", \"girl\" nor being able to perform gender-stereotypical word analogies." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-123", "text": "While these are nice \"party tricks\", algorithmic discrimination is more likely to happen by associating one implicitly gendered term with other implicitly gendered terms, or picking up on gender-specific regularities in the corpus by learning to condition on gender-biased words, and generalizing to other gender-biased words (i.e., a resume classifier that will learn to favor male over female candidates based on stereotypical cues in an existing-and biased-resume dataset, despite of being \"oblivious\" to gender)." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-124", "text": "Our experiments show that such classifiers would have ample opportunities to pick up on such cues also after debiasing w.r.t the gender-direction." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-125", "text": "The crux of the issue is that the gender-direction provides a way to measure the gender-association of a word, but does not determine it." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-126", "text": "Debiasing methods which directly target the gender-direction are for the most part merely hiding the gender bias and not removing it." }, { "sent_id": "e7c947a02bb0e81d6b6b4b9da74024-C001-127", "text": "The popular definitions used for quantifying and removing bias are insufficient, and other aspects of the bias should be taken into consideration as well." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e7c947a02bb0e81d6b6b4b9da74024-C001-17" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-21" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-29" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-34" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-35" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-53" ] ], "cite_sentences": [ "e7c947a02bb0e81d6b6b4b9da74024-C001-17", "e7c947a02bb0e81d6b6b4b9da74024-C001-21", "e7c947a02bb0e81d6b6b4b9da74024-C001-29", "e7c947a02bb0e81d6b6b4b9da74024-C001-34", "e7c947a02bb0e81d6b6b4b9da74024-C001-35", "e7c947a02bb0e81d6b6b4b9da74024-C001-53" ] }, "@SIM@": { "gold_contexts": [ [ "e7c947a02bb0e81d6b6b4b9da74024-C001-64" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-78" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-95" ] ], "cite_sentences": [ "e7c947a02bb0e81d6b6b4b9da74024-C001-64", "e7c947a02bb0e81d6b6b4b9da74024-C001-78", "e7c947a02bb0e81d6b6b4b9da74024-C001-95" ] }, "@DIF@": { "gold_contexts": [ [ "e7c947a02bb0e81d6b6b4b9da74024-C001-69" ] ], "cite_sentences": [ "e7c947a02bb0e81d6b6b4b9da74024-C001-69" ] }, "@EXT@": { "gold_contexts": [ [ "e7c947a02bb0e81d6b6b4b9da74024-C001-69" ] ], "cite_sentences": [ "e7c947a02bb0e81d6b6b4b9da74024-C001-69" ] }, "@USE@": { "gold_contexts": [ [ "e7c947a02bb0e81d6b6b4b9da74024-C001-78" ], [ "e7c947a02bb0e81d6b6b4b9da74024-C001-95" ] ], "cite_sentences": [ "e7c947a02bb0e81d6b6b4b9da74024-C001-78", "e7c947a02bb0e81d6b6b4b9da74024-C001-95" ] } } }, "ABC_09dad2fd96cd1d48936cd5b99a38e7_16": { "x": [ { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-2", "text": "This paper presents the evaluation setting for the SemEval-2010 Word Sense Induction (WSI) task." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-3", "text": "The setting of the SemEval-2007 WSI task consists of two evaluation schemes, i.e. unsupervised evaluation and supervised evaluation." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-4", "text": "The first one evaluates WSI methods in a similar fashion to Information Retrieval exercises using F-Score." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-5", "text": "However, F-Score suffers from the matching problem which does not allow: (1) the assessment of the entire membership of clusters, and (2) the evaluation of all clusters in a given solution." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-6", "text": "In this paper, we present the use of V-measure as a measure of objectively assessing WSI methods in an unsupervised setting, and we also suggest a small modification on the supervised evaluation." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-7", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-9", "text": "WSI is the task of identifying the different senses (uses) of a target word in a given text." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-10", "text": "WSI is a field of significant value, because it aims to overcome the limitations originated by representing word senses as a fixed-list of dictionary definitions." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-11", "text": "These limitations of hand-crafted lexicons include the use of general sense definitions, the lack of explicit semantic and topical relations between concepts (Agirre et al., 2001) , and the inability to reflect the exact content of the context in which a target word appears (V\u00e9ronis, 2004) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-12", "text": "Given the significance of WSI, the objective assessment and comparison of WSI methods is crucial." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-13", "text": "The first effort to evaluate WSI methods under a common framework (evaluation schemes & dataset) was undertaken in the SemEval-2007 WSI task (SWSI) (Agirre and Soroa, 2007) , where two separate evaluation schemes were employed." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-14", "text": "The first one, unsupervised evaluation, treats the WSI results as clusters of target word contexts and Gold Standard (GS) senses as classes." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-15", "text": "The traditional clustering measure of F-Score (Zhao et al., 2005 ) is used to assess the performance of WSI systems." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-16", "text": "The second evaluation scheme, supervised evaluation, uses the training part of the dataset in order to map the automatically induced clusters to GS senses." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-17", "text": "In the next step, the testing corpus is used to measure the performance of systems in a Word Sense Disambiguation (WSD) setting." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-18", "text": "A significant limitation of F-Score is that it does not evaluate the make up of clusters beyond the majority class (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-19", "text": "Moreover, F-Score might also fail to evaluate clusters which are not matched to any GS class due to their small size." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-20", "text": "These two limitations define the matching problem of F-Score (Rosenberg and Hirschberg, 2007) which can lead to: (1) identical scores between different clustering solutions, and (2) inaccurate assessment of the clustering quality." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-21", "text": "The supervised evaluation scheme employs a method in order to map the automatically induced clusters to GS senses." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-22", "text": "As a result, this process might change the distribution of clusters by mapping more than one clusters to the same GS sense." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-23", "text": "The outcome of this process might be more helpful for systems that produce a large number of clusters." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-24", "text": "In this paper, we focus on analysing the SemEval-2007 WSI evaluation schemes showing their deficiencies." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-25", "text": "Subsequently, we present the use of V-measure (Rosenberg and Hirschberg, 2007) as an evaluation measure that can overcome the current limitations of F-Score." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-26", "text": "Finally, we also suggest a small modification on the supervised evaluation scheme, which will possibly allow for a more reliable estimation of WSD performance." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-27", "text": "The proposed evaluation setting will be applied in the SemEval-2010 WSI task." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-28", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-29", "text": "**SEMEVAL-2007 WSI EVALUATION SETTING**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-30", "text": "The SemEval-2007 WSI task (Agirre and Soroa, 2007) evaluates WSI systems on 35 nouns and 65 verbs." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-31", "text": "The corpus consists of texts of the Wall Street Journal corpus, and is hand-tagged with OntoNotes senses (Hovy et al., 2006) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-32", "text": "For each target word tw, the task consists of firstly identifying the senses of tw (e.g. as clusters of target word instances, cooccurring words, etc.), and secondly tagging the instances of the target word using the automatically induced clusters." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-33", "text": "In the next sections, we describe and review the two evaluation schemes." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-34", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-35", "text": "**SWSI UNSUPERVISED EVALUATION**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-36", "text": "Let us assume that given a target word tw, a WSI method has produced 3 clusters which have tagged 2100 instances of tw. Table 1 shows the number of tagged instances for each cluster, as well as the common instances between each cluster and each gold standard sense." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-37", "text": "F-Score is used in a similar fashion to Information Retrieval exercises." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-38", "text": "Given a particular gold standard sense gs i of size a i and a cluster c j of size a j , suppose a ij instances in the class gs i belong to c j ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-39", "text": "Precision of class gs i with respect to cluster c j is defined as the number of their common instances divided by the total cluster size, i.e. P(gs i , c j ) = a ij a j ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-40", "text": "The recall of class gs i with respect to cluster c j is defined as the number of their common instances divided by the total sense size, i.e. R(gs i , c j ) = a ij a i ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-41", "text": "The F-Score of gs i with respect to c j , F (gs i , c j ), is then defined as the harmonic mean of P (gs i , c j ) and R(gs i , c j )." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-42", "text": "The F-Score of class gs i , F (gs i ), is the maximum F (gs i , c j ) value attained at any cluster." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-43", "text": "Finally, the F-Score of the entire clustering solution is defined as the weighted average of the F-Scores of each GS sense (Formula 1), where q is the number of GS senses and N is the total number of target word ings 1 gs 2 gs 3 cl 1 500 100 100 cl 2 100 500 100 cl 3 100 100 500 stances." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-44", "text": "If the clustering is identical to the original classes in the datasets, F-Score will be equal to one." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-45", "text": "In the example of Table 1 , F-Score is equal to 0.714." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-46", "text": "As it can be observed, F-Score assesses the quality of a clustering solution by considering two different angles, i.e. homogeneity and completeness (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-47", "text": "Homogeneity refers to the degree that each cluster consists of data points, which primarily belong to a single GS class." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-48", "text": "On the other hand, completeness refers to the degree that each GS class consists of data points, which have primarily been assigned to a single cluster." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-49", "text": "A perfect homogeneity would result in a precision equal to 1, while a perfect completeness would result in a recall equal to 1." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-50", "text": "Purity and entropy (Zhao et al., 2005) are also used in SWSI as complementary measures." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-51", "text": "However, both of them evaluate only the homogeneity of a clustering solution disregarding completeness." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-52", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-53", "text": "**SWSI SUPERVISED EVALUATION**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-54", "text": "In supervised evaluation, the target word corpus is split into a testing and a training part." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-55", "text": "The training part is used to map the automatically induced clusters to GS senses." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-56", "text": "In the next step, the testing corpus is used to evaluate WSI methods in a WSD setting." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-57", "text": "Let us consider the example shown in Table 1 and assume that this matrix has been created by using the training part of our corpus." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-58", "text": "Table 1 shows that cl 1 is more likely to be associated with gs 1 , cl 2 is more likely to be associated with gs 2 , and cl 3 is more likely to be associated with gs 3 ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-59", "text": "This information from the training part is utilised to map the clusters to GS senses." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-60", "text": "Particularly, the matrix shown in Table 1 is normalised to produce a matrix M , in which each entry depicts the conditional probability P (gs i |cl j )." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-61", "text": "Given an instance I of tw from the testing corpus, a row cluster vector IC is created, in which each entry k corresponds to the score assigned to cl k to be the winning cluster for instance I. The product of IC and M provides a row sense vector, IG, in which the highest scoring entry a denotes that gs a is the winning sense for instance I." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-62", "text": "For example, if we produce the row cluster vector" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-63", "text": ", and multiply it with the normalised matrix of Table 1 , then we would get a row sense vector in which gs 1 would be the winning sense with a score equal to 0.6." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-64", "text": "Table 2 shows the unsupervised and supervised performance of systems participating in SWSI." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-65", "text": "As far as the baselines is concerned, the 1c1w baseline groups all instances of a target word into a single cluster, while the 1c1inst creates a new cluster for each instance of a target word." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-66", "text": "Note that the 1c1w baseline is equivalent to the MFS in the supervised evaluation." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-67", "text": "As it can be observed, a system with low entropy (high purity) does not necessarily achieve high F-Score." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-68", "text": "This is due to the fact that entropy and purity only measure the homogeneity of a clustering solution." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-69", "text": "For that reason, the 1c1inst baseline achieves a perfect entropy and purity, although its clustering solution is far from ideal." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-70", "text": "On the contrary, F-Score has a significant advantage over purity and entropy, since it measures both homogeneity (precision) and completeness (recall) of a clustering solution." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-71", "text": "However, F-Score suffers from the matching problem, which manifests itself either by not evaluating the entire membership of a cluster, or by not evaluating every cluster (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-72", "text": "The former situation is present, due to the fact that F-Score does not consider the make-up of the clusters beyond the majority class (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-73", "text": "For example, in Table 3 the F-Score of the clustering so- lution is 0.714 and equal to the F-Score of the clustering solution shown in Table 1 , although these are two significantly different clustering solutions." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-74", "text": "In fact, the clustering shown in Table 3 should have a better homogeneity than the clustering shown in Table 1 , since intuitively speaking each cluster contains fewer classes." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-75", "text": "Moreover, the second clustering should also have a better completeness, since each GS class contains fewer clusters." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-76", "text": "An additional instance of the matching problem manifests itself, when F-Score fails to evaluate the quality of smaller clusters." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-77", "text": "For example, if we add in Table 3 one more cluster (cl 4 ), which only tags 50 additional instances of gs 1 , then we will be able to observe that this cluster will not be matched to any of the GS senses, since cl 1 is matched to gs 1 ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-78", "text": "Although F-Score will decrease since the recall of gs 1 will decrease, the evaluation setting ignores the perfect homogeneity of this small cluster." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-79", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-80", "text": "**SWSI RESULTS & DISCUSSION**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-81", "text": "In Table 2 , we observe that no system managed to outperform the 1c1w baseline in terms of F-Score." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-82", "text": "At the same time, some systems participating in SWSI were able to outperform the equivalent of the 1c1w baseline (MFS) in the supervised evaluation." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-83", "text": "For example, UBC-AS achieved the best F-Score close to the 1c1w baseline." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-84", "text": "However, by looking at its supervised recall, we observe that it is below the MFS baseline." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-85", "text": "A clustering solution, which achieves high supervised recall, does not necessarily achieve high FScore." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-86", "text": "One reason for that stems from the fact that F-Score penalises systems for getting the number of GS classes wrongly, as in 1c1inst baseline." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-87", "text": "According to Agirre & Soroa (2007) , supervised evaluation seems to be more neutral regarding the number of induced clusters, because clusters are mapped into a weighted vector of senses, and therefore inducing a number of clusters similar to the number of senses is not a requirement for good results." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-88", "text": "However, a large number of clusters might also lead to an unreliable mapping of clusters to GS senses." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-89", "text": "For example, high supervised recall also means high purity and low entropy as in I2R, but not vice versa as in UOY." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-90", "text": "UOY produces a large number of clean clusters, in effect suffering from an unreliable mapping of clusters to senses due to the lack of adequate training data." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-91", "text": "Moreover, an additional supervised evaluation of WSI methods using a different dataset split resulted in a different ranking, in which all of the systems outperformed the MFS baseline (Agirre and Soroa, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-92", "text": "This result indicates that the supervised evaluation might not provide a reliable estimation of WSD performance, particularly in the case where the mapping relies on a single dataset split." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-93", "text": "3 SemEval-2010 WSI evaluation setting 3.1 Unsupervised evaluation using V-measure Let us assume that the dataset of a target word tw comprises of N instances (data points)." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-94", "text": "These data points are divided into two partitions, i.e. a set of automatically generated clusters C = {c j |j = 1 . . ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-95", "text": "n} and a set of gold standard classes GS = {gs i |gs = 1 . . ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-96", "text": "m}. Moreover, let a ij be the number of data points, which are members of class gs i and elements of cluster c j ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-97", "text": "V-measure assesses the quality of a clustering solution by explicitly measuring its homogeneity and its completeness (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-98", "text": "Recall that homogeneity refers to the degree that each cluster consists of data points which primarily belong to a single GS class." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-99", "text": "V-measure assesses homogeneity by examining the conditional entropy of the class distribution given the proposed clustering, i.e. H(GS|C)." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-100", "text": "H(GS|C) quantifies the remaining entropy (uncertainty) of the class distribution given that the proposed clustering is known." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-101", "text": "As a result, when H(GS|C) is 0, we have the perfectly homogeneous solution, since each cluster contains only those data points that are members of a single class." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-102", "text": "However in an imperfect situation, H(GS|C) depends on the size of the dataset and the distribution of class sizes." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-103", "text": "As a result, instead of taking the raw conditional entropy, V-measure normalises it by the maximum reduction in entropy the clustering information could provide, i.e. H(GS)." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-104", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-105", "text": "**FORMULAS 2 AND 3 DEFINE H(GS) AND H(GS|C).**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-106", "text": "When there is only a single class (H(GS) = 0), any clustering would produce a perfectly homogeneous solution." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-107", "text": "In the worst case, the class distribution within each cluster is equal to the overall class distribution (H(GS|C) = H(GS)), i.e. clustering provides no new information." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-108", "text": "Overall, in accordance with the convention of 1 being desirable and 0 undesirable, the homogeneity (h) of a clustering solution is 1 if there is only a single class, and 1\u2212 H(GS|C) H(GS) in any other case (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-109", "text": "Symmetrically to homogeneity, completeness refers to the degree that each GS class consists of data points, which have primarily been assigned to a single cluster." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-110", "text": "To evaluate completeness, V-measure examines the distribution of cluster assignments within each class." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-111", "text": "The conditional entropy of the cluster given the class distribution, H(C|GS), quantifies the remaining entropy (uncertainty) of the cluster given that the class distribution is known." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-112", "text": "Consequently, when H(C|GS) is 0, we have the perfectly complete solution, since all the data points of a class belong to the same cluster." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-113", "text": "Therefore, symmetrically to homogeneity, the completeness c of a clustering solution is 1 if there is only a single cluster (H(C) = 0), and 1 \u2212" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-114", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-115", "text": "**H(C|GS) H(C)**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-116", "text": "in any other case." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-117", "text": "In the worst case, completeness will be equal to 0, particularly when H(C|GS) is maximal and equal to H(C)." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-118", "text": "This happens when each GS class is included in all clusters with a distribution equal to the distribution of sizes (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-119", "text": "Formulas 4 and 5 define H(C) and H(C|GS)." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-120", "text": "Finally h and c can be combined and produce V-measure, which is the harmonic mean of homogeneity and completeness." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-121", "text": "Returning to our clustering example in Table 1 , its V-measure is equal to 0.275." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-122", "text": "In section 2.3, we also presented an additional clustering (Table 3) , which had the same F-Score as the clustering in Table 1, despite the fact that it intuitively had a better completeness and homogeneity." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-123", "text": "The V-measure of the second clustering solution is equal to 0.45, and higher than the V-measure of the first clustering." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-124", "text": "This result shows that V-measure is able to discriminate between these two clusterings by considering the make-up of the clusters beyond the majority class." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-125", "text": "Furthermore, it is straightforward from the description in this section, that V-measure evaluates each cluster in terms of homogeneity and completeness, unlike F-Score which relies on a post-hoc matching." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-126", "text": "Table 4 shows the performance of SWSI participating systems according to V-measure." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-127", "text": "The last four columns of Table 4 show the weighted average homogeneity and completeness for nouns and verbs." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-128", "text": "Note that the homogeneity and completeness columns are weighted averages over all nouns or verbs, and are not used for the calculation of the weighted average V-measure (second column)." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-129", "text": "The latter is calculated by measuring for each target word's clustering solution the harmonic mean of homogeneity and completeness separately, and then producing the weighted average." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-130", "text": "As it can be observed in Table 4 , all WSI systems have outperformed the random baseline which means that they have learned useful information." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-131", "text": "Moreover, Table 4 shows that on average all systems have outperformed the 1c1w baseline, which groups the instances of a target word to a single cluster." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-132", "text": "The completeness of the 1c1w baseline is equal to 1 by definition, since all instances of GS classes are grouped to a single cluster." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-133", "text": "However, this solution is as inhomogeneous as possible and causes a homogeneity equal to 0 in the case of nouns." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-134", "text": "In the verb dataset however, some verbs appear with only one sense, in effect causing the 1c1w homogeneity to be equal to 1 in some cases, and the average Vmeasure greater than 0." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-135", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-136", "text": "**V-MEASURE RESULTS & DISCUSSION**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-137", "text": "In Table 4 , we also observe that the 1c1inst baseline achieves a high performance." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-138", "text": "In nouns only I2R is able to outperform this baseline, while in verbs the 1c1inst baseline achieves the highest result." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-139", "text": "By the definition of homogeneity (section 3.1), this baseline is perfectly homogeneous, since each cluster contains one instance of a single sense." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-140", "text": "However, its completeness is not 0, as one might intuitively expect." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-141", "text": "This is due to the fact that V-measure considers as the worst solution in terms of completeness the one, in which each class is represented by every cluster, and specifically with a distribution equal to the distribution of cluster sizes (Rosenberg and Hirschberg, 2007) ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-142", "text": "This worst solution is not equivalent to the 1c1inst, hence completeness of 1c1inst is greater than 0." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-143", "text": "Additionally, completeness of this baseline benefits from the fact that around 18% of GS senses have only one instance in the test set." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-144", "text": "Note however, that on average this baseline achieves a lower completeness than most of the systems." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-145", "text": "Another observation from Table 4 is that upv si and UOY have a better ranking than in Table 2 ." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-146", "text": "Note that these systems have generated a higher number of clusters than the GS number of senses." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-147", "text": "In verbs UOY has been extensively penalised by the F-Score." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-148", "text": "The inspection of their answers shows that both systems generate highly skewed distributions, in which a small number of clusters tag the majority of instances, while a larger number tag only a few." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-149", "text": "As mentioned in sections 2.1 and 2.3, these small clusters might not be matched to any GS sense, hence they will decrease the unsupervised recall of a GS class, and consequently the F-Score." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-150", "text": "However, their high homogeneity is not considered in the calculation of F-Score." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-151", "text": "On the contrary, V-measure is able to evaluate the quality of these small clusters, and provide a more objective assessment." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-152", "text": "Finally, in our evaluation we observe that I2R has on average the highest performance among the SWSI methods." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-153", "text": "This is due to its high V-measure in nouns, but not in verbs." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-154", "text": "Particularly in nouns, I2R achieves a consistent performance in terms of homogeneity and completeness without being biased towards one of them, as is the case for the rest of the systems." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-155", "text": "For example, UOY and upv si achieve on average the highest homogeneity (42.5 & 32.8 resp.) and the worst completeness (11.5 & 13.2 resp.)." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-156", "text": "The opposite picture is present for UBC-AS and UMND2." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-157", "text": "Despite that, UBC-AS and UMND2 perform better than I2R in verbs, due to the small number of generated clusters (high completeness), and a reasonable homogeneity mainly due to the existence of verbs with one GS sense." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-158", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-159", "text": "**MODIFIED SUPERVISED WSI EVALUATION**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-160", "text": "In section 2.3, we mentioned that supervised evaluation might favor methods which produce many clusters, since the mapping step can artificially increase completeness." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-161", "text": "Furthermore, we have shown that generating a large number of clusters might lead to an unreliable mapping of clusters to GS senses due to the lack of adequate training data." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-162", "text": "Despite that, the supervised evaluation can be considered as an application-oriented evaluation, since it allows the transformation of unsupervised WSI systems to semi-supervised WSD ones." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-163", "text": "Given the great difficulty of unsupervised WSD systems to outperform the MFS baseline as well as the SWSI results, which show that some systems outperform the MFS by a significant amount in nouns, we believe that this evaluation scheme should be used to compare against supervised WSD methods." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-164", "text": "In section 2.3, we also mentioned that the supervised evaluation on two different test/train splits provided a different ranking of methods, and more importantly a different ranking with regard to the MFS." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-165", "text": "To deal with that problem, we believe that it would be reasonable to perform k-fold cross validation in order to collect statistically significant information." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-166", "text": "----------------------------------" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-167", "text": "**CONCLUSION**" }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-168", "text": "We presented and discussed the limitations of the SemEval-2007 evaluation setting for WSI methods." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-169", "text": "Based on our discussion, we described the use of V-measure as the measure of assessing WSI performance on an unsupervised setting, and presented the results of SWSI WSI methods." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-170", "text": "We have also suggested a small modification on the supervised evaluation scheme, which will allow for a more reliable estimation of WSD performance." }, { "sent_id": "09dad2fd96cd1d48936cd5b99a38e7-C001-171", "text": "The new evaluation setting will be applied in the SemEval-2010 WSI task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "09dad2fd96cd1d48936cd5b99a38e7-C001-18" ], [ "09dad2fd96cd1d48936cd5b99a38e7-C001-20" ], [ "09dad2fd96cd1d48936cd5b99a38e7-C001-71" ], [ "09dad2fd96cd1d48936cd5b99a38e7-C001-118" ] ], "cite_sentences": [ "09dad2fd96cd1d48936cd5b99a38e7-C001-18", "09dad2fd96cd1d48936cd5b99a38e7-C001-20", "09dad2fd96cd1d48936cd5b99a38e7-C001-71", "09dad2fd96cd1d48936cd5b99a38e7-C001-118" ] }, "@DIF@": { "gold_contexts": [ [ "09dad2fd96cd1d48936cd5b99a38e7-C001-25" ], [ "09dad2fd96cd1d48936cd5b99a38e7-C001-72" ] ], "cite_sentences": [ "09dad2fd96cd1d48936cd5b99a38e7-C001-25", "09dad2fd96cd1d48936cd5b99a38e7-C001-72" ] }, "@EXT@": { "gold_contexts": [ [ "09dad2fd96cd1d48936cd5b99a38e7-C001-25" ] ], "cite_sentences": [ "09dad2fd96cd1d48936cd5b99a38e7-C001-25" ] }, "@SIM@": { "gold_contexts": [ [ "09dad2fd96cd1d48936cd5b99a38e7-C001-46" ], [ "09dad2fd96cd1d48936cd5b99a38e7-C001-97" ], [ "09dad2fd96cd1d48936cd5b99a38e7-C001-108" ], [ "09dad2fd96cd1d48936cd5b99a38e7-C001-141" ] ], "cite_sentences": [ "09dad2fd96cd1d48936cd5b99a38e7-C001-46", "09dad2fd96cd1d48936cd5b99a38e7-C001-97", "09dad2fd96cd1d48936cd5b99a38e7-C001-108", "09dad2fd96cd1d48936cd5b99a38e7-C001-141" ] } } }, "ABC_7616f6f8c1c188b32cd3a8374b61dd_16": { "x": [ { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-2", "text": "In this paper we present the development process of NLP-QT, a question treebank that will be used for data-driven parsing in the context of a domain-specific QA system for querying NLP resource metadata." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-3", "text": "We motivate the need to build NLP-QT as a resource in its own right, by comparing the Penn Treebank-style annotation scheme used for QuestionBank (Judge et al., 2006) with the modified NP annotation for the Penn Treebank introduced by Vadas and Curran (2007) ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-4", "text": "We argue that this modified annotation scheme provides a better interface representation for semantic interpretation and show how it can be incorporated into the NLP-QT resource, without significant loss in parser performance." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-5", "text": "The parsing experiments reported in the paper confirm the feasibility of an iterative, semi-automatic construction of the NLP-QT resource similar to the approach taken for QuestionBank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-6", "text": "At the same time, we propose to improve the iterative refinement technique used for QuestionBank by adopting Hwa (2001)'s heuristics for selecting additional material to be handcorrected and added to the data set at each iteration." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-7", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-9", "text": "Question-Answering (QA) systems have a long history in the field of natural language processing." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-10", "text": "In the 1970s and 1980s QA systems focused on natural language interfaces to domainspecific data bases or expert systems." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-11", "text": "Such systems typically used hand-crafted, rule-based front ends for parsing and semantic interpretation." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-12", "text": "With the increased availability of large-scale textual resources, QA systems more recently have focused on domain-independent broad-coverage information retrieval applications that typically employ more shallow processing techniques for question analysis and answer matching." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-13", "text": "The intended application for the research reported in the present paper is more in the tradition of the earlier, domain-specific QA systems in that it aims to provide a natural language front-end to large repositories of metadata about language tools and resources that are made available by the CLARIN 1 project." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-14", "text": "However, instead of relying on a parser with hand-crafted grammar rules, it employs a robust data-driven parser that requires annotated training data in the form of a treebank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-15", "text": "Since the natural language front end for the intended QA system is English, the simplest solution would be to use a statistical parser such as the Berkeley (Petrov and Klein, 2007) or Stanford (Klein and Manning, 2003) parser with an existing language model obtained from the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-16", "text": "However, it is well known that parser performance drops when analyzing text from domains other than that represented in the training data (Sekine, 1997; Gildea, 2001) ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-17", "text": "In particular, Judge et al. (2006) have shown that language models obtained from the Penn Treebank perform far worse on questions than on their original test data." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-18", "text": "The Bikel (2004) parser they employ has an F-Score of 82.97 when tested on Section 23 of the Penn-II Treebank and an F-Score of 78.77 when tested on the 4000 questions in QuestionBank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-19", "text": "Judge et al. (2006) attribute this loss of per-formance to two factors: (i) in the genre of newspaper texts, which the Penn Treebank is based on, questions are not a high frequency syntactic construction, and (ii) if wh-type constructions occur at all in the Penn Treebank, they predominantly involve relative clause constructions or indirect questions, but not unembedded questions." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-20", "text": "Therefore, a parser trained on Penn Treebank data, routinely misanalyses unembedded questions as these other two construction types." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-21", "text": "In fact, it was this poor parser performance that led Judge et al. to create QuestionBank, a special-purpose treebank based on SemEval data sets for Question Answering (QA).The data include the SemEval QA data from 1999-2001, part of the 2003 set (2000 questions), and another 2000 questions provided by the Cognitive Computation Group at the University of Illinois, which were also test data for developing QA systems." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-22", "text": "Training a statistical parser on QuestionBank data, possibly in combination with Penn Treebank data, therefore seems to be an attractive alternative." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-23", "text": "In fact, this is precisely how Judge et al. train their parser." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-24", "text": "However, for reasons explained in more detail in sections 2 and 3, we will adopt annotation guidelines for questions that differ from the Penn Treebank-style annotation used in QuestionBank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-25", "text": "Rather, we will follow a more hierarchical annotation style for NPs that has been proposed by Vadas and Curran (2007) and that provides an easier interface for semantic interpretation." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-26", "text": "Section 3 will introduce the Vadas and Curran (2007) annotation style and will motivate why it is appropriate for the QA system envisaged here." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-27", "text": "Section 4 will present a set of parsing experiments for the Berkeley parser trained on different combinations of treebank data discussed in sections 2 and 3." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-28", "text": "The final section summarizes the main results of this paper and discusses directions for future research." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-29", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-30", "text": "**DATA COLLECTION FOR QUERYING NLP RESOURCE METADATA**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-31", "text": "One of the main reasons to create a new data set of questions and not use some already existing set has to do with the specific subject domain of the QA system to be developed." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-32", "text": "All the questions should concern particular pieces of information associated with language resources or with different application domains of natural language processing." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-33", "text": "In order to obtain a realistic data set of this sort, we harvested the questions from mailing lists like LinguistList 2 and Corpora List 3 , as well as from the Stack Overflow 4 questions tagged with \"nlp\"." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-34", "text": "The mailing lists have a history of 20 years and have a lot of extra content other than user queries." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-35", "text": "Therefore, all the posts had to be browsed through in order to manually extract only the relevant questions from the whole post." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-36", "text": "For example, information about the person asking the question was deleted from the original posts, since such information is not relevant for a QA system." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-37", "text": "Spelling and grammar errors were then removed from the extracted questions." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-38", "text": "A number of 2500 questions were harvested until the moment of writing, but the goal is to gather a 10.000 questions corpus that should provide enough training and testing data when converted into a treebank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-39", "text": "The data below provide some typical examples that have been collected from the three sources:" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-40", "text": "(1) Where can I find a corpus of German newspapers from the 17th century until the 1950s?" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-41", "text": "(2) What good introductory books on the subject of natural language processing, parsing and tagging are there? Apart from the more restricted subject domain, the NLP Resource Metadata Questions significantly differ from the SemEval data used in QuestionBank in at least two other respects:" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-42", "text": "\u2022 The average length of the SemEval questions in QuestionBank is 47.58 characters and 9.45 words, whereas the NLP Questions average 81.17 characters and 12.88 words." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-43", "text": "\u2022 Moreover, the distribution of questions types is quite different in the two cases." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-44", "text": "The SemEval data set used for QuestionBank is intended to query encyclopedic knowledge from sources such as Wikipedia." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-45", "text": "querying NLP resource metadata, the emphasis is to a large extent on where and is there questions; the percentage for each type of question in the two datasources is showed in Table 1 ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-46", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-47", "text": "**COMPARING THE ANNOTATION OF BASE NPS**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-48", "text": "There is yet another property of both QuestionBank and the Penn Treebank that limits its usefulness for the QA application considered here." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-49", "text": "This concerns the flat-structure annotation style for noun phrases adopted in both resources." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-50", "text": "For example, in the question Where can I find a German corpus containing second language acquisition materials?" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-51", "text": "the compound noun second language acquisition materials would be annotated in these resources as a single flat NP, as shown in the left column of Figure 1 ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-52", "text": "Such a flat annotation does not provide sufficient information about the scope of each member of the compound." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-53", "text": "It is precisely this type of shortcoming that led Vadas and Curran (2007) to revise the Penn Treebank annotation style for NPs along the following lines:" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-54", "text": "\u2022 If the intended scope of a base NP leads to a strictly right-branching structure, then the Penn Treebank annotation remains unchanged." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-55", "text": "\u2022 If the intended scope is partially or completely left-branching, then an extra node is introduced into the tree for each leftbranching structure." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-56", "text": "The label of this node is either NML or JJP, depending on the lexical head of the local tree (noun or adjective, respectively)." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-57", "text": "The resulting annotation for the compound noun second language acquisition materials is shown in the right column of Figure 1 ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-58", "text": "From the point of view of semantic interpretation, the more contoured Vadas and Curran (2007) annotation style is to be preferred since it reflects the type of answer that is required, namely materials for second language acquisition, but not for example acquisition materials for second language, or the second (batch) of language acquisition materials." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-59", "text": "It is precisely for this reason that we adopt the annotation style of Vadas and Curran (2007) for the NLP Resource Metadata Questions Treebank (henceforth abbreviated as NLP-QT)." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-60", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-61", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-62", "text": "This section summarizes the set of experiments that we have conducted with the Vadas and Curran (2007) annotation style for NPs and in particular with the NLP-QT data set." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-63", "text": "We discuss two types of experiments:" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-64", "text": "\u2022 comparing the performance of the parser using different annotation styles for base NPs," }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-65", "text": "\u2022 experiments for optimizing the language model of a statistical parser in order to assist with the semi-automatic creation of the treebank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-66", "text": "All the experiments were performed with the Berkeley parser." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-67", "text": "The results are summarized in Table 2 and Table 3 ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-68", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-69", "text": "**PARSING RESULTS FOR DIFFERENT ANNOTATION STYLES**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-70", "text": "Using Bikel (2004)'s parser, Vadas and Curran (2007) report that the parsing results slightly decrease when the parser is trained on the Penn Treebank with the modified annotation style for NPs." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-71", "text": "As Table 2 shows, we obtain a similar result when testing on section 23 of the Penn Treebank, using the Berkeley parser trained on sections 02-21 of the same treebank: there is minor drop in F-score from 90.43 to 89.96." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-72", "text": "We also confirm Gildea's finding that testing a parser on test sets from a different domain than the training sets results in a significant loss of performance: when using the same models that we used for the Penn Treebank experiments, the average F-score for test data from the Question Bank in a 10-fold cross-validation experiment is 79.944 for the model trained on the original Penn Treebank and 77.607 for the model trained on the modified Penn Treebank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-73", "text": "The above experiments were designed as a baseline for comparing the performance of the parser trained only on Penn Treebank data." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-74", "text": "But since our primary interest is in parsing questions as accurately as possible, we conducted a second set of experiments, summarized in the lower half of Table 2." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-75", "text": "Here additional training data from the Question Bank was added to both the original and the modified Penn Treebank training data." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-117", "text": "The tagger treats the lexical token chunking as a noun (NN), rather than a gerund (VBG), and the lexical token French as a plural noun (NNS) rather than as an adjective (JJ)." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-76", "text": "The decrease in performance caused by adding the QuestionBank training data together with the modified NP annotation on section 23 is comparable to the one caused by adding the modified NP annotation alone (a decrease from 90.263 to 90.04, whereas for the original Penn Treebank data the F-score decreased from 90.43 to 89.96), but this slight decrease is more than offset by the increase in semantic information obtained from the Vadas and Curran (2007) annotation for complex base NPs." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-77", "text": "Even more noteworthy is the big jump in F-score from 77.607 to 92.658 when adding the QuestionBank data to the training data." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-78", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-79", "text": "**SEMI-AUTOMATIC CREATION OF NLP-QT**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-80", "text": "The creation of a treebank is a time-consuming and expensive task if all the annotation has to be performed manually." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-81", "text": "It is therefore useful to investigate whether at least parts of the annotation can be performed automatically or by a combination of automatic analysis and manual post editing." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-82", "text": "To this end, we performed a set of parsing experiments, again using the Berkeley parser, where the test data are taken both from the QuestionBank and a seed set of 500 manually annotated questions from the NLP-QT." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-83", "text": "The results are shown in Table 3 ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-84", "text": "As in the experiments shown in the previous subsection, the performance with a model trained purely on Penn Treebank data (with NPs annotated in the Vadas and Curran (2007) style) serves as a baseline (the model is called np-wsj in the table)." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-85", "text": "This model is then enriched by first adding annotated data from Question Bank and then by adding the manually annotated questions from the NLP-QT." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-86", "text": "We refer to these models as np-wsjqb and npwsjqblq 500, respectively." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-87", "text": "The results are very encouraging on several dimensions:" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-88", "text": "1. overall parsing performance on the test data for both the np-wsjqb and the np-wsjqblq 500 models is very good 2. adding questions from the NLP-QT yields a desired increase in performance 3." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-89", "text": "almost two-thirds of all questions from the test data yield a completely correct parse." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-90", "text": "These three findings together make a semiautomatic construction of the NLP-QT entirely feasible." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-91", "text": "In fact, we are currently constructing the NLP-QT treebank in this semi-automatic fashion, using the same iterative approach to treebank construction adopted for the QuestionBank data by Judge et al. This approach involves iterations of manual post correction of automatically generated Table 4 : Average length and constituent count for the correctly/incorrectly parsed questions parses, adding this post-corrected data set to the previously used training material and then retraining the parser with the enlarged data set." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-92", "text": "One question that was not addressed in the approach by Judge et al. concerns the selection of the additional trees that will be manually corrected and then added to the training and test material in the next iteration." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-93", "text": "As Hwa (2001) has pointed out, this selection process can be critical in minimizing the amount of data that needs to be hand-corrected during grammar induction." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-94", "text": "She suggests several simple heuristics for ranking the candidate trees, two of which will be considered here." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-95", "text": "One heuristic is based on the often observed fact that, on average, longer sentences are harder to parse correctly than shorter ones." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-96", "text": "A second, related and somewhat more fine-grained variant of the first heuristic is based on the number of constituents obtained by the automatic parse of a sentence." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-97", "text": "Since the automatic parse is often at least partially incorrect, the constituent count of the parser will typically be just an estimate of the actual constituent count and related complexity of the sentence." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-98", "text": "Hwa suggests that when trees are added, the selected trees should match the average constituent count and length profile of the trees that were incorrectly parsed in the previous iteration." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-99", "text": "We adopt Hwa's approach in the construction of the NLP-QT treebank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-100", "text": "In order to use it effectively, it is necessary to inspect the results of the parser and in particular create an automatic profile of the completely correct versus partially incorrect parses." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-101", "text": "This type of error analysis is the subject of the next section." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-102", "text": "Table 4 summarizes the profiling of the 500 questions from the NLP-QT used in the 10-fold validation experiment." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-103", "text": "On average, 48.59 % of all sentences received an entirely correct parse." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-104", "text": "The average length in characters and in words as well as the average number of constituents of the correctly parsed sentences differ significantly from the questions where the parse is only partially correct." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-105", "text": "These results provide a sound basis for applying Hwa's selection method: in the next iteration of optimizing the statistical model for the parser, sampling should focus on questions that match as closely as possible the character, word, and constituent count of the partially incorrect parse trees." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-106", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-107", "text": "**ERROR ANALYSIS**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-108", "text": "In order to get an impression of the kinds of mistakes that are made by the Berkeley parser, we are presenting two partially incorrect parse trees for the sentences in 6 and 7." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-109", "text": "(6) Is there any freely available text corpus for Croatian, no smaller than 20k words?" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-110", "text": "(7) Where can I find information on chunking French and German texts?" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-111", "text": "The trees obtained by the Berkeley parser for these two sentences are shown in Figures 2 and 3 , respectively." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-112", "text": "They exhibit the following typical attachment mistakes and misgroupings of conjuncts in a coordination structure:" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-113", "text": "The parse tree generated by the Berkeley parser for sentence 6 (Figure 2 ) contains several errors: two attachment errors (the PP for Croatian is not attached as a post-head modifier to the nominal head text corpus, but rather attached high as a sister of the preceding NP." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-114", "text": "Likewise, the modifier starting with no smaller ... is treated as an ADJP rather than an NP and is attached as well as a sister of the preceding NP and PP rather than to the complex NP any ... for Croatian in the gold parse." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-115", "text": "Moreover, the JJP freely available is incorrectly labelled as an ADJP." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-116", "text": "The parse tree for sentence 7 (Figure 3 ) fails on the correct grouping and labelling of the coordinate structure French and German texts." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-118", "text": "The parser then combines these two items into an NP, which is then coordinated with the NP German texts." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-119", "text": "By hand correcting parse trees similar to the ones just discussed and by including them in the data set for retraining the parsing model in the next iteration, the performance of the parser on the types of constructions in question will improve and thereby minimize the amount of manual post editing as much as possible." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-120", "text": "----------------------------------" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-121", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-122", "text": "In this paper we have presented the development process of the NLP-QT resource that will be used for data-driven parsing in the context of a domainspecific QA system for querying NLP resource metadata." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-123", "text": "We have motivated the need to build NLP-QT as a resource in its own right by comparing the Penn Treebank-style annotation scheme used for QuestionBank with the modified NP annotation for the Penn Treebank introduced by Vadas and Curran (2007) ." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-124", "text": "We have argued that this modified annotation scheme provides a better interface representation for semantic interpretation and have shown how it can be incorporated into the NLP-QT resource, without significant loss in parser performance." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-125", "text": "The parsing experiments reported in the paper confirm the feasibility of an iterative, semiautomatic construction of the NLP-QT resource similar to the approach taken for QuestionBank." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-126", "text": "At the same time, we propose to improve the iterative refinement technique used for QuestionBank by adopting Hwa's heuristics for selecting additional material to be hand-corrected and added to the data set at each iteration." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-127", "text": "Another important aspect in the creation of a treebank how to ensure a consistent and correct annotation of the linguistic material." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-128", "text": "Automatic error detection techniques that can be used to test the accuracy of the annotation have already been described in works like Kv\u011bto\u0148 and Oliva (2002) , for the part of speech annotation level, and Dickinson and Meurers (2005) , for the syntactic annotation level." }, { "sent_id": "7616f6f8c1c188b32cd3a8374b61dd-C001-129", "text": "In future work on the NLP-QT, we plan to employ such methods in order to identify and to correct inconsistencies in the annotation." } ], "y": { "@MOT@": { "gold_contexts": [ [ "7616f6f8c1c188b32cd3a8374b61dd-C001-3" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-123" ] ], "cite_sentences": [ "7616f6f8c1c188b32cd3a8374b61dd-C001-3", "7616f6f8c1c188b32cd3a8374b61dd-C001-123" ] }, "@USE@": { "gold_contexts": [ [ "7616f6f8c1c188b32cd3a8374b61dd-C001-25" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-59" ] ], "cite_sentences": [ "7616f6f8c1c188b32cd3a8374b61dd-C001-25", "7616f6f8c1c188b32cd3a8374b61dd-C001-59" ] }, "@SIM@": { "gold_contexts": [ [ "7616f6f8c1c188b32cd3a8374b61dd-C001-25" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-26" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-59" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-62" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-84" ] ], "cite_sentences": [ "7616f6f8c1c188b32cd3a8374b61dd-C001-25", "7616f6f8c1c188b32cd3a8374b61dd-C001-26", "7616f6f8c1c188b32cd3a8374b61dd-C001-59", "7616f6f8c1c188b32cd3a8374b61dd-C001-62", "7616f6f8c1c188b32cd3a8374b61dd-C001-84" ] }, "@BACK@": { "gold_contexts": [ [ "7616f6f8c1c188b32cd3a8374b61dd-C001-53" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-58" ], [ "7616f6f8c1c188b32cd3a8374b61dd-C001-70" ] ], "cite_sentences": [ "7616f6f8c1c188b32cd3a8374b61dd-C001-53", "7616f6f8c1c188b32cd3a8374b61dd-C001-58", "7616f6f8c1c188b32cd3a8374b61dd-C001-70" ] }, "@DIF@": { "gold_contexts": [ [ "7616f6f8c1c188b32cd3a8374b61dd-C001-76" ] ], "cite_sentences": [ "7616f6f8c1c188b32cd3a8374b61dd-C001-76" ] } } }, "ABC_2bb41cea97a0375f67eab3a77c3a97_16": { "x": [ { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-174", "text": "On the other hand, the quality improvement rate is roughly log-linear against the corpus size." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-2", "text": "Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-3", "text": "To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-4", "text": "There is, however, no study comparing the relative impact of these two sources on the precision and recall of post-learning answers." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-5", "text": "To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-6", "text": "We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-7", "text": "Our experiments show that increasing the corpus size for distant supervision has a statistically significant, positive impact on quality (F1 score)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-8", "text": "In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-9", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-11", "text": "Relation extraction is the problem of populating a target relation (representing an entity-level relationship or attribute) with facts extracted from naturallanguage text." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-12", "text": "Sample relations include people's titles, birth places, and marriage relationships." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-13", "text": "Traditional relation-extraction systems rely on manual annotations or domain-specific rules provided by experts, both of which are scarce resources that are not portable across domains." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-175", "text": "Recall that each data point in Figure 2 is the average from 30 trials." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-120", "text": "We separately tune this parameter for each training set (with cross validation), but found that the result quality was robust with respect to a broad range of parameter values." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-121", "text": "12" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-122", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-123", "text": "**EXPERIMENTS**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-124", "text": "We describe our experiments to test the hypotheses that the following two factors improve distantsupervision quality: increasing the (1) corpus size, and (2) the amount of crowd-sourced feedback." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-125", "text": "We confirm hypothesis (1), but, surprisingly, are unable to confirm (2)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-149", "text": "We retain only credible answers using the gold-standard method (see Section 3.3), and use them as the pool of human feedback that we run experiments with." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-14", "text": "To remedy these problems, recent years have seen interest in the distant supervision approach for relation extraction (Wu and Weld, 2007; Mintz et al., 2009) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-15", "text": "The input to distant supervision is a set of seed facts for the target relation together with an (unlabeled) text corpus, and the output is a set of (noisy) annotations that can be used by any machine learning technique to train a statistical model for the target relation." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-16", "text": "For example, given the target relation birthPlace(person, place) and a seed fact birthPlace(John, Springfield), the sentence \"John and his wife were born in Springfield in 1946\" (S1) would qualify as a positive training example." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-17", "text": "Distant supervision replaces the expensive process of manually acquiring annotations that is required by direct supervision with resources that already exist in many scenarios (seed facts and a text corpus)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-18", "text": "On the other hand, distantly labeled data may not be as accurate as manual annotations." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-19", "text": "For example, \"John left Springfield when he was 16\" (S2) would also be considered a positive example about place of birth by distant supervision as it contains both John and Springfield." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-20", "text": "The hypothesis is that the broad coverage and high redundancy in a large corpus would compensate for this noise." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-21", "text": "For example, with a large enough corpus, a distant supervision system may find that patterns in the sentence S1 strongly correlate with seed facts of birthPlace whereas patterns in S2 do not qualify as a strong indicator." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-22", "text": "Thus, intuitively the quality of distant supervision should improve as we use larger corpora." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-23", "text": "However, there has been no study on the impact of corpus size on distant supervision for relation extraction." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-24", "text": "Our goal is to fill this gap." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-25", "text": "Besides \"big data,\" another resource that may be valuable to distant supervision is crowdsourc-ing." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-26", "text": "For example, one could employ crowd workers to provide feedback on whether distant supervision examples are correct or not (Gormley et al., 2010) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-27", "text": "Intuitively the crowd workforce is a perfect fit for such tasks since many erroneous distant labels could be easily identified and corrected by humans." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-28", "text": "For example, distant supervision may mistakenly consider \"Obama took a vacation in Hawaii\" a positive example for birthPlace simply because a database says that Obama was born in Hawaii; a crowd worker would correctly point out that this sentence is not actually indicative of this relation." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-29", "text": "It is unclear however which strategy one should use: scaling the text corpus or the amount of human feedback." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-30", "text": "Our primary contribution is to empirically assess how scaling these inputs to distant supervision impacts its result quality." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-31", "text": "We study this question with input data sets that are orders of magnitude larger than those in prior work." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-32", "text": "While the largest corpus (Wikipedia and New York Times) employed by recent work on distant supervision (Mintz et al., 2009; Hoffmann et al., 2011) contain about 2M documents, we run experiments on a 100M-document (50X more) corpus drawn from ClueWeb." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-33", "text": "1 While prior work (Gormley et al., 2010) on crowdsourcing for distant supervision used thousands of human feedback units, we acquire tens of thousands of human-provided labels." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-34", "text": "Despite the large scale, we follow state-of-the-art distant supervision approaches and use deep linguistic features, e.g., part-of-speech tags and dependency parsing." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-35", "text": "2 Our experiments shed insight on the following two questions:" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-36", "text": "1. How does increasing the corpus size impact the quality of distant supervision? 2." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-37", "text": "For a given corpus size, how does increasing the amount of human feedback impact the quality of distant supervision?" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-38", "text": "We found that increasing corpus size consistently and significantly improves recall and F1, despite reducing precision on small corpora; in contrast, human feedback has relatively small impact on precision and recall." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-39", "text": "For example, on a TAC corpus with 1.8M documents, we found that increasing the corpus size ten-fold consistently results in statistically 1 http://lemurproject.org/clueweb09.php/ 2 We used 100K CPU hours to run such tools on ClueWeb." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-40", "text": "significant improvement in F1 on two standardized relation extraction metrics (t-test with p=0.05)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-41", "text": "On the other hand, increasing human feedback amount ten-fold results in statistically significant improvement on F1 only when the corpus contains at least 1M documents; and the magnitude of such improvement was only one fifth compared to the impact of corpus-size increment." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-42", "text": "We find that the quality of distant supervision tends to be recall gated, that is, for any given relation, distant supervision fails to find all possible linguistic signals that indicate a relation." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-43", "text": "By expanding the corpus one can expand the number of patterns that occur with a known set of entities." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-44", "text": "Thus, as a rule of thumb for developing distant supervision systems, one should first attempt to expand the training corpus and then worry about precision of labels only after having obtained a broad-coverage corpus." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-45", "text": "Throughout this paper, it is important to understand the difference between mentions and entities." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-46", "text": "Entities are conceptual objects that exist in the world (e.g., Barack Obama), whereas authors use a variety of wordings to refer to (which we call \"mention\") entities in text (Ji et al., 2010) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-47", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-48", "text": "**RELATED WORK**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-49", "text": "The idea of using entity-level structured data (e.g., facts in a database) to generate mention-level training data (e.g., in English text) is a classic one: researchers have used variants of this idea to extract entities of a certain type from webpages (Hearst, 1992; Brin, 1999) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-50", "text": "More closely related to relation extraction is the work of Lin and Patel (2001) that uses dependency paths to find answers that express the same relation as in a question." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-51", "text": "Since Mintz et al. (2009) coined the name \"distant supervision,\" there has been growing interest in this technique." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-52", "text": "For example, distant supervision has been used for the TAC-KBP slot-filling tasks and other relation-extraction tasks (Hoffmann et al., 2010; Carlson et al., 2010; Nguyen and Moschitti, 2011a; Nguyen and Moschitti, 2011b) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-53", "text": "In contrast, we study how increasing input size (and incorporating human feedback) improves the result quality of distant supervision." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-54", "text": "We focus on logistic regression, but it is interesting future work to study more sophisticated prob- Step 1 is preprocessing; step 4 is final evaluation." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-55", "text": "The key steps are distant supervision (step 2), where we train a logistic regression (LR) classifier for each relation using (noisy) examples obtained from sentences that match Freebase facts, and human feedback (step 3) where a crowd workforce refines the LR classifiers by providing feedback to the training data." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-56", "text": "abilistic models; such models have recently been used to relax various assumptions of distant supervision Hoffmann et al., 2011) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-57", "text": "Specifically, they address the noisy assumption that, if two entities participate in a relation in a knowledge base, then all co-occurrences of these entities express this relation." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-58", "text": "In contrast, we explore the effectiveness of increasing the training data sizes to improve distant-supervision quality." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-59", "text": "Sheng et al. (2008) and Gormley et al. (2010) study the quality-control issue for collecting training labels via crowdsourcing." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-60", "text": "Their focus is the collection process; in contrast, our goal is to quantify the impact of this additional data source on distantsupervision quality." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-61", "text": "Moreover, we experiment with one order of magnitude more human labels." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-62", "text": "Hoffmann et al. (2009) study how to acquire end-user feedback on relation-extraction results posted on an augmented Wikipedia site; it is interesting future work to integrate this source in our experiments." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-63", "text": "One technique for obtaining human input is active learning." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-64", "text": "We tried several active-learning techniques as described by Settles (2010) , but did not observe any notable advantage over uniform sampling-based example selection." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-65", "text": "3 taining mentions of named entities, our goal is to learn a classifier for R(x, y) using linguistic features of x and y, e.g., dependency-path information." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-66", "text": "The problem is that we lack the large amount of labeled examples that are typically required to apply supervised learning techniques." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-67", "text": "We describe an overview of these techniques and the methodological choices we made to implement our study." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-68", "text": "Figure 1 illustrates the overall workflow of a distant supervision system." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-69", "text": "At each step of the distant supervision process, we closely follow the recent literature (Mintz et al., 2009; ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-70", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-71", "text": "**DISTANT SUPERVISION**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-72", "text": "Distant supervision compensates for a lack of training examples by generating what are known as silver-standard examples (Wu and Weld, 2007) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-73", "text": "The observation is that we are often able to obtain a structured, but incomplete, database D that instantiates relations of interest and a text corpus C that contains mentions of the entities in our database." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-74", "text": "Formally, a database is a tuple D = (E,R) where E is a set of entities andR = (R 1 . . . , R N ) is a tuple of instantiated predicates." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-75", "text": "For example, R i may contain pairs of married people." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-76", "text": "4 We use the facts in R i combined with C to generate examples." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-77", "text": "Following recent work (Mintz et al., 2009; Hoffmann et al., 2011) , we use Freebase 5 as the knowledge base for seed facts." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-78", "text": "We use two text corpora: (1) the TAC-KBP 6 2010 corpus that 4 We only consider binary predicates in this work." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-79", "text": "5 http://freebase.com 6 KBP stands for \"Knowledge-Base Population.\" consists of 1.8M newswire and blog articles 7 , and (2) the ClueWeb09 corpus that is a 2009 snapshot of 500M webpages." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-80", "text": "We use the TAC-KBP slot filling task and select those TAC-KBP relations that are present in the Freebase schema as targets (20 relations on people and organization)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-81", "text": "One problem is that relations in D are defined at the entity level." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-82", "text": "Thus, the pairs in such relations are not embedded in text, and so these pairs lack the linguistic context that we need to extract features, i.e., the features used to describe examples." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-83", "text": "In turn, this implies that these pairs cannot be used directly as training examples for our classifier." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-84", "text": "To generate training examples, we need to map the entities back to mentions in the corpus." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-85", "text": "We denote the relation that describes this mapping as the relation EL(e, m) where e \u2208 E is an entity in the database D and m is a mention in the corpus C. For each relation R i , we generate a set of (noisy) positive examples denoted R" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-86", "text": "As in previous work, we impose the constraint that both mentions (m 1 , m 2 ) \u2208 R + i are contained in the same sentence (Mintz et al., 2009; Hoffmann et al., 2011) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-87", "text": "To generate negative examples for each relation, we follow the assumption in Mintz et al. (2009) that relations are disjoint and sample from other relations, i.e., R" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-88", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-89", "text": "**FEATURE EXTRACTION**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-90", "text": "Once we have constructed the set of possible mention pairs, the state-of-the-art technique to generate feature vectors uses linguistic tools such as partof-speech taggers, named-entity recognizers, dependency parsers, and string features." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-91", "text": "Following recent work on distant supervision (Mintz et al., 2009; Hoffmann et al., 2011) , we use both lexical and syntactic features." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-92", "text": "After this stage, we have a well-defined machine learning problem that is solvable using standard supervised techniques." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-93", "text": "We use sparse logistic regression ( 1 regularized) (Tibshirani, 1996) , which is used in previous studies." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-94", "text": "Our feature extraction process consists of three steps:" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-95", "text": "7 http://nlp.cs.qc.cuny.edu/kbp/2010/ 1." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-96", "text": "Run Stanford CoreNLP with POS tagging and named entity recognition (Finkel et al., 2005) ; 2. Run dependency parsing on TAC with the Ensemble parser and on ClueWeb with MaltParser (Nivre et al., 2007) 8 ; and 3. Run a simple entity-linking system that utilizes NER results and string matching to identify mentions of Freebase entities (with types)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-97", "text": "9 The output of this processing is a repository of structured objects (with POS tags, dependency parse, and entity types and mentions) for sentences from the training corpus." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-98", "text": "Specifically, for each pair of entity mentions (m 1 , m 2 ) in a sentence, we extract the following features F (m 1 , m 2 ): (1) the word sequence (including POS tags) between these mentions after normalizing entity mentions (e.g., replacing \"John Nolen\" with a place holder PER); if the sequence is longer than 6, we take the 3-word prefix and the 3-word suffix; (2) the dependency path between the mention pair." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-99", "text": "To normalize, in both features we use lemmas instead of surface forms." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-100", "text": "We discard features that occur in fewer than three mention pairs." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-101", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-102", "text": "**CROWD-SOURCED DATA**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-103", "text": "Crowd sourcing provides a cheap source of human labeling to improve the quality of our classifier." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-104", "text": "In this work, we specifically examine feedback on the result of distant supervision." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-105", "text": "Precisely, we construct the union of R + 1 \u222a . . . R + N from Section 3.1." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-106", "text": "We then solicit human labeling from Mechanical Turk (MTurk) while applying state-of-the-art quality control protocols following Gormley et al. (2010) and those in the MTurk manual." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-107", "text": "10 These quality-control protocols are critical to ensure high quality: spamming is common on MTurk and some turkers may not be as proficient or careful as expected." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-108", "text": "To combat this, we replicate each question three times and, following Gormley 8 We did not run Ensemble on ClueWeb because we had very few machines satisfying Ensemble's memory requirement." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-109", "text": "In contrast, MaltParser requires less memory and we could leverage Condor (Thain et al., 2005) to parse ClueWeb with MaltParser within several days (using about 50K CPU hours)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-110", "text": "9 We experiment with a slightly more sophisticated entitylinking system as well, which resulted in higher overall quality." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-111", "text": "The results below are from the simple entity-linking system." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-112", "text": "10 http://mturkpublic.s3.amazonaws.com/docs/ MTURK_BP.pdf et al. (2010), plant gold-standard questions: each task consists of five yes/no questions, one of which comes from our gold-standard pool." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-113", "text": "11 By retaining only those answers that are consistent with this protocol, we are able to filter responses that were not answered with care or competency." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-114", "text": "We only use answers from workers who display overall high consistency with the gold standard (i.e., correctly answering at least 80% of the gold-standard questions)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-115", "text": "(2011) and use an independent binary classifier for each individual relation; the intuition is that a pair of mentions (or entities) might participate in multiple target relations." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-116", "text": "We experimented with both protocols; since relation overlapping is rare for TAC-KBP and there was little difference in result quality, we focus on the binary-classification approach using training examples constructed as described in Section 3.1." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-117", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-118", "text": "**STATISTICAL MODELING ISSUES**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-119", "text": "We compensate for the different sizes of distant and human labeled examples by training an objective function that allows to tune the weight of human versus distant labeling." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-150", "text": "About 46% of our human labels are negative." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-126", "text": "Specifically, when using logistic regression to train relation extractors, increasing corpus size improves, consistently and significantly, the precision and recall produced by distant supervision, regardless of human feedback levels." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-127", "text": "Using the 11 We obtain the gold standard from a separate MTurk submission by taking examples that at least 10 out of 11 turkers answered yes, and then negate half of these examples by altering the relation names (e.g., spouse to sibling)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-128", "text": "12 More details in our technical report (Zhang et al., 2012) ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-129", "text": "methodology described in Section 3, human feedback has limited impact on the precision and recall produced from distant supervision by itself." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-130", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-131", "text": "**EVALUATION METRICS**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-132", "text": "Just as direct training data are scarce, ground truth for relation extraction is scarce as well." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-133", "text": "As a result, prior work mainly considers two types of evaluation methods: (1) randomly sample a small portion of predictions (e.g., top-k) and manually evaluate precision/recall; and (2) use a held-out portion of seed facts (usually Freebase) as a kind of \"distant\" ground truth." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-134", "text": "We replace manual evaluation with a standardized relation-extraction benchmark: TAC-KBP 2010." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-135", "text": "TAC-KBP asks for extractions of 46 relations on a given set of 100 entities." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-136", "text": "Interestingly, the Freebase held-out metric (Mintz et al., 2009; Hoffmann et al., 2011 ) turns out to be heavily biased toward distantly labeled data (e.g., increasing human feedback hurts precision; see Section 4.6)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-137", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-138", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-139", "text": "Our first group of experiments use the 1.8M-doc TAC-KBP corpus for training." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-140", "text": "We exclude from it the 33K documents that contain query entities in the TAC-KBP metrics." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-141", "text": "There are two key parameters: the corpus size (#docs) M and human feedback budget (#examples) N ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-142", "text": "We perform different levels of down-sampling on the training corpus." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-143", "text": "On TAC, we use subsets with M = 10 3 , 10 4 , 10 5 , and 10 6 documents respectively." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-144", "text": "For each value of M , we perform 30 independent trials of uniform sampling, with each trial resulting in a training corpus" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-145", "text": "For each training corpus D M i , we perform distant supervision to train a set of logistic regression classifiers." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-146", "text": "From the full corpus, distant supervision creates around 72K training examples." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-147", "text": "To evaluate the impact of human feedback, we randomly sample 20K examples from the input corpus (we remove any portion of the corpus that is used in an evaluation)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-148", "text": "Then, we ask three different crowd workers to label each example as either positive or negative using the procedure described in Section 3.3." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-151", "text": "Denote by N the number of examples that Figure 2 : Impact of input sizes under the TAC-KBP metric, which uses documents mentioning 100 predefined entities as testing corpus with entity-level ground truth." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-152", "text": "We vary the sizes of the training corpus and human feedback while measuring the scores (F1, recall, and precision) on the TAC-KBP benchmark." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-153", "text": "we want to incorporate human feedback for; we vary N in the range of 0, 10, 10 2 , 10 3 , 10 4 , and 2 \u00d7 10 4 ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-154", "text": "For each selected corpus and value of N , we perform without-replacement sampling from examples of this corpus to select feedback for up to N examples." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-155", "text": "In our experiments, we found that on average an M -doc corpus contains about 0.04M distant labels, out of which 0.01M have human feedback." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-156", "text": "After incorporating human feedback, we evaluate the relation extractors on the TAC-KBP benchmark." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-157", "text": "We then compute the average F1, recall, and precision scores among all trials for each metric and each (M,N) pair." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-158", "text": "Besides the KBP metrics, we also evaluate each (M,N) pair using Freebase held-out data." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-159", "text": "Furthermore, we experiment with a much larger corpus: ClueWeb09." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-160", "text": "On ClueWeb09, we vary M over 10 3 , . . . , 10 8 ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-161", "text": "Using the same metrics, we show at a larger scale that increasing corpus size can significantly improve both precision and recall." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-162", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-163", "text": "**OVERALL IMPACT OF INPUT SIZES**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-164", "text": "We first present our experiment results on the TAC corpus." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-165", "text": "As shown in Figure 2 , the F1 graph closely tracks the recall graph, which supports our earlier claim that quality is recall gated (Section 1)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-166", "text": "While increasing the corpus size improves F1 at a roughly log-linear rate, human feedback has little impact until both corpus size and human feedback size approch maximum M, N values." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-167", "text": "Table 1 shows the quality comparisons with minimum/maximum values of M and N ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-168", "text": "13 We observe that increasing the corpus size significant improves per-relation recall 13 When the corpus size is small, the total number of examples with feedback can be smaller than the budget size N -for example, when M = 10 3 there are on average 10 examples with feedback even if N = 10 4 ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-169", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-170", "text": "**IMPACT OF CORPUS SIZE**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-171", "text": "In Figure 3 (a) we plot a projection of the graphs in Figure 2 to show the impact of corpus size on distant-supervision quality." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-172", "text": "The two curves correspond to when there is no human feedback and when we use all applicable human feedback." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-173", "text": "The fact that the two curves almost overlap indicates that human feedback had little impact on precision or recall." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-176", "text": "To measure the statistical significance of changes in F1, we calculate t-test results to compare adjacent corpus size levels given each fixed human feedback level." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-177", "text": "As shown in Table 2 (a), increasing the corpus size by a factor of 10 consistently and significantly improves F1." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-178", "text": "Although precision decreases as we use larger corpora, the decreasing trend is sub-log-linear and stops at around 100K docs." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-179", "text": "On the other hand, recall and F1 keep increasing at a log-linear rate." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-180", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-181", "text": "**IMPACT OF HUMAN FEEDBACK**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-182", "text": "Figure 3(b) provides another perspective on the results under the TAC metric: We fix a corpus size and plot the F1, recall, and precision as functions of human-feedback amount." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-183", "text": "Confirming the trend in Figure 2 , we see that human feedback has little impact on precision or recall with both corpus sizes." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-184", "text": "We calculate t-tests to compare adjacent human feedback levels given each fixed corpus size level." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-185", "text": "Table 2 (b)'s last row reports the comparison, for various corpus sizes (and, hence, number of distant labels), of (i) using no human feedback and (ii) using all of the human feedback we collected." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-186", "text": "When the corpus size is small (fewer than 10 5 docs), human feedback has no statistically significant impact on F1." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-187", "text": "The locations of +'s suggest that the influence of human feedback becomes notable only when the corpus is very large (say with 10 6 docs)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-188", "text": "However, comparing the slopes of the curves in Figure 3 tably improve precision on either the full corpus or on a small 1K-doc corpus." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-189", "text": "To assess the quality of human labels, we train extraction models with human labels only (on examples obtained from distant supervision)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-190", "text": "We vary the amount of human labels and plot the F1 changes in Figure 4 ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-191", "text": "Although the F1 improves as we use more human labels, the best model has roughly the same performance as those trained from distant labels (with or without human labels)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-192", "text": "This suggests that the accuracy of human labels is not substantially better than distant labels." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-193", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-194", "text": "**FREEBASE HELD-OUT METRIC**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-195", "text": "In addition to the TAC-KBP benchmark, we also follow prior work (Mintz et al., 2009; Hoffmann et al., 2011) and measure the quality using held-out data from Freebase." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-196", "text": "We randomly partition both Freebase and the corpus into two halves." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-197", "text": "One database-corpus pair is used for training and the other pair for testing." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-198", "text": "We evaluate the precision over the 10 3 highest-probability predictions on the test set." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-199", "text": "In Figure 5 , we vary the size of the corpus in the train pair and the number of human labels; the precision reaches a dramatic peak when we the corpus size is above 10 5 and uses little human feedback." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-200", "text": "This suggests that this Freebase held-out metric is biased toward solely relying on distant labels alone." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-201", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-202", "text": "**WEB-SCALE CORPORA**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-203", "text": "To study how a Web corpus impacts distantsupervision quality, we select the first 100M English webpages from the ClueWeb09 dataset and measure how distant-supervision quality changes as we vary the number of webpages used." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-204", "text": "As shown in Figure 6 , increasing the corpus size improves F1 up to 10 7 docs (p = 0.05), while at 10 8 the two-tailed significance test reports no significant impact on F1 (p = 0.05)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-205", "text": "The dip in precision in Figure 6 from 10 6 to either 10 7 or 10 8 is significant (p = 0.05), and it is interesting future work to perform a detailed error analysis." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-206", "text": "Recall from Section 3 that to preprocess ClueWeb we use MaltParser instead of Ensemble." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-207", "text": "Thus, the F1 scores in Figure 6 are not comparable to those from the TAC training corpus." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-208", "text": "----------------------------------" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-209", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-210", "text": "We study how the size of two types of cheaply available resources impact the precision and recall of distant supervision: (1) an unlabeled text corpus from which distantly labeled training examples can be extracted, and (2) crowd-sourced labels on training examples." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-211", "text": "We found that text corpus size has a stronger impact on precision and recall than human feedback." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-212", "text": "We observed that distant-supervision systems are often recall gated; thus, to improve distantsupervision quality, one should first try to enlarge the input training corpus and then increase precision." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-213", "text": "It was initially counter-intuitive to us that human labels did not have a large impact on precision." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-214", "text": "One reason is that human labels acquired from crowdsourcing have comparable noise level as distant labels -as shown by Figure 4 ." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-215", "text": "Thus, techniques that improve the accuracy of crowd-sourced answers are an interesting direction for future work." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-216", "text": "We used a particular form of human input (yes/no votes on distant labels) and a particular statistical model to incorporate this information (logistic regression)." }, { "sent_id": "2bb41cea97a0375f67eab3a77c3a97-C001-217", "text": "It is interesting future work to study other types of human input (e.g., new examples or features) and more sophisticated techniques for incorporating human input, as well as machine learning methods that explicitly model feature interactions." } ], "y": { "@MOT@": { "gold_contexts": [ [ "2bb41cea97a0375f67eab3a77c3a97-C001-13", "2bb41cea97a0375f67eab3a77c3a97-C001-14" ] ], "cite_sentences": [ "2bb41cea97a0375f67eab3a77c3a97-C001-14" ] }, "@BACK@": { "gold_contexts": [ [ "2bb41cea97a0375f67eab3a77c3a97-C001-32" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-51" ] ], "cite_sentences": [ "2bb41cea97a0375f67eab3a77c3a97-C001-32", "2bb41cea97a0375f67eab3a77c3a97-C001-51" ] }, "@SIM@": { "gold_contexts": [ [ "2bb41cea97a0375f67eab3a77c3a97-C001-69" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-77" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-86" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-87" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-91" ] ], "cite_sentences": [ "2bb41cea97a0375f67eab3a77c3a97-C001-69", "2bb41cea97a0375f67eab3a77c3a97-C001-77", "2bb41cea97a0375f67eab3a77c3a97-C001-86", "2bb41cea97a0375f67eab3a77c3a97-C001-87", "2bb41cea97a0375f67eab3a77c3a97-C001-91" ] }, "@USE@": { "gold_contexts": [ [ "2bb41cea97a0375f67eab3a77c3a97-C001-77" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-86" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-87" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-91" ] ], "cite_sentences": [ "2bb41cea97a0375f67eab3a77c3a97-C001-77", "2bb41cea97a0375f67eab3a77c3a97-C001-86", "2bb41cea97a0375f67eab3a77c3a97-C001-87", "2bb41cea97a0375f67eab3a77c3a97-C001-91" ] }, "@DIF@": { "gold_contexts": [ [ "2bb41cea97a0375f67eab3a77c3a97-C001-136" ], [ "2bb41cea97a0375f67eab3a77c3a97-C001-195" ] ], "cite_sentences": [ "2bb41cea97a0375f67eab3a77c3a97-C001-136", "2bb41cea97a0375f67eab3a77c3a97-C001-195" ] } } }, "ABC_4f75f73b4eac8aecdde9312a846a1d_16": { "x": [ { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-2", "text": "This paper presents our English-German Automatic Post-Editing (APE) system submitted to the APE Task organized at WMT 2018." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-3", "text": "The proposed model is an extension of the transformer architecture: two separate self-attention-based encoders encode the machine translation output (mt) and the source (src), followed by a joint encoder that attends over a combination of these two encoded sequences (enc src and enc mt ) for generating the post-edited sentence." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-4", "text": "We compare this multi-source architecture (i.e, {src, mt} \u2192 pe) to a monolingual transformer (i.e., mt \u2192 pe) model and an ensemble combining the multi-source {src, mt} \u2192 pe and singlesource mt \u2192 pe models." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-5", "text": "For both the PBSMT and the NMT task, the ensemble yields the best results, followed by the multi-source model and last the singlesource approach." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-6", "text": "Our best model, the ensemble, achieves a BLEU score of 66.16 and 74.22 for the PBSMT and NMT task, respectively." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-7", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-8", "text": "**INTRODUCTION & RELATED WORK**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-9", "text": "The ultimate goal of machine translation (MT) is to provide fully automatic publishable quality translations." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-10", "text": "However, state-of-the-art MT systems often fail to deliver this; translations produced by MT systems contain different errors and require human interventions to post-edit the translations." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-11", "text": "Nevertheless, MT has become a standard in the translation industry as post-editing on MT output is often faster and cheaper than performing human translation from scratch." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-12", "text": "APE is a method that aims to automatically correct errors made by MT systems before performing actual human-post-editing (PE) (Knight and Chander, 1994) , thereby reducing the translators' workload and increasing productivity (Parra Escart\u00edn and Arcedillo, 2015b,a; Pal et al., 2016a) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-13", "text": "Various automatic and semi-automatic techniques have been developed to auto-correct repetitive errors (Roturier, 2009; TAUS/CNGL Report, 2010) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-14", "text": "The advantage of APE lies in its capability to adapt to any black-box (first-stage) MT engine; i.e., upon availability of human-corrected postedited data, no incremental training or full retraining of the first-stage MT system is required to improve the overall translation quality." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-15", "text": "APE can therefore be viewed as a 2 nd -stage MT system, translating predictable error patterns in MT output to their corresponding corrections." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-16", "text": "APE training data minimally involves MT output (mt) and the human-post-edited (pe) version of mt, but may additionally make use of the source (src)." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-17", "text": "A more detailed motivation on APE can be found in Bojar et al. (2015 ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-18", "text": "Based on the training process, APE systems can be categorized as either single-source (mt \u2192 pe) or multi-source ({src, mt} \u2192 pe) approaches." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-19", "text": "In general, the field of APE covers a wide methodological range, including SMTbased approaches (Simard et al., 2007a,b; Lagarda et al., 2009; Rosa et al., 2012; Pal et al., 2016c; Chatterjee et al., 2017b) , and neural APE (Pal et al., 2016b; Junczys-Dowmunt and Grundkiewicz, 2016; Pal et al., 2017) based on neural machine translation (NMT)." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-20", "text": "Some of the stateof-the-art multi-source approaches, both statistical (B\u00e9chara et al., 2011; and recently neural (Libovick\u00fd et al., 2016; Chatterjee et al., 2017a; Junczys-Dowmunt and Grundkiewicz, 2016; Varis and Bojar, 2017) , explore learning from {src, mt} \u2192 pe (multi-source, MS) to take advantage of the dependencies of translation errors in mt originating from src." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-21", "text": "Exploiting source information in multi-source neural APE can be configured either by using a single encoder that encodes the concatenation of src and mt (Niehues et al., 2016) or by using two separate encoders for src and mt and passing the concatenation of both encoders' final states to the decoder (Libovick\u00fd et al., 2016) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-22", "text": "A few approaches to multi-source neural APE have been proposed in the WMT-2017 APE shared task." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-23", "text": "Junczys-Dowmunt and Grundkiewicz (2017) explore different combinations of attention mechanisms including soft attention and hard monotonic attention on an end-to-end neural APE model that combines both mt and src in a single neural architecture." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-24", "text": "Chatterjee et al. (2017a) extend the two-encoder architecture of multi-source models presented in Libovick\u00fd et al. (2016) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-25", "text": "In their extension each encoder concatenates both contexts having their own attention layer that is used to compute the weighted context of src and mt." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-26", "text": "Finally, a linear transformation is applied on the concatenation of both weighted contexts." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-27", "text": "Varis and Bojar (2017) implement and compare two multisource architectures: In the first setup, they use a single encoder with concatenation of src and mt sentences, and in the second setup, they use two character-level encoders for mt and src, separately, along with a character-level decoder." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-28", "text": "The initial state of this decoder is a weighted combination of the final states of the two encoders." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-29", "text": "Intuitively, such an integration of sourcelanguage information in APE should be useful in conveying the context information to improve the APE performance." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-30", "text": "To provide the awareness of errors in mt originating from src, the transformer architecture (Vaswani et al., 2017) , which is built solely upon attention mechanisms (Bahdanau et al., 2015) , makes it possible to model dependencies without regard to their distance in the input or output sequences and also captures global dependencies between input and output (for our case src, mt, and pe)." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-31", "text": "The transformer architecture replaces recurrence and convolutions by using positional encodings on both the input and output sequences." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-32", "text": "The encoder and decoder both use multi-head (facilitating parallel computations) self-attention to compute representations of their corresponding inputs, and also compute multi-head vanilla-attentions between encoder and decoder representations." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-33", "text": "Our APE system extends this transformer-based NMT architecture (Vaswani et al., 2017) by using two encoders, a joint encoder, and a single decoder." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-34", "text": "Our model concatenates two separate selfattention-based encoders (enc src and enc mt ) and passes this sequence through another self-attended joint encoder (enc src,mt ) to ensure capturing dependencies between src and mt." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-35", "text": "Finally, this joint encoder is fed to the decoder which follows a similar architecture as described in Vaswani et al. (2017) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-36", "text": "The entire model is optimized as a single end-to-end transformer network." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-37", "text": "2 Transformer-Based Multi-Source APE MT errors originating from the input source sentences suggest that APE systems should leverage information from both the src and mt, instead of considering mt in isolation." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-38", "text": "This can help the model to disambiguate corrections applied at every time step." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-39", "text": "Generating the pe output from mt is greatly facilitated by the availability of src." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-40", "text": "To achieve benefits from both single-source (mt \u2192 pe) and multi-source ({src, mt} \u2192 pe) APEs, our primary submission in the WMT 2018 shared task is an ensemble of these two models." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-41", "text": "Transformer-based models are currently providing state-of-the-art performance in MT; hence, we want to explore a similar architecture for this year's APE task." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-42", "text": "We extend the transformer architecture to investigate how efficient this approach is in a multi-source scenario." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-43", "text": "In a MT task, it was already shown that a transformer can learn long-range dependencies." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-44", "text": "Therefore, we explore if we can leverage information from src and mt via a joint encoder through self-attention (see Section 2.2) to provide dependencies between src-mt that are then projected to the pe." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-45", "text": "To investigate this, we implement and evaluate three different models: a single-source approach, a multi-source approach, and an ensemble of both, described in more detail below." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-46", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-47", "text": "**SINGLE-SOURCE TRANSFORMER FOR APE**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-48", "text": "(mt \u2192 pe)" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-49", "text": "Our single-source model (SS) is based on an encoder-decoder-based transformer architecture (Vaswani et al., 2017) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-50", "text": "Transformer models can replace sequence-aligned recurrence entirely and follow three types of multi-head attention: encoder-decoder attention (also known as vanilla Figure 1 : Multi-source transformer-based APE attention), encoder self-attention, and masked decoder self-attention." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-51", "text": "Since for multi-head attention each head uses different linear transformations, it can learn these separate relationships in parallel, thereby improving learning time." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-52", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-53", "text": "**MULTI-SOURCE TRANSFORMER FOR APE ({SRC, MT} \u2192 PE)**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-54", "text": "For our multi-source model (MS), we propose a novel joint transformer model (cf. Figure 1) , which combines the encodings of src and mt and attends over a combination of both sequences while generating the post-edited sentence." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-55", "text": "Apart from enc src and enc mt , each of which is equivalent to the original transformer's encoder (Vaswani et al., 2017) , we use a joint encoder with an equivalent architecture, to maintain the homogeneity of the transformer model." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-56", "text": "For this, we extend Vaswani et al. (2017) by introducing an additional identical encoding block by which both the enc src and the enc mt encoders communicate with the decoder." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-57", "text": "Our multi-source neural APE computes intermediate states enc src and enc mt for the two encoders, enc src,mt for their combination, and dec pe for the decoder in sequence-to-sequence modeling." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-58", "text": "One self-attended encoder for src maps s = (s 1 , s 2 , ..., s k ) into a sequence of continuous representations, enc src = (e 1 , e 2 , ..., e k ), and a second encoder for mt, m = (m 1 , m 2 , ..., m l ), returns another sequence of continuous representations, enc mt = (e \u2032 1 , e \u2032 2 , ..., e \u2032 l )." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-59", "text": "The self-attended joint encoder (cf. Figure 1 ) then receives the concatenation of enc src and enc mt , enc concat = [enc src , enc mt ] as an input, and passes it through the stack of 6 layers, with residual connections, a self-attention and a position-wise fully connected feed-forward neural network." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-60", "text": "As a result, the joint encoder produces a final representation (enc src,mt ) for both src and mt." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-61", "text": "Self-attention at this point provides the advantage of aggregating information from all of the words, including src and mt, and successively generates a new representation per word informed by the entire src and mt context." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-62", "text": "The decoder generates the pe output in sequence, dec pe = (p 1 , p 2 , . . . , p n ), one word at a time from left to right by attending previously generated words as well as the final representations (enc src,mt ) generated by the encoder." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-63", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-64", "text": "**ENSEMBLE**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-65", "text": "In order to leverage the network architecture for both single-source and multi-source APE as discussed above, we decided to ensemble several expert neural models." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-66", "text": "Each model is averaged using the 5 best saved checkpoints, which generate different translation outputs." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-67", "text": "Taking into account all these generated translation outputs, we implement an ensemble technique based on the frequency of occurrence of the output words." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-68", "text": "Corresponding to each input word, we calculate the most frequent occurrence of the output word from all the generated translation outputs." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-69", "text": "For the two different APE tasks, we ensemble the following models:" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-70", "text": "\u2022 PBSMT task: We ensemble a SS (mt \u2192 pe) and a MS ({src, mt} \u2192 pe) average model." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-71", "text": "\u2022 NMT task: We ensemble two average SS (mt \u2192 pe) and MS ({src, mt} \u2192 pe) models, together with a SS and a MS model that are fine-tuned on a subset of the training set (cf." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-72", "text": "Section 3.3.2)." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-73", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-74", "text": "**EXPERIMENTS**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-75", "text": "In our experiment we investigate (1) how well the transformer-based APE architecture performs in general, (2) if our multi-source architecture using the additional joint encoder improves the performance over a single-source architecture, and (3) if ensembling of single-source and multi-source architectures facilitates APE even further." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-76", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-77", "text": "**DATA**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-78", "text": "Since this year's WMT 2018 APE task is divided into two sub-tasks, differ-ent datasets are provided for each task: for the PB-SMT task, there is a total of 23K English-German APE data samples (11K from WMT 2016 and 12K from WMT 2017) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-79", "text": "For the NMT task, 13,442 samples of English-German APE data are provided." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-80", "text": "All released APE data consists of EnglishGerman triplets containing source English text (src) from the IT domain, the corresponding German translations (mt) from a first stage MT system, and the corresponding human-post-edited version (pe), all of them already tokenized." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-81", "text": "As this released APE dataset is small in size (see Table 1 ), additional resources are also available: first, the 'artificial training data' (Junczys-Dowmunt and Grundkiewicz, 2016) containing 4.5M sentences, 4M of which are weakly similar to the WMT 2016 training data, while 500K show very similar TER statistics; and second, the synthetic 'eS-CAPE' APE corpus , consisting of more than 7M triples for both NMT and PBSMT." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-82", "text": "Table 1 presents the statistics of the released data for the English-German APE Task organized in WMT 2018." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-83", "text": "These datasets, except for the eSCAPE corpus, do not require any preprocessing in terms of encoding or alignment." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-84", "text": "For cleaning the noisy eSCAPE dataset containing many unrelated language words (e.g. Chinese), we perform the following two steps: (i) we use the cleaning process described in Pal et al. (2015) , and (ii) we execute the Moses (Koehn et al., 2007) corpus cleaning scripts with minimum and maximum number of tokens set to 1 and 80, respectively." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-85", "text": "After cleaning, we use the Moses tokenizer to tokenize the eSCAPE corpus." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-86", "text": "To handle outof-vocabulary words, words are preprocessed into subword units (Sennrich et al., 2016 ) using bytepair encoding (BPE)." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-87", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-88", "text": "**HYPER-PARAMETER SETTINGS**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-89", "text": "For {src, mt} \u2192 pe, both the self-attended encoders, the joint encoder, and the decoder are composed of a stack of N = 6 identical layers followed by layer normalization." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-90", "text": "Each layer again consists of two sub-layers and a residual connection (He et al., 2016) around each of the two sublayers." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-91", "text": "During training, we employ label smoothing of value \u03f5 ls = 0.1." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-92", "text": "The output dimension produced by all sub-layers and embedding layers is defined as d model = 256." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-93", "text": "All dropout values in the network are set to 0.2." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-94", "text": "Each encoder and decoder contains a fully connected feed-forward network having dimensionality d model = 256 for the input and output and dimensionality d f f = 1024 for the inner layer." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-95", "text": "This is a similar setting to Vaswani et al. (2017) 's C \u2212 model 1 ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-96", "text": "For the scaled dotproduct attention, the input consists of queries and keys of dimension d k , and values of dimension d v ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-97", "text": "As multi-head attention parameters, we employ h = 8 for parallel attention layers, or heads." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-98", "text": "For each of these we use a dimensional-" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-99", "text": "For optimization, we use the Adam optimizer (Kingma and Ba, 2015) with \u03b2 1 = 0.9, \u03b2 2 = 0.98 and \u03f5 = 10 \u22129 ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-100", "text": "The learning rate is varied throughout the training process, first increasing linearly for the first training steps warmup steps = 4000 and then adjusted as described in (Vaswani et al., 2017) ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-101", "text": "At training time, the batch size is set to 32 samples, with a maximum sentence length of 80 subwords, and a vocabulary of the 50K most frequent subwords." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-102", "text": "After each epoch, the training data is shuffled." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-103", "text": "For encoding the word order, our model uses learned positional embeddings (Gehring et al., 2017) , since Vaswani et al. (2017) reported nearly identical results to sinusoidal encodings." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-104", "text": "After finishing training, we save the 5 best checkpoints saved at each epoch." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-105", "text": "Finally, we use a single model obtained by averaging the last 5 checkpoints." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-106", "text": "During decoding, we perform greedy-search-based decoding." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-107", "text": "We follow a similar hyper-parameter setup for mt \u2192 pe." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-108", "text": "The total number of parameters for our {src, mt} \u2192 pe and mt \u2192 pe model is 46 \u00d7 10 6 and 28 \u00d7 10 6 , respectively." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-109", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-110", "text": "**EXPERIMENT SETUP**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-111", "text": "In this section, we present the training process, using the above datasets, to train mt \u2192 pe, {src, mt} \u2192 pe, and ensemble models for both PBSMT and NMT." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-112", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-113", "text": "**PBSMT TASK**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-114", "text": "For PBSMT, we first train both our SS and MS systems with the cleaned eSCAPE corpus for 3 epochs." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-115", "text": "We then perform transfer learning with 4M artificial data for 7 epochs." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-116", "text": "Afterwards, finetuning is performed using the 500K artificial and 23K real PE training data for another 20 epochs." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-117", "text": "Furthermore, we generate an ensemble model, by averaging the 5 best checkpoints of SS with the 5 best checkpoints of MS." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-118", "text": "We use the WMT 2016 development data (dev2016) containing 1,000 triplets to validate the model during training." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-119", "text": "To test our system performance, we use the WMT 2016 and 2017 test data (test2016, test2017), each containing 2,000 triplets." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-120", "text": "Furthermore, we report the results of the submitted ensemble model on test2018." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-121", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-122", "text": "**NMT TASK**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-123", "text": "Initial tests for pre-training our NMT model on the NMT eSCAPE data showed no performance improvements." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-124", "text": "Therefore, we use the PBSMT SS and MS models as a basis for the NMT task." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-125", "text": "We use the PBSMT models after training them on the eSCAPE corpus, the 4M artificial data and the 500K + 23K train sets of WMT 16 and 17." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-126", "text": "These SMT-based models are then fine-tuned using the WMT 2018 NMT APE data (train18) for 60 epochs." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-127", "text": "Afterwards, we perform an additional finetuning step towards the dev18/test18 dataset: For this, we extract sentences of train18 that are similar to the sentences contained in dev18/test18 and fine-train for another 15 epochs on this subset of train18, which we call fine-tune18." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-128", "text": "As a similarity measure we use the cosine similarity between the train src and mt segments and the test src and mt segments, respectively." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-129", "text": "These cosine similarities for src and mt are then simply multiplied to achieve an overall similarity measure." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-130", "text": "Our fine-tuning dataset contains only sentences with an overall similarity of at least 0.9." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-131", "text": "Last, two separate ensemble models are created." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-132", "text": "One consists of only the non-fine-tuned SS and MS models, and one ensembles the SS and MS models in both fine-tuned and non-fine-tuned variants." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-133", "text": "Both ensembles are created by averaging over the 5 best checkpoints of each sub-model." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-134", "text": "We report the results of all created models for the dev18 NMT dataset, and additionally those of the submitted overall ensemble model on test18." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-135", "text": "Table 2 presents the results for the PBSMT APE task (cf. 3.3.1), where two different transformerbased models, one ensemble of these models and the baseline BLEU scores are shown." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-136", "text": "The baseline here refers to the original MT output evaluated with respect to the corresponding PE translation." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-137", "text": "All models yield statistically significant results (p < 0.001) over this baseline." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-138", "text": "M S avg also provides statistically significant improvement over SS avg ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-139", "text": "For this and all following significance tests we employ the method by Clark et al. (2011) 2 ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-140", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-141", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-142", "text": "Generally, reasons for the good performance of our transformer-based MS architecture in comparison to the SS approach for PBSMT-based APE could be the positional encoding that injects information about the relative or absolute position of the tokens in the sequence." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-143", "text": "This might help to handle word order errors in mt for a given src context." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-144", "text": "Another possible explanation lies in the self-attention mechanism, which handles local word dependencies for src, mt, and pe." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-145", "text": "After the individual dependencies are learned by the two encoders' self-attention mechanisms, another level of self-attention is performed that can jointly learn from both src and mt using our joint encoder, thereby informing the decoder about the long-range dependencies between the words within both src and mt." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-146", "text": "Compared to RNNs, we believe that this technique can better convey source information via mt to the decoder." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-147", "text": "The ensemble model then leverages the advantages of both our SS and MS approaches to further improve the results." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-148", "text": "The results for our transformer-based architec- Table 3 : Evaluation result of WMT 2018 NMT task for all trained models." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-149", "text": "ture for the NMT task are shown in Table 3 ." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-150", "text": "As can be seen, the baseline NMT system performs best, followed by the ensemble models, then the multisource architectures and lastly the single-source approach." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-151", "text": "These differences between the three approaches, ensemble, MS, and SS, are all statistically significant." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-152", "text": "Fine-tuning provides some small, albeit insignificant, improvements over the non-fine-tuned versions." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-153", "text": "While none of our architectures perform better than the baseline MT system for the NMT task, we clearly see that the multi-source approach helps, and that ensembling of different SS and MS models further increases the performance." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-154", "text": "These results are in line with our expectations, because intuitively, inspecting both src and mt should help detect and correct common errors." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-155", "text": "However, we are unsure why all models did not improve over the baseline, which could have been achieved by simply copying the mt." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-156", "text": "One reason might be the small amount of PE data, which comprises only 13K samples; this could also explain why the simple fine-tuning approach already leads to slightly higher BLEU scores." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-157", "text": "However, further human evaluation is necessary to better understand what our model is doing for the neural APE task and why it remains approximately 0.5 BLEU points below the baseline." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-158", "text": "----------------------------------" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-159", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-160", "text": "In this paper, we investigated a novel transformerbased multi-source APE approach that jointly attends over a combination of src and mt to capture dependencies between the two." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-161", "text": "This architecture yields statistically significant improvements over single-source transformer-based models." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-162", "text": "An ensemble of both variants increases the performance further." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-163", "text": "For the PBSMT task, the baseline MT system was outperformed by 3.2 BLEU points, while the NMT baseline remains 0.51 BLEU points better than our APE approach on the 2018 test set." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-164", "text": "In the future, we will investigate if the performance of each system can be improved by using a different hyper-parameter setup." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-165", "text": "Unfortunately, we could not test either the 'big' or the 'base' hyper-parameter configuration in Vaswani et al. (2017) due to unavailable computing resources at the time of submission." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-166", "text": "As additional future work, we would like to explore whether using re-ranking and ensembling of different neural APEs helps to improve the performance further." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-167", "text": "Moreover, we will incorporate word-level quality estimation features of mt into the encoding layer." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-168", "text": "Lastly, we will evaluate if our model indeed is able to better handle word order errors and to capture longrange dependencies, as we expect." }, { "sent_id": "4f75f73b4eac8aecdde9312a846a1d-C001-169", "text": "Furthermore, we will analyze if adapting the learning rate to the size of the datasets used during training increases the performance compared to the currently used fixed learning rate initialization of 0.001." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4f75f73b4eac8aecdde9312a846a1d-C001-30" ] ], "cite_sentences": [ "4f75f73b4eac8aecdde9312a846a1d-C001-30" ] }, "@DIF@": { "gold_contexts": [ [ "4f75f73b4eac8aecdde9312a846a1d-C001-33" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-55" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-56" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-165" ] ], "cite_sentences": [ "4f75f73b4eac8aecdde9312a846a1d-C001-33", "4f75f73b4eac8aecdde9312a846a1d-C001-55", "4f75f73b4eac8aecdde9312a846a1d-C001-56", "4f75f73b4eac8aecdde9312a846a1d-C001-165" ] }, "@EXT@": { "gold_contexts": [ [ "4f75f73b4eac8aecdde9312a846a1d-C001-33" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-49" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-55" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-56" ] ], "cite_sentences": [ "4f75f73b4eac8aecdde9312a846a1d-C001-33", "4f75f73b4eac8aecdde9312a846a1d-C001-49", "4f75f73b4eac8aecdde9312a846a1d-C001-55", "4f75f73b4eac8aecdde9312a846a1d-C001-56" ] }, "@SIM@": { "gold_contexts": [ [ "4f75f73b4eac8aecdde9312a846a1d-C001-35" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-95" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-100" ], [ "4f75f73b4eac8aecdde9312a846a1d-C001-103" ] ], "cite_sentences": [ "4f75f73b4eac8aecdde9312a846a1d-C001-35", "4f75f73b4eac8aecdde9312a846a1d-C001-95", "4f75f73b4eac8aecdde9312a846a1d-C001-100", "4f75f73b4eac8aecdde9312a846a1d-C001-103" ] }, "@USE@": { "gold_contexts": [ [ "4f75f73b4eac8aecdde9312a846a1d-C001-35" ] ], "cite_sentences": [ "4f75f73b4eac8aecdde9312a846a1d-C001-35" ] } } }, "ABC_c5a10f46c253f0da005622661b12a1_16": { "x": [ { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-2", "text": "The task of event trigger labeling is typically addressed in the standard supervised setting: triggers for each target event type are annotated as training data, based on annotation guidelines." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-3", "text": "We propose an alternative approach, which takes the example trigger terms mentioned in the guidelines as seeds, and then applies an eventindependent similarity-based classifier for trigger labeling." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-4", "text": "This way we can skip manual annotation for new event types, while requiring only minimal annotated training data for few example events at system setup." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-5", "text": "Our method is evaluated on the ACE-2005 dataset, achieving 5.7% F 1 improvement over a state-of-the-art supervised system which uses the full training data." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-6", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-8", "text": "Event trigger labeling is the task of identifying the main word tokens that express mentions of prespecified event types in running text." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-9", "text": "For example, in \"20 people were wounded in Tuesday's airport blast\", \"wounded\" is a trigger of an Injure event and \"blast\" is a trigger of an Attack." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-10", "text": "The task both detects trigger tokens and classifies them to appropriate event types." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-11", "text": "While this task is often a component within the broader event extraction task, labeling both triggers and arguments, this paper focuses on trigger labeling." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-12", "text": "Most state-of-the-art event trigger labeling approaches (Ji and Grishman, 2008; Liao and Grishman, 2010b; Hong et al., 2011; Li et al., 2013) follow the standard supervised learning paradigm." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-13", "text": "For each event type, experts first write annotation guidelines." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-14", "text": "Then, annotators follow them to label event triggers in a large dataset." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-15", "text": "Finally, a classifier is trained over the annotated triggers to label the target events." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-16", "text": "The supervised paradigm requires major human efforts both in producing high-quality guidelines and in dataset annotation for each new event type." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-17", "text": "Given the rich information embedded in the guidelines, we raise in this paper the following research question: how well can we perform by leveraging only the lexical knowledge already available in quality guidelines for new event types, without requiring annotated training data for them?" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-18", "text": "To address this question, we propose a seedbased approach for the trigger labeling task (Section 2)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-85", "text": "4" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-19", "text": "Given the description for a new event type, which contains some examples of triggers, we first collect these triggers into a list of seeds." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-20", "text": "Then, at the labeling phase, we consider each text token as a candidate for a trigger and assess its similarity to the seed list." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-21", "text": "In the above example, given seeds such as \"explosion\" and \"fire\" for the Attack event type, we identify that the candidate token \"blast\" is a hyponym of \"explosion\" and synonym of \"fire\" and infer that \"blast\" is a likely Attack trigger." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-22", "text": "In our method, such similarity indicators are encoded as a small set of event-independent classification features, based on lexical matches and external resources like WordNet." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-23", "text": "Using eventindependent features allows us to train the system only once, at system setup phase, requiring annotated triggers in a training set for just a few preselected event types." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-24", "text": "Then, whenever a new event type is introduced for labeling, we only need to collect a seed list for it from its description, and provide it as input to the system." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-25", "text": "We developed a seed-based system (Section 3), based on a state-of-the-art fully-supervised event extraction system (Li et al., 2013) ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-26", "text": "When evaluated on the ACE-2005 dataset, 1 our system outperforms the fully-supervised one (Section 4), even though we don't utilize any annotated triggers of the test events during the labeling phase, and only Figure 1 : Flow of the seed-based approach use the seed triggers appearing in the ACE annotation guidelines." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-27", "text": "This result contributes to the broader line of research on avoiding or reducing annotation cost in information extraction (Section 5)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-28", "text": "In particular, it suggests the potential utility of the seed-based approach in scenarios where manual annotation per each new event is too costly." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-29", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-30", "text": "**SEED-BASED PROBLEM SETUP**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-31", "text": "This section describes our setup, as graphically illustrated in Figure 1 ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-32", "text": "Similarly to the supervised setting, our approach assumes that whenever a new event type is defined as target, an informative event description should be written for it." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-33", "text": "As a prominent example, we consider Section 5 of the ACE-2005 event annotation guidelines, 2 which provides a description for each event type." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-34", "text": "The description includes a short verbal specification plus several illustrating example sentences with marked triggers, spanning on average less than a page per event type." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-35", "text": "As event descriptions specify the intended event scope, they inherently include representative examples for event triggers." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-36", "text": "For instance, the ACE-2005 guidelines include: \"MEET Events include talks, summits, conferences, meetings, visits,. . . \", followed by an example: \"Bush and Putin met this week. . . \"." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-37", "text": "We thus collect triggers mentioned in each event description into a seed list for the event type, which is provided as input to our trigger labeling method." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-38", "text": "Triggers from the above quoted sentences are hence included in the Meet seed list, shown in Figure 1 ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-39", "text": "As mentioned in the Introduction, our method (Section 3) is based on event-independent features that identify similarities between a candidate trigger and a given seed list." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-40", "text": "To train such generic features, our training requires several arbitrary training event types, with a small amount of annotated triggers, from which it learns weights for the features." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-41", "text": "In our evaluation (Section 4) we use 5 training event types, with a total of 30 annotated trigger mentions (compared to roughly 5000 used by the baseline fully-supervised system)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-42", "text": "In this setting, the training phase is required only once during system setup, while no further training is required for each new target event type." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-43", "text": "In summary, our setup requires: (1) a seed list per target event type; (2) a small number of annotated triggers for few training event types, along with their seed lists (at system setup)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-44", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-45", "text": "**METHOD**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-46", "text": "This section describes the method we designed to implement the seed-based approach." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-47", "text": "To assess our approach, we compare it (Section 4) with the common fully-supervised approach, which requires annotated triggers for each target event type." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-48", "text": "Therefore, we implemented our system by adapting the state-of-the-art fully-supervised event extraction system of Li et al. (2013) , modifying mechanisms relevant for features and for trigger labels, as described below." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-49", "text": "Hence the systems are comparable with respect to using the same preprocessing and machine learning infrastructure." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-50", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-51", "text": "**THE FULLY-SUPERVISED SYSTEM**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-52", "text": "The event extraction system of Li et al. (2013) labels triggers and their arguments for a set of target event types L, for which annotated training documents are provided." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-53", "text": "The system utilizes a structured perceptron with beam search (Collins and Roark, 2004; ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-54", "text": "To label triggers, the system scans each sentence x, and creates candidate assignments y, that for each token x i assign each possible label y i \u2208 L \u222a {\u22a5} (\u22a5 meaning x i is not a trigger at all)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-55", "text": "The score of an assignment (x i , y i ) is calculated as w \u00b7 f , where f is the binary feature vector calculated for (x i , y i ), and w is the learned feature weight vector." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-56", "text": "The classifier's features capture various properties of x i and its context, such as its unigram and its containing bigrams." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-57", "text": "These features are highly lexicalized, resulting in a very large feature space." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-58", "text": "Additionally, each feature is replicated and paired with each label y i , allowing the system to learn For the last two features we allow up to 2 levels of transitivity (e.g. hypernym of hypernym), and consider also derivations of candidate tokens." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-59", "text": "different weights for different labels, e.g., feature (Unigram:\"visited\", Meet) will have a different weight than (Unigram:\"visited\", Attack)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-60", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-61", "text": "**THE SEED-BASED SYSTEM**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-62", "text": "To implement the seed-based approach for trigger labeling, we adapt only the trigger classification part in the Li et al. (2013) fully-supervised system, ignoring arguments." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-63", "text": "Given a set of new target event types T we classify every test sentence once for each event type t \u2208 T ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-64", "text": "Hence, when classifying a sentence for t, the labeling of each token x i is binary, where y i \u2208 { , \u22a5} marks whether x i is a trigger of type t ( ) or not (\u22a5)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-65", "text": "For instance x i =\"visited\" labeled as when classifying for t=Meet, means x i is labeled as a Meet trigger." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-66", "text": "To score the binary label assignment (x i , y i ), we use a small set of features that assess the similarity between x i and t's given seed list." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-67", "text": "We implement our approach with a basic set of binary features (Table 1) , which are turned on if similarity is found for at least one seed in the list." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-68", "text": "We use a single knowledge resource (WordNet (Fellbaum, 1998)) for expansion." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-69", "text": "3 As in the fully-supervised system, each feature is replicated for each label in { , \u22a5}, learning separately how well a feature can predict a trigger ( ) and a non-trigger (\u22a5)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-70", "text": "As labels are event-independent, features are event-independent as well, and their weights can be learned generically (Figure 1 )." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-71", "text": "Since we label each token independently for each event type t, multiple labels may be assigned to the same token." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-72", "text": "If a single-label setting is required, standard techniques can be applied, such as choosing a single random label, or the highest scoring one." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-73", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-74", "text": "**EVALUATION**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-75", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-76", "text": "**SETTING**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-77", "text": "We evaluate our seed-based approach (Section 2) in comparison to the fully-supervised approach implemented by Li et al. (2013) (Section 3) ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-78", "text": "To maintain comparability, we use the ACE-2005 documents with the same split as in (Ji and Grishman, 2008; Liao and Grishman, 2010b; Li et al., 2013) to 40 test documents and 559 training documents." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-79", "text": "However, some evaluation settings differ: Li et al. (2013) train a multi-class model for all 33 ACE-2005 event types, and classify all tokens in the test documents into these event types." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-80", "text": "Our approach, on the other hand, trains an eventindependent binary classifier, while testing on new event types that are different from those utilized for training." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-81", "text": "We next describe how this setup is addressed in our evaluation." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-82", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-83", "text": "**PER-EVENT CLASSIFICATION**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-84", "text": "To label the test documents to all 33 event types, we classify each token in the test documents once for each test event type." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-86", "text": "Training Event Types When we label for a test event type t, we use a model that was trained on different pre-selected training event types." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-87", "text": "Since we need to label for all event types, we cannot use the same model for testing them all, since then the event types used to train this model could not be tested." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-88", "text": "Thus, for each t we use a model trained on 5 randomly chosen training event types, different than t. 5 Additionally, to avoid a bias caused by a particular random choice, we build 10 different models, each time choosing a different set of 5 training event types." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-89", "text": "Then, we label the test documents for t 10 times, once by each model." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-90", "text": "When measuring performance we compute the average of these 10 runs for each t, as well as the variance within these runs." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-91", "text": "Annotated Triggers Training event types require annotated triggers from the training documents." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-92", "text": "To maintain consistency between different sets of training event types, we use a fixed total of 30 annotated trigger tokens for each set of" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-93", "text": "80.6 67.1 73.2 0.04 Li et al. (2013) 73.7 62.3 67.5 - Ji and Grishman (2008) 67.6 53.5 59.7 - Table 2 shows our system's precision, recall and F 1 , 7 and the average variance of F 1 within the 10 runs of each test event type." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-94", "text": "The very low variance indicates that the system's performance does not depend much on the choice of training event types." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-95", "text": "We compare our system's performance to the published trigger classification results of the baseline system of (Li et al., 2013 ) (its globally optimized run, when labeling both triggers and arguments)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-96", "text": "We also compare to the sentence-level system in (Ji and Grishman, 2008) which uses the same dataset split." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-97", "text": "Our system outperforms the fully-supervised baseline by 5.7% F 1 , which is statistically significant (two-tailed Wilcoxon test, p < 0.05)." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-98", "text": "This shows that there is no performance hit for the seed-based method on this dataset, even though it does not require any annotated data for new tested events, thus saving costly annotation efforts." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-99", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-100", "text": "**RESULTS**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-101", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-102", "text": "**RELATED WORK**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-103", "text": "Our work contributes to the broader research direction of reducing annotation for information extraction." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-104", "text": "One such IE paradigm, including Preemptive IE (Shinyama and Sekine, 2006) , Ondemand IE (Sekine, 2006; Sekine and Oda, 2007) and Open IE (Etzioni et al., 2005; Banko et al., 2007; Banko et al., 2008) , focuses on unsupervised relation and event discovery." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-105", "text": "We, on the other hand, follow the same goal as fullysupervised systems in targeting pre-specified event types, but aim at minimal annotation cost." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-106", "text": "Bootstrapping methods (such as (Yangarber et al., 2000; Agichtein and Gravano, 2000; Riloff, 1996; Greenwood and Stevenson, 2006; Liao and Grishman, 2010a; Stevenson and Greenwood, 2005; Huang and Riloff, 2012) ) have been widely applied in IE." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-107", "text": "Most work started from a small set of seed patterns, and repeatedly expanded them from unlabeled corpora." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-108", "text": "Relying on unlabeled data, bootstrapping methods are scalable but tend to produce worse results (Liao and Grishman, 2010a ) than supervised models due to semantic drift (Curran et al., 2007) ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-109", "text": "Our method can be seen complementary to bootstrapping frameworks, since we exploit manually crafted linguistic resources which are more accurate but may not cover all domains and languages." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-110", "text": "Our approach is perhaps closest to (Roth et al., 2009) ." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-111", "text": "They addressed a different IE task -relation extraction, by recognizing entailment between candidate relation mentions and seed patterns." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-112", "text": "While they exploited a supervised recognizing textual entailment (RTE) system, we show that using only simple WordNet-based similarity features and minimal training yields relatively high performance in event trigger labeling." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-113", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-114", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-115", "text": "In this paper we show that by utilizing the information embedded in annotation guidelines and lexical resources, we can skip manual annotation for new event types." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-116", "text": "As we match performance of a state-of-the-art fully-supervised system over the ACE-2005 benchmark (and even surpass it), we offer our approach as an appealing way of reducing annotation effort while preserving result quality." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-117", "text": "Future research may explore additional features and knowledge resources, investigate alternative approaches for creating effective seed lists, and extend our approach to argument labeling." }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-118", "text": "----------------------------------" }, { "sent_id": "c5a10f46c253f0da005622661b12a1-C001-119", "text": "**375**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "c5a10f46c253f0da005622661b12a1-C001-12" ] ], "cite_sentences": [ "c5a10f46c253f0da005622661b12a1-C001-12" ] }, "@DIF@": { "gold_contexts": [ [ "c5a10f46c253f0da005622661b12a1-C001-25" ], [ "c5a10f46c253f0da005622661b12a1-C001-48" ], [ "c5a10f46c253f0da005622661b12a1-C001-62" ], [ "c5a10f46c253f0da005622661b12a1-C001-79" ] ], "cite_sentences": [ "c5a10f46c253f0da005622661b12a1-C001-25", "c5a10f46c253f0da005622661b12a1-C001-48", "c5a10f46c253f0da005622661b12a1-C001-62", "c5a10f46c253f0da005622661b12a1-C001-79" ] }, "@EXT@": { "gold_contexts": [ [ "c5a10f46c253f0da005622661b12a1-C001-25" ], [ "c5a10f46c253f0da005622661b12a1-C001-48" ], [ "c5a10f46c253f0da005622661b12a1-C001-62" ], [ "c5a10f46c253f0da005622661b12a1-C001-79" ] ], "cite_sentences": [ "c5a10f46c253f0da005622661b12a1-C001-25", "c5a10f46c253f0da005622661b12a1-C001-48", "c5a10f46c253f0da005622661b12a1-C001-62", "c5a10f46c253f0da005622661b12a1-C001-79" ] }, "@SIM@": { "gold_contexts": [ [ "c5a10f46c253f0da005622661b12a1-C001-52" ], [ "c5a10f46c253f0da005622661b12a1-C001-77" ], [ "c5a10f46c253f0da005622661b12a1-C001-78" ], [ "c5a10f46c253f0da005622661b12a1-C001-95" ] ], "cite_sentences": [ "c5a10f46c253f0da005622661b12a1-C001-52", "c5a10f46c253f0da005622661b12a1-C001-77", "c5a10f46c253f0da005622661b12a1-C001-78", "c5a10f46c253f0da005622661b12a1-C001-95" ] }, "@USE@": { "gold_contexts": [ [ "c5a10f46c253f0da005622661b12a1-C001-78" ] ], "cite_sentences": [ "c5a10f46c253f0da005622661b12a1-C001-78" ] } } }, "ABC_4da1c39dbbeaa2c9dac22118d0c698_16": { "x": [ { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-2", "text": "We describe a machine learning based method to identify incorrect entries in translation memories." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-3", "text": "It extends previous work by Barbu (2015) through incorporating recall-based machine translation and part-of-speech-tagging features." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-4", "text": "Our system ranked first in the Binary Classification (II) task for two out of three language pairs: English-Italian and English-Spanish." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-5", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-7", "text": "Autodesk has accumulated more than 40 million professionally translated segments over the past 17 years." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-8", "text": "These translation units (TUs) mainly stem from user interfaces and documentation of software products localized into 32 languages." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-9", "text": "As we are now unifying and centralizing all translations in a single repository, it is high time to sort out duplicate, outdated, and erroneous TUs." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-10", "text": "Exploring methods to handle the latter -clearly more challenging than removing duplicate and outdated material -motivated us to participate in the First Shared Task on Translation Memory Cleaning (Barbu et al., 2016) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-11", "text": "Going forward, we strive to make human translation more efficient (by showing translators less erroneous fuzzy matches) and machine translation more accurate (by reducing noise in training data)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-12", "text": "In this paper, we describe our submitted system for distinguishing correct from incorrect TUs." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-13", "text": "Rather than tailoring it to individual languages, we aimed at a languageindependent solution to cover all of the language pairs in this shared task or, looking to the future, Autodesk's production environments." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-14", "text": "The system is based on previous work by Barbu (2015) and uses language-independent features with language-specific plug-ins, such as machine translation, part-of-speech tagging, and language classification." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-15", "text": "Specifics about previous work are given in the next section." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-16", "text": "In Section 3, we describe our method and, in Section 4, show how it compares to Barbu's (2015) approach as well as other submissions to this shared task." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-17", "text": "Lastly, we offer preliminary conclusions and outline future work in Section 5." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-18", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-19", "text": "**BACKGROUND**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-20", "text": "TM cleaning functionality in commercial tools is mostly rule-based, centering around the removal of duplicate entries, ensuring markup validity (e.g., no unclosed tags), or controlling for client or project specific terminology ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-21", "text": "Although helpful, these methods fall short of identifying spurious entries that contain language errors or partial translations." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-22", "text": "With crowd-sourced and automatically constructed TMs in particular, it is also necessary to identify translation units with source and target segments that do not correspond at all (e.g., Trombetti, 2009; Tiedemann, 2012) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-23", "text": "Barbu (2015) has proposed to cast the identification of such incorrect translations as a supervised classification problem." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-24", "text": "In his work, 1,243 labelled TUs were used to train binary classifiers based on 17 features." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-25", "text": "The \"most important\" of them, according to the author, were bisegment_similarity and lang_diff : the former is defined as the cosine similarity between a target segment and its machine translated source segment, while the latter denotes whether the language codes declared in a translation unit correspond with the codes detected by a language detector." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-26", "text": "The best classifier, a support vector machine with linear kernel, achieved 82% precision and 81% recall on a held-out test set of 309 TUs." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-27", "text": "To the best of our knowledge, Barbu provided the first and so far only research contribution on automatic TM cleaning, which the author himself described as \"a neglected research area\" (Barbu, 2015) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-28", "text": "With our participation to this shared task, we seek to extend his work by examining new features based on statistical MT and POS tagging." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-29", "text": "As outlined above, comparing machine translated source segments to their actual target segments has proven effective in Barbu's (2015) experiments." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-30", "text": "We propose to complement or replace the similarity function used for this comparison (cosine similarity) by two automatic MT evaluation metrics, Bleu (Papineni et al., 2002) and characterbased Levenshtein distance, in order to reward higher-order n-gram (n > 1) and partial word overlaps, respectively." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-31", "text": "Furthermore, we introduce a recall-based MT feature that takes multiple MT hypotheses (n-best translations) of a given source segment into account, based on the assumption that alternative translations of words (such as \"buy\" and \"purchase\") or phrases (such as \"despite\" and \"in spite of\") should not be punished." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-32", "text": "We also experiment with part-of-speech information to identify spurious translation units." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-33", "text": "With closely related languages in particular, the rationale would be that adjectives (to name an example) in a source segment are likely to be reflected in the corresponding target segment in case of a valid translation." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-34", "text": "The comparison of POS tags from language-specific tagsets will be based on a mapping to Table 1 : Trainig and evaluation data." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-35", "text": "Classes 1, 2, and 3 denote correct, almost correct, and incorrect translation units, respectively." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-36", "text": "eleven coarse-grained, language-independent grammatical groups (Petrov et al., 2011) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-37", "text": "We acknowledge that the use of MT is discouraged by the organizers of this shared task to foster contributions that require less compute power." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-38", "text": "However, as MT was found to be valuable in previous work (see above) and computational resources are hardly a limiting factor in corporate environments (see Section 3.2), we decided not to refrain from including MT-based features in our submissions." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-39", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-40", "text": "**METHOD**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-41", "text": "Our system uses labelled TUs to train classifiers based on language-independent features (see Section 3.1) with language-specific plug-ins (see Section 3.2)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-42", "text": "The feature extraction pipeline is implemented in Scala (see Section 3.3), and our final submission -geared to distinguish correct or almost correct from incorrect TUs -is based on a selection of nine features (see Section 3.4)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-43", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-44", "text": "**FEATURES**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-45", "text": "We re-implemented the 17 features proposed by Barbu (2015, see also Section 2)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-46", "text": "In addition, we explore" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-47", "text": "\u2022 mt_coverage the percentage of target words contained in the n-best machine translations of the source segment." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-48", "text": "We use n = 20 in our experiments." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-49", "text": "\u2022 mt_cfs the character-based Levenshtein distance between target segment and machine translated source segment." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-50", "text": "We normalise this score such that identical and completely dissimilar segments result in scores of 1.0 and 0.0 respectively, i.e., cfs = 1 \u2212 Levensthein distance in characters number of characters in longer segment." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-51", "text": "This score is computed individually for each of the 20-best translation options; the best of these scores instantiates the feature value." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-52", "text": "\u2022 mt_bleu the BLEU score (Papineni et al., 2002) between target segment and machine translated source segment." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-53", "text": "We employ the sentence-level version of the metric as implemented in Phrasal (Green et al., 2014) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-54", "text": "As with mt_cfs, individual scores are computed for each of the 20-best translation options; the best score instantiates the feature value." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-55", "text": "\u2022 pos_sim_all the cosine similarity between the partof-speech (POS) tags found in the source and target segment." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-56", "text": "\u2022 pos_sim_some the cosine similarity between source and target segment in terms of nouns (NOUN), verbs (VERB), adjectives (ADJ), and pronouns (PRON)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-57", "text": "\u2022 pos_exact whether or not the POS tag sequence in source and target segment is identical." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-58", "text": "\u2022 language_detection whether or not a state-of-the-art language classifier confirms the target segment's language declared in the translation unit." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-59", "text": "\u2022 ratio_words the ratio between number of words in source and target segment." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-60", "text": "\u2022 ratio_chars the ratio between number of characters in source and target segment." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-61", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-62", "text": "**RESOURCES**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-63", "text": "Some of the features described in the previous section require natural language processing (NLP) facilities." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-64", "text": "For machine translation, we use our in-house systems (Plitt and Masselot, 2010; Zhechev, 2014) based on the Moses SMT framework (Koehn et al., 2007) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-65", "text": "They are trained on translated software and user manuals from Autodesk products only and chosen for the sake of convenience; we would expect better performance of our MT-based features in conjunction with MT engines geared to the text domains used in this shared task (listed in Table 1 )." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-66", "text": "Our engines are integrated into a scalable infrastructure deployed on an elastic compute cloud, allowing high throughput even with large translation memories to be cleaned." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-67", "text": "For POS tagging, we rely on Schmid's (1995) TreeTagger and its readily available models 1 for English, German, Italian, and Spanish." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-68", "text": "To make POS tags comparable across these languages, they are mapped 2 to the Universal Tagset proposed by Petrov et al. (2011) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-69", "text": "Lastly, we use the publicly available Xerox Language Identifier API 3 for language detection." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-70", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-71", "text": "**CLASSIFICATION**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-72", "text": "Our feature extraction pipeline, including Barbu's (2015) as well as our own features (see Section 3.1), is implemented in Scala." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-73", "text": "This pipeline is used to transform translation units into feature vectors and train classifiers using the scikitlearn framework (Pedregosa et al., 2011) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-74", "text": "From the various classification algorithms we tested, Random Forests performed best with our selection of features (see below)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-75", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-76", "text": "**FEATURE SELECTION**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-77", "text": "For the reasons mentioned in Section 1, we aimed at finding a combination of features that would perform well with all language pairs rather than tailoring solutions to individual languages." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-78", "text": "We focused on gearing our classifiers to distinguish correct or almost correct (classes 1, 2) from incorrect TUs (class 3) -i.e., the Binary Classification (II) task -by optimising the weighted F 1 -score (F 1 ) on training data (see Tables 2a and 2b) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-79", "text": "From the various feature combinations we tested, we found the following to be most successful: ratio_words, pos_sim_all, language_detection, mt_cfs, mt_bleu, ratio_chars (as described in Section 3.1), alongside cg_score, only_capletters_dif, and punctuation_similarity (from Barbu, 2015) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-80", "text": "Evaluation results are given in the next section." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-81", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-82", "text": "**RESULTS**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-83", "text": "We tested our final submission -a Random Forests classifier based on the nine features described in Section 3.4 -on three language pairs (en-de, en-es, en-it) and two tasks: Binary II and Fine-Grained Classification (see Sections 4.1 and 4.2, respectively)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-84", "text": "The classifier was trained solely on data provided by the organizers of this shared task for each of the language \u00d7 task conditions." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-85", "text": "Each TU in this data was annotated with one of three labels: correct, almost correct, and incorrect (see Table 1 )." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-86", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-87", "text": "**BINARY CLASSIFICATION (II)**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-88", "text": "Our rationale for focusing on telling apart correct or almost correct from incorrect TUs was that a first application of our method, if successful, would most likely be the filtering of TM data for MT training." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-89", "text": "While eliminating almost correct TUs might decrease rather than increase MT quality, filtering out incorrect segments can have a positive impact (Vogel, 2003) ." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-90", "text": "Prior to submission, we benchmarked our system against the two baselines provided by the organizers: a dummy classifier assigning random classes according to the overall class distribution in the training data (Baseline 1), and a classifier based on the Church-Gale algorithm as adapted by Barbu (2015) (Baseline 2)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-112", "text": "More importantly, however, we would like to test our implementation as-is in Autodesk's production environments for software localization." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-91", "text": "More importantly, however, we compared our system to Barbu's (2015) approach, using the classification algorithms which reportedly worked best with the 17 features in his work." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-92", "text": "Our system performed well in this comparison, surpassing Barbu's approach in all language pairs except en-de, where both systems were en par." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-93", "text": "Details are shown in Table 2a , where we report weighted precision (P), recall (R), and F 1 -scores averaged over 5-fold cross-validation with 2 /3-1 /3 splits of the training data." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-94", "text": "The final evaluation and ranking produced by the organizers, shown in Table 3a , confirms our findings from experimenting with training data: our system performs well on the en-es and en-it test sets (best in class), while performance is substantially lower on the en-de test set." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-95", "text": "The reasons for this are yet to be ascertained (see also Section 5)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-96", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-97", "text": "**FINE-GRAINED CLASSIFICATION**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-98", "text": "Although geared to the Binary Classification (II) task (see above), we also assessed our system on the Fine-Grained Classification task." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-99", "text": "Here, the goal was to distinguish between all of the three classes, i.e., determine whether a TU is correct, almost correct, or incorrect." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-100", "text": "Again, we compared our system's performance to Barbu's (2015) method, using 2 /3-1 /3 splits of the training data (5-fold cross-validation)." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-101", "text": "The results, shown in Table 2b , implied that the nine features we selected would not suffice for a more fine-grained classification of TUs." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-102", "text": "This was confirmed in the official evaluation and ranking: our system scored low on en-de and mediocre on en-es and en-it." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-103", "text": "Further work will be needed to analyse these results in more detail." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-104", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-105", "text": "**CONCLUSIONS**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-106", "text": "We have proposed a machine learning based method to identify incorrect entries in translation memories." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-107", "text": "It is applicable to any language pair for which an MT system, a POS tagger, and a language identifier are available." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-108", "text": "Implemented using off-the-shelf tools, our system achieved the best classification results for two out of three language pairs (English-Italian and English-Spanish) in the Binary Classification (II) task." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-109", "text": "In future work, we would like to assess the impact of gearing NLP components to target domains on classification accuracy." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-110", "text": "The training data in this shared task stems from news (German) and medical texts (Italian, Spanish) which our MT systems, for example, were not optimized for." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-111", "text": "This domain mismatch might partially explain why our system did not perform well on the English-German test set." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-113", "text": "Removing incorrect segments from TMs could ultimately help make professional translation more efficient by providing better MT (through filtered training data) and more accurate fuzzy matches." }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-114", "text": "----------------------------------" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-115", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "4da1c39dbbeaa2c9dac22118d0c698-C001-116", "text": "We would like to thank Val\u00e9ry Jacot for his vital support and guidance." } ], "y": { "@DIF@": { "gold_contexts": [ [ "4da1c39dbbeaa2c9dac22118d0c698-C001-3" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-14" ] ], "cite_sentences": [ "4da1c39dbbeaa2c9dac22118d0c698-C001-3", "4da1c39dbbeaa2c9dac22118d0c698-C001-14" ] }, "@EXT@": { "gold_contexts": [ [ "4da1c39dbbeaa2c9dac22118d0c698-C001-3" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-14" ] ], "cite_sentences": [ "4da1c39dbbeaa2c9dac22118d0c698-C001-3", "4da1c39dbbeaa2c9dac22118d0c698-C001-14" ] }, "@SIM@": { "gold_contexts": [ [ "4da1c39dbbeaa2c9dac22118d0c698-C001-16" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-27" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-72" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-79" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-90" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-91" ], [ "4da1c39dbbeaa2c9dac22118d0c698-C001-100" ] ], "cite_sentences": [ "4da1c39dbbeaa2c9dac22118d0c698-C001-16", "4da1c39dbbeaa2c9dac22118d0c698-C001-27", "4da1c39dbbeaa2c9dac22118d0c698-C001-72", "4da1c39dbbeaa2c9dac22118d0c698-C001-79", "4da1c39dbbeaa2c9dac22118d0c698-C001-90", "4da1c39dbbeaa2c9dac22118d0c698-C001-91", "4da1c39dbbeaa2c9dac22118d0c698-C001-100" ] }, "@BACK@": { "gold_contexts": [ [ "4da1c39dbbeaa2c9dac22118d0c698-C001-29" ] ], "cite_sentences": [ "4da1c39dbbeaa2c9dac22118d0c698-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "4da1c39dbbeaa2c9dac22118d0c698-C001-91" ] ], "cite_sentences": [ "4da1c39dbbeaa2c9dac22118d0c698-C001-91" ] } } }, "ABC_fa5413db2c8e0a32bc3805d25cd0e7_16": { "x": [ { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-39", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-40", "text": "**BERT**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-2", "text": "Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-3", "text": "However, the crosslingual information obtained from shared BPE spaces is inexplicit and limited." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-4", "text": "In this paper, we propose a novel cross-lingual pre-training method for unsupervised machine translation by incorporating explicit cross-lingual training signals." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-5", "text": "Specifically, we first calculate cross-lingual n-gram embeddings and infer an n-gram translation table from them." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-6", "text": "With those n-gram translation pairs, we propose a new pre-training model called Cross-lingual Masked Language Model (CMLM), which randomly chooses source n-grams in the input text stream and predicts their translation candidates at each time step." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-7", "text": "Experiments show that our method can incorporate beneficial cross-lingual information into pre-trained models." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-8", "text": "Taking pre-trained CMLM models as the encoder and decoder, we significantly improve the performance of unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-9", "text": "Our code is available at https://github.com/Imagist-Shuo/CMLM." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-10", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-12", "text": "Unsupervised machine translation has become an emerging research interest in recent years (Artetxe et al., 2017; Artetxe et al., 2018b; Marie and Fujita, 2018; Ren et al., 2019; Lample and Conneau, 2019) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-13", "text": "The common framework of unsupervised machine translation builds two initial translation models at first (i.e., source to target and target to source), and then does iterative back-translation (Sennrich et al., 2016a; Zhang et al., 2018) with the two models using pseudo data generated by each other." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-14", "text": "The initialization process is crucial to the final translation * Contribution during internship at MSRA." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-15", "text": "performance as pointed in , Artetxe et al. (2018b) and Ren et al. (2019) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-16", "text": "Previous approaches benefit mostly from crosslingual n-gram embeddings, but recent work proves that cross-lingual language model pretraining could be a more effective way to build initial unsupervised machine translation models (Lample and Conneau, 2019) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-86", "text": "We call our proposed pre-training objective \"Cross-lingual Masked Language Model\" (CMLM) as shown in Figure 2 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-17", "text": "However, in their method, the cross-lingual information is mostly obtained from shared Byte Piece Encoding (BPE) (Sennrich et al., 2016b) spaces during pre-training, which is inexplicit and limited." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-18", "text": "Firstly, although the same BPE pieces from different languages may share the same semantic space, the semantic information of n-grams or sentences in different languages may not be shared properly." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-19", "text": "However, cross-lingual information based on n-gram level is crucial to model the initialization of unsupervised machine translation Artetxe et al., 2018b) , which is absent in the current pretraining method." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-20", "text": "Secondly, BPE sharing is limited to languages that share much of their alphabet." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-21", "text": "For language pairs that are not the case, the above pre-training method may not provide much useful cross-lingual information." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-22", "text": "In this paper, by incorporating explicit crosslingual training signals, we propose a novel crosslingual pre-training method based on BERT (Devlin et al., 2018) for unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-23", "text": "Our method starts from unsupervised crosslingual n-gram embeddings, from which we infer n-gram translation pairs." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-24", "text": "Then, we propose a new pre-training objective called Cross-lingual Masked Language Model (CMLM), which masks the input n-grams randomly and predicts their corresponding n-gram translation candidates inferred above." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-25", "text": "To solve the mismatch between different lengths of the masked source and predicted target n-grams, IBM models are introduced (Brown et al., 1993) to derive the training loss at each Figure 1 : Masked Language Model (MLM) for BERT training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-26", "text": "For a given sentence, this task is to predict randomly masked tokens, i.e., \"contrast\", \"held\" and \"opinion\"." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-27", "text": "In practice, it is implemented based on BPE." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-28", "text": "time step." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-29", "text": "In this way, we can guide the model with more explicit and strong cross-lingual training signals, and meanwhile, leverage the potential of BERT to model context information." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-30", "text": "We then use two pre-trained cross-lingual language models as the encoder and decoder respectively to build desired machine translation models." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-31", "text": "Our method can be iteratively performed with the n-gram translation table updated by downstream tasks." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-32", "text": "Experiments show that our method can produce better cross-lingual representations and significantly improve the performance of unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-33", "text": "Our contributions are listed as follows." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-34", "text": "\u2022 We propose a novel cross-lingual pre-training method to incorporate explicit cross-lingual information into pre-trained models, which significantly improves the performance of unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-35", "text": "\u2022 We introduce IBM models to calculate the step-wise training loss for CMLM, which breaks the limitation that masked n-grams and predicted ones have to be the same length during BERT training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-36", "text": "\u2022 We produce strong context-aware crosslingual representations with our pre-training method, which helps in word alignment and cross-lingual classification tasks." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-37", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-38", "text": "**BACKGROUND**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-41", "text": "BERT (Devlin et al., 2018) , short for Bidirectional Encoder Representations from Transformers, is a powerful pre-training method for natural language processing and breaks records of many NLP tasks after corresponding fine-tuning." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-42", "text": "The core idea of BERT is pre-training a deep bidirectional Transfomer encoder (Vaswani et al., 2017) with two training tasks." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-43", "text": "The first one is Masked Language Model (MLM) referring to the Cloze task (Taylor, 1953) , which takes a straightforward approach of masking some percentage of the input tokens at random, and then predicting them with the corresponding Transformer hidden states, as shown in Figure 1 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-44", "text": "The second one is Next Sentence Prediction, which means to predict whether two sentences are adjacent or not." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-45", "text": "This task is designed for some tasks that need modeling the relationship between two sentences such as Question Answering (QA) and Natural Language Inference (NLI)." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-46", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-47", "text": "**XLM**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-48", "text": "Based on BERT, Lample and Conneau (2019) propose a cross-lingual version called XLM and reach the state-of-the-art performance on some crosslingual NLP tasks including cross-lingual classification , machine translation, and unsupervised cross-lingual word embedding." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-49", "text": "The basic points of XLM are mainly two folds." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-50", "text": "The first one is to use a shared vocabulary of BPE (Sennrich et al., 2016b ) to provide potential crosslingual information between two languages just as mentioned in , in an inexplicit way though." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-51", "text": "The second point is a newly proposed training task called Translation Language Modeling (TLM), which extends MLM by concatenating parallel sentences into a single training text stream." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-52", "text": "In this way, the model can leverage the cross-lingual information provided by parallel sentences to predict the masked tokens." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-53", "text": "However, for unsupervised machine translation, TLM cannot be used due to the lack of parallel sentences." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-54", "text": "Different from them, we are motivated to give the model more explicit and strong cross-lingual information and propose a new pre-training method by (1) masking source n-grams and (2) predicting their corresponding translation candidates." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-55", "text": "The second one is pre-training with our proposed objective Cross-lingual Masked Language Model (CMLM) which is to predict the translation candidates of randomly masked n-grams." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-56", "text": "The last step is to leverage the pre-trained cross-lingual language models as the encoder and decoder of a neural machine translation model and fine-tune the translation model iteratively." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-57", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-58", "text": "**METHOD**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-59", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-60", "text": "**OVERVIEW**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-61", "text": "Our method can be divided into three steps as shown in Figure 2 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-62", "text": "Given two languages X and Y , we first get unsupervised cross-lingual n-gram embeddings of them, from which we infer n-gram translation tables (source-to-target and target-tosource)." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-63", "text": "The n-gram translation pairs inferred in this way have proven to be instructive for initial unsupervised machine translation models (Artetxe et al., 2018b; Marie and Fujita, 2018; Ren et al., 2019) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-64", "text": "Then, we pre-train cross-lingual BERT language models with our proposed Cross-lingual Masked Language Model (CMLM) objective." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-65", "text": "Specifically, we randomly choose n-grams in the monolingual sentences and predict corresponding translation candidates in the n-gram translation table inferred in the first step." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-66", "text": "In this way, we can guide the model with explicit and strong cross-lingual training signals." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-67", "text": "Finally, two pre-trained cross-lingual language models are used to initialize the encoder and decoder respectively, based on which, denoising auto-encoder and iterative back-translation are leveraged to finetune the unsupervised machine translation models." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-68", "text": "The above process is repeated by updating the n-gram table with the n-gram translation pairs extracted from the pseudo data generated by the translation models." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-69", "text": "In the following subsections, we will give details of each step." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-70", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-71", "text": "**N-GRAM TRANSLATION TABLE INFERRING**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-72", "text": "Following previous work of unsupervised machine translation (Artetxe et al., 2018b; Ren et al., 2019) , given two languages X and Y , we build our n-gram translation tables as follows." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-73", "text": "First, we obtain monolingual n-gram embeddings using fastText (Bojanowski et al., 2017) and then get cross-lingual n-gram embeddings using vecmap (Artetxe et al., 2018a ) in a fully unsupervised way." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-74", "text": "Based on that, we calculate the similarity score of n-grams x and y in two languages respectively with the marginal-based scoring method (Conneau et al., 2017; Artetxe and Schwenk, 2018) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-75", "text": "Specifically, given the crosslingual embeddings of x and y, denoted as e x and e y , the similarity score is calculated as:" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-76", "text": "sim(x, y) = margin(cos(e x , e y )," }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-77", "text": "cos(e y , e z ) 2n )" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-78", "text": "(1) where margin(a, b) is a marginal scoring function and NN n (x) denotes x's k-nearest neighbors in the other language." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-79", "text": "In our experiments, n is 5 and margin(a, b) = a b ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-80", "text": "We take the above similarity scores as the translation probabilities between x and y in the ngram table." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-81", "text": "For each top frequent n-gram in the source language, we retrieve top-k n-gram translation candidates in the target language." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-82", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-83", "text": "**CROSS-LINGUAL MASKED LANGUAGE MODEL**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-84", "text": "In this section, we introduce our proposed method for pre-training cross-lingual language models based on BERT." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-85", "text": "Unlike the masked language model (MLM) described in Section 2.2 which masks several tokens in the input stream and predict those tokens themselves, we randomly select some percentage of n-grams in the input source sentence, and predict their translation candidates retrieved from Section 3.2." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-87", "text": "The difficulty for BERT to predict target phrases during training is that the lengths of the translation candidates are sometimes different from the source n-grams." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-88", "text": "To deal with this problem, we turn to IBM Model 2 (Brown et al., 1993) to calculate the training loss at each time step." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-89", "text": "Our proposed method breaks the limitation that masked n-grams and predicted ones have to be the same length during BERT training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-90", "text": "Specifically, according to IBM Model 2, given a source n-gram x l 1 and a target one y m 1 , where l and m are the numbers of tokens in the source and target n-grams respectively, the translation probability from x l 1 to y m 1 is calculated as:" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-91", "text": "where = p(m|x l 1 ), i.e. the probability that the translation of x l 1 consists of m tokens; a(j|i, l, m) is the probability that the i th source token is aligned with the j th target token conditioned on the lengths l and m, while p(y j |x i ) is the translation probability from the source token x i to the target token y i ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-92", "text": "Based on the IBM Model 2, the loss function of our CMLM is defined as" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-93", "text": "The derived gradient w.r.t model parameters \u03b8 at each time step can be calculated as follows:" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-94", "text": "Since the target n-gram y m 1 is predicted with our modified BERT, in practice, the source word x i in Eq.(4) is replaced with its context-sensitive embedding C(x l 1 ), which is the corresponding hidden state of the top Transformer layer." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-95", "text": "The alignment probability a(i|j, l, m) cannot be learned during training because of the absence of bilingual corpus." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-96", "text": "Therefore, cross-lingual BPE embeddings are leveraged to calculate the normalized sim(x i , y j ) to approximate a(i|j, m, l)." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-97", "text": "p(y j |x i ) is the model prediction in Softmax outputs." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-98", "text": "For each source ngram, all of the retrieved k translation candidates are used to calculate the cross entropy loss, which are weighted with their translation probabilities in the n-gram table." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-99", "text": "Given a language pair X \u2212 Y , we process both languages with the same shared BPE vocabulary using their monolingual sentences together during pre-training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-100", "text": "Following Devlin et al. (2018) ; Lample and Conneau (2019) , in our CMLM objective, we randomly sample 15% of the BPE ngrams from the text streams, and replace them by [MASK] tokens 70% of the time." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-101", "text": "During pretraining, in each iteration, a batch is composed of sentences sampled from the same language, and we alternate between MLM and CMLM objectives." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-102", "text": "Different from the original MLM in BERT, in the half of the MLM time, we randomly choose some source n-grams in the input text stream, and replace them with their translation candidates to construct code-switching sentences." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-103", "text": "Our final pretraining loss is defined as" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-104", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-105", "text": "**UNSUPERVISED MACHINE TRANSLATION**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-106", "text": "Taking two cross-lingual language models pretrained with the above method as the encoder and decoder, we build an initial unsupervised neural machine translation model." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-107", "text": "Then, we train the model with monolingual data until convergence via denoising auto-encoder and iterative backtranslation, as described in Artetxe et al. (2017) ; ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-108", "text": "Different from them, we step further and make another iteration with updated n-gram translation tables." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-109", "text": "Specifically, we translate the monolingual sentences with our latest translation model and run GIZA++ (Och and Ney, 2003) on the generated pseudo parallel data to extract updated n-gram translation pairs, which are used to tune the encoder as Section 3.3, together with the back-translation within a multi-task learning framework." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-110", "text": "Experimental results show that running another iteration can further improve the translation performance." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-111", "text": "It is also interesting to explore the usage of pre-trained decoders in the translation model." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-112", "text": "It seems that pre-training decoders has a smaller effect on the final performance than pre-training encoders (Lample and Conneau, 2019) , one reason for which could be that the encoder-to-decoder attention is not pre-trained." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-113", "text": "Therefore, the parameters of the decoder need to be re-adjusted substantially in the following tuning process for MT task." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-114", "text": "In our experiments, we explore some other usage of pre-trained decoders, i.e., we use the pretrained decoder as the feature extractor and feed the outputs into a new decoder consisting of several Transformer layers with the attention to the encoder." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-115", "text": "We find this method improves the performance of some language translation directions." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-116", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-117", "text": "**EXPERIMENTS**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-118", "text": "In this section, we conduct experiments to evaluate our proposed pre-training method." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-119", "text": "In Section 4.1, we will introduce the setup of our experiments, followed by the overall results of the final unsupervised MT models." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-120", "text": "Then, in Section 4.3, we will discuss another usage of pre-trained decoders for translation models." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-121", "text": "To evaluate the cross-lingual modeling capacity of our pre-trained encoders, in Section 4.4, we conduct experiments on word alignment and cross-lingual classification tasks." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-122", "text": "Finally, we do the ablation study to check the performance contribution of each component in our proposed method." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-123", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-124", "text": "**SETUP DATA AND PREPROCESS**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-125", "text": "In our experiments, we consider three language pairs, English-French (en-fr), English-German (en-de) and English-Romanian (en-ro)." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-126", "text": "For each language, we use all the available sentences in NewsCrawl till 2018, monolingual datasets from WMT." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-127", "text": "The NewsCrawl data are used in both pretraining and the following unsupervised NMT iteration process." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-128", "text": "Our CMLM is optimized based on the pre-trained models released by Lample and Conneau (2019) 1 , which are trained with Wikipedia dumps." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-129", "text": "For fair comparison, we use newstest 2014 as the test set for en-fr, and newstest 2016 for en-de and en-ro." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-130", "text": "1 https://github.com/facebookresearch/XLM We use Moses scripts for tokenization, and use fastBPE 2 to split words into subword units with their released BPE codes 1 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-131", "text": "The number of shared BPE codes for each language pair is 60,000." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-132", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-133", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-134", "text": "Our implementation is based on the released code of XLM 1 (Paszke et al., 2017) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-135", "text": "Specifically, we use a Transformer architecture with 1024 hidden units, 8 heads, GELU activations (Hendrycks and Gimpel, 2016) , with a dropout rate of 0.1." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-136", "text": "The models are trained with the Adam optimizer (Kingma and Ba, 2014), a linear warmup (Vaswani et al., 2017 ) and the learning rates varying from 10 4 to 5 \u00d7 10 4 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-137", "text": "For both of the MLM and CMLM objectives, we use streams of 256 tokens and mini-batches of size 64." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-138", "text": "We use the averaged perplexity over languages as a stopping criterion for training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-139", "text": "For machine translation, we use 6 Transformer layers, and we create mini-batches of 2000 tokens." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-140", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-141", "text": "**BASELINES**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-142", "text": "Our method is compared with six baselines of unsupervised MT systems listed in the upper part of Table 1 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-143", "text": "The first two baselines (Artetxe et al., 2017; ) use a shared encoder and different decoders for two languages with the training methods of denoising auto-encoder and iterative back-translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-144", "text": "The third baseline (Artetxe et al., 2018b ) is an unsupervised PB-SMT model, which uses the initial PBSMT models built with language models and n-gram translation tables inferred from cross-lingual embeddings, followed with the iterative back-translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-145", "text": "The fourth baseline ) is a hybrid method of unsupervised NMT and PBSMT by combining the pseudo data generated by PBSMT models into the final iteration of NMT." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-146", "text": "The fifth baseline (Ren et al., 2019) is also a hybrid method of NMT and PBSMT but different from , they leverage PBSMT as posterior regularization in each NMT iteration." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-147", "text": "The last baseline is XLM described in Section 2.2." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-148", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-149", "text": "**OVERALL RESULTS**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-150", "text": "The overall comparison results of unsupervised machine translation are shown in Table 1 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-151", "text": "From the table, we find that our proposed method significantly outperforms previous methods on all language pairs by the average BLEU score of 1.7, Method fr2en en2fr de2en en2de ro2en en2ro Baselines (Artetxe et al., 2017) 15.6 15.1 ---- 14.3 15.1 13.3 9.6 -- (Artetxe et al., 2018b) 25 and both the improvements of en2fr and ro2en are over 2 BLEU points." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-152", "text": "The results indicate that the explicit cross-lingual information incorporated by our proposed CMLM is beneficial to the unsupervised machine translation task." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-153", "text": "Notice that by doing another iteration (\"Iter 2\") with updated ngram tables as described in Section 3.4, we further improve the performance a bit for most translation directions with the improvements of en2fr and ro2en bigger than 0.5 BLEU point, which confirms the potential that fine-tuned machine translation models contain more beneficial cross-lingual information than the initial n-gram translation tables, which can be used to enhance the pre-trained model iteratively." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-154", "text": "The improvement made by Lample and Conneau (2019) compared with the first five baselines shows that cross-lingual pre-training can be necessary for unsupervised MT." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-155", "text": "However, the crosslingual information learned with this method during pre-training is mostly from the shared subword space, which is inexplicit and not strong enough." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-156", "text": "Our proposed method can give the model more explicit and strong cross-lingual training signals so that the pre-trained model contains much beneficial cross-lingual information for unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-157", "text": "As a result, we can further improve the translation performance significantly, compared with Lample and Conneau (2019) (with the significance level of p<0.01)." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-158", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-159", "text": "**USAGE OF PRE-TRAINED DECODER**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-160", "text": "As mentioned in Section 3.4, it is interesting to explore the different usage of pre-trained decoders in the MT task." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-161", "text": "According to our intuition, directly using the pre-trained model as the decoder may not work well because parameters of the decoder need substantial adjustment due to the attention part between the encoder and the decoder." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-162", "text": "Therefore, we treat the pre-trained decoder as the feature extractor and add several Transformer layers with the encoder-to-decoder attention on top of it." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-163", "text": "We also try to fix the pre-trained decoder and just fine-tune the encoder and the added decoder part." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-164", "text": "The experiments are conducted based on \"Iter 1\" with the results reported in Table 2 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-165", "text": "From this table, we can see that directly using the pre-trained model as the decoder may be the best choice for most of the time, with the exceptions of en2fr and ro2en." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-166", "text": "By adding 6 Transformer layers on top of the original pre-trained decoder can achieve higher performance for en2fr and ro2en, but not significant." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-167", "text": "The reason could be that it is difficult to train the additional Transformer layers from scratch in the unsupervised scenario." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-168", "text": "There is also an interesting phenomenon that if we fix the pre-trained part of the decoder and only tune the added Transformer layers, the final performance will drop drastically, which indicates that the representation space of the decoder requires substantial adaptation, even though the pre-trained models already contain cross-lingual information." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-169", "text": "We think that further deep research on the decoder initialization could be a necessary and interesting topic in the future." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-170", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-171", "text": "**EVALUATION OF CROSS-LINGUAL PRE-TRAINED ENCODER WORD ALIGNMENT**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-172", "text": "To evaluate the cross-lingual modeling capacity of our pre-trained models, we first conduct experiments on the English-French (en-fr) dataset of the HLT/NAACL 2003 alignment shared task (Mihalcea and Pedersen, 2003) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-173", "text": "Given two parallel sentences in English and French respectively, we feed each sentence into the pre-trained cross-lingual encoder and get its respective outputs." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-174", "text": "Then, we calculate the similarities between the outputs of the two sentences and choose target words with max similarity scores as the alignments of corre- sponding source words." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-175", "text": "We compare the context-unaware method (i.e., directly calculating the similarity scores between unsupervised cross-lingual embeddings (Artetxe et al., 2018a ) of source and target words), XLM (Lample and Conneau, 2019) and our proposed CMLM pre-training method in the Table 3 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-176", "text": "In this experiment, we leave out all the OOV words and those torn apart by the BPE operations." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-177", "text": "Table 3 : Results of word alignment tasks using different cross-lingual word embeddings." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-178", "text": "In this table, \"P\" means \"precision\", \"R\" means recall\", \"F\" means \"Fmeasure\" and \"AER\" means the \"alignment error rate\"." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-179", "text": "From this table, we find that, based on BERT, both XLM and our method can model crosslingual context information, indicating that context information can greatly enhance the crosslingual mapping between the source and target words." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-180", "text": "By leveraging the explicit cross-lingual information in the model training, our CMLM can outperform XLM significantly." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-181", "text": "This confirms that our CMLM does better to connect the source and target representation space, with which as pretrained models, the performance of unsupervised NMT can be improved." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-182", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-183", "text": "**CROSS-LINGUAL CLASSIFICATION**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-184", "text": "We also conduct experiments on the cross-lingual classification task using the cross-lingual language inference (XNLI) dataset ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-185", "text": "Specifically, we add a linear classification layer on top of the first hidden state of our pre-trained model and fine-tune its parameters on the English NLI dataset." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-186", "text": "Without using any labeled data for French (fr) and Germany (de) languages, we only report the zero-shot classification results for them as shown in Table 4 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-187", "text": "We can find that our method reaches a new record of the zero-shot cross-lingual classification task on languages of French (fr) and Germany (de), which confirms again that our CMLM works better on modeling cross-lingual information than previous methods in the unsupervised scenario." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-188", "text": "Method en fr de 73.7 67.7 67.7 (Devlin et al., 2018) 81.4 -70.5 (Lample and Conneau, 2019)" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-189", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-190", "text": "**ABLATION STUDY**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-191", "text": "In this section, we will discuss the different settings of our method." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-192", "text": "Firstly, the training loss of our pre-trained method contains two parts, i.e., CMLM and MLM, just as Eq. (5) shows." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-193", "text": "To study the respective influences of these two parts, we remove the MLM loss from it and compare the performance on en-fr and en-de translation tasks." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-194", "text": "Since our CMLM task differs from XLM in two aspects during pre-training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-195", "text": "The first one is that we randomly choose n-grams to mask in the input text stream rather than BPE tokens, and the second one is that we predict the translation candidates of a source n-gram rather than predicting the source n-gram itself." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-196", "text": "Although the first one has proven to be beneficial during pre-training to some NLP tasks, we want to check how much its influence is to our final translation performance." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-197", "text": "Therefore, we disable those two modifications in CMLM one by one and report the translation results." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-198", "text": "Our experiments are conducted based on \"Iter 1\" with the results in Table 5 ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-199", "text": "From Table 5 , we can find that the combination of CMLM and MLM can improve the translation performance by about 0.6 to 0.7 BLEU compared with any one only." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-200", "text": "This confirms the monolingual context modeling capacity of the MLM, which is quite useful for unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-201", "text": "Table 5 : Ablation study." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-202", "text": "\"CMLM + MLM\" means we use L pre as the pre-training loss; \"CMLM\" means we only use L cmlm as the pre-training loss; \"-translation prediction\" means we predict the masked n-grams themselves rather than their translation candidates during pre-training; \"--n-gram mask\" means we randomly mask BPE tokens rather than n-grams based on \"-translation prediction\" during pre-training, which degrades our method to XLM." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-203", "text": "our model to learn both monolingual and crosslingual information during pre-training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-204", "text": "Besides, we find the two modifications(translation prediction and n-gram mask) made by CMLM have nearly equal contributions to the translation performance, except for en2de, where the \"n-gram mask\" has little influence." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-205", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-206", "text": "**RELATED WORK**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-207", "text": "Unsupervised machine translation dates back to Klementiev et al. (2012) ; Nuhn et al. (2012) , but becomes a hot research topic in recent years." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-208", "text": "As the pioneering methods, Artetxe et al. (2017) ; ; Yang et al. (2018) are mainly the modifications of the encoder-decoder structure." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-209", "text": "The core idea is to constrain outputs of encoders of two languages into a same latent space with a weight sharing mechanism such as using a shared encoder." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-210", "text": "Denoising auto-encoder (Vincent et al., 2010) and adversarial training methods are also leveraged." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-211", "text": "Besides, they apply iterative backtranslation to generated pseudo data for crosslingual training." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-212", "text": "In addition to NMT methods for unsupervised machine translation, some following work shows that SMT methods and the hybrid of NMT and SMT can be more effective (Artetxe et al., 2018b; Marie and Fujita, 2018; Ren et al., 2019) ." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-213", "text": "They all build unsupervised PBSMT systems, and all of their models are initialized with language models and phrase tables inferred from cross-lingual word or n-gram embeddings and then use the initial PBSMT models to do iterative back-translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-214", "text": "also build a hybrid system by combining the best pseudo data that SMT models generate into the training of the NMT model while Ren et al. (2019) alternately train SMT and NMT models with the framework of posterior regularization." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-215", "text": "More recently, Lample and Conneau (2019) reach new state-of-the-art performance on unsupervised en-fr and en-de translation tasks." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-216", "text": "They propose a cross-lingual language model pretraining method based on BERT (Devlin et al., 2018) , and then treat two cross-lingual language models as the encoder and decoder to finish the translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-217", "text": "Leveraging much more monolingual data from Wikipedia, their work shows a big potential of pre-training for unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-218", "text": "However, the cross-lingual information is obtained mostly from the shared BPE space during their pre-training method, which is inexplicit and limited." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-219", "text": "Therefore, we figure out a new pre-training method that gives the model more explicit and stronger cross-lingual information." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-220", "text": "In the recent work of Song et al. (2019) , they also mask several consecutive tokens in the source sentence, but jointly pre-train the encoder and decoder by making the decoder to predict those masked tokens in both the source and target sides." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-221", "text": "Their method is a good case of pre-training for seq-to-seq tasks but the cross-lingual information incorporated with their method is from BPE sharing, which is also implicit." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-222", "text": "Our proposed method can be combined with their method within a multitask framework, which could be done in the future." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-223", "text": "----------------------------------" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-224", "text": "**CONCLUSION**" }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-225", "text": "In this paper, we propose a novel cross-lingual pretraining method for unsupervised machine translation." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-226", "text": "In our method, we leverage Cross-lingual Masked Language Model (CMLM) to incorporate explicit and strong cross-lingual information into pre-trained models." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-227", "text": "Experimental results on en-fr, en-de, and en-ro language pairs demonstrate the effectiveness of our proposed method." }, { "sent_id": "fa5413db2c8e0a32bc3805d25cd0e7-C001-228", "text": "In the future, we may apply our pre-training method to other language pairs and delve into the performance of the pre-trained encoders on other NLP tasks, such as Name Entity Recognition." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-12" ], [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-16" ], [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-48" ], [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-215" ] ], "cite_sentences": [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-12", "fa5413db2c8e0a32bc3805d25cd0e7-C001-16", "fa5413db2c8e0a32bc3805d25cd0e7-C001-48", "fa5413db2c8e0a32bc3805d25cd0e7-C001-215" ] }, "@MOT@": { "gold_contexts": [ [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-16" ] ], "cite_sentences": [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-16" ] }, "@DIF@": { "gold_contexts": [ [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-100" ], [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-112" ] ], "cite_sentences": [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-100", "fa5413db2c8e0a32bc3805d25cd0e7-C001-112" ] }, "@EXT@": { "gold_contexts": [ [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-100" ] ], "cite_sentences": [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-100" ] }, "@SIM@": { "gold_contexts": [ [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-128" ], [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-175" ] ], "cite_sentences": [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-128", "fa5413db2c8e0a32bc3805d25cd0e7-C001-175" ] }, "@USE@": { "gold_contexts": [ [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-128" ] ], "cite_sentences": [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-128" ] }, "@FUT@": { "gold_contexts": [ [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-154" ], [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-157" ] ], "cite_sentences": [ "fa5413db2c8e0a32bc3805d25cd0e7-C001-154", "fa5413db2c8e0a32bc3805d25cd0e7-C001-157" ] } } }, "ABC_43a52325987ea035136a6a718389d9_16": { "x": [ { "sent_id": "43a52325987ea035136a6a718389d9-C001-32", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-33", "text": "**FRENCH GOLD STANDARD**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-81", "text": "**CLUSTERING METHODS**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-104", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-105", "text": "**DATA AND PRE-PROCESSING**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-2", "text": "Verb classes which integrate a wide range of linguistic properties (Levin, 1993) have proved useful for natural language processing (NLP) applications." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-3", "text": "However, the real-world use of these classes has been limited because for most languages, no resources similar to VerbNet (KipperSchuler, 2005) are available." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-4", "text": "We apply a verb clustering approach developed for English to French -a language for which no such experiment has been conducted yet." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-5", "text": "Our investigation shows that not only the general methodology but also the best performing features are transferable between the languages, making it possible to learn useful VerbNet style classes for French automatically without languagespecific tuning." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-6", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-8", "text": "A number of verb classifications have been built to support natural language processing (NLP) tasks (Grishman et al., 1994; Miller, 1995; Baker et al., 1998; Palmer et al., 2005; Kipper-Schuler, 2005; Hovy et al., 2006) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-9", "text": "These include both syntactic and semantic classifications, as well as ones which integrate aspects of both." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-10", "text": "Classifications which integrate a wide range of linguistic properties can be particularly useful for NLP applications suffering from data sparseness." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-11", "text": "One such classification is VerbNet (Kipper-Schuler, 2005) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-12", "text": "Building on the taxonomy of Levin (1993) , VerbNet groups verbs (e.g. deliver, post, dispatch) into classes (e.g. SEND) on the basis of their shared meaning components and syntactic behaviour, identified in terms of meaning preserving diathesis alternations." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-13", "text": "Such classes can be identified across the entire lexicon, and they may also apply across languages, since their meaning components are said to be cross-linguistically applicable (Jackendoff, 1990) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-14", "text": "Offering a powerful tool for generalization, abstraction and prediction, VerbNet classes have been used to support many important NLP tasks, including e.g. computational lexicography, parsing, word sense disambiguation, semantic role labeling, information extraction, questionanswering, and machine translation (Swier and Stevenson, 2004; Dang, 2004; Shi and Mihalcea, 2005; Abend et al., 2008) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-15", "text": "However, to date their exploitation has been limited because for most languages, no Levin style classification is available." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-16", "text": "Since manual classification is costly (Kipper et al., 2008) automatic approaches have been proposed recently which could be used to learn novel classifications in a cost-effective manner (Joanis et al., 2008; Li and Brew, 2008; \u00d3 S\u00e9aghdha and Copestake, 2008; Vlachos et al., 2009; Sun and Korhonen, 2009 )." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-17", "text": "However, most work on Levin type classification has focussed on English." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-18", "text": "Large-scale research on other languages such as German (Schulte im Walde, 2006) and Japanese (Suzuki and Fukumoto, 2009 ) has focussed on semantic classification." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-19", "text": "Although the two classification systems have shared properties, studies comparing the overlap between VerbNet and WordNet (Miller, 1995) have reported that the mapping is only partial and many to many due to fine-grained nature of classes based on synonymy (Shi and Mihalcea, 2005; Abend et al., 2008) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-20", "text": "Only few studies have been conducted on Levin style classification for languages other than English." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-21", "text": "In their experiment involving 59 verbs and three classes, Merlo et al. (2002) applied a supervised approach developed for English to Italian, obtaining high accuracy (86.3%)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-22", "text": "In another experiment with 60 verbs and three classes, they showed that features extracted from Chinese translations of English verbs can improve English classification." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-23", "text": "These results are promising, but those from a later experiment by Ferrer (2004) are not." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-24", "text": "Ferrer applied a clustering approach developed for English to Spanish, and evaluated it against the manual classification of V\u00e1zquez et al. (2000) , constructed using criteria similar (but not identical) to Levin's." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-25", "text": "This experiment involving 514 verbs and 31 classes produced results only slightly better than the random baseline." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-26", "text": "In this paper, we investigate the cross-linguistic potential of Levin style classification further." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-27", "text": "In past years, verb classification techniques -in particular unsupervised ones -have improved considerably, making investigations for a new language more feasible." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-28", "text": "We take a recent verb clustering approach developed for English (Sun and Korhonen, 2009 ) and apply it to French -a major language for which no such experiment has been conducted yet." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-29", "text": "Basic NLP resources (corpora, taggers, parsers and subcategorization acquisition systems) are now sufficiently developed for this language for the application of a state-ofthe-art verb clustering approach to be realistic." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-30", "text": "Our investigation reveals similarities between the English and French classifications, supporting the linguistic hypothesis (Jackendoff, 1990) and the earlier result of Merlo et al. (2002) that Levin classes have a strong cross-linguistic basis." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-31", "text": "Not only the general methodology but also best performing features are transferable between the languages, making it possible to learn useful classes for French automatically without language-specific tuning." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-34", "text": "The development of an automatic verb classification approach requires at least an initial gold standard." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-35", "text": "Some syntactic (Gross, 1975) and semantic (Vossen, 1998) verb classifications exist for French, along with ones which integrate aspects of both (Saint-Dizier, 1998) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-36", "text": "Since none of these resources offer classes similar to Levins', we followed the idea of Merlo et al. (2002) and translated a number of Levin classes from English to French." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-37", "text": "As our aim was to to investigate the cross-linguistic applicability of classes, we took an English gold standard which has been used to evaluate several recent clustering works -that of Sun et al. (2008) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-38", "text": "This resource includes 17 finegrained Levin classes." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-39", "text": "Each class has 12 member verbs whose predominant sense in English (according to WordNet) belongs to that class." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-40", "text": "Member verbs were first translated to French." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-41", "text": "Where several relevant translations were identified, each of them was considered." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-42", "text": "For each candidate verb, subcategorization frames (SCFs) were identified and diathesis alternations were considered using the criteria of Levin (1993) \u2022 clustered the features using a method which has proved promising in both English and German experiments: spectral clustering," }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-43", "text": "\u2022 evaluated the clusters both quantitatively (using the gold standard) and qualitatively," }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-44", "text": "\u2022 and compared the performance to that recently obtained for English in order to gain a better understanding of the cross-linguistic and language-specific properties of verb classification This work is described in the subsequent sections." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-45", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-46", "text": "**DATA: THE LEXSCHEM LEXICON**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-47", "text": "We extracted the features for clustering from LexSchem ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-48", "text": "This large subcategorization lexicon provides SCF frequency information for 3,297 French verbs." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-49", "text": "It was acquired fully automatically from Le Monde newspaper corpus (200M words from years 1991-2000) using ASSCI -a recent subcategorization acquisition system for French (Messiant, 2008) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-50", "text": "Systems similar to ASSCI have been used in recent verb classification works e.g. (Schulte im Walde, 2006; Li and Brew, 2008; Sun and Korhonen, 2009 )." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-51", "text": "Like these other systems, ASSCI takes raw corpus data as input." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-52", "text": "The data is first tagged and lemmatized using the Tree-Tagger and then parsed using Syntex (Bourigault et al., 2005) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-53", "text": "Syntex is a shallow parser which employs a combination of statistics and heuristics to identify grammatical relations (GRs) in sentences." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-54", "text": "ASSCI considers GRs where the target verbs occur and constructs SCFs from nominal, prepositional and adjectival phrases, and infinitival and subordinate clauses." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-55", "text": "When a verb has no dependency, its SCF is considered as intransitive." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-106", "text": "The SCF-based features (F1-F3 and F14-F17) were extracted directly from LexSchem." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-56", "text": "ASSCI assumes no predefined list of SCFs but almost any combination of permitted constructions can appear as a candidate SCF." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-57", "text": "The number of automatically generated SCF types in LexSchem is 336." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-58", "text": "Many candidate SCFs are noisy due to processing errors and the difficulty of argument-adjunct distinction." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-59", "text": "Most SCF systems assume that true arguments occur in argument positions more frequently than adjuncts." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-60", "text": "Many systems also integrate filters for removing noise from system output." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-61", "text": "When LexSchem was evaluated after filter-ing its F-measure was 69 -which is similar to that of other current SCF systems We used the unfiltered version of the lexicon because English experiments have shown that information about adjuncts can help verb clustering (Sun et al., 2008) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-62", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-63", "text": "**FEATURES**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-64", "text": "Lexical entries in LexSchem provide a variety of material for verb clustering." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-65", "text": "Using this material, we constructed a range of features for experimentation." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-66", "text": "The first three include basic information about SCFs:" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-67", "text": "F1: SCFs and their relative frequencies with individual verbs." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-68", "text": "SCFs abstract over particles and prepositions." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-69", "text": "F2: F1, with SCFs parameterized for the tense (the POS tag) of the verb." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-70", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-71", "text": "**F3: F2, WITH SCFS PARAMETERIZED FOR PREPOSITIONS (PP).**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-72", "text": "The following six features include information about the lexical context (co-occurrences) of verbs." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-73", "text": "We adopt the best method of Li and Brew (2008) where collocations (COs) are extracted from the window of words immediately preceding and following a lemmatized verb." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-74", "text": "Stop words are removed prior to extraction." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-75", "text": "We adopt a fully unsupervised approach to SP acquisition using the method of Sun and Korhonen (2009) , with the difference that we determine the optimal number of SP clusters automatically following Zelnik-Manor and Perona (2004) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-76", "text": "The method is introduced in the following section." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-77", "text": "The approach involves (i) taking the GRs (SUBJ, OBJ, IOBJ) associated with verbs, (ii) extracting all the argument heads in these GRs, and (iii) clustering the resulting N most frequent argument heads into M classes." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-78", "text": "The empirically determined N 200 was used." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-79", "text": "The method produced 40 SP clusters." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-80", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-131", "text": "As expected, SPEC (the 2nd column) outperforms K-Means (the 3rd column)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-82", "text": "Spectral clustering (SPEC) has proved promising in previous verb clustering experiments (Brew and Schulte im Walde, 2002; Sun and Korhonen, 2009 ) and other similar NLP tasks involving high dimensional feature space (Chen et al., 2006) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-83", "text": "Following Sun and Korhonen (2009) we used the MNCut spectral clustering (Meila and Shi, 2001 ) which has a wide applicability and a clear probabilistic interpretation (von Luxburg, 2007; Verma and Meila, 2005) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-84", "text": "However, we extended the method to determine the optimal number of clusters automatically using the technique proposed by (Zelnik-Manor and Perona, 2004) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-85", "text": "Clustering groups a given set of verbs V = {v n } N n=1 into a disjoint partition of K classes." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-86", "text": "SPEC takes a similarity matrix as input." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-87", "text": "All our features can be viewed as probabilistic distributions because the combination of different features is performed via parameterization." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-88", "text": "Thus we use the Jensen-Shannon divergence (JSD) to construct the similarity matrix." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-89", "text": "The similarity matrix W is constructed where W ij = exp (\u2212d jsd (v, v ) )." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-90", "text": "In SPEC, the similarities W ij are viewed as the connection weight ij of a graph G over V ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-91", "text": "The similarity matrix W is thus the adjacency matrix for G. The degree of a vertex i is d i = N j=1 w ij ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-92", "text": "A cut between two partitions A and A is defined to be Cut(A, A ) = m\u2208A,n\u2208A W mn ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-93", "text": "The similarity matrix W is normalized into a stochastic matrix P ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-94", "text": "The degree matrix D is a diagonal matrix where" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-95", "text": "It was shown by Meila and Shi (2001) that if P has the K leading eigenvectors that are piecewise constant 1 with respect to a partition I * and their eigenvalues are not zero, then I * minimizes the multiway normalized cut(MNCut):" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-96", "text": "P mn can be interpreted as the transition probability between vertices m, n. The criterion can thus be expressed as MNCut(I) = , which is the sum of transition probabilities across different clusters." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-97", "text": "This criterion finds the partition where the random walks are most likely to happen within the same cluster." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-98", "text": "In practice, the leading eigenvectors of P are not piecewise constant. But we can extract the partition by finding the approximately equal elements in the eigenvectors using a clustering algorithm like K-Means." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-99", "text": "As the value of K is not known beforehand, we use Zelnik-Manor and Perona (2004)'s method to estimate it." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-100", "text": "This method finds the optimal value by minimizing a cost function based on the eigenvector structure of W ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-101", "text": "Like Brew and Schulte im Walde (2002), we compare SPEC against a K-Means baseline." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-102", "text": "We used the Matlab implementation with euclidean distance as the distance measure." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-103", "text": "1 The eigenvector v is piecewise constant with respect to I if v(i) = v(j)\u2200i, j \u2208 I k and k \u2208 1, 2...K 6 Experimental Evaluation" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-107", "text": "The CO (F4-F9) and LP features (F10-F13) were extracted from the raw and parsed corpus sentences, respectively, which were used for creating the lexicon." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-108", "text": "Features that only appeared once were removed." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-109", "text": "Feature vectors were normalized by the sum of the feature values before clustering." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-110", "text": "Since our clustering algorithms have an element of randomness, we repeated clustering multiple times." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-111", "text": "We report the results that minimize the distortion (the distance to cluster centroid)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-112", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-113", "text": "**EVALUATION MEASURES**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-114", "text": "We employ the same measures for evaluation as previously employed e.g. by\u00d3 S\u00e9aghdha and Copestake (2008) and Sun and Korhonen (2009) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-115", "text": "The first measure is modified purity (mPUR) -a global measure which evaluates the mean precision of clusters." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-116", "text": "Each cluster is associated with its prevalent class." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-117", "text": "The number of verbs in a cluster K that take this class is denoted by n prevalent (K)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-118", "text": "Verbs that do not take it are considered as errors." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-119", "text": "Clusters where n prevalent (K) = 1 are disregarded as not to introduce a bias towards singletons:" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-120", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-121", "text": "**NUMBER OF VERBS**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-122", "text": "The second measure is weighted class accuracy (ACC): the proportion of members of dominant clusters DOM-CLUST i within all classes c i ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-123", "text": "number of verbs mPUR and ACC can be seen as a measure of precision(P) and recall(R) respectively." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-124", "text": "We calculate F measure as the harmonic mean of P and R: F = 2 \u00b7 mPUR \u00b7 ACC mPUR + ACC The random baseline (BL) is calculated as follows: BL = 1/number of classes 7 Evaluation" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-125", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-126", "text": "**QUANTITATIVE EVALUATION**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-127", "text": "In our first experiment, we evaluated 116 verbsthose which appeared in LexSchem the minimum of 150 times." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-128", "text": "We did this because English experiments had shown that due to the Zipfian nature of SCF distributions, 150 corpus occurrences are typically needed to obtain a sufficient number of frames for clustering (Sun et al., 2008) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-129", "text": "Table 2 shows F-measure results for all the features." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-130", "text": "The 4th column of the table shows, for comparison, the results of Sun and Korhonen (2009) obtained for English when they used the same features as us, clustered them using SPEC, and evaluated them against the English version of our gold standard, also using F-measure 2 ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-132", "text": "Looking at the basic SCF features F1-F3, we can see that they perform significantly better than the BL method." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-133", "text": "F3 performs the best among the three features both in French (50.6 F) and in English (63.3 F)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-134", "text": "We therefore use F3 as the SCF feature in F14-F17 (the same was done for English)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-135", "text": "In French, most CO features (F4-F9) outperform SCF features." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-136", "text": "The best result is obtained with F7: 55.1 F. This is clearly better than the best SCF result 50.6 (F3)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-137", "text": "This result is interesting since SCFs correspond better than COs with features used in manual Levin classification." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-138", "text": "Also, SCFs perform considerably better than COs in the English experiment (we only have the result for F4 available, but it is considerably lower than the result for F3)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-139", "text": "However, earlier English studies have reported contradictory results (e.g. Li and Brew (2008) showed that CO performs better than SCF in supervised verb classification), indicating that the role of CO features in verb classification requires further investigation." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-140", "text": "Looking at the LP features, F13 produces the best F (52.7) for French which is slightly better than the best SCF result for the language." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-141", "text": "Also in English, F13 performs the best in this feature group and yields a higher result than the best SCFbased feature F3." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-142", "text": "Parameterizing the best SCF feature F3 with LPs (F14-16) and SPs (F17) yields better performance in French." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-143", "text": "F15 and F17 have the F of 54.5 and 54.6, respectively." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-144", "text": "These results are so close to the result of the best CO feature F7 (55.1 -which is the highest result in this experiment) that the differences are not statistically significant." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-145", "text": "In English, the results of F14-F17 are similarly good; however, only F17 beats the already high performance of F13." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-146", "text": "On the basis of this experiment, it is difficult to tell whether shallow CO features or more sophisticated SCF-based features are better for French." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-147", "text": "In the English experiment sophisticated features performed better (the SCF-SP feature was the best)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-148", "text": "However, the English experiment employed a much larger dataset." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-149", "text": "These more sophisticated features may suffer from data sparseness in our French experiment since although we required the minimum of 150 occurrences per verb, verb clustering performance tends to improve when more data is available, and given the fine-grained nature of LexShem SCFs it is likely that more data is required for optimal performance." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-150", "text": "We therefore performed another experiment with French on the full set of 147 verbs, using SPEC, where we investigated the effect of instance filtering on the performance of the best features from each feature group: F3, F7, F13 and F17." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-151", "text": "The results shown in Table 3 reveal that the performance of the features remains fairly similar until the instance threshold of 1000." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-152", "text": "When 2000 occurrences per verb are used, the differences become clearer, until at the threshold of 4000, it is obvious that the most sophisticated SCF-SP feature F17 is by far the best feature for French (65.4 F) and the SCF feature F3 the second best (60.5 F)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-153", "text": "The COfeature F7 and the LP feature F13 are not nearly as good (53.4 and 51.0 F)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-154", "text": "Although the results at different thresholds are not comparable due to the different number of verbs and classes (see columns 2-3), the results for features at the same threshold are." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-155", "text": "Those results suggest that when 2000 or more occurrences per verb are used, most features perform like they performed for English in the experiment of Sun and Korhonen (2009) which is not typical to many other classes." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-156", "text": "Interestingly, Levin classes 29.2, 36.1, 37.3, and 37.7 were among the best performing classes also in the supervised verb classification experiment of Sun et al. (2008) because these classes have distinctive characteristics also in English." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-157", "text": "The benefit of sophisticated features which integrate also semantic (SP) information (F17) is particularly evident for classes with nondistinctive syntactic characteristics." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-158", "text": "For example, the intransitive verbs in 43.1 LIGHT EMISSION class (e.g. briller,\u00e9tinceler, flamboyer) are difficult to cluster based on syntax only, but semantic features work because the verbs pose strong SPs on their subjects (entities capable of light emission)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-159", "text": "In the experiment of Sun et al. (2008) , 43.1 was the worst performing class, possibly because no semantic features were used in the experiment." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-160", "text": "The most frequent source of error is syntactic idiosyncracy." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-161", "text": "This is particularly evident for classes 10.1 REMOVE and 45.4 CHANGE OF STATE." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-162", "text": "Although verbs in these classes can take similar SCFs and alternations, only some of them are frequent in data." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-163", "text": "For example, the SCF\u00f4ter X a Y is frequent for verbs in 10.1, but not\u00f4ter X de Y. Although class 10.1 did not suffer from this problem in the English experiment of Sun et al. (2008) , class 45.4 did." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-164", "text": "Class 45.4 performs particularly bad in French also because its member verbs are low in frequency." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-165", "text": "Some errors are due to polysemy, caused partly by the fact that the French version of the gold standard was not controlled for this factor." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-166", "text": "Some verbs have their predominant senses in classes which are missing in the gold standard, e.g. the most frequent sense of retenir is memorize, not keep as in the gold standard class 13.5.1." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-167", "text": "GET." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-168", "text": "Finally, some errors are not true errors but demonstrate the capability of clustering to learn novel information." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-169", "text": "For example, the CHANGE OF STATE class 45.4 includes many antonyms (e.g. weaken vs. strenghten)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-170", "text": "Clustering (using F17) separates these antonyms, so that verbs adoucir, att\u00e9nuer and temp\u00e9rer appear in one cluster and consolider and renforcer in another." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-171", "text": "Although these verbs share the same alternations, their SPs are different." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-172", "text": "The opposite effect can be observed when clustering maps together classes which are semantically and syntactically related (e.g. 36.1 CORRESPOND and 37.7 SPEAK)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-173", "text": "Such classes are distinct in Levin and VerbNet, although should ideally be related." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-174", "text": "Cases such as these show the potential of clustering in discovering novel valuable information in data." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-175", "text": "----------------------------------" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-176", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-177", "text": "When sufficient corpus data is available, there is a strong correlation between the types of features which perform the best in English and French." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-178", "text": "When the best features are used, many individual Levin classes have similar performance in the two languages." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-179", "text": "Due to differences in data sets direct comparison of performance figures for English and French is not possible." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-180", "text": "When considering the general level of performance, our best performance for French (65.4 F) is lower than the best performance for English in the experiment of Sun and Korhonen (2009) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-181", "text": "However, it does compare favourably to the performance of other stateof-the-art (even supervised) English systems (Joanis et al., 2008; Li and Brew, 2008; \u00d3 S\u00e9aghdha and Copestake, 2008; Vlachos et al., 2009) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-182", "text": "This is impressive considering that we experimented with a fully unsupervised approach originally developed for another language." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-183", "text": "When aiming to improve performance further, employing larger data is critical." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-184", "text": "Most recent experiments on English have employed bigger data sets, and unlike us, some of them have only considered the predominant senses of medium-high frequency verbs." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-185", "text": "As seen in section 7.1, such differences in data can have significant impact on performance." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-186", "text": "However, parser and feature extraction performance can also play a big role in overall accuracy, and should therefore be investigated further (Sun and Korhonen, 2009 )." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-187", "text": "The relatively low performance of basic LP features in French suggests that at least some of the current errors are due to parsing." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-188", "text": "Future research should investigate the source of error at different stages of processing." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-189", "text": "In addition, it would be interesting to investigate whether language-specific tuning (e.g. using language specific features such as auxiliary classes) can further improve performance on French." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-190", "text": "Earlier works most closely related to ours are those of Merlo et al. (2002) and Ferrer (2004) ." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-191", "text": "Our results contrast with those of Ferrer who showed that a clustering approach does not transfer well from English to Spanish." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-192", "text": "However, she used basic SCF and named entity features only, and a clustering algorithm less suitable for high dimensional data." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-193", "text": "Like us, Merlo et al. (2002) created a gold standard by translating Levin classes to another language (Italian)." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-194", "text": "They also applied a method developed for English to Italian, and reported good overall performance using features developed for English." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-195", "text": "Although the experiment was small (focussing on three classes and a few features only) and involved supervised classification, the results agree with ours." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-196", "text": "These experiments support the linguistic hypothesis that Levin style classification can be cross-linguistically applicable." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-197", "text": "A clustering technique such as the one presented here could be used as a tool for investigating whether classifications are similar across a wider range of more diverse languages." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-198", "text": "From the NLP perspective, the fact that an unsupervised technique developed for one language can be applied to another language without the need for substantial tuning means that automatic techniques could be used to hypothesise useful Levin style classes for further languages." }, { "sent_id": "43a52325987ea035136a6a718389d9-C001-199", "text": "This, in turn, could facilitate the creation of multilingual VerbNets in the future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "43a52325987ea035136a6a718389d9-C001-16" ], [ "43a52325987ea035136a6a718389d9-C001-82" ] ], "cite_sentences": [ "43a52325987ea035136a6a718389d9-C001-16", "43a52325987ea035136a6a718389d9-C001-82" ] }, "@USE@": { "gold_contexts": [ [ "43a52325987ea035136a6a718389d9-C001-28" ], [ "43a52325987ea035136a6a718389d9-C001-83" ], [ "43a52325987ea035136a6a718389d9-C001-114" ] ], "cite_sentences": [ "43a52325987ea035136a6a718389d9-C001-28", "43a52325987ea035136a6a718389d9-C001-83", "43a52325987ea035136a6a718389d9-C001-114" ] }, "@SIM@": { "gold_contexts": [ [ "43a52325987ea035136a6a718389d9-C001-50" ], [ "43a52325987ea035136a6a718389d9-C001-83" ], [ "43a52325987ea035136a6a718389d9-C001-114" ], [ "43a52325987ea035136a6a718389d9-C001-130" ], [ "43a52325987ea035136a6a718389d9-C001-155" ] ], "cite_sentences": [ "43a52325987ea035136a6a718389d9-C001-50", "43a52325987ea035136a6a718389d9-C001-83", "43a52325987ea035136a6a718389d9-C001-114", "43a52325987ea035136a6a718389d9-C001-130", "43a52325987ea035136a6a718389d9-C001-155" ] }, "@DIF@": { "gold_contexts": [ [ "43a52325987ea035136a6a718389d9-C001-75" ], [ "43a52325987ea035136a6a718389d9-C001-180" ] ], "cite_sentences": [ "43a52325987ea035136a6a718389d9-C001-75", "43a52325987ea035136a6a718389d9-C001-180" ] }, "@EXT@": { "gold_contexts": [ [ "43a52325987ea035136a6a718389d9-C001-75" ] ], "cite_sentences": [ "43a52325987ea035136a6a718389d9-C001-75" ] }, "@FUT@": { "gold_contexts": [ [ "43a52325987ea035136a6a718389d9-C001-186" ] ], "cite_sentences": [ "43a52325987ea035136a6a718389d9-C001-186" ] } } }, "ABC_be67496882917c2a44afb42e6f9f15_16": { "x": [ { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-2", "text": "We introduce the Scratchpad Mechanism, a novel addition to the sequence-to-sequence (seq2seq) neural network architecture and demonstrate its effectiveness in improving the overall fluency of seq2seq models for natural language generation tasks." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-3", "text": "By enabling the decoder at each time step to write to all of the encoder output layers, Scratchpad can employ the encoder as a \"scratchpad\" memory to keep track of what has been generated so far and thereby guide future generation." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-21", "text": "We find, for each task, that Scratchpad attains improvements over several strong baselines: Sequence-to-Sequence with attention Bahdanau et al., 2014) , copy-enhanced approaches (Gu et al., 2016; Vinyals et al., 2015) , and coverageenhanced approaches (Tu et al., 2016; See et al., 2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-22", "text": "Scratchpad furthermore obtains state-ofthe-art performance for each task." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-23", "text": "Qualitative assessments in the form of human judgements (question generation), attention visualization (MT) and sample output (summarization) provide further evidence of the ability of Scratchpad to generate fluent and expressive output." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-24", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-25", "text": "**BACKGROUND**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-26", "text": "Scratchpad builds upon a standard attention-based seq2seq neural architecture (Bahdanau et al., 2014 ) comprised of (a) an encoder that operates token-by-token over the input, (b) a decoder that produces the output, and (c) an attention mechanism that allows the decoder to focus on different parts of the input." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-27", "text": "In the subsections below, we first briefly review this architecture (we assume the reader is familiar with the framework)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-28", "text": "In Section 3, we introduce the Scratchpad mechanism." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-29", "text": "Encoder Let X = [x 1 , ..., x n ] denote an input sequence of length N where x i is the i-th token." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-30", "text": "The encoder is a recurrent neural network (RNN) that produces in its final layer a sequence of hidden states [h 1 , ..., h n ] = RNN({x 1 , ..., x n })." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-31", "text": "These can be viewed as a sequence of token-level feature vectors learned from the input." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-32", "text": "Decoder Let the decoding sequence be indexed by the superscript i. The decoder is an RNN whose initial hidden state s 0 is set to the final state(s) of the of the encoder." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-33", "text": "Attention At every decoding step i, an attention mechanism, i.e., an attentive read (often termed attentional context) (c i ), is derived from the encoder output states ([h 1 , ..., h n ])." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-34", "text": "Concretely, attention is applied by first computing a score for each encoder output, h t :" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-35", "text": "where weight matrices W 1 and W 2 are learned parameters." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-36", "text": "These scores, score i 1..T , are then normalized into a probability distribution:" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-37", "text": "The attentive read operation is then the weighted average of encoder outputs according to this distribution, which allows the decoder to focus on different parts of the input at different timesteps i:" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-38", "text": "3 Scratchpad Mechanism" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-39", "text": "The above attention mechanism has been widely successful in many generation tasks but the quality of generated text still suffers from caveats and requires significant tuning." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-40", "text": "We augment attention with a Scratchpad mechanism to introduce higher quality generated text with less effort." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-41", "text": "Intuitively, Scratchpad adds one simple step to the decoder: treating the encoder output states, [h 1 , ..., h n ], as a scratchpad, thus it writes to them as if the set of states were an external memory." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-42", "text": "Exactly how this is done is described next." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-43", "text": "Without Scratchpad, the decoder's workflow at every output timestep step i is as follows: 1. Read attentively (c i ) from the encoder outputs ([h 1 , ..., h n ]) using the current state, s i . 2." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-44", "text": "Update s i using the most recently generated output token, y i\u22121 , and the results of the attentive read (c i )." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-45", "text": "3. Output a distribution over the output vocabulary\u0177 i ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-46", "text": "Scratchpad simply adds a fourth step: 4. Write an update (u i ) to the encoder outputs ([h 1 , ..., h n ]) in an attentive fashion (\u03b1 i 1..T ), treating the encoder outputs as if they were cells in an external memory." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-47", "text": "More specifically, to calculate both the writeattention and the update, Scratchpad uses the concatenation of the decoder state after steps 1-3 (s i+1 ) and the attentive read (c i ):" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-48", "text": "In essence, the Scratchpad consists of two components." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-49", "text": "The first determines what 'notes' to keep (u i )." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-50", "text": "The second is similar to the 'forget' gate in an LSTM, where the network decides how much Figure 1 : The Scratchpad Mechanism first computes the update probability (\u03b1 i t ) for each encoder state according to Eq. 5, then computes a global update u i according to Eq. 6, and finally updates the encoder states according to Eq. 4. to overwrite a cell (1 \u2212 \u03b1 i t ) versus how much to keep past information (\u03b1 i t ) for that cell." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-51", "text": "These two components are used in concert (see Fig. 1 Figure 1 shows the outline of the scratchpad mechanism update at multiple timesteps." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-52", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-53", "text": "**EXPERIMENTS**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-54", "text": "In this section we describe experimental setup and results for Machine Translation, Question Generation, and Summarization tasks which exhibit a variety of input modalities and strategies required to perform well." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-55", "text": "We work with structured and unstructured language and several sequence to sequences architectures i.e. attention, pointing, and copying." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-56", "text": "Machine translation is a canonical sequence-to-sequence task where pairwise word-level or phrase-level generation is expected." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-57", "text": "Question Generation from logical forms requires reasoning about the syntax, parse tree, and vocabulary of the input sequence to infer the meaning of the logical form program and utilize copy-mechanism to copy entities." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-58", "text": "Lastly, summarization requires understanding both the global and the local context of a sentence within a document, identifying spans that are informative and diverse, and generating coherent representative summaries." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-59", "text": "Demonstrating a single mechanism that reaches state of the art on such a diverse set of natural language tasks underscores the generalizability of our technique, particularly given the large range in number of training examples (3k, 56k, 153k, 287k) across datasets." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-60", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-61", "text": "**TRANSLATION**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-62", "text": "We evaluate on the IWLST 14 English to German and Spanish to English translation datasets (Cettolo et al., 2015) as well as the IWSLT 15 (Cettolo et al., 2015) English to Vietnamese translation dataset." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-63", "text": "For IWSLT14 (Cettolo et al., 2015) , we compare to the models evaluated by He et al. (2018) , which includes a transformer (Vaswani et al., 2017) and RNN-based models (Bahdanau et al., 2014) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-64", "text": "For IWSLT15, we primarily compare to GNMT (Wu et al., 2016) , which incorporates Coverage (Tu et al., 2016) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-65", "text": "Table 1 shows BLEU scores of our approach on 3 IWSLT translation tasks along with reported results from previous work." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-66", "text": "Our approach achieves state-of-the-art or comparable results on all datasets." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-67", "text": "Experimental Details For IWSLT14, our encoder is a 3-layer Bi-LSTM (Hochreiter and Schmidhuber, 1997) , where outputs are combined by concatenation, and the decoder is a 3-layer LSTM as well." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-68", "text": "For IWSLT15 the encoder and decoder are 2-layers." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-69", "text": "We follow , using the 'general' score function, input feeding, and combining the attentional context and hidden state." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-70", "text": "Since we use input feeding, Steps (1) and (2) in Section 3 are switched." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-71", "text": "All our models have a hidden size of 512 (for the LSTM and any MLP's)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-72", "text": "The internal layers of the decoder are residual, adding their output to their input and putting it through Layer Normalization (Ba et al., 2016) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-73", "text": "Sentences were encoded using byte-pair encoding (Sennrich et al., 2016) , with a shared source-target vocabulary of 10, 000 for De\u2192En and Es\u2192En (En \u2192Vi uses words as tokens to be comparable to Wu et al. (2016) )." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-74", "text": "Source and target word embeddings are dimension 128." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-75", "text": "We use dropout (Srivastava et al., 2014) in the encoder and decoder with a probability of 0.1." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-76", "text": "We use the Adam optimizer (Kingma and Ba, 2014) , with an initial learning rate of 0.002.We train for 30/20 epochs for IWSLT14/15, decaying the learning rate by a factor of 0.7 whenever the validation loss does not improve from the last epoch." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-77", "text": "Each training batch contained at most 2000 source or target tokens." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-78", "text": "We use label smoothing with ls = 0.1 (Szegedy et al., 2016) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-79", "text": "We average the last 5 epochs to obtain the final model and run with a beam of size 4." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-80", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-81", "text": "**QUESTION GENERATION**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-82", "text": "We use the task of question generation: Given a structured representation of a query against a knowledge base or a database (e.g. a logical form), produce the corresponding natural language question." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-83", "text": "We use two datasets consisting of (question, logical form) pairs: WebQuestionsSP (Yih et al., 2016 ) (a standard dataset for semantic parsing, where the logical form is in SPARQL), and WikiSQL (Zhong et al., 2017) (where the logical form is SQL)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-84", "text": "Both datasets are small, with the former having 3098 training and 1639 testing examples, and the latter being an order of magnitude larger with 56346 training and 15873 testing examples." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-85", "text": "We evaluate metrics at both a corpus level (to indicate how natural output questions are) and at a per-sentence level (to demonstrate how well output questions exactly match the gold question)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-86", "text": "BLEU (Papineni et al., 2002) , ROUGE (Lin, 2004) are chosen for precision and recall-based metrics." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-87", "text": "METEOR (Banerjee and Lavie, 2005 ) is chosen to deal with stemming and synonyms." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-88", "text": "We noticed that many tokens that appear in the logical form are also present in the natural language form for each example." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-89", "text": "In fact, nearly half of the tokens in the question appear in the corresponding SPARQL of the WebQuestionSP dataset (Yih et al., 2016) , implying that a network with the ability to copy from the input could see significant gains on the task." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-90", "text": "Accordingly, we compare our Scratchpad Mechanism against three baselines: (1) Seq2Seq, (2) Copynet and (3) Coverage, a method introduced by Tu et al. (2016) that aims to solve attention-related problems." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-91", "text": "Seq2Seq is the standard approach introduced in ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-92", "text": "The Copynet (He et al., 2017) baseline additionally gives the Seq2Seq model the ability to copy vocabulary from the source to the target." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-93", "text": "From Table 2 it is clear that our approach, Scratchpad outperforms all baselines on all the metrics." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-94", "text": "Experimental Details Our encoder is a 2-layer bi-directional GRU where outputs are combined by concatenation, and our decoder is a 2-layer GRU." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-95", "text": "We use the attention mechanism from 4.1." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-96", "text": "We train all models for 75 epochs with a batch size of 32, a hidden size of 512 (for the GRU and any MLP's), and a word vector size of 300." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-97", "text": "Dropout is used on every layer except the output layer, with a drop probability of 0.5." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-98", "text": "Where Glove vectors (Pennington et al., 2014 ) are used to initialize word vectors, we use 300-dimensional vectors trained on Wikipedia and Gigaword (6B.300D)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-99", "text": "We use the Adam optimizer with a learning rate of 1e \u22124 and we do teacher forcing (Williams and Zipser, 1989 ) with probability 0.5." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-100", "text": "These hyperparameters were tuned for our Seq2Seq baselines and held constant for the rest of the models." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-101", "text": "The vocabulary consists of all tokens appearing at least once in the training set." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-102", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-103", "text": "**SUMMARIZATION**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-104", "text": "We use the CNN/Daily Mail dataset (Hermann et al., 2015; Nallapati et al., 2016b) as in See et al. (2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-105", "text": "The dataset consists of 287,226 training, 13,368 validation, and 11,490 test examples." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-106", "text": "Each example is an online news article (781 tokens on average) along with a multi-sentence summary (56 tokens, 3.75 sentences on average)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-107", "text": "As in See et al. (2017) we operate on the original non-anonymized version of the data." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-108", "text": "We follow See et al. (2017) in evaluating with ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-109", "text": "We report F 1 scores for ROUGE-1, ROUGE-2, and ROUGE-LCS (measuring word, bigram, and longest-common-subsequence overlap, respectively) and we report METEOR in exact and full mode." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-110", "text": "We compare to the pointergenerator baseline and the coverage variant introduced by See et al. (2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-111", "text": "See et al. (2017) use a multi-step training procedure for the coverage component to improve performance where a pointer-generator model is first trained without the coverage component for a large number of iterations, then trained with the component and a tuned auxiliary coverage loss, finding that the auxiliary loss and pre-training the network without coverage are required to improve performance." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-112", "text": "As demonstrated in Tab." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-113", "text": "3, with Scratchpad, we are able to improve performance over See et al. (2017) in all the Rouge metrics, statistically significant for Rouge-2 and Rouge-L, while remaining comparable in METEOR." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-114", "text": "We reach this performance with half of the training iterations, no pretraining, and without the additional memory outlay and model complexity of including an auxiliary loss." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-115", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-116", "text": "**EXPERIMENTAL DETAILS**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-117", "text": "We use the same setup as in See et al. (2017) : The encoder is a single-layer bi-directional LSTM where outputs are combined by concatenation, and the decoder consists of a single-layer LSTM." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-118", "text": "The encoder states modified by the scratchpad mechanism are the outputs of the LSTM at every timestep, i.e. the 'hidden' state." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-119", "text": "We use the same attention mechanism as in See et al. (2017) to calculate the attentive read and the attentive write probabilities \u03b1 i t for the scratchpad mechanism." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-120", "text": "See et al. (2017) introduce a multistep training procedure where a pointer-generator model is first trained with the vanilla cross-entropy objective for 230k iterations." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-121", "text": "Then the coverage component is added and the full model is further trained for 3k iterations with the combined crossentropy coverage loss." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-122", "text": "See et al. (2017) use Adagrad (Duchi et al., 2010) with learning rate 0.15 and an initial accumulator value of 0.1." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-123", "text": "Early stopping on validation is used to select the final model." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-124", "text": "We adopt a simpler procedure, training our full model with the scratchpad mechanism for 100k iterations with Adam and a learning rate of 1e \u22124 as compared to the two-step procedure in See et al. (2017) taking 230k iterations." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-125", "text": "We follow See et al. (2017) in using a batch size of 16 and clipping gradient norms to 2." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-126", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-127", "text": "**ANALYSIS**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-128", "text": "To gain insight into the behaviour and performance of our Scratchpad Mechanism, we analyze the output for Question Generation and Translation." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-129", "text": "We start by conducting a human evaluation study on the Question Generation task, since this task is relatively new and it is well known that quantitative metrics like BLEU do not always correlate with human assessed quality of generated text 1 ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-130", "text": "Later we use the attention heatmaps to visualize how the scratchpad mechanism drives the attention weights to be more focused on the relevant source token(s)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-131", "text": "Additionally, we analize the entropies of the attention weights to understand how the scratchpad mechanism better allows models to attend to the input." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-132", "text": "We hypothesize that this is one of the reasons that lead to good performance of the scratchpad mechanism as the decoder ends up being more focused than with the standard seq2seq models." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-133", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-134", "text": "**HUMAN EVALUATIONS**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-135", "text": "For our human evaluation we use two standard metrics from the machine translation community: Adequacy and Fluency (Bojar et al., 2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-136", "text": "To compute Adequacy, human judges are presented with a reference output and the system proposed output, and are asked to rate the adequacy of the proposed output in conveying the meaning of the reference output on a scale from 0-10." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-137", "text": "To compute Fluency, the judges are asked to rate, on a scale from 0-10, whether the proposed output is a fluent English sentence." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-138", "text": "We used crowd-sourced judges." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-139", "text": "Each output is rated by 3 judges." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-140", "text": "Table 4 summarizes the human evaluation results for our Scratchpad Mechanism and two more baselines." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-141", "text": "As the table shows, the judges assigned 1 The relation between BLEU scores and more canonical tasks such as machine translation and summarization have already been studied in the literature." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-142", "text": "(Bojar et al., 2017; Graham, 2015) higher fluency and adequacy scores to our approach than both the coverage-based and copynet baseline." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-143", "text": "In the table we also report the fluency score of the gold questions as a way to measure the gap between the generated questions and the expected ones." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-144", "text": "Our approach is only 2 points behind the gold when it comes to generation fluency." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-145", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-146", "text": "**MODEL**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-147", "text": "Fluency Adequacy Table 5 : The percentage of times judges preferred one result over the other." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-148", "text": "In a Head-to-Head evaluation the output of Scratchpad Encoder is 9 and 2 times as likely to be chosen vs. Copynet and Coverage, respectively." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-149", "text": "Win rate is the percentage of times Scratchpad was picked when the judges chose a single winner (not a tie)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-150", "text": "Additionally, we design a side-by-side experiment where judges are presented with 2 generated questions from 2 different systems along with the reference and asked to judge which output presents a better paraphrase to the reference question." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-151", "text": "Judges take into consideration the grammatical correctness of the question as well as its ability to capture the meaning of the reference question fluently." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-152", "text": "In Table 5 We show that in headto-head evaluations, human judges are nine times as likely to prefer scratchpad generated questions over copynet and nearly two times over coverage, accentuating the improved fluency and adequacy of scratchpad generated questions." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-153", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-154", "text": "**ATTENTION VISUALIZATION AND ANALYSIS**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-155", "text": "In the standard attention setup, the weights assigned to each encoder output is determined by the decoder internal state and the encoder output (s i and h t ) in Equation 1." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-156", "text": "Throughout the decoding steps, only s i varies from timestep to the next." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-157", "text": "Our scratchpad mechanism allows the encoder outputs to change in order to keep track of generated output, so that both s i and h t will vary from timestep to the next, hence more focused attention can be generated." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-158", "text": "We demonstrate this behavior in Fig 5 where two sentences in a German to English machine translation system are shown." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-159", "text": "In the top figure, attention weights are shown when the scratchpad mechanism is utilized, while in the bottom Figure standard attention is used." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-160", "text": "As can be seen from the figures, attention weights are more focused especially in the first few steps of decoding that better aligns with word-level translations (e.g. 'hand' is properly attended to with scratchpad, but not with non-scratchpad)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-161", "text": "Additionally, some words that are never properly translated (e.g. wahrscheinlich -'probably') by the non-scratchpad model are not heavily attended to, whereas with the scratchpad mechanism, they are." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-162", "text": "We also demonstrate this effect quantitatively." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-163", "text": "Recall the attention distribution a i t over the input [x 1 , ..., x n ] generated at each decoding timestep i. By calculating the entropy ent i = \u2212 t a i t * log(a i t ) and taking the mean of this value across a set of output sentences we can measure how well the model \"focuses\" on input sequences (e.g. [x 1 , ..., x n ]) as it decodes." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-164", "text": "The lower the entropy, the sharper the attention distribution." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-165", "text": "We evaluate this metric on the IWSLT14 De\u2192En test set for the scratchpad and non-scratchpad models." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-166", "text": "By adding the scratchpad mechanism, the mean entropy decreases substantially from 1.33 to 0.887 -indicating that it makes the model more selective (focusing on fewer input tokens with higher weights during generation)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-167", "text": "Additionally, we plot in Fig. 2 the cumulative frequency of the wordlevel entropies ent i for all output timesteps i. Note from the graph that for every value x, the scratchpad model produces more attention distributions with an entropy \u2264 x. Finally, the shape of the curve changes to be less sigmoidal, with the proportion of particularly peaky or focused distributions (very low entropy, e.g. \u2264 0.5) increasing significantly, over 4\u00d7 that for the non-scratchpad model." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-168", "text": "Previous work based on coverage based approaches (Tu et al., 2016; See et al., 2017) either imposed an extra term to the loss function or used an extra vector to keep track of which parts of the input sequences had been attended to, thereby focusing the attention weights in subsequent steps on tokens that received little attention before." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-169", "text": "In other words, focusing the attention on the relevant parts of the input." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-170", "text": "Our proposed approach naturally learns to focus the attention on the important tokens, without a need for modifying the loss function or introducing coverage vectors." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-171", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-172", "text": "**RELATED WORK**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-173", "text": "Machine Translation Since introduced the sequence-to-sequence paradigm the approach became the defacto standard for performing machine translation." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-174", "text": "Improvements over the approach followed, first by the introduction of attention (Bahdanau et al., 2014) which helped seq2seq translation to focus on certain tokens of the encoder outputs." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-175", "text": "Later on, many improvements were described in the Google neural machine translation system (Wu et al., 2016) , including utilizing coverage penalty (Tu et al., 2016) while decoding." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-176", "text": "The Transformer model was introduced to alleviate the dependence on RNNs in both the encoder and the decoder steps (Vaswani et al., 2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-177", "text": "Our proposed model sits on top of the seq2seq framework, and could be used with any choice of encoder/decoder as long as attention is used." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-178", "text": "\"@@\" at the end of a bpe token denotes it should be concatenated with the following token(s) to make a word." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-179", "text": "With Scratchpad, we see sharper attention earlier in the sentence that better aligns with word-level translations (e.g. 'hand' is properly attended to with scratchpad, but not with non-scratchpad)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-180", "text": "Additionally, some words that are never properly translated (e.g. wahrscheinlich -'probably') by the non-scratchpad model are not heavily attended to, whereas with the scratchpad mechanism, they are." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-181", "text": "Summarization Since Rush et al. (2015) first applied neural networks to abstractive text summarization, work has focused on augmenting models Nallapati et al., 2016b; Gu et al., 2016) , incorporating syntactic and semantic information (Takase et al., 2016) , or direct optimization of the metric at hand (Ranzato et al., 2016) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-182", "text": "Nallapati et al. (2016b) adapted the DeepMind question-answering dataset (Hermann et al., 2015) for summarization and provided the first abstractive and extractive (Nallapati et al., 2016a) models." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-183", "text": "See et al. (2017) demonstrated that pointer-generator networks can significantly improve the quality of generated summaries." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-184", "text": "Additionally, work has explored using Reinforcement Learning, often with additional losses or objective functions to improve performance (Hsu et al., 2018; Paulus et al., 2018; Li et al., 2018; elikyilmaz et al., 2018; Pasunuru and Bansal, 2018) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-185", "text": "Finally, Gehrmann et al. (2018) demonstrated that a two-stage procedure, where a model first identifies spans in the article that could be copied into the summary which are used to restrict a second pointer-generator model can reap significant gains." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-186", "text": "Question Generation Early work on translating SPARQL queries into natural language relied heavily on hand-crafted rules (Ngonga Ngomo et al., 2013a,b) or manually crafted templates to map selected categories of SPARQL queries to questions (Trivedi et al., 2017; Seyler et al., 2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-187", "text": "In (Serban et al., 2016) knowledge base triplets are used to generate questions using encoder-decoder framework that operates on entity and predicate embeddings trained using TransE (Bordes et al., 2011) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-188", "text": "Later, Elsahar et al. (2018) extended this approach to support unseen predicates." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-189", "text": "Both approaches operate on triplets, meaning they have limited capability beyond generating simple questions and cannot generate the far more complex compositional questions that our approach does by operating on the more expressive SPARQL query (logical form)." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-190", "text": "In the question generation domain, there has been a recent surge in research on generating questions for a given paragraph of text (Song et al., 2017; Du et al., 2017; Yao et al., 2018) , with most of the work being a variant of the seq2seq approach." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-191", "text": "In Song et al. (2017) , a seq2seq model with copynet and a coverage mechanism (Tu et al., 2016 ) is used to achieve state-of-the-art results." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-192", "text": "We have demonstrated that our Scratchpad outperforms this approach in both quantitative and qualitative evaluations." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-193", "text": "Attention Closest to our work, in the general paradigm of seq2seq learning, is the coverage mechanism introduced in Tu et al. (2016) and later adapted for summarization in See et al. (2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-194", "text": "Both works try to minimize erroneous repetitions generated by a copy mechanism by introducing a new vector to keep track of what has been used from the encoder thus far." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-195", "text": "Tu et al. (2016) , for example, use an extra GRU to keep track of this information, whereas See et al. (2017) keep track of the sum of attention weights and add a penalty to the loss function based on it to discourage repetition." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-196", "text": "Our approach is much simpler than either solution since it does not require any extra vectors or an additional loss term; rather, the encoder vector itself is being used as scratch memory." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-197", "text": "Our experiments also show that for the question generation task, the Scratchpad performs better than coverage based approaches." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-198", "text": "Our idea was influenced by the dialogue generation work of Eric and Manning (2017) in which the entire sequence of interactions is re-encoded every time a response is generated by the decoder." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-199", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-200", "text": "**CONCLUSION**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-201", "text": "In this paper, we introduce the Scratchpad Mechanism, a novel write operator, to the sequence to sequence framework aimed at addressing many of the common issues encountered by sequence to sequence models and evaluate it on a variety of standard conditional natural language generation tasks." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-202", "text": "By letting the decoder 'keep notes' on the encoder, or said another way, re-encode the input at every decoding step, the Scratchpad Mechanism effectively guides future generation." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-203", "text": "The Scratchpad Mechanism attains state of the art in Machine Translation, Question Generation, and Summarization on standard metrics and human evaluation across multiple datasets." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-204", "text": "In addition, our approach decreases training time and model complexity compared to other leading approaches." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-205", "text": "Our success on such a diverse set of tasks, input data, and volumes of training data underscores the generalizability of our approach and its conceptual simplicity make it easy to add to any sequence to sequence model with attention." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-4", "text": "We evaluate Scratchpad in the context of three well-studied natural language generation tasks -Machine Translation, Question Generation, and Text Summarization -and obtain stateof-the-art or comparable performance on standard datasets for each task." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-5", "text": "Qualitative assessments in the form of human judgements (question generation), attention visualization (MT), and sample output (summarization) provide further evidence of the ability of Scratchpad to generate fluent and expressive output." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-6", "text": "----------------------------------" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-8", "text": "The sequence-to-sequence neural network framework (seq2seq) has been successful in a wide range of tasks in natural language processing, from machine translation (Bahdanau et al., 2014) and semantic parsing (Dong and Lapata, 2016) to summarization (Nallapati et al., 2016b; See et al., 2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-9", "text": "Despite this success, seq2seq models are known to often exhibit an overall lack of fluency in the natural language output produced: problems include lexical repetition, under-generation in the form of partial phrases and lack of specificity (often caused by the gap between the input and output vocabularies) (Xie, 2017) ." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-10", "text": "Recently, a number of taskspecific attention variants have been proposed to deal with these issues: See et al. (2017) introduced a coverage mechanism (Tu et al., 2016 ) * Work performed while at Apple." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-11", "text": "to deal with repetition and over-copying in summarization, Hua and Wang (2018) introduced a method of attending over keyphrases to improve argument generation, and Kiddon et al. (2016) introduced a method that attends to an agenda of items to improve recipe generation." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-12", "text": "Perhaps not surprisingly, general-purpose attention mechanisms targeting individual problems from the list above have also begun to be developed." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-13", "text": "Copynet (Gu et al., 2016) and pointer-generator networks (Vinyals et al., 2015) , for example, aim to reduce input-output vocabulary mismatch and, thereby, improve specificity, while the coveragebased techniques of Tu et al. (2016) tackle repetition and under-generation." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-14", "text": "These techniques, however, often require significant hyperparameter tuning and are purposely limited to fixing a specific problem in the generated text." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-15", "text": "We present here a general-purpose addition to the standard seq2seq framework that aims to simultaneously tackle all of the above issues." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-16", "text": "In particular, we propose Scratchpad, a novel write mechanism that allows the decoder to keep notes on its past actions (i.e., generation, attention, copying) by directly modifying encoder states." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-17", "text": "The Scratchpad mechanism essentially lets the decoder more easily keep track of what the model has focused on and copied from the input in the recent past, as well as what it has produced thus far as output." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-18", "text": "Thus, Scratchpad can alternatively be viewed as an external memory initialized by the input, or as an input re-encoding step that takes into account past attention and generation." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-19", "text": "To demonstrate general(izable) improvements on conditional natural language generation problems broadly construed, we instantiate Scratchpad for three well-studied generation tasksMachine Translation, Question Generation, and Summarization -and evaluate it on a diverse set of datasets." }, { "sent_id": "be67496882917c2a44afb42e6f9f15-C001-20", "text": "These tasks exhibit a variety of input modalities (structured and unstructured language) and typically have required a variety of computational strategies to perform well (attention, pointing, copying)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "be67496882917c2a44afb42e6f9f15-C001-10" ], [ "be67496882917c2a44afb42e6f9f15-C001-13" ], [ "be67496882917c2a44afb42e6f9f15-C001-21" ], [ "be67496882917c2a44afb42e6f9f15-C001-175" ] ], "cite_sentences": [ "be67496882917c2a44afb42e6f9f15-C001-10", "be67496882917c2a44afb42e6f9f15-C001-13", "be67496882917c2a44afb42e6f9f15-C001-21", "be67496882917c2a44afb42e6f9f15-C001-175" ] }, "@SIM@": { "gold_contexts": [ [ "be67496882917c2a44afb42e6f9f15-C001-64" ], [ "be67496882917c2a44afb42e6f9f15-C001-90" ], [ "be67496882917c2a44afb42e6f9f15-C001-191" ], [ "be67496882917c2a44afb42e6f9f15-C001-193" ] ], "cite_sentences": [ "be67496882917c2a44afb42e6f9f15-C001-64", "be67496882917c2a44afb42e6f9f15-C001-90", "be67496882917c2a44afb42e6f9f15-C001-191", "be67496882917c2a44afb42e6f9f15-C001-193" ] }, "@USE@": { "gold_contexts": [ [ "be67496882917c2a44afb42e6f9f15-C001-90" ], [ "be67496882917c2a44afb42e6f9f15-C001-191" ], [ "be67496882917c2a44afb42e6f9f15-C001-193" ] ], "cite_sentences": [ "be67496882917c2a44afb42e6f9f15-C001-90", "be67496882917c2a44afb42e6f9f15-C001-191", "be67496882917c2a44afb42e6f9f15-C001-193" ] }, "@DIF@": { "gold_contexts": [ [ "be67496882917c2a44afb42e6f9f15-C001-168" ] ], "cite_sentences": [ "be67496882917c2a44afb42e6f9f15-C001-168" ] } } }, "ABC_ca7db62af4457ca887fe220c43b10e_16": { "x": [ { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-2", "text": "If we compare the widely used Conditional Random Fields (CRF) with newly proposed \"deep architecture\" sequence models (Collobert et al., 2011) , there are two things changing: from linear architecture to non-linear, and from discrete feature representation to distributional." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-3", "text": "It is unclear, however, what utility nonlinearity offers in conventional featurebased models." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-4", "text": "In this study, we show the close connection between CRF and \"sequence model\" neural nets, and present an empirical investigation to compare their performance on two sequence labeling tasks -Named Entity Recognition and Syntactic Chunking." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-5", "text": "Our results suggest that non-linear models are highly effective in low-dimensional distributional spaces." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-6", "text": "Somewhat surprisingly, we find that a nonlinear architecture offers no benefits in a high-dimensional discrete feature space." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-7", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-9", "text": "Sequence labeling encompasses an important class of NLP problems that aim at annotating natural language texts with various syntactic and semantic information, such as part-of-speech tags and named-entity labels." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-10", "text": "Output from such systems can facilitate downstream applications such as Question Answering and Relation Extraction." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-11", "text": "Most methods developed so far for sequence labeling employ generalized linear statistical models, meaning methods that describe the data as a combination of linear basis functions, either directly in the input variables space (e.g., SVM) or through some transformation of the probability distributions (e.g., \"log-linear\" models)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-12", "text": "Recently, Collobert et al. (2011) proposed \"deep architecture\" models for sequence labeling (named Sentence-level Likelihood Neural Nets, abbreviated as SLNN henceforth), and showed promising results on a range of tasks (POS tagging, NER, Chunking, and SRL)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-13", "text": "Two new changes were suggested: extending the model from a linear to non-linear architecture; and replacing discrete feature representations with distributional feature representations in a continuous space." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-14", "text": "It has generally been argued that nonlinearity between layers is vital to the power of neural models (Bengio, 2009 )." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-15", "text": "The relative contribution of these changes, however, is unclear, as is the question of whether gains can be made by introducing non-linearity to conventional featurebased models." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-16", "text": "In this paper, we illustrate the close relationship between CRF and SLNN models, and conduct an empirical investigation of the effect of nonlinearity with different feature representations." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-17", "text": "Experiments on Named Entity Recognition (NER) and Syntactic Chunking tasks suggest that non-linear models are highly effective in low-dimensional distributed feature space, but offer no benefits in high-dimensional discrete space." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-18", "text": "Furthermore, both linear and non-linear models improve when we combine the discrete and continuous feature spaces, but a linear model still outperforms the non-linear one." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-19", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-20", "text": "**FROM CRFS TO SLNNS**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-21", "text": "A CRF models the conditional probability of structured output variables y given observations x. In sequence modeling, the observations are typically words in a sentence, and the output variables are some syntactic or semantic tags we are trying to predict for each word (e.g., POS, named-entity tags, etc.)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-22", "text": "The most commonly used CRF model has a linear chain structure, where prediction y i at position i is independent of other predictions, given its neighbors y i\u22121 and y i+1 ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-23", "text": "It is customary to describe the model as an undirected graphical model, with the following probability definition:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-24", "text": "denotes node clique potentials in this graph, and \u03a6(x, y i , y i\u22121 ) denotes edge clique potentials." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-25", "text": "f k (x) is the set of node-level feature functions, m is the number of node features, and \u03b8 (k,y i ) is a weight parameter of feature k associated with a particular output y i ; similarly for edges we have g k (x), m , and \u03bb (k,y i ,y i\u22121 ) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-26", "text": "Z(x) is the partition function that sums over all possible assignments of output variables in the entire sequence." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-27", "text": "Let us focus our discussion on the node clique potentials \u03a8 for now." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-28", "text": "We call the operand of the exponentiation operator in \u03a8 a potential function \u03c8." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-29", "text": "In a CRF, this can be expressed in matrix notation as:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-30", "text": "We use the notation y i to denote the ordinal index of the value assigned to y i ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-31", "text": "This linear potential function \u03c8 can be visualized using a neural network diagram, shown in the left plot in Figure 1 ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-32", "text": "Each edge in the graph represents a parameter weight \u03b8 (k, y i ) , for feature f k (x) and a variable assignment of y i ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-33", "text": "In neural network terminology, this architecture is called a single-layer Input-Output Neural Network (IONN)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-34", "text": "1 Normalizing locally in a logistic regression is equivalent to adding a softmax layer to the output layer of the IONN, which was commonly done in neural networks, such as in Collobert et al. (2011) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-35", "text": "We can add a hidden linear layer to this architecture to formulate a two-layer Linear Neural Network (LNN), as shown in the middle diagram of Figure 1 ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-36", "text": "The value of the node z j in the hidden layer is computed as" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-37", "text": "The value y i for nodes in the output layer is computed as:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-38", "text": "where \u03c9 (k,j) and \u03b4 (j,i) are new parameters introduced in the model." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-39", "text": "In matrix form, it can be written as y = \u2206 z = \u2206 \u2126 f (x)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-40", "text": "The node potential function now becomes:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-41", "text": "This two-layer network is actually no more powerful than the previous model, since we can always compile it down to a single-layer IONN by making \u0398 = \u2126\u2206. In the next step, we take the output of the hidden layer in the LNN, and send it through a non-linear activation function, such as a sigmoid or tanh, then we arrive at a two-layer Deep Neural Network (DNN) model." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-42", "text": "Unlike the previous two models, the DNN is non-linear, and thus capable of representing a more complex decision surface." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-43", "text": "So far we have extended the potential function used in node cliques of a CRF to a non-linear DNN. And if we keep the potential function for edge cliques the same as before, then in fact we have arrived at an identical model to the SLNN in Collobert et al. (Collobert et al., 2011) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-44", "text": "The difference between a SLNN and an ordinary DNN model is that we need to take into consideration the influence of edge cliques, and therefore we can no longer normalize the clique factors at each position to calculate the local marginals, as we would do in a logistic regression." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-45", "text": "The cardinality of the output variable vector y grows exponentially with respect to input sequence length." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-46", "text": "Fortunately, we can use forward-backward style dynamic programming to compute the marginal probabilities efficiently." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-47", "text": "It is also worth pointing out that this model has in fact been introduced a few times in prior literature." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-48", "text": "It was termed Conditional Neural Fields by Peng et al. (2009) , and later Neural Conditional Random Fields by Do and Artieres (2010) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-49", "text": "Unfortunately, the connection to Collobert and Weston (2008) was not recognized in either of these two studies; vice versa, neither of the above were referenced in Collobert et al. (2011) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-50", "text": "This model also appeared previously in the speech recognition literature in Prabhavalkar and Fosler-Lussier (2010) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-51", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-52", "text": "**PARAMETER LEARNING**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-53", "text": "Supervised conditional training of the SLNN model amounts to maximizing the objective function L, which is given by the sum of logprobabilities over training examples:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-54", "text": "The change in node potential function from \u03c8 to \u03c8 does not affect the inference procedure, and thus we can employ the same dynamic programming algorithm as in a CRF to calculate the log sum over Z(x) and the expectation of feature parameters." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-55", "text": "We adopted the simple L-BFGS algorithm for training weights in this model (Liu and Nocedal, 1989) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-56", "text": "Although L-BFGS is in general slower than mini-batch SGD -another common optimization algorithm used to train neural networks (Bengio et al., 2006, inter alia) , it has been found to be quite stable and suitable for learning neural networks (Socher et al., 2011) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-57", "text": "The gradient of a parameter \u03c9 (k,j) is calculated as the following:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-58", "text": "The partial derivative of the potential function \u2202\u03c8 (x l , y l i ) \u2202\u03c9 (k,j) can be calculated using the backpropagation procedure, identical to how gradients of a standard Multilayer Perceptron are calculated." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-59", "text": "The gradient calculation for output layer parameters \u2206 and edge parameters \u039b follow the same form." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-60", "text": "We apply 2 -regularization to prevent overfitting." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-61", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-62", "text": "**EMPIRICAL EVALUATION**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-63", "text": "We evaluate the CRF and SLNN models on two standard sequence labeling tasks: Syntactic Chunking and Named Entity Recognition (NER)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-64", "text": "In both experiments, we use the publicly available Stanford CRF Toolkit (Finkel et al., 2005) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-65", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-66", "text": "**NAMED ENTITY RECOGNITION**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-67", "text": "We train all models on the standard CoNLL-2003 shared task benchmark dataset (Sang and Meulder, 2003) , which is a collection of documents from Reuters newswire articles, annotated with four entity types: Person, Location, Organization, and Miscellaneous." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-68", "text": "We adopt the BIO2 annotation standard." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-69", "text": "Beginning and intermediate positions of an entity are marked with B-and I-tags, and nonentities with O tag." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-70", "text": "The training set contains 204K words (14K sentences), the development set contains 51K words (3.3K sentences), and the test set contains 46K words (3.5K sentences)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-71", "text": "To evaluate out-of-domain performance, we run the models trained on CoNLL-03 training data on two additional test datasets." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-72", "text": "The first dataset (ACE) is taken from the ACE Phase 2 (2001-02) and ACE-2003 data." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-73", "text": "Although the ACE dataset also consists of newswire text and thus is not strictly out-of-domain, there is a genre or dialect difference in that it is drawn from mostly American news sources, whereas CoNLL is mostly English." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-74", "text": "The test portion of this dataset contains 63K words, and is annotated with 5 original entity types: Person, Location, Organization, Fact, and GPE." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-75", "text": "We remove all entities of type Fact and GPE by relabeling them as O during preprocessing, and discard entities tags of type Miscellaneous in the output of the models." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-76", "text": "The second dataset is the MUC7 Formal Run test set, which contains 59K words." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-77", "text": "It is also missing the Miscellaneous entity type, but includes 4 additional entity types that do not occur in CoNLL-2003: Date, Time, Money, and Percent." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-78", "text": "We converted the data to CoNLL-2003 type format using the same method applied to the ACE data." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-79", "text": "We used a comprehensive set of features that comes with the standard distribution of Stanford NER model (Finkel et al., 2005) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-80", "text": "A total number of 437,905 features were generated for the CoNLL-2003 training dataset." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-81", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-82", "text": "**SYNTACTIC CHUNKING**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-83", "text": "In Syntactic Chunking, we tag each word with its phrase type." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-84", "text": "For example, tag B-NP indicates a word starts a noun phrase, and I-PP marks an intermediate word of a prepositional phrase." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-85", "text": "We test the models on the standard CoNLL-2000 shared task evaluation set (Sang and Buchholz, 2000) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-86", "text": "This dataset comes from the Penn Treebank." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-87", "text": "The training set contains 211K words (8.9K sentences), and the test set contains 47K words (2K sentences)." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-88", "text": "The set of features used for this task is:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-89", "text": "\u2022 Current word and tag \u2022 Word pairs:" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-90", "text": "\u2022 The Disjunctive word set of the previous and next 4 positions A total number of 317794 features were generated on this dataset." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-91", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-92", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-93", "text": "In all experiments, we used the development portion of the CoNLL-2003 data to tune the 2 -regularization parameter \u03c3 (variance in Gaussian prior), and found 20 to be a stable value." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-94", "text": "Overall tuning \u03c3 does not affect the qualitative results in our experiments." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-95", "text": "We terminate L-BFGS training when the average improvement is less than 1e-3." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-96", "text": "All model parameters are initialized to a random value in [\u22120.1, 0.1] in order to break symmetry." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-97", "text": "We did not explicitly tune the features used in CRF to optimize for performance, since feature engineering is not the focus of this study." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-98", "text": "However, overall we found that the feature set we used is competitive with CRF results from earlier literature (Turian et al., 2010; Collobert et al., 2011) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-99", "text": "For models that embed hidden layers, we set the number of hidden nodes to 300." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-100", "text": "2 Results are reported on the standard evaluation metrics of entity/chunk precision, recall and F1 measure." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-101", "text": "For experiments with continuous space feature representations (a.k.a., word embeddings), we took the word embeddings (130K words, 50 dimensions) used in Collobert et al. (2011) , which were trained for 2 months over Wikipedia text." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-102", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-103", "text": "**3**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-104", "text": "All sequences of numbers are replaced with num (e.g., \"PS1\" would become \"PSnum\"), sentence boundaries are padded with token PAD, and unknown words are grouped into UNKNOWN." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-105", "text": "We attempt to replicate the model described in Collobert et al. (2011) without task-specific fine-tuning, with a few exceptions: 1) we used the soft tanh activation function instead of hard tanh; 2) we use the BIO2 tagging scheme instead of BIOES; 3) we use L-BFGS optimization algorithm instead of stochastic gradient descent; 4) we did not use Gazetteer features; 5) Collobert et al. (2011) mentioned 5 binary features that look at the capitalization pattern of words to append to the embedding as additional dimensions, but only 4 were described in the paper, which we implemented accordingly." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-106", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-107", "text": "**RESULTS OF DISCRETE REPRESENTATION**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-108", "text": "The first question we address is the following: in the high-dimensional discrete feature space, would the non-linear architecture in SLNN model help it to outperform CRF?" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-109", "text": "Results from Table 1 suggest that SLNN does not seem to benefit from the non-linear architecture on either the NER or Syntactic Chunking tasks." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-110", "text": "In particular, on the CoNLL and MUC dataset, SLNN resulted in a 1% performance drop, which is significant for NER." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-111", "text": "The specific statistical properties of this dataset that lead to the performance drop are hard to determine, but we believe it is partially because the SLNN has a much harder non-convex optimization problem to solve -on this small dataset, the SLNN with 300 hidden units generates a shocking number of 100 million parameters (437905 features times 300 hidden dimensions), due to the high dimensionality of the input feature space." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-112", "text": "To further illustrate this point, we also compared the CRF model with its Linear Neural Network (LNN) extension, which has exactly the same number of parameters as the SLNN but does not include the non-linear activation layer." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-113", "text": "Although this model is identical in representational power to the CRF as we discussed in Section 2, the optimization problem here is no longer convex (Ando and Zhang, 2005) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-114", "text": "To see why, consider applying a linear scaling transformation to the input layer parameter matrix \u2126, and apply the inverse scaling to output layer \u2206 matrix." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-115", "text": "The resulting model has exactly the same function values." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-116", "text": "We can see from Table 2 that there is indeed a performance drop with the LNN model as well, likely due to difficulty with optimization." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-117", "text": "By comparing the results of LNN and SLNN, we see that the addition of a non-linear activation layer in SLNN does not seem to help, but in fact further decreases performance in all cases except Syntactic Chunking." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-118", "text": "A distinct characteristic of NLP data is its high dimensionality." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-119", "text": "The vocabulary size of a decent sized text corpus is already in the tens of thousands, and bigram statistics are usually an order of magnitude larger." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-120", "text": "These basic information units are typically very informative, and there is not much structure in them to be explored." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-121", "text": "Although some studies argue that non-linear neural nets suffer less from the curse of dimensionality (Attali and Pag\u00e9s, 1997; Bengio and Bengio, 2000; Pitkow, 2012) , counter arguments have been offered (Camastra, 2003; Verleysen et al., 2003) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-122", "text": "The empirical results from our experiment seems to support the latter." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-123", "text": "Similar results have also been found in other NLP applications such as Text Classification." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-124", "text": "Joachims concluded in his seminal work: \"non-linear SVMs do not provide any advantage for text classification using the standard kernels\" (Joachims, 2004, p. 115 )." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-125", "text": "If we compare the learning curve of CRF and SLNN (Figure 2 ), where we vary the amount of binary features available in the model by random sub-sampling, we can further observe that SLNNs enjoy a small performance advantage in lower dimensional space (when less than 30% of features are used), but are quickly outpaced by CRFs in higher dimensional space as more features become available." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-126", "text": "Another point of consideration is whether there is actually much non-linearity to be captured in sequence labeling." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-127", "text": "While in some NLP applications like grammar induction and semantic parsing, the data is complex and rich in statistical structures, the structure of data in sequence labeling is considerably simpler." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-128", "text": "This contrast is more salient if we compare with data in Computer Vision tasks such as object recognition and image segmentation." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-129", "text": "The interactions among local variables there are much stronger and more likely to be non-linear." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-130", "text": "Lastly, models like CRF actually already capture some of the non-linearity in the Table 3 : Results of CRF versus SLNN, over continuous space feature representations." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-131", "text": "input space through the interactions of latent variables (Liang et al., 2008) , and it is unclear how much additional gain we would get by explicitly modeling the non-linearity in local inputs." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-132", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-133", "text": "**RESULTS OF DISTRIBUTIONAL REPRESENTATION**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-134", "text": "For the next experiment, we replace the discrete input features with a continuous space representation by looking up the embedding of each word, and concatenate the embeddings of a five word window centered around the current position." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-135", "text": "Four binary features are also appended to each word embedding to capture capitalization patterns, as described in Collobert et al. (2011) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-136", "text": "Results of the CRF and SLNN under this setting for the NER task is show in Table 3 ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-137", "text": "With a continuous space representation, the SLNN model works significantly better than a CRF, by as much as 7% on the CoNLL development set, and 3.7% on ACE dataset." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-138", "text": "This suggests that there exist statistical dependencies within this low-dimensional (300) data that cannot be effectively captured by linear transformations, but can be modeled in the non-linear neural nets." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-139", "text": "This perhaps coincides with the large performance im- provements observed from neural nets in handwritten digit recognition datasets as well (Peng et al., 2009; Do and Artieres, 2010) , where dimensionality is also relatively low." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-140", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-141", "text": "**COMBINE DISCRETE AND DISTRIBUTIONAL FEATURES**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-142", "text": "When we join word embeddings with discrete features, we see further performance improvements, especially in the out-of-domain datasets." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-143", "text": "The results are shown in Table 4 ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-144", "text": "A similar effect was also observed in Turian et al. (2010) ." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-145", "text": "The performance of both the CRF and SLNN increases by similar relative amounts, but the CRF model maintains a lead in overall absolute performance." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-146", "text": "----------------------------------" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-147", "text": "**CONCLUSION**" }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-148", "text": "We carefully compared and analyzed the nonlinear neural networks used in Collobert et al. (2011) and the widely adopted CRF, and revealed their close relationship." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-149", "text": "Through extensive experiments on NER and Syntactic Chunking, we have shown that non-linear architectures are effective in low dimensional continuous input spaces, but that they are not better suited for conventional highdimensional discrete input spaces." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-150", "text": "Furthermore, both linear and non-linear models benefit greatly from the combination of continuous and discrete features, especially for out-of-domain datasets." }, { "sent_id": "ca7db62af4457ca887fe220c43b10e-C001-151", "text": "This finding confirms earlier results that distributional representations can be used to achieve better generalization." } ], "y": { "@DIF@": { "gold_contexts": [ [ "ca7db62af4457ca887fe220c43b10e-C001-2" ], [ "ca7db62af4457ca887fe220c43b10e-C001-49" ], [ "ca7db62af4457ca887fe220c43b10e-C001-105" ] ], "cite_sentences": [ "ca7db62af4457ca887fe220c43b10e-C001-2", "ca7db62af4457ca887fe220c43b10e-C001-49", "ca7db62af4457ca887fe220c43b10e-C001-105" ] }, "@MOT@": { "gold_contexts": [ [ "ca7db62af4457ca887fe220c43b10e-C001-12" ] ], "cite_sentences": [ "ca7db62af4457ca887fe220c43b10e-C001-12" ] }, "@SIM@": { "gold_contexts": [ [ "ca7db62af4457ca887fe220c43b10e-C001-34" ], [ "ca7db62af4457ca887fe220c43b10e-C001-43" ], [ "ca7db62af4457ca887fe220c43b10e-C001-98" ], [ "ca7db62af4457ca887fe220c43b10e-C001-101" ], [ "ca7db62af4457ca887fe220c43b10e-C001-135" ], [ "ca7db62af4457ca887fe220c43b10e-C001-148" ] ], "cite_sentences": [ "ca7db62af4457ca887fe220c43b10e-C001-34", "ca7db62af4457ca887fe220c43b10e-C001-43", "ca7db62af4457ca887fe220c43b10e-C001-98", "ca7db62af4457ca887fe220c43b10e-C001-101", "ca7db62af4457ca887fe220c43b10e-C001-135", "ca7db62af4457ca887fe220c43b10e-C001-148" ] }, "@USE@": { "gold_contexts": [ [ "ca7db62af4457ca887fe220c43b10e-C001-101" ], [ "ca7db62af4457ca887fe220c43b10e-C001-135" ], [ "ca7db62af4457ca887fe220c43b10e-C001-148" ] ], "cite_sentences": [ "ca7db62af4457ca887fe220c43b10e-C001-101", "ca7db62af4457ca887fe220c43b10e-C001-135", "ca7db62af4457ca887fe220c43b10e-C001-148" ] }, "@EXT@": { "gold_contexts": [ [ "ca7db62af4457ca887fe220c43b10e-C001-105" ] ], "cite_sentences": [ "ca7db62af4457ca887fe220c43b10e-C001-105" ] } } }, "ABC_f3012301e42a4075ed6d4d2b39b528_16": { "x": [ { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-27", "text": "We introduce context incongruity in Section 3." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-25", "text": "Rest of the paper is organized as follows." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-26", "text": "We first discuss related work in Section 2." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-2", "text": "The relationship between context incongruity and sarcasm has been studied in linguistics." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-3", "text": "We present a computational system that harnesses context incongruity as a basis for sarcasm detection." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-4", "text": "Our statistical sarcasm classifiers incorporate two kinds of incongruity features: explicit and implicit." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-5", "text": "We show the benefit of our incongruity features for two text forms -tweets and discussion forum posts." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-6", "text": "Our system also outperforms two past works (with Fscore improvement of 10-20%)." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-7", "text": "We also show how our features can capture intersentential incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-8", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-10", "text": "Sarcasm is defined as 'a cutting, often ironic remark intended to express contempt or ridicule' 1 ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-11", "text": "Sarcasm detection is the task of predicting a text as sarcastic or non-sarcastic." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-12", "text": "The past work in sarcasm detection involves rule-based and statistical approaches using: (a) unigrams and pragmatic features (such as emoticons, etc.) (Gonzalez-Ibanez et al., 2011; Carvalho et al., 2009; Barbieri et al., 2014) , (b) extraction of common patterns, such as hashtag-based sentiment (Maynard and Greenwood, 2014; Liebrecht et al., 2013) , a positive verb being followed by a negative situation (Riloff et al., 2013) , or discriminative n-grams (Tsur et al., 2010a; Davidov et al., 2010) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-13", "text": "Thus, the past work detects sarcasm with specific indicators." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-14", "text": "However, we believe that it is time that sarcasm detection is based on well-studied linguistic theories." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-15", "text": "In this paper, we use one such linguistic theory: context incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-16", "text": "Although the past work exploits incongruity, it does so piecemeal; we take a more well-rounded view of incongruity and place it center-stage for our work." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-17", "text": "1 Source: The Free Dictionary" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-18", "text": "The features of our sarcasm detection system are based on two kinds of incongruity: 'explicit' and 'implicit'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-19", "text": "The contribution of this paper is:" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-20", "text": "\u2022 We present a sarcasm detection system that is grounded on a linguistic theory, the theory of context incongruity in our case." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-21", "text": "Sarcasm detection research can push the frontiers by taking help of well-studied linguistic theories." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-22", "text": "\u2022 Our sarcasm detection system outperforms two state-of-art sarcasm detection systems (Riloff et al., 2013; Maynard and Greenwood, 2014) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-23", "text": "Our system shows an improvement for short 'tweets' as well as long 'discussion forum posts'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-24", "text": "\u2022 We introduce inter-sentential incongruity for sarcasm detection, that expands context of a discussion forum post by including the previous post (also known as the 'elicitor' post) in the discussion thread." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-28", "text": "Feature design for explicit incongruity is presented in Section 3.1, and that for implicit incongruity is in Section 3.2." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-29", "text": "We then describe the architecture of our sarcasm detection system in Section 4 and our experimental setup in Section 5." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-30", "text": "Quantitative evaluation is in Section 6." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-31", "text": "Inter-sentential sarcasm detection is in Section 7." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-32", "text": "Section 8 presents the error analysis." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-33", "text": "Section 9 concludes the paper and points to future directions." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-34", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-36", "text": "Sarcasm/irony as a linguistic phenomenon has been extensively studied." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-37", "text": "According to Wilson (2006) , sarcasm arises from situational disparity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-38", "text": "The relationship between context incongruity and sarcasm processing (by humans) has been studied in Ivanko and Pexman (2003) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-39", "text": "Liebrecht et al. (2013) use a dataset of Dutch tweets that contain sarcasmrelated hashtags and implement a classifier to predict sarcasm." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-40", "text": "A recent work by ?) takes the output of sarcasm detection as an input to sentiment classification." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-41", "text": "They present a rule-based system that uses the pattern: if the sentiment of a tokenized hashtag does not agree with sentiment in rest of the tweet, the tweet is sarcastic, in addition to other rules." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-42", "text": "Our approach is architecturally similar to Tsur et al. (2010b) who use a semi-supervised pattern acquisition followed by classification." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-43", "text": "Our feature engineering is based on Riloff et al. (2013) and Ramteke et al. (2013) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-44", "text": "Riloff et al. (2013) state that sarcasm is a contrast between positive sentiment word and a negative situation." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-45", "text": "They implement a rule-based system that uses phrases of positive verb phrases and negative situations extracted from a corpus of sarcastic tweets." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-46", "text": "Ramteke et al. (2013) present a novel approach to detect thwarting: the phenomenon where sentiment in major portions of text is reversed by sentiment in smaller, conclusive portions." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-47", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-48", "text": "**CONTEXT INCONGRUITY**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-49", "text": "Incongruity is defined as 'the state of being not in agreement, as with principles' 1 ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-50", "text": "Context incongruity is a necessary condition for sarcasm (Campbell and Katz, 2012) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-51", "text": "Ivanko and Pexman (2003) state that the sarcasm processing time (time taken by humans to understand sarcasm) depends on the degree of context incongruity between the statement and the context." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-52", "text": "Deriving from this idea, we consider two cases of incongruity in sarcasm that are analogous to two degrees of incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-53", "text": "We call them explicit incongruity and implicit incongruity, where implicit incongruity demands a higher processing time." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-54", "text": "It must be noted that our system only handles incongruity between the text and common world knowledge (i.e., the knowledge that 'being stranded' is an undesirable situation, and hence, 'Being stranded in traffic is the best way to start my week' is a sarcastic statement)." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-55", "text": "This leaves out an example like 'Wow! You are so punctual' which may be sarcastic depending on situational context." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-56", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-57", "text": "**EXPLICIT INCONGRUITY**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-58", "text": "Explicit incongruity is overtly expressed through sentiment words of both polarities (as in the case of 'I love being ignored' where there is a positive word 'love' and a negative word 'ignored')." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-59", "text": "The converse is not true as in the case of 'The movie starts slow but the climax is great'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-60", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-61", "text": "**IMPLICIT INCONGRUITY**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-62", "text": "An implicit incongruity is covertly expressed through phrases of implied sentiment, as opposed to opposing polar words." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-63", "text": "Consider the example \"I love this paper so much that I made a doggy bag out of it\"." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-64", "text": "There is no explicit incongruity here: the only polar word is 'love'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-65", "text": "However, the clause 'I made a doggy bag out of it' has an implied sentiment that is incongruous with the polar word 'love'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-66", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-67", "text": "**ESTIMATING PREVALENCE**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-68", "text": "We conduct a na\u00efve, automatic evaluation on a dataset of 18,141 sarcastic tweets." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-69", "text": "As a crude estimate, we consider an explicit incongruity as presence of positive and negative words." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-70", "text": "Around 11% sarcastic tweets have at least one explicit incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-71", "text": "We also manually evaluate 50 sarcastic tweets and observe that 10 have explicit incongruity, while others have implicit incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-72", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-73", "text": "**ARCHITECTURE**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-74", "text": "Our system for sarcasm detection augments the feature vector of a tweet with features based on the two types of incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-75", "text": "Specifically, we use four kinds of features: (a) Lexical, (b) Pragmatic, (c) Implicit congruity, and (d) Explicit incongruity features." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-76", "text": "Lexical features are unigrams obtained using feature selection techniques such as \u03c7 2 Test and Categorical Proportional Difference." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-77", "text": "Pragmatic features include emoticons, laughter expressions, punctuation marks and capital words as given by Carvalho et al. (2009) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-78", "text": "In addition to the two, our system incorporates two kinds of incongruity features, as discussed next." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-79", "text": "The explicit incongruity features are numeric, qualitative features, while implicit incongruity features are related to implicit phrases." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-80", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-81", "text": "**FEATURE DESIGN: EXPLICIT INCONGRUITY**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-82", "text": "An explicit incongruity giving rise to sarcasm bears resemblance to thwarted expectations (another commonly known challenge to sentiment analysis)." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-83", "text": "Consider this example: 'I love the color." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-84", "text": "The features are interesting. But a bad battery life ruins it'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-85", "text": "The positive expectation in the first two sentences is thwarted by the last sentence." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-86", "text": "A similar incongruity is observed in the sarcastic 'My tooth hurts! Yay!'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-87", "text": "The negative word 'hurts' is incongruous with the positive 'Yay!'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-88", "text": "Hence, our explicit incongruity features are a relevant subset of features from a past system to detect thwarting by Ramteke et al. (2013) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-89", "text": "These features are:" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-90", "text": "\u2022 Number of sentiment incongruities: The number of times a positive word is followed by a negative word, and vice versa \u2022 Largest positive/negative subsequence: The length of the longest series of contiguous positive/negative words \u2022 Number of positive and negative words \u2022 Lexical Polarity: The polarity based purely on the basis of lexical features, as determined by Lingpipe SA system (Alias-i, 2008) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-91", "text": "Note that the 'native polarity' need not be correct." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-92", "text": "However, a tweet that is strongly positive on the surface is more likely to be sarcastic than a tweet that seems to be negative." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-93", "text": "This is because sarcasm, by definition, tends to be caustic/hurtful." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-94", "text": "This also helps against humble bragging." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-95", "text": "(as in case of the tweet 'so i have to be up at 5am to autograph 7,000 pics of myself? Sounds like just about the worst Wednesday morning I could ever imagine')." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-96", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-97", "text": "**FEATURE DESIGN: IMPLICIT INCONGRUITY**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-98", "text": "We use phrases with implicit sentiment as the implicit incongruity features." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-99", "text": "These phrases are sentiment-bearing verb and noun phrases, the latter being situations with implied sentiment (e.g. 'getting late for work')." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-100", "text": "For this, we modify the algorithm given in Riloff et al. (2013) in two ways: (a) they extract only positive verbs and negative noun situation phrases." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-101", "text": "We generalize it to both polarities, (b) they remove subsumed phrases (i.e. 'being ignored' subsumes 'being ignored by a friend') while we retain both phrases." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-102", "text": "The benefit of (a) and (b) above was experimentally validated, but is not included in this paper due to limited space." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-103", "text": "While they use rule-based algorithms that employ these extracted phrases to detect sarcasm, we include them as implicit incongruity features, in addition to other features." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-104", "text": "It is possible that the set of extracted situation phrases may contain some phrases without implicit sentiment." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-105", "text": "We hope that the limited size of the tweet guards against such false positives being too many in number." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-106", "text": "We add phrases in the two sets as count-based implicit incongruity features." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-107", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-108", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-109", "text": "We use three datasets to evaluate our system: 1. Tweet-A (5208 tweets, 4170 sarcastic):" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-110", "text": "We download tweets with hashtags #sar-casm and #sarcastic as sarcastic tweets and #notsarcasm and #notsarcastic as nonsarcastic, using the Twitter API (https:// dev.twitter.com/)." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-111", "text": "A similar hashtagbased approach to create a sarcasm-annotated dataset was employed in Gonzalez-Ibanez et al. (2011) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-112", "text": "As an additional quality check, a rough glance through the tweets is done, and the ones found to be wrong are removed." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-113", "text": "The hashtags mentioned above are removed from the text so that they act as labels but not as features." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-114", "text": "2. Tweet-B (2278 tweets, 506 sarcastic): This dataset was manually labeled for Riloff et al. (2013 To extract the implicit incongruity features, we run the iterative algorithm described in Section 4.2, on a dataset of 4000 tweets (50% sarcastic) (also created using hashtag-based supervision)." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-115", "text": "The algorithm results in a total of 79 verb phrases and 202 noun phrases." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-116", "text": "We train our classifiers for different feature combinations, using LibSVM with RBF kernel (Chang and Lin, 2011) , and report average 5-fold cross-validation values." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-117", "text": "Table 2 : Comparative results for Tweet-A using rule-based algorithm and statistical classifiers using our feature combinations 6 Evaluation Table 2 shows the performance of our classifiers in terms of Precision (P), Recall (R) and F-score Riloff et al. (2013) 's two rule-based algorithms: the ordered version predicts a tweet as sarcastic if it has a positive verb phrase followed by a negative situation/noun phrase, while the unordered does so if the two are present in any order." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-118", "text": "We see that all statistical classifiers surpass the rule-based algorithms." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-119", "text": "The best F-score obtained is 0.8876 when all four kinds of features are used." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-120", "text": "This is an improvement of about 5% over the baseline, and 40% over the algorithm by Riloff et al. (2013) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-121", "text": "Table 3 shows that even in the case of the Discussion-A dataset, our features result in an improved performance." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-122", "text": "The F-score increases from 0.568 to 0.640, an improvement of about 8% in case of discussion forum posts, when all features are used." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-123", "text": "To confirm that we indeed do better, we compare our system, with their reported values." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-124", "text": "This is necessary for several reasons." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-125", "text": "For example, we reimplement their algorithm but do not have Table 4 : Comparison of our system with two past works, for Tweet-B access to their exact extracted phrases." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-126", "text": "Table 4 shows that we achieve a 10% higher F-score than the best reported F-score of Riloff et al. (2013) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-127", "text": "This value is also 20% higher than our re-implementation of Maynard and Greenwood (2014) that uses their hashtag retokenizer and rulebased algorithm." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-128", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-129", "text": "**INCORPORATING INTER-SENTENTIAL INCONGRUITY**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-130", "text": "Our system performs worse for Discussion-A than Tweet-A/B possibly because of incongruity outside the text." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-131", "text": "Because of the thread structure of discussion forums, sarcasm in a 'target post' can be identified using the post preceding it (called 'elicitor post'), similar to human conversation (Eisterhold et al., 2006) ." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-132", "text": "For example, 'Wow, you are smart!' may or may not be sarcastic." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-133", "text": "If a sarcasm classifier incorporates information from the elicitor post 'I could not finish my assignment', a correct prediction is possible." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-134", "text": "Hence, we now explore how our incongruity-based features can help to capture 'inter-sentential incongruity'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-135", "text": "We compute the five explicit incongruity features for a concatenated version of target post and elicitor post (elicitor posts are available for IAC corpus, the source of Discussion-A)." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-136", "text": "The precision rises to 0.705 but the recall falls to 0.274." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-137", "text": "A possible reason is that only 15% posts have elicitor posts, making the inter-sentential features sparse." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-138", "text": "That notwithstanding, our observation shows that using the inter-sentential context is an interesting direction for sarcasm detection." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-139", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-140", "text": "**ERROR ANALYSIS**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-141", "text": "Some common errors made by our system are:" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-142", "text": "1. Subjective polarity: The tweet 'Yay for 3 hour Chem labs' is tagged by the author as sarcastic, which may not be common perception." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-143", "text": "2. No incongruity within text: As stated in Section 2, our system does not detect sarcasm where incongruity is expressed outside the text." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-144", "text": "About 10% misclassified examples that we analyzed, contained such an incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-145", "text": "3. Incongruity due to numbers: Our system could not detect incongruity arising due to numbers as in 'Going in to work for 2 hours was totally worth the 35 minute drive.'. 4. Dataset granularity: Some discussion forum posts are marked as sarcastic, but contain non-sarcastic portions, leading to irrelevant features." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-146", "text": "For example, 'How special, now all you have to do is prove that a glob of cells has rights." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-147", "text": "I happen to believe that a person's life and the right to life begins at conception'. 5. Politeness: In some cases, implicit incongruity was less evident because of politeness, as in, 'Post all your inside jokes on facebook, I really want to hear about them'." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-148", "text": "----------------------------------" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-149", "text": "**CONCLUSION & FUTURE WORK**" }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-150", "text": "Our paper uses the linguistic relationship between context incongruity and sarcasm as a basis for sarcasm detection." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-151", "text": "Our sarcasm classifier uses four kinds of features: lexical, pragmatic, explicit incongruity, and implicit incongruity features." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-152", "text": "We evaluate our system on two text forms: tweets and discussion forum posts." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-153", "text": "We observe an improvement of 40% over a reported rule-based algorithm, and 5% over the statistical classifier baseline that uses unigrams, in case of tweets." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-154", "text": "The corresponding improvement in case of discussion forum posts is 8%." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-155", "text": "Our system also outperforms two past works (Riloff et al., 2013; Maynard and Greenwood, 2014) with 10-20% improvement in F-score." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-156", "text": "Finally, to improve the performance for discussion forum posts, we introduce a novel approach to use elicitor posts for sarcasm detection." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-157", "text": "We observe an improvement of 21.6% in precision, when our incongruity features are used to capture inter-sentential incongruity." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-158", "text": "Our error analysis points to potential future work such as: (a) role of numbers for sarcasm, and (b) situations with subjective sentiment." }, { "sent_id": "f3012301e42a4075ed6d4d2b39b528-C001-159", "text": "We are currently exploring a more robust incorporation of inter-sentential incongruity for sarcasm detection." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f3012301e42a4075ed6d4d2b39b528-C001-12" ] ], "cite_sentences": [ "f3012301e42a4075ed6d4d2b39b528-C001-12" ] }, "@DIF@": { "gold_contexts": [ [ "f3012301e42a4075ed6d4d2b39b528-C001-22" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-100" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-117", "f3012301e42a4075ed6d4d2b39b528-C001-118" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-120" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-126" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-155" ] ], "cite_sentences": [ "f3012301e42a4075ed6d4d2b39b528-C001-22", "f3012301e42a4075ed6d4d2b39b528-C001-100", "f3012301e42a4075ed6d4d2b39b528-C001-117", "f3012301e42a4075ed6d4d2b39b528-C001-120", "f3012301e42a4075ed6d4d2b39b528-C001-126", "f3012301e42a4075ed6d4d2b39b528-C001-155" ] }, "@SIM@": { "gold_contexts": [ [ "f3012301e42a4075ed6d4d2b39b528-C001-43" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-114" ] ], "cite_sentences": [ "f3012301e42a4075ed6d4d2b39b528-C001-43", "f3012301e42a4075ed6d4d2b39b528-C001-114" ] }, "@USE@": { "gold_contexts": [ [ "f3012301e42a4075ed6d4d2b39b528-C001-43" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-114" ] ], "cite_sentences": [ "f3012301e42a4075ed6d4d2b39b528-C001-43", "f3012301e42a4075ed6d4d2b39b528-C001-114" ] }, "@EXT@": { "gold_contexts": [ [ "f3012301e42a4075ed6d4d2b39b528-C001-100" ], [ "f3012301e42a4075ed6d4d2b39b528-C001-120" ] ], "cite_sentences": [ "f3012301e42a4075ed6d4d2b39b528-C001-100", "f3012301e42a4075ed6d4d2b39b528-C001-120" ] } } }, "ABC_c2952b2da147d5f128cdbd5d8074a5_16": { "x": [ { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-109", "text": "**EXPERIMENT 2: NAMES AND OCCUPATIONS (REVISITED)**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-15", "text": "Table 1 exemplifies what the results might look like." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-36", "text": "Our contribution consists of an investigation of the presence of gender bias in pretrained embeddings for Swedish." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-2", "text": "This paper investigates the presence of gender bias in pretrained Swedish embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-3", "text": "We focus on a scenario where names are matched with occupations, and we demonstrate how a number of standard pretrained embeddings handle this task." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-4", "text": "Our experiments show some significant differences between the pretrained embeddings, with word-based methods showing the most bias and contextualized language models showing the least." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-5", "text": "We also demonstrate that a previously proposed debiasing method does not affect the performance of the various embeddings in this scenario." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-6", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-8", "text": "The motivation for this study is the currently widespread practice of using pretrained embeddings as building blocks for NLP-related tasks." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-9", "text": "More specifically, we are concerned about such usage by actors in the public sector, for instance government agencies and public organizations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-10", "text": "It is obvious how the presence of (gender or racial) bias would be potentially serious in applications where embeddings are used as input to decision support systems in the public sector." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-11", "text": "As an example, in Sweden limited companies must be approved and registered by the Swedish Companies Registration Office." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-12", "text": "One important (and internationally unique) step in this registration procedure is the approval of the company name, which is decided by case handlers at the Registration Office." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-13", "text": "Their decision is based on several factors, one of which is the appropriateness of the company name in relation to the company description." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-14", "text": "Now, imagine the hypothetical use case in which the case handlers use a decision support system that employs pretrained embeddings to quantify the similarity between a suggested company name and its company description." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-16", "text": "In this fictive example, the company description states that the company will do business with cars, and the name suggestions are composed of a person name in genitive and the word \"cars\" (i.e. \"Fredrik's cars\")." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-17", "text": "We use pretrained Swedish ELMo embeddings (Che et al., 2018) to compute the distance between the name suggestion and the company description." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-18", "text": "The results demonstrate that male person names (\"Magnus\" and \"Fredrik\") are closer to \"cars\" in the ELMo similarity space than female person names (\"Maria\" and \"Anna\")." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-19", "text": "If such results are used as input to a decision support system for deciding on the appropriateness of a company name suggestion in relation to a company description, we might introduce gender bias into the decision process." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-20", "text": "We subscribe to the view that such bias would be unfair and problematic." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-21", "text": "The point of this paper is therefore to investigate gender bias when using existing and readily available pretrained embeddings for tasks relating to names and occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-22", "text": "We include both word-based embeddings produced using word2vec and fastText, as well as characterbased (and WordPiece-based) contextualized embeddings produced using ELMo and the multilingual BERT." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-23", "text": "The next section covers related work." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-24", "text": "We then discuss the various embeddings in Section 3, before we then turn to some experimental evidence of bias in the embeddings, and we also show that the previously proposed debiasing method is unable to handle gender bias in our scenario." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-25", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-26", "text": "**RELATED WORK**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-27", "text": "Research regarding bias and stereotypes expressed in text and subsequently incorporated in learned language models is currently a vivid field." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-28", "text": "Caliskan et al. (2017) show that learned embeddings exhibit every linguistic bias documented in the field of psychology (such as that flowers are more pleasant than insects, musical instruments are preferred to weapons, and personal names are used to infer race)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-29", "text": "Garg et al. (2018) show that temporal changes of the embeddings can be used to quantify gender and ethnic stereotypes over time, and Zhao et al. (2017) suggest that biases might in fact be amplified by embedding models." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-30", "text": "Several researchers have also investigated ways to counter stereotypes and biases in learned language models." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-31", "text": "While the seminal work by Bolukbasi et al. (2016a Bolukbasi et al. ( , 2016b concerns the identification and mitigation of gender bias in pretrained word embeddings, Zhao et al. (2018) provide insights into the possibilities of learning embeddings that are gender neutral." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-32", "text": "Bordia and Bowman (2019) outline a way of training a recurrent neural network for word-based language modelling such that the model is gender neutral." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-33", "text": "Park et al. (2018) discuss different ways of mitigating gender bias, in the context of abusive language detection, ranging from debiasing a model by using the hard debiased word embeddings produced by Bolukbasi et al. (2016b) , to manipulating the data prior to training a model by swapping masculine and feminine mentions, and employing transfer learning from a model learned from less biased text." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-34", "text": "Gonen and Goldberg (2019) contest the approaches to debiasing word embeddings presented by Bolukbasi et al. (2016b) and Zhao et al. (2018) , arguing that while the bias is reduced when measured according to its definition, i.e., dampening the impact of the general gender direction in the vector space, \"the actual effect is mostly hiding the bias, not removing it\"." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-35", "text": "Further, Gonen and Gold-berg (2019) claim that a lot of the supposedly removed bias can be recovered due to the geometry of the vector representation of the gender neutralized words." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-86", "text": "word2vec shows a clear tendency to group both male and female names with male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-37", "text": "We are less interested in bias as a theoretical construct, and more interested in the effects of gender bias in actual applications where pretrained embeddings are employed." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-38", "text": "Our experiments are therefore tightly tied to a real-world use case where gender bias would have potentially serious ramifications." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-39", "text": "We also provide further evidence of the inability of the debiasing method proposed by Bolukbasi et al. (2016b) to handle the type of bias we are concerned with." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-40", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-41", "text": "**EMBEDDINGS**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-42", "text": "We include four different standard embeddings in these experiments: word2vec, fastText, ELMo and BERT." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-43", "text": "There are several pre-trained models available in various web repositories." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-44", "text": "We select one representative instance per model, summarized in Table 2 (next page)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-45", "text": "These models represent different types of embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-46", "text": "word2vec (Mikolov et al., 2013) builds embeddings by training a shallow neural network to predict a set of context words based on a target word (this is the so-called skipgram architecture; if we instead predict the target word based on the context words the model is called continuous bag of words)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-47", "text": "The network learns two sets of vectors, one for the target terms (the embedding vectors), and one for context terms." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-48", "text": "The objective of the network is to learn vectors such that their dot product correspond to the log likelihood of observing word pairs in the training data." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-49", "text": "fastText (Bojanowski et al., 2017) uses the same neural network architecture, but incorporates character information by using character n-grams instead of whole words in the prediction step." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-50", "text": "It should be noted that most applications of the above-mentioned vectors use only the embeddings for the target terms." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-51", "text": "In fact, many repositories with pretrained vectors do not even contain the context embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-52", "text": "When the downstream task focuses on associative relations (which is the case in the present scenario with names and occupations), it would be beneficial to be able to use both target and context vectors, since using only one of these will result in more paradigmatic similarities." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-53", "text": "ELMo (Peters et al., 2018 ) is a deep characterbased neural network that learns embeddings by predicting the next token given an input sequence." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-54", "text": "The network architecture includes both convolutional and (bidirectional) LSTM layers, and produces an embedding that that is sensitive to the particular context of the input sequence." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-55", "text": "ELMo is thus different from word2vec and fastText in the sense that it produces contextualized embeddings, which has proven to be highly beneficial when using the embeddings as representation in downstream NLP tasks such as classification, entity recognition, and question answering." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-56", "text": "BERT (Devlin et al., 2018) is similar to ELMo in the sense that it uses a deep neural network architecture and produces contextualized embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-57", "text": "However, it differs in the type of network used." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-58", "text": "BERT uses a (bidirectional) Transformer network that relies exclusively on attention, and the model is trained using a masked language model task, similar to a cloze test." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-59", "text": "Contrary to ELMo, BERT is not character-based, but relies on WordPiece tokenization of the input data." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-60", "text": "This has some potentially problematic effects when tokenizing proper names." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-61", "text": "As an example, the Swedish male name \"Henrik\" gets tokenized as [\"hen\", \"##rik\"], with \"rik\" probably deriving from the Swedish word \"rik\" (eng. \"rich\")." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-62", "text": "It would have been desirable to not use WordPiece tokenization for proper names." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-63", "text": "In the following experiments, pre-trained ELMo and BERT are used to produce contextualized embeddings both for individual words (such as names or places) and for texts (such as company descriptions)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-64", "text": "Pre-trained word2vec and fastText are used to look up individual words, and for texts we follow standard practice and average the vectors of the component words." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-65", "text": "Since proper names in Swedish use uppercase for the initial letter, we retain the casing information for all models that can handle such vocabulary, which in our case are all models except word2vec." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-66", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-67", "text": "**DATA**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-68", "text": "In order to investigate whether our concerns about gender bias in pretrained Swedish embeddings are valid, we collect lists of the 100 most common Swedish female and male first names from Statistics Sweden (www.scb.se)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-69", "text": "We also collect lists of the most typical female and male occupations from the same source, as shown in Tables 3 and 4 (next page)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-70", "text": "These are the most common occupations for women and men as compiled by Statistics Sweden, together with the percentage of women and men in each occupation." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-71", "text": "Since our interest in this paper is bias, we do not include occupations that have less than (or close to) 50% occurrence of women or men (such cases are marked by * in the tables)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-72", "text": "This leaves us with 18 typically female occupations, and 15 typically male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-73", "text": "Some of the remaining occupations are very similar to each other, and we therefore collapse them to one occupation (marked by numbers in the tables), resulting in 14 distinct female occupations and 14 distinct male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-74", "text": "For each of these gendered occupations, we also list a number of synonyms, collected from wikipedia.se and framtid.se." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-75", "text": "Morphological variants of each term are included." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-76", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-77", "text": "**EXPERIMENT 1: NAMES AND OCCUPATIONS**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-78", "text": "As a first experiment, we compute the similarity between the names and the occupations using the different embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-79", "text": "We do this by computing the similarity between each name and each occupation." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-80", "text": "Table 5 shows the percentage of female and male names that are on average more similar to a female vs. male occupation." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-81", "text": "Numbers in parentheses are based on only the most similar oc- There are several ways in which an embedding could show bias in this setting." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-82", "text": "The arguably most detrimental effect would be if the embedding grouped male names with male occupations and female names with female occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-83", "text": "Somewhat less severe, but still problematic, would be if the embedding grouped all names with female or male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-84", "text": "A completely unbiased model would not show any difference between the female and male names with respect to female and male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-85", "text": "The numbers in Table 5 demonstrate some interesting differences between the different embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-87", "text": "fastText, on the other hand, shows a bias for female occupations for male names, and for male occupations for female names." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-88", "text": "This is a very interesting difference, given that the only algorithmic difference between these models is the inclusion of character n-grams in the latter model." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-89", "text": "The results for ELMo and BERT show some interesting differences too." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-90", "text": "ELMo groups the male names with the male occupations, but is less biased for the female names." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-91", "text": "When counting only the single most similar occupation, ELMo shows a similar tendency as word2vec and groups both male and female names with male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-92", "text": "BERT, on the other hand, seems slightly more balanced, with a tendency similar to fastText when counting the average similarities." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-93", "text": "When only counting the single most similar occupation, BERT is almost perfectly balanced between female and male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-94", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-95", "text": "**DEBIASING EMBEDDINGS**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-96", "text": "We apply the debiasing methodology in (Bolukbasi et al., 2016b) to the pretrained embedddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-97", "text": "Debiasing a given vector space involves finding the general direction in it that signifies gender using a set of predefined definitional pairs, and then removing the direction from all vectors except those corresponding to words that are naturally gender specific." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-98", "text": "The definitional pairs are word pairs expressing among themselves a natural distinction between the genders, e.g., he -she, and motherfather." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-99", "text": "In our setting, there are 10 such pairs." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-100", "text": "The gender specific words are words that also carry a natural gender dimension that should not be corrected during the debiasing phase of the vector space." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-101", "text": "We use the same methodology for growing a seed set of gender specific words into a larger set as described in (Bolukbasi et al., 2016b) , and end up with 486 manually curated gender specific words, including e.g., farfar (paternal grandfather), tvillingsystrar (twin sisters), and matriark (matriarch)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-102", "text": "The definitional pairs are used to find a gender direction in the embedding space, which is done by taking the difference vector of each of the definitional pairs (i.e. w 1 \u2212 w 2 ), and then factorizing the mean-centered difference vectors using PCA, retaining only the first principal component, which will act as the gender direction." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-103", "text": "The vector space is then hard debiased 1 in the sense that the gen- der direction b is removed from the embeddings of all non-gender specific words w using orthogonal projection: w = w \u2212 b \u00d7 w\u00b7b b\u00b7b ." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-104", "text": "The approach described by (Bolukbasi et al., 2016b) includes an equalize step to make all gender neutral words equidistant to each of the members of a given equality set of word pairs." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-105", "text": "The equality set is application specific, and since the current investigation of Swedish language embeddings does not naturally lend itself to include an equality set, the debiasing of the embeddings does not involve equalization in our case." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-106", "text": "We apply the method described above to all pretrained embeddings in Table 3 , as well as to the token vectors generated by ELMo and BERT." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-107", "text": "Although it is not clear whether the proposed debiasing method is applicable to embeddings produced by contextualized language models, we argue that it is reasonable to treat the contextualized models as black boxes, and rely only on their output, given the proposed use case." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-108", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-110", "text": "We repeat the experiment described in Section 5, but using the debiased embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-111", "text": "Table 6 summarizes the results." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-112", "text": "It is clear that the debiasing method does not have any impact on the results in these experiments." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-113", "text": "The tendencies for the wordbased embeddings word2vec and fastText are more or less identical before and after debiasing." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-114", "text": "The most striking differences between Table 5 and Table 6 are the results for ELMo and BERT, which become less balanced after applying the debiasing method." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-115", "text": "ELMo actually shows a clearer gender distinction after debiasing, with male names being more similar to male occupations, and female names being more similar to feamong all vectors and decreasing the influence of the gender specific direction." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-116", "text": "male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-117", "text": "BERT also becomes less balanced after debiasing, grouping male names with female occupations, and female names with male occupations, when considering the average similarities." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-118", "text": "When counting only the most similar occupation per name, BERT is still well balanced after debiasing." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-119", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-120", "text": "**EXPERIMENT 3: COMPANY NAMES AND COMPANY DESCRIPTIONS**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-121", "text": "The experiments in the previous sections are admittedly somewhat simplistic considering the scenario discussed in the Introduction: quantifying the similarity between a company name and a company description." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-122", "text": "In particular the contextualized language models are not primarily designed for generating token embeddings, and it is neither clear what kind of quality we can expect from such un-contextualized token embeddings, nor whether they are susceptible to the debiasing operation discussed in Section 6." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-123", "text": "In order to provide a more realistic scenario, we also include experiments where we compute the similarity between a set of actual company descriptions and a set of fictive company names generated from the lists of male and female names by adding the term \"Aktiebolag\" (in English limited company) after each name." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-124", "text": "2 The company descriptions are provided by the Swedish Companies Registration Office, and contain approximately 10 company descriptions for each of the sectors construction work, vehicles and transportation, information technologies, health and health care, education, and economy." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-125", "text": "Based on Tables 3 and 4, we consider the descriptions from the first three sectors to be representative of typically male occupations, and the descriptions from the latter three sectors to be representative of typically female occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-126", "text": "We generate vectors for each of the descriptions and for each fictive company name (i.e. a male or female name, followed by \"Aktiebolag\")." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-127", "text": "For the word-based models (word2vec and fastText), we take the average of the embeddings of the words in the descriptions and the name." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-128", "text": "For the contextualized language models (ELMo and BERT), we generate vectors for each description and each fictive name." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-129", "text": "In the case of ELMo we take the average over the three LSTM layers, and for BERT we use the output embedding for the [CLS] token for each of the input sequences." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-130", "text": "The results are summarized in Table 7 ." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-131", "text": "It is clear that these results are significantly more balanced than the results using tokens only." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-132", "text": "Even so, there are still some interesting differences between the embeddings." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-133", "text": "Contrary to the results in Tables 5 and 6 , word2vec now shows a bias for female occupations, and fastText now shows a bias for male occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-134", "text": "ELMo and BERT seem more balanced, with ELMo showing almost perfectly balanced results, and BERT showing a slight bias for female occupations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-135", "text": "Even though the biases apparently are different when considering tokens in comparison with considering texts, there are still biases in all models in both cases." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-136", "text": "The only exception in our experiments is ELMo, when used for texts instead of tokens." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-137", "text": "We hypothesize that the results for BERT are negatively affected by artefacts of the WordPiece tokenization, as discussed in Section 3." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-138", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-139", "text": "**THE EFFECT OF DEBIASING ON EMBEDDINGS**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-140", "text": "So far, we have shown that all Swedish pretrained embeddings included in this study exhibit some degree of gender bias when applied to a real-world scenario." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-141", "text": "We now turn to investigate the effect the hard debiasing operation has on the embedding spaces, using the intrinsic evaluation methodology of Bolukbasi et al. (2016b) ." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-142", "text": "In this setting, a number of analogy pairs are extracted for the original and debiased embeddings, and human evaluators are used to asses the number of appropriate and stereotypical pairs in the respective representations." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-143", "text": "Bolukbasi et al. (2016b) used 10 crowdworkers to classify the analogy pairs as being appropriate or stereotypical." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-144", "text": "Their results indicated that 19% of the top 150 analogies generated using the original embedding model were deemed gender stereotypical, while the corresponding figure for the hard debiased model was 6%." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-145", "text": "We carry out a similar, but smaller, evaluation exercise using the analogy pairs generated by the original Swedish word2vec and fastText models, as well as their debiased counterparts." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-146", "text": "3 We use honhan (she -he) as seed pair, and score all word pairs in the embeddings with respect to the similarity of the word pair's difference vector to that of the the seed pair." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-147", "text": "The top 150 pairs are manually categorized as either appropriate, gender stereotypical, or uncertain by the authors." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-148", "text": "The results of the annotation are shown in Table 8 (next page)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-149", "text": "Due to the limited extent of the evaluation, we can only use these results for painting the big picture." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-150", "text": "First of all, there is a relatively small overlap between the analogy pairs in the top 150 of the original models, and the top lists of the debiased models: for word2vec, only 42 of the analogy pairs in the original list are also in the list produced by the debiased model." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-151", "text": "The corresponding number for fastText is 31." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-152", "text": "This means that the debiasing operation changes the embedding space to a large extent." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-153", "text": "Second, there is a considerable amount of annotator uncertainty involved, either regarding the plausibility of a given analogy pair, or regarding its appropriateness." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-154", "text": "This is manifested by an increase of the number of uncertain analogy pairs that the annotators agree on between the original and debiased models (both for word2vec and fastText)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-155", "text": "However, the most interesting findings have to do with the number of stereotypical analogy pairs." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-156", "text": "The number of stereotypical analogy pairs output by the Swedish models is small compared to the numbers reported by Bolukbasi et al. (2016b) ." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-157", "text": "Further, the number of stereotypical pairs is larger in the debiased word2vec model than in the original model (we anticipated that it should be lower)." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-158", "text": "It thus seems as if the debiasing operation makes the word2vec embedding space more biased." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-159", "text": "For fastText, the number of such pairs are slightly fewer in the debiased model compared to its original counterpart." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-160", "text": "----------------------------------" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-161", "text": "**DISCUSSION**" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-162", "text": "This paper has shown that pretrained Swedish embeddings do exhibit gender bias to varying extent, and that the debiasing operation suggested by Bolukbasi et al. (2016a) does not have the desired effect, neither in the task of matching person names with occupations, nor in the case of the gender stereotypes being present among the top ranked analogy pairs generated by the models." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-163", "text": "Our experiments also indicate that word-based embeddings are more susceptible to bias than contextualized language models, and that there is an unexpectedly large difference in the biases shown by word2vec and fastText, something we believe requires further study." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-164", "text": "Although contextualized language models appear to be more balanced with respect to gender bias in our experiments, there is still bias in these models; in particular if they are used to generate token embeddings, but also when they are used to generate representations for texts -ELMo, which produces almost perfect scores in Table 7 , may still show bias in individual examples, such as those in Table 1 ." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-165", "text": "We acknowledge the possibility that it may not be appropriate to use contextualized language models to generate embeddings for individual tokens, but we also believe such usages to occur in real-world applications, and we therefore consider it relevant to include such examples in these experiments." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-166", "text": "The debiasing operation proposed by Bolukbasi et al. (2016a) does nothing to rectify the situation in our setting." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-167", "text": "On the contrary, the debiased models still show significant gender bias, and in the case of ELMo and BERT, the bias actually becomes more prevalent after debiasing." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-168", "text": "(However, we are aware that the debiasing operation may neither be intended nor suitable for such representations.) Furthermore, our (admittedly small) analogy evaluation shows that debiasing actually introduces more stereotypical word pairs in the word2vec model." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-169", "text": "Why then does not debiasing the Swedish wordbased embeddings produce results similar to those of Bolukbasi et al. (2016a) ?" }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-170", "text": "One of the big differences between the Swedish pretrained word2vec model and the one used by Bolukbasi et al. is the size of the vocabulary." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-171", "text": "The Swedish model contains 3M+ word types, while Bolukbasi et al. constrained their experiments to include only lowercased words shorter than 20 characters, omitting digits and words containing punctuation, from the top 50,000 most frequent words in the model." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-172", "text": "By doing so, Bolukbasi et al. effectively removed many person names from the model." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-173", "text": "A large portion of the word pairs in our analogy lists produced by the original model consist of person names (e.g., Anna -Jakob), which we consider to be appropriate, and their presence on the top 150 list contribute to the comparatively low number of stereotypical pairs." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-174", "text": "The debiasing operation of the word-based models remove many of the persons name pairs from the top list, giving way for potentially stereotypical pairs." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-175", "text": "Thus, the increase of stereotypical pairs on the top list of analogy pairs generated by a debiased model is more likely to be due to the debiasing operation effectively removing many of the names from the top list, than the model being more biased in the first place." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-176", "text": "Since our experiments have focused on pretrained embeddings readily available on the Internet, which have been trained on different types and different sizes of data, we cannot speculate about the extent to which a particular learning algorithm amplifies or distorts bias." }, { "sent_id": "c2952b2da147d5f128cdbd5d8074a5-C001-177", "text": "We believe this is an interesting direction for further research, and we aim to replicate this study using a variety of embeddings trained on the same data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c2952b2da147d5f128cdbd5d8074a5-C001-31" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-33" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-34" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-104" ] ], "cite_sentences": [ "c2952b2da147d5f128cdbd5d8074a5-C001-31", "c2952b2da147d5f128cdbd5d8074a5-C001-33", "c2952b2da147d5f128cdbd5d8074a5-C001-34", "c2952b2da147d5f128cdbd5d8074a5-C001-104" ] }, "@SIM@": { "gold_contexts": [ [ "c2952b2da147d5f128cdbd5d8074a5-C001-39" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-96" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-101" ] ], "cite_sentences": [ "c2952b2da147d5f128cdbd5d8074a5-C001-39", "c2952b2da147d5f128cdbd5d8074a5-C001-96", "c2952b2da147d5f128cdbd5d8074a5-C001-101" ] }, "@USE@": { "gold_contexts": [ [ "c2952b2da147d5f128cdbd5d8074a5-C001-39" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-96" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-101" ], [ "c2952b2da147d5f128cdbd5d8074a5-C001-141" ] ], "cite_sentences": [ "c2952b2da147d5f128cdbd5d8074a5-C001-39", "c2952b2da147d5f128cdbd5d8074a5-C001-96", "c2952b2da147d5f128cdbd5d8074a5-C001-101", "c2952b2da147d5f128cdbd5d8074a5-C001-141" ] }, "@DIF@": { "gold_contexts": [ [ "c2952b2da147d5f128cdbd5d8074a5-C001-156" ] ], "cite_sentences": [ "c2952b2da147d5f128cdbd5d8074a5-C001-156" ] } } }, "ABC_f29baa099b13f38badeb4cbd8789f6_17": { "x": [ { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-271", "text": "In contrast, some articles that have higher average SL, but labeled as very easy or easy." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-272", "text": "We randomly choose such articles and observe average SL." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-273", "text": "The average SL of articles belong to medium is 7.40 and the average SL of articles belongs to easy or very easy is 12.08." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-274", "text": "However, articles that are labeled as medium have higher average word entropy than articles that are labeled as easy or very easy." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-275", "text": "This shows that different type of features should be considered together to build a readability classifier." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-276", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-277", "text": "**CONCLUSION**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-2", "text": "Many news papers publish articles for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-3", "text": "Journalists use their experience and intuition to write these." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-4", "text": "They might not aware of readability of articles they write." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-5", "text": "There is no evaluation tool or method available to determine how appropriate these articles are for the target readers." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-6", "text": "In this paper, we evaluate difficulty of Bangla news articles that are written for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-7", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-9", "text": "News is the communication of selected information on current events (Shirky, 2009 )." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-10", "text": "This communication is shared by various mediums such as print, online and broadcasting." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-11", "text": "A newspaper is a printed publication that contains news and other informative articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-12", "text": "There are many newspapers that are also published online." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-13", "text": "Due to the rapid growth of internet use, it is very common that more people read news online nowadays than before." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-14", "text": "Newspapers try to target certain audience through different topics and stories." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-15", "text": "Children are also in their target audience." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-16", "text": "This target group is their future reader." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-17", "text": "Nowadays children also read news online." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-18", "text": "One third of children in developed countries such as Netherlands, United Kingdoms and Belgium browse internet for news (De Cock, 2012; De Cock and Hautekiet, 2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-19", "text": "Another study by Livingstone et al. (2010) showed that one fourth of the British children between age of nine and nineteen look for news on the internet." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-20", "text": "The ratio could be similar in other developed countries where most of the citizen have access over the internet." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-21", "text": "The number of internet users also increasing in developing countries such as Bangladesh and India." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-22", "text": "According to the English Wikipedia 1 , more than thirty three million people in Bangladesh use internet and many of them read news online." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-23", "text": "Also the Alexa index 2 shows that three Bangla news sites are in the list of ten most visited websites from Bangladesh." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-24", "text": "All newspapers contain a variety of sections." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-25", "text": "These sections are based on different news topics." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-26", "text": "Some of the them are specific to children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-27", "text": "The news for children will vary linguistically and cognitively than news for adults." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-28", "text": "This characteristic is similar to the websites dedicated for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-29", "text": "De Cock and Heutekiet (2012) observed difficulties for children to navigate these websites." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-30", "text": "Readability of the texts is one of the reasons." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-31", "text": "There is no specific guideline for writing texts for this target group." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-32", "text": "Journalists use their experience and intuition while writing." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-33", "text": "However, a text that is very easy to understand for an adult reader could be very difficult for a child." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-34", "text": "This difficulty motivate children readers to skip the newspaper in future." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-35", "text": "The readability of a text relates to how easily human readers can process and understand a text." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-36", "text": "There are many text related factors that influence the readability of a text." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-37", "text": "These factors include very simple features such as type face, font size, text vocabulary as well as complex features like grammatical conciseness, clarity, underlying semantics and lack of ambiguity." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-38", "text": "Nielsen (2010) recommended font size of 14 for young children and 12 for adults." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-39", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-40", "text": "**PACLIC 28**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-41", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-42", "text": "**! 310**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-43", "text": "Readability classification, is a task of mapping text onto a scale of readability levels." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-44", "text": "We explore the task of automatically classifying documents based on their different readability levels." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-45", "text": "As an input, this function operates on various statistics relating to different text features." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-46", "text": "In this paper, we train a readability classification model using a corpus compiled from textbooks and features inherited from our previous works Islam et al. (2012; and features from Sinha et al. (2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-47", "text": "Later we use the model to classify Bangla news articles for children from different well-known news sources from Bangladesh and West Bengal." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-48", "text": "The paper is organized as follows: Section 2 discusses related work." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-49", "text": "Section 3 describes cognitive model of children in terms of readability followed by an introduction of the training corpus and news articles in Section 4." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-50", "text": "The features used for classification are described in Section 5, and our experiments and results in Section 6 are followed by a discussion in Section 7." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-51", "text": "Finally, we present our conclusions in Section 8." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-52", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-53", "text": "**RELATED WORK**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-54", "text": "Most of the text readability research works use texts for adult readers." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-55", "text": "Only few numbers of related work available that only focus on texts for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-56", "text": "De Belder and Moens (2010) perform a study that transfers a complex text into a simpler text so that the target text become easier to understand for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-57", "text": "They have focused on two types of simplification: lexical and syntactic." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-58", "text": "Two traditional readability formulas: Flesch-Kincaid (Kincaid et al., 1975) and Dale-Chall (Dale and Chall, 1948; Dale and Chall, 1995) are used to measure reading difficulty." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-59", "text": "De Cock and Heutekiet (2012) performed a usability study to analyze websites for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-60", "text": "The study uses texts from different websites published in English and Dutch." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-61", "text": "The usability experiment shows that previous knowledge of children play an important role to read and understand texts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-62", "text": "They have used Flesch-Kincaid (Kincaid et al., 1975) to determine the difficulty level of English texts and a variation of the same formula for Dutch texts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-63", "text": "Both of the related work mentioned above use traditional readability formulas to measure text difficulty." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-64", "text": "However these traditional formulas have significant drawbacks." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-65", "text": "These formulas assume that texts do not contain noise and the sentences are always well-formed." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-66", "text": "However this is not the case always." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-67", "text": "Traditional formulas require significant sample sizes of text, they become unreliable for a text that contains less than 300 words (Kidwell et al., 2011) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-68", "text": "Si and Callan (2001) , Peterson and Ostendorf (2009) and Feng et al. (2009) show that these traditional formulas are not reliable." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-69", "text": "These formulas are easy to implement, but have a basic inability to model the semantic of vocabulary usage in a context." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-70", "text": "The most important limitation is that these measures are based only on surface characteristics of texts and ignore deeper properties." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-71", "text": "They ignore important factors such as comprehensibility, syntactical complexity, discourse coherence, syntactic ambiguity, rhetorical organizations and propositional density of texts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-72", "text": "Longer sentences are not always syntactically complex and counting the number of syllables of a single word does not show word difficulty." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-73", "text": "That is why, the validity of these traditional formulas for text comprehensibility is often suspect." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-74", "text": "Two recent works on Bangla texts use two of these traditional formulas." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-75", "text": "Das and Roychudhury (2004; show that readability measures proposed by Kincaid et al. (1975) and Gunning (1952) work well for Bangla." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-76", "text": "However, the measures were tested only for seven documents, mostly novels." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-77", "text": "Since there are not many linguistic tools available for Bangla, researchers are exploring language independent and surface features to measure difficulty of Bangla texts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-78", "text": "Recently, in our previous works, we proposed a readability classifier for Bangla using information-theoretic features (Islam et al., 2012; Islam et al., 2014) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-79", "text": "We have achieved an F-Score of 86.46% by combining these features with some lexical features." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-80", "text": "Sinha et al. (2012) proposed two readability models that are similar to classical readability measures for English." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-81", "text": "They conducted a user experiment to identify important structural parameters of Bangla texts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-82", "text": "These measures are based on the average word length (WL), the number of poly-syllabic words and the number of consonantconjuncts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-83", "text": "According to their experimental results, consonant-conjuncts plays an important role in texts in terms of readability." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-84", "text": "From the beginning of research on text readability, researchers proposed different measures for English (Dale and Chall, 1948; Dale and Chall, 1995; Gunning, 1952; Kincaid et al., 1975; Senter and Smith, 1967; McLaughlin, 1969) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-85", "text": "Many commercial readability tools use traditional measures." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-86", "text": "Fitzsimmons et al. (2010) stated that the SMOG (McLaughlin, 1969) readability measure should be preferred to assess the readability of texts on health care." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-87", "text": "Due to recent achievements in linguistic data processing, different linguistic features are now in the focus of readability studies." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-88", "text": "Islam et al." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-89", "text": "(2012) summarizes related work regarding language model-based features (Collins- Thompson and Callan, 2004; Schwarm and Ostendorf, 2005; Aluisio et al., 2010; Kate et al., 2010; Eickhoff et al., 2011) , POS-related features (Pitler and Nenkova, 2008; Feng et al., 2009; Aluisio et al., 2010; Feng et al., 2010) , syntactic features (Pitler and Nenkova, 2008; Barzilay and Lapata, 2008; Heilman et al., 2007; Heilman et al., 2008; Islam and Mehler, 2013) , and semantic features (Feng et al., 2009; Islam and Mehler, 2013) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-90", "text": "Recently, Hancke et al. (2012) found that morphological features influence the readability of German texts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-91", "text": "Due to unavailability of linguistic resources for Bangla, we did not explore any of the linguistically motivated features." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-92", "text": "We have inherited features from Islam et al. (2012; and Sinha et al. (2012) , these features achieve reasonable classification accuracy." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-93", "text": "Children's reading skills is influenced by their cognitive ability." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-94", "text": "The following section describes children's cognitive model and text readability." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-95", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-96", "text": "**TEXT READABILITY AND CHILDREN**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-97", "text": "Children start building their cognitive skills from an early age." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-98", "text": "They use their cognitive skills to perform different tasks in different environments." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-99", "text": "Kali (2009) stated that children refine their motor skills and start to be involved in different social games when they are 5 to 6 years of age." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-100", "text": "From age of 6 to 8, children start to expand their vision beyond their immediate surroundings." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-101", "text": "Children from 8 to 12 years of age acquire the ability to present different entities of the world using concepts and abstract representations." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-102", "text": "Children become more interested in social interactions in their teenage years." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-103", "text": "Children learn to recognize alphabets prior they developed motor skills." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-104", "text": "This lead to develop their reading skills." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-105", "text": "Reading skills require two processes: word decoding and comprehension." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-106", "text": "Word decoding is a process of identifying a pattern of alphabets." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-107", "text": "Children must have the knowledge about these and their patterns." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-108", "text": "For example: it is impossible to recognise any word from any language without knowledge of alphabets of that language." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-109", "text": "A pattern of alphabets carry a semantic in their cognitive knowledge." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-110", "text": "Comprehension is a process of extracting meaning from a sequence of words." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-111", "text": "The sequence of words follow an order." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-112", "text": "It could be impossible for children to understand a sentence where the order of the words is random." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-113", "text": "Therefore, word order plays an important role in text comprehension." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-114", "text": "Reading is different than understanding a picture, it extracts meaning from words that are separated by white spaces." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-115", "text": "The comprehension process is also influenced by the memory system." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-116", "text": "The cognitive system of humans contains three different memories: sensory memory, working memory and long-term memory (Rayner et al., 2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-117", "text": "The sensory store contains raw, un-analyzed information very briefly." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-118", "text": "The ongoing cognitive process takes place in working memory and the longterm memory is the permanent storehouse of knowledge about the world (Kali, 2009) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-119", "text": "Older children sometimes are better where they simply retrieve a word from their memory while reading." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-120", "text": "A younger children might have to sound out of a novel word spelling." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-121", "text": "However they are also able to retrieve some of the familiar words." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-122", "text": "Children derive meaning of a sentence by combining words to form propositions then combine them get the final meaning." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-123", "text": "Some children might struggle to recognize words which make them unable to establish links between words." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-124", "text": "Children without this problem able to recognize words and derive meaning from a whole sentence." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-125", "text": "Generally, older children are better reader due to their working memory capacity where they can store more of a sentence in their memory as they are able to identify propositions in the sentence (De Beni and Palladino, 2000) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-126", "text": "Older children are able to comprehend more than younger children because of recognizing ability and more working memory (Kali, 2009) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-127", "text": "They also know more about the world and skilled to use appropriate reading strategies." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-128", "text": "In summary, children become skilled reader as their working memories develop over time, extract propositions and combine them to understand the meaning of a sentence." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-129", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-130", "text": "**DATA**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-131", "text": "The goal of this study is to asses difficulty of news articles that are aimed for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-132", "text": "The reading ability of children is very different than adult readers." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-278", "text": "In this paper, our goals was to examine the difficulty levels of news articles targeting children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-133", "text": "The preceding section describes cognitive developments of children in terms of readability." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-134", "text": "A children who is 10 years old will have different reading capability than a children who is 15 years of old." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-135", "text": "That is why, a corpus that is categorized by the ages of children would be an ideal resource as training corpus." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-136", "text": "Duarte and Weber (2011) proposed different categories of children based on their ages." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-137", "text": "The categorized list is relevant with our study." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-138", "text": "However, our categorized list is still different than their one." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-139", "text": "The corpus is categorized as following age ranges:" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-140", "text": "\u2022 early elementary: 7 9 years old In this paper, we train a model using support vector machine (SVM)." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-141", "text": "This technique requires a training corpus." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-142", "text": "We compile the training corpus from textbooks that have been using for teaching in different school levels in Bangladesh." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-143", "text": "The following subsections describe the training corpus and children news articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-144", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-145", "text": "**TRAINING CORPUS**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-146", "text": "The training corpus targets top four age groups described above." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-147", "text": "Textbooks from grade two to grade ten are considered as sources for corpus compilation." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-148", "text": "Generally, in Bangladesh children start going to schools when they are 6 years of old and finish the grade ten when they are fifteen (Arends-Kuenning and Amin, 2004 al. (2012; 2014) , we compile the corpus from the same source." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-149", "text": "However, the latest version is more cleaned and contains more documents." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-150", "text": "It contains texts from 54 textbooks." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-151", "text": "Table 1 shows the statistics of average document length (DL), average sentence length (SL) and average word length (WL)." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-152", "text": "Textbooks were written using ASCII encoding which required to be converted into Unicode." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-153", "text": "The classification distinguishes four readability classes: very easy, easy, medium and difficult." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-154", "text": "Documents of (school) grade two, three and four are included into the class very easy." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-155", "text": "Class easy covers texts of grade five and six." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-156", "text": "Texts of grade seven and eight were subsumed under the class medium." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-157", "text": "Finally, all texts of grade nine and ten are belong to the class difficult." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-158", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-159", "text": "**NEWS ARTICLES**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-160", "text": "The goal of this paper is observing children news articles in Bangla on the basis of difficulty levels." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-161", "text": "As an Indo-Aryan language Banga is spoken in Southeast Asia, specifically in present day Bangladesh and the Indian states of West Bengal, Assam, Tripura and Andaman and on the Nicobar Islands." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-162", "text": "With nearly 250 million speakers (Karim et al., 2013) , Bangla is spoken by a large speech community." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-163", "text": "However, due to lack of linguistic resources Bangla is considered as a low-resourced language." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-164", "text": "We collected children news articles from four popular news sites from Bangladesh and one from West Bengal." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-165", "text": "The sites are: Banglanews24 3 , Bdnews24 4 , Kaler kantho 5 , Prothom alo 6 and Ichchhamoti 7 ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-166", "text": "Banglanews24, Bdnews24 and Ichchhamoti publish online only." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-167", "text": "In contrast, Kalerkantho and Prothomalo publish as printed newspapers and online." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-168", "text": "These newspapers publish weekly featured articles for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-169", "text": "We have collected 50 fea-3 www.banglanews24.com 4 www.bangla.bdnews24.com 5 www.kalerkantho.com 6 www.prothomalo.com 7 http://www.ichchhamoti.in/ PACLIC 28 ! 313 tured articles from each of the sites and pre-process in similar way as the training corpus." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-170", "text": "However, the news articles are already written in Unicode and cover different topics ranges from family, society, science and history to sports." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-171", "text": "Table 2 : Statistics of news articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-172", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-173", "text": "**FEATURE SELECTION**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-174", "text": "A limited number of related works available that deal texts from Bangla." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-175", "text": "All of them are limited into traditional readability formulas, lexical and information-theoretic features." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-176", "text": "Any of features do not require any linguistic pre-processing." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-177", "text": "The following subsections describe feature selection in detail." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-178", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-179", "text": "**LEXICAL FEATURES**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-180", "text": "We inherited a list of lexical features from our previous study Islam et al. (2014) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-181", "text": "Lexical features are very cheap to compute and shown useful for different text categorizing tasks." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-182", "text": "Average SL and average WL are two of most used features for readability classification." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-183", "text": "Recently, Learning (2001) showed that these are the two most reliable measures that affect readability of texts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-184", "text": "The average SL is a quantitative measure of syntactic complexity." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-185", "text": "In most cases, the syntax of a longer sentence is difficult than the syntax of a shorter sentence." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-186", "text": "However, children of a lower grade level are not aware of syntax." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-187", "text": "A long word that contains many syllables is morphologically complex and leads to comprehension problems (Harly, 2008) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-188", "text": "Generally, most of the frequent words are shorter in length." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-189", "text": "These frequent words are more likely to be processed with a fair degree of automaticity." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-190", "text": "This automaticity increases reading speed and free-memory for higher level meaning building (Crossley et al., 2008) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-191", "text": "Our previous study, Islam et al. (2014) also listed different type token ratio (TTR) formulas." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-192", "text": "The TTR indicates lexical density of texts, a higher value of it reflects the diversification of the vocabulary of a text." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-193", "text": "The diversification causes difficulties for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-194", "text": "In a diversified text, synonyms may be used to represent similar concepts." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-195", "text": "Children face difficulties to detect relationship between synonyms (Temnikova, 2012)." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-196", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-197", "text": "**INFORMATION-THEORETIC FEATURES**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-198", "text": "Nowadays, researchers exploring uncertainty based features from the field of information theory to measure complexity in natural languages (Febres et al., 2014) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-199", "text": "Information theory studies statistical laws of how information can be optimally coded (Cover and Thomas, 2006) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-200", "text": "The entropy rate plays an important role in human communication in general (Genzel and Charniak, 2002; Levy and Jaeger, 2007) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-201", "text": "The rate of information transmission per second in a human speech conversation is roughly constant, that is, transmitting a constant number of bits per second or maintaining a constant entropy rate." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-202", "text": "The entropy of a random variable is related to the difficulty of correctly guessing the value of the corresponding random variable." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-203", "text": "In our previous studies, Islam et al. (2012; and Islam and Mehler (2013) use different information-theoretic features for text readability classification." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-204", "text": "Our hypothesis was that the higher the entropy, the less readable the text along the feature represented by the corresponding random variable." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-205", "text": "We have inherited seven informationtheoretic features from our previous studies." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-206", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-207", "text": "**READABILITY MODELS FOR BANGLA**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-208", "text": "Recently, Sinha et al. (2012) proposed few computational models that are similar to the traditional English readability formulas." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-209", "text": "A user study was performed to evaluate their performance." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-210", "text": "We also inherited two of their best performing models:" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-211", "text": "In their models, they use structural parameters such as average WL, number of jukta-akshars (JUK) or consonant-conjuncts, number of polysyllabic words (PSW)." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-212", "text": "The PSW30 shows that normalized value of PSW over 30 sentences." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-213", "text": "Table 3 : Performance of Bangla readability models proposed by Sinha et al. (Sinha et al., 2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-214", "text": "In this paper, we use 20 features to generate feature vectors for the classifier." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-279", "text": "Therefore we build a readability classifier that is able to classify the corresponding news articles into different difficulty levels." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-280", "text": "Children news articles are cognitively and linguistically different than articles for adult readers." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-281", "text": "A readability classifier trained on a textbooks corpus is able to classify these articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-282", "text": "Although linguistically motivated features could capture linguistic properties of news articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-283", "text": "Lexical features and features related to information density also have good predictive power to identify text difficulties." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-284", "text": "The classification results show that candidate articles are appropriate for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-285", "text": "This study also validate that features in our previous study Islam et al. (2014) and features proposed by Sinha et al. (Sinha et al., 2012) are useful for Bangla text readability analysis." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-286", "text": "There are many languages in the world which lack a readability measurement tool." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-287", "text": "A readability classifier for these language could be built by using the features proposed in our previous study Islam et al." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-215", "text": "The following section describes our experiments and results on training corpus and news articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-216", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-217", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-218", "text": "In order to find the best performing training model, we use 20 features from Islam et al. (2012; and Sinha et al. (2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-219", "text": "Note that hundred data sets were randomly generated where 80% of the corpus was used for training and remaining 20% for evaluation." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-220", "text": "The weighted average of Accuracy and F-score is computed by considering results of all data sets." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-221", "text": "We use the SMO (Platt, 1998; Keerthi et al., 2001) classifier model implemented in WEKA (Hall et al., 2009) together with the Pearson VII function-based universal kernel PUK (\u00dcst\u00fcn et al., 2006) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-222", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-223", "text": "**TRAINING MODEL**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-224", "text": "The traditional readability formulas that were proposed for English texts do not work for Bangla texts (Islam et al., 2012; Islam et al., 2014; Sinha et al., 2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-225", "text": "That is why, we did not explore any of the traditional formulas." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-226", "text": "At first we build a classifier using two readability models from Sinha et al(2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-227", "text": "The output of these models are used as input for the readability classifier." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-228", "text": "Table 3 shows the evaluation results." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-229", "text": "The classification accuracy is little over than 66%." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-230", "text": "In our previous study Islam et al. (2014) found better classification accuracy using these features." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-231", "text": "However, the corpus is slightly different." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-232", "text": "The latest version of the corpus contains more documents for easy readability class." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-233", "text": "The classifier miss-classifies documents from this class mostly." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-234", "text": "The classifier labeled many of the documents from this readability class as very easy." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-235", "text": "Miss-classification of documents from other readability classes are also observed." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-236", "text": "Table 4 shows the performance of features proposed in our previous study Islam et al. (2014 Table 4 : Performance of features proposed by Islam et al. (2014) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-237", "text": "The classification accuracy also drops." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-238", "text": "The classifier also suffer to classify documents from easy readability class correctly." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-239", "text": "However, informationtransmission based features (i.e., SL and WL prob. and SL and DW prob.) are the best performing features." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-240", "text": "Therefore, a text with higher average SL become more difficult when it contains more difficult words or more longer words." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-241", "text": "The classification F-Score rises to 87.87 when we combine features from Islam et al. (2014) and Sinha et al. (Sinha et al., 2012) ." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-242", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-243", "text": "**NEWS ARTICLES CLASSIFICATION**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-244", "text": "Total 250 children news articles are collected as candidate news articles for classification." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-245", "text": "We consider the whole training corpus in order to build a training model." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-246", "text": "The training model is used to classify the candidate news articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-247", "text": "Among all articles, 160 articles are labeled as very easy and 18 articles as easy." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-248", "text": "Only 2 articles are labeled as difficult and remaining 60 articles are labeled as medium." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-249", "text": "Figure 1 shows classification results." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-250", "text": "More than 60% of news articles from newspapers are classified as very easy." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-251", "text": "However, the amount drops below 20% for the articles from Icchamoti children magazine." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-252", "text": "Also articles labeled as difficult belong to this magazine." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-253", "text": "The evaluation shows that, among all of the newspapers, news from Banglanews24 are more suitable for children." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-254", "text": "Most of articles from that site belong to very easy and easy readability class." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-255", "text": "Apart from the classification of children news articles we are also interested in behavior of different features in classified articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-256", "text": "The following section describes from interesting observation we notice." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-257", "text": "----------------------------------" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-258", "text": "**OBSERVATION**" }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-259", "text": "Articles from Ichchhamoti has the lowest average WL. But, have higher values for average DW and average SL." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-260", "text": "Two of the articles from this site are labeled as difficult." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-261", "text": "This labeling could be influenced by average DW and average SL." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-262", "text": "Documents from training corpus have higher average WL." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-263", "text": "Among the lexical features different TTRs have been considered to measure text difficulty (Islam et al., 2014 )." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-264", "text": "An article with a higher TTR value supposed to be difficult that an article with a lower TTR value (See Section 5.1)." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-265", "text": "However, we observe different behavior of TTR formulas." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-266", "text": "Figure 2 shows the behaviour of different TTR formulas in classified articles." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-267", "text": "The average TTR value of articles from very easy readability class is higher than the average TTR value of articles from higher difficulty classes." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-268", "text": "Article length could be the reason of this irregularity." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-269", "text": "Articles from higher difficulty classes are longer and contain more words." }, { "sent_id": "f29baa099b13f38badeb4cbd8789f6-C001-270", "text": "We also observed that some articles which have lower average SL, but labeled as medium." } ], "y": { "@USE@": { "gold_contexts": [ [ "f29baa099b13f38badeb4cbd8789f6-C001-46" ], [ "f29baa099b13f38badeb4cbd8789f6-C001-92" ], [ "f29baa099b13f38badeb4cbd8789f6-C001-208", "f29baa099b13f38badeb4cbd8789f6-C001-210" ], [ "f29baa099b13f38badeb4cbd8789f6-C001-213" ], [ "f29baa099b13f38badeb4cbd8789f6-C001-218" ], [ "f29baa099b13f38badeb4cbd8789f6-C001-226" ], [ "f29baa099b13f38badeb4cbd8789f6-C001-241" ] ], "cite_sentences": [ "f29baa099b13f38badeb4cbd8789f6-C001-46", "f29baa099b13f38badeb4cbd8789f6-C001-92", "f29baa099b13f38badeb4cbd8789f6-C001-208", "f29baa099b13f38badeb4cbd8789f6-C001-213", "f29baa099b13f38badeb4cbd8789f6-C001-218", "f29baa099b13f38badeb4cbd8789f6-C001-226", "f29baa099b13f38badeb4cbd8789f6-C001-241" ] }, "@BACK@": { "gold_contexts": [ [ "f29baa099b13f38badeb4cbd8789f6-C001-208" ], [ "f29baa099b13f38badeb4cbd8789f6-C001-224" ] ], "cite_sentences": [ "f29baa099b13f38badeb4cbd8789f6-C001-208", "f29baa099b13f38badeb4cbd8789f6-C001-224" ] }, "@SIM@": { "gold_contexts": [ [ "f29baa099b13f38badeb4cbd8789f6-C001-285" ] ], "cite_sentences": [ "f29baa099b13f38badeb4cbd8789f6-C001-285" ] } } }, "ABC_b0a50145121eb797cf8e6ebc2f49e0_17": { "x": [ { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-103", "text": "It generates source and target translations simultaneously 4 ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-2", "text": "Abstract N-gram-based models co-exist with their phrase-based counterparts as an alternative SMT framework." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-3", "text": "Both techniques have pros and cons." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-4", "text": "While the N-gram-based framework provides a better model that captures both source and target contexts and avoids spurious phrasal segmentation, the ability to memorize and produce larger translation units gives an edge to the phrase-based systems during decoding, in terms of better search performance and superior selection of translation units." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-5", "text": "In this paper we combine N-grambased modeling with phrase-based decoding, and obtain the benefits of both approaches." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-6", "text": "Our experiments show that using this combination not only improves the search accuracy of the N-gram model but that it also improves the BLEU scores." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-7", "text": "Our system outperforms state-of-the-art phrase-based systems (Moses and Phrasal) and N-gram-based systems by a significant margin on German, French and Spanish to English translation tasks." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-8", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-10", "text": "Statistical Machine Translation advanced from word-based models (Brown et al., 1993) towards more sophisticated models that take contextual information into account." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-11", "text": "Phrase-based (Och and Ney, 2004; Koehn et al., 2003) and N-gram-based ) models are two instances of such frameworks." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-12", "text": "While the two models have some common properties, they are substantially different." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-13", "text": "* Much of the work presented here was carried out while the first author was at the University of Stuttgart." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-14", "text": "Phrase-based systems employ a simple and effective machinery by learning larger chunks of translation called phrases 1 ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-15", "text": "Memorizing larger units enables the phrase-based model to learn local dependencies such as short reorderings, idioms, insertions and deletions, etc." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-16", "text": "The model however, has the following drawbacks: i) it makes independence assumptions over phrases ignoring the contextual information outside of phrases ii) it has issues handling long-distance reordering iii) it has the spurious phrasal segmentation problem which allows multiple derivations of a bilingual sentence pair having different model scores for each segmentation." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-17", "text": "Modeling with minimal translation units helps address some of these issues." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-18", "text": "The N-gram-based SMT framework is based on tuples." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-19", "text": "Tuples are minimal translation units composed of source and target cepts 2 ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-20", "text": "N-gram-based models are Markov models over sequences of tuples or operations encapsulating tuples (Durrani et al., 2011) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-21", "text": "This mechanism has several useful properties." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-22", "text": "Firstly, no phrasal independence assumption is made." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-23", "text": "The model has access to both source and target context outside of phrases." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-24", "text": "Secondly the model learns a unique derivation of a bilingual sentence given its alignment, thus avoiding the spurious segmentation problem." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-25", "text": "Using minimal translation units, however, results in a higher number of search errors because of i) 1 A phrase-pair in PBSMT is a sequence of source and target words that is translation of each other, and is not necessarily a linguistic constituent." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-26", "text": "Phrases are built by combining minimal translation units and ordering information." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-27", "text": "2 A cept is a group of words in one language that is translated as a minimal unit in one specific context (Brown et al., 1993)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-28", "text": "poor translation selection, ii) inaccurate future-cost estimates and iii) incorrect early pruning of hypotheses that would produce better model scores if allowed to continue." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-29", "text": "In order to deal with these problems, search is carried out only on a graph of pre-calculated orderings, and ad-hoc reordering limits are imposed to constrain the search space (Crego et al., 2005; , or a higher beam size is used in decoding (Durrani et al., 2011) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-30", "text": "The ability to memorize and produce larger translation chunks during decoding, on the other hand, gives a distinct advantage to the phrase-based system during search." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-31", "text": "Phrase-based systems i) have access to uncommon translations, ii) do not require higher beam sizes, iii) have more accurate future-cost estimates because of the availability of phrase-internal language model context before search is started." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-32", "text": "To illustrate this consider the German-English phrase-pair \"scho\u00df ein Torscored a goal\", composed from the tuples (ceptpairs) \"scho\u00df -scored\", \"ein -a\" and \"Tor -goal\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-33", "text": "It is likely that the N-gram system does not have the tuple \"scho\u00df -scored\" in its n-best translation options because \"scored\" is an uncommon translation for \"scho\u00df\" outside the sports domain." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-34", "text": "Even if \"scho\u00df -scored\" is hypothesized, it will be ranked quite low in the stack until \"ein\" and \"Tor\" are generated in the next steps." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-35", "text": "A higher beam is required to prevent it from getting pruned." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-36", "text": "Phrase-based systems, on the other hand, are likely to have access to the phrasal unit \"scho\u00df ein Tor -scored a goal\" and can generate it in a single step." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-37", "text": "Moreover, a more accurate future-cost estimate can be computed because of the available context internal to the phrase." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-38", "text": "In this work, we extend the N-gram model, based on operation sequences (Durrani et al., 2011) , to use phrases during decoding." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-39", "text": "The main idea is to study whether a combination of modeling with minimal translation units and using phrasal information during decoding helps to solve the above-mentioned problems." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-40", "text": "The remainder of this paper is organized as follows." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-41", "text": "The next two sections review phrase-based and N-gram-based SMT." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-42", "text": "Section 2 provides a comparison of phrase-based and N-gram-based SMT." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-43", "text": "Section 3 summarizes the operation sequence model (OSM), the main baseline for this work." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-44", "text": "Section 4 analyzes the search problem when decoding with minimal units." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-45", "text": "Section 5 discusses how information available in phrases can be used to improve search performance." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-46", "text": "Section 6 presents the results of this work." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-47", "text": "We conducted experiments on the German-toEnglish and French-to-English translation tasks and found that using phrases in decoding improves both search accuracy and BLEU scores." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-48", "text": "Finally we compare our system with two state-of-the-art phrasebased systems (Moses and Phrasal) and two stateof-the-art N-gram-based systems (Ncode and OSM) on standard translation tasks." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-49", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-50", "text": "**PREVIOUS WORK**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-51", "text": "Phrase-based and N-gram-based SMT are alternative frameworks for string-to-string translation." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-52", "text": "Phrase-based SMT segments a bilingual sentence pair into phrases that are continuous sequences of words (Och and Ney, 2004; Koehn et al., 2003) or discontinuous sequences of words ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-53", "text": "These phrases are then reordered through a lexicalized reordering model that takes into account the orientation of a phrase with respect to its previous phrase (Tillmann and Zhang, 2005) or block of phrases (Galley and Manning, 2008) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-54", "text": "There are several drawbacks of the phrase-based model." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-55", "text": "Firstly it makes an independence assumption over phrases, according to which phrases are translated independently of each other, thus ignoring the contextual information outside of the phrasal boundary." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-56", "text": "This problem is corrected by the monolingual language model that takes context into account." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-57", "text": "But often the language model cannot compensate for the dispreference of the translation model for nonlocal dependencies." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-58", "text": "The second problem is that the model is unaware of the actual phrasal segmentation of a sentence during training." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-59", "text": "It therefore learns all possible ways of segmenting a bilingual sentence." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-60", "text": "Different segmentations of a bilingual sentence re-sult in different probability scores for the translation and reordering models, causing spurious ambiguity in the model." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-61", "text": "See Figure 1 ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-62", "text": "In the first segmentation, the model learns the lexical and reordering probabilities of the phrases \"sie w\u00fcrden -they would\" and \"gegen ihre kampagne abstimmen -vote against your campaign\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-63", "text": "In the second segmentation, the model learns the lexical and reordering probabilities of the phrases \"sie -they\" \"w\u00fcrden -would\", \"abstimmen -vote\", \"gegen ihre kampagne -against your campaign\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-64", "text": "Both segmentations result in different translation and reordering scores." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-65", "text": "This kind of ambiguity in the model subsequently results in the presence of many different equivalent segmentations in the search space." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-66", "text": "Also note that the two segmentations contain different information." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-67", "text": "From the first segmentation the model learns the dependency between the verb \"abstimmen -vote\" and the phrase \"gegen ihre kampagne -against your campaign\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-68", "text": "The second segmentation allows the model to capture the reordering of the complex verb predicate \"w\u00fcrden -would\" and \"abstimmen -vote\" by learning that the verb \"abstimmen -vote\" is discontinuous with respect to the auxiliary." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-69", "text": "This information cannot be captured in the first segmentation because of the phrasal independence assumption and stiff phrasal boundaries." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-70", "text": "The model loses one of the dependencies depending upon which segmentation it chooses during decoding." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-71", "text": "N-gram-based SMT is an instance of a joint model that generates source and target strings together in bilingual translation units called tuples." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-72", "text": "Tuples are essentially phrases but they are atomic units that cannot be decomposed any further." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-73", "text": "This condition of atomicity results in a unique segmentation of the bilingual sentence pair given its alignments." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-74", "text": "The model does not make any phrasal independence assumption and generates a tuple by looking at a context of n \u2212 1 previous tuples (or operations)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-75", "text": "This allows the N-gram model to model all the dependencies through a single derivation." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-76", "text": "The main drawback of N-gram-based SMT is its poor search mechanism which is inherent from using minimal translation units during search." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-77", "text": "Decoding with tuples has problems with a high number of search errors caused by lower translation coverage, inaccurate future-cost estimation and pruning of correct hypotheses (see Section 4.2 for details)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-78", "text": "proposed a way to couple reordering and search through POS-based rewrite rules." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-79", "text": "These rules are learned during training when units with crossing alignments are unfolded through source linearization to form minimal tuples." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-80", "text": "For example, in Figure 1 , the N-gram-based MT will linearize the word sequence \"gegen ihre kampagne abstimmen\" to \"abstimmen gegen ihre kampagne\", so that it is in the same order as the English words." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-81", "text": "It also learns a POS-rule \"IN PRP NN VB \u2192 VB IN PRP NN\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-82", "text": "The POS-based rewrite rules serve to precompute the orderings that are hypothesized during decoding." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-83", "text": "Coupling reordering and search allows the N-gram model to arrange hypotheses in 2 m stacks (for an m word source sentence), each containing hypotheses that cover exactly the same foreign words." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-84", "text": "This removes the need for futurecost estimation 3 ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-85", "text": "Secondly, memorizing POS-based rules enables phrase-based like reordering, however without lexical selection." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-86", "text": "There are three drawbacks of this approach." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-87", "text": "Firstly, lexical generation and reordering are decoupled." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-88", "text": "Search is only performed on a small number of reorderings, pre-calculated using the source side and completely ignoring the targetside." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-89", "text": "And lastly, the POS-based rules face data sparsity problems especially in the case of long distance reorderings." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-90", "text": "Durrani et al. (2011) recently addressed these problems by proposing an operation sequence Ngram model which strongly couples translation and reordering, hypothesizes all possible reorderings and does not require POS-based rules." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-91", "text": "Representing bilingual sentences as a sequence of operations enables them to memorize phrases and lexical reordering triggers like PBSMT." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-92", "text": "However, using minimal units during decoding and searching over all possible reorderings means that hypotheses can no longer be arranged in 2 m stacks." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-93", "text": "The problem of inaccurate future-cost estimates resurfaces resulting in more search errors." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-94", "text": "A higher beam size of 500 is therefore used to produce translation units in comparison to phrase-based systems." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-95", "text": "This, however, still does not eliminate all search errors." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-96", "text": "This paper shows that using phrases instead of cepts in de-coding improves the search accuracy and translation quality." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-97", "text": "It also shows that using some phrasal information in cept-based decoding captures some of these improvements." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-98", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-99", "text": "**OPERATION SEQUENCE MODEL**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-100", "text": "The N-gram model with integrated reordering models a sequence of operations obtained through the transformation of a bilingual sentence pair." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-101", "text": "An operation can either be to i) generate a sequence of source and target words, ii) to insert a gap as a placeholder for skipped words, iii) or to jump forward and backward in a sentence to translate words discontinuously." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-102", "text": "The translate operation Generate(X,Y) encapsulates the translation tuple (X,Y)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-104", "text": "This is similar to N-gram-based SMT except that the tuples in the N-gram-based model are generated monotonically, whereas in this case lexical generation and reordering information is strongly coupled in an operation sequence." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-105", "text": "Consider the phrase pair:" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-106", "text": "The model memorizes it through the sequence:" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-107", "text": ". . , o j\u22121 be a sequence of operations as hypothesized by the translator to generate the bilingual sentence pair F, E with an alignment function A. The translation model is defined as:" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-108", "text": "where n indicates the amount of context used." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-109", "text": "The translation model is implemented as an N-gram model of operations using SRILM-Toolkit (Stolcke, 2002) with Kneser-Ney smoothing." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-110", "text": "A 9-gram model is used." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-111", "text": "Several count-based features such as gap and open gap penalties and distance-based features such as gap-width and reordering distance are added to the model, along with the lexical weighting and length penalty features in a standard log-linear framework (Durrani et al., 2011) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-112", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-113", "text": "**SEARCH 4.1 OVERVIEW OF DECODING FRAMEWORK**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-114", "text": "The decoding framework used in the operation sequence model is based on Pharaoh (Koehn, 2004a) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-115", "text": "The decoder uses beam search to build up the translation from left to right." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-116", "text": "The hypotheses are arranged in m stacks such that stack i maintains hypotheses that have already translated i many foreign words." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-117", "text": "The ultimate goal is to find the best scoring hypothesis, that has translated all the words in the foreign sentence." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-118", "text": "The overall process can be roughly divided into the following steps: i) extraction of translation units ii) future-cost estimation, iii) hypothesis extension iv) recombination and pruning." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-119", "text": "During the hypothesis extension each extracted phrase is translated into a sequence of operations." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-120", "text": "The reordering operations (gaps and jumps) are generated by looking at the position of the translator, the last foreign word generated etc." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-121", "text": "(Refer to Algorithm 1 in Durrani et al. (2011) )." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-122", "text": "The probability of an operation depends on the n \u2212 1 previous operations." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-123", "text": "The model backs-off to the smaller n-grams of operations if the full history is unknown." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-124", "text": "We use Kneser-Ney smoothing to handle back-off 5 ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-125", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-126", "text": "**DRAWBACKS OF CEPT-BASED DECODING**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-127", "text": "One of the main drawbacks of the operation sequence model is that it has a more difficult search problem than the phrase-based model." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-128", "text": "The operation model, although based on minimal translation units, can learn larger translation chunks by memorizing a sequence of operations." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-129", "text": "However, using cepts during decoding has the following drawbacks: i) the cept-based decoder does not have access to all the translation units that a phrase-based decoder uses as part of a larger phrase." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-130", "text": "ii) it requires a higher beam size to prevent early pruning of better hypotheses that lead toward higher model scores when allowed to continue and iii) it uses worse future-cost estimates than the phrase-based decoder." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-131", "text": "Recall the example from the last section." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-132", "text": "For the cept-based decoder to generate the same phrasal translation, it requires three separate tuple translations \"Wie -what is\", \"Sie -your\" and \"hei\u00dfen -name\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-133", "text": "Here we are faced with three challenges." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-134", "text": "Translation Coverage: The first problem is that the N-gram model does not have the same coverage of translation options." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-135", "text": "The English cepts \"what is\", \"your\" and \"name\" are not good candidate translations for the German cepts \"Wie\", \"Sie\" and \"hei\u00dfen\", respectively." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-136", "text": "When extracting tuple translations for these cepts from the Europarl data for our system, the tuple \"Wie -what is\" is ranked 124 th , \"hei\u00dfen -name\" is ranked 56 th , and \"Sieyour\" is ranked 9 th in the list of n-best translation candidates." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-137", "text": "Typically only the 20 best translation options are used, to reduce the decoding time, and such phrasal units with less frequent cept translations are never hypothesized in the N-gram-based systems." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-138", "text": "The phrase-based system on the other hand can extract the phrase \"Wie hei\u00dfen Sie -what is your name\" even if it is observed only once during training." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-139", "text": "A similar problem is also reported in Costa-juss\u00e0 et al. (2007) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-140", "text": "When trying to reproduce the sentences in the n-best translation output of the phrase-based system, the N-gram-based system was only able to produce 37.5% of the sentences in the Spanish-to-English and 37.2% in the English-to-Spanish translation tasks." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-141", "text": "In comparison the phrase-based system was able to reproduce 57.5% and 48.6% of the sentences in the nbest translation output of the Spanish-to-English and English-to-Spanish N-gram-based systems." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-142", "text": "Larger Beam Size: A related problem is that a higher beam size is required in cept-based decoding to prevent uncommon translations from getting pruned." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-143", "text": "The phrase-based system can generate the phrase-pair \"Wie hei\u00dfen Sie -what is your name\" in a single step placing it directly into the stack three words to the right." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-144", "text": "The cept-based decoder generates this phrase in three stacks with the tuple translations \"Wie -What is\", \"Sie -your\" and \"hei\u00dfen -name\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-145", "text": "A very large stack size is required during decoding to prevent the pruning of \"Wie -What is\" which is ranked quite low in the stack until the tuple \"Sieyour\" is hypothesized in the next stack." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-146", "text": "Costa-juss\u00e0 et al. Future-Cost Estimation: A third problem is caused by inaccurate future-cost estimation." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-147", "text": "Using phrases helps phrase-based SMT to better estimate the future language model cost because of the larger context available, and allows the decoder to capture local (phrase-internal) reorderings in the future cost." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-148", "text": "In comparison the future cost for tuples is mostly unigram probabilities." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-149", "text": "The future-cost estimate for the phrase pair \"Wie hei\u00dfen Sie -What is your name\" is estimated by calculating the cost of each feature." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-150", "text": "The language model cost, for example, is estimated in the phrase-based system as follows:" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-151", "text": "The cost of the direct phrase translation probability, one of the features used in the phrase translation model, is estimated as: p tm = p(What is your name|Wie hei\u00dfen Sie)" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-152", "text": "Phrase-based SMT is aware during the preprocessing step that the words \"Wie hei\u00dfen Sie\" may be translated as a phrase." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-153", "text": "This is helpful for estimating a more accurate future cost because the phraseinternal context is already available." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-154", "text": "The same is not true for the operation sequence model, to which only minimal units are available." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-155", "text": "The operation model does not have the information that \"Wie hei\u00dfen Sie\" may be translated as a phrase during decoding." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-156", "text": "The future-cost estimate available to the operation model for the span covering \"Wie hei\u00dfen Sie\" will have unigram probabilities for both the translation and language model:" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-157", "text": "Thus the future-cost estimate in the operation model is much worse than that of the phrase-based model." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-158", "text": "The poor future-cost estimation leads to search errors, causing a drop in the translation quality." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-159", "text": "A more accurate future-cost estimate for the translation model cost would be:" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-160", "text": "where C is the context, i.e., the n\u22121 previously generated operations." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-161", "text": "The future-cost estimates computed in this manner are much more accurate because they not only consider context, but also take the reordering operations into account." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-162", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-163", "text": "**N-GRAM MODEL WITH PHRASE-BASED DECODING**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-164", "text": "In the last section we discussed the disadvantages of using cepts during search in a left-to-right decoding framework." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-165", "text": "We now define a method to empirically study the mentioned drawbacks and whether using information available in phrase-pairs during decoding can help improve search accuracy and translation quality." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-166", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-167", "text": "**TRAINING**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-168", "text": "We extended the training steps in Durrani et al. (2011) to extract a phrase lexicon from the parallel data." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-169", "text": "We extract all phrase pairs of length 6 and below, that are consistent (Och et al., 1999) with the word alignments." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-170", "text": "Only continuous phrases as used in a traditional phrase-based system are extracted thus allowing only inside-out (Wu, 1997) type of alignments." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-171", "text": "The future cost of each feature component used in the log-linear model is calculated." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-172", "text": "The operation sequence required to hypothesize each phrase is generated and its future cost is calculated." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-173", "text": "The future costs of other features such as language models, lexicalized probability features, etc. are also estimated." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-174", "text": "The estimates of the countbased reordering penalties (gap penalty and open gap penalty) and the distance-based features (gapwidth and reordering distance) could not be estimated previously with cepts but are available when using phrases." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-175", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-176", "text": "**DECODING**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-177", "text": "We extended the decoder developed by Durrani et al. (2011) and tried three ideas." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-178", "text": "In our primary experiments we enabled the decoder to use phrases instead of cepts." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-179", "text": "This allows the decoder to i) use phraseinternal context when computing the future-cost estimates, ii) hypothesize translation options not available to the cept-based decoder iii) cover multiple source words in a single step subsequently improving translation coverage and search." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-180", "text": "Note that using phrases instead of cepts during decoding, does not reintroduce the spurious phrasal segmentation problem as is present in the phrase-based system, because the model is built on minimal units which avoids segmentation ambiguity." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-181", "text": "Different compositions of the same phrasal unit lead to exactly the same model score." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-182", "text": "We therefore do not create any alternative compositions of the same phrasal unit during decoding." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-183", "text": "This option is not available in phrase-based decoding, because an alternative composition may lead towards a better model score." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-184", "text": "In our secondary set of experiments, we used cept-based decoding but modified the decoder to use information available from the phrases extracted for the test sentences." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-185", "text": "Firstly, we used future-cost estimates from the extracted phrases (see system cept.500.fc in Table1)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-186", "text": "This however, leads to inconsistency in the cases where the future cost is estimated from some phrasal unit that cannot be generated through the available cept translations." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-187", "text": "For example, say the best cost to cover the sequence \"Wie hei\u00dfen Sie\" is given by the phrase \"What is your name\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-188", "text": "The 20-best translation options in ceptbased system, however, do not have tuples \"WieWhat\" and \"hei\u00dfen -name\"." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-189", "text": "To remove this discrepancy, we add all such tuples that are used in the extracted phrases, to the list of extracted cepts (system cept.500.fc.t)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-190", "text": "We also studied how much gain we obtain by only adding tuples from phrases and using cept-based future-cost estimates (system cept.500.t)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-191", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-192", "text": "**EVALUATION METHOD**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-193", "text": "To evaluate our modifications we apply a simple strategy." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-194", "text": "We hold the model constant and change the search to use the baseline decoder, which uses minimal translation units, or the modified decoders that use phrasal information during decoding." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-195", "text": "The model parameters are optimized by running MERT (minimum error rate training) for the baseline decoder on the dev set." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-196", "text": "After we have the optimized weights, we run the baseline decoder and our modifications on the test." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-221", "text": "For each extracted cept or phrase 10-best translation options are extracted." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-197", "text": "Note that because all the decoding runs use the same feature vector, the model stays constant, only search changes." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-198", "text": "This allows us to compare different decoding runs, obtained using the same parameters, but different search strategies, in terms of model scores." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-199", "text": "We compute a search accuracy and translation quality for each run." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-200", "text": "Search accuracy is computed by comparing translation hypotheses from the different decoding runs." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-201", "text": "We form a collection of the best scoring hypotheses by traversing through all the runs and selecting the sentences with highest model score." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-202", "text": "For each input sentence we select a single best scoring hypothesis." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-203", "text": "The best scoring hypothesis can be contributed from several runs." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-204", "text": "In this case all these runs will be given a credit for that particular sentence when computing the search accuracy." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-205", "text": "The search accuracy of a decoding run is defined as the percentage of hypotheses that were contributed from this run, when forming a list of best scoring hypotheses." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-206", "text": "For example, for a test set of 1000 sentences, the accuracy of a decoding run would be 30% if it was able to produce the best scoring hypothesis for 300 sentences in the test set." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-207", "text": "Translation quality is measured through BLEU (Papineni et al., 2002) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-208", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-209", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-210", "text": "We initially experimented with two language pairs: German-to-English (G-E) and French-to-English (F-E)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-211", "text": "We trained our system and the baseline systems on most of the data 6 made available for the translation task of the Fourth Workshop on Statistical Machine Translation." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-212", "text": "7 We used 1M bilingual sentences, for the estimation of the translation model and 2M sentences from the monolingual corpus (news commentary) which also contains the English part of the bilingual corpus." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-213", "text": "Word alignments are obtained by running GIZA++ (Och and Ney, 2003) with the grow-diag-final-and (Koehn et al., 2005) symmetrization heuristic." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-214", "text": "We follow the training steps described in Durrani et al. (2011) , consisting of i) post-processing the alignments to remove discontinuous and unaligned target cepts, ii) conversion of bilingual alignments into operation sequences, iii) estimation of the n-gram language models." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-215", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-216", "text": "**SEARCH ACCURACY RESULTS**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-217", "text": "We divided our evaluation into two halves." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-218", "text": "In the first half we carried out experiments to measure search accuracy and translation quality of our decoders against the baseline cept-based OSM (cept.500) that uses minimal translation units with a stack size of 500." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-219", "text": "We used the version of the ceptbased OSM system that does not allow discontinuous 8 source cepts." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-220", "text": "To increase the speed of the system we used a hard reordering limit of 15 9 , in the baseline decoder and our modifications, disallowing jumps that are beyond 15 words from the first open gap." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-222", "text": "Using phrases in search reduces the decoding speed." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-223", "text": "In order to make a fair comparison, both the phrase-based and the baseline cept-based decoders should be allowed to run for the same amount of time." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-224", "text": "We therefore reduced the stack size in the phrase-based decoder so that it runs in the same amount of time as the cept-based decoder." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-225", "text": "We found that using a stack size of 200 10 for the phrase-based decoder was comparable in speed to using a stacksize of 500 in the cept-based decoding." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-226", "text": "We first tuned the baseline on dev 11 to obtain an optimized weight vector." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-227", "text": "We then ran the baseline and our decoders as discussed in Section 5.2 on the dev-test." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-228", "text": "Then we repeated this experiment by tuning the weights with our phrase-based decoder (using a stack size of 100) and ran all the decoders again using the new weights." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-229", "text": "Table 1 shows the average search accuracies and BLEU scores of the two experiments." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-230", "text": "Using phrases during decoding in the G-E experiments resulted in a statistically significant 12 0.69 BLEU points gain comparing our best system phrase.200 with the baseline system cept.500." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-231", "text": "We mark a result as sig-8 Discontinuous source-side units did not lead to any improvements in (Durrani et al., 2011) and increased the decoding times by multiple folds." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-232", "text": "We also found these to be less useful." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-233", "text": "9 Imposing a hard reordering limit significantly reduced the decoding time and also slightly increased the BLEU scores." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-234", "text": "10 Higher stack sizes leads to improvement in model scores for both German-English and French-English and slight improvement of BLEU in the case of the former." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-235", "text": "11 We used news-dev2009a as dev and news-dev2009b as devtest and tuned the weights with Z-MERT (Zaidan, 2009) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-236", "text": "12 We use bootstrap resampling (Koehn, 2004b) nificant if the improvement shown by our decoder over the baseline decoder (cept.500) is significant at the p \u2264 0.05 level, in both the runs." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-237", "text": "All the outputs that show statistically significant improvements over the baseline decoder (cept.500) in Table 1 are marked with an asterisk." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-238", "text": "The search accuracy of our best system (phrase.200), in G-E experiments is roughly 80%, which means that 80% of the times the phrase-based decoder (using stack size 200) was able to produce the same model score or a better model score than the cept-based decoders (using a stack size of 500)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-239", "text": "Our F-E experiments also showed improvements in BLEU and model scores." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-240", "text": "The search accuracy of our best system phrase.200 is roughly 93% as compared with 55% in the baseline decoder (cept.500) giving a BLEU point gain of +0.30 over the baseline." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-241", "text": "Our modifications to the cept-based decoder also showed improvements." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-242", "text": "We found that extending the cept translation table (cept.500.t) using phrases helps both in G-E and F-E experiments by extending the list of n-best translation options by 18% and 18.30% respectively." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-243", "text": "Using future costs estimated from phrases (cept.500.fc) improved both search accuracy and BLEU scores in G-E experiments, but does not lead to any improvements in the F-E experiments, as both BLEU and model scores drop slightly." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-244", "text": "We looked at a few examples where the model score dropped and found that in these cases, the best scoring hypotheses are ranked very low earlier in the decoding and make their way to the top gradually in subsequent steps." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-245", "text": "A slight difference in the future-cost estimate prunes these hypotheses in one or the other decoder." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-246", "text": "We found future cost to be more critical in G-E than F-E experiments." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-247", "text": "This can be explained by the fact that more reordering is required in G-E and it is necessary to account for the reordering operations and jump-based features (gapbased penalties, reordering distance and gap-width) in the future-cost estimation." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-248", "text": "F-E on the other hand is largely monotonic except for a few short distance reorderings such as flipping noun and adjective." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-249", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-250", "text": "**COMPARISON WITH OTHER BASELINE SYSTEMS**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-251", "text": "In the second half of our evaluation we compared our best system phrase.200 with the baseline system cept.500, and other state-of-the-art phrase-based and N-gram-based systems on German-to-English, French-to-English, and Spanish-to-English tasks 13 ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-252", "text": "We used the official evaluation data (news-test sets) from the Statistical Machine Translation Workshops 2009-2011 for all three language pairs (German, Spanish and French)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-253", "text": "The feature weights for all the systems are tuned using the dev set news-dev2009a." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-254", "text": "We separately tune the baseline system (cept.500) and the phrase-based system (phrase.200) and do not hold the lambda vector constant like before." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-255", "text": "Baseline Systems: We also compared our system with i) Moses (Koehn et al., 2007) , ii) Phrasal 14 (Cer et al., 2010) , and iii) Ncode (Crego et al., 2011) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-256", "text": "We used the default stack sizes of 100 for Moses 15 , 200 for Phrasal, 25 for Ncode (with 2 m stacks)." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-257", "text": "A 5-gram English language model is used." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-258", "text": "Both phrase-based systems use 20-best phrases for translation, Ncode uses 25-best tuple translations." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-259", "text": "The training and test data for Ncode was tagged using TreeTagger (Schmid, 1994) ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-260", "text": "All the baseline systems used lexicalized reordering model." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-261", "text": "A hard reordering limit 16 of 6 words is used as a default in 13 We did not include the results of Spanish in the previous section due to space limitations but these are similar to those of the French-to-English translation task." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-262", "text": "14 Phrasal provides two extensions to Moses: i) hierarchical reordering model (Galley and Manning, 2008) and ii) discontinuous phrases ." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-263", "text": "15 Using stacks sizes from 200\u22121000 did not improve results." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-264", "text": "16 We tried to increase the distortion limit in the baseline sys-both the baseline phrase-based systems." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-265", "text": "Amongst the other defaults we retained the hard source gap penalty of 15 and a target gap penalty of 7 in Phrasal." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-266", "text": "We provide Moses and Ncode with the same postedited alignments 17 from which we removed targetside discontinuities." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-267", "text": "We feed the original alignments to Phrasal because of its ability to learn discontinuous source and target phrases." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-268", "text": "All the systems use MERT for the optimization of the weight vector." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-269", "text": "Table 2 compares the performance of our phrasebased decoder against the baselines." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-270", "text": "Our system shows an improvement over all the baseline systems for the G-E pair, in 11 out of 12 cases in the F-E pair and in 8 out of 12 cases in the S-E language pair." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-271", "text": "We mark a baseline with \"*\" to indicate that our decoder shows an improvement over this baseline result which is significant at the p \u2264 0.05 level." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-272", "text": "----------------------------------" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-273", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-274", "text": "We proposed a combination of using a model based on minimal units and decoding with phrases." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-275", "text": "Modeling with minimal units enables us to learn local and non-local dependencies in a unified manner and avoid spurious segmentation ambiguities." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-276", "text": "However, using minimal units also in the search presents a significant challenge because of the poor translation coverage, inaccurate future-cost estimates and tems to 15 (in G-E experiments) as used in our systems but the results dropped significantly in case of Moses and slightly for Phrasal so we used the default limits for both decoders." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-277", "text": "17 Using post-processed alignments gave slightly better results than the original alignments for these baseline systems." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-278", "text": "Details are omitted due to space limitation." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-279", "text": "the pruning of the correct hypotheses." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-280", "text": "Phrase-based SMT on the other hand overcomes these drawbacks by using larger translation chunks during search." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-281", "text": "However, the drawback of the phrase-based model is the phrasal independence assumption, spurious ambiguity in segmentation and a weak mechanism to handle non-local reorderings." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-282", "text": "We showed that combining a model based on minimal units with phrasebased decoding can improve both search accuracy and translation quality." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-283", "text": "We also showed that the phrasal information can be indirectly used in ceptbased decoding with improved results." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-284", "text": "We tested our system against the state-of-the-art phrase-based and N-gram-based systems, for German-to-English, French-to-English, and Spanish-to-English for three standard test sets." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-285", "text": "Our system showed statistically significant improvements over all the baseline systems in most of the cases." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-286", "text": "We have shown the benefits of using phrase-based search with a model based on minimal units." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-287", "text": "In future work, we would like to study whether a phrase-based system like Moses or Phrasal can profit from an OSM-style or N-gramstyle feature." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-288", "text": "Feng et al. (2010) previously showed that adding a linearized source-side language model in a phrase-based system helped." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-289", "text": "It would also be interesting to study whether the insight of using minimal units for modeling and phrase-based search would hold for hierarchical SMT." }, { "sent_id": "b0a50145121eb797cf8e6ebc2f49e0-C001-290", "text": "Vaswani et al. (2011) recently showed that a Markov model over the derivation history of minimal rules can obtain the same translation quality as using grammars formed with composed rules." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b0a50145121eb797cf8e6ebc2f49e0-C001-20" ], [ "b0a50145121eb797cf8e6ebc2f49e0-C001-90" ], [ "b0a50145121eb797cf8e6ebc2f49e0-C001-121" ] ], "cite_sentences": [ "b0a50145121eb797cf8e6ebc2f49e0-C001-20", "b0a50145121eb797cf8e6ebc2f49e0-C001-90", "b0a50145121eb797cf8e6ebc2f49e0-C001-121" ] }, "@USE@": { "gold_contexts": [ [ "b0a50145121eb797cf8e6ebc2f49e0-C001-29" ], [ "b0a50145121eb797cf8e6ebc2f49e0-C001-111" ], [ "b0a50145121eb797cf8e6ebc2f49e0-C001-120", "b0a50145121eb797cf8e6ebc2f49e0-C001-121" ], [ "b0a50145121eb797cf8e6ebc2f49e0-C001-214" ] ], "cite_sentences": [ "b0a50145121eb797cf8e6ebc2f49e0-C001-29", "b0a50145121eb797cf8e6ebc2f49e0-C001-111", "b0a50145121eb797cf8e6ebc2f49e0-C001-121", "b0a50145121eb797cf8e6ebc2f49e0-C001-214" ] }, "@EXT@": { "gold_contexts": [ [ "b0a50145121eb797cf8e6ebc2f49e0-C001-38" ], [ "b0a50145121eb797cf8e6ebc2f49e0-C001-168" ], [ "b0a50145121eb797cf8e6ebc2f49e0-C001-177" ] ], "cite_sentences": [ "b0a50145121eb797cf8e6ebc2f49e0-C001-38", "b0a50145121eb797cf8e6ebc2f49e0-C001-168", "b0a50145121eb797cf8e6ebc2f49e0-C001-177" ] }, "@DIF@": { "gold_contexts": [ [ "b0a50145121eb797cf8e6ebc2f49e0-C001-231" ] ], "cite_sentences": [ "b0a50145121eb797cf8e6ebc2f49e0-C001-231" ] } } }, "ABC_fed51218e78d35aae39d287c95a95a_17": { "x": [ { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-2", "text": "In order to automatically extract opinion holders, we propose to harness the contexts of prototypical opinion holders, i.e. common nouns, such as experts or analysts, that describe particular groups of people whose profession or occupation is to form and express opinions towards specific items." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-3", "text": "We assess their effectiveness in supervised learning where these contexts are regarded as labeled training data and in rule-based classification which uses predicates that frequently co-occur with mentions of the prototypical opinion holders." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-4", "text": "Finally, we also examine in how far knowledge gained from these contexts can compensate the lack of large amounts of labeled training data in supervised learning by considering various amounts of actually labeled training sets." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-5", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-7", "text": "Building an opinion holder (OH) extraction system on the basis of supervised classifiers requires large amounts of labeled training data which are expensive to obtain." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-8", "text": "Therefore, alternative methods requiring less human effort are required." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-9", "text": "Such methods would be particularly valuable for languages other than English as for most other languages sentiment resources are fairly sparse." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-10", "text": "In this paper, we propose to leverage contextual information from prototypical opinion holders (protoOHs), such as experts or analysts." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-11", "text": "We define prototypical opinion holders as common nouns denoting particular groups of people whose profession or occupation is to form and express opinions towards specific items." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-12", "text": "Mentions of these nouns are disproportionately often OHs:" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-13", "text": "1. Experts agree it generally is a good idea to follow the manufacturers' age recommendations." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-14", "text": "2. Shares of Lotus Development Corp. dropped sharply after analysts expressed concern about their business." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-15", "text": "Since protoOHs are common nouns they should occur sufficiently often in a large text corpus in order to gain knowledge for OH extraction." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-16", "text": "We examine different ways of harnessing mentions of protoOHs for OH extraction." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-17", "text": "We compare their usage as labeled training data for supervised learning with a rule-based classifier that relies on a lexicon of predictive predicates that have been extracted from the contexts of protoOHs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-18", "text": "Moreover, we investigate in how far the knowledge gained from these contexts can compensate the lack of large amounts of actually labeled training data in supervised classification by considering various amounts of labeled training sets." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-19", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-21", "text": "There has been much research on supervised learning for OH extraction." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-22", "text": "Choi et al. (2005) explore OH extraction using CRFs with several manually defined linguistic features and automatically learnt surface patterns." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-23", "text": "The linguistic features focus on named-entity information and syntactic relations to opinion words." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-24", "text": "Kim and Hovy (2006) and Bethard et al. (2004) examine the usefulness of semantic roles provided by FrameNet 1 for both OH and opinion target extraction." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-25", "text": "More recently, Wiegand and Klakow (2010) explored convolution kernels for OH extraction and found that tree kernels outperform all other kernel types." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-26", "text": "In (Johansson and Moschitti, 2010) , a re-ranking approach modeling complex relations between multiple opinions in a sentence is presented." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-27", "text": "Rule-based OH extraction heavily relies on lexical cues." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-28", "text": "Bloom et al. (2007) , for example, use a list of manually compiled communication verbs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-29", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-30", "text": "**DATA**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-31", "text": "As a large unlabeled (training) corpus, we chose the North American News Text Corpus." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-32", "text": "As a labeled (test) corpus, we use the MPQA corpus." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-33", "text": "2 We use the definition of OHs as described in (Wiegand and Klakow, 2010) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-34", "text": "The instance space are all noun phrases (NP) in that corpus." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-35", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-36", "text": "**METHOD**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-37", "text": "In this paper, we propose to leverage contextual information from prototypical opinion holders (protoOHs) by which we mean common nouns denoting particular groups of people whose profession or occupation it is to form and express opinions towards specific items." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-38", "text": "The set of protoOHs that we use are listed in Table 1 ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-39", "text": "It has been created ad-hoc." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-40", "text": "We neither claim completeness nor have made any attempts to tune it to our data." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-41", "text": "Though mentions of protoOHs are likely to present OHs, not every mention is an OH:" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-42", "text": "3. Canada offered to make some civilian experts available." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-43", "text": "We try to solve this problem by exclusively looking at contexts in which the protoOH is an agent of some predicate." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-44", "text": "Bethard et al. (2004) state that 90% of the OHs are realized as agents on their dataset." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-45", "text": "This heuristic would exclude Sentence 3 as some civilian experts should be considered the patient of make available rather than its agent." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-46", "text": "We use grammatical dependencies from a syntactic parser rather than the output of a semantic parser for the detection of agents as in our initial experiments with semantic parsers the detection of agents of predicate adjectives and nouns was deemed less reliable." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-47", "text": "The grammatical dependency relations that we consider implying an agent are illustrated in the left half of Table 2 ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-48", "text": "We consider two different methods for extracting an OH from the contexts of protoOHs: supervised learning and rule-based classification." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-49", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-50", "text": "**SUPERVISED LEARNING**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-51", "text": "The simplest way of using the contexts of agentive protoOHs is by using supervised learning." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-52", "text": "This means that on our unlabeled training corpus we consider each NP with the head being an agentive protoOH as a positive data instance and all the remaining NPs occurring in those sentences as negative instances." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-53", "text": "With this definition we train a supervised classifier based on convolution kernels (Collins and Duffy, 2001 ) as this method has been shown to be quite effective for OH extraction (Wiegand and Klakow, 2010) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-54", "text": "Convolution kernels derive features automatically from complex discrete structures, such as syntactic parse trees or part-of-speech sequences, that are directly provided to the learner." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-55", "text": "Thus a classifier can be built without the taking the burden of implementing an explicit feature extraction." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-56", "text": "We chose the best performing set of tree kernels (Collins and Duffy, 2001; Moschitti, 2006) from that work." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-57", "text": "It comprises two tree kernels based on constituency parse trees and a tree kernel based on semantic role trees." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-58", "text": "Apart from a set of sequence kernels (Taylor and Christianini, 2004) , this method also largely outperforms a traditional vector kernel using a set of features that were found predictive in previous work." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-59", "text": "We exclude sequence and vector kernels in this work not only for reasons of simplicity but also since their addition to tree kernels only results in a marginal improvement." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-60", "text": "Moreover, the features in the vector kernel heavily rely on taskspecific resources, e.g. a sentiment lexicon, which are deliberately avoided in our low-resource classifier as our method should be applicable to any language (and for many languages sentiment resources are either sparse or do not exist at all)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-61", "text": "In addition to Wiegand and Klakow (2010) , we have to discard the content of candidate NPs (e.g. the candidate opinion holder NP [N P Cand [N N S advocates] ] is reduced to [N P Cand ]), the reason for this being that in our automatically generated training set, OHs will always be protoOHs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-62", "text": "Retaining them in the training data would cause the learner to develop a detrimental bias towards these nouns (our resulting classifier should detect any OH and not only protoOHs)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-63", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-64", "text": "**RULE-BASED CLASSIFIER**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-65", "text": "Instead of training a supervised classifier, we can also construct a rule-based classifier on the basis of the agentive protoOHs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-66", "text": "The classifier is built on the insight that the most predictive cues for OH extraction are predicates (Wiegand and 2010)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-67", "text": "We, therefore, mine the contexts of agentive protoOHs (left half of Table 2 ) for discriminant predicates (i.e. verbs, nouns, and adjectives)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-68", "text": "That is, we rank every predicate according to its correlation, i.e. we use Pointwise Mutual Information, of having agentive protoOHs as an argument." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-69", "text": "The highly ranked predicates are used as predictive cues." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-70", "text": "The resulting rule-based classifier always classifies an NP as an OH if its head is an agent of a highly ranked discriminative predicate (as illustrated in the right half of Table 2 )." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-71", "text": "The supervised kernel-based classifier from \u00a74.1 learns from a rich set of features." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-72", "text": "In a previous study on reverse engineering making implicit features within convolution kernels visible (Pighin and Moschitti, 2009) , it has been shown that the learnt features are usually fairly small subtrees." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-73", "text": "There are plenty of structures which just contain one or two leaf nodes, i.e. sparse lexical information, coupled with some further structural nodes from the parse tree." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-74", "text": "These structures are fairly similar to low-level features, such as bag of words or bag of ngrams, in the sense that they are weak predictors and that there are plenty of them." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-75", "text": "For such types of features, it has been shown in both subjectivity detection (Lambov et al., 2009 ) and polarity classification (Andreevskaia and Bergler, 2008 ) that they generalize poorly across different domains." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-76", "text": "On the other hand, very few high-level features describing the presence of certain semantic classes or opinion words perform consistently well across different domains." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-77", "text": "These features can either be incorporated within a supervised learner (Lambov et al., 2009 ) or a lexicon-based rule-based classifier (Andreevskaia and Bergler, 2008) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-78", "text": "We assume that our rule-based classifier based on discriminant predicates (they can also be considered as some kind of semantic class) used in combination with very common grammatical relations will have a similar impact as those highlevel features used in the related tasks mentioned above." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-79", "text": "Domain-independence is also an important issue in our setting, since our training and test data originate from two different corpora (which can be considered two different domains)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-80", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-81", "text": "**SELF-TRAINING**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-82", "text": "A shortcoming of the rule-based classifier is that it incorporates no (or hardly any) domain knowledge." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-83", "text": "In other related sentiment classification tasks, i.e. subjectivity detection and polarity classification, it has been shown that by applying selftraining, i.e. learning a model with a supervised classifier trained on low-level features (usually bag of words) using the domain-specific instances labeled by a rule-based classifier, more in-domain knowledge can be captured." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-84", "text": "Thus, one can outperform the rule-based classifier (Wiebe and Riloff, 2005; Tan et al., 2008) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-85", "text": "Assuming that the same can be achieved in OH extraction, we train a classifier with convolution kernels (=low level features) on the output of the rule-based classifier run on our target corpus." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-86", "text": "The set of labeled data instances is derived from the sentences of the MPQA corpus in which the rulebased classifier predicts at least one OH, i.e. the instances the classifier labels as OHs are used as positive instances while the remaining NPs are labeled as negative." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-87", "text": "Unlike \u00a74.1 we do not discard the content of the candidate NPs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-88", "text": "In these labeled training data, OHs are not restricted to protoOHs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-89", "text": "We, therefore, assume that among the domain-specific features the supervised classifier may learn could be useful prior weights towards some of these domain-specific NPs as to whether they might be an OH or not." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-90", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-91", "text": "**GENERALIZATION WITH CLUSTERING AND KNOWLEDGE BASIS**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-92", "text": "We also examine in how far the coverage of the discriminant predicates can be increased with the usage of clustering." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-93", "text": "Turian et al. (2010) have shown that in semi-supervised learning for namedentity recognition, i.e. a task which bears some resemblance to the present task, features referring to the clusters corresponding to groups of specific words with similar properties (induced in an unsupervised manner) help to improve performance." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-94", "text": "In the context of our rule-based classifier, we augment the set of discriminant predicates by all words which are also contained in the cluster associated with these discriminant predicates." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-95", "text": "Hopefully, due to the strong similarity among the words within the same cluster, the additional words will have a similar predictiveness as the discriminant predicates." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-96", "text": "Unlike our extraction phase for OH extraction in which only the correlation between predicates and protoOHs are considered (Table 2) , we may find additional predicates as the clustering is induced from completely unrestricted text." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-97", "text": "The extension of discriminant predicates can also be done by taking into account manually built general-purpose lexical resources, such as WordNet." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-98", "text": "3 One simply adds the entire set of synonyms of each of the predicates." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-99", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-100", "text": "**INCORPORATION INTO SUPERVISED CLASSIFIERS WITH ACTUALLY LABELED DATA**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-101", "text": "We also want to investigate the effectiveness of the knowledge from our rule-based classifier that has been learned on the unlabeled corpus ( \u00a74.2) in supervised learning using actually labeled training data from our target corpus, i.e. the MPQA corpus." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-102", "text": "In particular, we will examine in how far this knowledge (when used as a feature in supervised learning) can compensate the lack of a sufficiently large labeled training set." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-103", "text": "For that experiment the labeled corpus, i.e. MPQA corpus, will be split into a training set and a test set." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-104", "text": "Again, we use the supervised learner based on tree kernels ( \u00a74.1)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-105", "text": "We also augment the tree kernels themselves with additional information by 3 wordnet.princeton.edu following Wiegand and Klakow (2010) who add for each word that belongs to a predictive semantic class another node that directly dominates the pertaining leaf node and assign it a label denoting that class." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-106", "text": "While Wiegand and Klakow (2010) made use of manually built lexicons, we use our predictive predicates extracted from contexts of protoOHs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-107", "text": "For instance, if doubt is such a predicate, we would replace the subtree" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-108", "text": "." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-109", "text": "Moreover, we devise a simple vector kernel incorporating the prediction of the rule-based classifier." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-110", "text": "All kernels are combined by plain summation." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-111", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-112", "text": "**EXPERIMENTS**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-113", "text": "The documents were parsed using the Stanford Parser." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-114", "text": "4 Semantic roles were obtained by using the parser by Zhang et al. (2008) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-115", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-116", "text": "**SUPERVISED LEARNING**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-117", "text": "All experiments using convolution kernels were done with the SVM-Light-TK toolkit." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-118", "text": "5 We test two versions of the supervised classifier." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-119", "text": "The first considers any mention of a protoOH as an OH, while the second is restricted to only those mentions of a protoOH which are an agent of some predicate." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-120", "text": "We also experimented with different amounts of (pseudo-)labeled training data from our unlabeled corpus varying from 12500 to 150000 instances." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-121", "text": "We found that from 25000 instances onwards the classifier does not notably improve when further training data are added." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-122", "text": "The results of the classifier (using 150000 data instances) are listed in Table 3 ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-123", "text": "The restriction of protoOHs to agents increases performance as expected (see \u00a74)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-124", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-125", "text": "**THE DIFFERENT RULE-BASED CLASSIFIERS**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-126", "text": "In order to build a rule-based classifier, we first need to determine how many of the ranked predicates are to be used." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-127", "text": "This process is done separately for verbs, nouns, and adjectives." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-128", "text": "For verbs, F-Score reaches its maximum at approximately 250 which is the value we chose in our subsequent experiments." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-129", "text": "In a similar fashion, we determined 100 for both nouns and adjectives." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-130", "text": "Table 4 lists the most highly ranked verbs that are extracted." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-131", "text": "6 As an indication of the intrinsic quality of the extracted words, we mark the words which can also be found in task-specific resources, i.e. communication verbs from the Appraisal Lexicon (AL) (Bloom et al., 2007) and opinion words from the Subjectivity Lexicon (SL) (Wilson et al., 2005) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-132", "text": "Both resources have been found predictive for OH extraction (Bloom et al., 2007; Wiegand and Klakow, 2010) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-133", "text": "Table 3 (lower part) shows the performance of the rule-based classifiers based on protoOHs using different parts of speech." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-134", "text": "As hard baselines, the table also shows other rule-based classifiers using the same dependency relations as our rulebased classifier (see Table 2 ) but employing different predicates." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-135", "text": "As lexical resources for these predicates, we again use AL and SL." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-136", "text": "The table also compares two different versions of the rule-based classifier being the classifier as presented in \u00a74.2 (left half of Table 3 ) and a classifier additionally incorporating the two heuristics (right half):" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-137", "text": "\u2022 If the candidate NP follows according to, then it is labeled as an OH." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-138", "text": "\u2022 The candidate NP can only be an OH if it represents a person or a group of persons." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-139", "text": "These are commonly accepted heuristics which have already been used in previous work as features (Choi et al., 2005; Wiegand and Klakow, 2010) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-140", "text": "The latter rule requires the output of a named-entity recognizer 7 for checking proper nouns and WordNet for common nouns." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-141", "text": "As far as the classifier built with the help of protoOHs is concerned, adding highly ranked adjectives and nouns consistently improves the performance (mostly recall) when added to the set of 7 We use the Stanford tagger: nlp.stanford.edu/software/CRF-NER.shtml Table 4 : List of verbs most highly correlating with protoOHs; \u2020 : included in AL; * : included in SL." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-142", "text": "highly ranked verbs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-143", "text": "The heuristics further improve the rule-based classifier which is achieved by notably increasing precision." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-144", "text": "None of the baselines is as robust as the best rule-based classifier using protoOHs (i.e. V250+A100+N100)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-145", "text": "Considering our discussion in \u00a74.2, it comes as no surprise that the best (pseudo-)supervised classifier does not perform as well as our best rule-based classifier (induced by protoOHs)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-146", "text": "The fact that, in addition to that, our proposed method also largely outperforms the rule-based classifier relying on both AL and SL when no heuristics are used and is still slightly better when they are incorporated supports the effectiveness of our method." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-147", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-148", "text": "**PERFORMANCE OF SUBSETS OF PROTOOHS**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-149", "text": "In the previous section, we evaluated predicates often co-occurring with the entire set of protoOHs (Table 1 )." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-150", "text": "Therefore, we should also check how individual protoOHs or special subsets perform in order to find out whether the simple approach of considering the entire set is the optimal setting." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-151", "text": "For these experiments we use the configuration: V250+N100+A100 without heuristics." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-152", "text": "We found that the performance of individual protoOHs varies and that the performance cannot be fully ascribed to the frequency of a protoOH with agentive contexts." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-153", "text": "For example, though proponent and demonstrator occur similarly often with those contexts, we obtain an F-Score of 44.75 when we use the predicates from the context of the former while we only obtain an F-Score of 32.70 when we consider the predicates of the latter." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-154", "text": "We also checked whether it would be more effective to use only a subset of protoOHs and compared the performance produced by the five best protoOHs, the five most frequent protoOHs, and Table 5 : Performance of extended rule-based classifiers." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-155", "text": "the entire set of protoOHs." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-156", "text": "The performance of the different subsets is very similar (i.e. 46.44, 46.28, and 46.17) , so we may conclude that the configuration that we proposed, namely to consider all protoOHs, is more or less the optimal configuration for this method." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-157", "text": "Table 5 shows the performance of our method when extended by either self-training (SelfTr) or generalization." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-158", "text": "For generalization by clustering (Clus), we chose Brown clustering (Brown et al., 1992) which is the best performing algorithm in (Turian et al., 2010) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-159", "text": "The clusters are induced on our unlabeled corpus (see \u00a73)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-160", "text": "We induced 1000 clusters (optimal size)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-161", "text": "For the knowledge-based generalization (WN), we used synonyms from WordNet 3." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-162", "text": "For both Clus and WN, we display the results extending only the most highly ranked V100+N50+A50 since it provided notably better results than extending all predicates, i.e. V250+N100+A100 (our baseline)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-163", "text": "The table shows that only self-training consistently improves the results." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-164", "text": "The impact of generalization is less advantageous since by increasing recall precision drops more dramatically." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-165", "text": "Only Clus in conjunction with the heuristics manages to preserve sufficient precision." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-166", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-167", "text": "**SELF-TRAINING AND GENERALIZATION**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-168", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-169", "text": "**INCORPORATING KNOWLEDGE FROM PROTOOHS INTO SUPERVISED LEARNING**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-170", "text": "As a maximum amount of labeled training data we chose 60000 instances (i.e. NPs) which is even a bit more than used in (Wiegand and Klakow, 2010) ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-171", "text": "In addition, we also test 1%, 5%, 10%, 25% and 50% of the training set." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-172", "text": "From the remaining data instances, we use 25000 instances as test data." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-173", "text": "In order to deliver generalizing results, we randomly sample the training and test partitions five times and report the averaged results." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-174", "text": "We compare four different classifiers, a plain classifier using only the convolution kernel configuration from previous experiments (TKPlain), the augmented convolution kernels (TKAug) where additional nodes are added indicating the presence of an OH predicate ( \u00a74.3), the augmented convolution kernels with the vector kernel encoding the prediction of the best rule-based classifier (induced by protoOHs) without heuristics (TKAug+VK) and the classifier incorporating those heuristics (TKAug+VK[heur] )." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-175", "text": "Instead of just using one feature encoding the overall prediction we use several binary features representing the occurrence of the individual groups of predicates (i.e. verbs, nouns, or adjectives) and prediction types (direct predicate or predicate from cluster extension)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-176", "text": "We also include the prediction of the self-trained classifier." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-177", "text": "The performance of these different classifiers is listed in Table 6 ." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-178", "text": "Recall from \u00a74.1 that we want to examine cases in which no task-specific resources and no or few labeled training data are available." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-179", "text": "This is why the different classifiers presented should primarily be compared to our own baseline (TKPlain) and not the numbers presented in previous work as they always use the maximal size of labeled training data and additionally task-specific resources (e.g. sentiment lexicons)." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-180", "text": "The results show that using the information extracted from the unlabeled data can be usefully combined with the labeled training data." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-181", "text": "Tree augmentation causes both precision and recall to rise." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-182", "text": "This observation is consistent with (Wiegand and Klakow, 2010) where, however, AL and SL are considered for augmentation." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-183", "text": "When the vector kernel with the prediction of the rule-based classifier is also included, precision drops slightly but recall is notably boosted resulting in an even more increased F-Score." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-184", "text": "The results also show that for the setting that we have in focus, i.e. using only few labeled training data, our proposed method is particularly useful." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-185", "text": "For example, when TKPlain is as good as the best classifier exclusively built from unlabeled data (50.88% in Table 5 ), i.e. at 10%, there is a very notable increase in F-Score when the additional knowledge is added, i.e. the F-Score of TKAug+VK[heur] is increased by approx." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-186", "text": "4% points." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-187", "text": "The degree of improvement towards TKPlain decreases the more labeled training data are used." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-188", "text": "However, when 100% of the labeled data are used, all of the other classifiers using additional information still outperform TKPlain." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-189", "text": "8 Table 6 : Performance of supervised classifiers incorporating the prediction of the rule-based classifier." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-190", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-191", "text": "**TKPLAIN (BASELINE)**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-192", "text": "----------------------------------" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-193", "text": "**CONCLUSION**" }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-194", "text": "We proposed to harness contextual information from prototypical opinion holders for opinion holder extraction." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-195", "text": "We showed that mentions of such nouns when they are agents of a predicate are a useful source for automatically building a rulebased classifier." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-196", "text": "The resulting classifier performs at least as well as classifiers depending on taskspecific lexical resources and can also be extended by self-training." }, { "sent_id": "fed51218e78d35aae39d287c95a95a-C001-197", "text": "We also demonstrated that this knowledge can be incorporated into supervised classifiers and thus improve performance, in particular, if only few labeled training data are used." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fed51218e78d35aae39d287c95a95a-C001-25" ], [ "fed51218e78d35aae39d287c95a95a-C001-139" ] ], "cite_sentences": [ "fed51218e78d35aae39d287c95a95a-C001-25", "fed51218e78d35aae39d287c95a95a-C001-139" ] }, "@USE@": { "gold_contexts": [ [ "fed51218e78d35aae39d287c95a95a-C001-33" ], [ "fed51218e78d35aae39d287c95a95a-C001-53" ], [ "fed51218e78d35aae39d287c95a95a-C001-105" ], [ "fed51218e78d35aae39d287c95a95a-C001-132" ], [ "fed51218e78d35aae39d287c95a95a-C001-136", "fed51218e78d35aae39d287c95a95a-C001-137", "fed51218e78d35aae39d287c95a95a-C001-138", "fed51218e78d35aae39d287c95a95a-C001-139" ] ], "cite_sentences": [ "fed51218e78d35aae39d287c95a95a-C001-33", "fed51218e78d35aae39d287c95a95a-C001-53", "fed51218e78d35aae39d287c95a95a-C001-105", "fed51218e78d35aae39d287c95a95a-C001-132", "fed51218e78d35aae39d287c95a95a-C001-139" ] }, "@EXT@": { "gold_contexts": [ [ "fed51218e78d35aae39d287c95a95a-C001-61" ] ], "cite_sentences": [ "fed51218e78d35aae39d287c95a95a-C001-61" ] }, "@DIF@": { "gold_contexts": [ [ "fed51218e78d35aae39d287c95a95a-C001-106" ], [ "fed51218e78d35aae39d287c95a95a-C001-170" ] ], "cite_sentences": [ "fed51218e78d35aae39d287c95a95a-C001-106", "fed51218e78d35aae39d287c95a95a-C001-170" ] }, "@SIM@": { "gold_contexts": [ [ "fed51218e78d35aae39d287c95a95a-C001-182" ] ], "cite_sentences": [ "fed51218e78d35aae39d287c95a95a-C001-182" ] } } }, "ABC_c7778abb2f1890ba896ccef2c3e13b_17": { "x": [ { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-21", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-22", "text": "**RECENT ADVANCEMENTS**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-66", "text": "In another work, Mihalcea et al. [18] attempted to quantify the changes in word usage over time." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-2", "text": "In this era of Big Data, due to expeditious exchange of information on the web, words are being used to denote newer meanings, causing linguistic shift." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-3", "text": "With the recent availability of large amounts of digitized texts, an automated analysis of the evolution of language has become possible." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-4", "text": "Our study mainly focuses on improving the detection of new word senses." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-5", "text": "This paper presents a unique proposal based on network features to improve the precision of new word sense detection." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-6", "text": "For a candidate word where a new sense (birth) has been detected by comparing the sense clusters induced at two different time points, we further compare the network properties of the subgraphs induced from novel sense cluster across these two time points." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-7", "text": "Using the mean fractional change in edge density, structural similarity and average path length as features in an SVM classifier, manual evaluation gives precision values of 0.86 and 0.74 for the task of new sense detection, when tested on 2 distinct time-point pairs, in comparison to the precision values in the range of 0.23-0.32, when the proposed scheme is not used." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-8", "text": "The outlined method can therefore be used as a new post-hoc step to improve the precision of novel word sense detection in a robust and reliable way where the underlying framework uses a graph structure." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-9", "text": "Another important observation is that even though our proposal is a post-hoc step, it can be used in isolation and that itself results in a very decent performance achieving a precision of 0.54-0.62." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-10", "text": "Finally, we show that our method is able to detect the well-known historical shifts in 80% cases." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-11", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-13", "text": "How do words develop new senses? How does one characterize semantic change?" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-14", "text": "Is it possible to develop algorithms to track semantic change by comparing historical data at scale?" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-15", "text": "In order to extract meaningful insights from these data, a very important step is to understand the contextual sense of a word, e.g., does the word 'bass' in a particular context refer to fish or is it related to music?" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-16", "text": "Most data-driven approaches so far have been limited to either word sense induction where the the goal is to automatically induce different senses of a given word in an unsupervised clustering setting, or word sense disambiguation where a fixed sense inventory is assumed to exist, and the senses of a given word are disambiguated relative to the sense inventory." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-17", "text": "However in both these tasks, the assumption is that the number of senses that a word has, is static, and also the senses exist in the sense inventory to compare with." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-18", "text": "They attempt to detect or induce one of these senses depending on the context." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-19", "text": "However, natural language is dynamic, constantly evolving as per the users' needs which leads to change of word meanings over time." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-20", "text": "For example, by late 20 th century, the word 'float' has come up with the 'data type' sense whereas the word 'hot' has started corresponding to the 'attractive personality' sense." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-23", "text": "Recently, with the arrival of large-scale collections of historic texts and online libraries such as Google books, a new paradigm has been added to this research area, whereby the prime interest is in identifying the temporal scope of a sense [10, 14, 16, 25] which, in turn, can give further insights to the phenomenon of language evolution." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-24", "text": "Some recent attempts [5, 8, 11, 12, 15] also have been made to model the dynamics of language in terms of word senses." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-25", "text": "One of the studies in this area has been presented by Mitra et al. [19] where the authors show that at earlier times, the sense of the word 'sick' was mostly associated to some form of illness; however, over the years, a new sense associating the same word to something that is 'cool' or 'crazy' has emerged." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-26", "text": "Their study is based on a unique network representation of the corpus called a distributional thesauri (DT) network built using Google books syntactic n-grams." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-27", "text": "They have used unsupervised clustering techniques to induce a sense of a word and then compared the induced senses of two time periods to get the new sense for a particular target word." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-28", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-29", "text": "**LIMITATIONS OF THE RECENT APPROACHES**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-30", "text": "While Mitra et al. [19] reported a precision close to 0.6 over a random sample of 49 words, we take another random sample of 100 words separately and repeat manual evaluation." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-31", "text": "When we extract the novel senses by comparing the DTs from 1909-1953 and 2002-2005 , the precision obtained for these 100 words is as low as 0.32." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-32", "text": "Similarly if we extract the novel senses comparing the DTs of 1909-1953 with 2006-2008, the precision stands at 0.23." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-33", "text": "We then explore another unsupervised approach presented in Lau et al. [16] over the same Google books corpus 1 , apply topic modeling for sense induction and directly adapt their similarity measure to get the new senses." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-34", "text": "Using a set intersecting with the 100 random samples for Mitra et al. [19] , we obtain the precision values of 0.21 and 0.28, respectively." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-35", "text": "Clearly, none of the precision values are good enough for reliable novel sense detection." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-36", "text": "This motivates us to devise a new approach to improve the precision of the existing approaches." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-37", "text": "Further, being inspired by the recent works of applying complex network theory in NLP applications like co-hyponymy detection [13] , evaluating machine generated summaries [20] , detection of ambiguity in a text [4] , etc. we opt for a solution using complex network measures." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-38", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-39", "text": "**OUR PROPOSAL AND THE ENCOURAGING RESULTS**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-40", "text": "We propose a method based on the network features to reduce the number of false positives and thereby, increase the overall precision of the method proposed by Mitra et al. [19] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-41", "text": "In particular, if a target word qualifies as a 'birth' as per their method, we construct two induced subgraphs of those words that form the cluster corresponding to this 'birth' sense from the corresponding distributional thesauri (DT) networks of the two time points." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-42", "text": "Next we compare the following three network properties: (i) the edge density, (ii) the structural similarity and (iii) the average path length [27, 29] of the two induced subgraphs from the two time points." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-43", "text": "A remarkable observation is that although this is a small set of only three features, for the actual 'birth' cases, each of them has a significantly different value for the later time point and are therefore very discriminative indicators." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-44", "text": "In fact, the features are so powerful that even a small set of training instances is sufficient for making highly accurate predictions." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-45", "text": "Results: Manual evaluation of the results by 3 evaluators shows that this classification achieves an overall precision of 0.86 and 0.74 for the two time point pairs over the same set of samples, in contrast with the precision values of 0.32 and 0.23 by the original method." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-46", "text": "Note that we would like to stress here that an improvement of more than double in the precision of novel sense detection that we achieve has the potential to be the new stepping stone in many NLP and IR applications that are sensitive to novel senses of a word." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-47", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-48", "text": "**DETECTING KNOWN SHIFTS**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-49", "text": "Further we also investigate the robustness of our approach by analyzing the ability to capture known historical shifts in meaning." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-50", "text": "Preparing a list of words that have been suggested by different prior works as having undergone sense change, we see that 80% of those words get detected by our approach." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-51", "text": "We believe that the ability to detect such diachronic shifts in data can significantly enhance various standard studies in natural language evolution and change." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-52", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-53", "text": "**IMPACT**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-54", "text": "We stress that our work could have strong repercussions in historical linguistics [1] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-55", "text": "Besides, lexicography is also expensive; compiling, editing and updating sense inventory entries frequently remains cumbersome and labor-intensive." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-56", "text": "Time specific knowledge would make the word meaning representations more accurate." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-57", "text": "A well constructed semantic representation of a word is useful for many natural language processing or information retrieval systems like machine translation, semantic search, disambiguation, Q&A, etc." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-58", "text": "For semantic search, taking into account the newer senses of a word can increase the relevance of the query result." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-59", "text": "Similarly, a disambiguation engine informed with the newer senses of a word can increase the efficiency of disambiguation, and recognize senses uncovered by the inventory that would otherwise have to be wrongly assigned to covered senses." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-60", "text": "Above all, a system having the ability to perceive the novel sense of a word can help in automatic sense inventory update by taking into account the temporal scope of senses." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-61", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-62", "text": "**RELATED WORK**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-63", "text": "Our work broadly classifies under data-driven models of language dynamics." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-64", "text": "One of the first attempts in this area was made by Erk [6] , where the author tried to model this problem as an instance of outlier detection, using a simple nearest neighbor-based approach." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-65", "text": "Gulordava and Baroni [10] study the change in the semantic orientation of words using Google book n-grams corpus from different time periods." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-67", "text": "Along similar lines, Jatowt and Duh [14] used the Google n-grams corpus from two different time points and proposed a method to identify semantic change based on the distributional similarity between the word vectors." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-68", "text": "Tahmasebi et al. [25] attempted to track sense changes from a newspaper corpus containing articles between 1785 and 1985." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-69", "text": "Efforts have been made by Cook et al. [3] to prepare the largest corpus-based dataset of diachronic sense differences." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-70", "text": "Attempts have been made by Lau et al. [17] where they first introduced their topic modeling based word sense induction method to automatically detect words with emergent novel senses and in a subsequent work, Lau et al. [16] extended this task by leveraging the concept of predominant sense." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-71", "text": "The first computational approach to track and detect statistically significant linguistic shifts of words has been proposed by Kulkarni et al. [15] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-72", "text": "Recently, Hamilton et al. [12] proposed a method to quantify semantic change by evaluating word embeddings against known historical changes." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-73", "text": "In another work, Hamilton et al. [11] categorized the semantic change into two types and proposed different distributional measures to detect those types." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-74", "text": "Attempts have also been made to analyze time-series model of embedding vectors as well as time-indexed self-similarity graphs in order to hypothesize the linearity of semantic change by Eger et al. [5] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-75", "text": "A dynamic Bayesian model of diachronic meaning change has been proposed by Frermann et al. [8] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-76", "text": "Recently, researchers have also tried to investigate the reasons behind word sense evolution and have come up with computational models based on chaining [21] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-77", "text": "Researchers also attempt to apply dynamic word embeddings as well to detect language evolution [23, 30] , analyze temporal word analogy [24] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-78", "text": "We now describe the two baselines that are relevant for our work." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-79", "text": "Baseline 1: Mitra et al. [19] The authors proposed an unsupervised method to identify word sense changes automatically for nouns." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-80", "text": "Datasets and graph construction: The authors used the Google books corpus, consisting of texts from over 3.4 million digitized English books published between 1520 and 2008." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-81", "text": "The authors constructed distributional thesauri (DT) networks from the Google books syntactic n-grams data [9] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-82", "text": "In the DT network, each word is a node and there is a weighted edge between a pair of words where the weight of the edge is defined as the number of features that these two words share in common." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-83", "text": "A snapshot of the DT is shown in leftmost image of Figure 1 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-84", "text": "To study word sense changes over time, they divided the dataset across eight time periods; accordingly DT networks for each of these time periods were constructed separately." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-85", "text": "Sense change detection: The Chinese Whispers algorithm [2] is used to produce a set of clusters for each target word by decomposing its neighbourhood in the DT network." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-86", "text": "The hypothesis is that different clusters signify different senses of a target word." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-87", "text": "The clusters for a target word 'float' is shown in the right image of Figure 1 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-88", "text": "The authors then compare the sense clusters extracted across two different time points to obtain the suitable signals of sense change." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-110", "text": "Let S = {w 1 , w 2 , . . . , w n } be the 'birth' cluster for w. Once we construct a graph induced by S from the DT, these network properties are measured as follows: Edge Density (ED): ED is given by ED = N a /N p (2) where N a denotes the number of actual edges between w 1 , w 2 , . . . , w n and N p denotes the maximum possible edges between these, i.e., n(n\u22121)" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-111", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-112", "text": "**PROPOSED NETWORK-CENTRIC APPROACH**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-113", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-114", "text": "**. STRUCTURAL SIMILARITY (SS):**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-115", "text": "For each pair of words (w i , w j ) in the cluster S, the structural similarity SS(w i , w j ) is computed as:" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-116", "text": "where N c denotes the number of common neighbors of w i and w j in the induced graph and de\u0434(w k ) denotes the degree of w k in the induced graph, for k = i, j. The average structural similarity for the cluster S is computed by averaging the structural similarity of all the word pairs." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-117", "text": "Average Path Length (APL): To compute average path length of S, we first find the shortest path length between w and the words w i , in the induced graph of S. Let spl i denote the shortest path distance from w to w i ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-118", "text": "The average path length is defined as:" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-119", "text": "where n is the number of words in the cluster S. Table 1 notes the values obtained for these network properties for the induced subgraphs of the reported 'birth' clusters for 'registers' and 'quotes' across the two time periods." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-120", "text": "The fractional changes observed for the three network properties show a clear demarcation between the two cases." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-121", "text": "Fractional change (\u2206) of any network measure P is defined as," }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-122", "text": "where t 1 and t 2 are old and new timeperiods respectively." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-123", "text": "The change observed for the 'birth' cluster of 'registers' is significantly higher than that in 'quotes' 2 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-124", "text": "We now compute these parameter values for all the 49 candidate cases." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-125", "text": "The mean values obtained for the true positives (TP) and false positives (FP) are shown in Table 2 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-126", "text": "The findings are consistent with those obtained for 'registers' and 'quotes'." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-127", "text": "2 As we have taken the 'birth' clusters from new time period (t 2 ), the words in the clusters are the direct neighbors of the target word always resulting in average path length of 1 in t 2" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-128", "text": "We, therefore, use the fractional changes in the three network properties over time as three features to classify the remaining candidate 'birth' words into true positives (actual 'birth') or false positives (false 'birth')." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-129", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-130", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-131", "text": "For experimental evaluation, we start with the 'birth' cases reported by Mitra et al. ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-132", "text": "We run Lau et al. [16] over these birth cases to detect 'novel' sense as per their algorithm." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-133", "text": "Separately, we also apply the proposed SVM classification model as a filtering step to obtain 'filtered birth' cases." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-134", "text": "This helps in designing a comparative evaluation of these algorithms as follows." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-135", "text": "From both the time point pairs (T 1 and T 2 ), we take 100 random samples from the birth cases reported by Mitra et al. [19] and get these manually evaluated." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-136", "text": "For the same 100 random samples, we now use the outputs of Lau et al. [16] and the proposed approach, and estimate the precision as well as recall of these." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-137", "text": "To further evaluate the proposed algorithm, we perform two more evaluations." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-138", "text": "First, we take 60 random samples from each time point pair for computing precision of the 'filtered birth' cases." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-139", "text": "Secondly, we also take 100 random samples for each time point pair Word Word for computing precision of our approach independently of Mitra et al. [19] , i.e., the proposed approach is not informed of the 'birth' cluster reported by Mitra et al. [19] , instead all the clusters in old and new time point are shown." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-140", "text": "We perform all the evaluations manually and each of the candidate word is judged by 3 evaluators." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-141", "text": "These evaluators are graduate/postgraduate students, having good background in Natural Language Processing." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-142", "text": "They are unaware of each other, making the annotation process completely blind and independent." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-143", "text": "Evaluators are shown the detected 'birth' cluster from the newer time period and all the clusters from the older time period." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-144", "text": "They are asked to make a binary judgment as to whether the 'birth' cluster indicates a new sense of the candidate word, which is not present in any of the sense clusters of the previous time point 3 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-145", "text": "Majority decision is taken in case of disagreement." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-146", "text": "In total, we evaluate the system for a set of as large as 520 words 4 which we believe is significant given the tedious manual judgment involved." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-147", "text": "In this process of manual annotation, we obtain an inter-annotator agreement (Fleiss' kappa [7] ) of 0.745, which is substantial [28] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-148", "text": "Table 3 shows three example words from T 1 , their 'birth' clusters as reported in Mitra et al. [19] and the manual evaluation result." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-149", "text": "The first two cases belong to computer related sense of 'searches' and 'logging', which were absent from time point 1909-1953." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-150", "text": "On the other hand, the 'birth' cluster of 'pesticide' represents an old sense which was also present in 1909-1953." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-151", "text": "Similarly Table 4 shows manual evaluations results for 3 example cases, along with their novel sense as captured by Lau et al. [16] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-152", "text": "[16] are presented in Table 5 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-153", "text": "Since the reported novel sense cluster can in principle be different from the 'birth' sense reported by the method of Mitra et al. [19] for the same word, we get the novel sense cases manually evaluated by 3 annotators (42 and 28 cases for the two time periods, respectively)." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-154", "text": "Note that for these 100 random samples (that are all marked 'true' by Mitra et al. [19] ), it is possible to find an upper bound on the recall of Lau et al. [16] 's approach automatically." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-155", "text": "While the low recall might be justified because this is a different approach, even the precision is found to be in the same range as that of Mitra et al. [19] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-156", "text": "Table 6 presents the evaluation results for the same set of 100 random samples after using the proposed SVM filtering." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-157", "text": "We see that the filtering using SVM classification improves the precision for both the time point pairs (T 1 and T 2 ) significantly, boosting it from the range of 0.23-0.32 to 0.74-0.86. Note that, as per our calculations, indeed the recall of Mitra et al. [19] would be 100% (as we are taking random samples for annotation from the set of reported 'birth' cases by Mitra et al. [19] only)." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-158", "text": "Even then Mitra et al. [19] 's F-measure ranges from 0.37-0.48 while ours is 0.67-0.68." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-159", "text": "Table 7 represents some of the examples which were declared as 'birth' by Mitra et al. [19] but SVM filtering correctly flagged them as 'false birth'." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-160", "text": "The feature values in the third column clearly show that the network around the words in the detected 'birth' cluster did not change much and therefore, the SVM approach could correctly flag these." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-161", "text": "Considering the small training set, the results are highly encouraging." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-162", "text": "We also obtain decent recall values for the two time point pairs, giving an overall F-measure of 0.67-0.68." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-163", "text": "Further, we check if we can meaningfully combine the results reported by both the methods of Mitra et al. [19] and Lau et al. [16] for more accurate sense detection; and how does this compare with Table 8 ; both the senses look quite similar." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-164", "text": "Table 9 shows the accuracy results obtained using this approach." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-165", "text": "Only 6 and 2 words out of those 100 samples were flagged as 'birth' for the two time points T 1 and T 2 respectively." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-166", "text": "Thus, the recall is very poor." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-167", "text": "While the precision improves slightly forT 1 (4 out of 6 are correct), it is worse for T 2 (only 1 out of 6 words is correct)." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-168", "text": "The results confirm that the proposed SVM classification approach works better than both the approaches, individually as well as combined." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-169", "text": "Feature analysis: We therefore move onto further feature analysis and error analysis of the proposed approach." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-170", "text": "To validate the usefulness of all the identified features, we perform feature leave-one-out experiments." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-171", "text": "The results for T 1 are presented in Table 10 and 11." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-172", "text": "We see that F-measure drops as we leave out one of the features." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-173", "text": "While {ED, SS } turns out to be the best for precision, {SS, APL} gives the best recall." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-174", "text": "Table 12 provides three examples to illustrate the importance of using all the three features." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-175", "text": "For the word 'newsweek', using {ED, APL} and for the word 'caring', using {ED, SS } could not detect those as 'birth'." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-176", "text": "Only when all the three features are used, these cases are correctly detected as 'birth'." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-177", "text": "Edge density, on the other hand is very crucial for improving precision." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-178", "text": "For instance, when only {SS, APL} are used, words like 'moderators' are wrongly flagged as 'true'." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-179", "text": "Such cases are filtered out when all the three features are used." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-180", "text": "Extensive evaluation of the proposed approach: We first take 60 random samples each from the 'birth' cases reported by the SVM filtering for the two time point pairs, T 1 (from 318 cases) and T 2 (from 329 cases)." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-181", "text": "The precision values of this evaluation are found to be 0.87 (52/60) and 0.75 (45/60) respectively, quite consistent with those reported in Table 6 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-182", "text": "We did another experiment in order to estimate the performance of our model for detecting novel sense, independent of the method of Mitra et al. [19] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-183", "text": "We take 100 random words from the two time point pairs (T 1 and T 2 ), along with all the induced clusters from the newer time period and run the proposed SVM filtering approach to flag the novel 'birth' senses." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-184", "text": "According to our model, for T 1 and T 2 respectively, 16 and 15 words are flagged to be having novel sense achieving precision values of 0.54 and 0.62 on manual evaluation, which itself is quite decent." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-185", "text": "Note that, for some cases, multiple clusters of a single word have been flagged as novel senses and we observe that these clusters hold a similar sense." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-186", "text": "Error analysis: We further analyze the cases, which are labeled as 'true birth' by the SVM but are evaluated as 'false' by the human evaluators." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-187", "text": "We find that in most of such cases, the sense cluster reported as 'birth' contained many new terms (and therefore, the network properties have undergone change) but the implied sense was already present in one of the previous clusters with very few common words (and therefore, the new cluster contained > 80% new words, and is being reported as 'birth' in Mitra et al. [19] )." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-188", "text": "Two such examples are given in Table 13 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-189", "text": "The split-join algorithm proposed in Mitra et al. [19] needs to be adapted for such cases." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-190", "text": "We also analyze the 'false positive' cases, which are labeled as 'false birth' by the SVM filtering but are evaluated as 'true' by the human evaluators." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-191", "text": "Two such examples are given in Table 14 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-192", "text": "By looking at the feature values of these cases, it is clear that the network structure of the induced subgraph is not changing much, yet they undergo sense change." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-193", "text": "The probable reason could be that the target word was not in the network of the induced subgraph in the old time point and enters into it in the new time point." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-194", "text": "Our SVM model is unable to detect this single node injection in a network so far." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-195", "text": "Handling these cases would be an immediate future step to improve the recall of the system." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-196", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-197", "text": "**DETECTION OF KNOWN SHIFTS**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-198", "text": "So far, we have reported experiments on discovering novel senses from data and measured the accuracy of our method using manual evaluation." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-199", "text": "In this section, we evaluate the diachronic validity of our method on another task of detecting known shifts." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-200", "text": "We test, whether our proposed method is able to capture the known historical shifts in meaning." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-201", "text": "For this purpose, we create a reference list L of 15 words that have been suggested by prior work [5, 11, 12] as having undergone a linguistic change and emerging with a novel sense." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-202", "text": "Note that, we focus only on nouns that emerge with a novel sense between 1900 and 1990." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-203", "text": "The goal of this task is to find out the number of cases, our method is able to detect as novel sense from the list L, which in turn would prove the robustness of our method." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-204", "text": "Data: Consistent with the prior work, we use the Corpus of Historical American (COHA) 5 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-205", "text": "COHA corpus is carefully created to be genre balanced and is a well constructed prototype of American English over 200 years, from the time period 1810 to 2000." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-206", "text": "We extract the raw text data of two time slices: 1880-1900 and 1990-2000 for our experiment." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-207", "text": "Experiment details and results: We first construct distributional thesauri (DT) networks [22] for the COHA corpus at two different time points, 1880-1900 and 1990-2000." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-208", "text": "We apply Chinese Whispers algorithm [2] to produce a set of clusters for each target word in the DT network." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-209", "text": "The Chinese Whispers clusters for the target word 'web' are shown in Figure 3 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-210", "text": "Note that we have reported only some of the representative words for each cluster." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-211", "text": "Each of the clusters represents a particular sense of the target." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-212", "text": "We now compare the sense clusters extracted across two different time points to obtain the suitable signals of sense change following the approach proposed in Mitra et al. Mitra et al. [19] ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-213", "text": "After getting the novel sense clusters, we pick up 50 random samples, of which 25 cases are flagged as 'true birth' and the rest 25 cases are flagged as 'false birth' by manual evaluation." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-214", "text": "We use these 50 samples as our training set for classification using SVM." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-215", "text": "Some of the examples of this training set are presented in Table 15 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-216", "text": "We ensure that none of the words in the list L is present in the training set." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-217", "text": "Using this training set for our proposed SVM classifier, we are successfully able to detect 80% of the cases (12 out of 15) from the list L as having novel sense." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-218", "text": "Table 16 presents all of these detected words along with the novel senses and the discriminative network feature." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-219", "text": "Our method is unable to detect three cases -'gay', 'guy' and 'bush'." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-220", "text": "For 'gay', since there is no sense cluster in the older time period with 'gay' being a noun, cluster comparison does not even detect the 'birth' cluster of 'gay'." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-221", "text": "The 'birth' sense clusters for 'guy', 'bush' in the new time period, as detected by split-join algorithm contain general terms like \"someone, anyone, men, woman, mother, son\" and \"cloud, air, sky, sunlight\" respectively." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-222", "text": "As the network around these words did not change much over time, our method found it difficult to detect." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-223", "text": "Note that even though COHA corpus is substantially smaller than the Google n-grams corpus, our approach produces promising result, showing the usability of the proposed method with not so large corpus as well." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-224", "text": "----------------------------------" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-225", "text": "**CONCLUSION**" }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-226", "text": "In this paper, we showed how complex network theory can help improving the performance of otherwise challenging task of novel sense detection." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-227", "text": "This is the first attempt to apply concepts borrowed from complex network theory to deal with the problem of novel sense detection." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-228", "text": "We demonstrated how the change in the network properties of the induced subgraphs from a sense cluster can be used to improve the precision of novel sense detection significantly." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-229", "text": "Manual evaluation over two different time point pairs shows that the proposed SVM classification approach boosts the precision values from 0.23-0.32 to 0.74-0.86." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-230", "text": "Finally, from the experiments on the COHA corpus, we have also shown that our approach can reliably detect the words known to have sense shifts." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-231", "text": "We have made the human annotated data used in our experiments publicly available which could be used as gold standard dataset to validate systems built for novel sense detection 6 ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-232", "text": "This framework of precise novel sense detection of a word can be used by lexicographers as well as historical linguistics to design new dictionaries as well as updating the existing sense repositories like WordNet where candidate new senses can be semi-automatically detected and included, thus greatly reducing the otherwise required manual effort." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-233", "text": "Computational methods based on large diachronic corpora are considered to have huge potential to put a light on interesting language evolution phenomenon which can be useful for etymologists as well." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-234", "text": "In future, we plan to apply our methodology to different genres of corpus, like social network data, several product or movie reviews data which are becoming increasingly popular source for opinion tracking, to identify short-term changes in word senses or usages." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-235", "text": "These analyses would also provide insights on the evolution of language in a short span of time." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-236", "text": "Apart from that, we plan to extend our work to detect the dying senses of words; the senses which were used in the older texts, but are not being used in newer time anymore." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-237", "text": "Our ultimate goal is to prepare a generalized framework for accurate detection of sense change across languages and investigate the triggering factors behind language evolution as well." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-89", "text": "Specifically, for a candidate word w, a sense cluster in the later time period is called as a 'birth' cluster if at least 80% words of this cluster do not appear in any of the sense clusters from the previous time period." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-90", "text": "The authors then apply multi-stage filtering in order to obtain meaningful candidate words." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-91", "text": "Baseline 2: Lau et al. [16] : The authors proposed an unsupervised approach based on topic modeling for sense induction, and showed novel sense identification as one of its applications." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-92", "text": "For a candidate word, Hierarchical Dirichlet Process [26] is run over a corpus to induce topics." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-93", "text": "The induced topics are represented as word multinomials, and are expressed by the top-N words in descending order of conditional probability." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-94", "text": "Each topic is represented as a sense of the target word." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-95", "text": "The words having highest probability in each topic represent the sense clusters." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-96", "text": "The authors treated the novel sense detection task as identifying those sense clusters, which did not align with any of the recorded senses in a sense repository." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-97", "text": "They used Jensen-Shannon (JS) divergence measure to compute alignment between a sense cluster and a synset." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-98", "text": "They computed JS divergence between the multinomial distribution (over words) of the topic cluster and that of the synset, and converted the divergence value into a similarity score." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-99", "text": "Similarity between topic cluster t j and synset s i is defined as," }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-100", "text": "where T and S are the multinomial distributions over words for topic t j and synset s i , respectively, and JS(X \u2225 Y ) is the Jensen-Shannon divergence for the distribution X and Y ." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-101", "text": "Since we define novel senses while comparing sense clusters across two time points, we use the same JS measure to detect novel sense of a target word." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-102", "text": "A sense cluster in the newer time period denotes a new sense ('birth') only if its maximum similarity with any of the clusters in older time period is below a threshold, which we have set to 0.35 based on empirical observation." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-103", "text": "The strength of this connection can be measured if we construct induced subgraphs of S from the two DTs and measure the network properties of these subgraphs; the difference would be more prominent for the actual 'birth' cases (true positives) than for the false 'birth' signals (false positives)." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-104", "text": "Note that by definition, the nodes in an induced subgraph from a DT will be the words in S and there will be an edge between two words if and only if the edge exists in the original DT; we ignore the weight of the edge henceforth." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-105", "text": "Thus, the difference between the two subgraphs (one each from the older and newer DTs) will only be in the edge connections." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-106", "text": "Figure 2 takes one true positive ('register') and one false positive ('quotes') word from the set of 49 words and shows the induced subgraphs obtained by a subset of their 'birth' clusters across the two time points." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-107", "text": "We can clearly see that connections among the words in S is much stronger in the newer DT than in the older one in the case of 'registers', indicating the emergence of a new sense." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-108", "text": "In the case of 'quotes', however, the connections are not very different across the two time periods." }, { "sent_id": "c7778abb2f1890ba896ccef2c3e13b-C001-109", "text": "We choose three cohesion indicating network properties, (i) the edge density, (ii) the structural similarity and (iii) the average path length, to capture this change." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c7778abb2f1890ba896ccef2c3e13b-C001-23" ], [ "c7778abb2f1890ba896ccef2c3e13b-C001-70" ] ], "cite_sentences": [ "c7778abb2f1890ba896ccef2c3e13b-C001-23", "c7778abb2f1890ba896ccef2c3e13b-C001-70" ] }, "@USE@": { "gold_contexts": [ [ "c7778abb2f1890ba896ccef2c3e13b-C001-33" ], [ "c7778abb2f1890ba896ccef2c3e13b-C001-91" ], [ "c7778abb2f1890ba896ccef2c3e13b-C001-132" ], [ "c7778abb2f1890ba896ccef2c3e13b-C001-136" ], [ "c7778abb2f1890ba896ccef2c3e13b-C001-151" ], [ "c7778abb2f1890ba896ccef2c3e13b-C001-154" ], [ "c7778abb2f1890ba896ccef2c3e13b-C001-163" ] ], "cite_sentences": [ "c7778abb2f1890ba896ccef2c3e13b-C001-33", "c7778abb2f1890ba896ccef2c3e13b-C001-91", "c7778abb2f1890ba896ccef2c3e13b-C001-132", "c7778abb2f1890ba896ccef2c3e13b-C001-136", "c7778abb2f1890ba896ccef2c3e13b-C001-151", "c7778abb2f1890ba896ccef2c3e13b-C001-154", "c7778abb2f1890ba896ccef2c3e13b-C001-163" ] } } }, "ABC_730738d63cabcd4e63ec4300a8091b_17": { "x": [ { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-100", "text": "**ERROR DISTRIBUTIONS**" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-2", "text": "Beam-search and global models have been applied to transition-based dependency parsing, leading to state-of-the-art accuracies that are comparable to the best graph-based parsers." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-3", "text": "In this paper, we analyze the effects of global learning and beam-search on the overall accuracy and error distribution of a transition-based dependency parser." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-4", "text": "First, we show that global learning and beam-search must be jointly applied to give improvements over greedy, locally trained parsing." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-5", "text": "We then show that in addition to the reduction of error propagation, an important advantage of the combination of global learning and beam-search is that it accommodates more powerful parsing models without overfitting." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-6", "text": "Finally, we characterize the errors of a global, beam-search, transition-based parser, relating it to the classic contrast between \"local, greedy, transition-based parsing\" and \"global, exhaustive, graph-based parsing\"." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-7", "text": "TITLE AND ABSTRACT IN CHINESE" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-8", "text": "----------------------------------" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-10", "text": "Beam-search has been applied to transition-based dependency parsing in recent studies (Zhang and Clark, 2008; Huang and Sagae, 2010; Hatori et al., 2011) ." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-11", "text": "In addition to reducing search errors compared to greedy search, it also enables the use of global models that accommodate richer non-local features without overfitting, leading to recent state-of-the-art accuracies of transition-based dependency parsing (Zhang and Nivre, 2011; Bohnet and Kuhn, 2012; Bohnet and Nivre, 2012) that are competitive with the best graph-based dependency parsers." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-12", "text": "It has been known that a transition-based parser using global learning, beam-search and rich features gives significantly higher accuracies than one with local learning and greedy search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-13", "text": "However, the effects of global learning, beam-search and rich features have not been separately studied." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-14", "text": "Apart from the natural conclusion that beam-search reduces error propagation compared to greedy search, exactly how these techniques help to improve parsing has not been discussed, and many interesting questions remain unanswered." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-15", "text": "For example, the contribution of global learning in improving the accuracies has not been separately studied." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-16", "text": "It has not been shown how global learning affects the accuracies, or whether it is important at all." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-17", "text": "For another example, it would be interesting to know whether a local, greedy, transition-based parser can be equipped with the rich features of Zhang and Nivre (2011) to improve its accuracy, and in particular whether MaltParser (Nivre et al., 2006) can achieve the same level of accuracies as ZPar (Zhang and Nivre, 2011) by using the same range of rich feature definitions." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-18", "text": "In this paper, we answer the above questions empirically." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-19", "text": "First, we separate out global learning and beam-search, and study the effect of each technique by comparison with a local greedy baseline." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-20", "text": "Our results show that significant improvements are achieved only when the two are jointly applied." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-21", "text": "Second, we show that the accuracies of a local, greedy transition-based parser cannot be improved by adding the rich features of Zhang and Nivre (2011) ." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-22", "text": "Our result suggests that global learning with beam-search accommodates more complex models with richer features than a local model with greedy search and therefore enables higher accuracies." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-23", "text": "One interesting aspect of using a global model with beam-search is that it narrows down the contrast between \"local, greedy, transition-based parsing\" and \"global, exhaustive, graph-based parsing\" as exemplified by McDonald and Nivre (2007) ." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-24", "text": "On the one hand, global beam-search parsing is more similar to global, exhaustive parsing than local, greedy parsing in the use of global models and non-greedy search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-25", "text": "On the other hand, beam-search does not affect the fundamental transition-based parsing process, which allows the use of rich non-local features, and is very different from graph-based parsing." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-26", "text": "An interesting question is how such differences in models and algorithms affect empirical errors." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-27", "text": "McDonald and Nivre (2007) make a comparative analysis of local greedy transition-based MaltParser and global near-exhaustive graph-based MSTParser (McDonald and Pereira, 2006) using the CoNLL-X Shared Task data (Buchholz and Marsi, 2006) , showing that the parsers give near identical overall accuracies, but have very different error distributions according to various metrics." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-28", "text": "While MaltParser is more accurate on frequently occurring short sentences and dependencies, it performs worse on long sentences and dependencies due to search errors." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-29", "text": "We present empirical studies of the error distribution of global, beam-search transition-based dependency parsing, using ZPar (Zhang and Nivre, 2011) as a representative system." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-30", "text": "We follow McDonald and Nivre (2007) and perform a comparative error analysis of ZPar, MSTParser and MaltParser using the CoNLL-X shared task data." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-31", "text": "Our results show that beam-search im-proves the precision on long sentences and dependencies compared to greedy search, while the advantage of transition-based parsing on short dependencies is preserved." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-32", "text": "Under particular measures, such as precision for arcs at different levels of the trees, ZPar shows characteristics surprisingly similar to MSTParser." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-33", "text": "----------------------------------" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-34", "text": "**ANALYZING THE EFFECT OF GLOBAL LEARNING AND BEAM-SEARCH**" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-35", "text": "In this section we study the effects of global learning and beam-search on the accuracies of transition-based dependency parsing." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-36", "text": "Our experiments are performed using the Penn Treebank (PTB)." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-37", "text": "We follow the standard approach to split PTB3 into training (sections 2-21), development (section 22) and final testing (section 23) sections." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-38", "text": "Bracketed sentences from the treebank are transformed into dependency structures using the Penn2Malt tool." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-39", "text": "1 POS-tags are assigned using a perceptron tagger (Collins, 2002) , with an accuracy of 97.3% on a standard Penn Treebank test." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-40", "text": "We assign automatic POS-tags to the training data using ten-way jacknifing." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-41", "text": "Accuracies are measured using the unlabeled attached score (UAS) metric, which is defined as the percentage of words (excluding punctuation) that are assigned the correct heads." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-42", "text": "----------------------------------" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-43", "text": "**THE EFFECTS OF GLOBAL LEARNING AND BEAM-SEARCH**" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-44", "text": "In this subsection, we study the effects of global learning and beam-search separately." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-45", "text": "Our experiments are performed using ZPar, which uses global learning and beam-search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-46", "text": "To make comparisons with local learning under different settings, we make configurations and modifications to ZPar where necessary." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-47", "text": "Global learning is implemented in the same way as Zhang and Nivre (2011) , using the averaged perceptron algorithm (Collins, 2002) and early update (Collins and Roark, 2004) ." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-48", "text": "This is a global learning method in the sense that it tries to maximize accuracy over the entire sentence and not on isolated local transitions." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-49", "text": "Unless explicitly specified, the same beam size is applied for training and testing when beam-search is applied." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-50", "text": "Local learning is implemented as a multi-class classifier that predicts the next transition action given a parser configuration (i.e. a stack and an incoming queue), trained using the averaged perceptron algorithm." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-51", "text": "In local learning, each transition is considered in isolation and there is no global view of the transition sequence needed to parse an entire sentence." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-52", "text": "Figure 1 shows the UAS of ZPar under different settings, where 'global' refers to a global model trained using the same method as Zhang and Nivre (2011) , 'local' refers to a local classifier trained using the averaged perceptron, 'base features' refers to the set of base feature templates in Zhang and Nivre (2011) , and 'all features' refers to the set of base and all extended feature templates in Zhang and Nivre (2011) ." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-53", "text": "When the size of the beam is 1, the decoding algorithm is greedy local search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-54", "text": "Using base features, a locally trained model gives a UAS of 89.15%, higher than that of a globally trained model (89.04%)." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-55", "text": "Here a global model does not give better accuracies compared to a local model under greedy search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-56", "text": "As the size of the beam increses, the UAS of the global model increases, but the UAS of the local model decreases." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-57", "text": "Global learning gives significantly better accuracies than local learning under beam-search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-58", "text": "There are two ways to explain the reason that beam-search hurts the UAS of a locally trained model." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-59", "text": "First, the perceptron can be viewed as a large-margin training algorithm that finds a separation margin between the scores of positive examples (gold-standard structures) and negative examples (non-gold structures from the decoder)." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-60", "text": "The online learning process runs the decoding algorithm to generate a space of negative examples, which is used together with its corresponding positive example space for parameter udpates." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-61", "text": "If the negative example space during training is different from that during testing, the trained model will not separate the test examples as effectively as when the negative example spaces for training and testing are similar, since there are more unseen negative examples in the model." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-62", "text": "To further illustrate this, we conduct an additional set of development experiments by training two global models with different beam sizes." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-63", "text": "Each of the models is tested using its own training beam size and the training beam size of the other model." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-64", "text": "The results are shown in Table 1 ." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-65", "text": "As can be seen from the table, a global model trained with a size-1 beam gives a higher UAS when tested with a size-1 beam than with a size-64 beam." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-66", "text": "Similarly, a global model trained with a size-64 beam gives a higher UAS when tested using a size-64 beam than using a size-1 beam." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-67", "text": "Our observations are consistent with those of Daum\u00e9 III and Marcu (2005) , which show that the accuracies of another online large-margin model are lower when the training and testing beam sizes are different than when they are the same." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-68", "text": "These results show the negative effect of a mismatch between training and testing negative example spaces, which also happens when a locally trained model is tested using beam-search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-69", "text": "To take a second perspective, a local model is trained to disambiguate different transition actions under the same parser configuration, but not different transitions under different parser configurations." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-70", "text": "This means that the scores of two sequences of transition actions may not be comparable with each other when they consist of very different parser configuration sequences." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-71", "text": "This is reminiscent of the label bias problem (Lafferty et al., 2001) , and partly explains the performance degradation of the local model when tested with beam-search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-72", "text": "To summarize the above discussion, a global model does not improve over a local model for greedy parsing, and beam-search does not improve the performance of a parser trained locally using the perceptron algorithm." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-73", "text": "However, the combination of global learning and beam-search can significantly improve the performance compared to a local, greedy transition-based parser." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-74", "text": "----------------------------------" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-75", "text": "**BENEFITS FROM GLOBAL LEARNING AND BEAM-SEARCH**" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-76", "text": "An additional benefit of global learning and beam-search is the accommodation of rich nonlocal features." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-77", "text": "Again in Figure 1 , the use of rich non-local features improves the UAS of the global models with all beam sizes, while the improvement brought by rich non-local features also increases with increased size of the beam." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-78", "text": "With greedy local search, the accuracy improves from 89.04% with base features to 89.35% with all features; with the size of the beam being 64, the accuracy improves form 92.27% with base features to 93.18% with all features." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-79", "text": "The absolute improvement increased from 0.3% to 0.89%." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-80", "text": "The above fact shows that rich non-local features are more effective on a global model with a large beam-size." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-81", "text": "This is a consequence of the interaction between learning and search: a large beam not only reduces search errors, but also enables a more complex model to be trained without overfitting." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-82", "text": "In contrast to a globally trained model, a local model cannot benefit as much from the power of rich features." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-83", "text": "With greedy local search, the UAS of a local model improves from 89.15% with base features to 89.28% with all features." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-84", "text": "Beam-search does not bring additional improvements." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-85", "text": "For further evidence, we add rich non-local features in the same increments as Zhang and Nivre (2011) to both ZPar and MaltParser, and evaluate UAS on the same development data set." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-86", "text": "Original settings are applied to both parsers, with ZPar using global learning and beam-search, and MaltParser using local learning and greedy search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-87", "text": "Table 2 shows that while ZPar's accuracy consistently improves with the addition of each new set of features, there is very little impact on MaltParser's accuracy and in some cases the effect is in fact negative, indicating that the locally trained greedy parser cannot benefit from the rich non-local features." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-88", "text": "Yet another evidence for the support of more complex models by global learning and beamsearch is the work of Bohnet and Nivre (2012) , where non-projective parsing using online reordering (Nivre, 2009 ) and rich features led to significant improvements over greedy search (Nivre, 2009) , achieving state-of-the-art on a range of typologically diverse languages." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-89", "text": "3 Characterizing the errors" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-90", "text": "----------------------------------" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-91", "text": "**THE PARSERS AND EVALUATION DATA**" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-92", "text": "In this section we study the effect of global learning and beam-search on the error distributions of transition-based dependency parsing." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-93", "text": "We characterize the errors of ZPar and add it to the error comparison between MaltParser and MSTParser (McDonald and Nivre, 2007) ." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-94", "text": "Following McDonald and Nivre (2007) we evaluate the parsers on the CoNLL-X Shared Task data (Buchholz and Marsi, 2006) , which include training and test sentences for 13 different languages." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-95", "text": "For each parser, we conjoin the outputs for all 13 languages in the same way as McDonald and Nivre (2007) , and calculate error distributions over the aggregated output." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-96", "text": "Accuracies are measured using the labeled attached score (LAS) evaluation metric, which is defined as the percentage of words (excluding punctuation) that are assigned both the correct head word and the correct arc label." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-97", "text": "To handle non-projectivity, pseudo-projective parsing (Nivre and Nilsson, 2005 ) is applied to ZPar and MaltParser, transforming non-projective trees into pseudo-projective trees in the training data, and post-processing pseudo-projective outputs by the parser to transform them into non-projective trees." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-98", "text": "MSTParser produces non-projective trees from projective trees by scorebased rearrangements of arcs." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-99", "text": "----------------------------------" }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-101", "text": "We take a range of different perspectives to characterize the errors of ZPar, comparing them with those of MaltParser and MSTParser by measuring the accuracies against various types of metrics, including the size of the sentences and dependency arcs, the distance to the root of the dependency tree, and the number of siblings." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-102", "text": "The parsers show different empirical performances over these measures, demonstrating the comparative advantages and disadvantages of their design discussed in Section 3.1." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-103", "text": "Figure 2 shows the accuracy of the parsers relative to sentence length (the number of words in a sentence, in bins of size 10)." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-104", "text": "All three parsers perform comparatively better on short sentences." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-105", "text": "The performance of MaltParser and MSTParser is very similar, with MaltParser performing better on very short sentences (\u2264 20) due to richer feature representations, and worse on longer sentences (20 to 50) due to the propagation of search errors." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-106", "text": "Because short sentences are much more frequent in the test data, MaltParser and MSTParser give almost identical overall accuracies." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-107", "text": "The three parsers show larger variance in performance when evaluated against specific properties of the dependency tree." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-108", "text": "Figure 3 shows the precision and recall for each parser relative to the arc lengths in the predicted and gold-standard dependency trees." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-109", "text": "Here the length of an arc is defined as the absolute difference between the indices of the head and modifier." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-110", "text": "Precision represents the percentage of predicted arcs with a particular length that are correct, and recall represents the percentage of gold arcs of a particular length that are correctly predicted." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-111", "text": "MaltParser gives higher precision than MSTParser for short dependency arcs (\u2264 4), but its precision drops rapidly for arcs with increased lengths." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-112", "text": "These arcs take more shift-reduce actions to build, and are hence more prone to error propagation." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-113", "text": "The precision of ZPar drops much slower compared to MaltParser, demonstrating the effect of beam-search for the reduction of error propagation." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-114", "text": "Another important factor is the use of rich non-local features by ZPar, which is a likely reason for its precision to drop slower even than that of MSTParser when the arc size increases from 1 to 8." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-115", "text": "Interestingly, the precision of ZPar is almost indistinguishable from that of MaltParser for size 1 arcs (arcs between neighbouring words), showing that the wider range of features in ZPar is the most helpful in arcs that take more than one, but not too many shiftreduce actions to build." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-116", "text": "The recall curves of the three parsers are similar, with ZPar having higher recall than MSTParser and MaltParser, particularly when the dependency size is greater than 2." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-117", "text": "This shows that particular gold-standard dependencies are hard for all parsers to build, but ZPar is better in recovering hard gold dependencies probably due to its rich features." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-118", "text": "To take another perspective, we compare the performance of the three parsers at different levels of a dependency tree by measuring accuracies for arcs relative to their distance to the root." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-119", "text": "Here the distance of an arc to the root is defined as the number of arcs in the path from the root to the modifier in the arc." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-120", "text": "Figure 4 shows the precision and recall of each system for arcs of varying distances to the root." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-121", "text": "Here the precision of MaltParser and MSTParser is very different, with MaltParser being more precise for arcs nearer to the leaves, but less precise for those nearer to the root." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-122", "text": "One possible reason is that arcs near the bottom of the tree require comparatively fewer shift-reduce actions to build, and are therefore less prone to the propagation of search errors." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-123", "text": "Another important reason, as pointed out by McDonald and Nivre (2007) , is the default single-root mechanism by MaltParser: all words that have not been attached as a modifier when the shift-reduce process finishes are attached as modifiers to the pseudo-root." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-124", "text": "Although the vast majority of sentences have only one root-modifier, there is no global control for the number of root-modifiers in the greedy shift-reduce process, and each action is made locally and independently." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-125", "text": "As a result, MaltParser tends to over-predict root modifiers, leading to the comparatively low precision." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-126", "text": "Surprisingly, the precision curve of ZPar is much more similar to that of MSTParser than that of MaltParser, although ZPar is based on the same shift-reduce parsing process, and even has a similar default single-root mechanism as MaltParser." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-127", "text": "This result is perhaps the most powerful demonstration of the effect of global learning and beam-search compared to local learning and greedy search." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-128", "text": "The model which scores whole sequences of shift-reduce actions, plus the reduction of search error propagation, lead to significantly reduced over-prediction of rootmodifiers." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-129", "text": "In addition, rich features used by ZPar, such as the valency (number of modifiers for a head) and set of modifier labels for a head, can also be useful in reducing over-prediction of modifiers." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-130", "text": "Because of these, ZPar effectively pushes the predictions of difficult arcs down the tree, which is exactly the behavior of MSTParser." }, { "sent_id": "730738d63cabcd4e63ec4300a8091b-C001-131", "text": "Interestingly, the recall curve of ZPar is more similar to that of MaltParser than that of MSTParser, showing that arcs at particular levels are harder to recover using the shift-reduce process than a global tree search." } ], "y": { "@BACK@": { "gold_contexts": [ [ "730738d63cabcd4e63ec4300a8091b-C001-11" ] ], "cite_sentences": [ "730738d63cabcd4e63ec4300a8091b-C001-11" ] }, "@MOT@": { "gold_contexts": [ [ "730738d63cabcd4e63ec4300a8091b-C001-17" ] ], "cite_sentences": [ "730738d63cabcd4e63ec4300a8091b-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "730738d63cabcd4e63ec4300a8091b-C001-21" ], [ "730738d63cabcd4e63ec4300a8091b-C001-29" ], [ "730738d63cabcd4e63ec4300a8091b-C001-47" ], [ "730738d63cabcd4e63ec4300a8091b-C001-52" ], [ "730738d63cabcd4e63ec4300a8091b-C001-85" ] ], "cite_sentences": [ "730738d63cabcd4e63ec4300a8091b-C001-21", "730738d63cabcd4e63ec4300a8091b-C001-29", "730738d63cabcd4e63ec4300a8091b-C001-47", "730738d63cabcd4e63ec4300a8091b-C001-52", "730738d63cabcd4e63ec4300a8091b-C001-85" ] } } }, "ABC_f4792ef9808a1a3c415f6f57351335_17": { "x": [ { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-2", "text": "We investigate different feature sets for performing automatic sentence-level discourse segmentation within a general machine learning approach, including features derived from either finite-state or contextfree annotations." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-3", "text": "We achieve the best reported performance on this task, and demonstrate that our SPADE-inspired context-free features are critical to achieving this level of accuracy." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-4", "text": "This counters recent results suggesting that purely finite-state approaches can perform competitively." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-5", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-7", "text": "Discourse structure annotations have been demonstrated to be of high utility for a number of NLP applications, including automatic text summarization (Marcu, 1998; Marcu, 1999; Cristea et al., 2005) , sentence compression (Sporleder and Lapata, 2005) , natural language generation (Prasad et al., 2005) and question answering (Verberne et al., 2006) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-8", "text": "These annotations include sentence segmentation into discourse units along with the linking of discourse units, both within and across sentence boundaries, into a labeled hierarchical structure." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-9", "text": "For example, the tree in Figure 1 shows a sentence-level discourse tree for the string \"Prices have dropped but remain quite high, according to CEO Smith,\" which has three discourse segments, each labeled with either \"Nucleus\" or \"Satellite\" depending on how central the segment is to the coherence of the text." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-10", "text": "There are a number of corpora annotated with discourse structure, including the well-known RST Treebank (Carlson et al., 2002) ; the Discourse GraphBank (Wolf and Gibson, 2005) ; and the Penn Discourse Treebank (Miltsakaki et al., 2004) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-11", "text": "While the annotation approaches differ across these corpora, the requirement of sentence segmentation into sub-sentential discourse units is shared across all approaches." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-12", "text": "These resources have facilitated research into stochastic models and algorithms for automatic discourse structure annotation in recent years." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-13", "text": "Using the RST Treebank as training and evaluation data, Soricut and Marcu (2003) demonstrated that their automatic sentence-level discourse parsing system could achieve near-human levels of accuracy, if it was provided with manual segmentations and manual parse trees." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-14", "text": "Manual segmentation was primarily responsible for this performance boost over their fully automatic system, thus making the case that automatic discourse segmentation is the primary impediment to high accuracy automatic sentence-level discourse structure annotation." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-15", "text": "Their models and algorithm -subsequently packaged together into the publicly available SPADE discourse parser 1 -make use of the output of the Charniak (2000) parser to derive syntactic indicator features for segmentation and discourse parsing." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-16", "text": "Sporleder and Lapata (2005) also used the RST Treebank as training data for data-driven discourse parsing algorithms, though their focus, in contrast to Soricut and Marcu (2003) , was to avoid contextfree parsing and rely exclusively on features in their model that could be derived via finite-state chunkers and taggers." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-17", "text": "The annotations that they derive are dis-course \"chunks\", i.e., sentence-level segmentation and non-hierarchical nucleus/span labeling of segments." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-18", "text": "They demonstrate that their models achieve comparable results to SPADE without the use of any context-free features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-19", "text": "Once again, segmentation is the part of the process where the automatic algorithms most seriously underperform." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-20", "text": "In this paper we take up the question posed by the results of Sporleder and Lapata (2005) : how much, if any, accuracy reduction should we expect if we choose to use only finite-state derived features, rather than those derived from full contextfree parses? If little accuracy is lost, as their results suggest, then it would make sense to avoid relatively expensive context-free parsing, particularly if the amount of text to be processed is large or if there are real-time processing constraints on the system." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-21", "text": "If, however, the accuracy loss is substantial, one might choose to avoid context-free parsing only in the most time-constrained scenarios." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-22", "text": "While Sporleder and Lapata (2005) demonstrated that their finite-state system could perform as well as the SPADE system, which uses context-free parse trees, this does not directly answer the question of the utility of context-free derived features for this task." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-23", "text": "SPADE makes use of a particular kind of feature from the parse trees, and does not train a general classifier making use of other features beyond the parse-derived indicator features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-24", "text": "As we shall show, its performance is not the highest that can be achieved via context-free parser derived features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-25", "text": "In this paper, we train a classifier using a general machine learning approach and a range of finitestate and context-free derived features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-26", "text": "We investigate the impact on discourse segmentation performance when one feature set is used versus another, in such a way establishing the utility of features derived from context-free parses." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-27", "text": "In the course of so doing, we achieve the best reported performance on this task, an absolute F-score improvement of 5.0% over SPADE, which represents a more than 34% relative error rate reduction." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-28", "text": "By focusing on segmentation, we provide an approach that is generally applicable to all of the various annotation approaches, given the similarities between the various sentence-level segmentation guidelines." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-29", "text": "Given that segmentation has been shown to be a primary impediment to high accuracy sentence-level discourse structure annotation, this represents a large step forward in our ability to automatically parse the discourse structure of text, whatever annotation approach we choose." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-30", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-31", "text": "**METHODS**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-32", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-33", "text": "**DATA**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-34", "text": "For our experiments we use the Rhetorical Structure Theory Discourse Treebank (Carlson et al., 2002) , which we will denote RST-DT, a corpus annotated with discourse segmentation and relations according to Rhetorical Structure Theory (Mann and Thompson, 1988) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-35", "text": "The RST-DT consists of 385 documents from the Wall Street Journal, about 176,000 words, which overlaps with the Penn Wall St. Journal (WSJ) Treebank (Marcus et al., 1993) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-36", "text": "The segmentation of sentences in the RST-DT is into clause-like units, known as elementary discourse units, or edus." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-37", "text": "We will use the two terms 'edu' and 'segment' interchangeably throughout the rest of the paper." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-38", "text": "Human agreement for this segmentation task is quite high, with agreement between two annotators at an F-score of 98.3 for unlabeled segmentation (Soricut and Marcu, 2003) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-39", "text": "The RST-DT corpus annotates edu breaks, which typically include sentence boundaries, but sentence boundaries are not explicitly annotated in the corpus." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-40", "text": "To perform sentence-level processing and evaluation, we aligned the RST-DT documents to the same documents in the Penn WSJ Treebank, and used the sentence boundaries from that corpus." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-41", "text": "2 An additional benefit of this alignment is that the Penn WSJ Treebank tokenization is then available for parsing purposes." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-42", "text": "Simple minimum edit distance alignment effectively allowed for differences in punctuation representation (e.g., double quotes) and tokenization when deriving the optimal alignment." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-43", "text": "The RST-DT corpus is partitioned into a training set of 347 documents and a test set of 38 documents." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-44", "text": "This test set consists of 991 sentences with 2,346 segments." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-45", "text": "For training purposes, we created a held-out development set by selecting every tenth sentence of the training set." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-46", "text": "This development set was used for feature development and for selecting the number of iterations used when training models." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-47", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-48", "text": "**EVALUATION**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-49", "text": "Previous research into RST-DT segmentation and parsing has focused on subsets of the 991 sentence test set during evaluation." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-50", "text": "Soricut and Marcu (2003) omitted sentences that were not exactly spanned by a subtree of the treebank, so that they could focus on sentence-level discourse parsing." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-51", "text": "By our count, this eliminates 40 of the 991 sentences in the test set from consideration." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-52", "text": "Sporleder and Lapata (2005) went further and established a smaller subset of 608 sentences, which omitted sentences with only one segment, i.e., sentences which themselves are atomic edus." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-53", "text": "Since the primary focus of this paper is on segmentation, there is no strong reason to omit any sentences from the test set, hence our results will evaluate on all 991 test sentences, with two exceptions." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-54", "text": "First, in Section 2.3, we compare SPADE results under our configuration with results from Sporleder and Lapata (2005) in order to establish comparability, and this is done on their 608 sentence subset." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-55", "text": "Second, in Section 3.2, we investigate feeding our segmentation into the SPADE system, in order to evaluate the impact of segmentation improvements on their sentence-level discourse parsing performance." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-56", "text": "For those trials, the 951 sentence subset from Soricut and Marcu (2003) is used." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-57", "text": "All other trials use the full 991 sentence test set." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-58", "text": "Segmentation evaluation is done with precision, recall and F1-score of segmentation boundaries." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-59", "text": "Given a word string w 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-60", "text": "w k , we can index word boundaries from 0 to k, so that each word w i falls between boundaries i\u22121 and i. For sentence-based segmentation, indices 0 and k, representing the beginning and end of the string, are known to be segment boundaries." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-61", "text": "Hence Soricut and Marcu (2003) evaluate with respect to sentence internal segmentation boundaries, i.e., with indices j such that 0Sporleder and Lapata (2005) , they were primarily interested in labeled segmentation, where the segment initial boundary was labeled with the segment type." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-64", "text": "In such a scenario, the boundary at index 0 is no longer known, hence their evaluation included those boundaries, even when reporting unlabeled results." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-65", "text": "Thus, in section 2.3, for comparison with reported results in Sporleder and Lapata (2005) , our F1-score is defined accordingly, i.e., seg- mentation boundaries j such that 0 \u2264 j < k. In addition, we will report unlabeled bracketing precision, recall and F1-score, as defined in the PARSEVAL metrics (Black et al., 1991) and evaluated via the widely used evalb package." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-66", "text": "We also use evalb when reporting labeled and unlabeled discourse parsing results in Section 3.2." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-67", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-68", "text": "**BASELINE SPADE SETUP**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-69", "text": "The publicly available SPADE package, which encodes the approach in Soricut and Marcu (2003) , is taken as the baseline for this paper." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-70", "text": "We made several modifications to the script from the default, which account for better baseline performance than is achieved with the default configuration." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-71", "text": "First, we modified the script to take given parse trees as input, rather than running the Charniak parser itself." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-72", "text": "This allowed us to make two modifications that improved performance: turning off tokenization in the Charniak parser, and reranking." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-73", "text": "The default script that comes with SPADE does not turn off tokenization inside of the parser, which leads to degraded performance when the input has already been tokenized in the Penn Treebank style." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-74", "text": "Secondly, Charniak and Johnson (2005) showed how reranking of the 50-best output of the Charniak (2000) parser gives substantial improvements in parsing accuracy." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-75", "text": "These two modifications to the Charniak parsing output used by the SPADE system lead to improvements in its performance compared to previously reported results." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-76", "text": "Table 1 compares segmentation results of three systems on the Sporleder and Lapata (2005) 608 sentence subset of the evaluation data: (1) their best reported system; (2) the SPADE system results reported in that paper; and (3) the SPADE system results with our current configuration." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-77", "text": "The evaluation uses the unlabeled F1 measure as defined in that paper, which counts sentence initial boundaries in the scoring, as discussed in the previous section." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-78", "text": "As can be seen from these results, our improved configuration of SPADE gives us large improvements over the previously reported SPADE performance on this subset." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-79", "text": "As a result, we feel that we can use SPADE 490 as a very strong baseline for evaluation on the entire test set." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-80", "text": "Additionally, we modified the SPADE script to allow us to provide our segmentations to the full discourse parsing that it performs, in order to evaluate the improvements to discourse parsing yielded by any improvements to segmentation." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-81", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-82", "text": "**SEGMENTATION CLASSIFIER**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-83", "text": "For this paper, we trained a binary classifier, which was applied independently at each word w i in the string w 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-84", "text": "w k , to decide whether that word is the last in a segment." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-85", "text": "Note that w k is the last word in the string, and is hence ignored." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-86", "text": "We used a loglinear model with no Markov dependency between adjacent tags, 3 and trained the parameters of the model with the perceptron algorithm, with averaging to control for over-training (Collins, 2002) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-87", "text": "Let C={E, I} be the set of classes: segmentation boundary (E) or non-boundary (I)." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-88", "text": "Let f (c, i, w 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-89", "text": "w k ) be a function that takes as input a class value c, a word index i and the word string w 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-90", "text": "w k and returns a d-dimensional vector of feature values for that word index in that string with that class." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-91", "text": "For example, one feature might be (c = E, w i = the), which returns the value 1 when c = E and the current word is 'the', and returns 0 otherwise." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-92", "text": "Given a d-dimensional parameter vector \u03c6, the output of the classifier is that class which maximizes the dot product between the feature and parameter vectors:" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-93", "text": "In training, the weights in \u03c6 are initialized to 0." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-94", "text": "For m epochs (passes over the training data), for each word in the training data (except sentence final words), the model is updated." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-95", "text": "Let i be the current word position in string w 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-96", "text": "w k and suppose that there have been j\u22121 previous updates to the model parameters." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-97", "text": "Letc i be the true class label, and let\u0109 i be shorthand for\u0109(i, w 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-98", "text": "w k ) in equation 1. Then the parameter vector \u03c6 j at step j is updated as follows:" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-99", "text": "As stated in Section 2.1, we reserved every tenth sentence as held-out data." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-100", "text": "After each pass over the training data, we evaluated the system performance on this held-out data, and chose the model that optimized accuracy on that set." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-101", "text": "The averaged perceptron was used on held-out and evaluation sets." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-102", "text": "See Collins (2002) for more details on this approach." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-103", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-104", "text": "**FEATURES**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-105", "text": "To tease apart the utility of finite-state derived features and context-free derived features, we consider three feature sets: (1) basic finite-state features; (2) context-free features; and (3) finite-state approximation to context-free features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-106", "text": "Note that every feature must include exactly one class label c in order to discriminate between classes in equation 1." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-107", "text": "Hence when presenting features, it can be assumed that the class label is part of the feature, even if it is not explicitly mentioned." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-108", "text": "The three feature sets are not completely disjoint." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-109", "text": "We include simple position-based features in every system, defined as follows." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-110", "text": "Because edus are typically multi-word strings, it is less likely for a word near the beginning or end of a sentence to be at an edu boundary." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-111", "text": "Thus it is reasonable to expect the position within a sentence of a token to be a helpful feature." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-112", "text": "We created 101 indicator features, representing percentages from 0 to 100." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-113", "text": "For a string of length k, at position i, we round i/k to two decimal places and provide a value of 1 for the corresponding quantized position feature and 0 for the other position features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-114", "text": "2.5.1 Basic finite-state features Our baseline finite-state feature set includes simple tagger derived features, as well as features based on position in the string and n-grams 4 ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-115", "text": "We annotate tag sequences onto the word sequence via a competitive discriminatively trained tagger (Hollingshead et al., 2005) , trained for each of two kinds of tag sequences: part-of-speech (POS) tags and shallow parse tags." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-116", "text": "The shallow parse tags define nonhierarchical base constituents (\"chunks\"), as defined for the CoNLL-2000 shared task (Tjong Kim Sang and Buchholz, 2000) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-117", "text": "These can either be used as tag or chunk sequences." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-118", "text": "For example, the tree in Figure 2 represents a shallow (non-hierarchical) parse tree, with four base constituents." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-119", "text": "Each base constituent X begins with a word labeled with B X , which signifies that this word begins the constituent." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-120", "text": "All other words within a constituent X are labeled and I(nside) tags I X , and words outside of any base constituent are labeled O. In such a way, each word is labeled with both a POS-tag and a B/I/O tag." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-121", "text": "For our three sequences (lexical, POS-tag and shallow tag), we define n-gram features surrounding the potential discourse boundary." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-122", "text": "If the current word is w i , the hypothesized boundary will occur between w i and w i+1 ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-123", "text": "For this boundary position, the 6-gram including the three words before and the three words after the boundary is included as a feature; additionally, all n-grams for n < 6 such that either w i or w i+1 (or both) is in the n-gram are included as features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-124", "text": "In other words, all n-grams in a six word window of boundary position i are included as features, except those that include neither w i nor w i+1 in the n-gram." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-125", "text": "The identical feature templates are used with POS-tag and shallow tag sequences as well, to define tag n-gram features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-126", "text": "This feature set is very close to that used in Sporleder and Lapata (2005) , but not identical." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-127", "text": "Their n-gram feature definitions were different (though similar), and they made use of cue phrases from Knott (1996) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-128", "text": "In addition, they used a rulebased clauser that we did not." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-129", "text": "Despite such differences, this feature set is quite close to what is described in that paper." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-130", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-131", "text": "**CONTEXT-FREE FEATURES**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-132", "text": "To describe our context-free features, we first present how SPADE made use of context-free parse trees within their segmentation algorithm, since this forms the basis of our features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-133", "text": "The SPADE features are based on productions extracted from full syntactic parses of the given sentence." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-134", "text": "The primary feature for a discourse boundary after word w i is based on the lowest constituent in the tree that spans words w m . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-135", "text": "w n such that m \u2264 i < n. For example, in the parse tree schematic in Figure 3 , the constituent labeled with A is the lowest constituent in the tree whose span crosses the potential discourse boundary after w i ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-136", "text": "The primary feature is the production that expands this constituent in the tree, with the proposed segmentation boundary marked, which in this case is: A \u2192 B 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-137", "text": "B j\u22121 ||B j . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-138", "text": "B m , where || denotes the segmentation boundary." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-139", "text": "In SPADE, the production is lexicalized by the head words of each constituent, which are determined using standard head-percolation techniques." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-140", "text": "This feature is used to predict a boundary as follows: if the relative frequency estimate of a boundary given the production feature in the corpus is greater than 0.5, then a boundary is predicted; otherwise not." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-141", "text": "If the production has not been observed frequently enough, the lexicalization is removed and the relative frequency of a boundary given the unlexicalized production is used for prediction." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-142", "text": "If the observations of the unlexicalized production are also too sparse, then only the children adjacent to the boundary are maintained in the feature, e.g., A \u2192 * B j\u22121 ||B j * where * represents zero or more categories." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-143", "text": "Further smoothing is used when even this is unobserved." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-144", "text": "We use these features as the starting point for our context-free feature set: the lexicalized production A \u2192 B 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-145", "text": "B j\u22121 ||B j . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-146", "text": "B m , as defined above for SPADE, is a feature in our model, as is the unlexicalized version of the production." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-147", "text": "As with the other features that we have described, this feature is used as an indicator feature in the classifier applied at the word w i preceding the hypothesized boundary." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-148", "text": "In addition to these full production features, we use the production with only children adjacent to the boundary, denoted by A \u2192 * B j\u22121 ||B j * ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-149", "text": "This production is used in four ways: fully lexicalized; unlexicalized; only category B j\u22121 lexicalized; and only category B j lexicalized." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-150", "text": "We also use A \u2192 * B j\u22122 B j\u22121 || * and A \u2192 * ||B j B j+1 * features, both unlexicalized and with the boundary-adjacent category lexicalized." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-151", "text": "If there is no category B j\u22122 or B j+1 , they are replaced with \"N/A\"." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-152", "text": "In addition to these features, we fire the same features for all productions on the path from A down 492 Table 2 : Segmentation results on all 991 sentences in the RST-DT test set." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-153", "text": "Segment boundary accuracy is for sentence internal boundaries only, following Soricut and Marcu (2003) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-154", "text": "Bracketing accuracy is for unlabeled flat bracketing of the same segments." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-155", "text": "While boundary accuracy correctly depicts segmentation results, the harsher flat bracketing metric better predicts discourse parsing performance." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-156", "text": "to the word w i ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-157", "text": "For these productions, the segmentation boundary || will occur after all children in the production, e.g., B j\u22121 \u2192 C 1 . . ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-158", "text": "C n ||, which is then used in both lexicalized and unlexicalized forms." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-159", "text": "For the feature with only categories adjacent to the boundary, we again use \"N/A\" to denote the fact that no category occurs to the right of the boundary: B j\u22121 \u2192 * C n ||N/A. Once again, these are lexicalized as described above." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-160", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-161", "text": "**FINITE-STATE APPROXIMATION FEATURES**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-162", "text": "An approximation to our context-free features can be made by using the shallow parse tree, as shown in Figure 2 , in lieu of the full hierarchical parse tree." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-163", "text": "For example, if the current word was \"sell\" in the tree in Figure 2 , the primary feature would be ROOT \u2192 NP VP||NP NP, and it would have an unlexicalized version and three lexicalized versions: the category immediately prior to the boundary lexicalized; the category immediately after the boundary lexicalized; and both lexicalized." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-164", "text": "For lexicalization, we choose the final word in the constituent as the lexical head for the constituent." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-165", "text": "This is a reasonable first approximation, because such typically left-headed categories as PP and VP lose their arguments in the shallow parse." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-166", "text": "As a practical matter, we limit the number of categories in the flat production to 8 to the left and 8 to the right of the boundary." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-167", "text": "In a manner similar to the n-gram features that we defined in Section 2.5.1, we allow all combinations with less than 8 contiguous categories on each side, provided that at least one of the adjacent categories is included in the feature." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-168", "text": "Each feature has an unlexicalized and three lexicalized versions, as described above." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-169", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-170", "text": "**EXPERIMENTS**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-171", "text": "We performed a number of experiments to determine the relative utility of features derived from full context-free syntactic parses and those derived solely from shallow finite-state tagging." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-172", "text": "Our primary concern is with intra-sentential discourse segmentation, but we are also interested in how much the improved segmentation helps discourse parsing." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-173", "text": "The syntactic parser we use for all context-free syntactic parses used in either SPADE or our classifier is the Charniak parser with reranking, as described in Charniak and Johnson (2005) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-174", "text": "The Charniak parser and reranker were trained on the sections of the Penn Treebank not included in the RST-DT test set." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-175", "text": "All statistical significance testing is done via the stratified shuffling test (Yeh, 2000) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-176", "text": "Table 2 presents segmentation results for SPADE and four versions of our classifier." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-177", "text": "The \"Basic finitestate\" system uses only finite-state sequence features as defined in Section 2.5.1, while the \"Full finite-state\" also includes the finite-state approximation features from Section 2.5.3." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-178", "text": "The \"Context-free\" system uses the SPADE-inspired features detailed in Section 2.5.2, but none of the features from Sections 2.5.1 or 2.5.3." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-179", "text": "Finally, the \"All features\" section includes features from all three sections." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-180", "text": "5 Note that the full finite-state system is considerably better than the basic finite-state system, demonstrating the utility of these approximations of the SPADE-like context-free features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-181", "text": "The performance of the resulting \"Full\" finite-state system is not statistically significantly different from that of SPADE (p=0.193) , despite no reliance on features derived from context-free parses." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-182", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-183", "text": "**SEGMENTATION**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-184", "text": "The context-free features, however, even without any of the finite-state sequence features (even lexical n-grams) outperforms the best finite-state system by almost two percent absolute, and the system with all features improves on the best finite-state system by over four percent absolute." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-185", "text": "The system with all features is statistically significantly better than both SPADE and the \"Full finite-state\" classifier system, at p < 0.001." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-186", "text": "This large improvement demonstrates that the context-free features can provide a very large system improvement." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-187", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-188", "text": "**DISCOURSE PARSING**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-189", "text": "It has been shown that accurate discourse segmentation within a sentence greatly improves the overall parsing accuracy to near human levels (Soricut and Marcu, 2003) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-190", "text": "Given our improved segmentation results presented in the previous section, improvements would be expected in full sentencelevel discourse parsing." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-191", "text": "To achieve this, we modified the SPADE script to accept our segmentations when building the fully hierarchical discourse tree." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-192", "text": "The results for three systems are presented in Table 3 : SPADE, our \"Full finite-state\" system, and our system with all features." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-193", "text": "Results for unlabeled bracketing are presented, along with results for labeled bracketing, where the label is either Nucleus or Satellite, depending upon whether or not the node is more central (Nucleus) to the coherence of the text than its sibling(s) (Satellite)." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-194", "text": "This label set has been shown to be of particular utility for indicating which segments are more important to include in an automatically created summary or compressed sentence (Sporleder and Lapata, 2005; Marcu, 1998; Marcu, 1999; Cristea et al., 2005) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-195", "text": "Once again, the finite-state system does not perform statistically significantly different from SPADE on either labeled or unlabeled discourse parsing." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-196", "text": "Using all features, however, results in greater than 5% absolute accuracy improvement over both of these systems, which is significant, in all cases, at p < 0.001." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-197", "text": "----------------------------------" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-198", "text": "**DISCUSSION AND FUTURE DIRECTIONS**" }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-199", "text": "Our results show that context-free parse derived features are critical for achieving the highest level of accuracy in sentence-level discourse segmentation." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-200", "text": "Given that edus are by definition clause-like units, it is not surprising that accurate full syntactic parse trees provide highly relevant information unavailable from finite-state approaches." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-201", "text": "Adding contextfree features to our best finite-state feature model reduces error in segmentation by 32.1%, an increase in absolute F-score of 4.5%." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-202", "text": "These increases are against a finite-state segmentation model that is powerful enough to be statistically indistinguishable from SPADE." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-203", "text": "Our experiments also confirm that increased segmentation accuracy yields significantly better discourse parsing accuracy, as previously shown to be the case when providing reference segmentations to a parser (Soricut and Marcu, 2003) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-204", "text": "The segmentation reduction in error of 34.5% propagates to a 28.6% reduction in error for unlabeled discourse parse trees, and a 19.8% reduction in error for trees labeled with Nucleus and Satellite." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-205", "text": "We have several key directions in which to continue this work." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-206", "text": "First, given that a general machine learning approach allowed us to improve upon SPADE's segmentation performance, we also believe that it will prove useful for improving full discourse parsing, both at the sentence level and beyond." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-207", "text": "For efficient inter-sentential discourse parsing, we see the need for an additional level of segmentation at the paragraph level." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-208", "text": "Whereas most sentences correspond to a well-formed subtree, Sporleder and Lascarides (2004) report that over 20% of the paragraph boundaries in the RST-DT do not correspond to a well-formed subtree in the human annotated discourse parse for that document." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-209", "text": "Therefore, to perform accurate and efficient parsing of the RST-DT at the paragraph level, the text should be segmented into paragraph-like segments that conform to the human-annotated subtree boundaries, just as sentences are segmented into edus." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-210", "text": "We also intend to begin work on the other discourse annotated corpora." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-211", "text": "While most work on textual discourse parsing has made use of the RST-DT corpus, the Discourse GraphBank corpus provides a competing annotation that is not constrained to tree structures (Wolf and Gibson, 2005) ." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-212", "text": "Once accurate levels of segmentation and parsing for both corpora are attained, it will be possible to perform extrinsic evaluations to determine their relative utility for different NLP tasks." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-213", "text": "Recent work has shown promising preliminary results for recognizing and labeling relations of GraphBank structures (Wellner et al., 2006) , in the case that the algorithm is provided with 494 manually segmented sentences." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-214", "text": "Sentence-level segmentation in the GraphBank is very similar to that in the RST-DT, so our segmentation approach should work well for Discourse GraphBank style parsing." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-215", "text": "The Penn Discourse Treebank (Miltsakaki et al., 2004) , or PDTB, uses a relatively flat annotation of discourse structure, in contrast to the hierarchical structures found in the other two corpora." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-216", "text": "It contains annotations for discourse connectives and their arguments, where an argument can be as small as a nominalization or as large as several sentences." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-217", "text": "This approach obviates the need to create a set of discourse relations, but sentence internal segmentation is still a necessary step." }, { "sent_id": "f4792ef9808a1a3c415f6f57351335-C001-218", "text": "Though segmentation in the PDTB tends to larger units than edus, our approach to segmentation should be straightforwardly applicable to their segmentation style." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f4792ef9808a1a3c415f6f57351335-C001-7" ], [ "f4792ef9808a1a3c415f6f57351335-C001-16" ], [ "f4792ef9808a1a3c415f6f57351335-C001-22" ], [ "f4792ef9808a1a3c415f6f57351335-C001-52" ], [ "f4792ef9808a1a3c415f6f57351335-C001-63" ], [ "f4792ef9808a1a3c415f6f57351335-C001-194" ] ], "cite_sentences": [ "f4792ef9808a1a3c415f6f57351335-C001-7", "f4792ef9808a1a3c415f6f57351335-C001-16", "f4792ef9808a1a3c415f6f57351335-C001-22", "f4792ef9808a1a3c415f6f57351335-C001-52", "f4792ef9808a1a3c415f6f57351335-C001-63", "f4792ef9808a1a3c415f6f57351335-C001-194" ] }, "@MOT@": { "gold_contexts": [ [ "f4792ef9808a1a3c415f6f57351335-C001-20" ] ], "cite_sentences": [ "f4792ef9808a1a3c415f6f57351335-C001-20" ] }, "@DIF@": { "gold_contexts": [ [ "f4792ef9808a1a3c415f6f57351335-C001-52", "f4792ef9808a1a3c415f6f57351335-C001-53" ] ], "cite_sentences": [ "f4792ef9808a1a3c415f6f57351335-C001-52" ] }, "@USE@": { "gold_contexts": [ [ "f4792ef9808a1a3c415f6f57351335-C001-54" ], [ "f4792ef9808a1a3c415f6f57351335-C001-65" ], [ "f4792ef9808a1a3c415f6f57351335-C001-76" ] ], "cite_sentences": [ "f4792ef9808a1a3c415f6f57351335-C001-54", "f4792ef9808a1a3c415f6f57351335-C001-65", "f4792ef9808a1a3c415f6f57351335-C001-76" ] }, "@SIM@": { "gold_contexts": [ [ "f4792ef9808a1a3c415f6f57351335-C001-126" ] ], "cite_sentences": [ "f4792ef9808a1a3c415f6f57351335-C001-126" ] } } }, "ABC_e3c735811b2ea08d92659272ddcbdd_17": { "x": [ { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-62", "text": "The Skip-gram model differs by predicting context words given a target word and by capturing the ordering of word occurrences." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-86", "text": "**TWITTER PROFILE AND COVER IMAGES:**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-135", "text": "Non-gang Members Total (count) ) ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-39", "text": "However, their approach required the gang members' Twitter profile names to be known beforehand, and data collection was localized to a single city in the country." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-136", "text": "This is the mean of the word embedding vectors weighted by term frequency:" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-137", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-138", "text": "**EVALUATION**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-139", "text": "We evaluate the performance of using word embeddings to discover gang member profiles on Twitter." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-164", "text": "We conducted 10-fold cross validation experiments to evaluate the performance of our models." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-40", "text": "These studies investigated a small set of manually curated gang member profiles, often from a small geographic area that may bias their findings." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-2", "text": "Gang affiliates have joined the masses who use social media to share thoughts and actions publicly." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-3", "text": "Interestingly, they use this public medium to express recent illegal actions, to intimidate others, and to share outrageous images and statements." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-4", "text": "Agencies able to unearth these profiles may thus be able to anticipate, stop, or hasten the investigation of gang-related crimes." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-5", "text": "This paper investigates the use of word embeddings to help identify gang members on Twitter." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-6", "text": "Building on our previous work, we generate word embeddings that translate what Twitter users post in their profile descriptions, tweets, profile images, and linked YouTube content to a real vector format amenable for machine learning classification." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-7", "text": "Our experimental results show that pre-trained word embeddings can boost the accuracy of supervised learning algorithms trained over gang members' social media posts." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-8", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-10", "text": "Street gangs are defined as \"a coalition of peers, united by mutual interests, with identifiable leadership and internal organization, who act collectively to conduct illegal activity and to control a territory, facility, or enterprise\" [Mil92] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-11", "text": "They promote criminal activities such as drug trafficking, assault, robbery, and threatening or intimidating a neighborhood [20113] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-12", "text": "Today, over 1.4 million people, belonging to more than 33,000 gangs, are active in the United States [20111] , of which 88% identify themselves as being members of a street gang 1 ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-13", "text": "They are also active users of social media [20111] ; according to 2007 National Assessment Center's survey of gang members, 25% of individuals in gangs use the Internet for at least 4 hours a week [20007] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-14", "text": "More recent studies report approximately 45% of gang members participate in online offending activities such as threatening, harassing individuals, posting violent videos or attacking someone on the street for something they said online [DP11, PDJ15] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-37", "text": "Their framework could only extract social media posts from self identified gang members by searching for pre-identified gang names in a user's Twitter profile description." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-38", "text": "Patton et al. developed a method to collect tweets from a group of gang members operating in Detroit, MI [Pat15] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-15", "text": "They confirm that gang members use social media to express themselves in ways similar to their offline behavior on the streets [PEB13] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-16", "text": "Despite its public nature, gang members post on social media without fear of consequences because there are only few tools law enforcement can presently use to surveil social media [WDSD15] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-17", "text": "For example, the New York City police department employs over 300 detectives to combat teen violence triggered by insults, dares, and threats exchanged on social media, and the Toronto police department teaches officers about the use of social media in investigations [pol13] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-18", "text": "From offline clues, the officers monitor just a selected set of social media accounts which are manually discovered and related to a specific investigation." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-19", "text": "Thus, developing tools to identify gang member profiles on social media is an important step in the direction of using machine intelligence to fight crime." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-20", "text": "To help agencies monitor gang activity on social media, our past work investigated how features from Twitter profiles, including profile text, profile images, tweet text, emjoi use, and their links to YouTube, may be used to reliably find gang member profiles [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-21", "text": "The diverse set of features, chosen to combat the fact that gang members often use local terms and hashtags in their posts, offered encouraging results." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-22", "text": "In this paper, we report our experience in integrating deep learning into our gang member profile classifier." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-23", "text": "Specifically, we investigate the effect of translating the features into a vector space using word embeddings [MSC + 13] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-24", "text": "This idea is motivated by the recent success of word embeddings-based methods to learn syntactic and semantic structures automatically when provided with large datasets." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-25", "text": "A dataset of over 3,000 gang and non-gang member profiles that we previously curated is used to train the word embeddings." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-26", "text": "We show that pre-trained word embeddings improve the machine learning models and help us obtain an F 1-score of 0.7835 on gang member profiles (a 6.39% improvement in F 1-score compared to the baseline models which were not trained using word embeddings)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-27", "text": "This paper is organized as follows." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-28", "text": "Section 2 discusses the related literature and frames how this work differs from other related works." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-29", "text": "Section 3 discusses our approach based on word embeddings to identify gang member profiles." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-30", "text": "Section 4 reports on the evaluation of the proposed approach and the evaluation results in detail." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-31", "text": "Section 5 concludes the work reported while discussing the future work planned." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-32", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-33", "text": "**RELATED WORK**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-34", "text": "Researchers have begun investigating the gang members' use of social media and have noticed the importance of identifying gang members' Twitter profiles a priori [PEB13, WDSD15] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-35", "text": "Before analyzing any textual context retrieved from their social media posts, knowing that a post has originated from a gang member could help systems to better understand the message conveyed by that post." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-36", "text": "Wijeratne et al. developed a framework to analyze what gang members post on social media [WDSD15] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-41", "text": "In our previous work [BWDS16] , we curated what may be the largest set of gang member profiles to study how gang member Twitter profiles can be automatically identified based on the content they share online." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-42", "text": "A data collection process involving location neutral keywords used by gang members, with an expanded search of their retweet, friends and follower networks, led to identifying 400 authentic gang member profiles on Twitter." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-43", "text": "Our study discovered that the text in their tweets and profile descriptions, their emoji use, their profile images, and music interests embodied by links to YouTube music videos, can help a classifier distinguish between gang and non-gang member profiles." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-44", "text": "While a very promising F 1 measure with low false positive rate was achieved, we hypothesize that the diverse kinds and the multitude of features employed (e.g. unigrams of tweet text) could be amenable to an improved representation for classification." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-45", "text": "We thus explore the possibility of mapping these features into a considerably smaller feature space through the use of word embeddings." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-46", "text": "Previous" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-47", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-48", "text": "**WORD EMBEDDINGS**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-49", "text": "A word embedding model is a neural network that learns rich representations of words in a text corpus." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-50", "text": "It takes data from a large, n-dimensional 'word space' (where n is the number of unique words in a corpus) and learns a transformation of the data into a lower k-dimensional space of real numbers." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-51", "text": "This transformation is developed in a way that similarities between the k-dimensional vector representation of two words reflects semantic relationships among the words themselves." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-52", "text": "These semantics are not captured by typical bag-of-words or n-gram models for classification tasks on text data [MYZ13, MSC" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-53", "text": "+ 13]." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-54", "text": "Word embeddings have led to the state-of-the-art results in many sequential learning tasks [LBH15] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-55", "text": "In fact, word embedding learning is an important step for many statistical language modeling tasks in text processing systems." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-56", "text": "Bengio et al. were the first ones to introduce the idea of learning a distributed representation for words over a text corpus [BDVJ03] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-57", "text": "They learned representations for each word in the text corpus using a neural network model that modeled the joint probability function of word sequences in terms of the feature vectors of the words in the sequence." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-58", "text": "Mikolov et al. showed that simple algebraic operations can be performed on word embeddings learned over a text corpus, which leads to findings such as the word embedding vector of the word \"King\" \u2212 the word embedding vectors of \"Man\" + \"Woman\" would results in a word embedding vector that is closest to the word embedding vector of the word \"Queen\" [MYZ13] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-59", "text": "Recent successes in using word embeddings to improve text classification for short text [WXX + 16, LZZ15], encouraged us to explore how they can be used to improve gang and non-gang member Twitter profile classification." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-60", "text": "Word embeddings can be performed under different neural network architectures; two popular ones are the Continuous Bag-of-Words (CBOW) and Continuous Skip-gram (Skip-gram) models [MCCD13]." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-61", "text": "The CBOW model learns a neural network such that given a set of context words surrounding a target word, it predict a target word." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-63", "text": "Recent improvements to Skip-gram model make it better able to handle less frequent words, especially when negative sampling is used [MSC + 13]." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-64", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-65", "text": "**FEATURES CONSIDERED**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-66", "text": "Gang member tweets and profile descriptions tend to have few textual indicators that demonstrate their gang affiliations or their tweets/profile text may carry acronyms which can only be deciphered by others involved in gang culture [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-67", "text": "These gang-related terms are often local to gangs operating in neighborhoods and change rapidly when they form new gangs." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-68", "text": "Consequently, building a database of keywords, phrases, and other identifiers to find gang members nationally is not feasible." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-69", "text": "Instead, we use heterogeneous sets of features derived not only from profile and tweet text but also from the emoji usage, profile images, and links to YouTube videos reflecting their music preferences and affinity." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-70", "text": "In this section, we briefly discuss the feature types and their broad differences in gang and non-gang member profiles." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-71", "text": "An in-depth explanation of these feature selection can be found in [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-72", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-73", "text": "**TWEET TEXT:**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-74", "text": "In our previous work, we observed that gang members use curse words nearly five times more than the average curse words use on Twitter [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-75", "text": "Further, we noticed that gang members mainly use Twitter to discuss drugs and money using terms such as smoke, high, hit, money, got, and need while non-gang members mainly discuss their feelings using terms such as new, like, love, know, want, and look." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-76", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-77", "text": "**TWITTER PROFILE DESCRIPTION:**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-78", "text": "We found gang member profile descriptions to be rife with curse words (nigga, fuck, and shit) while non-gang members use words related to their feelings or interests (love, life, music, and book)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-79", "text": "We noticed that gang members use their profile descriptions as a space to grieve for their fallen or incarcerated gang members as about 12% of gang member Twitter profiles used terms such as rip and free." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-80", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-81", "text": "**EMOJI FEATURES:**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-82", "text": "We found that the fuel pump emoji was the most frequently used emoji by gang members, which is often used in the context of selling or consuming marijuana." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-83", "text": "The pistol emoji was the second most frequently used emoji, which is often used with the police cop emoji in an 'emoji chain' to express their hatred towards law enforcement officers." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-84", "text": "The money bag emoji, money with wings emoji, unlock emoji, and a variety of the angry face emoji such as the devil face emoji and imp emoji were also common in gang members' but not in non-gang members' tweets." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-85", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-87", "text": "We noticed that gang members often pose holding or pointing weapons, seen in a group fashion which displays a gangster culture, show off graffiti, hand signs, tattoos, and bulk cash in their profile and cover images." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-88", "text": "We used Clarifai web service 2 to tag the profile and cover images of the Twitter users in our dataset and used the image tags returned by Clarifai API to train word embeddings." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-89", "text": "Tags such as trigger, bullet, and worship were unique for gang member profiles while non-gang member images had unique tags such as beach, seashore, dawn, wildlife, sand, and pet." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-90", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-91", "text": "**YOUTUBE VIDEOS:**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-92", "text": "We found that 51.25% of the gang members in our dataset have a tweet that links to a YouTube video." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-93", "text": "Further, we found that 76.58% of the shared links are related to hip-hop music, gangster rap, and the culture that surrounds this music genre [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-94", "text": "Moreover, we found that eight YouTube links are shared on average by a gang member." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-95", "text": "The top 5 terms used in YouTube videos shared by gang members were shit, like, nigga, fuck, and lil while like, love, peopl, song, and get were the top 5 terms in nongang members' video data." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-96", "text": "Figure 1 gives an overview of the steps to learn word embeddings and to integrate them into a classifier." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-97", "text": "We first convert any non-textual features such as emoji and profile images into textual features." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-98", "text": "We use Emoji for Python 3 and Clarifai services, respectively, to convert emoji and images into text." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-99", "text": "Prior to training the word embeddings, we remove all the seed words used to find gang member profiles and stopwords, and perform stemming across all tweets and profile descriptions." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-100", "text": "We then feed all the training data (word w t in Figure 1) we collected from our Twitter dataset to Word2Vec tool and train it using a Skip-gram model with negative sampling." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-101", "text": "When training the Skip-gram model, we set the negative sampling rate to 10 sample words, which seems to work well with medium-sized datasets [MSC + 13]." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-102", "text": "We set the context word window to be 5, so that it will consider 5 words to left and right of the target word (words w t\u22125 to w t+5 in Figure 1 )." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-103", "text": "This setting is suitable for sentences where average sentence length is less than 11 words, as is the case in tweets [HTK13] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-104", "text": "We ignore the words that occur less than 5 times in our training corpus." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-105", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-106", "text": "**CLASSIFICATION APPROACH**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-107", "text": "We investigated how well the local language has been captured by the word embedding models we trained." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-108", "text": "We used the 'most similar' functionality offered by Word2Vec tool to understand what the model has learned about few gang-related slang terms which are specific to Chicago area." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-109", "text": "For example, we analyzed the ten most similar words learned by the word embedding model for the term BDK (Black Desciples Killers)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-110", "text": "We noticed that out of the 10 most similar words, five were names of local Chicago gangs, which are rivals of the Black Disciples Gang, two were different syntactic variations of BDK (bdkk, bdkkk) and the other three were different syntactic variations of GDK (gdk, gdkk, gdkkk)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-111", "text": "GDK is a local gang slang for 'Gangster Disciples Killer' which is used by rivals of Gangster Disciples gang to show their hatred towards them." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-112", "text": "We found similar results for the term GDK." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-113", "text": "Out of the ten most similar words, six were showing hatred towards six different Gangster Disciples gangs that operate in Chicago area." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-114", "text": "We believe that those who used the term GDK to show their hatred towards Gangster Disciples gangs might be also having rivalry with the six gangs we found." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-115", "text": "We obtain word vectors of size 300 from the learned word embeddings." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-116", "text": "To represent a Twitter profile, we retrieve word vectors for all the words that appear in a particular profile including the words appear in tweets, profile description, words extracted from emoji, cover and profile images converted to textual formats, and words extracted from YouTube video comments and descriptions for all YouTube videos shared in the user's timeline." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-117", "text": "Those word vectors are combined to compute the final feature vector for the Twitter profile." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-118", "text": "To combine the word vectors, we consider five different methods." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-119", "text": "Letting the size of a word vector be k = 300, for a Twitter profile p with n unique words and the vector of the i th word in p denoted by w ip , we compute the feature vector for the Twitter profile V p by:" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-120", "text": "1. Sum of word embeddings V psum ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-121", "text": "This is the sum the word embedding vectors obtained for all words in a Twitter profile:" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-122", "text": "w ip 2." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-123", "text": "Mean of word embeddings V pavg ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-124", "text": "This is the mean of the word embedding vectors of all words found in a Twitter profile:" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-125", "text": "3. Sum of word embeddings weighted by term frequency V p sum(count) ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-126", "text": "This is each word embedding vector multiplied by the word's frequency for the Twitter profile:" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-127", "text": "where c ip is the term frequency for the i th word in profile p." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-128", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-129", "text": "**SUM OF WORD EMBEDDINGS WEIGHTED BY TF -IDF**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-130", "text": "V p sum(tf \u2212idf ) ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-131", "text": "This is each word vector multiplied by the word's tf -idf for the Twitter profile:" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-132", "text": "where t ip is the tf -idf value for the i th word in profile p." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-133", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-134", "text": "**# OF WORDS IN GANG MEMBERS**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-140", "text": "We first discuss the dataset, learning algorithms and baseline comparison models used in the experiments." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-141", "text": "Then we discuss the 10-fold cross validation experiments and the evaluation matrices used." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-142", "text": "Finally we present the results of the experiments." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-143", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-144", "text": "**EVALUATION SETUP**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-145", "text": "We consider a dataset of curated gang and non-gang members' Twitter profiles collected from our previous work [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-146", "text": "It was developed by querying the Followerwonk Web service API 4 with location-neutral seed words known to be used by gang members across the U.S. in their Twitter profiles." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-147", "text": "The dataset was further expanded by examining the friends, follower, and retweet networks of the gang member profiles found by searching for seed words." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-148", "text": "Specific details about our data curation procedure are discussed in [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-149", "text": "Ultimately, this dataset consists of 400 gang member profiles and 2,865 non-gang member profiles." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-150", "text": "For each user profile, we collected up to most recent 3,200 tweets from their Twitter timelines, profile description text, profile and cover images, and the comments and video descriptions for every YouTube video shared by them." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-151", "text": "Table 1 provides statistics about the number of words found in each type of feature in the dataset." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-152", "text": "It includes a total of 821,412 tweets from gang members and 7,238,758 tweets from non-gang members." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-153", "text": "To build the classifiers we used three different learning algorithms, namely Logistic Regression (LR), Random Forest (RF), and Support Vector Machines 4 https://moz.com/followerwonk/bio (SVM)" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-154", "text": "fiers." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-155", "text": "An open source tool of Python, Gensim [\u0158S10] was used to generate the word embeddings." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-156", "text": "We compare our results with the two best performing systems reported in [BWDS16] which are the two state-of-theart models for identifying gang members in Twitter." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-157", "text": "Both baseline models are built from a random forest classifier trained over term frequencies for unigrams in tweet text, emoji, profile data, YouTube video data and image tags." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-158", "text": "Baseline Model(1) considers all 3,285 gang and non-gang member profiles in our dataset." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-159", "text": "Baseline Model(2) considers all Twitter profiles that contain every feature type discussed in Section 3.1." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-160", "text": "Because a Twitter profile may not have every feature type, baseline Model(1) represents a practical scenario where not every Twitter profile contains every type of feature." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-161", "text": "However, we compare our results to both baseline models and report the improvements." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-162", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-163", "text": "**10-FOLD CROSS VALIDATION**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-165", "text": "We used all Twitter profiles in the dataset to conduct experiments on the five methods we used to combine word embedding vectors." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-166", "text": "For each of the five vector combination methods (as mentioned in Section 3.2), we trained classifiers using each learning algorithm we considered." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-167", "text": "In each fold, the training set was used to generate the word vectors, which were then used to compute features for both the training set and test set." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-168", "text": "For each 10-fold cross validation experiment, we report three evaluation metrics for the 'gang' (positive) and 'non-gang' (negative) classes, namely, the Precision = tp/(tp + f p), Recall = tp/(tp + f n), and F 1-score = 2 * (P recision * Recall)/(P recision + Recall), where tp is the number of true positives, f p is the number of false positives, tn is the number of true negatives, and f n is the number of false negatives." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-169", "text": "We report these metrics for the 'gang' and 'non-gang' classes separately because of the class imbalance in the dataset." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-170", "text": "Table 2 presents 10-fold cross validation results for the baseline models (first and second rows) and our word embeddings-based models (from third row to seventh row)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-171", "text": "As mentioned earlier both baseline models use a random forest classifier trained on term frequencies of unigram features extracted from all feature types." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-172", "text": "The two baseline models only differs on the training data filtering method used, which is based on the availability of features in the training dataset as described Table 2 : Classification Results Based on 10-Fold Cross Validation." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-173", "text": "in [BWDS16] ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-174", "text": "The baseline Model(1) uses all profiles in the dataset and has a F 1-score of 0.7364 for 'gang' class and 0.9690 for 'non-gang' class." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-175", "text": "The baseline Model(2) which only uses profiles that contain each and every feature type has a F 1-score of 0.7755 for 'gang' class and F 1-score of 0.9720 for 'non-gang' class." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-176", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-177", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-178", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-179", "text": "**MODEL**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-180", "text": "Vector sum is one of the basic operations we can perform on word embedding vectors." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-181", "text": "The random forest classifier performs the best among vector sumbased classifiers where logistic regression and SVM classifiers also perform comparatively well (V psum )." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-182", "text": "Using vector mean (V pavg ) improves all classifier results and SVM classifier trained on the mean of word embeddings achieves very close results to the baseline Model(2)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-183", "text": "Multiplying vector sum with corresponding word counts for each word in word embeddings degrades the classifier accuracy for correctly identifying the positive class (V p sum(count) )." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-184", "text": "When we multiply words by their corresponding tf -idf values before taking the vector sum, we again observe an increase in the classifiers' accuracy (V p sum(tf \u2212idf ) )." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-185", "text": "We achieve the best performance by averaging the vector sum weighted by term frequency (V p avg(sum(count)) )." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-186", "text": "Here we multiply the mean of the word embeddings by count of each word, which beats all other word embeddingsbased models and the two baselines." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-187", "text": "In this setting, logistic regression classifier trained on word embeddings performs the best with a F 1-score of 0.7835." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-188", "text": "This is a 6.39% improvement in performance when compared to the baseline Model(1) and a 1.03% improvement in performance when compared to baseline Model(2)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-189", "text": "Overall, out of the five vector operations that we used to train machine learning classifiers, four gave us classifier models that beat baseline Model(1) and two vector based operations gave us classifier models that either achieved very similar results to baseline Model(2) or beat it." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-190", "text": "This evaluation demonstrates the promise of using pre-trained word embeddings to boost the accuracy of supervised learning algorithms for Twitter gang member profile classification." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-191", "text": "----------------------------------" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-192", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-193", "text": "This paper presented a word embeddings-based approach to address the problem of automatically identifying gang member profiles on Twitter." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-194", "text": "Using a Twitter user dataset that consist of 400 gang member and 2,865 non gang member profiles, we trained word embedding models based on users' tweets, profile descriptions, emoji, images, and videos shared on Twitter (textual features extracted from images, and videos)." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-195", "text": "We then use the pre-trained word embedding models to train supervised machine learning classifiers, which showed superior performance when compared to the state-of-the-art baseline models reported in the literature." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-196", "text": "We plan to further extend our work by building our own image classification system specifically designed to identify images commonly shared by gang members such as guns, gang hand signs, stacks of cash and drugs." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-197", "text": "We would also like to experiment with automatically building dictionaries that contain gang names and gang-related slang using crowd-sourced gang-related knowledge-bases such as HipWiki 6 ." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-198", "text": "We also want to experiment with using such knowledge-bases to train word embeddings to understand whether having access to gang-related knowledge could boost the performance of our models." }, { "sent_id": "e3c735811b2ea08d92659272ddcbdd-C001-199", "text": "Finally, we would like to study how we can further use social networks of known gang members to identify new gang member profiles on Twitter." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e3c735811b2ea08d92659272ddcbdd-C001-20" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-41" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-66" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-71" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-74" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-93" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-148" ] ], "cite_sentences": [ "e3c735811b2ea08d92659272ddcbdd-C001-20", "e3c735811b2ea08d92659272ddcbdd-C001-41", "e3c735811b2ea08d92659272ddcbdd-C001-66", "e3c735811b2ea08d92659272ddcbdd-C001-71", "e3c735811b2ea08d92659272ddcbdd-C001-74", "e3c735811b2ea08d92659272ddcbdd-C001-93", "e3c735811b2ea08d92659272ddcbdd-C001-148" ] }, "@EXT@": { "gold_contexts": [ [ "e3c735811b2ea08d92659272ddcbdd-C001-20", "e3c735811b2ea08d92659272ddcbdd-C001-22" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-41", "e3c735811b2ea08d92659272ddcbdd-C001-43", "e3c735811b2ea08d92659272ddcbdd-C001-44", "e3c735811b2ea08d92659272ddcbdd-C001-45" ] ], "cite_sentences": [ "e3c735811b2ea08d92659272ddcbdd-C001-20", "e3c735811b2ea08d92659272ddcbdd-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "e3c735811b2ea08d92659272ddcbdd-C001-145" ], [ "e3c735811b2ea08d92659272ddcbdd-C001-156" ] ], "cite_sentences": [ "e3c735811b2ea08d92659272ddcbdd-C001-145", "e3c735811b2ea08d92659272ddcbdd-C001-156" ] } } }, "ABC_ae67018df3a74e0fd4ae90522499a3_17": { "x": [ { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-2", "text": "Recently, a significant number of studies have focused on neural information retrieval (IR) models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-3", "text": "One category of works use unlabeled data to train general word embeddings based on term proximity." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-4", "text": "The general embeddings can be integrated into traditional IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-5", "text": "The other category employs labeled data (e.g. click-through data) to train end-to-end neural IR models consisting of layers for target-specific representation learning." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-6", "text": "The latter idea accounts better for the IR task and is favored by recent research works, which is the one we will follow in this paper." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-7", "text": "We hypothesize that general semantics learned from unlabeled data can complement task-specific representation learned from labeled data of limited quality, and that a combination of the two is favorable." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-8", "text": "To this end, we propose a learning framework which can benefit from both labeled and more abundant unlabeled data for representation learning in the context of IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-9", "text": "Through a joint learning fashion in a single neural framework, the learned representation is optimized to minimize both the supervised loss on query-document matching and the unsupervised loss on text reconstruction." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-10", "text": "Standard retrieval experiments on TREC collections indicate that the joint learning methodology leads to significant better performance of retrieval over several strong baselines for IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-11", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-13", "text": "In recent years, the research community has noticed the great success of neural networks in computer vision (Krizhevsky et al., 2012) , speech recognition and natural language processing (Mikolov et al., 2013) tasks." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-14", "text": "However, the potential of neural networks has not been fully investigated in the IR field." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-15", "text": "Although a significant number of studies (e.g. (Huang et al., 2013; Ganguly et al., 2015; Zheng and Callan, 2015; Guo et al., 2016; Zamani and Croft, 2016; Dehghani et al., 2017; ) try to apply neural networks in IR, there have been few studies reporting the performance that is comparable to state-of-the-art IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-16", "text": "These approaches rely on the general idea that neural network can provide a low-dimensional and semantics-rich representation for both queries and documents." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-17", "text": "Such a representation can bridge lexical and semantic gaps in traditional IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-18", "text": "Depending on if the embeddings are trained with discriminative information for IR tasks, existing works can be broadly divided into two categories (Zhang et al., 2016; ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-19", "text": "The first category of approaches extend traditional IR models to incorporate word embeddings that are trained on huge and unlabeled corpora with existing models such as Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) in an unsupervised manner." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-20", "text": "These approaches (e.g. (Zheng and Callan, 2015; Nalisnick et al., 2016) ) leverage semantic information captured by word embeddings in order to enhance traditional IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-21", "text": "We note that such models trained without references to the retrieval task model term proximity and do not contain discriminative information adapted for IR ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-22", "text": "The second category (e.g. (Huang et al., 2013; Guo et al., 2016) ) tries to incorporate word embedding learning within neural models for IR, which reflects a more significant shift toward an endto-end framework." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-23", "text": "These approaches treat word embeddings as layers in neural IR models, to be learned along with all model parameters in a supervised manner." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-24", "text": "Most studies in the second category rely on click-through data for relevance judgment between queries and documents." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-25", "text": "Text representation learned with relevance information captures relevance rather than term proximity, which clearly accounts better for IR requirements ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-26", "text": "However, supervised signals such as click-through data are often limited outside of large industrial research labs, probably due to user privacy concerns." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-27", "text": "It is thus not surprising to see that many authors following this methodology have industrial background (e.g. (Huang et al., 2013; Shen et al., 2014b; Nalisnick et al., 2016; )." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-28", "text": "In addition, point out that previous studies using click-through data make implicit but strong assumptions about clicked query-document pairs which are not necessarily met in practice." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-29", "text": "Neural networks are hungry for data, a fact which also holds for neural IR tasks." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-30", "text": "One can find from above discussions that the second category of approaches suffer from the data spareness problem, although there have been recent attempts (Gupta et al., 2017; Dehghani et al., 2017) trying to pseudo label query-document pairs automatically with unsupervised retrieval models such as BM25." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-31", "text": "Using pseudo labels as relevance signals relieves data spareness in terms of quantity but not quality." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-32", "text": "The idea of using unsupervised learning to complement supervision has been practiced successfully in computer vision (Yang et al., 2013) and natural language processing (Rasmus et al., 2015) tasks." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-33", "text": "In such a background, we hypothesize that semantics learned from unlabeled data can complement task-specific representation learned from pseudo-labeled data of limited quality, and a combination of the two is favorable in IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-34", "text": "To the best of our knowledge, such a combination has never been investigated in neural IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-35", "text": "In this paper, we propose a learning framework which can benefit from both labeled and more abundant unlabeled data for representation learning in IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-36", "text": "Through joint learning in a single neural network, the learned representation can account for task-specific characteristics via supervised loss optimization on query-document matching, as well as preserving general semantics via unsupervised loss optimization on text reconstruction." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-37", "text": "We demonstrate by experiments that the joint learning model leads to significantly better performance over state-of-the-art IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-38", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-39", "text": "**RELATED WORK**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-40", "text": "Representation learning approaches based on neural networks have gained in prominence in recent years due to their extreme efficiency." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-41", "text": "They motivate the emerging research field of Neural IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-42", "text": "Neural approaches have attracted increasing interests of the IR community in very recent years." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-43", "text": "Apart from learning to rank approaches that train their models over a set of hand-crafted features (Liu, 2009) , neural IR models typically accept the raw text of queries and documents as input." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-44", "text": "The dense representations of words or texts can then be learned with or without reference to retrieval tasks, respectively corresponding to the two categories of methods summarized in section 1." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-45", "text": "Unsupervised approaches learn general text representation without query and document interaction information." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-46", "text": "Embeddings pre-trained on unlabeled text with tools such as Word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) have been used to extend traditional IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-47", "text": "Ganguly et al. (2015) develop a generalized language model with query-likelihood language modeling for integrating word embeddings as additional smoothing." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-48", "text": "Zheng and Callan (2015) represent term and query as vectors in the same latent space based on word embeddings so as to learn a model to reweight terms." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-49", "text": "Nalisnick et al. (2016) retain both input and output embeddings of Word2vec and map query words into the input space and document words into the output space." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-50", "text": "Zamani and Croft (2016) propose to use word embeddings to incorporate and weight terms not present in the query, acting as smoothing and query expansion." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-51", "text": "There are also studies developing their own embedding learning algorithms instead of using standard tools for embedding learning." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-52", "text": "For instance, Salakhutdinov and Hinton (2009) propose a deep auto-encoder model to generate a condensed binary vector representation of documents." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-53", "text": "Clinchant and Perronnin (2013) use latent semantic indexing to induce word embeddings for IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-54", "text": "Vuli\u0107 and Moens (2015) propose to learn from document-aligned comparable corpora the embeddings that can be used for both monolingual IR and cross-lingual IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-55", "text": "Supervised approaches use query-document relevance information to learn the representation that is optimized end-to-end for the task at hand." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-56", "text": "With click-through data, Huang et al. (2013) develop DSSM, a feed forward neural network with a word hashing phrase as the first layer to predict the click probability given a query string and a document title." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-57", "text": "DSSM is extended in (Shen et al., 2014a; Shen et al., 2014b) by incorporating convolutional neural network and max-pooling layers to extract the most salient local features." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-58", "text": "Since the DSSM related methods make implicit but strong assumptions about clicked data, try to relax the assumptions in their model." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-59", "text": "Guo et al. (2016) develop the DRMM model that takes the histogram-based features representing interactions between queries and documents as input into neural networks." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-60", "text": "DRMM is one of the first neural IR models to show improvement over traditional IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-61", "text": "aim to simultaneously learn local and distributional representation to capture both lexical matching and semantic matching in IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-62", "text": "Following the discussion in section 1, we note that click-through data are not always available in massive amount outside of industrial labs." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-63", "text": "More recent works propose to use unsupervised IR models to pseudo label query-document pairs that provide weak supervision for representation learning." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-64", "text": "Dehghani et al. (2017) use BM25 to obtain relevant documents for a large set of AOL queries (Pass et al., 2006) which are then used as weakly supervised signals for joint embedding and ranking model training." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-65", "text": "employ similar supervision signals as (Dehghani et al., 2017) to train an embedding network similar to Word2vec and use the obtained embeddings for query expansion and query classification." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-66", "text": "Gupta et al. (2017) develop a cross-lingual IR model based on weak supervision." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-67", "text": "Luo et al. (2017) propose to train deep ranking models with weak relevance labels generated by click model based on click behavior of real users." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-68", "text": "We can conclude from above discussions that supervised approaches account better for task-specific features and are superior in IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-69", "text": "They rely on relevance information between query-document pairs of which the quality is relatively low in practice." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-70", "text": "In this paper, we follow successful practice in CV and NLP tasks and hypothesize that general and rich semantics learned from unlabeled data can complement task-specific representation learned from labeled data of limited quality." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-71", "text": "We will propose in section 3 a learning framework which can simultaneously learn from labeled and more abundant unlabeled data in the context of IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-72", "text": "By the way, we note that the joint learning framework resembles those studies (e.g. (Liu et al., 2015) ) which couple IR with another supervised learning task." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-73", "text": "Our framework differs from those studies in that we do not require additional data that are labeled for another supervised learning task." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-74", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-75", "text": "**JOINT LEARNING FRAMEWORK FOR IR**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-76", "text": "In this section, we will develop a joint framework to learn low-dimensional representation of queries and documents from both labeled and unlabeled data." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-77", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-78", "text": "**LEARNING FRAMEWORK**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-79", "text": "The joint learning framework is illustrated in figure 1." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-80", "text": "It consists of three crucial components:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-81", "text": "\u2022 An encoding network." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-82", "text": "It embeds the raw input into low-dimensional representations that are designed to capture target-specific characteristics of IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-83", "text": "\u2022 A decoding network." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-84", "text": "It tries to reconstruct the input so as to benefit from unlabeled data." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-85", "text": "\u2022 A pairwise ranking model." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-86", "text": "It makes use of supervision signals from labeled query-document pairs to perform document ranking." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-87", "text": "On top of the network structure, we perform joint optimization of both supervised loss and unsupervised loss." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-88", "text": "The unsupervised learning process uses all the text collection (e.g. queries and documents) for learning rich and general semantics." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-89", "text": "The supervised learning process learns, from labeled querydocument pairs, discriminative representations adapted for IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-90", "text": "The joint training fashion makes two learning processes complement each other via co-tuning the shared hidden layers in the encoding networks to help the representation generalize better in the IR task." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-91", "text": "The joint learning framework with labeled and unlabeled data." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-92", "text": "It consists of an encoding network, a decoding network and a pairwise ranking model." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-93", "text": "We impose an unsupervised loss and a supervised loss respectively on the reconstruction output and the pairwise ranking output, which is learned in a joint fashion in this paper." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-94", "text": "The low-dimensional representation is the one our model aims to learn." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-95", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-96", "text": "**UNSUPERVISED LEARNING**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-97", "text": "The unsupervised part learns the low-dimensional representation of text via an autoencoder style, which uses all the available text data." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-98", "text": "Following previous studies on text autoencoder (Chen and Zaki, 2017) , we opt for the simple feed-forward neural network architecture for both the encoding and decoding parts in figure 1." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-99", "text": "For each layer of the encoding/decoding networks, we use Rectified Linear Unit (ReLU) as the activation function, a function recommended by many works in deep learning (LeCun et al., 2015) ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-100", "text": "In the feed-forward step, each layer l(l \u2265 1) is a fully-connected layer and its activation potential z l is given by:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-101", "text": "where W l is the weight matrix at layer l and b l is the corresponding bias." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-102", "text": "The input layer (corresponding to l = 0) maps the input text into fixed-length vector." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-103", "text": "There have been two methodologies we can employ to represent the input text: one is the one-hot representation (Gupta et al., 2017) and its variants (Zhai and Zhang, 2016) ; the other one is the dense and semantically rich representations (He et al., 2017) ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-104", "text": "Empirical results do not indicate that one is always better than the other and we will make use of the former one in this paper." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-105", "text": "Given the set of text T , we follow previous studies such as (Zhai and Zhang, 2016; Chen and Zaki, 2017) and represent each input text t in T as log-normalized word count vector x \u2208 R |V | where |V | is the size of the vocabulary V ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-106", "text": "Each dimension of the input vector x is represented by:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-107", "text": ", for i \u2208 V where tf (i) is the term frequency of the i-th word in the vocabulary." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-108", "text": "Since the unsupervised learning part of the framework is modeled as an autoencoder, we want the unsupervised output x to resemble the input x, leading to the binary cross-entropy loss function l u on t that can be defined as:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-109", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-110", "text": "**SUPERVISED LEARNING**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-111", "text": "The document ranking problem can not be modeled with the standard classification or regression framework." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-112", "text": "Following the methodology in learning to rank (Liu, 2009) , we model document ranking in the pairwise style where the relevance information is in the form of preferences between pairs of documents with respect to individual queries." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-113", "text": "In addition, we follow previous studies (Gupta et al., 2017) and make use of well-performing unsupervised retrieval models (e.g. BM25) to pseudo-label query and document pairs so as to obtain the relevance information." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-114", "text": "More details will be given in section 4.1." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-115", "text": "From figure 1 one can note that the hidden layers in the encoding networks are shared by unsupervised and supervised learning, and one can refer to the unsupervised learning part for details of the layers in the encoding networks." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-116", "text": "The supervised model, on top of the top-level representation layer (i.e. lowdimensional representation), tries to learn a model that, given the query q, assigns a larger score to document d 1 than document d 2 if the ground truth is that d 1 matches to q better." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-117", "text": "The supervised model is implemented as a pairwise ranking model in figure 1 , which is again a feed forward neural networks." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-118", "text": "Inspired by such studies as (Yih et al., 2011), we can derive the probability P (d 1 q d 2 ) that d 1 is ranked higher than d 2 with respect to the query q via a logistic function:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-119", "text": "where the score function is computed with the pairwise ranking model, and the parameter \u03c3 is used to determine the shape of the sigmoid." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-120", "text": "The supervised training objective l s on a triplet of query-document pair (q, d 1 , d 2 ) can then be defined as the cross entropy loss, which is:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-121", "text": "where P (d 1 q d 2 ) is the actual probability that d 1 is ranked higher than d 2 according to annotations (i.e. pseudo-labels of query-document pairs)." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-122", "text": "The actual probability in this paper is estimated in a similar way as in (Dehghani et al., 2017) , which is:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-123", "text": "where s denotes the relevance scores obtained from training instances." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-124", "text": "In the training process, the positive sample d 1 for the query q can be chosen as the most relevant documents according to annotated relevance scores." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-125", "text": "The negative sample d 2 is selected randomly from the document collection." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-126", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-127", "text": "**JOINT LEARNING WITH REGULARIZATION**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-128", "text": "Combining the unsupervised loss l u in equation 1 on all text data, the supervised loss l s in equation 2 on all labeled query-document pairs, and the L2 norm regularization for weight matrices, one finally arrives at the objective function for the joint learning model, which is:" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-129", "text": "where T and |T | denote the set of text data and its size, QD and |QD| denote the set of labeled querydocument pairs and its size, LY stands for all the hidden and output layers of the framework in figure 1 , and W l is the weight matrix of the layer l in the network." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-130", "text": "The hyper-parameters \u03b1, \u03b2 control the importance of the unsupervised loss and the supervised loss." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-131", "text": "The joint loss function L(T, DS) can be optimized in the gradient-based way, and we use the Adam algorithm (Kingma and Ba, 2015) to compute the gradients." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-132", "text": "In this section, we conduct IR experiments to demonstrate the effectiveness of our proposed model." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-133", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-134", "text": "**DATA SETS**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-135", "text": "The IR experiments are carried out against standard TREC collections consisting of one Robust track and one Web track, which represent different sizes and genres of heterogeneous text collections." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-136", "text": "These collections have been broadly used in recent studies (Zheng and Callan, 2015; Guo et al., 2016; Dehghani et al., 2017) ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-137", "text": "The details of these collections and corresponding queries are given in table 1." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-138", "text": "The Robust dataset is used in the standard form without change." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-139", "text": "The ClueWeb-09-Cat-B collection (or ClueWeb for short) is filtered to the set of documents with spam scores in the 60-th percentile with Waterloo Fusion spam scores 1 ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-140", "text": "For all TREC queries, we only make use of the title fields for retrieval." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-141", "text": "In order to build the labeled query-document pairs for supervised learning, we choose to use the more general methodology in (Gupta et al., 2017) instead of the one in (Dehghani et al., 2017) to relieve from data (i.e. AOL queries) only available from industrial labs." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-142", "text": "We fetch a set of news titles from the China Daily website 2 and use these titles as training queries to produce annotated query-document pairs." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-143", "text": "We use these training queries to retrieve the document collection with BM25." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-144", "text": "We make sure that no training queries appear in the evaluation query set in table 1." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-145", "text": "For each training query, we take the top 500 retrieved documents as positive samples." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-146", "text": "The negative samples are picked randomly from the document collection." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-147", "text": "There are other strategies for choosing negative samples (Wieting et al., 2015) , which is out of the scope of this paper." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-148", "text": "For unsupervised learning, we make use of training queries and evaluation document sets listed in table 1, as well as the Wikipedia articles 3 as the external resource." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-149", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-150", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-151", "text": "We set the hyper-parameters of our model by following similar tasks such as (Dehghani et al., 2017) ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-152", "text": "The size and number of hidden layers are respectively selected from {64, 128, 256, 512, 1024} and {1, 2, 3, 4}. The values of \u03b1, \u03b2 in equation 3 are chosen from {0.001, 0.01, 0.1, 1, 10, 100, 1000}. We select the initial learning rate from {10 \u22123 , 10 \u22124 , 5 * 10 \u22124 , 10 \u22125 , 5 * 10 \u22125 }." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-153", "text": "The batch size for learning is selected from {64, 128, 256, 512}. These model hyper-parameters are tuned on the validation set (20% of the training queries used for validation)." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-154", "text": "For IR evaluation, we make use of mean average precision (MAP) of top-ranked 1000 documents , precision at rank 20 (P20), and normalized discounted cumulative gain at rank 20 (nDCG20)." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-155", "text": "Statistically significant differences between various models are determined using the two-tailed paired t-test with p < 0.05." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-156", "text": "We compare the retrieval performance of our joint learning retrieval model with two categories of IR models: classic IR models showing state-of-the-art performance, and the recent neural ranking models for IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-157", "text": "Since our model is representation-focused rather than interaction-focused, we do not plan to compare our model with those based on relevance matching (Guo et al., 2016) in this paper." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-158", "text": "More importantly, since our model learns from weakly supervised signals by BM25, we are more interested in the comparisons to BM25 and similar models using weakly supervised signals, an experimental strategy also employed in (Dehghani et al., 2017) ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-159", "text": "Under such considerations, we perform experiments with the following baselines: \u2022 Classic models: The probabilistic BM25 model and query likelihood (QL) model based on Dirichlet smoothing are highly efficient IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-160", "text": "\u2022 DSSM: It is a representative deep matching model proposed in (Huang et al., 2013) , which is a representation-focused model." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-161", "text": "The model is framed as a feed forward neural network with a word hashing layer." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-162", "text": "\u2022 NRMS: It is a weakly-supervised neural IR model learned with automatically annotated querydocument pairs (Dehghani et al., 2017) ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-163", "text": "NRMS shows significant improvement over traditional IR models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-164", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-165", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-166", "text": "Comparisons to classic models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-167", "text": "We use here the recommended settings of the baseline models according to their original papers." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-168", "text": "Table 2 reports the experimental results on TREC datasets for our model and all the baseline models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-169", "text": "One can find from the results that classic IR models BM25 and QL perform similarly on the two collections, a conclusion that is coincident with previous findings." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-170", "text": "Since BM25 is the model we employ to produce pseudo labels for supervised learning, we will not compare neural models with QL in the following discussions." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-171", "text": "The neural IR model DSSM performs significantly worse than the traditional BM25 model, due to its unsuitability for relevance matching and for handling the diverse matching requirements in long documents (Guo et al., 2016) ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-172", "text": "NRMS is a neural ranking model learned from automatically labeled data, which resembles our model." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-173", "text": "NRMS shows all the significant improvements over BM25." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-174", "text": "Our model proposed in this paper, by jointly learning from the labeled and unlabeled data, achieves the best overall performance." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-175", "text": "Our model always significantly outperforms BM25 by a large margin." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-176", "text": "Comparisons to neural models." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-177", "text": "We further compare our model with the neural IR models DSSM and NRMS." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-178", "text": "We find that our model performs better than DSSM and NRMS on all collections." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-179", "text": "Our model significantly outperforms DSSM in all the cases considered above." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-180", "text": "Our model significantly outperforms NRMS with only one exception that is not significant on Robust04 with nDCG20." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-181", "text": "By the way, we find that NRMS is also always significantly better than DSSM on all collections." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-182", "text": "The experimental conclusion is that our model is always significantly better than traditional IR models and mostly outperforms neural IR models considered above." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-183", "text": "Furthermore, we find that using unlabeled data for training in neural IR models is useful, since it leads to significant improvement over the neural models only using labeled data." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-184", "text": "Impact of unsupervised learning." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-185", "text": "It has been confirmed above that our model shows the best performance overall." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-186", "text": "However, it is not clear how much unsupervised learning contributes to the retrieval performance." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-187", "text": "We thus compare representations learned in a different setting without the help of unsupervised loss, which amounts to removing the unsupervised loss l u from equation 3." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-188", "text": "We perform IR experiments with the new model over data sets in table 1 and list results in table 3 ." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-189", "text": "From the results one can find that the performance of the model without unsupervised loss decreases from the joint model with significance in all the cases considered." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-190", "text": "It indicates that it is beneficial to combine unsupervised learning with supervised learning in neural IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-191", "text": "Empirical results in this part support our claim in this paper that learning from unlabeled data complements knowledge learned from labeled data in neural IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-192", "text": "----------------------------------" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-193", "text": "**CONCLUSIONS**" }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-194", "text": "In this paper, we propose a neural IR model which jointly learns from labeled and unlabeled data to benefit from both the rich and general semantics in unlabeled data and target-specific features in labeled data." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-195", "text": "As far as we can tell, it is the first time such a combination is investigated in neural IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-196", "text": "Experiments on TREC collections show that our model, without any human annotation, is significantly better than traditional IR models and recently proposed models based on neural networks." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-197", "text": "Experiments also show that using unsupervised learning to complement supervised learning with weak supervision is important in IR." }, { "sent_id": "ae67018df3a74e0fd4ae90522499a3-C001-198", "text": "A future direction to follow would be to use more expressive architectures such as LSTM to replace feed-forward networks used in this paper." } ], "y": { "@MOT@": { "gold_contexts": [ [ "ae67018df3a74e0fd4ae90522499a3-C001-15" ] ], "cite_sentences": [ "ae67018df3a74e0fd4ae90522499a3-C001-15" ] }, "@BACK@": { "gold_contexts": [ [ "ae67018df3a74e0fd4ae90522499a3-C001-30" ], [ "ae67018df3a74e0fd4ae90522499a3-C001-64" ], [ "ae67018df3a74e0fd4ae90522499a3-C001-162" ] ], "cite_sentences": [ "ae67018df3a74e0fd4ae90522499a3-C001-30", "ae67018df3a74e0fd4ae90522499a3-C001-64", "ae67018df3a74e0fd4ae90522499a3-C001-162" ] }, "@SIM@": { "gold_contexts": [ [ "ae67018df3a74e0fd4ae90522499a3-C001-65" ], [ "ae67018df3a74e0fd4ae90522499a3-C001-122" ], [ "ae67018df3a74e0fd4ae90522499a3-C001-136" ], [ "ae67018df3a74e0fd4ae90522499a3-C001-151" ], [ "ae67018df3a74e0fd4ae90522499a3-C001-158" ] ], "cite_sentences": [ "ae67018df3a74e0fd4ae90522499a3-C001-65", "ae67018df3a74e0fd4ae90522499a3-C001-122", "ae67018df3a74e0fd4ae90522499a3-C001-136", "ae67018df3a74e0fd4ae90522499a3-C001-151", "ae67018df3a74e0fd4ae90522499a3-C001-158" ] }, "@DIF@": { "gold_contexts": [ [ "ae67018df3a74e0fd4ae90522499a3-C001-141" ] ], "cite_sentences": [ "ae67018df3a74e0fd4ae90522499a3-C001-141" ] }, "@USE@": { "gold_contexts": [ [ "ae67018df3a74e0fd4ae90522499a3-C001-159", "ae67018df3a74e0fd4ae90522499a3-C001-162" ] ], "cite_sentences": [ "ae67018df3a74e0fd4ae90522499a3-C001-162" ] } } }, "ABC_9dd9ac975c6f55797615f0e52aa296_17": { "x": [ { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-2", "text": "Convolutional neural networks (CNN) have achieved the top performance for event detection due to their capacity to induce the underlying structures of the k-grams in the sentences." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-3", "text": "However, the current CNN-based event detectors only model the consecutive k-grams and ignore the non-consecutive kgrams that might involve important structures for event detection." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-4", "text": "In this work, we propose to improve the current CNN models for ED by introducing the non-consecutive convolution." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-5", "text": "Our systematic evaluation on both the general setting and the domain adaptation setting demonstrates the effectiveness of the nonconsecutive CNN model, leading to the significant performance improvement over the current state-of-the-art systems." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-6", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-8", "text": "The goal of event detection (ED) is to locate event triggers of some specified types in text." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-9", "text": "Triggers are generally single verbs or nominalizations that evoke the events of interest." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-10", "text": "This is an important and challenging task of information extraction in natural language processing (NLP), as the same event might appear in various expressions, and an expression might express different events depending on contexts." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-11", "text": "The current state-of-the-art systems for ED have involved the application of convolutional neural networks (CNN) (Nguyen and Grishman, 2015b; Chen et al., 2015) that automatically learn effective feature representations for ED from sentences." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-12", "text": "This has overcome the two fundamental limitations of the traditional feature-based methods for ED: (i) the complicated feature engineering for rich feature sets and (ii) the error propagation from the NLP toolkits and resources (i.e, parsers, part of speech taggers etc) that generate such features." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-13", "text": "The prior CNN models for ED are characterized by the temporal convolution operators that linearly map the vectors for the k-grams in the sentences into the feature space." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-14", "text": "Such k-gram vectors are obtained by concatenating the vectors of the k consecutive words in the sentences (Nguyen and Grishman, 2015b; Chen et al., 2015) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-15", "text": "In other words, the previous CNN models for ED only focus on modeling the consecutive k-grams." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-16", "text": "Unfortunately, such consecutive mechanism is unable to capture the long-range and non-consecutive dependencies that are necessary to the prediction of trigger words." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-17", "text": "For instance, consider the following sentence with the trigger word \"leave\" from the ACE 2005 corpus:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-18", "text": "The mystery is that she took the job in the first place or didn't leave earlier." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-19", "text": "The correct event type for the trigger word \"leave\" in this case is \"End-Org\"." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-20", "text": "However, the previous CNN models might not be able to detect \"leave\" as an event trigger or incorrectly predict its type as \"Movement\"." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-21", "text": "This is caused by their reliance on the consecutive local k-grams such as \"leave earlier\"." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-22", "text": "Consequently, we need to resort to the nonconsecutive pattern \"job leave\" to correctly determine the event type of \"leave\" in this case." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-23", "text": "Guided by this intuition, we propose to improve the previous CNN models for ED by operating the convolution on all possible non-consecutive k-grams in the sentences." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-24", "text": "We aggregate the resulting convolution scores via the max-pooling function to unveil the most important non-consecutive k-grams for ED." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-25", "text": "The aggregation over all the possible nonconsecutive k-grams is made efficient with dynamic programming." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-26", "text": "Note that our work is related to (Lei et al., 2015) who employ the non-consecutive convolution for the sentence and news classification problems." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-27", "text": "Our work is different from (Lei et al., 2015) in that we model the relative distances of words to the trigger candidates in the sentences via position embeddings, while (Lei et al., 2015) use the absolute distances between words in the k-grams to compute the decay weights for aggregation." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-28", "text": "To the best of our knowledge, this is the first work on non-consecutive CNN for ED." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-29", "text": "We systematically evaluate the proposed model in the general setting as well as the domain adaptation setting." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-30", "text": "The experiment results demonstrate that our model significantly outperforms the current state-ofthe-art models in such settings." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-31", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-32", "text": "**MODEL**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-33", "text": "We formalize ED as a multi-class classification problem." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-34", "text": "Given a sentence, for every token in that sentence, we want to predict if the current token is an event trigger of some event in the pre-defined event set or not? The current token along with its context in the sentence constitute an event trigger candidate." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-35", "text": "In order to make it compatible with the previous work, we follow the procedure in (Nguyen and Grishman, 2015b) to process the trigger candidates for CNN." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-36", "text": "In particular, we limit the context of the trigger candidates to a fixed window size by trimming longer sentences and padding shorter sentences with a special token when necessary." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-37", "text": "Let 2n + 1 be the fixed window size, and W = [w 0 , w 1 , . . . , w n , . . . , w 2n 1 , w 2n ] be some trigger candidate where the current token is positioned in the middle of the window (token w n )." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-38", "text": "Before entering CNN, each token w i is first transformed into a real-valued vector x i using the concatenation of the following vectors:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-39", "text": "1. The word embedding vector of w i : This is obtained by looking up a pre-trained word embedding table D (Turian et al., 2010; Mikolov et al., 2013a) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-40", "text": "2. The position embedding vector of w i : We obtain this vector by looking up the position embedding table for the relative distance i n from the token w i to the current token w n ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-41", "text": "The position embedding table is initialized randomly." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-42", "text": "3. The real-valued embedding vector for the entity type of w i : This vector is generated by looking up the entity type embedding table (initialized randomly) for the entity type of w i . Note that we employ the BIO annotation schema to assign entity type labels to each token in the sentences using the entity mention heads as in (Nguyen and Grishman, 2015b) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-43", "text": "The transformation from the token w i to the vector x i (x i 2 R d ) essentially converts the input candidate W into a sequence of real-valued vectors X = (x 0 , x 1 , . . . , x 2n )." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-44", "text": "This sequence is used as input in the following CNN models." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-45", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-46", "text": "**THE TRADITIONAL CNN**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-47", "text": "Given the window size k, the traditional CNN models for ED consider the following set of 2n + 1 consecutive k-gram vectors:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-48", "text": "Vector u i is the concatenation of the k consecutive vectors preceding position i in the sequence X:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-49", "text": "where the out-ofindex vectors are simply set to all zeros." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-50", "text": "The core of the CNN models is the convolution operation, specified by the filter vector f 2 R dk ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-51", "text": "In CNN, f can be seen as a feature extractor for the k-grams that operates via the dot product with each element in C. This produces the following convolution score set:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-52", "text": "In the next step, we aggregate the features in S with the max function, resulting in the aggregation score:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-53", "text": "Afterward, p f k is often transformed by a nonlinear function G 1 to generate the transformed score G(p f k ), functioning as the extracted feature for the initial trigger candidate W ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-54", "text": "We can then repeat this process for different window sizes k and filters f , generating multiple features G(p f k ) to capture various aspects of the trigger candidate W ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-55", "text": "Finally, such features are concatenated into a single representation vector for W , to be fed into a feed-forward neural network with a softmax layer in the end to perform classification." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-56", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-57", "text": "**THE NON-CONSECUTIVE CNN**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-58", "text": "As mentioned in the introduction, the limitation of the previous CNN models for ED is the inability to encode the non-consecutive k-grams that might be crucial to the trigger prediction." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-59", "text": "This limitation originates from Equation 1 in which only the consecutive k-gram vectors are considered." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-60", "text": "In order to overcome such limitation, we propose to model all possible non-consecutive k-grams in the trigger candidate, leading to the following set of non-consecutive k-gram vectors:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-61", "text": "k ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-62", "text": "The non-consecutive CNN model then follows the procedure of the traditional CNN model in Section 2.1 to compute the representation vector for classification." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-63", "text": "The only difference is that the computation is done on the input set N instead of C. In particular, the convolution score set in this case would be S(N ) = {f T v : v 2 N }, while the aggregating score would be:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-64", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-65", "text": "**IMPLEMENTATION**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-66", "text": "Note that the maximum operation in Equation 2 only requires O(n) operations while the naive implementation of Equation 3 would need O(|N |) = O(n k ) operations." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-67", "text": "In this work, we employ the dynamic programming (DP) procedure below to reduce the computation time for Equation 3." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-68", "text": "Assuming the filter vector f is the concatenation of the k vectors f 1 , . . . ," }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-69", "text": "can be re-written by:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-70", "text": "Let D j t be the dynamic programming table representing the maximum convolution score for the subfilter [f 1 , . . . , f j ] over all possible non-consecutive jgram vectors in the subsequence (x 0 , x 1 , . . . , x t ) of X:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-71", "text": "We can solve this DP problem by the following recursive formulas 2 :" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-72", "text": "The computation time for this procedure is O(kn) and remains linear in the sequence length." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-73", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-74", "text": "**TRAINING**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-75", "text": "We train the networks using stochastic gradient descent with shuffled mini-batches, the AdaDelta update rule, back-propagation and dropout." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-76", "text": "During the training, we also optimize the embedding tables (i.e, word, position and entity type embeddings) to achieve the optimal states." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-77", "text": "Finally, we rescale the weights whose l 2 -norms exceed a predefined threshold (Nguyen and Grishman (2015a))." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-78", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-79", "text": "**EXPERIMENTS**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-80", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-81", "text": "**DATASET, PARAMETERS AND RESOURCES**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-82", "text": "We apply the same parameters and resources as (Nguyen and Grishman, 2015b) to ensure the compatible comparison." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-83", "text": "Specifically, we employ the window sizes in the set {2, 3, 4, 5} for the convolution operation with 150 filters for each window size." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-84", "text": "The window size of the trigger candidate is 31 while the dimensionality of the position embeddings and entity type embeddings is 50." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-85", "text": "We use word2vec from (Mikolov et al., 2013b) as the pretrained word embeddings." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-86", "text": "The other parameters include the dropout rate \u21e2 = 0.5, the mini-batch size = 50, the predefined threshold for the l 2 norms = 3." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-87", "text": "Following the previous studies (Li et al., 2013; Chen et al., 2015; Nguyen and Grishman, 2015b) , we evaluate the models on the ACE 2005 corpus with 33 event subtypes." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-88", "text": "In order to make it compatible, we use the same test set with 40 newswire articles, the same development set with 30 other documents and the same training set with the remaining 529 documents." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-89", "text": "All the data preprocessing and evaluation criteria follow those in (Nguyen and Grishman, 2015b)." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-90", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-91", "text": "**THE GENERAL SETTING**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-92", "text": "We compares the non-consecutive CNN model (NC-CNN) with the state-of-the-art systems on the ACE 2005 dataset in Table 1 ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-93", "text": "These systems include:" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-94", "text": "1) The feature-based systems with rich handdesigned feature sets, including: the MaxEnt model with local features in (Li et al., 2013) (MaxEnt) ; the structured perceptron model for joint beam search with local features (Joint+Local), and with both local and global features (Joint+Local+Global) in (Li et al., 2013 ); and the sentence-level and cross-entity models in (Hong et al., 2011) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-95", "text": "2) The neural network models, i.e, the CNN model in (Nguyen and Grishman, 2015b) (CNN), the dynamic multi-pooling CNN model (DM-CNN) in (Chen et al., 2015) and the bidirectional recurrent neural networks (B-RNN) in (Nguyen et al., 2016a) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-96", "text": "3) The probabilistic soft logic based model to capture the event-event correlation in (Liu et al., 2016 )." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-97", "text": "Methods F Sentence-level in Hong et al (2011) 59.7 MaxEnt (Li et al., 2013) 65.9 Joint+Local (Li et al., 2013) 65.7 Joint+Local+Global (Li et al., 2013) 67.5 Cross-entity in Hong et al. (2011) \u2020 68.3 Probabilistic soft logic (Liu et al., 2016) \u2020 69.4 CNN (Nguyen and 69.0 DM-CNN (Chen et al., 2015) 69.1 B- RNN (Nguyen et al., 2016a) 69.3 NC-CNN 71.3 The most important observation from the table is that the non-consecutive CNN model significantly outperforms all the compared models with large margins." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-98", "text": "In particular, NC-CNN is 2% better than B-RNN (Nguyen et al., 2016a) , the state-of-theart system that only relies on the context information within the sentences of the trigger candidates." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-99", "text": "In addition, although NC-CNN only employs the sentence-level information, it is still better than the other models that further exploit the document-level information for prediction (an improvement of 1.9% over the probabilistic soft logic based model in (Liu et al., 2016) )." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-100", "text": "Finally, comparing NC-CNN and the CNN model in (Nguyen and Grishman, 2015b), we see that the non-consecutive mechanism significantly improves the performance of the traditional CNN model for ED (up to 2.3% in absolute Fmeasures with p < 0.05)." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-101", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-102", "text": "**THE DOMAIN ADAPTATION EXPERIMENTS**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-103", "text": "Previous studies have shown that the NLP models would suffer from a significant performance loss when domains shift (Blitzer et al., 2006; Daume III, 2007; Plank and Moschitti, 2013; Nguyen et al., 2015c) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-104", "text": "In particular, if a model is trained on some source domain and applied to a different domain (the target domain), its performance would degrade significantly." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-105", "text": "The domain adaptation (DA) studies aim to overcome this issue by developing robust techniques across domains." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-106", "text": "The best reported system in the DA setting for ED is (Nguyen and Grishman, 2015b) , which demonstrated that the CNN model outperformed the feature-based models in the cross-domain setting." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-107", "text": "In this section, we compare NC-CNN with the CNN model in (Nguyen and Grishman, 2015b) (as well as the other models above) in the DA setting to further investigate their effectiveness." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-108", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-109", "text": "**DATASET**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-110", "text": "This section also uses the ACE 2005 dataset but focuses more on the difference between domains." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-111", "text": "The ACE 2005 corpus includes 6 different domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl)." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-112", "text": "Following (Nguyen and Grishman, 2015b), we use news (the union of bn and nw) as the source domain and bc, cts, wl and un as four different target domains 3 ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-113", "text": "We take half of bc as the development set and use the remaining data for testing." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-114", "text": "Our data split is the same as that in (Nguyen and Grishman, 2015b Table 2 reports the performance of the systems with 5-fold cross validation." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-115", "text": "Note that we focus on the systems exploiting only the sentence level information in this section." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-116", "text": "For each system, we train a model on the training data of the source domain and evaluate this model on the test set of the source domain (in-domain performance) as well as on the four target domains bc, cts, wl and un." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-117", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-118", "text": "**PERFORMANCE**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-119", "text": "We emphasize that the performance of the systems MaxEnt, Joint+Local, Joint+Local+Global, B-RNN, and CNN is obtained from the actual systems in the original work (Li et al., 2013; Nguyen and Grishman, 2015b; Nguyen et al., 2016a) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-120", "text": "The performance of DM-CNN, on the other hand, is from our re-implementation of the system in (Chen et al., 2015) using the same hyper-parameters and resources as CNN and NC-CNN for a fair comparison." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-121", "text": "From the table, we see that NC-CNN is significantly better than the other models on the source domain." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-122", "text": "This is consistent with the conclusions in Section 3.2 and further confirms the effectiveness of NC-CNN." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-123", "text": "More importantly, NC-CNN outperforms CNN and the other models on the target domains bc, cts and un, and performs comparably with CNN on wl." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-124", "text": "The performance improvement is significant on bc and un (p < 0.05), thereby verifying the robustness of NC-CNN for ED across domains." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-125", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-126", "text": "**RELATED WORK**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-127", "text": "There have been three major approaches to event detection in the literature." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-128", "text": "First, the pattern-based approach explores the application of patterns to identify the instances of events, in which the patterns are formed by predicates, event triggers and constraints on the syntactic context (Grishman et al., 2005; Cao et al., 2015a; Cao et al., 2015b) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-129", "text": "Second, the feature-based approach relies on linguistic intuition to design effective feature sets for statistical models for ED, ranging from the local sentence-level representations (Ahn, 2006; Li et al., 2013) , to the higher level structures such as the cross-sentence or cross-event information (Ji and Grishman, 2008; Gupta and Ji, 2009; Patwardhan and Riloff, 2009; Liao and Grishman, 2011; Hong et al., 2011; McClosky et al., 2011; Li et al., 2015) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-130", "text": "Some recent work on the feature-based approach has also investigated event trigger detection in the joint inference with event argument prediction (Riedel et al., 2009; Poon and Vanderwende, 2010; Li et al., 2013; Venugopal et al., 2014) to benefit from their inter-dependencies." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-131", "text": "Finally, neural networks have been introduced into ED very recently with the early work on convolutional neural networks (Nguyen and Grishman, 2015b; Chen et al., 2015) ." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-132", "text": "The other work includes: (Nguyen et al., 2016a) who employ bidirectional recurrent neural networks to perform event trigger and argument labeling jointly, (Jagannatha and Yu, 2016) who extract event instances from health records with recurrent neural networks and (Nguyen et al., 2016b) who propose a two-stage training algorithm for event extension with neural networks." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-133", "text": "----------------------------------" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-134", "text": "**CONCLUSION**" }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-135", "text": "We present a new CNN architecture for ED that exploits the non-consecutive convolution for sentences." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-136", "text": "Our evaluation of the proposed model on the general setting and the DA setting demonstrates the effectiveness of the non-consecutive mechanism." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-137", "text": "We achieve the state-of-the-art performance for ED in both settings." }, { "sent_id": "9dd9ac975c6f55797615f0e52aa296-C001-138", "text": "In the future, we plan to investigate the non-consecutive architecture on other problems such as relation extraction or slot filling." } ], "y": { "@USE@": { "gold_contexts": [ [ "9dd9ac975c6f55797615f0e52aa296-C001-35" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-42" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-82" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-87" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-89" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-92", "9dd9ac975c6f55797615f0e52aa296-C001-93", "9dd9ac975c6f55797615f0e52aa296-C001-95" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-107" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-112" ], [ "9dd9ac975c6f55797615f0e52aa296-C001-119" ] ], "cite_sentences": [ "9dd9ac975c6f55797615f0e52aa296-C001-35", "9dd9ac975c6f55797615f0e52aa296-C001-42", "9dd9ac975c6f55797615f0e52aa296-C001-82", "9dd9ac975c6f55797615f0e52aa296-C001-87", "9dd9ac975c6f55797615f0e52aa296-C001-89", "9dd9ac975c6f55797615f0e52aa296-C001-95", "9dd9ac975c6f55797615f0e52aa296-C001-107", "9dd9ac975c6f55797615f0e52aa296-C001-112", "9dd9ac975c6f55797615f0e52aa296-C001-119" ] }, "@DIF@": { "gold_contexts": [ [ "9dd9ac975c6f55797615f0e52aa296-C001-100" ] ], "cite_sentences": [ "9dd9ac975c6f55797615f0e52aa296-C001-100" ] }, "@BACK@": { "gold_contexts": [ [ "9dd9ac975c6f55797615f0e52aa296-C001-106" ] ], "cite_sentences": [ "9dd9ac975c6f55797615f0e52aa296-C001-106" ] } } }, "ABC_8084b5077b2a8db755b1bbd0f6fe60_17": { "x": [ { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-2", "text": "DSLRAE is a hierarchical classifier for similar written languages and varieties based on maximum-entropy (maxent) classifiers." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-3", "text": "In the first level, the text is classified into a language group using a simple token-based maxent classifier." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-4", "text": "At the second level, a group-specific maxent classifier is applied to classify the text as one of the languages or varieties within the previously identified group." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-5", "text": "For each group of languages, the classifier uses a different kind and combination of knowledge-poor features: token or character n-grams and 'white lists' of tokens." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-6", "text": "Features were selected according to the results of applying ten-fold cross-validation over the training dataset." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-7", "text": "The system presented in this article 1 has been ranked second in the Discriminating Similar Language (DSL) shared task co-located within the VarDial Workshop at COLING 2014 ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-8", "text": "----------------------------------" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-10", "text": "Language identification (LI) can be defined as the task of determining the language of a written text." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-11", "text": "LI is also a cross-cutting technology supporting many other text analysis tasks: sentiment analysis, political tendency or topic classification." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-12", "text": "There are some interesting problems around written language identification that have attracted some attention recently, as native language identification (NLI, Tetreault et al., 2013) , the identification of the country of origin or the discrimination between similar or closely related languages (DSL, Tiedemann and Ljube\u0161i\u0107, 2012) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-13", "text": "LI has reached a great success in discriminating between languages with unique character sets and languages belonging to different language groups or typologically distant." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-14", "text": "However, according to Zampieri (2013) , multilingualism, noisy or non-standard features in text and discrimination between similar languages, varieties or dialects remain as the major known bottlenecks in language identification." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-15", "text": "For this reason, DSL can be considered as a sub-task in language identification." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-36", "text": "Beesley (1988) pioneered the use of character n-grams models, which were also used by Dunning (1994) and Cavnar and Trenkle (1994) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-55", "text": "Very brief descriptions of the systems are also offered." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-16", "text": "Interestingly enough, LI seems to work well with what Kloss (1967) called abstandsprache or language by distance (because Basque is an isolate, it is generally regarded as a distant language) but fails in dealing with ausbausprache or language by development (a standard variety together with all varieties heteronomous with respect to it, e. g. Basque Batua koin\u00e9 and the various vernacular dialects)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-17", "text": "Mass media, educational centres, administrations and communications favour standard languages instead of other varieties." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-18", "text": "Standard varieties of languages are then seen by sociolinguists and dialectologists as political and cultural constructs (Trudgill, 2004) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-19", "text": "However, languages and varieties are not just systems for communication between individuals, they are also used by groups and they are a crucial part of their identity and culture." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-20", "text": "Language variation is systematic, both inter-and intra-personal." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-21", "text": "It can be related to political, social, geographical, situational, communicative or instrumental factors." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-22", "text": "Variation within a language can be found at different levels: alphabet, orthography (diacritics), word structure (syllable composition, morphology), lexical choice or even syntax." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-23", "text": "Similar or closely related languages often reflect a common origin and are members of a dialect continuum (Bloomfield, 1935) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-24", "text": "Solutions to language identification are often based either on generative or discriminative character n-gram language models." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-25", "text": "While character-based methods provide a means to distinguish between different languages on the basis of coarse-grained statistics on n-grams, it seems that discriminating between similar languages needs more fine-grained distinctions not always reflected by n-gram character distributions." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-26", "text": "According to Tiedemann and Ljube\u0161i\u0107 (2012) , character-based n-gram methods fail for languages with a high lexical overlap, since the more shared words between two languages, the more similar will their n-gram character frequency profiles be." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-27", "text": "Table 1 : Macro-averaged Precision, Recall and F 1 -score on the DSL training dataset resulting from 10-fold cross-validation using the best model for each group of languages o varieties." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-28", "text": "Model has a letter code indicating the kind of elements considered: C (characters), T (tokens), L (tokens from the list of the 10,000 most frequent tokens), and a number indicating how many consecutive elements have been taken in a feature: 1 (unigrams), 1-2 (unigrams and bigrams), 1-5 (sequences of length one to five)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-29", "text": "----------------------------------" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-30", "text": "**GROUP**" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-31", "text": "----------------------------------" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-32", "text": "**PREVIOUS APPROACHES**" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-33", "text": "Although focused on formal languages, Gold (1967) is usually credited as the first to attempt computational language identification." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-34", "text": "In particular, two common LI approaches, namely n-gram language models and white (or black) lists, echo Gold's information presentation methods." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-35", "text": "In the 1990s, language identification was formulated as a sub-task of text categorization and varied approaches were explored." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-37", "text": "Grefenstette (1995) compared this approach to Ingle (1978) , based on the frequency of short words." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-38", "text": "The interested reader is referred to Zampieri (2013) for a review of some statistical and machine learning proposals and to both Baldwin and Lui (2010) and Lui and Baldwin (2011) for an overview of some linguistically motivated models." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-39", "text": "As Baldwin and Lui (2010) or Tiedemann and Ljube\u0161i\u0107 (2012) point out, language identification is erroneously considered an easy and solved problem 2 , in part because of some general purpose systems being available, notably TextCat 3 , Xerox Language Identifier 4 and, more recently, langid.py (Lui and Baldwin, 2012) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-40", "text": "While it is true that it is possible to obtain brilliant results for a small number of languages (Baldwin and Lui, 2010) or typologically distant languages , accurately discriminating among closely related languages or varieties of the same language has been repeatedly reported as a bottleneck for language identification systems, in particular for those based on n-grams." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-41", "text": "Back in 2004, Padr\u00f3 and Padr\u00f3 concluded that \"since the tested systems tend to fail when distinguishing similar languages (e.g. Spanish and Catalan), further research could be done to solve these cases.\" Martins and Silva (2005) report similar difficulties in discriminating among European and Brazilian Portuguese." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-42", "text": "Ranaivo-Malan\u00e7on (2006) motivates her work on the unsatisfactory performance of (then) available language identifiers when dealing with close languages such as Malay and Indonesian." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-43", "text": "Ljube\u0161i\u0107 et al. (2007) do not even attempt to distinguish Bosnian from Croatian when developing a Croatian identifier because of their closeness." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-44", "text": "Trieschnigg et al. (2012) come as an exception as they report satisfactory results in identifying sixteen varieties of Dutch with TextCat." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-45", "text": "Ranaivo-Malan\u00e7on (2006) presents a cascaded language identifier for Malay and Indonesian." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-46", "text": "It first distinguishes Malay or Indonesian from other four European languages using trigrams extracted from the most frequent words from each language." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-47", "text": "Texts classified as Malay or Indonesian are subsequently scanned for some linguistic features (format of numbers and exclusive words), yielding a more precise performance than TextCat." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-48", "text": "Ljube\u0161i\u0107 et al. (2007) also propose a cascaded identifier that relies on 'black lists' to discard nonBalkan languages and a second order Markov model on n-grams to discriminate among them, augmented with a 'black list' component that raises accuracy up to 0.99 when dealing with the most difficult pair (Croatian and Serbian)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-49", "text": "This work is followed up in Tiedemann and Ljube\u0161i\u0107 (2012) where 9% of improvement over standard approaches is reported and where support for Bosnian discrimination is included." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-50", "text": "Huang and Lee (2008) use a bag of the most frequent words to build a voting identifier for three Chinese varieties with a top accuracy of 0.929." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-51", "text": "More recently, Zampieri (2013) compares the performance of n-gram based models to machine learning methods using bag of words when discriminating similar languages and varieties obtaining comparable performance with both approaches." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-52", "text": "Grouin et al. (2010) present the shared task DEFT 2010." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-53", "text": "Participants were challenged to identify the decade, country (France and Canada) and newspaper for a set of journalistic texts." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-54", "text": "As far as the country labeling is concerned, they report an upper 0.964 F 1 -measure and an average of 0.767." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-56", "text": "Zampieri and Gebre (2012) present a log-likelihood estimation method for language models built on orthographical (character n-grams), lexical (word unigrams) and lexico-syntactic (word bigrams) features." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-57", "text": "They report a 0.998 accuracy distinguishing European and Brazilian Portuguese with a language model based on character 4-grams." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-58", "text": "This approach is adapted in to deal with Spanish varieties, where the role of knowledge-rich features (POS tags) is also explored." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-59", "text": "They report a 0.99 accuracy when binarily distinguishing Argentinean and Mexican Spanish with single words or bigrams." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-60", "text": "Trieschnigg et al. (2012) compare the performance of TextCat to the nearest neighbour and nearest prototype in combination with a cosine distance when distinguishing among sixteen varieties of Dutch." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-61", "text": "They report a micro-average F 1 -score of 0.799 (and a macro-average F 1 -score of 0.527) with a top F 1 -score of 0.987 when dealing with Frisian." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-62", "text": "Lui and Cook (2013) report experiments with different classifiers to map English documents to their country of origin." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-63", "text": "An SVM classifier with bag of words is top ranked with a macro-average 0.911 F 1 -score in a cross-domain setting and 0.975 in an in-domain setting." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-64", "text": "All these previous works (with the sole exception of Trieschnigg et al. (2012) , where a general purpose LI system yields a satisfactory performance) agree in the specificity of DSL regarding LI." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-65", "text": "Maybe because of that, two level approaches are not uncommon." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-66", "text": "Features used to discriminate seem to be languagegroup specific, altough word rather than character features seem to perform better (Zampieri and Gebre (2012) report best results for character 4-grams, however, given that European and Brazilian Portuguese do not completely share ortography)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-67", "text": "----------------------------------" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-68", "text": "**MAXIMUM ENTROPY MODELS AND FEATURE ENGINEERING**" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-69", "text": "Maximum Entropy modelling is a general purpose machine learning framework that has proven to be highly expressive and powerful in many areas." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-70", "text": "Maximum Entropy (maxent) was first introduced into natural language processing by Berger et al. (1996) and Della Pietra et al. (1997) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-71", "text": "Since its introduction, Maximum Entropy techniques and the more general framework of Random Fields have been applied extensively to natural language processing problems, where maxent classifiers are commonly used as an alternative to Na\u00efve Bayes classifiers." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-72", "text": "In maxent modelling, the probability that an example x is in a class c is estimated from its bag of words (or n-grams) as:" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-73", "text": "where f i (c, y) are indicator functions, w ci is the weight assigned to feature i in class c, and Z is a normalization factor." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-74", "text": "Features are modelled by indicator functions f i (c, y), which are evaluated to one when the feature i for a particular class c is true for a word y and zero otherwise." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-75", "text": "The following is an example of an indicator function modelling the presence of a particular word in a class:" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-76", "text": "The class assigned to an example x is the most probable one:" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-77", "text": "The maxent classifiers are implemented with the toolkit of Zhang Le (2004) , and the parameters of the model are estimated using Generalized Iterative Scaling (Darroch and Ratcli, 1972) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-78", "text": "Having chosen a closed approach to the DSL shared task, no other resources than the text samples given as training and development datasets have been used in features design." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-79", "text": "In this knowledge-poor approach to the problem, the maxent classifier has been trained with token and character n-gram features." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-80", "text": "Character-based features are obtained with a simple character tokenizer." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-81", "text": "However, for token-based features, texts are tokenized using an orthographic tokenizer which splits punctuation from words." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-82", "text": "Several bags of features have been considered during the experiments: single tokens (T1), single words from the list of the 10,000 most frequent tokens (L1), token bigrams (T2), and n-grams of character sequences of length from one to five (C1-5)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-83", "text": "We will also refer to the lists of the 10,000 most frequent words as 'white list', which have a complementary role to the 'black lists' of Tiedemann and Ljube\u0161i\u0107 (2012) ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-84", "text": "To determine which features are best suited to each group, we measured their performance using tenfold cross-validation on the training dataset and using the development dataset for testing." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-85", "text": "For group A, best results were obtained using bag of features consisting of variable length character n-grams ranging from one to five (C1-5)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-86", "text": "On group B, token bigrams (T2) performed slightly better in the development set than in the training set than the 'white list' of tokens (L1), which seems to indicate a better generalisation of the former on unseen examples." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-87", "text": "Results for group C were similar for all features considered." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-88", "text": "Regarding groups D and E, token-based features got similar results, with slightly better results for token bigrams." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-89", "text": "Finally, for English (group F) results were generally bad, reaching the 'white list' the better results." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-90", "text": "Group F is known to contain more than a few misclassifications due to news cross citing between American and British press." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-91", "text": "Results for each group's best model using ten-fold cross-validation on the training dataset are shown in Table 1 ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-92", "text": "All figures have been macro averaged, i.e., they have been computed averaging the ten folds." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-93", "text": "Because best results for each group are obtained with different feature sets, a new classifier is introduced." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-94", "text": "This classifier determines the language/variety group of each example before applying its particular group classifier." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-95", "text": "As can be seen in Table 2 , the degree of token overlap between languages and varieties of different groups is rather low compared with the degree of overlap within the same group." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-96", "text": "Using only tokens, total accuracy is reached on the training dataset using cross validation." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-97", "text": "A classifier applying several classifiers in the way we propose is known as a hierarchical two-level classifier." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-98", "text": "Table 3 : Macro-averaged Precision, Recall and F 1 -score on the DSL test dataset." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-99", "text": "Models are described in Table 1 ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-100", "text": "Table 4 shows the confusion matrix for the classifier on the test dataset and Table 1 the results in terms of precision, recall and F 1 -score for each language and variety." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-101", "text": "As can be seen in Table 4 , no example has been classified outside in a wrong group." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-102", "text": "provide a baseline using a Na\u00efve Bayes classifier on character 5-grams." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-103", "text": "As can be seen if Table 3 is compared with Table 4 of , figures for group A are slightly below the baseline, groups B and C achieve the same results, D and E groups get slightly better results with the maxent classifier, and the biggest difference is found in group F, having better results Na\u00efve Bayes." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-104", "text": "The overall result without group F is similar: an F 1 -score of 0.947 for maxent and 0.942 for Na\u00efve Bayes." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-105", "text": "The DSL Corpus is composed of journalistic comparable texts to make the corpus suitable for discriminating similar languages and languages varieties but not text types or genres." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-106", "text": "Tiedemann and Ljube\u0161i\u0107 (2012) avoid biases towards topic and domain by experimenting with parallel texts reaching an overall accuracy of 90.3% for group A (br, hr, sr) using a 'black list' classifier and comparing its results with a Na\u00efve Bayes approach." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-107", "text": "They found that the 'black list' classifier generalise better than the Na\u00efve Bayes approach when moving from parallel to comparable corpora, since the former classifier is based on more informative features than the later." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-108", "text": "Results of ten-fold cross-validation on the training dataset for different feature settings for group E (Spanish) were consistent with those of , where word bigrams are reported to outperform character n-grams." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-109", "text": "Given that datasets are not identical, it is difficult to draw any conclusion from the 1.2% difference in accuracy between DSLRAE and ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-110", "text": "Manual inspection of misclassified news suggests some textual properties that are specially challenging: a) high density of foreign proper names (Russian, Baby, Pony, Jack, . . . ) may dilute the evidence provided by vernacular words; b) conversely, low density of features specific to any variant (such as place or family names 5 , demonyms, lexical choices) may be insufficient to drive the text to the right class; this is also the case of some perfectly neutral sentences where a trained linguist could not spot any clue about their origin; c) certain syntactical idiosyncrasies (for example Argentinian idioms la pasas bien, tal como muchas veces, en exceso de) are not captured by bigrams; d) there are instances of cross-information, e. g., Argentinian news about Spain and vice versa where maybe more of a topic rather than a variety is being detected (e. g., news about Urdangar\u00edn or Fern\u00e1ndez de Kirchner); e) there are some typos and misspellings (carabanas, dosco) whose role remains unclear; e) finally, there is at least one text misclassified in the gold standard: it is labeled as Argentinian but it was written by the Spanish EFE news agency." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-111", "text": "Some of these difficulties cross-cut all language groups and are not specific to Spanish but rather to DSL as a task." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-112", "text": "In contrast to what Zampieri and Gebre (2012) found, ten-fold cross-validation on the training dataset for different feature settings on the DSL dataset did not find character n-grams to outperform word ngrams for group D (Portuguese)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-113", "text": "It could be hypothesized that they used a unique source (newspaper) for each variety and therefore rigid editorial conventions could be at play; moreover, the collections were three years distant, so topic consistency could also be compromised 6 ." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-114", "text": "Manual inspection of mislabeled sentences shows some already known categories: evidence diluted by foreign words (Red Brick Warehouse, M\u00e9sz\u00e1ros, Fat Duck), poor evidence (Valongo, Sao Paulo) or cross-information (TAP, Bras\u00edlia)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-115", "text": "There is, however, a Portuguese-specific issue: some texts obey the 1990 Orthographic Agreement 7 which blurs the orthographic distinctions regarding diacritics or consonant clusters; in fact, one sentence contains words following both standards (perspectiva and reprodu\u00e7\u00e3o)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-116", "text": "It remains unexplained why word bigrams did not capture the Brazilian preference for passive voice (foram rebaixados), auxiliary + gerund chunks (estamos utilizando) or clitic dropping (lembro)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-117", "text": "Despite findings by Tiedemann and Ljube\u0161i\u0107 (2012) , character n-grams performed better during tenfold cross-validation on the training dataset for different feature settings on the DSL dataset for group A (Bosnian, Croatian and Serbian)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-118", "text": "Misclassified sentences involve failing to capture adapted place names (Belgiji,\u0160vedskoj) or derivational choices (organiziranog)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-119", "text": "Results of ten-fold cross-validation on the training dataset for different feature settings for group B (Indonesian and Malay) top ranked word unigrams." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-120", "text": "Ranaivo-Malan\u00e7on (2006) uses number formatting and exclusive word lists." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-121", "text": "It can be hypothesized that lexical overlap is low (see Table 2 ) and/or frequency distributions are dissimilar thus allowing word unigrams to perform as well as 'white lists'." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-122", "text": "Languages of group C (Czech and Slovak) are dissimilar both orthographically and lexically." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-123", "text": "These dissimilarities are surprisingly well captured by the top 10,000 most frequent words." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-124", "text": "----------------------------------" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-125", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-126", "text": "In this paper, we have shown that a hierarchical classifier is well suited to discriminate among different language groups and languages or varieties therein." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-127", "text": "Different features are shown to better suit typological traits of supported languages." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-128", "text": "A comparison to previous approaches is provided, when available." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-129", "text": "In a multilingual setting, the effect of adding Galician to group D could be investigated." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-130", "text": "Focusing on Spanish language, we plan to geographically expand the classifier to deal with all national varieties, a much harder task as both Baldwin and Lui (2010) and remark." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-131", "text": "Moreover, the classifier could be used, as Tiedemann and Ljube\u0161i\u0107 (2012) suggest, to learn varieties discriminators to label texts beyond national classes (e.g. both Caribbean and Andean Spanish cross-cut national borders and, conversely, nations involved are known not to be dialectally uniform)." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-132", "text": "Given that error analysis showed that word bigrams fail to capture certain syntactical idiosyncrasies, a model with longer n-grams and/or knowledge-richer features such as POS sequences could also be explored, although report lower performance than knowledge-poor features." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-133", "text": "Finally, classification techniques such as those described in Gyawali et al. (2013) may be used to discard translations when building monolingual, vernacular corpora." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-134", "text": "A diachronic expansion, such as Trieschnigg et al. (2012) , is also in mind." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-135", "text": "Medieval Castilian coexisted with other Romance varieties such as Leonese or Aragonese whose features permeated Castilian texts." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-136", "text": "Researchers are in need of a tool to properly classify diachronic texts to accurately describe older stages of Spanish." }, { "sent_id": "8084b5077b2a8db755b1bbd0f6fe60-C001-137", "text": "Following the suggestion of Tiedemann and Ljube\u0161i\u0107 (2012) , we envisage the use of parallel texts such as versions of the Bible from different areas to learn the differences among varieties." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8084b5077b2a8db755b1bbd0f6fe60-C001-12" ], [ "8084b5077b2a8db755b1bbd0f6fe60-C001-26" ], [ "8084b5077b2a8db755b1bbd0f6fe60-C001-39" ], [ "8084b5077b2a8db755b1bbd0f6fe60-C001-49" ] ], "cite_sentences": [ "8084b5077b2a8db755b1bbd0f6fe60-C001-12", "8084b5077b2a8db755b1bbd0f6fe60-C001-26", "8084b5077b2a8db755b1bbd0f6fe60-C001-39", "8084b5077b2a8db755b1bbd0f6fe60-C001-49" ] }, "@SIM@": { "gold_contexts": [ [ "8084b5077b2a8db755b1bbd0f6fe60-C001-83" ], [ "8084b5077b2a8db755b1bbd0f6fe60-C001-131" ] ], "cite_sentences": [ "8084b5077b2a8db755b1bbd0f6fe60-C001-83", "8084b5077b2a8db755b1bbd0f6fe60-C001-131" ] }, "@DIF@": { "gold_contexts": [ [ "8084b5077b2a8db755b1bbd0f6fe60-C001-117" ] ], "cite_sentences": [ "8084b5077b2a8db755b1bbd0f6fe60-C001-117" ] }, "@USE@": { "gold_contexts": [ [ "8084b5077b2a8db755b1bbd0f6fe60-C001-137" ] ], "cite_sentences": [ "8084b5077b2a8db755b1bbd0f6fe60-C001-137" ] } } }, "ABC_2a84615479af66bbf875517a3a753b_17": { "x": [ { "sent_id": "2a84615479af66bbf875517a3a753b-C001-105", "text": "Total 10 unique speakers participated in this work." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-156", "text": "We consider this observation as a future research direction." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-157", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-158", "text": "**CONCLUSIONS**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-2", "text": "In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-3", "text": "The baseline approach models the information from audio and text independently using two deep neural networks (DNNs)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-4", "text": "The outputs from both the DNNs are then fused for classification." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-5", "text": "As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-6", "text": "The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-7", "text": "Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-8", "text": "The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-9", "text": "The relevant textual data is then applied to attend parts of the audio signal." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-10", "text": "To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-11", "text": "Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-12", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-14", "text": "In this era of high-performance computing, human-computer interaction (HCI) has become pervasive." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-15", "text": "To enrich the user experience, the system is often required to detect human emotion and produce a response with proper emotional context [1, 2] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-16", "text": "The first step in such an HCI involves building a system that recognizes emotion from the speech utterance." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-17", "text": "A speech emotional system aims to identify audio recording as belonging to one of the categories, like happy, sad, angry or neutral." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-18", "text": "Beside HCI, the output of emotion recognition engine is beneficial in the paralinguistic area as well [3] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-19", "text": "In this paper, we build a speech emotion recognition system that uses acoustic and textual information in tandem." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-20", "text": "Various approaches to address emotion recognition have been investigated in the literature." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-21", "text": "Most of the techniques involve extracting low-level or high-level acoustic features for this task [4] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-22", "text": "In emotion recognition, the lexical content of the audio recording is an important source of information that is usually ignored." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-23", "text": "For example, the presence of words such as \"gorgeous\" and \"stunning\" in the utterance would indicate that the person is happy." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-24", "text": "Recently researchers * Most work was done while the author was an intern at Adobe Research. have also explored the application of textual content of the speech signal for this task." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-25", "text": "In [5] , frame and supra-segmental level features (such as pitch and spectral contours) are derived from the speech signal." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-26", "text": "Textual information is used by spotting keywords that emphases the emotional states of the speaker." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-27", "text": "The work in [6] also presents an approach to exploit the acoustic and lexical content." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-28", "text": "In particular, they explored conventional acoustic features from the speech signal while the textual information is derived from the bag of word representation." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-29", "text": "Recently, deep neural network (DNN) has shown to provide good results for modeling acoustic and textual information for emotion identification." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-30", "text": "In [5] , textual and acoustic information of the utterance are used by a DNN to obtain hidden feature representations for both the modality." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-31", "text": "These features are then concatenated to represent the utterance and subsequently used to classify the emotion of the speaker." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-32", "text": "Experimental evidence shows the potential of the approach." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-33", "text": "In our previous work [7] , we applied a dual RNN in order to obtain a richer representation by blending the content and acoustic knowledge." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-34", "text": "In this paper, we improve upon our earlier work by incorporating an attention mechanism in the emotion recognition framework." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-35", "text": "The proposed attention mechanism is trained to exploit both textual and acoustic information in tandem." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-36", "text": "We refer to this attention method as the multi-hop." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-37", "text": "The multi-hop attention is designed to select relevant parts of the textual data, which is subsequently applied to attend to the segments of the audio signal for classification." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-38", "text": "We hypothesize that this approach would automatically detect the segments that contain information relevant for the task." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-39", "text": "The emotion recognition experiments are performed on the standard IEMOCAP dataset [8] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-40", "text": "Experimental results indicate that the proposed approach outperforms the state-of-the-art system published in the literature on this database by 6.5% relative improvement in terms of weighted accuracy." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-41", "text": "This paper is organized as follows." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-42", "text": "Section 2 provides a brief literature review on speech emotion recognition." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-43", "text": "In Section 3, we start by describing the baseline bidirectional recurrent encoder model considered in this paper, then introducing the proposed technique in detail." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-44", "text": "Experimental setup for evaluating the system and discussion of the achieved results by various systems are presented in Sections 4." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-45", "text": "Finally, the paper is concluded in Section 5." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-46", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-47", "text": "**RELATED WORK**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-48", "text": "Along with classical algorithms based models such as support vector machine (SVM), hidden markov model (HMM) and decision tree [9, 10, 11] , various neural network architectures have been recently introduced for the speech emotion recognition task." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-49", "text": "For exam-ple, convolutional neural network (CNN)-based models were trained on spectrograms or audio features such as mel-frequency cepstral coefficients (MFCCs) and low-level descriptors (LLDs) [12, 13, 14] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-76", "text": "Next, we attempt to use the processed textual information as another modality in predicting the emotion class of a given signal." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-50", "text": "More complex models such as [15] were designed to better learn nonlinear decision boundaries of emotional speech and achieved the best-recorded performance in audio modality models on IEMOCAP dataset [8] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-51", "text": "Several neural network models with attention mechanism have been proposed to efficiently focus on a prominent part of speech and learn temporal dependency within whole utterance [16, 17] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-52", "text": "Multi-modal approaches using acoustic features and textual information have been investigated." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-53", "text": "[5] identified emotional key phrases and salience of verbal cues from both phoneme sequences and words." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-54", "text": "Recently, [7, 18] combined acoustic information and conversation transcripts using a neural network-based model to improve emotion classification accuracy." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-55", "text": "However, none of these studies utilized attention method over audio and text modality in tandem for contextual understanding of the emotion in audio recording." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-56", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-57", "text": "**MODEL**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-58", "text": "This section describes the methodologies that are applied to the speech emotion recognition task." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-59", "text": "We start by introducing a baseline model, the bidirectional recurrent encoder, for encoding the audio and text modalities individually." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-60", "text": "We then propose an approach to exploit both audio and text data in tandem." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-61", "text": "In this technique, multihop attention is proposed to obtain relevant parts of audio and text data automatically." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-62", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-63", "text": "**BIDIRECTIONAL RECURRENT ENCODER**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-64", "text": "Motivated by the architecture used in [7, 17, 19] , we train a recurrent encoder to predict the categorical class of a given audio signal." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-65", "text": "To model the sequential nature of the speech signal, we use a bidirectional recurrent encoder (BRE) as shown in the Figure 1 (a)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-66", "text": "We also added a residual connection to the model for promoting convergence during training [20] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-67", "text": "A sequence of feature vectors is fed as input to the BRE, which leads to the formation of hidden states of the model as given by the following equation:" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-68", "text": "where f \u03b8 , f \u03b8 are the forward and backward long short-term memory (LSTM) with weight parameter \u03b8, ht represents the hidden state at t-th time step, and xt represents the t-th MFCC features in audio signal." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-69", "text": "The hidden representations ( \u2212 \u2192 h t, \u2190 \u2212 h t) from forward/backward LSTMs are concatenated for produce the feature, ot." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-70", "text": "To follow previous research [7] , we also add another prosodic feature vector, p, with each ot to generate a more informative vector representation of the signal, o A t ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-71", "text": "Finally, an emotion class is predicted from the acoustic signal by applying a softmax function to the final hidden representation at the last time step, o A last ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-72", "text": "We refer this model as audio-BRE with the objective function as follows:" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-73", "text": "where yi,c is the true label vector, and\u0177i,c is the predicted probability distribution from the softmax layer." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-74", "text": "The W and the bias b are learned model parameters." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-75", "text": "C is the total number of classes, and N is the total number of samples used in training." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-77", "text": "To obtain textual hidden representation, o T t , we tokenize the transcript and feed it into the BRE in such a way that the acoustic signals are encoded by equation (1) ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-78", "text": "We refer this model as text-BRE." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-79", "text": "The training objective for the text-BRE is same as the audio-BRE in equation (2)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-80", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-81", "text": "**PROPOSED MULTI-HOP ATTENTION**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-82", "text": "We propose a novel multi-hop attention method to predict the importance of audio and text, referred to multi-hop attention (MHA)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-83", "text": "Figure 1 shows the architecture of the proposed MHA model." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-84", "text": "Previous research used multi-modal information independently using neural network model by concatenating features from each modality [7, 21] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-85", "text": "As opposed to this approach, we propose a neural network architecture that exploits information in each modality by extracting relevant segments of the speech data using information from the lexical content (and vice-versa)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-86", "text": "First, the acoustic and textual data are encoded with the audio-BRE and text-BRE, respectively, using equation (1)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-87", "text": "We then consider the final hidden representation of audio-BRE, o" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-88", "text": ", (i = 1, ..., t)" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-89", "text": "The H 1 (equation 3) is a new hidden representation for textual information with consideration of audio modality." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-90", "text": "With this information, we apply 2nd-hop attention, referred to MHA-2, to the audio sequence." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-91", "text": "The final hidden representation of the MHA-2 model, H, is calculated as follows:" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-92", "text": "where H 2 is a new hidden representation for audio information with the consideration of textual modality obtained from the MHA-1." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-93", "text": "Similarly to MHA-1, we further apply 3rd-hop attention to textual sequence, referred to MHA-3, with the new audio hidden representation H 2 (equation 4)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-94", "text": "The final hidden representation of the MHA-3 model, H, is calculated as follows:" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-95", "text": ", (i = 1, ..., t)" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-96", "text": "where H 3 is updated representative vector of the textual information with the consideration of audio modality one more time." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-97", "text": "In each case, the final hidden representation, H, is passed through the softmax function to predict the four-categories emotion class." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-98", "text": "We use the same training objective as the BRE model with equation (2), and the predicted probability distribution for the target class,\u0177c is as follows:" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-99", "text": "where projection matrix W and bias b are leaned model parameters." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-100", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-101", "text": "**EXPERIMENTS**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-102", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-103", "text": "**DATASET AND EXPERIMENTAL SETUP**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-104", "text": "To train and evaluate our model, we use the Interactive Emotional Dyadic Motion Capture (IEMOCAP) [8] dataset, which includes five sessions of utterances between two speakers (one male and one female)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-106", "text": "For consistent comparison with previous works [7, 18] , all utterances labeled \"excitement\" are merged with those labeled \"happiness\"." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-107", "text": "We assign single categorical emotion to the utterance with majority of annotators agreed on the emotion labels." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-108", "text": "The final dataset contains 5,531 utterances in total (1,636 happy, 1,084 sad, 1,103 angry and 1,708 neutral)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-109", "text": "In the training process, we perform 10-fold cross-validation where each 8, 1, 1 folds are used for the train set, development set, and test set, respectively." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-110", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-111", "text": "**FEATURE EXTRACTION AND IMPLEMENTATION DETAILS**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-112", "text": "As this research is extended work from previous research [7] , we use the same feature extraction method as done in our previous work." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-113", "text": "After extracting 40-dimensional Mel-frequency cepstral coefficients (MFCC) feature (frame size is set to 25 ms at a rate of 10 ms with the Hamming window) using Kaldi [22] , we concatenate it with its first, second order derivates, making the feature dimension to 120." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-114", "text": "We also extract prosodic features by using OpenSMILE toolkit [23] and appending it to the audio feature vector." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-115", "text": "In preparing the textual dataset, we first use the ground-truth transcripts of the IEMOCAP dataset." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-116", "text": "In a practical scenario where we may not access to transcripts of the audio, we obtain all of the transcripts from the speech signal using a commercial ASR system [24] (The performance of the ASR system is word error rate (WER) of 5.53%)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-117", "text": "We apply word-tokenizer to the transcripts and obtain sequential data for textual input." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-118", "text": "The maximum length of an audio segment is set to 750 based on the implementation choices presented in [25] and 128 for the textual input which covers the maximum length of the tokenized transcripts." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-119", "text": "We minimize the cross-entropy loss function using (equation (2)) the Adam optimizer [26] with a learning rate of 1e-3 and gradients clipped with a norm value of 1." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-120", "text": "For the purposes of regularization, we apply the dropout method, 30%." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-121", "text": "The number of hidden units and the number of layers in the RNN for each model (BRE and MHA) are optimized on the development set." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-122", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-123", "text": "**PERFORMANCE EVALUATION**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-124", "text": "To measure the performance of systems, we report the weighted accuracy (WA) and unweighted accuracy (UA) averaging over the 10-fold cross-validation experiments." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-125", "text": "We use the same dataset and features as other researchers [7, 18] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-126", "text": "Table 1 presents performances of proposed approaches for recognizing speech emotion in comparison with various models." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-127", "text": "To compare our results from previous approaches, we first use groundtruth transcripts included in the dataset in training textual modality." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-128", "text": "From the previous model, E vec-MCNN-LSTM encodes acoustic signal and textual information using a neural network (RNN and CNN, respectively) and then fuse each result by concatenating and feeding them into following (SVM) to predict emotion labels." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-129", "text": "On the other hand, MDRE model use dual-RNNs to encode both the modalities and merge the results using another fully-connect neural network layer." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-130", "text": "This MDRE approach applies end-to-end learning and outperforms E vec-MCNN-LSTM by 10.6% relative (0.649 to 0.718 absolute) in terms of WA." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-131", "text": "Among our proposed system, the audio-BRE model that uses an acoustic signal with bidirectional-RNN architecture achieves WA 0.646." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-132", "text": "Interestingly, the text-BRE model that use textual information shows higher performance than that of audio-BRE by 8% relative (0.646 to 0.698) in WA." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-133", "text": "The multi-hop attention model, MHA-N, (N = 1, 2, 3) , shows a substantial performance gain." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-134", "text": "In particular, the MHA-2 model (best performing system among MHA-N) outperformed the best baseline model, MDRE, by 6.5% relative (0.718 to 0.765) in WA." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-135", "text": "Although we observe performance degradation in the MHA-3 model, we believe that this could be due to the limited data for training." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-136", "text": "Table 1 ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-137", "text": "Model performance comparisons." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-138", "text": "The top 2 bestperforming models (according to the WA) are marked in bold." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-139", "text": "The \"-ASR\" models use ASR-processed transcripts from the Google Cloud Speech API." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-140", "text": "The \"A\" and \"T\" in modality indicate \"Audio\" and \"Text\", receptively." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-141", "text": "In a practical scenario, we may not access the audio transcripts." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-142", "text": "We describe the effect of using ASR-processed transcripts on the proposed system." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-143", "text": "From table 1, we observe performance degradation in text-BRE-ASR and MHA-2-ASR (our best system), compared to that of text-BRE and MHA-2 by 6.6% (0.698 to 0.652) and 4.6% (0.765 to 0.730) relative in WA, receptively." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-144", "text": "Even with the erroneous transcripts (WER = 5.53%), however, the proposed approach (MHA-2-ASR) outperforms the best baseline system (MDRE) by 1.6% relative (0.718 to 0.730) in terms of WA." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-145", "text": "Figure 2 shows the confusion matrices of the proposed systems." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-146", "text": "In audio-BRE (Fig. 2(a) ), most of the emotion labels are frequently misclassified as neutral class, supporting the claims of [7, 25] ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-147", "text": "The text-BRE shows improvement in classifying most of the labels in Fig. 2(b) ." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-148", "text": "In particular, angry and happy classes are correctly classified by 32% (57.14 to 75.41) and 63% (40.21 to 65.56) relative in accuracy with respect to audio-BRE, receptively." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-149", "text": "However, it incorrectly predicted instances of the happy class as sad class in 10% of the time, even though these emotional states are opposites of one another." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-150", "text": "----------------------------------" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-151", "text": "**ERROR ANALYSIS**" }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-152", "text": "The MHA-2 (our best system, Fig. 2(c) ) compensates for the weaknesses of the single modality models and benefits from their strengths." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-153", "text": "It shows significant performance gain for angry, happy, sad and neutral classes by 6%, 20%, 15% and 13% relative in accuracy with respect to text-BRE." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-154", "text": "It also correctly classify neutral class similar to that of audio-BRE (81.63 and 78.00 for audio-BRE and MHA-2, receptively)." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-155", "text": "Interestingly, although MHA-2 shows superior discriminating ability among emotion classes, it still shows the tendency such that most of the incorrect cases are misclassified into neutral class." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-159", "text": "In this paper, we propose a multi-hop attention model to combine acoustic and textual data for speech emotion recognition task." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-160", "text": "The proposed attention method is designed to select relevant parts of the textual data, which is subsequently applied to attend to the segments of the audio signal for classification." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-161", "text": "Extensive experiments show that the proposed MHA-2 outperforms the best baseline system in classifying the four emotion categories by 6.5% (0.718 to 0.765 absolute) in terms of WA when the model is applied to the IEMOCAP dataset." }, { "sent_id": "2a84615479af66bbf875517a3a753b-C001-162", "text": "We further test our model with ASR-processed transcripts and achieve WA 0.73 that shows the reliability of the proposed system (MHA-2-ASR) in the practical scenario where the ground-truth transcripts are not available." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2a84615479af66bbf875517a3a753b-C001-33" ], [ "2a84615479af66bbf875517a3a753b-C001-54" ], [ "2a84615479af66bbf875517a3a753b-C001-84" ] ], "cite_sentences": [ "2a84615479af66bbf875517a3a753b-C001-33", "2a84615479af66bbf875517a3a753b-C001-54", "2a84615479af66bbf875517a3a753b-C001-84" ] }, "@EXT@": { "gold_contexts": [ [ "2a84615479af66bbf875517a3a753b-C001-33", "2a84615479af66bbf875517a3a753b-C001-34" ], [ "2a84615479af66bbf875517a3a753b-C001-112" ] ], "cite_sentences": [ "2a84615479af66bbf875517a3a753b-C001-33", "2a84615479af66bbf875517a3a753b-C001-112" ] }, "@MOT@": { "gold_contexts": [ [ "2a84615479af66bbf875517a3a753b-C001-54", "2a84615479af66bbf875517a3a753b-C001-55" ], [ "2a84615479af66bbf875517a3a753b-C001-64" ] ], "cite_sentences": [ "2a84615479af66bbf875517a3a753b-C001-54", "2a84615479af66bbf875517a3a753b-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "2a84615479af66bbf875517a3a753b-C001-70" ], [ "2a84615479af66bbf875517a3a753b-C001-106" ], [ "2a84615479af66bbf875517a3a753b-C001-125" ] ], "cite_sentences": [ "2a84615479af66bbf875517a3a753b-C001-70", "2a84615479af66bbf875517a3a753b-C001-106", "2a84615479af66bbf875517a3a753b-C001-125" ] }, "@DIF@": { "gold_contexts": [ [ "2a84615479af66bbf875517a3a753b-C001-84", "2a84615479af66bbf875517a3a753b-C001-85" ] ], "cite_sentences": [ "2a84615479af66bbf875517a3a753b-C001-84" ] }, "@SIM@": { "gold_contexts": [ [ "2a84615479af66bbf875517a3a753b-C001-146" ] ], "cite_sentences": [ "2a84615479af66bbf875517a3a753b-C001-146" ] } } }, "ABC_d2b9c678a3d4920919f59c3b5903d3_17": { "x": [ { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-75", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-76", "text": "**SUPERVISION**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-15", "text": "Formally, a language model (LM) is a probability distribution over strings of a language:" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-16", "text": "where x is a sentence and t indicates a word position." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-72", "text": "3" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-13", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-14", "text": "**LANGUAGE MODELING**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-73", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-2", "text": "We recast syntactic parsing as a language modeling problem and use recent advances in neural network language modeling to achieve a new state of the art for constituency Penn Treebank parsing -93.8 F 1 on section 23, using 2-21 as training, 24 as development, plus tri-training." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-3", "text": "When trees are converted to Stanford dependencies, UAS and LAS are 95.9% and 94.1%." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-4", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-6", "text": "Recent work on deep learning syntactic parsing models has achieved notably good results, e.g., Dyer et al. (2016) with 92.4 F 1 on Penn Treebank constituency parsing and Vinyals et al. (2015) with 92.8 F 1 ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-7", "text": "In this paper we borrow from the approaches of both of these works and present a neural-net parse reranker that achieves very good results, 93.8 F 1 , with a comparatively simple architecture." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-8", "text": "In the remainder of this section we outline the major difference between this and previous workviewing parsing as a language modeling problem." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-9", "text": "Section 2 looks more closely at three of the most relevant previous papers." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-10", "text": "We then describe our exact model (Section 3), followed by the experimental setup and results (Sections 4 and 5)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-11", "text": "There is a one-to-one mapping between a tree and its sequential form." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-12", "text": "(Part-of-speech tags are not used.)" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-17", "text": "The efforts in language modeling go into computing P (x t |x 1 , \u00b7 \u00b7 \u00b7 , x t\u22121 ), which as described next is useful for parsing as well." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-18", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-19", "text": "**PARSING AS LANGUAGE MODELING**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-20", "text": "A generative parsing model parses a sentence (x) into its phrasal structure (y) according to" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-21", "text": "where Y(x) lists all possible structures of x. If we think of a tree (x, y) as a sequence (z) (Vinyals et al., 2015) as illustrated in Figure 1 , we can define a probability distribution over (x, y) as follows:" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-22", "text": "which is equivalent to Equation (1)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-23", "text": "We have reduced parsing to language modeling and can use language modeling techniques of estimating P (z t |z 1 , \u00b7 \u00b7 \u00b7 , z t\u22121 ) for parsing." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-24", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-25", "text": "**PREVIOUS WORK**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-26", "text": "We look here at three neural net (NN) models closest to our research along various dimensions." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-27", "text": "The first (Zaremba et al., 2014) gives the basic language modeling architecture that we have adopted, while the other two (Vinyals et al., 2015; Dyer et al., 2016) are parsing models that have the current best results in NN parsing." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-28", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-29", "text": "**LSTM-LM**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-30", "text": "The LSTM-LM of Zaremba et al. (2014) turns (x 1 , \u00b7 \u00b7 \u00b7 , x t\u22121 ) into h t , a hidden state of an LSTM (Hochreiter and Schmidhuber, 1997; Gers et al., 2003; Graves, 2013) , and uses h t to guess x t :" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-31", "text": "where W is a parameter matrix and [i] indexes ith element of a vector." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-32", "text": "The simplicity of the model makes it easily extendable and scalable, which has inspired a character-based LSTM-LM that works well for many languages (Kim et al., 2016) and an ensemble of large LSTM-LMs for English with astonishing perplexity of 23.7 (Jozefowicz et al., 2016) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-33", "text": "In this paper, we build a parsing model based on the LSTM-LM of Zaremba et al. (2014) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-34", "text": "Vinyals et al. (2015) observe that a phrasal structure (y) can be expressed as a sequence and build a machine translation parser (MTP), a sequence-tosequence model, which translates x into y using a conditional probability:" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-35", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-36", "text": "**MTP**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-37", "text": "where the conditioning event (x, y 1 , \u00b7 \u00b7 \u00b7 , y t\u22121 ) is modeled by an LSTM encoder and an LSTM decoder." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-38", "text": "The encoder maps x into h e , a set of vectors that represents x, and the decoder obtains a summary vector (h t ) which is concatenation of the decoder's hidden state (h d t ) and weighted sum of word representations (" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-39", "text": "with an alignment vector (\u03b1)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-40", "text": "Finally the decoder predicts y t given h t ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-41", "text": "Inspired by MTP, our model processes sequential trees." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-42", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-43", "text": "**RNNG**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-44", "text": "Recurrent Neural Network Grammars (RNNG), a generative parsing model, defines a joint distribution over a tree in terms of actions the model takes to generate the tree (Dyer et al., 2016) :" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-45", "text": "where a is a sequence of actions whose output precisely matches the sequence of symbols in z, which implies Equation (3) is the same as Equation (2)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-46", "text": "RNNG and our model differ in how they compute the conditioning event (z 1 , \u00b7 \u00b7 \u00b7 , z t\u22121 ): RNNG combines hidden states of three LSTMs that keep track of actions the model has taken, an incomplete tree the model has generated and words the model has generated whereas our model uses one LSTM's hidden state as shown in the next section." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-47", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-48", "text": "**MODEL**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-49", "text": "Our model, the model of Zaremba et al. (2014) applied to sequential trees and we call LSTM-LM from now on, is a joint distribution over trees:" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-50", "text": "where h t is a hidden state of an LSTM." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-51", "text": "Due to lack of an algorithm that searches through an exponentially large phrase-structure space, we use an n-best parser to reduce Y(x) to Y (x), whose size is polynomial, and use LSTM-LM to find y that satisfies" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-52", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-53", "text": "**HYPER-PARAMETERS**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-54", "text": "The model has three LSTM layers with 1,500 units and gets trained with truncated backpropagation through time with mini-batch size 20 and step size 50." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-55", "text": "We initialize starting states with previous minibatch's last hidden states (Sutskever, 2013) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-56", "text": "The forget gate bias is initialized to be one (Jozefowicz et al., 2015) and the rest of model parameters are sampled from U(\u22120.05, 0.05)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-57", "text": "Dropout is applied to non-recurrent connections (Pham et al., 2014) and gradients are clipped when their norm is bigger than 20 (Pascanu et al., 2013) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-58", "text": "The learning rate is 0.25 \u00b7 0.85 max ( \u221215, 0) where is an epoch number." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-59", "text": "For simplicity, we use vanilla softmax over an entire vocabulary as opposed to hierarchical softmax (Morin and Bengio, 2005) or noise contrastive estimation (Gutmann and Hyv\u00e4rinen, 2012)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-60", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-62", "text": "We describe datasets we use for evaluation, detail training and development processes." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-63", "text": "1" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-64", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-65", "text": "**DATA**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-66", "text": "We use the Wall Street Journal (WSJ) of the Penn Treebank (Marcus et al., 1993) for training (2-21), development (24) and testing (23) and millions of auto-parsed \"silver\" trees (McClosky et al., 2006; Huang et al., 2010; Vinyals et al., 2015) for tritraining." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-67", "text": "To obtain silver trees, we parse the entire section of the New York Times (NYT) of the fifth Gigaword (Parker et al., 2011 ) with a product of eight Berkeley parsers (Petrov, 2010) 2 and ZPar (Zhu et al., 2013) and select 24 million trees on which both parsers agree (Li et al., 2014) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-68", "text": "We do not resample trees to match the sentence length distribution of the NYT to that of the WSJ (Vinyals et 1 The code and trained models used for experiments are available at github.com/cdg720/emnlp2016." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-69", "text": "2 We use the reimplementation by Huang et al. (2010) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-70", "text": "(Charniak, 2000) performed better when trained on all of 24 million trees than when trained on resampled two million trees." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-71", "text": "Given x, we produce Y (x), 50-best trees, with Charniak parser and find y with LSTM-LM as Dyer et al. (2016) do with their discriminative and generative models." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-74", "text": "**TRAINING AND DEVELOPMENT**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-77", "text": "We unk words that appear fewer than 10 times in the WSJ training (6,922 types) and drop activations with probability 0.7." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-78", "text": "At the beginning of each epoch, we shuffle the order of trees in the training data." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-79", "text": "Both perplexity and F 1 of LSTM-LM (G) improve and then plateau (Figure 2) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-80", "text": "Perplexity, the Base Final Vinyals et al. (2015) 88.3 90.5 Dyer et al. (2016) 89.8 92.4 LSTM-LM (G) 89.7 92.6 We also evaluate our model with varying n-best trees including optimal 51-best trees that contain gold trees (51 o )." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-81", "text": "As shown in Table 1, the LSTM-LM (G) is robust given sufficiently large n, i.e. 50, but does not exhibit its full capacity because of search errors in Charniak parser." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-82", "text": "We address this problem in Section 5.3." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-83", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-84", "text": "**SEMI-SUPERVISION**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-85", "text": "We unk words that appear at most once in the training (21,755 types)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-86", "text": "We drop activations with probability 0.45, smaller than 0.7, thanks to many silver trees, which help regularization." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-87", "text": "We train LSTM-LM (GS) on the WSJ and a different set of 400,000 NYT trees for each epoch except for the last one during which we use the WSJ only." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-88", "text": "Training takes 26 epochs and 68 hours on a Titan X. LSTM-LM (GS) achieves 92.5 F 1 on the development." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-89", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-90", "text": "**RESULTS**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-91", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-92", "text": "**SUPERVISION**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-93", "text": "As shown in Table 2 , with 92.6 F 1 LSTM-LM (G) outperforms an ensemble of five MTPs (Vinyals et al., 2015) and RNNG (Dyer et al., 2016) , both of which are trained on the WSJ only." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-94", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-95", "text": "**SEMI-SUPERVISION**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-96", "text": "We compare LSTM-LM (GS) to two very strong semi-supervised NN parsers: an ensemble of five MTPs trained on 11 million trees of the highconfidence corpus 4 (HC) (Vinyals et al., 2015) ; and an ensemble of six one-to-many sequence models trained on the HC and 4.5 millions of EnglishGerman translation sentence pairs (Luong et al., 2016) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-97", "text": "We also compare LSTM-LM (GS) to best performing non-NN parsers in the literature." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-98", "text": "Parsers' parsing performance along with their training data is reported in Table 3 ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-99", "text": "LSTM-LM (GS) outperforms all the other parsers with 93.1 F 1 ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-100", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-101", "text": "**IMPROVED SEMI-SUPERVISION**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-102", "text": "Due to search errors -good trees are missing in 50-best trees -in Charniak (G), our supervised and semi-supervised models do not exhibit their full potentials when Charniak (G) provides Y (x)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-103", "text": "To mitigate the search problem, we tri-train Charniak (GS) on all of 24 million NYT trees in addition to the WSJ, to yield Y (x)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-104", "text": "As shown in Table 3 , both LSTM-LM (G) and LSTM-LM (GS) are affected by the quality of Y (x)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-105", "text": "A single LSTM-LM (GS) together with Charniak (GS) reaches 93.6 and an ensemble of eight LSTM-LMs (GS) with Charniak (GS) achieves a new state of the art, 93.8 F 1 ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-106", "text": "When trees are converted to Stanford dependencies, 5 UAS and LAS are 95.9% and 94.1%, 6 more than 1% higher than those of the state of the art dependency parser (Andor et al., 2016) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-107", "text": "Why an indirect method (converting trees to dependencies) is more accurate than a direct one (dependency parsing) remains unanswered (Kong and Smith, 2014) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-108", "text": "----------------------------------" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-109", "text": "**CONCLUSION**" }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-110", "text": "The generative parsing model we presented in this paper is very powerful." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-111", "text": "In fact, we see that a generative parsing model, LSTM-LM, is more effective than discriminative parsing models (Dyer et al., 2016) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-112", "text": "We suspect building large models with character embeddings would lead to further improvement as in language modeling (Kim et al., 2016; Jozefowicz et al., 2016) ." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-113", "text": "We also wish to develop a complete parsing model using the LSTM-LM framework." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-114", "text": "Table 3 : Evaluation of models trained on the WSJ and additional resources." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-115", "text": "Note that the numbers of Vinyals et al. (2015) and Luong et al. (2016) are not directly comparable as their models are evaluated on OntoNotesstyle trees instead of PTB-style trees." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-116", "text": "E(LSTM-LMs (GS)) is an ensemble of eight LSTM-LMs (GS)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-117", "text": "X/Y in Silver column indicates the number of silver trees used to train Charniak parser and LSTM-LM." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-118", "text": "For the ensemble model, we report the maximum number of trees used to train one of LSTM-LMs (GS)." }, { "sent_id": "d2b9c678a3d4920919f59c3b5903d3-C001-119", "text": "at Brown University for setting up GPU machines and David McClosky for helping us train Charniak parser on millions trees." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d2b9c678a3d4920919f59c3b5903d3-C001-6" ], [ "d2b9c678a3d4920919f59c3b5903d3-C001-21" ] ], "cite_sentences": [ "d2b9c678a3d4920919f59c3b5903d3-C001-6", "d2b9c678a3d4920919f59c3b5903d3-C001-21" ] }, "@USE@": { "gold_contexts": [ [ "d2b9c678a3d4920919f59c3b5903d3-C001-6", "d2b9c678a3d4920919f59c3b5903d3-C001-7" ], [ "d2b9c678a3d4920919f59c3b5903d3-C001-27" ], [ "d2b9c678a3d4920919f59c3b5903d3-C001-66" ], [ "d2b9c678a3d4920919f59c3b5903d3-C001-96" ] ], "cite_sentences": [ "d2b9c678a3d4920919f59c3b5903d3-C001-6", "d2b9c678a3d4920919f59c3b5903d3-C001-27", "d2b9c678a3d4920919f59c3b5903d3-C001-66", "d2b9c678a3d4920919f59c3b5903d3-C001-96" ] }, "@DIF@": { "gold_contexts": [ [ "d2b9c678a3d4920919f59c3b5903d3-C001-93" ], [ "d2b9c678a3d4920919f59c3b5903d3-C001-115" ] ], "cite_sentences": [ "d2b9c678a3d4920919f59c3b5903d3-C001-93", "d2b9c678a3d4920919f59c3b5903d3-C001-115" ] } } }, "ABC_bd2a718f75d206ef3f2cb5648585d5_17": { "x": [ { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-53", "text": "7-2 B is non-sense / confusing." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-54", "text": "7-3 B does not address the topic." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-55", "text": "7-4 B is generally weak / vague." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-56", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-57", "text": "**POSITIVE PROPERTIES OF ARGUMENT A**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-84", "text": "**MATCHING THEORY AND PRACTICE**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-135", "text": "**AGREEMENT OF THE CROWD WITH EXPERTS**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-161", "text": "**CONCLUSION**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-2", "text": "Argumentation quality is viewed differently in argumentation theory and in practical assessment approaches." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-3", "text": "This paper studies to what extent the views match empirically." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-4", "text": "We find that most observations on quality phrased spontaneously are in fact adequately represented by theory." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-5", "text": "Even more, relative comparisons of arguments in practice correlate with absolute quality ratings based on theory." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-6", "text": "Our results clarify how the two views can learn from each other." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-7", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-9", "text": "The assessment of argumentation quality is critical for any application built upon argument mining, such as debating technologies (Rinott et al., 2015) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-10", "text": "However, research still disagrees on whether quality should be assessed from a theoretical or from a practical viewpoint (Allwood, 2016) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-11", "text": "Theory states, among other things, that a cogent argument has acceptable premises that are relevant to its conclusion and sufficient to draw the conclusion (Johnson and Blair, 2006) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-12", "text": "Practitioners object that such quality dimensions are hard to assess for real-life arguments (Habernal and Gurevych, 2016b) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-13", "text": "Moreover, the normative nature of theory suggests absolute quality ratings, but in practice it seems much easier to state which argument is more convincing-a relative assessment." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-14", "text": "Consider two debate-portal arguments for \"advancing the common good is better than personal pursuit\", taken from the corpora analyzed later in this paper: Argument A \"While striving to make advancements for the common good you can change the world forever." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-15", "text": "Allot of people have succeded in doing so." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-16", "text": "Our founding fathers, Thomas Edison, George Washington, Martin Luther King jr, and many more." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-17", "text": "These people made huge advances for the common good and they are honored for it.\"" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-18", "text": "Argument B \"I think the common good is a better endeavor, because it's better to give then to receive." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-19", "text": "It's better to give other people you're hand out in help then you holding your own hand.\"" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-20", "text": "In the study of Habernal and Gurevych (2016b) , annotators assessed Argument A as more convincing than B. When giving reasons for their assessment, though, they saw A as more credible and well thought through; that does not seem to be too far from the theoretical notion of cogency." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-21", "text": "This paper gives empirical answers to the question of how different the theoretical and practical views of argumentation quality actually are. Section 2 briefly reviews existing theories and practical approaches." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-22", "text": "Section 3 then empirically analyzes correlations in two recent argument corpora, one annotated for 15 well-defined quality dimensions taken from theory (Wachsmuth et al., 2017a) and one with 17 reasons for quality differences phrased spontaneously in practice (Habernal and Gurevych, 2016a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-23", "text": "In a crowdsourcing study, we test whether lay annotators achieve agreement on the theoretical quality dimensions (Section 4)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-24", "text": "We find that assessments of overall argumentation quality largely match in theory and practice." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-25", "text": "Nearly all phrased reasons are adequately represented in theory." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-26", "text": "However, some theoretical quality dimensions seem hard to separate in practice." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-27", "text": "Most importantly, we provide evidence that the observed relative quality differences are reflected in absolute quality ratings." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-28", "text": "Still, our study underpins the fact that the theory-based argumentation quality assessment remains complex." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-29", "text": "Our results do not generally answer the question of what view of argumentation quality is preferable, but they clarify where theory can learn from practice and vice versa." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-30", "text": "In particular, practical approaches indicate what to focus on to simplify theory, whereas theory seems beneficial to guide quality assessment in practice." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-31", "text": "Wachsmuth et al. (2017a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-32", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-33", "text": "**THEORY VERSUS PRACTICE**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-34", "text": "This section outlines major theories and practical approaches to argumentation quality assessment, including those we compare in the present paper." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-35", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-36", "text": "**THEORETICAL VIEWS OF QUALITY ASSESSMENT**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-37", "text": "Argumentation theory discusses logical, rhetorical, and dialectical quality." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-38", "text": "As few real-life arguments are logically sound, requiring true premises that deductively entail a conclusion, cogency (as defined in Section 1) is largely seen as the main logical quality (Johnson and Blair, 2006; Damer, 2009; Govier, 2010) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-39", "text": "Toulmin (1958) models the general structure of logical arguments, and Walton et al. (2008) analyze schemes of fallacies and strong arguments." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-40", "text": "A fallacy is a kind of error that undermines reasoning (Tindale, 2007) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-41", "text": "Strength may mean cogency but also rhetorical effectiveness (Perelman and Olbrechts-Tyteca, 1969) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-42", "text": "Rhetoric has been studied since Aristotle (2007) who developed the notion of the means of persuasion (logos, ethos, pathos) and their linguistic delivery in terms of arrangement and style." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-43", "text": "Dialectical quality dimensions resemble those of cogency, but arguments are judged specifically by their reasonableness for achieving agreement (van Eemeren and Grootendorst, 2004) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-44", "text": "Wachsmuth et al. (2017a) point out that dialectical builds on rhetorical, and rhetorical builds on logical quality." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-45", "text": "They derive a unifying taxonomy from the major theories, decomposing quality hierarchically into cogency, effectiveness, reasonableness, and subdimensions." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-46", "text": "5-1 B is attacking / abusive." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-47", "text": "5-2 B has language/grammar issues, or uses humour or sarcasm." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-48", "text": "5-3 B is unclear / hard to follow." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-49", "text": "6-1 B has no credible evidence / no facts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-50", "text": "6-2 B has less or insufficient reasoning." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-51", "text": "6-3 B uses irrelevant reasons." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-52", "text": "7-1 B is only an opinion / a rant." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-58", "text": "8-1 A has more details/facts/examples, has better reasoning / is deeper." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-59", "text": "8-4 A is objective / discusses other views." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-60", "text": "8-5 A is more credible / confident." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-61", "text": "9-1 A is clear / crisp / well-written." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-62", "text": "9-2 A sticks to the topic." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-63", "text": "9-3 A makes you think. 9-4 A is well thought through / smart." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-64", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-65", "text": "**OVERALL**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-66", "text": "Conv A is more convincing than B. Table 2 : The 17+1 practical reason labels given in the corpus of Habernal and Gurevych (2016a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-67", "text": "covered." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-68", "text": "In Section 3, we use their absolute quality ratings from 1 (low) to 3 (high) annotated by three experts for each dimension of 304 arguments taken from the UKPConvArg1 corpus detailed below." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-69", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-70", "text": "**PRACTICAL VIEWS OF QUALITY ASSESSMENT**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-71", "text": "There is an application area where absolute quality ratings of argumentative text are common practice: essay scoring (Beigman Klebanov et al., 2016) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-72", "text": "Persing and Ng (2015) annotated the argumentative strength of essays composing multiple arguments with notable agreement." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-73", "text": "For single arguments, however, all existing approaches that we are aware of assess quality in relative terms, e.g., Cabrio and Villata (2012) find accepted arguments based on attack relations, Wei et al. (2016) rank arguments by their persuasiveness, and Wachsmuth et al. (2017b) rank them by their relevance." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-74", "text": "Boudry et al. (2015) argue that normative concepts such as fallacies rarely apply to real-life arguments and that they are too sophisticated for operationalization." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-75", "text": "Based on the idea that relative assessment is easier, Habernal and Gurevych (2016b) crowdsourced the UKPConvArg1 corpus." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-76", "text": "Argument pairs (A, B) from a debate portal were classified as to which argument is more convincing." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-77", "text": "Without giving any guidelines, the authors also asked for reasons as to why A is more convincing than B. In a follow-up study (Habernal and Gurevych, 2016a) , these reasons were used to derive a hierarchical annotation scheme." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-78", "text": "9111 argument pairs were then labeled with one or more of the 17 reason labels in Table 2 Negative Properties of Argument B Positive Properties of Argument A Quality Dimension 5-1 5-2 5-3 6-1 6-2 6-3 7-1 7-2 7-3 7-4 8-1 8-4 8-5 9-1 9-2 9-3 9- Wachsmuth et al. (2017a) given for each of the 17+1 reason labels of Habernal and Gurevych (2016a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-79", "text": "Bold/gray: Highest/lowest value in each column." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-80", "text": "Bottom row: The number of labels for each dimension." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-81", "text": "by crowd workers (UKPConvArg2)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-82", "text": "These pairs represent the practical view in our experiments." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-83", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-85", "text": "We now report on experiments that we performed to examine to what extent the theory and practice of argumentation quality assessment match." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-86", "text": "1" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-87", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-88", "text": "**CORPUS-BASED COMPARISON OF THE VIEWS**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-89", "text": "Several dimensions and reasons in Tables 1 and 2 seem to refer to the same or opposite property, e.g., clarity and 5-3 (unclear)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-90", "text": "This raises the question of how absolute ratings of arguments based on theory relate to relative comparisons of argument pairs in practice." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-91", "text": "We informally state three hypotheses:" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-92", "text": "The reasons for quality differences in practice are adequately represented in theory." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-93", "text": "Hypothesis 2 The perception of overall argumentation quality is the same in theory and practice." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-94", "text": "Hypothesis 3 Relative quality differences are reflected by differences in absolute quality ratings." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-95", "text": "As both corpora described in Section 2 are based on the UKPConvArg1 corpus and thus share many arguments, we can test the hypotheses empirically." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-96", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-97", "text": "**CORRELATIONS OF DIMENSIONS AND REASONS**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-98", "text": "For Hypotheses 1 and 2, we consider all 736 pairs of arguments from Habernal and Gurevych (2016a) where both have been annotated by Wachsmuth et al. (2017a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-99", "text": "For each pair (A, B) with A being 1 Source code and annotated data: http://www.arguana.com more convincing than B, we check whether the ratings of A and B for each dimension (averaged over all annotators) show a concordant difference (i.e., a higher rating for A), a disconcordant difference (lower), or a tie." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-100", "text": "This way, we can correlate each dimension with all reason labels in Table 2 including Conv." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-101", "text": "In particular, we compute Kendall's \u03c4 based on all argument pairs given for each label." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-102", "text": "2 Table 3 presents all \u03c4 -values." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-103", "text": "The phrasing of a reason can be assumed to indicate a clear quality difference-this is underlined by the generally high correlations." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-104", "text": "Analyzing the single values, we find much evidence for Hypothesis 1: Most notably, label 5-1 perfectly correlates with global acceptability, fitting the intuition that abuse is not acceptable." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-105", "text": "The high \u03c4 's of 8-5 (more credible) for local acceptability (.73) and of 9-4 (well thought through) for cogency (.75) confirm the match assumed in Section 1. Also, the values of 5-3 (unclear) for clarity (.91) and of 7-2 (non-sense) for reasonableness (.94) as well as the weaker correlation of 8-4 (objective) for emotional appeal (.35) makes sense." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-106", "text": "Only the comparably low \u03c4 of 6-1 (no credible evidence) for local acceptability (.49) and credibility (.52) seem really unexpected." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-107", "text": "Besides, the descriptions of 6-2 and 6-3 sound like local but cor- Table 4 : The mean rating for each quality dimension of those arguments from Wachsmuth et al. (2017a) given for each reason label (Habernal and Gurevych, 2016a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-108", "text": "The bottom rows show that the minimum maximum mean ratings are consistently higher for the positive properties than for the negative properties." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-109", "text": "relate more with global relevance and sufficiency respectively." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-110", "text": "Similarly, 7-3 (off-topic) correlates strongly with local and global relevance (both .95)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-111", "text": "So, these dimensions seem hard to separate." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-112", "text": "In line with Hypothesis 2, the highest correlation of Conv is indeed given for overall quality (.64)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-113", "text": "Thus, argumentation quality assessment seems to match in theory and practice to a broad extent." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-114", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-115", "text": "**ABSOLUTE RATINGS FOR RELATIVE DIFFERENCES**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-116", "text": "The correlations found imply that the relative quality differences captured are reflected in absolute differences." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-117", "text": "For explicitness, we computed the mean rating for each quality dimension of all arguments from Wachsmuth et al. (2017a) with a particular reason label from Habernal and Gurevych (2016a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-118", "text": "As each reason refers to one argument of a pair, this reveals whether the labels, although meant to signal relative differences, indicate absolute ratings." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-119", "text": "Table 4 compares the mean ratings of \"negative labels\" (5-1 to 7-4) and \"positive\" ones (8-1 to 9-4)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-120", "text": "For all dimensions, the maximum and minimum value are higher for the positive than for the negative labels-a clear support of Hypothesis 3." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-121", "text": "3 Also, Table 4 reveals which reasons predict absolute differences most: The mean ratings of 7-3 (off-topic) are very low, indicating a strong negative impact, while 6-3 (irrelevant reasons) still shows rather 3 While the differences seem not very large, this is expected, as in many argument pairs from Habernal and Gurevych (2016a) both arguments are strong or weak respectively." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-122", "text": "high values." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-123", "text": "Vice versa, especially 8-5 (more credible) and 9-4 (well thought through) are reflected in high ratings, whereas 9-2 (sticks to topic) does not have much positive impact." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-124", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-125", "text": "**ANNOTATING THEORY IN PRACTICE**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-126", "text": "The results of Section 3 suggest that theory may guide the assessment of argumentation quality in practice." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-127", "text": "In this section, we evaluate the reliability of a crowd-based annotation process." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-128", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-129", "text": "**ABSOLUTE QUALITY RATINGS BY THE CROWD**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-130", "text": "We emulated the expert annotation process carried out by Wachsmuth et al. (2017a) on CrowdFlower in order to evaluate whether lay annotators suffice for a theory-based quality assessment." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-131", "text": "In particular, we asked the crowd to rate the same 304 arguments as the experts for all 15 given quality dimensions with scores from 1 to 3 (or choose \"cannot judge\")." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-132", "text": "Each argument was rated 10 times at an offered price of $0.10 for each rating (102 annotators in total)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-133", "text": "Given the crowd ratings, we then performed two comparisons as detailed in the following." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-134", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-136", "text": "First, we checked to what extent lay annotators and experts agree in terms of Krippendorff's \u03b1." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-137", "text": "On one hand, we compared the mean of all 10 crowd ratings to the mean of the three ratings of Wachsmuth et al. (2017a) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-138", "text": "On the other hand, we estimated a reliable rating from the crowd ratings using MACE (Hovy et al., 2013) and compared it to the experts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-139", "text": "Table 5 : Mean and MACE Krippendorff's \u03b1 agreement between (a) the crowd and the experts, (b) two independent crowd groups and the experts, (c) group 1 and the experts, and (d) group 2 and the experts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-140", "text": "Table 5 (a) presents the results." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-141", "text": "For the mean ratings, most \u03b1-values are above .40." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-142", "text": "This is similar to the study of Wachsmuth et al. (2017b) , where a range of .27 to .51 is reported, meaning that lay annotators achieve similar agreement to experts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-143", "text": "Considering the minimum of mean and MACE, we observe the highest agreement for overall quality (.43)-analog to Wachsmuth et al. (2017b) ." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-144", "text": "Also, global sufficiency has the lowest agreement in both cases." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-145", "text": "In contrast, the experts hardly said \"cannot judge\" at all, whereas the crowd chose it for about 4% of all ratings (most often for global sufficiency), possibly due to a lack of training." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-146", "text": "Still, we conclude that the crowd generally handles the theory-based quality assessment almost as well as the experts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-147", "text": "However, the complexity of the assessment is underlined by the generally limited agreement, suggesting that either simplification or stricter guidelines are needed." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-148", "text": "Regarding simplification, the most common practical reasons of Habernal and Gurevych (2016a) imply what to focus on." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-149", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-150", "text": "**RELIABILITY OF THE CROWD ANNOTATIONS**" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-151", "text": "In the second comparison, we checked how many crowd annotators are needed to compete with the experts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-152", "text": "For this purpose, we split the crowd ratings into two independent groups of 5 and treated the mean and MACE of each group as a single rating." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-153", "text": "We then computed the agreement of both groups and each group individually against the experts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-154", "text": "The \u03b1-values for both groups are listed in Table 5(b)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-155", "text": "On average, they are a bit lower than those of all 10 crowd annotators in Table 5 (a)." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-156", "text": "Hence, five crowd ratings per argument seem not enough for sufficient reliability." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-157", "text": "Tables 5(c) and 5(d) reveal the reason behind, namely, the results of crowd group 1 and group 2 differ clearly." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-158", "text": "At the same time, the values in Table 5 (c) are close to those in Table 5 (a), so 10 ratings might suffice." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-159", "text": "Moreover, we see that the most stable \u03b1-values in Table 5 are given for overall quality, indicating that the theory indeed helps assessing quality reliably." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-160", "text": "----------------------------------" }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-162", "text": "This paper demonstrates that the theory and practice of assessing argumentation quality can learn from each other." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-163", "text": "Most reasons for quality differences phrased in practice seem well-represented in the normative view of theory and correlate with absolute quality ratings." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-164", "text": "In our study, lay annotators had similar agreement on the ratings as experts." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-165", "text": "Considering that some common reasons are quite vague, the diverse and comprehensive theoretical view of argumentation quality may guide a more insightful assessment." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-166", "text": "On the other hand, some quality dimensions remain hard to assess and/or to separate in practice, resulting in limited agreement." }, { "sent_id": "bd2a718f75d206ef3f2cb5648585d5-C001-167", "text": "Simplifying theory along the most important reasons will thus improve its practical applicability." } ], "y": { "@USE@": { "gold_contexts": [ [ "bd2a718f75d206ef3f2cb5648585d5-C001-22" ], [ "bd2a718f75d206ef3f2cb5648585d5-C001-78", "bd2a718f75d206ef3f2cb5648585d5-C001-82" ], [ "bd2a718f75d206ef3f2cb5648585d5-C001-98" ], [ "bd2a718f75d206ef3f2cb5648585d5-C001-107" ], [ "bd2a718f75d206ef3f2cb5648585d5-C001-117" ], [ "bd2a718f75d206ef3f2cb5648585d5-C001-130" ], [ "bd2a718f75d206ef3f2cb5648585d5-C001-137" ] ], "cite_sentences": [ "bd2a718f75d206ef3f2cb5648585d5-C001-22", "bd2a718f75d206ef3f2cb5648585d5-C001-78", "bd2a718f75d206ef3f2cb5648585d5-C001-98", "bd2a718f75d206ef3f2cb5648585d5-C001-107", "bd2a718f75d206ef3f2cb5648585d5-C001-117", "bd2a718f75d206ef3f2cb5648585d5-C001-130", "bd2a718f75d206ef3f2cb5648585d5-C001-137" ] }, "@BACK@": { "gold_contexts": [ [ "bd2a718f75d206ef3f2cb5648585d5-C001-44" ], [ "bd2a718f75d206ef3f2cb5648585d5-C001-78" ] ], "cite_sentences": [ "bd2a718f75d206ef3f2cb5648585d5-C001-44", "bd2a718f75d206ef3f2cb5648585d5-C001-78" ] } } }, "ABC_489d0077e05269327e7fe4e7f7e4a3_17": { "x": [ { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-20", "text": "To this end, a number of computer vision approaches have been developed, in order to extract nutrient information from meal images by using machine learning." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-2", "text": "Direct computer vision based-nutrient content estimation is a demanding task, due to deformation and occlusions of ingredients, as well as high intra-class and low inter-class variability between meal classes." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-3", "text": "In order to tackle these issues, we propose a system for recipe retrieval from images." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-4", "text": "The recipe information can subsequently be used to estimate the nutrient content of the meal." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-5", "text": "In this study, we utilize the multi-modal Recipe1M dataset, which contains over 1 million recipes accompanied by over 13 million images." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-6", "text": "The proposed model can operate as a first step in an automatic pipeline for the estimation of nutrition content by supporting hints related to ingredient and instruction." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-7", "text": "Through self-attention, our model can directly process raw recipe text, making the upstream instruction sentence embedding process redundant and thus reducing training time, while providing desirable retrieval results." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-8", "text": "Furthermore, we propose the use of an ingredient attention mechanism, in order to gain insight into which instructions, parts of instructions or single instruction words are of importance for processing a single ingredient within a certain recipe." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-9", "text": "Attention-based recipe text encoding contributes to solving the issue of high intra-class/low inter-class variability by focusing on preparation steps specific to the meal." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-10", "text": "The experimental results demonstrate the potential of such a system for recipe retrieval from images." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-11", "text": "A comparison with respect to two baseline methods is also presented." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-12", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-14", "text": "Social media and designated online cooking platforms have made it possible for large populations to share food culture (diet, recipes) by providing a vast amount of food-related data." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-15", "text": "Despite the interest in food culture, global eating behavior still contributes heavily to dietrelated diseases and deaths, according to the Lancet [8] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-16", "text": "Nutrition assessment is a demanding, time-consuming and expensive task." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-17", "text": "Moreover, the conventional approaches for nutrition assessment are cumbersome and prone to errors." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-18", "text": "A tool that enables users to easily and accurately estimate the nutrition content of a meal, while at the same time minimize the need for tedious work is of great importance for a number of different population groups." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-19", "text": "Such a tool can be utilized for promoting a healthy lifestyle, as well as to support patients suffering food-related diseases such as diabetes." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-96", "text": "**LOSS FUNCTION**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-21", "text": "Typically, such systems detect the different food items in a picture [1] , [27] , [21] , estimate their volumes [16] , [6] , [10] and calculate the nutrient content using a food composition database [20] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-22", "text": "In some cases however, inferring the nutrient content of a meal from an image can be really challenging -due to unseen ingredients (e.g. sugar, oil) or the structure of the meal (mixed food, soups, etc.)." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-23", "text": "Humans often use information from diverse sensory modalities (visual, auditory, haptic) to infer logical conclusions." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-24", "text": "This kind of multi-sensory integration helps us process complex tasks [12] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-25", "text": "In this study, we investigate the use of recipe information, in order to better estimate nutrient content of complex meal compositions." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-26", "text": "With the aim to develop a pipeline for holistic dietary assessment, we present and evaluate a method based on machine learning to retrieve recipe information from images, as a first step towards more accurate nutrient estimation." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-27", "text": "Such recipe information can then be utilized together with the volume of the food item to enhance an automatic system to estimate the nutrient content of complex meals, such as lasagna, crock pot or stew." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-28", "text": "The performance of approaches based on machine learning relies heavily on the quantity and quality of the available data." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-29", "text": "To this end, a number of efforts have been made to compile informative datasets to be used for machine learning approaches." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-30", "text": "Most of the early released food databases were assembled only by image data for a special kind of meal." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-31", "text": "In particular, the first publicly available database was the Pittsburgh Fast-Food Image Dataset (PFID) [5] , which contains only fast food images taken under laboratory conditions." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-32", "text": "After the recent breakthrough in deep learning models, a number of larger databases were introduced." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-33", "text": "Bossard et al. [2] introduced the Food-101 dataset, which is composed of 101 food categories represented by 101'000 food images." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-34", "text": "This was followed by several image-based databases, such as the UEC-100 [18] and its augmented version, the UEC-256 [13] dataset, with 9060 food images referring to 100 Japanese food types and 31651 food images referring to 256 Japanese food types, respectively." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-35", "text": "Xu et al. [26] developed a specialized dataset by including geolocation and external information about restaurants to simplify the food recognition task." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-36", "text": "Wang et al. [25] introduced the UPMC Food-101 multi-modal dataset, that shares the same 101 food categories with the popular Food-101 dataset, but contains textual information in addition." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-37", "text": "A number of studies have been carried out utilizing the aforementioned databases, mainly for the task of food recognition." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-38", "text": "Salvador et al. [22] published Recipe1M, the largest publicly available multimodal dataset, that consists of 1 million recipes together with the accompanying images." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-39", "text": "The emergence of multi-modal databases has led to novel approaches for meal image analysis." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-40", "text": "The fusion of visual features learned from images by deep Convolution Neural Networks (CNN) and textual features lead to outstanding results in food recognition applications." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-41", "text": "An early approach for recipe retrieval was based on jointly learning to predict food category and its ingredients using deep CNN [4] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-42", "text": "In a following step, the predicted ingredients are matched against a large corpus of recipes." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-43", "text": "More recent approach is proposed by [22] and is based on jointly learning recipe-text and image representations in a shared latent space." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-44", "text": "Recurrent Neural Networks (RNN) and CNN are mainly used to map text and image into the shared space." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-45", "text": "To align the text and image embedding vectors between matching recipe-image pairs, cosine similarity loss with margin was applied." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-46", "text": "Carvalho et al. [3] proposed a similar multi-modal embedding method for aligning text and image representations in a shared latent space." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-47", "text": "In contrast to Salvador et al. [22] , they formulated a joint objective function which incorporates the loss for the cross-modal retrieval task and a classification loss, instead of using the latent space for a multitask learning setup." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-48", "text": "To address the challenge of encoding long sequences (like recipe instructions), [22] chose to represent single instructions as sentence embedding using the skip-thought technique [15] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-49", "text": "These encoded instruction sentences are referred to as skip-instructions and their embedding is not fine tuned when learning the image-text joint embedding." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-50", "text": "In this study, we present a method for the joint learning of meal image and recipe embedding, using a multi-path structure that incorporates natural language processing paths, as well as image analysis paths." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-51", "text": "The main contribution of the proposed method is threefold: i) the direct encoding of the instructions, ingredients and images during training, making the need of skip instruction embedding redundant; ii) the utilization of multiple attention mechanisms (i.e. self-attention and ingredient-attention), and iii) a lightweight architecture." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-52", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-53", "text": "**MATERIALS AND METHODS 2.1 DATABASE**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-54", "text": "The proposed method is trained and evaluated on Recipe1M [22] , the largest publicly available multi-modal food database." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-55", "text": "Recipe1M Figure 2 : Text-image embedding model with optional semantic classifier for semantic regularization according to [17] and with Ingredient Attention based instruction encoding provides over 1 million recipes (ingredients and instructions), accompanied by one or more images per recipe, leading to 13 million images." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-56", "text": "The large corpus is supplemented with semantic information (1048 meal classes) for injecting an additional source of information in potential models." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-57", "text": "In the table in Figure 1 , the structure of recipes belonging to different semantic classes is displayed." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-58", "text": "Using a slightly adjusted pre-processing than that in [22] (elimination of noisy instruction sentences), the training set, validation set and test set contain 254,238 and 54,565 and 54,885 matching pairs, respectively." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-59", "text": "In [22] , the authors chose the overall amount of instructions per recipe as one criterion for a valid matching pair. But we simply removed instruction sentences that contain only punctuation and gained some extra data for training and validation." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-60", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-61", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-62", "text": "The proposed model architecture is based on a multi-path approach for each of the involved input data types namely, instructions, ingredients and images, similarly to [17] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-63", "text": "In Figure 2 , the overall structure is presented." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-64", "text": "For the instruction encoder, we utilized a self-attention mechanism [24] , which learns which words of the instructions are relevant with a certain ingredient." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-65", "text": "In order to encode the ingredients, a bidirectional RNN is used, since ingredients are an unordered list of words." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-66", "text": "All RNNs in the ingredients path were implemented with Long Short-Term Memory (LSTM) cells [11] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-67", "text": "We fixed the ingredient representation to have a length of 600, independent of the amount of ingredients." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-68", "text": "Lastly, the outputs of the self-attention-instruction encoder with ingredient attention and the output of the bidirectional LSTM ingredient-encoder are concatenated and mapped to the joint embedding space." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-69", "text": "The image analysis path is composed of a ResNet-50 model [9] , pretrained on the ImageNet Dataset [7] , with a custom top layer for mapping the image features to the joint embedding space." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-70", "text": "All word embeddings are pretrained with the word2vec algorithm [19] and fine tuned during the joint embedding learning phase." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-71", "text": "We chose 512-dimensional word embedding for our model with self-attention, whereas [17] and [3] chose a vector length of 300." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-72", "text": "In the following sections, more details about the aforementioned paths are presented." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-73", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-74", "text": "**ATTENTION MECHANISMS**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-75", "text": "The instruction encoder follows a transformer based encoder, as suggested by [24] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-76", "text": "Since we do not focus on syntactic rules, but mostly on weak sentence semantics or single words, we built a more shallow encoder containing only 2 stacked layers, where each of this layers contains two sub-layers." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-77", "text": "The first is the multi-head attention layer, and the second is a position-wise densely connected feed-forward network (FFN)." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-78", "text": "Due to recipes composed of over 600 words as instructions, we decided to trim words per instruction sentence to restrict the overall words per recipe to 300." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-79", "text": "In order to avoid removing complete instructions at the end of the instruction table, we removed a fraction of words from each instruction, based on this instruction's length and the overall recipe-instruction length." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-80", "text": "This strategy reinforces the neglect of syntactic structures in the instruction encoding process." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-81", "text": "With such a model, we can directly perform the instruction encoding during the learning process for the joint embedding, thus saving training time and reducing disk space consumption." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-82", "text": "The transformer-like encoder does not make use of any recurrent units, thus providing the opportunity for a more lightweight architecture." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-83", "text": "By using self-attention [24] , the model learns to focus on instructions relevant to recipe-retrieval-relevant, parts of instructions or single instruction-words." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-84", "text": "Furthermore we gain insight into which instructions are important to distinguish recipes with similar ingredients but different preparation styles." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-85", "text": "The instruction encoder transforms the sequence of plain word representations with added positional information to a sequence of similarity-based weighted sum of all word representations." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-86", "text": "The outputted sequence of the encoder exhibits the same amount of positions as the input to the instruction encoder (in our experiments 300)." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-87", "text": "Each of this positions is represented by a 512-dimensional vector." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-88", "text": "To obtain a meaningful representation without a vast number of parameters, we reduced the number of word representations before the concatenation with the ingredient representation." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-89", "text": "For this reduction step, we implemented a recipe-embedding specific attention layer where the ingredient representation is used to construct n queries, where n is the amount of new instruction representation vectors." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-90", "text": "Each of these new representations is a composition of all previous word representations weighted by the ingredient attention score." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-91", "text": "Following, the ingredient attention process is formulated mathematically and is visually portrayed in Figure 2 ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-92", "text": "where K (inst ) and V (inst ) are linear mappings of the encoded instruction words, and Q (in\u0434) is a linear mapping of the ingredient representation and d k is the dimensionality of linearly projected position vectors." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-93", "text": "where b is the batch-size, p is the amount of word embeddings, w is the dimensionality of the wort embedding, h is the dimensionality of the space to where we project the word embeddings and queries, q is the dimensionality of the ingredient representation and n is the amount of Ingredient Attention-based instruction representations." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-94", "text": "Ingredient Attention can be performed step-wise, similarly" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-95", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-97", "text": "To align text and image embeddings of matching recipe-image pairs alongside each other, we maximize the cosine distance between positive pairs and minimize it between negative pairs." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-98", "text": "We have trained our model using cosine similarity loss with margin as in [17] and with the triplet loss proposed by [3] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-99", "text": "Both objective functions and the semantic regularization by [17] aim at maximizing intra-class correlation and minimizing inter-class correlation." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-100", "text": "Let us define the text query embedding as \u03d5 q and the embedding of the image query as \u03d5 d , then the cosine embedding loss can be defined as follows:" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-101", "text": "where cos (x,y) is the normalized cosine similarity and \u03b1 is a margin (\u22121 \u2a7d \u03b1 \u2a7d 1), that determines how similar negative pairs are allowed to be." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-102", "text": "Positive margins allow negative pairs to share at maximum \u03b1 similarity, where a maximum margin of zero or negative margins allow no correlation between non matching embedding vectors or force the model to learn antiparallel representations, respectively." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-103", "text": "\u03d5 d is the corresponding image counterpart to \u03d5 q if y = 1 or a randomly chosen sample \u03d5 d \u2208 S \u2227 \u03d5 d \u03d5 d (q) if y = \u22121, where \u03d5 d (q) is the true match for \u03d5 q and S is the dataset we sample from it." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-104", "text": "Furthermore, we complement the cosine similarity with crossentropy classification loss (L r e\u0434 ), leading to the applied objective function." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-105", "text": "with c r and c v as semantic recipe-class and semantic image-class, respectively, while c r = c v if the food image and recipe text are a positive pair." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-106", "text": "For the triplet loss, we define \u03d5 q as query embedding, \u03d5 d+ as matching image counterpart and \u03d5 d\u2212 as another random sample taken from S. Further \u03d5 d s em + \u2208 S \u2227 \u03d5 d s em + \u03d5 d (q) is a sample from S sharing the same semantic class as \u03d5 q and \u03d5 d s em \u2212 is a sample from any other class." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-107", "text": "The triplet loss is formulated as follows:" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-108", "text": "where \u03b2 \u2208 [0, 1] weights between quadratic and linear loss, \u03b1 \u2208 [0, 2] is the margin and \u03b3 \u2208 [0, 1] weights between semanticand sample-loss." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-109", "text": "The triplet loss encourages the embedding vectors of a matching pair to be larger by a margin above its non-matching counterpart." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-110", "text": "Further, the semantic loss encourages the model to form clusters of dishes, sharing the same class." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-111", "text": "We chose \u03b2 to be 0.1, \u03b1 to be 0.3 and \u03b3 to be 0.3." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-112", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-113", "text": "**TRAINING CONFIGURATION**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-114", "text": "We used Adam [14] optimizer with an initial learning rate of 10 \u22124 ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-115", "text": "At the beginning of the training session, we freeze the pretrained ResNet-50 weights and optimize only the text-processing branch until we do no longer make progress." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-116", "text": "Then, we alternate train image and text branch until we switched modality for 10 times." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-117", "text": "Lastly, we fine-tune the overall model by releasing all trainable parameters in the model." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-118", "text": "Our optimization strategy differs from [17] in that we use an aggressive learning rate decay, namely exponential decay, so that the learning rate is halved all 20 epochs." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-119", "text": "Since the timing of freezing layers proved not to be of importance unless the recipe path is trained first, we used the same strategy under the cosine distance objective [17] and for the triplet loss [3] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-120", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-121", "text": "**EXPERIMENTAL SETUP AND RESULTS**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-122", "text": "Recipe1M is already distributed in three parts, the training, validation and testing sets." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-123", "text": "We did not make any changes to these partitions." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-124", "text": "Except with our more sensitive preprocessing algorithm, we accept more recipes from the raw corpus." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-125", "text": "[17] used 238,399 samples for their effective training set and for the validation and testing set 51,119 and 51,303 samples, respectively." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-126", "text": "By filtering out noisy instructions sentences (e.g. instructions containing only punctuation) we increased the effective dataset size to 254,238 samples for the training set and 54,565 and 54,885 for the validation and testing sets, respectively." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-127", "text": "Similarly to [17] and [3] , we evaluated our model on 10 subsets of 1000 samples each." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-128", "text": "One sample of these subsets is composed of text embedding and image embedding in the shared latent space." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-129", "text": "Since our interest lies in the recipe retrieval task, we optimized and evaluated our model by using each image embedding in the subsets as query against all text embeddings." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-130", "text": "By ranking the query and the candidate embeddings according to their cosine distance, we estimate the median rank." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-131", "text": "The model's performance is best, if the matching text embedding is found at the first rank." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-132", "text": "Further, we estimate the recall percentage at the top K percent over all queries." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-133", "text": "The recall percentage describes the quantity of queries ranked amid the top K closest results." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-134", "text": "In Table 1 the results are presented, in comparison to baseline methods." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-135", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-136", "text": "**ORAL PAPER SESSION 2**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-137", "text": "MADiMa '19, October 21, 2019, Nice, France [17] and AdaMine [3] re-implementation." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-138", "text": "For all models we were using selected matching pairs generated by reducing noisy instruction sentences as described above." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-139", "text": "Recall rates are averaged over the evaluation batches." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-140", "text": "Image to Recipe MedR R@1 R@5 R@10 1k samples" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-141", "text": "Random [17] 500.0 0.001 0.005 0.01 JNE [17] 5.0 \u00b1 0." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-142", "text": "Both [17] and [3] use time-consuming instruction text preprocessing over the skip-thought technique [15] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-143", "text": "This process doubles the overall training time from three days to six days using two Nvidia Titan X GPU's." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-144", "text": "By using online-instruction encoding with the self-attention encoder, we were able train the model for its main task in under 30 hours." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-145", "text": "Furthermore, the proposed approach offers more flexibility for dataset alterations." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-146", "text": "Figure 4 : Ingredient-Attention based focus on instruction sentences." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-147", "text": "We use two different mapping matrices for the two ingredient based queries." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-148", "text": "Qualitative results such as recipe retrieval, quality of the cluster formation in the joint embedding space and heat maps of instruction words are more important than the previously mentioned benchmarking scores." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-149", "text": "Depending on meal type, all baseline implementations as well as our Ingredient Attention based model exhibit a broad range of retrieval accuracy." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-150", "text": "In Figure 5 we present a few typical results on the intended recipe retrieval task." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-151", "text": "AdaMine [3] creates more distinct class clusters than in [17] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-152", "text": "In Figure 3 , we demonstrate the difference in cluster formation using the aforementioned Methods for our Ingredient Attention." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-153", "text": "We visualize the top ten most common recipe classes in Recipe1M using t-SNE [23] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-154", "text": "Since chocolate chip, peanut butter, cream cheese and/or ice cream are used as ingredients in desserts, due to semantic regularization inside the triplet loss, clusters of sweet meals are close together (Figure 3b top right corner)." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-155", "text": "MADiMa '19, October 21, 2019, Nice, France Sample 1 Figure 5 : The retrieval performance of our model depends heavily on the meal type." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-156", "text": "We marked matching retrieved ingredients or those of the same family in green." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-157", "text": "The Ingredient Attention model performed well on Sample 1, and acceptably on Sample 2." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-158", "text": "On Sample 3, the model missed the main ingredient in all top three retrievals." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-159", "text": "We use heat maps on instruction words as tool to visualize words relevant to ingredient-lists in plain instruction text." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-160", "text": "In Figure 4 , we demonstrate how easily we can achieve insight into the models decision making." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-161", "text": "----------------------------------" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-162", "text": "**CONCLUSIONS**" }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-163", "text": "In this paper, we have introduced self-attention for instruction encoding in the context of the recipe retrieval task and ingredient attention for disclosing ingredient dependent meal preparation steps." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-164", "text": "Our main contribution is the aforementioned ingredient attention, empowering our model to solve the recipe retrieval without any upstream skip instruction embedding, as well as the light-weight architecture provided by the transformer-like instruction encoder." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-165", "text": "On the recipe retrieval task, our method performs similarly to our baseline implementation of [3] ." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-166", "text": "Regarding training time on the other hand, we increased the efficiency significantly for crossmodal based retrieval methods." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-167", "text": "There is no need for a maximum number of instructions for a recipe to be considered as valid for training or testing; only for total words, making more samples of the large Recipe1M corpus usable for training." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-168", "text": "Through ingredient attention, we are able to unveil internal focus in the text processing path by observing attention weights." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-169", "text": "Incorporation of new samples in the train set can be done by retraining just one model." }, { "sent_id": "489d0077e05269327e7fe4e7f7e4a3-C001-170", "text": "Overall, an accurate and flexible method for recipe retrieval from meal images could provide downstream models (e.g. automatic nutrient content estimation) with decisive information and significantly improve their results." } ], "y": { "@BACK@": { "gold_contexts": [ [ "489d0077e05269327e7fe4e7f7e4a3-C001-46" ], [ "489d0077e05269327e7fe4e7f7e4a3-C001-142" ], [ "489d0077e05269327e7fe4e7f7e4a3-C001-151" ] ], "cite_sentences": [ "489d0077e05269327e7fe4e7f7e4a3-C001-46", "489d0077e05269327e7fe4e7f7e4a3-C001-142", "489d0077e05269327e7fe4e7f7e4a3-C001-151" ] }, "@DIF@": { "gold_contexts": [ [ "489d0077e05269327e7fe4e7f7e4a3-C001-71" ], [ "489d0077e05269327e7fe4e7f7e4a3-C001-142", "489d0077e05269327e7fe4e7f7e4a3-C001-143", "489d0077e05269327e7fe4e7f7e4a3-C001-144" ] ], "cite_sentences": [ "489d0077e05269327e7fe4e7f7e4a3-C001-71", "489d0077e05269327e7fe4e7f7e4a3-C001-142" ] }, "@USE@": { "gold_contexts": [ [ "489d0077e05269327e7fe4e7f7e4a3-C001-98" ], [ "489d0077e05269327e7fe4e7f7e4a3-C001-119" ] ], "cite_sentences": [ "489d0077e05269327e7fe4e7f7e4a3-C001-98", "489d0077e05269327e7fe4e7f7e4a3-C001-119" ] }, "@SIM@": { "gold_contexts": [ [ "489d0077e05269327e7fe4e7f7e4a3-C001-127" ], [ "489d0077e05269327e7fe4e7f7e4a3-C001-165" ] ], "cite_sentences": [ "489d0077e05269327e7fe4e7f7e4a3-C001-127", "489d0077e05269327e7fe4e7f7e4a3-C001-165" ] }, "@UNSURE@": { "gold_contexts": [ [ "489d0077e05269327e7fe4e7f7e4a3-C001-137" ] ], "cite_sentences": [ "489d0077e05269327e7fe4e7f7e4a3-C001-137" ] } } }, "ABC_4e1b01c1faebc447891bc0b847316d_17": { "x": [ { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-96", "text": "----------------------------------" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-97", "text": "**CONCLUSION**" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-2", "text": "When evaluating the quality of topics generated by a topic model, the convention is to score topic coherence -either manually or automatically -using the top-N topic words." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-3", "text": "This hyper-parameter N , or the cardinality of the topic, is often overlooked and selected arbitrarily." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-4", "text": "In this paper, we investigate the impact of this cardinality hyper-parameter on topic coherence evaluation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-5", "text": "For two automatic topic coherence methodologies, we observe that the correlation with human ratings decreases systematically as the cardinality increases." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-6", "text": "More interestingly, we find that performance can be improved if the system scores and human ratings are aggregated over several topic cardinalities before computing the correlation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-7", "text": "In contrast to the standard practice of using a fixed value of N (e.g. N = 5 or N = 10), our results suggest that calculating topic coherence over several different cardinalities and averaging results in a substantially more stable and robust evaluation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-8", "text": "We release the code and the datasets used in this research, for reproducibility." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-9", "text": "----------------------------------" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-10", "text": "**1**" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-11", "text": "----------------------------------" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-13", "text": "Latent Dirichlet Allocation (\"LDA\": Blei et al. (2003) ) is an approach to document clustering, in which \"topics\" (multinomial distributions over terms) and topic allocations (multinomial distributions over topics per document) are jointly learned." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-14", "text": "When the topic model output is to be presented to humans, optimisation of the number of topics is a non-trivial problem." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-15", "text": "In the seminal paper of Chang et al. (2009) , e.g., the authors showed thatcontrary to expectations -extrinsically measured topic coherence correlates negatively with model perplexity." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-16", "text": "They introduced the word intrusion task, whereby a randomly selected \"intruder\" word is injected into the top-N words of a given topic and users are asked to identify the intruder word." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-17", "text": "Low reliability in identifying the intruder word indicates low coherence (and vice versa), based on the intuition that the more coherent the topic, the more clearly the intruder word should be an outlier." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-18", "text": "Since then, several methodologies have been introduced to automate the evaluation of topic coherence." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-19", "text": "Newman et al. (2010) found that aggregate pairwise PMI scores over the top-N topic words correlated well with human ratings." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-20", "text": "Mimno et al. (2011) proposed replacing PMI with conditional probability based on co-document frequency." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-21", "text": "Aletras and Stevenson (2013) showed that coherence can be measured by a classical distributional similarity approach." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-22", "text": "More recently, Lau et al. (2014) proposed a methodology to automate the word intrusion task directly." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-23", "text": "Their results also reveal the differences between these methodologies in their assessment of topic coherence." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-24", "text": "A hyper-parameter in all these methodologies is the number of topic words, or its cardinality." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-25", "text": "These methodologies evaluate coherence over the top-N topic words, where N is selected arbitrarily: for Chang et al. (2009) , N = 5, whereas for Newman et al. (2010) , Aletras and Stevenson (2013) and Lau et al. (2014) , N = 10." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-26", "text": "The germ of this paper came when using the automatic word intrusion methodology (Lau et al., 2014) , and noticing that introducing one extra word to a given topic can dramatically change the accuracy of intruder word prediction." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-27", "text": "This forms the kernel of this paper: to better understand the impact of the topic cardinality hyper-parameter on the evaluation of topic coherence." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-28", "text": "To investigate this, we develop a new dataset with human-annotated coherence judgements for a range of cardinality settings (N = {5, 10, 15, 20})." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-29", "text": "We experiment with the automatic word intrusion (Lau et al., 2014) and discover that correlation with human ratings decreases systematically as cardinality increases." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-30", "text": "We also test the PMI methodology (Newman et al., 2010) and make the same observation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-31", "text": "To remedy this, we show that performance can be substantially improved if system scores and human ratings are aggregated over different cardinality settings before computing the correlation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-32", "text": "This has broad implications for topic model evaluation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-33", "text": "----------------------------------" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-34", "text": "**DATASET AND GOLD STANDARD**" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-35", "text": "To examine the relationship between topic cardinality and topic coherence, we require a dataset that has topics for a range of cardinality settings." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-36", "text": "Although there are existing datasets with human-annotated coherence scores (Newman et al., 2010; Aletras and Stevenson, 2013; Lau et al., 2014; Chang et al., 2009) , these topics were annotated using a fixed cardinality setting (e.g. 5 or 10)." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-37", "text": "We thus develop a new dataset for this experiment." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-38", "text": "Following Lau et al. (2014) , we use two domains: (1) WIKI, a collection of 3.3 million English Wikipedia articles (retrieved November 28th 2009); and (2) NEWS, a collection of 1.2 million New York Times articles from 1994 to 2004 (English Gigaword)." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-39", "text": "We sub-sample approximately 50M tokens (100K and 50K articles for WIKI and NEWS respectively) from both domains to create two smaller document collections." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-40", "text": "We then generate 300 LDA topics for each of the sub-sampled collection." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-41", "text": "2 There are two primary approaches to assessing topic coherence: (1) via word intrusion (Chang et (2) by directly measuring observed coherence (Newman et al., 2010; Lau et al., 2014) ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-42", "text": "With the first method, Chang et al. (2009) injects an intruder word into the top-5 topic words, shuffles the topic words, and sets the task of selecting the single intruder word out of the 6 words." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-43", "text": "In preliminary experiments, we found that the word intrusion task becomes unreasonably difficult for human annotators when the topic cardinality is high, e.g. when N = 20." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-44", "text": "As such, we use the second approach as the means for generating our gold standard, asking users to judge topic coherence directly over different topic cardinalities." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-45", "text": "3 To collect the coherence judgements, we used Amazon Mechanical Turk and asked Turkers to rate topics in terms of coherence using a 3-point ordinal scale, where 1 indicates incoherent and 3 very coherent (Newman et al., 2010) ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-46", "text": "For each topic (600 topics in total) we experiment with 4 cardinality settings: N = {5, 10, 15, 20}. For example, for N = 5, we display the top-5 topic words for coherence judgement." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-47", "text": "For annotation quality control, we embed a bad topic generated using random words into each HIT." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-48", "text": "Workers who fail to consistently rate these bad topics low are filtered out." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-49", "text": "4 On average, we collected approximately 9 ratings per topic in each cardinality setting (post-filtered), from which we generate the gold standard via the arithmetic mean." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-50", "text": "To understand the impact of cardinality (N ) on topic coherence, we analyse: (a) the mean topic rating for each N (Table 1) , and (b) the pairwise Pearson correlation coefficient between the same topics for different values of N (Table 2) ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-51", "text": "Coherence decreases slightly but systematically as N increases, suggesting that users find topics less coherent (but marginally more consistently interpretable, as indicated by the slight drop in standard deviation) when more words are presented in a topic." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-52", "text": "The strong pairwise correlations, however, indicate that the ratings are relatively stable across different cardinality settings." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-53", "text": "To better understand the data, in Figure 1 we present scatter plots of the ratings for all pairwise cardinality settings (where a point represents a topic)." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-54", "text": "Note the vertical lines for x = 3.0 (cf." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-55", "text": "the weaker effect of horizontal lines for y = 3.0), in particular for the top 3 plots where we are comparing N = 5 against higher cardinality settings." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-56", "text": "This implies that topics that are rated as perfectly coherent (3.0) for N = 5 exhibit some variance in coherence ratings when N increases." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-57", "text": "Intuitively, it means that a number of perfectly coherent 5-word topics become less coherent as more words are presented." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-58", "text": "----------------------------------" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-59", "text": "**AUTOMATED METHOD -WORD INTRUSION**" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-60", "text": "Lau et al. (2014) proposed an automated approach to the word intrusion task." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-61", "text": "The methodology computes pairwise word association features for the top-N words, and trains a support vector regression model to rank the words." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-62", "text": "The top-ranked word is then selected as the predicted intruder word." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-63", "text": "Note that even though it is supervised, no manual annotation is required as the identity of the true intruder word is known." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-64", "text": "Following the original paper, we use as features normalised PMI (NPMI) and two conditional probabilities (CP1 and CP2), computed over the full collection of WIKI (3.3 million articles) and NEWS (1.2 million articles), respectively." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-65", "text": "We use 10-fold cross validation to predict the intruder words for all topics." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-66", "text": "To generate an intruder for a topic, we select a random word that has a low probability (P < 0.0005) in the topic but high probability (P > 0.01) in another topic." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-67", "text": "We repeat this ten times to generate 10 different intruder words for a topic." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-68", "text": "The 4 cardinalities of a given topic share the same set of intruder words." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-69", "text": "To measure the coherence of a topic, we compute model precision, or the accuracy of intruder word prediction." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-70", "text": "For evaluation we compute the Pearson correlation coefficient r of model precisions and human ratings for each cardinality setting." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-71", "text": "Results are summarised in Table 3 Each domain has 2 sets of correlation figures, based on in-domain and out-of-domain features." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-72", "text": "Indomain (out-of-domain) features are word association features computed using the same (different) domain as the topics, e.g. when we compute coherence of WIKI topics using word association features derived from WIKI (NEWS)." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-73", "text": "The correlations using in-domain features are in general lower than for out-of-domain features." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-74", "text": "This is due to idiosyncratic words that are closely related in the collection, e.g. remnant Wikipedia markup tags." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-75", "text": "The topic model discovers them as topics and the word statistics derived from the same collection supports the association, but these topics are generally not coherent, as revealed by out-of-domain statistics." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-76", "text": "This result is consistent with previous studies (Lau et al., 2014) ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-77", "text": "We see that correlation decreases systematically as N increases, implying that N has high impact on topic coherence evaluation and that if a single value of N is to be used, a lower value is preferable." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-78", "text": "To test whether we can leverage the additional information from the different values of N , we aggregate the model precision values and human ratings per-topic before computing the correlation (Table 3 : Cardinality = \"Avg\")." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-79", "text": "We also test the significance of difference for each N with the aggregate correlation using the Steiger Test (Steiger, 1980) The correlation improves substantially." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-80", "text": "In fact, for NEWS using in-domain features, the correlation is higher than that of any individual cardinality setting." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-81", "text": "This observation suggests that a better approach to automatically computing topic coherence is to aggregate coherence scores over different cardinality settings, and that it is sub-optimal to evaluate a topic by only assessing a single setting of N ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-82", "text": "Instead, we should repeat it several times, varying N ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-83", "text": "----------------------------------" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-84", "text": "**AUTOMATED METHOD -NPMI**" }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-85", "text": "The other mainstream approach to evaluating topic coherence is to directly measure the average pairwise association between the top-N words." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-86", "text": "Newman et al. (2010) found PMI to be the best association measure, and later studies (Aletras and Stevenson, 2013; Lau et al., 2014) found that normalised PMI (NPMI: Bouma (2009)) improves PMI further." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-87", "text": "To see if the benefit of aggregating coherence measures over several cardinalities transfers across to other methodologies, we test the NPMI methodology." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-88", "text": "We compute the topic coherence using the full collection of WIKI and NEWS, respectively, for varying N ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-89", "text": "Results are presented in Table 4 ." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-90", "text": "The in-domain features perform much worse, especially for the WIKI topics." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-91", "text": "NPMI assigns very high scores to several incoherent topics, thereby reducing the correlation to almost zero." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-92", "text": "These topics consist predominantly of Wikipedia markup tags, and the high association is due to word statistics idiosyncratic to the collection." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-93", "text": "Once again, aggregating the topic coherence over multiple N values boosts results further." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-94", "text": "The correlations using aggregation and out-of-domain features again produce the best results for both WIKI and NEWS." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-95", "text": "It is important to note that, while these findings were established based on manual annotation of topic coherence, for practical applications, topic coherence would be calculated in a fully-unsupervised manner (averaged over different topic cardinalities), without the use of manual annotations." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-98", "text": "We investigate the impact of the cardinality of topic words on topic coherence evaluation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-99", "text": "We found that human ratings decrease systematically when cardinality increases, although pairwise correlations are relatively high." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-100", "text": "We discovered that the performance of two automated methods -word intrusion and pairwise NPMI -can be substantially improved if the system scores and human ratings are aggregated over several cardinality settings before computing the correlation." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-101", "text": "Contrary to the standard practice of using a fixed cardinality setting, our findings suggest that we should assess topic coherence using several cardinality settings and then aggregate over them." }, { "sent_id": "4e1b01c1faebc447891bc0b847316d-C001-102", "text": "The human-judged coherence ratings, along with code to compute topic coherence, are available online." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4e1b01c1faebc447891bc0b847316d-C001-22" ], [ "4e1b01c1faebc447891bc0b847316d-C001-25" ], [ "4e1b01c1faebc447891bc0b847316d-C001-26" ], [ "4e1b01c1faebc447891bc0b847316d-C001-36" ], [ "4e1b01c1faebc447891bc0b847316d-C001-41" ], [ "4e1b01c1faebc447891bc0b847316d-C001-60" ], [ "4e1b01c1faebc447891bc0b847316d-C001-86" ] ], "cite_sentences": [ "4e1b01c1faebc447891bc0b847316d-C001-22", "4e1b01c1faebc447891bc0b847316d-C001-25", "4e1b01c1faebc447891bc0b847316d-C001-26", "4e1b01c1faebc447891bc0b847316d-C001-36", "4e1b01c1faebc447891bc0b847316d-C001-41", "4e1b01c1faebc447891bc0b847316d-C001-60", "4e1b01c1faebc447891bc0b847316d-C001-86" ] }, "@MOT@": { "gold_contexts": [ [ "4e1b01c1faebc447891bc0b847316d-C001-26", "4e1b01c1faebc447891bc0b847316d-C001-27" ] ], "cite_sentences": [ "4e1b01c1faebc447891bc0b847316d-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "4e1b01c1faebc447891bc0b847316d-C001-29" ], [ "4e1b01c1faebc447891bc0b847316d-C001-38" ], [ "4e1b01c1faebc447891bc0b847316d-C001-41", "4e1b01c1faebc447891bc0b847316d-C001-44" ], [ "4e1b01c1faebc447891bc0b847316d-C001-86", "4e1b01c1faebc447891bc0b847316d-C001-87" ] ], "cite_sentences": [ "4e1b01c1faebc447891bc0b847316d-C001-29", "4e1b01c1faebc447891bc0b847316d-C001-38", "4e1b01c1faebc447891bc0b847316d-C001-41", "4e1b01c1faebc447891bc0b847316d-C001-86" ] }, "@SIM@": { "gold_contexts": [ [ "4e1b01c1faebc447891bc0b847316d-C001-76" ] ], "cite_sentences": [ "4e1b01c1faebc447891bc0b847316d-C001-76" ] } } }, "ABC_197b557d7b5c7c2d195be84990719b_17": { "x": [ { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-64", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-65", "text": "**DOCUMENT MIXTURES**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-2", "text": "Distributed dense word vectors have been shown to be effective at capturing tokenlevel semantic and syntactic regularities in language, while topic models can form interpretable representations over documents." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-3", "text": "In this work, we describe lda2vec, a model that learns dense word vectors jointly with Dirichlet-distributed latent document-level mixtures of topic vectors." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-4", "text": "In contrast to continuous dense document representations, this formulation produces sparse, interpretable document mixtures through a non-negative simplex constraint." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-5", "text": "Our method is simple to incorporate into existing automatic differentiation frameworks and allows for unsupervised document representations geared for use by scientists while simultaneously learning word vectors and the linear relationships between them." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-6", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-8", "text": "Topic models are popular for their ability to organize document collections into a smaller set of prominent themes." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-9", "text": "In contrast to dense distributed representations, these document and topic representations are generally accessible to humans and more easily lend themselves to being interpreted." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-10", "text": "This interpretability provides additional options to highlight the patterns and structures within our systems of documents." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-11", "text": "For example, using Latent Dirichlet Allocation (LDA) topic models can reveal cluster of words within documents (Blei et al., 2003) , highlight temporal trends (Charlin et al., 2015) , and infer networks of complementary products (McAuley et al., 2015) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-12", "text": "See Blei et al. (2010) for an overview of topic modelling in domains as diverse as computer vision, genetic markers, survey data, and social network data." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-13", "text": "Dense vector approaches to building document representations also exist: Le and Mikolov (2014) propose paragraph vectors that are predictive of bags of words within paragraphs, Kiros et al. (2015) build vectors that reconstruct the sentence sequences before and after a given sentence, and Ghosh et al. (2016) construct contextual LSTMs that predict proceeding sentence features." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-14", "text": "Probabilistic topic models tend to form documents as a sparse mixed-membership of topics while neural network models tend to model documents as dense vectors." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-15", "text": "By virtue of both their sparsity and low-dimensionality, representations from the former are simpler to inspect and more immediately yield high level intuitions about the underlying system (although not without hazards, see Chang et al. (2009) )." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-16", "text": "This paper explores hybrid approaches mixing sparse document representations with dense word and topic vectors." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-17", "text": "Unfortunately, crafting a new probabilistic topic model requires deriving a new approximation, a procedure which takes substantial expertise and must be customized to every model." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-18", "text": "As a result, prototypes are time-consuming to develop and changes to model architectures must be carefully considered." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-19", "text": "However, with modern automatic differentiation frameworks the practitioner can focus development time on the model design rather than the model approximations." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-20", "text": "This expedites the process of evaluating which model features are relevant." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-21", "text": "This work takes advantage of the Chainer (Tokui et al., 2015) framework to quickly develop models while also enabling us to utilize GPUs to dramatically improve computational speed." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-22", "text": "Finally, traditional topic models over text do not take advantage of recent advances in distributed word representations which can capture semantically meaningful regularities between tokens." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-23", "text": "The examination of word co-occurrences has proven to be a fruitful research paradigm." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-24", "text": "For example, Mikolov et al. (2013) utilize Skipgram NegativeSampling (SGNS) to train word embeddings using word-context pairs formed from windows moving across a text corpus." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-25", "text": "These vector representations ultimately encode remarkable linearities such as king \u2212 man + woman = queen." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-26", "text": "In fact, Levy and Goldberg (2014c) demonstrate that this is implicitly factorizing a variant of the Pointwise Mutual Information (PMI) matrix that emphasizes predicting frequent co-occurrences over rare ones." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-27", "text": "Closely related to the PMI matrix, Pennington et al. (2014) factorize a large global word count co-occurrence matrix to yield more efficient and slightly more performant computed embeddings than SGNS." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-28", "text": "Once created, these representations are then useful for information retrieval (Manning et al., 2009 ) and parsing tasks (Levy and Goldberg, 2014a) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-29", "text": "In this work, we will take advantage of word-level representations to build document-level abstractions." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-30", "text": "This paper extends distributed word representations by including interpretable document representations and demonstrate that model inference can be performed and extended within the framework of automatic differentiation." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-31", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-32", "text": "**MODEL**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-33", "text": "This section describes the model for lda2vec." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-34", "text": "We are interested in modifying the Skipgram Negative-Sampling (SGNS) objective in (Mikolov et al., 2013) to utilize document-wide feature vectors while simultaneously learning continuous document weights loading onto topic vectors." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-35", "text": "The network architecture is shown in Figure 1 ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-36", "text": "The total loss term L in (1) is the sum of the Skipgram Negative Sampling Loss (SGNS) L neg ij with the addition of a Dirichlet-likelihood term over document weights, L d that will be discussed later." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-37", "text": "The loss is conducted using a context vector, c j , pivot word vector w j , target word vector w i , and negatively-sampled word vector w l ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-38", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-39", "text": "**WORD REPRESENTATION**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-40", "text": "As in Mikolov et al. (2013) , pairs of pivot and target words (j, i) are extracted when they cooccur in a moving window scanning across the corpus." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-41", "text": "In our experiments, the window contains five tokens before and after the pivot token." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-42", "text": "For every pivot-target pair of words the pivot word is used to predict the nearby target word." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-43", "text": "Each word is represented with a fixedlength dense distributed-representation vector, but unlike Mikolov et al. (2013) the same word vectors are used in both the pivot and target representations." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-44", "text": "The SGNS loss shown in (2) attempts to discriminate context-word pairs that appear in the corpus from those randomly sampled from a 'negative' pool of words." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-45", "text": "This loss is minimized when the observed words are completely separated from the marginal distribution." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-46", "text": "The distribution from which tokens are drawn is u \u03b2 , where u denotes the overall word frequency normalized by the total corpus size." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-47", "text": "Unless stated otherwise, the negative sampling power beta is set to 3/4 and the number of negative samples is fixed to n = 15 as in Mikolov et al. (2013) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-48", "text": "Note that a distribution of u 0.0 would draw negative tokens from the vocabulary with no notion of popularity while a distribution proportional with u 1.0 draws from the empirical unigram distribution." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-49", "text": "Compared to the unigram distribution, the choice of u 3/4 slightly emphasizes choosing infrequent words for negative samples." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-50", "text": "In contrast to optimizing the softmax cross entropy, which requires modelling the overall popularity of each token, negative sampling focuses on learning word vectors conditional on a context by drawing negative samples from each token's marginal popularity in the corpus." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-51", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-52", "text": "**DOCUMENT REPRESENTATIONS**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-53", "text": "lda2vec embeds both words and document vectors into the same space and trains both representations simultaneously." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-54", "text": "By adding the pivot and document vectors together, both spaces are effectively joined." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-55", "text": "Mikolov et al. (2013) provide the intuition that word vectors can be summed together to form a semantically meaningful combination of both words." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-56", "text": "For example, the vector representation for Germany + airline is similar to the vector for Luf thansa." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-57", "text": "We would like to exploit the additive property of word vectors to construct a meaningful sum of word and document vectors." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-58", "text": "For example, if as lda2vec is scanning a document the jth word is Germany, then neighboring words are predicted to be similar such as F rance, Spain, and Austria. But if the document is specifically about airlines, then we would like to construct a document vector similar to the word vector for airline." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-59", "text": "Then instead of predicting tokens similar to Germany alone, predictions similar to both the document and the pivot word can be made such as: Luf thansa, Condor F lugdienst, and Aero Lloyd." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-60", "text": "Motivated by the meaningful sums of words vectors, in lda2vec the context vector is explicitly designed to be the sum of a document vector and a word vector as in (3):" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-61", "text": "This models document-wide relationships by preserving d j for all word-context pairs in a document, while still leveraging local inter-word relationships stemming from the interaction between the pivot word vector w j and target word w i ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-62", "text": "The document and word vectors are summed together to form a context vector that intuitively captures long-and short-term themes, respectively." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-63", "text": "In order to prevent co-adaptation, we also perform dropout on both the unnormalized document vector d j and the pivot word vector w j (Hinton et al., 2012)." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-66", "text": "If we only included structure up to this point, the model would produce a dense vector for every document." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-67", "text": "However, lda2vec strives to form interpretable representations and to do so an additional constraint is imposed such that the document representations are similar to those in traditional LDA models." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-68", "text": "We aim to generate a document vector from a mixture of topic vectors and to do so, we begin by constraining the document vector d j to project onto a set of latent topic vectors t 0 , t 1 , ..., t k :" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-69", "text": "Each weight 0 \u2264 p jk \u2264 1 is a fraction that denotes the membership of document j in the topic k. For example, the Twenty Newsgroups model described later has 11313 documents and n = 20 topics so j = 0...11312, k = 0...19." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-70", "text": "When the word vector dimension is set to 300, it is assumed that the document vectors d j , word vectors w i and topic vectors t k all have dimensionality 300." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-71", "text": "Note that the topics t k are shared and are a common component to all documents but whose strengths are modulated by document weights p jk that are unique to each document." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-72", "text": "To aid interpretability, the document memberships are designed to be non-negative, and to sum to unity." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-73", "text": "To achieve this constraint, a softmax transform maps latent vectors initialized in R 300 onto the simplex defined by p jk ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-74", "text": "The softmax transform naturally enforces the constraint that \u03a3 k p jk = 1 and allows us interpret memberships as percentages rather than unbounded weights." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-75", "text": "Formulating the mixture in (4) as a sum ensures that topic vectors t k , document vectors d j and word vectors w i , operate in the same space." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-76", "text": "As a result, what words w i are most similar to any given topic vector t k can be directly calculated." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-77", "text": "While each topic is not literally a token present in the corpus, it's similarity to other tokens is meaningful and can be measured." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-78", "text": "Furthermore, by examining the list of most similar words one can attempt to interpret what the topic represents." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-79", "text": "For example, by calculating the most similar token to any topic vector (e.g. argmax i ( t 0 \u00b7 w i )) one may discover that the first topic vector t 0 is similar to the tokens pitching, catcher, and Braves while the second topic vector t 1 may be similar to Jesus, God, and faith." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-80", "text": "This provides us the option to interpret the first topic as baseball topic, and as a result the first component in every document proportion p j0 indicates how much document j is in the baseball topic." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-81", "text": "Similarly, the second topic may be interpreted as Christianity and the second component of any document proportion p j1 indicates the membership of that document in the Christianity topic." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-82", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-83", "text": "**SPARSE MEMBERSHIPS**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-84", "text": "Finally, the document weights p ij are sparsified by optimizing the document weights with respect to a Dirichlet likelihood with a low concentration parameter \u03b1:" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-85", "text": "The overall objective in (5) measures the likelihood of document j in topic k summed over all available documents." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-86", "text": "The strength of this term is modulated by the tuning parameter \u03bb." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-87", "text": "This simple likelihood encourages the document proportions coupling in each topic to be sparse when \u03b1 < 1 and homogeneous when \u03b1 > 1." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-88", "text": "To drive interpretability, we are interested in finding sparse memberships and so set \u03b1 = n \u22121 where n is the number of topics." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-89", "text": "We also find that setting the overall strength of the Dirichlet optimization to \u03bb = 200 works well." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-90", "text": "Document proportions are initialized to be relatively homogeneous, but as time progresses, the L d encourages document proportions vectors to become more concentrated (e.g. sparser) over time." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-91", "text": "In experiments without this sparsity-inducing term (or equivalently when \u03b1 = 1) the document weights p ij tend to have probability mass spread out among all elements." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-92", "text": "Without any sparsity inducing terms the existence of so many non-zero weights makes interpreting the document vectors difficult." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-93", "text": "Furthermore, we (R\u00f6der et al., 2015) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-94", "text": "The number of topics chosen is given, as well as the negative sampling exponent parameter \u03b2." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-95", "text": "Compared to \u03b2 = 1.00, \u03b2 = 0.75 draws more rare words as negative samples." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-96", "text": "The best topic coherences are found in models n = 20 topics and a \u03b2 = 0.75." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-97", "text": "find that the topic basis are also strongly affected, and the topics become incoherent." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-98", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-99", "text": "**PREPROCESSING AND TRAINING**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-100", "text": "The objective in (1) is trained in individual minibatches at a time while using the Adam optimizer (Kingma and Ba, 2014) for two hundred epochs across the dataset." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-101", "text": "The Dirichlet likelihood term L d is typically computed over all documents, so in modifying the objective to minibatches we adjust the loss of the term to be proportional to the minibatch size divided by the size of the total corpus." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-102", "text": "Our software is open source, available online, documented and unit tested 1 ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-103", "text": "Finally, the top ten most likely words in a given topic are submitted to the online Palmetto 2 topic quality measuring tool and the coherence measure C v is recorded." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-104", "text": "After evaluating multiple alternatives, C v is the recommended coherence metric in R\u00f6der et al. (2015) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-105", "text": "This measure averages the Normalized Pointwise Mutual Information (NPMI) for every pair of words within a sliding window of size 110 on an external corpus and returns mean of the NPMI for the submitted set of words." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-106", "text": "Token-toword similarity is evaluated using the 3COSMUL measure (Levy and Goldberg, 2014b" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-107", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-108", "text": "**EXPERIMENTS**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-109", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-110", "text": "**TWENTY NEWSGROUPS**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-111", "text": "This section details experiments in discovering the salient topics in the Twenty Newsgroups dataset, a popular corpus for machine learning on text." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-112", "text": "Each document in the corpus was posted to one of twenty possible newsgroups." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-113", "text": "While the text of each post is available to lda2vec, each of the newsgroup partitions is not revealed to the algorithm but is nevertheless useful for post-hoc qualitative evaluations of the discovered topics." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-114", "text": "The corpus is preprocessed using the data loader available in Scikit-learn (Pedregosa et al., 2012) and tokens are identified using the SpaCy parser (Honnibal and Johnson, 2015) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-115", "text": "Words are lemmatized to group multiple inflections into single tokens." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-116", "text": "Tokens that occur fewer than ten times in the corpus are removed, as are tokens that appear to be URLs, numbers or contain special symbols within their orthographic forms." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-117", "text": "After preprocessing, the dataset contains 1.8 million observations of 8,946 unique tokens in 11,313 documents." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-118", "text": "Word vectors are initialized to the pretrained values found in Mikolov et al. (2013) but otherwise updates are allowed to these vectors at training time." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-119", "text": "A range of lda2vec parameters are evaluated by varying the number of topics n \u2208 20, 30, 40, 50 and the negative sampling exponent \u03b2 \u2208 0.75, 1.0." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-120", "text": "The best topic coherences were achieved with n = 20 topics and with negative sampling power \u03b2 = 0.75 as summarized in Figure 2 ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-121", "text": "We briefly experimented with variations on dropout ratios but we did not observe any substantial differences." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-122", "text": "Figure 3 lists four example topics discovered in the Twenty Newsgroups dataset." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-123", "text": "Each topic is associated with a topic vector that lives in the same space as the trained word vectors and listed are the most similar words to each topic vector." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-124", "text": "The first topic shown has high similarity to the tokens astronomical, Astronomy, satellite, planetary, and telescope and is thus likely a 'Space'-related topic similar to the 'sci.space' newsgroup." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-125", "text": "The second example topic is similar to words semantically related to 'Encryption', such as Clipper and encrypt, and is likely related to the 'sci.crypt' newsgroup." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-126", "text": "The third and four example topics are 'X Windows' and 'Middle East' which likely belong to the 'comp.windows.x' and 'talk.politics.mideast' newsgroups." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-127", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-128", "text": "**HACKER NEWS COMMENTS CORPUS**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-129", "text": "This section evaluates lda2vec on a very large corpus of Hacker News 3 comments." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-130", "text": "Hacker News is social content-voting website and community whose focus is largely on technology and entrepreneurship." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-131", "text": "In this corpus, a single document is composed of all of the words in all comments posted to a single article." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-132", "text": "Only stories with more than 10 comments are included, and only comments from users with more than 10 comments are included." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-133", "text": "We ignore other metadata such as votes, timestamps, and author identities." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-134", "text": "The raw dataset 4 is available for download online." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-135", "text": "The corpus is nearly fifty times the size of the Twenty Newsgroups corpus which is sufficient for learning a specialized vocabulary." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-136", "text": "To take advantage of this rich corpus, we use the SpaCy to tokenize whole noun phrases and entities at once (Honnibal and Johnson, 2015) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-137", "text": "The specific tokenization procedure 5 is also available online, as are the pre-\"Housing Issues\" \"Internet Portals\" \"Bitcoin\" \"Compensation\" \"Gadget We train an lda2vec model using 40 topics and 256 hidden units and report the learned topics that demonstrate the themes present in the corpus." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-138", "text": "Furthermore, we demonstrate that word vectors and semantic relationships specific to this corpus are learned." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-139", "text": "In Figure 4 five example topics discovered by lda2vec in the Hacker News corpus are listed." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-140", "text": "These topics demonstrate that the major themes of the corpus are reproduced and represented in learned topic vectors in a similar fashion as in LDA (Blei et al., 2003) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-141", "text": "The first, which we Figure 5 demonstrates that token similarities are learned in a similar fashion as in SGNS (Mikolov et al., 2013 ) but specialized to the Hacker News corpus." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-142", "text": "Tokens similar to the token Artificial sweeteners include other sugar-related tokens like fructose and food-related tokens such as paleo diet." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-143", "text": "Tokens similar to Black holes include physics-related concepts such as galaxies and dark matter." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-144", "text": "The Hacker News corpus devotes a substantial quantity of text to fonts and design, and the words most similar to Comic Sans are other popular fonts (e.g. Times New Roman and Helvetica) as well as font-related concepts such as typeface and serif font." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-145", "text": "Tokens similar to Functional Programming demonstrate similarity to other computer science-related tokens while tokens similar to San Francisco include other large American cities as well smaller cities located in the San Francisco Bay Area." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-146", "text": "Figure 6 demonstrates that in addition to learning topics over documents and similarities to word tokens, linear regularities between tokens are also learned." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-147", "text": "The 'Query' column lists a selection of tokens that when combined yield a token vector closest to the token shown in the 'Result' column." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-148", "text": "The subtractions and additions of vectors are evaluated literally, but instead take advantage of the 3COSMUL objective (Levy and Goldberg, 2014b) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-149", "text": "The results show that relationships between tokens important to the Hacker News community exists between the token vectors." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-150", "text": "For example, the vector for Silicon Valley is similar to both California and technology, Bitcoin is indeed a digital currency, Node.js is a technology that enables running Javascript on servers instead of on clientside browsers, Jeff Bezos and Mark Zuckerberg are CEOs of Amazon and Facebook respectively, NLP and computer vision are fields of machine learning research primarily dealing with text and images respectively, Edward Snowden and Julian Assange are both whistleblowers who were primarily located in the United States and Sweden and finally the Kindle and the Surface Pro are both tablets manufactured by Amazon and Microsoft respectively." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-151", "text": "In the above examples semantic relationships between tokens encode for attributes and features including: location, currencies, server v.s. client, leadership figures, machine learning fields, political figures, nationalities, companies and hardware." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-152", "text": "----------------------------------" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-153", "text": "**CONCLUSION**" }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-154", "text": "This work demonstrates a simple model, lda2vec, that extends SGNS (Mikolov et al., 2013) to build unsupervised document representations that yield coherent topics." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-155", "text": "Word, topic, and document vectors are jointly trained and embedded in a common representation space that preserves semantic regularities between the learned word vectors while still yielding sparse and interpretable documentto-topic proportions in the style of LDA (Blei et al., 2003) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-156", "text": "Topics formed in the Twenty Newsgroups corpus yield high mean topic coherences which have been shown to correlate with human evaluations of topics (R\u00f6der et al., 2015) ." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-157", "text": "When applied to a Hacker News comments corpus, lda2vec discovers the salient topics within this community and learns linear relationships between words that allow it solve word analogies in the specialized vocabulary of this corpus." }, { "sent_id": "197b557d7b5c7c2d195be84990719b-C001-158", "text": "Finally, we note that our method is simple to implement in automatic differentiation frameworks and can lead to more readily interpretable unsupervised representations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "197b557d7b5c7c2d195be84990719b-C001-24" ], [ "197b557d7b5c7c2d195be84990719b-C001-55" ] ], "cite_sentences": [ "197b557d7b5c7c2d195be84990719b-C001-24", "197b557d7b5c7c2d195be84990719b-C001-55" ] }, "@EXT@": { "gold_contexts": [ [ "197b557d7b5c7c2d195be84990719b-C001-34" ], [ "197b557d7b5c7c2d195be84990719b-C001-154" ] ], "cite_sentences": [ "197b557d7b5c7c2d195be84990719b-C001-34", "197b557d7b5c7c2d195be84990719b-C001-154" ] }, "@USE@": { "gold_contexts": [ [ "197b557d7b5c7c2d195be84990719b-C001-40" ], [ "197b557d7b5c7c2d195be84990719b-C001-47" ], [ "197b557d7b5c7c2d195be84990719b-C001-118" ] ], "cite_sentences": [ "197b557d7b5c7c2d195be84990719b-C001-40", "197b557d7b5c7c2d195be84990719b-C001-47", "197b557d7b5c7c2d195be84990719b-C001-118" ] }, "@DIF@": { "gold_contexts": [ [ "197b557d7b5c7c2d195be84990719b-C001-43" ] ], "cite_sentences": [ "197b557d7b5c7c2d195be84990719b-C001-43" ] }, "@SIM@": { "gold_contexts": [ [ "197b557d7b5c7c2d195be84990719b-C001-141" ] ], "cite_sentences": [ "197b557d7b5c7c2d195be84990719b-C001-141" ] } } }, "ABC_9c5baf669470fe4dd18277591591f1_17": { "x": [ { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-2", "text": "Recent studies have shown that pre-trained contextual word embeddings, which assign the same word different vectors in different contexts, improve performance in many tasks. But while contextual embeddings can also be trained at the character level, the effectiveness of such embeddings has not been studied." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-3", "text": "We derive character-level contextual embeddings from Flair (Akbik et al., 2018) , and apply them to a time normalization task, yielding major performance improvements over the previous state-of-the-art: 51% error reduction in news and 33% in clinical notes." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-4", "text": "We analyze the sources of these improvements, and find that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and cross-domain changes." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-5", "text": "We also quantify the size of context that pretrained contextual character embeddings take advantage of, and show that such embeddings capture features like part-of-speech and capitalization." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-6", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-8", "text": "Pre-trained language models (LMs) such as ELMo (Peters et al., 2018) , ULMFiT (Howard and Ruder, 2018) , OpenAI GPT (Radford et al., 2018) , Flair (Akbik et al., 2018) and Bert (Devlin et al., 2018) have shown great improvements in NLP tasks ranging from sentiment analysis to named entity recognition to question answering." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-9", "text": "These models are trained on huge collections of unlabeled data and produce contextualized word embeddings, i.e., each word receives a different vector representation in each context, rather than a single common vector representation regardless of context as in word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-10", "text": "Research is ongoing to study these models and determine where their benefits are coming from (Peters et al., 2018; Radford et al., 2018; Khandelwal et al., 2018; Zhang and Bowman, 2018) ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-11", "text": "The analyses have focused on wordlevel models, yet character-level models have been shown to outperform word-level models in some NLP tasks, such as text classification (Zhang et al., 2015) , named entity recognition (Kuru et al., 2016) , and time normalization (Laparra et al., 2018a) ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-12", "text": "Thus, there is a need to study pre-trained contextualized character embeddings, to see if they also yield improvements, and if so, to analyze where those benefits are coming from." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-13", "text": "All of the pre-trained word-level contextual embedding models include some character or subword components in their architecture." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-14", "text": "For example, Flair is a forward-backward LM trained over characters using recurrent neural networks (RNNs), that generates pre-trained contextual word embeddings by concatenating the forward LM's hidden state for the word's last character and the backward LM's hidden state for the word's first character." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-15", "text": "Flair achieves state-of-the-art or competitive results on part-of-speech tagging and named entity tagging (Akbik et al., 2018) ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-16", "text": "Though they do not pre-train a LM, Bohnet et al. (2018) similarly apply a bidirectional long short term memory network (LSTM) layer on all characters of a sentence and generate contextual word embeddings by concatenating the forward and backward LSTM hidden states of the first and last character in each word." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-17", "text": "Together with other techniques, they achieve state-of-the-art performance on part-of-speech and morphological tagging." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-18", "text": "However, both Akbik et al. (2018) and Bohnet et al. (2018) discard all other contextual character embeddings, and no analyses of the models are performed at the character-level." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-19", "text": "In the current paper, we derive pre-trained contextual character embeddings from Flair's forwardbackward LM trained on a 1-billion word corpus of English (Chelba et al., 2014) , and observe if these embeddings yield the same large improvements for character-level tasks as yielded by pre-trained contextual word embeddings for word-level tasks." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-20", "text": "We aim to analyze where improvements come from (e.g., term variations, low frequency words) and what they depend on (e.g., embedding size, context size)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-21", "text": "We focus on the task of parsing time normalizations (Laparra et al., 2018b) , where large gains of character-level models over word-level models have been observed (Laparra et al., 2018a) ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-22", "text": "This task involves finding and composing pieces of a time expression to infer time intervals, so for example, the expression 3 days ago could be normalized to the interval [2019-03-01, 2019-03-02) ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-23", "text": "We first take a state-of-the-art neural network for parsing time normalizations (Laparra et al., 2018a) and replace its randomly initialized character embeddings with pre-trained contextual character embeddings." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-24", "text": "After showing that this yields major performance improvements, we analyze the improvements to understand why pre-trained contextual character embeddings are so useful." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-25", "text": "Our contributions are:" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-26", "text": "\u2022 We derive pre-trained contextual character embeddings from Flair (Akbik et al., 2018) , apply them to a state-of-the art time normalizer (Laparra et al., 2018a) , and obtain major performance improvements over the previous state-of-the-art: 51% error reduction in news and 33% error reduction in clinical notes." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-27", "text": "\u2022 We demonstrate that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and crossdomain changes." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-28", "text": "\u2022 We quantify the amount of context leveraged by pre-trained contextual character embeddings." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-29", "text": "\u2022 We show that pre-trained contextual character embeddings remove the need for features like part-of-speech and capitalization." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-30", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-31", "text": "**FRAMEWORK**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-32", "text": "The parsing time normalizations task is based on the Semantically Compositional Annotation of Time Expressions (SCATE) schema (Bethard and Parker, 2016) , in which times are annotated as compositional time entities." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-33", "text": "Laparra et al. (2018a) decomposes the Parsing Time Normalizations task into two subtasks: a) time entity identification using a character-level sequence tagger which detects" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-34", "text": "Bi-GRUs (3) Softmax (3) Outputs ( the spans of characters that belong to each time expression and labels them with their corresponding time entity; and b) time entity composition using a simple set of rules that links relevant entities together while respecting the entity type constraints imposed by the SCATE schema." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-35", "text": "These two tasks are run sequentially using the predicted output of the sequence tagger as input to the rule-based time entity composition system." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-36", "text": "In this paper, We focus on the character-level time entity identifier that is the foundation of Laparra et al. (2018a) 's model." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-37", "text": "The sequence tagger is a multi-output RNN with three different input features, shown in Figure 1 ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-38", "text": "Features are mapped through an embedding layer, then fed into stacked bidirectional Gated Recurrent Units (bi-GRUs), and followed by a softmax layer." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-39", "text": "There are three types of outputs per Laparra et al. (2018a) 's encoding of the SCATE schema, so there is a separate stack of bi-GRUs and a softmax for each output type." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-40", "text": "We keep the original neural architecture and parameter settings in Laparra et al. (2018a) , and experiment with the following embedding layers: Rand(128): the original setting of Laparra et al. (2018a) , where 128-dimensional character embeddings are randomly initialized." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-41", "text": "Rand (4096) ning Flair forward-backward character-level LM Flair's forward and backward character-level language models over the text, and concatenating the hidden states from forward and backward character-level LMs for each character ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-42", "text": "We evaluate in the clinical and news domains, the former being more than 9 times larger and the latter having a more diverse set of labels." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-43", "text": "Three different evaluation metrics are used for parsing time normalization tasks: identification of time entities, which evaluates the predicted span (offsets) and the SCATE type for each entity; parsing of time entities, which evaluates the span, the SCATE type, and properties for each time entity; interval extraction, which interprets parsed annotations as intervals along the timeline and measures the fraction of the correctly parsed intervals." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-44", "text": "The SemeEval task description paper (Laparra et al., 2018b ) has more details on dataset statistics and evaluation metrics." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-45", "text": "Table 3 shows how pre-trained contextual character embeddings improve performance on both term variations and low frequency words." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-46", "text": "We define term variations as time entities that appear in the training data in the following patterns: both upper-case and lower-case, e.g., DAY, Day, and day; abbreviation with and without punctuation, e.g., AM and A.M.; or same stem, e.g., Month and Months, previously and previous." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-47", "text": "In the dev and test sets, 30." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-48", "text": "over Rand(4096) on time entities with (+var) and without (-var) term variations." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-49", "text": "Cont (4096) is always better than Rand(4096) so all differences are positive, but the improvements in +var are much larger than those of -var in the news domain (+8.4 vs. +1.6 and +15.0 vs. +8.7)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-50", "text": "In the clinical domain, where 9 times more training data is available, both +var and -var yield similar improvements." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-51", "text": "We conclude that pre-trained contextual character embeddings are mostly helpful with term variations in low data scenarios." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-52", "text": "We define infrequent terms as time entities that occur in the training set 10 or fewer times." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-53", "text": "In the dev and test sets, 73.9-86.9% of terms are infrequent, with about one third of infrequent terms being numerical 3 ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-54", "text": "The bottom two rows of table 3 show the improvements in F 1 of the Cont(4096) over Rand(4096) on frequent (>10) and infrequent (\u226410) terms." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-55", "text": "Cont(4096) is always better than Rand(4096), and in both domains the improvements on low frequency terms are always greater than those on high frequency terms (+8.1 vs. +2.4 in news dev, +17.6 vs. +5.0 in news test, etc.)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-56", "text": "We conclude that pre-trained contextual character embeddings improve the representations of low frequency words in both low and high data settings." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-57", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-58", "text": "**RESULTS**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-59", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-60", "text": "**ROBUSTNESS TO VARIANTS AND FREQUENCY**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-61", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-62", "text": "**ROBUSTNESS TO DOMAIN DIFFERENCES**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-63", "text": "To illustrate the ability of pre-trained contextual character embeddings to handle unseen data, we train the models in one domain and evaluate in the other, as shown in Table 4 ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-64", "text": "We find that Rand(128) and Rand(4096) achieve similar cross-domain performance, e.g., Rand(128) achieves 63.4% of F 1 on news dev and Rand (4096) improvements are significant (p < 0.001)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-65", "text": "We conclude that pre-trained contextual character embeddings generalize better across domains." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-66", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-67", "text": "**GREATER RELIANCE ON NEARBY CONTEXT**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-68", "text": "Inspired by Khandelwal et al. (2018) 's analysis of the effective context size of a word-based language model, we present an ablation study to measure performance when contextual information is removed." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-69", "text": "Specifically, when evaluating models, we retain only the characters in a small window around each time entity in the dev and test sets, and replace all other characters with padding characters." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-70", "text": "Figures 2a and 2b evaluate the Cont(4096), Rand(4096) and Rand(128) models across different context window sizes on the news dev and test set, respectively." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-71", "text": "Rand(128) performs similarly across all context sizes, suggesting that it makes little use of context information." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-72", "text": "Both Rand (4096) and Cont(4096) depend heavily of context: without any context information (context size 0), they perform worse than Rand(128)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-73", "text": "Cont (4096) is sensitive to the nearby context, with a \u223c10 point gain on news dev and \u223c15 point gain on news test from just the first 10 characters of context, putting it easily above Rand(128)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-74", "text": "Rand(4096) doesn't exceed the performance of Rand(128) until at least 50 characters of context." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-75", "text": "Figures 2c and 2d shows similar trends in the clinical domain, except that the Rand(128) model now shows a small dependence on context, with a \u223c5 point drop on clinical dev and a \u223c3 drop on clinical test in the no-context setting." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-76", "text": "Cont(4096) again makes large improvements in just the first 10 characters, and Rand(4096) now takes close to 100 characters of context to reach the performance of Rand(128)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-77", "text": "We conclude that pre-trained contextual character embeddings make better use of local context, especially within the first 10 characters." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-78", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-79", "text": "**ENCODING WORD CATEGORIES**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-80", "text": "We perform a feature ablation to see if pre-trained contextual character embeddings capture basic syntax (e.g., part-of-speech) like pre-trained contextual word embeddings do (Peters et al., 2018; Akbik et al., 2018) ." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-81", "text": "Table 5 shows that removing both part-of-speech and unicode category features from Cont (4096) Table 5 : Effect of features on performance: Performance (F 1 ) with different feature sets, including characters (C), part-of-speech tags (P), and unicode character categories (U)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-82", "text": "formance for both Rand(128) and Rand(4096) in all cases." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-83", "text": "For example, Rand(4096) with all features achieves 82.7 F 1 on news dev, significantly better than the 80.5 F 1 of using only characters (p = 0.0467)." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-84", "text": "We conclude that pre-trained contextual character embeddings encode a variety of word category information such as part-of-speech, capitalization, and punctuation." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-85", "text": "----------------------------------" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-86", "text": "**CONCLUSION**" }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-87", "text": "We derive pre-trained character-level contextual embeddings from Flair (Akbik et al., 2018) , a wordlevel embedding model, inject these into a state-ofthe-art time normalization system, and achieve major performance improvements: 51% error reduction in news and 33% in clinical notes." }, { "sent_id": "9c5baf669470fe4dd18277591591f1-C001-88", "text": "Our detailed analysis concludes that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and cross-domain changes; that they benefit most from the first 10 characters of context; and that they encode part-of-speech, capitalization, and punctuation information." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9c5baf669470fe4dd18277591591f1-C001-11" ], [ "9c5baf669470fe4dd18277591591f1-C001-21" ], [ "9c5baf669470fe4dd18277591591f1-C001-33" ], [ "9c5baf669470fe4dd18277591591f1-C001-39" ] ], "cite_sentences": [ "9c5baf669470fe4dd18277591591f1-C001-11", "9c5baf669470fe4dd18277591591f1-C001-21", "9c5baf669470fe4dd18277591591f1-C001-33", "9c5baf669470fe4dd18277591591f1-C001-39" ] }, "@MOT@": { "gold_contexts": [ [ "9c5baf669470fe4dd18277591591f1-C001-11", "9c5baf669470fe4dd18277591591f1-C001-12" ] ], "cite_sentences": [ "9c5baf669470fe4dd18277591591f1-C001-11" ] }, "@USE@": { "gold_contexts": [ [ "9c5baf669470fe4dd18277591591f1-C001-21" ], [ "9c5baf669470fe4dd18277591591f1-C001-26" ], [ "9c5baf669470fe4dd18277591591f1-C001-36" ] ], "cite_sentences": [ "9c5baf669470fe4dd18277591591f1-C001-21", "9c5baf669470fe4dd18277591591f1-C001-26", "9c5baf669470fe4dd18277591591f1-C001-36" ] }, "@EXT@": { "gold_contexts": [ [ "9c5baf669470fe4dd18277591591f1-C001-23" ], [ "9c5baf669470fe4dd18277591591f1-C001-40" ] ], "cite_sentences": [ "9c5baf669470fe4dd18277591591f1-C001-23", "9c5baf669470fe4dd18277591591f1-C001-40" ] } } }, "ABC_0fd26c6dffab3fba2d120d2c58dff6_17": { "x": [ { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-47", "text": "**RELATED WORK**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-2", "text": "In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-3", "text": "The baseline approach models the information from audio and text independently using two deep neural networks (DNNs)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-4", "text": "The outputs from both the DNNs are then fused for classification." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-5", "text": "As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-6", "text": "The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-7", "text": "Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-8", "text": "The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-9", "text": "The relevant textual data is then applied to attend parts of the audio signal." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-10", "text": "To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-11", "text": "Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-12", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-14", "text": "In this era of high-performance computing, human-computer interaction (HCI) has become pervasive." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-15", "text": "To enrich the user experience, the system is often required to detect human emotion and produce a response with proper emotional context [1, 2] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-16", "text": "The first step in such an HCI involves building a system that recognizes emotion from the speech utterance." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-17", "text": "A speech emotional system aims to identify audio recording as belonging to one of the categories, like happy, sad, angry or neutral." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-18", "text": "Beside HCI, the output of emotion recognition engine is beneficial in the paralinguistic area as well [3] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-19", "text": "In this paper, we build a speech emotion recognition system that uses acoustic and textual information in tandem." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-20", "text": "Various approaches to address emotion recognition have been investigated in the literature." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-21", "text": "Most of the techniques involve extracting low-level or high-level acoustic features for this task [4] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-22", "text": "In emotion recognition, the lexical content of the audio recording is an important source of information that is usually ignored." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-23", "text": "For example, the presence of words such as \"gorgeous\" and \"stunning\" in the utterance would indicate that the person is happy." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-24", "text": "Recently researchers have also explored the application of textual content of the speech signal for this task." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-25", "text": "In [5] , frame and supra-segmental level features (such as pitch and spectral contours) are derived from the speech signal." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-26", "text": "Textual information is used by spotting keywords that emphases the emotional states of the speaker." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-27", "text": "The work in [6] also presents an approach to exploit the acoustic and lexical content." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-28", "text": "In particular, they explored conventional acoustic features from the speech signal while the textual information is derived from the bag of word representation." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-29", "text": "Recently, deep neural network (DNN) has shown to provide good results for modeling acoustic and textual information for emotion identification." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-30", "text": "In [5] , textual and acoustic information of the utterance are used by a DNN to obtain hidden feature representations for both the modality." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-31", "text": "These features are then concatenated to represent the utterance and subsequently used to classify the emotion of the speaker." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-32", "text": "Experimental evidence shows the potential of the approach." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-33", "text": "In our previous work [7] , we applied a dual RNN in order to obtain a richer representation by blending the content and acoustic knowledge." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-34", "text": "In this paper, we improve upon our earlier work by incorporating an attention mechanism in the emotion recognition framework." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-35", "text": "The proposed attention mechanism is trained to exploit both textual and acoustic information in tandem." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-36", "text": "We refer to this attention method as the multi-hop." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-37", "text": "The multi-hop attention is designed to select relevant parts of the textual data, which is subsequently applied to attend to the segments of the audio signal for classification." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-38", "text": "We hypothesize that this approach would automatically detect the segments that contain information relevant for the task." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-39", "text": "The emotion recognition experiments are performed on the standard IEMOCAP dataset [8] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-40", "text": "Experimental results indicate that the proposed approach outperforms the state-of-the-art system published in the literature on this database by 6.5% relative improvement in terms of weighted accuracy." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-41", "text": "This paper is organized as follows." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-42", "text": "Section 2 provides a brief literature review on speech emotion recognition." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-43", "text": "In Section 3, we start by describing the baseline bidirectional recurrent encoder model considered in this paper, then introducing the proposed technique in detail." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-44", "text": "Experimental setup for evaluating the system and discussion of the achieved results by various systems are presented in Sections 4." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-45", "text": "Finally, the paper is concluded in Section 5." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-46", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-48", "text": "Along with classical algorithms based models such as support vector machine (SVM), hidden markov model (HMM) and decision tree [9, 10, 11] , various neural network architectures have been recently introduced for the speech emotion recognition task." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-49", "text": "For example, convolutional neural network (CNN)-based models were trained on spectrograms or audio features such as mel-frequency cepstral coefficients (MFCCs) and low-level descriptors (LLDs) [12, 13, 14] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-50", "text": "More complex models such as [15] were designed to better learn nonlinear decision boundaries of emotional speech and achieved the best-recorded performance in audio modality models on IEMOCAP dataset [8] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-51", "text": "Several neural network models with attention mechanism have been proposed to efficiently focus on a prominent part of speech and learn temporal dependency within whole utterance [16, 17] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-52", "text": "Multi-modal approaches using acoustic features and textual information have been investigated." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-53", "text": "[5] identified emotional key phrases and salience of verbal cues from both phoneme sequences and words." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-54", "text": "Recently, [7, 18] combined acoustic information and conversation transcripts using a neural network-based model to improve emotion classification accuracy." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-55", "text": "However, none of these studies utilized attention method over audio and text modality in tandem for contextual understanding of the emotion in audio recording." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-56", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-57", "text": "**MODEL**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-58", "text": "This section describes the methodologies that are applied to the speech emotion recognition task." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-59", "text": "We start by introducing a baseline model, the bidirectional recurrent encoder, for encoding the audio and text modalities individually." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-60", "text": "We then propose an approach to exploit both audio and text data in tandem." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-61", "text": "In this technique, multihop attention is proposed to obtain relevant parts of audio and text data automatically." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-62", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-63", "text": "**BIDIRECTIONAL RECURRENT ENCODER**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-64", "text": "Motivated by the architecture used in [7, 17, 19] , we train a recurrent encoder to predict the categorical class of a given audio signal." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-65", "text": "To model the sequential nature of the speech signal, we use a bidirectional recurrent encoder (BRE) as shown in the Figure 1 (a)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-66", "text": "We also added a residual connection to the model for promoting convergence during training [20] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-67", "text": "A sequence of feature vectors is fed as input to the BRE, which leads to the formation of hidden states of the model as given by the following equation:" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-68", "text": "where f \u03b8 , f \u03b8 are the forward and backward long short-term memory (LSTM) with weight parameter \u03b8, ht represents the hidden state at t-th time step, and xt represents the t-th MFCC features in audio signal." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-69", "text": "The hidden representations ( \u2212 \u2192 h t, \u2190 \u2212 h t) from forward/backward LSTMs are concatenated for produce the feature, ot." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-94", "text": "Similarly to MHA-1, we further apply 3rd-hop attention to textual sequence, referred to MHA-3, with the new audio hidden representation H 2 (equation 4)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-70", "text": "To follow previous research [7] , we also add another prosodic feature vector, p, with each ot to generate a more informative vector representation of the signal, o A t ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-71", "text": "Finally, an emotion class is predicted from the acoustic signal by applying a softmax function to the final hidden representation at the last time step, o A last ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-72", "text": "We refer this model as audio-BRE with the objective function as follows:" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-73", "text": "where yi,c is the true label vector, and\u0177i,c is the predicted probability distribution from the softmax layer." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-74", "text": "The W and the bias b are learned model parameters." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-75", "text": "C is the total number of classes, and N is the total number of samples used in training." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-76", "text": "Next, we attempt to use the processed textual information as another modality in predicting the emotion class of a given signal." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-77", "text": "To obtain textual hidden representation, o T t , we tokenize the transcript and feed it into the BRE in such a way that the acoustic signals are encoded by equation (1) ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-78", "text": "We refer this model as text-BRE." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-79", "text": "The training objective for the text-BRE is same as the audio-BRE in equation (2)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-80", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-81", "text": "**PROPOSED MULTI-HOP ATTENTION**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-82", "text": "We propose a novel multi-hop attention method to predict the importance of audio and text, referred to multi-hop attention (MHA)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-83", "text": "Figure 1 shows the architecture of the proposed MHA model." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-84", "text": "Previous research used multi-modal information independently using neural network model by concatenating features from each modality [7, 21] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-85", "text": "As opposed to this approach, we propose a neural network architecture that exploits information in each modality by extracting relevant segments of the speech data using information from the lexical content (and vice-versa)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-86", "text": "First, the acoustic and textual data are encoded with the audio-BRE and text-BRE, respectively, using equation (1)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-87", "text": "We then consider the final hidden representation of audio-BRE, o A last , as a context vector and apply attention method to the textual sequence, o T t ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-88", "text": "As this model is developed with a single attention method, we refer to the model as MHA-1." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-89", "text": "The final hidden representation of the MHA-1 model, H, is calculated as follows:" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-90", "text": "The H 1 (equation 3) is a new hidden representation for textual information with consideration of audio modality." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-91", "text": "With this information, we apply 2nd-hop attention, referred to MHA-2, to the audio sequence." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-92", "text": "The final hidden representation of the MHA-2 model, H, is calculated as follows:" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-93", "text": "where H 2 is a new hidden representation for audio information with the consideration of textual modality obtained from the MHA-1." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-95", "text": "The final hidden representation of the MHA-3 model, H, is calculated as follows:" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-96", "text": ", (i = 1, ..., t)" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-97", "text": "where H 3 is updated representative vector of the textual information with the consideration of audio modality one more time." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-98", "text": "In each case, the final hidden representation, H, is passed through the softmax function to predict the four-categories emotion class." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-99", "text": "We use the same training objective as the BRE model with equation (2) , and the predicted probability distribution for the target class,\u0177c is as follows: where projection matrix W and bias b are leaned model parameters." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-100", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-101", "text": "**EXPERIMENTS**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-102", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-103", "text": "**DATASET AND EXPERIMENTAL SETUP**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-104", "text": "To train and evaluate our model, we use the Interactive Emotional Dyadic Motion Capture (IEMOCAP) [8] dataset, which includes five sessions of utterances between two speakers (one male and one female)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-105", "text": "Total 10 unique speakers participated in this work." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-106", "text": "For consistent comparison with previous works [7, 18] , all utterances labeled \"excitement\" are merged with those labeled \"happiness\"." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-107", "text": "We assign single categorical emotion to the utterance with majority of annotators agreed on the emotion labels." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-108", "text": "The final dataset contains 5,531 utterances in total (1,636 happy, 1,084 sad, 1,103 angry and 1,708 neutral)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-109", "text": "In the training process, we perform 10-fold cross-validation where each 8, 1, 1 folds are used for the train set, development set, and test set, respectively." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-110", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-111", "text": "**FEATURE EXTRACTION AND IMPLEMENTATION DETAILS**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-112", "text": "As this research is extended work from previous research [7] , we use the same feature extraction method as done in our previous work." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-113", "text": "After extracting 40-dimensional Mel-frequency cepstral coefficients (MFCC) feature (frame size is set to 25 ms at a rate of 10 ms with the Hamming window) using Kaldi [22] , we concatenate it with its first, second order derivates, making the feature dimension to 120." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-114", "text": "We also extract prosodic features by using OpenSMILE toolkit [23] and appending it to the audio feature vector." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-115", "text": "In preparing the textual dataset, we first use the ground-truth transcripts of the IEMOCAP dataset." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-116", "text": "In a practical scenario where we may not access to transcripts of the audio, we obtain all of the transcripts from the speech signal using a commercial ASR system [24] (The performance of the ASR system is word error rate (WER) of 5.53%)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-117", "text": "We apply word-tokenizer to the transcripts and obtain sequential data for textual input." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-118", "text": "The maximum length of an audio segment is set to 750 based on the implementation choices presented in [25] and 128 for the textual input which covers the maximum length of the tokenized transcripts." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-143", "text": "The \"A\" and \"T\" in modality indicate \"Audio\" and \"Text\", receptively." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-119", "text": "We minimize the cross-entropy loss function using (equation (2)) the Adam optimizer [26] with a learning rate of 1e-3 and gradients clipped with a norm value of 1." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-120", "text": "For the purposes of regularization, we apply the dropout method, 30%." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-121", "text": "The number of hidden units and the number of layers in the RNN for each model (BRE and MHA) are optimized on the development set." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-122", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-123", "text": "**PERFORMANCE EVALUATION**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-124", "text": "To measure the performance of systems, we report the weighted accuracy (WA) and unweighted accuracy (UA) averaging over the 10-fold cross-validation experiments." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-125", "text": "We use the same dataset and features as other researchers [7, 18] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-126", "text": "Table 1 presents performances of proposed approaches for recognizing speech emotion in comparison with various models." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-127", "text": "To compare our results from previous approaches, we first use groundtruth transcripts included in the dataset in training textual modality." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-128", "text": "From the previous model, E vec-MCNN-LSTM encodes acoustic signal and textual information using a neural network (RNN and CNN, respectively) and then fuse each result by concatenating and feeding them into following (SVM) to predict emotion labels." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-129", "text": "On the other hand, MDRE model use dual-RNNs to encode both the modalities and merge the results using another fully-connect neural network layer." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-130", "text": "This MDRE approach applies end-to-end learning and outperforms E vec-MCNN-LSTM by 10.6% relative (0.649 to 0.718 absolute) in terms of WA." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-131", "text": "Among our proposed system, the audio-BRE model that uses an acoustic signal with bidirectional-RNN architecture achieves WA 0.646." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-132", "text": "Interestingly, the text-BRE model that use textual information shows higher performance than that of audio-BRE by 8% relative (0.646 to 0.698) in WA." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-133", "text": "The multi-hop attention model, MHA-N, (N = 1, 2, 3) , shows a substantial performance gain." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-134", "text": "In particular, the MHA-2 model (best performing system among MHA-N) outperformed the best baseline model, MDRE, by 6.5% relative (0.718 to 0.765) in WA." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-135", "text": "Although we observe performance degradation in the MHA-3 model, we believe that this could be due to the limited data for training." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-136", "text": "In a practical scenario, we may not access the audio transcripts." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-137", "text": "We describe the effect of using ASR-processed transcripts on the proposed system." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-138", "text": "From table 1, we observe performance degradation in text-BRE-ASR and MHA-2-ASR (our best system), compared to that of text-BRE and MHA-2 by 6.6% (0.698 to 0.652) and 4.6% (0.765 to 0.730) relative in WA, receptively." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-139", "text": "Even with the erroneous transcripts (WER = 5.53%), however, the proposed approach Table 1 ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-140", "text": "Model performance comparisons." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-141", "text": "The top 2 bestperforming models (according to the WA) are marked in bold." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-142", "text": "The \"-ASR\" models use ASR-processed transcripts from the Google Cloud Speech API." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-144", "text": "(MHA-2-ASR) outperforms the best baseline system (MDRE) by 1.6% relative (0.718 to 0.730) in terms of WA." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-145", "text": "Figure 2 shows the confusion matrices of the proposed systems." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-146", "text": "In audio-BRE ( Fig. 2(a) ), most of the emotion labels are frequently misclassified as neutral class, supporting the claims of [7, 25] ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-147", "text": "The text-BRE shows improvement in classifying most of the labels in Fig. 2(b) ." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-148", "text": "In particular, angry and happy classes are correctly classified by 32% (57.14 to 75.41) and 63% (40.21 to 65.56) relative in accuracy with respect to audio-BRE, receptively." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-149", "text": "However, it incorrectly predicted instances of the happy class as sad class in 10% of the time, even though these emotional states are opposites of one another." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-150", "text": "The MHA-2 (our best system, Fig. 2(c) ) compensates for the weaknesses of the single modality models and benefits from their strengths." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-151", "text": "It shows significant performance gain for angry, happy, sad and neutral classes by 6%, 20%, 15% and 13% relative in accuracy with respect to text-BRE." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-152", "text": "It also correctly classify neutral class similar to that of audio-BRE (81.63 and 78.00 for audio-BRE and MHA-2, receptively)." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-153", "text": "Interestingly, although MHA-2 shows superior discriminating ability among emotion classes, it still shows the tendency such that most of the incorrect cases are misclassified into neutral class." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-154", "text": "We consider this observation as a future research direction." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-155", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-156", "text": "**ERROR ANALYSIS**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-157", "text": "----------------------------------" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-158", "text": "**CONCLUSIONS**" }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-159", "text": "In this paper, we propose a multi-hop attention model to combine acoustic and textual data for speech emotion recognition task." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-160", "text": "The proposed attention method is designed to select relevant parts of the textual data, which is subsequently applied to attend to the segments of the audio signal for classification." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-161", "text": "Extensive experiments show that the proposed MHA-2 outperforms the best baseline system in classifying the four emotion categories by 6.5% (0.718 to 0.765 absolute) in terms of WA when the model is applied to the IEMOCAP dataset." }, { "sent_id": "0fd26c6dffab3fba2d120d2c58dff6-C001-162", "text": "We further test our model with ASR-processed transcripts and achieve WA 0.73 that shows the reliability of the proposed system (MHA-2-ASR) in the practical scenario where the ground-truth transcripts are not available." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0fd26c6dffab3fba2d120d2c58dff6-C001-33" ], [ "0fd26c6dffab3fba2d120d2c58dff6-C001-54" ], [ "0fd26c6dffab3fba2d120d2c58dff6-C001-84" ] ], "cite_sentences": [ "0fd26c6dffab3fba2d120d2c58dff6-C001-33", "0fd26c6dffab3fba2d120d2c58dff6-C001-54", "0fd26c6dffab3fba2d120d2c58dff6-C001-84" ] }, "@EXT@": { "gold_contexts": [ [ "0fd26c6dffab3fba2d120d2c58dff6-C001-33", "0fd26c6dffab3fba2d120d2c58dff6-C001-34" ], [ "0fd26c6dffab3fba2d120d2c58dff6-C001-112" ] ], "cite_sentences": [ "0fd26c6dffab3fba2d120d2c58dff6-C001-33", "0fd26c6dffab3fba2d120d2c58dff6-C001-112" ] }, "@MOT@": { "gold_contexts": [ [ "0fd26c6dffab3fba2d120d2c58dff6-C001-64" ] ], "cite_sentences": [ "0fd26c6dffab3fba2d120d2c58dff6-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "0fd26c6dffab3fba2d120d2c58dff6-C001-70" ], [ "0fd26c6dffab3fba2d120d2c58dff6-C001-106" ], [ "0fd26c6dffab3fba2d120d2c58dff6-C001-125" ] ], "cite_sentences": [ "0fd26c6dffab3fba2d120d2c58dff6-C001-70", "0fd26c6dffab3fba2d120d2c58dff6-C001-106", "0fd26c6dffab3fba2d120d2c58dff6-C001-125" ] }, "@DIF@": { "gold_contexts": [ [ "0fd26c6dffab3fba2d120d2c58dff6-C001-84", "0fd26c6dffab3fba2d120d2c58dff6-C001-85" ] ], "cite_sentences": [ "0fd26c6dffab3fba2d120d2c58dff6-C001-84" ] }, "@SIM@": { "gold_contexts": [ [ "0fd26c6dffab3fba2d120d2c58dff6-C001-146" ] ], "cite_sentences": [ "0fd26c6dffab3fba2d120d2c58dff6-C001-146" ] } } }, "ABC_397e593f8f282d4951402d83036c12_17": { "x": [ { "sent_id": "397e593f8f282d4951402d83036c12-C001-36", "text": "----------------------------------" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-37", "text": "**VISUALIZATION**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-2", "text": "This paper introduces THUMT, an opensource toolkit for neural machine translation (NMT) developed by the Natural Language Processing Group at Tsinghua University." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-3", "text": "THUMT implements the standard attention-based encoder-decoder framework on top of Theano and supports three training criteria: maximum likelihood estimation, minimum risk training, and semi-supervised training." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-4", "text": "It features a visualization tool for displaying the relevance between hidden states in neural networks and contextual words, which helps to analyze the internal workings of NMT." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-5", "text": "Experiments on ChineseEnglish datasets show that THUMT using minimum risk training significantly outperforms GroundHog, a state-of-the-art toolkit for NMT." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-6", "text": "----------------------------------" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-8", "text": "End-to-end neural machine translation (NMT) (Sutskever et al., 2014; Bahdanau et al., 2015) has gained increasing popularity in the machine translation community." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-9", "text": "Capable of capturing longdistance dependencies with gating (Hochreiter and Schmidhuber, 1997; Cho et al., 2014) and attention (Bahdanau et al., 2015) mechanisms, NMT has proven to outperform conventional statistical machine translation systematically across a variety of language pairs (Junczys-Dowmunt et al., 2016) ." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-10", "text": "This paper introduces THUMT, an open-source toolkit developed by the Tsinghua Natural Language Processing Group." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-11", "text": "On top of Theano (Bergstra et al., 2010) , THUMT implements the standard attention-based encoder-decoder framework for NMT (Bahdanau et al., 2015) ." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-12", "text": "It sup- * Corresponding author: Yang Liu." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-13", "text": "ports three training criteria: maximum likelihood estimation (Bahdanau et al., 2015) , minimum risk training , and semisupervised training ." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-14", "text": "To facilitate the analysis of the translation process in NMT, THUMT also provides a visualization tool that calculates the relevance between hidden layers of neural networks and contextual words." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-15", "text": "We compare THUMT with the state-of-the-art opensource toolkit GroundHog (Bahdanau et al., 2015) and achieve significant improvements on ChineseEnglish translation tasks by introducing new training criteria and optimizers." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-16", "text": "2 The Toolkit 2.1 Model THUMT implements the standard attention-based NMT model (Bahdanau et al., 2015) on top of Theano (Bergstra et al., 2010) ." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-17", "text": "Please refer to (Bahdanau et al., 2015) for more details." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-18", "text": "----------------------------------" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-19", "text": "**TRAINING CRITERIA**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-20", "text": "THUMT supports three training criteria:" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-21", "text": "1. Maximum likelihood estimation (MLE) (Bahdanau et al., 2015) : the default training criterion in THUMT, which aims to find a set of model parameters that maximizes the likelihood of training data." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-22", "text": "2. Minimum Risk Training (MRT) : the recommended training criterion in THUMT, which aims to find a set of model parameters that minimizes the risk (i.e., expected loss measured by evaluation metrics) on training data." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-23", "text": "In THUMT, MLE is often used to initialize MRT." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-24", "text": "In other words, the model trained with respect to MLE serves as the initial model of MRT." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-25", "text": "3. Semi-supervised Training (SST) : the recommended training criterion for low-resource language translation." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-26", "text": "SST is capable of exploiting abundant monolingual corpora to train source-to-target and target-to-source translation models jointly." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-27", "text": "MLE is also used to initialize SST." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-28", "text": "----------------------------------" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-29", "text": "**OPTIMIZATION**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-30", "text": "Optimization plays a crucial role in NMT and directly influences the training time and translation quality." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-31", "text": "THUMT supports three optimizers:" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-32", "text": "1. SGD: standard stochastic gradient descent with fixed learning rate." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-33", "text": "2. Adadelta (Zeiler, 2012): dynamically adapting learning rate over time according to history." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-34", "text": "3. Adam (Kingma and Ba, 2015) : computing individual learning rate for different parameters." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-35", "text": "THUMT uses a modified version of Adam to address the NaN problem." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-38", "text": "Although NMT achieves state-of-the-art translation performance, it is hard to understand how it works because all internal information is represented as real-valued vectors or matrices." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-39", "text": "To address this problem, THUMT features a visualization tool to use layer-wise relevance propagation (LRP) (Bach et al., 2015) to visualize and interpret neural machine translation (Ding et al., 2017) ." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-40", "text": "Figure 1 shows an example." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-41", "text": "Given a source sentence, a target sentence, and a trained model, THUMT displays the entire attention-based neural network." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-42", "text": "Clicking on a node of the network (e.g., the output node of \"processing\"), the relevance values of relevant source and target contextual words are shown in the bottom area." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-43", "text": "This is helpful for analyzing the internal workings of NMT. Please refer to (Ding et al., 2017) for more details." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-44", "text": "----------------------------------" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-45", "text": "**REPLACING UNKNOWN WORDS**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-46", "text": "We follow Luong et al. (2015) to address unknown words." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-47", "text": "In our implementation, we use FastAlign (Dyer et al., 2013) (Papineni et al., 2002) score." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-48", "text": "Our baseline system is GroundHog (Bahdanau et al., 2015) , a state-of-the-art open-source NMT toolkit." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-49", "text": "We use the same setting of hyper-parameters for both GroundHog and THUMT." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-50", "text": "The vocabulary size is set to 30K." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-51", "text": "We set word embedding dimension to 620 for both languages." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-52", "text": "The dimension of hidden layers is set to1000." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-53", "text": "In training, we set the mini-batch size to 80." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-54", "text": "In decoding, we set the beam size to 10." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-55", "text": "During training, we set the Table 1 shows the BLEU scores obtained by GroundHog and THUMT using different training criteria and optimizers." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-56", "text": "Experimental results show that the translation performance of THUMT is comparable to GROUNDHOG using MLE." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-57", "text": "Due to the capability to include evaluation metrics in during, MRT obtain significant improvements over MLE." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-58", "text": "1 Another finding is that Adam leads to consistent and significant improvements over AdaDelta." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-59", "text": "Table 2 shows the effect of semi-supervised training." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-60", "text": "It is clear that exploiting monolingual corpora helps to improve translation quality for both directions." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-61", "text": "Table 3 shows that replacing unknown words leads to consistent improvements for all training criteria and optimizers." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-62", "text": "Table 4 compares the training time between MLE, MRT, and SST." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-63", "text": "While \"MLE + AdaDelta\" requires 200K iterations and 55.9 hours to converge, \"MLE + Adam\" only needs 36K iterations and 10.1 hours." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-64", "text": "The training time of MRT is much longer than MLE." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-65", "text": "----------------------------------" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-66", "text": "**RESULTS**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-67", "text": "----------------------------------" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-68", "text": "**CONCLUSION**" }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-69", "text": "We have introduced a new open source toolkit for NMT that supports two new training criteria: minimum risk training and semisupervised training ." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-70", "text": "While minimum risk training proves to improve over standard maximum likelihood estimation substantially, semi-supervised training is capable of exploiting monolingual corpora to improve lowresource translation." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-71", "text": "The toolkit also features a visualization tool for analyzing the translation process of THUMT." }, { "sent_id": "397e593f8f282d4951402d83036c12-C001-72", "text": "The toolkit is freely available at http://thumt.thunlp.org." } ], "y": { "@BACK@": { "gold_contexts": [ [ "397e593f8f282d4951402d83036c12-C001-8" ], [ "397e593f8f282d4951402d83036c12-C001-9" ], [ "397e593f8f282d4951402d83036c12-C001-17" ] ], "cite_sentences": [ "397e593f8f282d4951402d83036c12-C001-8", "397e593f8f282d4951402d83036c12-C001-9", "397e593f8f282d4951402d83036c12-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "397e593f8f282d4951402d83036c12-C001-11" ], [ "397e593f8f282d4951402d83036c12-C001-16" ], [ "397e593f8f282d4951402d83036c12-C001-21" ], [ "397e593f8f282d4951402d83036c12-C001-48" ] ], "cite_sentences": [ "397e593f8f282d4951402d83036c12-C001-11", "397e593f8f282d4951402d83036c12-C001-16", "397e593f8f282d4951402d83036c12-C001-21", "397e593f8f282d4951402d83036c12-C001-48" ] }, "@DIF@": { "gold_contexts": [ [ "397e593f8f282d4951402d83036c12-C001-15" ] ], "cite_sentences": [ "397e593f8f282d4951402d83036c12-C001-15" ] } } }, "ABC_f633ceffdf53849159574a2891eda1_17": { "x": [ { "sent_id": "f633ceffdf53849159574a2891eda1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-2", "text": "Recently, a simple combination of passage retrieval using off-the-shelf IR techniques and a BERT reader was found to be very effective for question answering directly on Wikipedia, yielding a large improvement over the previous state of the art on a standard benchmark dataset." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-3", "text": "In this paper, we present a data augmentation technique using distant supervision that exploits positive as well as negative examples." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-4", "text": "We apply a stage-wise approach to fine tuning BERT on multiple datasets, starting with data that is \"furthest\" from the test data and ending with the \"closest\"." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-5", "text": "Experimental results show large gains in effectiveness over previous approaches on English QA datasets, and we establish new baselines on two recent Chinese QA datasets." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-6", "text": "----------------------------------" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-8", "text": "BERT (Devlin et al., 2018) represents the latest refinement in a series of neural models that take advantage of pretraining on a language modeling task (Peters et al., 2018; Radford et al., 2018) ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-9", "text": "Researchers have demonstrated impressive gains in a broad range of NLP tasks, from sentence classification to sequence labeling." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-10", "text": "Recently, Yang et al. (2019) showed that combining a BERT-based reader with passage retrieval using the Anserini IR toolkit yields a large improvement in question answering directly from a Wikipedia corpus, measured in terms of exact match on a standard benchmark (Chen et al., 2017) ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-11", "text": "Interestingly, the approach of Yang et al. (2019) represents a simple method to combining BERT with off-the-shelf IR." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-12", "text": "In this paper, we build on these initial successes to explore how much further we can push this simple architecture by data augmentation, taking advantage of distant supervision techniques to gather more and higher-quality * equal contribution training data to fine tune BERT." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-13", "text": "Experiments show that, using the same reader model as Yang et al. (2019) , our simple data-augmentation techniques yield additional large improvements." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-14", "text": "To illustrate the robustness of our methods, we also demonstrate consistent gains on another English QA dataset and present baselines for two additional Chinese QA datasets (which have not to date been evaluated in an \"end-to-end\" manner)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-15", "text": "In addition to achieving state-of-the-art results, we contribute important lessons on how to leverage BERT effectively for question answering." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-16", "text": "First, most previous work on distant supervision focuses on generating positive examples, but we show that using existing datasets to identify negative training examples is beneficial as well." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-17", "text": "Second, we propose an approach to fine-tuning BERT with disparate datasets that works well in practice: our heuristic is to proceed in a stage-wise manner, beginning with the dataset that is \"furthest\" from the test data and ending with the \"closest\"." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-18", "text": "----------------------------------" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-19", "text": "**BACKGROUND AND RELATED WORK**" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-20", "text": "In this paper, we tackle the \"end-to-end\" variant of the question answering problem, where the system is only provided a large corpus of articles." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-21", "text": "This stands in contrast to reading comprehension datasets such as SQuAD (Rajpurkar et al., 2016) , where the system works with a single pre-determined document, or most QA benchmarks today such as TrecQA (Yao et al., 2013) , WikiQA (Yang et al., 2015) , and MS-MARCO (Bajaj et al., 2016) , where the system is provided a list of candidate passages to choose from." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-22", "text": "This task definition, which combines a strong element of information retrieval, traces back to the Text Retrieval Conferences (TRECs) in the late 1990s (Voorhees and Tice, 1999) , but there is a recent resurgence of interest in this formulation (Chen et al., 2017) ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-23", "text": "The roots of the distant supervision techniques we use trace back to at least the 1990s (Yarowsky, 1995; Riloff, 1996) , although the term had not yet been coined." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-24", "text": "Such techniques have recently become commonplace, especially as a way to gather large amounts of labeled examples for data-hungry neural networks and other machine learning algorithms." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-25", "text": "Specific recent applications in question answering include Bordes et al. (2015) , Chen et al. (2017) , Lin et al. (2018) , as well as Joshi et al. (2017) for building benchmark test collections." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-26", "text": "----------------------------------" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-27", "text": "**APPROACH**" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-28", "text": "In this work, we fix the underlying model and focus on data augmentation techniques to explore how to best fine-tune BERT." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-29", "text": "We use the same exact setup as the \"paragraph\" variant of BERTserini (Yang et al., 2019) , where the input corpus is pre-segmented into paragraphs at index time, each of which is treated as a \"document\" for retrieval purposes." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-30", "text": "The question is used as a \"bag of words\" query to retrieve the top k candidate paragraphs using BM25 ranking." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-31", "text": "Each paragraph is then fed into the BERT reader along with the original natural language question for inference." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-32", "text": "Our reader is built using Google's reference implementation, but with a small tweak: to allow comparison and aggregation of results from different segments, we remove the final softmax layer over different answer spans; cf. ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-33", "text": "For each candidate paragraph, we apply inference over the entire paragraph, and the reader selects the best text span and provides a score." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-34", "text": "We then combine the reader score with the retriever score via linear interpolation: S = (1 \u2212 \u00b5) \u00b7 S Anserini + \u00b5 \u00b7 S BERT , where \u00b5 \u2208 [0, 1] is a hyperparameter (tuned on a training sample)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-35", "text": "One major shortcoming with BERTserini is that Yang et al. (2019) only fine tune on SQuAD, which means that the BERT reader is exposed to an impoverished set of examples; all SQuAD data come from a total of only 442 documents." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-36", "text": "This contrasts with the diversity of paragraphs that the model will likely encounter at inference time, since they are selected from potentially millions of articles." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-37", "text": "The solution to this problem, of course, is to fine tune BERT with the types of paragraphs it is likely to see at inference time." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-38", "text": "Unfortunately, such data does not exist for modern QA test collections." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-39", "text": "Distant supervision can provide a bridge." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-40", "text": "Starting from a source dataset comprising question-answer pairs (for example, SQuAD), we can create training data for a specific corpus by using passage retrieval to fetch paragraphs from that corpus (with the question as the query) and then searching (i.e., matching) for answer instances in those paragraphs." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-41", "text": "A hyperparameter here is n, the number of candidates we examine from passage retrieval." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-42", "text": "Larger values of n will lead to more training examples, but as n increases, so does the chance that a paragraph will spuriously match the answer without actually answering the question." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-43", "text": "The above technique allows us to extract positive training examples, but previous work has shown the value of negative examples, specifically for QA ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-44", "text": "To extract negative examples, we sample the top n candidates from passage retrieval for paragraphs that do not contain the answer, with a ratio of d:1." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-45", "text": "That is, for every positive example we find, we sample d negative examples, where d is also a hyperparameter." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-46", "text": "Note that these negative examples are also noisy, since they may in fact contain an alternate correct (or acceptable) answer to the question, one that differs from the answer given in the source dataset." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-47", "text": "Thus, given a corpus, we can create using distant supervision a new dataset that is specifically adapted to a particular passage retrieval method." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-48", "text": "For convenience, we refer to training data gathered using this technique that only contain positive examples as DS(+) and use DS(\u00b1) to refer to the additional inclusion of negative examples." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-49", "text": "Next, we have a design decision regarding how to fine tune BERT using the source QA pairs (SRC) and the augmented dataset using distant supervision (DS)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-50", "text": "There are three possibilities: SRC + DS: Fine tune BERT with all data, grouped together." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-51", "text": "In practice, this means that the source and augmented data are shuffled together." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-52", "text": "DS \u2192 SRC: Fine tune the reader on the augmented data and then the source dataset." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-53", "text": "SRC \u2192 DS: Fine tune the reader on the source dataset and then the augmented data." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-54", "text": "Experiment results show that of the three choices above, the third option is the most effective." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-55", "text": "More generally, when faced with multiple, qualitativelydifferent datasets, we advocate a stage-wise finetuning strategy that starts with the dataset \"furthest\" to the task at hand and ending with the dataset \"closest\"." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-56", "text": "Another way to think about using different datasets is in terms of a very simple form of trans- fer learning." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-57", "text": "The stage-wise fine-tuning strategy is in essence trying to transfer knowledge from labeled data that is not drawn from the same distribution as the test instances." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-58", "text": "We wish to take advantage of transfer effects, but limit the scope of erroneous parameterizations." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-59", "text": "Thus it makes sense not to intermingle qualitatively different datasets, but to fine tune the model in distinct stages." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-60", "text": "----------------------------------" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-61", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-62", "text": "To show the generalizability of our data augmentation technique, we conduct experiments on two English datasets: SQuAD (v1.1) and Trivia-QA (Joshi et al., 2017) ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-63", "text": "For both, we use the 2016-12-21 dump of English Wikipedia, following Chen et al. (2017) ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-64", "text": "We also examine two Chinese datasets: CMRC (Cui et al., 2018) and DRCD (Shao et al., 2018) ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-65", "text": "For these, we use the 2018-12-01 dump of Chinese Wikipedia, tokenized with Lucene's CJKAnalyzer into overlapping bigrams." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-66", "text": "We apply hanziconv 1 to transform the corpus into simplified characters for CMRC and traditional characters for DRCD." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-67", "text": "Following Yang et al. (2019) , to evaluate answers in an end-to-end setup, we disregard the paragraph context from the original datasets and use only the answer spans." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-68", "text": "As in previous work, exact match (EM) score and F 1 score (at the token level) serve as the two primary evaluation metrics." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-69", "text": "In addition, we compute recall (R), the fraction of questions for which the correct answer appears in any retrieved paragraph; to make our results comparable to Yang et al. (2019) , Anserini returns the top k = 100 paragraphs to feed into the BERT reader." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-70", "text": "Note that this recall is not the same as the token-level recall component in the F 1 score." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-71", "text": "Statistics for the datasets are shown in Table 4. 2 For data augmentation, based on preliminary experiments, we find that examining n = 10 candidates from passage retrieval works well, and we further discover that effectiveness is insensitive to the amount of negative samples." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-72", "text": "Thus, we eliminate the need to tune d by simply using all passages that do not contain the answer as negative examples." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-73", "text": "The second block of Table 4 shows the sizes of the augmented datasets constructed using our distant supervision techniques: DS(+) contains positive examples only, while DS(\u00b1) includes both positive and negative examples." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-74", "text": "There are two additional characteristics to note about our data augmentation techniques: The most salient characteristic is that SQuAD, CMRC, and DRCD all have source answers drawn from Wikipedia (English or Chinese), while TriviaQA includes web pages as well as Wikipedia." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-75", "text": "Therefore, for the first three collections, the source and augmented datasets share the same document genre-the primary difference is that data augmentation increases the amount and diversity of answer passages seen by the model during training." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-76", "text": "For TriviaQA, however, we consider the source and augmented datasets as coming from different genres (noisy web text vs. higher quality Wikipedia articles)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-77", "text": "Furthermore, the TriviaQA augmented dataset is also much largersuggesting that those questions are qualitatively different (e.g., in the manner they were gathered)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-78", "text": "These differences appear to have a substantial impact, as experiment results show that TriviaQA behaves differently than the other three collections." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-79", "text": "For model training, we begin with the BERTBase model (uncased, 12-layer, 768-hidden, 12-heads, 110M parameters) , which is then fine-tuned using the various conditions described in the previous section." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-80", "text": "All inputs to the model are padded to 384 tokens; the learning rate is set to 3 \u00d7 10 \u22125 and all other defaults settings are used." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-81", "text": "----------------------------------" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-82", "text": "**RESULTS**" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-83", "text": "Our main results on SQuAD are shown in Table 2 ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-84", "text": "The row marked \"SRC\" indicates fine tuning with SQuAD data only and matches the BERTserini condition of Yang et al. (2019) ; we report higher scores due to engineering improvements (primarily a Lucene version upgrade)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-85", "text": "As expected, fine tuning with augmented data improves effectiveness, and experiments show that while training with positive examples using DS(+) definitely (Wang et al., 2017) 29.1 37.5 - Kratzwald and Feuerriegel (2018) 29.8 --Par." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-86", "text": "R. 28.5 -83.1 Par." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-87", "text": "R. + Answer Agg." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-88", "text": "28.9 --Par." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-89", "text": "R. + Full Agg." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-90", "text": "30.2 --MINIMAL (Min et al., 2018) 34.7 42.5 64.0" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-91", "text": "BERTserini (Yang et al., 2019) 38.6 46.1 85.9 SRC 41.8 49.5 85.9 DS(+) 44.0 51.4 85.9 DS(\u00b1) 48.7 56.5 85.9 SRC + DS(\u00b1) 45.7 53.5 85.9 DS(\u00b1) \u2192 SRC 47.4 55.0 85.9 SRC \u2192 DS(\u00b1) 50.2 58.2 85.9 Table 2 : Results on SQuAD helps, an even larger boost comes from leveraging negative examples using DS(\u00b1)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-92", "text": "In both these cases, we only fine tune BERT with the augmented data, ignoring the source data." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-93", "text": "What if we use the source data as well? Results show that \"lumping\" all training data together (both the source and augmented data) to fine tune BERT is not the right approach: in fact, the SRC + DS(\u00b1) condition performs worse than just using the augmented data alone." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-94", "text": "Instead, disparate datasets should be leveraged using the stage-wise fine-tuning approach we propose, according to our heuristic of starting with the dataset that is \"furtherest\" away from the test data." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-95", "text": "That is, we wish to take advantage of all available data, but the last dataset we use to fine tune BERT should be \"most like\" the test data the model will see at inference time." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-96", "text": "Indeed, this heuristic is borne out empirically, as SRC \u2192 DS(\u00b1) yields another boost over using DS(\u00b1) only." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-97", "text": "Further confirmation for this heuristic comes from an alternative where we switch the order of the stages, DS(\u00b1) \u2192 SRC, which yields results worse than DS(\u00b1) alone." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-98", "text": "We note that our best configuration beats BERTserini, the previous state of the art, by over ten points." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-99", "text": "Note that recall in all our conditions is the same since we are not varying the passage retrieval algorithm, and in each case Anserini provides exactly the same candidate passages." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-100", "text": "Improvements come solely from a better BERT reader." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-101", "text": "Results on TriviaQA are shown in Table 3 ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-102", "text": "With just fine tuning on the source dataset, we obtain a Wang et al., 2017) 47.3 53.7 -DS-QA (Lin et al., 2018) 48.7 56.3 -Evidence Agg. ." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-103", "text": "Interestingly, using only positive examples leads to worse effectiveness than just using the source dataset." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-104", "text": "However, fine tuning on both positive and negative examples leads to a three point boost in exact match score, establishing a new high score on this dataset." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-105", "text": "Experiments on fine tuning with both source and augmented data show the same pattern as with SQuAD: stage-wise tuning is more effective than just combining datasets, and tuning should proceed in the \"furthest to closest\" sequence we propose." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-106", "text": "While data augmentation no doubt helps (beats the source-only baseline), for this dataset the highest effectiveness is achieved by disregarding the source dataset completely; that is, DS(\u00b1) beats SRC \u2192 DS(\u00b1)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-107", "text": "We attribute this behavior to the difference between TriviaQA and the other datasets discussed in Section 4: it appears that gains from transfer effects are outweighed by genre mismatch." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-108", "text": "Results on the Chinese datasets are shown in Table 4." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-109", "text": "To our knowledge, they have only been evaluated as reading comprehension tests, not in the \"end-to-end\" setup that we tackle here (requiring retrieval from a sizeable corpus)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-110", "text": "Although there is no previous work to compare against, our results provide a strong baseline for future work." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-111", "text": "Experiment results on the two Chinese datasets support the same conclusions as SQuAD: First, we see that data augmentation using distant supervision is effective." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-112", "text": "Second, including both positive and negative training examples is better than having positive examples only." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-113", "text": "Third, when leveraging multiple datasets, our \"furthest to closest\" heuristic for stage-wise tuning yields the best results." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-114", "text": "Since the source datasets also draw from (Chinese) Wikipedia, we benefit from fine tuning with both source and augmented data." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-115", "text": "Table 4 : Results on the two Chinese datasets: CMRC (top) and DRCD (bottom)." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-116", "text": "----------------------------------" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-117", "text": "**CONCLUSIONS**" }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-118", "text": "In this paper, we have further advanced the state of the art in end-to-end open-domain question answering using simple BERT models." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-119", "text": "We focus on data augmentation using distant supervision techniques to construct datasets that are closer to the types of paragraphs that the reader will see at inference time." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-120", "text": "Explained this way, it should not come as a surprise that effectiveness improves as a result." }, { "sent_id": "f633ceffdf53849159574a2891eda1-C001-121", "text": "This work confirms perhaps something that machine learning practitioners already know too well: quite often, the best way to better results is not better modeling, but better data preparation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f633ceffdf53849159574a2891eda1-C001-10", "f633ceffdf53849159574a2891eda1-C001-11" ] ], "cite_sentences": [ "f633ceffdf53849159574a2891eda1-C001-10", "f633ceffdf53849159574a2891eda1-C001-11" ] }, "@EXT@": { "gold_contexts": [ [ "f633ceffdf53849159574a2891eda1-C001-10", "f633ceffdf53849159574a2891eda1-C001-11", "f633ceffdf53849159574a2891eda1-C001-12" ] ], "cite_sentences": [ "f633ceffdf53849159574a2891eda1-C001-10", "f633ceffdf53849159574a2891eda1-C001-11" ] }, "@USE@": { "gold_contexts": [ [ "f633ceffdf53849159574a2891eda1-C001-13" ], [ "f633ceffdf53849159574a2891eda1-C001-29" ], [ "f633ceffdf53849159574a2891eda1-C001-67" ], [ "f633ceffdf53849159574a2891eda1-C001-91", "f633ceffdf53849159574a2891eda1-C001-92" ] ], "cite_sentences": [ "f633ceffdf53849159574a2891eda1-C001-13", "f633ceffdf53849159574a2891eda1-C001-29", "f633ceffdf53849159574a2891eda1-C001-67", "f633ceffdf53849159574a2891eda1-C001-91" ] }, "@DIF@": { "gold_contexts": [ [ "f633ceffdf53849159574a2891eda1-C001-13" ], [ "f633ceffdf53849159574a2891eda1-C001-35" ], [ "f633ceffdf53849159574a2891eda1-C001-84" ] ], "cite_sentences": [ "f633ceffdf53849159574a2891eda1-C001-13", "f633ceffdf53849159574a2891eda1-C001-35", "f633ceffdf53849159574a2891eda1-C001-84" ] }, "@SIM@": { "gold_contexts": [ [ "f633ceffdf53849159574a2891eda1-C001-69" ] ], "cite_sentences": [ "f633ceffdf53849159574a2891eda1-C001-69" ] }, "@UNSURE@": { "gold_contexts": [ [ "f633ceffdf53849159574a2891eda1-C001-91" ] ], "cite_sentences": [ "f633ceffdf53849159574a2891eda1-C001-91" ] } } }, "ABC_04461d946dadc759e4be1207655159_18": { "x": [ { "sent_id": "04461d946dadc759e4be1207655159-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-2", "text": "This paper presents an extension of Chiang's hierarchical phrase-based (HPB) model, called Head-Driven HPB (HD-HPB), which incorporates head information in translation rules to better capture syntax-driven information, as well as improved reordering between any two neighboring non-terminals at any stage of a derivation to explore a larger reordering search space." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-3", "text": "Experiments on Chinese-English translation on four NIST MT test sets show that the HD-HPB model significantly outperforms Chiang's model with average gains of 1.91 points absolute in BLEU." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-4", "text": "----------------------------------" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-6", "text": "Chiang's hierarchical phrase-based (HPB) translation model utilizes synchronous context free grammar (SCFG) for translation derivation (Chiang, 2005; Chiang, 2007) and has been widely adopted in statistical machine translation (SMT)." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-7", "text": "Typically, such models define two types of translation rules: hierarchical (translation) rules which consist of both terminals and non-terminals, and glue (grammar) rules which combine translated phrases in a monotone fashion." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-8", "text": "Due to lack of linguistic knowledge, Chiang's HPB model contains only one type of nonterminal symbol X, often making it difficult to select the most appropriate translation rules." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-9", "text": "1 What is more, Chiang's HPB model suffers from limited phrase reordering combining translated phrases in a monotonic way with glue rules." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-10", "text": "In addition, once a 1 Another non-terminal symbol S is used in glue rules." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-11", "text": "glue rule is adopted, it requires all rules above it to be glue rules." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-12", "text": "One important research question is therefore how to refine the non-terminal category X using linguistically motivated information: Zollmann and Venugopal (2006) (SAMT) e.g. use (partial) syntactic categories derived from CFG trees while Zollmann and Vogel (2011) use word tags, generated by either POS analysis or unsupervised word class induction." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-13", "text": "Almaghout et al. (2011) employ CCGbased supertags." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-14", "text": "Mylonakis and Sima'an (2011) use linguistic information of various granularities such as Phrase-Pair, Constituent, Concatenation of Constituents, and Partial Constituents, where applicable." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-15", "text": "Inspired by previous work in parsing (Charniak, 2000; Collins, 2003) , our Head-Driven HPB (HD-HPB) model is based on the intuition that linguistic heads provide important information about a constituent or distributionally defined fragment, as in HPB." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-16", "text": "We identify heads using linguistically motivated dependency parsing, and use their POS to refine X. In addition HD-HPB provides flexible reordering rules freely mixing translation and reordering (including swap) at any stage in a derivation." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-17", "text": "Different from the soft constraint modeling adopted in (Chan et al., 2007; Marton and Resnik, 2008; Shen et al., 2009; He et al., 2010; Huang et al., 2010; Gao et al., 2011) , our approach encodes syntactic information in translation rules." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-18", "text": "However, the two approaches are not mutually exclusive, as we could also include a set of syntax-driven features into our translation model." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-19", "text": "Our approach maintains the advantages of Chiang's HPB model while at the same time incorporating head information and flex- ible reordering in a derivation in a natural way." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-20", "text": "Experiments on Chinese-English translation using four NIST MT test sets show that our HD-HPB model significantly outperforms Chiang's HPB as well as a SAMT-style refined version of HPB." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-21", "text": "----------------------------------" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-22", "text": "**HEAD-DRIVEN HPB TRANSLATION MODEL**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-23", "text": "Like Chiang (2005) and Chiang (2007) , our HD-HPB translation model adopts a synchronous context free grammar, a rewriting system which generates source and target side string pairs simultaneously using a context-free grammar." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-24", "text": "Instead of collapsing all non-terminals in the source language into a single symbol X as in Chiang (2007) , given a word sequence f i j from position i to position j, we first find heads and then concatenate the POS tags of these heads as f i j 's non-terminal symbol." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-25", "text": "Specifically, we adopt unlabeled dependency structure to derive heads, which are defined as:" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-26", "text": "is regarded as a head if it is dominated by a word outside of this sequence." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-27", "text": "Note that this definition (i) allows for a word sequence to have one or more heads (largely due to the fact that a word sequence is not necessarily linguistically constrained) and (ii) ensures that heads are always the highest heads in the sequence from a dependency structure perspective." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-28", "text": "For example, the word sequence ouzhou baguo lianming in Figure 1 has two heads (i.e., baguo and lianming, ouzhou is not a head of this sequence since its headword baguo falls within this sequence) and the non-terminal corresponding to the sequence is thus labeled as NN-AD." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-29", "text": "It is worth noting that in this paper we only refine non-terminal X on the source side to headinformed ones, while still using X on the target side." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-30", "text": "According to the occurrence of terminals in translation rules, we group rules in the HD-HPB model into two categories: head-driven hierarchical rules (HD-HRs) and non-terminal reordering rules (NRRs), where the former have at least one terminal on both source and target sides and the later have no terminals." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-31", "text": "For rule extraction, we first identify initial phrase pairs on word-aligned sentence pairs by using the same criterion as most phrase-based translation models (Och and Ney, 2004 ) and Chiang's HPB model (Chiang, 2005; Chiang, 2007) ." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-32", "text": "We extract HD-HRs and NRRs based on initial phrase pairs, respectively." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-33", "text": "----------------------------------" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-34", "text": "**HD-HRS: HEAD-DRIVEN HIERARCHICAL RULES**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-35", "text": "As mentioned, a HD-HR has at least one terminal on both source and target sides." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-36", "text": "This is the same as the hierarchical rules defined in Chiang's HPB model (Chiang, 2007) , except that we use head POSinformed non-terminal symbols in the source language." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-37", "text": "We look for initial phrase pairs that contain other phrases and then replace sub-phrases with POS tags corresponding to their heads." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-38", "text": "Given the word alignment in Figure 1 , Table 1 demonstrates the difference between hierarchical rules in Chiang (2007) and HD-HRs defined here." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-39", "text": "Similar to Chiang's HPB model, our HD-HPB model will result in a large number of rules causing problems in decoding." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-40", "text": "To alleviate these problems, we filter our HD-HRs according to the same constraints as described in Chiang (2007) ." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-41", "text": "Moreover, we discard rules that have non-terminals with more than four heads." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-42", "text": "----------------------------------" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-43", "text": "**NRRS: NON-TERMINAL REORDERING RULES**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-44", "text": "NRRs are translation rules without terminals." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-45", "text": "Given an initial phrase pair on the source side, there are four possible positional relationships for their target side translations (we use Y as a variable for nonterminals on the source side while all non-terminals on the target side are labeled as X): Chiang (2007) and HD-HRs." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-46", "text": "Indexed underlines indicate sub-phrases and corresponding non-terminal symbols." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-47", "text": "The non-terminals in HD-HRs (e.g., NN, VV, VV-NR) capture the head(s) POS tags of the corresponding word sequence in the source language." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-48", "text": "Merging two neighboring non-terminals into a single non-terminal, NRRs enable the translation model to explore a wider search space." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-49", "text": "During training, we extract four types of NRRs and calculate probabilities for each type." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-50", "text": "To speed up decoding, we currently (i) only use monotone and swap NRRs and (ii) limit the number of non-terminals in a NRR to 2." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-51", "text": "----------------------------------" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-52", "text": "**FEATURES AND DECODING**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-53", "text": "Given e for the translation output in the target language, s and t for strings of terminals and nonterminals on the source and target side, respectively, we use a feature set analogous to the default feature set of Chiang (2007) , including:" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-54", "text": "\u2022 P hd-hr (t|s) and P hd-hr (s|t), translation probabilities for HD-HRs;" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-55", "text": "\u2022 P lex (t|s) and P lex (s|t), lexical translation probabilities for HD-HRs;" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-56", "text": "\u2022 P ty hd-hr = exp (\u22121), rule penalty for HD-HRs;" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-57", "text": "\u2022 P nrr (t|s), translation probability for NRRs;" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-58", "text": "\u2022 P ty nrr = exp (\u22121), rule penalty for NRRs;" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-59", "text": "\u2022 P lm (e), language model;" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-60", "text": "\u2022 P ty word (e) = exp (\u2212|e|), word penalty." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-61", "text": "Our decoder is based on CKY-style chart parsing with beam search and searches for the best derivation bottom-up." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-62", "text": "For a source span [i, j], it applies both types of HD-HRs and NRRs." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-63", "text": "However, HDHRs are only applied to generate derivations spanning no more than K words -the initial phrase length limit used in training to extract HD-HRswhile NRRs are applied to derivations spanning any length." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-64", "text": "Unlike in Chiang's HPB model, it is possible for a non-terminal generated by a NRR to be included afterwards by a HD-HR or another NRR." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-65", "text": "----------------------------------" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-66", "text": "**EXPERIMENTS**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-67", "text": "We evaluate the performance of our HD-HPB model and compare it with our implementation of Chiang's HPB model (Chiang, 2007) , a source-side SAMTstyle refined version of HPB (SAMT-HPB), and the Moses implementation of HPB." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-68", "text": "For fair comparison, we adopt the same parameter settings for our HD-HPB and HPB systems, including initial phrase length (as 10) in training, the maximum number of non-terminals (as 2) in translation rules, maximum number of non-terminals plus terminals (as 5) on the source, beam threshold \u03b2 (as 10 \u22125 ) (to discard derivations with a score worse than \u03b2 times the best score in the same chart cell), beam size b (as 200) (i.e. each chart cell contains at most b derivations)." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-69", "text": "For Moses HPB, we use \"grow-diag-final-and\" to obtain symmetric word alignments, 10 for the maximum phrase length, and the recommended default values for all other parameters." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-70", "text": "We train our model on a dataset with\u02dc1.5M sentence pairs from the LDC dataset." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-71", "text": "2 We use the 2002 NIST MT evaluation test data (878 sentence pairs) as the development data, and the , 2004 , 2005 , 2006 -news NIST MT evaluation test data (919, 1788 , 1082 , and 616 sentence pairs, respectively) as the test data." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-72", "text": "To find heads, we parse the source sentences with the Berkeley Parser 3 (Petrov and Klein, 2007) trained on Chinese TreeBank 6.0 and use the Penn2Malt toolkit 4 to obtain (unlabeled) dependency structures." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-73", "text": "We obtain the word alignments by running GIZA++ (Och and Ney, 2000) on the corpus in both directions and applying \"grow-diag-final-and\" refinement (Koehn et al., 2003) ." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-74", "text": "We use the SRI language modeling toolkit to train a 5-gram language model on the Xinhua portion of the Gigaword corpus and standard MERT (Och, 2003) to tune the feature weights on the development data." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-75", "text": "For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-76", "text": "To test whether a performance difference is statistically significant, we conduct significance tests following the paired bootstrap approach (Koehn, 2004) ." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-77", "text": "In this paper, '**' and '*' denote p-values less than 0.01 and in-between [0.01, 0.05), respectively." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-78", "text": "Table 2 lists the rule table sizes." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-79", "text": "The full rule table size (including HD-HRs and NRRs) of our HD-HPB model is\u02dc1.5 times that of Chiang's, largely due to refining the non-terminal symbol X in Chiang's model into head-informed ones in our model." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-80", "text": "It is also unsurprising, that the test set-filtered rule table size of our model is only\u02dc0.7 times that of Chiang's: this is due to the fact that some of the refined translation rule patterns required by the test set are unattested in the training data." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-81", "text": "Furthermore, the rule table size of NRRs is much smaller than that of HDHRs since a NRR contains only two non-terminals." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-82", "text": "Table 3 lists the translation performance with BLEU scores." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-83", "text": "Note that our re-implementation of Chiang's original HPB model performs on a par with Moses HPB." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-84", "text": "Table 3 shows that our HD-HPB model significantly outperforms Chiang's HPB model with an average improvement of 1.91 in BLEU (and similar improvements over Moses HPB)." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-85", "text": "Table 3 shows that the head-driven scheme outperforms a SAMT-style approach (for each test set p < 0.01), indicating that head information is more effective than (partial) CFG categories." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-86", "text": "Taking lianming zhichi in Figure 1 as an example, HD-HPB labels the span VV, as lianming is dominated by zhichi, effecively ignoring lianming in the translation rule, while the SAMT label is ADVP:AD+VV 5 which is more susceptible to data sparsity." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-87", "text": "In addition, SAMT resorts to X if a text span fails to satisify pre-defined categories." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-88", "text": "Examining initial phrases extracted from the SAMT training data shows that 28% of them are labeled as X. In order to separate out the individual contributions of the novel HD-HRs and NRRs, we carry out an additional experiment (HD-HR+Glue) using HDHRs with monotonic glue rules only (adjusted to refined rule labels, but effectively switching off the extra reordering power of full NRRs)." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-89", "text": "Table 3 shows that on average more than half of the improvement over HPB (Chiang and Moses) comes from the refined HD-HRs, the rest from NRRs." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-90", "text": "Examining translation rules extracted from the training data shows that there are 72,366 types of non-terminals with respect to 33 types of POS tags." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-91", "text": "On average each sentence employs 16.6/5.2 HDHRs/NRRs in our HD-HPB model, compared to 15.9/3.6 hierarchical rules/glue rules in Chiang's model, providing further indication of the importance of NRRs in translation." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-92", "text": "----------------------------------" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-93", "text": "**CONCLUSION**" }, { "sent_id": "04461d946dadc759e4be1207655159-C001-94", "text": "We present a head-driven hierarchical phrase-based (HD-HPB) translation model, which adopts head information (derived through unlabeled dependency analysis) in the definition of non-terminals to better differentiate among translation rules." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-95", "text": "In ad-dition, improved and better integrated reordering rules allow better reordering between consecutive non-terminals through exploration of a larger search space in the derivation." }, { "sent_id": "04461d946dadc759e4be1207655159-C001-96", "text": "Experimental results on Chinese-English translation across four test sets demonstrate significant improvements of the HD-HPB model over both Chiang's HPB and a sourceside SAMT-style refined version of HPB." } ], "y": { "@BACK@": { "gold_contexts": [ [ "04461d946dadc759e4be1207655159-C001-6" ] ], "cite_sentences": [ "04461d946dadc759e4be1207655159-C001-6" ] }, "@SIM@": { "gold_contexts": [ [ "04461d946dadc759e4be1207655159-C001-23" ], [ "04461d946dadc759e4be1207655159-C001-31" ], [ "04461d946dadc759e4be1207655159-C001-36" ], [ "04461d946dadc759e4be1207655159-C001-40" ], [ "04461d946dadc759e4be1207655159-C001-53" ] ], "cite_sentences": [ "04461d946dadc759e4be1207655159-C001-23", "04461d946dadc759e4be1207655159-C001-31", "04461d946dadc759e4be1207655159-C001-36", "04461d946dadc759e4be1207655159-C001-40", "04461d946dadc759e4be1207655159-C001-53" ] }, "@DIF@": { "gold_contexts": [ [ "04461d946dadc759e4be1207655159-C001-24" ], [ "04461d946dadc759e4be1207655159-C001-36" ], [ "04461d946dadc759e4be1207655159-C001-38" ] ], "cite_sentences": [ "04461d946dadc759e4be1207655159-C001-24", "04461d946dadc759e4be1207655159-C001-36", "04461d946dadc759e4be1207655159-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "04461d946dadc759e4be1207655159-C001-40" ] ], "cite_sentences": [ "04461d946dadc759e4be1207655159-C001-40" ] }, "@UNSURE@": { "gold_contexts": [ [ "04461d946dadc759e4be1207655159-C001-45" ], [ "04461d946dadc759e4be1207655159-C001-67" ] ], "cite_sentences": [ "04461d946dadc759e4be1207655159-C001-45", "04461d946dadc759e4be1207655159-C001-67" ] } } }, "ABC_af9b884710f8198f008a9687153db6_18": { "x": [ { "sent_id": "af9b884710f8198f008a9687153db6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-2", "text": "We present Segment-level Neural CRF, which combines neural networks with a linear chain CRF for segment-level sequence modeling tasks such as named entity recognition (NER) and syntactic chunking." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-3", "text": "Our segment-level CRF can consider higher-order label dependencies compared with conventional word-level CRF." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-4", "text": "Since it is difficult to consider all possible variable length segments, our method uses segment lattice constructed from the word-level tagging model to reduce the search space." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-5", "text": "Performing experiments on NER and chunking, we demonstrate that our method outperforms conventional word-level CRF with neural networks." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-6", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-8", "text": "Named entity recognition (NER) and syntactic chunking are segment-level sequence modeling tasks, which require to recognize a segment from a sequence of words." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-9", "text": "A segment means a sequence of words that may compose an expression as shown in Figure 1 ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-10", "text": "Current high performance NER systems use the word-level linear chain Conditional Random Fields (CRF) (Lafferty et al., 2001 ) with neural networks." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-11", "text": "Especially, it has been shown that the combination of LSTMs (Hochreiter and Schmidhuber, 1997; Gers et al., 2000) , convolutional neural networks (CNNs) (LeCun et al., 1989) , and word-level CRF achieves the state-of-the-art performance (Ma and Hovy, 2016) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-12", "text": "Figure 1 shows an overview of the word-level CRF for NER." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-13", "text": "However, the word-level neural CRF has two main limitations: (1) it captures only first-order word label dependencies thus it cannot capture segment-level information; (2) it is not easy to incorporate dictionary features directly into a wordlevel model since named entities and syntactic chunks consist of multiple words rather than a single word." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-14", "text": "To overcome the limitation of first-order label dependencies, previous work propose the higher-order CRF, which outperforms first-order CRF on NER task (Sarawagi and Cohen, 2005) and morphological tagging task (Mueller et al., 2013) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-15", "text": "In this paper, we extend a neural CRF from word-level to segment-level and propose Segmentlevel Neural CRF." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-16", "text": "Our method has two main advantages: (1) segment-level linear chain CRF can consider higher-order word label dependencies (e.g., the relations between named entities and the other words); (2) it is easy to incorporate dictionary features into the model directly since a dictionary entry and a segment (e.g., a named entity) are in one-to-one correspondence." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-17", "text": "Our experiments on chunking and NER demonstrate that our method outperforms conventional word-level neural CRF." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-18", "text": "2 Word-level Neural CRF As a baseline method, we use word-level neural CRF proposed by (Ma and Hovy, 2016) since their method achieves state-of-the-art performance on NER." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-19", "text": "Specifically, they propose Bi-directional LSTM-CNN CRF (BLSTM-CNN-CRF) for sequential tagging." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-20", "text": "Here, we briefly review their BLSTM-CNN-CRF model." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-21", "text": "Let w t be the t-th word in an input sentence and C t = c (1) t , . . . , c (k) t be the character sequence of w t ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-22", "text": "BLSTM-CNN-CRF uses both word-level embedding w t \u2208 R d word and character-level embedding c t \u2208 R d char ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-23", "text": "Given a word sequence X = w 1 , . . . , w n , the model outputs a score vector o t as follows." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-24", "text": "where CNN char is the character-level CNN function, \u2295 is the concatenation of two vectors, LSTM f is the forward LSTM function, LSTM b is the backward LSTM function, Bi-LSTM is the Bi-LSTM function, respectively." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-25", "text": "Then, W T G \u2208 R |T |\u00d7d hidden is the weight matrix to learn, b T G \u2208 R |T | is the bias vector to learn, |T | is the size of tag set T , d hidden is the size of hidden layer of Bi-LSTM, and o t \u2208 R |T | is the score vector in which each element is the probability of a possible tag." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-26", "text": "In BLSTM-CNN-CRF, CRF is applied to the output layer." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-27", "text": "The conditional probability of CRF is defined as follows:" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-28", "text": "i is the j-th element of the vector o i ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-29", "text": "Then, A \u2208 R |T |\u00d7|T | is a transition score matrix, A y i\u22121 ,y i is a transition 1 While (Ma and Hovy, 2016) define \u03d5(yi\u22121, yi, oi) = exp(Wy i\u22121 ,y i oi + Ay i\u22121 ,y i ) as the potential function where W is the weight vector corresponding to label pair (yi\u22121, yi), we use the simple potential function here." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-30", "text": "score for jumping from tag y i\u22121 to y i , and Y indicates all possible paths." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-31", "text": "At test time, the predicted sequence is obtained by finding the highest score in a all possible paths using Viterbi algorithm as follows:" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-32", "text": "3 Segment-level Neural CRF" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-33", "text": "In this section, we describe our proposed method." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-34", "text": "Our segment-level neural CRF consists of the following two steps:" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-35", "text": "(i) A segment lattice is constructed from a sequence of words by pruning unlikely BIO tags to reduce a search space." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-36", "text": "This is because it is difficult to consider all possible variable length segments in practice." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-37", "text": "(ii) We use a linear chain CRF to find the highest score path on the segment lattice." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-38", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-39", "text": "**CONSTRUCTING SEGMENT LATTICE**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-40", "text": "A segment lattice is a graph structure where each path corresponds to a candidate segmentation path as shown in the lower part of Figure 1 ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-41", "text": "The segment lattice is a kind of semi-Markov model (Sarawagi and Cohen, 2005) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-42", "text": "To construct the segment lattice, we firstly give an input sentence to the word-level tagging model, then obtain the score vector o t for each word that gives the probabilities of possible BIO tags." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-43", "text": "Then, we generate the candidate BIO tags whose scores are greater than the threshold T ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-44", "text": "After that, we construct the segment lattice by generating admissible segments from the candidate BIO tags." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-45", "text": "For example, we generate the PERSON segment from the candidate BIO tags {B-PER, I-PER, E-PER}. The threshold T is a hyper-parameter for our model." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-46", "text": "We describe how to choose the threshold T in Section 4.3." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-47", "text": "While it has been shown that the CRF layer is required to achieve the state-ofthe-art performance in Ma and Hovy (2016) , we observe that the CRF has no significant effect on the final performance for the lattice construction." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-48", "text": "Therefore, we use BLSTM-CNN (without CRF) as the word-level tagging model in this paper." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-49", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-50", "text": "**SEGMENT-LEVEL VECTOR REPRESENTATION**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-51", "text": "To find the highest score path in the segment lattice, we use a standard linear chain CRF at segment-level." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-52", "text": "Since each segment has variable length, we need to obtain fixed-dimensional Figure 2 shows the details of the segment-level neural CRF model." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-53", "text": "Let u i = w b , w b+1 , . . . , w e be the i-th segment in a segment lattice, b is the starting word index, and e is the ending word index." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-54", "text": "To obtain the fixed-dimensional vector u i \u2208 R d node for the segment u i , we apply a CNN to the hidden vector sequence h b:e = h b , h b+1 , . . . , h e by Eq." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-55", "text": "(1), and compute the score vector z i as follows:" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-56", "text": "where CNN node is the CNN function for the segment vector, W LS \u2208 R |N |\u00d7d node is the weight matrix to learn, b LS \u2208 R d node is the bias vector to learn, |N | is the size of named entity type set N , d node is the size of the segment vector, and z i \u2208 R |N | is the score vector in which each element is the probability of a possible NE type." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-57", "text": "Finally, we apply a linear chain CRF to find the highest score path in the segment lattice as we describe in Section 2." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-58", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-59", "text": "**DICTIONARY FEATURES FOR NER**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-60", "text": "In this subsection, we describe the use of two additional dictionary features for NER." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-61", "text": "Since an entry of named entity dictionary and the segment in our model are in one-to-one correspondence, it is easy to directly incorporate the dictionary features into our model." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-62", "text": "We use following two dictionary features on NER task." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-63", "text": "Binary feature The binary feature e i \u2208 R d dict indicates whether the i-th segment (e.g., a named entity) exists in the dictionary or not." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-64", "text": "We use the embedding matrix W dict \u2208 R 2\u00d7d dict , where d dict is the size of the feature embedding." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-65", "text": "e \u2208 {0, 1} is the binary index which indicates whether the segment exists in the dictionary or not." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-66", "text": "Using the index e, we extract the column vector e i \u2208 R d dict from W dict and concatenate the segment vector r i and e i ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-67", "text": "The concatenated segment vector r \u2032 i is defined as r \u2032 i = r i \u2295 e i ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-68", "text": "W dict is a randomly initialized matrix and updated in the training time." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-69", "text": "To incorporate the popularity of the Wikipedia entity into our method, we also concatenate one-dimensional vector constructed from the page view count for one month period into e i ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-70", "text": "The page view count is normalized by the number of candidate segments in the segment lattice." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-71", "text": "The Wikipedia dictionary is constructed by extracting the titles of all Wikipedia pages and the titles of all redirect pages from the Wikipedia Dump Data 2 ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-72", "text": "Wikipedia embedding feature Another additional feature is the Wikipedia embeddings proposed by Yamada et al. (2016) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-73", "text": "Their method maps words and entities (i.e., Wikipedia entities) into the same continuous vector space using the skip-gram model (Mikolov et al., 2013) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-74", "text": "We use only the 300 dimensional entity embeddings in this paper. Please refer to Yamada et al. (2016) for more detail." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-75", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-76", "text": "**EXPERIMENTS**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-77", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-78", "text": "**DATASETS**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-79", "text": "We evaluate our method on two segment-level sequence tagging tasks: NER and text chunking 3 ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-80", "text": "For NER, we use CoNLL 2003 English NER shared task (Tjong Kim Sang and De Meulder, 2003) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-81", "text": "Following previous work (Ma and Hovy, 2016) , we use BIOES tagging scheme in the wordlevel tagging model." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-82", "text": "For text chunking, we use the CoNLL 2000 English text chunking shared task (Tjong Kim Sang and Buchholz, 2000) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-83", "text": "Following previous work (S\u00f8gaard and Goldberg, 2016) , the section 19 of WSJ corpus is used as the development set." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-84", "text": "We use BIOES tagging scheme in the word-level tagging model and measure performance using F1 score in all experiments." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-85", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-86", "text": "**MODEL SETTINGS**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-87", "text": "To generate a segment lattice, we train word-level BLSTM-CNN with the same hyper-parameters used in Ma and Hovy (2016) level CNN, and 100 dimentional pre-trained word embedding of GloVe (Pennington et al., 2014) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-88", "text": "At input layer and output layer, we apply dropout (Srivastava et al., 2014) with rate at 0.5." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-89", "text": "In our model, we set 400 filters with window size 3 in CNN for segment vector." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-90", "text": "To optimize our model, we use AdaDelta (Zeiler, 2012) with batch size 10 and gradient clipping 5." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-91", "text": "We use early stopping (Caruana et al., 2001 ) based on performance on development sets." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-92", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-93", "text": "**HOW TO CHOOSE THRESHOLD**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-94", "text": "The threshold T is a hyper-parameter for our model." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-95", "text": "We choose the threshold T based on how a segment lattice maintains the gold segments in the training and development sets." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-96", "text": "The threshold T and the oracle score are shown in Table 1 ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-97", "text": "In our experiment, the T = 0.00005 is used in NER task and T = 0.0005 is used in chunking task." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-98", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-99", "text": "**RESULTS AND DISCUSSIONS**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-100", "text": "The results of CoNLL 2003 NER is shown in Table 2." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-101", "text": "By adding a CRF layer to BLSTM-CNN, it improves the F1 score from 89.72 to 90.96." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-102", "text": "This result is consistent with the result of (Ma and Hovy, 2016 In both experiments, it improves the F1 score by using segment-level CRF." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-103", "text": "On the NER experiment, the additional dictionary features help to obtain further improvement." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-104", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-105", "text": "**RELATED WORK**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-106", "text": "Several different neural network methods have been proven to be effective for NER (Collobert et al., 2011; Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-107", "text": "Ma and Hovy (2016) demonstrate that combining LSTM, CNN and CRF achieves the state-of-the-art performance on NER and chunking tasks." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-108", "text": "Mueller et al. (2013) show that higher-order CRF outperforms first-order CRF." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-109", "text": "Our work differs from their work in that it can handle segments of variable lengths and thus it is easy to incorporate dictionary features directly." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-110", "text": "Zhuo et al. (2016) propose Gated Recursive Semi-CRF, which models a sequence of segments and automatically learns features." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-111", "text": "They combine Semi-CRF (Sarawagi and Cohen, 2005) and neural networks." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-112", "text": "However they report the F1 score 89.44% on NER and 94.73 4 on Chunking which are lower than the scores of our method." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-113", "text": "Kong et al. (2016) propose segmental recurrent neural networks (SRNNs)." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-114", "text": "SRNNs are based on Bi-LSTM feature extractor and uses dynamic programming algorithm to reduce search space." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-115", "text": "----------------------------------" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-117", "text": "In this paper, we propose the segment-level sequential modeling method based on a segment lat-tice structure." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-118", "text": "Our experimental results show that our method outperforms conventional word-level neural CRF." }, { "sent_id": "af9b884710f8198f008a9687153db6-C001-119", "text": "Furthermore, two additional dictionary features help to obtain further improvement on NER task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "af9b884710f8198f008a9687153db6-C001-11" ], [ "af9b884710f8198f008a9687153db6-C001-106" ], [ "af9b884710f8198f008a9687153db6-C001-107" ] ], "cite_sentences": [ "af9b884710f8198f008a9687153db6-C001-11", "af9b884710f8198f008a9687153db6-C001-106" ] }, "@USE@": { "gold_contexts": [ [ "af9b884710f8198f008a9687153db6-C001-18" ], [ "af9b884710f8198f008a9687153db6-C001-81" ], [ "af9b884710f8198f008a9687153db6-C001-87" ] ], "cite_sentences": [ "af9b884710f8198f008a9687153db6-C001-18", "af9b884710f8198f008a9687153db6-C001-81", "af9b884710f8198f008a9687153db6-C001-87" ] }, "@MOT@": { "gold_contexts": [ [ "af9b884710f8198f008a9687153db6-C001-18" ] ], "cite_sentences": [ "af9b884710f8198f008a9687153db6-C001-18" ] }, "@DIF@": { "gold_contexts": [ [ "af9b884710f8198f008a9687153db6-C001-29" ], [ "af9b884710f8198f008a9687153db6-C001-47" ] ], "cite_sentences": [ "af9b884710f8198f008a9687153db6-C001-29", "af9b884710f8198f008a9687153db6-C001-47" ] }, "@SIM@": { "gold_contexts": [ [ "af9b884710f8198f008a9687153db6-C001-87" ], [ "af9b884710f8198f008a9687153db6-C001-102" ] ], "cite_sentences": [ "af9b884710f8198f008a9687153db6-C001-87", "af9b884710f8198f008a9687153db6-C001-102" ] } } }, "ABC_975413dd6b3d3df9c5d111d94e8eb7_18": { "x": [ { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-2", "text": "Pre-trained embeddings such as word embeddings and sentence embeddings are fundamental tools facilitating a wide range of downstream NLP tasks." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-3", "text": "In this work, we investigate how to learn a general-purpose embedding of textual relations, defined as the shortest dependency path between entities." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-24", "text": "The Ford Motor Company is both founded by and named after Henry Ford." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-4", "text": "Textual relation embedding provides a level of knowledge between word/phrase level and sentence level, and we show that it can facilitate downstream tasks requiring relational understanding of the text." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-5", "text": "To learn such an embedding, we create the largest distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-6", "text": "We use global co-occurrence statistics between textual and knowledge base relations as the supervision signal to train the embedding." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-7", "text": "Evaluation on two relational understanding tasks demonstrates the usefulness of the learned textual relation embedding." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-8", "text": "The data and code can be found at https://github.com/czyssrs/GloREPlus" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-9", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-11", "text": "Pre-trained embeddings such as word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018; Devlin et al., 2018) and sentence embeddings (Le and Mikolov, 2014; Kiros et al., 2015) have become fundamental NLP tools." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-12", "text": "Learned with large-scale (e.g., up to 800 billion tokens (Pennington et al., 2014) ) open-domain corpora, such embeddings serve as a good prior for a wide range of downstream tasks by endowing task-specific models with general lexical, syntactic, and semantic knowledge." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-13", "text": "Inspecting the spectrum of granularity, a representation between lexical (and phrasal) level and sentence level is missing." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-14", "text": "Many tasks require relational understanding of the entities mentioned in the text, e.g., relation extraction and knowledge base completion." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-15", "text": "Textual relation (Bunescu and Mooney, 2005) , defined as the shortest path between two entities in the dependency parse tree of a sentence, has been widely shown to be the main bearer of relational information in text and proved effective in relation extraction tasks (Xu et al., 2015; Su et al., 2018) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-16", "text": "If we can learn a general-purpose embedding for textual relations, it may facilitate many downstream relational understanding tasks by providing general relational knowledge." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-17", "text": "Similar to language modeling for learning general-purpose word embeddings, distant supervision (Mintz et al., 2009 ) is a promising way to acquire supervision, at no cost, for training general-purpose embedding of textual relations." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-18", "text": "Recently Su et al. (2018) propose to leverage global co-occurrence statistics of textual and KB relations to learn embeddings of textual relations, and show that it can effectively combat the wrong labeling problem of distant supervision (see Figure 1 for example)." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-19", "text": "While their method, named GloRE, achieves the state-of-the-art performance on the popular New York Times (NYT) dataset (Riedel et al., 2010) , the scope of their study is limited to relation extraction with smallscale in-domain training data." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-20", "text": "In this work, we take the GloRE approach further and apply it to large-scale, domainindependent data labeled with distant supervision, with the goal of learning general-purpose textual relation embeddings." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-21", "text": "Specifically, we create the largest ever distant supervision dataset by linking the entire English ClueWeb09 corpus (half a billion of web documents) to the latest version of Freebase (Bollacker et al., 2008) , which contains 45 million entities and 3 billion relational facts." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-22", "text": "After filtering, we get a dataset with over 5 million unique textual relations and around 9 million cooccurring textual and KB relation pairs." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-23", "text": "We then train textual relation embedding on the collected The wrong labeling problem of distant supervision." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-25", "text": "The KB relation founder and named after are thus both mapped to all of the sentences containing the entity pair, resulting in many wrong labels (red dashed arrows)." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-26", "text": "Right: Global co-occurrence statistics from our distant supervision dataset, which clearly distinguishes the two textual relations." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-27", "text": "dataset in a way similar to (Su et al., 2018) , but using Transformer (Vaswani et al., 2017) instead of vanilla RNN as the encoder for better training efficiency." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-28", "text": "To demonstrate the usefulness of the learned textual relation embedding, we experiment on two relational understanding tasks, relation extraction and knowledge base completion." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-29", "text": "For relation extraction, we use the embedding to augment PCNN+ATT (Lin et al., 2016) and improve the precision for top 1000 predictions from 83.9% to 89.8%." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-30", "text": "For knowledge base completion, we replace the neural network in (Toutanova et al., 2015) with our pre-trained embedding followed by a simple projection layer, and gain improvements on both MRR and HITS@10 measures." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-31", "text": "Our major contributions are summarized as following:" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-32", "text": "\u2022 We propose the novel task of learning general-purpose embedding of textual relations, which has the potential to facilitate a wide range of relational understanding tasks." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-33", "text": "\u2022 To learn such an embedding, we create the largest distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-34", "text": "The dataset is publicly available 1 ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-35", "text": "\u2022 Based on the global co-occurrence statistics of textual and KB relations, we learn a textual relation embedding on the collected dataset and demonstrate its usefulness on relational understanding tasks." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-36", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-37", "text": "**RELATED WORK**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-38", "text": "Distant supervision methods (Mintz et al., 2009) for relation extraction have been studied by a number of works (Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012; Zeng et al., 2015; Lin et al., 2016; Ji et al., 2017; Wu et al., 2017) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-39", "text": "(Su et al., 2018) use global co-occurrence statistics of 1 https://github.com/czyssrs/GloREPlus textual and KB relations to effectively combat the wrong labeling problem. But the global statistics in their work is limited to NYT dataset, capturing domain-specific distributions." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-40", "text": "Another line of research that relates to ours is the universal schema (Riedel et al., 2013) for relation extraction, KB completion, as well as its extensions (Toutanova et al., 2015; Verga et al., 2016) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-41", "text": "Wrong labeling problem still exists since their embedding is learned based on individual relation facts." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-42", "text": "In contrast, we use the global cooccurrence statistics as explicit supervision signal." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-43", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-44", "text": "**TEXTUAL RELATION EMBEDDING**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-45", "text": "In this section, we describe how to collect largescale data via distant supervision ( \u00a73.1) and train the textual relation embedding ( \u00a73.2)." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-46", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-47", "text": "**GLOBAL CO-OCCURRENCE STATISTICS FROM DISTANT SUPERVISION**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-48", "text": "To construct a large-scale distant supervision dataset, we first get the English ClueWeb09 corpus (Callan et al., 2009) , which contains 500 million web documents." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-49", "text": "We employ the FACC1 dataset (Gabrilovich et al., 2013) to map ClueWeb09 to Freebase." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-50", "text": "We identify over 5 billion entity mentions in ClueWeb09 and link them to Freebase entities." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-51", "text": "From the linked documents, we extract 155 million sentences containing at least two entity mentions." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-52", "text": "We then use the Stanford Parser (Chen and Manning, 2014) with universal dependencies to extract textual relations (shortest dependency paths) between each pair of entity mentions 2 , leading to 788 million relational triples (subject, textual relation, object), of which 451 million are unique." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-53", "text": "Following (Su et al., 2018) , we then collect the global co-occurrence statistics of textual and KB relations." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-54", "text": "More specifically, for a relational triple (e 1 , t, e 2 ) with textual relation t, if (e 1 , r, e 2 ) with KB relation r exists in the KB, then we count it as a co-occurrence of t and r. We count the total number of co-occurrences of each pair of textual and KB relation across the entire corpus." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-55", "text": "We then normalize the global co-occurrence statistics such that each textual relation has a valid probability distribution over all the KB relations, which presumably captures the semantics of the textual relation." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-56", "text": "In the end, a bipartite relation graph is constructed, with one node set being the textual relations, the other node set being the KB relations, and the weighted edges representing the normalized global co-occurrence statistics." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-57", "text": "Filtering. When aligning the text corpus with the KB, we apply a number of filters to ensure data quality and training efficiency: (1) We only use the KB relations in Freebase Commons, 70 domains that are manually verified to be of release quality." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-58", "text": "(2) Only textual relations with the number of tokens (including both lexical tokens and dependency relations) less than or equal to 10 are kept." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-59", "text": "(3) Only non-symmetric textual relations are kept, because symmetric ones are typically from conjunctions like \"and\" or \"or\", which are less of interest." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-60", "text": "(4) Only textual relations with at least two occurrences are kept." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-61", "text": "After filtering, we end up with a relation graph with 5,559,176 unique textual relations, 1,925 knowledge base (KB) relations, and 8,825,731 edges with non-zero weight." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-62", "text": "It is worth noting that these filters are very conservative, and we can easily increase the scale of data by relaxing some of the filters." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-63", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-64", "text": "**EMBEDDING TRAINING**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-65", "text": "Considering both effectiveness and efficiency, we employ the Transformer encoder (Vaswani et al., 2017) to learn the textual relation embedding." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-66", "text": "It has been shown to excel at learning generalpurpose representations (Devlin et al., 2018) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-67", "text": "The embedded textual relation token sequence is fed as input." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-68", "text": "For example, for the textual relation dobj \u2190\u2212\u2212 f ounded nsubj \u2212\u2212\u2212\u2192, the input is the embedded sequence of {< \u2212dobj >, f ounded, < nsubj >}." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-69", "text": "We project the output of the encoder to a vector z as the result embedding." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-70", "text": "Given a textual relation t i and its embedding z i , denote {r 1 , r 2 , ..., r n } as all KB relations, andp(r j |t i ) as the global co-occurrence distribution, the weight of the edge between textual relation t i and KB relation r j in the relation graph." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-71", "text": "The training objective is to minimize the cross-entropy loss:" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-72", "text": "Where" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-73", "text": "W and b are trainable parameters." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-74", "text": "We use the filtered relation graph in \u00a73.1 as our training data." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-75", "text": "To guarantee that the model generalizes to unseen textual relations, we take 5% of the training data as validation set." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-76", "text": "Word embeddings are initialized with the GloVe (Pennington et al., 2014) vectors 3 ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-77", "text": "Dependency relation embeddings are initialized randomly." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-78", "text": "For the Transformer model, we use 6 layers and 6 attention heads for each layer." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-79", "text": "We use the Adam optimizer (Kingma and Ba, 2015) with parameter settings suggested by the original Transformer paper (Vaswani et al., 2017) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-80", "text": "We train a maximum number of 200 epochs and take the checkpoint with minimum validation loss for the result." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-81", "text": "We also compare with using vanilla RNN in GloRE (Su et al., 2018) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-82", "text": "Denote the embedding trained with Tranformer as GloRE++, standing for both new data and different model, and with RNN as GloRE+, standing for new data." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-83", "text": "We observe that, in the early stage of training, the validation loss of RNN decreases faster than Transformer." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-84", "text": "However, it starts to overfit soon." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-85", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-86", "text": "**EXPERIMENTS**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-87", "text": "In this section, we evaluate the usefulness of the learned textual relation embedding on two popular relational understanding tasks, relation extraction and knowledge base completion." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-88", "text": "We do not fine-tune the embedding, and only use in-domain data to train a single feedforward layer to project the embedding to the target relations of the domain." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-89", "text": "We compare this with models that are specifically designed for those tasks and trained using in-domain data." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-90", "text": "If we can achieve comparable or better results, it demonstrates that the general-purpose embedding captures useful information for downstream tasks." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-91", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-92", "text": "**RELATION EXTRACTION**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-93", "text": "We experiment on the popular New York Times (NYT) relation extraction dataset (Riedel et al., 2010) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-94", "text": "Following GloRE (Su et al., 2018) , we aim at augmenting existing relation extractors with the textual relation embeddings." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-95", "text": "We first average the textual relation embeddings of all contextual sentences of an entity pair, and project the average embedding to the target KB relations." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-96", "text": "We then construct an ensemble model by a weighted combination of predictions from the base model and the textual relation embedding." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-97", "text": "Same as (Su et al., 2018) , we use PCNN+ATT (Lin et al., 2016 ) as our base model." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-98", "text": "GloRE++ improves its best F 1 -score from 42.7% to 45.2%, slightly outperforming the previous state-of-theart (GloRE, 44.7%)." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-99", "text": "As shown in previous work (Su et al., 2018) , on NYT dataset, due to a significant amount of false negatives, the PR curve on the held-out set may not be an accurate measure of performance." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-100", "text": "Therefore, we mainly employ manual evaluation." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-101", "text": "We invite graduate students to check top 1000 predictions of each method." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-102", "text": "They are present with the entity pair, the prediction, and all the contextual sentences of the entity pair." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-103", "text": "Each prediction is examined by two students until reaching an agreement after discussion." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-104", "text": "Besides, the students are not aware of the source of the predictions." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-105", "text": "Table 1 shows the manual evaluation results." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-106", "text": "Both GloRE+ and GloRE++ get improvements over GloRE." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-107", "text": "GloRE++ obtains the best results for top 700, 900 and 1000 predictions." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-108", "text": "----------------------------------" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-109", "text": "**KNOWLEDGE BASE COMPLETION**" }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-110", "text": "We experiment on another relational understanding task, knowledge base (KB) completion, on the popular FB15k-237 dataset (Toutanova et al., 2015) ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-111", "text": "The goal is to predict missing relation facts based on a set of known entities, KB relations, and textual mentions." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-112", "text": "(Toutanova et al., 2015) use a convolutional neural network (CNN) to model textual relations." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-113", "text": "We replace their CNN with our pretrained embedding followed by one simple feedforward projection layer." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-114", "text": "As in (Toutanova et al., 2015) , we use the best performing DISTMULT and E+DISTMULT as the base models." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-115", "text": "DISTMULT (Yang et al., 2015) learns latent vectors for the entities and each relation type, while model E (Riedel et al., 2013) learns two latent vectors for each relation type, associated with its subject and object entities respectively." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-116", "text": "E+DISTMULT is a combination model that ensembles the predictions from individual models, and is trained jointly." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-117", "text": "We conduct experiments using only KB relations (KB only), using their CNN to model textual relations (Conv), and using our embedding to model textual relations (Emb)." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-118", "text": "The models are tested on predicting the object entities of a set of KB triples disjoint from the training set, given the subject entity and the relation type." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-119", "text": "Table 2 shows the performances of all models measured by mean reciprocal rank (MRR) of the correct entity, and HITS@10 (the percentage of test instances for which the correct entity is ranked within the top 10 predictions)." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-120", "text": "We also show the performances on the two subsets of the test set, with and without textual mentions." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-121", "text": "The pre-trained embedding achieves comparable or better results to the CNN model trained with indomain data." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-122", "text": "we apply t-SNE visualization (Maaten and Hinton, 2008) on the learned embedding of ClueWeb validation set." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-123", "text": "We filter out infrequent textual relations and assign labels to the textual relations when they cooccur more than half of the times with a KB relation." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-124", "text": "The visualization result of GloRE++ embedding associating with the top-10 frequent KB relations is shown in Figure 2 ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-125", "text": "As we can see, similar textual relations are grouped together while dissimilar ones are separated." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-126", "text": "This implies that the embedding model can well generate textual relation representation for unseen textual relations, and can potentially serve as relational features to help tasks in unsupervised setting." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-127", "text": "Case Study." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-128", "text": "To show that the embedding model generalizes to unseen textual relations via capturing crucial textual sub-patterns, we randomly pick some textual relations in NYT train set but not in ClueWeb train set, and compare with its top-5 nearest neighbors in ClueWeb train set, based on the similarity of the learned embedding." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-129", "text": "A case study is shown in Table 3 ." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-130", "text": "We can see that the KB relation place of birth often collocates with a preposition in indicating the object fits into a location type, and some key words like born." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-131", "text": "Together, the sub-structure born in serves as a strong indicator for place of birth relation." }, { "sent_id": "975413dd6b3d3df9c5d111d94e8eb7-C001-132", "text": "There is almost always some redundant information in the textual relations, for example in the textual rela-" } ], "y": { "@BACK@": { "gold_contexts": [ [ "975413dd6b3d3df9c5d111d94e8eb7-C001-15" ], [ "975413dd6b3d3df9c5d111d94e8eb7-C001-18" ], [ "975413dd6b3d3df9c5d111d94e8eb7-C001-39" ] ], "cite_sentences": [ "975413dd6b3d3df9c5d111d94e8eb7-C001-15", "975413dd6b3d3df9c5d111d94e8eb7-C001-18", "975413dd6b3d3df9c5d111d94e8eb7-C001-39" ] }, "@USE@": { "gold_contexts": [ [ "975413dd6b3d3df9c5d111d94e8eb7-C001-53" ], [ "975413dd6b3d3df9c5d111d94e8eb7-C001-81" ], [ "975413dd6b3d3df9c5d111d94e8eb7-C001-94" ], [ "975413dd6b3d3df9c5d111d94e8eb7-C001-97" ] ], "cite_sentences": [ "975413dd6b3d3df9c5d111d94e8eb7-C001-53", "975413dd6b3d3df9c5d111d94e8eb7-C001-81", "975413dd6b3d3df9c5d111d94e8eb7-C001-94", "975413dd6b3d3df9c5d111d94e8eb7-C001-97" ] }, "@SIM@": { "gold_contexts": [ [ "975413dd6b3d3df9c5d111d94e8eb7-C001-97" ] ], "cite_sentences": [ "975413dd6b3d3df9c5d111d94e8eb7-C001-97" ] }, "@MOT@": { "gold_contexts": [ [ "975413dd6b3d3df9c5d111d94e8eb7-C001-100", "975413dd6b3d3df9c5d111d94e8eb7-C001-99" ] ], "cite_sentences": [ "975413dd6b3d3df9c5d111d94e8eb7-C001-99" ] } } }, "ABC_45d4d6f0ac4a4f3bf7b2ac70fbcf7f_18": { "x": [ { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-2", "text": "In dialogues, an utterance is a chain of consecutive sentences produced by one speaker which ranges from a short sentence to a thousand-word post." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-3", "text": "When studying dialogues at the utterance level, it is not uncommon that an utterance would serve multiple functions." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-4", "text": "For instance, \"Thank you." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-5", "text": "It works great. \" expresses both gratitude and positive feedback in the same utterance." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-6", "text": "Multiple dialogue acts (DA) for one utterance breeds complex dependencies across dialogue turns." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-7", "text": "Therefore, DA recognition challenges a model's predictive power over long utterances and complex DA context." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-8", "text": "We term this problem Concurrent Dialogue Acts (CDA) recognition." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-9", "text": "Previous work on DA recognition either assumes one DA per utterance or fails to realize the sequential nature of dialogues." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-10", "text": "In this paper, we present an adapted Convolutional Recurrent Neural Network (CRNN) which models the interactions between utterances of long-range context." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-11", "text": "Our model significantly outperforms existing work on CDA recognition on a tech forum dataset." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-12", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-14", "text": "An utterance is a sequence of sentences that are produced by the same speaker in his or her turn in a dialogue." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-15", "text": "Dialogue Acts (DA) aim to capture the functionality of these utterances." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-16", "text": "Recognizing DAs can benefit a dialogue system in many ways." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-17", "text": "1) A pre-trained DA model can reduce the number of utterance annotations required to train a dialogue system [11] ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-18", "text": "2) For retrieval-based dialogue systems, DA modeling can provide meaningful patterns to reduce the search space and thus expedite response selection." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-19", "text": "Even though DA recognition is a well-studied problem, most contemporary methods concentrate on conversations of short utterances with usually a single DA per utterance, the case of the Switchboard Dialogue Act * Both authors contributed equally to this research." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-20", "text": "Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-21", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-22", "text": "Abstracting with credit is permitted." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-23", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-24", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-25", "text": "CIKM '19, November 3-7, 2019 (SwDA) corpus [5] ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-26", "text": "However, when utterances are long and complex, one DA label would not be sufficient to describe the pragmatic function of an utterance." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-27", "text": "As a result, multiple DA labels would be required and the approaches on SwDA would not apply." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-28", "text": "We term our problem Concurrent Dialogue Acts (CDA) Recognition." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-29", "text": "Forum discussion is a type of dialogue that breeds CDAs." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-30", "text": "The sample in Figure 1 from MSDialog-Intent corpus [14] discusses a technical issue with Microsoft Bingo among two users (U 1 & U 2) and two agents (A1 & A2)." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-31", "text": "For instance, utterance 6 is simultaneously labeled as GG for \"Good Luck\" and PA for \"using a Win+Shift+Cursor move\" (DA labels are shown in Table 1 )." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-32", "text": "59.95% of utterances in MSDialog-Intent are indeed multi-labeled." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-33", "text": "Another important characteristic of these dialogues is that a speaker can refer to any of the preceding utterances." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-34", "text": "Thus, it is unfeasible to determine a fixed window size that would incorporate all relevant context." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-35", "text": "In Figure 1 , utterance 6 provides a potential answer (PA) to utterance 1; utterance 7 provides further details (FD) to the information request (IR) in utterance 2." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-36", "text": "Without the capability of capturing a wider range of utterances, the chance of recognizing DAs in these forum conversations would be reduced." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-37", "text": "However, current approaches for DA recognition only attempt to capture either of the two characteristics but not both." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-38", "text": "1) Sequence models, that can capture far-away context information, are often applied to DA sequence labeling." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-39", "text": "Kumar et al. [10] build a hierarchical RNN using Bi-LSTM as a base unit and CRF as the last layer; Chen et al. [2] use the CRF-Attentive Structured Network for the same hierarchical nature of DAs." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-40", "text": "Nonetheless, these models only concern single-label DA datasets (e.g., SwDA) and heavily rely on CRF, which is incapable of predicting multiple labels for each utterance." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-41", "text": "2) Another line of research casts DA recognition as a multi-label classification problem to accommodate the CDA scenario." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-42", "text": "Qu et al. [14] apply a CNN-based text classifier proposed by Kim [8] using a fixed window to represent the context." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-43", "text": "Although capable of classifying utterances with CDAs, Qu et al. [14] 's model only concerns a strictly-local context range and thus cannot include distant information." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-44", "text": "In this paper, we present a novel neural model that is adapted from Convolutional Recurrent Neural Network (CRNN) to both incorporate the interaction between distant utterances and generalize the DA recognition task to accommodate CDA." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-45", "text": "Our contributions can be summarized as follows: 1) In our adapted Convolutional Recurrent Neural Network (CRNN) we use the recurrent layers to gather long-range contextual information that are extracted from utterances by the convolutional layers." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-46", "text": "2) We further optimize the model by incorporating highway connections and dynamic k-max pooling." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-47", "text": "3) Our model significantly outperforms the state-of-the-art model by 4.68% in accuracy and 3.08% in F-score." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-48", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-49", "text": "**RELATED WORK**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-50", "text": "CRNN has been widely applied to classification tasks." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-51", "text": "In Natural Language Processing (NLP) research including sentence segmentation [16] , sentiment analysis [17] and discourse modeling [6] , CRNN achieves the state-of-the-art in single-label classification where for each input, a single output would be predicted." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-52", "text": "Deep learning research outside of NLP proves CRNN suitable for classifications where multiple labels may be assigned to each target." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-53", "text": "Cakir et al. [1] apply CRNN to Polyphonic Sound Event Detection where multiple sound events can be detected in a timestamp." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-54", "text": "Similar to single-label CRNNs, features are extracted by the convolutional layers and further integrated into the recurrent layers to provide context information." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-55", "text": "The sigmoid activation function is used to make multi-binary decisions for each event activity." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-56", "text": "Choi et al. [4] treat music tagging as a multi-label classification task where a subset of genres, moods, etc." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-57", "text": "are assigned to each music." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-58", "text": "In Choi et al. [4] 's model, The CNN is responsible for extracting local features and the RNN presents a temporal summarization." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-59", "text": "These studies substantiate CRNN's applicability to multi-label classification tasks, especially its ability to extract features from the current step and to integrate features from arbitrarily long distance." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-60", "text": "In previous dialogue research, Kalchbrenner and Blunsom [6] and Kim et al. [7] incorporate CRNN into a single-label DA recognition and dialogue topic tracking system." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-61", "text": "Based on this line of research, we argue that CRNN can be applied for CDA recognition with two adaptations: binary cross-entropy loss (BCE) and sigmoid activation." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-62", "text": "Further experiments also show that adding highway connections and dynamic k-max pooling further boosts the performance." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-63", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-64", "text": "**TASK DEFINITION**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-65", "text": "The task is defined as a CDA recognition problem where for each utterance u t (the t-th utterance) in a dialogue, we predict a subset of DA labels y t that describes the functionality of the utterance from a candidate set of DA labels L = {l 1 , l 2 , ..., l c }." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-66", "text": "For a dialog with s utterances, the inputs to the algorithm is U = {u 1 , u 2 , ..., u s }, and the output is Y = {y 1 , y 2 , ..., y s }, where y t is the annotated DA label set for u t , in which y t = {y 1 t , y 2 t , ..., y c t }." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-67", "text": "Here, y j t = {1, 0} denotes whether the t-th utterance of the dialog is labeled with DA label l j or not." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-68", "text": "When c j=1 y j t > 1, we say CDAs are recognized." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-69", "text": "Given a dialogue U, the goal is to predict the DA sequence Y from the text." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-70", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-71", "text": "**THE PROPOSED APPROACH**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-72", "text": "The challenge of this task lies in the complexity of dialogue structures in human conversations where an utterance can express multiple DAs." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-73", "text": "In this work, we improve CDA recognition with an adapted CRNN which models the interactions between long-range context." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-74", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-75", "text": "**CONVOLUTIONAL LAYER**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-76", "text": "The base of our architecture is a CNN module similar to Kim [8] ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-77", "text": "The module works by 'sliding' through the embedding matrix of an utterance with various filter sizes to capture semantic features in differently ordered n-grams." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-78", "text": "A convolution operation is denoted as" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-79", "text": "where k i is the feature generated by the i-th filter with weights w and bias b k ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-80", "text": "This filter of size d is applied to an embedding matrix, which is the concatenation from the i-th to the (i + d \u2212 1)th embedding vectors." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-81", "text": "This operation is applied to every possible window of words in an utterance of length n and generates a feature map k." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-82", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-83", "text": "**DYNAMIC K-MAX POOLING**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-84", "text": "A max-over-time pooling operation [8] is usually applied over the feature map and takes the maximum value as the feature corresponding to this particular filter." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-85", "text": "The idea is to capture the most important features of an utterance." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-86", "text": "However, this mechanism could be problematic when the utterance is long." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-87", "text": "We experimented with Dynamic k-Max Pooling [12] to pool the most powerful features from p sub-sequences of an utterance with m words." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-88", "text": "This pooling scheme naturally deals with variable utterance length." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-89", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-90", "text": "**RECURRENT LAYER**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-91", "text": "Based on the local textual features extracted from each utterance, a bidirectional RNN is applied to gather features from a wider context for recognizing the DAs in the target utterance, u t ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-92", "text": "We experimented with two variations of RNN: LSTM [18] and GRU [3] , both of which utilize gating information to prevent the vanishing gradient problem." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-93", "text": "GRU is constructed similarly to LSTM but without using a memory cell." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-94", "text": "It exposes the full hidden state without any control, which may be computationally more efficient." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-95", "text": "We experimented with both since it is difficult to predict which one performs better on our task." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-96", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-97", "text": "**HIGHWAY CONNECTION**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-98", "text": "Although LSTM and GRU help capture wider context, the network training becomes more difficult with the additional recurrent layers." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-99", "text": "Inspired by the highway networks [15] , we propose to add a highway connection between the convolutional layer and the last fully connected layer." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-100", "text": "With this mechanism, the information about the target utterance, u t , can flow across the recurrent layer without attenuation." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-101", "text": "The last fully connected layer learns from the outputs of both recurrent and convolutional layers." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-102", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-103", "text": "**DATASET**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-104", "text": "We use the MSDialog-Intent dataset [14] to conduct experiments." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-105", "text": "In the dataset, each of the 10,020 utterances is annotated with a subset of 12 DAs." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-106", "text": "The abundance of information in a single utterance (avg. 72 tokens/utterance) breeds CDA (avg. 1.83 DAs/utterance)." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-107", "text": "We observe a strong correlation between the number of DAs and utterance length, which necessitates a CDA model for forum conversations." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-108", "text": "The dataset includes plenty of metadata for each utterance, e.g., answer vote and user affiliation." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-109", "text": "For generalizability, our model only incorporates textual content of the dialogues." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-110", "text": "Besides, unlike Qu et al. [14] , we keep all the DA annotations in the dataset to preserve the meaningful DA structures within and across utterances." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-111", "text": "1" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-112", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-113", "text": "**EXPERIMENTS**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-114", "text": "In this section, three versions of our proposed model with incremental improvements are evaluated against a CNN baseline [8] and the state-of-the-art approach for CDA recognition [14] ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-115", "text": "\u2022 CNN-Kim [8] : One of the first attempts to apply CNN to text classification." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-116", "text": "The CNN model consists of three convolutional layers with the same filter size." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-117", "text": "\u2022 CNN-CR [14] : The state-of-the-art approach for CDA recognition on the MSDialog-Intent dataset [14] ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-118", "text": "The CNN model incorporates context information with a window size of 3." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-119", "text": "\u2022 CRNN (v 1 ): Our base model that adapts CRNN for CDA recognition using BCE loss and sigmoid activation function." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-120", "text": "\u2022 CRNN (v 2 ): CRNN (v 1 ) with highway connections added between the convolutional layer and the fully connected layer." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-121", "text": "\u2022 CRNN (v 3 ): CRNN (v 1 ) with highway connections and dynamic k-max pooling implemented." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-122", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-123", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-124", "text": "The dataset is partitioned into training, validation, and test sets in the ratio of 8:1:1." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-125", "text": "Hyper-parameters are tuned with the validation set." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-126", "text": "For word embedding, we use the publicly available GloVe vectors that were trained on Wikipedia + Gigaword 5 [13] ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-127", "text": "We find that setting (CNN filters, CNN dropout, RNN units, RNN layers, RNN dropout and k) to (100, 0.4, 900, 2, 0.15 and 2) is the best for our model." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-128", "text": "The BCE is used as loss function and Adam [9] is used for optimization with an initial learning rate, 0.001." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-129", "text": "The models were trained on four GeForce RTX 2080 Ti GPUs." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-130", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-131", "text": "**METRICS**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-132", "text": "Following previous work [14] on multi-label classification, we adopt label-based accuracy (i.e., Hamming score) and micro-F 1 score as our main evaluation metrics." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-133", "text": "Micro-precision and micro-recall are also reported to assist the analysis." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-134", "text": "Among all, accuracy is the only metrics that is on a per utterance basis." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-135", "text": "Therefore, Student's paired t-test is performed only on accuracy." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-136", "text": "Other metrics (P, R, F 1 ) provide an overall performance evaluation for all utterances." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-137", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-138", "text": "**RESULTS**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-139", "text": "All proposed adaptations, i.e., highway connections and dynamic k-max pooling, contribute to the model, as in Table 2 ." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-140", "text": "The CRNN models significantly 2 outperform both baselines." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-141", "text": "The model with full adaptation (v 3 ), performs the best across all experiments: achieving the highest accuracy, recall and F 1 with LSTM and the highest precision with GRU." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-142", "text": "The divergence between LSTM and GRU may indicate that the forget and output gates in LSTM are helpful for the retrieval of related memory but increase the variance of the model." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-143", "text": "Table 3 further exemplifies the capability of our best model in recognizing CDAs." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-144", "text": "Our model systematically outperforms the stateof-the-art approach, CNN-CR." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-145", "text": "More specifically, 1) CRNN achieves better mean accuracy for all reference sizes and significantly for 2) The average number of DAs predicted by CRNN is the closest to the reference." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-146", "text": "Another advantage of CRNN is the ability to handle longer dialogues." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-147", "text": "Figure 3 visualizes the mean accuracy of utterances grouped by dialogue length." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-148", "text": "Along with the increase of dialogue length, we observe an increasing advantage of CRNN (in blue) over CNN-CR (in orange), especially for dialogues longer than 6 utterances." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-149", "text": "Table 4 shows an example where CRNN and CNN-CR predict differently for utterance 5." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-150", "text": "Utterance 5 should respond to utterance 2, establishing a long-range dependency that cannot be captured by CNN-CR's fixed context window." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-151", "text": "Our CRNN model succeeds in recognizing the distant context for further details (FD)." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-152", "text": "Furthermore, CNN-CR wrongly recognizes utterance 5 as original question (OQ) since the OQ in utterance 1 is out of range." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-153", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-154", "text": "**A CASE STUDY**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-155", "text": "----------------------------------" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-156", "text": "**CONCLUSION**" }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-157", "text": "Our proposed CRNN models for CDA recognition impose fewer restrictions on the structure of DAs and capture textual features from a wider context." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-158", "text": "The experiment results show that all of the proposed adaptations, i.e., highway connections and dynamic kmax pooling, contribute to the model optimization." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-159", "text": "Our final model significantly outperforms the state-of-the-art approach on a tech forum dataset where the dialogues are packed with complex DA structures and information-rich utterances." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-160", "text": "Future work will consider a more fine-grained set of DAs for a deeper analysis of CDA modeling." }, { "sent_id": "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-161", "text": "It would also be interesting to annotate antecedent relations among utterances to structure threading dialogue flows." } ], "y": { "@BACK@": { "gold_contexts": [ [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-30" ], [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-42" ] ], "cite_sentences": [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-30", "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-42" ] }, "@MOT@": { "gold_contexts": [ [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-43", "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-44" ] ], "cite_sentences": [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-104" ], [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-132" ] ], "cite_sentences": [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-104", "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-132" ] }, "@DIF@": { "gold_contexts": [ [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-110" ] ], "cite_sentences": [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-110" ] }, "@UNSURE@": { "gold_contexts": [ [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-114" ] ], "cite_sentences": [ "45d4d6f0ac4a4f3bf7b2ac70fbcf7f-C001-114" ] } } }, "ABC_af0c9e20d34a080bac3304ded1f8d6_18": { "x": [ { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-48", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-49", "text": "**RELATED WORK**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-2", "text": "In dialogues, an utterance is a chain of consecutive sentences produced by one speaker which ranges from a short sentence to a thousand-word post." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-3", "text": "When studying dialogues at the utterance level, it is not uncommon that an utterance would serve multiple functions." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-4", "text": "For instance, \"Thank you." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-5", "text": "It works great. \" expresses both gratitude and positive feedback in the same utterance." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-6", "text": "Multiple dialogue acts (DA) for one utterance breeds complex dependencies across dialogue turns." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-7", "text": "Therefore, DA recognition challenges a model's predictive power over long utterances and complex DA context." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-8", "text": "We term this problem Concurrent Dialogue Acts (CDA) recognition." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-9", "text": "Previous work on DA recognition either assumes one DA per utterance or fails to realize the sequential nature of dialogues." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-10", "text": "In this paper, we present an adapted Convolutional Recurrent Neural Network (CRNN) which models the interactions between utterances of long-range context." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-11", "text": "Our model significantly outperforms existing work on CDA recognition on a tech forum dataset." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-12", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-14", "text": "An utterance is a sequence of sentences that are produced by the same speaker in his or her turn in a dialogue." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-15", "text": "Dialogue Acts (DA) aim to capture the functionality of these utterances." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-16", "text": "Recognizing DAs can benefit a dialogue system in many ways." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-17", "text": "1) A pre-trained DA model can reduce the number of utterance annotations required to train a dialogue system [11] ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-18", "text": "2) For retrieval-based dialogue systems, DA modeling can provide meaningful patterns to reduce the search space and thus expedite response selection." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-19", "text": "Even though DA recognition is a well-studied problem, most contemporary methods concentrate on conversations of short utterances with usually a single DA per utterance, the case of the Switchboard Dialogue Act * Both authors contributed equally to this research." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-20", "text": "Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-21", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-22", "text": "Abstracting with credit is permitted." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-23", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-24", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-25", "text": "CIKM '19, November 3-7, 2019 (SwDA) corpus [5] ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-50", "text": "CRNN has been widely applied to classification tasks." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-26", "text": "However, when utterances are long and complex, one DA label would not be sufficient to describe the pragmatic function of an utterance." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-27", "text": "As a result, multiple DA labels would be required and the approaches on SwDA would not apply." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-28", "text": "We term our problem Concurrent Dialogue Acts (CDA) Recognition." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-29", "text": "Forum discussion is a type of dialogue that breeds CDAs." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-30", "text": "The sample in Figure 1 from MSDialog-Intent corpus [14] discusses a technical issue with Microsoft Bingo among two users (U 1 & U 2) and two agents (A1 & A2)." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-31", "text": "For instance, utterance 6 is simultaneously labeled as GG for \"Good Luck\" and PA for \"using a Win+Shift+Cursor move\" (DA labels are shown in Table 1 )." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-32", "text": "59.95% of utterances in MSDialog-Intent are indeed multi-labeled." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-33", "text": "Another important characteristic of these dialogues is that a speaker can refer to any of the preceding utterances." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-34", "text": "Thus, it is unfeasible to determine a fixed window size that would incorporate all relevant context." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-35", "text": "In Figure 1 , utterance 6 provides a potential answer (PA) to utterance 1; utterance 7 provides further details (FD) to the information request (IR) in utterance 2." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-36", "text": "Without the capability of capturing a wider range of utterances, the chance of recognizing DAs in these forum conversations would be reduced." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-37", "text": "However, current approaches for DA recognition only attempt to capture either of the two characteristics but not both." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-38", "text": "1) Sequence models, that can capture far-away context information, are often applied to DA sequence labeling." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-39", "text": "Kumar et al. [10] build a hierarchical RNN using Bi-LSTM as a base unit and CRF as the last layer; Chen et al. [2] use the CRF-Attentive Structured Network for the same hierarchical nature of DAs." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-40", "text": "Nonetheless, these models only concern single-label DA datasets (e.g., SwDA) and heavily rely on CRF, which is incapable of predicting multiple labels for each utterance." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-41", "text": "2) Another line of research casts DA recognition as a multi-label classification problem to accommodate the CDA scenario." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-42", "text": "Qu et al. [14] apply a CNN-based text classifier proposed by Kim [8] using a fixed window to represent the context." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-43", "text": "Although capable of classifying utterances with CDAs, Qu et al. [14] 's model only concerns a strictly-local context range and thus cannot include distant information." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-44", "text": "In this paper, we present a novel neural model that is adapted from Convolutional Recurrent Neural Network (CRNN) to both incorporate the interaction between distant utterances and generalize the DA recognition task to accommodate CDA." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-45", "text": "Our contributions can be summarized as follows: 1) In our adapted Convolutional Recurrent Neural Network (CRNN) we use the recurrent layers to gather long-range contextual information that are extracted from utterances by the convolutional layers." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-46", "text": "2) We further optimize the model by incorporating highway connections and dynamic k-max pooling." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-47", "text": "3) Our model significantly outperforms the state-of-the-art model by 4.68% in accuracy and 3.08% in F-score." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-76", "text": "The base of our architecture is a CNN module similar to Kim [8] ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-51", "text": "In Natural Language Processing (NLP) research including sentence segmentation [16] , sentiment analysis [17] and discourse modeling [6] , CRNN achieves the state-of-the-art in single-label classification where for each input, a single output would be predicted." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-77", "text": "The module works by 'sliding' through the embedding matrix of an utterance with various filter sizes to capture semantic features in differently ordered n-grams." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-78", "text": "A convolution operation is denoted as" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-79", "text": "where k i is the feature generated by the i-th filter with weights w and bias b k ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-80", "text": "This filter of size d is applied to an embedding matrix, which is the concatenation from the i-th to the (i + d \u2212 1)th embedding vectors." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-81", "text": "This operation is applied to every possible window of words in an utterance of length n and generates a feature map k." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-82", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-83", "text": "**DYNAMIC K-MAX POOLING**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-84", "text": "A max-over-time pooling operation [8] is usually applied over the feature map and takes the maximum value as the feature corresponding to this particular filter." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-85", "text": "The idea is to capture the most important features of an utterance." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-86", "text": "However, this mechanism could be problematic when the utterance is long." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-87", "text": "We experimented with Dynamic k-Max Pooling [12] to pool the most powerful features from p sub-sequences of an utterance with m words." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-88", "text": "This pooling scheme naturally deals with variable utterance length." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-89", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-90", "text": "**RECURRENT LAYER**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-91", "text": "Based on the local textual features extracted from each utterance, a bidirectional RNN is applied to gather features from a wider context for recognizing the DAs in the target utterance, u t ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-92", "text": "We experimented with two variations of RNN: LSTM [18] and GRU [3] , both of which utilize gating information to prevent the vanishing gradient problem." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-93", "text": "GRU is constructed similarly to LSTM but without using a memory cell." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-94", "text": "It exposes the full hidden state without any control, which may be computationally more efficient." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-95", "text": "We experimented with both since it is difficult to predict which one performs better on our task." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-96", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-97", "text": "**HIGHWAY CONNECTION**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-98", "text": "Although LSTM and GRU help capture wider context, the network training becomes more difficult with the additional recurrent layers." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-99", "text": "Inspired by the highway networks [15] , we propose to add a highway connection between the convolutional layer and the last fully connected layer." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-100", "text": "With this mechanism, the information about the target utterance, u t , can flow across the recurrent layer without attenuation." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-101", "text": "The last fully connected layer learns from the outputs of both recurrent and convolutional layers." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-102", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-103", "text": "**DATASET**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-104", "text": "We use the MSDialog-Intent dataset [14] to conduct experiments." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-105", "text": "In the dataset, each of the 10,020 utterances is annotated with a subset of 12 DAs." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-106", "text": "The abundance of information in a single utterance (avg. 72 tokens/utterance) breeds CDA (avg. 1.83 DAs/utterance)." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-107", "text": "We observe a strong correlation between the number of DAs and utterance length, which necessitates a CDA model for forum conversations." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-108", "text": "The dataset includes plenty of metadata for each utterance, e.g., answer vote and user affiliation." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-109", "text": "For generalizability, our model only incorporates textual content of the dialogues." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-110", "text": "Besides, unlike Qu et al. [14] , we keep all the DA annotations in the dataset to preserve the meaningful DA structures within and across utterances." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-111", "text": "1" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-112", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-113", "text": "**EXPERIMENTS**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-114", "text": "In this section, three versions of our proposed model with incremental improvements are evaluated against a CNN baseline [8] and the state-of-the-art approach for CDA recognition [14] ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-115", "text": "\u2022 CNN-Kim [8] : One of the first attempts to apply CNN to text classification." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-116", "text": "The CNN model consists of three convolutional layers with the same filter size." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-117", "text": "\u2022 CNN-CR [14] : The state-of-the-art approach for CDA recognition on the MSDialog-Intent dataset [14] ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-118", "text": "The CNN model incorporates context information with a window size of 3." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-119", "text": "\u2022 CRNN (v 1 ): Our base model that adapts CRNN for CDA recognition using BCE loss and sigmoid activation function." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-120", "text": "\u2022 CRNN (v 2 ): CRNN (v 1 ) with highway connections added between the convolutional layer and the fully connected layer." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-121", "text": "\u2022 CRNN (v 3 ): CRNN (v 1 ) with highway connections and dynamic k-max pooling implemented." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-122", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-123", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-124", "text": "The dataset is partitioned into training, validation, and test sets in the ratio of 8:1:1." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-125", "text": "Hyper-parameters are tuned with the validation set." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-126", "text": "For word embedding, we use the publicly available GloVe vectors that were trained on Wikipedia + Gigaword 5 [13] ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-127", "text": "We find that setting (CNN filters, CNN dropout, RNN units, RNN layers, RNN dropout and k) to (100, 0.4, 900, 2, 0.15 and 2) is the best for our model." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-128", "text": "The BCE is used as loss function and Adam [9] is used for optimization with an initial learning rate, 0.001." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-129", "text": "The models were trained on four GeForce RTX 2080 Ti GPUs." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-130", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-131", "text": "**METRICS**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-132", "text": "Following previous work [14] on multi-label classification, we adopt label-based accuracy (i.e., Hamming score) and micro-F 1 score as our main evaluation metrics." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-133", "text": "Micro-precision and micro-recall are also reported to assist the analysis." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-134", "text": "Among all, accuracy is the only metrics that is on a per utterance basis." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-135", "text": "Therefore, Student's paired t-test is performed only on accuracy." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-136", "text": "Other metrics (P, R, F 1 ) provide an overall performance evaluation for all utterances." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-137", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-138", "text": "**RESULTS**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-139", "text": "All proposed adaptations, i.e., highway connections and dynamic k-max pooling, contribute to the model, as in Table 2 ." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-140", "text": "The CRNN models significantly 2 outperform both baselines." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-141", "text": "The model with full adaptation (v 3 ), performs the best across all experiments: achieving the highest accuracy, recall and F 1 with LSTM and the highest precision with GRU." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-142", "text": "The divergence between LSTM and GRU may indicate that the forget and output gates in LSTM are helpful for the retrieval of related memory but increase the variance of the model." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-143", "text": "Table 3 further exemplifies the capability of our best model in recognizing CDAs." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-144", "text": "Our model systematically outperforms the stateof-the-art approach, CNN-CR." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-145", "text": "More specifically, 1) CRNN achieves better mean accuracy for all reference sizes and significantly for 2) The average number of DAs predicted by CRNN is the closest to the reference." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-146", "text": "Another advantage of CRNN is the ability to handle longer dialogues." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-147", "text": "Figure 3 visualizes the mean accuracy of utterances grouped by dialogue length." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-148", "text": "Along with the increase of dialogue length, we observe an increasing advantage of CRNN (in blue) over CNN-CR (in orange), especially for dialogues longer than 6 utterances." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-149", "text": "Table 4 shows an example where CRNN and CNN-CR predict differently for utterance 5." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-150", "text": "Utterance 5 should respond to utterance 2, establishing a long-range dependency that cannot be captured by CNN-CR's fixed context window." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-151", "text": "Our CRNN model succeeds in recognizing the distant context for further details (FD)." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-152", "text": "Furthermore, CNN-CR wrongly recognizes utterance 5 as original question (OQ) since the OQ in utterance 1 is out of range." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-153", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-154", "text": "**A CASE STUDY**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-155", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-156", "text": "**CONCLUSION**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-157", "text": "Our proposed CRNN models for CDA recognition impose fewer restrictions on the structure of DAs and capture textual features from a wider context." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-158", "text": "The experiment results show that all of the proposed adaptations, i.e., highway connections and dynamic kmax pooling, contribute to the model optimization." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-159", "text": "Our final model significantly outperforms the state-of-the-art approach on a tech forum dataset where the dialogues are packed with complex DA structures and information-rich utterances." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-160", "text": "Future work will consider a more fine-grained set of DAs for a deeper analysis of CDA modeling." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-161", "text": "It would also be interesting to annotate antecedent relations among utterances to structure threading dialogue flows." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-52", "text": "Deep learning research outside of NLP proves CRNN suitable for classifications where multiple labels may be assigned to each target." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-53", "text": "Cakir et al. [1] apply CRNN to Polyphonic Sound Event Detection where multiple sound events can be detected in a timestamp." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-54", "text": "Similar to single-label CRNNs, features are extracted by the convolutional layers and further integrated into the recurrent layers to provide context information." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-55", "text": "The sigmoid activation function is used to make multi-binary decisions for each event activity." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-56", "text": "Choi et al. [4] treat music tagging as a multi-label classification task where a subset of genres, moods, etc." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-57", "text": "are assigned to each music." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-58", "text": "In Choi et al. [4] 's model, The CNN is responsible for extracting local features and the RNN presents a temporal summarization." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-59", "text": "These studies substantiate CRNN's applicability to multi-label classification tasks, especially its ability to extract features from the current step and to integrate features from arbitrarily long distance." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-60", "text": "In previous dialogue research, Kalchbrenner and Blunsom [6] and Kim et al. [7] incorporate CRNN into a single-label DA recognition and dialogue topic tracking system." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-61", "text": "Based on this line of research, we argue that CRNN can be applied for CDA recognition with two adaptations: binary cross-entropy loss (BCE) and sigmoid activation." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-62", "text": "Further experiments also show that adding highway connections and dynamic k-max pooling further boosts the performance." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-63", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-64", "text": "**TASK DEFINITION**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-65", "text": "The task is defined as a CDA recognition problem where for each utterance u t (the t-th utterance) in a dialogue, we predict a subset of DA labels y t that describes the functionality of the utterance from a candidate set of DA labels L = {l 1 , l 2 , ..., l c }." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-66", "text": "For a dialog with s utterances, the inputs to the algorithm is U = {u 1 , u 2 , ..., u s }, and the output is Y = {y 1 , y 2 , ..., y s }, where y t is the annotated DA label set for u t , in which y t = {y 1 t , y 2 t , ..., y c t }." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-67", "text": "Here, y j t = {1, 0} denotes whether the t-th utterance of the dialog is labeled with DA label l j or not." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-68", "text": "When c j=1 y j t > 1, we say CDAs are recognized." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-69", "text": "Given a dialogue U, the goal is to predict the DA sequence Y from the text." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-70", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-71", "text": "**THE PROPOSED APPROACH**" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-72", "text": "The challenge of this task lies in the complexity of dialogue structures in human conversations where an utterance can express multiple DAs." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-73", "text": "In this work, we improve CDA recognition with an adapted CRNN which models the interactions between long-range context." }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-74", "text": "----------------------------------" }, { "sent_id": "af0c9e20d34a080bac3304ded1f8d6-C001-75", "text": "**CONVOLUTIONAL LAYER**" } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "af0c9e20d34a080bac3304ded1f8d6-C001-30" ], [ "af0c9e20d34a080bac3304ded1f8d6-C001-114" ] ], "cite_sentences": [ "af0c9e20d34a080bac3304ded1f8d6-C001-30", "af0c9e20d34a080bac3304ded1f8d6-C001-114" ] }, "@BACK@": { "gold_contexts": [ [ "af0c9e20d34a080bac3304ded1f8d6-C001-42" ] ], "cite_sentences": [ "af0c9e20d34a080bac3304ded1f8d6-C001-42" ] }, "@MOT@": { "gold_contexts": [ [ "af0c9e20d34a080bac3304ded1f8d6-C001-43", "af0c9e20d34a080bac3304ded1f8d6-C001-44" ] ], "cite_sentences": [ "af0c9e20d34a080bac3304ded1f8d6-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "af0c9e20d34a080bac3304ded1f8d6-C001-104" ], [ "af0c9e20d34a080bac3304ded1f8d6-C001-114", "af0c9e20d34a080bac3304ded1f8d6-C001-115", "af0c9e20d34a080bac3304ded1f8d6-C001-116", "af0c9e20d34a080bac3304ded1f8d6-C001-117" ], [ "af0c9e20d34a080bac3304ded1f8d6-C001-132" ] ], "cite_sentences": [ "af0c9e20d34a080bac3304ded1f8d6-C001-104", "af0c9e20d34a080bac3304ded1f8d6-C001-114", "af0c9e20d34a080bac3304ded1f8d6-C001-117", "af0c9e20d34a080bac3304ded1f8d6-C001-132" ] }, "@DIF@": { "gold_contexts": [ [ "af0c9e20d34a080bac3304ded1f8d6-C001-110" ] ], "cite_sentences": [ "af0c9e20d34a080bac3304ded1f8d6-C001-110" ] } } }, "ABC_d9c0e641f8ceb61e5d6e416bfc6492_18": { "x": [ { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-207", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-208", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-31", "text": "The motivations behind those two types of studies are different." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-55", "text": "These comparisons allow them to provide insights into the strengths and weaknesses of each model." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-80", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-81", "text": "**TRANSFORMATION ALGORITHM**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-82", "text": "The transformation algorithm is illustrated by Fig (1) I could easily have done this" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-83", "text": "The transformation first looks for verb groups in a dependency graph." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-2", "text": "Treebanks have recently been released for a number of languages with the harmonized annotation created by the Universal Dependencies project." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-3", "text": "The representation of certain constructions in UD are known to be suboptimal for parsing and may be worth transforming for the purpose of parsing." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-4", "text": "In this paper, we focus on the representation of verb groups." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-5", "text": "Several studies have shown that parsing works better when auxiliaries are the head of auxiliary dependency relations which is not the case in UD." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-6", "text": "We therefore transformed verb groups in UD treebanks, parsed the test set and transformed it back, and contrary to expectations, observed significant decreases in accuracy." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-84", "text": "Those verb groups are collected in the set V ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-7", "text": "We provide suggestive evidence that improvements in previous studies were obtained because the transformation helps disambiguating POS tags of main verbs and auxiliaries." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-8", "text": "The question of why parsing accuracy decreases with this approach in the case of UD is left open." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-9", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-11", "text": "Universal Dependencies 1 (henceforth UD) (Nivre, 2015) is a recent project that is attempting to harmonize syntactic annotation in dependency treebanks across languages." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-12", "text": "This is done through the development of annotation guidelines." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-13", "text": "Some guidelines have been hypothesized to be suboptimal for parsing." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-14", "text": "In the literature, certain representations of certain constructions have been shown to be better than their alternatives for parsing, for example in Schwartz et al. (2012) ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-15", "text": "The UD guidelines however have been written with the intent to maximize crosslinguistic parallelism and this constraint has forced the guidelines developers to sometimes choose representations that are known to be worse for parsing (de Marneffe et al., 2014) ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-16", "text": "For that reason, de Marneffe et al. (2014) suggest that those representations could be modified for the purpose of parsing, thus creating a parsing representation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-17", "text": "Transforming tree representations for the purpose of parsing is not a new idea." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-18", "text": "It has been done for constituency parsing for example by Collins (1999) but also for dependency parsing for example by Nilsson et al. (2007) ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-19", "text": "Nilsson et al. (2007) modified the representation of several constructions in several languages and obtained a consistent improvement in parsing accuracy." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-20", "text": "In this paper, we will investigate the case of the verb group construction and attempt to reproduce the study by Nilsson et al. (2007) on UD treebanks to find out whether or not the alternative representation is useful for parsing with UD." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-21", "text": "have shown that modifying coordination constructions and verb groups from their representation in the Prague Dependency Treebank (henceforth PDT) to a representation described in Mel\u010duk (1988) (Mel'\u010duk style, henceforth MS) improves dependency parsing for Czech." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-22", "text": "The procedure they follow is as follows:" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-23", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-24", "text": "**BACKGROUND 2.1 TREE TRANSFORMATIONS FOR PARSING**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-25", "text": "2. Train a model on that transformed data." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-26", "text": "3. Parse the test data." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-27", "text": "4. Transform the parsed data back to the original representation (for comparison with the original gold standard)." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-28", "text": "Nilsson et al. (2007) have shown that these same modifications as well as the modification of nonprojective structures helps parsing in four languages." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-29", "text": "Schwartz et al. (2012) conducted a study over the alternative representations of 6 constructions across 5 parsing models for English and found that some of them are easier to parse than others." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-30", "text": "Their results were consistent across parsing models." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-32", "text": "have originally a representation that is more semantically oriented and potentially useful for NLP applications which they therefore wish their output to have, the PDT style, and change it to a representation that is more syntactically oriented, the MS style, because it is easier to parse." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-33", "text": "By contrast, Schwartz et al. (2012) have no a priori preference for any of the different alternatives of the constructions they study and instead study the effect of the different representations on parsing for the purpose of choosing one representation over the other." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-34", "text": "Their methodology is therefore different, they evaluate the different representations on their respective gold standard." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-35", "text": "They argue that accuracy within a representation is a good indicator of the learnability of that representation and they argue that learnability is a good criterion for selecting a syntactic representation among alternatives." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-36", "text": "In any case, these studies seem to show that such transformations can affect parsing for various languages and for various parsing models." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-37", "text": "Silveira and Manning (2015) were the first to obtain negative results from such transformations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-38", "text": "They attempted to modify certain constructions in a UD treebank to improve parsing for English but failed to show any improvement." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-39", "text": "Some transformations even decreased parsing accuracy." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-40", "text": "They observe that when they transform their parsed data back to the original representation, they can amplify parser errors." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-41", "text": "As a matter of fact, a transformation can be prompted by the presence of only one dependency relation but involve transformations of many surrounding dependency relations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-42", "text": "The verb group transformation is such an example and will be described in section 3. If, then, a wrong dependency relation prompts a transformation in the parsed data, its surrounding items which might have been correct become wrong." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-43", "text": "A wrong parse can then become worse." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-44", "text": "They take this as partial explanation for the results that are inconsistent with the literature." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-45", "text": "However, the same problem can have arisen in and may have downplayed the effects that those studies have observed." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-46", "text": "It therefore seems that this explanation is not enough to account for those results." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-47", "text": "This raises the question of whether this phenomenon actually happened in the study by Nilsson et al. (2007) ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-48", "text": "It would be interesting to know if the effects they observed were affected by this kind of error amplification." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-49", "text": "It seems that there is still a lot to do to study the impact of different representations on parsing with UD as well as on dependency parsing more generally." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-50", "text": "We propose to take one step in that direction in this paper." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-51", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-52", "text": "**ERROR ANALYSIS FOR DEPENDENCY PARSING**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-53", "text": "McDonald and conducted an extensive error analysis on two parsers in order to compare them." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-54", "text": "They compare the effect of sentence length on the two models, the effect of the structure of the graph (i.e. how close to the root individual arcs are) on the two models as well as the accuracy of the models on different POS tags and on different dependency relations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-56", "text": "Conducting such an error analysis that compares baseline models with their transformed version could provide some further insights into the effects obtained with tree transformations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-57", "text": "Attempting such a detailed error analysis is beyond the scope of this project but some steps will be taken in that direction and are described in Section 4." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-58", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-59", "text": "**VERB GROUPS**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-60", "text": "In the PDT, main verbs are the head of auxiliary dependencies, as in Figure 1 ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-61", "text": "Nilsson et al. (2007) show that making the auxiliary the head of the dependency as in Figure 2 is useful for parsing Czech and Slovenian." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-62", "text": "Schwartz et al. (2012) verb groups are easier to parse when the auxiliary is the head (as in PDT) than when the verb is the head (as in MS)." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-63", "text": "Since UD adopts the PDT style representation of verb groups, it would be interesting to find out whether or not transforming them to MS could also improve parsing." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-64", "text": "This is what will be attempted in this study." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-65", "text": "describe algorithms for such a transformation as well as its back transformation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-66", "text": "However, their back transformation algorithm assumes that the auxiliary appears to the left of the verb which is not always the case." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-67", "text": "In addition, it is unclear what they do with the cases in which there are two auxiliaries in a verb group." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-68", "text": "For these reasons, we will use a slightly modified version of this algorithm that we describe in Section 3." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-69", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-70", "text": "**METHODOLOGY**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-71", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-72", "text": "**GENERAL APPROACH**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-73", "text": "We will follow the methodology from Nilsson et al. (2007) , that is, to transform, parse and then detransform the data so as to compare the original and the transformed model on the original gold standard." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-74", "text": "The method from Schwartz et al. (2012) which consists in comparing the baseline and the transformed data on their respective gold standard is less relevant here because UD is believed to be a useful representation and that the aim will be to improve parsing within that representation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-75", "text": "However, as was argued in that study, their method can give an indication of the learnability of a construction and can potentially be used to understand the results obtained by the parse-transform-detransform method." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-76", "text": "For this reason, this method will also be attempted." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-77", "text": "In addition, the original parsed data will also be transformed into the MS gold standard for comparison with the MS parsed data on the MS gold standard." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-78", "text": "Comparing the two can potentially help find out if the error amplifications described in the background section are strongly influencing the results." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-79", "text": "As a matter of fact, if the transformed model is penalized by error amplifications on the original gold standard, it is expected that the original model will be penalized in the same way on the transformed gold standard." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-181", "text": "This idea is further explored in Section 4.4." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-85", "text": "A verb group V i has a main verb V imv (done in the example) and a set of auxiliaries V iaux with at least one element (could and have in the example)." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-86", "text": "Verb groups are collected by traversing the sentence from left to right, looking at auxiliary dependency relations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-87", "text": "An auxiliary dependency relation w aux aux \u2190 \u2212 \u2212w mv is a relation where the main verb is the head and the auxiliary is the dependent." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-88", "text": "Only auxiliary dependency relations between two verbal forms are considered." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-89", "text": "When such a dependency relation is found, if there is a V i in V that has the head of the dependency relation (w mv ) as main verb V imv , w aux is added to that V i 's set of auxiliaries V iaux ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-90", "text": "Otherwise, a new V i is created and added to V ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-91", "text": "After that, for each V i in V , if there is only one auxiliary in V iaux , the direction of the dependency relation between that auxiliary and the main verb V imv is inverted and the head of V imv becomes the head of the auxiliary." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-92", "text": "When there are several auxiliaries (like in example (1)), the algorithm attaches the closest one to V imv and the head of V imv becomes the head of the outermost one." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-93", "text": "Any auxiliary inbetween is attached in a chain from the outermost to the one that is closest to the verb." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-94", "text": "In the example, the main verb done gets attached to the closest auxiliary have and the head of the main verb done which was the root becomes the head of the outermost auxiliary, could." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-95", "text": "Next, dependents of the main verb are dealt with to make sure projectivity is maintained." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-96", "text": "As a matter of fact, as can be seen from Figure 4 , the previous changes can introduce non-projectivity in an otherwise projective tree, which is undesirable." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-97", "text": "Dependents to the left of the leftmost verb of the whole verb group (i.e. including the auxiliaries and the main verb) get attached to the leftmost verb." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-98", "text": "In the example, I gets attached to could." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-99", "text": "Dependents to the right of the rightmost verb of the verb group get attached to the leftmost verb." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-100", "text": "In the example, this remains attached to the main verb done." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-101", "text": "Any remaining dependent gets attached to the auxiliary that is closest to the verb." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-102", "text": "In the example, easily gets attached to have." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-103", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-104", "text": "**BACK TRANSFORMATION ALGORITHM**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-105", "text": "The back transformation algorithm works similarly to the transformation algorithm." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-106", "text": "A set of verb groups V is first collected by traversing the sentence from left to right, looking at auxiliary dependency relations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-107", "text": "An auxiliary dependency relation w d aux \u2190 \u2212 \u2212w h between a dependent w d and a head w h in MS can be between an auxiliary and the main verb or between two auxiliaries." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-108", "text": "When one such relation is found, if its head w h is not already in a V iaux in V , a new verb group V i is created and w h is added to V iaux ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-109", "text": "What the algorithm does next depends on the direction of that dependency relation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-110", "text": "If it is right-headed, the dependent w d of that dependency relation is the main verb and the algorithm recurses the chain of auxiliary dependency relations through heads: it looks at the head w h of dependency relations and adds them to V iaux until it finds a head that is not itself the dependent of an auxiliary dependency relation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-111", "text": "If it is left-headed, the algorithm recurses the chain of auxiliary dependency relations through the dependents." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-112", "text": "It looks at dependents of dependency relations until it finds the main verb V imv , i.e. a w i d that is not the head of an auxiliary dependency relation, each time adding the head of the relation w i h to V iaux ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-113", "text": "After that, for each V i in V , the head of the auxiliary that is furthest from the main verb becomes the head of the main verb." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-114", "text": "The main verb becomes the head of all auxiliaries and their dependents." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-115", "text": "In the previous example, Figure 5 can be transformed back to Figure 3 in this way: done is identified as the main verb of the verb group and could as its furthest auxiliary." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-116", "text": "The head of could therefore becomes the head of done and the two auxiliaries of the sentence as well as their dependents get attached to the main verb." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-117", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-118", "text": "**DATA**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-119", "text": "We ran all experiments on UD 1.2 (Nivre et al., 2015) ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-120", "text": "Treebanks that had 0.1% or less of auxiliary dependency relations were discarded." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-121", "text": "Japanese was also discarded because the Japanese treebank is not open source." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-122", "text": "Dutch was discarded because the back transformation accuracy was low (90%)." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-123", "text": "This is due to inconsistencies in the annotation: verb groups are annotated as a chain of dependency relations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-124", "text": "This leaves us with a total of 25 out of the 37 treebanks." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-125", "text": "For comparability with the study in Nilsson et al. (2007) , and because we used a slightly modified version of their algorithm, we also tested the approach on the versions of the Czech and Slovenian treebanks that they worked on, respectively version 1.0 of the PDT (Hajic et al., 2001 ) and the 2006 version of SDT (Deroski et al., 2006) ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-126", "text": "overview of the data used for the experiments." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-127", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-128", "text": "**SOFTWARE**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-129", "text": "For comparability with previous studies, we used MaltParser with default settings, training on the training set and parsing on the development set for all the languages that we investigated." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-130", "text": "For enhanced comparability of the results, we used the UD POS tags instead of the language specific POS tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-131", "text": "MaltEval (Nilsson and Nivre, 2008) was used for evaluation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-132", "text": "The transformation code has been released as part of the python package oDETTE version 1.0 2 (DEpendency Treebank Transformation and Evaluation)." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-133", "text": "The package can be used to run the whole pipeline, from transformation to evaluation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-134", "text": "It can work on several treebanks in parallel which enables quick experiments." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-135", "text": "(We trained and parsed the data for the 25 treebanks in 9 minutes on an 8-core machine)." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-136", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-137", "text": "**RESULTS**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-138", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-139", "text": "**EFFECT OF VG TRANSFORMATION ON PARSING**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-140", "text": "As mentioned before, we converted training data in all treebanks involved, trained a parser with that transformed training set, parsed the test data and transformed the parsed data back to the original representation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-141", "text": "Parsing accuracy of that transformed parsed data can then be compared with the parsed data obtained from the baseline, the unmodified model." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-142", "text": "Results are given in Table 2 ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-143", "text": "All results report Labeled Attachment Scores (henceforth LAS) using MaltParser." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-144", "text": "As Unlabeled Attachment Scores (UAS) showed similar tendencies to LAS, they are not included for clarity." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-145", "text": "All experiments report significance levels for the McNemar test as obtained by MaltEval." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-146", "text": "3 A 100% accuracy was obtained for the back transformation of all data sets except for UD Spanish, Portuguese, Romanian and Hindi (99.9, 99.7, 99.4 and 99% respectively)." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-147", "text": "As can be seen from the table and contrary to expectations, by and large the results decrease significantly with the transformed version of the treebank with a few exceptions but no result increases significantly." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-148", "text": "As mentioned in Section 2.1, results on the original representation are the ones that we care about because it is the UD representation that we are interested in and because those results are directly comparable with each other." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-149", "text": "However, as was also said, results on the transformed gold standard can give an indication on the learnability of a construction." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-150", "text": "For this reason, they are reported in Table 3." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-151", "text": "Table 3 also reports results of the parsing model trained on UD representations where the parsed data have been transformed to the MS representation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-152", "text": "As was said in Section 3.1, this is to find out if error amplifications have a strong influence on the results: if error amplifications were the main source of added errors from the baseline on UD to the back transformed UD, it would be expected that the original parsed test set transformed into MS would perform worse on the MS gold standard than the test set parsed by the model trained on MS." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-153", "text": "As can be seen from Table 3 however, this is not the case: the original model generally beats the transformed model even on the transformed gold standard." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-154", "text": "As can also be seen from the table, the scores are overall higher for the UD parsing model on the UD gold standard than the transformed parsing model on the transformed gold standard." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-155", "text": "This potentially indicates that the verb group transformation makes the UD representation harder to learn and might help give a partial explanation of why it decreases parsing accuracy on the original gold standard." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-156", "text": "This is not entirely surprising as, as can be seen from the Figures illustrating the transformation above, the original representation is flatter than the transformed representation." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-157", "text": "Further work is needed to explore that more in-depth." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-206", "text": "This could be caused by the domain of the treebank." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-158", "text": "In any case, the original model beats the transformed model on several metrics and it seems safe to conclude that the verb group transformation hurts UD parsing at least with MaltParser." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-159", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-160", "text": "**COMPARING DEPENDENCY RELATIONS**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-161", "text": "Turning to the error analysis, one thing that is striking when looking at the performance of different dependency relations is that punctuation performs consistently worse in the transformed version of the parsed data compared to the baseline as can be seen in Figure 6 ." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-162", "text": "4 Because punctuation is most often attached to the main verb, it can be hypothesized that identifying the main verb of the sentence is crucial for avoiding this kind of errors and that the transformation hurts the identification of the main verb in the case of UD." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-163", "text": "A close examination of about a third of errors containing an auxiliary dependency relation in English further reinforced that hypothesis." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-164", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-165", "text": "**COMPARISON WITH SDT AND PDT**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-166", "text": "What is noticeable in the results we have seen so far is that the accuracy decreased for languages for which accuracy has been shown to increase in the past: Czech, Slovenian and English." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-167", "text": "This indicates that the UD style is making a difference." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-168", "text": "For that reason, we are now attempting a comparison be- tween the effect of the approach on SDT and on UD Slovenian as well as between its effect on PDT and UD Czech." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-169", "text": "As shown in Table 4 , similar improvements to the original study were obtained on SDT and PDT." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-170", "text": "As was just mentioned, it can be hypothesized that identifying the main verb is crucial for avoiding the kind of errors that were observed in the UD transformed version." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-171", "text": "It can then be hypothesized that the transformation helps to identify the main verb in PDT and SDT whereas it makes it harder in UD." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-172", "text": "When observing some examples in SDT, the transformation seems to help disambiguating POS tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-173", "text": "As a matter of fact, more than 90% of auxiliaries in SDT have the tag Verb-copula but also more than 20% of the main verbs involved in auxiliary dependency relations have that same POS tag." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-174", "text": "POS tags therefore do not give enough information to distinguish between the main verb and an auxiliary." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-175", "text": "The experiment we are now turning to suggested that this is a reasonable hypothesis." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-176", "text": "We tested the approach on three different versions of PDT and Table 5 , the results in SDT support the hypothesis: when verbs are made fully ambiguous, the transformation improves the results more than when they are partially ambiguous." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-177", "text": "When they are disambiguated, the approach does not work, the accuracy even decreases." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-178", "text": "The picture is slightly less clear with PDT where disambiguating the POS tags makes the approach ineffective but making them ambiguous does not make the approach more useful." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-179", "text": "Ambiguating the tags seems to affect PDT less than it affects SDT however which might indicate that PDT suffers from ambiguity even more than SDT in the original treebank." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-180", "text": "This might be due to the fact that the POS tags used in the PDT experiments are automatically predicted whereas the tags used for SDT are gold tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-182", "text": "We tested the same approach on the UD treebanks for Czech and Slovenian to see if they can also be affected by ambiguity in some way." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-183", "text": "In the case of UD, \u03c4 d is the same as \u03c4 o since the tags are already disambiguated." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-184", "text": "As can be seen from the top part of Table 6 , the opposite effect is found: the transformation hurts accuracy more when the tags are ambiguous than when they are not." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-185", "text": "However, because of the similarity between copulas and auxiliaries in UD, representing them differently might make it confus-16 Table 6 : LAS on the original and transformed treebanks with different levels of POS tag ambiguity." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-186", "text": "\u2206 = Transf -Orig ing for the parser." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-187", "text": "It would be interesting to try the approach and change the representation of copulas as well as auxiliaries." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-188", "text": "We tested something simpler: we tested the same experiment on the treebanks without copulas, i.e. we removed all sentences that have a copula dependency relations both in the training and the test sets." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-189", "text": "As can be seen from the bottom of Table 6 , doing so gives the expected results: the transformation affects accuracy less when the tags are ambiguous than when they are not." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-190", "text": "The transformation still does not help parsing accuracy however." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-191", "text": "----------------------------------" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-192", "text": "**PREDICTED VS GOLD POS TAGS**" }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-193", "text": "An issue that has been ignored so far is that in the PDT, the parser used predicted POS tags for parsing the test sets whereas in UD (and in SDT), we have been using gold POS tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-194", "text": "It was said in the previous section that the experiment about ambiguity on the PDT seems to indicate that tags are of poorer quality in the original experiment." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-195", "text": "It is possible that this is due to the fact that they are predicted rather than gold tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-196", "text": "It would be interesting to find out if the transformation approach works on UD parsing using predicted tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-197", "text": "This is slightly difficult to test as there does not exist taggers for all UD treebanks yet." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-198", "text": "There does exist one for Swedish however, which is why we tested this hypothesis on UD Swedish." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-199", "text": "As can be seen from Table 7 , using predicted POS tags does have an impact on the effect of the transformation as the transformation hurts parsing accuracy less than it does on data with gold POS tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-200", "text": "The transformation still does not help parsing accuracy however." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-201", "text": "Overall then, the results suggest that there is something about the UD representation that makes this transformation infelicitous." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-202", "text": "It seems then that in POS tag Orig Transf \u2206 gold 76.8 75.7** -1.1 predicted 76.4 75.6** -0.8 Table 7 : LAS on the original and transformed UD Swedish treebank with predicted and gold POS tags." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-203", "text": "\u2206 = Transf -Orig the case of UD, it is better to keep the main verbs as heads of auxiliary dependency relations." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-204", "text": "There are other factors that may play a role in the results." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-205", "text": "For example, as appears from Table 1 , the original SDT has a much higher percentage of auxiliary dependencies." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-209", "text": "In this paper, we have attempted to reproduce a study by Nilsson et al. (2007) that has shown that making auxiliaries heads in verb groups improves parsing but failed to show that those results port to parsing with Universal Dependencies." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-210", "text": "Contrary to expectations, the study has given evidence that main verbs should stay heads of auxiliary dependency relations for parsing with UD." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-211", "text": "The benefits of error analyses for such a study have been highlighted because they allow us to shed more light on the different ways in which the transformations affect the parsing output." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-212", "text": "Experiments suggest that gains obtained from verb group transformations in previous studies have been obtained mainly because those transformations help disambiguating between main verbs and auxiliaries." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-213", "text": "It is however still an open question why the VG transformation hurts parsing accuracy in the case of UD." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-214", "text": "It seems that the transformation makes the construction harder to learn which might be because it makes it less flat." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-215", "text": "Future work could carry out an error analysis that is more detailed than was the case in this study." }, { "sent_id": "d9c0e641f8ceb61e5d6e416bfc6492-C001-216", "text": "Repeating those experiments with other tree transformations that have been shown to be successful in the past, such as making prepositions the head of prepositional phrases, as well as looking at other parsing models would provide more insight into the relationship between tree transformations and parsing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-18" ], [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-19" ], [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-28" ], [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-61" ], [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-209" ] ], "cite_sentences": [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-18", "d9c0e641f8ceb61e5d6e416bfc6492-C001-19", "d9c0e641f8ceb61e5d6e416bfc6492-C001-28", "d9c0e641f8ceb61e5d6e416bfc6492-C001-61", "d9c0e641f8ceb61e5d6e416bfc6492-C001-209" ] }, "@USE@": { "gold_contexts": [ [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-20" ], [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-73" ], [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-125" ], [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-209" ] ], "cite_sentences": [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-20", "d9c0e641f8ceb61e5d6e416bfc6492-C001-73", "d9c0e641f8ceb61e5d6e416bfc6492-C001-125", "d9c0e641f8ceb61e5d6e416bfc6492-C001-209" ] }, "@UNSURE@": { "gold_contexts": [ [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-47" ] ], "cite_sentences": [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-47" ] }, "@DIF@": { "gold_contexts": [ [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-125" ] ], "cite_sentences": [ "d9c0e641f8ceb61e5d6e416bfc6492-C001-125" ] } } }, "ABC_f2db88c0d4e0ec4c34fc295a5d59ba_18": { "x": [ { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-83", "text": "**INFERENCE**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-2", "text": "We show that the automatically induced latent variable grammars of Petrov et al. (2006) vary widely in their underlying representations, depending on their EM initialization point." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-3", "text": "We use this to our advantage, combining multiple automatically learned grammars into an unweighted product model, which gives significantly improved performance over state-ofthe-art individual grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-4", "text": "In our model, the probability of a constituent is estimated as a product of posteriors obtained from multiple grammars that differ only in the random seed used for initialization, without any learning or tuning of combination weights." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-5", "text": "Despite its simplicity, a product of eight automatically learned grammars improves parsing accuracy from 90.2% to 91.8% on English, and from 80.3% to 84.5% on German." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-6", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-8", "text": "Learning a context-free grammar for parsing requires the estimation of a more highly articulated model than the one embodied by the observed treebank." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-9", "text": "This is because the naive treebank grammar (Charniak, 1996) is too permissive, making unrealistic context-freedom assumptions." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-10", "text": "For example, it postulates that there is only one type of noun phrase (NP), which can appear in all positions (subject, object, etc.) , regardless of case, number or gender." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-11", "text": "As a result, the grammar can generate millions of (incorrect) parse trees for a given sentence, and has a flat posterior distribution." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-12", "text": "High accuracy grammars therefore add soft constraints on the way categories can be combined, and enrich the label set with additional information." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-13", "text": "These constraints can be lexicalized (Collins, 1999; Charniak, 2000) , unlexicalized (Johnson, 1998; Klein and Manning, 2003b) or automatically learned (Matsuzaki et al., 2005; Petrov et al., 2006) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-14", "text": "The constraints serve the purpose of weakening the independence assumptions, and reduce the number of possible (but incorrect) parses." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-15", "text": "Here, we focus on the latent variable approach of Petrov et al. (2006) , where an Expectation Maximization (EM) algorithm is used to induce a hierarchy of increasingly more refined grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-16", "text": "Each round of refinement introduces new constraints on how constituents can be combined, which in turn leads to a higher parsing accuracy." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-17", "text": "However, EM is a local method, and there are no guarantees that it will find the same grammars when initialized from different starting points." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-18", "text": "In fact, it turns out that even though the final performance of these grammars is consistently high, there are significant variations in the learned refinements." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-19", "text": "We use these variations to our advantage, and treat grammars learned from different random seeds as independent and equipotent experts." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-20", "text": "We use a product distribution for joint prediction, which gives more peaked posteriors than a sum, and enforces all constraints of the individual grammars, without the need to tune mixing weights." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-21", "text": "It should be noted here that our focus is on improving parsing performance using a single underlying grammar class, which is somewhat orthogonal to the issue of parser combination, that has been studied elsewhere in the literature (Sagae and Lavie, 2006; Fossum and Knight, 2009; Zhang et al., 2009) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-22", "text": "In contrast to that line of work, we also do not restrict ourselves to working with kbest output, but work directly with a packed forest representation of the posteriors, much in the spirit of Huang (2008) , except that we work with several forests rather than rescoring a single one." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-23", "text": "In our experimental section we give empirical answers to some of the remaining theoretical questions." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-24", "text": "We address the question of averaging versus multiplying classifier predictions, we investigate different ways of introducing more diversity into the underlying grammars, and also compare combining partial (constituent-level) and complete (tree-level) predictions." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-25", "text": "Quite serendipitously, the simplest approaches work best in our experiments." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-26", "text": "A product of eight latent variable grammars, learned on the same data, and only differing in the seed used in the random number generator that initialized EM, improves parsing accuracy from 90.2% to 91.8% on English, and from 80.3% to 84.5% on German." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-27", "text": "These parsing results are even better than those obtained by discriminative systems which have access to additional non-local features (Charniak and Johnson, 2005; Huang, 2008) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-28", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-29", "text": "**LATENT VARIABLE GRAMMARS**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-30", "text": "Before giving the details of our model, we briefly review the basic properties of latent variable grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-31", "text": "Learning latent variable grammars consists of two tasks: (1) determining the data representation (the set of context-free productions to be used in the grammar), and (2) estimating the parameters of the model (the production probabilities)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-32", "text": "We focus on the randomness introduced by the EM algorithm and refer the reader to Matsuzaki et al. (2005) and Petrov et al. (2006) for a more general introduction." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-33", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-34", "text": "**SPLIT & MERGE LEARNING**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-35", "text": "Latent variable grammars split the coarse (but observed) grammar categories of a treebank into more fine-grained (but hidden) subcategories, which are better suited for modeling the syntax of natural languages (e.g. NP becomes NP 1 through NP k )." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-36", "text": "Accordingly, each grammar production A\u2192BC over observed categories A,B,C is split into a set of productions A x \u2192B y C z over hidden categories A x ,B y ,C z ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-37", "text": "Computing the joint likelihood of the observed parse trees T and sentences w requires summing over all derivations t over split subcategories: Matsuzaki et al. (2005) derive an EM algorithm for maximizing the joint likelihood, and Petrov et al. (2006) extend this algorithm to use a split&merge procedure to adaptively determine the optimal number of subcategories for each observed category." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-38", "text": "Starting from a completely markovized X-Bar grammar, each category is split in two, generating eight new productions for each original binary production." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-39", "text": "To break symmetries, the production probabilities are perturbed by 1% of random noise." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-40", "text": "EM is then initialized with this starting point and used to climb the highly non-convex objective function given in Eq. 1." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-41", "text": "Each splitting step is followed by a merging step, which uses a likelihood ratio test to reverse the least useful half of the splits." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-42", "text": "Learning proceeds by iterating between those two steps for six rounds." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-43", "text": "To prevent overfitting, the production probabilities are linearly smoothed by shrinking them towards their common base category." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-44", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-45", "text": "**EM INDUCED RANDOMNESS**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-46", "text": "While the split&merge procedure described above is shown in Petrov et al. (2006) to reduce the variance in final performance, we found after closer examination that there are substantial differences in the patterns learned by the grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-47", "text": "Since the initialization is not systematically biased in any way, one can obtain different grammars by simply changing the seed of the random number generator." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-48", "text": "We trained 16 different grammars by initializing the random number generator with seed values 1 through 16, but without biasing the initialization in any other way." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-49", "text": "Figure 1 shows that the number of subcategories allocated to each observed category varies significantly between the different initialization points, especially for the phrasal categories." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-50", "text": "Figure 2 shows posteriors over the most frequent subcategories given their base category for the first four grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-51", "text": "Clearly, EM is allocating the latent variables in very different ways in each case." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-52", "text": "As a more quantitative measure of difference, 1 we evaluated all 16 grammars on sections 22 and 24 of the Penn Treebank." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-53", "text": "Figure 3 shows the performance on those two sets, and reveals that there is no single grammar that achieves the best score on both." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-54", "text": "While the parsing accuracies are consistently high, 2 there is only a weak correlation between the accuracies on the two evaluation sets (Pearson coefficient 0.34)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-55", "text": "This suggests that no single grammar should be preferred over the others." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-56", "text": "In previous work (Petrov et al., 2006; Petrov and Klein, 2007 ) the final grammar was chosen based on its performance on a held-out set (section 22), and corresponds to the second best grammar in Figure 3 (because only 8 different grammars were trained)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-57", "text": "A more detailed error analysis is given in Figure 4 , where we show a breakdown of F 1 scores for selected phrasal categories in addition to the overall F 1 score and exact match (on the WSJ development set)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-58", "text": "While grammar G 2 has the highest overall F 1 score, its exact match is not particularly high, and it turns out to be the weakest at predicting quantifier phrases (QP)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-59", "text": "Similarly, the performance of the other grammars varies between the different error measures, indicating again that no single grammar dominates the others." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-60", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-61", "text": "**A SIMPLE PRODUCT MODEL**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-62", "text": "It should be clear by now that simply varying the random seed used for initialization causes EM to discover very different latent variable grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-63", "text": "While this behavior is worrisome in general, it turns out that we can use it to our advantage in this particular case." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-64", "text": "Recall that we are using EM to learn both, the data representation, as well as the parameters of the model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-65", "text": "Our analysis showed that changing the initialization point results in learning grammars that vary quite significantly in the errors they make, but have comparable overall accuracies." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-66", "text": "This suggests that the different local maxima found by EM correspond to different data representations rather than to suboptimal parameter estimates." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-67", "text": "To leverage the strengths of the individual grammars, we combine them in a product model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-68", "text": "Product models have the nice property that their KullbackLiebler divergence from the true distribution will always be smaller than the average of the KL divergences of the individual distributions (Hinton, 2001) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-69", "text": "Therefore, as long as no individual grammar G i is significantly worse than the others, we can only benefit from combining multiple latent variable grammars and searching for the tree that maximizes" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-70", "text": "Here, we are making the assumption that the individual grammars are conditionally independent, which is of course not true in theory, but holds surprisingly well in practice." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-71", "text": "To avoid this assumption, we could use a sum model, but we will show in Section 4.1 that the product formulation performs significantly better." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-72", "text": "Intuitively speaking, products have the advantage that the final prediction has a high posterior under all models, giving each model veto power." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-73", "text": "This is exactly the behavior that we need in the case of parsing, where each grammar has learned different constraints for ruling out improbable parses." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-74", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-75", "text": "**LEARNING**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-76", "text": "Joint training of our product model would couple the parameters of the individual grammars, necessitating the computation of an intractable global partition function (Brown and Hinton, 2001 but from a different, randomly chosen starting point." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-77", "text": "To emphasize, we do not introduce any systematic bias (but see Section 4.3 for some experiments), or attempt to train the models to be maximally different (Hinton, 2002) -we simply train a random collection of grammars by varying the random seed used for initialization." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-78", "text": "We found in our experiments that the randomness provided by EM is sufficient to achieve diversity among the individual grammars, and gives results that are as good as more involved training procedures." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-79", "text": "Xu and Jelinek (2004) made a similar observation when learning random forests for language modeling." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-80", "text": "Our model is reminiscent of Logarithmic Opinion Pools (Bordley, 1982) and Products of Experts (Hinton, 2001) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-81", "text": "3 However, because we believe that none of the underlying grammars should be favored, we deliberately do not use any combination weights." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-82", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-84", "text": "Computing the most likely parse tree is intractable for latent variable grammars (Sima'an, 2002) , and therefore also for our product model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-85", "text": "This is because there are exponentially many derivations over split subcategories that correspond to a single parse tree over unsplit categories, and there is no dynamic program to efficiently marginalize out the latent variables." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-86", "text": "Previous work on parse risk minimization has addressed this problem in two different ways: by changing the objective function, or by constraining 3 As a matter of fact, Hinton (2001) the search space (Goodman, 1996; Titov and Henderson, 2006; Petrov and Klein, 2007) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-87", "text": "The simplest approach is to stick to likelihood as the objective function, but to limit the search space to a set of high quality candidates T :" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-88", "text": "Because the likelihood of a given parse tree can be computed exactly for our product model (Eq. 2), the quality of this approximation is only limited by the quality of the candidate list." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-89", "text": "To generate the candidate list, we produce k-best lists of Viterbi derivations with the efficient algorithm of Huang and Chiang (2005) , and erase the subcategory information to obtain parse trees over unsplit categories." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-90", "text": "We refer to this approximation as TREE-LEVEL inference, because it considers a list of complete trees from the underlying grammars, and selects the tree that has the highest likelihood under the product model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-91", "text": "While the k-best lists are of very high quality, this is a fairly crude and unsatisfactory way of approximating the posterior distribution of the product model, as it does not allow the synthesis of new trees based on tree fragments from different grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-92", "text": "An alternative is to use a tractable objective function that allows the efficient exploration of the entire Legend: log G1-score log G2-score Figure 5 : Grammar G 1 has a preference for flat structures, while grammar G 2 prefers deeper hierarchical structures." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-93", "text": "Both grammars therefore make one mistake each on their own." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-94", "text": "However, the correct parse tree (which uses a flat ADJP in the first slot and a hierarchical NP in the second) scores highest under the product model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-95", "text": "search space." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-96", "text": "Petrov and Klein (2007) present such an objective function, which maximizes the product of expected correct productions r:" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-97", "text": "These expectations can be easily computed from the inside/outside scores, similarly as in the maximum bracket recall algorithm of Goodman (1996) , or in the variational approximation of Matsuzaki et al. (2005) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-98", "text": "We extend the algorithm to work over posterior distributions from multiple grammars, by aggregating their expectations into a product." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-99", "text": "In practice, we use a packed forest representation to approximate the posterior distribution, as in Huang (2008) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-100", "text": "We refer to this approximation as CONSTITUENT-LEVEL, because it allows us to form new parse trees from individual constituents." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-101", "text": "Figure 5 illustrates a real case where the product model was able to construct a completely correct parse tree from two partially correct ones." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-102", "text": "In the example, one of the underlying grammars (G 1 ) had an imperfect recall score, because of its preference for flat structures (it missed an NP node in the second part of the sentence)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-103", "text": "In contrast, the other grammar (G 2 ) favors deeper structures, and therefore introduced a superfluous ADVP node." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-104", "text": "The product model gives each underlying grammar veto power, and picks the least controversial tree (which is the correct one in this case)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-105", "text": "Note that a sum model allows the most confident model to dominate the decision, and would chose the incorrect hierarchical ADJP construction here (as one can verify using the provided model scores)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-106", "text": "To make inference efficient, we can use the same coarse-to-fine pruning techniques as Petrov and Klein (2007) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-107", "text": "We generate a hierarchy of projected grammars for each individual grammar and parse with each one in sequence." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-108", "text": "Because only the very last pass requires scores from the different underlying grammars, this computation can be trivially parallelized across multiple CPUs." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-109", "text": "Additionally, the first (X-Bar) pruning pass needs to be computed only once because it is shared among all grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-110", "text": "Since the X-Bar pass is the bottleneck of the multipass scheme (using nearly 50% of the total processing time), the overhead of using a product model is quite manageable." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-111", "text": "It would have also been possible to use A*-search for factored models (Klein and Manning, 2003a; Sun and Tsujii, 2009 ), but we did not attempt this in the present work." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-112", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-113", "text": "**EXPERIMENTS**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-114", "text": "In our experiments, we follow the standard setups described in Table 1 , and use the EVALB tool for computing parsing figures." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-115", "text": "Unless noted otherwise, we use CONSTITUENT-LEVEL inference." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-116", "text": "All our experiments are based on the publicly available BerkeleyParser." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-117", "text": "4 Training Set Dev." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-118", "text": "Set Test Set ENGLISH-WSJ Sections Section 22 Section 23 (Marcus et al., 1993) 2-21 ENGLISH-BROWN see 10% of 10% of the (Francis et al. 1979) ENGLISH-WSJ the data 5 the data 5 GERMAN Sentences Sentences Sentences (Skut et al., 1997) 1 -18,602 18,603-19,602 19,603-20,602" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-119", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-120", "text": "**(WEIGHTED) PRODUCT VS. (WEIGHTED) SUM**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-121", "text": "A great deal has been written on the topic of products versus sums of probability distributions for joint prediction (Genest and Zidek, 1986; Tax et al., 2000) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-122", "text": "However, those theoretical results do not apply directly here, because we are using multiple randomly permuted models from the same class, rather models from different classes." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-123", "text": "To shed some light on this issue, we addressed the question empirically, and combined two grammars into an unweighted product model, and also an unweighted sum model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-124", "text": "The individual grammars had parsing accuracies (F 1 ) of 91.2 and 90.7 respectively, and their product (91.7) clearly outperformed their sum (91.3)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-125", "text": "When more grammars are added, the gap widens even further, and the trends persist independently of whether the models use TREE-LEVEL or CONSTITUENT-LEVEL inference." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-126", "text": "At least for the case of unweighted combinations, the product distribution seems to be superior." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-127", "text": "In related work, Zhang et al. (2009) achieve excellent results with a weighted sum model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-128", "text": "Using weights learned on a held-out set and rescoring 50-best lists from Charniak (2000) and Petrov et al. (2006) , they obtain an F 1 score of 91.0 (which they further improve to 91.4 using a voting scheme)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-129", "text": "We replicated their experiment, but used an unweighted product of the two model scores." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-130", "text": "Using TREE-LEVEL inference, we obtained an F 1 score of 91.6, suggesting that weighting is not so important in the product case, as long as the classifiers are of comparable quality." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-131", "text": "6 This is in line with previous work on product models, where weighting has been important when combining heterogenous classifiers (Heskes, 1998), and less important when the classifiers are of similar accuracy (Smith et al., 2005) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-132", "text": "5 See Gildea (2001) for the exact setup." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-133", "text": "6 The unweighted sum model, however, underperforms the individual models with an F1 score of only 90.3." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-134", "text": "Figure 6 shows that accuracy increases when more grammars are added to the product model, but levels off after eight grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-135", "text": "The plot also compares our two inference approximations, and shows that CONSTITUENT-LEVEL inference results in a small (0.2), but consistent improvement in F 1 score." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-136", "text": "A first thought might be that the improvement is due to the limited scope of the k-best lists." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-137", "text": "However, this is not the case, as the results hold even when the candidate set for CONSTITUENT-LEVEL inference is constrained to trees from the k-best lists." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-138", "text": "While the packed forrest representation can very efficiently encode an exponential set of parse trees, in our case the k-best lists appear to be already very diverse because they are generated by multiple grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-139", "text": "Starting at 96.1 for a single latent variable grammar, merging two 50-best lists from different grammars gives an oracle score of 97.4, and adding more k-best lists further improves the oracle score to 98.6 for 16 grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-140", "text": "This compares favorably to the results of Huang (2008) , where the oracle score over a pruned forest is shown to be 97.8 (compared to 96.7 for a 50-best list)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-141", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-142", "text": "**TREE-LEVEL VS. CONSTITUENT-LEVEL INFERENCE**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-143", "text": "The accuracy improvement can instead be explained by the change in the objective function." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-144", "text": "Recall from section Section 3.2, that CONSTITUENT-LEVEL inference maximizes the expected number of correct productions, while TREE-LEVEL inference maximizes tree-likelihood." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-145", "text": "It is therefore not too surprising that the two objective functions select the same tree only 41% of the time, even when limited to the same candidate set." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-146", "text": "Maximizing the expected number of correct productions is superior for F 1 score (see the one grammar case in Figure 6 )." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-147", "text": "However, as to be expected, likelihood is better for exact match, giving a score of 47.6% vs. 46.8%." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-148", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-149", "text": "**SYSTEMATIC BIAS**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-150", "text": "Diversity among the underlying models is what gives combined models their strength." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-151", "text": "One way of increasing diversity is by modifying the feature sets of the individual models (Baldridge and Osborne, 2008; Smith and Osborne, 2007) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-152", "text": "This approach has the disadvantage that it reduces the performance of the individual models, and is not directly applicable for latent variable grammars because the features are automatically learned." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-153", "text": "Alternatively, one can introduce diversity by changing the training distribution." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-154", "text": "Bagging (Breiman, 1996) and Boosting (Freund and Shapire, 1996) fall into this category, but have had limited success for parsing (Henderson and Brill, 2000) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-155", "text": "Furthermore boosting is impractical here, because it requires training dozens of grammars in sequence." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-156", "text": "Since training a single grammar takes roughly one day, we opted for a different, parallelizable way of changing the training distribution." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-157", "text": "In a first experiment, we divided the training set into two disjoint sets, and trained separate grammars on each half." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-158", "text": "These truly disjoint grammars had low F 1 scores of 89.4 and 89.6 respectively (because they were trained on less data)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-159", "text": "Their combination unfortunately also achieves only an accuracy of 90.9, which is lower than what we get when training a single grammar on the entire training set." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-160", "text": "In another experiment, we used a cross-validation setup where individual sections of the treebank were held out." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-161", "text": "The resulting grammars had parsing accuracies of about 90.5, and the product model was again not able to overcome the lower starting point, despite the potentially larger diversity among the underlying grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-162", "text": "It appears that any systematic bias that lowers the accuracy of the individual grammars also hurts the final performance of the product model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-163", "text": "Smith et al. (2005) interpret Logarithmic Opinion Pools (LOPs) as a smoothing technique." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-164", "text": "They compare regularizing Conditional Random Fields (CRFs) with Gaussian priors (Lafferty et al., 2001) , to training a set of unregularized CRFs over different feature sets and combining them in an LOP." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-165", "text": "In their experiments, both approaches work comparably well, but their combination, an LOP of regularized CRFs works best." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-166", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-167", "text": "**PRODUCT DISTRIBUTION AS SMOOTHING**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-168", "text": "Not too surprisingly, we find this to be the case here as well." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-169", "text": "The parameters of each latent variable grammar are typically smoothed in a linear fashion to prevent excessive overfitting (Petrov et al., 2006) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-170", "text": "While all the experiments so far used smoothed grammars, we reran the experiments also with a set of unsmoothed grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-171", "text": "The individual unsmoothed grammars have on average an 1.2% lower accuracy." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-172", "text": "Even though our product model is able to increase accuracy by combining multiple grammars, the gap to the smoothed models remains consistent." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-173", "text": "This suggests that the product model is doing more than just smoothing." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-174", "text": "In fact, because the product distribution is more peaked, it seems to be doing the opposite of smoothing." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-175", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-176", "text": "**FINAL RESULTS**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-177", "text": "Our final model uses an unweighted product of eight grammars trained by initializing the random number generator with seeds 1 through 8." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-178", "text": "Table 2 shows our test set results (obtained with CONSTITUENT-LEVEL inference), and compares them to related work." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-179", "text": "There is a large body of work that has reported parsing accuracies for English, and we have grouped the different methods into categories for better overview." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-180", "text": "Our results on the English in-domain test set are higher than those obtained by any single component parser (SINGLE)." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-181", "text": "The other methods quoted in Table 2 operate over the output of one or more single component parsers and are therefore largely orthogonal to our line of work." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-182", "text": "It is nonetheless exciting to see that our product model is competitive with the discriminative rescoring methods (RE) of Charniak and Johnson (2005) and Huang (2008) , achieving higher F 1 scores but lower exact match." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-183", "text": "These two methods work on top of the Charniak (2000) parser, and it would be possible to exchange that parser with our product model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-184", "text": "We did not attempt this experiment, but we expect that those methods would stack well with our model, because they use primarily non-local features that are not available in a context-free grammar." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-185", "text": "Techniques like self-training (SELF) and system combinations (COMBO) can further improve parsing accuracies, but are also orthogonal to our work." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-186", "text": "In particular the COMBO methods seem related to our work, but are very different in their nature." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-187", "text": "While we use multiple grammars in our work, all grammars are from the same model class for us." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-188", "text": "In contrast, those methods rely on a diverse set of individual parsers, each of which requires a significant effort to build." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-189", "text": "Furthermore, those techniques have largely relied on different voting schemes in the past (Henderson and Brill, 1999; Sagae and Lavie, 2006) , and only more recently have started using actual posteriors from the underlying models (Fossum and Knight, 2009; Zhang et al., 2009) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-190", "text": "Even then, those methods operate only over k-best lists, and we are the first to work directly with parse forests from multiple grammars." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-191", "text": "It is also interesting to note that the best results in Zhang et al. (2009) are achieved by combining kbest lists from a latent variable grammar of Petrov et al. (2006) with the self-trained reranking parser of McClosky et al. (2006) ." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-192", "text": "Clearly, replacing the single latent variable grammar with a product of latent variable grammars ought to improve performance." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-193", "text": "The results on the other two corpora are similar." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-194", "text": "A product of latent variable grammars very significantly outperforms a single latent variable grammar and sets new standards for the state-of-the-art." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-195", "text": "We also analyzed the errors of the product models." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-196", "text": "In addition to the illustrative example in Figure 5 , we computed detailed error metrics for different phrasal categories." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-197", "text": "Figure 4 shows that a product of four random grammars is always better than even the best underlying grammar." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-198", "text": "The individual grammars seem to learn different sets of constraints, and the product model is able to model them all at once, giving consistent accuracy improvements across all metrics." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-199", "text": "----------------------------------" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-200", "text": "**CONCLUSIONS**" }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-201", "text": "We presented a simple product model that significantly improves parsing accuracies on different domains and languages." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-202", "text": "Our model leverages multiple automatically learned latent variable grammars, which differ only in the seed of the random number generator used to initialize the EM learning al- gorithm." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-203", "text": "As our analysis showed, the grammars vary widely, making very different errors." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-204", "text": "This is in part due to the fact that EM is used not only for estimating the parameters of the grammar, but also to determine the set of context-free productions that underlie it." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-205", "text": "Because the resulting data representations are largely independent, they can be easily combined in an unweighted product model." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-206", "text": "The product model does not require any additional training and is capable of significantly improving the state-of-the-art in parsing accuracy." }, { "sent_id": "f2db88c0d4e0ec4c34fc295a5d59ba-C001-207", "text": "It remains to be seen if a similar approach can be used in other cases where EM converges to widely varying local maxima." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-2" ], [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-15" ], [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-32" ] ], "cite_sentences": [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-2", "f2db88c0d4e0ec4c34fc295a5d59ba-C001-15", "f2db88c0d4e0ec4c34fc295a5d59ba-C001-32" ] }, "@BACK@": { "gold_contexts": [ [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-13" ], [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-37" ], [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-56" ], [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-128" ], [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-169" ], [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-191" ] ], "cite_sentences": [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-13", "f2db88c0d4e0ec4c34fc295a5d59ba-C001-37", "f2db88c0d4e0ec4c34fc295a5d59ba-C001-56", "f2db88c0d4e0ec4c34fc295a5d59ba-C001-128", "f2db88c0d4e0ec4c34fc295a5d59ba-C001-169", "f2db88c0d4e0ec4c34fc295a5d59ba-C001-191" ] }, "@DIF@": { "gold_contexts": [ [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-46" ] ], "cite_sentences": [ "f2db88c0d4e0ec4c34fc295a5d59ba-C001-46" ] } } }, "ABC_e7b1c00e747f5bfbb96499d7223496_18": { "x": [ { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-40", "text": "**SKIP-GRAM FOR WORD REPRESENTATIONS**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-2", "text": "This work studies the representational mapping across multimodal data such that given a piece of the raw data in one modality the corresponding semantic description in terms of the raw data in another modality is immediately obtained." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-3", "text": "Such a representational mapping can be found in a wide spectrum of real-world applications including image/video retrieval, object recognition, action/behavior recognition, and event understanding and prediction." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-4", "text": "To that end, we introduce a simplified training objective for learning multimodal embeddings using the skip-gram architecture by introducing convolutional \"pseudowords:\" embeddings composed of the additive combination of distributed word representations and image features from convolutional neural networks projected into the multimodal space." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-5", "text": "We present extensive results of the representational properties of these embeddings on various word similarity benchmarks to show the promise of this approach." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-6", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-8", "text": "Distributed representations of multimodal embeddings (Feng & Lapata, 2010) are receiving increasing attention recently in the machine learning literature, and techniques developed have found a wide spectrum of applications in the real world." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-9", "text": "These types of vector representations are particularly desirable for the way in which they better model the grounding of perceptual or semantic concepts in human vocabulary (Lazaridou et al., 2015; Glenberg & Robertson, 2000; )." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-10", "text": "As such, there has been development towards so-called multimodal distributional semantic models (Silberer & Lapata, 2014; Lazaridou et al., 2015; Kiros et al., 2014; Frome et al., 2013; Bruni et al., 2014) , which leverage textual co-occurance and visual features to form multimodal representations of words or concepts." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-11", "text": "The work introduced in Lazaridou et al. (2015) sought to address many of the drawbacks of these models." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-12", "text": "In particular, by incorporating visual information into the training objective, they address the biological inaccuracy of the existing models, in that word representations grounded in visual information have been shown to more closely approximate the way humans learn language." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-13", "text": "Furthermore, incorporating visual information alongside the text corpus allows the training set to consist of both visual and non-visual words." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-14", "text": "As a result, the induced multimodal representations and multimodal mapping no longer rely on the assumption of full visual coverage of the vocabulary, so the results are able to generalize beyond the initial training set and to be applied to various representation-related tasks, such as image annotation or retrieval." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-15", "text": "In this work, we introduce a further refinement on the multimodal skip-gram architecture, building upon the approaches of Mikolov et al. (2013a; b) , , and Lazaridou et al. (2015) ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-16", "text": "Rather than adding a visual term to the linguistic training objective, we directly situate terms in a visual context by replacing relevant words with multimodal pseudowords, derived by composing the textual representations with convolutional features projected into the multimodal space." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-17", "text": "In this way, we further address the grounding problem of Glenberg & Robertson (2000) by incorpo-rating the word-level visual modality directly into the sentence context." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-18", "text": "This model represents an advancement of the existing literature surrounding multimodal skip-gram, as well as multimodal distributional semantic models in general, by greatly simplifying the method of situating the words in the visual context and reducing the number of hyperparameters to tune by directly incorporating multimodal words into the existing objective function and hiearchical softmax formulations of the skip-gram models." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-19", "text": "Finally, we would also like the learned embeddings to be applicable to the problem of zero-shot learning (Socher et al., 2013; Lazaridou et al., 2014; Frome et al., 2013) ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-20", "text": "By incorporating perceptual information into the skip-gram learning objective, we can leverage vocabulary terms for which no manually-annotated images were originally available." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-21", "text": "In this way, these learned representations can be used to both grow the annotation set and retrieve new annotations for a given image set." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-22", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-24", "text": "In the last few years, there has been a wealth of literature on multimodal representational models." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-25", "text": "As explained in Lazaridou et al. (2015) , the majority of this literature focuses on constructing textual and visual representations independently and then combining them under some metrics." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-26", "text": "Bruni et al. (2014) utilize a direct approach to \"mixing\" the vector representations by concatenating the text and image vectors and applying Singular Value Decomposition." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-27", "text": "The image vectors used here, though, are constructed using the bag-of-visual-words method." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-28", "text": "In Kiela & Bottou (2014) , the authors utilize a more sophisticated approach to the concatenation method by extracting visual features using state-of-the-art convolutional neural networks and the skip-gram architecture for the text." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-29", "text": "Similarly, Frome et al. (2013) also utilizes the skip-gram architecture and convolutional features; however the two modalities are then combined using a natural similarity metric." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-30", "text": "Other recent work has presented several methods for directly incorporating visual context in neural language models." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-31", "text": "In Xu et al. (2014) , word context is enhanced by global visual context; i.e., a single image is used as the context for the whole sentence (conversely, the sentence acts as a caption for the image)." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-32", "text": "The multimodal skip-gram architecture proposed by Lazaridou et al. (2015) takes a more fine-grained approach by incorporating word-level visual context and concurrently training words to predict other text words in the window as well as their visual representation." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-33", "text": "Our model makes this approach even more explicit, by training the word vectors to predict an additive composition of the textual and visual context and thus constructing an implicit mapping between the textual and visual modalities." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-34", "text": "Finally, the work introduced in Hill & Korhonen (2014) employs a similar \"pseudoword\" architecture to that proposed here." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-35", "text": "However, the visual features used are in the form of perceptual information derived from either user-generated attributes or other textual annotations of imagery." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-36", "text": "While this is shown to be useful for distinguishing classes of words (e.g., between abstract and concrete), it precludes any incorporation of visual, non-linguistic context and thus the derivation of any mapping between images and words or applications to representation-related tasks." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-37", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-38", "text": "**ARCHITECTURE**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-39", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-41", "text": "This model is primarily derived from the skip-gram model introduced by Mikolov et al. (2013c) ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-42", "text": "Skip-gram learns representations of words that predict a target word's context." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-43", "text": "The model maximizes" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-44", "text": "Figure 1: t-SNE embedding of a subset of the convolutional features extracted from Imagenet, demonstrating the inherent clustering of the dataset." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-45", "text": "where w 1 , w 2 , . . . , w T are words in the training set and c the window size around the target word." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-46", "text": "The probablity p(w t+j |w t ) is given by softmax, that is:" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-47", "text": "where u w and u w are the context vector and target vector representations induced for word w, respectively, and W gives the vocabulary size." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-48", "text": "To speed up this computation, a Huffman tree is constructed from the vocabulary, and then the softmax formula is replaced with the hierarchical softmax (Morin & Bengio, 2005) ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-49", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-50", "text": "**SKIP-GRAM WITH MULTIMODAL PSEUDOWORDS**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-51", "text": "To ground the word representations in a visual context, we introduce a means of replacing a word in the corpus with its corresponding multimodal pseudoword." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-52", "text": "The pseudoword vector for a given word w, denoted z w , is given by" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-53", "text": "where u w is the multimodal word representation of w to be induced, v w is the visual data for the concept represented by w, and M is the mapping induced between the visual and multimodal space." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-54", "text": "(The sources of the textual and visual data are explained below.) Where no visual features are available for a given word, v w is set to 0." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-55", "text": "Thus, the objective function in (1) remains the same, while each word vector in the context window of the current word in (2) is replaced with its corresponding pseudoword." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-56", "text": "In this way, each target word in the corpus is trained to predict every given pseudoword in its context window." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-57", "text": "For the value of v w , a key issue in this approach is selecting a canonical visual representation for a given concept (e.g., a single image labeled \"dog\" does not necessarily accurately represent the visual information of all dogs)." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-58", "text": "Consequently, although the features extracted for a given image from a convolutional neural network (CNN) provided a higher-level visual representation than the raw pixel data, these features are not representative of the concept as a whole." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-59", "text": "Rather than handpicking an appropriate image for each class, we rely on the manner in which the CNN features form clusters based on their corresponding visual concepts, a well-explored phenomenon." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-60", "text": "1 Thus, for each visual word, we can sample some images corresponding to the concept and extract CNN features for the images." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-61", "text": "Figure 1 shows the nature of some of these clusters; based on this intuition, we test two approaches to this problem, which we refer to as the centroid method and the hypersphere method." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-62", "text": "For the centroid method 2 , the CNN features sampled for a given visual concept are averaged together." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-63", "text": "In this way, we form unified representation for each cluster by extracting its centroid and using this for v w ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-64", "text": "We not, however, that this somewhat limits the representational quality of v w by condensing the varied class of images to a single data pointl." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-65", "text": "To this end, we introduce the hypersphere method as a means of capturing the complexity of the visual space." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-66", "text": "Rather than averaging the sampled CNN features, we instead fit a Gaussian mixture model to the cluster of CNN features, and, at each training step, sample a point from the model." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-67", "text": "This method, then, retains more of the variation between samples present in the dataset, while still only drawing on a small set of original images." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-68", "text": "The hypersphere technique can also be seen as a means of augmenting the training data without directly extracting more convolutional features." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-69", "text": "Since all points in a given cluster are assumed to be similar and to represent the same concept, a \"new\" image can be used to form v w at each training step without necessitating an equivalently-large image dataset." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-70", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-71", "text": "**EXPERIMENTS**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-72", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-73", "text": "**EXPERIMENTAL DATA**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-74", "text": "For our text corpus, keeping with the existing literature, we use a preprocessed dump of Wikipedia 3 containing approximately 800M tokens." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-75", "text": "For the visual data, we use the image data from ILSVRC 2012 (Russakovsky et al., 2015) and the corresponding Wordnet hierarchy (Miller, 1995) to represent a word visually if the word or any of its hyponyms has an entry in Imagenet and occurs more than 500 times in the text corpus." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-76", "text": "This yields approximately 5,100 \"visual\" words." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-77", "text": "To construct the vectors for the visual representations, we follow a similar experimental set-up as that used by Lazaridou et al. (2015) ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-78", "text": "In each of the cases described above-centroid and hypersphere-, we randomly sample 100 images from the corresponding synsets of Imagenet for each visual word and use a pre-trained convolutional neural network as described in Krizhevsky et al. (2012) via the Caffe toolkit (Jia et al., 2014) to extract a 4096-dimensional vector representation of each image." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-79", "text": "We then treat the 100 vectors corresponding to each of the 5,100 visual words as clusters in the 4096-dimensional visual space." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-80", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-81", "text": "**APPROXIMATING HUMAN SIMILARITY JUDGMENTS**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-82", "text": "Word Similarity Benchmarks To compare our technique to the existing literature, we evaluate our embeddings on four common benchmarks which capture several diverse aspects of word meaning: MEN , Simlex-999 , SemSim (Silberer & Lapata, 2014) , VisSim (Silberer & Lapata, 2014) ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-83", "text": "MEN was designed to capture general word \"relatedness." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-84", "text": "\" Simlex-999 and SemSim measure notions of semantic similarity, and VisSim ranks the same words as SemSim but in terms of visual similarity." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-85", "text": "In each case, the designers of the benchmarks provided pairs of words to human judges, who in turned provide ratings based on the metric of the benchmark." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-86", "text": "To judge our model, we calculate the cosine similarity of our embeddings for the word pairs and then calculate Spearman's \u03c1 between our list of ratings and those of the human judges." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-87", "text": "We evaluate three versions of our model on these benchmarks: pseudowords using the centroid method (PSUEDOWORDS-C), pseudowords using the hypersphere method (PSEUDOWORDS-H), and the centroid method with a randomly initialized mapping (PSEUDOWORDS-RAN), as explained below." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-88", "text": "Existing Multimodal Models We compare our results on these benchmarks against previously published results for other multimodal word embeddings ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-89", "text": "Using the results published by Lazaridou et al. (2015) and a target word embedding of 300, we compare our results to their MMSKIP-GRAM-A and MMSKIP-GRAM-B, which maximize the similarity of the textual and visual representations Table 1 : Spearman correlation between the generated multimodal similarities and the benchmark human judgments." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-90", "text": "In all cases, results are reported on the full set of word similarity pairs." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-91", "text": "under a max-margin framework.; the former constrains the dimensionality of the visual features to be the same as the word embeddings, while the latter learns an explicit mapping between the textual and visual spaces." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-92", "text": "We also include baseline results for pure-text skip-gram embeddings (SKIP-GRAM))." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-93", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-94", "text": "**RESULTS**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-95", "text": "The results for the human judgment experiments are presented in Table 1 ." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-96", "text": "For these experiments, we tried two methods of initializing the mapping." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-97", "text": "First, Random Initialization: the visual-textual mapping matrix was randomly initialized in the same manner as the word embeddings, with the goal of allowing the mapping to be freely generated from the word context." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-98", "text": "Second, Neural Weight Initialization: to boost the performance of the multimodal embeddings, the mapping was initialized with the weights from a simple neural network trained to predict known word embeddings 4 from our convolutional image features." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-99", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-100", "text": "**RANDOM INITIALIZATION**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-101", "text": "Interestingly, there is a degradation in the correlation from the addition of the visual features across all benchmarks." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-102", "text": "This seems to indicate that the induced mapping, when beginning with a random initialization, is yet insufficient to properly situate the convolutional features into the multimodal space." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-103", "text": "It would seem initially that, during training, while the mapping is still being learned, adding the visual context to the text vectors perhaps worsens the representational quality of the word embedding." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-104", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-105", "text": "**NEURAL WEIGHT INITIALIZATION**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-106", "text": "On the other hand, when the mapping is quickly pretrained on existing distributed word representations, the results are greatly improved." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-107", "text": "In the cases of capturing general relatedness and pure visual similarity, the multimodal model of Lazaridou et al. (2015) performs better." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-108", "text": "However, in the case of capturing semantic word similarity, our model performs signficantly better than MMSKIP-GRAM-B (although it should be noted that these results are roughly on par with the benchmark authors (Silberer & Lapata, 2014) and a point below the non-mapping MMSKIP-GRAM-A)." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-109", "text": "Although further work is needed to examine this result, the performance of the model in this case can be visualized through an example." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-110", "text": "Table 2 provides some insights on the changes made to the word embeddings as a result of the inclusion of visual information in the learning process." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-111", "text": "In the two visual instances, our model captures many of the same nuances as MMSKIP-GRAM-B over the SKIP-GRAM model: donuts are more similar to other types of food than to places where you find donuts and owls are more similar to other birds of prey than just woodland creatures." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-112", "text": "However, our model seems to capture more of the semantic idea of donuts as \"junk food\" rather than just the visual similarity of roundness (the link established between donut and cupcake is particularly interesting)." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-113", "text": "As for owl, some of the visual similarity is lost, by ranking sparrow first, with regards to the class of birds of prey, but there seems to be a recognition of the semantic relationship between the top similar words (\"sparrow hawk\" is a synonym for \"kestrel\" in Wordnet, for example) as well as visual similarity via brown feathers and beaks." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-114", "text": "As for the representations learned without explicit visual information, our model still seems to demonstrate the propagation of this information but in a different manner than MMSKIP-GRAM- Table 2 : Top 3 neighbors of the target words, ordered by similarity." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-115", "text": "The training data contained visual information only for donut and owl B. The words ranked as similar to mural lose the artistic concepts of painting and portrait ranked highly by the other models; instead our model ranks \"fresco\" and \"bas-relief\" alongside sculpture, capturing instead a more complex representation of \"artwork executed directly on a wall.\" For tobacco, our model dismisses the recreational uses of tobacco captured via \"cigar\" and \"cigarette,\" while also ignoring the na\u00efve \"crop\" sense captured by \"corn.\" Instead, the highly-ranked words seem to display a more robust use of tobacco in the semanic sense as a cash crop, specifically referencing other notable trade crops." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-116", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-117", "text": "**5**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-118", "text": "The two abstract concepts, depth and chaos, reveal two very different results of the visual propagation." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-119", "text": "For depth, no evidence of the visual is apparent: unlike the relation to the sea drawn by MMSKIP-GRAM-B, our model seems to only capture depth's semantic similarity to other types of measurement (height and thickness)." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-120", "text": "For chaos, there is still the loss of more imageable concepts, as in shadow, from the rankings; however, our model instead seems to capture words more semantically similar to chaos (anarchy) or synonyms of events like chaos (turmoil and pandemonium)." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-121", "text": "----------------------------------" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-123", "text": "Our model performs multimodal skip-gram using pseudowords construction from convolutional image features, and then demonstrates the propagation of visual information to non-visual words but in a seemingly distinct manner from the existing models." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-124", "text": "Of particular note is that it is apparent that distributed word representations are being improved and informed by this information over pure-text skipgram, but these embeddings seem to perform best at a semantic rather than a visual level." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-125", "text": "Future work will focus on the nature of this embedding." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-126", "text": "In particular, we will investigate the applicability of the induced textual-visual mapping to the task of zero-shot image labeling (Socher et al., 2013) and image retrieval." }, { "sent_id": "e7b1c00e747f5bfbb96499d7223496-C001-127", "text": "As of yet, the nature of the mapping, beyond the qualitative improvements provided to the word embeddings, is still unclear, and future work will seek to address this." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e7b1c00e747f5bfbb96499d7223496-C001-9" ], [ "e7b1c00e747f5bfbb96499d7223496-C001-10" ], [ "e7b1c00e747f5bfbb96499d7223496-C001-11" ], [ "e7b1c00e747f5bfbb96499d7223496-C001-25" ], [ "e7b1c00e747f5bfbb96499d7223496-C001-32" ] ], "cite_sentences": [ "e7b1c00e747f5bfbb96499d7223496-C001-9", "e7b1c00e747f5bfbb96499d7223496-C001-10", "e7b1c00e747f5bfbb96499d7223496-C001-11", "e7b1c00e747f5bfbb96499d7223496-C001-25", "e7b1c00e747f5bfbb96499d7223496-C001-32" ] }, "@EXT@": { "gold_contexts": [ [ "e7b1c00e747f5bfbb96499d7223496-C001-15" ] ], "cite_sentences": [ "e7b1c00e747f5bfbb96499d7223496-C001-15" ] }, "@SIM@": { "gold_contexts": [ [ "e7b1c00e747f5bfbb96499d7223496-C001-77" ] ], "cite_sentences": [ "e7b1c00e747f5bfbb96499d7223496-C001-77" ] }, "@USE@": { "gold_contexts": [ [ "e7b1c00e747f5bfbb96499d7223496-C001-77" ], [ "e7b1c00e747f5bfbb96499d7223496-C001-89" ] ], "cite_sentences": [ "e7b1c00e747f5bfbb96499d7223496-C001-77", "e7b1c00e747f5bfbb96499d7223496-C001-89" ] }, "@DIF@": { "gold_contexts": [ [ "e7b1c00e747f5bfbb96499d7223496-C001-107", "e7b1c00e747f5bfbb96499d7223496-C001-108" ] ], "cite_sentences": [ "e7b1c00e747f5bfbb96499d7223496-C001-107" ] } } }, "ABC_9d1699d4ca3b4026ed5aab125a737d_18": { "x": [ { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-2", "text": "Coverage maximization with bigram concepts is a state-of-the-art approach to unsupervised extractive summarization." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-3", "text": "It has been argued that such concepts are adequate and, in contrast to more linguistic concepts such as named entities or syntactic dependencies, more robust, since they do not rely on automatic processing." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-4", "text": "In this paper, we show that while this seems to be the case for a commonly used newswire dataset, use of syntactic and semantic concepts leads to significant improvements in performance in other domains." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-5", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-7", "text": "State-of-the-art approaches to extractive summarization are based on the notion of coverage maximization (Berg-Kirkpatrick et al., 2011) ." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-8", "text": "The assumption is that a good summary is a selection of sentences from the document that contains as many of the important concepts as possible." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-9", "text": "The importance of concepts is implemented by assigning weights w i to each concept i with binary variable c i , yielding the following coverage maximization objective, subject to the appropriate constraints:" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-10", "text": "In proposing bigrams as concepts for their system, Gillick and Favre (2009) explain that:" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-11", "text": "[c]oncepts could be words, named entities, syntactic subtrees or semantic relations, for example." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-12", "text": "While deeper semantics make more appealing concepts, their extraction and weighting are much more error-prone." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-13", "text": "Any error in concept extraction can result in a biased objective function, leading to poor sentence selection." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-14", "text": "(Gillick and Favre, 2009) Several authors, e.g., Woodsend and Lapata (2012) , and Li et al. (2013) , have followed Gillick and Favre (2009) in assuming that bigrams would lead to better practical performance than more syntactic or semantic concepts, even though bigrams serve as only an approximation of these." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-15", "text": "In this paper, we revisit this assumption and evaluate the maximum coverage objective for extractive text summarization with syntactic and semantic concepts." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-16", "text": "Specifically, we replace bigram concepts with new ones based on syntactic dependencies, semantic frames, as well as named entities." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-17", "text": "We show that using such concepts can lead to significant improvements in text summarization performance outside of the newswire domain." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-18", "text": "We evaluate coverage maximization incorporating syntactic and semantic concepts across three different domains: newswire, legal judgments, and Wikipedia articles." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-19", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-20", "text": "**CONCEPT COVERAGE MAXIMIZATION FOR EXTRACTIVE SUMMARIZATION**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-21", "text": "In extractive summarization, the unsupervised version of the task is sometimes set up as that of finding a subset of sentences in a document, within some relatively small budget, that covers as many of the important concepts in the document as possible." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-22", "text": "In the maximum coverage objective, concepts are considered as independent of each other." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-23", "text": "Concepts are weighted by the number of times they appear in a document." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-24", "text": "Moreover, due the NP-hardness of coverage maximization, for an exact solution to the concept coverage optimization problem, we resort to fast solvers for integer linear programming, under some appropriate constraints." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-25", "text": "Bigrams. Gillick and Favre (2009) proposed to use bigrams as concepts, and to weight their contribution to the objective function in Equation (1) by the frequency with which they occur in the document." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-26", "text": "Some pre-processing is first carried out to these bigrams: all bigrams consisting uniquely of stop-words are removed from consideration, and each word is stemmed." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-27", "text": "They also require bigrams to occur with a minimal frequency (cf. Section 3.2)." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-28", "text": "Named entities." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-29", "text": "We consider three new types of concepts, all suggested, but subsequently rejected by Gillick and Favre (2009 Semantic frames." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-30", "text": "The intuition behind our use of frame semantics is that a summary should represent the most central semantic frames (Fillmore, 1982; Fillmore et al., 2003) present in the corresponding document-indeed, we consider these frames to be actual types of concepts." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-31", "text": "We extract frame names from sentences for a further type of concepts under consideration." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-32", "text": "We use SE-MAFOR 3 to augment documents with semantic frames." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-33", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-34", "text": "**EXPERIMENTS**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-35", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-36", "text": "**DATA**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-37", "text": "In order to investigate the importance of concept types across different domains, we evaluate our systems across three distinct domains, which we refer to as ECHR, TAC08, and WIKIPEDIA." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-38", "text": "ECHR consists of judgment-summary pairs scraped from the European Court of Hu-1 http://www.nltk.org/ 2 http://nlp.stanford.edu/software/ lex-parser.shtml 3 http://www.ark.cs.cmu.edu/SEMAFOR/ man Rights case-law website, HUDOC 4 ." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-39", "text": "The document-summary pairs were split into training, development and test sets, consisting of 1018, 117, and 138 pairs, respectively." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-40", "text": "In the training set (pruning sentences of length less than 5), the average document length is 13,184 words or 455 sentences." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-41", "text": "The average summary length is 806 words or 28 sentences." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-42", "text": "For both documents and summaries, the average sentence length is 29 words." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-43", "text": "TAC08 consists of 48 queries and 2 newswire document sets for each query, each set containing 10 documents." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-44", "text": "Document sets contain 235 input sentences on average, and the mean sentence length is 25 words." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-45", "text": "Summaries consist of 4 sentences or 100 words on average." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-46", "text": "WIKIPEDIA consists of 992 Wikipedia articles (all labeled \"good article\" 5 ) from a comprehensive dump of English language Wikipedia articles 6 ." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-47", "text": "We use the Wikipedia abstracts (the leading paragraphs before the table of contents) as summaries." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-48", "text": "The (document,summary) pairs were split into training, development and test sets, consisting of 784, 97, and 111 pairs, respectively." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-49", "text": "In the training set (pruning sentences of length less than 5), the average document length is around 8918 words or 339 sentences." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-50", "text": "The average summary length is 335 words or 13 sentences." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-51", "text": "For both documents and summaries, the average sentence length is around 26 words." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-52", "text": "In our main experiments, we use unsupervised summarization techniques, and we only use the training summaries (and not the documents) to determine output summary lengths." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-53", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-54", "text": "**BASELINE AND SYSTEMS**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-55", "text": "Our baseline is the bigram-based extraction summarization system of Gillick and Favre (2009) , icsisumm 7 ." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-56", "text": "Their system was originally intended for multi-document update summarization, and summaries are extracted from document sentences that share more than k content words with some query." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-57", "text": "We follow this approach for the TAC08 data." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-58", "text": "For ECHR and WIKIPEDIA, the task is single document summarization, and the now irrelevant topic-document intersection preprocessing step is eliminated." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-59", "text": "The original system uses the GNU linear programming kit 8 with a time limit of 100 seconds." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-60", "text": "For all experiments presented in this paper, we double this time limit; we experimented with longer time limits on the development set for the ECHR data, without any performance improvements." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-61", "text": "Once the summarizer reaches the time limit, a summary is output based on the current feasible solution, whether the solution is optimal or not." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-62", "text": "Moreover, the current icsisumm (v1) distribution prunes sentences shorter than 10 words." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-63", "text": "We note that we also tried replacing glpk by gurobi 9 , for which no time limit was necessary, but found poorer results on the development set of the ECHR data." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-64", "text": "The original system takes several important input parameters." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-65", "text": "1. Summary length, for TAC08, is specified by the TAC 2008 conference guidelines as 100 words." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-66", "text": "For WIKIPEDIA and ECHR, we have access to training sets which gave an average summary length of around 335 and 805 words respectively, which we take as the standard output summary length." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-67", "text": "2." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-68", "text": "Concept count cut-off is the minimum frequency of concepts from the document (set) that qualifies them for consideration in coverage maximization." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-69", "text": "For bigrams of the original system on TAC08, there are two types of document sets: 'A' and 'B'." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-70", "text": "For 'A' type documents, Gillick and Favre (2009) set this threshold to 3 and for 'B' type documents, they set this to 4." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-71", "text": "For WIKIPEDIA and ECHR, we take the bigram threshold to be 4." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-72", "text": "In our extension of the system to other concepts, we do not use any threshold." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-73", "text": "3. First concept weighting: in multi-document summarization, there is the possibility for repeated sentences." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-74", "text": "Concepts from firstencountered sentences may be weighted higher: these concept counts from firstencountered sentences are doubled for 'B' documents and remain unchanged for 'A' documents in the original system on TAC08." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-75", "text": "For other concepts, we do not alter frequencies in this manner, which is justified by the task change to single-document summarization." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-76", "text": "4." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-77", "text": "Query-sentence intersection threshold, is set to 1 for 'A' documents and 0 to 'B' documents in the original system on TAC08." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-78", "text": "This threshold is only for the update summarization task and therefore does not concern ECHR and WIKIPEDIA." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-79", "text": "In addition to our baseline, we consider five single-concept systems using (a) named entities, (b) labeled dependencies, (c) unlabeled dependencies, (d) semantic frame names, and (e) semantic frame dependencies, as well as the five systems combining each of these new concept types with bigrams." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-80", "text": "For the combination of these new concepts with bigrams, we extend the objective function to maximise in, Equation (1), into two sums-one for bigram concepts and the other for the new concept type-with their relative importance controlled by a parameter \u03b1." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-81", "text": "N 1 and N 2 are the number of bigram and other concept types occurring with the permitted threshold frequency in the document, relatively." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-82", "text": "Given that we are carrying out unsupervised summarization, rather than tune \u03b1, we set \u03b1 = 0.5, so the concepts are considered in their totality (i.e., N 1 + N 2 concepts together) with no explicit favouring of one over the other that does not naturally fall out of concept frequency." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-83", "text": "(1\u2212\u03b1)" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-84", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-85", "text": "**RESULTS**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-86", "text": "We evaluate output summaries using ROUGE-1, ROUGE-2, and ROUGE-SU4 (Lin, 2004) , with no stemming and retaining all stopwords." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-87", "text": "These measures have been shown to correlate best with human judgments in general, but among the automatic measures, ROUGE-1 and ROUGE-2 also correlate best with the Pyramid (Nenkova and Passonneau, 2004; Nenkova et al., 2007) and Responsiveness manual metrics (Louis and Nenkova, 2009) ." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-88", "text": "Moreover, ROUGE-1 has been shown to best reflect human-automatic summary comparisons (Owczarzak et al., 2012) ." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-89", "text": "For single concept systems, the results are shown in Table 1 , and concept combination system results are given in Table 2 ." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-90", "text": "We first note that our runs of the current distribution of icsisumm yield significantly worse ROUGE-2 results than reported in (Gillick and Favre, 2009 ) (see Table 1 , BIGRAMS): 0.081 compared to 0.110 respectively." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-91", "text": "On the TAC08 data, we observe no improvements over the baseline BIGRAM system for any ROUGE metric here." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-92", "text": "Hence, Gillick and Favre (2009) were right in their assumption that syntactic and semantic concepts would not lead to performance improvements, when restricting ourselves to this dataset." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-93", "text": "However, when we change domain to the legal judgments or Wikipedia articles, using syntactic and semantic concepts leads to significant gains across all the ROUGE metrics." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-94", "text": "For ECHR, replacing bigrams by frame names (FRAME) results in an increase of +0.1 in ROUGE-1, +0.031 in ROUGE-2 and +0.046 in ROUGE-SU4." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-95", "text": "We note that FrameNet 1.5 covers the legal domain quite well, which may explain why these concepts are particularly useful for the ECHR dataset." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-96", "text": "However, labeled (LDEP) and unlabeled (UDEP) dependencies also significantly outperform the baseline." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-97", "text": "For WIKIPEDIA, replacing bigrams by labeled or unlabeled syntactic dependencies results in significant improvements: an increase of +0.088 for ROUGE-1, +0.015 for ROUGE-2, and +0.03 for ROUGE-SU4." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-98", "text": "Interestingly, the NER system also yields significantly better performance over the baseline, which may reflect the nature of Wikipedia articles, often being about historical figures, famous places, organizations, etc." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-99", "text": "We observe in Table 2 , that for concept combination systems as well, ROUGE scores on TAC08 do not indicate any improvement in performance." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-100", "text": "However, best ROUGE-1 scores are produced for both ECHR and WIKIPEDIA data with systems that incorporate semantic frame names." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-101", "text": "For WIKIPEDIA, best ROUGE-2 and ROUGE-SU4 scores incorporate named-entity information." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-102", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-103", "text": "**RELATED WORK**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-104", "text": "Most researchers have used bigrams as concepts in coverage maximization-based approaches to unsupervised extractive summarization." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-105", "text": "Filatova and Hatzivassiloglou (2004) , however,use relations between named entities as concepts in extractive summarization." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-106", "text": "They use slightly different extraction algorithms, but their work is similar in spirit to ours." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-107", "text": "Nishikawa et al. (2010) , also, use opinions -tuples of targets, aspects, and polarityas concepts in opinion summarization." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-108", "text": "In early work on summarization, Silber and McCoy (2000) used WordNet synsets as concepts." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-109", "text": "Kitajima and Kobayashi (2011) Multidocument measure first proposed by Goldstein et al. (2000) for evaluating the importance of sentences in query-based extractive summarization, yielding improvements for their Japanese newswire dataset." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-110", "text": "----------------------------------" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-111", "text": "**CONCLUSIONS**" }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-112", "text": "This paper challenges the assumption that bigrams make better concepts for unsupervised extractive summarization than syntactic and semantic concepts relying on automatic processing." }, { "sent_id": "9d1699d4ca3b4026ed5aab125a737d-C001-113", "text": "We show that using concepts relying on syntactic dependencies or semantic frames instead of bigrams leads to significant performance improvements of coverage maximization summarization across domains." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9d1699d4ca3b4026ed5aab125a737d-C001-10" ], [ "9d1699d4ca3b4026ed5aab125a737d-C001-14" ], [ "9d1699d4ca3b4026ed5aab125a737d-C001-25" ], [ "9d1699d4ca3b4026ed5aab125a737d-C001-70" ] ], "cite_sentences": [ "9d1699d4ca3b4026ed5aab125a737d-C001-10", "9d1699d4ca3b4026ed5aab125a737d-C001-14", "9d1699d4ca3b4026ed5aab125a737d-C001-25", "9d1699d4ca3b4026ed5aab125a737d-C001-70" ] }, "@UNSURE@": { "gold_contexts": [ [ "9d1699d4ca3b4026ed5aab125a737d-C001-29" ], [ "9d1699d4ca3b4026ed5aab125a737d-C001-92" ] ], "cite_sentences": [ "9d1699d4ca3b4026ed5aab125a737d-C001-29", "9d1699d4ca3b4026ed5aab125a737d-C001-92" ] }, "@USE@": { "gold_contexts": [ [ "9d1699d4ca3b4026ed5aab125a737d-C001-55" ] ], "cite_sentences": [ "9d1699d4ca3b4026ed5aab125a737d-C001-55" ] }, "@DIF@": { "gold_contexts": [ [ "9d1699d4ca3b4026ed5aab125a737d-C001-90" ] ], "cite_sentences": [ "9d1699d4ca3b4026ed5aab125a737d-C001-90" ] } } }, "ABC_67b6d87aa2a943a854251fada6e183_18": { "x": [ { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-2", "text": "We implement a city-level geolocation prediction system for Twitter users." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-3", "text": "The system infers a user's location based on both tweet text and user-declared metadata using a stacking approach." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-4", "text": "We demonstrate that the stacking method substantially outperforms benchmark methods, achieving 49% accuracy on a benchmark dataset." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-5", "text": "We further evaluate our method on a recent crawl of Twitter data to investigate the impact of temporal factors on model generalisation." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-6", "text": "Our results suggest that user-declared location metadata is more sensitive to temporal change than the text of Twitter messages." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-7", "text": "We also describe two ways of accessing/demoing our system." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-8", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-10", "text": "In this paper, we present and evaluate a geolocation prediction method for Twitter users." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-11", "text": "1 Given a user's tweet data as input, the task of user level geolocation prediction is to infer a primary location (i.e., \"home location\": Mahmud et al. (2012) ) for the user from a discrete set of pre-defined locations (Cheng et al., 2010) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-12", "text": "For instance, President Obama's location might be predicted to be Washington D.C., USA, based on his public tweets and profile metadata." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-13", "text": "Geolocation information is essential to locationbased applications, like targeted advertising and local event detection (Sakaki et al., 2010; MacEachren et al., 2011) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-14", "text": "However, the means to obtain such information are limited." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-15", "text": "Although Twitter allows users to specify a plain text description of their location in their profile, these descriptions tend to be ad hoc and unreliable (Cheng et al., 2010) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-16", "text": "Recently, user geolocation prediction based on a user's tweets has become popular (Wing and Baldridge, 2011; Roller et al., 2012) , based on the assumption that tweets implicitly contain locating information, and with appropriate statistical modeling, the true location can be inferred." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-17", "text": "For instance, if a user frequently mentions NYC, JFK and yankees, it is likely that they are from New York City, USA." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-18", "text": "In this paper, we discuss an implementation of a global city-level geolocation prediction system for English Twitter users." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-19", "text": "The system utilises both tweet text and public profile metadata for modeling and inference." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-20", "text": "Specifically, we train multinomial Bayes classifiers based on location indicative words (LIWs) in tweets (Han et al., 2012) , and user-declared location and time zone metadata." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-21", "text": "These base classifiers are further stacked (Wolpert, 1992) using logistic regression as the meta-classifier." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-22", "text": "The proposed stacking model is compared with benchmarks on a public geolocation dataset." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-23", "text": "Experimental results demonstrate that our stacking model outperforms benchmark methods by a large margin, achieving 49% accuracy on the test data." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-24", "text": "We further evaluate the stacking model on a more recent crawl of public tweets." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-25", "text": "This experiment tests the effectiveness of a geolocation model trained on \"old\" data when applied to \"new\" data." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-26", "text": "The results reveal that user-declared locations are more variable over time than tweet text and time zone data." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-27", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-28", "text": "**BACKGROUND AND RELATED WORK**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-29", "text": "Identifying the geolocation of objects has been widely studied in the research literature over target objects including webpages (Zong et al., 2005) , search queries (Backstrom et al., 2008) , Flickr images (Crandall et al., 2009) and Wikipedia editors (Lieberman and Lin, 2009) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-30", "text": "Recently, a considerable amount of work has been devoted to extending geolocation prediction for Twitter users (Cheng et al., 2010; Eisenstein et al., 2010) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-31", "text": "The geolocations are usually represented by unambiguous city names or a partitioning of the earth's surface (e.g., grid cells specified by latitude/longitude)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-32", "text": "User geolocation is generally related to a \"home\" location where a user regularly resides, and user mobility is ignored." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-33", "text": "Twitter allows users to declare their home locations in plain text in their profile, however, this data has been found to be unstructured and ad hoc in preliminary research (Cheng et al., 2010; Hecht et al., 2011) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-34", "text": "While popular for desktop machine geolocation, methods that map IP addresses to physical locations (Buyukokkten et al., 1999 ) cannot be applied to Twitter-based user geolocation, as IPs are only known to the service provider and are non-trivial to retrieve in a mobile Internet environment." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-35", "text": "Although social network information has been proven effective in inferring user locations (Backstrom et al., 2010; Sadilek et al., 2012; Rout et al., 2013) , we focus exclusively on message and metadata information in this paper, as they are more readily accessible." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-36", "text": "Text data tends to contain salient geospatial expressions that are particular to specific regions." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-37", "text": "Attempts to leverage this data directly have been based on analysis of gazetted expressions (Leidner and Lieberman, 2011) or the identification of geographical entities (Quercini et al., 2010; Qin et al., 2003) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-38", "text": "However these methods are limited in their ability to capture informal geospatial expressions (e.g. Brissie for Brisbane) and more nongeospatial terms which are associated with particular locations (e.g. ferry for Seattle or Sydney)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-39", "text": "Beyond identifying geographical references using off-the-shelf tools, more sophisticated methods have been introduced in the social media realm." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-40", "text": "Cheng et al. (2010) built a simple generative model based on tweet words, and further added words which are local to particular regions and applied smoothing to under-represented locations." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-41", "text": "Kinsella et al. (2011) applied different similarity measures to the task, and investigated the relative difficulty of geolocation prediction at city, state, and country levels." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-42", "text": "Wing and Baldridge (2011) introduced a grid-based representation for geolocation modeling and inference based on fixed latitude and longitude values, and aggregated all tweets in a single cell." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-43", "text": "Their approach was then based on lexical similarity using KL-divergence." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-44", "text": "One drawback to the uniformsized cell representation is that it introduces class imbalance: urban areas tend to contain far more tweets than rural areas." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-45", "text": "Based on this observation, Roller et al. (2012) introduced an adaptive grid representation in which cells contain approximately the same number of users, based on a KDtree partition." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-90", "text": "Because of the different class representations, Acc numbers are not comparable between the benchmarks." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-46", "text": "Given that most tweets are from urban areas, Han et al. (2012) consider a citybased class division, and explore different feature selection methods to extract \"location indicative words\", which they show to improve prediction accuracy." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-47", "text": "Additionally, time zone information has been incorporated in a coarse-to-fine hierarchical model by first determining the time zone, and then disambiguating locations within it (Mahmud et al., 2012) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-48", "text": "Topic models have also been applied to the task, in capturing regional linguistic differences (Eisenstein et al., 2010; Yin et al., 2011; Hong et al., 2012) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-49", "text": "When designing a practical geolocation system, simple models such as naive Bayes and nearest prototype methods (e.g., based on KL divergence) have clear advantages in terms of training and classification throughput, given the size of the class set (often numbering in the thousands of classes) and sheer volume of training data (potentially in the terabytes of data)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-50", "text": "This is particularly important for online systems and downstream applications that require timely predictions." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-51", "text": "As such, we build off the text-based naive Bayes-based geolocation system of Han et al. (2012) , which our experiments have shown to have a good balance of tractability and accuracy." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-52", "text": "By selecting a reduced set of \"location indicative words\", prediction can be further accelerated." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-53", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-54", "text": "**METHODOLOGY**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-55", "text": "In this study, we adopt the same city-based representation and multinomial naive Bayes learner as Han et al. (2012) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-56", "text": "The city-based representation consists of 3,709 cities throughout the world, and is obtained by aggregating smaller cities with the largest nearby city." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-57", "text": "Han et al. (2012) found that using feature selection to identify \"location indicative words\" led to improvements in geolocation performance." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-58", "text": "We use the same feature selection technique that they did." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-59", "text": "Specifically, feature selection is based on information gain ratio (IGR) (Quinlan, 1993) over the city-based label set for each word." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-60", "text": "In the original research of Han et al. (2012) ," }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-61", "text": "only the text of Twitter messages was used, and training was based exclusively on geotagged tweets, despite these accounting for only around 1% of the total public data on Twitter." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-62", "text": "In this research, we include additional non-geotagged tweets (e.g., posted from a non-GPS enabled device) for those users who have geotagged tweets (allowing us to determine a home location for the user)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-63", "text": "In addition to including non-geotagged data in modeling and inference, we further take advantage of the text-based metadata embedded in a user's public profile (and included in the JSON object for each tweet)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-64", "text": "This metadata is potentially complementary to the tweet message and of benefit for geolocation prediction, especially the userdeclared location and time zone, which we consider here." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-65", "text": "Note that these are in free text rather than a structured data format, and that while there are certainly instances of formal place name descriptions (e.g., Edinburgh, UK), they are often informal (e.g., mel for Melbourne)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-114", "text": "The stacked model accuracy numbers drop 5-8% on LIVE test , and the median error distance increases moderately by 31km." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-66", "text": "As such, we adopt a statistical approach to model each selected metadata field, by capturing the text in the form of character 4-grams, and training a multinomial naive Bayes classifier for each field." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-67", "text": "To combine together the tweet text and metadata fields, we use stacking (Wolpert, 1992) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-68", "text": "The training of stacking consists of two steps." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-69", "text": "First, a multinomial naive Bayes base classifier (L0) is learned for each data type using 10-fold cross validation." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-70", "text": "This is carried out for the tweet text (TEXT), user-declared location (MB-LOC) and user-declared time zone (MB-TZ)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-71", "text": "Next, a metaclassifier (L1 classifier) is trained over the base classifiers, using a logistic regression learner (Fan et al., 2008) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-72", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-73", "text": "**EVALUATION AND DISCUSSION**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-74", "text": "In this section, we compare our proposed stacking approach with existing benchmarks on a public dataset, and investigate the impact of time using a recently collected dataset." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-75", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-76", "text": "**EVALUATION MEASURES**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-77", "text": "In line with other work on user geolocation prediction, we use three evaluation measures:" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-78", "text": "\u2022 Acc : The percentage of correct city-level predictions." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-79", "text": "\u2022 Acc@161 : The percentage of predicted locations which are within a 161km (100 mile) (Cheng et al., 2010) , to capture near-misses (e.g., Edinburgh UK being predicted as Glasgow, UK)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-80", "text": "\u2022 Median : The median distance from the predicted city to the home location (Eisenstein et al., 2010) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-81", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-82", "text": "**COMPARISON WITH BENCHMARKS**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-83", "text": "We base our evaluation on the publicly-available WORLD dataset of Han et al. (2012) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-84", "text": "The dataset contains 1.4M users whose tweets are primarily identified as English based on the output of the langid.py language identification tool (Lui and Baldwin, 2012) , and who have posted at least 10 geotagged tweets." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-85", "text": "The city-level home location for a geotagged user is determined as follows." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-86", "text": "First, each of a user's geotagged tweets is mapped to its nearest city (based on the same set of 3,709 cities used for the city-based location representation)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-87", "text": "Then, the most frequent city for a user is selected as the home location." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-88", "text": "To benchmark our method, we reimplement two recently-published state-of-the-art methods:" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-89", "text": "(1) the KL-divergence nearest prototype method of Roller et al. (2012) based on KD-tree partitioned grid cells, which we denote as KL; and (2) the multinomial naive Bayes city-level geolocation model of Han et al. (2012) , which we denote as MB." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-91", "text": "To remedy this, we find the closest city to the centroid of each grid cell in the KD-tree representation, and map the classification onto this city." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-92", "text": "We present results including non-geotagged data for users with geotagged messages for the two methods, as KL-NG and MB-NG, respectively." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-93", "text": "We also present results based on the user-declared location (MB-LOC) and time zone (MB-TZ), and finally the stacking method (STACKING) which combines MB-NG, MB-LOC and MB-TZ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-94", "text": "The results are shown in Table 1 ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-95", "text": "The approximate doubling of Acc for KL-NG and MB-NG over KL and MB, respectively, demonstrates the high utility of non-geotagged data in tweet text-based geolocation prediction." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-96", "text": "Of the two original models, we can see that MB is comparable to KL, in line with the findings of Han et al. (2012) ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-97", "text": "The MB-LOC results are by far the highest of all the base classifiers." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-98", "text": "Contrary to the suggestion of Cheng et al. (2010) that userdeclared locations are too unreliable to use for user geolocation, we find evidence indicating that they are indeed a valuable source of information for this task." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-99", "text": "The best overall results are achieved for the stacking approach (STACKING), assigning almost half of the test users to the correct city-level location, and improving more than four-fold on the previous-best accuracy (i.e., MB)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-100", "text": "These results also suggest that there is strong complementarity between user metadata and tweet text." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-101", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-102", "text": "**EVALUATION ON TIME-HETEROGENEOUS DATA**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-103", "text": "In addition to the original held-out test data (WORLD test ) from WORLD, we also developed a new geotagged evaluation dataset using the Twitter Streaming API." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-104", "text": "2 This new LIVE test dataset is intended to evaluate the impact of time on predictive accuracy." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-105", "text": "The training and test data in WORLD are time-homogeneous as they are randomly sampled from data collected in a relatively narrow time window." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-106", "text": "In contrast, LIVE test is much newer, collected more than 1 year later than WORLD." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-107", "text": "Given that Twitter users and topics change over time, an essential question is whether the statistical model learned from the \"old\" training data is still effective over the \"new\" test data?" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-108", "text": "The LIVE test data was collected over 48 hours from 2013/03/03 to 2013/03/05." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-109", "text": "By selecting users with at least 10 geotagged tweets and a declared language of English, 55k users were obtained." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-110", "text": "For each user, their recent status updates were aggregated, and non-English users were filtered out based on the language predictions of langid.py." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-111", "text": "For some users with geotagged tweets from many cities, the most frequent city might not be an appropriate representation of their home location for evaluation." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-112", "text": "To improve the evaluation data quality, we therefore exclude users who have less than 50% of their geotagged tweets originating from a single city." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-113", "text": "After filtering, 32k Table 2 ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-115", "text": "Overall, the numbers suggest inference on WORLD test , which is time-homogenous with the training data (taken from WORLD), is an easier classification than LIVE test , which is time-heterogeneous with the training data." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-116", "text": "Training on \"old\" data and testing on \"new\" data is certainly possible, however. Looking over the results of the base classifiers, we can see that the biggest hit is for MB-LOC classifier." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-117", "text": "In contrast, the accuracy for MB-NG and MB-TZ is relatively stable (other than the sharp increase in the median error distance for MB-TZ)." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-118", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-119", "text": "**ARCHITECTURE AND ACCESS**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-120", "text": "In this section, we describe the architecture of the proposed geolocation system, as well as two ways of accessing the live system." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-121", "text": "3 The core structure of the system consists of two parts: (1) the interface; (2) the back-end geolocation service." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-122", "text": "We offer two interfaces to access the system: a Twitter bot and a web interface." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-123", "text": "The Twitter bot account is: @MELBLTFSD." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-124", "text": "A daemon process detects any user mentions of the bot in tweets via keyword matching through the Twitter search API." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-125", "text": "The screen name of the tweet author is extracted and sent to the back-end geolocation service, and the predicted user geolocation is sent to the Twitter user in a direct message, as shown in Figure 1 ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-126", "text": "Web access is via http://hum.csse.unimelb." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-127", "text": "edu.au:9000/geo.html." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-128", "text": "Users can input a Twitter user screen name through the web interface, whereby a call is made to the back-end geolocation service to geolocate that user." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-129", "text": "The geoloca- tion results are rendered on a map (along with any geotagged tweets for the user) as in Figure 2 ." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-130", "text": "4 The back-end geolocation service crawls recent tweets for a given user in real time, 5 and word and n-gram features are extracted from both the text and the user metadata." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-131", "text": "These features are sent to the L0 classifiers (TEXT, MB-LOC and MB-TZ), and the L0 results are further fed into the L1 classifier for the final prediction." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-132", "text": "----------------------------------" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-133", "text": "**SUMMARY AND FUTURE WORK**" }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-134", "text": "In this paper, we presented a city-level geolocation prediction system for Twitter users." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-135", "text": "Over a public dataset, our stacking method -exploiting both tweet text and user metadata -substantially outperformed benchmark methods." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-136", "text": "We further evaluated model generalisation on a newer, timeheterogeneous dataset." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-137", "text": "The overall results decreased by 5-8% in accuracy, compared with numbers on time-homogeneous data, primarily due to the poor generalisation of the MB-LOC classifier." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-138", "text": "In future work, we plan to further investigate the cause of the MB-LOC classifier accuracy decrease on the new dataset." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-139", "text": "In addition, we'd like to study differences in prediction accuracy across cities." }, { "sent_id": "67b6d87aa2a943a854251fada6e183-C001-140", "text": "For cities with reliable predictions, the system can be adapted as a preprocessing module for downstream applications, e.g., local event detection based on users with reliable predictions." } ], "y": { "@USE@": { "gold_contexts": [ [ "67b6d87aa2a943a854251fada6e183-C001-20" ], [ "67b6d87aa2a943a854251fada6e183-C001-55" ], [ "67b6d87aa2a943a854251fada6e183-C001-57", "67b6d87aa2a943a854251fada6e183-C001-58" ], [ "67b6d87aa2a943a854251fada6e183-C001-83" ], [ "67b6d87aa2a943a854251fada6e183-C001-88", "67b6d87aa2a943a854251fada6e183-C001-89" ] ], "cite_sentences": [ "67b6d87aa2a943a854251fada6e183-C001-20", "67b6d87aa2a943a854251fada6e183-C001-55", "67b6d87aa2a943a854251fada6e183-C001-57", "67b6d87aa2a943a854251fada6e183-C001-83", "67b6d87aa2a943a854251fada6e183-C001-89" ] }, "@BACK@": { "gold_contexts": [ [ "67b6d87aa2a943a854251fada6e183-C001-46" ], [ "67b6d87aa2a943a854251fada6e183-C001-57", "67b6d87aa2a943a854251fada6e183-C001-58" ], [ "67b6d87aa2a943a854251fada6e183-C001-60", "67b6d87aa2a943a854251fada6e183-C001-61" ] ], "cite_sentences": [ "67b6d87aa2a943a854251fada6e183-C001-46", "67b6d87aa2a943a854251fada6e183-C001-57", "67b6d87aa2a943a854251fada6e183-C001-60" ] }, "@EXT@": { "gold_contexts": [ [ "67b6d87aa2a943a854251fada6e183-C001-51" ] ], "cite_sentences": [ "67b6d87aa2a943a854251fada6e183-C001-51" ] }, "@SIM@": { "gold_contexts": [ [ "67b6d87aa2a943a854251fada6e183-C001-55" ], [ "67b6d87aa2a943a854251fada6e183-C001-57", "67b6d87aa2a943a854251fada6e183-C001-58" ], [ "67b6d87aa2a943a854251fada6e183-C001-96" ] ], "cite_sentences": [ "67b6d87aa2a943a854251fada6e183-C001-55", "67b6d87aa2a943a854251fada6e183-C001-57", "67b6d87aa2a943a854251fada6e183-C001-96" ] }, "@MOT@": { "gold_contexts": [ [ "67b6d87aa2a943a854251fada6e183-C001-57", "67b6d87aa2a943a854251fada6e183-C001-58" ] ], "cite_sentences": [ "67b6d87aa2a943a854251fada6e183-C001-57" ] } } }, "ABC_43622e43d6ef5291b64320d2d68b95_18": { "x": [ { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-40", "text": "We thus propose to convolute the items in adjacent heads (2D-CSANs, Figure 1 (c))." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-2", "text": "Self-attention networks (SANs) have drawn increasing interest due to their high parallelization in computation and flexibility in modeling dependencies." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-3", "text": "SANs can be further enhanced with multi-head attention by allowing the model to attend to information from different representation subspaces." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-4", "text": "In this work, we propose novel convolutional self-attention networks, which offer SANs the abilities to 1) strengthen dependencies among neighboring elements, and 2) model the interaction between features extracted by multiple attention heads." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-5", "text": "Experimental results of machine translation on different language pairs and model settings show that our approach outperforms both the strong Transformer baseline and other existing models on enhancing the locality of SANs." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-6", "text": "Comparing with prior studies, the proposed model is parameter free in terms of introducing no more parameters." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-7", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-9", "text": "Self-attention networks (SANs) (Parikh et al., 2016; Lin et al., 2017) have shown promising empirical results in various natural language processing (NLP) tasks, such as machine translation (Vaswani et al., 2017) , natural language inference (Shen et al., 2018a) , and acoustic modeling (Sperber et al., 2018) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-10", "text": "One appealing strength of SANs lies in their ability to capture dependencies regardless of distance by explicitly attending to all the elements." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-11", "text": "In addition, the performance of SANs can be improved by multi-head attention (Vaswani et al., 2017) , which projects the input sequence into multiple subspaces and applies attention to the representation in each subspace." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-12", "text": "Despite their success, SANs have two major limitations." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-13", "text": "First, the model fully take into ac- * Zhaopeng Tu is the corresponding author of the paper." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-14", "text": "This work was conducted when Baosong Yang was interning at Tencent AI Lab." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-15", "text": "count all the elements, which disperses the attention distribution and thus overlooks the relation of neighboring elements and phrasal patterns Guo et al., 2019) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-16", "text": "Second, multi-head attention extracts distinct linguistic properties from each subspace in a parallel fashion (Raganato and Tiedemann, 2018) , which fails to exploit useful interactions across different heads." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-17", "text": "Recent work shows that better features can be learned if different sets of representations are present at feature learning time (Ngiam et al., 2011; Lin et al., 2014) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-18", "text": "To this end, we propose novel convolutional self-attention networks (CSANs), which model locality for self-attention model and interactions between features learned by different attention heads in an unified framework." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-19", "text": "Specifically, in order to pay more attention to a local part of the input sequence, we restrict the attention scope to a window of neighboring elements." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-20", "text": "The localness is therefore enhanced via a parameter-free 1-dimensional convolution." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-21", "text": "Moreover, we extend the convolution to a 2-dimensional area with the axis of attention head." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-22", "text": "Thus, the proposed model allows each head to interact local features with its adjacent subspaces at attention time." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-23", "text": "We expect that the interaction across different subspaces can further improve the performance of SANs." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-24", "text": "We evaluate the effectiveness of the proposed model on three widely-used translation tasks: WMT14 English-to-German, WMT17 Chineseto-English, and WAT17 Japanese-to-English." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-25", "text": "Experimental results demonstrate that our approach consistently improves performance over the strong TRANSFORMER model (Vaswani et al., 2017) across language pairs." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-26", "text": "Comparing with previous work on modeling locality for SANs (e.g. Shaw et al., 2018; Sperber et al., 2018) , our model boosts performance on both translation quality and training efficiency." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-27", "text": "2 Multi-Head Self-Attention Networks" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-28", "text": "SANs produce representations by applying attention to each pair of tokens from the input sequence, regardless of their distance." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-29", "text": "Vaswani et al. (2017) found it is beneficial to capture different contextual features with multiple individual attention functions." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-30", "text": "Given an input sequence X = {x 1 , . . . , x I } \u2208 R I\u00d7d , the model first transforms it into queries Q, keys K, and values V:" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-31", "text": "where" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-32", "text": "where ATT(\u00b7) is an attention model (Bahdanau et al., 2015; Vaswani et al., 2017) that retrieves the keys K h with the query q h i ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-33", "text": "The final output representation O is the concatenation of outputs generated by multiple attention models:" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-34", "text": "3 Approach" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-35", "text": "As shown in Figure 1 (a), the vanilla SANs use the query q h i to compute a categorical distribution over all elements from K h (Equation 2)." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-36", "text": "It may inherit the attention to neighboring information (Yu et al., 2018; Guo et al., 2019) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-37", "text": "In this work, we propose to model locality for SANs by restricting the model to attend to a local region via convolution operations (1D-CSANs, Figure 1(b) )." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-38", "text": "Accordingly, it provides distance-aware information (e.g. phrasal patterns), which is complementary to the distance-agnostic dependencies modeled by the standard SANs (Section 3.1)." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-39", "text": "Moreover, the calculation of output o h are restricted to the a single individual subspace, overlooking the richness of contexts and the dependencies among groups of features, which have proven beneficial to the feature learning (Ngiam et al., 2011; Wu and He, 2018) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-41", "text": "The proposed model is expected to improve performance through interacting linguistic properties across heads (Section 3.2)." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-42", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-43", "text": "**LOCALITY MODELING VIA 1D CONVOLUTION**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-44", "text": "For each query q h i , we restrict its attention region (e.g., K h = {k h 1 , . . . , k h i , . . . , k h I }) to a local scope with a fixed size M + 1 (M \u2264 I) centered at the position i:" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-45", "text": "Accordingly, the calculation of corresponding output in Equation (2) is modified as:" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-46", "text": "As seen, SANs are only allowed to attend to the neighboring tokens (e.g., K h , V h ), instead of all the tokens in the sequence (e.g., K h , V h )." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-47", "text": "The SAN-based models are generally implemented as multiple layers, in which higher layers tend to learn semantic information while lower layers capture surface and lexical information (Peters et al., 2018; Raganato and Tiedemann, 2018) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-48", "text": "Therefore, we merely apply locality modeling to the lower layers, which same to the configuration in Yu et al. (2018) and ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-49", "text": "In this way, the representations are learned in a hierarchical fashion (Yang et al., 2017) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-50", "text": "That is, the distance-aware and local information extracted by the lower SAN layers, is expected to complement distance-agnostic and global information captured by the higher SAN layers." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-51", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-52", "text": "**ATTENTION INTERACTION VIA 2D CONVOLUTION**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-53", "text": "Mutli-head mechanism allows different heads to capture distinct linguistic properties (Raganato and Tiedemann, 2018; Li et al., 2018) , especially in diverse local contexts ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-54", "text": "We hypothesis that exploiting local properties across heads can further improve the performance of SANs." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-55", "text": "To this end, we expand the 1-dimensional window to a 2-dimensional area with the new dimension being the index of attention head." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-56", "text": "Suppose that the area size is (N + 1) \u00d7 (M + 1) (N \u2264 H), the keys and values in the area are:" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-57", "text": "where K h , V h are elements in the h-th subspace, which are calculated by Equations 4 and 5 respectively." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-58", "text": "The union operation means combining the keys and values in different subspaces." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-59", "text": "The corresponding output is calculated as:" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-60", "text": "The 2D convolution allows SANs to build relevance between elements across adjacent heads, thus flexibly extract local features from different subspaces rather than merely from an unique head." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-61", "text": "The vanilla SAN models linearly aggregate features from different heads, and this procedure limits the extent of abstraction (Fukui et al., 2016; ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-62", "text": "Multiple sets of representations presented at feature learning time can further improve the expressivity of the learned features (Ngiam et al., 2011; Wu and He, 2018) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-63", "text": "Concerning modeling locality for SANs, Yu et al. (2018) injected several CNN layers (Kim, 2014) to fuse local information, the output of which is fed to the subsequent SAN layer." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-64", "text": "Several researches proposed to revise the attention distribution with a parametric localness bias, and succeed on machine translation and natural language inference (Guo et al., 2019) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-65", "text": "While both models introduce additional parameters, our approach is a more lightweight solution without introducing any new parameters." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-116", "text": "**UNIVERSALITY OF THE PROPOSED MODEL**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-66", "text": "Closely related to this work, Shen et al. (2018a) applied a positional mask to encode temporal order, which only allows SANs to attend to the previous or following tokens in the sequence." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-67", "text": "In contrast, we employ a positional mask (i.e. the tokens outside the local window is masked as 0) to encode the distance-aware local information." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-68", "text": "In the context of distance-aware SANs, Shaw et al. (2018) introduced relative position encoding to consider the relative distances between sequence elements." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-69", "text": "While they modeled locality from position embedding, we improve locality modeling from revising attention scope." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-70", "text": "To make a fair comparison, we re-implemented the above approaches under a same framework." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-71", "text": "Empirical results on machine translation tasks show the superiority of our approach in both translation quality and training efficiency." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-72", "text": "Multi-Head Attention Multi-head attention mechanism (Vaswani et al., 2017) employs different attention heads to capture distinct features (Raganato and Tiedemann, 2018) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-73", "text": "Along this direction, Shen et al. (2018a) explicitly used multiple attention heads to model different dependencies of the same word pair, and Strubell et al. (2018) employed different attention heads to capture different linguistic features." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-74", "text": "Li et al. (2018) introduced disagreement regularizations to encourage the diversity among attention heads." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-75", "text": "Inspired by recent successes on fusing information across layers (Dou et al., 2018 , proposed to aggregate information captured by different attention heads." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-76", "text": "Based on these findings, we model interactions among attention heads to exploit the richness of local properties distributed in different heads." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-77", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-78", "text": "**EXPERIMENTS**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-79", "text": "We conducted experiments with the Transformer model (Vaswani et al., 2017) on English\u21d2German (En\u21d2De), Chinese\u21d2English (Zh\u21d2En) and Japanese\u21d2English (Ja\u21d2En) translation tasks." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-80", "text": "For the En\u21d2De and Zh\u21d2En tasks, the models were trained on widely-used WMT14 and WMT17 corpora, consisting of around 4.5 and 20.62 million sentence pairs, respectively." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-81", "text": "Concerning Ja\u21d2En, we used the first two sections of WAT17 corpus as the training data, which consists of 2M sentence pairs." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-82", "text": "To reduce the vocabulary size, all the data were tokenized and segmented into subword symbols using byte-pair encoding (Sennrich et al., 2016) with 32K merge operations." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-83", "text": "Following Shaw et al. (2018) , we incorporated the proposed model into the encoder, which is a stack of 6 SAN layers." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-84", "text": "Prior studies revealed that modeling locality in lower layers can achieve better performance (Shen et al., 2018b; Yu et al., 2018; , we applied our approach to the lowest three layers of the encoder." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-85", "text": "About configurations of NMT models, we used the Base and Big settings same as Vaswani et al. (2017) , and all models were trained on 8 NVIDIA P40 GPUs with a batch of 4096 tokens." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-86", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-87", "text": "**EFFECTS OF WINDOW/AREA SIZE**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-88", "text": "We first investigated the effects of window size (1D-CSANs) and area size (2D-CSANs) on En\u21d2De validation set, as plotted in Figure 2 ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-89", "text": "For 1D-CSANs, the local size with 11 is superior to other settings." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-90", "text": "This is consistent with Luong et al. (2015) who found that 10 is the best window size in their local attention experiments." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-91", "text": "Then, we fixed the number of neighboring tokens being 11 and varied the number of heads." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-92", "text": "As seen, by considering the features across heads (i.e. > 1), 2D-CSANs further improve the translation quality." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-93", "text": "However, when the number of heads in attention goes up, the translation quality inversely drops." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-94", "text": "One possible reason is that the model still has the flexibility of learning a different distribution for each head with few interactions, while a large amount of interactions assumes more heads make \"similar contributions\" (Wu and He, 2018) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-95", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-96", "text": "**ACCURACY OF PHRASE TRANSLATION**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-97", "text": "One intuition of our approach is to capture useful phrasal patterns via modeling locality." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-98", "text": "To evaluate the accuracy of phrase translations, we calculate the improvement of the proposed approaches over multiple granularities of n-grams, as shown in Figure 3 ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-99", "text": "Both the two model variations consistently outperform the baseline on larger granularities, indicating that modeling locality can raise the ability of self-attention model on capturing the phrasal information." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-100", "text": "Furthermore, the dependencies among heads can be complementary to the localness modeling, which reveals the necessity of the interaction of features in different subspaces." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-101", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-102", "text": "**COMPARISON TO RELATED WORK**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-103", "text": "We re-implemented and compared several exiting works (Section 4) upon the same framework." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-104", "text": "Table 1 lists the results on the En\u21d2De translation task." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-105", "text": "As seen, all the models improve translation quality, reconfirming the necessity of modeling locality and distance information." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-106", "text": "Besides, our models outperform all the existing works, indicating the superiority of the proposed approaches." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-107", "text": "In particular, CSANs achieve better performance than" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-108", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-109", "text": "**MODEL**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-110", "text": "Parameter Speed BLEU TRANSFORMER-BASE (Vaswani et al., 2017) 88.0M 1.28 27.31 -+ BI DIRECT (Shen et al., 2018a) +0.0M -0.00 27.58 +0.27 + REL POS (Shaw et al., 2018) +0.1M -0.11 27.63 +0.32 + NEIGHBOR (Sperber et al., 2018) +0.4M -0.06 27.60 +0.29 + LOCAL HARD (Luong et al., 2015) +0.4M -0.06 27.73 +0.42 + LOCAL SOFT +0.8M -0.09 27.81 +0.50 + BLOCK (Shen et al., 2018b) +6.0M -0.33 27.59 +0.28 + CNNs (Yu et al., 2018) +42 Table 2 : Experimental results on WMT14 En\u21d2De, WMT17 Zh\u21d2En and WAT17 Ja\u21d2En test sets." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-111", "text": "\"Speed\" denotes the training speed (steps/second)." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-112", "text": "\"\u2191 / \u21d1\" indicates statistically significant difference from the vanilla self-attention counterpart (p < 0.05/0.01), tested by bootstrap resampling (Koehn, 2004) ." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-113", "text": "CNNs, revealing that extracting local features with dynamic weights is superior to assigning fixed parameters." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-114", "text": "Moreover, while most of the existing approaches (except for Shen et al. (2018a) ) introduce new parameters, our methods are parameter-free and thus only marginally affect training efficiency." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-115", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-117", "text": "To validate the universality of our approach on MT tasks, we evaluated the proposed approach on different language pairs and model settings." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-118", "text": "Table 2 lists the results on En\u21d2De, Zh\u21d2En and Ja\u21d2En translation tasks." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-119", "text": "As seen, our model consistently improves translation quality across language pairs, which demonstrates the effectiveness and universality of the proposed approach." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-120", "text": "It is encouraging to see that CSANs with base setting yields comparable performance with TRANSFORMER-BIG." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-121", "text": "----------------------------------" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-123", "text": "In this paper, we propose a parameter-free convolutional self-attention model to enhance the feature extraction of neighboring elements across multiple heads." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-124", "text": "Empirical results of machine translation task on a variety of language pairs demonstrate the effectiveness and universality of the proposed methods." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-125", "text": "The extensive analyses suggest that: 1) modeling locality is beneficial to SANs; 2) interacting features across multiple heads at attention time can further improve the performance; and 3) to some extent, the dynamic weights are superior to their fixed counterpart (i.e. CSANs vs. CNNs) on local feature extraction." }, { "sent_id": "43622e43d6ef5291b64320d2d68b95-C001-126", "text": "Moreover, it is interesting to validate the proposed model in other sequence modeling tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "43622e43d6ef5291b64320d2d68b95-C001-9" ], [ "43622e43d6ef5291b64320d2d68b95-C001-11" ], [ "43622e43d6ef5291b64320d2d68b95-C001-32" ], [ "43622e43d6ef5291b64320d2d68b95-C001-72" ] ], "cite_sentences": [ "43622e43d6ef5291b64320d2d68b95-C001-9", "43622e43d6ef5291b64320d2d68b95-C001-11", "43622e43d6ef5291b64320d2d68b95-C001-32", "43622e43d6ef5291b64320d2d68b95-C001-72" ] }, "@DIF@": { "gold_contexts": [ [ "43622e43d6ef5291b64320d2d68b95-C001-25" ] ], "cite_sentences": [ "43622e43d6ef5291b64320d2d68b95-C001-25" ] }, "@USE@": { "gold_contexts": [ [ "43622e43d6ef5291b64320d2d68b95-C001-79" ] ], "cite_sentences": [ "43622e43d6ef5291b64320d2d68b95-C001-79" ] }, "@SIM@": { "gold_contexts": [ [ "43622e43d6ef5291b64320d2d68b95-C001-85" ] ], "cite_sentences": [ "43622e43d6ef5291b64320d2d68b95-C001-85" ] } } }, "ABC_b49e6f8181d51a998c6c27a830b98e_18": { "x": [ { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-2", "text": "We present work in progress on the temporal progression of compositionality in noun-noun compounds." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-3", "text": "Previous work has proposed computational methods for determining the compositionality of compounds." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-4", "text": "These methods try to automatically determine how transparent the meaning of the compound as a whole is with respect to the meaning of its parts." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-5", "text": "We hypothesize that such a property might change over time." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-6", "text": "We use the time-stamped Google Books corpus for our diachronic investigations, and first examine whether the vectorbased semantic spaces extracted from this corpus are able to predict compositionality ratings, despite their inherent limitations." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-7", "text": "We find that using temporal information helps predicting the ratings, although correlation with the ratings is lower than reported for other corpora." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-8", "text": "Finally, we show changes in compositionality over time for a selection of compounds." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-9", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-11", "text": "Compositionality is a long debated issue in theoretical linguistics." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-12", "text": "The principle of compositionality (Partee, 1984) states that the meaning of an expression is a function of the meanings of its parts and of the way they are syntactically combined." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-13", "text": "It is often used to describe how the meaning of a sentence can be derived from the meaning of single words and phrases, but the principle might also be postulated for compounding, i.e. the process of combining two or more lexemes to form a new concept (Bauer, 2017, p. 1 and 4) ." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-14", "text": "Compounds can often be directly derived from the meanings of the involved compound constituents (e.g. graduate student, speed limit), however, we also find compounds whose meanings can only be derived partially from their components (night owl, hot dog)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-15", "text": "Surprisingly, diachronic perspectives on compositionality 1 are virtually absent from previous work." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-16", "text": "To the best of our knowledge, we present the first study on the compositionality of compounds over time." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-17", "text": "We bring two strands of research together." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-18", "text": "On the one hand we are inspired by the synchronic work on predicting the degree of compositionality of compounds by comparing the vector-based representations of the parts to the vector-based representations of the compound as a whole." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-19", "text": "On the other hand, we rely on methods designed for detecting semantic change, such as presented in Hamilton et al. (2018) , to study compositionality in compounds from a diachronic viewpoint." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-20", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-21", "text": "**RELATED WORK**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-22", "text": "From a synchronic perspective, Reddy et al. (2011 ), Schulte im Walde et al. (2013 and Schulte im Walde et al. (2016a) are closest to our approach, since they predict the compositionality of compounds using vector space representations." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-23", "text": "However, Schulte im Walde et al. (2013) use German data and do not investigate diachronic changes." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-24", "text": "They report a Spearman's \u03c1 of 0.65 for predicting the compositionality of compounds based on the features of their semantic space and find that the modifiers mainly influence the compositionality of the whole compound, contrary to their expectation that the head should be the main source of influence." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-25", "text": "This is true for both the human annotation and their vector space model." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-26", "text": "Schulte im Walde et al. (2016a) further investigate the role of heads and modifiers on the prediction of compositionality and report \u03c1 values between 0.35 and 0.61 for their models on German and English data." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-27", "text": "Reddy et al. (2011) also report Spearman's \u03c1 between their surveyed compositionality values and word vectors." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-28", "text": "They achieve \u03c1 values of around 0.68, depending on the model." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-29", "text": "From a diachronic perspective, we follow the general methodological approach of Hamilton et al. (2018) , who use PPMI, SVD and word2vec based vector spaces to investigate a shift in meaning for chosen words with a known semantic change (gay, broadcast, etc.)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-30", "text": "They use time series to detect a significant change-point for two words, using cosine similarity and Spearman's \u03c1." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-31", "text": "They also compute the displacement for a single word embedding by calculating the cosine similarity between a point in time t and a later point in time t + \u2206. We adapt this methodology and make use of the same corpus (Google Books Ngram)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-32", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-33", "text": "**METHODS AND DATA**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-34", "text": "Several studies have been conducted in order to measure compositionality for compounds in different languages (von der Heide and Borgwaldt, 2009; Reddy et al., 2011; Schulte im Walde et al., 2016b) ." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-35", "text": "Some of these works have used large corpora to extract vector-based representations of compounds and their parts to automatically determine the compositionality of a given compound." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-36", "text": "The models were validated on the basis of their correlation with human compositionality ratings for a set of compounds." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-37", "text": "Because we are interested in the diachronic perspective on compounds, we use a time-stamped corpus: the Google Books Ngram corpus 2 (Michel et al., 2011) It contains books from the 1500s to the 2000s, from which we retrieve the contextual information of compounds and their constituents per year." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-38", "text": "We operate on 5-grams, which is the largest unit provided by Google Ngrams and use the words appearing in the 5-grams as both target words and context." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-39", "text": "We use the Part-of-Speech information already included in the Google Ngram corpus to extract noun-noun patterns." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-40", "text": "We then regard all other tokens in the 5-gram as context words and from this build up a semantic space rep-resentation of noun compounds for each year." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-41", "text": "We use a sliding window approach, wherein we capture the context of a compound based on its position in the 5-gram." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-42", "text": "That means that a bigram (say the compound gold mine) could occur in four different positions in the 5-grams (1-2, 2-3, 3-4 and finally 4-5)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-43", "text": "We then capture the contexts for each of these positions, in order to enrich the representation of a compound and its constituents (which similarly have five such positions, as they are unigrams)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-44", "text": "Ideally, we would validate our diachronic model on diachronic test data." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-45", "text": "However, as it is not possible to survey compositionality rating for diachronic data, we instead use the synchronic data provided by Reddy et al. (2011) (henceforth referred to as REDDY) for evaluating the quality of the Google Books Ngram data as a source for investigating the compositionality of compounds in general." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-46", "text": "Reddy et al. (2011) compiled a list of 90 English compounds and asked annotators to rate the compositionality of these compounds on a scale from 0 to 5. They provide three mean values of their ratings for the compounds (compound-mean), heads (head-mean) and modifiers (modifier-mean)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-47", "text": "We make use of REDDY in order to verify that our methods are capable of capturing compositionality (synchronically) and use the diachronic data of Google Books Ngram to investigate the temporal change of compositionality." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-71", "text": "To find the best configuration of a time span and cut-off for the regression models, we use the R 2 metric." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-48", "text": "A common challenge in building semantic spaces on a diachronic scale is that when building the spaces for individual spans of time, the spaces need to be aligned later on in order to compare models (see e.g. Kutuzov et al., 2018, Section 3.3) ." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-49", "text": "We circumvent this problem by jointly learning the spaces for the target words." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-50", "text": "To do this, we take the sparse representations of the compounds and their constituents and jointly learn their dense representations using SVD." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-51", "text": "Similar to Hamilton et al. (2018) we also choose the dimensions of our embeddings to be 300." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-52", "text": "We carry out row normalization on the embeddings, in order to remove the bias of the frequency of the compounds and their constituents." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-53", "text": "We make use of six different semantic features that have been proposed in the literature to capture compositionality (Schulte im Walde et al., 2016a) and plausibility of noun-noun compounds (G\u00fcnther and Marelli, 2016; Dhar and van der Plas, 2019) ." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-54", "text": "Three features are based on the cosine similarity between the embeddings of different compound parts (see G\u00fcnther and Marelli, 2016)" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-55", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-56", "text": "**EXPERIMENTS**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-57", "text": "We ran a total of two experiments 3 (Section 4.2 and 4.3) with different goals." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-58", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-59", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-60", "text": "Hyper-parameters We experiment with certain hyper-parameters, in particular we varied the time span length, e.g. single years, decades or a span of 20 years etc. and frequency cut-off of compounds and their constituents in a specific time span, i.e. compounds and constituents have to occur above a certain frequency threshold." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-61", "text": "Choosing a greater time span will increase the observable data per compound and might improve the vector representations." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-62", "text": "We only consider compounds which retain representations in all time spans starting from the year 1800, which reduces the number of total compounds depending on the specific setup." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-63", "text": "3 The code is available at https://github.com/ prajitdhar/Compounding Compound-centric setting Dhar and van der Plas (2019) found the compound-centric set up, where the distributional representations of words are based on their usage as constituents in a compound to outperform compound-agnostic setups, for predicting novel compounds in English." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-64", "text": "They were inspired by research on N-N compounds in Dutch that suggests that constituents such asmolen '-mill' in pepermolen 'peppermill' are separately stored as abstract combinatorial structures rather than understood on the basis of their independent constituents (De Jong et al., 2002) ." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-65", "text": "We hence adopt the compound-centric setting." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-66", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-67", "text": "**CORRELATION**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-68", "text": "We first carry out a quantitative experiment, to see if our features bolster the prediction of compositionality in noun-noun compounds." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-69", "text": "To do so, we calculate correlation scores between our proposed metrics and the annotated compositionality ratings of REDDY." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-70", "text": "Like Reddy et al. (2011) and Schulte im Walde et al. (2013), we opt for Spearman's \u03c1." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-72", "text": "Table 1 presents our findings; we will discuss them in the following Section 5." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-73", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-74", "text": "**PROGRESSION OF COMPOSITIONALITY OVER TIME**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-75", "text": "Based on the results of our correlation experiment, we proceed to analyze the temporal progression of compositionality." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-76", "text": "Our goals are two-fold: First, investigate if temporal information helps in predicting the contemporary REDDY data and second, use the best feature and setup in order to model the progression of compositionality over time." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-77", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-78", "text": "**RESULTS**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-79", "text": "We find the best predictors for the compositionality ratings of REDDY to be LMI and LLR (compare Table 2 )." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-80", "text": "The overall highest correlation occurs between compound-mean and LMI with \u03c1 of 0.54." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-81", "text": "We also see that sim-bw-constituents and sim-with-heads are generally good predictors as well." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-82", "text": "Contrary to Schulte im Walde et al. (2013) we do not find a strong correlation between modifiers and the REDDY ratings." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-83", "text": "Interestingly, PPMI is always weakly negatively correlated with the ratings." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-84", "text": "This could be due to PPMI's property of inflating scores for rare events." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-85", "text": "As can also be seen from Table 2 , our correlation values are considerably lower than that of Reddy et al. (2011) , but on par with a replication study by Schulte im Walde et al. (2016a) for compound-mean." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-86", "text": "We speculate that these differences are potentially due to the use of different data sets, the fact that we use a considerably smaller context window for constructing the word vectors (5 due to the restrictions of Google Ngram corpus vs. 100 in Reddy et al. (2011) and 40 in Schulte im Walde et al. (2016b) ) and the use of a compound-centric setting (as described in 4.1)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-87", "text": "From Table 1 we see that our best reported R 2 value occurs when observing time in stretches of 20 years (scores) and compounds having a frequency cut-off of at least 100." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-88", "text": "A few other observations could be made: In general the cut-off seems to improve the R 2 metric and the time spans of 10 and 20 years appear to be the most informative and stable for the cut-off values." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-89", "text": "Also, using temporal information almost always outperforms the setup that ignores all temporal information." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-90", "text": "For our following experiment, we choose to use the configuration with the highest R 2 value: a time span of 20 years and a cut-off of 100." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-91", "text": "Since LMI achieved the highest \u03c1 values, we also choose LMI over the other features." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-92", "text": "We group the compounds of REDDY into three groups based on the human ratings they obtained: low (0-1), med (2-3) and high (4-5)." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-93", "text": "Each group contains around 30 compounds." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-94", "text": "We then plot the LMI values of these three groups with their confidence interval across the time step of 20 years, shown in Figure 1 ." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-95", "text": "We can observe that there is a separation between the groups towards the later years, and that the period between 1940s and 1960s caused a noticeable change in the compositionality of the REDDY compounds." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-96", "text": "We find the same trends for all three information theory based features." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-97", "text": "Although care should be taken given the small data sets (especially for the earlier decades) on which the models were build and tested, the slope of the lines for the three groups of compounds seems to suggest that less compositional compounds go through a more pronounced change in compositionality than compositional compounds, as expected." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-98", "text": "We also show the graphs for sim-with-head and sim-with-mod (Figures 2 and 3) for the different groups of compounds across time, as these underperformed in our previous experiment." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-99", "text": "Both figures based on cosine based features largely confound the three groups of compounds across time, reinforcing our previous findings." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-100", "text": "6 Future Work" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-101", "text": "Our current work was limited to English compounds from Reddy et al. (2011) ." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-102", "text": "We plan to expand our models to other languages for which compositionality ratings are available, such as German." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-103", "text": "We would also like to further investigate the fact that the information theory based measures outperform the ones based on cosine similarity." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-104", "text": "We intend to do so by incorporating more compounds and their compositionality ratings, as well as by using larger corpora." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-105", "text": "Lastly, we will seek to find ways to harvest proxies for compositionality ratings of compounds over time." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-106", "text": "A possible avenue could be to use the information available in dictionaries." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-107", "text": "----------------------------------" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-109", "text": "We have shown work in progress on determining the compositionality of compounds over time." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-110", "text": "We showed that for our current setup, information theory based measures seem to capture compositionality better." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-111", "text": "Furthermore, we showed that adding temporal information increases the predictive power of these features to prognosticate synchronic compositionality." }, { "sent_id": "b49e6f8181d51a998c6c27a830b98e-C001-112", "text": "Finally, we showed how our best performing models trace the compositionality of compounds over time, delineating the behavior of compounds of varying levels of compositionality." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b49e6f8181d51a998c6c27a830b98e-C001-22" ], [ "b49e6f8181d51a998c6c27a830b98e-C001-34" ] ], "cite_sentences": [ "b49e6f8181d51a998c6c27a830b98e-C001-22", "b49e6f8181d51a998c6c27a830b98e-C001-34" ] }, "@SIM@": { "gold_contexts": [ [ "b49e6f8181d51a998c6c27a830b98e-C001-22" ], [ "b49e6f8181d51a998c6c27a830b98e-C001-70" ] ], "cite_sentences": [ "b49e6f8181d51a998c6c27a830b98e-C001-22", "b49e6f8181d51a998c6c27a830b98e-C001-70" ] }, "@USE@": { "gold_contexts": [ [ "b49e6f8181d51a998c6c27a830b98e-C001-45" ], [ "b49e6f8181d51a998c6c27a830b98e-C001-101" ] ], "cite_sentences": [ "b49e6f8181d51a998c6c27a830b98e-C001-45", "b49e6f8181d51a998c6c27a830b98e-C001-101" ] }, "@DIF@": { "gold_contexts": [ [ "b49e6f8181d51a998c6c27a830b98e-C001-85" ], [ "b49e6f8181d51a998c6c27a830b98e-C001-86" ] ], "cite_sentences": [ "b49e6f8181d51a998c6c27a830b98e-C001-85", "b49e6f8181d51a998c6c27a830b98e-C001-86" ] } } }, "ABC_878c6cf1c47c86f36a7ff3f04e2998_18": { "x": [ { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-2", "text": "Fact-checking of textual sources needs to effectively extract relevant information from large knowledge bases." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-3", "text": "In this paper, we extend an existing pipeline approach to better tackle this problem." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-4", "text": "We propose a neural ranker using a decomposable attention model that dynamically selects sentences to achieve promising improvement in evidence retrieval F1 by 38.80%, with (\u00d765) speedup compared to a TF-IDF method." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-5", "text": "Moreover, we incorporate lexical tagging methods into our pipeline framework to simplify the tasks and render the model more generalizable." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-6", "text": "As a result, our framework achieves promising performance on a large-scale fact extraction and verification dataset with speedup." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-7", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-9", "text": "With the rapid growth of available textual information, automatic extraction and verification, also known as fact-checking, has become important in order to identify relevant and factual information from the ever-growing information pool." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-10", "text": "The FakeNews Challenge (Pomerleau and Rao) addresses fact-checking as a simple stance detection problem, where the article is verified by checking the stance agreement between an article's title and content." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-11", "text": "Similar to the FakeNews, (Rashkin et al., 2017; Vlachos and Riedel, 2014) focused on political statements from Politifact.com to verify the degree of truthfulness." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-12", "text": "However, they assume that the gold standard documents containing the evidence are already known, which overly simplifies the task." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-13", "text": "Question Answering (QA) is similar to factchecking in the sense that a question and its answers can be considered as a claim and evidence respectively, but the answers may come from a large-scale database." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-14", "text": "Several approaches (Chen * * These two authors contributed equally." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-15", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-16", "text": "**CLAIM**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-17", "text": "Finding Dory was written by anyone but an American." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-18", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-19", "text": "**EVIDENCE**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-20", "text": "Finding Dory: Directed by Andrew Stanton with co-direction by Angus MacLane, the screenplay was written by Stanton and Victoria Strouse Andrew Stanton: Andrew Stanton -LRB-born December 3, 1965 -RRB-is an American film director , screenwriter, producer and voice actor based at Pixar." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-21", "text": "Label REFUTE , 2017a; Ryu et al., 2014; Ahn et al., 2004) proposed QA system utilizing resources such as Wikipedia, which is more comprehensive and incorporates wider world knowledge." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-22", "text": "However, the main focus is to identify only the \"correct\" answers that support a given question." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-23", "text": "Since the ability to refute is as important as to support, it does not fully address the verification problem of factchecking." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-24", "text": "Recently, Thorne et al. (2018) proposed a public dataset to explore the complete process of the large-scale fact-checking." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-25", "text": "It is designed not only to verify claims but also to extract sets of related evidence." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-26", "text": "Nevertheless, the pipeline solution proposed in that paper suffers from following problems: 1) The overall performance (30.88% accuracy) still needs further improvement to be applicable to the evidence selection and classification, which also highlights the challenging nature of this task." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-27", "text": "2) The evidence retrieval using Term Frequency-Inverse Document Frequency (TF-IDF) is time-consuming since the TF-IDF between a claim and set of candidate evidence cannot be computed in advance." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-28", "text": "In this paper, we extend the original pipeline solution to achieve faster and better fact-checking results." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-29", "text": "Our main contributions are: 1) Propose a neural ranker using decomposable attention (DA) model for evidence selection to speed up (\u00d765) and outperform related works." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-30", "text": "2) Incorporate several lexical tag information to effectively simplify the problem and generalize the models." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-31", "text": "3) Improve the overall fact extraction F1 by 38.80% and verification accuracy by 2.10% to achieve the state-of-the-art performance on the dataset." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-32", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-33", "text": "**METHODOLOGY**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-34", "text": "Our pipeline framework 1 has three main modules: document retrieval (DR), evidence selection (ES), and textual entailment recognition (TER)." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-35", "text": "The goal is to verify a given claim with a set of evidence from Wikipedia ( Table 1 )." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-36", "text": "The verification labels are support, refute and not enough information (NEI) to verify." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-37", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-38", "text": "**LEXICAL TAGGING**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-39", "text": "In our framework, two lexical tags (i.e. partof-speech (POS) and named entity recognition (NER)) are used to enhance the performance." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-40", "text": "We compute the tags for claims in advance using the Stanford CoreNLP library." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-41", "text": "Using this information is helpful in the following ways: 1) it helps keyword extraction for each claim." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-42", "text": "2) it reduces the out-of-vocabulary (OOV) problems related to name or organization entities, for better generalization." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-43", "text": "For example, a claim like \"Michael Jackson and Justin Timberlake are friends,\" is replaced as \"PERSON-1 and PERSON-2 are friends\"." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-44", "text": "In this way, we encourage our model to learn verification without dealing with the real entity values but the delexicalized indexed tokens." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-45", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-46", "text": "**DOCUMENT RETRIEVAL (DR)**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-47", "text": "For document retrieval, we extend the method of DrQA (Chen et al., 2017a) , which calculates cosine similarity between query and document, using binned unigram and bigram TF-IDF features." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-48", "text": "We refer to this method as DR tf idf ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-49", "text": "Instead of directly selecting top k document using TF-IDF as in DR tf idf , our document retriever DR rerank use TD-IDF to reduce the search space from 5.4M to 100 documents." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-50", "text": "Re-ranking is then applied to select the top k documents." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-51", "text": "For re-rank, we defined a score function f rank that ranks the relevance of the document by considering both the title and the content as follows:" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-52", "text": "To capture the relevance from the title, all the POS tags with high discriminating power (NN, NNS, NNP, NNPS, JJ, CD) of a claim are chosen as keywords." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-53", "text": "POS claim and POS title are the counts of such POS tags inside the claim and title respectively." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-54", "text": "POS match is the count of common POS keywords in the claim and the title; r claim is a ratio between POS match and POS claim to reward the documents with higher keyword hits; r title is the ratio between POS match and POS title to penalize those documents with more candidate keywords as it is more likely to have keyword hits with more candidates." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-55", "text": "We incorporate the TF-IDF score (tf -idf ) to ensure that the content information is not neglected." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-56", "text": "Our experiments show that our re-rank strategy increases the document recall compared to the single-step approach (Table 2)." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-57", "text": "To decide on the optimal value for hyperparameter k, full-pipeline performance was compared to evaluate the effect of k on final verification accuracy." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-58", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-59", "text": "**EVIDENCE SELECTION (ES)**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-60", "text": "In this module, l sentences are extracted as possible evidence for the claim." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-61", "text": "Instead of selecting the sentences by recomputing sentence-level TF-IDF features between claim and document text as in Thorne et al. (2018) , we propose a neural ranker using decomposable attention (DA) model (Parikh et al., 2016) to perform evidence selection." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-62", "text": "DA model does not require the input text to be parsed syntactically, nor is an ensemble, and it is faster without any recurrent structure." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-63", "text": "In general, using neural methods is better for the following reasons: 1) The TF-IDF may have limited ability to capture semantics compared to word representation learning 2) Faster inference time compared to TF-IDF methods that need real-time reconstruction." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-64", "text": "The neural ranker DA rank is trained using a fake task, which is to classify whether a given sentence is an evidence of a given claim or not." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-65", "text": "The output of DA rank is considered as the evidence probability." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-66", "text": "The training samples are unbalanced since there are more unrelated sentences than evidence sentences." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-67", "text": "Note that the classifier's accuracy on the fake task is not crucial because the choice of evidence is based on the relative score of relevance compared to other candidates." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-68", "text": "Therefore, it is not necessary to balance positive and negative samples using up/down-sampling, to the contrary, making it unbalanced actually improved the performance (Table 3) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-69", "text": "Unlike the k value which is fixed, the l value is selected dynamically based on the evidence score of DA rank '." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-70", "text": "It is used as a confidence measure of the given sentence being an evidence." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-71", "text": "Evidence with the score below fixed threshold value th is eliminated." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-72", "text": "Hence, each claim will have a different number of l evidence." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-73", "text": "To decide on th, we also carry out the full-pipeline evaluation." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-74", "text": "We propose the dynamic selection of l because we hypothesize that any wrong evidence, or noise, from early module could harm the subsequent RTE module." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-75", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-76", "text": "**RECOGNIZING TEXTUAL ENTAILMENT (RTE)**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-77", "text": "Given a claim and l possible evidence, a DA rte classifier is trained to recognize the textual entailment to be support, refute or not enough information to verify (NEI)." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-78", "text": "Same as Thorne et al. (2018) , we use the decomposable attention (DA) between the claim and the evidence for RTE." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-79", "text": "DA model decomposes the RTE problem into subproblems, which can be considered as bi-direction wordlevel attention features." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-80", "text": "Note that the DA model is utilized over other models such as as Chen et al. (2017b) ; Glockner et al. (2018) , because it is a simple but effective model." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-81", "text": "Our DA rte model must correctly decide whether a claim is NEI, when the evidence retrieved is irrelevant and insufficient." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-82", "text": "However, NEI claims have no annotated evidence, thus cannot be used to train RTE." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-83", "text": "To overcome this issue, same as (Thorne et al., 2018) , the most probable NEI evidence are simulated by sampling sentences from the nearest page to the claim using the document retrieval module." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-84", "text": "MLP DA rte DA rte +NER Accuracy (%) 63.2 78.4 79.9 Table 4 : Oracle RTE classification accuracy in the test set using gold evidence." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-85", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-86", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-87", "text": "Dataset: FEVER dataset (Thorne et al., 2018 ) is a relatively large-scale dataset compared to other previous fact extraction and verification works, with around 5.4M Wikipedia documents and 185k samples." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-88", "text": "The claims are generated by altering sentences extracted from Wikipedia, with humanannotated evidence sentences and verification labels (e.g. Table 1 )." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-89", "text": "The training/validation/test sets of these three datasets are split in advance by the providers." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-90", "text": "Note that the test-set was equally split into 3 classes: Supported (3333), Refuted (3333), NEI (3333)." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-91", "text": "Training: We trained our models end-to-end using Adagrad optimizer (Duchi et al., 2011) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-92", "text": "The embedding size is set to 200 and initialized with GloVe (Pennington et al., 2014) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-93", "text": "The dropout rate is set to 0.2." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-94", "text": "In all the datasets, we tuned the hyper-parameters with grid-search over the validation set." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-95", "text": "Evaluation: For each module, we independently measure oracle performance, where we assume gold standard documents and set of evidence are provided (oracle evaluation)." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-96", "text": "For the final fullpipeline, we compare to and follow the metric defined in Thorne et al. (2018) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-97", "text": "NoScoreEv is a simple classification accuracy that only considers the correctness of the verification label." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-98", "text": "On the other hand, ScoreEv is a stricter measure that also considers the correctness of the retrieved evidence." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-99", "text": "Hence, it is a more meaningful measure because it considers the classification to be correct only if appropriate evidence is provided to justify the classification." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-100", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-101", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-102", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-103", "text": "**ORACLE PERFORMANCE**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-104", "text": "Document Retrieval: As shown in Table 2 , our count-based re-rank strategy outperforms the TF-IDF method." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-105", "text": "Take k = 1 as an example, we achieve 60.99% recall using only one document, which is \u223c30% higher than TF-IDF approach." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-106", "text": "Given that the recall upper-bound of re-rank is 0.8886 (k=100), our method manages to retrieve near the limit by just retrieving a few documents." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-107", "text": "Note that there is Table 6 : Effect of de-noising ES modules on RTE score a trade-off between the document recall and the noise ratio (i.e. low precision)." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-108", "text": "As shown in Table 5, k = 2 with a low recall but high precision and F1 has the highest accuracy." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-109", "text": "This means DR that can effectively leverage this trade-off (therefore, high F1) performs the best." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-110", "text": "Therefore, we select k = 2 for our full-pipeline." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-111", "text": "Evidence Selection: In Table 3 , the trained neural ranker achieves a promising recall of 96.8%." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-112", "text": "In the case of l = 5, our neural ranker can perform 5% better than the TF-IDF method." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-113", "text": "Here, we use fixed l = 5 results for fair comparison with TF-IDF method." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-114", "text": "The ratios in Table 3 are the ratio of negative samples we tried to train the fake task." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-115", "text": "For example, 1:4 means that four negative sentences are sampled for each true evidence sentence." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-116", "text": "The results further give supports to our assumption that using unbalanced up-sampling actually help train our neural ranker." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-117", "text": "Along with performance gain, we also achieve a drastic gain in inference time by around 65 times from 3.57 seconds to 0.055 seconds for each sample." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-118", "text": "The full-pipeline results for different values of th is shown in Table 6 ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-119", "text": "Similar to document retrieval, having a high F1 score is the most important factor in assuring high full-pipeline performance." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-120", "text": "This is because providing a succinct set of evidence makes the verification task easier for the RTE module." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-121", "text": "Therefore, we choose DA rank +NER model with th = 0.6 for the fullpipeline." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-122", "text": "Recognizing Textual Entailment: The oracle classification accuracy for RTE is shown in Table 4." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-123", "text": "The MLP is a simple multi-layer perceptron using TF and TF-IDF cosine similarity between the claim and evidence as features as shown in Thorne et al. (2018) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-124", "text": "The highest accuracy achieved is 79.9% using DA rte with NER information, thus, we further evaluate the full-pipeline accuracy on this setting." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-125", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-126", "text": "**FULL PIPELINE PERFORMANCE**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-127", "text": "Combining each of our improved pipeline modules using k = 2, th = 0.6, the full pipeline results are shown in Table 7 ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-128", "text": "Our proposed framework can achieve 42.43% in ScoreEv and 52.54% in NoScoreEv, which outperforms DR tf idf +DA by 11.55% and 2.10%, respectively." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-129", "text": "The evidence retrieval F1 in our full framework is 56.3%, which is improved promisingly by 38.80%." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-130", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-131", "text": "**RELATED WORK**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-132", "text": "Prior work (Vlachos and Riedel, 2014; Ciampaglia et al., 2015) have proposed fact-checking through entailment from knowledge bases." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-133", "text": "Some works have investigated fact verification using PolitiFact data (Wang, 2017; Rashkin et al., 2017) or FakeNews challenge (Pomerleau and Rao) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-134", "text": "Most closely related to our work, Thorne et al. (2018) addresses large-scale fact extraction and verification task using a pipeline approach." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-135", "text": "In addition, question answering (Dang et al., 2007; Chen et al., 2017a; Ryu et al., 2014; Ahn et al., 2004) and task-oriented dialog systems (Dhingra et al., 2017; Table 7 : Full-pipeline evaluation on the test set using k = 2 and th = 0.6." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-136", "text": "The first and the second one (with *) are the baselines from Thorne et al. (2018) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-137", "text": "Madotto et al., 2018) also have similar aspects to these works, although aiming at a different goal." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-138", "text": "Other fields that are related to the particular individual modules of our system are the following: Document and evidence retrieval for identifying text segments and documents to support a given claim (Salton and Buckley, 1987; Le and Mikolov, 2014; Cartright et al., 2011; Bellot et al., 2013; Rinott et al., 2015) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-139", "text": "Recognizing textual entailment that aims to determine whether a hypothesis h can justifiably be inferred from a premise (Dang et al., 2007; Bowman et al., 2015; Parikh et al., 2016; Chen et al., 2017b; Glockner et al., 2018) ." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-140", "text": "In some of these work (Rinott et al., 2015; Rashkin et al., 2017) , the lexical and linguistic features are leveraged to further improve the performance." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-141", "text": "----------------------------------" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-142", "text": "**CONCLUSION**" }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-143", "text": "In this paper, we extend the pipeline framework for fact-checking and propose a neural ranker for evidence selection." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-144", "text": "Our experiments show that the usage of lexical tagging is helpful in simplifying the task and improving the generalization ability." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-145", "text": "Moreover, reducing noise in the input of RTE module, by de-noising the DR and SR modules, appears to be crucial for improving the overall performance." }, { "sent_id": "878c6cf1c47c86f36a7ff3f04e2998-C001-146", "text": "As a result, our ranker outperforms the TF-IDF method by 38.8% in evidence retrieval F1, with 65 times faster inference speed, achieving a promising performance in a large-scale fact extraction and verification dataset." } ], "y": { "@BACK@": { "gold_contexts": [ [ "878c6cf1c47c86f36a7ff3f04e2998-C001-24" ] ], "cite_sentences": [ "878c6cf1c47c86f36a7ff3f04e2998-C001-24" ] }, "@DIF@": { "gold_contexts": [ [ "878c6cf1c47c86f36a7ff3f04e2998-C001-61" ] ], "cite_sentences": [ "878c6cf1c47c86f36a7ff3f04e2998-C001-61" ] }, "@SIM@": { "gold_contexts": [ [ "878c6cf1c47c86f36a7ff3f04e2998-C001-78" ], [ "878c6cf1c47c86f36a7ff3f04e2998-C001-83" ], [ "878c6cf1c47c86f36a7ff3f04e2998-C001-134" ] ], "cite_sentences": [ "878c6cf1c47c86f36a7ff3f04e2998-C001-78", "878c6cf1c47c86f36a7ff3f04e2998-C001-83", "878c6cf1c47c86f36a7ff3f04e2998-C001-134" ] }, "@USE@": { "gold_contexts": [ [ "878c6cf1c47c86f36a7ff3f04e2998-C001-87" ], [ "878c6cf1c47c86f36a7ff3f04e2998-C001-96" ], [ "878c6cf1c47c86f36a7ff3f04e2998-C001-123" ] ], "cite_sentences": [ "878c6cf1c47c86f36a7ff3f04e2998-C001-87", "878c6cf1c47c86f36a7ff3f04e2998-C001-96", "878c6cf1c47c86f36a7ff3f04e2998-C001-123" ] } } }, "ABC_3bc48bea420e4977027832240450ec_18": { "x": [ { "sent_id": "3bc48bea420e4977027832240450ec-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-2", "text": "Lack of repeatability and generalisability are two significant threats to continuing scientific development in Natural Language Processing." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-3", "text": "Language models and learning methods are so complex that scientific conference papers no longer contain enough space for the technical depth required for replication or reproduction." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-4", "text": "Taking Target Dependent Sentiment Analysis as a case study, we show how recent work in the field has not consistently released code, or described settings for learning methods in enough detail, and lacks comparability and generalisability in train, test or validation data." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-5", "text": "To investigate generalisability and to enable state of the art comparative evaluations, we carry out the first reproduction studies of three groups of complementary methods and perform the first large-scale mass evaluation on six different English datasets." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-6", "text": "Reflecting on our experiences, we recommend that future replication or reproduction experiments should always consider a variety of datasets alongside documenting and releasing their methods and published code in order to minimise the barriers to both repeatability and generalisability." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-7", "text": "We have released our code with a model zoo on GitHub with Jupyter Notebooks to aid understanding and full documentation, and we recommend that others do the same with their papers at submission time through an anonymised GitHub account." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-8", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-10", "text": "Repeatable (replicable and/or reproducible 1 ) experimentation is a core tenet of the scientific endeavour." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-11", "text": "In Natural Language Processing (NLP) research as in other areas, this requires three crucial components: (a) published methods described in sufficient detail (b) a working code base and (c) open dataset(s) to permit training, testing and validation to be reproduced and generalised." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-12", "text": "In the cognate sub-discipline of corpus linguistics, releasing textual datasets has been a defining feature of the community for many years, enabling multiple comparative experiments to be conducted on a stable basis since the core underlying corpora are community resources." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-13", "text": "In NLP, with methods becoming increasingly complex with the use of machine learning and deep learning approaches, it is often difficult to describe all settings and configurations in enough detail without releasing code." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-14", "text": "The work described in this paper emerged from recent efforts at our research centre to reimplement other's work across a number of topics (e.g. text reuse, identity resolution and sentiment analysis) where previously published methods were not easily repeatable because of missing or broken code or dependencies, and/or where methods were not sufficiently well described to enable reproduction." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-15", "text": "We focus on one sub-area of sentiment analysis to illustrate the extent of these problems, along with our initial recommendations and contributions to address the issues." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-16", "text": "The area of Target Dependent Sentiment Analysis (TDSA) and NLP in general has been growing rapidly in the last few years due to new neural network methods that require no feature engineering." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-17", "text": "However it is difficult to keep track of the state of the art as new models are tested on different datasets, thus preventing true comparative evaluations." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-18", "text": "This is best shown by table 1 where many approaches This work is licenced under a Creative Commons Attribution 4.0 International Licence." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-19", "text": "Licence details: http:// creativecommons.org/licenses/by/4.0/ 1 We follow the definitions in Antske Fokkens' guest blog post \"replication (obtaining the same results using the same experiment) as well as reproduction (reach the same conclusion through different means)\" from http://coling2018." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-20", "text": "org/slowly-growing-offspring-zigglebottom-anno-2017-guest-post/ are evaluated on the SemEval dataset (Pontiki et al., 2014) but not all." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-21", "text": "Datasets can vary by domain (e.g. product), type (social media, review), or medium (written or spoken), and to date there has been no comparative evaluation of methods from these multiple classes." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-22", "text": "Our primary and secondary contributions therefore, are to carry out the first study that reports results across all three different dataset classes, and to release a open source code framework implementing three complementary groups of TDSA methods." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-23", "text": "In terms of reproducibility via code release, recent TDSA papers have generally been very good with regards to publishing code alongside their papers (Mitchell et al., 2013; Zhang et al., 2016; Liu and Zhang, 2017; Wang et al., 2017) but other papers have not released code (Wang et al., 2016; Tay et al., 2017) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-24", "text": "In some cases, the code was initially made available, then removed, and is now back online (Tang et al., 2016a) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-25", "text": "Unfortunately, in some cases even when code has been published, different results have been obtained relative to the original paper." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-26", "text": "This can be seen when Chen et al. (2017) used the code and embeddings in Tang et al. (2016b) they observe different results." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-27", "text": "Similarly, when others (Tay et al., 2017; Chen et al., 2017) attempt to replicate the experiments of Tang et al. (2016a) they also produce different results to the original authors." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-28", "text": "Our observations within this one sub-field motivates the need to investigate further and understand how such problems can be avoided in the future." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-29", "text": "In some cases, when code has been released, it is difficult to use which could explain why the results were not reproduced." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-30", "text": "Of course, we would not expect researchers to produce industrial strength code, or provide continuing free ongoing support for multiple years after publication, but the situation is clearly problematic for the development of the new field in general." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-31", "text": "In this paper, we therefore reproduce three papers chosen as they employ widely differing methods: Neural Pooling (NP) , NP with dependency parsing (Wang et al., 2017) , and RNN (Tang et al., 2016a) , as well as having been applied largely to different datasets." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-32", "text": "At the end of the paper, we reflect on bringing together elements of repeatability and generalisability which we find are crucial to NLP and data science based disciplines more widely to enable others to make use of the science created." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-33", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-34", "text": "**RELATED WORK**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-35", "text": "Reproducibility and replicability have long been key elements of the scientific method, but have been gaining renewed prominence recently across a number of disciplines with attention being given to a 'reproducibility crisis'." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-36", "text": "For example, in pharmaceutical research, as little as 20-25% of papers were found to be replicable (Prinz et al., 2011) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-37", "text": "The problem has also been recognised in computer science in general (Collberg and Proebsting, 2016) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-38", "text": "Reproducibility and replicability have been researched for sometime in Information Retrieval (IR) since the Grid@CLEF pilot track (Ferro and Harman, 2009 )." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-39", "text": "The aim was to create a 'grid of points' where a point defined the performance of a particular IR system using certain pre-processing techniques on a defined dataset." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-40", "text": "Louridas and Gousios (2012) looked at reproducibility in Software Engineering after trying to replicate another authors results and concluded with a list of requirements for papers to be reproducible: (a) All data related to the paper, (b) All code required to reproduce the paper and (c) Documentation for the code and data." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-41", "text": "Fokkens et al. (2013) looked at reproducibility in WordNet similarity and Named Entity Recognition finding five key aspects that cause experimental variation and therefore need to be clearly stated: (a) pre-processing, (b) experimental setup, (c) versioning, (d) system output, (e) system variation." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-42", "text": "In Twitter sentiment analysis, Sygkounas et al. (2016) stated the need for using the same library versions and datasets when replicating work." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-43", "text": "Different methods of releasing datasets and code have been suggested." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-44", "text": "Ferro and Harman (2009) defined a framework (CIRCO) that enforces a pre-processing pipeline where data can be extracted at each stage therefore facilitating a validation step." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-45", "text": "They stated a mechanism for storing results, dataset and pre-processed data 2 ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-46", "text": "Louridas and Gousios (2012) suggested the use of a virtual machine alongside papers to bundle the data and code together, while most state the advantages of releasing source code (Fokkens et al., 2013; Potthast et al., 2016; Sygkounas et al., 2016) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-47", "text": "The act of reproducing or replicating results is not just for validating research but to also show how it can be improved." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-48", "text": "Ferro and Silvello (2016) followed up their initial research and were able to analyse which pre-processing techniques were important on a French monolingual dataset and how the different techniques affected each other given an IR system." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-49", "text": "Fokkens et al. (2013) showed how changes in the five key aspects affected results." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-50", "text": "The closest related work to our reproducibility study is that of Marrese-Taylor and Matsuo (2017) which they replicate three different syntactic based aspect extraction methods." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-51", "text": "They found that parameter tuning was very important however using different pre-processing pipelines such as Stanford's CoreNLP did not have a consistent effect on the results." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-52", "text": "They found that the methods stated in the original papers are not detailed enough to replicate the study as evidenced by their large results differential." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-53", "text": "Dashtipour et al. (2016) undertook a replication study in sentiment prediction, however this was at the document level and on different datasets and languages to the originals." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-54", "text": "In other areas of (aspectbased) sentiment analysis, releasing code for published systems has not been a high priority, e.g. in SemEval 2016 task 5 (Pontiki et al., 2016) only 1 out of 21 papers released their source code." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-55", "text": "In IR, specific reproducible research tracks have been created 3 and we are pleased to see the same happening at COLING 2018 4 ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-56", "text": "Turning now to the focus of our investigations, Target Dependent sentiment analysis (TDSA) research (Nasukawa and Yi, 2003) arose as an extension to the coarse grained analysis of document level sentiment analysis (Pang et al., 2002; Turney, 2002) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-57", "text": "Since its inception, papers have applied different methods such as feature based (Kiritchenko et al., 2014) , Recursive Neural Networks (RecNN) (Dong et al., 2014) , Recurrent Neural Networks (RNN) (Tang et al., 2016a) , attention applied to RNN (Wang et al., 2016; Chen et al., 2017; Tay et al., 2017) , Neural Pooling (NP) Wang et al., 2017) , RNN combined with NP (Zhang et al., 2016) , and attention based neural networks (Tang et al., 2016b) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-58", "text": "Others have tackled TDSA as a joint task with target extraction, thus treating it as a sequence labelling problem." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-59", "text": "Mitchell et al. (2013) carried out this task using Conditional Random Fields (CRF), and this work was then extended using a neural CRF ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-60", "text": "Both approaches found that combining the two tasks did not improve results compared to treating the two tasks separately, apart from when considering POS and NEG when the joint task performs better." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-61", "text": "Finally, created an attention RNN for this task which was evaluated on two very different datasets containing written and spoken (video-based) reviews where the domain adaptation between the two shows some promise." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-62", "text": "Overall, within the field of sentiment analysis there are other granularities such as sentence level (Socher et al., 2013) , topic (Augenstein et al., 2018) , and aspect (Wang et al., 2016; Tay et al., 2017) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-63", "text": "Aspect-level sentiment analysis relates to identifying the sentiment of (potentially multiple) topics in the same text although this can be seen as a similar task to TDSA." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-64", "text": "However the clear distinction between aspect and TDSA is that TDSA requires the target to be mentioned in the text itself while aspect-level employs a conceptual category with potentially multiple related instantiations in the text." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-65", "text": "Tang et al. (2016a) created a Target Dependent LSTM (TDLSTM) which encompassed two LSTMs either side of the target word, then improved the model by concatenating the target vector to the input embeddings to create a Target Connected LSTM (TCLSTM)." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-66", "text": "Adding attention has become very popular recently." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-67", "text": "Tang et al. (2016b) showed the speed and accuracy improvements of using multiple attention layers only over LSTM based methods, however they found that it could not model complex sentences e.g. negations." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-68", "text": "Liu and Zhang (2017) showed that adding attention to a Bi-directional LSTM (BLSTM) improves the results as it takes the importance of each word into account with respect to the target." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-69", "text": "Chen et al. (2017) also combined a BLSTM and attention, however they used multiple attention layers and combined the results using a Gated Recurrent Unit (GRU) which they called Recurrent Attention on Memory (RAM), and they found this method to allow models to better understand more complex sentiment for each comparison." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-70", "text": "used neural pooling features e.g. max, min, etc of the word embeddings of the left and right context of the target word, the target itself, and the whole Tweet." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-71", "text": "They inputted the features into a linear SVM, and showed the importance of using the left and right context for the first time." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-72", "text": "They found in their study that using a combination of Word2Vec embeddings and sentiment embeddings performed best alongside using sentiment lexicons to filter the embedding space." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-73", "text": "Other studies have adopted more linguistic approaches." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-74", "text": "Wang et al. (2017) extended the work of by using the dependency linked words from the target." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-75", "text": "Dong et al. (2014) used the dependency tree to create a Recursive Neural Network (RecNN) inspired by Socher et al. (2013) but compared to Socher et al. (2013) they also utilised the dependency tags to create an Adaptive RecNN (ARecNN)." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-76", "text": "Critically, the methods reported above have not been applied to the same datasets, therefore a true comparative evaluation between the different methods is somewhat difficult." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-77", "text": "This has serious implications for generalisability of methods." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-78", "text": "We correct that limitation in our study." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-79", "text": "There are two papers taking a similar approach to our work in terms of generalisability although they do not combine them with the reproduction issues that we highlight." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-80", "text": "First, Chen et al. (2017) compared results across SemEval's laptop and restaurant reviews in English (Pontiki et al., 2014) , a Twitter dataset (Dong et al., 2014) and their own Chinese news comments dataset." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-81", "text": "They did perform a comparison across different languages, domains, corpora types, and different methods; SVM with features (Kiritchenko et al., 2014) , Rec-NN (Dong et al., 2014) , TDLSTM (Tang et al., 2016a) , Memory Neural Network (MNet) (Tang et al., 2016b) and their own attention method." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-82", "text": "However, the Chinese dataset was not released, and the methods were not compared across all datasets." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-83", "text": "By contrast, we compare all methods across all datasets, using techniques that are not just from the Recurrent Neural Network (RNN) family." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-84", "text": "A second paper, by Barnes et al. (2017) compares seven approaches to (document and sentence level) sentiment analysis on six benchmark datasets, but does not systematically explore reproduction issues as we do in our paper." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-85", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-86", "text": "**DATASETS USED IN OUR EXPERIMENTS**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-87", "text": "We are evaluating our models over six different English datasets deliberately chosen to represent a range of domains, types and mediums." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-88", "text": "As highlighted above, previous papers tend to only carry out evaluations on one or two datasets which limits the generalisability of their results." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-89", "text": "In this paper, we do not consider the quality or inter-annotator agreement levels of these datasets but it has been noted that some datasets may have issues here." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-90", "text": "For example, Pavlopoulos and Androutsopoulos (2014) point out that the Hu and Liu (2004) dataset does not state their inter-annotator agreement scores nor do they have aspect terms that express neutral opinion." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-91", "text": "We only use a subset of the English datasets available." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-92", "text": "For two reasons." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-93", "text": "First, the time it takes to write parsers and run the models." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-94", "text": "Second, we only used datasets that contain three distinct sentiments (Wilson (2008) only has two)." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-95", "text": "From the datasets we have used, we have only had issue with parsing Wang et al. (2017) where the annotations for the first set of the data contains the target span but the second set does not." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-96", "text": "Thus making it impossible to use the second set of annotation and forcing us to only use a subset of the dataset." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-97", "text": "An as example of this: \"Got rid of bureaucrats 'and we put that money, into 9000 more doctors and nurses'... to turn the doctors into bureaucrats#BattleForNumber10\" in that Tweet 'bureaucrats' was annotated as negative but it does not state if it was the first or second instance of 'bureaucrats' since it does not use target spans." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-98", "text": "As we can see from table 2, generally the social media datasets (Twitter and YouTube) contain more targets per sentence with the exception of Dong et al. (2014) and Mitchell et al. (2013) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-99", "text": "The only dataset that has a small difference between the number of unique sentiments per sentence is the Wang et al. (2017)" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-100", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-101", "text": "**REPRODUCTION STUDIES**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-102", "text": "In the following subsections, we present the three different methods that we are reproducing and how their results differ from the original analysis." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-103", "text": "In all of the experiments below, we lower case all text and tokenise using Twokenizer (Gimpel et al., 2011) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-104", "text": "This was done as the datasets originate from Twitter and this pre-processing method was to some extent stated in and assumed to be used across the others as they do not explicitly state how they pre-process in the papers." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-105", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-106", "text": "**REPRODUCTION OF VO AND ZHANG (2015)**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-107", "text": "Vo and Zhang (2015) created the first NP method for TDSA." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-108", "text": "It takes the word vectors of the left, right, target word, and full tweet/sentence/text contexts and performs max, min, average, standard deviation, and product pooling over these contexts to create a feature vector as input to the Support Vector Machine (SVM) classifier." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-109", "text": "This feature vector is in affect an automatic feature extractor." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-110", "text": "They created four different methods: 1." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-111", "text": "Target-ind uses only the full tweet context, 2." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-112", "text": "Target-dep-uses left, right, and target contexts, 3." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-113", "text": "Target-dep Uses both features of Target-ind and Target-dep-, and 4." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-114", "text": "Target-dep+ Uses the features of Target-dep and adds two additional contexts left and right sentiment (LS & RS) contexts where only the words within a specified lexicon are kept and the rest of the words are zero vectors." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-115", "text": "All of their experiments are performed on Dong et al. (2014) Twitter data set." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-116", "text": "For each of the experiments below we used the following configurations unless otherwise stated: we performed 5 fold stratified cross validation, features are scaled using Max Min scaling before inputting into the SVM, and used the respective C-Values for the SVM stated in the paper for each of the models." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-117", "text": "One major difficulty with the description of the method in the paper and re-implementation is handling the same target multiple appearances issue as originally raised by Wang et al. (2017) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-118", "text": "As the method requires context with regards to the target word, if there is more than one appearance of the target word then the method does not specify which to use." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-119", "text": "We therefore took the approach of Wang et al. (2017) and found all of the features for each appearance and performed median pooling over features." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-120", "text": "This change could explain the subtle differences between the results we report and those of the original paper." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-121", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-122", "text": "**SENTIMENT LEXICONS**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-123", "text": "Vo and Zhang (2015) used three different sentiment lexicons: MPQA 5 (Wilson et al., 2005) , NRC 6 (Mohammad and Turney, 2010) , and HL 7 (Hu and Liu, 2004) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-124", "text": "We found a small difference in word counts between their reported statistics for the MPQA lexicons and those we performed ourselves, as can be seen in the bold numbers in table 3." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-125", "text": "Originally, we assumed that a word can only occur in one sentiment class within the same lexicon, and this resulted in differing counts for all lexicons." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-126", "text": "This distinction is not clearly documented in the paper or code." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-127", "text": "However, our assumption turned out to be incorrect, giving a further illustration of why detailed descriptions and documentation of all decisions is important." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-128", "text": "We ran the same experiment as to show the effectiveness of sentiment lexicons the results can be seen in table 4." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-129", "text": "We can clearly see there are some difference not just with the accuracy scores but the rank of the sentiment lexicons." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-130", "text": "We found just using HL was best and MPQA does help performance compared to the Target-dep baseline which differs to findings." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-131", "text": "Since we found that using just HL performed best, the rest of the results will apply the Target-dep+ method using HL and using HL & MPQA to show the affect of using the lexicon that both we and Vo and Zhang (2015)" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-132", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-133", "text": "**USING DIFFERENT WORD VECTORS**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-134", "text": "The original authors tested their methods using three different word vectors: 1. Word2Vec trained by on 5 million Tweets containing emoticons (W2V), 2." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-135", "text": "Sentiment Specific Word Embedding (SSWE) from , and 3." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-136", "text": "W2V and SSWE combined." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-137", "text": "Neither of these word embeddings are available from the original authors as never released the embeddings and the link to embeddings no longer works 8 ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-138", "text": "However, the embeddings were released through Wang et al. (2017) code base 9 following requesting of the code from ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-139", "text": "Figure 1 shows the results of the different word embeddings across the different methods." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-140", "text": "The main finding we see is that SSWE by themselves are not as informative as W2V vectors which is different to the findings of ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-141", "text": "However we agree that combining the two vectors is beneficial and that the rank of methods is the same in our observations." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-142", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-143", "text": "**SCALING AND FINAL MODEL COMPARISON**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-144", "text": "We test all of the methods on the test data set of Dong et al. (2014) and show the difference between the original and reproduced models in figure 2 ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-145", "text": "Finally, we show the effect of scaling using Max Min and not scaling the data." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-146", "text": "As stated before, we have been using Max Min scaling on the NP features, however did not mention scaling in their paper." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-147", "text": "The library they were using, LibLinear (Fan et al., 2008) , suggests in its practical guide (Hsu et al., 2003) to scale each feature to [0, 1] but this was not re-iterated by ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-148", "text": "We are using scikit-learn's (Pedregosa et al., 2011) LinearSVC which is a wrapper of LibLinear, hence making it appropriate to use here." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-149", "text": "As can be seen in figure 2, not scaling can affect the results by around one-third." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-150", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-151", "text": "**REPRODUCTION OF WANG ET AL. (2017)**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-152", "text": "Wang et al. (2017) extended the NP work of and instead of using the full tweet/sentence/text contexts they used the full dependency graph of the target word." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-153", "text": "Thus, they created three different methods: 1. TDParse-uses only the full dependency graph context, 2." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-154", "text": "TDParse the feature of TDParse-and the left and right contexts, and 3." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-155", "text": "TDParse+ the features of TDParse and LS and RS contexts." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-156", "text": "The experiments are performed on the Dong et al. (2014) and Wang et al. (2017) Twitter datasets where we train and test on the previously specified train and test splits." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-157", "text": "We also scale our features using Max Min scaling before inputting into the SVM." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-158", "text": "We used all three sentiment lexicons as in the original paper, and we found the C-Value by performing 5 fold stratified cross validation on the training datasets." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-159", "text": "The results of these experiments can be seen in figure 3 10 ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-160", "text": "As found with the results of replication, scaling is very important but is typically overlooked when reporting." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-161", "text": "Tang et al. (2016a) was the first to use LSTMs specifically for TDSA." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-162", "text": "They created three different models: 1. LSTM a standard LSTM that runs over the length of the sentence and takes no target information into account, 2." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-163", "text": "TDLSTM runs two LSTMs, one over the left and the other over the right context of the target word and concatenates the output of the two, and 3." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-164", "text": "TCLSTM same as the TDLSTM method but each input word vector is concatenated with vector of the target word." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-165", "text": "All of the methods outputs are fed into a softmax activation function." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-166", "text": "The experiments are performed on the Dong et al. (2014) dataset where we train and test on the specified splits." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-167", "text": "For the LSTMs we initialised the weights using uniform distribution U(0.003, 0.003), used Stochastic Gradient Descent (SGD) a learning rate of 0.01, cross entropy loss, padded and truncated sequence to the length of the maximum sequence in the training dataset as stated in the original paper, and we did not \"set the clipping threshold of softmax layer as 200\" (Tang et al., 2016a) as we were unsure what this meant." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-168", "text": "With regards to the number of epochs trained, we used early stopping with a patience of 10 and allowed 300 epochs." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-169", "text": "Within their experiments they used SSWE and Glove Twitter vectors 11 (Pennington et al., 2014) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-170", "text": "As the paper being reproduced does not define the number of epochs they trained for, we use early stopping." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-171", "text": "Thus for early stopping we require to split the training data into train and validation sets to know when to stop." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-172", "text": "As it has been shown by Reimers and Gurevych (2017) that the random seed statistically significantly changes the results of experiments we ran each model over each word embedding thirty times, using a different seed value but keeping the same stratified train and validation split, and reported the results on the same test data as the original paper." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-173", "text": "As can be seen in Figure 4 , the initial seed value makes a large difference more so for the smaller embeddings." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-174", "text": "In table 5, we show the difference between our mean and maximum result and the original result for each model using the 200 dimension Glove Twitter vectors." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-175", "text": "Even though the mean result is quite different from the original the maximum is much closer." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-176", "text": "Our results generally agree with their results on the ranking of the word vectors and the embeddings." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-177", "text": "Overall, we were able to reproduce the results of all three papers." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-178", "text": "However for the neural network/deep learning approach of Tang et al. (2016a) we agree with Reimers and Gurevych (2017) that reporting multiple runs of the system over different seed values is required as the single performance scores can be misleading, which could explain why previous papers obtained different results to the original for the TDLSTM method (Chen et al., 2017; Tay et al., 2017) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-179", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-180", "text": "**MASS EVALUATION**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-181", "text": "For all of the methods we pre-processed the text by lower casing and tokenising using Twokenizer (Gimpel et al., 2011) , and we used all three sentiment lexicons where applicable." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-182", "text": "We found the best word vectors from SSWE and the common crawl 42B 300 dimension Glove vectors by five fold stratified cross validation for the NP methods and the highest accuracy on the validation set for the LSTM methods." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-183", "text": "We chose these word vectors as they have very different sizes (50 and 300), also they have been shown to perform well in different text types; SSWE for social media (Tang et al., 2016a) and Glove for reviews (Chen et al., 2017) ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-184", "text": "To make the experiments quicker and computationally less expensive, we filtered out all words from the word vectors that did not appear in the train and test datasets, and this is equivalent with respect to word coverage as using all words." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-185", "text": "Finally we only reported results for the LSTM methods with one seed value and not multiple due to time constraints." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-186", "text": "The results of the methods using the best found word vectors on the test sets can be seen in table 6." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-187", "text": "We find that the TDParse methods generally perform best but only clearly outperforms the other nondependency parser methods on the YouTuBean dataset." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-188", "text": "We hypothesise that this is due to the dataset containing, on average, a deeper constituency tree depth which could be seen as on average more complex sentences." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-189", "text": "This could be due to it being from the spoken medium compared to the rest of the datasets which are written." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-190", "text": "Also that using a sentiment lexicon is almost always beneficial, but only by a small amount." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-191", "text": "Within the LSTM based methods the TDLSTM method generally performs the best indicating that the extra target information that the TCLSTM method contains is not needed, but we believe this needs further analysis." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-192", "text": "We can conclude that the simpler NP models perform well across domain, type and medium and that even without language specific tools and lexicons they are competitive to the more complex LSTM based methods." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-193", "text": "----------------------------------" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-194", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-195", "text": "The fast developing subfield of TDSA has so far lacked a large-scale comparative mass evaluation of approaches using different models and datasets." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-196", "text": "In this paper, we address this generalisability limitation and perform the first direct comparison and reproduction of three different approaches for TDSA." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-197", "text": "While carrying out these reproductions, we have noted and described above, the many emerging issues in previous research related to incomplete descriptions of methods and settings, patchy release of code, and lack of comparative evaluations." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-198", "text": "This is natural in a developing field, but it is crucial for ongoing development within NLP in general that improved repeatability practices are adopted." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-199", "text": "The practices adopted in our case studies are to reproduce the methods in open source code, adopt only open data, provide format conversion tools to ingest the different data formats, and describe and document all settings via the code and Jupyter Notebooks (released initially in anonymous form at submission time) 12 ." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-200", "text": "We therefore argue that papers should not consider repeatability (replication or reproduction) or generalisability alone, but these two key tenets of scientific practice should be brought together." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-201", "text": "In future work, we aim to extend our reproduction framework further, and extend the comparative evaluation to languages other than English." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-202", "text": "This will necessitate changes in the framework since we expect that dependency parsers and sentiment lexicons will be unavailable for specific languages." }, { "sent_id": "3bc48bea420e4977027832240450ec-C001-203", "text": "Also we will explore through error analysis in which situations different neural network architectures perform best." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3bc48bea420e4977027832240450ec-C001-23" ], [ "3bc48bea420e4977027832240450ec-C001-57" ], [ "3bc48bea420e4977027832240450ec-C001-74" ], [ "3bc48bea420e4977027832240450ec-C001-117" ], [ "3bc48bea420e4977027832240450ec-C001-152" ] ], "cite_sentences": [ "3bc48bea420e4977027832240450ec-C001-23", "3bc48bea420e4977027832240450ec-C001-57", "3bc48bea420e4977027832240450ec-C001-74", "3bc48bea420e4977027832240450ec-C001-117", "3bc48bea420e4977027832240450ec-C001-152" ] }, "@MOT@": { "gold_contexts": [ [ "3bc48bea420e4977027832240450ec-C001-30", "3bc48bea420e4977027832240450ec-C001-31" ] ], "cite_sentences": [ "3bc48bea420e4977027832240450ec-C001-31" ] }, "@UNSURE@": { "gold_contexts": [ [ "3bc48bea420e4977027832240450ec-C001-95" ], [ "3bc48bea420e4977027832240450ec-C001-99" ], [ "3bc48bea420e4977027832240450ec-C001-138" ] ], "cite_sentences": [ "3bc48bea420e4977027832240450ec-C001-95", "3bc48bea420e4977027832240450ec-C001-99", "3bc48bea420e4977027832240450ec-C001-138" ] }, "@USE@": { "gold_contexts": [ [ "3bc48bea420e4977027832240450ec-C001-119" ], [ "3bc48bea420e4977027832240450ec-C001-156" ] ], "cite_sentences": [ "3bc48bea420e4977027832240450ec-C001-119", "3bc48bea420e4977027832240450ec-C001-156" ] } } }, "ABC_af6c68ef5f80eac2274bf33a894d1f_18": { "x": [ { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-2", "text": "We replicate the syntactic experiments of Mikolov et al. (2013b) on English, and expand them to include morphologically complex languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-3", "text": "We learn vector representations for Dutch, French, German, and Spanish with the WORD2VEC tool, and investigate to what extent inflectional information is preserved across vectors." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-4", "text": "We observe that the accuracy of vectors on a set of syntactic analogies is inversely correlated with the morphological complexity of the language." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-5", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-6", "text": "****" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-50", "text": "The removal of the possessives improves the accuracy from 25.2% reported in the original paper to 40.1%." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-7", "text": "1 Introduction Mikolov et al. (2013b) demonstrate that vector representations of words obtained from a neural network language model provide a way of capturing both semantic and syntactic regularities in language." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-8", "text": "They observe that by manipulating vector offsets between pairs of words, it is possible to derive an approximation of vectors representing other words, such as queen \u2248 king \u2212 man + woman." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-9", "text": "Similarly, an abstract relationship between the present and past tense may be computed by subtracting the base form eat from the past form ate; the result of composing such an offset with the base form cook may turn out to be similar to the vector for cooked (Figure 1 )." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-10", "text": "They report state-of-the-art results on a set of analogy questions of the form \"a is to b as c is to \", where the variables represent various English word forms." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-11", "text": "Our work is motivated by two observations regarding Mikolov et al.'s experiments: first, the syntactic analogies that they test correspond to morphological inflections, and second, the tests only evaluate English, a language with little morphological complexity." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-12", "text": "In this paper, we replicate their syntactic experiments on four languages that are more morphologically complex than English: Dutch, French, German, and Spanish." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-13", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-14", "text": "**REPLICATION EXPERIMENTS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-15", "text": "In order to to validate our methodology, we first replicate the results of Mikolov et al. (2013b) on English syntactic analogies." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-16", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-17", "text": "**TRAINING CORPUS FOR WORD VECTORS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-18", "text": "The vectors of Mikolov et al. (2013b) were trained on 320M tokens of broadcast news data, as described by Mikolov et al. (2011) ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-19", "text": "Since we have no access to this data, we instead train English vectors on a corpus from the Polyglot project (Al-Rfou et al., 2013) , which contains tokenized Wikipedia dumps intended for the training of vector-space models." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-20", "text": "For comparison with the results of Mikolov et al. (2013b) , we limit the data to the first 320M lowercased tokens of the corpus." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-21", "text": "Mikolov et al. (2013b) obtain their best results with vectors of size 1600 that combine several models, but do not elaborate how this composite model was constructed." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-22", "text": "Instead, we take as a point of reference their second-best model, which employs 640-dimensional vectors produced by a single recursive neural network (RNN) language model." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-23", "text": "1 Rather than use an RNN model to learn our own vectors, we employ the far simpler skip-gram model." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-24", "text": "Mikolov et al. (2013a) show that higher accuracy can be obtained using vectors derived using this model, which is also far less expensive to train." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-25", "text": "The skip-gram model eschews a language modeling objective in favor of a logistic regression classifier that predicts surrounding words." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-26", "text": "The WORD2VEC package includes code for learning skip-gram models from very large corpora." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-27", "text": "2 We train 640-dimensional vectors using the skip-gram model with a hierarchical softmax, a context window of 10, sub-sampling of 1e-3, and a minimum frequency threshold of 10." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-28", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-29", "text": "**TEST SET**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-30", "text": "The test set of Mikolov et al. (2013b) is publicly available 3 ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-31", "text": "They extract their gold standard inflections, as well as frequency counts, from tagged newspaper text." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-32", "text": "Their test set was constructed as follows: after tagging 267M words, the 100 most frequent plural nouns, possessive nouns, comparative adjectives, and verbal infinitives were selected." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-33", "text": "Each was paired with 5 randomly-selected words of the same part-of-speech, and analogy questions were constructed for each word pair." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-34", "text": "For example, for the pair people and city, two questions are created: people:person :: cities:city, and its mirror: person:people :: city:cities." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-35", "text": "To solve the analogies in this test set, we apply the word-analogy tool that is included with WORD2VEC." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-36", "text": "For each analogy a : b :: c :?, the tool searches the entire vocabulary for the vector d that is most similar to the vector estimated by performing a linear analogy on the query triplet a, b, c:" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-37", "text": "We calculate accuracy as the percentage of analogies whose answers are correctly predicted, according to an exact match." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-38", "text": "The analogies involve nouns, adjectives, and verbs." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-39", "text": "Nominal analogies consist of comparisons between singular and plural forms, and possessive and nominative forms." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-40", "text": "Due to the tokenization method used in our training corpus, we are unable to build vectors for English possessives." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-41", "text": "We therefore modify the nominal test set to only include questions that contain the singular vs. plural distinction." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-42", "text": "We make no changes to the adjectival and verbal analogy sets." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-43", "text": "The adjectival set contains analogies between the comparative and the superlative, the comparative and the base, and the superlative and the base." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-44", "text": "The verbal set includes comparisons between the preterite, the infinitive, and the 3rd person singular present, but not the past and present participles." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-45", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-46", "text": "**RESULTS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-47", "text": "In Table 1 , we report two numbers for each part of speech." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-48", "text": "The first, labeled as M13, is the result of applying the vectors of Mikolov et al. (2013b) to their test set." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-49", "text": "The results match the results reported in their paper, except for the nominal results, which reflect our modifications described in Section 2.2." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-51", "text": "The second column, labeled as Ours, reports the results for our vectors, which were trained using WORD2VEC on the English data described in Section 2.1." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-52", "text": "Our verbal and adjectival vectors obtain slightly lower accuracies than the RNN trained vectors of Mikolov et al. (2013b) , but they are not far off." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-53", "text": "For nouns, however, we obtain higher accuracy than Mikolov et al. The tokenization method that removes possessives from consideration may produce better vectors for singular and plural forms, as it increases the frequency of these types." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-54", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-55", "text": "**MULTILINGUAL EXPERIMENTS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-56", "text": "Our second set of experiments examine to what extent the syntactic regularities are captured by word vectors in four other languages: Dutch, French, German, and Spanish." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-57", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-58", "text": "**TRAINING CORPORA FOR WORD VECTORS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-59", "text": "As in the previous experiment, our training corpora are from the Polyglot project." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-60", "text": "We limit each corpus to the first 320M lowercased tokens, except for the Dutch corpus, which has only 180M tokens." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-61", "text": "Since the WORD2VEC tool cannot handle Unicode, we map all non-ASCII characters to unused ASCII characters." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-62", "text": "We run WORD2VEC with exactly the same hyper-parameters as in Section 2.1." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-63", "text": "The English experiments in this section use the same training data and vectors as in Section 2, but we construct a new test set to match our methodology for the other languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-64", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-65", "text": "**TEST SETS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-66", "text": "In order to make results between multiple languages comparable, we made several changes to the construction of syntactic analogy questions." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-67", "text": "We follow the methodology of Mikolov et al. (2013b) in limiting analogy questions to the 100 most frequent verbs or nouns." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-68", "text": "The frequencies are obtained from corpora tagged by TREETAGGER (Schmid, 1994) ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-69", "text": "We identify inflections using manually constructed inflection tables from several sources." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-70", "text": "Spanish and German verbal inflections, as well as German nominal inflections, are from a Wiktionary data set introduced by Durrett and DeNero (2013) ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-71", "text": "4 Dutch verbal inflections and English verbal and nominal inflections are from the CELEX database (Baayen et al., 1995) ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-72", "text": "French verbal inflections are from Verbiste, an online French conjugation dictionary." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-95", "text": "In the All Inflections column, we see that the overall accuracy decreases as the morphological complexity increases." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-73", "text": "5 Whereas Mikolov et al. create analogies from various inflectional forms, we require the analogies to always include the base dictionary form: the infinitive for verbs, and the nominative singular for nouns." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-74", "text": "In other words, all analogies are limited to 4 We exclude Finnish because of its high morphological complexity and the small size of the corresponding Polyglot corpus." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-75", "text": "5 http://perso.b2b2c.ca/sarrazip/dev/verbiste.html comparisons between the base form and an inflected form." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-76", "text": "This is to prevent a combinatorial explosion of the number of analogies in languages that contain dozens of different inflection forms." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-77", "text": "We also create new English test sets using this methodology, in order to ensure a fair cross-lingual comparison." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-78", "text": "Table 2 shows the number of analogy questions for each language set." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-79", "text": "Note that the languages are ordered according to increasing morphological complexity." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-80", "text": "Following Mikolov et al., we ensure that all analogies contain at least one pair of non-syncretic forms." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-81", "text": "It would make little sense to include analogies such as \"set is to set as put is to ?\" because both verbs in question have the same present and past tense form." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-82", "text": "However, we do allow analogies which involve syncretic forms for one half of the analogy." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-83", "text": "For example, either taken or took is a correct answer to \"play is to played as take is to ?\"." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-84", "text": "These types of questions account for an average of 2.8% of analogies, ranging from 0% for English nouns to 8.9% for German verbs." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-85", "text": "The number of questions for each language is a function of the number of inflectional forms, but it is not a simple linear relationship." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-86", "text": "If each English verb had five different inflections, each with sufficient frequency in the training corpus, we would expect 4000 questions for 100 verbs." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-87", "text": "This is because each verb should ideally be compared to five other verbs, with the base form paired with the other four inflectional forms, in both directions." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-88", "text": "The actual number of questions is smaller because some forms are identical, while other forms are observed less frequently than our minimum threshold of 10." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-89", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-90", "text": "**RESULTS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-91", "text": "We conduct two experiments to quantify the extent that the syntactic regularities observed in English hold in the other languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-92", "text": "In the first experiment, which is referred to as All Inflections, we measure the accuracy of vectors on all inflected forms." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-93", "text": "In the second experiment, named Inflection Subset, we attempt to factor out the variation in the number of inflectional forms across languages by considering only the forms that are observed in English (five forms for verbs, and two forms for nouns)." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-94", "text": "The results of the experiments are in Table 3 ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-96", "text": "However, the Inflection Subset column reveals an opposite trend: the accuracy is increasing towards the bottom of the table, (although English stands out as a clear exception)." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-97", "text": "Looking across the rows, the accuracy on the inflection subset is higher than on all inflections, except on Dutch." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-98", "text": "Noun analogies are only tested on two languages, but they seem to follow the same trends as verbs." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-99", "text": "The results in Table 3 are not easy to interpret." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-100", "text": "It appears the lower frequencies of multiple inflected word forms make the task more difficult, which is reflected in the All Inflections results." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-101", "text": "The median frequencies of individual verb forms in French and Spanish are approximately one-tenth of the corresponding numbers in Dutch and German, which in turn are about one-fourth of the English median." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-102", "text": "However, these ratios are not neatly correlated with the accuracy results in Table 3 ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-103", "text": "Regarding the contrasting results in the Inflection Subset column, we conjecture that a larger number of inflections may make individual forms easier to disambiguate." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-104", "text": "This in turn allows WORD2VEC to learn more precise vectors for each word type." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-105", "text": "The median frequencies of the forms in the inflection subset tend to be higher than the corresponding values computed for all inflections, but there is a substantial variation between different languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-106", "text": "Dutch, in particular, sees a similar increase in median frequency to German, but while German accuracy increases, Dutch decreases." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-107", "text": "We conclude that although frequency is an important factor when performing syntactic analogies with vectors, there must be other factors contributing to these results." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-108", "text": "It is perhaps unsurprising that English is the winner on its own inflection set." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-109", "text": "However, another reason that English does not follow the trend in the Inflection Subset column may be related to the frequencies of its small set of wordforms, which are uniformly higher than in other languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-110", "text": "The experiments that we describe in the next section provide additional insights into these results." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-111", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-112", "text": "**HYPER-PARAMETER EXPERIMENTS**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-113", "text": "In this section, we describe experiments that quantify how the quality of the vectors is affected by the window size and the amount of training data." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-114", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-115", "text": "**WINDOW SIZE**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-116", "text": "First, we investigate the role that the window size has on the accuracy of learned vectors." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-117", "text": "We expect that larger window sizes may create more topicoriented vectors, while small windows result in vectors that capture syntactic information (Turney, 2012) ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-118", "text": "While all experiments in Section 3 used a window size of 10, the languages have different syntactic and morphological patterns, and some of the results observed in Section 3 may simply be a side effect of better or worse window sizes for particular languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-119", "text": "We run an experiment that tests window sizes of 1, 3, 5 and 10, calculating the analogy accuracy for each language and each window size." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-120", "text": "Figure 2 shows the results for varying window sizes." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-121", "text": "While no single window size is best for all languages, we observe that the morphologically complex languages perform better with larger windows." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-122", "text": "One benefit that larger window sizes may provide is access to more information during vector training, which may be important when each type is observed less frequently." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-123", "text": "Our next experiment directly investigates the impact of the training data size." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-124", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-125", "text": "**LEARNING CURVES**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-126", "text": "In this section, we investigate how varying the size of the vector training data affects the vector accuracy." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-127", "text": "We progressively subsample the training data: starting with the complete training set, we construct a 50% subsample by selecting each sentence for inclusion with probability 0.5." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-128", "text": "We then iterate this process, each time sampling roughly 50% of the sentences from the previously created subsample, until we have a subsample that is only 1.6% of the original training data." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-129", "text": "This gives us training sets with approximately 1. 6, 3.1, 6.3, 12.5, 25, 50 , and 100% of the full corpora." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-130", "text": "We set the window size to 5 for this experiment; the other hyper-parameters are the same as those in Section 2.1." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-131", "text": "The learning curves for verbs and nouns are shown in Figure 3 ." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-132", "text": "We see that the trends observed in Section 3 hold regardless of the amount of data that is used for training: namely, the accuracy of the vectors is inversely correlated with the number of inflection slots in a given language set." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-133", "text": "Secondly, while the English curves are beginning to level off, the curves for the other languages continue to rise, even as we reach 100% of our data." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-134", "text": "This suggests that there would be little gain in adding more English data, but a potential gain to be seen by adding more data to the other languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-135", "text": "This seems to support our hypothesis that the sparsity of the data is at least partially responsible for the lower accuracies on the morphologically complex languages." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-136", "text": "----------------------------------" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-137", "text": "**CONCLUSION**" }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-138", "text": "The results of our experiments show that it is possible to learn vectors that preserve morphological information even for languages with complex inflectional systems." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-139", "text": "The accuracy of vectors on a set of syntactic analogies in four tested languages is lower than in English, and it appears to be in-versely proportional to morphological complexity, as measured by the number of inflections in the language." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-140", "text": "When we limit our test set to the small set of inflections common across languages, we see improvements in the accuracy, which positively correlate with the complexity of the language." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-141", "text": "This suggests that for frequently observed phenomena, morphological complexity may be an advantage, making each type distinct and easier to model." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-142", "text": "Additional experiments suggest that the accuracy on more complex languages may further improve if more training data is provided." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-143", "text": "These results suggest two possible avenues for future work." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-144", "text": "The first is to build morphologicallyaware vectors, such as those of Botha and Blunsom (2014) , so that the more morphologically complex languages can make better use of limited training data." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-145", "text": "The second is to investigate methods that can distinguish syncretic forms in context." }, { "sent_id": "af6c68ef5f80eac2274bf33a894d1f-C001-146", "text": "For example, it could be possible to modify the joint word-sense and vector induction algorithm of Neelakantan et al. (2014) to focus on syntactic parts-of-speech instead of topical senses." } ], "y": { "@USE@": { "gold_contexts": [ [ "af6c68ef5f80eac2274bf33a894d1f-C001-2" ], [ "af6c68ef5f80eac2274bf33a894d1f-C001-15" ], [ "af6c68ef5f80eac2274bf33a894d1f-C001-48" ], [ "af6c68ef5f80eac2274bf33a894d1f-C001-67" ] ], "cite_sentences": [ "af6c68ef5f80eac2274bf33a894d1f-C001-2", "af6c68ef5f80eac2274bf33a894d1f-C001-15", "af6c68ef5f80eac2274bf33a894d1f-C001-48", "af6c68ef5f80eac2274bf33a894d1f-C001-67" ] }, "@EXT@": { "gold_contexts": [ [ "af6c68ef5f80eac2274bf33a894d1f-C001-2" ] ], "cite_sentences": [ "af6c68ef5f80eac2274bf33a894d1f-C001-2" ] }, "@BACK@": { "gold_contexts": [ [ "af6c68ef5f80eac2274bf33a894d1f-C001-7" ], [ "af6c68ef5f80eac2274bf33a894d1f-C001-18" ] ], "cite_sentences": [ "af6c68ef5f80eac2274bf33a894d1f-C001-7", "af6c68ef5f80eac2274bf33a894d1f-C001-18" ] }, "@MOT@": { "gold_contexts": [ [ "af6c68ef5f80eac2274bf33a894d1f-C001-20" ] ], "cite_sentences": [ "af6c68ef5f80eac2274bf33a894d1f-C001-20" ] }, "@DIF@": { "gold_contexts": [ [ "af6c68ef5f80eac2274bf33a894d1f-C001-21", "af6c68ef5f80eac2274bf33a894d1f-C001-22" ], [ "af6c68ef5f80eac2274bf33a894d1f-C001-52" ] ], "cite_sentences": [ "af6c68ef5f80eac2274bf33a894d1f-C001-21", "af6c68ef5f80eac2274bf33a894d1f-C001-52" ] }, "@UNSURE@": { "gold_contexts": [ [ "af6c68ef5f80eac2274bf33a894d1f-C001-30" ] ], "cite_sentences": [ "af6c68ef5f80eac2274bf33a894d1f-C001-30" ] }, "@SIM@": { "gold_contexts": [ [ "af6c68ef5f80eac2274bf33a894d1f-C001-52" ] ], "cite_sentences": [ "af6c68ef5f80eac2274bf33a894d1f-C001-52" ] } } }, "ABC_3356313ee5cdf186816cd6fecfce84_18": { "x": [ { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-2", "text": "Keyword Spotting (KWS) enables speech-based user interaction on smart devices." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-3", "text": "Always-on and battery-powered application scenarios for smart devices put constraints on hardware resources and power consumption, while also demanding high accuracy as well as real-time capability." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-4", "text": "Previous architectures first extracted acoustic features and then applied a neural network to classify keyword probabilities, optimizing towards memory footprint and execution time." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-5", "text": "Compared to previous publications, we took additional steps to reduce power and memory consumption without reducing classification accuracy." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-6", "text": "Power-consuming audio preprocessing and data transfer steps are eliminated by directly classifying from raw audio." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-7", "text": "For this, our end-to-end architecture extracts spectral features using parametrized Sinc-convolutions." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-8", "text": "Its memory footprint is further reduced by grouping depthwise separable convolutions." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-9", "text": "Our network achieves the competitive accuracy of 96.4% on Google's Speech Commands test set with only 62k parameters." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-10", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-12", "text": "Speech processing enables natural communication with smart phones or smart home assistants, e.g., Amazon Echo, Google Home." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-13", "text": "However, continuously performing speech recognition is not energy-efficient and would drain batteries of smart devices." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-14", "text": "Instead, most speech recognition systems passively listen for utterances of certain wake words such as \"Ok Google\", \"Hey Siri\", \"Alexa\", etc." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-15", "text": "to trigger the continuous speech recognition system on demand." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-16", "text": "This task is referred to as keyword spotting (KWS) ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-17", "text": "There are also uses of KWS where a view simple speech commands (e.g. \"on\", \"off\") are enough to interact with a device such as a voice-controlled light bulb." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-18", "text": "Conventional hybrid approaches to KWS first divide their audio signal in time frames to extract features, e.g., Mel Frequency Cepstral Coefficients (MFCC)." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-19", "text": "A neural net then estimates phoneme or state posteriors of the keyword Hidden Markov Model in order to calculate the keyword probability using a Viterbi search." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-20", "text": "In recent years, end-to-end architectures gained traction that directly classify keyword posterior probabilites based on the previously extracted features, e.g., [1, 2, 3, 4, 5] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-21", "text": "Typical application scenarios imply that the device is powered by a battery, and possesses restricted hardware resources to reduce costs." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-22", "text": "Therefore previous works optimized towards memory footprint and operations per second." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-23", "text": "In contrast to this, we tune our neural network towards energy conservation in microcontrollers motivated by obervations on power consumption, as detailed in Sec. 3.1." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-24", "text": "To extract meaningful and representative features from raw audio, our architecture uses parametrized Sinc-convolutions (SincConv) from SincNet proposed by Ravianelli et al. [6] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-25", "text": "We use Depthwise Separable Convolutions (DSConv) [7, 8] that preserve time-context information while at the same time compare features in different channels." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-26", "text": "To further reduce the number of network parameters, which is key for energy efficiency, we group DSConv-layers, a technique we refer to as Grouped DSConv (GDSConv)." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-27", "text": "Our key contributions are:" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-28", "text": "\u2022 We propose a neural network architecture tuned towards energy efficiency in microcontrollers grounded on the observation that memory access is costly, while computation is cheap [9] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-29", "text": "\u2022 Our keyword-spotting network classifies on raw audio employing SincConvs while at the same time reducing the number of parameters using (G)DSConvs." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-30", "text": "[3] , while keeping the number of parameters comparable to [2] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-31", "text": "Choi et al. build on this work as they also use a ResNet-inspired architecture." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-32", "text": "Instead of using 2D convolution over a time-frequency representation of the data they convolve along the time dimension and treat the frequency dimension as channels [4] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-33", "text": "This bears similarities with our approach as we are using 1D convolution along the time dimension as well." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-34", "text": "However, all the approaches mentioned classify from MFCCs or similar preprocessed features." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-35", "text": "Our architecture works directly on raw audio signals." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-36", "text": "There is a recent trend towards using CNNs on raw audio data directly [6, 10, 11, 12] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-37", "text": "Ravanelli et al. present an effective method of processing raw audio with CNNs, called SincNet." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-38", "text": "Kernels of the first convolutional layer are restricted to only learn shapes of parametrized sinc functions." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-39", "text": "This method was first introduced for Speaker Recognition [6] and later also used for Phoneme Recognition [10] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-40", "text": "To the best of our knowledge, we are the first to apply this method to the task of KWS." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-41", "text": "The first convolutional layer of our model is inspired by SincNet and we combine it with DSCconv." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-42", "text": "DSCconvs have first been introduced in the domain of Image Processing [8, 13] and have been applied to other domains since: Zhang et al. applied DSCconv to KWS [2] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-43", "text": "Kaiser et al. used DSConv for neural machine translation [7] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-44", "text": "They also introduce the \"super-separable\" convolution, a DSConv that also uses grouping and thus reduces the already small number of parameters of DSConv even further." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-45", "text": "A similar method is used by ShuffleNet where they combine DSConv with grouping and channel shuffling [14] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-46", "text": "The idea of Grouped Convolutions was first used in AlexNet [15] to reduce parameters and operations and to enable distributed computing of the model over multiple GPUs." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-47", "text": "We denominate the combination of grouping and DSconv as GDSConv in our work and use it for our smallest model." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-48", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-49", "text": "**MODEL**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-50", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-51", "text": "**KEYWORD-SPOTTING ON BATTERY-POWERED DEVICES**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-52", "text": "Typical application scenarios for smart devices imply that the device is powered by a battery, and possesses restricted hardware resources." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-53", "text": "The requirements for a KWS system in these scenarios are (1) very low power consumption to maximize battery life, (2) real-time or near real-time capability, (3) low memory footprint and (4) high accuracy to avoid random activations and to ensure responsiveness." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-54", "text": "Regarding real-time capability, our model is designed to operate on a single-core microcontroller capable of 50 MOps per second [2] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-55", "text": "We assume that in microcontrollers the memory consumption of a KWS neural network is associated with its power consumption: Reading memory values contributes most to power consumption which makes re-use of weights favorable." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-56", "text": "While in general large memory modules leak more power than small memory modules, one read operation from RAM costs far more energy than the corresponding multiply-and-accumulate computation [16, 9] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-57", "text": "In addition to the parameter-reducing approach in this work, further steps may be employed to reduce power consumption such as quantization, model compression or optimization strategies regarding dataflows that depend on the utilized hardware platform [16, 9, 17, 18] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-58", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-59", "text": "**FEATURE EXTRACTION USING SINCCONVS**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-60", "text": "SincNet [6] classifies on raw audio by restricting the filters of the first convolutional layer of a CNN to only learn parametrized sinc functions, i.e., sinc (x) = sin(x)/x." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-61", "text": "One sinc function in the time domain represents a rectangular function in the spectral domain, therefore two sinc functions can be combined to an ideal band-pass filter:" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-62", "text": "Performing convolution with such a filter extracts the parts of the input signal that lie within a certain frequency range." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-63", "text": "SincNet combines Sinc-convolutions with CNNs; as we only use the feature extraction layer of this architecture, we label this layer as SincConv to establish a distinction to SincNet." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-64", "text": "Compared to one filter of a regular CNN, the number of parameters is derived from its kernel width, e.g., k = 400 [11] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-65", "text": "Sinc-convolutions only require two parameters to derive each filter, the lower and upper cut-off frequencies (f 1 , f 2 ), resulting in a small memory footprint." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-66", "text": "SincConv filters are initialized with the cutoff frequencies of the melscale filter bank and then further adjusted during training." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-67", "text": "Fig. 1 visualizes this adjustment from initialization to after training." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-68", "text": "SincConv filter banks can be easily interpreted, as the two learned parameter correspond to a specific frequency band." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-69", "text": "Fig. 2 visualizes how a SincConv layer with 7 filters processes an audio sample containing the word \"yes\"." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-70", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-71", "text": "**LOW-PARAMETER GDSCONV LAYERS**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-72", "text": "DSConv have been successfully applied to the domain of computer vision [8, 13] , neural translation [7] and KWS [2] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-73", "text": "Fig. 3 provides an overview of the steps from a regular convolution to the GDSConv." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-74", "text": "The number of parameters of one DSConv layer amounts to N DSConv = k \u00b7 c in + c in \u00b7 c out with the kernel size k and the number of input and output channels c in and c out respectively; the first summand is determined by the depthwise convolution, the second summand by the pointwise convolution [7] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-75", "text": "In our model configuration, the depthwise convolution only accounts for roughly 5% of parameters in this layer, the pointwise for 95%." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-76", "text": "We therefore reduced the parameters of the pointwise convolution using grouping by a factor g to N GDSConv = k \u00b7 c in + cin\u00b7cout g , rather than the parameters in the depthwise convolution." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-77", "text": "To allow information exchange between groups we alternate the number of groups per layer, namely 2 and 3, as proposed in [7] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-78", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-79", "text": "**TWO LOW-PARAMETER ARCHITECTURES**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-80", "text": "The SincConv as the first layer extracts features from the raw input samples, as shown in Fig. 4 ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-81", "text": "As non-linearity after the SincConv we opt to use log-compression, i.e., y = log(abs(x) + 1), instead of a common activation function (e.g., ReLU)." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-82", "text": "This has also shown to be effective in other CNN architectures for raw audio processing [11, 12] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-83", "text": "Five (G)DSConv layers are then used to process the features further: The first layer has a larger kernel size and scales [20] and spatial dropout [21] for regularization, as well as average pooling to reduce temporal resolution." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-84", "text": "After the (G)DSConv blocks, we use global average pooling to receive a 160element vector that can be transformed to class posteriors using a Softmax layer to classify 12 classes, i.e., 10 keywords as well as a class for unknown and for silence." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-85", "text": "The low-parameter model is obtained by grouping the DSConv layers with an alternating number of groups between 2 and 3." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-86", "text": "For the configuration shown in Fig. 4 , the base model has 122k parameters." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-87", "text": "After grouping, the number of parameters is reduced to a total of 62k." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-88", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-89", "text": "**EVALUATION**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-90", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-91", "text": "**TRAINING ON THE SPEECH COMMANDS DATASET**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-92", "text": "We train and evaluate our model using Google's Speech Commands data set [19] , an established dataset for benchmarking KWS systems." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-93", "text": "The first version of the data set consists of 65k one-second long utterances of 30 different keywords spoken by 1881 different speakers." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-94", "text": "The most common setup consists of a classification of 12 classes: \"yes\", \"no\", \"up\", \"down\", \"left\", \"right\", \"on\", \"off\", \"stop\", \"go\", unknown, or silence." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-95", "text": "The remaining 20 keywords are labeled as unknown, samples of provided background noise files as silence." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-96", "text": "To ensure the benchmark reproducibility, a separate test set was released with a predefined list of samples for the unknown and the silence class." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-97", "text": "The second version of the dataset contains 105k samples and five additional keywords [19] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-98", "text": "However, previous publications on KWS reported only results on the first version, therefore we focused on the first version and additionally report testing results on version 2 of the dataset." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-99", "text": "Every sample from the training set is used in training, this leads to a class imbalance as there are much more samples for unknown." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-100", "text": "Class weights in the training phase assign a lower weight to samples labeled as unknown such that the impact on the model is proportional to the other classes." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-101", "text": "This way, the model can see more unknown word samples during training without getting biased." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-102", "text": "Our model is trained for 60 epochs with the Adam optimizer [22] with an initial learning rate of 0.001 and learning rate decay of 0.5 after 10 epochs; the model with the highest validation accuracy is saved to evaluate accuracy on the test set." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-103", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-104", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-105", "text": "The base model composed of DSConv layers without grouping achieves the state-of-the-art accuracy of 96.6% on the Speech Commands test set." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-106", "text": "The low-parameter model with GDSConv achieves almost the same accuracy of 96.4% with only about half the parameters." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-107", "text": "This validates the effectiveness of GDSConv for model size reduction." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-108", "text": "Table 1 lists these results in comparison with related work." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-109", "text": "Compared to the DSConv network in [2] , our network is more efficient in terms of accuracy for a given parameter count." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-110", "text": "Their biggest model has a 1.2% lower accuracy than our base model while having about 4 times the parameters." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-111", "text": "Choi et al. [4] has the most competitive results while we are still able to improve upon their accuracy for a given number of parameters." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-112", "text": "They are using 1D convolution along the time dimension as well which may be evidence that this yields better performance for audio processing or at least KWS." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-113", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-114", "text": "**MODEL**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-115", "text": "Accuracy Parameters DS-CNN-S [2] 94.1% 39k DS-CNN-M [2] 94.9% 189k DS-CNN-L [2] 95.4% 498k ResNet15 [3] 95.8% 240k TC-ResNet8 [4] 96.1% 66k TC-ResNet14 [4] 96.2% 137k TC-ResNet14-1.5 [4] 96 Table 2 ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-116", "text": "Results on Speech Commands version 2 [19] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-117", "text": "As opposed to previous works, our architecture does not use preprocessing to extract features, but is able to extract features from raw audio samples with the SincConv layer." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-118", "text": "That makes it possible to execute a full inference as floating point operations, without requiring additional hardware modules to process or transfer preprocessed features." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-119", "text": "Furthermore, we deliberately opted to not use residual connections in our network architecture, considering the memory overhead and added difficulty for hardware acceleration modules." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-120", "text": "For future comparability, we also trained and evaluated our model on the newer version 2 of the Speech Commands data set; see Table 2 for results." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-121", "text": "On a side note, we observed that models trained on version 2 of the Speech Commands dataset tend to perform better on both the test set for version 2 and the test set for version 1 [19] ." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-122", "text": "----------------------------------" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-123", "text": "**CONCLUSION**" }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-124", "text": "Always-on, battery-powered devices running keyword spotting require energy efficient neural networks with high accuracy." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-125", "text": "For this, we identified the parameter count in a neural network as a main contributor to power consumption, as memory accesses contribute far more to power consumption than the computation." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-126", "text": "Based on this observation, we proposed an energy efficient KWS neural network architecture by combining feature extraction using SincConvs with GDSConv layers." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-127", "text": "Starting with the base model composed of DSConvs that have already less parameters than a regular convolution, we achieved state-of-the-art accuracy on Google's Speech Commands dataset." }, { "sent_id": "3356313ee5cdf186816cd6fecfce84-C001-128", "text": "We further reduce the number of parameters by grouping the convolutional channels to GDSConv, resulting in a low-parameter model with only 62k parameters." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3356313ee5cdf186816cd6fecfce84-C001-20" ], [ "3356313ee5cdf186816cd6fecfce84-C001-42" ], [ "3356313ee5cdf186816cd6fecfce84-C001-72" ] ], "cite_sentences": [ "3356313ee5cdf186816cd6fecfce84-C001-20", "3356313ee5cdf186816cd6fecfce84-C001-42", "3356313ee5cdf186816cd6fecfce84-C001-72" ] }, "@SIM@": { "gold_contexts": [ [ "3356313ee5cdf186816cd6fecfce84-C001-29", "3356313ee5cdf186816cd6fecfce84-C001-30" ] ], "cite_sentences": [ "3356313ee5cdf186816cd6fecfce84-C001-30" ] }, "@UNSURE@": { "gold_contexts": [ [ "3356313ee5cdf186816cd6fecfce84-C001-54" ] ], "cite_sentences": [ "3356313ee5cdf186816cd6fecfce84-C001-54" ] }, "@DIF@": { "gold_contexts": [ [ "3356313ee5cdf186816cd6fecfce84-C001-109" ] ], "cite_sentences": [ "3356313ee5cdf186816cd6fecfce84-C001-109" ] } } }, "ABC_08d3f7a0938ab85d9a251b6a2364ed_18": { "x": [ { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-2", "text": "We generalize Cohen, parser to a family of non-projective transition-based dependency parsers allowing polynomial-time exact inference." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-3", "text": "This includes novel parsers with better coverage than Cohen et al. (2011), and even a variant that reduces time complexity to Opn 6 q, improving on prior bounds." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-4", "text": "We hope that this piece of theoretical work inspires design of novel transition systems with better coverage and better run-time guarantees." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-5", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-7", "text": "Non-projective dependency trees are those containing crossing edges." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-8", "text": "They account for 12.59% of all training sentences in the annotated Universal Dependencies (UD) 2.1 data (Nivre et al., 2017) , and more than 20% in each of 10 languages among the 54 in UD 2.1 with training treebanks." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-9", "text": "But modeling non-projectivity is computationally costly (McDonald and Satta, 2007) ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-10", "text": "Some transition-based dependency parsers have deduction systems that use dynamic programming to enable exact inference in polynomial time and space (Huang and Sagae, 2010; Kuhlmann et al., 2011) ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-11", "text": "For non-projective parsing, though, the only tabularization of a transition-based parser is, to our knowledge, that of Cohen et al. (2011) ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-12", "text": "They define a deduction system for (an isomorphic variant of) Attardi's (2006) transition system, which covers a subset of non-projective trees." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-13", "text": "The exact inference algorithm runs in Opn 7 q time, where n denotes sentence length." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-14", "text": "In this paper, we show how Cohen et al.'s (2011) system can be modified to generate a new family of deduction systems with corresponding transition systems." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-15", "text": "In particular, we present three novel variants of the degree-2 Attardi parser, summarized in Fig. 1 (our technique can also be applied to generalized Attardi (2006) systems; see \u00a73.2)." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-16", "text": "The first two bring non-projective coverage for UD 2.1 to as high as 95.99% by adding extra transitions, and yet retain the same time complexity." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-17", "text": "The third reduces time complexity for exact inference to Opn 6 q and space complexity from Opn 5 q to Opn 4 q, while still improving empirical coverage from 87.24% to 93.16%." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-18", "text": "1 Code and full statistics for all treebanks can be found at https://github.com/tzshi/ nonproj-dp-variants-naacl2018." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-19", "text": "These theoretical improvements are a step towards making recent state-of-the-art results in transition-based parsing with exact inference (Shi et al., 2017) extensible to practical non-projective parsing, by exemplifying the design of transition systems with better coverage on highly nonprojective datasets and, for one variant, bringing the runtime complexity one level closer to feasibility." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-20", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-21", "text": "**TRANSITION-BASED PARSING**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-22", "text": "We first introduce necessary definitions and notation." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-23", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-24", "text": "**A GENERAL CLASS OF TRANSITION SYSTEMS**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-25", "text": "A transition system is given by a 4-tuple pC, T, c s , C \u03c4 q, where C is a set of configurations, T is a set of transition functions between configurations, c s is an initialization function mapping an input sentence to an initial configuration, and C \u03c4 \u0102 C defines a set of terminal configurations." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-26", "text": "1 Faster exact inference algorithms have been defined for some sets of mildly non-projective trees (e.g. Pitler et al. (2013) ; see G\u00f3mez-Rodr\u00edguez (2016) for more), but lack an underlying transition system." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-27", "text": "Having one has the practical advantage of allowing generative models, as in Cohen et al. (2011) , and transition-based scoring functions, which have yielded good projective-parsing results (Shi et al., 2017) Attardi's (2006) transition system of degree 2 and our variants." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-28", "text": "Solid arrows denote the inventory of reduce transitions; each arrow points from the head to the modifier of the edge created by that transition." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-29", "text": "The degree of a transition is the distance between the head and modifier." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-30", "text": "Green highlights the single degree-3 transition." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-31", "text": "Thick arrows and gray dotted arrows represent additional and deleted transitions with respect to the original Attardi (2006) system." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-32", "text": "Coverage refers to the percentage of nonprojective sentences (a total of 76,084 extracted from 604,273 training sentences in UD 2.1) that the systems are able to handle." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-33", "text": "We employ a tripartite representation for configurations: p\u03c3, \u03b2, Aq, where the three elements are as follows." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-34", "text": "\u03c3 and \u03b2 are disjoint lists called the stack and buffer, respectively." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-35", "text": "Each dependency arc ph, mq in the resolved arcs set A has head h and modifier m. For a length-n input sentence w, the initial configuration is c s pwq \" prs, r0, 1, ..., ns, Hq where the 0 in the initial buffer denotes a special node representing the root of the parse tree." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-36", "text": "All terminal configurations have an empty buffer and a stack containing only 0." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-37", "text": "Indexing from 0, we write s i and b j to denote item i on the stack (starting from the right) and item j on the buffer (from the left), respectively." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-38", "text": "We use vertical bars to separate different parts of the buffer or stack." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-39", "text": "For example, when concerned with the top three stack items and the first item on the buffer, we may write \u03c3|s 2 |s 1 |s 0 and b 0 |\u03b2." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-40", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-41", "text": "**ATTARDI'S (2006) SYSTEM**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-42", "text": "We now introduce the widely-used Attardi (2006) system, which includes transitions that create arcs between non-consecutive subtrees, thus allowing it to produce some non-projective trees." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-43", "text": "To simplify exposition, here we present Cohen et al.'s (2011) isomorphic version." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-44", "text": "The set of transitions consists of a shift transition (sh) and four reduce transitions (re)." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-45", "text": "A shift moves the first buffer item onto the stack:" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-46", "text": "A reduce transition re h,m creates a dependency arc between h (head) and m (modifier) and reduces m. For example, re s 0 ,s 1 rp\u03c3|s 1 |s 0 , \u03b2, Aqs \" p\u03c3|s 0 , \u03b2, A Y tps 0 , s 1 quq ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-47", "text": "Row 1 of Fig. 1 depicts the four Attardi reduces." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-48", "text": "The distance between h and m in a re h,m transition is called its degree." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-49", "text": "A system limited to degree-1 transitions can only parse projective sentences." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-50", "text": "As shown in Fig. 1 , Attardi's (2006) system has two degree-2 transitions (re s 0 ,s 2 and re s 2 ,s 0 ) that allow it to cover 87.24% of the nonprojective trees in UD 2.1." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-51", "text": "More generally, an Attardi system of degree D adds re s 0 ,s D and re s D ,s 0 to the system of degree D\u00b41." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-52", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-53", "text": "**IMPROVING COVERAGE**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-54", "text": "A key observation is that a degree-D Attardi system does not contain all possible transitions of degree within D. Since prior empirical work has ascertained that transition systems using more transitions with degree greater than 1 can handle more non-projective treebank trees (Attardi, 2006; G\u00f3mez-Rodr\u00edguez, 2016) , we hypothesize that adding some of these \"missing\" reduce transitions into the system's inventory should increase coverage." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-55", "text": "The challenge is to simultaneously maintain run-time guarantees, as there exists a known trade-off between coverage and complexity (G\u00f3mez-Rodr\u00edguez, 2016 Cohen et al. (2011) , rather than Opn 3\u00a83`1 q; and (ii) another has degree 2 but better runtime than Cohen et al.'s (2011) system." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-56", "text": "Here, we first sketch the existing exact inference algorithm, 3 and then present our variants." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-57", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-58", "text": "**COHEN ET AL.'S (2011) EXACT INFERENCE**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-59", "text": "The main idea of the algorithm is to group transition sequences into equivalence classes and construct longer sequences from shorter ones." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-60", "text": "\u00dd\u00d1 c m , where t i P T and t i pc i\u00b41 q \" c i for i P 1..m." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-61", "text": "As depicted in Fig. 2 , a lengthm I-computation rh 1 , i, h 2 , h 3 , js is any length-m computation where (1) c 0 \" p\u03c3|h 1 , i|\u03b2, Aq and c m \" p\u03c3|h 2 |h 3 , j|\u03b2 1 , A 1 q for some \u03c3, \u03b2, \u03b2 1 , A, and A 1 ; and (2) for all k P 1..m, c k 's stack has \u03c3 as base and length at least |\u03c3|`2." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-62", "text": "Only condition (1) is relevant to this paper: 4 it states that the net effect of an I-computation is to replace the rightmost item h 1 on the stack with items h 2 and h 3 , while advancing the buffer-start from i to j." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-63", "text": "The dynamic programming algorithm is specified as a deduction system, where each transition corresponds to a deduction rule." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-64", "text": "The shift rule is:" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-65", "text": "rh 1 , i, h 2 , h 3 , js rh 3 , j, h 3 , j, j`1s ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-66", "text": "Each reduce rule combines two I-computations into a larger I-computation, e.g. (see Fig. 3) :" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-67", "text": "rh 1 , i, h 2 , h 3 , ks rh 3 , k, h 4 , h 5 , js rh 1 , i, h 2 , h 5 , js , 2 While Opn 7 q or Opn 10 q is not practical, the result is still impressive, since the search space is exponential." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-68", "text": "Cohen et al. (2011) were inspired by Huang and Sagae's (2010) and Kuhlmann et al.'s (2011) with the side condition that h 4 modifies h 5 ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-69", "text": "5 Other reduce transitions have similar deduction rules, with the same two premises, but a different conclusion depending on the reduced stack item." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-70", "text": "As an illustration:" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-71", "text": "rh 1 , i, h 2 , h 3 , ks rh 3 , k, h 4 , h 5 , js" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-72", "text": "The goal of deduction is to produce the Icomputation r , 0, , 0, s, using the shift and reduce deduction rules starting from the axiom r , 0, , 0, 1s, corresponding to the first and mandatory shift transition moving the root node from buffer to stack." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-73", "text": "stands for an empty stack or buffer." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-74", "text": "As analyzed by Cohen et al. (2011) , direct tabularization for this deduction system takes Opn 5 q space and Opn 8 q time." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-75", "text": "With adaptation of the \"hook trick\" described in Eisner and Satta (1999) , we can reduce the running time to Opn 7 q." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-76", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-77", "text": "**OUR NEW VARIANTS**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-78", "text": "In this section, we modify Cohen et al.'s (2011) set of reduce deduction rules to improve coverage or time complexity." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-79", "text": "Since each such deduction rule corresponds to a reduce transition, each revision to the deduction system yields a variant of Attardi's (2006) parser." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-80", "text": "In other words, generalization of the deduction system gives rise to a family of nonprojective transition-based dependency parsers." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-81", "text": "We first explain why there are exactly nine reduce transitions R \" tre s 0 ,s 1 , re s 1 ,s 0 , re s 0 ,s 2 , re s 2 ,s 0 , re s 1 ,s 2 , re s 2 ,s 1 , re b 0 ,s 0 , re b 0 ,s 1 , re b 0 ,s 2 u that can be used in Cohen et al.'s (2011) exact inference algorithm, without allowing a reduction with head b i for i \u011b 1. 6 (Note that Cohen et al.'s (2011) reduce rules are precisely the first four elements of R.) From Fig. 3 we infer that the concatenation of I-computations rh 1 , i, h 2 , h 3 , ks and rh 3 , k, h 4 , h 5 , js yields a configuration of the form p\u03c3|h 2 |h 4 |h 5 , j|\u03b2, Aq." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-82", "text": "For the application of a reduce rule to yield a valid I-computation, by condition (1) of the I-computation definition, first, the head and modifier must be selected from the \"exposed\" elements h 2 , h 4 , h 5 , and j, corresponding to s 2 , s 1 , s 0 , b 0 , respectively; and second, the modifier can only come from the stack." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-83", "text": "R is precisely the set of rules satisfying these criteria." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-84", "text": "Further, every reduce transition from R is compatible with Eisner and Satta's (1999) \"hook trick\"." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-85", "text": "This gives us the satisfactory result that the Opn 7 q running time upper bound still holds for transitions in R, even though one of them has degree 3." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-86", "text": "Next, we consider three notable variants within the family of R-based non-projective transitionbased dependency parsers." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-87", "text": "They are given in Fig. 1 , along with their time complexities and empirical coverage statistics." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-88", "text": "The latter is computed using static oracles (Cohen et al., 2012) on the UD 2.1 dataset (Nivre et al., 2017) ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-89", "text": "7 We report the global coverage over the 76,084 non-projective sentences from all the training treebanks." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-90", "text": "One might assume that adding more degree-1 transitions wouldn't improve coverage of trees with non-crossing edges." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-91", "text": "On the other hand, since their addition doesn't affect the asymptotic run-time, we define ALLDEG1 to include all five degree-1 transitions from R into the Attardi (2006) system." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-92", "text": "Surprisingly, using ALLDEG1 improves non-projective coverage from 87.24% to 93.32%." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-93", "text": "Furthermore, recall that we argued above that," }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-94", "text": "----------------------------------" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-95", "text": "**CONCLUSION**" }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-96", "text": "We have introduced a family of variants of Cohen et al.'s (2011) Attardi-based transition system and its associated dynamic programming algorithm." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-97", "text": "Among these, we have highlighted novel algorithms that (1) increase non-projective coverage without affecting computational complexity for exact inference, and (2) improve the time and space complexity for exact inference, even while providing better coverage than the original parser." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-98", "text": "Specifically, our ALLs 0 s 1 runs in Opn 6 q time and Opn 4 q space (improving from Opn 7 q and Opn 5 q, respectively) while providing coverage of 93.16% of the non-projective sentences in UD 2.1." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-99", "text": "Exact inference for transition-based parsers has recently achieved state-of-the-art results in projective parsing (Shi et al., 2017) ." }, { "sent_id": "08d3f7a0938ab85d9a251b6a2364ed-C001-100", "text": "The complexity improvements achieved in this paper are a step towards making their exact-inference, projective approach extensible to practical, wide-coverage nonprojective parsing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "08d3f7a0938ab85d9a251b6a2364ed-C001-12" ], [ "08d3f7a0938ab85d9a251b6a2364ed-C001-42" ], [ "08d3f7a0938ab85d9a251b6a2364ed-C001-50" ], [ "08d3f7a0938ab85d9a251b6a2364ed-C001-54" ], [ "08d3f7a0938ab85d9a251b6a2364ed-C001-79" ] ], "cite_sentences": [ "08d3f7a0938ab85d9a251b6a2364ed-C001-12", "08d3f7a0938ab85d9a251b6a2364ed-C001-42", "08d3f7a0938ab85d9a251b6a2364ed-C001-50", "08d3f7a0938ab85d9a251b6a2364ed-C001-54", "08d3f7a0938ab85d9a251b6a2364ed-C001-79" ] }, "@UNSURE@": { "gold_contexts": [ [ "08d3f7a0938ab85d9a251b6a2364ed-C001-15" ], [ "08d3f7a0938ab85d9a251b6a2364ed-C001-27" ], [ "08d3f7a0938ab85d9a251b6a2364ed-C001-31" ] ], "cite_sentences": [ "08d3f7a0938ab85d9a251b6a2364ed-C001-15", "08d3f7a0938ab85d9a251b6a2364ed-C001-27", "08d3f7a0938ab85d9a251b6a2364ed-C001-31" ] }, "@USE@": { "gold_contexts": [ [ "08d3f7a0938ab85d9a251b6a2364ed-C001-91" ] ], "cite_sentences": [ "08d3f7a0938ab85d9a251b6a2364ed-C001-91" ] } } }, "ABC_123d8e8ddef15fed120908c5c20656_18": { "x": [ { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-68", "text": "Thus, after the nonlinearity, both architectures have 40 filters." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-120", "text": "from the waveform respectively." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-2", "text": "State-of-the-art speech recognition systems rely on fixed, handcrafted features such as mel-filterbanks to preprocess the waveform before the training pipeline." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-3", "text": "In this paper, we study end-toend systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-4", "text": "The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015) , and the second one by the scattering transform (Zeghidour et al., 2017) ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-5", "text": "We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-6", "text": "The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-7", "text": "The second one relates to the low-pass filter used in these approaches." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-8", "text": "These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-9", "text": "In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-10", "text": "It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-11", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-13", "text": "State-of-the-art speech recognition systems rapidly shift from the paradigm of composite subsystems trained or designed independently to the paradigm of end-to-end training." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-14", "text": "While most of the work in this direction has been devoted to learning the acoustic model directly from sequences of phonemes or characters without intermediate alignment step or phone-state/senome induction, the other end of the pipeline model -namely, learning directly from the waveform rather than from speech features such as mel-filterbanks or MFCC -has recently received attention [1, 2, 3, 4, 5, 6, 7, 8] , but the performances on the master task of speech recognition still seem to be lagging behind those of models trained on speech features [9, 10] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-15", "text": "Yet, promising results have already been obtained by learning the front-end of speech recognition systems." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-16", "text": "We focus the discussion on trainable components that can be plugged in as replacement of mel-filterbanks without modification of the acoustic model." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-17", "text": "The approach inspired by gammatone filterbanks of Hoshen et al. and Sainath et al. [3, 4] achieved similar or better results than comparable mel-filterbanks on multichannel speech recognition and on far-field/noisy recording conditions." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-18", "text": "More recently, Zeghidour et al. [8] proposed an alternative learnable architecture based on a convolutional architecture that computes a scattering transform and can be initialized as an approximation of mel-filterbanks, and obtained promising results on endto-end phone recognition on TIMIT." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-19", "text": "However, these approaches have not been proven to improve on speech features on largescale, end-to-end speech recognition in clean recording conditions on English -admittedly one of the tasks for which melfilterbanks have been the most extensively tuned." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-20", "text": "We present a systematic comparison of the two previous architectures of learnable filterbanks, which we will (coarsely) refer to as gammatone-based and scattering-based, and evaluate them against mel-filterbanks within an end-to-end training pipeline on letter error rate and word error rate on the Wall Street Journal dataset." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-21", "text": "Our main contributions and results are the following:" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-22", "text": "1." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-23", "text": "A mean-variance normalization layer on top of the log nonlinearity of learnable filterbanks appears to be critical for the efficient learning of the gammatone-based architecture, and makes the training of the scattering-based architecture faster;" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-24", "text": "2. The low-pass filter previously used in the scattering-based learnable filterbanks stabilizes the training of gammatone filterbanks, compared to the max-pooling that was originally proposed [3, 4] ;" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-25", "text": "3. For scattering-based trainable filterbanks, keeping the lowpass filter fixed during training allows to efficiently learn the filters from a random initialization, whereas the results of [8] with random initialization of both the filters and the lowpass filter showed poor performances compared to a suitable initialization;" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-26", "text": "4. Both trainable filterbanks improve against the melfilterbanks baseline on word error rate on the Wall Street Journal dataset, in similar conditions (same number of filters, same end-to-end training convolutional architecture)." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-27", "text": "This is the first time learnable filterbanks improve against a strong mel-filterbanks baseline on a large vocabulary, speech recognition task under clean recording conditions." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-28", "text": "The next section describes the learnable filterbanks architectures." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-29", "text": "Then, we present the end-to-end convolutional architecture used to perform the comparisons, and analyze the results of our comparative studies." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-30", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-31", "text": "**LEARNING FILTERBANKS FROM RAW SPEECH**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-32", "text": "The two approaches that we consider for learning filterbanks from the raw waveform can be used as direct replacement for mel-filterbanks in any end-to-end learning pipeline: they are convolutional architectures that take the raw waveform as input and output 40 channels every 10ms." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-33", "text": "As such, they can directly be compared with standard mel-filterbanks, simply by changing the features stage of a neural-network-based acoustic model." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-34", "text": "The filters are then nothing more than an additional layer to the neural network and are learnt by backpropagation with the rest of the acoustic model." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-35", "text": "The first architecture we consider is inspired by [3, 4] , the second one is taken from [8] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-36", "text": "They are described in Table 1 ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-37", "text": "In both architectures, a convolutional layer with window length 25ms (to match the standard frame size used in melfilterbanks) is applied with a stride of 1 sample, and is followed by a nonlinearity to give 40 output channels for each sample." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-38", "text": "Then, a pooling operator of width 25ms with a stride of 10ms performs low-pass filtering and decimation." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-39", "text": "Finally, a log nonlinearity reproduces the dynamic range compression of log melfilterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-40", "text": "The parameters to be learnt are the convolution filters, and possibly the weights of the low-pass filters." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-41", "text": "The two architectures differ by the choices of each layer of computation." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-42", "text": "Hoshen et al. and Sainath et al. use 40 realvalued filters with ReLU non-linearity, and rely on gammatones as filter values to approximate mel-filterbanks [3, 4] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-43", "text": "In their work, they use a max-pooling operator for low-pass filtering." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-44", "text": "In contrast, Zeghidour et al. [8] use 40 complex-valued filters with a square modulus operator as non-linearity." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-45", "text": "Low-pass filtering is then performed by multiplying each output channel by a squared Hanning window so that, when using suitable Gabor wavelets as convolution filters, the architecture closely approximates mel-filterbanks computed on the power spectrum [11] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-46", "text": "The number of filters (40), the convolution and pooling width of 25ms, as well as the decimation of 10ms are not necessarily the optimal parameters of either trainable architecture, but these are the standard settings of mel-filterbanks (and likely the best settings for these features on standard speech recognition datasets)." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-47", "text": "We keep these values fixed for the trainable architectures, so that the comparison to mel-filterbanks is carried out in the setting most favorable for the non-learnable baseline." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-48", "text": "In the next subsections, we describe the improvements we propose for these architectures: the low-pass filter and the addition of instance normalization." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-49", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-50", "text": "**LOW-PASS FILTERING**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-51", "text": "The original papers describing the gammatone-based trainable filterbanks used max-pooling as low-pass filter, whereas the scattering-based approach uses a squared Hanning window per channel." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-52", "text": "To make sure the low-pass filter is not responsible for notable differences between the two approaches we experiment with the squared Hanning window on both architectures." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-53", "text": "For both architectures, we also propose to keep this low-pass filter fixed while learning the convolution filter weights, a setting that was not explored by Zeghidour et al. [8] , who learnt the lowpass filter weights when randomly initializing the convolutions." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-54", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-55", "text": "**INSTANCE NORMALIZATION**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-56", "text": "More importantly, we noticed that a per-channel per-sentence mean-variance normalization after log-compression is important for the baseline mel-filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-57", "text": "Consequently, we propose to add a mean-variance normalization layer on both trainable architectures, performed for each of the 40 channels independently on each sentence." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-58", "text": "Coincidently, this corresponds to an instance normalization layer [12] , which has been shown to stabilize training in other deep learning contexts." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-59", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-61", "text": "The experiments compare different versions of the trainable architectures against log mel-filterbanks on a single deep convolutional network architecture for the acoustic model." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-62", "text": "The experiments are carried out on the open vocabulary task of the Wall Street Journal dataset [13] , using the subset si284 for training, nov93-dev for validation, and nov92-eval for testing." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-63", "text": "Training is performed end-to-end on letters." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-64", "text": "We evaluate in both letter and word error rates." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-65", "text": "All our experiments use the open source code of wav2letter [14] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-66", "text": "In the next subsections, we describe the model, the different variants we tested and the hyperparameters." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-67", "text": "1 The convolution for the scattering-based architecture uses 80-real valued output channels and squared L2-pooling on the feature dimension to emulate a complex-valued convolution with 40 filters followed by a squared modulus operator." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-69", "text": "2 [8] use 1 to prevent log(0) and [3, 4] use 0.01." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-70", "text": "We kept the values initially used by the authors of the respective papers and did not try alternatives." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-71", "text": "We believe it has little impact on the final performance." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-72", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-73", "text": "**ACOUSTIC MODEL**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-74", "text": "Taking either log mel-filterbanks or trainable filterbanks, the acoustic model is a convolutional network with gated linear units (GLU) [15] trained to predict sequences of letters, following [16] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-75", "text": "The model is a smaller version of the convolutional network used in [16] since they train on the larger LibriSpeech dataset." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-76", "text": "Using the syntax C-input channels-output channels-width, the architecture we use has the structure C-40-200-13/C-100-200-3/C-100-200-4/C-100-250-5/ C-125-250-6/C-125-300-7/C-150-350-8/C-175-400-9/ C-200-450-10/C-225-500-11/C-250-500-12/C-250-500-13/ C-250-600-14/C-300-600-15/C-300-750-21/C-375-1000-1." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-77", "text": "All convolutions have stride 1." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-78", "text": "The number of input channels of the n + 1th convolution is half the size of the output of the n-th convolution because of the GLU." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-79", "text": "There are GLU layers with a dropout [17] of 0.25 after each convolution layer." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-80", "text": "There is an additional linear layer to predict the final letter probabilities." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-81", "text": "When predicting letters, the training and decoding are performed as in [16] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-82", "text": "When predicting words, we use a 4-gram language model trained on the standard LM data of WSJ [13] and perform beam search decoding, as in [16] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-83", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-84", "text": "**VARIANTS**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-85", "text": "We compare the two architectures of trainable filterbanks along different axes: how to initialize the convolutions of the trainable filterbanks, the low-pass filter, and instance normalization." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-86", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-87", "text": "**GAMMATONE-BASED ARCHITECTURE**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-88", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-89", "text": "**INITIALIZATION OF THE CONVOLUTION WEIGHTS RANDOM (RAND), OR WITH GAMMATONE FILTERS (GAMM) THAT MATCH THE IMPULSE RESPONSE OF A REFERENCE OPEN SOURCE IMPLEMENTATION OF GAMMATONES [18];**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-90", "text": "Low-pass filter max-pooling as in [3] , or the squared Hanning window (Han-fixed)." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-91", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-92", "text": "**SCATTERING-BASED ARCHITECTURE**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-93", "text": "Initialization of the convolution weights random (rand), or Gabor filters (scatt) as described in Section 2." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-94", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-95", "text": "**OF [8];**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-96", "text": "Low-pass filter the squared Hanning window (Han-fixed), or a low-pass filter of same width and stride initialized with the weights of the squared Hanning window but the weights are then learnt by backpropagation (Han-learnt)." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-97", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-98", "text": "**HYPERPARAMETERS AND TRAINING**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-99", "text": "For models trained on the raw waveform, the signal was first normalized with mean/variance normalization by sequence." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-100", "text": "The network is trained with stochastic gradient descent and weight normalization [19] for all convolutional layers except the front-ends." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-101", "text": "First, 80 epochs are performed with a learning rate of 1.4, then training is resumed for 80 additional epochs with a learning rate of 0.1." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-102", "text": "These hyperparameters were chosen from preliminary experiments as they seemed to work well for all architectures." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-103", "text": "Additional hyperparameters are the momentum and the learning rate for the training criterion, respectively chosen in {0, 0.9} and {0.001, 0.0001} [14, 16] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-104", "text": "For Letter Error Rate (LER) evaluations, the hyperparameters are selected using the LER on the validation set, validating every epoch." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-105", "text": "For Word Error Rate (WER) evaluations, the hyperparameters are chosen on the validation set using the WER, validating every 10 epochs." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-106", "text": "The model selected on LER is also included for validation." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-107", "text": "The additional hyperparameters are the (i) SOTA -speech features: for state-of-the-art and representative baselines using speech features (mel-filterbanks, spectrograms or MFCC), (ii) SOTA-waveform: state-of-the-art from the raw waveform, including our own implementation of vanilla gammatone filterbanks without instance normalization, and (iii) our baseline and the different variants of the trainable filterbanks (with instance normalization) studied in this paper." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-108", "text": "weight of the language model and the weight of word insertion penalty (see [16] for details)." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-109", "text": "We set them between 5 and 8 by steps of 0.5, and between \u22122 and 0.5 by steps of 0.1, respectively." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-110", "text": "For hyperparameter selection, the beam size of the decoder is set to 2, 500; the final performances are computed with the selected hyperparameters but using a beam size of 25, 000." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-111", "text": "Table 2 contains our results together with end-to-end baselines from the literature." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-112", "text": "[20] is the current state-of-the-art on the WSJ dataset; it is given as a topline but uses much more training data (\u223c 12, 000h of speech) so the results are not comparable." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-113", "text": "[21, 22, 23, 24] are representative results in terms of WER and LER from the literature of end-to-end models trained on speech features from 2014-2017, in chronological order." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-114", "text": "[25] and [26] are the current state-of-the-art in LER on speech features and Table 3 : Comparison of models trained with or without a learnable pre-emphasis layer." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-115", "text": "All models are initialized either with the scattering or gammatone initialization, and the pooling function is a fixed squared Hanning window." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-116", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-117", "text": "**EXPERIMENTS**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-118", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-119", "text": "**BASELINE RESULTS**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-121", "text": "These comparisons validate our baseline model trained on mel-filterbanks as a strong baseline in light of recent results, as it outperforms the state-of-theart in LER by a significant margin (4.9% vs 6.1% for [25] ), and achieves a test WER of 6.6%, better than all other end-to-end baselines ( [27] and [7] report WER that are below our 6.6% but are on easier closed vocabulary tasks)." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-122", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-123", "text": "**INSTANCE NORMALIZATION**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-124", "text": "As described in Section 2.2, we evaluate the integration of instance normalization after the log-compression in the trainable filterbanks, which was not used in previous work [3, 4, 7, 8] but is used in our baseline." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-125", "text": "Figure 1 shows training LER as a function of the number of epochs for scattering-based and gammatone-based filterbanks models, with and without instance normalization." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-126", "text": "We can see that this normalization drastically improves the training stability of the gammatonebased model, while it moderately improves the scattering-based model." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-127", "text": "We observed a positive impact of instance normalization in all settings, and so only report as a reference the results of our implementation of a vanilla gammatone-based trainable filterbanks following [3, 4] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-128", "text": "Comparing gammatone (learnt)/gamm/max-pool without instance norm (under SOTA -waveform) to the results of gammatone (learnt)/gamm/maxpool in Table 2 , we see a significant improvements of both LER and WER due to instance normalization, with an absolute reduction in LER and WER of 1.5% and 2.8% respectively." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-129", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-130", "text": "**IMPACT OF THE LOW-PASS FILTER**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-131", "text": "For low-pass filtering, we first compare the Han-fixed setting to max-pooling for gammatone-based filterbanks (as max-pooling was previously used in [3, 4] ), and to Han-learnt for scattering, all with instance normalization." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-132", "text": "The tendency is that the Han-fixed setting consistently improves the results in LER and WER of both trainable filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-133", "text": "More importantly, using either an Han-fixed or Han-learnt filter when learning scatteringbased filterbanks from a random initialization removes the gap in performance with the Gabor wavelet initialization that was observed in [8] where the lowpass filter was also initialized randomly." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-134", "text": "This is an important result since carefully initializing the convolutional filters is both technically non-trivial, and also relies on the prior knowledge of mel-filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-135", "text": "We believe the ability to use random initialization is an important first step for more extensive tuning of trainable filterbanks (e.g., trying different numbers of filters, decimation or convolution width)." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-136", "text": "Compared to the literature, replacing the max-pooling by a low-pass filter and adding an instance normalization layer leads to a 23% relative improvement in LER and a 33% relative improvement in WER on nov92-eval on the gammatone-based trainable filterbanks, a significant improvement compared to the existing approach [3, 4] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-137", "text": "Our models trained on the waveform also exhibit a gain in performance in LER of 22 \u2212 31% relative compared to the state-of-the-art end-to-end model trained on the waveform with its first 6 layers being pre-trained for melfilterbanks reconstruction [26] , and outperform various end-toend models trained on speech features, both in LER [24, 25] and WER [21, 22, 23] ." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-138", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-139", "text": "**TRAINABLE FILTERBANKS VS MEL-FILTERBANKS**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-140", "text": "Comparing both trainable filterbanks with instance normalization to the log mel-filterbanks baseline, we observe that the performances of the Han-fixed settings and of the mel-filterbanks are comparable in terms of LER." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-141", "text": "However, we observe a consistent improvement in terms of WER of all trainable filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-142", "text": "To the best of our knowledge, this is the first time a significant improvement in terms of WER relatively to comparable melfilterbanks has been shown on a large vocabulary task under clean recording conditions." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-143", "text": "Some improvements on the clean test of the Switchboard dataset have previously been observed by [7] , but their comparison point is MFCC rather than melfilterbanks and the number of filters of the trainable architecture differs from their MFCC baseline." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-144", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-145", "text": "**ADDING A LEARNABLE PRE-EMPHASIS LAYER**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-146", "text": "The first step in the computation of mel-filterbanks is typically the application of a pre-emphasis layer to the raw signal." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-147", "text": "Preemphasis is a convolution with a first-order high-pass filter of the form y[n] = x[n] \u2212 \u03b1x[n \u2212 1], with \u03b1 typically equal to 0.97." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-148", "text": "This operation can be performed by a convolutional layer of kernel size 2 and stride 1, that can be plugged below timedomain filterbanks, initialized with weights [\u22120.97 1], then learned with the network." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-149", "text": "In Table 3 , we compare the performance of identical models (all using a fixed Hanning window, and a gammatone or scattering initialization) with and without pre-emphasis." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-150", "text": "We observe a gain on both LER and WER (except on nov93-dev WER/scatt) when using pre-emphasis." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-151", "text": "----------------------------------" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-152", "text": "**CONCLUSION**" }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-153", "text": "This paper presents a systematic study of two approaches for trainable filterbanks, which clarifies good practices and identifies better architectures to learn from raw speech." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-154", "text": "Our results show that adding an instance normalization layer on top of the trainable filterbanks is critical for learning gammatone-based architectures, and speeds up learning of scattering-based architectures." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-155", "text": "Second, the use of a fixed squared Hanning window as low-pass filter is critical to learn the scattering-based filterbanks from random initialization of the filters, and improves on max-pooling for gammatone-based filterbanks." }, { "sent_id": "123d8e8ddef15fed120908c5c20656-C001-156", "text": "With these two improvements, we observe a consistent reduction of WER against comparable mel-filterbanks on the open vocabulary task of the WSJ dataset, in the setting of speech recognition under clean recording condition -most likely the setting on which mel-filterbanks have been the most heavily tuned." } ], "y": { "@BACK@": { "gold_contexts": [ [ "123d8e8ddef15fed120908c5c20656-C001-14" ], [ "123d8e8ddef15fed120908c5c20656-C001-18" ], [ "123d8e8ddef15fed120908c5c20656-C001-43", "123d8e8ddef15fed120908c5c20656-C001-44" ] ], "cite_sentences": [ "123d8e8ddef15fed120908c5c20656-C001-14", "123d8e8ddef15fed120908c5c20656-C001-18", "123d8e8ddef15fed120908c5c20656-C001-44" ] }, "@DIF@": { "gold_contexts": [ [ "123d8e8ddef15fed120908c5c20656-C001-25" ], [ "123d8e8ddef15fed120908c5c20656-C001-53" ], [ "123d8e8ddef15fed120908c5c20656-C001-124" ], [ "123d8e8ddef15fed120908c5c20656-C001-133" ] ], "cite_sentences": [ "123d8e8ddef15fed120908c5c20656-C001-25", "123d8e8ddef15fed120908c5c20656-C001-53", "123d8e8ddef15fed120908c5c20656-C001-124", "123d8e8ddef15fed120908c5c20656-C001-133" ] }, "@USE@": { "gold_contexts": [ [ "123d8e8ddef15fed120908c5c20656-C001-35" ], [ "123d8e8ddef15fed120908c5c20656-C001-69", "123d8e8ddef15fed120908c5c20656-C001-70" ] ], "cite_sentences": [ "123d8e8ddef15fed120908c5c20656-C001-35", "123d8e8ddef15fed120908c5c20656-C001-69" ] } } }, "ABC_0f66e9a5c51cff004d97e4aaddf4d0_18": { "x": [ { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-2", "text": "Representing sentences as numerical vectors while capturing their semantic context is an important and useful intermediate step in natural language processing." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-3", "text": "Representations that are both general and discriminative can serve as a tool for tackling various NLP tasks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-4", "text": "While common sentence representation methods are unsupervised in nature, recently, an approach for learning universal sentence representation in a supervised setting was presented in (Conneau et al., 2017) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-5", "text": "We argue that although promising results were obtained, an improvement can be reached by adding various unsupervised constraints that are motivated by auto-encoders and by language models." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-6", "text": "We show that by adding such constraints, superior sentence embeddings can be achieved." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-7", "text": "We compare our method with the original implementation and show improvements in several tasks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-8", "text": "----------------------------------" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-10", "text": "Word embeddings are considered one of the key building blocks in natural language processing and are widely used for various applications (Mikolov et al., 2013; Pennington et al., 2014) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-11", "text": "While word representations has been successfully used, representing the more complicated and nuanced nature of the next element in the hierarchy -a full sentence -is still considered a challenge." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-12", "text": "Once trained, universal sentence representations can be used as an out-of-the-box tool for solving various NLP and computer vision problems." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-13", "text": "Even though their importance is unquestionable, it seems that current results are still far from satisfactory." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-14", "text": "More concretely, given a set of sentences {s i } n i=1 , sentence embedding methods are designed to map them to some feature space F along with a distance metric M such that given two sentences s i and s j that have similar semantic meaning," }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-15", "text": "their distance M(s i , s j ) would be small." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-16", "text": "The challenge is learning a mapping T : {s i } n i=1 \u2192 F that manages to capture the semantics of each s i ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-17", "text": "While sentence embedding are not always used in similarity probing, we find this formulation useful as the similarity assumption is implicitly made when training classifiers on top of the embeddings in downstream tasks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-18", "text": "Sentences embedding methods were mostly trained in an unsupervised setting." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-19", "text": "In (Le and Mikolov, 2014 ) the ParagraphVector model was proposed which is trained to predict words in the document." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-20", "text": "SkipThought (Kiros et al., 2015) vectors rely on the continuity of text to train an encoder-decoder model that tries to reconstruct the surrounding sentences of a given passage." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-21", "text": "In Sequential Denoising Autoencoders (SDAE) (Hill et al., 2016) high-dimensional input data is corrupted according to some noise function, and the model is trained to recover the original data from the corrupted version." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-22", "text": "FastSent (Hill et al., 2016) learns to predicts a Bag-Of-Word (BOW) representation of adjacent sentences given a BOW representation of some sentence." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-23", "text": "In (Klein et al., 2015) a Hybrid Gaussian Laplacian density function is fitted to the sentence to derive Fisher Vectors." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-24", "text": "While previous methods train sentence embeddings in an unsupervised manner, a recent work (Conneau et al., 2017) argued that better representations can be achieved via supervised training on a general sentence inference dataset (Bowman et al., 2015) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-25", "text": "To this end, the authors use the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) to train different Table 1 : Sentence embedding results." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-26", "text": "BiLSTM refers to the original BiLSTM followed by MaxPooling implementation of (Conneau et al., 2017) which is the baseline for our work." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-27", "text": "AE Reg and LM Reg refers to the Auto-Encoder and Language-Model regularization terms described in 2.1 and Combined refers to optimizing with both terms." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-28", "text": "Bi-AE Reg and Bi-LM Reg refers to the bi-directional Auto-Encoder and bi-directional Language-Model regularization terms described in 2.2." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-29", "text": "As evident from the results, adding simple unsupervised regularization terms improves the results of the model on almost all the evaluated tasks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-30", "text": "sentence embedding methods and compare them on various benchmarks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-31", "text": "The SNLI dataset is composed of 570K pairs of sentences with a label depicting the relationship between them, which can be either 'neutral', 'contradiction' or 'entailment'." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-32", "text": "The authors show that by leveraging the dataset, state-of-the-art representations can be obtained which are universal and general enough for solving various NLP tasks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-33", "text": "A different, unsupervised, task in NLP is estimating the probability of word sequences." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-34", "text": "A family of algorithms for this task titled word language models seek to model the problem as estimating the probability of a word, given the previous words in the text." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-35", "text": "In (Bengio et al., 2003) neural networks were employed and (Mikolov et al., 2010) was among the first methods to use recurrent neural networks (RNN) for modeling the problem, where the probability of the a word is estimated based on the previous words fed to the RNN." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-36", "text": "A variant of RNN -Long Short Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997 ) -were used in (Sundermeyer et al., 2012) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-37", "text": "Following that, (Zaremba et al., 2014) proposed a dropout augmented LSTM." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-38", "text": "We note that there exists a connection between those two problems and try to model it more explicitly." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-39", "text": "Recently, the incorporation of the hidden states of neural language models in downstream supervised-learning models have been shown to improve the results of the latter (e.g. ElMo -Peters et al. (2018), CoVe -McCann et al. (2017 ) Peters et al. (2017 , Salant and Berant (2017) ) -in this work we jointly train the unsupervised and supervised tasks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-40", "text": "To this end, we incorporate unsupervised regularization terms motivated by language modeling and auto-encoders in the training framework proposed by (Conneau et al., 2017) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-41", "text": "We test our proposed model on a set of NLP tasks and show improved results over the baseline framework of (Conneau et al., 2017) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-42", "text": "----------------------------------" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-43", "text": "**METHOD**" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-44", "text": "Our approach builds upon the previous work of (Conneau et al., 2017) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-45", "text": "Specifically, we use their BiLSTM model with max pooling." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-46", "text": "More concretely, given a sequence of T words, {w t } t=1,...,T with given word embedding (Mikolov et al., 2013; Pennington et al., 2014) {v t } t=1,...,T ,a bidirectional LSTM computes a set of T vectors {h t } t=1,...,T where each h t is the concatenation of a forward LSTM and a backward LSTM that read the sentences in two opposite directions." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-47", "text": "We denote { \u2212 \u2192 h t } and { \u2190 \u2212 h t } as the hidden states of the left and right LSTM's respectively, where t = 1, . . . , T ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-48", "text": "The final sentence representation is obtained by taking the maximal value of each dimension of the {h t } hidden units (i.e.: max pooling)." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-49", "text": "The original model of (Conneau et al., 2017) was trained on the SNLI dataset in a supervised fashion -given pairs of sentences s 1 and s 2 , denote their representation bys 1 and s 2 ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-50", "text": "During training, the concatenation ofs 1 ,s 2 , |s 1 \u2212s 2 | ands 1 * s 2 is fed to a three layer fully connected network followed by a softmax classifier." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-51", "text": "----------------------------------" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-52", "text": "**REGULARIZATION TERMS**" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-53", "text": "We note that by training on SNLI, the model might overfit and would not be general enough to provide universal sentence embedding." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-54", "text": "We devise several regularization criteria that incentivize the hidden states to maintain more information about the input sequence." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-55", "text": "Specifically, denote the dimension of the word embedding by d and the dimension of the hidden state by l. We add a linear transformation layer L l\u00d7d : H \u2192 W on top of the BiLSTM to transform the hidden states back to the dimension of word embeddings and denote its output by {w t } t=1,...,T ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-56", "text": "Recall that in the training process, we minimize the log-likelihood loss of the fully connected network predictions which we denote by y i where y gt is the prediction score given to the correct ground truth class." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-57", "text": "Now, the total loss criteria with our regularization term can be written as" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-58", "text": "or as L = \u2212log e ygt j e y j + \u03bb" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-59", "text": "where the first term in both (1) and (2) is the original classification loss." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-60", "text": "We call the second regularization term in (1) an auto-encoder regularization term and in (2) a language model regularization term." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-61", "text": "Intuitively, since each w t is obtained by a linear transformation of h t , it enforces the hidden state h t to maintain enough information on each w t such it can be reconstructed back from h t or such that the following word w t+1 can be predicted from h t ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-62", "text": "This aids in obtaining a more general sentence representation and mitigates the risk of overfitting to the SNLI training set." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-63", "text": "The constant \u03bb in (1) and (2) is a hyper-parameter that controls the amount of regularization and was set to 1 in our experiments." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-64", "text": "We have also experimented with combining the two terms, giving equal weight to each of them in optimizing the model." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-65", "text": "----------------------------------" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-66", "text": "**BI-DIRECTIONAL REGULARIZATION TERMS**" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-67", "text": "Similarly to regularization terms described in 2.1, we devise variants of (1) and (2) which take into account the bi-directional architecture of the model." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-68", "text": "Here, we add two linear transformation layers:" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-69", "text": "\u2190 \u2212 H \u2192 W on top of the forward LSTM and backward LSTM, respectively, and denote their output as { \u2212 \u2192 w t } and { \u2190 \u2212 w t }, respectively, where t = 1, . . . , T ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-70", "text": "Now, equations (1) and (2) are re-written as:" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-71", "text": "We call the second regularization term in (3) a bi-directional auto-encoder regularization and in (4) a bi-directional language model regularization term." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-72", "text": "Again, \u03bb 1 and \u03bb 2 are hyper-parameters controlling the amount of regularization and were set to 0.5 in our experiments." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-73", "text": "----------------------------------" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-74", "text": "**EXPERIMENTS**" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-75", "text": "Following (Conneau et al., 2017) we have tested our approach on a wide array of classification tasks, including sentiment analysis (MR -Pang and Lee (2005) , SST -Socher et al. (2013) ), question-type (TREC -Li and Roth (2002) ), product reviews (CR - Hu and Liu (2004) ), subjectivity/objectivity (SUBJ - Pang and Lee (2005) ) and opinion polarity (MPQA -Wiebe et al. (2005) )." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-76", "text": "We also tested our approach on semantic textual similarity (STS 14 - Agirre et al. (2014) ), paraphrase detection (MRPC - Dolan et al. (2004) ), entailment and semantic relatedness tasks (SICK-R and SICK-E - Marelli et al. (2014) ), though those tasks are more close in nature to the task of the SNLI dataset which the model was trained on." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-77", "text": "In our experiments we have set \u03bb from eq. (1) and eq. (2) to be 1 and \u03bb 1 , \u03bb 2 from eq. (3) and eq. (4) to be 0.5." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-78", "text": "All other hyper-parameters and implementation details were left unchanged to provide a fair comparison to the baseline method of (Conneau et al., 2017) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-79", "text": "Our results are summarized in table 1." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-80", "text": "We compared out method against the baseline BiL-STM implementation of (Conneau et al., 2017) and included FastSent (Hill et al., 2016) and SkipThought vectors (Kiros et al., 2015) as a reference." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-81", "text": "As evident from table 1 in almost all the tasks evaluated, adding the proposed regularization terms improves performance." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-82", "text": "This serve to show that in a supervised learning setting, additional information on the input sequence can be leveraged and injected to the model by adding simple unsupervised loss criteria." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-83", "text": "----------------------------------" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-84", "text": "**CONCLUSIONS**" }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-85", "text": "In our work, we have sought to connect unsupervised and supervised learning in the context of sentence embeddings." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-86", "text": "Leveraging supervision given by some general task aided in obtaining state-of-the-art sentence representations (Conneau et al., 2017) ." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-87", "text": "However, every supervised learning tasks is prone to overfit." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-88", "text": "In this context, overfitting to the learning task will result in a model which generalizes less well to new tasks." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-89", "text": "We alleviate this problem by incorporating unsupervised regularization criteria in the model's loss function which are motivated by autoencoders and language models." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-90", "text": "We note that the added regularization terms do come at the price of increasing the model size by ld parameters (where d and l are the dimensions of the word embedding and the LSTM hidden state, respectively) due to the added linear transformation (see 2.1)." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-91", "text": "However, as evident from our results, this does not hinder the model performance, even though we did not increase the amount of training data." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-92", "text": "Moreover, since those term are unsupervised in nature, it is possible to pre-train the model on unlabeled data and then finetune it on the SNLI dataset." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-93", "text": "In conclusion, our experiments show that adding the proposed regularization terms results in a more general model and superior sentence embeddings." }, { "sent_id": "0f66e9a5c51cff004d97e4aaddf4d0-C001-94", "text": "This validates our assumption that while the a supervised signal is general enough for learning sentence embeddings, it can be further improved by incorporated a second unsupervised signal." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-4" ], [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-24" ] ], "cite_sentences": [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-4", "0f66e9a5c51cff004d97e4aaddf4d0-C001-24" ] }, "@USE@": { "gold_contexts": [ [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-26" ], [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-49" ], [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-75" ] ], "cite_sentences": [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-26", "0f66e9a5c51cff004d97e4aaddf4d0-C001-49", "0f66e9a5c51cff004d97e4aaddf4d0-C001-75" ] }, "@EXT@": { "gold_contexts": [ [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-40" ], [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-44" ] ], "cite_sentences": [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-40", "0f66e9a5c51cff004d97e4aaddf4d0-C001-44" ] }, "@DIF@": { "gold_contexts": [ [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-41" ] ], "cite_sentences": [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-41" ] }, "@UNSURE@": { "gold_contexts": [ [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-78" ], [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-80" ] ], "cite_sentences": [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-78", "0f66e9a5c51cff004d97e4aaddf4d0-C001-80" ] }, "@MOT@": { "gold_contexts": [ [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-86", "0f66e9a5c51cff004d97e4aaddf4d0-C001-87" ] ], "cite_sentences": [ "0f66e9a5c51cff004d97e4aaddf4d0-C001-86" ] } } }, "ABC_759c1c892361f62ad8f2c46e569e8a_18": { "x": [ { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-2", "text": "We explore state-of-the-art deep reinforcement learning methods such as prioritized experience replay, double deep Q-Networks, dueling network architectures, distributional learning methods for dialog policy." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-3", "text": "Our main findings show that each individual method improves the rewards and the task success rate but combining these methods in a Rainbow agent, which performs best across tasks and environments, is a non-trivial task." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-4", "text": "We, therefore, provide insights about the influence of each method on the combination and how to combine them to form the Rainbow agent." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-5", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-7", "text": "Dialog system can be designed for generic purposes, e.g. smalltalk (Weizenbaum, 1966) or a specific task such as finding restaurants or booking flights (Bobrow et al., 1977; ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-8", "text": "This paper focuses on task-oriented dialog systems, which interact with a user to aid achieving their goals." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-9", "text": "The systems have several modules which solve different subtasks (Williams et al., 2016) starting with natural language understanding (NLU) module (De Mori et al., 2008) ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-10", "text": "Its output is then passed to a belief tracking module that holds the state of the dialog, i.e. all relevant information provided by the user." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-11", "text": "This belief state is then passed to the dialog policy module (Williams and Young, 2007) which has to decide how the system should reply." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-12", "text": "Depending on the ontology of the task, e.g. the restaurant search, the size of the input space for the policy can quickly become very large." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-13", "text": "Furthermore, the belief state might be wrong due to noisy inputs, e.g. the user could be misunderstood because of NLU errors or in general, language ambiguity." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-14", "text": "Therefore, building such policies by hand is rather time consuming." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-15", "text": "Reinforcement learning (RL) can alleviate this task by allowing to learn such policies automatically (Williams and Young, 2007) with a user simulator such as proposed in Schatzmann et al. (2007) within a task (Dhingra et al., 2017; Peng et al., 2018) , between task and non-task (Yu et al., 2017) and also in multimodal dialog systems (Manuvinakurike et al., 2017; Zhang et al., 2018) ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-16", "text": "Deep RL has been proven to be successful with Deep Q-Learning (DQN) (Mnih et al., 2013) introducing the idea of using neural networks as a Q-function approximator." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-17", "text": "It has been widely used in the context of dialog policy learning (Fatemi et al., 2016; Dhingra et al., 2017; Casanueva et al., 2017) ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-18", "text": "However according to a recent comparison (Casanueva et al., 2017) in the context of dialog policy learning, it performed worse than other RL methods such as Gaussian Process in many testing conditions." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-19", "text": "Recently, several advances in deep RL such as distributional RL (Bellemare et al., 2017) , dueling network architectures (Wang et al., 2016) and their combination (Hessel et al., 2018 )a Rainbow agent -have been shown to be promising for further improvements of deep RL agents in benchmark environments, e.g. Atari 2600." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-20", "text": "However, it is still unclear whether these methods could advance dialog policies." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-21", "text": "This paper attempts to provide insights motivated from dialog policy modeling perspectives how to use state-of-the-art deep RL methods such as prioritized experience replay (Schaul et al., 2015) , double DQN (Van Hasselt et al., 2016) , dueling network architecture, distributional learning method and how to combine them to train the Rainbow agent for dialog policy learning 1 ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-22", "text": "Moreover, we explore the influence of each method w.r.t the resulting rewards and the number of successful dialogs, highlighting methods with the biggest and the smallest impact." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-23", "text": "Env." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-24", "text": "1 Env." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-25", "text": "2 Env." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-26", "text": "3 Env." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-27", "text": "4 Env." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-28", "text": "5 Env ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-29", "text": "6 Task T1.1 T1.2 T1.3 T2.1 T2.2 T2.3 T3.1 T3.2 T3.3 T4.1 T4.2 T4.3 T5.1 T5.2 T5.3 T6.1 T6.2 T6.3 Domain CR SFR LAP CR SFR LAP CR SFR LAP CR SFR LAP CR SFR LAP CR SFR LAP SER 0% 0% 15% 15% 15% Table 1 : Benchmarking environments with several domains, semantic error rates (SERs), action masks and user models (Casanueva et al., 2017) ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-30", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-31", "text": "**PROPOSED METHOD**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-32", "text": "For value-based reinforcement learning methods like Q-learning, potentially large state spaces as in the dialog setting require the use of function approximators." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-33", "text": "The DQN-Algorithm (Mnih et al., 2013) is an example of such a method where the action-value function is approximated by a neural network which takes a state vector as input and outputs a value for each possible action." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-34", "text": "Loss is calculated with the squared temporal difference (TD) error." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-35", "text": "Efficient off-policy batch-training is enabled by a replay buffer which records the agent's turn-level experiences and allows the drawing of uncorrelated training samples." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-36", "text": "Prioritized experience replay Drawing samples from this buffer uniformly is straightforward but problematic: important state transitions might never be drawn from the buffer or at least too few times to have an impact on the network weights." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-37", "text": "Motivated by the insight that a high absolute TDerror of an experience means that the current action-value is not an accurate estimate yet, prioritized experience replay (Schaul et al., 2015) samples experiences having higher TD-errors with greater probability than those with lower TD-error." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-38", "text": "This method is relevant because it is expected to increase learning efficiency." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-39", "text": "In the context of dialog policy, there are some system actions which are crucial to the outcome of the dialog and should have a higher probability for being used as training data if they are not well approximated." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-40", "text": "For example, if systems end the dialog before the user's goal is completed by telling the user goodbye, this will immediately terminate a dialog with a negative reward and without any chance of recovery." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-41", "text": "Double DQN Another improvement mitigates the overestimation bias inherent to Q-learning by introducing a second action-value network which copies the parameters from the online action-value network periodically and is held fix otherwise." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-42", "text": "This additional network is then used to evaluate the action-value of the action selected greedily w.r.t." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-43", "text": "the online Q-function, thereby decoupling action choice and evaluation which could increase stability of the learning process." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-44", "text": "Dueling network architecture In comparison to the action-value function, the state-value function is a simpler estimate -it is the expectation over a state's action-values and therefore only a single value." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-45", "text": "But in states where the action choice does not matter, or to avoid visiting states with a low state value in general, an estimate of the value function should be sufficient." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-46", "text": "Dueling network architecture (Wang et al., 2016) therefore splits the calculation of the action-value function into separate layers of a neural network, one group computing the value function and another an advantage function chosen so that their combination results in the action-value function again." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-47", "text": "This approach also has the benefit that the state value estimation is updated every time when a state is observed by the network, regardless of the chosen action." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-48", "text": "As a result, it should encourage generalization across actions." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-90", "text": "Regarding learning speed w.r.t." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-49", "text": "In dialog settings, there are many states where generalization across actions could prove beneficial, e.g. exact action choice is not important, just the choice between action classes." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-50", "text": "For example at the beginning of a dialog, when users greet the system without providing any information, the only appropriate action for the system is to ask for more information." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-51", "text": "The exact type of information should not matter and all other actions except for the dialog ending action should be about equally unsuitable." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-52", "text": "Distributional learning method One of the latest additions to reinforcement learning is the quantile regression distributional reinforcement learning algorithm ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-53", "text": "Instead of learning only the expected value for each stateaction pair, as in regular Q-learning, the distribution of rewards is approximated instead, thereby modeling the randomness of the reward over multiple turns induced by action selection and random state transitions." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-54", "text": "A noisy environment like dialog could benefit from better knowledge about the distribution of rewards." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-55", "text": "The Rainbow agent Following the methodology from (Hessel et al., 2018) , we extend the DQN algorithm (Mnih et al., 2013) with prioritized experience replay, double DQN, and dueling network architecture." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-56", "text": "Furthermore in contrast to (Hessel et al., 2018) , we apply the following changes to successfully train the Rainbow agent: 1) we drop the multi-step method (Sutton, 1988) because it seems to diminish the obtained rewards." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-57", "text": "As the step size gets larger, the rewards are decreased more." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-58", "text": "A possible explanation could be that the noise generated by the user simulator leads to accumulation of noise in rewards over multiple steps, which could lead to higher variance in value estimates." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-59", "text": "2) we discard the noisy linear layers (Fortunato et al., 2018) , relying on -greedy exploration instead." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-60", "text": "The first reason could be the additional parameters, which usually would require more training samples." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-61", "text": "Since the agent was already required to learn environmental noises from the user simulator, a complementary explanation could be that the inclusion of a second noise distribution might have been too difficult to learn, especially when considering the relatively small amount of training episodes." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-62", "text": "3) we swap the categorical DQN approach (Bellemare et al., 2017) with the quantile regression Q-learning algorithm , now consistent with the theoretic results from (Bellemare et al., 2017) , no longer restricting the values of the value distribution to a uniform resolution and also no longer requiring knowledge about their bounds." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-63", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-64", "text": "**RESOURCES**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-65", "text": "We used PyDial toolkit as a test-bed for experiments and evaluation." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-66", "text": "It includes a configurable user simulator and provides multiple dialog ontologies like Cambridge Restaurants (CR), Laptops (LAP) and San Francisco Restaurants (SFR)." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-67", "text": "The ontologies used for the benchmarks in this paper together with their properties are listed in table 2." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-68", "text": "CR 3 9 268 SFR 6 11 636 LAP 11 21 257 Table 2 : Benchmark domains with #slots the user can provide or #request from the system as well as #values of each requestable slot (Casanueva et al., 2017) ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-69", "text": "Casanueva et al. (2017) propose six different environmental models, varying in user friendliness, simulated input channel noise and the presence or absence of action masks, which, when enabled, simplify learning by masking some of the possible actions." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-70", "text": "An overview of all these environmental configurations and their assignment to tasks is given in Table 1 ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-71", "text": "Evaluation results in Casanueva et al. (2017) with several dialog policy types, e.g. a handcrafted policy and the best reported policies serve as baselines in our experiments." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-72", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-73", "text": "**DOMAIN #SLOTS #REQUESTS #VALUES**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-74", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-75", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-76", "text": "Training and evaluation with the PyDial user simulator follows the PyDial benchmarking tasks (Casanueva et al., 2017) , where each task (see Table 1) is trained on 10000 dialogs split into ten training iterations of 1000 dialogs each." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-77", "text": "We evaluate policies after each training iteration on 1000 test dialogs." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-78", "text": "All of the following results were obtained by averaging over the outcome of ten different random seeds using the parameters described in appendix A." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-79", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-80", "text": "**THE RAINBOW AGENT**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-81", "text": "The first row of Table 3 and 4 show the results of the highest scoring policy from the PyDial benchmark (Casanueva et al., 2017) to serve as baselines." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-82", "text": "Evaluations of the handcrafted policies follow in the last line." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-83", "text": "The results show that Rainbow agent outperforms reward of the best PyDial agents in all 18 conditions and success rate in 16 out of 18 setting." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-84", "text": "Compared to the basic DQN agent, Rainbow agent is better in all 18 conditions w.r.t both reward and success rate." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-85", "text": "When averaged across all 18 tasks, Rainbow agent (mean reward 10.1) scores more than 29% higher rewards compared to the best PyDial agent (DQN, mean reward 7.8) and more than 9.7% compared to our DQN agent." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-86", "text": "An average success rate of 90.4% is superior to the best PyDial agent (GP-Sarsa, 80.2%)." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-87", "text": "Mean deviation across all tasks and random seeds is 0.4 in reward and 1.6% in successful dialogs." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-88", "text": "Figure 1 shows the averaged success rates for each of our Rainbow agents leaving out one particular method after training with 10000 dialogs." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-89", "text": "Each of the plotted values has been evaluated on 1000 dialogs per random seed and averaged over all tasks." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-91", "text": "success rate, the best PyDial 13.5 1 12.3 2 11.0 2 12.7 3 10.1 1 9.1 3 12.2 3 8.6 2 6.5 2 11.1 3 8.2 3 5.8 1 10.5 3 6.5 2 3.8 2 9.9 3 3.6 3 3.2 2 DQN 13.0 10.8 9.5 13.1 11.0 9.5 12.7 9.7 7.5 11.9 7.9 5." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-92", "text": "Table 5 show that there is almost no difference between the distributional and the nondistributional approach." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-93", "text": "Their final rewards are the same and their success rates differ by 0.1% when averaged across all tasks." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-94", "text": "A possible reason could be that the diversity of the training dialogs was too little and rewards too sparse to show a benefit by using the distributional reinforcement learning method." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-95", "text": "This coincides with the findings in Hessel et al. (2018) which found their combined agent without distributional learning performing similar to the combined agent with distributional learning for the first 40 million frames on the Atari benchmark." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-96", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-97", "text": "**MODEL ABLATION ANALYSIS**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-98", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-99", "text": "**RESULTS IN**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-100", "text": "The strongest benefits to final performance come with the dueling architecture." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-101", "text": "For some scenarios like the previously described dialog start without any user-provided information, we examined the action-state values by clustering them and observed fewer clusters and smaller within-cluster variance for the dueling agents, indicating better generalization and simpler action-value functions." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-102", "text": "Prioritized experience replay helped with learning efficiency but had no significant effect on final performance, as expected." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-103", "text": "Only a small improvement can be attributed to double DQN, but overall performance seems to be slightly more stable." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-104", "text": "Overall, Table 5 shows that the final best Rainbow agent performs considerably better than the best reported PyDial agent and the DQN agent across all the tasks and testing environments and is on par with handcrafted policy performance." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-105", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-106", "text": "**CONCLUSIONS**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-107", "text": "We explored state-of-the-art deep RL methods for dialog policy on different domains with various noise levels and user behaviours." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-108", "text": "Our findings are that not all extensions to DQN prove beneficial in dialog policy settings, especially when learning speed is concerned: distributional reinforcement learning method requires more training time to reach the success rates and final rewards of the non-distributional agent." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-109", "text": "The Rainbow agent that makes use of prioritized experience replay, double DQN and dueling network architecture is stable across domains and evaluation settings and learns fastest (when excluding distributional method)." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-110", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-111", "text": "**A HYPERPARAMETERS**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-135", "text": "User Thank you, bye 5" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-112", "text": "All neural network layers are fully connected linear layers with ReLUs as activation functions." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-113", "text": "In case of the dueling network architecture, the shared layer consists of 256 neurons, followed by two value layers, each with 300 neurons, and two advantage layers with 400 neurons per layer." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-114", "text": "Distributional agents use an atom count of 50." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-115", "text": "Where the dueling architecture is replaced by a standard architecture in the evaluation process, three layers of sizes 256, 700 and 700 are used to guarantee a fair comparison to the dueling case by providing the same model capacity." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-116", "text": "For prioritized replay, the prioritization exponent \u03b1 is set to 0.525 and importance sampling exponent \u03b2 to 0.4 (Schaul et al., 2015) ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-117", "text": "To train the networks we use the Adam optimizer with a learning rate of 10 \u22124 ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-118", "text": "Exploration is performed -greedy with linear decay, starting at 0.3." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-119", "text": "Whenever an agent makes use of double Q-learning, it updates its target network after 6 dialogs." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-120", "text": "All agents use an experience replay buffer capacity of 16384 transitions, a discount factor \u03b3 = 0.99, minibatch size 256 and the Huber Loss \u03ba is set to 1." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-121", "text": "Following the PyDial benchmarking process, we leave all hyperparameters constant across all environments and dialog domains (Casanueva et al., 2017) , thus also evaluating the generalization capabilities of the agents." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-122", "text": "----------------------------------" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-123", "text": "**B EXAMPLE DIALOGS**" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-124", "text": "Tables 6 and 7 show example dialogs between the Rainbow agent as well as the handcrafted policy and a user recorded in the LAP domain as qualitative evaluation." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-125", "text": "The Rainbow agent was trained in a noisy environment with 15% semantic error rate." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-126", "text": "The user's task was to find a cheap, light notebook with medium-sized hard drive and arbitrary battery rating." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-127", "text": "User responses were kept identical for all tested policies whenever possible." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-128", "text": "Both the combined and the handcrafted agents succeeded in fulfilling the user's goal with the same amount of dialog turns." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-129", "text": "System The satellite-pro-nb10-a-10p is in the budget price range, is in the light weight range and is in the medium drive range." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-130", "text": "4" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-131", "text": "User Thank you, bye 5" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-132", "text": "System Thank you, goodbye." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-133", "text": "System The satellite-pro-nb10-a-10p is in the light weight range, has a standard battery rating , is in the budget price range and is in the medium drive range ." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-134", "text": "4" }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-136", "text": "System Thank you, goodbye." }, { "sent_id": "759c1c892361f62ad8f2c46e569e8a-C001-137", "text": "Table 7 : Recorded dialog between a human and PyDials handcrafted policy for task T4.3." } ], "y": { "@BACK@": { "gold_contexts": [ [ "759c1c892361f62ad8f2c46e569e8a-C001-17" ], [ "759c1c892361f62ad8f2c46e569e8a-C001-18" ], [ "759c1c892361f62ad8f2c46e569e8a-C001-69" ] ], "cite_sentences": [ "759c1c892361f62ad8f2c46e569e8a-C001-17", "759c1c892361f62ad8f2c46e569e8a-C001-18", "759c1c892361f62ad8f2c46e569e8a-C001-69" ] }, "@USE@": { "gold_contexts": [ [ "759c1c892361f62ad8f2c46e569e8a-C001-71" ], [ "759c1c892361f62ad8f2c46e569e8a-C001-76" ], [ "759c1c892361f62ad8f2c46e569e8a-C001-81" ], [ "759c1c892361f62ad8f2c46e569e8a-C001-121" ] ], "cite_sentences": [ "759c1c892361f62ad8f2c46e569e8a-C001-71", "759c1c892361f62ad8f2c46e569e8a-C001-76", "759c1c892361f62ad8f2c46e569e8a-C001-81", "759c1c892361f62ad8f2c46e569e8a-C001-121" ] } } }, "ABC_7293ab5db16d3fe1fee48d45154697_19": { "x": [ { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-2", "text": "This paper describes the Universitat d'Alacant submissions (labeled as UAlacant) to the machine translation quality estimation (MTQE) shared task at WMT 2016, where we have participated in the word-level and phrase-level MTQE subtasks." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-3", "text": "Our systems use external sources of bilingual information as a black box to spot sub-segment correspondences between the source segment and the translation hypothesis." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-22", "text": "Section 2 describes the approach used to produce our submissions." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-4", "text": "For our submissions, two sources of bilingual information have been used: machine translation (Lucy LT KWIK Translator and Google Translate) and the bilingual concordancer Reverso Context." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-5", "text": "Building upon the word-level approach implemented for WMT 2015, a method for phrase-based MTQE is proposed which builds on the probabilities obtained for word-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-6", "text": "For each sub-task we have submitted two systems: one using the features produced exclusively based on online sources of bilingual information, and one combining them with the baseline features provided by the organisers of the task." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-7", "text": "----------------------------------" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-9", "text": "Machine translation quality estimation (MTQE) (Blatz et al., 2004; Specia et al., 2010; Specia and Soricut, 2013) has aroused the interest of both the scientific community and translation companies on account of its noticeable advantages: it can be used to help professional translators in post-editing, to estimate the translation productivity for different translation technologies, or even for budgeting translation projects." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-10", "text": "In this context, the WMT 2016 MTQE shared task becomes one of the best scenarios in which different approaches to MTQE can be evaluated and compared for different granularities: segment-level (sub-task 1), phrase-level and word-level (sub-task 2), and document-level (sub-task 3)." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-11", "text": "For the second consecutive year, the submissions of the UAlacant team tackle the word-level MTQE sub-task, but this year they also cover phrase-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-12", "text": "This year, the shared task featured a dataset obtained by translating segments in English into German using MT, for which it is needed to identify which words and phrases are inadequately translated." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-13", "text": "In the case of words, this means detecting which words need to be deleted or replaced, while in the case of phrases this means detecting which phrases contain words translated inadequately, but also if there are missing words, or the order of the words in the phrase is not correct." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-14", "text": "The systems participating in the task are required to apply the labels BAD and OK, either to words or phrases." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-15", "text": "In this paper we describe the approach behind the submissions of the Universitat d'Alacant team to these sub-tasks." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-16", "text": "For our word-level submissions we have applied the approach proposed by Espl\u00e0-Gomis et al. (2015) , where we used black-box bilingual on-line resources." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-17", "text": "The new task tackles MTQE for translating English into German." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-18", "text": "For this task we have combined two on-line-available MT systems, 1 Lucy LT KWIK Translator 2 and Google Translate, 3 and the bilingual concordancer Reverso Context 4 to spot sub-segment correspondences between a sentence S in the source language (SL) and a given translation hypothesis T in the target language (TL)." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-19", "text": "As described by Espl\u00e0-Gomis et al. (2015) , a collection of features is obtained from these correspondences and then used by a binary classifier to determine the final word-level MTQE labels." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-20", "text": "We have repeated the approach proposed in WMT 2015 for word-level sub-tasks, and have proposed a new one for phrase-level MTQE that builds upon the system trained for word-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-21", "text": "The rest of the paper is organised as follows." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-23", "text": "Section 3 describes the experimental setting and the results obtained." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-24", "text": "The paper ends with some concluding remarks." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-25", "text": "2 Sources of bilingual information for machine translation quality estimation at the word and phrase levels" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-26", "text": "The method used to produce the word-level MTQE submissions is the same than that used by the UAlacant team in the last edition of the shared task of MTQE at WMT 2015 (Espl\u00e0-Gomis et al., 2015) , which uses binary classification based on a collection of information." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-27", "text": "As in the previous edition of the shared task, we have used online sources of bilingual information to identify sub-segment alignments between the original SL segment S and a given translation hypothesis T in the TL." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-28", "text": "These sub-segment alignments are identified by: (i) splitting segments S and T in all possible overlapping sub-segments up to a given length L; (ii) using the sources of bilingual information to translate each sub-segment into the other language, i.e. SL subsegments into TL, and vice versa; and (iii) attempting to match the translated sub-segments either in T or S. The rest of the section briefly describes the features used for building the submissions both for word-level and phrase-level sub-tasks in the MTQE shared task of WMT 2016." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-29", "text": "----------------------------------" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-30", "text": "**WORD-LEVEL MACHINE TRANSLATION QUALITY ESTIMATION**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-31", "text": "A complete description of the features used for word-level MTQE can be found in Section 2 of the paper by Espl\u00e0-Gomis et al. (2015) ." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-32", "text": "We provide here a general description of the type of features used." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-33", "text": "Espl\u00e0-Gomis et al. (2015) describe two types of features: positive and negative ones, i.e. features that would indicate that the current translation is OK, and features that would indicate that it is BAD." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-34", "text": "Positive features use those sub-segment pairs (\u03c3, \u03c4 ) obtained by means of the external sources of bilingual information such that \u03c3 matches the source segment S and \u03c4 matches the translation hypothesis T ." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-35", "text": "These features provide positive evidence for words in T matching \u03c4 ." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-36", "text": "An additional positive feature is defined, which measures the confidence of the sub-segment pairs by using the translation frequency in those sources of bilingual information capable of providing several translation alternatives, such as bilingual concordancers or probabilistic lexicons." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-37", "text": "On the other hand, negative features are built from those sub-segment pairs (\u03c3, \u03c4 ) for which \u03c3 fully matches S, but \u03c4 matches T only partially." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-38", "text": "These sub-segment pairs provide negative evidence for those words in T that do not match \u03c4 ." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-39", "text": "----------------------------------" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-40", "text": "**PHRASE-LEVEL MACHINE TRANSLATION QUALITY ESTIMATION**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-41", "text": "While the word-level MTQE task has been going on during the last three editions of WMT, this is the first time that this shared task tackles phrase-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-42", "text": "This problem, as proposed by the organisers of the task, may miss some kinds of errors that are plausible in a phrase, such as missing words (insertions)." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-43", "text": "According to the instructions provided, the organisers describe the problem as follows: \"if a phrase has at least one 'BAD' word, all its labels are replaced with 'BAD'\"; in other words, the problem of phrase-level MTQE just extends the errors found in a given word to the words happening in the same phrase, but does not add new problems related to the new granularity." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-44", "text": "The approach proposed for this task builds on the word-level MTQE method described in Section 2.1." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-45", "text": "In the case of phrase-level MTQE, a binary classifier is also used to classify a phrase either as OK or BAD." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-46", "text": "This classifier uses the probability of belonging to the class BAD of every word in a phrase as a feature, which is provided by the classifier trained for the task at the word-level." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-47", "text": "These features are combined with two more binary features, which are aimed at capturing the information provided by the external sources of bilingual information at the level of phrases." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-48", "text": "Basically, these features take value true when the phrase of the translation hypothesis being evaluated is confirmed by one or more sources of bilingual information, i.e. if the TL phrase exactly corresponds to a sub-segment in the SL segment." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-49", "text": "Having two different features allows to capture this information for each translation direction, i.e. if the TL phrase is the result of translating a phrase in the SL, or if the translation of the TL phrase appears as a sub-segment in the SL segment." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-50", "text": "Given that phrases have variable lengths (from 1 to 7 words in the data set provided by the organisation), we decided to train specific classifiers for each phrase length using as many features as words in the phrase (plus the two features at the phrase level described above)." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-51", "text": "Alternatively, it would have been possible to experiment with an approach able to deal with sparse features." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-52", "text": "----------------------------------" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-53", "text": "**SUBMISSIONS TO THE WMT 2016 SHARED TASK ON MTQE**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-54", "text": "This section describes the details of the systems submitted to the MTQE shared task at WMT 2016." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-55", "text": "This year, the task consisted in estimating the quality of a collection of segments in German that had been obtained through machine translation from English." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-56", "text": "The organisers provided three datasets:" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-57", "text": "\u2022 training set: a collection of 12,000 segments in English (S) and their corresponding machine translations in German (T ); for every word/phrase in T , a label was provided: BAD for the words/phrases to be post-edited, and OK for those to be kept unedited;" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-58", "text": "\u2022 development set: 1,000 pairs of segments (S, T ) with the corresponding MTQE labels, which can be used to optimise the binary classifier trained by using the training set;" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-59", "text": "\u2022 test set: 2,000 pairs of segments (S, T ) for which the MTQE labels have to be estimated with the binary classifier built on the training and the development sets." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-60", "text": "The same data set was used both for word-level and phrase-level MTQE sub-tasks, with the only difference that, for the latter, the limits of the phrases which make up the full translated segments T were provided." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-61", "text": "In addition, for every sub-task, a collection of baseline features was provided for each word or phrase in T , respectively, in the different datasets." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-62", "text": "For word-level quality estimation, this collection consists of 22 baseline features, such as the number of occurrences of the word, or partof-speech information." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-63", "text": "5 For phrase-level quality estimation, this collection consists of 72 baseline features, such as the phrase length or its perplexity." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-64", "text": "6 Using these data, four systems have been submitted to the shared task on MTQE at WMT 2016: two for word-level MTQE and two more for phraselevel MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-65", "text": "All the systems are based on the binary classifier described bellow in Section 3.1, but using different collections of features." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-66", "text": "Of the two systems submitted to each sub-task: one was built using only the features described in Section 2, and the other combined them with the baseline features provided by the organisation." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-67", "text": "Section 3.2 describes the results obtained with each of these approaches by using the following metrics:" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-68", "text": "\u2022 The precision P c , i.e. the fraction of instances correctly labelled among all the instances labelled as c, where c is the class assigned (either OK or BAD);" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-69", "text": "\u2022 The recall R c , i.e. the fraction of instances correctly labelled as c among all the instances that should have been labelled as c;" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-70", "text": "\u2022 The F c 1 score, which is defined as scores, which is the main metric used by the organisers of the task for comparing all the submissions made." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-71", "text": "----------------------------------" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-72", "text": "**BINARY CLASSIFIER**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-73", "text": "A multilayer perceptron (Duda et al., 2000 , Section 6) was used for classification, as implemented in Weka 3.7 (Hall et al., 2009) ." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-74", "text": "Following the approach by Espl\u00e0-Gomis et al. (2015) , the perceptron was built with a single hidden layer containing the same number of nodes as the number of features; this was the best performing architecture in the preliminary experiments." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-75", "text": "7 The training sets 5 The list of features can be found in the file features list in the package http://www.quest." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-76", "text": "dcs.shef.ac.uk/wmt16_files_qe/task2_ en-de_test.tar.gz 6 The list of features can be found in the file features list in the package http://www.quest." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-77", "text": "dcs.shef.ac.uk/wmt16_files_qe/task2p_ en-de_test.tar.gz 7 The rest of parameters of the classifiers were also kept as in the approach by Espl\u00e0-Gomis et al. (2015) ." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-78", "text": "provided by the organisation were used to train the binary classifiers, both for word and phrase levels, while the development sets were used as validation sets on which the training error was computed, in order to minimise the risk of overfitting." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-79", "text": "The binary classifiers for the sub-task on phrase-level MTQE was trained to optimise the main comparison metric: F BAD 1 \u00b7F OK 1 , while the classifier for word-level MTQE was trained to optimise the F BAD 1 metric, which was the main comparison metric in WMT 2015." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-80", "text": "8 Given that the binary classifier used for the phrase-level sub-task depends on the output of the binary classifier for word-level MTQE, the training process was incremental, training first the wordlevel MTQE binary classifiers and then the phraselevel ones." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-81", "text": "It is worth mentioning that the binary classifiers for phrase-level MTQE use the probabilities provided by the best performing system for word-level MTQE: the one that combines the features obtained from on-line sources of bilingual information with the baseline features." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-82", "text": "However, the phrase-level baseline features are only used in one of the systems submitted." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-83", "text": "Table 1 shows the results obtained by the systems submitted to the shared task on MTQE, both at the level of words and at the level of phrases." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-84", "text": "The table also includes the results obtained with a binary classifier trained only on the baseline features (baseline), in order to estimate the contribution of the features described in this work on the performance of the system." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-85", "text": "Incidentally, and in spite of the changes in languages and machine translation systems, the results obtained for word-level MTQE are very similar to those obtained by Espl\u00e0-Gomis et al. (2015) for the translation from English into Spanish." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-86", "text": "----------------------------------" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-87", "text": "**RESULTS**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-88", "text": "As can also be seen in Table 1 , the classifiers using only the baseline features outperform those using only features based on sources of bilingual information, both at the word level and at the phrase level." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-89", "text": "The difference between both feature families is specially relevant in the case of the phraselevel MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-90", "text": "However, the most interesting results are those obtained when combining both feature families." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-91", "text": "As a result of this combination, an improvement of 5% in F BAD 1 and more than 8% in F OK 1 with respect to the baseline is obtained for word-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-92", "text": "In the case of phrase-based MTQE, this improvement is more unbalanced: 1% for F BAD 1 , and more than 10% in F OK 1 ." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-93", "text": "Therefore, it is possible to conclude that both the baseline features and those obtained from sources of bilingual information are reasonably independent and, therefore, combining them leads to much more successful systems for the two granularities evaluated." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-94", "text": "----------------------------------" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-95", "text": "**CONCLUDING REMARKS**" }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-96", "text": "In this paper we have described the submissions of the Universitat d'Alacant (called UAlacant) team to the sub-task 2 in the MTQE shared task at WMT 2016, which covers the problems of word-level and phrase-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-97", "text": "Our submissions used online available sources of bilingual information in order to obtain features about the translation hypotheses at different granularities." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-98", "text": "The approach employed is aimed at being system-independent, since it only uses resources produced by external systems, which makes the addition of new sources of bilingual information straightforward." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-99", "text": "In fact, one of the sources of bilingual information used in the previous edition of the shared task, Apertium, has been replaced by a new one: Lucy LT." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-100", "text": "The results obtained confirm the conclusion by Espl\u00e0-Gomis et al. (2015) that combining the baseline features with those obtained from external sources of bilingual information provide a noticeable improvement, in this case, not only for word-level MTQE, but also for phrase-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-101", "text": "Some future work may be interesting, specially as regards the approach to phrase-level MTQE." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-102", "text": "As already mentioned, it would be interesting to use binary classifiers that support sparse features, in order to be able to directly train a single binary classifier capable to deal with phrases of any length." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-103", "text": "This would make it possible to put together all the data available, avoiding splitting it into smaller training sets for different classifiers, and therefore allowing to have larger training data set." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-104", "text": "On the other hand, it may also be interesting to try to use the features defined for word-level MTQE to train the phrase-level MTQE classifier, instead of defining two levels of classification." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-105", "text": "The main disadvantage of this approach would be the large amount of features, that would make training more expensive." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-106", "text": "Table 1 : Precision (P ), recall (R), and F1 score obtained for the four systems submitted to the shared task on MTQE at WMT 2016." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-107", "text": "Two of them are based exclusively on the use of sources of bilingual information (SBI, see Section 2), and two more combine these SBI with the baseline features provided by the organisers of the task (SBI+baseline)." }, { "sent_id": "7293ab5db16d3fe1fee48d45154697-C001-108", "text": "The table also includes the results obtained when training the same binary classifier exclusively on the baseline features (baseline)." } ], "y": { "@DIF@": { "gold_contexts": [ [ "7293ab5db16d3fe1fee48d45154697-C001-16" ], [ "7293ab5db16d3fe1fee48d45154697-C001-19" ], [ "7293ab5db16d3fe1fee48d45154697-C001-74" ], [ "7293ab5db16d3fe1fee48d45154697-C001-100" ] ], "cite_sentences": [ "7293ab5db16d3fe1fee48d45154697-C001-16", "7293ab5db16d3fe1fee48d45154697-C001-19", "7293ab5db16d3fe1fee48d45154697-C001-74", "7293ab5db16d3fe1fee48d45154697-C001-100" ] }, "@EXT@": { "gold_contexts": [ [ "7293ab5db16d3fe1fee48d45154697-C001-16" ], [ "7293ab5db16d3fe1fee48d45154697-C001-19" ], [ "7293ab5db16d3fe1fee48d45154697-C001-74" ], [ "7293ab5db16d3fe1fee48d45154697-C001-100" ] ], "cite_sentences": [ "7293ab5db16d3fe1fee48d45154697-C001-16", "7293ab5db16d3fe1fee48d45154697-C001-19", "7293ab5db16d3fe1fee48d45154697-C001-74", "7293ab5db16d3fe1fee48d45154697-C001-100" ] }, "@BACK@": { "gold_contexts": [ [ "7293ab5db16d3fe1fee48d45154697-C001-26" ], [ "7293ab5db16d3fe1fee48d45154697-C001-31", "7293ab5db16d3fe1fee48d45154697-C001-74" ] ], "cite_sentences": [ "7293ab5db16d3fe1fee48d45154697-C001-26", "7293ab5db16d3fe1fee48d45154697-C001-31", "7293ab5db16d3fe1fee48d45154697-C001-74" ] }, "@USE@": { "gold_contexts": [ [ "7293ab5db16d3fe1fee48d45154697-C001-77" ] ], "cite_sentences": [ "7293ab5db16d3fe1fee48d45154697-C001-77" ] }, "@SIM@": { "gold_contexts": [ [ "7293ab5db16d3fe1fee48d45154697-C001-77" ], [ "7293ab5db16d3fe1fee48d45154697-C001-85" ] ], "cite_sentences": [ "7293ab5db16d3fe1fee48d45154697-C001-77", "7293ab5db16d3fe1fee48d45154697-C001-85" ] } } }, "ABC_1c0d971cf771f351b51661950f4b14_19": { "x": [ { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-3", "text": "We demonstrate the utility of the induced BWEs in the BLI task by reporting on benchmarking BLI datasets for three language pairs: (1) We show that our BWE-based BLI models significantly outperform the MuPTM-based and context-counting models in this setting, and obtain the best reported BLI results for all three tested language pairs; (2) We also show that our BWE-based BLI models outperform other BLI models based on recently proposed BWEs that require parallel data for bilingual training." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-4", "text": "----------------------------------" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-6", "text": "Dense real-valued vectors known as distributed representations of words or word embeddings (WEs) (Bengio et al., 2003; Collobert and Weston, 2008; Mikolov et al., 2013a; Pennington et al., 2014) have been introduced recently as part of neural network architectures for statistical language modeling." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-7", "text": "Recent studies (Levy and Goldberg, 2014; Levy et al., 2015) have showcased a direct link and comparable performance to \"more traditional\" distributional models (Turney and Pantel, 2010) , but the skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013c) is still established as the state-of-the-art word representation model, due to its simplicity, fast training, as well as its solid and robust performance across a wide variety of semantic tasks (Baroni et al., 2014; Levy et al., 2015) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-8", "text": "A natural extension of interest from monolingual to multilingual word embeddings has occurred recently (Klementiev et al., 2012; Zou et al., 2013; Mikolov et al., 2013b; Hermann and Blunsom, 2014a; Hermann and Blunsom, 2014b; Gouws et al., 2014; Chandar et al., 2014; Soyer et al., 2015; Luong et al., 2015) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-9", "text": "When operating in multilingual settings, it is highly desirable to learn embeddings for words denoting similar concepts that are very close in the shared inter-lingual embedding space (e.g., the representations for the English word school and the Spanish word escuela should be very similar)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-10", "text": "These shared interlingual embedding spaces may then be used in a myriad of multilingual natural language processing tasks, such as fundamental tasks of computing cross-lingual and multilingual semantic word similarity and bilingual lexicon induction (BLI), etc." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-11", "text": "However, all these models critically require at least sentence-aligned parallel data and/or readilyavailable translation dictionaries to induce bilingual word embeddings (BWEs) that are consistent and closely aligned over languages in the same semantic space." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-12", "text": "Contributions In this work, we alleviate the requirements: (1) We present the first model that is able to induce bilingual word embeddings from non-parallel data without any other readily available translation resources such as pre-given bilingual lexicons; (2) We demonstrate the utility of BWEs induced by this simple yet effective model in the BLI task from comparable Wikipedia data on benchmarking datasets for three language pairs (Vuli\u0107 and Moens, 2013b )." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-13", "text": "Our BLI model based on our novel BWEs significantly outperforms a series of strong baselines that reported previous best scores on these datasets in the same learning setting, as well as other BLI models based on recently proposed BWE induction models (Gouws et al., 2014; Chandar et al., 2014) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-14", "text": "The focus of the work is on learning lexicons from documentaligned comparable corpora (e.g., Wikipedia articles aligned through inter-wiki links)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-15", "text": "The architecture of our BWE Skip-Gram model for learning bilingual word embeddings from document-aligned comparable data." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-16", "text": "Source language words and documents are drawn as gray boxes, while target language words and documents are drawn as blue boxes." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-17", "text": "The right side of the figure (separated by a vertical dashed line) illustrates how a pseudo-bilingual document is constructed from a pair of two aligned documents; two documents are first merged, and then words in the pseudo-bilingual document are randomly shuffled to ensure that both source and target language words occur as context words." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-18", "text": "----------------------------------" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-19", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-20", "text": "In the following architecture description, we assume that the reader is familiar with the main assumptions and training procedure of SGNS (Mikolov et al., 2013a; Mikolov et al., 2013c) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-21", "text": "We extend the SGNS model to work with bilingual document-aligned comparable data." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-22", "text": "An overview of our architecture for learning BWEs from such comparable data is given in fig. 1 ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-23", "text": "Let us assume that we possess a documentaligned comparable corpus which is defined as" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-24", "text": "denotes a pair of aligned documents in the source language L S and the target language L T , respectively, and N is the number of documents in the corpus." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-25", "text": "V S and V T are vocabularies associated with languages L S and L T ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-26", "text": "The goal is to learn word embeddings for all words in both V S and V T that will be semantically coherent and closely aligned over languages in a shared cross-lingual word embedding space." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-27", "text": "In the first step, we merge two documents d S j and d T j from the aligned document pair d j into a single \"pseudo-bilingual\" document d j and remove sentence boundaries." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-28", "text": "Following that, we randomly shuffle the newly constructed pseudobilingual document." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-29", "text": "The intuition behind this pretraining completely random shuffling step 1 (see 1 In this paper, we investigate only the random shuffling procedure and show that the model is fairly robust to different fig. 1 ) is to assure that each word w, regardless of its actual language, obtains word collocates from both vocabularies." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-30", "text": "The idea of having bilingual contexts for each pivot word in each pseudobilingual document will steer the final model towards constructing a shared inter-lingual embedding space." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-31", "text": "Since the model depends on the alignment at the document level, in order to ensure the bilingual contexts instead of monolingual contexts, it is intuitive to assume that larger window sizes will lead to better bilingual embeddings." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-32", "text": "We test this hypothesis and the effect of window size in sect." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-33", "text": "4." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-34", "text": "The final model called BWE Skip-Gram (BWESG) then relies on the monolingual variant of skip-gram trained on the shuffled pseudobilingual documents." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-35", "text": "2 The model learns word embeddings for source and target language words that are aligned over the d embedding dimensions and may be represented in the same shared cross-lingual embedding space." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-36", "text": "The BWESGbased representation of word w, regardless of its actual language, is then a d-dimensional vector:" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-37", "text": "where f k \u2208 R denotes the score for the k-th inter-lingual feature within the d-dimensional shared embedding space." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-38", "text": "Since all words share the embedding space, semantic similarity between words may be computed both outputs of the procedure if the window size is large enough." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-39", "text": "As one line of future work, we plan to investigate other, more systematic and deterministic shuffling algorithms." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-40", "text": "2 We were also experimenting with GloVe and CBOW, but they were falling behind SGNS on average." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-41", "text": "monolingually and across languages." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-42", "text": "Given w, the most similar word cross-lingually should be its one-to-one translation, and we may use this intuition to induce one-to-one bilingual lexicons from comparable data." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-43", "text": "In another interpretation, BWESG actually builds BWEs based on (pseudo-bilingual) document level co-occurrence." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-44", "text": "The window size parameter then just controls the amount of random data dropout." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-45", "text": "With larger windows, the model becomes prohibitively computationally expensive, but in sect." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-46", "text": "4 we show that the BLI performance flattens out for \"reasonably large\" windows." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-47", "text": "----------------------------------" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-48", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-49", "text": "Training Data We use comparable Wikipedia data introduced in (Vuli\u0107 and Moens, 2013a; Vuli\u0107 and Moens, 2013b ) available in three language pairs to induce bilingual word embeddings: (i) a collection of 13, 696 Spanish-English Wikipedia article pairs (ES-EN), (ii) a collection of 18, 898 ItalianEnglish Wikipedia article pairs (IT-EN), and (iii) a collection of 7, 612 Dutch-English Wikipedia article pairs (NL-EN)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-50", "text": "All corpora are theme-aligned comparable corpora, that is, the aligned document pairs discuss similar themes, but are in general not direct translations." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-51", "text": "Following prior work (Haghighi et al., 2008; Prochasson and Fung, 2011; Vuli\u0107 and Moens, 2013b) , we retain only nouns that occur at least 5 times in the corpus." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-52", "text": "Lemmatized word forms are recorded when available, and original forms otherwise." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-53", "text": "TreeTagger (Schmid, 1994 ) is used for POS tagging and lemmatization." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-54", "text": "After the preprocessing vocabularies comprise between 7,000 and 13,000 noun types for each language in each language pair." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-55", "text": "Exactly the same training data and vocabularies are used to induce bilingual lexicons with all other BLI models in comparison." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-56", "text": "BWESG Training Setup We have trained the BWESG model with random shuffling on 10 random corpora shuffles for all three training corpora with the following parameters from the word2vec package (Mikolov et al., 2013c) : stochastic gradient descent with a default learning rate of 0.025, negative sampling with 25 samples, and a subsampling rate of value 1e\u22124." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-57", "text": "All models are trained for 15 epochs." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-58", "text": "We have varied the number of embedding dimensions: d = 100, 200, 300, and have also trained the model with d = 40 to be directly comparable to pre-trained state-of-theart BWEs from (Gouws et al., 2014; Chandar et al., 2014) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-59", "text": "Moreover, in order to test the effect of window size on final results, we have varied the maximum window size cs from 4 to 60 in steps of 4." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-60", "text": "3 Since cosine is used for all similarity computations in the BLI task, we call our new BLI model BWESG+cos." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-61", "text": "Baseline BLI Models We compare BWESG+cos to a series of state-of-the-art BLI models from document-aligned comparable data:" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-62", "text": "(1) BiLDA-BLI -A BLI model that relies on the induction of latent cross-lingual topics (Mimno et al., 2009) by the bilingual LDA model and represents words as probability distributions over these topics (Vuli\u0107 et al., 2011) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-63", "text": "(2) Assoc-BLI -A BLI model that represents words as vectors of association norms (Roller and Schulte im Walde, 2013) over both vocabularies, where these norms are computed using a multilingual topic model (Vuli\u0107 and Moens, 2013a) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-64", "text": "(3) PPMI+cos -A standard distributional model for BLI relying on positive pointwise mutual information and cosine similarity (Bullinaria and Levy, 2007) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-65", "text": "The seed lexicon is bootstrapped using the method from (Peirsman and Pad\u00f3, 2011; Vuli\u0107 and Moens, 2013b) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-66", "text": "All parameters of the baseline BLI models (i.e., topic models and their settings, the number of dimensions K, feature pruning values, window size) are set to their optimal values according to suggestions in prior work (Steyvers and Griffiths, 2007; Vuli\u0107 and Moens, 2013a; Vuli\u0107 and Moens, 2013b; Kiela and Clark, 2014) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-67", "text": "Due to space constraints, for (much) more details about the baselines we point to the relevant literature (Peirsman and Pad\u00f3, 2011; Tamura et al., 2012; Vuli\u0107 and Moens, 2013a; Vuli\u0107 and Moens, 2013b) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-68", "text": "Test Data For each language pair, we evaluate on standard 1,000 ground truth one-to-one translation pairs built for the three language pairs (ES/IT/NL-EN) (Vuli\u0107 and Moens, 2013a; Vuli\u0107 and Moens, 2013b) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-69", "text": "Translation direction is ES/IT/NL \u2192 EN." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-70", "text": "Evaluation Metrics Since we can build a oneto-one bilingual lexicon by harvesting one-to-one translation pairs, the lexicon qualiy is best reflected in the Acc 1 score, that is, the number of source language (ES/IT/NL) words w S i from ground truth translation pairs for which the top ranked word cross-lingually is the correct trans- Table 1 : Example lists of top 10 semantically similar words for all 3 language pairs obtained using BWESG+cos; d = 200, cs = 48; (col 1.) only source language words (ES/IT/NL) are listed while target language words are skipped (monolingual similarity); (2) only target language words (EN) are listed (cross-lingual similarity); (3) words from both languages are listed (multilingual similarity)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-71", "text": "EN words are given in italic." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-72", "text": "The correct one-to-one translation for each source word is marked by (+)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-73", "text": "lation in the other language (EN) according to the ground truth over the total number of ground truth translation pairs (=1000) (Gaussier et al., 2004; Tamura et al., 2012; Vuli\u0107 and Moens, 2013b) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-74", "text": "----------------------------------" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-75", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-76", "text": "Exp 0: Qualitative Analysis Tab." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-77", "text": "1 displays top 10 semantically similar words monolingually, across-languages and combined/multilingually for one ES, IT and NL word." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-78", "text": "The BWESG+cos model is able to find semantically coherent lists of words for all three directions of similarity (i.e., monolingual, cross-lingual, multilingual)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-79", "text": "In the combined (multilingual) ranked lists, words from both languages are represented as top similar words." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-80", "text": "This initial qualitative analysis already demonstrates the ability of BWESG to induce a shared cross-lingual embedding space using only document alignments as bilingual signals." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-81", "text": "Exp I: BWESG+cos vs. Baseline Models In the first experiment, we test whether our BWESG+cos BLI model produces better results than the baseline BLI models which obtain current state-of-theart results for BLI from comparable data on these test sets." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-82", "text": "Tab. 2 summarizes the BLI results." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-83", "text": "As the most striking finding, the results reveal superior performance of the BWESG-cos model for BLI which relies on our new framework for inducing bilingual word embeddings over other BLI models relying on previously used bilingual word representations." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-84", "text": "The relative increase in Acc 1 scores over the best scoring baseline BLI models from comparable data is 19.4% for the ES-EN pair, 6.1% for IT-EN (significant at p < 0.05 using McNemar's test) and 65.4% for NL-EN." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-85", "text": "For large enough values for cs (cs \u2265 20) (see also Table 2 : BLI performance for all tested BLI models for ES/IT/NL-EN, with all bilingual word representations except CHANDAR and GOUWS learned from comparable Wikipedia data." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-86", "text": "The scores for BWESG+cos are computed as post-hoc averages over 10 random shuffles." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-87", "text": "fig. 2 (a)-2(c)), almost all BWESG+cos models for all language pairs outperform the highest baseline results." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-88", "text": "We may also observe that the performance of BWESG+cos is fairly stable for all models with larger values for cs (cs \u2265 20)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-89", "text": "This finding reveals that even a coarse tuning of these parameters might lead to optimal or near-optimal scores in the BLI task with BWESG+cos. Exp II: Shuffling and Window Size Since our BWESG model relies on the pre-training random shuffling procedure, we also test whether the shuffling has significant or rather minor impact on the induction of BWEs and final BLI scores." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-90", "text": "Therefore, in fig. 2 , we present maximum, minimum, and average Acc 1 scores for all three language pairs obtained using 10 different random corpora shuffles with d = 100, 200, 300 and varying val- (Chandar et al., 2014; Gouws et al., 2014) , proven superior to or on a par with the BLI model from (Mikolov et al., 2013b) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-91", "text": "We use their pre-trained BWEs (obtained from the authors) and report the BLI scores in tab. 2." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-92", "text": "To make the comparison fair, we search for translations over the same vocabulary as with all other models." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-93", "text": "The results clearly reveal that, although both other BWE models critically rely on parallel Europarl data for training, and Gouws et al. (2014) in addition train on entire monolingual Wikipedias in both languages, our simple BWE induction model trained on much smaller amounts of document-aligned non-parallel data produces significantly higher BLI scores for IT-EN and ES-EN with sufficiently large windows." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-94", "text": "However, the results for NL-EN with all BLI models from comparable data from tab. 2 are significantly lower than with the GOUWS BWEs." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-95", "text": "We attribute it to using less (and clearly insufficient) document-aligned training data for NL-EN (i.e., training corpora for ES-EN and IT-EN are almost double or triple the size of training corpora for NL-EN, see sect." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-96", "text": "3)." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-97", "text": "----------------------------------" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-98", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-99", "text": "We have proposed Bilingual Word Embeddings Skip-Gram (BWESG), a simple yet effective model that is able to learn bilingual word embeddings solely on the basis of document-aligned comparable data." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-100", "text": "We have demonstrated its utility in the task of bilingual lexicon induction from such comparable data, where our new BWESG-based BLI model outperforms state-of-the-art models for BLI from document-aligned comparable data and related BWE induction models." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-101", "text": "The low-cost BWEs may be used in other (semantic) tasks besides the ones discussed here, and it would be interesting to experiment with other types of context aggregation and selection beyond random shuffling, and other objective functions." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-102", "text": "Preliminary studies also demonstrate the utility of the BWEs in monolingual and cross-lingual information retrieval (Vuli\u0107 and Moens, 2015) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-103", "text": "Finally, we may use the knowledge of BWEs obtained by BWESG from document-aligned data to learn bilingual correspondences (e.g., word translation pairs or lists of semantically similar words across languages) which may in turn be used for representation learning from large unaligned multilingual datasets as proposed in (Haghighi et al., 2008; Mikolov et al., 2013b; Vuli\u0107 and Moens, 2013b) ." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-104", "text": "In the long run, this idea may lead to large-scale fully data-driven representation learning models from huge amounts of multilingual data without any \"pre-requirement\" for parallel data or manually built lexicons." }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1c0d971cf771f351b51661950f4b14-C001-2", "text": "We propose a simple yet effective approach to learning bilingual word embeddings (BWEs) from non-parallel document-aligned data (based on the omnipresent skip-gram model), and its application to bilingual lexicon induction (BLI)." } ], "y": { "@SIM@": { "gold_contexts": [ [ "1c0d971cf771f351b51661950f4b14-C001-12" ], [ "1c0d971cf771f351b51661950f4b14-C001-49" ], [ "1c0d971cf771f351b51661950f4b14-C001-65" ], [ "1c0d971cf771f351b51661950f4b14-C001-66" ], [ "1c0d971cf771f351b51661950f4b14-C001-68" ] ], "cite_sentences": [ "1c0d971cf771f351b51661950f4b14-C001-12", "1c0d971cf771f351b51661950f4b14-C001-49", "1c0d971cf771f351b51661950f4b14-C001-65", "1c0d971cf771f351b51661950f4b14-C001-66", "1c0d971cf771f351b51661950f4b14-C001-68" ] }, "@DIF@": { "gold_contexts": [ [ "1c0d971cf771f351b51661950f4b14-C001-51" ] ], "cite_sentences": [ "1c0d971cf771f351b51661950f4b14-C001-51" ] }, "@EXT@": { "gold_contexts": [ [ "1c0d971cf771f351b51661950f4b14-C001-51" ] ], "cite_sentences": [ "1c0d971cf771f351b51661950f4b14-C001-51" ] }, "@USE@": { "gold_contexts": [ [ "1c0d971cf771f351b51661950f4b14-C001-65" ], [ "1c0d971cf771f351b51661950f4b14-C001-66" ] ], "cite_sentences": [ "1c0d971cf771f351b51661950f4b14-C001-65", "1c0d971cf771f351b51661950f4b14-C001-66" ] }, "@BACK@": { "gold_contexts": [ [ "1c0d971cf771f351b51661950f4b14-C001-67" ], [ "1c0d971cf771f351b51661950f4b14-C001-73" ] ], "cite_sentences": [ "1c0d971cf771f351b51661950f4b14-C001-67", "1c0d971cf771f351b51661950f4b14-C001-73" ] }, "@MOT@": { "gold_contexts": [ [ "1c0d971cf771f351b51661950f4b14-C001-103" ] ], "cite_sentences": [ "1c0d971cf771f351b51661950f4b14-C001-103" ] } } }, "ABC_f5d1c0d3ac45ea4949f7d01d1704f6_19": { "x": [ { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-150", "text": "We term such algorithms as Feature Sharing Algorithms." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-2", "text": "In this work, we propose a semisupervised extension to a well-known supervised domain adaptation approach (EA) (Daum\u00e9 III, 2007) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-3", "text": "Our proposed approach (EA++) builds on the notion of augmented space (introduced in EA) and harnesses unlabeled data in target domain to ameliorate the transfer of information from source to target." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-4", "text": "This semisupervised approach to domain adaptation is extremely simple to implement, and can be applied as a pre-processing step to any supervised learner." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-5", "text": "Experimental results on sequential labeling tasks demonstrate the efficacy of the proposed method." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-6", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-8", "text": "A domain adaptation approach for sequential labeling tasks in NLP was proposed in (Daum\u00e9 III, 2007) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-9", "text": "The proposed approach, termed EASYADAPT (EA), augments the source domain feature space using features from labeled data in target domain." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-10", "text": "EA is simple, easy to extend and implement as a preprocessing step and most importantly is agnostic of the underlying classifier." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-11", "text": "However, EA requires labeled data in the target and hence applies to fully supervised (labeled data in source and target) domain adaptation settings only." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-12", "text": "In this paper, we propose a semi-supervised 1 (labeled data in source, and both labeled and unlabeled data in target) approach to leverage unlabeled data for EASYADAPT (which we call EA++) and empirically demonstrate its superior performance over EA as well as few other existing approaches." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-13", "text": "There exists prior work on supervised domain adaptation (or multi-task learning) that can be related to EASYADAPT." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-14", "text": "An algorithm for multitask learning using shared parameters was proposed (Evgeniou and Pontil, 2004) for multi-task regularization where each task parameter was represented as sum of a mean parameter (that stays same for all tasks) and its deviation from this mean." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-15", "text": "SVM was used as the base classifier and the algorithm was formulated in the standard SVM dual optimization setting." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-16", "text": "Subsequently, this framework (Evgeniou and Pontil, 2004) was extended (Dredze et al., 2010) to online multidomain setting." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-17", "text": "Prior work on semi-supervised approaches to domain adaptation also exists in literature." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-18", "text": "Extraction of specific features from the available dataset was proposed (Arnold and Cohen, 2008; Blitzer et al., 2006) to facilitate the task of domain adaptation." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-19", "text": "Co-adaptation (Tur, 2009), a combination of co-training and domain adaptation, can also be considered as a semisupervised approach to domain adaptation." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-20", "text": "A semi-supervised EM algorithm for domain adaptation was proposed in ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-21", "text": "Similar to graph based semi-supervised approaches, a label propagation method was proposed (Xing et al., 2007) to facilitate domain adaptation." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-22", "text": "The recently proposed Domain Adaptation Machine (DAM) (Duan et al., 2009 ) is a semi-supervised extension of SVMs for domain adaptation and presents extensive empirical results." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-23", "text": "However, in almost all of the above cases, the proposed methods either use specifics of the datasets or are customized for some particular base classifier and hence it is not clear how the proposed methods can be extended to other existing classifiers." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-24", "text": "EA, on the other hand, is remarkably general in the sense that it can be used as a pre-processing step in conjunction with any base classifier." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-25", "text": "However, one of the prime limitations of EA is its incapability to leverage unlabeled data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-26", "text": "Given its simplicity and generality, it would be interesting to extend EA to semi-supervised settings." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-27", "text": "In this paper we propose EA++, a co-regularization based semi-supervised extension to EA." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-28", "text": "We present our approach and results for a single pair of source and target domain." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-29", "text": "However, we note that EA++ can also be extended to multiple source settings." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-30", "text": "If we have k sources and a single target domain then we can introduce a co-regularizer for each source-target pair." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-31", "text": "Due to space constraints, we defer details to a full version." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-32", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-33", "text": "**BACKGROUND**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-34", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-35", "text": "**PROBLEM SETUP AND NOTATIONS**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-36", "text": "Let X \u2282 R d denote the instance space and Y = {\u22121, +1} denote the label space." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-37", "text": "We have a set of source labeled examples L s (\u223c D s (x, y)) and a set of target labeled examples" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-38", "text": "We also have target unlabeled data denoted by U t (\u223c D t (x)), where |U t | = u t ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-39", "text": "Our goal is to learn a hypothesis h : X \u2192 Y having low expected error with respect to the target domain." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-40", "text": "In this paper, we consider linear hypotheses only." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-41", "text": "However, the proposed techniques extend to non-linear hypotheses, as mentioned in (Daum\u00e9 III, 2007) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-42", "text": "Source and target empirical errors for hypothesis h are denoted b\u0177 \u01eb s (h, f s ) and\u01eb t (h, f t ) respectively, where f s and f t are source and target labeling functions." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-43", "text": "Similarly, the corresponding expected errors are denoted by \u01eb s (h, f s ) and \u01eb t (h, f t )." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-44", "text": "Shorthand notions of\u01eb s ,\u01eb t , \u01eb s and \u01eb t have also been used." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-45", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-46", "text": "**EASYADAPT (EA)**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-47", "text": "In this section, we give a brief overview of EASYADAPT proposed in (Daum\u00e9 III, 2007) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-48", "text": "Let us denote R d as the original space." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-49", "text": "EA operates in an augmented space denoted byX \u2282 R 3d (for a single pair of source and target domain)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-50", "text": "For k domains, the augmented space blows up to R (k+1)d ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-51", "text": "The augmented feature maps \u03a6 s , \u03a6 t : X \u2192X for source and target domains are defined as, Source and target domain features are transformed using these feature maps and the augmented feature space so constructed is passed onto the underlying supervised classifier." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-52", "text": "One of the most appealing properties of EASYADAPT is that it is agnostic of the underlying supervised classifier being used to learn in the augmented space." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-53", "text": "Almost any standard supervised learning approach for linear classifiers (for e.g., SVMs, perceptrons) can be used to learn a linear hypothesish \u2208 R 3d in the augmented space." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-54", "text": "As mentioned earlier, this work considers linear hypotheses only and the the proposed techniques can be extended (Daum\u00e9 III, 2007) to non-linear hypotheses." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-55", "text": "Let us denot\u0207 h = h c , h s , h t , where each of h c , h s , h t is of dimension d and represent the common, sourcespecific and target-specific components ofh, respectively." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-56", "text": "During prediction on target data, the incoming target feature x is transformed to obtain \u03a6 t (x) andh is applied on this transformed feature." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-57", "text": "This is equivalent to applying (h c + h t ) on x." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-58", "text": "A good intuitive insight into why this simple algorithm works so well in practice and outperforms most state-of-the-art algorithms is given in (Daum\u00e9 III, 2007) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-59", "text": "Briefly, it can be thought to be simultaneously training two hypotheses: w s = (h c + h s ) for source domain and w t = (h c + g t ) for target domain." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-60", "text": "The commonality between the domains is represented by h c whereas the source and target domain specific information is captured by h s and h t , respectively." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-61", "text": "This technique can be easily extended to a multi-domain scenario by making more copies of the original feature space ((K + 1) copies in case of K domains)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-62", "text": "A kernelized version of the algorithm has also been presented in (Daum\u00e9 III, 2007) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-63", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-64", "text": "**USING UNLABELED DATA**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-65", "text": "As discussed in the previous section, the EASYADAPT algorithm is attractive because it performs very well empirically and can be used in conjunction with any underlying supervised clas-sifier." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-66", "text": "One drawback of EASYADAPT is that it does not make use of unlabeled target data which is generally available in large quantity in most practical problems." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-67", "text": "In this section, we propose a semi-supervised extension of this algorithm while maintaining the desirable classifier-agnostic property." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-68", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-69", "text": "**MOTIVATION**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-70", "text": "In multi-view approach for semi-supervised learning algorithms (Sindhwani et al., 2005) , different hypotheses are learned in different views." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-71", "text": "Thereafter, unlabeled data is utilized to co-regularize these learned hypotheses by making them agree on unlabeled samples." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-72", "text": "In domain adaptation, the source and target data come from two different distributions." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-73", "text": "However, if the source and target domains are reasonably close to each other, we can employ a similar form of regularization using unlabeled data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-74", "text": "A similar co-regularizer based approach for unlabeled data was previously shown (Duan et al., 2009 ) to give improved empirical results for domain adaptation task." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-75", "text": "However, their technique applies for the particular base classifier they consider and hence does not extend to EASYADAPT." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-76", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-77", "text": "**EA++: EASYADAPT WITH UNLABELED DATA**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-78", "text": "In our proposed semi-supervised extension to EASYADAPT, the source and target hypothesis are made to agree on unlabeled data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-79", "text": "We refer to this algorithm as EA++." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-80", "text": "Recall that EASYADAPT learns a linear hypothesish \u2208 R 3d in the augmented space." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-81", "text": "The hypothesish contains common, source and target sub-hypotheses and is expressed ash = h c , h s , h t ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-82", "text": "In original space (ref. section 2.2), this is equivalent to learning a source specific hypothesis w s = (h c + h s ) and a target specific hypothesis w t = (h c + h t )." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-83", "text": "In EA++, we want source hypothesis w s and target hypothesis w t to agree on unlabeled data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-84", "text": "For some unlabeled target sample x i \u2208 U t \u2282 R d , EA++ would implicitly want to make the predictions of w t and w t on x i to agree." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-85", "text": "Formally, it aims to achieve the following condition:" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-86", "text": "We define another feature map \u03a6 u : X \u2192X for unlabeled data as below:" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-87", "text": "Every unlabeled sample is transformed using the map \u03a6 u (.)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-88", "text": "The augmented feature space that results from the application of three feature maps, namely, \u03a6 s : X \u2192X, \u03a6 t : X \u2192X, \u03a6 u : X \u2192 X, on source labeled samples, target labeled sampled and target unlabeled samples is summarized in Figure 1 . As shown in Eq. 3.1, during the training phase, EA++ assigns a predicted value close to 0 for each unlabeled sample." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-89", "text": "However, it is worth noting that, during the test phase, EA++ predicts labels from two classes: +1 and \u22121." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-90", "text": "This warrants further exposition of the implementation specifics which is deferred until the next subsection." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-91", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-92", "text": "**IMPLEMENTATION**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-93", "text": "In this section, we present implementation specific details of EA++." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-94", "text": "We consider SVM as our base supervised learner (LEARN in Algorithm 1)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-95", "text": "However, these details hold for other supervised Algorithm 1 EA++ Input: L s ; L t ; U t ; LEARN : supervised classifier Output:h : classifier learned in augmented space /* initialize augmented training set */ 1: P := {} /* construct augmented training set */ 2: \u2200(x, y) \u2208 L s , P := P \u222a {\u03a6 s (x), y} 3: \u2200(x, y) \u2208 L t , P := P \u222a {\u03a6 t (x), y} 4: \u2200x \u2208 U t , P := P \u222a {\u03a6 u (x), 0} /* output learned classifier */ 5:h = LEARN (P )" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-96", "text": "classifiers too." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-97", "text": "In the dual form of SVM optimization function, the labels are multiplied with the inner product of features." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-98", "text": "This can make the unlabeled samples redundant since we want their labels to be 0 according to Eq. 3.1." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-99", "text": "To avoid this, we create as many copies of \u03a6 u (x) as there are labels and assign each label to one copy." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-100", "text": "For the case of binary classification, we create two copies of every augmented unlabeled sample, and assign +1 label to one copy and \u22121 to the other." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-101", "text": "The learner attempts to balance the loss of the two copies, and tries to make the prediction on unlabeled sample equal to 0." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-102", "text": "Figure 2 shows the curves of the hinge loss for class +1, class \u22121 and their sum." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-103", "text": "The effective loss for each unlabeled sample is similar to the sum of losses for +1 and \u22121 classes (shown in Figure 2c )." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-104", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-105", "text": "**EXPERIMENTS**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-106", "text": "In this section, we demonstrate the empirical performance of EA augmented with unlabeled data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-107", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-108", "text": "**SETUP**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-109", "text": "We follow the same experimental setup used in (Daum\u00e9 III, 2007) and perform two sequence labelling tasks (a) named-entity-recognition (NER), and (b) part-of-speech-tagging (POS )on the following datasets:" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-110", "text": "PubMed-POS: Introduced by (Blitzer et al., 2006) , this dataset consists of two domains." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-111", "text": "task is to perform part-of-speech tagging on unlabeled PubMed abstracts with a classifier trained on labeled WSJ and PubMed data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-112", "text": "Treebank-Brown." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-113", "text": "Treebank-Chunk data consists of the following domains: the standard WSJ domain (the same data as for CoNLL 2000), the ATIS switchboard domain and the Brown corpus." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-114", "text": "The Brown corpus consists of data combined from six subdomains." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-115", "text": "TreebankChunk is a shallow parsing task based on the data from the Penn Treebank." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-116", "text": "TreebankBrown is identical to the Treebank-Chunk task, However, in Treebank-Brown we consider all of the Brown corpus to be a single domain." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-117", "text": "Table 1 presents a summary of the datasets used." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-118", "text": "All datasets use roughly the same feature set which are lexical information (words, stems, capitalization, prefixes and suffixes), membership on gazetteers, etc." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-119", "text": "We use an averaged perceptron classifier from the Megam framework (implementation due to (Daum\u00e9 III, 2004) ) for all the aforementioned tasks." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-120", "text": "The training sample size varies from 1k to 16k." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-121", "text": "In all cases, the amount of unlabeled target data was equal to the total amount of labeled source and target data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-122", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-123", "text": "**RESULTS**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-148", "text": "In both EA and EA++, we use features from source and target space to construct an augmented feature space." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-149", "text": "In other words, we are sharing features across source and target labeled data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-124", "text": "We compare the empirical performance of EA++ with a few other baselines, namely, (a) SOURCEONLY (classifier trained on source labeled samples), (b) TARGETONLY-FULL (classifier trained on the same number of target labeled samples as the number of source labeled samples in SOURCEONLY), (c) TARGETONLY (classifier trained on small amount of target labeled samples, roughly one-tenth of the amount of source labeled samples in SOURCEONLY), (d) ALL (classifier trained on combined labeled samples of SOURCEONLY and TARGETONLY), (e) EA (classifier trained in augmented feature space on the same input training set as ALL), (f) EA++ (classifier trained in augmented feature space on the same input training set as EA and an equal amount of unlabeled target data)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-125", "text": "All these approaches were tested on the entire amount of available target test data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-126", "text": "Figure 3 presents the learning curves for (a) SOURCEONLY, (b) TARGETONLY-FULL, (c) TARGETONLY, (d) ALL, (e) EA, and (f) EA++ (EA with unlabeled data)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-127", "text": "The x-axis represents the number of training samples on which the predictor has been trained." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-128", "text": "At this point, we note that the number of training samples vary depending on the particular approach being used." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-129", "text": "For SOURCEONLY, TARGETONLY-FULL and TARGETONLY, it is just the corresponding number of labeled source or target samples, respectively." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-130", "text": "For ALL and EA, it is the summation of labeled source and target samples." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-131", "text": "For EA++, the x-value plotted denotes the amount of unlabeled target data used (in addition to an equal amount of source+target labeled data, as in ALL or EA)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-132", "text": "We plot this number for EA++, just to compare its improvement over EA when using an additional (and equal) amount of unlabeled target data." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-133", "text": "This accounts for the different x values plotted for the different curves." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-134", "text": "In all cases, the y-axis denotes the error rate." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-135", "text": "As can be seen in Figure 3 (a), EA++ performs better than the normal EA (which uses labeled data only)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-136", "text": "The labeled and unlabeled case start together but with increase in number of samples their gap increases with the unlabeled case resulting in much lower error as compared to the labeled case." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-137", "text": "Similar trends were observed in other data sets as can be seen in Figure 3(b) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-138", "text": "We also note that EA performs poorly for some cases, as was shown (Daum\u00e9 III, 2007) earlier." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-139", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-140", "text": "**SUMMARY**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-141", "text": "In this paper, we have proposed a semi-supervised extension to an existing domain adaptation technique (EA)." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-142", "text": "Our approach EA++, leverages the unlabeled data to improve the performance of EA." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-143", "text": "Empirical results demonstrate improved accuracy for sequential labeling tasks performed on standardized datasets." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-144", "text": "The previously proposed EA could be applied exclusively to fully supervised domain adaptation problems only." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-145", "text": "However, with the current extension, EA++ applies to both fully supervised and semi-supervised domain adaptation problems." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-146", "text": "----------------------------------" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-147", "text": "**FUTURE WORK**" }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-151", "text": "Feature sharing algorithms are effective for domain adaptation because they are simple, easy to implement as a preprocessing step and outperform many existing state-of-the-art techniques (shown previously for domain adaptation (Daum\u00e9 III, 2007) )." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-152", "text": "However, despite their simplicity and empirical success, it is not theoretically apparent why these algorithms perform so well." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-153", "text": "Prior work provides some intuitions but is mostly empirical and a formal theoretical analysis to justify FSAs (for domain adaptation) is clearly missing." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-154", "text": "Prior work (Maurer, 2006) analyzes the multi-task regularization approach (Evgeniou and Pontil, 2004) (which is related to EA) but they consider a cumulative loss in multi-task (or multi-domain) setting." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-155", "text": "This does not apply to domain adaptation setting where we are mainly interested in loss in the target domain only." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-156", "text": "Theoretically analyzing the superior performance of EA and EA++ and providing generalization guarantees is an interesting line of future work." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-157", "text": "One approach would be to model the feature sharing approach in terms of coregularization; an idea that originated in the context of multiview learning and for which some theoretical analysis has already been done (Rosenberg and Bartlett, 2007; Sindhwani and Rosenberg, 2008) ." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-158", "text": "Additionally, the aforementioned techniques, namely, SOURCEONLY, TARGETONLY, ALL have been empirically compared to EA and EA++." }, { "sent_id": "f5d1c0d3ac45ea4949f7d01d1704f6-C001-159", "text": "It would be interesting to formally frame these approaches and see whether their empirical performance can be justified within a theoretical framework." } ], "y": { "@DIF@": { "gold_contexts": [ [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-2" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-41" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-54" ] ], "cite_sentences": [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-2", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-41", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-54" ] }, "@EXT@": { "gold_contexts": [ [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-2" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-41" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-54" ] ], "cite_sentences": [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-2", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-41", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-54" ] }, "@BACK@": { "gold_contexts": [ [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-8" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-47" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-58" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-62" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-151" ] ], "cite_sentences": [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-8", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-47", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-58", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-62", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-151" ] }, "@SIM@": { "gold_contexts": [ [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-109" ], [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-138" ] ], "cite_sentences": [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-109", "f5d1c0d3ac45ea4949f7d01d1704f6-C001-138" ] }, "@USE@": { "gold_contexts": [ [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-109" ] ], "cite_sentences": [ "f5d1c0d3ac45ea4949f7d01d1704f6-C001-109" ] } } }, "ABC_c2bfe3534597a8f192ec846619f6b1_19": { "x": [ { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-64", "text": "We detail these steps in Algorithm 1." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-65", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-66", "text": "**PERFORMANCE ANALYSIS**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-2", "text": "Many tasks in NLP and IR require efficient document similarity computations." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-3", "text": "Beyond their common application to exploratory data analysis, latent variable topic models have been used to represent text in a low-dimensional space, independent of vocabulary, where documents may be compared." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-4", "text": "This paper focuses on the task of searching a large multilingual collection for pairs of documents that are translations of each other." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-5", "text": "We present (1) efficient, online inference for representing documents in several languages in a common topic space and (2) fast approximations for finding near neighbors in the probability simplex." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-6", "text": "Empirical evaluations show that these methods are as accurate as-and significantly faster thanGibbs sampling and brute-force all-pairs search." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-7", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-9", "text": "Statistical topic models, such as latent Dirichlet allocation (LDA) , have proven to be highly effective at discovering hidden structure in document collections (Hall et al., 2008, e.g.) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-10", "text": "Often, these models facilitate exploratory data analysis, by revealing which collocations of terms are favored in different kinds of documents or which terms and topics rise and fall over time (Blei and Lafferty, 2006; Wang and McCallum, 2006) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-11", "text": "One of the greatest advantages in using topic models to analyze and process large document collections is their ability to represent documents as probability distributions over a small number of topics, thereby mapping documents into a low-dimensional latent space-the T -dimensional probability simplex, where T is the number of topics." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-12", "text": "A document, represented by some point in this simplex, is said to have a particular \"topic distribution\"." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-13", "text": "Representing documents as points in a lowdimensional shared latent space abstracts away from the specific words used in each document, thereby facilitating the analysis of relationships between documents written using different vocabularies." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-14", "text": "For instance, topic models have been used to identify scientific communities working on related problems in different disciplines, e.g., work on cancer funded by multiple Institutes within the NIH (Talley et al., 2011) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-15", "text": "While vocabulary mismatch occurs within the realm of one language, naturally this mismatch occurs across different languages." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-16", "text": "Therefore, mapping documents in different languages into a common latent topic space can be of great benefit when detecting document translation pairs (Mimno et al., 2009; Platt et al., 2010) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-17", "text": "Aside from the benefits that it offers in the task of detecting document translation pairs, topic models offer potential benefits to the task of creating translation lexica, aligning passages, etc." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-18", "text": "The process of discovering relationship between documents using topic models involves: (1) representing documents in the latent space by inferring their topic distributions and (2) comparing pairs of topic distributions to find close matches." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-19", "text": "Many widely used techniques do not scale efficiently, however, as the size of the document collection grows." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-20", "text": "Posterior inference by Gibbs sampling, for instance, may make thousands of passes through the data." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-21", "text": "For the task of comparing topic distributions, recent work has also resorted to comparing all pairs of documents (Talley et al., 2011) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-22", "text": "This paper presents efficient methods for both of these steps and performs empirical evaluations on the task of detected translated document pairs embedded in a large multilingual corpus." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-23", "text": "Unlike some more exploratory applications of topic models, translation detection is easy to evaluate." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-24", "text": "The need for bilingual training data in many language pairs and domains also makes it attractive to mitigate the quadratic runtime of brute force translation detection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-25", "text": "We begin in \u00a72 by extending the online variational Bayes approach of Hoffman et al. (2010) to polylingual topic models (Mimno et al., 2009) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-26", "text": "Then, in \u00a73, we build on prior work on efficient approximations to the nearest neighbor problem by presenting theoretical and empirical evidence for applicability to topic distributions in the probability simplex and in \u00a74, we evaluate the combination of online variational Bayes and approximate nearest neighbor methods on the translation detection task." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-27", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-28", "text": "**ONLINE VARIATIONAL BAYES FOR POLYLINGUAL TOPIC MODELS**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-29", "text": "Hierarchical generative Bayesian models, such as topic models, have proven to be very effective for modeling document collections and discovering underlying latent semantic structures." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-30", "text": "Most current topic models are based on Latent Dirichlet Allocation (LDA) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-31", "text": "In some early work on the subject, showed the usefulness of LDA on the task of automatic annotation of images." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-32", "text": "Hall et al. (2008) used LDA to analyze historical trends in the scientific literature; Wei and Croft (2006) showed improvements on an information retrieval task." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-33", "text": "More recently Eisenstein et al. (2010) modeled geographic linguistic variation using Twitter data." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-34", "text": "Aside from their widespread use on monolingual text, topic models have also been used to model multilingual data (Boyd-Graber and Blei, 2009; Platt et al., 2010; Jagarlamudi and Daum\u00e9, 2010; Fukumasu et al., 2012) , to name a few." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-35", "text": "In this paper, we focus on the Polylingual Topic Model, introduced by Mimno et al. (2009) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-36", "text": "Given a multilingual set of aligned documents, the PLTM assumes that across an aligned multilingual document tuple, there exists a single, tuple-specific, distribution across topics." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-37", "text": "In addition, PLTM assumes that for each language-topic pair, there exists a distribution over words in that language \u03b2 l ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-38", "text": "As such, PLTM assumes that the multilingual corpus is created through a generative process where first a document tuple is generated by drawing a tuple-specific distribution over topics \u03b8 1 which, as it is the case with LDA, is drawn from a Dirichlet prior \u03b8 \u223c Dir (\u03b1) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-39", "text": "For each of the languages l in the tuple and for each of the N words w l n in the document the generative process: first chooses a topic assignment z l n \u223c M ultinomial (\u03b8) which is then followed by choosing a word w l n from a multinomial distribution conditioned on the topic assignment and the language specific topics distribution over words \u03b2 l \u223cDir (\u03b7 l )." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-40", "text": "Both \u03b1 and \u03b7 1,...,L are symmetric priors, i.e. the priors are exchangeable Dirichlet distributions." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-41", "text": "Finally, each word is generated from a language-and topic-specific multinomial distribution \u03b2 l t as selected by the topic assignment variable z l n :" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-42", "text": "(1) Figure 1 shows a graphical representation of the PLTM using plate notation." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-43", "text": "In their original work Mimno et al. (2009) used the Gibbs sampling approach as a posterior inference algorithm to assign topics distributions over their test collection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-44", "text": "While more straightforward to implement, this sampling approach is inherently slow when applied to large collections which makes the original PLTM work practically infeasible to be used on real-world data sets." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-45", "text": "In general, performing posterior inference over the latent variables of a Bayesian model is usually done with two of the three approximate approaches, Gibbs sampling, variational Bayes (VB) and expectation-propagation." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-46", "text": "While Gibbs Sampling is a variation of Markov Chain Monte Carlo method (MCMC) which generates a sample from the true posterior after converging to a stationary distribution; in VB, a set of free variational parameters characterizes a simpler family of probability distributions." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-47", "text": "These variational parameters are then optimized by finding the minimum KullbackLeibler (KL) divergence between the variational distribution q (\u03b8, z, \u03b2|\u03b3, \u03c6, \u03bb) and the true posterior P (\u03b8, z, \u03b2|w, \u03b1, \u03b7)." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-48", "text": "From an algorithmic perspective, the variational Bayes approach follows the Expectation-Maximization (EM) procedure where for a given document, the E-step updates the per document variational parameters \u03b3 d and \u03c6 d while holding the per words-topic distribution parameter \u03bb fixed." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-49", "text": "It then updates the variational parameter \u03bb using the sufficient statistics computed in the E step." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-50", "text": "In order to converge to a stationary point, both approaches require going over the whole collection multiple times which makes their time complexity to grown linearly with the size of the data collection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-51", "text": "The mere fact that they require continuous access to the whole collection makes both inference approaches impracticable to use on very large or streaming collections." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-52", "text": "To alleviate this problem, several algorithms have been proposed that draws from belief propagation (Zeng et al., 2012) , the Gibbs sampling approach such as (Canini et al., 2009 ), variational Bayes (Hoffman et al., 2010) as well as a combination of the latter two (Hoffman et al., 2012) to name a few." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-53", "text": "In this paper we use Hoffman et al. (2010) approach." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-54", "text": "Hoffman et al. (2010) proposed a new inference approach called Online LDA which relies on the stochastic gradient descent to optimize the variational parameters." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-55", "text": "This approach can produce good estimates of LDA posteriors in a single pass over the whole collection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-56", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-57", "text": "**ALGORITHMIC IMPLEMENTATION**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-58", "text": "We now derive an online variational Bayes algorithm for PLTM to infer topic distributions over multilingual collections." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-59", "text": "Figure 2 shows the variational model and free parameters used in our approach." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-60", "text": "As in the case of Hoffman et al. (2010) , our algorithm updates the variational parameters \u03b3 l d and \u03c6 l d on each batch of documents while the variational parameter \u03bb is computed as a weighted average of the value on the previous batch and its approximate version\u03bb." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-61", "text": "Averaging is performed using a decay function whose parameters control the rate at which old values of \u03bb l are forgotten." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-62", "text": "Within the E step of the VB approach, we compute the updates over the variational parameter \u03c6 l T . . ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-63", "text": "Figure 2: Graphical model representation of the free variational parameters for the online variational Bayes approximation of the PLTM posterior for each language L present in our document tuple while the update on the \u03b3 parameter accumulates the language specific sufficient statistics:" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-67", "text": "To demonstrate the efficacy of online PLTM, we ran topic inference on a subset of the EnglishSpanish Europarl collection consisting of \u223c64k parallel speeches and compared the accuracy results vs. the training and inference speed against the original PLTM model using topic sets of T=50,100, 200 and 500." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-68", "text": "We explain in details the evaluation task and the performance metric used in \u00a74." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-69", "text": "Shown in Figure 3 are the results of these comparisons." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-70", "text": "Our speed measurements were performed on Xeon quad processors with a clock speed of 2.66GHz and a total of 16GB of memory." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-71", "text": "As we increase the number of topics we gain in accuracy over the evaluation task across both inference approaches." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-72", "text": "When we increase the number of topics from 50 to 500 the speed improvement obtained by Online VB PLTM drops by a factor of 2.9 within the training step and by a factor of 4.45 in the test step." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-73", "text": "Our total running time for the Online VB PLTM with T=500 approaches the running time of the Gibbs sampling approach with T=50." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-74", "text": "The gradual drop in speed improvement with the increase of the number topics is mostly attributed to the commutation of the Algorithm 1 Online variational Bayes for PLTM initialize \u03bb l randomly obtain the tth mini-batch of tuples M t for t = 1 to \u221e do Online VB PLTM and Gibbs Sampling PLTM at T=50,100, 200 and 500." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-75", "text": "We used a Python implementation of Online VB and Mallet's Java implementation of PLTM with in-memory Gibbs Sampling using 1000 iterations." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-76", "text": "While a multilingual collection of \u223c64k document pairs is considered relatively big, our goal of deriving the Online VB PLTM approach was to be able to utilize PLTM on very large multilingual collections." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-77", "text": "To analyze the potential of using Online VB PLTM on such collections we ran speed comparisons within the training step by creating multilingual collections of different lengths multiplying the original English-Spanish Europarl collection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-78", "text": "Speed comparisons using collections of length 50K, 100K, 250K, 500K, 750K and 1M are shown in Figure 4 ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-79", "text": "Training was performed with the number of topics T set to T=50 and T=500." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-80", "text": "As we increase the collection size we observe the real benefit of using Online VB compared to Gibbs sampling." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-81", "text": "This is mostly attributed to the fact that the Gibbs sampling approach requires multiple iterations over the whole collection in order to achieve a convergence point." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-82", "text": "For collection sizes of 50k and 100k the training time for the Online VB PLTM with T=500 approaches the training time of Gibbs sampling with T=50 and as we increase the collection size this proximity dissipates." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-83", "text": "In Figure 5 we show a sample set of the aligned topics extracted using Online VB PLTM with T=400 on the English-Spanish Europarl collection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-84", "text": "For a given topic tuple words are ordered based on probability of occurrence within the given topic." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-85", "text": "3 Approximate NN Search in the Probability Simplex" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-86", "text": "One of the most attractive applications for topic models has involved using the latent variables as a low-dimensional representation for document similarity computations (Hall et al., 2008; BoydGraber and Resnik, 2010; Talley et al., 2011) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-87", "text": "After computing topic distributions for documents, however, researchers in this line of work have almost always resorted to brute-force all-pairs similarity comparisons between topic distributions." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-88", "text": "In this section, we present efficient methods for approximate near neighbor search in the probability simplex in which topic distributions live." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-89", "text": "Measurements for similarity between two probability distributions are information-theoretic, and distance metrics, typical for the metric space, are not appropriate (measurements such as Euclidean, cosine, Jaccard, etc.)." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-90", "text": "Divergence metrics, such as Kullback-Leibler (KL), Jensen-Shannon (JS), and Hellinger distance are used instead." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-91", "text": "Shown in Figure 6 are the formulas of the divergence metrics along with the Euclidean distance." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-92", "text": "When dealing with a large data set of N documents, the O(N 2 ) time complexity of all-pairs comparison makes the task practically infeasible." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-93", "text": "With some distance measures, however, the time complexity on near neighbor tasks has been alleviated using approximate methods that reduce the time complexity of each query to a sub-linear number of comparisons." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-94", "text": "For example, Euclidean distance (3) has been efficiently used on all-pairs comparison tasks in large data sets thanks to its approximate based versions developed using locality sensitive hashing (LSH) (Andoni et al., 2005) and k-d search trees (Friedman et al., 1977) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-95", "text": "In order to alleviate the all-pairs computational complexity in the probability simplex, we will use a reduction of the Hellinger divergence measure (4) to Euclidean distance and therefore utilize preexisting approximation techniques for the Euclidean distance in the probability simplex." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-96", "text": "This reduction comes from the fact that both measurements have similar algebraic expressions." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-97", "text": "If we discard the square root used in the Euclidean distance, Hellinger distance (4) becomes equivalent to the Euclidean distance metric (3) between \u221a p i and \u221a q i ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-98", "text": "The task of finding nearest neighbors for a given point (whether in the metric space or the probability simplex) involves ranking all nearest points discovered and as such not computing the square root function does not affect the overall ranking and the nearest neighbor discovery." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-99", "text": "Moreover, depending on its functional form, the Hellinger distance is often defined as square root over the whole summation." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-100", "text": "Aside from the Hellinger distance, we also approximate JensenShannon divergence which is a symmetric version of the Kullback-Liebler divergence." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-101", "text": "For the JS approximation, we will use a constant factor relationship between the Jensen-Shannon divergence an Hellinger distance previously explored by (Tops\u00f8e, 2000) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-102", "text": "More specifically, we will be using its more concise form (7) also presented by" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-103", "text": "Figure 6: Distance measures and bounds (Guha et al., 2006) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-104", "text": "The constant factor relationship provides us with the theoretical guarantees necessary for this approximation." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-105", "text": "In practice, we can often do much better than this theoretical bound." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-106", "text": "Figure 7 shows the empirical relation of JS and Hellinger on a translationdetection task." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-107", "text": "As will be described in \u00a74, we computed the JS and Hellinger divergences between topic distributions of English and Spanish Europarl speeches for a total of 1 million document pairs." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-108", "text": "Each point in the figure represents one Spanish-English document pair that might or might not be translations of each other." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-109", "text": "In this figure we emphasize the lower left section of the plot where the nearest neighbors (i.e., likely translations) reside, and the relationship between JS and Hellinger is much tighter than the theoretical bounds and from pratical perspective as we will show in the next section." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-110", "text": "As a summary for the reader, using the above approaches, we will approximate JS divergence by using the Euclidean based representation of the Hellinger distance." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-111", "text": "As stated earlier, the Euclidean based representation is computed using well established approximation approaches and in our case we will use two such approaches: the Exact Euclidean LSH (E2LSH) (Andoni et al., 2005)" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-112", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-113", "text": "**EFFICIENT APPROXIMATE TRANSLATION DETECTION**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-114", "text": "Mapping multilingual documents into a common, language-independent vector space for the purpose of improving machine translation (MT) and performing cross-language information retrieval (CLIR) tasks has been explored through various techniques." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-115", "text": "Mimno et al. (2009) introduced polylingual topic models (PLTM), an extension of latent Dirichlet allocation (LDA), and, more recently, Platt et al. (2010) proposed extensions of principal component analysis (PCA) and probabilistic latent semantic indexing (PLSI)." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-116", "text": "Both the PLTM and PLSI represent bilingual documents in the probability simplex, and thus the task of finding document translation pairs is formulated as finding similar probability distributions." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-117", "text": "While the nature of both works was exploratory, results shown on fairly large collections of bilingual documents (less than 20k documents) offer convincing argument of their potential." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-118", "text": "Expanding these approaches to much large collections of multilingual documents would require utilizing fast NN search for computing similarity in the probability simplex." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-119", "text": "While there are many other proposed approaches to the task of finding document translation pairs that represent documents in metric space, such as Krstovski and Smith (2011) which utilizes LSH for cosine distance, there is no evidence that they yield good results on documents of small lengths such as paragraphs and even sen-tences." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-120", "text": "In this section, we empirically show how to utilize approaches that deal with representing documents in the probability simplex without a significant loss in accuracy while significantly improving the processing time." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-121", "text": "We use PLTM representations of bilingual documents." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-122", "text": "In addition, we show how the results as reported by Platt et al. (2010) can be obtained using the PLTM representation with a significant speed improvement." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-123", "text": "As in (Platt et al., 2010) and (Mimno et al., 2009 ) the task is to find document translation pairs in a multilingual collection of documents by representing documents in the probability simplex and computing similarity between their probability distribution representation across all document pairs." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-124", "text": "For this experimental setup, accuracy is defined as the number of times (in percentage) that the target language document was discovered at rank 1 (i.e. % @Rank 1.) across the whole test collection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-125", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-126", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-127", "text": "We use Mallet's (McCallum, 2002) implementation of the PLTM to train and infer topics on the same data set used in Platt et al. (2010) ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-128", "text": "That paper used the Europarl (Koehn, 2005) (Mimno et al., 2009) , these performance comparisons are not done on the same training and test sets-a gap that we fill below." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-129", "text": "We train PLTM models with number of topics T set to 50, 100, 200, and 500." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-130", "text": "In order to compare exactly the same topic distributions when computing speed vs. accuracy of various approximate and exhaustive all-pairs comparisons we focus only on one inference approach -the Gibbs sampling and ignore the online VB approach as it yields similar performance." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-131", "text": "For all four topic models, we use the same settings for PLTM (hyperparameter values and number of Gibbs sampling iterations) as in (Mimno et al., 2009) 2 ." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-132", "text": "Topic distributions were then inferred on the test collection using the trained topics." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-133", "text": "We then performed all-pairs comparison using JS divergence, Hellinger distance, and approximate, LSH and kd-trees based, Hellinger distance." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-134", "text": "We measured the total time that it takes to perform exhaustive all-pairs comparison using JS divergence, the LSH and kdtrees version on a single machine consisting of a core 2 duo quad processors with a clock speed of 2.66GHz on each core and a total of 8GB of memory." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-135", "text": "Since the time performance of the E2LSH depends on the radius R of data set points considered for each query point (Indyk and Motwani, 1998) , we performed measurements with different values of R. For this task, the all-pairs JS code implementation first reads both source and target sets of documents and stores them in hash tables." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-136", "text": "We then go over each entry in the source table and compute divergence against all target table entries." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-137", "text": "We refer to this code implementation as hash map implementation." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-138", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-139", "text": "**EVALUATION TASK AND RESULTS**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-140", "text": "Performance of the four PLTM models and the performance across the four different similarity measurements was evaluated based on the percentage of document translation pairs (out of the whole test set) that were discovered at rank one." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-141", "text": "This same approach was used by (Platt et al., 2010) to show the absolute performance comparison." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-142", "text": "As in the case of the previous two tasks, in order to evaluate the approximate, LSH based, Hellinger distance we used values of R=0.4, R=0.6 and R=0.8." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-143", "text": "Since in (Platt et al., 2010) numbers were reported on the test speeches whose word length is greater or equal to 100, we used the same subset (total of 14150 speeches) of the original test collection." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-144", "text": "Shown in Table 1 are results across the four different measurements for all four PLTM models." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-145", "text": "When using regular JS divergence, our PLTM model with 200 topics performs the best with 99.42% of the top one ranked candidate translation documents being true translations." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-146", "text": "When using approximate, kd-trees based, Hellinger distance, we outperform regular JS and Hellinger divergence across all topics and for T=500 we achieve the best overall accuracy of 99.61%." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-147", "text": "We believe that this is due to the small amount of error in the search introduced by ANN, due to its approximate nature, which for this task yields positive results." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-148", "text": "On the same data set, (Platt et al., 2010) report accuracy of 98.9% using 50 topics, a slightly different prior distribution, and MAP instead of posterior inference." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-149", "text": "Shown in Table 2 are the relative differences in time between all pairs JS divergence, approximate kd-trees and LSH based Hellinger distance with different value of R. Rather than showing absolute speed numbers, which are often influenced by the processor configuration and available memory, we show relative speed improvements where we take the slowest running configuration as a referent value." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-150", "text": "In our case we assign the referent speed value of 1 to the configuration with T=500 and allpairs JS computation." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-151", "text": "Results shown are based on comparing running time of E2LSH and ANN against the all-pairs similarity comparison implementation that uses hash tables to store all documents in the bilingual collection which is significantly faster than the other code implementation." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-152", "text": "For the approximate, LSH based, Hellinger distance with T=100 we obtain a speed improvement of 24.2 times compared to regular all-pairs JS divergence while maintaining the same performance compared to Hellinger distance metric and insignificant loss over all-pairs JS divergence." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-153", "text": "From Table 2 it is evident that as we increase the radius R we reduce the relative speed of performance since the range of points that LSH considers for a given query point increases." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-154", "text": "Also, as the number of topics increases, the speed benefit is reduced for both the LSH and k-d tree techniques." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-155", "text": "----------------------------------" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-156", "text": "**CONCLUSION**" }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-157", "text": "Hierarchical Bayesian models, such as Polylingual Topic Models, have been shown to offer great potential in analyzing multilingual collections, extracting aligned topics and finding document translation pairs when trained on sufficiently large aligned collections." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-158", "text": "Online stochastic optimization inference allows us to generate good parameter estimates." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-159", "text": "By combining these two approaches we are able to infer topic distributions across documents in large multilingual document collections in an efficient manner." }, { "sent_id": "c2bfe3534597a8f192ec846619f6b1-C001-160", "text": "Utilizing approximate NN search techniques in the probability simplex, we showed that fast document translation detection could be achieved with insignificant loss in accuracy." } ], "y": { "@MOT@": { "gold_contexts": [ [ "c2bfe3534597a8f192ec846619f6b1-C001-16" ] ], "cite_sentences": [ "c2bfe3534597a8f192ec846619f6b1-C001-16" ] }, "@BACK@": { "gold_contexts": [ [ "c2bfe3534597a8f192ec846619f6b1-C001-34" ], [ "c2bfe3534597a8f192ec846619f6b1-C001-115" ] ], "cite_sentences": [ "c2bfe3534597a8f192ec846619f6b1-C001-34", "c2bfe3534597a8f192ec846619f6b1-C001-115" ] }, "@DIF@": { "gold_contexts": [ [ "c2bfe3534597a8f192ec846619f6b1-C001-122" ], [ "c2bfe3534597a8f192ec846619f6b1-C001-148" ] ], "cite_sentences": [ "c2bfe3534597a8f192ec846619f6b1-C001-122", "c2bfe3534597a8f192ec846619f6b1-C001-148" ] }, "@SIM@": { "gold_contexts": [ [ "c2bfe3534597a8f192ec846619f6b1-C001-123" ], [ "c2bfe3534597a8f192ec846619f6b1-C001-127" ], [ "c2bfe3534597a8f192ec846619f6b1-C001-141" ], [ "c2bfe3534597a8f192ec846619f6b1-C001-143" ] ], "cite_sentences": [ "c2bfe3534597a8f192ec846619f6b1-C001-123", "c2bfe3534597a8f192ec846619f6b1-C001-127", "c2bfe3534597a8f192ec846619f6b1-C001-141", "c2bfe3534597a8f192ec846619f6b1-C001-143" ] }, "@USE@": { "gold_contexts": [ [ "c2bfe3534597a8f192ec846619f6b1-C001-127" ], [ "c2bfe3534597a8f192ec846619f6b1-C001-141" ], [ "c2bfe3534597a8f192ec846619f6b1-C001-143" ] ], "cite_sentences": [ "c2bfe3534597a8f192ec846619f6b1-C001-127", "c2bfe3534597a8f192ec846619f6b1-C001-141", "c2bfe3534597a8f192ec846619f6b1-C001-143" ] }, "@EXT@": { "gold_contexts": [ [ "c2bfe3534597a8f192ec846619f6b1-C001-148" ] ], "cite_sentences": [ "c2bfe3534597a8f192ec846619f6b1-C001-148" ] } } }, "ABC_8ca479895b028ea6dedb0e99cacae6_19": { "x": [ { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-112", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-113", "text": "**DATASETS**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-140", "text": "We observe improvement on both fine-tuned datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-2", "text": "Typical relation extraction models are trained on a single corpus annotated with a pre-defined relation schema." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-3", "text": "An individual corpus is often small, and the models may often be biased or overfitted to the corpus." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-4", "text": "We hypothesize that we can learn a better representation by combining multiple relation datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-5", "text": "We attempt to use a shared encoder to learn the unified feature representation and to augment it with regularization by adversarial training." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-6", "text": "The additional corpora feeding the encoder can help to learn a better feature representation layer even though the relation schemas are different." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-7", "text": "We use ACE05 and ERE datasets as our case study for experiments." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-8", "text": "The multi-task model obtains significant improvement on both datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-9", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-11", "text": "Relations represent specific semantic relationships between two entities." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-12", "text": "For example, there is Physical." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-13", "text": "Located relationship between Smith and Brazil in the sentence: Smith went to a conference in Brazil." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-14", "text": "Relation extraction is a crucial task for many applications such as knowledge base population." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-15", "text": "Several relation schemas and annotated corpora have been developed such as the Automatic Content Extraction (ACE), and the Entities, Relations and Events (ERE) annotation (Song et al., 2015) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-16", "text": "These schemas share some similarity, but differ in details." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-17", "text": "A relation type may exist in one schema but not in another." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-18", "text": "An example might be annotated as different types in different datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-19", "text": "For example, Part-whole." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-20", "text": "Geographical relations in ACE05 are annotated as Physcial." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-21", "text": "Located relations in ERE." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-22", "text": "Most of these corpora are relatively small." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-23", "text": "Models trained on a single corpus may be biased or overfitted towards the corpus." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-24", "text": "Despite the difference in relation schemas, we hypothesize that we can learn a more general representation with a unified encoder." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-25", "text": "Such a representation could have better out-of-domain or lowresource performance." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-26", "text": "We develop a multi-task model to learn a representation of relations in a shared relation encoder." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-27", "text": "We use separate decoders to allow different relation schemas." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-28", "text": "The shared encoder accesses more data, learning less overfitted representation." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-29", "text": "We then regularize the representation with adversarial training in order to further enforce the sharing between different datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-30", "text": "In our experiments, we take ACE05 1 and ERE 2 datasets as a case study." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-31", "text": "Experimental results show that the model achieves higher performance on both datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-32", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-33", "text": "**RELATED WORK**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-34", "text": "Relation extraction is typically reduced to a classification problem." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-35", "text": "A supervised machine learning model is designed and trained on a single dataset to predict the relation type of pairs of entities." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-36", "text": "Traditional methods rely on linguistic or semantic features (Zhou et al., 2005; Jing and Zhai, 2007) , or kernels based on syntax or sequences (Bunescu and Mooney, 2005a,b; Plank and Moschitti, 2013) to represent sentences of relations." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-37", "text": "More recently, deep neural nets start to show promising results." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-38", "text": "Most rely on convolutional neural nets (Zeng et al., 2014 (Zeng et al., , 2015 Grishman, 2015, 2016; Fu et al., 2017) or recurrent neural nets (Zhang et al., 2015; Zhou et al., 2016; Miwa and Bansal, 2016) to learn the representation of relations." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-39", "text": "Our supervised base model will be similar to (Zhou et al., 2016) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-40", "text": "Our initial experiments did not use syntactic features (Nguyen and Grishman, 2016; Fu et al., 2017 ) that require additional parsers." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-41", "text": "In order to further improve the representation learning for relation extraction, tried to transfer knowledge through bilingual representation." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-42", "text": "They used their multi-task model to train on the bilingual ACE05 datasets and obtained improvement when there is less training available (10%-50%)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-43", "text": "Our experiments will show our multitask model can make significant improvement on the full training set." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-44", "text": "In terms of the regularization to the representation, Duong et al. (2015) used l2 regularization between the parameters of the same part of two models in multi-task learning." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-45", "text": "Their method is a kind of soft-parameter sharing, which does not involve sharing any part of the model directly." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-46", "text": "Fu et al. (2017) applied domain adversarial networks (Ganin and Lempitsky, 2015) to relation extraction and obtained improvement on out-of-domain evaluation." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-47", "text": "Inspired by the adversarial training, we attempt to use it as a regularization tool in our multi-task model and find some improvement." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-48", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-49", "text": "**SUPERVISED NEURAL RELATION EXTRACTION MODEL**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-50", "text": "The supervised neural model on a single dataset was introduced by Zeng et al. (2014) and followed by many others (Nguyen and Grishman, 2015; Zhou et al., 2016; Miwa and Bansal, 2016; Nguyen and Grishman, 2016; Fu et al., 2017) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-51", "text": "We use a similar model as our base model." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-52", "text": "It takes word tokens, position of arguments and their entity types as input." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-53", "text": "Some work (Nguyen and Grishman, 2016; Fu et al., 2017) used extra syntax features as input." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-54", "text": "However, the parsers that produce syntax features could have errors and vary depending on the domain of text." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-55", "text": "The syntax features learned could also be too specific for a single dataset." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-56", "text": "Thus, we focus on learning representation from scratch, but also compare the models with extra features later in the experiments." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-57", "text": "The encoder is a bidirectional RNN with attention and the decoder is one hidden fully connected layer followed by a softmax output layer." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-58", "text": "In the input layer, we convert word tokens into word embeddings with pretrained word2vec (Mikolov et al., 2013) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-59", "text": "For each token, we convert the distance to the two arguments of the example to two position embeddings." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-60", "text": "We also convert the entity types of the arguments to entity embeddings." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-61", "text": "The setup of word embedding and position embedding was introduced by Zeng et al. (2014) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-62", "text": "The entity embedding (Nguyen and Grishman, 2016; Fu et al., 2017 ) is included for arguments that are entities rather than common nouns." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-63", "text": "At the end, each token is converted to an embedding w i as the concatenation of these three types of embeddings, where i \u2208 [0, T ), T is the length of the sentence." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-64", "text": "A wide range of encoders have been proposed for relation extraction." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-65", "text": "Most of them fall into categories of CNN (Zeng et al., 2014) , RNN (Zhou et al., 2016) and TreeRNN (Miwa and Bansal, 2016) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-66", "text": "In this work, we follow Zhou et al. (2016) to use Bidirectional RNN with attention (BiRNN), which works well on both of the datasets we are going to evaluate on." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-67", "text": "BiRNN reads embeddings of the words from both directions in the sentence." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-68", "text": "It summarizes the contextual information at each state." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-69", "text": "The attention mechanism aggregates all the states of the sentence by paying more attention to informative words." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-70", "text": "Given input w i from the input layer, the encoder is defined as the following:" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-71", "text": "We use GRU (Cho et al., 2014) as the RNN cell." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-72", "text": "W v and b v are the weights for the projection v i ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-73", "text": "v w is the word context vector, which works as a query of selecting important words." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-74", "text": "The importance of the word is computed as the similarity between v i and v w ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-75", "text": "The importance weight is then normalized through a softmax function." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-76", "text": "Then we obtain the high level summarization \u03c6(x) for the relation example." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-77", "text": "The decoder uses this high level representation as features for relation classification." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-78", "text": "It usually contains one hidden layer (Zeng et al., 2014; Nguyen and Grishman, 2016; Fu et al., 2017 ) and a softmax output layer." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-79", "text": "We use the same structure which can be formalized as the following:" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-80", "text": "where W h and b h are the weights for the hidden" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-81", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-82", "text": "**LEARNING UNIFIED REPRESENTATION**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-83", "text": "While the data for one relation task may be small, noisy and biased, we can learn a better representation combining multiple relation tasks." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-84", "text": "We attempt to use multi-task learning to learn a unified representation across different relation tasks." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-85", "text": "The method is simple and straightforward." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-86", "text": "We use the same encoder to learn the unified feature representation for both relation tasks, and then we train classifiers for each task on top of this representation." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-87", "text": "We then apply regularization on this representation by adversarial training." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-88", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-89", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-90", "text": "Given example x 1 from relation schema 1 and x 2 from relation schema 2, we use the same encoder to obtain representation \u03c6(x 1 ) and \u03c6(x 2 ) respectively." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-91", "text": "Then we build separate decoders for them using the same structure (7) (8). To train them at the same time, we put examples from both tasks in the same batch." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-92", "text": "The ratio of the examples are controlled so that the the model reads both datasets once every epoch." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-93", "text": "We use linear interpolation to combine the loss from them." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-94", "text": "where \u03bb is used to control the attention to each task." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-95", "text": "The model may learn the two tasks at different speed." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-96", "text": "During optimization, one task can be seen as the main task, while the other can be seen as the auxiliary task." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-97", "text": "The benefit of joint learning to the main task may vary depending on how much attention the model pays to the auxiliary task." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-98", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-99", "text": "**REGULARIZATION BY ADVERSARIAL TRAINING**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-100", "text": "Being optimized simultaneously by different decoders, the model could still learn very different representation for similar examples coming from different tasks." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-101", "text": "We want to prevent this and to further push the model to learn similar representation for similar examples even if they come from different tasks." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-102", "text": "We attempt to regularize the representation using adversarial training between the two tasks." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-103", "text": "Given the representation \u03c6(x 1 ) and \u03c6(x 2 ) learned from the two tasks, we build a classifier to predict which task the examples come from (11)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-104", "text": "We add a gradient reversal layer (Ganin and Lempitsky, 2015) at the input of this classifier (10) to implement the adversarial training." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-105", "text": "While the classifier learns to distinguish the sources of the input representation, the input representation is learned in the opposite direction to confuse the classifier thanks to GRL." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-106", "text": "Thus, the input representation (\u03c6(x 1 ) and \u03c6(x 2 )) will be pushed to be close to each other." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-107", "text": "The gradient reversal layer (GRL) is defined as the identity function for forward propagation (12) and reversed gradient for back propagation (13)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-108", "text": "We also use the cross-entropy loss for this adversarial training, and combine the loss L adv with the two relation tasks." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-109", "text": "where we can use \u03b2 to control how close the representations are between the two relation tasks." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-110", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-111", "text": "**EXPERIMENTS**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-114", "text": "To apply the multi-task learning, we need at least two datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-115", "text": "We pick ACE05 and ERE for our case study." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-116", "text": "The ACE05 dataset provides a cross-domain evaluation setting ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-117", "text": "It contains 6 domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and weblogs (wl)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-118", "text": "Previous work (Gormley et al., 2015; Nguyen and Grishman, 2016; Fu et al., 2017) set, and the other half of bc, cts and wl as the test sets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-119", "text": "We followed their split of documents and their split of the relation types for asymmetric relations." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-120", "text": "The ERE dataset has a similar relation schema to ACE05, but is different in some annotation guidelines (Aguilar et al., 2014) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-121", "text": "It also has more data than ACE05, which we expect to be helpful in the multi-task learning." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-122", "text": "It contains documents from newswire and discussion forums." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-123", "text": "We did not find an existing split of this dataset, so we randomly split the documents into train (80%), dev (10%) and test (10%)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-124", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-125", "text": "**MODEL CONFIGURATIONS**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-126", "text": "We use word embedding pre-trained on newswire with 300 dimensions from word2vec (Mikolov et al., 2013) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-127", "text": "We fix the word embeddings during the training." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-128", "text": "We follow Nguyen and Grishman (2016) to set the position and entity type embedding size to be 50." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-129", "text": "We use 150 dimensions for the GRU state, 100 dimensions for the word context vector and use 300 dimensions for the hidden layer in the decoders." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-130", "text": "We train the model using Adam (Kingma and Ba, 2014) optimizer with learning rate 0.001." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-131", "text": "We tune \u03bb linearly from 0 to 1, and \u03b2 logarithmically from 5 \u00b7 10 \u22121 to 10 \u22124 For all scores, we run experiments 10 times and take the average." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-132", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-133", "text": "**AUGMENTATION BETWEEN ACE05 AND ERE**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-134", "text": "Training separately on the two corpora (row \"Supervised\" in Table 1 ), we obtain results on ACE05 comparable to previous work (Gormley et al., 2015) with substantially fewer features." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-135", "text": "With syntactic features as (Nguyen and Grishman, 2016; Fu et al., 2017) did, it could be further improved." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-136", "text": "In this paper, however, we want to focus on representation learning from scratch first." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-137", "text": "Our experiments focus on whether we can improve the representation with more sources of data." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-138", "text": "A common way to do so is pre-training." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-139", "text": "As a baseline, we pre-train the encoder of the supervised model on ERE and then fine-tune on ACE05, and vice versa (row \"Pretraining\" in Table 1 )." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-141", "text": "This shows the similarity between the encoders of the two datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-142", "text": "However, if we fix the encoder trained from one dataset, and only train the decoder on the other dataset, we will actually obtain a much worse model." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-143", "text": "This indicates that neither dataset contains enough data to learn a universal feature representation layer for classification." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-144", "text": "This leaves the possibility to further improve the representation by learning a better encoder." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-145", "text": "We then attempt to learn a multi-task model using a shared encoder." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-146", "text": "We use 14K words as the vocabulary from ACE05 and 20K from ERE." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-147", "text": "There are about 8K words shared by the two datasets (same for both pretrained and multi-task models)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-148", "text": "By multi-task learning, we expect the model to conceive the embeddings for words better and construct more general representation." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-149", "text": "Experiments determined that the multi-task learning works best at \u03bb = 0.8 for both ACE05 and ERE datasets (Table 1)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-150", "text": "It obtains improvement on both the out-ofdomain evaluation on ACE and in-domain evaluation on ERE." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-151", "text": "It works especially well on weblogs (wl) and telephone conversation (cts) domains on ACE, which possibly benefits from the discussion forum data from ERE." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-152", "text": "On the other hand, we use the adversarial training between the two datasets to further enforce the representation to be close to each other." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-153", "text": "There is strong dependency between the schemas of these two datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-154", "text": "Two examples from different datasets could have the same semantics in terms of relation type." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-155", "text": "We try to force the representation of these examples to be similar." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-156", "text": "With appropriate amount of this regularization (\u03b2 = 0.001), the model can be further improved ( multi-task model can already balance representation with enough labels on both sides." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-157", "text": "We also artificially remove half of the training data of each dataset to see the performance in a relatively lowresource setting (row \"Training Data\" Table 1) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-158", "text": "We observe larger improvement with both multitask learning and regularization." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-159", "text": "Because of the decrease of the training data, the best \u03bb is 0.9 for ACE05 and 0.7 for ERE." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-160", "text": "We also use slightly stronger regularization (\u03b2 = 0.01)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-161", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-162", "text": "**MORE FEATURES ON ACE05**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-163", "text": "Since ACE05 has been studied for a long time, numerous features have been found to be effective on this dataset." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-164", "text": "(Nguyen and Grishman, 2016) incorporated some of those features into the neural net and beat the state-of-art on the dataset." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-165", "text": "Although representation learning from scratch could be more general across multiple datasets, we compare the effect of multi-task learning with extra features on this specific dataset." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-166", "text": "We add chunk embedding and on dep path embedding (Nguyen and Grishman, 2016) ." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-167", "text": "Similar to entity type embedding, chunk embedding is created according to each token's chunk type, we set the embedding size to 50." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-168", "text": "On dep path embedding is a vector indicating whether the token is on the dependency path between the two entities." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-169", "text": "In the multi-task model, the shared encoder is a bidirectional RNN (BiRNN) without attention (Equation (1-3) )." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-170", "text": "These two embeddings will be concatenated to the output of the BiRNN to obtain the new h i and then passed to Equation (4)." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-171", "text": "As the results (Table 2) , our supervised baseline is slightly worse than the previous state-of-the-art neural model with extra features, but the multitask learning can consistently help." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-172", "text": "The improvement is more obvious with 50% training data." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-173", "text": "It is also worth to note that with 50% training data, the extra features improve the supervised base model, but not the multi-task learning model." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-174", "text": "It shows the effectiveness of the multi-task model when there is less training data." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-175", "text": "----------------------------------" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-176", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-177", "text": "We attempt to learn unified representation for relations by multi-task learning between ACE05 and ERE datasets." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-178", "text": "We use a shared encoder to learn the unified feature representation and then apply regularization by adversarial training." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-179", "text": "The improvement on both datasets shows the promising future of learning representation for relations in this unified way." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-180", "text": "This will require less training data for new relation schemas." }, { "sent_id": "8ca479895b028ea6dedb0e99cacae6-C001-181", "text": "It will be interesting future work to further explore the multi-task learning between different datasets, especially to capture the dependency between different schemas in the decoder." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8ca479895b028ea6dedb0e99cacae6-C001-38" ], [ "8ca479895b028ea6dedb0e99cacae6-C001-50" ], [ "8ca479895b028ea6dedb0e99cacae6-C001-53" ], [ "8ca479895b028ea6dedb0e99cacae6-C001-78" ], [ "8ca479895b028ea6dedb0e99cacae6-C001-118" ] ], "cite_sentences": [ "8ca479895b028ea6dedb0e99cacae6-C001-38", "8ca479895b028ea6dedb0e99cacae6-C001-50", "8ca479895b028ea6dedb0e99cacae6-C001-53", "8ca479895b028ea6dedb0e99cacae6-C001-78", "8ca479895b028ea6dedb0e99cacae6-C001-118" ] }, "@DIF@": { "gold_contexts": [ [ "8ca479895b028ea6dedb0e99cacae6-C001-40" ] ], "cite_sentences": [ "8ca479895b028ea6dedb0e99cacae6-C001-40" ] }, "@SIM@": { "gold_contexts": [ [ "8ca479895b028ea6dedb0e99cacae6-C001-62" ] ], "cite_sentences": [ "8ca479895b028ea6dedb0e99cacae6-C001-62" ] }, "@FUT@": { "gold_contexts": [ [ "8ca479895b028ea6dedb0e99cacae6-C001-135" ] ], "cite_sentences": [ "8ca479895b028ea6dedb0e99cacae6-C001-135" ] } } }, "ABC_faeac0a0e3c0cad79d39dea04ec59a_19": { "x": [ { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-108", "text": "Second, we aim to highlight the important distinctions within subtasks that have hitherto been ignored." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-2", "text": "As the body of research on abusive language detection and analysis grows, there is a need for critical consideration of the relationships between different subtasks that have been grouped under this label." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-3", "text": "Based on work on hate speech, cyberbullying, and online abuse we propose a typology that captures central similarities and differences between subtasks and we discuss its implications for data annotation and feature construction." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-4", "text": "We emphasize the practical actions that can be taken by researchers to best approach their abusive language detection subtask of interest." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-5", "text": "----------------------------------" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-7", "text": "There has been a surge in interest in the detection of abusive language, hate speech, cyberbullying, and trolling in the past several years (Schmidt and Wiegand, 2017) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-8", "text": "Social media sites have also come under increasing pressure to tackle these issues." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-9", "text": "Similarities between these subtasks have led scholars to group them together under the umbrella terms of \"abusive language\", \"harmful speech\", and \"hate speech\" (Nobata et al., 2016; Faris et al., 2016; Schmidt and Wiegand, 2017) but little work has been done to examine the relationship between them." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-10", "text": "As each of these subtasks seeks to address a specific yet partially overlapping phenomenon, we believe that there is much to gain by studying how they are related." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-11", "text": "The overlap between subtasks is illustrated by the variety of labels used in prior work." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-12", "text": "For example, in annotating for cyberbullying events, 1 This paper has been accepted at the 1st Workshop on Abusive Language Online. Please be sure to cite that version." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-13", "text": "Van Hee et al. (2015b) identifies discriminative remarks (racist, sexist) as a subset of \"insults\", whereas Nobata et al. (2016) classifies similar remarks as \"hate speech\" or \"derogatory language\"." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-14", "text": "Waseem and Hovy (2016) only consider \"hate speech\" without regard to any potential overlap with bullying or otherwise offensive language, while Davidson et al. (2017) distinguish hate speech from generally offensive language." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-15", "text": "Wulczyn et al. (2017) annotates for personal attacks, which likely encompasses identifying cyberbullying, hate speech, and offensive language." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-16", "text": "The lack of consensus has resulted in contradictory annotation guidelines -some messages considered as hate speech by Waseem and Hovy (2016) are only considered derogatory and offensive by Nobata et al. (2016) and Davidson et al. (2017) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-17", "text": "To help to bring together these literatures and to avoid these contradictions, we propose a typology that synthesizes these different subtasks." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-18", "text": "We argue that the differences between subtasks within abusive language can be reduced to two primary factors:" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-19", "text": "1. Is the language directed towards a specific individual or entity or is it directed towards a generalized group?" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-20", "text": "----------------------------------" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-21", "text": "**IS THE ABUSIVE CONTENT EXPLICIT OR IMPLICIT?**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-22", "text": "Each of the different subtasks related to abusive language occupies one or more segments of this typology." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-23", "text": "Our aim is to clarify the similarities and differences between subtasks in abusive language detection to help researchers select appropriate strategies for data annotation and modeling." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-24", "text": "----------------------------------" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-25", "text": "**A TYPOLOGY OF ABUSIVE LANGUAGE**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-26", "text": "Much of the work on abusive language subtasks can be synthesized in a two-fold typology that conExplicit Implicit Directed \"Go kill yourself\", \"You're a sad little f*ck\" (Van Hee et al., 2015a) , \"@User shut yo beaner ass up sp*c and hop your f*ggot ass back across the border little n*gga\" (Davidson et al., 2017) , \"Youre one of the ugliest b*tches Ive ever fucking seen\" (Kontostathis et al., 2013) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-27", "text": "\"Hey Brendan, you look gorgeous today. What beauty salon did you visit?\" (Dinakar et al., 2012), \"(((@User) )) and what is your job? Writing cuck articles and slurping Google balls?" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-109", "text": "For example, in much hate speech research, diverse types of abuse have been lumped together under a single label, forcing models to account for a large amount of within-class variation." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-28", "text": "#Dumbgoogles\" (Hine et al., 2017) , \"you're intelligence is so breathtaking!!!!!!\" (Dinakar et al., 2011) Generalized \"I am surprised they reported on this crap who cares about another dead n*gger?\", \"300 missiles are cool! Love to see um launched into Tel Aviv! Kill all the g*ys there!\" (Nobata et al., 2016) , \"So an 11 year old n*gger girl killed herself over my tweets?\u02c6\u02c6thats another n*gger off the streets!!\" (Kwok and Wang, 2013) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-29", "text": "\"Totally fed up with the way this country has turned into a haven for terrorists. Send them all back home.\" (Burnap and Williams, 2015) , \"most of them come north and are good at just mowing lawns\" (Dinakar et al., 2011 ), \"Gas the skypes\" (Magu et al., 2017) siders whether (i) the abuse is directed at a specific target, and (ii) the degree to which it is explicit." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-30", "text": "Starting with the targets, abuse can either be directed towards a specific individual or entity, or it can be used towards a generalized Other, for example people with a certain ethnicity or sexual orientation." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-31", "text": "This is an important sociological distinction as the latter references a whole category of people rather than a specific individual, group, or organization (see Brubaker 2004 , Wimmer 2013 ) and, as we discuss below, entails a linguistic distinction that can be productively used by researchers." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-32", "text": "To better illustrate this, the first row of Table 1 shows examples from the literature of directed abuse, where someone is either mentioned by name, tagged by a username, or referenced by a pronoun." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-33", "text": "2 Cyberbullying and trolling are instances of directed abuse, aimed at individuals and online communities respectively." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-34", "text": "The second row shows cases with abusive expressions towards generalized groups such as racial categories and sexual orientations." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-35", "text": "Previous work has identified instances of hate speech that are both directed and generalized (Burnap and Williams, 2015; Waseem and Hovy, 2016; Davidson et al., 2017) , although Nobata et al. (2016) come closest to making a distinction between directed and generalized hate." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-36", "text": "The other dimension is the extent to which abusive language is explicit or implicit." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-37", "text": "This is roughly analogous to the distinction in linguistics and semiotics between denotation, the literal meaning of a term or symbol, and connotation, its sociocultural associations, famously articulated by Barthes (1957) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-38", "text": "Explicit abusive lan-guage is that which is unambiguous in its potential to be abusive, for example language that contains racial or homophobic slurs." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-39", "text": "Previous research has indicated a great deal of variation within such language (Warner and Hirschberg, 2012; Davidson et al., 2017) , with abusive terms being used in a colloquial manner or by people who are victims of abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-40", "text": "Implicit abusive language is that which does not immediately imply or denote abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-41", "text": "Here, the true nature is often obscured by the use of ambiguous terms, sarcasm, lack of profanity or hateful terms, and other means, generally making it more difficult to detect by both annotators and machine learning approaches (Dinakar et al., 2011; Dadvar et al., 2013; Justo et al., 2014) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-42", "text": "Social scientists and activists have recently been paying more attention to implicit, and even unconscious, instances of abuse that have been termed \"microaggressions\" (Sue et al., 2007) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-43", "text": "As the examples show, such language may nonetheless have extremely abusive connotations." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-44", "text": "The first column of Table 1 shows instances of explicit abuse, where it should be apparent to the reader that the content is abusive." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-45", "text": "The messages in the second column are implicit and it is harder to determine whether they are abusive without knowing the context." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-46", "text": "For example, the word \"them\" in the first two examples in the generalized and implicit cell refers to an ethnic group, and the words \"skypes\" and \"Google\" are used as euphemisms for slurs about Jews and African-Americans respectively." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-47", "text": "Abuse using sarcasm can be even more elusive for detection systems, for instance the seemingly harmless comment praising someone's intelligence was a sarcastic response to a beauty pageant contestants unsatisfactory answer to a question (Dinakar et al., In the following section we outline the implications of this typology, highlighting where the existing literatures indicate how we can understand, measure, and model each subtype of abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-48", "text": "----------------------------------" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-49", "text": "**IMPLICATIONS FOR ANNOTATION**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-50", "text": "In the task of annotating documents that contain bullying, it appears that there is a common understanding of what cyberbullying entails: an intentionally harmful electronic attack by an individual or group against a victim, usually repetitive in nature (Dadvar et al., 2013) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-51", "text": "This consensus allows for a relatively consistent set of annotation guidelines across studies, most of which simply ask annotators to determine if a post contains bullying or harassment (Dadvar et al., 2014; Kontostathis et al., 2013; Bretschneider et al., 2014) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-52", "text": "High interannotator agreement on cyberbullying tasks (93%) (Dadvar et al., 2013) further indicates a general consensus around the features of cyberbullying (Van Hee et al., 2015b) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-53", "text": "After bullying has been identified annotators are typically asked more detailed questions about the extremity of the bullying, the identification of phrases that indicate bullying, and the roles of users as bully/victim (Dadvar et al., 2014; Van Hee et al., 2015b; Kontostathis et al., 2013) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-54", "text": "We expect that consensus may be due to the directed nature of the phenomenon." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-55", "text": "Cyberbullying involves a victim whom annotators can identify and relatively easily discern whether statements directed towards the victim should be considered abusive." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-56", "text": "In contrast, in work on annotating harassment, offensive language, and hate speech there appears to be little consensus on definitions and lower inter-annotator agreement (\u03ba \u2248 0.60\u22120.80) (Ross et al., 2016; Waseem, 2016a; Tulkens et al., 2016; Bretschneider and Peters, 2017) are obtained." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-57", "text": "Given that these tasks are often broadly defined and the target is often generalized, all else being equal, it is more difficult for annotators to determine whether statements should be considered abusive." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-58", "text": "Future work in these subtasks should aim to have annotators distinguish between targeted and generalized abuse so that each subtype can be modeled more effectively." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-59", "text": "Annotation (via crowd-sourcing and other methods) tends to be more straightforward when explicit instances of abusive language can be identified and agreed upon (Waseem, 2016b) , but is considerably more difficult when implicit abuse is considered (Dadvar et al., 2013; Justo et al., 2014; Dinakar et al., 2011) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-60", "text": "The connotations of language can be difficult to classify without domainspecific knowledge." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-61", "text": "Furthermore, while some argue that detailed guidelines can help annotators to make more subtle distinctions (Davidson et al., 2017) , others find that they do not improve the reliability of non-expert classifications (Ross et al., 2016) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-62", "text": "In such cases, expert annotators with domain specific knowledge are preferred as they tend to produce more accurate classifications (Waseem, 2016a) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-63", "text": "Ultimately, the nature of abusive language can be extremely subjective, and researchers must endeavor to take this into account when using human annotators." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-110", "text": "We suggest that fine-grained distinctions along the axes allows for more focused systems that may be more effective at identifying particular types of abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-111", "text": "Third, we call for closer consideration of how annotation guidelines are related to the phenomenon of interest." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-112", "text": "The type of annotation and even the choice of annotators should be motivated by the nature of the abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-113", "text": "Further, we welcome discussion of annotation guidelines and the annotation process in published work." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-114", "text": "Many existing studies only tangentially mention these, sometimes never explaining how the data were annotated." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-115", "text": "Fourth, we encourage researchers to consider which features are most appropriate for each subtask." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-116", "text": "Prior work has found a diverse array of features to be useful in understanding and identifying abuse, but we argue that different feature sets will be relevant to different subtasks." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-117", "text": "Future work should aim to build a more robust understanding of when to use which types of features." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-118", "text": "Fifth, it is important to emphasize that not all abuse is equal, both in terms of its effects and its detection." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-119", "text": "We expect that social media and website operators will be more interested in identifying and dealing with explicit abuse, while activists, campaigners, and journalists may have more incentive to also identify implicit abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-120", "text": "Targeted abuse such as cyberbullying may be more likely to be reported by victims and thus acted upon than generalized abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-121", "text": "We also expect that implicit abuse will be more difficult to detect and model, although methodological advances may make such tasks more feasible." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-122", "text": "----------------------------------" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-123", "text": "**CONCLUSION**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-124", "text": "We have presented a typology that synthesizes the different subtasks in abusive language detection." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-125", "text": "Our aim is to bring together findings in these different areas and to clarify the key aspects of abusive language detection." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-126", "text": "There are important analytical distinctions that have been largely overlooked in prior work and through acknowledging these and their implications we hope to improve abuse detection systems and our understanding of abusive language." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-64", "text": "Davidson et al. (2017) , for instance, show that annotators tend to code racism as hate speech at a higher rate than sexism." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-65", "text": "As such, it is important that researchers consider the social biases that may lead people to disregard certain types of abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-66", "text": "The type of abuse that researchers are seeking to identify should guide the annotation strategy." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-67", "text": "Where subtasks occupy multiple cells in our typology, annotators should be allowed to make nuanced distinctions that differentiate between different types of abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-68", "text": "In highlighting the major differences between different abusive language detection subtasks, our typology indicates that different annotation strategies are appropriate depending on the type of abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-69", "text": "----------------------------------" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-70", "text": "**IMPLICATIONS FOR MODELING**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-71", "text": "Existing research on abusive language online has used a diverse set of features." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-72", "text": "Moving forward, it is important that researchers clarify which features are most useful for which subtasks and which subtasks present the greatest challenges." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-73", "text": "We do not attempt to review all the features used (see Schmidt and Wiegand 2017 for a detailed review) but make suggestions for which features could be most helpful for the different subtasks." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-74", "text": "For each aspect of the typology, we suggest features that have been shown to be successful predictors in prior work." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-75", "text": "Many features occur in more than one form of abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-76", "text": "As such, we do not propose that particular features are necessarily unique to each phenomenon, rather that they provide different insights and should be employed depending on what the researcher is attempting to measure." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-77", "text": "Directed abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-78", "text": "Features that help to identify the target of abuse are crucial to directed abuse detection." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-79", "text": "Mentions, proper nouns, named entities, and co-reference resolution can all be used in different contexts to identify targets." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-80", "text": "Bretschneider and Peters (2017) use a multi-tiered system, first identifying offensive statements, then their severity, and finally the target." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-81", "text": "Syntactical features have also proven to be successful in identifying abusive language." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-82", "text": "A number of studies on hate speech use part-of-speech sequences to model the expression of hatred (Warner and Hirschberg, 2012; Gitari et al., 2015; Davidson et al., 2017) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-83", "text": "Typed dependencies offer a more sophisticated way to capture the relationship between terms (Burnap and Williams, 2015) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-84", "text": "Overall, there are many tools that researchers can use to model the relationship between abusive language and targets, although many of these require high-quality annotations to use as training data." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-85", "text": "Generalized abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-86", "text": "Generalized abuse online tends to target people belonging to a small set of categories, primarily racial, religious, and sexual minorities (Silva et al., 2016) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-87", "text": "Researchers should consider identifying forms of abuse unique to each target group addressed, as vocabularies may depend on the groups targeted." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-88", "text": "For example, the language used to abuse trans-people and that used against Latin American people are likely to differ, both in the nouns used to denote the target group and the other terms associated with them." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-89", "text": "In some cases a lexical method may therefore be an appropriate strategy." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-90", "text": "Further research is necessary to determine if there are underlying syntactic structures associated with generalized abusive language." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-91", "text": "Explicit abuse Explicit abuse, whether directed or generalized, is often indicated by specific keywords." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-92", "text": "Hence, dictionary-based approaches may be well suited to identify this type of abuse (Warner and Hirschberg, 2012; Nobata et al., 2016) , although the presence of particular words should not be the only criteria, even terms that denote abuse may be used in a variety of different ways (Kwok and Wang, 2013; Davidson et al., 2017) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-93", "text": "Negative polarity and sentiment of the text are also likely indicators of explicit abuse that can be leveraged by researchers (Gitari et al., 2015) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-94", "text": "Implicit abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-95", "text": "Building a specific lexicon may prove impractical, as in the case of the appropriation of the term \"skype\" in some forums (Magu et al., 2017) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-96", "text": "Still, even partial lexicons may be used as seeds to inductively discover other keywords by use of a semi-supervised method proposed by King et al. (2017) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-97", "text": "Additionally, character n-grams have been shown to be apt for abusive language tasks due to their ability to capture variation of words associated with abuse (Nobata et al., 2016; Waseem, 2016a) ." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-98", "text": "Word embeddings are also promising ways to capture terms associated with abuse (Djuric et al., 2015; Badjatiya et al., 2017) , although they may still be insufficient for cases like 4Chan's connotation of \"skype\" where a word has a dominant meaning and a more subversive one." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-99", "text": "Furthermore, as some of the above examples show, implicit abuse often takes on complex linguistic forms like sarcasm, metonymy, and humor." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-100", "text": "Without high quality labeled data to learn these representations, it may be difficult for researchers to come up with models of syntactic structure that can help to identify implicit abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-101", "text": "To overcome these limitations researchers may find it prudent to incorporate features beyond just textual analysis, including the characteristics of the individuals involved (Dadvar et al., 2013) and other extra-textual features." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-102", "text": "----------------------------------" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-103", "text": "**DISCUSSION**" }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-104", "text": "This typology has a number of implications for future work in the area." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-105", "text": "First, we want to encourage researchers working on these subtasks to learn from advances in other areas." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-106", "text": "Researchers working on purportedly distinct subtasks are often working on the same problems in parallel." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-107", "text": "For example, the field of hate speech detection can be strengthened by interactions with work on cyberbullying, and vice versa, since a large part of both subtasks consists of identifying targeted abuse." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-127", "text": "Rather than attempting to resolve the \"definitional quagmire\" (Faris et al., 2016) involved in neatly bounding and defining each subtask we encourage researchers to think carefully about the phenomena they want to measure and the appropriate research design." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-128", "text": "We intend for our typology to be used both at the stage of data collection and annotation and the stage of feature creation and modeling." }, { "sent_id": "faeac0a0e3c0cad79d39dea04ec59a-C001-129", "text": "We hope that future work will be more transparent in discussing the annotation and modeling strategies used, and will closely examine the similarities and differences between these subtasks through empirical analyses." } ], "y": { "@BACK@": { "gold_contexts": [ [ "faeac0a0e3c0cad79d39dea04ec59a-C001-14" ], [ "faeac0a0e3c0cad79d39dea04ec59a-C001-16" ], [ "faeac0a0e3c0cad79d39dea04ec59a-C001-26" ], [ "faeac0a0e3c0cad79d39dea04ec59a-C001-35" ], [ "faeac0a0e3c0cad79d39dea04ec59a-C001-39" ], [ "faeac0a0e3c0cad79d39dea04ec59a-C001-61" ], [ "faeac0a0e3c0cad79d39dea04ec59a-C001-64" ], [ "faeac0a0e3c0cad79d39dea04ec59a-C001-82" ] ], "cite_sentences": [ "faeac0a0e3c0cad79d39dea04ec59a-C001-14", "faeac0a0e3c0cad79d39dea04ec59a-C001-16", "faeac0a0e3c0cad79d39dea04ec59a-C001-26", "faeac0a0e3c0cad79d39dea04ec59a-C001-35", "faeac0a0e3c0cad79d39dea04ec59a-C001-39", "faeac0a0e3c0cad79d39dea04ec59a-C001-61", "faeac0a0e3c0cad79d39dea04ec59a-C001-64", "faeac0a0e3c0cad79d39dea04ec59a-C001-82" ] }, "@FUT@": { "gold_contexts": [ [ "faeac0a0e3c0cad79d39dea04ec59a-C001-92" ] ], "cite_sentences": [ "faeac0a0e3c0cad79d39dea04ec59a-C001-92" ] } } }, "ABC_a6954db741df61f014cc622c5b8263_19": { "x": [ { "sent_id": "a6954db741df61f014cc622c5b8263-C001-144", "text": "TAG-PA brought F 1 up substantially, to 80.62%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-2", "text": "We demonstrate that an unlexicalized PCFG can parse much more accurately than previously shown, by making use of simple, linguistically motivated state splits, which break down false independence assumptions latent in a vanilla treebank grammar." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-3", "text": "Indeed, its performance of 86.36% (LP/LR F 1 ) is better than that of early lexicalized PCFG models, and surprisingly close to the current state-of-theart." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-4", "text": "This result has potential uses beyond establishing a strong lower bound on the maximum possible accuracy of unlexicalized models: an unlexicalized PCFG is much more compact, easier to replicate, and easier to interpret than more complex lexical models, and the parsing algorithms are simpler, more widely understood, of lower asymptotic complexity, and easier to optimize." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-5", "text": "In the early 1990s, as probabilistic methods swept NLP, parsing work revived the investigation of probabilistic context-free grammars (PCFGs) (Booth and Thomson, 1973; Baker, 1979) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-6", "text": "However, early results on the utility of PCFGs for parse disambiguation and language modeling were somewhat disappointing." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-7", "text": "A conviction arose that lexicalized PCFGs (where head words annotate phrasal nodes) were the key tool for high performance PCFG parsing." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-8", "text": "This approach was congruent with the great success of word n-gram models in speech recognition, and drew strength from a broader interest in lexicalized grammars, as well as demonstrations that lexical dependencies were a key tool for resolving ambiguities such as PP attachments (Ford et al., 1982; Hindle and Rooth, 1993) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-9", "text": "In the following decade, great success in terms of parse disambiguation and even language modeling was achieved by various lexicalized PCFG models (Magerman, 1995; Charniak, 1997; Collins, 1999; Charniak, 2000; Charniak, 2001) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-10", "text": "However, several results have brought into question how large a role lexicalization plays in such parsers." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-11", "text": "Johnson (1998) showed that the performance of an unlexicalized PCFG over the Penn treebank could be improved enormously simply by annotating each node by its parent category." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-12", "text": "The Penn treebank covering PCFG is a poor tool for parsing because the context-freedom assumptions it embodies are far too strong, and weakening them in this way makes the model much better." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-13", "text": "More recently, Gildea (2001) discusses how taking the bilexical probabilities out of a good current lexicalized PCFG parser hurts performance hardly at all: by at most 0.5% for test text from the same domain as the training data, and not at all for test text from a different domain." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-14", "text": "But it is precisely these bilexical dependencies that backed the intuition that lexicalized PCFGs should be very successful, for example in Hindle and Rooth's demonstration from PP attachment." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-15", "text": "We take this as a reflection of the fundamental sparseness of the lexical dependency information available in the Penn Treebank." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-16", "text": "As a speech person would say, one million words of training data just isn't enough." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-17", "text": "Even for topics central to the treebank's Wall Street Journal text, such as stocks, many very plausible dependencies occur only once, for example stocks stabilized, while many others occur not at all, for example stocks skyrocketed." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-18", "text": "The best-performing lexicalized PCFGs have increasingly made use of subcategorization 3 of the 1 There are minor differences, but all the current best-known lexicalized PCFGs employ both monolexical statistics, which describe the phrasal categories of arguments and adjuncts that appear around a head lexical item, and bilexical statistics, or dependencies, which describe the likelihood of a head word taking as a dependent a phrase headed by a certain other word." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-90", "text": "Figure 2 presents a grid of horizontal and vertical markovizations of the grammar." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-19", "text": "2 This observation motivates various class-or similaritybased approaches to combating sparseness, and this remains a promising avenue of work, but success in this area has proven somewhat elusive, and, at any rate, current lexicalized PCFGs do simply use exact word matches if available, and interpolate with syntactic category-based estimates when they are not." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-20", "text": "3 In this paper we use the term subcategorization in the original general sense of Chomsky (1965) , for where a syntactic cat-" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-21", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-22", "text": "****" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-23", "text": "In the early 1990s, as probabilistic methods swept NLP, parsing work revived the investigation of probabilistic context-free grammars (PCFGs) (Booth and Thomson, 1973; Baker, 1979) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-24", "text": "However, early results on the utility of PCFGs for parse disambiguation and language modeling were somewhat disappointing." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-25", "text": "A conviction arose that lexicalized PCFGs (where head words annotate phrasal nodes) were the key tool for high performance PCFG parsing." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-26", "text": "This approach was congruent with the great success of word n-gram models in speech recognition, and drew strength from a broader interest in lexicalized grammars, as well as demonstrations that lexical dependencies were a key tool for resolving ambiguities such as PP attachments (Ford et al., 1982; Hindle and Rooth, 1993) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-27", "text": "In the following decade, great success in terms of parse disambiguation and even language modeling was achieved by various lexicalized PCFG models (Magerman, 1995; Charniak, 1997; Collins, 1999; Charniak, 2000; Charniak, 2001) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-28", "text": "However, several results have brought into question how large a role lexicalization plays in such parsers." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-29", "text": "Johnson (1998) showed that the performance of an unlexicalized PCFG over the Penn treebank could be improved enormously simply by annotating each node by its parent category." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-30", "text": "The Penn treebank covering PCFG is a poor tool for parsing because the context-freedom assumptions it embodies are far too strong, and weakening them in this way makes the model much better." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-31", "text": "More recently, Gildea (2001) discusses how taking the bilexical probabilities out of a good current lexicalized PCFG parser hurts performance hardly at all: by at most 0.5% for test text from the same domain as the training data, and not at all for test text from a different domain." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-32", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-33", "text": "**1**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-34", "text": "But it is precisely these bilexical dependencies that backed the intuition that lexicalized PCFGs should be very successful, for example in Hindle and Rooth's demonstration from PP attachment." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-35", "text": "We take this as a reflection of the fundamental sparseness of the lexical dependency information available in the Penn Treebank." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-36", "text": "As a speech person would say, one million words of training data just isn't enough." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-37", "text": "Even for topics central to the treebank's Wall Street Journal text, such as stocks, many very plausible dependencies occur only once, for example stocks stabilized, while many others occur not at all, for example stocks skyrocketed." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-38", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-39", "text": "**2**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-40", "text": "The best-performing lexicalized PCFGs have increasingly made use of subcategorization 3 of the categories appearing in the Penn treebank." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-41", "text": "Charniak (2000) shows the value his parser gains from parentannotation of nodes, suggesting that this information is at least partly complementary to information derivable from lexicalization, and Collins (1999) uses a range of linguistically motivated and carefully hand-engineered subcategorizations to break down wrong context-freedom assumptions of the naive Penn treebank covering PCFG, such as differentiating \"base NPs\" from noun phrases with phrasal modifiers, and distinguishing sentences with empty subjects from those where there is an overt subject NP." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-42", "text": "While he gives incomplete experimental results as to their efficacy, we can assume that these features were incorporated because of beneficial effects on parsing that were complementary to lexicalization." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-43", "text": "In this paper, we show that the parsing performance that can be achieved by an unlexicalized PCFG is far higher than has previously been demonstrated, and is, indeed, much higher than community wisdom has thought possible." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-44", "text": "We describe several simple, linguistically motivated annotations which do much to close the gap between a vanilla PCFG and state-of-the-art lexicalized models." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-45", "text": "Specifically, we construct an unlexicalized PCFG which outperforms the lexicalized PCFGs of Magerman (1995) and Collins (1996) (though not more recent models, such as Charniak (1997) or Collins (1999) )." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-46", "text": "One benefit of this result is a much-strengthened lower bound on the capacity of an unlexicalized PCFG." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-47", "text": "To the extent that no such strong baseline has been provided, the community has tended to greatly overestimate the beneficial effect of lexicalization in probabilistic parsing, rather than looking critically at where lexicalized probabilities are both needed to make the right decision and available in the training data." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-48", "text": "Secondly, this result affirms the value of linguistic analysis for feature discovery." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-49", "text": "The result has other uses and advantages: an unlexicalized PCFG is easier to interpret, reason about, and improve than the more complex lexicalized models." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-50", "text": "The grammar representation is much more compact, no longer requiring large structures that store lexicalized probabilities." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-51", "text": "The parsing algorithms have lower asymptotic complexity 4 and have much smaller grammar egory is divided into several subcategories, for example dividing verb phrases into finite and non-finite verb phrases, rather than in the modern restricted usage where the term refers only to the syntactic argument frames of predicators." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-52", "text": "4 O(n 3 ) vs. O(n 5 ) for a naive implementation, or vs. O(n 4 ) if using the clever approach of Eisner and Satta (1999)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-53", "text": "constants." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-54", "text": "An unlexicalized PCFG parser is much simpler to build and optimize, including both standard code optimization techniques and the investigation of methods for search space pruning (Caraballo and Charniak, 1998; ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-55", "text": "It is not our goal to argue against the use of lexicalized probabilities in high-performance probabilistic parsing." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-56", "text": "It has been comprehensively demonstrated that lexical dependencies are useful in resolving major classes of sentence ambiguities, and a parser should make use of such information where possible." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-57", "text": "We focus here on using unlexicalized, structural context because we feel that this information has been underexploited and underappreciated." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-58", "text": "We see this investigation as only one part of the foundation for state-of-the-art parsing which employs both lexical and structural conditioning." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-59", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-61", "text": "To facilitate comparison with previous work, we trained our models on sections 2-21 of the WSJ section of the Penn treebank." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-62", "text": "We used the first 20 files (393 sentences) of section 22 as a development set (devset)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-63", "text": "This set is small enough that there is noticeable variance in individual results, but it allowed rapid search for good features via continually reparsing the devset in a partially manual hill-climb." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-64", "text": "All of section 23 was used as a test set for the final model." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-65", "text": "For each model, input trees were annotated or transformed in some way, as in Johnson (1998) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-66", "text": "Given a set of transformed trees, we viewed the local trees as grammar rewrite rules in the standard way, and used (unsmoothed) maximum-likelihood estimates for rule probabilities." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-67", "text": "5 To parse the grammar, we used a simple array-based Java implementation of a generalized CKY parser, which, for our final best model, was able to exhaustively parse all sentences in section 23 in 1GB of memory, taking approximately 3 sec for average length sentences." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-68", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-69", "text": "**VERTICAL AND HORIZONTAL MARKOVIZATION**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-70", "text": "The traditional starting point for unlexicalized parsing is the raw n-ary treebank grammar read from training trees (after removing functional tags and null elements)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-71", "text": "This basic grammar is imperfect in two well-known ways." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-72", "text": "First, the category symbols are too coarse to adequately render the expansions independent of the contexts." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-73", "text": "For example, subject NP expansions are very different from object NP expansions: a subject NP is 8.7 times more likely than an object NP to expand as just a pronoun." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-74", "text": "Having separate symbols for subject and object NPs allows this variation to be captured and used to improve parse scoring." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-75", "text": "One way of capturing this kind of external context is to use parent annotation, as presented in Johnson (1998) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-76", "text": "For example, NPs with S parents (like subjects) will be marked NP\u02c6S, while NPs with VP parents (like objects) will be NP\u02c6VP." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-77", "text": "The second basic deficiency is that many rule types have been seen only once (and therefore have their probabilities overestimated), and many rules which occur in test sentences will never have been seen in training (and therefore have their probabilities underestimated -see Collins (1999) for analysis)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-78", "text": "Note that in parsing with the unsplit grammar, not having seen a rule doesn't mean one gets a parse failure, but rather a possibly very weird parse (Charniak, 1996) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-79", "text": "One successful method of combating sparsity is to markovize the rules (Collins, 1999) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-80", "text": "In particular, we follow that work in markovizing out from the head child, despite the grammar being unlexicalized, because this seems the best way to capture the traditional linguistic insight that phrases are organized around a head (Radford, 1988) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-81", "text": "Both parent annotation (adding context) and RHS markovization (removing it) can be seen as two instances of the same idea." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-82", "text": "In parsing, every node has a vertical history, including the node itself, parent, grandparent, and so on." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-83", "text": "A reasonable assumption is that only the past v vertical ancestors matter to the current expansion." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-84", "text": "Similarly, only the previous h horizontal ancestors matter (we assume that the head child always matters)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-85", "text": "It is a historical accident that the default notion of a treebank PCFG grammar takes v = 1 (only the current node matters vertically) and h = \u221e (rule right hand sides do not decompose at all)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-86", "text": "On this view, it is unsurprising that increasing v and decreasing h have historically helped." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-87", "text": "As an example, consider the case of v = 1, h = 1. If we start with the rule VP \u2192 VBZ NP PP PP, it will be broken into several stages, each a binary or unary rule, which conceptually represent a head-outward generation of the right hand size, as shown in figure 1." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-88", "text": "The bottom layer will be a unary over the head declaring the goal: VP: [ 7 Finally, we have another unary to finish the VP." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-89", "text": "Note that while it is convenient to think of this as a head-outward process, these are just PCFG rewrites, and so the actual scores attached to each rule will correspond to a downward generation order." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-91", "text": "The raw treebank grammar corresponds to v = 1, h = \u221e (the upper right corner), while the parent annotation in (Johnson, 1998) corresponds to v = 2, h = \u221e, and the second-order model in Collins (1999) , is broadly a smoothed version of v = 2, h = 2." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-92", "text": "In addition to exact nth-order models, we tried variable- Figure 3: Size and devset performance of the cumulatively annotated models, starting with the markovized baseline." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-93", "text": "The right two columns show the change in F 1 from the baseline for each annotation introduced, both cumulatively and for each single annotation applied to the baseline in isolation." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-94", "text": "history models similar in intent to those described in Ron et al. (1994 Figure 2 shows parsing accuracies as well as the number of symbols in each markovization." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-95", "text": "These symbol counts include all the intermediate states which represent partially completed constituents." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-96", "text": "The general trend is that, in the absence of further annotation, more vertical annotation is better -even exhaustive grandparent annotation." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-97", "text": "This is not true for horizontal markovization, where the variableorder second-order model was superior." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-98", "text": "The best entry, v = 3, h \u2264 2, has an F 1 of 79.74, already a substantial improvement over the baseline." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-99", "text": "In the remaining sections, we discuss other annotations which increasingly split the symbol space." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-100", "text": "Since we expressly do not smooth the grammar, not all splits are guaranteed to be beneficial, and not all sets of useful splits are guaranteed to co-exist well." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-101", "text": "In particular, while v = 3, h \u2264 2 markovization is good on its own, it has a large number of states and does not tolerate further splitting well." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-102", "text": "Therefore, we base all further exploration on the v \u2264 2, h \u2264 2 grammar." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-103", "text": "Although it does not necessarily jump out of the grid at first glance, this point represents the best compromise between a compact grammar and useful markov histories." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-104", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-105", "text": "**EXTERNAL VS. INTERNAL ANNOTATION**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-106", "text": "The two major previous annotation strategies, parent annotation and head lexicalization, can be seen as instances of external and internal annotation, respectively." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-107", "text": "Parent annotation lets us indicate an important feature of the external environment of a node which influences the internal expansion of that node." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-108", "text": "On the other hand, lexicalization is a (radical) method of marking a distinctive aspect of the otherwise hidden internal contents of a node which influence the external distribution." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-109", "text": "Both kinds of annotation can be useful." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-110", "text": "To identify split states, we add suffixes of the form -X to mark internal content features, and\u02c6X to mark external features." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-111", "text": "To illustrate the difference, consider unary productions." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-112", "text": "In the raw grammar, there are many unaries, and once any major category is constructed over a span, most others become constructible as well using unary chains (see Klein and Manning (2001) for discussion)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-113", "text": "Such chains are rare in real treebank trees: unary rewrites only appear in very specific contexts, for example S complements of verbs where the S has an empty, controlled subject." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-114", "text": "Figure 4 shows an erroneous output of the parser, using the baseline markovized grammar." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-115", "text": "Intuitively, there are several reasons this parse should be ruled out, but one is that the lower S slot, which is intended primarily for S complements of communication verbs, is not a unary rewrite position (such complements usually have subjects)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-116", "text": "It would therefore be natural to annotate the trees so as to confine unary productions to the contexts in which they are actually appropriate." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-117", "text": "We tried two annotations." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-118", "text": "First, UNARY-INTERNAL marks (with a -U) any nonterminal node which has only one child." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-119", "text": "In isolation, this resulted in an absolute gain of 0.55% (see figure 3) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-120", "text": "The same sentence, parsed using only the baseline and UNARY-INTERNAL, is parsed correctly, because the VP rewrite in the incorrect parse ends with an S\u02c6VP-U with very low probability." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-121", "text": "8 Alternately, UNARY-EXTERNAL, marked nodes which had no siblings with\u02c6U. It was similar to UNARY-INTERNAL in solo benefit (0.01% worse), but provided far less marginal benefit on top of other later features (none at all on top of UNARY-INTERNAL for our top models), and was discarded." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-122", "text": "One restricted place where external unary annotation was very useful, however, was at the preterminal level, where internal annotation was meaningless." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-123", "text": "One distributionally salient tag conflation in the Penn treebank is the identification of demonstratives (that, those) and regular determiners (the, a)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-124", "text": "Splitting DT tags based on whether they were only children (UNARY-DT) captured this distinction." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-125", "text": "The same external unary annotation was even more effective when applied to adverbs (UNARY-RB), distinguishing, for example, as well from also)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-126", "text": "Beyond these cases, unary tag marking was detrimental." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-127", "text": "The F 1 after UNARY-INTERNAL, UNARY-DT, and UNARY-RB was 78.86%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-128", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-129", "text": "**TAG SPLITTING**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-130", "text": "The idea that part-of-speech tags are not fine-grained enough to abstract away from specific-word behaviour is a cornerstone of lexicalization." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-131", "text": "The UNARY-DT annotation, for example, showed that the determiners which occur alone are usefully distinguished from those which occur with other nominal material." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-132", "text": "This marks the DT nodes with a single bit about their immediate external context: whether there are sisters." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-133", "text": "Given the success of parent annotation for nonterminals, it makes sense to parent annotate tags, as well (TAG-PA)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-134", "text": "In fact, as figure 3 shows, exhaustively marking all preterminals with their parent category was the most effective single annotation we tried. Why should this be useful?" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-135", "text": "Most tags have a canonical category." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-136", "text": "For example, NNS tags occur under NP nodes (only 234 of 70855 do not, mostly mistakes)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-137", "text": "However, when a tag 8 Note that when we show such trees, we generally only show one annotation on top of the baseline at a time." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-138", "text": "Moreover, we do not explicitly show the binarization implicit by the horizontal markovization." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-139", "text": "9 These two are not equivalent even given infinite data." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-140", "text": "somewhat regularly occurs in a non-canonical position, its distribution is usually distinct." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-141", "text": "For example, the most common adverbs directly under ADVP are also (1599) and now (544)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-142", "text": "Under VP, they are n't (3779) and not (922)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-143", "text": "Under NP, only (215) and just (132), and so on." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-145", "text": "In addition to the adverb case, the Penn tag set conflates various grammatical distinctions that are commonly made in traditional and generative grammar, and from which a parser could hope to get useful information." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-146", "text": "For example, subordinating conjunctions (while, as, if ), complementizers (that, for), and prepositions (of, in, from) all get the tag IN." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-147", "text": "Many of these distinctions are captured by TAG-PA (subordinating conjunctions occur under S and prepositions under PP), but are not (both subordinating conjunctions and complementizers appear under SBAR)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-148", "text": "Also, there are exclusively nounmodifying prepositions (of ), predominantly verbmodifying ones (as), and so on." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-149", "text": "The annotation SPLIT-IN does a linguistically motivated 6-way split of the IN tag, and brought the total to 81.19%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-150", "text": "Figure 5 shows an example error in the baseline which is equally well fixed by either TAG-PA or SPLIT-IN." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-151", "text": "In this case, the more common nominal use of works is preferred unless the IN tag is annotated to allow if to prefer S complements." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-152", "text": "We also got value from three other annotations which subcategorized tags for specific lexemes." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-153", "text": "First we split off auxiliary verbs with the SPLIT-AUX annotation, which appends\u02c6BE to all forms of be and\u02c6HAVE to all forms of have." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-154", "text": "10 More minorly, SPLIT-CC marked conjunction tags to indicate whether or not they were the strings [Bb]ut or &, each of which have distinctly different distributions from other conjunctions." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-155", "text": "Finally, we gave the percent sign (%) its own tag, in line with the dollar sign ($) already having its own." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-156", "text": "Together these three annotations brought the F 1 to 81.81%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-157", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-158", "text": "**WHAT IS AN UNLEXICALIZED GRAMMAR?**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-159", "text": "Around this point, we must address exactly what we mean by an unlexicalized PCFG." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-160", "text": "To the extent that we go about subcategorizing POS categories, many of them might come to represent a single word." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-161", "text": "One might thus feel that the approach of this paper is to walk down a slippery slope, and that we are merely arguing degrees." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-162", "text": "However, we believe that there is a fundamental qualitative distinction, grounded in linguistic practice, between what we see as permitted in an unlexicalized PCFG as against what one finds and hopes to exploit in lexicalized PCFGs." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-163", "text": "The division rests on the traditional distinction between function words (or closed-class words) and content words (or open class or lexical words)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-164", "text": "It is standard practice in linguistics, dating back decades, to annotate phrasal nodes with important functionword distinctions, for example to have a CP [for] or a PP[to], whereas content words are not part of grammatical structure, and one would not have special rules or constraints for an NP[stocks] , for example." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-165", "text": "We follow this approach in our model: various closed classes are subcategorized to better represent important distinctions, and important features commonly expressed by function words are annotated onto phrasal nodes (such as whether a VP is finite, or a participle, or an infinitive clause)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-166", "text": "However, no use is made of lexical class words, to provide either monolexical or bilexical probabilities." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-167", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-168", "text": "**11**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-169", "text": "At any rate, we have kept ourselves honest by estimating our models exclusively by maximum likelihood estimation over our subcategorized grammar, without any form of interpolation or shrinkage to unsubcategorized categories (although we do markovize rules, as explained above)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-170", "text": "This effec-11 It should be noted that we started with four tags in the Penn treebank tagset that rewrite as a single word: EX (there), WP$ (whose), # (the pound sign), and TO), and some others such as WP, POS, and some of the punctuation tags, which rewrite as barely more." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-171", "text": "To the extent that we subcategorize tags, there will be more such cases, but many of them already exist in other tag sets." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-172", "text": "For instance, many tag sets, such as the Brown and CLAWS (c5) tagsets give a separate sets of tags to each form of the verbal auxiliaries be, do, and have, most of which rewrite as only a single word (and any corresponding contractions)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-173", "text": "tively means that the subcategories that we break off must themselves be very frequent in the language." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-174", "text": "In such a framework, if we try to annotate categories with any detailed lexical information, many sentences either entirely fail to parse, or have only extremely weird parses." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-175", "text": "The resulting battle against sparsity means that we can only afford to make a few distinctions which have major distributional impact." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-176", "text": "Even with the individual-lexeme annotations in this section, the grammar still has only 9255 states compared to the 7619 of the baseline model." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-177", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-178", "text": "**ANNOTATIONS ALREADY IN THE TREEBANK**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-179", "text": "At this point, one might wonder as to the wisdom of stripping off all treebank functional tags, only to heuristically add other such markings back in to the grammar." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-180", "text": "By and large, the treebank out-of-the package tags, such as PP-LOC or ADVP-TMP, have negative utility." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-181", "text": "Recall that the raw treebank grammar, with no annotation or markovization, had an F 1 of 72.62% on our development set." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-182", "text": "With the functional annotation left in, this drops to 71.49%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-183", "text": "The h \u2264 2, v \u2264 1 markovization baseline of 77.77% dropped even further, all the way to 72.87%, when these annotations were included." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-184", "text": "Nonetheless, some distinctions present in the raw treebank trees were valuable." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-185", "text": "For example, an NP with an S parent could be either a temporal NP or a subject." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-186", "text": "For the annotation TMP-NP, we retained the original -TMP tags on NPs, and, furthermore, propagated the tag down to the tag of the head of the NP." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-187", "text": "This is illustrated in figure 6 , which also shows an example of its utility, clarifying that CNN last night is not a plausible compound and facilitating the otherwise unusual high attachment of the smaller NP." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-188", "text": "TMP-NP brought the cumulative F 1 to 82.25%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-189", "text": "Note that this technique of pushing the functional tags down to preterminals might be useful more generally; for example, locative PPs expand roughly the same way as all other PPs (usually as IN NP), but they do tend to have different prepositions below IN." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-190", "text": "A second kind of information in the original trees is the presence of empty elements." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-191", "text": "Following Collins (1999) , the annotation GAPPED-S marks S nodes which have an empty subject (i.e., raising and control constructions)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-192", "text": "This brought F 1 to 82.28%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-193", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-194", "text": "**HEAD ANNOTATION**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-195", "text": "The notion that the head word of a constituent can affect its behavior is a useful one." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-196", "text": "However, often the head tag is as good (or better) an indicator of how a constituent will behave." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-197", "text": "12 We found several head annotations to be particularly effective." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-198", "text": "First, possessive NPs have a very different distribution than other NPs -in particular, NP \u2192 NP \u03b1 rules are only used in the treebank when the leftmost child is possessive (as opposed to other imaginable uses like for New York lawyers, which is left flat)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-199", "text": "To address this, POSS-NP marked all possessive NPs." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-200", "text": "This brought the total F 1 to 83.06%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-201", "text": "Second, the VP symbol is very overloaded in the Penn treebank, most severely in that there is no distinction between finite and infinitival VPs." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-202", "text": "An example of the damage this conflation can do is given in figure 7 , where one needs to capture the fact that present-tense verbs do not generally take bare infinitive VP complements." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-203", "text": "To allow the finite/non-finite distinction, and other verb type distinctions, SPLIT-VP annotated all VP nodes with their head tag, merging all finite forms to a single tag VBF." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-204", "text": "In particular, this also accomplished Charniak's gerund-VP marking." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-205", "text": "This was extremely useful, bringing the cumulative F 1 to 85.72%, 2.66% absolute improvement (more than its solo improvement over the baseline)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-206", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-207", "text": "**DISTANCE**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-208", "text": "Error analysis at this point suggested that many remaining errors were attachment level and conjunction scope." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-209", "text": "While these kinds of errors are undoubtedly profitable targets for lexical preference, most attachment mistakes were overly high attachments, indicating that the overall right-branching tendency of English was not being captured." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-210", "text": "Indeed, this tendency is a difficult trend to capture in a PCFG because often the high and low attachments involve the very same rules." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-211", "text": "Even if not, attachment height is not modeled by a PCFG unless it is somehow explicitly encoded into category labels." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-212", "text": "More complex parsing models have indirectly overcome this by modeling distance (rather than height)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-213", "text": "Linear distance is difficult to encode in a PCFG -marking nodes with the size of their yields massively multiplies the state space." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-214", "text": "13 Therefore, we wish to find indirect indicators that distinguish high attachments from low ones." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-215", "text": "In the case of two PPs following a NP, with the question of whether the second PP is a second modifier of the leftmost NP or should attach lower, inside the first PP, the important distinction is usually that the lower site is a non-recursive base NP." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-216", "text": "Collins (1999) captures this notion by introducing the notion of a base NP, in which any NP which dominates only preterminals is marked with a -B. Further, if an NP-B does not have a non-base NP parent, it is given one with a unary production." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-217", "text": "This was helpful, but substantially less effective than marking base NPs without introducing the unary, whose presence actually erased a useful internal indicator -base NPs are more frequent in subject position than object position, for example." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-218", "text": "In isolation, the Collins method actually hurt the baseline (absolute cost to F 1 of 0.37%), while skipping the unary insertion added an absolute 0.73% to the baseline, and brought the cumulative F 1 to 86.04%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-219", "text": "In the case of attachment of a PP to an NP either above or inside a relative clause, the high NP is distinct from the low one in that the already modified one contains a verb (and the low one may be a base NP as well)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-220", "text": "This is a partial explanation of the utility of verbal distance in Collins (1999) ." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-221", "text": "To 13 The inability to encode distance naturally in a naive PCFG is somewhat ironic." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-222", "text": "In the heart of any PCFG parser, the fundamental table entry or chart item is a label over a span, for example an NP from position 0 to position 5." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-223", "text": "The concrete use of a grammar rule is to take two adjacent span-marked labels and combine them (for example NP[0, 5] and VP[5, 12] into S[0,12] )." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-224", "text": "Yet, only the labels are used to score the combination." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-225", "text": "capture this, DOMINATES-V marks all nodes which dominate any verbal node (V*, MD) with a -V. This brought the cumulative F 1 to 86.91%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-226", "text": "We also tried marking nodes which dominated prepositions and/or conjunctions, but these features did not help the cumulative hill-climb." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-227", "text": "The final distance/depth feature we used was an explicit attempt to model depth, rather than use distance and linear intervention as a proxy." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-228", "text": "With RIGHT-REC-NP, we marked all NPs which contained another NP on their right periphery (i.e., as a rightmost descendant)." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-229", "text": "This captured some further attachment trends, and brought us to a final development F 1 of 87.04%." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-230", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-231", "text": "**FINAL RESULTS**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-232", "text": "We took the final model and used it to parse section 23 of the treebank." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-233", "text": "Figure 8 shows the results." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-234", "text": "The test set F 1 is 86.32% for \u2264 40 words, already higher than early lexicalized models, though of course lower than the state-of-the-art parsers." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-235", "text": "----------------------------------" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-236", "text": "**CONCLUSION**" }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-237", "text": "The advantages of unlexicalized grammars are clear enough -easy to estimate, easy to parse with, and time-and space-efficient." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-238", "text": "However, the dismal performance of basic unannotated unlexicalized grammars has generally rendered those advantages irrelevant." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-239", "text": "Here, we have shown that, surprisingly, the maximum-likelihood estimate of a compact unlexicalized PCFG can parse on par with early lexicalized parsers." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-240", "text": "We do not want to argue that lexical selection is not a worthwhile component of a state-ofthe-art parser -certain attachments, at least, require it -though perhaps its necessity has been overstated." }, { "sent_id": "a6954db741df61f014cc622c5b8263-C001-241", "text": "Rather, we have shown ways to improve parsing, some easier than lexicalization, and others of which are orthogonal to it, and could presumably be used to benefit lexicalized parsers as well." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a6954db741df61f014cc622c5b8263-C001-9" ], [ "a6954db741df61f014cc622c5b8263-C001-27" ], [ "a6954db741df61f014cc622c5b8263-C001-41" ], [ "a6954db741df61f014cc622c5b8263-C001-77" ], [ "a6954db741df61f014cc622c5b8263-C001-79" ], [ "a6954db741df61f014cc622c5b8263-C001-216" ], [ "a6954db741df61f014cc622c5b8263-C001-220" ] ], "cite_sentences": [ "a6954db741df61f014cc622c5b8263-C001-9", "a6954db741df61f014cc622c5b8263-C001-27", "a6954db741df61f014cc622c5b8263-C001-41", "a6954db741df61f014cc622c5b8263-C001-77", "a6954db741df61f014cc622c5b8263-C001-79", "a6954db741df61f014cc622c5b8263-C001-216", "a6954db741df61f014cc622c5b8263-C001-220" ] }, "@DIF@": { "gold_contexts": [ [ "a6954db741df61f014cc622c5b8263-C001-45" ] ], "cite_sentences": [ "a6954db741df61f014cc622c5b8263-C001-45" ] }, "@SIM@": { "gold_contexts": [ [ "a6954db741df61f014cc622c5b8263-C001-91" ] ], "cite_sentences": [ "a6954db741df61f014cc622c5b8263-C001-91" ] }, "@USE@": { "gold_contexts": [ [ "a6954db741df61f014cc622c5b8263-C001-191" ] ], "cite_sentences": [ "a6954db741df61f014cc622c5b8263-C001-191" ] } } }, "ABC_bb133ba3dfe483412672b44b777c4a_19": { "x": [ { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-2", "text": "The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder/decoder structure." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-3", "text": "While several previous works show that even with residual connection and layer normalization, deep Transformers still have difficulty in training, and particularly a Transformer model with more than 12 encoder/decoder layers fails to converge." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-4", "text": "In this paper, we first empirically demonstrate that a simple modification made in the official implementation which changes the computation order of residual connection and layer normalization can effectively ease the optimization of deep Transformers." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-5", "text": "In addition, we deeply compare the subtle difference in computation order, and propose a parameter initialization method which simply puts Lipschitz restriction on the initialization of Transformers but can effectively ensure their convergence." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-6", "text": "We empirically show that with proper parameter initialization, deep Transformers with the original computation order can converge, which is quite in contrast to all previous works, and obtain significant improvements with up to 24 layers." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-7", "text": "Our proposed approach additionally enables to benefit from deep decoders compared to previous works which focus on deep encoders." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-8", "text": "----------------------------------" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-10", "text": "Neural machine translation has achieved great success in the last few years (Bahdanau et al., 2014; Gehring et al., 2017; Vaswani et al., 2017) ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-11", "text": "The Transformer (Vaswani et al., 2017) , which has outperformed previous RNN/CNN based translation models (Bahdanau et al., 2014; Gehring et al., 2017) , is based on multi-layer self-attention networks and can be trained very efficiently." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-12", "text": "The multi-layer structure allows the Transformer to model complicated functions." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-13", "text": "Increasing the depth of models can increase their capacity but may also cause optimization difficulties (Mhaskar et al., 2017; Telgarsky, 2016; Eldan and Shamir, 2016; He et al., 2016; ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-14", "text": "In order to ease optimization, the Transformer employs residual connection and layer normalization techniques which have been proven useful in reducing optimization difficulties of deep neural networks for various tasks (He et al., 2016; Ba et al., 2016) ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-15", "text": "However, even with residual connections and layer normalization, deep Transformers are still hard to train: the original Transformer (Vaswani et al., 2017) only contains 6 encoder/decoder layers." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-32", "text": "We applied joint Byte-Pair Encoding (BPE) (Sennrich et al., 2016) with 32k merge operations." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-72", "text": "Results for two computation orders with new parameter initialization method are shown in Table 3 ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-16", "text": "show that Transformer models with more than 12 encoder layers fail to converge, and propose the Transparent Attention (TA) mechanism which weighted combines outputs of all encoder layers as encoded representation." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-17", "text": "However, the TA mechanism has to value outputs of shallow encoder layers to feedback sufficient gradients during back-propagation to ensure their convergence, which implies that weights of deep layers are likely to be hampered and against the motivation when go very deep, and as a result cannot get further improvements with more than 16 layers. reveal that deep Transformers with proper use of layer normalization is able to converge and propose to aggregate previous layers' outputs for each layer instead of at the end of encoding." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-18", "text": "Wu et al. (2019) research on incremental increasing the depth of the Transformer Big by freezing pre-trained shallow layers." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-19", "text": "In concurrent work, Zhang et al. (2019) also point out the same issue as in this work, but there are differences between." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-20", "text": "In contrast to all previous works, we empirically show that with proper parameter initialization, deep Transformers with the original computation order can converge." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-21", "text": "The contributions of our work are as follows: We empirically demonstrate that a simple modification made in the Transformer's official implementation which changes the computation order of residual connection and layer normalization can effectively ease its optimization;" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-22", "text": "We deeply analyze how the subtle difference of computation order affects the convergence deep Transformer models, and propose to initialize deep Transformer models under Lipschitz restriction;" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-23", "text": "Our simple approach effectively ensures the convergence of deep Transformers with up to 24 layers, and bring +1.50 and +0.92 BLEU improvements in the WMT 14 English to German task and the WMT 15 Czech to English task;" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-24", "text": "We study the influence of the deep decoder in addition to the deep encoder studied by previous works , and show that deep decoders can also benefit the performance of the Transformer." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-25", "text": "----------------------------------" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-26", "text": "**CONVERGENCE OF DIFFERENT COMPUTATION ORDER**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-27", "text": "In our research we focus on training problems of deep Transformers which prevent them from convergence (as opposed to other important issues such as over-fitting on the training set)." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-28", "text": "To alleviate the training problem for the standard Transformer model, Layer Normalization (Ba et al., 2016) and Residual Connection (He et al., 2016 ) are adopted." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-29", "text": "The official implementation of the Transformer uses a different computation sequence (Figure 1 b) compared to the published version (Vaswani et al., 2017) (Figure 1 a), since it seems better for harder-to-learn models 1 ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-30", "text": "Though several papers Domhan, 2018) mentioned this change, how this modification impacts on the performance of the Transformer, especially for deep Transformers, has never been deeply studied before with empirical results to the best of our knowledge, except analyzed the difference between two computation orders during back-propagation, and Zhang et al. (2019) point out the same effects of normalization in concurrent work." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-31", "text": "In order to compare with , we used the datasets from the WMT 14 English to German task and the WMT 15 Czech to English task for experiments." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-33", "text": "We used the same setting as the Transformer base (Vaswani et al., 2017) except the number of warm-up steps was set to 8k." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-34", "text": "We conducted our experiments based on the Neutron implementation (Xu and Liu, 2019) of the Transformer." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-35", "text": "Parameters were initialized with Glorot Initialization 2 (Glorot and Bengio, 2010) like in many other Transformer implementation (Klein et al., 2017; Hieber et al., 2017; ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-36", "text": "Our experiments run on 2 GTX 1080 Ti GPUs, and a batch size of at least 25k target tokens is achieved through gradient accumulation of small batches." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-37", "text": "We used a beam size of 4 for decoding, and evaluated tokenized case-sensitive BLEU with the averaged model of the last 5 checkpoints saved with an interval of 1,500 training steps (Vaswani et al., 2017) ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-38", "text": "Results of two different computation order are shown in Table 1 ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-39", "text": "v1 and v2 stand for the computation order of the proposed Transformer (Vaswani et al., 2017) and that of the official implementation respectively. \"\u00ac\" means fail to converge, \"None\" means not reported in original works, \"*\" indicates our implementation of their approach." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-40", "text": "\u2020 and \u2021 mean p < 0.01 and p < 0.05 while comparing between v1 and v2 of the same number of layers in significance test." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-41", "text": "----------------------------------" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-42", "text": "**ANALYSIS AND LIPSCHITZ RESTRICTED PARAMETER INITIALIZATION**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-43", "text": "Since the subtle change of computation order results in huge differences in convergence, we analyze the differences between the computation orders to figure out how they affect convergence." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-44", "text": "----------------------------------" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-45", "text": "**COMPARISON BETWEEN COMPUTATION ORDERS**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-46", "text": "As a conjecture, we think that the convergence issue of deep Transformers is perhaps due to the fact that layer normalization over residual connections in Figure 1 (a) makes residual connections are likely to be hampered by layer normalization which tends to shrink consecutive residual connections to avoid potential exploding of combined layer outputs ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-47", "text": "We studied how the layer normalization and the residual connection are computed in the two computation orders as shown in Table 2 . \"mean\" and \"std\" mean the computation of mean value and standard variance." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-48", "text": "in model and in res stand for output of current layer and accumulated outputs from previous layers respectively." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-49", "text": "w and b are weight and bias of layer normalization which are initialized with a vector full of 1 and another vector full of 0." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-50", "text": "out LN is the computation result of the layer normalization." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-51", "text": "out v1 res and out v2 res are results of residual connections of v1 and v2." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-52", "text": "Table 2 shows that the computation of residual connection in v1 is weighted by w \u03c3 compared to v2, and the residual connection of previous layers will be shrunk in case w \u03c3 < 1.0." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-53", "text": "We suggest introduced the TA mechanism to compensate normalized residual connections through combining outputs of shallow layers to the final encoder output for the published Transformer, and obtained significant improvements with deep Transformer models." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-54", "text": "additionally aggregating outputs of previous layers for each encoder layer instead of only at the end of encoding." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-55", "text": "----------------------------------" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-56", "text": "**LIPSCHITZ RESTRICTED PARAMETER INITIALIZATION**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-57", "text": "Since the convergence issue of deep v1 Transformers is likely because of the shrunken residual connections, is it possible to restrict w \u03c3 \u2265 1.0?" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-58", "text": "Given that w is initialized with 1, we suggest to restrict the standard variance of in model + in res :" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-59", "text": "in which case, w \u03c3 will be greater than or at least equal to 1.0, and the residual connection of v1 will not be shrunk anymore." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-60", "text": "To achieve this goal, we can restrict values in in model + in res between [\u2212c, +c] and ensuring its distribution variance is smaller than 1.0." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-61", "text": "Define P (x) as any probability distribution of x between [\u2212c, +c]: (2) then the standard variance of x is:" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-62", "text": "(3) given that:" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-63", "text": "for x \u2208 [\u2212c, +c] as P (x) constrained by Equation 2, we can make Equation 3 into:" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-64", "text": "clean up the Equation 5, we can get:" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-65", "text": "after applying Equation 2 into Equation 6, we can find that:" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-66", "text": "Thus, as long as:" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-67", "text": "the requirements for corresponding \u03c3 described in Equation 1 can be satisfied." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-68", "text": "This goal can be simply achieved through initializing the sub-model before layer normalization to be a k-Lipschitz function, where k \u2264 0.5." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-69", "text": "The k-Lipschitz restriction can be satisfied effectively through weight clipping 3 , and we empirically find that only applying a restriction to parameter initialization is sufficient enough, which is more efficient and can avoid potential risk of weight clipping on performance." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-70", "text": "In practice, we initialize embedding matrices and weights of linear transformation with uniform distributions of [\u2212e, +e] and [\u2212l, +l] respectively." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-71", "text": "We use 2 esize+vsize as e and 1 isize as l where esize, vsize and isize stand for the size of embedding, vocabulary size and the input dimension of the linear transformation respectively 4 ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-73", "text": "v1' indicates v1 with Lipschitz restricted parameter initialization, same for v2'." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-74", "text": "----------------------------------" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-75", "text": "**EFFECTS OF DEEPER ENCODER AND DEEPER DECODER**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-76", "text": "Previous approaches only increases the depth of encoder, while we suggest that deep decoders should also be helpful." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-77", "text": "We analyzed the influence of deep encoders and decoders separately and results are shown in Table 4 ." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-78", "text": "Table 4 shows that the deep decoder can benefit the performance in addition to the deep encoder, especially on the Czech to English task." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-79", "text": "----------------------------------" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-80", "text": "**CONCLUSION**" }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-81", "text": "In contrast to all previous works Wu et al., 2019) which show that deep Transformers with the computation order as in Vaswani et al. (2017) have difficulty in convergence." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-82", "text": "We empirically show that deep Transformers with the original computation order can converge as long as with proper parameter initialization." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-83", "text": "In this paper, we first investigate convergence differences between the published Transformer (Vaswani et al., 2017) and the official implementation of the Transformer , and compare the differences of computation orders between them." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-84", "text": "Then we conjecture the training problem of deep Transformers is because layer normalization sometimes shrinks residual connections, and propose this can be tackled simply with Lipschitz restricted parameter initialization." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-85", "text": "Our experiments demonstrate the effectiveness of our simple approach on the convergence of deep Transformers, and brings significant improvements on the WMT 14 English to German and the WMT 15 Czech to English news translation tasks." }, { "sent_id": "bb133ba3dfe483412672b44b777c4a-C001-86", "text": "We also study the effects of deep decoders in addition to deep encoders concerned in previous works." } ], "y": { "@MOT@": { "gold_contexts": [ [ "bb133ba3dfe483412672b44b777c4a-C001-10" ], [ "bb133ba3dfe483412672b44b777c4a-C001-11" ] ], "cite_sentences": [ "bb133ba3dfe483412672b44b777c4a-C001-10", "bb133ba3dfe483412672b44b777c4a-C001-11" ] }, "@BACK@": { "gold_contexts": [ [ "bb133ba3dfe483412672b44b777c4a-C001-15" ], [ "bb133ba3dfe483412672b44b777c4a-C001-29" ], [ "bb133ba3dfe483412672b44b777c4a-C001-39" ], [ "bb133ba3dfe483412672b44b777c4a-C001-81" ] ], "cite_sentences": [ "bb133ba3dfe483412672b44b777c4a-C001-15", "bb133ba3dfe483412672b44b777c4a-C001-29", "bb133ba3dfe483412672b44b777c4a-C001-39", "bb133ba3dfe483412672b44b777c4a-C001-81" ] }, "@DIF@": { "gold_contexts": [ [ "bb133ba3dfe483412672b44b777c4a-C001-33" ], [ "bb133ba3dfe483412672b44b777c4a-C001-83" ] ], "cite_sentences": [ "bb133ba3dfe483412672b44b777c4a-C001-33", "bb133ba3dfe483412672b44b777c4a-C001-83" ] }, "@EXT@": { "gold_contexts": [ [ "bb133ba3dfe483412672b44b777c4a-C001-33" ] ], "cite_sentences": [ "bb133ba3dfe483412672b44b777c4a-C001-33" ] }, "@SIM@": { "gold_contexts": [ [ "bb133ba3dfe483412672b44b777c4a-C001-37" ] ], "cite_sentences": [ "bb133ba3dfe483412672b44b777c4a-C001-37" ] } } }, "ABC_6e92b1fa4f3b78a099cb222b3eb9a9_19": { "x": [ { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-114", "text": "**SYSTEM DESCRIPTION**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-23", "text": "In other cases, a phone n-gram language model is used to model the phone statistics instead of a VSM [6, 7, 8] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-2", "text": "In this work, we present a new Vector Space Model (VSM) of speech utterances for the task of spoken dialect identification." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-3", "text": "Generally, DID systems are built using two sets of features that are extracted from speech utterances; acoustic and phonetic." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-4", "text": "The acoustic and phonetic features are used to form vector representations of speech utterances in an attempt to encode information about the spoken dialects." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-5", "text": "The Phonotactic and Acoustic VSMs, thus formed, are used for the task of DID." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-6", "text": "The aim of this paper is to construct a single VSM that encodes information about spoken dialects from both the Phonotactic and Acoustic VSMs." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-7", "text": "Given the two views of the data, we make use of a well known multi-view dimensionality reduction technique known as Canonical Correlation Analysis (CCA), to form a single vector representation for each speech utterance that encodes dialect specific discriminative information from both the phonetic and acoustic representations." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-8", "text": "We refer to this approach as feature space combination approach and show that our CCA based feature vector representation performs better on the Arabic DID task than the phonetic and acoustic feature representations used alone." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-9", "text": "We also present the feature space combination approach as a viable alternative to the model based combination approach, where two DID systems are built using the two VSMs (Phonotactic and Acoustic) and the final prediction score is the output score combination from the two systems." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-10", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-12", "text": "Dialect Identification (DID) problem is a special case of the more general problem of Language Identification (LID)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-13", "text": "LID refers to the process of automatically identifying the language class for given speech segment or text document, while DID classifies between dialects within the same language class, making it a more challenging task than LID." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-14", "text": "A good DID system used as a front-end to an automatic speech recognition system, can help improve the recognition performance by providing dialectal data for acoustic and language model adaptation to the specific dialect being spoken [1] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-15", "text": "In this work, we focus on Arabic DID which can can be posed as a five class classification problem, given that the Arabic language can be divided into five major dialects; Egyptian (EGY), Gulf (GLF), Lavantine (LAV), Modern Standard Arabic (MSA) and North African (NOR) [2] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-16", "text": "Over the past decade, great advances have been made in the field of automatic language identification (LID)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-17", "text": "Research effort has focused on coming up with mathematical representations of speech utterances, that encodes the information about the language being spoken." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-18", "text": "These approaches are also known as Vector Space Modeling approaches [3] , where speech utterances are represented by a continuous vector of high dimensions." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-19", "text": "Two predominant Vector Space Modeling approaches are Phonotactic and Acoustic." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-20", "text": "Phonotactic approaches attempt to model the n-gram phone statistics of speech." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-21", "text": "Phone sequences for each utterance are extracted using one or multiple phone recognisers." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-22", "text": "A Vector Space Model (VSM) is then constructed using a term-document matrix [4] , followed by an unsupervised dimensionality reduction technique, such as Principal Component Analysis (PCA) [5] to map the high dimensional feature space to a low dimensional Vector Subspace (Section 2.1), giving a Phonotactic VSM." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-45", "text": "**PHONOTACTIC VSM; X P**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-24", "text": "On the other hand, Acoustic approaches attempt to extract dialect discriminative information from speech using low level acoustic features, such as pitch, prosody, shifted delta ceptral coefficients, bottleneck features [9, 10] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-25", "text": "One of the most successful acoustic approaches is, the use of i-Vector framework for LID, where i-Vectors are extracted for each speech utterance, using an i-Vector extractor that consists of a GMM-UBM trained on top of BNF, followed by a Total Variability Subspace Model [2, 11] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-26", "text": "The extracted i-Vectors give an Acoustic VSM (Section 2.2)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-27", "text": "These methods are also used for DID." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-28", "text": "Each of the two VSMs is used as an input to a back-end discriminative classifier, which is trained to find a suitable decision boundary in these vector spaces." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-29", "text": "This gives us two DID systems built using the Acoustic and Phonotactic VSMs." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-30", "text": "At prediction time, output scores from the two DID systems are combined to give a final score, on the basis of which classification decision is made." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-31", "text": "This model combination approach has been shown to give performace improvements on the DID task [2] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-32", "text": "This also shows that the two systems are complementary to each other, which leads us to investigate a feature space combination approach i.e. to construct a single VSM by combining Phonotactic and Acoustic VSMs, in an attempt to encode useful discriminative information in that single VSM." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-33", "text": "In this work, we present a feature space combination approach." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-34", "text": "We form a combined VSM that incorporates useful information, necessary for DID, from both the Phonotactic and Acoustic VSMs." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-35", "text": "To achieve this goal, we make use of the well known multi-view dimensionality reduction technique known as Canonical Correlation Analysis (CCA), devloped by H. Hotelling [12] (Section 2.3)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-36", "text": "We show the performance of the combined VSM on Arabic DID task and compare it against the performance of Phonotactic and Acoustic VSMs used alone (Section 5)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-37", "text": "CCA VSM shows superior performance." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-38", "text": "The advantages of our feature space combination approach over model combination are two fold: Only one backend classifier needs to be trained and; Unlabeled data from other domain can easily be used in CCA framework to construct the single VSM." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-39", "text": "In this work, we do not experiment with unlabeled data and leave it as an extension to our current work." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-40", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-41", "text": "**VECTOR SPACE MODELS**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-42", "text": "This section gives details about the construction of combined VSM, also referred to as CCA VSM, Z C ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-43", "text": "We start by presenting the Phonotactic VSM, X P and Acoustic VSM, X A , used in this work, followed by the section on CCA VSM, Z C ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-44", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-46", "text": "Phonotactic VSM is constructed by modeling the n-gram phone statistics of the phone sequences that are extracted using an Arabic phone recognizer." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-47", "text": "Details about the phone recognizer can be found in [2] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-48", "text": "VSM is constructed in two steps; 1) Construct a term-document matrix, X \u2208 R N \u00d7d (See Fig 1) , where each speech utterance in represented by a Phonotactic feature vector," }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-49", "text": ", where N is the number of speech utterances and f (p, s) is the number of times a phone n-gram (term) s appears in the utterance (document) p and 2) Perform Truncated Singular Value Decomposition (SVD) (Equation 2) on X to learn a lower dimensional linear manifold, \u03a0 \u2208 R d\u00d7k , where k << d. SVD attempts to discover the latent structure in the high dimensional feature space." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-50", "text": "Note that, k is the number of largest singular values." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-51", "text": "X is projected down to \u03a0 to get the Phonotactic VSM, X P (Equation 2)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-52", "text": "In our case, the n-gram dictionary consisted of phone 2-grams and 3-grams with a total of 8K features i.e. d = 8K." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-53", "text": "The dimensionality of the VSM, X P , was chosen to be 1200, i.e. k = 1200." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-54", "text": "1200 was the optimal value chosen experimentally." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-55", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-56", "text": "**ACOUSTIC VSM; X A**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-57", "text": "Acoustic VSM is constructed in two steps; 1) Extracting the bottleneck features (BNF) from speech and 2) Modeling BNF using the i-Vector extraction framework." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-58", "text": "We use the same Deep Neural Network (DNN) based ASR system to extract the BNF as in our previous works [2, 13] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-59", "text": "Two DNNs are used with 5 hidden layers and 1 Bottleneck Layer, all having sigmoidal neurons." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-60", "text": "Tied-phone states are used as the target to the DNNs." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-61", "text": "The target labels of dimension 3040 are provided by a GMM-HMM baseline system trained on 60 hours of Arabic Broadcast speech [14] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-62", "text": "Input to the DNN consists of 11 consecutive frames stacked together, where for each frame 23 fbank features along with pitch and voicing probability are extracted." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-63", "text": "The output of the BN layer from the first DNN are fed as inputs to the second DNN, which acts as a correction DNN for the first model." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-64", "text": "Time offsets at the input layer of the second DNN are -10, -5, 0, 5 and 10, giving an overall context of 31 frames at the input of the second DNN." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-65", "text": "The BNF from the first DNN are used in the i-Vector modeling framework." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-66", "text": "i-Vector modeling framework consists of building a GMM-UBM on a large amount of data using acoustic features (BNF), to model the dialectal feature space." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-67", "text": "The sufficient statistics of the GMM-UBM give a general idea of the data spread in the high dimensional Vector Space." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-68", "text": "GMM-UBM mean supervector is updated while adapting it to each utterance." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-69", "text": "This update information is encoded in a low dimensional latent vector known as an i-Vector." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-70", "text": "The latent variable model used to extract i-Vector is called Total Variability Subspace Model and is given by the equation:" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-71", "text": "where u is GMM-UBM mean supervector." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-72", "text": "v is the latent vector, known as the i \u2212 V ector and T is the lower dimensional Vector Subspace." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-73", "text": "The parameters of the model are estimated using Maximum Likelihood training criterion." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-74", "text": "For a detailed explanation of i-Vector modeling framework, reader is directed to excellent work in [15, 11] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-75", "text": "In this work, GMM-UBM model has 2048 gaussian components, MFCC features are extracted using a 25 ms window and the i-Vectors are 400 dimensional [2] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-76", "text": "Finally, we construct the acoustic VSM, X A \u2208 R N \u00d7400 , where the i th row is the 400 dimensional i-Vector representation corresponding to the speech utterance, a i ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-77", "text": "We also perform Linear Discriminant Analysis (LDA) and Within Class Co-variance Normalization (WCCN) on the Acoustic Vector Space, to increase the discriminative strength of the VSM." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-78", "text": "This method has been shown to improve DID (LID) performance [2, 11] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-79", "text": "Here, we give a brief overview of the mathematical foundations of the CCA ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-80", "text": "Fig 2 gives a probabilistic graphical model of CCA." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-81", "text": "Nodes of the graph represent Random Variables (RVs) and the structure encodes conditional independence assumptions." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-82", "text": "X P and X A are the Random Variables corresponding to the Phonotactic and Acoustic views of the data." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-83", "text": "Each data view is associated with two latent variables; 1) Z C , which is shared, and is the variable of interest that will form the final combined VSM, Z C and 2) \u03c6 p and \u03c6 a , which are the subspaces associated with the Phontactic and Acoustic views, respectively." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-84", "text": "CCA attempts to estimate \u03c6 p and \u03c6 a such that the correlation between the projections of the phonotactic feature vectors, p, on \u03c6 p and acoustic feature vectors, a, on \u03c6 a are maximized." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-85", "text": "Hence, CCA can be posed as the following optimization problem [16] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-86", "text": "The above optimization formualtion can be massaged into the following eigenvalue problem." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-87", "text": "For details see [17] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-88", "text": "An equivalent SVD formulation of the above eigenvalue problem is given below, which allows us to find \u03c6 a and \u03c6 p by performing SVD of C" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-89", "text": ". For the proof of this formulation, reader is referred to [16] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-90", "text": "We use the above formulation in this paper to learn the latent Vector Subspaces, \u03c6 p and \u03c6 a ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-91", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-92", "text": "**MODELING**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-93", "text": "Given the two views, X P \u2208 R N \u00d71200 and X A \u2208 R N \u00d7400 , for our speech data, we form a shared VSM, Z C \u2208 R N \u00d72c using CCA formulation discussed in Section 2.3.1." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-94", "text": "Note that, X P , X A and Z C are instantiations of the RVs X P , X A and Z C respectively as given in Fig 2." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-95", "text": "Z C 's construction can be concisely given by the following two equations." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-96", "text": "where, CCA is performed by using the SVD formulation of equation 3." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-97", "text": "In our case, the shared VSM's dimensionality is 600, i.e. c = 300." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-98", "text": "This value is the optimal VSM dimensionality that is experimentally decided." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-99", "text": "We also perform LDA and WCCN to increase the discriminability of the shared Vector space." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-100", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-101", "text": "**DATA USED**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-102", "text": "Training and test data used in this work is the same as used in [2] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-103", "text": "Table 1 gives the number of hours of data available for each dialect for training and testing." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-104", "text": "Train 13 10 11 9 10 Test 2 2 2 2 2 Table 1 . Number of hours of training and testing data for each dialect Table 2 shows the number of speech utterances that are available for training and testing the DID system." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-105", "text": "Train 1720 1907 1059 1934 1820 Test 315 348 238 355 265 Table 2 . Number of training and test utterances for DID system development" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-106", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-107", "text": "**DATA EGY GLF LAV NOR MSA**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-108", "text": "Training data consist of recording from the Arabic Broadcast domain and contains utterances spoken in all the five dialects; EGY, GLF, LAV, MSA and NOR." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-109", "text": "The test set is from the same broadcast domain but is collected from Al-Jazeera and hence, unlike training data set, the recording are of high quality." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-110", "text": "The test set is labeled using CrowdFlower, a crowd source platform, by QCRI and is publicly available on their web portal 1 ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-111", "text": "More details about the train and test data can be found in [2, 18] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-112", "text": "Fig 3 gives an overview of our DID system, which can be seen as a combination of two broad components; 1) Vector Space Modeling Component and 2) Back-end classifier." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-113", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-115", "text": "The most important pieces of the DID system are the four latent Vector Subspaces; 1) \u03a0: which is learned by performing SVD on the n-gram phootactic term-document matrix (Section 2.1), 2) T: is the total variability subspace that is learned in an unsupervised manner in the i-Vector framework (Section 2.2), 3) \u03c6 p : is the latent vector subspace corresponding to the Phonotactic VSM learned in the CCA Vector Space modeling framework and 4) \u03c6 a : is the latent vector subspace corresponding to Acoustic VSM learned in the CCA vector space modeling framework (Section 2.3)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-116", "text": "The shared VSM is then fed to a back-end discriminative classifier." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-117", "text": "In our case, we use multi-class logistic regression, also known as softmax classification for DID [19, 20] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-118", "text": "The parameter settings used for the softmax classifier can be found in Table 3 ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-119", "text": "We use elastic net regularization i.e. using both an L1 and L2 regularizer." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-120", "text": "Stochastic Gradient Descent is used for learning the model parameters." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-121", "text": "Log-entropy loss function is used as the model training objective." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-122", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-123", "text": "**MODEL**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-124", "text": "L1-ratio L2-ratio loss training Softmax 0.5 0.5 log-entropy SGD Table 4 gives performance of different VSMs on Arabic DID task." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-125", "text": "The standalone Phonotactic VSM, X P , performs rather poorly as compared to the standalone Acoustic VSM, X A ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-126", "text": "CCA VSM, Z C , which is formed by combining X P and X A using CCA performs better than both standalone VSMs, which is an expected outcome because Z C carries dialect discriminative information from both X P and X A ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-127", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-128", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-129", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-130", "text": "**DIALECT IDENTIFICATION RESULTS**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-131", "text": "Further improvements in performance are due to performing LDA and WCCN on Z C ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-132", "text": "LDA is a supervised dimensionality reduction technique that form a lower dimensional Vector Subspace such that the class separability between the data points is maximized, and hence the performance gain that we see due to LDA is to be expected." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-133", "text": "An interesting observation to note is that the LDA based Acoustic VSM, X A+LDA+WCCN , performs at par with the LDA based CCA VSM, Z C+LDA+WCCN ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-134", "text": "This shows that some of the useful discriminative information from the Acoustic VSM is lost while performing CCA and hence we finally concatenate the two LDA based VSMs to get the final accuracy of 60% on the DID task." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-135", "text": "Table 4 ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-136", "text": "DID Results with Different VSMs." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-137", "text": "Accuracy, Precision and Recall Table 5 gives the confusion matrix for the Arabic DID task." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-138", "text": "We can infer the following: 1) EGY is confused most often with LAV, 2) GLF is confused most often with LAV and EGY, 3) LAV most often confused with EGY and GLF, 4) MSA is pretty well discriminated, which can also be confirmed by the VSM projection given by Fig 5.1, 5 ) NOR is most confused with EGY and LAV, which can also be seen in the VSM projection, where LAV is given by cyan region while EGY and LAV are given by blue and green dots." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-139", "text": "Table 5 ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-140", "text": "DID Results with Different VSMs." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-141", "text": "Accuracy, Precision and Recall" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-142", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-143", "text": "**CONFUSION**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-144", "text": "----------------------------------" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-145", "text": "**CONCLUSIONS**" }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-146", "text": "In this work, we showed our innovative approach to construct a single VSM for DID that carries dialect discriminative information from both the Acoustic and Phonotactic VSMs." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-147", "text": "To that end, we use a well known multi-view dimensionality reduction known as Canonical Correlation Analysis (CCA)." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-148", "text": "The single VSM constructed performed better than any of the Phonotcatic or Acoustic VSMs alone, but the LDA based CCA and Acoustic VSMs performed at par on the DID taks, while their combination gave us our best VSM for Arabic DID." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-149", "text": "We conclude that some dialect specific discriminative information is lost while performing CCA between Acoustic and Phonotactic VSMs and hence the final combination performs better." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-150", "text": "CCA is an unsupervised method and can easily incorporate unlabeled data from a different domain and act as a domain adaptation or semi-supervised learning method such as co-training as shown in [21] ." }, { "sent_id": "6e92b1fa4f3b78a099cb222b3eb9a9-C001-151", "text": "We leave co-training and domain adaptation using CCA for our future work." } ], "y": { "@SIM@": { "gold_contexts": [ [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-15" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-58" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-75" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-102" ] ], "cite_sentences": [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-15", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-58", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-75", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-102" ] }, "@DIF@": { "gold_contexts": [ [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-25" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-31" ] ], "cite_sentences": [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-25", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-31" ] }, "@EXT@": { "gold_contexts": [ [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-25" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-31" ] ], "cite_sentences": [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-25", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-31" ] }, "@BACK@": { "gold_contexts": [ [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-47" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-78" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-111" ] ], "cite_sentences": [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-47", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-78", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-111" ] }, "@USE@": { "gold_contexts": [ [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-58" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-75" ], [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-102" ] ], "cite_sentences": [ "6e92b1fa4f3b78a099cb222b3eb9a9-C001-58", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-75", "6e92b1fa4f3b78a099cb222b3eb9a9-C001-102" ] } } }, "ABC_866ae880aa0de1e60d306eac2e66fc_19": { "x": [ { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-2", "text": "A key aspect of visual question answering (VQA) models that are interpretable is their ability to ground their answers to relevant regions in the image." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-3", "text": "Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-4", "text": "Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-5", "text": "In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-6", "text": "We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-7", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-9", "text": "We are interested in the problem of visual question answering (VQA), where an algorithm is presented with an image and a question that is formulated in natural language and relates to the contents of the image." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-10", "text": "The goal of this task is to get the algorithm to correctly answer the question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-11", "text": "The VQA task has recently received significant attention from the computer vision community, in particular because obtaining high accuracies would presumably require precise understanding of both natural language as well as visual stimuli." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-12", "text": "In addition to serving as a milestone towards visual intelligence, there are practical applications such as development of tools for the visually impaired." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-13", "text": "The problem of VQA is challenging due to the complex interplay between the language and visual modalities." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-14", "text": "On one hand, VQA algorithms must be able to parse and interpret the input question, which is provided in natural language [8, 14, 9] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-15", "text": "This may potentially involve understanding of nouns, verbs and other linguistic elements, as well as their visual significance." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-16", "text": "On the other hand, the algorithms Figure 1 ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-17", "text": "Interpretable VQA algorithms must ground their answer into image regions that are relevant to the question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-18", "text": "In this paper, we aim at providing this ability by leveraging existing region descriptions and object annotations to construct grounding supervision automatically." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-19", "text": "must analyze the image to identify and recognize the visual elements relevant to the question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-20", "text": "Furthermore, some questions may refer directly to the contents of the image, but may require external, common sense knowledge to be answered correctly." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-21", "text": "Finally, the algorithms should generate a textual output in natural language that correctly answers the input visual question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-22", "text": "In spite of the recent research efforts to address these challenges, the problem remains largely unsolved [22] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-23", "text": "We are particularly interested in giving VQA algorithms the ability to identify the visual elements that are relevant to the question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-24", "text": "In the VQA literature, such ability has been implemented by attention mechanisms." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-25", "text": "Such attention mechanisms generate a heatmap over the input image, which highlights the regions of the image that lead to the answer." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-26", "text": "These heatmaps are interpreted as groundings of the answer to the most relevant areas of the image." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-27", "text": "Generally, these mechanisms have either been considered as latent variables for which there is no supervision, or have been treated as output variables that receive direct supervision from human annotations." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-28", "text": "Unfortunately, both of these approaches have disadvantages." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-29", "text": "First, unsupervised training of attention tends to lead to models that cannot ground their decision in the image in a human interpretable manner." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-30", "text": "Second, supervised training of attention is difficult and expensive: human annotators may consider different regions to be relevant for the question at hand, which entails ambiguity and increased annotation cost." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-31", "text": "Our goal is to leverage the best of both worlds by providing VQA algorithms with interpretable grounding of their answers, without the need of direct and explicit manual annotation of attention." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-32", "text": "From a practical point of view, as autonomous machines are increasingly finding real world applications, there is an increasing need to provide them with suitable capabilities to explain their decisions." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-33", "text": "However, in most applications, including VQA, current state-of-the-art techniques operate as black-box models that are usually trained using a discriminative approach." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-34", "text": "Similarly to [5] , in this work we show that, in the context of VQA, such approaches lead to internal representations that do not capture the underlying semantic relations between textual questions and visual information." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-35", "text": "Consequently, as we show in this work, current state-ofthe-art approaches for VQA are not able to support their answers with a suitable interpretable representation." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-36", "text": "In this work, we introduce a methodology that provides VQA algorithms with the ability to generate human interpretable attention maps which effectively ground the answer to the relevant image regions." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-37", "text": "We accomplish this by leveraging region descriptions and object annotations available in the Visual Genome dataset, and using these to automatically construct attention maps that can be used for attention supervision, instead of requiring human annotators to manually provide grounding labels." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-38", "text": "Our framework achieves competitive state-of-the-art VQA performance, while generating visual groundings that outperform other algorithms that use human annotated attention during training." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-39", "text": "The contributions of this paper are: (1) we introduce a mechanism to automatically obtain meaningful attention supervision from both region descriptions and object annotations in the Visual Genome dataset; (2) we show that by using the prediction of region and object label attention maps as auxiliary tasks in a VQA application, it is possible to obtain more interpretable intermediate representations." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-40", "text": "(3) we experimentally demonstrate state-of-the-art performances in VQA benchmarks as well as visual grounding that closely matches human attention annotations." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-41", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-42", "text": "**RELATED WORK**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-43", "text": "Since its introduction [8, 14, 9] , the VQA problem has attracted an increasing interest [22] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-44", "text": "Its multimodal nature and more precise evaluation protocol than alternative multimodal scenarios, such as image captioning, help to explain this interest." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-45", "text": "Furthermore, the proliferation of suitable datasets and potential applications, are also key elements behind this increasing activity." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-46", "text": "Most state-of-the-art methods follow a joint embedding approach, where deep models are used to project the textual question and visual input to a joint feature space that is then used to build the answer." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-47", "text": "Furthermore, most modern approaches pose VQA as a classification problem, where classes correspond to a set of pre-defined candidate answers." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-48", "text": "As an example, most entries to the VQA challenge [9] select as output classes the most common 3000 answers in this dataset, which account for 92% of the instances in the validation set." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-49", "text": "The strategy to combine the textual and visual embeddings and the underlying structure of the deep model are key design aspects that differentiate previous works." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-50", "text": "Antol et al. [9] propose an element-wise multiplication between image and question embeddings to generate spatial attention map." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-51", "text": "Fukui et al. [6] propose multimodal compact bilinear pooling (MCB) to efficiently implement an outer product operator that combines visual and textual representations." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-52", "text": "Yu et al. [26] extend this pooling scheme by introducing a multi-modal factorized bilinear pooling approach (MFB) that improves the representational capacity of the bilinear operator." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-53", "text": "They achieve this by adding an initial step that efficiently expands the textual and visual embeddings to a high-dimensional space." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-54", "text": "In terms of structural innovations, Noh et al. [16] embed the textual question as an intermediate dynamic bilinear layer of a ConvNet that processes the visual information." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-55", "text": "Andreas et al. [2] propose a model that learns a set of task-specific neural modules that are jointly trained to answer visual questions." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-56", "text": "Following the successful introduction of soft attention in neural machine translation applications [3] , most modern VQA methods also incorporate a similar mechanism." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-57", "text": "The common approach is to use a one-way attention scheme, where the embedding of the question is used to generate a set of attention coefficients over a set of predefined image regions." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-58", "text": "These coefficients are then used to weight the embedding of the image regions to obtain a suitable descriptor [19, 21, 6, 25, 26] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-59", "text": "More elaborated forms of attention has also been proposed." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-60", "text": "Xu and Saenko [23] suggest use wordlevel embedding to generate attention." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-61", "text": "Yang et al. [24] iterates the application of a soft-attention mechanism over the visual input as a way to progressively refine the location of relevant cues to answer the question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-62", "text": "Lu et al. [13] proposes a bidirectional co-attention mechanism that besides the question guided visual attention, also incorporates a visual guided attention over the input question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-63", "text": "In all the previous cases, the attention mechanism is applied using an unsupervised scheme, where attention coefficients are considered as latent variables." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-64", "text": "Recently, there have been also interest on including a supervised attention scheme to the VQA problem [5, 7, 18] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-65", "text": "Das et al. [5] compare the image areas selected by humans and state-of-theart VQA techniques to answer the same visual question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-66", "text": "To achieve this, they collect the VQA human attention dataset (VQA-HAT), a large dataset of human attention maps built by asking humans to select images areas relevant to answer questions from the VQA dataset [9] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-67", "text": "Interestingly, this study concludes that current machine-generated atten-tion maps exhibit a poor correlation with respect to the human counterpart, suggesting that humans use different visual cues to answer the questions." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-68", "text": "At a more fundamental level, this suggests that the discriminative nature of most current VQA systems does not effectively constraint the attention modules, leading to the encoding of discriminative cues instead of the underlying semantic that relates a given question-answer pair." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-69", "text": "Our findings in this work support this hypothesis." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-70", "text": "Related to the work in [5] , Gan et al. [7] apply a more structured approach to identify the image areas used by humans to answer visual questions." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-71", "text": "For VQA pairs associated to images in the COCO dataset, they ask humans to select the segmented areas in COCO images that are relevant to answer each question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-72", "text": "Afterwards, they use these areas as labels to train a deep learning model that is able to identify attention features." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-73", "text": "By augmenting a standard VQA technique with these attention features, they are able to achieve a small boost in performance." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-74", "text": "Closely related to our approach, Qiao et al. [18] use the attention labels in the VQA-HAT dataset to train an attention proposal network that is able to predict image areas relevant to answer a visual question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-75", "text": "This network generates a set of attention proposals for each image in the VQA dataset, which are used as labels to supervise attention in the VQA model." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-76", "text": "This strategy results in a small boost in performance compared with a non-attentional strategy." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-77", "text": "In contrast to our approach, these previous works are based on a supervised attention scheme that does not consider an automatic mechanism to obtain the attention labels." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-78", "text": "Instead, they rely on human annotated groundings as attention supervision." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-79", "text": "Furthermore, they differ from our work in the method to integrate attention labels to a VQA model." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-80", "text": "Specifically, we take the mined grounding labels as weakly-supervised signals and denote two types of attention supervision, namely region-level and object-level labels." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-81", "text": "Figure 2 shows the main pipeline of our VQA model." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-82", "text": "We mostly build upon the MCB model in [6] , which exemplifies current state-of-the-art techniques for this problem." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-83", "text": "Our main innovation to this model is the addition of an Attention Supervision Module that incorporates visual grounding as an auxiliary task." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-84", "text": "Next we describe the main modules behind this model." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-85", "text": "Question Attention Module: Questions are tokenized and passed through an embedding layer, followed by an LSTM layer that generates the question features Q f \u2208 R T \u00d7D , where T is the maximum number of words in the tokenized version of the question and D is the dimensionality of the hidden state of the LSTM." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-86", "text": "Additionally, following [25] , a question attention mechanism is added that generates question attention coefficients C q \u2208 R T \u00d7G q , where G q is the so-called number of \"glimpses\"." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-87", "text": "The purpose of G q is to allow the model to predict multiple attention maps so as to increase its expressiveness." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-88", "text": "Here, we use G q = 2." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-89", "text": "The weighted question features Q w \u2208 R G q D are then computed using a soft attention mechanism [3] , which is essentially a weighted sum of the T word features followed by a concatenation according to G q ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-90", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-91", "text": "**VQA MODEL STRUCTURE**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-92", "text": "Image Attention Module: Images are passed through an embedding layer consisting of a pre-trained ConvNet model, such as Resnet pretrained with the ImageNet dataset [10] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-93", "text": "This generates image features I f \u2208 R C\u00d7H\u00d7W , where C, H and W are depth, height and width of the extracted feature maps." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-94", "text": "Fusion Module I is then used to generate a set of image attention coefficients." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-95", "text": "First, question features Q w are tiled as the same spatial shape of I f ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-96", "text": "Afterwards, the fusion module models the joint relationship J attn \u2208 R O\u00d7H\u00d7W between questions and images, mapping them to a common space of dimension O. In the simplest case, one can implement the fusion module using either concatenation or Hadamard product [1] , but more effective pooling schemes can be applied [6, 11, 25, 26] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-97", "text": "The design choice of the fusion module remains an on-going research topic." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-98", "text": "In general, it should both effectively capture the latent relationship between multi-modal features meanwhile be easy to optimize." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-99", "text": "The fusion results are then passed through an attention module that computes the visual atten-" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-100", "text": "Classification Module: Using the compact representation of questions Q w and visual information V w , the classification module applies first the Fusion Module II that provides the feature representation of answers J ans \u2208 R L , where L is the latent answer space." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-101", "text": "Afterwards, it computes the logits over a set of predefined candidate answers." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-102", "text": "Following previous work [6] , we use as candidate outputs the top 3000 most frequent answers in the VQA dataset." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-103", "text": "At the end of this process, we obtain the highest scoring answer\u00c2." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-104", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-105", "text": "**ATTENTION SUPERVISION MODULE:**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-106", "text": "As a main novelty of the VQA model, we add an Image Attention Supervision Module as an auxiliary classification task, where ground-truth visual grounding labels C gt \u2208 R H\u00d7W \u00d7G v are used to guide the model to focus on meaningful parts of the image to answer each question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-107", "text": "To do that, we simply treat the generated attention coefficients C v as a probability distribution, and then compare it with the ground-truth using KL-divergence." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-108", "text": "Interestingly, we introduce two attention maps, corresponding to relevant region-level and objectlevel groundings, as shown in Figure 3 ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-109", "text": "Sections 4 and 5 provide details about our proposed method to obtain the attention labels and to train the resulting model, respectively." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-110", "text": "J a n s Figure 2 ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-111", "text": "Schematic diagram of the main parts of the VQA model." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-112", "text": "It is mostly based on the model presented in [6] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-113", "text": "Main innovation is the Attention Supervision Module that incorporates visual grounding as an auxiliary task." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-114", "text": "This module is trained through the use of a set of image attention labels that are automatically mined from the Visual Genome dataset." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-115", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-116", "text": "**MINING ATTENTION SUPERVISION FROM VISUAL GENOME**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-117", "text": "Visual Genome (VG) [12] includes the largest VQA dataset currently available, which consists of 1.7M QA pairs." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-118", "text": "Furthermore, for each of its more than 100K images, VG also provides region and object annotations by means of bounding boxes." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-119", "text": "In terms of visual grounding, these region and object annotations provide complementary information." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-120", "text": "As an example, as shown in Figure 3 , for questions related to interaction between objects, region annotations result highly relevant." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-121", "text": "In contrast, for questions related to properties of specific objects, object annotations result more valuable." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-122", "text": "Consequently, in this section we present a method to automatically select region and object annotations from VG that can be used as labels to implement visual grounding as an auxiliary task for VQA." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-123", "text": "For region annotations, we propose a simple heuristic to mine visual groundings: for each (I, Q, A) we enumerate all the region descriptions of I and pick the description D i that has the most (at least two) overlapped informative words with Q and A. Informative words are all nouns and verbs, where two informative words are matched if at least one of the following conditions is met: (1) Their raw text as they appear in Q or A are the same; (2) Their lemmatizations (using NLTK [4] ) are the same; (3) Their synsets in WordNet [15] are the same; (4) Their aliases (provided from VG) are the same." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-124", "text": "We refer to the resulting labels as region-level groundings." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-125", "text": "Figure 3(a) illustrates an example of a region-level grounding." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-126", "text": "In terms of object annotations, for each image in a (I, Q, A) triplet we select the bounding box of an object as a valid grounding label, if the object name matches one of the informative nouns in Q or A. To score each match, we use the same criteria as region-level groundings." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-127", "text": "Additionally, if a triplet (I, Q, A) has a valid region grounding, each corresponding object-level grounding must be inside this region to be accepted as valid." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-128", "text": "As a further refinement, selected objects grounding are passed through an intersection over union filter to account for the fact that VG usually includes multiple labels for the same object instance." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-129", "text": "As a final consideration, for questions related to counting, region-level groundings are discarded after the corresponding object-level groundings are extracted." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-130", "text": "We refer to the resulting labels as object-level groundings." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-131", "text": "Figure 3 (b) illustrates an example of an object-level grounding." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-132", "text": "As a result, combining both region-level and object-level groundings, about 700K out of 1M (I, Q, A) triplets in VG end up with valid grounding labels." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-133", "text": "We will make these labels publicly available." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-134", "text": "Here \"men\" in the region description is firstly lemmatized to be \"man\", whose aliases contain \"people\"; the word \"talking\" in the answer also contributes to the matching." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-135", "text": "So the selected regions have two matchings which is the most among all candidates." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-136", "text": "(b) Example object-level grounding from VG." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-137", "text": "Left: image with object instance labels; Right: our mined results." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-138", "text": "Note that in this case region-level grounding will give us the same result as in (a), but object-level grounding is clearly more localized." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-139", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-140", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-141", "text": "We build the attention supervision on top of the opensourced implementation of MCB [6] and MFB [25] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-142", "text": "Similar to them, We extract the image feature from res5c layer of Resnet-152, resulting in 14 \u00d7 14 spatial grid (H = 14, W = 14, C = 2048)." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-143", "text": "We construct our ground-truth visual grounding labels to be G v = 2 glimpse maps per QA pair, where the first map is object-level grounding and the second map is region-level grounding, as discussed in Section 4. Let (x i min , y i min , x i max , y i max ) be the coordinate of i th selected object bounding box in the grounding labels, then the mined object-level attention maps C 0 gt are:" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-144", "text": "where I[\u00b7] is the indicator function." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-145", "text": "Similarly, the regionlevel attention maps C 1 gt are:" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-146", "text": "Afterwards, C 0 gt and C 1 gt are spatially L1-normalized to represent probabilities and concatenated to form" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-147", "text": "The model is trained using a multi-task loss," }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-148", "text": "where CE denotes cross-entropy and KL denotes KLdivergence." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-149", "text": "\u0398 corresponds to the learned parameters." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-150", "text": "\u03b1(t) is a scalar that weights the loss terms." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-151", "text": "This scalar decays as a function of the iteration number t. In particular, we choose to use a cosine-decay function:" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-152", "text": "This is motivated by the fact that the visual grounding labels have some level of subjectivity." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-153", "text": "As an example, Figure 4 (second row) shows a case where the learned attention seems more accurate than the VQA-HAT ground truth." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-154", "text": "Hence, as the model learns suitable parameter values, we gradually loose the penalty on the attention maps to provide more freedom to the model to selectively decide what attention to use." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-155", "text": "It is important to note that, for training samples in VQA-2.0 or VG that do not have region-level or objectlevel grounding labels, \u03b1 = 0 in Equation 3, so the loss is reduced to the classification term only." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-156", "text": "In our experiment, t max is calibrated for each tested model based on the number of training steps." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-157", "text": "In particular, we choose t max = 190K for all MCB models and t max = 160K for others." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-158", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-159", "text": "**EXPERIMENTS**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-160", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-161", "text": "**DATASETS**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-162", "text": "VQA-2.0: The VQA-2.0 dataset [9] consists of 204721 images, with a total of 1.1M questions and 10 crowdsourced answers per question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-163", "text": "There are more than 20 question types, covering a variety of topics and free-form answers." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-164", "text": "The dataset is split into training (82K images and 443K questions), validation (40K images and 214K questions), and testing (81K images and 448K questions) sets." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-165", "text": "The task is to predict a correct answer A given a corresponding image-question pair (I, Q)." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-166", "text": "As a main advantage with respect to version 1.0 [9] , for every question VQA-2.0 includes complementary images that lead to different answers, reducing language bias by forcing the model to use the visual information." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-167", "text": "Visual Genome: The Visual Genome (VG) dataset [12] contains 108077 images, with an average of 17 QA pairs per image." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-168", "text": "We follow the processing scheme from [6] , where non-informative words in the questions and answers such as \"a\" and \"is\" are removed." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-169", "text": "Afterwards, (I, Q, A) triplets with answers to be single keyword and overlapped with VQA-2.0 dataset are included in our training set." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-170", "text": "This adds 97697 images and about 1 million questions to the training set." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-171", "text": "Besides the VQA data, VG also provides on average 50 region descriptions and 30 object instances per image." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-172", "text": "Each region/object is annotated by one sentence/phrase description and bounding box coordinates." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-173", "text": "VQA-HAT: VQA-HAT dataset [5] contains 58475 human visual attention heat (HAT) maps for (I, Q, A) triplets in VQA-1.0 training set." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-174", "text": "Annotators were shown a blurred image, a (Q, A) pair and were asked to \"scratch\" the image until they believe someone else can answer the question by looking at the blurred image and the sharpened area." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-175", "text": "The" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-176", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-177", "text": "**RANK CORRELATION**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-178", "text": "Accuracy/% VQA-HAT VQA-X VQA-2.0 Human [5] 0.623 -80.62 PJ-X [17] 0.396 0.342 -MCB [6] 0 authors also collect 1374 \u00d7 3 = 4122 HAT maps for VQA-1.0 validation sets, where each of the 1374 (I, Q, A) were labeled by three different annotators, so one can compare the level of agreement among labels." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-179", "text": "We use VQA-HAT to evaluate visual grounding performance, by comparing the rank-correlation between human attention and model attention, as in [5, 17] ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-180", "text": "VQA-X: VQA-X dataset [17] contains 2000 labeled attention maps in VQA-2.0 validation sets." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-181", "text": "In contrast to VQA-HAT, VQA-X attention maps are in the form of instance segmentations, where annotators were asked to segment objects and/or regions that most prominently justify the answer." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-182", "text": "Hence the attentions are more specific and localized." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-183", "text": "We use VQA-X to evaluate visual grounding performance by comparing the rank-correlation, as in [5, 17 ]." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-184", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-185", "text": "**RESULTS**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-186", "text": "We evaluate the performance of our proposed method using two criteria: i) rank-correlation [20] to evaluate visual grounding and ii) accuracy to evaluate question answering." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-187", "text": "Intuitively, rank-correlation measures the similarity between human and model attention maps under a rank-based metric." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-188", "text": "A high rank-correlation means that the model is 'looking at' image areas that agree to the visual information used by a human to answer the same question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-189", "text": "In terms of accuracy of a predicted answer\u00c2 is evaluated by: Q: Is the computer on or off? Ans: on Q: What color is the inside of the cats ears?" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-190", "text": "Ans: pink Q: How many of these animals are there? Ans: 2 Figure 4 ." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-191", "text": "Visual grounding comparison: the first column is the ground-truth human attention in VQA-HAT [5] ; the second column shows the results from pretrained MFH model [26] ; the last column are our Attn-MFH trained with attention supervision." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-192", "text": "We can see that the attention areas considered by our model mimic the attention areas used by humans, but they are more localized in space." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-193", "text": "model in VQA challenge 2017." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-194", "text": "In Table 1 , we can observe that our proposed model achieves a significantly boost on rank-correlation with respect to human attention." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-195", "text": "Furthermore, our model outperforms alternative state-of-art techniques in terms of accuracy in answer prediction." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-196", "text": "Specifically, the rank-correlation for MFH model increases by 36.4% when is evaluated in VQA-HAT dataset and 7.7% when is evaluated in VQA-X. This indicates that our proposed methods enable VQA models to provide more meaningful and interpretable results by generating more accurate visual grounding." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-197", "text": "Table 1 also reports the result of setting the decaying factor \u03b1(t) in Equation 4 to the fixed value of 1." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-198", "text": "In this case, the model is able to achieve higher rank-correlation, but ac-Q: What is the food item with the green colors?" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-199", "text": "Ans: broccoli Ans: salad Q: What color is the traffic light? Ans: green Ans: red curacy drops by 2%." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-200", "text": "We observe that as training proceeds, attention loss becomes dominant in the final training steps, which affects the accuracy of the classification module." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-201", "text": "Figure 4 shows qualitative results of the resulting visual grounding, including also a comparison with respect to noattn model." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-202", "text": "We also provide qualitative results in Figure 5 for the VQA-2.0 complementary pairs, which are more difficult as the model needs to overcome the language bias." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-203", "text": "----------------------------------" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-204", "text": "**CONCLUSIONS**" }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-205", "text": "In this work we have proposed a new method that is able to slightly outperform current state-of-the-art VQA systems, while also providing interpretable representations in the form of an explicitly trainable visual attention mechanism." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-206", "text": "Specifically, as a main result, our experiments provide evidence that the generated visual groundings achieve high correlation with respect to human-provided attention annotations, outperforming the correlation scores of previous works by a large margin." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-207", "text": "As further contributions, we highlight two relevant in-sides of the proposed approach." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-208", "text": "On one side, by using attention labels as an auxiliary task, the proposed approach demonstrates that is able to constraint the internal representation of the model in such a way that it fosters the encoding of interpretable representations of the underlying relations between the textual question and input image." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-209", "text": "On other side, the proposed approach demonstrates a method to leverage existing datasets with region descriptions and object labels to effectively supervise the attention mechanism in VQA applications, avoiding costly human labeling." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-210", "text": "As future work, we believe that the superior visual grounding provided by the proposed method can play a relevant role to generate natural language explanations to justify the answer to a given visual question." }, { "sent_id": "866ae880aa0de1e60d306eac2e66fc-C001-211", "text": "This scenario will help to demonstrate the relevance of our technique as a tool to increase the capabilities of AI based technologies to explain their decisions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "866ae880aa0de1e60d306eac2e66fc-C001-51" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-58" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-96" ] ], "cite_sentences": [ "866ae880aa0de1e60d306eac2e66fc-C001-51", "866ae880aa0de1e60d306eac2e66fc-C001-58", "866ae880aa0de1e60d306eac2e66fc-C001-96" ] }, "@DIF@": { "gold_contexts": [ [ "866ae880aa0de1e60d306eac2e66fc-C001-82" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-141" ] ], "cite_sentences": [ "866ae880aa0de1e60d306eac2e66fc-C001-82", "866ae880aa0de1e60d306eac2e66fc-C001-141" ] }, "@EXT@": { "gold_contexts": [ [ "866ae880aa0de1e60d306eac2e66fc-C001-82" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-112" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-141" ] ], "cite_sentences": [ "866ae880aa0de1e60d306eac2e66fc-C001-82", "866ae880aa0de1e60d306eac2e66fc-C001-112", "866ae880aa0de1e60d306eac2e66fc-C001-141" ] }, "@SIM@": { "gold_contexts": [ [ "866ae880aa0de1e60d306eac2e66fc-C001-102" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-168" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-178" ] ], "cite_sentences": [ "866ae880aa0de1e60d306eac2e66fc-C001-102", "866ae880aa0de1e60d306eac2e66fc-C001-168", "866ae880aa0de1e60d306eac2e66fc-C001-178" ] }, "@USE@": { "gold_contexts": [ [ "866ae880aa0de1e60d306eac2e66fc-C001-102" ], [ "866ae880aa0de1e60d306eac2e66fc-C001-168" ] ], "cite_sentences": [ "866ae880aa0de1e60d306eac2e66fc-C001-102", "866ae880aa0de1e60d306eac2e66fc-C001-168" ] } } }, "ABC_2407cfa8572ccbab7f9a081f45a4ad_19": { "x": [ { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-2", "text": "Sense annotation and lexicon building are costly affairs demanding prudent investment of resources." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-3", "text": "Recent work on multilingual WSD has shown that it is possible to leverage the annotation work done for WSD of one language (S L ) for another (T L ), by projecting Wordnet and sense marked corpus parameters of S L to T L ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-4", "text": "However, this work does not take into account the cost of manually cross-linking the words within aligned synsets." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-5", "text": "Further, it does not answer the question of \"Can better accuracy be achieved if a user is willing to pay additional money?\" We propose a measure for cost-benefit analysis which measures the \"value for money\" earned in terms of accuracy by investing in annotation effort and lexicon building." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-6", "text": "Two key ideas explored in this paper are (i) the use of probabilistic crosslinking model to reduce manual crosslinking effort and (ii) the use of selective sampling to inject a few training examples for hard-to-disambiguate words from the target language to boost the accuracy." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-7", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-9", "text": "Word Sense Disambiguation (WSD) is one of the most widely investigated problems of Natural Language Processing (NLP)." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-10", "text": "Previous works have shown that supervised approaches to Word Sense Disambiguation which rely on sense annotated corpora (Ng and Lee, 1996; Lee et al., 2004) outperform unsupervised (Veronis, 2004 ) and knowledge based approaches (Mihalcea, 2005) ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-11", "text": "However, creation of sense marked corpora has always remained a costly proposition, especially for some of the resource deprived languages." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-12", "text": "To circumvent this problem, Khapra et al. (2009) proposed a WSD method that can be applied to a language even when no sense tagged corpus for that language is available." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-13", "text": "This is achieved by projecting Wordnet and corpus parameters from another language to the language in question." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-14", "text": "The approach is centered on a novel synset based multilingual dictionary (Mohanty et al., 2008) where the synsets of different languages are aligned and thereafter the words within the synsets are manually cross-linked." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-15", "text": "For example, the word W L 1 belonging to synset S of language L 1 will be manually cross-linked to the word W L 2 of the corresponding synset in language L 2 to indicate that W L 2 is the best substitute for W L 1 according to an experienced bilingual speaker's intuition." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-16", "text": "We extend their work by addressing the following question on the economics of annotation, lexicon building and performance:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-17", "text": "\u2022 Is there an optimal point of balance between the annotation effort and the lexicon building (i.e. manual cross-linking)" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-18", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-19", "text": "**EFFORT AT WHICH ONE CAN BE ASSURED OF BEST VALUE FOR MONEY IN TERMS OF ACCURACY?**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-20", "text": "To address the above question we first propose a probabilistic cross linking model to eliminate the effort of manually cross linking words within the source and target language synsets and calibrate the resultant trade-off in accuracy." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-21", "text": "Next, we show that by injecting examples for most frequent hard-to-disambiguate words from the target domain one can achieve higher accuracies at optimal cost of annotation." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-22", "text": "Finally, we propose a measure for cost-benefit analysis which identifies the optimal point of balance between these three related entities, viz., cross-linking, sense annotation and accuracy of disambiguation." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-23", "text": "The remainder of this paper is organized as follows." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-24", "text": "In section 2 we present related work." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-25", "text": "In section 3 we describe the Synset based multilingual dictionary which enables parameter projection." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-26", "text": "In section 4 we discuss the work of Khapra et al. (2009) on parameter projection for multilingual WSD." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-27", "text": "Section 5 is on the economics of multilingual WSD." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-28", "text": "In section 6 we propose a probabilistic model for representing the cross-linkage of words within synsets." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-29", "text": "In section 7 we present a strategy for injecting hard-to-disambiguate cases from the target language using selective sampling." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-30", "text": "In section 8 we introduce a measure for cost-benefit analysis for calculating the value for money in terms of accuracy, annotation effort and lexicon building effort." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-31", "text": "In section 9 we describe the experimental setup." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-32", "text": "In section 10 we present the results followed by discussion in section 11." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-33", "text": "Section 12 concludes the paper." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-34", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-36", "text": "Knowledge based approaches to WSD such as Lesk's algorithm (Lesk, 1986 ), Walker's algorithm (Walker and Amsler, 1986) , Conceptual Density (Agirre and Rigau, 1996) and PageRank (Mihalcea, 2005) are less demanding in terms of resources but fail to deliver good results." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-37", "text": "Supervised approaches like SVM (Lee et al., 2004) and k-NN (Ng and Lee, 1996) , on the other hand, give better accuracies, but the requirement of large annotated corpora renders them unsuitable for resource scarce languages." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-38", "text": "Recent work by Khapra et al. (2009) has shown that it is possible to project the parameters learnt from the annotation work of one language to another language provided aligned Wordnets for two languages are available." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-39", "text": "However, their work does not address the question of further improving the accuracy of WSD by using a small amount of training data from the target language." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-40", "text": "Some similar work has been done in the area of domain adaptation where Chan et al. (2007) showed that adding just 30% of the target data to the source data achieved the same performance as that obtained by taking the entire source and target data." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-41", "text": "Similarly, Agirre and de Lacalle (2009) reported a 22% error reduction when source and target data were combined for training a classifier, compared to the case when only the target data was used for training the classifier." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-42", "text": "However, such combining of training statistics has not been tried in cases where the source data is in one language and the target data is in another language." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-43", "text": "To the best of our knowledge, no previous work has attempted to perform resource conscious allwords multilingual Word Sense Disambiguation by finding a trade-off between the cost (in terms of annotation effort and lexicon creation effort) and the quality in terms of F-score." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-44", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-45", "text": "**SYNSET BASED MULTILINGUAL DICTIONARY**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-46", "text": "A novel and effective method of storage and use of dictionary in a multilingual setting was proposed by Mohanty et al. (2008) ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-47", "text": "For the purpose of current discussion, we will refer to this multilingual dictionary framework as MultiDict." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-48", "text": "One important departure in this framework from the traditional dictionary is that synsets are linked, and after that the words inside the synsets are linked." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-49", "text": "The basic mapping is thus between synsets and thereafter between the words." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-50", "text": "Khapra et al. (2009) proposed that the various parameters essential for domain-specific Word Sense Disambiguation can be broadly classified into two categories:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-51", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-52", "text": "**PARAMETER PROJECTION**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-53", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-54", "text": "**WORDNET-DEPENDENT PARAMETERS:**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-55", "text": "\u2022 belongingness-to-dominant-concept" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-56", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-57", "text": "**CORPUS-DEPENDENT PARAMETERS:**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-58", "text": "\u2022 sense distributions \u2022 corpus co-occurrence They proposed a scoring function (Equation (1)) which combines these parameters to identify the correct sense of a word in a context:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-59", "text": "where, i \u2208 Candidate Synsets J = Set of disambiguated words" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-60", "text": "The first component \u03b8 i V i of Equation (1) captures influence of the corpus specific sense of a word in a domain." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-61", "text": "The other component W ij * V i * V j captures the influence of interaction of the candidate sense with the senses of context words weighted by factors of co-occurrence, conceptual distance and semantic distance." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-62", "text": "Wordnet-dependent parameters depend on the structure of the Wordnet whereas the Corpusdependent parameters depend on various statistics learnt from a sense marked corpora." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-63", "text": "Both the tasks of (a) constructing a Wordnet from scratch and (b) collecting sense marked corpora for multiple languages are tedious and expensive." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-64", "text": "Khapra et al. (2009) observed that by projecting relations from the Wordnet of a language and by projecting corpus statistics from the sense marked corpora of the language to those of the target language, the effort required in constructing semantic graphs for multiple Wordnets and collecting sense marked corpora for multiple languages can be avoided or reduced." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-65", "text": "At the heart of their work lies the MultiDict described in previous section which facilitates parameter projection in the following manner: 1." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-66", "text": "By linking with the synsets of a pivot resource rich language (Hindi, in our case), the cost of building Wordnets of other languages is partly reduced (semantic relations are inherited)." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-67", "text": "The Wordnet parameters of Hindi Wordnet now become projectable to other languages." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-68", "text": "2. For calculating corpus specific sense distributions, P (Sense S i |W ord W ), we need the counts, #(S i , W )." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-69", "text": "By using cross linked words in the synsets, these counts become projectable to the target language (Marathi, in our case) as they can be approximated by the counts of the cross linked Hindi words calculated from the Hindi sense marked corpus as follows:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-70", "text": "The rationale behind the above approximation is the observation that within a domain sense distributions remain the same across languages." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-71", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-72", "text": "**THE ECONOMICS OF MULTILINGUAL WSD**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-73", "text": "The problem of multilingual WSD using parameter projection can be viewed as an economic system consisting of three factors." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-74", "text": "The first factor is the cost of manually cross-linking the words in a synsets of the target language to the words in the corresponding synset in the pivot language." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-75", "text": "The second factor is the cost of sense annotated data from the target language." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-76", "text": "The third factor is the accuracy of WSD The first two factors in some sense relate to the cost of purchasing a commodity and the third factor relates to the commodity itself." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-77", "text": "The work of Khapra et al. (2009) as described above does not attempt to reach an optimal costbenefit point in this economic system." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-78", "text": "They place their bets on manual cross-linking only and settle for the accuracy achieved thereof." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-79", "text": "Specifically, they do not explore the inclusion of small amount of annotated data from the target language to boost the accuracy (as mentioned earlier, supervised systems which use annotated data from the target language are known to perform better)." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-80", "text": "Further, it is conceivable that with respect to accuracy-cost trade-off, there obtains a case for balancing one cost against the other, viz., the cost of cross-linking and the cost of annotation." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-81", "text": "In some cases bilingual lexicographers (needed for manual cross-linking) may be more expensive compared to monolingual annotators." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-82", "text": "There it makes sense to place fewer bets on manual crosslinking and more on collecting annotated corpora." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-83", "text": "On the other hand if manual cross-linking is cheap then a very small amount of annotated corpora can be used in conjunction with full manual crosslinking to boost the accuracy." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-84", "text": "Based on the above discussion, if k a is the cost of sense annotating one word, k c is the cost of manually cross-linking a word and A is the accuracy desired then the problem of multilingual WSD can be cast as an optimization problem:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-85", "text": "Accuracy \u2265 A where, w c and w a are the number of words to be manually cross linked and annotated respectively." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-86", "text": "Ours is thus a 3-factor economic model (crosslinking, annotation and accuracy) as opposed to the 2-factor model (cross-linking, accuracy) proposed by Khapra et al. (2009) ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-87", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-88", "text": "**OPTIMAL CROSS-LINKING**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-89", "text": "As mentioned earlier, in some cases where bilingual lexicographers are expensive we might be interested in reducing the effort of manual crosslinking." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-90", "text": "For such situations, we propose that only a small number of words, comprising of the most frequently appearing ones should be manually cross linked and the rest of the words should be cross-linked using a probabilistic model." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-91", "text": "The rationale here is simple: invest money in words which are bound to occur frequently in the test data and achieve maximum impact on the accuracy." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-92", "text": "In the following paragraphs, we explain our probabilistic cross linking model." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-93", "text": "The model proposed by Khapra et al. (2009 ) is a deterministic model where the expected count for (Sense S, Marathi Word W ), i.e., the number of times the word W appears in sense S is approximated by the count for the corresponding cross linked Hindi word." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-94", "text": "Such a model assumes that each Marathi word links to appropriate Hindi word(s) as identified manually by a lexicographer." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-95", "text": "Instead, we propose a probabilistic model where a Marathi word can link to every word in the corresponding Hindi synset with some probability." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-96", "text": "The expected count for (S, W ) can then be estimated as:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-97", "text": "where, P (h i |W, S) is the probability that the word h i from the corresponding Hindi synset is the correct cross-linked word for the given Marathi word." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-98", "text": "For example, one of the senses of the Marathi word maan is {neck} i.e. \"the body part which connects the head to the rest of the body\"." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-99", "text": "The corresponding Hindi synset has 10 words {gardan, gala, greeva, halak, kandhar and so on}. Thus, using Equation (2), the expected count, E[C({neck}, maan)], is calculated as:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-100", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-101", "text": "**SO ON F OR ALL WORDS IN THE HINDI SYNSET**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-102", "text": "Instead of using a uniform probability distribution over the Hindi words we go by the empirical observation that some words in a synset are more representative of that sense than other words, i.e. some words are more preferred while expressing that sense." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-103", "text": "For example, out of the 10 words in the Hindi synset only 2 words {gardan, gala} appeared in the corpus." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-104", "text": "We thus estimate the value of P (h i |W, S) empirically from the Hindi sense marked corpus by making the following independence assumption:" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-105", "text": "The rationale behind the above independence assumption becomes clear if we represent words and synsets using the Bayesian network of Figure 1 ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-106", "text": "Here, the Hindi word h i and the Marathi word W Figure 1 : Bayesian network formed by a synset S and the constituent Hindi and Marathi words are considered to be derived from the same parent concept S. In other words, they represent two different manifestations-one in Hindi and one in Marathi-of the same synset S. Given the above representation, it is easy to see that given the parent synset S, the Hindi word h i is independent of the Marathi word W ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-107", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-108", "text": "**OPTIMAL ANNOTATION USING SELECTIVE SAMPLING**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-109", "text": "In the previous section we dealt with the question of optimal cross-linking." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-110", "text": "Now we take up the other dimension of this economic system, viz., optimal use of annotated corpora for better accuracy." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-111", "text": "In other words, if an application demands higher accuracy for WSD and is willing to pay for some annotation then there should be a way of ensuring best possible accuracy at lowest possible cost." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-112", "text": "This can be done by including small amount of sense annotated data from the target language." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-113", "text": "The simplest strategy is to randomly annotate text from the target language and use it as training data." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-114", "text": "However, this strategy of random sampling may not be the most optimum in terms of cost." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-115", "text": "Instead, we propose a selective sampling strategy where the aim is to identify hard-to-disambiguate words from the target language and use them for training." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-116", "text": "The algorithm proceeds as follows: 1. First, using the probabilistic cross linking model and aligned Wordnets we learn the parameters described in Section 4. 2." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-139", "text": "The data was collected from two domains, viz., Tourism and Health." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-117", "text": "We then apply this scoring function on untagged examples (development set) from the target language and identify hard-to-disambiguate words i.e., the words which were disambiguated with a very low confidence." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-118", "text": "3. Training instances of these words are then injected into the training data and the parameters learnt from them are used instead of the projected parameters learnt from the source language corpus." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-119", "text": "Thus, the selective sampling strategy ensures that we get maximum value for money by spending it on annotating only those words which would otherwise not have been disambiguated correctly." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-120", "text": "A random selection strategy, in contrast, might bring in words which were disambiguated correctly using only the projected parameters." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-121", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-122", "text": "**A MEASURE FOR COST-BENEFIT ANALYSIS**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-123", "text": "We need a measure for cost-benefit analysis based on the three dimensions of our economic system, viz., annotation effort, lexicon creation effort and performance in terms of F-score." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-124", "text": "The first two dimensions can be fused into a single dimension by expressing the annotation effort and lexicon creation effort in terms of cost incurred." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-125", "text": "For example, we assume that the cost of annotating one word is k a and the cost of cross-linking one word is k c rupees." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-126", "text": "Further, we define a baseline and an upper bound for the F-score." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-127", "text": "In this case, the baseline would be the accuracy that can be obtained without spending any money on cross-linking and annotation in the target language." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-128", "text": "An upper bound could be the best F-score obtained using a large amount of annotated corpus in the target domain." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-129", "text": "Based on the above description, an ideal measure for cost-benefit analysis would assign a 1. reward depending on the improvement over the baseline performance." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-130", "text": "2. penalty depending on the difference from the upper bound on performance." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-131", "text": "3. reward inversely proportional to the cost in-curred in terms of annotation effort and/or manual cross-linking." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-132", "text": "Based on the above wish-list we propose a measure for cost-benefit analysis." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-133", "text": "Let," }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-134", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-135", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-136", "text": "We used Hindi as the source language (S L ) and trained a WSD engine using Hindi sense tagged corpus." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-137", "text": "The parameters thus learnt were then projected using the MultiDict (refer section 3 and 4) to build a resource conscious Marathi (T L ) WSD engine." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-138", "text": "We used the same dataset as described in Khapra et al. (2009) for all our experiments." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-140", "text": "The data for Tourism domain was collected by manually translating English documents downloaded from Indian Tourism websites into Hindi and Marathi." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-141", "text": "Similarly, English documents for Health domain were obtained from two doctors and were manually translated into Hindi and Marathi." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-142", "text": "The Hindi and Marathi documents thus created were manually sense annotated by two lexicographers adept in Hindi and Marathi using the respective Wordnets as sense repositories." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-143", "text": "Table 2 summarizes some statistics about the corpora." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-144", "text": "As for cross-linking, Hindi is used as the pivot language and words in Marathi synset are linked to the words in the corresponding Hindi synset." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-145", "text": "The total number of cross-links that were manually setup were 3600 for Tourism and 1800 for Health." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-146", "text": "The cost of cross-linking as well as sense annotating one word was taken to be 10 rupees." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-147", "text": "These costs were estimated based on quotations from lexicographers." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-148", "text": "However, these costs need to be taken as representative values only and may vary greatly depending on the availability of skilled bilingual lexicographers and skilled monolingual annotators." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-149", "text": "Table 2 : Number of polysemous words and average degree of polysemy." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-150", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-151", "text": "**RESULTS**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-152", "text": "Tables 3 and 4 report the average 4-fold performance on Marathi Tourism and Health data using different proportions of available resources, i.e., annotated corpora and manual cross-links." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-153", "text": "In each of these tables, along the rows, we increase the amount of Marathi sense annotated corpora from 0K to 6K." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-154", "text": "Similarly, along the columns we show the increase in the number of manual cross links (MCL) used." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-155", "text": "For example, the second column of Tables 3 and 4 reports the F-scores when probabilistic cross-linking (PCL) was used for all words (i.e., no manual cross-links) and varying amounts of sense annotated corpora from Marathi were used." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-156", "text": "Similarly, the first row represents the case in which no sense annotated corpus from Marathi was used and varying amounts of manual crosslinks were used." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-157", "text": "We report three values in the tables, viz., Fscore (F), cost in terms of money (C) and the costbenefit (CB) obtained by using x amount of annotated corpus and y amount of manual cross-links." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-158", "text": "The cost was estimated using the values given in section 9 (i.e., 10 rupees for cross-linking or sense annotating one word)." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-159", "text": "For calculating, the costbenefit baseline was taken as the F-score obtained by using no cross-links and no annotated corpora i.e. 68.21% for Tourism and 67.28% for Health (see first F-score cell of Tables 3 and 4) ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-160", "text": "Similarly the upper bound (F-scores obtained by training on entire Marathi sense marked corpus) for Tourism and Health were 83.16% and 80.67% respectively (see last row of Table 5 )." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-161", "text": "Due to unavailability of large amount of tagged Health corpus, the injection size was varied from 0-to-4K only." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-162", "text": "In the other dimension, we varied the cross-links from 0 to 1/3rd to 2/3rd to full only Tables 3 and 4) ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-163", "text": "However, to give an idea about the soundness of probabilistic crosslinking we performed a separate set of experiments by varying the number of cross-links and using no sense annotated corpora." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-164", "text": "Table 5 summarizes these results and compares them with the baseline (Wordnet first sense) and skyline." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-165", "text": "In Table 6 we compare our selective sampling strategy with random sampling when fully probabilistic cross-linking (PCL) is used and when fully manual cross-linking (MCL) is used." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-166", "text": "Here again, due to lack of space we report results only on Tourism domain." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-167", "text": "However, we would like to mention that similar experiments on Health domain showed that the results were indeed consistent." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-168", "text": "Finally, in Table 7 we compare the accuracies obtained when certain amount of annotated corpus from Marathi is used alone, with the case when the same amount of annotated corpus is used in conjunction with probabilistic cross-linking." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-169", "text": "While calculating the results for the second row in Table 7 , we found that the recall was very low due to the small size of injections." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-170", "text": "Hence, to ensure a fair comparison with our strategy (first row) we used the Wordnet first sense (WFS) for these recall errors (a typical practice in WSD literature)." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-171", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-172", "text": "**DISCUSSIONS**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-173", "text": "We make the following observations: 1. PCL v/s MCL: Table 5 shows that the probabilistic cross-linking model performs much better than the WFS (a typically reported baseline) and it comes very close to the performance of manual cross-linking." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-174", "text": "This establishes the soundness of the probabilistic model and suggests that with a little compromise in the accuracy, the model can be used as an approximation to save the cost of manual cross-linking." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-175", "text": "Further, in Table 7 we see that when PCL is used in conjunction with certain amount of annotated corpus we get up to 9% improvement in F-score as compared to the case when the same amount of annotated corpus is used alone." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-176", "text": "Thus, in the absence of skilled bilingual lexicographers, PCL can still be used to boost the accuracy obtained using annotated corpora." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-177", "text": "Table 6 shows the benefit of selective sampling over random annotation." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-178", "text": "This benefit is felt more when the amount of training data injected from Marathi is small." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-179", "text": "For example, when an annotated corpus of size 2K is used, selective sampling gives an advantage of 3% to 4% over random selection." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-180", "text": "Thus the marginal gain (i.e., value for money) obtained by using selective sampling is more than that obtained by using random annotation." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-181", "text": "3. Optimal cost-benefit: Finally, we address the main message of our work, i.e., finding the best cost benefit." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-182", "text": "By referring to Tables 3 and 4 , we see that the best value for money in Tourism domain is obtained by manually cross-linking 2/3rd of all corpus words and sense annotating 2K target words and in the Health domain it is obtained by manually cross-linking 2/3rd of all corpus words but sense annotating only 1K words." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-183", "text": "This suggests that striking a balance between crosslinking and annotation gives the best value for money." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-184", "text": "Further, we would like to highlight that our 3-factor economic model is able to capture these relations better than the 2-factor model of Khapra et al. (2010) ." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-185", "text": "As per their model the best F-score achieved using manual cross-linking for ALL words was 73.34% for both Tourism and Health domain at a cost of 36K and 18K respectively." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-186", "text": "On the other hand, using our model we obtain higher accuracies of 76.96% in the Tourism domain (using 1/3rd manual cross-links and 2K injection) at a lower total cost (32K rupees) and 75.57% in the Health domain (using only 1/3rd cross-linking and 1K injection) at a lower cost (16K rupees)." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-187", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-188", "text": "**SELECTIVE SAMPLING V/S RANDOM ANNOTATION:**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-189", "text": "----------------------------------" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-190", "text": "**CONCLUSION**" }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-191", "text": "We reported experiments on multilingual WSD using different amounts of annotated corpora and manual cross-links." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-192", "text": "We showed that there exists some trade-off between the accuracy and balancing the cost of annotation and lexicon creation." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-193", "text": "In the absence of skilled bilingual lexicographers one can use a probabilistic cross-linking model and still obtain good accuracies." }, { "sent_id": "2407cfa8572ccbab7f9a081f45a4ad-C001-194", "text": "Also, while sense annotating a corpus, careful selection of words using selective sampling can give better marginal gain as compared to random sampling." } ], "y": { "@MOT@": { "gold_contexts": [ [ "2407cfa8572ccbab7f9a081f45a4ad-C001-12" ] ], "cite_sentences": [ "2407cfa8572ccbab7f9a081f45a4ad-C001-12" ] }, "@BACK@": { "gold_contexts": [ [ "2407cfa8572ccbab7f9a081f45a4ad-C001-26" ], [ "2407cfa8572ccbab7f9a081f45a4ad-C001-38" ], [ "2407cfa8572ccbab7f9a081f45a4ad-C001-77" ], [ "2407cfa8572ccbab7f9a081f45a4ad-C001-93" ] ], "cite_sentences": [ "2407cfa8572ccbab7f9a081f45a4ad-C001-26", "2407cfa8572ccbab7f9a081f45a4ad-C001-38", "2407cfa8572ccbab7f9a081f45a4ad-C001-77", "2407cfa8572ccbab7f9a081f45a4ad-C001-93" ] }, "@DIF@": { "gold_contexts": [ [ "2407cfa8572ccbab7f9a081f45a4ad-C001-86" ] ], "cite_sentences": [ "2407cfa8572ccbab7f9a081f45a4ad-C001-86" ] }, "@EXT@": { "gold_contexts": [ [ "2407cfa8572ccbab7f9a081f45a4ad-C001-86" ] ], "cite_sentences": [ "2407cfa8572ccbab7f9a081f45a4ad-C001-86" ] }, "@SIM@": { "gold_contexts": [ [ "2407cfa8572ccbab7f9a081f45a4ad-C001-138" ] ], "cite_sentences": [ "2407cfa8572ccbab7f9a081f45a4ad-C001-138" ] }, "@USE@": { "gold_contexts": [ [ "2407cfa8572ccbab7f9a081f45a4ad-C001-138" ] ], "cite_sentences": [ "2407cfa8572ccbab7f9a081f45a4ad-C001-138" ] } } }, "ABC_6678c19792be8d9ad66cf923d00c23_19": { "x": [ { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-2", "text": "This paper attempts to deal with a ranking problem with a collection of financial reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-3", "text": "By using the text information in the reports, we apply learning-to-rank techniques to rank a set of companies to keep them in line with their relative risk levels." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-4", "text": "The experimental results show that our ranking approach significantly outperforms the regression-based one." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-5", "text": "Furthermore, our ranking models not only identify some financially meaningful words but suggest interesting relations between the text information in financial reports and the risk levels among companies." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-6", "text": "Finally, we provide a visualization interface to demonstrate the relations between financial risk and text information in the reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-7", "text": "This demonstration enables users to easily obtain useful information from a number of financial reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-8", "text": "----------------------------------" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-10", "text": "Financial risk is the chance that a chosen investment instruments (e.g., stock) will lead to a loss." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-11", "text": "In finance, volatility is an empirical measure of risk and will vary based on a number of factors." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-12", "text": "This paper attempts to use text information in financial reports as factors to rank the risk of stock returns." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-13", "text": "Considering such a problem is a text ranking problem, we attempt to use learning-to-rank techniques to deal with the problem." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-14", "text": "Unlike the previous study (Kogan et al., 2009) , in which a regression model is employed to predict stock return volatilities via text information, our work utilizes learning-to-rank methods to model the ranking of relative risk levels directly." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-15", "text": "The reason of this practice is that, via text information only, predicting ranks among real-world quantities should be more reasonable than predicting their real values." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-16", "text": "The difficulty of predicting the values is partially because of the huge amount of noise within texts (Kogan et al., 2009 ) and partially because of the weak connection between texts and the quantities." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-17", "text": "Regarding these issues, we turn to rank the relative risk levels of the companies (their stock returns)." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-18", "text": "By means of learning-to-ranking techniques, we attempt to identify some key factors behind the text ranking problem." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-19", "text": "Our experimental results show that in terms of two different ranking correlation metrics, our ranking approach significantly outperforms the regression-based method with a confidence level over 95%." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-20", "text": "In addition to the improvements, through the learned ranking models, we also discover meaningful words that are financially risk-related, some of which were not identified in (Kogan et al., 2009 )." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-21", "text": "These words enable us to get more insight and understanding into financial reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-22", "text": "Finally, in this paper, a visualization interface is provided to demonstrate the learned relations between financial risk and text information in the reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-23", "text": "This demonstration not only enables users to easily obtain useful information from a number of financial reports but offer a novel way to understand these reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-24", "text": "The remainder of this paper is organized as follows." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-25", "text": "In Section 2, we briefly review some previous work." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-26", "text": "Section 3 presents the proposed ranking approach to the financial risk ranking problem." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-27", "text": "Section 4 reports experimental results and provides some discussions and analyses on the results." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-28", "text": "We finally conclude our paper and provide several directions for future work." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-29", "text": "----------------------------------" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-31", "text": "In the literature, most text ranking studies are related to information retrieval (Manning et al., 2008) ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-32", "text": "Given a query, an information retrieval system ranks documents with respect to their relative relevances to the given query." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-33", "text": "Traditional models include Vector Space Model (Salton et al., 1975) , Probabilistic Relevance Model (Robertson and Sparck Jones, 1988) , and Language Model (Ponte and Croft, 1998) ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-34", "text": "In addition to the conventional models, in recent years there have also been some attempts of using learning-based methods to solve the text ranking problem, such as (Freund et al., 2003; Burges et al., 2005; Joachims, 2006) , which subsequently brings about a new area of learning to rank in the fields of information retrieval and machine learning." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-35", "text": "Considering the prevalence of learning-to-rank techniques, this paper attempts to use such techniques to deal with the ranking problem of financial risk." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-36", "text": "In recent year, there have been some studies conducted on mining financial reports, such as (Lin et al., 2008; Kogan et al., 2009; Leidner and Schilder, 2010) ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-37", "text": "(Lin et al., 2008 ) use a weighting scheme to combine both qualitative and quantitative features of financial reports together, and propose a method to predict short-term stock price movements." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-38", "text": "In the work, a Hierarchical Agglomerative Clustering (HAC) method with K-means updating is employed to improve the purity of the prototypes of financial reports, and then the generated prototypes are used to predict stock price movements." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-39", "text": "(Leidner and Schilder, 2010 ) use text mining techniques to detect whether there is a risk within a company, and classify the detected risk into several types." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-40", "text": "The above two studies both use a classification manner to mine financial reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-41", "text": "(Kogan et al., 2009 ) apply a regression approach to predict stock return volatilities of companies via their financial reports; in specific, the Support Vector Regression (SVR) model is applied to conduct mining on text information." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-42", "text": "----------------------------------" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-43", "text": "**OUR RANKING APPROACH**" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-44", "text": "In finance, volatility is a common risk metric, which is measured by the standard deviation of a stock's returns over a period of time." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-45", "text": "Let S t be the price of a stock at time t. Holding the stock for one period from time t \u2212 1 to time t would result in a simple net return: R t = S t /S t\u22121 (Tsay, 2005) ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-46", "text": "The volatility of returns for a stock from time t \u2212 n to t can be defined as" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-47", "text": ". We now proceed to classify the volatilities of n stocks into 2\u2113 + 1 risk levels, where n, \u2113 \u2208 {1, 2, 3, \u00b7 \u00b7 \u00b7 }." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-48", "text": "Let m be the sample mean and s be the sample standard deviation of the logarithm of volatilities of n stocks (denoted as ln(v))." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-49", "text": "The distribution over ln(v) across companies tends to have a bell shape (Kogan et al., 2009) ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-50", "text": "Therefore, given a volatility v, we derive the risk level r via:" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-51", "text": ", \u00b7 \u00b7 \u00b7 , \u2113 \u2212 1}, and b = \u221e when k = \u2113. Note that r stands for the concept of relative risk among n stocks; for instance, the stock with r = 4 is much more risky than that with r = 0." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-52", "text": "After classifying the volatilities of stock returns (of companies) into different risk levels, we now proceed to formulate our text ranking problem." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-53", "text": "Given a collection of financial reports" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-54", "text": "and is associated with a company c i , we aim to rank the companies via a ranking model f :" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-55", "text": "d \u2192 such that the rank order of the set of companies is specified by the real value that the model f takes." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-56", "text": "In specific, f (d i ) > f (d j ) is taken to mean that the model asserts that c i \u227b c j , where c i \u227b c j means that c i is ranked higher than c j ; that is, the company c i is more risky than c j in this work." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-57", "text": "This paper adopts Ranking SVM (Joachims, 2006) for our text ranking problem." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-58", "text": "Within a year, if the ground truth (i.e., the relative risk level) asserts that the company c i is more risky than c j , the constraint of Ranking SVM is \u2329w, Numbers in brackets indicate the p-value from a paired t-test." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-59", "text": "Bold faced numbers denote improvements over the baseline, and * indicates that the entry is statistically significant from the baseline at 95% confidence level." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-60", "text": "where w is a learned weight vector, C is the trade-off parameter, \u03be i, j,k is a slack variable, and Y k is a set of pairs of financial reports within a year." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-61", "text": "----------------------------------" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-62", "text": "**EXPERIMENTS AND ANALYSIS**" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-63", "text": "In this paper, the 10-K Corpus (Kogan et al., 2009 ) is used to conduct the experiments; only Section 7 \"management's discussion and analysis of financial conditions and results of operations\" (MD&A) is included in the experiments since typically Section 7 contains the most important forward-looking statements." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-64", "text": "In the experiments, all documents were stemmed by the Porter stemmer, and the documents in each year are indexed separately." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-65", "text": "In addition to the reports, the twelve months after the report volatility for each company can be calculated by Equation (1), where the price return series can be obtained from the Center for Research in Security Prices (CRSP) US Stocks Database." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-66", "text": "The company in each year is then classified into 5 risk levels (\u2113 = 2) via Equation (2)." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-67", "text": "For regression, linear kernel is adopted with \u03b5 = 0.1 and the trade-off C is set to the default choice of SVM light , which are the similar settings of (Kogan et al., 2009 )." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-68", "text": "For ranking, linear kernel is adopted with C = 1, all other parameters are left for the default values of SVM Rank ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-69", "text": "Table 1 tabulates the experimental results, in which all reports from the five-year period preceding the test year are used as the training data (we denote the training data from the n-year period preceding the test year as T n hereafter)." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-70", "text": "For example, the reports from year 1996 to 2000 constitute a training data T 5 , and the resulting model is tested on the reports of year 2001." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-71", "text": "As shown in the table, with the feature of TF-IDF, our results are significantly better than those of the baseline in terms of both two measures." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-72", "text": "In addition to using T 5 as the training data, we also conduct other 4 sets of experiments with T 1 , T 2 , T 3 , T 4 to test the reports from Words with Negative Weights 1996-2000 1997-2001 1998-2002 1999-2003 2000-2004 2001-2005 Figure 1 Figure 1 illustrates the top positive and negative weighted terms appearing more than twice in the six T 5 models trained on TF-IDF; these terms (8 positive and 8 negative) constitute the radar chart in Figure 1 ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-73", "text": "Almost all the terms found by our ranking approach are financially meaningful; in addition, some of highly risk-correlated terms are not even reported in (Kogan et al., 2009) ." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-74", "text": "We now take the term defaut (only identified by our ranking approach) as an example." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-75", "text": "In finance, a company \"defaults\" when it cannot meet its legal obligations according to the debt contract; as a result, the term \"default\" is intuitively associated with a relative high risk level." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-76", "text": "One piece of the paragraph quoted from the original report (from AFC Enterprises, Inc.) is listed as follows:" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-77", "text": "As of December 25, 2005, approximately $3.0" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-78", "text": "----------------------------------" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-79", "text": "**CONCLUSION**" }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-80", "text": "This paper adopts learning-to-rank techniques to rank the companies to keep them in line with their relative risk levels via the text information in their financial reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-81", "text": "The experimental results suggest interesting relations between the text information in financial reports and the risk levels among companies; these findings may be of great value for providing us more insight and understanding into financial reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-82", "text": "Finally, we provide a visualization interface to demonstrate the relations between financial risk and text information in the reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-83", "text": "This demonstration enables users to easily obtain useful information from a number of financial reports." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-84", "text": "Future directions include how to reduce the noise within texts, and how to incorporate Standard Industrial Classification (SIC) into our ranking approach." }, { "sent_id": "6678c19792be8d9ad66cf923d00c23-C001-85", "text": "In addition, a hybrid model consisting of both financial and text information may be also one of our future directions." } ], "y": { "@DIF@": { "gold_contexts": [ [ "6678c19792be8d9ad66cf923d00c23-C001-14" ], [ "6678c19792be8d9ad66cf923d00c23-C001-20" ], [ "6678c19792be8d9ad66cf923d00c23-C001-73" ] ], "cite_sentences": [ "6678c19792be8d9ad66cf923d00c23-C001-14", "6678c19792be8d9ad66cf923d00c23-C001-20", "6678c19792be8d9ad66cf923d00c23-C001-73" ] }, "@BACK@": { "gold_contexts": [ [ "6678c19792be8d9ad66cf923d00c23-C001-16" ], [ "6678c19792be8d9ad66cf923d00c23-C001-36" ], [ "6678c19792be8d9ad66cf923d00c23-C001-49" ] ], "cite_sentences": [ "6678c19792be8d9ad66cf923d00c23-C001-16", "6678c19792be8d9ad66cf923d00c23-C001-36", "6678c19792be8d9ad66cf923d00c23-C001-49" ] }, "@SIM@": { "gold_contexts": [ [ "6678c19792be8d9ad66cf923d00c23-C001-63" ], [ "6678c19792be8d9ad66cf923d00c23-C001-67" ] ], "cite_sentences": [ "6678c19792be8d9ad66cf923d00c23-C001-63", "6678c19792be8d9ad66cf923d00c23-C001-67" ] } } }, "ABC_be77eed8430b6492c81ae6535f1dd5_19": { "x": [ { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-2", "text": "Discourse parsing has long been treated as a stand-alone problem independent from constituency or dependency parsing." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-3", "text": "Most attempts at this problem are pipelined rather than end-to-end, sophisticated, and not self-contained: they assume goldstandard text segmentations (Elementary Discourse Units), and use external parsers for syntactic features." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-4", "text": "In this paper we propose the first end-to-end discourse parser that jointly parses in both syntax and discourse levels, as well as the first syntacto-discourse treebank by integrating the Penn Treebank with the RST Treebank." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-5", "text": "Built upon our recent span-based constituency parser, this joint syntactodiscourse parser requires no preprocessing whatsoever (such as segmentation or feature extraction), achieves the state-of-theart end-to-end discourse parsing accuracy." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-6", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-29", "text": "In an RST discourse tree, there are two types of branchings." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-30", "text": "Most of the internal tree nodes are binary branching, with one nucleus child containing the core semantic meaning of the current node, and one satellite child semantically decorating the nucleus." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-31", "text": "Like dependency labels, there is a relation annotated between each satellite-nucleus pair, such as \"Background\" or \"Purpose\"." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-32", "text": "Figure 1 (a) shows an example RST tree." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-33", "text": "There are also nonbinary-branching internal nodes whose children are conjunctions, e.g., a \"List\" of semantically similar EDUs (which are all nucleus nodes); see Figure 2 (a) for an example." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-34", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-35", "text": "**SYNTACTO-DISCOURSE REPRESENTATION**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-36", "text": "It is widely recognized that lower-level lexical and syntactic information can greatly help determining both the boundaries of the EDUs (i.e., discourse segmentation) (Bach et al., 2012) as well as the semantic relations between EDUs (Soricut and Marcu, 2003; Hernault et al., 2010; Joty and Moschitti, 2014; Feng and Hirst, 2014; Ji and Eisenstein, 2014; Li et al., 2014a; Heilman and Sagae, 2015) ." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-37", "text": "While these previous approaches rely on pre-trained tools to provide both EDU segmentation and intra-EDU syntactic parse trees, we instead propose to directly determine the low-level segmentations, the syntactic parses, and the highlevel discourse parses using a single joint parser." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-38", "text": "This parser is trained on the combined trees of constituency and discourse structures." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-39", "text": "We first convert an RST tree to a format similar Another example of RST vs. PTB-RST, demonstrating a discourse tree over two sentences and a non-binary relation (List)." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-40", "text": "The lower levels of the PTB-RST tree are collapsed due to space contraints." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-41", "text": "to those constituency trees in the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-42", "text": "For each binary branching node with a nucleus child and a satellite child, we use the relation as the label of the converted parent node." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-43", "text": "The nucleus/satellite relation, along with the direction (either \u2190 or \u2192, pointing from satellite to nucleus) is then used as the label." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-44", "text": "For example, at the top level in Figure 2 , we convert" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-45", "text": "For a conjunctive branch (e.g. \"List\"), we simply use the relation as the label of the converted node." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-46", "text": "After converting an RST tree into the constituency tree format, we then replace each leaf node (i.e., EDU) with the corresponding syntactic (sub)tree from PTB." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-47", "text": "Given that the sentences in the RST Treebank (Marcu, 2000b ) is a subset of that of PTB, we can always find the corresponding constituency subtrees for each EDU leaf node." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-48", "text": "In most cases, each EDU corresponds to one single (sub)tree in PTB, since the discourse boundaries generally do not conflict with constituencies." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-49", "text": "In other cases, one EDU node may correspond to multiple subtrees in PTB, and for these EDUs we use the lowest common ancestor of those subtrees in the PTB as the label of that EDU in the converted tree." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-50", "text": "E.g., if C-D is one EDU in the PTB tree" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-51", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-52", "text": "**JOINT PTB-RST TREEBANK**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-53", "text": "Using the conversion strategy described above we build the first joint syntacto-discourse treebank based on the Penn Treebank and RST Treebank." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-54", "text": "This PTB-RST treebank is released as a set of tools to generate the joint trees given Penn Treebank and RST Treebank data." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-55", "text": "During the alignment between the RST trees and the PTB trees, we only keep the common parts of the two trees." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-56", "text": "We follow the standard training/testing split of the RST Treebank." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-57", "text": "In the training set, there are 347 joint trees with a total of 17,837 tokens, and the lengths of the discourses range from 30 to 2,199 tokens." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-58", "text": "In the test set, there are 38 joint trees with a total of 4,819 tokens, and the lengths vary from 45 to 2,607." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-59", "text": "Figure 3 shows the distribution of the discourse lengths over the whole dataset, which on average is about 2x of PTB sentence length, but longest ones are about 10x the longest lengths in the Treebank." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-60", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-61", "text": "**JOINT SYNTACTO-DISCOURSE PARSING**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-62", "text": "Given the combined syntacto-discourse treebank, we now propose a joint parser that can perform end-to-end discourse segmentation and parsing." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-63", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-64", "text": "**EXTENDING SPAN-BASED PARSING**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-65", "text": "As mentioned above, the input sequences are substantially longer than PTB parsing, so we choose linear-time parsing, by adapting a popular greedy constituency parser, the span-based constituency parser of Cross and Huang (2016) ." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-66", "text": "As in span-based parsing, at each step, we maintain a a stack of spans." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-67", "text": "Notice that in conventional incremental parsing, the stack stores the subtrees Similar to span-based constituency parsing, we alternate between structural (either shift or combine) and label (label X or nolabel) actions in an odd-even fashion." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-68", "text": "But different from Cross and Huang (2016) , after a structural action, we choose to keep the last branching point k, i.e., i Some text and the symbol or scaled k j (mostly for combine, but also trivially for shift)." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-69", "text": "This is because in our parsing mechanism, the discourse relation between two EDUs is actually determined after the previous combine action." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-70", "text": "We need to keep the splitting point to clearly find the spans of the two EDUs to determine their relations." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-71", "text": "This midpoint k disappears after a label action; therefore we can use the shape of the last span on the stack (whether it contains the split point, i.e., i xt and the symbol or scaled k j or i Some text and the symbol or scaled j ) to determine the parity of the step and thus no longer need to carry the step z in the state as in Cross and Huang (2016) ." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-72", "text": "The nolabel action makes the binarization of the discourse/constituency tree unnecessary, because nolabel actually combines the top two spans on the stack \u03c3 into one span, but without annotating the new span a label." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-73", "text": "This greatly simplifies the preprocessing and post-processing efforts needed." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-74", "text": "Table 1 : Accuracies on PTB-RST at constituency and discourse levels." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-75", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-76", "text": "**RECURRENT NEURAL MODELS AND TRAINING**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-77", "text": "The scoring functions in the deductive system (Figure 4 ) are calculated by an underlying neural model, which is similar to the bi-directional LSTM model in Cross and Huang (2016) that evaluates based on span boundary features." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-78", "text": "Again, it is important to note that no discourse or syntactic tree structures are represented in the features." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-79", "text": "During the decoding time, a document is firstl passed into a two-layer bi-directional LSTM model, then the outputs at each text position of the two layers of the bi-directional LSTMs are concatenated as the positional features." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-80", "text": "The spans at each parsing step can be represented as the feature vectors at the boundaries." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-81", "text": "The span features are then passed into fully connected networks with softmax to calculate the likelihood of performing the corresponding action or marking the corresponding label." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-82", "text": "We use the \"training with exploration\" strategy (Goldberg and Nivre, 2013) and the dynamic oracle mechanism described in Cross and Huang (2016) to make sure the model can handle unseen parsing configurations properly." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-83", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-84", "text": "**EMPIRICAL RESULTS**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-85", "text": "We use the treebank described in Section 2 for empirical evaluation." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-86", "text": "We randomly choose 30 documents from the training set as the development set." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-87", "text": "We tune the hyperparameters of the neural model on the development set." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-88", "text": "For most of the hyperparameters we settle with the same values suggested by Cross and Huang (2016) ." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-89", "text": "To alleviate the overfitting problem for training on the relative small RST Treebank, we use a dropout of 0.5." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-90", "text": "One particular hyperparameter is that we use a value \u03b2 to balance the chances between training following the exploration (i.e., the best action chosen by the neural model) and following the correct path provided by the dynamic oracle." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-91", "text": "We find that \u03b2 = 0.8, i.e., following the dynamic oracle with asyntactic feats." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-92", "text": "segmentation structure +nuclearity +relation Bach et al. (2012) segmentation only Table 2 : F1 scores of end-to-end systems. \"+nuclearity\" indicates scoring of tree structures with nuclearity included. \"+relation\" has both nuclearity and relation included (e.g., \u2190Elaboration)." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-93", "text": "syntactic feats structure +nuclearity +relation human annotation (Ji and Eisenstein, 2014 Table 3 : Experiments using gold segmentations." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-94", "text": "The column of \"syntactic feats\" shows how the syntactic features are calculated in the corresponding systems." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-95", "text": "Note that our parser predicts solely based on the span features from bi-directionaly LSTM, instead of any explicitly designed syntactic features." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-96", "text": "icantly larger than the constituency trees in Penn Treebank, lower \u03b2 makes the parsing easily divert into wrong trails that are difficult to learn from." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-97", "text": "Since our parser essentially performs both constituency parsing task and discourse parsing task." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-98", "text": "We also evaluate the performances on sentence constituency level and discourse level separately." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-99", "text": "The result is shown in Table 1 . Note that in constituency level, the accuracy is not directly comparable with the accuracy reported in Cross and Huang (2016) , since: a) our parser is trained on a much smaller dataset (RST Treebank is about 1/6 of Penn Treebank); b) the parser is trained to optimize the discourse-level accuracy." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-100", "text": "Table 2 shows that, in the perspective of endto-end discourse parsing, our parser first outperforms the state-of-the-art segmentator of Bach et al. (2012) , and furthermore, in end-to-end parsing, the superiority of our parser is more pronounced comparing to the previously best parser of Hernault et al. (2010) ." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-101", "text": "On the other hand, the majority of the conventional discourse parsers are not end-to-end: they rely on gold EDU segmentations and pre-trained tools like Stanford parsers to generate features." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-102", "text": "We perform an experiment to compare the performance of our parser with them given the gold EDU segments ( Table 3 )." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-103", "text": "Note that most of these parsers do not handle multi-branching discourse nodes and are trained and evaluated on binarized discourse trees (Feng and Hirst, 2014; Li et al., 2014a,b; Ji and Eisenstein, 2014; Heilman and Sagae, 2015) , so their performances are actually not directly comparable to the results we reported." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-104", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-105", "text": "**CONCLUSION**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-106", "text": "We have presented a neural-based incremental parser that can jointly parse at both constituency and discourse levels." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-107", "text": "To our best knowledge, this is the first end-to-end parser for discourse parsing task." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-108", "text": "Our parser achieves the state-of-the-art performance in end-to-end parsing, and unlike previous approaches, needs little pre-processing effort." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-8", "text": "Distinguishing the semantic relations between segments in a document can be greatly beneficial to many high-level NLP tasks, such as summarization (Louis et al., 2010; Yoshida et al., 2014) , sentiment analysis (Voll and Taboada, 2007; Somasundaran et al., 2009; Bhatia et al., 2015) , question answering (Ferrucci et al., 2010; Jansen et al., 2014) , and textual quality evaluation (Tetreault et al., 2013; Li and Jurafsky, 2016) ." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-9", "text": "There has been a variety of research on discourse parsing (Marcu, 2000a; Soricut and Marcu, 2003; Pardo and Nunes, 2008; Hernault et al., 1. pipelined rather than end-to-end: they assume pre-segmented discourse, and worse yet, use gold-standard segmentations, except Hernault et al. (2010) ;" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-10", "text": "2. not self-contained: they rely on external syntactic parsers and pretrained word vectors;" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-11", "text": "3. complicated: they design sophisticated features, including those from parse-trees." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-12", "text": "We argue for the first time that discourse parsing should be viewed as an extension of, and be performed in conjunction with, constituency parsing." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-13", "text": "We propose the first joint syntacto-discourse treebank, by unifying constituency and discourse tree representations." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-14", "text": "Based on this, we propose the first end-to-end incremental parser that jointly parses at both constituency and discourse levels." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-15", "text": "Our algorithm builds up on the span-based parser (Cross and Huang, 2016) ; it employs the strong generalization power of bi-directional LSTMs, and parses efficiently and robustly with an extremely simple span-based feature set that does not use any tree structure information." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-16", "text": "We make the following contributions:" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-17", "text": "1. We develop a combined representation of constituency and discourse trees to facilitate parsing at both levels without explicit conversion mechanism." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-18", "text": "Using this representation, we build and release a joint treebank based on the Penn Treebank (Marcus et al., 1993) and RST Treebank (Marcu, 2000a,b) (Section 2)." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-19", "text": "2. We propose a novel joint parser that parses at both constituency and discourse levels." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-20", "text": "Our parser performs discourse parsing in an endto-end manner, which greatly reduces the efforts required in preprocessing the text for segmentation and feature extraction, and, to our best knowledge, is the first end-to-end discourse parser in literature (Section 3)." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-21", "text": "3. Even though it simultaneously performs constituency parsing, our parser does not use any explicit syntactic feature, nor does it need any binarization of discourse trees, thanks to the powerful span-based framework of Cross and Huang (2016) (Section 3)." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-22", "text": "4. Empirically, our end-to-end parser outperforms the existing pipelined discourse parsing efforts." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-23", "text": "When the gold EDUs are provided, our parser is also competitive to other existing approaches with sophisticated features (Section 4)." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-24", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-25", "text": "**COMBINED REPRESENTATION & TREEBANK**" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-26", "text": "We first briefly review the discourse structures in Rhetorical Structure Theory (Mann and Thompson, 1988) , and then discuss how to unify discourse and constituency trees, which gives rise to our syntacto-discourse treebank PTB-RST." }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-27", "text": "----------------------------------" }, { "sent_id": "be77eed8430b6492c81ae6535f1dd5-C001-28", "text": "**REVIEW: RST DISCOURSE STRUCTURES**" } ], "y": { "@DIF@": { "gold_contexts": [ [ "be77eed8430b6492c81ae6535f1dd5-C001-15" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-21" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-68" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-71" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-99" ] ], "cite_sentences": [ "be77eed8430b6492c81ae6535f1dd5-C001-15", "be77eed8430b6492c81ae6535f1dd5-C001-21", "be77eed8430b6492c81ae6535f1dd5-C001-68", "be77eed8430b6492c81ae6535f1dd5-C001-71", "be77eed8430b6492c81ae6535f1dd5-C001-99" ] }, "@EXT@": { "gold_contexts": [ [ "be77eed8430b6492c81ae6535f1dd5-C001-15" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-21" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-68" ] ], "cite_sentences": [ "be77eed8430b6492c81ae6535f1dd5-C001-15", "be77eed8430b6492c81ae6535f1dd5-C001-21", "be77eed8430b6492c81ae6535f1dd5-C001-68" ] }, "@SIM@": { "gold_contexts": [ [ "be77eed8430b6492c81ae6535f1dd5-C001-65" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-77" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-82" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-88" ] ], "cite_sentences": [ "be77eed8430b6492c81ae6535f1dd5-C001-65", "be77eed8430b6492c81ae6535f1dd5-C001-77", "be77eed8430b6492c81ae6535f1dd5-C001-82", "be77eed8430b6492c81ae6535f1dd5-C001-88" ] }, "@USE@": { "gold_contexts": [ [ "be77eed8430b6492c81ae6535f1dd5-C001-65" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-82" ], [ "be77eed8430b6492c81ae6535f1dd5-C001-88" ] ], "cite_sentences": [ "be77eed8430b6492c81ae6535f1dd5-C001-65", "be77eed8430b6492c81ae6535f1dd5-C001-82", "be77eed8430b6492c81ae6535f1dd5-C001-88" ] } } }, "ABC_983ef31a44646d8e6276ee1933e41d_19": { "x": [ { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-33", "text": "To that end, we introduce the WikiSplit corpus and detail its construction next." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-8", "text": "A complex sentence can typically be rewritten into multiple simpler ones that together retain the same meaning." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-34", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-35", "text": "**MINING WIKIPEDIA EDITS**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-2", "text": "Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-3", "text": "We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-4", "text": "(2017) as a benchmark for this task." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-5", "text": "Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-6", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-36", "text": "Wikipedia maintains snapshots of entire documents at different timestamps, which makes it possible to reconstruct edit histories for documents." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-37", "text": "This has been exploited for many NLP tasks, including sentence compression (Yamangil and Nelken, 2008 ), text simplification (Yatskar et al., 2010; Woodsend and Lapata, 2011; Tonelli et al., 2016) and modeling semantic edit intentions (Yang et al., 2017) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-38", "text": "To construct the WikiSplit corpus, we identify edits that involve sentences being split." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-39", "text": "A list of sentences for each snapshot is obtained by stripping HTML tags and Wikipedia markup and running a sentence break detector (Gillick, 2009) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-40", "text": "Temporally adjacent snapshots of a Wikipedia page are then compared to check for sentences that have undergone a split like that shown in Figure 1 ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-41", "text": "We search for splits in both temporal directions." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-42", "text": "Given all candidate examples extracted this way, we use a high-precision heuristic to retain only high quality splits." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-43", "text": "To extract a full sentence C and its candidate split into S = (S 1 , S 2 ), we require that C and S 1 have the same trigram prefix, C and S 2 have the same trigram suffix, and S 1 and S 2 have different trigram suffixes." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-44", "text": "To filter out misaligned pairs, we use BLEU scores (Papineni et al., 2002) to ensure similarity between the original and the split versions." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-45", "text": "Specifically, we discard pairs where BLEU(C, S 1 ) or BLEU(C, S 2 ) is less than \u03b4 (an empirically chosen threshold)." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-46", "text": "If multiple candidates remain for a given sentence C, we retain arg max S (BLEU(C, S 1 ) + BLEU(C, S 2 ))." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-47", "text": "1" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-48", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-49", "text": "**CORPUS STATISTICS AND QUALITY**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-50", "text": "Our extraction heuristic is imperfect, so we manually assess corpus quality using the same categorization schema proposed by Aharoni and Goldberg (2018) ; see Table 1 for examples of correct, unsupported and missing sentences in splits extracted from Wikipedia." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-51", "text": "We do this for 100 randomly selected examples using three different thresholds of \u03b4." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-52", "text": "As shown in Table 2 , \u03b4=0.2 provides the best trade-off between quality and size." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-53", "text": "Out of the 100 complex sentences in the sample, only 4 contained information that was not completely covered by the simple sentences." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-54", "text": "In our corpus, every complex sentence is split into two simpler sentences, so the sample contains 200 simple sentences." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-55", "text": "Out of these we found 168 (84%) to be correct, while 35 (18%) contained unsupported facts." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-56", "text": "Thus, for the overall sample of 100 split-and-rephrase examples, 68% are perfect while 32% contain some noise (either unsupported facts or missing information)." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-57", "text": "We stress that our main goal is to use data extracted this way as training data and accept that its use for evaluation is an imperfect signal with some inherent noise and bias (by construction)." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-58", "text": "After extraction and filtering, we obtain over one million examples of sentence splits from around 18 million English documents." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-59", "text": "We randomly reserved 5000 examples each for tuning, validation and testing, producing 989,944 unique complex training sentences, compared to the 16,938 of WebSplit (cf." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-60", "text": "Table 3 )." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-61", "text": "derived the WebSplit corpus by matching up sentences in the WebNLG corpus according to partitions of their underlying meaning representations (RDF triples)." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-62", "text": "The WebNLG corpus itself was created by having crowd workers write sentential realizations of one or more RDF triples." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-63", "text": "The resulting language is often unnatural, for example, \"Akeem Dent once played for the Houston Texans team which is based in Houston in Texas.\" 2 Repetition arises because the same sentence fragment may appear in many different examples." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-64", "text": "This is to be expected given that WebSplit's small vocabulary of 7k words must account for the 344k tokens that make up the distinct complex sentences themselves." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-65", "text": "3 This is compounded in that each sentence contains a named entity by construction." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-66", "text": "In contrast, our large new WikiSplit dataset offers more natural and diverse text (see examples in Table 1 ), having a vocabulary of 633k items covering the 33m tokens in its distinct complex sentences." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-67", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-68", "text": "**COMPARISON TO WEBSPLIT**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-69", "text": "The task represented by our WikiSplit dataset is a priori both harder and easier than that of the WebSplit dataset -harder because of the greater diversity and sparsity, but potentially easier due to the uniform use of a single split." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-70", "text": "Of the two datasets, WebSplit is better suited for evaluation: its construction method guarantees cleaner data than is achieved by our extraction heuristic, and it provides multiple reference decompositions for each complex sentence, which tends to improve the correlation of automatic metrics with human judgment in related text generation tasks (Toutanova et al., 2016) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-71", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-72", "text": "**EXPERIMENTS**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-73", "text": "In order to understand how WikiSplit can inform the split-and-rephrase task, we vary the composition of the training set when training a fixed model architecture." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-74", "text": "We compare three training configurations: WEBSPLIT only, WIKISPLIT only, and BOTH, which is simply their concatenation." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-75", "text": "Text-to-text training instances are defined as all the unique pairs of (C, S), where C is a complex sentence and S is its simplification into multiple simple sentences Aharoni and Goldberg, 2018) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-76", "text": "For training, we delimit the simple sentences with a special symbol." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-77", "text": "We depart from the prior work by only using a subset of the WebSplit training set: we take a fixed sub-sample such that each distinct C is paired with a single S, randomly selected from the multiple possibilities in the dataset." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-78", "text": "This scheme produced superior performance in preliminary experiments." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-79", "text": "As a quality measure, we report multi-reference corpus-level BLEU 4 (Papineni et al., 2002 ), but include sentence-level BLEU (sBLEU) for direct comparison to past work." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-80", "text": "5 We also report lengthbased statistics to quantify splitting." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-81", "text": "We use the same sequence-to-sequence architecture that produced the top result for Aharoni and Goldberg (2018) , \"Copy512\", which is a one-layer, bi-directional LSTM (cell size 512) with attention (Bahdanau et al., 2014 ) and a copying mechanism (See et al., 2017 ) that dynamically interpolates the standard word distribution with a distribution over the words in the input sentence." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-82", "text": "Training details are as described in the Appendix of Aharoni and Goldberg (2018) using the OpenNMT-py framework (Klein et al., 2017) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-83", "text": "6" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-84", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-85", "text": "**RESULTS**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-86", "text": "We compare to the SOURCE baseline, which is the previously reported method of taking the unmodified input sentence as prediction, and we add SPLITHALF, the natural baseline of deterministically splitting a complex sentence into two equallength token sequences and appending a period to the first one." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-87", "text": "Table 4 compares our three training configurations on the validation sets of both WebSplit and WikiSplit." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-88", "text": "The WEBSPLIT model scores 35.3 BLEU on the WebSplit validation set but fails to generalize beyond its narrow domain, as evidenced by reaching only 4.2 BLEU on the WikiSplit validation set." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-89", "text": "The example predictions in Table 7 illustrate how this model tends to drop content (\"Alfred Warden\", \"mouth\", \"Hamburg\"), hallucinate common elements from its training set (\"food\", \"ingre-4 Using NLTK v3.2.2, with case sensitive scoring." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-90", "text": "5 Past work on WebSplit Aharoni and Goldberg, 2018) reported macro-averaged sentence-level BLEU, calculated without smoothing precision values of zero." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-91", "text": "We found this ill-defined case occurred often for low-quality output." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-92", "text": "6 github.com/OpenNMT/OpenNMT-py, 0ecec8b Table 5 : Results on the WebSplit v1.0 test set when varying the training data while holding model architecture fixed: corpus-level BLEU, sentence-level BLEU (to match past work), simple sentences per complex sentence, and tokens per simple sentence (micro-average)." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-93", "text": "AG18 is the previous best model by Aharoni and Goldberg (2018) , which used the full WebSplit training set, whereas we downsampled it." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-94", "text": "dient\", \"publisher\") and generally fails to produce coherent sentences." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-95", "text": "In contrast, the WIKISPLIT model achieves 59.4 BLEU on the WebSplit validation set, without observing any in-domain data." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-96", "text": "It also outperforms the two deterministic baselines on both validation sets by a non-trivial BLEU margin." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-97", "text": "This indicates that the WikiSplit training data enable better generalization than when using WebSplit by itself." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-98", "text": "Reintroducing the downsampled, in-domain training data (BOTH) further improves performance on the WebSplit evaluation." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-99", "text": "These gains in BLEU from using WikiSplit carry over to the blind manual evaluation we performed on a random sample of model predictions on the WebSplit validation set." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-100", "text": "As shown in Table 6, the BOTH model produced the most accurate output (95% correct simple sentences), with the lowest incidence of missed or unsupported statements." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-101", "text": "Our manual evaluation includes the corresponding outputs from Aharoni and Goldberg (2018) (AG18), which were 22% accurate." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-102", "text": "The examples in Table 7 demonstrate that the WIKISPLIT and BOTH models produce much more coherent output which faithfully rephrases the input." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-103", "text": "In Example 1, the combined model (BOTH) produces three fluent sentences, overcoming the strong bias toward two-sentence output inherent in the majority of its training examples." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-104", "text": "We relate our approach to prior work on WebSplit v1.0 by reporting scores on its test set in Table 5." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-105", "text": "Our best performance in BLEU is again obtained by combining the proposed WikiSplit dataset with the downsampled WebSplit, yielding Aharoni and Goldberg (2018) , while the other outputs are from our models trained on the corresponding data." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-106", "text": "a 32 point improvement over the prior best result." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-107", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-108", "text": "**CONCLUSION AND OUTLOOK**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-109", "text": "Our results demonstrate a large, positive impact on the split-and-rephrase task when training on large, diverse data that contains some noise." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-110", "text": "This suggests that future improvements may come from finding other such sources of data as much as from modeling." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-111", "text": "The new WikiSplit dataset is intended as training data, but for further progress on the split-and-rephrase task, we ideally need evaluation data also derived from naturally occurring sentences, and an evaluation metric that is more sensitive to the particularities of the task." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-112", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-113", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-114", "text": "Thanks go to Kristina Toutanova and the anonymous reviewers for helpful feedback on an earlier draft, and to Roee Aharoni for supplying his system's outputs." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-9", "text": "Performing this split-and-rephrase task is one of the main operations in text simplification, alongside paraphrasing and dropping less salient content (Siddharthan, 2006; Zhu et al., 2010; Woodsend and Lapata, 2011, i.a.) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-10", "text": "The area of automatic text simplification has received a lot of attention (Siddharthan, 2014; Shardlow, 2014 ), yet still holds many open challenges (Xu et al., 2015) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-11", "text": "Splitting sentences in this way could also benefit systems where predictive quality degrades with sentence length, as observed in, e.g., relation extraction (Zhang et al., 2017) and translation (Koehn and Knowles, 2017) . And the schema-free nature of the task may allow for future supervision in the form of crowd-sourced rather than expensive expert annotation (He et al., 2015) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-12", "text": "introduce the WebSplit corpus for the split-and-rephrase task and report results for several models on it." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-13", "text": "Aharoni and Goldberg (2018) improve WebSplit by reducing overlap in the data splits, and * Both authors contributed equally." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-14", "text": "A classic leaf symptom is water-soaked lesions between the veins which appear as angular leaf-spots where the lesion edge and vein meet." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-15", "text": "A classic leaf symptom is the appearance of angular, water-soaked lesions between the veins." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-16", "text": "The angular appearance results where the lesion edge and vein meet." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-17", "text": "demonstrate that neural encoder-decoder models (Bahdanau et al., 2014) perform poorly, even when enhanced with a copy mechanism (Gu et al., 2016; See et al., 2017) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-18", "text": "One limitation of the WebSplit examples themselves is that they contain fairly unnatural linguistic expression using a small vocabulary." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-19", "text": "We introduce new training data mined from Wikipedia edit histories that have some noise, but which have a rich and varied vocabulary over naturally expressed sentences and their extracted splits." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-20", "text": "Figure 1 gives an example of how a Wikipedia editor rewrote a single sentence into two simpler ones." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-21", "text": "We create WikiSplit, a set of one million such examples mined from English Wikipedia, and show that models trained with this resource produce dramatically better output for split and rephrase." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-22", "text": "Our primary contributions are:" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-23", "text": "\u2022 A scalable, language agnostic method for extracting split-and-rephrase rewrites from Wikipedia edits." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-24", "text": "\u2022 Public release of the English WikiSplit dataset, containing one million rewrites:" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-25", "text": "http://goo.gl/language/wiki-split" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-26", "text": "\u2022 By incorporating WikiSplit into training, we more than double (30.5 to 62.4) the BLEU score obtained on WebSplit by Aharoni and Goldberg (2018) ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-27", "text": "Figure 1 ." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-28", "text": "----------------------------------" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-29", "text": "**THE WIKISPLIT CORPUS**" }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-30", "text": "WebSplit provides a basis for measuring progress on splitting and rephrasing sentences." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-31", "text": "However, its small size, inherent repetitiveness, and synthetic nature limit its broader applicability." }, { "sent_id": "983ef31a44646d8e6276ee1933e41d-C001-32", "text": "In particular, we see it as a viable benchmark for evaluating models, but not for training them." } ], "y": { "@BACK@": { "gold_contexts": [ [ "983ef31a44646d8e6276ee1933e41d-C001-13" ] ], "cite_sentences": [ "983ef31a44646d8e6276ee1933e41d-C001-13" ] }, "@SIM@": { "gold_contexts": [ [ "983ef31a44646d8e6276ee1933e41d-C001-26" ], [ "983ef31a44646d8e6276ee1933e41d-C001-50" ], [ "983ef31a44646d8e6276ee1933e41d-C001-75" ], [ "983ef31a44646d8e6276ee1933e41d-C001-81" ], [ "983ef31a44646d8e6276ee1933e41d-C001-82" ] ], "cite_sentences": [ "983ef31a44646d8e6276ee1933e41d-C001-26", "983ef31a44646d8e6276ee1933e41d-C001-50", "983ef31a44646d8e6276ee1933e41d-C001-75", "983ef31a44646d8e6276ee1933e41d-C001-81", "983ef31a44646d8e6276ee1933e41d-C001-82" ] }, "@USE@": { "gold_contexts": [ [ "983ef31a44646d8e6276ee1933e41d-C001-26" ], [ "983ef31a44646d8e6276ee1933e41d-C001-50" ], [ "983ef31a44646d8e6276ee1933e41d-C001-81" ], [ "983ef31a44646d8e6276ee1933e41d-C001-82" ], [ "983ef31a44646d8e6276ee1933e41d-C001-101" ] ], "cite_sentences": [ "983ef31a44646d8e6276ee1933e41d-C001-26", "983ef31a44646d8e6276ee1933e41d-C001-50", "983ef31a44646d8e6276ee1933e41d-C001-81", "983ef31a44646d8e6276ee1933e41d-C001-82", "983ef31a44646d8e6276ee1933e41d-C001-101" ] }, "@DIF@": { "gold_contexts": [ [ "983ef31a44646d8e6276ee1933e41d-C001-90" ], [ "983ef31a44646d8e6276ee1933e41d-C001-93" ], [ "983ef31a44646d8e6276ee1933e41d-C001-105" ] ], "cite_sentences": [ "983ef31a44646d8e6276ee1933e41d-C001-90", "983ef31a44646d8e6276ee1933e41d-C001-93", "983ef31a44646d8e6276ee1933e41d-C001-105" ] }, "@EXT@": { "gold_contexts": [ [ "983ef31a44646d8e6276ee1933e41d-C001-93" ], [ "983ef31a44646d8e6276ee1933e41d-C001-105" ] ], "cite_sentences": [ "983ef31a44646d8e6276ee1933e41d-C001-93", "983ef31a44646d8e6276ee1933e41d-C001-105" ] } } }, "ABC_ebd4488438579946c23904cc0f5932_19": { "x": [ { "sent_id": "ebd4488438579946c23904cc0f5932-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-2", "text": "The development of FrameNet, a large database of semantically annotated sentences, has primed research into statistical methods for semantic tagging." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-3", "text": "We advance previous work by adopting a Maximum Entropy approach and by using Viterbi search to find the highest probability tag sequence for a given sentence." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-4", "text": "Further we examine the use of syntactic pattern based re-ranking to further increase performance." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-5", "text": "We analyze our strategy using both extracted and human generated syntactic features." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-6", "text": "Experiments indicate 85.7% accuracy using human annotations on a held out test set." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-7", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-9", "text": "The ability to develop automatic methods for semantic classification has been hampered by the lack of large semantically annotated corpora." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-10", "text": "Recent work in the development of FrameNet, a large database of semantically annotated sentences, has laid the foundation for the use of statistical approaches to automatic semantic classification." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-11", "text": "The FrameNet project seeks to annotate a large subset of the British National Corpus with semantic information." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-12", "text": "Annotations are based on Frame Semantics (Fillmore, 1976) , in which frames are defined as schematic representations of situations involving various Frame Elements such as participants, props, and other conceptual roles." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-13", "text": "In each FrameNet sentence, a single target predicate is identified and all of its relevant Frame Elements are tagged with their element-type (e.g., Agent, Judge), their syntactic Phrase Type (e.g., NP, PP), and their Grammatical Function (e.g., External Argument, Object Argument)." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-14", "text": "Figure 1 shows an example of an annotated sentence and its appropriate semantic frame." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-15", "text": "To our knowledge, Gildea and Jurafsky (2000) is the only work that uses FrameNet to build a statistical semantic classifier." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-16", "text": "They split the problem into two distinct sub-tasks: Frame Element identification and Frame Element classification." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-17", "text": "In the identification phase, they use syntactic information extracted from a parse tree to learn the boundaries of Frame Elements in sentences." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-18", "text": "The work presented here, focuses only on the second phase: classification." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-19", "text": "Gildea and Jurafsky (2000) describe a system that uses completely syntactic features to classify the Frame Elements in a sentence." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-20", "text": "They extract features from a parse tree and model the conditional probability of a semantic role given those features." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-21", "text": "They report an accuracy of 76.9% on a held out test set." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-22", "text": "She clapped her hands in inspiration." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-23", "text": "Frame: We extend Gildea and Jurafsky (2000) 's initial effort in three ways." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-24", "text": "First, we adopt a Maximum Entropy (ME) framework to better learn the feature weights associated with the classification model." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-25", "text": "Second, we recast the classification task as a tagging problem in which an n-gram model of Frame Elements is applied to find the most probable tag sequence (as opposed to the most probable individual tags)." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-26", "text": "Finally, we implement a re-ranking system that takes advantage of the sentence-level syntactic patterns of each sequence." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-27", "text": "We analyze our results using syntactic features extracted from a parse tree generated by Collins parser (Collins, 1997) and compare those to models built using features extracted from FrameNet's human annotations." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-28", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-29", "text": "**BODY-MOVEMENT**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-30", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-31", "text": "**METHOD**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-32", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-33", "text": "**2.2**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-34", "text": "Training (32,251 sentences), development (3,491 sentences), and held out test sets (3,398 sentences) were generated from the June 2002 FrameNet release following the divisions used in Gildea and Jurafsky (2000) 1 ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-35", "text": "Because human-annotated syntactic information could only be obtained for a subset of their data, the training, development, and test sets used here are approximately 10% smaller than those used in Gildea and Jurafsky (2000) ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-36", "text": "2 There are on average 2.2 Frame Elements per sentence, falling into one of 126 unique classes." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-37", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-38", "text": "**MAXIMUM ENTROPY**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-39", "text": "ME models implement the intuition that the best model will be the one that is consistent with all the evidence, but otherwise, is as uniform as possible." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-40", "text": "(Berger et al., 1996) ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-41", "text": "Following recent successes using it for many NLP tasks (Och and Ney, 2002; Koeling, 2000) , we use ME to implement a Frame Element classifier." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-42", "text": "We use the YASMET ME package (Och, 2002) to train an approximation of the model below:" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-43", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-44", "text": "**P(R| PT, VOICE, POSITION, TARGET, GF, H)**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-45", "text": "Here r indicates the element type, pt the phrase type, gf the grammatical function, h the head word, and target the target predicate." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-46", "text": "Due to data sparsity issues, we do not calculate this model directly, but rather, model various feature combinations as described in Gildea and Jurafsky (2000) ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-47", "text": "The classifier was trained, using only features that had a frequency in training of one or more, and until performance on the development set ceased to improve." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-48", "text": "Feature weights were smoothed using a Bayesian method, such that weight limits are Gaussian distributed with mean 0 and standard deviation 1." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-49", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-50", "text": "**TAGGING**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-51", "text": "Frame Elements do not occur in isolation, but rather, depend very much on what other Elements occur in a sentence." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-52", "text": "For example, if a Frame Element is tagged as an Agent it is highly unlikely that the next Element will also be an Agent." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-53", "text": "We exploit this dependency by treating the Frame Element classification task as a tagging problem." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-54", "text": "The YASMET MEtagger was used to apply an ngram tag model to the classification task (Bender et al., 2003) ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-55", "text": "The feature set for the training data was" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-56", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-57", "text": "**3 RESULTS**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-58", "text": "1 Divisions given by Dan Gildea via personal communication." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-59", "text": "2 Gildea and Jurafsky (2000) use 36995 training, 4000 development, and 3865 test sentences." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-60", "text": "They do not report results using hand annotated syntactic information." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-61", "text": "augmented to include information about the tags of the previous one and two Frame Elements in the sentence:" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-62", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-63", "text": "**P(R| PT, VOICE, POSITION, TARGET, GF, H, R -1 ,R -1 +R -2 )**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-64", "text": "Viterbi search was then used to find the most probable tag sequence through all possible sequences." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-65", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-66", "text": "**PATTERN FEATURES**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-67", "text": "A great deal of information useful for classification can be found in the syntactic patterns associated with each sequence of Frame Elements." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-68", "text": "A typical syntactic pattern is exhibited by the sentence \"Alexandra bent her head.\" Here \"Alexandra\" is an external argument Noun Phrase, \"bent\" is the target, and \"her head\" is an object argument Noun Phrase." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-69", "text": "In the training data, a syntactic pattern of NP-ext, target, NP-obj, given the predicate bend, was associated 100% of the time with the Frame Element pattern: \"Agent target BodyPart\", thus, providing powerful evidence as to the classification of those Frame Elements." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-70", "text": "We exploit these sentence-level patterns by implementing a re-ranking system that chooses among the n-best tagger outputs." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-71", "text": "The re-ranker was trained on a development corpus, which was first tagged using the MEtagger described above." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-72", "text": "For each sentence in the development corpus, the 10 best tag sequences are output by the classifier and described by three probabilities: 3 1) the sequence's probability given by the ME classifier (ME); 2) the conditional probability of that sequence given the syntactic pattern and the target predicate (pat+target); 3) a back off conditional probability of the tag sequence given just the syntactic pattern (pat)." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-73", "text": "A ME model is then used to combine the log of these probabilities to give a model of the form: P(tag-seq| ME, pat+target, pat) Figure 2 shows the performance of the base ME model, the base model within a tagging framework, and the base model within a tagging framework plus the reranker." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-74", "text": "Results are shown for data sets trained and tested using human annotated syntactic features and trained and tested using automatically extracted syntactic features." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-75", "text": "In both cases the training and test sets are identical." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-76", "text": "For both the extracted and human conditions, adopting a tagging framework improves results by over 1%." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-77", "text": "However, while the syntactic pattern based reranker increases performance using human annotations by nearly 2%, the effect when using automatically extracted information is only 0.5%." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-78", "text": "This is reasonable considering that the re-ranker's effectiveness is correlated with the level of noise in the syntactic patterns upon which it is based." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-79", "text": "The difference in performance between the models under both human and extracted conditions was relatively consistent: averaging 8.7% with a standard deviation of 0.7." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-80", "text": "As a further analysis, we have examined the performance of our base ME model on the same test set as that used in Gildea and Jurafsky (2000) ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-81", "text": "Using only extracted information, we achieve an accuracy of 74.9%, two percent lower than their reported results." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-82", "text": "This result is not unreasonable, however, because, due to limited time, very little effort was spent tuning the parameters of the model." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-83", "text": "Figure 2 ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-84", "text": "Performance of models on held out test data." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-85", "text": "ME refers to results of the base Maximum Entropy model, Tagger to a combined ME and Viterbi search model, Re-Rank to the Tagger augmented with a re-ranker." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-86", "text": "Extracted refers to models trained using features extracted from parse trees, Human to models using features from FrameNet's human annotations." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-87", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-88", "text": "**CONCLUSION**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-89", "text": "It is clear that using a tagging framework and syntactic patterns improves performance of the semantic classifier when features are extracted from either automatically generated parse trees or human annotations." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-90", "text": "The most striking result of these experiments, however, is the dramatic decrease in performance associated with using features extracted from a parse tree." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-91", "text": "This decrease in performance can be traced to at least two aspects of the automatic extraction process: noisy parser output and limited grammatical information." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-92", "text": "To compensate for noisy parser output, our current work is focusing on two strategies." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-93", "text": "First, we are looking at using shallower but more reliable methods for syntactic feature generation, such as part of speech tagging and text chunking, to either replace or augment the syntactic parser." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-94", "text": "Second, we are using ontological information, such as word classes and synonyms, in the hopes that semantic information may supplement the noisy syntactic information." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-95", "text": "The models trained on features extracted from parse trees do not have access to rich grammatical information." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-96", "text": "Following Gildea and Jurafsky (2000) , automatic extraction of grammatical information here is limited to the governing category of a Noun Phrase." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-97", "text": "The FrameNet annotations, however, are much richer and include information about complements, modifiers, etc." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-98", "text": "We are looking at ways to include such information either by using alternative parsers (Hermjakob, 1997) or as a post processing task (Blaheta and Charniak, 2000) ." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-99", "text": "In future work, we will extend the strategies outlined here to incorporate Frame Element identification into our model." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-100", "text": "By treating semantic classification as a single tagging problem, we hope to create a unified, practical, and high performance system for Frame Element tagging." }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-101", "text": "----------------------------------" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-102", "text": "**% CORRECT**" }, { "sent_id": "ebd4488438579946c23904cc0f5932-C001-103", "text": "Extracted Human" } ], "y": { "@SIM@": { "gold_contexts": [ [ "ebd4488438579946c23904cc0f5932-C001-15" ], [ "ebd4488438579946c23904cc0f5932-C001-80" ] ], "cite_sentences": [ "ebd4488438579946c23904cc0f5932-C001-15", "ebd4488438579946c23904cc0f5932-C001-80" ] }, "@BACK@": { "gold_contexts": [ [ "ebd4488438579946c23904cc0f5932-C001-19" ] ], "cite_sentences": [ "ebd4488438579946c23904cc0f5932-C001-19" ] }, "@DIF@": { "gold_contexts": [ [ "ebd4488438579946c23904cc0f5932-C001-23" ], [ "ebd4488438579946c23904cc0f5932-C001-35" ], [ "ebd4488438579946c23904cc0f5932-C001-46" ], [ "ebd4488438579946c23904cc0f5932-C001-59", "ebd4488438579946c23904cc0f5932-C001-60" ], [ "ebd4488438579946c23904cc0f5932-C001-96" ] ], "cite_sentences": [ "ebd4488438579946c23904cc0f5932-C001-23", "ebd4488438579946c23904cc0f5932-C001-35", "ebd4488438579946c23904cc0f5932-C001-46", "ebd4488438579946c23904cc0f5932-C001-59", "ebd4488438579946c23904cc0f5932-C001-96" ] }, "@EXT@": { "gold_contexts": [ [ "ebd4488438579946c23904cc0f5932-C001-23" ], [ "ebd4488438579946c23904cc0f5932-C001-46" ] ], "cite_sentences": [ "ebd4488438579946c23904cc0f5932-C001-23", "ebd4488438579946c23904cc0f5932-C001-46" ] }, "@USE@": { "gold_contexts": [ [ "ebd4488438579946c23904cc0f5932-C001-34" ], [ "ebd4488438579946c23904cc0f5932-C001-80" ] ], "cite_sentences": [ "ebd4488438579946c23904cc0f5932-C001-34", "ebd4488438579946c23904cc0f5932-C001-80" ] } } }, "ABC_79ff6e23cc951aa18ae53763e9c982_19": { "x": [ { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-109", "text": "**COMPOSITIONAL SEMANTICS**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-130", "text": "**5**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-61", "text": "(Webber et al. (2003) , p. 555)" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-2", "text": "This paper presents a corpus-based study of the discourse connective in contrast." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-3", "text": "The corpus data are drawn from the British National Corpus (BNC) and are analyzed at the levels of syntax, discourse structure, and compositional semantics." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-4", "text": "Following Webber et al. (2003) , the paper argues that in contrast crucially involves discourse anaphora and, thus, resembles other discourse adverbials such as then, otherwise, and nevertheless." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-5", "text": "The compositional semantics proposed for other discourse connectives, however, does not straightforwardly generalize to in contrast, for which the notions of contrast pairs and contrast properties are essential." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-6", "text": "----------------------------------" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-8", "text": "The semantics and pragmatics of discourse structure has been a central theme in linguistic research for quite some time." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-9", "text": "Recent research on large-scale annotation of discourse relations for the purposes of natural language processing applications has resulted in new insights in the properties of such relations and in concrete proposals on how to annotate them." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-10", "text": "A particularly ambitious and interesting effort of this kind is the Penn Discourse Treebank (PDTB), a corpus of 1 million words which is being annotated for discourse connectives and their arguments, more specifically for connectives such as but, for example, because, after, and when that are either realized lexically (explicit connectives) or that have no overt linguistic realization, but that can be inferred as a logical relation between pieces of discourse (implicit connectives)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-11", "text": "On the basis of the detailed PDTB annotations, which by now comprise a substantial corpus of linguistic data, it has become possible to revisit an open research question that had been raised repeatedly in the literature, albeit without yielding concrete results." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-12", "text": "This open research question concerns the similarities and differences between syntactic and semantic relations at the sentence level and at the discourse level." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-13", "text": "Webber (2006) and Lee et al. (2006) have addressed this very issue in the context of the PDTB annotations and have arrived at the following empirical generalizations:" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-14", "text": "1. While the arity of predicates at the sentential level can vary, e.g. one argument in the case of intransitive verbs, two in the case of transitives, three for ditransitives, etc., the arity of discourse connectives is fixed and consists of exactly two arguments." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-15", "text": "2. While syntactic dependencies can be quite complex and may involve highly nested or even crossing dependencies of various kinds, dependencies expressed by discourse connectives tend to be much more limited, typically involving tree-like structures and not introducing structural ambiguities of scope or attachment." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-16", "text": "3. More complex cases of discourse connectives that prima facie seem to involve crossing or partially overlapping arguments can be reduced to the independent discourse mechanisms of anaphora and attribution and thus do not introduce any added complexities." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-17", "text": "The third generalization is further elaborated by Webber et al. (2003) who distinguish between coordinating conjunctions such as and, or, so, and but and subordinating conjunctions such as although, whereas, and when on the one hand, and discourse adverbials such as then, otherwise, nevertheless, and instead on the other hand." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-18", "text": "It is the latter group, namely discourse adverbials, that, according to Webber et al. (2003) , should be considered as anaphors in very much the same way as other anaphoric expressions such as definite descriptions and pronouns." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-19", "text": "The purpose of this paper is to further examine and refine the above hypotheses by looking in some detail at a family of discourse connectives, all involving the notion of contrast." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-20", "text": "----------------------------------" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-21", "text": "**THE DATA**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-22", "text": "The British National Corpus (BNC; Burnage and Baguley (1996) ) served as the data source for the present investigation." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-23", "text": "The BNC is a 100 million word collection of samples from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-24", "text": "The reasons for choosing the BNC rather than the Wall Street Journal (WSJ) corpus, which provides the data source for the PDTB, are two-fold: (i) The BNC is a hundred times larger than the 1-million word WSJ corpus and thus yields a much larger data source, and (ii) the BNC is much more balanced in the genres represented than the WSJ." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-25", "text": "The lemma contrast with part of speech tag noun appears 6816 times in the BNC." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-26", "text": "In the current experiment we extracted all occurrences of the noun sense of contrast in combination with the preposition in and possibly intervening adjectives such as marked, sharp or stark, yielding patterns such as in contrast or in sharp contrast." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-27", "text": "While additional data involving the preposition by or related connectives such as in comparison or by comparison still need to be examined, the current data set of 2693 examples of the phrase in (ADJ) contrast suffices to address the theoretical issues most relevant for this paper." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-28", "text": "----------------------------------" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-29", "text": "**SYNTACTIC PROPERTIES**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-30", "text": "This section provides an overview of the various syntactic environments that the phrase in contrast can occur in, all of which are attested in the BNC." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-31", "text": "In contrast can appear either with or without an accompanying prepositional phrase, as shown in (1) and (2), respectively." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-32", "text": "1 Among these two options, occurrences without a prepositional phrase are much more frequent in the BNC." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-33", "text": "(1) But both are acceptably direct, although the Corrado's steering has pinpoint accuracy." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-34", "text": "It's a shame, then, that its gearchange is coarse and sloppy." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-35", "text": "In contrast, the Calibra's is light and quick, although the clutch action could be more progressive." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-36", "text": "A6W (0763) (2) Yet this is the first serious attempt to write about the revolution since the heyday of the early 1970s." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-37", "text": "In contrast to the books written then, The Road to Jaramillo is full of insight into the interactions, communications, thoughts and impressions that make scientific problem solving enlightening." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-38", "text": "B72 (1514) While in examples (1) and (2) the phrase in contrast appears sentence-initially, it can also appear in non-initial position, as in (3) or even as the last phrase in a sentence, as in (4)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-39", "text": "(3) On the other side of the mill, in contrast, was a deep high banked muddy trough meandering the last half mile into the Severn estuary." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-40", "text": "B3J(1927) (4) The vegetation of urban commons varies region by region, and so unwittingly contributes to local character in contrast with most urban landscapes." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-41", "text": "B7L (1619) Sentence-initial occurrences far outrank non-initial occurrences in the BNC." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-42", "text": "Among non-initial occurrences, sentence-final placement is more frequent than non-final placement." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-43", "text": "Thus, the placement of in contrast seems to be concentrated at the left or right edge of the clause, with strong preference for the left edge." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-44", "text": "Finally, sentence-initially in contrast can cooccur with other discourse connectives with related meanings such as however, as shown in (5)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-45", "text": "(5) Like Mozambique and Nicaragua, it is struggling to survive, with its education system in chaos." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-46", "text": "However, in contrast, successive Sudanese governments have made little attempt to initiate radical social transformation." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-47", "text": "B12 (0566) All of the cases considered so far represent cases where in contrast functions as a prepositional phrase adjunct of the clause that it appears in." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-48", "text": "However, in contrast can also appear in predicative position with the copula be or with light verbs such as stand in (6)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-49", "text": "(6) South Africa stands in apparent contrast to the rest of the states considered here." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-50", "text": "B12(0020) (6) also provides an example of an adjectival premodifier that can modify contrast and that tends to function as an intensifier." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-51", "text": "Other such modifiers include profound, sharpest, strong, utter and clear." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-52", "text": "----------------------------------" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-53", "text": "**DISCOURSE ANAPHORA**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-54", "text": "This section will focus on the discourse function of the adverbial phrase in contrast." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-55", "text": "Following Webber et al. (2003), we will argue that it resembles other discourse adverbials such as then, otherwise, and nevertheless in that it crucially involves the notion of discourse anaphora." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-56", "text": "Discourse anaphora involves a relation between an anaphor, such as a pronoun or a temporal adverbial, and an antecedent that is present in the previous discourse or that can be inferred from it." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-57", "text": "In the case of pronouns, antecedents are typically NPs, while the antecedents of temporal adverbials can be time-denoting expressions, such as dates, events or states of affairs." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-58", "text": "For pronouns, discourse anaphora can either involve coreference or more indirect referential relations which do not involve identity of reference with a previous discourse entity, but where the anaphor is merely associated with a previously mentioned discourse entity." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-59", "text": "Such cases of indirect referential relations include cases of bridging, as in (7), where the anaphor, in this case the receiver stands in a part-whole-relation to its antecedent -in this case a phone." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-60", "text": "(7) Myra darted to a phone and picked up the receiver." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-62", "text": "Other-anaphora (Bierner and Webber (2000) Bierner (2001), Modjeska (2002)), as in (8), provides another instance of such an indirect referential relation." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-63", "text": "(8) Sue grabbed one phone, as Tom darted to the other phone. (Webber et al. (2003), p. 555) Here the referent of the other phone can be inferred from the antecedent one phone." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-64", "text": "The referential relation between the anaphor and the antecedent is not one of identity of reference." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-65", "text": "Rather, the referents of the antecedent and the anaphor together constitute the set of phones owned by Sue and Tom." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-66", "text": "It is indicative of the anaphoric character of in contrast that it licenses other-anaphora in the same way, as shown in (9)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-67", "text": "(9) He retired to Hampshire and died in 1832 at the age of 76." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-68", "text": "The MCC continue to care for his grave." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-69", "text": "In contrast, on the other side of the same cemetery is the grave of the Burgess family, where the ashes of the spy Guy Burgess, who died in Russia, were placed." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-70", "text": "BM4 (0772) Note that (9) is not an isolated case." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-71", "text": "(3) shown above provides another example of this kind." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-72", "text": "A second piece of evidence in support of the anaphoric properties of in contrast concerns ellipsis, as shown in (1), repeated below as (10)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-73", "text": "plete, delightful, direct, distinct, dramatic, elegant, explicit, extreme, fascinating, frightening, further, great, greater, harmonic, harmonious, high, marked, methodological, moderate, profound, pure, real, sad, sharp, sharpest, significant, sorry, strange, stark, strong, subdued, sympathetic, total, unhappy, utter, and welcome." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-74", "text": "With the possible exception of subdued, all other adjectives function as intensifiers." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-75", "text": "Given the lexical meaning of contrast and its discourse function, this should come as no surprise." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-76", "text": "(10) It's a shame, then, that its gearchange is coarse and sloppy." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-77", "text": "In contrast, the Calibra's is light and quick, although the clutch action could be more progressive." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-78", "text": "A6W (0763) Here the elliptical the Calibra's is missing its nominal head, which is provided by the antecedent gearchange." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-79", "text": "Yet another anaphoric effect licensed by in contrast arises with respect to the notion of domain restriction, previously studied by, among many others, Lewis (1979) , Hinrichs (1988) and Hinrichs (1998), and von Fintel (1994) ." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-80", "text": "(11) Few countries have satisfactory legislation on pesticides or the trained manpower to enforce it." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-81", "text": "In contrast, extensive use of pesticides in Europe, North America and Japan is backed by government legislation or voluntary schemes that provide farmers with detailed advice." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-82", "text": "B7G (0726) Note that the domain of the set of countries in the quantified NP few countries is subsequently narrowed so as to not include countries in Europe, North America and Japan." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-83", "text": "It is precisely the explicitly mentioned contrast that leads to this effect." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-84", "text": "Webber et al. (2003) observe that identification of the correct antecedent of a definite description such as the tower or this tower in (12a) or a discourse adverbial such as otherwise in (12b) may require reference to abstract discourse objects such as the result of stacking blocks (to form a tower) or the state of not wanting an apple as the logical antecedent of a definite description or of a discourse adverbial." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-85", "text": "(12) a. Stack five blocks on top of one another." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-86", "text": "Now close your eyes and try knocking the tower, this tower\u00a1 over with your nose. (Webber et al. (2003) , p. 552) b. Do you want an apple? Otherwise you can have a pear." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-87", "text": "(Webber et al. (2003), p. 552) Notice that the same kind of inference is required for contrast in example (13), providing further evidence for the anaphoric nature of this discourse connective." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-88", "text": "(13) Jack's heart lurched as he saw the ambulances and the busy, functional building and he immediately forgot everything they had been saying. \"I'll ask where he is,\" said Jamie Shepherd as they walked towards the reception desk." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-89", "text": "In contrast to the outside, the area was softly carpeted, softly lit, as if illness and death had to be cushioned away, made to look as if they didn't exist." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-90", "text": "BPD(0200)" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-91", "text": "The referent of outside in (13) is never explicitly mentioned." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-92", "text": "Rather, outside refers back to the entire scene described before." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-93", "text": "Another type of inference that is sometimes necessitated by the in contrast connective concerns the operation of complementarity of reference as in (14)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-94", "text": "(14) Other speed-reducing devices may be added, such as regular shifts in the axis of the road, together with changes in the profile in the form of ramps and speed humps (Figure 4.3) ." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-95", "text": "Narrowings that allow a cycle to pass but not two cars are frequently added, often reinforced by the placement of trees, planters and street furniture." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-96", "text": "In contrast to the flowing design of fast roads, design elements are angular and of pedestrianscale, typified by low-level lamp posts which avoid the \"sea of light\" provided by high poles in traffic streets (Figure 4 .4)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-97", "text": "C8F (0297) In this text, which is on the topic of child safety, roads are never explicitly mentioned." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-98", "text": "Rather the concept of slow neighborhood roads can only be inferred from the description." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-99", "text": "The first explicit mention of the term road then refers to the opposite term fast roads." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-100", "text": "Comparison of in contrast with personal pronouns yields yet another similarity with other anaphoric expressions." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-101", "text": "Like with personal pronouns, the antecedent of in contrast can either occur across sentences, as in all of the examples considered so far, or it can occur intrasententially, as in example (15)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-102", "text": "(15) In contrast to his predecessors who worked at all hours of the day Macmillan tended to keep office hours." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-103", "text": "B0H (0476) Another property that distinguishes anaphoric discourse adverbials from structural connectives in the sense of Webber et al. (2003) , i.e. coordinating and subordinating conjunctions, concerns the type of dependencies that the arguments of the types of connectives can enter into." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-104", "text": "While structural connectives only allow non-crossing adjacent material as their arguments, discourse adverbials may involve crossing dependencies among non-adjacent material -just like other anaphoric expressions." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-105", "text": "e.g. pronouns and definite descriptions." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-106", "text": "Figure 1 and Figure 2 show this type of crossing dependency for in contrast both intrasententially and across sentence boundaries." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-107", "text": "While Figure 2 involves material in adjacent clauses, there are plenty of examples where such dependencies extend over an entire paragraph or over even larger amounts of text." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-108", "text": "----------------------------------" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-110", "text": "This section will try to develop a logical representation for in contrast that does justice to its anaphoric properties and to the lexical semantics of the lexeme contrast." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-111", "text": "As will become clear in the course of this discussion, this task is far from trivial." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-112", "text": "The following remarks should, thus, not be taken as a fully worked-out proposal, but rather as an attempt to point out a set of crucial properties that a fully worked-out account needs to take into consideration." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-113", "text": "The discourse properties of in contrast are not just of interest from a purely theoretical perspective." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-114", "text": "Teufel and Moens (2002) and Siddharthan and Teufel (2007) In the previous section we established at some length that in contrast shares with other discourse adverbials its anaphoric behavior." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-115", "text": "This naturally raises the question whether the semantics that has been proposed for this class of expressions can be naturally generalized to the semantics of in contrast." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-116", "text": "Following earlier proposals by Hinrichs (1986) and Kamp and Reyle (1993) , Webber et al. (2003) assume that the semantics of discourse adverbials such as then involves an anaphoric relation between two events." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-117", "text": "For example, the two clauses in (16) refer to individual events, which are put in the sequence-relation by the adverbial then." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-118", "text": "There are at least two difficulties associated with modelling the semantics of in contrast as a two-place relation between events: The scope of the two arguments of the contrast relation often extends beyond descriptions of individual events, as illustrated by examples such as (13), where the contrast involves sets of events and states of affairs." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-119", "text": "Thus, at the very least, one would have to generalize the semantics of in contrast to relations between sets of events and states of affairs, with relations between single events or states of affairs as a special case." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-120", "text": "However, it is difficult to see how such a modified representation could be suitably generalized to adequately model examples as in (18)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-121", "text": "(18) The Holsteins also tend to have much more white in the coat so that the white areas predominate and they could almost be described as white-and-blacks in contrast to the black-and-white Friesian type." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-122", "text": "B0K (0438) (18) explicitly contrasts two sets of individuals, cows in this case, rather than events or states of affairs." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-123", "text": "One could, of course, argue that the in contrast relation is simply polymorphic, referring either to relations between sets of events or states of affairs or to relations between (sets of) individuals or other entities such as locations, as in (13)." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-124", "text": "Support for such a position could be derived from the fact that there are two syntactic variants of the in contrast connective: one with and one without a postmodifying prepositional phrase." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-125", "text": "The latter could then be interpreted as involving a relation between sets of events and/or states of affairs, and the former with relations between entities of various sorts, e.g. individuals, locations, times, etc." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-126", "text": "However, there are two shortcomings of such an account: (i) Everything else being equal, one would prefer a unified account of in contrast and thereby of the two syntactic constructions, and, more importantly, (ii) an account of in contrast without a postmodifying prepositional phrase that merely posits a relation between sets of events and/or states of affairs misses the fact that this construction also focuses on specific participants of the events and/or states of affairs as opposites. (17), for example, focuses on the differences between the price of prime versus average properties." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-127", "text": "Once one recognizes such contrast pairs for the in contrast construction without a postmodifying prepositional phrase, then a unified analysis is starting to emerge." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-128", "text": "This analyis can be illustrated by the logical formulas in (19a) and (19b) for examples (17) and (18), respectively." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-129", "text": "----------------------------------" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-131", "text": "(19) a. in-contrast((the-price-of-prime-properties-inYorkshire, x [According-to-the-buyers'-guideproduced-by-estate-agents-Savills,-who-specialize-in-such-properties, x has-risen-by-more-than-130-per-cent-over-the-last-three-years]), (the-price-of-average-properties,-those-on-which-the-Halifax-has-lent-mortgages, x [x has-risen-by-morethan-130-per-cent-over-the-last-three-years]))." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-132", "text": "b. in-contrast((The-Holsteins, x [x also-tend-tohave-much-more-white-in-the-coat-so-thatthe-white-areas-predominate-and-they-couldalmost-be-described-as-white-and-blacks]), (the-black-and-white-Friesian-type, x \u00a1 [x alsotend-to-have-much-more-white-in-the-coat-sothat-the-white-areas-predominate-and-they-couldalmost-be-described-as-white-and-blacks]))." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-133", "text": "In ( The account of in contrast which has been illustrated by the formulas in (19) and (20b) has two attractive properties: (i) from a theoretical perspective, it provides a unified analysis of the in contrast construction with and without a postmodifying prepositional phrase; (ii) by separating out the contrast pairs (as the first members of each argument pair) from their contrasting properties, it provides a transparent representation for applications such as information extraction and text summarization, which require tracking discourse entities and their relevant properties." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-134", "text": "Finally, it is worth reviewing the proposed analysis in light of the generalization put forth by Webber (2006) and by Lee et al. (2006) , namely that discourse connectives always denote two-place relations." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-135", "text": "The semantics of in contrast proposed in this section is consistent with this hypothesis since it assumes a two-place relation." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-136", "text": "However, notice that each of the two arguments is further structured into a contrast item and a contrast property." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-137", "text": "It is this highly structured character of the in-contrast relation that distinguishes this discourse connective from the much simpler two-place relations denoted by coordinating and subordinating conjunctions." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-138", "text": "The latter simply denote relations between events and/or states of affairs, namely those denoted by the two conjunct clauses." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-139", "text": "The semantics proposed for in-contrast, thus, provides further evidence for the distinction between coordinating and subordinating conjunctions and discourse adverbials that has been put forth by Webber et al. (2003) ." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-140", "text": "----------------------------------" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-141", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-142", "text": "This paper has presented a corpus-based study of the discourse connective in contrast." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-143", "text": "The corpus data were drawn from the British National Corpus (BNC) and were analyzed at the levels of syntax, discourse structure, and compositional semantics." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-144", "text": "Following Webber et al. (2003) , the paper argues that in contrast crucially involves discourse anaphora and, thus, resembles other discourse adverbials such as then, otherwise, and nevertheless." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-145", "text": "The compositional semantics proposed for other discourse connectives, however, does not straightforwardly generalize to in contrast, for which the notions of contrast pairs and contrast properties are essential." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-146", "text": "In future work we plan to consider a wider range of contrast relations in discourse such as by comparison, contrary to and on the other hand in order to ascertain whether the properties of the discourse connective in contrast will generalize to these cases as well." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-147", "text": "A second line of research will investigate ways of automatically detecting comparison patterns and contrast pairs, which figure prominently in the compositional semantics of in contrast, by means of machine learning techniques." }, { "sent_id": "79ff6e23cc951aa18ae53763e9c982-C001-148", "text": "Here we expect that elliptical expressions, other-anaphora, and syntactic parallelism will provide important cues." } ], "y": { "@DIF@": { "gold_contexts": [ [ "79ff6e23cc951aa18ae53763e9c982-C001-4" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-139" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-144" ] ], "cite_sentences": [ "79ff6e23cc951aa18ae53763e9c982-C001-4", "79ff6e23cc951aa18ae53763e9c982-C001-139", "79ff6e23cc951aa18ae53763e9c982-C001-144" ] }, "@BACK@": { "gold_contexts": [ [ "79ff6e23cc951aa18ae53763e9c982-C001-17" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-63" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-84" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-86" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-103" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-116" ] ], "cite_sentences": [ "79ff6e23cc951aa18ae53763e9c982-C001-17", "79ff6e23cc951aa18ae53763e9c982-C001-63", "79ff6e23cc951aa18ae53763e9c982-C001-84", "79ff6e23cc951aa18ae53763e9c982-C001-86", "79ff6e23cc951aa18ae53763e9c982-C001-103", "79ff6e23cc951aa18ae53763e9c982-C001-116" ] }, "@SIM@": { "gold_contexts": [ [ "79ff6e23cc951aa18ae53763e9c982-C001-18" ], [ "79ff6e23cc951aa18ae53763e9c982-C001-55" ] ], "cite_sentences": [ "79ff6e23cc951aa18ae53763e9c982-C001-18", "79ff6e23cc951aa18ae53763e9c982-C001-55" ] } } }, "ABC_90522b5ac99d1657bf9af9d165c36e_19": { "x": [ { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-56", "text": "**DOWN-SAMPLING**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-2", "text": "The stability of word embedding algorithms, i.e., the consistency of the word representations they reveal when trained repeatedly on the same data set, has recently raised concerns." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-3", "text": "We here compare word embedding algorithms on three corpora of different sizes, and evaluate both their stability and accuracy." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-4", "text": "We find strong evidence that down-sampling strategies (used as part of their training procedures) are particularly influential for the stability of SVD PPMI -type embeddings." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-5", "text": "This finding seems to explain diverging reports on their stability and lead us to a simple modification which provides superior stability as well as accuracy on par with skip-gram embeddings." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-6", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-8", "text": "Word embedding algorithms implement the latest form of distributional semantics originating from the seminal work of Harris (1954) or Rubenstein and Goodenough (1965) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-9", "text": "They generate dense vector space representations for words based on co-occurrences within a context window." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-10", "text": "They sample word-context pairs, i.e., typically two cooccurring tokens, from a corpus and use these to generate vector representations of words and their context." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-11", "text": "Changes to the algorithm's sampling mechanism can lead to new capabilities, e.g., processing dependency information instead of linear co-occurrences (Levy and Goldberg, 2014a) , or increased performance, e.g., using word association values instead of raw co-occurrence counts (Bullinaria and Levy, 2007) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-12", "text": "Word embedding algorithms commonly downsample contexts to lessen the impact of highfrequency words (termed 'subsampling' in Levy et al. (2015) ) or increase the relative importance of words closer to the center of a context window (called 'dynamic context window' in Levy et al. (2015) )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-13", "text": "The effect of using such down-sampling strategies on accuracy in word similarity and analogy tasks was explored in several papers (e.g., Levy et al. (2015) )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-14", "text": "However, down-sampling and details of its implementation also have major effects on the stability of word embeddings (also known as 'reliability'), i.e., the degree to which models trained independently on the same data agree on the structure of the resulting embedding space." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-15", "text": "This problem has lately raised severe concerns in the word embedding community (e.g., Hellrich and Hahn (2016b) ; Antoniak and Mimno (2018) ; Wendlandt et al. (2018) ) and is also of interest to the wider machine learning community due to the influence of probabilistic-and thus unstablemethods on experimental results (Reimers and Gurevych, 2017; Henderson et al., 2018) , as well as replicability and reproducibility (Ivie and Thain, 2018, pp." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-16", "text": "63:3-4) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-17", "text": "Stability is critical for studies examining the underlying semantic space as a more advanced form of corpus linguistics, e.g., tracking lexical change (Kim et al., 2014; Kulkarni et al., 2015; Hellrich et al., 2018) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-18", "text": "Unstable word embeddings can lead to serious problems in such applications, as interpretations will depend on the luck of the draw." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-19", "text": "This might also affect high-stake fields like medical informatics where patients could be harmed as a consequence of misleading results (Coiera et al., 2018) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-20", "text": "In the light of these concerns, we here evaluate down-sampling strategies by modifying the SVD PPMI (Singular Value Decomposition of a Positive Pointwise Mutual Information matrix; Levy et al. (2015) ) algorithm and comparing its results with those of two other embedding algorithms, namely, GLOVE (Pennington et al., 2014) and SGNS (Mikolov et al., 2013a,c) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-21", "text": "Our analysis is based on three corpora of different sizes and investigates effects on both accuracy and stability." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-22", "text": "The inclusion of accuracy measurements and the larger size of our training corpora exceed prior work." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-23", "text": "We show how the choice of down-sampling strategies, a seemingly minor detail, leads to major differences in the characterization of SVD PPMI in recent studies (Hellrich and Hahn, 2017; Antoniak and Mimno, 2018) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-24", "text": "We also present SVD WPPMI , a simple modification of SVD PPMI that replaces probabilistic down-sampling with weighting." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-25", "text": "What, at first sight, appears to be a small change leads, nevertheless, to an unrivaled combination of stability and accuracy, making it particularly well-suited for the above-mentioned corpus linguistic applications." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-26", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-27", "text": "**COMPUTATIONAL METHODOLOGY**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-28", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-29", "text": "**MEASURING STABILITY**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-30", "text": "Measuring word embedding stability can be linked to older research comparing distributional thesauri (Salton and Lesk, 1971) by the most similar words they contain for particular anchor words (Weeds et al., 2004; Padr\u00f3 et al., 2014) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-31", "text": "Most stability experiments focused on repeatedly training the same algorithm on one corpus Hahn, 2016a,b, 2017; Antoniak and Mimno, 2018; Pierrejean and Tanguy, 2018; Chugh et al., 2018) , whereas Wendlandt et al. (2018) quantified stability by comparing word similarity for models trained with different algorithms." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-32", "text": "We follow the former approach, since we deem it more relevant for ensuring that study results can be replicated or reproduced." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-33", "text": "Stability can be quantified by calculating the overlap between sets of words considered most similar in relation to pre-selected anchor words." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-34", "text": "Reasonable metrical choices are, e.g., the Jaccard coefficient (Jaccard, 1912) between these sets (Antoniak and Mimno, 2018; Chugh et al., 2018) , or a percentage based coefficient (Hellrich and Hahn, 2016a,b; Wendlandt et al., 2018; Pierrejean and Tanguy, 2018) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-35", "text": "We here use j@n, i.e., the Jaccard coefficient for the n most similar words." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-36", "text": "It depends on a set M of word embedding models, m, for which the n most similar words (by cosine) from a set A of anchor words, a, as provided by the 'most similar words' function msw(a, n, m), are compared:" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-37", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-38", "text": "**SVD PPMI WORD EMBEDDINGS**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-39", "text": "The SVD PPMI algorithm from Levy et al. (2015) generates word embeddings in a three-step process." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-40", "text": "First, a corpus is transformed to a wordcontext matrix listing co-occurrence frequencies." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-41", "text": "Next, the frequency-based word-context matrix is transformed into a word-context matrix that contains word association values." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-42", "text": "Finally, singular value decomposition (SVD; Berry (1992); Saad (2003) ) is applied to the latter matrix to reduce its dimensionality and generate word embeddings." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-43", "text": "Each token from the corpus is successively processed in the first step by recording co-occurrences with other tokens within a symmetric window of a certain size." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-44", "text": "For example, in a token sequence . . . , w i\u22122 , w i\u22121 , w i , w i+1 , w i+2 , . . . , with w i as the currently modeled token, a window of size 1 would be concerned with w i\u22121 and w i+1 only." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-45", "text": "Down-sampling as described by Levy et al. (2015) increases accuracy by ignoring certain co-occurrences while populating the word-context matrix (further details are described below)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-46", "text": "A word-context matrix is also used in GLOVE, whereas SGNS directly operates on sampled cooccurrences in a streaming manner." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-47", "text": "Positive pointwise mutual information (PPMI) is a variant of pointwise mutual information (Fano, 1961; Church and Hanks, 1990) , independently developed by Niwa and Nitta (1994) and Bullinaria and Levy (2007) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-48", "text": "PPMI measures the ratio between observed co-occurrences (normalized and treated as a joint probability) and the expected co-occurrences (based on normalized frequencies treated as individual probabilities) for two words i and j while ignoring all cases in which the observed co-occurrences are fewer than the expected ones:" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-49", "text": "(2) Truncated SVD reduces the dimensionality of the vector space described by the PPMI wordcontext matrix M ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-50", "text": "SVD factorizes M in three special 1 matrices, so that M = U \u03a3V T ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-51", "text": "Entries of \u03a3 are ordered by their size, allowing to infer the relative importance of vectors in U and V ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-52", "text": "This can be used to discard all but the highest d values and corresponding vectors during truncated SVD, so that" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-53", "text": "Both GLOVE and SGNS start with randomly initialized vectors of the desired dimensionality d and have thus no comparable step in their processing pipeline." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-54", "text": "However, Levy and Goldberg (2014c) showed SGNS to perform as an approximation of SVD applied to a PPMI matrix." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-55", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-57", "text": "Down-sampling by some factor requires both a formal expression to define the factor, as well as a strategy to perform down-sampling according to this factor-data can either be sampled probabilistically or weighted (see below)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-58", "text": "The following set of formulae is shared by SGNS and SVD PPMI , whereas GLOVE uses a distinct one." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-59", "text": "Distance-based down-sampling depends on the distance between the currently modeled token w i and a second token w j in a token sequence (such as the above example)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-60", "text": "The distance d between w i and w j is given as:" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-61", "text": "To increase the effect of the nearest-and thus assumedly most salient-tokens both SVD PPMI and SGNS down-sample words based on this distance with a distance factor, df (s being the size of the window used for sampling):" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-62", "text": "To limit the effect of high-frequency wordslikely to be function words-both algorithms also down-sample words according to a frequency factor (ff ), which compares each token's relative frequency r(w) with a threshold t:" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-63", "text": "The frequency down-sampling factor for the cooccurrence of two tokens w i and w j is then given by the product of their down-sampling factors, i.e., the probabilities are treated as being independent:" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-64", "text": "The strategy used to apply these down-sampling factors can affect accuracy and, especially, stability, as can the decision not to apply them at all." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-65", "text": "These down-sampling processes can either be probabilistic, i.e., each word-context pair is processed with a probability given by df (w i , w j ) \u00b7 ff (w i , w j ), or operate by weighting, i.e., for each observed co-occurrence only a fraction of a count according to the product of df and ff is added to the word-context matrix." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-66", "text": "SGNS uses probabilistic down-sampling, GLOVE uses weighting and SVD PPMI by Levy et al. (2015) allows for probabilistic down-sampling or no down-sampling at all." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-67", "text": "As SVD itself is non-probabilistic 2 (Saad, 2003, chs." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-68", "text": "6.3 & 7 .1) any instability observed for SVD PPMI must be caused by its probabilistic down-sampling." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-69", "text": "We thus suggest SVD WPPMI , i.e., SVD of a PPMI matrix with weighted entries, a simple modification which uses fractional counts according to df (w i , w j ) \u00b7 ff (w i , w j )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-70", "text": "As shown in Section 5, this modification is beneficial for both accuracy and stability." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-71", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-72", "text": "**CORPORA**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-73", "text": "The corpora used in most stability studies are relatively small." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-74", "text": "For instance, the largest corpus in Antoniak and Mimno (2018) contains 15M tokens, whereas the corpus used by Hellrich and Hahn (2017) and the largest corpus from Wendlandt et al. (2018) each contain about 60M tokens." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-75", "text": "Pierrejean and Tanguy (2018) used three corpora of about 100M words each." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-76", "text": "Two exceptions are Hellrich and Hahn (2016a,b) using relatively large Google Books Ngram corpus subsets (Michel et al., 2011) with 135M to 4.7G n-grams, as well as Chugh et al. (2018) who investigated the influence of embedding dimensionality on stability based on three corpora with only 1.2-2.6M tokens." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-77", "text": "3 We used three different English corpora as training material: the 2000s decade of the Corpus of Historical American English (COHA; Davies (2012)), the English News Crawl Corpus (NEWS) collected for the 2018 WMT Shared Task 4 and a Wikipedia corpus (WIKI)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-78", "text": "5 COHA contains 14k texts and 28M tokens, NEWS 27M texts and 550M tokens, and WIKI 4.5M texts and 1.7G tokens, respectively." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-79", "text": "COHA was selected as it is commonly used in corpus linguistic studies, whereas NEWS and WIKI serve to gauge the performance of all algorithms in general applica-tions." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-80", "text": "The latter two corpora are far larger than common in stability studies, making our study the largest-scale evaluation of embedding stability we are aware of." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-81", "text": "All three corpora were tokenized, transformed to lower case and cleaned from punctuation." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-82", "text": "We used both the corpora as-is, as well as independently drawn random subsamples (see also Hellrich and Hahn (2016a) ; Antoniak and Mimno (2018) ) to simulate the arbitrary content selection in most corpora-texts could be removed or replaced with similar ones without changing the overall nature of a corpus, e.g., Wikipedia articles are continuously edited." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-83", "text": "Subsampling allows us to quantify the effect of this arbitrariness on the stability of embeddings, i.e., how consistently word embeddings are trained on variations of a corpus." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-84", "text": "Subsampling was performed on the level of the constituent texts of each corpus, e.g., individual news articles." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-85", "text": "For a corpus with n texts we drew n samples with replacement." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-86", "text": "Texts could be drawn multiple times, but only one copy was kept, reducing corpora to 1 \u2212 1 /e \u2248 2 /3 of their original size." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-87", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-88", "text": "**EXPERIMENTAL SET-UP**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-89", "text": "We compared five algorithm variants: GLOVE, SGNS, SVD PPMI without down-sampling, SVD PPMI with probabilistic down-sampling, and SVD WPPMI ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-90", "text": "While we could use SGNS 6 and GLOVE 7 implementations directly, we had to modify SVD PPMI 8 to support the weighted sampling used in SVD WPPMI ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-91", "text": "As proposed by Antoniak and Mimno (2018) , we further modified our SVD PPMI implementation to use random numbers generated with a non-fixed seed for probabilistic down-sampling." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-92", "text": "A fixed seed would benefit reliability, but also act as a bias during all analyses-seed choice has been shown to cause significant differences in experimental results (Henderson et al., 2018) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-93", "text": "Down-sampling strategies for df and ff can be chosen independently of each other, e.g., using probabilistic down-sampling for df together with weighted down-sampling for ff ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-94", "text": "However, we decided to use the same down-sampling strategies, e.g., weighting, for both factors, taking into ac-count computational limitations as well as results from pre-tests that revealed little benefit of mixed strategies." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-95", "text": "9 We trained ten models for each algorithm variant and corpus." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-96", "text": "10 In the case of subsampling, each model was trained on one of the independently drawn samples." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-97", "text": "Stability was evaluated by selecting the 1k most frequent words in each non-bootstrap subsampled corpus as anchor words and calculating j@10 (see Equation 1)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-98", "text": "11 Following Hellrich and Hahn (2016a,b) , we did not only investigate stability, but also the accuracy of our models to gauge potential tradeoffs." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-99", "text": "We measured the Spearman rank correlation between cosine-based word similarity judgments and human ones with four psycholinguistic test sets, i.e., the two crowdsourced test sets MEN (Bruni et al., 2012) and MTurk (Radinsky et al., 2011) , the especially strict SimLex-999 (Hill et al., 2014) and the widely used WordSim-353 (WS-353; Finkelstein et al. (2002) )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-100", "text": "We also measured the percentage of correctly solved analogies (using the multiplicative formula from Levy and Goldberg (2014b) ) with two test sets developed at Google (Mikolov et al., 2013a) and Microsoft Research (MSR; Mikolov et al. (2013b) )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-101", "text": "Table 1 shows the accuracy and stability for all tested combinations of algorithm and corpus variants." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-102", "text": "Accuracy differences between test sets are in line with prior observations and general 9 The strongest counterexample is a combination of probabilistic down-sampling for df and weighting for ff which lead to small, yet significant improvements in the MEN (0.703 \u00b1 0.001) and MTurk (0.568 \u00b1 0.015) similarity tasks (cf . Table 1 )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-103", "text": "However, other accuracy tasks showed no improvements and the stability of this approach (0.475 \u00b1 0.001) was far closer to SVDPPMI with fully probabilistic down-sampling than to the perfect stability of SVDWPPMI." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-104", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-105", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-106", "text": "10 Hyperparameters roughly follow Levy et al. (2015) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-107", "text": "We used symmetric 5 word context windows for all models as well as frequent word down-sampling thresholds of 100 (GLOVE) and 10 \u22124 (others)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-108", "text": "Default learning rates and numbers of iterations were used for all models." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-109", "text": "Eigenvalues as well as context vectors were ignored for SVDPPMI embeddings." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-110", "text": "5 negative samples were used for SGNS." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-111", "text": "The minimum frequency threshold was 50 for COHA, 100 for NEWS and 750 for WIKI-increased thresholds were necessary due to SVDPPMI's memory consumption scaling quadratically with vocabulary size." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-112", "text": "11 Stability calculation was not performed directly between all 10 models, as this would result in a single value and preclude significance tests." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-113", "text": "Instead, we generated ten j@10 values by calculating the stability of all subsets formed by leaving out each model once in a jackknife procedure." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-114", "text": "performance on WIKI is roughly in-line with the data reported in Levy et al. (2015) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-115", "text": "In general, corpus size does seem to have a positive effect on accuracy." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-116", "text": "However, for MEN, MTurk and MSR the highest values are achieved with NEWS and not with WIKI." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-117", "text": "SVD PPMI variants seem to be less hampered by small training corpora, matching observations by Sahlgren and Lenci (2016) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-118", "text": "Stability is clearly positively influenced by corpus size for all probabilistic algorithm variants except GLOVE, which, in contrast, benefits from small training corpora." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-119", "text": "Also, randomly subsampling corpora has a negative effect on both accuracy and stability-this can be explained by the smaller corpus size for accuracy and the differences in training material (as subsampling was performed independently for each model) for stability." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-120", "text": "Figure 1 illustrates the stability of all tested algorithm variants." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-121", "text": "SVD WPPMI and SVD PPMI without down-sampling are the only systems which achieve perfect stability when trained on non-subsampled corpora." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-122", "text": "GLOVE is the third most reliable algorithm in this scenario, except for the large WIKI corpus." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-123", "text": "Corpus subsampling decreases the stability of all algorithms, with SVD WPPMI still performing better than all other alternatives." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-124", "text": "The only exception is subsampled COHA where the otherwise suboptimal GLOVE narrowly (0.330 instead of 0.329; difference significant with p < .05 by two-sided t-test) outperforms SVD WPPMI ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-125", "text": "SVD WPPMI can achieve stability values on subsampled corpora that are competitive with those for SGNS and GLOVE trained on non-subsampled corpora." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-126", "text": "We found standard deviations for stability to be very low, the highest being 0.01 for GLOVE trained on nonsubsampled WIKI, probably due to the overlap in our jackknife procedure." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-127", "text": "Finally, we tested 12 the overall performance of each algorithm variant by first performing a Quade test (Quade, 1979) as a safeguard against type I Figure 1 : Stability for each combination of algorithm variant and corpus." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-128", "text": "Measured with j@10 metric (higher is better)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-129", "text": "Same data as in Table 1 , standard deviations too small to display." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-130", "text": "errors, thus confirming the existence of significant differences between algorithms (p = 1.3 \u00b7 10 \u22127 )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-131", "text": "We then used a pairwise Wilcoxon rank-sum test with Holm-\u0160id\u00e1k correction (see Dem\u0161ar (2006) ) in order to compare other algorithms with SVD WPPMI ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-132", "text": "13 We found it to be not significantly different in accuracy from SGNS (p = 0.101), but significantly better than SVD PPMI without downsampling (corrected p = 5.4\u00b710 \u22126 ) or probabilistic down-sampling (corrected p = 0.015), as well as GLOVE (corrected p = 0.027)." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-133", "text": "Our results show SVD WPPMI to be both highly reliable and accurate, especially on COHA, which has a size common in both stability studies and corpus linguistic applications." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-134", "text": "Diverging reports on SVD PPMI stability-described as perfectly reliable in Hellrich and Hahn (2017) , yet not in Antoniak and Mimno (2018) -can thus be explained by their difference in down-sampling options, i.e., no down-sampling or probabilistic down-sampling." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-135", "text": "GLOVE's high stability in other studies (Antoniak and Mimno, 2018; Wendlandt et al., 2018) seems to be counterbalanced by its low accuracy and also appears to be limited to training on small corpora." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-136", "text": "----------------------------------" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-137", "text": "**DISCUSSION**" }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-138", "text": "We investigated the effect of down-sampling strategies on word embedding stability by comparing five algorithm variants on three corpora, two of which were larger than those typically used in stability studies." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-139", "text": "We proposed a simple modification to the down-sampling strategy used for the SVD PPMI algorithm, SVD WPPMI , which uses weighting, to achieve an otherwise unmatched combination of accuracy and stability." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-140", "text": "We also gathered evidence that GLOVE lacks accuracy and is only stable when trained on small corpora." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-141", "text": "We thus recommend using SVD WPPMI , especially for studies targeting (qualitative) interpretations of semantic spaces (e.g., Kim et al. (2014) )." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-142", "text": "Overall, SGNS provided no benefit in accuracy over SVD WPPMI and the latter seemed especially well-suited for small training corpora." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-143", "text": "The only downside of SVD WPPMI we are aware of is its relatively large memory consumption during training shared by all SVD PPMI variants." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-144", "text": "Further research could investigate the performance of SVD WPPMI with other sets of hyperparameters or scrutinize the effect of down-sampling strategies on the ill-understood geometry of embedding spaces (Mimno and Thompson, 2017) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-145", "text": "It would also be interesting to investigate the effect of down-sampling and stability on downstream tasks in a follow-up to Wendlandt et al. (2018) ." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-146", "text": "Finally, the increasingly popular contextualized embedding algorithms, e.g., BERT (Devlin et al., 2018) or ELMo (Peters et al., 2018) , are also probabilistic in nature and should thus be affected by stability problems." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-147", "text": "A direct transfer of our type specific evaluation strategy is impossible." }, { "sent_id": "90522b5ac99d1657bf9af9d165c36e-C001-148", "text": "However, an indirect one could be achieved by averaging token-specific contextualized embeddings to generate type representations." } ], "y": { "@MOT@": { "gold_contexts": [ [ "90522b5ac99d1657bf9af9d165c36e-C001-15" ], [ "90522b5ac99d1657bf9af9d165c36e-C001-23" ] ], "cite_sentences": [ "90522b5ac99d1657bf9af9d165c36e-C001-15", "90522b5ac99d1657bf9af9d165c36e-C001-23" ] }, "@BACK@": { "gold_contexts": [ [ "90522b5ac99d1657bf9af9d165c36e-C001-31" ], [ "90522b5ac99d1657bf9af9d165c36e-C001-34" ], [ "90522b5ac99d1657bf9af9d165c36e-C001-74" ], [ "90522b5ac99d1657bf9af9d165c36e-C001-134" ], [ "90522b5ac99d1657bf9af9d165c36e-C001-135" ] ], "cite_sentences": [ "90522b5ac99d1657bf9af9d165c36e-C001-31", "90522b5ac99d1657bf9af9d165c36e-C001-34", "90522b5ac99d1657bf9af9d165c36e-C001-74", "90522b5ac99d1657bf9af9d165c36e-C001-134", "90522b5ac99d1657bf9af9d165c36e-C001-135" ] }, "@SIM@": { "gold_contexts": [ [ "90522b5ac99d1657bf9af9d165c36e-C001-82" ] ], "cite_sentences": [ "90522b5ac99d1657bf9af9d165c36e-C001-82" ] }, "@USE@": { "gold_contexts": [ [ "90522b5ac99d1657bf9af9d165c36e-C001-82" ] ], "cite_sentences": [ "90522b5ac99d1657bf9af9d165c36e-C001-82" ] }, "@DIF@": { "gold_contexts": [ [ "90522b5ac99d1657bf9af9d165c36e-C001-91" ] ], "cite_sentences": [ "90522b5ac99d1657bf9af9d165c36e-C001-91" ] }, "@EXT@": { "gold_contexts": [ [ "90522b5ac99d1657bf9af9d165c36e-C001-91" ] ], "cite_sentences": [ "90522b5ac99d1657bf9af9d165c36e-C001-91" ] } } }, "ABC_2e7df0912d9aac8bf97f4061de613f_19": { "x": [ { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-114", "text": "12.2% relative." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-68", "text": "Adversarial training [12] is integrated as follows." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-67", "text": "Early stopping (with a patience of 30 epochs) is performed on the validation data of the source speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-2", "text": "We present a Lipreading system, i.e. a speech recognition system using only visual features, which uses domain-adversarial training for speaker independence." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-3", "text": "Domain-adversarial training is integrated into the optimization of a lipreader based on a stack of feedforward and LSTM (Long Short-Term Memory) recurrent neural networks, yielding an end-to-end trainable system which only requires a very small number of frames of untranscribed target data to substantially improve the recognition accuracy on the target speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-4", "text": "On pairs of different source and target speakers, we achieve a relative accuracy improvement of around 40% with only 15 to 20 seconds of untranscribed target speech data." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-5", "text": "On multi-speaker training setups, the accuracy improvements are smaller but still substantial." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-6", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-8", "text": "Lipreading is the process of understanding speech by using solely visual features, i.e. images of the lips of a speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-9", "text": "In communication between humans, lipreading has a twofold relevance [1] : First, visual cues play a role in spoken conversation [2] ; second, hearing-impaired persons may use lipreading as a means to follow verbal speech." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-10", "text": "With the success of computer-based speech recognition over the past decades, automatic lipreading has become an active field of research as well, with pioneering work by Petajan [3] , who used lipreading to augment conventional acoustic speech recognition, and Chiou and Hwang [4] , who were the first to perform lipreading without resorting to any acoustic signal at all." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-11", "text": "Since 2014, lipreading systems have systematically begun to use neural networks at part of the processing pipeline [5, 6] or for end-to-end-training [7, 8, 9] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-12", "text": "In our previous work [7] , we proposed a fully neural network based system, using a stack of fully connected and recurrent (LSTM, Long ShortTerm Memory) [10, 11] neural network layers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-13", "text": "The scope of this paper is the introduction of state-of-theart methods for speaker-independent lipreading with neural networks." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-14", "text": "We evaluate our established system [7] in a crossspeaker setting, observing a drastic performance drop on unknown speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-15", "text": "In order to alleviate the discrepancy between training speakers and unknown test speaker, we use domainadversarial training as proposed by Ganin and Lempitsky [12] : Untranscribed data from the target speaker is used as additional training input to the neural network, with the aim of pushing the network to learn an intermediate data representation which is domain-agnostic, i.e. which does not depend on whether the input data comes from a source speaker or a target speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-16", "text": "We evaluate our system on a subset of the GRID corpus [13] , which contains extensive data from 34 speakers and is therefore ideal for a systematic evaluation of the proposed method." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-17", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-19", "text": "Lipreading can be used to complement or augment speech recognition, particularly in the presence of noise [3, 14] , and for purely visual speech recognition [4, 15, 5] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-20", "text": "In the latter case, ambiguities due to incomplete information (e.g. about voicing) can be mitigated by augmenting the video stream with ultrasound images of the vocal tract [16] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-21", "text": "Visual speech processing is an instance of a Silent Speech interface [17] ; further promising approaches include capturing the movement of the articulators by electric or permanent magnetic articulography [18, 19] , and capturing of muscle activity using electromyography [20, 21, 22, 23] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-22", "text": "Versatile lipreading features have been proposed, such as Active Appearance Models [24] , Local Binary Patterns [25] , and PCA-based Eigenlips [26] and Eigentongues [27] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-23", "text": "For tackling speaker dependency, diverse scaling and normalization techniques have been employed [28, 29] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-24", "text": "Classification is often done with Hidden Markov Models (HMMs), e.g. [30, 15, 31, 32] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-25", "text": "Mouth tracking is done as a preprocessing step [32, 15, 5] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-26", "text": "For a comprehensive review see [33] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-27", "text": "Neural networks have early been applied to the Lipreading task [34] , however, they have become widespread only in recent years, with the advent of state-of-the-art learning techniques (and the necessary hardware)." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-28", "text": "The first deep neural network for lipreading was a seven-layer convolutional net as a preprocessing stage for an HMM-based word recognizer [5] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-29", "text": "Since then, several end-to-end trainable systems were presented [7, 8, 9] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-30", "text": "The current state-of-the-art accuracy on the GRID corpus is 3.3% error [9] using a very large set of additional training data; so their result is not directly comparable to ours." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-31", "text": "In domain adaptation, it is assumed that a learning task exhibits a domain shift between the training (or source) and test (or target) data." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-32", "text": "This can be mitigated in several ways [35] ; we apply domain-adversarial training [12] , where an intermediate layer in a multi-layer network is driven to learn a representation of the input data which is optimized to be domain-agnostic, i.e. to make it difficult to detect whether an input sample is from the source or the target domain." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-33", "text": "A great advantage of this approach is the end-to-end trainability of the entire system." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-34", "text": "For a summary of further approaches to domain adaptation with neural networks, we refer to the excellent overview in [12] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-35", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-36", "text": "**DATA AND PREPROCESSING**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-37", "text": "We follow the data preprocessing protocol from [7] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-38", "text": "We use the GRID corpus [13] , which consists of video and audio recordings of 34 speakers (which we name s1 to s34) saying 1000 sentences each." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-39", "text": "All sentences have a fixed structure: command(4) + color(4) + preposition(4) + letter(25) + digit(10) + adverb(4), for example \"Place red at J 2, please\", where the number of alternative words is given in parentheses." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-40", "text": "There are 51 distinct words; alternatives are randomly distributed so that context cannot be used for classification." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-41", "text": "Each sentence has a length of 3 seconds at 25 frames per second, so the total data per speaker is 3000 seconds (50 minutes)." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-42", "text": "Using the annotations contained in the corpus, we segmented all videos at word level, yielding 6000 word samples per speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-43", "text": "We experiment on speakers s1-s19: speakers 1-9 form the development speakers, used to determine optimal parameters; speakers 10-19 are the evaluation speakers, held back until the final evaluation of the systems." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-44", "text": "The data from each speaker was randomly subdivided into training, validation, and test sets, where the latter two contain five samples of each word, i.e. a total of 51 \u00b7 5 = 255 samples each." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-45", "text": "The training data is consequently highly unbalanced: For example, each letter from \"a\" to \"z\" appears 30 times, whereas each color appears 240 times." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-46", "text": "We converted the \"normal\" quality videos (360 \u00d7 288 pixels) to greyscale and extracted 40\u00d740 pixel windows containing the mouth area, as described in [7] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-47", "text": "The frames were contrastnormalized and z-normalized over the training set, independently for each speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-48", "text": "Unreadable videos were discarded." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-49", "text": "All experiments have one dedicated target speaker on which this experiment is evaluated, and one, four, or eight source speakers on which supervised training is performed." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-50", "text": "Speakers are chosen consecutively, for example, the experiments on four training speakers on the development data are (s1 . . ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-51", "text": "s4) \u2192 s5, (s2 . . . s5) \u2192 s6, \u00b7 \u00b7 \u00b7 , (s9, s1, s2, s3) \u2192 s4, where \u2192 separates source and target speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-52", "text": "We also compute baseline results on single speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-53", "text": "The data sets of each speaker are used as follows: Training data is used for supervised training (on the source speakers) and unsupervised adaptation (on the target speaker)." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-54", "text": "Validation data is used for early stopping, the network is evaluated on the test data." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-55", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-56", "text": "**METHODS AND SYSTEM SETUP**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-57", "text": "The system is based on the lipreading setup from [7] , reimplemented in Tensorflow [36] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-58", "text": "Raw 40 \u00d7 40 lip images are used as input data, without any further preprocessing except normalization." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-59", "text": "We stack several fully connected feedforward layers, optionally followed by Dropout [37] , and one LSTM recurrent layer to form a network which is capable of recognizing sequential video data." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-60", "text": "The final layer is a softmax with 51 word targets." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-61", "text": "All inner layers use a tanh nonlinearity." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-62", "text": "During testing, classification is performed on the last frame of an input word, the softmax output on all previous frames is discarded." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-63", "text": "Similarly, during training, an error signal is backpropagated (through time and through the stack of layers) only from the last frame of each training word sample." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-64", "text": "Optimization is performed by minimizing the multi-class cross-entropy using stochastic gradient descent applying Tensorflow's MomentumOptimizer with a momentum of 0.5, a learning rate of 0.001, and a batch size of 8 sequences." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-65", "text": "The network weights are initialized following a truncated normal distribution with a standard deviation of 0.1." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-66", "text": "In order to compensate for the unbalanced training set, each training sample is weighted with a factor inversely proportional to its frequency." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-69", "text": "At the second feedforward layer, we attach a further network which performs framewise speaker classification on source and target speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-70", "text": "For this purpose, each training batch of 8 word sequences is augmented by eight additional word sequences from the target speaker, for which no word label is used, and no gradient is backpropagated from the word classifier." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-71", "text": "On the extended batch of 16 sequences, the \"adversarial\" network performs framewise speaker classification." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-72", "text": "This network follows a standard pattern (two feedforward layers with 100 neurons each plus a softmax layer with 2, 5, or 9 speaker outputs) and is trained jointly with the word classifier, with a configurable weight." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-73", "text": "If there are more word sequences from the source speaker(s) than from the target speaker, target sequences are repeated." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-74", "text": "So far, this describes a joint classifier for two different tasks (speaker and word classification), resembling Caruana's Multitask training [38] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-75", "text": "The power of the adversarial network comes from a simple twist: The backpropagated gradient from the adversarial network is inverted where it is fed into the main branch of the network, causing the lower branch to perform gradient ascent instead of descent." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-76", "text": "Since the speaker classification part of the system learns to classify speakers, the inverted gradient fed into the \"branching\" layer causes the joint part of the network to learn to confuse speakers instead of separating them." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-77", "text": "The speaker classifier and the joint network work for opposite objectives (hence, \"adversarial\"); an idea first presented in the context of factorial codes [39] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-78", "text": "Figure 2 shows a graphical overview of the system: The joint part is at the top, at the bottom are word classifier (left) and speaker classifier (right)." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-79", "text": "Table 1 : Baseline word accuracies on single speakers, averaged over the development set, with standard deviation." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-80", "text": "Layer types are FC (fully connected feedforward), DP (Dropout), and LSTM, followed by the number of neurons/cells." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-81", "text": "* marks the (reimplemented and recomputed) best system from [7] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-82", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-83", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-84", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-85", "text": "**BASELINE LIPREADER**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-86", "text": "The first experiment deals with establishing a baseline for our experiments, building on prior work [7] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-87", "text": "We run the lipreader as a single-speaker system with different topologies, optionally using Dropout (always with 50% dropout ratio) to avoid overfitting the training set." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-88", "text": "Adversarial training is not used (i.e. the weight in figure 2 is set to zero)." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-89", "text": "Table 1 shows the resulting test set accuracies averaged over the development speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-90", "text": "Without using Dropout, the accuracy on the test set is \u223c79%." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-113", "text": "In the case of four or eight source speakers, the accuracy improves by 13.1% resp." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-91", "text": "Note in particular that the baseline cannot substantially be improved by increasing the layer size or adding more layers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-92", "text": "We remark that not only the average accuracy across speakers, but also the accuracies for every single speaker hardly vary." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-93", "text": "The situation changes when Dropout is used: Now our best average accuracy is 83.3%, which is in line with results reported in literature (the most recent best result is 86.4% word accuracy [40] , but with a different training/test data split)." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-94", "text": "This best system, which is employed in the remainder of this paper, uses three feedforward layers each followed by Dropout, with 256 neurons each, followed by the LSTM layer with 256 LSTM cells, and the softmax layer." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-95", "text": "Thus the system is larger and has more layers than the baseline system, which is indeed made possible by the Dropout regularizer." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-96", "text": "On the evaluation speakers, the baseline system achieves an average accuracy of 78.3%, and the Dropout system is at 83.9% accuracy." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-97", "text": "This improvement is significant (one-tailed t-test with paired samples, p = 2.38 \u00d7 10 \u22126 )." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-98", "text": "The accuracies in a cross-speaker setting, again on the development speakers, are given in table 2." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-99", "text": "The accuracy decreases drastically, in particular when only one source speaker is used for training: On an unknown target speaker, the system achieves only an average 13.5% accuracy." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-100", "text": "The situation is clearly better when training data from multiple speakers is used, but even for eight training speakers, the average accuracy on an unknown speaker is only 37.8%." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-101", "text": "We also note that the test accuracy on the source speakers does not rise when data from multiple speakers is used, even though there is more training data." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-102", "text": "It appears that the additional data does not \"help\" the system to improve its performance." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-103", "text": "On an unknown speaker, however, training data from multiple speakers does improve performance, very probably the system learns to be more speaker-agnostic." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-104", "text": "A similar observation with a very different input signal was reported in [41] ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-105", "text": "Clearly, lipreading across different speakers is a challenging problem." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-106", "text": "In the remainder of this paper, we show how domain-adversarial training helps to tackle this challenge." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-107", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-108", "text": "**TUNING OF THE ADVERSARIAL SYSTEM**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-109", "text": "We now augment the baseline word classification network with adversarial training as described in section 4, thus making full use of the system shown in figure 2 ." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-110", "text": "For now, we use all sequences from the training set of the target speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-111", "text": "As suggested in [12] , we found it beneficial to gradually activate adversarial training: the weight of the adversarial part is set to zero at the beginning, every 10 epochs, it is raised by 0.2 until the maximum value of 1.0 is reached at epoch 50." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-112", "text": "The results of this experiment are shown in the upper two blocks of table 3, where it can be seen that adversarial training causes substantial accuracy improvement, particularly with only one source speaker: In this case, the accuracy rises by more than 40% relative, from 13.5% to 19.2%." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-115", "text": "We tuned this system using various topologies for the adversarial part, as well as different weight schedules for adversarial training, finding rather consistent behavior." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-116", "text": "The only setting which is emphatically discouraged is starting with an adversarial weight greater than zero." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-117", "text": "See section 6 for further analysis." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-118", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-119", "text": "**TRAINING WITH VERY LITTLE TARGET DATA**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-120", "text": "While the presented system does not require supervised training data from the target speaker, we still use the entire training set of the target speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-121", "text": "In practical applications, even unsupervised training data may only be sparsely available, so this setup is somewhat undesired." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-122", "text": "Since the content of the target training sequences is irrelevant for the adversarial training, we may hypothesize that we could also do with a much smaller set of target training data." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-123", "text": "So as a final experiment, we reduce the number of training sequences for the target speaker." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-124", "text": "The training protocol remains as before; in particular, training is always performed on the full set of source sequences, target sequences are repeated as necessary." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-125", "text": "The original number of 5490 target training sequences can be reduced to 50 sequences without a substantial loss of accuracy-this amounts to only 15-20 seconds of untranscribed target data." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-126", "text": "Results are shown in the lower block of table 3: For example, in the case of a single source speaker, the target accuracy drops to 18.9% instead of 19.2%." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-127", "text": "The improvement is lower when more source speakers are used." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-128", "text": "We hypothesize that this stems from the growing ratio between the number of source sequences and the number of target sequences." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-129", "text": "Finally, figure 3 shows an accuracy breakdown for speaker pairs, i.e. for single-speaker supervised training." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-130", "text": "In eight out of nine cases, domain-adversarial training clearly outperforms the baseline system, often by a substantial margin." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-131", "text": "We also observe that the accuracy gain depends very much on the speaker pair." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-132", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-133", "text": "**EVALUATION**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-134", "text": "We evaluate our result on the evaluation speakers, i.e. speakers 10-19 from the GRID corpus." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-135", "text": "The hypothesis to be tested states that adversarial training improves the accuracy of the crossspeaker lipreader trained on one, four, or eight source speakers, using either all target sequences or 50 target sequences." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-136", "text": "We use the one-tailed t-test with paired samples for evaluation." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-137", "text": "Table 4 shows the resulting accuracies, relative improvements, and p-values." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-138", "text": "Improvements are significant in all cases in which the entire target speaker data is used." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-139", "text": "For 50 target sequences, significance can be ascertained only in the case of a single source speaker, but we always get some improvement." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-140", "text": "We finally note that when applying such a system in practice, untranscribed data is accrued continuously: so the quality of the system on the target speaker could be improved continuously as well, without requiring any extra data collection." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-141", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-142", "text": "**50%**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-143", "text": "Target speaker, validation data Figure 4 : Accuracy vs. epoch on different data sets with adversarial training, for speaker pair s5 \u2192 s6." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-144", "text": "Note that the target accuracy shows a substantial rise at epoch 10, where adversarial training sets in." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-145", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-146", "text": "**ANALYSIS**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-147", "text": "In this section we attempt to shed light on the effect of domainadversarial training." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-148", "text": "Figure 4 shows the progress of training for speakers s5 \u2192 s6 versus the training epoch, with adversarial training activated." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-149", "text": "The source speaker accuracies on validation and test set are \u223c78%, almost unaffected by adversarial training." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-150", "text": "The target speaker accuracies are 39.1% on the validation set and 39.5% on the test set, our greatest single increase with adversarial training: without adversarial training, the target accuracy is less than 22%." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-151", "text": "From the steady rise of the first curve, we see that the training progresses smoothly." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-152", "text": "This is the expected behavior for a well-tuned system." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-153", "text": "On the validation sets, the accuracy varies much less smoothly, with jumps of several percent points between epochs." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-154", "text": "We observed that this behavior is quite consistent for all systems, with or without adversarial training, and also for varying numbers of training speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-155", "text": "Clearly the \"error landscape\" between training and validation data is very different, both within the same speaker and between different speakers." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-156", "text": "The effect of adversarial training is clearly observable: At epoch 10, where adversarial training becomes active (with 0.2 weight), the target accuracy jumps visibly, even though the criterion for which the adversarial network is optimized is very different from the word accuracy which is plotted in the graph." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-157", "text": "This is a remarkable success, even though it should be noted (compare figure 3) that on other speaker pairs, we obtain a much lower improvement by adversarial training." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-158", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-159", "text": "**CONCLUSION**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-160", "text": "In this study we have described how to apply domainadversarial training to a state-of-the-art lipreading system for improved speaker independency." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-161", "text": "When training and test are performed on pairs of different speakers, the average improvement is around 40%, which is highly significant; this improvement even persists when the amount of untranscribed target data is drastically reduced to about 15-20 seconds." }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-162", "text": "When supervised training data from several speakers is available, there is still some improvement, from a much higher baseline" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-163", "text": "----------------------------------" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-164", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "2e7df0912d9aac8bf97f4061de613f-C001-165", "text": "The first author was supported by the H2020 project INPUT (grant #687795)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2e7df0912d9aac8bf97f4061de613f-C001-11" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-81" ] ], "cite_sentences": [ "2e7df0912d9aac8bf97f4061de613f-C001-11", "2e7df0912d9aac8bf97f4061de613f-C001-81" ] }, "@SIM@": { "gold_contexts": [ [ "2e7df0912d9aac8bf97f4061de613f-C001-12" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-14" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-29" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-37" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-46" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-86" ] ], "cite_sentences": [ "2e7df0912d9aac8bf97f4061de613f-C001-12", "2e7df0912d9aac8bf97f4061de613f-C001-14", "2e7df0912d9aac8bf97f4061de613f-C001-29", "2e7df0912d9aac8bf97f4061de613f-C001-37", "2e7df0912d9aac8bf97f4061de613f-C001-46", "2e7df0912d9aac8bf97f4061de613f-C001-86" ] }, "@USE@": { "gold_contexts": [ [ "2e7df0912d9aac8bf97f4061de613f-C001-37" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-46" ], [ "2e7df0912d9aac8bf97f4061de613f-C001-86" ] ], "cite_sentences": [ "2e7df0912d9aac8bf97f4061de613f-C001-37", "2e7df0912d9aac8bf97f4061de613f-C001-46", "2e7df0912d9aac8bf97f4061de613f-C001-86" ] }, "@DIF@": { "gold_contexts": [ [ "2e7df0912d9aac8bf97f4061de613f-C001-57" ] ], "cite_sentences": [ "2e7df0912d9aac8bf97f4061de613f-C001-57" ] }, "@EXT@": { "gold_contexts": [ [ "2e7df0912d9aac8bf97f4061de613f-C001-57" ] ], "cite_sentences": [ "2e7df0912d9aac8bf97f4061de613f-C001-57" ] } } }, "ABC_46b9079fb1dd6b4626f20819ccfa07_19": { "x": [ { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-118", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-46", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-47", "text": "**PREVIOUS WORK**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-119", "text": "**PARSE TREE FEATURES**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-143", "text": "Second, we considered features based on lexical heads within the tree." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-2", "text": "We describe a method for discriminative training of a language model that makes use of syntactic features." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-3", "text": "We follow a reranking approach, where a baseline recogniser is used to produce 1000-best output for each acoustic input, and a second \"reranking\" model is then used to choose an utterance from these 1000-best lists." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-48", "text": "Techniques for exploiting stochastic context-free grammars for language modeling have been explored for more than a decade." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-4", "text": "The reranking model makes use of syntactic features together with a parameter estimation method that is based on the perceptron algorithm." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-5", "text": "We describe experiments on the Switchboard speech recognition task." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-6", "text": "The syntactic features provide an additional 0.3% reduction in test-set error rate beyond the model of (Roark et al., 2004a; Roark et al., 2004b ) (significant at p < 0.001), which makes use of a discriminatively trained n-gram model, giving a total reduction of 1.2% over the baseline Switchboard system." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-7", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-9", "text": "The predominant approach within language modeling for speech recognition has been to use an ngram language model, within the \"source-channel\" or \"noisy-channel\" paradigm." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-10", "text": "The language model assigns a probability P l (w) to each string w in the language; the acoustic model assigns a conditional probability P a (a|w) to each pair (a, w) where a is a sequence of acoustic vectors, and w is a string." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-11", "text": "For a given acoustic input a, the highest scoring string under the model is w * = arg max w (\u03b2 log P l (w) + log P a (a|w)) (1)" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-12", "text": "where \u03b2 > 0 is some value that reflects the relative importance of the language model; \u03b2 is typically chosen by optimization on held-out data." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-13", "text": "In an n-gram language model, a Markov assumption is made, namely that each word depends only on the previous (n \u2212 1) words." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-14", "text": "The parameters of the language model are usually estimated from a large quantity of text data." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-15", "text": "See (Chen and Goodman, 1998) for an overview of estimation techniques for n-gram models." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-16", "text": "This paper describes a method for incorporating syntactic features into the language model, using discriminative parameter estimation techniques." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-17", "text": "We build on the work in Roark et al. (2004a; 2004b) , which was summarized and extended in Roark et al. (2005) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-18", "text": "These papers used discriminative methods for n-gram language models." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-19", "text": "Our approach reranks the 1000-best output from the Switchboard recognizer of Ljolje et al. (2003) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-20", "text": "1 Each candidate string w is parsed using the statistical parser of Collins (1999) to give a parse tree T (w)." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-21", "text": "Information from the parse tree is incorporated in the model using a feature-vector approach: we define \u03a6(a, w) to be a d-dimensional feature vector which in principle could track arbitrary features of the string w together with the acoustic input a. In this paper we restrict \u03a6(a, w) to only consider the string w and/or the parse tree T (w) for w. For example, \u03a6(a, w) might track counts of context-free rule productions in T (w), or bigram lexical dependencies within T (w)." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-22", "text": "The optimal string under our new model is defined as w * = arg max w (\u03b2 log P l (w) + \u1fb1, \u03a6(a, w) + log Pa(a|w))" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-23", "text": "where the arg max is taken over all strings in the 1000-best list, and where\u1fb1 \u2208 R d is a parameter vector specifying the \"weight\" for each feature in \u03a6 (note that we define x, y to be the inner, or dot product, between vectors x and y)." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-24", "text": "For this paper, we train the parameter vector\u1fb1 using the perceptron algorithm (Collins, 2004; Collins, 2002) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-25", "text": "The perceptron algorithm is a very fast training method, in practice requiring only a few passes over the training set, allowing for a detailed comparison of a wide variety of feature sets." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-26", "text": "A number of researchers have described work that incorporates syntactic language models into a speech recognizer." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-27", "text": "These methods have almost exclusively worked within the noisy channel paradigm, where the syntactic language model has the task of modeling a distribution over strings in the language, in a very similar way to traditional n-gram language models." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-28", "text": "The Structured Language Model (Chelba and Jelinek, 1998; Chelba and Jelinek, 2000; Chelba, 2000; Xu et al., 2002; Xu et al., 2003) makes use of an incremental shift-reduce parser to enable the probability of words to be conditioned on k previous c-commanding lexical heads, rather than simply on the previous k words." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-29", "text": "Incremental topdown and left-corner parsing (Roark, 2001a; Roark, 2001b) and head-driven parsing (Charniak, 2001) approaches have directly used generative PCFG models as language models." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-30", "text": "In the work of Wen Wang and Mary Harper (Wang and Harper, 2002; Wang, 2003; Wang et al., 2004) , a constraint dependency grammar and a finite-state tagging model derived from that grammar were used to exploit syntactic dependencies." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-31", "text": "Our approach differs from previous work in a couple of important respects." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-32", "text": "First, through the featurevector representations \u03a6(a, w) we can essentially incorporate arbitrary sources of information from the string or parse tree into the model." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-33", "text": "We would argue that our method allows considerably more flexibility in terms of the choice of features in the model; in previous work features were incorporated in the model through modification of the underlying generative parsing or tagging model, and modifying a generative model is a rather indirect way of changing the features used by a model." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-34", "text": "In this respect, our approach is similar to that advocated in Rosenfeld et al. (2001) , which used Maximum Entropy modeling to allow for the use of shallow syntactic features for language modeling." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-35", "text": "A second contrast between our work and previous work, including that of Rosenfeld et al. (2001) , is in the use of discriminative parameter estimation techniques." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-36", "text": "The criterion we use to optimize the parameter vector\u1fb1 is closely related to the end goal in speech recognition, i.e., word error rate." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-37", "text": "Previous work (Roark et al., 2004a; Roark et al., 2004b) has shown that discriminative methods within an ngram approach can lead to significant reductions in WER, in spite of the features being of the same type as the original language model." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-38", "text": "In this paper we extend this approach, by including syntactic features that were not in the baseline speech recognizer." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-39", "text": "This paper describe experiments using a variety of syntactic features within this approach." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-40", "text": "We tested the model on the Switchboard (SWB) domain, using the recognizer of Ljolje et al. (2003) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-41", "text": "The discriminative approach for n-gram modeling gave a 0.9% reduction in WER on this domain; the syntactic features we describe give a further 0.3% reduction." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-42", "text": "In the remainder of this paper, section 2 describes previous work, including the parameter estimation methods we use, and section 3 describes the featurevector representations of parse trees that we used in our experiments." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-43", "text": "Section 4 describes experiments using the approach." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-44", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-45", "text": "**BACKGROUND**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-49", "text": "Early approaches included algorithms for efficiently calculating string prefix probabilities (Jelinek and Lafferty, 1991; Stolcke, 1995) and approaches to exploit such algorithms to produce n-gram models (Stolcke and Segal, 1994; Jurafsky et al., 1995) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-50", "text": "The work of Chelba and Jelinek (Chelba and Jelinek, 1998; Chelba and Jelinek, 2000; Chelba, 2000) involved the use of a shift-reduce parser trained on Penn treebank style annotations, that maintains a weighted set of parses as it traverses the string from left-to-right." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-51", "text": "Each word is predicted by each candidate parse in this set at the point when the word is shifted, and the conditional probability of the word given the previous words is taken as the weighted sum of the conditional probabilities provided by each parse." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-52", "text": "In this approach, the probability of a word is conditioned by the top two lexical heads on the stack of the par-ticular parse." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-53", "text": "Enhancements in the feature set and improved parameter estimation techniques have extended this approach in recent years (Xu et al., 2002; Xu et al., 2003) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-54", "text": "Roark (2001a; 2001b ) pursued a different derivation strategy from Chelba and Jelinek, and used the parse probabilities directly to calculate the string probabilities." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-55", "text": "This work made use of a left-to-right, top-down, beam-search parser, which exploits rich lexico-syntactic features from the left context of each derivation to condition derivation move probabilities, leading to a very peaked distribution." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-56", "text": "Rather than normalizing a prediction of the next word over the beam of candidates, as in Chelba and Jelinek, in this approach the string probability is derived by simply summing the probabilities of all derivations for that string in the beam." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-57", "text": "Other work on syntactic language modeling includes that of Charniak (2001) , which made use of a non-incremental, head-driven statistical parser to produce string probabilities." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-58", "text": "In the work of Wen Wang and Mary Harper (Wang and Harper, 2002; Wang, 2003; Wang et al., 2004) , a constraint dependency grammar and a finite-state tagging model derived from that grammar, were used to exploit syntactic dependencies." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-59", "text": "The processing advantages of the finite-state encoding of the model has allowed for the use of probabilities calculated off-line from this model to be used in the first pass of decoding, which has provided additional benefits." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-60", "text": "Finally, Och et al. (2004) use a reranking approach with syntactic information within a machine translation system." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-61", "text": "Rosenfeld et al. (2001) investigated the use of syntactic features in a Maximum Entropy approach." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-62", "text": "In their paper, they used a shallow parser to annotate base constituents, and derived features from sequences of base constituents." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-63", "text": "The features were indicator features that were either (1) exact matches between a set or sequence of base constituents with those annotated on the hypothesis transcription; or (2) tri-tag features from the constituent sequence." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-64", "text": "The generative model that resulted from their feature set resulted in only a very small improvement in either perplexity or word-error-rate." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-65", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-66", "text": "**GLOBAL LINEAR MODELS**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-67", "text": "We follow the framework of Collins (2002; 2004) , recently applied to language modeling in Roark et al. (2004a; 2004b) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-68", "text": "The model we propose consists of the following components:" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-144", "text": "Let us first distinguish between POS-tags and non-POS non-terminal categories by calling these latter constituents NTs." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-69", "text": "\u2022 GEN(a) is a set of candidate strings for an acoustic input a. In our case, GEN(a) is a set of 1000-best strings from a first-pass recognizer." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-70", "text": "\u2022 T (w) is the parse tree for string w." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-71", "text": "\u2022 \u03a6(a, w) \u2208 R d is a feature-vector representation of an acoustic input a together with a string w." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-72", "text": "\u2022\u1fb1 \u2208 R d is a parameter vector." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-73", "text": "\u2022 The output of the recognizer for an input a is defined as" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-74", "text": "In principle, the feature vector \u03a6(a, w) could take into account any features of the acoustic input a together with the utterance w. In this paper we make a couple of restrictions." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-75", "text": "First, we define the first feature to be" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-76", "text": "where P l (w) and P a (a|w) are language and acoustic model scores from the baseline speech recognizer." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-77", "text": "In our experiments we kept \u03b2 fixed at the value used in the baseline recogniser." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-78", "text": "It can then be seen that our model is equivalent to the model in Eq. 2." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-79", "text": "Second, we restrict the remaining features \u03a6 2 (a, w) . . ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-80", "text": "\u03a6 d (a, w) to be sensitive to the string w alone." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-81", "text": "2 In this sense, the scope of this paper is limited to the language modeling problem." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-82", "text": "As one example, the language modeling features might take into account n-grams, for example through definitions such as \u03a6 2 (a, w) = Count of the the in w Previous work (Roark et al., 2004a; Roark et al., 2004b ) considered features of this type." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-83", "text": "In this paper, we introduce syntactic features, which may be sensitive to the parse tree for w, for example" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-84", "text": "where S \u2192 NP VP is a context-free rule production." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-85", "text": "Section 3 describes the full set of features used in the empirical results presented in this paper." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-86", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-87", "text": "**PARAMETER ESTIMATION**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-88", "text": "We now describe how the parameter vector\u1fb1 is estimated from a set of training utterances." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-89", "text": "The training set consists of examples (a i , w i ) for i = 1 . . ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-90", "text": "m, where a i is the i'th acoustic input, and w i is the transcription of this input." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-91", "text": "We briefly review the two training algorithms described in Roark et al. (2004b) , the perceptron algorithm and global conditional log-linear models (GCLMs)." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-92", "text": "Figure 1 shows the perceptron algorithm." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-93", "text": "It is an online algorithm, which makes several passes over the training set, updating the parameter vector after each training example." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-94", "text": "For a full description of the algorithm, see Collins (2004; 2002) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-95", "text": "A second parameter estimation method, which was used in (Roark et al., 2004b) , is to optimize the log-likelihood under a log-linear model." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-96", "text": "Similar approaches have been described in Johnson et al. (1999) and Lafferty et al. (2001) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-97", "text": "The objective function used in optimizing the parameters is" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-98", "text": "w\u2208GEN(a i ) e \u03a6(a i ,w),\u1fb1 ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-99", "text": "Here, each s i is the member of GEN(a i ) which has lowest WER with respect to the target transcription w i ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-100", "text": "The first term in L(\u1fb1) is the log-likelihood of the training data under a conditional log-linear model." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-101", "text": "The second term is a regularization term which penalizes large parameter values." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-102", "text": "C is a constant that dictates the relative weighting given to the two terms." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-103", "text": "The optimal parameters are defined as" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-104", "text": "We refer to these models as global conditional loglinear models (GCLMs)." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-105", "text": "Each of these algorithms has advantages." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-106", "text": "A number of results-e.g., in Sha and Pereira (2003) and Roark et al. (2004b) -suggest that the GCLM approach leads to slightly higher accuracy than the perceptron training method." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-107", "text": "However the perceptron converges very quickly, often in just a few passes over the training set-in comparison GCLM's can take tens or hundreds of gradient calculations before convergence." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-108", "text": "In addition, the perceptron can be used as an effective feature selection technique, in that \u2022Calculate y i = arg max w\u2208GEN(a i ) \u03a6(a i , w),\u1fb1" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-109", "text": "Output: Either the final parameters\u1fb1, or the averaged parameters\u1fb1avg defined as\u1fb1avg = t,i\u1fb1 t,i /mT where\u1fb1 t,i is the parameter vector after training on the i'th training example on the t'th pass through the training data." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-110", "text": "Figure 1: The perceptron training algorithm." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-111", "text": "Following Roark et al. (2004a) , the parameter \u03b11 is set to be some constant \u03b1 that is typically chosen through optimization over the development set." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-112", "text": "Recall that \u03b11 dictates the weight given to the baseline recognizer score." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-113", "text": "at each training example it only increments features seen on s i or y i , effectively ignoring all other features seen on members of GEN(a i )." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-114", "text": "For example, in the experiments in Roark et al. (2004a) , the perceptron converged in around 3 passes over the training set, while picking non-zero values for around 1.4 million n-gram features out of a possible 41 million n-gram features seen in the training set." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-115", "text": "For the present paper, to get a sense of the relative effectiveness of various kinds of syntactic features that can be derived from the output of a parser, we are reporting results using just the perceptron algorithm." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-116", "text": "This has allowed us to explore more of the potential feature space than we would have been able to do using the more costly GCLM estimation techniques." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-117", "text": "In future we plan to apply GLCM parameter estimation methods to the task." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-120", "text": "We tagged each candidate transcription with (1) part-of-speech tags, using the tagger documented in Collins (2002) ; and (2) a full parse tree, using the parser documented in Collins (1999) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-121", "text": "The models for both of these were trained on the Switchboard treebank, and applied to candidate transcriptions in both the training and test sets." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-122", "text": "Each transcription received one POS-tag annotation and one parse tree annotation, from which features were extracted." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-123", "text": "Figure 2 shows a Penn Treebank style parse tree that is of the sort produced by the parser." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-124", "text": "Given such a structure, there is a tremendous amount of flexibility in selecting features." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-125", "text": "The first approach that we follow is to map each parse tree to sequences encoding part-of-speech (POS) decisions, and \"shallow\" parsing decisions." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-126", "text": "Similar representations have been used by (Rosenfeld et al., 2001; Wang and Harper, 2002) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-127", "text": "Figure 3 shows the sequential representations that we used." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-128", "text": "The first simply makes use of the POS tags for each word." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-129", "text": "The latter representations make use of sequences of non-terminals associated with lexical items." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-130", "text": "In 3(b), each word in the string is associated with the beginning or continuation of a shallow phrase or \"chunk\" in the tree." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-131", "text": "We include any non-terminals above the level of POS tags as potential chunks: a new \"chunk\" (VP, NP, PP etc.) begins whenever we see the initial word of the phrase dominated by the non-terminal." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-132", "text": "In 3(c), we show how POS tags can be added to these sequences." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-133", "text": "The final type of sequence mapping, shown in 3(d), makes a similar use of chunks, but preserves only the headword seen with each chunk." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-134", "text": "3 From these sequences of categories, various features can be extracted, to go along with the n-gram features used in the baseline." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-135", "text": "These include n-tag features, e.g. t i\u22122 t i\u22121 t i (where t i represents the tag in position i); and composite tag/word features, e.g. t i w i (where w i represents the word in position i) or, more complicated configurations, such as t i\u22122 t i\u22121 w i\u22121 t i w i ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-136", "text": "These features can be extracted from whatever sort of tag/word sequence we provide for feature extraction, e.g. POS-tag sequences or shallow parse tag sequences." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-137", "text": "One variant that we performed in feature extraction had to do with how speech repairs (identified as EDITED constituents in the Switchboard style parse trees) and filled pauses or interjections (labeled with the INTJ label) were dealt with." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-138", "text": "In the simplest version, these are simply treated like other constituents in the parse tree." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-139", "text": "However, these can disrupt what may be termed the intended sequence of syntactic categories in the utterance, so we also tried skipping these constituents when mapping from the parse tree to shallow parse sequences." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-140", "text": "The second set of features we employed made use of the full parse tree when extracting features." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-141", "text": "For this paper, we examined several features templates of this type." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-142", "text": "First, we considered context-free rule instances, extracted from each local node in the tree." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-145", "text": "For each constituent NT in the tree, there is an associated lexical head (H NT ) and the POS-tag of that lexical head (HP NT )." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-146", "text": "Two simple features are NT/H NT and NT/HP NT for every NT constituent in the tree." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-147", "text": "Using the heads as identified in the parser, example features from the tree in figure 2 would be S/VBD, S/helped, NP/NN, and NP/house." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-148", "text": "Beyond these constituent/head features, we can look at the head-to-head dependencies of the sort used by the parser." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-149", "text": "Consider each local tree, consisting of a parent node (P), a head child (HC P ), and k non-head children (C 1 . . ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-150", "text": "C k )." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-151", "text": "For each non-head child C i , it is either to the left or right of HC P , and is either adjacent or non-adjacent to HC P ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-152", "text": "We denote these positional features as an integer, positive if to the right, negative if to the left, 1 if adjacent, and 2 if non-adjacent." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-153", "text": "Table 1 shows four head-to-head features that can be extracted for each non-head child C i ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-154", "text": "These features include dependencies between pairs of lexical items, between a single lexical item and the part-of-speech of another item, and between pairs of part-of-speech tags in the parse." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-155", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-156", "text": "**EXPERIMENTS**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-157", "text": "The experimental set-up we use is very similar to that of Roark et al. (2004a; 2004b) , and the extensions to that work in Roark et al. (2005) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-158", "text": "We make use of the Rich Transcription 2002 evaluation test set (rt02) as our development set, and use the Rich Transcription 2003 Spring evaluation CTS test set (rt03) as test set." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-159", "text": "The rt02 set consists of 6081 sentences (63804 words) and has three subsets: Switchboard 1, Switchboard 2, Switchboard Cellular." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-160", "text": "The rt03 set consists of 9050 sentences (76083 words) and has two subsets: Switchboard and Fisher." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-161", "text": "The training set consists of 297580 transcribed utterances (3297579 words) 4 ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-162", "text": "For each utterance, a weighted word-lattice was produced, representing alternative transcriptions, from the ASR system." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-163", "text": "The baseline ASR system that we are comparing against then performed a rescoring pass on these first pass lattices, allowing for better silence modeling, and replaces the trigram language model score with a 6-gram model." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-164", "text": "1000-best lists were then extracted from these lattices." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-165", "text": "For each candidate in the 1000-best lists, we identified the number of edits (insertions, deletions or substitutions) for that candidate, relative to the \"target\" transcribed utterance." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-166", "text": "The oracle score for the 1000-best lists was 16.7%." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-167", "text": "To produce the word-lattices, each training utterance was processed by the baseline ASR system." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-188", "text": "Table 4 shows the results of these trials on rt02." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-168", "text": "In a naive approach, we would simply train the baseline system (i.e., an acoustic model and language model) on the entire training set, and then decode the training utterances with this system to produce lattices." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-169", "text": "We would then use these lattices with the perceptron algorithm." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-170", "text": "Unfortunately, this approach is likely to produce a set of training lattices that are very different from test lattices, in that they will have very low word-error rates, given that the lattice for each utterance was produced by a model that was trained on that utterance." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-171", "text": "To somewhat control for this, the training set was partitioned into 28 sets, and baseline Katz backoff trigram models were built for each set by including only transcripts from the other 27 sets." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-172", "text": "Lattices for each utterance were produced with an acoustic model that had been trained on the entire training set, but with a language model that was trained on the 27 data portions that did not include the current utterance." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-173", "text": "Since language models are generally far more prone to overtraining than standard acoustic models, this goes a long way toward making the training conditions similar to testing conditions." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-174", "text": "Similar procedures were used to train the parsing and tagging models for the training set, since the Switchboard treebank overlaps extensively with the ASR training utterances." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-175", "text": "Table 2 presents the word-error rates on rt02 and rt03 of the baseline ASR system, 1000-best perceptron and GCLM results from Roark et al. (2005) under this condition, and our 1000-best perceptron results." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-176", "text": "Note that our n-best result, using just ngram features, improves upon the perceptron result of (Roark et al., 2005) Table 3 : Use of POS-tag sequence derived features condition." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-177", "text": "(Note that the perceptron-trained n-gram features were trigrams (i.e., n = 3).) This is due to a larger training set being used in our experiments; we have added data that was used as held-out data in (Roark et al., 2005) to the training set that we use." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-178", "text": "The first additional features that we experimented with were POS-tag sequence derived features." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-179", "text": "Let t i and w i be the POS tag and word at position i, respectively." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-180", "text": "We experimented with the following three feature definitions: Table 3 summarizes the results of these trials on the held out set." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-181", "text": "Using the simple features (number 1 above) yielded an improvement beyond just n-grams, but additional, more complicated features failed to yield additional improvements." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-182", "text": "Next, we considered features derived from shallow parsing sequences." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-183", "text": "Given the results from the POS-tag sequence derived features, for any given sequence, we simply use n-tag and tag/word features (number 1 above)." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-184", "text": "The first sequence type from which we extracted features was the shallow parse tag sequence (S1), as shown in figure 3(b) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-185", "text": "Next, we tried the composite shallow/POS tag sequence (S2), as in figure 3(c) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-186", "text": "Finally, we tried extracting features from the shallow constituent sequence (S3), as shown in figure 3(d) ." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-187", "text": "When EDITED and rt02 Trial WER ASR system output 37.1 n-gram perceptron 36.4 n-gram + POS perceptron 36.1 n-gram + POS + S1 perceptron 36.1 n-gram + POS + S2 perceptron 36.0 n-gram + POS + S3 perceptron 36.0 n-gram + POS + S3-E perceptron 36.0 n-gram + POS + CF perceptron 36.1 n-gram + POS + H2H perceptron 36.0 Table 4 : Use of shallow parse sequence and full parse derived features INTJ nodes are ignored, we refer to this condition as S3-E. For full-parse feature extraction, we tried context-free rule features (CF) and head-to-head features (H2H), of the kind shown in table 1." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-189", "text": "Although the single digit precision in the table does not show it, the H2H trial, using features extracted from the full parses along with n-grams and POS-tag sequence features, was the best performing model on the held out data, so we selected it for application to the rt03 test data." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-190", "text": "This yielded 35.2% WER, a reduction of 0.3% absolute over what was achieved with just n-grams, which is significant at p < 0.001, 5 reaching a total reduction of 1.2% over the baseline recognizer." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-191", "text": "----------------------------------" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-192", "text": "**CONCLUSION**" }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-193", "text": "The results presented in this paper are a first step in examining the potential utility of syntactic features for discriminative language modeling for speech recognition." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-194", "text": "We tried two possible sets of features derived from the full annotation, as well as a variety of possible feature sets derived from shallow parse and POS tag sequences, the best of which gave a small but significant improvement beyond what was provided by the n-gram features." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-195", "text": "Future work will include a further investigation of parserderived features." }, { "sent_id": "46b9079fb1dd6b4626f20819ccfa07-C001-196", "text": "In addition, we plan to explore the alternative parameter estimation methods described in (Roark et al., 2004a; Roark et al., 2004b) , which were shown in this previous work to give further improvements over the perceptron." } ], "y": { "@DIF@": { "gold_contexts": [ [ "46b9079fb1dd6b4626f20819ccfa07-C001-6" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-95" ] ], "cite_sentences": [ "46b9079fb1dd6b4626f20819ccfa07-C001-6", "46b9079fb1dd6b4626f20819ccfa07-C001-95" ] }, "@EXT@": { "gold_contexts": [ [ "46b9079fb1dd6b4626f20819ccfa07-C001-17" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-95" ] ], "cite_sentences": [ "46b9079fb1dd6b4626f20819ccfa07-C001-17", "46b9079fb1dd6b4626f20819ccfa07-C001-95" ] }, "@SIM@": { "gold_contexts": [ [ "46b9079fb1dd6b4626f20819ccfa07-C001-37" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-67" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-91" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-157" ] ], "cite_sentences": [ "46b9079fb1dd6b4626f20819ccfa07-C001-37", "46b9079fb1dd6b4626f20819ccfa07-C001-67", "46b9079fb1dd6b4626f20819ccfa07-C001-91", "46b9079fb1dd6b4626f20819ccfa07-C001-157" ] }, "@USE@": { "gold_contexts": [ [ "46b9079fb1dd6b4626f20819ccfa07-C001-67" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-91" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-157" ] ], "cite_sentences": [ "46b9079fb1dd6b4626f20819ccfa07-C001-67", "46b9079fb1dd6b4626f20819ccfa07-C001-91", "46b9079fb1dd6b4626f20819ccfa07-C001-157" ] }, "@BACK@": { "gold_contexts": [ [ "46b9079fb1dd6b4626f20819ccfa07-C001-82" ], [ "46b9079fb1dd6b4626f20819ccfa07-C001-106" ] ], "cite_sentences": [ "46b9079fb1dd6b4626f20819ccfa07-C001-82", "46b9079fb1dd6b4626f20819ccfa07-C001-106" ] }, "@FUT@": { "gold_contexts": [ [ "46b9079fb1dd6b4626f20819ccfa07-C001-196" ] ], "cite_sentences": [ "46b9079fb1dd6b4626f20819ccfa07-C001-196" ] } } }, "ABC_b6efc2f5239a0c5d9210d7da8466ab_19": { "x": [ { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-2", "text": "Statistical machine translation (SMT) performance suffers when models are trained on only small amounts of parallel data." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-3", "text": "The learned models typically have both low accuracy (incorrect translations and feature scores) and low coverage (high out-of-vocabulary rates)." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-4", "text": "In this work, we use an additional data resource, comparable corpora, to improve both." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-5", "text": "Beginning with a small bitext and corresponding phrase-based SMT model, we improve coverage by using bilingual lexicon induction techniques to learn new translations from comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-6", "text": "Then, we supplement the model's feature space with translation scores estimated over comparable corpora in order to improve accuracy." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-7", "text": "We observe improvements between 0.5 and 1.7 BLEU translating Tamil, Telugu, Bengali, Malayalam, Hindi, and Urdu into English." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-8", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-10", "text": "Standard statistical machine translation (SMT) models (Koehn et al., 2003) are trained using large, sentence-aligned parallel corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-11", "text": "Unfortunately, parallel corpora are not always available in large enough quantities to train robust models (Kolachina et al., 2012) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-12", "text": "In this work, we consider the situation in which we have access to only a small amount of bitext for a given low resource language pair, and we wish to supplement an SMT model with additional translations and features estimated using comparable corpora in the source and target languages." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-13", "text": "Assuming access to a small amount * Performed while faculty at Johns Hopkins University of parallel text is realistic, especially considering the recent success of crowdsourcing translations (Zaidan and Callison-Burch, 2011; Ambati, 2011; Post et al., 2012) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-14", "text": "We frame the shortcomings of SMT models trained on limited amounts of parallel text 1 in terms of accuracy and coverage." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-15", "text": "In this context, coverage refers to the number of words and phrases that a model has any knowledge of at all, and it is low when the training text is small, which results in a high out-of-vocabulary (OOV) rate." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-16", "text": "Accuracy refers to the correctness of the translation pairs and their corresponding probability features that make up the translation model." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-17", "text": "Because the quality of unsupervised automatic word alignments correlates with the amount of available parallel text and alignment errors result in errors in extracted translation pairs, accuracy tends to be low in low resource settings." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-83", "text": "For each language, we extract a phrase table with a phrase limit of seven." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-18", "text": "Additionally, estimating translation probabilities 2 over sparse training sets results in inaccurate feature scores." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-19", "text": "Given these deficiencies, we begin with a baseline SMT model learned from a small parallel corpus and supplement the model to improve its accuracy and coverage." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-20", "text": "We apply techniques presented in prior work that use comparable corpora to estimate similarities between word and phrases." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-21", "text": "In particular, we build on prior work in bilingual lexicon induction in order to predict translations for OOV words, improving coverage." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-22", "text": "We then use the same corpora to estimate additional translation feature scores, improving model accuracy." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-23", "text": "We see improvements in translation quality between 0.5 1 We consider low resource settings to be those with parallel datasets of fewer than 1 million words." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-24", "text": "Most standard MT datasets contain tens or hundreds of millions of words." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-25", "text": "2 Estimating reordering probabilities over sparse data also leads to model inaccuracies; we do not tackle that here. and 1.7 BLEU points translating the following low resource languages into English: Tamil, Telugu, Bengali, Malayalam, Hindi, and Urdu." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-26", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-27", "text": "**PREVIOUS WORK**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-28", "text": "Prior work shows that a variety of signals, including distributional, temporal, topic, and string similarity, may inform bilingual lexicon induction (Rapp, 1995; Fung and Yee, 1998; Rapp, 1999; Schafer and Yarowsky, 2002; Koehn and Knight, 2002; Monz and Dorr, 2005; Huang et al., 2005; Schafer, 2006; Klementiev and Roth, 2006; Haghighi et al., 2008; Mimno et al., 2009; Mausam et al., 2010) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-29", "text": "Other work has used decipherment techniques to learn translations from monolingual and comparable data (Ravi and Knight, 2011; Dou and Knight, 2012; Nuhn et al., 2012) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-30", "text": "Daum\u00e9 and Jagarlamudi (2011) use contextual and string similarity to mine translations for OOV words in a high resource language domain adaptation for a machine translation setting." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-31", "text": "Unlike most other prior work on bilingual lexicon induction, Daum\u00e9 and Jagarlamudi (2011) use the translations in end-to-end SMT." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-32", "text": "More recently, Irvine and Callison-Burch (2013) combine a variety of the techniques for estimating word pair similarity using source and target language comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-33", "text": "That work shows that only a small amount of supervision is needed to learn how to effectively combine similarity features into a single model for doing bilingual lexicon induction." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-34", "text": "In this work, because we assume access to a small amount of bilingual data, it is natural to take such a supervised approach to inducing new translations, and we directly apply that of Irvine and Callison-Burch (2013) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-35", "text": "Klementiev et al. (2012) use comparable corpora to score an existing Spanish-English phrase table extracted from the Europarl corpus." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-36", "text": "In this work, we directly apply their technique for scoring an existing phrase table." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-37", "text": "However, unlike that work, our initial phrase tables are estimated from small parallel corpora for genuine low resource languages." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-38", "text": "Additionally, we include new translations discovered in comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-39", "text": "Other prior work has mined supplemental parallel data from comparable corpora (Munteanu and Marcu, 2006; AbduI-Rauf and Schwenk, 2009; Smith et al., 2010; Uszkoreit et al., 2010; Smith et al., 2013) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-40", "text": "Such efforts are orthogonal and complementary to the approach that we take." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-41", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-42", "text": "**USING COMPARABLE CORPORA TO IMPROVE ACCURACY AND COVERAGE**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-43", "text": "After describing our bilingual and comparable corpora, we briefly describe the techniques proposed by Irvine and Callison-Burch (2013) and Klementiev et al. (2012) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-44", "text": "The contribution of this paper is the application and combination of these techniques in truly low resource translation conditions." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-45", "text": "Post et al. (2012) used Mechanical Turk to collect small parallel corpora for the following Indian languages and English: Tamil, Telugu, Bengali, Malayalam, Hindi, and Urdu." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-46", "text": "They collected both parallel sentence pairs and a dictionary of word translations." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-47", "text": "3 We use all six datasets, which provide real low resource data conditions for six truly low resource language pairs." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-48", "text": "Table 1 shows statistics about the datasets." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-49", "text": "Table 2 lists the amount of comparable data that we use for each language." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-50", "text": "Following both Klementiev et al. (2012) and Irvine and CallisonBurch (2013) , we use time-stamped web crawls as well as interlingually linked Wikipedia documents." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-51", "text": "We use the time-stamped data to estimate temporal similarity and the interlingual Wikipedia links, which indicate documents about the same topic written in different languages, to estimate topic similarity." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-52", "text": "We use both datasets in combination with a dictionary derived from the small parallel corpora to estimate contextual similarity." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-53", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-54", "text": "**DATASETS**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-55", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-56", "text": "**IMPROVING COVERAGE**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-57", "text": "In order to improve the coverage of our low resource translation models, we use bilingual lexicon induction techniques to learn translations for words which appear in our test sets but not in our training data (OOVs)." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-58", "text": "Bilingual lexicon induction is the task of inducing pairs of words that are translations of one another from monolingual or comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-59", "text": "Irvine and Callison-Burch (2013) use a diverse set of features estimated over comparable corpora and a small set of known translations as supervision for training a discriminative classifier, which makes predictions (translation or not a translation) on test set words paired with all possible translations." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-60", "text": "Possible translations are taken from the set of all target words appearing in the comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-61", "text": "Candidates are ranked according to their classification scores." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-152", "text": "So far we have only added the top-1 induced translation for OOV and low frequency source words to our phrase-based model." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-62", "text": "They achieve very good performance on the induction task itself compared with an unsupervised baseline that aggregates the same similarity features uniformly." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-63", "text": "In our setting, we have access to a small parallel corpus, which makes such a supervised approach to bilingual lexicon induction a natural choice." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-64", "text": "We use the framework described in Irvine and Callison-Burch (2013) directly, and further details may be found there." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-65", "text": "In particular, we use the same feature set, which includes the temporal, contextual, topic, orthographic, and frequency similarity between a candidate translation pair." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-66", "text": "We derive translations to serve as positive supervision from our automatically aligned parallel text 4 and, like the prior work, use random word pairs as negative supervision." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-67", "text": "Figure 1 shows some examples of Bengali words, their correct translations, and the top-3 translations that this framework induces." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-68", "text": "In our initial experiments, we add the highest ranked English candidate translation for each source language OOV to our phrase tables." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-69", "text": "Because all of the OOVs appear at least once in our comparable corpora, 5 we are able to mine translations for all of them." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-70", "text": "Adding these translations by definition improves the coverage of our MT models." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-71", "text": "Then, in additional sets of experiments, we 4 GIZA++ intersection alignments over all training data." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-72", "text": "5 The Post et al. (2012) also induce translations for source language words which are low frequency in the training data and supplement our SMT models with top-k translations, not just the highest ranked." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-73", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-74", "text": "**IMPROVING ACCURACY**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-75", "text": "In order to improve the accuracy of our models, we use comparable corpora to estimate additional features over the translation pairs in our phrase tables and include those features in tuning and decoding." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-76", "text": "This approach follows that of Klementiev et al. (2012) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-77", "text": "We compute both phrasal features and lexically smoothed features (using word alignments, like the Moses lexical translation probabilities) for all of the following except orthographic similarity, for which we only use lexically smoothed features, 6 resulting in nine additional features: temporal similarity based on timestamped web crawls, contextual similarity based on web crawls and Wikipedia (separately), orthographic similarity using normalized edit distance, and topic similarity based on inter-lingually linked Wikipedia pages." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-78", "text": "Our hope is that by adding a diverse set of similarity features to the phrase tables, our models will better distinguish between good and bad translation pairs, improving accuracy." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-79", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-80", "text": "**EXPERIMENTS 4.1 EXPERIMENTAL SETUP**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-81", "text": "We use the data splits given by Post et al. (2012) and, following that work, include the dictionaries in the training data and report results on the devtest set using case-insensitive BLEU and four references." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-82", "text": "We use the Moses phrase-based MT framework (Koehn et al., 2007) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-176", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-84", "text": "In order to make our results comparable to those of Post et al. (2012) , we follow that work and use Table 3 : Percent of word types in a held out portion of the training data which are translated correctly by our bilingual lexicon induction technique." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-85", "text": "Evaluation is over the top-1 and top-10 outputs in the ranked lists for each source word." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-86", "text": "the English side of the training data to train a language model." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-87", "text": "Using a language model trained on a larger corpus (e.g. the English side of our comparable corpora) may yield better results, but such an improvement is orthogonal to the focus of this work." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-88", "text": "Throughout our experiments, we use the batch version of MIRA (Cherry and Foster, 2012) for tuning the feature set." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-89", "text": "7 We rerun tuning for all experimental conditions and report results averaged over three tuning runs (Clark et al., 2011) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-90", "text": "Our baseline uses the bilingually extracted phrase pairs and standard translation probability features." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-91", "text": "We supplement it with the top ranked translation for each OOV to improve coverage (+ OOV Trans) and with additional features to improve accuracy (+Features)." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-92", "text": "In Section 4.2, we make each modification separately and then together." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-93", "text": "Then we present additional experiments where we induce translations for low frequency words, in addition to OOVs (4.3), append top-k translations (4.4), vary the amount of training data used to induce the baseline model (4.5), and vary the amount of comparable corpora used to estimate features and induce translations (4.6)." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-94", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-95", "text": "**RESULTS**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-96", "text": "Before presenting end-to-end MT results, we examine the performance of the supervised bilingual lexicon induction technique that we use for translating OOVs." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-97", "text": "In Table 3 , top-1 accuracy is the percent of source language words in a held out portion of the training data 8 for which the highest ranked English candidate is a correct translation." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-98", "text": "9 Performance is lowest for Tamil and highest for Hindi." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-99", "text": "For all languages, top-10 accuracy is much higher than the top-1 accuracy." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-100", "text": "In Section 4.4, we explore 7 We experimented with MERT and PRO as well but saw consistently better baseline performance using batch MIRA." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-101", "text": "8 Described in Section 3.2." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-102", "text": "We retrain with all training data for MT experiments." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-103", "text": "9 Post et al. (2012) gathered up to six translations for each source word, so some have multiple correct translations appending the top-k translations for OOV words to our model instead of just the top-1." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-104", "text": "Table 4 shows our results adding OOV translations, adding features, and then both." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-105", "text": "Additional translation features alone, which improve our models' accuracy, increase BLEU scores between 0.18 (Bengali) and 0.60 (Malayalam) points." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-106", "text": "Adding OOV translations makes a big difference for some languages, such as Bengali and Urdu, and almost no difference for others, like Malayalam and Tamil." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-107", "text": "The OOV rate (Table 1) is low in the Malayalam dataset and high in the Tamil dataset." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-108", "text": "However, as Table 3 shows, the translation induction accuracy is low for both." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-109", "text": "Since few of the supplemental translations are correct, we don't observe BLEU gains." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-110", "text": "In contrast, induction accuracies for the other languages are higher, OOV rates are substantial, and we do observe moderate BLEU improvements by supplementing phrase tables with OOV translations." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-111", "text": "In order to compute the potential BLEU gains that we could realize by correctly translating all OOV words (achieving 100% accuracy in Table 3 ), we perform an oracle experiment." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-112", "text": "We use automatic word alignments over the test sets to identify correct translations and append those to the phrase tables." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-113", "text": "10 The results, in Table 4 , show possible gains between 4.3 (Telugu and Bengali) and 0 (Malayalam) BLEU points above the baseline." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-114", "text": "Not surprisingly, the possible gain for Malayalam, which has a very low OOV rate, is very low." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-115", "text": "Our +OOV Trans." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-116", "text": "model gains between 0% (Tamil) and 38% (Urdu) of the potential improvement." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-117", "text": "Using comparable corpora to improve both accuracy (+Features) and coverage (+OOV Trans.) results in translations that are better than applying either technique alone for five of the six languages." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-118", "text": "BLEU gains range from 0.48 (Bengali) to 1.39 (Urdu)." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-119", "text": "We attribute the particularly good Urdu performance to the relatively large comparable corpora ( Table 2) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-120", "text": "As a result, we have already begun to expand our web crawls for all languages." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-121", "text": "In Section 4.6, we present results varying the amount of Urdu-English comparable corpora used to induce translations and estimate additional features." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-122", "text": "Table 4 : BLEU performance gains that target coverage (+OOV Trans.) and accuracy (+Features), and both (+Feats & OOV)." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-123", "text": "OOV oracle uses OOV translations from automatic word alignments." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-124", "text": "Hiero and SAMT results are reported in Post et al. (2012) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-125", "text": "datasets." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-126", "text": "Both syntax-based models outperform the phrase-based MT baseline for each language except Urdu, where the phrase-based model outperforms Hiero." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-127", "text": "Here, we extend a phrase-based rather than a syntax-based system because it is simpler." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-128", "text": "However, our improvements may also apply to syntactic models (future work)." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-177", "text": "**LEARNING CURVES OVER COMPARABLE CORPORA**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-129", "text": "Because our efforts have focused on the accuracy and coverage of translation pairs and have not addressed reordering or syntax, we expect that combining them with an SAMT grammar will result in state-of-the art performance." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-130", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-131", "text": "**TRANSLATIONS OF LOW FREQUENCY WORDS**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-132", "text": "Given the positive results in Section 4.2, we hypothesize that mining translations for low frequency words, in addition to OOV words, may improve accuracy." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-133", "text": "For source words which only appear a few times in the parallel training text, the bilingually extracted translations in the standard phrase table are likely to be inaccurate." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-134", "text": "Therefore, we perform additional experiments varying the minimum source word training data frequency for which we induce additional translations." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-135", "text": "That is, if f req(w src ) \u2264 M , we induce a new translation for it and include that translation in our phrase table." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-136", "text": "Note that in the results presented in Table 4 , M = 0." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-137", "text": "In these experiments, we include our additional phrase table features estimated over comparable corpora and hope that these scores will assist the model in choosing among multiple translation options for low frequency words, one or more of which is extracted bilingually and one of which is induced using comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-138", "text": "Table 5 shows the results when we vary M ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-139", "text": "As before, we average BLEU scores over three tuning runs." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-140", "text": "In general, modest BLEU score gains are made as we supplement our phrase-based models with induced translations of low frequency words." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-141", "text": "The highest performance is achieved when M is between 5 and 50, depending on language." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-142", "text": "The Language Base." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-143", "text": "M : trans added for f req(wsrc) \u2264 M 0 1 5 10 25 50 Tamil 9.5 10.0 9.9 10.2 10.2 9.9 10." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-144", "text": "largest gains are 0.5 and 0.3 BLEU points for Bengali and Urdu, respectively, at M = 25." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-145", "text": "This is not surprising; we also saw the largest relative gains for those two languages when we added OOV translations to our baseline model." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-146", "text": "With the addition of low frequency translations, our highest performing Urdu model achieves a BLEU score that is 1.7 points higher than the baseline." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-147", "text": "In different data conditions, inducing translations for low frequency words may result in better or worse performance." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-148", "text": "For example, the size of the training set impacts the quality of automatic word alignments, which in turn impacts the reliability of translations of low frequency words." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-149", "text": "However, the experiments detailed here suggest that including induced translations of low frequency words will not hurt performance and may improve it." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-150", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-151", "text": "**APPENDING TOP-K TRANSLATIONS**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-153", "text": "However, the bilingual lexicon induction results in Table 3 show that accuracies in the top-10 ranked translations are, on average, nearly twice the top-1 accuracies." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-154", "text": "Here, we explore adding the top-k induced translations." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-155", "text": "We hope that our additional phrase decoder to correctly choose between the k translation options." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-156", "text": "We induce translations for OOV words only (M = 0) and include all comparable corpora features." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-157", "text": "Table 6 shows performance as we append the top-k ranked translations for each OOV word and vary k. With the exception of Bengali, using a k greater than 1 does not increase performance." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-158", "text": "In the case of Bengali, and additional 0.2 BLEU is observed when the top-25 translations are appended." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-159", "text": "In contrast, we see performance decrease substantially for other languages (0.7 BLEU for Telugu and 0.2 for Urdu) when the top-25 translations are used." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-160", "text": "Therefore, we conclude that, in general, the models do not sufficiently distinguish good from bad translations when we append more than just the top-1." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-161", "text": "Although using a k greater than 1 means that more correct translations are in the phrase table, it also increases the number of possible outputs over which the decoder must search." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-162", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-163", "text": "**LEARNING CURVES OVER PARALLEL DATA**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-164", "text": "In the experiments above, we only evaluated our methods for improving the accuracy and coverage of models trained on small amounts of bitext using the full parallel training corpora released by Post et al. (2012) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-165", "text": "Here, we apply the same techniques but vary the amount of parallel data in order to generate learning curves." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-166", "text": "Figure 2 shows learning cures for all six languages." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-167", "text": "In all cases, results are averaged over three tuning runs." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-168", "text": "We sample both parallel sentences and dictionary entries." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-169", "text": "All six learning curves show similar trends." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-170", "text": "In all experimental conditions, BLEU performance increases approximately linearly with the log of the amount of training data." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-171", "text": "Additionally, supplementing the baseline with OOV translations improves performance more than supplementing the baseline with additional phrase table scores based on comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-172", "text": "However, in most cases, supplementing the baseline with both translations and features improves performance more than either alone." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-173", "text": "Performance gains are greatest when very little training data is used." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-174", "text": "The Urdu learning curve shows the most gains as well as the cleanest trends across training data amounts." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-175", "text": "As before, we attribute this to the relatively large comparable corpora available for Urdu." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-178", "text": "In our final experiment, we consider the effect of the amount of comparable corpora that we use to estimate features and induce translations." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-179", "text": "We present learning curves for Urdu-English because we have the largest amount of comparable corpora for that pair." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-180", "text": "We use the full amount of parallel data to train a baseline model, and then we randomly sample varying amounts of our UrduEnglish comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-181", "text": "Sampling is done separately for the web crawl and Wikipedia comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-182", "text": "Figure 3 shows the results." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-183", "text": "As before, results are averaged over three tuning runs." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-184", "text": "The phrase table features estimated over comparable corpora improve end-to-end MT performance more with increasing amounts of comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-185", "text": "In contrast, the amount of comparable corpora used to induce OOV translations does not impact the performance of the resulting MT system as much." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-186", "text": "The difference may be due Figure 2: Comparison of learning curves over lines of parallel training data for four SMT systems: our baseline phrase-based model (baseline), model that supplements the baseline with translations of OOV words induced using our supervised bilingual lexicon induction framework (+Trans), model that supplements the baseline with additional phrase table features estimated over comparable corpora (+Feats), and a system that supplements the baseline with both OOV translations and additional features (+Trans & Feats) ." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-187", "text": "to the fact that data sparsity is always more of an issue when estimating features over phrase pairs than when estimating features over word pairs because phrases appear less frequently than words in monolingual corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-188", "text": "Our comparable corpora features are estimated over phrase pairs while translations are only induced for OOV words, not phrases." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-189", "text": "So, it makes sense that the former would benefit more from larger comparable corpora." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-190", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-191", "text": "**CONCLUSION**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-192", "text": "As Post et al. (2012) showed, it is reasonable to assume a small parallel corpus for training an SMT model even in a low resource setting." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-193", "text": "We have used comparable corpora to improve the accuracy and coverage of phrase-based MT models built using small bilingual corpora for six low resource languages." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-194", "text": "We have shown that our methods improve BLEU score performance independently and that their combined impact is nearly additive." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-195", "text": "Additionally, our results show that adding induced translations of low frequency words improves performance beyond what is achieved by inducing translations for OOVs alone." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-196", "text": "Finally, our results show that our techniques improve relative performance most when very little parallel training data is available." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-197", "text": "----------------------------------" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-198", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-199", "text": "This material is based on research sponsored by DARPA under contract HR0011-09-1-0044 and by the Johns Hopkins University Human Language Technology Center of Excellence." }, { "sent_id": "b6efc2f5239a0c5d9210d7da8466ab-C001-200", "text": "The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government." } ], "y": { "@MOT@": { "gold_contexts": [ [ "b6efc2f5239a0c5d9210d7da8466ab-C001-13" ] ], "cite_sentences": [ "b6efc2f5239a0c5d9210d7da8466ab-C001-13" ] }, "@DIF@": { "gold_contexts": [ [ "b6efc2f5239a0c5d9210d7da8466ab-C001-72" ], [ "b6efc2f5239a0c5d9210d7da8466ab-C001-103" ] ], "cite_sentences": [ "b6efc2f5239a0c5d9210d7da8466ab-C001-72", "b6efc2f5239a0c5d9210d7da8466ab-C001-103" ] }, "@EXT@": { "gold_contexts": [ [ "b6efc2f5239a0c5d9210d7da8466ab-C001-72" ], [ "b6efc2f5239a0c5d9210d7da8466ab-C001-103" ] ], "cite_sentences": [ "b6efc2f5239a0c5d9210d7da8466ab-C001-72", "b6efc2f5239a0c5d9210d7da8466ab-C001-103" ] }, "@SIM@": { "gold_contexts": [ [ "b6efc2f5239a0c5d9210d7da8466ab-C001-81" ], [ "b6efc2f5239a0c5d9210d7da8466ab-C001-84" ], [ "b6efc2f5239a0c5d9210d7da8466ab-C001-192" ] ], "cite_sentences": [ "b6efc2f5239a0c5d9210d7da8466ab-C001-81", "b6efc2f5239a0c5d9210d7da8466ab-C001-84", "b6efc2f5239a0c5d9210d7da8466ab-C001-192" ] }, "@USE@": { "gold_contexts": [ [ "b6efc2f5239a0c5d9210d7da8466ab-C001-81" ], [ "b6efc2f5239a0c5d9210d7da8466ab-C001-84" ], [ "b6efc2f5239a0c5d9210d7da8466ab-C001-164" ] ], "cite_sentences": [ "b6efc2f5239a0c5d9210d7da8466ab-C001-81", "b6efc2f5239a0c5d9210d7da8466ab-C001-84", "b6efc2f5239a0c5d9210d7da8466ab-C001-164" ] } } }, "ABC_320a5c79d9884e652c42f85847172b_20": { "x": [ { "sent_id": "320a5c79d9884e652c42f85847172b-C001-111", "text": "Table 2 )." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-2", "text": "We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-3", "text": "The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-88", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-4", "text": "The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-5", "text": "Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-6", "text": "Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-7", "text": "We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-8", "text": "We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or nonexistent." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-9", "text": "\u2022 Information systems \u2192 Multilingual and cross-lingual retrieval; Retrieval models and ranking;" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-10", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-12", "text": "Retrieving relevant content across languages (i.e., cross-lingual information retrieval, termed CLIR henceforth) requires the ability Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-13", "text": "Copyrights for third-party components of this work must be honored." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-14", "text": "For all other uses, contact the owner/author(s)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-15", "text": "SIGIR'18, July 2018, Ann Arbor, Michigan, USA \u00a9 2018 Copyright held by the owner/author(s)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-16", "text": "ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn to bridge the lexical gap between languages [8, 16] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-17", "text": "Traditional IR methods based on sparse text representations are not suitable for CLIR, since languages, in general, do not share much of the vocabulary." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-18", "text": "Even in the monolingual IR, they cannot bridge the lexical gap, being incapable of semantic generalization [6] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-19", "text": "A solution is to resort to structured real-valued semantic representations, that is, text embeddings [2, 6, 10] : these representations allow to generalize over the vocabularies observed in labelled data, and hence offer additional retrieval evidence and mitigate the ubiquitous problem of data sparsity." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-20", "text": "Their usefulness has been proven for monolingual [15] and cross-lingual ad-hoc IR models [21] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-21", "text": "Besides the embedding-based CLIR paradigms, other approaches to bridging the lexical gap for CLIR exist." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-22", "text": "1) Full-blown Machine Translation (MT) systems are employed to translate either queries or documents [8, 9] , but these require huge amounts of parallel data, while such resources are still scarce for many language pairs and domains." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-23", "text": "2) The lexical chasm can be crossed by grounding queries and documents in an external multilingual knowledge source (e.g., Wikipedia or BabelNet) [4, 20] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-24", "text": "However, the concept coverage is limited for resource-lean languages, and all content not present in a knowledge base is effectively ignored by a CLIR system." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-25", "text": "Bilingual text embeddings, while displaying a wider applicability and versatility than the two other paradigms, still suffer from one important limitation: a bilingual supervision signal is required to induce shared cross-lingual semantic spaces." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-26", "text": "This supervision takes form of sentence-aligned parallel data [5] , pre-built word translation pairs [11, 19] or document-aligned comparable data [21] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-27", "text": "1 Recently, methods for inducing shared cross-lingual embedding spaces without the need for any bilingual signal (not even word translation pairs) have been proposed [1, 3] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-28", "text": "These methods exploit inherent structural similarities of induced monolingual embedding spaces to learn vector space transformations that align the source language space to the target language space, with strong results observed for bilingual lexicon extraction." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-29", "text": "In this work, we show that these unsupervised cross-lingual word embeddings offer strong support to the construction of fully unsupervised adhoc CLIR models." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-30", "text": "We propose two different CLIR models: 1) termby-term translation through the shared cross-lingual space, and 2) query and document representations as IDF-weighted sums of constituent word vectors." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-31", "text": "To the best of our knowledge, our CLIR methodology is the first to allow the construction of CLIR models without any bilingual data and supervision at all, relying solely on monolingual corpora." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-32", "text": "Experimental evaluation on standard CLEF CLIR data for three different language pairs shows that the proposed fully unsupervised CLIR models outperform competitive baselines and models that exploit word translation pairs or comparable corpora." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-33", "text": "Our CLIR code and multilingual embedding spaces are publicly available at: https://github.com/rlitschk/UnsupCLIR." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-34", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-35", "text": "**METHODOLOGY**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-36", "text": "The proposed unsupervised CLIR models rely on the existence of a shared cross-lingual word embedding space in which all vocabulary terms of both languages are placed." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-37", "text": "We first outline three methods for the shared space induction, with a focus on the unsupervised method." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-38", "text": "We then explain in detail the query and document representations as well as the ranking functions of our CLIR models." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-39", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-40", "text": "**CROSS-LINGUAL WORD VECTOR SPACES**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-41", "text": "For our proposed CLIR models, we investigate cross-lingual embedding spaces produced with state-of-the-art representative methods requiring different amount and type of bilingual supervision: 1) document-aligned comparable data [21] , 2) word translation pairs [19] ; and 3) no bilingual data at all [3] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-42", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-43", "text": "**CROSS-LINGUAL EMBEDDINGS FROM COMPARABLE DOCUMENTS (CL-CD).**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-44", "text": "The BWE Skip-Gram (BWESG) model from Vuli\u0107 and Moens [21] exploits large document-aligned comparable corpora (e.g., Wikipedia)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-45", "text": "BWESG first creates a merged corpus of bilingual pseudo-documents by intertwining pairs of available comparable documents." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-46", "text": "Then it applies a standard monolingual log-linear Skip-Gram model with negative sampling (SGNS) [10] on the merged corpus in which words have bilingual contexts instead of monolingual ones." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-47", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-48", "text": "**CROSS-LINGUAL EMBEDDINGS FROM WORD TRANSLATION PAIRS (CL-WT).**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-49", "text": "This class of models [1, 11, 19] focuses on learning the projections (i.e., mappings) between independently trained monolingual embedding spaces." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-50", "text": "Let { S w i } V S i =1 , S w i \u2208 R ds be the monolingual word embedding space of the source language L S with V S vectors, and { T w i } V T i =1 , T w i \u2208 R dt the monolingual space for the target language L T containing V T vectors; ds and dt are the respective space dimensionalities." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-51", "text": "The models learn a parametrized mapping function f ( |\u03b8) that projects the source language vectors into the target space:" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-52", "text": ". The projection parameters \u03b8 are learned using the training set of K word translation pairs:" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-53", "text": ", typically via second-order stochastic optimisation techniques." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-54", "text": "According to the comparative evaluation from [18] , all projectionbased methods for inducing cross-lingual embedding spaces perform similarly." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-55", "text": "We therefore opt for the recent model of Smith et al. [19] to serve as a baseline, due to its competitive performance, large coverage, and readily available implementation." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-56", "text": "2 Technically, the method of Smith et al. [19] learns two projection functions f S ( S |\u03b8 S ) and f S ( T |\u03b8 T ), projecting the source and target monolingual embedding spaces, respectively, to the new shared space." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-57", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-58", "text": "**CROSS-LINGUAL EMBEDDINGS WITHOUT BILINGUAL SUPERVISION (CL-UNSUP).**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-59", "text": "Most recently, Conneau et al. [3] have proposed an adversarial learning-based model in order to automatically, in a fully unsupervised fashion, create word translation pairs that can then be used to learn the same projection functions f S and f T as in the model of Smith et al. [19] . Let X be the set of all monolingual word embeddings from the source language, and Y the set of all target language embeddings." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-60", "text": "In the first, adversarial learning step, they jointly learn (1) the projection matrix W that maps one embedding space to the other and (2) the parameters of the discriminator model which, given an embedding vector (either W x where x \u2208 X , or \u2208 Y ) needs to predict whether it is an original vector from the target embedding space ( ),nor a vector from the source embedding space mapped via projection W to the target embedding space (W x)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-61", "text": "The discriminator model is a multi-layer perceptron network." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-62", "text": "In the second step, the projection matrix W trained with adversarial objective is used to find the mutual nearest neighbors between the two vocabularies -this set of automatically obtained word translation pairs becomes a synthetic training set for the refined projection functions f S and f T computed via the SVD-based method similar to the previously described model of Smith et al. [19] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-63", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-64", "text": "**UNSUPERVISED CLIR MODELS**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-89", "text": "Language Pairs and Training Data." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-65", "text": "With the induced cross-lingual spaces we can directly measure semantic similarity of words from the two languages, but we still need to define how to represent queries and documents." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-66", "text": "To this end, we outline two models that exploit the induced cross-lingual embedding spaces for CLIR tasks." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-67", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-68", "text": "**BWE AGGREGATION MODEL (BWE-AGG).**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-69", "text": "In the first approach, we derive the cross-lingual embeddings of queries and documents by aggregating the cross-lingual embeddings of their constituent terms." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-70", "text": "Let \u2212 \u2192 t be the embedding of the term t, obtained from the crosslingual embedding space and let d = {t 1 , t 2 , . . . , t N d } be a document from the collection consisting of N d terms." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-71", "text": "The embedding of the document d in the shared space can then be computed as:" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-72", "text": "where \u2022 is a semantic composition operator: it aggregates constituent term embeddings into a document embedding." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-73", "text": "3 We opt for vector addition as composition for two reasons: 1) word embedding spaces exhibit linear linguistic regularities [12] , and 2) addition displays robust performance in compositional and IR tasks [14, 21] ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-74", "text": "A representation of the query vector \u2212 \u2192 q is then the sum of embeddings of constituent terms:" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-75", "text": "To obtain document representations, we compare two aggregation functions." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-76", "text": "First, we experiment with a simple non-weighted addition (BWE-Agg-Add):" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-77", "text": "i ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-78", "text": "Second, we use weighted addition where each term's embedding is weighted with the term's inverse document frequency (IDF) (BWE-Agg-IDF):" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-79", "text": "on the common assumption that not all terms equally contribute to the document meaning: it emphasizes vectors of more documentspecific terms." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-80", "text": "4 Finally, we compute the relevance score simply as the cosine similarity between query and document embeddings in the shared cross-lingual space:" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-81", "text": "Term-by-term query translation model (TbT-QT)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-82", "text": "Our second CLIR model exploits the cross-lingual word embedding space in a different manner: it performs a term-by-term translation of the query into the language of the document collection relying solely on the shared cross-lingual space." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-83", "text": "Each source language query term t q is replaced by the target language term tr(t q ), that is, its cross-lingual nearest neighbour in the embedding space." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-84", "text": "The cosine similarity is used for computing cross-lingual semantic similarities of terms." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-85", "text": "In other words, the query q = {t 5 By effectively transforming a CLIR task into a monolingual IR task, we can apply any of the traditional IR ranking functions designed for sparse text representations." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-86", "text": "We opt for the ubiquitous query likelihood model [17] , smoothing the unigram language model of individual documents with the unigram language model of the entire collection, using the Dirichlet smoothing scheme [23] :" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-87", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-90", "text": "We experiment with three language pairs of varying degree of similarity: English (EN) -{Dutch (NL), Italian (IT), Finnish (FI)}. 6 We use precomputed monolingual T vectors [2] (available online) 7 as monolingual word embeddings required by CL-WT and CL-UNSUP embedding models." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-91", "text": "For the CL-CD embeddings, the BWESG model trains on full documentaligned Wikipedias 8 using SGNS with suggested parameters from prior work [22] : 15 negative samples, global decreasing learning rate is .025, subsampling rate is 1e \u2212 4, window size is 16." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-92", "text": "The CL-WT embeddings of Smith et al. [19] use 10K translation pairs obtained from Google Translate to learn the linear mapping functions." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-93", "text": "The CL-UNSUP training setup closely follows the default setup of Conneau et al. [3] : we refer the reader to the original 4 Note that with both variants of BWE-Agg, we effectively ignore both query and document terms that are not represented in the cross-lingual embedding space." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-94", "text": "5 If the representation of a query term t q i is not present in the cross-lingual embedding space, we retain the query term t q i itself." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-95", "text": "We have also attempted eliminating out-ofvocabulary query terms, but the former consistently leads to better performance." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-96", "text": "6 English and Dutch are Germanic languages, Italian is a Romance language, whereas Finnish is an Uralic language (i.e., not Indo-European) 7 Table 1 : Basic statistics of used CLEF test collections: number of documents (#doc), number of tokens (#tok), and average number of relevant documents per query (#rel)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-97", "text": "paper and the model implementation accessible online for more information and technical details." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-98", "text": "9 Test Collections and Queries." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-99", "text": "We evaluate the models on the standard test collections from the CLEF 2000-2003 ad-hoc retrieval Test Suite." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-100", "text": "10 We select all NL, IT, and FI document collections from years 2001-2003 11 and paired them with English queries from the respective year." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-101", "text": "The statistics for test collections are shown in Table 1 ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-102", "text": "Following a standard practice [7, 21] , queries were created by concatenating the title and the description of each CLEF \"topic\"." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-103", "text": "The test collections for years 2001-2003 respectively contain 50, 50, and 60 EN queries." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-104", "text": "Queries and documents were lowercased; stop words, punctuations and one-character words were removed." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-105", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-106", "text": "**MODELS IN COMPARISON.**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-107", "text": "We evaluate six different CLIR models, obtained by combining each of the three models for inducing crosslingual word vector spaces -CL-CD, CL-WT, and CL-UNSUP -with each of the two ranking models -BWE-Agg and TbT-QT." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-108", "text": "For each cross-lingual vector space, we also evaluate an ensemble ranker that combines the two ranking functions: BWE-Agg-IDF and TbT-QT." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-109", "text": "If r 1 is the rank of document d for query q according to the TbT-QT model and r 2 is the rank produced by BWE-Agg-IDF, the ensemble ranker ranks the documents in the increasing order of the scores \u03bb \u00b7 r 1 + (1 \u2212 \u03bb) \u00b7 r 2 ." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-110", "text": "We evaluate ensembles with values \u03bb = 0.5, i.e., with equal contributions of both models; and \u03bb = 0.7, i.e., with more weight allocated to the more powerful TbT-QT model (cf." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-112", "text": "Additionally, we evaluate the standard query likelihood model (LM-UNI ) [17] Table 2 : CLIR performance on all three test language pairs for all models in comparison (MAP scores reported)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-113", "text": "shows that we can perform robust CLIR without any cross-lingual information, that is, by relying purely on monolingual data." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-114", "text": "Ensemble CLIR Models." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-115", "text": "Ensembles generally outperform the bestperforming individual CLIR models, and for some test collections (e.g., EN\u2192NL 2002, EN\u2192FI 2003) by a wide margin." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-116", "text": "For the CL-CD and CL-WT spaces, we observe similar results for both values of the interpolation factor (\u03bb = 0.5 and \u03bb = 0.7)." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-117", "text": "This is not surprising, since the single models BWE-Agg-IDF and TbT-QT exhibit similar performance for CL-CD and CL-WT." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-118", "text": "In contrast, the combined model with \u03bb = 0.7 (i.e., more weight for the TbT-QT ranking) yields larger performance gains for CL-UNSUP spaces, for which the TbT-QT model consistently outperforms BWE-Agg-IDF." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-119", "text": "Language Similarity and Aggregation." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-120", "text": "The results in Table 2 imply that the proximity of CLIR languages plays a role only to a certain extent." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-121", "text": "Most models do exhibit lower performance for EN\u2192FI than for the other two language pairs: this is expected since Finnish is lexically and typologically more distant from English than Italian and Dutch." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-122", "text": "However, even though NL is linguistically closer to EN than IT, for the unsupervised CLIR models we generally observe slightly better performance for EN\u2192IT than for EN\u2192NL." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-123", "text": "We speculate that this is due to the compounding phenomenon in word formation, which is present in NL, but is not a property of EN and IT." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-124", "text": "The reported performance on bilingual lexicon extraction (BLE) using cross-lingual embedding spaces is also lower for EN-NL compared to EN-IT (see, e.g., [19] )." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-125", "text": "We observe the same pattern (4-5% lower BLE performance for EN-NL than for EN-IT) with the CL-UNSUP embedding spaces." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-126", "text": "The weighted variant of BWE-Agg (BWE-Agg-IDF) outperforms the simpler non-weighted summation model (BWE-Agg-Add) across the board." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-127", "text": "These results suggest that the common IR assumption about document-specific terms being more important than the terms occurring collection-wide is also valid for constructing dense document representations by summing word embeddings." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-128", "text": "----------------------------------" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-130", "text": "We have presented a fully unsupervised CLIR framework that leverages unsupervised cross-lingual word embeddings induced solely on the basis of monolingual corpora." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-131", "text": "We have shown the ability of our models to retrieve relevant content cross-lingually without any bilingual data at all, by reporting competitive performance on standard CLEF CLIR evaluation data for three test language pairs." }, { "sent_id": "320a5c79d9884e652c42f85847172b-C001-132", "text": "This unsupervised framework holds promise to support and guide the development of effective CLIR models for language pairs and domains where parallel data are scarce or unavailable." } ], "y": { "@SIM@": { "gold_contexts": [ [ "320a5c79d9884e652c42f85847172b-C001-26" ], [ "320a5c79d9884e652c42f85847172b-C001-41" ], [ "320a5c79d9884e652c42f85847172b-C001-55" ], [ "320a5c79d9884e652c42f85847172b-C001-59" ], [ "320a5c79d9884e652c42f85847172b-C001-62" ], [ "320a5c79d9884e652c42f85847172b-C001-124" ] ], "cite_sentences": [ "320a5c79d9884e652c42f85847172b-C001-26", "320a5c79d9884e652c42f85847172b-C001-41", "320a5c79d9884e652c42f85847172b-C001-55", "320a5c79d9884e652c42f85847172b-C001-59", "320a5c79d9884e652c42f85847172b-C001-62", "320a5c79d9884e652c42f85847172b-C001-124" ] }, "@USE@": { "gold_contexts": [ [ "320a5c79d9884e652c42f85847172b-C001-26" ], [ "320a5c79d9884e652c42f85847172b-C001-41" ], [ "320a5c79d9884e652c42f85847172b-C001-55" ], [ "320a5c79d9884e652c42f85847172b-C001-59" ], [ "320a5c79d9884e652c42f85847172b-C001-62" ] ], "cite_sentences": [ "320a5c79d9884e652c42f85847172b-C001-26", "320a5c79d9884e652c42f85847172b-C001-41", "320a5c79d9884e652c42f85847172b-C001-55", "320a5c79d9884e652c42f85847172b-C001-59", "320a5c79d9884e652c42f85847172b-C001-62" ] }, "@BACK@": { "gold_contexts": [ [ "320a5c79d9884e652c42f85847172b-C001-49" ], [ "320a5c79d9884e652c42f85847172b-C001-56" ] ], "cite_sentences": [ "320a5c79d9884e652c42f85847172b-C001-49", "320a5c79d9884e652c42f85847172b-C001-56" ] }, "@DIF@": { "gold_contexts": [ [ "320a5c79d9884e652c42f85847172b-C001-92" ] ], "cite_sentences": [ "320a5c79d9884e652c42f85847172b-C001-92" ] } } }, "ABC_0cc576e90c5ee2af043e09234792f5_20": { "x": [ { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-85", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-2", "text": "This study aims at determining whether collocational features automatically extracted from EFL (English as a foreign language) texts are useful for quality scoring, and allow the improvement of a competitive baseline based on, amongst other factors, bigram frequencies." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-3", "text": "The collocational features were gathered by assigning to each bigram in an EFL text eight association scores computed on the basis of a native reference corpus." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-4", "text": "The distribution of the association scores were then summarized by a few global statistical features and by a discretizing procedure." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-5", "text": "An experiment conducted on a publicly available dataset confirmed the effectiveness of these features and the benefit brought by using several discretized association scores." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-6", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-8", "text": "The importance of preformed units in language use is well established (Pawley and Syder, 1983; Schmitt, 2004; Sinclair, 1991) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-9", "text": "If some of these sequences belong to the traditional phraseological approach, signalled by their syntactic fixedness and semantic non-compositionality, the vast majority of them are conventional word combinations that display statistical idiomaticity (Baldwin and Kim, 2010; Smiskova et al., 2012) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-10", "text": "This phraseological dimension of language has important implications for learning a foreign language, as shown by many studies in applied linguistics." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-11", "text": "It not only distinguishes native speakers from nonnative ones, but the number of phraseological units in a learner text is related to the overall level of proficiency in the learned language (e.g., Forsberg, 2010; Levitzky-Aviad and Laufer, 2013; Santos et al., 2012; ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-12", "text": "In these studies, a limited number of expressions were analysed in a small number of texts, giving a very detailed, but also very punctual, view of the phenomenon." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-13", "text": "In addition, the phraseological nature of a lexical sequence was determined manually using dictionaries or by asking native speakers, making the analysis of numerous texts difficult." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-14", "text": "These limitations were overcome by Durrant and Schmitt (2009) , who proposed 1 assigning to the bigrams present in an EFL text two association scores (ASs), computed on the basis of a large native reference corpus: (pointwise) Mutual Information (MI), which favours bigrams made up of low-frequency words, and the t-score, which highlights those composed of high-frequency words." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-15", "text": "They observed that, compared to native speakers, EFL learners tend to underuse collocations with high MI scores while overusing those with high t-scores." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-16", "text": "More recently, Bestgen and Granger (2014, 2015) and showed that these ASs distinguish advanced learners from intermediate learners, and that the average MI score and the proportion of bigrams in the text that are absent from the reference corpus were good predictors of text quality, but that the average t-score was much less successful." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-17", "text": "These studies have a major drawback: the effectiveness of phraseological indices was not compared to that of other features known to be effective predictors." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-18", "text": "It is therefore impossible to determine whether the phraseological indices are really effective and if they can improve the prediction when combined with other indices." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-19", "text": "This limitation is probably partly due to the fact that these analyses were not conducted in the field of automatic scoring, but in applied linguistics." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-20", "text": "In automatic scoring, phraseological expres-sions have long been used almost exclusively for detecting errors, a task for which they have been very useful (e.g., Chodorow and Leacock, 2000; Futagi et al., 2008; Wu et al., 2010) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-21", "text": "It is noteworthy that a feature tracking the correct use of collocations was considered for inclusion in eRater, but its usefulness for predicting text quality seems rather limited (Higgins et al., 2015) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-86", "text": "**COLLOCATIONAL AND BASELINE FEATURES**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-22", "text": "Very recently, however, Somasundaran and Chodorow (2014) and Somasundaran et al. (2015) Even if these results were extremely promising, they leave a number of questions unanswered." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-23", "text": "First, they were obtained by studying short oral responses." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-24", "text": "Can they be generalized to longer written texts, a situation that allows the learner to spend much more time on its production? Then one can wonder whether the use of MI is sufficient, or if additional benefits can be obtained by taking into account other associational measures for collocations." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-25", "text": "In this context, extracting richer features than the mean scores, as done by Somasundaran and Chodorow (2014) , seems particularly promising, because found that the best learner texts contain more middlelevel t-score bigrams and fewer low and high-level t-score bigrams." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-26", "text": "This observation may be related to the fact that the low t-score bigrams are often erroneous combinations of words, while high scores indicate extremely common bigrams in the language, which are easy to learn." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-27", "text": "It is therefore far from obvious that there is a simple linear or monotonic relationship between the distribution of the association scores (ASs) in a text and its quality." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-28", "text": "Finally, it would be interesting to determine whether using ASs extracted from a corpus of native texts enables a better prediction than that obtained by using the simple frequency of the unigrams and bigrams (Yannakoudakis et al., 2011) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-29", "text": "This study attempts to answer these questions by extracting from the bigrams in EFL texts richer features from several association measures as described in Section 2, and by comparing the effectiveness of these collocational features to that of lexical features (Section 3)." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-30", "text": "The conclusion proposes several paths for further research." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-31", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-32", "text": "**EXTRACTING COLLOCATION FEATURES**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-33", "text": "Somasundaran and Chodorow (2014) used only one AS, while Durrant and Schmitt (2009) used two, but there are many other ASs (Pecina, 2010) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-34", "text": "Evert (2009) recommends a heuristic approach by testing a series of ASs to keep the one that is most appropriate for the task at hand, while Pecina recommends using several ASs simultaneously." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-35", "text": "These recommendations were followed here by comparing the performance of eight ASs and by combining them (i.e., using simultaneously all of them in the feature set)." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-36", "text": "In addition to MI and tscore (Church et al., 1991) , the six following ASs were evaluated:" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-37", "text": "1. MI3 (Daille, 1994), a heuristic modification of MI, proposed to reduce its tendency to assign inflated scores to rare words that occur together, 2." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-38", "text": "z (Berry-Rogghe, 1973), the signed squareroot of the cell contribution to the Pearson Chi-square for a 2x2 contingency table, 3. simple-ll (Evert, 2009), the signed cell contribution to the log-likelihood Chi-square test recommended by Dunning (1993) , 4." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-39", "text": "Fisher's exact test (Pedersen et al., 1996) , which corresponds to the probability of observing, under the null hypothesis of independence, at least as many collocations as the number actually observed, 5." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-40", "text": "Mutual rank ratio (mrr, Dean, 2005), a nonparametric measure that has been successful in detecting collocation errors in EFL texts (Futagi et al., 2008), 6 . logDice (Rychly, 2008), a logarithmic transformation of the Dice coefficient used in the Sketch Engine (Kilgarriff et al., 2014) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-62", "text": "The collocational features were then computed on the basis of all the different bigrams present in each text (types) to give more weight to their diversity (Durrant and Schmitt, 2009 )." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-41", "text": "In order to extract more information from the distribution of the ASs in each text than the mean or the median, Durrant and Schmitt (2009) and Somasundaran et al. (2015) used a standard procedure in descriptive statistics and automatic information processing known as discretization, binning or quantization (Garcia et al., 2013) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-42", "text": "It divides a continuous variable into bins and counts the proportion of scores that fall into each bin." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-43", "text": "In their analyses, the boundaries of the bins were manually and arbitrarily defined." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-44", "text": "This approach can be used for any AS, but it makes the comparison of the effectiveness of them difficult because a weaker performance may come from a less effective AS or from poorly chosen bin boundaries." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-45", "text": "To reduce the potential impact of the choice of boundaries, a very simple and completely automatic discretization procedure was used: the Equal Frequency Discretizer, which divides the sorted values into k intervals so that each interval contains approximately the same number of values (Dougherty et al., 1995) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-46", "text": "It is unsupervised and depends on only one parameter (i.e., the number of bins)." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-47", "text": "In the present study, it was applied separately for each AS, to every bigram present in the learners' texts and consists of two steps:" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-48", "text": "1. Partitioning the distribution of scores in bins containing the same number of bigrams, 2. Computing for each text the proportion of bigrams whose AS falls into each bin, using as a denominator the total number of bigrams in the text." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-49", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-50", "text": "**EXPERIMENT**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-51", "text": "To assess the benefits of relying on collocational features to predict an EFL text's quality, an experiment was conducted." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-52", "text": "This section describes the corpus used, as well as the procedures for extracting the collocational and baseline features and for scoring the texts." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-53", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-54", "text": "**EXPERIMENT SETUP**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-55", "text": "Dataset: The analyses were conducted on the First Certificate in English (FCE) ESOL examination scripts described in Yannakoudakis et al. (2011 Yannakoudakis et al. ( , 2012 ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-56", "text": "Extracted from the Cambridge Learner Corpus, this dataset consists of 1238 texts of between 200 and 400 words, to which an overall mark has been assigned." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-57", "text": "As in Yannakoudakis et al. (2011) , the 1141 texts from the year 2000 were used for training, while the 97 texts from the year 2001 were used for testing." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-58", "text": "Collocational Features: The global statistical features in Somasundaran et al. (2015) and were used: the mean, the median, the maximum and the minimum of the ASs, and the proportion of bigrams that are present in the learner text but absent from the reference corpus." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-59", "text": "Because the best number of bins for discretizing the distributions was not known, the following ones were compared: 3, 5, 8, 10, 15, 20, 25, 33, 50, 75 and 100. To get all these features, each learner text was tokenized and POS-tagged by means of CLAWS7 2 and all bigrams were extracted." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-60", "text": "Punctuation marks and any sequence of characters that did not correspond to a word interrupt the bigram extraction." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-61", "text": "Each bigram was then looked up in the 100 million word British National Corpus (BNC 3 ) and, if found, assigned its ASs." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-63", "text": "Lexical Features: As a benchmark for comparison, the lexical features that were showed to be good predictors of the quality of the texts in this dataset (Yannakoudakis et al., 2011) were chosen." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-64", "text": "They consist of the frequency of the word unigrams and bigrams." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-65", "text": "This baseline is particularly relevant because it includes the lexical bigrams that are the basis of the collocational features." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-66", "text": "These features were extracted as described in Yannakoudakis et al. (2011) ; the only difference is that they used the RASP tagger and not the CLAWS tagger." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-67", "text": "Supervised Learning Approach and Evaluation: As in Yannakoudakis et al. (2011) , the automated scoring task was treated as a rankpreference learning problem by means of the SVM-Rank package (Joachims, 2006) , which is a much faster version of the SVM-Light package used by Yannakoudakis et al. (2011) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-68", "text": "The procedure was identical to that described in their study." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-69", "text": "Since the quality ratings are distributed on a zero to 40 scale, I chose Pearson's correlation coefficient, also used by Yannakoudakis et al. (2011) , as the measure of performance." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-70", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-71", "text": "**RESULTS**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-72", "text": "Initial analyses focused on the interest of discretizing the ASs by assessing the benefits obtained when these features were added to the global statistical features." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-73", "text": "Collocational features were then compared to the lexical features and added to them to determine the maximum level of performance that could be achieved." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-74", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-75", "text": "**COLLOCATIONAL FEATURES**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-76", "text": "When no discretization procedure was used (the 0 row), Fisher was far more effective than the other ASs, followed by MI." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-77", "text": "Adding the discretized features led to far better performances (except for logDice), as shown by the Mean row." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-78", "text": "For a small number of bins, Fisher remained the best, but for an intermediate number, the best were t and simple-ll, and for a large number, z became competitive." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-79", "text": "Still, the differences between the best ASs were quite small." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-80", "text": "From eight bins and beyond, using all the ASs gave the best result, but the gain was relatively small." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-81", "text": "Regarding the number of bins, at least five seems necessary, but using many more did not harm performance." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-82", "text": "It is noteworthy that all the correlations reported in table 1 are much larger that the correlation of a baseline system based purely on length (r = 0.27)." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-83", "text": "To determine if the automatic procedure for discretizing the ASs is at least as effective as the bin boundaries manually set by Somasundaran et al. (2015) , I used them instead of the automatic bins for the model with eight bins based on MI." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-84", "text": "The correlation obtained was 0.60, a value slightly lower than that reported in Table 1 (0.61)." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-87", "text": "The lexical features used alone allowed a 0.68 correlation 4 ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-88", "text": "These features are thus more effective than the best combinations of collocational features reported in Table 1 , but, as shown in Table 2 , adding the collocational features to the lexical ones produces far better performances." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-89", "text": "Steiner's ttest (Howell, 2008, p. 269-271) for comparing two non-independent correlations showed that collocational features significantly improve the prediction when compared to the baseline (all ps <0.005)." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-90", "text": "If MI is always one of the best performing ASs, the differences between the ASs are quite low." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-91", "text": "For all numbers of bins, using all the ASs allows the best performance." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-92", "text": "To get an idea of how well the collocational and lexical features perform, the correlations in Table 2 can be compared to the average correlation between the Examiners' scores reported by Yannakoudakis et al. (2011) , which give an upper bound of 0.80 while the All models with more than three bins obtain a correlation of at least 0.75." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-93", "text": "Adding collocational features to lexical ones thus reduces by 58% the difference between the lexical features alone and the upper bound." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-94", "text": "However, the most difficult part of the work is still to be done." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-95", "text": "----------------------------------" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-96", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-97", "text": "Following on from Durrant and Schmitt (2009), Somasundaran and Chodorow (2014) and , this study confirms the benefits conferred by collocational features for the automated scoring of EFL texts." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-98", "text": "It also shows that these features improve a competitive baseline, based among other factors on the bigram frequenprocedure, the difference probably comes from the SVMRank/SVM-Light parameters." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-99", "text": "The SVM-Rank default settings were used except for the squared slacks for the L-norm (i.e., -p 2) because it provided a high performance without having to optimize other parameters such as C. cies in the texts." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-100", "text": "As proposed by Somasundaran and Chodorow (2014) , binning the AS distributions improves the efficiency and, as proposed by Durrant and Schmitt (2009) , considering several ASs also gives extra efficiency." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-101", "text": "Compared to , the binning allows t to be as effective as the MI." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-102", "text": "This result suggests that it might be interesting to analyse more thoroughly the complex relationship between the AS distributions in a text and its quality." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-103", "text": "It must be kept in mind that these observations result from the analysis of a single dataset and replications are more than desirable." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-104", "text": "It is also necessary to determine whether the collocational features can improve not only the baseline used here, but also a predictive model that includes many other features known for their effectiveness." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-105", "text": "Further developments are worth mentioning." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-106", "text": "Unlike Somasundaran et al. (2015) , I only used bigrams' collocational features." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-107", "text": "Whether adding trigrams would further improve the performance is an open question." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-108", "text": "Trying to answer it requires a thorough study of the association measures for ngrams longer than two words since they have received much less attention (Bestgen, 2014; Gries, 2010) ." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-109", "text": "It might also be interesting to evaluate other techniques to discretize the AS distributions, since this study rests on one of the simplest techniques." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-110", "text": "Further studies are also needed to better understand the impact of the combination of ASs." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-111", "text": "On the one hand, it is likely that some ASs are partially redundant and that keeping only one might be enough." }, { "sent_id": "0cc576e90c5ee2af043e09234792f5-C001-112", "text": "On the other hand, it would be interesting to determine whether, rather than combining the AS bin proportions independently, it would be better to create the bins on the simultaneous basis of two or more ASs, such as one bin for the bigrams with high MI scores and medium t-scores." } ], "y": { "@FUT@": { "gold_contexts": [ [ "0cc576e90c5ee2af043e09234792f5-C001-28" ] ], "cite_sentences": [ "0cc576e90c5ee2af043e09234792f5-C001-28" ] }, "@SIM@": { "gold_contexts": [ [ "0cc576e90c5ee2af043e09234792f5-C001-55" ], [ "0cc576e90c5ee2af043e09234792f5-C001-57" ], [ "0cc576e90c5ee2af043e09234792f5-C001-63" ], [ "0cc576e90c5ee2af043e09234792f5-C001-69" ], [ "0cc576e90c5ee2af043e09234792f5-C001-92" ] ], "cite_sentences": [ "0cc576e90c5ee2af043e09234792f5-C001-55", "0cc576e90c5ee2af043e09234792f5-C001-57", "0cc576e90c5ee2af043e09234792f5-C001-63", "0cc576e90c5ee2af043e09234792f5-C001-69", "0cc576e90c5ee2af043e09234792f5-C001-92" ] }, "@USE@": { "gold_contexts": [ [ "0cc576e90c5ee2af043e09234792f5-C001-55" ], [ "0cc576e90c5ee2af043e09234792f5-C001-63" ], [ "0cc576e90c5ee2af043e09234792f5-C001-69" ] ], "cite_sentences": [ "0cc576e90c5ee2af043e09234792f5-C001-55", "0cc576e90c5ee2af043e09234792f5-C001-63", "0cc576e90c5ee2af043e09234792f5-C001-69" ] }, "@DIF@": { "gold_contexts": [ [ "0cc576e90c5ee2af043e09234792f5-C001-66" ], [ "0cc576e90c5ee2af043e09234792f5-C001-67" ] ], "cite_sentences": [ "0cc576e90c5ee2af043e09234792f5-C001-66", "0cc576e90c5ee2af043e09234792f5-C001-67" ] }, "@EXT@": { "gold_contexts": [ [ "0cc576e90c5ee2af043e09234792f5-C001-66" ], [ "0cc576e90c5ee2af043e09234792f5-C001-67" ] ], "cite_sentences": [ "0cc576e90c5ee2af043e09234792f5-C001-66", "0cc576e90c5ee2af043e09234792f5-C001-67" ] } } }, "ABC_d66ca5ff22e508da239fc7fdf5ac29_20": { "x": [ { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-2", "text": "We present approaches for the identification of sentences understandable by second language learners of Swedish, which can be used in automatically generated exercises based on corpora." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-3", "text": "In this work we merged methods and knowledge from machine learning-based readability research, from rule-based studies of Good Dictionary Examples and from second language learning syllabuses." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-4", "text": "The proposed selection methods have also been implemented as a module in a free web-based language learning platform." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-5", "text": "Users can use different parameters and linguistic filters to personalize their sentence search with or without a machine learning component assessing readability." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-6", "text": "The sentences selected have already found practical use as multiple-choice exercise items within the same platform." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-7", "text": "Out of a number of deep linguistic indicators explored, we found mainly lexical-morphological and semantic features informative for second language sentence-level readability." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-8", "text": "We obtained a readability classification accuracy result of 71%, which approaches the performance of other models used in similar tasks." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-9", "text": "Furthermore, during an empirical evaluation with teachers and students, about seven out of ten sentences selected were considered understandable, the rulebased approach slightly outperforming the method incorporating the machine learning model." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-10", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-11", "text": "****" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-12", "text": "We present approaches for the identification of sentences understandable by second language learners of Swedish, which can be used in automatically generated exercises based on corpora." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-13", "text": "In this work we merged methods and knowledge from machine learning-based readability research, from rule-based studies of Good Dictionary Examples and from second language learning syllabuses." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-14", "text": "The proposed selection methods have also been implemented as a module in a free web-based language learning platform." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-15", "text": "Users can use different parameters and linguistic filters to personalize their sentence search with or without a machine learning component assessing readability." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-16", "text": "The sentences selected have already found practical use as multiple-choice exercise items within the same platform." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-17", "text": "Out of a number of deep linguistic indicators explored, we found mainly lexical-morphological and semantic features informative for second language sentence-level readability." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-18", "text": "We obtained a readability classification accuracy result of 71%, which approaches the performance of other models used in similar tasks." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-19", "text": "Furthermore, during an empirical evaluation with teachers and students, about seven out of ten sentences selected were considered understandable, the rulebased approach slightly outperforming the method incorporating the machine learning model." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-20", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-21", "text": "**INTRODUCTION AND MOTIVATION**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-22", "text": "Despite the fact that there is a vast selection of existing materials, many language teachers opt for completing course syllabuses with either invented examples or authentic resources, customized to the need of specific learners (Howard and Major, 2004) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-23", "text": "Collections with millions of tokens of digital text are available for several languages today, part of which would offer adequate practice material for learners of a second or foreign language (L2) to develop their skills further." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-24", "text": "However, a necessary first step representing a major challenge when reusing copora for automatic exercise generation is how to assess the suitability of the available material." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-25", "text": "In this study, we explored how we could exploit existing Natural Language Processing (NLP) tools and resources for this purpose." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-26", "text": "To overcome copyright issues often limiting full-text access to certain corpora, we decided to work with sentences as linguistic unit when assessing the characteristics of suitability and when generating exercise items." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-27", "text": "Although a large number of studies exist investigating readability, i.e. understandability, at the text level, the sentence level remains little explored." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-28", "text": "Similarly, the focus of previous investigations has mainly been readability from native language (L1) readers' perspective, but aspects of L2 readability have been less widely studied." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-29", "text": "To our knowledge no previous research have explored this latter dimension for Swedish before, hence we aim at filling this gap, which can be useful, besides the purposes mentioned above, also in future sentence and text simplification and adaptation tasks." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-30", "text": "We propose a rule-based as well as a combination of rule-based and machine learning methods for the identification of sentences understandable by L2 learners and suitable as exercise items." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-51", "text": "Studies focusing on L2 readability are considerably fewer in the literature." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-31", "text": "During the selection of linguistic indicators, we have taken into consideration previously studied features of readability (Fran\u00e7ois and Fairon, 2012; Heimann M\u00fchlenbock, 2013; Vajjala and Meurers, 2012) , L2 Swedish curricula (Levy Scherrer and Lindemalm, 2009; Folkuniversitet, 2013) and aspects of Good Dictionary Examples (GDEX) (Hus\u00e1k, 2010; Kilgarriff et al., 2008) , being that we believe they have some properties in common with exercise items." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-32", "text": "The current version of the machine learning model distinguishes sentences readable by students at an intermediate level of proficiency from sentences of a higher readability level." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-33", "text": "The approaches have been implemented and integrated into an online Intelligent ComputerAssisted Language Learning (ICALL) platform, L\u00e4rka ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-34", "text": "Besides a module where users can experiment with the filtering of corpus hits, a module with inflectional and vocabulary exercises (making use of the selected sentences with our method) is also available." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-35", "text": "An initial evaluation with students, teachers and linguists indicated that more than 70% of the sentences selected were understandable, and about 60% of them would be suitable as exercise items according to the two latter respondent groups." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-36", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-37", "text": "**BACKGROUND**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-38", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-39", "text": "**TEXT-LEVEL READABILITY**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-40", "text": "Readability of texts in different languages has been the subject of several studies and they range from simpler formulas, taking into account superficial text properties, to more sophisticated NLP methods." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-41", "text": "Traditional readability measures for L1 Swedish at the text level include LIX (L\u00e4sbarthetsindex, \"Readability index\") (Bj\u00f6rnsson, 1968) and the Nominal Ratio (Hultman and Westman, 1977) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-42", "text": "In recent years a number of studies, mostly focusing on the L1 context, appeared which take into consideration linguistic features based on a deeper text processing." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-43", "text": "Morphosyntactic aspects informative for L1 readability include, among others, parse tree depth, subordination features and dependency link depth (length) (Dell'Orletta et al., 2011) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-44", "text": "Language models have also been commonly used for readability predictions (Collins-Thompson and Callan, 2004; Schwarm and Ostendorf, 2005) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-45", "text": "A recently proposed measure, the Coh-Metrix (Graesser et al., 2011) , aims at a multilevel analysis of texts, inspired by psycholinguistic principles." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-46", "text": "It measures not only linguistic difficulty, but also cohesion in texts." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-47", "text": "Research on L1 readability for Swedish, using machine learning, is described in Heimann M\u00fchlenbock (2013) and Falkenjack et al. (2013) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-48", "text": "Heimann M\u00fchlenbock (2013) examined readability along five dimensions: surface features, word usage, sentence structure, idea density and human interest." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-49", "text": "Mean dependency distance, subordinate clauses and modifiers proved good predictors for L1 Swedish." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-50", "text": "Although a number of readability formulas exist for native language users, these might not be suitable predictors of L2 difficulty being that the acquisition processes of L1 and L2 present a number of differences (Beinborn et al., 2012) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-52", "text": "The linguistic features in this context include, among others, relative clauses, passive voice (Heilman et al., 2007) and the number of coordinate phrases per clause (Vajjala and Meurers, 2012) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-53", "text": "Crossley et al. (2008) applied some Coh-Metrix indicators to English L2 readability." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-54", "text": "The authors found that lexical coreferentiality, syntactic similarity and word frequency measures outperformed traditional L1 readability formulas." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-55", "text": "A language-independent approach to L2 readability assessment, using an online machine learning algorithm, is presented by Shen et al. (2013) which, however, employed only the surface features of average sentence and word length, and word frequencies as lexical feature." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-56", "text": "The authors found that none of the features in isolation was able to clearly distinguish between the levels." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-57", "text": "In the second language teaching scenario, a widely used scale is the Common European Framework of Reference for Languages (CEFR) (Council of Europe, 2001), which, however, has been less frequently adopted so far in readability studies." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-58", "text": "The CEFR guidelines for L2 teaching and assessment define six different proficiency levels: A1 (beginner), A2 (elementary), B1 (intermediate), B2 (upper intermediate), C1 (advanced) and C2 (proficiency)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-59", "text": "Fran\u00e7ois and Fairon (2012) proposed a CEFR-based readability formula for L2 French." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-60", "text": "Some of the predictive features proved to be structural properties, including shallow length features as well as different morpho-syntactic categories (e.g. present participles) and the presence of words in a list of easy words." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-61", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-62", "text": "**SENTENCE-LEVEL READABILITY**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-63", "text": "Many of the text readability measures mentioned above have shortcomings when used on very short passages containing 100 words or less (Kilgarriff et al., 2008) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-64", "text": "The concept of readability at the sentence level can be related to the selection of appropriate vocabulary example sentences." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-65", "text": "GDEX (Hus\u00e1k, 2010; Kilgarriff et al., 2008 ) is a sentence evaluation algorithm, which, on the basis of lexical and syntactical criteria, automatically ranks example candidates from corpora." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-66", "text": "Some of the influential linguistic aspects of appropriate example sentences are: their length and structure, the presence of short and common vocabulary items which do not need disambiguation and the absence of anaphoric pronouns." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-67", "text": "Segler (2007) focuses on the L2 rather than on the lexicographic context." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-68", "text": "He explores the characteristics of helpful vocabulary examples to be used via an ICALL system for L2 German and underlines the importance of syntactic complexity." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-69", "text": "Research about ranking Swedish corpus examples is presented in Volodina et al. (2012b) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-70", "text": "Their first algorithm includes four heuristic rules concerning sentence length, infrequent lexical items, keyword position and the presence of finite verbs, complemented by a sentence similarity measure in the second algorithm." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-71", "text": "Readability experiments focusing at the sentence level have started to appear recently both for language learning purposes (Pil\u00e1n et al., 2013) and for detecting differences between simplified and unsimplified sentence pairs (Vajjala and Meurers, 2014) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-72", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-73", "text": "**RESOURCES**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-74", "text": "Our sentence selection module utilizes a number of tools, resources and web services available for Swedish." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-75", "text": "Korp 1 , an infrastructure for accessing and maintaining corpora (Borin et al., 2012) , contains a large number of Swedish texts which are equipped with automatic annotations (with some exceptions) for part-of-speech (POS), syntactic (dependency) relations, lemma forms and sense ids." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-76", "text": "Korp offers, among others, a web service for concordances, which makes a search in corpora based on a query (e.g. a keyword and its POS) and returns hits with a sentence-long context." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-77", "text": "Moreover, with the corpus pipeline of Korp, tools for automatically annotating corpora are also available." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-78", "text": "A variety of different modern Swedish corpora from Korp have been used throughout this study including novel, newspaper and blog texts." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-79", "text": "Another source for sentences was the CEFR corpus , a collection of CEFR-related L2 Swedish course book texts." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-80", "text": "The corpus contains: (a) manual annotations indicating the structure of each lesson in the book (exercises, instructions, texts etc.); 1 http://spraakbanken.gu.se/korp/ (b) automatic linguistic annotations obtained with the annotation tools available through Korp." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-81", "text": "The CEFR corpus at the time of writing included B1 texts from three course books and B2 texts from one course book." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-82", "text": "The annotation of additional material covering other CEFR levels was ongoing." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-83", "text": "Not only corpora, but also information from frequency word lists has been used for determining the appropriateness of a sentence." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-84", "text": "The Kelly list (Volodina and Kokkinakis, 2012 ) is a frequencybased vocabulary list mostly built on a corpus of web texts from 2010." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-85", "text": "Besides frequency information, an associated CEFR level is available for each item." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-86", "text": "Another frequency-based word list employed for the machine learning experiments is the Wikipedia list (Volodina et al., 2012b) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-87", "text": "It contains the POS and the number of occurrences for each word form in a corpus of Swedish Wikipedia texts." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-88", "text": "A central resource of the present study is L\u00e4rka 2 , a freely available online ICALL platform." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-89", "text": "Currently its exercise generator module offers tasks both for students of linguistics and learners of L2 Swedish (Figure 1) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-90", "text": "Additional parts include a corpus editor used for the annotation of the CEFR corpus and the sentence selection module presented in this paper, Hit-Ex 3 (Hitta Exempel, \"Find Examples\" or Hit Examples)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-91", "text": "The version under development contains also dictation and spelling exercises ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-92", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-93", "text": "**MACHINE LEARNING EXPERIMENTS FOR READABILITY 4.1 DATASET**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-94", "text": "We distinguished two different classes in the dataset for the machine learning experiments: (a) sentences understandable at (within) B1 level and (b) sentences above B1 level." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-95", "text": "For the former group, sentences were collected from B1-level texts from the CEFR corpus." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-96", "text": "Sentences above B1 level consisted partly of B2-level sentences from the CEFR corpus, and partly of native language sentences from Korp retrieved on the basis of keywords between B2 and C2 levels according to the Kelly list." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-97", "text": "Only sentences between the length of 5 and 30 tokens were collected from all resources to decrease the influence of sentence length on the decisions made by the classifiers and to increase the importance of other linguistic features." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-98", "text": "The" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-99", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-100", "text": "**METHOD**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-101", "text": "We performed supervised classification using as training and test data the set of sentences described in section 4.1." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-102", "text": "Thus, we aimed at a two-way classification distinguishing sentences within B1 level from those above." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-103", "text": "This level, besides being approximately a middle point of the CEFR scale, is typically divided into sub-levels in language courses (Folkuniversitet, 2013) which indicates a more substantial linguistic content." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-104", "text": "Consequently, additional practice for learners can be beneficial at this stage." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-105", "text": "Self-study activities may also be more common in this phase since students have sufficient L2 autonomy." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-106", "text": "We experimented with different classification algorithms 4 available through the Scikit-learn Python package (Pedregosa et al., 2011) , out of which we present the results only of the best performing one here, a linear Support Vector Machine (SVM) classifier." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-107", "text": "The SVM classifier aims at separating instances into classes with a hyperplane (Tanwani et al., 2009 ), equivalent to a line in a two-dimensional space." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-108", "text": "This hyperplane is defined based on the feature values of instances and weights associated with them." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-109", "text": "Once extracted, the values for each feature were scaled and centered." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-110", "text": "Evaluation was carried out with stratified 10-fold cross-validation, i.e. the proportion of labels in each fold was kept the same as that in the whole training set during the ten iterations of training and testing." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-111", "text": "The evaluation measures taken into consideration were accuracy, precision, recall and the F1 score, a combination of precision and recall, the two of them being equally important (Pedregosa et al., 2011)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-112", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-113", "text": "**FEATURES**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-114", "text": "After a thorough overview of the machine learning approaches for readability in the literature, a number of features were chosen to be tested in our experiments." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-115", "text": "The features selected aimed at a deep analysis of the sentences at different linguistic levels." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-116", "text": "Besides traditional readability indicators, a number of syntactic, morphological, lexical and semantic aspects have been taken into consideration." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-117", "text": "Our initial set contained altogether 28 features, as presented in Table 2 on the next page." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-118", "text": "A number of popular traditional (shallow) features were included in the feature set (features 1-4)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-119", "text": "These required less sophisticated text processing and had previously been used in several studies with success (Beinborn et al., 2012; Dell'Orletta et al., 2011; Fran\u00e7ois and Fairon, 2012; Heimann M\u00fchlenbock, 2013; Vajjala and Meurers, 2012) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-120", "text": "We computed sentence length as the number of tokens including punctuation, and token length as the number of characters per token." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-121", "text": "Part of the syntactic features was based on the depth (length) and direction of dependency arcs (features 5-8)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-122", "text": "Another group of these features relied on the type of dependency relations." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-123", "text": "In feature 9 (Mod) nominal pre-modifiers (e.g. adjectives) and post-modifiers (e.g. relative clauses, prepositional phrases) were counted, similarly to Heimann M\u00fchlenbock (2013)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-124", "text": "Variation features (ModVar, AdvVar) measured the ratio of a morphosyntactic category to the number of lexical (content) words in the sentence, as in Vajjala and Meurers (2012) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-125", "text": "These lexical categories comprised nouns, verbs, adverbs and adjectives." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-126", "text": "Subordinates (11) were detected on the basis of the \"UA\" (subordinate clause minus subordinating conjunction) dependency relation tag (Heimann M\u00fchlenbock, 2013) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-127", "text": "Features DepDepth, Mod, Sub and RightDep, PrepComp have previously been empoyed for Swedish L1 readability at the text level in Heimann M\u00fchlenbock (2013) and Falkenjack et al. (2013) respectively." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-128", "text": "The lexical-morphological features (features 13-25) constituted the largest group." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-129", "text": "Difficulty at the lexical level was determined based on both the TTR feature mentioned above, expressing vocabulary diversity, and on the basis of the rarity of words (features 13-17) according to the Kelly list and the Wikipedia word list." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-130", "text": "An analogous approach was adopted also by Fran\u00e7ois and Fairon (2012) , Vajjala and Meurers (2012) and Heimann M\u00fchlenbock (2013) with positive results." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-131", "text": "The LexD feature considers the ratio of lexical words (nouns, verbs, adjectives and adverbs) to the sum of tokens in the sentence (Vajjala and Meurers, 2012) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-132", "text": "The NN/VB ratio feature, which has a higher value in written text, can also indicate a more complex sentence (Biber et al., 2004; Heimann M\u00fchlenbock, 2013) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-133", "text": "Features 21-25 are based on evidence from the content of L2 Swedish course syllabuses (Folkuniversitet, 2013) and course books (Levy Scherrer and Lindemalm, 2009), part of them being language-dependent, namely S-VB/VB and S-VB%." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-134", "text": "These two features cover different types of Swedish verbs ending in -s which can indicate either a reciprocal verb, a passive construction or a deponent verb, active in meaning but passive in form (Fasth and Kannermark, 1989) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-135", "text": "Our feature set included three semantic features (26-28)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-136", "text": "The intuition behind 28 is that words with multiple senses (polysemous words), increase reading complexity as, in order to understand the sentence, word senses need to be disambiguated (Graesser et al., 2011) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-137", "text": "This feature was computed by counting the number of sense IDs per token according to a lexical-semantic resource for Swedish, SALDO (Borin et al., 2013) , and dividing this value by the number of tokens in the sentence." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-138", "text": "As pronouns indicate a potentially more difficult text (Graesser et al., 2011), we included PN/NN in our set." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-139", "text": "Both NomR and PN/NN capture idea density, i.e. how complex the relation between the ideas expressed are (Heimann M\u00fchlenbock, 2013)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-140", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-141", "text": "**CLASSIFICATION RESULTS**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-142", "text": "The results obtained using the complete set of 28 features is shown in Table 3 ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-143", "text": "The results of the SVM are presented in comparison to a baseline classifier assigning the most frequent output label in the dataset to each instance." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-144", "text": "classes was about 50-50%." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-145", "text": "The SVM classified 7 out of 10 sentences accurately." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-146", "text": "The precision and recall values for the identification of B1 sentences was 73% and 68%." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-147", "text": "Previous classification results for a similar task obtained an average of 77.25% of precision for the classification of easy-to-read texts within an L1 Swedish text-level readability study (Heimann M\u00fchlenbock, 2013) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-148", "text": "Another classification at the sentence level, but for Italian and from an L1 perspective achieved an accuracy of 78.2%, thus 7% higher compared to our results (Dell'Orletta et al., 2011) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-149", "text": "The 73% precision of our SVM model for classifying B1 sentences was close to the precision of 75.1% obtained for the easy-to-read sentences from Dell'Orletta et al. (2011) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-150", "text": "Fran\u00e7ois and Fairon (2012) in a classification study from the L2 perspective, aiming at distinguishing all 6 CEFR levels for French at the text level, concluded that intermediate levels are harder to distinguish than the levels at the edges of the CEFR scale." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-151", "text": "The authors reported an adjacent accuracy of 67% for B1 level, i.e. the level of almost 7 out of 10 texts was predicted either correctly or with only one level of difference compared to the original level." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-152", "text": "Precise comparison with previous results is, however, difficult since, to our knowledge, there are no results reported for L2 readability at the sentence level." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-153", "text": "Thus, the values mentioned above serve more as a side-by-side illustration." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-154", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-155", "text": "**CLASSIFIER**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-156", "text": "Besides experimenting with the complete feature set, groups of features were also separately tested." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-157", "text": "The results are presented in Table 4 ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-158", "text": "terestingly, although semantic features represented the smallest group, they performed 2% better than traditional or syntactic features." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-159", "text": "The largest group of features including lexical-morphological indicators performed around 10% more accurately than other feature groups." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-160", "text": "Among the 10 features that influenced most the decisions of our SVM classifier, we can find attributes from different feature groups." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-161", "text": "The ID of these features together with the SVM weights are reported in Table 5 ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-162", "text": "An informative traditional measure was sentence length, similarly to the results of previous studies (Beinborn et al., 2012; Dell'Orletta et al., 2011; Fran\u00e7ois and Fairon, 2012; Heimann M\u00fchlenbock, 2013; Vajjala and Meurers, 2012) ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-163", "text": "Lexical-morphological features based on information about the frequency and the CEFR level of items in the Kelly list (DiffW%, DiffWs and KellyFr) also proved to be influential for the classification, as well as AdvVar." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-164", "text": "Two out of our three semantic features, namely NomR and, in particular, Sense/W, were also highly predictive." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-165", "text": "Syntactic features Ddep/SentLen and DeepDep, based on information about dependency arcs, were also among the ten features with highest weights, but they were somewhat less useful, as the weights in Table 5 show." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-166", "text": "Contrary to our results, Fran\u00e7ois and Fairon (2012) found syntactic features more informative than semantic ones for L2 French." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-167", "text": "This may depend either on the difference between the features used or the target languages." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-168", "text": "Moreover, in the case of Swedish L1 text readability the noun/pronoun ratio and modifiers proved to be indicative of textlevel difficulty (Heimann M\u00fchlenbock, 2013 ), but at the sentence level from the L2 perspective only the latter seemed influential in our experiments." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-169", "text": "The data used for the experiments was labeled for CEFR levels at the text level, not at the sentence level." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-170", "text": "This introduced some noise in the data and made the classification task somewhat harder." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-171", "text": "In the future, the availability of data labeled at the sentence level could contribute to more accurate results." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-172", "text": "Excluding potentially lower level sentences from those appearing in higher level texts based on the distance between feature vectors could also be explored, in a similar fashion to Dell'Orletta et al. (2011)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-173", "text": "5 Heuristics: GDEX parameters for sentence filtering and ranking" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-174", "text": "Besides SVM classification, our sentence selection module, Hit-Ex, offers also a number of heuristic parameter options 5 , usable either in combination or as an alternative to the machine learning model (for further details see section 6)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-175", "text": "Part of these search parameters are generic preferences including the keyword to search for, its POS, the corpora from Korp to be used during selection and the desired CEFR level of the sentences." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-176", "text": "Furthermore, it is possible to avoid sentences containing: abbreviations, proper names, keyword repetition, negative formulations (inte \"not\" or utom \"except\" in the sentence), modal verbs, participles, s-verbs and sentences lacking finite verbs." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-177", "text": "Users can also allow these categories and choose a penalty point between 0 and -50 for them." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-178", "text": "The penalty score for each filtering criteria is summed for obtaining a final score per sentence, based on which a final ranking is produced for all sentences retrieved from Korp, the ranking reflecting the extent to which they satisfy the search criteria." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-179", "text": "Some additional parameters, partly overlapping with the machine learning model's features, are also available for users to experiment with, being that the machine learning model does not cover all CEFR levels." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-180", "text": "Based on statistical evidence from corpora, we suggested default values for all parameters for retrieving sentences of B1, B2, C1 level with rulebased parameters only." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-181", "text": "However, additional data and further testing is required to verify the appropriateness of the proposed values." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-182", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-183", "text": "**COMBINED APPROACH**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-184", "text": "As mentioned in the previous subsection, the heuristic parameters and the machine learning approach have been implemented and tested also in combination." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-185", "text": "Parameters are kept to perform a GDEX-like filtering, whilst the SVM model is employed to ensure that hits were of a suitable level for learners." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-186", "text": "During this combined filtering, first a ranking for each unfiltered sentence coming from the web service of Korp is computed with heuristics." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-187", "text": "During these calculations, the parameters partly or fully overlapping with certain features of the machine learning model are deactivated, i.e. receive penalty points set to 0, thus, they do not influence the ranking." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-188", "text": "Instead, those aspects are taken care of by the machine learning model, in a subsequent step." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-189", "text": "Only the 100 sentences ranked highest are given for classification to the machine learning model for efficiency reasons." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-190", "text": "Finally, once the classification has been performed, sentences classified as understandable at B1 level are returned in the order of their heuristic ranking." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-191", "text": "Figure 2 shows part of the interface of Hit-Ex, as well as the highest ranked three sentences 6 of an example search for the noun hund \"dog\" at B1 level." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-192", "text": "Besides the Hit-Ex page, both the heuristics-only and the combined approaches are available also as web services." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-193", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-194", "text": "**EVALUATION**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-195", "text": "The purpose of the evaluation was to explore how many sentences, collected from native language corpora in Korp with our algorithms, were understandable at B1 level (at B1 or below) and thus, appropriate to be presented to learners of L2 Swedish of that CEFR level." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-216", "text": "Our results indicate the success of lexical-morphological and semantic factors over syntactic ones in the L2 context." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-196", "text": "Participants included three L2 Swedish teachers, twenty-six L2 Swedish students at B1 level, according to their current or most recent language course, and five linguists familiar with the CEFR scale." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-197", "text": "Besides the criteria of understandability (readability), the aspect of being an appropriate exercise item was also explored." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-198", "text": "We selected altogether 196 sentences using both our approaches, with two different parameter settings for the rule-based method (See Pil\u00e1n et al. (2013) and Pil\u00e1n (2013) for further details about the evaluation)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-199", "text": "Evaluators were asked to indicate whether they found the sentences understandable 6 English translations of the selected sentences: (1)\"It would be enough for a normal dog.\"; (2)\"They left the body in the form of a dog.\"; (3)\"There was a person with a dog.\" at B1 level or not." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-200", "text": "Teachers and linguists (TL) rated the sentences also as potential exercise items." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-201", "text": "The results of the evaluation are presented in Table 6 ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-202", "text": "Understandability Exercise item TL Students TL 76% 69% 59% 73% Table 6 : Evaluation results." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-203", "text": "Respondents found overall 73% percent of the sentences selected by both our methods understandable at B1 level, whilst somewhat less, about six out of ten items, proved to be suitable for being included in exercises for L2 Swedish learning." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-204", "text": "According to our evaluators, the two settings of the rule-based approach (Alg1-s1 and Alg1-s2) satisfied the two criteria observed between 1-5% more of the cases." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-205", "text": "On average, teachers, linguists and students considered 75% of the sentences selected with Alg1-s1 understandable, but only 70% of those identified with the combined approach (Alg2)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-206", "text": "The detailed results per algorithm, criteria and user group are shown in Figure 3 ." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-207", "text": "According to our evaluators' comments, some of the selected sentences contained difficult aspects at the syntactic level, among others, difficult word order, subordinates and relative clauses." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-208", "text": "Moreover, at the lexical level, a stricter lexical filtering, and checking for a sufficient amount of lexical words in the sentence would be required." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-209", "text": "Respondents' comments revealed also the potential future improvement of filtering for context dependency which would make sentences more suitable as exercise items." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-210", "text": "----------------------------------" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-211", "text": "**CONCLUSION**" }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-212", "text": "In this study we investigated linguistic factors influencing the sentence-level readability of Swedish from a L2 learning point of view." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-213", "text": "The main contribution of our work consists of two sentence selection methods and their implementation for identifying sentences from a variety of Swedish corpora which are not only readable, but potentially suitable also as automatically generated exercise items for learners at intermediate (CEFR B1) level and above." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-214", "text": "We proposed a heuristics-only and a combined selection approach, the latter merging rule-based parameters (targeting mainly the filtering of \"undesired\" linguistic elements), and machine learning methods for classifying the readability of sentences from L2 learners' perspective." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-215", "text": "We obtained a classification accuracy of 71% with an SVM classifier which compares well to previously reported results for similar tasks." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-217", "text": "The most predictive indicators include, besides sentence length, the amount of difficult words in the sentence, adverb variation, nominal pre-and postmodifiers and two semantic criteria, the average number of senses per word and nominal ratio (Table 5)." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-218", "text": "Within a smaller-scale evaluation, about 73% of the sentences selected by our methods were understandable at B1 level, whilst about 60% of the sentences proved to be suitable as exercise items, the heuristics-only approach being slightly preferred by evaluators." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-219", "text": "Further investigation of the salient properties of exercise items may contribute to the improvement of the current selection approach." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-220", "text": "The method, as well as most of the parameters and features used, are language independent and could, thus, be applied also to languages other than Swedish, provided that NLP tools performing similarly deep linguistic processing are available." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-221", "text": "Future additions to the filtering parameters may include aspects of word order, independence from a wider context, valency information and collocations." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-222", "text": "The optimization of the classifier could also be studied further; different algorithms and additional features could be tested to improve the classification results." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-223", "text": "The machine learning approach might show improvements in the future with training instances tagged at the sentence level and it can be easily extended, once additional data for other CEFR levels becomes available." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-224", "text": "Finally, additional evaluations could be carried out to confirm the appropriateness of the sentences ranked by the extended and improved selection method." }, { "sent_id": "d66ca5ff22e508da239fc7fdf5ac29-C001-225", "text": "To indicate the extent to which a sentence is understandable, 4-or 5-point scales may be used, and the employment of exercises instead of a list of sentences to read could also be investigated for verifying the suitability of the examples." } ], "y": { "@SIM@": { "gold_contexts": [ [ "d66ca5ff22e508da239fc7fdf5ac29-C001-31" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-119", "d66ca5ff22e508da239fc7fdf5ac29-C001-120" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-126" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-127" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-162" ] ], "cite_sentences": [ "d66ca5ff22e508da239fc7fdf5ac29-C001-31", "d66ca5ff22e508da239fc7fdf5ac29-C001-119", "d66ca5ff22e508da239fc7fdf5ac29-C001-126", "d66ca5ff22e508da239fc7fdf5ac29-C001-127", "d66ca5ff22e508da239fc7fdf5ac29-C001-162" ] }, "@USE@": { "gold_contexts": [ [ "d66ca5ff22e508da239fc7fdf5ac29-C001-31" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-119", "d66ca5ff22e508da239fc7fdf5ac29-C001-120" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-127" ] ], "cite_sentences": [ "d66ca5ff22e508da239fc7fdf5ac29-C001-31", "d66ca5ff22e508da239fc7fdf5ac29-C001-119", "d66ca5ff22e508da239fc7fdf5ac29-C001-127" ] }, "@BACK@": { "gold_contexts": [ [ "d66ca5ff22e508da239fc7fdf5ac29-C001-47" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-132" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-139" ] ], "cite_sentences": [ "d66ca5ff22e508da239fc7fdf5ac29-C001-47", "d66ca5ff22e508da239fc7fdf5ac29-C001-132", "d66ca5ff22e508da239fc7fdf5ac29-C001-139" ] }, "@DIF@": { "gold_contexts": [ [ "d66ca5ff22e508da239fc7fdf5ac29-C001-146", "d66ca5ff22e508da239fc7fdf5ac29-C001-147" ], [ "d66ca5ff22e508da239fc7fdf5ac29-C001-168" ] ], "cite_sentences": [ "d66ca5ff22e508da239fc7fdf5ac29-C001-147", "d66ca5ff22e508da239fc7fdf5ac29-C001-168" ] } } }, "ABC_b49807b058e5e1e50eae524e592401_20": { "x": [ { "sent_id": "b49807b058e5e1e50eae524e592401-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-2", "text": "Fine-grained entity typing is the task of assigning fine-grained semantic types to entity mentions." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-3", "text": "We propose a neural architecture which learns a distributional semantic representation that leverages a greater amount of semantic context -both document and sentence level information -than prior work." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-4", "text": "We find that additional context improves performance, with further improvements gained by utilizing adaptive classification thresholds." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-5", "text": "Experiments show that our approach without reliance on hand-crafted features achieves the state-ofthe-art results on three benchmark datasets." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-6", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-8", "text": "Named entity typing is the task of detecting the type (e.g., person, location, or organization) of a named entity in natural language text." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-9", "text": "Entity type information has shown to be useful in natural language tasks such as question answering (Lee et al., 2006) , knowledge-base population (Carlson et al., 2010; Mitchell et al., 2015) , and coreference resolution (Recasens et al., 2013) ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-10", "text": "Motivated by its application to downstream tasks, recent work on entity typing has moved beyond standard coarse types towards finer-grained semantic types with richer ontologies (Lee et al., 2006; Ling and Weld, 2012; Yosef et al., 2012; Gillick et al., 2014; Del Corro et al., 2015) ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-11", "text": "Rather than assuming an entity can be uniquely categorized into a single type, the task has been approached as a multi-label classification problem: e.g., in \"... became a top seller ... Monopoly is played in 114 countries." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-12", "text": "...\" (Figure 1 ), \"Monopoly\" is considered both a game as well as a product." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-13", "text": "The state-of-the-art approach (Shimaoka et al., 2017) for fine-grained entity typing employs an attentive neural architecture to learn representations of the entity mention as well as its context." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-80", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-14", "text": "These representations are then combined with hand-crafted features (e.g., lexical and syntactic features), and fed into a linear classifier with a fixed threshold." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-15", "text": "While this approach outperforms previous approaches which only use sparse binary features (Ling and Weld, 2012; Gillick et al., 2014) or distributed representations (Yogatama et al., 2015) , it has a few drawbacks: (1) the representations of left and right contexts are learnt independently, ignoring their mutual connection; (2) the attention on context is computed solely upon the context, considering no alignment to the entity; (3) document-level contexts which could be useful in classification are not exploited; and (4) hand-crafted features heavily rely on system or human annotations." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-16", "text": "To overcome these drawbacks, we propose a neural architecture (Figure 1 ) which learns more context-aware representations by using a better attention mechanism and taking advantage of semantic discourse information available in both the document as well as sentence-level contexts." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-17", "text": "Fur-ther, we find that adaptive classification thresholds leads to further improvements." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-18", "text": "Experiments demonstrate that our approach, without any reliance on hand-crafted features, outperforms prior work on three benchmark datasets." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-19", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-20", "text": "**MODEL**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-21", "text": "Fine-grained entity typing is considered a multilabel classification problem: Each entity e in the text x is assigned a set of types T * drawn from the fine-grained type set T ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-22", "text": "The goal of this task is to predict, given entity e and its context x, the assignment of types to the entity." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-23", "text": "This assignment can be represented by a binary vector y \u2208 {1, 0} |T | where |T | is the size of T ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-24", "text": "y t = 1 iff the entity is assigned type t \u2208 T ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-25", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-26", "text": "**GENERAL MODEL**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-27", "text": "Given a type embedding vector w t and a featurizer \u03d5 that takes entity e and its context x, we employ the logistic regression (as shown in Figure 1 ) to model the probability of e assigned t (i.e., y t = 1)" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-28", "text": "and we seek to learn a type embedding matrix W = [w 1 , . . . , w |T | ] and a featurizer \u03d5 such that" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-29", "text": "(2) At inference, the predicted type setT assigned to entity e is carried out b\u0177" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-30", "text": "with r t the threshold for predicting e has type t." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-31", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-32", "text": "**FEATURIZER**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-33", "text": "As shown in Figure 1 , featurizer \u03d5 in our model contains three encoders which encode entity e and its context x into feature vectors, and we consider both sentence-level context x s and document-level context x d in contrast to prior work which only takes sentence-level context (Gillick et al., 2014; Shimaoka et al., 2017) ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-81", "text": "**APPROACH STRICT MACRO MICRO**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-34", "text": "1 The output of featurizer \u03d5 is the concatenation of these feature vectors:" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-35", "text": "We define the computation of these feature vectors in the followings." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-36", "text": "Entity Encoder: The entity encoder f computes the average of all the embeddings of tokens in entity e." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-37", "text": "where \u2212 \u2192 f and \u2190 \u2212 f are L-layer stacked LSTMs units (Hochreiter and Schmidhuber, 1997) ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-38", "text": "This is different from Shimaoka et al. (2017) who use two separate bi-directional RNNs for context on each side of the entity mention." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-39", "text": "Attention: The feature representation for x s is a weighted sum of the hidden states: g s (x s , e) = n i=1 a i h i , where a i is the attention to hidden state h i ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-40", "text": "We employ the dot-product attention (Luong et al., 2015) ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-41", "text": "It computes attention based on the alignment between the entity and its context:" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-42", "text": "where W a is the weight matrix." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-43", "text": "The dot-product attention differs from the self attention (Shimaoka et al., 2017) which only considers the context." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-44", "text": "Document-level Context Encoder: The encoder g d for document-level context x d is a multi-layer perceptron:" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-45", "text": "where DM is a pretrained distributed memory model (Le and Mikolov, 2014 ) which converts the document-level context into a distributed representation." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-46", "text": "W d 1 and W d 2 are weight matrices." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-47", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-48", "text": "**ADAPTIVE THRESHOLDS**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-49", "text": "In prior work, a fixed threshold (r t = 0.5) is used for classification of all types (Ling and Weld, 2012; Shimaoka et al., 2017) ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-50", "text": "We instead assign a different threshold to each type that is optimized to maximize the overall strict F 1 on the dev set." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-51", "text": "We show the definition of strict F 1 in Section 3.1." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-52", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-53", "text": "**EXPERIMENTS**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-54", "text": "We conduct experiments on three publicly available datasets." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-55", "text": "2 Table 1" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-56", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-57", "text": "**METRICS**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-58", "text": "We adopt the metrics used in Ling and Weld (2012) where results are evaluated via strict, loose macro, loose micro F 1 scores." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-59", "text": "For the i-th instance, let the predicted type set beT i , and the reference type set T i ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-60", "text": "The precision (P ) and recall (R) for each metric are computed as follow." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-61", "text": "Strict:" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-62", "text": "2 We made the source code and data publicly available at https://github.com/sheng-z/figet." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-63", "text": "Loose Macro:" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-64", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-65", "text": "**HYPERPARAMETERS**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-66", "text": "We use open-source GloVe vectors (Pennington et al., 2014 ) trained on Common Crawl 840B with 300 dimensions to initialize word embeddings used in all encoders." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-67", "text": "All weight parameters are sampled from U(\u22120.01, 0.01)." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-68", "text": "(Kingma and Ba, 2014) and mini-batch gradient is used for optimization." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-69", "text": "Batch size is 200." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-70", "text": "Dropout (rate=0.5) is applied to three feature functions." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-71", "text": "To avoid overfitting, we choose models which yield the best strict F 1 on dev sets." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-72", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-73", "text": "**RESULTS**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-74", "text": "We compare experimental results of our approach with previous approaches 3 , and study contribution of our base model architecture, documentlevel contexts and adaptive thresholds via ablation." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-75", "text": "To ensure our findings are reliable, we run each experiment twice and report the average performance." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-76", "text": "Overall, our approach significantly increases the state-of-the-art macro F 1 on both OntoNotes and BBN datasets." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-77", "text": "On OntoNotes (Table 3) , our approach improves the state of the art across all three metrics." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-78", "text": "Note that (1) without adaptive thresholds or document-level contexts, our approach still outperforms other approaches on macro F 1 and micro F 1 ; (2) adding hand-crafted features (Shimaoka et al., 2017) does not improve the performance." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-79", "text": "Ma et al. (2016) 49.30 68.23 61.27 AFET (Ren et al., 2016a) 55.10 71.10 64.70 FNET (Abhishek et al., 2017) 52.20 68.50 63.30 NEURAL (Shimaoka et al., 2017) This indicates the benefits of our proposed model architecture for learning fine-grained entity typing, which is discussed in detail in Section 3.4; and (3) BINARY and KWASIBIE were trained on a different dataset, so their results are not directly comparable." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-82", "text": "PLE (Ren et al., 2016b) 49.44 68.75 64.54 Ma et al. (2016) 70.43 75.78 76.50 AFET (Ren et al., 2016a) 67.00 72.70 73.50 FNET (Abhishek et al., 2017) 60 Table 4 : Results on the BBN dataset." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-83", "text": "On BBN (Table 4) , while Ma et al. (2016) 's label embedding algorithm holds the best strict F 1 , our approach notably improves both macro F 1 and micro F 1 ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-84", "text": "4 The performance drops to a competitive level with other approaches if adaptive thresholds or document-level contexts are removed." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-85", "text": "On FIGER (Table 5) where no document-level context is currently available, our proposed ap-4 Integrating label embedding into our proposed approach is an avenue for future work." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-86", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-87", "text": "**APPROACH**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-88", "text": "Strict Macro Micro KWSABIE (Yogatama et al., 2015) N/A N/A 72.25 Attentive (Shimaoka et al., 2016 ) 58.97 77.96 74.94 FNET(Abhishek et al., 2017 65.80 81.20 77.40" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-89", "text": "Ling and Weld (2012) 52.30 69.90 69.30 PLE (Ren et al., 2016b) 49.44 68.75 64.54 Ma et al. (2016) 53.54 68.06 66.53 AFET (Ren et al., 2016a) 53.30 69.30 66.40 NEURAL (Shimaoka et al., 2017) proach still achieves the state-of-the-art strict and micro F 1 ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-90", "text": "If compared with the ablation variant of the NEURAL approach, i.e., w/o hand-crafted features, our approach gains significant improvement." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-91", "text": "We notice that removing adaptive thresholds only causes a small performance drop; this is likely because the train and test splits of FIGER are from different sources, and adaptive thresholds are not generalized well enough to the test data." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-92", "text": "KWASIBIE, Attentive and FNET were trained on a different dataset, so their results are not directly comparable." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-93", "text": "Table 2 shows examples illustrating the benefits brought by our proposed approach." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-94", "text": "Example A illustrates that sentence-level context sometimes is not informative enough, and attention, though already placed on the head verbs, can be misleading." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-95", "text": "Including document-level context (i.e., \"Canada's declining crude output\" in this case) helps preclude wrong predictions (i.e., /other/health and /other/health/treatment)." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-96", "text": "Example B shows that the semantic patterns learnt by our attention mechanism help make the correct prediction." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-97", "text": "As we observe in Table 3 and Table 5 , adding handcrafted features to our approach does not improve the results." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-98", "text": "One possible explanation is that hand-crafted features are mostly about syntactichead or topic information, and such information are already covered by our attention mechanism and document-level contexts as shown in Table 2 ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-99", "text": "Compared to hand-crafted features that heavily rely on system or human annotations, attention mechanism requires significantly less supervision, and document-level or paragraph-level contexts are much easier to get." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-100", "text": "Through experiments, we observe no improvement by encoding type hierarchical information (Shimaoka et al., 2017) ." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-101", "text": "5 To explain this, we compute cosine similarity between each pair of fine-grained types based on the type embeddings learned by our model, i.e., w t in Eq. (1)." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-102", "text": "Table 6 shows several types and their closest types: these types do not always share coarse-grained types with their closest types, but they often co-occur in the same context." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-103", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-104", "text": "**ANALYSIS**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-105", "text": "----------------------------------" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-106", "text": "**CONCLUSION**" }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-107", "text": "We propose a new approach for fine-grained entity typing." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-108", "text": "The contributions are: (1) we propose a neural architecture which learns a distributional semantic representation that leverage both document and sentence level information, (2) we find that context increased with document-level information improves performance, and (3) we utilize adaptive classification thresholds to further boost the performance." }, { "sent_id": "b49807b058e5e1e50eae524e592401-C001-109", "text": "Experiments show our approach achieves new state-of-the-art results on three benchmarks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b49807b058e5e1e50eae524e592401-C001-13" ] ], "cite_sentences": [ "b49807b058e5e1e50eae524e592401-C001-13" ] }, "@DIF@": { "gold_contexts": [ [ "b49807b058e5e1e50eae524e592401-C001-33" ], [ "b49807b058e5e1e50eae524e592401-C001-38" ], [ "b49807b058e5e1e50eae524e592401-C001-43" ], [ "b49807b058e5e1e50eae524e592401-C001-49", "b49807b058e5e1e50eae524e592401-C001-50" ], [ "b49807b058e5e1e50eae524e592401-C001-78" ], [ "b49807b058e5e1e50eae524e592401-C001-79" ], [ "b49807b058e5e1e50eae524e592401-C001-89", "b49807b058e5e1e50eae524e592401-C001-90" ], [ "b49807b058e5e1e50eae524e592401-C001-100" ] ], "cite_sentences": [ "b49807b058e5e1e50eae524e592401-C001-33", "b49807b058e5e1e50eae524e592401-C001-38", "b49807b058e5e1e50eae524e592401-C001-43", "b49807b058e5e1e50eae524e592401-C001-49", "b49807b058e5e1e50eae524e592401-C001-78", "b49807b058e5e1e50eae524e592401-C001-79", "b49807b058e5e1e50eae524e592401-C001-89", "b49807b058e5e1e50eae524e592401-C001-100" ] }, "@EXT@": { "gold_contexts": [ [ "b49807b058e5e1e50eae524e592401-C001-100" ] ], "cite_sentences": [ "b49807b058e5e1e50eae524e592401-C001-100" ] } } }, "ABC_ca98f16fa3a118f83b16586bba04c8_20": { "x": [ { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-2", "text": "This paper addresses the question: In neural dialog systems, why do sequence-to-sequence (Seq2Seq) neural networks generate short and meaningless replies for open-domain response generation?" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-3", "text": "We conjecture that in a dialog system, due to the randomness of spoken language, there may be multiple equally plausible replies for one utterance, causing the deficiency of a Seq2Seq model." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-4", "text": "To evaluate our conjecture, we propose a systematic way to mimic the dialog scenario in machine translation systems with both real datasets and toy datasets generated elaborately." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-44", "text": "We propose to mimic the \"unaligned\" property in machine translation datasets by shuffling the source and target pairs." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-5", "text": "Experimental results show that we manage to reproduce the phenomenon of generating short and meaningless sentences in the translation setting." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-6", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-8", "text": "Open-domain human-computer dialog systems are attracting increasing attention in the NLP community." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-9", "text": "With the development of deep learning, sequence-to-sequence (Seq2Seq) neural networks or more generally encoder-decoder frameworks, are among the most popular models for text-based response generation in dialog systems [1, 2, 3, 4] ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-10", "text": "Historically, Seq2Seq-like models are first designed for machine translation [5, 6] and later widely applied to image captioning [7] , text summarization [8] , etc." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-11", "text": "When adapted to text-based open-domain dialog systems, however, Seq2Seq models are less satisfactory." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-12", "text": "A severe problem is that the Seq2Seq model tends to generate short and meaningless replies, e.g., \"I don't know\" [2] and \"Me too\" [3] ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-13", "text": "They are universally relevant to most utterances, called universal replies in [3] , and hence less desired in real-world conversation systems." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-14", "text": "In previous studies, researchers have proposed a variety of approaches to address the problem of universal replies, ranging from heuristically modified training objectives [2] , diversified decoding algorithms [9] , to content-introducing approaches [3, 10] ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-15", "text": "Although the problems of universal replies have been alleviated to some extent, there lacks an empirical explanation to the curious question: Why does the same Seq2Seq model tend to generate shorter and less meaningful sentences in a dialog system than in a machine translation system?" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-16", "text": "Considering the difference between dialog and translation data, our intuition is that, compared with translation data, a dialog system encounters a severe \"unaligned\" problem due to the randomness and uncertainty of spoken language: an utterance can be matched to multiple equally plausible replies, but these replies may have different meanings." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-17", "text": "On the contrary, the translation datasets typically have a more precise semantic matching between the source and target sides." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-18", "text": "This conjecture is casually expressed in previous work [3] , but is so far not supported by experiments." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-19", "text": "To verify our conjecture, we propose a method by mimicking the unaligned phenomenon on machine translation datasets, which is to shuffle the source and target sides of the translation pairs to artificially build a conditional distribution of target sentences with multiple plausible data points." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-20", "text": "We conduct experiments on a widely used translation dataset; we further conduct a simulation with some predefined distributions, serving as additional evidence." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-21", "text": "The experimental results show that shuffling of datasets tends to make translated sentences shorter and less meaningful." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-22", "text": "Therefore, the unaligned problem can be one reason that causes short and meaningless replies in neural dialog systems." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-23", "text": "To summarize, this paper compares Seq2Seq dialog with translation systems, and provides an explanation to the question: Why do neural dialog systems tend to generate short and meaningless replies?" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-24", "text": "Our study also sheds light on the future development of neural dialog systems as well as the application scenarios where Seq2Seq models are appropriate." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-25", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-26", "text": "**CONJECTURE**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-27", "text": "We hypothesize that given a source sequence, the conditional distribution of the target sequence having multiple plausible points is one cause of the deficiency of Seq2Seq models in dialog systems." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-28", "text": "Let us denote the source sequence by s = s 1 , s 2 , \u00b7 \u00b7 \u00b7 , s |s| and the target sequence by t = t 1 , t 2 , \u00b7 \u00b7 \u00b7 , t |t| ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-29", "text": "Both (orthodox) training and prediction objectives are to maximize p \u03b8 (t|s), where the conditional probability p \u03b8 (\u00b7|\u00b7) is modeled by a Seq2Seq neural network with parameters \u03b8." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-30", "text": "In a machine translation system, the source and target information generally aligns well, although some meanings could have different expressions." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-31", "text": "Figure 1a shows a continuous analog of p(t|s)." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-32", "text": "In an open-domain dialog system, however, an utterance may have a variety of replies that are (nearly) equally plausible." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-33", "text": "For example, given a user-issued utterance \"What are you going to do?\" there could be multiple replies like \"having lunch,\" \"watching movies,\" and \"sleeping,\" shown in Figure 1b with an analog using continuous variables." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-34", "text": "There is no particular reason why one reply should be favored over another without further context; even with context, this problem could not be fully solved because of the true randomness of dialog." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-35", "text": "Located near the \"mode\" could be viewed as replies of similar meanings but less fluent expressions." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-36", "text": "Other areas with low probabilities are nonsensical utterances that are either not fluent in spoken language or irrelevant to the previous utterance s." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-37", "text": "The above is, perhaps, the most salient difference between dialog and translation datasets." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-38", "text": "Although it is tempting to think of Seq2Seq's performance in this way [3] , barely a practical approach exists to verify the conjecture in the dialog setting alone." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-39", "text": "In the rest of this paper, we will verify it with a modification of machine translation tasks." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-40", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-41", "text": "**EXPERIMENTAL PROTOCOL**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-42", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-43", "text": "**MIMICKING A \"DIALOG SCENARIO\" IN TRANSLATION**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-45", "text": "This ensures the resulting conditional distribution p(t|s) to have multiple plausible target sequences, whereas other settings of translation remain unchanged, making a rigorous controlled experiment." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-46", "text": "Formally speaking, let {(s (n) , t (n) )} N n=1 be the training dataset in a translation setting, where (s (n) , t (n) ) is a particular data point containing a source and target sentence pair; in total we have N data points." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-47", "text": "The shuffled dataset is {(s (n) , t (n) )} N n=1 , where t (n) = t (\u03c4 (n)) and \u03c4 (1), \u00b7 \u00b7 \u00b7 , \u03c4 (N ) is a random permutation of 1, 2, \u00b7 \u00b7 \u00b7 , N ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-48", "text": "In this way, we artificially construct a conditional target distribution p( t (n) |s (n) ) that allows multiple plausible sentences conditioned on a particular source sentence." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-49", "text": "Notice that, for the sake of constructing a distribution where the target sentences can have multiple plausible data points, there is no need to generate multiple random target sentences for a particular source sentence." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-50", "text": "In fact, it is preferred NOT, so that the experiment is more controlled." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-51", "text": "In the case where we generate a single target sentence t (n) = t (\u03c4 (n)) for a source sentence s (n) , { t (n) |s (n) } N n=1 can still be viewed as samples from the marginal (unconditioned) distribution p(t), and thus the desired \"unaligned\" property is in place." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-52", "text": "It is straightforward to shuffle a subset of the translation dataset." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-53", "text": "This helps to analyze how Seq2Seq models behave when the \"unaligned\" problem becomes more severe." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-54", "text": "Shuffling trick is previously used by [11] to compare the robustness of Seq2Seq models and phrase-based statistical machine translation." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-55", "text": "Our paper contains a novel insight that shuffling datasets mimics the unaligned property in dialog datasets, which facilitates the comparison between Seq2Seq dialog and translation systems." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-56", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-57", "text": "**THE SEQ2SEQ MODEL AND DATASETS**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-58", "text": "We adopted a Seq2Seq model (with an attention mechanism) as the neural network for both dialog and translation systems." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-59", "text": "The encoder is a bidirectional recurrent neural network with gated recurrent units (GRUs), whereas the decoder comprises two GRU transition blocks and an attention mechanism in between [12] ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-60", "text": "1 For the dialog system, we used text-based dialog dataset, the Cornell Movie-Dialogs Corpus dataset, 2 containing 221k samples." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-61", "text": "For machine translation, we conducted experiments on a real-world dataset as well as a toy dataset." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-62", "text": "We applied the shuffling method in both two scenarios, mimicking the \"unaligned\" property." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-63", "text": "Following are the details of the translation datasets." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-64", "text": "The Real-World Dataset." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-65", "text": "We used the WMT-2017 dataset 3 and focus on English-to-German translation, containing 5.8M samples." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-66", "text": "We trained Seq2Seq models on sub-word units by using the Byte-Pair Encoding technique [13] ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-67", "text": "For validation and test, we used newstest2014 and newstest2016 sections, each containing 3k pairs respectively." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-68", "text": "The Toy Dataset." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-69", "text": "We further evaluated our experiments on a toy dataset where we generated sequence-to-sequence samples from some predefined distributions." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-70", "text": "In this way, we can eliminate the effect of noise in real-world data (where source and target sides cannot match perfectly), serving as additional evidence of our claim." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-71", "text": "In particular, the task for the toy dataset is to verbatim copy a source sequence." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-72", "text": "This can be thought of as a \"trivial\" translation dataset, where the source and target are exactly the same." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-73", "text": "Further, we sample the source string lengths and word frequencies from meaningful distributions so that the toy dataset more resembles true natural language." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-74", "text": "Specifically, we first sample the length of source strings from a Poisson distribution, which is a counting distribution that oftentimes models the number of events in a certain time period." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-75", "text": "Formally, the probability of the length of a string being k is given by" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-76", "text": "where \u03bb is the parameter of the Poisson distribution, indicating the average length of strings, and in our case it was set to 20." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-77", "text": "We then set the vocabulary to all lower-case letters in English." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-78", "text": "For simplicity, we do not model the dependency among characters in a string." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-79", "text": "In other words, each character is sampled from a unigram distribution." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-80", "text": "Considering that the frequency of words generally follows a power law distributionalso known as the Zipf's law [14] -in natural language, we also use a power law to approximate the unigram distribution." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-81", "text": "The power law has the form" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-82", "text": "where \u03b1 was set to 1.63 in our experiments." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-83", "text": "Our synthesized toy dataset consists of 500k training samples, 2k validation samples, and 2k test samples." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-84", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-85", "text": "**RESULTS**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-86", "text": "BLEU Scores." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-87", "text": "Table 1 presents the BLEU scores of dialog and machine translation systems." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-88", "text": "In open-domain dialog, BLEU-2 exhibits some (not large) correlation with human satisfaction [15] , although BLEU scores are generally low." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-89", "text": "For machine translation, we achieved 27.2 BLEU for the normal setting on the English-to-German translation, which is comparable to 28.4 achieved by a baseline method in [16] , and thus our replication of the machine translation system is fair." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-90", "text": "For the unshuffled toy dataset, we achieved 99.9 BLEU-4 score as expected, indicating that the aligned pattern is easy to be learned by the Seq2Seq." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-91", "text": "If we begin to shuffle the WMT dataset, we see that BLEU drops gradually and finally reaches near zero if the training set is 100% shuffled." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-92", "text": "The results are not surprising and also reported in [11] ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-93", "text": "This provides a quick understanding on how the Seq2Seq is influenced by shuffled data." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-94", "text": "The same phenomenon is also observed on the toy dataset." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-95", "text": "With the increase of the shuffling rate, the BLEU score on the toy dataset decreases." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-96", "text": "4 The reason of this phenomenon is apparent." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-97", "text": "As the shuffling rate increases, \"unaligned\" problem becomes more and more severe, which makes the pattern in the dataset difficult to study." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-98", "text": "When the training data in the toy dataset is 100% shuffled, the BLEU score is close to zero while 50% and 75% shuffling settings return relative high scores, which is because simple alignment patterns still lies in the un-shuffled subset in the training data and are easy to be learned by the Seq2Seq model." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-99", "text": "Length, Negative Log-Likelihood, and Entropy." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-100", "text": "We evaluated the quality of generated results ( .4349 .8871 Table 2 : R 2 correlation obtained by fitting a linear regression of the encoding/decoding step with hidden states." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-101", "text": "length metric counts the number of words in a generated reply." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-102", "text": "5 The negative log-likelihood (NLL) is computed as \u2212 1 |R| w\u2208R log p train (w) where R denotes all replies and p train (\u00b7) is the unigram distribution of words in the training set." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-103", "text": "Entropy is defined as \u2212 w\u2208R p gen (w) log p gen (w) where p gen (\u00b7) is the unigram distribution in generated replies." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-104", "text": "Intuitively, both NLL and entropy evaluate how much \"content\" is contained in the replies." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-105", "text": "These metrics are used in previous work [4, 3] , and are related to our research question." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-106", "text": "We first compare the dialog system with machine translation, both in the un-shuffled setting." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-107", "text": "We observe that the dialog system does generate short and meaningless replies with lower length, NLL, and entropy metrics than references, as opposed to machine translation where Seq2Seq's generated sentences are comparable to references in terms of these statistics on both two datasets." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-108", "text": "Quantitatively, in the dialog system, the length is 20% shorter than references." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-109", "text": "The NLL and entropy decrease by 0.71 and 0.99, respectively; a decrease of 1 in NLL and entropy metrics is large because they are logarithmic metrics." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-110", "text": "Although with a well-engineered Seq2Seq model (with attention, beam search, etc.), the phenomenon is less severe than a vanilla Seq2Seq, it is still perceivable and worth investigating." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-111", "text": "We then applied the shuffling setting to the translation system." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-112", "text": "With the increase of shuffling rate, the Seq2Seq model trained on translation datasets precisely exhibits the phenomenon as a dialog system: the length decreases, the NLL decreases, and the entropy decreases." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-113", "text": "In particular, the decreasing NLL implies that the generated words are more frequently appearing in the training set, whereas the decreasing entropy implies that the distribution of generated sentences spread less across the vocabulary." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-114", "text": "The phenomenon is consistent in both real and synthetic datasets." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-115", "text": "In summary, artificially constructing an unaligned property in translation datasets-with all other settings remain unchanged-enables to reproduce the phenomenon in a dialog system." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-116", "text": "This shows evidence that the unaligned property could be one reason that causes the problem of short and meaningless replies in a dialog system." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-117", "text": "Correlation between Time Steps and Hidden States." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-118", "text": "Shi 5 In some cases, an RNN fails to terminate by repeating a same word." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-119", "text": "Here, we assume a same word can be repeated at most four times." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-120", "text": "et al. [17] conduct an empirical study analyzing \"Why Neural Translations are the Right Length?\" They observe that the length of generated reply is likely to be right regardless of the correctness of meaning." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-121", "text": "They further find that some dimensions in RNN states are responsible for memorizing the current length in the process of sequence generation; the result is also reported in [18] ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-122", "text": "Shi et al. [17] apply linear regression to predict the time step during sequence modeling based on hidden states, and compute the R 2 correlation as a quantitative measure." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-123", "text": "Since a dialog system usually generates short replies (and thus not right length), we wonder whether such correlation exists in dialog and shuffled translation settings." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-124", "text": "We computed R 2 correlation as in [17] and show results in Table 2 ." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-125", "text": "6 We find that the correlation is low with dialog system." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-126", "text": "In translation, the correlation first decreases then increases as the shuffling rate becomes larger." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-127", "text": "A possible explanation is that the lengths of generated translation sentences are similar when shuffling rate is at a high level." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-128", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-129", "text": "**CONCLUSION AND DISCUSSION**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-130", "text": "This paper addressed the question why dialog systems generate short and meaningless replies." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-131", "text": "We managed to reproduce this phenomenon in two translation datasets, artificially mimicking the scenario that a source sentence can have multiple equally plausible target sentences." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-132", "text": "Admittedly, it is impossible to construct identical scenario as dialog by using translation datasets (otherwise the translation just becomes dialog)." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-133", "text": "However, the unaligned property is a salient difference, and by controlling this, we observe the desired phenomenon, demonstrating our conjecture." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-134", "text": "Our findings also explain why referring to additional information-including dialog context [19] , keywords [3] and knowledge bases [20]-helps dialog systems: the number of plausible target sentences decreases if the generation is conditioned on more information; this intuition is helpful for future development of text-based response generation in Seq2Seq dialog systems." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-135", "text": "Besides, our experiments suggest that Seq2Seq models are more suitable to applications where the source and target information is aligned." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-136", "text": "----------------------------------" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-137", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-138", "text": "We thank all reviewers for their constructive comments." }, { "sent_id": "ca98f16fa3a118f83b16586bba04c8-C001-139", "text": "This research is supported by the National Key R&D Program under Grant No. 2017YFB1001804, and the National Natural Science Foundation of China under Grant No. 61832009." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ca98f16fa3a118f83b16586bba04c8-C001-9" ], [ "ca98f16fa3a118f83b16586bba04c8-C001-12" ], [ "ca98f16fa3a118f83b16586bba04c8-C001-13" ], [ "ca98f16fa3a118f83b16586bba04c8-C001-14" ], [ "ca98f16fa3a118f83b16586bba04c8-C001-18" ], [ "ca98f16fa3a118f83b16586bba04c8-C001-38" ] ], "cite_sentences": [ "ca98f16fa3a118f83b16586bba04c8-C001-9", "ca98f16fa3a118f83b16586bba04c8-C001-12", "ca98f16fa3a118f83b16586bba04c8-C001-13", "ca98f16fa3a118f83b16586bba04c8-C001-14", "ca98f16fa3a118f83b16586bba04c8-C001-18", "ca98f16fa3a118f83b16586bba04c8-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "ca98f16fa3a118f83b16586bba04c8-C001-105" ], [ "ca98f16fa3a118f83b16586bba04c8-C001-134" ] ], "cite_sentences": [ "ca98f16fa3a118f83b16586bba04c8-C001-105", "ca98f16fa3a118f83b16586bba04c8-C001-134" ] }, "@USE@": { "gold_contexts": [ [ "ca98f16fa3a118f83b16586bba04c8-C001-105" ] ], "cite_sentences": [ "ca98f16fa3a118f83b16586bba04c8-C001-105" ] } } }, "ABC_649eff228a47b484d01872a980e58f_20": { "x": [ { "sent_id": "649eff228a47b484d01872a980e58f-C001-44", "text": "**KEYWORD SPOTTING (KWS) SYSTEM**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-92", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-93", "text": "**RECURRENT NEURAL NETWORK (RNN)**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-45", "text": "A typical KWS system consists of a feature extractor and a neural network based classifier as shown in Fig. 1 ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-2", "text": "Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-3", "text": "It requires real-time response and high accuracy for good user experience." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-4", "text": "Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-5", "text": "Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-6", "text": "The design of neural network architecture for KWS must consider these constraints." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-7", "text": "In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-8", "text": "We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-9", "text": "We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-10", "text": "We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-11", "text": "DS-CNN achieves an accuracy of 95.4%, which is~10% higher than the DNN model with similar number of parameters." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-12", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-14", "text": "Deep learning algorithms have evolved to a stage where they have surpassed human accuracies in a variety of cognitive tasks including image classification [1] and conversational speech recognition [2] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-15", "text": "Motivated by the recent breakthroughs in deep learning based speech recognition technologies, speech is increasingly becoming a more natural way to interact with consumer electronic devices, for example, Amazon Echo, Google Home and smart phones." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-16", "text": "However, always-on speech recognition is not energy-efficient and may also cause network congestion to transmit continuous audio stream from billions of these devices to the cloud." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-17", "text": "Furthermore, such a cloud based solution adds latency to the application, which hurts user experience." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-18", "text": "There are also privacy concerns when audio is continuously transmitted to the cloud." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-19", "text": "To mitigate these concerns, the devices first detect predefined keyword(s) such as \"Alexa\", \"Ok Google\", \"Hey Siri\", etc., which is commonly known as keyword spotting (KWS)." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-66", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-67", "text": "**MICROCONTROLLER SYSTEMS**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-20", "text": "Detection of keyword wakes up the device and then activates the full scale speech recognition either on device [3] or in the cloud." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-21", "text": "In some applications, the sequence of keywords can be used as voice commands to a smart device such as a voice-enabled light bulb." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-22", "text": "Since KWS system is always-on, it should have very low power consumption to maximize battery life." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-23", "text": "On the other hand, the KWS system should detect the keywords with high accuracy and low latency, for best user experience." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-24", "text": "These conflicting system requirements make KWS an active area of research ever since its inception over 50 years ago [4] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-25", "text": "Recently, with the renaissance of artificial neural networks in the form of deep learning algorithms, neural network (NN) based KWS has become very popular [5, 6, 7, 8] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-26", "text": "Low power consumption requirement for keyword spotting systems make microcontrollers an obvious choice for deploying KWS in an always-on system." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-27", "text": "Microcontrollers are low-cost energy-efficient processors that are ubiquitous in our everyday life with their presence in a variety of devices ranging from home appliances, automobiles and consumer electronics to wearables." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-28", "text": "However, deployment of neural network based KWS on microcontrollers comes with following challenges:" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-29", "text": "Limited memory footprint: Typical microcontroller systems have only tens to few hundred KB of memory available." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-30", "text": "The entire neural network model, including input/output, weights and activations, has to fit within this small memory budget." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-31", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-32", "text": "**LIMITED COMPUTE RESOURCES:**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-33", "text": "Since KWS is always-on, the real-time requirement limits the total number of operations per neural network inference." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-34", "text": "These microcontroller resource constraints in conjunction with the high accuracy and low latency requirements of KWS call for a resource-constrained neural network architecture exploration to find lean neural network structures suitable for KWS, which is the primary focus of our work." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-35", "text": "The main contributions of this work are as follows:" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-36", "text": "\u2022 We first train the popular KWS neural net models from the literature [5, 6, 7, 8] on Google speech commands dataset [9] and compare them in terms of accuracy, memory footprint and number of operations per inference." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-37", "text": "\u2022 In addition, we implement a new KWS model using depth-wise separable convolutions and point-wise convolutions, inspired by the success of resource-efficient MobileNet [10] in computer vision." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-38", "text": "This model outperforms the other prior models in all aspects of accuracy, model size and number of operations." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-39", "text": "\u2022 Finally, we perform resource-constrained neural network architecture exploration and present comprehensive comparison of different network architectures within a set of compute and memory constraints of typical microcontrollers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-40", "text": "The code, model definitions and pretrained models are available at https://github.com/ARM-software/ML-KWS-for-MCU." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-41", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-42", "text": "**BACKGROUND**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-43", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-46", "text": "First, the input speech signal of length L is framed into overlapping frames of length l with a stride s, giving a total of T = L\u2212l s + 1 frames." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-47", "text": "From each frame, F speech features are extracted, generating a total of T \u00d7 F features for the entire input speech signal of length L. Log-mel filter bank energies (LFBE) and Mel-frequency cepstral coefficients (MFCC) are the commonly used human-engineered speech features in deep learning based speech-recognition, that are adapted from traditional speech processing techniques." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-48", "text": "Feature extraction using LFBE or MFCC involves translating the time-domain speech signal into a set of frequency-domain spectral coefficients, which enables dimensionality compression of the input signal." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-49", "text": "The extracted speech feature matrix is fed into a classifier module, which generates the probabilities for the output classes." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-50", "text": "In a real-world scenario where keywords need to be identified from a continuous audio stream, a posterior handling module averages the output probabilities of each output class over a period of time, improving the overall confidence of the prediction." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-51", "text": "Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) and Viterbi decoding [11, 12] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-52", "text": "While these approaches achieve reasonable accuracies, they are hard to train and are computationally expensive during inference." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-53", "text": "Other techniques explored for KWS include discriminative models adopting a large-margin problem formulation [13] or recurrent neural networks (RNN) [14] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-54", "text": "Although these methods significantly outperform HMM based KWS in terms of accuracy, they suffer from large detection latency." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-55", "text": "KWS models using deep neural networks (DNN) based on fully-connected layers with rectified linear unit (ReLU) activation functions are introduced in [5] , which outperforms the HMM models with a very small detection latency." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-56", "text": "Furthermore, low-rank approximation techniques are used to compress the DNN model weights achieving similar accuracy with less hardware resources [15, 16] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-57", "text": "The main drawback of DNNs is that they ignore the local temporal and spectral correlation in the input speech features." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-58", "text": "In order to exploit these correlations, different variants of convolutional neural network (CNN) based KWS are explored in [6] , which demonstrate higher accuracy than DNNs." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-59", "text": "The drawback of CNNs in modeling time varying signals (e.g. speech) is that they ignore long term temporal dependencies." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-60", "text": "Combining the strengths of CNNs and RNNs, convolutional recurrent neural network based KWS is investigated in [7] and demonstrate the robustness of the model to noise." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-61", "text": "While all the prior KWS neural networks are trained with cross entropy loss function, a max-pooling based loss function for training KWS model with long short-term memory (LSTM) is proposed in [8] , which achieves better accuracy than the DNNs and LSTMs trained with cross entropy loss." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-62", "text": "Although many neural network models for KWS are presented in literature, it is difficult to make a fair comparison between them as they are all trained and evaluated on different proprietary datasets (e.g. \"TalkType\" dataset in [7] , \"Alexa\" dataset in [8] , etc.) with different input speech features and audio duration." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-63", "text": "Also, the primary focus of prior research has been to maximize the accuracy with a small memory footprint model, without explicit constraints of underlying hardware, such as limits on number of operations per inference." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-64", "text": "In contrast, this work is more hardware-centric and targeted towards neural network architectures that maximize accuracy on microcontroller devices." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-65", "text": "The constraints on memory and compute significantly limit the neural network parameters and the number of operations." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-68", "text": "A typical microcontroller system consists of a processor core, an on-chip SRAM block and an on-chip embedded flash." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-69", "text": "Table 1 shows some commercially available microcontroller development boards with Arm Cortex-M processor cores with different compute capabilities running at different frequencies (16 MHz to 216 MHz), consisting of a wide range of on-chip memory (SRAM: 8 KB to 320 KB; Flash: 128 KB to 1 MB)." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-70", "text": "The program binary, usually preloaded into the non-volatile flash, is loaded into the SRAM at startup and the processor runs the program with the SRAM as the main data memory." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-71", "text": "Therefore, the size of the SRAM limits the size of memory that the software can use." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-72", "text": "Other than the memory footprint, performance (i.e., operations per second) is also a constraining factor for running neural networks on microcontrollers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-73", "text": "Most microcontrollers are designed for embedded applications with low cost and high energy-efficiency as the primary targets, and do not have high throughput for compute-intensive workloads such as neural networks." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-74", "text": "Some microcontrollers have integrated DSP instructions that can be useful for running neural network workloads." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-75", "text": "For example, Cortex-M4 and Cortex-M7 have integrated SIMD and MAC instructions that can be used to accelerate low-precision computation in neural networks." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-76", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-77", "text": "**NEURAL NETWORK ARCHITECTURES FOR KWS**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-78", "text": "This section gives an overview of all the different neural network architectures explored in this work including the deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), convolutional recurrent neural network (CRNN) and depthwise separable convolutional neural network (DS-CNN)." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-79", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-80", "text": "**DEEP NEURAL NETWORK (DNN)**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-81", "text": "The DNN is a standard feed-forward neural network made of a stack of fully-connected layers and non-linear activation layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-82", "text": "The input to the DNN is the flattened feature matrix, which feeds into a stack of d hidden fully-connected layers each with n neurons." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-83", "text": "Typically, each fully-connected layer is followed by a rectified linear unit (ReLU) based activation function." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-84", "text": "At the output is a linear layer followed by a softmax layer generating the output probabilities of the k keywords, which are used for further posterior handling." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-85", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-86", "text": "**CONVOLUTIONAL NEURAL NETWORK (CNN)**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-87", "text": "One main drawback of DNN based KWS is that they fail to efficiently model the local temporal and spectral correlation in the speech features." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-88", "text": "CNNs exploit this correlation by treating the input time-domain and spectral-domain features as an image and performing 2-D convolution operations over it." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-89", "text": "The convolution layers are typically followed by batch normalization [17] , ReLU based activation functions and optional max/average pooling layers, which reduce the dimensionality of the features." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-90", "text": "During inference, the parameters of batch normalization can be folded into the weights of the convolution layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-91", "text": "In some cases, a linear low-rank layer, which is simply a fully-connected layer without non-linear activation, is added in between the convolution layers and dense layers for the purpose of reducing parameters and accelerating training [18, 19] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-94", "text": "RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition [20] , language modeling [21] , translation [22] , etc." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-95", "text": "RNNs not only exploit the temporal relation between the input signal, but also capture the long-term dependencies using \"gating\" mechanism." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-96", "text": "Unlike CNNs where input features are treated as 2-D image, RNNs operate for T time steps, where at each time step t the corresponding spectral feature vector f t \u2208 R F concatenated with the previous time step output h t\u22121 is used as input to the RNN." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-97", "text": "Figure 2 shows the model architecture of a typical RNN model, where the RNN cell could be an LSTM cell [23, 24] or a gated recurrent unit (GRU) cell [25, 26] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-98", "text": "Since the weights are reused across all the T time steps, the RNN models tend to have less number of parameters compared to the CNNs." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-99", "text": "Similar to batch normalization in CNNs, research show that applying layer normalization can be beneficial for training RNNs [27] , in which the hidden states are normalized during each time step." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-100", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-101", "text": "**CONVOLUTIONAL RECURRENT NEURAL NETWORK (CRNN)**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-102", "text": "Convolution recurrent neural network [7] is a hybrid of CNN and RNN, which takes advantages of both." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-103", "text": "It exploits the local temporal/spatial correlation using convolution layers and global temporal dependencies in the speech features using recurrent layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-104", "text": "As shown in Fig. 3 , a CRNN model starts with a convolution layer, followed by an RNN to encode the signal and a dense fully-connected layer to map the information." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-105", "text": "Here, the recurrent layer is bi-directional [28] and has multiple stages, increasing the network learning capability." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-106", "text": "Gated recurrent units (GRU) [25] is used as the base cell for recurrent layers, as it uses fewer parameters than LSTMs and gave better convergence in our experiments." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-107", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-108", "text": "**DEPTHWISE SEPARABLE CONVOLUTIONAL NEURAL NETWORK (DS-CNN)**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-109", "text": "Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard 3-D convolution operation [29] and has been used to achieve compact network architectures in the area of computer vision [10, 30] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-110", "text": "DS-CNN first convolves each channel in the input feature map with a separate 2-D filter and then uses pointwise convolutions (i.e. 1x1) to combine the outputs in the depth dimension." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-111", "text": "By decomposing the standard 3-D convolutions into 2-D convolutions followed by 1-D convolutions, depthwise separable convolutions are more efficient both in number of parameters and operations, which makes deeper and wider architecture possible even in the resource-constrained microcontroller devices." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-112", "text": "In this work, we adopt a depthwise separable CNN based on the implementation of MobileNet [10] as shown in Fig. 4 ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-113", "text": "An average pooling followed by a fully-connected layer is used at the end to provide global interaction and reduce the total number of parameters in the final layer." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-114", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-115", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-116", "text": "We use the Google speech commands dataset [9] for the neural network architecture exploration experiments." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-117", "text": "The dataset consists of 65K 1-second long audio clips of 30 keywords, by thousands of different people, with each clip consisting of only one keyword." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-118", "text": "The neural network models are trained to classify the incoming audio into one of the 10 keywords -\"Yes\", \"No\", \"Up\", \"Down\", \"Left\", \"Right\", \"On\", \"Off\", \"Stop\", \"Go\", along with \"silence\" (i.e. no word spoken) and \"unknown\" word, which is the remaining 20 keywords from the dataset." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-119", "text": "The dataset is split into training, validation and test set in the ratio of 80:10:10 while making sure that the audio clips from the same person stays in the same set." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-120", "text": "All models are trained in Google Tensorflow framework [31] using the standard cross-entropy loss and Adam optimizer [32] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-121", "text": "With a batch size of 100, the models are trained for 20K iterations with initial learning rate of 5 \u00d7 10 \u22124 , and reduced to 10 \u22124 after first 10K iterations." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-122", "text": "The training data is augmented with background noise and random time shift of up to 100ms." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-123", "text": "The trained models are evaluated based on the classification accuracy on the test set." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-124", "text": "Table 2 summarizes the accuracy, memory requirement and operations per inference for the network architectures for KWS from literature [5, 6, 7, 8] trained on Google speech commands dataset [9] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-125", "text": "For all the models, we use 40 MFCC features extracted from a speech frame of length 40ms with a stride of 20ms, which gives 1960 (49\u00d740) features for 1 second of audio." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-126", "text": "The accuracy shown in the table is the accuracy on test set." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-127", "text": "The memory shown in the table assumes 8-bit weights and activations, which is sufficient to achieve same accuracy as that from a full-precision network." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-128", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-129", "text": "**TRAINING RESULTS**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-130", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-131", "text": "**NN ARCHITECTURE ACCURACY MEMORY**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-132", "text": "Operations DNN [5] 84.3% 288 KB 0.57 MOps CNN-1 [6] 90.7% 556 KB 76.02 MOps CNN-2 [6] 84.6% 149 KB 1.46 MOps LSTM [8] 88.8% 26 KB 2.06 MOps CRNN [7] 87.8% 298 KB 5.85 MOps Table 2 : Neural network model accuracy." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-133", "text": "CNN-1, CNN-2 are (cnn-trad-fpool3, cnn-one-fstride4) architectures from [6] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-134", "text": "Also, we assume that the memory for activations is reused across different layers and hence memory requirement for the activations uses the maximum of two consecutive layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-135", "text": "The operations in the table counts the total number of multiplications and additions in the matrix-multiplication operations in each layer in the network, which is representative of the execution time of the entire network." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-136", "text": "The models from the existing literature are optimized for different datasets and use different memory/compute resources, hence a direct comparison of accuracy is unfair." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-137", "text": "That said, these results still provide useful insights on the different neural network architectures for KWS:" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-138", "text": "\u2022 Although DNNs do not achieve the best accuracy and tend to be memory intensive, they have less number of operations/inference and hence suit well to systems that have limited compute capability (e.g. systems running at low operating frequencies for energy-efficiency)." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-139", "text": "\u2022 CNNs, on the other hand, achieve higher accuracy than DNNs but at the cost of large number of operations and/or memory requirement." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-140", "text": "\u2022 LSTMs and CRNNs achieve a balance between memory and operations while still achieving good accuracy." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-141", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-142", "text": "**CLASSIFYING NEURAL NETWORKS FOR KWS BASED ON RESOURCE REQUIREMENTS**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-143", "text": "As discussed in section 2.2, memory footprint and execution time are the two important considerations in being able to run keyword spotting on microcontrollers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-144", "text": "These should be considered when designing and optimizing neural networks for running keyword spotting." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-145", "text": "Based on typical microcontroller system configurations (as described in Table 1 ), we derive three sets of constraints for the neural networks in Table 3 , targeting small, medium and large microcontroller systems." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-146", "text": "Both memory and compute limit are derived with assumptions that some amount of resources will be allocated for running other tasks such as OS, I/O, network communication, etc." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-147", "text": "The operations per inference limit assumes that the system is running 10 inferences per second." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-148", "text": "Table 3 : Neural network (NN) classes for KWS models considered in this work, assuming 10 inferences per second and 8-bit weights/activations." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-149", "text": "Figure 5 shows the number of operations per inference, memory requirement and test accuracy of neural network models from prior work [5, 6, 7, 8] trained on Google speech commands dataset overlayed with the memory and compute bounding boxes for the neural network classes from section 4." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-150", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-151", "text": "**RESOURCE CONSTRAINED NEURAL NETWORK ARCHITECTURE EXPLORATION**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-152", "text": "2." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-153", "text": "An ideal model would have high accuracy, small memory footprint and lower number of computations, i.e., close to the origin in Fig. 5 ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-154", "text": "Apart from the LSTM model, the other models are too memory/compute resource heavy and do not fit into the bounding box S with 80KB/6MOps memory/compute limits." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-155", "text": "CNN-2, CRNN and DNN models fit in the M and L bounding boxes, but have lower accuracies as compared to the CNN-1 model, which does not fit in any of the boxes at all." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-156", "text": "The rest of this section discusses different hyperparameters of the feature extraction and neural network architectures that can be tuned in order to bring the models close to the origin and still achieve high accuracy." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-157", "text": "[5, 6, 7, 8] trained on the speech commands dataset [9] ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-158", "text": "As shown in Fig. 1 , from each input speech signal, T \u00d7 F features are extracted and the number of these features impact the model size, number of operations and accuracy." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-159", "text": "The key parameters in the feature extraction step that impact the model size, number of operations and accuracy are (1) number of MFCC features per frame (F) and (2) the frame stride (S)." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-160", "text": "The number of MFCC features per audio frame (F) impacts the number of weights in fully-connected and recurrent layers, but not in convolution layers as weights are reused in convolution layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-161", "text": "The frame stride (S), which determines the number of frames to be processed per inference (i.e. T), impacts the number of weights in fully-connected layers but not in recurrent and convolution layers because of the weight reuse." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-162", "text": "Both F and S impact the number of operations per inference." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-163", "text": "An efficient model would maximize accuracy using small T \u00d7 F , i.e., small F and/or large S." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-164", "text": "The neural network architectures and the corresponding hyperparameters explored in this work are summarized in Table 4 ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-165", "text": "The LSTM model mentioned in the table includes peephole connections and output projection layer similar to that in [8] , whereas basic LSTM model does not include those." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-166", "text": "CRNN uses one convolution layer followed by multi-layer GRU for the recurrent layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-167", "text": "We also use batch normalization for convolutional/fully-connected layers and layer normalization for recurrent layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-168", "text": "During inference, the parameters of batch normalization and layer normalization can be folded into the weights of the convolution or recurrent layers and hence these layers are ignored in memory/Ops computation." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-169", "text": "Table 4 : Neural network hyperparameters used in this study." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-170", "text": "We iteratively perform exhaustive search of feature extraction hyperparameters and NN model hyperparameters followed by manual selection to narrow down the search space." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-171", "text": "The final best performing models for each neural network architecture along with their memory requirements and operations are summarized in Table 5 and Fig. 6 ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-172", "text": "The hyperparameters of these networks are summarized in Appendix A. From the results we can see that DNNs are memory-bound and achieve less accuracies and saturate at~87% even when the model is scaled up." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-173", "text": "CNNs achieve better accuracies than DNN, but are limited by the weights in the final fully-connected layers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-174", "text": "RNN models (i.e. Basic LSTM, LSTM and GRU) achieve better accuracies than CNNs and yield even smaller models with less Ops in some cases, demonstrating that exploiting temporal dependencies maximizes accuracy within the same resource budget." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-175", "text": "CRNN models, which combine the best properties of CNNs and RNNs, achieve better accuracies than both CNNs and RNNs, even with less Ops." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-176", "text": "CRNN architecture also scales up well when more memory/compute resources are available." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-177", "text": "DS-CNN achieves the best accuracies and demonstrate good scalability owing to their deeper architecture enabled by depthwise separable convolution layers, which are less compute/memory intensive." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-178", "text": "Table 5 : Summary of best neural networks from the hyperparameter search." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-179", "text": "The memory required for storing the 8-bit weights and activations is shown in the table." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-180", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-181", "text": "**NEURAL NETWORK QUANTIZATION**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-182", "text": "Neural networks are typically trained with floating point weights and activations." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-183", "text": "Previous research [33, 34, 35] Figure 6 : Memory vs. Ops/inference of the best models described in Table 5 ." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-184", "text": "operations in typical microcontrollers, which is another reason for executing quantized model during deployment." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-185", "text": "In this work, we use the quantization flow described in [34] The weights are quantized to 8-bits progressively one layer at a time by finding the optimal N for each layer that minimizes the loss in accuracy because of quantization." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-186", "text": "After all the weights are quantized, the activations are also quantized in a similar way to find the appropriate fractional length N for each layer." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-187", "text": "Table 6 shows the accuracies of representative 8-bit networks quantized using this method and compared with those of the original full-precision networks." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-188", "text": "The table shows that the accuracy of the quantized network is either same or marginally better than the full-precision network, possibly due to better regularization because of quantization." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-189", "text": "We believe that the same conclusion will hold for the other neural network models explored in this work." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-190", "text": "Table 6 : Accuracy comparison of representative 8-bit quantized networks with full-precision networks." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-191", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-192", "text": "**KWS DEPLOYMENT ON MICROCONTROLLER**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-193", "text": "We deployed the KWS application on Cortex-M7 based STM32F746G-DISCO development board." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-194", "text": "A picture of the board performing KWS is shown in Fig. 7" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-195", "text": "----------------------------------" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-196", "text": "**CONCLUSIONS**" }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-197", "text": "Hardware optimized neural network architecture is key to get efficient results on memory and compute constrained microcontrollers." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-198", "text": "We trained various neural network architectures for keyword spotting published in literature on Google speech commands dataset to compare their accuracy and memory requirements vs. operations per inference, from the perspective of deployment on microcontroller systems." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-199", "text": "We quantized representative trained 32-bit floating-point KWS models into 8-bit fixed-point versions demonstrating that these models can easily be quantized for deployment without any loss in accuracy, even without retraining." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-200", "text": "Furthermore, we trained a new KWS model using depthwise separable convolution layers, inspired from MobileNet." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-201", "text": "Based on typical microcontroller systems, we derived three sets of memory/compute constraints for the neural networks and performed resource constrained neural network architecture exploration to find the best networks achieving maximum accuracy within these constraints." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-202", "text": "In all three sets of memory/compute constraints, depthwise separable CNN model (DS-CNN) achieves the best accuracies of 94.4%, 94.9% and 95.4% compared to the other model architectures within those constraints, which shows good scalability of the DS-CNN model." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-203", "text": "The code, model definitions and pretrained models are available at https://github.com/ARMsoftware/ML-KWS-for-MCU." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-204", "text": "A Appendix: Neural Network Hyperparameters Table 7 shows the summary of the hyperparameters of the best neural networks described in Table 5 , along with their memory, number of operations and accuracy on training, validation and test sets." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-205", "text": "All the models use 10 MFCC features, with a frame length (L) of 40ms, where as the frame stride (S) is shown in the table." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-206", "text": "F C stands for fully-connected layer and the number in the parentheses shows the number of neurons in the fully-connected layer." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-207", "text": "C stands for convolution layer and the numbers in parentheses correspond to the number of convolution features, kernel sizes in time and frequency axes, strides in time and frequency axes." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-208", "text": "Although not shown, all the convolution and fully connected layers have a ReLU as activation function." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-209", "text": "L stands for low-rank linear layer with the number of elements shown in parentheses." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-210", "text": "The number in the parentheses for LST M and GRU models correspond to the number of memory elements in those models." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-211", "text": "DSC is depthwise separable convolution layer (DSConv in Fig. 4 ) and the number in the parentheses correspond to the number of features, kernel size and stride in both time and frequency axes." }, { "sent_id": "649eff228a47b484d01872a980e58f-C001-212", "text": "Table 7 : Summary of hyperparameters of the best models described in Table 5 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "649eff228a47b484d01872a980e58f-C001-25" ], [ "649eff228a47b484d01872a980e58f-C001-60" ], [ "649eff228a47b484d01872a980e58f-C001-62" ], [ "649eff228a47b484d01872a980e58f-C001-102" ], [ "649eff228a47b484d01872a980e58f-C001-149" ], [ "649eff228a47b484d01872a980e58f-C001-157" ] ], "cite_sentences": [ "649eff228a47b484d01872a980e58f-C001-25", "649eff228a47b484d01872a980e58f-C001-60", "649eff228a47b484d01872a980e58f-C001-62", "649eff228a47b484d01872a980e58f-C001-102", "649eff228a47b484d01872a980e58f-C001-149", "649eff228a47b484d01872a980e58f-C001-157" ] }, "@SIM@": { "gold_contexts": [ [ "649eff228a47b484d01872a980e58f-C001-36" ], [ "649eff228a47b484d01872a980e58f-C001-124" ] ], "cite_sentences": [ "649eff228a47b484d01872a980e58f-C001-36", "649eff228a47b484d01872a980e58f-C001-124" ] }, "@USE@": { "gold_contexts": [ [ "649eff228a47b484d01872a980e58f-C001-36" ], [ "649eff228a47b484d01872a980e58f-C001-124" ] ], "cite_sentences": [ "649eff228a47b484d01872a980e58f-C001-36", "649eff228a47b484d01872a980e58f-C001-124" ] } } }, "ABC_48e3715c55fcc188367dcfdc26c05f_20": { "x": [ { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-2", "text": "This short paper summarizes a faithful implementation of the categorical framework of Coecke et al. (2010) , the aim of which is to provide compositionality in distributional models of lexical semantics." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-3", "text": "Based on Frobenius Algebras, our method enable us to (1) have a unifying meaning space for phrases and sentences of different structure and word vectors, (2) stay faithful to the linguistic types suggested by the underlying type-logic, and (3) perform the concrete computations in lower dimensions by reducing the space complexity." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-4", "text": "We experiment with two different parameters of the model and apply the setting to a verb disambiguation and a term/definition classification task with promising results." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-5", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-7", "text": "Distributional models of meaning work by building co-occurrence vectors for every word in a corpus depending on its context, following Firth's intuition that \"you should know a word by the company it keeps\" (Firth, 1957) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-8", "text": "In such models, the co-occurrence vector of each word is built by fixing a set of words as the basis of a vector space and a window of size k, then counting how many times the word in question has co-occurred with each base in that window." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-9", "text": "This approach has been proved useful in many natural language tasks (Curran, 2004; Sch\u00fctze, 1998; Landauer and Dumais, 1997; Manning et al., 2008) , but until now it lacks any means of compositionality that would allow the combination of two word vectors into a new one following some grammar rule." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-10", "text": "In fact, compositional abilities of distributional models have been subject of much discussion and research in recent years." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-11", "text": "For example, Mitchell and Lapata (2008) present results for intransitive sentences, Erk and Pad\u00f3 (2004) work on transitive verb phrases, while Baroni and Zamparelli (2010) and Guevara (2010) provide comprehensive analyses of adjective-noun phrases." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-12", "text": "Despite the experimental strength of these approaches, most of them only deal with phrases and sentences of two words." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-13", "text": "On the other hand, Socher et al. (2010 Socher et al. ( , 2011 use recursive neural networks in order to produce vectors for sentences of arbitrary length with good results." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-14", "text": "However, their method is somehow detached from the formal semantics view, paying little attention to the grammatical relations that hold between the words." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-15", "text": "Following a different path, Coecke et al. (2010) provide a solution that offers compositional abilities to distributional models while at the same time avoids all the above pitfalls." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-16", "text": "Based on the abstract setting of category theory, the authors develop a generic mathematical framework whereby the meaning of a sentence of any length and structure can, in principle, be turned into a vector, following the rules of the grammar." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-17", "text": "Implementations of this model for transitive and intransitive sentences have been provided by Grefenstette and Sadrzadeh (2011a,b) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-18", "text": "However, although their method outperforms the multiplicative and additive models of Mitchell and Lapata (2008) on simple transitive sentences, it has a non-scalability problem." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-19", "text": "Specifically, the concrete structures used in the actual computations are not faithful to the linguistic types of the underlying type-logic, hence the model does not generalize to more complex phrases and sentences where a relational structure can be found nested in another relational structure." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-20", "text": "Furthermore, the vectors obtained for sentences of different grammatical structures live in different vector spaces: sentences with intransitive verbs live in the same space as context vectors, denoted by N , sentences with transitive verbs in N 2 = N \u2297 N , and sentences with ditransitive verbs in N 3 ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-21", "text": "A direct consequence of this instantiation is that one cannot compare meanings of sentences unless they have the same grammatical structure." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-22", "text": "In this work we outline a solution to the above problems by instantiating the sentence space to be the same space as one in which context vectors live, namely we stipulate that S = N ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-23", "text": "As a result of this decision, we become able to compare lexical meanings of words with compositional meanings of phrases and sentences." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-24", "text": "We show how the theoretical computations of Coecke et al. (2010) instantiate in this concrete setting, and how the Frobenius Algebras, originating from group theory (Frobenius, 1903) and later extended to vector spaces (Coecke et al., 2008) , allow us to not only represent meanings of words with complex roles, such as verbs, adjectives, and prepositions, in an intuitive relational manner, but also to stay faithful to their original linguistic types." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-25", "text": "Equally as importantly, this model enables us to realize the concrete computations in lower dimensional spaces, thus reduce the space complexity of the implementation." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-26", "text": "We experiment in two different tasks with promising results: First, we repeat the disambiguation experiment of Grefenstette and Sadrzadeh (2011a) for transitive verbs." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-27", "text": "Then we proceed to a novel task: We use The Oxford Junior Dictionary (Sansome et al., 2000) , Oxford Concise School Dictionary (Hawkins et al., 2004) , and WordNet in order to derive a set of term/definition pairs, measure the similarity of each term with every definition, and use this measurement to classify the definitions to specific terms." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-28", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-29", "text": "**AN OVERVIEW OF THE CATEGORICAL MODEL**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-30", "text": "Using the abstract framework of category theory, Coecke et al. (2010) equip the distributional models of meaning with compositionality in a way that every grammatical reduction is in one-to-one correspondence with a linear map defining mathematical manipulations between vector spaces." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-31", "text": "In other words, given a sentence s = w 1 w 2 \u00b7 \u00b7 \u00b7 w n there exists a syntax-driven linear map f from the context vectors of the individual words to a vector for the whole sentence:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-32", "text": "allowing us to compare the synonymy of two different sentences as if they were words, by constructing their vectors and measuring the distance between them." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-33", "text": "This result is based on the fact that the base type-logic of the framework, a pregroup grammar (Lambek, 2008) , shares the same abstract structure with vector spaces, that of a compact closed category." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-34", "text": "If P is the free pregroup generated by such a grammar and FVect the category of finite dimensional vector spaces (with linear maps) over , it is possible then for one to work on the product category FVect \u00d7 P, pairing each grammatical type \u03b1 \u2208 P with a vector space V to an object (V, \u03b1)." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-35", "text": "More importantly, the morphisms of this product category will be pairs of linear maps and pregroup reductions between these objects of the following form:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-36", "text": "leading from the grammatical type p and its corresponding vector space V to type q and the vector space W ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-37", "text": "Pregroups A pregroup grammar (Lambek, 2008 ) is a type-logical grammar built on the rigorous mathematical basis of pregroups, i.e. partially ordered monoids with unit 1, whose each element p has a left adjoint p l and a right adjoint p r , that is" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-38", "text": "Each element p represents an atomic type of the grammar, for example n for noun phrases and s for sentences." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-39", "text": "Atomic types and their adjoints can be combined to form compound types, e.g. n r sn l for a transitive verb." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-40", "text": "The rules of the grammar are prescribed by the mathematical properties of pregroups, and specifically by the inequalities in (3) above." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-41", "text": "A partial order in the context of a logic denotes implication, so from (3) we derive:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-42", "text": "These cancellation rules to the unit object are called \u03b5 maps, and linear-algebraically correspond to the inner product between the involved context vectors." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-43", "text": "It also holds that 1p = p = p1." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-44", "text": "We will use the case of a transitive sentence as an example." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-45", "text": "Here, the subject and the object have the type n, whereas the type of the verb is n r sn l , denoting that the verb looks for a noun at its left and a noun at its right in order to return an entity of type s (a sentence)." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-46", "text": "The derivation has the form n(n r sn l )n = (nn r )s(n l n) \u2192 1s1 = s, and corresponds to the morphism" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-47", "text": "For details of pregroup grammars and its type dictionary we refer the reader to Lambek (2008) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-48", "text": "For more information about the compositional-distributional framework, see Coecke et al. (2010) ; Coecke and Paquette (2011) provide a good introduction to category theory." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-49", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-50", "text": "**INSTANTIATING THE SENTENCE SPACE**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-51", "text": "The categorical framework of Coecke et al. (2010) is abstract in the sense that it does not prescribe concrete guidelines for constructing tensors for meanings of words with special roles such as verbs or adjectives." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-52", "text": "Even more importantly, it does not specify the exact form of the sentence space S, leaving these details as open questions for the implementor." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-53", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-54", "text": "**STIPULATING S = N \u2297 N**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-55", "text": "The work of Grefenstette and Sadrzadeh (2011a) was the first large-scale practical implementation of this framework for intransitive and transitive sentences, and thus a first step towards providing some concrete answers to these questions." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-56", "text": "Following ideas from formal semantics that verbs are actually relations, the authors argue that the distributional meaning of a verb is a weighted relation representing the extent according to which the verb is related to its subjects and objects." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-57", "text": "In vector spaces, these relations are represented by linear maps, equivalent to matrices for the case of binary relations and to tensors for relations of arity n. Hence transitive verbs can be represented by matrices created by structurally mixing and summing up all the contexts (subject and object pairs) in which the verb appears." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-58", "text": "More precisely, we have:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-59", "text": "where \u2212\u2192 sb j i and \u2212 \u2212 \u2192 ob j i are the context vectors of subject and object, respectively, and i iterates over all contexts in which the specific verb occurs." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-60", "text": "This method (which we refer to as \"relational\") is also extended to other relational words, such as adjectives whose vectors are constructed as the sum of all the nouns that the adjective modifies." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-61", "text": "One important design decision was that the meaning of a sentence was represented as a rank-n tensor, where n is the number of arguments for the head word of the sentence." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-62", "text": "In other words, an intransitive sentence lives in a space S = N , a transitive one in S = N \u2297 N and so on." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-63", "text": "Although this approach delivers good results for the disambiguation task on which it was tested, it inherently suffers from two important problems, the most obvious of which is that there is no direct way to compare sentences of different structures, say an intransitive one with a transitive one." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-64", "text": "Furthermore, the representation of the meaning of a sentence or a phrase as a rank-n tensor with n > 1 limits the ability of the model to scale up to larger fragments of the language, where more complex sentences with nested or recursive structure can occur, since the concrete objects used in the actual mathematical operations are not any more faithful to the linguistic types." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-65", "text": "Finally, the above design decision means that the space complexity of the algorithm is \u0398(d n ), where d is the cardinality of the vector space and n the number of arguments for the head word." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-66", "text": "This could create certain space problems for complex sentences." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-67", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-68", "text": "**STIPULATING S = N**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-69", "text": "The work presented in this paper stems from the observation that the theory does not impose a special choice of sentence space, in particular it does not impose that tensors for S should have ranks greater than 1." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-70", "text": "Hence we stipulate that S = N and show how this instantiation works by performing the computations on the example transitive sentence 'dogs chase cats'." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-71", "text": "Take \u2212 \u2212 \u2192 do g and \u2212\u2192 cat be the context vectors for the subject and the object, both living in N as prescribed by their types." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-72", "text": "As any vector, these can be expressed as weighted sums of their basis vectors, that is," }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-73", "text": ". By putting everything together, the meaning of the sentence is calculated as follows; this result lives in N , since it is a weighted sum over \u2212 \u2192 n j :" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-74", "text": "An important consequence of our design decision is that it enables us to reduce the space complexity of the implementation from \u0398(d n ) (Grefenstette and Sadrzadeh, 2011a) to \u0398(d), making the problem much more tractable." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-75", "text": "What remains to be solved is a theoretical issue, that in practice the meaning of relational words such as 'chase' as calculated by Equation 5 is a matrix living in N 2 -however, the mathematical framework above prescribes that it should be a rank-3 tensor in N 3 ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-76", "text": "The necessary expansions are achieved by using Frobenius algebraic operations, for which the following sections first provide the mathematical definitions and then a linguistic justification." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-77", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-78", "text": "**FROBENIUS ALGEBRAS**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-79", "text": "Frobenius algebras were originally introduced by F. G. Frobenius in group theory (Frobenius, 1903) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-80", "text": "Since then they have found applications in other fields of mathematics and physics, e.g. see Kock (2003) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-81", "text": "Carboni and Walters (1987) provided a general categorical definition, according to which a Frobenius algebra over a monoidal category ( , \u2297, I) is a tuple (F, \u03c3, \u03b9, \u00b5, \u03b6) consisting of an associative coalgebra (\u03c3, \u03b9) and an associative algebra (\u00b5, \u03b6), respectively given by the following types:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-82", "text": "The above should satisfy the Frobenius condition, stating that" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-83", "text": "For the case of the category FVect over a field I (for us I = ), these morphisms become linear maps that form a Frobenius algebra over a vector space N with a fixed set of bases { \u2212 \u2192 n i } i , explicitly given as follows (Coecke et al., 2008) :" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-84", "text": "Since the bases of our vector spaces are orthonormal, these maps moreover form a special commutative Frobenius algebra, meaning that they correspond to a uniform copying and uncopying of the basis vectors." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-85", "text": "When applied to v \u2208 N , the copying map \u03c3 recovers the bases of v and the unit map \u03b9 their corresponding weights." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-86", "text": "Together, they faithfully encode tensors of a lower dimensional N into a higher dimensional tensor space N \u2297 N ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-87", "text": "In linear algebraic terms, \u03c3(v) is a diagonal tensor whose diagonal elements consist of weights of v. The uncopying map \u00b5, on the other hand, loses some information when encoding a higher dimensional tensor into a lower dimensional space." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-88", "text": "For w \u2208 N \u2297 N , we have that \u00b5(w) is a tensor consisting only of the diagonal elements of w, hence losing the information encoded in the non-diagonal part." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-89", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-90", "text": "**FROBENIUS PARAMETERS IN DISTRIBUTIONAL LINGUISTIC PRACTICE**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-91", "text": "It would be instructive to see how our decision for taking S = N and the Frobenius constructions affect the meaning of a sentence in practice." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-92", "text": "We use a pictorial calculus that allows convenient graphical representations of the derivations." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-93", "text": "In this notation, each tensor is represented by a triangle, and its rank can be determined by the outgoing wires." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-94", "text": "The tensor product is depicted as juxtaposition of triangles." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-95", "text": "We also remind to the reader that the relational method for constructing a tensor for the meaning of a verb (Grefenstette and Sadrzadeh, 2011a) provides us with a matrix in N 2 ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-96", "text": "In order to embed this in N 3 , as required by the categorical framework, we apply a \u03c3 : N 2 \u2192 N 3 map to it." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-97", "text": "Now the Frobenius operation \u03c3 gives us some options for the form of the resulting tensor, which are presented below:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-98", "text": "CPSBJ The first option is to copy the \"row\" dimension of the matrix which, according to Equation 5, corresponds to the subject." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-99", "text": "In Part (a) below we see how \u03c3 transforms the verb this way." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-100", "text": "Once substituted in Equation 1, we obtain the interaction in Part (b)." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-101", "text": "Linear algebraically, the \u03c3 map transforms the matrix of the verb in the way depicted on the right:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-102", "text": "CPOBJ Our other option is to copy the \"column\" dimension of the matrix, i.e. the object dimension (the corresponding \u03c3 map again on the right):" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-103", "text": "Geometrically, we can think of these two options as different ways for \"diagonally\" placing a plane into a cube." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-104", "text": "The diagrams provide us a direct way to simplify the calculations involved, since they suggest a closed form formula for each case." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-105", "text": "Taking as an example the diagram of the copy-subject method, we see that: (a) the object interacts with the verb; (b) the result of this interaction serves as input for the \u03c3 function; (c) one wire of the output of \u03c3 interacts with the object, while the other branch delivers the result." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-106", "text": "In terms of linear algebra, this corresponds to the computation \u03c3(ver b \u00d7 \u2212\u2192 o b j) \u00d7 \u2212 \u2192 s b j (where \u00d7 denotes matrix multiplication), which is equivalent to the following:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-107", "text": "where the symbol denotes component-wise multiplication and \u00d7 is matrix multiplication." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-108", "text": "Similarly, the meaning of a transitive sentence for the copy-object case is given by:" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-109", "text": "We should bring to the reader's attention the fact that equipped with the above closed forms we do not need to create or manipulate rank-3 tensors at any point of the computation, something that would cause high computational overhead." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-110", "text": "Furthermore, note that the nesting problem of Grefenstette and Sadrzadeh (2011a) does not arise here, since the linguistic and concrete types are the same." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-111", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-112", "text": "**EXPERIMENTS**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-113", "text": "We train our vectors from a lemmatised version of the British National Corpus (BNC), following closely the parameters of the setting described in Mitchell and Lapata (2008) , later used by Grefenstette and Sadrzadeh (2011a) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-114", "text": "Specifically, we use the 2000 most frequent words as the basis for our vector space; this single space will serve as a semantic space for both nouns and sentences." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-115", "text": "The weights of the vectors are set to the ratio of the probability of the context word given the target word to the probability of the context word overall." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-116", "text": "As our similarity measure we use the cosine distance." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-117", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-118", "text": "**DISAMBIGUATION**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-119", "text": "We first test our models against the disambiguation task for transitive sentences described in Grefenstette and Sadrzadeh (2011a) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-120", "text": "The goal is to assess how well a model can discriminate between the different senses of an ambiguous verb, given the context (subject and object) of that verb." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-121", "text": "The entries of this dataset consist of a target verb, a subject, an object, and a landmark verb used for the comparison." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-122", "text": "One such entry for example is, \"write, pupil, name, spell\"." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-123", "text": "A good compositional model should be able to understand that the sentence \"pupil write name\" is closer to the sentence \"pupil spell name\" than, for example, to \"pupil publish name\"." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-124", "text": "On the other hand, given the context \"writer, book\" these results should be reversed." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-125", "text": "The dataset contains 200 such entries with verbs from CELEX, hence 400 sentences." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-126", "text": "The evaluation of this experiment is performed by calculating Spearman's \u03c1 correlation against the judgements of 25 human evaluators." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-127", "text": "As our baselines we use an additive (ADDTV) and a multiplicative (MULTP) model, where the meaning of a sentence is computed by adding and point-wise multiplying, respectively, the context vectors of its words." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-128", "text": "The results are shown in Table 1 ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-129", "text": "The most successful S = N model for this task is the copyobject model, which is performing really close to the original relational model of Grefenstette and Sadrzadeh (2011a) , with the difference to be statistically insignificant." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-130", "text": "This is a promising result, since it suggests that the lower-dimensional new model performs similarly with the richer structure of the old model for transitive sentences, while at the same time allows generalisation to even more complex sentences 1 ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-131", "text": "More importantly, note that the categorical models are the only ones that respect the word order and grammatical structure of sentences; a feature completely dismissed in the simple multiplicative model." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-132", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-133", "text": "**UPPER-BOUND**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-134", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-135", "text": "**DEFINITION CLASSIFICATION**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-136", "text": "The ability of reliably comparing the meaning of single words with larger textual fragments, e.g. phrases or even sentences, can be an invaluable tool for many challenging NLP tasks, such as definition classification, paraphrasing, sentiment analysis, or even the simple everyday search on the Internet." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-137", "text": "In this task we examine the extent to which our models can correctly match a number of terms (single words) with a number of definitions (phrases)." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-138", "text": "To our knowledge, this is the first time a compositional distributional model is tested for its ability to match words with phrases." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-139", "text": "Our dataset consists of 112 terms (72 nouns and 40 verbs) and their main definitions, extracted from The Oxford Junior Dictionary (Sansome et al., 2000) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-140", "text": "For each term, and in order to get a richer dataset, we added two more definitions that expressed the same or an" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-141", "text": "1 The original relational model of Grefenstette and Sadrzadeh (2011a) with S = N 2 , provided a \u03c1 of 0.21." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-142", "text": "When computed with our program with the exact same parameters (without embedding them in the S = N model), we obtained a \u03c1 of 0.195." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-143", "text": "The differences between both of these and our best model are statistically insignificant." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-144", "text": "In Grefenstette and Sadrzadeh (2011b) , a direct non-relational model was used to compute verb matrices; this provided a \u03c1 of 0.28." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-145", "text": "However, as explained by the authors themselves, this method is not general and for instance cannot be used for intransitive verbs." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-146", "text": "alternative meaning, using the entries from the Oxford Concise School Dictionary (Hawkins et al., 2004) or by paraphrasing with the WordNet synonyms of the words in the definitions." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-147", "text": "So in total we obtained three definitions per term." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-148", "text": "In all cases a definition for a noun-term is a noun phrase, whereas the definitions for the verb-terms consist of verb phrases." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-149", "text": "For the latter case, we construct our verb vectors by summing over all context vectors of objects with which the verb appears in the corpus in a verb phrase; that is, we use ver b = i \u2212 \u2212 \u2192 ob j i ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-150", "text": "A sample of the dataset is shown in Table 2 ; the complete dataset will be made available online." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-151", "text": "We approach the evaluation problem as a classification task, where the terms have the role of the classes." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-152", "text": "Specifically, we calculate the distance between each definition and every term in the dataset, and the definition is assigned to the term that gives the higher similarity." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-153", "text": "We evaluate the results by calculating accuracy (Table 3) ." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-154", "text": "Our model is referred to as the copy-object model (CPOBJ), and is compared with the multiplicative and additive models." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-155", "text": "The copy-object and multiplicative models perform similarly, with the former to have slightly better performance for nouns and the latter to be slightly better for verbs." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-156", "text": "We speculate that this lesser ability of our model in verbs terms is due to data sparsity, since the cases of pure verb phrases (from which we build the verb vectors for this task) are limited in BNC and not every verb of our dataset had a well-populated vector representation." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-157", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-158", "text": "**CPOBJ MULTP ADDT CONT**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-159", "text": "----------------------------------" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-160", "text": "**CONCLUSION**" }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-161", "text": "The contribution of this work is that it provides a faithful implementation of the general categorical compositional distributional model of Coecke et al. (2010) , with three important advantages compared to previous attempts: (1) it makes possible to compare phrases and sentences with different structures, up to the extreme case of comparing a sentence with a single word; (2) it follows the types suggested by the type-logical approaches, hence enables us to build concrete vectors for nested relational phrases; and (3) drastically reduces the space complexity of previous implementations." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-162", "text": "We achieved this using operations of Frobenius Algebras over vector spaces to expand and shrink the dimensions of the concrete tensors involved in the actual computations." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-163", "text": "This theoretical result stands on its own right, since it provides a framework that can be used in conjunction with various compositional-distributional settings and techniques." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-164", "text": "For example, one could populate the relational matrices using machine-learning techniques, as Baroni and Zamparelli (2010) tried for adjective-noun pairs, and then apply the categorical framework for the composition as described in this paper." }, { "sent_id": "48e3715c55fcc188367dcfdc26c05f-C001-165", "text": "As a proof of concept for the viability of our method, we presented experimental results in two tasks involving disambiguation and definition classification." } ], "y": { "@SIM@": { "gold_contexts": [ [ "48e3715c55fcc188367dcfdc26c05f-C001-26" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-95" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-113" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-119" ] ], "cite_sentences": [ "48e3715c55fcc188367dcfdc26c05f-C001-26", "48e3715c55fcc188367dcfdc26c05f-C001-95", "48e3715c55fcc188367dcfdc26c05f-C001-113", "48e3715c55fcc188367dcfdc26c05f-C001-119" ] }, "@USE@": { "gold_contexts": [ [ "48e3715c55fcc188367dcfdc26c05f-C001-26" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-113" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-119" ] ], "cite_sentences": [ "48e3715c55fcc188367dcfdc26c05f-C001-26", "48e3715c55fcc188367dcfdc26c05f-C001-113", "48e3715c55fcc188367dcfdc26c05f-C001-119" ] }, "@BACK@": { "gold_contexts": [ [ "48e3715c55fcc188367dcfdc26c05f-C001-55" ] ], "cite_sentences": [ "48e3715c55fcc188367dcfdc26c05f-C001-55" ] }, "@DIF@": { "gold_contexts": [ [ "48e3715c55fcc188367dcfdc26c05f-C001-74" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-110" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-129" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-141", "48e3715c55fcc188367dcfdc26c05f-C001-142" ] ], "cite_sentences": [ "48e3715c55fcc188367dcfdc26c05f-C001-74", "48e3715c55fcc188367dcfdc26c05f-C001-110", "48e3715c55fcc188367dcfdc26c05f-C001-129", "48e3715c55fcc188367dcfdc26c05f-C001-141" ] }, "@EXT@": { "gold_contexts": [ [ "48e3715c55fcc188367dcfdc26c05f-C001-74" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-129" ], [ "48e3715c55fcc188367dcfdc26c05f-C001-141", "48e3715c55fcc188367dcfdc26c05f-C001-142" ] ], "cite_sentences": [ "48e3715c55fcc188367dcfdc26c05f-C001-74", "48e3715c55fcc188367dcfdc26c05f-C001-129", "48e3715c55fcc188367dcfdc26c05f-C001-141" ] } } }, "ABC_c4e0e12362bd7d505f6887abad78d4_20": { "x": [ { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-66", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-67", "text": "**MICROCONTROLLER SYSTEMS**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-68", "text": "A typical microcontroller system consists of a processor core, an on-chip SRAM block and an on-chip embedded flash." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-93", "text": "**RECURRENT NEURAL NETWORK (RNN)**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-2", "text": "Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-3", "text": "It requires real-time response and high accuracy for good user experience." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-4", "text": "Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-5", "text": "Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-6", "text": "The design of neural network architecture for KWS must consider these constraints." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-7", "text": "In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-8", "text": "We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-9", "text": "We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-10", "text": "We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-11", "text": "DS-CNN achieves an accuracy of 95.4%, which is~10% higher than the DNN model with similar number of parameters." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-12", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-14", "text": "Deep learning algorithms have evolved to a stage where they have surpassed human accuracies in a variety of cognitive tasks including image classification [1] and conversational speech recognition [2] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-15", "text": "Motivated by the recent breakthroughs in deep learning based speech recognition technologies, speech is increasingly becoming a more natural way to interact with consumer electronic devices, for example, Amazon Echo, Google Home and smart phones." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-61", "text": "While all the prior KWS neural networks are trained with cross entropy loss function, a max-pooling based loss function for training KWS model with long short-term memory (LSTM) is proposed in [8] , which achieves better accuracy than the DNNs and LSTMs trained with cross entropy loss." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-62", "text": "Although many neural network models for KWS are presented in literature, it is difficult to make a fair comparison between them as they are all trained and evaluated on different proprietary datasets (e.g. \"TalkType\" dataset in [7] , \"Alexa\" dataset in [8] , etc.) with different input speech features and audio duration." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-63", "text": "Also, the primary focus of prior research has been to maximize the accuracy with a small memory footprint model, without explicit constraints of underlying hardware, such as limits on number of operations per inference." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-64", "text": "In contrast, this work is more hardware-centric and targeted towards neural network architectures that maximize accuracy on microcontroller devices." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-65", "text": "The constraints on memory and compute significantly limit the neural network parameters and the number of operations." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-16", "text": "However, always-on speech recognition is not energy-efficient and may also cause network congestion to transmit continuous audio stream from billions of these devices to the cloud." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-17", "text": "Furthermore, such a cloud based solution adds latency to the application, which hurts user experience." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-18", "text": "There are also privacy concerns when audio is continuously transmitted to the cloud." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-19", "text": "To mitigate these concerns, the devices first detect predefined keyword(s) such as \"Alexa\", \"Ok Google\", \"Hey Siri\", etc., which is commonly known as keyword spotting (KWS)." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-20", "text": "Detection of keyword wakes up the device and then activates the full scale speech recognition either on device [3] or in the cloud." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-21", "text": "In some applications, the sequence of keywords can be used as voice commands to a smart device such as a voice-enabled light bulb." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-22", "text": "Since KWS system is always-on, it should have very low power consumption to maximize battery life." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-23", "text": "On the other hand, the KWS system should detect the keywords with high accuracy and low latency, for best user experience." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-24", "text": "These conflicting system requirements make KWS an active area of research ever since its inception over 50 years ago [4] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-25", "text": "Recently, with the renaissance of artificial neural networks in the form of deep learning algorithms, neural network (NN) based KWS has become very popular [5, 6, 7, 8] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-26", "text": "Low power consumption requirement for keyword spotting systems make microcontrollers an obvious choice for deploying KWS in an always-on system." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-27", "text": "Microcontrollers are low-cost energy-efficient processors that are ubiquitous in our everyday life with their presence in a variety of devices ranging from home appliances, automobiles and consumer electronics to wearables." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-28", "text": "However, deployment of neural network based KWS on microcontrollers comes with following challenges:" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-29", "text": "Limited memory footprint: Typical microcontroller systems have only tens to few hundred KB of memory available." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-30", "text": "The entire neural network model, including input/output, weights and activations, has to fit within this small memory budget." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-31", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-32", "text": "**LIMITED COMPUTE RESOURCES:**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-33", "text": "Since KWS is always-on, the real-time requirement limits the total number of operations per neural network inference." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-34", "text": "These microcontroller resource constraints in conjunction with the high accuracy and low latency requirements of KWS call for a resource-constrained neural network architecture exploration to find lean neural network structures suitable for KWS, which is the primary focus of our work." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-35", "text": "The main contributions of this work are as follows:" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-36", "text": "\u2022 We first train the popular KWS neural net models from the literature [5, 6, 7, 8] on Google speech commands dataset [9] and compare them in terms of accuracy, memory footprint and number of operations per inference." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-37", "text": "\u2022 In addition, we implement a new KWS model using depth-wise separable convolutions and point-wise convolutions, inspired by the success of resource-efficient MobileNet [10] in computer vision." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-38", "text": "This model outperforms the other prior models in all aspects of accuracy, model size and number of operations." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-39", "text": "\u2022 Finally, we perform resource-constrained neural network architecture exploration and present comprehensive comparison of different network architectures within a set of compute and memory constraints of typical microcontrollers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-40", "text": "The code, model definitions and pretrained models are available at https://github.com/ARM-software/ML-KWS-for-MCU." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-41", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-42", "text": "**BACKGROUND**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-43", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-44", "text": "**KEYWORD SPOTTING (KWS) SYSTEM**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-45", "text": "A typical KWS system consists of a feature extractor and a neural network based classifier as shown in Fig. 1 ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-46", "text": "First, the input speech signal of length L is framed into overlapping frames of length l with a stride s, giving a total of T = L\u2212l s + 1 frames." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-47", "text": "From each frame, F speech features are extracted, generating a total of T \u00d7 F features for the entire input speech signal of length L. Log-mel filter bank energies (LFBE) and Mel-frequency cepstral coefficients (MFCC) are the commonly used human-engineered speech features in deep learning based speech-recognition, that are adapted from traditional speech processing techniques." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-48", "text": "Feature extraction using LFBE or MFCC involves translating the time-domain speech signal into a set of frequency-domain spectral coefficients, which enables dimensionality compression of the input signal." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-49", "text": "The extracted speech feature matrix is fed into a classifier module, which generates the probabilities for the output classes." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-50", "text": "In a real-world scenario where keywords need to be identified from a continuous audio stream, a posterior handling module averages the output probabilities of each output class over a period of time, improving the overall confidence of the prediction." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-51", "text": "Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) and Viterbi decoding [11, 12] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-52", "text": "While these approaches achieve reasonable accuracies, they are hard to train and are computationally expensive during inference." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-53", "text": "Other techniques explored for KWS include discriminative models adopting a large-margin problem formulation [13] or recurrent neural networks (RNN) [14] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-54", "text": "Although these methods significantly outperform HMM based KWS in terms of accuracy, they suffer from large detection latency." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-55", "text": "KWS models using deep neural networks (DNN) based on fully-connected layers with rectified linear unit (ReLU) activation functions are introduced in [5] , which outperforms the HMM models with a very small detection latency." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-56", "text": "Furthermore, low-rank approximation techniques are used to compress the DNN model weights achieving similar accuracy with less hardware resources [15, 16] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-57", "text": "The main drawback of DNNs is that they ignore the local temporal and spectral correlation in the input speech features." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-58", "text": "In order to exploit these correlations, different variants of convolutional neural network (CNN) based KWS are explored in [6] , which demonstrate higher accuracy than DNNs." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-59", "text": "The drawback of CNNs in modeling time varying signals (e.g. speech) is that they ignore long term temporal dependencies." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-60", "text": "Combining the strengths of CNNs and RNNs, convolutional recurrent neural network based KWS is investigated in [7] and demonstrate the robustness of the model to noise." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-69", "text": "Table 1 shows some commercially available microcontroller development boards with Arm Cortex-M processor cores with different compute capabilities running at different frequencies (16 MHz to 216 MHz), consisting of a wide range of on-chip memory (SRAM: 8 KB to 320 KB; Flash: 128 KB to 1 MB)." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-70", "text": "The program binary, usually preloaded into the non-volatile flash, is loaded into the SRAM at startup and the processor runs the program with the SRAM as the main data memory." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-71", "text": "Therefore, the size of the SRAM limits the size of memory that the software can use." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-72", "text": "Other than the memory footprint, performance (i.e., operations per second) is also a constraining factor for running neural networks on microcontrollers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-73", "text": "Most microcontrollers are designed for embedded applications with low cost and high energy-efficiency as the primary targets, and do not have high throughput for compute-intensive workloads such as neural networks." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-74", "text": "Some microcontrollers have integrated DSP instructions that can be useful for running neural network workloads." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-75", "text": "For example, Cortex-M4 and Cortex-M7 have integrated SIMD and MAC instructions that can be used to accelerate low-precision computation in neural networks." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-76", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-77", "text": "**NEURAL NETWORK ARCHITECTURES FOR KWS**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-78", "text": "This section gives an overview of all the different neural network architectures explored in this work including the deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), convolutional recurrent neural network (CRNN) and depthwise separable convolutional neural network (DS-CNN)." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-79", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-80", "text": "**DEEP NEURAL NETWORK (DNN)**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-81", "text": "The DNN is a standard feed-forward neural network made of a stack of fully-connected layers and non-linear activation layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-82", "text": "The input to the DNN is the flattened feature matrix, which feeds into a stack of d hidden fully-connected layers each with n neurons." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-83", "text": "Typically, each fully-connected layer is followed by a rectified linear unit (ReLU) based activation function." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-84", "text": "At the output is a linear layer followed by a softmax layer generating the output probabilities of the k keywords, which are used for further posterior handling." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-85", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-86", "text": "**CONVOLUTIONAL NEURAL NETWORK (CNN)**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-87", "text": "One main drawback of DNN based KWS is that they fail to efficiently model the local temporal and spectral correlation in the speech features." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-88", "text": "CNNs exploit this correlation by treating the input time-domain and spectral-domain features as an image and performing 2-D convolution operations over it." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-89", "text": "The convolution layers are typically followed by batch normalization [17] , ReLU based activation functions and optional max/average pooling layers, which reduce the dimensionality of the features." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-90", "text": "During inference, the parameters of batch normalization can be folded into the weights of the convolution layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-91", "text": "In some cases, a linear low-rank layer, which is simply a fully-connected layer without non-linear activation, is added in between the convolution layers and dense layers for the purpose of reducing parameters and accelerating training [18, 19] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-92", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-118", "text": "The neural network models are trained to classify the incoming audio into one of the 10 keywords -\"Yes\", \"No\", \"Up\", \"Down\", \"Left\", \"Right\", \"On\", \"Off\", \"Stop\", \"Go\", along with \"silence\" (i.e. no word spoken) and \"unknown\" word, which is the remaining 20 keywords from the dataset." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-119", "text": "The dataset is split into training, validation and test set in the ratio of 80:10:10 while making sure that the audio clips from the same person stays in the same set." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-120", "text": "All models are trained in Google Tensorflow framework [31] using the standard cross-entropy loss and Adam optimizer [32] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-121", "text": "With a batch size of 100, the models are trained for 20K iterations with initial learning rate of 5 \u00d7 10 \u22124 , and reduced to 10 \u22124 after first 10K iterations." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-122", "text": "The training data is augmented with background noise and random time shift of up to 100ms." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-123", "text": "The trained models are evaluated based on the classification accuracy on the test set." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-124", "text": "Table 2 summarizes the accuracy, memory requirement and operations per inference for the network architectures for KWS from literature [5, 6, 7, 8] trained on Google speech commands dataset [9] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-125", "text": "For all the models, we use 40 MFCC features extracted from a speech frame of length 40ms with a stride of 20ms, which gives 1960 (49\u00d740) features for 1 second of audio." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-126", "text": "The accuracy shown in the table is the accuracy on test set." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-127", "text": "The memory shown in the table assumes 8-bit weights and activations, which is sufficient to achieve same accuracy as that from a full-precision network." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-128", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-129", "text": "**TRAINING RESULTS**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-130", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-131", "text": "**NN ARCHITECTURE ACCURACY MEMORY**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-132", "text": "Operations DNN [5] 84.3% 288 KB 0.57 MOps CNN-1 [6] 90.7% 556 KB 76.02 MOps CNN-2 [6] 84.6% 149 KB 1.46 MOps LSTM [8] 88.8% 26 KB 2.06 MOps CRNN [7] 87.8% 298 KB 5.85 MOps Table 2 : Neural network model accuracy." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-133", "text": "CNN-1, CNN-2 are (cnn-trad-fpool3, cnn-one-fstride4) architectures from [6] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-134", "text": "Also, we assume that the memory for activations is reused across different layers and hence memory requirement for the activations uses the maximum of two consecutive layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-135", "text": "The operations in the table counts the total number of multiplications and additions in the matrix-multiplication operations in each layer in the network, which is representative of the execution time of the entire network." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-136", "text": "The models from the existing literature are optimized for different datasets and use different memory/compute resources, hence a direct comparison of accuracy is unfair." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-137", "text": "That said, these results still provide useful insights on the different neural network architectures for KWS:" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-138", "text": "\u2022 Although DNNs do not achieve the best accuracy and tend to be memory intensive, they have less number of operations/inference and hence suit well to systems that have limited compute capability (e.g. systems running at low operating frequencies for energy-efficiency)." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-139", "text": "\u2022 CNNs, on the other hand, achieve higher accuracy than DNNs but at the cost of large number of operations and/or memory requirement." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-140", "text": "\u2022 LSTMs and CRNNs achieve a balance between memory and operations while still achieving good accuracy." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-141", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-142", "text": "**CLASSIFYING NEURAL NETWORKS FOR KWS BASED ON RESOURCE REQUIREMENTS**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-143", "text": "As discussed in section 2.2, memory footprint and execution time are the two important considerations in being able to run keyword spotting on microcontrollers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-144", "text": "These should be considered when designing and optimizing neural networks for running keyword spotting." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-145", "text": "Based on typical microcontroller system configurations (as described in Table 1 ), we derive three sets of constraints for the neural networks in Table 3 , targeting small, medium and large microcontroller systems." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-146", "text": "Both memory and compute limit are derived with assumptions that some amount of resources will be allocated for running other tasks such as OS, I/O, network communication, etc." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-147", "text": "The operations per inference limit assumes that the system is running 10 inferences per second." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-148", "text": "Table 3 : Neural network (NN) classes for KWS models considered in this work, assuming 10 inferences per second and 8-bit weights/activations." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-149", "text": "Figure 5 shows the number of operations per inference, memory requirement and test accuracy of neural network models from prior work [5, 6, 7, 8] trained on Google speech commands dataset overlayed with the memory and compute bounding boxes for the neural network classes from section 4." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-150", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-151", "text": "**RESOURCE CONSTRAINED NEURAL NETWORK ARCHITECTURE EXPLORATION**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-152", "text": "2." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-153", "text": "An ideal model would have high accuracy, small memory footprint and lower number of computations, i.e., close to the origin in Fig. 5 ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-154", "text": "Apart from the LSTM model, the other models are too memory/compute resource heavy and do not fit into the bounding box S with 80KB/6MOps memory/compute limits." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-155", "text": "CNN-2, CRNN and DNN models fit in the M and L bounding boxes, but have lower accuracies as compared to the CNN-1 model, which does not fit in any of the boxes at all." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-156", "text": "The rest of this section discusses different hyperparameters of the feature extraction and neural network architectures that can be tuned in order to bring the models close to the origin and still achieve high accuracy." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-157", "text": "[5, 6, 7, 8] trained on the speech commands dataset [9] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-158", "text": "As shown in Fig. 1 , from each input speech signal, T \u00d7 F features are extracted and the number of these features impact the model size, number of operations and accuracy." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-159", "text": "The key parameters in the feature extraction step that impact the model size, number of operations and accuracy are (1) number of MFCC features per frame (F) and (2) the frame stride (S)." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-160", "text": "The number of MFCC features per audio frame (F) impacts the number of weights in fully-connected and recurrent layers, but not in convolution layers as weights are reused in convolution layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-161", "text": "The frame stride (S), which determines the number of frames to be processed per inference (i.e. T), impacts the number of weights in fully-connected layers but not in recurrent and convolution layers because of the weight reuse." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-162", "text": "Both F and S impact the number of operations per inference." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-163", "text": "An efficient model would maximize accuracy using small T \u00d7 F , i.e., small F and/or large S." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-164", "text": "The neural network architectures and the corresponding hyperparameters explored in this work are summarized in Table 4 ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-165", "text": "The LSTM model mentioned in the table includes peephole connections and output projection layer similar to that in [8] , whereas basic LSTM model does not include those." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-166", "text": "CRNN uses one convolution layer followed by multi-layer GRU for the recurrent layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-167", "text": "We also use batch normalization for convolutional/fully-connected layers and layer normalization for recurrent layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-168", "text": "During inference, the parameters of batch normalization and layer normalization can be folded into the weights of the convolution or recurrent layers and hence these layers are ignored in memory/Ops computation." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-169", "text": "Table 4 : Neural network hyperparameters used in this study." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-170", "text": "We iteratively perform exhaustive search of feature extraction hyperparameters and NN model hyperparameters followed by manual selection to narrow down the search space." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-171", "text": "The final best performing models for each neural network architecture along with their memory requirements and operations are summarized in Table 5 and Fig. 6 ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-172", "text": "The hyperparameters of these networks are summarized in Appendix A. From the results we can see that DNNs are memory-bound and achieve less accuracies and saturate at~87% even when the model is scaled up." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-173", "text": "CNNs achieve better accuracies than DNN, but are limited by the weights in the final fully-connected layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-174", "text": "RNN models (i.e. Basic LSTM, LSTM and GRU) achieve better accuracies than CNNs and yield even smaller models with less Ops in some cases, demonstrating that exploiting temporal dependencies maximizes accuracy within the same resource budget." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-175", "text": "CRNN models, which combine the best properties of CNNs and RNNs, achieve better accuracies than both CNNs and RNNs, even with less Ops." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-176", "text": "CRNN architecture also scales up well when more memory/compute resources are available." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-177", "text": "DS-CNN achieves the best accuracies and demonstrate good scalability owing to their deeper architecture enabled by depthwise separable convolution layers, which are less compute/memory intensive." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-178", "text": "Table 5 : Summary of best neural networks from the hyperparameter search." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-179", "text": "The memory required for storing the 8-bit weights and activations is shown in the table." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-180", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-181", "text": "**NEURAL NETWORK QUANTIZATION**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-182", "text": "Neural networks are typically trained with floating point weights and activations." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-183", "text": "Previous research [33, 34, 35] Figure 6 : Memory vs. Ops/inference of the best models described in Table 5 ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-184", "text": "operations in typical microcontrollers, which is another reason for executing quantized model during deployment." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-185", "text": "In this work, we use the quantization flow described in [34] The weights are quantized to 8-bits progressively one layer at a time by finding the optimal N for each layer that minimizes the loss in accuracy because of quantization." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-186", "text": "After all the weights are quantized, the activations are also quantized in a similar way to find the appropriate fractional length N for each layer." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-187", "text": "Table 6 shows the accuracies of representative 8-bit networks quantized using this method and compared with those of the original full-precision networks." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-188", "text": "The table shows that the accuracy of the quantized network is either same or marginally better than the full-precision network, possibly due to better regularization because of quantization." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-189", "text": "We believe that the same conclusion will hold for the other neural network models explored in this work." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-190", "text": "Table 6 : Accuracy comparison of representative 8-bit quantized networks with full-precision networks." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-191", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-192", "text": "**KWS DEPLOYMENT ON MICROCONTROLLER**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-193", "text": "We deployed the KWS application on Cortex-M7 based STM32F746G-DISCO development board." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-194", "text": "A picture of the board performing KWS is shown in Fig. 7" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-195", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-196", "text": "**CONCLUSIONS**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-197", "text": "Hardware optimized neural network architecture is key to get efficient results on memory and compute constrained microcontrollers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-198", "text": "We trained various neural network architectures for keyword spotting published in literature on Google speech commands dataset to compare their accuracy and memory requirements vs. operations per inference, from the perspective of deployment on microcontroller systems." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-199", "text": "We quantized representative trained 32-bit floating-point KWS models into 8-bit fixed-point versions demonstrating that these models can easily be quantized for deployment without any loss in accuracy, even without retraining." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-200", "text": "Furthermore, we trained a new KWS model using depthwise separable convolution layers, inspired from MobileNet." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-201", "text": "Based on typical microcontroller systems, we derived three sets of memory/compute constraints for the neural networks and performed resource constrained neural network architecture exploration to find the best networks achieving maximum accuracy within these constraints." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-202", "text": "In all three sets of memory/compute constraints, depthwise separable CNN model (DS-CNN) achieves the best accuracies of 94.4%, 94.9% and 95.4% compared to the other model architectures within those constraints, which shows good scalability of the DS-CNN model." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-203", "text": "The code, model definitions and pretrained models are available at https://github.com/ARMsoftware/ML-KWS-for-MCU." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-204", "text": "A Appendix: Neural Network Hyperparameters Table 7 shows the summary of the hyperparameters of the best neural networks described in Table 5 , along with their memory, number of operations and accuracy on training, validation and test sets." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-205", "text": "All the models use 10 MFCC features, with a frame length (L) of 40ms, where as the frame stride (S) is shown in the table." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-206", "text": "F C stands for fully-connected layer and the number in the parentheses shows the number of neurons in the fully-connected layer." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-207", "text": "C stands for convolution layer and the numbers in parentheses correspond to the number of convolution features, kernel sizes in time and frequency axes, strides in time and frequency axes." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-208", "text": "Although not shown, all the convolution and fully connected layers have a ReLU as activation function." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-209", "text": "L stands for low-rank linear layer with the number of elements shown in parentheses." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-210", "text": "The number in the parentheses for LST M and GRU models correspond to the number of memory elements in those models." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-211", "text": "DSC is depthwise separable convolution layer (DSConv in Fig. 4 ) and the number in the parentheses correspond to the number of features, kernel size and stride in both time and frequency axes." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-212", "text": "Table 7 : Summary of hyperparameters of the best models described in Table 5 ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-94", "text": "RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition [20] , language modeling [21] , translation [22] , etc." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-95", "text": "RNNs not only exploit the temporal relation between the input signal, but also capture the long-term dependencies using \"gating\" mechanism." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-96", "text": "Unlike CNNs where input features are treated as 2-D image, RNNs operate for T time steps, where at each time step t the corresponding spectral feature vector f t \u2208 R F concatenated with the previous time step output h t\u22121 is used as input to the RNN." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-97", "text": "Figure 2 shows the model architecture of a typical RNN model, where the RNN cell could be an LSTM cell [23, 24] or a gated recurrent unit (GRU) cell [25, 26] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-98", "text": "Since the weights are reused across all the T time steps, the RNN models tend to have less number of parameters compared to the CNNs." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-99", "text": "Similar to batch normalization in CNNs, research show that applying layer normalization can be beneficial for training RNNs [27] , in which the hidden states are normalized during each time step." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-100", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-101", "text": "**CONVOLUTIONAL RECURRENT NEURAL NETWORK (CRNN)**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-102", "text": "Convolution recurrent neural network [7] is a hybrid of CNN and RNN, which takes advantages of both." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-103", "text": "It exploits the local temporal/spatial correlation using convolution layers and global temporal dependencies in the speech features using recurrent layers." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-104", "text": "As shown in Fig. 3 , a CRNN model starts with a convolution layer, followed by an RNN to encode the signal and a dense fully-connected layer to map the information." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-105", "text": "Here, the recurrent layer is bi-directional [28] and has multiple stages, increasing the network learning capability." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-106", "text": "Gated recurrent units (GRU) [25] is used as the base cell for recurrent layers, as it uses fewer parameters than LSTMs and gave better convergence in our experiments." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-107", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-108", "text": "**DEPTHWISE SEPARABLE CONVOLUTIONAL NEURAL NETWORK (DS-CNN)**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-109", "text": "Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard 3-D convolution operation [29] and has been used to achieve compact network architectures in the area of computer vision [10, 30] ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-110", "text": "DS-CNN first convolves each channel in the input feature map with a separate 2-D filter and then uses pointwise convolutions (i.e. 1x1) to combine the outputs in the depth dimension." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-111", "text": "By decomposing the standard 3-D convolutions into 2-D convolutions followed by 1-D convolutions, depthwise separable convolutions are more efficient both in number of parameters and operations, which makes deeper and wider architecture possible even in the resource-constrained microcontroller devices." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-112", "text": "In this work, we adopt a depthwise separable CNN based on the implementation of MobileNet [10] as shown in Fig. 4 ." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-113", "text": "An average pooling followed by a fully-connected layer is used at the end to provide global interaction and reduce the total number of parameters in the final layer." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-114", "text": "----------------------------------" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-115", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-116", "text": "We use the Google speech commands dataset [9] for the neural network architecture exploration experiments." }, { "sent_id": "c4e0e12362bd7d505f6887abad78d4-C001-117", "text": "The dataset consists of 65K 1-second long audio clips of 30 keywords, by thousands of different people, with each clip consisting of only one keyword." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c4e0e12362bd7d505f6887abad78d4-C001-25" ], [ "c4e0e12362bd7d505f6887abad78d4-C001-61" ], [ "c4e0e12362bd7d505f6887abad78d4-C001-62" ], [ "c4e0e12362bd7d505f6887abad78d4-C001-149" ], [ "c4e0e12362bd7d505f6887abad78d4-C001-157" ] ], "cite_sentences": [ "c4e0e12362bd7d505f6887abad78d4-C001-25", "c4e0e12362bd7d505f6887abad78d4-C001-61", "c4e0e12362bd7d505f6887abad78d4-C001-62", "c4e0e12362bd7d505f6887abad78d4-C001-149", "c4e0e12362bd7d505f6887abad78d4-C001-157" ] }, "@SIM@": { "gold_contexts": [ [ "c4e0e12362bd7d505f6887abad78d4-C001-36" ], [ "c4e0e12362bd7d505f6887abad78d4-C001-124" ], [ "c4e0e12362bd7d505f6887abad78d4-C001-165" ] ], "cite_sentences": [ "c4e0e12362bd7d505f6887abad78d4-C001-36", "c4e0e12362bd7d505f6887abad78d4-C001-124", "c4e0e12362bd7d505f6887abad78d4-C001-165" ] }, "@USE@": { "gold_contexts": [ [ "c4e0e12362bd7d505f6887abad78d4-C001-36" ], [ "c4e0e12362bd7d505f6887abad78d4-C001-124" ] ], "cite_sentences": [ "c4e0e12362bd7d505f6887abad78d4-C001-36", "c4e0e12362bd7d505f6887abad78d4-C001-124" ] } } }, "ABC_44916cd85311c78666839a3376ccc6_20": { "x": [ { "sent_id": "44916cd85311c78666839a3376ccc6-C001-149", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-146", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-147", "text": "**MODEL**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-148", "text": "GloVe" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-2", "text": "Attention mechanisms have seen some success for natural language processing downstream tasks in recent years and generated new Stateof-the-Art results." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-3", "text": "A thorough evaluation of the attention mechanism for the task of Argumentation Mining is missing, though." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-4", "text": "With this paper, we report a comparative evaluation of attention layers in combination with a bidirectional long short-term memory network, which is the current state-of-the-art approach to the unit segmentation task." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-5", "text": "We also compare sentence-level contextualized word embeddings to pre-generated ones." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-6", "text": "Our findings suggest that for this task the additional attention layer does not improve upon a less complex approach." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-7", "text": "In most cases, the contextualized embeddings do also not show an improvement on the baseline score." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-8", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-10", "text": "Argumentation Mining (AM) is increasingly applied in different fields of research like fake-news detection and political argumentation and network analysis 1 ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-11", "text": "One crucial part of the AM pipeline is to segment written text into argumentative and nonargumentative units." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-12", "text": "Recent research in the area of unit segmentation (Eger et al., 2017; Ajjour et al., 2017) has lead to promising results with F1-scores of up to 0.90 for in-domain segmentation (Eger et al., 2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-13", "text": "Nevertheless, there is still a need for more robust approaches." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-14", "text": "Given the recent progress of attention-based models in Neural Machine Translation (NMT) (Bahdanau et al., 2014; Vaswani et al., 2017) , this paper evaluates the effectiveness of attention for the task 1 See for example the MARDY project (https:// www.socium.uni-bremen.de/projekte/?" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-15", "text": "proj=570&print=1, last accessed:" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-16", "text": "2019-04-15, 09:50UTC+2)." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-17", "text": "of argumentative unit segmentation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-18", "text": "The idea of the attention layers added to the recurrent network is to enable the model to prioritize those parts of the input sequence that are important for the current prediction (Bahdanau et al., 2014) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-19", "text": "This can be achieved by learning additional parameters during the training of the model." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-20", "text": "With the additional information gained, the model learns a better internal representation which improves performance." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-21", "text": "Additionally, we evaluate the impact of contextualized distributed term representations (also referred to as word embeddings hereinafter) on all our models." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-22", "text": "The goal of word embeddings is to represent a word as a high-dimensional vector that encodes its approximate meaning." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-23", "text": "This vector will be generated by a model trained on a language modeling task, like next-word prediction (Mikolov et al., 2013) , for a given text corpus." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-24", "text": "The approximation is based on the word's surrounding context in the train set and with that predefined by the chosen corpus." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-25", "text": "Words with a similar semantic meaning should then also have similar vector representations, as measured by their distance in the vector space (Heuer, 2015) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-26", "text": "Different methods to pre-compute the embeddings include word2vec (Mikolov et al., 2013) , FastText (Bojanowski et al., 2017) and GloVe (Pennington et al., 2014) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-27", "text": "To make use of the capabilities of pre-trained Language Models (LMs), such as BERT (Devlin et al., 2018) or Flair (Akbik et al., 2018) , we evaluate how well their semantic representations perform, by using contextualized word embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-28", "text": "Those are, in contrast to previously mentioned methods, specific to the context of the word in the input sequence." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-29", "text": "One major benefit is the fact that the time-consuming feature engineering could become obsolete since the features are implicitly encoded in the word embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-30", "text": "Furthermore, a better semantic representation of the input could lead to better generalization capabilities of the model and, therefore, to better crossdomain performance." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-31", "text": "This paper answers the following research questions, which will help to assess the importance of the attention layers and contextualized word embeddings for the argument unit segmentation task:" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-32", "text": "To what extent can additional attention layers help the model focus on the, for the task of unit segmentation relevant, sequence parts and how much do they influence the predictions?" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-33", "text": "\u2022 RQ2: What is the impact of contextualized distributed term representations like BERT (Devlin et al., 2018) and Flair (Akbik et al., 2018) on the task of unit segmentation and do they improve upon pre-defined representations like GloVe?" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-34", "text": "The contributions of this paper are as follows: first, we present and evaluate new attention based architectures for the task of argumentative text segmentation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-35", "text": "Second, we review the effectiveness of recently proposed contextualized word embedding approaches in regard to AM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-36", "text": "We will continue by presenting the previous work on this specific task, followed by a description of the data set, the different architectures used and the generation of the word embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-37", "text": "Afterward, we will report the results, followed by a discussion and the limitations." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-38", "text": "We will finish with a conclusion and an outlook on possible future work." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-39", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-40", "text": "**RELATED WORK**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-41", "text": "Attention mechanisms have long been utilized in deep neural networks." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-42", "text": "Some of its roots are in the salient region detection for the processing of images (Itti et al., 1998) , which takes inspiration from human perception." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-43", "text": "The main idea is to focus the attention of the underlying network on pointsof-interest in the input that are often surrounded by irrelevant parts (Mnih et al., 2014) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-44", "text": "This allows the model to put more weight on the important chunks." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-45", "text": "While earlier salient detectors were task-specific, newer approaches (e.g. Mnih et al., 2014) can be adapted to different tasks, like image description generation (Xu et al., 2015) , and allow for the parameters of the attention to be tuned during the training." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-46", "text": "These additional tasks include sequence processing and the application of such networks to different areas of Natural Language Processing (NLP)." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-47", "text": "One of the first use-cases for attention mechanisms in the field of NLP was machine translation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-48", "text": "Bahdanau et al. (2014) utilized the attention to improve their NMT model." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-49", "text": "A few years later, Vaswani et al. (2017) achieved new State-of-the-Art (SotA) results by presenting an encoder-decoder architecture that is based on the attention mechanism, only adding a position-wise feed-forward network and normalizations in between." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-50", "text": "Devlin et al. (2018) picked up on the encoder part of this architecture to pre-train a bidirectional LM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-51", "text": "After fine-tuning, they achieved, again, a new SotA performance on different downstream NLP tasks like Part-of-speech tagging and Questions-Answering." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-52", "text": "A possible way of posing the unit segmentation as NLP task is a token-based sequence labeling (Stab, 2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-53", "text": "While Tobias et al. (2018) used rather simple, non-recurrent classifiers to approach this problem, others mostly applied recurrent networks to the task of unit boundary prediction." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-54", "text": "For example, Eger et al. (2017) reported different Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) architectures." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-55", "text": "Further, Ajjour et al. (2017) proposed a setup with three bidirectional LSTMs (Bi-LSTMs) (Schuster and Paliwal, 1997) in total as their best solution." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-56", "text": "While the first two of them are fully connected and work on word embeddings and task-specific features respectively, the intention for the third is to take the output of the first two as input and learn to correct their errors." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-57", "text": "Even though the third Bi-LSTM did not improve on the F1-score metric, it did succeed in resolving some of the wrong consecutive token predictions, without worsening the final results." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-58", "text": "To the best of the authors' knowledge, the attention mechanism has not been widely utilized so far for the task of argumentative unit segmentation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-59", "text": "Stab et al. (2018) integrated the attention mechanism directly into their Bi-LSTM by calculating it at each time step t to evaluate the importance of the current hidden state h t ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-60", "text": "To do that, they employed additive attention." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-61", "text": "A similar approach has been applied by Morio and Fujita (2018) for a three-label classification task (claim, premise or non-argumentative)." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-103", "text": "**DATA**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-62", "text": "In contrast to that, the approach presented in this paper uses attention as a separate layer that encodes all sequences before they are fed into a Bi-LSTM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-63", "text": "This might enable the recurrent part of the network to learn from better representations that are specific to the task it is trained on." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-64", "text": "The aim is further to evaluate the possible applications of attention layers for the task of sequence segmentation and token classification." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-65", "text": "A recurrent architecture (Ajjour et al., 2017 ) is compared to multiple modified versions that utilize the attention mechanism." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-66", "text": "In order to derive a representation of the input text that better resembles the context of the input for a specific task, several approaches have been presented." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-67", "text": "Akbik et al. (2018) , for example, pretrain a character-level Bi-LSTM to predict the next character for a given text corpus." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-68", "text": "The pre-trained model is able to derive contextualized word embeddings by additionally utilizing the input sequence for a specific task." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-69", "text": "This allows it to encode the previous as well as the following words of the given input sequence into the word itself." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-70", "text": "In comparison to that, the pre-trained BERT-LM utilizes stacked attention layers (Vaswani et al., 2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-71", "text": "By feeding a sequence into it and extracting the output of the last sublayer for each token, the idea is to implicitly use the attention mechanism to derive a better representation for every token." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-72", "text": "As is the case for the LM from Akbik et al. (2018) , the BERT embeddings are contextualized by the whole input sequence of the specific task." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-73", "text": "This paper will compare the two contextualized approaches described above with the pre-defined GloVe (Pennington et al., 2014) embeddings in the light of their usefulness for AM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-74", "text": "The goal is to encode the features necessary to detect arguments by utilizing the context of a sentence." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-75", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-76", "text": "**METHODOLOGY**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-77", "text": "This paper evaluates different machine learning architectures with added attention layers for the task of AM, and more specifically unit segmentation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-78", "text": "The problem is framed as a multi-class token labeling task, in which each token is assigned one of three labels." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-79", "text": "A (B) label denotes that the token is at the beginning of an argumentative unit, an (I) label that it lies inside a unit and an (O) label that the token is not part of a unit." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-80", "text": "This framework has been applied previously for the same task (Stab, 2017; Eger et al., 2017; Ajjour et al., 2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-81", "text": "The architectures proposed in this section build on Ajjour et al. (2017) , omitting the second Bi-LSTM, which was used to process features other than word embeddings (see section 3.1)." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-82", "text": "They are further being modified by adding attention layers at different positions." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-83", "text": "The goal is to reuse existing approaches and possibly enhance their ability to model long-range dependencies." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-84", "text": "Additionally, a simpler architecture, consisting of a single Bi-LSTM paired with an attention layer, is built and evaluated with the aim of decreased complexity." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-85", "text": "In order to answer the second research question, this paper reports results in combination with improved input embeddings, in order to evaluate their effectiveness and impact on the AM downstream task." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-86", "text": "All models are compared to the modified reimplementation of the architecture, which is defined as the baseline architecture." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-87", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-88", "text": "**FEATURES**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-89", "text": "For each token, a set of three different embeddings is generated and compared regarding their capability as standalone input features." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-90", "text": "The resulting weighted F1-score is then used as a proxy for measuring the usefulness of the generated textrepresentation in light of this specific downstream task." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-91", "text": "In combination with the re-implemented architecture, the word vectorization approach GloVe (Pennington et al., 2014), trained on 6 billion tokens 2 , serves as the baseline." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-92", "text": "As a first approach to enhance the performance, the GloVe embeddings are stacked with the character-based Flair embeddings (Akbik et al., 2018) , which are generated by a Bi-LSTM model." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-93", "text": "Akbik et al. (2018) argue that the resulting embeddings are contextualized, since the LM was trained to predict the most probable next character and therefore to encode the context of the whole sequence." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-94", "text": "Similar to that, we also compare contextualized BERT-embeddings as standalone features (Devlin et al., 2018 )." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-95", "text": "An increased performance is expected because of the pre-training procedure of the LM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-96", "text": "The BERT-LM was trained to predict a (randomly masked) word by utilizing the context of its appearance, as well as on next sentence prediction." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-97", "text": "Due to its SotA performance for both, token-level and sentence-level tasks, the authors of this paper argue that the derived representations are well suited for the task of unit segmentation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-98", "text": "Also, the representation fits the needs of the inter- token and sentence dependencies of the task." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-99", "text": "It is expected that this enables the model to better grasp the notion or pattern of an argument." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-100", "text": "Both contextualized embeddings are generated using the Flair library (Zalando Research, 2018) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-101", "text": "Features specifically engineered for this task are not included in the input, following the argumentation of Eger et al. (2017) that they will probably not be generalizable to different data sets." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-102", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-104", "text": "In order to evaluate the different architectures, the \"Argument annotated Essays (version 2)\" corpus (also referred to as Persuasive Essays corpus) is used (Stab and Gurevych, 2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-105", "text": "It was utilized for the same task in previous literature (Ajjour et al., 2017; Eger et al., 2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-106", "text": "The corpus, compiled for parsing argumentative structures in written text, consists of a random sample of 402 student essays." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-107", "text": "The annotation scheme includes the argumentative units and the relations between them, as well as the major claim and stance of the author towards a specific topic." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-108", "text": "The texts were annotated by non-professionals, labeling the boundary of each argumentative unit alongside the unit type." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-109", "text": "A type can either be major-claim, claim or premise." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-110", "text": "For the unit segmentation task, the corpus is labeled by treating major claims, claims, and premises as argumentative units 3 ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-111", "text": "For comparability reasons in the 3 All data pre-processing scripts are available in our code repository:" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-112", "text": "https://gitlab." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-113", "text": "informatik.uni-bremen.de/covis1819/ worth-the-attention." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-114", "text": "evaluation process, the models are trained and tested with the train-test-split defined by Stab and Gurevych (2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-115", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-116", "text": "**MODELS**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-117", "text": "In order to evaluate the attention mechanisms, different architectures based on previous AM literature are implemented." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-118", "text": "The attention layer is added at different positions in the network." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-119", "text": "All models were implemented using Python and the Keras framework with a TensorFlow backend." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-120", "text": "For the self-attention and multi-head attention layers, an existing implementation is used (HG, 2018a,b) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-121", "text": "The difference between the two is that the multi-head attention divides the input into multiple chunks and each head therefore works on a different vector subspace (Vaswani et al., 2017) , while the self-attention works on the whole input sequence." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-122", "text": "This is supposed to allow the head to focus on specific features of the input." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-123", "text": "In this case, the self-attention layers use additive attention, while the multi-head attention layers use scaled dot-product attention, with the latter following the implementation of Vaswani et al. (2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-124", "text": "Baseline re-implementation The baseline model from Ajjour et al. (2017) uses a total of three Bi-LSTMs (two of them fully connected) to assign labels to tokens (see Figure 1a) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-125", "text": "The re-implementation does not include the two fully connected Bi-LSTMs but instead uses only a single one that works on the word embeddings (see Figure 1b) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-126", "text": "Due to the fact that the second Bi-LSTM in the first layer is only used to encode the non-semantic features like Part-of-speech tags and discourse marker labels, it is omitted in the re-implementation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-127", "text": "Hereafter, we will refer to this model as baseline." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-128", "text": "Also, the batch size was increased from 8 to 64, compared to the original implementation, as a trade-off between convergence time and the model's generalization performance (Keskar et al., 2016) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-129", "text": "Nevertheless, this model achieves comparable scores to the ones presented in the original paper." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-130", "text": "The slightly lower performance can probably be attributed to implementation details." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-131", "text": "baseline +input and baseline +error For both variations, the baseline architecture was used as a basis, as can be seen in Figure 1b ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-132", "text": "Multi-headattention layers are added at different positions in the network." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-133", "text": "The number of attention heads depends on the dimension of the embedding vectors." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-134", "text": "For the GloVe (300 features) and the BERT (3072 features) embeddings, six heads are used, while the Flair (4196 features) embeddings require four heads." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-135", "text": "Both numbers were the largest divisor for the respective input vector size that worked inside the computational boundaries available." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-136", "text": "In the first model, an attention layer was added before the first Bi-LSTM in an attempt to apply a relevance score directly to the tokens, in order to better capture dependencies of the input sequence." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-137", "text": "This model will be referred to as baseline +input ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-138", "text": "The second variation adds the attention layer after the first and before the second Bi-LSTM, which will be called baseline +error ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-139", "text": "According to Ajjour et al. (2017) , the latter Bi-LSTM is used to correct the errors of the first one." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-140", "text": "The attention layer should be able to support the model in the error correction process." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-141", "text": "In contrast to the first approach, this does not change the input data, but only works on the output of the first Bi-LSTM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-142", "text": "bilstm and bilstm +input To decrease the complexity of the architecture, two additional models with a single Bi-LSTM are trained." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-143", "text": "The first variant has no attention layer, while the second one utilized the same input attention described above (see Figure 1c )." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-144", "text": "They will be refered to as bilstm and bilstm +input respectively." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-145", "text": "Both architectures use a self-attention mechanism instead of the above-mentioned multi-head-attention, due to better results in preliminary tests." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-150", "text": "**RESULTS**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-151", "text": "We evaluate the performance of all architectures on the Persuasive Essays data set detailed above." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-152", "text": "The models are re-initialized after every evaluation and do not share any weights." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-153", "text": "This allows us to answer the first research question of whether additional attention layers have a positive impact on the prediction quality." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-154", "text": "To answer the second research question, we rerun each training, replacing the GloVe with BERT and Flair embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-155", "text": "Both contextualized embedding methods are tested separately." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-156", "text": "We contextualize the tokens on the sentence level since the BERT model (Google AI Research, 2018) only allows for a maximum input length of 512 characters." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-157", "text": "This makes document-level or paragraphlevel embeddings impractical for the data set." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-158", "text": "As a performance measure, we report the weighted F1-score instead of the macro F1-score, since it takes the imbalance of the samples per label into account." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-159", "text": "For our re-implementation of the baseline, we are able to approximately reproduce the results reported by Ajjour et al. (2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-160", "text": "Additionally, we can verify that there is no major change in the performance when adding a second Bi-LSTM to the network (compare results for bilstm and baseline in Table 1 )." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-161", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-162", "text": "**ATTENTION LAYERS**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-163", "text": "The results of the token classification task are presented in Table 1 ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-164", "text": "Generally speaking, the added attention encodings do not improve upon the original architecture's performance, no matter at which position they are added." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-165", "text": "Architectures with an input attention encoding, namely baseline +input and bilstm +input , do achieve similar performances compared to their respective baseline." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-166", "text": "But the F1-score performance is in strong contrast to the generalization error, which is in most cases lower for the baseline model." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-167", "text": "The baseline +error architecture on the other hand, which is supposed to help the second Bi-LSTM in the network to correct the errors made by the first one, performs worse across all tests." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-168", "text": "For the Flair embeddings, this results in a 0.20 points performance drop in the F1-score measure." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-169", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-170", "text": "**CONTEXTUALIZED WORD EMBEDDINGS**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-171", "text": "The results for the enhanced word embedding evaluations are reported in Table 1 ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-172", "text": "In some cases, the models utilizing the word embeddings generated by the BERT-LM achieve a lower performance score than the other embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-173", "text": "This drop is most noticeable for the baseline +input model, while the performance for the bilstm +input decreases only slightly." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-174", "text": "The baseline +error model is able to achieve results that outperform both, GloVe and Flair embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-175", "text": "Compared to the GloVe vectors, the models trained on the Flair embeddings mostly lose in F1-score performance as well." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-176", "text": "For example, the baseline +input model drops by 0.18 points." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-177", "text": "On the other hand, the baseline model is able to slightly improve upon the GloVe score using the Flair embeddings, achieving a final score of 0.87, which also marks the best overall score in our testings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-178", "text": "An interesting observation is the fact that the enhanced embeddings seem to increase the generalization error (compare Figure 2) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-179", "text": "The baseline model trained on the GloVe embeddings for example, shows a difference in the final validation and training loss of around 0.17 and increases for the BERT and Flair embeddings to roughly 0.60 and 0.48 points, respectively." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-180", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-181", "text": "**DISCUSSION**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-182", "text": "Given the experimental results, we discuss the resulting implications for our two research questions and conclude this section by presenting some limitations." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-183", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-184", "text": "**ATTENTION LAYERS**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-185", "text": "Our results suggest that the attention encoding does not increase the performance of the model, as we hypothesized above." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-186", "text": "This is true for both, the input and the error encoding." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-187", "text": "A potential explanation is the fact that we use the attention mechanism as an additional layer to encode the input." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-188", "text": "Other approaches, like Morio and Fujita (2018) or Stab et al. (2018) , incorporate it into the Bi-LSTM architecture and calculate the weight of the hidden states at every time step." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-189", "text": "While the performance does not decrease meaningfully for the baseline +input and bilstm +input models (using the GloVe embeddings as features), it does for the error encoding baseline +error model." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-190", "text": "This drop might be explained by the vector space the attention mechanism is working on." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-191", "text": "Due to its small size of only four features, it is unlikely that the resulting vector has a meaningful encoding." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-192", "text": "A deeper inspection of the output values from the different layers in the network and how they influence the overall classification task might give more insight into the cause of the problem." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-193", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-194", "text": "**CONTEXTUALIZED WORD EMBEDDINGS**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-195", "text": "For most of the tests we conduct, the contextualized embedding approaches do not improve upon the GloVe embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-196", "text": "This is especially true for the architectures that include an attention layer, which does not seem to be able to handle the encoding of high dimensional vectors very well." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-197", "text": "The results further suggest that the amount of neurons in the Bi-LSTMs is not an issue in this case, since the baseline model achieves comparable results across all three embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-198", "text": "A potential way to improve the results of the enhanced embeddings is to contextualize them on the paragraph level." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-199", "text": "While we contextualize them on a sentence level, the dependencies between arguments might span over multiple sentences, sometimes even a paragraph, as described by Stab and Gurevych (2017) for the Persuasive Essays data set." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-200", "text": "Following this reasoning, one might think that a document level contextualization makes sense and adds even more information to the embedding." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-201", "text": "For the task of AM, however, we argue against that for two reasons." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-202", "text": "First, argumentative units usually do not span over the whole document and it might include additional counter-arguments (Stab and Gurevych, 2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-203", "text": "The contextualization would most likely cause a lot of noise and make the vector less useful." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-204", "text": "Also, depending on the size of the document, the size of the vector might be too small to hold the contextual information of the full document." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-205", "text": "Second, the model trained on such embeddings would probably not generalize very well." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-206", "text": "An argumentative document can be written in different formats with different purposes, like an essay, a speech or a newspaper article." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-207", "text": "Contextualizing the embeddings on the document level might then also encode the structure of the text and decrease the cross-domain applicability of the model." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-208", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-209", "text": "**LIMITATIONS**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-210", "text": "The results we report and analyze above are the networks' performance as validated on the data splits provided by Stab and Gurevych (2017) ." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-211", "text": "Due to time and resource restrictions, we evaluate the results after a single training run and perform neither an averaging over multiple runs nor any cross-validation." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-212", "text": "Both could lead to more reliable results." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-213", "text": "As another consequence of the abovementioned restrictions, we are also not able to test the model's generalization capabilities on different data sets." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-214", "text": "For the learning rate, we perform only a basic Bayesian hyperparameter optimization (Snoek et al., 2012) with four iterations per model." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-215", "text": "These limitations are especially important for the variations of the baseline architecture, since the performed changes to the architecture, even though rather small, entail the need for independently tuned hyperparameters." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-216", "text": "Furthermore, an additional evaluation of the different contextualization levels for the embeddings could provide a clearer picture of how much the results actually improve, compared to noncontextualized methods." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-217", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-218", "text": "**CONCLUSION**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-219", "text": "Recent improvements in utilizing contextual information for sequence processing had a big impact on the area of NLP, namely advances of attention architectures and contextualized word embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-220", "text": "For example, the Transformer architecture (Vaswani et al., 2017 ) employs attention to achieve SotA scores on different NLP tasks." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-221", "text": "Further, the Flair model (Akbik et al., 2018) incorporates character-wise context to generate enhanced word representations." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-222", "text": "In this paper, we report on the usefulness of these two approaches for the task of AM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-223", "text": "First, we are able to show that an attention layer as additional encoding of the input does not improve upon the current SotA approach of a Bi-LSTM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-224", "text": "Additionally, the attention mechanism seems to fail for a low-dimensional vector space." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-225", "text": "Second, we present the impact of contextualized word embeddings for AM." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-226", "text": "Although the Flair embeddings slightly improve upon the performance of the GloVe embeddings for the baseline architecture, we can not confirm any advantage over non-contextualized embeddings." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-227", "text": "----------------------------------" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-228", "text": "**FUTURE WORK**" }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-229", "text": "A first extension of this work could be a proper hyperparameter optimization for the attention-based models." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-230", "text": "Second, we plan to explore an attempt to fine-tune solely attention based pre-trained models like BERT (Devlin et al., 2018) to domain-specific data." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-231", "text": "Recent research by Howard and Ruder (2018) in transfer-learning for NLP has shown great improvement for several NLP-downstream tasks, while reducing the needed amount of labeled training data." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-232", "text": "Third, we contextualize the embeddings on the sentence level only." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-233", "text": "According to Stab and Gurevych (2017) , arguments can sometimes span over multiple sentences." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-234", "text": "Therefore, the contextualization of the embeddings could be extended to a paragraph level, in order to make use of possible inter-dependencies within it." }, { "sent_id": "44916cd85311c78666839a3376ccc6-C001-235", "text": "Additionally, a finetuning approach of the underlying LMs to the AM task could further enhance the embeddings." } ], "y": { "@BACK@": { "gold_contexts": [ [ "44916cd85311c78666839a3376ccc6-C001-12" ], [ "44916cd85311c78666839a3376ccc6-C001-55" ] ], "cite_sentences": [ "44916cd85311c78666839a3376ccc6-C001-12", "44916cd85311c78666839a3376ccc6-C001-55" ] }, "@SIM@": { "gold_contexts": [ [ "44916cd85311c78666839a3376ccc6-C001-80" ], [ "44916cd85311c78666839a3376ccc6-C001-124" ], [ "44916cd85311c78666839a3376ccc6-C001-139" ], [ "44916cd85311c78666839a3376ccc6-C001-159" ] ], "cite_sentences": [ "44916cd85311c78666839a3376ccc6-C001-80", "44916cd85311c78666839a3376ccc6-C001-124", "44916cd85311c78666839a3376ccc6-C001-139", "44916cd85311c78666839a3376ccc6-C001-159" ] }, "@DIF@": { "gold_contexts": [ [ "44916cd85311c78666839a3376ccc6-C001-81" ] ], "cite_sentences": [ "44916cd85311c78666839a3376ccc6-C001-81" ] }, "@EXT@": { "gold_contexts": [ [ "44916cd85311c78666839a3376ccc6-C001-81" ] ], "cite_sentences": [ "44916cd85311c78666839a3376ccc6-C001-81" ] }, "@USE@": { "gold_contexts": [ [ "44916cd85311c78666839a3376ccc6-C001-124" ], [ "44916cd85311c78666839a3376ccc6-C001-159" ] ], "cite_sentences": [ "44916cd85311c78666839a3376ccc6-C001-124", "44916cd85311c78666839a3376ccc6-C001-159" ] } } }, "ABC_3bbc588f06e326e1d75985fe253a5f_20": { "x": [ { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-2", "text": "We introduce unsupervised techniques based on phrase-based statistical machine translation for grammatical error correction (GEC) trained on a pseudo learner corpus created by Google Translation." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-3", "text": "We verified our GEC system through experiments on various GEC dataset, including a low resource track of the shared task at Building Educational Applications 2019 (BEA2019)." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-4", "text": "As a result, we achieved an F 0.5 score of 28.31 points with the test data of the low resource track." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-5", "text": "----------------------------------" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-7", "text": "Research on grammatical error correction (GEC) has gained considerable attention recently." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-8", "text": "Many studies treat GEC as a task that involves translation from a grammatically erroneous sentence (sourceside) into a correct sentence (target-side) and thus, leverage methods based on machine translation (MT) for GEC." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-151", "text": "Bryant and Briscoe (2018) built a GEC system using minimally annotated data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-9", "text": "For instance, some GEC systems use large parallel corpora and synthetic data (Ge et al., 2018; Xie et al., 2018) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-10", "text": "We introduce an unsupervised method based on MT for GEC that does not almost use parallel learner data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-11", "text": "In particular, we use methods proposed by Marie and Fujita (2018) , Artetxe et al. (2018b) , and Lample et al. (2018) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-12", "text": "These methods are based on phrase-based statistical machine translation (SMT) and two phrase table refinements, i.e., forward and backward refinement." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-13", "text": "Forward refinement simply arguments a learner corpus with automatic corrections whereas backward refinement expends both source-side and target-side data to train GEC model using backtranslation (Sennrich et al., 2016a) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-14", "text": "Unsupervised MT techniques do not require a parallel but a comparable corpus as training data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-15", "text": "Therefore, we use comparable translated texts using Google Translation as the source-side data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-16", "text": "Specifically, we use News Crawl written in English as target-side data and News Crawl written in another language translated into English as source-side data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-17", "text": "We identified the difference between forward and backward refinement with CoNLL-2014 dataset and JFLEG dataset; the former generates fluent outputs." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-18", "text": "We also verified our GEC system through experiments for a low resource track of the shared task at Building Educational Applications 2019 (BEA2019)." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-19", "text": "The experimental results show that our system achieved an F 0.5 score of 28.31 points in the low resource track of the shared task at BEA2019." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-20", "text": "2 Unsupervised GEC Algorithm 1 shows the pseudocode for unsupervised GEC." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-21", "text": "This code is derived from Artetxe et al. (2018b) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-22", "text": "First, the cross-lingual phrase embeddings are acquired." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-23", "text": "Second, a phrase table is created based on these cross-lingual embeddings." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-24", "text": "Third, the phrase table is combined with a language model trained by monolingual data to initialize a phrase-based SMT system." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-25", "text": "Finally, the SMT system is updated through iterative forwardtranslation or back-translation." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-26", "text": "Cross-lingual embeddings First, n-gram embeddings were created on the source-and targetsides." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-27", "text": "Specifically, each monolingual embedding was created based on the source-and target-sides using a variant of skip-gram (Mikolov et al., 2013) for unigrams, bigrams, and trigrams with high frequency 1 in the monolingual data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-28", "text": "Next, the monolingual embeddings were mapped onto a shared space to obtain cross-lingual embeddings." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-29", "text": "The self-learning method of Artetxe et al. (2018a) was used for unsupervised mapping." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-30", "text": "Phrase table induction A phrase table was created based on the cross-lingual embeddings." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-31", "text": "In particular, this involved the creation of phrase translation models and lexical translation models." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-32", "text": "The translation candidates were limited in the source-to-target phrase translation model \u03c6(f |e) for each source phrase e to its 100 nearest neighbor phrases f on the target-side." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-33", "text": "The score of the phrase translation model was calculated based on the normalized cosine similarity between the source and target phrases." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-34", "text": "f \u2032 represents each phrase embedding on the targetside and \u03c4 is a temperature parameter that controls the confidence of prediction 2 ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-35", "text": "The backward phrase translation probability \u03c6(e|f ) was determined in a similar manner." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-36", "text": "The source-to-target lexical translation model lex(f |e) considers the word with the highest translation probability in a target phrase for each word in a source phrase." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-37", "text": "The score of the lexical translation model was calculated based on the product of respective phrase translation probabilities." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-38", "text": "\u01eb is a constant term for the case where no alignments are found." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-39", "text": "As in Artetxe et al. (2018b) , the term was set to 0.001." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-40", "text": "The backward lexical translation probability lex(e|f ) is calculated in a similar manner." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-41", "text": "Refinement of SMT system The phrase table created is considered to include noisy phrase pairs." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-42", "text": "Therefore, we update the phrase table using an SMT system." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-43", "text": "The SMT system trained on synthetic data eliminates the noisy phrase pairs using language models trained on the target-side corpus." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-44", "text": "This process corresponds to lines 4-23 in Algorithm 1." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-45", "text": "The phrase table can be refined in either of two ways: forward and backward refinement." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-46", "text": "For forward refinement (Marie and Fujita, 2018) , target synthetic data were generated from the source monolingual data using the source-totarget phrase table P (0) s\u2192t and target language models LM t ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-47", "text": "A new phrase table P (1) s\u2192t was then created with this target synthetic corpus." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-48", "text": "This operation was executed N times." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-69", "text": "We used moses truecaser for the training data; this truecaser model was learned from processed English News Crawl." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-152", "text": "Their model used LM and confusion sets based on the common English error types." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-49", "text": "For backward refinement (Artetxe et al., 2018b) , source synthetic data were generated from the target monolingual data using the target to source phrase table P (0) t\u2192s and source language model LM s ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-50", "text": "A new source to target phrase table P" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-51", "text": "(1) s\u2192t was created with this source synthetic parallel corpus." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-52", "text": "Next, target synthetic data were generated from the source monolingual data using P" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-53", "text": "( 1) s\u2192t and target language model LM t ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-54", "text": "The target to source phrase table P" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-55", "text": "(1) t\u2192s was built using this target synthetic data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-56", "text": "Construction of a comparable corpus This unsupervised method is based on the assumption that the source and target corpora are comparable." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-57", "text": "In fact, Lample et al. (2018) , Artetxe et al. (2018b) and Marie and Fujita (2018) use the News Crawl of source and target language as training data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-58", "text": "To make a comparable corpus for GEC, we use translated texts using Google Translation as 3 Experiment of low resource GEC 3.1 Experimental setting Table 1 shows the training and development data size." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-59", "text": "Unless mentioned otherwise, Finnish News Crawl 2014-2015 translated into English was used as source training data and English News Crawl 2017 was used as target training data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-60", "text": "To train the extra language model of the target-side (LM t ), we used training data of One Billion Word Benchmark (Chelba et al., 2014) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-61", "text": "We used googletrans v2.4.0 3 for Google Translation and obtained 2,122,714 translated sentences." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-62", "text": "We sampled the 3,000,000 sentences from English News Crawl 2017 and excluded the sentences with more than 150 words for either source-and targetside data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-63", "text": "Finally, the synthetic comparable corpus comprises processed News Crawl data listed in Table 1 ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-64", "text": "Our system was verified using three GEC datasets, CoNLL-14 (Ng et al., 2014) , JFLEG test set (Napoles et al., 2017) and W&I+LOCNESS (Bryant et al., 2019; Granger, 1998) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-65", "text": "We used the CoNLL-13 dataset (Ng et al., 2013) and JF-LEG dev set as tuning data for CoNLL-14 and JF-LEG test, respectively." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-66", "text": "The low resource track at BEA2019 permitted to use W&I+LOCNESS development set, so we split it in half; tune data and 3 https://github.com/ssut/py-googletrans dev data 4 ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-67", "text": "These data were tokenized by spaCy v1.9.0 5 and the en_core_web_sm-1.2.0 model for W&I+LOCNESS." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-68", "text": "For CoNLL-14 and JFLEG test set, NLTK (Bird, 2006) tokenizer was used." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-70", "text": "We used byte-pair-encoding (Sennrich et al., 2016b) learned from processed English News Crawl; the number of operations was 50K." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-71", "text": "The implementation made by Artetxe et al. (2018b) 6 was modified to conduct the experiments." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-72", "text": "Specifically, some features were added; word-level Levenshtein distance, word-, and character-level edit operation, operation sequence model (Durrani et al., 2013) 7 , and 9-gram word class language model 8 , similar to Grundkiewicz and Junczys-Dowmunt (2018) without sparse features." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-73", "text": "Word class language model was trained with One Billion Word Benchmark data; the number of classes is 200, and the word class was estimated with fastText (Bojanowski et al., 2017) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-74", "text": "The distortion feature was not used." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-75", "text": "Moses (Koehn et al., 2007) was used to train the SMT system." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-76", "text": "FastAlign (Dyer et al., 2013) was used for word alignment and KenLM (Heafield, 2011) was used to train the 5-gram language model over each processed English News Crawl and One Billion Word Benchmark." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-77", "text": "MERT (Och, 2003) was used with M\u02c62 Scorer (Dahlmeier and Ng, 2012) for the tuning data of CoNLL-13 and W&I+LOCNESS and with GLEU (Napoles et al., 2015) for the tuning data of JF-LEG." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-78", "text": "Synthetic sentence pairs with a [3, 80] sentence length were used at the refinement step." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-79", "text": "The number of iterations N was set to 3 or 5, and the embedding dimension was set to 300." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-80", "text": "For the low resource track, we decided best iteration of forward refinement with the dev data and submitted the output of the best iteration model." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-81", "text": "We used pyspellchecker 9 as a spell checker." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-82", "text": "This tool uses Levenshtein distance to 4 Because W&I+LOCNESS data has four types of learner level, we split it so that each learner level is equal." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-83", "text": "Table 2 : M 2 and GLEU results." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-84", "text": "The bold scores represent the best score in unsupervised SMT." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-85", "text": "The underlined scores represent the best overall score." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-86", "text": "obtain permutations within an edit distance of 2 over the words included in a word list." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-87", "text": "We made the word list from One Billion Word Benchmark and included words that occur more than five times." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-88", "text": "For comparison, supervised GEC with SMT and neural MT (NMT) were conducted using the data extracted from Lang-8 (Mizumoto et al., 2011) as for CoNLL-14 and JFLEG." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-89", "text": "In supervised SMT, the feature weights were tuned and the setting was the same as that in unsupervised SMT (USMT)." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-90", "text": "In supervised NMT, a convolutional EncoderDecoder model (Gehring et al., 2017) was used and the parameter settings were similar to those in Ge et al. (2018) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-91", "text": "We report precision, recall, and F 0.5 score for CoNLL-14 and W&I+LOCNESS data and GLEU score for JFLEG test set." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-92", "text": "The output of CoNLL-14 and W&I+LOCNESS dev data was evaluated using M 2 scorer and ERRANT scorer (Bryant et al., 2017) , respectively." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-93", "text": "Table 2 shows the results of the GEC experiments for CoNLL-14 and JFLEG." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-94", "text": "The F 0.5 score for USMT forward in iter 1 is 13.57 points lower than that of supervised SMT and 17.17 points lower than that of supervised NMT." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-95", "text": "On JFLEG, the highest score was achieved with USMT forward in iter 1 among the unsupervised SMT models; its GLEU scores are 5.28 points and 3.39 points lower than those of supervised SMT and supervised NMT, respectively." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-96", "text": "----------------------------------" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-97", "text": "**CONLL-14 AND JFLEG RESULTS**" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-98", "text": "According to the improvement of iteration from 0 to 1, it is confirmed that the forward refine- ment works well." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-99", "text": "However, it is observed that the system with forward refinement ceases to improve after iteration 1." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-100", "text": "In forward refinement, the source-side data is not changed, and target-side data is generated from the source-side for each iteration." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-101", "text": "Therefore, the quality of the source-side data is important for this refinement method." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-102", "text": "In this study, we use the automatically translated text as source-side data; thus, it is considered that the quality is not high and the refinement after iteration 1 does not work." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-103", "text": "Difference between forward and backward refinements To examine how different the refinement methods are, we counted the number of corrections predicted by each method." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-104", "text": "The number of USMT forward in iter 1 and iter 2 is 3,437 and 3,257, respectively, whereas that of USMT backward in iter 1 and iter 2 is 4,092 and 2,789." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-105", "text": "As for USMT backward , the number of corrections from iter 1 to iter 2 decreases by 1,303." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-106", "text": "Artetxe et al. (2018b) and Lample et al. (2018) reported that the BLEU score (Papineni et al., 2002) of unsupervised MT with backward-refinement improves with increasing iterations." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-107", "text": "In GEC, increasing the iterations of USMT backward improves the GEC accuracy by predicting less corrections." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-108", "text": "The GLEU score for USMT forward is considered higher than that for USMT backward because the language model makes up for the synthetic target data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-130", "text": "Table 6 shows the error types that our system corrected well or badly on the dev data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-109", "text": "To compare the fluency, the outputs of each best iter on JFLEG were evaluated with the perplexity based on the Common Crawl language model 10 ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-110", "text": "The perplexity of USMT forward in iter 1 is 179.23 and that of USMT backward in iter 1 is 187.49; hence, the perplexity suggests USMT forward produces more likely outputs than USMT backward under the language model of Com-10 http://data.statmt.org/romang/gec-emnlp16/cclm.tgz Table 5 : GEC results with dev data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-111", "text": "The bold scores represent the best score without the spell checker." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-112", "text": "mon Crawl text." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-113", "text": "Effect of the source language We also examine how source languages of machine translation affect performance." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-114", "text": "Table 3 shows the result in changing the source-side data on CoNLL-14." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-115", "text": "The outputs using Finnish data is the best score among various languages; the more similar to English the source-side data is, the lower the F 0.5 score of the output." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-116", "text": "Table 4 shows the results of the GEC experiments with official test data for W&I+LOCNESS." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-117", "text": "The F 0.5 score for our system (TMU) is 28.31; this score is eighth among the nine teams." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-118", "text": "In particular, the number of false positives of our system is 4,314; this is the worst result of all." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-119", "text": "Table 5 shows the results of the dev data listed in Table 1 ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-120", "text": "On the dev data, the system of iteration 1 is the best among all." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-121", "text": "This tendency is the same as the CoNLL and JFLEG results." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-122", "text": "----------------------------------" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-123", "text": "**W&I+LOCNESS RESULTS**" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-124", "text": "The results of Table 5 confirm that the spell checker works well." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-125", "text": "We also investigate the importance of the order; SMT or spell check, which is suitable for the first system for a better result?" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-126", "text": "As a result, it is better to use the SMT system after using the spell checker." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-127", "text": "That is because the source-side data does not include the misspelled Table 6 : Error types for which our best system corrected errors well or badly on the dev data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-128", "text": "Easy2 denotes the easiest two errors, and Hard2 denotes the hardest two errors in terms of the F 0.5 11 ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-129", "text": "words as mentioned above." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-131", "text": "SPELL means the misspell errors; the correction of these errors depends only on the spell checker." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-132", "text": "PUNCT means the errors about the punctuation; e.g., 'Unfortunately when we...\u2192 Unfortunately, when we...'." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-133", "text": "It is considered that our system can correct errors such as these owing to the n-gram co-occurrence knowledge derived from the language models." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-134", "text": "In contrast, our system struggled to correct content word errors." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-135", "text": "For example, NOUN includes an error like this; 'way \u2192 means' and VERB includes an error like this; 'watch \u2192 see'." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-136", "text": "It is considered that our system is mostly not able to correct the word usage errors based on the context because the phrase table is still noisy." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-137", "text": "Although we observed some usage error examples of 'watch' in the synthetic source data, our model was not able to replace 'watch' to 'see' based on the context." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-138", "text": "both NMT (Lample et al., 2018; Marie and Fujita, 2018) and SMT (Artetxe et al., 2018b) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-139", "text": "In this study, we apply the USMT method of Artetxe et al. (2018b) and Marie and Fujita (2018) to GEC." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-140", "text": "The UNMT method (Lample et al., 2018) was ineffective under the GEC setting in our preliminary experiments." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-141", "text": "GEC with NMT/SMT Several studies that introduce sequence-to-sequence models in GEC heavily rely on large amounts of training data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-142", "text": "Ge et al. (2018) , who presented state-of-the-art results in GEC, proposed a supervised NMT method trained on corpora of a total 5.4 M sentence pairs." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-143", "text": "On the other hand, we mainly use the monolingual corpus and use small learner data as the tuning data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-144", "text": "Despite the success of NMT, many studies on GEC traditionally use SMT Junczys-Dowmunt and Grundkiewicz, 2014) ." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-145", "text": "These studies apply an offthe-shelf SMT toolkit, Moses, to GEC." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-146", "text": "JunczysDowmunt and Grundkiewicz (2014) claimed that the SMT system optimized for BLEU learns to not change the source sentence." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-147", "text": "Instead of BLEU, they proposed tuning an SMT system using the M 2 score with annotated development data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-148", "text": "In this study, we also tune the weights with an F 0.5 score measured by the M 2 scorer." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-149", "text": "Low-resource GEC Park and Levy (2011) proposed a GEC system based on a noisy channel model using an unannotated corpus of learner English." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-150", "text": "In contrast, our method does not require an unannotated corpus but requires monolingual corpora." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-153", "text": "Our method does not require knowledge about the common error types." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-154", "text": "Miao et al. (2019) proposed a language generation method using Metropolis-Hastings sampling." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-155", "text": "This method does not require parallel corpora for training, instead, monolingual data is required." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-156", "text": "They evaluate it on a variety of tasks including GEC and report that the GLEU score of 45.5 on JF-LEG." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-157", "text": "Because we used a parallel corpus for tuning weights, their results cannot be compared with ours." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-158", "text": "Zhao et al. (2019) reported that a neural GEC model was improved only with denoising autoencoder, which was trained using a synthetic parallel corpus." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-159", "text": "Their parallel corpus was generated by adding artificial errors, such as random deletion of a token, to monolingual data instead of using machine translation." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-160", "text": "----------------------------------" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-161", "text": "**CONCLUSION**" }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-162", "text": "In this paper, we described our GEC system with minimally annotated data." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-163", "text": "We introduced an unsupervised approach based on SMT for GEC." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-164", "text": "This method requires a comparable corpus, so we created a synthetic comparable corpus using Google Translation." }, { "sent_id": "3bbc588f06e326e1d75985fe253a5f-C001-165", "text": "The experimental results demonstrate that our system achieved an F 0.5 score of 28.31 points with the W&I+LOCNESS test data." } ], "y": { "@SIM@": { "gold_contexts": [ [ "3bbc588f06e326e1d75985fe253a5f-C001-11" ], [ "3bbc588f06e326e1d75985fe253a5f-C001-39" ], [ "3bbc588f06e326e1d75985fe253a5f-C001-139" ] ], "cite_sentences": [ "3bbc588f06e326e1d75985fe253a5f-C001-11", "3bbc588f06e326e1d75985fe253a5f-C001-39", "3bbc588f06e326e1d75985fe253a5f-C001-139" ] }, "@USE@": { "gold_contexts": [ [ "3bbc588f06e326e1d75985fe253a5f-C001-11" ], [ "3bbc588f06e326e1d75985fe253a5f-C001-139" ] ], "cite_sentences": [ "3bbc588f06e326e1d75985fe253a5f-C001-11", "3bbc588f06e326e1d75985fe253a5f-C001-139" ] }, "@DIF@": { "gold_contexts": [ [ "3bbc588f06e326e1d75985fe253a5f-C001-21" ], [ "3bbc588f06e326e1d75985fe253a5f-C001-71" ] ], "cite_sentences": [ "3bbc588f06e326e1d75985fe253a5f-C001-21", "3bbc588f06e326e1d75985fe253a5f-C001-71" ] }, "@EXT@": { "gold_contexts": [ [ "3bbc588f06e326e1d75985fe253a5f-C001-21" ], [ "3bbc588f06e326e1d75985fe253a5f-C001-71" ] ], "cite_sentences": [ "3bbc588f06e326e1d75985fe253a5f-C001-21", "3bbc588f06e326e1d75985fe253a5f-C001-71" ] }, "@BACK@": { "gold_contexts": [ [ "3bbc588f06e326e1d75985fe253a5f-C001-49" ], [ "3bbc588f06e326e1d75985fe253a5f-C001-57" ], [ "3bbc588f06e326e1d75985fe253a5f-C001-106" ] ], "cite_sentences": [ "3bbc588f06e326e1d75985fe253a5f-C001-49", "3bbc588f06e326e1d75985fe253a5f-C001-57", "3bbc588f06e326e1d75985fe253a5f-C001-106" ] } } }, "ABC_bb2609c568540390a560757dd40b32_20": { "x": [ { "sent_id": "bb2609c568540390a560757dd40b32-C001-91", "text": "Thus, LSTM networks are well-suited for working with (very) long sequences [8] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-169", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-170", "text": "**CONCLUSION**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-2", "text": "This paper presents results of our experiments for the next utterance ranking on the Ubuntu Dialog Corpus -the largest publicly available multi-turn dialog corpus." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-3", "text": "First, we use an in-house implementation of previously reported models to do an independent evaluation using the same data." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-4", "text": "Second, we evaluate the performances of various LSTMs, Bi-LSTMs and CNNs on the dataset." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-5", "text": "Third, we create an ensemble by averaging predictions of multiple models." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-6", "text": "The ensemble further improves the performance and it achieves a state-of-the-art result for the next utterance ranking on this dataset." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-7", "text": "Finally, we discuss our future plans using this corpus." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-8", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-10", "text": "The Ubuntu Dialogue Corpus is the largest freely available multi-turn based dialog corpus [1] 1 ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-11", "text": "It was constructed from the Ubuntu chat logs 2 -a collection of logs from Ubuntu-related chat rooms on the Freenode IRC network." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-12", "text": "Although multiple users can talk at the same time in the chat room, the logs were preprocessed using heuristics to create two-person conversations." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-13", "text": "The resulting corpus consists of almost one million two-person conversations, where a user seeks help with his/her Ubuntu-related problems (the average length of a dialog is 8 turns, with a minimum of 3 turns)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-14", "text": "Because of its size, the corpus is well-suited for explorations of deep learning techniques in the context of dialogue systems." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-15", "text": "In this paper, we introduce our preliminary research and experiments with this corpus, and report state-of-the-art results." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-16", "text": "The rest of the paper continues as follows: 1. we introduce the setup -the data as well as the evaluation of the task; 2. we briefly describe the previously evaluated models; 3. we introduce three different models (one of them being the same as in the previous work); 4. we evaluate these models and experiment with different amount of training data; 5. we conclude and discuss our plans for future works" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-17", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-18", "text": "**DATA**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-19", "text": "In this section we briefly describe the data and evaluation metrics used in [1] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-20", "text": "First, all the collected data was preprocessed by replacing named entities with corresponding tags (name, location, organization, url, path)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-21", "text": "This is analogical to the prepossessing of [2] (note that the IT helpdesk dataset used there is not publicly available)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-22", "text": "Second, these data are further processed to create tuples of (context, response, f lag)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-23", "text": "The f lag is a Boolean variable indicating whether the response is correct or incorrect." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-24", "text": "To form the training set, each utterance (starting from the third one) is considered as a potential response, while the previous utterances form its context." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-25", "text": "So a dialogue of length n yields (n \u2212 2) training examples (context, response, 1) and (n \u2212 2) training examples (context, response , 0)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-26", "text": "The negative response response is a randomly sampled utterance from the entire corpus." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-27", "text": "Finally, the training examples are shuffled." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-28", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-29", "text": "**EVALUATION METRIC**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-30", "text": "A randomly selected 2% of the conversations are used to create a test set." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-31", "text": "The proposed task is that of the best response selection." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-32", "text": "The system is presented with n response candidates, and it is asked to rank them." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-33", "text": "To vary the task's difficulty (and to remedy that some of the sampled candidates flagged as incorrect can very well be correct), the system's ranking is considered correct if the correct response is among the first k candidates." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-34", "text": "This quantity is denoted as Recall@k." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-35", "text": "The baselines were reported with (n, k) of (2, 1), (10, 1), (10, 2) and (10, 5)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-36", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-37", "text": "**APPROACHES**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-38", "text": "This task can naturally be formulated as a ranking problem which is often tackled by three techniques [3] : (i) pointwise; (ii) pairwise and (iii) listwise ranking." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-39", "text": "While pairwise and listwise ranking approaches are empirically superior to the pointwise ranking approach, our preliminary experiments use pointwise ranking approach for its simplicity." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-40", "text": "Note that pointwise method was also used in the original baselines [1] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-41", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-42", "text": "**POINTWISE RANKING**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-43", "text": "In pointwise ranking, only the context and the response are directly used to compute the probability of the pair." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-44", "text": "All the pairs are then sorted by their probabilities." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-45", "text": "We denote the function that outputs the probability of the pair as g(context, response)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-46", "text": "In our settings, the function g is represented by a neural network (learned using the training data)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-47", "text": "We describe the details of the network architectures used in the following sections." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-48", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-49", "text": "**PREVIOUS WORK**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-50", "text": "The pointwise architectures reported in [1] included (i) TF-IDF, (ii) RNN and (iii) LSTM." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-51", "text": "In this section, we briefly describe these models." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-52", "text": "A neural network is used to compute the embedding for the context and the response, denoted as c and r. These are fed through a sigmoid function to compute the pairwise probability." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-53", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-54", "text": "**TF-IDF**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-55", "text": "The motivation here is that the correct response tends to share more words with the context than the incorrect ones." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-56", "text": "First, the TF-IDF vectors are calculated for the context and each of the candidate responses." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-57", "text": "Next, the cosine similarity between the context vector and each response vector is used to rank the responses." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-58", "text": "(3) tf idf context and tf idf response are the resulting TF-IDF vectors for context and response respectively." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-59", "text": "D stands for the corpus and w is a word." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-60", "text": "The dimension of the resulting vectors is thus equal to the dictionary size." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-61", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-62", "text": "**NEURAL NETWORK EMBEDDINGS**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-63", "text": "A neural network is used to create an embedding of both the context and the candidate response." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-64", "text": "These embeddings, denoted as c and r, are then multiplied using a matrix M and the result is fed into the sigmoid function to score the response." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-65", "text": ") c and r are the resulting embeddings of the context and response, computed using a neural network." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-66", "text": "We present some different architectures to compute these embeddings." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-67", "text": "Figure 1 illustrates the approach." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-68", "text": "Note that matrix M , bias b and parameters of the function f (which is a neural network) are all learned using the training data." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-69", "text": "One can think of this approach as a predictive one -given the context, we predict the embedding of the response as r = c M , and measure the similarity of the predicted response r to the actual response r using the dot product (or vice-versa, predicting the context from the response as c = M r)" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-70", "text": "The authors experimented with vanilla RNN and LSTM [4] as the underlying networks producing the embeddings." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-71", "text": "LSTM significantly outperformed RNN in the author's experiments." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-72", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-73", "text": "**OUR ARCHITECTURES**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-74", "text": "All our architectures fall within the neural network embedding based approach." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-75", "text": "We implemented three different architectures (i) CNN [5] (ii) LSTM and (iii) Bi-Directional [6] LSTM." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-76", "text": "We also report an ensemble of our models." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-77", "text": "All of our architectures share the same design where the words from the input sequence (context or response) are projected into the words' embeddings vectors." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-78", "text": "Thus, if the input sequence consist of 42 words, we project these words into a matrix E which has a dimension e \u00d7 42, where e is dimensionality of the word embeddings." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-79", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-80", "text": "**CNN**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-81", "text": "While originating from computer vision [7] , CNN models have recently been very successfully applied in NLP problems [5] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-82", "text": "At the very heart of the CNN model, the convolving filters are sequentially applied over the input sequence." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-83", "text": "The width of the filters might vary, and in NLP typically range from 1 to 5 (the filters can be thought of here as a form of n-grams)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-84", "text": "These filters are followed by a max-pooling layer to get a fixed-length input." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-85", "text": "In our architecture, the output of the max-pooling operation forms the context/response embedding." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-86", "text": "Thus, the resulting embedding has a dimension equal to the number of filters." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-87", "text": "Figure 2a displays this architecture with two filters." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-88", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-89", "text": "**LSTM**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-90", "text": "Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture designed to remedy the vanishing gradient problem of vanilla RNN [4] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-92", "text": "We use the same model as the authors' LSTM network [?] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-93", "text": "LSTM iterates over the sequence embeddings, and the resulting embedding is the last state of the LSTM's cells." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-94", "text": "Figure 2b illustrates this architecture." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-95", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-96", "text": "**BI-DIRECTIONAL LSTM**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-97", "text": "Although the LSTM is tailor-made to keep context over large sequences, empirically it can be problematic for the network to capture the meaning of the entire sequence as it gets longer." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-98", "text": "If the important parts of the sequence are found at the beginning of a long sequence, the LSTM might struggle to get well-performing embedding." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-99", "text": "We decided to experiment with Bi-LSTMs to see whether this is the case in our settings." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-100", "text": "Bi-directional [6] LSTMSs feed the sequence into two recurrent networks -one reads the sequence as it is, the second reads the sequence from the end to the beginning." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-101", "text": "To avoid forming cycles, only the outputs of the recurrent networks (not the state-to-state connections) lead to same units in the next layers." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-102", "text": "Figure 2c illustrates this architecture." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-103", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-104", "text": "**EXPERIMENTS**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-105", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-106", "text": "**METHOD**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-107", "text": "To match the original setup of [1] we use the same training data 3 ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-108", "text": "We use one million training examples and we use the same word vectors pre-trained by GloVe [9] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-109", "text": "All our models were implemented using Theano [10] and Blocks [11] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-110", "text": "For training we use ADAM learning rule [12] and binary negative log-likelihood as training objective." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-111", "text": "We stop the training once Recall@1 starts increasing on a validation set." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-112", "text": "The experiments were executed on Nvidia K40 GPUs." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-113", "text": "The best meta-parameters were found by simple grid search." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-114", "text": "In all architectures we tried both: (i) learning separate parameters for the networks encoding context and response and (ii) learning shared parameters for both networks." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-115", "text": "Here we report only the results for the architectures with shared parameters, since they consistently achieved higher accuracy." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-116", "text": "Aside from learning single models, we also experimented with model ensembles." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-117", "text": "We found that averaging predictions of multiple models further improves performance, which is common in many machine learning tasks [13, 14] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-118", "text": "Our best classifier is an ensemble of 11 LSTMs, 7 Bi-LSTMs and 10 CNNs trained with different meta-parameters." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-119", "text": "Table 1 shows performance of the models with the best metaparameters in each category." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-120", "text": "An example prediction from the ensemble is shown in Table 2 ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-121", "text": "The performance was evaluated after every epoch of training." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-122", "text": "Most of the models achieved the best cost on validation data after a single epoch of training." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-123", "text": "However, the best Recall metrics were usually recorded in the second epoch of training." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-124", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-125", "text": "**RESULTS**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-126", "text": "Baselines from [1] Our Table 1 : Results of our experiments compared to the results reported in [1] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-127", "text": "Meta-parameters of our architectures are the following: our CNN had 400 filters of length 1, 100 filters of length 2 and 100 filters of length 3; our LSTM had 200 hidden units and our bidirectional LSTM had 250 hidden units in each network." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-128", "text": "For CNNs and LSTMs, the best results were achieved with batch size 256." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-129", "text": "For Bi-LSTM, the best batch size was 128." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-130", "text": "Turn User Text 1 A: anyone know why \" aptitude update \" returns a non-successful status (255) ?" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-131", "text": "2 B: does apt-get update work ?" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-132", "text": "3 A: i ' ve been missing updates because my normal process is sudo bash -c \" aptitude update && aptitude safe-upgrade -y \". ahh , \" e : some index files failed to download ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-133", "text": "they have been ignored , or old ones used instead .\". so i guess the issue is that \" aptitude update \" is n't giving an error at all N-Best Confidence Response 1 ****** 0.598 does the internet work on that box ?" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-134", "text": "2 **** 0.444 what time is it saying to going to be released ?? 3 *** 0.348 ahh ok 4 ** 0.245 nice Table 2 : A dialog context with three turns and a set of four ranked possible responses." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-135", "text": "The highest ranked response is the ground truth response in this case." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-136", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-137", "text": "**DISCUSSION**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-138", "text": "Our ensemble of classifiers sets a new state-of-the art performance for response ranking on the Ubuntu Dialog Corpus -the largest, publicly available multi-turn dialog corpus." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-139", "text": "Interestingly LSTMs and Bi-LSTMs achieve almost the same accuracy." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-140", "text": "We hypothesise that: (i) either utterances that appear at the beginning of the context are less important than the later utterances or, (ii) LSTMs successfully capture all of the important parts of the sequence." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-141", "text": "When we inspect accuracy of individual models we see that recurrent models are superior to CNNs." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-142", "text": "However, CNNs proved to significantly improve performance of the ensemble." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-143", "text": "An ensemble without the 10 CNNs had Recall@1 accuracy of only 66.8 compared to 68.3 of the larger ensemble." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-144", "text": "This shows that CNNs learned representations that are complementary to the recurrent models." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-145", "text": "We believe that our results are important, since they can be used as baselines for more complicated models (see the Future Work)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-146", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-147", "text": "**VARYING TRAINING DATA SIZE**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-148", "text": "We also experimented with different training data sizes in order to see how this affects the resulting models." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-149", "text": "We trained all networks on a training data size ranging from 100, 000 to the full 1, 000, 000 examples." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-150", "text": "The graph in Figure 3 shows the Recall@1 for all the three models (reported on the test data)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-151", "text": "There are two main observations here: (i) CNNs outperform recurrent models if the training dataset is small." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-152", "text": "We believe that this is mostly due to the max operation performed on top of the feature maps." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-153", "text": "Thanks to the simplicity of this operation, the model does not over-fit the data and generalizes better when learned on small training datasets." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-154", "text": "On the other hand, the simplicity of the operation does not allow the model to properly handle more complicated dependencies (such as the order in which the n-grams occur in the text), thus recurrent models perform better given enough data; (ii) the recurrent models have not made its peak yet, suggesting that adding more training data would improve the model's accuracy." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-155", "text": "This agrees with Figure 3 of the previous evaluation [1] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-156", "text": "Figure 3 : Training data size ranging from 100, 000 to the full 1, 000, 000 examples (X axis) and the resulting Recall@1 (Y axis)." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-157", "text": "The CNN has 500, 100 and 100 filters of length 1, 2 and 3." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-158", "text": "The LSTM and Bi-LSTM has both 300 hidden units in each recurrent layer." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-159", "text": "----------------------------------" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-160", "text": "**FUTURE WORK**" }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-161", "text": "In our future work, we plan to investigate applicability of neural networks architectures extended with memory (e.g., [15, 16, 17] ) on this task." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-162", "text": "It is an appealing idea to bootstrap the system with external source of information (e.g., user manual or man pages) to help the system pick the right answer." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-163", "text": "For successful application of this paradigm in the domain of reinforcement learning, see [18] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-164", "text": "An alternative direction for future research might be to extend the model with attention [19] over sentences in the dialog context." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-165", "text": "This would allow the model to explain which facts in the context were the most important for its prediction." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-166", "text": "Therefore, the prediction could be better interpreted by a human." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-167", "text": "Additional accuracy improvements might be also achieved by different text pre-processing pipelines." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-168", "text": "For instance, in the current dataset all named entities were replaced with generic tags, which could possibly harm the performance." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-171", "text": "In this work we achieved a new state-of-the-art results on the next utterance ranking problem recently introduced in [1] ." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-172", "text": "The best performing system is an ensemble of multiple diverse neural networks." }, { "sent_id": "bb2609c568540390a560757dd40b32-C001-173", "text": "In the future, we plan to use our system as a base for more complicated models going beyond the standard neural network paradigm." } ], "y": { "@BACK@": { "gold_contexts": [ [ "bb2609c568540390a560757dd40b32-C001-10" ], [ "bb2609c568540390a560757dd40b32-C001-19" ], [ "bb2609c568540390a560757dd40b32-C001-40" ], [ "bb2609c568540390a560757dd40b32-C001-50" ], [ "bb2609c568540390a560757dd40b32-C001-126" ], [ "bb2609c568540390a560757dd40b32-C001-154", "bb2609c568540390a560757dd40b32-C001-155" ] ], "cite_sentences": [ "bb2609c568540390a560757dd40b32-C001-10", "bb2609c568540390a560757dd40b32-C001-19", "bb2609c568540390a560757dd40b32-C001-40", "bb2609c568540390a560757dd40b32-C001-50", "bb2609c568540390a560757dd40b32-C001-126", "bb2609c568540390a560757dd40b32-C001-155" ] }, "@SIM@": { "gold_contexts": [ [ "bb2609c568540390a560757dd40b32-C001-107" ] ], "cite_sentences": [ "bb2609c568540390a560757dd40b32-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "bb2609c568540390a560757dd40b32-C001-107" ], [ "bb2609c568540390a560757dd40b32-C001-126" ] ], "cite_sentences": [ "bb2609c568540390a560757dd40b32-C001-107", "bb2609c568540390a560757dd40b32-C001-126" ] }, "@DIF@": { "gold_contexts": [ [ "bb2609c568540390a560757dd40b32-C001-171" ] ], "cite_sentences": [ "bb2609c568540390a560757dd40b32-C001-171" ] }, "@EXT@": { "gold_contexts": [ [ "bb2609c568540390a560757dd40b32-C001-171" ] ], "cite_sentences": [ "bb2609c568540390a560757dd40b32-C001-171" ] } } }, "ABC_831342435ca0a4695e2a7f149891e4_20": { "x": [ { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-2", "text": "We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-3", "text": "The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice)." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-4", "text": "We further demonstrate the ability to synthesize translated speech using the voice of the source speaker." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-5", "text": "We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-6", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-8", "text": "We address the task of speech-to-speech translation (S2ST): translating speech in one language into speech in another." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-9", "text": "This application is highly beneficial for breaking down communication barriers between people who do not share a common language." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-10", "text": "Specifically, we investigate whether it is possible to train model to accomplish this task directly, without relying on an intermediate text representation." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-34", "text": "[12] transfered source speaker's voice to the synthesized translated speech by mapping hidden Markov model states from ASR to TTS." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-11", "text": "This is in contrast to conventional S2ST systems which are often broken down into three components: automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis [1] [2] [3] [4] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-12", "text": "Cascaded systems have the potential problem of errors compounding between components, e.g. recognition errors leading to larger translation errors." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-13", "text": "Direct S2ST models avoid this issue by training to solve the task end-to-end." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-14", "text": "They also have advantages over cascaded systems in terms of reduced computational requirements and lower inference latency since only one decoding step is necessary, instead of three." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-15", "text": "In addition, direct models are naturally capable of retaining paralinguistic and nonlinguistic information during translation, e.g. maintaining the source speaker's voice, emotion, and prosody, in the synthesized translated speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-16", "text": "Finally, directly conditioning on the input speech makes it easy to learn to generate fluent pronunciations of words which do not need to be translated, such as names." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-17", "text": "However, solving the direct S2ST task is especially challenging for several reasons." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-18", "text": "Fully-supervised end-to-end training requires collecting a large set of input/output speech pairs." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-19", "text": "Such data are more difficult to collect compared to parallel text pairs for MT, or speech-text pairs for ASR or TTS." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-20", "text": "Decomposing into smaller tasks can take advantage of the lower training data requirements compared to a monolithic speech-to-speech model, and can result in a more robust system for a given training budget." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-21", "text": "Uncertain alignment between two spectrograms whose underlying spoken content differs also poses a major training challenge." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-22", "text": "* Equal contribution." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-23", "text": "In this paper we demonstrate Translatotron 1 , a direct speechto-speech translation model which is trained end-to-end." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-24", "text": "To facilitate training without predefined alignments, we leverage high level representations of the source or target content in the form of transcriptions, essentially multitask training with speechto-text tasks." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-25", "text": "However no intermediate text representation is used during inference." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-26", "text": "The model does not perform as well as a baseline cascaded system." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-27", "text": "Nevertheless, it demonstrates a proof of concept and serves as a starting point for future research." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-28", "text": "Extensive research has studied methods for combining different sub-systems within cascaded speech translation systems." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-29", "text": "[5, 6] gave MT access to the lattice of the ASR." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-30", "text": "[7, 8] integrated acoustic and translation models using a stochastic finite-state transducer which can decode the translated text directly using Viterbi search." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-31", "text": "For synthesis, [9] used unsupervised clustering to find F0-based prosody features and transfer intonation from source speech and target." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-32", "text": "[10] augmented MT to jointly predict translated words and emphasis, in order to improve expressiveness of the synthesized speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-33", "text": "[11] used a neural network to transfer duration and power from the source speech to the target." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-35", "text": "Similarly, recent work on neural TTS has focused on adapting to new voices with limited reference data [13] [14] [15] [16] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-36", "text": "Initial approaches to end-to-end speech-to-text translation (ST) [17, 18] performed worse than a cascade of an ASR model and an MT model." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-37", "text": "[19, 20] achieved better end-to-end performance by leveraging weakly supervised data with multitask learning." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-38", "text": "[21] further showed that use of synthetic training data can work better than multitask training." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-39", "text": "In this work we take advantage of both synthetic training targets and multitask training." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-40", "text": "The proposed model resembles recent sequence-to-sequence models for voice conversion, the task of recreating an utterance in another person's voice [22] [23] [24] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-41", "text": "For example, [23] proposes an attention-based model to generate spectrograms in the target voice based on input features (spectrogram concatenated with ASR bottleneck features) from the source voice." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-42", "text": "In contrast to S2ST, the input-output alignment for voice conversion is simpler and approximately monotonic." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-43", "text": "[23] also trains models that are specific to each input-output speaker pair (i.e. one-toone conversion), whereas we explore many-to-one and manyto-many speaker configurations." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-44", "text": "Finally, [25] demonstrated an attention-based direct S2ST model on a toy dataset with a 100 word vocabulary." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-45", "text": "In this work we train on real speech, including spontaneous telephone conversations, at a much larger scale." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-46", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-47", "text": "**SPEECH-TO-SPEECH TRANSLATION MODEL**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-48", "text": "An overview of the proposed Translatotron model architecture is shown in Figure 1 ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-49", "text": "Following [15, 26] , it is composed of several separately trained components: 1) an attention-based sequence- to-sequence network (blue) which generates target spectrograms, 2) a vocoder (red) which converts target spectrograms to timedomain waveforms, and, 3) optionally, a pretrained speaker encoder (green) which can be used to condition the decoder on the identity of the source speaker, enabling cross-language voice conversion [27] simultaneously with translation." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-50", "text": "The sequence-to-sequence encoder stack maps 80-channel log-mel spectrogram input features into hidden states which are passed through an attention-based alignment mechanism to condition an autoregressive decoder, which predicts 1025-dim log spectrogram frames corresponding to the translated speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-51", "text": "Two optional auxiliary decoders, each with their own attention components, predict source and target phoneme sequences." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-52", "text": "Following recent speech translation [21] and recognition [28] models, the encoder is composed of a stack of 8 bidirectional LSTM layers." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-53", "text": "As shown in Fig. 1 , the final layer output is passed to the primary decoder, whereas intermediate activations are passed to auxiliary decoders predicting phoneme sequences." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-54", "text": "We hypothesize that early layers of the encoder are more likely to represent the source content well, while deeper layers might learn to encode more information about the target content." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-55", "text": "The spectrogram decoder uses an architecture similar to Tacotron 2 TTS model [26] , including pre-net, autoregressive LSTM stack, and post-net components." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-100", "text": "Overall, there remains a gap of 6 BLEU points to the baseline, indicating room for improvement." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-56", "text": "We make several changes to it in order to adapt to the more challenging S2ST task." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-57", "text": "We use multi-head additive attention [29] with 4 heads instead of location-sensitive attention, which shows better performance in our experiments." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-58", "text": "We also use a significantly narrower 32 dimensional pre-net bottleneck compared to 256-dim in [26] , which we find to be critical in picking up attention during training." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-59", "text": "We also use reduction factor [30] of 2, i.e. predicting two spectrogram frames for each decoding step." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-60", "text": "Finally, consistent with results on translation tasks [19, 31] , we find that using a deeper decoder containing 4 or 6 LSTM layers leads to good performance." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-61", "text": "We find that multitask training is critical in solving the task, which we accomplish by integrating auxiliary decoder networks to predict phoneme sequences corresponding to the source and/or target speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-62", "text": "Losses computed using these auxiliary recognition networks are used during training, which help the primary spectrogram decoder to learn attention." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-63", "text": "They are not used during inference." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-64", "text": "In contrast to the primary decoder, the auxiliary decoders use 2-layer LSTMs with single-head additive attention [32] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-65", "text": "All three decoders use attention dropout and LSTM zoneout regularization [33] , all with probability 0.1." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-66", "text": "Training Finally, in order to control the output speaker identity we incorporate an optional speaker encoder network as in [15] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-67", "text": "This network is discriminatively pretrained on a speaker verification task and is not updated during the training of Translatotron." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-68", "text": "We use the dvector V3 model from [37] , trained on a larger set of 851K speakers across 8 languages including English and Spanish." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-69", "text": "The model computes a 256-dim speaker embedding from the speaker reference utterance, which is passed into a linear projection layer (trained with the sequence-to-sequence model) to reduce the dimensionality to 16." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-70", "text": "This is critical to generalizing to source language speakers which are unseen during training." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-71", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-72", "text": "**EXPERIMENTS**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-73", "text": "We study two Spanish-to-English translation datasets: the large scale \"conversational\" corpus of parallel text and read speech pairs from [21] , and the Spanish Fisher corpus of telephone conversations and corresponding English translations [38] , which is smaller and more challenging due to the spontaneous and informal speaking style." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-74", "text": "In Sections 3.1 and 3.2, we synthesize target speech from the target transcript using a single (female) speaker English TTS system; In Section 3.4, we use real human target speech for voice transfer experiments on the conversational dataset." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-75", "text": "Models were implemented using the Lingvo framework [39] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-76", "text": "See Table 1 for dataset-specific hyperparameters." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-77", "text": "To evaluate speech-to-speech translation performance we compute BLEU scores [40] as an objective measure of speech intelligibility and translation quality, by using a pretrained ASR system to recognize the generated speech, and comparing the resulting transcripts to ground truth reference translations." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-78", "text": "Due to potential recognition errors (see Figure 2) , this can be thought of as a lower bound on the underlying translation quality." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-79", "text": "We use the 16k Word-Piece attention-based ASR model from [41] trained on the 960 hour LibriSpeech corpus [42] , which obtained word error rates of 4.7% and 13.4% on the test-clean and testother sets, respectively." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-80", "text": "In addition, we conduct listening tests to measure subjective speech naturalness mean opinion score (MOS), as well as speaker similarity MOS for voice transfer." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-81", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-82", "text": "**CONVERSATIONAL SPANISH-TO-ENGLISH**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-83", "text": "This proprietary dataset described in [21] was obtained by crowdsourcing humans to read the both sides of a conversational Spanish-English MT dataset." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-84", "text": "In this section, instead of using the human target speech, we use a TTS model to synthesize target In addition, we augment the input source speech by adding background noise and reverberation in the same manner as [21] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-85", "text": "The resulting dataset contains 979k parallel utterance pairs, containing 1.4k hours of source speech and 619 hours of synthesized target speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-86", "text": "The total target speech duration is much smaller because the TTS output is better endpointed, and contains fewer pauses." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-87", "text": "9.6k pairs are held out for testing." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-88", "text": "Input feature frames are created by stacking 3 adjacent frames of an 80-channel log-mel spectrogram as in [21] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-89", "text": "The speaker encoder was not used in these experiments since the target speech always came from the same speaker." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-90", "text": "Table 2 shows performance of the model trained using different combinations of auxiliary losses, compared to a baseline ST \u2192 TTS cascade model using a speech-to-text translation model [21] trained on the same data, and the same Tacotron 2 TTS model used to synthesize training targets." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-91", "text": "Note that the ground truth BLEU score is below 100 due to ASR errors during evaluation, or TTS failure when synthesizing the ground truth." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-92", "text": "Training without auxiliary losses leads to extremely poor performance." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-93", "text": "The model correctly synthesizes common words and simple phrases, e.g. translating \"hola\" to \"hello\"." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-94", "text": "However, it does not consistently translate full utterances." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-95", "text": "While it always generates plausible speech sounds in the target voice, the output can be independent of the input, composed of a string of nonsense syllables." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-96", "text": "This is consistent with failure to learn to attend to the input, and reflects the difficulty of the direct S2ST task." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-97", "text": "Integrating auxiliary phoneme recognition tasks helped regularize the encoder and enabled the model to learn attention, dramatically improving performance." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-98", "text": "The target phoneme PER is much higher than on source phonemes, reflecting the difficulty of the corresponding translation task." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-99", "text": "Training using both auxiliary tasks achieved the best quality, but the performance difference between different combinations is small." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-101", "text": "Nevertheless, the relatively narrow gap demonstrates the potential of the end-to-end approach." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-102", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-103", "text": "**FISHER SPANISH-TO-ENGLISH**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-104", "text": "This dataset contains about 120k parallel utterance pairs 2 , spanning 127 hours of source speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-105", "text": "Target speech is synthesized using Parallel WaveNet [43] using the same voice as the previous section." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-106", "text": "The result contains 96 hours of synthetic target speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-107", "text": "Following [19] , input features were constructed by stacking 80-channel log-mel spectrograms, with deltas and accelerations." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-108", "text": "Given the small size of the dataset compared to that in Sec. 3.1, we found that obtaining good performance required significantly 2 This is a subset of the Fisher data due to TTS errors on target text." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-109", "text": "more careful regularization and tuning." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-110", "text": "As shown in Table 1 , we used narrower encoder dimension of 256, a shallower 4-layer decoder, and added Gaussian weight noise to all LSTM weights as regularization, as in [19] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-111", "text": "The model was especially sensitive to the auxiliary decoder hyperparameters, with the best performance coming when passing activations from intermediate layers of the encoder stack as inputs to the auxiliary decoders, using slightly more aggressive dropout of 0.3, and decaying the auxiliary loss weight over the course of training in order to encourage the model training to fit the primary S2ST task." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-112", "text": "Experiment results are shown in Table 3 ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-113", "text": "Once again using two auxiliary losses works best, but in contrast to Section 3.1, there is a large performance boost relative to using either one alone." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-114", "text": "Performance using only the source recognition loss is very poor, indicating that learning alignment on this task is especially difficult without strong supervision on the translation task." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-115", "text": "We found that 4-head attention works better than one head, unlike the conversational task, where both attention mechanisms had similar performance." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-116", "text": "Finally, as in [21] , we find that pretraining the bottom 6 encoder layers on an ST task improves BLEU scores by over 5 points." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-117", "text": "This is the best performing direct S2ST model, obtaining 76% of the baseline performance." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-118", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-119", "text": "**SUBJECTIVE EVALUATION OF SPEECH NATURALNESS**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-120", "text": "To evaluate synthesis quality of the best performing models from Tables 2 and 3 we use the framework from [15] to crowdsource 5-point MOS evaluations based on subjective listening tests." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-121", "text": "1k examples were rated for each dataset, each one by a single rater." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-122", "text": "Although this evaluation is expected to be independent of the correctness of the translation, translation errors can result in low scores for examples raters describe as \"not understandable\"." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-123", "text": "Results are shown in Table 4 , comparing different vocoders where results with Griffin-Lim correspond to identical model configurations as Sections 3.1 and 3.2." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-146", "text": "This is because the task of synthesizing arbitrary speakers is more difficult; the training targets are much noisier and training set is much smaller; and the ASR model used for evaluation makes more errors on the noisy, multispeaker targets." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-147", "text": "In terms of BLEU score, the difference between conditioning on ground truth and random targets is very small, verifying that content leak is not a concern (in part due to the low speaker embedding dimension)." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-148", "text": "However conditioning on the source trails by 1.8 BLEU points, reflecting the mismatch in conditioning language between the training and inference configurations." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-149", "text": "Naturalness MOS scores are close in all cases." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-150", "text": "However, conditioning on the source speaker significantly reduces similarity MOS by 1.4 points." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-151", "text": "Again this suggests that using English speaker embeddings during training does not generalize well to Spanish speakers." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-152", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-153", "text": "**CONCLUSIONS**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-154", "text": "We present a direct speech-to-speech translation model, trained end-to-end." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-155", "text": "We find that it is important to use speech transcripts during training, but no intermediate speech transcription is necessary for inference." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-156", "text": "Exploring alternate training strategies which alleviate this requirement is an interesting direction for future work." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-157", "text": "The model achieves high translation quality on two Spanish-to-English datasets, although performance is not as good as a baseline cascade of ST and TTS models." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-158", "text": "In addition, we demonstrate a variant which simultaneously transfers the source speaker's voice to the translated speech." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-159", "text": "The voice transfer does not work as well as in a similar TTS context [15] , reflecting the difficulty of the cross-language voice transfer task, as well as evaluation [44] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-160", "text": "Potential strategies to improve voice transfer performance include improving the speaker encoder by adding a language adversarial loss, or by incorporating a cycle-consistency term [13] into the S2ST loss." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-161", "text": "Other future work includes utilizing weakly supervision to scale up training with synthetic data [21] or multitask learning [19, 20] , and transferring prosody and other acoustic factors from the source speech to the translated speech following [45] [46] [47] ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-162", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-163", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-124", "text": "As expected, using WaveRNN vocoders dramatically improves ratings over Griffin-Lim into the \"Very Good\" range (above 4.0)." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-125", "text": "Note that it is most fair to compare the Griffin-Lim results to the ground truth training targets since they were generated using corresponding lower quality vocoders." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-126", "text": "In such a comparison it is clear that the S2ST models do not score as highly as the ground truth." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-127", "text": "Finally, we note the similar performance gap between Translatotron and the baseline under this evaluation." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-128", "text": "In part, this is a consequence of the different types of errors made by the two models." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-129", "text": "For example, Translatotron sometimes mispronounces words, especially proper nouns, using pronunciations from the source language, e.g. mispronouncing the /ae/ vowel in \"Dan\" as /ah/, consistent with Spanish but sounding less natural to English listeners, whereas by construction, the baseline consistently projects results to English." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-130", "text": "Figure 2 demonstrates other differences in behavior, where Translatotron reproduces the input \"eh\" disfluency (transcribed as \"a\", between 0.4 \u2212 0.8 sec in the bottom row of the figure), but the cascade does not." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-131", "text": "It is also interesting to note that the cascade translates \"Guillermo\" to its English form \"William\", whereas Translatotron speaks the Spanish name (although the ASR model mistranscribes it as \"of the ermo\"), suggesting that the direct model might have a bias toward more directly reconstructing the input." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-132", "text": "Similarly, in example 7 on the companion page Translatotron reconstructs \"pasejo\" as \"passages\" instead of \"tickets\", potentially reflecting a bias for cognates." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-133", "text": "We leave detailed analysis to future work." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-134", "text": "----------------------------------" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-135", "text": "**CROSS LANGUAGE VOICE TRANSFER**" }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-136", "text": "In our final experiment, we synthesize translated speech using the voice of the source speaker by training the full model depicted in Figure 1 ." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-137", "text": "The speaker encoder is conditioned on the ground truth target speaker during training." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-138", "text": "We use a subset of the data from Sec. 3.1 for which we have paired source and target recordings." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-139", "text": "Note that the source and target speakers for each pair are always different -the data was not collected from bilingual speakers." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-140", "text": "This dataset contains 606k utterance pairs, resampled to 16 kHz, with 863 and 493 hours of source and target speech, respectively; 6.3k pairs, a subset of that from Sec. 3.1, are held out for testing." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-141", "text": "Since target recordings contained noise, we apply the denoising and volume normalization from [15] to improve output quality." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-142", "text": "Table 5 compares performance using different conditioning Ground truth 59.9 4.10 \u00b1 0.06 -strategies." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-143", "text": "The top row transfers the source speaker's voice to the translated speech, while row two is a \"cheating\" configuration since the speaker embedding can potentially leak information about the target content to the decoder." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-144", "text": "To verify that this does not negatively impact performance we also condition on random target utterances in row three." }, { "sent_id": "831342435ca0a4695e2a7f149891e4-C001-145", "text": "In all cases performance is worse than models trained on synthetic targets in Tables 2 and 4 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "831342435ca0a4695e2a7f149891e4-C001-38" ] ], "cite_sentences": [ "831342435ca0a4695e2a7f149891e4-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "831342435ca0a4695e2a7f149891e4-C001-52" ], [ "831342435ca0a4695e2a7f149891e4-C001-73" ], [ "831342435ca0a4695e2a7f149891e4-C001-83" ], [ "831342435ca0a4695e2a7f149891e4-C001-88" ], [ "831342435ca0a4695e2a7f149891e4-C001-90" ], [ "831342435ca0a4695e2a7f149891e4-C001-116" ] ], "cite_sentences": [ "831342435ca0a4695e2a7f149891e4-C001-52", "831342435ca0a4695e2a7f149891e4-C001-73", "831342435ca0a4695e2a7f149891e4-C001-83", "831342435ca0a4695e2a7f149891e4-C001-88", "831342435ca0a4695e2a7f149891e4-C001-90", "831342435ca0a4695e2a7f149891e4-C001-116" ] }, "@USE@": { "gold_contexts": [ [ "831342435ca0a4695e2a7f149891e4-C001-52" ], [ "831342435ca0a4695e2a7f149891e4-C001-73" ], [ "831342435ca0a4695e2a7f149891e4-C001-83" ], [ "831342435ca0a4695e2a7f149891e4-C001-90" ] ], "cite_sentences": [ "831342435ca0a4695e2a7f149891e4-C001-52", "831342435ca0a4695e2a7f149891e4-C001-73", "831342435ca0a4695e2a7f149891e4-C001-83", "831342435ca0a4695e2a7f149891e4-C001-90" ] }, "@DIF@": { "gold_contexts": [ [ "831342435ca0a4695e2a7f149891e4-C001-84" ] ], "cite_sentences": [ "831342435ca0a4695e2a7f149891e4-C001-84" ] }, "@EXT@": { "gold_contexts": [ [ "831342435ca0a4695e2a7f149891e4-C001-84" ] ], "cite_sentences": [ "831342435ca0a4695e2a7f149891e4-C001-84" ] }, "@FUT@": { "gold_contexts": [ [ "831342435ca0a4695e2a7f149891e4-C001-161" ] ], "cite_sentences": [ "831342435ca0a4695e2a7f149891e4-C001-161" ] } } }, "ABC_4f5a25d7a961e7e61c2caef81418e0_20": { "x": [ { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-2", "text": "Abstractive Sentence Summarization generates a shorter version of a given sentence while attempting to preserve its meaning." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-3", "text": "We introduce a conditional recurrent neural network (RNN) which generates a summary of an input sentence." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-4", "text": "The conditioning is provided by a novel convolutional attention-based encoder which ensures that the decoder focuses on the appropriate input words at each step of generation." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-5", "text": "Our model relies only on learned features and is easy to train in an end-to-end fashion on large data sets." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-6", "text": "Our experiments show that the model significantly outperforms the recently proposed state-of-the-art method on the Gigaword corpus while performing competitively on the DUC-2004 shared task." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-7", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-9", "text": "Generating a condensed version of a passage while preserving its meaning is known as text summarization." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-10", "text": "Tackling this task is an important step towards natural language understanding." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-11", "text": "Summarization systems can be broadly classified into two categories." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-12", "text": "Extractive models generate summaries by cropping important segments from the original text and putting them together to form a coherent summary." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-13", "text": "Abstractive models generate summaries from scratch without being constrained to reuse phrases from the original text." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-14", "text": "In this paper we propose a novel recurrent neural network for the problem of abstractive sentence summarization." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-15", "text": "Inspired by the recently proposed architectures for machine translation , our model consists of a conditional recurrent neural network, which acts as a decoder to generate the summary of an input sentence, much like a standard recurrent language model." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-16", "text": "In addition, at every time step the decoder also takes a conditioning input which is the output of an encoder module." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-17", "text": "Depending on the current state of the RNN, the encoder computes scores over the words in the input sentence." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-18", "text": "These scores can be interpreted as a soft alignment over the input text, informing the decoder which part of the input sentence it should focus on to generate the next word." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-19", "text": "Both the decoder and encoder are jointly trained on a data set consisting of sentence-summary pairs." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-20", "text": "Our model can be seen as an extension of the recently proposed model for the same problem by Rush et al. (2015) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-21", "text": "While they use a feed-forward neural language model for generation, we use a recurrent neural network." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-22", "text": "Furthermore, our encoder is more sophisticated, in that it explicitly encodes the position information of the input words." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-23", "text": "Lastly, our encoder uses a convolutional network to encode input words." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-24", "text": "These extensions result in improved performance." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-25", "text": "The main contribution of this paper is a novel convolutional attention-based conditional recurrent neural network model for the problem of abstractive sentence summarization." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-26", "text": "Empirically we show that our model beats the state-of-the-art systems of Rush et al. (2015) on multiple data sets." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-27", "text": "Particularly notable is the fact that even with a simple generation module, which does not use any extractive feature tuning, our model manages to significantly outperform their ABS+ system on the Gigaword data set and is comparable on the DUC-2004 task." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-28", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-29", "text": "**93**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-30", "text": "While there is a large body of work for generating extractive summaries of sentences (Jing, 2000; Knight and Marcu, 2002; McDonald, 2006; Clarke and Lapata, 2008; Filippova and Altun, 2013; Filippova et al., 2015) , there has been much less research on abstractive summarization." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-31", "text": "A count-based noisy-channel machine translation model was proposed for the problem in Banko et al. (2000) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-32", "text": "The task of abstractive sentence summarization was later formalized around the DUC-2003 and DUC-2004 competitions (Over et al., 2007 , where the TOP-IARY system (Zajic et al., 2004) was the state-ofthe-art." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-33", "text": "More recently Cohn and Lapata (2008) and later Woodsend et al. (2010) proposed systems which made heavy use of the syntactic features of the sentence-summary pairs." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-34", "text": "Later, along the lines of Banko et al. (2000) , MOSES was used directly as a method for text simplification by Wubben et al. (2012) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-35", "text": "Other works which have recently been proposed for the problem of sentence summarization include (Galanis and Androutsopoulos, 2010; Napoles et al., 2011; Cohn and Lapata, 2013) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-36", "text": "Very recently Rush et al. (2015) proposed a neural attention model for this problem using a new data set for training and showing state-of-the-art performance on the DUC tasks." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-37", "text": "Our model can be seen as an extension of their model." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-38", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-39", "text": "**ATTENTIVE RECURRENT ARCHITECTURE**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-40", "text": "Let x denote the input sentence consisting of a sequence of M words x = [x 1 , . . . , x M ], where each word x i is part of vocabulary V, of size |V| = V ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-41", "text": "Our task is to generate a target sequence y = [y 1 , . . . , y N ], of N words, where N < M , such that the meaning of x is preserved: y = argmax y P (y|x), where y is a random variable denoting a sequence of N words." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-42", "text": "Typically the conditional probability is modeled by a parametric function with parameters \u03b8: P (y|x) = P (y|x; \u03b8)." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-43", "text": "Training involves finding the \u03b8 which maximizes the conditional probability of sentence-summary pairs in the training corpus." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-44", "text": "If the model is trained to generate the next word of the summary, given the previous words, then the above conditional can be factorized into a product of individual conditional probabilities:" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-45", "text": "In this work we model this conditional probability using an RNN Encoder-Decoder architecture, inspired by and subsequently extended in ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-46", "text": "We call our model RAS (Recurrent Attentive Summarizer)." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-47", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-48", "text": "**RECURRENT DECODER**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-49", "text": "The above conditional is modeled using an RNN:" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-50", "text": "where h t is the hidden state of the RNN:" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-51", "text": "Here c t is the output of the encoder module (detailed in \u00a73.2)." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-52", "text": "It can be seen as a context vector which is computed as a function of the current state h t\u22121 and the input sequence x." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-53", "text": "Our Elman RNN takes the following form (Elman, 1990):" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-54", "text": "where \u03c3 is the sigmoid function and \u03c1 is the softmax, defined as: \u03c1(o t ) = e ot / j e o j and W i (i = 1, . . . , 5) are matrices of learnable parameters of sizes W {1,2,3} \u2208 R d\u00d7d and W {4,5} \u2208 R d\u00d7V ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-55", "text": "The LSTM decoder is defined as (Hochreiter and Schmidhuber, 1997) :" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-56", "text": "Operator refers to component-wise multiplication, and W i (i = 1, . . . , 14) are matrices of learnable parameters of sizes W {1,...,12} \u2208 R d\u00d7d , and W {13,14} \u2208 R d\u00d7V ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-57", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-58", "text": "**ATTENTIVE ENCODER**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-59", "text": "We now give the details of the encoder which computes the context vector c t for every time step t of the decoder above." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-60", "text": "With a slight overload of notation, for an input sentence x we denote by x i the d dimensional learnable embedding of the i-th word (x i \u2208 R d )." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-61", "text": "In addition the position i of the word x i is also associated with a learnable embedding l i of size d (l i \u2208 R d )." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-62", "text": "Then the full embedding for i-th word in x is given by a i = x i + l i . Let us denote by B k \u2208 R q\u00d7d a learnable weight matrix which is used to convolve over the full embeddings of consecutive words." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-63", "text": "Let there be d such matrices (k \u2208 {1, . . . , d})." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-64", "text": "The output of convolution is given by:" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-65", "text": "where b k j is the j-th column of the matrix B k ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-66", "text": "Thus the d dimensional aggregate embedding vector z i is defined as z i = [z i1 , . . . , z id ]." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-67", "text": "Note that each word x i in the input sequence is associated with one aggregate embedding vector z i ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-68", "text": "The vectors z i can be seen as a representation of the word which captures the position in which it occurs in the sentence and also the context in which it appears in the sentence." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-69", "text": "In our experiments the width q of the convolution matrix B k was set to 5." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-70", "text": "To account for words at the boundaries of x we first pad the sequence on both sides with dummy words before computing the aggregate vectors z i 's." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-71", "text": "Given these aggregate vectors of words, we compute the context vector c t (the encoder output) as:" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-72", "text": "where the weights \u03b1 j,t\u22121 are computed as" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-73", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-74", "text": "**TRAINING AND GENERATION**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-75", "text": "Given a training corpus S = {(x i , y i )} S i=1 of S sentence-summary pairs, the above model can be trained end-to-end using stochastic gradient descent by minimizing the negative conditional log likelihood of the training data with respect to \u03b8:" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-76", "text": "where the parameters \u03b8 constitute the parameters of the decoder and the encoder." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-77", "text": "Once the parametric model is trained we generate a summary for a new sentence x through a wordbased beam search such that P (y|x) is maximized, argmax P (y t |{y 1 , . . . , y t\u22121 }, x)." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-78", "text": "The search is parameterized by the number of paths k that are pursued at each time step." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-79", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-80", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-81", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-82", "text": "**DATASETS AND EVALUATION**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-83", "text": "Our models are trained on the annotated version of the Gigaword corpus (Graff et al., 2003; Napoles et al., 2012) and we use only the annotations for tokenization and sentence separation while discarding other annotations such as tags and parses." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-84", "text": "We pair the first sentence of each article with its headline to form sentence-summary pairs." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-85", "text": "The data is pre-processed in the same way as Rush et al. (2015) and we use the same splits for training, validation, and testing." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-86", "text": "For Gigaword we report results on the same randomly held-out test set of 2000 sentence-summary pairs as (Rush et al., 2015) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-87", "text": "1 We also evaluate our models on the DUC-2004 evaluation data set comprising 500 pairs (Over et al., 2007) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-88", "text": "Our evaluation is based on three variants of ROUGE (Lin, 2004) , namely, ROUGE-1 (unigrams), ROUGE-2 (bigrams), and ROUGE-L (longest-common substring)." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-89", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-90", "text": "**ARCHITECTURAL CHOICES**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-91", "text": "We implemented our models in the Torch library (http://torch.ch/) 2 ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-92", "text": "To optimize our loss (Equation 5) we used stochastic gradient descent with mini-batches of size 32." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-93", "text": "During training we measure the perplexity of the summaries in the validation set and adjust our hyper-parameters, such as the learning rate, based on this number." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-94", "text": "For the decoder we experimented with both the Elman RNN and the Long-Short Term Memory (LSTM) architecture (as discussed in \u00a7 3.1)." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-95", "text": "We chose hyper-parameters based on a grid search and picked the one which gave the best perplexity on the validation set." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-96", "text": "In particular we searched over the number of hidden units H of the recurrent layer, the learning rate \u03b7, the learning rate annealing schedule \u03b3 (the factor by which to decrease \u03b7 if the validation perplexity increases), and the gradient clipping threshold \u03ba." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-97", "text": "Our final Elman architecture (RASElman) uses a single layer with H = 512, \u03b7 = 0.5, \u03b3 = 2, and \u03ba = 10." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-98", "text": "The LSTM model (RAS-LSTM) also has a single layer with H = 512, \u03b7 = 0.1, \u03b3 = 2, and \u03ba = 10." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-99", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-100", "text": "**RESULTS**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-101", "text": "On the Gigaword corpus we evaluate our models in terms of perplexity on a held-out set." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-102", "text": "We then pick the model with best perplexity on the held-out set and use it to compute the F1-score of ROUGE-1, ROUGE-2, and ROUGE-L on the test sets, all of which we report." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-103", "text": "For the DUC corpus however, inline with the standard, we report the recall-only ROUGE." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-104", "text": "As baseline we use the state-of-the-art attention-based system (ABS) of Rush et al. (2015) which relies on a feed-forward network decoder." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-105", "text": "Additionally, we compare to an enhanced version of their system (ABS+), which relies on a range of separate extractive summarization features that are added as log-linear features in a secondary learning step with minimum error rate training (Och, 2003) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-106", "text": "Table 1 shows that both our RAS-Elman and RAS-LSTM models achieve lower perplexity than ABS as well as other models reported in Rush et al. (2015) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-107", "text": "The RAS-LSTM performs slightly worse than RAS-Elman, most likely due to over-fitting." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-108", "text": "We attribute this to the relatively simple nature of this task which can be framed as English-to-English translation with few long-term dependencies." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-109", "text": "The ROUGE results (Table 2) show that our models comfortably outperform both ABS and ABS+ by a wide margin on all metrics." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-110", "text": "This is even the case when we rely only on very fast greedy search (k = 1), while as ABS uses a much wider beam of size k = 50; the stronger ABS+ system also uses additional extractive features which our model does not." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-111", "text": "These features cause ABS+ to copy 92% of words from the input into the summary, whereas our model copies only 74% of the words leading to more abstractive summaries." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-112", "text": "On DUC-2004 we report recall ROUGE as is customary on this dataset." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-113", "text": "The results (Table 3) show that our models are better than ABS+." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-114", "text": "However the improvements are smaller than for Gi-gaword which is likely due to two reasons: First, tokenization of DUC-2004 differs slightly from our training corpus." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-115", "text": "Second, headlines in Gigaword are much shorter than in DUC-2004." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-116", "text": "For the sake of completeness we also compare our models to the recently proposed standard Neural Machine Translation (NMT) systems." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-117", "text": "In particular, we compare to a smaller re-implementation of the attentive stacked LSTM encoder-decoder of Luong et al. (2015) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-118", "text": "Our implementation uses two-layer LSTMs for the encoder-decoder with 500 hidden units in each layer." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-119", "text": "Tables 2 and 3 report ROUGE scores on the two data sets." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-120", "text": "From the tables we observe that the proposed RAS-Elman model is able to match the performance of the NMT model of Luong at al. (2015) ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-121", "text": "This is noteworthy because RAS-Elman is significantly simpler than the NMT model at multiple levels." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-122", "text": "First, the encoder used by RAS-Elman is extremely light-weight (attention over the convolutional representation of the input words), compared to Luong's (a 2 hidden layer LSTM)." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-123", "text": "Second, the decoder used by RAS-Elman is a single layer standard (Elman) RNN as opposed to a multi-layer LSTM." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-124", "text": "In an independent work, Nallapati et." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-125", "text": "al (2016) also trained a collection of standard NMT models and report numbers in the same ballpark as RAS-Elman on both datasets." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-126", "text": "In order to better understand which component of the proposed architecture is responsible for the improvements, we trained the recurrent model with Rush et." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-127", "text": "al., (2015) 's ABS encoder on a subset of the Gigaword dataset." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-128", "text": "The ABS encoder, which does not have the position features, achieves a final validation perplexity of 38 compared to 29 for the proposed encoder, which uses position features as well as context information." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-129", "text": "This clearly shows the benefits of using the position feature in the proposed encoder." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-130", "text": "Finally in Figure 1 we highlight anecdotal examples of summaries produced by the RAS-Elman system on the Gigaword dataset." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-131", "text": "The first two examples highlight typical improvements in the RAS model over ABS+." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-132", "text": "Generally the model produces more fluent summaries and is better able to capture the main actors of the input." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-133", "text": "For instance in Sentence 1, RASElman correctly distinguishes the actions of \"pepe\" from \"ferreira\", and in Sentence 2 it identifies the correct role of the \"think tank\"." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-134", "text": "The final two ex-I(1): brazilian defender pepe is out for the rest of the season with a knee injury , his porto coach jesualdo ferreira said saturday ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-135", "text": "G: football : pepe out for season A+: ferreira out for rest of season with knee injury R: brazilian defender pepe out for rest of season with knee injury I(2): economic growth in toronto will suffer this year because of sars , a think tank said friday as health authorities insisted the illness was under control in canada 's largest city ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-136", "text": "G: sars toll on toronto economy estimated at c$ # billion A+: think tank under control in canada 's largest city R: think tank says economic growth in toronto will suffer this year I(3): colin l. powell said nothing -a silence that spoke volumes to many in the white house on thursday morning ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-137", "text": "G: in meeting with former officials bush defends iraq policy A+: colin powell speaks volumes about silence in white house R: powell speaks volumes on the white house I(4): an international terror suspect who had been under a controversial loose form of house arrest is on the run , british home secretary john reid said tuesday ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-138", "text": "G: international terror suspect slips net in britain A+: reid under house arrest terror suspect on the run R: international terror suspect under house arrest Figure 1 : Example sentence summaries produced on Gigaword." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-139", "text": "I is the input, G is the true headline, A is ABS+, and R is RAS-ELMAN." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-140", "text": "amples highlight typical mistakes of the models." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-141", "text": "In Sentence 3 both models take literally the figurative use of the idiom \"a silence that spoke volumes,\" and produce fluent but nonsensical summaries." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-142", "text": "In Sentence 4 the RAS model mistakes the content of a relative clause for the main verb, leading to a summary with the opposite meaning of the input." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-143", "text": "These difficult cases are somewhat rare in the Gigaword, but they highlight future challenges for obtaining human-level sentence summary." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-144", "text": "----------------------------------" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-145", "text": "**CONCLUSION**" }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-146", "text": "We extend the state-of-the-art model for abstractive sentence summarization (Rush et al., 2015) to a recurrent neural network architecture." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-147", "text": "Our model is a simplified version of the encoder-decoder framework for machine translation ." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-148", "text": "The model is trained on the Gigaword corpus to generate headlines based on the first line of each news article." }, { "sent_id": "4f5a25d7a961e7e61c2caef81418e0-C001-149", "text": "We comfortably outperform the previous state-of-the-art on both Gigaword data and the DUC-2004 challenge even though our model does not rely on additional extractive features." } ], "y": { "@EXT@": { "gold_contexts": [ [ "4f5a25d7a961e7e61c2caef81418e0-C001-20" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-146" ] ], "cite_sentences": [ "4f5a25d7a961e7e61c2caef81418e0-C001-20", "4f5a25d7a961e7e61c2caef81418e0-C001-146" ] }, "@DIF@": { "gold_contexts": [ [ "4f5a25d7a961e7e61c2caef81418e0-C001-20" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-106" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-146" ] ], "cite_sentences": [ "4f5a25d7a961e7e61c2caef81418e0-C001-20", "4f5a25d7a961e7e61c2caef81418e0-C001-106", "4f5a25d7a961e7e61c2caef81418e0-C001-146" ] }, "@SIM@": { "gold_contexts": [ [ "4f5a25d7a961e7e61c2caef81418e0-C001-26" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-85" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-86" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-104" ] ], "cite_sentences": [ "4f5a25d7a961e7e61c2caef81418e0-C001-26", "4f5a25d7a961e7e61c2caef81418e0-C001-85", "4f5a25d7a961e7e61c2caef81418e0-C001-86", "4f5a25d7a961e7e61c2caef81418e0-C001-104" ] }, "@BACK@": { "gold_contexts": [ [ "4f5a25d7a961e7e61c2caef81418e0-C001-36" ] ], "cite_sentences": [ "4f5a25d7a961e7e61c2caef81418e0-C001-36" ] }, "@USE@": { "gold_contexts": [ [ "4f5a25d7a961e7e61c2caef81418e0-C001-85" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-86" ], [ "4f5a25d7a961e7e61c2caef81418e0-C001-104" ] ], "cite_sentences": [ "4f5a25d7a961e7e61c2caef81418e0-C001-85", "4f5a25d7a961e7e61c2caef81418e0-C001-86", "4f5a25d7a961e7e61c2caef81418e0-C001-104" ] } } }, "ABC_73eaa7d5a54b2d60bd8128e0270683_20": { "x": [ { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-2", "text": "Conventional wisdom is that hand-crafted features are redundant for deep learning models, as they already learn adequate representations of text automatically from corpora." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-3", "text": "In this work, we test this claim by proposing a new method for exploiting handcrafted features as part of a novel hybrid learning approach, incorporating a feature auto-encoder loss component." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-4", "text": "We evaluate on the task of named entity recognition (NER), where we show that including manual features for partof-speech, word shapes and gazetteers can improve the performance of a neural CRF model." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-5", "text": "We obtain a F 1 of 91.89 for the CoNLL-2003 English shared task, which significantly outperforms a collection of highly competitive baseline models." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-6", "text": "We also present an ablation study showing the importance of autoencoding, over using features as either inputs or outputs alone, and moreover, show including the autoencoder components reduces training requirements to 60%, while retaining the same predictive accuracy." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-7", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-9", "text": "Deep neural networks have been proven to be a powerful framework for natural language processing, and have demonstrated strong performance on a number of challenging tasks, ranging from machine translation (Cho et al., 2014b,a) , to text categorisation (Zhang et al., 2015; Joulin et al., 2017; Liu et al., 2018b) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-10", "text": "Not only do such deep models outperform traditional machine learning methods, they also come with the benefit of not requiring difficult feature engineering." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-11", "text": "For instance, both Lample et al. (2016) and Ma and Hovy (2016) propose end-to-end models for sequence labelling task and achieve state-of-the-art results." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-12", "text": "* https://github.com/minghao-wu/CRF-AE \u2020 Work carried out at The University of Melbourne" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-13", "text": "Orthogonal to the advances in deep learning is the effort spent on feature engineering." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-14", "text": "A representative example is the task of named entity recognition (NER), one that requires both lexical and syntactic knowledge, where, until recently, most models heavily rely on statistical sequential labelling models taking in manually engineered features (Florian et al., 2003; Chieu and Ng, 2002; Ando and Zhang, 2005) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-15", "text": "Typical features include POS and chunk tags, prefixes and suffixes, and external gazetteers, all of which represent years of accumulated knowledge in the field of computational linguistics." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-16", "text": "The work of Collobert et al. (2011) started the trend of feature engineering-free modelling by learning internal representations of compositional components of text (e.g., word embeddings)." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-17", "text": "Subsequent work has shown impressive progress through capturing syntactic and semantic knowledge with dense real-valued vectors trained on large unannotated corpora (Mikolov et al., 2013a,b; Pennington et al., 2014) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-18", "text": "Enabled by the powerful representational capacity of such embeddings and neural networks, feature engineering has largely been replaced with taking off-the-shelf pre-trained word embeddings as input, thereby making models fully end-to-end and the research focus has shifted to neural network architecture engineering." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-19", "text": "More recently, there has been increasing recognition of the utility of linguistic features Chen et al., 2017; Wu et al., 2017; Liu et al., 2018a) where such features are integrated to improve model performance." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-20", "text": "Inspired by this, taking NER as a case study, we investigate the utility of hand-crafted features in deep learning models, challenging conventional wisdom in an attempt to refute the utility of manually-engineered features." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-21", "text": "Of particular interest to this paper is the work by Ma and Hovy (2016) introduce a strong end-to-end model combining a bi-directional Long Short-Term Memory (Bi-LSTM) network with Convolutional Neural Network (CNN) character encoding in a Conditional Random Field (CRF)." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-22", "text": "Their model is highly capable of capturing not only word-but also characterlevel features." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-23", "text": "We extend this model by integrating an auto-encoder loss, allowing the model to take hand-crafted features as input and re-construct them as output, and show that, even with such a highly competitive model, incorporating linguistic features is still beneficial." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-24", "text": "Perhaps the closest to this study is the works by Ammar et al. (2014) and , who show how CRFs can be framed as auto-encoders in unsupervised or semisupervised settings." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-25", "text": "With our proposed model, we achieve strong performance on the CoNLL 2003 English NER shared task with an F 1 of 91.89, significantly outperforming an array of competitive baselines." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-26", "text": "We conduct an ablation study to better understand the impacts of each manually-crafted feature." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-27", "text": "Finally, we further provide an in-depth analysis of model performance when trained with varying amount of data and show that the proposed model is highly competent with only 60% of the training set." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-28", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-29", "text": "**METHODOLOGY**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-30", "text": "In this section, we first outline the model architecture, then the manually crafted features, and finally how they are incorporated into the model." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-31", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-32", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-33", "text": "We build on a highly competitive sequence labelling model, namely Bi-LSTM-CNN-CRF, first introduced by Ma and Hovy (2016) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-34", "text": "Given an input sequence of x = {x 1 , x 2 , . . . , x T } of length T , the model is capable of tagging each input with a predicted label\u0177, resulting in a sequence of\u0177 = {\u0177 1 ,\u0177 2 , . . . ,\u0177 T } closely matching the gold label sequence y = {y 1 , y 2 , . . . , y T }." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-35", "text": "Here, we extend the model by incorporating an auto-encoder loss taking hand-crafted features as in/output, thereby forcing the model to preserve crucial information stored in such features and allowing us to evaluate the impacts of each feature on model performance." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-36", "text": "Specifically, our model, referred to as Neural-CRF+AE, consists of four major components: (1) a character-level CNN (char-CNN); (2) a word-level bi-directional LSTM (Bi-LSTM); (3) a conditional random field (CRF); and (4) an auto-encoder auxiliary loss." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-37", "text": "An illustration of the model architecture is presented in Figure 1 ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-38", "text": "Zadrozny, 2014; Chiu and Nichols, 2016; Ma and Hovy, 2016) have demonstrated that CNNs are highly capable of capturing character-level features." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-39", "text": "Here, our character-level CNN is similar to that used in Ma and Hovy (2016) but differs in that we use a ReLU activation (Nair and Hinton, 2010) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-40", "text": "1" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-41", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-42", "text": "**CHAR-CNN. PREVIOUS STUDIES (SANTOS AND**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-43", "text": "Bi-LSTM." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-44", "text": "We use a Bi-LSTM to learn contextual information of a sequence of words." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-45", "text": "As inputs to the Bi-LSTM, we first concatenate the pre-trained embedding of each word w i with its character-level representation c w i (the output of the char-CNN) and a vector of manually crafted features f i (described in Section 2.2):" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-46", "text": "where [; ] denotes concatenation." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-47", "text": "The outputs of the forward and backward pass of the Bi-LSTM is then concatenated" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-48", "text": "to form the output of the Bi-LSTM, where dropout is also applied." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-49", "text": "CRF." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-50", "text": "For sequence labelling tasks, it is intuitive and beneficial to utilise information carried between neighbouring labels to predict the best sequence of labels for a given sentence." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-51", "text": "Therefore, we employ a conditional random field layer (Lafferty et al., 2001) taking as input the output of the Bi-LSTM h i ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-52", "text": "Training is carried out by maximising the log probability of the gold sequence: L CRF = log p(y|x) while decoding can be efficiently performed with the Viterbi algorithm." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-53", "text": "Auto-encoder loss." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-54", "text": "Alongside sequence labelling as the primary task, we also deploy, as auxiliary tasks, three auto-encoders for reconstructing the hand-engineered feature vectors." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-79", "text": "Exponential learning rate decay is applied every 5 epochs with a factor of 0.8." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-55", "text": "To this end, we add multiple independent fully-connected dense layers, all taking as input the Bi-LSTM output h i with each responsible for reconstructing a particular type of feature:f t i = \u03c3(W t h i ) where \u03c3 is the sigmoid activation function, t denotes the type of feature, and W t is a trainable parameter matrix." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-56", "text": "More formally, we define the auto-encoder loss as:" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-57", "text": "Model training." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-58", "text": "Training is carried out by optimising the joint loss:" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-59", "text": "where, in addition to L CRF , we also add the autoencoder loss, weighted by \u03bb t ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-60", "text": "In all our experiments, we set \u03bb t to 1 for all ts." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-61", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-62", "text": "**HAND-CRAFTED FEATURES**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-63", "text": "We consider three categories of widely used features: (1) POS tags; (2) word shape; and (3) gazetteers and present an example in ] for the i-th word." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-64", "text": "In addition, we also experimented with including the label of the incoming dependency edge to each word as a feature, but observed performance deterioration on the development set." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-65", "text": "While we still study and analyse the impacts of this feature in Table 3 and Section 3.2, it is excluded from our model configuration (not considered as part of f i unless indicated otherwise)." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-66", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-67", "text": "**EXPERIMENTS**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-68", "text": "In this section, we present our experimental setup and results for name entity recognition over the CoNLL 2003 English NER shared task dataset (Tjong Kim Sang and De Meulder, 2003) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-69", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-70", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-71", "text": "Dataset." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-72", "text": "We use the CoNLL 2003 NER shared task dataset, consisting of 14,041/3,250/3,453 sentences in the training/development/test set respectively, all extracted from Reuters news articles during the period from 1996 to 1997." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-73", "text": "The dataset is annotated with four categories of name entities: PERSON, LOCATION, ORGANIZATION and MISC." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-74", "text": "We use the IOBES tagging scheme, as previous study have shown that this scheme provides a modest improvement to the model performance (Ratinov and Roth, 2009; Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-75", "text": "Model configuration." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-76", "text": "Following the work of Ma and Hovy (2016) , we initialise word embeddings with GloVe (Pennington et al., 2014 ) (300-dimensional, trained on a 6B-token corpus)." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-77", "text": "Character embeddings are 30-dimensional and randomly initialised with a uniform distribution in" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-78", "text": "Parameters are optimised with stochastic gradient descent (SGD) with an initial learning rate of \u03b7 = 0.015 and momentum of 0.9." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-80", "text": "To reduce the impact of exploding gradients, we employ gradient clipping at 5.0 (Pascanu et al., 2013) ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-81", "text": "We train our models on a single GeForce GTX TITAN X GPU." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-82", "text": "With the above hyper-parameter setting, training takes approximately 8 hours for a full run of 40 epochs." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-83", "text": "Evaluation." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-84", "text": "We measure model performance with the official CoNLL evaluation script and report span-level named entity F-score on the test set using early stopping based on the performance on the validation set." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-85", "text": "We report average F-scores and standard deviation over 5 runs for our model." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-86", "text": "Baseline." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-87", "text": "In addition to reporting a number of prior results of competitive baseline models, as listed in Table 2 , we also re-implement the Bi-LSTM-CNN-CRF model by Ma and Hovy (2016) (referred to as Neural-CRF in Table 2 ) and report its average performance." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-88", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-89", "text": "**RESULTS**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-90", "text": "The experimental results are presented in Table 2 . Observe that Neural-CRF+AE, trained either on the training set only or with the addition of the development set, achieves substantial improvements in F-score in both settings, superior to all but one of the benchmark models, highlighting the utility of hand-crafted features incorporated with the proposed auto-encoder loss." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-91", "text": "Compared against the Neural-CRF, a very strong model in itself, our model significantly improves performance, showing the positive impact of our technique for exploiting manually-engineered features." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-92", "text": "Although Peters et al. (2018) report a higher F-score using their ELMo embedding technique, our approach here is orthogonal, and accordingly we would expect a performance increase if we were to incorporate their ELMo representations into our model." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-93", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-94", "text": "**ABLATION STUDY**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-95", "text": "To gain a better understanding of the impacts of each feature, we perform an abModel F1" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-96", "text": "Chieu and Ng (2002) 88.31 Florian et al. (2003) 88.76 Ando and Zhang (2005) 89.31 Collobert et al. (2011) 89.59 90." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-97", "text": "10 Passos et al. (2014) 90.90 Lample et al. (2016) 90.94 Luo et al. (2015) 91.20 Ma and Hovy (2016) 91.21 91.62 Peters et al. (2018) 90.15 Peters et al. (2018) lation study and present the results in Table 3 ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-98", "text": "We observe performance degradation when eliminating POS, word shape and gazetteer features, showing that each feature contributes to NER performance beyond what is learned through deep learning alone." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-99", "text": "Interestingly, the contribution of gazetteers is much less than that of the other features, which is likely due to the noise introduced in the matching process, with many incorrectly identified false positives." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-100", "text": "Including features based on dependency tags into our model decreases the performance slightly." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-101", "text": "This might be a result of our simple implementation (as illustrated in Table 1 ), which does not include dependency direction, nor parent-child relationships." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-102", "text": "Next, we investigate the impact of different means of incorporating manually-engineered features into the model." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-103", "text": "To this end, we experiment with three configurations with features as: (1) input only; (2) output only (equivalent to multi-task learning); and (3) both input and output (Neural-CRF+AE) and present the results in Table 4 ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-104", "text": "Simply using features as either input or output only improves model performance slightly, but insignificantly so." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-105", "text": "It is only when features are incorporated with the proposed auto-encoder loss do we observe a significant performance boost." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-106", "text": "Training Requirements Neural systems typically require a large amount of annotated data." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-107", "text": "Here we measure the impact of training with varying amount of annotated data, as shown in Figure 2 ." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-108", "text": "Wtih the proposed model architecture, the amount of labelled training data can be drastically reduced: our model, achieves comparable performance against the baseline Neural-CRF, with as little as 60% of the training data." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-109", "text": "Moreover, as we increase the amount of training text, the performance of Neural-CRF+AE continues to improve." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-110", "text": "Hyperparameters Three extra hyperparameters are introduced into our model, controlling the weight of the autoencoder loss relative to the CRF loss, for each feature type." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-111", "text": "Figure 3 shows the effect of each hyperparameter on test performance." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-112", "text": "Observe that setting \u03bb i = 1 gives strong performance, and that the impact of the gazetteer is less marked than the other two feature types." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-113", "text": "While increasing \u03bb is mostly beneficial, performance drops if the \u03bbs are overly large, that is, the auto-encoder loss overwhelms the main prediction task." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-114", "text": "----------------------------------" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-115", "text": "**CONCLUSION**" }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-116", "text": "In this paper, we set out to investigate the utility of hand-crafted features." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-117", "text": "To this end, we have presented a hybrid neural architecture to validate this hypothesis extending a Bi-LSTM-CNN-CRF by incorporating an auto-encoder loss to take manual features as input and then reconstruct them." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-118", "text": "On the task of named entity recognition, we show significant improvements over a collection of competitive baselines, verifying the value of such features." }, { "sent_id": "73eaa7d5a54b2d60bd8128e0270683-C001-119", "text": "Lastly, the method presented in this work can also be easily applied to other tasks and models, where hand-engineered features provide key insights about the data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "73eaa7d5a54b2d60bd8128e0270683-C001-11" ], [ "73eaa7d5a54b2d60bd8128e0270683-C001-21" ], [ "73eaa7d5a54b2d60bd8128e0270683-C001-38" ] ], "cite_sentences": [ "73eaa7d5a54b2d60bd8128e0270683-C001-11", "73eaa7d5a54b2d60bd8128e0270683-C001-21", "73eaa7d5a54b2d60bd8128e0270683-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "73eaa7d5a54b2d60bd8128e0270683-C001-33" ], [ "73eaa7d5a54b2d60bd8128e0270683-C001-74" ], [ "73eaa7d5a54b2d60bd8128e0270683-C001-76" ], [ "73eaa7d5a54b2d60bd8128e0270683-C001-87" ] ], "cite_sentences": [ "73eaa7d5a54b2d60bd8128e0270683-C001-33", "73eaa7d5a54b2d60bd8128e0270683-C001-74", "73eaa7d5a54b2d60bd8128e0270683-C001-76", "73eaa7d5a54b2d60bd8128e0270683-C001-87" ] }, "@USE@": { "gold_contexts": [ [ "73eaa7d5a54b2d60bd8128e0270683-C001-33" ], [ "73eaa7d5a54b2d60bd8128e0270683-C001-74" ], [ "73eaa7d5a54b2d60bd8128e0270683-C001-87" ] ], "cite_sentences": [ "73eaa7d5a54b2d60bd8128e0270683-C001-33", "73eaa7d5a54b2d60bd8128e0270683-C001-74", "73eaa7d5a54b2d60bd8128e0270683-C001-87" ] }, "@MOT@": { "gold_contexts": [ [ "73eaa7d5a54b2d60bd8128e0270683-C001-38" ] ], "cite_sentences": [ "73eaa7d5a54b2d60bd8128e0270683-C001-38" ] }, "@DIF@": { "gold_contexts": [ [ "73eaa7d5a54b2d60bd8128e0270683-C001-39" ] ], "cite_sentences": [ "73eaa7d5a54b2d60bd8128e0270683-C001-39" ] }, "@EXT@": { "gold_contexts": [ [ "73eaa7d5a54b2d60bd8128e0270683-C001-39" ] ], "cite_sentences": [ "73eaa7d5a54b2d60bd8128e0270683-C001-39" ] } } }, "ABC_4e7ee576b07a8a21a42472bf921291_20": { "x": [ { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-25", "text": "The word embeddings reflect these patterns." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-53", "text": "Documents were represented by titles." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-2", "text": "This paper studies the consistency of the kernel-based neural ranking model (K-NRM), a recent state-of-the-art neural IR model, which is important for reproducible research and deployment in the industry." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-3", "text": "We find that K-NRM has low variance on relevance-based metrics across experimental trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-4", "text": "In spite of this low variance in overall performance, different trials produce different document rankings for individual queries." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-5", "text": "The main source of variance in our experiments was found to be different latent matching patterns captured by K-NRM." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-6", "text": "In the IR-customized word embeddings learned by K-NRM, the query-document word pairs follow two different matching patterns that are equally effective, but align word pairs differently in the embedding space." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-7", "text": "The different latent matching patterns enable a simple yet effective approach to construct ensemble rankers, which improve K-NRM's effectiveness and generalization abilities." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-8", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-10", "text": "Neural IR models have received much attention due to their continuous text representations, soft-matching of terms, and sophisticated non-linear models." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-11", "text": "However, the non-convexity and stochastic training of neural IR models raises questions about their consistency compared to heuristic and learning-to-rank models that use discrete representations and simpler methods of combining evidence." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-12", "text": "Consistent behavior under slightly different conditions is essential to reproducible research and deployment in industry." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-13", "text": "This paper studies the stability of K-NRM, a recent state-of-the-art neural ranking model [10] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-14", "text": "K-NRM learns the word embeddings and ranking model from relevance signals." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-15", "text": "Its effectiveness is due to * The first three authors contributed equally Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-16", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-17", "text": "Abstracting with credit is permitted." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-18", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-19", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-20", "text": "SIGIR '18, July [8] [9] [10] [11] [12] 2018 word embeddings tailored for search tasks and kernels that group matches into bins of different quality." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-21", "text": "Its parameter space is large, the solution space is non-convex, and training is stochastic." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-22", "text": "To better understand its stability, we compare the behavior of multiple trained models under similar conditions." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-26", "text": "Trials whose kernel weights have the same pattern have similar word embeddings." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-23", "text": "We find that although K-NRM produces similar accuracy across different trials, it also produces rather different document rankings for individual queries." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-24", "text": "Analysis of weights learned for K-NRM kernel scores (soft-match features) revealed that weights from different trials match one of two patterns." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-27", "text": "Interestingly, the two patterns are equally effective." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-28", "text": "The difference in the ranking patterns from different K-NRM trials makes them a good fit for ensembles." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-29", "text": "Aggregating scores from different trials enables an ensemble to promote documents that multiple trials agree are most likely to be relevant." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-30", "text": "Experimental results show that simple K-NRM ensembles significantly boost its ranking accuracy and improve its generalization ability." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-31", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-32", "text": "**RELATED WORK**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-33", "text": "Recent neural IR methods can be categorized as representationbased and interaction-based [2] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-34", "text": "Representation-based models use distributed representations of the query and document that are matched in the representation space [4, 8] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-35", "text": "Interaction-based models use local interactions between the query and document words and neural networks that learn matching patterns [2, 10] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-36", "text": "K-NRM [10] is an interaction-based model that uses kernel pooling to summarize word-word interactions." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-37", "text": "It builds a word-word similarity matrix from word embeddings, and uses kernel pooling to 'count' the soft matches at multiple similarity levels using Gaussian kernels." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-38", "text": "A linear learning-to-rank layer combines the kernel features." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-39", "text": "The whole model is end-to-end trainable." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-40", "text": "When trained from a search log, K-NRM outperforms neural IR methods and feature-based learning-to-rank methods." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-41", "text": "Most neural IR research focuses on ranking accuracy." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-42", "text": "However, the high variance of deep learning models causes concern about their consistency." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-43", "text": "Haber et al. [3] identify the causes for lack of stability and high variance as the dimensionality and non-convexity of the optimization problem." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-44", "text": "A common method to reduce variance and improve generalization is to create an ensemble of models [6] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-45", "text": "Krogh and Vedelsby [6] argue that a good ensemble is one where the components are all accurate but disagree on individual examples." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-46", "text": "Ensembles of neural network have been applied successfully to tasks such as image classification [5] and machine translation [9] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-47", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-48", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-49", "text": "Our experiments followed the original K-NRM work [10] and used its open-source implementation 1 ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-50", "text": "We used the same click log data from Sogou.com, a Chinese web search engine." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-51", "text": "The training set contained 95M queries, each with 12 candidate documents on average." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-52", "text": "The testing set contained 1,000 queries, each with 30 candidate documents on average." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-54", "text": "Xiong, et al. [10] built the vocabulary from queries and titles, but we built it from the queries, titles and URLs for better term coverage." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-55", "text": "Training Labels: The relevance labels for training were generated by the DCTR [1] click model from user clicks in the training sessions." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-56", "text": "DCTR uses the clickthrough rate for each query-document pair as the relevance score." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-57", "text": "Testing Labels: Following Xiong et al. [10] , three testing conditions were used." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-58", "text": "Testing-SAME used DCTR to generate the testing labels while Testing-DIFF employed a more sophisticated model, TACM [7] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-59", "text": "TACM takes into account both clicks and dwell times to generate testing labels." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-60", "text": "Testing-RAW treated the only clicked document in each single-clicked session as a relevant document, and used MRR (Mean Reciprocal Rank) as the metric." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-61", "text": "Testing-DIFF and Testing-RAW were considered more reliable than Testing-SAME because they are less subject to over-fitting." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-62", "text": "Model Configuration: We adopted the same default hyperparameter configuration and 11 Gaussian kernels as in prior work [10] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-63", "text": "The first kernel had \u00b5 = 1, \u03c3 = 10 \u22123 to cover exact matches." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-64", "text": "The other 10 kernels were equally split in the cosine value range [\u22121, 1]: \u00b5 1 = 0.9, \u00b5 2 = 0.7, ..., \u00b5 10 = \u22120.9; \u03c3 was set to 0.1." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-65", "text": "Word embeddings were initialized with a pretrained word2vec." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-66", "text": "The model used the Adam optimizer and was trained with batch size 16, learning rate = 0.001, and \u03f5 = 1e \u2212 5." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-67", "text": "The order of training data batches was fixed." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-68", "text": "An early stopping condition of 2 epochs was used in all experiments." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-69", "text": "The model was implemented using TensorFlow." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-70", "text": "All experiments were executed on a p2.xlarge AWS instance with 4 virtual CPUs and 1 NVIDIAGK210 GPU." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-71", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-72", "text": "**VARIANCE**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-73", "text": "The first experiment studied the consistency of K-NRM by running 50 stochastically trained models with random initialization." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-74", "text": "The consistency among the 50 trials is examined at the query-set level and the individual query level." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-75", "text": "The performance of 50 models on three metrics is summarized in Table 1 ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-76", "text": "The min/max differences are large, especially for NDCG@1." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-77", "text": "However the standard deviations are small, ranging from 0.5-1.3% absolute, and 1-4% relative to mean values." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-78", "text": "The min/max differences are due to a small number of outliers." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-79", "text": "Performance is stable across most trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-80", "text": "Table 1 also shows results reported by Xiong et al [10] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-81", "text": "Their model performance falls in the lower end of our trials, probably due to different vocabularies and stopping conditions." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-82", "text": "1 https://github.com/AdeDZY/K-NRM The next analysis studied the consistency at the individual query level by examining document rankings generated by different trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-83", "text": "For each query we examined the top k ranked document from 10 different trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-84", "text": "The total number of distinct documents indicates how well the models agree about which documents to place at the top of the ranking." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-85", "text": "A histogram (Figure 1) shows the number of queries at each agreement level for top 1, 3, and 10 ranked documents." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-86", "text": "Different trials rank different documents at the top to some extent." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-87", "text": "For about 50 of the 1000 queries, all 10 trials select the same document at rank 1 ( Figure 1a) ; for 35% of the queries, the trials select 2-3 different documents." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-88", "text": "Moderate consistency is observed across the trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-89", "text": "Only 15% of the queries get more than 5 different documents from the 10 trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-90", "text": "None of the queries get 10 completely different documents at the top 1." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-91", "text": "Figure 1b shows a similar trend." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-92", "text": "For 66% of the queries, the 10 trials collectively select 3-9 different documents for the top 3 slots." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-93", "text": "The document sets from the 10 trials converge deeper in the rankings." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-94", "text": "In the top 10 slots (Figure 1c) , the histogram shifts left, indicating that the 10 trials have higher agreement." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-95", "text": "This is expected because K-NRM re-ranks the top 30 documents." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-96", "text": "The disagreement in the top 1 and 3 ranks indicates that although different trials have similar sets of documents, their rankings are slightly different." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-97", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-98", "text": "**LATENT MATCHING PATTERNS**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-99", "text": "To better understand the model differences, we investigated the model parameters through multiple K-NRM trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-100", "text": "K-NRM has two trainable components: the word embedding layer and the learningto-rank layer." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-101", "text": "The word embedding layer aligns query-document word pairs and assigns them to the closest kernels by their cosine similarity." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-102", "text": "The learning-to-rank layer learns the importance of word pairs around each kernel." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-103", "text": "This analysis studied both parameters." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-104", "text": "Figure 2 plots the learning-to-rank weights from 10 random trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-105", "text": "The trials fall into two main patterns." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-106", "text": "One pattern starts with a downward slope and then moves upward while the second pattern goes the other way." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-107", "text": "K-NRM allocates word pairs into kernels based on their contribution to relevance." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-108", "text": "Different learning-to-rank weights indicate different ways of allocating word pairs to kernels." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-109", "text": "We further studied the two patterns with word embeddings from multiple trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-110", "text": "We randomly picked 5 runs." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-111", "text": "Runs A1-A3 belonged to one learning-to-rank weight pattern (Pattern A); runs B1-B2 belonged to the other pattern (Pattern B)." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-112", "text": "We compared the word pair distribution between pairs of runs through a heat map with each cell (\u00b5 x , \u00b5 y ) indicating the fraction of word pairs that fall into kernel \u00b5 x in one run and kernel \u00b5 y in the other run." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-113", "text": "Figure 3 shows the heat maps between Run A1 and the rest of runs." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-114", "text": "Runs from the same pattern have similar heat maps." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-115", "text": "As shown in Figure 3a and 3b, Runs A2 and A3 show a strong diagonal pattern, indicating that most of the word pairs are in the same kernel as in run A1." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-116", "text": "Runs from pattern B share another word pair distribution." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-117", "text": "As can be seen from Figure 3c and 3d, a lot of word pairs are assigned to a different kernel by runs B1/B2 as compared to the kernel assigned by run A1." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-118", "text": "The results reveal two distinct latent matching patterns." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-119", "text": "Trials from the same pattern have similar learning-to-rank weights and word embeddings." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-120", "text": "The two patterns differ largely in their word pair alignment." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-121", "text": "Although the two patterns align word embeddings differently, both are equally effective and produce similar accuracy (Table 1 )." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-122", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-123", "text": "**ENSEMBLE MODEL**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-124", "text": "The different rankings and distinct patterns in multiple K-NRM trials provided possibilities to reduce risk and improve the model's generalization ability using ensemble models [6] ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-125", "text": "The following experiments studied the effectiveness of ensemble models." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-126", "text": "We used an unweighted-average ensemble model [5] that averages the scores from multiple trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-127", "text": "To investigate the effects of latent matching patterns, we tested different types of ensemble models: Ensemble-A used 10 base models randomly selected from Pattern A; Ensemble-B used 10 base models from Pattern B; Ensemble-A&B used base models from both patterns, 5 from each 2 ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-128", "text": "To make evaluation reliable, 10 ensemble rankers were generated for each method with different base models randomly chosen from a pool of 50 K-NRM trials." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-129", "text": "All ensemble methods significantly outperformed individual models ( Table 2 )." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-130", "text": "The differences in document rankings allowed multiple trials to 'vote' in the ensemble model." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-131", "text": "Documents favored by the majority of trials are voted up, whereas documents that are wrongly ranked high in a poor trial are voted down." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-132", "text": "Comparing NDCG scores at different depths, we see that ensembles are most effective at the top of the ranking." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-133", "text": "This is because the dataset mostly contains 20-30 documents per query." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-134", "text": "There is more opportunity for disagreement at the top, which gives more scope for improvement." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-135", "text": "The ensemble generalization ability is reflected by the improved performance on Testing-DIFF and Testing-RAW as compared to Testing-SAME (Table 2) ." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-136", "text": "Ensemble-A&B outperformed Ensemble-A and Ensemble-B (Table 2), which indicates that having and recognizing two distinctive matching patterns is beneficial to ensemble models." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-137", "text": "To further understand the effects of the two patterns, we tested ensemble models with m Pattern A models and n Pattern B models." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-138", "text": "Figure 4 shows MRR on Testing-RAW as a heatmap." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-139", "text": "It confirms that having two variations enables better ensembles; ensemble models that only used one pattern have the lowest accuracy." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-140", "text": "Compared to single pattern ensembles, mixed ensembles can achieve the same accuracy using a smaller ensemble model with fewer base models." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-141", "text": "For example, cell (3, 3) has higher accuracy than cell (10, 0)." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-142", "text": "Besides, ensemble models benefit from a balanced mix of the two patterns, as seen from the darker cells around the diagonal which have similar number of base trials from each pattern." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-143", "text": "Prior research did not recognize that K-NRM consistently converges to a small number of distinct, equally-good local optima." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-144", "text": "Recognizing this helps in constructing high-quality ensembles." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-145", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-146", "text": "**CONCLUSION**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-147", "text": "This paper studies the consistency and variation of K-NRM, a recent state-of-the-art neural ranking model." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-148", "text": "Unlike feature-based methods where features are stable and ranking models are often convex, neural networks are non-convex and employ stochastic training, making it important to consider the ranking stability in neural IR." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-149", "text": "By investigating multiple trials of K-NRM, we find that its accuracy is quite stable (has low standard deviation) in spite of its random components." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-150", "text": "However, stable NDCG does not imply identical rankings at the individual query level." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-151", "text": "Different trials have moderate agreement about which document to rank first." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-152", "text": "Ten trials collectively select 1-3 documents to rank first for 40% of our queries." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-153", "text": "Our analyses further demonstrate that multiple trials of K-NRM converge to two latent patterns that are about equally effective." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-154", "text": "Runs within the same pattern converge to similar ranking weights and word embeddings." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-155", "text": "This behavior was not recognized by prior work, and is worth additional study." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-156", "text": "The distinct but equally effective matching patterns makes K-NRM a good fit for ensemble models." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-157", "text": "Recognizing different convergence patterns and selecting ensemble components equally from each pattern further improves K-NRM's accuracy and ability to generalize." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-158", "text": "----------------------------------" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-159", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-160", "text": "This research was supported by National Science Foundation (NSF) grant IIS-1422676." }, { "sent_id": "4e7ee576b07a8a21a42472bf921291-C001-161", "text": "Any opinions, findings, and conclusions are the authors' and do not necessarily reflect those of the sponsor." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4e7ee576b07a8a21a42472bf921291-C001-13" ], [ "4e7ee576b07a8a21a42472bf921291-C001-20" ], [ "4e7ee576b07a8a21a42472bf921291-C001-35" ], [ "4e7ee576b07a8a21a42472bf921291-C001-36" ] ], "cite_sentences": [ "4e7ee576b07a8a21a42472bf921291-C001-13", "4e7ee576b07a8a21a42472bf921291-C001-20", "4e7ee576b07a8a21a42472bf921291-C001-35", "4e7ee576b07a8a21a42472bf921291-C001-36" ] }, "@SIM@": { "gold_contexts": [ [ "4e7ee576b07a8a21a42472bf921291-C001-49" ], [ "4e7ee576b07a8a21a42472bf921291-C001-57" ], [ "4e7ee576b07a8a21a42472bf921291-C001-62" ] ], "cite_sentences": [ "4e7ee576b07a8a21a42472bf921291-C001-49", "4e7ee576b07a8a21a42472bf921291-C001-57", "4e7ee576b07a8a21a42472bf921291-C001-62" ] }, "@USE@": { "gold_contexts": [ [ "4e7ee576b07a8a21a42472bf921291-C001-49" ], [ "4e7ee576b07a8a21a42472bf921291-C001-57" ], [ "4e7ee576b07a8a21a42472bf921291-C001-62" ] ], "cite_sentences": [ "4e7ee576b07a8a21a42472bf921291-C001-49", "4e7ee576b07a8a21a42472bf921291-C001-57", "4e7ee576b07a8a21a42472bf921291-C001-62" ] }, "@DIF@": { "gold_contexts": [ [ "4e7ee576b07a8a21a42472bf921291-C001-54" ], [ "4e7ee576b07a8a21a42472bf921291-C001-80", "4e7ee576b07a8a21a42472bf921291-C001-81" ] ], "cite_sentences": [ "4e7ee576b07a8a21a42472bf921291-C001-54", "4e7ee576b07a8a21a42472bf921291-C001-80" ] }, "@EXT@": { "gold_contexts": [ [ "4e7ee576b07a8a21a42472bf921291-C001-54" ] ], "cite_sentences": [ "4e7ee576b07a8a21a42472bf921291-C001-54" ] } } }, "ABC_7cb7cfed8b7e7bf2f0a810e02e6cbc_20": { "x": [ { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-49", "text": "**UNGUIDED NLG FROM AMR**" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-73", "text": "**EXPERIMENTS**" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-2", "text": "Recent work on abstractive summarization has made progress with neural encoder-decoder architectures." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-3", "text": "However, such models are often challenged due to their lack of explicit semantic modeling of the source document and its summary." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-4", "text": "In this paper, we extend previous work on abstractive summarization using Abstract Meaning Representation (AMR) with a neural language generation stage which we guide using the source document." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-5", "text": "We demonstrate that this guidance improves summarization results by 7.4 and 10.5 points in ROUGE-2 using gold standard AMR parses and parses obtained from an off-the-shelf parser respectively." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-6", "text": "We also find that the summarization performance using the latter is 2 ROUGE-2 points higher than that of a well-established neural encoderdecoder approach trained on a larger dataset." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-7", "text": "Code is available at https://github." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-8", "text": "com/sheffieldnlp/AMR2Text-summ" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-9", "text": "----------------------------------" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-11", "text": "Abstractive summarization is the task of automatically producing the summary of a source document through the process of paraphrasing, aggregating and/or compressing information." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-12", "text": "Recent work in abstractive summarization has made progress with neural encoder-decoder architectures (See et al., 2017; Chopra et al., 2016; Rush et al., 2015) ." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-13", "text": "However, these models are often challenged when they are required to combine semantic information in order to generate a longer summary (Wiseman et al., 2017) ." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-14", "text": "To address this shortcoming, several works have explored the use of Abstract Meaning Representation (Banarescu et al., 2013, AMR) ." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-15", "text": "These were motivated by AMR's capability to capture the predicate-argument structure which can be utilized in information aggregation during summarization." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-16", "text": "However, the use of AMR also has its own shortcomings." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-17", "text": "While AMR is suitable for information aggregation, it ignores aspects of language such as tense, grammatical number, etc., which are important for the natural language generation (NLG) stage that normally occurs in the end of the summarization process." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-18", "text": "Due to the lack of such information, approaches for NLG from AMR typically infer it from regularities in the training data (Pourdamghani et al., 2016; Konstas et al., 2017; Song et al., 2016; Flanigan et al., 2016) , which however is not suitable in the context of summarization." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-19", "text": "Consequently, the main previous work on AMR-based abstractive summarization (Liu et al., 2015) only generated bag-of-words from the summary AMR graph." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-20", "text": "In this paper, we propose an approach to guide the NLG stage in AMR-based abstractive summarization using information from the source document." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-50", "text": "Our baseline is a standard (unguided) seq2seq model with attention (Luong et al., 2015) which consists of an encoder and a decoder." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-21", "text": "Our objective is twofold: (1) to retrieve the information missing from AMR but needed for NLG and (2) improve the quality of the summary." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-22", "text": "We achieve this in a two-stages process: (1) estimating the probability distribution of the side information, and (2) using it to guide a Luong et al. (2015) 's seq2seq model for NLG." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-23", "text": "Our approach is evaluated using the Proxy Report section from the AMR dataset (Knight et al., 2017 which contains manually annotated document and summary AMR graphs." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-24", "text": "Using our proposed guided AMR-to-text NLG, we improve summarization results using both gold standard AMR parses and parses obtained using the RIGA (Barzdins and Gosko, 2016) parser by 7.4 and 10.5 ROUGE-2 points respectively." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-25", "text": "Our model also outperforms a strong baseline seq2seq model (See et al., 2017) for summarization by 2 ROUGE-2 points." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-26", "text": "----------------------------------" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-72", "text": "----------------------------------" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-28", "text": "Abstractive Summarization using AMR: In Liu et al. (2015) work, the source document's sentences were parsed into AMR graphs, which were then combined through merging, collapsing and graph expansion into a single AMR graph representing the source document." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-29", "text": "Following this, a summary AMR graph was extracted, from which a bag of concept words was obtained without attempting to form fluent text." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-30", "text": "Vilca and Cabezudo (2017) performed a summary AMR graph extraction augmented with discourse-level information and the PageRank (Page et al., 1998) algorithm." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-31", "text": "For text generation, Vilca and Cabezudo (2017) used a rule-based syntactic realizer (Gatt and Reiter, 2009 ) which requires substantial human input to perform adequately." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-32", "text": "Seq2seq using Side Information: In Neural Machine Translation (NMT) field, recent work (Zhang et al., 2018 ) explored modifications to the decoder of seq2seq models to improve translation results." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-33", "text": "They used a search engine to retrieve sentences and their translation (referred to as translation pieces) that have high similarity with the source sentence." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-34", "text": "When similar n-grams from a source document were found in the translation pieces, they rewarded the presence of those ngrams during the decoding process through a scoring mechanism calculating the similarity between source sentence and the source side of the translation pieces." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-35", "text": "Zhang et al. (2018) reported improvements in translation results up to 6 BLEU points over their seq2seq NMT baseline." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-36", "text": "In this paper we use the same principle and reward n-grams that are found in the source document during the AMRto-Text generation process." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-37", "text": "However we use a simpler approach using a probabilistic language model in the scoring mechanism." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-38", "text": "----------------------------------" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-39", "text": "**GUIDING NLG FOR AMR-BASED SUMMARIZATION**" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-40", "text": "We first briefly describe the AMR-based summarization method of Liu et al. (2015) and then our guided NLG approach." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-41", "text": "----------------------------------" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-42", "text": "**AMR-BASED SUMMARIZATION**" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-43", "text": "In Liu et al. (2015) 's work, each of the sentence of the source document was parsed into an AMR graph, and combined into a source graph, G = (V, E) where v \u2208 V and e \u2208 E are the unique concepts and the relations between pairs of concepts." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-44", "text": "They then extracted a summary graph, G \u2032 using the following sub-graph prediction:" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-45", "text": "where f(v) and f(e) are the feature representations of node v and edge e respectively." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-46", "text": "The final summary produced was a bag of concept words extracted from G \u2032 ." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-47", "text": "This output we will be replacing with our proposed guided NLG." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-48", "text": "----------------------------------" }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-51", "text": "The encoder computes the hidden representation of the input, {z 1 , z 2 , . . . , z k }, which is the linearized summary AMR graph, G \u2032 from Liu et al. (2015) , following Van Noord and Bos (2017)'s preprocessing steps." }, { "sent_id": "7cb7cfed8b7e7bf2f0a810e02e6cbc-C001-52", "text": "Following this, the decoder generates the target words, {y 1 , y 2 , . . . , y m }, using the conditional probability P s2s (y j |y tag at the end of each sentence whereas the begin tag is not used." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-82", "text": "The resulting corpus size is 133M with an rate of 1.43% for the training set and 2.30% for the test set." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-83", "text": "The second corpus is the OBWB, which contains \u2248 0.8B tokens with a vocabulary size of \u2248 0.8M words." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-84", "text": "The data processing follows the description provided in [7] leading to an rate of \u2248 0.3%." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-85", "text": "Similarly to LTCB, a single tag is added at the end of each sentence." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-86", "text": "In the experiments described below, the first 5 held-out sets are used for validation whereas the remaining 45 sets are used for testing." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-87", "text": "The obtained results, however, showed that the models perplexity on these two sets is comparable, with an average difference of less than 0.5 points in perplexity." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-88", "text": "The primary motive of using LTCB, with its medium vocabulary size (80K), is to be able to compare the performance of LMs trained using NCE to their counterparts that are trained using the full softmax." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-89", "text": "When using NCE to train the models, the evaluation is either performed using the NCE constant Z for normalization (PPL n ), in this case the target word probabilities are given by (2), or using the softmax function (PPL f ), which calculates these probabilities using (1) ." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-90", "text": "The difference in performance between these metrics will evaluate the ability of the models to learn to self-normalize after training." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-91", "text": "For a comprehensive comparison of the different models, we also report the Number of Parameters (NoP) required by each model as well as its Training Speed (TS), which is calculated as the number of words processed per second (w/s) during training." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-92", "text": "All experiments were conducted on a single Titan-X GPU." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-93", "text": "----------------------------------" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-94", "text": "**BASELINE MODELS**" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-95", "text": "In order to assess the gap among established NNLMs, this paper also presents a comparative study of different standard architectures with comparable NoPs." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-96", "text": "That is, we report results for the standard Feed-forward network (FFNN) [1] , the Recurrent Neural Network (RNN) [10] as well as the Long-Short Term Memory network (LSTM) [21] ." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-97", "text": "Our RNN implementation uses a projection weight matrix to decouple the word embedding and the hidden layer sizes." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-98", "text": "We also report results after adding a bottleneck fully-connected ReLu layer right before the output layer in the recurrent models." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-99", "text": "These models are marked with the prefix ReLu in the tables below." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-100", "text": "Each of the models is trained using the proposed B-NCE approach and the shared noise NCE (S-NCE) [16] ." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-101", "text": "For the LTCB corpus, we also report results of the models trained with the full softmax function." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-102", "text": "This is the primary motive for using this corpus." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-103", "text": "We would like also to highlight that the goal of this paper is not about improving LMs performance but rather showing how a significant training speed-up can be achieved without compromising the models performance for large vocabulary LMs." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-104", "text": "Hence, we solely focus our experiments on NCE as a major approach to achieve this goal [17, 13, 16] in comparison to the reference full softmax function." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-105", "text": "Comparison to other training approaches such as importance sampling will be conducted in future work." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-106", "text": "----------------------------------" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-107", "text": "**LTCB EXPERIMENTS**" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-108", "text": "For the LTCB experiments, the embedding size is fixed at 200, the 5-gram FFNN has two hidden layers, whereas RNN and LSTM use a single recurrent layer." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-109", "text": "All non-recurrent layers use ReLu as activation function." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-110", "text": "More details about the models architectures are shown in Table 1 , where \"(R)\" stands for recurrent and \"(B)\" for bottleneck." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-111", "text": "The batch size is fixed at 400 and the initial learning rate is set to 0.4." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-112", "text": "The latter is halved when no improvement on the validation data is observed for an additional 7 epochs." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-113", "text": "We also use a norm-based gradient clipping with a threshold of 5 but we do not use dropout." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-114", "text": "Moreover, B-NCE and S-NCE use the unigram as noise distribution pn." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-115", "text": "Following the setup proposed in [13, 16] , S-NCE uses K = 100 noise samples, whereas B-NCE uses only the target words in the batch (K=0)." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-116", "text": "Note that S-NCE will process and update B + K words at its output layer during each forward-backward pass, whereas B-NCE updates only B words." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-117", "text": "Similarly to [17] , the NCE normalization constant is set to Z = exp(9), which approximates the mean of the normalization term using softmax." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-118", "text": "Table 2 clearly show that B-NCE reduces the training time by a factor of 4 to 8 with a slight degradation in the models performance compared to softmax." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-119", "text": "Moreover, we can also see that B-NCE slightly outperforms S-NCE while being faster and simpler to implement." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-120", "text": "In particular, B-NCE does not require the sampling step since it uses the rest of the output words in the batch itself as noise samples to train the model on each target word." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-121", "text": "This can be efficiently implemented using dense matrix operations (see Sections 3)." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-122", "text": "Table 2 also shows that PPL n is close from PPL f , which typically reflects that the models trained using NCE are able to selfnormalize, where the normalization term using softmax is, in average, very close from the NCE constant Z. We have also observed in our experiments that the models degradation and the gap between PPL f and PPL n strongly depend on the amount of training data, the vocabulary size as well as the size of the last hidden layer." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-123", "text": "More particularly, increasing the training data leads to a more stable learning and therefore to a smaller gap between these two metrics and a much lower degradation of the models performance (see OBWB experiments below)." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-124", "text": "We can also conclude from Table 2 that the additional ReLu layer improves the performance while significantly decreasing the number of parameters (NoP)." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-125", "text": "This conclusion is valid for both, RNN and LSTM." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-126", "text": "These results confirm that adding a fully-connected bottleneck layer can significantly boost the performance of recurrent models." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-127", "text": "This idea has been previously used in computer vision tasks in combination with Convolutional Neural Networks (CNN) [22] , as well as in speech recognition [23] , where the fully-connected layer is used as pat of the LSTM recurrent module." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-128", "text": "----------------------------------" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-129", "text": "**ONE BILLION WORD BENCHMARK EXPERIMENTS**" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-130", "text": "The OBWB experiments are similar to LTCB with minor differences." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-131", "text": "Namely, the embedding size is set to 500 for all models, the batch size is fixed at 500, S-NCE uses K = 200 noise samples and the initial learning rate is set to 1.0." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-132", "text": "Given that the vocabulary size is \u2248 0.8M , it was not possible to train the language models using the full softmax function." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-133", "text": "Therefore, we only report results for B-NCE and S-NCE." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-134", "text": "More details about the models configuration are shown in Table 3 ." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-135", "text": "Table 4 generally confirm the LTCB conclusions." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-136", "text": "That is, B-NCE slightly outperforms S-NCE while being faster and simpler to train (training speed in 3 rd column)." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-137", "text": "Moreover, these results also show a much smaller difference between PPL f and PPL n compared to LTCB, which suggests that the models learned to better self-normalize due to the larger amount of training data." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-138", "text": "Similarly to LTCB, we can see that the additional ReLu helps reducing the NoPs while improving or maintaining the models performance for RNN and LSTM." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-139", "text": "In comparison to other results on the OBWB." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-140", "text": "We can see that the small ReLu-RNN achieves a close performance from the very large RNN model (PPL = 51.3 and NoP = 20B) proposed in [7] ." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-141", "text": "Moreover, the performance of the small ReLu-LSTM is comparable to the LSTM models proposed in [16] and [18] which use large hidden layers." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-142", "text": "In particular, the first paper trains a large 4-layers LSTM model using S-NCE on 4 GPUs (PPL = 43.2 and NoP = 3.4B), whereas the second uses a recurrent bottleneck layer [23] and a total of K = 8192 noise samples with importance sampling on 32 Tesla K40 GPUs." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-143", "text": "----------------------------------" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-144", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-145", "text": "We have presented a batch-NCE approach which allows a fast and simple training of large vocabulary LMs." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-146", "text": "This approach eliminates the sampling step required in standard NCE and can be fully formulated using dense matrix operations." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-147", "text": "Experiments on LTCB and OBWB have shown that this approach achieves a comparable performance to a softmax function while significantly speeding-up the training." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-148", "text": "While the evaluation focused on NCE performance, future experiments will be conducted to evaluate B-NCE in comparison to other alternatives of softmax." }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-149", "text": "----------------------------------" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-150", "text": "**ACKNOWLEDGMENT**" }, { "sent_id": "2ef456a3f6b043350121c4c5cfd404-C001-151", "text": "This research was funded by the German Research Foundation (DFG) as part of SFB 1102." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2ef456a3f6b043350121c4c5cfd404-C001-39" ], [ "2ef456a3f6b043350121c4c5cfd404-C001-48" ], [ "2ef456a3f6b043350121c4c5cfd404-C001-51" ], [ "2ef456a3f6b043350121c4c5cfd404-C001-73" ], [ "2ef456a3f6b043350121c4c5cfd404-C001-141" ] ], "cite_sentences": [ "2ef456a3f6b043350121c4c5cfd404-C001-39", "2ef456a3f6b043350121c4c5cfd404-C001-48", "2ef456a3f6b043350121c4c5cfd404-C001-51", "2ef456a3f6b043350121c4c5cfd404-C001-73", "2ef456a3f6b043350121c4c5cfd404-C001-141" ] }, "@USE@": { "gold_contexts": [ [ "2ef456a3f6b043350121c4c5cfd404-C001-100" ], [ "2ef456a3f6b043350121c4c5cfd404-C001-115" ] ], "cite_sentences": [ "2ef456a3f6b043350121c4c5cfd404-C001-100", "2ef456a3f6b043350121c4c5cfd404-C001-115" ] }, "@UNSURE@": { "gold_contexts": [ [ "2ef456a3f6b043350121c4c5cfd404-C001-104" ] ], "cite_sentences": [ "2ef456a3f6b043350121c4c5cfd404-C001-104" ] } } }, "ABC_d13502d44435988822e59bcf66b635_22": { "x": [ { "sent_id": "d13502d44435988822e59bcf66b635-C001-15", "text": "http://www.itl.nist.gov/ iad/mig/tests/tdt/resources.html (Last Update: 2008) 3 5,700 tweets per second https://about.twitter .com/company (last updated: March 31, 2015) to all previously seen documents and the minimum distance determines the novelty score." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-16", "text": "Documents, whose minimum distance falls above a certain threshold are considered to talk about a new event and declared as first stories." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-17", "text": "Consequently, the computational effort increases with each document processed." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-18", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-19", "text": "**RELATED WORK**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-20", "text": "Researchers have proposed a range of approaches to scale FSD to large data streams." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-21", "text": "Sankaranarayanan et al. (2009) were one of the first to apply FSD to Twitter." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-22", "text": "They reduced the volume by classifying documents into news/non-news and only compared to tweets within a 3-day window." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-23", "text": "They did not perform a quantitative evaluation of their approach." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-24", "text": "Sakaki et al. (2010) and Li et al. (2012) applied keyword filtering in conjunction with classification algorithms, which allowed them to efficiently detect certain events with high precision." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-25", "text": "These two approaches, although efficient and effective, require a user to explicitly define a set of keywords or to provide a set of examples that he wants to track." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-26", "text": "The approach cannot detect previously unknown events." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-27", "text": "Phuvipadawat and Murata (2010) , Ozdikis et al. (2012) and Cordeiro (2012) , scale their systems by only considering tweets containing hashtags." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-28", "text": "Although efficient, this method don't consider 90% of the tweets (Petrovic, 2013) , which limits their scope." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-29", "text": "Cataldi et al. (2010) , Weng et al.(2011) and Cordeiro (2012) use the degree of burstiness of terms during a time interval to detect new events." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-30", "text": "This approach is not suitable for FSD as events are detected with a time lag, once they grow in popularity." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-31", "text": "Petrovic et al. (2010) were the first to demonstrate FSD on Twitter in constant time and space, while maintaining effectiveness comparable to those of pair-wise comparison systems." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-32", "text": "The key was to reduce the search space using Locality Sensitive Hashing (LSH)." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-33", "text": "Each tweet was hashed, placing it into buckets that contain other similar tweets, which are subsequently compared." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-34", "text": "Operation in constant space was ensured by keeping the number of tweets per bucket constant." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-35", "text": "Because LSH alone performed ineffectively, Petrovic et al. (2010) additionally compared each incoming tweet with the k most recent tweets." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-36", "text": "Allan et al. (2003) analysed scoring functions for novelty detection while focusing on their effectiveness." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-37", "text": "They presented a language-model (LM) based novelty measure using the KL divergence between the LM of a document and a single LM built on all previously scored documents, which they referred to as an aggregate measure language model." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-38", "text": "The idea of maintaining a single representation covering all previously seen documents, instead of performing pairwise comparisons with every document is closely related to the approach presented in this paper." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-39", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-40", "text": "**APPROACH**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-41", "text": "First Story Detection is a challenging task (Allan et al., 2000) ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-42", "text": "The highest FSD accuracy is achieved by nearest-neighbour methods, where each incoming document (tweet) is compared to all documents that came before it, and the novelty score is determined by the most-similar documents in the past." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-43", "text": "This approach requires us to make n\u22121 comparisons 4 to determine the novelty of document d n ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-44", "text": "The approach becomes progressively slower with each processed document, and cannot scale up to unbounded streams like Twitter." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-45", "text": "Prior attempts to speed up FSD involve organising previously seen documents d 1 . ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-46", "text": ".d n\u22121 into clusters (Allan et al., 1989) or LSH buckets (Petrovic et al., 2010) ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-47", "text": "The document d n is then compared only to past documents in the nearest cluster or LSH bucket, resulting in substantially fewer than n comparisons." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-48", "text": "While this approach is reasonably effective, it does lead to decreased accuracy, as potentially relevant past documents may exist in other clusters/buckets and would not be compared against." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-49", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-50", "text": "**FIRST STORY DETECTION IN CONSTANT TIME**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-51", "text": "Our method computes the novelty of document d n in a time that is constant with respect to n. The main difference from previous approaches is that we do not compare d n to individual documents that came before it." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-52", "text": "Instead, we store the content of past documents d 1 . ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-53", "text": ".d n\u22121 in a single lookup table H n\u22121 ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-54", "text": "When d n arrives, we count what fraction of its content is novel by looking it up in H n\u22121 ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-55", "text": "The number of lookups is polynomial in |d n | (the length of the document), and is independent of n. Formally, let d n denote the set of distinct words occurring in the n'th document in the stream. Let a k-term t={w 1 , w 2 , ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-56", "text": ". .} denote a non-empty set of up to k distinct words." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-57", "text": "We define the content c n to be the set of all k-terms that can be formed from the words in the document d n : c n = { t : t \u2282 d n , |t| \u2264 k}. We estimate the novelty of document d n as the proportion of novel k-terms, i.e. k-terms that do not appear in the history H n\u22121 :" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-58", "text": "Here \u03b1 |t| is the weight assigned to k-terms of size |t|, and |dn| |t| is the total number of such k-terms formed from d n ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-59", "text": "After the novelty is computed, we update the history H to include all k-terms formed from d n :" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-60", "text": "The computational cost of equations (1) and (2) is determined by the number of k-terms formed from the document d n , and can be bounded at O(|d n | k ) operations." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-61", "text": "The complexity is manageable, as tweets are short and we keep k small." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-62", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-63", "text": "**OPERATING IN CONSTANT TIME AND SPACE**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-64", "text": "We use a Bloom filter (Bloom, 1970) to maintain the history H n\u22121 of previously seen k-terms." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-65", "text": "For each k-term t we compute a 32-bit Murmur 5 hashcode, and use it as an index into a fixed-length bit-array." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-66", "text": "This ensures that both membership testing (t\u2208H) and history update can be performed in constant time." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-67", "text": "Constraining H to be a fixedlength array also means that our method operates in constant space, irrespective of the size of the stream and its vocabulary growth." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-68", "text": "In contrast to our method, previous approaches to FSD required more and more memory to maintain the history of the stream (see Figure 3) ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-69", "text": "A potential downside of using a Bloom filter is that it introduces a small probability of false matches: a novel k-term t i may collide with a previously observed k-term t j and would be reported as nonnovel." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-70", "text": "The probability of collision is directly proportional to the load factor of the Bloom filter, i.e. the fraction of non-zero bits in the array." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-71", "text": "By Heaps law (Egghe, 2007) the number of distinct words (and k-terms) will continue to grow and will eventually saturate the bit-array." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-72", "text": "To mitigate this problem, we introduce a deletion strategy: whenever the load factor exceeds a pre-determined threshold \u03c1, we zero out a random bit in H. This allows us to keep low the probability of false matches, at the cost of forgetting some previously-seen k-terms." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-73", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-74", "text": "**PARAMETER SETTINGS**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-75", "text": "We make the following parameter choices based on initial experiments on our training dataset." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-76", "text": "We set the maximum size of k-terms to be k = 3 and keep the Bloom filter load factor under \u03c1 = 0.6." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-77", "text": "We tokenize the tweets on punctuation, treat all hashtags and mentions as words, stem them using the stemmer by Krovetz (1993) , but do not remove stopwords." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-78", "text": "We optimise the weights \u03b1 1 . ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-79", "text": ".\u03b1 k using grid search on the same training data set." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-80", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-81", "text": "**EXPERIMENTS**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-82", "text": "In a streaming setting, documents arrive one at a time on a continual basis." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-83", "text": "FSD requires computing a novelty score for each document in a single-pass over the data." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-84", "text": "High novelty scores indicate new topics." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-85", "text": "We use the standard TDT evaluation procedure (Allan, 2002) and the official TDT3 evaluation scripts with standard settings for evaluating FSD accuracy." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-86", "text": "The Detection Error Trade-off (DET) curve shows the trade-off between miss and false alarm probability for the full range of novelty scores." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-87", "text": "The normalized Topic Weighted Minimum Cost (C min ) is a linear combination of miss and false alarm probabilities, which allows comparing different methods based on a single value metric." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-88", "text": "Efficiency is measured by the throughput of tweets per second and the memory footprint." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-89", "text": "To ensure a fair comparison, all reported numbers are averaged over 5 runs on an idle machine using a single core (Intel-Xeon CPU with 2.27GHz)." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-90", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-91", "text": "**DATA SET**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-92", "text": "We use the data set developed by Petrovic (2013) , Petrovic et al. (2013b) as a test set, which consists of 27 topics and 116,000 tweets from the period of April till September 2011." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-93", "text": "Parameters were tuned using a sample of the data set annotated by Wurzer et al. (2015) as a training set." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-94", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-95", "text": "**BASELINES**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-96", "text": "We compare our system (k-term) against 3 baselines." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-97", "text": "UMass is a state-of-the-art FSD system, developed by Allan et al. (2000) ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-98", "text": "It is known for its high effectiveness in the TDT2 and TDT3 competitions (Fiscus, 2001) and widely used as a benchmark for FSD systems (Petrovic et al., 2010; Kasiviswanathan et al., 2011; Petrovic 2013; ) ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-99", "text": "UMass makes use of an inverted index and k-nearest-neighbour clustering, which optimize the system for speed by ensuring a minimal number of comparisons." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-100", "text": "To maximise efficiency, we set-up UMass to operate in-memory by turning off its default memory mapping to disk." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-101", "text": "This ensures fair comparisons, as all algorithms operate in memory." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-102", "text": "LSH-FSD is a highly-scalable system by Petrovic et al. (2010) ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-103", "text": "It is based on Locality Sensitive Hashing (LSH) and claims to operate in constant time and space while performing on a comparable level of accuracy as UMass." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-104", "text": "We configure their system using the default parameters (Petrovic et al., 2010) ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-105", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-106", "text": "**KL-FSD**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-107", "text": "We also compare our approach with the aggregate measure language model (Allan et al., 2003) because it builds upon a similar principle." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-108", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-109", "text": "**EFFECTIVENESS AND EFFICIENCY**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-110", "text": "In Table 1 , the UMass system shows state-of-theart accuracy (C min = 0.79, lower is better), but can only process 30 tweets per second." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-111", "text": "LSH-FSD operates 17 times faster, at the cost of a 13% decrease in accuracy (C min = 0.90)." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-112", "text": "Our system (k-term) operates on par with UMass in terms of accuracy, while being 197 times faster." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-113", "text": "KL-FSD, which is based on uni-grams, reveals the highest throughput at a considerable cost of efficiency (C min = 0.96)." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-114", "text": "To further investigate accuracy we also compare the systems over the full range of the novelty thresholds illustrated by the DET plot in Figure 1 ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-115", "text": "Additionally we show the 90% confidence interval of UMass in two solid lines." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-116", "text": "We observer that both, FSD-LSH and our system (k-term) are statistically indistinguishable form UMass at any Miss-False Alarm trade-off point: their DET curves fall entirely within the 90% confidence interval of UMass." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-117", "text": "Note that DET curve of UMass is formed by the middle of it's 90% confidence interval curves." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-118", "text": "KL-FSD in contrast results in significantly worse accuracy than UMass in the mid-range and in particular the high recall area of the DET plot." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-119", "text": "We conclude that uni-grams are insufficient for determining the novelty of tweets." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-120", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-121", "text": "**FSD IN CONSTANT TIME AND SPACE**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-122", "text": "High-volume streams require operation in constant time and space." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-123", "text": "Figure 2 compares the change in throughput of LSH-FSD, UMass and kterm as we process more and more tweets in the stream." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-124", "text": "Additionally, the plot also shows the average rate of tweets in the Twitter Firehose 6 at 5,787 tweets per second." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-125", "text": "Note that our system processes the equivalent of the full Twitter stream on a single core of modest hardware." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-126", "text": "This surpasses the throughput of LSH-FSD, a system known for high efficiency, by more than an order of magnitude." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-127", "text": "The throughput of LSH-FSD and k-term increases up until 20k documents because both approaches require initialisation of their data structures, which makes them slow when the number of documents is low." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-128", "text": "UMass has no initialization and performs the fastest when the number of documents is kept low." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-129", "text": "The pair-wise comparison of UMass causes it's throughput to decrease drastically with every new document." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-130", "text": "In Figure 2 we compare the memory requirements of k-term and LSH-FSD at different points in the stream." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-131", "text": "Although Petrovic et al. (2010) designed their system (LSH-FSD) to operate in constant space, we found that the memory requirement gradually increases with the number of documents processed, as seen in Figure 3 ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-132", "text": "We hypothesise that this increase results from new terms added to the vocabulary." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-133", "text": "Our system has a strictly constant memory footprint." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-134", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-135", "text": "**CONCLUSION**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-136", "text": "We presented an approach to FSD in a high volume streaming setting in constant time and space." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-137", "text": "Our approach computes novelty based on a single lookup table that represents past documents." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-138", "text": "Shifting from direct comparisons with previous documents to comparisons with a single model that combines them, accounts for a great increase in efficiency." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-139", "text": "For the first time, we showed that it is possible to perform FSD on the full Twitter stream on a single core of modest hardware." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-140", "text": "This greatly outperforms state-of-the-art systems by an order of magnitude without sacrificing accuracy." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-2", "text": "First Story Detection is hard because the most accurate systems become progressively slower with each document processed." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-3", "text": "We present a novel approach to FSD, which operates in constant time/space and scales to very high volume streams." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-4", "text": "We show that when computing novelty over a large dataset of tweets, our method performs 192 times faster than a state-of-the-art baseline without sacrificing accuracy." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-5", "text": "Our method is capable of performing FSD on the full Twitter stream on a single core of modest hardware." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-6", "text": "----------------------------------" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-8", "text": "First Story Detection (FSD), also called New Event Detection, is the task of identifying the very first document in a stream to mention a new event 1 ." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-9", "text": "FSD was introduced as port of the TDT 2 initiative and has direct applications in finance, news and government security." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-10", "text": "The most accurate approaches to FSD involve a runtime of O(n 2 ) and cannot scale to unbounded high volume streams such as Twitter." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-11", "text": "We present a novel approach to FSD that operates in O(1) per tweet." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-12", "text": "Our method is able to process the load of the average Twitter Firehose 3 stream on a single core of modest hardware while retaining effectiveness on par with one of the most accurate FSD systems." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-13", "text": "During the TDT program, FSD was applied to news wire documents and solely focused on effectiveness, neglecting efficiency and scalability." }, { "sent_id": "d13502d44435988822e59bcf66b635-C001-14", "text": "The traditional approach to FSD (Petrovic et al., 2010) computes the distance of each incoming document 1 e.g. a natural disaster or a scandal 2 TDT by NIST -1998 NIST - -2004 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d13502d44435988822e59bcf66b635-C001-14" ], [ "d13502d44435988822e59bcf66b635-C001-35" ], [ "d13502d44435988822e59bcf66b635-C001-98" ], [ "d13502d44435988822e59bcf66b635-C001-102" ] ], "cite_sentences": [ "d13502d44435988822e59bcf66b635-C001-14", "d13502d44435988822e59bcf66b635-C001-35", "d13502d44435988822e59bcf66b635-C001-98", "d13502d44435988822e59bcf66b635-C001-102" ] }, "@USE@": { "gold_contexts": [ [ "d13502d44435988822e59bcf66b635-C001-104" ] ], "cite_sentences": [ "d13502d44435988822e59bcf66b635-C001-104" ] }, "@DIF@": { "gold_contexts": [ [ "d13502d44435988822e59bcf66b635-C001-131" ] ], "cite_sentences": [ "d13502d44435988822e59bcf66b635-C001-131" ] } } }, "ABC_375b9c865d9f1b559387aa01a20a78_22": { "x": [ { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-2", "text": "Applying conventional word embedding models to unsegmented languages, where word boundary is not clear, requires word segmentation as preprocessing." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-3", "text": "However, word segmentation is difficult and expensive to conduct without errors." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-4", "text": "Segmentation error degrades the quality of word embeddings, leading to performance degradation in downstream applications." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-5", "text": "In this paper, we propose a simple segmentation-free method to obtain unsupervised vector representations for words, phrases and sentences from an unsegmented raw corpus." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-6", "text": "Our model is based on subword information skip-gram model, but embedding targets and contexts are character n-grams instead of segmented words." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-7", "text": "We consider all possible character n-grams in a corpus as targets, and every target is modeled as the sum of its compositional sub-n-grams." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-8", "text": "Our method completely ignores word boundaries in a corpus and is not word-segmentation dependent." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-9", "text": "This approach may sound reckless, but it was found to work well through experiments on real-world datasets and benchmarks." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-10", "text": "----------------------------------" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-12", "text": "Most existing word embedding methods (Bengio et al., 2003; Le and Mikolov, 2014; Pennington et al., 2014 ) take a sequence of words, or tokens, as their input." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-13", "text": "However, previous studies have claimed a harmful impact which comes from tokenization (Chung et al., 2016; Oshikiri, 2017) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-14", "text": "Some languages are morphologically rich with large collection of tokens." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-15", "text": "The number of unique tokens can also be increased by a vast amount of misspelling in real-world datasets." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-16", "text": "The explicit tokenization step considers all unique tokens (or character strings) independently, and that makes it infeasible to cover all unique tokens." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-17", "text": "Moreover, tokenization fails to capture the structure that involves multiple tokens." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-18", "text": "Although, there are some heuristics to detect phrases (Mikolov et al., 2013) , it is still hard to perfectly cover everything." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-19", "text": "These problems become more significant in unsegmented languages, such as Chinese and Japanese, where word boundary is not explicitly specified in text." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-20", "text": "A sequence of tokens is obtained by word segmentation tools from a raw character corpus." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-21", "text": "The accuracy of word segmentation tools strongly depends on the richness of dictionaries, and segmentation errors influence the performances or accuracies of subsequent processes (Xu et al., 2004) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-22", "text": "For example, poor dictionaries are undesirable when dealing with SNS data containing a vast amount of neologisms." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-23", "text": "However, rich dictionaries are not always available, and maintaining them up-to-date to cover neologisms is also expensive." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-24", "text": "Another problem of word embedding with explicit tokenization step is the existence of Out-Of-Vocabulary (OOV) words." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-25", "text": "Due to tokenization error (or wrong segmentation), we may lose some words in training data." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-26", "text": "In addition, newly given real-world datasets may include a lot of unseen words and phrases." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-27", "text": "Practically, OOV words in a corpus are replaced with a special token representing OOV." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-28", "text": "The larger OOV rate in a corpus affects the accuracies of downstream tasks (Sun et al., 2005) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-29", "text": "In recent years, an increasing number of studies have investigated character-level models with subwords in both unsupervised (Bojanowski et al., 2017; Pagliardini et al., 2018) and supervised learning (Zhang et al., 2015; Sennrich et al., 2016; Wieting et al., 2016; Lee et al., 2017) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-30", "text": "In these models, the notion of vocabularies is extended to include sub-words." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-31", "text": "By enriching the information of the word, sub-words are useful for capturing morphological changes (Bojanowski et al., 2017) and the meaning of short phrases (Wieting et al., 2016) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-32", "text": "In addition, OOV (or unseen) words can be composed from sub-words, which are present at training (Bojanowski et al., 2017) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-33", "text": "In this paper, we propose a simple unsupervised method of character n-gram embedding for unsegmented languages, where the segmentation step is completely omitted thus words, phrases and sentences are treated seamlessly." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-34", "text": "Our model considers all possible character n-grams as embedding targets in a corpus." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-35", "text": "Each n-gram is explicitly modeled as a composition of its sub-n-grams just like each word is modeled as a composition of sub-words in the subword information skipgram model (SISG) (Bojanowski et al., 2017) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-36", "text": "Our segmentation-free compositional n-gram embedding is referred to as SCNE in this paper." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-37", "text": "This kind of approach that does not consider any word boundaries for unsegmented languages may sound reckless since the embedding targets can include a lot of wrong boundaries." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-38", "text": "However, we found that we can compose vector representations for words and sentences with good quality by summing up the representations of their substrings." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-39", "text": "Oshikiri (2017) also proposed a segmentationfree word embedding for unsegmented languages, which is called as Sembei, but the word and sentence vectors of our model are enriched by their substrings, and hence our method may achieve better performance in downstream tasks." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-40", "text": "----------------------------------" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-41", "text": "**MODEL**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-42", "text": "Our method SCNE is inspired by recent successful sub-word models (Bojanowski et al., 2017; Zhang et al., 2015) as well as by the segmentation-free character n-gram embedding for unsegmented languages (Oshikiri, 2017) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-43", "text": "Vector representation of target n-gram is defined as follows." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-44", "text": "Let x 1 x 2 \u00b7 \u00b7 \u00b7 x N be a raw unsegmented corpus of N characters." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-45", "text": "For a range" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-46", "text": "We first count occurrences of n-gram s = s 1 s 2 \u00b7 \u00b7 \u00b7 s n in the raw corpus as x t = s with length n = j\u2212i+1 \u2264 n max ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-47", "text": "Then n-gram vocabulary V = {s} is constructed by collecting M -most frequent n-grams with n \u2264 n max ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-48", "text": "For any target n-gram w = w 1 w 2 \u00b7 \u00b7 \u00b7 w n , n \u2265 1, the sequence of its sub-n-grams included in the vocabulary is where sub-n-gram representations z s \u2208 R d , s \u2208 V , are model parameters to be learned below." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-49", "text": "The objective is almost the same as the skipgram model (Mikolov et al., 2013) ," }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-50", "text": "Note that we do not employ compositional form for the contexts." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-51", "text": "The negative sampling distribution P neg of s \u2208 V is proportional to the frequency in the corpus." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-52", "text": "Model parameters z s , u s \u2208 R d , s \u2208 V , are learned by maximizing the objective." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-53", "text": "We set n target = n max in the experiments." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-54", "text": "Large n max , such as 16, contributes to capturing sentence-level representations while 4 to 8 is enough for word embeddings." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-55", "text": "Japanese, and Korean 1 ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-56", "text": "Through the following two word-level tasks and one sentence-level task, we investigate the qualities of vector representations and their usefulness in practical applications." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-57", "text": "We will make the C++ implementation of our method and pre-trained models available open-source." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-58", "text": "----------------------------------" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-59", "text": "**BASELINE SYSTEMS**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-60", "text": "As baseline systems, we use C-BOW, Skipgram (Mikolov et al., 2013) , Subword Information Skip-gram (SISG) (Bojanowski et al., 2017) and Segmentation-free word embedding for unsegmented languages (Sembei) (Oshikiri, 2017) for the word-level tasks." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-61", "text": "For the sentence-level task, baselines are PV-DBOW, PV-DM (Le and Mikolov, 2014) and Sent2vec (Pagliardini et al., 2018) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-62", "text": "In addition, we test sentence embedding baselines obtained by simple averaging of word embeddings over the sentence, denoted as C-BOW * , Skip-gram * and SISG * ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-63", "text": "We also test a variant of Sembei, denoted by Sembei * , which calculates word or sentence embeddings by simple averaging of sub-n-gram embeddings, to see whether our model provides more effective compositional n-gram embeddings compared to the previously proposed non-compositional model." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-64", "text": "----------------------------------" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-65", "text": "**EVALUATION TASKS**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-66", "text": "Word similarity task: Our model and the baselines are trained on portions of Wikipedia of increasing size to see the effect of the size of the training data." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-67", "text": "For pairs of words, the cosine similarity between embeddings is compared to human judgment, and the quality is measured by Spearman rank correlation." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-68", "text": "Most of the settings are the same as that of Bojanowski et al. (2017) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-69", "text": "Two widely-used benchmark datasets are used: Chinese word similarity dataset (Jin and Wu, 2012) , which contains 297 pairs of words, and Japanese word similarity dataset (Sakaizawa and Komachi, 2017) , which contains 4429 pairs of words." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-70", "text": "Conventional word embedding methods, C-BOW, Skip-gram, and Sembei, cannot provide the embeddings of OOV words in the test data." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-71", "text": "In contrast, SISG and our model can compute representations for almost all words, since both methods learn compositional n-gram features." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-72", "text": "In order to show comparable results, we use the null vector for these OOV words following Bojanowski et al. (2017) ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-73", "text": "Noun category prediction task: We use 100MB of SNS data, Sina Weibo for Chinese and Twitter for Japanese and Korean, as training corpora." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-74", "text": "For evaluating the learned embeddings, noun words, including neologisms, and their categories are extracted from Wikidata with the predetermined semantic category set 2 ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-75", "text": "For each category, a logistic regression classifier is trained from the vector representations, where unseen words are skipped in training and treated as errors in testing." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-76", "text": "Sentiment analysis task: Movie review datasets are used for evaluating sentence embeddings." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-77", "text": "We use 101,114, 55,837 and 200,000 movie reviews and their rating scores from Yahoo 3 , Yahoo! 4 and Naver Movies 5 for Chinese, Japanese and Korean, respectively." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-78", "text": "Each review is labeled as positive or negative by its rating score." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-79", "text": "We simply concatenate all reviews into a single document and use it as the training corpus." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-80", "text": "At the testing phase, 5,000 positive and 5,000 negative reviews are randomly selected and used for evaluation." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-81", "text": "In this experiment, we only include unsupervised sentence embedding models as baselines, i.e. not task-specific model, to ensure coherence." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-82", "text": "----------------------------------" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-83", "text": "**COMMON SETTINGS**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-84", "text": "The dimension of word vector is d = 200 and the number of negative samples is k = 10 for all methods." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-85", "text": "The number of epochs is 10 for Sembei and SCNE, and 20 for the other baselines." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-86", "text": "The n-gram vocabulary size M = 2 \u00d7 10 6 is used for SISG, Sembei and SCNE." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-87", "text": "The initial learning rate \u03b3 = 10 \u22122 is used for Sembei and SCNE." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-88", "text": "n max = 8 is used for Sembei and SCNE in the word-level tasks, while n max = 16 is also used in the sentence-level task." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-89", "text": "The other hyperparameters of the baselines, such as window size, are adjusted via grid search." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-90", "text": "For the conventional word-segmentation dependent baselines, we employ widely-used word segmentation tools with only basic dictionary (basic) or using rich dictionary together (rich), to see the effect of rich resources." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-91", "text": "In the downstream supervised tasks, vector representations are combined with a logistic regression classifier." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-92", "text": "We repeat training and testing of the classifier for 10 times, while the dataset is randomly split into train (60%) and test (40%) sets at each time." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-93", "text": "We adopt mean accuracy as the evaluation metric." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-94", "text": "----------------------------------" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-95", "text": "**RESULTS**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-96", "text": "Word similarity task: The results are shown in Figure 1 ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-97", "text": "The first 10, 50, 100, 200, 300MB of Wikipedia corpus is used in each language setting." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-98", "text": "The number of OOV words in the benchmarks grow as the dataset shrinks, and therefore the performances of C-BOW, Skip-gram and Sembei are necessarily degraded." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-99", "text": "We can see that, for all datasets and all corpus sizes, our proposed method SCNE outperforms the baselines." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-100", "text": "We can see that the proposed approach provides high-quality word vectors even when using very small training dataset, which is crucial for practical real-world settings where rich data is not available." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-101", "text": "Noun category prediction task: The results are reported in Table 1 ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-102", "text": "Since the set of covered nouns (non-OOV words) depends on the methods, we calculate accuracies in two ways for a fair comparison: Using all the nouns and using the intersection of the covered nouns." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-103", "text": "We notice that our proposed model SCNE achieved the highest prediction accuracies in most of the settings." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-104", "text": "These results demonstrate the efficacy of the proposed method both quantitatively and qualitatively." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-105", "text": "Sentiment analysis task: The results are reported in Table 2 ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-106", "text": "The classification accuracies show that our method SCNE is very effective in the sentence-level application." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-107", "text": "Furthermore, the larger n max contributes to the performance improvement by allowing our model to capture composed representations for longer phrases or sentences, whereas there is little improvement in Sembei * with larger n max ." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-108", "text": "This fact shows the efficacy of compositional n-gram features in our model." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-109", "text": "----------------------------------" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-111", "text": "We proposed segmentation-free compositional ngram embedding (SCNE), a new unsupervised method to acquire general-purpose vector representations of words, phrases and sentences seamlessly, which is especially effective for languages whose word boundaries are not obvious." }, { "sent_id": "375b9c865d9f1b559387aa01a20a78-C001-112", "text": "Although our method does not rely on any manually annotated resources, such as word boundaries or dictionaries for word segmentation tools, our experimental results on several corpora show that our method outperforms the conventional approaches that are dependent on such manually annotated resources." } ], "y": { "@BACK@": { "gold_contexts": [ [ "375b9c865d9f1b559387aa01a20a78-C001-29" ], [ "375b9c865d9f1b559387aa01a20a78-C001-31" ], [ "375b9c865d9f1b559387aa01a20a78-C001-32" ] ], "cite_sentences": [ "375b9c865d9f1b559387aa01a20a78-C001-29", "375b9c865d9f1b559387aa01a20a78-C001-31", "375b9c865d9f1b559387aa01a20a78-C001-32" ] }, "@SIM@": { "gold_contexts": [ [ "375b9c865d9f1b559387aa01a20a78-C001-34", "375b9c865d9f1b559387aa01a20a78-C001-35" ], [ "375b9c865d9f1b559387aa01a20a78-C001-68" ], [ "375b9c865d9f1b559387aa01a20a78-C001-72" ] ], "cite_sentences": [ "375b9c865d9f1b559387aa01a20a78-C001-35", "375b9c865d9f1b559387aa01a20a78-C001-68", "375b9c865d9f1b559387aa01a20a78-C001-72" ] }, "@UNSURE@": { "gold_contexts": [ [ "375b9c865d9f1b559387aa01a20a78-C001-42" ] ], "cite_sentences": [ "375b9c865d9f1b559387aa01a20a78-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "375b9c865d9f1b559387aa01a20a78-C001-60" ], [ "375b9c865d9f1b559387aa01a20a78-C001-68" ], [ "375b9c865d9f1b559387aa01a20a78-C001-72" ] ], "cite_sentences": [ "375b9c865d9f1b559387aa01a20a78-C001-60", "375b9c865d9f1b559387aa01a20a78-C001-68", "375b9c865d9f1b559387aa01a20a78-C001-72" ] } } }, "ABC_87a190b1df5a7a941ba7b9a98064a3_22": { "x": [ { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-68", "text": "**3.2.EXPERIMENTAL RESULTS**" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-2", "text": "Abstract." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-3", "text": "This paper proposes a convolution tree kernel-based approach for relation extraction where the parse tree is expanded with entity features such as entity type, subtype, and mention level etc." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-4", "text": "Our study indicates that not only can our method effectively capture both syntactic structure and entity information of relation instances, but also can avoid the difficulty with tuning the parameters in composite kernels." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-5", "text": "We also demonstrate that predicate verb information can be used to further improve the performance, though its enhancement is limited." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-6", "text": "Evaluation on the ACE2004 benchmark corpus shows that our system slightly outperforms both the previous best-reported feature-based and kernel-based systems." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-7", "text": "----------------------------------" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-9", "text": "Information extraction is an important research sub-field in natural language processing (NLP) which aims to identify relevant information from large amount of text documents in digital archives and the WWW." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-10", "text": "Information extraction subsumes three main tasks, including Entity Detection and Tracking (EDT), Relation Detection and Characterization (RDC), and Event Detection and Characterization (EDC)." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-11", "text": "This paper will focus on the ACE RDC task 1 and employ kernel method to extract semantic relationships between named entity pairs." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-12", "text": "Many feature-based approaches transform relation instances into feature vectors of high dimension, and compute the inner dot product between these feature vectors." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-13", "text": "Current research (Kambhatla 2004 , Zhao et al 2005 , Zhou et al. 2005 , Wang et al. 2006 shows that it is very difficult to extract new effective features from relation examples." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-14", "text": "Kernel methods are non-parametric estimation techniques that computer a kernel function between data instances." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-15", "text": "By avoiding transforming data examples into feature vectors, kernel methods can implicitly explore much larger feature space than could be searched by a feature-based approach." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-16", "text": "Thereafter, kernel methods especially on discrete structures (Haussler 1999 ) attract more and more attentions in relation extraction as well as other fields in NLP." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-17", "text": "Prior work on kernel methods for relation extraction includes Zelenko et al. (2003) , Culotta and Sorensen (2004) , Bunescu and Mooney (2005) ." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-18", "text": "Due to strong constraints that matching nodes be at the same layer and in the identical path starting from the roots to the current nodes, their kernels achieve good precision but much lower recall on the ACE2003 corpus." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-19", "text": "Zhang et al. (2006) proposed a composite kernel that consists of two individual kernels: an entity kernel that allows for entity-related features and a convolution parse tree kernel that models syntactic information of relation examples." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-20", "text": "However, their method needs to manually tune parameters in composite kernels that are often difficult to determine." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-21", "text": "This paper describes an expanded convolution parse tree kernel to incorporate entity information into syntactic structure of relation examples." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-22", "text": "Similar to Zhang et al. (2006) , we employ a convolution parse tree kernel in order to model syntactic structures." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-23", "text": "Different from their method, we use the convolution parse tree kernel expanded with entity information other than a composite kernel." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-24", "text": "One of our motivations is to capture syntactic and semantic information in a single parse tree for further graceful refinement, the other is that we can avoid the difficulty with tuning parameters in composite kernels." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-25", "text": "Evaluation on the ACE2004 corpus shows that our method slightly outperforms the previous feature-base and kernel-based methods." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-26", "text": "The rest of the paper is organized as follows." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-27", "text": "First, we present our expanded convolution tree kernel in Section 2." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-28", "text": "Then, Section 3 reports the experimental setting and results." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-29", "text": "Finally, we conclude our work with some general observations and indicate future work in Section 4." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-30", "text": "----------------------------------" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-31", "text": "**EXPANDED TREE KERNEL**" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-32", "text": "In this section, we describe the expanded convolution parse tree kernel and demonstrate how entity information can be incorporated into the parse tree." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-33", "text": "Figure 1: Different representations of a relation instance in the example sentence \"in many cities, angry crowds roam the streets.\", which is excerpted from the ACE2004 corpus, where a relation \"PHSY.Located\" holds between the first entity \"crowds\"(PER) and the second entity \"streets\" (FAC)." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-34", "text": "We employ the same convolution tree kernel used by Collins and Duffy (2001) , Moschitti (2004) and Zhang et al. (2006) ." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-35", "text": "This convolution tree kernel counts the number of subtrees that have similar productions on every node between two parse trees." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-36", "text": "However, the kernel value will depend greatly on the size of the trees, so we should normalize the kernel." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-37", "text": "From ACE definition on relation types and subtypes, we know that entity features impose a strong constraint on relation types." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-38", "text": "For example, PER-SOC relations describe the relationship between entities of type PER." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-39", "text": "Zhang et al. (2006) described five cases to extract the portion of parse tree for relation extraction." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-40", "text": "Their experiments show that PT (Path-enclosed Tree) achieves best performance among those cases, so we begin with PT and then incorporate entity features at different locations as depicted in Figure 1 ." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-41", "text": "The four cases is listed as follows:" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-42", "text": "(1) Compressed Path-enclosed Tree (CPT, T1 in Fig.1 ): Originated from PT in Zhang et al. (2006) , we further make two kinds of compression." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-43", "text": "One is to prune out the children nodes right before the second entity under the same parent node of NP." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-44", "text": "The other is to compress the sub-structure like \"X-->Y-->Z\" into \"X-->Z\" in the parse trees." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-45", "text": "(2) Bottom-attached CPT (B-CPT, T2 in Fig.1 ): the entity type information is attached to the bottom of the entity node, i.e., two more nodes whose tags are \"TP\" are added under the first and the second entity nodes respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-46", "text": "(3) Entity-attached CPT (E-CPT, T3 in Fig.1 ): the entity type name is combined with entity order name, e.g. \"E1-PER\" denotes the first entity whose type is \"PER\"." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-47", "text": "This case is also explored by Zhang et al. (2006) , and we include it here just for the purpose of comparison." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-48", "text": "(4) Top-attached CPT (T-CPT, T4 in Fig.1 ): the entity type information is attached to the top node of the parse tree." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-49", "text": "In order to distinguish between two entities, we use tags \"TP1\" and \"TP2\" to represent the first entity type and the second entity type respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-50", "text": "From the above four cases, we want to evaluate whether and how the entity information will be useful for relation extraction and in what way we can embed the entity information (especially the location where we attach) in the parse tree in order to achieve the best performance." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-51", "text": "----------------------------------" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-52", "text": "**EXPERIMENTS**" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-53", "text": "----------------------------------" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-54", "text": "**3.1.EXPERIMENTAL CORPUS AND SETTING**" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-55", "text": "We use the ACE RDC 2004 corpus as our experiment data." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-56", "text": "The ACE RDC 2004 data contains 451 documents and 5702 relation instances." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-57", "text": "It defines 7 entity types, 7 major relation types and 23 subtypes." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-58", "text": "The portion of training data we use contains 347 documents, 121K words and 4307 relations." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-59", "text": "Evaluation of kernel is done on the training data using 5-fold cross-validation." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-60", "text": "First, the corpus is parsed using Charniak's parser (Charniak, 2001) ." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-61", "text": "Then, we iterate over all pairs of entity mentions occurring in the same sentence to generate potential relation instances." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-62", "text": "We choose SVM (Vapnik 1998) as the binary classifier, since SVM has achieved the state-ofthe-art performances for many classification problems like text categorization (Joachims 1998) ." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-63", "text": "For efficiency, we apply the one-against-others approach to convert binary classifier to multiclass classifier." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-64", "text": "The final decision of a relation instance in the multi-class classification is determined by the classifier which has the maximal SVM output." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-65", "text": "In our implementation, we use the binary-class SVMLight (Joachims, 1998) and Tree Kernel Tools (Moschitti, 2004) ." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-66", "text": "For comparison with the composite kernels (Zhang et.al. 2006) , our training parameter C (SVM) and \u03bb (tree kernel) are set to 2.4 and 0.4 respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-67", "text": "----------------------------------" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-69", "text": "In this section, we present and analyze the experimental results with respect to different settings." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-70", "text": "(1) Different instance representations According to the above discussion, we select CPT with entity order information as our baseline to try to discover whether and how entity information will be effective to relation extraction." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-71", "text": "In order to reduce training time, we only add major type information into the parse tree." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-72", "text": "Table 1 compares the performance of seven major types for three different setups in the ACE2004 corpus using expanded convolution tree kernel." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-73", "text": "It shows that:" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-74", "text": "Using convolution parse tree kernel only embedded with entity order information achieves the performance of 67.8%/52.3%/59.0 in precision/recall/F-measure." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-75", "text": "This indicates that convolution parse tree kernel is somewhat effective for relation extraction." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-76", "text": "Compared with CPT, other three setups B-CPT, E-CPT and T-CPT improve the F-measure by 8.5/10.1/10.5 units respectively due to the increase both in precision and recall." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-77", "text": "This shows that entity major type information incorporated into the parse tree of relation instances produces significant improvement for relation extraction." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-78", "text": "This further suggests that our parse tree kernel can effectively capture both the entity information and the structured syntactic information of relation examples." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-79", "text": "Among the three different instance representations except CPT, the T-CPT (highlighted in bold font) achieves slightly better performance of 2.0/0.4 units in F-measure than the other two representations B-CPT and E-CPT respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-80", "text": "This may be due to the following reason." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-81", "text": "From the definition of the convolution parse tree kernel, we introduce a decay factor \u03bb (set to 0.4 here) to make the kernel less dependent on the tree size." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-82", "text": "However, this factor also decreases the contribution of the entity information on the kernel when they are attached to the bottom of the entity." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-83", "text": "Table 2 reports the contribution of different entity features over seven major types in the ACE2004 corpus using the above T-CPT kernel." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-84", "text": "It indicates that our system achieves the best performance of 79.0%/66.4%/72.2 in precision/recall/F-measure when combining some of the entity features." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-85", "text": "In order to measure the contribution of different entity features we add them one by one in the decreasing order of their potential importance." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-86", "text": "It also shows: Entity type feature is very effective for relation extraction and it increases precision/recall/Fmeasure by 8.2%/11.7%/10.5 units respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-87", "text": "Entity subtype feature improves the F-measure by 1.2 units." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-88", "text": "This further shows that gracefully defined entity type and subtype features in the ACE2004 corpus contribute to most of the performance improvement among all entity features." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-89", "text": "Mention level feature is also useful and increases the F-measure by 1.5 units while both entity class and GPE role feature are futile because they don't lead to any improvement in Fmeasure." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-90", "text": "Other two entity features (i.e. \"head word\", \"LDC mention type\"), however, both decrease the performance by 0.6/0.4 units in F-measure respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-91", "text": "This suggests that both of these features can't differentiate relation types from each other and their incorporations into parse tree make relation extraction even more difficult." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-92", "text": "In the last experiment (highlighted in bold and italic font) we add the base form of the predicate verb nearest to the second entity mention." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-93", "text": "Although it only improves the Fmeasure by 0.6 units largely due to the increase in recall, it indicates that moving verbs from the bottom to the top of the parse tree is helpful to relation extraction." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-94", "text": "This also suggests that constructing a parse tree that contains all necessary features and is designed specifically for relation extraction is very promising." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-95", "text": "(3) Different relation lexical condition In ACE vocabulary, relation lexical condition indicates the syntactic structure where the entity pair relates to each other." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-96", "text": "There are five relation lexical conditions in the ACE2004 corpus, i.e. \"Possessive\", \"Preposition\", \"PreMod\", \"Formulaic\" and \"Verbal\"." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-97", "text": "Table 3 separately measures the recall performance of different relation lexical condition on one of the testing sets in the ACE2004 corpus." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-98", "text": "It also indicates the number of testing instance, correctly classified instances and wrongly classified instances for each condition respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-99", "text": "The recall performance is best in the condition \"Possessive\"." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-100", "text": "This may be largely due to consistency of syntactic structure for this condition in the ACE2004 corpus." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-101", "text": "It is somewhat surprising that our system performs worse than we expected in the condition \"Formulaic\", since we think that there should be several fixed patterns for this condition." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-102", "text": "The reason may be that there are many syntactic errors in the parse trees produced by Charniak's parser although this parser represents the-start-of-art in parsing." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-103", "text": "Finally our system achieves surprisingly lowest performance in the condition \"Verbal\" although they occur frequently in the testing data." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-104", "text": "This may be that the syntactic structure in this condition is diverse and it contains too much noise in this kind of parse tree." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-105", "text": "It also suggests that much more noise needs to be pruned out from the parse tree while the key relation structure should remain in this condition." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-106", "text": "(4) Comparison with recent work Table 4 compares our system with recent work on the ACE2004 corpus." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-107", "text": "It shows that our system slightly outperforms recently best-reported systems." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-108", "text": "Compared with the composite kernel (Zhang et al, 2006) , our system further prunes the parse tree and incorporates entity features into the convolution parse tree kernel." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-109", "text": "It shows that our system achieves higher precision, lower recall and slightly better F-measure than their method." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-110", "text": "Compared with featurebased systems (Zhou et al, 2006 and Zhao et al, 2005 ) that incorporate many lexical, syntactic and semantic features, our system improves the F-measure by 1.8/2.5 units over relation types respectively." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-111", "text": "This suggests that kernel-based systems can promisingly outperform feature-based systems, although much work like performance enhancement and reduction of training speed still needs to be done to further improve the system." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-112", "text": "----------------------------------" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-113", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-114", "text": "In this paper, we have designed a convolution parse tree kernel expanded with entity features for relation extraction using Support Vector Machines." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-115", "text": "Evaluation on the ACE2004 corpus shows that the expanded convolution parse tree kernel achieves better performance on relation extraction than recent feature-based and kernel-based systems." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-116", "text": "This may result from the following reasons: First, syntactic structure information of relation examples is very useful and can be effectively captured by the convolution parse tree kernel, therefore the convolution parse tree alone achieves comparable performance on relation extraction." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-117", "text": "Second, the expanded convolution parse tree incorporated with entity features significantly improves performance. And the higher we put entity feature node in the parse tree, the better performance we can get." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-118", "text": "We also discover that entity type feature contributes to most of performances improvement while some other features such as \"head word\" or \"GPE role\" conversely decrease the performance." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-119", "text": "Last, compared with other recent systems, performance enhancement of our system is limited, for many parse errors still exist both in short-distance relations and longdistance relations even though the Charniak's parser we use in our system represents the-stateof-the-art in full parsing." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-120", "text": "This suggests that the parser needs to be further improved in order to provide more accurate syntactic structure information." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-121", "text": "In the future work, we will try to construct a dynamic relation tree to reflect both the syntactic structure and semantic information more accurately." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-122", "text": "First, we will further prune out the noise from the parse tree according to linguistic knowledge especially for lexical condition \"Verbal\"." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-123", "text": "Second, more weight will be assigned to discriminative features no matter where they are located (e.g. entity features, predicate verb and preposition etc) to reflect their contributions." }, { "sent_id": "87a190b1df5a7a941ba7b9a98064a3-C001-124", "text": "Last, we will use semantic resources such as WordNet to compute semantic similarity between terminal words (e.g. noun for entity and verb for predicate respectively) in the parse tree." } ], "y": { "@SIM@": { "gold_contexts": [ [ "87a190b1df5a7a941ba7b9a98064a3-C001-22" ], [ "87a190b1df5a7a941ba7b9a98064a3-C001-34" ], [ "87a190b1df5a7a941ba7b9a98064a3-C001-47" ] ], "cite_sentences": [ "87a190b1df5a7a941ba7b9a98064a3-C001-22", "87a190b1df5a7a941ba7b9a98064a3-C001-34", "87a190b1df5a7a941ba7b9a98064a3-C001-47" ] }, "@USE@": { "gold_contexts": [ [ "87a190b1df5a7a941ba7b9a98064a3-C001-22" ], [ "87a190b1df5a7a941ba7b9a98064a3-C001-34" ], [ "87a190b1df5a7a941ba7b9a98064a3-C001-42" ], [ "87a190b1df5a7a941ba7b9a98064a3-C001-39", "87a190b1df5a7a941ba7b9a98064a3-C001-40" ] ], "cite_sentences": [ "87a190b1df5a7a941ba7b9a98064a3-C001-22", "87a190b1df5a7a941ba7b9a98064a3-C001-34", "87a190b1df5a7a941ba7b9a98064a3-C001-42" ] }, "@DIF@": { "gold_contexts": [ [ "87a190b1df5a7a941ba7b9a98064a3-C001-107", "87a190b1df5a7a941ba7b9a98064a3-C001-108" ] ], "cite_sentences": [ "87a190b1df5a7a941ba7b9a98064a3-C001-108" ] }, "@BACK@": { "gold_contexts": [ [ "87a190b1df5a7a941ba7b9a98064a3-C001-39" ] ], "cite_sentences": [] }, "@MOT@": { "gold_contexts": [ [ "87a190b1df5a7a941ba7b9a98064a3-C001-39", "87a190b1df5a7a941ba7b9a98064a3-C001-40" ] ], "cite_sentences": [] } } }, "ABC_be39cfec0479ace0a7e08508239cb0_22": { "x": [ { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-87", "text": "now to our fine-tuning experiments." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-2", "text": "While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-3", "text": "In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-4", "text": "We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-5", "text": "This result motivates us to replace the reconstructionbased objective of Zhang et al. (2017) with our sentence content probe objective in a semisupervised setting." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-6", "text": "Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-7", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-8", "text": "**1**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-9", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-11", "text": "Methods that embed a paragraph into a single vector have been successfully integrated into many NLP applications, including text classification (Zhang et al., 2017) , document retrieval (Le and Mikolov, 2014) , and semantic similarity and relatedness (Dai et al., 2015; Chen, 2017) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-12", "text": "However, downstream performance provides little insight into the kinds of linguistic properties that are encoded by these embeddings." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-13", "text": "Inspired by the growing body of work on sentence-level linguistic probe tasks (Adi et al., 2017; Conneau et al., 2018) , we set out to evaluate a state-of-the-art paragraph embedding method using a probe task to measure how well it encodes the identity of the sentences within a paragraph." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-14", "text": "We discover that the method falls short of capturing this basic property, and that implementing a simple objective to fix this issue improves classification performance, training speed, and generalization ability." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-15", "text": "We specifically investigate the paragraph embedding method of Zhang et al. (2017) , which consists of a CNN-based encoder-decoder model paired with a reconstruction objective to learn powerful paragraph embeddings that are capable of accurately reconstructing long paragraphs." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-16", "text": "This model significantly improves downstream classification accuracies, outperforming LSTM-based alternatives (Li et al., 2015) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-17", "text": "How well do these embeddings encode whether or not a given sentence appears in the paragraph?" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-18", "text": "Conneau et al. (2018) show that such identity information is correlated with performance on downstream sentence-level tasks." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-19", "text": "We thus design a probe task to measure the extent to which this sentence content property is captured in a paragraph embedding." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-42", "text": "Our dataset has 346,033 training paragraphs, 19,368 for validation, and 19,350 for testing." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-20", "text": "Surprisingly, our experiments (Section 2) reveal that despite its impressive downstream performance, the model of Zhang et al. (2017) substantially underperforms a simple bagof-words model on our sentence content probe." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-21", "text": "Given this result, it is natural to wonder whether the sentence content property is actually useful for downstream classification." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-22", "text": "To explore this question, we move to a semi-supervised setting by pre-training the paragraph encoder in Zhang et al.'s model (2017) on either our sentence content objective or its original reconstruction objective, and then optionally fine-tuning it on supervised classification tasks (Section 3)." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-23", "text": "Sentence content significantly improves over reconstruction on standard benchmark datasets both with and without fine-tuning; additionally, this objective is four times faster to train than the reconstruction-based variant." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-24", "text": "Furthermore, pre-training with sentence content substantially boosts generalization ability: fine-tuning a pre-trained model on just 500 labeled reviews from the Yelp sentiment dataset surpasses the accuracy of a purely supervised model trained on 100,000 labeled reviews." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-25", "text": "Our results indicate that incorporating probe objectives into downstream models might help improve both accuracy and efficiency, which we hope will spur more linguistically-informed research into paragraph embedding methods." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-26", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-27", "text": "**PROBING PARAGRAPH EMBEDDINGS FOR SENTENCE CONTENT**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-28", "text": "In this section, we first fully specify our probe task before comparing the model of Zhang et al. (2017) to a simple bag-of-words model." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-29", "text": "Somewhat surprisingly, the latter substantially outperforms the former despite its relative simplicity." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-30", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-31", "text": "**PROBE TASK DESIGN**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-32", "text": "Our proposed sentence content task is a paragraph-level analogue to the word content task of Adi et al. (2017) (Le and Mikolov, 2014) , and BOW models using pre-trained word representations." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-33", "text": "DCNN as in the original paper." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-34", "text": "In all experiments, we use their publicly available code." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-35", "text": "4" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-36", "text": "Bag-of-words (BoW): This model is simply an average of the word vectors learned by a trained CNN-R model." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-37", "text": "BoW models have been shown to be surprisingly good at sentence-level probe tasks (Adi et al., 2017; Conneau et al., 2018) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-38", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-39", "text": "**PROBE EXPERIMENTAL DETAILS**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-40", "text": "Paragraphs to train our classifiers are extracted from the Hotel Reviews corpus (Li et al., 2015) , which has previously been used for evaluating the quality of paragraph embeddings (Li et al., 2015; Zhang et al., 2017) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-41", "text": "We only consider paragraphs that have at least two sentences." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-43", "text": "The average numbers of sentences per paragraph, tokens per paragraph, and tokens per sentence are 8.0, 123.9, and 15.6, respectively." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-44", "text": "The vocabulary contains 25,000 tokens." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-45", "text": "To examine the effect of the embedding dimensionality d on the results, we trained models with d \u2208 {100, 300, 500, 700, 900}. Each classifier is a feed-forward neural network with a single 300-d ReLu layer." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-46", "text": "We use a minibatch size of 32, Adam optimizer (Kingma and Ba, 2015) with a learning rate of 2e-4, and a dropout rate of 0.5 (Srivastava et al., 2014) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-47", "text": "We trained classifiers for a maximum of 100 epochs with early stopping based on validation performance." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-48", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-49", "text": "**BOW OUTPERFORMS CNN-R ON SENTENCE CONTENT**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-50", "text": "Our probe task results are displayed in Figure 1 ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-51", "text": "Interestingly, BoW performs significantly better" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-52", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-53", "text": "**ENCODER**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-54", "text": "Figure 2: A visualization of our semi-supervised approach." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-55", "text": "We first train the CNN encoder (shown as two copies with shared parameters) on unlabeled data using our sentence content objective." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-56", "text": "The encoder is then used for downstream classification tasks." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-57", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-58", "text": "**SETTING CNN-R BOW**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-59", "text": "Without s + excluded from p 61.2 82.3 With s + excluded from p 57.5 61.7 Table 1 : Probe task accuracies without and with s + excluded from p, measured at d = 300." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-60", "text": "BoW's accuracy degrades quickly in the latter case, suggesting that it relies much more on low-level matching." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-61", "text": "than CNN-R, achieving an accuracy of 87.2% at 900 dimensions, compared to only 66.4% for CNN-R. We hypothesize that much of BoW's success is because it is easier for the model to perform approximate string matches between the candidate sentence and text segments within the paragraph than it is for the highly non-linear representations of CNN-R." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-62", "text": "To investigate this further, we repeat the experiment, but exclude the sentence s + from the paragraph p during both training and testing." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-63", "text": "As we would expect (see Table 1 ), BoW's performance degrades significantly (20.6% absolute) with s + excluded from p, whereas CNN-R experiences a more modest drop (3.6%)." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-64", "text": "While BoW still outperforms CNN-R in this new setting, the dramatic drop in accuracy suggests that it relies much more heavily on low-level matching." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-65", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-66", "text": "**SENTENCE CONTENT IMPROVES PARAGRAPH CLASSIFICATION**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-67", "text": "Motivated by our probe results, we further investigate whether incorporating the sentence content property into a paragraph encoder can help increase downstream classification accuracies." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-68", "text": "We propose a semi-supervised approach by pretraining the encoder of CNN-R using our sentence content objective, and optionally fine-tuning it on different classification tasks." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-69", "text": "this procedure can be seen in Figure 2 ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-70", "text": "We compare our approach (henceforth CNN-SC) without and with fine-tuning against CNN-R, which uses a reconstruction-based objective." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-71", "text": "5 We report comparisons on three standard paragraph classification datasets: Yelp Review Polarity (Yelp), DBPedia, and Yahoo! Answers (Yahoo) (Zhang et al., 2015) , which are instances of common classification tasks, including sentiment analysis and topic classification." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-72", "text": "Table 2 shows the statistics for each dataset." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-73", "text": "Paragraphs from each training set without labels were used to generate training data for unsupervised pre-training." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-74", "text": "Sentence content significantly improves over reconstruction on both in-domain and out-ofdomain data We first investigate how useful each pre-training objective is for downstream classification without any fine-tuning by simply training a classifier on top of the frozen pre-trained CNN encoder." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-75", "text": "We report the best downstream performance for each model across different numbers of pre-training epochs." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-76", "text": "The first row of Table 3 shows the downstream accuracy on Yelp when the whole unlabeled data of the Yelp training set is used for unsupervised pre-training." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-77", "text": "Strikingly, CNN-SC achieves an accuracy of 90.0%, outperforming CNN-R by a large margin." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-78", "text": "Additionally, sentence content is four times as fast to train as the computationally-expensive reconstruction objective." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-79", "text": "6 Are representations obtained using these objectives more useful when learned from in-domain data?" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-80", "text": "To examine the dataset effect, we repeat our experiments using paragraph embeddings pre-trained using these objectives on a subset of Wikipedia (560K paragraphs)." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-81", "text": "The second row of Table 3 shows that both approaches suffer a drop in downstream accuracy when pre-trained on out-of-domain data." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-82", "text": "Interestingly, CNN-SC still performs best, indicating that sentence content is more suitable for downstream classification." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-83", "text": "Another advantage of our sentence content objective over reconstruction is that it better correlates to downstream accuracy (see Appendix A.2)." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-84", "text": "For reconstruction, there is no apparent correlation between BLEU and downstream accuracy; while BLEU increases with the number of epochs, the downstream performance quickly begins to decrease." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-85", "text": "This result indicates that early stopping based on BLEU is not feasible with reconstruction-based pre-training objectives." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-86", "text": "With fine-tuning, CNN-SC substantially boosts accuracy and generalization We switch gears Table 4 : CNN-SC outperforms other baseline models that do not use external data, including CNN-R. All baseline models are taken from Zhang et al. (2017) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-88", "text": "Specifically, we take the CNN encoder pre-trained using our sentence content objective and then fine-tune it on downstream classification tasks with supervised labels." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-89", "text": "While our previous version of CNN-SC created just a single positive/negative pair of examples from a single paragraph, for our finetuning experiments we create a pair of examples from every sentence in the paragraph to maximize the training data." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-90", "text": "For each task, we compare against the original CNN-R model in (Zhang et al., 2017) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-91", "text": "Figure 3 shows the model performance with fine-tuning on 0.1% to 100% of the training set of each dataset." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-92", "text": "One interesting result is that CNN-SC relies on very few training examples to achieve comparable accuracy to the purely supervised CNN model." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-93", "text": "For instance, fine-tuning CNN-SC using just 500 labeled training examples surpasses the accuracy of training from scratch on 100,000 labeled examples, indicating that the sentence content encoder generalizes well." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-94", "text": "CNN-SC also outperforms CNN-R by large margins when only small amounts of labeled training data are available." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-95", "text": "Finally, when all labeled training data is used, CNN-SC achieves higher classification accuracy than CNN-R on all three datasets (Table 4) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-96", "text": "While CNN-SC exhibits a clear preference for target task unlabeled data (see Table 3 ), we can additionally leverage large amounts of unlabeled general-domain data by incorporating pretrained word representations from language models into CNN-SC." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-97", "text": "Our results show that further improvements can be achieved by training the sentence content objective on top of the pre-trained language model representations from ULMFiT (Howard and Ruder, 2018 ) (see Appendix A.3), indicating that our sentence content objective learns complementary information." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-98", "text": "On Yelp, it exceeds the performance of training from scratch on the whole labeled data (560K examples) with only 0.1% of the labeled data." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-99", "text": "CNN-SC implicitly learns to distinguish between class labels The substantial difference in downstream accuracy between pre-training on indomain and out-of-domain data (Table 3) implies that the sentence content objective is implicitly learning to distinguish between class labels (e.g., that a candidate sentence with negative sentiment is unlikely to belong to a paragraph with positive sentiment)." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-100", "text": "If true, this result implies that CNN-SC prefers not only in-domain data but also a representative sample of paragraphs from all class labels." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-101", "text": "To investigate, we conduct an additional experiment that restricts the class label from which negative sentence candidates s \u2212 are sampled." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-102", "text": "We experiment with two sources of s \u2212 : (1) paragraphs of the same class label as the probe paragraph (CNN-SC \u2212 ), and (2) paragraphs from a different class label (CNN-SC + )." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-103", "text": "Figure 4 reveals that the performance of CNN-SC drops dramatically when trained on the first dataset and improves when trained on the second dataset, which confirms our hypothesis." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-104", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-105", "text": "**RELATED WORK**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-125", "text": "A.3 Further improvements by training sentence content on top of pre-trained language model representations Figure 6 shows that further improvements can be achieved by training sentence content on top of the pre-trained language model representations from ULMFiT (Howard and Ruder, 2018) on Yelp and IMDB (Maas et al., 2011) datasets, indicating that our sentence content objective learns complementary information." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-106", "text": "Text embeddings and probe tasks A variety of methods exist for obtaining fixed-length dense vector representations of words (e.g., Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018) , sentences (e.g., Kiros et al., 2015; Conneau et al., 2017; Subramanian et al., 2018; Cer et al., 2018) , and larger bodies of text (e.g., Le and Mikolov, 2014; Dai et al., 2015; Iyyer et al., 2015; Li et al., 2015; Chen, 2017; Zhang et al., 2017) To analyze word and sentence embeddings, recent work has studied classification tasks that probe them for various linguistic properties (Shi et al., 2016; Adi et al., 2017; Belinkov et al., 2017a,b; Conneau et al., 2018; Tenney et al., 2019) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-107", "text": "In this paper, we extend the notion of probe tasks to the paragraph level." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-108", "text": "Transfer learning Another line of related work is transfer learning, which has been the driver of recent successes in NLP." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-109", "text": "Recently-proposed objectives for transfer learning include surrounding sentence prediction (Kiros et al., 2015) , paraphrasing (Wieting and Gimpel, 2017) , entailment (Conneau et al., 2017) , machine translation (McCann et al., 2017) , discourse (Jernite et al., 2017; Nie et al., 2017) , and language modeling (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-110", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-111", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-112", "text": "In this paper, we evaluate a state-of-the-art paragraph embedding model, based on how well it captures the sentence identity within a paragraph." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-113", "text": "Our results indicate that the model is not fully aware of this basic property, and that implementing a simple objective to fix this issue improves classification performance, training speed, and generalization ability." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-114", "text": "Future work can investigate other embedding methods with a richer set of probe tasks, or explore a wider range of downstream tasks." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-115", "text": "----------------------------------" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-116", "text": "**A APPENDICES**" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-117", "text": "A.1 BoW models outperform more complex models on our sentence content probe" }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-118", "text": "In addition to the paragraph embedding models presented in the main paper, we also experiment (Chen, 2017) represents a document as an average of randomly-sampled words from within the document." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-119", "text": "The method introduces a corruption mechanism that favors rare but important words while suppressing frequent but uninformative ones." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-120", "text": "Doc2VecC was found to outperform other unsupervised BoW-style algorithms, including Paragraph Vector (Le and Mikolov, 2014) , on downstream tasks." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-121", "text": "Other BoW models: We also consider other BoW models with pre-trained word embeddings or contextualized word representations, including Word2Vec (Mikolov et al., 2013) , Glove (Pennington et al., 2014) , and ELMo (Peters et al., 2018) ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-122", "text": "Paragraph embeddings are computed as the average of the word vectors." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-123", "text": "For ELMo, we take the average of the layers." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-124", "text": "The results of our sentence content probe task are summarized in Table 5 ." }, { "sent_id": "be39cfec0479ace0a7e08508239cb0-C001-126", "text": "7 On Yelp, it exceeds the performance of training from scratch on the whole labeled data (560K examples) with only 0.1% of the labeled data." } ], "y": { "@MOT@": { "gold_contexts": [ [ "be39cfec0479ace0a7e08508239cb0-C001-3" ], [ "be39cfec0479ace0a7e08508239cb0-C001-5" ] ], "cite_sentences": [ "be39cfec0479ace0a7e08508239cb0-C001-3", "be39cfec0479ace0a7e08508239cb0-C001-5" ] }, "@BACK@": { "gold_contexts": [ [ "be39cfec0479ace0a7e08508239cb0-C001-11" ], [ "be39cfec0479ace0a7e08508239cb0-C001-40" ], [ "be39cfec0479ace0a7e08508239cb0-C001-106" ] ], "cite_sentences": [ "be39cfec0479ace0a7e08508239cb0-C001-11", "be39cfec0479ace0a7e08508239cb0-C001-40", "be39cfec0479ace0a7e08508239cb0-C001-106" ] }, "@USE@": { "gold_contexts": [ [ "be39cfec0479ace0a7e08508239cb0-C001-15" ], [ "be39cfec0479ace0a7e08508239cb0-C001-86" ] ], "cite_sentences": [ "be39cfec0479ace0a7e08508239cb0-C001-15", "be39cfec0479ace0a7e08508239cb0-C001-86" ] }, "@UNSURE@": { "gold_contexts": [ [ "be39cfec0479ace0a7e08508239cb0-C001-20" ], [ "be39cfec0479ace0a7e08508239cb0-C001-90" ] ], "cite_sentences": [ "be39cfec0479ace0a7e08508239cb0-C001-20", "be39cfec0479ace0a7e08508239cb0-C001-90" ] }, "@DIF@": { "gold_contexts": [ [ "be39cfec0479ace0a7e08508239cb0-C001-28", "be39cfec0479ace0a7e08508239cb0-C001-29" ] ], "cite_sentences": [ "be39cfec0479ace0a7e08508239cb0-C001-28" ] } } }, "ABC_45551e674210bb9bbb56c8778d2f8c_22": { "x": [ { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-2", "text": "Abuse on the Internet represents a significant societal problem of our time." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-3", "text": "Previous research on automated abusive language detection in Twitter has shown that communitybased profiling of users is a promising technique for this task." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-4", "text": "However, existing approaches only capture shallow properties of online communities by modeling followerfollowing relationships." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-5", "text": "In contrast, working with graph convolutional networks (GCNs), we present the first approach that captures not only the structure of online communities but also the linguistic behavior of the users within them." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-6", "text": "We show that such a heterogeneous graph-structured modeling of communities significantly advances the current state of the art in abusive language detection." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-7", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-9", "text": "Matthew Zook (2012) carried out an interesting study showing that the racist tweets posted in response to President Obama's re-election were not distributed uniformly across the United States but instead formed clusters." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-10", "text": "This phenomenon is known as homophily: i.e., people, both in real life and online, tend to cluster with those who appear similar to themselves." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-11", "text": "To model homophily, recent research in abusive language detection on Twitter (Mishra et al., 2018a) incorporates embeddings for authors (i.e., users who have composed tweets) that encode the structure of their surrounding communities." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-12", "text": "The embeddings (called author profiles) are generated by applying a node embedding framework to an undirected unlabeled community graph where nodes denote the authors and edges the follower-following relationships amongst them on Twitter." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-13", "text": "However, these profiles do not capture the linguistic behavior of the authors and their communities and do not convey whether their tweets tend to be abusive or not." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-14", "text": "In contrast, we represent the community of authors as a heterogeneous graph consisting of two types of nodes, authors and their tweets, rather than a homogeneous community graph of authors only." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-15", "text": "The primary advantage of such heterogeneous representations is that they enable us to model both community structure as well as the linguistic behavior of authors in these communities." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-16", "text": "To generate richer author profiles, we then propose a semi-supervised learning approach based on graph convolutional networks (GCNs) applied to the heterogeneous graph representation." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-17", "text": "To the best of our knowledge, our work is the first to use GCNs to model online communities in social media." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-18", "text": "We demonstrate that our methods provide significant improvements over existing techniques." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-19", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-21", "text": "Supervised learning for abusive language detection was first explored by Spertus (1997) who extracted rule-based features to train their classifier." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-22", "text": "Subsequently, manually-engineered lexicalsyntactic features formed the crux of most approaches to the task (Yin et al., 2009; Warner and Hirschberg, 2012) ." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-23", "text": "Djuric et al. (2015) showed that dense comment representations generated using paragraph2vec outperform bag-of-words features." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-24", "text": "Several works have since utilized (deep) neural architectures to achieve impressive results on a variety of abuse-annotated datasets (Nobata et al., 2016; Pavlopoulos et al., 2017a) ." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-25", "text": "Recently, the research focus has shifted towards extraction of features that capture behavioral and social traits of users." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-26", "text": "Pavlopoulos et al. (2017b) showed that including randomly-initialized user embeddings improved the performance of their RNN methods." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-27", "text": "Qian et al. (2018) Following previous work (Mishra et al., 2018a) , we experiment with a subset of the Twitter dataset compiled by Waseem and Hovy (2016" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-28", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-29", "text": "**APPROACH**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-30", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-31", "text": "**REPRESENTING ONLINE COMMUNITIES**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-32", "text": "We create two different graphs: the first one is identical to the community graph of Mishra et al. (2018a) (referred to as the community graph)." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-33", "text": "It contains 1, 875 nodes representing each of the authors in the dataset." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-34", "text": "Two authors/nodes are connected by a single undirected edge if either one follows the other on Twitter." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-35", "text": "There are 453 solitary authors in the graph who are neither followed by nor follow any other author in the dataset." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-36", "text": "This graph is homogeneous, i.e., it has nodes (and hence edges) of a single type only." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-37", "text": "Our second graph is an extended version of the first (referred to as the extended graph) that additionally contains nodes representing the tweets of the authors." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-38", "text": "Specifically, in addition to the 1, 875 author nodes, the graph contains 16, 202 tweet nodes." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-39", "text": "Each tweet node is connected to a single author node, denoting that the tweet is elicited from that particular author." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-40", "text": "This graph is no longer homogeneous since it contains nodes and edges of two different types." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-41", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-42", "text": "**GENERATING AUTHOR PROFILES**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-43", "text": "We first describe the approach of Mishra et al. (2018a) that learns author embeddings using node2vec (Grover and Leskovec, 2016) ; this serves as our baseline." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-44", "text": "We then move on to our semi-supervised approach based on graph convolutional networks (Kipf and Welling, 2017)." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-45", "text": "Node2vec." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-46", "text": "Node2vec extends the word2vec skipgram model (Mikolov et al., 2013) to graphs in order to create low-dimensional embeddings for nodes based on their position and neighborhood." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-47", "text": "Specifically, for a given graph with nodes V = {v 1 , v 2 , . . . , v n }, node2vec aims to maximize the following log probability:" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-69", "text": "Once the model is trained, we extract 200-dimensional embeddings E = A F W (1) from the first layer (i.e., the layer's output without activation)." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-48", "text": "where N s (v) denotes the neighbor set of node v generated using neighbor sampling strategy s. The framework utilizes two different strategies for sampling neighbor sets of nodes: DepthFirst Sampling (DFS) and Breadth-First Sampling (BFS)." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-49", "text": "The former captures the structural role of nodes, while the latter captures the local neighborhood around them." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-50", "text": "Two hyper-parameters control the overall contribution of each of these strategies." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-51", "text": "Following Mishra et al. (2018a) , we initialize these parameters to their default value of 1 and set the embedding size and number of iterations to 200 and 25 respectively." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-52", "text": "Since node2vec cannot produce embeddings for nodes without edges, we map the solitary authors to a single zero embedding as done by Mishra et al. Graph convolutional networks." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-53", "text": "We propose an approach for learning author profiles using GCNs applied to the extended graph." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-54", "text": "In contrast to node2vec, our method allows us to additionally propagate information with respect to whether tweets composed by authors and their communities are abusive or not." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-55", "text": "Specifically, as labels are available for a subset of nodes in our graph (i.e., the tweet nodes), we frame the task as a graphbased semi-supervised learning problem, allowing the model to distribute gradient information from the supervised loss on the labeled tweet nodes." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-56", "text": "This, in turn, allows us to create profiles for authors that not only capture the structural traits of their surrounding community but also their own linguistic behavior based on the types of tweets that they have composed." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-57", "text": "We consider a graph G = (V, E), where V is the set of nodes (|V | = n) and E is the set of edges." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-58", "text": "A denotes the adjacency matrix of G. We assume that A is symmetric (A ij = A ji ), and that all nodes in G have self loops (A ii = 1)." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-59", "text": "The significance of these assumptions is explained in Kipf and Welling (2017) ." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-60", "text": "Let D be the diagonal degree matrix defined as D ii = j A ij , and F \u2208 R n\u00d7m be the input feature matrix that holds feature vectors of length m for the nodes in G. We can now recursively define the computation that takes place at the i th convolutional layer of a k-layer GCN as:" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-61", "text": "with the computation at the first layer being:" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-62", "text": "Here, \u03c3 denotes an activation function; A = D In our experiments, we apply a 2-layer GCN to the extended graph." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-63", "text": "2 Specifically, our GCN performs the following computation, yielding a softmax distribution over the 3 classes in the dataset for each of the nodes:" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-64", "text": "We set the input feature vectors in F to be the binary bag-of-words representations of the nodes (following Kipf and Welling 2017); for author nodes, these representations are constructed over the entire set of their respective tweets." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-65", "text": "Note that F is row-normalized prior to being fed to the GCN." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-66", "text": "We set the number of hidden units in the first convolutional layer to 200 in order to extract 200-dimensional embeddings for author nodes so that they are directly comparable with those from node2vec ." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-67", "text": "The number of hidden units in the second convolutional layer is set to 3 for the output O \u2208 R n\u00d73 of the GCN to be a softmax distribution over the 3 classes in the data." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-68", "text": "The GCN is trained by minimizing the crossentropy loss with respect to the labeled nodes of the graph." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-70", "text": "This contains embeddings for author nodes as well as tweet nodes." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-71", "text": "For our experiments on author profiles, we make use of the former." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-72", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-73", "text": "**CLASSIFICATION METHODS**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-74", "text": "We experiment with five different supervised classification methods for tweets in the dataset." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-75", "text": "The first three (LR, LR+AUTH, LR+EXTD) serve as our baselines, 3 and the last two with GCNs 4 are the methods we propose." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-76", "text": "LR." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-77", "text": "This method is adopted from Waseem and Hovy (2016) wherein they train a logistic regression classifier on character n-grams (up to 4-grams) of the tweets." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-78", "text": "Character n-grams have been shown to be highly effective for abuse detection due to their robustness to spelling variations." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-79", "text": "LR + AUTH." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-80", "text": "This is the state of the art method (Mishra et al., 2018a) for the dataset we are using." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-81", "text": "For each tweet, the profile of its author (generated by node2vec from the community graph) is appended onto the tweet's character n-gram representation for training the LR classifier as above." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-82", "text": "LR + EXTD." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-83", "text": "This method is identical to LR + AUTH, except that we now run node2vec on the extended graph to generate author profiles." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-84", "text": "Intuitively, since node2vec treats both author and tweet nodes as the same and does not take into account the labels of tweets, the author profiles generated should exhibit the same properties as those generated from the community graph." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-85", "text": "GCN." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-86", "text": "Here, we simply assign a label to each tweet based on the highest score from the softmax distribution provided by our GCN model for the (tweet) nodes of the extended graph." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-87", "text": "LR + GCN." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-88", "text": "Identical to LR + EXTD, except that we replace the author profiles from node2vec with those extracted by our GCN approach." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-89", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-90", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-91", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-92", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-93", "text": "We run every method 10 times with random initializations and stratified train-test splits." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-94", "text": "Specifically, in each run, the dataset is split into a randomly-sampled train set (90%) and test set (10%) with identical distributions of the 3 classes in each." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-95", "text": "In methods involving our GCN, a small part of the train set is held out as validation data to prevent over-fitting using early-stopping regularization." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-96", "text": "When training the GCN, we only have Table 1 : The baselines (LR, LR + AUTH/EXTD) vs. our GCN approaches ( \u2020 ) on the racism and sexism classes." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-97", "text": "Overall shows the macro-averaged metrics computed over the 3 classes: sexism, racism, and clean." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-98", "text": "labeled tweet nodes for those tweets in the extended graph that are part of the train set." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-99", "text": "Our GCN is trained using the parameters from the original paper (Kipf and Welling, 2017): Glorot initialization (Glorot and Bengio, 2010) , ADAM optimizer (Kingma and Ba, 2015) with a learning rate of 0.01, dropout regularization (Srivastava et al., 2014 ) rate of 0.5, 200 training epochs with an early-stopping patience of 10 epochs." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-100", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-101", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-102", "text": "In Table 1 , we report the mean precision, recall, and F 1 on the racism and sexism classes over the 10 runs." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-103", "text": "We further report the mean macroaveraged precision, recall, and F 1 for each method ('Overall') to investigate their overall performance on the data." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-104", "text": "LR + GCN significantly (p < 0.05 on paired t-test) outperforms all other methods." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-105", "text": "The author profiles from node2vec only capture the structural and community information of the authors; however, those from the GCN also take into account the (abusive) nature of the tweets composed by the authors." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-106", "text": "As a result, tweets like \"#MKR #mkr2015 Who is gonna win the peoples choice?\" that are misclassified as sexist by LR + AUTH (because their author is surrounded by others producing sexist tweets) are correctly classified as clean by LR + GCN." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-107", "text": "GCN on its own achieves a high performance, particularly on the sexism class where its performance is typical of a community-based profiling approach, i.e., high recall at the expense of precision." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-108", "text": "However, on the racism class, its recall is hindered by the same factor that Mishra et al. (2018a) highlighted for their node2vec-only method, i.e., that racist tweets come from 5 unique authors only who have also contributed sexist or clean tweets." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-109", "text": "The racist activity of these authors is therefore eclipsed, leading to misclassifications of their tweets." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-110", "text": "LR + GCN alleviates this problem by incorporating character n-gram representations of the tweets, hence not relying solely on the linguistic behavior of their authors." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-111", "text": "Figure 1 shows the t-SNE (van der Maaten and Hinton, 2008) visualizations of node2vec author profiles from the community and extended graphs." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-112", "text": "Both visualizations show that some authors belong to densely-connected communities while others are part of more sparse ones." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-113", "text": "The results from LR + AUTH and LR + EXTD have insignificant differences, further confirming that their author profiles have similar properties." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-114", "text": "In essence, node2vec is unable to gain anything more from the extended graph than what it does from the community graph." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-115", "text": "Figure 2 shows a t-SNE visualization of the author profiles generated using our GCN approach." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-116", "text": "Red dots denote the authors who are abusive (sexist or racist) according to our model (i.e., as per the softmax outputs for the author nodes)." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-117", "text": "5 The red dots are mostly clustered in a small portion of the visualization, which corroborates the notion of homophily amongst abusive authors." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-118", "text": "Despite the addition of improved author profiles, several abusive tweets remain misclassified." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-119", "text": "As per our analysis, many of these tend to contain URLs to abusive content but not the content itself, e.g., \"@MENTION: Logic in the world of Islam http://t.co/6nALv2HPc3\" and \"@MENTION Yes. http://t.co/ixbt0uc7HN\"." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-120", "text": "Since Twitter shortens all URLs into a standard format, there is no indication of what they refer to." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-121", "text": "One possible way to address this limitation could be to append the content of the URL to the tweet; however this can lead to misclassifications in cases where the tweet is disagreeing with the URL." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-122", "text": "Another factor in misclassifications is the deliberate obfuscation of words and phrases by authors in order to evade detection, e.g., \"Kat, a massive c*nt." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-123", "text": "The biggest ever on #mkr #cuntandandre\"." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-124", "text": "Mishra et al. (2018b) demonstrate in their work that character-based word composition models can be useful in dealing with this aspect." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-125", "text": "----------------------------------" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-126", "text": "**CONCLUSIONS**" }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-127", "text": "In this paper, we built on the work of Mishra et al. (2018a) that introduces community-based profiling of authors for abusive language detection." }, { "sent_id": "45551e674210bb9bbb56c8778d2f8c-C001-128", "text": "We proposed an approach based on graph convolutional networks to show that author profiles that directly capture the linguistic behavior of authors along with the structural traits of their community significantly advance the current state of the art." } ], "y": { "@BACK@": { "gold_contexts": [ [ "45551e674210bb9bbb56c8778d2f8c-C001-11" ], [ "45551e674210bb9bbb56c8778d2f8c-C001-43" ], [ "45551e674210bb9bbb56c8778d2f8c-C001-80" ] ], "cite_sentences": [ "45551e674210bb9bbb56c8778d2f8c-C001-11", "45551e674210bb9bbb56c8778d2f8c-C001-43", "45551e674210bb9bbb56c8778d2f8c-C001-80" ] }, "@USE@": { "gold_contexts": [ [ "45551e674210bb9bbb56c8778d2f8c-C001-27" ], [ "45551e674210bb9bbb56c8778d2f8c-C001-32" ], [ "45551e674210bb9bbb56c8778d2f8c-C001-43" ], [ "45551e674210bb9bbb56c8778d2f8c-C001-51" ] ], "cite_sentences": [ "45551e674210bb9bbb56c8778d2f8c-C001-27", "45551e674210bb9bbb56c8778d2f8c-C001-32", "45551e674210bb9bbb56c8778d2f8c-C001-43", "45551e674210bb9bbb56c8778d2f8c-C001-51" ] }, "@SIM@": { "gold_contexts": [ [ "45551e674210bb9bbb56c8778d2f8c-C001-32" ], [ "45551e674210bb9bbb56c8778d2f8c-C001-108" ] ], "cite_sentences": [ "45551e674210bb9bbb56c8778d2f8c-C001-32", "45551e674210bb9bbb56c8778d2f8c-C001-108" ] }, "@EXT@": { "gold_contexts": [ [ "45551e674210bb9bbb56c8778d2f8c-C001-127" ] ], "cite_sentences": [ "45551e674210bb9bbb56c8778d2f8c-C001-127" ] } } }, "ABC_ebb79e6e223d4747987aa4abfd1a58_22": { "x": [ { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-2", "text": "We introduce unsupervised techniques based on phrase-based statistical machine translation for grammatical error correction (GEC) trained on a pseudo learner corpus created by Google Translation." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-3", "text": "We verified our GEC system through experiments on a low resource track of the shared task at Building Educational Applications 2019 (BEA2019)." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-4", "text": "As a result, we achieved an F 0.5 score of 28.31 points with the test data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-5", "text": "----------------------------------" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-7", "text": "Research on grammatical error correction (GEC) has gained considerable attention recently." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-8", "text": "Many studies treat GEC as a task that involves translation from a grammatically erroneous sentence (sourceside) into a correct sentence (target-side) and thus, leverage methods based on machine translation (MT) for GEC." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-9", "text": "For instance, some GEC systems use large parallel corpora and synthetic data (Ge et al., 2018; Xie et al., 2018) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-10", "text": "We introduce an unsupervised method based on MT for GEC that does not use parallel learner data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-11", "text": "In particular, we use methods proposed by Marie and Fujita (2018) , Artetxe et al. (2018b) , and Lample et al. (2018) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-12", "text": "These methods are based on phrase-based statistical machine translation (SMT) and phrase table refinements." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-13", "text": "Forward refinement used by Marie and Fujita (2018) simply augments a learner corpus with automatic corrections." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-14", "text": "We also use forward refinement for improvement of phrase table." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-15", "text": "Unsupervised MT techniques do not require a parallel but a comparable corpus as training data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-16", "text": "Therefore, we use comparable translated texts using Google Translation as the source-side data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-17", "text": "Specifically, we use News Crawl written in English as target-side data and News Crawl written in another language translated into English as source-side data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-18", "text": "We verified our GEC system through experiments for a low resource track of the shared task at Building Educational Applications 2019 (BEA2019)." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-19", "text": "The experimental results show that our system achieved an F 0.5 score of 28.31 points." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-20", "text": "2 Unsupervised GEC Algorithm 1 shows the pseudocode for unsupervised GEC." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-21", "text": "This code is derived from Artetxe et al. (2018b) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-22", "text": "First, the cross-lingual phrase embeddings are acquired." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-23", "text": "Second, a phrase table is created based on these cross-lingual embeddings." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-24", "text": "Third, the phrase table is combined with a language model trained by monolingual data to initialize a phrase-based SMT system." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-25", "text": "Finally, the SMT system is updated through iterative forwardtranslation." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-26", "text": "Cross-lingual embeddings First, n-gram embeddings were created on the source-and targetsides." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-27", "text": "Specifically, each monolingual embedding was created based on the source-and target-sides using a variant of skip-gram (Mikolov et al., 2013) for unigrams, bigrams, and trigrams with high frequency 1 in the monolingual data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-28", "text": "Next, the monolingual embeddings were mapped onto a shared space to obtain cross-lingual embeddings." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-29", "text": "The self-learning method of Artetxe et al. (2018a) was used for unsupervised mapping." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-30", "text": "Phrase table induction A phrase table was created based on the cross-lingual embeddings." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-31", "text": "In particular, this involved the creation of phrase translation models and lexical translation models." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-32", "text": "The translation candidates were limited in the source-to-target phrase translation model \u03d5(f |e) for each source phrase e to its 100 nearest neighbor phrases f on the target-side." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-33", "text": "The score of the phrase translation model was calculated based on the normalized cosine similarity between the source and target phrases." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-34", "text": "f \u2032 represents each phrase embedding on the targetside and \u03c4 is a temperature parameter that controls the confidence of prediction 2 ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-35", "text": "The backward phrase translation probability \u03d5(e|f ) was determined in a similar manner." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-36", "text": "The source-to-target lexical translation model lex(f |e) considers the word with the highest translation probability in a target phrase for each word in a source phrase." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-37", "text": "The score of the lexical translation model was calculated based on the product of respective phrase translation probabilities." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-38", "text": "\u03f5 is a constant term for the case where no alignments are found." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-39", "text": "As in Artetxe et al. (2018b) , the term was set to 0.001." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-40", "text": "The backward lexical translation probability lex(e|f ) is calculated in a similar manner." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-41", "text": "Refinement of SMT system The phrase table created is considered to include noisy phrase pairs." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-42", "text": "Therefore, we update the phrase table using an SMT system." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-43", "text": "The SMT system trained on synthetic data eliminates the noisy phrase pairs using 2 As in Artetxe et al. (2018b) , \u03c4 is estimated by maximizing the phrase translation probability between an embedding and the nearest embedding on the opposite side." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-44", "text": "language models trained on the target-side corpus." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-45", "text": "This process corresponds to lines 6-10 in Algorithm 1." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-46", "text": "The phrase table is refined with forward refinement (Marie and Fujita, 2018) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-47", "text": "For forward refinement, target synthetic data were generated from the source monolingual data using the source-to-target phrase s\u2192t was then created with this target synthetic corpus." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-48", "text": "This operation was executed N times." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-49", "text": "Construction of a comparable corpus This unsupervised method is based on the assumption that the source and target corpora are comparable." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-50", "text": "In fact, Lample et al. (2018) , Artetxe et al. (2018b) and Marie and Fujita (2018) use the News Crawl of source and target language as training data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-51", "text": "To make a comparable corpus for GEC, we use translated texts using Google Translation as the source-side data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-52", "text": "Specifically, we use Finnish News Crawl translated into English as source-side." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-53", "text": "English News Crawl is used as the target-side as is." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-54", "text": "Finnish data is used because Finnish is not similar to English." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-55", "text": "This translated data does not include misspelled words." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-56", "text": "To address these words, we use a spell checker as a preprocessing step before inference." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-57", "text": "3 Experiment of low resource GEC 3.1 Experimental setting Table 1 shows the training and development data size." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-58", "text": "Finnish News Crawl 2014-2015 translated into English was used as source training data and English News Crawl 2017 was used as target training data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-59", "text": "To train the extra language model of the target-side (LM t ), we used training data of One Billion Word Benchmark (Chelba et al., 2014) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-60", "text": "We used googletrans v2.4.0 3 for Google Translation." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-61", "text": "This module did not work sometimes and thus, we obtained 2,122,714 trans- Table 2 : GEC results with test data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-62", "text": "lated sentences 4 ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-63", "text": "We sampled the 3,000,000 sentences from English News Crawl 2017 and excluded the sentences with more than 150 words for either source-and target-side data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-64", "text": "Finally, the synthetic comparable corpus comprises processed News Crawl data listed in Table 1 ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-65", "text": "The low resource track permitted to use W&I+LOCNESS (Bryant et al., 2019; Granger, 1998 ) development set, so we split it in half; tune data and dev data 5 ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-66", "text": "These data are tokenized by spaCy v1.9.0 6 and the en_core_web_sm-1.2.0 model." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-67", "text": "We used moses truecaser for the training data; this truecaser model is learned from processed English News Crawl." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-68", "text": "We used byte-pair-encoding (Sennrich et al., 2016) learned from processed English News Crawl; the number of operations is 50K." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-69", "text": "The implementation proposed by Artetxe et al. (2018b) 7 was modified to conduct the experiments." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-70", "text": "Specifically, some features were added; word-level Levenshtein distance, word-, and character-level edit operation, operation sequence model, (Durrani et al., 2013) 8 and 9-gram word class language model, similar to Grundkiewicz and Junczys-Dowmunt (2018) without sparse features." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-71", "text": "Word class language model was trained with One Billion Word Benchmark data; the number of classes is 200, and the word class was estimated with fastText (Bojanowski et al., 2017) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-72", "text": "The distortion feature was not used." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-73", "text": "Moses (Koehn et al., 2007) was used to train the SMT system." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-74", "text": "FastAlign (Dyer et al., 2013) was used for word alignment and KenLM (Heafield, 2011) Crawl and One Billion Word Benchmark." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-75", "text": "MERT (Och, 2003) was used with the tuning data for M\u02c62 Scorer (Dahlmeier and Ng, 2012) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-76", "text": "Synthetic sentence pairs with a [3, 80] sentence length were used at the refinement step." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-77", "text": "The number of iterations N was set to 5, and the embedding dimension was set to 300." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-78", "text": "We decided best iteration using the dev data and submitted the output of the best iteration model." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-79", "text": "We used pyspellchecker 9 as a spell checker." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-80", "text": "This tool uses Levenshtein distance to obtain permutations within an edit distance of 2 over the words included in a word list." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-81", "text": "We made the word list from One Billion Word Benchmark and included words that occur more than five times." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-82", "text": "We report precision, recall, and F 0.5 score based on the dev data and official test data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-83", "text": "The output of dev data was evaluated using ERRANT scorer (Bryant et al., 2017) similarly to official test data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-84", "text": "Table 2 shows the results of the GEC experiments with test data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-85", "text": "The F 0.5 score for our system (TMU) is 28.31; this score is eighth among the nine teams." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-86", "text": "In particular, the number of false positives of our system is 4,314; this is the worst result of all." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-87", "text": "Table 3 shows the results of the dev data listed in Table 1 ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-88", "text": "On the dev data, the system of iteration 1 is the best among all." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-89", "text": "According to the improvement of iteration from 0 to 1, it is confirmed that the refinement method works well." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-90", "text": "However, it is observed that the system is not improved after iteration 1." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-91", "text": "The source-side data is fixed, and target-side data is generated from the source-side for each iteration." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-92", "text": "Therefore, the quality of the 9 https://github.com/barrust/pyspellchecker Table 3 : GEC results with dev data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-93", "text": "The bold scores represent the best score without the spell checker." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-94", "text": "----------------------------------" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-95", "text": "**RESULTS**" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-96", "text": "----------------------------------" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-97", "text": "**DISCUSSION**" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-98", "text": "source-side data is important for this refinement method." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-99", "text": "In this study, we use the automatically translated text as source-side data; thus, it is considered that the quality is not high and the refinement after iteration 1 does not work." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-100", "text": "The results of Table 3 confirm that the spell checker works well." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-101", "text": "We also investigate the importance of the order; SMT or spell check, which is suitable for the first system for a better result?" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-102", "text": "As a result, it is better to use the SMT system after using the spell checker." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-103", "text": "That is because the source-side data does not include the misspelled words as mentioned above." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-104", "text": "Table 4 shows the error types that our system corrected well or mostly did not correct on the dev data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-105", "text": "SPELL means the misspell errors; the correction of these errors depends only on the spell checker." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-106", "text": "PUNCT means the errors about the punctuation; e.g., 'Unfortunately when we...\u2192 Unfortunately, when we...'." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-107", "text": "It is considered that our system can correct errors such as these owing to the n-gram co-occurrence knowledge derived from the language models." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-108", "text": "In contrast, our system struggled to correct content word errors." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-109", "text": "For example, NOUN includes an error like this; 'way \u2192 means' and VERB includes an error like this; 'watch \u2192 see'." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-110", "text": "It is considered that our system is mostly not able to correct the errors regarding word usage based on the context because the phrase table was still noisy." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-111", "text": "Although we observed some usage error examples of 'watch' in the synthetic source data, our model was not able to replace 'watch' to 'see' based on the context." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-112", "text": "----------------------------------" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-113", "text": "**RELATED WORK**" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-114", "text": "Unsupervised Machine Translation Studies on unsupervised methods have been conducted for both NMT (Lample et al., 2018; Marie and Fujita, 2018) and SMT (Artetxe et al., 2018b Table 4 : Error types for which our best system corrected errors well or mostly did not correct on the dev data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-115", "text": "Top2 denotes the top two errors, and Bottom2 denotes the lowest two errors in terms of the F 0.5 10 ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-116", "text": "this study, we apply the USMT method of Artetxe et al. (2018b) and Marie and Fujita (2018) to GEC." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-117", "text": "The UNMT method (Lample et al., 2018) was ineffective under the GEC setting in our preliminary experiments." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-118", "text": "GEC with NMT/SMT Several studies that introduce sequence-to-sequence models in GEC heavily rely on large amounts of training data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-119", "text": "Ge et al. (2018) , who presented state-of-the-art results in GEC, proposed a supervised NMT method trained on corpora of a total 5.4 M sentence pairs." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-120", "text": "We mainly use the monolingual corpus because the low resource track does not permit the use of the learner corpora." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-121", "text": "Despite the success of NMT, many studies on GEC traditionally use SMT (Susanto et al., 2014; Junczys-Dowmunt and Grundkiewicz, 2014) ." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-122", "text": "These studies apply an offthe-shelf SMT toolkit, Moses, to GEC." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-123", "text": "JunczysDowmunt and Grundkiewicz (2014) claimed that the SMT system optimized for BLEU learns to not change the source sentence." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-124", "text": "Instead of BLEU, they proposed tuning an SMT system using the M 2 score with annotated development data." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-125", "text": "In this study, we also tune the weights with an F 0.5 score measured by the M 2 scorer because the official score is an F 0.5 score." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-126", "text": "----------------------------------" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-127", "text": "**CONCLUSION**" }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-128", "text": "In this paper, we described our GEC system for the low resource track of the shared task at BEA2019." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-129", "text": "We introduced an unsupervised approach based on SMT for GEC." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-130", "text": "This track prohibited the use of learner data as training data, so we created a synthetic comparable corpus using Google Translation." }, { "sent_id": "ebb79e6e223d4747987aa4abfd1a58-C001-131", "text": "The experimental results demonstrate that our system achieved an F 0.5 score of 28.31 points with the test data." } ], "y": { "@USE@": { "gold_contexts": [ [ "ebb79e6e223d4747987aa4abfd1a58-C001-11" ], [ "ebb79e6e223d4747987aa4abfd1a58-C001-19", "ebb79e6e223d4747987aa4abfd1a58-C001-20", "ebb79e6e223d4747987aa4abfd1a58-C001-21" ], [ "ebb79e6e223d4747987aa4abfd1a58-C001-39" ], [ "ebb79e6e223d4747987aa4abfd1a58-C001-43" ], [ "ebb79e6e223d4747987aa4abfd1a58-C001-116" ] ], "cite_sentences": [ "ebb79e6e223d4747987aa4abfd1a58-C001-11", "ebb79e6e223d4747987aa4abfd1a58-C001-21", "ebb79e6e223d4747987aa4abfd1a58-C001-39", "ebb79e6e223d4747987aa4abfd1a58-C001-43", "ebb79e6e223d4747987aa4abfd1a58-C001-116" ] }, "@SIM@": { "gold_contexts": [ [ "ebb79e6e223d4747987aa4abfd1a58-C001-39" ], [ "ebb79e6e223d4747987aa4abfd1a58-C001-43" ] ], "cite_sentences": [ "ebb79e6e223d4747987aa4abfd1a58-C001-39", "ebb79e6e223d4747987aa4abfd1a58-C001-43" ] }, "@BACK@": { "gold_contexts": [ [ "ebb79e6e223d4747987aa4abfd1a58-C001-50" ], [ "ebb79e6e223d4747987aa4abfd1a58-C001-114" ] ], "cite_sentences": [ "ebb79e6e223d4747987aa4abfd1a58-C001-50", "ebb79e6e223d4747987aa4abfd1a58-C001-114" ] }, "@EXT@": { "gold_contexts": [ [ "ebb79e6e223d4747987aa4abfd1a58-C001-69" ] ], "cite_sentences": [ "ebb79e6e223d4747987aa4abfd1a58-C001-69" ] } } }, "ABC_3ced64da2c64b0963c4c3d88fd60e0_22": { "x": [ { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-64", "text": "**TYPE**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-2", "text": "We present a pointwise approach to Japanese morphological analysis (MA) that ignores structure information during learning and tagging." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-3", "text": "Despite the lack of structure, it is able to outperform the current state-of-the-art structured approach for Japanese MA, and achieves accuracy similar to that of structured predictors using the same feature set." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-4", "text": "We also find that the method is both robust to outof-domain data, and can be easily adapted through the use of a combination of partial annotation and active learning." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-5", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-7", "text": "Japanese morphological analysis (MA) takes an unsegmented string of Japanese text as input, and outputs a string of morphemes annotated with parts of speech (POSs)." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-8", "text": "As MA is the first step in Japanese NLP, its accuracy directly affects the accuracy of NLP systems as a whole." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-9", "text": "In addition, with the proliferation of text in various domains, there is increasing need for methods that are both robust and adaptable to out-of-domain data (Escudero et al., 2000) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-10", "text": "Previous approaches have used structured predictors such as hidden Markov models (HMMs) or conditional random fields (CRFs), which consider the interactions between neighboring words and parts of speech (Nagata, 1994; Asahara and Matsumoto, 2000; Kudo et al., 2004) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-11", "text": "However, while structure does provide valuable information, Liang et al. (2008) have shown that gains provided by structured prediction can be largely recovered by using a richer feature set." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-12", "text": "This approach has also been called \"pointwise\" prediction, as it makes a single independent decision at each point (Neubig and Mori, 2010) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-13", "text": "While Liang et al. (2008) focus on the speed benefits of pointwise prediction, we demonstrate that it also allows for more robust and adaptable MA." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-14", "text": "We find experimental evidence that pointwise MA can exceed the accuracy of a state-of-the-art structured approach (Kudo et al., 2004) on in-domain data, and is significantly more robust to out-of-domain data." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-15", "text": "We also show that pointwise MA can be adapted to new domains with minimal effort through the combination of active learning and partial annotation (Tsuboi et al., 2008) , where only informative parts of a particular sentence are annotated." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-16", "text": "In a realistic domain adaptation scenario, we find that a combination of pointwise prediction, partial annotation, and active learning allows for easy adaptation." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-17", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-18", "text": "**JAPANESE MORPHOLOGICAL ANALYSIS**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-19", "text": "Japanese MA takes an unsegmented string of characters x I 1 as input, segments it into morphemes w J 1 , and annotates each morpheme with a part of speech t J 1 ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-20", "text": "This can be formulated as a two-step process of first segmenting words, then estimating POSs (Ng and Low, 2004) , or as a single joint process of finding a morpheme/POS string from unsegmented text (Kudo et al., 2004; Nakagawa, 2004; Kruengkrai et al., 2009) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-21", "text": "In this section we describe an existing joint sequence-based method for Japanese MA, as well as our proposed two-step pointwise method." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-22", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-23", "text": "**JOINT SEQUENCE-BASED MA**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-24", "text": "Japanese MA has traditionally used sequence based models, finding a maximal POS sequence for en-" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-25", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-26", "text": "**TYPE**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-27", "text": "Feature Strings tire sentences as in Figure 1 (a)." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-28", "text": "The CRF-based method presented by Kudo et al. (2004) is generally accepted as the state-of-the-art in this paradigm." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-29", "text": "CRFs are trained over segmentation lattices, which allows for the handling of variable length sequences that occur due to multiple segmentations." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-30", "text": "The model is able to take into account arbitrary features, as well as the context between neighboring tags." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-31", "text": "We follow Kudo et al. (2004) in defining our feature set, as summarized in Table 1 1 ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-32", "text": "Lexical features were trained for the top 5000 most frequent words in the corpus." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-33", "text": "It should be noted that these are wordbased features, and information about transitions between POS tags is included." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-34", "text": "When creating training data, the use of word-based features indicates that word boundaries must be annotated, while the use of POS transition information further indicates that all of these words must be annotated with POSs." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-35", "text": "1 More fine-grained POS tags have provided small boosts in accuracy in previous research (Kudo et al., 2004) , but these increase the annotation burden, which is contrary to our goal." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-36", "text": "Feature Strings Character" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-37", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-38", "text": "**2-STEP POINTWISE MA**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-39", "text": "In our research, we take a two-step approach, first segmenting character sequence x I 1 into the word sequence w J 1 with the highest probability, then tagging each word with parts of speech t J 1 ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-40", "text": "This approach is shown in Figure 1 (b) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-41", "text": "We follow Sassano (2002) in formulating word segmentation as a binary classification problem, estimating boundary tags b I\u22121 1 ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-42", "text": "Tag b i = 1 indicates that a word boundary exists between characters x i and x i+1 , while b i = 0 indicates that a word boundary does not exist." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-43", "text": "POS estimation can also be formulated as a multi-class classification problem, where we choose one tag t j for each word w j ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-44", "text": "These two classification problems can be solved by tools in the standard machine learning toolbox such as logistic regression (LR), support vector machines (SVMs), or conditional random fields (CRFs)." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-45", "text": "We use information about the surrounding characters (character and character-type n-grams), as well as the presence or absence of words in the dictionary as features (Table 2) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-46", "text": "Specifically dictionary features for word segmentation l s and r s are active if a string of length s included in the dictionary is present directly to the left or right of the present word boundary, and i s is active if the present word boundary is included in a dictionary word of length s. Dictionary feature d jk for POS estimation indicates whether the current word w j occurs as a dictionary entry with tag t k ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-47", "text": "Previous work using this two-stage approach has used sequence-based prediction methods, such as maximum entropy Markov models (MEMMs) or CRFs (Ng and Low, 2004; Peng et al., 2004) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-48", "text": "However, as Liang et al. (2008) note, and we confirm, sequence-based predictors are often not necessary when an appropriately rich feature set is used." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-49", "text": "One important difference between our formulation and that of Liang et al. (2008) and all other previous methods is that we rely only on features that are directly calculable from the surface string, without using estimated information such as word boundaries or neighboring POS tags 2 ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-50", "text": "This allows for training from sentences that are partially annotated as described in the following section." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-51", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-52", "text": "**DOMAIN ADAPTATION FOR MORPHOLOGICAL ANALYSIS**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-53", "text": "NLP is now being used in domains such as medical text and legal documents, and it is necessary that MA be easily adaptable to these areas." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-54", "text": "In a domain adaptation situation, we have at our disposal both annotated general domain data, and unannotated target domain data." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-55", "text": "We would like to annotate the target domain data efficiently to achieve a maximal gain in accuracy for a minimal amount of work." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-56", "text": "Active learning has been used as a way to pick data that is useful to annotate in this scenario for several applications (Chan and Ng, 2007; Rai et al., 2010) so we adopt an active-learning-based approach here." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-57", "text": "When adapting sequence-based prediction methods, most active learning approaches have focused on picking full sentences that are valuable to annotate (Ringger et al., 2007; Settles and Craven, 2008) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-58", "text": "However, even within sentences, there are generally a few points of interest surrounded by large segments that are well covered by already annotated data." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-59", "text": "Partial annotation provides a solution to this problem (Tsuboi et al., 2008; Sassano and Kurohashi, 2010) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-60", "text": "In partial annotation, data that will not contribute to the improvement of the classifier is left untagged." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-61", "text": "For example, if there is a single difficult word in a long sentence, only the word boundaries and POS of the difficult word will be tagged." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-62", "text": "\"Dif-2 Dictionary features are active if the string exists, regardless of whether it is treated as a single word in w J 1 , and thus can be calculated without the word segmentation result." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-63", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-65", "text": "Train Test General 782k 87.5k Target 153k 17.3k Table 3 : General and target domain corpus sizes in words." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-66", "text": "ficult\" words can be selected using active learning approaches, choosing words with the lowest classifier accuracy to annotate." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-67", "text": "In addition, corpora that are tagged with word boundaries but not POS tags are often available; this is another type of partial annotation." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-68", "text": "When using sequence-based prediction, learning on partially annotated data is not straightforward, as the data that must be used to train context-based transition probabilities may be left unannotated." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-69", "text": "In contrast, in the pointwise prediction framework, training using this data is both simple and efficient; unannotated points are simply ignored." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-70", "text": "A method for learning CRFs from partially annotated data has been presented by Tsuboi et al. (2008) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-71", "text": "However, when using partial annotation, CRFs' already slow training time becomes slower still, as they must be trained over every sequence that has at least one annotated point." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-72", "text": "Training time is important in an active learning situation, as an annotator must wait while the model is being re-trained." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-73", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-74", "text": "**EXPERIMENTS**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-75", "text": "In order to test the effectiveness of pointwise MA, we did an experiment measuring accuracy both on in-domain data, and in a domain-adaptation situation." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-76", "text": "We used the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa, 2008) , specifying the whitepaper, news, and books sections as our general domain corpus, and the web text section as our target domain corpus (Table 3) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-77", "text": "As a representative of joint sequence-based MA described in 2.1, we used MeCab (Kudo, 2006) , an open source implementation of Kudo et al. (2004) 's CRF-based method (we will call this JOINT)." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-78", "text": "For the pointwise two-step method, we trained logistic regression models with the LIBLINEAR toolkit (Fan et al., 2008) using the features described in Section 2.2 (2-LR)." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-79", "text": "In addition, we trained a CRF-based model with the CRFSuite toolkit (Okazaki, 2007) using the same features and set-up (for both word segmentation and POS tagging) to examine the contribution of context information (2-CRF)." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-80", "text": "To create the dictionary, we added all of the words in the corpus, but left out a small portion of singletons to prevent overfitting on the training data 3 ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-81", "text": "As an evaluation measure, we follow Nagata (1994) and Kudo et al. (2004) and use Word/POS tag pair Fmeasure, so that both word boundaries and POS tags must be correct for a word to be considered correct." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-82", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-83", "text": "**ANALYSIS RESULTS**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-84", "text": "In our first experiment we compared the accuracy of the three methods on both the in-domain and outof-domain test sets (Table 4) ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-85", "text": "It can be seen that 2-LR outperforms JOINT, and achieves similar but slightly inferior results to 2-CRF." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-86", "text": "The reason for accuracy gains over JOINT lies largely in the fact that while JOINT is more reliant on the dictionary, and thus tends to mis-segment unknown words, the two-step methods are significantly more robust." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-87", "text": "The small difference between 2-LR and 2-CRF indicates that given a significantly rich feature set, contextbased features provide little advantage, although the advantage is larger on out-of-domain data." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-88", "text": "In addition, training of 2-LR is significantly faster than 2-CRF." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-89", "text": "2-LR took 16m44s to train, while 2-CRF took 51m19s to train on a 3.33GHz Intel Xeon CPU." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-90", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-91", "text": "**DOMAIN ADAPTATION**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-92", "text": "Our second experiment focused on the domain adaptability of each method." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-93", "text": "Using the target domain training corpus as a pool of unannotated data, we performed active learning-based domain adaptation using two techniques." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-94", "text": "\u2022 Sentence-based annotation (SENT), where sentences with the lowest total POS and word boundary probabilities were annotated first." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-95", "text": "\u2022 Word-based partial annotation (PART), where the word or word boundary with the smallest probability margin between the first and second candidates was chosen." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-96", "text": "This can only be used with the pointwise 2-LR approach 4 ." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-97", "text": "For both methods, 100 words (or for SENT until the end of the sentence in which the 100th word is reached) are annotated, then the classifier is retrained and new probability scores are generated." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-98", "text": "Each set of 100 words is a single iteration, and 100 iterations were performed for each method." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-99", "text": "From the results in Figure 2 , it can be seen that the combination of PART and 2-LR allows for significantly faster adaptation than other approaches, achieving accuracy gains in 15 iterations that are achieved in 100 iterations with SENT, and surpassing 2-CRF after 15 iterations." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-100", "text": "Finally, it can be seen that JOINT improves at a pace similar to PART, likely due to the fact that its pre-adaptation accuracy is lower than the other methods." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-101", "text": "It can be seen from Table 4 that even after adaptation with the full corpus, it will still lag behind the two-step methods." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-102", "text": "----------------------------------" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-103", "text": "**CONCLUSION**" }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-104", "text": "This paper proposed a pointwise approach to Japanese morphological analysis." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-105", "text": "It showed that despite the lack of structure, it was able to achieve re-sults that meet or exceed structured prediction methods." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-106", "text": "We also demonstrated that it is both robust and adaptable to out-of-domain text through the use of partial annotation and active learning." }, { "sent_id": "3ced64da2c64b0963c4c3d88fd60e0-C001-107", "text": "Future work in this area will include examination of performance on other tasks and languages." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3ced64da2c64b0963c4c3d88fd60e0-C001-10" ], [ "3ced64da2c64b0963c4c3d88fd60e0-C001-20" ], [ "3ced64da2c64b0963c4c3d88fd60e0-C001-28" ], [ "3ced64da2c64b0963c4c3d88fd60e0-C001-35" ] ], "cite_sentences": [ "3ced64da2c64b0963c4c3d88fd60e0-C001-10", "3ced64da2c64b0963c4c3d88fd60e0-C001-20", "3ced64da2c64b0963c4c3d88fd60e0-C001-28", "3ced64da2c64b0963c4c3d88fd60e0-C001-35" ] }, "@DIF@": { "gold_contexts": [ [ "3ced64da2c64b0963c4c3d88fd60e0-C001-14" ] ], "cite_sentences": [ "3ced64da2c64b0963c4c3d88fd60e0-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "3ced64da2c64b0963c4c3d88fd60e0-C001-31" ], [ "3ced64da2c64b0963c4c3d88fd60e0-C001-81" ] ], "cite_sentences": [ "3ced64da2c64b0963c4c3d88fd60e0-C001-31", "3ced64da2c64b0963c4c3d88fd60e0-C001-81" ] }, "@UNSURE@": { "gold_contexts": [ [ "3ced64da2c64b0963c4c3d88fd60e0-C001-77" ] ], "cite_sentences": [ "3ced64da2c64b0963c4c3d88fd60e0-C001-77" ] } } }, "ABC_ce86cf36ee3b359c34b68e5d82b563_22": { "x": [ { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-46", "text": "we are looking for a good phonetic fit, not necessarily the best one." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-47", "text": "----------------------------------" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-48", "text": "**GENERALIZING TO THREE OR MORE LANGUAGES**" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-2", "text": "An essential step in comparative reconstruction is to align corresponding phonological segments in the words being compared." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-3", "text": "To do this, one must search among huge numbers of potential alignments to find those that give a good phonetic fit." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-4", "text": "This is a hard computational problem, and it becomes exponentially more difficult when more than two strings are being aligned." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-5", "text": "In this paper I extend the guided-search alignment algorithm of Covington (Computational Linguistics, 1996) to handle more than two strings." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-6", "text": "The resulting algorithm has been implemented in Prolog and gives reasonable results when tested on data from several languages." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-7", "text": "The Comparative Method for reconstructing languages consists of at least the following steps:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-8", "text": "1. Choose sets of words in the daughter languages that appear to be cognate; Frantz (1970) , Hewson (1974) , Wimbish (1989) , and Lowe and Mazandon (1994), but none of them have tackled the alignment step." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-9", "text": "Covington (1996) presents a workable alignment algorithm for comparing two languages." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-10", "text": "In this paper I extend that algorithm to handle more than two languages at once." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-11", "text": "The alignment step is hard to automate because there are too many possible alignments to choose from." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-12", "text": "For example, French le [l~] and Spanish el [el I can be lined up at least three ways:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-13", "text": "Of these, the second is etymologically correct, and the third would merit consideration if one did not know the etymology." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-14", "text": "The number of alignments rises exponentially with the length of the strings and the number of strings being aligned." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-15", "text": "Two ten-letter strings have anywhere from 26,797 to 8,079,453 different alignments depending on exactly what alignments are considered distinct (Covington 1996, Covington and Canfield 1996) ." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-16", "text": "As for multiple strings, if two strings have A alignments then n strings have roughly A '~-1 alignments, assuming the alignments are generated by aligning the first two strings, then aligning the third string against the second, and so forth." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-17", "text": "In fact, the search space isn't quite that large because some combinations are equivalent to others, but it is clearly too large to search exhaustively." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-18", "text": "----------------------------------" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-19", "text": "**BACKGROUND**" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-20", "text": "The Comparative Method for reconstructing languages consists of at least the following steps:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-21", "text": "1. Choose sets of words in the daughter languages that appear to be cognate; 275 a regular correspondence, once discovered, can be used to refine one's choice of alignments and even putative cognates." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-22", "text": "Parts of the Comparative Method have been computerized by Frantz (1970) , Hewson (1974) , Wimbish (1989) , and Lowe and Mazandon (1994) , but none of them have tackled the alignment step." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-23", "text": "Covington (1996) presents a workable alignment algorithm for comparing two languages." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-24", "text": "In this paper I extend that algorithm to handle more than two languages at once." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-25", "text": "----------------------------------" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-26", "text": "**MULTIPLE-STRING ALIGNMENT**" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-27", "text": "The alignment step is hard to automate because there are too many possible alignments to choose from." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-28", "text": "For example, French le [l~] and Spanish el [el I can be lined up at least three ways:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-29", "text": "Of these, the second is etymologically correct, and the third would merit consideration if one did not know the etymology." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-30", "text": "The number of alignments rises exponentially with the length of the strings and the number of strings being aligned." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-31", "text": "Two ten-letter strings have anywhere from 26,797 to 8,079,453 different alignments depending on exactly what alignments are considered distinct (Covington 1996, Covington and Canfield 1996) ." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-32", "text": "As for multiple strings, if two strings have A alignments then n strings have roughly A '~-1 alignments, assuming the alignments are generated by aligning the first two strings, then aligning the third string against the second, and so forth." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-33", "text": "In fact, the search space isn't quite that large because some combinations are equivalent to others, but it is clearly too large to search exhaustively." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-34", "text": "Fortunately the comparative linguist is not looking for all possible alignments, only the ones that are likely to manifest regular sound correspondences -that is, those with a reasonable degree of phonetic similarity." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-35", "text": "Thus, phonetic similarity can be used to constrain the search." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-36", "text": "----------------------------------" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-37", "text": "**APPLYING AN EVALUATION METRIC**" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-38", "text": "The phonetic similarity criterion used by Covington (1996) is shown in Table 1 ." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-39", "text": "It is obviously just a stand-in for a more sophisticated, perhaps feature-based, system of phonology." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-40", "text": "The algorithm computes a \"badness\" or \"penalty\" for each step (column) in the alignment, summing the values to judge the badness of the whole alignment, thus:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-41", "text": "The alignment with the lowest total badness is the one with the greatest phonetic similarity." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-42", "text": "Note that two separate skips count exactly the same as one complete mismatch; thus the alignments e -e 1 lare equally valued." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-43", "text": "In fact, a \"no-alternatingskips rule\" prevents the second one from being generated; deciding whether [e] and [I] correspond is left for another, unstated, part of the comparison process." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-44", "text": "I will explain below why this is not satisfactory." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-45", "text": "Naturally, the alignment with the best overall phonetic similarity is not always the etymologically correct one, although it is usually close;" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-49", "text": "When a guided search is involved, aligning strings from three or more languages is not simply a matter of finding the best alignment of the first two, then adding a third, and then a fourth, and so on." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-50", "text": "Thus, an algorithm to align two strings cannot be used iteratively to align more than two." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-51", "text": "The reason is that the best overall alignment of three or more strings is not necessarily the best alignment of any given pair in the set." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-52", "text": "Fox (1995:68) gives a striking example, originally from Haas (1969) ." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-53", "text": "The best alignment of the Choctaw and Cree words for 'squirrel' appears to be:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-54", "text": "Here the correspondence [a]:[i] is problematic." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-55", "text": "Add the Koasati word, though, and it becomes clear that the correct alignment is actually:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-56", "text": "Any algorithm that started by finding the best alignment of Choctaw against Cree would miss this solution." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-57", "text": "A much better strategy is to evaluate each column of the alignment (I'll call it a \"step\") before generating the next column." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-58", "text": "That is, evaluate the first step, and then the second step, That way, no string gets aligned against another without considering the rest of the strings in the set." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-59", "text": "Another detail has to do with skips." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-60", "text": "Empirically, I found that the badness of f P comes out too high if computed as" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-61", "text": "----------------------------------" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-62", "text": "**BADNESS(F,P) + BADNESS(P,-) + BADNESS(F,-);**" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-63", "text": "that is, the algorithm is too reluctant to take skips." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-64", "text": "The reason, intuitively, is that in this alignment step, there is really only one skip, not two separate skips (one skipping If] and one skipping [p] )." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-65", "text": "This becomes even more apparent when more than three strings are being aligned." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-66", "text": "Accordingly, when computing badness I count each skip only once (assessing it 50 points), then ignore skips when comparing the segments against each other. I have not implemented the rule from Covington (1996) that gives a reduced penalty for adjacent skips in the same string to reflect the fact that affixes tend to be contiguous." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-67", "text": "5 Searching the set of alignments The standard way to find the best alignment of two strings is a matrix-based technique known as dynamic programming (Ukkonen 1985 , Waterman 1995 ." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-68", "text": "However, dynamic programming cannot accommodate rules that look ahead along the string to recognize assimilation or metathesis, a possibility that needs to be left open when implementing comparative reconstruction." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-69", "text": "Additionally, generalization of dynamic programming to multiple strings does not entirely appear to be a solved problem (cf." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-70", "text": "Kececioglu 1993)." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-71", "text": "Accordingly, I follow Covington (1996) in recasting the problem as a tree search." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-72", "text": "Consider the problem of aligning [el] with [le] ." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-73", "text": "Covington (1996) treats this as a process that steps through both strings and, at each step, performs either a \"match\" (accepting a character from both strings), a \"skip-l\" (skipping a character in the first string), or a \"skip-2\" (skipping a character in the second string)." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-74", "text": "That results in the search tree shown in Fig. 1 (ignoring Covington's \"no-alternating-skips rule\")." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-75", "text": "The search tree can be generalized to multiple strings by breaking up each step into a series of operations, one on each string, as shown in Fig. 2 ." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-76", "text": "Instead of three choices, match, skip-l, and skip-2, there are really 2x2: accept or skip on string 1 and then accept or skip on string 2." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-77", "text": "One of the four combinations is disallowedyou can't have a step in which no characters are accepted from any string." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-78", "text": "Similarly, if there were three strings, there would be three two-way decisions, leading to eight (= 2 3) states, one of which would be disallowed." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-79", "text": "Using search trees of this type, the decisions necessary to align any number of strings can be strung together in a satisfactory way." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-80", "text": "6 Alternating skips Covington (1996) Step 1 Step 2" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-81", "text": "Step 3... which is undeniably equivalent." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-82", "text": "It also ensures that there is only one way of skipping several consecutive segments; we get" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-83", "text": ".... def or numerous other equivalent combinations of skips." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-84", "text": "----------------------------------" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-85", "text": "**PRUNING THE SEARCH**" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-86", "text": "The goal of the algorithm is, of course, to generate not the whole search tree, but only the parts of it likely to contain the best alignments, thereby narrowing the intractably large search space into something manageable." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-87", "text": "Following Covington (1996) , I implemented a very simple pruning strategy." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-88", "text": "The program keeps track of the badness of the best complete alignment found so far." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-89", "text": "Every branch in the search tree is abandoned as soon as its total badness exceeds that value." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-90", "text": "Thus, bad alignments are abandoned when they have only partly been generated." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-91", "text": "A second part of the strategy is that the computer always tries matches before it tries skips." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-92", "text": "As a result, if not much material needs to be skipped, a good alignment is found very quickly." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-93", "text": "For example, three four-character strings have 10,536 alignments (generated my way), but when comparing Spanish tres, French trois, and The algorithm has been prototyped in LPA Prolog, and Table 2 shows some of the alignments it found." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-94", "text": "None of these took more than five seconds on a 133-MHz Pentium, and the Prolog program was written for versatility, not speed." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-95", "text": "As comparative linguists know, the alignment that gives the best phonetic fit (by any criterion) is not always the etymologically correct one." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-96", "text": "This is evident with my algorithm." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-97", "text": "Worse, occasionally the present algorithm doesn't consider the etymologically correct alignment at all because something that looks better has already been found." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-98", "text": "For example, taking the Avestan, Greek, and Latin words for '100', the algorithm settles on --satom hekaton ken-tum (badness 610) without ever considering the etymologically correct alignment:" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-99", "text": "--sa-tom heka-ton --kentum (badness 690)" }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-100", "text": "The penalties for skips may still be too high here, but the real problem is, of course, that the algorithm is looking for the one best alignment, and that's not what comparative reconstruction needs." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-101", "text": "Instead, the computer should prune the search tree less eagerly, pursuing any alignment whose badness is, say, no more than 120% of the lowest found so far, and delivering all solutions that are reasonably close to the best one found during the entire procedure." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-102", "text": "Indeed, the availability of multiple potential alignments is the keystone of Kay's (1964) proposal to implement the Comparative Method, which could not be implemented at the time Kay proposed it because of the lack of an efficient search algorithm." }, { "sent_id": "ce86cf36ee3b359c34b68e5d82b563-C001-103", "text": "The requisite modification is easily made and I plan to pursue it in subsequent work." } ], "y": { "@EXT@": { "gold_contexts": [ [ "ce86cf36ee3b359c34b68e5d82b563-C001-10", "ce86cf36ee3b359c34b68e5d82b563-C001-9" ], [ "ce86cf36ee3b359c34b68e5d82b563-C001-23", "ce86cf36ee3b359c34b68e5d82b563-C001-24" ] ], "cite_sentences": [ "ce86cf36ee3b359c34b68e5d82b563-C001-9", "ce86cf36ee3b359c34b68e5d82b563-C001-23" ] }, "@BACK@": { "gold_contexts": [ [ "ce86cf36ee3b359c34b68e5d82b563-C001-10", "ce86cf36ee3b359c34b68e5d82b563-C001-9" ], [ "ce86cf36ee3b359c34b68e5d82b563-C001-15" ], [ "ce86cf36ee3b359c34b68e5d82b563-C001-23", "ce86cf36ee3b359c34b68e5d82b563-C001-24" ], [ "ce86cf36ee3b359c34b68e5d82b563-C001-31" ], [ "ce86cf36ee3b359c34b68e5d82b563-C001-38" ], [ "ce86cf36ee3b359c34b68e5d82b563-C001-73" ] ], "cite_sentences": [ "ce86cf36ee3b359c34b68e5d82b563-C001-9", "ce86cf36ee3b359c34b68e5d82b563-C001-15", "ce86cf36ee3b359c34b68e5d82b563-C001-23", "ce86cf36ee3b359c34b68e5d82b563-C001-31", "ce86cf36ee3b359c34b68e5d82b563-C001-38", "ce86cf36ee3b359c34b68e5d82b563-C001-73" ] }, "@DIF@": { "gold_contexts": [ [ "ce86cf36ee3b359c34b68e5d82b563-C001-66" ] ], "cite_sentences": [ "ce86cf36ee3b359c34b68e5d82b563-C001-66" ] }, "@MOT@": { "gold_contexts": [ [ "ce86cf36ee3b359c34b68e5d82b563-C001-66" ] ], "cite_sentences": [ "ce86cf36ee3b359c34b68e5d82b563-C001-66" ] }, "@USE@": { "gold_contexts": [ [ "ce86cf36ee3b359c34b68e5d82b563-C001-71" ], [ "ce86cf36ee3b359c34b68e5d82b563-C001-87" ] ], "cite_sentences": [ "ce86cf36ee3b359c34b68e5d82b563-C001-71", "ce86cf36ee3b359c34b68e5d82b563-C001-87" ] } } }, "ABC_6cd4235e66a6e6e9768250c3db7fc6_22": { "x": [ { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-3", "text": "Recently, other languages have also been addressed, e.g., HeidelTime was extended to process eight languages." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-4", "text": "Chinese temporal tagging has achieved less attention, and no Chinese temporal tagger is publicly available." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-30", "text": "This finding encouraged us to manually create Chinese HeidelTime resources instead of trying automatic methods." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-31", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-5", "text": "In this paper, we address the full task of Chinese temporal tagging (extraction and normalization) by developing Chinese HeidelTime resources." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-32", "text": "**THE TEMPEVAL-2 CHINESE CORPUS**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-86", "text": "(iii) We checked the training texts for undetected expressions and created rules to match them." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-2", "text": "Temporal information is important for many NLP tasks, and there has been extensive research on temporal tagging with a particular focus on English texts." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-87", "text": "In parallel, we adapted the Chinese pattern and normalization resources when necessary." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-6", "text": "Our evaluation on a publicly available corpus -which we also partially re-annotated due to its rather low quality -demonstrates the effectiveness of our approach, and we outperform a recent approach to normalize temporal expressions." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-7", "text": "The Chinese HeidelTime resource as well as the corrected corpus are made publicly available." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-8", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-10", "text": "Temporal information plays a crucial role in many documents, and temporal tagging, i.e., the extraction of temporal expressions and their normalization to some standard format, is crucial for several NLP tasks." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-11", "text": "So far, research on temporal information extraction mostly focused on western languages, especially English." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-12", "text": "In contrast, eastern languages, e.g., Chinese, are less explored." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-13", "text": "Nevertheless, there is research on Chinese temporal tagging." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-14", "text": "While some works addressed either the extraction or the normalization subtask, a few full temporal taggers exist, e.g., CTEMP (Wu et al., 2005b) and CTAT (Jing et al., 2008) , but none of them is publicly available." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-15", "text": "In contrast, some temporal taggers were recently made available, e.g., DANTE (Mazur and Dale, 2009) , TipSem (Llorens et al., 2010) , and HeidelTime ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-16", "text": "Furthermore, showed that HeidelTime can be extended to further languages by developing language-specific resources without modifying the source code." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-17", "text": "Thus, when developing temporal tagging capabilities for an additional language, one is faced with the question of whether to develop a new temporal tagger or to extend an existing one." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-18", "text": "We decided to extend HeidelTime for Chinese for the following reasons: (i) HeidelTime was the best temporal tagger in the TempEval-2 (English) and TempEval-3 (English and Spanish) competitions (Verhagen et al., 2010; UzZaman et al., 2013) , (ii) it already supports eight languages, and (iii) it is the only multilingual temporal tagger for cross-domain temporal tagging, e.g., news-and narrative-style documents can be processed with high quality." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-19", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-21", "text": "For Chinese temporal tagging, machine learning and rule-based approaches have been employed." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-22", "text": "Wu et al. (2005a) and Wu (2010) report that machine learning techniques do not achieve as good results as rule-based approaches when processing Chinese." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-23", "text": "Thus, it is reasonable to extend a rulebased system such as HeidelTime to Chinese." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-24", "text": "In general, temporal tagging approaches perform the extraction, the normalization, or both, and create TIDES TIMEX2 (Ferro et al., 2005) or TimeML's TIMEX3 (Pustejovsky et al., 2005) annotations." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-25", "text": "For development and evaluation, there are two Chinese temporally annotated corpora, the ACE 2005 training corpus and TempEval-2 (c.f. Section 3)." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-26", "text": "Table 1 lists approaches to Chinese temporal tagging with some further information." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-27", "text": "The most recent work is the learningbased language-independent discriminative parsing approach for normalizing temporal expressions by Angeli and Uszkoreit (2013 There are also (semi-)automatic approaches to port a temporal tagger from one language to another." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-28", "text": "For instance, TERSEO has been extended from Spanish to English and Italian by automatic ruletranslation and automatically developed parallel corpora." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-29", "text": "However, the normalization quality of this approach was rather low compared to a rulebased tagger manually developed for the specific language (Negri, 2007) ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-33", "text": "There are two Chinese temporally annotated corpora available: While the Chinese part of the ACE 2005 multilingual training corpus (Walker et al., 2006) has been used by some approaches (c.f. Table 1), it only contains TIMEX2 extent annotations." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-34", "text": "In contrast, the TempEval-2 Chinese data sets (Verhagen et al., 2010) contain TIMEX3 annotations with extent and normalization information." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-35", "text": "However, no TempEval-2 participants addressed Chinese and only Angeli and Uszkoreit (2013) report evaluation results on this corpus." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-36", "text": "Since HeidelTime is TIMEX3-compliant, and we address the extraction and normalization subtasks, we use the TempEval-2 corpus in our work." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-37", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-38", "text": "**ANNOTATION STANDARD TIMEML**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-39", "text": "For temporal expressions, TimeML (Pustejovsky et al., 2005) contains TIMEX3 tags with several attributes." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-40", "text": "The two most important ones -also annotated in the TempEval-2 data -are type and value." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-41", "text": "Type specifies if an expression is a date, time, duration, or set (set of times), and value contains the normalized meaning in standard format." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-42", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-43", "text": "**ORIGINAL TEMPEVAL-2 CORPUS**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-44", "text": "The Chinese training and test sets consist of 44 and 15 documents with 746 and 190 temporal expressions, respectively." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-45", "text": "However, several expressions have no normalized value information (85 in the training and 47 in the test set), others no type." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-46", "text": "1 This issue was also reported by Angeli and Uszkoreit (2013) ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-47", "text": "Thus, they report evaluation results on two versions of the data sets, the original version and a cleaned version, in which all expressions without value information were removed." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-48", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-49", "text": "**RE-ANNOTATION OF THE TEMPEVAL-2 CORPUS**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-50", "text": "Due to the significant amount of temporal expressions with undefined value attributes, we decided to manually assign normalized values to these expressions instead of excluding them." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-51", "text": "During this process, we recognized that the corpus contained several more errors, e.g., some expressions were annotated as dates although they refer to durations." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-52", "text": "Thus, instead of only substituting undefined values, we checked all annotations in the two data sets and corrected errors." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-53", "text": "For this, one Chinese native and two further TimeML experts discussed all modified annotations." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-54", "text": "Although there were several difficult expressions and not all normalizations were straightforward, we significantly improved the annotation quality." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-55", "text": "After our modification, the improved training and test sets contain 765 and 193 temporal expressions with value information, respectively." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-56", "text": "In Table 2 , statistics about the three versions of the data sets are provided." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-57", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-58", "text": "**CHINESE HEIDELTIME RESOURCES**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-59", "text": "HeidelTime is a cross-domain, multilingual temporal tagger that strictly separates the source code and language-dependent resources ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-60", "text": "While the implementation takes care of domain-dependent normalization issues, language-dependent resources contain pattern, normalization, and rule files." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-61", "text": "We had to develop such Chinese resources to perform Chinese temporal tagging with HeidelTime." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-62", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-63", "text": "**CHINESE LINGUISTIC PREPROCESSING**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-64", "text": "As input, HeidelTime requires sentence, token, and part-of-speech information." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-65", "text": "For most of the supported languages, HeidelTime uses a UIMA wrapper of the TreeTagger (Schmid, 1994" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-66", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-67", "text": "**RESOURCE DEVELOPMENT PROCESS**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-68", "text": "To develop Chinese HeidelTime resources, we followed the strategy applied by for Spanish: Using HeidelTime's English resources as starting point, we translated the pattern files, the normalization files, and the rules for extracting and normalizing temporal expressions." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-69", "text": "More details on these steps are provided next." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-70", "text": "Pattern & Normalization Resources." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-71", "text": "English patterns in the pattern files, which also exist in Chinese in a similar form, were directly translated." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-72", "text": "For instance, there are Chinese expressions for names of months and weekdays." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-73", "text": "Patterns existing in English but not used in Chinese were removed, e.g., there are no abbreviations of month names in Chinese." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-74", "text": "In contrast, for other patterns frequently used in Chinese, additional pattern files were created." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-75", "text": "Examples are Chinese numerals." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-76", "text": "Based on the pattern files, we built the normalization files." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-77", "text": "Here, the normalized values of the patterns are stored." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-78", "text": "An example of the Chinese resources is as follows: The three patterns \"\u661f \u671f \u4e8c\", \"\u793c \u62dc \u4e8c\", and \"\u5468 \u4e8c\" can all be translated as Tuesday and are thus part of the Weekday pattern resource." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-79", "text": "Since weekdays are internally handled by HeidelTime with their English names, the normalization file for Chinese weekdays contains \"\u661f \u671f \u4e8c,Tuesday\" \"\u793c \u62dc \u4e8c,Tuesday\" and \"\u5468\u4e8c,Tuesday\"." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-80", "text": "Chinese Rule Development." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-81", "text": "HeidelTime's rules contain three components, a name, an extraction and a normalization part." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-82", "text": "The extraction mainly makes use of regular expressions and the pattern resources, and in the normalization part, the matched expressions are normalized using the normalization resources." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-83", "text": "3 To develop the rules, we again followed and applied the following strategy:" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-84", "text": "(i) A few simple Chinese rules were created based on the English rules." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-85", "text": "(ii) We reviewed extracted temporal expressions in the training set and improved the extraction and normalization parts of the rules." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-88", "text": "(iv) We translated more complex English rules to also cover valid expressions not occurring in the Chinese training documents." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-89", "text": "(v) Steps (ii) to (iv) were iteratively performed until the results on the training set could not be improved further." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-90", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-91", "text": "**CHINESE CHALLENGES**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-92", "text": "Chinese is an isolating language without inflection and depends on word order and function words to represent grammatical relations." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-93", "text": "Although we only consider modern Mandarin as it is the most widely used variety of Chinese in contemporary texts, many challenges occurred during the resource development process." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-94", "text": "Some examples are:" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-95", "text": "Polysemous words: Many Chinese words have more than one meaning, e.g., dynasty names such as \"\u5510\" (Tang) or \"\u5b8b\" (Song) can refer to a certain time period, but also appear as family names." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-96", "text": "Further ambiguities: There are many ambiguous expressions in Chinese, e.g., the temporal expression \"\u4e94\u65e5\u524d\" has two meanings: \"before the 5th day of a certain month\" and also \"5 days ago\" -depending on the context." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-97", "text": "Calendars: There are various calendars in Chinese culture and thus also in Chinese texts, such as the lunar calendar and the 24 solar terms, which are different from the Gregorian calendar and thus very difficult to normalize." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-98", "text": "Besides, Taiwan has a different calendar, which numbers the year from the founding year of the Republic of China (1911)." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-99", "text": "Table 3 : Evaluation results for extraction and normalization (TempEval-2 training and test sets)." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-100", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-101", "text": "**EVALUATION**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-102", "text": "In this section, we present evaluation results of our newly developed Chinese HeidelTime resources." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-103", "text": "In addition, we compare our results for the normalization sub-task to Angeli and Uszkoreit (2013) ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-104", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-105", "text": "**EVALUATION SETUP**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-106", "text": "Corpus: We use three versions of the TempEval-2 training and test sets: (i) the original versions, (ii) the improved versions described in Section 3.3, and (iii) the cleaned versions also used by Angeli and Uszkoreit (2013) in which temporal expressions without value information are removed." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-107", "text": "Setting: Since the TempEval-2 data already contains sentence and token information, we only had to perform part-of-speech tagging as linguistic preprocessing step." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-108", "text": "For this, we used the TreeTagger (Schmid, 1994 ) with its Chinese model." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-109", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-110", "text": "**MEASURES:**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-111", "text": "We use the official TempEval-2 evaluation script." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-112", "text": "For the extraction, precision, recall, and f-score are calculated on the token-level." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-113", "text": "For the normalization, accuracy for the attributes type and value are calculated on the expression-level." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-114", "text": "Note that the use of accuracy makes it difficult to compare systems having a different recall in the extraction, as will be explained below." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-115", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-116", "text": "**EVALUATION RESULTS**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-117", "text": "Table 3 (top) shows the evaluation results on the training set." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-118", "text": "Extraction and normalization quality are high, and value accuracies of over 90% on the cleaned and improved versions are promising." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-119", "text": "4 The results on the test sets ( (Angeli and Uszkoreit, 2013) ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-120", "text": "are written in modern Mandarin, some test documents contain Taiwan-specific expressions (c.f. Section 4.3) not covered by our rules yet." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-121", "text": "Finally, we compare the normalization quality of our approach to the multilingual parsing approach of Angeli and Uszkoreit (2013) ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-122", "text": "However, their approach performs only the normalization subtask assuming that the extents of temporal expressions are provided." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-123", "text": "For this, they used gold extents for evaluation." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-124", "text": "HeidelTime only normalizes those expressions that it knows how to extract." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-125", "text": "Thus, we run HeidelTime performing the extraction and the normalization." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-126", "text": "However, since the accuracy measure used by the TempEval-2 script calculates the ratio of correctly normalized expressions to all extracted expressions and not to all expressions in the gold standard, we additionally present the raw numbers of correctly normalized expressions for the two systems." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-127", "text": "Table 4 shows the comparison between our approach and the one by Angeli and Uszkoreit (2013) ." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-128", "text": "We outperform their approach not only with respect to the accuracy but also with respect to the numbers of correctly normalized expressions (574 vs. 484 5 and 121 vs. 86 5 on the training and test sets, respectively) -despite the fact that we perform the full task of temporal tagging and not only the normalization." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-129", "text": "----------------------------------" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-130", "text": "**CONCLUSIONS & ONGOING WORK**" }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-131", "text": "In this paper, we addressed Chinese temporal tagging by developing Chinese HeidelTime resources." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-132", "text": "These make HeidelTime the first publicly available Chinese temporal tagger." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-133", "text": "Our evaluation showed the high quality of the new HeidelTime resources, and we outperform a recent normalization approach." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-134", "text": "Furthermore, the re-annotated Chinese TempEval-2 data sets will also be made available." }, { "sent_id": "6cd4235e66a6e6e9768250c3db7fc6-C001-135", "text": "Currently, we are performing a detailed error analysis and hope to gain insights to further improve HeidelTime's Chinese resources." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6cd4235e66a6e6e9768250c3db7fc6-C001-27" ], [ "6cd4235e66a6e6e9768250c3db7fc6-C001-35" ], [ "6cd4235e66a6e6e9768250c3db7fc6-C001-46" ] ], "cite_sentences": [ "6cd4235e66a6e6e9768250c3db7fc6-C001-27", "6cd4235e66a6e6e9768250c3db7fc6-C001-35", "6cd4235e66a6e6e9768250c3db7fc6-C001-46" ] }, "@UNSURE@": { "gold_contexts": [ [ "6cd4235e66a6e6e9768250c3db7fc6-C001-103" ], [ "6cd4235e66a6e6e9768250c3db7fc6-C001-121" ] ], "cite_sentences": [ "6cd4235e66a6e6e9768250c3db7fc6-C001-103", "6cd4235e66a6e6e9768250c3db7fc6-C001-121" ] }, "@USE@": { "gold_contexts": [ [ "6cd4235e66a6e6e9768250c3db7fc6-C001-106" ] ], "cite_sentences": [ "6cd4235e66a6e6e9768250c3db7fc6-C001-106" ] }, "@SIM@": { "gold_contexts": [ [ "6cd4235e66a6e6e9768250c3db7fc6-C001-106" ] ], "cite_sentences": [ "6cd4235e66a6e6e9768250c3db7fc6-C001-106" ] }, "@DIF@": { "gold_contexts": [ [ "6cd4235e66a6e6e9768250c3db7fc6-C001-127", "6cd4235e66a6e6e9768250c3db7fc6-C001-128" ] ], "cite_sentences": [ "6cd4235e66a6e6e9768250c3db7fc6-C001-127" ] } } }, "ABC_022049c0e75a490978b2c49da41deb_22": { "x": [ { "sent_id": "022049c0e75a490978b2c49da41deb-C001-94", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-54", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-2", "text": "This paper presents the first attempt to use word embeddings to predict the compositionality of multiword expressions." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-3", "text": "We consider both single-and multi-prototype word embeddings." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-4", "text": "Experimental results show that, in combination with a back-off method based on string similarity, word embeddings outperform a method using count-based distributional similarity." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-5", "text": "Our best results are competitive with, or superior to, state-of-the-art methods over three standard compositionality datasets, which include two types of multiword expressions and two languages." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-6", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-8", "text": "Multiword expressions (MWEs) are word combinations that display some form of idiomaticity (Baldwin and Kim, 2009 ), including semantic idiomaticity, wherein the semantics of the MWE (e.g. ivory tower) cannot be predicted from the semantics of the component words (e.g. ivory and tower)." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-9", "text": "Recent NLP work on semantic idiomaticity has focused on the task of \"compositionality prediction\", in the form of a regression task whereby a given MWE is mapped onto a continuous-valued compositionality score, either for the MWE as a whole or for each of its component words (Reddy et al., 2011; Schulte im Walde et al., 2013; Salehi et al., 2014b) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-10", "text": "Separately in NLP, there has been a recent surge of interest in learning distributed representations of word meaning, in the form of \"word embeddings\" (Collobert and Weston, 2008; Mikolov et al., 2013a) and composition over distributed representations (Socher et al., 2012; Baroni et al., 2014) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-11", "text": "This paper is the first attempt to bring together the work on word embedding-style distributional analysis with compositionality prediction of MWEs." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-12", "text": "In the context of compositionality prediction, our primary research questions here are: RQ1: Are word embeddings superior to conventional count-based models of distributional similarity?" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-13", "text": "RQ2: How sensitive to parameter optimisation are different word embedding approaches?" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-14", "text": "RQ3: Are multi-prototype word embeddings empirically superior to single-prototype word embeddings?" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-15", "text": "We explore these questions relative to three compositionality prediction datasets spanning two MWE construction types (noun compounds and verb particle constructions) and two languages (English and German), and arrive at the following conclusions:" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-16", "text": "(1) consistent with recent work over other NLP tasks, word embeddings are superior to countbased models of distributional similarity (and also translation-based string similarity); (2) the results are relatively stable under parameter optimisation for a given word embedding learning approach; and (3) based on two simple approaches to composition, single word embeddings are empirically slightly superior to multi-prototype word embeddings overall." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-17", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-19", "text": "Recent work on distributed approaches to distributional semantics has demonstrated their utility in a wide range of NLP tasks, including identifying various morphosyntactic and semantic relations (Mikolov et al., 2013a) , dependency parsing (Bansal et al., 2014) , sentiment analysis , named-entity recognition (Collobert and Weston, 2008; , and machine translation (Zou et al., 2013; Devlin et al., 2014) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-20", "text": "Despite the wealth of research applying word embeddings within NLP, they have not yet been considered for predicting the compositionality of MWEs." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-21", "text": "Much prior work on MWEs has been tailored to specific kinds of MWEs in particular languages (e.g. English verb-noun combinations (Fazly et al., 2009) )." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-22", "text": "There has however been recent interest in approaches to MWEs that are more broadly applicable to a wider range of languages and MWE types (Brooke et al., 2014; Salehi et al., 2014b; Schneider et al., 2014) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-23", "text": "Word embeddings could form the basis for such an approach to predicting MWE compositionality." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-24", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-25", "text": "**METHODOLOGY**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-26", "text": "In this work, we estimate the compositionality of an MWE based on the similarity between the expression and its component words in vector space." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-27", "text": "We use three different vector-space models: (1) a simple count-based model of distributional similarity; (2) word embeddings based on WORD2VEC; and (3) a multi-sense skip-gram model that, unlike the previous two models, is able to learn multiple embeddings per target word (or MWE)." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-28", "text": "For all three models, we first greedily pre-tokenise the corpus to represent each MWE as a single token, similarly to Baldwin et al. (2003) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-29", "text": "In this, we apply the constraint that no language-specific pre-processing can be applied to the training corpus, in order to make the method maximally language independent." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-30", "text": "As such, we cannot perform any form of lemmatisation, and MWE identification takes the form of simple string match for concatenated instances of the component words, naively assuming that all occurrences of that word combination are MWEs." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-31", "text": "We detail each of the distributional similarity methods below." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-32", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-33", "text": "**COUNT-BASED DISTRIBUTIONAL SIMILARITY**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-55", "text": "**DATASETS**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-95", "text": "**CONCLUSIONS**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-34", "text": "Our first method for building vectors is that of Salehi et al. (2014b) : the top 50 most-frequent words in the training corpus are considered to be stopwords and discarded, and words with frequency rank 51-1051 are considered to be the content-bearing words, which form the dimensions for our vectors, in the manner of Sch\u00fctze (1997) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-35", "text": "To measure the similarity of the MWE vector and the component word vectors, we considered two different approaches." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-36", "text": "The first approach is based on Reddy et al. (2011) and Schulte im Walde et al. (2013) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-37", "text": "The similarity between the MWE and each of its components is measured, and the overall compositionality of the MWE is computed by combining the similarity scores for the two components as follows:" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-38", "text": "where MWE is the vector associated with the MWE, C i is the vector associated with the ith component word of the MWE, sim is a vector similarity function, and \u03b1 \u2208 [0, 1] is a weight parameter." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-39", "text": "We also experimented with the approach from Mitchell and Lapata (2010) , where MWE is compared directly with a composed vector of the component words, based on vector addition: 1" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-40", "text": "For both comp 1 and comp 2 , we used cosine similarity as our similarity measure sim." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-41", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-42", "text": "**WORD2VEC**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-43", "text": "Our second method is based on the recurrent neural network language model (RNNLM) approach to learning word embeddings of Mikolov et al. (2013a) and Mikolov et al. (2013b) , using the WORD2VEC package." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-44", "text": "2 WORD2VEC uses a log-linear model inspired by the original RNNLM approach of Mikolov et al. (2010) , in two forms: (1) a continuous bagof-words (\"CBOW\") model, whereby all words in a context window are averaged in a single projection layer; and (2) a continuous skip-gram model (\"C-SKIP\"), whereby a given word in context is projected onto a projection layer, and used to predict its immediate context (preceding and following words)." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-45", "text": "WORD2VEC generates a vector of fixed dimensionality d for each pre-tokenised word/MWE type with frequency above a certain threshold in the training corpus." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-46", "text": "We again use comp 1 and comp 2 to estimate compositionality from these vectors." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-47", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-48", "text": "**MULTI-SENSE SKIP-GRAM MODEL**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-49", "text": "One potential shortcoming of WORD2VEC is that it generates a single word embedding for each word, irrespective of the relative polysemy of the word." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-50", "text": "Neelakantan et al. (2014) proposed a method motivated by WORD2VEC, which efficiently learns multiple embeddings per word/MWE." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-51", "text": "We refer to this approach as the multi-sense skip-gram (MSSG) model." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-52", "text": "We once again compose the resultant vectors with comp 1 and comp 2 , but modify the formulation slightly to handle the variable number of vectors for each word/MWE, by searching over the cross-product of vectors in each sim calculation and taking the maximum in each case." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-53", "text": "We initially set the number of embeddings to 2 in our MSSG experiments -in keeping with the findings in Neelakantan et al. (2014) -but come back to examine the impact of the number of embeddings on compositionality prediction in Section 5." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-56", "text": "We evaluate our methods over three datasets: 3 (1) English noun compounds (\"ENCs\", e.g. spelling bee and swimming pool); (2) English verb particle constructions (\"EVPCs\", e.g. stand up and give away); and (3) German noun compounds (\"GNCs\", e.g. ahornblatt \"maple leaf\" and eidechse \"lizard\")." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-57", "text": "The ENC dataset consists of 90 binary English noun compounds, and is annotated on a continuous [0, 5] scale for both overall compositionality and the component-wise compositionality of each of the modifier and head noun (Reddy et al., 2011) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-58", "text": "The state-of-the-art method for this dataset (Salehi et al., 2014b ) is a supervised support vector regression model, trained over the distributional method from Section 3.1 as applied to both English and 51 target languages (under word and MWE translation)." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-59", "text": "The EVPC dataset consists of 160 English verb particle constructions, and is manually annotated for compositionality on a binary scale for each of the head verb and particle (Bannard, 2006) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-60", "text": "In order to translate the dataset into a regression task, we calculate the overall compositionality as the number of annotations of entailment for the verb, divided by the total number of verb annotations for that VPC." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-61", "text": "The state-of-the-art method for this dataset (Salehi et al., 2014b ) is a linear combination of: (1) the distributional method from Section 3.1; (2) the same method applied to 10 target languages (under word and MWE translation, selecting the languages using supervised learning); and (3) the string similarity method of Salehi and Cook (2013) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-62", "text": "The GNC dataset consists of 246 German noun compounds, and is annotated on a continuous" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-63", "text": "----------------------------------" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-64", "text": "**EXPERIMENTS**" }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-65", "text": "For all experiments, we train our models over raw text Wikipedia corpora for either English or German, depending on the language of the dataset." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-66", "text": "The raw English and German corpora were preprocessed using the WP2TXT toolbox 4 to eliminate XML and HTML tags and hyperlinks, and punctuation was removed." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-67", "text": "Finally, word-tokenisation was performed based on simple whitespace delimitation, after which we greedily identified all string occurrences of the MWEs in each of our datasets and combined them into a single token." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-68", "text": "5 The word embedding approaches are unable to generate vector representations for tokens which occur with frequency below a fixed cutoff." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-69", "text": "6 In order to Table 1 : Pearson's correlation (r) for the different methods over the three datasets; the state-of-the-art for each dataset is described in Section 4 generate a compositionality prediction back-off for the small numbers of MWEs in this category, we assign a default value, which is the mean of computed compositionality scores for other instances." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-70", "text": "7 As a baseline, we use the translation string similarity approach of Salehi and Cook (2013) , including the cross-validation-based method for selecting the 10 best languages to use for each dataset." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-71", "text": "We further include a linear combination of the string similarity method with each of the various approaches based on word embeddings." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-72", "text": "Table 1 shows the results for the various methods, lack of lemmatisation." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-73", "text": "7 We also experimented with using the string similarity approach as a back-off, which resulted in marginally lower results than what is reported in Table 1." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-93", "text": "Further research is required to better understand this effect." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-74", "text": "over a range of hyper-parameter settings for each of WORD2VEC (vector dimensionality d; we also present results for CBOW vs. C-SKIP) and MSSG (vector dimensionality d and window size w), informed by the experimental results in the respective publications." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-75", "text": "Note that for EVPC, we don't use the vector for the particle, in keeping with Salehi et al. (2014b) ; as such, there are no results for comp 2 ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-76", "text": "For comp 1 , \u03b1 is set to 1.0 for EVPC, and 0.7 for both ENC and GNC, also based on the findings of Salehi et al. (2014b) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-77", "text": "The results indicate that the approaches using both WORD2VEC and MSSG outperform simple distributional and string similarity by a substantial margin." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-78", "text": "Further, over a variety of parameteriza- tions, they surpass the state-of-the-art methods for ENC and EVPC; in the case of GNC, the bestperforming method (WORD2VEC with d = 500 and C-SKIP) roughly matches the state-of-the-art." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-79", "text": "Note that in each case, the state-of-the-art is achieved using varying levels of supervision over labelled data (ENC and EVPC) or language-specific preprocessing (GNC), whereas the word embedding methods use no labelled data." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-80", "text": "As such, the answer to RQ1 would appear to be a resounding yes." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-81", "text": "Looking to RQ2, the models are remarkably insensitive to hyper-parameter optimisation for EVPC, but there are slight deviations in the results for ENC and GNC." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-82", "text": "Having said that, they are largely between the different word embedding approaches, and the results for a given approach under different parameter settings is relatively stable." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-83", "text": "A large part of the cause of the drop in results and greater parameter sensitivity over GNC is the lower token frequencies, through a combination of the Wikipedia corpus being markedly smaller and our naive tokenisation strategy having low recall over German due to the richer morphology." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-84", "text": "As such, the answer would appear to be a tentative \"relatively insensitive, assuming high token frequencies\"." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-85", "text": "Finally, looking to RQ3, there was little separating WORD2VEC and MSSG over ENC, but over the other two datasets, WORD2VEC had a clear advantage." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-86", "text": "Given the high levels of polysemy observed in high frequency English verb particle constructions (Salehi et al., 2014a) , this result for EVPC was particularly surprising, and suggests that, at least under our two basic forms of composition, multiprototype word embeddings are at best equal to, and in many cases, inferior to, single-prototype word embeddings." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-87", "text": "According to the results, the string similarity approach complements all word-embedding approaches." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-88", "text": "We hypothesise that this is because it is not based on any corpus, and is thus not biased by the frequency of token instances in the corpus." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-89", "text": "In Table 1 , the number of embeddings for MSSG was set to 2 prototypes, based on the default recommendations of Neelakantan et al. (2014) ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-90", "text": "To investigate the impact of this parameter on our results, we retrained MSSG over the range [1, 6] and reran our experiments for each set of embeddings over the three datasets (without string similarity, to isolate the effect of the number of embeddings), as shown in Figure 1 ." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-91", "text": "For both English datasets (ENC and EVPC), setting the number of prototypes to a value higher than 2 boosts the results slightly, with 5 prototypes appearing to be the optimal value." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-92", "text": "For the German dataset (GNC), on the other hand, the best results are actually achieved for a single prototype." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-96", "text": "We presented the first approach to using word embeddings to predict the compositionality of MWEs." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-97", "text": "We showed that this approach, in combination with information from string similarity, surpassed, or was competitive with, the current state-of-the-art on three compositionality datasets." }, { "sent_id": "022049c0e75a490978b2c49da41deb-C001-98", "text": "In future work we intend to explore the contribution of information from word embeddings of a target expression and its component words under translation into many languages, along the lines of Salehi et al. (2014b) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "022049c0e75a490978b2c49da41deb-C001-9" ], [ "022049c0e75a490978b2c49da41deb-C001-22" ], [ "022049c0e75a490978b2c49da41deb-C001-58" ], [ "022049c0e75a490978b2c49da41deb-C001-61" ] ], "cite_sentences": [ "022049c0e75a490978b2c49da41deb-C001-9", "022049c0e75a490978b2c49da41deb-C001-22", "022049c0e75a490978b2c49da41deb-C001-58", "022049c0e75a490978b2c49da41deb-C001-61" ] }, "@USE@": { "gold_contexts": [ [ "022049c0e75a490978b2c49da41deb-C001-34" ], [ "022049c0e75a490978b2c49da41deb-C001-76" ] ], "cite_sentences": [ "022049c0e75a490978b2c49da41deb-C001-34", "022049c0e75a490978b2c49da41deb-C001-76" ] }, "@SIM@": { "gold_contexts": [ [ "022049c0e75a490978b2c49da41deb-C001-75" ], [ "022049c0e75a490978b2c49da41deb-C001-98" ] ], "cite_sentences": [ "022049c0e75a490978b2c49da41deb-C001-75", "022049c0e75a490978b2c49da41deb-C001-98" ] }, "@FUT@": { "gold_contexts": [ [ "022049c0e75a490978b2c49da41deb-C001-98" ] ], "cite_sentences": [ "022049c0e75a490978b2c49da41deb-C001-98" ] } } }, "ABC_742d9ca22bf801b0ade5fd1671473c_22": { "x": [ { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-2", "text": "We explore contemporary, data-driven techniques for solving math word problems over recent large-scale datasets." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-3", "text": "We show that well-tuned neural equation classifiers can outperform more sophisticated models such as sequence to sequence and self-attention across these datasets." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-4", "text": "Our error analysis indicates that, while fully data driven models show some promise, semantic and world knowledge is necessary for further advances." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-5", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-7", "text": "Solving math word problems has been an interest of the natural language processing community since the 1960s (Feigenbaum et al., 1963; Bobrow, 1964) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-8", "text": "More recently, algorithms for learning to solve algebra problems have gone in complementary directions: semantic and purely data-driven." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-9", "text": "Semantic methods learn from data how to map problem texts to a semantic representation which can then be converted to an equation." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-10", "text": "These representations combine setlike constructs (Hosseini et al., 2014) with hierarchical representations like equation trees (Koncel-Kedziorski et al., 2015; Roy and Roth, 2015; Wang et al., 2018) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-11", "text": "Such methods have the benefit of being interpretable, but no semantic representation general enough to solve all varieties of math word problems, including proportion problems and those that map to systems of equations, has been found." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-12", "text": "Another popular line of research is on purely data-driven solvers." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-13", "text": "Given enough training data, data-driven models can learn to map word problem texts to arbitrarily complex equations or systems of equations." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-14", "text": "These models have the additional advantage of being more language-independent than semantic methods, which often rely on parsers and other NLP tools." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-15", "text": "To train these fully data driven models, large-scale datasets for both English and Chinese were recently introduced (Wang et al., 2017; Koncel-Kedziorski et al., 2016) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-16", "text": "In response to the success of representation learning elsewhere in NLP, sequence to sequence (seq2seq) models have been applied to algebra problem solving (Wang et al., 2017) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-17", "text": "These powerful models have been shown to outperform other data-driven approaches in a variety of tasks." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-18", "text": "However, it is not obvious that solving word problems is best modeled as a sequence prediction task rather than a classification or retrieval task." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-19", "text": "Downstream applications such as question answering or automated tutoring systems may never have to deal with arbitrarily complex or even unseen equation types, obviating the need for a sequence prediction model." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-20", "text": "These considerations beg the questions: how do data-driven approaches to math word problem solving compare to each other? How can datadriven approaches benefit from recent advances in neural representation learning? What are the limits of data-driven solvers?" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-21", "text": "In this paper, we thoroughly examine datadriven techniques on three larger algebra word problem datasets (Huang et al., 2016; Koncel-Kedziorski et al., 2016; Wang et al., 2017) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-22", "text": "We study classification, generation, and information retrieval models, and examine popular extensions to these models such as structured self-attention (Lin et al., 2017) and the use of pretrained word embeddings (Pennington et al., 2014; Peters et al., 2018) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-23", "text": "Our experiments show that a well-tuned neural equation classifier consistently performs better than more sophisticated solvers." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-24", "text": "We provide evidence that pretrained word embeddings, useful in other tasks, are not helpful for word problem solving." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-25", "text": "Advanced modeling such as structured self-attention is not shown to improve performance versus a well-tuned BiLSTM Classifier." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-26", "text": "Our error analysis supports the idea that, while data-driven techniques are powerful and robust, many word problems require semantic or world knowledge that cannot be easily incorporated into an end-to-end learning framework." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-27", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-28", "text": "**PROBLEM FORMULATION**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-29", "text": "Solving an algebra word problem (as shown below) requires finding the correct solution given the text of the problem." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-30", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-31", "text": "**PROBLEM TEXT**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-32", "text": "Similar to previous data-driven methods, we frame the task as one of mapping the word problem texts to equations given the training data." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-33", "text": "Our models abstract the specific numbers away from both the word problem text and target equation, preserving the ordering of the numbers found in the problem text." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-34", "text": "The resulting abstracted equation is called an equation template." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-35", "text": "At inference time, our solvers produce an equation template given the test problem." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-36", "text": "The template is then populated with the actual numbers from the problem text and evaluated to produce a solution." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-37", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-38", "text": "**MODELS**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-39", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-40", "text": "**RETRIEVAL**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-41", "text": "Retrieval methods map test word problem texts at inference time to the nearest training problem according to some similarity metric." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-42", "text": "The nearest neighbor's equation template is then filled in with numbers from the test problem and solved." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-43", "text": "Following Wang et al. (2017) , we use Jaccard distance in this model." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-44", "text": "For test problem S and training problem T , the Jaccard similarity is computed as: jacc(S, T ) = S\u2229T S\u222aT ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-45", "text": "We also evaluate the use of a cosine similarity metric." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-46", "text": "Words from S and T are associated with pretrained vectors v(w i ) (Pennington et al., 2014) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-47", "text": "These vectors are averaged across each problem, resulting in vectors S and T. The Cosine similarity is then computed as cos(S, T) = S\u00b7T ||ST|| ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-48", "text": "Vector averaging has previously been used as a strong baseline for a variety of sentence similarity tasks (Mu et al., 2017) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-49", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-50", "text": "**CLASSIFICATION**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-51", "text": "Classification methods learn to map problem texts to equation templates by learning parameters that minimize a cross entropy loss function over the set of training instances." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-52", "text": "At inference time, these methods choose the most likely equation template (the class) given a test word problem text." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-53", "text": "In both retrieval and classification methods, model accuracy is upper bounded by the oracle accuracy, or the number of test equation templates which appear in the training data." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-54", "text": "BiLSTM The BiLSTM classification model encodes the word problem text using a bidirectional Long Short Term Memory network (Hochreiter and Schmidhuber, 1997) with learned parameters \u03b8." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-55", "text": "The final hidden state of this encoding h n is scaled to the number of classes by weights W = w 1 ...w k and passed through a softmax to produce a distribution over class labels." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-56", "text": "The probability of equation template j for problem S is given by:" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-57", "text": "This model is trained end-to-end using cross entropy loss." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-58", "text": "Structured Self-Attention Sentence embeddings using self-attention mechanisms (Lin et al., 2017) were shown to be successful in question answering tasks ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-59", "text": "We conjecture that algebra problem solvers can also benefit from the long distance dependencies information introduced by self-attention." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-60", "text": "Here, bi-directional LSTM encoders capture relationships among the words of the input text." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-61", "text": "A multi-hop self-attention mechanism is applied to the resulting hidden states to produce a fixed sized embedding." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-62", "text": "The different attention hops are constrained so as to reduce redundancy, ensuring that various semantic aspects of the input are included in the resulting embedding." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-63", "text": "We refer the reader to the original paper for details." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-64", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-65", "text": "**GENERATION**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-66", "text": "Generation methods treat equation templates as strings of formal symbols." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-67", "text": "The production of a template is considered a sequence prediction problem conditioned on the word problem text." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-68", "text": "By treating templates as sequences rather than monolithic structures, generation methods have the potential to learn finer-grained relationships between the input text and output template." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-69", "text": "They also are the only methods studied here which can induce templates during inference which were not seen at training." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-70", "text": "We generate equation templates with seq2seq models (Sutskever et al., 2014) with attention mechanisms (Luong et al., 2015) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-71", "text": "These models condition the token-by-token generation of the equation template on encodings of the word problem text." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-72", "text": "Following Wang et al. (2017) we evaluate a seq2seq with LSTMs as the encoder and decoder." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-73", "text": "We also evaluate the use of Convolutional Neural Networks (CNNs) in the encoder and decoder." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-74", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-75", "text": "**EXPERIMENTS**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-76", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-77", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-78", "text": "Datasets For comparison, we report solution accuracy on the Chinese language Math23K dataset (Wang et al., 2017) , and the English language DRAW (Upadhyay and Chang, 2015) and MAWPS (Koncel-Kedziorski et al., 2016) datasets." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-79", "text": "Math23K and MAWPS consist of single equation problems, and DRAW contains both single and simultaneous equation problems." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-80", "text": "Details on the datasets are shown in Table 1 ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-81", "text": "The Math23K dataset contains problems with possibly irrelevant quantities." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-82", "text": "To prune these quantities, we implement a significant number identifier (SNI) as discussed in Wang et al. (2017) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-83", "text": "Our best accuracy for SNI is 97%, slightly weaker than previous results." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-84", "text": "each dataset." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-85", "text": "We also explore two modifications of the BiLSTM's embedding matrix W E , either by using pretrained GloVe embeddings (Pennington et al., 2014) or using the ELMo technique of (Peters et al., 2018) as implemented in the AllenNLP toolkit (Gardner et al.) with pretrained character embeddings." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-86", "text": "For seq2seq modeling, we use OpenNMT (Klein et al., 2017) with 500 dimensional hidden states and embeddings and a dropout rate of 0.3." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-87", "text": "The CNN uses a kernel width of 3." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-88", "text": "Optimization is done using SGD with a learning rate of 1, decayed by half if the validation perplexity does not decrease after an epoch." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-89", "text": "Table 2 reports the accuracies of the data-driven models for solving algebra word problems." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-90", "text": "The classification models perform better than retrieval or generation models, despite their limited modeling power." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-91", "text": "The self-attention classification model performs well across all datasets." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-92", "text": "For the largest dataset (Math23K), a simple, well-tuned classifier can outperform the more sophisticated sequenceto-sequence and self-attention models." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-93", "text": "Table 3 shows results of augmenting the clas-" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-94", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-95", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-96", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-97", "text": "**RESULTS**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-98", "text": "Kendra made punch for her friend's birthday party." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-99", "text": "She used 3/4 of a gallon of grape juice, 1/4 of a gallon of cranberry juice , and 3/5 of a gallon of club soda. How many gallons of punch did Kendra make? Sandy went to the mall to buy clothes." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-100", "text": "She spent $20 on shorts, $10 on a shirt, and $35 on a jacket. How much money did Sandy spend on clothes?" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-101", "text": "World Knowledge (19%) Mary began walking home from school, heading south at a rate of 3 miles per hour." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-102", "text": "Sharon left school at the same time heading north at 5 miles per hour." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-103", "text": "How long will it take for them to be 20 miles apart? If you purchase a membership for 100 dollars to receive 5% off purchases, how much would you need to spend to pay off the membership? sifier with pretrained word and character embeddings." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-104", "text": "Neither of these methods help over the English language data." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-105", "text": "It appears that the ELMo technique may require more training examples before it can improve solution accuracy." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-106", "text": "The previous state of the art model for the DRAW dataset is described in Upadhyay and Chang (2015) ." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-107", "text": "The state of the art for Math23K, described in Wang et al. (2017) , uses a hybrid Jaccard retrieval and seq2seq model." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-108", "text": "All models shown here fall well short of the highest possible classification/retrieval accuracy, shown in Table 2 as \"Oracle\"." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-109", "text": "This gap invites a more detailed error analysis regarding the possible limitations of data-driven solvers." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-110", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-111", "text": "**ERROR ANALYSIS**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-112", "text": "Despite the sophistication of these data-driven models, they still do not achieve optimal performance." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-113", "text": "A closer analysis of the errors these models make can illuminate the reason for this gap." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-114", "text": "Consider Table 4 , which illustrates two classes of errors made by data-driven systems." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-115", "text": "Both stem from incomplete knowledge on the part of the learning algorithm. But it is worth distinguishing the \"semantic limitations\" errors as this kind of information (subset relations, counts of nonnumerical entities) may be possible to extract from the data provided, given a sufficiently powerful modeling technique." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-116", "text": "The second class of errors, labeled \"world knowledge\", are impossible to extract from the math data alone." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-117", "text": "Consider the first example of people walking in different directions." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-118", "text": "To solve this problem, it is necessary to know that \"north\" and \"south\" are away from each other." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-119", "text": "Complicating the problem, suppose Sharon walked east instead of north." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-120", "text": "Then the relationship between east and south would impact the problem semantics." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-121", "text": "This kind of knowledge is beyond what is conveyed in any dataset of math word problems, and is a known problem for many NLP applications." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-122", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-123", "text": "**RELATED WORK**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-124", "text": "Semantic solvers provide some scaffolding for the grounding of word problem texts to equations." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-125", "text": "Mitra and Baral (2015) solve simple word problems by categorizing their operations as partwhole, change, or comparison." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-126", "text": "Shi et al. (2015) learn a semantic parser by semi-automatically inducing 9600 grammar rules over a dataset of number word problems." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-127", "text": "Works such has Roy and Roth (2015) and Koncel-Kedziorski et al. (2015) treat arithmetic word problem templates as equation trees and perform efficient tree-search by learning how to combine quantities using textual information." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-128", "text": "Roy and Roth (2017) advance this approach by considering unit consistency in the tree-search procedure." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-129", "text": "Wang et al. (2018) advance this line of work even further by modeling the search using deep Q-learning." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-130", "text": "Still, these semantic approaches are limited by their inability to model systems of equations as well as use of hand-engineered features." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-131", "text": "Data-driven math word problem solvers include , who learn to predict equation templates and subsequently align numbers and unknowns from the text." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-132", "text": "Zhou et al. (2015) only assign numbers to the predicted template, reducing the search space significantly." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-133", "text": "More recently, Wang et al. (2017) provide a large dataset of Chinese algebra word problems and learn a hybrid model consisting of both retrieval and seq2seq components." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-134", "text": "The current work extends these approaches by exploring advanced techniques in data-driven solving." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-135", "text": "----------------------------------" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-136", "text": "**CONCLUSION**" }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-137", "text": "We have thoroughly examined data-driven models for automatically solving algebra word problems, including retrieval, classification, and generation techniques." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-138", "text": "We find that a well-tuned classifier outperforms generation and retrieval on several datasets." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-139", "text": "One avenue for improving performance is to ensemble different models." }, { "sent_id": "742d9ca22bf801b0ade5fd1671473c-C001-140", "text": "However, in light of the error analysis provided, the incorporation of semantic and world knowledge will be necessary to achieve maximal success." } ], "y": { "@BACK@": { "gold_contexts": [ [ "742d9ca22bf801b0ade5fd1671473c-C001-15" ], [ "742d9ca22bf801b0ade5fd1671473c-C001-16" ], [ "742d9ca22bf801b0ade5fd1671473c-C001-107" ] ], "cite_sentences": [ "742d9ca22bf801b0ade5fd1671473c-C001-15", "742d9ca22bf801b0ade5fd1671473c-C001-16", "742d9ca22bf801b0ade5fd1671473c-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "742d9ca22bf801b0ade5fd1671473c-C001-21" ], [ "742d9ca22bf801b0ade5fd1671473c-C001-43" ], [ "742d9ca22bf801b0ade5fd1671473c-C001-72" ], [ "742d9ca22bf801b0ade5fd1671473c-C001-78" ], [ "742d9ca22bf801b0ade5fd1671473c-C001-82" ] ], "cite_sentences": [ "742d9ca22bf801b0ade5fd1671473c-C001-21", "742d9ca22bf801b0ade5fd1671473c-C001-43", "742d9ca22bf801b0ade5fd1671473c-C001-72", "742d9ca22bf801b0ade5fd1671473c-C001-78", "742d9ca22bf801b0ade5fd1671473c-C001-82" ] }, "@EXT@": { "gold_contexts": [ [ "742d9ca22bf801b0ade5fd1671473c-C001-133", "742d9ca22bf801b0ade5fd1671473c-C001-134" ] ], "cite_sentences": [ "742d9ca22bf801b0ade5fd1671473c-C001-133" ] } } }, "ABC_a5f00f524fdf18e62a4e98a92a2d82_22": { "x": [ { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-2", "text": "Traditional approaches to Sentiment Analysis (SA) rely on large annotated data sets or wide-coverage sentiment lexica, and as such often perform poorly on under-resourced languages." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-3", "text": "This paper presents empirical evidence of an efficient SA approach using freely available machine translation (MT) systems to translate Arabic tweets to English, which we then label for sentiment using a state-of-theart English SA system." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-4", "text": "We show that this approach significantly outperforms a number of standard approaches on a gold-standard heldout data set, and performs equally well compared to more cost-intense methods with 76% accuracy." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-5", "text": "This confirms MT-based SA as a cheap and effective alternative to building a fully fledged SA system when dealing with under-resourced languages." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-6", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-8", "text": "Over the past decade, there has been a growing interest in collecting, processing and analysing usergenerated text from social media using Sentiment Analysis (SA)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-9", "text": "SA determines the polarity of a given text, i.e. whether its overall sentiment is negative or positive." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-10", "text": "While previous work on SA for English tweets reports an overall accuracy of 65-71% on average (Abbasi et al., 2014) , recent studies investigating Arabic tweets only report accuracy scores ranging between 49-65% (Mourad and Darwish, 2013; Refaee and Rieser, 2014b) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-11", "text": "Arabic SA faces a number of challenges: first, Arabic used in social media is usually a mixture of Modern Standard Arabic (MSA) and one or more of its dialects (DAs)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-12", "text": "Standard toolkits for Natural Language Processing (NLP) mainly cover the former and perform poorly on the latter 1 ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-13", "text": "These tools are vital for the performance of machine learning (ML) approaches to Arabic SA: traditionally, ML approaches use a \"bag of words\" (BOW) model (e.g. Wilson et al. (2009) )." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-14", "text": "However, for morphologically rich languages, such as Arabic, a mixture of stemmed tokens and morphological features have shown to outperform BOW approaches (Abdul-Mageed et al., 2011; Mourad and Darwish, 2013) , accounting for the fact that Arabic contains a very large number of inflected words." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-15", "text": "In addition (or maybe as a result), there is much less interest from the research community in tackling the challenge of Arabic SA for social media." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-16", "text": "As such, there are much fewer open resources available, such as annotated data sets or sentiment lexica." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-17", "text": "We therefore explore an alternative approach to Arabic SA on social media, using off-the-shelf Machine Translation systems to translate Arabic tweets into English and then use a state-of-the-art sentiment classifier (Socher et al., 2013) to assign sentiment labels." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-18", "text": "To the best of our knowledge, this is the first study to measure the impact of automatically translated data on the accuracy of sentiment analysis of Arabic tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-19", "text": "In particular, we address the following research questions:" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-20", "text": "1. How does off-the-shelf MT on Arabic social data influence SA performance?" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-21", "text": "2. Can MT-based approaches be a viable alternative to improve sentiment classification performance on Arabic tweets? 3." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-22", "text": "Given the linguistic resources currently available for Arabic and its dialects, is it more effective to adapt an MT-based approach instead of building a new system from scratch?" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-23", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-25", "text": "There are currently two main approaches to automatic sentiment analysis: using a sentiment lexicon or building a classifier using machine learning." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-26", "text": "Lexicon-based approaches, on the one hand, utilise sentiment lexica to retrieve and annotate sentiment bearing word tokens for their sentiment orientation and then utilise a set of rules to assign the overall sentiment label (Taboada et al., 2011) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-27", "text": "Machine Learning (ML) approaches, on the other hand, frequently make use of annotated data sets, to learn a statistical classifier (Mourad and Darwish, 2013; Abdul-Mageed et al., 2011; Wilson et al., 2009 )." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-28", "text": "These approaches gain high performance for English tweets: a benchmark test on commercial and freely-available SA tools report accuracy levels between 65% -71% on English tweets (Abbasi et al., 2014) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-29", "text": "For Arabic tweets, one of the best results for SA to date is reported in Mourad and Darwish (2013) with 72.5% accuracy using 10-fold-cross validation and SVM on a manually annotated data set (2300 tweets)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-30", "text": "However, this performance drops dramatically to 49.65% -65.32% accuracy when testing an independent held-out set Refaee and Rieser, 2014c) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-31", "text": "One possible explanation is the time-changing nature of twitter (Eisenstein, 2013) : models trained on data collected at one point in time will not generalise to tweets collected at a later stage, due to changing topics and vocabulary." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-32", "text": "As such, current work investigates Distant Supervision (DS) to collect and annotate large data sets in order to train generalisable models (e.g. Go et al. (2009) )." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-33", "text": "Recent work by Refaee and Rieser (2014b) has evaluated DS approaches on Arabic Tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-34", "text": "They report accuracy scores of around 57% which significantly outperforms a majority baseline and a fully supervised ML approach, but it is still considerably lower than scores achieved on English tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-35", "text": "In the following, we compare these previous approaches to an approach using automatic Machine Translation (MT)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-36", "text": "So far, there is only limited evidence that this approach works for languages lack large SA training data-set, such as Arabic." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-37", "text": "Bautin et al. (2008) investigate MT to aggregate sentiment from multiple news documents written in a number of different languages." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-38", "text": "The authors argue that despite the difficulties associated with MT, e.g. information loss, the translated text still maintains a sufficient level of captured sentiments for their purposes." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-39", "text": "This work differs from our work in terms of domain and in measuring summary consistency rather than SA accuracy." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-40", "text": "Balahur and Turchi (2013) investigate the use of an MT system (Google) to translate an annotated corpus of English tweets into four European languages in order to obtain an annotated training set for learning a classifier." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-41", "text": "The authors report an accuracy score of 64.75% on the English held-out test set." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-42", "text": "For the other languages, reported accuracy scores ranged between 60 -62%." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-43", "text": "Hence, they conclude that it is possible to obtain high quality training data using MT, which is an encouraging result to motivate our approach." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-44", "text": "Wan (2009) proposes a co-training approach to tackle the lack of Chinese sentiment corpora by employing Google Translate as publicly available machine translation (MT) service to translate a set of annotated English reviews into Chinese." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-45", "text": "Using a held-out test set, the best reported accuracy score was at 81.3% with SVM on binary classification task: positive vs negative." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-46", "text": "Our approach differs from the ones described, in that we use automatic MT to translate Arabic tweets into English and then perform SA using a stateof-the-art SA classifier for English (Socher et al., 2013) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-47", "text": "Most importantly, we empirically benchmark its performance towards previous SA approaches, including lexicon-based, fully supervised and distant supervision SA." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-48", "text": "tweets from the Twitter public stream." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-49", "text": "We restrict the language of all retrieved tweets to Arabic by setting the language parameter to ar." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-50", "text": "The data-set was manually labeled with gold-standard sentiment orientation by two native speakers of Arabic, obtaining a Kappa score of 0.81, which indicates highly reliable annotations." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-51", "text": "Table 1 summarises the data set and its distribution of labels." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-52", "text": "For SA, we perform binary classification using positive and negative tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-53", "text": "We apply a number of common preprocessing steps following Go et al. (2009) and Pak and Paroubek (2010) to account for noise introduced by Twitter." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-54", "text": "The data set will be released as part of this submission." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-55", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-56", "text": "**MT-BASED APPROACH**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-57", "text": "In order to obtain the English translation of our Twitter data-set, we employ two common and freelyavailable MT systems: Google Translate and Microsoft Translator Service." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-58", "text": "We then use the Stanford Sentiment Classifier (SSC) developed by Socher et al. (2013) to automatically assign sentiment labels (positive, negative) to translated tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-59", "text": "The classifier is based on a deep learning (DL) approach, using recursive neural models to capture syntactic dependencies and compositionality of sentiments." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-60", "text": "Socher et al. (2013) show that this model significantly outperforms previous standard models, such as Na\u00efve Bayes (NB) and Support Vector Machines (SVM) with an accuracy score of 85.4% for binary classification (positive vs. negative) at sentence level 2 ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-61", "text": "The authors observe that the recursive models work well on shorter text while BOW features with NB and SVM perform well only on longer sentences." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-62", "text": "Using Socher et al. (2013) 's approach for directly training a sentiment classifier will require a larger training data-set, which is not available yet for Ara-bic 3 ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-63", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-64", "text": "**BASELINE SYSTEMS**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-65", "text": "We benchmark the MT-approach against three baseline systems representing current standard approaches to SA: a lexicon-based approach, a fully supervised machine learning approach and a distant supervision approach (also see Section 2)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-66", "text": "The lexicon-based baseline combines three sentiment lexica." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-67", "text": "We exploit two existing subjectivity lexica: a manually annotated Arabic subjectivity lexicon (Abdul-Mageed and Diab, 2012) and a publicly available English subjectivity lexicon, called MPQA (Wilson et al., 2009 ), which we automatically translate using Google Translate, following a similar technique to Mourad and Darwish (2013) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-68", "text": "The translated lexicon is manually corrected by removing translations with a no clear sentiment indicator 4 ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-69", "text": "This results in 2,627 translated instances after correction." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-70", "text": "We then construct a third dialectal lexicon of 484 words that we extract from an independent Twitter development set and manually annotate for sentiment." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-71", "text": "All lexica are merged into a combined lexicon of 4,422 annotated sentiment words (duplicates removed)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-72", "text": "In order to obtain automatic labels for positive and negative instances, we follow a simplified version of the rule-based aggregation approach of Taboada et al. (2011) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-73", "text": "First, all lexicons and tweets are lemmatised using MADAMIRA (Pasha et al., 2014) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-74", "text": "For each tweet, matched sentiment words are marked with either (+1) or (-1) to incorporate the semantic orientation of individual constituents." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-75", "text": "This achieves a coverage level of 76.62% (which is computed as a percentage of tweets with at least one lexicon word) using the combined lexicon." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-76", "text": "To account for negation, we reverse the polarity (switch negation) following Taboada et al. (2011) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-77", "text": "The sentiment orientation of the entire tweet is then computed by summing up the sentiment scores of all sentiment words in a given tweet into a single score that automatically determines the label as being: positive or negative." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-78", "text": "Instances where the score equals zero are excluded from the training set as they (Refaee and Rieser, 2014c) to train a classifier using word n-grams and SVMs (which we found to achieve the best performance amongst a number of other machine learning schemes we explored)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-79", "text": "The Distant Supervision (DS) baseline uses lexicon-based annotation to create a training set of 134,069 automatically labeled tweets (using the approach we described for the lexicon-based baseline), where the identified sentiment-bearing words are replaced by place-holders to avoid bias." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-80", "text": "We then use these noisy sentiment labels to train a classifier using SVMs." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-81", "text": "Note that previous work has also experimented with emoticon-based DS, but has found that a lexicon-based DS approach leads to superior results (Refaee and Rieser, 2014b )." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-82", "text": "Table 2 summarises the results for comparing the above baselines to our MT-based approaches (using Google and Microsoft MT), reporting on per-class and average recall, precision and F-measure." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-83", "text": "We also measure statistical significance by performing a planned comparison between the top-performing approaches (namely, the lexicon-based baseline and the two MT systems) using \u03c7 2 with Bonferroni correction on binary accuracy values (see Table 3 )." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-84", "text": "We observe the following:" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-85", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-86", "text": "**EXPERIMENT RESULTS**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-87", "text": "\u2022 In general, MT-based approaches reach a similar performance to the more resource-intense baseline systems." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-88", "text": "There is no significant distance in accuracy between the MT-based approaches and the overall best performing lexicon-based approach." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-89", "text": "\u2022 Microsoft MT significantly outperforms Google MT for this task." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-90", "text": "\u2022 Overall, the fully supervised baseline performs worst." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-91", "text": "A possible explanation for that is the timechanging nature of Twitter resulting in issues like topic-shift resulting in word token-based features being less effective in such a medium (Refaee and Rieser, 2014c )." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-92", "text": "\u2022 MT-based SA approaches in general have a problem of identifying positive tweets (low recall and precision), often misclassifying them as negative." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-93", "text": "The reverse it true for the DS and fully supervised baselines, which find it hard to identify negative tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-94", "text": "This is in line with results reported by Refaee and Rieser (2014b) which evaluate DS approaches to Arabic SA." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-95", "text": "Only the lexiconapproach is balanced between the positive and negative class." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-96", "text": "Note that our ML baseline systems as well as the English SA classifier by Socher et al. (2013) are trained on balanced data sets, i.e. we can assume no prior bias towards one class." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-97", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-98", "text": "**PLANNED CONTRASTS**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-99", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-100", "text": "**ERROR ANALYSIS**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-101", "text": "The above results highlight the potential of an MTbased approach to SA for languages that lack a large Table 4 : Examples of misclassified tweets training data-set annotated for sentiment analysis, such as Arabic." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-102", "text": "In the following, we conduct a detailed error analysis to fully understand the strength and weaknesses of this approach." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-103", "text": "First, we investigate the superior performance of Microsoft over Google MT by manually examining examples where Microsoft translated data is assigned the correct SA label, but the reverse is true for Google translated data, which is the case for 108 instances of our test set (11.5%)." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-104", "text": "This analysis reveals that the main difference is the ability of Microsoft MT to maintain a better sentence structure (see Table 5 )." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-105", "text": "For the following example-based error analysis of the MT approach, we therefore only consider examples where both MT systems lead to the same SA label, taking a random sample of 100 misclassified tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-106", "text": "We observe the following cases of incorrectly classified tweets (see examples in Table 4 ):" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-107", "text": "1. Example 1 fails to translate the sentimentbearing dialectical word, 'elegant', transcribing it as Kchkh but not translating it." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-108", "text": "2. Incorrectly translated sentiment-bearing phrases/idioms, see e.g. that cub is from that lion in example 2. 3. Misspelled and hence incorrectly translated sentiment-bearing words in the original text, see example 3 'Farahhh' ('happpiness') with repeated letters." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-109", "text": "This problem is also highlighted by Abbasi et al. (2014) as one of challenges facing sentiment analysis for social networks." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-110", "text": "4. Example 4 shows a correctly translated tweet, but with an incorrect sentiment label." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-111", "text": "We assume that this is a case of cultural differences: the phrase \"oh God\" can have a negative connotation in English (Strapparava et al., 2012) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-112", "text": "Note that the Stanford Sentiment classifier makes use of a manually labeled English sentiment phrase-based lexicon, which may introduce a cultural bias." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-113", "text": "5. Example 5 represents a case of correctly translated sentiment-bearing words (love, life), but failed to translate surrounding text ('Ashan' and 'Amtlat')." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-114", "text": "Bautin et al. (2008) point out that this type of contextual information loss is one of the main challenges of MT-based SA." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-115", "text": "6. Example 6 represents a case of a correctly translated tweet, but with an incorrectly assigned sentiment label." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-116", "text": "We assume that this is due to changes in sentence structure introduced by the MT system." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-117", "text": "Balahur and Turchi (2013) state that word ordering is one of the most prominent causes of SA misclassification." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-118", "text": "In order to confirm this hypothesis, we manually corrected sentence structure before feeding it into the SA classifier." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-119", "text": "This approach led to the correct SA label, and thus, confirmed that the cause of the problem is word-ordering." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-120", "text": "Note that the Stanford SA system pays particular attention to sentence structure due to its \"deep\" architecture that adds to the model the feature of being sensitive to word ordering (Socher et al., 2013) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-121", "text": "In future work, we will verify this by comparing these results to other high performing English SA tools (see for example Abbasi et al. (2014) In sum, one of the major challenges of this approach seems to be the use of Arabic dialects in social media, such as Twitter." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-122", "text": "In order to confirm this hypothesis, we automatically label Dialectal Arabic (DA) vs. Modern Standard Arabic (MSA) using AIDA (Elfardy et al., 2014) and analyse the performance of MT-based SA." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-123", "text": "The results in Fig. 1 show a significant correlation (Pearson, p<0.05) between language class and SA accuracy, with MSA outperforming DA." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-124", "text": "This confirms DA as a major source of error in the MT-based approach." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-125", "text": "Issues like dialectal variation and the vowel-free writing system still present a challenge to machine-translation (Zbib et al., 2012) ." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-126", "text": "This is especially true for tweets as they tend to be less formal resulting in issues like misspelling and individual spelling variations." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-127", "text": "However, with more resources being released for informal Arabic and Arabic dialects, e.g. (Cotterell and Callison-Burch, 2014; Refaee and Rieser, 2014a) , we assume that off-the-shelf MT systems will improve their performance in the near future." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-128", "text": "----------------------------------" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-130", "text": "This paper is the first to investigate and empirically evaluate the performance of Machine Translation (MT)-based Sentiment Analysis (SA) for Arabic Tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-131", "text": "In particular, we make use of off-theshelf MT tools, such as Google and Microsoft MT, to translate Arabic Tweets into English." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-132", "text": "We then use the Stanford Sentiment Classifier (Socher et al., 2013) to automatically assign sentiment labels (positive, negative) to translated tweets." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-133", "text": "In contrast to previous work, we benchmark this approach on a gold-standard test set of 937 manually annotated tweets and compare its performance to standard SA approaches, including lexicon-based, supervised and distant supervision approaches." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-134", "text": "We find that MT approaches reach a comparable performance or significantly outperform more resourceintense standard approaches." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-135", "text": "As such, we conclude that using off-the-shelf tools to perform SA for under-resourced languages, such as Arabic, is an effective and efficient alternative to building SA classifiers from scratch." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-136", "text": "Future directions of this work include quantifying the impact of the used off-the-shelf tools, e.g. by using alternative high performing English SA tools." }, { "sent_id": "a5f00f524fdf18e62a4e98a92a2d82-C001-137", "text": "In addition, we plan to investigate multi-classifier systems, given the strength and weaknesses identified for each of the approaches." } ], "y": { "@USE@": { "gold_contexts": [ [ "a5f00f524fdf18e62a4e98a92a2d82-C001-17" ], [ "a5f00f524fdf18e62a4e98a92a2d82-C001-46" ], [ "a5f00f524fdf18e62a4e98a92a2d82-C001-58" ], [ "a5f00f524fdf18e62a4e98a92a2d82-C001-62" ], [ "a5f00f524fdf18e62a4e98a92a2d82-C001-132" ] ], "cite_sentences": [ "a5f00f524fdf18e62a4e98a92a2d82-C001-17", "a5f00f524fdf18e62a4e98a92a2d82-C001-46", "a5f00f524fdf18e62a4e98a92a2d82-C001-58", "a5f00f524fdf18e62a4e98a92a2d82-C001-62", "a5f00f524fdf18e62a4e98a92a2d82-C001-132" ] }, "@SIM@": { "gold_contexts": [ [ "a5f00f524fdf18e62a4e98a92a2d82-C001-96" ] ], "cite_sentences": [ "a5f00f524fdf18e62a4e98a92a2d82-C001-96" ] }, "@FUT@": { "gold_contexts": [ [ "a5f00f524fdf18e62a4e98a92a2d82-C001-120", "a5f00f524fdf18e62a4e98a92a2d82-C001-121" ] ], "cite_sentences": [ "a5f00f524fdf18e62a4e98a92a2d82-C001-120" ] } } }, "ABC_26658b95c9bac96f1206da96b95921_22": { "x": [ { "sent_id": "26658b95c9bac96f1206da96b95921-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-2", "text": "Compositional matrix-space models of language were recently proposed for the task of meaning representation of complex text structures in natural language processing." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-3", "text": "These models have been shown to be a theoretically elegant way to model compositionality in natural language." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-4", "text": "However, in practical cases, appropriate methods are required to learn such models by automatically acquiring the necessary token-to-matrix assignments." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-5", "text": "In this paper, we introduce graded matrix grammars of natural language, a variant of the matrix grammars proposed by Rudolph and Giesbrecht (2010) , and show a close correspondence between this matrix-space model and weighted finite automata." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-6", "text": "We conclude that the problem of learning compositional matrix-space models can be mapped to the problem of learning weighted finite automata over the real numbers." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-7", "text": "----------------------------------" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-9", "text": "Quantitative models of language have recently received considerable research attention in the field of Natural Language Processing (NLP)." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-10", "text": "In the application of meaning representation of text in NLP, much effort has been spent on semantic Vector Space Models (VSMs)." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-11", "text": "Such models capture word meanings quantitatively, based on their statistical co-occurrences in the documents." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-12", "text": "The basic idea is to represent words as vectors in a highdimensional space, where each dimension corresponds to a separate feature." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-13", "text": "In this way, semantic similarities can be computed based on measuring the distance between vectors in the vector * Supported by DFG Graduiertenkolleg 1763 (QuantLA) space (Mitchell and Lapata, 2010) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-14", "text": "Vectors which are close together in this space have similar meanings and vectors which are far away are distant in meaning (Turney and Pantel, 2010) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-15", "text": "VSMs typically represent each word separately, without considering representations of phrases or sentences." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-16", "text": "So, the compositionality properties of the language is lost in VSMs (Mitchell and Lapata, 2010) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-17", "text": "Recently, some approaches have been developed in the area of compositionality and distributional semantics in NLP." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-18", "text": "These approaches introduce different word representations and ways of combining those words." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-19", "text": "Mitchell and Lapata (2010) propose a framework for vector-based semantic composition." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-20", "text": "They define additive or multiplicative function for the composition of two vectors and show that compositional approaches generally outperform non-compositional approaches which treat the phrase as the union of single lexical items." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-21", "text": "However, VSMs still have some limitations in the task of modeling complex conceptual text structures." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-22", "text": "For example, in the bag-of-words model, the words order and therefore the structure of the language is lost." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-23", "text": "To overcome the limitations of VSMs, Rudolph and Giesbrecht (2010) proposed Compositional Matrix-Space Models (CMSM) as a recent alternative model to work with distributional approaches." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-24", "text": "These models employ matrices instead of vectors and make use of iterated matrix multiplication as the only composition operation." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-25", "text": "They show that these models are powerful enough to subsume many known models, both quantitative (vector-space models with diverse composition operations) and qualitative ones (such as regular languages)." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-26", "text": "It is also proved theoretically that this framework is an elegant way to model compositional, symbolic and distributional aspects of natural language." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-27", "text": "However, in practical cases, methods are needed to automatically acquire the token-to-matrix assignments from available data." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-28", "text": "Therefore, methods for training such models should be developed e.g. by leveraging appropriate machine learning methods." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-29", "text": "In this paper, we are concerned with Graded Matrix Grammars, a variant of the Matrix Grammars of Rudolph and Giesbrecht (2010) , where instead of the \"yes or no\" decision, if a sequence is part of a language, a real-valued score is assigned." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-30", "text": "This is a popular task in NLP, used, e.g., in sentiment analysis settings (Yessenalina and Cardie, 2011) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-31", "text": "Generally, in many tasks of NLP, we need to estimate functions which map arbitrary sequence of words (e.g. sentences) to some semantical space." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-32", "text": "Using Weighted Finite Automata (WFA), an extensive class of these functions can be defined, which assign values to these sequences (Balle and Mohri, 2012) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-33", "text": "Herein, inspired by the definition of weighted finite automata (Sakarovitch, 2009) and their applications in NLP (Knight and May, 2009 ), we show a tight correspondence between graded matrix grammars and weighted finite automata." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-34", "text": "Hence, we argue that the problem of learning CMSMs can be mapped to the problem of learning WFA." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-35", "text": "The rest of the paper is organized as follows." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-36", "text": "Section 2 provides the basic notions of weighted automata and the matrix-space model." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-37", "text": "A detailed description of correspondence between CMSM and WFA is presented in Section 3, followed by related work in Section 4 and conclusion and future work in Section 5." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-38", "text": "----------------------------------" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-39", "text": "**PRELIMINARIES**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-40", "text": "In this section, we provide the definitions of weighted automata in (Balle and Mohri, 2015; Sakarovitch, 2009 ) and matrix-space models of language in (Rudolph and Giesbrecht, 2010) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-41", "text": "----------------------------------" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-42", "text": "**WEIGHTED FINITE AUTOMATA**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-43", "text": "Weighted finite automata generalize classical automata in which transitions and states carry weights." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-44", "text": "These weights can be considered as the cost of the transitions or amount of resources needed to execute the transitions." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-45", "text": "Let \u03a3 be a finite alphabet." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-46", "text": "A weighted automaton A is a tuple of (Q A , \u03bb, \u03b1, \u03b2) and defined over a semi-ring (S, \u2295, \u2297,0,1)." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-47", "text": "Q A is a finite set of states, \u03bb : \u03a3 \u2192 S Q A \u00d7Q A is the transition weight function, and, \u03b1 : \u03a3 \u2192 S and \u03b2 : \u03a3 \u2192 S are two functions assigning to every state its initial and final weight." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-48", "text": "Thereby, for each transition e = (q, \u03c3, q ), \u03bb(\u03c3) q,q denotes the weight of the label \u03c3 associated with the transition e between q and q , which are the source and target state of the transition." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-49", "text": "Moreover, A path P in A is a sequence of transitions labeled with \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n , in more detail:" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-50", "text": "The weight of P is defined as the \u2297-product of the weights of the starting state, its transitions, and final state:" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-51", "text": ". Now, the weight of a word x = \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n \u2208 \u03a3 is the cumulative weight of all paths labeled with the sequence \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n which is computed as the \u2295-sum of the weights of the corresponding paths, also known as a rational power series:" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-52", "text": "where P A (\u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n ) denotes the (finite) set of paths in A labeled with \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-53", "text": "So, the function f A maps the set of strings in \u03a3 to S. In this work, we will assume that S is the set of the real numbers R with the usual multiplication and addition." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-54", "text": "Figure 1 illustrates an example of WFA over \u03a3 = {a, b}. Inside each state there is a tuple of the name, initial and final weight of the state, respectively." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-55", "text": "As an example, for x = ab we have:" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-56", "text": "----------------------------------" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-57", "text": "**COMPOSITIONALITY AND COMPOSITIONAL MATRIX-SPACE MODEL**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-58", "text": "The general principle of compositionality is that the meaning of a complex expression is a function of the meaning of its constituent tokens and some rules used to combine them (Frege, 1884) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-59", "text": "More formally, according to Rudolph and Giesbrecht (2010) , the underlying idea can be described as follows: \"Given a mapping \u00b7 : \u03a3 \u2192 S from a set of tokens in \u03a3 into some semantical space S, the composition operation is defined by mapping sequences of meanings to meanings: : S \u2192 S. So, the meaning of the sequence of tokens \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n can be obtained by first applying the function \u00b7 to each token and then to the sequence \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n , as shown in Figure 2 \"." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-60", "text": "Figure 2: Principle of compositionality, illustration taken from Rudolph and Giesbrecht (2010) In compositional matrix-space models, this general idea is instantiated as follows: we have S = R n\u00d7n , i.e., the semantical space consists of quadratic matrices of real numbers." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-61", "text": "The mapping function \u00b7 maps the tokens into matrices so that the semantics of simple tokens is expressed by matrices." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-62", "text": "Then, using the standard matrix multiplication as the only composition operation , the semantics of complex phrases are also described by matrices." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-63", "text": "Rudolph and Giesbrecht (2010) showed theoretically that by employing matrices instead of vectors, CMSMs subsume a wide range of linguistic models such as statistical models (vector-space models and word space models)." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-64", "text": "----------------------------------" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-65", "text": "**GRADED MATRIX GRAMMARS AND WEIGHTED FINITE AUTOMATA**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-66", "text": "In some applications of NLP, we need to derive the meaning of a sequence of words in a language, which can be done with CMSMs as described in Section 2.2." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-67", "text": "In this section, we introduce the notion of a graded matrix grammar which constitutes a slight variation of matrix grammars as introduced by Rudolph and Giesbrecht (2010) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-68", "text": "Definition 1 (Graded Matrix Grammars)." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-69", "text": "Let \u03a3 be an alphabet." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-70", "text": "A graded matrix grammar M of degree n is defined as the tuple \u00b7 , \u03a3, \u03b1, \u03b2 where \u00b7 is a function mapping tokens in \u03a3 to n \u00d7 n matrices of real numbers." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-71", "text": "Moreover, \u03b1, \u03b2 \u2208 R n ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-72", "text": "Then we map each sequence of tokens \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n \u2208 \u03a3 to a real number (called the value of the sequence) using the target function \u03d5 : \u03a3 \u2192 R defined by:" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-73", "text": "However, as discussed before, to be used in practice, some learning methods are needed to extract graded matrix grammars from textual data." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-74", "text": "Hence, the target function \u03d5 can be generalized to all texts in the language and handle unseen word compositions." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-75", "text": "To this end, we show the correspondence between the CMSM and WFA, with the consequence that existing learning methods for WFA can be applied to learning CMSMs." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-76", "text": "As discussed in Section 2.1, in WFA, for a rational power series f , a value f (x) is the sum of all possible paths labeled with x = \u03c3 1 \u00b7 \u00b7 \u00b7 \u03c3 n \u2208 \u03a3 ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-77", "text": "However, this computation can be described via matrices by the fact that a walk over a graph corresponds to a matrix multiplication (Sakarovitch, 2009 )." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-78", "text": "More precisely, for any \u03c3 \u2208 \u03a3, let A \u03c3 \u2208 R Q A \u00d7Q A be the transition matrix of \u03c3: [A \u03c3 ] pq = e\u2208P A (p,\u03c3,q) \u03bb(\u03c3) p,q , where P A (p, \u03c3, q) is the set of all transitions labeled with \u03c3 from p to q. Also, the vectors \u03b1 A \u2208 R Q A and \u03b2 A \u2208 R Q A are the start and final weights of the states in Q A , respectively." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-79", "text": "Then, Equation 1 can be equally expressed as follows in terms of matrices with entries in R (Balle and Mohri, 2015) :" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-80", "text": "Hence, we see the correspondence between Equation 2 and 3." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-81", "text": "In more detail, consider each phrase p and its value r in a natural language." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-82", "text": "If we extract the words of the language as a finite alphabet \u03a3 in an automaton, then p would be a string in \u03a3 ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-83", "text": "The \u00b7 function in M applied over the words constructs n\u00d7n transition matrices of the alphabets in the automaton." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-84", "text": "Here, n can be the number of states of the automaton." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-85", "text": "So, estimating the function \u03d5 in graded matrix grammar corresponds to estimating the target function of the automaton f A , which computes exactly the value of a string translated from the phrase p in a language." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-86", "text": "That is, the representation of a string is done with multiplication of transition matrices of its tokens, which results in a new representation matrix for the string." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-87", "text": "Then, the suitable predefined vectors \u03b1 and \u03b2 translate the resulting matrix to a real number which denotes the value of associated phrase p in the natural language." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-88", "text": "The problem of learning WFAs is finding a WFA closely estimating a target function, using for training a finite sample of strings labeled with their target values (Balle and Mohri, 2015) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-89", "text": "By learning WFAs, one obtains an automaton that is a tuple A = \u03b1, \u03b2, {A a } a\u2208\u03a3 , and one can compute the target function f A (x)." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-90", "text": "Since WFA encode CMSMs, and based on this close correspondence between them, learning a graded matrix grammar to estimate the value of phrases can be mapped to the problem of learning a weighted automaton." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-91", "text": "----------------------------------" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-92", "text": "**RELATED WORK**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-93", "text": "An application of CMSM has been shown in the work of Yessenalina and Cardie (2011) ." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-94", "text": "They proposed a learning-based approach for phraselevel sentiment analysis." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-95", "text": "Inspired by the work of Rudolph and Giesbrecht (2010) they use CMSMs to model composition, and present an algorithm for learning a matrix for each word via ordered logistic regression, which is evaluated with promising results." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-96", "text": "However, it is not trivial to learn a matrix-space model." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-97", "text": "Since the final optimization problem is non-convex, the matrix initialization for this method is not done perfectly." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-98", "text": "Socher et al. (2012) introduce a matrix-vector recursive neural network (MV-RNN) model that learns compositional vector representations for phrases and sentences." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-99", "text": "The model assigns a vector and a matrix to every node in a parse tree." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-100", "text": "The vector represents the meaning of the constituent, while the matrix captures how it affects the meaning of neighboring constituent." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-101", "text": "The model needs to parse the tree to learn the vectors and matrices." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-102", "text": "Recently, new approaches are proposed in learning weighted finite automata in NLP." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-103", "text": "Balle and Mohri (2012) and Balle et al. (2014) introduce a new family of algorithms for learning general WFA and stochastic WFA based on the combination of matrix completion problem and spectral methods." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-104", "text": "These algorithms are designed for learning an arbitrary weighted automaton from sample data of strings and assigned labels." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-105", "text": "They formulate the missing information from the sample data as a Hankel matrix completion problem." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-106", "text": "Then, the spectral learning is applied to the resulting Hankel matrix to obtain WFA." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-107", "text": "Balle et al. (2014) also, offer the main results in spectral learning which are an interesting alternative to the classical EM algorithms in the context of grammatical inference and show the computational efficiency of these algorithms." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-108", "text": "Moreover, Balle and Mohri (2015) discuss modern learning methods (spectral methods) for an arbitrary WFA in different scenarios." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-109", "text": "They provide WFA reconstruction algorithms and standardization." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-110", "text": "it is theoretically guaranteed that for a Hankel matrix with a finite rank, representing a rational power series, there is a corresponding WFA with the number of states equal to this rank and it is minimal." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-111", "text": "----------------------------------" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-112", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-113", "text": "In this paper, we introduced a graded matrix grammar for compositionality in language where compositional matrix-space models are employed in different tasks of NLP." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-114", "text": "However, we need to propose a learning method to train this model for value assignments in NLP." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-115", "text": "For this purpose, we showed the close correspondence between matrix grammars and weighted automata." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-116", "text": "So, the problem of learning the CMSM can be encoded as the problem of learning WFA." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-117", "text": "Our future goal is to review the existing methods in learning WFA, and adapt them to solve the task of sentiment analysis/meaning representation in NLP." }, { "sent_id": "26658b95c9bac96f1206da96b95921-C001-118", "text": "Using learning methods, they allow to automatically learn CMSM and induce the graded matrix grammar." } ], "y": { "@EXT@": { "gold_contexts": [ [ "26658b95c9bac96f1206da96b95921-C001-5" ], [ "26658b95c9bac96f1206da96b95921-C001-29" ], [ "26658b95c9bac96f1206da96b95921-C001-67" ] ], "cite_sentences": [ "26658b95c9bac96f1206da96b95921-C001-5", "26658b95c9bac96f1206da96b95921-C001-29", "26658b95c9bac96f1206da96b95921-C001-67" ] }, "@BACK@": { "gold_contexts": [ [ "26658b95c9bac96f1206da96b95921-C001-23" ], [ "26658b95c9bac96f1206da96b95921-C001-59" ], [ "26658b95c9bac96f1206da96b95921-C001-63" ], [ "26658b95c9bac96f1206da96b95921-C001-95" ] ], "cite_sentences": [ "26658b95c9bac96f1206da96b95921-C001-23", "26658b95c9bac96f1206da96b95921-C001-59", "26658b95c9bac96f1206da96b95921-C001-63", "26658b95c9bac96f1206da96b95921-C001-95" ] }, "@UNSURE@": { "gold_contexts": [ [ "26658b95c9bac96f1206da96b95921-C001-40" ] ], "cite_sentences": [ "26658b95c9bac96f1206da96b95921-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "26658b95c9bac96f1206da96b95921-C001-60" ] ], "cite_sentences": [ "26658b95c9bac96f1206da96b95921-C001-60" ] } } }, "ABC_237ac6f9b635e56119be956d7521e1_22": { "x": [ { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-2", "text": "Despite the long history of named-entity recognition (NER) task in the natural language processing community, previous work rarely studied the task on conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-3", "text": "Such texts are challenging because they contain a lot of word variations which increase the number of out-of-vocabulary (OOV) words." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-4", "text": "The high number of OOV words poses a difficulty for word-based neural models." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-5", "text": "Meanwhile, there is plenty of evidence to the effectiveness of character-based neural models in mitigating this OOV problem." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-6", "text": "We report an empirical evaluation of neural sequence labeling models with character embedding to tackle NER task in Indonesian conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-7", "text": "Our experiments show that (1) character models outperform word embedding-only models by up to 4 F 1 points, (2) character models perform better in OOV cases with an improvement of as high as 15 F 1 points, and (3) character models are robust against a very high OOV rate." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-8", "text": "----------------------------------" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-10", "text": "Critical to a conversational agent is the ability to recognize named entities." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-11", "text": "For example, in a flight booking application, to book a ticket, the agent needs information about the passenger's name, origin, and destination." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-12", "text": "While named-entity recognition (NER) task has a long-standing history in the natural language processing community, most of the studies have been focused on recognizing entities in well-formed data, such as news articles or biomedical texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-13", "text": "Hence, little is known about the suitability of the available named-entity recognizers for conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-14", "text": "In this work, we tried to shed some light on this direction by evaluating neural sequence labeling models on NER task in Indonesian conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-15", "text": "Unlike standard NLP corpora, conversational texts are typically noisy and informal." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-16", "text": "For example, in Indonesian, the word aku (\"I\") can be written as: aq, akuw, akuh, q. People also tend to use non-standard words to represent named entities." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-17", "text": "This creative use of language results in numerous word variations which may increase the number out-of-vocabulary (OOV) words (Baldwin et al., 2013) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-18", "text": "The most common approach to handle the OOV problem is by representing each OOV word with a single vector representation (embedding)." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-19", "text": "However, this treatment is not optimal because it ignores the fact that words can share similar morphemes which can be exploited to estimate the OOV word embedding better." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-20", "text": "Meanwhile, word representation models based on subword units, such as characters or word segments, have been shown to perform well in many NLP tasks such as POS tagging (dos Santos and Zadrozny, 2014; Ling et al., 2015) , language modeling (Ling et al., 2015; Kim et al., 2016; Vania and Lopez, 2017) , machine translation (Vylomova et al., 2016; Lee et al., 2016; Sennrich et al., 2016) , dependency parsing (Ballesteros et al., 2015) , and sequence labeling (Rei et al., 2016; Lample et al., 2016) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-21", "text": "These representations are effective because they can represent OOV words better by leveraging the orthographic similarity among words." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-22", "text": "As for Indonesian NER, the earliest work was done by Budi et al. (2005) which relied on a rulebased approach." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-23", "text": "More recent research mainly used machine learning methods such as conditional random fields (CRF) (Luthfi et al., 2014; Leonandya et al., 2015; Taufik et al., 2016) and support vector machines (Suwarningsih et al., 2014; Aryoyudanta et al., 2016) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-24", "text": "The most commonly used datasets are news articles (Budi et al., 2005) , Wikipedia/DBPedia articles (Luthfi et al., 2014; Leonandya et al., 2015; Aryoyudanta et al., 2016) , medical texts (Suwarningsih et al., 2014) , and Twitter data (Taufik et al., 2016) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-25", "text": "To the best of our knowledge, there has been no work that used neural networks for Indonesian NER nor NER for Indonesian conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-26", "text": "In this paper, we report the ability of a neural network-based approach for Indonesian NER in conversational data." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-27", "text": "We employed the neural sequence labeling model of (Rei et al., 2016) and experimented with two word representation models: word-level and character-level." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-28", "text": "We evaluated all models on relatively large, manually annotated Indonesian conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-29", "text": "We aim to address the following questions: 1) How do the character models perform compared to word embedding-only models on NER in Indonesian conversational texts?" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-30", "text": "2) How much can we gain in terms of performance from using the character models on OOV cases?" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-31", "text": "3) How robust (in terms of performance) are the character models on different levels of OOV rates?" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-32", "text": "Our experiments show that (1) the character models perform really well compared to word embedding-only with an improvement up to 4 F 1 points, (2) we can gain as high as 15 F 1 points on OOV cases by employing character models, and (3) the character models are highly robust against OOV rate as there is no noticeable performance degradation even when the OOV rate approaches 100%." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-33", "text": "----------------------------------" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-34", "text": "**METHODOLOGY**" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-35", "text": "We used our own manually annotated datasets collected from users using our chatbot service." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-36", "text": "There are two datasets: SMALL-TALK and TASK-ORIENTED." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-37", "text": "SMALL-TALK contains 16K conversational messages from our users having small talk with our chatbot, Jemma." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-59", "text": "They usually employ a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) with CRF as the output layer, and a CNN (Ma and Hovy, 2016) or LSTM (Lample et al., 2016; Rei et al., 2016) composes the character embeddings." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-60", "text": "Also, we do not try to achieve state-of-theart results but only are interested whether neural sequence labeling models with character embedding can handle the OOV problem well." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-61", "text": "Therefore, for the neural models, we just picked the implementation provided in (Rei et al., 2016) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-62", "text": "4 In their implementation, all the LSTMs have only one layer." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-63", "text": "Dropout (Srivastava et al., 2014 ) is used as the regularizer but only applied to the final word embedding as opposed to the LSTM outputs as proposed by Zaremba et al. (2015) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-64", "text": "The loss function contains not only the log likelihood of the training data and the similarity score but also a language modeling loss, which is not mentioned in (Rei et al., 2016) but discussed in the subsequent work (Rei, 2017) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-65", "text": "Thus, their implementation essentially does multi-task learning with sequence labeling as the primary task and language modeling as the auxiliary task." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-66", "text": "We used an almost identical setting to Rei et al. (2016) : words are lowercased, but characters are not, digits are replaced with zeros, singleton words in the training set are converted into unknown tokens, word and character embedding sizes are 300 and 50 respectively." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-67", "text": "The character embeddings were initialized randomly and learned during training." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-68", "text": "LSTMs are set to have 200 hidden units, the pre-output layer has an output size of 50, CRF layer is used as the output layer, and early stopping is used with a patience of 7." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-69", "text": "Some differences are: we did not use any pretrained word embedding, and we used Adam optimization (Kingma and Ba, 2014 ) with a learning rate of 0.001 and batch size of 16 to reduce GPU memory usage." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-70", "text": "We decided not to use any pretrained word embedding because to the best of our knowledge, there is no off-the-shelf Indonesian pretrained word embedding that is trained on conversational data." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-71", "text": "The ones available are usually trained on Wikipedia articles (fastText) and we believe it has a very small size of shared vocabulary with conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-72", "text": "We tuned the dropout rate on the development set via grid search, trying multiples of 0.1." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-73", "text": "We evaluated all of our models using CoNLL evaluation: microaveraged F 1 score based on exact span matching." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-74", "text": "3 Results and discussion 3.1 Performance Table 5 shows the overall F 1 score on the test set of each dataset." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-75", "text": "We see that the neural network models beat both baseline models significantly." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-76", "text": "We also see that the character models consistently outperform the word embedding-only model, where the improvement can be as high as 4 points on SMALL-TALK." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-77", "text": "An interesting observation is how the improvement is much larger in SMALL-TALK than TASK-ORIENTED." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-78", "text": "We speculate that this is due to the higher OOV rate SMALL-TALK has, as can be seen in Table 4 ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-79", "text": "To understand the character model better, we draw the confusion matrix of the word embeddingonly and the concatenation model for each dataset in Figure 1 ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-80", "text": "We chose only the concatenation model because both character models are better than the word embedding-only, so we just picked the simplest one." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-81", "text": "SMALL-TALK." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-82", "text": "Both word embedding-only and concatenation model seem to hallucinate PERSON and LOCATION often." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-83", "text": "This observation is indicated by the high false positive rate of those entities, where 56% of non-entities are recognized as PERSON, and about 30% of nonentities are recognized as LOCATION." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-84", "text": "Both models appear to confuse PHONE as DATETIME as marked by 11% and 17% misclassification rate of the models respectively." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-85", "text": "The two models also have some differences." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-86", "text": "The word embedding-only model has higher false negative than the concatenation model." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-87", "text": "DATE-TIME has the highest false negative, where the word embedding-only model incorrectly classified 30% of true entities as non-entity." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-88", "text": "Turning to the concatenation model, we see how the false negative decreases for almost all entities." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-89", "text": "DATETIME has the most significant drop of 20% (down from 30% to 10%), followed by PERSON, PHONE, LOCATION, and GENDER." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-90", "text": "TASK-ORIENTED." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-91", "text": "The confusion matrices of the two models are strikingly similar." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-92", "text": "The models seem to have a hard time dealing with LOC because it often hallucinates the existence of LOC (as indicated by the high false positive rate) and misses genuine LOC entities (as shown by the high false negative rate)." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-93", "text": "Upon closer look, we found that the two models actually can recognize LOC well, but sometimes they partition it into its parts while the gold annotation treats the entity as a single unit." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-94", "text": "Table 6 shows an example of such case." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-95", "text": "A long location like Kantor PKPK lt." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-96", "text": "3 is partitioned by the models into Kantor PKPK (office name) and lt." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-97", "text": "3 (floor number)." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-98", "text": "The models also partition Jl Airlangga no. 4-6 Sby into Jl Airlangga no. 4-6 (street and building number) and Sby (abbreviated city name)." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-99", "text": "We think that this partitioning behavior is reasonable because each part is indeed a location." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-100", "text": "There is also some amount of false positive on PER, signaling that the models sometimes falsely recognize a non-entity as a person's name." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-101", "text": "The similarity of the two confusion matrices appears to demonstrate that character embedding only provides a small improvement on the TASK-ORIENTED dataset." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-102", "text": "----------------------------------" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-103", "text": "**PERFORMANCE ON OOV ENTITIES**" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-104", "text": "Next, we want to understand better how much gain we can get from character models on OOV cases." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-105", "text": "To answer this question, we ignored entities that do not have any OOV word on the test set and re-evaluated the word embeddingonly and concatenation models." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-106", "text": "Table 7 shows the re-evaluated overall and per-entity F 1 score on the test set of each dataset." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-107", "text": "We see how the concatenation model consistently outperforms the word embedding-only model for almost all entities on both datasets." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-108", "text": "On SMALL-TALK dataset, the overall F 1 score gap is as high as 15 points." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-109", "text": "It is also remarkable that the concatenation model manages to achieve 40 F 1 points for GENDER on SMALL-TALK while the word embedding-only cannot even recognize any GENDER." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-110", "text": "Therefore, Table 6 : An example displaying how the word embedding-only (word) and concatenation (concat) models can partition a long location entity into its parts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-111", "text": "in general, this result corroborates our hypothesis that the character model is indeed better at dealing with the OOV problem." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-112", "text": "----------------------------------" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-113", "text": "**IMPACT OF OOV RATE TO MODEL PERFORMANCE**" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-114", "text": "To better understand to what extent the character models can mitigate OOV problem, we evaluated the performance of the models on different OOV rates." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-115", "text": "We experimented by varying the OOV rate on each dataset and plot the result in Figure 2 ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-116", "text": "Varying the OOV rate can be achieved by changing the minimum frequency threshold for a word to be included in the vocabulary." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-117", "text": "Words that occur fewer than this threshold in the training set are converted into the special token for OOV words." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-118", "text": "Thus, increasing this threshold means increasing the OOV rate and vice versa." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-119", "text": "From Figure 2 , we see that across all datasets, the models which employ character embedding, either by concatenation or attention, consistently outperform the word embedding-only model at almost every threshold level." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-120", "text": "The performance gap is even more pronounced when the OOV rate is high." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-121", "text": "Going from left to right, as the OOV rate increases, the character models performance does not seem to degrade much." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-122", "text": "Remarkably, this is true even when OOV rate is as high as 90%, even approaching 100%, whereas the word embeddingonly model already has a significant drop in performance when the OOV rate is just around 70%." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-123", "text": "This finding confirms that character embedding is useful to mitigate the OOV problem and robust against different OOV rates." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-124", "text": "We also observe that there seems no perceptible difference between the concatenation and attention model." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-125", "text": "----------------------------------" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-126", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-127", "text": "We reported an empirical evaluation of neural sequence labeling models by Rei et al. (2016) on NER in Indonesian conversational texts." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-128", "text": "The neural models, even without character embedding, outperform the CRF baseline, which is a typical model for Indonesian NER." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-129", "text": "The models employing character embedding have an improvement up to 4 F 1 points compared to the word embeddingonly counterpart." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-130", "text": "We demonstrated that by using character embedding, we could gain improvement as high as 15 F 1 points on entities having OOV words." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-131", "text": "Further experiments on different OOV rates show that the character models are highly robust against OOV words, as the performance does not seem to degrade even when the OOV rate approaches 100%." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-132", "text": "While the character model by Rei et al. (2016) has produced good results, it is still quite slow because of the LSTM used for composing character embeddings." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-133", "text": "Recent work on sequence labeling by Reimers and Gurevych (2017) showed that replacing LSTM with CNN for composition has no significant performance drop but is faster because unlike LSTM, CNN computation can be parallelized." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-134", "text": "Using character trigrams as subword units can also be an avenue for future research, as their effectiveness has been shown by Vania and Lopez (2017) ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-135", "text": "Entities like PHONE and EMAIL have quite clear patterns so it might be better to employ a regex-based classifier to recognize such entities and let the neural network models tag only person and location names." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-38", "text": "1 TASK-ORIENTED contains 72K task-oriented imperative messages such as flight booking, food delivery, and so forth obtained from YesBoss service." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-39", "text": "2 Thus, TASK-ORIENTED usually has longer texts and more precise entities (e.g., locations) compared to SMALL-TALK." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-40", "text": "Table 1 shows some example sentences for each dataset." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-41", "text": "A total of 13 human annotators annotated the two datasets." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-42", "text": "Unfortunately, we cannot publish the datasets because of proprietary reasons." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-43", "text": "SMALL-TALK has 6 entities: DATETIME, EMAIL, GENDER, LOCATION, PERSON, and PHONE." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-44", "text": "TASK-ORIENTED has 4 entities: EMAIL, LOC, PER, and PHONE." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-45", "text": "The two datasets have different entity inventory because the two chatbot purposes are different." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-46", "text": "In SMALL-TALK, we care about personal information such as date of birth, email, or gender to offer personalized content." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-47", "text": "In TASK-ORIENTED, the tasks usually can be performed by providing minimal personal information." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-48", "text": "Therefore, some of the entities are not necessary." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-49", "text": "Table 2 and 3 report some examples of each entity and the number of entities in both datasets respectively." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-50", "text": "The datasets are tagged using BIO tagging scheme and split into training, development, and testing set." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-51", "text": "The complete dataset statistics, along with the OOV rate for each split, are shown in Table 4 ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-52", "text": "We define OOV rate as the percentage of word types that do not occur in the training set." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-53", "text": "As seen in the table, the OOV rate is quite high, especially for SMALL-TALK with more than 50% OOV rate." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-54", "text": "As baselines, we used a simple model which memorizes the word-tag assignments on the training data (Nadeau and Sekine, 2007 ) and a featurebased CRF (Lafferty et al., 2001) , as it is a common model for Indonesian NER." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-55", "text": "We used almost identical features as Taufik et al. (2016) since they experimented on the Twitter dataset which we regarded as the most similar to our conversational texts among other previous work on Indonesian NER." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-56", "text": "Some features that we did not employ were POS tags, lookup list, and non-standard word list as we did not have POS tags in our data nor access to the lists Taufik et al. (2016) used." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-57", "text": "For the CRF model, we used an implementation provided by Okazaki (2007) 3 ." }, { "sent_id": "237ac6f9b635e56119be956d7521e1-C001-58", "text": "Neural architectures for sequence labeling are pretty similar." } ], "y": { "@BACK@": { "gold_contexts": [ [ "237ac6f9b635e56119be956d7521e1-C001-20" ], [ "237ac6f9b635e56119be956d7521e1-C001-59" ], [ "237ac6f9b635e56119be956d7521e1-C001-64" ] ], "cite_sentences": [ "237ac6f9b635e56119be956d7521e1-C001-20", "237ac6f9b635e56119be956d7521e1-C001-59", "237ac6f9b635e56119be956d7521e1-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "237ac6f9b635e56119be956d7521e1-C001-27" ], [ "237ac6f9b635e56119be956d7521e1-C001-61" ], [ "237ac6f9b635e56119be956d7521e1-C001-66" ], [ "237ac6f9b635e56119be956d7521e1-C001-127" ] ], "cite_sentences": [ "237ac6f9b635e56119be956d7521e1-C001-27", "237ac6f9b635e56119be956d7521e1-C001-61", "237ac6f9b635e56119be956d7521e1-C001-66", "237ac6f9b635e56119be956d7521e1-C001-127" ] }, "@SIM@": { "gold_contexts": [ [ "237ac6f9b635e56119be956d7521e1-C001-66" ] ], "cite_sentences": [ "237ac6f9b635e56119be956d7521e1-C001-66" ] }, "@UNSURE@": { "gold_contexts": [ [ "237ac6f9b635e56119be956d7521e1-C001-132" ] ], "cite_sentences": [ "237ac6f9b635e56119be956d7521e1-C001-132" ] } } }, "ABC_cd56849805cdb43bba567f74b31b87_22": { "x": [ { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-2", "text": "The availability of corpora to train semantic parsers in English has lead to significant advances in the field." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-3", "text": "Unfortunately, for languages other than English, annotation is scarce and so are developed parsers." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-4", "text": "We then ask: could a parser trained in English be applied to language that it hasn't been trained on?" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-5", "text": "To answer this question we explore zero-shot cross-lingual semantic parsing where we train an available coarse-to-fine semantic parser (Liu et al., 2018) using cross-lingual word embeddings and universal dependencies in English and test it on Italian, German and Dutch." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-6", "text": "Results on the Parallel Meaning Bank -a multilingual semantic graphbank, show that Universal Dependency features significantly boost performance when used in conjunction with other lexical features but modeling the UD structure directly when encoding the input does not." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-7", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-99", "text": "Table 2 shows the result for the models trained and tested in English." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-9", "text": "Semantic parsing is a task of transducing natural language to meaning representations, which in turn can be expressed through many different semantic formalisms including lambda calculus (Zettlemoyer and Collins, 2012) , DCS (Liang et al., 2013) , Discourse Representation Theory (DRT) (Kamp and Reyle, 2013) , AMR (Banarescu et al., 2013) and so on." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-10", "text": "This availability of annotated data in English has translated into the development of a plethora of models, including encoder-decoders (Dong and Lapata, 2016; Jia and Liang, 2016) as well as tree or graph-structured decoders Lapata, 2016, 2018; Liu et al., 2018; Yin and Neubig, 2017) ." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-11", "text": "Whereas the majority of semantic banks focus on English, recent effort has focussed on *Work done when Jingfeng Yang was an intern and Federico Fancellu a post-doc at the University of Edinburgh building multilingual representations, e.g. PMB (Abzianidze et al., 2017) , MRS (Copestake et al., 1995) and FrameNet (Pad\u00f3 and Lapata, 2005) ." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-12", "text": "However, manually annotating meaning representations in a new language is a painstaking process which explains why there are only a few datasets available for different formalisms in languages other than English." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-13", "text": "As a consequence, whereas the field has made great advances for English, little work has been done in other languages." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-14", "text": "To answer this question, previous work have leveraged machine translation techniques to map the semantics from a language to another (e.g. Damonte and Cohen, 2018) ." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-15", "text": "However, these methods require parallel corpora to extract automatic alignments which are often noisy or not available at all." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-16", "text": "In this paper we explore parameter-shared models instead, where a model is trained on English using language independent features and tested in a target language." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-17", "text": "To show how this approach performs, we focus on the Parallel Meaning Bank (PMB Abzianidze et al., 2017 ) -a multilingual semantic bank, where sentences in English, German, Italian and Dutch have been annotated with their meaning representations." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-18", "text": "The annotations in the PMB are based on Discourse Representation Theory (DRT, Kamp and Reyle, 2013) , a popular theory of meaning representation designed to account for intra and inter-sentential phenomena, like temporal expressions and anaphora." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-19", "text": "Figure 1 shows an example DRT for the sentence 'I sat down and opened my laptop' in its canonical 'box' representation." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-20", "text": "A DRS is a nested structure with the top part containing the discourse references and the bottom with unary and binary predicates, as well as semantic constants (e.g. 'speaker')." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-21", "text": "DRS can be linked to each other via logic operator (e.g. \u00ac, \u2192, \u22c4) or, as in this case, discourse relations (e.g. CONTINUATION, RESULT, ELABORA-TION, etc.)." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-22", "text": "To test our approach we leverage the DRT parser of Liu et al. (2018) , an encoder-decoder architecture where the meaning representation is reconstructed in three stages, coarse-to-fine, by first building the DRS skeleton (i.e. the 'box' structures) and then fill each DRS with predicates and variables." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-23", "text": "Whereas the original parser utilizes a sequential Bi-LSTM encoder with monolingual lexical features, we experiment with languageindependent features in the form of cross-lingual word-embeddings, universal PoS tags and universal dependencies." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-24", "text": "In particular, we also make use of tree encoders to assess whether modelling syntax can be beneficial in cross-lingual settings, as shown for other semantic tasks (e.g. negation scope detection (Fancellu et al., 2018) )." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-25", "text": "Results show that language-independent features are a valid alternative to projection methods for cross-lingual semantic parsing." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-26", "text": "We show that adding dependency relation as features is beneficial, even when they are the only feature used during encoding." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-27", "text": "However, we also show that modeling the dependency structure directly via tree encoders does not outperform a sequential BiLSTM architecture for the three languages we have experimented with." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-28", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-29", "text": "**METHODS**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-30", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-31", "text": "**MODEL**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-32", "text": "In this section, we describe the modifications to the coarse-to-fine encoder-decoder architecture of Liu et al. (2018) ; for more detail, we refer the reader to the original paper." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-33", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-34", "text": "**ENCODER**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-35", "text": "BiLSTM." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-36", "text": "We use Liu et al. (2018) 's Bi-LSTM as baseline." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-37", "text": "However, whereas the original model represents each token in the input sentence as the concatenation of word (e w i ) and lemma embeddings, we discard the latter and add a POS tag embedding (e p i ) and dependency relation embedding (e d i ) feature." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-38", "text": "These embeddings are concatenated to represent the input token." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-39", "text": "The final encoder representation is obtained by concatenating both final forward and backward hidden states." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-40", "text": "TreeLSTM." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-41", "text": "To model the dependency structure directly, we use a child-sum tree-LSTM (Tai et al., 2015) , where each word in the input sentence corresponds to a node in the dependency tree." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-42", "text": "In particular, summing across children is advantageous for cross-lingual tasks since languages might display different word orders." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-43", "text": "Computation follows Equation (1)." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-44", "text": "Po/treeLSTM. Completely discarding word order might hurt performance for related languages, where a soft notion of positioning can help." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-45", "text": "To this end, we add a positional embeddings P i (Vaswani et al., 2017) that helps the child-sum tree-LSTM discriminating between the left and right child of a parent node." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-46", "text": "This is computed following Equation (2) where i is the position of the word, j is the jth dimension in total d dimensions." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-47", "text": "Bi/treeLSTM." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-48", "text": "Finally, similarly to Chen et al. (2017) , we combine tree-LSTM and Bi-LSTM, where a tree-LSTM come is initialized using the last layer of a Bi-LSTM, which encodes order information." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-49", "text": "Computation is shown in Equation (3)." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-50", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-51", "text": "**DECODER**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-52", "text": "The decoder of Liu et al. (2018) reconstructs the DRS in three steps, by first predicting the overall structure (the 'boxes'), then the predicates and finally the referents, with each subsequent step being conditioned on the output of the previous." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-53", "text": "During predicate prediction, the decoder uses a copying mechanism to predict those unary predicates that are also lemmas in the input sentence (e.g. 'eat')." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-54", "text": "For the those that are not, soft attention is used instead." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-55", "text": "No modifications were done to the decoder; for more detail, we refer the reader to the original paper." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-56", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-57", "text": "**DATA**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-58", "text": "We use the PMB v. In order to be used as input to the parser, Liu et al. (2018) first convert the DRS into treebased representations, which are subsequently linearized into PTB-style bracketed sequences." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-59", "text": "This transformation is lossless in that re-entrancies are duplicated to fit in the tree structure." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-60", "text": "We use the same conversion in this work; for further detail we refer the reader to the original paper." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-61", "text": "Finally, it is worth noting that lexical predicates in PMB are in English, even for non-English languages." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-62", "text": "Since this is not compatible with our copy mechanism, we revert predicates to their original language by substituting them with the lemmas of the tokens they are aligned to (since gold alignment information is included in the PMB)." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-63", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-64", "text": "**CROSS-LINGUAL FEATURES**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-65", "text": "In order to make the model directly transferable to the German, Italian and Dutch test data, we use the following language-independent features." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-66", "text": "Multilingual word embeddings." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-67", "text": "We use the MUSE (Conneau et al., 2017) pre-trained multilingual word embeddings and keep them fixed during training." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-68", "text": "UD relations and structure." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-69", "text": "We use UDPipe (Straka and Strakov\u00e1, 2017 ) to obtain parses for English, German, Italian and Dutch." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-70", "text": "UD relation embeddings are randomly initialized and updated." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-71", "text": "Universal POS tags." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-72", "text": "We use the Universal POS tags (Petrov et al., 2011) obtained with UDPipe parser." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-73", "text": "Universal POS tag embeddings are randomly initialized and updated during training." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-74", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-75", "text": "**MODEL COMPARISON**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-76", "text": "We use the BiLSTM model as baseline (Bi) and compare it to the child-sum tree-LSTM (tree) with positional information added (Po/tree), as well as to a treeLSTM initialized with the hidden states of the BiLSTM(Bi/tree)." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-77", "text": "We also conduct an ablation study on the features used, where WE, PE and DE are the word-embedding, PoS embedding and dependency relation embedding respectively." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-78", "text": "For completeness, along with the results for the crosslingual task, we also report results for monolingual English semantic parsing, where word embedding features are randomly initialized." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-79", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-80", "text": "**EVALUATION**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-81", "text": "We use Counter (Van Noord et al., 2018) to evaluate the performance of our models." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-82", "text": "Counter looks for the best alignment between the predicted and gold DRS and computes precision, recall and F1." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-83", "text": "For further details about Counter, the reader is referred to Van Noord et al. (2018) ." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-84", "text": "It is worth reminding that unlike other work on the PMB (e.g. van Noord et al., 2018), Liu et al. (2018) does not deal with presupposition." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-85", "text": "In the PMB, presupposed variables are extracted from a main box and included in a separate one." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-86", "text": "In our work, we revert this process so to ignore presupposed boxes." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-87", "text": "Similarly, we also do not deal with sense tags which we aim to include in future work." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-88", "text": "Table 1 shows the performance of our crosslingual models in German, Italian and Dutch." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-89", "text": "We summarize the results as follows: Dependency features are crucial for zeroshot cross-lingual semantic parsing." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-90", "text": "Adding dependency features dramatically improves the performance in all three languages, when compared to using multilingual word-embedding and universal PoS embeddings alone." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-91", "text": "We hypothesize that the quality of the multilingual word-embeddings is poor, given that models using embeddings for the dependency relations alone outperform those using the other two features." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-92", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-93", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-94", "text": "TreeLSTMs slightly improve performance only for German." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-95", "text": "TreeLSTMs do not outperform a baseline BiLSTM for Italian and Dutch and they show little improvement in performance for German." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-96", "text": "This might be due to different factors that deserve more analysis including the performance of the parsers and syntactic similarity between these languages." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-97", "text": "When only dependency features are available, we found treeLSTM to boost performance only for Dutch." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-98", "text": "BiLSTM are still state-of-the-art for monolingual semantic parsing for English." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-100", "text": "Dependency features in conjunction with word and PoS embeddings lead to the best performance; however, in all settings explored treeLSTMs do not outperform a BiLSTM." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-101", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-102", "text": "**ERROR ANALYSIS**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-103", "text": "We perform an error analysis to assess the quality of the prediction for operators (i.e. logic operators like \"Not\" as well as discourse relations \"Contrast\"), non-lexical predicates, such as binary predicates (e.g. Agent(e,x)) as well as unary predicates (e.g. time(t), entity(x), etc.), as well as for lexical predicates (e.g. open(e))." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-104", "text": "Results in Table 3 show that predicting operators and binary predicates across language is hard, compared to the other two categories." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-105", "text": "Prediction of lexical predicates is relatively good even though most tokens in the test set where never seen during training; this can be attributable to the copy mechanism that is able to transfer tokens from the input directly during predication." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-106", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-107", "text": "**RELATED WORK**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-108", "text": "Previous work have explored two main methods for cross-lingual semantic parsing." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-109", "text": "One method requires parallel corpora to extract alignments between source and target languages using machine translation (Pad\u00f3 and Lapata, 2005; Damonte and Cohen, 2017; Zhang et al., 2018) The other method is to use parameter-shared models in the target language and the source language by leveraging language-independent features such as multilingual word embeddings, Universal POS tags and UD Duong et al., 2017; Susanto and Lu, 2017; Mulcaire et al., 2018) ." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-110", "text": "For semantic parsing, encoder-decoder mod- els have achieved great success." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-111", "text": "Amongst these, tree or graph-structured decoders have recently shown to be state-of-the-art Lapata, 2016, 2018; Liu et al., 2018; Cheng et al., 2017; Yin and Neubig, 2017) ." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-112", "text": "----------------------------------" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-113", "text": "**CONCLUSIONS**" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-114", "text": "We go back to the questions in the introduction: Can we train a semantic parser in a language where annotation is available?." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-115", "text": "In this paper we show that this is indeed possible and we propose a zero-shot cross-lingual semantic parsing method based on language-independent features, where a parser trained in English -where labelled data is available, is used to parse sentences in three languages, Italian, German and Dutch." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-116", "text": "What would that require?" }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-117", "text": "We show that universal dependency features can dramatically improve the performance of a cross-lingual semantic parser but modelling the tree structure directly does not outperform sequential BiLSTM architectures, not even when the two are combined together." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-118", "text": "We are planning to extend this initial survey to other DRS parsers that does not exclude presupposition and sense as well as to other semantic formalisms (e.g. AMR, MRS) where data sets annotated in languages other than English are available." }, { "sent_id": "cd56849805cdb43bba567f74b31b87-C001-119", "text": "Finally, we want to understand whether adding a bidirectionality to the treeLSTM will help improving the performance on modelling the dependency structure directly." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cd56849805cdb43bba567f74b31b87-C001-10" ], [ "cd56849805cdb43bba567f74b31b87-C001-52" ], [ "cd56849805cdb43bba567f74b31b87-C001-58" ], [ "cd56849805cdb43bba567f74b31b87-C001-84" ], [ "cd56849805cdb43bba567f74b31b87-C001-111" ] ], "cite_sentences": [ "cd56849805cdb43bba567f74b31b87-C001-10", "cd56849805cdb43bba567f74b31b87-C001-52", "cd56849805cdb43bba567f74b31b87-C001-58", "cd56849805cdb43bba567f74b31b87-C001-84", "cd56849805cdb43bba567f74b31b87-C001-111" ] }, "@EXT@": { "gold_contexts": [ [ "cd56849805cdb43bba567f74b31b87-C001-22", "cd56849805cdb43bba567f74b31b87-C001-23" ], [ "cd56849805cdb43bba567f74b31b87-C001-32" ] ], "cite_sentences": [ "cd56849805cdb43bba567f74b31b87-C001-22", "cd56849805cdb43bba567f74b31b87-C001-32" ] }, "@USE@": { "gold_contexts": [ [ "cd56849805cdb43bba567f74b31b87-C001-36" ] ], "cite_sentences": [ "cd56849805cdb43bba567f74b31b87-C001-36" ] } } }, "ABC_cd10d509dacd8f55993396258eb92a_22": { "x": [ { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-2", "text": "Contextual automatic speech recognition, i.e., biasing recognition towards a given context (e.g. user's playlists, or contacts), is challenging in end-to-end (E2E) models." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-3", "text": "Such models maintain a limited number of candidates during beam-search decoding, and have been found to recognize rare named entities poorly." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-4", "text": "The problem is exacerbated when biasing towards proper nouns in foreign languages, e.g., geographic location names, which are virtually unseen in training and are thus outof-vocabulary (OOV)." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-5", "text": "While grapheme or wordpiece E2E models might have a difficult time spelling OOV words, phonemes are more acoustically salient and past work has shown that E2E phoneme models can better predict such words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-6", "text": "In this work, we propose an E2E model containing both English wordpieces and phonemes in the modeling space, and perform contextual biasing of foreign words at the phoneme level by mapping pronunciations of foreign words into similar English phonemes." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-7", "text": "In experimental evaluations, we find that the proposed approach performs 16% better than a grapheme-only biasing model, and 8% better than a wordpiece-only biasing model on a foreign place name recognition task, with only slight degradation on regular English tasks." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-8", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-10", "text": "End-to-end (E2E) models have attracted increasing attention recently." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-11", "text": "Instead of building an automatic speech recognition (ASR) system from different components such as the acoustic model (AM), language model (LM), and pronunciation model (PM), E2E models rely on a single neural network to directly learn speech-to-text mapping." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-12", "text": "Representative systems include a word-based connectionist temporal classification (CTC) model [1] , recurrent neural network transducer (RNN-T) [2, 3] , and attention-based models such as \"Listen, Attend, and Spell\" (LAS) [4] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-13", "text": "Recent advances have shown that E2E models can outperform the state-of-the-art conventional system when trained on thousands of hours of data [5, 6] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-14", "text": "In previous work [7] , it has been shown that contextual information (i.e., phrases relevant to recognition in the current context such as contact names, geographic place names, songs, etc.) can improve ASR accuracy." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-15", "text": "Such phrases are often foreign words, or are rarely seen in training." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-16", "text": "Recognizing these phrases is challenging." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-17", "text": "Conventional ASR systems model them as independent contextual LM using an n-gram weighted finite state transducer (WFST), and compose it with a baseline LM for on-the-fly (OTF) rescoring [7, 8] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-18", "text": "This idea is extended to a LAS model in [9] , where an n-gram LM and a word-tographeme \"speller\" are composed to produce a contextual FST for rescoring." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-19", "text": "The approach is similar to shallow fusion [10] which interpolates E2E model scores with an external LM in These authors contributed equally to this work." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-20", "text": "We would like to thank David Rybach for helpful comments and suggestions." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-21", "text": "beam search." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-22", "text": "To bring more relevant words for biasing in E2E models, [11] proposes to push biasing weights to each subword unit and deals with over-biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-23", "text": "Further improvements such as biasing before beam pruning, and wordpiece-based biasing have been proposed to achieve state-of-the-art biasing results [12, 6, 13] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-24", "text": "Another class of contextual biasing uses an allneural approach." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-25", "text": "Contextual-LAS (CLAS) is proposed in [11] to use a bias encoder to model contextual phrases as embeddings and shows significant improvement than OTF rescoring [8] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-26", "text": "Phonetic information has been incorporated to CLAS to improve rare word recognition [14] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-27", "text": "Although biasing is improved by these techniques, they do not address cross-lingual recognition." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-28", "text": "In [15] , contextual biasing has been used to assist recognition of foreign words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-29", "text": "With the phoneme mapping from a foreign language phoneme set to the recognizer's phoneme set, foreign words are modeled as a phoneme-level contextual FST for biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-30", "text": "It is unclear whether such an approach can be directly applied to E2E models." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-31", "text": "Phoneme-only E2E systems have been shown to have inferior performance compared to grapheme or wordpiece models (WPM) in general [16, 17] , but shows better recognition of rare words and proper nouns." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-32", "text": "In this work we propose to incorporate phonemes to a wordpiece E2E model as modeling units and use phoneme-level FST for contextual biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-33", "text": "We propose a word-frequency based sampling strategy to randomly tokenize rare words into phonemes in the target sequence using a lexicon." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-34", "text": "This approach also mitigates accuracy regressions that have been observed when using phoneme-only E2E models [16, 17] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-35", "text": "We train our model using only American English data and thus its wordpieces and phoneme set (no data from foreign languages)." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-36", "text": "In inference, given a list of foreign words, we bias the recognition using an English phoneme-level biasing FST, which is built by first tokenizing the words into foreign phonemes and then mapping them to English phonemes using [15] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-37", "text": "For example, given a navigation query \"directions to Cr\u00e9teil\" and the assumption that the French word \"Cr\u00e9teil\" is in our biasing list, \"Cr\u00e9teil\" is first tokenized to French phonemes as \"k R e t E j\", and then mapped to English phonemes \"k r\\ E t E j\" for biasing 1 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-38", "text": "The phoneme mapping is necessary since our modeling units contain only English phonemes." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-39", "text": "In decoding, we propose to incorporate the pronunciation FST of the biasing words to consume English phoneme symbols and produce foreign words, using the aforementioned foreign lexicon and phoneme mapping, i.e. \"k r\\ E t E j\" \u2192 Cr\u00e9teil (details in Section 3.2 and 3.3)." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-40", "text": "Wordpiece outputs are concatenated to form words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-41", "text": "In experimental evaluations, we find that the proposed phoneme-based biasing using wordpiece-phoneme model successfully recognizes foreign words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-42", "text": "It performs 16% relatively better in terms of WER than the grapheme-only biasing model, and 8% better than the wordpiece-only biasing model in a task of recognizing navigation queries containing French place names." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-43", "text": "The proposed model also has the advantage that it can be directly applied to other foreign languages for biasing without model scalability issues." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-44", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-45", "text": "**PRIOR WORK**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-46", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-47", "text": "**SHALLOW FUSION E2E BIASING**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-48", "text": "Shallow fusion has been used in E2E models for decoding [10] and contextual biasing [6] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-49", "text": "Biasing phrases are first represented as n-gram WFST in the word level (G), and then left composed with a \"speller\" FST (S) to produce a contextual LM: C = min(det(S \u2022 G))." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-50", "text": "The speller transduces a sequence of subword units to corresponding words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-51", "text": "The contextual LM (C) is then used to rescore the log-probability outputs of the E2E model during beam search:" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-52", "text": "Here, x denotes acoustic observations, and y the subword-unit sequence." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-53", "text": "P is the probability estimation from the E2E model, and PC is the biasing rescoring probability." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-54", "text": "\u03bb controls the weight of contextual LM in rescoring." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-55", "text": "For E2E biasing, [11] explored biasing at beginning of word, end of word, and at subword units." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-56", "text": "The authors find that unlike biasing at the end of word in conventional models, biasing at subword units with weight pushing prevents candidates from being pruned early from the beam." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-57", "text": "In [13] , wordpieces have been shown to outperform graphemes in biasing since they create a sparser match of biasing units." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-58", "text": "All these improvements lead to significantly better biasing which is comparable to the state-of-the-art conventional model [6] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-59", "text": "To avoid over-biasing, [13] also proposed to only activate biasing phrases when they are proceeded by a set of prefixes." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-60", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-61", "text": "**PHONEME MAPPING**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-62", "text": "Cross-lingual phoneme mapping has been used in conventional systems for recognizing foreign words [15] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-63", "text": "First, a phoneme mapping is learned by aligning the pronunciations between foreign and target languages using TTS-synthesized audio and a pronunciation learning algorithm [18] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-64", "text": "In inference, foreign words are first built to an FST (G)." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-65", "text": "Lexica (L) are constructed between target-language phonemes and foreign words using phoneme mapping, and then left composed with G to construct a dynamic class LM for decoding:" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-66", "text": "where d denotes a dynamic class label." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-67", "text": "In Section 3.2, we describe how phoneme mapping is incorporated to a wordpiecephoneme E2E model for contextual biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-68", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-69", "text": "**PHONEME-BASED BIASING**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-70", "text": "The focus of this work is to bias toward rare cross-lingual words which are typically missing from the training set." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-71", "text": "We propose to do that by utilizing phonemes, which are not affected by orthography." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-72", "text": "Specifically, we augment the wordpiece modeling space of an E2E model with phonemes to train a wordpiece-phoneme model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-73", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-74", "text": "**WORDPIECE-PHONEME MODEL**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-75", "text": "A wordpiece-phoneme model differs from a wordpiece-only model in that it may decompose a few words to phonemes in training." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-76", "text": "The output of the model is a single softmax whose symbol set is the union of wordpiece and phoneme symbols." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-77", "text": "We use a pronunciation lexicon to obtain phoneme sequences of words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-78", "text": "Since phonemes show strength in recognizing rare words [16] , we want to present these words as phonemes more often." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-79", "text": "In a target sentence, we decide to randomly present the i th word as phonemes with a probability" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-80", "text": ", 1.0) where p0 and T are constants and c(i) is an integer representing the number of time the word appears in our entire training corpus." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-81", "text": "Therefore, the words that appear T times or less will be presented as phonemes with probability p0." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-82", "text": "For words that appear more than T times, the more frequent they are, the less likely they are presented as phonemes 2 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-83", "text": "Note that the decision of whether to use wordpieces or phonemes is made randomly at each gradient iteration, and thus a given sentence could have different target sequences at different epochs." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-84", "text": "We use context-independent phonemes as in [16] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-85", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-86", "text": "**BIASING FST FOR PHONEMES**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-87", "text": "In inference, cross-lingual biasing words are converted to English phonemes to rescore the phoneme outputs of the wordpiece-phoneme model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-88", "text": "In our work, phoneme mapping is represented by a dictionary which contains human generated source-language to target-language phoneme pairs [15] , and the X-SAMPA phoneme set is used for all languages." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-89", "text": "For example, given a French word \"Cr\u00e9teil\", we tokenize it into phonemes using the French pronunciation lexicon, i.e. \"Cr\u00e9teil\" \u2192 \"k R e t E j\", and then map the French phonemes to English phonemes one by one: \"k R e t E j\" \u2192 \"k r\\ E t E j\"." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-90", "text": "Note that the mapping is needed since our wordpiecephoneme model contains only English phonemes." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-91", "text": "The English phoneme sequence is then used to construct a phoneme-level FST for biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-92", "text": "Weight pushing is used to assign weights at the phoneme level and failure arcs are added to avoid over-biasing similar to [11] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-93", "text": "Figure 1 shows an example of a contextual FST for the word \"Cr\u00e9teil\" at the phoneme level." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-94", "text": "The biasing FST is then used to rescore the phoneme outputs of the wordpiecephoneme model on the fly, using Eq. (1)." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-95", "text": "Figure 1 : Contextual FST for the word \"Cr\u00e9teil\" using a sequence of English phonemes \"k r\\ E t E j\"." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-96", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-97", "text": "**DECODING GRAPH**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-98", "text": "To generate words as outputs, we search through a decoding graph similar to [16] but accept both phonemes and wordpieces." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-99", "text": "An example is shown in Figure 2 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-100", "text": "The decoding FST has wordpiece loops around state 0 (we show only a few for simplicity), but also has a pronunciation section (states 1 through 14) ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-101", "text": "The pronunciation section is a prefix tree with phonemes as inputs, and outputs are wordpieces of the corresponding word produced by the WPM in Section 3.1." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-102", "text": "Specifically, for each word in the biasing list, we look up pronunciations from the lexicon and split the word into its constituent wordpieces." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-103", "text": "Input phoneme labels are accepted and transduced into wordpieces." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-104", "text": "Input wordpiece labels are accepted by the wordpiece loops." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-105", "text": "The final output symbols, which are always wordpieces, are concatenated into words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-106", "text": "Figure 2 : Decoding graph for the words \"cr\u00e8che\" (daycare) with English cross lingual pronunciation \"k r\\ E S\" and \"cr\u00e9teil\" (a city) with pronunciation \"k r\\ E t E j\"." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-107", "text": "For clarity, we omitted most wordpieces for the state 0." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-108", "text": "Based on [16] , we add two improvements to the decoding strategy." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-109", "text": "First, during decoding we consume as many input epsilon arcs as possible thus guaranteeing that all wordpieces in word are produced when all corresponding phonemes are seen in the input." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-110", "text": "Second, we merge paths that have the same output symbols." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-111", "text": "Given the nature of our training and decoding, a given word can be output either directly in wordpieces, or transduced from phonemes to wordpieces." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-112", "text": "Since the input symbols are different, each hypothesis has a different probability." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-113", "text": "We keep track of equivalent hypotheses and recombine them by adding their probabilities, assigning the total probability to the most likely hypothesis, and dropping the others from the beam." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-114", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-115", "text": "**EXPERIMENTS**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-116", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-117", "text": "**DATA SETS**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-118", "text": "Our training set contains 35 million English utterances with a total of around 27,500 hours." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-119", "text": "These utterances are sampled from Google's general English traffic, and are anonymized and hand-transcribed for training." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-120", "text": "To increase training diversity, clean utterances are artificially corrupted by using a room simulator, varying degrees of noise, and reverberation such that the overall signal-to-noise ratio (SNR) is between 0dB and 30dB, and an average SNR is 12dB [19] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-121", "text": "The noise sources are from YouTube and daily life noisy environmental recordings." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-122", "text": "Utterances with cross-lingual words are hardly present in our data set, and thus we use a TTS engine, parallel-wavenet [20] , to synthesize utterances for evaluation." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-123", "text": "We choose French as the foreign language, and the utterances consist of navigation queries (e.g. \"directions to Cr\u00e9teil\")." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-124", "text": "There are in total 1K utterances and we refer to this set as the Directions test set." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-125", "text": "For each utterance, the bias set contains 1K words including the groundtruth place name and unrelated French place names." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-126", "text": "Since all biasing words are in a foreign language, they have never been seen in training." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-127", "text": "In decoding, all biasing words are used to construct a contextual FST with each arc having the same weight." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-128", "text": "In later evaluation, this weight is tuned independently for different models." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-129", "text": "On the other hand, to evaluate how the wordpiece-phoneme model performs on the regular English recognition task, we sampled a total of 30.5K English utterances from general Google traffic as the no-biasing test set." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-130", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-131", "text": "**MODEL TRAINING**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-132", "text": "Similarly to [6] , an input utterance is divided to 25-ms frames, windowed and shifted at a rate of 10 ms." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-133", "text": "A 80-dimensional logMel feature is extracted at each frame, and the current frame and two frames to the left are concatenated to produce a 240-dimensional log-Mel feature." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-134", "text": "These features are then downsampled at a rate of 30 ms." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-135", "text": "We use RNN-T as the sequenceto-sequence model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-136", "text": "Similar to [6] , the encoder of the RNN-T consists of 8 Long Short-Term Memory (LSTM) [21] layers and the prediction network contains 2 layers." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-137", "text": "Each LSTM layer contains 2,048 hidden units followed by a 640-dimensional projection layer." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-138", "text": "A time-reduction layer is added after the second layer to improve the inference speed without accuracy loss." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-139", "text": "Outputs of the encoder and prediction network are fed to a joint-network which has 640 hidden units, which is then followed by a softmax layer containing 4,096 output units." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-140", "text": "Specifically, the output units contain 41 context-independent phonemes and the rest are wordpieces." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-141", "text": "As described in Section 3.1, we use a lexicon containing about 430,000 words with their frequencies to determine when to use phoneme sequences." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-142", "text": "The lexicon contains words and their frequencies from training data, and is trimmed by removing homophones (e.g. \"flower\" and \"flour\"), homographs (e.g. \"live\" as a verb or adjective), and pronunciation variants (e.g. \"either\")." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-143", "text": "It thus only contains entries that are unambiguous when going from spelling to pronunciation or the other way around." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-144", "text": "We do not generate phonemes for out-of-lexicon words using a trained grapheme-to-phoneme model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-145", "text": "The intuition is that when pronunciation is not known, it is simpler and cleaner to let the E2E model infer the pronunciation rather than bring in another independently trained model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-146", "text": "In addition, we use the written form of a transcript and do not use any verbalizer." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-147", "text": "Thus, words like \"$9.95\" were never presented as phonemes." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-148", "text": "In model training, a symbol is inserted between words to identify spacing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-149", "text": "The model contains around 120M parameters in total." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-150", "text": "All RNN-T models are trained in Tensorflow [22] on 8 \u00d7 8 Tensor Processing Units (TPU) slices with a global batch size of 4,096." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-151", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-152", "text": "**WERS AND COMPARISONS**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-153", "text": "We compare the biasing results of the wordpiece-phoneme model to a grapheme-only model and a wordpiece-only model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-154", "text": "The latter two models have the same structure as the wordpiecephoneme model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-155", "text": "The difference is that the grapheme model has 76 graphemes as outputs and the wordpiece model has 4,096 wordpieces." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-156", "text": "This leads to around 117M and 120M parameters for the grapheme model and wordpiece model, respectively." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-157", "text": "Note that the two model's output symbols are in English and they are trained using all-English data described in Section 4.1." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-158", "text": "For these two models, biasing is done at the grapheme level or wordpiece level alone using the English transliterated versions of French biasing words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-159", "text": "WERs of the Directions set are shown in Table 1 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-160", "text": "First, we see that all three models perform poorly without biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-161", "text": "This is because the place names are in French and they have never been seen in training, i.e. an word OOV rate of nearly 100% 3 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-162", "text": "Secondly, we see in Table 1 that all models performs substantially better with biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-163", "text": "The WER reductions range from 9%-23% relatively for different models when compared to the no-bias case." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-164", "text": "Comparing different biasing strategies, we find that the wordpiece-phoneme model performs the best: 16% relatively better than the grapheme model, and 8.3% better than the wordpiece model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-165", "text": "We attribute the superior per- formance of the wordpiece-phoneme model to the robustness of phonemes to OOV words, as observed in [16] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-166", "text": "Since the wordpiece-phoneme model contains both wordpieces and phonemes as modeling units, we can further perform wordpiece biasing in addition to phoneme-based biasing by building a wordpiece FST in parallel to the phoneme FST." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-167", "text": "This further reduces the WER by 2%, as shown in the bottom row in Table 1 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-168", "text": "This shows that wordpiece and phoneme biasing are complementary to each other." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-169", "text": "We note that the same weights are used for both phoneme and wordpiece biasing, and empirically we did not find significant improvements by using different weights." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-170", "text": "On the other hand, for wordpiece model biasing, our results are consistent with the observation in [13] that the wordpieces perform better than graphemes because of its sparsity in matching longer units." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-171", "text": "To further understand how biasing helps recognizing French place names, we present some wins of wordpiecephoneme model in Table 2 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-172", "text": "We can see that biasing helps produce the correct French words, and in contrast, phonetically similar but wrong English words are produced when without biasing." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-173", "text": "On the other hand, we present some typical recognition errors in Table 3 ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-174", "text": "We see that errors are mainly due to phonetically similar words in French." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-175", "text": "We will analyze how the biasing performance changes as the number of biasing words changes in Section 4.4." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-176", "text": "To ensure there is no regression in no-biasing scenarios, we compare three models in decoding regular English utterances from general Google traffic." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-177", "text": "In decoding, we turn the biasing mechanism off by using an empty list of biasing phrases." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-178", "text": "As shown in the last column of Table 1 , the wordpiece model performs better than the grapheme model as in [6] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-179", "text": "The wordpiecephoneme model performs a little better than the grapheme model, and we attribute that to the higher frequency of wordpieces during training." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-180", "text": "Compared to the wordpiece model, the wordpiece-phoneme model has a slight degradation (0.1% absolute WER)." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-181", "text": "This is due to the introduction of phonemes in modeling." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-182", "text": "One potential approach to improve regression is to incorporate an English external language model for phonemes in rescoring, similarly to the wordpiece-based rescoring in [10] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-183", "text": "However, we note that the regression is significantly smaller than the all-phoneme model in [16] ." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-184", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-185", "text": "**EFFECT OF NUMBER OF BIASING WORDS**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-186", "text": "Given examples in Table 3 , we are curious how competing biasing words affect recognition." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-187", "text": "We thus randomly choose a fixed number of biasing words (including the ground truth one) and vary the number to see how WER changes." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-188", "text": "Figure 3 shows that WER for the Directions set is 9.1% when only the ground truth word is present (i.e. number of biasing words is 1), and the rate increases quickly when the total number of biasing words increases." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-189", "text": "We attribute the quick degradation to the significant matching confusion in the phoneme prefixes of the words as the number of biasing words increases (as confirmed by phonetically similar French place names in Table 3 )." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-190", "text": "One interesting direction would be to increase the length of the phonemic units to create a sparser match." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-191", "text": "----------------------------------" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-192", "text": "**CONCLUSION**" }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-193", "text": "In this work we proposed a wordpiece-phoneme RNN-T model and phoneme-level contextual biasing to recognize foreign words." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-194", "text": "Biasing at the phoneme level enables us to avoid the OOV problem in the wordpiece model." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-195", "text": "Evaluating on a test set containing navigation queries to French place names, we show the proposed approach performs significantly better than a state-of-the-art grapheme and wordpiece model, by 16% and 8%, respectively in terms of relative WER reductions." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-196", "text": "Wordpiece biasing is complimentary to phoneme biasing and adds a further 2% reduction." }, { "sent_id": "cd10d509dacd8f55993396258eb92a-C001-197", "text": "Lastly, since wordpieces perform better than graphemes [6] in E2E modeling, it would be interesting to explore longer phonemic units such as phoneme pieces for biasing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cd10d509dacd8f55993396258eb92a-C001-31" ] ], "cite_sentences": [ "cd10d509dacd8f55993396258eb92a-C001-31" ] }, "@MOT@": { "gold_contexts": [ [ "cd10d509dacd8f55993396258eb92a-C001-33", "cd10d509dacd8f55993396258eb92a-C001-34" ], [ "cd10d509dacd8f55993396258eb92a-C001-78" ] ], "cite_sentences": [ "cd10d509dacd8f55993396258eb92a-C001-34", "cd10d509dacd8f55993396258eb92a-C001-78" ] }, "@USE@": { "gold_contexts": [ [ "cd10d509dacd8f55993396258eb92a-C001-84" ] ], "cite_sentences": [ "cd10d509dacd8f55993396258eb92a-C001-84" ] }, "@SIM@": { "gold_contexts": [ [ "cd10d509dacd8f55993396258eb92a-C001-84" ], [ "cd10d509dacd8f55993396258eb92a-C001-98" ], [ "cd10d509dacd8f55993396258eb92a-C001-165" ] ], "cite_sentences": [ "cd10d509dacd8f55993396258eb92a-C001-84", "cd10d509dacd8f55993396258eb92a-C001-98", "cd10d509dacd8f55993396258eb92a-C001-165" ] }, "@DIF@": { "gold_contexts": [ [ "cd10d509dacd8f55993396258eb92a-C001-98" ], [ "cd10d509dacd8f55993396258eb92a-C001-183" ] ], "cite_sentences": [ "cd10d509dacd8f55993396258eb92a-C001-98", "cd10d509dacd8f55993396258eb92a-C001-183" ] }, "@EXT@": { "gold_contexts": [ [ "cd10d509dacd8f55993396258eb92a-C001-108" ] ], "cite_sentences": [ "cd10d509dacd8f55993396258eb92a-C001-108" ] } } }, "ABC_a0c0076fa8c3be914d93ec1d66d0c1_22": { "x": [ { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-32", "text": "**DERIVATION DATASETS**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-55", "text": "**ADDITIVE METHOD (AVGADD)**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-2", "text": "Recent models in distributional semantics consider derivational patterns (e.g., use \u2192 use + f ul ) as the result of a compositional process, where base term and affix are combined." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-3", "text": "We exploit such models for German particle verbs (PVs), and focus on the task of learning a mapping function between base verbs and particle verbs." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-4", "text": "Our models apply particle-verb motivated training-space restrictions relying on nearest neighbors, as well as recent advances from zeroshot-learning." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-5", "text": "The models improve the mapping between base terms and derived terms for a new PV derivation dataset, and also across existing derivation datasets for German and English." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-6", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-8", "text": "were the first to apply distributional semantic models (DSMs) to the task of deriving the meaning of morphologically complex words from their parts." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-9", "text": "They relied on high-dimensional vector representations to model the derived term (e.g., useful) as a result of a compositional process that combines the meanings of the base term (e.g., to use) and the affix (e.g., ful)." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-10", "text": "For evaluation, they compared the predicted vector of the complex word with the original, corpus-based vector." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-11", "text": "More recently, Kisselew et al. (2015) put the task of modeling derivation into the perspective of zero-shot-learning: instead of using cosine similarities they predicted the derived term by learning a mapping function between the base term and the derived term." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-12", "text": "Once the predicted vector was computed, a nearest neighbor search was applied to validate if the prediction corresponded to the derived term." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-13", "text": "In zero-shotlearning the task is to predict novel values, i.e., values that were never seen in training." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-14", "text": "More formally, zero-shot-learning trains a classifier f : X \u2192 Y that predicts novel values for Y (Palatucci et al., 2009) ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-15", "text": "It is often applied across vector spaces, such as different domains (Mikolov et al., 2013; ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-16", "text": "The experiments by Kisselew et al. (2015) were performed over six derivational patterns for German (cf." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-17", "text": "Table 1), including particle verbs (PVs) with two different particle prefixes (an and durch), which were particularly difficult to predict." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-18", "text": "PVs such as anfangen (to start) are compositions of a base verb (BV) such as fangen (to catch) and a verb particle such as an." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-19", "text": "Predicting PV meaning is challenging because German PVs are highly productive (Springorum et al., 2013b; Springorum et al., 2013a) , and the particles are notoriously ambiguous (Lechler and Ro\u00dfdeutscher, 2009; Haselbach, 2011; Kliche, 2011; Springorum, 2011) ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-20", "text": "Furthermore, the particles often trigger meaning shifts when they combine with base verbs (Springorum et al., 2013b) , so the resulting PVs represent frequent cases of non-literal meaning." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-21", "text": "In this paper, we focus on predicting the meanings of German PV derivations." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-22", "text": "Our models provide two contributions to the research field of predicting derivations: (i) We suggest a novel idea of restricting the available training data, which has a positive impact on the mapping quality." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-23", "text": "(ii) We integrate a correction method for popular nearest neighbors into our models, so-called hubs (Radovanovi\u0107 et al., 2010) , to improve the prediction quality." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-24", "text": "Table 2 : New German PV derivation dataset." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-25", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-26", "text": "**PREDICTION EXPERIMENTS**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-27", "text": "As in Kisselew et al. (2015) , we treat every derivation type as a specific learning problem: we take a set of word pairs with a particular derivation pattern (e.g., \"-in\", B\u00e4cker::B\u00e4ckerin), and divide this set into training and test pairs by performing 10-fold cross-validation." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-28", "text": "For the test pairs, we predict the vectors of the derived terms (e.g.," }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-29", "text": "The search space includes all corpus words across parts-of-speech, except for the base term." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-30", "text": "The performance is measured in terms of recall-out-of-5 (McCarthy and Navigli, 2009), counting how often the correct derived term is found among the five nearest neighbors of the predicted vector." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-31", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-33", "text": "We created a new collection of German particle verb derivations 1 relying on the same resource as Kisselew et al. (2015) , the semiautomatic derivational lexicon for German DErivBase (Zeller et al., 2013) ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-34", "text": "From DErivBase, we induced all pairs of base verbs and particle verbs across seven different particles." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-35", "text": "Nonexisting verbs were manually filtered out." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-36", "text": "In total, our collection contains 1 410 BV-PV combinations across seven particles, cf." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-37", "text": "Table 2 ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-38", "text": "In addition, we apply our models to two existing collections for derivational patterns, the German dataset from Kisselew et al. (2015) , comprising six derivational patterns with 80 in-stances each (cf." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-39", "text": "Table 3 : English dataset (Lazaridou et al., 2013) ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-40", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-41", "text": "**WORD EMBEDDING VECTORS**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-42", "text": "We relied on the German and English COW web corpora 2 (Sch\u00e4fer and Bildhauer, 2012 ) to obtain vector representations." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-43", "text": "The corpora contain 20 billion words and 9 billion words, respectively." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-44", "text": "We parsed the corpora using state-of-the-art pipelines integrating the MarMoT tagger and the MATE parser (M\u00fcller et al., 2013; Bohnet, 2010) , and induced window co-occurrences for all corpus lemma-POS pairs and co-occurring nouns, verbs and adjectives in a 5-lemma window." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-45", "text": "We then created 400-dimensional word representations using the hyperwords toolkit (Levy et al., 2015) , with context distribution smoothing of 0.75 and positive point-wise mutual information weighting together with singular value decomposition." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-46", "text": "The resulting vector space models contain approximately 460 000 lemmas for German and 240 000 lemmas for English." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-47", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-48", "text": "**PREDICTION METHODS**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-49", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-50", "text": "**BASELINE**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-51", "text": "A baseline method that simply guesses the derived term has a chance of approx." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-52", "text": "1 460 000 for German and 1 240 000 for English to predict the correct term." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-53", "text": "We thus apply a more informed baseline, the same as in Kisselew et al. (2015) , and predict the derived term at exactly the same position as the base term." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-54", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-56", "text": "AvgAdd is a re-implementation of the best method in Kisselew et al. (2015) :" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-57", "text": "3 For each affix, the method learns a difference vector by computing the dimension-wise differences between the vector representations of base term A and derived term B ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-58", "text": "The method thus learns a centroid c for all relevant training pairs (N ) with the same affix:" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-59", "text": "For each PV test instance with this affix, the learned centroid vector is added dimensionwise to the vector representation of the base term to predict a position for the derived term." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-60", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-61", "text": "**RESTRICTING THE TRAINING SPACE (BESTADD)**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-62", "text": "Avg-Add learns a vector representation based on the full available training data for each derivational pattern." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-63", "text": "In this paper, we suggest a method BestAdd k that restricts the training items of a given base term to those BV-PV training instances that include the k nearest base verbs (using k = 1, 3, 5) according to their cosine." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-64", "text": "The motivation for our adjusted method relies on the observation that particles are very ambiguous and thus differ in their meanings across particle verbs." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-65", "text": "For example, the meanings of 'an' include a directed contact as in sprechen::ansprechen (to speak/to speak to s.o.) and in schreiben::anschreiben (to write/to write to s.o.), and also a start of an action as in spielen::anspielen (to play/to start playing) and in stimmen::anstimmen (to pitch/to start singing)." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-66", "text": "We assume that base verbs that are distributionally similar also behave in a similar way when combined with a specific particle, and that a more restricted training set that is however specified for BV semantics outperforms a larger training set across wider BV meanings." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-67", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-68", "text": "**3COSMUL**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-69", "text": "We also re-implemented 3CosMul (Levy and Goldberg, 2014) , a method that has been proven successful in solving analogy tasks, such as man" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-70", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-71", "text": "**IS TO KI NG (B) AS WOMAN (C) IS TO QUEEN (D).**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-72", "text": "3CosMul does not explicitly predict a position in space but selects a target D in space that is close to B and C but not close to A. We applied 3Cos-Mul by always using the most similar training instance (as for BestAdd with k = 1)." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-73", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-74", "text": "**LOCAL SCALING**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-75", "text": "All methods introduced in the previous section perform a nearest neighbor search at the predicted position." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-76", "text": "We suggest to improve the prediction quality at this stage by mitigating the hubness problem ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-77", "text": "Hubs are objects in vector space that are likely to appear disproportionately often among nearest neighbors, without necessarily being semantically related." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-78", "text": "Hubness has been shown an intrinsic problem of high-dimensional spaces (Tomasev, 2014) ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-79", "text": "In order to reduce hubness, three unsupervised methods to re-scale the high-dimensional distances have been proposed (Schnitzer et al., 2014) : local scaling, global scaling, and shared nearest neighbors." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-80", "text": "We focus on a local scaling (LS) type of hubness-correcting distance measure, namely the non-iterative contextual measure N I (J\u00e9gou et al., 2007) :" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-81", "text": "N I relies on the average distance \u00b5 of x and y to their k nearest neighbors." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-82", "text": "It increases the similarity between x and y in cases where we observe low average similarities between x, y and its k nearest neighbors." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-83", "text": "Intuitively, if a word x is not even close to its nearest neighbors but comparably close to y then we increase the similarity between x and y. For 3CosMul, we adapt local scaling by scaling over the neighborhood information for all four parts (A, B, C and D) in the analogy: BestAdd (with k = {3, 5}) are significantly 4 above AvgAdd (p < 0.01), the previously best method for the existing German and English datasets." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-84", "text": "BestAdd with k = 1 and 3CosMul perform at a similar level than AvgAdd, but for our new PV derivation dataset do not even outperform the baseline." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-85", "text": "Restricting the training process to a small selection of nearest neighbors therefore has a positive impact on the prediction quality." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-86", "text": "Furthermore, local scaling relying on k = 15 nearest neighbors (N I 15 ) improves the prediction results in all but one cases." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-87", "text": "These improvements are however not significant." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-88", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-89", "text": "**RESULTS**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-90", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-91", "text": "**BESTADD AND LOCAL SCALING**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-92", "text": "The results in Table 4 also demonstrate that predicting particle verbs is the most challenging derivation task, as the results are significantly lower than for the other two datasets." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-93", "text": "Figure 1 once more illustrates the recall-out-of-5 results for our new PV dataset." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-94", "text": "In the following, we zoom into dataset derivation types." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-95", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-96", "text": "**IMPROVEMENT ACROSS DERIVATION TYPES AND LANGUAGES**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-97", "text": "Figures 2 to 4 break down the results from Table 4 across the German and English derivation types." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-98", "text": "4 Significance relies on \u03c7 2 ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-99", "text": "The blue bars show the BestAdd 3 results, and the green stacked bars represent the additional gain using local scaling (NI 15 )." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-100", "text": "The yellow points correspond to baseline performance, and the dotted black lines to the AvgAdd results." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-101", "text": "We can see that BestAdd 3 not only outperforms the previously best method AvgAdd on average but also for each derivation type." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-102", "text": "Also, local scaling provides an additional positive impact for all but one particle type in German, ab-, and for all but three derivation types in English, -able, -al, -less." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-103", "text": "At the same time, we can see that the impact of local scaling is different across derivation types." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-104", "text": "For example, looking into the data we observe that mit PVs are often wrongly mapped to nouns, and BestAdd and local scaling correct this behavior: The nearest neighbors of the verb erledigen (to manage sth.) with BestAdd 3 are Botengang (errand), Haushaltsarbeit (domestic work), Hausmeisterarbeit (janitor work), and further six compounds with the nominal head Arbeit (work)." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-105", "text": "Additional local scaling predicts the correct PV miterledigen (to manage sth." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-106", "text": "in addition) as second nearest neighbor." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-107", "text": "Kisselew et al. (2015) ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-108", "text": "Figure 4: Performance gain for derivation types in Lazaridou et al. (2013) ." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-109", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-110", "text": "**RECALL-OUT-OF-X ACROSS PARTICLE TYPES**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-111", "text": "Figure 5 focuses on the particle types, but varies the strength of the evaluation measure." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-112", "text": "Relying on BestAdd 3 with local scaling NI 15 , we apply recall-out-of-x with x \u2208 [1, 10]." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-113", "text": "With one exception (zu), all particle types achieve a performance of 15-23% for recall-out-of-5, so zu had a negative impact on the average score in Table 4 . Looking at recall-out-of-10, the performances go up to 20-30%." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-114", "text": "While PVs with the rather nonambiguous mit are again modeled best, also PVs with strongly ambiguous particles (such as an and auf ) are modeled well." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-115", "text": "----------------------------------" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-117", "text": "We suggested two ways to improve the prediction of derived terms for English and German." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-118", "text": "Both (i) particle-verb motivated trainingspace restrictions and (ii) local scaling to address hubness in high-dimensional spaces had a positive impact on the prediction quality of derived terms across datasets." }, { "sent_id": "a0c0076fa8c3be914d93ec1d66d0c1-C001-119", "text": "Particle-specific explorations demonstrated the difficulty of this derivation, and differences across particle types." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-11" ], [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-16" ], [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-56" ] ], "cite_sentences": [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-11", "a0c0076fa8c3be914d93ec1d66d0c1-C001-16", "a0c0076fa8c3be914d93ec1d66d0c1-C001-56" ] }, "@SIM@": { "gold_contexts": [ [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-27" ], [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-33" ], [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-53" ] ], "cite_sentences": [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-27", "a0c0076fa8c3be914d93ec1d66d0c1-C001-33", "a0c0076fa8c3be914d93ec1d66d0c1-C001-53" ] }, "@USE@": { "gold_contexts": [ [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-27" ], [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-38" ], [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-53" ] ], "cite_sentences": [ "a0c0076fa8c3be914d93ec1d66d0c1-C001-27", "a0c0076fa8c3be914d93ec1d66d0c1-C001-38", "a0c0076fa8c3be914d93ec1d66d0c1-C001-53" ] } } }, "ABC_c022b7cf4568e26c7408a835eaafb7_22": { "x": [ { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-2", "text": "Model combination techniques have consistently shown state-of-the-art performance across multiple tasks, including syntactic parsing." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-3", "text": "However, they dramatically increase runtime and can be difficult to employ in practice." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-4", "text": "We demonstrate that applying constituency model combination techniques to n-best lists instead of n different parsers results in significant parsing accuracy improvements." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-5", "text": "Parses are weighted by their probabilities and combined using an adapted version of Sagae and Lavie (2006) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-6", "text": "These accuracy gains come with marginal computational costs and are obtained on top of existing parsing techniques such as discriminative reranking and self-training, resulting in state-of-the-art accuracy: 92.6% on WSJ section 23." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-7", "text": "On out-of-domain corpora, accuracy is improved by 0.4% on average." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-8", "text": "We empirically confirm that six well-known n-best parsers benefit from the proposed methods across six domains." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-9", "text": "----------------------------------" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-11", "text": "Researchers have proposed many algorithms to combine parses from multiple parsers into one final parse (Henderson and Brill, 1999; Zeman an\u010f Zabokrtsk\u1ef3, 2005; Sagae and Lavie, 2006; Nowson and Dale, 2007; Fossum and Knight, 2009; Petrov, 2010; Johnson and Ural, 2010; Huang et al., 2010; McDonald and Nivre, 2011; Shindo et al., 2012; Narayan and Cohen, 2015) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-12", "text": "These new parses are substantially better than the originals: Zhang et al. (2009) combine outputs from multiple n-best parsers and achieve an F 1 of 92.6% on the WSJ test set, a 0.5% improvement over their best n-best parser." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-13", "text": "Model combination approaches tend to fall into the following categories:" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-14", "text": "hybridization, where multiple parses are combined into a single parse; switching, which picks a single parse according to some criteria (usually a form of voting); grammar merging where grammars are combined before or during parsing; and stacking, where one parser sends its prediction to another at runtime." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-15", "text": "All of these have at least one of the caveats that (1) overall computation is increased and runtime is determined by the slowest parser and (2) using multiple parsers increases the system complexity, making it more difficult to deploy in practice." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-16", "text": "In this paper, we describe a simple hybridization extension (\"fusion\") which obtains much of hybridization's benefits while using only a single n-best parser and minimal extra computation." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-17", "text": "Our method treats each parse in a single parser's n-best list as a parse from n separate parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-18", "text": "We then adapt parse combination methods by Henderson and Brill (1999) , Sagae and Lavie (2006) , and Fossum and Knight (2009) to fuse the constituents from the n parses into a single tree." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-19", "text": "We empirically show that six n-best parsers benefit from parse fusion across six domains, obtaining stateof-the-art results." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-20", "text": "These improvements are complementary to other techniques such as reranking and self-training." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-21", "text": "Our best system obtains an F 1 of 92.6% on WSJ section 23, a score previously obtained only by combining the outputs from multiple parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-22", "text": "A reference implementation is available as part of BLLIP Parser at http: //github.com/BLLIP/bllip-parser/ 2 Fusion Henderson and Brill (1999) propose a method to combine trees from m parsers in three steps: populate a chart with constituents along with the number of times they appear in the trees; remove any constituent with count less than m/2 from the chart; and finally create a final tree with all the remaining constituents." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-23", "text": "Intuitively their method constructs a tree with constituents from the majority of the trees, which boosts precision significantly." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-24", "text": "Henderson and Brill (1999) show that this process is guaranteed to produce a valid tree." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-25", "text": "Sagae and Lavie (2006) generalize this work by reparsing the chart populated with constituents whose counts are above a certain threshold." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-26", "text": "By adjusting the threshold on development data, their generalized method balances precision and recall." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-27", "text": "Fossum and Knight (2009) further extend this line of work by using n-best lists from multiple parsers and combining productions in addition to constituents." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-28", "text": "Their model assigns sums of joint probabilities of constituents and parsers to constituents." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-29", "text": "Surprisingly, exploiting n-best trees does not lead to large improvement over combining 1-best trees in their experiments." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-30", "text": "Our extension takes the n-best trees from a parser as if they are 1-best parses from n parsers, then follows Sagae and Lavie (2006) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-31", "text": "Parses are weighted by the estimated probabilities from the parser." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-32", "text": "Given n trees and their weights, the model computes a constituent's weight by summing weights of all trees containing that constituent." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-33", "text": "Concretely, the weight of a constituent spanning from ith word to jth word with label is" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-34", "text": "where W (k) is the weight of kth tree and C k (i \u2192 j) is one if a constituent with label spanning from i to j is in kth tree, zero otherwise." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-35", "text": "After populating the chart with constituents and their weights, it throws out constituents with weights below a set threshold t. Using the threshold t = 0.5 emulates the method of Henderson and Brill (1999) in that it constructs the tree with the constituents in the majority of the trees." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-36", "text": "The CYK parsing algorithm is applied to the chart to produce the final tree." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-37", "text": "Note that populating the chart is linear in the number of words and the chart contains substantially fewer constituents than charts in well-known parsers, making this a fast procedure." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-38", "text": "----------------------------------" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-39", "text": "**SCORE DISTRIBUTION OVER TREES**" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-40", "text": "We assume that n-best parsers provide trees along with some kind of scores (often probabilities or log probabilities)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-41", "text": "Given these scores, a natural way to obtain weights is to normalize the probabilities." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-42", "text": "However, parsers do not always provide accurate estimates of parse quality." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-43", "text": "We may obtain better performance from parse fusion by altering this distribution and passing scores through a nonlinear function, f (\u00b7)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-44", "text": "The kth parse is weighted:" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-45", "text": "where SCORE(i) is the score of ith tree." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-46", "text": "1 We explore the family of functions f (x) = x \u03b2 which can smooth or sharpen the score distributions." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-47", "text": "This includes a tunable parameter, \u03b2 \u2208 R + 0 :" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-48", "text": "Employing \u03b2 < 1 flattens the score distribution over n-best trees and helps over-confident parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-49", "text": "On the other hand, having \u03b2 > 1 skews the distribution toward parses with higher scores and helps under-confident parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-50", "text": "Note that setting \u03b2 = 0 weights all parses equally and results in majority voting at the constituent level." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-51", "text": "We leave developing other nonlinear functions for fusion as future work." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-52", "text": "----------------------------------" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-53", "text": "**EXPERIMENTS**" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-54", "text": "Corpora: Parse fusion is evaluated on British National Corpus (BNC), Brown, GENIA, Question Bank (QB), Switchboard (SB) and Wall Street Journal (WSJ) (Foster and van Genabith, 2008; Francis and Ku\u010dera, 1989; Kim et al., 2003; Judge et al., 2006; Godfrey et al., 1992; Marcus et al., 1993) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-55", "text": "WSJ is used to evaluate indomain parsing, the remaining five are used for out-of-domain." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-56", "text": "For divisions, we use tune and test splits from Bacchiani et al. (2006) Supervised parsers are trained on the WSJ training set (sections 2-21) and use section 22 or 24 for development." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-57", "text": "Self-trained BLLIP is selftrained using two million sentences from Gigaword and Stanford RNN uses word embeddings trained from larger corpora." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-58", "text": "Parameter tuning: There are three parameters for our fusion process: the size of the n-best list (2 < n \u2264 50), the smoothing exponent from Section 2.1 (\u03b2 \u2208 [0.5, 1.5] with 0.1 increments), and the minimum threshold for constituents (t \u2208 [0.2, 0.7] with 0.01 increments)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-59", "text": "We use grid search to tune these parameters for two separate scenarios." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-60", "text": "When parsing WSJ (in-domain), we tune parameters on WSJ section 24." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-61", "text": "For the remaining corpora (out-ofdomain), we use the tuning section from Brown." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-62", "text": "Each parser is tuned separately, resulting in 12 different tuning scenarios." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-63", "text": "In practice, though, in-domain and out-of-domain tuning regimes tend to pick similar settings within a parser." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-64", "text": "Across parsers, settings are also fairly similar (n is usually 30 or 40, t is usually between 0.45 and 0.5)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-65", "text": "While the smoothing exponent varies from 0.5 to 1.3, setting \u03b2 = 1 does not significantly hurt accuracy for most parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-66", "text": "To study the effects of these parameters, Figure 1 shows three slices of the tuning surface for BLLIP parser on WSJ section 24 around the optimal settings (n = 30, \u03b2 = 1.1, t = 0.47)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-67", "text": "In each graph, one of the parameters is varied while the other is held constant." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-68", "text": "Increasing n-best size improves accuracy until about n = 30 where there seems to be sufficient diversity." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-69", "text": "For BLLIP, the Table 2 : F 1 of a baseline parser, fusion, and baselines on development sections of corpora (WSJ section 24 and Brown tune)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-70", "text": "smoothing exponent (\u03b2) is best set around 1.0, with accuracy falling off if the value deviates too much." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-71", "text": "Finally, the threshold parameter is empirically optimized a little below t = 0.5 (the value suggested by Henderson and Brill (1999) )." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-72", "text": "Since score values are normalized, this means that constituents need roughly half the \"score mass\" in order to be included in the chart." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-73", "text": "Varying the threshold changes the precision/recall balance since a high threshold adds only the most confident constituents to the chart (Sagae and Lavie, 2006) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-74", "text": "Baselines: Table 2 gives the accuracy of fusion and baselines for BLLIP on the development corpora." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-75", "text": "Majority voting sets n = 50, \u03b2 = 0, t = 0.5 giving all parses equal weight and results in constituent-level majority voting." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-76", "text": "We explore a rank-based weighting which ignores parse probabilities and weight parses only using the rank: W rank (k) = 1/(2 k )." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-77", "text": "These show that accurate parse-level scores are critical for good performance." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-78", "text": "Final evaluation: Table 3 gives our final results for all parsers across all domains." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-79", "text": "Results in blue are significant at p < 0.01 using a randomized permutation test." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-80", "text": "Fusion generally improves F 1 for in-domain and out-of-domain parsing by a significant margin." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-81", "text": "For the self-trained BLLIP parser, in-domain F 1 increases by 0.4% and out-of-domain F 1 increases by 0.4% on average." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-82", "text": "Berkeley parser obtains the smallest gains from fusion since Berkeley's n-best lists are ordered by factors other than probabilities." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-83", "text": "As a result, the probabilities from Berkeley can mislead the fusion process." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-84", "text": "We also compare against model combination using our reimplementation of Sagae and Lavie (2006) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-85", "text": "For these results, all six parsers were given equal weight." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-86", "text": "The threshold was set to 0.42 to optimize model combination F 1 on development data (similar to Setting 2 for constituency parsing in Sagae and Lavie (2006) )." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-87", "text": "Model combination Table 3 : Evaluation of the constituency fusion method on six parsers across six domains." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-88", "text": "x/y indicates the F 1 from the baseline parser (x) and the baseline parser with fusion (y) respectively." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-89", "text": "Blue indicates a statistically significant difference between fusion and its baseline parser (p < 0.01)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-90", "text": "performs better than fusion on BNC and GENIA, but surprisingly fusion outperforms model combination on three of the six domains (not usually not by a significant margin)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-91", "text": "With further tuning (e.g., specific weights for each constituent-parser pair), the benefits from model combination should increase." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-92", "text": "Multilingual evaluation: We evaluate fusion with the Berkeley parser on Arabic (Maamouri et al., 2004; Green and Manning, 2010) , French (Abeill\u00e9 et al., 2003) , and German (Brants et al., 2002) from the SPMRL 2014 shared task (Seddah et al., 2014) but did not observe any improvement." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-93", "text": "We suspect this has to do with the same ranking issues seen in the Berkeley Parser's English results." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-94", "text": "On the other hand, fusion helps the parser of Narayan and Cohen (2015) on the German NEGRA treebank (Skut et al., 1997) to improve from 80.9% to 82.4%." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-95", "text": "Runtime: As discussed in Section 2, fusion's runtime overhead is minimal." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-96", "text": "Reranking parsers (e.g., BLLIP and Stanford RNN) already need to perform n-best decoding as input for the reranker." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-97", "text": "Using a somewhat optimized implementation fusion in C++, the overhead over BLLIP parser is less than 1%." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-98", "text": "Discussion: Why does fusion help? It is possible that a parser's n-list and its scores act as a weak approximation to the full parse forest." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-99", "text": "As a result, fusion seems to provide part of the benefits seen in forest reranking (Huang, 2008) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-100", "text": "Results from Fossum and Knight (2009) imply that fusion and model combination might not be complementary." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-101", "text": "Both n-best lists and additional parsers provide syntactic diversity." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-102", "text": "While additional parsers provide greater diversity, n-best lists from common parsers are varied enough to provide improvements for parse hybridization." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-103", "text": "We analyzed how often fusion produces completely novel trees." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-104", "text": "For BLLIP on WSJ section 24, this only happens about 11% of the time." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-105", "text": "Fusion picks the 1-best tree 72% of the time." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-106", "text": "This means that for the remaining 17%, fusion picks an existing parse from the rest of the nlist, acting similar to a reranker." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-107", "text": "When fusion creates unique trees, they are significantly better than the original 1-best trees (for the 11% subset of WSJ 24, F 1 scores are 85.5% with fusion and 84.1% without, p < 0.003)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-108", "text": "This contrasts with McClosky et al. (2012) where novel predic-tions from model combination (stacking) were worse than baseline performance." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-109", "text": "The difference is that novel predictions with fusion better incorporate model confidence whereas when stacking, a novel prediction is less trusted than those produced by one or both of the base parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-110", "text": "Preliminary extensions: Here, we summarize two extensions to fusion which have yet to show benefits." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-111", "text": "The first extension explores applying fusion to dependency parsing." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-112", "text": "We explored two ways to apply fusion when starting from constituency parses: (1) fuse constituents and then convert them to dependencies and (2) convert to dependencies then fuse the dependencies as in Sagae and Lavie (2006) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-113", "text": "Approach (1) does not provide any benefit (LAS drops between 0.5% and 2.4%)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-114", "text": "This may result from fusion's artifacts including unusual unary chains or nodes with a large number of children -it is possible that adjusting unary handling and the precision/recall tradeoff may reduce these issues." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-115", "text": "Approach (2) provided only modest benefits compared to those from constituency parsing fusion." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-116", "text": "The largest LAS increase for (2) is 0.6% for the Stanford Parser, though for Berkeley and Self-trained BLLIP, dependency fusion results in small losses (-0.1% LAS)." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-117", "text": "Two possible reasons are that the dependency baseline is higher than its constituency counterpart and some dependency graphs from the n-best list are duplicates which lowers diversity and may need special handling, but this remains an open question." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-118", "text": "While fusion helps on top of a self-trained parser, we also explored whether a fused parser can self-train (McClosky et al., 2006) ." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-119", "text": "To test this, we (1) parsed two million sentences with BLLIP (trained on WSJ), (2) fused those parses, (3) added the fused parses to the gold training set, and (4) retrained the parser on the expanded training." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-120", "text": "The resulting model did not perform better than a selftrained parsing model that didn't use fusion." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-121", "text": "----------------------------------" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-122", "text": "**CONCLUSIONS**" }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-123", "text": "We presented a simple extension to parse hybridization which adapts model combination techniques to operate over a single parser's n-best list instead of across multiple parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-124", "text": "By weighting each parse by its probability from the n-best parser, we are able to better capture the confidence at the constituent level." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-125", "text": "Our best configuration obtains state-of-the-art accuracy on WSJ with an F 1 of 92.6%." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-126", "text": "This is similar to the accuracy obtained from actual model combination techniques but at a fraction of the computational cost." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-127", "text": "Additionally, improvements are not limited to a single parser or domain." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-128", "text": "Fusion improves parser accuracy for six n-best parsers both in-domain and out-of-domain." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-129", "text": "Future work includes applying fusion to n-best dependency parsers and additional (parser, language) pairs." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-130", "text": "We also intend to explore how to better apply fusion to converted dependencies from constituency parsers." }, { "sent_id": "c022b7cf4568e26c7408a835eaafb7-C001-131", "text": "Lastly, it would be interesting to adapt fusion to other structured prediction tasks where n-best lists are available." } ], "y": { "@EXT@": { "gold_contexts": [ [ "c022b7cf4568e26c7408a835eaafb7-C001-5" ], [ "c022b7cf4568e26c7408a835eaafb7-C001-18" ] ], "cite_sentences": [ "c022b7cf4568e26c7408a835eaafb7-C001-5", "c022b7cf4568e26c7408a835eaafb7-C001-18" ] }, "@BACK@": { "gold_contexts": [ [ "c022b7cf4568e26c7408a835eaafb7-C001-11" ], [ "c022b7cf4568e26c7408a835eaafb7-C001-73" ] ], "cite_sentences": [ "c022b7cf4568e26c7408a835eaafb7-C001-11", "c022b7cf4568e26c7408a835eaafb7-C001-73" ] }, "@USE@": { "gold_contexts": [ [ "c022b7cf4568e26c7408a835eaafb7-C001-30" ], [ "c022b7cf4568e26c7408a835eaafb7-C001-84" ], [ "c022b7cf4568e26c7408a835eaafb7-C001-112" ] ], "cite_sentences": [ "c022b7cf4568e26c7408a835eaafb7-C001-30", "c022b7cf4568e26c7408a835eaafb7-C001-84", "c022b7cf4568e26c7408a835eaafb7-C001-112" ] }, "@SIM@": { "gold_contexts": [ [ "c022b7cf4568e26c7408a835eaafb7-C001-86" ], [ "c022b7cf4568e26c7408a835eaafb7-C001-112" ] ], "cite_sentences": [ "c022b7cf4568e26c7408a835eaafb7-C001-86", "c022b7cf4568e26c7408a835eaafb7-C001-112" ] } } }, "ABC_0143619c1c54129702aafb585463d2_23": { "x": [ { "sent_id": "0143619c1c54129702aafb585463d2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-2", "text": "Citation texts are sometimes not very informative or in some cases inaccurate by themselves; they need the appropriate context from the referenced paper to re ect its exact contributions." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-3", "text": "To address this problem, we propose an unsupervised model that uses distributed representation of words as well as domain knowledge to extract the appropriate context from the reference paper." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-4", "text": "Evaluation results show the e ectiveness of our model by signi cantly outperforming the state-of-the-art." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-5", "text": "We furthermore demonstrate how an e ective contextualization method results in improving citation-based summarization of the scienti c articles." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-6", "text": "----------------------------------" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-8", "text": "In scienti c literature, related work is often referenced along with a short textual description regarding that work which we call citation text." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-9", "text": "Citation texts usually highlight certain contributions of the referenced paper and a set of citation texts to a reference paper can provide useful information about that paper." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-10", "text": "Therefore, citation texts have been previously used to enhance many downstream tasks in IR/NLP such as search and summarization (e.g. [2, 15, 16] )." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-11", "text": "While useful, citation texts might lack the appropriate context from the reference article [4, 5, 18] ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-12", "text": "For example, details of the methods, assumptions or conditions for the obtained results are often not mentioned." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-13", "text": "Furthermore, in many cases the citing author might misunderstand or misquote the referenced paper and ascribe contributions to it that are not intended in that form." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-14", "text": "Hence, sometimes the citation text is not su ciently informative or in other cases, even inaccurate [17] ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-15", "text": "This problem is more serious in life sciences where accurate dissemination of knowledge has direct impact on human lives." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-16", "text": "We present an approach for addressing such concerns by adding the appropriate context from the reference article to the citation texts." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-17", "text": "Enriching the citation texts with relevant context from the reference paper helps the reader to better understand the context for the ideas, methods or ndings stated in the citation text." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-18", "text": "Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-19", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-20", "text": "Abstracting with credit is permitted." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-21", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-22", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-23", "text": "SIGIR '17, August 07-11, 2017, Shinjuku, Tokyo, Japan \u00a9 2017 ACM." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-24", "text": "A challenge in citation contextualization is the discourse and terminology variations between the citing and the referenced authors." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-25", "text": "Hence, traditional IR models that rely on term matching for nding the relevant information are ine ective." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-26", "text": "We propose to address this challenge by a model that utilizes word embeddings and domain speci c knowledge." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-27", "text": "Speci cally, our approach is a retrieval model for nding the appropriate context of citations, aimed at capturing terminology variations and paraphrasing between the citation text and its relevant reference context." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-28", "text": "We perform two sets of experiments to evaluate the performance of our system." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-29", "text": "First, we evaluate the relevance of extracted contexts intrinsically." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-30", "text": "Then we evaluate the e ect of citation contextualization on the application of scienti c summarization." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-31", "text": "Experimental results on TAC 2014 benchmark show that our approach signi cantly outperforms several strong baselines in extracting the relevant contexts." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-32", "text": "We furthermore, demonstrate that our contextualization models can enhance summarizing scienti c articles." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-33", "text": "----------------------------------" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-34", "text": "**CONTEXTUALIZING CITATIONS**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-35", "text": "Given a citation text, our goal is to extract the most relevant context to it in the reference article." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-36", "text": "These contexts are essentially certain textual spans within the reference article." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-37", "text": "Throughout, colloquially, we refer to the citation text as query and reference spans in the reference article as documents." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-38", "text": "Our approach extends Language Models for IR (LM) by incorporating word embeddings and domain ontology to address shortcomings of LM for this research purpose." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-39", "text": "The goal in LM is to rank a document d according to the conditional probability p(d |q) \u221d p(q|d ) = q i \u2208q p(q i |d ) where q i shows the tokens in the query q. Estimating p(q i |d ) is often achieved by maximum likelihood estimate from term frequencies with some sort of smoothing." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-40", "text": "Using Dirichlet smoothing [21] , we have:" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-41", "text": "where f (q i , d ) shows the frequency of term q i in document d, C is the entire collection, V is the vocabulary and \u00b5 the Dirichlet smoothing parameter." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-42", "text": "In the citation contextualization problem, (i) the target reference sentences are short documents and (ii) there exist terminology variations between the citing author and the referenced author." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-43", "text": "Hence, the citation terms usually do not appear in the documents and relying only on the frequencies of citation terms in the documents (f (q i , d )) for estimating p(q i |d ) yields an almost uniform smoothed distribution that is unable to decisively distinguish between the documents." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-44", "text": "Embeddings." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-45", "text": "Distributed representation (embedding) of a word w in a eld F is a mapping w \u2192 F n where words semantically similar to w will be ideally located close to it." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-46", "text": "Given a query q, we rank the documents d according to the following scoring function which leverages this property:" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-47", "text": "where f sem is a function that measures semantic relatedness of the query term q i to the document d and is de ned as:" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-48", "text": "is the relatedness between the the query term and document term which is calculated by applying a similarity function to the distributed representations of q i and d j ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-49", "text": "We use a transformation (\u03d5) of dot products between the unit vectors e (q i ) and e (d j ) corresponding to the embeddings of the terms q i and d j for the similarity function:" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-50", "text": "We rst explain the role of \u03c4 and then the reason for considering the function \u03d5 instead of raw dot product." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-51", "text": "\u03c4 is a parameter that controls the noise introduced by less similar words." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-52", "text": "Many unrelated word vectors have non-zero similarity scores and adding them up introduces noise to the model and reduces the performance." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-53", "text": "\u03c4 's function is to set the similarity between unrelated words to zero instead of a positive number." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-54", "text": "To identify an appropriate value for \u03c4 , we select a random set of words from the embedding model and calculate the average and standard deviation of pointwise absolute value of similarities between terms from these two samples." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-55", "text": "We then select the threshold \u03c4 to be two standard deviations larger than the average to only consider very high similarity values (this choice was empirically justi ed)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-56", "text": "Examining term similarity values between words shows that there are many terms with high similarities associated with each term and these values are not highly discriminative." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-57", "text": "We apply a transfer function \u03d5 to the dot product e (q i ).e (d j ) to dampen the e ect of less similar words." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-58", "text": "In other words, we only want highly related words to have high similarity values and similarity should quickly drop as we move to less related words." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-59", "text": "We use the logit function for \u03d5 to achieve this dampening e ect: Figure 1 shows this e ect." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-60", "text": "The purple line is the normalized dot product of a sample word with the most similar words in the model." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-61", "text": "As illustrated, the similarity score di erences among top words is not very discriminative." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-62", "text": "However, applying the logit function (green line) causes the less similar words to have lower similarity values to the target word." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-63", "text": "Domain knowledge." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-64", "text": "Successful word embedding methods have previously shown to be e ective in capturing syntactic and semantic relatedness between terms." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-65", "text": "These co-occurrence based models are data driven." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-66", "text": "On the other hand, domain ontologies and lexicons that are built by experts include some information that might not be captured by embedding methods [8] ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-67", "text": "Therefore, using domain knowledge can further help the embedding based retrieval model; we incorporate it in our model in the following ways: 1) Retro tting: Faruqui et al. [6] proposed a model that uses the constraints on WordNet lexicon to modify the word vectors and pull synonymous words closer to each other." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-68", "text": "To inject the domain knowledge in the embeddings, we apply this model on two domain speci c ontologies, namely, M and Protein Ontologies (PO) 1 ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-69", "text": "We chose these two biomedical domain ontologies because they are in the same domain as the articles in the TAC dataset." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-70", "text": "M is a broad ontology that consists of biomedical terms and PO is a more focused ontology related to biology of proteins and genes." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-71", "text": "2) Interpolating in the LM: We also directly incorporate the domain knowledge in the retrieval model; we modify the LM into the following interpolated LM with parameter \u03bb:" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-72", "text": "where p 1 is estimated using Eq. 2 and p 2 is similar to p 1 except that we replace f sem with the function f ont which considers domain ontology in calculating similarities:" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-73", "text": "where" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-74", "text": "----------------------------------" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-75", "text": "**EXPERIMENTS**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-76", "text": "Data." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-77", "text": "We use the TAC 2014 Biomedical Summarization benchmark 3 ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-78", "text": "This dataset contains 220 scienti c biomedical journal articles and 313 total citation texts where the relevant contexts for each citation text are annotated by 4 experts." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-79", "text": "Baselines." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-80", "text": "To our knowledge, the only published results on TAC 2014 is [4] , where the authors utilized query reformulation (QR) based on UMLS ontology." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-81", "text": "In addition to [4] , we also implement several other strong baselines to better evaluate the e ectiveness of our model: 1) BM25; 2) VSM: Vector Space Model that was used in [4] ; 3) DESM: Dual Embedding Space Model which is a recent embedding based retrieval model [12] ; and 4) LMD-LDA: Language modeling with LDA smoothing which is a recent extension of the LMD to also account for the latent topics [10] ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-82", "text": "All the baseline parameters are tuned for the best performance, and the same preprocessing is applied to all the baselines and our methods." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-83", "text": "Our methods." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-84", "text": "We rst report results based on training the embeddings on Wikipedia (WE Wiki )." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-85", "text": "Since TAC dataset is in biomedical domain, many of the biomedical terms might be either outof-vocabulary or not captured in the correct context using general embeddings, therefore we also train biomedical embeddings (WE Bio ) 4 ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-86", "text": "In addition, we report results for biomedical embeddings with retro tting (WE Bio +rtrft), as well as interpolating domain knowledge (WE Bio +dmn)" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-87", "text": "----------------------------------" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-88", "text": "**INTRINSIC EVALUATION**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-89", "text": "First, we analyze the e ectiveness of our proposed approach for contextualization intrinsically." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-90", "text": "That is, we evaluate the quality of the 1 https://www.nlm.nih.gov/mesh/; http://pir.georgetown.edu/pro/ 2 The values of the parameters \u03b3 and \u03bb were selected empirically by grid search 3 http://www.nist.gov/tac/2014/BiomedSumm/ 4 We train biomedical embeddings on TREC Genomics 2004 and 2006 collections (both Wikipedia and Genomics embeddings were trained using gensim implementation of Word2Vec, negative sampling, window size of 5 and 300 dimensions." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-91", "text": "extracted citation contexts using our contextualization methods in terms of how accurate they are with respect to human annotations." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-92", "text": "Evaluation." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-93", "text": "We consider the following evaluation metrics for assessing the quality of the retrieved contexts for each citation from multiple aspects: (i) Character o set overlaps of the retrieved contexts with human annotations in terms of precision (c-P), recall (c-R) and F-score (c-F)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-94", "text": "These are the recommended metrics for the task per TAC 5 . (ii) nDCG: we treat any partial overlaps with the gold standard as a correct context and then calculate the nDCG scores." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-95", "text": "(iii) R -N scores: To also consider the content similarity of the retrieved contexts with the gold standard, we calculate the R scores between them." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-96", "text": "(iv) Character precision at K (c-P@K): Since we are usually interested in the top retrieved spans, we consider character o set precision only for the top K spans and we denote it with \"c-P@K\"." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-97", "text": "Results." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-98", "text": "The results of intrinsic evaluation of contextualization are presented in Table 1 ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-99", "text": "Our models (last 4 rows of table 1) achieve signi cant improvements over the baselines consistently across most of the metrics." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-100", "text": "This shows the e ectiveness of our models viewed from di erent aspects in comparison with the baselines." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-101", "text": "The best baseline performance is the query reformulation (QR) method by [4] which improves over other baselines." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-102", "text": "We observe that using general domain embeddings does not provide much advantage in comparison with the best baseline (compare WE wiki and QR in the Table) ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-103", "text": "However, using the domain speci c embeddings (WE Bio ) results in 10% c-F improvement over the best baseline." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-104", "text": "This is expected since word relations in the biomedical context are better captured with biomedical embeddings." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-105", "text": "In Table 2 an illustrative word \"expression\" gives better intuition why is that the case." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-106", "text": "As shown, using general embeddings (left column in the table), the most similar words to \"expression\" are those related to 5 https://tac.nist.gov/2014/BiomedSumm/guidelines.html the general meaning of it." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-107", "text": "However, many of these related words are not relevant in the biomedical context." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-108", "text": "In the biomedical context, \"expression\" refers to \"the appearance in a phenotype attributed to a particular gene\"." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-109", "text": "As shown on the right column, the domain speci c embeddings (Bio) trained on genomics data are able to capture this meaning." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-110", "text": "This further con rms the inferior performance of the out-of-domain word embeddings in capturing correct word-level semantics [13] ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-111", "text": "Last two rows in Table 1 show incorporating the domain knowledge in the model which results in signi cant improvement over the best baseline in terms of most metrics (e.g. 14% and 16% c-F improvements)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-112", "text": "This shows that domain ontologies provide additional information that the domain trained embeddings may not contain." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-113", "text": "While both our methods of incorporating domain ontologies prove to be e ective, interpolating domain knowledge directly (WE Bio +dmn) has the edge over retro tting (WE Bio +rtrft)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-114", "text": "This is likely due to the direct e ect of ontology on the interpolated language model, whereas in retro tting, the ontology rst a ects the embeddings and then the context extraction model." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-115", "text": "To analyze the performance of our system more closely, we took the context identi ed by 1 annotator as the candidate and the other 3 as gold standard and evaluated the precision to obtain an estimate of human performance on each citation." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-116", "text": "We then divided the citations based on human performance to 4 groups by quartiles." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-117", "text": "Table 3 shows our system's performance on each of these groups." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-118", "text": "We observe that, when human precision is higher (upper quartiles in the table), our system also performs better and with more con dence (lower std)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-119", "text": "Therefore, the system errors correlate well with human disagreement on the correct context for the citations." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-120", "text": "Averaged over the 4 annotators for each citation, the mean precision was 56.7% (note that this translates to our c-P@1 metric)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-121", "text": "In Table 1 , we observe that our best method (c-P@1 of 56.1%) is comparable with average human precision score (c-P@1 of 56.7%) which further demonstrates the e ectiveness of our model." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-122", "text": "----------------------------------" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-123", "text": "**EXTERNAL EVALUATION**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-124", "text": "Citation-based summarization can e ectively capture various contributions and aspects of the paper by utilizing citation texts [15] ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-125", "text": "However; as argued in section 1, citation texts do not always accurately re ect the original paper." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-126", "text": "We show how adding context from the original paper can address this concern, while keeping the bene ts of citation-based summarization." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-127", "text": "Speci cally, we compare how using no contextualization, versus various proposed contextualization approaches a ect the quality of summarization." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-128", "text": "We apply the following well-known summarization algorithms on the set of citation texts, and the retrieved citation-contexts: LexRank, LSAbased, SumBasic, and KL-Divergence (For space constraints, we will not explain these approaches here; refer to [14] for details)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-129", "text": "We then compare the e ect of our proposed contextualization methods using the standard R -N summarization evaluation metrics." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-130", "text": "summarization approach solely on the citations without any contextualization." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-131", "text": "The next 5 rows show the baselines and last 4 rows are our proposed contextualization methods." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-132", "text": "As shown, e ective contextualization positively impacts the generated summaries." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-133", "text": "For example, our best method is \"WE Bio + dmn\" which signi cantly improves the quality of generated summaries in terms of R over the ones without any context." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-134", "text": "We observe that two low-performing baseline methods for contextualization according to Table 1 (\"VSM\" and \"BM25\") also do not result in any improvements for summarization." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-135", "text": "Therefore, the intrinsic quality of citation contextualization has direct impact on the quality of generated summaries." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-136", "text": "These results further demonstrate that e ective contextualization is helpful for scienti c citation-based summarization." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-137", "text": "----------------------------------" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-138", "text": "**RELATED WORK**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-139", "text": "Related work has mostly focused on extracting the citation text in the citing article (e.g. [1] )." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-140", "text": "In this work, given the citation texts, we focus on extracting its relevant context from the reference paper." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-141", "text": "Related work have also shown that citation texts can be used in di erent applications such as summarization [2, 3, 9, 11, 15, 20] ." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-142", "text": "Our proposed model utilizes word embeddings and the domain knowledge." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-143", "text": "Embeddings have been recently used in general information retrieval models." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-144", "text": "Vuli\u0107 and Moens [19] proposed an architecture for learning word embeddings in multilingual settings and used them in document and query representation." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-145", "text": "Mitra et al. [12] proposed dual embedded space model that predicts document aboutness by comparing the centroid of word vectors to query terms." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-146", "text": "Ganguly et al. [7] used embeddings to transform term weights in a translation model for retrieval." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-147", "text": "Their model uses embeddings to expand documents and use co-occurrences for estimation." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-148", "text": "Unlike these works, we directly use embeddings in estimating the likelihood of query given documents; we furthermore incorporate ways to utilize domain speci c knowledge in our model." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-149", "text": "The most relevant prior work to ours is [4] where the authors approached the problem using a vector space model similarity ranking and query reformulations." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-150", "text": "----------------------------------" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-151", "text": "**CONCLUSIONS**" }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-152", "text": "Citation texts are textual spans in a citing article that explain certain contributions of a reference paper." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-153", "text": "We presented an e ective model for contextualizing citation texts (associating them with the appropriate context from the reference paper)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-154", "text": "We obtained statistically signi cant improvements in multiple evaluation metrics over several strong baseline, and we matched the human annotators precision." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-155", "text": "We showed that incorporating embeddings and domain knowledge in the language modeling based retrieval is e ective for situations where there are high terminology variations between the source and the target (such as citations and their reference context)." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-156", "text": "Citation contextualization not only can help the readers to better understand the citation texts but also as we demonstrated, they can improve other downstream applications such as scienti c document summarization." }, { "sent_id": "0143619c1c54129702aafb585463d2-C001-157", "text": "Overall, our results show that citation contextualization enables us to take advantage of the bene ts of citation texts, while ensuring accurate dissemination of the claims, ideas and ndings of the original referenced paper." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0143619c1c54129702aafb585463d2-C001-11" ], [ "0143619c1c54129702aafb585463d2-C001-80" ], [ "0143619c1c54129702aafb585463d2-C001-101" ] ], "cite_sentences": [ "0143619c1c54129702aafb585463d2-C001-11", "0143619c1c54129702aafb585463d2-C001-80", "0143619c1c54129702aafb585463d2-C001-101" ] }, "@DIF@": { "gold_contexts": [ [ "0143619c1c54129702aafb585463d2-C001-81" ] ], "cite_sentences": [ "0143619c1c54129702aafb585463d2-C001-81" ] }, "@EXT@": { "gold_contexts": [ [ "0143619c1c54129702aafb585463d2-C001-81" ] ], "cite_sentences": [ "0143619c1c54129702aafb585463d2-C001-81" ] }, "@SIM@": { "gold_contexts": [ [ "0143619c1c54129702aafb585463d2-C001-149" ] ], "cite_sentences": [ "0143619c1c54129702aafb585463d2-C001-149" ] } } }, "ABC_b1c06a67b03d81b249b320413a6e7e_23": { "x": [ { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-2", "text": "To answer the question \"What are the duties of a medical doctor?\", one would require knowledge about verb-based relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-3", "text": "A lot of effort has been invested in developing relation learners, however to our knowledge there is no repository (or system) which can return all verb relations for a given term." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-4", "text": "This paper describes an automated procedure which can learn and produce such information with minimal effort." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-5", "text": "To evaluate the performance of our verb harvesting procedure, we have conducted two types of evaluations: (1) in the human based evaluation we found that the accuracy of the described algorithm is .95 at rank 100; (2) in the comparative study with existing relation learner and knowledge bases we found that our approach yields 12 times more verb relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-6", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-8", "text": "To be able to answer the questions \"What causes ebola?\", \"What are the duties of a medical doctor?\", \"What are the differences between a terrorist and a victim?\", \"Which are the animals that have wings but cannot fly?\" one requires knowledge about verb-based relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-9", "text": "Over the years, researchers have developed various relation learning algorithms." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-10", "text": "Some (Ravichandran and Hovy, 2002; Bunescu and Mooney, 2007) targeted specific relations like BornInYear, CorporationAcquired, others (Wu and Weld, 2010; Fader et al., 2011 ) extracted any phrase denoting a relation in an English sentence." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-11", "text": "(Banko, 2009) used labeled data to learn relations, (Suchanek et al., 2007) used information encoded in the structured Wikipedia documents, (Riloff and Jones, 1999) bootstrapped patterns." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-12", "text": "As a result various knowledge bases have been produced like TopicSignatures (Agirre and Lacalle, 2004) , ConceptNet (Liu and Singh, 2004) , Yago (Suchanek et al., 2007) , NELL (Carlson et al., 2009) and ReVerb (Fader et al., 2011) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-13", "text": "Despite the many efforts to date, yet there is no universal repository (or even a system), which for a given term it can immediately return all verb relations related to the term." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-14", "text": "However, one would still like to dispose of an automated procedure, which on the fly can accurately and quickly produce such information for any term." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-15", "text": "If available, such resource can aid different natural language processing tasks such as preposition sense disambiguation (Litkowski and Hargraves, 2007) , selectional preferences (Resnik, 1996; Ritter et al., 2010) , question answering (Ferrucci et al., 2010) and textual entailment (Szpektor et al., 2004) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-16", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-17", "text": "**THE QUESTION WE ADDRESS IN THIS PAPER IS: IS IT POSSIBLE TO CREATE A PROCEDURE WHICH WILL GO BEYOND EXISTING TECHNIQUES AND LEARN IN A SEMI-SUPERVISED MANNER FOR A GIVEN TERM ALL VERB RELATIONS ASSOCIATED WITH IT?**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-18", "text": "The main contributions of the paper are:" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-19", "text": "\u2022 We develop an automatic procedure, which on the fly can learn a diverse set of verb and verb-preposition relations for a given term." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-20", "text": "\u2022 We establish the effectiveness of our approach through human-based evaluation." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-21", "text": "\u2022 We conduct a comparative study with the verb-based relation extraction system ReVerb (Fader et al., 2011) and show that our approach accurately extracts more verb-based relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-22", "text": "\u2022 We also compare the verb relations produced by our system with those available in existing knowledge bases, and observe that despite their completeness these repositories lack many verb-based relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-23", "text": "The rest of the paper is organized as follows." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-24", "text": "Next, we present related work." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-25", "text": "Section 3 outlines the verb-based relation learner." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-26", "text": "Section 4 describes the data collection process." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-27", "text": "Section 5 reports on the experimental results." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-28", "text": "Finally, we conclude in Section 6." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-29", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-31", "text": "Lots of attention has been payed on learning is-a and part-of relations (Hearst, 1992; Girju et al., 2003; Pasca, 2004; Etzioni et al., 2005; Kozareva et al., 2008; Pantel and Pennacchiotti, 2006; Carlson et al., 2009; Talukdar et al., 2008) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-32", "text": "Others (Ravichandran and Hovy, 2002; Bunescu and Mooney, 2007) have focused on learning specific relations like BornInYear, EmployedBy and CorporationAcquired." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-33", "text": "However to build a system that can learn a richer set of relations is not trivial, because often labeled training data is required (Kim and Moldovan, 1993; Soderland et al., 1999) and most methods do not scale to corpora where the number of relations is very large or when the relations are not specified in advance (Fader et al., 2011) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-34", "text": "However, recently developed OpenIE systems like TextRunner (Banko et al., 2007; Banko, 2009) and ReVerb (Fader et al., 2011) surmount the necessity of labeled data by extracting arbitrary phrases denoting relations in English sentences." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-35", "text": "(Banko et al., 2007; Banko, 2009 ) define relation to be any verb-prep, adj-noun construction." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-36", "text": "While such systems are great at learning general relations, they are not guided but simply gather in an undifferentiated way whatever happens to be contained in their input." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-37", "text": "In order to be able to extract all verb relations associated with a given term, such systems need to part-of-speech tag and parse a large document collection, then they have to extract all verb constructions and all arguments matching specific sets of patterns which were written by humans (or experts)." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-38", "text": "Finally, they must filter out the information and retrieve only those verb relations that are associated with the specific term." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-39", "text": "Once compiled the repository is straightforward to query and use, however if a term is not present in the compiled repository, repeating the whole process on a new document collection becomes time consuming and unpractical." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-40", "text": "The main objective and contribution of our research is the development of a dynamic and flexible knowledge harvesting procedure, which for any given term can learn on the fly verb based relations associated with the term in a very fast and accurate manner." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-41", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-42", "text": "**LEARNING VERB-BASED RELATIONS**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-43", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-44", "text": "**PROBLEM FORMULATION**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-45", "text": "We define our task as given a term, a relation expressed by a verb and a set of prepositions: (1) learn in bootstrapping fashion new relations (i.e. verbs) associated with the initial term and filter out erroneous extractions; (2) form triples of the term, the harvested verbs and the initial set of prepositions to learn additional relations (i.e. verb-prepositions) and their argument fillers. shows an example for the input term terrorists, the verb relation bomb and the recursive pattern \"terrorists bomb and *\"." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-46", "text": "The algorithm learns on the * position verbs like kill, murder, threaten, burn, assassinate." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-47", "text": "We denote this phase as verb extraction." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-48", "text": "Then each learned verb is used to form triples of the type term-verb-preposition to learn new verb-preposition relations and their argument fillers." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-49", "text": "For instance, \"terrorists kill with *\" extracts arguments like {bombs, suicide, impunity}. We denote this phase as verb-preposition extraction." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-50", "text": "Finally, the learned relations and arguments are ranked and arranged by their ranking score." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-51", "text": "The output of this harvesting procedure is triples of the kind \"terrorists kill people\", \"terrorists kill on purpose\", \"terrorists bomb buildings\" among others." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-52", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-53", "text": "**ALGORITHM DESCRIPTION**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-54", "text": "Because of their fixed nature, pattern-based methods often fail to extract information from small corpus or single document." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-55", "text": "However, nowadays we dispose of endless amount of data, which is easily accessible and is making it possible for such systems to work successfully by scanning billions of Web pages to extract the necessary information." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-56", "text": "Many of the existing and most accurate is-a relation learners rely on lexico-syntactic patterns (Hearst, 1992; Pasca, 2004; Etzioni et al., 2005) , therefore we decided to use patterns for the verb extraction procedure." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-57", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-58", "text": "**PHASE1: LEARNING VERB**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-59", "text": "Relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-60", "text": "The first phase of the algorithm focuses on verb extraction." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-61", "text": "We use (Kozareva et al., 2008) recursive DAP pattern for is-a relation learning and adapted it to verb extraction as follows: \" and *\", where is any term (noun) given by the user or taken from an existing knowledge base, is a seed relation expressed through a verb and * indicates the position on which new verbs will be extracted." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-62", "text": "The generated patterns are submitted to the search engine as a web query and all retrieved snippets are kept." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-63", "text": "The algorithm extracts on the position of the * all verb constructions and if they were not previously explored by the algorithm, they are placed on the position of DAP and used as seeds in the subsequent verb extraction iteration." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-64", "text": "The harvesting terminates when there are no more verbs to be explored." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-65", "text": "Following (Kozareva et al., 2008) , we filter out erroneous extractions using graph ranking." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-66", "text": "We build a directed graph G = (V, E), where each node v \u2208 V is an extracted verb candidate and (u, v) \u2208 E is an edge between two verb nodes indicating that the verb u lead to the extraction of the verb v. Each node u in the graph is ranked as u = \u2200(u,v)\u2208E (u, v) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-67", "text": "Confidence in u increases when u extracts more verbs." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-68", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-69", "text": "**PHASE2: LEARNING VERB-PREPOSITION**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-70", "text": "Relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-71", "text": "In the second phase, the learned verbs are paired with an initial set of 17 prepositions to learn new relations and argument fillers." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-72", "text": "The prepositions were taken from the SemEval 2007 task on preposition disambiguation (Litkowski and Hargraves, 2007) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-73", "text": "To extract more relations, the algorithm uses the pattern \" *\", where is the initial term for which we want to learn verb-based relations, are the leaned verbs from the previous phase and * is the position of the argument fillers." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-74", "text": "Given the relation kill for the term terrorists, new relations like terrorists kill on, terrorists kill with, terrorists kill for and terrorists kill without are instantiated 1 ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-75", "text": "Similarly to the verb extraction phase, we rank terms by building a bipartite graph G = (V , E ) with two types of nodes." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-76", "text": "One set represents the verbs and verb-prepositions V , and the other set represents the arguments A. An edge e (v, a) \u2208 E between v \u2208 V and a \u2208 A shows that the verb (or verb-prep) v extracted the argument a. Each argument is ranked as a = \u2200(v,a)\u2208E (v, a)." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-77", "text": "Confidence in a increases when a is extracted multiple times by different verbs." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-78", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-79", "text": "**DATA COLLECTION**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-80", "text": "It is impossible to collect and report results for all terms in the world." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-81", "text": "Still to evaluate the effectiveness of our verb-based relation learner, we have randomly selected 36 terms, which capture daily activities like going to a restaurant to unpleasant events like bombing." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-82", "text": "For the purpose of visualization, we have organized the terms into the following groups (topics): Bombing, Diseases, Elections, Restaurants, and Animals. like (were killed, are killed, killed) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-83", "text": "For each domain, we also show the total number of verbs used to initiate the harvesting process and the total number of learned information." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-84", "text": "In total, we have submitted \u223c 101, 559 queries and we have collected 10.3GB snippets, which were cleaned, part-of-speech tagged (Schmid, 1994) and used for the extraction of the verb-based relations and arguments." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-85", "text": "In total for all terms the algorithm extracted 26, 678 candidate relations and 1, 040, 651 candidate arguments of which 26, 086 have rank a>5." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-86", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-87", "text": "**EVALUATION AND RESULTS**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-88", "text": "In this section, we evaluate the results of the verb-based relation learning procedure, which is extremely challenging because there is no universal knowledge repository against which one can compare performance in terms of precision and recall." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-89", "text": "To the extend to which it is possible, we conduct a human-based evaluation and we compare results to knowledge bases that have been extracted in a similar way (i.e., through pattern application over unstructured text)." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-90", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-91", "text": "**HUMAN-BASED EVALUATION**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-92", "text": "Among the most common approaches on evaluating the correctness of the harvested information is by using human annotators (Pantel and Pennacchiotti, 2006; Navigli et al., 2011) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-93", "text": "Conducting such evaluations is very important, because the harvested information is often used by QA, machine reading and IE systems (Ferrucci et al., 2010; Freedman et al., 2011) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-94", "text": "Since the evaluation of all 1, 067, 329 harvested terms is time consuming and costly, we decided to annotate for each term 100 verb relations and argument fillers." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-95", "text": "We conducted two separate annotations for the verbs and arguments, which resulted in 7200 annotations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-96", "text": "We used two annotators who were instructed to mark as incorrect verbs (and argument fillers) that do not correspond to the term." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-97", "text": "For instance, \"drugs affect\" is marked as correct, while \"drugs discuss\" is marked as incorrect." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-98", "text": "We compute Accuracy as the number of Correct terms, divided by the total number of terms used in the annotation." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-99", "text": "Table 2 shows the accuracy of each domain at different ranks." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-100", "text": "The overall performance of our relation learner is .95 at rank 100 for the learned verbs and argument fillers." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-101", "text": "Tables 3 and 4 show examples of the harvested information." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-102", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-103", "text": "**COMPARISON WITH EXISTING KNOWLEDGE BASES**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-104", "text": "In this evaluation, we measure the ability of our system to learn verb-based relations of a term with respect to already existing knowledge bases, which have been created in a similar way." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-105", "text": "However, such comparative evaluations are not always possible to perform, because researchers have not fully explored the same terms and relations we have studied." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-106", "text": "When we compared results against existing knowledge bases, we noticed that Yago (Suchanek et al., 2007) has more detailed information for the arguments of the verb relations rather than the verb relations themselves." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-107", "text": "Repositories like ConceptNet 2 (Liu and Singh, 2004 ) contain 1.6 million assertions, however they only belong to twenty relation types such as is-a, part-of, made-of, effect-of among others." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-108", "text": "The only repository that we found with a diverse set of verb relations is the never-ending language learner NELL 3 (Carlson et al., 2009 )." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-109", "text": "However, there were only 11 verb relations for bomb and 2 verb relations for virus." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-110", "text": "This analysis shows that despite their completeness and richness, existing knowledge repositories can be further enriched with verb-based relations produced by our learning procedure. , shoot, beat, fought, fell, destroyed, fired, attacked, are trained, died, took, said, laughed, kicked, die, were humiliating, cheered, mocked, raised, drummed, captured, looted, ran, arrested, buried, defended" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-111", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-112", "text": "**COMPARISON WITH EXISTING RELATION LEARNER**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-113", "text": "For our comparative study with existing systems, we used ReVerb 4 (Fader et al., 2011) , which similarly to our approach was specifically designed to learn verb-based relations from unstructured texts." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-114", "text": "Currently, ReVerb has extracted relations from ClueWeb09 5 and Wikipedia, which have been freely distributed to the public." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-115", "text": "ReVerb learns relations by taking as input any document and applies POS-tagging, NP-chunking and a set of rules over all sentences in the document to generate triples containing the verbs and the arguments associated with them." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-116", "text": "According to (Fader et al., 2011) ReVerb outperforms TextRunner (Banko et al., 2007) and the open Wikipedia extractor WOE (Wu and Weld, 2010) in terms of the quantity and quality of the learned relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-117", "text": "For comparison, we took five terms from our experiment: ant, bomb, president, terrorists, virus and collected all verbs found by ReVerb in the ClueWeb09 and Wikipedia triples." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-118", "text": "----------------------------------" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-119", "text": "**CONCLUSION**" }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-120", "text": "Our key contribution is the development of a semi-supervised procedure, which starts with a term and a verb to learn from Web documents a large and diverse set of verb relations." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-121", "text": "We have conducted an experimental evaluation with 36 terms and have collected 26, 678 unique candidate verbs and 1, 040, 651 candidate argument fillers." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-122", "text": "We have evaluated the accuracy of our approach using human based evaluation and have compared results against the ReVerb (Fader et al., 2011) system and existing knowledge bases like NELL (Carlson et al., 2009) , Yago (Suchanek et al., 2007) and ConceptNet (Liu and Singh, 2004) ." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-123", "text": "Our study showed that despite their completeness these resources lack verb-based information and there is plenty of room for improvement since they can be further enriched with verbs using our harvesting procedure." }, { "sent_id": "b1c06a67b03d81b249b320413a6e7e-C001-124", "text": "In the future, we would like to test the usefulness of the generated resources in NLP applications." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b1c06a67b03d81b249b320413a6e7e-C001-10" ], [ "b1c06a67b03d81b249b320413a6e7e-C001-12" ], [ "b1c06a67b03d81b249b320413a6e7e-C001-33" ], [ "b1c06a67b03d81b249b320413a6e7e-C001-34" ], [ "b1c06a67b03d81b249b320413a6e7e-C001-116" ] ], "cite_sentences": [ "b1c06a67b03d81b249b320413a6e7e-C001-10", "b1c06a67b03d81b249b320413a6e7e-C001-12", "b1c06a67b03d81b249b320413a6e7e-C001-33", "b1c06a67b03d81b249b320413a6e7e-C001-34", "b1c06a67b03d81b249b320413a6e7e-C001-116" ] }, "@DIF@": { "gold_contexts": [ [ "b1c06a67b03d81b249b320413a6e7e-C001-21" ] ], "cite_sentences": [ "b1c06a67b03d81b249b320413a6e7e-C001-21" ] }, "@SIM@": { "gold_contexts": [ [ "b1c06a67b03d81b249b320413a6e7e-C001-113" ], [ "b1c06a67b03d81b249b320413a6e7e-C001-122" ] ], "cite_sentences": [ "b1c06a67b03d81b249b320413a6e7e-C001-113", "b1c06a67b03d81b249b320413a6e7e-C001-122" ] } } }, "ABC_a334cda78f8ba6dea709809f0999b6_23": { "x": [ { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-2", "text": "Prior work has shown that word embeddings capture human stereotypes, including gender bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-3", "text": "However, there is a lack of studies testing the presence of specific gender bias categories in word embeddings across diverse domains." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-4", "text": "This paper aims to fill this gap by applying the WEAT bias detection method to four sets of word embeddings trained on corpora from four different domains: news, social networking, biomedical and a gender-balanced corpus extracted from Wikipedia (GAP)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-5", "text": "We find that some domains are definitely more prone to gender bias than others, and that the categories of gender bias present also vary for each set of word embeddings." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-6", "text": "We detect some gender bias in GAP." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-7", "text": "We also propose a simple but novel method for discovering new bias categories by clustering word embeddings." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-8", "text": "We validate this method through WEAT's hypothesis testing mechanism and find it useful for expanding the relatively small set of wellknown gender bias word categories commonly used in the literature." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-9", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-11", "text": "Artificial intelligence (AI) acquired from machine learning is becoming more prominent in decisionmaking tasks in areas as diverse as industry, healthcare and education." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-12", "text": "AI-informed decisions depend on AI systems' input training data which, unfortunately, can contain implicit racial, gender or ideological biases." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-13", "text": "Such AI-informed decisions can thus lead to unfair treatment of certain groups." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-51", "text": "Instead of relying on reaction times, WEAT relies on cosine similarity." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-14", "text": "For example, in Natural Language Processing (NLP), r\u00e9sum\u00e9 search engines can produce rankings that disadvantage some candidates, when these ranking algorithms take demographic features into account (directly or indirectly) (Chen et al., 2018) , while abusive online language detection systems have been observed to produce false positives on terms associated with minorities and women (Dixon et al., 2018; Park et al., 2018) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-15", "text": "Another example where bias (specifically gender bias) can be harmful is in personal pronoun coreference resolution, where systems carry the risk of relying on societal stereotypes present in the training data (Webster et al., 2018) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-16", "text": "Whilst gender bias in the form of concepts of masculinity and femininity has been found inscribed in implicit ways in AI systems more broadly (Adam, 2006) , this paper focuses on gender bias on word embeddings." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-17", "text": "Word embeddings are one of the most common techniques for giving semantic meaning to words in text and are used as input in virtually every neural NLP system (Goldberg, 2017) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-18", "text": "It has been shown that word embeddings capture human biases (such as gender bias) present in these corpora in how they relate words to each other (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-19", "text": "For the purposes of this paper, gender bias is understood as the inclination towards or prejudice against one gender." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-20", "text": "Several methods have been proposed to test for the presence of gender bias in word embeddings; an example being the Word Embedding Association Test (WEAT) (Caliskan et al., 2017) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-21", "text": "WEAT is a statistical test that detects bias in word embeddings using cosine similarity and averaging methods, paired with hypothesis testing." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-22", "text": "WEAT's authors applied these tests to the publicly-available GloVe embeddings trained on the English-language \"Common Crawl\" corpus (Pennington et al., 2014) as well as the Skip-Gram (word2vec) embeddings trained on the Google News corpus (Mikolov et al., 2013) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-23", "text": "However, there is a diverse range of publicly-available word embeddings trained on corpora of different domains." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-24", "text": "To address this, we applied the WEAT test on four sets of word embeddings trained on corpora from four domains: social media (Twit-ter), a Wikipedia-based gender-balanced corpus (GAP) and a biomedical corpus (PubMed) and news (Google News, in order to reproduce and validate our results against those of Caliskan et al. (2017) ) (see Section 3)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-25", "text": "Caliskan et al. (2017) confirmed the presence of gender bias using three categories of words wellknown to be prone to exhibit gender bias: (B1) career vs. family activities, (B2) Maths vs. Arts and (B3) Science vs. Arts." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-26", "text": "Garg et al. (2018) expanded on this work and tested additional gender bias word categories: (B4) differences on personal descriptions based on intelligence vs. appearance and on (B5) physical or emotional strength vs. weakness." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-27", "text": "In this paper, we use these five categories to test for the presence of gender bias in the aforementioned domain corpora." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-28", "text": "Notice that one of the tested corpora is the gender-balanced GAP corpus (Webster et al., 2018) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-29", "text": "We specifically chose this corpus in order to test whether the automatic method used to compile it (based on sampling an equal number of male and female pronouns from Wikipedia) yielded a set that was balanced according to these five well-known gender bias word categories." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-30", "text": "GAP's authors acknowledge that Wikipedia has been found to contain gender biased content (Reagle and Rhue, 2011) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-31", "text": "We confirmed bias in all five categories on the Google News embeddings but far less bias on the rest of the embeddings, with the biomedical PubMed embeddings showing the least bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-32", "text": "We did find some bias on GAP." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-33", "text": "However, given the small size of this corpus, many test words were not present (see Section 4)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-34", "text": "The six word categories studied here are word lists manually curated by Psychology researchers based on their studies (e.g. Greenwald et al., 1998) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-35", "text": "However, it is difficult to establish whether they are exhaustive as there could be other word categories presenting bias, which may well be domain-dependant." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-36", "text": "In response, we developed a simple method to automatically discover new categories of gender bias words based on word clustering, and measuring statistical associations of the words in each cluster to known female and male attribute words." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-37", "text": "Assuming that each cluster roughly represents a topic in the corpus, the set of gender bias words in each cluster/topic in the corpus corresponds to a potentially new category of gender-biased words." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-38", "text": "As far as we are aware, this is the first time a method to discover new gender bias word categories is proposed." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-39", "text": "We used WEAT's hypothesis testing mechanism to automatically validate the induced gender bias word categories produced by our system." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-40", "text": "A visual inspection on a sample of these induced categories is consistent with the authors' intuitions of gender bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-41", "text": "We make these induced categories available to other researchers to study." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-42", "text": "1 An advantage of this discovery method is that it allows us to detect bias based on a corpus' own vocabulary, even if it is small, as is the case in the GAP corpus embeddings." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-43", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-44", "text": "**PREVIOUS WORK**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-45", "text": "In word embeddings, words are represented in a continuous vector space where semantically similar words are mapped to nearby points (Goldberg, 2017, ch. 10)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-46", "text": "The underlying assumption is that words that appear in similar contexts share similar meaning (Harris, 1954; Miller and Charles, 1991) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-47", "text": "This context-based similarity is operationalised through cosine similarity, a well-established method for measuring the semantic similarity of words in vector space (Sch\u00fctze, 1998) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-48", "text": "Recently, however, researchers noticed that cosine similarity was able to exhibit gender biases captured through training on corpora and started developing methods for mitigating this bias (Bolukbasi et al., 2016) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-49", "text": "Caliskan et al. (2017) then developed the Word Embedding Association Test (WEAT), which is an adaptation of the Implicit Association Test (IAT) from Psychology (Greenwald et al., 1998) to measure biases in word embeddings." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-50", "text": "The IAT measures a person's automatic association between mental representations of concepts, based on their reaction times." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-52", "text": "WEAT is based on two statistical measures: (1) the effect size in terms of Cohen's d, which measures the association between suspected gender biased words and two sets of reference words (attribute words in WEAT's terminology) known to be intrinsically male and female, respectively; and (2) a statistical hypothesis test that confirms this association." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-53", "text": "We borrow these statistical measures in this paper." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-54", "text": "Garg et al. (2018) measured gender bias synchronically across historical data covering 100 years of English language use." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-55", "text": "Most work however has concentrated in meth-ods for mitigating gender bias in word embeddings." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-56", "text": "One approach is debiasing learnt corpora (Bolukbasi et al., 2016) , which is achieved using algorithms that modify word embeddings in such a way that neutralises stereotypical cosine similarities." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-57", "text": "Another approach is creating gender-balanced corpora, such as the GAP corpus (balanced corpus of Gendered Ambiguous Pronouns) (Webster et al., 2018) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-58", "text": "Roughly speaking, GAP was developed by sampling sentences from Wikipedia in such a way that an equal number of male and female personal pronouns was obtained." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-59", "text": "Its main use is in the evaluatiation of systems that resolve the coreference of gendered ambiguous pronouns in English." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-60", "text": "In a similar vain, Dixon et al. (2018) builds a balanced corpora that seeks to neutralise toxic mentions of identity terms." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-61", "text": "To the best of our knowledge there has not been work testing for bias on corpora from different domains." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-62", "text": "Also, we believe this is the first time an unsupervised method for discovering new gender bias word categories from word embeddings is proposed." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-63", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-64", "text": "**CHOICE OF WORD EMBEDDINGS**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-65", "text": "English-language word embeddings were selected with the intention of giving an insight into gender bias over a range of domains and with the expectation that some word embeddings would demonstrate much more bias than others." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-66", "text": "The word embeddings selected were: (a) Skip-Gram embeddings trained on the Google News corpus 2 , with a vocabulary of 3M word types (Mikolov et al., 2013) ; (b) Skip-Gram embeddings trained on 400 million Twitter micro-posts 3 , with a vocabulary of slightly more than 3M word types (Godin et al., 2015) ; (c) Skip-Gram embeddings trained on the PubMed Central Open Access subset (PMC) and PubMed 4 , with a vocabulary of about 2.2M word types (Chiu et al., 2016) and trained using two different sliding window sizes: 2 and 30 words; (d) FastText embeddings trained on the GAP corpus (Webster et al., 2018) by us 5 , with a vocabulary of 7,400 word types." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-67", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-68", "text": "**WEAT HYPOTHESIS TESTING 4.1 EXPERIMENTAL PROTOCOL**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-69", "text": "We largely follow the WEAT Hypothesis testing protocol introduced by Caliskan et al. (2017) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-70", "text": "The input is a suspected gender bias word category represented by two lists, X and Y , of target words, i.e. words which are suspected to be biased to one or another gender." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-71", "text": "E.g. X = {programmer, engineer, scientist}, Y = {nurse, teacher, librarian}. We wish to test whether X or Y is more biased to one gender or the other, or whether there is not difference in bias between the two lists." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-72", "text": "Bias is compared in relation to two reference lists of words that represent unequivocally male and female concepts." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-73", "text": "E.g. M = {man, male, he}, F = {woman, female, she}. In WEAT's terminology these reference lists are called the attribute words." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-74", "text": "Table 1 shows the target and attribute word sets used in our experiments." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-75", "text": "The null hypothesis H o is that there is no difference between X and Y in terms of their relative (cosine) similarity to M and F ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-76", "text": "Assuming that there is a word embedding vector w (trained on some corpus from some domain) for each word w in X, Y , M and F , we compute the following test statistic:" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-77", "text": "(1) where s(w, M, F ) is the measure of association between target word w and the attribute words in M and F :" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-78", "text": "In Caliskan et al. (2017) H o is tested through a permutation test, in which X \u222a Y is partitioned into alternative target listsX and\u0176 exhaustively and computing the one-sided p-value p[s(X,\u0176 , M, F ) > s(X, Y, M, F )], i.e. the proportion of partition permutationsX,\u0176 in which the test statistic s(X,\u0176 , M, F ) is greater than the observed test statistic s(X, Y, M, F )." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-79", "text": "This p-value is the probability that H o is true." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-80", "text": "In other words, it is the probability that there is no difference between X and Y (in relation to M and F ) and therefore that the word category is not biased." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-81", "text": "The" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-82", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-83", "text": "**ATTRIBUTE WORDS**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-84", "text": "M male, man, boy, brother, he, him, his, son, father, uncle, grandfather F female, woman, girl, sister, she, her, hers, daughter, mother, aunt, grandmother (Hoeffding, 1952; Noreen, 1989) with a maximum of 100,000 iterations in each test." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-85", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-86", "text": "**WEAT RESULTS**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-87", "text": "Before experimentation we expected to find a great deal of gender bias across the Google News and Twitter embedding sets and far less in the PubMed and GAP sets." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-88", "text": "However, results in Table 2 are somewhat different to our expectations:" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-89", "text": "Google News We detect statistically significant (p-values in bold) gender bias in all 5 categories (B1-B5) on this corpus." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-90", "text": "Although one would hope to find little gender bias in a news corpus, given that its authors are professional journalists, bias had already been detected by Caliskan et al. (2017) and Garg et al. (2018) using methods similar to ours." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-91", "text": "This is not surprising given that women represent only a third (33.3%) of the full-time journalism workforce (Byerly, 2011) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-92", "text": "In addition, it has been found that news coverage of female personalities more frequently mentions family situations and is more likely to invoke matters of superficial nature, such as personality, appearance and fashion decisions, whereas the focus on men in news coverage tends to be be given to their experience and accomplishments (Armstrong et al., 2006) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-93", "text": "Twitter On this social media set, we surprisingly only detected bias on the career vs. family (B1) category, although science vs. maths (B2) is a borderline case with a p-value of just 0.0715, and the rest of the values are not particularly high." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-94", "text": "We also observe that most effect sizes (Cohen's d) are under 1, indicating relatively weaker associations with the gender-specific attribute words from Table 1." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-95", "text": "We leave for future work further analysis on this set, however we hypothesise that the idiosyncratic language use common in micro-blogging, such as non-standard spelling and hashtags, divide up the semantic signal of word embeddings, perhaps diluting their association bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-96", "text": "Indeed, the word categories showing most gender bias in the discovery experiments (Section 5) include many hashtags, punctuation marks and words with nonstandard spellings such as \"alwaaaaays\", which will not be tested for bias using standard-spelling target words." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-97", "text": "PubMed This biomedical set showed the least gender bias, which was expected given its scientific nature." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-98", "text": "However, it has been documented that gender bias exists in biomedical studies given that more clinical studies involve males than females, and also based on the differences in which male and female patients report pain and other medical complaints and, in turn, the differences in which male and female health practitioners record and understand these complaints (Fillingim et al., 2009) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-99", "text": "It is possible that gender bias is still present in these texts but it is manifested differently and perhaps cannot be detected through word embed- dings." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-100", "text": "Also of note is that across all five categories, bias is greater (smaller p-values) on the 30-word window set than on the 2-word window set." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-101", "text": "It is known that window size affects semantic similarity: larger window sizes tend to capture broader, more topical similarities between words whilst smaller windows capture more linguistic or even syntactic similarities (Goldberg, 2017, Sec. 10.5) ." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-102", "text": "We leave for future work further analysis on the bias effects of window sizes." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-103", "text": "GAP Whilst GAP was specifically developed with gender balance in mind, we did find some degree of gender bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-104", "text": "In fact, given that it is derived from a gender-biased source text (Wikipedia), we actually expected to measure a higher degree of gender bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-105", "text": "This relatively low bias measurement could be due in part to the fact that GAP's vocabulary lacks many of the attribute and target word lists used in the tests." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-106", "text": "Table 3 shows the number of out-of-vocabulary words from these lists in PubMed and GAP (Google News and Twitter did not have any out-of-vocabulary words)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-107", "text": "Notice that the category missing most target words (intelligence vs. appearance category, B4) shows the least bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-108", "text": "However, the second category that misses most words (strength vs weakness, B5) does indeed show bias to a medium-high effect size of 0.77." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-130", "text": "The effect size (Cohen's d) was quite high across all clusters, averaging 1.89 for Google News, 1.87 for Twitter, 1.88 for PubMed and 1.67 for GAP." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-109", "text": "This difficulty in assessing the reliability of these tests, in the face of a relatively high number of out-of-vocabulary attribute and target words, is one of the reasons that inspired us to develop a method for discovering new categories of biased words from an embedding set's own vocabulary." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-110", "text": "Section 5 covers this method." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-111", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-112", "text": "**DISCOVERING NEW GENDER BIAS WORD CATEGORIES**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-113", "text": "We propose a method for automatically detecting new categories of gender-biased words from a word embedding set." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-114", "text": "The simplest method in- volves constructing a list of male-and femalebiased words from a word embedding vocabulary through eq. (2)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-115", "text": "However, the resulting list would not have a topical or semantic cohesion as the categories B1-B5 have." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-116", "text": "We propose instead to first cluster the word vectors in an embedding set and then return a list of male-and female-associated word lists per cluster." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-117", "text": "We expect these clusterbased biased word lists to be more topically cohesive." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-118", "text": "By controlling for the number and size of clusters it should be possible to find more or less fine-grained categories." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-119", "text": "We cluster embedings using K-Means++ (Arthur and Vassilvitskii, 2007) , as implemented by scikit-learn (Pedregosa et al., 2011) , using 100 clusters for GAP and 3,000 for Google News, Twitter and PubMed (window size 30 only)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-120", "text": "This algorithm was chosen as it is fast and produces clusters of comparable sizes." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-121", "text": "For each cluster we then return the list of n most male-and female-associated words (as per eq. 2): these are the discovered gender bias word categories candidates." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-122", "text": "Table 4 shows a selection of these candidates." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-123", "text": "6 Upon visual inspection, most of these candidates seem to be somewhat cohesive." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-124", "text": "We notice that on Google News and GAP many of the clusters relate to people's names (Google News cluster 2369) whilst others mix people's names with more obviously biased words (Google News cluster 2995 and most GAP clusters)." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-125", "text": "It is clear that this method detects thematically-cohesive groups of gender-associated words." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-126", "text": "However, not all words seem to be genuinely gender biased in a harmful way." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-127", "text": "We leave for future work the development of a filtering or classification step capable of making this distinction." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-128", "text": "In order to test whether the candidates' bias is statistically significant, we applied the full WEAT hypothesis testing protocol, using randomised tests of 1,000 iterations per cluster to make the computation tractable." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-129", "text": "All clusters across all embedding sets returned a p-value < 0.001." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-131", "text": "We leave for future work to conduct a human-based experiment involving experts on gender bias on different domains and languages other than English to further validate our outputs." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-132", "text": "Emphasis will be placed on assessing the usefulness of this tool for domains and languages lacking or seeking to develop lists of gender bias word categories." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-133", "text": "----------------------------------" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-134", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-135", "text": "We have shown that there are varying levels of bias for word embeddings trained on corpora of different domains and that within the embeddings, there are different categories of gender bias present." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-136", "text": "We have also developed a method to discover potential new word categories of gender bias." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-137", "text": "Whilst our clustering method discovers new gender-associated word categories, the induced topics seem to mix harmless gender-associated words (like people names) with more obviously harmful gender-biased words." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-138", "text": "So as a future development, we would like to develop a classifier to distinguish between harmless gender-associated words and harmful gender-biased words." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-139", "text": "We wish to involve judgements by experts on gender bias in this effort, as well as exploiting existing thematic word categories from lexical databases like WordNet (Fellbaum, 1998) , ontologies and terminologies." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-140", "text": "At the same time, we will also seek to measure the negative impact of discovered categories in NLP systems' performance." }, { "sent_id": "a334cda78f8ba6dea709809f0999b6-C001-141", "text": "We also wish to more closely investigate the relationships between different word embedding hyperparameters, such sliding window size in the PubMed set, and their learned bias." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a334cda78f8ba6dea709809f0999b6-C001-18" ], [ "a334cda78f8ba6dea709809f0999b6-C001-20" ], [ "a334cda78f8ba6dea709809f0999b6-C001-78" ] ], "cite_sentences": [ "a334cda78f8ba6dea709809f0999b6-C001-18", "a334cda78f8ba6dea709809f0999b6-C001-20", "a334cda78f8ba6dea709809f0999b6-C001-78" ] }, "@SIM@": { "gold_contexts": [ [ "a334cda78f8ba6dea709809f0999b6-C001-24" ], [ "a334cda78f8ba6dea709809f0999b6-C001-69" ], [ "a334cda78f8ba6dea709809f0999b6-C001-90" ] ], "cite_sentences": [ "a334cda78f8ba6dea709809f0999b6-C001-24", "a334cda78f8ba6dea709809f0999b6-C001-69", "a334cda78f8ba6dea709809f0999b6-C001-90" ] }, "@USE@": { "gold_contexts": [ [ "a334cda78f8ba6dea709809f0999b6-C001-24" ], [ "a334cda78f8ba6dea709809f0999b6-C001-69" ] ], "cite_sentences": [ "a334cda78f8ba6dea709809f0999b6-C001-24", "a334cda78f8ba6dea709809f0999b6-C001-69" ] } } }, "ABC_cc992a7a918858f9e04b9bb5c15c3f_23": { "x": [ { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-2", "text": "We use referential translation machines (RTMs) for predicting the semantic similarity of text." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-3", "text": "RTMs are a computational model effectively judging monolingual and bilingual similarity while identifying translation acts between any two data sets with respect to interpretants." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-4", "text": "RTMs pioneer a language independent approach to all similarity tasks and remove the need to access any task or domain specific information or resource." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-5", "text": "RTMs become the 2nd system out of 13 systems participating in Paraphrase and Semantic Similarity in Twitter, 6th out of 16 submissions in Semantic Textual Similarity Spanish, and 50th out of 73 submissions in Semantic Textual Similarity English." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-6", "text": "We present positive results from a fully automated judge for semantic similarity based on Referential Translation Machines (Bi\u00e7ici and Way, 2014b) in two semantic similarity tasks at SemEval-2015, Semantic Evaluation Exercises -International Workshop on Semantic Evaluation (Nakov et al., 2015)." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-7", "text": "Referential translation machine (RTM) is a computational model for identifying the acts of translation for translating between any given two data sets with respect to a reference corpus selected in the same domain." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-8", "text": "An RTM model is based on the selection of interpretants, training data close to both the training set and the test set, which allow shared semantics by providing context for similarity judgments." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-9", "text": "Each RTM model is a data translation and translation prediction model between the instances in the training set and the test set and translation acts are indicators of the data transformation and translation." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-10", "text": "RTMs present an accurate and language independent solution for making semantic similarity judgments." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-11", "text": "RTMs pioneer a computational model for quality and semantic similarity judgments in monolingual and bilingual settings using retrieval of relevant training data (Bi\u00e7ici and Yuret, 2015) as interpretants for reaching shared semantics." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-12", "text": "RTMs achieve (i) top performance when predicting the quality of translations (Bi\u00e7ici, 2013; Bi\u00e7ici and Way, 2014a) ; (ii) top performance when predicting monolingual cross-level semantic similarity; (iii) second performance when predicting paraphrase and semantic similarity in Twitter (iv) good performance when judging the semantic similarity of sentences; (iv) good performance when evaluating the semantic relatedness of sentences and their entailment (Bi\u00e7ici and Way, 2014b) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-13", "text": "RTMs use Machine Translation Performance Prediction (MTPP) System Bi\u00e7ici and Way, 2014b) , which is a state-of-the-art (SoA) performance predictor of translation even without using the translation." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-14", "text": "MTPP system measures the coverage of individual test sentence features found in the training set and derives indicators of the closeness of test sentences to the available training data, the difficulty of translating the sentence, and the presence of acts of translation for data transformation." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-15", "text": "MTPP features for translation acts are provided in (Bi\u00e7ici and Way, 2014b) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-16", "text": "RTMs become the 2nd system out of 13 systems participating in Paraphrase and Semantic Similarity in Twitter (Task 1) (Xu et al., 2015) and achieve good results in Semantic Tex-56" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-17", "text": "----------------------------------" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-18", "text": "**REFERENTIAL TRANSLATION MACHINE (RTM)**" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-19", "text": "We present positive results from a fully automated judge for semantic similarity based on Referential Translation Machines (Bi\u00e7ici and Way, 2014b) in two semantic similarity tasks at SemEval-2015, Semantic Evaluation Exercises -International Workshop on Semantic Evaluation (Nakov et al., 2015) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-20", "text": "Referential translation machine (RTM) is a computational model for identifying the acts of translation for translating between any given two data sets with respect to a reference corpus selected in the same domain." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-21", "text": "An RTM model is based on the selection of interpretants, training data close to both the training set and the test set, which allow shared semantics by providing context for similarity judgments." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-22", "text": "Each RTM model is a data translation and translation prediction model between the instances in the training set and the test set and translation acts are indicators of the data transformation and translation." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-23", "text": "RTMs present an accurate and language independent solution for making semantic similarity judgments." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-24", "text": "RTMs pioneer a computational model for quality and semantic similarity judgments in monolingual and bilingual settings using retrieval of relevant training data (Bi\u00e7ici and Yuret, 2015) as interpretants for reaching shared semantics." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-25", "text": "RTMs achieve (i) top performance when predicting the quality of translations (Bi\u00e7ici, 2013; Bi\u00e7ici and Way, 2014a) ; (ii) top performance when predicting monolingual cross-level semantic similarity; (iii) second performance when predicting paraphrase and semantic similarity in Twitter (iv) good performance when judging the semantic similarity of sentences; (iv) good performance when evaluating the semantic relatedness of sentences and their entailment (Bi\u00e7ici and Way, 2014b) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-26", "text": "RTMs use Machine Translation Performance Prediction (MTPP) System Bi\u00e7ici and Way, 2014b) , which is a state-of-the-art (SoA) performance predictor of translation even without using the translation." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-27", "text": "MTPP system measures the coverage of individual test sentence features found in the training set and derives indicators of the closeness of test sentences to the available training data, the difficulty of translating the sentence, and the presence of acts of translation for data transformation." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-28", "text": "MTPP features for translation acts are provided in (Bi\u00e7ici and Way, 2014b) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-29", "text": "RTMs become the 2nd system out of 13 systems participating in Paraphrase and Semantic Similarity in Twitter (Task 1) (Xu et al., 2015) and achieve good results in Semantic Tex-" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-30", "text": "tual Similarity (Task 2) (Agirre et al., 2015) becoming 6th out of 16 submissions in Spanish." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-31", "text": "We use the Parallel FDA5 instance selection model for selecting the interpretants (Bi\u00e7ici et al., 2014; Bi\u00e7ici and Yuret, 2015) , which allows efficient parameterization, optimization, and implementation of Feature Decay Algorithms (FDA), and build an MTPP model." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-32", "text": "We view that acts of translation are ubiquitously used during communication: the RTM algorithm." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-80", "text": "R+L correspond to using the features from both R and L, which doubles the number of features." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-33", "text": "Our encouraging results in the semantic similarity tasks increase our understanding of the acts of translation we ubiquitously use when communicating and how they can be used to predict semantic similarity." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-34", "text": "RTMs are powerful enough to be applicable in different domains and tasks with good performance." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-35", "text": "We describe the tasks we participated as follows:" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-36", "text": "ParSS Paraphrase and Semantic Similarity in Twitter (ParSS) (Xu et al., 2015) :" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-37", "text": "Given two sentences S 1 and S 2 in the same language, produce a similarity score indicating whether they express a similar meaning: a discrete real number in [0, 1]." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-38", "text": "We model as sentence MTPP between S 1 to S 2 ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-39", "text": "STS Semantic Textual Similarity (STS) (Agirre et al., 2015) :" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-40", "text": "Given two sentences S 1 and S 2 in the same language, quantify the degree of similarity: a real number in [0, 5] ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-41", "text": "STS is in English and Spanish (a real number in [0, 4])." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-42", "text": "We model as sentence MTPP of S 1 and S 2 ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-43", "text": "----------------------------------" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-44", "text": "**SEMEVAL-15 RESULTS**" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-45", "text": "We develop individual RTM models for each task and subtask that we participate at SemEval-2015 with the RTM-DCU team name." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-46", "text": "Interpretants are selected from the LM corpora distributed by the translation task of WMT14 (Bojar et al., 2014) and LDC for English (Parker et al., 2011) and Spanish (\u00c2ngelo Mendon\u00e7a et al., 2011) 1 ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-47", "text": "We use the Stanford POS tagger (Toutanova et al., 2003) to obtain the lemmatized corpora for the ParSS task." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-48", "text": "The number of instances we select for the interpretants 1 English Gigaword 5th, Spanish Gigaword 3rd edition." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-49", "text": "in each task is given in Table 1 ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-50", "text": "----------------------------------" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-51", "text": "**RTM-DCU RESULTS**" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-52", "text": "We use ridge regression (RR), support vector regression (SVR), and extremely randomized trees (TREE) (Geurts et al., 2006) as the learning models." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-53", "text": "These models learn a regression function using the features to estimate a numerical target value." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-54", "text": "We also use them after a dimensionality reduction and mapping step with partial least squares (PLS) (Specia et al., 2009 )." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-55", "text": "We optimize the learning parameters, the number of dimensions used for PLS, and the parameters for parallel FDA5." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-56", "text": "More details about the optimization processes are in (Bi\u00e7ici and Way, 2014b; Bi\u00e7ici et al., 2014) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-81", "text": "----------------------------------" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-57", "text": "We optimize the learning parameters by selecting \u03b5 close to the standard deviation of the noise in the training set (Bi\u00e7ici, 2013) since the optimal value for \u03b5 is shown to have linear dependence to the noise level for different noise models (Smola et al., 1998) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-58", "text": "At testing time, the predictions are bounded to obtain scores in the corresponding ranges." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-59", "text": "We use Pearson's correlation (r P ), mean absolute error (MAE), and relative absolute error (RAE) for evaluation:" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-60", "text": "We define MAER and MRAER for easier replication and comparability with relative errors for each instance:" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-61", "text": "MAER is the mean absolute error relative to the magnitude of the target and MRAER is the mean absolute error relative to the absolute error of a predictor always predicting the target mean assuming that target mean is known." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-62", "text": "MAER and MRAER are capped from below 2 with = MAE(\u0177, y)/2, which is the measurement error and it is estimated as half of the mean absolute error or deviation of the predictions from target mean." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-63", "text": "represents half of the score step with which a decision about a change in measurement's value can be made." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-64", "text": "is similar to half of the standard deviation, \u03c3, of the data but over absolute differences." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-65", "text": "For discrete target scores, = step size 2" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-66", "text": "." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-67", "text": "A method for learning decision thresholds for mimicking the human decision process when determining whether two translations are equivalent is described in (Bi\u00e7ici, 2013) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-68", "text": "MAER and MRAER are able to capture averaged fluctuations at the instance level and they may evaluate the performance of a predictor at performance prediction tasks at the instance level (e.g. performance of the similarity of sentences, performance of translation of different translation instances) better." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-69", "text": "RAE compares sums of prediction errors and MRAER averages instance prediction error comparisons." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-70", "text": "(Xu et al., 2015) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-71", "text": "Official evaluation metric is Pearson's correlation score, which we use to select the top systems on the training set." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-72", "text": "RTM-DCU results on the ParSS test set are given in Table 2 ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-73", "text": "The setting R using SVR becomes 2nd out of 13 systems and 3rd out of 25 submissions." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-74", "text": "Looking at MAE and MAER allows us to obtain explanations to train and test performance differences for example without knowing their target distribution." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-75", "text": "Even though MAE of PLS-SVR is about %5 smaller on the ParSS test set, MAER is %55 smaller due to test set containing fewer zero entries (%16 vs. %39 on the train set)." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-76", "text": "Lower test MAE than training MAE may be attributed to RTMs." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-77", "text": "We obtained results with lemmatized datasets and further optimized the learning model parameters after the challenge." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-78", "text": "We present the performance of the top 5 individual RTM models on the training set in Table 3 ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-79", "text": "R uses the regular truecase (Koehn et 2007; Koehn, 2010) corpora and L uses the lemmatized truecased corpora." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-82", "text": "**TASK 2: SEMANTIC TEXTUAL SIMILARITY (STS)**" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-83", "text": "STS contains sentence pairs from different domains: answers-forums, answers-students, belief, headlines, and images for English and wikipedia and newswire for Spanish." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-84", "text": "Official evaluation metric in STS is the Pearson's correlation score." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-85", "text": "We build separate RTM models for headlines and images domains for STS English." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-86", "text": "Domain specific RTM models obtain improved performance in those domains (Bi\u00e7ici and Way, 2014b) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-87", "text": "STS English test set contains 2000, 1500, 2000, 1500, and 1500 sentences respectively from the specified domains however for evaluation, STS use a subset of the test set, 375, 750, 375, 750, and 750 instances respectively from the corresponding domains." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-88", "text": "This may lower the performance of RTMs by causing FDA5 to select more domain specific data and less task specific since RTMs use the test set to select interpretants and build a task specific RTM prediction model." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-89", "text": "(Bi\u00e7ici and Way, 2014b) are presented in Table 6, where we have used the top results from domain specific RTM models for headlines and images domains in the overall model results." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-90", "text": "Top 3 individual RTM model performance on the training set with further optimized learning model parameters after the challenge are presented in Table 7 ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-91", "text": "Better r P , RAE, and MRAER on the test set than on the training set in STS 2015 English may be attributed to RTMs." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-92", "text": "----------------------------------" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-93", "text": "**RTMS ACROSS TASKS AND YEARS**" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-94", "text": "We compare the difficulty of tasks according to MRAER where the correlation of RAE and MRAER is 0.89." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-95", "text": "In Table 8 , we list the RAE, MAER, and MRAER obtained for different tasks and subtasks, also listing RTM results from SemEval-2013 , from SemEval-2014 (Bi\u00e7ici and Way, 2014b) , and and from quality estimation task (QET) (Bi\u00e7ici and Way, 2014a ) of machine translation (Bojar et al., 2014) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-96", "text": "RTMs at SemEval-2013 contain results from STS." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-97", "text": "RTMs at SemEval-2014 contain results from STS, semantic relatedness and entailment (SRE) (Marelli et al., 2014) , and cross-level semantic similarity (CLSS) tasks (Jurgens et al., 2014) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-98", "text": "RTMs at WMT2014 QET contain tasks involving the prediction of an integer in [1, 3] representing post-editing effort (PEE), a real number in [0, 1] representing human-targeted translation edit rate (HTER), or an integer representing post-editing time (PET) of translations." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-99", "text": "The best results are obtained for the CLSS paragraph to sentence subtask, which may be due to the larger contextual information that paragraphs can provide for the RTM models." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-100", "text": "For the ParSS task, we can only reduce the error with respect to knowing and predicting the mean by about 22.5%." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-101", "text": "Prediction of bilingual similarity as in quality estimation of translation can be expected to be harder and RTMs achieve SoA performance in this task as well (Bi\u00e7ici and Way, 2014a) ." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-102", "text": "Table 8 can be used to evaluate the difficulty of various tasks and domains based on our SoA predictor RTM." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-103", "text": "MRAER considers both the predictor's error and the target scores' fluctuations at the instance level." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-104", "text": "We separated the results having MRAER greater than 1 as in these tasks and subtasks RTM does not perform significantly better than mean predictor and fluctuations render these as tasks that may require more work." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-105", "text": "----------------------------------" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-106", "text": "**CONCLUSION**" }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-107", "text": "Referential translation machines pioneer a clean and intuitive computational model for automatically measuring semantic similarity by measuring the acts of translation involved and achieve to become the 2nd system out of 13 systems participating in Paraphrase and Semantic Similarity in Twitter, 6th out of 16 submissions in Semantic Textual Similarity Spanish, and 50th out of 73 submissions in Semantic Textual Similarity English." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-108", "text": "RTMs make quality and semantic similarity judgments possible based on the retrieval of relevant training data as interpretants for reaching shared semantics." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-109", "text": "We define MAER, mean absolute error relative to the magnitude of the target, and MRAER, mean absolute error relative to the absolute error of a predictor always predicting the target mean assuming that target mean is known." }, { "sent_id": "cc992a7a918858f9e04b9bb5c15c3f-C001-110", "text": "RTM test performance on various tasks sorted according to MRAER can identify which tasks and subtasks may require more work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cc992a7a918858f9e04b9bb5c15c3f-C001-6" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-12" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-13" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-15" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-25" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-26" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-28" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-56" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-86" ], [ "cc992a7a918858f9e04b9bb5c15c3f-C001-95" ] ], "cite_sentences": [ "cc992a7a918858f9e04b9bb5c15c3f-C001-6", "cc992a7a918858f9e04b9bb5c15c3f-C001-12", "cc992a7a918858f9e04b9bb5c15c3f-C001-13", "cc992a7a918858f9e04b9bb5c15c3f-C001-15", "cc992a7a918858f9e04b9bb5c15c3f-C001-25", "cc992a7a918858f9e04b9bb5c15c3f-C001-26", "cc992a7a918858f9e04b9bb5c15c3f-C001-28", "cc992a7a918858f9e04b9bb5c15c3f-C001-56", "cc992a7a918858f9e04b9bb5c15c3f-C001-86", "cc992a7a918858f9e04b9bb5c15c3f-C001-95" ] }, "@USE@": { "gold_contexts": [ [ "cc992a7a918858f9e04b9bb5c15c3f-C001-19" ] ], "cite_sentences": [ "cc992a7a918858f9e04b9bb5c15c3f-C001-19" ] }, "@SIM@": { "gold_contexts": [ [ "cc992a7a918858f9e04b9bb5c15c3f-C001-19" ] ], "cite_sentences": [ "cc992a7a918858f9e04b9bb5c15c3f-C001-19" ] } } }, "ABC_86af8f2cc08b00821a7a83abdfd964_23": { "x": [ { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-7", "text": "----------------------------------" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-2", "text": "In this paper, we study direct transfer methods for multilingual named entity recognition." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-3", "text": "Specifically, we extend the method recently proposed by T\u00e4ckstr\u00f6m et al. (2012) , which is based on cross-lingual word cluster features." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-4", "text": "First, we show that by using multiple source languages, combined with self-training for target language adaptation, we can achieve significant improvements compared to using only single source direct transfer." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-5", "text": "Second, we investigate how the direct transfer system fares against a supervised target language system and conclude that between 8,000 and 16,000 word tokens need to be annotated in each target language to match the best direct transfer system." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-6", "text": "Finally, we show that we can significantly improve target language performance, even after annotating up to 64,000 tokens in the target language, by simply concatenating source and target language annotations." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-9", "text": "Recognition of named entities in natural language text is an important subtask of information extraction and thus bears importance for modern text mining and information retrieval applications." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-10", "text": "The need to identify named entities such as persons, locations, organizations and places, arises both in applications where the entities are first class objects of interest, such as in Wikification of documents (Ratinov et al., 2011) , and in applications where knowledge of named entities is helpful in boosting performance, e.g., machine translation (Babych and Hartley, 2003) and question answering (Leidner et al., 2003) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-11", "text": "The advent of massive machine readable factual databases, such as Freebase 1 and the proposed Wikidata 2 , will likely push the need for automatic extraction tools further." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-12", "text": "While these databases store information about entity types and the relationships between those types, the named entity recognition (NER) task concerns finding occurrences of named entities in context." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-13", "text": "This view originated with the Message Understanding Conferences (MUC) (Grishman and Sundheim, 1996) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-14", "text": "As with the majority of tasks in contemporary natural language processing, most approaches to NER have been based on supervised machine learning." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-15", "text": "However, although resources for a handful of languages have been created, through initiatives such as MUC, the Multilingual Entity Task (Merchant et al., 1996) and the CoNLL shared tasks (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003) , coverage is still very limited in terms of both domains and languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-16", "text": "With fine-grained entity taxonomies such as that proposed by Sekine and Nobata (2004) , who define over two hundred categories, we can expect an increase in the amount of annotated data required for acceptable performance, as well as an increased annotation cost for each entity occurrence." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-17", "text": "Although semi-supervised approaches have been shown to reduce the need for manual annotation (Freitag, 2004; Miller et al., 2004; Ando and Zhang, 2005; Suzuki and Isozaki, 2008; Lin and Wu, 2009; Turian et al., 2010; Dhillon et al., 2011; T\u00e4ckstr\u00f6m et al., 2012) , these methods still require a substantial amount of manual annotation for each target language." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-18", "text": "Manually creating a sufficient amount of annotated resources for all entity types in all languages thus seems like an Herculean task." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-19", "text": "In this study, we turn to direct transfer methods T\u00e4ckstr\u00f6m et al., 2012) as a way to combat the need for annotated resources in all languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-20", "text": "These methods allow one to train a system for a target language, using only annotations in some source language, as long as all source language features also have support in the target languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-21", "text": "Specifically, we extend the direct transfer method proposed by T\u00e4ckstr\u00f6m et al. (2012) in two ways." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-22", "text": "First, in \u00a73, we use multiple source languages for training." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-23", "text": "We then propose a self-training algorithm, which allows for the inclusion of additional target language specific features, in \u00a74." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-24", "text": "By combining these extensions, we achieve significant error reductions on all tested languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-25", "text": "Finally, in \u00a75, we assess the viability of the different direct transfer systems compared to a supervised system trained on target language annotations, and conclude that direct transfer methods may be useful even in this scenario." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-26", "text": "----------------------------------" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-27", "text": "**DIRECT TRANSFER FOR CROSS-LINGUAL NER**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-28", "text": "Rather than starting from scratch when creating systems that predict linguistic structure in one language, we should be able to take advantage of any corresponding annotations that are available in other languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-29", "text": "This idea is at the heart of both direct transfer methods T\u00e4ckstr\u00f6m et al., 2012) and of annotation projection methods (Yarowsky et al., 2001; Diab and Resnik, 2002; Hwa et al., 2005) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-30", "text": "While the aim of the latter is to transfer annotations across languages, direct transfer methods instead aim to transfer systems, trained on some source language, directly to other languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-31", "text": "In this paper, we focus on direct transfer methods, however, we briefly discuss the relationship between these approaches in \u00a76." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-32", "text": "Considering the substantial differences between languages at the grammatical and lexical level, the prospect of directly applying a system trained on one language to another language may seem bleak." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-33", "text": "However, showed that a language independent dependency parser can indeed be created by training on a delexicalized treebank and by only incorporating features defined on universal part-of-speech tags (Das and Petrov, 2011) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-34", "text": "Recently, T\u00e4ckstr\u00f6m et al. (2012) developed an algorithm for inducing cross-lingual word clusters and proposed to use these clusters to enrich the feature space of direct transfer systems." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-35", "text": "The richer set of cross-lingual features was shown to substantially improve on direct transfer of both dependency parsing and NER from English to other languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-36", "text": "Cross-lingual word clusters are clusterings of words in two (or more) languages, such that the clusters are adequate in each language and at the same time consistent across languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-37", "text": "For cross-lingual word clusters to be useful in direct transfer of linguistic structure, the clusters should capture crosslingual properties on both the semantic and syntactic level." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-38", "text": "T\u00e4ckstr\u00f6m et al. (2012) showed that this is, at least to some degree, achievable by coupling monolingual class-based language models, via word alignments." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-39", "text": "The basic building block is the following simple monolingual class-based language model (Saul and Pereira, 1997; Uszkoreit and Brants, 2008) :" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-40", "text": "where L(w; C) is the likelihood of a sequence of words, w, and C is a (hard) clustering function, which maps words to cluster identities." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-41", "text": "These monolingual models are coupled through word alignments, which constrains the clusterings to be consistent across languages, and optimized by approximately maximizing the joint likelihood across languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-42", "text": "Just as monolingual word clusters are broadly applicable as features in monolingual models for linguistic structure prediction (Turian et al., 2010) , the resulting cross-lingual word clusters can be used as features in various crosslingual direct transfer models." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-43", "text": "We believe that the extensions that we propose are likely to be useful for other tasks as well, e.g., direct transfer dependency parsing, in this paper, we focus solely on discriminative direct transfer models for NER." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-44", "text": "----------------------------------" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-45", "text": "**MULTI-SOURCE DIRECT TRANSFER**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-46", "text": "Learning from multiple languages have been shown to be of benefit both in unsupervised learning of syntax and part-of-speech (Snyder et al., 2009; BergKirkpatrick and Klein, 2010) and in transfer learning of dependency syntax (Cohen et al., 2011; 256 cross-lingual word clusters and the same feature templates as T\u00e4ckstr\u00f6m et al. (2012) , with the exception that the transition factors are not conditioned on the input." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-47", "text": "3 The features used are similar to those used by Turian et al. (2010) , but include cross-lingual rather than monolingual word clusters." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-48", "text": "We remove the capitalization features when transferring to German, but keep them in all other cases, even when German is included in the set of source languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-49", "text": "We use the training, development and test data sets provided by the CoNLL 2002/2003 shared tasks (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-50", "text": "The multi-source training sets are created by concatenating each of the source languages' training sets." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-51", "text": "In order to have equivalent label sets across languages, we use the IO (inside/outside) encoding, rather than the BIO (begin/inside/outside) encoding, since the latter is available only for Spanish and Dutch." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-52", "text": "The models are trained using CRFSuite 0.12 (Okazaki, 2007) , by running stochastic gradient descent for a maximum of 100 iterations." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-53", "text": "Table 1 shows the result of using different source languages for different target languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-54", "text": "We see that multi-source transfer is somewhat helpful in general, but that the results are sensitive to the combination of source and target languages." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-55", "text": "On average, using all source languages only give a relative error reduction of about 3% on the test set." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-56", "text": "However, results for" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-57", "text": "Return adapted model end procedure \u2020 If LEARN(\u00b7) supports instance weighting, we could weight each instance (x, y * ) \u2208 F i by p \u03b8 i\u22121 (y * |x) in the training objective, rather than performing sampling according to the same distribution." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-58", "text": "supervised target language models trained with different cluster features, 4 these clusters are not optimally adapted to the target language, compared to the monolingual native clusters that are induced solely on the target language, without any cross-lingual constraints." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-59", "text": "This is to be expected, as the probabilistic model used to learn the cross-lingual clusters strikes a balance between two language specific models." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-60", "text": "On the other hand, this suggests an opportunity for adapting to target language specific features through self-training." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-61", "text": "In fact, since the direct transfer models are trained using cross-lingual features, the target language can be viewed as simply representing a different domain from the source language." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-62", "text": "Self-training has previously been shown to be a simple and effective way to perform domain adaptation for syntactic parsers and other tasks (McClosky et al., 2006; Chen et al., 2011) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-63", "text": "The idea of selftraining for domain adaptation is to first train a supervised predictor on labeled instances from a source domain." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-64", "text": "This predictor is then used to label instances from some unlabeled target domain." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-144", "text": "First, we looked at the scenario where no annotated resources are available in the target language." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-65", "text": "Those instances for which the predictor is confident are added to the source training set, and the process is repeated until some stopping criterion is met." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-66", "text": "Recently, Daum\u00e9 et al. (2010) and Chen et al. (2011) proposed more complex domain adaptation techniques, based on cotraining." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-67", "text": "In this work, however, we stick with the simple single-view self-training approach just outlined." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-68", "text": "In the self-training for domain adaptation method, described by Chen et al. (2011) , the top-k instances for which the predictor is most confident are added to the training set in each iteration." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-69", "text": "We instead propose to weight the target instances selected for self-training in each iteration proportional to the confidence of the classifier trained in the previous iteration." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-70", "text": "In short, let x \u2208 D u t be an unlabeled target language input sequence (in our case a sentence) and y * \u2208 Y t (x) its top-ranked label sequence (in our case an IO sequence)." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-71", "text": "In the first iteration, a predictor is trained on the labeled source language data, D l s ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-72", "text": "In each subsequent iteration the sequences are scored according to the probabilities assigned by the predictor trained in the previous iteration, p \u03b8 i\u22121 (y * |x)." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-73", "text": "When constructing the training set for the next iteration, we first filter out all instances for which the top-ranked label sequence is not \u03b4-dominating." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-74", "text": "That is, we filter out all instances x \u2208 D t u such that p \u03b8 i\u22121 (y * |x) < \u03b4, for some user-specified \u03b4." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-75", "text": "In this work, we set \u03b4 = 0.5, since this guarantees that the output associated with each instance that is kept is assigned the majority of the probability mass." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-76", "text": "This is important, as we only consider the most likely output y * for each input x, so that sampling low-confidence instances will result in a highly biased sample." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-77", "text": "After filtering, we sample from the remaining instances, i.e. from the set of instances x \u2208 D t u such that p \u03b8 i\u22121 (y * |x) \u2265 \u03b4, adding each instance (x, y * ) to the training set with probability p \u03b8 i\u22121 (y * |x)." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-78", "text": "This procedure is repeated for T iterations as outlined in Algorithm 1." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-79", "text": "By using instance weighting rather than a top-k list, we remove the need to heuristically set the number of instances to be selected for selftraining in each iteration." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-80", "text": "Further, although we have not verified this empirically, we hypothesize that using instance weighting is more robust than picking only the most confident instances, as it maintains diversity in the training set in the face of uncertainty." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-81", "text": "Note also that when we have access to target language test data during training, we can perform transductive learning by including the test set in the pool of unlabeled data." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-82", "text": "This gives the model the opportunity to adapt to the characteristics of the test domain." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-83", "text": "Our use of self-training for exploiting features na- tive to the target language resembles the way McDonald et al. (2011) re-lexicalize a delexicalized direct transfer parser." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-84", "text": "Both methods allow the model to move weights from shared parameters to more predictive target language specific parameters." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-85", "text": "However, rather than using the direct transfer parser's own predictions through self-training, these authors project head-modifier relations to the target language through loss-augmented learning ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-86", "text": "The bootstrapping methods for language independent NER of Cucerzan and Yarowsky (1999) have a similar effect." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-87", "text": "Our self-training approach is largely orthogonal to these approaches." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-88", "text": "We therefore believe that combining these methods could be fruitful." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-89", "text": "----------------------------------" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-90", "text": "**EXPERIMENTS**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-91", "text": "In these experiments we combine direct transfer with self-training using unlabeled target data." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-92", "text": "This is the transductive setting, as we include the test data (with labels removed, of course) in the unlabeled target data." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-93", "text": "We investigate the effect of adding self-training (SELF) to the single-source and multi-source transfer settings of \u00a73, where only cross-lingual features are used (SINGLE and MULTI, respectively)." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-94", "text": "We further study the effect of including native monolingual word cluster features in addition to the cross-lingual features (SELF/NATVE)." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-95", "text": "The experimental settings and datasets used are the same as those described in \u00a73." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-96", "text": "We performed self-training for T = 5 iterations for all languages, as preliminary experiments indicated that the procedure converges to a stable solution after this number of iterations." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-97", "text": "CRFSuite was used to compute all the required probabilities for the filtering and sampling steps." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-98", "text": "The results of these experiments are shown in Table 3." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-99", "text": "By itself, self-training without target specific features result in an average relative error reduction of less than 4%, compared to the baseline direct transfer system." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-100", "text": "This is only slightly better than the improvement achieved with multi-source transfer." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-101", "text": "However, when adding target specific features, selftraining works better, with a 7% reduction." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-102", "text": "Combining multi-source transfer with self-training, without target specific features, performs even better with a 10% reduction." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-103", "text": "Finally, combining multi-source transfer and self-training with target specific features, gives the best result across all three languages, with an average relative error reduction of more than 14%." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-104", "text": "The results for German are particularly interesting, in that they highlight a rather surprising general trend." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-105", "text": "The relative improvement achieved by combining multi-source transfer and self training with native clusters is almost twice as large as that achieved when using only self-training with native clusters, despite the fact that multi-source transfer is not very effective on its own -in the case of German, multisource transfer actually hurts results when used in isolation." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-106", "text": "One explanation for this behavior could be that the regularization imposed by the use of multiple source languages is beneficial to self-training, in that it generates better confidence estimates." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-107", "text": "Another, perhaps more speculative, explanation could be that each source language shares different characteristics with the target language." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-108", "text": "Even though the predictions on the target language are not much better on average in this case, as long as a large enough subset of the confident predictions are better than with singlesource transfer, these predictions can be exploited during self-training." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-109", "text": "In addition to using self-training with native word cluster features, we also experimented with creating target language specific versions of the cross-lingual features by means of the feature duplication trick (Daum\u00e9, 2007) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-110", "text": "However, preliminary experiments suggested that this is not an effective strategy in the cross-lingual direct transfer scenario." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-111", "text": "It thus seems likely that the significant improvements that we observe are at least in part explained by the fact that the native features are distinct from the cross-lingual features and not mere duplicates." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-112", "text": "----------------------------------" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-113", "text": "**DIRECT TRANSFER VS. SUPERVISED LEARNING**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-114", "text": "Finally, we look at the relative performance of the different direct transfer methods and a target language specific supervised system trained with native and cross-lingual word cluster features." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-115", "text": "For these experiments we use the same settings as for the experiments in \u00a73 and \u00a74.1." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-116", "text": "Figures 1-3 show the learning curves for the supervised system, as more and more target language annotations, selected by picking sentences at random from the full training set, are added to the training set, compared to the same system when combined with different direct transfer methods." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-117", "text": "From these curves, we can see that the purely supervised model requires between 8,000 and 16,000 annotated word tokens (roughly corresponding to between 430 and 860 sentences) in each target language to match the best direct transfer system." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-118", "text": "The learning curves also show that adding source language data improves performance with as many as 64,000 annotated target language tokens." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-119", "text": "Although we believe that the results on combining source and target data are interesting, in practice the marginal cost of annotation is typically quite low compared to the initial cost." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-120", "text": "Therefore, the cost of going from 125 to 64,000 annotated tokens is likely not too high, so that the benefit of cross-lingual transfer is small on the margin in this scenario." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-121", "text": "However, we believe that direct transfer methods can reduce the initial cost as well, especially when a larger label set is used, since a larger label set implies a larger cognitive load throughout annotation, but especially in the initial phase of the annotation." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-122", "text": "Another aspect, which we were unable to investigate is the relative performance of these methods on domains other than news text." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-123", "text": "It is well known that the performance of supervised NER systems drop significantly when applied to data outside of the training domain (Nothman et al., 2008) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-145", "text": "We showed that multi-source direct transfer and self-training with additional features, exclusive to the target language, both bring benefits in this setting, but that combining these methods provide an even larger advantage." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-124", "text": "Although the direct transfer systems in these experiments are also trained on news data, we suspect that the advantage of these methods will be more pronounced when applied to other domains, since the supervised target system runs a higher risk of overfitting to the characteristics of the target language training domain compared to the direct transfer system, which has already to some degree overfitted to the source language." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-125", "text": "----------------------------------" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-126", "text": "**DISCUSSION**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-127", "text": "We have focused on direct transfer methods that exploit cross-lingual word clusters, which are induced with the help of word alignments." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-128", "text": "A more common use of word alignments for cross-lingual linguistic structure prediction is for projecting annotations across languages (Yarowsky et al., 2001; Diab and Resnik, 2002; Hwa et al., 2005) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-129", "text": "Apart from the algorithmic differences between these approaches, there are more fundamental differences in terms of the assumptions they make." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-130", "text": "Annotation projection relies on the construction of a mapping from structures in the source language to structures in the target language, Y s \u2192 Y t ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-131", "text": "Based on the direct correspondence assumption (Diab and Resnik, 2002; Hwa et al., 2005) , word alignments are assumed to be a good basis for this mapping." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-132", "text": "When projecting annotations, no consideration is taken to the source language input space, X s , nor to the target language input space, X t , except implicitly in the construction of the word alignments." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-133", "text": "The learning algorithm is thus free to use any parameters when training on instances from X t \u00d7 Y t , but can at the same time not exploit any additional information that may be present in X s \u00d7 Y s about X t \u00d7 Y t ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-134", "text": "Furthermore, word alignments are noisy and often only provide partial information about the target side annotations." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-135", "text": "Direct transfer, on the other hand, makes a stronger assumption, as it relies on a mapping from the joint space of source inputs and output structures to the target language, X s \u00d7 Y s \u2192 X t \u00d7 Y t ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-136", "text": "Actually, the assumption is even stronger, since in order to achieve low error on the target language with a discriminative model, we must further assume that the conditional distribution P (Y t |X t ) does not diverge too much from P (Y t |X t ) in regions where P (X t ) is large." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-137", "text": "This suggests that direct transfer might be preferable when source and target languages are sufficiently similar so that a good mapping can be found." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-138", "text": "These differences suggest that it may be fruitful to combine direct transfer with annotation projection." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-139", "text": "For example, direct transfer could be used to first map X s \u00d7 Y s \u2192 X t \u00d7 Y t , while annotation projection could be used to derive constraints on the target output space by means of a mapping Y s \u2192 Y t ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-140", "text": "These constraints could perhaps be exploited in self-training, e.g., through posterior regularization (Ganchev et al., 2010) , or be used for co-training (Blum and Mitchell, 1998) ." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-141", "text": "----------------------------------" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-142", "text": "**CONCLUSIONS**" }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-143", "text": "We investigated several open questions regarding the use of cross-lingual word clusters for direct transfer named entity recognition." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-146", "text": "We then examined the rate with which a supervised system, trained with cross-lingual and native word cluster features, approaches the performance of the direct transfer system." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-147", "text": "We found that on average between 8,000 and 16,000 word tokens need to be annotated in each target language to match our best direct transfer system." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-148", "text": "We also found that combining native and crosslingual word clusters leads to improved results across the board." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-149", "text": "Finally, we showed that direct transfer methods can aid even in the supervised target language scenario." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-150", "text": "By simply mixing annotated source language data with target language data, we can significantly reduce the annotation burden required to reach a given level of performance in the target language, even with up to 64,000 tokens annotated in the target language." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-151", "text": "We hypothesize that more elaborate domain adaptation techniques, such as that proposed by Chen et al. (2011) , can lead to further improvements in these scenarios." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-152", "text": "Our use of cross-lingual word clusters is orthogonal to several other approaches discussed in this paper." }, { "sent_id": "86af8f2cc08b00821a7a83abdfd964-C001-153", "text": "We therefore suggest that such clusters could be of general use in multilingual learning of linguistic structure, in the same way that monolingual word clusters have been shown to be a robust way to bring improvements in many monolingual applications (Turian et al., 2010; T\u00e4ckstr\u00f6m et al., 2012) ." } ], "y": { "@DIF@": { "gold_contexts": [ [ "86af8f2cc08b00821a7a83abdfd964-C001-3" ], [ "86af8f2cc08b00821a7a83abdfd964-C001-21" ], [ "86af8f2cc08b00821a7a83abdfd964-C001-46" ] ], "cite_sentences": [ "86af8f2cc08b00821a7a83abdfd964-C001-3", "86af8f2cc08b00821a7a83abdfd964-C001-21", "86af8f2cc08b00821a7a83abdfd964-C001-46" ] }, "@EXT@": { "gold_contexts": [ [ "86af8f2cc08b00821a7a83abdfd964-C001-3" ], [ "86af8f2cc08b00821a7a83abdfd964-C001-21" ] ], "cite_sentences": [ "86af8f2cc08b00821a7a83abdfd964-C001-3", "86af8f2cc08b00821a7a83abdfd964-C001-21" ] }, "@BACK@": { "gold_contexts": [ [ "86af8f2cc08b00821a7a83abdfd964-C001-17" ], [ "86af8f2cc08b00821a7a83abdfd964-C001-29" ], [ "86af8f2cc08b00821a7a83abdfd964-C001-34" ], [ "86af8f2cc08b00821a7a83abdfd964-C001-153" ] ], "cite_sentences": [ "86af8f2cc08b00821a7a83abdfd964-C001-17", "86af8f2cc08b00821a7a83abdfd964-C001-29", "86af8f2cc08b00821a7a83abdfd964-C001-34", "86af8f2cc08b00821a7a83abdfd964-C001-153" ] }, "@SIM@": { "gold_contexts": [ [ "86af8f2cc08b00821a7a83abdfd964-C001-19" ] ], "cite_sentences": [ "86af8f2cc08b00821a7a83abdfd964-C001-19" ] }, "@USE@": { "gold_contexts": [ [ "86af8f2cc08b00821a7a83abdfd964-C001-19" ] ], "cite_sentences": [ "86af8f2cc08b00821a7a83abdfd964-C001-19" ] } } }, "ABC_918caabbc0bcad04cd07761b29e767_23": { "x": [ { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-21", "text": "Testing data is TREC-10." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-48", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-49", "text": "**METHOD**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-2", "text": "This paper presents a re-examiniation of previous work on machine learning techniques for questions classification, as well as results from new experiments." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-3", "text": "The results suggest that some of the work done in the field have yielded biased results." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-4", "text": "The results also suggest that Na\u00efve Bayes, Decision Trees and Support Vector Machines perform on par with each other when faced with actual users' questions." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-5", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-7", "text": "One of the most important factors for a question answering system to succeed is the ability to correctly identify the expected answer's semantic type (Moldovan et al., 2002) ." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-8", "text": "This paper presents results from an evaluation of five different machine learning approaches to question classification (Na\u00efve Bayes, k Nearest Neighbours, Decision Tree Learning, Sparse Network of Winnows, and Support Vector Machines)." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-9", "text": "The paper also presents a review of earlier work on question classification as well as results from experiments using slightly different data than used in previous work." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-10", "text": "The reason for re-examining the results from previous work is that only performance in terms of accuracy has been reported in the literature." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-11", "text": "No significance testing has been made to see if there really is a difference in results between learners." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-12", "text": "Furthermore, the data used in many of the experiments have been submitted to manual selection, and also the test data is slightly different from the training data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-13", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-14", "text": "**THE QUESTION CLASSIFICATION TASK**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-15", "text": "Question classification can loosely be defined as the task of given a question (represented by a set of features), assign the question to a single or a set of categories (answer types)." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-16", "text": "Adopting the formal definition of text categorization (Sebastiani, 2002) to the problem of question classification, the task can be defined as follows: Question classification is the task of assigning a boolean value to each pair q j , c i \u2208 Q\u00d7C, where Q is the domain of questions and C = {c 1 , c 2 , . . . , c |C| } is a set of predefined categories." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-17", "text": "The task therefore requires a taxonomy of answer types according to which questions should be categorized on the on hand, and a means for actually making this classification on the other." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-18", "text": "Radev et al. (2002) experiment with machine learning for question classification using decision rule learning with set-valued features." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-19", "text": "This is a standard decision tree/rule approach that has been augmented in that instead of being restricted to features with single values, the values can also be a set of values." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-20", "text": "The answer type taxonomy consists of 17 types, and the training data is TREC-8 and TREC-9 data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-22", "text": "In the experiment, questions are represented by 13 features, 9 of which are semantic features based on WordNet." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-23", "text": "Li and Roth (2002) use a Sparse Network of Winnows (SNoW) to classify questions with respect to their expected answer type." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-24", "text": "The taxonomy consists of 6 coarse and 50 fine semantic classes." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-25", "text": "The training corpus used consists of 5,500 questions." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-26", "text": "Some of these are manually constructed, while other stems from the TREC-8 and TREC-9 conferences." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-27", "text": "The test corpus comprise 500 questions from the TREC-10 conference." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-28", "text": "The input to the classifiers is a list of features." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-29", "text": "The features used were words, part-ofspeech tags, chunks, named entities, head chunks (e.g. the first noun chunk in a sentence), and semantically related words (words that often occur with a specific question class)." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-30", "text": "Apart from these primitive features, a set of operators were used to compose more complex features." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-31", "text": "Zhang and Lee (2003) used the same taxonomy as Li and Roth (2002) , as well as the same training and testing data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-32", "text": "In an initial experiment they compared different machine learning approaches with regards to the question classification problem: Nearest Neighbors (NN), Na\u00efve Bayes (NB), Decision Trees (DT), SNoW, and Support Vector Machines." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-33", "text": "The feature extracted and used as input to the machine learning algorithms in the initial experiment was bag-of-words and bag-of-ngrams (all continuous word sequences in the question)." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-34", "text": "Questions were represented as binary feature vectors." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-35", "text": "In a second experiment the linear kernel of the SVM was replaced with a tree kernel developed by the authors." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-36", "text": "Suzuki et al. (2003b) used a SVM with a hierarchical directed acyclic graph kernel (Suzuki et al., 2003a) for the question classification problem." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-37", "text": "The answer type taxonomy used consists of 150 different types." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-38", "text": "The corpus used was in Japanese and consisted of 1011 questions from NTCIR-QAC, 2000 questions of CRL-QA data, and 2000 other questions reported to be of TREC-style (Suzuki et al., 2002) ." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-39", "text": "After removing answer types with too few (less than 10) examples, a total of 68 answer types were actually used." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-40", "text": "Hacioglu and Ward (2003) used a SVM with error correcting codes to convert the multi-class classification problem into a number of binary ones." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-41", "text": "In essence each class is assigned a codeword of 1's and -1's of length m, where m equals or is greater than the number of classes." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-42", "text": "This splits the multi-class data into m binary class data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-43", "text": "Therefore, m SVM classifiers can be designed and their output combined." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-44", "text": "The SVM:s also used linear kernels." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-45", "text": "The same taxonomy, training and testing data was used as in Li and Roth (2002)" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-46", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-47", "text": "**PREVIOUS WORK**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-50", "text": "In order to compare the five algorithms (Na\u00efve Bayes (NB), k Nearest Neighbours (kNN), Decision Tree Learning (DT), Sparse Network of Winnows (SNoW), and Support Vector Machines (SVM)) significance testing have been used." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-51", "text": "Significance scores can not be found in any previous work on question classification and hence it is difficult to draw any real conclusions from this work." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-52", "text": "For present purposes the micro and macro sign tests established by Yang and Liu (1999) have been used." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-53", "text": "Thses were originally developed for the text categorization task, but as question classification bears many resemblances and can be seen as a special case of text categorization." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-54", "text": "The taxonomy used is the taxonomy proposed by Li and Roth (2002) ." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-55", "text": "This taxonomy has been chosen since it is the most frequently used one in earlier work in the field (Li and Roth, 2002; Zhang and Lee, 2003; Hacioglu and Ward, 2003) ." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-56", "text": "The corpora used is both the corpus constructed and tagged by Li and Roth (2002) , as well as a newly tagged corpus extracted from the AnswerBus logs." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-57", "text": "AnswerBus is a question answering system that has been online and logged real users questions." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-58", "text": "The AnswerBus corpus consists of 25,000 questions." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-59", "text": "For present purposes 2,000 questions have been selected and tagged according to the aforementioned taxonomy." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-60", "text": "Questions are in all experiments treated as a bag-of-words and represented as binary feature vectors." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-61", "text": "The results will be reported in terms of microand macro-averaged precision, recall and F-score." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-62", "text": "Micro-averaged precision and recall are dominated by the large categories, whereas macro-averaged precision and recall illustrate how well a classifier performs across all categories." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-63", "text": "Micro-averaged precision is denoted as \u03c0 \u00b5 , macro-averaged precision as \u03c0 M , micro-averaged recall as \u03c1 \u00b5 , and macroaveraged recall as \u03c1 M ." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-64", "text": "Combined measures for micro-averaged results is denoted as F \u00b5 1 while the corresponding macro-averaged measure is denoted as F M 1 ." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-65", "text": "Performance is for the purpose of this paper seen as solely related to accuracy in terms of precision and recall." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-66", "text": "The learning and classification speed of the algorithms are ignored." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-67", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-68", "text": "**EXPERIMENT 1**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-69", "text": "The first experiment is intended to be a straightforward re-examination of previous work to establish what differences in performance there really are between machine learners." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-70", "text": "This experiment has been done under two different settings." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-71", "text": "First, we have used the corpus originally developed by (Li and Roth, 2002) , but since the test corpus used consists of questions solely from TREC-10 and the TREC conferences have a specific agenda the test corpus might be slightly different from the training data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-72", "text": "Therefore, a second setting was used where the questions from the training and test corpora were pooled together and a randomized test corpus was extracted." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-73", "text": "This will be refered to as the repartitioned corpus." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-74", "text": "The performance of the different learners on setting 1 can be found in table 1, setting 2 in table 2 while significance testing between the learners is shown in table 3 Classifier \u03c0 Table 2 : Performance of classifers on repartitioned TREC data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-75", "text": "As can be seen in table 1 and 2 the performance of the different learning algorithms with regards to micro-averaged precision and recall is at best equal to and in most cases worse on the repartitioned data than on the original data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-76", "text": "When it comes to macroaveraged precision and recall the results are more varied." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-77", "text": "In table 3 we can find differences when comparing the algorithms with regards to significant differences in performance." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-78", "text": "In the table, \"<\" means a Table 3 : Sigificance testing of classifiers on both original and repartitioned TREC data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-79", "text": "significantly on the .05 level, \" \" and \" \" means a difference on the .01 level." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-80", "text": "NB < SNoW should be read as NB performs significantly worse than SNoW on the .05 level." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-81", "text": "The column \"s-test\" means micro sign test, and \"S-test\" means macro sign test." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-82", "text": "It is interesting to note that where there were no significant differences in performance on the original corpus there now are to some extent differences on the repartitioned corpus and also the other way around to a smaller extent." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-83", "text": "This might be an indication that the training and test corpora in fact are not balanced in the original setting, and some of the results reported in previous work is somewhat biased." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-84", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-85", "text": "**EXPERIMENT 2**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-86", "text": "To further investigate the performance of different machine learners in the face of a corpus consisting of actual users' questions a second experiment was conducted." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-87", "text": "As mentioned earlier, in this setting 2,000 questions from the AnswerBus logs are used, but everything else remains the same as in experiment 1." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-88", "text": "Results in terms of performance is found in Table 4 : Performance of classifiers on AnswerBus data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-89", "text": "As can be seen in table 4 the performance in terms of micro-averaged precision and recall is higher on the AnswerBus corpus than on any of the TREC corpora." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-90", "text": "When it comes to macro-averaged performanve the results are more varied and it is hard to draw any clear conclusions." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-91", "text": "Table 5 : Sigificance testing of classifiers on AnswerBus data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-92", "text": "In terms of significant differences between classifiers, the results from the AnswerBus corpus deviates from what could have been expected given the results on the TREC corpora." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-93", "text": "It seems than Na\u00efve Bayes, Decision Trees and Support Vector Machines are on par with each other, while k Nearest Neighbours and Sparse Network of Winnows are sigificantly worse in terms of performance." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-94", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-95", "text": "**CONCLUSIONS**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-96", "text": "The results in this paper indicate that some of the results found in previous work (Li and Roth, 2002; Zhang and Lee, 2003; Hacioglu and Ward, 2003) on question classification might be incorrect due to an unbiased training and test corpus." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-97", "text": "This bias stems from the fact that the training corpus is derived exclusively from TREC-10 data, while the training data stems from other sources." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-98", "text": "Since the TREC conferences have an explicit agenda that shifts from year to year this is perhaps no surprise." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-99", "text": "In relation to this, TREC material is maybe not the best source of information if one is interested in how different machine learners might perform on actual user data." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-100", "text": "----------------------------------" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-101", "text": "**FUTURE WORK**" }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-102", "text": "The results from experiment 2 in this paper stems from a corpus of 2,000 questions." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-103", "text": "We will go on to categorize 3,000 more questions from the AnswerBus logs and run the learners on this data in order to get even more accurate results." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-104", "text": "This work is well on the way." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-105", "text": "We will also go on to make a deeper analysis of exactly which questions that pose problems for learning algorithms." }, { "sent_id": "918caabbc0bcad04cd07761b29e767-C001-106", "text": "Such work has not been reported in the literature thus far." } ], "y": { "@BACK@": { "gold_contexts": [ [ "918caabbc0bcad04cd07761b29e767-C001-23" ], [ "918caabbc0bcad04cd07761b29e767-C001-31" ], [ "918caabbc0bcad04cd07761b29e767-C001-55" ], [ "918caabbc0bcad04cd07761b29e767-C001-96" ] ], "cite_sentences": [ "918caabbc0bcad04cd07761b29e767-C001-23", "918caabbc0bcad04cd07761b29e767-C001-31", "918caabbc0bcad04cd07761b29e767-C001-55", "918caabbc0bcad04cd07761b29e767-C001-96" ] }, "@SIM@": { "gold_contexts": [ [ "918caabbc0bcad04cd07761b29e767-C001-45" ], [ "918caabbc0bcad04cd07761b29e767-C001-56" ] ], "cite_sentences": [ "918caabbc0bcad04cd07761b29e767-C001-45", "918caabbc0bcad04cd07761b29e767-C001-56" ] }, "@USE@": { "gold_contexts": [ [ "918caabbc0bcad04cd07761b29e767-C001-45" ], [ "918caabbc0bcad04cd07761b29e767-C001-54" ], [ "918caabbc0bcad04cd07761b29e767-C001-56" ] ], "cite_sentences": [ "918caabbc0bcad04cd07761b29e767-C001-45", "918caabbc0bcad04cd07761b29e767-C001-54", "918caabbc0bcad04cd07761b29e767-C001-56" ] }, "@DIF@": { "gold_contexts": [ [ "918caabbc0bcad04cd07761b29e767-C001-71" ] ], "cite_sentences": [ "918caabbc0bcad04cd07761b29e767-C001-71" ] }, "@EXT@": { "gold_contexts": [ [ "918caabbc0bcad04cd07761b29e767-C001-71" ] ], "cite_sentences": [ "918caabbc0bcad04cd07761b29e767-C001-71" ] } } }, "ABC_fb1061d28dbf80858c1a630621a975_23": { "x": [ { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-39", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-40", "text": "**CONTEXT-AWARE THREAD DETECTION**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-64", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-65", "text": "**CATD-MATCH (RIGHT IN**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-93", "text": "**EXPERIMENTS**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-2", "text": "In multi-party chat, it is common for multiple conversations to occur concurrently, leading to intermingled conversation threads in chat logs." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-3", "text": "In this work, we propose a novel Context-Aware Thread Detection (CATD) model that automatically disentangles these conversation threads." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-4", "text": "We evaluate our model on three realworld datasets and demonstrate an overall improvement in thread detection accuracy over state-of-the-art benchmarks." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-5", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-7", "text": "In multi-party chat conversations, such as in Slack 1 , multiple topics are often discussed at the same time (Wang et al., 2019; Zhang et al., 2017) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-8", "text": "For example, in Table 1 , Alice and Bob talk about work, while Bob and Chuck chat about lunch, and this results in intermingled messages." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-9", "text": "Automatic conversational thread detection could be used to disentangle and group the messages into their respective topic threads." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-10", "text": "The resulting thread information could in turn be used to improve response relevancy for conversational agents (Shamekhi et al., 2018) or improve chat summarization quality (Zhang and Cranshaw, 2018) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-11", "text": "Unlike most of today's Email and Forum systems that use threaded structure by default." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-12", "text": "However, the Instant Messaging systems (e.g., Slack) often require users to manually organize messages in threads." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-13", "text": "A recent study (Wang et al., 2019) found that users most likely do not manually create threads." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-14", "text": "On average, only 15.3 threads were created per Slack channel with 355 messages, when they discuss group projects." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-15", "text": "Prior work on conversation thread disentanglement is often based on pairwise message compar- ison." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-16", "text": "Some solutions use unsupervised clustering methods with hand-engineered features (Wang and Oard, 2009; Shen et al., 2006) , while others use supervised approaches with statistical (Du et al., 2017) or linguistic features (Wang et al., 2008; Wang and Ros\u00e9, 2010; Elsner and Charniak, 2008 , 2011 Mayfield et al., 2012) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-38", "text": "During inference, we use this model to sequentially perform message thread labelling (Subsection 2.2)." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-17", "text": "Recent work by (Jiang et al., 2018; Mayfield et al., 2012) adopt deep learning approaches to compute message pair similarity, using a combination of message content and simple contextual features (e.g., authorship and timestamps)." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-18", "text": "However, linguistic theories (Biber and Conrad, 2019) differentiate the following three concepts: register, genre and style, to describe the text varieties." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-19", "text": "Register refers to the linguistic features such as the choice of words in content." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-20", "text": "Genre and Style refer to the conversational structure such as the sentence sequence and distribution." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-21", "text": "All aforementioned thread disentanglement methods fail to take into account the contextual information of the thread, or the conversational flow and genre." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-22", "text": "A thread's contextual information is a useful feature for thread-detection because considering the relationship between a single new input message and an existing message alone may not be enough to accurately determine thread membership." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-23", "text": "Hence, using the full thread context history during comparison can improve pre-diction." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-24", "text": "Additionally, the conversational flow and genre may also be useful because (Butler et al., 2002) suggests this represents a conversation's signature." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-25", "text": "For example, we observe that users act distinctively in public Q&A (StackOverflow) and enterprise Q&A (IBM Social Q&A) online community (Wang et al., 2016) , even when they are answering a similar question." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-26", "text": "Based on these hypotheses, we propose two contextaware thread detection (CATD) models." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-27", "text": "The first model (CATD-MATCH) captures contexts of existing threads and computes the distance between the context and the input message; the second model (CATD-FLOW) captures the conversational flow, and computes the language genre consistency while attaching the input message to a thread." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-28", "text": "We also combine them with a dynamic gate for further performance improvement, followed by an efficient beam search mechanism in the inference step." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-29", "text": "The evaluation proves our approach improves over the existing methods." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-30", "text": "The contribution of this work is two-fold: 1) We propose context-aware deep learning models for thread detection and it advances the state-ofthe-art; 2) Based on the dataset in (Jiang et al., 2018) , we develop and release a more realistic multi-party multi-thread conversation dataset for future research." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-31", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-32", "text": "**METHODOLOGY**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-33", "text": "We model thread-detection as a topic detection and tracking task by deciding whether an incoming message starts a new thread or belongs to an existing thread (Allan, 2002) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-34", "text": "The goal is to assign each message m i in the sequence, M = {m i } N i=1 , a thread label t i , such that the complete thread label sequence T = {t i } N i=1 contains multiple threads (T 1 , T 2 , \u00b7 \u00b7 \u00b7), where each thread T l contains all messages with the same label." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-35", "text": "We denote" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-36", "text": "is always before m i , but may not be the last message m i 1 ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-37", "text": "The training and inference steps are as follows: we train an LSTM-based thread classification model to obtain the membership of the message m i to an existing-or a new thread, given the existing message sequence's thread tags T i 1 = {t j } i 1 j=1 , which form L threads {T l i 1 } L l=1 (Subsection 2.1)." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-41", "text": "We first adopt the Universal Sentence Encoder 2 with deep averaging network (USE) (Cer et al., 2018) to get a static feature representation for each message in the form of sentence embeddings." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-42", "text": "We encode each message m j as enc(m j ), by concatenating the USE output with two 20dimensional embeddings: (1) User-identity difference between m j and m i ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-43", "text": "(2) Time difference by mapping the time difference between m j and m i into 11 ranges (from 1 minutes to 72 hours, details in Appendix A)." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-44", "text": "These two features are also used in (Jiang et al., 2018) , and another baseline model GTM uses only these features (Elsner and Charniak, 2008) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-45", "text": "Given a message sequence M i 1 , which has been detected with L threads" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-46", "text": "i 1 indicates that m i starts a new thread." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-47", "text": "As shown in Fig.1 , we adopt a message-level single directional LSTM to encode each thread T l i 1 , whose inputs are enc(\u00b7) of maximum K last messages (set to 20) in the thread, denoted as {m (l,k) } K k=1 ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-48", "text": "The messages outside that window are viewed as irrelevant to the prediction of m i ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-49", "text": "In Fig.1 , we propose two CATD models, CATD-FLOW and CATD-MATCH, each one capturing the semantic relationship between the new message and the existing thread contexts." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-50", "text": "Fig.1 ): This model considers each thread as a conversation flow for a particular topic with its own genre, and the current message should belong to the thread with which it is more likely to form a fluent conversation." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-51", "text": "Therefore, we concatenate the enc(m i ) to each LSTM sequence of T l i 1 , and get the last output e l flow to dot-product a trainable vector w and compute the probability of m i being labeled with T l i 1 ," }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-52", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-53", "text": "**CATD-FLOW (LEFT IN**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-54", "text": "where is a scaling hyper-parameter." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-55", "text": "We differentiate T L+1 i 1 from other existing threads by introducing a parameter u as the only input for LSTM." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-56", "text": "Fig.1) : An alternative view of determining the thread tag of m i is to find , the last K messages for each thread T l i 1 in history are encoded by an LSTM." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-57", "text": "CATD-FLOW concatenates the current message m i as the final step of each LSTM, and gets the last output of LSTMs e l flow to perform thread classification." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-58", "text": "CATD-MATCH runs LSTM on T l i 1 and m i separately, performs attention to obtain the context embeddings and then gets the matching vector e l match for thread classification." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-59", "text": "For a newly-generated thread, we use a parameter u as the only input for its LSTM." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-60", "text": "a thread T l i 1 semantically closest to m i ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-61", "text": "Thus we independently encode each thread and m i with the parameter-shared LSTM (Only one LSTM step for m i )." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-62", "text": "In order to dynamically point to more related messages in the thread history, we use an one-way attention, which has been successfully adopted in many NLP tasks (Tan et al., 2016; Hermann et al., 2015) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-63", "text": "Specifically, given the sequence outputs of each thread's LSTM, {h l (l,k) } K k=1 , we perform a weighted mean pooling to get a context embedding e l cxt , attended by the one-step LSTMencoded m i , denoted as\u0125 i ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-66", "text": "Next, we compute a matching vector e l match below, which again computes dot-product w for classification." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-67", "text": "The function N (x) normalizes x by l 2 norm." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-68", "text": "\u2326 is element-wise multiplication." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-69", "text": "CATD-COMBINE: Finally, we propose the dynamically-gated combination of FLOW and MATCH model, such that the e l flow in Eq. 1 is replaced by a combination of the two models, e l combine = (1 g l )e l match + g l e l flow (5)" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-70", "text": "where w 0 is a parameter vector." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-71", "text": "g l is determined by the distance between N (e l cxt ) and N (\u0125 i )." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-72", "text": "We use this dynamic gate g to linearly combine the two models." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-73", "text": "g is computed based on the difference of the MATCH vector of context and the input message." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-74", "text": "Intuitively, if they are close, both FLOW and MATCH will be considered equally for prediction." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-75", "text": "Otherwise, the model dynamically computes the weights of MATCH and FLOW." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-76", "text": "Training Procedure: Following (Jiang et al., 2018) , apart from a new thread, we consider the candidate threads (Active Threads) in Eq. 1 only from those appearing in one hour time-frame before m i ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-77", "text": "During training, we treat the messages of a channel as a single sequence, and optimize Eq. 1 with training examples, containing m i and its active threads." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-78", "text": "Though messages are sorted by time, the training examples are shuffled during training." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-79", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-80", "text": "**THREAD INFERENCE**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-81", "text": "During inference, we want to find the optimal thread labeling by maximizing:" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-82", "text": "where t i are selected from active threads and the new thread." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-83", "text": "However, searching the entire space of T is unfeasible." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-84", "text": "Hence, we resort to Beam Search, a generalized version of greedy search." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-85", "text": "It predicts sequentially from m 1 to m N , while keeping B states in the beam." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-86", "text": "For each m i , each state in the beam is a candidate T i 1 ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-87", "text": "Each new state T i is ranked after labeling t i for m i :" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-88", "text": "where T l i 1 is selected from the active threads in the previous state and a new thread tag." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-89", "text": "The new states with scores lower than top-B candidates are discarded." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-90", "text": "Similar to training, the active threads are also pruned by the \"one-hour\" constraint." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-91", "text": "However, they are not extracted from the groundtruth, but from previously-detected threads." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-92", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-94", "text": "Datasets: We conduct extensive experiments on three publicly available datasets from Reddit datasets." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-95", "text": "We strictly follow (Jiang et al., 2018) to construct our data." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-96", "text": "Comments under a post can be treated as messages in a single conversational thread, and we merge all comments in a" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-97", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-98", "text": "**GADGETS**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-99", "text": "Iphones Politics NMI ARI F1 NMI ARI F1 NMI ARI F1 Table 2: CATD models are compared with baselines wrt." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-100", "text": "metrics of NMI, ARI and F1 for the three datasets sub-reddit to construct a synthetic dataset of interleaved conversations." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-101", "text": "We take three sub-reddits to build three datasets, Gadgets, IPhones and Politics." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-102", "text": "3 The data statistics and examples are shown in Appendix B." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-103", "text": "Reddit Dataset Improvement: We use the same pre-processing method in (Jiang et al., 2018) : we discard the messages which have less than 10 words or more than 100 words." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-104", "text": "Conversations less than 10 messages are also discarded." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-105", "text": "We guarantee that no more than 10 conversations happen at the same time." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-106", "text": "In their work, different message pairs of the same thread might be included in both train and test sets." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-107", "text": "Instead, we split the datasets on the thread level because in realistic settings, test threads should be completely unseen in train set." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-108", "text": "4" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-109", "text": "Experimental Setup: We use Adam (Kingma and Ba, 2015) to optimize the training objective (Eq. 1)." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-110", "text": "During training, we fix in Eq. 1 as 10." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-111", "text": "In inference, this value may influence the search quality." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-112", "text": "We set it as 20.0 by the validation accuracy on Politics." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-113", "text": "We set LSTM output dimensions to 400, the batch size to 10 and the beam size to 5 by default." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-114", "text": "We train 50 epochs and select the model with the best validation-set performance." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-115", "text": "Baseline: (1) CISIR-SHCNN (Jiang et al., 2018) : A recently proposed model based on CNN and ranking message pairs." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-116", "text": "(2) CISIR-USE: We replace CNN encoder in CISIR with a USE to test the effect of different sentence encoders." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-117", "text": "(3) GTM (Elsner and Charniak, 2008) : A graph-theoretical model with chat and content specific features." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-118", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-119", "text": "**MODEL VARIATIONS**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-120", "text": "NMI ARI F1 A COMBINE (share) .832 .420 .422 B COMBINE (concat) .828 .446 .448 C SPLIT .824 .417 .420 D FLOW (K=5) .813 .395 .398 E FLOW (K=10) .823 .414 .417 F FLOW (K=20) .826 .420 .423 G MATCH (K=5) .820 .399 .402 H MATCH (K=10) .823 .405 .408 I MATCH (K=20) .831 .427 .430 J MATCH (K=20, bi-LSTM) .832 .428 .430 K COMBINE (K=5) .811 .378 .381 L COMBINE (K=10) .822 .403 .405 M COMBINE (K=20, B=1) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-121", "text": "828 .452 .455 N COMBINE (K=20, B=5) .834 .461 .464 O COMBINE (K=20, B=10) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-122", "text": "833 .431 .433 Evaluation Metrics: Normalized mutual information (NMI), Adjusted rand index (ARI) and F1 score, following (Jiang et al., 2018) ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-123", "text": "F1 is computed based on all message pairs in a test set." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-124", "text": "Also, following their work, we assume the candidate threads of each message for our models and baselines are obtained from the ones which have messages in the previous hour." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-125", "text": "For examples, the CISIR-SHCNN models will take pairs only within the one-hour frame." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-126", "text": "Main Results: Table 2 compares the CATD models and baselines on NMI, ARI and F1." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-127", "text": "CISIR models are generally better than non-deeplearning GTM." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-128", "text": "There is a clear gap between CISIR-USE and our proposed models, which proves our models' improvement is not due to the usage of USE but the new model structures." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-129", "text": "CATD models are significantly superior to all baselines and CATD-COMBINE generally performs best." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-130", "text": "Specifically, all baselines failed on Politics, probably because there are more threads in Politics than the other two datasets (see Appendix B), making disentanglement more difficult. But CATD models achieve better results because they encode ac-tive threads in parallel, while considering longer history in each thread." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-131", "text": "Analysis : In Table 3 , we analyze our models on Politics, the largest dataset." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-132", "text": "First, we examine the effect of K. For all CATD models, with K from 5 to 20 (D-F, G-I, K, L and N), and all metrics improve, showing the importance of the longer history in LSTM." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-133", "text": "Second, we adopt bidirectional LSTMs (J) for CATD-MATCH, without an obvious improvement, probably because most messages in the datasets can be fully comprehended only with previous history." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-134", "text": "This assumption is consistent with a mild improvement when we increase beam size from 1 (M) to 5 (N)." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-135", "text": "We see a lower ARI with beam size as 10 (O), because of the incorrect candidates at lower ranking positions." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-136", "text": "Finally, the models are generally good when beam=1, enabling an \"online\" detection without knowing the future messages, which can not be directly fulfilled by most pairwise prior work." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-137", "text": "Table 3 , we also shared the LSTM parameters for MATCH and FLOW models (A), with 4% drop on ARI." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-138", "text": "This is because we need two independent LSTMs to capture different linguistic features." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-139", "text": "Next, we combine FLOW and MATCH (B) by concatenating e l flow and e l match , resulting in 1.5% drop on ARI, which proves the benefit of the gate in CATD-COMBINE." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-140", "text": "Also, we break the links between LSTM nodes and perform one-step LSTM on all the history messages (C), leading to over 4% drop on ARI." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-141", "text": "This reflects the necessity of a RNN encoding inter-messages information." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-142", "text": "In Table 4 and 5, we show the analysis for for Gadgets and Iphones datasets similar to Poli-tics dataset in Table 3 ." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-143", "text": "As compared to Politics, we observe that for Gadgets and Iphones, CATD-FLOW models have some fluctuations in performance when we increase K from 5 to 20, which may be due to the limited capability of LSTMs for memorizing long-term history." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-144", "text": "This issue is more prevalent when the training data size is small." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-145", "text": "NMI ARI F1 A COMBINE (share) .748 .415 .427 B COMBINE (concat) .745 .419 .433 C SPLIT .749 .418 .433 D FLOW (K=5) .742 .411 .425 E FLOW (K=10) .743 .434 .446 F FLOW (K=20) .740 .410 .424 G MATCH (K=5) .744 .385 .398 H MATCH (K=10) .744 .409 .420 I MATCH (K=20) .740 .428 .441 J MATCH (K=20, bi-LSTM) .739 .430 .445 K COMBINE (K=5) .751 .363 .374 L COMBINE (K=10) .751 .419 .429 M COMBINE (K=20, B=1) .750 .431 .444 N COMBINE (K=20, B=5) .750 .434 .445 O COMBINE (K=20, B=10) .750 .434 445" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-146", "text": "----------------------------------" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-147", "text": "**CONCLUSION**" }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-148", "text": "We propose context-aware thread detection models to perform thread detection for multi-party chat conversations which take into account threads' contextual information." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-149", "text": "These are integrated into an efficient beam search for inference." }, { "sent_id": "fb1061d28dbf80858c1a630621a975-C001-150", "text": "Our proposed method advances the state-of-the-art." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fb1061d28dbf80858c1a630621a975-C001-17" ], [ "fb1061d28dbf80858c1a630621a975-C001-115" ] ], "cite_sentences": [ "fb1061d28dbf80858c1a630621a975-C001-17", "fb1061d28dbf80858c1a630621a975-C001-115" ] }, "@USE@": { "gold_contexts": [ [ "fb1061d28dbf80858c1a630621a975-C001-30" ], [ "fb1061d28dbf80858c1a630621a975-C001-95" ], [ "fb1061d28dbf80858c1a630621a975-C001-103" ], [ "fb1061d28dbf80858c1a630621a975-C001-122" ] ], "cite_sentences": [ "fb1061d28dbf80858c1a630621a975-C001-30", "fb1061d28dbf80858c1a630621a975-C001-95", "fb1061d28dbf80858c1a630621a975-C001-103", "fb1061d28dbf80858c1a630621a975-C001-122" ] }, "@SIM@": { "gold_contexts": [ [ "fb1061d28dbf80858c1a630621a975-C001-44" ], [ "fb1061d28dbf80858c1a630621a975-C001-95" ], [ "fb1061d28dbf80858c1a630621a975-C001-103" ], [ "fb1061d28dbf80858c1a630621a975-C001-122" ] ], "cite_sentences": [ "fb1061d28dbf80858c1a630621a975-C001-44", "fb1061d28dbf80858c1a630621a975-C001-95", "fb1061d28dbf80858c1a630621a975-C001-103", "fb1061d28dbf80858c1a630621a975-C001-122" ] }, "@DIF@": { "gold_contexts": [ [ "fb1061d28dbf80858c1a630621a975-C001-76" ] ], "cite_sentences": [ "fb1061d28dbf80858c1a630621a975-C001-76" ] }, "@EXT@": { "gold_contexts": [ [ "fb1061d28dbf80858c1a630621a975-C001-76" ] ], "cite_sentences": [ "fb1061d28dbf80858c1a630621a975-C001-76" ] } } }, "ABC_520588fbf0643725153b07a09430d1_23": { "x": [ { "sent_id": "520588fbf0643725153b07a09430d1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-2", "text": "Sequence-to-sequence attention-based models integrate an acoustic, pronunciation and language model into a single neural network, which make them very suitable for multilingual automatic speech recognition (ASR)." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-3", "text": "In this paper, we are concerned with multilingual speech recognition on low-resource languages by a single Transformer, one of sequence-to-sequence attentionbased models." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-4", "text": "Sub-words are employed as the multilingual modeling unit without using any pronunciation lexicon." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-5", "text": "First, we show that a single multilingual ASR Transformer performs well on low-resource languages despite of some language confusion." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-6", "text": "We then look at incorporating language information into the model by inserting the language symbol at the beginning or at the end of the original sub-words sequence under the condition of language information being known during training." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-7", "text": "Experiments on CALLHOME datasets demonstrate that the multilingual ASR Transformer with the language symbol at the end performs better and can obtain relatively 10.5% average word error rate (WER) reduction compared to SHL-MLSTM with residual learning." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-8", "text": "We go on to show that, assuming the language information being known during training and testing, about relatively 12.4% average WER reduction can be observed compared to SHL-MLSTM with residual learning through giving the language symbol as the sentence start token." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-9", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-11", "text": "Multilingual speech recognition has been investigated for many years [1, 2, 3, 4, 5] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-12", "text": "Conventional studies concentrate on the area of multilingual acoustic modeling by the contextdependent deep neural network hidden Markov models (CD-DNN-HMM) [6] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-13", "text": "The hidden layers of DNN in CD-DNN-HMM can be thought of complicated feature transformation through multiple layers of nonlinearity, which can be used to extract universal feature transformation from multilingual datasets [1] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-14", "text": "Among the CD-DNN-HMM based approaches, the architecture of SHL-MDNN [1] , in which the hidden layers are shared across multiple languages while the softmax layers are language dependent, is a significant progress in the area of multilingual ASR." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-15", "text": "These shared hidden layers and language dependent softmax layers of SHL-MDNN are optimized jointly by multilingual datasets." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-16", "text": "SHL-MLSTM [5] further explores long short-term memory (LSTM) [7] with residual learning as the shared hidden layer instead of DNN and achieves better results than SHL-MDNN." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-17", "text": "Although these models achieve encouraging results on multilingual ASR tasks, a hand-designed language-specific pronunciation lexicon must be employed." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-18", "text": "This severely limits their application on low-resource languages, which may have not a well-designed pronunciation lexicon." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-19", "text": "Recent researches on sequence-to-sequence attention-based models try to remove this dependency on the pronunciation lexicon [8, 9, 10] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-20", "text": "Chiu et al. shows that attention-based encoder-decoder architecture, namely listen, attend, and spell (LAS), achieves a new stateof-the-art WER on a 12500 hour English voice search task using the word piece models (WPM) [10] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-21", "text": "Our previous work [9] demonstrates that the lexicon independent models can outperform lexicon dependent models on Mandarin Chinese ASR tasks by the ASR Transformer and the character based model establishes a new state-of-the-art character error rate (CER) on HKUST datasets." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-22", "text": "Since the acoustic, pronunciation and language model are integrated into a single neural network by sequence-to-sequence attention-based models, it makes them very suitable for multilingual ASR." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-23", "text": "In this paper, we concentrate on multilingual ASR on low-resource languages." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-24", "text": "Building on our work [9] , we employ sub-words generated by byte pair encoding (BPE) [11] as the multilingual modeling unit, which do not need any pronunciation lexicon." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-25", "text": "The ASR Transformer is chosen to be the basic architecture of sequence-to-sequence attention-based model [9, 12] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-26", "text": "To alleviate the problem of few training data on low-resource languages, a well-trained ASR Transformer from a high-resource language is adopted as the initial model rather than random initialization, whose softmax layer is replaced by the language-specific softmax layer." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-27", "text": "We then look at incorporating language information into the model by inserting the language symbol at the beginning or at the end of the original sub-words sequence [13] under the condition of language information being known during training." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-28", "text": "A comparison with SHL-MLSTM [5] with residual learning is investigated on CALL-HOME datasets with 6 languages." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-29", "text": "Experimental results reveal that the multilingual ASR Transformer with the language symbol at the end performs better and can obtain relatively 10.5% average WER reduction compared to SHL-MLSTM with residual learning." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-30", "text": "We go on to show that, assuming the language information being known during training and testing, about relatively 12.4% average WER reduction can be observed compared to SHL-MLSTM with residual learning through giving the language symbol as the sentence start token." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-31", "text": "The rest of the paper is organized as follows." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-32", "text": "After an overview of the related work in Section 2, Section 3 describes the proposed method in detail." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-33", "text": "We then show experimental results in Section 4 and conclude this work in Section 5." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-34", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-36", "text": "Although multilingual speech recognition has been studied [1, 2, 3, 4, 5] for a long time, these researches are commonly limited to making acoustic model (AM) multilingual, which require language-specific pronunciation model (PM) and language model (LM)." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-37", "text": "Recently, sequence-to-sequence attentionbased models, integrating the AM, PM and LM into a single network, have attracted a lot of attention on multilingual ASR [13, 14, 15, 16] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-38", "text": "[14, 15] have presented a single sequence-tosequence attention-based model can be capable of recognizing any of the languages seen in training." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-39", "text": "[13] explored the possibility of training a single model serve different English dialects and compared different methods incorporating dialect-specific information into the model." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-40", "text": "However, multilingual ASR on low-resource languages are few investigated by sequence-tosequence attention-based models." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-41", "text": "Furthermore, we argue that the modeling unit of sub-words allows for a much stronger decoder LM compared to graphemes [10] , so sub-words encoded by BPE are employed as the multilingual modeling unit rather than graphemes [13, 14] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-42", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-43", "text": "**SYSTEM OVERVIEW 3.1. ASR TRANSFORMER MODEL ARCHITECTURE**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-44", "text": "The ASR Transformer architecture used in this work is the same as our work [9, 12] which is shown in Figure 1 ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-45", "text": "It stacks multihead attention (MHA) [17] and position-wise, fully connected layers for both the encode and decoder." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-46", "text": "The encoder is composed of a stack of N identical layers." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-47", "text": "Each layer has two sublayers." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-48", "text": "The first is a MHA, and the second is a position-wise fully connected feed-forward network." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-49", "text": "Residual connections are employed around each of the two sub-layers, followed by a layer normalization." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-50", "text": "The decoder is similar to the encoder except inserting a third sub-layer to perform a MHA over the output of the encoder stack." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-51", "text": "To prevent leftward information flow and preserve the auto-regressive property in the decoder, the self-attention sub-layers in the decoder mask out all values corresponding to illegal connections." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-52", "text": "In addition, positional encodings [17] are added to the input at the bottoms of these encoder and decoder stacks, which inject some information about the relative or absolute position of the tokens in the sequence." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-53", "text": "The difference between the neural machine translation (NMT) Transformer [17] and the ASR Transformer is the input of the encoder." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-54", "text": "We add a linear transformation with a layer normalization to convert the log-Mel filterbank feature to the model dimension d model for dimension matching, which is marked out by a dotted line in Figure 1 ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-55", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-56", "text": "**MULTILINGUAL MODELING UNIT**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-57", "text": "Sub-words are employed as the multilingual modeling unit, which are generated by BPE 1 [11] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-58", "text": "Firstly, the symbol vocabulary with the character vocabulary is initialized, and each word is represented as a sequence of characters plus a special end-of-word symbol '@@', which allows to restore the original tokenization." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-59", "text": "Then, all symbol pairs are counted iteratively and each occurrence of the most frequent pair ('A', 'B') are replaced with a new symbol 'AB'." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-60", "text": "Each merge operation produces a new symbol which represents a character n-gram." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-61", "text": "Frequent character n-grams (or whole words) are eventually merged into a single symbol." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-62", "text": "Then the final symbol vocabulary size is equal to the size of the initial vocabulary, plus the number of merge operations \u03b1 , which is the only hyper-parameter of this algorithm [11] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-63", "text": "In our multilingual experiments, training transcripts in all languages are combined together to generate the multilingual symbol vocabulary, instead of directly merging each language symbol vocabulary together." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-64", "text": "So same sub-words are shared among different languages automatically, which is very beneficial for languages belonging to the same language family." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-65", "text": "For example, for a German word of \"universit\u00e4tsgeb\u00e4u\", it is encoded into \"univer@@ sit@@\u00e4@@ ts@@ ge@@ b@@ a@@ u\"; for an English word of \"university\", it is encoded into \"univer@@ sit@@ y\"." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-66", "text": "Two sub-words \"univer@@\" and \"sit@@\" are shared in these two languages." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-67", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-68", "text": "**LANGUAGE INFORMATION AS OUTPUT TARGETS**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-69", "text": "Similar to [13, 18] , we expand the symbol vocabulary of the multilingual ASR Transformer to include a list of special symbols, each corresponding to a language." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-70", "text": "For example, we add the symbol into the symbol vocabulary when including English." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-71", "text": "If the language information of training data can only be known beforehand, two methods of adding the language symbol are explored, i.e. inserting at the beginning (Transformer-B) or at the end (Transformer-E) of the original sub-words sequence [13, 18] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-72", "text": "What's more, if the language information of both training and testing data can be known beforehand, we directly take the language symbol as the sentence start token (Transformer-B2) rather than original sentence start token . It can force the multilingual ASR Transformer to decode a speech utterance into the pointed language, which is able to alleviate the language confusion greatly during testing." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-73", "text": "The difference between Transformer-B and Transformer-B2 is whether to utilize the language information during testing." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-74", "text": "The sentence start token is in Transformer-B. It first predicts a language symbol by itself and then the following tokens are predicted as usual." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-75", "text": "Therefore, Transformer-B do not need to know the language information beforehand during testing." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-76", "text": "In contrast, Transformer-B2 employs as its sentence start token and predicts the following tokens as usual, which need to know the language information beforehand during testing." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-77", "text": "An example of adding the language symbol is shown in Table 1 ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-78", "text": "The datasets in the paper come from CALLHOME corpora collected by Linguistic Data Consortium (LDC)." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-79", "text": "The following six languages are used: Mandarin (MA), English (EN), Japanese (JA), Arabic (AR), German (GE) and Spanish (SP)." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-80", "text": "We follow the Kaldi [19] recipe to process CALLHOME datasets 2 ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-81", "text": "The detailed information is listed below in Table 2 ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-82", "text": "We train the ASR Transformer with a given number of epochs, so validation sets are not employed in this paper." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-83", "text": "All experiments are conducted using 80-dimensional log-Mel filterbank features, computed with a 25ms window and shifted every 10ms." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-84", "text": "The features are normalized via mean subtraction and variance normalization on the speaker basis." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-85", "text": "Similar to [20, 21] , at the current frame t, these features are stacked with 3 frames to the left and downsampled to a 30ms frame rate." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-86", "text": "We generate more training data by linearly scaling the audio lengths by factors of 0.9 and 1.1 [22] , since it is always beneficial for training the ASR Transformer [9] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-87", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-88", "text": "**MODEL AND TRAINING DETAILS**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-89", "text": "We perform our experiments on the big model (D1024-H16) [9, 17] of the ASR Transformer." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-90", "text": "Table 3 lists our experimental parameters." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-91", "text": "The Adam algorithm [23] with gradient clipping and warmup is used for optimization." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-92", "text": "During training, label smoothing of value ls = 0.1 is employed [24] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-93", "text": "After trained, the last 20 checkpoints are averaged to make the performance more stable [17] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-94", "text": "At the beginning we train the ASR Transformer on English data with a random initialization, but the result is poor although the CE loss looks good." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-95", "text": "We propose that one reason for the poor performance could be the training data is too few but the parameters of the ASR Transformer are relatively large which is about 230M in this work." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-96", "text": "To compensate the lack of training data on low-resource languages, a well-trained ASR Transformer with a CER of 26.64% on HKUST dataset, a corpus of Mandarin Chinese conversational telephone speech, is adopted from our work [9] ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-97", "text": "Its softmax layer is replaced by the language-specific softmax layer which is initialized randomly." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-98", "text": "Through this initialization method, the ASR Transformer can converge very well." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-99", "text": "All experiments in this paper are conducted by this initialization method." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-100", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-101", "text": "**NUMBER OF MERGE OPERATIONS**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-102", "text": "First, we evaluate how the number of merge operations \u03b1 in BPE affects the performance of the ASR Transformer." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-103", "text": "When \u03b1 is tiny, the number of sub-words is small." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-104", "text": "Otherwise the number of sub-words is large." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-105", "text": "Since the training data is quite few on low-resource languages, it means that the number of sub-words cannot be too large in order to make sure each sub-word has enough training samples." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-106", "text": "For each monolingual ASR Transformer, we first experiment on English dataset for choosing an appropriate \u03b1." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-107", "text": "As shown in Table 4 , the performance reaches the best when \u03b1 = 500 and the number of sub-words is 548 on English dataset." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-108", "text": "Appended with 4 extra tokens, (i.e. an unknown token (), a padding token (), and sentence start and end tokens (/<\\S>)), the total number of sub-words is 552." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-109", "text": "In this paper, we choose \u03b1 = 500 in monolingual ASR Transformer experiments." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-110", "text": "For the multilingual ASR Transformer, all languages training transcripts are combined together to generate the multilingual symbol vocabulary by BPE." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-111", "text": "Table 6 shows that \u03b1 do not affect the performance too much on average." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-112", "text": "We choose \u03b1 = 3000 in all multilingual ASR Transformer experiments and the total number of sub-words is 8062." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-113", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-114", "text": "**RESULTS**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-115", "text": "The baseline systems come from our previous work [5] and all results are summarized in Table 5 ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-116", "text": "First, we train six monolingual ASR Transformers (MonoTransformer) independently on each language data." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-117", "text": "As can be seen from Table 5 , the monolingual ASR Transformer performs very well on each low-resource language and can obtain about relatively 15.7% WER reduction on average compared to monolingual LSTM (Mono-LSTM)." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-118", "text": "Furthermore, we build a single multilingual ASR Transformer (Multi-Transformer) on all training data together without using any language information during training and testing." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-119", "text": "We note that the Multi-Transformer can achieve slightly better performance than Mono-Transformer on average, which represents simply pooling the data together can give an acceptable recognition performance by a single multilingual ASR Transformer." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-120", "text": "After analyzing recognition results from MultiTransformer, we find that some recognition results are completely wrong because of language confusion, especially when the speech utterance is short." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-121", "text": "For example, sometimes an English word \"um\" is decoded into a German word \"ja\", because they have similar pronunciation." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-122", "text": "Since the language information of training data usually can be known beforehand, we go on to build two multilingual ASR Transformers integrating language information as depicted in Section 3.3 to alleviate the problem of language confusion." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-123", "text": "Here, the language information is just used during training and the model itself predicts the language symbol during testing." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-124", "text": "From Table 5 , we can observe that inserting the language symbol at the end (Multi-Transformer-E) is better than inserting it at the beginning (Multi-Transformer-B)." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-125", "text": "Compared to SHL-MLSTM-RESIDUAL, Multi-Transformer-B can obtain about relatively 10.5% average WER reduction." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-126", "text": "If the language information of both training and testing data can be known beforehand, we directly take the language symbol as the sentence start token rather than original sentence start token . It forces the multilingual ASR Transformer to decode a speech utterance into the pointed language, which greatly alleviate the language confusion during testing." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-127", "text": "As can be seen from Table 5 , Multi-Transformer-B2 performs best and obtain relative 12.4% average WER reduction compared to SHL-MLSTM-RESIDUAL although the improvement on Spanish is very little. What's more, an interesting observation is that if we give a wrong language symbol as the sentence start token, Multi-Transformer-B2 is able to transliterate speech into the pointed language." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-128", "text": "An English example of predictions from Multi-Transformer-B2 with different is shown in Table 7 ." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-129", "text": "We can find that the prediction from wrong is an approximate pronunciation of the correct target." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-130", "text": "----------------------------------" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-131", "text": "**CONCLUSIONS**" }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-132", "text": "In this paper we investigated multilingual speech recognition on low-resource languages by a single multilingual ASR Transformer." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-133", "text": "Sub-words are chosen as the multilingual modeling unit to remove the dependency on the pronunciation lexicon." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-134", "text": "A comparison with SHL-MLSTM with residual learning is investigated on CALLHOME datasets with 6 languages." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-135", "text": "Experimental results reveal that a single multilingual ASR Transformer by inserting the language symbol at the end can obtain relatively 10.5% average WER reduction compared to SHL-MLSTM with residual learning if the language information of training data can be employed during training." }, { "sent_id": "520588fbf0643725153b07a09430d1-C001-136", "text": "We go on to show that about relatively 12.4% average WER reduction can be observed compared to SHL-MLSTM with residual learning by giving the language symbol as the sentence start token assuming the language information being known during training and testing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "520588fbf0643725153b07a09430d1-C001-19" ], [ "520588fbf0643725153b07a09430d1-C001-21" ], [ "520588fbf0643725153b07a09430d1-C001-86" ], [ "520588fbf0643725153b07a09430d1-C001-89" ] ], "cite_sentences": [ "520588fbf0643725153b07a09430d1-C001-19", "520588fbf0643725153b07a09430d1-C001-21", "520588fbf0643725153b07a09430d1-C001-86", "520588fbf0643725153b07a09430d1-C001-89" ] }, "@DIF@": { "gold_contexts": [ [ "520588fbf0643725153b07a09430d1-C001-24" ], [ "520588fbf0643725153b07a09430d1-C001-25" ] ], "cite_sentences": [ "520588fbf0643725153b07a09430d1-C001-24", "520588fbf0643725153b07a09430d1-C001-25" ] }, "@EXT@": { "gold_contexts": [ [ "520588fbf0643725153b07a09430d1-C001-24" ], [ "520588fbf0643725153b07a09430d1-C001-25" ] ], "cite_sentences": [ "520588fbf0643725153b07a09430d1-C001-24", "520588fbf0643725153b07a09430d1-C001-25" ] }, "@SIM@": { "gold_contexts": [ [ "520588fbf0643725153b07a09430d1-C001-44" ] ], "cite_sentences": [ "520588fbf0643725153b07a09430d1-C001-44" ] }, "@USE@": { "gold_contexts": [ [ "520588fbf0643725153b07a09430d1-C001-44" ], [ "520588fbf0643725153b07a09430d1-C001-96" ] ], "cite_sentences": [ "520588fbf0643725153b07a09430d1-C001-44", "520588fbf0643725153b07a09430d1-C001-96" ] } } }, "ABC_29294f2ed3cc2772ca57fd4294274c_23": { "x": [ { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-2", "text": "A training and test set for a dialogue system in the form of linked questions and responses is translated from English into Tamil." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-3", "text": "Accuracy of identifying an appropriate response in Tamil is 79%, compared to the English accuracy of 89%, suggesting that translation can be useful to start up a dialogue system." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-4", "text": "Machine translation of Tamil inputs into English also results in 79% accuracy." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-5", "text": "However, machine translation of the English training data into Tamil results in a drop in accuracy to 54% when tested on manually authored Tamil, indicating that there is still a large gap before machine translated dialogue systems can interact with human users." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-6", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-8", "text": "Much of the effort in creating a dialogue system is devoted to the collection of training data, to allow the system to process, understand, and react to input coming from real users." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-9", "text": "If a comparable system is available for a different language, it would be helpful to use some of the existing foreign language resources in order to cut down the development time and cost -an issue known as language portability." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-10", "text": "Recent efforts have shown machine translation to be an effective tool for porting dialogue system resources from French into Italian (Jabaian et al., 2010; Jabaian et al., 2013; Servan et al., 2010) ; this system used concept-based language understanding, and the findings were that machine translation of the target language inputs yielded better results than using translation to train an understanding component directly for the target language." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-11", "text": "Here we report similar findings on more challenging data, by exploring a dialogue system with a less structured understanding component, using off-the-shelf rather than domainadapted machine translation, and with languages that are not as closely related." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-12", "text": "Question-answering characters are designed to sustain a conversation driven primarily by the user asking questions." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-13", "text": "Leuski et al. (2006) developed algorithms for training such characters using linked questions and responses in the form of unstructured natural language text." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-14", "text": "Given a novel user question, the character finds an appropriate response from a list of available responses, and when a direct answer is not available, the character selects an \"off-topic\" response according to a set policy, ensuring that the conversation remains coherent even with a finite number of responses." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-15", "text": "The response selection algorithms are languageindependent, also allowing the questions and responses to be in separate languages." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-16", "text": "These algorithms have been incorporated into a tool which has been used to create characters for a variety of applications (e.g. Leuski et al., 2006; Artstein et al., 2009; Swartout et al., 2010) ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-17", "text": "To date, most characters created using these principles understood and spoke only English; one fairly limited character spoke Pashto, a language of Afghanistan (Aggarwal et al., 2011) ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-18", "text": "To test language portability we chose Tamil, a Dravidian language spoken primarily in southern India." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-19", "text": "Tamil has close to 70 million speakers worldwide (Lewis et al., 2015) , is the official language of Tamil Nadu and Puducherry in India (Wasey, 2014) , and an official language in Sri Lanka and Singapore." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-20", "text": "There is active development of language processing tools in Tamil such as stemmers (Thangarasu and Manavalan, 2013), POS taggers (Pandian and Geetha, 2008) , constituent and dependency parsers (Saravanan et al., 2003; Ramasamy and\u017dabokrtsk\u00fd, 2011) , sentence generators (Pandian and Geetha, 2009) The main questions we want to answer in this paper are: (Q1) How good is a dialogue system created using translation between English and Tamil? (Q2) Is there a difference between manual and machine translation in this regard? (Q3) Can machine translation be used for interaction with users, that is with manually translated test data?" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-21", "text": "To answer these questions, we translated linked questions and responses from an English questionanswering system into Tamil both mechanically and manually, and tested the response selection algorithms on the English and both versions of the Tamil data." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-22", "text": "We found that translation caused a drop in performance of about 10% on either manually or mechanically translated text, answering a tentative fair to Q1 and no to Q2." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-23", "text": "The answer to Q3 is mixed: a similar performance drop of about 10% was found with machine translation on the target language inputs (that is, translating test questions from Tamil into English); a much more severe drop in performance was observed when using machine translation to create a system in the target language (that is, translating the training data from English into Tamil, and testing on manually authored Tamil)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-24", "text": "The remainder of the paper describes the experiment and results, and concludes with directions for future research." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-25", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-26", "text": "**METHOD 2.1 MATERIALS**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-27", "text": "Our English data come from the New Dimensions in Testimony system, which allows people to ask questions in conversation with a representation of Holocaust Survivor Pinchas Gutter; this system had undergone an extensive process of user testing, so the linked questions and responses contain many actual user questions that are relevant to the domain ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-28", "text": "The New Dimensions in Testimony system has more than 1700 responses, almost 7000 training questions, and 400 test questions, with a manyto-many linking between questions and responses ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-29", "text": "To get a dataset that is large enough to yield meaningful results yet small enough to translate manually, we needed a subset that included questions with multiple links, and answers that were fairly short." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-30", "text": "We selected all the test questions that had exactly 4 linked responses, and removed all the responses that were more than 200 characters in length; this yielded a test set with 28 questions, 45 responses, and 63 links, with each test question linked to between 1 and 4 responses." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-31", "text": "We took all the training questions linked to the 45 test responses, resulting in a training set with 441 questions and 1101 links." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-32", "text": "This sampling procedure was deliberately intended to enable high performance on the English data, in order to provide a wide range of possible performance for the various translated versions." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-33", "text": "Automatic translation into Tamil was done using Google Translate, and manual translation was performed by the first author." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-34", "text": "Thus, each question and response in the training and test datasets has three versions: the original English, and automatic and manual translations into Tamil." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-35", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-36", "text": "**TOKENIZATION**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-37", "text": "We use unigrams as tokens for the response classification algorithm; these are expected to work well for Tamil, which has a fairly free word order (Lehmann, 1989) ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-38", "text": "The English text was tokenized using the word tokenize routine from NLTK (Bird et al., 2009 )." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-39", "text": "This tokenizer does not work for Tamil characters, so we used a simple tokenizer that separates tokens by whitespace and removes periods, exclamation marks, question marks and quotation marks." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-40", "text": "The same simple tokenizer was used as a second option for the English text." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-41", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-42", "text": "**STEMMING**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-43", "text": "Tamil is an agglutinative language where stems can take many affixes (Lehmann, 1989 ), so we experimented with a stemmer (Rajalingam, 2013) ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-44", "text": "2 For comparison, we also ran the English experiments with the SnowballStemmer(\"english\") routine from NLTK." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-45", "text": "3" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-46", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-47", "text": "**RESPONSE RANKING**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-48", "text": "We reimplemented parts of the response ranking algorithms of Leuski et al. (2006) , including both the language modeling (LM) and cross-language modeling (CLM) approaches." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-49", "text": "The LM approach constructs language models for both questions and responses using the question vocabulary." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-50", "text": "For each training question S, a language model is estimated as the frequency distribution of tokens in S, smoothed by the distribution of tokens in the entire question corpus (eq. 1)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-51", "text": "The language model of a novel question Q is estimated as the probability of each token in the vocabulary coinciding with Q (eq. 2)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-52", "text": "Each available response R is associated with a pseudo-question Q R made up by the concatenation of all the questions linked to R in the training data." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-53", "text": "The responses are ranked by the distance between a novel question Q and the associated pseudo-questions Q R using Kullback-Leibler (KL) divergence (eq. 3)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-54", "text": "In eq. (1), # S (w) is the number of times token w appears in sequence S; |S| is the length of sequence S; the variable S iterates over all the questions in the corpus, and \u03bb \u03c0 is a smoothing parameter." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-55", "text": "The sum in eq. (2) is over all the questions in the training corpus; the product iterates over the tokens in the question, and thus is an estimate the probability of the question Q given a training question S ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-56", "text": "In eq. (3), V S is the entire question vocabulary." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-57", "text": "The CLM approach constructs language models for both questions and responses using the response vocabulary." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-58", "text": "The language model of a response is estimated in a similar way to eq. (1), but with the smoothing factor using the response corpus (eq. 4)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-59", "text": "The language model associated with a novel question Q represents the ideal response to Q, and is estimated as the probability of each token in the response vocabulary coinciding with Q in the linked question-response training data (eq. 5); available responses are ranked against this ideal response (eq. 6)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-60", "text": "The sum in eq. (5) is over all linked questionresponse pairs {S j , R j } in the training data, and the product is an estimate the probability of the question Q given the training question S j ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-61", "text": "In eq. (6), V R is the entire response vocabulary." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-62", "text": "We did not implement the parameter learning of Leuski et al. (2006) ; instead we use a constant smoothing parameter \u03bb \u03c0 = \u03bb \u03c6 = 0.1." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-63", "text": "We also do not use the response threshold parameter, which Leuski et al. (2006) use to determine whether the top-ranked response is good enough." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-64", "text": "Instead, we just check for the correctness of the top-ranked response." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-65", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-66", "text": "**PROCEDURE**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-67", "text": "Our basic tests kept the language and processing options the same for questions and responses." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-68", "text": "Each dataset (English and the two Tamil translations) was processed with both the LM and CLM approaches, both with and without a stemmer; English was also processed with the two tokenizer options." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-69", "text": "Additionally, we processed some crosslanguage datasets, with questions in Tamil and responses in English, and vice versa." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-70", "text": "We also performed two tests intended to check whether it is feasible to use machine-translated data with human questions: the manually translated Tamil test questions were machine translated back into English and tested with the original English training data (target language input translation); the manually translated Tamil test questions were also tested with the automatically translated Tamil training questions (creating a target language system)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-71", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-72", "text": "**EVALUATION**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-73", "text": "We use accuracy as our success measure: the top ranked response to a test question is considered correct if it is identified as a correct response in the linked test data (there are up to 4 correct responses per question)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-74", "text": "This measure does not take into account non-understanding, that is the classifier's determination that the best response is not good enough (Leuski et al., 2006) , since this capability was not implemented; however, since all of our test questions are known to have at least one appropriate response, any non-understanding of a question would necessarily count against accuracy anyway." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-75", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-76", "text": "**RESULTS**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-77", "text": "The results of the experiments with matched question and response languages are reported in Table 1." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-78", "text": "The LM approach almost invariably produced better results than the CLM approach; this is the opposite of the findings of Leuski et al. (2006) , where CLM fared consistently better." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-79", "text": "In most cases, the errors produced by the CLM approach were a superset of those of the LM approach; the only exceptions were Tamil with stemming." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-80", "text": "Accuracy of response selection on the Tamil data is about 10% lower than that of English, or twice the errors (6 errors rather than 3)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-81", "text": "The errors of automatically translated Tamil are a superset of the English errors; however, manually translated Tamil did get right some of the errors of English." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-82", "text": "Some of the errors are due to the complexity of Tamil morphology." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-83", "text": "For example, the following test question receives a correct response in English but incorrect responses in Tamil:" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-84", "text": "(7) How do you envision the future?" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-85", "text": "The correct responses are linked to the following training questions." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-86", "text": "In English the word future, common to training and test questions, helps identify the desired responses." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-87", "text": "In Tamil, however, the word \"future\" appears in distinct case forms: unmarked \u0b8e\"r\u0b95\u0bbe\u0bb2m etirkaalam in the test question, but genitive \u0b8e\"r\u0b95\u0bbe\u0bb2t\"( etirkaalattin in the training questions." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-88", "text": "It looks as though some morphological analysis of the Tamil text would be useful." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-89", "text": "However, while English appears almost invariant to the use of stemming, Tamil performs markedly worse with a stemmer." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-90", "text": "In this particular case, the stemmer does not unify the -am and -attin forms, and leaves both forms intact (these forms involve both a stem alternation -am/-att as well as a case morpheme -in)." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-91", "text": "We are still not able to say why the stemmer hurts performance, but it appears that our application could benefit from a different level of morphological analysis than provided by the current stemmer." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-92", "text": "Table 2 reports the results of the experiments which use different languages for the questions and responses." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-93", "text": "The top two rows use the same language for training and test questions, and only the response language varies." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-94", "text": "Accuracy is the same as that of the question language: this is necessarily the case for the LM approach, which does not use any of the response content; but it turned out to be the case even for the CLM approach." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-95", "text": "The middle two rows show the effect of machine translation on the target language inputs: questions in Tamil (manually translated from English) are automatically translated into English, and tested with the original English system." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-96", "text": "The performance penalty turns out to be the same as for the Tamil systems with matched training and test data, when using the NLTK tokenizer; the simple tokenizer incurs a larger performance penalty." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-97", "text": "Finally, the bottom two rows show the effect of using machine translation to create a target language system: manually translated questions in Tamil are tested with a system trained on automatic translation from English into Tamil." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-98", "text": "Performance drops sharply, likely due to mismatches between automatically and manually translated Tamil; this probably speaks to the quality of present state machine translation from English to Tamil." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-99", "text": "The result means that at present, off-the-shelf machine translation into Tamil is not quite sufficient for a translated dialogue system to work well with human user questions." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-100", "text": "----------------------------------" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-101", "text": "**DISCUSSION**" }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-102", "text": "The experiments demonstrate that translating data in the form of linked questions and responses from one language to another can result in a classifier that works in the target language, though there is a drop in performance." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-103", "text": "The reasons for the drop are not clear, but it appears that simple tokenization is not as effective for Tamil as it is for English, and the level of morphological analysis provided by the Tamil stemmer is probably not appropriate for the task." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-104", "text": "We thus need to continue experimenting with Tamil morphology tools." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-105", "text": "The further drop in performance when mixing automatically and manually translated Tamil is probably due to translation mismatches." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-106", "text": "Several questions remain left for future work." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-107", "text": "One possibility is to improve the machine translation itself, for example by adapting it to the domain." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-108", "text": "Another alternative is to use both languages together for classification; the fact that the manual Tamil translation identified some responses missed by the English classifier suggests that there may be benefit to this approach." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-109", "text": "Another direction for future work is identifying bad responses by using the distance between question and response to plot the tradeoff curve between errors and return rates (Artstein, 2011) ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-110", "text": "In our experiments the LM approach consistently outperforms the CLM approach, contra Leuski et al. (2006) ." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-111", "text": "Our data may not be quite natural: while the English data are well tested, our sampling method may introduce biases that affect the results. But even if we achieved full English-like performance using machine translation, the questions that Tamil speakers want to ask will likely be somewhat different than those of English speakers." }, { "sent_id": "29294f2ed3cc2772ca57fd4294274c-C001-112", "text": "A translated dialogue system is therefore only an initial step towards tailoring a system to a new population." } ], "y": { "@BACK@": { "gold_contexts": [ [ "29294f2ed3cc2772ca57fd4294274c-C001-13" ], [ "29294f2ed3cc2772ca57fd4294274c-C001-16" ], [ "29294f2ed3cc2772ca57fd4294274c-C001-74" ] ], "cite_sentences": [ "29294f2ed3cc2772ca57fd4294274c-C001-13", "29294f2ed3cc2772ca57fd4294274c-C001-16", "29294f2ed3cc2772ca57fd4294274c-C001-74" ] }, "@DIF@": { "gold_contexts": [ [ "29294f2ed3cc2772ca57fd4294274c-C001-48" ], [ "29294f2ed3cc2772ca57fd4294274c-C001-62" ], [ "29294f2ed3cc2772ca57fd4294274c-C001-63" ], [ "29294f2ed3cc2772ca57fd4294274c-C001-74" ], [ "29294f2ed3cc2772ca57fd4294274c-C001-78" ], [ "29294f2ed3cc2772ca57fd4294274c-C001-110" ] ], "cite_sentences": [ "29294f2ed3cc2772ca57fd4294274c-C001-48", "29294f2ed3cc2772ca57fd4294274c-C001-62", "29294f2ed3cc2772ca57fd4294274c-C001-63", "29294f2ed3cc2772ca57fd4294274c-C001-74", "29294f2ed3cc2772ca57fd4294274c-C001-78", "29294f2ed3cc2772ca57fd4294274c-C001-110" ] }, "@EXT@": { "gold_contexts": [ [ "29294f2ed3cc2772ca57fd4294274c-C001-48" ] ], "cite_sentences": [ "29294f2ed3cc2772ca57fd4294274c-C001-48" ] } } }, "ABC_bce5c3bf551a8aa211dfd962cde7a8_23": { "x": [ { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-17", "text": "First, the integration only experimented with the semantic role labeling part of the task." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-43", "text": "The importance of the first of these two aspects is often underestimated." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-2", "text": "In this paper we present our syntactic and se-" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-3", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-4", "text": "**INTRODUCTION**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-5", "text": "The CoNLL 2009 shared task (Haji\u010d et al., 2009) continues the exploration on learning syntactic and semantic structures based on dependency notations in previous year's shared task." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-6", "text": "The new addition to this year's shared task is the extension to multiple languages." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-7", "text": "Being one of the leading competitions in the field, the shared task received submissions from systems built on top of the stateof-the-art data-driven dependency parsing and semantic role labeling systems." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-8", "text": "Although it was originally designed as a task for machine learning approaches, CoNLL shared tasks also feature an 'open' track since 2008, which encourages the use of extra linguistic resources to further improve the \u2020 We are indebted to our DELPH-IN colleagues, specifically Peter Adolphs, Francis Bond, Berthold Crysmann, and Montserrat Marimon for numerous hours of support in adapting their grammars and the PET software to parsing the CoNLL data sets." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-9", "text": "The first author thanks the German Excellence Cluster of Multimodal Computing and Interaction for the support of the work." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-10", "text": "The second author is funded by the PIRE PhD scholarship program." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-11", "text": "Participation of the third author in this work was supported by the University of Oslo, as part of its research partnership with the Center for the Study of Language and Information at Stanford University." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-12", "text": "Our deep parsing experimentation was executed on the TITAN HPC facilities at the University of Oslo." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-13", "text": "performance." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-14", "text": "This makes the task a nice testbed for the cross-fertilization of various language processing techniques." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-15", "text": "As an example of such work, Zhang et al. (2008) have shown in the past that deep linguistic parsing outputs can be integrated to help improve the performance of the English semantic role labeling task." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-16", "text": "But several questions remain unanswered." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-44", "text": "A detailed computational grammar, inevitably, comes with its own assumptions about tokenization-the ERG, for example, rejects the conventional assumptions underlying the PTB (and derived tools)." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-45", "text": "It opts for an analysis of punctuation akin to affixation (rather than as standalone tokens), does not break up contracted negated auxiliaries, and splits hyphenated words like illadvised into two tokens (the hyphen being part of the first component)." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-46", "text": "Thus, a string like Don't you! in the CoNLL data is tokenized as the four-element sequence do, n't, you, ! , 2 whereas the ERG analysis has only two leaf nodes: don't, you! ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-47", "text": "Fortunately, the DELPH-IN toolchain recently incorporated a mechanism called chart mapping (Adolphs et al., 2008) , which allows one to map flexibly from 'external' input to grammar-internal assumptions, while keeping track of external token identities and their contributions to the final analysis." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-48", "text": "The February 2009 release of the ERG already had this machinery in place (with the goal of supporting extant, PTB-trained PoS taggers in pre-processing input to the deep parser), and we found that only a tiny number of additional chart mapping rules was required to 'fix up' CoNLL-specific deviations from the PTB tradition." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-49", "text": "With the help of the original developers, we created new chart mapping configurations for the German and Japanese grammars (with 17 and 16 such accomodation rules, respectively) in a similar spirit." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-50", "text": "All four DELPH-IN grammars in-clude an account of unknown words, based on underspecified 'generic' lexical entries that are activated from PoS information." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-51", "text": "The Japenese case was interesting, in that the grammar assumes a different pre-processor (ChaSen, rather than Juman), such that not only token boundaries but also PoS tags and morphological features had to be mapped." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-52", "text": "From our limited experience to date, we found the chart mapping approach adequate in accomodating such discrepancies, and the addition of this extra layer of input processing gave substantial gains in parser coverage (see below)." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-53", "text": "For the Spanish data, on the other hand, we found it impossible to make effective use of the PoS and morphological information in the CoNLL data, due to more fundamental discrepancies (e.g. the treatment of enclitics and multi-word expressions)." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-54", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-55", "text": "**RETRAINING DISAMBIGUATION MODELS**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-56", "text": "The ERG includes a domain-specific parse selection model (for tourism instructions); GG only a stub model trained on a handful of test sentences." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-57", "text": "For use on the CoNLL data, thus, we had to train new parse selections models, better adapted to the shared task corpora." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-58", "text": "Disambiguation in PET is realized by conditional MaxEnt models (Toutanova, Manning, Flickinger, & Oepen, 2005) , usually trained on full HPSG treebanks." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-59", "text": "Lacking this kind of training material, we utilized the CoNLL dependency information instead, by defining an unlabeled dependency accuracy (DA) metric for HPSG analyses, essentially quantifying the degree of overlap in headdependent relations against the CoNLL annotations." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-60", "text": "Calculating DA for HPSG trees is similar to the procedure commonly used for extracting bi-lexical dependencies from phrase structure trees, in a sense even simpler as HPSG analyses fully determine headeness." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-61", "text": "Taking into account the technical complication of token-level mismatches, our DA metric loosely corresponds to the unlabeled attachment score." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-62", "text": "To train CoNLL-specific parse selection models, we parsed the development sections in 500-best mode (using the existing models) and then mechanically 'annotated' the HPSG analyses with maximum DA as preferred, all others as dis-preferred." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-63", "text": "In other words, this procedure constructs a 'binarized' empirical distribution where estimation of log-linear Using the [incr tsdb()] MaxEnt experimentation facilities, we trained new parse selection models for English and German, using the first 16,000 sentences of the English training data and the full German training corpus; seeing that only inputs that (a) parse successfully and (b) have multiple readings, with distinct DA values are relevant to this step, the final models reflect close to 13,000 sentences for English, and a little more than 4,000 items for German." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-64", "text": "Much like in the SRL component, these experiments are carried out with the TADM software, using tenfold cross-validation and exact match ranking accuracy (against the binarized training distribution) to optimize estimation hyper-parameters" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-65", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-66", "text": "**DEEP PARSING FEATURES**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-67", "text": "HPSG parsing coverage and average cpu time per input for the four languages with DELPH-IN grammars are summarized in Table 1 ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-68", "text": "The PoS-based unknown word mechanism was active for all grammars but no other robustness measures (which tend to lower the quality of results) were used, i.e. only complete spanning HPSG analyses were accepted." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-69", "text": "Parse times are for 1-best parsing, using selective unpacking (Zhang, Oepen, & Carroll, 2007) ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-70", "text": "HPSG parsing outputs are available in several different forms." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-71", "text": "We investigated two types of structures: syntactic derivations and MRS meaningrepresentations." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-72", "text": "Representative features were extracted from both structures and selectively used in the statistical syntactic dependency parsing and semantic role labeling modules for the 'open' challenge." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-73", "text": "Deep Semantic Features Similar to Zhang et al. (2008) , we extract a set of features from the semantic outputs (MRS) of the HPSG parses." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-74", "text": "These features represent the basic predicate-argument structure, and provides a simplified semantic view on the target sentence." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-75", "text": "Deep Syntactic Dependency Features A HPSG derivation is a tree structure." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-76", "text": "The internal nodes are labeled with identifiers of grammar rules, and leaves with lexical entries." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-77", "text": "The derivation tree provides complete information about the actual HPSG analysis, and can be used together with the grammar to reproduce complete feature structure and/or MRS." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-78", "text": "Given that the shared task adopts dependency representation, we further map the derivation trees into token-token dependencies, labeled by corresponding HPSG rules, by defining a set of head-finding rules for each grammar." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-79", "text": "This dependency structure is different from the dependencies in CoNLL dataset, and provides an alternative HPSG view on the sentences." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-80", "text": "We refer to this structure as the dependency backbone (DB) of the HPSG anaylsis." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-81", "text": "A set of features were extracted from the deep syntactic dependency structures." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-82", "text": "This includes: i) the POS of the DB parent from the predicate and/or argument; ii) DB label of the argument to its parent (only for AI/AC); iii) labeled path from predicate to argument in DB (only for AI/AC); iv) POSes of the predicate's DB dependents" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-83", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-84", "text": "**SYNTACTIC DEPENDENCY PARSING**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-85", "text": "For the syntactic dependency parsing, we use the MST Parser (McDonald et al., 2005) , which is a graph-based approach." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-86", "text": "The best parse tree is acquired by searching for a spanning tree which maximizes the score on either a partially or a fully connected graph with all words in the sentence as nodes (Eisner, 1996; McDonald et al., 2005) ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-87", "text": "Based on our experience last year, we use the second order setting of the parser, which includes features over pairs of adjacent edges as well as features over single edges in the graph." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-88", "text": "For the projective or non-projective setting, we compare the results on the development datasets of different languages." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-89", "text": "According to the parser performance, we decide to use non-projective parsing for German, Japanese, and Czech, and use projective parsing for the rest." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-90", "text": "For the Closed Challenge, we first consider whether to use the morphological features." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-91", "text": "We find that except for Czech, parser performs better without morphological features on other languages (English and Chinese have no morphological features)." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-92", "text": "As for the other features (i.e. lemma and pos) given by the data sets, we also compare the gold standard features and P-columns." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-93", "text": "For all languages, the performance decreases in the following order: training with gold standard features and evaluating with the gold standard features, training with P-columns and evaluating with P-columns, training with gold standard features and testing with P-columns." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-94", "text": "Consequently, in the final submission, we take the second combination." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-95", "text": "The goal of the Open Challenge is to see whether using external resources can be helpful for the parsing performance." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-96", "text": "As we mentioned before, our deep parser gives us both the syntactic analysis of the input sentences using the HPSG formalism and also the semantic analysis using MRS as the representation." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-97", "text": "However, for the syntactic dependency parsing, we only extract features from the syntactic HPSG analyses and feed them into the MST Parser." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-98", "text": "Although, when parsing with gold standard lemma and POS features, our open system outperforms the closed system on out-domain tests (for English), when parsing with P-columns there is no substantial improvement observed after using the HPSG features." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-99", "text": "Therefore, we did not include it in the final submission." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-100", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-101", "text": "**SEMANTIC ROLE LABELING**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-102", "text": "The semantic role labeling component used in the submitted system is similar to the one described by Zhang et al. (2008) ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-103", "text": "Since predicates are indicated in the data, the predicate identification module is removed from this year's system." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-104", "text": "Argument identification, argument classification and predicate classification are the three sub-components in the pipeline." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-105", "text": "All of them are MaxEnt-based classifiers." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-106", "text": "For parameter estimation, we use the open source TADM system (Malouf, 2002) ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-107", "text": "The active features used in various steps of SRL are fine tuned separately for different languages using development datasets." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-108", "text": "The significance of feature types varies across languages and datasets." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-109", "text": "In the open challenge, two groups of extra features from HPSG parsing outputs, as described in Section 3.3, were used on languages for which we have HPSG grammars, that is English, German, Japanese, and Spanish." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-110", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-111", "text": "**RESULT ANALYSIS**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-112", "text": "The evaluation results of the submitted system are summarized in Table 2 ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-113", "text": "The overall ranking of the system is #7 in the closed challenge, and #2 in the open challenge." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-114", "text": "While the system achieves mediocre performance, the clear performance difference between the closed and open challenges of the semantic role labeler indicates a substantial gain from the integration of HPSG parsing outputs." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-115", "text": "The most interesting observation is that even with grammars which only achieve very limited coverage, noticeable SRL improvements are obtained." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-116", "text": "Confirming the observation of Zhang et al. (2008) , the gain with HPSG features is more significant on outdomain tests, this time on German as well." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-117", "text": "The training of the syntactic parsing models for all seven languages with MST parser takes about 100 CPU hours with 10 iterations." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-118", "text": "The dependency parsing takes 6 -7 CPU hours." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-119", "text": "The training and testing of the semantic role labeler is much more efficient, thanks to the use of MaxEnt models and the efficient parameter estimation software." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-120", "text": "The training of all SRL models for 7 languages takes about 3 CPU hours in total." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-121", "text": "The total time for semantic role labeling on test datasets is less than 1 hour." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-122", "text": "Figure 2 shows the learning curve of the syntactic parser and semantic role labeler on the Czech and English datasets." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-123", "text": "While most of the systems continue to improve when trained on larger datasets, an exception was observed with the Czech dataset on the out-domain test for syntactic accuracy." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-124", "text": "In most of the cases, with the increase of training data, the out-domain test performance of the syntactic parser and semantic role labeler improves slowly relative to the in-domain test." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-125", "text": "For the English dataset, the SRL learning curve climbs more quickly than those of syntactic parsers." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-126", "text": "This is largely due to the fact that the semantic role annotation is sparser than the syntactic dependencies." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-127", "text": "On the Czech dataset which has dense semantic annotation, this effect is not observed." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-128", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-130", "text": "In this paper, we described our syntactic parsing and semantic role labeling system participated in both closed and open challenge of the (Joint) CoNLL 2009 Shared Task." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-131", "text": "Four hand-written HPSG grammars of a variety of scale have been applied to parse the datasets, and the outcomes were integrated as features into the semantic role labeler of the system." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-132", "text": "The results clearly show that the integration of HPSG parsing results in the semantic role labeling task brings substantial performance improvement." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-133", "text": "The conclusion of Zhang et al. (2008) has been reconfirmed on multiple languages for which we handbuilt HPSG grammars exist, even where grammatical coverage is low." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-134", "text": "Also, the gain is more significant on out-of-domain tests, indicating that the hybrid system is more robust to cross-domain variation." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-18", "text": "It is not clear whether syntactic dependency parsing can also benefit from grammar-based parsing results." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-19", "text": "Second, the English grammar used to achieve the improvement is one of the largest and most mature hand-crafted linguistic grammars." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-20", "text": "It is not clear whether similar improvements can be achieved with less developed grammars." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-21", "text": "More specifically, the lack of coverage of hand-crafted linguistic grammars is a major concern." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-22", "text": "On the other hand, the CoNLL task is also a good opportunity for the deep processing community to (re-)evaluate their resources and software." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-23", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-24", "text": "**SYSTEM ARCHITECTURE**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-25", "text": "The overall system architecture is shown in Figure 1 ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-26", "text": "It is similar to the architecture used by Zhang et al. (2008) ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-27", "text": "Three major components were involved." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-28", "text": "The HPSG parsing component utilizes several handcrafted grammars for deep linguistic parsing." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-29", "text": "The outputs of deep parsings are passed to the syntactic dependency parser and semantic role labeler." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-30", "text": "The syntactic parsing component is composed of a modified MST parser which accepts HPSG parsing results as extra features." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-31", "text": "The semantic role labeler is comprised of a pipeline of 4 sub-components (predicate identification is not necessary in this year's task)." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-32", "text": "Comparing to Zhang et al. (2008) , this architecture simplified the syntactic component, and puts more focus on the integration of deep parsing outputs." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-33", "text": "While Zhang et al. (2008) only used seman- tic features from HPSG parsing in the SRL task, we added extra syntactic features from deep parsing to help both tasks." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-34", "text": "Flickinger, 2000) , German (GG; Crysmann, 2005) , Japanese (JaCY; Siegel & Bender, 2002) , and Spanish (SRG; Marimon, Bel, & Seghezzi, 2007) ." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-35", "text": "The grammars vary in their stage of development: the ERG comprises some 15 years of continuous development, whereas work on the SRG only started about five years ago, with GG and JaCY ranging somewhere inbetween." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-36", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-37", "text": "**HPSG PARSING FOR THE CONLL DATA**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-38", "text": "----------------------------------" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-39", "text": "**OVERALL SETUP**" }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-40", "text": "We applied the DELPH-IN grammars to the CoNLL data using the PET parser (Callmeier, 2002) it through the [incr tsdb()] environment (Oepen & Carroll, 2000) , for parallelization and distribution." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-41", "text": "Also, [incr tsdb()] provides facilities for (re-)training the MaxEnt parse selection models that PET uses for disambiguation." }, { "sent_id": "bce5c3bf551a8aa211dfd962cde7a8-C001-42", "text": "The two main challenges in applying DELPH-IN resources to parsing CoNLL data were (a) mismatches in basic assumptions, specifically tokenization and the inventory of PoS tags provided as part of the input, and (b) the need to adapt the resources for new domains and genres-in particular in terms of parse disambiguation-as the English and Spanish grammars at least had not been previously applied to the corpora used in the CoNLL shared task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "bce5c3bf551a8aa211dfd962cde7a8-C001-15" ], [ "bce5c3bf551a8aa211dfd962cde7a8-C001-133" ] ], "cite_sentences": [ "bce5c3bf551a8aa211dfd962cde7a8-C001-15", "bce5c3bf551a8aa211dfd962cde7a8-C001-133" ] }, "@SIM@": { "gold_contexts": [ [ "bce5c3bf551a8aa211dfd962cde7a8-C001-26" ], [ "bce5c3bf551a8aa211dfd962cde7a8-C001-73" ], [ "bce5c3bf551a8aa211dfd962cde7a8-C001-102" ], [ "bce5c3bf551a8aa211dfd962cde7a8-C001-116" ] ], "cite_sentences": [ "bce5c3bf551a8aa211dfd962cde7a8-C001-26", "bce5c3bf551a8aa211dfd962cde7a8-C001-73", "bce5c3bf551a8aa211dfd962cde7a8-C001-102", "bce5c3bf551a8aa211dfd962cde7a8-C001-116" ] }, "@DIF@": { "gold_contexts": [ [ "bce5c3bf551a8aa211dfd962cde7a8-C001-32" ], [ "bce5c3bf551a8aa211dfd962cde7a8-C001-33" ] ], "cite_sentences": [ "bce5c3bf551a8aa211dfd962cde7a8-C001-32", "bce5c3bf551a8aa211dfd962cde7a8-C001-33" ] } } }, "ABC_836992d035c4be0c8eacdd419f151e_23": { "x": [ { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-78", "text": "Think The Warriors" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-2", "text": "To disambiguate between closely related concepts, entity linking systems need to effectively distill cues from a mention's textual context." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-3", "text": "We investigate several techniques for using these cues in the task of noisy entity linking on short texts." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-4", "text": "Our starting point is a stateof-the-art attention-based model from prior work; while this model's attention typically identifies context that is topically relevant, it fails to identify some of the most indicative context words, especially those exhibiting lexical overlap with the true title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-5", "text": "Augmenting the model with convolutional networks over characters still leaves it largely unable to pick up on these cues compared to sparse features that target them directly, indicating that automatically learning how to identify relevant character-level context features is a hard problem." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-6", "text": "Armed with these sparse features, our final system 1 outperforms past work on the WikilinksNED test set by 2.8% absolute." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-7", "text": "----------------------------------" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-9", "text": "Effectively using an entity mention's context to disambiguate it is the crux of the entity linking task: in isolation, the mention Richard Wright could refer to three possible entities in Wikipedia's knowledge base corresponding to an artist, a musician, or an author." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-10", "text": "Previous work in this area has distilled context information by exploiting tfidf features (Cucerzan, 2007; Milne and Witten, 2008; Ratinov et al., 2011) , global link coherence (Hoffart et al.; Sil and Florian, 2016) , cues from coreference (Cheng and Roth, 2013; Hajishirzi et al., 2013; Durrett and Klein, 2014) , convolutional neural networks (Sun et al.; FrancisLandau et al., 2016) , or more sophisticated neural architectures (Gupta et al., 2017; Sil et al., 2018) ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-11", "text": "These approaches typically focus on aggregating information from a mix of sources, including long-range information from the textual context or other linked entities." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-12", "text": "While this approach is suitable for entity linking settings such as newswire (Bentivogli, 2010) and Wikipedia (Ratinov et al., 2011) , we cannot always rely on this information in other settings like Twitter (Guo et al., 2013; Fang and Chang, 2014; Huang et al., 2014; Dredze et al., 2016) , Snapchat (Moon et al., 2018) , other web platforms (Eshel et al., 2017) , or dialogue systems (Bowden et al., 2018) ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-13", "text": "We need models that can make effective use of limited context windows in noisy settings." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-14", "text": "In this work, we investigate this problem of effectively using context in the setting of the WikilinksNED dataset from Eshel et al. (2017) ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-15", "text": "The examples in this dataset, which consists of 3.2 million entity disambiguation examples derived from Wikilinks (Singh et al., 2012) , have at most 20 words of context on either side and usually no other mentions of the entity being disambiguated." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-16", "text": "We build off a state-of-the-art attentive LSTM model from prior work (Eshel et al., 2017) and show that despite its good performance, it fails to resolve some examples that human readers would find trivial." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-17", "text": "For example, disambiguating the identity of the song Down in Figure 1 is easy if we can recognize the nearby string Jay Sean in the context, but the model sometimes fails to do this." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-18", "text": "We explore the performance of a standard attention mechanism as well as two modifications." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-19", "text": "First, we inject character information into the model through character-level CNNs; these give the model a deeper ability to recognize character correspondences between the context and entity title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-20", "text": "However, these convolutional filters struggle to learn useful features in this noisy context and ultimately do not help performance." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-21", "text": "By contrast, sparse features explicitly targeting these overlaps" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-22", "text": "----------------------------------" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-23", "text": "**BASIC MODEL**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-24", "text": "The WikilinksNED dataset consists of entity mentions in context scraped from the web, with gold annotation derived from the fact that those mentions originally appeared with hyperlinks to Wikipedia." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-25", "text": "We denote the mention text (i.e., anchor text of the hyperlink) by m, and denote the left and right context of the mention by c l and c r respectively; these are at most 20 words." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-26", "text": "For this dataset, we can assume that the possible linked titles for a mention have been seen in training, and the main task is instead to disambiguate between them and identify the gold title t * ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-27", "text": "We therefore follow prior work (Eshel et al., 2017) and take as candidates all gold entities in the training set whose mention was m rather than relying on a separate candidate generation scheme." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-28", "text": "Our model places a distribution over titles P (t|m, c l , c r ), where t takes values in the set of candidate Wikipedia titles for that mention." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-29", "text": "This model, depicted in Figure 1 , roughly follows that of Eshel et al. (2017) , with some key differences, as we discuss in the rest of this section." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-30", "text": "Embedding contexts Given an example of the form (m, c l , c r ), our model first uses a GRU layer (Cho et al., 2014) over each context to convert c l and c r into continuous vector representations l and r, respectively." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-31", "text": "Our word embeddings are trained over Wikipedia as described in the following paragraph." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-32", "text": "Embedding entities We follow the method of Eshel et al. (2017) for generating entity embeddings, using word2vecf (Levy and Goldberg, 2014) to jointly train word and entity embeddings simultaneously using Wikipedia article text." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-33", "text": "Each title t is associated in turn with each content word w in the article, yielding a set of (w, t) pairs that are consumed by the training procedure." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-34", "text": "This yields a set of title embeddings e t which we can treat as distributed representations of entities." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-35", "text": "Entity-context comparison We systematically compare the representations for l, r, and e t as follows:" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-36", "text": "where \u00b7 denotes the conventional dot product and the other comparisons are elementwise." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-37", "text": "These features form the input to a final feedforward layer which produces a real-valued score s t for the given title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-38", "text": "Repeating this computation for each title, our model's distribution is P (t|m, c l , c r ) = softmax t (s t )." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-39", "text": "Training Because our model involves substantial computation for each possible title, we want to limit the set of titles considered during training." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-40", "text": "For each example we consider, we construct a set T containing the gold title and 4 negative \"distractor\" titles from the candidate set." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-41", "text": "Unlike Eshel et al. (2017), we structure training as a multiclass decision among these titles rather than a binary prediction problem over each title as gold or not." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-42", "text": "We run our model over the candidates t \u2208 T to produce the distribution P (t|m, c l , c r ) and train to maximize the log probablity log P (t * |m, c l , c r ) of the gold title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-43", "text": "Results The model set forth in this section is the basis for the remaining models in this paper; we call it the GRU model as that is the only context encoding mechanism it uses." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-44", "text": "As shown in Table 1 , this GRU model gets a score of 73.4 on the WikilinksNED development set." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-45", "text": "In the next section, we explore techniques for using the context in a more sophisticated way to improve further on this result." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-46", "text": "Eshel et al. (2017) , allowing the model to weight the importance of the outputs of the GRU at each time step." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-47", "text": "Each context (left and right) has its own attention weights." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-48", "text": "For a given side of context and candidate t, the attention first computes a transformation of the entity embedding e t as follows: q t = tanh(W e t )." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-49", "text": "This allows the model to learn an attention query q t distinct from the candidate embedding e t ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-50", "text": "The model then computes attention probabilities \u03b1 i for each GRU output o i , normalized over the entire sequence of GRU outputs (of length n):" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-51", "text": "The resulting probability distribution is used to take a weighted sum of GRU outputs to get a representation a:" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-52", "text": "We compute a l and a r independently and symmetrically for the left and right context." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-53", "text": "These vectors are then fed forward through the model as the final continuous representation of the left or right context, l or r respectively." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-54", "text": "Results In Table 1 , we see that our model with attention (GRU+ATTN) outperforms our basic GRU model by around 1% absolute." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-55", "text": "It also outperforms the roughly similar model of Eshel et al. (2017) on the test set: this gain is due to a combination of factors including the improved training procedure and some small modeling changes." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-56", "text": "2 However, our attention scheme is not without its shortcomings, as we now discuss." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-57", "text": "----------------------------------" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-58", "text": "**SHORTCOMINGS OF ATTENTION**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-59", "text": "One common and frustrating error our model makes is failing to correctly disambiguate mentions whose contexts share similar words or character overlap with the gold entity's actual Wikipedia title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-60", "text": "In these instances, the model fails to attend correctly to words that we, as human readers, would most likely see as disambiguating terms." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-61", "text": "For instance, in this example's left context:" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-62", "text": "...known also for the B.P. Koirala Institute of Health Sciences, one of the biggest government hospital." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-63", "text": "The indigenous people of Dharan are Limbu ... the model fails to identify people as a critical term for disambiguation." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-64", "text": "This failure is partially due to the model's sole reliance on distributed representations: the embedding for people and the title embedding for Limbu People need to somehow contain enough common information for the model to associate these, identify people as an important token, and use it to disambiguate between candidate titles such as Limbu People, Limbu Language, and Limbu Alphabet." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-65", "text": "Moreover, with such noisy, unstructured context, it is difficult for the model to learn to rely on other grammatical or semantic cues (such as are indicating that the title is probably a plural noun, which alphabet and language are not)." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-66", "text": "----------------------------------" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-67", "text": "**CHARACTER CNNS**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-68", "text": "One way to address these issues in the model is to exploit more fine-grained character-level information." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-69", "text": "This circumvents the need to separately learn a distributed correspondence between terms with lexical overlap, and is especially useful when these terms may be unknown words; for example, a year mentioned within a context is often unknown and therefore assigned an UNK embedding, even if that year matches exactly with a year in the gold candidate's title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-70", "text": "One solution to this is to allow our model to consult character-level information, which past models have done successfully for named entity recognition (Chiu and Nichols, 2015; Lample et al., 2016; Ma and Hovy, 2016 ), text classification (Zhang et al., 2015) , and POS tagging (Santos and Zadrozny, 2014) ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-71", "text": "We use convolutional neural networks (CNNs) to distill character representations of words into vectors that we concatenate with our word representations." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-72", "text": "We additionally use character CNNs over entity titles and concatenate these representations with the title embeddings e t , to allow the model to learn to characterize similarities between contexts and entity titles." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-73", "text": "Our CNNs use window sizes of 6 and 100 filters each; these values were selected through hyperparameter tuning on the development set." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-74", "text": "Table 1 shows the impact of incorporating character CNNs (GRU+ATTN+CNN) ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-75", "text": "Surprisingly, these have a mild negative impact on performance." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-76", "text": "One possible explanation of this is that it causes the model to split its attention between semantically important and lexically similar context terms." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-77", "text": "Consider the following example: really think Final Fight could be a lot of a fun as a vigilante justice movie with a high quotient of hand-to-hand fight sequences." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-79", "text": "The gold title is The Warrior (film) and the base model correctly places 90% of its attention weight on the word movie when calculating attention for this title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-80", "text": "However, the character-level CNN model only places 60% of its attention weight on it, distributing its attention values more evenly across the rest of the words." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-81", "text": "Such cases are frequent: the average highest weight given by attention in GRU+ATTN+CNN is about 6% lower than the average highest attention weight given by GRU+ATTN." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-82", "text": "The CNNs seem to have generally decreased the model's confidence in what context clues are key for disambiguation, leading to lower performance." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-83", "text": "We will return to more analysis of this in Section 4." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-84", "text": "----------------------------------" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-85", "text": "**LEXICAL FEATURE SET**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-86", "text": "To determine whether character level overlap between the entity title and context is useful, we take a more direct approach to incorporating that information into our model and build a set of sparse features that directly target it." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-87", "text": "Figure 2 shows an example of how our features are computed." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-88", "text": "We fire features on each word in the context that is either an exact match or a substring of a word in the candidate title; people is the relevant token here." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-89", "text": "We conjoin that match information with whether the word is in the left or right context along with the bucketed offset of the word from the mention." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-90", "text": "This feature set is then appended to the vector comparison features to form the input to the model's feedforward layer (see Figure 1) ." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-91", "text": "Table 1 shows the results of stacking these features on top of our model with attention (GRU+ATTN+FEATS)." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-92", "text": "We see our highest development set performance and correspondingly high test performance from this model." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-93", "text": "This indicates that character-level information is useful for disambiguation, but character CNNs as we incorporated them are not able to distill it as effectively as sparse features can." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-94", "text": "Our model augmented with these sparse features achieves state-of-the-art results on the test set." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-95", "text": "----------------------------------" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-96", "text": "**ATTENTION AND \"OBVIOUS\" TERMS**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-97", "text": "Now that we have identified features which seem useful for this entity linking problem, we can ask how the tokens attended by our attention mechanism compare to those singled out by the features." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-98", "text": "Table 2 contains statistics regarding the attention values of our GRU+ATTN and GRU+ATTN+CNN model on a subset of examples that both models got wrong." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-99", "text": "We define accuracy as the percentage of examples in which the model gives the highest attention to a word Figure 3: Examples of our models putting high attention weight into irrelevant context words, not acknowledging the relevance of disambiguating terms that share lexical overlap with the correct title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-100", "text": "We display the weight given to the top 4 attended words above each word for two of our models." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-101", "text": "that contains one of our lexical features, out of all examples where such a feature exists anywhere." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-102", "text": "The reported probability mass is the total attention mass that the model puts into words that associated with lexical features, averaged over all examples where such features exist." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-103", "text": "We see that the model frequently fails to exploit this information, and moreover the addition of CNNs does not strongly improve this." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-104", "text": "Figure 3 shows examples of this behavior." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-105", "text": "In the first example, rather than identifying cheese as a salient term, both models instead focus more heavily on milk and like." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-106", "text": "Similarly, in the second example, the model fails to recognize the importance of robot in the context." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-107", "text": "One possible reason that CNNs don't help more is that the sparse features only trigger on a subset of examples." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-108", "text": "Because the CNNs process every example, they may not see enough examples of lexical overlap to pick up on it, and instead try to augment what the word embedding model is already doing with subword information, which ends up being unstable for this task." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-109", "text": "Naturally, words with these overlap characteristics are not always the most disambiguating term." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-110", "text": "However, in light of noisy contexts, when the standard representation of context fails to be sufficient for allowing the model to disambiguate, we want the model to be able to leverage this character level information to help it make intuitive decisions, which the CNN fails to do." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-111", "text": "----------------------------------" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-112", "text": "**CONCLUSION**" }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-113", "text": "In this paper, we observed that in noisy entity linking settings on short texts, neural models relying on attention do not always pick up on the correct context clues, even when those clues exhibit very obvious surface overlap with the correct entity title." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-114", "text": "These models can perform better when augmented with sparse features explicitly targeting this kind of lexical overlap: our system using these features achieves state-of-the-art disambiguation accuracy on the WikilinksNED dataset." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-115", "text": "By contrast, automatically learning learning finegrained character-level features with CNNs in this context is hard." }, { "sent_id": "836992d035c4be0c8eacdd419f151e-C001-116", "text": "More exploration is needed to better understand what inductive biases are necessary for an entity linking system to make maximally effective use of the information available to it." } ], "y": { "@BACK@": { "gold_contexts": [ [ "836992d035c4be0c8eacdd419f151e-C001-12" ], [ "836992d035c4be0c8eacdd419f151e-C001-46" ] ], "cite_sentences": [ "836992d035c4be0c8eacdd419f151e-C001-12", "836992d035c4be0c8eacdd419f151e-C001-46" ] }, "@SIM@": { "gold_contexts": [ [ "836992d035c4be0c8eacdd419f151e-C001-14" ], [ "836992d035c4be0c8eacdd419f151e-C001-16" ], [ "836992d035c4be0c8eacdd419f151e-C001-27" ], [ "836992d035c4be0c8eacdd419f151e-C001-32" ] ], "cite_sentences": [ "836992d035c4be0c8eacdd419f151e-C001-14", "836992d035c4be0c8eacdd419f151e-C001-16", "836992d035c4be0c8eacdd419f151e-C001-27", "836992d035c4be0c8eacdd419f151e-C001-32" ] }, "@USE@": { "gold_contexts": [ [ "836992d035c4be0c8eacdd419f151e-C001-14" ], [ "836992d035c4be0c8eacdd419f151e-C001-16" ], [ "836992d035c4be0c8eacdd419f151e-C001-27" ], [ "836992d035c4be0c8eacdd419f151e-C001-32" ] ], "cite_sentences": [ "836992d035c4be0c8eacdd419f151e-C001-14", "836992d035c4be0c8eacdd419f151e-C001-16", "836992d035c4be0c8eacdd419f151e-C001-27", "836992d035c4be0c8eacdd419f151e-C001-32" ] }, "@DIF@": { "gold_contexts": [ [ "836992d035c4be0c8eacdd419f151e-C001-29" ], [ "836992d035c4be0c8eacdd419f151e-C001-41" ], [ "836992d035c4be0c8eacdd419f151e-C001-55" ] ], "cite_sentences": [ "836992d035c4be0c8eacdd419f151e-C001-29", "836992d035c4be0c8eacdd419f151e-C001-41", "836992d035c4be0c8eacdd419f151e-C001-55" ] }, "@EXT@": { "gold_contexts": [ [ "836992d035c4be0c8eacdd419f151e-C001-29" ] ], "cite_sentences": [ "836992d035c4be0c8eacdd419f151e-C001-29" ] } } }, "ABC_4b65a59fc2331b9771ea09a12f32de_23": { "x": [ { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-2", "text": "We continue the study of generating semantically correct regular expressions from natural language descriptions (NL)." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-3", "text": "The current stateof-the-art model, SemRegex, produces regular expressions from NLs by rewarding the reinforced learning based on the semantic (rather than syntactic) equivalence between two regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-4", "text": "Since the regular expression equivalence problem is PSPACE-complete, we introduce the EQ Reg model for computing the similarity of two regular expressions using deep neural networks." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-5", "text": "Our EQ Reg model essentially softens the equivalence of two regular expressions when used as a reward function." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-6", "text": "We then propose a new regex generation model, SoftRegex, using the EQ Reg model, and empirically demonstrate that SoftRegex substantially reduces the training time (by a factor of at least 3.6) and produces state-ofthe-art results on three benchmark datasets." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-7", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-9", "text": "Regular expressions are an efficient tool to represent structured data with specific rules in a variety of fields such as natural language processing or text classification." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-28", "text": "The SemRegex model improved the Deep-Regex model by using reinforcement learning based on the determinisitic finite automata (DFA) equivalence oracle (which determines if two regular expressions describe the same language) as a reward function." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-10", "text": "However, it is not always an easy task to write an exact regular expression for those who do not have a deep knowledge of regular expressions or when the expression is very complicated, and incorrect or sloppy regular expressions may cause unexpected consequences in practice (Bispo et al., 2006; Zhang et al., 2018) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-11", "text": "Indeed, even a single character difference between regular expressions can cause them * Now at Google. to represent completely different sets of strings." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-12", "text": "As such, researchers have begun working on a system than generates a regular expression from a natural language description provided by a human while reducing possible errors caused by incorrect regular expressions (Liu et al., 2019) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-13", "text": "Recently, Locascio et al. (2016) designed the Deep-Regex model based on the sequence-to-sequence (Seq2Seq) model (Sutskever et al., 2014) using minimal domain knowledge during the learning phase while still accurately predicting regular expressions from NLs." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-14", "text": "Later, Zhong et al. (2018a) improved the performance by training on not only syntactic content of the expressions (i.e. the exact textual representation of the expression that was used), but also the semantic content (the regular language described by the expression)." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-15", "text": "However, the reward function in the SemRegex model (Zhong et al., 2018a) that determines if the predicted regular expression is semantically equivalent to the ground truth expression is known to be PSPACE-complete and is a bottleneck in practice (Stockmeyer and Meyer, 1973) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-16", "text": "Thus, if we can solve this problem (even approximately) more quickly, then we can decrease the required learning time in the natural-language-to-regular expression (NL-RX) model." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-17", "text": "Reward shaping (Mataric, 1994; Ng et al., 1999) is a well-known mechanism to estimate the reward of an action in reinforced learning without executing the action." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-18", "text": "As a reward shaping mechanism for NL-RX, we build a new model, EQ Reg, based on deep learning that estimates the equivalence probability of two regular expressions and use it to improve the NL-RX training speed substantially." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-19", "text": "During the NL-RX model training phase, EQ Reg can quickly determine the equivalence of the predicted expression and the true expression, and pass this value as a reward for reinforcement learning." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-20", "text": "Our new NL-RX model together with the EQ Reg model as a reward function is called SoftRegex." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-21", "text": "We demonstrate that SoftRegex substantially reduces the training time and produces state-of-the-art results on three benchmark datasets." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-22", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-24", "text": "Generating Regular Expressions: Ranta (1998) studied rule-based techniques for the conversion between multi-languages and regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-25", "text": "Kushman and Barzilay (2013) built a parsing model that translates a natural language sentence to a regular expression, and provided a dataset which is now a popular benchmark dataset for related research." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-26", "text": "Locascio et al. (2016) proposed the Deep-Regex model based on Seq2Seq for generating regular expressions from natural language descriptions together with a dataset of 10,000 NL-RX pairs." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-27", "text": "However, due to the limitations of the standard Seq2Seq model, the Deep-Regex model can only generate regular expressions similar in shape to the training data." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-29", "text": "This model can generate correct regular expressions that may not resemble the ground truth answer." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-30", "text": "Comparing Regular Expressions: A regular language can have several syntactically different regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-31", "text": "We say that two regular expressions are equivalent if they both define the same language." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-32", "text": "The most basic method to deciding whether or not two regular expressions are equivalent is to convert both regular expressions to DFAs." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-33", "text": "However, the time and space complexity of converting regular expressions to DFAs are both exponential (Hopcroft and Ullman, 1979) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-34", "text": "Stockmeyer and Meyer (1973) showed that it is PSPACE-complete to decide if two regular expressions generate the same set of words." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-35", "text": "Thus, deciding equivalence for two regular expressions is a bottleneck calculation even for small inputs." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-36", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-37", "text": "**METHODS**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-38", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-39", "text": "**NL-RX MODEL**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-40", "text": "We apply the Seq2Seq model with an attention mechanism (Luong et al., 2015) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-41", "text": "We utilize LSTM (Hochreiter and Schmidhuber, 1997) cells in the Seq2Seq model, which consists of an encoder and decoder." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-42", "text": "The encoder generates a latent vector from the given natural language description." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-43", "text": "Concurrently, the decoder receives the latent vector from the encoder and generates an output." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-44", "text": "Maximum Likelihood Estimation (MLE): Let \u03b8 be all parameters in the Seq2Seq model." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-45", "text": "The probability that the model outputs regular expressions R from a natural language input S is" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-46", "text": "(1)" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-47", "text": "Here, r t represents a predicted token at time step t. The Deep-Regex model trains itself to find the proper r t using MLE and optimizes through minimizing loss." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-48", "text": "Policy Gradient: The SemRegex model additionally trains the NL-RX model via a policy gradient (Williams, 1992 ) by rewarding the model if it generates regular expressions that are semantically equivalent to the ground truth." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-49", "text": "The reward function returns 1 if the output regular expression is equivalent to the answer, and 0 otherwise." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-50", "text": "In short, the SemRegex model trains itself to maximize the following objective function:" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-51", "text": "where D is a training set, p(R | S) represents the probability that the model predicts regular expression for given sentence, and r(R) is the return value of the reward function that receives the predicted regular expression." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-52", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-53", "text": "**EQ REG MODEL**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-54", "text": "The reward function in the SemRegex model is based on a regular expression equivalence oracle." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-55", "text": "This test is time intensive (the regular expression equivalence problem is PSPACE-complete) and returns only a binary value (0/1) representing the equivalence." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-56", "text": "Therefore, the equivalence test is a major hurdle when training large amounts of data, which is closely related to model performance." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-57", "text": "Recently, there have been a few attempts to tackle intractable problems using deep learning (Prates Selsam et al., 2019) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-58", "text": "This motivates us to study a different approach that can speed up the equivalence test based on deep learning and propose the EQ Reg model that returns an equivalence probability of two regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-59", "text": "We can then reward the NL-RX model with continuous values (the equivalence probability of two regular expressions) and significantly boost the learning speed of the NL-RX model using the EQ Reg model." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-60", "text": "The EQ Reg model converts the input regular expressions to two sequences of embedding vectors." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-61", "text": "Next, we feed the embedding vector sequences into two different LSTM layers." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-62", "text": "Each LSTM layer converts the sequences to latent vectors." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-63", "text": "Since the regular expression is a grammatical expression, we present its context using a bidirectional RNN method (Graves et al., 2013) as" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-64", "text": "where \u2212 \u2192 h is the forward hidden sequence and \u2190 \u2212 h is the backward hidden sequence." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-65", "text": "We concatenate the latent vectors from each LSTM layer into one vector and so that a single vector contains both regular expressions information." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-66", "text": "Finally, the fully connected layer predicts the equivalence of the two regular expressions from their concatenated vector." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-67", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-68", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-69", "text": "We run our experiments on a computer with the following specifications: AMD Ryzen 7 1700 8core with 64GB RAM and a GTX 1080Ti GPU on Ubuntu 16.04.1." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-70", "text": "Our source code is written using PyTorch 0.4.0 (Ketkar, 2017) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-71", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-72", "text": "**EQ REG MODEL**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-73", "text": "Datasets: Locascio et al. (2016) created a set of NL-RX pair data by arbitrarily creating and combining data in a tree form." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-74", "text": "We define the depth of a regular expression in this dataset as the depth of the tree that generated the NL-RX pair (see Appendix A)." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-75", "text": "Similar to Locascio et al. (2016) , we randomly generate regular expression pairs up to depth three and label the equivalence between each pair." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-76", "text": "We sample approximately 200,000 pairs using this method with a ratio of equivalent and non-equivalent pairs of about 2:1." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-77", "text": "We prepare three sets of data having different depths for test data." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-78", "text": "One set is made up of only regular expressions with depth at most 3, which is the same as the training data (10,000 pairs)." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-79", "text": "The other two have depths 4 and 5, respectively, and each contain 1,000 pairs of regular expressions of which half are equivalent." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-80", "text": "Model Settings: We set the two LSTMs in our model to not have shared parameters as we found this gives better performance." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-81", "text": "The embedding dimension size is set to 4, since the size of vocabulary for regular expressions is relatively small compared to natural languages." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-82", "text": "We configure two independent LSTMs with 1 layer to receive the two regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-83", "text": "The LSTM layers use the average value of the sequence output values as their final outputs and pass them to the next layer." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-84", "text": "We set the batch size to 256 and the learning rate to 0.1 and use a stochastic gradient descent optimizer (Bottou, 2010) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-85", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-86", "text": "**SOFTREGEX MODEL**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-87", "text": "Datasets: We use three public datasets to compare SoftRegex with the-state-of-the-art model, Sem-Regex." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-88", "text": "The KB13 (Kushman and Barzilay, 2013) dataset was constructed by regex experts and is relatively small." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-89", "text": "On the other hand, NL-RX-Synth is data generated automatically and NL-RX-Turk is made from ordinary people by paraphrasing NL descriptions in NL-RX-Synth using Mechanical Turk (Locascio et al., 2016) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-90", "text": "Both datasets have 10,000 pairs of NL-RX data." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-91", "text": "We follow the previous work in splitting the data (train: 65%, dev: 10%, test: 25%)." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-92", "text": "Model settings: We arrange the SoftRegex architecture based on SemRegex." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-93", "text": "We embed the input sequence tokens with dimension size 128, stack two LSTM layers, and set the hidden state dimension size to 256." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-94", "text": "We set the batch size to 32 and learning rate to 0.05 with the ADADELTA optimizer (Zeiler, 2012) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-95", "text": "We substitute in the EQ Reg model as the reward function which gives high reward value 1 if our model generate a regular expression that is equivalent to the ground truth ( Figure 2) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-96", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-97", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-98", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-99", "text": "**MODEL PERFORMANCE**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-100", "text": "EQ Reg: We test the EQ Reg model with the three datasets from Section 4.1." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-101", "text": "The F1-score of test set 1 (depth 3) is 0.986, test set 2 (depth 4) is 0.868, and test set 3 (depth 5) is 0.853." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-102", "text": "As expected, we can see the model is showing high performance for test data with the same range of depth with the training data." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-103", "text": "In addition, although the EQ Reg model has a simple structure, it shows reasonable accuracy for more complex data that has not been previously seen." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-104", "text": "Here, we can notice the model does not just remember the structure of the training data but generalizes to solve the equivalence problem by understanding the semantics of regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-105", "text": "SoftRegex: Table 1 shows the experimental results of Deep-Regex, SemRegex, and our Soft-Regex model." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-106", "text": "The average accuracy of 10 evaluations is given." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-107", "text": "The distinguishing test cases method is based on the membership test of samples for the case when an oracle is not available and is described in (Zhong et al., 2018a) ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-108", "text": "The accuracy of SoftRegex is similar to or better than SemRegex (Oracle) and always better than Deep Regex and SemRegex (Distinguishing Test Cases)." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-109", "text": "Figure 3 shows the average training time per epoch over 30 epochs for each of the three models (SoftRegex and both SemRegex variants) and datasets." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-110", "text": "The training time with EQ Reg vastly outperforms the training time for SemRegex." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-111", "text": "In the worst case (the NL-RX-Synth dataset), we see that our new method is still 3.6 times faster than that of SemRegex." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-112", "text": "Though the speedup described in experimental result may appear constant, our softened equivalence approximately decides a PSPACE-complete problem in linear time to the length of regular expressions, which would otherwise take exponential time." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-113", "text": "Zhong et al. (2018b) pointed out some problems in the NL-RX datasets." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-114", "text": "Specifically, there are some ambiguities since Locascio et al. (2016) tried to obtain data from machine-generated sentences." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-115", "text": "Thus, there are situations that even expert humans cannot accurately classify." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-116", "text": "On the other hand, the NL-RX-Turk dataset is unreliable in that it is generated by non-experts who are paraphrasing previously generated data." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-117", "text": "We investigate all 921 incorrect predictions of SoftRegex in NL-RX-Turk and categorize the resulting errors into 4 types ( type-1 errors, we need to train more NL-RX data where a single NL is paired with several equivalent regex to cope with the NL ambiguity problem." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-118", "text": "The type-2 and type-3 errors are caused from crowd-sourcing, which might be hard to detect in the current learning model." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-119", "text": "We may consider a rule-based prescreening procedure." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-120", "text": "Finally, for the type-4 errors, we plan to incorporate the copying mechanism (Gu et al., 2016) and the sequence-totree translation model (Dong and Lapata, 2016) to enhance the performance of our model considering the deterministic relationship between input and output sequences and hierarchical structure of output regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-121", "text": "Although errors are mostly caused by data errors that we can handle easily, we chose to conduct our experiments under the same condition as SemRegex to provide a fair comparison with prior work." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-122", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-123", "text": "**ERROR ANALYSIS**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-124", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-125", "text": "**CONCLUSIONS**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-126", "text": "Recently, there have been several successful attempts to generate regular expressions from natural language." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-127", "text": "The current state-of-the-art model is based on reinforced learning using the (in)equivalence of two regular expressions." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-128", "text": "The regular expression equivalence procedure is a PSPACE-complete problem and, thus, is a major bottleneck in training both in time and space." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-129", "text": "We sidestep this bottleneck by using the EQ Reg model as reward shaping, which gives a regex equivalence probability for two regular expressions, and build a new NL-RX model called Soft-Regex." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-130", "text": "Our SoftRegex model with EQ Reg as a reward function substantially speeds up the training phase (at least 3.6 times faster than SemRegex) while having similar or better performance on a series of standard benchmarks." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-131", "text": "----------------------------------" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-132", "text": "**A SUPPLEMENTAL MATERIAL**" }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-133", "text": "We now describe how Locascio et al. (2016) generated their synthetic regular expression data." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-134", "text": "To begin, they manually mapped primitive regular expression operations, such as union and concatenation, to natural language." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-135", "text": "They then defined a small alphabet on which the operations could be performed." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-136", "text": "In the end, their system has 15 operations and 6 types of characters in the vocabulary." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-137", "text": "From this, they were able to build up NL-RX pairs automatically by creating a parse tree of repeated applications of operations to an initial regular expression." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-138", "text": "The operations (non-terminals) and alphabet (terminals) are shown in Table 3 ." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-139", "text": "Figure 4 gives an example of how a NL-RX pair is generated by creating a parse tree." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-140", "text": "It can be seen that it builds from the bottom up by starting with a series of words or characters (terminals) and applying some primitive operations (non-terminals)." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-141", "text": "Simultaneously, natural language descriptions of terminals and non-terminals are composed together to create a semantically identical natural language description." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-142", "text": "In this example, the NL-RX pair has depth 2." }, { "sent_id": "4b65a59fc2331b9771ea09a12f32de-C001-143", "text": "Locascio et al. (2016) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4b65a59fc2331b9771ea09a12f32de-C001-13" ], [ "4b65a59fc2331b9771ea09a12f32de-C001-26" ], [ "4b65a59fc2331b9771ea09a12f32de-C001-73" ], [ "4b65a59fc2331b9771ea09a12f32de-C001-89" ], [ "4b65a59fc2331b9771ea09a12f32de-C001-114" ], [ "4b65a59fc2331b9771ea09a12f32de-C001-133" ] ], "cite_sentences": [ "4b65a59fc2331b9771ea09a12f32de-C001-13", "4b65a59fc2331b9771ea09a12f32de-C001-26", "4b65a59fc2331b9771ea09a12f32de-C001-73", "4b65a59fc2331b9771ea09a12f32de-C001-89", "4b65a59fc2331b9771ea09a12f32de-C001-114", "4b65a59fc2331b9771ea09a12f32de-C001-133" ] }, "@SIM@": { "gold_contexts": [ [ "4b65a59fc2331b9771ea09a12f32de-C001-75" ] ], "cite_sentences": [ "4b65a59fc2331b9771ea09a12f32de-C001-75" ] } } }, "ABC_dd603c79f87e98d23f6f8e13028ae9_23": { "x": [ { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-2", "text": "Modern Machine Translation (MT) systems perform consistently well on clean, in-domain text." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-71", "text": "We present quantitative results of our experiments in Table 3 below." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-96", "text": "This difference results in a gain of 0.9 BLEU of TBT over UBT." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-97", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-98", "text": "**CONCLUSION**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-3", "text": "However human generated text, particularly in the realm of social media, is full of typos, slang, dialect, idiolect and other noise which can have a disastrous impact on the accuracy of output translation." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-4", "text": "In this paper we leverage the Machine Translation of Noisy Text (MTNT) dataset (Michel and Neubig, 2018) to enhance the robustness of MT systems by emulating naturally occurring noise in otherwise clean data." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-5", "text": "Synthesizing noise in this manner we are ultimately able to make a vanilla MT system resilient to naturally occurring noise and partially mitigate loss in accuracy resulting therefrom." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-6", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-8", "text": "Machine Translation (MT) systems have been shown to exhibit severely degraded performance when presented with translation of out-of-domain or noisy data (Luong and Manning, 2015; Sakaguchi et al., 2016; Belinkov and Bisk, 2017) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-9", "text": "This is particularly pronounced in systems trained on clean, formalized parallel data such as Europarl (Koehn, 2005) , are tasked with translation of unedited, human generated text such as is common in domains such as social media, where accurate translation is becoming of widespread relevance (Michel and Neubig, 2018) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-10", "text": "Improving the robustness of MT systems to naturally occurring noise presents an important and interesting task." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-11", "text": "Recent work on MT robustness (Belinkov and Bisk, 2017) has further demonstrated the need to build or adapt systems that are resilient to such noise." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-12", "text": "We approach the problem of adapting to noisy data through two primary research questions: * These authors contributed equally 1." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-13", "text": "Can we artificially synthesize the types of noise common to social media text in otherwise clean data?" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-14", "text": "2. Are we able to improve the performance of vanilla MT systems on noisy data by leveraging artificially generated noise?" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-15", "text": "In this work we present two primary methods of synthesizing natural noise in accordance with the types of noise identified in prior work (Eisenstein, 2013; Michel and Neubig, 2018) as naturally occurring in internet and social media based text." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-16", "text": "We present a series of experiments based on the Machine Translation of Noisy Text (MTNT) data set (Michel and Neubig, 2018 ) through which we demonstrate improved resilience of a vanilla MT system by adaptation using artificially noised data." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-17", "text": "The primary contributions of this work are our Synthetic Noise Induction model which specifically introduces types of noise unique to social media text and the introduction of back translation (Sennrich et al., 2015a) as a means of emulating target noise." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-18", "text": "Szegedy et al. (2013) demonstrate the fragility of neural networks to noisy input." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-19", "text": "This fragility has been shown to extend to MT systems (Belinkov and Bisk, 2017; Khayrallah and Koehn, 2018) where both artificial and natural noise are shown to negatively affect performance." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-20", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-21", "text": "**RELATED WORK**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-22", "text": "Human generated text on the internet and social media are a particularly rich source of natural noise (Eisenstein, 2013; Baldwin et al., 2015) which causes pronounced problems for MT (Michel and Neubig, 2018) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-23", "text": "Robustness to noise in MT can be treated as a domain adaptation problem (Koehn and Knowles, 2017) and several attempts have been made to handle noise from this perspective." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-24", "text": "Notable approaches include training on varying amounts of data from the target domain (Li et al., 2010; Axelrod et al., 2011) , Luong and Manning (2015) suggest the use of fine-tuning on varying amounts of target domain data, and Barone et al. (2017) note a logarithmic relationship between the amount of data used in fine-tuning and the relative success of MT models." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-25", "text": "Other approaches to domain adaptation include weighting of domains in the system objective function (Wang et al., 2017) and specifically curated datasets for adaptation (Blodgett et al., 2017) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-26", "text": "Kobus et al. (2016) introduce a method of domain tagging to assist neural models in differentiating domains." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-27", "text": "Whilst the above approaches have shown success in specifically adapting across domains, we contend that adaptation to noise is a nuanced task and treating the problem as a domain adaptation task may fail to fully account for the varied types of noise that can occur in internet and social media text." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-28", "text": "Experiments that specifically handle noise include text normalization approaches (Baldwin et al., 2015) and (most relevant to our work) the artificial induction of noise in otherwise clean data (Sperber et al., 2017; Belinkov and Bisk, 2017) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-29", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-30", "text": "**DATA**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-31", "text": "To date, work in the adaptation of MT to natural noise has been restricted by a lack of available parallel data." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-32", "text": "Michel and Neubig (2018) introduce a new data set of noisy social media content and demonstrate the success of fine-tuning which we leverage in the current work." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-33", "text": "The dataset consists of naturally noisy data from social media sources in both English to French and English to Japanese pairs." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-34", "text": "In our experimentation we utilize the subset of the data for English to French which contains data scraped from Reddit 1 ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-35", "text": "The data set contains training, validation and test data." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-36", "text": "The training data is used in fine-tuning of our model in certain settings outlined below and all results are reported on the MTNT test set for French-English." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-37", "text": "We additionally use other datasets including Europarl (EP) (Koehn, 2005) and TED talks (TED) (Ye et al., 2018) for training our models as described in \u00a75." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-38", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-39", "text": "**BASELINE MODEL**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-40", "text": "Our baseline MT model architecture consists of a bidirectional Long Short-Term Memory (LSTM) network encoder-decoder model with two layers." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-41", "text": "The hidden and embedding sizes are set to 256 and 512, respectively." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-42", "text": "We also employ weighttying (Press and Wolf, 2016) between the embedding layer and projection layer of the decoder." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-43", "text": "For expediency and convenience of experimentation we have chosen to deploy a smaller, faster variant of the model used in Michel and Neubig (2018) , which allows us to provide comparative results across a variety of settings." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-44", "text": "Other model parameters reflect the implementation outlined in Michel and Neubig (2018) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-45", "text": "In all experimental settings we employ Byte-Pair Encoding (BPE) (Sennrich et al., 2015b) using Google's SentencePiece 2 ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-46", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-47", "text": "**EXPERIMENTAL APPROACHES**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-48", "text": "We propose two primary approaches to increasing the resilience of our baseline model to the MTNT data, outlined as follows:" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-49", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-50", "text": "**SYNTHETIC NOISE INDUCTION (SNI)**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-51", "text": "For this method, we inject artificial noise in the clean data according to the distribution of types of noise in MTNT specified in Michel and Neubig (2018) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-52", "text": "For every token we choose to introduce the different types of noise with some probability on both French and English sides in 100k sentences of EP." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-53", "text": "Specifically, we fix the probabilities of error types as follows: spelling (0.04), profanity (0.007), grammar (0.015) and emoticons (0.002)." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-54", "text": "To simulate spelling error, we randomly add or drop a character in a given word." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-55", "text": "For grammar error and profanity, we randomly select and insert a stop word or an expletive and its translation on either side." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-56", "text": "Similarly for emoticons, we randomly select an emoticon and insert it on both sides." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-57", "text": "Algorithm 1 elaborates on this procedure." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-58", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-59", "text": "**NOISE GENERATION THROUGH BACK-TRANSLATION**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-60", "text": "We further propose two experimental methods to inject noise into clean data using the backtranslation technique (Sennrich et al., 2015a) ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-61", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-62", "text": "**UN-TAGGED BACK-TRANSLATION (UBT)**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-63", "text": "We first train both our baseline model for fr-en and an en-fr model using TED.We subsequently take 100k french sentences from EP and generate a noisy version thereof by passing them sequentially through the trained models as shown in Figure 1 ." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-64", "text": "We hypothesize that the resulting translation will be inherently noisy as a result of imperfect translation of the intervening MT system." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-65", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-66", "text": "**TAGGED BACK-TRANSLATION (TBT)**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-67", "text": "The intuition behind this method is to generate noise in clean data whilst leveraging the particular style of the intermediate corpus." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-68", "text": "Both models are trained using TED and MTNT as in the preceding setting, save that we additionally append a tag in front on every sentence while training in accordance with Kobus et al. (2016) , to indicate the origin data set of each sentence." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-69", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-70", "text": "**RESULTS**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-72", "text": "Of specific note is the apparent correlation between the amount of in-domain training data and the resulting BLEU score." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-73", "text": "The tagged back-translation technique produces the most pronounced increase in BLEU score being +6.07 points." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-74", "text": "This represents a particularly significant result given that we do not fine-tune the baseline model on in-domain data." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-75", "text": "We attribute this gain to the quality of the noise generated." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-76", "text": "The results for all our proposed experimental methods further imply that out-of-domain clean data can be leveraged to make the existing MT models robust on a noisy dataset." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-77", "text": "However, simply using clean data is not that beneficial as can be seen from the experiment involving FT Baseline w/ TED-100k." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-78", "text": "----------------------------------" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-79", "text": "**ANALYSIS**" }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-80", "text": "In this section we present qualitative analysis of both methods introduced above." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-81", "text": "Figure 2 illustrates the relative effect of varying the level of SNI on the BLEU score as evaluated on the newsdiscuss2015 3 dev set." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-82", "text": "From this we note that the relationship between the amount of noise and the effect on BLEU score appears to be linear." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-83", "text": "We also note that most negative effect is obtained by including profanity." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-84", "text": "Our current approach involves inserting expletives at random positions in a given sentence." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-85", "text": "However we note that the latter approach may under-represent the nuanced linguistic usage of the latter in natural text, which may result in its above-mentioned effect on accuracy." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-86", "text": "Table 2 shows the decoded output produced by different models." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-87", "text": "We find that the output produced by our best model is reasonably successful at imitating the language and style of the reference." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-88", "text": "The output of Baseline + FT w/ EP-100k-TBT is far superior than that of Baseline, which highlights the quality of obtained back translated noisy EP through our tagging method." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-89", "text": "We also consider the effect of varying the amount of supervision which is added for finetuning the model." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-90", "text": "From the Baseline + FT w/ EP-100k-TBT model already produces a reasonable translation for the input sentence." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-91", "text": "However, if we further fine-tune the model using only 10k NTMT data, we note that the model still struggles with generation of *very*. This error dissipates if we use 20k NTMT data for fine-tuning." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-92", "text": "These represent small nuances which the model learns to capture with increasing supervision." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-93", "text": "To better understand the performance difference between UBT and TBT, we evaluate the noised EP data." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-94", "text": "Figure 1 shows an example where we can clearly see that the style of translation obtained from TBT is very informal as opposed to the output generated by UBT." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-95", "text": "Both the outputs are noisy and different from the input but since the TBT method enforces the style of MTNT, the resulting output is perceptibly closer in style to the MTNT equivalent." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-99", "text": "In this paper we introduce two novel methods of improving the resilience of vanilla MT systems to noise occurring in internet and social media text." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-100", "text": "Namely a method of emulating specific types of noise and the use of back-translation to create artificial noise." }, { "sent_id": "dd603c79f87e98d23f6f8e13028ae9-C001-101", "text": "Both of these methods are shown to increase system accuracy when used in fine-tuning without the need for the training of a new system and for large amounts of naturally noisy parallel data." } ], "y": { "@USE@": { "gold_contexts": [ [ "dd603c79f87e98d23f6f8e13028ae9-C001-4" ], [ "dd603c79f87e98d23f6f8e13028ae9-C001-16" ], [ "dd603c79f87e98d23f6f8e13028ae9-C001-51" ] ], "cite_sentences": [ "dd603c79f87e98d23f6f8e13028ae9-C001-4", "dd603c79f87e98d23f6f8e13028ae9-C001-16", "dd603c79f87e98d23f6f8e13028ae9-C001-51" ] }, "@SIM@": { "gold_contexts": [ [ "dd603c79f87e98d23f6f8e13028ae9-C001-4" ], [ "dd603c79f87e98d23f6f8e13028ae9-C001-44" ], [ "dd603c79f87e98d23f6f8e13028ae9-C001-51" ] ], "cite_sentences": [ "dd603c79f87e98d23f6f8e13028ae9-C001-4", "dd603c79f87e98d23f6f8e13028ae9-C001-44", "dd603c79f87e98d23f6f8e13028ae9-C001-51" ] }, "@BACK@": { "gold_contexts": [ [ "dd603c79f87e98d23f6f8e13028ae9-C001-9" ], [ "dd603c79f87e98d23f6f8e13028ae9-C001-15" ], [ "dd603c79f87e98d23f6f8e13028ae9-C001-22" ] ], "cite_sentences": [ "dd603c79f87e98d23f6f8e13028ae9-C001-9", "dd603c79f87e98d23f6f8e13028ae9-C001-15", "dd603c79f87e98d23f6f8e13028ae9-C001-22" ] }, "@DIF@": { "gold_contexts": [ [ "dd603c79f87e98d23f6f8e13028ae9-C001-43" ] ], "cite_sentences": [ "dd603c79f87e98d23f6f8e13028ae9-C001-43" ] }, "@EXT@": { "gold_contexts": [ [ "dd603c79f87e98d23f6f8e13028ae9-C001-43" ] ], "cite_sentences": [ "dd603c79f87e98d23f6f8e13028ae9-C001-43" ] } } }, "ABC_874a8d4f847aff2895deb7c7560c56_23": { "x": [ { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-10", "text": "These patterns are used along with the othering pronoun and hate speech terms to build our 'Othering Lexicon'." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-2", "text": "Hateful and offensive language (also known as hate speech or cyber hate) posted and widely circulated via the World Wide Web can be considered as a key risk factor for individual and societal tension linked to regional instability." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-3", "text": "Automated Web-based hate speech detection is important for the observation and understanding trends of societal tension -especially in online social networks where hateful and antagonistic posts can be widely viewed and disseminated." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-4", "text": "Existing research has made significant in-roads into achieving this, albeit to levels of around 80% accuracy, leaving room for improvement." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-5", "text": "In this research, we improve on existing research by proposing different data mining feature extraction methods." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-6", "text": "While previous work has involved using lexicons, bags-ofwords or probabilistic parsing approach (e.g. using Typed Dependencies), they all suffer from a similar issue which is that hate speech can often be subtle and indirect, and depending on individual words or phrases can lead to a significant number of false negatives." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-7", "text": "This problem motivated us to conduct new experiments to identify subtle language use, such as references to immigration or job prosperity in a hateful context." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-8", "text": "We propose a novel 'Othering Lexicon' to identify these subtleties and we incorporate our lexicon with embedding learning for feature extraction and subsequent classification using a neural network approach." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-9", "text": "Our method first explores the context around othering terms in a corpus, and identifies context patterns that are relevant to the othering context." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-11", "text": "Embedding algorithim has the superior characteristic that the similar words have a closer distance, which is helpful to train our classifier on the negative and positive classes." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-12", "text": "captures the semantic meaning of the lexicon content jointly with a benign content, the negative content would be aligned in similar vectors with the othering pronoun and hateful content." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-13", "text": "For validation, several experiments were conducted on different types of hate speech, namely religion, disability, race and sexual orientation, with F-measure scores for classifying hateful instances obtained through applying our model of 0.93, 0.95, 0.97 and 0.92 respective" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-14", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-15", "text": "**INTRODUCTION**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-16", "text": "The uptake of online social networks for social participation and social mobilisation is having a big impact on society." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-17", "text": "As people increasingly communicate through online applications, the need for high-quality, automated abusive language detection has become much more profound." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-18", "text": "While the benefits of online social media are enabling distributed societies to be connected, one unanticipated disadvantage of the technology is the ability for hateful and antagonistic content or cyberhate to be published and propagated [5] , [43] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-19", "text": "Several studies have shown how individuals with biased or negative views towards a range of minority groups are taking to the Web to spread such hateful messages [23] [32] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-20", "text": "Instances of cyberhate and racist tension on social media have also been shown to be triggered by antecedent events, such as terrorist acts [4] [44] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-21", "text": "A hate crime or bias-motivated crime occurs when the perpetrator of the crime intentionally selects the victim because of his or her membership of a certain group [35] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-22", "text": "Expressing hate speech or discriminative reactions implies different language uses." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-23", "text": "For example, to express emotions, hateful words might be used such as 'hate them' ; moreover, to encourage violence, an inflammatory verb could be used such as 'kill'." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-24", "text": "While the previous example contains a directly offensive word (kill), some hate speech samples contain no single negative words (e.g. send them home)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-25", "text": "Although they do not contain hateful words, they setting a distance between different groups which promote the discriminative situations inside societies." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-26", "text": "There have been a number of attempts to address online hate speech classification using different approaches (e.g. lexicons, syntax features and semantic features), yet the problem lies in classifying text that does not contain clearly hateful words and would have an impact on classification accuracy." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-27", "text": "An example of a text that does not contain direct hate speech but imply hateful meaning is the tweets \"send them home\"." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-28", "text": "While Previous studies highlight the use of embedding learning on different features sets (e.g BOW [10] , POS, dependency, trained lexicon and n grams [28] , they do not consider the previous example as hate and their results depended on the occurrence in the negative or positive context of the words \"send\", \"them\" and \"home\"." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-29", "text": "To solve this problem, this study provides new insights into leveraging a specific hateful stereotype in the web content which is the othering terms as a pointer towards hate speech and discrimination." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-30", "text": "Our aim in this study is to add new feature that has not used before for advance the machine understanding of the negative text, specifically texts that do not contains explicit hateful expression." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-31", "text": "Othering terms in different languages (e.g. English: You and Me), are used for pointing towards the side of the speaker, or tweets' authors in our case." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-32", "text": "Our assumptions are based on studies that introduced the idea that some text content could contribute to detecting hateful speech, e.g. the othering terms [6] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-33", "text": "In our study, English pronouns are considered the othering language, and we named two groups of English pronouns the 'two-sided' othering language." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-34", "text": "The first group of English pronouns comprises those that express the speaker/writer side (e.g. I, we, us, our, etc.), and the other group contains English pronouns that are used when the speaker/writer points towards different individuals or groups (e.g. you, she, he, they, them, those, these, etc.)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-35", "text": "The permutation of the pronouns is as follows: each pronoun from one side appears with a pronoun from the other side, as follows: You Our, You Us, She We, They I, etc) In addition, we refer to the content that frequently appears with the othering terms as a pattern of the othering use." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-36", "text": "For example, in 'send them all home', the verb 'send' appears with the pronoun 'them', which could be a pattern of the use of pronouns." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-37", "text": "The pronouns' patterns could comprise verbs, nouns, etc." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-38", "text": "We hypothesis that considering the sentence that contains two-sided othering language as hateful content improve the overall hate speech detection accuracy." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-39", "text": "Building on that, we propose a very effective frame work that captures the features that are expected to be seen with the two-sided othering language and extract their semantic meaning using paragraph2vec algorithim." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-40", "text": "From one side, This frame work adds a novel feature to be expected as negative meaning by building a lexicon which considers the two-sided othering terms, the two-sided othering patterns and the main negative components-the lexicon component were extracted from an external annotated data set." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-41", "text": "From another side, this framework improves the use of paragraph2vec algorithim that used in previous works as we consider our lexicon to improve the embedding learning." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-42", "text": "Differently from the previous works that include the syntactic features and semantic learning, we add the two-sided pronouns as feature and we consider the (verbs, nouns and adjectives) which has no negative meaning but appear in a negative content as hateful content." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-43", "text": "The paragraph2vec algorithim learn the semantic meaning of the lexicon component jointly with the main data sets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-44", "text": "The lexicon component advance the semantic learning as following: the samples that contain two sided othering terms and patterns and negative content would be aligned in similar vectors." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-45", "text": "For example, if we have the following sentence: (We want to hang them all) which contains clear hateful content \"hang\" and two-sided othering terms (we, them), from the embedding algorithim perspective, the previous sentence would aligned to similar vector of the following sentence: (We need to get them out)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-46", "text": "We compared our model with state of art that applied combination of different text features -probabilistic (e.g. typed dependency text) and vector space features (e.g. doc2vec) Our model improve on the state of art studies by decreasing the false negative and false positive prediction." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-47", "text": "In this paper, we show our proposed method and how it leads to improving machine understanding of the 'othering' language, as follows:in Section 2 previous the data set and collection." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-48", "text": "In Section 3, we introduce the theoretical explanation of our model." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-49", "text": "In Section 4, we review related work that is relevant to our technical proposal." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-50", "text": "The othering terms used are defined in Section 5." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-51", "text": "We explain the experimental steps of the proposed framework in Section 6." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-52", "text": "The classification results are presented and compared in Section 7 with the baseline results from [6] through a deep discussion to clarify how the proposed approach leads to better performance of cyberhate classification than the main baseline one." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-53", "text": "In Section 8, we show the possibility of generalising our model and the limitations." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-54", "text": "Finally, in Section 8, the contributions of this paper are summarised and some further directions are suggested for advancing this research area further." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-55", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-56", "text": "**DATASETS**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-57", "text": "We used the dataset provided by [9] to extract the othering lexicon content." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-58", "text": "They used a crowdsourced hate speech lexicon to collect tweets containing different types of hate speech keywords." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-59", "text": "They used crowdsourcing to divide a sample of these tweets into three categories: those containing hate speech, those with only offensive language, and those with neither." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-60", "text": "Their dataset contains different samples of different hate speech types." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-61", "text": "We extracted the samples that contain hate speech or offensive language and then from the extracted samples we extracted those that only contain two-sided othering language, which resulted in 674 tweets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-62", "text": "Those tweets are used for extracting the verbs, nouns and adjectives related to othering terms in hate speech for lexicon construction." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-63", "text": "In addition to this sample, we extracted all nouns and adjectives that were mentioned in the whole hateful samples." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-64", "text": "For testing, to demonstrate an improvement on the state of the art, we used four datasets from previous work [6] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-65", "text": "The datasets were collected from Twitter and annotated using the CrowdFlower human intelligence task service with a single question: 'Is this text antagonistic or hateful based on a protected characteristic?' The datasets comprise four different protected characteristics, as follows: sexual orientation \u00e2\u0102\u015e 1803 tweets, with 183 instances of offensive or antagonistic content (10.15% of the annotated sample); race 1876 tweets, with 70 instances of offensive or antagonistic content (3.73% of the annotated sample); disability 1914 tweets, with 51 instances of offensive or antagonistic content (2.66% of the annotated sample); and religion 1901 tweets, with 222 instances of offensive or antagonistic content (11.68% of the annotated sample)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-66", "text": "The authors conducted all of the necessary tests so as to ensure agreement between annotators for the gold-standard samples [6] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-67", "text": "The amount of abusive or hate instances is small relative to the size of the sample." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-68", "text": "However, these are random instances of the full datasets for each event and they are considered representative of the overall levels of cyberhate within the corpus of tweets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-69", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-70", "text": "**THEORETICAL FRAMEWORK**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-71", "text": "This study is based on leveraging the use of the othering pronouns in web-based content as features for hate speech classification." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-72", "text": "[4] identified that the 'othering' language was a useful feature for classifying cyberhate based on religious beliefs, specifically for identifying anti-Muslim sentiment." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-73", "text": "Othering is an established construct in rhetorical narrative surrounding hate speech [25] , and the 'we-they' dichotomy has always been identified in racist discourse [48] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-74", "text": "[47] , who focus on decoding racist discourse, argued that while the 'self' or the concept of 'us' is constructed as an in-group identity, the 'other or the concept of 'them' is constructed as an out-group identity [39] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-75", "text": "Therefore, polarisation and opposition are created by emphasising the differences between 'us' and 'them'." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-76", "text": "Positive self-representation and negative representation of the 'other' mark are considered an out-group that is undesirable [46] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-77", "text": "We assume that the text that appears with two-sided othering terms is more likely to represent negative speech, specifically in hate-related events." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-78", "text": "Anti-Hispanic speech might make a reference to border crossing or legal identification, antiAfrican American speech often references unemployment or single parent upbringing, and antiSemitic language often refers to money, banking and the media." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-79", "text": "The use of stereotypes also means that some languages may be regarded as hateful even if no single word in the passage is hateful by itself [11] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-80", "text": "Figure 1 presents an overview of textual features which are generated between different groups who discriminate themselves from others." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-81", "text": "The figure shows how the othering pronoun terms from one side (us, we, I, etc.) draw the boundary between the two different groups, e.g. people try to distinguish themselves from another group by using either 'we' or 'they'. When they use two-sided pronouns in the same context, they define the boundary of the discriminative or negative text (whether it is directly or indirectly hateful)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-82", "text": "When the authors express their opinion or attitude using two-sided othering terms in related hateful web events, it is more likely that they will present the difference by drawing boundaries between the two different groups." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-83", "text": "The first boundary was drawn by expressing the first side of pronouns (e.g. we) and the second boundary was drawn by using the second side of pronouns (e.g. you)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-84", "text": "In the hate-related event, the text that appears with these pronouns expresses the tension in using two-sided pronouns, and we assume in our study that this is more likely a negative or discriminative perspective." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-85", "text": "Our approach involves four main steps." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-86", "text": "The first step is to extract the othering term pattern, as well as all of the verbs, nouns and adjectives in the negative samples." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-87", "text": "The second step is to build the lexicon by adding the two previous results and the othering permutation." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-88", "text": "Thirdly, we learn the sentence embedding of the lexicon jointly with negative samples and benign samples." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-89", "text": "Finally, we feed our feature sets to MLP for training classifiers." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-90", "text": "Figure 9 explains the workflow of our method To leverage the use of the two-sided othering terms as new features in hate speech classification, firstly, we investigate whether expressing two-sided othering terms would be more likely to elict negative tension among short messages (e.g. tweets) and long messages (e.g. Wikipedia articles)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-91", "text": "Secondly, depending on the previous findings, we begin to extract the othering pattern as features." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-92", "text": "The classification task is considered a validation one that aims at measuring the accuracy obtained by using our model." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-93", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-94", "text": "**MEASURING THE OTHERING LANGUAGE**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-95", "text": "Most existing efforts to measure hate speech require knowing the hateful words or targets [21] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-96", "text": "The key idea is as follows: if some users post their hateful emotions, they are likely to use pronouns that express their feelings (e.g. we like), but if another pronoun that points towards the other side is expressed in the same message, the post might be more likely a discriminative or negative post." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-97", "text": "This idea could not be applied in all instances for hate speech detection but it could boost the assumption of hate existence in a specific post, specifically within text that contains no directly hateful words." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-98", "text": "This step of the investigation does not aim to predict all hateful instances, yet points towards the usage of two-sided otherness in hateful instances, so we examine the instances that contain othering terms only." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-99", "text": "Hateful and neutral instances were collected from six types of hate speech characteristics: anti-Muslim, anti-Semitism, homosexuality, disability, racism and sexism." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-100", "text": "We investigate 9400 benign and 1329 negative instances." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-101", "text": "We found that two-sided otherness expression is considered a hate-related feature of hate speech classification, as otherness usage is 0.9% and 17.6% for benign and negative instances respectively in short messages such as tweets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-102", "text": "Comparing the two-sided otherness appearance in benign and negative samples illustrates that negative hate clearly involves this language for expressing a hateful attitude, whereas it was used" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-103", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-104", "text": "**FIGURE 1: THE GRAPH SHOWS THE BOUNDARY DEFINED BETWEEN TWO GROUPS USING THE OTHERING TERMS, AND THE SPACEBETWEEN THE BOUNDARY SHOWS HOW THE NEGATIVE TEXT COULD BE DEFIED**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-105", "text": "in a non-observable manner in the benign instances." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-106", "text": "We should notice that the percentage of 0.9% is considerably tiny for the benign sample." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-107", "text": "However, it contains 28% of the negative articles but produces a higher percentage of positive articles (46%), which means that two different sides of othering terms in long text (e.g. articles) are used frequently in both benign and negative text and could not be considered hateful features." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-108", "text": "Clearly, this strategy does not identify all existing hate speech in social media, which is fine when considering the purpose of the analysis presented in this study." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-109", "text": "Overall, despite the volume of hateful instances in comparison with the volume of benign samples, it is clear in Figure 3 that two-sided othering term existence in the hateful sample exceeds the occurrence of othering terms in the benign samples." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-110", "text": "In 4, we implemented a comparative experiment measure the frequent occurrence of two-sided othering terms in both hateful samples and non-hateful samples among different types of hate speech datasets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-111", "text": "Figure 4 that the anti-Muslim dataset contains the most frequent usage of the twosided othering language, followed by anti-Semitism, which means that as [4] identified, 'othering' language was a useful feature for classifying cyberhate based on religious beliefs, specifically for identifying anti-Muslim sentiment." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-133", "text": "The dataset was provided by [9] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-112", "text": "The difference in this investigation from previous study [6] is that we investigate the use of two sides of othering terms (English pronouns in our case) as a pointer towards negative or discriminative attitudes in different types of hate speech events." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-113", "text": "The third hate speech characteristic that uses the othering term is the Racism dataset." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-114", "text": "In contrast, we noticed that in the Disability dataset, the use of the two-sided othering term was more frequent in the neutral language than in the negative one." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-115", "text": "This means that the othering terms are more likely to be benign in the disability event, an example extracted from the same Disability dataset (e.g. we" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-116", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-117", "text": "**IT IS OBVIOUS IN**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-118", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-119", "text": "**FIGURE 2: OUR MODEL WORKFLOW USING TEN-FOLD CROSS-VALIDATION. OUR DATASET COMPRISES THE FOUR DATASETS OFRELIGION, RACE, DISABILITY, AND SEXUAL ORIENTATION**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-120", "text": "are proud of you), which means that two-sided othering terms are used to explain the situation of the achievement more than discrimination in the disability-related event." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-121", "text": "Ending with the Sexism and Bisexual hate speech characteristics, we found that the othering terms are frequented in a similar percentage for both hateful and neutral samples (higher in the negative sample), which indicates that the two-sided othering contributes intermediately to discriminating genders and sexual orientations." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-122", "text": "The size of the hateful samples is still a limitation." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-123", "text": "In summary, we found that the use of the two-sided othering language in short messages might be more likely to be that of hate speech, whereas the use of two-sided othering language in the articles tends to be neutral." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-124", "text": "In addition, we found that the othering language contributes significantly to expressing religious and racist attitudes and intermediately to expressing gender and sexual orientation discrimination." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-125", "text": "For both the former and latter we consider the othering terms to be effective features for hate speech detection." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-126", "text": "Clearly, this strategy does not identify all existing hate speech in social media, which is fine when considering the purpose of the analysis presented in this study." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-127", "text": "This investigation leads us to produce a novel feature that uses two-sided pronouns and their pattern usage as pointers of hate speech." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-128", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-129", "text": "**FEATURES ANALYSIS**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-130", "text": "This study discovers the effectiveness of a text portion which somehow could be considered a hate pointer, which is using two-sided othering pronouns." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-131", "text": "To leverage the use of the two-sided othering terms as a new feature in hate speech classification, two core steps were implemented to build our model: (a) building the othering lexicon and (b) learning the vector space embedding of the lexicon jointly with the negative and benign samples." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-132", "text": "The othering lexicon contains the othering terms and their patterns extracted by a dependency parser, as well as the verbs, nouns and adjectives that appear in all of the negative samples extracted by the POS tagger and all of the two-sided pronoun permutations." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-134", "text": "An example of the previous method for the tweet 'we want to send them all home' contains two-sided pronouns and the verb 'send', which grammatically relates to the pronoun 'them' as a subjective clause." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-135", "text": "Even though the verb 'send' is not offensive, we include it in the lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-136", "text": "This verb was extracted from the dependency parser of the sample tweets that con-train two-sided othering terms, which is considered a pattern verb of the othering terms (pronouns)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-137", "text": "The reason for pronoun permutation was to increase the possibility of the pro-nouns appearing with each hateful content, as it was expected that people might use any combination of two pronouns from different sides so as to express discriminative attitudes using different hateful content." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-138", "text": "Algorithm 1 shows the steps of building our lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-139", "text": "All pronoun combinations will appear with each row that contains the othering pattern and the POS results (verbs, nouns and adjectives)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-140", "text": "Figure 5 illustrates a focused part of the entire workflow which is presented in Figure 9 ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-141", "text": "In Figure 5 we show how a specific negative sample contributes to building our lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-142", "text": "The authors will now turn to the second phase of the model, which is paragraph embedding learning." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-143", "text": "The two-sided pronouns, which appear frequently in the negative and discriminative attitudes, change the process of assigning the vectors of the negative content to similar vector space by using the paragraph2vec embedding algorithm." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-144", "text": "In this case, all of the two-sided combinations of the pronouns appear with all of the negative content." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-145", "text": "To show the othering effectiveness on the context, which would be learned by the embedding algorithm, assume that we have the following rows in our lexicon:" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-146", "text": "\u2022we them nsubj send all ni**as home \u2022we you nsubj getout wogs country Figure 10 shows how the pronouns would be predicted by the paragraph2vec algorithm during the training phase of the dataset for producing vectors." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-147", "text": "This prediction schemecould be applied so as to predict unseen data instances that contain similar content." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-148", "text": "From an embedding learning perspective, two-sided othering words and dependency modifiers decrease the distance between the directly/indirectly hateful content and align them in similar vectors." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-149", "text": "The similarity increment advanced the classification accuracy as the negative vectors." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-150", "text": "[36] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-151", "text": "Furthermore, several works begin by first creating priorpolarity lexicons [50] [19] [17] ; in contrast, we build an othering lexicon that assumes that a portion of the hate text contains the othering language, which could be a useful feature for boosting hate speech prediction." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-152", "text": "However, while the dictionary-based approaches generally suffer from the inability to find offensive words with domain and context-specific orientations, Corpus-based approaches use a domain corpus to capture opinion words with preferred syntactic or co-occurrence patterns." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-153", "text": "Focusing on a theme-related lexicon, [15] generate a lexicon of sentiment expressions using semantic and subjectivity features with an orientation towards hate speech, and then use these features to create a classifier for hate speech detection." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-154", "text": "This is similar to our work somehow, in that they depend on extracting useful expressions related to their main themes of race, nationality and religion that can be used as a lexicon in hate speech detection." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-155", "text": "However, we are trying to extract the theme of the use of the othering terms in the antagonistic samples; in addition, we are learning the vector embeddings." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-156", "text": "In relation to the context pattern, [34] detect hate speech using sentence structure by assuming specific patterns starting with the word 'I'." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-157", "text": "They assume that the word 'I' means that the user is talking about the emotions that he or she is feeling." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-158", "text": "Meanwhile, our work assumes that if 'I' and 'you' exist in the same context, the context is more likely to be discriminative." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-159", "text": "Phrase-level sentiment analysis is different from lexicon-based analysis in that the former focuses on the polarity of the contextual content (e.g. whether there is negation in the sentence or not), whereas the latter predicts the polarity depending on the occurrence of the words that exist in the dictionary." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-160", "text": "A work conducted by [45] proposed a method for automatic hate detection using two steps: firstly, they concentrated on whether sample instances are neutral or polar in context (where polar in context refers to having a contextual polarity that is positive, negative or both)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-161", "text": "They used word features (tokens), sentence/structure features (dependency relations) and document features (document topic)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-162", "text": "They used Negated, which is a binary feature that captures whether the word is being locally negated." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-163", "text": "Its value is true if a negation word or phrase is found within the four proceeding words or in any of the word's children in the dependency tree, and if the negation word is not in a phrase that intensifies rather than negates." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-164", "text": "They combined sentence-level features and dictionary-based levels and achieved significant improvement in hate speech polarity prediction." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-165", "text": "While their experiments classified individual words and phrases, we focused on the tension of the author in respect of the othering terms." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-166", "text": "In addition, they focused on content used in the negative sentiment (e.g. negation), whereas we clarify the effectiveness of content that is not related yet is used in a specific form in hate speech." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-167", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-168", "text": "**LINGUISTIC FEATURES**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-169", "text": "One of the most basic forms of natural language processing is the Bag of Words (BoW) feature extraction approach." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-170", "text": "BoW has been successfully applied as a feature extraction method for the automated detection of hate speech, relying largely on keywords relating to offence and antagonism [5, 29, 42] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-171", "text": "However, the method also suffers from a high rate of false positives, since the presence of hateful words can lead to the misclassification of tweets being hateful when they are used in a different context (e.g. 'black') [16] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-172", "text": "Similar to BoW, n-grams consecutive words of varying sizes (from1...n) have been used to improve the accuracy of hate speech classification [14] [?] by capturing context within a sentence that is lost in the BoW model [3, 7, 28, 41] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-173", "text": "[41] found that character n-grams have been shown to be appropriate for abusive language tasks due to their ability to capture variations of words associated with hate." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-174", "text": "In general, while previous studies addressed the difficulty of the definition of hateful language, their experiments led to better results when combined with a large set of features such as BoW. They showed that BoW, n-grams, part-ofspeech tagging, and data preprocessing (stop word/punctuation removal) provided a significant improvement in sentiment classification among different datasets (blogs and movies) when applied as a sophisticated combination of feature sets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-175", "text": "They speculated that engineering features based on deeper linguistic representations (e.g. dependencies and parse tree) may work for content on social media." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-176", "text": "In general, lexical and syntactic features are useful when they are applied directly for automatic categorisation of annoying behaviours or topic detection [40] , whereas in cyberhate detection we need a deep understanding of the posted text, which will be the focus of the current work." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-177", "text": "According to [49] , while part-of-speech tagging does not significantly improve classifier performance, we apply POS tagging for extracting specific content (verbs, nouns and adjectives) for lexicon building." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-178", "text": "[9] prepared their data by stemming and then creating unigram, bigram and trigram features, each of which was weighted by its TF-IDF metric." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-179", "text": "Then they demonstrated how non-hateful content might be misclassified due to the fact that it contains words used in racist text." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-180", "text": "In contrast, there were hateful instances misclassified because they did not contain any of the terms most strongly associated with cyberhate." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-181", "text": "We aim to overcome this limitation by moving away from dependencies on specific keywords." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-182", "text": "Typed dependencies have been widely used for extracting the functional role of context words for sentiment classification [17, 18] and document polarity [37] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-183", "text": "[5, 6] demonstrated the effectiveness of applying typed dependencies for classifying cyberhate." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-184", "text": "Their study showed that typed dependencies consistently improved the performance of machine classification for different types of cyberhate by reducing the false negatives by 7%, beyond the use of BoW and known hateful terms." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-185", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-186", "text": "**TEXT EMBEDDING**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-187", "text": "Embedding learning is aimed at training a model that can automatically transform a sentence/word into a vector that encodes its semantic meaning." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-188", "text": "It has been shown that embedding representation is very capable of semantic learning when word vectors are mapped into a vector space, such that distributed representations of sentences and doc-uments with semantically similar words have similar vector representations [26] [27] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-189", "text": "Based on the distributional representation of the text, many methods of deriving word representations that are related to cyberhate and offensive language detection are explored in the following works." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-190", "text": "In general, neural network applications were shown to be capable of capturing specific semantic features from complex natural language (e.g. location [31] , entity [33] and images feature [1] )." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-191", "text": "For hate speech detection purposes, [10] solved theproblem of high dimensionality and sparsity by applying sentence embed-ding (paragraph2vec) directly in the context." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-192", "text": "In their study, paragraph2vec, which is an extended version of word2vec for sentences, has been shown to outperform the BoW representation for cyberhate classification." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-193", "text": "However, they limited their study by comparing the classification results with TF-BoW and TF-IDF-BoW for the same comments." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-194", "text": "Similarly, [2] compared the classification accuracy of the combination of different baselines and classifiers (Char n-gram, TF-IDF, BoW and LSTM) and found that learning embedding with gradient-boosted decision trees led to the best classification performance." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-195", "text": "Word vector extraction was also applied to tweets for cyberhate classification by [12] , who built four feature sets: character 4-grams, word vectors based on semantic information built using word2vec, randomly generated word vectors, and word vectors combined with character n-grams." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-196", "text": "They showed that the second feature set (word2vec alone) outperformed the other ones, which clarified that n-grams might affect classifier performance negatively." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-197", "text": "Several works have merged typed dependencies with embedding learning and clarified the different levels of embedding learning when using the dependency context rather than the raw or liner text." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-198", "text": "[20] showed that dependencybased embeddings are less topical and exhibit more functional similarity than linear embeddings,and [24] showed that dependency context embeddings can provide valuable syntactic information for sentence classification tasks, which is a motivation for implementing classification tasks in respect of dependency embed-ding text." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-199", "text": "In addition, [51] defined the differences between flat text,which they called neighbour words, and the dependency context, and clarified through examples the drawbacks of learning embedding of flat text." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-200", "text": "While these studies introduced the effectiveness of the word distances of the dependency context, which capture the semantic relations, their works targeted other areas of research, not cyberhate; therefore, we are yet to see much evidence that the complex and nuanced 'us and them' narrative emerging on social media can be captured using a combination of typed dependencies and embeddings." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-201", "text": "One study that has combined typed dependencies with embedding learning in the context of cyberhate was reported by [28] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-202", "text": "They developed a machine learning approach to cyberhate based on different syntactic features as well as different types of embedding features, and reported its effectiveness when combined with the standard NLP features (n-gram, linguistics feature, dependency context) in detecting hate speech in online user comments." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-203", "text": "They showed how applying each feature set alone resulted in different classification performance, and found that character n-grams alone are useful in noisy datasets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-204", "text": "While using n-grams boosted training performance, n-grams result in high dimensionality and render the models susceptible to overfitting." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-205", "text": "This study is arguably closest to ours, but differs insomuch as they used the dependency context to capture long-range dependencies between words, whereas we focused on employing the dependency modifiers as a connection between the pronouns and offensive language in order to capture distances between social groups." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-206", "text": "To determine the effectiveness of our novel approach, we have compared our method with [28] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-207", "text": "Our work is therefore distinct from the above literature in that we utilise the dependency modifiers for embedding learning in the context of cyberhate classification." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-208", "text": "Furthermore, we validate our model by training in an unseen dataset for hate speech classification so as to provide evidence that the model itself is applicable for training using various web-based sources of information and testing using short, informal posts from Twitter." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-209", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-210", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-211", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-212", "text": "**OTHERING LEXICON**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-213", "text": "Based on the previous findings in Section 3.1, we evaluate the effectiveness of the othering language as features in hate speech detection by implementing a classification based on training in a negative lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-214", "text": "As the othering language might appear in the benign and negative language, we construct the following basic expression in order to build a lexicon, with each row of the lexicon being as follows." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-215", "text": "The lexicon assumes that two-sided pronouns appear with a specific pattern and offensive words in each row." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-216", "text": "This row is then converted to vector space using the paragraph2vec algorithm: Each part was extracted as following: Our lexicon content was extracted from negative samples of an annotated dataset [9] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-217", "text": "From each instance in the negative dataset we constitute one row in the lexicon, with each row in the lexicon containing three columns: (1) all combinations of two-sided English pronouns, (2) the othering pronoun-related patterns which were extracted using typed dependency parser, and(3)the negative words which were extracted by POS 3 tagging [9] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-218", "text": "The pronoun patterns extracted from the samples that contain two-sided pronouns and the hateful words were extracted from all of the samples." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-219", "text": "A typed dependency parser provides subjective relations between the othering terms and nouns, verbs and adjectives (see Figure 7) ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-220", "text": "We identify these relations by using four types of dependency relations: nsubj, dobj, detand compound ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-221", "text": "All of the dependency identifiers and their relations were discarded from the context and we kept the nsubj, dobj and compound identifiers without discarding." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-222", "text": "The reason for preserving these identifiers is as follows: nsubj presents the grammatical relations between the othering and the verb in a tweet, and dobj concerns the direct object of a VP." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-223", "text": "It is the noun phrase, which is the (accusative) object of the verb which presents the grammatical relation between the verb and the noun that occur in the same tweet." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-224", "text": "The compound modifier compounds any noun that serves to modify the head noun, which exist in the same tweet." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-225", "text": "We apply the dependency parser to all of the 674 samples and start extracting all of the othering terms, verbs, nouns and adjectives related to the othering terms through a dependency relationship for building a lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-226", "text": "For all of the extracted content, we implemented a permutation between all of the othering terms of two groups." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-227", "text": "Then for each row in the lexicon, we added all of the combination permutations, the process of which resulted in 51,753 rows." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-228", "text": "The goal behind implementing a permutation between the othering terms is to advance the possibility of pronoun occurrence in each lexicon row." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-229", "text": "It seems to assume that each negative sample might contain any combination of two-sided pronouns." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-230", "text": "The dependency modifiers were included in the lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-231", "text": "In the discussion section, we will clarify the role of the dependency modifiers in embedding learning." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-232", "text": "To develop the lexicon content, POS tagging was applied in all of the negative samples (whether containing two-sided othering language or not)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-233", "text": "Then we filtered the POS tagged dataset in order to obtain only three types of text content: verbs nouns and adjectives." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-340", "text": "At this sort of studies, different factors have an effect on the results." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-234", "text": "After the nouns, adjectives, verbs and adverbs were extracted, the POS tagged modifiers were removed and then the content was added to the lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-235", "text": "The POS modifiers were removed from the lexicon because the embedding algorithm takes any content into consideration in the learning vector stage." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-236", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-237", "text": "**EMBEDDING LEARNING**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-238", "text": "We learned the Distributed Memory (PV-DM) vectors using Gensim 4 implementation of distributed representations of the sentence [22] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-239", "text": "In paragraph2vec, every sentence is mapped to a unique vector, and every word included in the sentence is also mapped to a unique vector." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-240", "text": "Max \u2211 (tar,con,doc))log P(targetword | contextword ,documentcontext) (1) \u2200" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-241", "text": "The paragraph vectors and word vectors are averaged or concatenated so as to predict the next word in a context." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-242", "text": "In the Distributed Memory component (PV-DM), the paragraph (the tweet) acts as a memory that remembers the missed word in the current context of the paragraph or tweet." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-243", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-244", "text": "**FIGURE 9: EXAMPLE OF STANFORD DEPENDENCY PAR**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-245", "text": "According to [27] , the distributed memory model is consistently better than PV-DBOW." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-246", "text": "To find the best implementation for our data, we experimented with both and found that distributed memory performed better in learning our dataset vectors." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-247", "text": "We used small window sizes because, according to [24] , a window of size 5 is commonly used to capture broad topical content, whereas smaller windows (e.g.k=2 windows) contain more focused information regarding the target word." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-248", "text": "For further explanation, using a window of size k around the target word w , 2k the contexts which are produced by using k Windows size as k words before and the k words after w. For example, for k= 2, the contexts of the target word comprise w-2, w-1, w+1, w+2." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-249", "text": "In fact, we cannot assign a meaning to any particular dimension." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-250", "text": "We observed the different results being captured by the model by examining which dimension behaved better in which dataset." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-251", "text": "We report results for 600 dimension embeddings and windows = 2, though similar trends were also observed with 100, 300, 800 and 1000 dimensions and windows = 3, 5, 6 and 10." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-252", "text": "We recorded the best performance for 600 dimensions and windows = 2 parameters." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-253", "text": "The final output is a vectorised dataset that is used as a feature set for developing a machine classification approach." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-254", "text": "5.3 Machine Classification Ten-fold cross-validation was used due to the volume of positive and negative in-stances for supervised classification." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-255", "text": "This method has previously been used for building machine classifiers for short text [37] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-256", "text": "In addition to that, we implemented semi-supervised classification by training in the positive samples of the [9] dataset and training in only the lexicon as negative samples." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-257", "text": "The model was evaluated by testing it on the four datasets: religion, disability, racism and sexual-orientation." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-258", "text": "Two classifiers were applied: Multilayer Perception (MLP) and Logistic Regression (LR)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-259", "text": "Multilayer Perceptron (MLP) is a feed-forward artificial neural network model which maps input datasets on an appropriate set of outputs." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-260", "text": "MLP consists of multiple layers of nodes in a directed graph, with each layer being fully connected to the next layer [13] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-261", "text": "Non-linear MLP classifiers work better with our vectorised features." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-262", "text": "Two-layer MLP achieved better performance for our vectors with 200 iterations." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-263", "text": "An LR model is a simplified model on which MLP is based." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-264", "text": "We also implemented this model to test the potential for a much less computationally intensive approach to classification, possibly something that could be utilised in real time to classify content in streamed data." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-265", "text": "For ten-fold cross-validation, we learned the vector space of the negative samples and the othering lexicon (as a negative feature set) jointly with positive samples for supervised learning (see Figure 5 )." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-266", "text": "For semi-supervised learning, when we tested our lexicon on unseen datasets, we trained the embedding algorithm in our othering lexicon as negative and the positive samples of the [9] dataset (see Figure 9 )." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-267", "text": "Then we labelled each instance individually, assigning the positive and negative labels as 0 and 1 respectively." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-268", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-269", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-270", "text": "The main baselines in this work are the results produced in [6] , [10] and [28] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-271", "text": "The previous studies use different feature sets for hate speech classification, part of which we applied in our study." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-272", "text": "[6] apply lexicon, BoW, n-grams and dependency, [10] used paragraph2vec for joint modeling of comments and words, and [28] add to [6] paragraph2vec and word2vec." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-273", "text": "We validate our model (a) by using ten-fold cross-validation and (b) by training our classifier in the lexicon for negative training and in the benign samples of davidson2017automated for positive training, testing the four datasets as unseen data." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-274", "text": "Table 1 shows machine classification performance for cyberhate based on disability, race, sexual orientation and religion datasets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-275", "text": "The table shows the effectiveness of including the othering language terms as features in embedding learning for classification of cyberhate, which leads to reducing the false positive rate in two out of three types of hate speech classification -race and sexual orientation -when compared with using a BoW model and/or hateful terms as features." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-276", "text": "In our work, we show how our model led to reducing the false negative and false positive rates in comparison with the baselines." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-277", "text": "The previous results for [6] were provided using the F-measure metric, with 0.77 for religion, 0.75 for disability, 0.75 for race, and 0.47 for sexual orientation." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-278", "text": "Table 1 shows that we have improved considerably on these results." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-279", "text": "The key finding from previous research was that the inclusion of othering language in the classification of religious cyberhate reduced the false negative rate by 7% when compared to using hateful terms alone." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-280", "text": "In addition, there was no significant improvement in using typed dependencies for disability over a standard BoW approach [6] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-281", "text": "In our study, the lowest false negative rate was achieved when using the new feature set for all of the datasets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-282", "text": "For religious cyberhate, the classification task achieved a 56% reduction in false negatives through using an n-gram feature set, which was the lowest false negative rate in [6 ] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-283", "text": "In addition, the classifier predicted effectively the majority of the positive instances, with an overall false positive rate of 0.3%." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-284", "text": "For the religion dataset, the classifier correctly predicted 89.5% of the cyberhate and around 98% of non-hateful instances." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-285", "text": "This resulted in a 38% reduction in the false negative rate when compared with the lowest rate achieved when applying typed dependency with hateful terms." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-286", "text": "For disability, we improved on the result in [6] with a 16% reduction in false negatives." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-287", "text": "However, our model produced no improvement in the false positive rate when compared with the best rate reported by [6] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-288", "text": "For the race classifier, we detected 95% of the cyberhate instances, which is considered an improvement of 22% in detecting hateful instances when compared to." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-289", "text": "For the false positive rate, we achieved an improvement of 1.6% when compared to [6] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-290", "text": "For sexual orientation, there is a large reduction in the false negative rate achieved by our model." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-291", "text": "There is a 57% improvement on [6] ." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-292", "text": "While [6] produced no false positives when applying the n-grams feature and hateful terms to the sexual orientation dataset, we achieved 98% of correctly classified hateful instances." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-293", "text": "We recorded a 0.99 F-measurement, whereas [6] reported a 0.18 F-measurement." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-294", "text": "Our results were compared with using only the paragraph2vec algorithim which introduced in [10] , We noticed that our frame work improve the paragraph2vec learning by decreasing the false negative by 3%, 47%, 7% and 4% for the Religion, Disability, race and sexual orientation respectively." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-295", "text": "In addition to that, We notice the high reduction of the false positive measurements by 43%, 50%, 99% and 24% for the Religion, Disability, race and sexual orientation respectively, which mean that our frame work provide better understanding for semantic learning as take in consideration the effectiveness of the othering language use." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-296", "text": "Moreover, we compare our model with [28] , because they used a combination of several NLP methods (one of them are typed dependency and POS ) with paragraph2vec and word2vec and found them to be very powerful when combined with each other." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-297", "text": "For religion data set, our frame work outperforms [28] features by reducing the FP from 14 to 4 and decreasing the FN by %45." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-298", "text": "For Disability, the FP instances were reduced from 30 to just one and the FN from 36 incorrectly classified to 4 and same thing for the Race data set when the former was 8 and the later were 10." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-299", "text": "For Sexual-Orientation, we could notice the reduction of the FP and FN by approximately the half, from 8 to 4 and from 22 to 10 incorrectly classified instances for both FP and FN respectively." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-300", "text": "In our study, different features have an effect on the results: The othering terms, the othering patterns which extracted by typed dependency, and the negative content which extracted by POS tagging, the semantic meaning of the previous features which extracted jointly with the main data set by Paragraph2vec algorithim." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-301", "text": "Despite that the Disability has not been affected by the othering terms, we could notice the reduction of the FP and FN results when using our frame work." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-302", "text": "This mean that the semantic meaning of the othering patterns (which extracted by dependncy parser) and the negative content (which extracted by POS tagging) have an effect on results even if the Disability data set does not contain that much use of the two-sided othering terms." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-303", "text": "This mean that our frame work improved in understanding the negative content in general." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-304", "text": "To clarify the effectiveness of the othering terms in hate speech detection, we implemented the same experiments yet we discarded the othering terms from the lexicon (see the ninth row in Table 1 )." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-305", "text": "The results illustrate that there are increases in false negatives and false positives in only extracting the two-sided othering terms from the lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-306", "text": "This means that the two-sided othering terms (pronouns in our study) advance the paragraph embedding training, which has an effect on the classifier results." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-307", "text": "Our results are affected by three factors: (a) lexicon content, (b) embedding learning, and (c) MLP classifier." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-308", "text": "Changing any factors of the previous ones produces low-accuracy results." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-309", "text": "For example, we implemented a classification using a Logistic Regression (LR) classifier but we found that MLP behaves better with our feature sets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-310", "text": "The F-measurements produced by LR were 0.79, 0.73, 0.76 and 0.86." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-311", "text": "The lexicon provides a pattern of the two-sided othering language and the related words (verb, noun, adjectives) which were extracted from humanannotated tweets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-312", "text": "The othering terms, the othering patterns and the hateful related verbs, nouns and adjectives, which are not necessarily offensive, play an important role in boosting the accuracy of the results." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-313", "text": "Paragraph2vec learns the embedding algorithm and produces two matrices: vocabulary vector matrix and paragraph vector matrix." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-314", "text": "Paragraph2vec aligns the combination of two-sided othering language and the hateful content in similar vectors (in the vocabulary matrix)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-315", "text": "This increases the probability that using two-sided othering language in the same sentence increases the possibility that the entire sentence is negative." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-316", "text": "As paragraph2vec aligns the lexicon content with similar vectors, the verb 'send' would be negative if it appeared with two-sided pronouns, as the paragraph2vec algorithm would predict them if they had 'them' and 'we' OR any combination of two-sided othering." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-317", "text": "The novelty of our lexicon is that it not only provides negative words but also provides a pattern or stereotype of negative speech." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-318", "text": "However, our lexicon does not provide an enhanced result when it is used for training the classifier without paragraph embedding learning, and vice versa." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-319", "text": "This means that the combination of the lexicon and paragraph2vec provides the best results." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-341", "text": "For example, similar tweet that related to different event might imply different meaning (e.g. \"send them\")." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-320", "text": "To be specific, the othering terms and dependency modifiers in our lexicon play an important role in aligning the negative in similar vectors in the embedding learning phase." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-321", "text": "While the othering terms as a feature have an effect on the previous hate speech types, it is noticeable that adding the othering language to the lexicon does not produce a significant change for the results prediction in the disability dataset and the improvement of the results due to the other content of the lexicon." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-322", "text": "It could be justified that disability hate speech is more related to hate terms than the othering terms (which are the pronouns in our study)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-323", "text": "In other words, people tend to describe the situation of the disabled rather than discriminate themselves from them, as the boundary between the two groups (disabled and non-disabled) is clear." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-324", "text": "In contrast, people in the religion, racism and sexual orientation events tend to use the two-sided othering language because they tend to clarify the ambiguity of their tension in respect of others (e.g. 'my religion is different from yours', 'my ethnicity is different from yours', etc.)." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-325", "text": "With globalisation, people cannot clarify religious identity, racial identity and sexual orientation, so in their speaking or writing they tend to clarify their identity and their attitude towards other identities, which leads them to use two pronouns: the first points towards the speaker/writer and the second points towards others." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-326", "text": "Our framework composed to three main components, The lexicon, the embedding learning and the classification algorithim." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-327", "text": "At this paragraph, We discover the effectiveness of our classifier on our data embedding." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-328", "text": "We compare MLP classifier with different classifiers which used in previous studies for hate speech detection without embedding (e.g. SVM, DT, NB and RF [4] ) and with embedding (e.g. LR [10] )." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-329", "text": "Table 2 shows the results of F-score measurements of applying six different classifiers." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-330", "text": "on our framework." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-331", "text": "Multiple Layer Perception classifier produced the best results for both false benign and false negative detection." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-332", "text": "Turning to true hate speech classified as offensive it appears that our framework success in increasing the true negative and true positive tweets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-333", "text": "As we consider the othering language and the indirect hateful content as feature, the improvement is returned to the the lexicon and the embedding learning." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-334", "text": "To show how our lexicon improved the detection results, using only our lexicon resulted in decreasing the false negative than the previous word which use m-gram, BoW and hateful terms." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-335", "text": "Looking at the tweets misclassified as hate speech, despite that our framework successfully advance the machine understanding of hateful content, the misclassified means that the there are tweets contains content which considered benign yet the machine consider them as negative." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-336", "text": "For Religion data set, this data set collected after Woolwich event, which contains language about Muslim and African man (religion and race), according to [9] Twitter users use this type of language in their everyday communications which could be contained in the benign tweets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-337", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-338", "text": "**STUDY LIMITATIONS**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-339", "text": "As our study assume that the tweet which contains two-sided othering language is more likely to be hateful, this do not mean that all tweet that contain two-sided pronoun are considered as hateful tweet." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-342", "text": "The previous text if belong to races hash-tags show different meaning when it belong to a transportation hash-tags." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-343", "text": "----------------------------------" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-344", "text": "**CONCLUSION**" }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-345", "text": "To conclude, two-sided othering language could be considered an effective feature for hate speech classification in events related to Religious (anti-Muslim, anti-Semitism), racism and sexual orientation hate speech." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-346", "text": "We investigate the use of two-sided othering terms in negative samples and conclude that a combination of two-sided pronoun terms is used frequently in negative tweets, whereas they are rarely used in the benign tweets." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-347", "text": "We leverage this finding to produce a new feature for hate speech detection, i.e. two-sided pronoun terms and their patterns." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-348", "text": "This leads us to build our own lexicon, learn the vector space of the lexicon and the datasets jointly, and validate the features using ten-fold cross-validation." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-349", "text": "We found that our framework decrease the incorrectly classified instances for both benign and negative samples for the four data sets including he disability data set which does not contain a noticeable use of two-sided othering terms." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-350", "text": "which mean that our framework provide better understanding for the hate speech." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-351", "text": "We examined the case of removing the othering terms from the lexicon in our framework and found how those terms have an effect on the results." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-352", "text": "Our work is different from the previous work as following: (a) using the two-sided othering terms as features, (b) use the two-sided othering patterns as features extracted by typed dependency, (c) using just the verbs, nouns and adjectives of an annotated negative samples which extracted by POS tagging and (d) applying paragraph embedding on the previous feature jointly with the main data set to extract the vector space features." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-353", "text": "differently, we did not apply the dependency learning on the examined data set yet for extracting the othering patterns also we use the POS for extracting the negative content from an annotate data set." }, { "sent_id": "874a8d4f847aff2895deb7c7560c56-C001-354", "text": "In addition, the semantic learning was implemented to extract the vector space of the lexicon content jointly with the data set, this step guarantee that the negative instances would be aligned in similar vector of the lexicon if the content of both is similar, which boost the classifier prediction accuracy." } ], "y": { "@SIM@": { "gold_contexts": [ [ "874a8d4f847aff2895deb7c7560c56-C001-57" ], [ "874a8d4f847aff2895deb7c7560c56-C001-216" ], [ "874a8d4f847aff2895deb7c7560c56-C001-256" ], [ "874a8d4f847aff2895deb7c7560c56-C001-266" ] ], "cite_sentences": [ "874a8d4f847aff2895deb7c7560c56-C001-57", "874a8d4f847aff2895deb7c7560c56-C001-216", "874a8d4f847aff2895deb7c7560c56-C001-256", "874a8d4f847aff2895deb7c7560c56-C001-266" ] }, "@USE@": { "gold_contexts": [ [ "874a8d4f847aff2895deb7c7560c56-C001-57" ], [ "874a8d4f847aff2895deb7c7560c56-C001-133" ], [ "874a8d4f847aff2895deb7c7560c56-C001-216" ], [ "874a8d4f847aff2895deb7c7560c56-C001-256" ] ], "cite_sentences": [ "874a8d4f847aff2895deb7c7560c56-C001-57", "874a8d4f847aff2895deb7c7560c56-C001-133", "874a8d4f847aff2895deb7c7560c56-C001-216", "874a8d4f847aff2895deb7c7560c56-C001-256" ] }, "@BACK@": { "gold_contexts": [ [ "874a8d4f847aff2895deb7c7560c56-C001-178" ], [ "874a8d4f847aff2895deb7c7560c56-C001-336" ] ], "cite_sentences": [ "874a8d4f847aff2895deb7c7560c56-C001-178", "874a8d4f847aff2895deb7c7560c56-C001-336" ] } } }, "ABC_b3f7051cbba3344f0aec0f2e80d5e0_23": { "x": [ { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-2", "text": "We present and evaluate a set of architectures for conversational dialogue systems, exploring rule-based and statistical classification approaches." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-3", "text": "In a case study, we show that while a rule-based dialogue policy is capable of high performance if perfect natural language understanding is assumed, a direct classification approach that combines the dialogue policy with NLU has practical advantages." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-4", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-6", "text": "In this paper we present and evaluate a set of alternative dialogue system architectures that could be used to implement dialogue policies for conversational characters or virtual humans." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-7", "text": "The motivation for this work is to improve our understanding of the development costs and performance benefits associated with alternative system architectures for virtual human dialogue systems (Traum et al., 2005; Swartout et al., 2006; Kenny et al., 2009; Jan et al., 2009; Swartout et al., 2010) ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-8", "text": "We focus on the language processing steps used in a specific virtual human system described in Section 2." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-9", "text": "We analyze the relationship between Natural Language Understanding (NLU), which maps a user's natural language input to systemspecific semantic representations, and Dialogue Management (DM), which executes a dialogue policy that dictates what the virtual human will say or do in response to the user's input." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-10", "text": "Traditionally, designing a two step NLU+DM pipeline involves defining semantic representations for the dialogue domain and writing rules that constitute the dialogue policy." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-11", "text": "This modular design has the benefit of making the DM policy easy to express in explicit rules, but carries the development cost of requiring significant linguistic expertise." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-12", "text": "Additionally, as we illustrate in this paper, its performance can depend critically on the reliability of the NLU module." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-13", "text": "As an alternative, we contrast this design with a direct classification approach that relies only on textual examples and effectively combines the dialogue policy with NLU." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-14", "text": "In our case study evaluation, we find that this approach offers superior performance, owing to the high frequency of NLU errors in the two step pipeline." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-15", "text": "The research presented in this paper extends our previous work." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-16", "text": "As we summarize in Section 2, this paper relies on the same data set and evaluation metric as DeVault et al. (2011) , which reports results for learned policies based on maximum entropy models." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-17", "text": "In this paper, we add a comparison to a hand-authored policy (Rules) and a new policy based on relevance models (RM)." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-18", "text": "These new policies are described in Section 3." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-19", "text": "We conclude with some discussion of our new findings." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-20", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-21", "text": "**RESEARCH SETTING AND DATA SET**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-22", "text": "We begin by summarizing our research setting, data set, and evaluation metric." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-23", "text": "We refer the reader to DeVault et al. (2011) for additional details." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-24", "text": "We use an existing virtual human scenario designed for Tactical Questioning (TACQ) (Traum et al., 2008) , where military personnel interview individuals for information of military value." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-25", "text": "TACQ characters are designed to be non-cooperative at times." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-26", "text": "They may answer some of the interviewer's questions, but either lie or refuse to answer others until certain conditions are met ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-27", "text": "The dialogue policy for a TACQ character is relatively simple in that the character is willing to answer most questions, but correctly implementing the policy requires that certain questions only be answered under certain conditions." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-28", "text": "Our work builds on an existing TACQ scenario involving a virtual human called Amani user interviews Amani, who was a witness to the incident and has information about the identity of the sniper." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-29", "text": "Amani is willing to tell the interviewer what she knows, but she will only reveal certain information in exchange for specific promises of safety, secrecy, and monetary compensation ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-30", "text": "Figure 1 provides an excerpt of a user interaction with Amani." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-31", "text": "Gandhe et al.'s TACQ system uses speech acts (SAs) to represent the meaning of user and system utterances." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-32", "text": "In this paper, user utterances are modeled using 46 distinct SA labels." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-33", "text": "For example, the label elicit-whq-tellmemoreabouttheincident is assigned to the user's utterance of can you tell me what you know of the incident? in Figure 1 ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-34", "text": "The system also defines a different set of 96 unique SAs (responses) for the Amani character." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-35", "text": "We perform our experiments and evaluation using an existing set of 19 annotated Amani dialogues (DeVault et al., 2011) ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-36", "text": "The dialogues were collected through teletype-based role play." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-37", "text": "Each dialogue turn includes a single user utterance followed by the response chosen by a human role player in the role of Amani." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-38", "text": "There are a total of 296 turns, for an average of 15.6 turns/dialogue." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-39", "text": "The task of Amani's dialogue manager (DM) is to select the most appropriate system SA to use in response to a user utterance." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-40", "text": "In the experiments reported here, the user's utterance may be provided to the DM either directly as text or using a SA label." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-41", "text": "We call the DM's decision process a dialogue policy." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-42", "text": "The system builders' intended policy for Amani is detailed in DeVault et al. (2011) ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-43", "text": "Because Amani has only a fixed set of system responses, the policy problem looks like a traditional classification task." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-44", "text": "However, there are two sources of uncertainty that complicate the task." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-45", "text": "Firstly, the mapping between the user's utterance and an appropriate system SA is often one-tomany." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-46", "text": "In our data set, 6 referees independently linked each user utterance to the best system SA response." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-47", "text": "In Figure 1 , we provide an example in which three different system SAs were selected by the 6 referees." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-48", "text": "In other cases, up to 6 different system SAs were selected (DeVault et al., 2011) ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-49", "text": "Our first experimental question is therefore: how well can a dialogue policy select an appropriate system SA, if it is provided with an accurate user SA?" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-50", "text": "Would a statistical classification-based policy perform as well as a rule-based policy?" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-51", "text": "Secondly, the user SAs in the Amani dataset were assigned to the user's utterance by a computational linguist, and we may assume that these \"gold\" SAs accurately represent the user's intended meaning." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-52", "text": "In a run-time system, however, the SAs are identified by an automatic NLU module that is likely to introduce errors." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-53", "text": "It is not obvious a priori to what extent the dialogue policy will suffer due to these NLU errors, and our second experimental question is therefore: how well can a policy select an appropriate system SA, if provided with the NLU's hypothesized user SA?" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-54", "text": "In training the NLU module, as well as our dialogue policies, we can make use of an additional resource in the Amani data set, which is the availability of approximately 6 textual paraphrases for each utterance; see Figure 1 for an example." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-55", "text": "As a final empirical question, we consider combining the NLU and DM in a design that classifies user utterances, together with shallow features of the dialogue history, directly into system responses." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-56", "text": "This approach is similar to the NLU module, but tries to determine system SAs instead of user SAs." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-57", "text": "This makes unnecessary the labor and knowledge intensive steps of developing the user SA set and annotating utterances with these SAs." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-58", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-59", "text": "**EVALUATION METRIC**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-60", "text": "We evaluate the dialogue policies learned in each of our experimental conditions through 19-fold cross-validation of our set of 19 dialogues." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-61", "text": "In each fold, we hold out one dialogue and use the remaining 18 dialogues as training data." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-62", "text": "To measure the performance of the dialogue policy, we follow the approach of DeVault et al. (2011) , which counts an automatically produced system SA as correct if that SA was chosen by at least one referee for that dialogue turn in the data set." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-63", "text": "We then count the proportion of the correct SAs among all the SAs produced across all 19 dialogues, and use this measure of weak accuracy to score competing dialogue policies." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-64", "text": "We can use the weak accuracy of one referee, measured against all the others, to establish a performance ceiling for this metric." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-65", "text": "(We do not expect that an automatic system would outperform a human referee.) This score is .79; see DeVault et al. (2011) for discussion." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-66", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-67", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-68", "text": "Our experimental setup evaluates three different dialogue policy models: a rule-based approach (Rules), discussed in Section 3.2; a statistical classification technique that uses maximum-entropy classification (MaxEnt), discussed in Section 3.3; and another statistical technique called relevance models (RM), discussed in Section 3.4." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-69", "text": "For the Rules approach, the user's utterance is represented in SA form, and we evaluate the performance of the rules using both hand-annotated or \"gold\" SA (G-SA) as well as automatically assigned NLU SAs (NLU-SA), as described in Section 3.1." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-70", "text": "For the two statistical policy techniques, MaxEnt and RM, the user's utterance may be represented in SA form or in plain text form." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-71", "text": "In the latter case, the NLU and DM modules are effectively consolidated into a single classification step." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-72", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-73", "text": "**NLU**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-74", "text": "Our NLU module treats the problem of mapping an utterance text to a single SA label as a multiclass classification problem, which it solves using a maximum-entropy model (Berger et al., 1996) ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-75", "text": "The utterance is represented using shallow features such as unigrams and the length of the user utterance (Sagae et al., 2009 )." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-76", "text": "Paraphrases of user utterances are included in the training set." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-77", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-78", "text": "**RULE-BASED POLICY**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-79", "text": "We developed our rule-based policy (Rules) by manually writing the simple rules needed to implement Amani's dialogue policy." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-80", "text": "Given a user SA label A t for turn t, the rules for determining Amani's response R t take one of three forms:" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-81", "text": "The first rule form specifies that a given user SA should always lead to a given system response." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-82", "text": "The second and third rule forms enable the system's response to depend on the user having previously performed (or not performed) a specific SA." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-83", "text": "For example, Amani will only tell the name of the shooter if the user has previously promised to protect her from danger." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-84", "text": "If such a promise has not yet been made, she will ask the user to protect her in exchange for the information." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-85", "text": "Amani's set of 51 rules was developed in 115 minutes by a computational linguist and system developer." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-86", "text": "Given the existing set of SAs, the rules were very straightforward to develop." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-87", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-88", "text": "**MAXENT POLICY**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-89", "text": "Like the NLU, the MaxEnt policy is based on a multi-class maximum-entropy classifier." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-90", "text": "It uses text-based features including unigrams and the length of the current and previous user utterance, as well as the SA label for Amani's previous utterance." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-91", "text": "For experiments in which user utterances are represented as text, the MaxEnt policy is trained using all available paraphrases of user utterances." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-92", "text": "In experiments in which the user utterance is represented using SA labels rather than text, the paraphrase data is ignored, and the MaxEnt policy is trained using the user SA label in place of the textbased features." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-93", "text": "In all cases, the MaxEnt policy is trained using all the alternative acceptable Amani SA responses as examples of correct output." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-94", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-95", "text": "**RELEVANCE MODEL POLICY**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-96", "text": "The text classification task of assigning the system SAs using either the user SAs or the user text input can be viewed as a cross-language information retrieval (IR) task: we have a fixed collection of system SAs (\"documents\") and a user's input (\"query\"), and we need to find the best SA that matches the user's input." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-97", "text": "This is similar to the task of searching Chinese documents using an English query, where the training data that maps user inputs to the system SAs can be viewed as a \"parallel corpus\" (Lavrenko et al., 2002) ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-98", "text": "For our third approach we use the Relevance Model (RM) information retrieval technique first suggested by Lavrenko et al. (2002) and recently adapted to a question-answering task by Leuski et al. (2006) ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-99", "text": "We have experimented with different feature sets and we found that (1) when the text data is not available, the combination of the current user SA and the last system SA is the most effective; (2) when both the utterance text and SA are available, the combination of the current user SA and unigram text features from all available paraphrases works the best; and (3) when only the text is available, the unigram word features work well by themselves." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-100", "text": "We should note that we found it is significantly better to train the model on \"gold\" SAs even when testing on NLU-SAs." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-101", "text": "We also observed that integrating the unigram features with the history information in the form of SAs or words from previous utterances tended to over-fit the model, resulting in degraded performance." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-102", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-103", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-104", "text": "We present our results in Table 1 ." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-105", "text": "The highest performance is achieved when \"gold\" SAs (G-SA) are provided to Rules." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-106", "text": "Indeed, the weak accuracy of .79 is approximately at the ceiling level of performance observed when one human referee is scored against 5 other human referees." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-107", "text": "This suggests that, with human-level NLU performance, a handauthored rule-based policy can effectively implement Amani's intended dialogue policy." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-108", "text": "However, the table also shows that when automaticallyassigned NLU speech acts (NLU-SA) are provided as input to Rules, the performance drops significantly to .58." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-109", "text": "Note that Rules cannot interpret text representations of user utterances; SA labels are needed, which is a cost of using Rules." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-110", "text": "For the MaxEnt policy, a score of .71 is achieved with \"gold\" SAs, and a lower .57 with run-time SAs." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-111", "text": "Note that .71 is an inferior performance to the .79 achieved with G-SA/Rules, indicating that MaxEnt does not learn a policy as effective as the hand-authored Rules, even if it is trained and evaluated on gold SA labels." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-112", "text": "As previously reported in DeVault et al. (2011) , a performance of .66 is achieved with the MaxEnt policy when trained on text-based features." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-113", "text": "It is interesting to see here, however, that this .66 performance is significantly higher than the .58 that is achieved using Rules together with run-time SAs." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-114", "text": "In fact, the accuracy of the NLU-SA labels in this data set, with respect to the gold SAs, is 53%." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-115", "text": "Thus, while Rules can achieve very good performance with gold SAs, the high frequency of NLU errors causes a significant degradation in policy performance." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-116", "text": "Interestingly, the alternative Text/MaxEnt design that combines NLU and DM into a single step ends up performing significantly better (.66 )." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-117", "text": "The RM performance shows a pattern broadly similar to MaxEnt; performance is highest (.73) with gold SAs, and when trained to classify directly from Text to system SAs, performance is significantly better (.71) than NLU-SA/Rules (.58)." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-118", "text": "For RM, we additionally explored using both Text features as well as the NLU-SA label as input features, but observed performance degraded to .65 (presumably due to NLU errors)." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-119", "text": "Our best overall performance not requiring gold SAs, .71, was achieved by Text/RM." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-120", "text": "Our intuition is that a couple of factors helped RM to outperform the MaxEnt approach: (1) MaxEnt treats word features as binary, while RM explicitly takes into account the word occurrence frequencies; (2) RM is better designed to handle multi-label classification, where a single input instance can be assigned to multiple classes." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-121", "text": "----------------------------------" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-122", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-123", "text": "We have presented and evaluated a set of alternative dialogue system architectures in a case study domain." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-124", "text": "In this domain, we have shown that the theoretical performance that is achievable with a rule-based dialogue policy is high, but that two classification approaches that omit a separate NLU step and directly select system responses perform significantly better." }, { "sent_id": "b3f7051cbba3344f0aec0f2e80d5e0-C001-125", "text": "In future work, we plan to address some of the remaining questions, including how these learned policies would perform in live dialogues, how these results would change if NLU performance could be improved, and to what extent this pattern of results would transfer to other domains with more complex NLU and policy requirements." } ], "y": { "@SIM@": { "gold_contexts": [ [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-16" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-35" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-62" ] ], "cite_sentences": [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-16", "b3f7051cbba3344f0aec0f2e80d5e0-C001-35", "b3f7051cbba3344f0aec0f2e80d5e0-C001-62" ] }, "@USE@": { "gold_contexts": [ [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-16" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-35" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-62" ] ], "cite_sentences": [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-16", "b3f7051cbba3344f0aec0f2e80d5e0-C001-35", "b3f7051cbba3344f0aec0f2e80d5e0-C001-62" ] }, "@BACK@": { "gold_contexts": [ [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-23" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-42" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-48" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-65" ], [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-112" ] ], "cite_sentences": [ "b3f7051cbba3344f0aec0f2e80d5e0-C001-23", "b3f7051cbba3344f0aec0f2e80d5e0-C001-42", "b3f7051cbba3344f0aec0f2e80d5e0-C001-48", "b3f7051cbba3344f0aec0f2e80d5e0-C001-65", "b3f7051cbba3344f0aec0f2e80d5e0-C001-112" ] } } }, "ABC_b5149b6136c8baaed8356b562d3f96_23": { "x": [ { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-22", "text": "**OVERVIEW**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-2", "text": "Abstract Ratnaparkhi (1996) introduced a method of inferring a tag dictionary from annotated data to speed up part-of-speech tagging by limiting the set of possible tags for each word." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-3", "text": "While Ratnaparkhi's tag dictionary makes tagging faster but less accurate, an alternative tag dictionary that we recently proposed (Moore, 2014) makes tagging as fast as with Ratnaparkhi's tag dictionary, but with no decrease in accuracy." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-4", "text": "In this paper, we show that a very simple semi-supervised variant of Ratnaparkhi's method results in a much tighter tag dictionary than either Ratnaparkhi's or our previous method, with accuracy as high as with our previous tag dictionary but much faster tagging-more than 100,000 tokens per second in Perl." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-5", "text": "In this paper, we present a new method of constructing tag dictionaries for part-of-speech (POS) tagging." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-6", "text": "A tag dictionary is simply a list of words 1 along with a set of possible tags for each word listed, plus one additional set of possible tags for all words not listed." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-7", "text": "Tag dictionaries are commonly used to speed up POS-tag inference by restricting the tags considered for a particular word to those specified by the dictionary." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-8", "text": "Early work on POS tagging generally relied heavily on manually constructed tag dictionaries, sometimes agumented with tag statistics derived from an annotated corpus (Leech et al., 1983; Church, 1988; Cutting et al., 1992) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-9", "text": "Merialdo (1994) relied only on a tag dictionary extracted from annotated data, but he used the annotated 1 According to the conventions of the field, POS tags are assigned to all tokens in a tokenized text, including punctuation marks and other non-word tokens." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-10", "text": "In this paper, all of these will be covered by the term word." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-11", "text": "tags from his test data as well as his training data to construct his tag dictionary, so his evaluation was not really fair." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-12", "text": "2 Ratnaparkhi (1996) seems to have been the first to use a tag dictionary automatically extracted only from training data." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-13", "text": "Ratnaparkhi's method of constructing a tag dictionary substantially speeds up tagging compared to considering every possible tag for every word, but it noticeably degrades accuracy when used with a current state-of-the-art tagging model." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-14", "text": "We recently presented (Moore, 2014) a new method of constructing a tag dictionary that produces a tagging speed-up comparable to Ratnaparkhi's, but with no decrease in tagging accuracy." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-15", "text": "In this paper, we show that a very simple semi-supervised variant of Ratnaparkhi's method results in a much tighter tag dictionary than either Ratnaparkhi's or our previous method, with accuracy as high as we previously obtained, while allowing much faster tagging-more than 100,000 tokens per second even in a Perl implementation." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-16", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-17", "text": "****" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-18", "text": "introduced a method of inferring a tag dictionary from annotated data to speed up part-of-speech tagging by limiting the set of possible tags for each word." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-19", "text": "While Ratnaparkhi's tag dictionary makes tagging faster but less accurate, an alternative tag dictionary that we recently proposed (Moore, 2014 ) makes tagging as fast as with Ratnaparkhi's tag dictionary, but with no decrease in accuracy." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-20", "text": "In this paper, we show that a very simple semi-supervised variant of Ratnaparkhi's method results in a much tighter tag dictionary than either Ratnaparkhi's or our previous method, with accuracy as high as with our previous tag dictionary but much faster tagging-more than 100,000 tokens per second in Perl." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-21", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-23", "text": "In this paper, we present a new method of constructing tag dictionaries for part-of-speech (POS) tagging." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-24", "text": "A tag dictionary is simply a list of words 1 along with a set of possible tags for each word listed, plus one additional set of possible tags for all words not listed." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-25", "text": "Tag dictionaries are commonly used to speed up POS-tag inference by restricting the tags considered for a particular word to those specified by the dictionary." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-26", "text": "Early work on POS tagging generally relied heavily on manually constructed tag dictionaries, sometimes agumented with tag statistics derived from an annotated corpus (Leech et al., 1983; Church, 1988; Cutting et al., 1992) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-27", "text": "Merialdo (1994) relied only on a tag dictionary extracted from annotated data, but he used the annotated tags from his test data as well as his training data to construct his tag dictionary, so his evaluation was not really fair." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-28", "text": "2 Ratnaparkhi (1996) seems to have been the first to use a tag dictionary automatically extracted only from training data." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-29", "text": "Ratnaparkhi's method of constructing a tag dictionary substantially speeds up tagging compared to considering every possible tag for every word, but it noticeably degrades accuracy when used with a current state-of-the-art tagging model." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-30", "text": "We recently presented (Moore, 2014) a new method of constructing a tag dictionary that produces a tagging speed-up comparable to Ratnaparkhi's, but with no decrease in tagging accuracy." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-31", "text": "In this paper, we show that a very simple semi-supervised variant of Ratnaparkhi's method results in a much tighter tag dictionary than either Ratnaparkhi's or our previous method, with accuracy as high as we previously obtained, while allowing much faster tagging-more than 100,000 tokens per second even in a Perl implementation." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-32", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-33", "text": "**TAG DICTIONARIES AND TAGGING SPEED**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-34", "text": "A typical modern POS tagger applies a statistical model to compute a score for a sequence of tags t 1 , . . . , t n given a sequence of words w 1 , . . . , w n ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-35", "text": "The tag sequence assigned the highest score by the model for a given word sequence is selected as the tagging for the word sequence." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-36", "text": "If T is the set of possible tags, and there are no restrictions on the form of the model, then the time to find the highest scoring tag sequence is potentially O(n|T | n ) or worse, which would be intractable." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-37", "text": "To make tagging practical, models are normally defined to be factorable in a way that reduces the time complexity to O(n|T | k ), for some small integer k. For models in which all tagging decisions are independent, or for higher-order mod-els pruned by fixed-width beam search, k = 1, so the time to find the highest scoring tag sequence is O(n|T |)." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-38", "text": "But this linear dependence on the size of the tag set means that reducing the average number of tags considered per token should further speed up tagging, whatever the underlying model or tagger may be." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-39", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-40", "text": "**RATNAPARKHI'S METHOD**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-41", "text": "For each word observed in an annotated training set, Ratnaparkhi's tag dictionary includes all tags observed with that word in the training set, with all possible tags allowed for all other words." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-42", "text": "Ratnaparkhi reported that using this tag dictionary improved per-tag accuracy from 96.31% to 96.43% on his Penn Treebank (Marcus et al., 1993) Wall Street Journal (WSJ) development set, compared to considering all tags for all words." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-106", "text": "As noted above, we treated all digits as indistinguishable in constructing and applying the dictionary." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-43", "text": "With a more accurate model, however, we found (Moore, 2014 ) that while Ratnaparkhi's tag dictionary decreased the average number of tags per token from 45 to 3.7 on the current standard WSJ development set, it also decreased per-tag accuracy from 97.31% to 97.19%." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-44", "text": "This loss of accuracy can be explained by the fact that 0.5% of the development set tokens are known words with a tag not seen in the training set, for which our model achieved 44.5% accuracy with all word/tag pairs permitted." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-45", "text": "With Ratnaparkhi's dictionary, accuracy for these tokens is necessarily 0%." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-46", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-47", "text": "**OUR PREVIOUS METHOD**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-48", "text": "We previously presented (Moore, 2014) a tag dictionary constructed by using the annotated training set to compute a smoothed probability estimate for any possible tag given any possible word, and for each word in the training set, including in the dictionary the tags having an estimated probability greater than a fixed threshold T ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-49", "text": "In this approach, the probability p(t|w) of tag t given word w is computed by interpolating a discounted relative frequency estimate of p(t|w) with an estimate of p(t) based on \"diversity counts\", taking the count of a tag t to be the number of distinct words ever observed with that tag." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-50", "text": "The distribution p(t) is also used to estimate tag probabilities for unknown words, so the set of possible tags for any word not explicitly listed is {t|p(t) > T }." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-51", "text": "If we think of w followed by t as a word bigram, this is exactly like a bigram language model estimated by the interpolated Kneser-Ney (KN) method described by Chen and Goodman (1999) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-52", "text": "The way tag diversity counts are used has the desirable property that closed-class tags receive a very low estimated probability of being assigned to a rare or unknown word, even though they occur very often with a small number of frequent words." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-53", "text": "A single value for discounting the count of all observed word/tag pairs is set to maximize the estimated probability of the reference tagging of the development set." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-54", "text": "When T was chosen to be the highest threshold that preserves our model's 97.31% per tag WSJ development set accuracy, we obtained an average of 3.5 tags per token." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-55", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-56", "text": "**OUR NEW APPROACH**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-57", "text": "We now present a new method that reduces the average number of tags per token to about 1.5, with no loss of tagging accuracy." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-58", "text": "We apply a simple variant of Ratnaparkhi's method, with a training set more than 4,000 times larger than the Penn Treebank WSJ training set." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-59", "text": "Since no such handannotated corpus exists, we create the training set automatically by running a version of our tagger on the LDC English Gigaword corpus." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-60", "text": "We thus describe our approach as a semi-supervised variant of Ratnaparkhi's method." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-61", "text": "Our method can be viewed as an instance of the well-known technique of self-training (e.g., McClosky et al., 2006) , but ours is the first use of self-training we know of for learning inference-time search-space pruning." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-62", "text": "We introduce two additional modifications of Ratnaparki's approach." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-63", "text": "First, with such a large training corpus, we find it unnecessary to keep in the dictionary every tag observed with every word in the automatically-annotated data." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-107", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-64", "text": "So, we estimate a probability distribution over tags for each word in the dictionary according to unsmoothed relative tag frequencies, and include for each word in the dictionary only tags whose probability given the word is greater than a fixed threshold." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-65", "text": "Second, since our tokenized version of the English Gigaword corpus contains more than 6 million unique words, we reduce the vocabulary of the dictionary to the approximately 1 million words having 10 or more occurrences in the corpus." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-66", "text": "We treat all other tokens as instances of unknown words, and we use their combined unsmoothed relative tag frequencies to estimate a tag probability distribution for unknown words." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-67", "text": "We use the same threshold on this distribution as we do for words explicitly listed in the dictionary, to obtain a set of possible tags for unknown words." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-68", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-69", "text": "**EXPERIMENTAL DETAILS**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-70", "text": "In our experiments, we use the WSJ corpus from Penn Treebank-3, split into the standard training (sections 0-18), development (sections 19-21), and test (sections 22-24) sets for POS tagging." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-71", "text": "The tagging model we use has the property that all digits are treated as indistinguishable for all features." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-72", "text": "We therefore also make all digits indistinguishable in constructing tag dictionaries (by internally replacing all digits by \"9\"), since it does not seem sensible to give two different dictionary entries based on containing different digits, when the tagging model assigns them the same features." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-73", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-74", "text": "**THE TAGGING MODEL**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-75", "text": "The model structure, feature set, and learning method we use for POS tagging are essentially the same as those in our earlier work, treating POS tagging as a single-token independent multiclass classification task." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-76", "text": "Word-class-sequence features obtained by supervised clustering of the annotated training set replace the hidden tag-sequence features frequently used for POS tagging, and additional word-class features obtained by unsupervised clustering of a very large unannotated corpus provide information about words not occurring in the training set." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-77", "text": "For full details of the feature set, see our previous paper (Moore, 2014) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-78", "text": "The model is trained by optimizing the multiclass SVM hinge loss objective (Crammer and Singer, 2001 ), using stochastic subgradient descent as described by Zhang (2004) , with early stopping and averaging." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-79", "text": "The only difference from our previous training procedure is that we now use a tag dictionary to speed up training, while we previously used tag dictionaries only at test time." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-80", "text": "Our training procedure makes multiple passes through the training data considering each training example in turn, comparing the current model score of the correct tag for the example to that of the highest scoring incorrect tag and updating the model if the score of the correct tag does not exceed the score of the highest scoring incorrect tag by a specified margin." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-81", "text": "In our new version of this procedure, we use the KN-smoothed tag dictionary described in Section 1.3. to speed up finding the highest scoring incorrect tag." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-82", "text": "Recall that the KN-smoothed tag dictionary estimates a non-zero probability p(t|w) for every possible word/tag pair, and that the possible tags for a given word are determinted by setting a threshold T on this probability." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-83", "text": "In each pass through the training set, we use the same probability distribution p(t|w) determined from the statistics of the annotated training data, but we employ an adaptive method to determine what threshold T to use in each pass." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-84", "text": "For the first pass through the training set, we set an initial threshold T 0 to the highest value such that for every token in the development set, p(t|w) \u2265 T 0 , where t is the correct tag for the token and w is the word for the token." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-85", "text": "At the end of each training pass i, while evaluating the current model on the development set for early stopping using threshold T i\u22121 , we also find the highest probability threshold T i such that choosing a lower threshold would not enable any additional correct taggings on the development set using the current model." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-86", "text": "This threshold will normally be higher than T 0 , because we disregard tokens in the development set for which the correct tag would not be selected by the model resulting from the previous pass at any threshold." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-87", "text": "T i is then used as the threshold for training pass i + 1." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-88", "text": "Whenever the selected threshold leaves only one tag remaining for a particular training example, we skip that example in training." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-89", "text": "On the first pass through the training set, use of this method resulted in consideration of an average of 31.36 tags per token, compared to 45 total possible tags." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-90", "text": "On the second and all subsequent passes, an average of 10.48 tags were considered per token." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-91", "text": "This sped up training by a factor of 3.7 compared to considering all tags for all tokens, with no loss of tagging accuracy when a development-set-optimized KN-smoothed tag dictionary is also used at test time." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-92", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-93", "text": "**TAGGING THE GIGAWORD CORPUS**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-94", "text": "To construct our new tag dictionary, we need an automatically-tagged corpus several orders of magnitude larger than the hand-tagged WSJ training set." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-95", "text": "To obtain this corpus we ran a POS tagger on the LDC English Gigaword Fifth Edition 3 corpus, which consists of more than 4 billion words of English text from seven newswire sources." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-96", "text": "We first removed all SGML mark-up, and performed sentence-breaking and tokenization using the Stanford CoreNLP toolkit (Manning et al, 2014) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-97", "text": "This produced 4,471,025,373 tokens of Table 1 : WSJ development set token accuracy and tagging speed for different tag dictionaries 6,616,812 unique words." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-98", "text": "We tagged this corpus using the model described in Section 2.1 and a KN-smoothed tag dictionary as described in Section 1.3, with a threshold T = 0.0005." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-99", "text": "The tagger we used is based on the fastest of the methods described in our previous work (Moore, 2014, Section 3.1) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-100", "text": "Tagging took about 26 hours using a single-threaded implementation in Perl on a Linux workstation equipped with Intel Xeon X5550 2.67 GHz processors." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-101", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-102", "text": "**EXTRACTING THE TAG DICTIONARY**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-103", "text": "We extracted a Ratnaparkhi-like tag dictionary for the 957,819 words with 10 or more occurrences in our corpus." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-104", "text": "Tokens of all other words in the corpus were treated as unknown word tokens and used to define a set of 24 tags 4 to be used for words not explicitly listed in the dictionary." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-105", "text": "To allow pruning the dictionary as described in Section 1.4, for each word (including the unknown word), we computed a probability distribution p(t|w) using unsmoothed relative frequencies." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-108", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-109", "text": "Tagging the WSJ development set with an unpruned semi-supervised tag dictionary obtained from the automatic tagging of the English Gigaword corpus produced the same tagging accuracy as allowing all tags for all tokens or using the pruned KN-smoothed tag dictionary used in tagging the Gigaword corpus." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-110", "text": "Additional experiments showed that we could prune this dictionary with a threshold on p(t|w) as high as T = 0.0024 without decreasing development set accuracy." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-111", "text": "In addition to applying this threshold to the tag probabilities for all listed words, we also applied it to the tag probabilities for unknown words, leaving 13 possible tags 5 for those." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-112", "text": "Tagging the WSJ development set with these two dictionaries is compared in Table 1 to tagging with our previous pruned KN-smoothed dictionary." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-113", "text": "The second column shows the accuracy per tag, which is 97.31% for all three dictionaries." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-114", "text": "The third column shows the mean number of tags per token allowed by each dictionary." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-115", "text": "The fourth column shows the percentage of tokens with only one tag allowed, which is significant since the tagger need not apply the model for such tokens-it can simply output the single possible tag." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-116", "text": "The last column shows the tagging speed in tokens per second for each of the three tag dictionaries, using the fast tagging method we previously described (Moore, 2014) , in a singlethreaded implementation in Perl on a Linux workstation equipped with Intel Xeon X5550 2.67 GHz processors." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-117", "text": "Speed is rounded to the nearest 1,000 tokens per second, because we measured times to a precision of only about one part in one hundred." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-118", "text": "For the pruned KN-smoothed dictionary, we previously reported a speed of 49,000 tokens per second under similar conditions." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-119", "text": "Our current faster speed of 69,000 tokens per second is due to an improved low-level implementation for computing the model scores for permitted tags, and a slightly faster version of Perl (v5.18.2) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-120", "text": "The most restrictive tag dictionary, the pruned semi-supervised dictionary, allows only 1.51 tags per token, and our implementation runs at 103,000 tokens per second on the WSJ development set." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-121", "text": "For our final experiments, we tested our tagger with this dictionary on the standard Penn Treebank WSJ test set and on the Penn Treebank-3 parsed Brown corpus subset, as an out-of-domain evaluation." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-122", "text": "For comparison, we tested our previous tagger and the fast version (english-left3words-distsim) of the Stanford tagger (Toutanova et al., 2003; Manning, 2011) recommended for practical use on the Stanford tagger website, which we found to be by far the fastest of the six publicly available taggers tested in our previous work (Moore, 2014) ." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-123", "text": "The results of these tests are shown in Table 2 Table 2 : WSJ test set and Brown corpus tagging speeds and token accuracies" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-124", "text": "For our previous tagger, we give three speeds: the speed we reported earlier, a speed for a duplicate of the earlier experiment using the faster version of Perl that we use here, and a third measurement including both the faster version of Perl and our improved low-level tagger implementation." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-125", "text": "With the pruned semi-supervised dictionary, our new tagger has slightly higher all-token accuracy than our previous tagger on both the WSJ test set and Brown corpus set, and it is much more accurate than the fast Stanford tagger." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-126", "text": "The accuracy on the standard WSJ test set is 97.36%, one of the highest ever reported." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-127", "text": "The new tagger is also much faster than either of the other taggers, achieving a speed of more than 100,000 tokens per second on the WSJ test set, and almost 100,000 tokens per second on the out-of-domain Brown corpus data." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-128", "text": "----------------------------------" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-129", "text": "**CONCLUSIONS**" }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-130", "text": "Our method of constructing a tag dictionary is technically very simple, but remarkably effective." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-131", "text": "It reduces the mean number of possible tags per token by 57% and increases the number of unambiguous tokens by by 47%, compared to the previous state of the art (Moore, 2014) for a tag dictionary that does not degrade tagging accuracy." }, { "sent_id": "b5149b6136c8baaed8356b562d3f96-C001-132", "text": "When combined with our previous work on fast high-accuracy POS tagging, this tag dictionary produces by far the fastest POS tagger reported with anything close to comparable accuracy." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b5149b6136c8baaed8356b562d3f96-C001-3" ], [ "b5149b6136c8baaed8356b562d3f96-C001-14" ], [ "b5149b6136c8baaed8356b562d3f96-C001-43" ], [ "b5149b6136c8baaed8356b562d3f96-C001-48" ], [ "b5149b6136c8baaed8356b562d3f96-C001-77" ], [ "b5149b6136c8baaed8356b562d3f96-C001-99" ], [ "b5149b6136c8baaed8356b562d3f96-C001-116" ], [ "b5149b6136c8baaed8356b562d3f96-C001-122" ] ], "cite_sentences": [ "b5149b6136c8baaed8356b562d3f96-C001-3", "b5149b6136c8baaed8356b562d3f96-C001-14", "b5149b6136c8baaed8356b562d3f96-C001-43", "b5149b6136c8baaed8356b562d3f96-C001-48", "b5149b6136c8baaed8356b562d3f96-C001-77", "b5149b6136c8baaed8356b562d3f96-C001-99", "b5149b6136c8baaed8356b562d3f96-C001-116", "b5149b6136c8baaed8356b562d3f96-C001-122" ] }, "@MOT@": { "gold_contexts": [ [ "b5149b6136c8baaed8356b562d3f96-C001-3" ], [ "b5149b6136c8baaed8356b562d3f96-C001-19" ], [ "b5149b6136c8baaed8356b562d3f96-C001-30" ] ], "cite_sentences": [ "b5149b6136c8baaed8356b562d3f96-C001-3", "b5149b6136c8baaed8356b562d3f96-C001-19", "b5149b6136c8baaed8356b562d3f96-C001-30" ] }, "@DIF@": { "gold_contexts": [ [ "b5149b6136c8baaed8356b562d3f96-C001-19" ], [ "b5149b6136c8baaed8356b562d3f96-C001-30" ], [ "b5149b6136c8baaed8356b562d3f96-C001-131" ] ], "cite_sentences": [ "b5149b6136c8baaed8356b562d3f96-C001-19", "b5149b6136c8baaed8356b562d3f96-C001-30", "b5149b6136c8baaed8356b562d3f96-C001-131" ] } } }, "ABC_09dfa2f17283fe6b3fc28383f36732_23": { "x": [ { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-67", "text": "Our input features are the 3D coordinates of the tracked human hand: the coordinates are obtained with a commodity depth sensor, then transformed to be centered on the person torso (to be invariant to the distance of the user from the sensor) and normalized to account for variability in amplitude (to be invariant to wide/emphatic vs narrow/subtle executions of the same gesture class)." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-68", "text": "The gesture recognition models are represented in Fig. 3 , and correspond to the Gesture HMMs block in Fig. 2 ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-69", "text": "The HMM for one gesture is defined by a set of (hidden) discrete states S = {s1, . . . , sQ} which model the temporal phases comprising the dynamic execution of the gesture, and by a set of parameters \u03bb = {A, B, \u03a0}, where A = {aij} is the transition probability matrix, aij is the transition probability from state si at time t to state sj at time t + 1, B = {fi} is the set of Q observation probability functions (one per state i) with continuous mixtures of Gaussian values, and \u03a0 is the initial probability distribution for the states." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-70", "text": "At recognition (testing) time, we obtain likelihood scores of a new gesture being classified with the common Forward- Backward inference algorithm." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-71", "text": "In Sec. 3.3, we discuss different ways in which the output information of the gesture recognizer can be combined with the Bayesian Network of words and affordances." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-72", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-73", "text": "**COMBINING THE BN WITH GESTURE HMMS**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-74", "text": "In this study we wish to generalize the model of [10] by observing external (human) agents, as shown in Fig. 1 ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-75", "text": "For this reason, the full model is now extended with a perception module capable of inferring the action of the agent from visual inputs." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-76", "text": "This corresponds to the Gesture HMMs block in Fig. 2 ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-77", "text": "The Affordance-Words Bayesian Network (BN) model and the Gestures HMMs may be combined in different ways [19] :" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-78", "text": "1. the Gesture HMMs may provide a hard decision on the action performed by the human (i. e., considering only the top result) to the BN, 2." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-79", "text": "the Gesture HMMs may provide a posterior distribution (i. e., soft decision) to the BN, 3. if the task is to infer the action, the posterior from the Gesture HMMs and the one from the BN may be combined as follows, assuming that they provide independent information:" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-80", "text": "In the experimental section, we will show that what the robot has learned subjectively or alone (by self-exploration, knowing the action identity as a prior [10] ), can subsequently be used when observing a new agent (human), provided that the actions can be estimated with Gesture HMMs as in [4] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-81", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-82", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-83", "text": "We present preliminary examples of two types of results: predictions over the effects of actions onto environment objects, and predictions over the associated word descriptions in the presence or absence of an action prior." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-84", "text": "In this section, we assume that the Gesture HMMs provide the discrete value of the recognized action performed by a human agent (i. e., we enforce a hard decision over the observed action, referring to the possible combination strategies listed in Sec. 3.3)." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-2", "text": "A growing field in robotics and Artificial Intelligence (AI) research is human-robot collaboration, whose target is to enable effective teamwork between humans and robots." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-3", "text": "However, in many situations human teams are still superior to human-robot teams, primarily because human teams can easily agree on a common goal with language, and the individual members observe each other effectively, leveraging their shared motor repertoire and sensorimotor resources." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-4", "text": "This paper shows that for cognitive robots it is possible, and indeed fruitful, to combine knowledge acquired from interacting with elements of the environment (affordance exploration) with the probabilistic observation of another agent's actions." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-5", "text": "We propose a model that unites (i) learning robot affordances and word descriptions with (ii) statistical recognition of human gestures with vision sensors." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-6", "text": "We discuss theoretical motivations, possible implementations, and we show initial results which highlight that, after having acquired knowledge of its surrounding environment, a humanoid robot can generalize this knowledge to the case when it observes another agent (human partner) performing the same motor actions previously executed during training." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-7", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-9", "text": "Robotics is progressing fast, with a steady and systematic shift from the industrial domain to domestic, public and leisure environments [1, ch. 65, Domestic Robotics] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-10", "text": "Application areas that are particularly relevant and being researched by the scientific community include: robots for people's health and active aging, mobility, advanced manufacturing (Industry 4.0)." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-11", "text": "In short, all domains that require direct and effective human-robot interaction and communication (including language and gestures [2] )." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-12", "text": "However, robots have not reached the level of performance that would enable them to work with humans in routine activities in a flexible and adaptive way, for example in the presence of sensor noise, or unexpected events not previously seen during the training or learning phase." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-13", "text": "One of the reasons to explain this performance gap between human-human teamwork and a human-robot teamwork is in the collaboration aspect, i. e., whether the members of a team understand one another." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-14", "text": "Humans have the ability of working successfully in groups." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-15", "text": "They can agree on common goals (e. g., through verbal and nonverbal communication), work towards the execution of these goals in a coordinated way, and understand each other's phys- Figure 1 : Experimental setup, consisting of an iCub humanoid robot and a human user performing a manipulation gesture on a shared table with different objects on top." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-16", "text": "The depth sensor in the top-left corner is used to extract human hand coordinates for gesture recognition." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-17", "text": "Depending on the gesture and on the target object, the resulting effect will differ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-18", "text": "ical actions (e. g., body gestures) towards the realization of the final target." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-19", "text": "Human team coordination and mutual understanding is effective [3] because of (i) the capacity to adapt to unforeseen events in the environment, and re-plan one's actions in real time if necessary, and (ii) a common motor repertoire and action model, which permits us to understand a partner's physical actions and manifested intentions as if they were our own [4] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-20", "text": "In neuroscience research, visuomotor neurons (i. e., neurons that are activated by visual stimuli) have been a subject of ample study [5] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-21", "text": "Mirror neurons are one class of such neurons that responds to action and object interaction, both when the agent acts and when it observes the same action performed by others, hence the name \"mirror\"." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-22", "text": "This work takes inspiration from the theory of mirror neurons, and contributes towards using it on humanoid and cognitive robots." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-23", "text": "We show that a robot can first acquire knowledge by sensing and self-exploring its surrounding environment (e. g., by interacting with available objects and building up an affordance representation of the interactions and their outcomes) and, as a result, the robot is capable of generalizing its acquired knowledge while observing another agent (e. g., a human person) who performs similar physical actions to the ones executed during prior robot training." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-24", "text": "Fig. 1" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-25", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-26", "text": "**RELATED WORK**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-27", "text": "A large and growing body of research is directed towards having robots learn new cognitive skills, or improving their capabilities, by interacting autonomously with their surrounding environment." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-85", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-86", "text": "**EFFECT PREDICTION**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-28", "text": "In particular, robots operating in an unstructured scenario may understand available opportunities conditioned on their body, perception and sensorimotor experiences: the intersection of these elements gives rise to object affordances (action possibilities), as they are called in psychology [6] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-29", "text": "The usefulness of affordances in cognitive robotics is in the fact that they capture essential properties of environment objects in terms of the actions that a robot is able to perform with them [7, 8] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-30", "text": "Some authors have suggested an alternative computational model called Object-Action Complexes (OACs) [9] , which links low-level sensorimotor knowledge with high-level symbolic reasoning hierarchically in autonomous robots." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-31", "text": "In addition, several works have demonstrated how combining robot affordance learning with language grounding can provide cognitive robots with new and useful skills, such as learning the association of spoken words with sensorimotor experience [10, 11] or sensorimotor representations [12] , learning tool use capabilities [13, 14] , and carrying out complex manipulation tasks expressed in natural language instructions which require planning and reasoning [15] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-32", "text": "In [10] , a joint model is proposed to learn robot affordances (i. e., relationships between actions, objects and resulting effects) together with word meanings." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-33", "text": "The data contains robot manipulation experiments, each of them associated with a number of alternative verbal descriptions uttered by two speakers for a total of 1270 recordings." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-34", "text": "That framework assumes that the robot action is known a priori during the training phase (e. g., the information \"grasping\" during a grasping experiment is given), and the resulting model can be used at testing to make inferences about the environment, including estimating the most likely action, based on evidence from other pieces of information." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-35", "text": "Several neuroscience and psychology studies build upon the theory of mirror neurons which we brought up in the Introduction." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-36", "text": "These studies indicate that perceptual input can be linked with the human action system for predicting future outcomes of actions, i. e., the effect of actions, particularly when the person possesses concrete personal experience of the actions being observed in others [16, 17] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-37", "text": "This has also been exploited under the deep learning paradigm [18] , by using a Multiple Timescales Recurrent Neural Network (MTRNN) to have an artificial simulated agent infer human intention from joint information about object affordances and human actions." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-38", "text": "One difference between this line of research and ours is that we use real, noisy data acquired from robots and sensors to test our models, rather than virtual simulations." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-39", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-40", "text": "**PROPOSED APPROACH**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-41", "text": "In this paper, we combine (1) the robot affordance model of [10] , which associates verbal descriptions to the physical interactions of an agent with the environment, with (2) the gesture recognition system of [4] , which infers the type of action from human user movements." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-42", "text": "We consider three manipulative gestures corresponding to physical actions performed by agent(s) onto objects on a table (see Fig. 1 ): grasp, tap, and touch." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-43", "text": "We reason on the effects of these actions onto the objects of the world, and on the co-occurring verbal description of the experiments." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-44", "text": "In the complete framework, we will use Bayesian Networks (BNs), which are a probabilistic model that represents random variables and conditional dependencies on a graph, such as in Fig. 2 ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-45", "text": "One of the advantages of using BNs is that their expressive power allows the marginalization over any set of variables given any other set of variables." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-46", "text": "Our main contribution is that of extending [10] by relaxing the assumption that the action is known during the learning phase." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-47", "text": "This assumption is acceptable when the robot learns through self-exploration and interaction with the environment, but must be relaxed if the robot needs to generalize the acquired knowledge through the observation of another (human) agent." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-48", "text": "We estimate the action performed by a human user during a human-robot collaborative task, by employing statistical inference methods and Hidden Markov Models (HMMs)." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-49", "text": "This provides two advantages." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-50", "text": "First, we can infer the executed action during training." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-51", "text": "Secondly, at testing time we can merge the action information obtained from gesture recognition with the information about affordances." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-52", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-53", "text": "**BAYESIAN NETWORK FOR AFFORDANCE-WORDS MODELING**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-54", "text": "Following the method adopted in [10] , we use a Bayesian probabilistic framework to allow a robot to ground the basic world behavior and verbal descriptions associated to it." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-55", "text": "The world behavior is defined by random variables describing: the actions A, defined over the set A = {ai}, object properties F , over F = {fi}, and effects E, over E = {ei}. We denote X = {A, F, E} the state of the world as experienced by the robot." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-56", "text": "The verbal descriptions are denoted by the set of words W = {wi}. Consequently, the relationships between words and concepts are expressed by the joint probability distribution p(X, W ) of actions, object features, effects, and words in the spoken utterance." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-57", "text": "The symbolic variables and their discrete values are listed in Table 1 ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-58", "text": "In addition to the symbolic variables, the model also includes word variables, describing Figure 3 : Structure of the HMMs used for human gesture recognition, adapted from [4] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-59", "text": "In this work, we consider three independent, multiple-state HMMs, each of them trained to recognize one of the considered manipulation gestures." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-60", "text": "the probability of each word co-occurring in the verbal description associated to a robot experiment in the environment." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-61", "text": "This joint probability distribution, that is illustrated by the part of Fig. 2 enclosed in the dashed box, is estimated by the robot in an ego-centric way through interaction with the environment, as in [10] ." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-62", "text": "As a consequence, during learning, the robot knows what action it is performing with certainty, and the variable A assumes a deterministic value." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-63", "text": "This assumption is relaxed in the present study, by extending the model to the observation of external (human) agents as explained below." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-64", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-65", "text": "**HIDDEN MARKOV MODELS FOR GESTURE RECOGNITION**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-66", "text": "As for the gesture recognition HMMs, we use the models that we previously trained in [4] for spotting the manipulationrelated gestures under consideration." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-87", "text": "From our combined model of words, affordances and observed actions, we report the inferred posterior value of the Object Velocity effect, given prior information about the action (provided" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-110", "text": "In terms of future work, there are several avenues to explore." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-88", "text": ", where F = {Size=big, Shape=sphere}, E = {ObjVel=fast}. This variation corresponds to the difference of word probability when we add the tap action evidence (obtained from the Gesture HMMs) to the initial evidence about object features and effects." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-89", "text": "We have omitted words for which no significant variation was observed." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-90", "text": "by the Gesture HMMs) and also about object features (Shape and Size)." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-91", "text": "Fig. 4 shows the computed predictions in two cases." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-92", "text": "Fig. 4a shows the anticipated object velocity when the human user performs the tapping action onto a small spherical object, whereas Fig. 4b displays it when the target object is a big box." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-93", "text": "Indeed, given the same observed action prior (lateral tap on the object), the expected movement is very different depending on the physical properties of the target object." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-94", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-95", "text": "**PREDICTION OF WORDS**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-96", "text": "In this experiment, we compare the associated verbal description obtained by the Bayesian Network in the absence of an action prior, with the ones obtained in the presence of one." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-97", "text": "In particular, we compare the probability of word occurrence in the following two situations:" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-98", "text": "1. when the robot prior knowledge (evidence in the BN) includes information about object features and effects only: Size=big, Shape=sphere, ObjVel=fast;" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-99", "text": "2. when the robot prior knowledge includes, in addition to the above, evidence about the action as observed from the Gestures HMMs: Action=tap." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-100", "text": "Fig . 5 shows the variation in word occurrence probabilities between the two cases, where we have omitted words for which no significant variation was observed in this case." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-101", "text": "We can interpret the difference in the predictions as follows:" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-102", "text": "\u2022 as expected, the probabilities of words related to tapping and pushing increase when a tapping action evidence from the Gestures HMMs is introduced; conversely, the probabilities of other action words (touching and poking) decreases;" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-103", "text": "\u2022 interestingly, the probability of the word rolling (which is an effect of an action onto an object) also increases when the tapping action evidence is entered." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-104", "text": "Even though the initial evidence of case 1 already included some effect information (the velocity of the object), it is only now, when the robot perceives that the physical action was a tap, that the event rolling is associated." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-105", "text": "----------------------------------" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-106", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-107", "text": "Within the scope of cognitive robots that operate in unstructured environments, we have discussed a model that combines word affordance learning with body gesture recognition." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-108", "text": "We have proposed such an approach, based on the intuition that a robot can generalize its previously-acquired knowledge of the world (objects, actions, effects, verbal descriptions) to the cases when it observes a human agent performing familiar actions in a shared human-robot environment." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-109", "text": "We have shown promising preliminary results that indicate that a robot's ability to predict the future can benefit from incorporate the knowledge of a partner's action, facilitating scene interpretation and, as a result, teamwork." }, { "sent_id": "09dfa2f17283fe6b3fc28383f36732-C001-111", "text": "The main ones are (i) the implementation of a fully probabilistic fusion between the affordance and the gesture components (e. g., the soft decision discussed in Sec. 3.3); (ii) to run quantitative tests on larger corpora of human-robot data; (iii) to explicitly address the correspondence problem of actions between two agents operating on the same world objects (e. g., a pulling action from the perspective of the human corresponds to a pushing action from the perspective of the robot, generating specular effects)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "09dfa2f17283fe6b3fc28383f36732-C001-31" ], [ "09dfa2f17283fe6b3fc28383f36732-C001-32" ], [ "09dfa2f17283fe6b3fc28383f36732-C001-61" ], [ "09dfa2f17283fe6b3fc28383f36732-C001-80" ] ], "cite_sentences": [ "09dfa2f17283fe6b3fc28383f36732-C001-31", "09dfa2f17283fe6b3fc28383f36732-C001-32", "09dfa2f17283fe6b3fc28383f36732-C001-61", "09dfa2f17283fe6b3fc28383f36732-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "09dfa2f17283fe6b3fc28383f36732-C001-41" ], [ "09dfa2f17283fe6b3fc28383f36732-C001-46" ] ], "cite_sentences": [ "09dfa2f17283fe6b3fc28383f36732-C001-41", "09dfa2f17283fe6b3fc28383f36732-C001-46" ] }, "@EXT@": { "gold_contexts": [ [ "09dfa2f17283fe6b3fc28383f36732-C001-41" ], [ "09dfa2f17283fe6b3fc28383f36732-C001-46" ], [ "09dfa2f17283fe6b3fc28383f36732-C001-74" ] ], "cite_sentences": [ "09dfa2f17283fe6b3fc28383f36732-C001-41", "09dfa2f17283fe6b3fc28383f36732-C001-46", "09dfa2f17283fe6b3fc28383f36732-C001-74" ] }, "@SIM@": { "gold_contexts": [ [ "09dfa2f17283fe6b3fc28383f36732-C001-54" ] ], "cite_sentences": [ "09dfa2f17283fe6b3fc28383f36732-C001-54" ] }, "@USE@": { "gold_contexts": [ [ "09dfa2f17283fe6b3fc28383f36732-C001-54" ] ], "cite_sentences": [ "09dfa2f17283fe6b3fc28383f36732-C001-54" ] } } }, "ABC_4cf805818bed233fabb81f5f64f4cc_23": { "x": [ { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-108", "text": "It also outperforms the pipeline neural net model of Weiss et al. (2015) by a considerable margin and matches the semisupervised variant of Weiss et al. (2015) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-109", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-110", "text": "**ENGLISH TREEBANK UNION**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-111", "text": "Turning to cross-domain results, and the \"Treebank Union\" datasets, we use an identical setup to the one described in Weiss et al. (2015) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-112", "text": "This setup includes the WSJ with Stanford Dependencies, the OntoNotes corpus version 5 (Hovy et al., 2006) , the English Web Treebank (Petrov and McDonald, 2012) , and the updated and corrected Question Treebank (Judge et al., 2006) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-113", "text": "We train on the union of each corpora's training set and test on each domain separately." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-114", "text": "Results." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-115", "text": "The results of this evaluation are shown in Table 5 ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-116", "text": "As for the WSJ we find that the integrated transition system combined with our novel features performs better than previous work and in particular the model of Weiss et al. (2015) , which serves as the starting point for this work." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-117", "text": "The improvements on the out-of-domain Web and Question corpora are particularly promising." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-118", "text": "Weiss et al. (2015) presented a parser that advanced the state of the art for English Stanford dependency parsing." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-119", "text": "In this paper we showed that this parser can be significantly improved by introducing novel set features for morphology and POS tag ambiguities, which are added with almost no feature engineering effort." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-120", "text": "The resulting parser is already competitive in the multi-lingual setting of the CoNLL'09 shared task, but can be further improved by utilizing an integrated POS tagging and parsing transition system." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-2", "text": "We extend and improve upon recent work in structured training for neural network transition-based dependency parsing." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-3", "text": "We do this by experimenting with novel features, additional transition systems and by testing on a wider array of languages." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-4", "text": "In particular, we introduce set-valued features to encode the predicted morphological properties and part-ofspeech confusion sets of the words being parsed." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-5", "text": "We also investigate the use of joint parsing and partof-speech tagging in the neural paradigm." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-6", "text": "Finally, we conduct a multi-lingual evaluation that demonstrates the robustness of the overall structured neural approach, as well as the benefits of the extensions proposed in this work." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-7", "text": "Our research further demonstrates the breadth of the applicability of neural network methods to dependency parsing, as well as the ease with which new features can be added to neural parsing models." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-8", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-10", "text": "Transition-based parsers (Nivre, 2008) are extremely popular because of their high accuracy and speed." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-11", "text": "Inspired by the greedy neural network transition-based parser of Chen and Manning (2014) , Weiss et al. (2015) and Zhou et al. (2015) concurrently developed structured neural network parsers that use beam search and achieve state-of-the-art accuracies for English dependency parsing." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-12", "text": "1 While very successful, these parsers have made use only of a small fraction of the rich options provided inside the transition-based framework: for example, all of these parsers use virtually identical atomic features and the arcstandard transition system." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-13", "text": "In this paper we extend this line of work and introduce two new types of features that significantly improve parsing performance: (1) a set-valued (i.e., bag-of-words style) feature for each word's morphological attributes, and (2) a weighted set-valued feature for each word's k-best POS tags." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-14", "text": "These features can be integrated naturally as atomic inputs to the embedding layer of the network and the model can learn arbitrary conjunctions with all other features through the hidden layers." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-15", "text": "In contrast, integrating such features into a model with discrete features requires nontrivial manual tweaking." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-16", "text": "For example, Bohnet and Nivre (2012) had to carefully discretize the real-valued POS tag score in order to combine it with the other discrete binary features in their system." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-17", "text": "Additionally, we also experiment with different transition systems, most notably the integrated parsing and part-of-speech (POS) tagging system of Bohnet and Nivre (2012) and also the swap system of Nivre (2009) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-18", "text": "We evaluate our parser on the CoNLL '09 shared task dependency treebanks, as well as on two English setups, achieving the best published numbers in many cases." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-19", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-20", "text": "**MODEL**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-21", "text": "In this section, we review the baseline model, and then introduce the features (which are novel) and the transition systems (taken from existing work) that we propose as extensions." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-22", "text": "We measure the impact of each proposed change on the development sets of the multi-lingual CoNLL '09 shared task treebanks (Haji\u010d et al., 2009) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-23", "text": "For details on our experimental setup, see Section 3." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-24", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-25", "text": "**BASELINE MODEL**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-26", "text": "Our baseline model is the structured neural network transition-based parser with beam search of Weiss et al. (2015) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-27", "text": "We use a feed-forward network with embedding, hidden and softmax layers." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-28", "text": "The input consists of a sequence of matrices extracted deterministically from a transition-based parse configuration (consisting of a stack and a buffer)." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-29", "text": "Each matrix X g , corresponds to a feature group g (one of words, tags, or labels), and has dimension F g \u00d7 V g ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-30", "text": "Here, X g f v is 1 if the f 'th feature takes on value v for group g, i.e. each row X g is a one-hot vector." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-31", "text": "These features are embedded and then concatenated to form the embedding layer, which in turn is input to the first hidden layer." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-32", "text": "The concatenated embedding layer can then be written as follows:" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-33", "text": "where E g is a (learned) V g \u00d7 D g embedding matrix for group g, and D g is the embedding dimension for group g." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-34", "text": "Beyond the embedding layer, there are two non-linear hidden layers (with nonlinearity introduced using a rectified linear activation function), and a softmax layer that outputs class probabilities for each possible decision." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-35", "text": "Training proceeds in two stages: We first train the network as a classifier by extracting decisions from gold derivations of the training set, as in Chen and Manning (2014) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-36", "text": "We then train a structured perceptron using the output of all network activations as features, as in Weiss et al. (2015) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-37", "text": "We use structured training and beam search during inference in all experiments." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-38", "text": "We train our models only on the treebank training set and do not use tri-training or other semi-supervised learning approaches (aside from using pre-trained word embeddings)." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-39", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-40", "text": "**NEW FEATURES**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-41", "text": "Prior work using neural networks for dependency parsing has not ventured beyond the use of one-hot feature activations for each feature type-location pair." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-42", "text": "In this work, we experiment with set-valued features, in which a set (or bag) of features for a given location fire at once, and are embedded into the same embedding space." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-43", "text": "Note that for both of the features we introduce, we extract features from the same 20 tokens as used in the tags and words features from Weiss et al. (2015) , i.e. various locations on the stack and input buffer." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-44", "text": "Morphology." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-45", "text": "It is well known that morphological information is very important for parsing morphologically rich languages (see for example Bohnet et al. (2013) )." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-46", "text": "We incorporate morphological information into our model using a setvalued feature function." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-47", "text": "We define the feature group morph as the matrix X morph such that, for" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-48", "text": "where N f is the number of morphological features active on the token indexed by f ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-49", "text": "In other words, we embed a bag of features into a shared embedding space by averaging the individual feature embeddings." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-50", "text": "k-best Tags." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-51", "text": "The non-linear network models of Weiss et al. (2015) and Chen and Manning (2014) embed the 1-best tag, according to a first-stage tagger, for a select set of tokens for any configuration." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-52", "text": "Inspired by the work of Bohnet and Nivre (2012) , we embed the set of top tags according to a first-stage tagger." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-53", "text": "Specifically, we define the feature group ktags as the matrix X ktags such that, for" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-54", "text": "where P(POS = v | f ) is the marginal probability that the token indexed by f has the tag indexed by v, according to the first-stage tagger." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-55", "text": "Results." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-56", "text": "The contributions of our new features for pipelined arc-standard parsing are shown in Table 1." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-57", "text": "Morphology features (+morph) contributed a labeled accuracy score (LAS) gain of 2.9% in Czech, 1.5% in Spanish, and 0.9% in Catalan." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-58", "text": "Adding the k-best tag feature (+morph +ktags) provides modest gains (and modest losses), peaking at 0.54% LAS for Spanish." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-59", "text": "This feature proves more beneficial in the integrated transition system, discussed in the next section." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-60", "text": "We note the ease with which we can obtain these gains in a multilayer embedding framework, without the need for any hand-tuning." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-61", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-62", "text": "**INTEGRATING PARSING AND TAGGING**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-121", "text": "We find that for all settings the dense neural network model produces higher POS tagging and parsing accuracy gains than its sparse linear counterpart." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-63", "text": "While past work on neural network transitionbased parsing has focused exclusively on the arcstandard transition system, it is known that better results can often be obtained with more sophisticated transition systems that have a larger set of possible actions." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-64", "text": "The integrated arc-standard transition system of Bohnet and Nivre (2012) allows the parser to participate in tagging decisions, rather than being forced to treat the tagger's tags as given, as in the arc-standard system." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-65", "text": "It does this by replacing the shift action in the arc-standard system with an action shift p , which, aside from shifting the top token on the buffer also assigns it one of the k best POS tags from a first-stage tagger." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-66", "text": "We also experiment with the swap action of Nivre (2009) , which allows reordering of the tokens in the input sequence." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-67", "text": "This transition system is able to produce non-projective parse trees, which is important for some languages." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-68", "text": "Results." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-69", "text": "The effect of using the integrated transition system is quantified in the bottom part of Table 1." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-70", "text": "The use of both 1) +morph +kbest features and 2) integrated parsing and tagging achieves the best score for 5 out of 7 languages tested." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-71", "text": "The use of integrated parsing and tagging provides, for example, a 0.8% LAS gain in German." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-72", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-73", "text": "**EXPERIMENTS**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-74", "text": "In this section we provide final test set results for our baseline and full models on three standard setups from the literature: CoNLL '09, English WSJ and English Treebank Union." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-75", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-76", "text": "**GENERAL SETUP**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-77", "text": "To train with predicted POS tags, we use a CRFbased POS tagger to generate 5-fold jack-knifed POS tags on the training set and predicted tags on the dev, test and tune sets; our tagger gets comparable accuracy to the Stanford POS tagger (Toutanova et al., 2003) with 97.44% on the WSJ test set." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-78", "text": "The candidate tags allowed by the integrated transition system on every shift p action are chosen by taking the top 4 tags for a token according to the CRF tagger, sorted by posterior probability, with no minimum posterior probability for a tag to be selected." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-79", "text": "We report unlabeled attachment score (UAS) and labeled attachment score (LAS)." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-80", "text": "Whether punctuation is included in the evaluation is specified in each subsection." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-81", "text": "We use 1024 units in all hidden layers, a choice made based on the development set." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-82", "text": "We found network sizes to be of critical importance for the accuracy of our models." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-83", "text": "For example, LAS improvements can be as high as 0.98% in CoNLL'09 German when increasing the size of the two hidden layers from 200 to 1024." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-84", "text": "We use B = 16 or B = 32 based on the development set performance per language." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-85", "text": "For ease of experimentation, we deviate from Bohnet and Nivre (2012) and use a single unstructured beam, rather than separate beams for POS tag and parse differences." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-86", "text": "We train our neural networks on the standard training sets only, except for initializing with word embeddings generated by word2vec and using cluster features in our POS tagger." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-87", "text": "Unlike Weiss et al. (2015) we train our model only on the treebank training set and do not use tri-training, which can likely further improve the results." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-88", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-89", "text": "**CONLL '09**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-90", "text": "Our multilingual evaluation follows the setup of the CoNLL '09 shared task 2 (Haji\u010d et al., 2009) ." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-91", "text": "As standard, we use the supplied predicted morphological features from the shared task data; however, we predict k-best tags with our own POS tagger since k-best tags are not part of the given data." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-92", "text": "We follow standard practice and include all punctuation in the evaluation." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-93", "text": "We used the (integrated) arc-standard transition system for all languages except for Czech where we added a swap transition, obtaining a 0.4% absolute improvement in UAS and LAS over just using arc-standard." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-94", "text": "Results." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-95", "text": "In Table 3 , we compare our models to the winners of the CoNLL '09 shared task, Gesmundo et al. (2009 ), Bohnet (2009 ), Che et al. (2009 ), Ren et al. (2009 , as well as to more recent results on the same datasets." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-96", "text": "It is worth pointing out that Gesmundo et al. (2009) is itself a neural net parser." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-97", "text": "Our models achieve higher labeled accuracy than the winning systems in the shared task in all languages." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-98", "text": "Additionally, our pipelined neural network parser always outperforms its linear counterpart, an in-house reimplementation of the system of Zhang and Nivre (2011) , as well as the more recent and highly accurate parsers of Zhang and McDonald (2014) and Lei et al. (2014 again outperforms its linear counterpart (Bohnet and Nivre, 2012) , however, in some cases the addition of graph-based and cluster features (Bohnet and Nivre, 2012 )+G+C can lead to even better results." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-99", "text": "The improvements in POS tagging (Table 2 ) range from 0.3% for English to 1.4% absolute for Chinese and are always higher for the neural network models compared to the linear models." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-100", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-101", "text": "**ENGLISH WSJ**" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-102", "text": "We experiment on English using the Wall Street Journal (WSJ) part of the Penn Treebank (Marcus et al., 1993) , with standard train/test splits." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-103", "text": "We convert the constituency trees to Stanford style dependencies (De Marneffe et al., 2006 ) using version 3.3.0 of the converter." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-104", "text": "We use predicted POS tags and exclude punctuation from the evaluation, as is standard for English." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-105", "text": "Results." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-106", "text": "The results shown in Table 4 , we find that our full model surpasses, to our knowledge, all previously reported supervised parsing models for the Stanford dependency conversions." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-107", "text": "It surpasses its linear analog, the work of Bohnet and Nivre (2012) on Stanford Dependencies UAS by 0.9% UAS and by 1.14% LAS." }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-122", "text": "----------------------------------" }, { "sent_id": "4cf805818bed233fabb81f5f64f4cc-C001-123", "text": "**CONCLUSIONS**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "4cf805818bed233fabb81f5f64f4cc-C001-16" ] ], "cite_sentences": [ "4cf805818bed233fabb81f5f64f4cc-C001-16" ] }, "@SIM@": { "gold_contexts": [ [ "4cf805818bed233fabb81f5f64f4cc-C001-17" ], [ "4cf805818bed233fabb81f5f64f4cc-C001-52" ] ], "cite_sentences": [ "4cf805818bed233fabb81f5f64f4cc-C001-17", "4cf805818bed233fabb81f5f64f4cc-C001-52" ] }, "@USE@": { "gold_contexts": [ [ "4cf805818bed233fabb81f5f64f4cc-C001-17" ], [ "4cf805818bed233fabb81f5f64f4cc-C001-52" ], [ "4cf805818bed233fabb81f5f64f4cc-C001-64" ] ], "cite_sentences": [ "4cf805818bed233fabb81f5f64f4cc-C001-17", "4cf805818bed233fabb81f5f64f4cc-C001-52", "4cf805818bed233fabb81f5f64f4cc-C001-64" ] }, "@DIF@": { "gold_contexts": [ [ "4cf805818bed233fabb81f5f64f4cc-C001-85" ], [ "4cf805818bed233fabb81f5f64f4cc-C001-98" ], [ "4cf805818bed233fabb81f5f64f4cc-C001-106", "4cf805818bed233fabb81f5f64f4cc-C001-107" ] ], "cite_sentences": [ "4cf805818bed233fabb81f5f64f4cc-C001-85", "4cf805818bed233fabb81f5f64f4cc-C001-98", "4cf805818bed233fabb81f5f64f4cc-C001-107" ] } } }, "ABC_1bcd442a685e5fb2d0f3f44d3c66c3_23": { "x": [ { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-2", "text": "Learning to communicate is considered an essential task to develop a general AI." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-3", "text": "While recent literature in language evolution has studied emergent language through discrete or continuous message symbols, there has been little work in the emergence of writing systems in artificial agents." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-4", "text": "In this paper, we present a referential game setup with two agents, where the mode of communication is a written language system that emerges during the play." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-5", "text": "We show that the agents can learn to coordinate successfully using this mode of communication." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-6", "text": "Further, we study how the game rules affect the writing system taxonomy by proposing a consistency metric." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-7", "text": "----------------------------------" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-9", "text": "Recent advances in deep learning have shown exceptional results in language-related tasks such as machine translation, question answering, or sentiment analysis." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-10", "text": "However, the supervised approaches that capture the underlying statistical patterns in language are not sufficient in perceiving the interactive nature of communication and how humans use it for coordination." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-11", "text": "It is thus crucial to learn to communicate by interaction, i.e., communication must emerge out of necessity." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-12", "text": "Such study gives further insights into how communication protocols emerge for successful coordination and the ability of a learner to understand the emerged language." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-13", "text": "Several recent works (Lazaridou, Peysakhovich, and Baroni 2016; Havrylov and Titov 2017; Lazaridou et al. 2018; Mordatch and Abbeel 2018) , have shown that in multi-agent cooperative setting of referential games, deep reinforcement learning can successfully induce communication protocols." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-14", "text": "In these games, communication success is the only supervision during learning, and the meaning of the emergent messages gets grounded during the game." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-15", "text": "In (Lazaridou, Peysakhovich, and Baroni 2016) , the authors have restricted the message to be a single symbol token picked from a fixed vocabulary while in (Havrylov and Titov 2017) , the message is considered to be a sequence of symbols." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-16", "text": "(Lazaridou et al. 2018) demonstrates that successful communication can also emerge in environments which present raw pixel input." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-17", "text": "Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org)." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-18", "text": "All rights reserved." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-19", "text": "(Mordatch and Abbeel 2018) further extends the scope of mode of communication by also studying the emergence of non-verbal communication." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-20", "text": "While these works have studied a wide variety of game setups as well as variations in communication rules, none of them have considered written language system as a mode of communication." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-21", "text": "Historically, written language systems have shown complex patterns in evolution over time." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-22", "text": "Moreover, the process of writing requires sophisticated graphomotor skills which involves both linguistic and non-linguistic factors." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-23", "text": "Thus writing systems can be considered crucial for understanding autonomous system development." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-24", "text": "We are further motivated by the work in (Ganin et al. 2018) , where the authors demonstrate that artificial agents can produce visual representations similar to those created by humans." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-25", "text": "This can only be achieved by giving them access to the same tools that we use to recreate the world around us." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-26", "text": "We extend this idea to study emergence of writing systems." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-27", "text": "----------------------------------" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-28", "text": "**REFERENTIAL GAME FRAMEWORK**" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-29", "text": "In our work, we have used two referential game setups that are slight modifications to the ones used in (Lazaridou, Peysakhovich, and Baroni 2016; Lazaridou et al. 2018 )." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-30", "text": "There are two players, a sender and a receiver." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-31", "text": "From a given set of images" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-32", "text": "where the sender only has access to the target image t; Distractor Aware (D-Aware): where the sender has access to the candidate set C = t \u222a D. In both these variations, the sender has to come up with a message M l = {m j } l j=1 , which is a sequence of l brushstrokes." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-33", "text": "A black-box renderer R accepts the sequence of brushstrokes M l and paints them onto a canvas." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-34", "text": "This results in a written symbol image W = R(M l )." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-35", "text": "Given the written symbol image W and the candidate set C, the receiver has to identify the target image t. Communicative success is achieved when the target is correctly identified and a payoff of 1 is assigned to both the players." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-36", "text": "In rest of the cases, payoff is 0." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-37", "text": "The sender and receiver are modelled as reinforcement learning policy networks S \u03b8 andR \u03c6 ." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-38", "text": "Specifically, the sender is a recurrent neural network which takes as input the current state of the canvas along with the visual input V which can either be target image t (D-Agnostic) or candidate set C (D-Aware)." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-39", "text": "At the i th timestep, the sender outputs a brushstroke m i ." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-40", "text": "The canvas state is the intermediate rendering" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-41", "text": "where h i is the internal hidden state maintained across timesteps." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-42", "text": "The sequence is terminated when either the maximum sequence length L is reached or a terminal flag is produced along with the brushstroke." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-43", "text": "The internal state is maintained across timesteps using an LSTM cell (Hochreiter and Schmidhuber 1997)." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-44", "text": "The receiver agent first extracts features from the written symbol image W ." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-45", "text": "For creating brushstrokes that are similar to written languages used by humans, we use feature extractor from a Siamese Neural Network (Koch, Zemel, and Salakhutdinov 2015) , pre-trained on the OMNIGLOT dataset (Lake, Salakhutdinov, and Tenenbaum 2015) ." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-46", "text": "Given the written symbol image W , a candidate set U (a random permutation of C), and the feature extractor f s , the receiver returns an integer value t = R \u03c6 (f s (W ), U ) in the range 0 to K-1 that points to the target." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-47", "text": "----------------------------------" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-48", "text": "**LEARNING**" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-49", "text": "For both the agents, we pose the learning of communication protocols as maximization of the expected return Er[R(r)], where R is the reward function." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-50", "text": "The payoff is 1 for both the agents iff R \u03c6 (f s (S \u03b8 (R(M i ), h i , V )), U ) = t , where i is the last timestep of the episode." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-51", "text": "In all other cases and intermediate timesteps, the payoff is 0. Because of the high dimensional search space introduced due to brushstrokes, we use Proximal Policy Optimization (PPO) (Schulman et al. 2017) for optimizing the weights of sender and receiver agents." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-52", "text": "----------------------------------" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-53", "text": "**IMAGES**" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-54", "text": "We have used CIFAR-10 dataset (Krizhevsky, Hinton, and others 2009) , as a source of images." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-55", "text": "From the test set of CIFAR-10, we randomly sample 100 images from each class and represent them as outputs from relu7 layer of pretrained VGG-16 convNet (Simonyan and Zisserman 2014)." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-56", "text": "Figure 1 shows the performance of our game setup for both the sender variations." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-57", "text": "The agents converge to coordination in both sender types, but D-Aware sender reaches higher levels more quickly." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-58", "text": "Further, we quantify the consistency of a writing system by studying the variability of the symbols produced for a given entity e. Let w e be the set of all written symbol images representing e. We define heatmap H e = mean(w e )." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-59", "text": "For a writing system consistent for the entity e, H e would contain sharp brushstrokes while a nonconsistent writing system would give a blurred heatmap." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-60", "text": "We thus compute Variance of Laplacian (VoL) of the heatmap to" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-61", "text": "----------------------------------" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-62", "text": "**RESULTS AND CONCLUSION**" }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-63", "text": "where E is the set of all the entities considered which can either be targets (t) or target-distractor combinations (t&d)." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-64", "text": "We also report a baseline consistency score where heatmap is generated by averaging across the universal set of generated symbol images." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-65", "text": "High consistency of D-Agnostic sender indicates a oneto-one mapping from target class to written symbols." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-66", "text": "The D-Aware sender has low consistency over target class but high consistency for target-distractor combinations ." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-67", "text": "This means that symbols are context dependent." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-68", "text": "From our qualitative evaluations, we infer that D-Aware sender assigns meaning to brushstrokes that represent conceptual differences between target and distractors." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-69", "text": "Furthermore, D-Agnostic sender uses a scheme akin to hierarchical encoding to attribute high level semantics to brushstrokes." }, { "sent_id": "1bcd442a685e5fb2d0f3f44d3c66c3-C001-70", "text": "Thus, the writing system emerging from D-Aware sender is an ideographic one representing concepts while D-Agnostic sender produces a writing system which has compositionality and shows logographic traits." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-13" ], [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-15" ], [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-19" ], [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-51" ] ], "cite_sentences": [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-13", "1bcd442a685e5fb2d0f3f44d3c66c3-C001-15", "1bcd442a685e5fb2d0f3f44d3c66c3-C001-19", "1bcd442a685e5fb2d0f3f44d3c66c3-C001-51" ] }, "@MOT@": { "gold_contexts": [ [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-13" ] ], "cite_sentences": [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-13" ] }, "@DIF@": { "gold_contexts": [ [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-29" ] ], "cite_sentences": [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-29" ] }, "@EXT@": { "gold_contexts": [ [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-29" ] ], "cite_sentences": [ "1bcd442a685e5fb2d0f3f44d3c66c3-C001-29" ] } } }, "ABC_982991efdb6b14f187702e0a577bac_23": { "x": [ { "sent_id": "982991efdb6b14f187702e0a577bac-C001-82", "text": "This is crucial because a single sample presents a biased view of the task." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-2", "text": "Disentangling conversations mixed together in a single stream of messages is a difficult task, made harder by the lack of large manually annotated datasets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-3", "text": "We created a new dataset of 77,563 messages manually annotated with reply-structure graphs that both disentangle conversations and define internal conversation structure." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-4", "text": "Our dataset is 16 times larger than all previously released datasets combined, the first to include adjudication of annotation disagreements, and the first to include context." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-5", "text": "We use our data to re-examine prior work, in particular, finding that 80% of conversations in a widely used dialogue corpus are either missing messages or contain extra messages." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-6", "text": "Our manually-annotated data presents an opportunity to develop robust data-driven methods for conversation disentanglement, which will help advance dialogue research." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-7", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-9", "text": "When a group of people communicate in a common channel there are often multiple conversations occurring concurrently." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-10", "text": "Often there is no explicit structure identifying conversations or their structure, such as in Internet Relay Chat (IRC), Google Hangout, and comment sections on websites." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-11", "text": "Even when structure is provided it often has limited depth, such as threads in Slack, which provide one layer of branching." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-12", "text": "In all of these cases, conversations are entangled: all messages appear together, with no indication of separate conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-13", "text": "Automatic disentanglement could be used to provide more interpretable results when searching over chat logs, and to help users understand what is happening when they join a channel." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-14", "text": "Over a decade of research has considered conversation disentanglement (Shen et al., 2006) , but using datasets that are either small (2,500 messages, Elsner and Charniak, 2008) or not released (Adams and Martell, 2008) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-15", "text": "* jkummerf@umich.edu" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-16", "text": "We introduce a conversation disentanglement dataset of 77,563 messages of IRC manually annotated with reply-to relations between messages." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-17", "text": "1 Our data is sampled from a technical support channel at 173 points in time between 2004 and 2018, providing a diverse set of speakers and topics, while remaining in a single domain." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-18", "text": "Our data is the first to include context, which differentiates messages that start a conversation from messages that are responding to an earlier point in time." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-19", "text": "We are also the first to adjudicate disagreements in disentanglement annotations, producing higher quality development and test sets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-20", "text": "We also developed a simple model that is more effective than prior work, and showed that having diverse data makes it perform better and more consistently." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-21", "text": "We also analyze prior disentanglement work." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-22", "text": "In particular, a recent approach from Lowe et al. (2015 Lowe et al. ( , 2017 ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-23", "text": "By applying disentanglement to an enormous log of IRC messages, they developed a resource that has been widely used (over 315 citations), indicating the value of disentanglement in dialogue research." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-24", "text": "However, they lacked annotated data to evaluate the conversations produced by their method." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-25", "text": "We find that 20% of the conversations are completely right or a prefix of a true conversation; 58% are missing messages, 3% contain messages from other conversations, and 19% have both issues." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-26", "text": "As a result, systems trained on the data will not be learning from accurate humanhuman dialogues." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-27", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-28", "text": "**TASK DEFINITION**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-29", "text": "We consider a shared channel in which a group of people are communicating by sending messages that are visible to everyone." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-30", "text": "We label this data with a graph in which messages are nodes and edges indicate that one message is a response to another." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-31", "text": "Each connected component is a conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-32", "text": "Figure 1 : #Ubuntu IRC log sample, earliest message first." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-33", "text": "Curved lines are our graph annotations of reply structure, which define two conversations shown with blue solid edges and green dashed edges." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-34", "text": "Figure 1 shows an example of two entangled conversations and their graph structure." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-35", "text": "It includes a message that receives multiple responses, when multiple people independently help BurgerMann, and the inverse, when the last message responds to multiple messages." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-36", "text": "We also see two of the users, delire and Seveas, simultaneously participating in two conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-37", "text": "This multi-conversation participation is common." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-38", "text": "The example also shows two aspects of IRC we will refer to later." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-39", "text": "Directed messages, an informal practice in which a participant is named in the message." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-40", "text": "These cues are useful for understanding the discussion, but only around 48% of messages have them." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-41", "text": "System messages, which indicate actions like users entering the channel." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-42", "text": "These all start with ===, but not all messages starting with === are system messages, as shown by the second message in Figure 1 ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-43", "text": "3 Related Work IRC Disentanglement Data: The most significant work on conversation disentanglement is a line of papers developing data and models for the #Linux IRC channel (Elsner and Charniak, 2008; Elsner and Schudy, 2009; Charniak, 2010, 2011) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-44", "text": "Until now, their dataset was the only publicly available set of messages with annotated conversations (partially re-annotated by Mehri and Carenini (2017) with reply-structure graphs), and has been used for training and evaluation in subsequent work (Wang and Oard, 2009; Mehri and Carenini, 2017; Jiang et al., 2018) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-45", "text": "We are aware of three other IRC disentanglement datasets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-46", "text": "First, Adams and Martell (2008) studied disentanglement and topic identification, but did not release their data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-47", "text": "Second, Riou et al. (2015) annotated conversations and discourse relations in the #Ubuntu-fr channel (French Ubuntu support)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-48", "text": "Third, Lowe et al. (2015 Lowe et al. ( , 2017 heuristically extracted conversations from the #Ubuntu channel." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-49", "text": "2 Their work opened up a new research opportunity by providing 930,000 disentangled conversations, and has already been the basis of many papers (315 citations), particularly on developing dialogue agents." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-50", "text": "This is far beyond the size of resources previously collected, even with crowdsourcing (Lasecki et al., 2013) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-51", "text": "Using our data we provide the first empirical evaluation of their method." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-52", "text": "Other Disentanglement Data: IRC is not the only form of synchronous group conversation online." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-53", "text": "Other platforms with similar communication formats have been studied in settings such as classes (Wang et al., 2008; Dulceanu, 2016) , support communities (Mayfield et al., 2012) , and customer service (Du et al., 2017) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-54", "text": "Unfortunately, only one of these resources (Dulceanu, 2016) is available, possibly due to privacy concerns." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-55", "text": "Another stream of research has used userprovided structure to get conversation labels (Shen et al., 2006; Domeniconi et al., 2016) and replyto relations (Wang and Ros\u00e9, 2010; Wang et al., 2011a; Aumayr et al., 2011; Balali et al., 2013 Balali et al., , 2014 Chen et al., 2017a) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-56", "text": "By removing these labels and mixing conversations they create a disentanglement problem." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-57", "text": "While convenient, this risks introducing a bias, as people write differently when explicit structure is defined, and only a few papers have released data (Abbott et al., 2016; Zhang et al., 2017; Louis and Cohen, 2015) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-83", "text": "Having multiple samples also means our training and evaluation sets are from different points in time, preventing overfitting to specific users or topics of conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-84", "text": "Context: We are the first to consider the fact that IRC data is sampled from a continuous stream and the context prior to the sample is important." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-85", "text": "In prior work, a message with no antecedent could either be the start of a conversation or a response to a message that occurs prior to the sample." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-86", "text": "Adjudication: Our labeling method is similar to prior work, but we are the first to perform adjudication of annotations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-87", "text": "While some cases were ambiguous, often one option was clearly incorrect." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-88", "text": "By performing adjudication we can reduce these errors, creating high quality sets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-89", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-90", "text": "**METHODOLOGY**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-91", "text": "Guidelines: We developed annotation guidelines through three rounds of pilot annotations in which annotators labeled a set of messages and discussed all disagreements." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-92", "text": "We instructed annotators to link each message to the one or more messages it is a response to." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-93", "text": "If a message started a new conversation it was linked to itself." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-94", "text": "We also described a series of subtle cases, using one to three examples to tease out differences." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-95", "text": "These included when a question is repeated, when a user responds multiple times, interjections, etc." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-96", "text": "For our full guidelines, see the supplementary material." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-97", "text": "All annotations were performed using SLATE (Kummerfeld, 2019), a custom-built tool with features designed specifically for this task." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-98", "text": "5 Adjudication: Table 1 shows the number of annotators for each subset of our data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-99", "text": "For the development, test, out-of-domain data, and a small set of the training data, we labeled each sample multiple times and then resolved all disagreements in an adjudication step." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-100", "text": "During adjudication, there was no indication of who had given which annotation, and there was the option to choose a different annotation entirely." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-101", "text": "In order to maximize the volume annotated, we did not perform adjudication for most of the training data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-102", "text": "Also, the 18,924 training message set initially only had 100 messages of context per sample, and we later added another 900 lines and checked every message that was not a reply to see if it was a response to something in the additional context." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-103", "text": "Annotators: The annotators were all fluent English speakers with a background in computer science (necessary to understand the technical content): a postdoc, a master's student, and three CS undergraduates." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-104", "text": "All adjudication was performed by the postdoc, who is a native English speaker." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-105", "text": "Time: Annotations took between 7 and 11 seconds per message depending on the complexity of the discussion, and adjudication took 5 seconds 5 https://jkk.name/slate" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-106", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-107", "text": "**ANNOTATION QUALITY**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-108", "text": "Our annotations define two levels of structure: (1) links between pairs of messages, and (2) sets of messages, where each set is one conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-109", "text": "Annotators label (1), from which (2) can be inferred." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-110", "text": "Table 2 presents inter-annotator agreement measures for both cases." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-111", "text": "These are measured in the standard manner, by comparing the labels from different annotators on the same data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-112", "text": "We also include measurements for annotations in prior work." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-113", "text": "Figure 2 shows ambiguous examples from our data to provide some intuition for the source of disagreements." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-114", "text": "In both examples the disagreement involves one link, but the conversation structure in the second case is substantially changed." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-115", "text": "Some disagreements in our data are mistakes, where one annotation is clearly incorrect, and some are ambiguous cases, such as these." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-116", "text": "In Channel Two, we also see mistakes and ambiguous cases, including a particularly long discussion about a user's financial difficulties that could be divided in multiple ways (also noted by Elsner and Charniak (2008) )." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-117", "text": "Graphs: We measure agreement on the graph structure annotation using Cohen (1960) 's \u03ba." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-118", "text": "This measure of inter-rater reliability corrects for chance agreement, accounting for the class imbalance between linked and not-linked pairs." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-119", "text": "Values are in the good agreement range proposed by Altman (1990) , and slightly higher than for Mehri and Carenini (2017)'s annotations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-120", "text": "Results are not shown for Elsner and Charniak (2008) because they did not annotate graphs." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-121", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-122", "text": "**CONVERSATIONS:**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-123", "text": "We consider three metrics: 6 (1) Variation of Information (VI, Meila, 2007) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-124", "text": "A measure of information gained or lost when going from one clustering to another." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-125", "text": "It is the sum of conditional entropies H(Y |X) + H(X|Y ), where X and Y are clusterings of the same set of items." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-126", "text": "We consider a scaled version, using the bound for n items that VI(X; Y ) \u2264 log(n), and present 1\u2212VI so that larger values are better." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-127", "text": "(2) One-to-One Overlap (1-1, Elsner and Charniak, 2008) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-128", "text": "Percentage overlap when conversations from two annotations are optimally paired up using the max-flow algorithm." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-129", "text": "We follow Mehri and Carenini (2017) and keep system messages." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-130", "text": "(3) Exact Match F 1 ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-131", "text": "Calculated using the number of perfectly matching conversations, excluding conversations with only one message (mostly system messages)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-132", "text": "This is an extremely challenging metric." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-133", "text": "We include it because it is easy to understand and it directly measures a desired value (perfectly extracted conversations)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-134", "text": "Our scores are higher in 4 cases and lower in 5." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-135", "text": "Interestingly, while \u03ba was higher for us than Mehri and Carenini (2017) , our scores for conversations are lower." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-136", "text": "This is possible because a single link can merge two conversations, meaning a single disagreement in links can cause a major difference in conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-137", "text": "This may reflect the fact that our annotation guide was developed for the Ubuntu channel, which differs in conversation style from the Channel Two data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-138", "text": "Manually comparing the annotations, there was no clear differences in the types of disagreements." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-139", "text": "Agreement is lower on the Channel Two data, particularly on its test set." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-140", "text": "From this we conclude that there is substantial variation in the difficulty of conversation disentanglement across datasets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-141", "text": "7" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-142", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-143", "text": "**EVALUATING DISENTANGLEMENT QUALITY**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-144", "text": "In this section, we propose new simple disentanglement models that perform better than prior methods, and re-examine prior work." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-145", "text": "The models we consider are: Previous: Each message is linked to the most recent non-system message before it." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-146", "text": "6 Metrics such as Cohen's \u03ba and Krippendorff's \u03b1 are not applicable to conversations because there is no clear mapping from one set of conversations to another." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-147", "text": "7 Riou et al. (2015) also observe this, noting that their French IRC data is less entangled than Elsner's, making it possible to achieve an agreement level of 0.95." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-148", "text": "(Pennington et al., 2014) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-149", "text": "Union: Run 10 FF models trained with different random seeds and combine their output by keeping all edges predicted." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-150", "text": "Vote: Run 10 FF models and combine output by keeping the edges they all agree on." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-151", "text": "Link messages with no agreed antecedent to themselves." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-152", "text": "Intersect: Conversations that 10 FF models agree on, and other messages as singleton conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-153", "text": "For Channel Two we also compare to Wang and Oard (2009) and Mehri and Carenini (2017) , but their code was unavailable, preventing evaluation on our data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-154", "text": "We exclude Jiang et al. (2018) as they substantially modified the dataset." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-155", "text": "For details of models, including hyperparameters tuned on the development set, see the supplementary material." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-156", "text": "Table 5 : Performance with different training conditions on the Ubuntu test set." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-157", "text": "For Graph-F, * indicates a significant difference at the 0.01 level compared to Standard." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-158", "text": "Results are averages over 10 runs, varying the data and random seeds." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-159", "text": "The standard deviation is shown in parentheses." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-160", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-161", "text": "**GRAPH CONVERSATION**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-162", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-163", "text": "**RESULTS**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-164", "text": "Graphs: Table 3 presents precision, recall, and F-score over links." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-165", "text": "Our models perform much better than the baseline." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-166", "text": "As we would expect, vote has higher precision, while union has higher recall." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-167", "text": "Vote has higher recall than a single feedforward model because it identifies more of the selflink cases (its default when there is no agreement)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-168", "text": "Conversations: Table 4 presents results on the metrics defined in Section 4.3." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-169", "text": "There are three regions of performance." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-170", "text": "First, the baseline has consistently low scores since it forms a single conversation containing all messages." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-171", "text": "Second, Elsner and Charniak (2008) and Lowe et al. (2017) per-form similarly, with one doing better on VI and the other on 1-1, though Elsner and Charniak (2008) do consistently better across the exact conversation extraction metrics." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-172", "text": "Third, our methods do best, with x10 vote best in all cases except precision, where the intersect approach is much better." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-173", "text": "Dataset Variations: Table 5 shows results for the feedforward model with several modifications to the training set, designed to test corpus design decisions." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-174", "text": "Removing context does not substantially impact results." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-175", "text": "Decreasing the data size to match Elsner and Charniak (2008) 's training set leads to worse results, both if the sentences are from diverse contexts (3rd row), and if they are from just two contexts (bottom row)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-176", "text": "We also see a substantial increase in the standard deviation when only two samples are used, indicating that performance is not robust when the data is not widely sampled." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-177", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-178", "text": "**CHANNEL TWO RESULTS**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-179", "text": "For channel Two, we consider two annotations of the same underlying text: ours and Elsner and Charniak (2008)'s." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-180", "text": "To compare with prior work, we use the metrics defined by Shen et al. (2006, Shen) and Elsner and Charniak (2008, Loc) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-181", "text": "8 We do not use these for our data as they have been superseded by more rigorously studied metrics (VI for Shen) or make strong assumptions about the data (Loc)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-182", "text": "We do not evaluate on graphs because Elsner and Charniak (2008) 's annotations do not include them." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-183", "text": "This also prevents us from training our method on their data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-184", "text": "Model Comparison: For Elsner's annotations (top section of Table 6 ), their approach remains the most effective with just Channel Two data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-185", "text": "However, training on our Ubuntu data, treating Channel Two as an out-of-domain sample, yields substantially higher performance on two metrics and comparable performance on the third." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-186", "text": "On our annotations (bottom section), we see the same trend." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-187", "text": "In both cases, the heuristic from Lowe et al. (2015 Lowe et al. ( , 2017 performs poorly." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-188", "text": "We suspect our model trained only on Channel Two data is overfitting, as the graph F-score on the training data is 94, whereas on the Ubuntu data it is 80." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-189", "text": "Data Comparison: Comparing the same models in the top and bottom section, scores are consistently higher for our annotations, except for the Lowe et al. (2015 Lowe et al. ( , 2017 heuristic." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-190", "text": "Comparing the annotations, we find that their annotators identified between 250 and 328 conversations (mean 281), while we identify 257." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-191", "text": "Beyond this difference it is hard to identify consistent variations in the annotations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-192", "text": "Another difference is the nature of the evaluation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-193", "text": "On Elsner's data, evaluation is performed by measuring relative to each annotators labels and averaging the scores." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-194", "text": "On our data, we adjudicated the annotations, providing a single gold standard." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-195", "text": "Evaluating our Channel-Two-trained Feedforward model on our two preadjudication annotations and averaging scores, the results are lower by 3.1, 1.8, and 4.3 on 1-1, Loc and Shen respectively." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-196", "text": "This suggests that our adjudication process removes annotator mistakes that introduce noise into the evaluation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-197", "text": "Lowe et al. (2015 Lowe et al. ( , 2017 The previous section showed that only 10.8% of the conversations extracted by the heuristic in Lowe et al. (2015 Lowe et al. ( , 2017 are correct (P in Table 4 )." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-198", "text": "We focus on precision because the primary use of their method has been to extract conversations to train and test dialogue systems, which will be impacted by errors in the conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-199", "text": "Recall errors (measuring missed conversations) are not as serious a problem because the Ubuntu chat logs are so large that even with low recall a large number of conversations will still be extracted." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-200", "text": "Additional Metrics: First, we must check this is Figure 3 : An example conversation extracted by the heuristic from Lowe et al. (2015 Lowe et al. ( , 2017 with the messages it misses and the one it incorrectly includes." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-201", "text": "not an artifact of our test set." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-202", "text": "On our development set, P, R, and F are slightly higher (11.6, 8.1 and 9.5), but VI and 1-1 are slightly lower (80.0 and 51.7)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-203", "text": "We can also measure performance as the distribution of scores over all of the samples we annotated." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-204", "text": "The average precision was 10, and varied from 0 to 50, with 19% of cases at 0 and 95% below 23." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-205", "text": "To avoid the possibility that we made a mistake running their code, we also considered evaluating their released conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-206", "text": "On the data that overlapped with our annotations, the precision was 9%." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-207", "text": "These results indicate that the test set performance is not an aberration: the heuristic's results are consistently low, with only about 10% of output conversations completely right." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-208", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-209", "text": "**EVALUATING**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-210", "text": "Error Types: Figure 3 shows an example heuristic output with several types of errors." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-211", "text": "The initial question was missed, as was the final resolution, and in the middle there is a message from a separate conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-212", "text": "67% of conversations were a subset of a true conversation (ie., only missed messages), and 3% were a superset of a true conversation (ie., only had extra messages)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-213", "text": "The subset cases were missing 1-187 messages (missing 56% of the conversation on average) and the superset cases had 1-3 extra messages (an extra 31% of the conversation on average)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-214", "text": "The first message is particularly important because it is usually the question being resolved." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-215", "text": "In 47% of cases the first message is not the true start of a conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-216", "text": "It is important to note that the dialogue task the conversations were intended for only uses a prefix of each conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-217", "text": "For this purpose, missing the end of a conversation is not a problem." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-218", "text": "In 9% of cases, the conversation is a true prefix of a gold conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-219", "text": "Combined with the exact match cases, that means 20% of the conversations are accurate as used in the next utterance selection task." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-220", "text": "A further 9% of cases are a continuous chunk of a conversation, but missing one or more messages at the start." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-221", "text": "Long Distance Links: One issue we observed is that conversations often spanned days." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-222", "text": "We manually inspected a random sample: 20 conversations 12 to 24 hours long, and 20 longer than 24 hours." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-223", "text": "All of the longer conversations and 17 of the shorter ones were clearly incorrect." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-224", "text": "9 This issue is not measured in the analysis above because our samples do not span days (they are 5.5 hours long on average when including context)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-225", "text": "The original work notes this issue, but claims that it is rare." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-226", "text": "We measured the time between consecutive messages in conversations and plot the frequency of each value in Figure 4 ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-227", "text": "10 The figure indicates that the conversations often extend over days, or even more than a month apart (note the point in the topright corner)." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-228", "text": "In contrast, our annotations rarely contain links beyond an hour, and the output of our model rarely contains links longer than 2 hours." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-229", "text": "Causes: To investigate possible reasons for these issues, we measured several properties of our data to test assumptions in the heuristic." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-230", "text": "First, the heuristic assumes if all directed messages from a user are in one conversation, all undirected messages from the user are in the same conversation." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-231", "text": "9 The exceptions were two cases where a user thanked another user for their help the previous day, and one case where a user asked if another user ended up resolving their question." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-232", "text": "10 In 68,002 conversations there was a negative time difference because a message was out of order." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-233", "text": "To resolve this, we sorted the messages in each conversation by timestamp." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-234", "text": "We find this is true 52.2% of the time." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-235", "text": "Second, it assumes that it is rare for two people to respond to an initial question." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-236", "text": "In our data, of the messages that start a conversation and receive a response, 37.7% receive multiple responses." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-237", "text": "Third, that a directed message can start a conversation, which we find in 6.8% of cases." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-238", "text": "Fourth, that the first response to a question is within 3 minutes, which we find is true in 94.8% of conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-239", "text": "Overall, these assumptions have mixed support from our data, which may be why the heuristic produces so few accurate conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-240", "text": "Dialogue Modeling: Most of the work building on Lowe et al. (2017) uses the conversations to train and evaluate dialogue systems." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-241", "text": "To see the impact on downstream work, we constructed a next utterance selection task as described in their work, disentangling the entire #Ubuntu logs with our feedforward model." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-242", "text": "We tried two dialogue models: a dual-encoder (Lowe et al., 2017) , and Enhanced Long Short-Term Memory (Chen et al., 2017b) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-243", "text": "For full details of the task and model hyperparameters, see the supplementary material." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-244", "text": "Table 7 show results when varying the training and test datasets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-245", "text": "Training and testing on the same dataset leads to higher performance than training on one and testing on the other." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-246", "text": "This is true even though the heuristic data contains nine times as many training conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-247", "text": "This is evidence that our conversations are fundamentally different despite being derived from the same resource and filtered in the same way." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-248", "text": "This indicates that our changes lead to quantitatively different downstream models." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-249", "text": "Fortunately, the relative performance of the two models remains consistent across the two datasets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-250", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-251", "text": "**RE-EXAMINING DISENTANGLEMENT RESEARCH**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-252", "text": "Using our data we also investigate other assumptions made in prior work." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-253", "text": "The scale of our data provides a more robust test of these ideas." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-254", "text": "Number of samples: Table 1 shows that all prior work with available data has considered a small number of samples." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-255", "text": "In Table 5 , we saw that training on less diverse data samples led to models that performed worse and with higher variance." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-256", "text": "We can also investigate this by looking at performance on the different samples in our test set." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-257", "text": "The difficulty of samples varies considerably, with the F-score of our model varying from 11 to 40 and annotator agreement scores before adjudication varying from 0.65 to 0.78." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-258", "text": "The model performance and agreement levels are also strongly correlated, with a Spearman's rank correlation of 0.77." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-259", "text": "This demonstrates the importance of evaluating on data from more than one point in time to get a robust estimate of performance." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-260", "text": "How far apart consecutive messages in a conversation are: Elsner and Charniak (2008) and Mehri and Carenini (2017) use a limit of 129 seconds, Jiang et al. (2018) limit to within 1 hour, Guo et al. (2017) limit to within 8 messages, and we limit to within 100 messages." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-261", "text": "Figure 4 shows the distribution of time differences in our conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-262", "text": "94.9% are within 2 minutes, and almost all are within an hour." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-263", "text": "88.3% are 8 messages or less apart, and 99.4% are 100 or less apart." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-264", "text": "This suggests that the lower limits in prior work are too low." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-265", "text": "However, in Channel Two, 98% of messages are within 2 minutes, suggesting this property is channel and sample dependent." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-266", "text": "Concurrent conversations: Adams and Martell (2008) forced annotators to label at most 3 conversations, while Jiang et al. (2018) remove conversations to ensure there are no more than 10 at once." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-267", "text": "We find there are 3 or fewer 46.4% of the time and 10 or fewer 97.3% of the time (where time is in terms of messages, not minutes, and we ignore system messages), Presumably the annotators in Adams and Martell (2008) would have proposed changes if the 3 conversation limit was problematic, suggesting that their data is less entangled than ours." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-268", "text": "Conversation and message length: Adams and Martell (2008) annotate blocks of 200 messages." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-269", "text": "If such a limit applied to our data, 13.7% of conversations would not finish before the cutoff point." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-270", "text": "This suggests that their conversations are typi-cally shorter, which is consistent with the previous conclusion that their conversations are less entangled." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-271", "text": "Jiang et al. (2018) remove conversations with fewer than 10 messages, describing them as outliers, and remove messages shorter than 5 words, arguing that they were not part of real conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-272", "text": "Not counting conversations with only system messages, 83.4% of our conversations have fewer than 10 messages, 40.8% of which have multiple authors." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-273", "text": "88.5% of messages with less than 5 words are in conversations with more than one author." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-274", "text": "These values suggest that these messages and conversations are real and not outliers." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-275", "text": "Overall: This analysis indicates that working from a small number of samples can lead to major bias in system design for disentanglement." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-276", "text": "There is substantial variation across channels, and across time within a single channel." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-277", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-278", "text": "**CONCLUSION**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-279", "text": "Conversation disentanglement has been understudied because of a lack of public, annotated datasets." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-280", "text": "We introduce a new corpus that is larger and more diverse than any prior corpus, and the first to include context and adjudicated annotations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-281", "text": "Using our data, we perform the first empirical analysis of Lowe et al. (2015 Lowe et al. ( , 2017 's widely used data, finding that only 20% of the conversations their method produces are true prefixes of conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-282", "text": "The models we develop have already enabled new directions in dialogue research, providing disentangled conversations for DSTC 7 track 1 (Gunasekara et al., 2019; Yoshino et al., 2018) and will be used in DSTC 8." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-283", "text": "We also show that diversity is particularly important for the development of robust models." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-284", "text": "This work fills a key gap that has limited research, providing a new opportunity for understanding synchronous multiparty conversation online." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-58", "text": "Models: Elsner and Charniak (2008) explored various message-pair feature sets and linear classifiers, combined with local and global inference methods." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-59", "text": "Their system is the only publicly released statistical model for disentanglement of chat conversation, but most of the other work cited above applied similar models." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-60", "text": "We evaluate their model on both our data and our re-annotated version of their data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-61", "text": "Recent work has applied neural networks (Mehri and Carenini, 2017; Guo et al. (2017) 1,500 1 48 hr 5 n/a 2 Table 1 : Annotated disentanglement dataset comparison." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-62", "text": "Our data is much larger than prior work, one of the only released sets, and the only one with context and adjudication. '+a' indicates there was an adjudication step to resolve disagreements. '?' indicates the value is not in the paper and the authors no longer have access to the data." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-63", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-64", "text": "**2018), WITH SLIGHT GAINS IN PERFORMANCE.**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-65", "text": "Graph Structure: Within a conversation, we define a graph of reply-to relations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-66", "text": "Almost all prior work with annotated graph structures has been for threaded web forums (Schuth et al., 2007; Kim et al., 2010; Wang et al., 2011b) , which do not exhibit the disentanglement problem we explore." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-67", "text": "Studies that do consider graphs for disentanglement have used small datasets (Dulceanu, 2016; Mehri and Carenini, 2017) that are not always released (Wang et al., 2008; Guo et al., 2017) ." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-68", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-69", "text": "**DATA**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-70", "text": "We introduce a manually annotated dataset of 77,563 messages: 74,963 from the #Ubuntu IRC channel, 3 and 2,600 messages from the #Linux IRC channel." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-71", "text": "4 Annotating the #Linux data enables comparison with Elsner and Charniak (2008) , while the #Ubuntu channel has over 34 million messages, making it an interesting largescale resource for dialogue research." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-72", "text": "It also allows us to evaluate Lowe et al. (2015 Lowe et al. ( , 2017 's widely used heuristically disentangled conversations." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-73", "text": "When choosing samples we had to strike a balance between the number of samples and the size of each one." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-74", "text": "We sampled the training set in three ways: (1) 95 uniform length samples, (2) 10 smaller samples to check annotator agreement, and (3) 48 time spans of one hour that are diverse in terms of the number of messages, the number of participants, and what percentage of messages are directed." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-75", "text": "For additional details of the data selection process, see the supplementary material." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-76", "text": "----------------------------------" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-77", "text": "**DATASET COMPARISON**" }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-78", "text": "Table 1 presents properties of our data and prior work on disentanglement in real-time chat." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-79", "text": "Availability: Only one other dataset, annotated twice, has been publicly released, and two others were shared when we contacted the authors." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-80", "text": "Scale: Our dataset is 31 times larger than almost any other dataset, the exception being one that was not released." }, { "sent_id": "982991efdb6b14f187702e0a577bac-C001-81", "text": "As well as being larger, our data is also based on many different points in time." } ], "y": { "@BACK@": { "gold_contexts": [ [ "982991efdb6b14f187702e0a577bac-C001-44" ], [ "982991efdb6b14f187702e0a577bac-C001-61" ], [ "982991efdb6b14f187702e0a577bac-C001-67" ] ], "cite_sentences": [ "982991efdb6b14f187702e0a577bac-C001-44", "982991efdb6b14f187702e0a577bac-C001-61", "982991efdb6b14f187702e0a577bac-C001-67" ] }, "@DIF@": { "gold_contexts": [ [ "982991efdb6b14f187702e0a577bac-C001-119" ], [ "982991efdb6b14f187702e0a577bac-C001-135" ], [ "982991efdb6b14f187702e0a577bac-C001-260" ] ], "cite_sentences": [ "982991efdb6b14f187702e0a577bac-C001-119", "982991efdb6b14f187702e0a577bac-C001-135", "982991efdb6b14f187702e0a577bac-C001-260" ] }, "@SIM@": { "gold_contexts": [ [ "982991efdb6b14f187702e0a577bac-C001-129" ], [ "982991efdb6b14f187702e0a577bac-C001-153" ] ], "cite_sentences": [ "982991efdb6b14f187702e0a577bac-C001-129", "982991efdb6b14f187702e0a577bac-C001-153" ] }, "@USE@": { "gold_contexts": [ [ "982991efdb6b14f187702e0a577bac-C001-129" ] ], "cite_sentences": [ "982991efdb6b14f187702e0a577bac-C001-129" ] } } }, "ABC_c0cac496ec0abdfd3f6bd9914f4cc4_24": { "x": [ { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-2", "text": "Nowadays, social media have become a platform where people can easily express their opinions and emotions about any topic such as politics, movies, music, electronic products and many others." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-3", "text": "On the other hand, politicians, companies, and businesses are interested in analyzing automatically people's opinions and emotions." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-4", "text": "In the last decade, a lot of efforts has been put into extracting sentiment polarity from texts." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-5", "text": "Recently, the focus has expanded to also cover emotion recognition from texts." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-6", "text": "In this work, we expand an existing emotion lexicon, DepecheMood, by leveraging semantic knowledge from English WordNet (EWN)." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-7", "text": "We create an expanded lexicon, EmoWordNet, consisting of 67K terms aligned with EWN, almost 1.8 times the size of DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-8", "text": "We also evaluate EmoWordNet in an emotion recognition task using SemEval 2007 news headlines dataset and we achieve an improvement compared to the use of DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-9", "text": "EmoWordNet is publicly available to speed up research in the field on http://oma-project.com." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-10", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-12", "text": "Emotion recognition models have been extensively explored based on different modalities such as human computer interaction (Cowie et al., 2001; Pantic and Rothkrantz, 2003; Fragopanagos and Taylor, 2005; Jaimes and Sebe, 2007; Hibbeln et al., 2017; Patwardhan and Knapp, 2017; Constantine et al., 2016) and facial images and expressions (Goldman and Sripada, 2005; Gunes and Piccardi, 2007; Trad et al., 2012; Wegrzyn et al., 2017) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-13", "text": "Recently, special attention has been given to emotion recognition from text (Wu et al., 2006; Alm et al., 2005; Shaheen et al., 2014; AbdulMageed and Ungar, 2017; Badaro et al., 2018b,a) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-32", "text": "WordNet Affect was also tested in different applications such as affective text sensing systems and computational humor." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-14", "text": "In fact, a tremendous amount of opinionated and emotionally charged text data is nowadays available on the Internet due to the increase of number of users of social networks such as Twitter and Facebook." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-15", "text": "For instance, Facebook reached more than 2 billion users on September 2017." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-16", "text": "1 Recognizing emotions from text has several applications: first, it helps companies and businesses in shaping their marketing strategies based on consumers' emotions (Bougie et al., 2003) ; second, it allows improving typical collaborative filtering based recommender systems (Badaro et al., 2013 (Badaro et al., , 2014c ) in terms of products or advertisements recommendations (Mohammad and Yang, 2011) ; third, politicians can learn how to adapt their political speech based on people emotions (Pang et al., 2008) and last but not least emotion classification helps in stock market predictions (Bollen et al., 2011) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-17", "text": "While plenty of works exist for sentiment analysis for different languages including analysis of social media data for sentiment characteristics (Al Sallab et al., 2015; Baly et al., , 2017b , few works focused on emotion recognition from text." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-18", "text": "Since sentiment lexicons helped in improving the accuracy of sentiment classification models (Liu and Zhang, 2012; Al-Sallab et al., 2017; Badaro et al., 2014a Badaro et al., ,b, 2015 , several researchers are working on developing emotion lexicons for different languages such as English, French, Polish and Chinese (Mohammad, 2017; Bandhakavi et al., 2017; Yang et al., 2007; Mohammad and Turney, 2013; Abdaoui et al., 2017; Staiano and Guerini, 2014; Maziarz et al., 2016; Janz et al., 2017) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-19", "text": "While sentiment is usually represented by three labels namely positive, negative or neutral, several representation models exist for emotions such as Ekman representation (Ekman, 1992) (happiness, sadness, fear, anger, surprise and disgust) or Plutchik model (Plutchik, 1994) that includes trust and anticipation in addition to Ekman's six emotions." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-20", "text": "Despite the efforts for creating large scale emotion lexicons for English, the size of existing emotion lexicons remain much smaller compared to sentiment lexicons." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-21", "text": "For example, DepecheMood (Staiano and Guerini, 2014) , one of the largest publicly available emotion lexicon for English, includes around 37K terms while SentiWordNet (SWN) (Esuli and Sebastiani, 2007; Baccianella et al., 2010) , a large scale English sentiment lexicon semi-automatically generated using English WordNet (EWN) (Fellbaum, 1998) , includes around 150K terms annotated with three sentiment scores: positive, negative and objective." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-22", "text": "In this paper, we focus on expanding coverage of existing emotion lexicon, namely DepecheMood, using the synonymy semantic relation available in English WordNet." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-23", "text": "We decide to expand DepecheMood since it is one of the largest emotion lexicon publicly available, and since its terms are aligned with EWN, thus allowing us to benefit from powerful semantic relations in EWN." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-24", "text": "The paper is organized as follows." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-25", "text": "In section 2, we conduct a brief literature survey on existing emotion lexicons." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-26", "text": "In section 3, we describe the expansion approach to build EmoWordNet." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-27", "text": "In section 4, we compare the performance of EmoWordNet against DepecheMood using SemEval 2007 dataset and in section 5, we present a conclusion of our results and future work." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-28", "text": "Strapparava et al. (2004) developed WordNet Affect by tagging specific synsets with affective meanings in EWN." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-29", "text": "They identified first a core number of synsets that represent emotions of a lexical database for emotions." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-30", "text": "They expanded then the coverage of the lexicon by checking semantically related synsets compared to the core set." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-31", "text": "They were able to annotate 2,874 synsets and 4,787 words." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-33", "text": "WordNet Affect is of good quality given that it was manually created and validated, however, it is of limited size." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-34", "text": "Mohammad and Turney (2013) presented challenges that researchers face for developing emotion lexicons and devised an annotation strategy to create a good quality and inexpensive emotion lexicon, EmoLex, by utilizing crowdsourcing." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-35", "text": "To create EmoLex, the authors first identified target terms for annotation extracted from Macquarie Thesaurus (Bernard and Bernard, 1986) , WordNet Affect and the General Inquirer (Stone et al., 1966) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-36", "text": "Then, they launched the annotation task on Amazon's Mechanical Turk." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-37", "text": "EmoLex has around 10K terms annotated for emotions as well as for sentiment polarities." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-38", "text": "They evaluated the annotation quality using different techniques such as computing inter-annotator agreement and comparing a subsample of EmoLex with existing gold data." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-39", "text": "AffectNet (Cambria et al., 2012) , part of the SenticNet project, includes also around 10K terms extracted from ConceptNet (Liu and Singh, 2004) and aligned with WordNet Affect." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-40", "text": "They extended WordNet Affect using the concepts in ConceptNet." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-41", "text": "While WordNet Affect, EmoLex and AffectNet include terms with emotion labels, Affect database (Neviarouskaya et al., 2007) and DepecheMood (Staiano and Guerini, 2014) include words that have emotion scores instead, which can be useful for compositional computations of emotion scores." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-42", "text": "Affect database extends SentiFul and covers around 2.5K words presented in their lemma form along with the corresponding part of speech (POS) tag." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-43", "text": "DepecheMood was automatically built by harvesting social media data that were implicitly annotated with emotions." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-44", "text": "Staiano and Guerini (2014) utilized news articles from rappler.com." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-45", "text": "The articles are accompanied by Rappler's Mood Meter, which allows readers to express their emotions about the article they are reading." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-46", "text": "DepecheMood includes around 37K lemmas along with their part of speech tags and the lemmas are aligned with EWN." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-47", "text": "Staiano and Guerini also evaluated DepecheMood in emotion regression and classification tasks in unsupervised settings." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-48", "text": "They claim that although they utilized a na\u00efve unsupervised model, they were able to outperform existing lexicons when applied on SemEval 2007 dataset (Strapparava and Mihalcea, 2007) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-49", "text": "Since DepecheMood is aligned with EWN, is publicly available and has a better coverage and claimed performance compared to existing emotion lexicons, we decide to expand it using EWN semantic relations as described below in section 3." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-50", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-51", "text": "**LITERATURE REVIEW**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-52", "text": "To summarize, there are mainly two approaches that have been followed for building emotion lexicons for English." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-53", "text": "The first set of methods relies on manual annotation either done by specific indi-viduals or through crowdsourcing, where the list of words is extracted from lexical resources." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-54", "text": "The second approach is automatic or semi-automatic and is based on annotated corpora for emotion." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-55", "text": "The first approach tends to produce limited size and highly accurate emotion lexicons but it is relatively expensive." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-56", "text": "On the other hand, the second approach is cheap and results in large scale emotion lexicons but with lower accuracy compared to manually developed emotion lexicons in terms of accurately representing the emotion of the term." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-57", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-58", "text": "**EMOWORDNET**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-59", "text": "In this section, we describe the approach we followed in order to expand DepecheMood and build EmoWordNet." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-60", "text": "DepecheMood consists of 37,771 lemmas along with their corresponding POS tags where each entry is appended with scores for 8 emotion labels: afraid, amused, angry, annoyed, don't care, happy, inspired and sad." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-61", "text": "Three variations of score representations exist for DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-62", "text": "We select to expand the DepecheMood variation with normalized scores since this variation performed best according to the presented results in (Staiano and Guerini, 2014) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-63", "text": "In Fig. 1 , we show an overview of the steps followed to expand DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-64", "text": "Step 1: EWN synsets that include lemmas of DepecheMood were retrieved." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-65", "text": "A score was then computed for each retrieved synset, s. Let S denotes the set of all such synsets." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-66", "text": "Two cases might appear: either the retrieved synset included only one lemma from DepecheMood, in this case the synset was assigned the same score of the lemma, or, the synset included multiple lemmas that exist in DepecheMood, in this case the synset's score was the average of the scores of its corresponding lemmas." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-67", "text": "Step 2: A synset, s, includes two set of terms: T, terms that are in DepecheMood, andT , terms not in DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-68", "text": "Using the synonymy semantic relation in EWN, and based on the concept that synonym words would likely share the same emotion scores, we assigned the synset's scores to its corresponding termsT ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-69", "text": "Again, a term t inT might appear in one or multiple synsets from S. Hence, the score assigned to t would be either the one of its corresponding synset or the average of the scores of its corresponding synsets that belong to S." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-70", "text": "Step 3: after performing step 2, new synsets might be explored." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-71", "text": "Terms inT might also appear in synsetss that do not belong to S.s would get the score of its corresponding terms." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-72", "text": "Step 2 and 3 were repeated until no new terms or synsets were added and scores of added terms converged." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-73", "text": "It is important to note that we decided to consider only synonyms for expansion since synonymy is the only semantic relation that mostly preserves the emotion orientation and does not require manual validation as described by Strapparava et al. (2004) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-74", "text": "As a walking example of the steps described above, let us consider the DepecheMood term \"bonding\" having noun as POS tag." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-75", "text": "\"bonding\" can be found in three different EWN noun synsets with the following offset IDs: \"00148653; 05665769; 13781820\"." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-76", "text": "Since \"bonding\" is the only term having a DepecheMood representation in the three synsets, the three synsets will have the same emotion scores as \"bonding\"." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-77", "text": "While synsets \"05665769; 13781820\" have only the term \"bonding\", \"00148653\" includes as well the lemma \"soldering\" which is not in DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-78", "text": "Thus, from step 2, \"soldering\" will have the same scores as \"bonding\". \"soldering\" does not appear in any other EWN synset so there are no more iterations." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-79", "text": "Using the described automatic expansion approach, we were able to extend the size of DepecheMood by a factor of 1.8." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-80", "text": "We obtained emotion scores for an additional 29,967 EWN terms and for 59,952 EWN synsets." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-81", "text": "Overall, we construct EmoWordNet, an emotion lexicon consisting of 67,738 EWN terms and of 59,952 EWN synsets annotated with emotion scores." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-82", "text": "Next, we present a simple extrinsic evaluation of EmoWordNet similar to the one performed for DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-83", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-84", "text": "**88**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-85", "text": "In this section, we evaluate the effectiveness of EmoWordNet in emotion recognition task from text." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-86", "text": "We evaluate regression as well as classification of emotions in unsupervised settings using similar techniques used for evaluating DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-87", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-88", "text": "**DATASET & COVERAGE**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-89", "text": "We utilized the dataset provided publicly by SemEval 2007 task on Affective text (Strapparava and Mihalcea, 2007) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-90", "text": "The dataset consists of one thousand news headlines annotated with six emotion scores: anger, disgust, fear, joy, sadness and surprise." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-91", "text": "For the regression task, a score between 0 and 1 is provided for each emotion." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-92", "text": "For the classification task, a threshold is applied on the emotion scores to get a binary representation of the emotions: if the score of a certain emotion is greater than 0.5, the corresponding emotion label is set to 1, otherwise it is 0." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-93", "text": "The emotion labels used in the dataset correspond to the six emotions of the Ekman model (Ekman, 1992) while those in EmoWordNet, as well as DepecheMood, follow the ones provided by Rappler Mood Meter." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-94", "text": "We considered the same emotion mapping assumptions presented in the work of (Staiano and Guerini, 2014) : Fear \u2192 Afraid, Anger \u2192 Angry, Joy \u2192 Happy, Sadness \u2192 Sad and Surprise \u2192 Inspired." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-95", "text": "Disgust was not aligned with any emotion in EmoWordNet and hence was discarded as also assumed in (Staiano and Guerini, 2014) ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-96", "text": "One important aspect of the extrinsic evaluation was checking the coverage of EmoWordNet against SemEval dataset." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-97", "text": "In order to compute coverage, we performed lemmatization of the news headlines using WordNet lemmatizer available through Python NLTK package." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-98", "text": "We excluded all words with POS tags different than noun, verb, adjective and adverb." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-99", "text": "EmoWordNet achieved a coverage of 68.6% while DepecheMood had a coverage of 67.1%." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-100", "text": "An increase in coverage was expected but since the size of the dataset is relatively small, the increase was only around 1.5%." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-101", "text": "In terms of headline coverage, only one headline (\"Toshiba Portege R400\") was left without any emotion scores when using both EmoWordNet and DepecheMood since none of its terms were found in any of the two lexicons." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-102", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-103", "text": "**REGRESSION AND CLASSIFICATION RESULTS**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-104", "text": "We followed an approach similar to the one presented for evaluating DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-105", "text": "For preprocessing, we first lemmatized the headlines using WordNet lemmatizer available in Python NLTK package." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-106", "text": "We also accounted for multi-word terms that were solely available in EmoWordNet by looking at n-grams (up to n=3) after lemmatization." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-107", "text": "We then removed all terms that did not belong to any of the four POS tags: noun, verb, adjective and adverbs." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-108", "text": "For features computation, we considered two variations: the sum and the average of the emotion scores for the five emotion labels that overlapped between EmoWordNet and SemEval dataset." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-109", "text": "Using average turned out to perform better than when using sum for both lexicons." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-110", "text": "As stated in (Staiano and Guerini, 2014) paper, 'Disgust' emotion was excluded since there was no corresponding mapping in EmoWordNet/DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-111", "text": "The first evaluation consisted of measuring Pearson Correlation between the scores computed using the lexicons and those provided in SemEval." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-112", "text": "The results are reported in Table 1." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-113", "text": "We could see that the results are relatively close to each other: EmoWordNet slightly outperformed DepecheMood for the five different emotions." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-114", "text": "It was expected to have close results given that the coverage of EmoWordNet is very close to DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-115", "text": "Given the slight improvement, we expect EmoWordNet to perform much better on larger datasets." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-116", "text": "For the classification task, we first transformed the numerical emotion scores of the headlines to a binary representation." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-117", "text": "We applied min-max normalization on the computed emotion scores per headline, and then assigned a '1' for the emotion label with score greater than '0.5', and a '0' otherwise." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-118", "text": "We used F1 measure for evaluation." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-119", "text": "Results are shown in Table 2 ." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-120", "text": "More significant improvement was observed in classification task compared to regression task when using EmoWordNet." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-121", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-122", "text": "**RESULTS ANALYSIS**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-123", "text": "In this section, we present some quantitative and qualitative analyses of the results." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-124", "text": "For quantitative analysis, we checked first whether the count of terms in a headline is correlated with having a correct emotion classification." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-125", "text": "Overall, the length of headlines was varying between 2 and 15 terms." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-126", "text": "Headlines with length between 5 and 10 terms were mostly correctly classified." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-127", "text": "Hence, one can conclude that having a headline with couple of terms only may not allow the system to clearly decide on the emotion label and having headlines with many terms may cause the system to over predict emotions." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-128", "text": "In addition to headline length, we checked whether POS tags are correlated with correct or erroneous emotion predictions." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-129", "text": "Given that the dataset consists of news headlines, the \"noun\" POS tag was the most frequent in both correctly classified headlines and misclassified ones." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-130", "text": "For qualitative analysis, we analyze few correctly classified headlines and few other misclassified ones." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-131", "text": "We show in Table 3 few examples of correctly classified headlines and in table 4 other examples of misclassified headlines." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-132", "text": "By looking at the misclassified examples, we observe that the golden annotation tend to be sometimes conflicting such as the second and the fifth examples in Table 4 where we have joy and sadness as assigned emotions for the two headlines." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-133", "text": "An explanation for having conflicting emotions for the same headline is that the annotators reflected their personal point of view of the information conveyed by the headline." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-134", "text": "Hence, some people were happy to read the headline others were sad." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-135", "text": "In order to incorporate such challenging aspect of emotion recognition from text, more sophisticated emotion recognition models need to be considered and tested." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-136", "text": "----------------------------------" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-137", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-138", "text": "We presented EmoWordNet, a large scale emotion lexicon, consisting of around 67K EWN words and 58K EWN synsets annotated with 8 emotion scores." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-139", "text": "EmoWordNet is automatically constructed by applying a semantic expansion approach using EWN and DepecheMood." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-140", "text": "When utilized for emotion recognition, EmoWordNet outperformed existing emotion lexicons and had a better lexical coverage." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-141", "text": "For future work, we would like to evaluate the performance of EmoWordNet on larger datasets and we would like to improve the accuracy of the recognition model." }, { "sent_id": "c0cac496ec0abdfd3f6bd9914f4cc4-C001-142", "text": "EmoWordNet is publicly available on http://oma-project.com." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-18" ], [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-21" ], [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-41" ], [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-44" ] ], "cite_sentences": [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-18", "c0cac496ec0abdfd3f6bd9914f4cc4-C001-21", "c0cac496ec0abdfd3f6bd9914f4cc4-C001-41", "c0cac496ec0abdfd3f6bd9914f4cc4-C001-44" ] }, "@EXT@": { "gold_contexts": [ [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-62" ] ], "cite_sentences": [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-62" ] }, "@DIF@": { "gold_contexts": [ [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-62" ] ], "cite_sentences": [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-62" ] }, "@SIM@": { "gold_contexts": [ [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-94" ], [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-95" ], [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-110" ] ], "cite_sentences": [ "c0cac496ec0abdfd3f6bd9914f4cc4-C001-94", "c0cac496ec0abdfd3f6bd9914f4cc4-C001-95", "c0cac496ec0abdfd3f6bd9914f4cc4-C001-110" ] } } }, "ABC_6891aebc7bb1152884d2236a893b55_24": { "x": [ { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-2", "text": "Although parallel coreference corpora can to a high degree support the development of SMT systems, there are no large-scale parallel datasets available due to the complexity of the annotation task and the variability in annotation schemes." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-3", "text": "In this study, we exploit an annotation projection method to combine the output of two coreference resolution systems for two different source languages (English, German) in order to create an annotated corpus for a third language (Russian)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-4", "text": "We show that our technique is superior to projecting annotations from a single source language, and we provide an in-depth analysis of the projected annotations in order to assess the perspectives of our approach." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-5", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-7", "text": "Most of the recent work on exploiting coreference relations in Machine Translation focused on improving the translation of anaphoric pronouns (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012; Nov\u00e1k et al., 2015; Guillou and Webber, 2015) , disregarding other types of coreference relations, one of the reasons being the lack of annotated parallel corpora as well as the variability in the annotated data." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-8", "text": "However, this could be alleviated by exploiting annotation projection across parallel corpora to create more linguistically annotated resources for new languages." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-9", "text": "More importantly, applying annotation projection using several source languages would support the creation of corpora less biased towards the peculiarities of a single source annotation scheme." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-10", "text": "In our study, we aim at exploring the usability of annotation projection for the transfer of automatically produced coreference chains." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-11", "text": "In particular, our idea is that using several source annotations produced by different systems could improve the performance of the projection method." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-12", "text": "Our approach to the annotation projection builds upon the approach recently introduced by (Grishina and Stede, 2017) , who experimented with projecting manually annotated coreference chains from two source languages to the target language." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-13", "text": "However, our goal is slightly different: We are interested in developing a fully automatic pipeline, which would support the automatic creation of parallel annotated corpora in new languages." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-14", "text": "Therefore, in contrast to (Grishina and Stede, 2017) , we use automatic source annotations produced by two state-of-the-art coreference systems, and we combine the output of our projection method for two source languages (English and German) to obtain target annotations for a third language (Russian)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-15", "text": "Through performing the error analysis of the projected annotations, we investigate the most common projection errors and assess the benefits and drawbacks of our method." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-16", "text": "The paper is organized as follows: Section 2 presents an overview of the related work and Section 3 describes the experimental setup." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-17", "text": "In Section 4, we give a detailed error analysis and discuss the results of our experiment." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-18", "text": "The conclusions and the avenues for future research are presented in Section 5." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-19", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-21", "text": "Annotation projection is a method that allows for automatically transferring annotations from a well-studied (source) language to a low-resource (target) language in a parallel corpus in order to automatically obtain annotated data." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-22", "text": "It was first introduced in the work of (Yarowsky et al., 2001) Rahman and Ng, 2012; Martins, 2015; Grishina and Stede, 2015) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-23", "text": "Thereafter, (Grishina and Stede, 2017) proposed a multi-source method for annotation projection: They used a manually annotated trilingual coreference corpus and two source languages (English-German, English-Russian) to transfer annotations to the target language (Russian and German, respectively)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-24", "text": "Although their approach showed promising results, it was based on transferring manually produced annotations, which are typically not available for other languages and, more importantly, can not be acquired large-scale due to the complexity of the annotation task." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-25", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-26", "text": "**ANNOTATION PROJECTION EXPERIMENT**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-27", "text": "In our experiment, we propose a fully automatic projection setup: First, we perform coreference resolution on the source language data and then we implement the single-and multi-source approaches to transfer the automatically produced annotations." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-28", "text": "We use the English-German-Russian unannotated corpus of (Grishina and Stede, 2017) as the basis for our experiment, which contains texts in two genres -newswire texts (229 sentences per language) and short stories (184 sentences per language)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-29", "text": "Furthermore, we use manual annotations present in the corpus as the gold standard for our evaluation." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-30", "text": "It should be noted that the manual annotations were performed according to the parallel coreference annotation guidelines of (Grishina and Stede, 2016) that are in general compatible with the annotation of the the OntoNotes corpus (Hovy et al., 2006) and are therefore suitable for our evaluation." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-31", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-32", "text": "**COREFERENCE RESOLUTION ON THE SOURCE LANGUAGE DATA**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-33", "text": "Since the main goal of this experiment is to assess the quality of the projection of automatic annotations, first we need to automatically label the source language data." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-34", "text": "For the English side of the corpus, we chose the Berkeley Entity Resolution system (Durrett and Klein, 2014) , which was trained on the English part of the OntoNotes corpus (Hovy et al., 2006) and achieves the average F1 of 61.71 on the OntoNotes dataset (Durrett and Klein, 2014)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-35", "text": "For the German side of the corpus, we use the state-of-the-art CorZu system (Tuggener, 2016) to obtain the source annotations, which achieves the average of 66.9 F1 on the German part of the SemEval 2010 dataset (Klenner and Tuggener, 2011) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-36", "text": "Corpus statistics for the English and German datasets are presented in Table 1 ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-37", "text": "Interestingly, CorZu was able to resolve slightly more markables and coreference chains in total than Berkeley (1035 vs. 915, 268 vs. 182 respectively) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-38", "text": "In particular, the numbers of found markables and chains in English and German diverge for the newswire texts, which further supports the claim that this part of the corpus contains more complex coreference relations than the short stories 1 ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-39", "text": "To estimate the quality of the automatically produced annotations, we evaluate the resulting dataset against the manually annotated English and German parts of the corpus ( Table 3 : Projection results from English and German into Russian" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-40", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-41", "text": "**ANNOTATION PROJECTION STRATEGIES**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-42", "text": "For our experiment, we implement a direct projection method for coreference as described in (Grishina and Stede, 2015) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-43", "text": "Our method works as follows: For each markable on the source side, we automatically select all the corresponding tokens on the target side aligned to it, and we then take the span between the first and the last word as the new target markable, which has the same coreference chain number as the source one." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-44", "text": "Since the corpus was already sentence-and word-aligned 2 , we use the available alignments to transfer the annotations." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-45", "text": "Thereafter, we re-implement the multi-source approach as described in (Grishina and Stede, 2017) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-46", "text": "In particular, they (a) looked at disjoint chains coming from different sources and (b) used the notion of chain overlap to measure the similarity between two coreference chains that contain some identical mentions 3 ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-47", "text": "In our experiment, we apply the following strategies from (Grishina and Stede, 2017):" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-48", "text": "1. Setting 1 ('add'): disjoint chains from one source language are added to all the chains projected from the other source language; 2. Setting 2 ('unify-intersect'): the intersection of mentions for overlapping chains is selected." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-49", "text": "3. Setting 3 ('unify-concatenate'): chains that overlap are treated as one chain starting from a certain percentage of overlap." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-50", "text": "For both single-and multi-source approaches, we deliberately rely solely on word alignment information to project the annotations, in order to keep our approach easily transferable to other languages." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-51", "text": "2 Sentence alignment was performed using HunAlign (Varga et al., 2007) ; word alignments were computed with GIZA++ (Och and Ney, 2003) on a parallel newswire corpus (Grishina and Stede, 2015) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-52", "text": "3 Computed as Dice coefficient." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-53", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-54", "text": "**RESULTS**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-55", "text": "To evaluate the projection results, we computed the standard coreference metrics -MUC (Vilain et al., 1995) , B-cubed (Bagga and Baldwin, 1998) and CEAF (Luo, 2005) -and their average for each of the approaches (Table 3) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-56", "text": "As one can see from the table, the quality of projections from English to Russian outperforms the quality of projections from German to Russian by 6.5 points F1." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-57", "text": "Moreover, while Precision number are quite similar, projections from English exhibit higher Recall numbers." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-58", "text": "As for the multi-source settings, we were able to achieve the highest F1 of 36.2 by combining disjoint chains (Setting 1), which is 1.9 point higher than the best single-source projection scores and constitutes almost 62% of the quality of the projection of gold standard annotations reported in (Grishina and Stede, 2017) ." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-59", "text": "We were able to achieve the highest Precision scores by intersecting the overlapping chains (Setting 2) and the highest Recall by concatenating them (Setting 3)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-60", "text": "Finally, we evaluate the annotations coming from English and German against each other, in order to estimate their comparability and the percentage of overlap." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-61", "text": "Interestingly, we achieve 52.0 F1, with Precision being slightly higher than Recall (53.9 vs. 50 .2), which shows the dissimilarity between the two projections." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-62", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-63", "text": "**ERROR ANALYSIS AND DISCUSSION**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-64", "text": "Analyzing the errors coming from each of the source languages, we first looked at the percentage of transferred mentions (Table 4) : Using our method we were able to automatically transfer 82.7% of all the source markable from English and only 57.6% of all the source markables from German; similarly, the percentage of the transferred chains is lower for German than for English." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-65", "text": "Interestingly, while CorZu performs better on the source dataset than Berkeley, the results for the annotations projected from a single source are the opposite: Annotation projection from English to Russian performs better than from German to Russian." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-66", "text": "Our hypothesis is that the reason for the lower percentage of transferred annotations is the lower quality of word alignments for GermanRussian as compared to English-Russian." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-67", "text": "Furthermore, since the original language of the texts was English, we presume that the German and Russian translations are closer to English and less similar to each other." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-68", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-69", "text": "**ENGLISH**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-70", "text": "German # % # % Markables 757 82.7 596 57.6 Chains 182 100.0 227 84.7 Table 4 : Transferred chains and markables" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-71", "text": "Since we do not have access to any gold alignment data, we estimate the quality of the word alignments by computing the number of unaligned tokens." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-72", "text": "Not surprisingly, we see a higher percentage of unaligned words for German-Russian than for English-Russian: 17.03% vs. 14.96% respectively, which supports our hypothesis regarding the difference in the alignment quality for the two pairs." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-73", "text": "Furthermore, we computed the distribution of unaligned words: The highest percentage of unaligned tokens disregarding punctuation marks are prepositions; pronouns constitute only 3% and 5% of all unaligned words for the alignments between English-Russian and German-Russian respectively." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-74", "text": "However, these numbers do not constitute more than 5% of the overall number of pronouns in the corpus." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-75", "text": "Following the work of (Grishina and Stede, 2017) , we analyse the projection accuracy for common nouns ('Nc'), named entities ('Np') and pronouns ('P') separately 4 : Table 5 shows the percentage of correctly projected markables of each type out of all the projected markables of this type." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-76", "text": "Our results conform to the results of (Grishina and Stede, 2017) : For both languages, pronouns exhibit the highest projection quality, while common and proper nouns are projected slightly less accurately, which is probably due to the fact that pronouns typically consist of single tokens and are better aligned than multi-token common and proper names." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-77", "text": "Overall, for all the markables, the projection accuracy for English-Russian is around 10% better than projection accuracy for GermanRussian." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-78", "text": "en-ru de-ru Nc 64.5 60.7 Np 70.5 66.6 P 83.6 76.5 All 65.1 55.6 Table 5 : Projection accuracy for common nouns, proper nouns and pronouns (%) Moreover, we compare the projected annotations across the two genres." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-79", "text": "Interestingly, the results for the two languages vary: While the average coreference scores for English-Russian are quite comparable (news: 34.2 F1, stories: 33.3 F1), the scores for German-Russian differ considerably (news: 30.8 F1, stories: 20.8 F1)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-80", "text": "We attribute this difference to the quality of the source annotations and the performance of the source coreference resolvers on different genres of texts." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-81", "text": "----------------------------------" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-82", "text": "**SUMMARY AND OUTLOOK**" }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-83", "text": "In this study, we assessed the applicability of annotation projection in a scenario where we have access to two coreference resolvers in two source languages, the output of which is projected to a third language in a low-resource setting." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-84", "text": "Our results have shown that projection from two source languages is able to reach 62% of the quality of the projection of manual annotations and improves the projection scores by 1.9 F1." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-85", "text": "Moreover, using the output of two completely different coreference resolution systems, we observed the similar tendencies as while projecting gold standard annotations: Projection from English to Russian achieves higher scores than projection from German to Russian, and pronouns have the highest projection accuracy." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-86", "text": "Another important finding is that better source annotations does not necessarily result in better projection scores, which can be explained by the different quality of word alignments for both language pairs." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-87", "text": "Having investigated this issue, we conclude that alignments between German and Russian contain more unaligned units than the alignments between English and Russian." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-88", "text": "Our next steps include examining the alignment quality in more detail, which would require establishing a gold standard set of alignments (for at least noun phrases)." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-89", "text": "Overall, we envision our future work in exploiting more than two source annotations as well as multiple coreference resolution systems for a single source language to improve the source coreference annotations." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-90", "text": "Specifically, we plan on applying our method on other language pairs and datasets, in order to explore its generalizabililty for a wider range of languages." }, { "sent_id": "6891aebc7bb1152884d2236a893b55-C001-91", "text": "Furthermore, we are interested in exploiting our approach as a first step to create coreference annotated corpora in new languages by providing automatically projected target coreference chains to human annotators for a subsequent validation." } ], "y": { "@SIM@": { "gold_contexts": [ [ "6891aebc7bb1152884d2236a893b55-C001-12" ], [ "6891aebc7bb1152884d2236a893b55-C001-28" ], [ "6891aebc7bb1152884d2236a893b55-C001-45" ], [ "6891aebc7bb1152884d2236a893b55-C001-47" ], [ "6891aebc7bb1152884d2236a893b55-C001-58" ], [ "6891aebc7bb1152884d2236a893b55-C001-75" ] ], "cite_sentences": [ "6891aebc7bb1152884d2236a893b55-C001-12", "6891aebc7bb1152884d2236a893b55-C001-28", "6891aebc7bb1152884d2236a893b55-C001-45", "6891aebc7bb1152884d2236a893b55-C001-47", "6891aebc7bb1152884d2236a893b55-C001-58", "6891aebc7bb1152884d2236a893b55-C001-75" ] }, "@USE@": { "gold_contexts": [ [ "6891aebc7bb1152884d2236a893b55-C001-12" ], [ "6891aebc7bb1152884d2236a893b55-C001-28" ], [ "6891aebc7bb1152884d2236a893b55-C001-45" ], [ "6891aebc7bb1152884d2236a893b55-C001-47" ], [ "6891aebc7bb1152884d2236a893b55-C001-58" ], [ "6891aebc7bb1152884d2236a893b55-C001-75" ] ], "cite_sentences": [ "6891aebc7bb1152884d2236a893b55-C001-12", "6891aebc7bb1152884d2236a893b55-C001-28", "6891aebc7bb1152884d2236a893b55-C001-45", "6891aebc7bb1152884d2236a893b55-C001-47", "6891aebc7bb1152884d2236a893b55-C001-58", "6891aebc7bb1152884d2236a893b55-C001-75" ] }, "@DIF@": { "gold_contexts": [ [ "6891aebc7bb1152884d2236a893b55-C001-14" ], [ "6891aebc7bb1152884d2236a893b55-C001-76" ] ], "cite_sentences": [ "6891aebc7bb1152884d2236a893b55-C001-14", "6891aebc7bb1152884d2236a893b55-C001-76" ] }, "@EXT@": { "gold_contexts": [ [ "6891aebc7bb1152884d2236a893b55-C001-14" ] ], "cite_sentences": [ "6891aebc7bb1152884d2236a893b55-C001-14" ] }, "@BACK@": { "gold_contexts": [ [ "6891aebc7bb1152884d2236a893b55-C001-23" ] ], "cite_sentences": [ "6891aebc7bb1152884d2236a893b55-C001-23" ] } } }, "ABC_939274ae40a68acc322b34d8f91f7e_24": { "x": [ { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-2", "text": "Inter-sentence relation extraction deals with a number of complex semantic relationships in documents, which require local, non-local, syntactic and semantic dependencies." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-3", "text": "Existing methods do not fully exploit such dependencies." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-4", "text": "We present a novel inter-sentence relation extraction model that builds a labelled edge graph convolutional neural network model on a document-level graph." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-5", "text": "The graph is constructed using various inter-and intra-sentence dependencies to capture local and non-local dependency information." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-6", "text": "In order to predict the relation of an entity pair, we utilise multi-instance learning with bi-affine pairwise scoring." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-7", "text": "Experimental results show that our model achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-8", "text": "Our analysis shows that all the types in the graph are effective for inter-sentence relation extraction." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-9", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-11", "text": "Semantic relationships between named entities often span across multiple sentences." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-12", "text": "In order to extract inter-sentence relations, most approaches utilise distant supervision to automatically generate document-level corpora (Peng et al., 2017; Song et al., 2018) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-13", "text": "Recently, Verga et al. (2018) introduced multi-instance learning (MIL) (Riedel et al., 2010; Surdeanu et al., 2012) to treat multiple mentions of target entities in a document." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-14", "text": "Inter-sentential relations depend not only on local but also on non-local dependencies." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-15", "text": "Dependency trees are often used to extract local dependencies of semantic relations (Culotta and Sorensen, 2004; Liu et al., 2015) in intra-sentence * Corresponding author." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-16", "text": "named entities." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-17", "text": "The red arrow represents a relation between co-referred entities and yellow arrows represent semantically dependent relations." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-18", "text": "Example adapted from the CDR dataset (Wei et al., 2015) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-19", "text": "relation extraction (RE)." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-20", "text": "However, such dependencies are not adequate for inter-sentence RE, since different sentences have different dependency trees." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-21", "text": "Figure 1 illustrates such a case between Oxytocin and hypotension." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-22", "text": "To capture their relation, it is essential to connect the co-referring entities Oxytocin and Oxt." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-23", "text": "RNNs and CNNs, which are often used for intra-sentence RE (Zeng et al., 2014; dos Santos et al., 2015; Zhou et al., 2016b; Lin et al., 2016) , are not effective on longer sequences (Sahu and Anand, 2018 ) thus failing to capture such non-local dependencies." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-24", "text": "We propose a novel inter-sentence RE model that builds a labelled edge Graph CNN (GCNN) model (Marcheggiani and Titov, 2017) on a document-level graph." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-25", "text": "The graph nodes correspond to words and edges represent local and nonlocal dependencies among them." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-26", "text": "The documentlevel graph is formed by connecting words with local dependencies from syntactic parsing and sequential information, as well as non-local dependencies from coreference resolution and other semantic dependencies (Peng et al., 2017) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-27", "text": "We infer relations between entities using MIL-based bi-affine pairwise scoring function (Verga et al., 2018) on the entity node representations." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-28", "text": "Our contribution is threefold." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-29", "text": "Firstly, we pro- Figure 2 : Proposed model architecture." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-30", "text": "The input word sequence is mapped to a graph structure, where nodes are words and edges correspond to dependencies." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-31", "text": "We omit several edges, such as self-node edges of all words and syntactic dependency edges of different labels, for brevity." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-32", "text": "GCNN is employed to encode the graph and a bi-affine layer aggregates all mention pairs." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-33", "text": "pose a novel model for inter-sentence RE using GCNN to capture local and non-local dependencies." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-34", "text": "Secondly, we apply the model on two biochemistry corpora and show its effectiveness." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-35", "text": "Finally, we developed a novel, distantly supervised dataset with chemical reactant-product relations from PubMed abstracts." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-36", "text": "1" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-37", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-38", "text": "**PROPOSED MODEL**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-39", "text": "We formulate the inter-sentence, document-level RE task as a classification problem." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-40", "text": "Let [w 1 , w 2 , \u00b7 \u00b7 \u00b7 , w n ] be the words in a document t and e 1 and e 2 be the entity pair of interest in t. We name the multiple occurrences of these entities in the document entity mentions." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-41", "text": "A relation extraction model takes a triple (e 1 , e 2 , t) as input and returns a relation for the pair, including the \"no relation\" category, as output." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-42", "text": "We assume that the relationship of the target entities in t can be inferred based on all their mentions." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-43", "text": "We thus apply multi-instance learning on t to combine all mention-level pairs and predict the final relation category of a target pair." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-44", "text": "We describe the architecture of our proposed model in Figure 2 ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-45", "text": "The model takes as input an entire abstract of scientific articles and two target entities with all their mentions in the input layer." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-46", "text": "It then constructs a graph structure with words as nodes and labelled edges that correspond to local and non-local dependencies." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-47", "text": "Next, it encodes the graph structure using a stacked GCNN layer and classifies the relation between the target entities by applying MIL (Verga et al., 2018) to aggregate all 1 The dataset is publicly available at http://nactem." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-48", "text": "ac.uk/CHR/. mention pair representations." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-49", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-50", "text": "**INPUT LAYER**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-51", "text": "In the input layer, we map each word i and its relative positions to the first and second target entities into real-valued vectors, w i , d 1 i , d 2 i , respectively." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-52", "text": "As entities can have more than one mention, we calculate the relative position of a word from the closest target entity mention." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-53", "text": "For each word i, we concatenate the word and position representations into an input representation," }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-54", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-55", "text": "**GRAPH CONSTRUCTION**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-98", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-56", "text": "In order to build a document-level graph for an entire abstract, we use the following categories of inter-and intra-sentence dependency edges, as shown with different colours in Figure 2 ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-57", "text": "Syntactic dependency edge: The syntactic structure of a sentence reveals helpful clues for intrasentential RE (Miwa and Bansal, 2016) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-58", "text": "We thus use labelled syntactic dependency edges between the words of each sentence, by treating each syntactic dependency label as a different edge type." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-59", "text": "Coreference edge: As coreference is an important indicator of local and non-local dependencies (Ma et al., 2016) , we connect co-referring phrases in a document using coreference type edges." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-60", "text": "Adjacent sentence edge: We connect the syntactic root of a sentence with the roots of the previous and next sentences with adjacent sentence type edges (Peng et al., 2017) for non-local dependencies between neighbouring sentences." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-61", "text": "Adjacent word edge: In order to keep sequential information among the words of a sentence, we connect each word with its previous and next words with adjacent word type edges." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-62", "text": "Self-node edge: GCNN learns a node representation based solely on its neighbour nodes and their edge types." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-63", "text": "Hence, to include the node information itself into the representation, we form selfnode type edges on all the nodes of the graph." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-64", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-65", "text": "**GCNN LAYER**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-66", "text": "We compute the representation of each input word i by applying GCNN (Kipf and Welling, 2017; Defferrard et al., 2016) on the constructed document graph." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-67", "text": "GCNN is an advanced version of CNN for graph encoding that learns semantic representations for the graph nodes, while preserving its structural information." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-68", "text": "In order to learn edge type-specific representations, we use a labelled edge GCNN, which keeps separate parameters for each edge type (Vashishth et al., 2018) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-69", "text": "The GCNN iteratively updates the representation of each input word i as follows:" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-70", "text": "is the i-th word representation resulted from the k-th GCNN block, \u03bd(i) is a set of neighbouring nodes to i, W k l(i,u) and b k l(i,u) are the parameters of the k-th block for edge type l between nodes i and u. We stack K GCNN blocks to accumulate information from distant neighbouring nodes and use edge-wise gating to control information from neighbouring nodes." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-71", "text": "Similar to Marcheggiani and Titov (2017), we maintain separate parameters for each edge direction." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-72", "text": "We, however, tune the number of model parameters by keeping separate parameters only for the top-N types and using the same parameters for all the remaining edge types, named \"rare\" type edges." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-73", "text": "This can avoid possible overfitting due to over-parameterisation for different edge types." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-74", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-75", "text": "**MIL-BASED RELATION CLASSIFICATION**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-99", "text": "**DATA PRE-PROCESSING**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-76", "text": "Since each target entity can have multiple mentions in a document, we employ a multi-instance learning (MIL)-based classification scheme to aggregate the predictions of all target mention pairs using bi-affine pairwise scoring (Verga et al., 2018) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-77", "text": "As shown in Figure 2 , each word i is firstly projected into two separate latent spaces using two-layered feed-forward neural networks (FFNN), which correspond to the first (head) or second (tail) argument of the target pair." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-78", "text": ", where x K i corresponds to the representation of the i-th word after |K| blocks of GCNN encoding, W (0) , W (1) are the parameters of two FFNNs for head and tail respectively and x head i , x tail i \u2208 R d are the resulted head/tail representations for the ith word." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-79", "text": "Then, mention-level pairwise confidence scores are generated by a bi-affine layer and aggregated to obtain the entity-level pairwise score." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-80", "text": "where, R \u2208 R d\u00d7r\u00d7d is a learned bi-affine tensor with r the number of relation categories, and E head , E tail denote a set of mentions for entities e head and e tail respectively." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-81", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-82", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-83", "text": "We first briefly describe the datasets where the proposed model is evaluated along with their preprocessing." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-84", "text": "We then introduce the baseline models we use for comparison." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-85", "text": "Finally, we show the training settings." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-86", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-87", "text": "**DATA SETS**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-88", "text": "We evaluated our model on two biochemistry datasets." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-89", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-90", "text": "**CHEMICAL-DISEASE RELATIONS DATASET (CDR):**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-91", "text": "The CDR dataset is a document-level, intersentence relation extraction dataset developed for the BioCreative V challenge (Wei et al., 2015 Table 1 shows the statistics for CDR and CHR datasets." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-92", "text": "For both datasets, the annotated entities can have more than one associated Knowledge Base (KB) ID." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-93", "text": "If there is at least one common KB ID between mentions then we considered all these mentions to belong to the same entity." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-94", "text": "This technique results in less negative pairs." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-95", "text": "We ignored entities that were not grounded to a known KB ID and removed relations between the same entity (self-relations)." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-96", "text": "For the CDR dataset, we performed hypernym filtering similar to Gu et al. (2017) and Verga et al. (2018) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-97", "text": "In the CHR dataset, both directions were generated for each candidate chemical pair as chemicals can be either a reactant (first argument) or a product (second argument) in an interaction." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-100", "text": "We processed the datasets using the GENIA Sentence Splitter 4 and GENIA tagger (Tsuruoka et al., 2005) for sentence splitting and word tokenisation, respectively." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-101", "text": "Syntactic dependencies were obtained using the Enju syntactic parser (Miyao and Tsujii, 2008) with predicate-argument structures." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-102", "text": "Coreference type edges were constructed using the Stanford CoreNLP software (Manning et al., 2014) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-103", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-104", "text": "**BASELINE MODELS**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-105", "text": "For the CDR dataset, we compare with five stateof-the-art models: SVM , ensemble of feature-based and neural-based models (Zhou et al., 2016a) , CNN and Maximum Entropy (Gu et al., 2017) , Piece-wise CNN (Li et al., 2018) and Transformer (Verga et al., 2018) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-106", "text": "We additionally prepare and evaluate the following models: CNN-RE, a re-implementation from Kim (2014) and Zhou et al. (2016a) and RNN-RE, a reimplementation from Sahu and Anand (2018) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-107", "text": "In all models we use bi-affine pairwise scoring to detect relations." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-108", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-109", "text": "**MODEL TRAINING**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-110", "text": "We used 100-dimentional word embeddings trained on PubMed with GloVe (Pennington et al., 2014; TH et al., 2015) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-111", "text": "Unlike Verga et al. (2018), we used the pre-trained word embeddings in place of sub-word embeddings to align with our word graphs." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-112", "text": "Due to the size of the CDR dataset, we merged the training and development sets to train the models, similarly to Xu et al. (2016a) and Gu et al. (2017) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-113", "text": "We report the performance as the average of five runs with different parameter initialisation seeds in terms of precision (P), recall (R) and F1-score." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-114", "text": "We used the frequencies of the edge types in the training set to choose the top-N edges in Section 2.3." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-115", "text": "We refer to the supplementary materials for the details of the training and hyper-parameter settings." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-116", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-117", "text": "**RESULTS**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-118", "text": "We show the results of our model for the CDR and CHR datasets in Table 2 ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-119", "text": "We report the performance of state-of-the-art models without any additional enhancements, such as joint training with NER, model ensembling and heuristic rules, to avoid any effects from the enhancements in the comparison." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-120", "text": "We observe that the GCNN outperforms the baseline models (CNN-RE/RNN-RE) in both datasets." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-121", "text": "However, in the CDR dataset, the performance of GCNN is 1.6 percentage points lower than the best performing system of (Gu et al., 2017) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-144", "text": "**CONCLUSION**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-122", "text": "In fact, Gu et al. (2017) incorporates two separate neural and feature-based models for intra-and inter-sentence pairs, respectively, whereas we utilize a single model for both pairs." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-123", "text": "Additionally, GCNN performs comparably to the second state-of-the-art neural model Li et al. (2018) , which requires a two-step process for mention aggregation unlike our unified approach." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-124", "text": "Figure 3 illustrates the performance of our model on the CDR development set when using a varying number of most frequent edge types N ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-125", "text": "While tuning N , we observed that the best performance was obtained for top-4 edge types, but it slightly deteriorated with more." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-126", "text": "We chose the top-4 edge types in other experiments." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-127", "text": "Table 3 : Ablation analysis on the CDR development set, in terms of F1-score (%), for intra-(Intra) and inter-sentence (Inter) pairs." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-128", "text": "We perform ablation analysis on the CDR dataset by separating the development set to intraand inter-sentence pairs (approximately 70% and 30% of pairs, respectively)." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-129", "text": "Table 3 shows the performance when removing an edge category at a time." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-130", "text": "In general, all dependency types have positive effects on inter-sentence RE and the overall performance, although self-node and adjacent sentence edges slightly harm the performance of intra-sentence relations." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-131", "text": "Additionally, coreference does not affect intra-sentence pairs." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-132", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-133", "text": "**RELATED WORK**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-134", "text": "Inter-sentence RE is a recently introduced task." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-135", "text": "Peng et al. (2017) and Song et al. (2018) used graph-based LSTM networks for n-ary RE in multiple sentences for protein-drug-disease associations." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-136", "text": "They restricted the relation candidates in up to two-span sentences." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-137", "text": "Verga et al. (2018) considered multi-instance learning for document-level RE." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-138", "text": "Our work is different from Verga et al. (2018) in that we replace Transformer with a GCNN model for full-abstract encoding using non-local dependencies such as entity coreference." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-139", "text": "GCNN was firstly proposed by Kipf and Welling (2017) and applied on citation networks and knowledge graph datasets." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-140", "text": "It was later used for semantic role labelling (Marcheggiani and Titov, 2017), multi-document summarization (Yasunaga et al., 2017) and temporal relation extraction (Vashishth et al., 2018) ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-141", "text": "Zhang et al. (2018) used a GCNN on a dependency tree for intrasentence RE." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-142", "text": "Unlike previous work, we introduced a GCNN on a document-level graph, with both intra-and inter-sentence dependencies for intersentence RE." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-143", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-145", "text": "We proposed a novel graph-based method for inter-sentence RE using a labelled edge GCNN model on a document-level graph." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-146", "text": "The graph is constructed with words as nodes and multiple intra-and inter-sentence dependencies between them as edges." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-147", "text": "A GCNN model is employed to encode the graph structure and MIL is incorporated to aggregate the multiple mention-level pairs ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-148", "text": "We show that our method achieves comparable performance to the state-of-the-art neural models on two biochemistry datasets." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-149", "text": "We tuned the number of labelled edges to maintain the number of parameters in the labelled edge GCNN." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-150", "text": "Analysis showed that all edge types are effective for inter-sentence RE." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-151", "text": "Although the model is applied to biochemistry corpora for inter-sentence RE, our method is also applicable to other relation extraction tasks." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-152", "text": "As future work, we plan to incorporate joint named entity recognition training as well as sub-word embeddings in order to further improve the performance of the proposed model." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-153", "text": "----------------------------------" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-154", "text": "**A TRAINING AND HYPER-PARAMETER SETTINGS**" }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-155", "text": "We implemented all models using Tensorflow 5 ." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-156", "text": "The development set was used for hyperparameter tuning." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-157", "text": "For all models, parameters were optimised using the Adam optimisation algorithm with exponential moving average (Kingma and Ba, 2015) , learning rate of 0.0005, learning rate decay of 0.75 and gradient clipping 10." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-158", "text": "We used early stopping with patience equal to 5 epochs in order to determine the best training epoch." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-159", "text": "For other hyper-parameters, we performed a non-exhaustive hyper-parameter search based on the development set." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-160", "text": "We used the same hyperparameters of both CDR and CHR datasets." }, { "sent_id": "939274ae40a68acc322b34d8f91f7e-C001-161", "text": "The best hyper-parameter values are shown in Table 4 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "939274ae40a68acc322b34d8f91f7e-C001-13" ], [ "939274ae40a68acc322b34d8f91f7e-C001-137" ] ], "cite_sentences": [ "939274ae40a68acc322b34d8f91f7e-C001-13", "939274ae40a68acc322b34d8f91f7e-C001-137" ] }, "@SIM@": { "gold_contexts": [ [ "939274ae40a68acc322b34d8f91f7e-C001-27" ], [ "939274ae40a68acc322b34d8f91f7e-C001-47" ], [ "939274ae40a68acc322b34d8f91f7e-C001-76" ], [ "939274ae40a68acc322b34d8f91f7e-C001-96" ], [ "939274ae40a68acc322b34d8f91f7e-C001-105" ] ], "cite_sentences": [ "939274ae40a68acc322b34d8f91f7e-C001-27", "939274ae40a68acc322b34d8f91f7e-C001-47", "939274ae40a68acc322b34d8f91f7e-C001-76", "939274ae40a68acc322b34d8f91f7e-C001-96", "939274ae40a68acc322b34d8f91f7e-C001-105" ] }, "@USE@": { "gold_contexts": [ [ "939274ae40a68acc322b34d8f91f7e-C001-27" ], [ "939274ae40a68acc322b34d8f91f7e-C001-47" ], [ "939274ae40a68acc322b34d8f91f7e-C001-76" ], [ "939274ae40a68acc322b34d8f91f7e-C001-96" ], [ "939274ae40a68acc322b34d8f91f7e-C001-105" ] ], "cite_sentences": [ "939274ae40a68acc322b34d8f91f7e-C001-27", "939274ae40a68acc322b34d8f91f7e-C001-47", "939274ae40a68acc322b34d8f91f7e-C001-76", "939274ae40a68acc322b34d8f91f7e-C001-96", "939274ae40a68acc322b34d8f91f7e-C001-105" ] }, "@DIF@": { "gold_contexts": [ [ "939274ae40a68acc322b34d8f91f7e-C001-111" ], [ "939274ae40a68acc322b34d8f91f7e-C001-138" ] ], "cite_sentences": [ "939274ae40a68acc322b34d8f91f7e-C001-111", "939274ae40a68acc322b34d8f91f7e-C001-138" ] }, "@EXT@": { "gold_contexts": [ [ "939274ae40a68acc322b34d8f91f7e-C001-111" ], [ "939274ae40a68acc322b34d8f91f7e-C001-138" ] ], "cite_sentences": [ "939274ae40a68acc322b34d8f91f7e-C001-111", "939274ae40a68acc322b34d8f91f7e-C001-138" ] } } }, "ABC_dd875dd5c0f2558bb173f31bbdea00_24": { "x": [ { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-72", "text": "The corresponding clues and patterns are given in Table 2 ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-2", "text": "Qualia Structures have many applications within computational linguistics, but currently there are no corresponding lexical resources such as WordNet or FrameNet." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-3", "text": "This paper presents an approach to automatically learn qualia structures for nominals from the World Wide Web and thus opens the possibility to explore the impact of qualia structures for natural language processing at a larger scale." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-4", "text": "Furthermore, our approach can be also used support a lexicographer in the task of manually creating a lexicon of qualia structures." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-5", "text": "The approach is based on the idea of matching certain lexicosyntactic patterns conveying a certain semantic relation on the World Wide Web using standard search engines." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-46", "text": "The downloaded abstracts are then part-of-speech tagged using QTag (Tufis and Mason, 1998" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-70", "text": "**THE CONSTITUTIVE ROLE**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-71", "text": "The procedure for finding elements of the Constitutive role is similar to the one described above for the Formal role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-6", "text": "We evaluate our approach qualitatively by comparing our automatically learned qualia structures with the ones from the literature, but also quantitatively by presenting results of a human evaluation." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-7", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-9", "text": "Qualia Structures have been originally introduced by (Pustejovsky, 1991) and are used for a variety of purposes in Natural Language processing such as the analysis of compounds (Johnston and Busa, 1996) , co-composition and coercion (Pustejovsky, 1991) as well as for bridging reference resolution (Bos et al., 1995) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-10", "text": "Further, it has also been argued that qualia structures and lexical semantic relations in general have applications in information retrieval (Voorhees, 1994; Pustejovsky et al., 1993) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-11", "text": "One major bottleneck however is that currently Qualia Structures need to be created by hand, which is probably also the reason why there are no practical system using qualia structures, but a lot of systems using globally available resources such as WordNet (Fellbaum, 1998) or FrameNet 1 1 http://framenet.icsi.berkeley.edu/ as source of lexical/world knowledge." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-12", "text": "The work described in this paper addresses this issue and presents an approach to automatically learning qualia structures for nominals from the Web." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-13", "text": "The approach is inspired in recent work on using the Web to identify instances of a relation of interest such as in (Markert et al., 2003) and (Cimiano and Staab, 2004) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-14", "text": "These approaches are in essence a combination of the usage of lexico-syntactic pattens conveying a certain relation of interest such as in (Hearst, 1992) , (Charniak and Berland, 1999) , (Iwanska et al., 2000) or (Poesio et al., 2002) with the idea of using the web as a big corpus (Resnik and Smith, 2003) , (Grefenstette, 1999) , (Keller et al., 2002) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-15", "text": "The idea of learning Qualia Structures from the Web is not only a very practical, it is in fact a principled one." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-16", "text": "While single lexicographers creating qualia structuresor lexicon entries in general -might take very subjective decisions, the structures learned from the Web do not mirror the view of a single person, but of the whole world as represented on the World Wide Web." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-17", "text": "Thus, an approach learning qualia structures from the Web is in principle more reliable than letting lexicographers craft lexical entries on their own." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-18", "text": "Obviously, on the other hand, using an automatic web based approach yields also a lot of inappropriate results which are due to 1) errors produced by the linguistic analysis (e.g. part-of-speech tagging), 2) idiosyncrasies of ranking algorithms of search machines, 3) the fact that the Web or in particular search engines are to a great extent commercially biased, 4) the fact that people also publish erroneous information on the Web, and 5) lexical ambiguities." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-19", "text": "Because of these reasons our aim is in fact not to replace lexicographers, but to support them in the task of creating qualia structures on the basis of the automatically learned qualia structures." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-20", "text": "The paper is structured as follows: Section 2 introduces qualia structures and describes the specific qualia structures we aim to acquire." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-21", "text": "Section 3 describes our approach in detail and section 4 presents a quantitative and qualitative evaluation of our approach." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-22", "text": "Before concluding, we discuss some related work in Section 5." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-23", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-24", "text": "**QUALIA STRUCTURES**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-45", "text": "With the use of such clues, we thus download a number of Google-abstracts in which a corresponding pattern will probably be matched thus restricting the linguistic analysis to a few promising pages." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-25", "text": "According to Aristotle, there are four basic factors or causes by which the nature of an object can be described (cf. (Kronlid, 2003) ): the material cause, i.e. the material an object is made of the agentive cause, i.e. the source of movement, creation or change the formal cause, i.e. its form or type the final cause, i.e. its purpose, intention or aim In his Generative Lexicon (GL) framework (Pustejovsky, 1991) reused Aristotle's basic factors for the description of the meaning of lexical elements." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-26", "text": "In fact he introduced so called Qualia Structures by which the meaning of a lexical element is described in terms of four roles:" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-27", "text": "Constitutive: describing physical properties of an object, i.e. its weight, material as well as parts and components" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-28", "text": "Agentive: describing factors involved in the bringing about of an object, i.e. its creator or the causal chain leading to its creation Formal: describing that properties which distinguish an object in a larger domain, i.e. orientation, magnitude, shape and dimensionality Telic: describing the purpose or function of an object Most of the qualia structures used in (Pustejovsky, 1991) however seem to have a more restricted interpretation." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-29", "text": "In fact, in most examples the Constitutive role seems to describe the parts or components of an object, while the Agentive role is typically described by a verb denoting an action which typically brings the object in question into existence." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-30", "text": "The Formal role normally consists in typing information about the object, i.e. its hypernym or superconcept." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-31", "text": "Finally, the Telic role describes the purpose or function of an object either by a verb or nominal phrase." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-32", "text": "The qualia structure for knife for example could look as follows (cf. (Johnston and Busa, 1996) ):" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-33", "text": "Formal: artifact tool Constitutive: blade,handle,... Telic: cut act Agentive: make act Our understanding of Qualia Structure is in line with this restricted interpretation of the qualia roles." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-34", "text": "Our aim is to automatically acquire Qualia Structures from the Web for nominals, looking for (i) nominals describing the type of the object, (ii) verbs defining its agentive role, (iii) nominals describing its parts or components and (iv) nouns or verbs describing its intended purpose." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-35", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-36", "text": "**APPROACH**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-37", "text": "Our approach to learning qualia structures from the Web is on the one hand based on the assumption that instances of a certain semantic relation can be learned by matching certain lexico-syntactic patterns more or less reliably conveying the relation of interest in line with the seminal work of (Hearst, 1992) , who defined the following patterns conveying a hypernym relation:" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-38", "text": "According to Hearst, from such patterns we can derive that for all (broken bone,injury) and hypernym (wound,injury) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-39", "text": "However, it is well known that Hearst-style patterns occur rarely, such that it seems intuitive to match them on the Web." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-40", "text": "So in our case we are looking not only for the hypernym relation (comparable to the Formal-Relation) but for similar patterns conveying a Constitutive, Telic or Agentive relation." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-41", "text": "As currently there is no support for searching using regular expressions in standard search engines such as Google or Altavista 3 , our approach consists of 5 phases (compare Figure 1 (Resnik and Elkiss, 2003) 4 The reason for using only the 10 first hits is to maintain efficiency." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-42", "text": "With the current setting the systems needs between 3 and 10 minutes to generate the qualia structure for a given nominal qualia element in a certain role is weighted according to some measure." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-43", "text": "The patterns in our pattern library are actually tuples" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-44", "text": "where is a regular expression defined over part-of-speech tags and is a function returning the plural form of x. We implemented this function as a lookup in a lexicon in which plural nouns are mapped to their base form." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-47", "text": "The result is then a Weighted Qualia Structure (WQS) in which for each role the qualia elements are weighted according to this Jaccard coefficient." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-48", "text": "In what follows we describe in detail the procedure for acquiring qualia elements for each qualia role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-49", "text": "In particular, we describe in detail the clues and lexico-syntactic patterns used." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-50", "text": "In general, the patterns have been crafted by hand, testing and refining them in an iterative process, paying attention to maximize their coverage but also accuracy." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-51", "text": "In general it is important to mention that by this approach we are not able to detect and separate multiple meanings of words, i.e. to handle polysemy, which is appropriately accounted for in the framework of the Generative Lexicon (Pustejovsky, 1991) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-52", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-53", "text": "**THE FORMAL ROLE**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-54", "text": "To derive qualia elements for the Formal role, we first download for each of the clues in Table 1 the first 10 abstracts matching the clue and then process them offline matching the patterns defined over part-of-speech-tags 5 thus yielding up to 10 different qualia element candidates per clue." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-55", "text": "The patterns are specified in form of regular expressions, whereby the part-of-speech tags are always 5 We use the well-known Penn Treebank tagset described at http://www.computing.dcu.ie/@ acahill/tagset.html." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-56", "text": "denoting any alphabetic character as well as [a-z] denoting the set of all lower case letters." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-57", "text": "As there are 4 different clues for the Formal role, we thus yield up to 40 qualia elements as potential candidates to fill the Formal role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-58", "text": "In general, we paid attention to create clues relying on indefinite articles as we found out that they produce more general and reliable results than when using definite articles." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-59", "text": "In order to choose the correct indefinite article -a or an -or even using no article at all, we implemented some ad-hoc heuristics checking if the first letter of the term in question is a vowel and checking if the term is used more often with an article or without an article on the Web by a set of corresponding Google queries." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-60", "text": "The alternative '(a/an/?)' means that we use either the indefinite article 'a' 'an' or no article depending on the results of the above mentioned Google queries." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-61", "text": "A general question raised also by Hearst (Hearst, 1992) is how to deal with NP modification." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-62", "text": "Hearst's conclusion is that this depends on the application." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-63", "text": "In our case we mainly remove adjective modifiers, keeping only the heads of noun phrases as candidate qualia elements." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-64", "text": "The lemmatized heads of the NPE noun phrase are then regarded as qualia role candidates for the Formal role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-65", "text": "These candidates are then weighted using the above defined Jaccard Coefficient measure." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-66", "text": "Hereby, a noun phrase is an instance matching the following regular expression:" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-67", "text": "where the head is the underlined expression, which is lemmatized and considered as a candidate qualia element." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-68", "text": "After some initial experiments we decided not to use the patterns 'X is Y' and 'X is a kind of Y' such as in a book is an item or a book is a kind of publication as well as the pattern 'Y, including X' (compare (Hearst, 1992) ) as we found that in our settings they delivered quite spurious results." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-69", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-73", "text": "As above, the candidate qualia elements are then the lemmatized heads of the noun phrase NP ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-74", "text": "As an additional heuristic, we test if the lemmatized head of NP is an element of the following list containing nouns denoting an indication of amount: bundle, majority, thousands, million, millions, hundreds, number, numbers, set, sets, series, range\u00a2 and furthermore this NP is followed by the preposition 'of'." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-75", "text": "In that case we would take the head of the noun phrase after the preposition 'of' as potential candidate of the Constitutive role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-76", "text": "For example, when considering a conversation is made up of a series of observable interpersonal exchanges, we would take exchange as a potential qualia element candidate instead of series." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-77", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-78", "text": "**THE TELIC ROLE**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-79", "text": "The Telic Role is in principle acquired in the same way as the Formal and Constitutive roles with the exception that the qualia element is not only the head of a noun phrase, but also a verb or a verb followed by a noun phrase." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-80", "text": "Table 3 gives the corresponding clues and patterns." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-81", "text": "In particular, the returned candidate qualia elements are the lemmatized underlined expressions in PURP:=" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-82", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-83", "text": "**THE AGENTIVE ROLE**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-84", "text": "As mentioned in (Hearst, 1992) , it is not always as straightforward to find lexico-syntactic patterns reliably conveying a certain relation." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-85", "text": "In fact, we did not find any patterns reliably identifying qualia elements for the Agentive role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-86", "text": "Certainly, it would have been possible to find the source of the creation by using patterns such as X is made by Y or X is produced by Y. However, we found that these patterns do not reliably convey a verb describing how an object is brought into existence." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-87", "text": "The fact that it is far from straightforward to find patterns indicating an Agentive role is further corroborated by the research in (Yamada and Baldwin, 2004) , in which only one pattern indicating a qualia relation is used, namely 'NN BE V[+en]' in order to match passive constructions such as the book was written." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-88", "text": "On the other hand it is clear that constructing a reliable clue for this pattern is not straightforward given the current state-of-the-art concerning search engine queries." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-89", "text": "Nevertheless, in order to also get results for the Agentive role, we apply a different method here." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-90", "text": "Instead of issuing a query which is used to search for possible candidates for the role, we take advantage of the fact that the verbs which describe how something comes into being, particularly artificial things, are often quite general phrases like \"make, produce, write, build...\"." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-91", "text": "So instead of generating clues as above, we calculate the value" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-92", "text": "for the nominal we want to acquire a qualia structure for as well as the following verbs: build, produce, make, write, plant, elect, create, cook, construct and design." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-93", "text": "If this value is over a threshold (0.0005 in our case), we assume that it is a valid filler of the Agentive qualia role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-94", "text": "Busa, 1996) or (Pustejovsky, 1991) , as well as computer, an abstract noun, i.e. conversation, as well as two very specific multi-term words, i.e. natural language processing and data mining." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-95", "text": "We give the automatically learned weighted Qualia Structures for these entries in Figures 3, 4 , 5 and 6." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-96", "text": "The evaluation of our approach consists on the one hand of a discussion of the weighted qualia structures, in particular comparing them to the ideal structures form the literature." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-97", "text": "On the other hand, we also asked a student at our institute to assign credits to each of the qualia elements from 0 (incorrect) to 3 (totally correct) whereby 1 credit meaning 'not totally wrong' and 2 meaning 'still acceptable'." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-98", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-99", "text": "**QUANTITATIVE EVALUATION**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-100", "text": "The distribution of credits for each qualia role and term is given in Table 4 ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-101", "text": "It can be seen that with three exceptions: beer formal, book agentive as well as beer constitutive, '3' is the mark assigned in most cases to the automatically learned qualia elements." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-102", "text": "Further, for almost every query term and qualia role, at least 50% of the automatically learned qualia structures have a mark of '2' or '3' -the only exceptions being beer formal with 45.45%, book agentive with 33.33% and beer constitutive with 28.57%." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-103", "text": "In general this shows that the automatically learned qualia roles are indeed reasonable." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-104", "text": "Considering the average over all the terms ('All' in the table), we observe that the qualia role which is recognized most reliably is the Telic one with 73.15% assignments of credit '3' and 75.93% of credits '2' or '3', followed by the Agentive role with 71.43% assignments of credit 3." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-105", "text": "The results for the Formal and Constitutive role are still reasonable with 62.09% assignments of credit '3' and 66.01% assignments of credits '2' or '3' for the Formal role; and respectively 61.61% and 64.61% for the Constitutive role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-106", "text": "The worst results are achieved for the Constitutive role due to the fact that 26.26% of the qualia elements are regarded as totally wrong." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-107", "text": "Table 5 supports the above claims and shows the average credits assigned by the human evaluator per query term and role." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-108", "text": "It shows again that the roles with the best results are the Agentive and Telic roles, while the Formal and Constitutive roles are not identified as accurately." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-109", "text": "This is certainly due to the fact that the patterns for the Telic role are much less ambiguous than the ones for the Formal and Constitutive roles." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-110", "text": "Finally, we also discuss the correlation between the credits assigned and the Jaccard Coefficient." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-111", "text": "Figure 2 shows this correlation." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-112", "text": "While for the Formal role the correlation is as expected, i.e. the higher the credit assigned, the higher also the Jaccard Coefficient, for the Constitutive and Telic roles this correlation is unfortunately less clear, thus making the task of finding a cut-off threshold more difficult." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-113", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-114", "text": "**QUALITATIVE EVALUATION & DISCUSSION**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-115", "text": "In this section we provide a more subjective evaluation of the automatically learned qualia structures by comparing them to ideal qualia structures discussed in the literature wherever possible." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-116", "text": "In particular, we discuss more in detail the qualia structure for book, knife and beer and leave the detailed assessment of the qualia structures for computer, natural language processing, data mining and conversation to the interested reader." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-117", "text": "For book, the first four candidates of the Formal role, i.e. product, item, publication and document are very appropriate, but alluding to the physical object meaning of book as opposed to the meaning in the sense of information container (compare (Pustejovsky, 1991) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-118", "text": "As candidates for the Agentive role we have make, write and create which are appropriate, write being the ideal filler of the Agentive role according to (Pustejovsky, 1991) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-119", "text": "For the Constitutive role of book we get -besides it at the first position which could be easily filtered out -sign (2nd position), letter (3rd position) and page (6th position), which are quite appropriate." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-120", "text": "The top four candidates for the Telic role are give, select, read and purchase." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-121", "text": "It seems that give is emphasizing the role of a book as a gift, read is referring to the most obvious purpose of a book as specified in the ideal qualia structures of (Pustejovsky, 1991) as well as (Johnston and Busa, 1996) and purchase denotes the more general purpose of a book, i.e. to be bought." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-122", "text": "The first element of the Formal role of knife unfortunately denotes the material it is typically made of, i.e. steel, but the next 5 elements are definitely appropriate: weapon, item, kitchenware, object and instrument." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-123", "text": "The ideal element artifact tool (compare (Johnston and Busa, 1996) ) can be found at the 10th position." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-124", "text": "The results are interesting in that on the one hand the most prominent meaning of knife according to the web is the one of a weapon." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-125", "text": "On the other hand our results are more specific, classifying a knife as kitchenware instead of merely as an artifact tool." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-126", "text": "Very interesting are the specific and accurate results at the end of the list." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-127", "text": "The reason why they appear at the end is that the Jaccard Coefficient ranks them lower because they are more specific, thus appearing less frequently." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-128", "text": "This shows that using some other measure less sensitive to frequency could yield more accurate results." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-129", "text": "The fillers of the Agentive role produce, make and create seem all appropriate, whereby make corresponds exactly to the ideal filler for the Agentive role as mentioned in (Johnston and Busa, 1996) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-130", "text": "The results for the Constitutive role contain not only parts but also materials a knife is made of and thus contain more information than the typical qualia structures assumed in the literature." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-131", "text": "The best results are (in this order) blade, metal, steel, wood and handle at the 6th position." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-132", "text": "In fact, in the ideal qualia structure in (Johnston and Busa, 1996) dle are mentioned as fillers of the Constitutive role, while there are no elements describing the materials of which a knife is made of." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-133", "text": "Finally, the top four candidates for the Telic role are kill, slit, cut and slice, whereby cut corresponds to the ideal filler of the qualia structure for knife as mentioned in (Johnston and Busa, 1996) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-134", "text": "Considering the qualia structure for beer, it is surprising that no purpose has been found." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-135", "text": "The reason is that currently no results are returned by Google for the clue a beer is used to and the four snippets returned for the purpose of a beer contain expressions of the form the purpose of a beer is to drink it which is not matched by our patterns as it is a pronoun and not matched by our NP pattern (unless it is matched by an error as in the Qualia Structure for book in Figure 4 )." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-136", "text": "Considering the results for the Formal role, the elements drink (1st), alcohol (2nd) and beverage (4th) are much more specific than liquid as given in (Pustejovsky, 1991) , while thing at the 3rd position is certainly too general." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-137", "text": "Furthermore, according to the automatically learned qualia structure, beer is made of rice, malt and hop, which are perfectly reasonable results." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-138", "text": "Very interesting are the results concoction and libation for the Formal role of beer, which unfortunately were rated low by our evaluator (compare Figure 3) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-139", "text": "Overall, the discussion has shown that the results produced by our method are reasonable when compared to the qualia structures from the literature." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-140", "text": "In general, our method produces in some cases additional qualia candidates, such as the ones describing the material a knife is typically made of." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-141", "text": "In other cases it discovers more specific candidates, such as for example weapon or kitchenware as elements of the Formal role for knife instead of the general term artifact tool." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-142", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-143", "text": "**RELATED WORK**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-144", "text": "There is quite a lot of work related to the use of linguistic patterns to discover certain ontological relations from text." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-145", "text": "Hearst's (Hearst, 1992) seminal work had the aim of discovering taxonomic relations from electronic dictionaries." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-146", "text": "The precision of the is-a-relations learned is 61/106 (57.55%) when measured against WordNet as gold standard, which is comparable to our results." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-147", "text": "Hearst's idea has been reapplied by different researchers with either slight variations in the patterns used (Iwanska et al., 2000) , to acquire knowledge for anaphora resolution (Poesio et al., 2002) , or to discover other kinds of semantic relations such as part-of relations (Charniak and Berland, 1999) or causation relations (Girju and Moldovan, 2002) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-148", "text": "Instead of matching these patterns in a large text collection, some researchers have recently turned to the Web to match these patterns such as in (Cimiano and Staab, 2004) or (Markert et al., 2003) ." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-149", "text": "(Cimiano and Staab, 2004) for example aim at learning instance-of as well as taxonomic (is-a) relations." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-150", "text": "This is very related to the acquisition of the Formal role proposed here." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-151", "text": "(Markert et al., 2003) aim at acquiring knowledge for anaphora resolution, while (Etzioni et al., 2004) aim at learning the complete extension of a certain concept." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-152", "text": "For example, they aim at finding all the actors in the world." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-153", "text": "Our approach goes further in that it not only learns typing, superconcept or instance-of relations, but also Constitutive and Telic relations." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-154", "text": "There also exist approaches specifically aiming at learning qualia elements from corpora based on machine learning techniques." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-155", "text": "(Claveau et al., 2003) for example use Inductive Logic Programming to learn if a given verb is a qualia element or not." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-156", "text": "However, their approach goes not as far as learning the complete qualia structure for a lexical element in an unsupervised way as presented in our approach." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-157", "text": "In fact, in their approach they do not distinguish between different qualia roles and restrict themselves to verbs as potential fillers of qualia roles." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-158", "text": "(Yamada and Baldwin, 2004) present an approach to learning Telic and Agentive relations from corpora analyzing two different approaches: one relying on matching certain lexico-syntactic patterns as in the work presented here, but also a second approach consisting in training a maximum entropy model classifier." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-159", "text": "Their conclusion is that the results produced by the classification approach correlate better with two hand-crafted gold standards." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-160", "text": "The patterns used by (Yamada and Baldwin, 2004) differ substantially from the ones used in this paper, which is mainly due to the fact that search engines do not provide support for regular expressions and thus instantiating a pattern as 'V[+ing] Noun' is impossible in our approach as the verbs are unknown a priori." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-161", "text": "Finally, (Pustejovsky et al., 1993) present an interesting framework for the acquisition of semantic relations from corpora not only relying on statistics, but guided by theoretical lexicon principles." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-162", "text": "----------------------------------" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-163", "text": "**CONCLUSION**" }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-164", "text": "We have presented an approach to automatically learning Qualia Structures from the Web." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-165", "text": "Such an approach is especially interesting either for lexicographers aiming at constructing lexicons, but even more for natural language processing systems relying on deep lexical knowledge as represented by qualia structures." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-166", "text": "We have in particular shown that the qualia structures learned by our system are reasonable." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-167", "text": "In general, it is valid to claim that our system is the first one automatically producing complete qualia structures for a given nominal." }, { "sent_id": "dd875dd5c0f2558bb173f31bbdea00-C001-168", "text": "Our system can be tested online at http://km.aifb.unikarlsruhe.de/pankow/qualia/. Further work will aim at improving the system but also at using the automatically learned structures within NLP applications." } ], "y": { "@BACK@": { "gold_contexts": [ [ "dd875dd5c0f2558bb173f31bbdea00-C001-9" ], [ "dd875dd5c0f2558bb173f31bbdea00-C001-25" ], [ "dd875dd5c0f2558bb173f31bbdea00-C001-28" ], [ "dd875dd5c0f2558bb173f31bbdea00-C001-94" ] ], "cite_sentences": [ "dd875dd5c0f2558bb173f31bbdea00-C001-9", "dd875dd5c0f2558bb173f31bbdea00-C001-25", "dd875dd5c0f2558bb173f31bbdea00-C001-28", "dd875dd5c0f2558bb173f31bbdea00-C001-94" ] }, "@DIF@": { "gold_contexts": [ [ "dd875dd5c0f2558bb173f31bbdea00-C001-51" ], [ "dd875dd5c0f2558bb173f31bbdea00-C001-117" ] ], "cite_sentences": [ "dd875dd5c0f2558bb173f31bbdea00-C001-51", "dd875dd5c0f2558bb173f31bbdea00-C001-117" ] }, "@EXT@": { "gold_contexts": [ [ "dd875dd5c0f2558bb173f31bbdea00-C001-51" ], [ "dd875dd5c0f2558bb173f31bbdea00-C001-117" ] ], "cite_sentences": [ "dd875dd5c0f2558bb173f31bbdea00-C001-51", "dd875dd5c0f2558bb173f31bbdea00-C001-117" ] }, "@SIM@": { "gold_contexts": [ [ "dd875dd5c0f2558bb173f31bbdea00-C001-118" ] ], "cite_sentences": [ "dd875dd5c0f2558bb173f31bbdea00-C001-118" ] }, "@USE@": { "gold_contexts": [ [ "dd875dd5c0f2558bb173f31bbdea00-C001-118" ] ], "cite_sentences": [ "dd875dd5c0f2558bb173f31bbdea00-C001-118" ] }, "@MOT@": { "gold_contexts": [ [ "dd875dd5c0f2558bb173f31bbdea00-C001-121" ], [ "dd875dd5c0f2558bb173f31bbdea00-C001-136", "dd875dd5c0f2558bb173f31bbdea00-C001-137" ] ], "cite_sentences": [ "dd875dd5c0f2558bb173f31bbdea00-C001-121", "dd875dd5c0f2558bb173f31bbdea00-C001-136" ] } } }, "ABC_aa496bd71f380e02dd392cda969999_24": { "x": [ { "sent_id": "aa496bd71f380e02dd392cda969999-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-2", "text": "We introduce a new measure for unsupervised hypernym detection and directionality." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-124", "text": "One reason for this is that among the disambiguation pages, a word is often generalized to related terms." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-3", "text": "The motivation is to keep the measure computationally light and portatable across languages." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-4", "text": "We show that the relative physical location of words in explanatory articles captures the directionality property." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-5", "text": "Further, the phrases in section titles of articles about the word, capture the semantic similarity needed for hypernym detection task." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-6", "text": "We experimentally show that the combination of features coming from these two simple measures suffices to produce results comparable with the best unsupervised measures in terms of the average precision." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-7", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-9", "text": "Given two words w 1 and w 2 , the hypernym detection task is to determine if there is a hypernym relation between the two words." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-10", "text": "If a hypernym is known to exist, the directionality task is to determine if w 1 is a hypernym or hyponym of w 2 ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-11", "text": "More precisely, due to polysemy, the detection task asks if, there is some meaning of w 1 , in which it is a hypernym or hyponym of some meaning of w 2 ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-12", "text": "The first approaches were pattern based (Hearst, 1992; Snow et al., 2004) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-13", "text": "However, these suffered from poor recall." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-14", "text": "This led to the development of methods based on the distributional hypothesis (Harris, 1954) or the Distributional Inclusion Hypotheses (Geffet and Dagan, 2005) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-15", "text": "The method used in these techniques was to take a very large corpus, and using either window based, or dependency path based approaches, along with measures like frequency, PPMI (Church and Hanks, 1990) , LPMI (Evert, 2005) , to find vectors to represent the words." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-16", "text": "In supervised settings, the vectors for two words are combined suitably and a classifier is trained (Baroni et al., 2012; Roller et al., * * Work done as part of IBM Research, Bangalore 2014; Weeds et al., 2014; Shwartz et al., 2016) to predict the existence of a hypernym relation and later directionality." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-17", "text": "However, recently there has been deeper research on what exactly is learned by these techniques (Levy et al., 2015) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-18", "text": "In the unsupervised setting a suitable measure, motivated by either the distributional inclusion hypothesis or the distributional informativeness hypothesis is used for hypernyms (Santus et al., 2016; Weeds et al., 2004; Santus et al., 2014; Geffet and Dagan, 2005) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-19", "text": "In this paper we present a simple and computationally light unsupervised technique for hypernym detection and directionality which is a combination of two measures, called as depth measure and heading measure." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-20", "text": "We start off with a large corpus, but instead of finding window or dependency path based contexts, which are very expensive to compute; we argue that the internal organization of descriptive and explanatory documents naturally leads to strong signals that are indicative of hypernyms." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-21", "text": "By exploratory documents, we mean documents that have been produced with the express purpose of making the reader understand the concepts that the document is describing; text books, research papers and Wikipedia articles are prime examples of this." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-22", "text": "We verify this intuition empirically, wherein we achieve results comparable and in some cases better than prior techniques in both the tasks of hypernym detection and directionality." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-23", "text": "One salient feature of our measures is that they exploit how humans organize information in explanatory documents making them portable across all languages." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-24", "text": "This offers us an advantage over prior techniques which depend on the intricacies of syntax and semantics of the language of the documents." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-25", "text": "We will be using Wikipedia as the source of descriptive and explanatory documents in this paper." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-26", "text": "For ease of exposition, we define the concept of units." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-27", "text": "Given an article from Wikipedia, each page title, section title, sub-section title etc., irrespective of the depth of section, will be referred to as a heading." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-28", "text": "Each heading on the article usually consists of a title that describes what the text following the heading is about." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-29", "text": "Following the heading, are usually a few paragraphs that describe in more detail the heading." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-30", "text": "This may be followed by another heading, and this pattern repeats." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-31", "text": "We refer to each heading and the text following it, till (but not including) the next heading, as a unit." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-32", "text": "Thus the article is physically organized as a sequence of disjoint units." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-33", "text": "We represent a unit u as a pair (h, S), where h is the heading, and S is the sequence of sentences in the unit." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-34", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-35", "text": "**DEPTH MEASURE**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-36", "text": "Given a hypernym-hyponym pair, (w 1 , w 2 ), consider the organization of a Wikipedia article containing both of them." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-37", "text": "The very first unit at the top of the page is usually a broader introduction of the main topic of the article." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-38", "text": "It will use words that are more popular." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-39", "text": "However, from the next unit onward, the articles tend to be more specialized with higher detail in content." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-40", "text": "Thus the linguistic contexts used in the units occurring lower in the article tend to be more detail oriented than those occurring earlier (except for the first introductory unit)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-41", "text": "Since more detailed context are indicative of a hyponym, they tend to occur later in the article than the hypernyms." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-42", "text": "This same reasoning applies within a unit." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-43", "text": "In this case the hypernym will tend to occur in earlier sentences in the unit than the hyponym." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-44", "text": "We generalize this to the case in which w 1 and w 2 do not co-occur in the same article, as follows." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-45", "text": "We take a large corpus of articles (e.g. all articles in Wikipedia), and check the depth at which w 1 and w 2 tend to occur (individually)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-46", "text": "If w 2 tends to occur at larger depth than w 1 , we conclude that w 2 is a hyponym of w 1 ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-47", "text": "Let P be the set of articles." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-48", "text": "Let a \u2208 P be an article, and let w be a given word or phrase." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-49", "text": "To formally define depth, we will assume that the article has a fixed rooted tree like topology with the units of a as its vertices, denoted by G(a)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-50", "text": "The root will be the first unit of the article, and the depth will be the distance from the root." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-51", "text": "We experiment when G(a) is a Star-like tree topology, as indicated by the depths of its sections and sub-sections, or a Linear-like topology with a unit being a parent of the immediate next unit in the physical layout of the article." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-52", "text": "We define a function \u03bb(a, w) that captures the depth of each occurrence of w in a. Let I(a, w) denote the set of occurrences of w in a. Each occurrence consists of a pair (u i , s j ), where u i denotes a unit, and s j is the sentence in which it occurs." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-53", "text": "Multiple instances of w in the same sentence is treated as one instance." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-54", "text": "Let d(G(a)) denote the total depth of G(a)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-55", "text": "If d(u i ) is the depth of unit u i in G(a), and |u i | is the number of sentences in it, then we define the" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-56", "text": "The first factor gives a normalized measure (to ensure same scale across all articles, of different sizes) of the depth of each occurrence of w in a. Similarly, the second factor gives a normalized depth of the instance within a unit." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-57", "text": "Larger the \u03bb(a, w), more likely is it to be a hypernym." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-58", "text": "To aggregate this measure across all articles:" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-59", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-60", "text": "**HEADING MEASURE**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-61", "text": "For testing relatedness between words, we define the heading measure, inspired by (Do and Roth, 2012)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-62", "text": "We search in Wikipedia for the article on the given phrase w (e.g., if w is the word jumping, then we get the article https://en.wikipedia.org/wiki/Jumping)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-63", "text": "Since the page is about w, it is organized into sections that explain every property of w. We can thus represent w simply by the collection of headings (titles, sub-titles at every possible level)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-64", "text": "If the page on w turns out to be a disambiguation page, then the page lists different possible meanings of w, along with the corresponding links." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-65", "text": "We follow each of the links to get possible articles on different meanings of w. In case any of the pages is again a disambiguation page, we iterate further." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-66", "text": "For each of the pages that are articles, we form a set of headings." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-67", "text": "Each set corresponds to a different meaning of w. We let S w denote the collection of the headings for different meanings of w (See Algorithm 1)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-68", "text": "Note here, that S w is a set of sets." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-69", "text": "One advantage of this method is that we get the different meanings of the words up front, whereas, in context feature based approaches, there can be a mixing of the different contexts for polysemous words." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-70", "text": "Algorithm 1: Extract Heading Sets input : Word or phrase w output : S w , a collection of headings of pages on w Function ExtractHeadings(w) P = {P 1 , . . . , P k } be the set of articles on w, S w \u2190 \u03c6 while P = \u03c6 do Select any P \u2208 P P \u2190 P \\ P if P is not a disambiguation page then Let C be the collection of headings on page P S w \u2190 S w \u222a {C} else Let D be the collection of articles that P points to as possible meanings of w P \u2190 P \u222a D end end return S w After computing S w 1 and S w 2 as shown in Algorithm (1), we compute the SimScore(w 1 , w 2 ) as the maximum similarity between an element of S w 1 and S w 2 ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-71", "text": "For the similarity, we experimented with two measures, the Jaccard Similarity, and the cosine of the corresponding word2vec (Mikolov et al., 2013) vectors." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-72", "text": "For using word2vec, for each heading set C, we take the mean of the vectors for each heading." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-73", "text": "Since word2vec uses the context of words, this combines our features with the contextual features." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-74", "text": "The final measure we use for the pair of words is:" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-75", "text": "(SimScore(w 1 , w 2 )) (2)" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-76", "text": "3 Experiments and Results" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-77", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-78", "text": "**DATASETS AND CORPUS**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-79", "text": "We experimented with four datasets widely used in literature: BLESS (Baroni and Lenci, 2011) , EVALution (Santus et al., 2015) , Lenci/Benotto (Benotto, 2015) , and Weeds (Weeds et al., 2014) taken from the repository provided by (Shwartz et al., 2017) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-80", "text": "The corpus of articles we use is a complete xml dump of the English Wikipedia dated 3 Nov 2017." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-81", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-82", "text": "**TESTING DIRECTIONALITY**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-83", "text": "We extracted out the pairs marked hypernyms from each of the four data sets and computed the depth measure for each word in the pair." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-84", "text": "If the difference \u03bb(w 1 ) -\u03bb(w 2 ) is less than zero, we mark these pairs as False and compute precision." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-85", "text": "To identify the articles containing w, we indexed the corpus of Wikipedia using Elasticsearch (Gormley and Tong, 2015) and used the top thousand articles returned as the set for computing \u03bb(w)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-86", "text": "We experimented with the Star and Linear topologies." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-87", "text": "It is seen that with Star topology our precision in very high on each of the datasets." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-88", "text": "The worst performance is on BLESS." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-89", "text": "However, here too, it is 91.8%." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-90", "text": "For BLESS, as seen in (Santus et al., 2014) , the performance of SLQS is 87%." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-91", "text": "Similar to SLQS, our depth measure is motivated by distributional informativeness hypothesis (Shwartz et al., 2017) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-92", "text": "However, without using the extensive computation of context vectors and entropy, we are able to demonstrate good performance." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-93", "text": "As can be seen by physically examining Wikipedia articles, many of them tend to have a Star topology." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-94", "text": "This is also indicative that the topology used plays a major role in this feature." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-95", "text": "More sophisticated techniques will be needed to identify the topology of individual articles." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-96", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-97", "text": "**TESTING DETECTION**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-98", "text": "For this experiment, we aim to discriminate pairs of words connected by the hypernym relation, from words connected by other relations (meronym, coord, attribute, event, antonym, synonym)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-99", "text": "For each pair, we evaluate our scoring function given in expression (2)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-100", "text": "We compared our numbers with those given in (Shwartz et al., 2017) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-101", "text": "In that paper, multiple measures are used, and the best performing measure for every row of the table is presented." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-102", "text": "We conducted the experiments for both, Star as well as the Linear topology." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-103", "text": "However, the results for Star topology were slightly better, hence we present these in Table ( 2)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-104", "text": "For the case of hypernym vs all other relations, except for EVALution, in all other data sets, our average precision (AP ) using both Jaccard and word2vec ( (Shwartz et al., 2017) call this as AP @all) is better than the best unsu- Table 2 : AP = average precision." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-105", "text": "The Best AP and Best Measure is taken from (Shwartz et al., 2017) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-106", "text": "pervised measure as reported in (Shwartz et al., 2017) ." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-107", "text": "For comparing hypernyms against individual relations, we find that with Jaccard similarity, it performs better than the best measures on meronyms in EVALution, and coordinates in Weeds." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-108", "text": "However, it performs worse for both the relations in BLESS." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-109", "text": "Our systems performs worse than the best measure whenever an Informativeness Measure (Shwartz et al., 2017) , like SLQS and its variants perform well." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-110", "text": "It performs better, or at least competitive, when the best performing measure is an Inclusion Measure or Similarity Measure (except for hypernym-vs-event in BLESS)." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-111", "text": "A possible explanation of this is that the heading features that we use do not capture how informative a phrase is." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-112", "text": "However, having common headings is an indication of shared features, implying similarity, which is also indicated by inclusion measures." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-113", "text": "However, it should be noted, that we are comparing our single system against the best performing one in each case." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-114", "text": "For finding the best measure, (Shwartz et al., 2017) finds the best by varying the measures as well as the features, whereas we have a fixed system." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-115", "text": "Our system took a day to set up (including coding effort), and a few mins to run." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-116", "text": "This is in contrast to methods mentioned above that rely on the computation of context vectors; calculation of dependency parse tree based features alone, from ukWack and Wackypedia corpus, took several days on the same machine." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-117", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-118", "text": "**ERROR ANALYSIS**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-119", "text": "One of the sources of error in our technique is a semantic drift due to disambiguation pages." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-120", "text": "For example, for the pair (alligator, wild), which is marked as attribute in BLESS, our system follows disambiguation links to wildlife, and then marks it as a hypernym." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-121", "text": "We find this pattern repeatedly." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-122", "text": "e.g. (scale, lizard) is a meronym in BLESS, but is classified as hypernym in the hypernym vs. meronym experiments." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-123", "text": "While (scale, snake) is a meronym in BLESS, and is marked correctly as not a hypernym." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-125", "text": "For the hypernym v/s all experiment, the proportion of false positives when at least one word needed a disambiguation page was 37% for BLESS, 29% for Weeds, and about 31% for EVALution and Lenci/Benotto." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-126", "text": "Selective link following during the disambiguation step can potentially solve this problem." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-127", "text": "----------------------------------" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-128", "text": "**CONCLUSION**" }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-129", "text": "We showed that the organization of articles is an important feature for the task of both hypernym detection and directionality." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-130", "text": "Using just this simple and computationally cheap measure suffices to give performance that is comparable to the state of art unsupervised measures in these tasks." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-131", "text": "The proposed measure can also be trivially extended to any languages with a Wikipedia." }, { "sent_id": "aa496bd71f380e02dd392cda969999-C001-132", "text": "We believe future work in this area will benefit by using this feature in complex systems that can improve performance." } ], "y": { "@SIM@": { "gold_contexts": [ [ "aa496bd71f380e02dd392cda969999-C001-79" ], [ "aa496bd71f380e02dd392cda969999-C001-91" ], [ "aa496bd71f380e02dd392cda969999-C001-100" ], [ "aa496bd71f380e02dd392cda969999-C001-104" ], [ "aa496bd71f380e02dd392cda969999-C001-105" ], [ "aa496bd71f380e02dd392cda969999-C001-106" ], [ "aa496bd71f380e02dd392cda969999-C001-109" ] ], "cite_sentences": [ "aa496bd71f380e02dd392cda969999-C001-79", "aa496bd71f380e02dd392cda969999-C001-91", "aa496bd71f380e02dd392cda969999-C001-100", "aa496bd71f380e02dd392cda969999-C001-104", "aa496bd71f380e02dd392cda969999-C001-105", "aa496bd71f380e02dd392cda969999-C001-106", "aa496bd71f380e02dd392cda969999-C001-109" ] }, "@USE@": { "gold_contexts": [ [ "aa496bd71f380e02dd392cda969999-C001-79" ], [ "aa496bd71f380e02dd392cda969999-C001-91" ], [ "aa496bd71f380e02dd392cda969999-C001-100" ], [ "aa496bd71f380e02dd392cda969999-C001-104" ], [ "aa496bd71f380e02dd392cda969999-C001-105" ], [ "aa496bd71f380e02dd392cda969999-C001-106" ], [ "aa496bd71f380e02dd392cda969999-C001-109" ] ], "cite_sentences": [ "aa496bd71f380e02dd392cda969999-C001-79", "aa496bd71f380e02dd392cda969999-C001-91", "aa496bd71f380e02dd392cda969999-C001-100", "aa496bd71f380e02dd392cda969999-C001-104", "aa496bd71f380e02dd392cda969999-C001-105", "aa496bd71f380e02dd392cda969999-C001-106", "aa496bd71f380e02dd392cda969999-C001-109" ] }, "@DIF@": { "gold_contexts": [ [ "aa496bd71f380e02dd392cda969999-C001-114" ] ], "cite_sentences": [ "aa496bd71f380e02dd392cda969999-C001-114" ] }, "@EXT@": { "gold_contexts": [ [ "aa496bd71f380e02dd392cda969999-C001-114" ] ], "cite_sentences": [ "aa496bd71f380e02dd392cda969999-C001-114" ] } } }, "ABC_642aa9fe999d0b2b3793cb1603c04c_24": { "x": [ { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-30", "text": "Finally, CTC models each P (\u03c0 | X) non-autoregressively, as a sequence of conditionally-independent outputs:" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-52", "text": "Operations per layer" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-4", "text": "We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-toend speech recognition." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-26", "text": "In ASR, X is the sequence of acoustic frames, L is the set of graphemes/phonemes/wordpieces, and y is the corresponding ground-truth transcription over L." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-27", "text": "For CTC, one assumes U \u2264 T and defines an intermediate al-" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-53", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-54", "text": "**SEQUENTIAL OPERATIONS**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-2", "text": "The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-3", "text": "Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-28", "text": "In this way, paths are analogous to framewise alignments in the HMM-NN framework." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-5", "text": "SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-6", "text": "Similar improvements hold for WERs after LM decoding." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-7", "text": "We motivate the architecture for speech, evaluate position and downsampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-8", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-9", "text": "****" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-10", "text": "later works proposed partially-or purely-convolutional CTC models [8] [9] [10] [11] and convolution-heavy encoder-decoder models [16] for ASR." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-11", "text": "However, convolutional models must be significantly deeper to retrieve the same temporal receptive field [23] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-12", "text": "Recently, the mechanism of self-attention [22, 24] was proposed, which uses the whole sequence at once to model feature interactions that are arbitrarily distant in time." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-13", "text": "Its use in both encoder-decoder and feedforward contexts has led to faster training and state-of-the-art results in translation (via the Transformer [22] ), sentiment analysis [25] , and other tasks." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-14", "text": "These successes have motivated preliminary work in self-attention for ASR." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-15", "text": "Time-restricted self-attention was used as a drop-in replacement for individual layers in the state-of-theart lattice-free MMI model [26] , an HMM-NN system." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-16", "text": "Hybrid self-attention/LSTM encoders were studied in the context of listenattend-spell (LAS) [27] , and the Transformer was directly adapted to speech in [19, 28, 29] ; both are encoder-decoder systems." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-17", "text": "In this work, we propose and evaluate fully self-attentional networks for CTC (SAN-CTC)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-18", "text": "We are motivated by practicality: selfattention could be used as a drop-in replacement in existing CTClike systems, where only attention has been evaluated in the past [30, 31] ; unlike encoder-decoder systems, SAN-CTC is able to predict tokens in parallel at inference time; an analysis of SAN-CTC is useful for future state-of-the-art ASR systems, which may equip self-attentive encoders with auxiliary CTC losses [17, 20] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-19", "text": "Unlike past works, we do not require convolutional frontends [19] or interleaved recurrences [27] to train self-attention for ASR." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-20", "text": "In Section 2, we motivate the model and relevant design choices (position, downsampling) for ASR." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-21", "text": "In Section 3, we validate SAN-CTC on the Wall Street Journal and LibriSpeech datasets by outperforming existing CTC models and most encoder-decoder models in character error rates (CERs), with fewer parameters or less training time." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-22", "text": "Finally, we train our models with different label alphabets (character, phoneme, subword), use WFST decoding to give word error rates (WERs), and examine the learned attention heads for insights." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-23", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-24", "text": "**MODEL ARCHITECTURES FOR CTC AND ASR**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-25", "text": "Consider an input sequence of T feature vectors, viewed as a matrix X \u2208 R T \u00d7d fr . Let L denote the (finite) label alphabet, and denote the output sequence as y = (y1, . . . , yU ) \u2208 L U ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-29", "text": "CTC models the distribution of sequences by marginalizing over all paths corresponding to an output:" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-31", "text": "This model assumption means each P (\u03c0t, t | X) could be computed in parallel, after which one can do prediction via beam search, or training with gradient descent using the objective LCTC(X, y) = \u2212 log P (y | X); the order-monotonicity of B ensures LCTC can be efficiently evaluated with dynamic programming [1, 4] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-32", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-33", "text": "**RECURRENT AND CONVOLUTIONAL MODELS**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-34", "text": "In practice, one models P (\u03c0, t | X) with a neural network." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-35", "text": "As inspired by HMMs, the model simplification of conditional independence can be tempered by multiple layers of (recurrent) bidirectional long short-term memory units (BLSTMs) [1] [2] [3] [4] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-36", "text": "However, these are computationally expensive (Table 1) , leading to simplifications like gated recurrent units (GRUs) [8, 32] ; furthermore, the success of the ReLU(x) = max(0, x) nonlinearity in preventing vanishing gradients enabled the use of vanilla bidirectional recurrent deep neural networks (BRDNNs) [5, 6, 33] to further reduce operations per layer." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-37", "text": "Convolutions over time and/or frequency were first used as initial layers to recurrent neural models, beginning with HMM-NNs [34] and later with CTC, where they are viewed as promoting invariance to temporal and spectral translation in ASR [8] , or image translation in handwriting recognition [35] ; they also serve as a form of dimensionality reduction (Section 2.4)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-38", "text": "However, these networks were still bottlenecked by the sequentiality of operations at the recurrent layers, leading [8] to propose row convolutions for unidirectional RNNs, which had finite lookaheads to enable online processing while having some future context." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-39", "text": "This led to convolution-only CTC models for long-range temporal dependencies [9] [10] [11] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-40", "text": "However, these models have to be very deep (e.g., 17-19 convolutional layers on LibriSpeech [23] ) to cover the same context (Table 1) ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-41", "text": "While in theory, a relatively local context could suffices for ASR, this is complicated by alphabets L which violate the conditional independence assumption of CTC (e.g., English characters [36] )." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-42", "text": "Wide contexts also enable incorporation of noise/speaker contexts, as [27] suggest regarding the broad-context attention heads in the first layer of their self-attentional LAS model." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-43", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-44", "text": "**MOTIVATING THE SELF-ATTENTION LAYER**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-45", "text": "We now replace recurrent and convolutional layers for CTC with self-attention [24] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-46", "text": "Our proposed framework ( Figure 1a ) is built around self-attention layers, as used in the Transformer encoder [22] , previous explorations of self-attention in ASR [19, 27] , and defined in Section 2.3." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-47", "text": "The other stages are downsampling, which reduces input length T via methods like those in Section 2.4; embedding, which learns a dh-dim. embedding that also describes token position (Section 2.5); and projection, where each final representation is mapped framewise to logits over the intermediate alphabet L ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-48", "text": "The first implements self-attention, where the success of attention in CTC and encoder-decoder models [14, 31] is parallelized by using each position's representation to attend to all others, giving a contextualized representation for that position." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-49", "text": "Hence, the full receptive field is immediately available at the cost of O(T 2 ) inner products (Table 1) , enabling richer representations in fewer layers." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-50", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-51", "text": "**MODEL**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-55", "text": "Maximum path length Table 1 : Operation complexity of each layer type, based on [22] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-56", "text": "T is input length, d is no. of hidden units, and k is filter/context width." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-57", "text": "We also see inspiration from convolutional blocks: residual connections, layer normalization, and tied dense layers with ReLU for representation learning." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-58", "text": "In particular, multi-head attention is akin to having a number of infinitely-wide filters whose weights adapt to the content (allowing fewer \"filters\" to suffice)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-59", "text": "One can also assign interpretations; for example, [27] argue their LAS self-attention heads are differentiated phoneme detectors." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-60", "text": "Further inductive biases like filter widths and causality could be expressed through time-restricted self-attention [26] and directed self-attention [25] , respectively." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-61", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-62", "text": "**FORMULATION**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-63", "text": "Let H \u2208 R T \u00d7d h denote a sublayer's input." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-64", "text": "The first sublayer performs multi-head, scaled dot-product, self-attention [22] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-65", "text": "For each head i of nhds, we learn linear maps W" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-66", "text": ", and values V (i) of the i-th head, which combine to give" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-67", "text": "where \u03c3 is row-wise softmax." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-68", "text": "Heads are concatenated along the dh/nhds axis to give MltHdAtt = [HdAtt (1) , . . . , HdAtt (n hds ) ]." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-69", "text": "The second sublayer is a position-wise feed-forward network [22] FFN(H) = ReLU(HW1 + b1)W2 + b2 where parameters" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-70", "text": "with the biases b1, b2 broadcasted over all T positions." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-71", "text": "This sublayer aggregates the multiple heads at time t into the attention layer's final output at t. All together, the layer is given by:" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-72", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-73", "text": "**DOWNSAMPLING**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-74", "text": "In speech, the input length T of frames can be many times larger than the output length U , in contrast to the roughly word-to-word setting of machine translation." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-75", "text": "This is especially prohibitive for self-attention in terms of memory: recall that an attention matrix of dimension" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-76", "text": "\u2208 R T \u00d7T is created, giving the T 2 factor in Table 1 ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-77", "text": "A convolutional frontend is a typical downsampling strategy [8, 19] ; however, we leave integrating other layer types into SAN-CTC as future work." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-78", "text": "Instead, we consider three fixed approaches, from least-to most-preserving of the input data: subsampling, which only takes every k-th frame; pooling, which aggregates every k consecutive frames via a statistic (average, maximum); reshaping, where one concatenates k consecutive frames into one [27] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-79", "text": "Note that CTC will still require U \u2264 T /k, however." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-80", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-81", "text": "**POSITION**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-82", "text": "Self-attention is inherently content-based [22] , and so one often encodes position into the post-embedding vectors." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-83", "text": "We use standard trigonometric embeddings, where for 0 \u2264 i \u2264 demb/2, we define" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-84", "text": "for position t. We consider three approaches: content-only [21] , which forgoes position encodings; additive [19] , which takes demb = dh and adds the encoding to the embedding; and concatenative, where one takes demb = 40 and concatenates it to the embedding." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-85", "text": "The latter was found necessary for self-attentional LAS [27] , as additive encodings did not give convergence." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-86", "text": "However, the monotonicity of CTC is a further positional inductive bias, which may enable the success of content-only and additive encodings." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-87", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-88", "text": "**EXPERIMENTS**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-89", "text": "We take (nlayers, dh, nheads, dff) = (10, 512, 8, 2048), giving \u223c30M parameters." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-90", "text": "This is on par with models on WSJ (10-30M) [4, 5, 9] and an order of magnitude below models on LibriSpeech (100-250M) [8, 23] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-91", "text": "We use MXNet [37] for modeling and Kaldi/EESEN [7, 38] for data preparation and decoding." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-92", "text": "Our self-attention code is based on GluonNLP's implementation." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-93", "text": "At train time, utterances are sorted by length: we exclude those longer than 1800 frames ( 1% of each training set)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-94", "text": "We take a window of 25ms, a hop of 10ms, and concatenate cepstral mean-variance normalized features with temporal first-and second-order differences." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-95", "text": "1 We downsample by a factor of k = 3 (this also gave an ideal T /k \u2248 dh for our data; see Table 1 )." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-96", "text": "We perform Nesterov-accelerated gradient descent on batches of 20 utterances." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-97", "text": "As self-attention architectures can be unstable in early training, we clip gradients to a global norm of 1 and use the standard linear warmup period before inverse square decay associated with these architectures [19, 22] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-98", "text": "Let n denote the global step number of the batch (across epochs); the learning rate is given by" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-99", "text": "1 Rescaling so that these differences also have var." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-100", "text": "\u2248 1 helped WSJ training." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-101", "text": "[17] 11.5 -9.0 -Enc-Dec (4-1) [17] 12.0 -8.2 -Enc-Dec+CTC (4-1) [17] 11.3 -7.4 -Enc-Dec (4-1) [39] --6.4 9.3 CTC/ASG (Gated CNN) [40] 6.9 9.5 4.9 6.6 Enc-Dec (2,1,3-1 Table 3 : CTC phoneme models with WFST decoding on WSJ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-102", "text": "where we take \u03bb = 400 and nwarmup as a hyperparameter." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-103", "text": "However, such a decay led to early stagnation in validation accuracy, so we later divide the learning rate by 10 and run at the decayed rate for 20 epochs." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-104", "text": "We do this twice, then take the epoch with the best validation score." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-105", "text": "Xavier initialization gave validation accuracies of zero for the first few epochs, suggesting room for improvement." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-106", "text": "Like previous works on self-attention, we apply label smoothing (see Tables 2, 3, 5; we also tried model averaging to no gain)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-107", "text": "To compute word error rates (WERs), we use the dataset's provided language model (LM) as incorporated by WFST decoding [7] to bridge the gap between CTC and encoder-decoder frameworks, allowing comparison with known benchmarks and informing systems that incorporate expert knowledge in this way (e.g., via a pronunciation lexicon)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-108", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-109", "text": "**WALL STREET JOURNAL (WSJ)**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-110", "text": "We train both character-and phoneme-label systems on the 80-hour WSJ training set to validate our architectural choices." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-111", "text": "Similar to [17, 19] , we use 40-dim." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-112", "text": "mel-scale filter banks and hence 120-dim." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-113", "text": "features." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-114", "text": "We warmup for 8000 steps, use a dropout of 0.2, and switch schedules at epoch 40." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-115", "text": "For the WSJ dataset, we compare with similar MLE-trained, end-to-end, open-vocabulary systems in Table 2 ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-116", "text": "We get an eval92 CER of 4.7%, outdoing all previous CTC-like results except 4.6% with a trainable frontend [40] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-117", "text": "We use the provided extended 3-gram LM to retrieve WERs." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-118", "text": "For phoneme training, our labels come from the CMU pronunciation lexicon (Table 3 )." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-119", "text": "These models train in one day (Tesla V100), comparable to the Speech Transformer [19] ; however, SAN-CTC gives further benefits at inference time as token predictions are generated in parallel." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-120", "text": "We also evaluate design choices in Table 4 ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-121", "text": "Here, we consider the effects of downsampling and position encoding on accuracy for our fixed training regime." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-122", "text": "We see that unlike self-attentional LAS [27] , SAN-CTC works respectably even with no position en- coding; in fact, the contribution of position is relatively minor (compare with [21] , where location in an encoder-decoder system improved CER by 3% absolute)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-123", "text": "Lossy downsampling appears to preserve performance in CER while degrading WER (as information about frame transitions is lost)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-124", "text": "We believe these observations align with the monotonicity and independence assumptions of CTC." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-125", "text": "Inspired by [27] , we plot the standard deviation of attention weights for each head as training progresses; see Figure 2 for details." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-126", "text": "In the first layers, we similarly observe a differentiation of variances, along with wide-context heads; in later layers, unlike [27] we still see mild differentiation of variances." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-127", "text": "Inspired by [26] , we further plot the attention weights relative to the current time position (here, per head)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-128", "text": "Character labels gave forward-and backward-attending heads (incidentally, averaging these would retrieve the bimodal distribution in [26] ) at all layers." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-129", "text": "This suggests a gradual expansion of context over depth, as is often engineered in convolutional CTC." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-130", "text": "This also suggests possibly using fewer heads, directed self-attention [25] , and restricted contexts for faster training (Table 1) ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-131", "text": "Phoneme labels gave a sharp backward-attending head and more diffuse heads." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-132", "text": "We believe this to be a symptom of English characters being more contextdependent than phonemes (for example, emitting 'tt' requires looking ahead, as '-' must occur between two runs of 't' tokens)." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-133", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-134", "text": "**LIBRISPEECH**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-135", "text": "We give the first large-scale demonstration of a fully self-attentional ASR model using the LibriSpeech ASR corpus [42] , an English corpus produced from audio books giving 960 hours of training" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-136", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-137", "text": "**MODEL**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-138", "text": "Tok. test-clean test-other CER WER CER WER CTC/ASG (Wav2Letter) [9] chr." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-139", "text": "6.9 7.2 --CTC (DS1-like) [33, 43] chr." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-140", "text": "-6.5 --Enc-Dec (4-4) [44] chr." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-141", "text": "6.5 -18.1 -Enc-Dec (6-1) [45] chr." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-142", "text": "4.5 -11.6 -CTC (DS2-like) [8, 32] chr." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-143", "text": "Table 5 ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-144", "text": "At this scale, even minor label smoothing was detrimental." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-145", "text": "We run 70 epochs in slightly over a week (Tesla V100) then choose the epoch with the best validation score for testing." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-146", "text": "For comparison, the best CTC-like architecture [23] took 4-8 weeks on 4 GPUs for its results." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-147", "text": "2 The Enc-Dec+CTC model is comparable, taking almost a week on an older GPU (GTX 1080 Ti) to do its \u223c12.5 full passes over the data." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-148", "text": "3 Finally, we trained the same model with BPE subwords as CTC targets, to get more context-independent units [36] ." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-149", "text": "We did 300 merge operations (10k was unstable) and attained a CER of 7.4%." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-150", "text": "This gave a WER of 8.7% with no LM (compare with Table 5 's LMbased entries), and 5.2% with a subword WFST of the LM." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-151", "text": "We still observed attention heads in both directions in the first layer, suggesting our subwords were still more context-dependent than phonemes." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-152", "text": "----------------------------------" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-153", "text": "**CONCLUSION**" }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-154", "text": "We introduced SAN-CTC, a novel framework which integrates a fully self-attentional network with a connectionist temporal classification loss." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-155", "text": "We addressed the challenges of adapting self-attention to CTC and to speech recognition, showing that SAN-CTC is competitive with or outperforms existing end-to-end models on WSJ and LibriSpeech." }, { "sent_id": "642aa9fe999d0b2b3793cb1603c04c-C001-156", "text": "Future avenues of work include multitasking SAN-CTC with other decoders or objectives, and streamlining network structure via directed or restricted attention." } ], "y": { "@BACK@": { "gold_contexts": [ [ "642aa9fe999d0b2b3793cb1603c04c-C001-12" ], [ "642aa9fe999d0b2b3793cb1603c04c-C001-13" ], [ "642aa9fe999d0b2b3793cb1603c04c-C001-64" ], [ "642aa9fe999d0b2b3793cb1603c04c-C001-69" ], [ "642aa9fe999d0b2b3793cb1603c04c-C001-82" ], [ "642aa9fe999d0b2b3793cb1603c04c-C001-97" ] ], "cite_sentences": [ "642aa9fe999d0b2b3793cb1603c04c-C001-12", "642aa9fe999d0b2b3793cb1603c04c-C001-13", "642aa9fe999d0b2b3793cb1603c04c-C001-64", "642aa9fe999d0b2b3793cb1603c04c-C001-69", "642aa9fe999d0b2b3793cb1603c04c-C001-82", "642aa9fe999d0b2b3793cb1603c04c-C001-97" ] }, "@SIM@": { "gold_contexts": [ [ "642aa9fe999d0b2b3793cb1603c04c-C001-46" ], [ "642aa9fe999d0b2b3793cb1603c04c-C001-55" ] ], "cite_sentences": [ "642aa9fe999d0b2b3793cb1603c04c-C001-46", "642aa9fe999d0b2b3793cb1603c04c-C001-55" ] }, "@USE@": { "gold_contexts": [ [ "642aa9fe999d0b2b3793cb1603c04c-C001-46" ], [ "642aa9fe999d0b2b3793cb1603c04c-C001-55" ] ], "cite_sentences": [ "642aa9fe999d0b2b3793cb1603c04c-C001-46", "642aa9fe999d0b2b3793cb1603c04c-C001-55" ] } } }, "ABC_2fbf5397a8219923d1d9bc0464cb59_24": { "x": [ { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-103", "text": "----------------------------------" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-104", "text": "**EXPERIMENTATION**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-106", "text": "This ACE corpus contains ~3.9k pronouns in the training data and ~1.0k pronouns in the test data." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-2", "text": "This paper proposes a context-sensitive convolution tree kernel for pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-3", "text": "It resolves two critical problems in previous researches in two ways." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-4", "text": "First, given a parse tree and a pair of an anaphor and an antecedent candidate, it implements a dynamic-expansion scheme to automatically d etermine a proper tree s pan for pronoun resolution by taking predicate-and antecedent competitor-related information into consideration." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-105", "text": "This paper focuses on the third-person pronoun resolution and, in all our experiments, uses the ACE 2003 corpus for evaluation." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-76", "text": "In some sense, this is a natural extension of the twin-candidate learning a pproach proposed in Yang et al (2003) , which explicitly models the competition between two antecedent candidates." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-77", "text": "3) For each node in the tree span, attaching the path from the node to the predicate terminal node if it is a predicate-headed node." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-78", "text": "As shown in Figure 1 (c), \"said\" and \"bit\" are attached." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-47", "text": "As for tree kernel-based methods, Yang et al (2006) captured syntactic structured information for pronoun resolution by using the convolution tree kernel (Collins and Duffy 2001) to measure the common sub-trees enumerated from the parse trees and achieved quite success on the ACE 2003 corpus." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-48", "text": "They also explored different tree span schemes and found that the simple-expansion scheme performed best." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-49", "text": "One problem with their method is that the sub-trees enumerated in Collins and Duffy's kernel computation are context-free, that is, they do not consider the information outside the sub-trees." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-50", "text": "As a result, their ability of exploring syntactic structured information is much limited." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-51", "text": "Another problem is that, among the three explored schemes, there exists no obvious overwhelming one, which can well cover syntactic structured information." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-52", "text": "The above discussion suggests that structured information in the parse trees may not be well utilized in the previous researches, regardless of feature-based or tree kernel-based methods." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-53", "text": "This paper follows tree kernel-based methods." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-101", "text": "In the tree kernel, a sub-tree becomes context-sensitive via the \"root node path\" moving along the sub-tree root." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-5", "text": "Second, it applies a context-sensitive convolution tree kernel, which enumerates both context-free and context-sensitive sub-trees by considering their ancestor node paths as their contexts." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-6", "text": "Evaluation on the ACE 2003 corpus shows that our dynamic-expansion tree span scheme can well cover necessary structured information in the parse tree for pronoun resolution and the context-sensitive tree kernel much outperforms previous tree kernels." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-7", "text": "----------------------------------" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-9", "text": "It is well known that syntactic structured information plays a critical role in many critical NLP applications, such as parsing, semantic role labeling, semantic relation extraction and co-reference resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-10", "text": "However, it is still an open question on what kinds of syntactic structured information are effective and how to well incorporate such structured information in these applications." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-11", "text": "Much research work has been done in this direction." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-12", "text": "Prior researches apply feature-based methods to select and define a set of flat features, which can be mined from the parse trees, to represent particular structured information in the parse tree, such as the grammatical role (e.g. subject or object), according to the particular application." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-13", "text": "Indeed, such feature-based methods have been widely applied in parsing (Collins 1999; Charniak 2001) , semantic role labeling (Pradhan et al 2005) , semantic relation extraction (Zhou et al 2005) and co-reference resolution (Lapin and Leass 1994; Aone and Bennett 1995; Mitkov 1998; Yang et al 2004; Luo and Zitouni 2005; Bergsma and Lin 2006) ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-14", "text": "The major problem with feature-based methods on exploring structured information is that they may fail to well capture complex structured information, which is critical for further performance improvement." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-15", "text": "The current trend is to explore kernel-based methods (Haussler, 1999) which can implicitly explore features in a high dimensional space by employing a kernel to calculate the similarity between two objects directly." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-16", "text": "In particular, the kernel-based methods could be very effective at reducing the burden of feature engineering for structured objects in NLP, e.g. the parse tree structure in coreference resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-17", "text": "During recent years, various tree kernels, such as the convolution tree kernel (Collins and Duffy 2001) , the shallow parse tree kernel (Zelenko et al 2003) and the dependency tree kernel (Culota and Sorensen 2004) , have been proposed in the literature." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-18", "text": "Among previous tree kernels, the convolution tree kernel represents the state-of-the-art and have been successfully applied by Collins and Duffy (2002) However, there exist two problems in Collins and Duffy's kernel." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-19", "text": "The first is that the sub-trees enumerated in the tree kernel are context-free." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-20", "text": "That is, each sub-tree enumerated in the tree kernel does not consider the context information outside the sub-tree." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-21", "text": "The second is how to decide a proper tree span in the tree kernel computation according to the particular application." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-22", "text": "To resolve above two problems, this paper proposes a new tree span scheme and applies a new tree kernel and to better capture syntactic structured information in pronoun resolution, whose task is to find the corresponding antecedent for a given pronominal anaphor in text." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-23", "text": "The rest of this paper is organized as follows." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-24", "text": "In Section 2, we review related work on exploring syntactic structured information in pronoun resolution and their comparison with our method." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-102", "text": "For more details, please refer to Zhou et al (2007) ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-25", "text": "Section 3 first presents a dynamic-expansion tree span scheme by automatically expanding the shortest path to include necessary structured information, such as predicate-and antecedent competitorrelated i nformation." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-26", "text": "Then it presents a contextsensitive convolution tree kernel, which not only enumerates context-free sub-trees but also contextsensitive sub-trees by considering their ancestor node paths as their contexts." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-27", "text": "Section 4 shows the experimental results." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-28", "text": "Finally, we conclude our work in Section 5." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-29", "text": "----------------------------------" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-31", "text": "Related work on exploring syntactic structured information in pronoun resolution can be typically classified into three categories: parse tree-based search algorithms ( Hobbs 1978) , feature-based (Lappin and Leass 1994; Bergsma and Lin 2006) and tree kernel-based methods (Yang et al 2006) ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-32", "text": "As a representative for parse tree-based search algorithms, Hobbs (1978) found the antecedent for a given pronoun by searching the parse trees of current text." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-33", "text": "It processes one sentence at a time from current sentence to the first sentence in text until an antecedent is found." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-34", "text": "For each sentence, it searches the corresponding parse tree in a left-toright breadth-first way." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-35", "text": "The first antecedent candidate, which satisfies hard constraints (such as gender and number agreement), would be returned as the antecedent." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-36", "text": "Since the search is completely done on the parse trees, one problem with the parse treebased search algorithms is that the performance would heavily rely on the accuracy of the parse trees." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-37", "text": "Another problem is that such algorithms are not good enough to capture necessary structured information for pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-38", "text": "There is still a big performance gap even on correct parse trees." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-39", "text": "Similar to other NLP applications, featurebased methods have been widely applied in pronoun resolution to explore syntactic structured information from the parse trees." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-40", "text": "Lappin and Leass (1994) derived a set of salience measures (e.g. subject, object or accusative emphasis) with manually assigned weights from the syntactic structure output by McCord's Slot Grammar parser." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-41", "text": "The candidate with the highest salience score would be selected as the antecedent." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-42", "text": "Bergsma and Lin (2006) presented an approach to pronoun resolution based on syntactic paths." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-43", "text": "Through a simple bootstrapping procedure, highly co-reference paths can be learned reliably to handle previously challenging instances and robustly address traditional syntactic co-reference constraints." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-44", "text": "Although feature-based methods dominate on exploring syntactic structured information in the literature of pronoun resolution, there still exist two problems with them." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-45", "text": "One problem is that the structured features have to be selected and defined manually, usually by linguistic intuition." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-46", "text": "Another problem is that they may fail to effectively capture complex structured parse tree information." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-75", "text": "In this way, the competition between the considered candidate and other compatible candidates can be included in the tree span." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-54", "text": "Compared with Collins and Duffy's kernel and its application in pronoun resolution (Yang et al 2006) , the context-sensitive convolution tree kernel enumerates not only context-free sub-trees but also context-sensitive sub-trees by taking their ancestor node paths into consideration." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-55", "text": "Moreover, this paper also implements a dynamic-expansion tree span scheme by taking predicate-and antecedent competitor-related information into consideration." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-56", "text": "----------------------------------" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-57", "text": "**CONTEXT SENSITIVE CONVOLUTION TREE KERNEL FOR PRONOUN RESOLUTION**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-58", "text": "In this section, we first propose an algorithm to dynamically determine a proper tree span for pronoun resolution and then present a contextsensitive convolution tree kernel to compute similarity between two tree spans." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-59", "text": "In this paper, all the texts are parsed u sing the Charniak parser (Charniak 2001 ) based on which the tree span is determined." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-60", "text": "----------------------------------" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-61", "text": "**DYNAMIC-EXPANSION TREE SPAN SCHEME**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-62", "text": "Normally, parsing is done on the sentence level." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-63", "text": "To deal with the cases that an anaphor and an antecedent candidate do not occur in the same sentence, we construct a pseudo parse tree for an entire text by attaching the parse trees of all its sentences to an upper \"S \" node, similar to Yang et al (2006) ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-64", "text": "Given the parse tree of a text, the problem is how to choose a proper tree span to well cover syntactic structured information in the tree kernel computation." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-65", "text": "Generally, the more a tree span includes, the more syntactic structured information would be provided, at the expense of more noisy information." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-66", "text": "Figure 2 shows the three tree span schemes explored in Yang et al (2006) : MinExpansion (only including the shortest path connecting the anaphor and the antecedent candidate), Simple-Expansion (containing not only all the nodes in Min-Expansion but also the first level children of these nodes) and Full-Expansion (covering the sub-tree between the anaphor and the candidate), such as the sub-trees inside the dash circles of Figures 2(a) , 2(b) and 2(c) respectively." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-67", "text": "It is found (Yang et al 2006) that the simpleexpansion tree span scheme performed best on the ACE 2003 corpus in pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-68", "text": "This suggests that inclusion of more structured information in the tree span may not help in pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-69", "text": "To better capture structured information in the parse tree, this paper presents a dynamic-expansion scheme by trying to include necessary structured information in a parse tree." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-70", "text": "The intuition behind our scheme is that predicate-and antecedent competitor-(all the other compatible 1 antecedent candidates between the anaphor and the considered antecedent candidate) related information plays a critical role in pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-71", "text": "Given an ana-1 With matched number, person and gender agreements." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-72", "text": "phor and an antecedent candidate, e.g. \"Mary\" and \"her\" as shown in Figure 1 , this is done by: 1) Determining the min-expansion tree span via the shortest path, as shown in Figure 1(a) ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-73", "text": "2) Attaching all the antecedent competitors along the corresponding paths to the shortest path." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-74", "text": "As shown in Figure 1(b) , \"the woman\" is attached while \"the room\" is not attached since the former is compatible with the anaphor and the latter is not compatible with the anaphor." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-79", "text": "4) Pruning those nodes (except POS nodes) with the single in-arc and the single out-arc and with its syntactic phrase type same as its child node." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-80", "text": "As shown in Figure 1(d) , the left child of the \"SBAR\" node, the \"NP\" node, is removed and the sub-tree (NP the/DT woman/NN) is a ttached to the \"SBAR\" node directly." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-81", "text": "To show the difference among min-, simple-, full-and dynamic-expansion schemes, Figure 2 compares them for three different sentences, given the anaphor \"her/herself\" and the antecedent candidate \"Mary\"." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-82", "text": "It shows that:" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-83", "text": "\u2022 Min-, simple-and full-expansion schemes have the same tree spans (except the word nodes) for the three sentences regardless of the difference among the sentences while the d ynamicexpansion scheme can adapt to difference ones." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-84", "text": "\u2022 Normally, the min-expansion scheme is too simple to cover necessary information (e.g. \"the woman\" in the 1 st sentence is missing)." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-85", "text": "\u2022 The full-expansion scheme can cover all the information at the expense of much noise (e.g. \"the man in that room\" in the 2 nd sentence)." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-86", "text": "\u2022 The simple-expansion scheme can cover some necessary predicate-related information (e.g. \"said\" and \"bit\" in the sentences)." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-87", "text": "However, it may introduce some noise (e.g. the left child of the \"SBAR\" node, the \"NP\" node, may not be necessary in the 2 nd sentence) and ignore necessary antecedent competitor-related information (e.g. \"the woman\" in the 1 st sentence)." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-88", "text": "\u2022 The dynamic-expansion scheme normally works well." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-89", "text": "It can not only cover predicaterelated information but also structured information related with the competitors of the considered antecedent candidate." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-90", "text": "In this way, the competition between the considered antecedent candidate and other compatible candidates can be included in the dynamic-expansion scheme." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-91", "text": "----------------------------------" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-92", "text": "**CONTEXT-SENSITIVE CONVOLUTION TREE KERNEL**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-93", "text": "Given any tree span scheme, e.g. the dynamicexpansion scheme in the last subsection, we now study how to measure the similarity between two tree spans using a convolution tree kernel." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-94", "text": "A convolution kernel (Haussler D., 1999) aims to capture structured information in terms of substructures." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-95", "text": "As a specialized convolution kernel, the convolution tree kernel, proposed in Collins and Duffy (2001) , counts the number of common subtrees (sub-structures) as the syntactic structure similarity between two parse trees." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-96", "text": "This convolution tree kernel has been successfully applied by Yang et al (2006) in pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-97", "text": "However, there is one problem with this tree kernel: the subtrees involved in the tree kernel computation are context-free (That is, they do not consider the information outside the sub-trees.)." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-98", "text": "This is contrast to the tree kernel proposed in Culota and Sorensen (2004) which is context-sensitive, that is, it considers the path from the tree root node to the sub-tree root node." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-99", "text": "In order to integrate the advantages of both tree kernels and resolve the problem in Collins and Duffy's kernel, this paper applies the same context-sensitive convolution tree kernel, proposed by Zhou et al (2007) on relation extraction." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-100", "text": "It works by taking ancestral information (i.e. the root node path) of sub-trees into consideration: n ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-107", "text": "Similar to Soon et al (2001) , an input raw text is first preprocessed automatically by a pipeline of NLP components, including sentence boundary detection, POS tagging, named entity recognition and phrase chunking, and then a training or test instance is formed by a pronoun and one of its antecedent candidates." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-108", "text": "During training, for each anaphor encountered, a positive instance is created by pairing the anaphor and its closest antecedent while a set of negative instances is formed by pairing the anaphor with each of the non-coreferential candidates." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-109", "text": "Based on the training instances, a binary classifier is generated using a particular learning algorithm." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-110", "text": "In this paper, we use SVMLight deleveloped by Joachims (1998) ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-111", "text": "During resolution, an anaphor is first paired in turn with each preceding antecedent candidate to form a test instance, which is presented to a classifier." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-112", "text": "The classifier then returns a confidence value indicating the likelihood that the candidate is the antecedent." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-113", "text": "Finally, the candidate with the highest confidence value is selected as the antecedent." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-114", "text": "In this paper, the NPs occurring within the current and previous two sentences are taken as the initial antecedent candidates, and those with mismatched number, person and gender agreements are filtered out." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-115", "text": "On average, an anaphor has ~7 antecedent candidates." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-116", "text": "The performance is evaluated using F-measure instead of accuracy since evaluation is done on all the pronouns occurring in the data." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-117", "text": "In this paper, the m parameter in our contextsensitive convolution tree kernel as shown in Equation (1) indicates the maximal length of root node paths and is optimized to 3 using 5-fold cross validation on the training data." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-118", "text": "Table 1 systematically evaluates the impact of different m in our context-sensitive convolution tree kernel and compares our dynamic-expansion tree span scheme with the existing three tree span schemes, min-, simple-and full-expansions as described in Yang et al (2006) ." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-119", "text": "It also shows that that our tree kernel achieves best performance with m = 3 on the test data, which outperforms the one with m = 1 by ~2.2 in F-measure." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-120", "text": "This suggests that the parent and grandparent nodes of a sub-tree contain much information for pronoun resolution while considering more ancestral nodes doesnot further improve the performance." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-121", "text": "This may be due to that, although our experimentation on the training data indicates that more than 9 0% (on average) of subtrees has a root node path longer than 3 (since most of the subtrees are deep from the root node and more than 90% of the parsed trees are deeper than 6 levels in the ACE 2003 corpus), including a root node path longer than 3 may be vulnerable to the full parsing errors and have negative impact." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-122", "text": "It also shows that our dynamic-expansion tree span scheme outperforms min-expansion, simpleexpansion a nd full-expansion schemes by ~2.4, ~1.2 and ~2.1 in F-measure respectively." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-123", "text": "This suggests the usefulness of dynamically expanding tree spans to cover necessary structured information in pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-124", "text": "In all the following experiments, we will apply our tree kernel with m=3 and the dynamic-expansion tree span scheme by default, unless specified." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-125", "text": "We also evaluate the contributions of antecedent competitor-related information, predicate-related information and pruning in our dynamic-expansion tree span scheme by excluding one of them from the dynamic-expansion scheme." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-126", "text": "Table 2 shows that 1) antecedent competitor-related information contributes much to our scheme; 2) predicate-related information contributes moderately; 3) pruning only has slight contribution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-127", "text": "This suggests the importance of including the competition in the tree span and the effect of predicate-argument structures in pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-128", "text": "This also suggests that our scheme can well make use of such predicateand antecedent competitor-related information." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-129", "text": "----------------------------------" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-131", "text": "Syntactic structured information holds great potential in many NLP applications." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-132", "text": "The purpose of this paper is to well capture syntactic structured information in pronoun resolution." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-133", "text": "In this paper, we proposes a context-sensitive convolution tree kernel to resolve two critical problems in previous researches in pronoun resolution by first automatically determining a dynamic-expansion tree span, which effectively covers structured information in the parse trees by taking predicate-and antecedent competitor-related information into consideration, and then applying a context-sensitive convolution tree kernel, which enumerates both context-free sub-trees and context-sensitive sub-trees." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-134", "text": "Evaluation on the ACE 2003 corpus shows that our dynamic-expansion tree span scheme can better capture necessary structured information than the existing tree span schemes and our tree kernel can better model structured information than the stateof-the-art Collins and Duffy's kernel." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-135", "text": "For the future work, we will focus on improving the context-sensitive convolution tree kernel by better modeling context-sensitive information and exploring new tree span schemes by better incorporating useful structured information." }, { "sent_id": "2fbf5397a8219923d1d9bc0464cb59-C001-136", "text": "In the meanwhile, a more detailed quantitative evaluation and thorough qualitative error analysis will be performed to gain more insights." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2fbf5397a8219923d1d9bc0464cb59-C001-31" ], [ "2fbf5397a8219923d1d9bc0464cb59-C001-47" ], [ "2fbf5397a8219923d1d9bc0464cb59-C001-54" ], [ "2fbf5397a8219923d1d9bc0464cb59-C001-66" ], [ "2fbf5397a8219923d1d9bc0464cb59-C001-67" ], [ "2fbf5397a8219923d1d9bc0464cb59-C001-96" ] ], "cite_sentences": [ "2fbf5397a8219923d1d9bc0464cb59-C001-31", "2fbf5397a8219923d1d9bc0464cb59-C001-47", "2fbf5397a8219923d1d9bc0464cb59-C001-54", "2fbf5397a8219923d1d9bc0464cb59-C001-66", "2fbf5397a8219923d1d9bc0464cb59-C001-67", "2fbf5397a8219923d1d9bc0464cb59-C001-96" ] }, "@SIM@": { "gold_contexts": [ [ "2fbf5397a8219923d1d9bc0464cb59-C001-63" ], [ "2fbf5397a8219923d1d9bc0464cb59-C001-118" ] ], "cite_sentences": [ "2fbf5397a8219923d1d9bc0464cb59-C001-63", "2fbf5397a8219923d1d9bc0464cb59-C001-118" ] }, "@USE@": { "gold_contexts": [ [ "2fbf5397a8219923d1d9bc0464cb59-C001-118" ] ], "cite_sentences": [ "2fbf5397a8219923d1d9bc0464cb59-C001-118" ] } } }, "ABC_f3f61d50929f862e263e3f658852bc_24": { "x": [ { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-2", "text": "Argumentation quality is viewed differently in argumentation theory and in practical assessment approaches." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-3", "text": "This paper studies to what extent the views match empirically." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-4", "text": "We find that most observations on quality phrased spontaneously are in fact adequately represented by theory." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-5", "text": "Even more, relative comparisons of arguments in practice correlate with absolute quality ratings based on theory." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-6", "text": "Our results clarify how the two views can learn from each other." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-7", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-9", "text": "The assessment of argumentation quality is critical for any application built upon argument mining, such as debating technologies (Rinott et al., 2015) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-10", "text": "However, research still disagrees on whether quality should be assessed from a theoretical or from a practical viewpoint (Allwood, 2016) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-11", "text": "Theory states, among other things, that a cogent argument has acceptable premises that are relevant to its conclusion and sufficient to draw the conclusion (Johnson and Blair, 2006) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-12", "text": "Practitioners object that such quality dimensions are hard to assess for real-life arguments (Habernal and Gurevych, 2016b) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-13", "text": "Moreover, the normative nature of theory suggests absolute quality ratings, but in practice it seems much easier to state which argument is more convincing-a relative assessment." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-14", "text": "Consider two debate-portal arguments for \"advancing the common good is better than personal pursuit\", taken from the corpora analyzed later in this paper: Argument A \"While striving to make advancements for the common good you can change the world forever." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-15", "text": "Allot of people have succeded in doing so." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-16", "text": "Our founding fathers, Thomas Edison, George Washington, Martin Luther King jr, and many more." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-17", "text": "These people made huge advances for the common good and they are honored for it.\"" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-18", "text": "Argument B \"I think the common good is a better endeavor, because it's better to give then to receive." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-19", "text": "It's better to give other people you're hand out in help then you holding your own hand.\"" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-20", "text": "In the study of Habernal and Gurevych (2016b) , annotators assessed Argument A as more convincing than B. When giving reasons for their assessment, though, they saw A as more credible and well thought through; that does not seem to be too far from the theoretical notion of cogency." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-21", "text": "This paper gives empirical answers to the question of how different the theoretical and practical views of argumentation quality actually are. Section 2 briefly reviews existing theories and practical approaches." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-22", "text": "Section 3 then empirically analyzes correlations in two recent argument corpora, one annotated for 15 well-defined quality dimensions taken from theory (Wachsmuth et al., 2017a) and one with 17 reasons for quality differences phrased spontaneously in practice (Habernal and Gurevych, 2016a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-23", "text": "In a crowdsourcing study, we test whether lay annotators achieve agreement on the theoretical quality dimensions (Section 4)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-24", "text": "We find that assessments of overall argumentation quality largely match in theory and practice." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-25", "text": "Nearly all phrased reasons are adequately represented in theory." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-26", "text": "However, some theoretical quality dimensions seem hard to separate in practice." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-27", "text": "Most importantly, we provide evidence that the observed relative quality differences are reflected in absolute quality ratings." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-28", "text": "Still, our study underpins the fact that the theory-based argumentation quality assessment remains complex." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-29", "text": "Our results do not generally answer the question of what view of argumentation quality is preferable, but they clarify where theory can learn from practice and vice versa." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-30", "text": "In particular, practical approaches indicate what to focus on to simplify theory, whereas theory seems beneficial to guide quality assessment in practice." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-31", "text": "Wachsmuth et al. (2017a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-32", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-33", "text": "**THEORY VERSUS PRACTICE**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-34", "text": "This section outlines major theories and practical approaches to argumentation quality assessment, including those we compare in the present paper." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-35", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-36", "text": "**THEORETICAL VIEWS OF QUALITY ASSESSMENT**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-37", "text": "Argumentation theory discusses logical, rhetorical, and dialectical quality." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-38", "text": "As few real-life arguments are logically sound, requiring true premises that deductively entail a conclusion, cogency (as defined in Section 1) is largely seen as the main logical quality (Johnson and Blair, 2006; Damer, 2009; Govier, 2010) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-39", "text": "Toulmin (1958) models the general structure of logical arguments, and Walton et al. (2008) analyze schemes of fallacies and strong arguments." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-40", "text": "A fallacy is a kind of error that undermines reasoning (Tindale, 2007) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-41", "text": "Strength may mean cogency but also rhetorical effectiveness (Perelman and Olbrechts-Tyteca, 1969) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-42", "text": "Rhetoric has been studied since Aristotle (2007) who developed the notion of the means of persuasion (logos, ethos, pathos) and their linguistic delivery in terms of arrangement and style." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-43", "text": "Dialectical quality dimensions resemble those of cogency, but arguments are judged specifically by their reasonableness for achieving agreement (van Eemeren and Grootendorst, 2004) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-44", "text": "Wachsmuth et al. (2017a) point out that dialectical builds on rhetorical, and rhetorical builds on logical quality." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-45", "text": "They derive a unifying taxonomy from the major theories, decomposing quality hierarchically into cogency, effectiveness, reasonableness, and subdimensions." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-46", "text": "5-1 B is attacking / abusive." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-47", "text": "5-2 B has language/grammar issues, or uses humour or sarcasm." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-48", "text": "5-3 B is unclear / hard to follow." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-49", "text": "6-1 B has no credible evidence / no facts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-50", "text": "6-2 B has less or insufficient reasoning." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-51", "text": "6-3 B uses irrelevant reasons." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-52", "text": "7-1 B is only an opinion / a rant." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-53", "text": "7-2 B is non-sense / confusing." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-54", "text": "7-3 B does not address the topic." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-55", "text": "7-4 B is generally weak / vague." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-56", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-57", "text": "**POSITIVE PROPERTIES OF ARGUMENT A**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-58", "text": "8-1 A has more details/facts/examples, has better reasoning / is deeper." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-59", "text": "8-4 A is objective / discusses other views." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-60", "text": "8-5 A is more credible / confident." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-61", "text": "9-1 A is clear / crisp / well-written." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-62", "text": "9-2 A sticks to the topic." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-63", "text": "9-3 A makes you think. 9-4 A is well thought through / smart." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-64", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-65", "text": "**OVERALL**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-66", "text": "Conv A is more convincing than B. Table 2 : The 17+1 practical reason labels given in the corpus of Habernal and Gurevych (2016a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-67", "text": "covered." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-68", "text": "In Section 3, we use their absolute quality ratings from 1 (low) to 3 (high) annotated by three experts for each dimension of 304 arguments taken from the UKPConvArg1 corpus detailed below." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-69", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-70", "text": "**PRACTICAL VIEWS OF QUALITY ASSESSMENT**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-71", "text": "There is an application area where absolute quality ratings of argumentative text are common practice: essay scoring (Beigman Klebanov et al., 2016) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-72", "text": "Persing and Ng (2015) annotated the argumentative strength of essays composing multiple arguments with notable agreement." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-73", "text": "For single arguments, however, all existing approaches that we are aware of assess quality in relative terms, e.g., Cabrio and Villata (2012) find accepted arguments based on attack relations, Wei et al. (2016) rank arguments by their persuasiveness, and Wachsmuth et al. (2017b) rank them by their relevance." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-74", "text": "Boudry et al. (2015) argue that normative concepts such as fallacies rarely apply to real-life arguments and that they are too sophisticated for operationalization." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-75", "text": "Based on the idea that relative assessment is easier, Habernal and Gurevych (2016b) crowdsourced the UKPConvArg1 corpus." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-76", "text": "Argument pairs (A, B) from a debate portal were classified as to which argument is more convincing." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-77", "text": "Without giving any guidelines, the authors also asked for reasons as to why A is more convincing than B. In a follow-up study (Habernal and Gurevych, 2016a) , these reasons were used to derive a hierarchical annotation scheme." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-78", "text": "9111 argument pairs were then labeled with one or more of the 17 reason labels in Table 2 Negative Properties of Argument B Positive Properties of Argument A Quality Dimension 5-1 5-2 5-3 6-1 6-2 6-3 7-1 7-2 7-3 7-4 8-1 8-4 8-5 9-1 9-2 9-3 9- Wachsmuth et al. (2017a) given for each of the 17+1 reason labels of Habernal and Gurevych (2016a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-79", "text": "Bold/gray: Highest/lowest value in each column." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-80", "text": "Bottom row: The number of labels for each dimension." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-81", "text": "by crowd workers (UKPConvArg2)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-82", "text": "These pairs represent the practical view in our experiments." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-83", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-84", "text": "**MATCHING THEORY AND PRACTICE**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-85", "text": "We now report on experiments that we performed to examine to what extent the theory and practice of argumentation quality assessment match." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-86", "text": "1" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-87", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-88", "text": "**CORPUS-BASED COMPARISON OF THE VIEWS**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-89", "text": "Several dimensions and reasons in Tables 1 and 2 seem to refer to the same or opposite property, e.g., clarity and 5-3 (unclear)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-90", "text": "This raises the question of how absolute ratings of arguments based on theory relate to relative comparisons of argument pairs in practice." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-91", "text": "We informally state three hypotheses:" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-92", "text": "The reasons for quality differences in practice are adequately represented in theory." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-93", "text": "Hypothesis 2 The perception of overall argumentation quality is the same in theory and practice." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-94", "text": "Hypothesis 3 Relative quality differences are reflected by differences in absolute quality ratings." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-95", "text": "As both corpora described in Section 2 are based on the UKPConvArg1 corpus and thus share many arguments, we can test the hypotheses empirically." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-96", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-97", "text": "**CORRELATIONS OF DIMENSIONS AND REASONS**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-98", "text": "For Hypotheses 1 and 2, we consider all 736 pairs of arguments from Habernal and Gurevych (2016a) where both have been annotated by Wachsmuth et al. (2017a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-99", "text": "For each pair (A, B) with A being 1 Source code and annotated data: http://www.arguana.com more convincing than B, we check whether the ratings of A and B for each dimension (averaged over all annotators) show a concordant difference (i.e., a higher rating for A), a disconcordant difference (lower), or a tie." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-100", "text": "This way, we can correlate each dimension with all reason labels in Table 2 including Conv." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-101", "text": "In particular, we compute Kendall's \u03c4 based on all argument pairs given for each label." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-102", "text": "2 Table 3 presents all \u03c4 -values." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-103", "text": "The phrasing of a reason can be assumed to indicate a clear quality difference-this is underlined by the generally high correlations." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-104", "text": "Analyzing the single values, we find much evidence for Hypothesis 1: Most notably, label 5-1 perfectly correlates with global acceptability, fitting the intuition that abuse is not acceptable." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-105", "text": "The high \u03c4 's of 8-5 (more credible) for local acceptability (.73) and of 9-4 (well thought through) for cogency (.75) confirm the match assumed in Section 1. Also, the values of 5-3 (unclear) for clarity (.91) and of 7-2 (non-sense) for reasonableness (.94) as well as the weaker correlation of 8-4 (objective) for emotional appeal (.35) makes sense." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-106", "text": "Only the comparably low \u03c4 of 6-1 (no credible evidence) for local acceptability (.49) and credibility (.52) seem really unexpected." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-107", "text": "Besides, the descriptions of 6-2 and 6-3 sound like local but cor- Table 4 : The mean rating for each quality dimension of those arguments from Wachsmuth et al. (2017a) given for each reason label (Habernal and Gurevych, 2016a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-108", "text": "The bottom rows show that the minimum maximum mean ratings are consistently higher for the positive properties than for the negative properties." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-109", "text": "relate more with global relevance and sufficiency respectively." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-110", "text": "Similarly, 7-3 (off-topic) correlates strongly with local and global relevance (both .95)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-111", "text": "So, these dimensions seem hard to separate." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-112", "text": "In line with Hypothesis 2, the highest correlation of Conv is indeed given for overall quality (.64)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-113", "text": "Thus, argumentation quality assessment seems to match in theory and practice to a broad extent." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-114", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-115", "text": "**ABSOLUTE RATINGS FOR RELATIVE DIFFERENCES**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-116", "text": "The correlations found imply that the relative quality differences captured are reflected in absolute differences." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-117", "text": "For explicitness, we computed the mean rating for each quality dimension of all arguments from Wachsmuth et al. (2017a) with a particular reason label from Habernal and Gurevych (2016a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-118", "text": "As each reason refers to one argument of a pair, this reveals whether the labels, although meant to signal relative differences, indicate absolute ratings." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-119", "text": "Table 4 compares the mean ratings of \"negative labels\" (5-1 to 7-4) and \"positive\" ones (8-1 to 9-4)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-120", "text": "For all dimensions, the maximum and minimum value are higher for the positive than for the negative labels-a clear support of Hypothesis 3." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-121", "text": "3 Also, Table 4 reveals which reasons predict absolute differences most: The mean ratings of 7-3 (off-topic) are very low, indicating a strong negative impact, while 6-3 (irrelevant reasons) still shows rather 3 While the differences seem not very large, this is expected, as in many argument pairs from Habernal and Gurevych (2016a) both arguments are strong or weak respectively." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-122", "text": "high values." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-123", "text": "Vice versa, especially 8-5 (more credible) and 9-4 (well thought through) are reflected in high ratings, whereas 9-2 (sticks to topic) does not have much positive impact." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-124", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-125", "text": "**ANNOTATING THEORY IN PRACTICE**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-126", "text": "The results of Section 3 suggest that theory may guide the assessment of argumentation quality in practice." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-127", "text": "In this section, we evaluate the reliability of a crowd-based annotation process." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-128", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-129", "text": "**ABSOLUTE QUALITY RATINGS BY THE CROWD**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-130", "text": "We emulated the expert annotation process carried out by Wachsmuth et al. (2017a) on CrowdFlower in order to evaluate whether lay annotators suffice for a theory-based quality assessment." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-131", "text": "In particular, we asked the crowd to rate the same 304 arguments as the experts for all 15 given quality dimensions with scores from 1 to 3 (or choose \"cannot judge\")." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-132", "text": "Each argument was rated 10 times at an offered price of $0.10 for each rating (102 annotators in total)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-133", "text": "Given the crowd ratings, we then performed two comparisons as detailed in the following." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-134", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-135", "text": "**AGREEMENT OF THE CROWD WITH EXPERTS**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-136", "text": "First, we checked to what extent lay annotators and experts agree in terms of Krippendorff's \u03b1." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-137", "text": "On one hand, we compared the mean of all 10 crowd ratings to the mean of the three ratings of Wachsmuth et al. (2017a) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-138", "text": "On the other hand, we estimated a reliable rating from the crowd ratings using MACE (Hovy et al., 2013) and compared it to the experts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-139", "text": "Table 5 : Mean and MACE Krippendorff's \u03b1 agreement between (a) the crowd and the experts, (b) two independent crowd groups and the experts, (c) group 1 and the experts, and (d) group 2 and the experts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-140", "text": "Table 5 (a) presents the results." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-141", "text": "For the mean ratings, most \u03b1-values are above .40." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-142", "text": "This is similar to the study of Wachsmuth et al. (2017b) , where a range of .27 to .51 is reported, meaning that lay annotators achieve similar agreement to experts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-143", "text": "Considering the minimum of mean and MACE, we observe the highest agreement for overall quality (.43)-analog to Wachsmuth et al. (2017b) ." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-144", "text": "Also, global sufficiency has the lowest agreement in both cases." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-145", "text": "In contrast, the experts hardly said \"cannot judge\" at all, whereas the crowd chose it for about 4% of all ratings (most often for global sufficiency), possibly due to a lack of training." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-146", "text": "Still, we conclude that the crowd generally handles the theory-based quality assessment almost as well as the experts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-147", "text": "However, the complexity of the assessment is underlined by the generally limited agreement, suggesting that either simplification or stricter guidelines are needed." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-148", "text": "Regarding simplification, the most common practical reasons of Habernal and Gurevych (2016a) imply what to focus on." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-149", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-150", "text": "**RELIABILITY OF THE CROWD ANNOTATIONS**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-151", "text": "In the second comparison, we checked how many crowd annotators are needed to compete with the experts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-152", "text": "For this purpose, we split the crowd ratings into two independent groups of 5 and treated the mean and MACE of each group as a single rating." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-153", "text": "We then computed the agreement of both groups and each group individually against the experts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-154", "text": "The \u03b1-values for both groups are listed in Table 5(b)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-155", "text": "On average, they are a bit lower than those of all 10 crowd annotators in Table 5 (a)." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-156", "text": "Hence, five crowd ratings per argument seem not enough for sufficient reliability." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-157", "text": "Tables 5(c) and 5(d) reveal the reason behind, namely, the results of crowd group 1 and group 2 differ clearly." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-158", "text": "At the same time, the values in Table 5 (c) are close to those in Table 5 (a), so 10 ratings might suffice." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-159", "text": "Moreover, we see that the most stable \u03b1-values in Table 5 are given for overall quality, indicating that the theory indeed helps assessing quality reliably." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-160", "text": "----------------------------------" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-161", "text": "**CONCLUSION**" }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-162", "text": "This paper demonstrates that the theory and practice of assessing argumentation quality can learn from each other." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-163", "text": "Most reasons for quality differences phrased in practice seem well-represented in the normative view of theory and correlate with absolute quality ratings." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-164", "text": "In our study, lay annotators had similar agreement on the ratings as experts." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-165", "text": "Considering that some common reasons are quite vague, the diverse and comprehensive theoretical view of argumentation quality may guide a more insightful assessment." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-166", "text": "On the other hand, some quality dimensions remain hard to assess and/or to separate in practice, resulting in limited agreement." }, { "sent_id": "f3f61d50929f862e263e3f658852bc-C001-167", "text": "Simplifying theory along the most important reasons will thus improve its practical applicability." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f3f61d50929f862e263e3f658852bc-C001-22" ], [ "f3f61d50929f862e263e3f658852bc-C001-66" ], [ "f3f61d50929f862e263e3f658852bc-C001-77" ], [ "f3f61d50929f862e263e3f658852bc-C001-78" ], [ "f3f61d50929f862e263e3f658852bc-C001-107" ], [ "f3f61d50929f862e263e3f658852bc-C001-121" ], [ "f3f61d50929f862e263e3f658852bc-C001-148" ] ], "cite_sentences": [ "f3f61d50929f862e263e3f658852bc-C001-22", "f3f61d50929f862e263e3f658852bc-C001-66", "f3f61d50929f862e263e3f658852bc-C001-77", "f3f61d50929f862e263e3f658852bc-C001-78", "f3f61d50929f862e263e3f658852bc-C001-107", "f3f61d50929f862e263e3f658852bc-C001-121", "f3f61d50929f862e263e3f658852bc-C001-148" ] }, "@SIM@": { "gold_contexts": [ [ "f3f61d50929f862e263e3f658852bc-C001-98" ], [ "f3f61d50929f862e263e3f658852bc-C001-117" ] ], "cite_sentences": [ "f3f61d50929f862e263e3f658852bc-C001-98", "f3f61d50929f862e263e3f658852bc-C001-117" ] }, "@USE@": { "gold_contexts": [ [ "f3f61d50929f862e263e3f658852bc-C001-98" ], [ "f3f61d50929f862e263e3f658852bc-C001-117" ] ], "cite_sentences": [ "f3f61d50929f862e263e3f658852bc-C001-98", "f3f61d50929f862e263e3f658852bc-C001-117" ] } } }, "ABC_f797e7439bd78af2ef86271214f991_24": { "x": [ { "sent_id": "f797e7439bd78af2ef86271214f991-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-2", "text": "We present a toolkit for coreference resolution error analysis." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-3", "text": "It implements a recently proposed analysis framework and contains rich components for analyzing and visualizing recall and precision errors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-4", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-6", "text": "Coreference resolution is the task of determining which mentions in a text refer to the same entity." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-7", "text": "Both the natural language processing engineer (who needs a coreference resolution system for the problem at hand) and the coreference resolution researcher need tools to facilitate and support system development, comparison and analysis." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-8", "text": "In Martschat and Strube (2014) , we propose a framework for error analysis for coreference resolution." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-9", "text": "In this paper, we present cort 1 , an implementation of this framework, and show how it can be useful for engineers and researchers." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-10", "text": "cort is released as open source and is available for download 2 ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-11", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-12", "text": "**ERROR ANALYSIS FRAMEWORK**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-13", "text": "Due to the set-based nature of coreference resolution, it is not clear how to extract errors when an entity is not correctly identified." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-14", "text": "The idea underlying the analysis framework of Martschat and Strube (2014) is to employ spanning trees in a graph-based entity representation." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-15", "text": "1 Short for coreference resolution toolkit." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-16", "text": "2 http://smartschat.de/software Figure 1 summarizes their approach." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-17", "text": "They represent reference and system entities as complete onedirectional graphs (Figures 1a and 1b) ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-18", "text": "To extract recall errors, they compute a spanning tree of the reference entity ( Figure 1a )." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-19", "text": "All edges in the spanning tree which do not appear in the system output are extracted as recall errors (Figure 1c )." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-20", "text": "For extracting precision errors, the roles of reference and system entities are switched." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-21", "text": "The analysis algorithm is parametrized only by the spanning tree algorithm employed: different algorithms lead to different notions of errors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-22", "text": "In Martschat and Strube (2014) , we propose an algorithm based on Ariel's accessibility theory (Ariel, 1990) for reference entities." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-23", "text": "For system entity spanning trees, we take each output pair as an edge." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-24", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-25", "text": "**ARCHITECTURE**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-26", "text": "Our toolkit is available as a Python library." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-27", "text": "It consists of three modules: the core module provides mention extraction and preprocessing, the coreference module implements features for and approaches to coreference resolution, and the analysis module implements the error analysis framework described above and ships with other analysis and visualization utilities." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-28", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-29", "text": "**CORE**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-30", "text": "All input and output must conform to the format of the CoNLL-2012 shared task on coreference resolution (Pradhan et al., 2012) ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-31", "text": "We employ a rule-based mention extractor, which also computes a rich set of mention attributes, including tokens, head, part-ofspeech tags, named entity tags, gender, number, se-6 (a) Figure 1 : (a) a reference entity r (represented as a complete one-directional graph) and its spanning tree T r , (b) a set S of three system entities, (c) the errors: all edges in T r which are not in S." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-32", "text": "mantic class, grammatical function, coarse mention type and fine-grained mention type." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-33", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-34", "text": "**COREFERENCE**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-35", "text": "cort ships with two coreference resolution approaches." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-36", "text": "First, it includes multigraph, which is a deterministic approach using a few strong features (Martschat and Strube, 2014) ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-37", "text": "Second, it includes a mention-pair approach (Soon et al., 2001 ) with a large feature set, trained via a perceptron on the CoNLL'12 English training data." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-38", "text": "In Table 1 , we compare both approaches with StanfordSieve ( Lee et al., 2013) , the winner of the CoNLL-2011 shared task, and BerkeleyCoref (Durrett and Klein, 2013), a state-of-the-art structured machine learning approach." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-39", "text": "The systems are evaluated via the CoNLL scorer (Pradhan et al., 2014) ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-40", "text": "Both implemented approaches achieve competitive performance." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-41", "text": "Due to their modular implementation, both approaches are easily extensible with new features and with training or inference schemes." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-42", "text": "They therefore can serve as a good starting point for system development and analysis." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-43", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-44", "text": "**ANALYSIS**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-45", "text": "The core of this module is the ErrorAnalysis class, which extracts and manages errors extracted from one or more systems." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-46", "text": "The user can define own spanning tree algorithms to extract errors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-47", "text": "We already implemented the algorithms discussed in Martschat and Strube (2014) ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-48", "text": "Furthermore, this module provides functionality to" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-49", "text": "\u2022 categorize and filter sets of errors," }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-50", "text": "\u2022 visualize these sets," }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-51", "text": "\u2022 compare errors of different systems, and \u2022 display errors in document context." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-52", "text": "Which of these features is interesting to the user depends on the use case." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-53", "text": "In the following, we will describe the popular use case of improving a coreference system in detail." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-54", "text": "Our system also supports other use cases, such as the cross-system analysis described in Martschat and Strube (2014) ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-55", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-56", "text": "**USE CASE: IMPROVING A COREFERENCE RESOLUTION SYSTEM**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-57", "text": "A natural language processing engineer might be interested in improving the performance of a coreference resolution system since it is necessary for another task." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-58", "text": "The needs may differ depending on the task at hand: for some tasks proper name coreference may be of utmost importance, while other tasks need mostly pronoun coreference." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-59", "text": "Through model and feature redesign, the engineer wants to improve the system with respect to a certain error class." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-60", "text": "The user will start with a baseline system, which can be one of the implemented systems in our toolkit or a third-party system." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-61", "text": "We now describe how cort facilitates improving the system." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-62", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-63", "text": "**INITIAL ANALYSIS**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-64", "text": "To get an initial assessment, the user can extract all errors made by the system and then make use of the plotting component to compare these errors with the maximum possible number of errors 3 ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-65", "text": "For a meaningful analysis, we have to find a suitable error categorization." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-66", "text": "Suppose the user is interested in improving recall for non-pronominal coreference." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-67", "text": "Hence, following Martschat and Strube (2014) , we categorize all errors by coarse mention type of anaphor and antecedent (proper name, noun, pronoun, demonstrative pronoun or verb) 4 ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-68", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-69", "text": "**BOTH NAME**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-70", "text": "Noun-Name Name-Noun Both noun Figure 2 compares the recall error numbers of the multigraph system with the maximum possible number of errors for the categories of interest to the engineer." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-71", "text": "The plot was created by our toolkit via matplotlib (Hunter, 2007) ." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-72", "text": "We can see that the model performs very well for proper name pairs." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-73", "text": "Relative to the maximum number of errors, there are much more recall errors in the other categories." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-74", "text": "A plot for precision errors shows that the system makes only relatively few precision errors, especially for proper name pairs." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-75", "text": "After studying these plots the user decides to improve recall for pairs where the anaphor is a noun and the antecedent is a name." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-76", "text": "This is a frequent category which is handled poorly by the system." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-77", "text": "3 For recall, the maximum number of errors are the errors made by a system which puts each mention in its own cluster." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-78", "text": "For precision, we take all pairwise decisions of a model." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-79", "text": "4 For a pair of mentions constituting an error, we call the mention appearing later in the text the anaphor, the other mention antecedent." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-80", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-81", "text": "**DETAILED ANALYSIS**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-82", "text": "In order to determine how to improve the system, the user needs to perform a detailed analysis of the noun-name errors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-83", "text": "Our toolkit provides several methods to do so." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-84", "text": "First of all, one can browse through the pairwise error representations." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-85", "text": "This suggests further subcategorization (for example by the presence of token overlap)." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-86", "text": "An iteration of this process leads to a fine-grained categorization of errors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-87", "text": "However, this approach does not provide any document context, which is necessary to understand some errors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-88", "text": "Maybe context features can help in resolving the error, or the error results from multiple competing antecedents." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-89", "text": "We therefore include a visualization component, which also allows to study the interplay between recall and precision." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-90", "text": "Figure 3 shows a screenshot of this visualization component, which runs in a web browser using JavaScript." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-91", "text": "The header displays the identifier of the document in focus." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-92", "text": "The left bar contains the navigation panel, which includes" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-93", "text": "\u2022 a list of all documents in the corpus," }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-94", "text": "\u2022 a summary of all errors for the document in focus, and \u2022 lists of reference and system entities for the document in focus." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-95", "text": "To the right of the navigation panel, the document in focus is shown." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-96", "text": "When the user picks a reference or system entity from the corresponding list, cort displays all recall and precision errors for all mentions which are contained in the entity (as labeled red arrows between mentions)." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-97", "text": "Alternatively, the user can choose an error category from the error summary." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-98", "text": "In that case, all errors of that category are displayed." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-99", "text": "We use color to distinguish between entities: mentions in different entities have different background colors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-100", "text": "Additionally mentions in reference entities have a yellow border, while mentions in system entities have a blue border (for example, the mention the U.S.-backed rebels is in a reference entity and in a system entity)." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-101", "text": "The user can choose to color the background of mentions either depending on their gold entity or depending on their system entity." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-102", "text": "These visualization capabilities allow for a detailed analysis of errors and enable the user to take all document information into account." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-103", "text": "The result of the analysis is that almost all errors are missed is-a relations, such as in the examples in Figure 3 (the U.S.-backed rebels and the Contras)." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-104", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-105", "text": "**ERROR COMPARISON**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-106", "text": "Motivated by this, the user can add features to the system, for example incorporating world knowledge from Wikipedia." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-107", "text": "The output of the changed model can be loaded into the ErrorAnalysis object which already manages the errors made by the baseline system." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-108", "text": "To compare the errors, cort implements various functions." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-109", "text": "In particular, the user can access common errors and errors which are unique to one or more systems." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-110", "text": "This allows for an assessment of the qualitative usefulness of the new feature." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-111", "text": "Depending on the results of the comparison, the user can decide between discarding, retaining and improving the feature." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-112", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-113", "text": "**RELATED WORK**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-114", "text": "Compared to our original implementation of the error analysis framework (Martschat and Strube, 2014) , we made the analysis interface more userfriendly and provide more analysis functionality." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-115", "text": "Furthermore, while our original implementation did not include any visualization capabilities, we now allow for both data visualization and document visualization." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-116", "text": "We are aware of two other software packages for coreference resolution error analysis." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-117", "text": "Our toolkit complements these." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-118", "text": "Kummerfeld and Klein (2013) present a toolkit which extracts errors from transformation of reference to system entities." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-119", "text": "Hence, their definition of what an error is not rooted in a pairwise representation, and is therefore conceptually different from our definition." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-120", "text": "They do not provide any visualization components." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-121", "text": "ICE (G\u00e4rtner et al., 2014 ) is a toolkit for coreference visualization and corpus analysis." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-122", "text": "In particular, the toolkit visualizes recall and precision errors in a tree-based visualization of coreference clusters." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-123", "text": "Compared to ICE, we provide more extensive functionality for error analysis and can accommodate for different notions of errors." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-124", "text": "----------------------------------" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-125", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-126", "text": "We presented cort, a toolkit for coreference resolution error analysis." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-127", "text": "It implements a graph-based analysis framework, ships with two strong coreference resolution baselines and provides extensive functionality for analysis and visualization." }, { "sent_id": "f797e7439bd78af2ef86271214f991-C001-128", "text": "We are currently investigating whether the analysis framework can also be applied to structurally related tasks, such as cross-document coreference resolution (Singh et al., 2011) or entity linking." } ], "y": { "@MOT@": { "gold_contexts": [ [ "f797e7439bd78af2ef86271214f991-C001-8" ] ], "cite_sentences": [ "f797e7439bd78af2ef86271214f991-C001-8" ] }, "@BACK@": { "gold_contexts": [ [ "f797e7439bd78af2ef86271214f991-C001-14" ], [ "f797e7439bd78af2ef86271214f991-C001-36" ] ], "cite_sentences": [ "f797e7439bd78af2ef86271214f991-C001-14", "f797e7439bd78af2ef86271214f991-C001-36" ] }, "@SIM@": { "gold_contexts": [ [ "f797e7439bd78af2ef86271214f991-C001-22" ], [ "f797e7439bd78af2ef86271214f991-C001-47" ], [ "f797e7439bd78af2ef86271214f991-C001-54" ], [ "f797e7439bd78af2ef86271214f991-C001-67" ] ], "cite_sentences": [ "f797e7439bd78af2ef86271214f991-C001-22", "f797e7439bd78af2ef86271214f991-C001-47", "f797e7439bd78af2ef86271214f991-C001-54", "f797e7439bd78af2ef86271214f991-C001-67" ] }, "@USE@": { "gold_contexts": [ [ "f797e7439bd78af2ef86271214f991-C001-22" ], [ "f797e7439bd78af2ef86271214f991-C001-47" ], [ "f797e7439bd78af2ef86271214f991-C001-54" ], [ "f797e7439bd78af2ef86271214f991-C001-67" ] ], "cite_sentences": [ "f797e7439bd78af2ef86271214f991-C001-22", "f797e7439bd78af2ef86271214f991-C001-47", "f797e7439bd78af2ef86271214f991-C001-54", "f797e7439bd78af2ef86271214f991-C001-67" ] }, "@DIF@": { "gold_contexts": [ [ "f797e7439bd78af2ef86271214f991-C001-114" ] ], "cite_sentences": [ "f797e7439bd78af2ef86271214f991-C001-114" ] } } }, "ABC_497b717bc4ff6b9e2160ee823f6b42_24": { "x": [ { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-2", "text": "Abstract." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-3", "text": "The most studied and most successful language models were developed and evaluated mainly for English and other close European languages, such as French, German, etc." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-4", "text": "It is important to study applicability of these models to other languages." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-5", "text": "The use of vector space models for Russian was recently studied for multiple corpora, such as Wikipedia, RuWac, lib.ru." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-6", "text": "These models were evaluated against word semantic similarity task." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-7", "text": "For our knowledge Twitter was not considered as a corpus for this task, with this work we fill the gap." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-8", "text": "Results for vectors trained on Twitter corpus are comparable in accuracy with other single-corpus trained models, although the best performance is currently achieved by combination of multiple corpora." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-9", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-11", "text": "Word semantic similarity task is an important part of contemporary NLP." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-12", "text": "It can be applied in many areas, like word sense disambiguation, information retrieval, information extraction and others." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-13", "text": "It has long history of improvements, starting with simple models, like bag-of-words (often weighted by TF-IDF score), continuing with more complex ones, like LSA [6] , which attempts to find \"latent\" meanings of words and phrases, and even more abstract models, like NNLM [3] ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-14", "text": "Latest results are based on neural network experience, but are far more simple: various versions of Word2Vec, Skip-gram and CBOW models [8] , which currently show the State-of-the-Art results and have proven success with morphologically complex languages like Russian [1] , [10] ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-15", "text": "These are corpus-based approaches, where one computes or trains the model from a large corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-16", "text": "They usually consider some word context, like in bag-ofwords, where model is simple count of how often can some word be seen in context of a word being described." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-17", "text": "This model anyhow does not use semantic information." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-18", "text": "A step in semantic direction was made by LSA, which requires SVD transformation of co-occurrence matrix and produces vectors with latent, unknown structure." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-19", "text": "However, this method is rather computationally expensive, and can rarely be applied to large corpora." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-20", "text": "Distributed language model was proposed, where every word is initially assigned a random fixed-size vector." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-21", "text": "During training semantically close vectors (or close by means of context) become closer to each other; as matter of closeness the cosine similarity is usually chosen." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-22", "text": "This trick enables usage of neural networks and other machine learning techniques, which easily deal with fixed-size real vectors, instead of large and sparse cooccurrence vectors." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-23", "text": "It is worth mentioning non-corpus based techniques to estimate word semantic similarity." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-24", "text": "They usually make use of knowledge databases, like WordNet, Wikipedia, Wiktionary and others [14] , [2] ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-25", "text": "It was shown that Wikipedia data can be used in graph-based methods [13] , and also in corpus-based ones." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-26", "text": "In this paper we are not focusing on non-corpus based techniques." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-27", "text": "In this paper we concentrate on usage of Russian Twitter stream as training corpus for Word2Vec model in semantic similarity task, and show results comparable with current (trained on a single corpus)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-28", "text": "This research is part of molva.spb.ru project, which is a trending topic detection engine for Russian Twitter." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-29", "text": "Thus the choice of language of interest is narrowed down to only Russian, although there is strong intuition that one can achieve similar results with other languages." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-30", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-31", "text": "**GOALS OF THIS PAPER**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-32", "text": "The primary goal of this paper is to prove usefulness of Russian Twitter stream as word semantic similarity resource." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-33", "text": "Twitter is a popular social network 1 , or also called \"microblogging service\", which enables users to share and interact with short messages instantly and publicly (although private accounts are also available)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-34", "text": "Users all over the world generate hundreds of millions of tweets per day, all over the world, in many languages, generating enormous amount of verbal data." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-35", "text": "Traditional corpora for the word semantic similarity task are News, Wikipedia, electronic libraries and others (e.g. RUSSE workshop [10] )." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-36", "text": "It was shown that type of corpus used for training affects the resulting accuracy." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-37", "text": "Twitter is not usually considered, and intuition behind this is that probably every-day language is too simple and too occasional to produce good results." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-38", "text": "On the other hand, the real-time nature of this user message stream seems promising, as it may reveal what certain word means in this given moment." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-39", "text": "The other counter-argument against Twitter-as-Dataset is the policy of Twitter, which disallows publication of any dump of Twitter messages larger than 50K 2 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-40", "text": "However, this policy permits publication of Twitter IDs in any amount." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-41", "text": "Thus the secondary goal of this paper is to describe how to create this kind of dataset from scratch." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-42", "text": "We provide the sample of Twitter messages used, as well as set of Twitter IDs used during experiments 3 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-43", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-44", "text": "**PREVIOUS WORK**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-45", "text": "Semantic similarity and relatedness task received significant amount of attention." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-46", "text": "Several \"Gold standard\" datasets were produced to facilitate the evaluation of algorithms and models, including WordSim353 [4] , RG-65 [11] for English language and others." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-47", "text": "These datasets consist of several pairs of words, where each pair receives a score from human annotators." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-48", "text": "The score represents the similarity between two words, from 0% (not similar) to 100% (identical meaning, words are synonyms)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-49", "text": "Usually these scores are filled out by a number of human annotators, for instance, 13 in case of WordSim353 4 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-50", "text": "The inter-annotator agreement is measured and the mean value is put into dataset." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-51", "text": "Until recent days there was no such dataset for Russian language." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-52", "text": "To mitigate this the \"RUSSE: The First Workshop on Russian Semantic Similarity\" [10] was conducted, producing RUSSE Human-Judgements evaluation dataset (we will refer to it as HJ-dataset)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-53", "text": "RUSSE dataset was constructed the following way." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-54", "text": "Firstly, datasets WordSim353, MC [9] and RG-65 were combined and translated." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-55", "text": "Then human judgements were obtained by crowdsourcing (using custom implementation)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-56", "text": "Final size of the dataset is 333 word pairs, it is available online 5 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-57", "text": "The RUSSE contest was followed by paper from its organizers [10] and several participators [1] , [5] , thus filling the gap in word semantic similarity task for Russian language." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-58", "text": "In this paper we evaluate a Word2Vec model, trained on Russian Twitter corpus against RUSSE HJ-dataset, and show results comparable to top results of other RUSSE competitors." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-59", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-60", "text": "**DATA PROCESSING**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-61", "text": "In this section we describe how we receive data from Twitter, how we filter it and how we feed it to the model." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-62", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-63", "text": "**ACQUIRING DATA**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-64", "text": "Twitter provides well-documented API 6 , which allows to request any information about Tweets, users and their profiles, with respect to rate limits." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-65", "text": "There is special type of API, called Streaming API 7 , that provides a real-time stream of tweets." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-66", "text": "The key difference with regular API is that connection is kept alive as long as possible, and Tweets are sent in real-time to the client." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-67", "text": "There are three endpoints of Streaming API of our interest: \"sample\", \"filter\" and \"firehose\"." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-68", "text": "The first one provides a sample (random subset) of the full Tweet stream." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-69", "text": "The second one allows to receive Tweets matching some search criteria: matching to one or more search keywords, produced by subset of users, or coming from certain geo location." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-70", "text": "The last one provides the full set of Tweets, although it is not available by default." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-71", "text": "In order to get Twitter \"firehose\" one can contact Twitter, or buy this stream from third-parties." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-72", "text": "In our case the simplest approach would be to use \"sample\" endpoint, but it provides Tweets in all possible languages from all over the World, while we are concerned only about one language (Russian)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-73", "text": "In order to use this endpoint we implemented filtering based on language." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-74", "text": "The filter is simple: if Tweet does not contain a substring of 3 or more cyrillic symbols, it is considered non-Russian." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-75", "text": "Although this approach keeps Tweets in Mongolian, Ukrainian and other slavic languages (because they use cyrillic alphabet), the total amount of false-positives in this case is negligible." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-76", "text": "To demonstrate this we conducted simple experiment: on a random sample of 200 tweets only 5 were in a language different from Russian." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-77", "text": "In order not to rely on Twitter language detection, we chose to proceed with this method of language-based filtering." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-78", "text": "However, the amount of Tweets received through \"sample\" endpoint was not satisfying." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-79", "text": "This is probably because \"sample\" endpoint always streams the same content to all its clients, and small portion of it comes in Russian language." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-80", "text": "In order to force mining of Tweets in Russian language, we chose \"filter\" endpoint, which requires some search query." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-81", "text": "We constructed heuristic query, containing some auxiliary words, specific to Russian language: conjunctions, pronouns, prepositions." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-82", "text": "The full list is as follows: \u044f, \u0443, \u043a, \u0432, \u043f\u043e, \u043d\u0430, \u0442\u044b, \u043c\u044b, \u0434\u043e, \u043d\u0430, \u043e\u043d\u0430, \u043e\u043d, \u0438, \u0434\u0430." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-83", "text": "We evaluated our search query on data obtained from \"sample\" endpoint, and 95% of Tweets matched it." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-84", "text": "We consider this coverage as reasonable and now on use \"filter\" endpoint with the query and language filtering described above." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-85", "text": "In this paper we work with Tweet stream acquired from 2015/07/21 till 2015/08/04." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-86", "text": "We refer to parts of the dataset by the day of acquisition: 2015/07/21, etc." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-87", "text": "Tweet IDs used in our experiments are listed on-line 8 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-88", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-89", "text": "**CORPUS PREPROCESSING**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-90", "text": "Corpus-based algorithms like BoW and Word2Vec require text to be tokenized, and sometimes to be stemmed as well." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-91", "text": "It is common practice to filter out StopWords (e.g. [5] ), but in this work we don't use it." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-92", "text": "Morphological richness of Russian language forces us to use stemming, even though models like Word2Vec does not require it." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-93", "text": "In our experiments stemmed version performs significantly better than unstemmed, so we only report results of stemmed one." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-94", "text": "To do stemming we use Yandex Tomita Parser 9 , which is an extractor of simple facts from text in Russian language." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-95", "text": "It is based on Yandex stemmer mystem [12] ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-96", "text": "It requires a set of grammar rules and facts (i.e. simple data structures) to be extracted." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-97", "text": "In this paper we use it with one simple rule: S \u2212> Word i n t e r p ( SimpleFa ct ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-98", "text": "Word ) ;" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-99", "text": "This rule tells parser to interpret each word it sees and return it back immediately." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-100", "text": "We use Tomita Parser as we find it more user-friendly than mystem." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-101", "text": "Tomita Parser performs following operations: sentence splitting, tokenization, stemming, removing punctuation marks, transforming words to lowercase." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-102", "text": "Each Tweet is transformed into one or several lines of tab-separated sequences of words (if there are several sentences or lines in a Tweet)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-103", "text": "Twitter-specific \"Hashtags\" and \"User mentions\" are treated by Tomita Parser as normal words, except that \"@\" and \"#\" symbols are stripped off." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-104", "text": "HJ-dataset contains non-lemmatized words." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-105", "text": "This is understandable, because the task of this dataset was oriented to human annotators." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-106", "text": "In several cases plural form is used (consider this pair: \"\u0442\u0438\u0433\u0440, \u043a\u043e\u0448\u0430\u0447\u044c\u0438\")." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-107", "text": "In order to compute similarity for those pairs, and having in mind that Twitter data is pre-stemmed, we have to stem HJ-dataset with same parser as well." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-108", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-109", "text": "**TRAINING THE MODEL**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-110", "text": "We use Word2Vec to obtain word vectors from Twitter corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-111", "text": "In this model word vectors are initialized randomly for each unique word and are fed to a sort of neural network." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-112", "text": "Authors of Word2Vec propose two different models: Skipgram and CBOW." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-113", "text": "The first one is trained to predict the context of the word given just the word vector itself." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-114", "text": "The second one is somewhat opposite: it is trained to predict the word vector given its context." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-115", "text": "In our study CBOW always performs worse than Skip-gram, hence we describe only results with Skip-gram model." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-116", "text": "Those models have several training parameters, namely: vector size, size of vocabulary (or minimal frequency of a word), context size, threshold of downsampling, amount of training epochs." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-117", "text": "We choose vector size based on size of corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-118", "text": "We use \"context size\" as \"number of tokens before or after current token\"." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-119", "text": "In all experiments presented in this paper we use one training epoch." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-120", "text": "There are several implementations of Word2Vec available, including original C utility 10 and a Python library gensim 11 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-121", "text": "We use the latter one as we find it more convenient." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-122", "text": "Output of Tomita Parser is fed directly line-by-line to the model." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-123", "text": "It produces the set of vectors, which we then query to obtain similarity between word vectors, in order to compute the correlation with HJ-dataset." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-124", "text": "To compute correlation we use Spearman coefficient, since it was used as accuracy measure in RUSSE [10] ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-125", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-126", "text": "**PROPERTIES OF THE DATA**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-127", "text": "In order to train Word2Vec model for semantic similarity task we collected Twitter messages for 15 full days, from 2015/07/21 till 2015/08/04." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-128", "text": "Each day contains on average 3M of Tweets and 40M of tokens." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-129", "text": "All properties measured are shown in Table 1 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-130", "text": "Our first observation was that given one day of Twitter data we cannot estimate all of the words from HJ-dataset, because they appear too rarely." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-131", "text": "We fixed the frequency threshold on value of 40 occurrences per day and counted how many words from HJ-dataset are below this threshold." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-132", "text": "Our second observation was that words \"missing\" from HJ-dataset are different from day to day." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-133", "text": "This is not very surprising having in mind the dynamic nature of Twitter data." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-134", "text": "Thus estimation of word vectors is different from day to day." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-135", "text": "In order to estimate the fluctuation of this semantic measure, we conduct training of Word2Vec on each day in our corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-136", "text": "We fix vector size to 300, context size to 5, downsampling threshold to 1e-3, and minimal word occurrence threshold (also called min-freq) to 40." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-137", "text": "The results are shown in Table 2 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-138", "text": "Mean Spearman correlation between daily Twitter splits and HJ-dataset is 0.36 with std.dev." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-139", "text": "of 0.04." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-140", "text": "Word pairs for missing words (infrequent ones) were excluded." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-141", "text": "We also create superset of all infrequent words, i.e. words having frequency below 40 in at least one daily split." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-142", "text": "This set contains 50 words and produces 76 \"infrequent word\" pairs (out of 333)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-143", "text": "Every pair containing at least one infrequent word was excluded." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-144", "text": "On that subset of HJ-dataset mean correlation is 0.29 with std.dev." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-145", "text": "of 0.03." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-146", "text": "We consider this to be reasonably stable result." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-147", "text": "Number of Tweets 50M Number of tokens 580M Number of sentences 59M Size of dictionary (full) 13M Size of dictionary (minfreq=40) 236K Number of tokens (minfreq=40) 555M Average sentence length 9.80" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-148", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-149", "text": "**DETERMINING OPTIMAL CORPUS SIZE**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-150", "text": "Word2Vec model was designed to be trained on large corpora." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-151", "text": "There are results of training it in reasonable time with corpus size of 1 billion of tokens [8] ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-152", "text": "It was mentioned that accuracy of estimated word vectors improves with size of corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-153", "text": "Twitter provides an enormous amount of data, thus it is a perfect job for Word2Vec." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-154", "text": "We fix parameters for the model with following values: vector size of 300, min-freq of 40, context size of 5 and downsampling of 1e-3." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-155", "text": "We train our Table 3 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-156", "text": "In this experiment the best result belongs to 7-day corpus with 0.56 correlation with HJ-dataset, and 15-day corpus has a little less, 0.55." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-157", "text": "This can be explained by following: in order to achieve better results with Word2Vec one should increase both corpus and vector sizes." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-158", "text": "Indeed, training model with vector size of 600 on full Twitter corpus (15 days) shows the best result of 0.59." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-159", "text": "It is also worth noting that number of \"missing\" pairs is negligible in 7-days corpus: the only missing word (and pair) is \"\u0439\u0435\u043b\u044c\", Yale, the name of university in the USA." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-160", "text": "There are no \"missing\" words in 15-days corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-161", "text": "Training the model on 15-days corpus took 8 hours on our machine with 2 cores and 4Gb of RAM." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-162", "text": "We have an intuition that further improvements are possible with larger corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-163", "text": "Comparing our results to ones reported by RUSSE participants, we conclude that our best result of 0.598 is comparable to other results, as it (virtually) encloses the top-10 of results." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-164", "text": "However, best submission of RUSSE has huge gap in accuracy of 0.16, compared to our Twitter corpus." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-165", "text": "Having in mind that best results in RUSSE combine several corpora, it is reasonable to compare Twitter results to other single-corpus results." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-166", "text": "For convenience we replicate results for these corpora, originally presented in [10] , alongside with our result in Table 5 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-167", "text": "Given these considerations we conclude that with size of Twitter corpus of 500M one can achieve reasonably good results on task of word semantic similarity." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-168", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-169", "text": "**DETERMINING OPTIMAL CONTEXT SIZE**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-170", "text": "Authors of Word2Vec [8] and Paragraph Vector [7] advise to determine the optimal context size for each distinct training session." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-171", "text": "In our Twitter corpus average length of the sentence appears to be 9.8 with std.dev." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-172", "text": "of 4.9; it means that most of sentences have less than 20 tokens." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-173", "text": "This is one of peculiarities of Twitter data: Tweets are limited in size, hence sentences are short." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-174", "text": "Context size greater than 10 is redundant." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-175", "text": "We choose to train word vectors with 3 different context size values: 2, 5, 10." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-176", "text": "We make two rounds of training: first one, with Twitter data from days from 07/21 till 07/25, and second, from 07/26 till 07/30." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-177", "text": "Results of measuring correlation with HJ-dataset are shown in Table 4 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-178", "text": "According to these results context size of 5 is slightly better than others, but the difference is negligible compared to fluctuation between several attempts of training." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-179", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-180", "text": "**SOME FURTHER OBSERVATIONS**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-181", "text": "Vector space model is capable to give more information than just measure of semantic distance of two given words." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-182", "text": "It was shown that word vectors can have multiple degrees of similarity." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-183", "text": "In particular, it is possible to model simple relations, like \"country\"-\"capital city\", gender, syntactic relations with algebraic operations over these vectors." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-184", "text": "Authors of [8] propose to assess quality of these vectors on task of exact prediction of these word relations." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-185", "text": "However, word vectors learned from Twitter seem to perform poorly on this task." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-186", "text": "We don't make systematic research on this subject here because it goes outside of the scope of the current paper, though it is an important direction of future studies." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-187", "text": "Twitter post often contains three special types of words: user mentions, hashtags and hyperlinks." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-188", "text": "It can be beneficial to filter them (consider as Stop-Words)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-189", "text": "In results presented in this paper, and in particular in Tables 3 and 4 , we don't filter such words." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-190", "text": "It is highly controversial if one should remove hashtags from analysis since they are often valid words or multiwords." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-191", "text": "It can also be beneficial, in some tasks, to estimate word vectors for a username." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-192", "text": "Hyperlinks in Twitter posts are mandatory shortened 12 ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-193", "text": "It is not clear how to treat them: filter out completely, keep them or even un-short them." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-194", "text": "However, some of our experiments show that filtering of \"User Mentions\" and hyperlinks can improve accuracy on the word semantic relatedness task by 3-5%." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-195", "text": "----------------------------------" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-196", "text": "**CONCLUSION**" }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-197", "text": "In this paper we investigated the use of Twitter corpus for training Word2Vec model for task of word semantic similarity." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-198", "text": "We described a method to obtain stream of Twitter messages and prepare them for training." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-199", "text": "We use HJ-dataset, which was created for RUSSE contest [10] to measure correlation between similarity of word vectors and human judgements on word pairs similarity." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-200", "text": "We achieve results comparable with results obtained while training Word2Vec on traditional corpora, like Wikipedia and Web pages [1] , [5] ." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-201", "text": "This is especially important because Twitter data is highly dynamic, and traditional sources are mostly static (rarely change over time)." }, { "sent_id": "497b717bc4ff6b9e2160ee823f6b42-C001-202", "text": "Thus verbal data acquired from Twitter may be used to estimate word vectors for neologisms, or determine other changes in word semantic, as soon as they appear in human speech." } ], "y": { "@BACK@": { "gold_contexts": [ [ "497b717bc4ff6b9e2160ee823f6b42-C001-14" ], [ "497b717bc4ff6b9e2160ee823f6b42-C001-35" ], [ "497b717bc4ff6b9e2160ee823f6b42-C001-52" ], [ "497b717bc4ff6b9e2160ee823f6b42-C001-57" ] ], "cite_sentences": [ "497b717bc4ff6b9e2160ee823f6b42-C001-14", "497b717bc4ff6b9e2160ee823f6b42-C001-35", "497b717bc4ff6b9e2160ee823f6b42-C001-52", "497b717bc4ff6b9e2160ee823f6b42-C001-57" ] }, "@SIM@": { "gold_contexts": [ [ "497b717bc4ff6b9e2160ee823f6b42-C001-124" ], [ "497b717bc4ff6b9e2160ee823f6b42-C001-166" ] ], "cite_sentences": [ "497b717bc4ff6b9e2160ee823f6b42-C001-124", "497b717bc4ff6b9e2160ee823f6b42-C001-166" ] }, "@USE@": { "gold_contexts": [ [ "497b717bc4ff6b9e2160ee823f6b42-C001-124" ], [ "497b717bc4ff6b9e2160ee823f6b42-C001-166" ], [ "497b717bc4ff6b9e2160ee823f6b42-C001-199" ] ], "cite_sentences": [ "497b717bc4ff6b9e2160ee823f6b42-C001-124", "497b717bc4ff6b9e2160ee823f6b42-C001-166", "497b717bc4ff6b9e2160ee823f6b42-C001-199" ] } } }, "ABC_242aacd35fb92d836fea9eb33961a3_24": { "x": [ { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-2", "text": "FAIRSEQ is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-3", "text": "The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-4", "text": "We also support fast mixed-precision training and inference on modern GPUs." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-5", "text": "A demo video can be found here: https://www.youtube." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-6", "text": "com/watch?v=OtgDdWtHvto." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-7", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-9", "text": "Neural sequence-to-sequence models have been successful on a variety of text generation tasks, including machine translation, abstractive document summarization, and language modeling." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-10", "text": "Accordingly, both researchers and industry professionals can benefit from a fast and easily extensible sequence modeling toolkit." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-11", "text": "There are several toolkits with similar basic functionality, but they differ in focus area and intended audiences." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-12", "text": "For example, OpenNMT (Klein et al., 2017 ) is a community-built toolkit written in multiple languages with an emphasis on extensibility." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-13", "text": "MarianNMT (Junczys-Dowmunt et al., 2018) focuses on performance and the backend is written in C++ for fast automatic differentiation." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-14", "text": "OpenSeq2Seq provides reference implementations for fast distributed and mixed precision training." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-15", "text": "Tensor2tensor and Sockeye (Hieber et al., 2018) focus on production-readiness." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-16", "text": "In this paper, we present FAIRSEQ, a sequence modeling toolkit written in PyTorch that is fast, extensible, and useful for both research and production." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-17", "text": "FAIRSEQ features: (i) a common interface across models and tasks that can be extended * equal contribution \u2020 Work done while at Facebook AI Research." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-18", "text": "with user-supplied plug-ins ( \u00a72); (ii) efficient distributed and mixed precision training, enabling training over datasets with hundreds of millions of sentences on current hardware ( \u00a73); (iii) stateof-the-art implementations and pretrained models for machine translation, summarization, and language modeling ( \u00a74); and (iv) optimized inference with multiple supported search algorithms, including beam search, diverse beam search (Vijayakumar et al., 2016) , and top-k sampling." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-19", "text": "FAIRSEQ is distributed with a BSD license and is available on GitHub at https://github.com/ pytorch/fairseq." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-20", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-21", "text": "**DESIGN**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-22", "text": "Extensibility." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-23", "text": "FAIRSEQ can be extended through five types of user-supplied plug-ins, which enable experimenting with new ideas while reusing existing components as much as possible." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-24", "text": "Models define the neural network architecture and encapsulate all learnable parameters." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-25", "text": "Models extend the BaseFairseqModel class, which in turn extends torch.nn." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-26", "text": "Module." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-27", "text": "Thus any FAIRSEQ model can be used as a stand-alone module in other PyTorch code." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-28", "text": "Models can additionally predefine named architectures with common network configurations (e.g., embedding dimension, number of layers, etc.)." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-29", "text": "We also abstracted the methods through which the model interacts with the generation algorithm, e.g., beam search, through step-wise prediction." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-30", "text": "This isolates model implementation from the generation algorithm." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-31", "text": "Criterions compute the loss given the model and a batch of data, roughly: loss = criterion (model, batch) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-32", "text": "This formulation makes criterions very expressive, since they have complete access to the model." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-33", "text": "For example, a criterion may perform on-the-fly genera-tion to support sequence-level training or online backtranslation (Edunov et al., 2018a; Lample et al., 2018) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-34", "text": "Alternatively, in a mixture-of-experts model, a criterion may implement EM-style training and backpropagate only through the expert that produces the lowest loss (Shen et al., 2019) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-35", "text": "Tasks store dictionaries, provide helpers for loading and batching data and define the training loop." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-36", "text": "They are intended to be immutable and primarily interface between the various components." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-37", "text": "We provide tasks for translation, language modeling, and classification." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-38", "text": "Optimizers update the model parameters based on the gradients." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-39", "text": "We provide wrappers around most PyTorch optimizers and an implementation of Adafactor (Shazeer and Stern, 2018) , which is a memory-efficient variant of Adam." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-40", "text": "Learning Rate Schedulers update the learning rate over the course of training." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-41", "text": "We provide several popular schedulers, e.g., the inverse square-root scheduler from Vaswani et al. (2017) and cyclical schedulers based on warm restarts (Loshchilov and Hutter, 2016) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-42", "text": "Reproducibility and forward compatibility." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-43", "text": "FAIRSEQ includes features designed to improve reproducibility and forward compatibility." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-44", "text": "For example, checkpoints contain the full state of the model, optimizer and dataloader, so that results are reproducible if training is interrupted and resumed." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-45", "text": "FAIRSEQ also provides forward compatibility, i.e., models trained using old versions of the toolkit will continue to run on the latest version through automatic checkpoint upgrading." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-46", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-47", "text": "**IMPLEMENTATION**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-48", "text": "FAIRSEQ is implemented in PyTorch and it provides efficient batching, mixed precision training, multi-GPU as well as multi-machine training." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-49", "text": "Batching." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-50", "text": "There are multiple strategies to batch input and output sequence pairs (Morishita et al., 2017) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-51", "text": "FAIRSEQ minimizes padding within a minibatch by grouping source and target sequences of similar length." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-52", "text": "The content of each mini-batch stays the same throughout training, however minibatches themselves are shuffled randomly every epoch." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-53", "text": "When training on more than one GPU or machine, then the mini-batches for each worker are likely to differ in the average sentence length which results in more representative updates." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-54", "text": "Multi-GPU training." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-55", "text": "FAIRSEQ uses the NCCL2 library and torch.distributed for inter-GPU communication." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-56", "text": "Models are trained in a synchronous optimization setup where each GPU has a copy of the model to process a sub-batch of data after which gradients are synchronized between GPUs; all sub-batches constitute a minibatch." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-57", "text": "Even though sub-batches contain a similar number of tokens, we still observe a high variance in processing times." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-58", "text": "In multi-GPU or multimachine setups, this results in idle time for most GPUs while slower workers are finishing their work (Figure 1 (a) )." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-59", "text": "FAIRSEQ mitigates the effect of stragglers by overlapping gradient synchronization between workers with the backward pass and by accumulating gradients over multiple minibatches for each GPU ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-60", "text": "Overlapping gradient synchronization starts to synchronize gradients of parts of the network when they are computed." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-61", "text": "In particular, when the gradient computation for a layer finishes, FAIRSEQ adds the result to a buffer." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-62", "text": "When the size of the buffer reaches a predefined threshold, the gradients are synchronized in a background thread while back-propagation continues as usual (Figure 1 (b) )." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-63", "text": "Next, we accumulate gradients for multiple sub-batches on each GPU which reduces the variance in processing time between workers since there is no need to wait for stragglers after each sub-batch (Figure 1 (c) )." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-64", "text": "This also increases the effective batch size but we found that models can still be trained effectively ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-65", "text": "Mixed precision." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-66", "text": "Recent GPUs enable efficient half precision floating point (FP16) computation." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-67", "text": "FAIRSEQ provides support for both full precision (FP32) and FP16 at training and inference." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-68", "text": "We perform all forward-backward computations as well as the all-reduce for gradient synchronization between workers in FP16." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-69", "text": "However, the parameter updates remain in FP32 to preserve accuracy." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-70", "text": "FAIRSEQ implements dynamic loss scaling in order to avoid underflows for activations and gradients because of the limited precision offered by FP16." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-71", "text": "This scales the loss right after the forward pass to fit into the FP16 range while the backward pass is left unchanged." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-72", "text": "After the FP16 gradients are synchronized between workers, we convert them to FP32, restore the original scale, and update the weights." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-73", "text": "Inference." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-74", "text": "FAIRSEQ provides fast inference for non-recurrent models (Gehring et al., 2017; Vaswani et al., 2017; Fan et al., 2018b; Wu et al., 2019) through incremental decoding, where the model states of previously generated tokens are cached in each active beam and re-used." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-75", "text": "This can speed up a na\u00efve implementation without caching by up to an order of magnitude, since only new states are computed for each token." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-76", "text": "For some models, this requires a component-specific caching implementation, e.g., multi-head attention in the Transformer architecture." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-77", "text": "During inference we build batches with a variable number of examples up to a user-specified number of tokens, similar to training." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-78", "text": "FAIRSEQ also supports inference in FP16 which increases decoding speed by 54% compared to FP32 with no loss in accuracy (Table 1) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-79", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-80", "text": "**APPLICATIONS**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-81", "text": "FAIRSEQ has been used in many applications, such as machine translation (Gehring et al., 2017; Edunov et al., 2018b,a; Chen et al., 2018; Song et al., 2018; Wu et al., 2019) , language modeling , abstractive document summarization (Fan et al., 2018a; Narayan et al., 2018) , story generation (Fan et al., 2018b , error correction (Chollampatt and Ng, 2018) , multilingual sentence embeddings (Artetxe and Schwenk, 2018) , and dialogue (Miller et al., 2017; Dinan et al., 2019) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-82", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-83", "text": "**MACHINE TRANSLATION**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-84", "text": "We provide reference implementations of several popular sequence-to-sequence models which can be used for machine translation, including LSTM (Luong et al., 2015) , convolutional models (Gehring et al., 2017; Wu et al., 2019) and Transformer (Vaswani et al., 2017) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-85", "text": "We evaluate a \"big\" Transformer encoderdecoder model on two language pairs, WMT English to German (En-De) and WMT English to French (En-Fr)." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-86", "text": "For En-De we replicate the setup of Vaswani et al. (2017) which relies on WMT'16 for training with 4.5M sentence pairs, we validate on newstest13 and test on newstest14." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-87", "text": "The 32K vocabulary is based on a joint source and target byte pair encoding (BPE; Sennrich et al. 2016 )." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-88", "text": "For En-Fr, we train on WMT'14 and borrow the setup of Gehring et al. (2017) with 36M training sentence pairs." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-89", "text": "We use newstest12+13 for validation and newstest14 for test." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-90", "text": "The 40K vocabulary is based on a joint source and target BPE." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-91", "text": "We measure case-sensitive tokenized BLEU with multi-bleu (Hoang et al., 2006) and detokenized BLEU with SacreBLEU 1 (Post, 2018) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-92", "text": "All results use beam search with a beam width of 4 and length penalty of 0.6, following Vaswani et al. 2017 ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-93", "text": "FAIRSEQ results are summarized in Table 2 ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-94", "text": "We reported improved BLEU scores over Vaswani et al. (2017) by training with a bigger batch size and an increased learning rate ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-95", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-96", "text": "**LANGUAGE MODELING**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-97", "text": "FAIRSEQ supports language modeling with gated convolutional models and Transformer models (Vaswani et al., 2017) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-98", "text": "Models can be trained using a variety of input and output representations, such as standard token embeddings, convolutional character embeddings (Kim 1 En-De En-Fr a. Gehring et al. (2017) 25.2 40.5 b. Vaswani et al. (2017) 28.4 41.0 c. Ahmed et al. (2017) 28.9 41.4 d. Shaw et al. (2018) 29 et al., 2016), adaptive softmax (Grave et al., 2017) , and adaptive inputs ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-99", "text": "We also provide tutorials and pre-trained models that replicate the results of and" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-100", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-101", "text": "**ABSTRACTIVE DOCUMENT SUMMARIZATION**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-102", "text": "Next, we experiment with abstractive document summarization where we use a base Transformer to encode the input document and then generate a summary with a decoder network." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-103", "text": "We use the CNN-Dailymail dataset (Hermann et al., 2015; Nallapati et al., 2016) of news articles paired with multi-sentence summaries." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-104", "text": "We evaluate on Perplexity 31.9 J\u00f3zefowicz et al. (2016) 30.0 28.0 FAIRSEQ Adaptive inputs 23.0 the full-text version with no entity anonymization (See et al., 2017) ; we truncate articles to 400 tokens (See et al., 2017) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-105", "text": "We use BPE with 30K operations to form our vocabulary following Fan et al. (2018a) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-106", "text": "To evaluate, we use the standard ROUGE metric (Lin, 2004) and report ROUGE-1, ROUGE-2, and ROUGE-L. To generate summaries, we follow standard practice in tuning the minimum output length and disallow repeating the same trigram (Paulus et al., 2017) ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-107", "text": "Table 5 shows results of FAIRSEQ." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-108", "text": "We also consider a configuration where we input pre-trained language model representations to the encoder network and this language model was trained on newscrawl and CNN-Dailymail, totalling 193M sentences." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-109", "text": "----------------------------------" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-111", "text": "We presented FAIRSEQ, a fast, extensible toolkit for sequence modeling that is scalable and suitable for many applications." }, { "sent_id": "242aacd35fb92d836fea9eb33961a3-C001-112", "text": "In the future, we will continue the development of the toolkit to enable further research advances." } ], "y": { "@SIM@": { "gold_contexts": [ [ "242aacd35fb92d836fea9eb33961a3-C001-41" ], [ "242aacd35fb92d836fea9eb33961a3-C001-86" ], [ "242aacd35fb92d836fea9eb33961a3-C001-92" ] ], "cite_sentences": [ "242aacd35fb92d836fea9eb33961a3-C001-41", "242aacd35fb92d836fea9eb33961a3-C001-86", "242aacd35fb92d836fea9eb33961a3-C001-92" ] }, "@USE@": { "gold_contexts": [ [ "242aacd35fb92d836fea9eb33961a3-C001-41" ], [ "242aacd35fb92d836fea9eb33961a3-C001-86" ], [ "242aacd35fb92d836fea9eb33961a3-C001-92" ] ], "cite_sentences": [ "242aacd35fb92d836fea9eb33961a3-C001-41", "242aacd35fb92d836fea9eb33961a3-C001-86", "242aacd35fb92d836fea9eb33961a3-C001-92" ] }, "@BACK@": { "gold_contexts": [ [ "242aacd35fb92d836fea9eb33961a3-C001-74" ], [ "242aacd35fb92d836fea9eb33961a3-C001-84" ], [ "242aacd35fb92d836fea9eb33961a3-C001-97" ], [ "242aacd35fb92d836fea9eb33961a3-C001-98" ] ], "cite_sentences": [ "242aacd35fb92d836fea9eb33961a3-C001-74", "242aacd35fb92d836fea9eb33961a3-C001-84", "242aacd35fb92d836fea9eb33961a3-C001-97", "242aacd35fb92d836fea9eb33961a3-C001-98" ] }, "@DIF@": { "gold_contexts": [ [ "242aacd35fb92d836fea9eb33961a3-C001-94" ] ], "cite_sentences": [ "242aacd35fb92d836fea9eb33961a3-C001-94" ] } } }, "ABC_817576dbe36f79ac3e0031211f400d_24": { "x": [ { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-2", "text": "Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-3", "text": "One of the most prominent pre-trained language models is BERT (Bidirectional Encoders for Transformers), which was released as an English as well as a multilingual version." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-4", "text": "Although multilingual BERT performs well on many tasks, recent studies showed that BERT models trained on a single language significantly outperform the multilingual results." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-5", "text": "Training a Dutch BERT model thus has a lot of potential for a wide range of Dutch NLP tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-6", "text": "While previous approaches have used earlier implementations of BERT to train their Dutch BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutch language model called RobBERT." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-7", "text": "We show that RobBERT improves state of the art results in Dutch-specific language tasks, and also outperforms other existing Dutch BERTbased models in sentiment analysis." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-8", "text": "These results indicate that RobBERT is a powerful pretrained model for fine-tuning for a large variety of Dutch language tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-9", "text": "We publicly release this pre-trained model in hope of supporting further downstream Dutch NLP applications." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-10", "text": "1 The model named itself RobBERT when it was prompted with \"Ik heet BERT." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-11", "text": "\" (\"My name is BERT.\"), which we found quite a suitable name." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-12", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-14", "text": "The advent of neural networks in natural language processing (NLP) has significantly improved state-of-the-art results within the field." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-15", "text": "While recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) initially dominated the field, recent models started incorporating attention mechanisms and then later dropped the recurrent part and just kept the attention mechanisms in so-called transformer models (Vaswani et al., 2017) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-16", "text": "This latter type of model caused a new revolution in NLP and led to popular language models like GPT-2 (Radford et al., 2018 (Radford et al., , 2019 and ELMo (Peters et al., 2018) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-17", "text": "BERT (Devlin et al., 2019) improved over previous transformer models and recurrent networks by allowing the system to learn from input text in a bidirectional way, rather than only from left-to-right or the other way around." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-18", "text": "This model was later reimplemented, critically evaluated and improved in the RoBERTa model ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-19", "text": "These large-scale transformer models provide the advantage of being able to solve NLP tasks by having a common, expensive pre-training phase, followed by a smaller fine-tuning phase." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-20", "text": "The pretraining happens in an unsupervised way by providing large corpora of text in the desired language." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-21", "text": "The second phase only needs a relatively small annotated data set for fine-tuning to outperform previous popular approaches in one of a large number of possible language tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-22", "text": "While language models are usually trained on English data, some multilingual models also exist." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-23", "text": "These are usually trained on a large quantity of text in different languages." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-24", "text": "For example, Multilingual-BERT is trained on a collection of corpora in 104 different languages (Devlin et al., 2019) , and generalizes language components well across languages (Pires et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-25", "text": "However, models trained on data from one specific language usually improve the performance of multilingual models for this particular language (Martin et al., 2019; de Vries et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-26", "text": "Training a RoBERTa model on a Dutch dataset thus has a lot of potential for increasing performance for many downstream Dutch NLP tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-27", "text": "In this paper, we introduce RobBERT 1 , a Dutch RoBERTabased pre-trained language model, and critically test its performance using natural language tasks against other Dutch languages models." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-28", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-30", "text": "Transformer models have been successfully used for a wide range of language tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-31", "text": "Initially, transformers were introduced for use in machine translation, where they vastly improved state-of-the-art results for English to German in an efficient manner (Vaswani et al., 2017) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-32", "text": "This transformer model architecture resulted in a new paradigm in NLP with the migration from sequence-to-sequence recurrent neural networks to transformer-based models by removing the recurrent component and only keeping attention." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-33", "text": "This cornerstone was used for BERT, a transformer model that obtained stateof-the-art results for eleven natural language processing tasks, such as question answering and natural language inference (Devlin et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-34", "text": "BERT is pre-trained with large corpora of text using two unsupervised tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-35", "text": "The first task is word masking (also called the Cloze task (Taylor, 1953) or masked language model (MLM)), where the model has to guess which word is masked in certain position in the text." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-36", "text": "The second task is next sentence prediction." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-37", "text": "This is done by predicting if two sentences are subsequent in the corpus, or if they are randomly sampled from the corpus." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-38", "text": "These tasks allowed the model to create internal representations about a language, which could thereafter be reused for different language tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-39", "text": "This architecture has been shown to be a general language model that could be fine-tuned with little data in a relatively efficient way for a very distinct range of tasks and still outperform previous architectures (Devlin et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-40", "text": "Transformer models are also capable of generating contextualized word embeddings." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-41", "text": "These contextualized embeddings were presented by Peters et al. (2018) and addressed the well known issue with a word's meaning being defined by its context (e.g. \"a stick\" versus \"let's stick to\")." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-42", "text": "This lack of context is something that traditional word embeddings like word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) lack, whereas BERT automatically incorporates the context a word occurs in." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-43", "text": "Another advantage of transformer models is that attention allows them to better resolve coreferences between words ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-44", "text": "A typical example for the importance of coreference resolution is \"The trophy doesnt fit in the brown suitcase because its too big.\", where the word \"it\" would refer to the the suitcase instead of the trophy if the last word was changed to \"small\" (Levesque et al., 2012) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-45", "text": "Being able to resolve these coreferences is for example important for translating to languages with gender, as suitcase and trophy have different genders in French." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-46", "text": "Although BERT has been shown to be a useful language model, it has also received some scrutiny on the training and pre-processing of the language model." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-47", "text": "As mentioned before, BERT uses next sentence prediction (NSP) as one of its two training tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-48", "text": "In NSP, the model has to predict whether two sentences follow each other in the training text, or are just randomly selected from the corpora." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-49", "text": "The authors of RoBERTa ) showed that while this task made the model achieve a better performance, it was not due to its intended reason, as it might merely predict relatedness rather than subsequent sentences." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-50", "text": "That Devlin et al. (2019) trained a better model when using NSP than without NSP is likely due to the model learning long-range dependencies in text from its inputs, which are longer than just the single sentence on itself." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-51", "text": "As such, the RoBERTa model uses only the MLM task, and uses multiple full sentences in every input." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-52", "text": "Other research improved the NSP task by instead making the model predict the correct order of two sentences, where the model thus has to predict whether the sentences occur in the given order in the corpus, or occur in flipped order (Lan et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-53", "text": "Devlin et al. (2019) also presented a multilingual model (mBERT) with the same architecture as BERT, but trained on Wikipedia corpora in 104 languages." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-54", "text": "Unfortunately, the quality of these multilingual embeddings is often considered worse than their monolingual counterparts." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-55", "text": "R\u00f6nnqvist et al. (2019) illustrated this difference in quality for German and English models in a generative setting." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-56", "text": "The monolingual French CamemBERT model (Martin et al., 2019 ) also compared their model to mBERT, which performed poorer on all tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-57", "text": "More recently, de Vries et al. (2019) also showed similar results for Dutch using their BERTje model, outperforming multilingual BERT in a wide range of tasks, such as sentiment analysis and part-of-speech tagging." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-58", "text": "Since this work is concurrent with ours, we compare our results with BERTje in this paper." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-59", "text": "This section describes the data and training regime we used to train our Dutch RoBERTa-based language model called RobBERT." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-60", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-61", "text": "**DATA**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-62", "text": "We pre-trained our model on the Dutch section of the OSCAR corpus, a large multilingual corpus which was obtained by language classification in the Common Crawl corpus (Ortiz Su\u00e1rez et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-63", "text": "This Dutch corpus has 6.6 billion words, totalling 39 GB of text." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-64", "text": "It contains 126,064,722 lines of text, where each line can contain multiple sentences." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-65", "text": "Subsequent lines are however not related to each other, due to the shuffled nature of the OSCAR data set." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-66", "text": "For comparison, the French RoBERTa-based language model Camem-BERT (Martin et al., 2019) has been trained on the French portion of OSCAR, which consists of 138 GB of scraped text." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-92", "text": "We replicated the high-level sentiment analysis task used to evaluate BERTje (de Vries et al., 2019) to be able to compare our methods." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-67", "text": "Our data differs in several ways from the data used to train BERTje, a BERT-based Dutch language model (de Vries et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-68", "text": "Firstly, they trained the model on an assembly of multiple Dutch corpora totalling only 12 GB." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-69", "text": "Secondly, they used WordPiece as subword embeddings, since this is what the original BERT architecture uses." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-70", "text": "RobBERT on the other hand uses Byte Pair Encoding (BPE), which is also used by GPT-2 (Radford et al., 2019) and RoBERTa ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-71", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-72", "text": "**TRAINING**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-73", "text": "RobBERT shares its architecture with RoBERTa's base model, which itself is a replication and improvement over BERT ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-74", "text": "The architecture of our language model is thus equal to the original BERT model with 12 self-attention layers with 12 heads (Devlin et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-75", "text": "One difference with the original BERT is due to the different pre-training task specified by RoBERTa, using only the MLM task and not the NSP task." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-76", "text": "The training thus only uses word masking, where the model has to predict which words were masked in certain positions of a given line of text." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-77", "text": "The training process uses the Adam optimizer (Kingma and Ba, 2017) with polynomial decay of the learning rate l r = 10 \u22126 and a ramp-up period of 1000 iterations, with parameters \u03b2 1 = 0.9 (a common default) and RoBERTa's default \u03b2 2 = 0.98." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-78", "text": "Additionally, we also used a weight decay of 0.1 as well as a small dropout of 0.1 to help prevent the model from overfitting (Srivastava et al., 2014) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-79", "text": "We used a computing cluster in order to efficiently pre-train our model." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-80", "text": "More specifically, the pre-training was executed on a computing cluster with 20 nodes with 4 Nvidia Tesla P100 GPUs (16 GB VRAM each) and 2 nodes with 8 Nvidia V100 GPUs (having 32 GB VRAM each)." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-81", "text": "This pretraining happened in fixed batches of 8192 sentences by rescaling each GPUs batch size depending on the number of GPUs available, in order to maximally utilize the cluster without blocking it entirely for other users." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-82", "text": "The model trained for two epochs, which is over 16k batches in total." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-83", "text": "With the large batch size of 8192, this equates to 0.5M updates for a traditional BERT model." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-84", "text": "At this point, the perplexity did not decrease any further." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-85", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-86", "text": "**EVALUATION**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-87", "text": "We evaluated RobBERT in several different settings on multiple downstream tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-88", "text": "First, we compare its performance with other BERT-models and state-of-the-art systems in sentiment analysis, to show its performance for classification tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-89", "text": "Second, we compare its performance in a recent Dutch language task, namely the disambiguation of demonstrative pronouns, which allows us to additionally compare the zero-shot performance of our and other BERT models, i.e. using only the pre-trained model without any fine-tuning." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-90", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-91", "text": "**SENTIMENT ANALYSIS**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-115", "text": "We then test two different approaches for solving this task on this dataset." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-116", "text": "The first approach is making the BERT models use their MLM task and guess which word should be filled in this spot, and check if it has more confidence in either \"die\" or \"dat\" (by checking the first 2,048 guesses at most, as this seemed sufficiently large)." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-117", "text": "This allows us to compare the zero-shot BERT models, i.e. without any fine-tuning after pre-training, for which the results can be seen in Table 2 ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-118", "text": "The second approach uses the same data, but creates two sentences by filling in the mask with both \"die\" and \"dat\", ap-pending both with the [SEP] token and making the model predict which of the two sentences is correct." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-119", "text": "The fine-tuning was performed using 4 Nvidia GTX 1080 Ti GPUs and evaluated against the same test set of 399k utterances." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-120", "text": "As before, we fine-tuned the model twice: once with the full training set and once with a subset of 10k utterances from the training set for illustrating the benefits of pre-training on low-resource tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-121", "text": "ZeroR (majority class) 66.70 mBERT (Devlin et al., 2019) 90.21 BERTje (de Vries et al., 2019) 94.94 RobBERT (ours) 98.03" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-122", "text": "RobBERT outperforms previous models as well as other BERT models both with as well as without fine-tuning (see Table 1 and Table 2 )." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-123", "text": "It is also able to reach similar performance using less data." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-124", "text": "The fact that zero-shot RobBERT outperforms other zero-shot BERT models is also an indication that the base model has internalised more knowledge about Dutch than the other two have." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-125", "text": "The reason RobBERT and other BERT models outperform the previous RNN-based approach is likely the transformers ability to deal better with coreference resolution , and by extension better in deciding which word the \"die\" or \"dat\" belongs to." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-126", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-127", "text": "**CODE**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-128", "text": "The training and evaluation code of this paper as well as the RobBERT model and the fine-tuned models are publicly available for download on https://github.com/iPieter/RobBERT." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-129", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-130", "text": "**FUTURE WORK**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-131", "text": "There are several possible improvements as well as interesting future directions for this research, for example in training similar models." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-132", "text": "First, as BERT-based models are a very active field of research, it is interesting to experiment with change the pre-training tasks with new unsupervised tasks when they are discovered, such as the sentence order prediction (Lan et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-133", "text": "Second, while" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-134", "text": "RobBERT is trained on lines that contain multiple sentences, it does not put subsequent lines of the corpus after each other due to the shuffled nature of the OSCAR corpus (Ortiz Su\u00e1rez et al., 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-135", "text": "This is unlike RoBERTa, which does put full sentences next to each other if they fit, in order to learn the long-range dependencies between words that the original BERT learned using its controversial NSP task." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-136", "text": "It could be interesting to use the processor used to create OS-CAR in order to create an unshuffled version to train on, such that this technique can be used on the data set." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-137", "text": "Third, RobBERT uses the same tokenizer as RoBERTa, meaning it uses a tokenizer built for the English language." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-138", "text": "Training a new model using a custom Dutch tokenizer, e.g. using the newly released HuggingFace tokenizers library (Wolf et al., 2019) , could increase the performance even further." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-139", "text": "On the same note, incorporating more Unicode glyphs as separate tokens can also be beneficial for example for tasks related to conversational agents (Delobelle and Berendt, 2019) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-140", "text": "RobBERT itself could also be used in new settings to help future research." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-141", "text": "First, Rob-BERT could be used in different settings thanks to the renewed interest of sequence-to-sequence models due to their results on a vast range of language tasks (Raffel et al., 2019; ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-142", "text": "These models use a BERT-like transformer stack for the encoder and depending on the task a generative model as a decoder." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-143", "text": "These advances once again highlight the flexibility of the selfattention mechanism and it might be interesting to research the re-usability of RobBERT in these type of architectures." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-144", "text": "Second, there are many Dutch language tasks that we did not examine in this paper, for which it may also be possible to achieve state-of-the-art results when fine-tuned on this pretrained model." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-145", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-146", "text": "**CONCLUSION**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-147", "text": "We introduced a new language model for Dutch based on RoBERTa, called RobBERT, and showed that it outperforms earlier approaches for Dutch language tasks, as well as other BERT-based language models." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-148", "text": "We thus hope this model can serve as a base for fine-tuning on other tasks, and thus help foster new models that might advance results for Dutch language tasks." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-93", "text": "This task uses a dataset called Dutch Book Reviews Dataset (DBRD), in which book reviews scraped from hebban.nl are labeled as positive or negative (van der Burgh and Verberne, 2019)." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-94", "text": "Although the dataset contains 118,516 reviews, only 22,252 of these reviews are actually labeled as positive or negative." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-95", "text": "The DBRD dataset is already split in a balanced 10% test and 90% train split, allowing us to easily compare to other models trained for solving this task." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-96", "text": "This dataset was released in a paper analysing the performance of an ULMFiT model (Universal Language Model Fine-tuning for Text Classification model) (van der Burgh and Verberne, 2019)." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-97", "text": "We fine-tuned RobBERT on the first 10,000 training examples as well as on the full data set." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-98", "text": "While the ULMFiT model is first fine-tuned using the unlabeled reviews before training the classifier (van der Burgh and Verberne, 2019), it is unclear whether BERTje also first fine-tuned on the unlabeled reviews or only used the labeled data for fine-tuning the pretrained model." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-99", "text": "It is also unclear how it dealt with reviews being longer than the maximum number of tokens allowed as input in BERT models, as the average book review length is 547 tokens, with 40% of the documents being longer than our RobBERT model can handle." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-100", "text": "For a safe comparison, we thus decided to discard the unlabeled data and only use the labeled data for training and test purposes (20,028 and 2,224 examples respectively), and compare approaches for dealing with too long input sequences." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-101", "text": "We trained our model for 2000 iterations with a batch size of 128 and a warm-up of 500 iterations, reaching a learning rate of 10 \u22125 ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-102", "text": "We found that our model performed better when trained on the last part of the book reviews than on the first part." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-103", "text": "This is likely due to this part containing concluding remarks summarizing the overall sentiment." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-104", "text": "While BERTje was slightly outperformed by ULMFiT (de Vries et al., 2019; van der Burgh and Verberne, 2019), we can see that RobBERT achieves better performance than both on the test set, although the performance difference is not statistically significantly better than the ULMFiT model, as can be seen in Table 1 ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-105", "text": "----------------------------------" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-106", "text": "**DIE/DAT DISAMBIGUATION**" }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-107", "text": "Aside from classic natural language processing tasks in previous subsections, we also evaluated its performance on a task that is specific to Dutch, namely disambiguating \"die\" and \"dat\" (= \"that\" in English)." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-108", "text": "In Dutch, depending on the sentence, both terms can be either demonstrative or relative pronouns; in addition they can also be used in a subordinating conjunction, i.e. to introduce a clause." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-109", "text": "The use of either of these words depends on the gender of the word it refers to." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-110", "text": "Distinguishing these words is a task introduced by Allein et al. (2020) , who presented multiple models trained on the Europarl (Koehn, 2005) and SoNaR corpora (Oostdijk et al., 2013) ." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-111", "text": "The results ranged from an accuracy of 75.03% on Europarl to 84.56% on SoNaR." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-112", "text": "For this task, we use the Dutch version of the Europarl corpus (Koehn, 2005) , which we split in 1.3M utterances for training, 319k for validation, and 399k for testing." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-113", "text": "We then process every sentence by checking if it contains \"die\" or \"dat\", and if so, add a training example for every occurrence of this word in the sentence, where a single occurrence is masked." }, { "sent_id": "817576dbe36f79ac3e0031211f400d-C001-114", "text": "For the test set for example, this resulted in about 289k masked sentences." } ], "y": { "@BACK@": { "gold_contexts": [ [ "817576dbe36f79ac3e0031211f400d-C001-17" ], [ "817576dbe36f79ac3e0031211f400d-C001-24" ], [ "817576dbe36f79ac3e0031211f400d-C001-33" ], [ "817576dbe36f79ac3e0031211f400d-C001-39" ], [ "817576dbe36f79ac3e0031211f400d-C001-50" ], [ "817576dbe36f79ac3e0031211f400d-C001-121", "817576dbe36f79ac3e0031211f400d-C001-122" ] ], "cite_sentences": [ "817576dbe36f79ac3e0031211f400d-C001-17", "817576dbe36f79ac3e0031211f400d-C001-24", "817576dbe36f79ac3e0031211f400d-C001-33", "817576dbe36f79ac3e0031211f400d-C001-39", "817576dbe36f79ac3e0031211f400d-C001-50", "817576dbe36f79ac3e0031211f400d-C001-121" ] }, "@SIM@": { "gold_contexts": [ [ "817576dbe36f79ac3e0031211f400d-C001-74" ] ], "cite_sentences": [ "817576dbe36f79ac3e0031211f400d-C001-74" ] } } }, "ABC_0e4deda746127b97f68080bc8f13c8_24": { "x": [ { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-2", "text": "The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-3", "text": "Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-4", "text": "We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-toend speech recognition." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-5", "text": "SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-6", "text": "Similar improvements hold for WERs after LM decoding." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-7", "text": "We motivate the architecture for speech, evaluate position and downsampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-8", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-9", "text": "****" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-10", "text": "later works proposed partially-or purely-convolutional CTC models [8] [9] [10] [11] and convolution-heavy encoder-decoder models [16] for ASR." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-11", "text": "However, convolutional models must be significantly deeper to retrieve the same temporal receptive field [23] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-12", "text": "Recently, the mechanism of self-attention [22, 24] was proposed, which uses the whole sequence at once to model feature interactions that are arbitrarily distant in time." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-13", "text": "Its use in both encoder-decoder and feedforward contexts has led to faster training and state-of-the-art results in translation (via the Transformer [22] ), sentiment analysis [25] , and other tasks." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-14", "text": "These successes have motivated preliminary work in self-attention for ASR." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-15", "text": "Time-restricted self-attention was used as a drop-in replacement for individual layers in the state-of-theart lattice-free MMI model [26] , an HMM-NN system." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-16", "text": "Hybrid self-attention/LSTM encoders were studied in the context of listenattend-spell (LAS) [27] , and the Transformer was directly adapted to speech in [19, 28, 29] ; both are encoder-decoder systems." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-17", "text": "In this work, we propose and evaluate fully self-attentional networks for CTC (SAN-CTC)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-18", "text": "We are motivated by practicality: selfattention could be used as a drop-in replacement in existing CTClike systems, where only attention has been evaluated in the past [30, 31] ; unlike encoder-decoder systems, SAN-CTC is able to predict tokens in parallel at inference time; an analysis of SAN-CTC is useful for future state-of-the-art ASR systems, which may equip self-attentive encoders with auxiliary CTC losses [17, 20] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-19", "text": "Unlike past works, we do not require convolutional frontends [19] or interleaved recurrences [27] to train self-attention for ASR." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-20", "text": "In Section 2, we motivate the model and relevant design choices (position, downsampling) for ASR." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-21", "text": "In Section 3, we validate SAN-CTC on the Wall Street Journal and LibriSpeech datasets by outperforming existing CTC models and most encoder-decoder models in character error rates (CERs), with fewer parameters or less training time." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-22", "text": "Finally, we train our models with different label alphabets (character, phoneme, subword), use WFST decoding to give word error rates (WERs), and examine the learned attention heads for insights." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-23", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-24", "text": "**MODEL ARCHITECTURES FOR CTC AND ASR**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-25", "text": "Consider an input sequence of T feature vectors, viewed as a matrix X \u2208 R T \u00d7d fr . Let L denote the (finite) label alphabet, and denote the output sequence as y = (y1, . . . , yU ) \u2208 L U ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-26", "text": "In ASR, X is the sequence of acoustic frames, L is the set of graphemes/phonemes/wordpieces, and y is the corresponding ground-truth transcription over L." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-27", "text": "For CTC, one assumes U \u2264 T and defines an intermediate al-" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-28", "text": "In this way, paths are analogous to framewise alignments in the HMM-NN framework." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-29", "text": "CTC models the distribution of sequences by marginalizing over all paths corresponding to an output:" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-30", "text": "Finally, CTC models each P (\u03c0 | X) non-autoregressively, as a sequence of conditionally-independent outputs:" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-31", "text": "This model assumption means each P (\u03c0t, t | X) could be computed in parallel, after which one can do prediction via beam search, or training with gradient descent using the objective LCTC(X, y) = \u2212 log P (y | X); the order-monotonicity of B ensures LCTC can be efficiently evaluated with dynamic programming [1, 4] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-32", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-33", "text": "**RECURRENT AND CONVOLUTIONAL MODELS**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-34", "text": "In practice, one models P (\u03c0, t | X) with a neural network." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-35", "text": "As inspired by HMMs, the model simplification of conditional independence can be tempered by multiple layers of (recurrent) bidirectional long short-term memory units (BLSTMs) [1] [2] [3] [4] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-36", "text": "However, these are computationally expensive (Table 1) , leading to simplifications like gated recurrent units (GRUs) [8, 32] ; furthermore, the success of the ReLU(x) = max(0, x) nonlinearity in preventing vanishing gradients enabled the use of vanilla bidirectional recurrent deep neural networks (BRDNNs) [5, 6, 33] to further reduce operations per layer." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-37", "text": "Convolutions over time and/or frequency were first used as initial layers to recurrent neural models, beginning with HMM-NNs [34] and later with CTC, where they are viewed as promoting invariance to temporal and spectral translation in ASR [8] , or image translation in handwriting recognition [35] ; they also serve as a form of dimensionality reduction (Section 2.4)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-38", "text": "However, these networks were still bottlenecked by the sequentiality of operations at the recurrent layers, leading [8] to propose row convolutions for unidirectional RNNs, which had finite lookaheads to enable online processing while having some future context." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-39", "text": "This led to convolution-only CTC models for long-range temporal dependencies [9] [10] [11] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-40", "text": "However, these models have to be very deep (e.g., 17-19 convolutional layers on LibriSpeech [23] ) to cover the same context (Table 1) ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-41", "text": "While in theory, a relatively local context could suffices for ASR, this is complicated by alphabets L which violate the conditional independence assumption of CTC (e.g., English characters [36] )." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-42", "text": "Wide contexts also enable incorporation of noise/speaker contexts, as [27] suggest regarding the broad-context attention heads in the first layer of their self-attentional LAS model." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-43", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-44", "text": "**MOTIVATING THE SELF-ATTENTION LAYER**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-45", "text": "We now replace recurrent and convolutional layers for CTC with self-attention [24] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-46", "text": "Our proposed framework ( Figure 1a ) is built around self-attention layers, as used in the Transformer encoder [22] , previous explorations of self-attention in ASR [19, 27] , and defined in Section 2.3." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-47", "text": "The other stages are downsampling, which reduces input length T via methods like those in Section 2.4; embedding, which learns a dh-dim. embedding that also describes token position (Section 2.5); and projection, where each final representation is mapped framewise to logits over the intermediate alphabet L ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-48", "text": "The first implements self-attention, where the success of attention in CTC and encoder-decoder models [14, 31] is parallelized by using each position's representation to attend to all others, giving a contextualized representation for that position." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-49", "text": "Hence, the full receptive field is immediately available at the cost of O(T 2 ) inner products (Table 1) , enabling richer representations in fewer layers." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-50", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-51", "text": "**MODEL**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-52", "text": "Operations per layer" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-53", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-54", "text": "**SEQUENTIAL OPERATIONS**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-55", "text": "Maximum path length Table 1 : Operation complexity of each layer type, based on [22] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-56", "text": "T is input length, d is no. of hidden units, and k is filter/context width." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-57", "text": "We also see inspiration from convolutional blocks: residual connections, layer normalization, and tied dense layers with ReLU for representation learning." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-58", "text": "In particular, multi-head attention is akin to having a number of infinitely-wide filters whose weights adapt to the content (allowing fewer \"filters\" to suffice)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-59", "text": "One can also assign interpretations; for example, [27] argue their LAS self-attention heads are differentiated phoneme detectors." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-60", "text": "Further inductive biases like filter widths and causality could be expressed through time-restricted self-attention [26] and directed self-attention [25] , respectively." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-61", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-62", "text": "**FORMULATION**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-63", "text": "Let H \u2208 R T \u00d7d h denote a sublayer's input." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-64", "text": "The first sublayer performs multi-head, scaled dot-product, self-attention [22] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-65", "text": "For each head i of nhds, we learn linear maps W" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-66", "text": ", and values V (i) of the i-th head, which combine to give" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-67", "text": "where \u03c3 is row-wise softmax." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-68", "text": "Heads are concatenated along the dh/nhds axis to give MltHdAtt = [HdAtt (1) , . . . , HdAtt (n hds ) ]." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-69", "text": "The second sublayer is a position-wise feed-forward network [22] FFN(H) = ReLU(HW1 + b1)W2 + b2 where parameters" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-70", "text": "with the biases b1, b2 broadcasted over all T positions." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-71", "text": "This sublayer aggregates the multiple heads at time t into the attention layer's final output at t. All together, the layer is given by:" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-72", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-73", "text": "**DOWNSAMPLING**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-74", "text": "In speech, the input length T of frames can be many times larger than the output length U , in contrast to the roughly word-to-word setting of machine translation." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-75", "text": "This is especially prohibitive for self-attention in terms of memory: recall that an attention matrix of dimension" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-76", "text": "\u2208 R T \u00d7T is created, giving the T 2 factor in Table 1 ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-77", "text": "A convolutional frontend is a typical downsampling strategy [8, 19] ; however, we leave integrating other layer types into SAN-CTC as future work." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-78", "text": "Instead, we consider three fixed approaches, from least-to most-preserving of the input data: subsampling, which only takes every k-th frame; pooling, which aggregates every k consecutive frames via a statistic (average, maximum); reshaping, where one concatenates k consecutive frames into one [27] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-79", "text": "Note that CTC will still require U \u2264 T /k, however." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-80", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-81", "text": "**POSITION**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-82", "text": "Self-attention is inherently content-based [22] , and so one often encodes position into the post-embedding vectors." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-83", "text": "We use standard trigonometric embeddings, where for 0 \u2264 i \u2264 demb/2, we define" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-84", "text": "for position t. We consider three approaches: content-only [21] , which forgoes position encodings; additive [19] , which takes demb = dh and adds the encoding to the embedding; and concatenative, where one takes demb = 40 and concatenates it to the embedding." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-85", "text": "The latter was found necessary for self-attentional LAS [27] , as additive encodings did not give convergence." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-86", "text": "However, the monotonicity of CTC is a further positional inductive bias, which may enable the success of content-only and additive encodings." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-87", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-88", "text": "**EXPERIMENTS**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-89", "text": "We take (nlayers, dh, nheads, dff) = (10, 512, 8, 2048), giving \u223c30M parameters." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-90", "text": "This is on par with models on WSJ (10-30M) [4, 5, 9] and an order of magnitude below models on LibriSpeech (100-250M) [8, 23] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-91", "text": "We use MXNet [37] for modeling and Kaldi/EESEN [7, 38] for data preparation and decoding." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-92", "text": "Our self-attention code is based on GluonNLP's implementation." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-93", "text": "At train time, utterances are sorted by length: we exclude those longer than 1800 frames ( 1% of each training set)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-94", "text": "We take a window of 25ms, a hop of 10ms, and concatenate cepstral mean-variance normalized features with temporal first-and second-order differences." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-95", "text": "1 We downsample by a factor of k = 3 (this also gave an ideal T /k \u2248 dh for our data; see Table 1 )." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-96", "text": "We perform Nesterov-accelerated gradient descent on batches of 20 utterances." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-97", "text": "As self-attention architectures can be unstable in early training, we clip gradients to a global norm of 1 and use the standard linear warmup period before inverse square decay associated with these architectures [19, 22] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-98", "text": "Let n denote the global step number of the batch (across epochs); the learning rate is given by" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-99", "text": "1 Rescaling so that these differences also have var." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-100", "text": "\u2248 1 helped WSJ training." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-101", "text": "[17] 11.5 -9.0 -Enc-Dec (4-1) [17] 12.0 -8.2 -Enc-Dec+CTC (4-1) [17] 11.3 -7.4 -Enc-Dec (4-1) [39] --6.4 9.3 CTC/ASG (Gated CNN) [40] 6.9 9.5 4.9 6.6 Enc-Dec (2,1,3-1 Table 3 : CTC phoneme models with WFST decoding on WSJ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-102", "text": "where we take \u03bb = 400 and nwarmup as a hyperparameter." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-103", "text": "However, such a decay led to early stagnation in validation accuracy, so we later divide the learning rate by 10 and run at the decayed rate for 20 epochs." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-104", "text": "We do this twice, then take the epoch with the best validation score." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-105", "text": "Xavier initialization gave validation accuracies of zero for the first few epochs, suggesting room for improvement." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-106", "text": "Like previous works on self-attention, we apply label smoothing (see Tables 2, 3, 5; we also tried model averaging to no gain)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-107", "text": "To compute word error rates (WERs), we use the dataset's provided language model (LM) as incorporated by WFST decoding [7] to bridge the gap between CTC and encoder-decoder frameworks, allowing comparison with known benchmarks and informing systems that incorporate expert knowledge in this way (e.g., via a pronunciation lexicon)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-108", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-109", "text": "**WALL STREET JOURNAL (WSJ)**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-110", "text": "We train both character-and phoneme-label systems on the 80-hour WSJ training set to validate our architectural choices." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-111", "text": "Similar to [17, 19] , we use 40-dim." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-112", "text": "mel-scale filter banks and hence 120-dim." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-113", "text": "features." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-114", "text": "We warmup for 8000 steps, use a dropout of 0.2, and switch schedules at epoch 40." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-115", "text": "For the WSJ dataset, we compare with similar MLE-trained, end-to-end, open-vocabulary systems in Table 2 ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-116", "text": "We get an eval92 CER of 4.7%, outdoing all previous CTC-like results except 4.6% with a trainable frontend [40] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-117", "text": "We use the provided extended 3-gram LM to retrieve WERs." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-118", "text": "For phoneme training, our labels come from the CMU pronunciation lexicon (Table 3 )." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-119", "text": "These models train in one day (Tesla V100), comparable to the Speech Transformer [19] ; however, SAN-CTC gives further benefits at inference time as token predictions are generated in parallel." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-120", "text": "We also evaluate design choices in Table 4 ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-121", "text": "Here, we consider the effects of downsampling and position encoding on accuracy for our fixed training regime." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-122", "text": "We see that unlike self-attentional LAS [27] , SAN-CTC works respectably even with no position en- coding; in fact, the contribution of position is relatively minor (compare with [21] , where location in an encoder-decoder system improved CER by 3% absolute)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-123", "text": "Lossy downsampling appears to preserve performance in CER while degrading WER (as information about frame transitions is lost)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-124", "text": "We believe these observations align with the monotonicity and independence assumptions of CTC." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-125", "text": "Inspired by [27] , we plot the standard deviation of attention weights for each head as training progresses; see Figure 2 for details." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-126", "text": "In the first layers, we similarly observe a differentiation of variances, along with wide-context heads; in later layers, unlike [27] we still see mild differentiation of variances." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-127", "text": "Inspired by [26] , we further plot the attention weights relative to the current time position (here, per head)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-128", "text": "Character labels gave forward-and backward-attending heads (incidentally, averaging these would retrieve the bimodal distribution in [26] ) at all layers." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-129", "text": "This suggests a gradual expansion of context over depth, as is often engineered in convolutional CTC." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-130", "text": "This also suggests possibly using fewer heads, directed self-attention [25] , and restricted contexts for faster training (Table 1) ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-131", "text": "Phoneme labels gave a sharp backward-attending head and more diffuse heads." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-132", "text": "We believe this to be a symptom of English characters being more contextdependent than phonemes (for example, emitting 'tt' requires looking ahead, as '-' must occur between two runs of 't' tokens)." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-133", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-134", "text": "**LIBRISPEECH**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-135", "text": "We give the first large-scale demonstration of a fully self-attentional ASR model using the LibriSpeech ASR corpus [42] , an English corpus produced from audio books giving 960 hours of training" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-136", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-137", "text": "**MODEL**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-138", "text": "Tok. test-clean test-other CER WER CER WER CTC/ASG (Wav2Letter) [9] chr." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-139", "text": "6.9 7.2 --CTC (DS1-like) [33, 43] chr." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-140", "text": "-6.5 --Enc-Dec (4-4) [44] chr." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-141", "text": "6.5 -18.1 -Enc-Dec (6-1) [45] chr." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-142", "text": "4.5 -11.6 -CTC (DS2-like) [8, 32] chr." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-143", "text": "Table 5 ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-144", "text": "At this scale, even minor label smoothing was detrimental." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-145", "text": "We run 70 epochs in slightly over a week (Tesla V100) then choose the epoch with the best validation score for testing." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-146", "text": "For comparison, the best CTC-like architecture [23] took 4-8 weeks on 4 GPUs for its results." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-147", "text": "2 The Enc-Dec+CTC model is comparable, taking almost a week on an older GPU (GTX 1080 Ti) to do its \u223c12.5 full passes over the data." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-148", "text": "3 Finally, we trained the same model with BPE subwords as CTC targets, to get more context-independent units [36] ." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-149", "text": "We did 300 merge operations (10k was unstable) and attained a CER of 7.4%." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-150", "text": "This gave a WER of 8.7% with no LM (compare with Table 5 's LMbased entries), and 5.2% with a subword WFST of the LM." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-151", "text": "We still observed attention heads in both directions in the first layer, suggesting our subwords were still more context-dependent than phonemes." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-152", "text": "----------------------------------" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-153", "text": "**CONCLUSION**" }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-154", "text": "We introduced SAN-CTC, a novel framework which integrates a fully self-attentional network with a connectionist temporal classification loss." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-155", "text": "We addressed the challenges of adapting self-attention to CTC and to speech recognition, showing that SAN-CTC is competitive with or outperforms existing end-to-end models on WSJ and LibriSpeech." }, { "sent_id": "0e4deda746127b97f68080bc8f13c8-C001-156", "text": "Future avenues of work include multitasking SAN-CTC with other decoders or objectives, and streamlining network structure via directed or restricted attention." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0e4deda746127b97f68080bc8f13c8-C001-12" ], [ "0e4deda746127b97f68080bc8f13c8-C001-13" ], [ "0e4deda746127b97f68080bc8f13c8-C001-64" ], [ "0e4deda746127b97f68080bc8f13c8-C001-69" ], [ "0e4deda746127b97f68080bc8f13c8-C001-82" ] ], "cite_sentences": [ "0e4deda746127b97f68080bc8f13c8-C001-12", "0e4deda746127b97f68080bc8f13c8-C001-13", "0e4deda746127b97f68080bc8f13c8-C001-64", "0e4deda746127b97f68080bc8f13c8-C001-69", "0e4deda746127b97f68080bc8f13c8-C001-82" ] }, "@USE@": { "gold_contexts": [ [ "0e4deda746127b97f68080bc8f13c8-C001-46" ], [ "0e4deda746127b97f68080bc8f13c8-C001-55" ] ], "cite_sentences": [ "0e4deda746127b97f68080bc8f13c8-C001-46", "0e4deda746127b97f68080bc8f13c8-C001-55" ] }, "@SIM@": { "gold_contexts": [ [ "0e4deda746127b97f68080bc8f13c8-C001-46" ], [ "0e4deda746127b97f68080bc8f13c8-C001-55" ] ], "cite_sentences": [ "0e4deda746127b97f68080bc8f13c8-C001-46", "0e4deda746127b97f68080bc8f13c8-C001-55" ] }, "@DIF@": { "gold_contexts": [ [ "0e4deda746127b97f68080bc8f13c8-C001-97" ] ], "cite_sentences": [ "0e4deda746127b97f68080bc8f13c8-C001-97" ] }, "@EXT@": { "gold_contexts": [ [ "0e4deda746127b97f68080bc8f13c8-C001-97" ] ], "cite_sentences": [ "0e4deda746127b97f68080bc8f13c8-C001-97" ] } } }, "ABC_652534f801dbff0c009c4a39fdef4d_24": { "x": [ { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-74", "text": "**SIG (44).**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-2", "text": "We present TranscRater, an open-source tool for automatic speech recognition (ASR) quality estimation (QE)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-3", "text": "The tool allows users to perform ASR evaluation bypassing the need of reference transcripts and confidence information, which is common to current assessment protocols." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-4", "text": "TranscRater includes: i) methods to extract a variety of quality indicators from (signal, transcription) pairs and ii) machine learning algorithms which make possible to build ASR QE models exploiting the extracted features." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-5", "text": "Confirming the positive results of previous evaluations, new experiments with TranscRater indicate its effectiveness both in WER prediction and transcription ranking tasks." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-6", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-8", "text": "How to determine the quality of an automatic transcription without reference transcripts and without confidence information?" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-9", "text": "This is the key problem addressed by research on ASR quality estimation C. de Souza et al., 2015; Jalalvand et al., 2015b) , and the task for which TranscRater, the tool described in this paper, has been designed." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-10", "text": "The work on ASR quality estimation (ASR QE) has several motivations." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-11", "text": "First, the steady increase of applications involving automatic speech recognition (e.g. video/TV programs subtitling, voice search engines, voice question answering, spoken dialog systems, meeting and broadcast news transcriptions) calls for an accurate method to estimate ASR output quality at run-time." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-12", "text": "Often, indeed, the nature of such applications (consider for instance spoken dialog systems) requires quick response capabilities that are incompatible with traditional reference-based protocols." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-13", "text": "Second, even when real-time processing is not a priority, standard evaluation based on computing word-error rate (WER) against gold references is not always a viable solution." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-14", "text": "In many situations (as in the case of languages for which even the ASR training data is scarce), the bottleneck represented by the limited availability of reference transcripts and the costs of their manual creation calls for a method to predict ASR output quality that is reference-independent." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-15", "text": "Third, even when designed to bypass the need of references, current quality prediction methods heavily depend on confidence information about the inner workings of the ASR system that produced the transcriptions (Evermann and Woodland, 2000; Wessel et al., 2001) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-16", "text": "Such information, describing how the system is certain about the quality of its own hypotheses, often reflects a biased perspective influenced by individual decoder features." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-17", "text": "More importantly, it is not always accessible and, in this frequent case, the sole elements available for quality prediction are the signal and its transcription (consider, for instance, the increasing amount of captioned Youtube videos generated by a \"black-box\" ASR system 1 )." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-18", "text": "These issues call for a method to predict ASR output quality that is also confidence-independent." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-19", "text": "TranscRater (Transcription Rater) provides a unified ASR QE framework designed to meet the three aforementioned requirements." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-20", "text": "Its development was inspired by software previously released for the machine translation (MT) (Specia et al., 2013; Shah et al., 2014; Servan et al., 2015) equivalent of ASR QE, in which MT quality has to be estimated at run-time and without reference trans-lations (Mehdad et al., 2012; Camargo de Souza et al., 2013; ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-21", "text": "Indeed, the two tasks deal with similar issues." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-22", "text": "In both cases, we have an input \"source\" (a written sentence and a recorded signal) and an output text (a translation and a transcription) that has to be assessed without any pre-created term of comparison." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-23", "text": "They can also be approached with similar supervised classification (C. de Souza et al., 2015) or regression strategies C. de Souza et al., 2015) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-24", "text": "Finally, they have similar applications like:" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-25", "text": "\u2022 Deciding if an input source has been correctly processed;" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-26", "text": "\u2022 Ranking the output of multiple independent systems (Jalalvand et al., 2015b );" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-27", "text": "\u2022 Estimating the human effort required to manually revise an output segment;" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-28", "text": "\u2022 Performing data selection for system improvement based on active learning." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-29", "text": "To support these applications, TranscRater provides an extensible ASR QE framework consisting of a variety of feature extractors and machine learning algorithms." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-30", "text": "The implemented feature extraction methods allow capturing predictive quality indicators both from the input signal and from the output transcription." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-31", "text": "This basic set of \"black box\" indicators has been successfully evaluated in a number of experiments, both on regression and on classification tasks, showing that ASR QE predictions can closely approximate the quality scores obtained with standard reference-based methods." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-32", "text": "The existing feature extractors can be easily extended to integrate new features, either capturing additional system-independent aspects, or relying on confidence information about the ASR system that produced the transcriptions, if available." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-33", "text": "Experimental results demonstrate that, also in the \"glass-box\" scenario in which the ASR system is known, the available features are able to improve the performance obtained with confidence information." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-34", "text": "The integration of different machine learning algorithms makes TranscRater a powerful framework to quickly set up an ASR QE model given some training data, tune it by choosing among the possible feature configurations and process new, unseen test data to predict their quality." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-35", "text": "As a standalone environment, with few documented external dependencies, TranscRater provides the first offthe-shelf solution to approach ASR QE and extend its application to new scenarios." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-36", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-37", "text": "**ASR QE**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-38", "text": "The basic ASR QE task consists in training a model from (signal, transcription, label) triplets, and using it to return quality predictions for a test set of unseen (signal, transcription) instances." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-39", "text": "In this supervised learning setting, the training labels can be either numeric scores or class identifiers (binary or multi-class) (C. de Souza et al., 2015) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-40", "text": "Class assignments can be manually done according to some criteria, or inferred by thresholding numeric scores." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-41", "text": "Numeric quality indicators can be easily obtained by measuring the similarity (or the distance) between the transcription and its manually-created reference." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-42", "text": "For instance, the models described in previous works on ASR QE learn from training data labelled with real values obtained by computing the transcription word error rate (WER 2 )." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-43", "text": "According to the type of training labels, the problem can be approached either as a regression or as a classification task." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-44", "text": "As a consequence, also the evaluation metrics will change." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-45", "text": "Precision/recall/F1 (or other metrics, such as balanced accuracy, in case of very unbalanced distributions) will be used for classification while, similar to MT QE, the mean absolute error (MAE) or similar metrics will be used for regression." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-46", "text": "A variant of the basic ASR QE task is to consider it as a QE-based ranking problem (Jalalvand et al., 2015b) , in which each utterance is captured by multiple microphones or transcribed by multiple ASR systems." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-47", "text": "In this case, the capability to rank transcriptions from the best to the worst can be evaluated in terms of normalized discounted cumulative gain (NDCG) or similar metrics." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-48", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-49", "text": "**THE TRANSCRATER TOOL**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-75", "text": "Signal features are extracted using the OpenSmile 4 toolkit (Eyben et al., 2013) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-50", "text": "TranscRater combines in a single open-source framework: i) a set of features capturing different aspects of transcription quality and ii) different learning algorithms suitable to address the challenges posed by different application scenarios." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-51", "text": "TranscRater internally consists of two main modules: feature extraction and machine learning." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-52", "text": "At training stage, the tool receives as input a set of signal recordings, their transcriptions and the corresponding reference transcripts." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-53", "text": "The speech signals are provided as separate files in the RIFF Microsoft PCM format with 16K sampling rate." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-54", "text": "Their transcriptions and the corresponding references are provided in single separate text files (one transcription per row)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-55", "text": "References are used to compute the WER label of each training instance, thus connecting the problem to the task formulation provided in \u00a72." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-56", "text": "The features extracted from each training instance are passed to the learning module, together with the corresponding label." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-57", "text": "The label is a WER score which, depending on the type of problem addressed, can be used either to directly train a regressor or to infer a ranking for multiple hypotheses." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-58", "text": "In either case, the learning module will train the corresponding model with the proper learning algorithm, and tune it using kfold cross-validation." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-59", "text": "At test stage, the model is used to predict the label of new, unseen (signal, transcription) instances." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-60", "text": "For each test point, the output is either a WER prediction or a rank, whose reliability can be respectively evaluated in terms of MAE or NDCG (as discussed in \u00a72)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-61", "text": "Output predictions are provided in a single file (one WER prediction per row for regression and one rank prediction per row for ranking)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-62", "text": "MAE or NDCG scores are provided as the standard output of the test functions." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-63", "text": "Internally, TranscRater stores the extracted features in the SVM-light 3 format." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-64", "text": "This makes possible to use the tool as a feature extractor and to embed it in applications different from the ones described in this paper." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-65", "text": "The features to be used, the type of learning algorithm, the input files and the links to resources and libraries can be easily set through a configuration file." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-66", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-67", "text": "**FEATURE SETS**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-68", "text": "The feature extraction module of TranscRater allows the user to extract 72 features that can be categorized in the following four groups:" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-69", "text": "\u2022 Signal (SIG) features, designed to capture the difficulty of transcribing the input signal given the general recording conditions in which it was acquired;" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-70", "text": "\u2022 Lexical (LEX) features, designed to capture 3 http://svmlight.joachims.org/ the difficulty to transcribe the input signal given the pronunciation difficulty and the ambiguity of the terms it contains;" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-71", "text": "\u2022 Language model (LM) features, designed to capture the plausibility of the transcription from the fluency point of view;" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-72", "text": "\u2022 Part-of-speech (POS) features, designed to capture the plausibility of the transcription from the syntax point of view." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-73", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-76", "text": "Each speech signal is broken down into 25ms length frames with 10ms overlap." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-77", "text": "For each frame, we compute 13 Mel Frequency Cepstral Coefficients (MFCC), their delta, acceleration and log-energy as well as the prosody features like fundamental frequency (F0), voicing probability, loudness contours and pitch." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-78", "text": "The final SIG feature vector for the entire input signal is obtained by averaging the values of each feature computed on all the frames." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-79", "text": "LEX (7)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-80", "text": "Lexicon-based features are extracted using a lexical feature dictionary (optionally provided by the user)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-81", "text": "In this dictionary each individual word is assigned to a feature vector containing the frequency of fricatives, liquids, nasals, stops and vowels in its pronunciation." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-82", "text": "Other elements of the vector are the number of homophones (words with the same pronunciation) and quasihomophones (words with similar pronunciation)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-83", "text": "LM (12)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-84", "text": "Language model features include the mean of word probabilities, the sum of the log probabilities and the perplexity score for each transcription." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-85", "text": "In previous experiments (Jalalvand et al., 2015b; Jalalvand and Falavigna, 2015) we showed that, instead of only one LM, using a combination of neural network and n-gram LMs trained on task-specific and generic data can significantly improve the accuracy of quality prediction." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-86", "text": "For this reason, TranscRater allows using up to four different language models: two RNNLM (Mikolov et al., 2010 ) trained on generic and specific data and two n-gramLM trained on generic and specific data." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-87", "text": "To work with neural network LMs, the tool makes use of RNNLM, 5 while for n-gram LMs it uses SRILM 6 (Stolcke et al., 2000) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-88", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-89", "text": "**POS (9)**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-90", "text": ". Part-of-speech features are extracted using the TreeTagger." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-91", "text": "7 For each word in the transcription, they consider the score assigned to the predicted POS of the word itself, the previous and the following one." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-92", "text": "This sliding window is used to compute the average value for the entire transcription and obtain the sentence-level POS feature vector." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-93", "text": "The intuition is that a low confidence of the POS tagger in labeling a sentence is an indicator of possible syntax issues and, in turn, of poor transcription quality." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-94", "text": "POS features also include the number and the percentage of content words (numbers, nouns, verbs, adjectives, adverbs)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-95", "text": "These feature groups were successfully tested in various conditions including clean/noisy data, single/multiple microphones and ASR systems (Jalalvand et al., 2015b; Jalalvand et al., 2015a) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-96", "text": "In such conditions, they proved to be a reliable predictor when confidence information about the ASR system inner workings is not accessible." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-97", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-98", "text": "**LEARNING ALGORITHMS**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-99", "text": "For regression-based tasks (WER prediction), TranscRater includes an interface to the Scikitlearn package (Pedregosa et al., 2011) , a Python machine learning library that contains a large set of classification and regression algorithms." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-100", "text": "Based on the empirical results reported in C. de Souza et al., 2015; Jalalvand et al., 2015b) , which indicate that Extremely Randomized Trees (XRT (Geurts et al., 2006) ) is a very competitive algorithm in several WER prediction tasks, the current version of the tool exploits XRT." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-101", "text": "However, adapting the interface to apply other algorithms is an easy task and one of the future extension directions." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-102", "text": "The main hyper-parameters of the model, such as the number of tree bags, the number of trees per bag, the number of features per tree and the number of instances in the leaves, are tuned using grid search with k-fold cross-validation on the training set to minimize the mean absolute error (MAE) between the true WERs and the predicted ones." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-103", "text": "As mentioned before, TranscRater provides the possibility to evaluate multiple transcriptions (e.g. obtained from different microphones or ASR systems) and rank them based on their quality." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-104", "text": "This can be done either indirectly, by exploiting the predicted WER labels in a \"ranking by regression\" approach (RR) or directly, by exploiting machinelearned ranking methods (MLR)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-105", "text": "To train and test MLR models, TranscRater exploits RankLib 8 , a library of learning-to-rank algorithms." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-106", "text": "The current version of the tool includes an interface to the Random Forest algorithm (RF (Breiman, 2001) ), the same used in (Jalalvand et al., 2015b) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-107", "text": "MLR predicts ranks through pairwise comparison between the transcriptions." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-108", "text": "The main parameters such as the number of bags, the number of trees per bag and the number of leaves per tree are tuned on training set using k-fold cross-validation to maximize the NDCG measure." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-109", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-110", "text": "**IMPLEMENTATION**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-111", "text": "TranscRater is written in Python and is made of several parts linked together using bash scripts." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-112", "text": "In order to run the toolkit on Linux, the following libraries are required: i) Java 8 (JDK-1.8); ii) Python 2.7 (or above) and iii) Scikit-learn (version 0.15.2)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-113", "text": "Moreover, the user has to download and compile the following libraries: OpenSmile, RNNLM, SRILM and TreeTagger for the feature extraction module as well as RankLib for using machine-learned ranking option." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-114", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-115", "text": "**BENCHMARKING**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-116", "text": "The features and algorithms contained in TranscRater have been successfully used in previous works C. de Souza et al., 2015; Jalalvand et al., 2015b; Jalalvand et al., 2015a) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-117", "text": "To further investigate their effectiveness, in this section we provide new results, both in WER prediction (MAE) and transcription ranking (NDCG), together with some efficiency analysis (Time in seconds 9 )." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-118", "text": "To this aim, we use data from the 3 rd CHiME challenge, 10 which were collected for multiple distant microphone speech recognition in noisy environments (Barker et al., 2015) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-119", "text": "CHiME-3 data consists of sentences of the Wall Street Journal corpus, uttered by four speakers in four noisy environments, and recorded by five microphones placed on the frame of a tablet PC (a sixth one, placed on the back, mainly records background noise)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-120", "text": "Training and test respectively contain 1,640 and 1,320 sentences." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-143", "text": "We presented TranscRater, an open-source tool for ASR quality estimation." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-121", "text": "Transcriptions are produced by a baseline ASR system, provided by the task organizers, which uses the deep neural network recipe of Kaldi (Povey et al., 2011) ." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-122", "text": "In WER prediction, different models built with TranscRater are compared with a baseline commonly used for regression tasks, which labels all the test instances with the average WER value computed on the training set." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-123", "text": "In ranking mode, baseline results are computed by averaging the NDCG scores obtained in one hundred iterations in which test instances are randomly ranked." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-124", "text": "Table 1 shows the results of models trained with different feature groups for WER prediction with a single microphone." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-125", "text": "In terms of time, in this as in the following experiments, the total time (feature extraction + training + test) is mostly determined by feature extraction and the bottleneck is clearly represented by the extraction of signal (SIG) features." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-126", "text": "In terms of MAE, SIG features are also those achieving the worst result." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-127", "text": "Although they significantly improve over the baseline, they are outperformed by LEX+LM+POS and, even in combination with them, they do not help." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-128", "text": "However, as suggested by previous works like ( in which some of the SIG features are among the most predictive ones, the usefulness of signal features highly depends on data and, in specific conditions, they definitely improve results." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-129", "text": "Their ineffectiveness in the experiments of this paper likely depends on the lack of wordlevel time boundaries, which prevented us to compute more discriminative features like word logenergies, noise log-energies and signal-to-noise ratio (the best indicator of the acoustic quality of an input utterance)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-130", "text": "Table 2 shows the results achieved by the same feature groups when ranking by regression (RR) the transcriptions from five microphones." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-131", "text": "In terms of computation time, the higher costs of SIG features are still evident (the significant increase for all groups is due to the higher number of audio files to be processed)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-132", "text": "Also in this case, SIG features do not help, neither alone nor in combination with the other groups." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-133", "text": "Indeed, the highest results are achieved by the combination of LEX+LM+POS." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-134", "text": "Their large NDCG improvement over the baseline (+6.8), combined with the significantly lower computation time, seems to make this combination particularly suitable for the ranking by regression strategy." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-135", "text": "Table 3 shows the results achieved, in the same multi-microphone scenario, by the machinelearned ranking approach (MLR)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-136", "text": "In terms of time, MLR is slightly more efficient than RR, at least on this dataset." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-137", "text": "Though surprising (MLR performs lots of pairwise comparisons, which are in principle more demanding), such difference is not very informative as it might depend on hyperparameter settings (e.g. the number of iterations for XRT, manually set to 20), whose optimization was out of the scope of our analysis." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-138", "text": "In terms of NDCG, the results are higher compared to RR but the differences between feature groups are confirmed." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-139", "text": "Interestingly, with MLR even the SIG features in isolation significantly improve over the baseline (+4.5 points)." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-140", "text": "The NDCG improvement with the combined feature groups is up to 9.5 points, confirming the effectiveness of the combined features shown in previous works." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-141", "text": "----------------------------------" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-142", "text": "**CONCLUSION**" }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-144", "text": "TranscRater provides an extensible framework including feature extractors, machine learning algorithms (for WER prediction and transcription ranking), optimization and evaluation functions." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-145", "text": "Its source code can be downloaded from https://github." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-146", "text": "com/hlt-mt/TranscRater." }, { "sent_id": "652534f801dbff0c009c4a39fdef4d-C001-147", "text": "Its license is FreeBSD, a lax permissive non-copyleft license, compatible with the GNU GPL and with any use, including commercial." } ], "y": { "@BACK@": { "gold_contexts": [ [ "652534f801dbff0c009c4a39fdef4d-C001-9" ], [ "652534f801dbff0c009c4a39fdef4d-C001-46" ], [ "652534f801dbff0c009c4a39fdef4d-C001-100" ], [ "652534f801dbff0c009c4a39fdef4d-C001-116" ] ], "cite_sentences": [ "652534f801dbff0c009c4a39fdef4d-C001-9", "652534f801dbff0c009c4a39fdef4d-C001-46", "652534f801dbff0c009c4a39fdef4d-C001-100", "652534f801dbff0c009c4a39fdef4d-C001-116" ] }, "@SIM@": { "gold_contexts": [ [ "652534f801dbff0c009c4a39fdef4d-C001-106" ] ], "cite_sentences": [ "652534f801dbff0c009c4a39fdef4d-C001-106" ] }, "@USE@": { "gold_contexts": [ [ "652534f801dbff0c009c4a39fdef4d-C001-106" ] ], "cite_sentences": [ "652534f801dbff0c009c4a39fdef4d-C001-106" ] } } }, "ABC_db42e01dbc86b77335a0e488ff85e2_24": { "x": [ { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-7", "text": "**I. INTRODUCTION**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-2", "text": "Abstract-Recent papers in neural machine translation have proposed the strict use of attention mechanisms over previous standards such as recurrent and convolutional neural networks (RNNs and CNNs)." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-3", "text": "We propose that by running traditionally stacked encoding branches from encoder-decoder attentionfocused architectures in parallel, that even more sequential operations can be removed from the model and thereby decrease training time." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-4", "text": "In particular, we modify the recently published attention-based architecture called Transformer by Google, by replacing sequential attention modules with parallel ones, reducing the amount of training time and substantially improving BLEU scores at the same time." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-5", "text": "Experiments over the English to German and English to French translation tasks show that our model establishes a new state of the art." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-6", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-31", "text": "Google's earlier and path-breaking endto-end translation approach [9] uses 16 LSTM layers with attention; once again, ensembling produces the best results." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-32", "text": "Facebook's end-to-end translation approach [10] depends entirely on CNNs with attention mechanism." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-33", "text": "Our work reported in this paper is based on another translation work by Google." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-34", "text": "Google's Vaswani et al. [3] proposed the reduction in the sequential steps seen in CNNs and RNNs." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-35", "text": "The sole use of attention mechanisms and feed-forward networks within the common encoder-decoder sequential model replaces the necessity of deep convolutions for distant dependent relationships, and the memory and computation intensive operations required within recurrent networks." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-36", "text": "Original training and testing by Vaswani et al. were over both the WMT 2014 English-French (EN-FE) and English-German (EN-DE) data sets, while this paper uses only the WMT 2014 EN-DE set and the IWSLT 2014 EN-DE and EN-FR data sets." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-37", "text": "This model is discussed later in the paper." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-38", "text": "Works in the field of NMT recommend a particular focus on the encoder." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-39", "text": "Analysis by Domhan [11] poses two questions: what type of attention is needed, and where." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-40", "text": "In this analysis, self-attention had a higher correspondence with accuracy when placed in the encoder section of the architecture than the decoder, even claiming that the decoder, when replaced with a CNN or RNN, retained the same accuracy with little to no loss in robustness." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-41", "text": "Imamura, Fujita, and Sumita's [12] study shows that the current paradigm of using high-volume sets of parallel corpora are sufficient for decoders but are unreliable for the encoder." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-42", "text": "These conclusions encourage further research in the manipulation of position and design of the encoder and attention mechanisms within them." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-43", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-44", "text": "**III. ARCHITECTURE**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-45", "text": "The Transformer architectures proposed by Vaswani et al. [3] , seen in Figure 1 , inspires this paper's work." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-46", "text": "We have made modifications to this architecture, to make it more efficient." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-47", "text": "However, our modifications can be applied to any encoder-decoder based model and is architecture-agnostic." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-48", "text": "These alterations follow from the following two hypotheses." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-49", "text": "1) Reduction in the number of required sequential operations throughout the encoder section is likely to reduce training time without reducing performance." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-50", "text": "2) Replacing the subsequent encoder attention stack is expected to result in discarding of inter-dependencies, and possibly incorrect, assumptions of encoder attention mechanisms and layers, improving performance." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-51", "text": "For simplification, but without loss of generalization, this paper discusses the use and modification of Transformer based-models." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-52", "text": "The original Transformer model is composed of stacked self-attention layers." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-53", "text": "These self-attention mechanisms compare and relate multiple positions of one sequence in order to find a representation of itself." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-54", "text": "In Figure 1 , we see such attention layers, one working on the input embedding, another on the output embedding, and the third on the both the input and the output embeddings." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-55", "text": "Each of these layers contains two main sub-layers including multi-head self attention, which feeds a simple feed-forward network, and a final layer of normalization." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-56", "text": "Around each of the main sub-layers, a skip or residual connection [13] is also used." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-57", "text": "This same structure is used in the decoder with an attention mask to avoid attending to subsequent positions." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-58", "text": "The attention mechanism used by Vaswani et al. [3] can be thought of as a function that maps a query and set of keyvalue pairs to an output." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-59", "text": "The query, keys, values and output are all vectors." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-60", "text": "The output is obtained as a weighted sum of the values." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-61", "text": "The weight given to a value is learned by the system by considering how compatible the query is to the corresponding key." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-62", "text": "The particular form of attention used is called scaled dot-product attention." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-63", "text": "This is due to the mechanism being homologous to a scaled version of the multiplicative attention proposed by Luong, Pham, and Manning [14] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-64", "text": "Several attention layers used in parallel constitute what is called multi-head attention." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-65", "text": "A brief description the proposed modifications of this architecture is discussed below." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-66", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-67", "text": "**A. PARALLEL ENCODING BRANCHES**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-68", "text": "A motivation for creating the Transformer model was the sluggish training and generation times of other common sequence-to-sequence models such as RNNs and CNNs [3] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-69", "text": "This was done by simplifying and limiting sequential operations and computational requirements while also increasing the model's ability to exploit current hardware architecture." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-70", "text": "This paper proposes that removal of the previously stacked branches of the encoder (there is a stack of N encoder and other blocks on the left side of Figure 1 ), parallelizing these separate encoder 'trees', and incorporating their learned results for the decoder, will further eliminate sequential steps and accelerate learning within current sequence-to-sequence models." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-71", "text": "The architectures discussed are modeled in Figure 2 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-72", "text": "Alterations to this parallel Transformer model were made and the following models were trained, tested, and are discussed in this paper:" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-73", "text": "\u2022 Additive Parallel Attention (APA), \u2022 Attended Concatenated Parallel Attention (ACPA), and" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-74", "text": "\u2022 Attended Additive Parallel Attention (AAPA)." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-75", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-76", "text": "**B. MODEL VARIATIONS**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-77", "text": "Additive Parallel Attention (APA): We replace the entire stack of (multi-head attention, add and normalize, feed forward, add and normalize) repeated N times on the original Transformer architecture on the left column, on the input side." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-78", "text": "We instead have several such attention sub-networks in parallel." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-79", "text": "The output layers of these networks contain attention embeddings for the input." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-80", "text": "The values at the output layers among the stacks are added." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-81", "text": "This model is seen to the left in Fig. 2 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-82", "text": "Attended Concatenated Parallel Attention (ACPA): This approach is similar to APA and AAPA, but the values at the output layers of the attention sub-networks are concatenated instead of being added." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-83", "text": "This model is seen in the middle of Fig. 2 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-84", "text": "Attended Additive Parallel Attention (AAPA): This model is built similarly to the APA model." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-85", "text": "However, it removes one of the parallel stacks and uses it as a final sequential attention mechanism over the additive results." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-86", "text": "This model is seen to the right in Fig. 2 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-87", "text": "When incorporating the results of the parallel encoding branches, two models of thought are pursued: additive and concatenation." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-88", "text": "The APA and AAPA models directly add the results of all encoding branches, whereas the ACPA models concatenate all encoding results and use a simple non-linear layer to learn a dimension-reduction among all attention branches." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-89", "text": "The attended parts of both the ACPA and AAPA models incorporate a final attention layer over all encoding branches before they are sent to the decoding layers." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-90", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-91", "text": "**IV. EXPERIMENTS AND EVALUATION**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-92", "text": "All proposed architectures including the base Transformer model [3] are trained over the International Workshop on Spoken Language Translation (IWSLT) 2016 corpus and tested similarly over the IWSLT 2014 test corpus [15] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-93", "text": "The training corpus includes over 200,000 parallel sentence pairs, and 4 million tokens for each language." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-94", "text": "The testing set contains 1,250 sentences, and 20-30 thousand tokens for French and German." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-95", "text": "This paper also performed experiments over the larger WMT data set including 4.5 and 36 million training sentence pairs for the EN-DE and EN-FR tasks respectively." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-96", "text": "The testing set for these experiments was the standard Newstest 2014 test set including around 3000 sentence pairs for each language task." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-97", "text": "These statistics are noted in Table I a full measure of the tested models and robustness to both short and long input." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-98", "text": "Across all models, a greedy-decoding function for both training and testing time, the Kullback-Leibler divergence loss function, the Adam optimizer [16] , and the number of training epochs (10) were kept constant." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-99", "text": "The training and testing were done using the NMT task of English to German (EN-DE) and IWSLT English to French and English to German translation and each network was trained using one graphics processing unit (GPU)." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-100", "text": "The utilized machine GPU configuration was one NVIDIA GTX 1070." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-101", "text": "For the assessment of each model and translation task this paper uses the bilingual evaluation understudy (BLEU) metric [17] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-102", "text": "This is a modified precision calculation using ngrams such as unigram, grouped unigrams, and bigrams." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-103", "text": "The BLEU metric claims to have a high correlation to translation quality judgments made by humans." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-104", "text": "BLEU computes scores for individual sentences by comparing them with good quality reference translations." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-105", "text": "The individual scores are averaged over the the entire corpus, without taking intelligibility or grammatical correctness into account." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-106", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-107", "text": "**V. RESULTS**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-108", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-109", "text": "**A. ATTENTION VISUALIZATION**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-110", "text": "One concern during early hypothesis testing was that if each attention branch looks at the same input, that each one would learn to focus on the same properties of the original embedding." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-111", "text": "However, through visualization of each attention layer, it is obvious that regardless of the same input, the encoder branches through random initialization learn different focuses as seen in Figure 5 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-112", "text": "The final branch for the attended models however would learn very light to no attention weights as seen in Figure 6 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-113", "text": "This is one area of research this group wishes to pursue in the future." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-114", "text": "Table II shows that the AAPA model consistently performed on average nearly ten points higher in the BLEU metric on the English to German translation task on the IWSLT 2014 test set." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-115", "text": "It also performed very well on the English to French translation task." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-116", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-117", "text": "**B. MACHINE TRANSLATION**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-118", "text": "On the much larger WMT English-German test set, all our models achieve better results then Vaswani et al. [3] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-119", "text": "Our model with five parallel encoding branches has a BLEU score of 62.69 compared to 60.95 and 61.00 for the two Transformers shown in Table III ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-120", "text": "Our approach also takes considerably less time than the large Transformer model with a stack of eight encoder attention heads, although it is a little slower than the smaller Transformer model reported by Vaswani et al. [3] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-121", "text": "In terms of the BLEU metric, we establish state-of-the-art performance for both EN-DE and EN-FR translation considering the IWSLT 2014, and comparable results for the WMT data sets." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-122", "text": "Since our results came up very good, surpassing state of the art for the IWSLT 2014 dataset, we ran our experiments multiple times to ensure the results are correct." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-123", "text": "During the Transformer and attended parallel model's training lifetime, it can be seen that loss was consistently lower for our modified parallel model with five parallel stacks as seen in Figure 3 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-124", "text": "In this task, loss doesn't always correspond to a higher metric, in this case our model also shows a continuous higher score in the BLEU metric over the validation set while the Transformer shows signs of plateauing early on Figure 4 ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-125", "text": "However, our parallelized model did have a slightly higher training time over a single GPU." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-126", "text": "One final experiment conducted to improve this drawback, also seen in the same table, is the reduction of number of parallel branches in the encoder." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-127", "text": "By reducing the number incrementally, our BLEU score stays equivalent to higher perplexity layers, but linearly reduces the run-time." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-128", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-129", "text": "**VI. CONCLUSIONS**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-130", "text": "In step with the goals of the original Transformer, this work continued to pursue the removal of sequential operations within attention-based translation models." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-131", "text": "Although dependent on choice of tool-kit implementation as shown in Table IV , this new parallelized Transformer model reaches a new state-ofthe-art in machine translation and provides multiple new directions for future research." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-132", "text": "It also shows through random initialization that attention mechanisms can learn different focuses and that by eliminating possibly negative inter-dependencies among them, superior results can be obtained." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-133", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-134", "text": "**VII. ACKNOWLEDGEMENT**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-8", "text": "Historically, statistical machine translation involved extensive work in the alignment of words and phrases developed by linguistic experts working with computer scientists [1] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-9", "text": "Deep Learning surpasses these historically used methods and has primarily replaced these with the recent use of neural machine translation (NMT)." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-10", "text": "The predominant design of the state of the art is the encoder-decoder model." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-11", "text": "The encoder takes sequential text, turning it into an internal representation." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-12", "text": "The decoder then takes this internal representation and generates a subsequent output." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-13", "text": "Since their emergence, attention mechanisms [2] have added to the effectiveness of the encoder decoder model and have been at the forefront of machine translation." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-14", "text": "Attention mechanisms help the neural system focus on parts of the input, and possibly the output as it learns to translate." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-15", "text": "This concentration facilitates the capturing of dependencies between parts of the input and the output." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-16", "text": "After training the network, the attention mechanism enables the system to perform translations that can handle issues such as the movement of words and phrases, and fertility." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-17", "text": "However, even with these attention mechanisms, NMT models have their drawbacks, which include long training time and high computational requirements." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-18", "text": "Recent papers [3] , [4] in neural machine translation have proposed the strict use of attention mechanisms in networks such as the Transformer over previous approaches such as recurrent neural networks (RNNs) [5] and convolutional neural networks (CNNs) [6] ." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-19", "text": "In other words, these approaches dispense with recurrences and convolutions entirely." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-20", "text": "In practice, attention mechanisms have mostly been used with recurrent architectures because removing the recurrent nature of the architecture makes the training more efficient by the removal of necessary sequential steps." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-21", "text": "This paper contributes by continuing to pursue the removal of sequential operations within encoder-decoder models." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-22", "text": "These operations are removed through the parallelization of previously stacked encoder layers." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-23", "text": "This new parallelized model can obtain a new state of the art in machine translation after being trained on one NVIDIA GTX 1070 for as little as three hours." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-24", "text": "The paper includes the following: a discussion of related work in the field of machine translation including encoderdecoder models and attention mechanisms; an explanation of the proposed novel architecture with motivations; and a description of the used methodology, along with evaluation including used data sets, hardware, hyper-parameters, and metrics." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-25", "text": "This paper concludes with results and possible avenues for future research." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-26", "text": "----------------------------------" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-27", "text": "**II. RELATED WORK**" }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-28", "text": "There has been a plethora of work in the past several years on end-to-end neural translation." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-29", "text": "ByteNet [7] uses CNNs with dilated convolutions for both encoding and decoding." }, { "sent_id": "db42e01dbc86b77335a0e488ff85e2-C001-30", "text": "Zhou et al. [8] use stacked interleaved bi-directional LSTM layers (up to 16 layers) with skipped connections; ensembling gives the best results." } ], "y": { "@BACK@": { "gold_contexts": [ [ "db42e01dbc86b77335a0e488ff85e2-C001-18" ], [ "db42e01dbc86b77335a0e488ff85e2-C001-34" ], [ "db42e01dbc86b77335a0e488ff85e2-C001-58" ], [ "db42e01dbc86b77335a0e488ff85e2-C001-68" ], [ "db42e01dbc86b77335a0e488ff85e2-C001-92" ] ], "cite_sentences": [ "db42e01dbc86b77335a0e488ff85e2-C001-18", "db42e01dbc86b77335a0e488ff85e2-C001-34", "db42e01dbc86b77335a0e488ff85e2-C001-58", "db42e01dbc86b77335a0e488ff85e2-C001-68", "db42e01dbc86b77335a0e488ff85e2-C001-92" ] }, "@SIM@": { "gold_contexts": [ [ "db42e01dbc86b77335a0e488ff85e2-C001-45" ] ], "cite_sentences": [ "db42e01dbc86b77335a0e488ff85e2-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "db42e01dbc86b77335a0e488ff85e2-C001-118" ], [ "db42e01dbc86b77335a0e488ff85e2-C001-120" ] ], "cite_sentences": [ "db42e01dbc86b77335a0e488ff85e2-C001-118", "db42e01dbc86b77335a0e488ff85e2-C001-120" ] } } }, "ABC_0763666190b6b4be1bcf494d7c6fe2_24": { "x": [ { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-100", "text": "The resulting roles assignment groups 116 verbs into 12 VerbNet classes, each associated with a unique thematic grid." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-2", "text": "This paper introduces a novel unsupervised approach to semantic role induction that uses a generative Bayesian model." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-3", "text": "To the best of our knowledge, it is the first model that jointly clusters syntactic verbs arguments into semantic roles, and also creates verbs classes according to the syntactic frames accepted by the verbs." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-4", "text": "The model is evaluated on French and English, outperforming, in both cases, a strong baseline." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-5", "text": "On English, it achieves results comparable to state-of-the-art unsupervised approaches to semantic role induction." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-6", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-7", "text": "**INTRODUCTION AND BACKGROUND**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-8", "text": "Semantic Role Labeling (SRL) is a major task in Natural Language Processing which provides a shallow semantic parsing of a text." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-9", "text": "Its primary goal is to identify and label the semantic relations that hold between predicates (typically verbs), and their associated arguments ." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-10", "text": "The extensive research carried out in this area resulted in a variety of annotated resources, which, in time, opened up new possibilities for supervised SRL systems." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-11", "text": "Although such systems show very good performance, they require large amounts of annotated data in order to be successful." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-101", "text": "Table 1 shows the set of roles used and their relation to VerbNet roles." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-102", "text": "This constitutes our gold evaluation corpus." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-103", "text": "The baseline model is the \"syntactic function\" used for instance in (Lang and Lapata, 2011a) , which simply clusters predicate arguments according to the dependency relation to their head." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-104", "text": "This is a standard baseline for unsupervised SRL, which, although simple, has been shown difficult to outperform." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-105", "text": "As done in previous work, it is implemented by allocating a different cluster to each of the 10 most frequent syntactic relations, and one extra cluster for all the other relations." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-106", "text": "Evaluation results are shown in Table 2 ." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-107", "text": "The proposed model significantly outperforms the deterministic baseline, which validates the unsupervised learning process." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-108", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-109", "text": "**EVALUATIONS ON ENGLISH**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-110", "text": "We made our best to follow the setup used in previous work (Lang and Lapata, 2011a; Titov and Kle-mentiev, 2012) , in order to compare with the current state of the art." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-111", "text": "The data used is the standard CoNLL 2008 shared task (Surdeanu et al., 2008) version of Penn Treebank WSJ and PropBank." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-112", "text": "Our model is evaluated on gold generated parses, using the gold PropBank annotations." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-113", "text": "In PropBank, predicates are associated with a set of roles, where roles A2-A5 or AA are verb specific, while adjuncts roles (AM) are consistent across verbs." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-12", "text": "This annotated data is not always available, very expensive to create and often domain specific (Pradhan et al., 2008) ." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-13", "text": "There is in particular no such data available for French." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-14", "text": "To bypass this shortcoming, \"annotation-by-projection\" approaches have been proposed (Pado and Lapata, 2006 ) which in essence, (i) project the semantic annotations available in one language (usually English) , to text in another language (in this case French); and (ii) use the resulting annotations to train a semantic role labeller." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-15", "text": "Thus Pado and Pitel (2007) show that the projection-based annotation framework permits bootstrapping a semantic role labeller for FrameNet which reaches an F-measure of 63%; and van der Plas et al. (2011) show that training a joint syntactic-semantic parser based on the projection approach permits reaching an F-measure for the labeled attachment score on PropBank annotation of 65%." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-16", "text": "Although they minimize the manual effort involved, these approaches still require both an annotated source corpus and an aligned target corpus." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-17", "text": "Moreover, they assume a specific role labeling (e.g., PropBank, FrameNet or VerbNet roles) and are not generally portable from one framework to another." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-18", "text": "These drawbacks with supervised approaches motivated the need for unsupervised methods capable of exploiting large amounts of unannotated data." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-19", "text": "In this context several approaches have been proposed." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-20", "text": "Swier and Stevenson (2004) were the first to introduce unsupervised SRL in an approach that used the VerbNet lexicon to guide unsupervised learning." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-21", "text": "Grenager and Manning (2006) proposed a directed graphical model for role induction that exploits linguistic priors for syntactic and semantic inference." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-22", "text": "Following this work, Lang and Lapata (2010) formulated role induction as the problem of detecting alternations and mapping non-standard linkings to cannonical ones, and later as a graph partitioning problem in (Lang and Lapata, 2011b) ." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-23", "text": "They also proposed an algorithm that uses successive splits and merges of semantic roles clusters in order to improve their quality in (Lang and Lapata, 2011a) ." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-24", "text": "Finally, Titov and Klementiev (2012) , introduce two new Bayesian models that treat unsupervised role induction as the clustering of syntactic argument signatures, with clusters corresponding to semantic roles, and achieve the best state-of-the-art results." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-25", "text": "In this paper, we propose a novel unsupervised approach to semantic role labeling that differs from previous work in that it integrates the notion of verb classes into the model (by analogy with VerbNet, we call these verb classes, frames)." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-26", "text": "We show that this approach gives good results both on the English PropBank and on a French corpus annotated with VerbNet style semantic roles." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-27", "text": "For the English PropBank, although the model is more suitable for a framework that uses a shared set of role labels such as VerbNet, we obtain results comparable to the state-of-the-art." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-28", "text": "For French, the model is shown to outperform a strong baseline by a wide margin." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-29", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-30", "text": "**PROBABILISTIC MODEL**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-31", "text": "As mentioned in the introduction, semantic role labeling comprises two sub-tasks: argument identification and role induction." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-32", "text": "Following common practice (Lang and Lapata, 2011a; Titov and Klementiev, 2012) , we assume oracle argument identification and focus on argument labeling." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-33", "text": "The approach we propose is an unsupervised generative Bayesian model that clusters arguments into classes each of which can be associated with a semantic role." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-34", "text": "The model starts by generating a frame assignment to each verb instance where a frame is a clustering of verbs and associated roles." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-35", "text": "Then, for each observed verb argument, a semantic role is drawn conditioned on the frame." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-36", "text": "Finally, the word and dependency label of this argument are generated." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-37", "text": "The model admits a simple Gibbs algorithm where the number of latent variables is proportional to the number of roles and frames to be clustered." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-38", "text": "There are two key benefits of this model architecture." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-39", "text": "First, it directly encodes linguistic intuitions about semantic frames: the model structure reflects the subcategorisation property of the frame variable, which also groups verbs that share the same set of semantic roles, something very close to the VerbNet notion of frames." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-40", "text": "Second, by ignoring the \"verbspecific\" nature of PropBank labels, we reduce the need for a large amount of data and we better share evidence across roles." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-41", "text": "In addition, because it is unsupervised, the model is independent both of the language and of the specific semantic framework (since no inventory of semantic role is a priori chosen)." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-42", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-43", "text": "**MODEL DESCRIPTION**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-44", "text": "The goal of the task is to assign argument instances to clusters, such that each argument cluster represents a specific semantic role, and each role corresponds to one cluster." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-45", "text": "The model is represented in the form of a plate diagram in Figure 1 ." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-46", "text": "The observed random variables are the verb V (lemma), its voice V o (active or passive), the words W (lemma) that are arguments of this verb, and the syntactic dependency labels D that link the argument to its head." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-47", "text": "There are two latent variables: the frame F that represents the class of the verb, and the role R assigned to each of its arguments." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-48", "text": "The parameters \u03b8 of all multinomial distributions are Dirichlet distributed, with fixed symmetric concentration hyper-parameter \u03b1." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-49", "text": "The frame plays a fundamental role in this setting, since it intends to capture classes of verbs that share similar distributions of role arguments." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-50", "text": "The model's generative story is described next, followed by a description of the inference algorithm used to apply the model to an unannotated corpus." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-51", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-52", "text": "**GENERATIVE STORY**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-53", "text": "For each verb instance, the proposed model first generates a frame cluster, a voice (active or passive), and then a verb lemma from the distribution of verbs in this frame." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-54", "text": "The number of arguments is assumed fixed." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-55", "text": "For each argument, a role is sampled conditioned on the frame." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-56", "text": "Then, a word is sampled from the distribution of words associated to this role, and finally a dependency label is generated, conditioned both on the role and the voice." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-57", "text": "All multinomial parameters are collapsed, and thus not sampled." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-58", "text": "All Dirichlet hyper-parameters are assumed constant." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-59", "text": "To identify words, we use either word lemmas or part-of-speech tags." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-60", "text": "In order to avoid data sparseness issues, we consider the word lemma only in cases where there are more than 9 instances of the word lemma in the corpus." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-61", "text": "Otherwise, if the number of word lemma instances is less than 10, we use the part-of-speech tags." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-62", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-63", "text": "**LEARNING AND INFERENCE**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-64", "text": "A collapsed Gibbs sampler is used to perform posterior inference on the model." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-65", "text": "Initially, all frames F i are sampled randomly from a uniform distribution, while the roles R i,j are assigned either randomly or following the deterministic syntactic function baseline, which simply clusters predicate arguments according to their syntactic function." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-66", "text": "This function is described in detail in Section 3." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-67", "text": "The Gibbs sampling algorithm samples each latent variable (F i and R i,j ) in turn according to its posterior distribution conditioned on all other instances of this variable (noted F \u00aci and R \u00ac(i,j) respectively) and all other variables." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-68", "text": "These posteriors are detailed next." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-69", "text": "In the following, R i,j represents the random variable for the j th role of the i th verb in the corpus: its value is R i,j = r i,j at a given iteration of the sampling algorithm." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-70", "text": "nr f,r is the count of occurrences of (F i = f, R i,j = r) in the whole corpus, excluding the i th instance when the superscript \u2212i is used." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-71", "text": "A star * matches any possible value." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-72", "text": "The joint probability over the whole corpus with collapsed multinomial parameters is:" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-73", "text": "The posterior from which the frame is sampled is derived from the joint distribution as follows:" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-74", "text": "where nr +i r is the count of occurrences of role r in the arguments of verb instance i (M i = r nr +i r )." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-75", "text": "The update equation for sampling the role becomes:" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-76", "text": "After T iterations, the process is stopped and the expected value of the sampled frames and roles after the burn-in period (20 iterations) is computed." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-77", "text": "With deterministic (syntactic) initialization, T is set to 200, while it is set to 2000 with random initialization because of slower convergence." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-78", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-79", "text": "**EVALUATIONS AND RESULTS**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-80", "text": "We evaluate our model both on English to situate our approach with respect to the state of the art; and on French to demonstrate its portability to other languages." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-81", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-82", "text": "**COMMON EXPERIMENTAL SETUP**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-83", "text": "The model's parameters have been tuned with a few rounds of trial-and-error on the English development corpus: For the hyper-parameters, we set \u03b1 F = 0.5, \u03b1 R = 1.e \u22123 , \u03b1 V = 1.e \u22127 , \u03b1 V o = 1.e \u22123 , \u03b1 D = 1.e \u22128 and \u03b1 W = 0.5." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-84", "text": "For the evaluation on French, we only changed the \u03b1 F and \u03b1 W parameters." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-85", "text": "In order to reflect the rather uniform distribution of verb instances across verb classes we set \u03b1 F to 1." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-86", "text": "Moreover, we set \u03b1 W to 0.001 because of the smaller number of words and roles in the French corpus." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-87", "text": "The number of roles and frames were chosen based on the properties of each corpus." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-88", "text": "We set number of roles to 40 and 10, and the number of frames to 300 and 60 for English and French respectively." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-89", "text": "As done in (Lang and Lapata, 2011a) and (Titov and Klementiev, 2012) , we use purity and collocation measures to assess the quality of our role induction process." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-90", "text": "For each verb, the purity of roles' clusters is computed as follows:" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-91", "text": "where C i is the set of arguments in the i th cluster found, G j is the set of arguments in the j th gold class, and N is the number of argument instances." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-92", "text": "In a similar way, the collocation of roles' clusters is computed as follows:" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-93", "text": "Then, each score is averaged over all verbs." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-94", "text": "In the same way as (Lang and Lapata, 2011a) , we use the micro-average obtained by weighting the scores for individual verbs proportionally to the number of argument instances for that verb." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-95", "text": "Finally the F1 measure is the harmonic mean of the aggregated values of purity and collocation:" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-96", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-97", "text": "**EVALUATIONS ON FRENCH**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-98", "text": "To evaluate our model on French, we used a manually annotated corpora consisting on sentences from the Paris 7 Treebank (Abeill\u00e9 et al., 2000) , containing verbs extracted from the gold standard V-GOLD (Sun et al., 2010) selected and annotated with VerbNet-style thematic roles." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-99", "text": "In some cases, the annotated roles were obtained by merging some of the VerbNet roles (e.g., Actor, Actor1 and Actor2 are merged); or by grouping together classes sharing the same thematic grids." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-114", "text": "Besides, roles A0 and A1 attempt to capture Proto-Agent and Proto-Patient roles (Dowty, 1991) , and thus are more valid across verbs and verb instances than A2-A5 roles." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-115", "text": "Table 3 reports the evaluation results of the proposed model along with those of the baseline system and of some of the latest state-of-the-art results." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-116", "text": "We can first note that, despite our efforts to reproduce the same baseline, there is still a difference between our baseline (Synt.Func.) and the baseline reported in (Lang and Lapata, 2011a)" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-117", "text": "The other results respectively correspond to the Split Merge approach presented in (Lang and Lapata, 2011a ) (Split Merge), the Graph Partitioning algorithm (Graph Part.) presented in (Lang and Lapata, 2011b) , and two Bayesian approaches presented in (Titov and Klementiev, 2012) , which achieve the best current unsupervised SRL results." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-118", "text": "The first such model (TK-Bay.1) clusters argument fillers and directly maps some syntactic labels to semantic roles for some adjunct like modifiers that are explicitly represented in the syntax, while the second model (TK-Bay.2) does not include these two features." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-119", "text": "Two versions of the proposed model are reported in the last rows of Table 3 : one with random (uniform) initialization of all variables, and the other with deterministic initialization of all R i from the syntactic function." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-120", "text": "Indeed, although many unsupervised system are very sensitive to initialization, we observe that in the proposed model, unsupervised inference reaches reasonably good performances even with a knowledge-free initialization." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-121", "text": "Furthermore, when initialized with the strong deterministic baseline, the model still learns new evidences and improves over the baseline to give comparable results to the best unsupervised state-of-the-art systems." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-122", "text": "----------------------------------" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-123", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-124", "text": "We have presented a method for unsupervised SRL that is based on an intuitive generative Bayesian model that not only clusters arguments into semantic roles, but also explicitly integrates the concept of frames in SRL." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-125", "text": "Previous approaches to semantic role induction proposed some clustering of roles without explicitly focusing on the verb classes generated." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-126", "text": "Although there has been work on verb clustering, this is, to the best of our knowledge, the first approach that jointly considers both tasks." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-127", "text": "In this work in progress, we focused on the role induction task and we only evaluated this part, leaving the evaluation of verb classes as future work." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-128", "text": "We successfully evaluated the proposed model on two languages, French and English, showing, in both cases, consistent performances improvement over the deterministic baseline." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-129", "text": "Furthermore, its accuracy reaches a level comparable to that of the best state-of-the-art unsupervised systems." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-130", "text": "The model could be improved in many ways, and in particular by including some penalization term for sampling the same role for several arguments of a verb instance (at least for core roles)." }, { "sent_id": "0763666190b6b4be1bcf494d7c6fe2-C001-131", "text": "Moreover, we believe that our model better fits within a framework that allows roles sharing between verbs (or frames), such as VerbNet, and we would like to carry out a deeper evaluation on this concept." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0763666190b6b4be1bcf494d7c6fe2-C001-23" ] ], "cite_sentences": [ "0763666190b6b4be1bcf494d7c6fe2-C001-23" ] }, "@SIM@": { "gold_contexts": [ [ "0763666190b6b4be1bcf494d7c6fe2-C001-32" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-89" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-94" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-103" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-110" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-117" ] ], "cite_sentences": [ "0763666190b6b4be1bcf494d7c6fe2-C001-32", "0763666190b6b4be1bcf494d7c6fe2-C001-89", "0763666190b6b4be1bcf494d7c6fe2-C001-94", "0763666190b6b4be1bcf494d7c6fe2-C001-103", "0763666190b6b4be1bcf494d7c6fe2-C001-110", "0763666190b6b4be1bcf494d7c6fe2-C001-117" ] }, "@USE@": { "gold_contexts": [ [ "0763666190b6b4be1bcf494d7c6fe2-C001-32" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-89" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-94" ], [ "0763666190b6b4be1bcf494d7c6fe2-C001-110" ] ], "cite_sentences": [ "0763666190b6b4be1bcf494d7c6fe2-C001-32", "0763666190b6b4be1bcf494d7c6fe2-C001-89", "0763666190b6b4be1bcf494d7c6fe2-C001-94", "0763666190b6b4be1bcf494d7c6fe2-C001-110" ] }, "@DIF@": { "gold_contexts": [ [ "0763666190b6b4be1bcf494d7c6fe2-C001-116" ] ], "cite_sentences": [ "0763666190b6b4be1bcf494d7c6fe2-C001-116" ] } } }, "ABC_8bd97eb118175c9fd2147b6456421c_24": { "x": [ { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-2", "text": "Contextual automatic speech recognition, i.e., biasing recognition towards a given context (e.g. user's playlists, or contacts), is challenging in end-to-end (E2E) models." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-3", "text": "Such models maintain a limited number of candidates during beam-search decoding, and have been found to recognize rare named entities poorly." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-4", "text": "The problem is exacerbated when biasing towards proper nouns in foreign languages, e.g., geographic location names, which are virtually unseen in training and are thus outof-vocabulary (OOV)." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-5", "text": "While grapheme or wordpiece E2E models might have a difficult time spelling OOV words, phonemes are more acoustically salient and past work has shown that E2E phoneme models can better predict such words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-6", "text": "In this work, we propose an E2E model containing both English wordpieces and phonemes in the modeling space, and perform contextual biasing of foreign words at the phoneme level by mapping pronunciations of foreign words into similar English phonemes." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-7", "text": "In experimental evaluations, we find that the proposed approach performs 16% better than a grapheme-only biasing model, and 8% better than a wordpiece-only biasing model on a foreign place name recognition task, with only slight degradation on regular English tasks." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-8", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-10", "text": "End-to-end (E2E) models have attracted increasing attention recently." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-11", "text": "Instead of building an automatic speech recognition (ASR) system from different components such as the acoustic model (AM), language model (LM), and pronunciation model (PM), E2E models rely on a single neural network to directly learn speech-to-text mapping." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-12", "text": "Representative systems include a word-based connectionist temporal classification (CTC) model [1] , recurrent neural network transducer (RNN-T) [2, 3] , and attention-based models such as \"Listen, Attend, and Spell\" (LAS) [4] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-13", "text": "Recent advances have shown that E2E models can outperform the state-of-the-art conventional system when trained on thousands of hours of data [5, 6] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-14", "text": "In previous work [7] , it has been shown that contextual information (i.e., phrases relevant to recognition in the current context such as contact names, geographic place names, songs, etc.) can improve ASR accuracy." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-15", "text": "Such phrases are often foreign words, or are rarely seen in training." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-16", "text": "Recognizing these phrases is challenging." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-17", "text": "Conventional ASR systems model them as independent contextual LM using an n-gram weighted finite state transducer (WFST), and compose it with a baseline LM for on-the-fly (OTF) rescoring [7, 8] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-85", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-18", "text": "This idea is extended to a LAS model in [9] , where an n-gram LM and a word-tographeme \"speller\" are composed to produce a contextual FST for rescoring." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-19", "text": "The approach is similar to shallow fusion [10] which interpolates E2E model scores with an external LM in These authors contributed equally to this work." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-20", "text": "We would like to thank David Rybach for helpful comments and suggestions." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-21", "text": "beam search." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-22", "text": "To bring more relevant words for biasing in E2E models, [11] proposes to push biasing weights to each subword unit and deals with over-biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-23", "text": "Further improvements such as biasing before beam pruning, and wordpiece-based biasing have been proposed to achieve state-of-the-art biasing results [12, 6, 13] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-24", "text": "Another class of contextual biasing uses an allneural approach." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-25", "text": "Contextual-LAS (CLAS) is proposed in [11] to use a bias encoder to model contextual phrases as embeddings and shows significant improvement than OTF rescoring [8] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-26", "text": "Phonetic information has been incorporated to CLAS to improve rare word recognition [14] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-27", "text": "Although biasing is improved by these techniques, they do not address cross-lingual recognition." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-28", "text": "In [15] , contextual biasing has been used to assist recognition of foreign words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-29", "text": "With the phoneme mapping from a foreign language phoneme set to the recognizer's phoneme set, foreign words are modeled as a phoneme-level contextual FST for biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-30", "text": "It is unclear whether such an approach can be directly applied to E2E models." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-31", "text": "Phoneme-only E2E systems have been shown to have inferior performance compared to grapheme or wordpiece models (WPM) in general [16, 17] , but shows better recognition of rare words and proper nouns." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-32", "text": "In this work we propose to incorporate phonemes to a wordpiece E2E model as modeling units and use phoneme-level FST for contextual biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-33", "text": "We propose a word-frequency based sampling strategy to randomly tokenize rare words into phonemes in the target sequence using a lexicon." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-34", "text": "This approach also mitigates accuracy regressions that have been observed when using phoneme-only E2E models [16, 17] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-35", "text": "We train our model using only American English data and thus its wordpieces and phoneme set (no data from foreign languages)." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-36", "text": "In inference, given a list of foreign words, we bias the recognition using an English phoneme-level biasing FST, which is built by first tokenizing the words into foreign phonemes and then mapping them to English phonemes using [15] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-37", "text": "For example, given a navigation query \"directions to Cr\u00e9teil\" and the assumption that the French word \"Cr\u00e9teil\" is in our biasing list, \"Cr\u00e9teil\" is first tokenized to French phonemes as \"k R e t E j\", and then mapped to English phonemes \"k r\\ E t E j\" for biasing 1 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-38", "text": "The phoneme mapping is necessary since our modeling units contain only English phonemes." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-39", "text": "In decoding, we propose to incorporate the pronunciation FST of the biasing words to consume English phoneme symbols and produce foreign words, using the aforementioned foreign lexicon and phoneme mapping, i.e. \"k r\\ E t E j\" \u2192 Cr\u00e9teil (details in Section 3.2 and 3.3)." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-40", "text": "Wordpiece outputs are concatenated to form words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-41", "text": "In experimental evaluations, we find that the proposed phoneme-based biasing using wordpiece-phoneme model successfully recognizes foreign words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-42", "text": "It performs 16% relatively better in terms of WER than the grapheme-only biasing model, and 8% better than the wordpiece-only biasing model in a task of recognizing navigation queries containing French place names." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-43", "text": "The proposed model also has the advantage that it can be directly applied to other foreign languages for biasing without model scalability issues." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-44", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-45", "text": "**PRIOR WORK**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-46", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-47", "text": "**SHALLOW FUSION E2E BIASING**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-48", "text": "Shallow fusion has been used in E2E models for decoding [10] and contextual biasing [6] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-49", "text": "Biasing phrases are first represented as n-gram WFST in the word level (G), and then left composed with a \"speller\" FST (S) to produce a contextual LM: C = min(det(S \u2022 G))." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-50", "text": "The speller transduces a sequence of subword units to corresponding words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-51", "text": "The contextual LM (C) is then used to rescore the log-probability outputs of the E2E model during beam search:" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-52", "text": "Here, x denotes acoustic observations, and y the subword-unit sequence." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-53", "text": "P is the probability estimation from the E2E model, and PC is the biasing rescoring probability." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-54", "text": "\u03bb controls the weight of contextual LM in rescoring." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-55", "text": "For E2E biasing, [11] explored biasing at beginning of word, end of word, and at subword units." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-56", "text": "The authors find that unlike biasing at the end of word in conventional models, biasing at subword units with weight pushing prevents candidates from being pruned early from the beam." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-57", "text": "In [13] , wordpieces have been shown to outperform graphemes in biasing since they create a sparser match of biasing units." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-58", "text": "All these improvements lead to significantly better biasing which is comparable to the state-of-the-art conventional model [6] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-59", "text": "To avoid over-biasing, [13] also proposed to only activate biasing phrases when they are proceeded by a set of prefixes." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-60", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-61", "text": "**PHONEME MAPPING**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-62", "text": "Cross-lingual phoneme mapping has been used in conventional systems for recognizing foreign words [15] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-63", "text": "First, a phoneme mapping is learned by aligning the pronunciations between foreign and target languages using TTS-synthesized audio and a pronunciation learning algorithm [18] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-64", "text": "In inference, foreign words are first built to an FST (G)." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-65", "text": "Lexica (L) are constructed between target-language phonemes and foreign words using phoneme mapping, and then left composed with G to construct a dynamic class LM for decoding:" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-66", "text": "where d denotes a dynamic class label." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-67", "text": "In Section 3.2, we describe how phoneme mapping is incorporated to a wordpiecephoneme E2E model for contextual biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-68", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-69", "text": "**PHONEME-BASED BIASING**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-70", "text": "The focus of this work is to bias toward rare cross-lingual words which are typically missing from the training set." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-71", "text": "We propose to do that by utilizing phonemes, which are not affected by orthography." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-72", "text": "Specifically, we augment the wordpiece modeling space of an E2E model with phonemes to train a wordpiece-phoneme model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-73", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-74", "text": "**WORDPIECE-PHONEME MODEL**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-75", "text": "A wordpiece-phoneme model differs from a wordpiece-only model in that it may decompose a few words to phonemes in training." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-76", "text": "The output of the model is a single softmax whose symbol set is the union of wordpiece and phoneme symbols." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-77", "text": "We use a pronunciation lexicon to obtain phoneme sequences of words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-78", "text": "Since phonemes show strength in recognizing rare words [16] , we want to present these words as phonemes more often." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-79", "text": "In a target sentence, we decide to randomly present the i th word as phonemes with a probability" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-80", "text": ", 1.0) where p0 and T are constants and c(i) is an integer representing the number of time the word appears in our entire training corpus." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-81", "text": "Therefore, the words that appear T times or less will be presented as phonemes with probability p0." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-82", "text": "For words that appear more than T times, the more frequent they are, the less likely they are presented as phonemes 2 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-83", "text": "Note that the decision of whether to use wordpieces or phonemes is made randomly at each gradient iteration, and thus a given sentence could have different target sequences at different epochs." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-84", "text": "We use context-independent phonemes as in [16] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-86", "text": "**BIASING FST FOR PHONEMES**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-87", "text": "In inference, cross-lingual biasing words are converted to English phonemes to rescore the phoneme outputs of the wordpiece-phoneme model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-88", "text": "In our work, phoneme mapping is represented by a dictionary which contains human generated source-language to target-language phoneme pairs [15] , and the X-SAMPA phoneme set is used for all languages." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-89", "text": "For example, given a French word \"Cr\u00e9teil\", we tokenize it into phonemes using the French pronunciation lexicon, i.e. \"Cr\u00e9teil\" \u2192 \"k R e t E j\", and then map the French phonemes to English phonemes one by one: \"k R e t E j\" \u2192 \"k r\\ E t E j\"." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-90", "text": "Note that the mapping is needed since our wordpiecephoneme model contains only English phonemes." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-91", "text": "The English phoneme sequence is then used to construct a phoneme-level FST for biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-92", "text": "Weight pushing is used to assign weights at the phoneme level and failure arcs are added to avoid over-biasing similar to [11] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-93", "text": "Figure 1 shows an example of a contextual FST for the word \"Cr\u00e9teil\" at the phoneme level." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-94", "text": "The biasing FST is then used to rescore the phoneme outputs of the wordpiecephoneme model on the fly, using Eq. (1)." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-95", "text": "Figure 1 : Contextual FST for the word \"Cr\u00e9teil\" using a sequence of English phonemes \"k r\\ E t E j\"." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-96", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-97", "text": "**DECODING GRAPH**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-98", "text": "To generate words as outputs, we search through a decoding graph similar to [16] but accept both phonemes and wordpieces." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-99", "text": "An example is shown in Figure 2 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-100", "text": "The decoding FST has wordpiece loops around state 0 (we show only a few for simplicity), but also has a pronunciation section (states 1 through 14) ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-101", "text": "The pronunciation section is a prefix tree with phonemes as inputs, and outputs are wordpieces of the corresponding word produced by the WPM in Section 3.1." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-102", "text": "Specifically, for each word in the biasing list, we look up pronunciations from the lexicon and split the word into its constituent wordpieces." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-103", "text": "Input phoneme labels are accepted and transduced into wordpieces." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-104", "text": "Input wordpiece labels are accepted by the wordpiece loops." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-105", "text": "The final output symbols, which are always wordpieces, are concatenated into words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-106", "text": "Figure 2 : Decoding graph for the words \"cr\u00e8che\" (daycare) with English cross lingual pronunciation \"k r\\ E S\" and \"cr\u00e9teil\" (a city) with pronunciation \"k r\\ E t E j\"." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-107", "text": "For clarity, we omitted most wordpieces for the state 0." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-131", "text": "**MODEL TRAINING**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-108", "text": "Based on [16] , we add two improvements to the decoding strategy." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-109", "text": "First, during decoding we consume as many input epsilon arcs as possible thus guaranteeing that all wordpieces in word are produced when all corresponding phonemes are seen in the input." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-110", "text": "Second, we merge paths that have the same output symbols." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-111", "text": "Given the nature of our training and decoding, a given word can be output either directly in wordpieces, or transduced from phonemes to wordpieces." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-112", "text": "Since the input symbols are different, each hypothesis has a different probability." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-113", "text": "We keep track of equivalent hypotheses and recombine them by adding their probabilities, assigning the total probability to the most likely hypothesis, and dropping the others from the beam." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-114", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-115", "text": "**EXPERIMENTS**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-116", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-117", "text": "**DATA SETS**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-118", "text": "Our training set contains 35 million English utterances with a total of around 27,500 hours." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-119", "text": "These utterances are sampled from Google's general English traffic, and are anonymized and hand-transcribed for training." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-120", "text": "To increase training diversity, clean utterances are artificially corrupted by using a room simulator, varying degrees of noise, and reverberation such that the overall signal-to-noise ratio (SNR) is between 0dB and 30dB, and an average SNR is 12dB [19] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-121", "text": "The noise sources are from YouTube and daily life noisy environmental recordings." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-122", "text": "Utterances with cross-lingual words are hardly present in our data set, and thus we use a TTS engine, parallel-wavenet [20] , to synthesize utterances for evaluation." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-123", "text": "We choose French as the foreign language, and the utterances consist of navigation queries (e.g. \"directions to Cr\u00e9teil\")." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-124", "text": "There are in total 1K utterances and we refer to this set as the Directions test set." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-125", "text": "For each utterance, the bias set contains 1K words including the groundtruth place name and unrelated French place names." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-126", "text": "Since all biasing words are in a foreign language, they have never been seen in training." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-127", "text": "In decoding, all biasing words are used to construct a contextual FST with each arc having the same weight." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-128", "text": "In later evaluation, this weight is tuned independently for different models." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-129", "text": "On the other hand, to evaluate how the wordpiece-phoneme model performs on the regular English recognition task, we sampled a total of 30.5K English utterances from general Google traffic as the no-biasing test set." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-130", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-132", "text": "Similarly to [6] , an input utterance is divided to 25-ms frames, windowed and shifted at a rate of 10 ms." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-133", "text": "A 80-dimensional logMel feature is extracted at each frame, and the current frame and two frames to the left are concatenated to produce a 240-dimensional log-Mel feature." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-134", "text": "These features are then downsampled at a rate of 30 ms." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-135", "text": "We use RNN-T as the sequenceto-sequence model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-136", "text": "Similar to [6] , the encoder of the RNN-T consists of 8 Long Short-Term Memory (LSTM) [21] layers and the prediction network contains 2 layers." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-137", "text": "Each LSTM layer contains 2,048 hidden units followed by a 640-dimensional projection layer." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-138", "text": "A time-reduction layer is added after the second layer to improve the inference speed without accuracy loss." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-139", "text": "Outputs of the encoder and prediction network are fed to a joint-network which has 640 hidden units, which is then followed by a softmax layer containing 4,096 output units." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-140", "text": "Specifically, the output units contain 41 context-independent phonemes and the rest are wordpieces." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-141", "text": "As described in Section 3.1, we use a lexicon containing about 430,000 words with their frequencies to determine when to use phoneme sequences." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-142", "text": "The lexicon contains words and their frequencies from training data, and is trimmed by removing homophones (e.g. \"flower\" and \"flour\"), homographs (e.g. \"live\" as a verb or adjective), and pronunciation variants (e.g. \"either\")." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-143", "text": "It thus only contains entries that are unambiguous when going from spelling to pronunciation or the other way around." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-144", "text": "We do not generate phonemes for out-of-lexicon words using a trained grapheme-to-phoneme model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-145", "text": "The intuition is that when pronunciation is not known, it is simpler and cleaner to let the E2E model infer the pronunciation rather than bring in another independently trained model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-146", "text": "In addition, we use the written form of a transcript and do not use any verbalizer." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-147", "text": "Thus, words like \"$9.95\" were never presented as phonemes." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-148", "text": "In model training, a symbol is inserted between words to identify spacing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-149", "text": "The model contains around 120M parameters in total." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-150", "text": "All RNN-T models are trained in Tensorflow [22] on 8 \u00d7 8 Tensor Processing Units (TPU) slices with a global batch size of 4,096." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-151", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-152", "text": "**WERS AND COMPARISONS**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-153", "text": "We compare the biasing results of the wordpiece-phoneme model to a grapheme-only model and a wordpiece-only model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-154", "text": "The latter two models have the same structure as the wordpiecephoneme model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-155", "text": "The difference is that the grapheme model has 76 graphemes as outputs and the wordpiece model has 4,096 wordpieces." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-156", "text": "This leads to around 117M and 120M parameters for the grapheme model and wordpiece model, respectively." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-157", "text": "Note that the two model's output symbols are in English and they are trained using all-English data described in Section 4.1." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-158", "text": "For these two models, biasing is done at the grapheme level or wordpiece level alone using the English transliterated versions of French biasing words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-159", "text": "WERs of the Directions set are shown in Table 1 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-160", "text": "First, we see that all three models perform poorly without biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-161", "text": "This is because the place names are in French and they have never been seen in training, i.e. an word OOV rate of nearly 100% 3 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-162", "text": "Secondly, we see in Table 1 that all models performs substantially better with biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-163", "text": "The WER reductions range from 9%-23% relatively for different models when compared to the no-bias case." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-164", "text": "Comparing different biasing strategies, we find that the wordpiece-phoneme model performs the best: 16% relatively better than the grapheme model, and 8.3% better than the wordpiece model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-165", "text": "We attribute the superior per- formance of the wordpiece-phoneme model to the robustness of phonemes to OOV words, as observed in [16] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-166", "text": "Since the wordpiece-phoneme model contains both wordpieces and phonemes as modeling units, we can further perform wordpiece biasing in addition to phoneme-based biasing by building a wordpiece FST in parallel to the phoneme FST." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-167", "text": "This further reduces the WER by 2%, as shown in the bottom row in Table 1 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-168", "text": "This shows that wordpiece and phoneme biasing are complementary to each other." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-169", "text": "We note that the same weights are used for both phoneme and wordpiece biasing, and empirically we did not find significant improvements by using different weights." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-170", "text": "On the other hand, for wordpiece model biasing, our results are consistent with the observation in [13] that the wordpieces perform better than graphemes because of its sparsity in matching longer units." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-171", "text": "To further understand how biasing helps recognizing French place names, we present some wins of wordpiecephoneme model in Table 2 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-172", "text": "We can see that biasing helps produce the correct French words, and in contrast, phonetically similar but wrong English words are produced when without biasing." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-173", "text": "On the other hand, we present some typical recognition errors in Table 3 ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-174", "text": "We see that errors are mainly due to phonetically similar words in French." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-175", "text": "We will analyze how the biasing performance changes as the number of biasing words changes in Section 4.4." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-176", "text": "To ensure there is no regression in no-biasing scenarios, we compare three models in decoding regular English utterances from general Google traffic." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-177", "text": "In decoding, we turn the biasing mechanism off by using an empty list of biasing phrases." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-178", "text": "As shown in the last column of Table 1 , the wordpiece model performs better than the grapheme model as in [6] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-179", "text": "The wordpiecephoneme model performs a little better than the grapheme model, and we attribute that to the higher frequency of wordpieces during training." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-180", "text": "Compared to the wordpiece model, the wordpiece-phoneme model has a slight degradation (0.1% absolute WER)." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-181", "text": "This is due to the introduction of phonemes in modeling." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-182", "text": "One potential approach to improve regression is to incorporate an English external language model for phonemes in rescoring, similarly to the wordpiece-based rescoring in [10] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-183", "text": "However, we note that the regression is significantly smaller than the all-phoneme model in [16] ." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-184", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-185", "text": "**EFFECT OF NUMBER OF BIASING WORDS**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-186", "text": "Given examples in Table 3 , we are curious how competing biasing words affect recognition." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-187", "text": "We thus randomly choose a fixed number of biasing words (including the ground truth one) and vary the number to see how WER changes." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-188", "text": "Figure 3 shows that WER for the Directions set is 9.1% when only the ground truth word is present (i.e. number of biasing words is 1), and the rate increases quickly when the total number of biasing words increases." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-189", "text": "We attribute the quick degradation to the significant matching confusion in the phoneme prefixes of the words as the number of biasing words increases (as confirmed by phonetically similar French place names in Table 3 )." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-190", "text": "One interesting direction would be to increase the length of the phonemic units to create a sparser match." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-191", "text": "----------------------------------" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-192", "text": "**CONCLUSION**" }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-193", "text": "In this work we proposed a wordpiece-phoneme RNN-T model and phoneme-level contextual biasing to recognize foreign words." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-194", "text": "Biasing at the phoneme level enables us to avoid the OOV problem in the wordpiece model." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-195", "text": "Evaluating on a test set containing navigation queries to French place names, we show the proposed approach performs significantly better than a state-of-the-art grapheme and wordpiece model, by 16% and 8%, respectively in terms of relative WER reductions." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-196", "text": "Wordpiece biasing is complimentary to phoneme biasing and adds a further 2% reduction." }, { "sent_id": "8bd97eb118175c9fd2147b6456421c-C001-197", "text": "Lastly, since wordpieces perform better than graphemes [6] in E2E modeling, it would be interesting to explore longer phonemic units such as phoneme pieces for biasing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8bd97eb118175c9fd2147b6456421c-C001-13" ], [ "8bd97eb118175c9fd2147b6456421c-C001-23" ], [ "8bd97eb118175c9fd2147b6456421c-C001-48" ], [ "8bd97eb118175c9fd2147b6456421c-C001-57", "8bd97eb118175c9fd2147b6456421c-C001-58" ], [ "8bd97eb118175c9fd2147b6456421c-C001-78" ] ], "cite_sentences": [ "8bd97eb118175c9fd2147b6456421c-C001-13", "8bd97eb118175c9fd2147b6456421c-C001-23", "8bd97eb118175c9fd2147b6456421c-C001-48", "8bd97eb118175c9fd2147b6456421c-C001-58", "8bd97eb118175c9fd2147b6456421c-C001-78" ] }, "@SIM@": { "gold_contexts": [ [ "8bd97eb118175c9fd2147b6456421c-C001-84" ], [ "8bd97eb118175c9fd2147b6456421c-C001-108" ], [ "8bd97eb118175c9fd2147b6456421c-C001-132" ], [ "8bd97eb118175c9fd2147b6456421c-C001-136" ], [ "8bd97eb118175c9fd2147b6456421c-C001-165" ] ], "cite_sentences": [ "8bd97eb118175c9fd2147b6456421c-C001-84", "8bd97eb118175c9fd2147b6456421c-C001-108", "8bd97eb118175c9fd2147b6456421c-C001-132", "8bd97eb118175c9fd2147b6456421c-C001-136", "8bd97eb118175c9fd2147b6456421c-C001-165" ] }, "@USE@": { "gold_contexts": [ [ "8bd97eb118175c9fd2147b6456421c-C001-84" ], [ "8bd97eb118175c9fd2147b6456421c-C001-108" ], [ "8bd97eb118175c9fd2147b6456421c-C001-132" ], [ "8bd97eb118175c9fd2147b6456421c-C001-136" ], [ "8bd97eb118175c9fd2147b6456421c-C001-165" ] ], "cite_sentences": [ "8bd97eb118175c9fd2147b6456421c-C001-84", "8bd97eb118175c9fd2147b6456421c-C001-108", "8bd97eb118175c9fd2147b6456421c-C001-132", "8bd97eb118175c9fd2147b6456421c-C001-136", "8bd97eb118175c9fd2147b6456421c-C001-165" ] }, "@DIF@": { "gold_contexts": [ [ "8bd97eb118175c9fd2147b6456421c-C001-98" ], [ "8bd97eb118175c9fd2147b6456421c-C001-183" ] ], "cite_sentences": [ "8bd97eb118175c9fd2147b6456421c-C001-98", "8bd97eb118175c9fd2147b6456421c-C001-183" ] }, "@EXT@": { "gold_contexts": [ [ "8bd97eb118175c9fd2147b6456421c-C001-98" ] ], "cite_sentences": [ "8bd97eb118175c9fd2147b6456421c-C001-98" ] }, "@FUT@": { "gold_contexts": [ [ "8bd97eb118175c9fd2147b6456421c-C001-197" ] ], "cite_sentences": [ "8bd97eb118175c9fd2147b6456421c-C001-197" ] } } }, "ABC_537ec54aac2c3e3c62070468dcd8a3_24": { "x": [ { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-88", "text": "Using our phrase-based model, the BLEU scores for lower/stem/prefix4 were 30.90/30.89/30.76, respectively." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-93", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-89", "text": "The differences of translation qualities were statistically significant at the 95% confidence level." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-80", "text": "If we could trust such consistently aligned words, reliable (hierarchical) phrase translation pairs would be extracted, which, in turn, would result in better estimates for relative count-based feature functions." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-90", "text": "Our phrase translation pairs aggregated from all the differently preprocessed corpora improved the translation quality." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-91", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-81", "text": "At the same time, differently biased word alignment annotations suggest alternative phrase translation pairs that is useful for increasing the coverage of translations." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-92", "text": "**RESULTS**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-94", "text": "**CONCLUSION**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-2", "text": "We present two translation systems experimented for the shared-task of \"Workshop on Statistical Machine Translation,\" a phrase-based model and a hierarchical phrase-based model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-3", "text": "The former uses a phrasal unit for translation, whereas the latter is conceptualized as a synchronous-CFG in which phrases are hierarchically combined using non-terminals." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-4", "text": "Experiments showed that the hierarchical phrasebased model performed very comparable to the phrase-based model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-5", "text": "We also report a phrase/rule extraction technique differentiating tokenization of corpora." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-6", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-8", "text": "We contrasted two translation methods for the Workshop on Statistical Machine Translation (WMT2006) shared-task." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-9", "text": "One is a phrase-based translation in which a phrasal unit is employed for translation (Koehn et al., 2003) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-10", "text": "The other is a hierarchical phrase-based translation in which translation is realized as a set of paired production rules (Chiang, 2005) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-11", "text": "Section 2 discusses those two models and details extraction algorithms, decoding algorithms and feature functions." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-12", "text": "We also explored three types of corpus preprocessing in Section 3." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-13", "text": "As expected, different tokenization would lead to different word alignments which, in turn, resulted in the divergence of the extracted phrase/rule size." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-14", "text": "In our method, phrase/rule translation pairs extracted from three distinctly word-aligned corpora are aggregated into one large phrase/rule translation table." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-15", "text": "The experiments and the final translation results are presented in Section 4." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-16", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-17", "text": "**TRANSLATION MODELS**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-18", "text": "We used a log-linear approach (Och and Ney, 2002) in which a foreign language sentence" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-19", "text": "In this framework, the posterior probability Pr(e I 1 | f J 1 ) is directly maximized using a log-linear combination of feature functions h m (e I 1 , f J 1 ), such as a ngram language model or a translation model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-20", "text": "When decoding, the denominator is dropped since it depends only on f J 1 ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-21", "text": "Feature function scaling factors \u03bb m are optimized based on a maximum likelihood approach (Och and Ney, 2002) or on a direct error minimization approach (Och, 2003) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-22", "text": "This modeling allows the integration of various feature functions depending on the scenario of how a translation is constituted." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-23", "text": "In a phrase-based statistical translation (Koehn et al., 2003) , a bilingual text is decomposed as K phrase translation pairs (\u0113 1 ,f\u0101 1 ), (\u0113 2 ,f\u0101 2 ), ...: The input foreign sentence is segmented into phrasesf K 1 , mapped into corresponding English\u0113 K 1 , then, reordered to form the output English sentence according to a phrase alignment index mapping\u0101." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-24", "text": "In a hierarchical phrase-based translation (Chiang, 2005) , translation is modeled after a weighted synchronous-CFG consisting of production rules whose right-hand side is paired (Aho and Ullman, 1969) :" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-25", "text": "X \u2192 \u03b3, \u03b1, \u223c where X is a non-terminal, \u03b3 and \u03b1 are strings of terminals and non-terminals." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-26", "text": "\u223c is a one-to-one correspondence for the non-terminals appeared in \u03b3 and \u03b1." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-27", "text": "Starting from an initial non-terminal, each rule rewrites non-terminals in \u03b3 and \u03b1 that are associated with \u223c." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-28", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-29", "text": "**PHRASE/RULE EXTRACTION**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-30", "text": "The phrase extraction algorithm is based on those presented by Koehn et al. (2003) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-31", "text": "First, manyto-many word alignments are induced by running a one-to-many word alignment model, such as GIZA++ (Och and Ney, 2003) , in both directions and by combining the results based on a heuristic (Och and Ney, 2004) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-32", "text": "Second, phrase translation pairs are extracted from the word aligned corpus (Koehn et al., 2003) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-33", "text": "The method exhaustively extracts phrase pairs ( f j+m j , e i+n i ) from a sentence pair ( f J 1 , e I 1 ) that do not violate the word alignment constraints a." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-34", "text": "In the hierarchical phrase-based model, production rules are accumulated by computing \"holes\" for extracted contiguous phrases (Chiang, 2005) :" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-35", "text": "2." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-36", "text": "A rule X \u2192 \u03b3, \u03b1 and a phrase pair (f ,\u0113) s.t." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-37", "text": "\u03b3 = \u03b3 \u2032f \u03b3 \u2032\u2032 and \u03b1 = \u03b1 \u2032\u0113 \u03b1 \u2032\u2032 constitutes a rule:" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-38", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-39", "text": "**DECODING**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-40", "text": "The decoder for the phrase-based model is a left-toright generation decoder with a beam search strategy synchronized with the cardinality of already translated foreign words." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-41", "text": "The decoding process is very similar to those described in (Koehn et al., 2003) : It starts from an initial empty hypothesis." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-42", "text": "From an existing hypothesis, new hypothesis is generated by consuming a phrase translation pair that covers untranslated foreign word positions." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-43", "text": "The score for the newly generated hypothesis is updated by combining the scores of feature functions described in Section 2.3." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-44", "text": "The English side of the phrase is simply concatenated to form a new prefix of English sentence." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-45", "text": "In the hierarchical phrase-based model, decoding is realized as an Earley-style top-down parser on the foreign language side with a beam search strategy synchronized with the cardinality of already translated foreign words (Watanabe et al., 2006) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-46", "text": "The major difference to the phrase-based model's decoder is the handling of non-terminals, or holes, in each rule." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-47", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-48", "text": "**FEATURE FUNCTIONS**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-49", "text": "Our phrase-based model uses a standard pharaoh feature functions listed as follows (Koehn et al., 2003) :" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-50", "text": "\u2022 Relative-count based phrase translation probabilities in both directions." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-51", "text": "\u2022 Lexically weighted feature functions in both directions." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-52", "text": "\u2022 The supplied trigram language model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-53", "text": "\u2022 Distortion model that counts the number of words skipped." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-54", "text": "\u2022 The number of words in English-side and the number of phrases that constitute translation." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-55", "text": "For details, please refer to Koehn et al. (2003) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-56", "text": "In addition, we added three feature functions to restrict reorderings and to represent globalized insertion/deletion of words:" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-57", "text": "\u2022 Lexicalized reordering feature function scores whether a phrase translation pair is monotonically translated or not :" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-58", "text": "\u2022 Deletion feature function penalizes words that do not constitute a translation according to a 80,260,191 111,153,303 103,523,206 80,666,414 110,787,982 102,940,840 lexicon model t( f |e) (Bender et al., 2004) :" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-59", "text": "The deletion model simply counts the number of words whose lexicon model probability is lower than a threshold \u03c4 del ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-60", "text": "Likewise, we also added an insertion model h ins (e I 1 , f J 1 ) that penalizes the spuriously inserted English words using a lexicon model t(e| f )." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-61", "text": "For the hierarchical phrase-based model, we employed the same feature set except for the distortion model and the lexicalized reordering model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-62", "text": "----------------------------------" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-63", "text": "**PHRASE EXTRACTION FROM DIFFERENT WORD ALIGNMENT**" }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-64", "text": "We prepared three kinds of corpora differentiated by tokenization methods." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-65", "text": "First, the simplest preprocessing is lower-casing (lower)." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-66", "text": "Second, corpora were transformed by a Porter's algorithm based multilingual stemmer (stem) 1 ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-67", "text": "Third, mixed-cased corpora were truncated to the prefix of four letters of each word (prefix4)." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-68", "text": "For each differently tokenized corpus, we computed word alignments by a HMM translation model (Och and Ney, 2003) and by a word alignment refinement heuristic of \"grow-diagfinal\" (Koehn et al., 2003) ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-69", "text": "Different preprocessing yields quite divergent alignment points as illustrated in Table 1 ." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-70", "text": "The table also shows the numbers for the intersection and union of three alignment annotations." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-71", "text": "The (hierarchical) phrase translation pairs are extracted from three distinctly word aligned corpora." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-72", "text": "In this process, each word is recovered into its lowercased form." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-73", "text": "The associated counts are aggregated to constitute relative count-based feature functions." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-74", "text": "Table 2 summarizes the size of phrase tables induced from the corpora." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-75", "text": "The number of rules extracted for the hierarchical phrase-based model was roughly twice as large as those for the phrase-based model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-76", "text": "Fewer word alignments resulted in larger phrase translation table size as observed in the \"prefix4\" corpus." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-77", "text": "The size is further increased by our aggregation step (merged)." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-78", "text": "Different induction/refinement algorithms or preprocessings of a corpus bias word alignment." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-79", "text": "We found that some word alignments were consistent even with different preprocessings, though we could not justify whether such alignments would match against human intuition." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-82", "text": "results are very comparable to the last year's best results in test2005." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-83", "text": "Also found that our hierarchical phrase-based translation (Rule) performed slightly inferior to the phrase-based translation (Phrase) in both test sets." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-84", "text": "The hierarchically combined phrases seem to be too flexible to represent the relationship of similar language pairs." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-85", "text": "Note that our hierarchical phrase-based model performed better in the Englishto-German translation task." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-86", "text": "Those language pair requires rather distorted reordering, which could be represented by hierarchically combined phrases." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-87", "text": "We also conducted additional studies on how differently aligned corpora might affect the translation quality on Spanish-to-English task for the 2005 test set." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-95", "text": "We presented two translation models, a phrasebased model and a hierarchical phrase-based model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-96", "text": "The former performed as well as the last year's best system, whereas the latter performed comparable to our phrase-based model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-97", "text": "We are going to experiment new feature functions to restrict the too flexible reordering represented by our hierarchical phrasebased model." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-98", "text": "We also investigated different word alignment annotations, first using lower-cased corpus, second performing stemming, and third retaining only 4-letter prefix." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-99", "text": "Differently preprocessed corpora resulted in quite divergent word alignment." }, { "sent_id": "537ec54aac2c3e3c62070468dcd8a3-C001-100", "text": "Large phrase/rule translation tables were accumulated from three distinctly aligned corpora, which in turn, increased the translation quality." } ], "y": { "@BACK@": { "gold_contexts": [ [ "537ec54aac2c3e3c62070468dcd8a3-C001-9" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-23" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-55" ] ], "cite_sentences": [ "537ec54aac2c3e3c62070468dcd8a3-C001-9", "537ec54aac2c3e3c62070468dcd8a3-C001-23", "537ec54aac2c3e3c62070468dcd8a3-C001-55" ] }, "@SIM@": { "gold_contexts": [ [ "537ec54aac2c3e3c62070468dcd8a3-C001-30" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-32" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-41" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-49" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-68" ] ], "cite_sentences": [ "537ec54aac2c3e3c62070468dcd8a3-C001-30", "537ec54aac2c3e3c62070468dcd8a3-C001-32", "537ec54aac2c3e3c62070468dcd8a3-C001-41", "537ec54aac2c3e3c62070468dcd8a3-C001-49", "537ec54aac2c3e3c62070468dcd8a3-C001-68" ] }, "@USE@": { "gold_contexts": [ [ "537ec54aac2c3e3c62070468dcd8a3-C001-30" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-32" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-41" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-49" ], [ "537ec54aac2c3e3c62070468dcd8a3-C001-68" ] ], "cite_sentences": [ "537ec54aac2c3e3c62070468dcd8a3-C001-30", "537ec54aac2c3e3c62070468dcd8a3-C001-32", "537ec54aac2c3e3c62070468dcd8a3-C001-41", "537ec54aac2c3e3c62070468dcd8a3-C001-49", "537ec54aac2c3e3c62070468dcd8a3-C001-68" ] } } }, "ABC_3e0704e0928f2df8b7c2ffa9863a55_24": { "x": [ { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-2", "text": "Political identity is often manifested in language variation, but the relationship between the two is still relatively unexplored from a quantitative perspective." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-3", "text": "This study examines the use of Catalan, a language local to the semi-autonomous region of Catalonia in Spain, on Twitter in discourse related to the 2017 independence referendum." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-4", "text": "We corroborate prior findings that pro-independence tweets are more likely to include the local language than anti-independence tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-5", "text": "We also find that Catalan is used more often in referendum-related discourse than in other contexts, contrary to prior findings on language variation." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-6", "text": "This suggests a strong role for the Catalan language in the expression of Catalonian political identity." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-7", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-9", "text": "Social identity is often constructed through language use, and variation in language therefore reflects social differences within the population (Labov, 1963) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-10", "text": "In a multilingual setting, an individual's preference to use a local language rather than the national one may reflect their political stance, as the local language can have strong ties to cultural and political identity (Moreno et al., 1998; Crameri, 2017) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-11", "text": "The role of linguistic identity is enhanced in extreme situations such as referenda, where the voting decision may be driven by identification with a local culture or language (Schmid, 2001) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-12", "text": "In October 2017, the semi-autonomous region of Catalonia held a referendum on independence from Spain, where 92% of respondents voted for independence (Fotheringham, 2017) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-13", "text": "To determine the role of the local language Catalan in * Equal contributions." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-14", "text": "this setting, we apply the methodology used by Shoemark et al. (2017) in the context of the 2014 Scottish independence referendum to a dataset of tweets related to the Catalonian referendum." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-15", "text": "We use the phenomenon of code-switching between Catalan and Spanish to pursue the following research questions in order to understand the choice of language in the context of the referendum:" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-16", "text": "1. Is a speaker's stance on independence strongly associated with the rate at which they use Catalan?" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-17", "text": "2. Does Catalan usage vary depending on whether the discussion topic is related to the referendum, and on the intended audience?" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-18", "text": "For the first question, our findings are similar to those in the Scottish case: pro-independence tweets are more likely to be written in Catalan than anti-independence tweets, and pro-independence Twitter users are more likely to use Catalan than anti-independence Twitter users (Section 4)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-19", "text": "With respect to the second question, we find that Twitter users are more likely to use Catalan in referendumrelated tweets, and that they are more likely to use Catalan in tweets with a broader audience (Section 5)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-20", "text": "1" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-21", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-22", "text": "**RELATED WORK**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-23", "text": "Code-switching, the alternation between languages within conversation (Poplack, 1980) , has been shown to be the product of grammatical factors, such as syntax (Pfaff, 1979) , and social factors, such as intended audience (Gumperz, 1977) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-24", "text": "While many studies have examined codeswitching in the spoken context (Auer, 2013) social media platforms such as Twitter provide an opportunity to study code-switching in online discussions (Androutsopoulos, 2015) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-25", "text": "In the online context, choice of language may reflect the writer's intended audience (Kim et al., 2014) or identity (Christiansen, 2015; Lavendar, 2017) , and the explicit social signals in online discussions such as @-replies can be leveraged to test claims about code-switching at a large scale (Nguyen et al., 2015) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-26", "text": "A relatively unexplored area of code-switching behavior is politically-motivated code-switching, which we assume has a different set of constraints compared to everyday code-switching." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-27", "text": "With respect to political separatism, Shoemark et al. (2017) studied the use of Scots, a language local to Scotland, in the context of the 2014 Scotland independence referendum." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-28", "text": "They found that Twitter users who openly supported Scottish independence were more likely to incorporate words from Scots in their tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-29", "text": "They also found that Twitter users who tweeted about the referendum were less likely to use Scots in referendum-related tweets than in non-referendum tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-30", "text": "This study considers the similar scenario which took place in 2017 vis-\u00e0-vis the semi-autonomous region of Catalonia." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-31", "text": "Our main methodological divergence from Shoemark et al. (2017) relates to the linguistic phenomenon at hand: while Scots is mainly manifested as interleaving individual words within English text (code-mixing), Catalan is a distinct language which, when used, usually replaces Spanish altogether for the entire tweet (code-switching)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-32", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-33", "text": "**DATA**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-34", "text": "The initial set of tweets for this study, T , was drawn from a 1% Twitter sample mined between January 1 and October 31, 2017, covering nearly a year of activity before the referendum, as well as its immediate aftermath." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-35", "text": "2 The first step in building this dataset was to manually develop a seed set of hashtags related to the referendum." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-36", "text": "Through browsing referendum content on Twitter, the following seed hashtags were selected: #Catalu\u00f1aLibre, #Independenci-aCatalu\u00f1a, #Catalu\u00f1aEsEspa\u00f1a, #Espa\u00f1aUnida, and #CatalanReferendum." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-37", "text": "All tweets containing at least one of these hashtags were extracted from T , and the top 1,000 hashtags appearing in the resulting dataset were manually inspected for relevance to the referendum." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-38", "text": "From these co-occurring hashtags, we selected a set of 46 hashtags and divided it into pro-independence, anti-independence, and neutral hashtags, based on translations of associated tweet content." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-39", "text": "3 After including ASCII-equivalent variants of special characters, as well as lowercased variants, our final hashtag set comprises 111 unique strings." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-40", "text": "Next, all tweets containing any referendum hashtag were extracted from T , yielding 190,061 tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-41", "text": "After removing retweets and tweets from users whose tweets frequently contained URLs (i.e., likely bots), our final \"Catalonian Independence Tweets\" (CT) dataset is made up of 11,670 tweets from 10,498 users (cf. the Scottish referendum set IT with 59,664 tweets and 18,589 users in Shoemark et al. (2017) )." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-42", "text": "36 referendum-related hashtags appear in the filtered dataset." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-43", "text": "They are shown with their frequencies (including variants) in Table 1 (cf. the 47 hashtags and similar frequency distribution in Table 1 of Shoemark et al. (2017) )." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-44", "text": "To address the control condition, all authors of tweets in the CT dataset were collected to form a set U , and all other tweets in T written by these users were extracted into a control dataset (XT) of 45,222 tweets (cf." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-45", "text": "the 693,815 control tweets in Table 6 of Shoemark et al. (2017) )." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-46", "text": "The CT dataset is very balanced with respect to the number of tweets per user: only four users contribute over ten tweets (max = 14) and only 16 have more than five." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-47", "text": "The XT dataset also has only a few \"power\" users, such that nine users have over 1,000 tweets (max = 3,581) and a total of 173 have over 100 tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-48", "text": "Since the results are macro-averaged over all users, these few power users should not significantly distort the findings." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-49", "text": "Language Identification." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-50", "text": "This study compares variation between two distinct languages, Catalan and Spanish." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-51", "text": "We used the langid language classification package (Lui and Baldwin, 2012) , based on character n-gram frequencies, to identify the language of all tweets in CT and XT." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-52", "text": "Tweets that were not classified as either Spanish or Catalan with at least 90% confidence were discarded." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-53", "text": "This threshold was chosen by manual inspection of the langid output." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-54", "text": "In the referendum dataset CT (control set XT), langid confidently labeled 4,014 (56,892) tweets as Spanish and 2,366 (10,178) as Catalan." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-55", "text": "To address the possibility of code-mixing within tweets, the first two authors manually annotated a sample of 100 tweets, of which half were confidently labeled as Spanish, and the other half as Catalan." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-56", "text": "They found only two examples of potential code-mixing, both of Catalan words in Spanish text." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-57", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-58", "text": "**CATALAN USAGE AND POLITICAL STANCE**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-59", "text": "The first research question concerns political stance: do pro-independence users tweet in Catalan at a higher rate than anti-independence users?" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-60", "text": "We analyze the relationship between language use and stance on independence under two conditions, comparing the use of Catalan among pro-independence users vs. anti-independence users in (1) opinionated referendum-related tweets (tweets with Pro/Anti hashtags); and (2) all tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-61", "text": "These conditions address the possibilities that the language distinction is relevant for pro/antiindependence Twitter users in political discourse and outside of political discourse, respectively." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-62", "text": "Method." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-63", "text": "The first step is to divide the Twitter users in U into pro-independence (PRO) and antiindependence (ANTI) groups." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-64", "text": "First, the proportion of tweets from each user that include a pro- independence hashtag is computed as" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-65", "text": "anti ) is the count of tweets from user u that contain a pro-(anti-) independence hashtag." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-66", "text": "The PRO user set (U pro ) includes all users whose pro-independence proportion was above or equal to 75%, and the ANTI user set (U anti ) includes all users whose pro-independence proportion was below or equal to 25%." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-67", "text": "The counts of users and tweets identified as either Spanish or Catalan are presented in Table 2 ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-68", "text": "To measure Catalan usage, let n To determine significance, the users are randomly shuffled between the two groups to recompute d over 100,000 iterations." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-69", "text": "The p-value is the proportion of permutations in which the randomized test statistic was greater than or equal to the original test statistic from the unpermuted data." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-70", "text": "Results. Catalan is used more often among the pro-independence users compared to the antiindependence users, across both the hashtagonly and all-tweet conditions." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-71", "text": "Table 3 shows that the proportion of tweets in Catalan for proindependence users (p pro ) is significantly higher than the proportion for anti-independence users (p anti )." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-72", "text": "This is consistent with Shoemark et al. (2017) , who found more Scots usage among proindependence users (d = 0.00555 for pro/anti tweets, d = 0.00709 for all tweets)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-73", "text": "The relative differences between the groups are large: in the all-tweet condition,p pro is five times greater than p anti , whereas Shoemark et al. found a twofold difference (p pro = 0.01443 versusp anti = 0.00734 for all-tweet condition)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-74", "text": "All raw proportions are two orders of magnitude greater than those in the Scottish study, a result of the denser language variable used in this study (full-tweet code-switching vs. intermittent code-mixing)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-75", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-76", "text": "**CATALAN USAGE, TOPIC, AND AUDIENCE**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-77", "text": "One way to explain the variability in Catalan usage is through topic-induced variation, which proposes that people adapt their language style in response to a shift in topic (Rickford and McNair-Knox, 1994) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-78", "text": "This leads to our second research question: is Catalan more likely to be used in discussions of the referendum than in other topics?" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-79", "text": "This analysis is conducted under three conditions." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-80", "text": "The first two conditions compare Catalan usage in referendum-hashtag tweets (pro, anti, and neutral) against (1) all tweets; and (2) tweets that contain a non-referendum hashtag." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-81", "text": "This second condition is meant to control for the general role of hashtags in reaching a wider audience (Pavalanathan and Eisenstein, 2015) , and its results motivate the third analysis, comparing (3) @-reply tweets with hashtag tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-82", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-83", "text": "**REFERENDUM HASHTAGS**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-84", "text": "Method." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-85", "text": "We extract all users in U who have posted at least one referendum-related tweet and at least one tweet unrelated to the referendum into a new set, U R ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-86", "text": "Tweet and user counts for all conditions are provided in Table 4 ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-87", "text": "The small numbers are a result of the condition requirement and the language constraint (tweets must be identified as Spanish or Catalan with 90% confidence)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-88", "text": "For a user u, we denote the proportion of u's referendum-related tweets written in Catalan byp (u) C , and the proportion of u's control tweets written in Catalan byp (u) X ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-89", "text": "We are interested in the difference between these two propor-" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-90", "text": "X and its average across all u) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-91", "text": "Under the null hypothesis that Catalan usage is unrelated to topic, d U R would be equal to 0, which we test for significance using a one-sample t-test." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-92", "text": "Results." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-93", "text": "Our results, presented in the middle columns of Table 5 , show that users tweet in Catalan at a significantly higher rate in referendum tweets than in all control tweets (first results column), but no significant difference was observed in the control condition where tweets include at least one hashtag (second results column)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-94", "text": "The lack of a significant difference between referendum-related hashtags and other hashtags suggests that the topic being discussed is not as central in choosing one's language, compared with the audience being targeted." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-95", "text": "Our second result is the opposite of the prior finding that there were significantly fewer Scots words in referendum-related tweets than in control tweets (cf." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-96", "text": "Table 7 in Shoemark et al. (2017) ; d u = \u22120.0015 for all controls)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-97", "text": "This suggests that Catalan may serve a different function than Scots in terms of political identity expression." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-98", "text": "Rather than suppressing their use of Catalan in broadcast tweets, users increase their Catalan use, perhaps to signal their Catalonian identity to a broader audience." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-99", "text": "This is supported by literature highlighting the integral role Catalan plays in the Catalonian national narrative (Crameri, 2017) , as well as the relatively high proportion of Catalan speakers in Catalonia: 80.4% of the population has speaking knowledge of Catalan (Government of Catalonia, 2013) , versus 30% population of Scotland with speaking knowledge of Scots (Scots Language Centre, 2011) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-100", "text": "There are also systemic differences between the political settings of the two cases: the Catalonian referendum had much larger support for separation among those who voted (92% in Catalonia vs. 45% in Scotland) (Fotheringham, 2017; Jeavens, 2014) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-101", "text": "These factors suggest a different public perception of national identity in the two regions within the context of the referenda, resulting in different motivations behind language choice." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-102", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-103", "text": "**REPLY TWEETS**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-104", "text": "Earlier work has highlighted the role of hashtags and @-replies as affordances for selecting large and small audiences, and their interaction with the use of non-standard vocabulary (Pavalanathan and Eisenstein, 2015) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-105", "text": "To test the role of audience size in Catalan use, we compare the proportion of Catalan in @-reply tweets against hashtag tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-106", "text": "Method." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-107", "text": "In this analysis, we take the treatment set to be all tweets made by users in U R which contain an @-reply but not a hashtag (narrow audience), and control against all tweets which contain a hashtag but not an @-reply (wide audience)." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-108", "text": "Results." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-109", "text": "The results in the rightmost column of Table 5 demonstrate a significant tendency toward less Catalan use in @-replies than in hashtag tweets." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-110", "text": "This trend supports the hypothesis that Catalan is intended for a wider audience." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-111", "text": "This effect may also be explained by a subset of reply tweets in political discourse being targeted at national figures, possibly seeking to direct the message at the target's followers rather than to engage in discussion with the target." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-112", "text": "For example, one of the reply-tweets addresses a Spanish politician (\"user1\") Pavalanathan and Eisenstein (2015) : by replying to tweets from well-known individuals, it may be possible to reach a large audience, similar to the use of popular hashtags." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-113", "text": "----------------------------------" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-114", "text": "**CONCLUSION**" }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-115", "text": "This study demonstrates the association of codeswitching with political stance, topic and audience, in the context of a political referendum." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-116", "text": "We corroborate prior work by showing that the use of a minority language is associated with proindependence political sentiment, and we also provide a result in contrast to prior work, that the use of a minority language is associated with a broader intended audience." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-117", "text": "This study extends the setting of code-switching from everyday conversation into specifically political conversation, which is subject to different expectations and constraints." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-118", "text": "This study does not use geographic signals, because the sparsity of geotagged tweets prevented us from restricting the scope to data generated in Catalonia proper." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-119", "text": "Another potential limitation is that assumption that political hashtags are robust signals for political stance." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-120", "text": "Other work has shown that political hashtags can be co-opted by opposing parties (Stewart et al., 2017) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-121", "text": "Our findings extend prior work on political use of Scots words on the inter-speaker level and Scots-English code-mixing on the intraspeaker level to examining language choice and code-switching, respectively." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-122", "text": "Further work is required to reconcile our results with prior work on topic differences and audience size (Pavalanathan and Eisenstein, 2015) ." }, { "sent_id": "3e0704e0928f2df8b7c2ffa9863a55-C001-123", "text": "Future work may also compare the Catalonian situation with multilingual societies in which a minority language is discouraged (Karrebaek, 2013) , or in which the languages are more equally distributed (Blommaert, 2011) ." } ], "y": { "@SIM@": { "gold_contexts": [ [ "3e0704e0928f2df8b7c2ffa9863a55-C001-14" ], [ "3e0704e0928f2df8b7c2ffa9863a55-C001-31" ], [ "3e0704e0928f2df8b7c2ffa9863a55-C001-41" ], [ "3e0704e0928f2df8b7c2ffa9863a55-C001-72" ] ], "cite_sentences": [ "3e0704e0928f2df8b7c2ffa9863a55-C001-14", "3e0704e0928f2df8b7c2ffa9863a55-C001-31", "3e0704e0928f2df8b7c2ffa9863a55-C001-41", "3e0704e0928f2df8b7c2ffa9863a55-C001-72" ] }, "@USE@": { "gold_contexts": [ [ "3e0704e0928f2df8b7c2ffa9863a55-C001-14" ], [ "3e0704e0928f2df8b7c2ffa9863a55-C001-41" ] ], "cite_sentences": [ "3e0704e0928f2df8b7c2ffa9863a55-C001-14", "3e0704e0928f2df8b7c2ffa9863a55-C001-41" ] }, "@BACK@": { "gold_contexts": [ [ "3e0704e0928f2df8b7c2ffa9863a55-C001-27" ], [ "3e0704e0928f2df8b7c2ffa9863a55-C001-43" ], [ "3e0704e0928f2df8b7c2ffa9863a55-C001-45" ] ], "cite_sentences": [ "3e0704e0928f2df8b7c2ffa9863a55-C001-27", "3e0704e0928f2df8b7c2ffa9863a55-C001-43", "3e0704e0928f2df8b7c2ffa9863a55-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "3e0704e0928f2df8b7c2ffa9863a55-C001-95", "3e0704e0928f2df8b7c2ffa9863a55-C001-96" ] ], "cite_sentences": [ "3e0704e0928f2df8b7c2ffa9863a55-C001-96" ] } } }, "ABC_805935a672f5d706bd878a73fa8171_24": { "x": [ { "sent_id": "805935a672f5d706bd878a73fa8171-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-2", "text": "This paper presents the introduction of WordNet semantic classes in a dependency parser, obtaining improvements on the full Penn Treebank for the first time." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-3", "text": "We tried different combinations of some basic semantic classes and word sense disambiguation algorithms." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-4", "text": "Our experiments show that selecting the adequate combination of semantic features on development data is key for success." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-5", "text": "Given the basic nature of the semantic classes and word sense disambiguation algorithms used, we think there is ample room for future improvements." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-6", "text": "Using semantic information to improve parsing performance has been an interesting research avenue since the early days of NLP, and several research works have tried to test the intuition that semantics should help parsing, as can be exemplified by the classical PP attachment experiments (Ratnaparkhi, 1994) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-7", "text": "Although there have been some significant results (see Section 2), this issue continues to be elusive." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-8", "text": "In principle, dependency parsing offers good prospects for experimenting with word-to-word-semantic relationships." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-9", "text": "We present a set of experiments using semantic classes in dependency parsing of the Penn Treebank (PTB)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-10", "text": "We extend the tests made in Agirre et al. (2008) , who used different types of semantic information, obtaining significant improvements in two constituency parsers, showing how semantic information helps in constituency parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-11", "text": "As our baseline parser, we use MaltParser (Nivre, 2006) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-12", "text": "We will evaluate the parser on both the full PTB (Marcus et al. 1993 ) and on a senseannotated subset of the Brown Corpus portion of PTB, in order to investigate the upper bound performance of the models given gold-standard sense information, as in Agirre et al. (2008) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-13", "text": "Agirre et al. (2008) trained two state-of-the-art statistical parsers (Charniak, 2000; Bikel, 2004 ) on semantically-enriched input, where content words had been substituted with their semantic classes." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-14", "text": "This was done trying to overcome the limitations of lexicalized approaches to parsing (Magerman, 1995; Collins, 1996; Charniak, 1997; Collins, 2003) , where related words, like scissors and knife cannot be generalized." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-15", "text": "This simple method allowed incorporating lexical semantic information into the parser." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-16", "text": "They tested the parsers in both a full parsing and a PP attachment context." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-17", "text": "The experiments showed that semantic classes gave significant improvement relative to the baseline, demonstrating that a simplistic approach to incorporating lexical semantics into a parser significantly improves its performance." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-18", "text": "This work presented the first results over both WordNet and the Penn Treebank to show that semantic processing helps parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-19", "text": "Collins (2000) tested a combined parsing/word sense disambiguation model based in WordNet which did not obtain improvements in parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-20", "text": "presented a semisupervised method for training dependency parsers, using word clusters derived from a large unannotated corpus as features." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-21", "text": "They demonstrate the effectiveness of the approach in a series of dependency parsing experiments on PTB and the Prague Dependency Treebank, showing that the cluster-based features yield substantial gains in performance across a wide range of conditions." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-22", "text": "Suzuki et al. (2009) also experiment with the same method combined with semi-supervised learning." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-23", "text": "699" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-24", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-25", "text": "**INTRODUCTION**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-26", "text": "Using semantic information to improve parsing performance has been an interesting research avenue since the early days of NLP, and several research works have tried to test the intuition that semantics should help parsing, as can be exemplified by the classical PP attachment experiments (Ratnaparkhi, 1994) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-27", "text": "Although there have been some significant results (see Section 2), this issue continues to be elusive." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-28", "text": "In principle, dependency parsing offers good prospects for experimenting with word-to-word-semantic relationships." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-29", "text": "We present a set of experiments using semantic classes in dependency parsing of the Penn Treebank (PTB)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-30", "text": "We extend the tests made in Agirre et al. (2008) , who used different types of semantic information, obtaining significant improvements in two constituency parsers, showing how semantic information helps in constituency parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-31", "text": "As our baseline parser, we use MaltParser (Nivre, 2006) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-32", "text": "We will evaluate the parser on both the full PTB (Marcus et al. 1993 ) and on a senseannotated subset of the Brown Corpus portion of PTB, in order to investigate the upper bound performance of the models given gold-standard sense information, as in Agirre et al. (2008) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-33", "text": "Agirre et al. (2008) trained two state-of-the-art statistical parsers (Charniak, 2000; Bikel, 2004) on semantically-enriched input, where content words had been substituted with their semantic classes." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-34", "text": "This was done trying to overcome the limitations of lexicalized approaches to parsing (Magerman, 1995; Collins, 1996; Charniak, 1997; Collins, 2003) , where related words, like scissors and knife cannot be generalized." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-35", "text": "This simple method allowed incorporating lexical semantic information into the parser." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-36", "text": "They tested the parsers in both a full parsing and a PP attachment context." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-37", "text": "The experiments showed that semantic classes gave significant improvement relative to the baseline, demonstrating that a simplistic approach to incorporating lexical semantics into a parser significantly improves its performance." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-38", "text": "This work presented the first results over both WordNet and the Penn Treebank to show that semantic processing helps parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-39", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-40", "text": "**RELATED WORK**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-41", "text": "Collins (2000) tested a combined parsing/word sense disambiguation model based in WordNet which did not obtain improvements in parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-42", "text": "presented a semisupervised method for training dependency parsers, using word clusters derived from a large unannotated corpus as features." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-43", "text": "They demonstrate the effectiveness of the approach in a series of dependency parsing experiments on PTB and the Prague Dependency Treebank, showing that the cluster-based features yield substantial gains in performance across a wide range of conditions." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-44", "text": "Suzuki et al. (2009) also experiment with the same method combined with semi-supervised learning." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-45", "text": "Ciaramita and Attardi (2007) show that adding semantic features extracted by a named entity tagger (such as PERSON or MONEY) improves the accuracy of a dependency parser, yielding a 5.8% relative error reduction on the full PTB." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-46", "text": "Candito and Seddah (2010) performed experiments in statistical parsing of French, where terminal forms were replaced by more general symbols, particularly clusters of words obtained through unsupervised clustering." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-47", "text": "The results showed that word clusters had a positive effect." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-48", "text": "Regarding dependency parsing of the English PTB, currently Koo and Collins (2010) and Zhang and Nivre (2011) hold the best results, with 93.0 and 92.9 unlabeled attachment score, respectively." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-49", "text": "Both works used the Penn2Malt constituency-todependency converter, while we will make use of PennConverter (Johansson and Nugues, 2007) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-50", "text": "Apart from these, there have been other attempts to make use of semantic information in different frameworks and languages, as in (Hektoen 1997; Xiong et al. 2005; Fujita et al. 2007 )." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-51", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-52", "text": "**EXPERIMENTAL FRAMEWORK**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-53", "text": "In this section we will briefly describe the datadriven parser used for the experiments (subsection 3.1), followed by the PTB-based datasets (subsection 3.2)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-54", "text": "Finally, we will describe the types of semantic representation used in the experiments." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-55", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-56", "text": "**MALTPARSER**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-57", "text": "MaltParser (Nivre et al. 2006 ) is a trainable dependency parser that has been successfully applied to typologically different languages and treebanks." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-58", "text": "We will use one of its standard versions (version 1.4)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-59", "text": "The parser obtains deterministically a dependency tree in linear-time in a single pass over the input using two main data structures: a stack of partially analyzed items and the remaining input sequence." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-60", "text": "To determine the best action at each step, the parser uses history-based feature models and SVM classifiers." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-61", "text": "One of the main reasons for using MaltParser for our experiments is that it easily allows the introduction of semantic information, adding new features, and incorporating them in the training model." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-62", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-63", "text": "**DATASET**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-64", "text": "We used two different datasets: the full PTB and the Semcor/PTB intersection (Agirre et al. 2008 )." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-65", "text": "The full PTB allows for comparison with the stateof-the-art, and we followed the usual train-test split." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-66", "text": "The Semcor/PTB intersection contains both gold-standard sense and parse tree annotations, and allows to set an upper bound of the relative impact of a given semantic representation on parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-67", "text": "We use the same train-test split of Agirre et al. (2008) , with a total of 8,669 sentences containing 151,928 words partitioned into 3 sets: 80% training, 10% development and 10% test data." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-68", "text": "This dataset is available on request to the research community." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-69", "text": "We will evaluate the parser via Labeled Attachment Score (LAS)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-70", "text": "We will use Bikel's randomized parsing evaluation comparator to test the statistical significance of the results using word sense information, relative to the respective baseline parser using only standard features." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-71", "text": "We used PennConverter (Johansson and Nugues, 2007) to convert constituent trees in the Penn Treebank annotation style into dependency trees." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-72", "text": "Although in general the results from parsing Pennconverter's output are lower than with other conversions, Johansson and Nugues (2007) claim that this conversion is better suited for semantic processing, with a richer structure and a more finegrained set of dependency labels." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-73", "text": "For the experiments, we used the best configuration for English at the CoNLL 2007 Shared Task on Dependency Parsing (Nivre et al., 2007) as our baseline." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-74", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-75", "text": "**SEMANTIC REPRESENTATION AND DISAMBIGUATION METHODS**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-76", "text": "We will experiment with the range of semantic representations used in Agirre et al. (2008) , all of which are based on WordNet 2.1." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-77", "text": "Words in WordNet (Fellbaum, 1998) are organized into sets of synonyms, called synsets (SS)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-78", "text": "Each synset in turn belongs to a unique semantic file (SF)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-79", "text": "There are a total of 45 SFs (1 for adverbs, 3 for adjectives, 15 for verbs, and 26 for nouns), based on syntactic and semantic categories." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-80", "text": "For example, noun semantic files (SF_N) differentiate nouns denoting acts or actions, and nouns denoting animals, among others." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-81", "text": "We experiment with both full synsets and SFs as instances of fine-grained and coarse-grained semantic representation, respectively." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-82", "text": "As an example of the difference in these two representations, knife in its tool sense is in the EDGE TOOL USED AS A CUTTING INSTRUMENT singleton synset, and also in the ARTIFACT SF along with thousands of other words including cutter." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-83", "text": "Note that these are the two extremes of semantic granularity in WordNet." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-84", "text": "As a hybrid representation, we also tested the effect of merging words with their corresponding SF (e.g. knife+ARTIFACT)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-85", "text": "This is a form of semantic specialization rather than generalization, and allows the parser to discriminate between the different senses of each word, but not generalize across words." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-86", "text": "For each of these three semantic representations, we experimented with using each of: (1) all open-class POSs (nouns, verbs, adjectives and adverbs), (2) nouns only, and (3) verbs only." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-87", "text": "There are thus a total of 9 combinations of representation type and target POS: SS (synset), SS_N (noun synsets), SS_V (verb synsets), SF (semantic file), SF_N (noun semantic files), SF_V (verb semantic files), WSF (wordform+SF), WSF_N (wordform+SF for nouns) and WSF_V (for verbs)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-88", "text": "For a given semantic representation, we need some form of WSD to determine the semantics of each token occurrence of a target word." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-89", "text": "We experimented with three options: a) gold-standard (GOLD) annotations from SemCor, which gives the upper bound performance of the semantic representation, b) first Sense (1ST), where all token instances of a given word are tagged with their most frequent sense in WordNet, and c) automatic Sense Ranking (ASR) which uses the sense returned by an unsupervised system based on an independent corpus (McCarthy et al. 2004) ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-90", "text": "For the full Penn Treebank experiments, we only had access to the first sense, taken from Wordnet 1.7." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-91", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-92", "text": "**RESULTS**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-93", "text": "In the following two subsections, we will first present the results in the SemCor/PTB intersection, with the option of using gold, 1st sense and automatic sense information (subsection 4.1) and the next subsection (4.2) will show the results on the full PTB, using 1st sense information." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-94", "text": "All results are shown as labelled attachment score (LAS)." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-95", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-96", "text": "**SEMCOR/PTB (GOLD/1ST/ASR)**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-97", "text": "We conducted a series of experiments testing:" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-98", "text": "\u2022 Each individual semantic feature, which gives 9 possibilities, also testing different learning configurations for each one." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-99", "text": "\u2022 Combinations of semantic features, for instance, SF+SS_N+WSF would combine the semantic file with noun synsets and wordform+semantic file." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-100", "text": "Although there were hundreds of combinations, we took the best combination of semantic features on the development set for the final test." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-101", "text": "For that reason, the table only presents 10 results for each disambiguation method, 9 for the individual features and one for the best combination." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-102", "text": "Table 1 presents the results obtained for each of the disambiguation methods (gold standard sense information, 1st sense, and automatic sense ranking) and individual semantic feature." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-103", "text": "In all cases except two, the use of semantic classes is benefi- cial albeit small." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-104", "text": "Regarding individual features, the SF feature using GOLD senses gives the best improvement." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-105", "text": "However, GOLD does not seem to clearly improve over 1ST and ASR on the rest of the features." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-106", "text": "Comparing the automatically obtained classes, 1ST and ASR, there is no evident clue about one of them being superior to the other." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-107", "text": "Regarding the best combination as selected in the training data, each WSD method yields a different combination, with best results for 1ST." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-108", "text": "The improvement is statistically significant for both 1ST and GOLD." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-109", "text": "In general, the results in Table 1 do not show any winning feature across all WSD algorithms." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-110", "text": "The best results are obtained when using the first sense heuristic, but the difference is not statistically significant." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-111", "text": "This shows that perfect WSD is not needed to obtain improvements, but it also shows that we reached the upperbound of our generalization and learning method." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-112", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-113", "text": "**PENN TREEBANK AND 1ST SENSE**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-114", "text": "We only had 1st sense information available for the full PTB." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-115", "text": "We tested MaltParser on the best configuration obtained for the reduced Semcor/PTB on the full treebank, taking sections 2-21 for training and section 23 for the final test." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-116", "text": "Table 2 presents the results, showing that several of the individual features and the best combination give significant improvements." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-117", "text": "To our knowledge, this is the first time that WordNet semantic classes help to obtain improvements on the full Penn Treebank." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-118", "text": "It is interesting to mention that, although not shown on the tables, using lemmatization to assign semantic classes to wordforms gave a slight increase for all the tests (0.1 absolute point approximately), as it helped to avoid data sparseness." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-119", "text": "We applied Schmid's (1994) TreeTagger." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-120", "text": "This can be seen as an argument in favour of performing morphological analysis, an aspect that is many times neglected when processing morphologically poor languages as English." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-121", "text": "We also did some preliminary experiments using Koo et al.'s (2008) word clusters, both independently and also combined with the WordNetbased features, without noticeable improvements." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-122", "text": "----------------------------------" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-123", "text": "**CONCLUSIONS**" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-124", "text": "We tested the inclusion of several types of semantic information, in the form of WordNet semantic classes in a dependency parser, showing that:" }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-125", "text": "\u2022 Semantic information gives an improvement on a transition-based deterministic dependency parsing." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-126", "text": "\u2022 Feature combinations give an improvement over using a single feature." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-127", "text": "Agirre et al. (2008) used a simple method of substituting wordforms with semantic information, which only allowed using a single semantic feature." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-128", "text": "MaltParser allows the combination of several semantic features together with other features such as wordform, lemma or part of speech." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-129", "text": "Although tables 1 and 2 only show the best combination for each type of semantic information, this can be appreciated on GOLD and 1ST in Table 1 ." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-130", "text": "Due to space reasons, we only have showed the best combination, but we can say that in general combining features gives significant increases over using a single semantic feature." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-131", "text": "\u2022 The present work presents a statistically significant improvement for the full treebank using WordNet-based semantic information for the first time." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-132", "text": "Our results extend those of Agirre et al. (2008) , which showed improvements on a subset of the PTB." }, { "sent_id": "805935a672f5d706bd878a73fa8171-C001-133", "text": "Given the basic nature of the semantic classes and WSD algorithms, we think there is room for future improvements, incorporating new kinds of semantic information, such as WordNet base concepts, Wikipedia concepts, or similarity measures." } ], "y": { "@DIF@": { "gold_contexts": [ [ "805935a672f5d706bd878a73fa8171-C001-10" ], [ "805935a672f5d706bd878a73fa8171-C001-30" ], [ "805935a672f5d706bd878a73fa8171-C001-132" ] ], "cite_sentences": [ "805935a672f5d706bd878a73fa8171-C001-10", "805935a672f5d706bd878a73fa8171-C001-30", "805935a672f5d706bd878a73fa8171-C001-132" ] }, "@EXT@": { "gold_contexts": [ [ "805935a672f5d706bd878a73fa8171-C001-10" ], [ "805935a672f5d706bd878a73fa8171-C001-30" ], [ "805935a672f5d706bd878a73fa8171-C001-132" ] ], "cite_sentences": [ "805935a672f5d706bd878a73fa8171-C001-10", "805935a672f5d706bd878a73fa8171-C001-30", "805935a672f5d706bd878a73fa8171-C001-132" ] }, "@SIM@": { "gold_contexts": [ [ "805935a672f5d706bd878a73fa8171-C001-12" ], [ "805935a672f5d706bd878a73fa8171-C001-32" ], [ "805935a672f5d706bd878a73fa8171-C001-67" ], [ "805935a672f5d706bd878a73fa8171-C001-76" ] ], "cite_sentences": [ "805935a672f5d706bd878a73fa8171-C001-12", "805935a672f5d706bd878a73fa8171-C001-32", "805935a672f5d706bd878a73fa8171-C001-67", "805935a672f5d706bd878a73fa8171-C001-76" ] }, "@BACK@": { "gold_contexts": [ [ "805935a672f5d706bd878a73fa8171-C001-13" ], [ "805935a672f5d706bd878a73fa8171-C001-33" ], [ "805935a672f5d706bd878a73fa8171-C001-127" ] ], "cite_sentences": [ "805935a672f5d706bd878a73fa8171-C001-13", "805935a672f5d706bd878a73fa8171-C001-33", "805935a672f5d706bd878a73fa8171-C001-127" ] }, "@USE@": { "gold_contexts": [ [ "805935a672f5d706bd878a73fa8171-C001-32" ], [ "805935a672f5d706bd878a73fa8171-C001-64" ], [ "805935a672f5d706bd878a73fa8171-C001-67" ], [ "805935a672f5d706bd878a73fa8171-C001-76" ] ], "cite_sentences": [ "805935a672f5d706bd878a73fa8171-C001-32", "805935a672f5d706bd878a73fa8171-C001-64", "805935a672f5d706bd878a73fa8171-C001-67", "805935a672f5d706bd878a73fa8171-C001-76" ] } } }, "ABC_f881f6c65301fdfe2fffe7a18e05c4_25": { "x": [ { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-2", "text": "Cue phrases are linguistic expressions such as 'now' and 'welg that may explicitly mark the structure of a discourse." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-3", "text": "For example, while the cue phrase 'inczdcntally' may be used SENTENTIALLY as an adverbial, the DISCOUaSE use initiates a digression." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-4", "text": "In [8], we noted the ambiguity of cue phrases with respect to discourse and sentential usage and proposed an intonational model for their disambiguation." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-5", "text": "In this paper, we extend our previous characterization of cue phrases aald generalize its domain of coverage, based on a larger and more comprehensive empirical study: an examination of all cue phrases produced by a single ,~peaker in recorded natural speech." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-6", "text": "We also associate this prosodic model with orthographic and part-of-speech analyses of cue phrases in text." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-7", "text": "Such a dual model provides both theoretical justification for current computational models of discourse and practical application to the generation of synthetic speech." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-8", "text": "----------------------------------" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-10", "text": "Words and phrases that may directly mark the structure of a discourse have been termed CUE PttR.ASES, CLUE WORDS, DISCOURSE MAI:tKERS~ arid DISCOURSE PARTICLES [3, 4, 14, 17, 19] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-11", "text": "Some exarnpies are 'now', which marks the introduction of a new subtopic or return to a previous one, 'incidentally' and 'by the way', which indicate the beginning of a digression, and 'anyway' and 'in any case', which indicate return from a digression." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-12", "text": "In a previous study [8] , we noted that such terms are potentially ambiguous between DISCOURSE and SENTENTIAL uses [18] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-13", "text": "So, 'now' may be used as a temporal adverbial as well as a discourse marker, 'incidentally' may also function as an adverbial, and other cue phrases similarly have one or more senses in addition to their function as markers of discourse structure." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-14", "text": "Based upon an empiricM study of 'now' in recorded speech, we proposed that such discourse and sentential uses of cue phrases can be disambiguated intonationally." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-15", "text": "In particular, we proposed a prosodic model for this disambiguation which discriminated all discourse from *We thank Bengt Altenberg, l=tichaa-d Omanson mid Jan van Santen for providing information and helpful comments on this work." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-16", "text": "sentential uses of tokens in our sample." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-17", "text": "This model provided not only a plausibility argument for the disambiguation of cue phrases, but also the beginnings of a model for the generation of cue phrases in synthetic speech." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-18", "text": "In this paper, we show that our prosodic model generalizes to other cue phrases as well." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-19", "text": "We further propose an initial model for the disambiguation of cuc phrases in text." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-20", "text": "Wc base these claims upon a further empirical study: an examination of all cue phrases produced by a single speaker in part of a recorded, transcribed lecture." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-21", "text": "In Section 2 we review our own and other work on cue phrases, in Section 3 we describe our current empirical studies, in Section 4 we present the results of our analysis, and in Section 5 we discuss theoretical and practical applications of our findings." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-22", "text": "----------------------------------" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-23", "text": "**PREVIOUS STUDIES**" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-24", "text": "The important role that cue phrases play in understanding and generating discourse has been well documented in the computational linguistics literature." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-25", "text": "For example, by indicating the presence of a structural boundary or a relationship between parts of a discourse, cue phrases caa assist in the resolution of anaphora [5, 4, 17] and in the identification of rhetorical relations [10, 12, 17] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-26", "text": "Cue phrases have also been used to reduce the complexity of discourse processing and to increase textual coherence [3, 11, 21] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-27", "text": "In Example (1) 1, interpretation of the anaphor 'it' as (correctly) co-indexed with THE SYSTEM is facilitated by the presence of the cue phrases 'say' and 'then', marking potential antecedents in '... as an EXPERT DATABASE for AN EXPERT SYSTEM ...' as structurally unavailable." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-28", "text": "2" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-29", "text": "(1) \"If THE SYSTEM attenqpts to hold rules, say as AN EXPERT DATABASE for AN EXPERT SYSTEM, then we expect it not only to hold the rules but to in fact apply them for us in appropriate situations." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-30", "text": "\" 1The examples are taken from the corpus described in Section 3." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-31", "text": "2InformMly, 'say' indicates the beginning of a discourse subtopic and 'then' signals a return from that subtopic." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-32", "text": "Previous attempts to define the set of cue phrases have typically been extensional, 3 with such lists of cue phrases then further classified as to their discourse function." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-33", "text": "For example, Cohen [3] uses a taxonomy of connectives based on that of Quirk [16] to associate with each class of cue phrases a semantic function with respect to a model of argument understanding." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-34", "text": "Grosz and Sidner [4] classify cue phrases based on changes to the attentional stack and intentional structure found in their theory of discourse." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-35", "text": "Schiffrin [18] classifies cue phrases into groups based on their sentential usage (e.g. conjunctive, adverbial, and clausal markers), while Reichman [17] and Ilobbs [10] associate groups of cue phrases with the rhetorical relationships they signal." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-36", "text": "Finally, Zukerman [21] presents a taxonomy of cue phrases based on three functions relevant to her work in language generation: knowledge organization, knowledge acquisition, and affect maintenance." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-37", "text": "Once a cue phrase has been identified, however, it is not always clear whether to interpret it as a discourse marker or not [6, 4, 8, 18] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-38", "text": "The texts in Exampie (2) are potentially ambiguous between a temporal reading of 'now' and a discourse interpretation:" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-39", "text": "\"Now in AI our approach is to look at a knowledge base as a set of symbolic items that represent something.\"" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-40", "text": "b. \"Now some of you may suspect from the title of this talk that this word is coming to you from Krypton or some other possible world.\"" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-41", "text": "On the temporal reading, (2a), for example, would convey that 'at this moment the AI approach to knowledge bases has changed'; on the discourse reading, 'now' simply initiates the topic of 'the AI approach to knowledge bases'." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-42", "text": "It has been suggested that this difference between discourse and sententiai use may be intonationally disambiguable." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-43", "text": "Halliday and Hassan [6] claim that, in general, items used COtIES1VELY --i.e., to relate one part of a text to another [6, p. 4 ] --tend to be intonationally non-prominent (to be unaccented and reduced) unless they are \"definitely contrastive\"." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-44", "text": "Non-cohesive uses, on the other hand, are indicated by non-reduced, accented forms.[6, p. 268] ttalliday and llassan particularly note that intonation disambiguates in this way between cohesive (discourse) and non-cohesive (sentential) uses of classes of items we term cue phrases, such as conjunctions and adverbials." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-45", "text": "Empirical studies to date have tended to bear out their observations." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-46", "text": "Studies of portions of the London-Lund corpus such as [1] have provided intonational profiles of word classes including DISCOURSE ITEMS, conjunctions and adverbials which are at least compatible with these views." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-47", "text": "However, the notion of 'discourse item' used appears much more restrictive 3An exception to this is found in the socio-linguistic work of Schifl'rin [18] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-48", "text": "than the notion of 'cue phrase', 4 so it is difficult to make comparative use of these results." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-49", "text": "In an earlier study [8] , we examined the use of various intonational, syntactic, and orthographic features to distinguish between discourse and sententim readings of a single cue phrase ('now')." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-50", "text": "5 While features such as tense, structural configuration, surface order, and orthographic indicators were sometimes useful, we found that intonational features provided only only significant correlation with discourse/sentential status." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-51", "text": "All of the tokens in our sample were disarnbiguable in terms of intonational phrasing and type of pitch accentfi" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-52", "text": "In our study of now, we found that discourse uses were either uttered as a single intermediate phrase (or in a phrase containing only cue phrases) (Discourse Type A), or uttered at the beginning of a longer intermediate phrase (or preceded only by other cue phrases in the phrase) and with a L* pitch accent or without a pitch accent (Discourse Type B)." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-53", "text": "Cue phrases judged to be of Sentential Type were never uttered as a single phrase; if first in intermediate phrase they were nearly always uttered with a H* or complex pitch accent (Sentential Type A); if not first in phrase they could bear any type of pitch accent or be deaccented (Sentential Type B)." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-54", "text": "These results are summarized in Figure I ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-55", "text": "Based on these findings, we proposed that listeners use prosodic information to disambiguate discourse from sentential uses of cue phrases." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-56", "text": "To investigate this possibility further, we conducted another multispeaker study of discourse and sentential uses of the cue phrase 'welt." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-57", "text": "Our findings were alrnost identical to results for the earlier study; briefly, of the 52 in- 4 For example, in the 48 minute text Altenberg examines, he finds only 23 discourse items, or about 17% of what our study of a similar corpus (described below) would have predicted." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-58", "text": "Our corpus consisted of recordings of four days of tile radio call-in program \"The Harry Gross Show: Speaking of Your Money,\" recorded during the week of 1 February 1982115] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-59", "text": "The four shows provided approximately ten hours of conversation between expert(s) m~d callers." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-60", "text": "6For the earlier study as well as the current one, we assume Pierrehumbel~,'s [13] system of phonological description." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-61", "text": "In this system, intonational contours are described as sequences of low (L) and high (H) tones in the FuraDAM~NTAL errs." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-62", "text": "QUENCV (F0) CONTOUrt." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-63", "text": "Pitch accents, peaks or valleys in the F0 contour that fall on the stressed syllables of lexical items, signify intonational prominence." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-64", "text": "A pitch accent consists either of a single tone or an ordered pair of tones, such as L*+H. The tone aligned with the stressed syllable is indicated by a star *; thus, in an L*+H accent, the low tone L* is aligned with the stressed syUahle." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-65", "text": "There are six pitch accents in English: two simple tones --H and L --and four To see whether these findings could be extended to cue phrases in general, we began a third study --of all cue phrases produced by a single speaker during 75 minutes of recorded speech." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-66", "text": "The remainder of this paper describes our first results from this study." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-67", "text": "----------------------------------" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-68", "text": "**THE DATA**" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-69", "text": "To test whether our prosodic model of discourse and sentential uses of 'now' and 'well' extended to cue phrases in general, we examined intonational chm'-acteristics of all single-word cue phrases 7 used in a keynote, address given by I~onald Brachman at the First lnlernalional Conference on Expert Database Syslems in 1986." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-70", "text": "The address provides approximately 75 minutes of speech t\u00a5om a single speaker." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-71", "text": "For our first sample, we examined the 211 cue phrases uttered during the first 17 minutes of the address." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-72", "text": "Our tokens had the following distribution: s actually (6) , also (2) , although (1) , and (68), basically (1) , because (2) , but (12) , finally (1) , ]i,'sl (1) , further (4) , however (2) , like (11) , look (11) , next (4) , now (26), ok (1), or (19) , say (12) , second (1) , see (5) , since (1) , so (9), then (3) , therefore (1) , well (7) ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-73", "text": "To determine the classification of each token (ms-COURSE, SENTENTIAL, or AMBIGUOUS), the authors separately .judged each token by listening to the taped address while marking a transcription." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-74", "text": "9 rWe exmnined o~fly single-word cue plu, asea in tiffs study since our current prosodic model applies only to such items." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-75", "text": "In future work we plan to develop additional models for discourse a~nL(l aententiel uses of multl-word cue phrases, e.g 'that reminds me', 'first o] all', 'speaking off and so on." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-76", "text": "8Our set of cue phrases was derived from extensional definitions provided by ourselves and othel~ [3, 4, 17, 18, 21] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-77", "text": "Tim following lexicel items, although also cue phrases, are not present in the portion of the axlch-ess examined to date: 9The address was transcribed independently of our study by a meraber of the text processing pool at AT&T Bell Laboratories." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-78", "text": "We found that 20 cite phrases had been omitted by the traalscriber: 'and', 'now', 'ok', 'so', and 'well'." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-79", "text": "Significantly, ell but two of these were termed 'discourse' uses by In comparing our judgments, we were interested in areas of disagreement as well as agreement." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-80", "text": "The set of tokens whose classification as to discourse or sentential use we agreed upon provide a testbed for our continuing investigation of the intonational disambiguation of cue phrases." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-81", "text": "The set of tokens we found difficult to classify (i.e. those tokens we both found ambiguous or those whose cla.ssification we disagreed upon), provide insight into possible intonational correlates of discourse/sentential ambiguity. \"Fable 1 presents the distribution of our judgments, where 'classifiable' represent those tokens whose classification we agreed upon and 'unclassifiable' represents those we both found ambiguous or disagreed upon." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-82", "text": "Of the 211 tokens in this initial sample, we found only 133 cue phrases (63.03%) to be unambiguously discourse or sentential." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-83", "text": "When we looked more closely at the 'unclassifiable' cases, we found that fully 73.08% were coordinate conjunctions (and, or, and but)." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-84", "text": "In fact, when we compare percent classifiable for conjunctions with other cue phrases, we find that, while only 42.42% of conjunctions were found to be classifiable, fully 81.25% of non-conjunctions were classified." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-85", "text": "Thus, the case of coordinate conjunction appears to explain a large portion of the our difficulty in agreeing upon a classification." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-86", "text": "Once we had made these judgments, we analyzed the tokens for their prosodic and syntactic features as well ms their orthographic context, much as we had done with tokens for the earlier two studies./\u00b0 We noted whether each token was accented or not and, if accented, we noted the type of accent employed." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-87", "text": "We also identified the composition of the intermedi-" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-88", "text": "1\u00b0We used a pitch tracker written by David Telkin and Talkin's Waves speech analysis software [20] in our prosodic analysis." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-89", "text": "ate phrase containing each token as to whether the token constituted a separate phrase (possibly with other cue phrases) or not. And we noted each token's position within its intermediate phrase --first (including tokens preceded only by other cue phrases) or not." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-90", "text": "We also noted syntactic characteristics of each item, including part of speech and its immediately dominating constituent, n Finally, we noted orthographic indicators in the transcript which might provide disambiguation, such as immediately preceding and succeeding punctuation and paragraph boundaries." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-91", "text": "In both the syntactic and orthographic analyses we were particularly interested in discovering how well non-prosodic features which might be obtained automatically from a text would do in differentiating discourse from sentential uses." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-92", "text": "----------------------------------" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-93", "text": "**THE SINGLE-SPEAKER/MULTICUE PHRASE STUDY**" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-94", "text": "Our findings from the classified data (133 tokens) in this pilot study confirmed our model of prosodic distinction between discourse and sentential uses of cue phrases." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-95", "text": "The distribution of these judgments with respect to the prosodic model of discourse and sentential cue phrases depicted in Figure 1 is shown in Table 2 . Recall that this model includes two intona- Table 2 that the ratio of discourse to sentential usage was about 1:2." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-96", "text": "Of the 44 tokens judged to represent discourse use and fitting our prosodic model, one third were of Discourse Type A and twothirds of Discourse Type B." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-97", "text": "llWe used Hindie's parser Fidditch [7] to obtain constituent structure and Fidditch aaid Church's part-of-speech program [2] for part of speech assignment." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-98", "text": "While overall results are quite significant, the 17 items judged sentential which nonetheless fit the discourse prosodic model must be explained." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-99", "text": "Of these 17, 14 (representing two thirds of the total error) are conjuncts (11 'and's and 3 '0r's) which fit the type (b) discourse prosodic model." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-100", "text": "While all are thus first in intermediate phrase --and, in fact, in intonational phrase --none are utterance-initial." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-101", "text": "Both judges found such items relatively difficult to distinguish between discourse and sentential use." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-102", "text": "12 In (3), for example, while the first and seems clearly sentential, the second seems much more problematic." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-103", "text": "(3) \"But instead actually we are bringing some thoughts on expert databases from a place that is even stranger and further away and that of course is the magical world of artificial intelligence.\"" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-104", "text": "The difficulty in such cases appears confined to instances ofsentential coordination where the conjunct is not utterance initial." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-105", "text": "Table 3 shows how judgments were distributed with respect to our prosodic model when coordinate conjunctions are removed from the sample." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-106", "text": "Our model thus predicts 93.4% of non- conjunct cue phrase distinctions, as opposed to the 84.2% success rate shown in Table 2 ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-107", "text": "Our prosodic model itself can of course be decomposed to examine the contributions of individual features to discourse/sentential judgments." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-108", "text": "Table 4 shows the distribution of judgments by all possible feature complexes for all tokens) 3 This distribution reveals that there is considerable agreement when cue phrases appear alone in their intermediate phrase (OF*, corresponding to Discourse type A in Figure 1) ; such items are most frequently judged to be discourse uses." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-109", "text": "There is even more agreement when cue phrases appear in non-initial position in a larger intermediate phrase (NONF* --Sentential type B in l~See Section 3." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-110", "text": "Of the 99 conjuncts in thin study, both judges agreed on a discom~e/sentential distinction only 42.4% of the time, compared to 78.6~ agreement on non-conjtmcts." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-111", "text": "Conjunct tokens represented two-thirds of all tokens the judges disagreed on, and 68:9% of tokens at least one judge was unable to assign." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-112", "text": "Figure 1) ; tlhese tend to be judged sentential." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-113", "text": "However, tokens which fit Discourse type B in Figure 1 (first in a larger phrase and deaccented (NOFD) or with a L* (NOFL)) appear more problematic: of the former, there was disagreement on fiflly two thirds) 4 While there is more agreement that tokens characterized as NOFIt (first in a larger phrase with a H* accent) or NOFC (same with a complex pitch accent) --Sentential type A in Figure 1 ---are sentential, this agreement is certainly less striMng than in the case of tokens characterized a,s NONF* (non-initial il~ a larger phrase with any type of pitch accent --Sentential type B)." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-114", "text": "Since Discourse type B and Sentcntial type A differ only in 'type of pitch accent', we wight conclude that the pitch accent feature is not as powerfid a discriminator as the phrasal features 'alone in intermediate phrase' or 'first in phrase'." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-115", "text": "As in our previous study, we also examined potential non-prosodic distinctions between discourse and sentential uses." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-116", "text": "Of the orthographic and syntactic fi:atures we examined, we found presence or absence of preceding punctuation and part-of-speech to be most successful in distinguishing discourse from sentential uses." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-117", "text": "For the 113 tokens on which both judges agreed a.s to discourse or sentential status, 1~ orthogral)hy distinguishes between discourse and sentential use in 101 (89.4%) of cases." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-118", "text": "Specifically, 21 of 30 discourse uses are preceded by punctuation and only 3 of 83 sentential items." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-119", "text": "We also tbund that part-of-speech distinguishes discourse from sentential use, although less successfully than orthography." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-120", "text": "If we simply predict discourse or se.ntential use by the assignment most frequently associated with a given part-of-speech, both 14And note that 91.3% of items in these two cells m'e conjmlcts." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-121", "text": "15Thls figm~ excludes those items which the transcriber omitted." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-122", "text": "Church's part-of-speech algorithm and Hindle's Fidditch predict discourse or sentential use in approximately 75% of cases where both judges agreed on discourse/sentential assigmnent." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-123", "text": "For example, we assume that since the majority of conjunctions and verbs are judged sentential that these parts-of-speech are predictors ofsentential status, and since most adverbials are associated with discourse uses, these are predictors of discourse status, and so on." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-124", "text": "While partof-speech thus might seem less useful than orthographic distinctions for our corpus, the fact that it is not subject to transcriber idiosyncracy might make it a more reliable predictor than orthographic indicators in the general case." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-125", "text": "Too, for text-to-speech applications, in which one would like to infer discourse or sentential use in order to employ the appropriate intonational features when synthesizing the item in question, these text-based results are encouraging." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-126", "text": "----------------------------------" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-127", "text": "**DISCUSSION**" }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-128", "text": "Our findings for the first stage of our single-speaker multi-cue phrase study support the intonational model of discourse/sentential characteristics of cue phrases which we proposed in [8] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-129", "text": "Discourse uses of cue phrases fell into two groups: in one, the cue phrase was set apart as a separate intermediate phrase (possibly with other cue phrases); in the other, the cue phrase was first in its intermediate phrase (possibly preceded by other cue phrases) and either was deaccented or bore a L* pitch accent." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-130", "text": "Sentential uses were in general part of a larger intermediate phrase: if first in phrase, they bore a H* or complex pitch accent." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-131", "text": "The association between discourse/sentential models and discourse/sentential judgments is significant at the .0(/1 level." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-132", "text": "We also found that the tokens we found difficult to clas-sify were those in which disambiguation relied solely upon pitch accent, rather than some combination of pitch accent and phrasing." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-133", "text": "Furthermore, we found that orthographic cues (from transcription) successfully disarnbiguate between discourse and sentential usage in 89.4% of cases in our pilot study." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-134", "text": "Partof-speech was less successful in distinguishing discourse from sentential use, disambiguating only 75% of cases in the study." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-135", "text": "The disambiguating power of both our textual and our prosodic models has both theoretical and practical import." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-136", "text": "From a practical point of view, the construction of both text-based and prosodic models permit improvement in the generation of synthetic speech from unrestricted text [9] ." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-137", "text": "With a prosodic model, we know how to convey discourse/sentential distinctions; with a text-based model, we know when to convey such distinctions." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-138", "text": "From a theoretical point of view, our findings demonstrate the feasibility of cue phrase disambiguation in both text and speech and provide a model for how that disambiguation might be done." }, { "sent_id": "f881f6c65301fdfe2fffe7a18e05c4-C001-139", "text": "Furthermore, these results strengthen the claim that the discourse structures crucial to computational models of interaction can indeed be identified." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f881f6c65301fdfe2fffe7a18e05c4-C001-10" ], [ "f881f6c65301fdfe2fffe7a18e05c4-C001-25" ], [ "f881f6c65301fdfe2fffe7a18e05c4-C001-34" ], [ "f881f6c65301fdfe2fffe7a18e05c4-C001-37" ] ], "cite_sentences": [ "f881f6c65301fdfe2fffe7a18e05c4-C001-10", "f881f6c65301fdfe2fffe7a18e05c4-C001-25", "f881f6c65301fdfe2fffe7a18e05c4-C001-34", "f881f6c65301fdfe2fffe7a18e05c4-C001-37" ] }, "@DIF@": { "gold_contexts": [ [ "f881f6c65301fdfe2fffe7a18e05c4-C001-76" ] ], "cite_sentences": [ "f881f6c65301fdfe2fffe7a18e05c4-C001-76" ] }, "@EXT@": { "gold_contexts": [ [ "f881f6c65301fdfe2fffe7a18e05c4-C001-76" ] ], "cite_sentences": [ "f881f6c65301fdfe2fffe7a18e05c4-C001-76" ] } } }, "ABC_c7821d22613ad91f77ea454d50a5ce_25": { "x": [ { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-87", "text": "Answers of the questions are text spans in the articles." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-117", "text": "We have the following findings to note about the results." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-118", "text": "First, as can be observed, BERT-QG offers poor performance." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-2", "text": "In this study, we investigate the employment of the pre-trained BERT language model to tackle question generation tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-3", "text": "We introduce two neural architectures built on top of BERT for question generation tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-4", "text": "The first one is a straightforward BERT employment, which reveals the defects of directly using BERT for text generation. And, the second one remedies the first one by restructuring the BERT employment into a sequential manner for taking information from previous decoded results." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-5", "text": "Our models are trained and evaluated on the question-answering dataset SQuAD." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-6", "text": "Experiment results show that our best model yields state-of-the-art performance which advances the BLEU 4 score of existing best models from 16.85 to 21.04." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-7", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-9", "text": "Question generation (QG) task, which takes a context and an answer as input and generates a question that targets the given answer, have received tremendous interests in recent years from both industrial and academic communities (Zhao et al., 2018) (Zhou et al., 2017) (Du et al., 2017) ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-10", "text": "The state-of-the-art models mainly adopt neural approaches by training a neural network based on the sequence-to-sequence framework." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-11", "text": "So far, the best performing result is reported in (Zhao et al., 2018) , which advances the state-of-the-art results from 13.9 to 16.8 (BLEU 4) ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-12", "text": "The existing QG models mainly rely on recurrent neural networks (RNN) augmented by attention mechanisms." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-13", "text": "However, the inherent sequential nature of the RNN models suffers from the problem of handling long sequences." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-14", "text": "As a result, the existing QG models (Du et al., 2017) (Zhou et al., 2017 ) mainly use only sentence-level information as context." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-15", "text": "When applied to a paragraphlevel context, the existing models show significant performance degradation." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-16", "text": "However, as indicated by (Du et al., 2017) , providing paragraph-level information can improve QG performance." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-17", "text": "For handling long context, the work (Zhao et al., 2018) introduces a maxout pointer mechanism with gated self-attention encoder for processing paragraphlevel input." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-18", "text": "The work reports state-of-the-art performance for QG tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-19", "text": "Recently, the NLP community has seen excitement around neural learning models that make use of pre-trained language models (Devlin et al., 2018) (Radford et al., 2018) ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-20", "text": "The latest development is BERT, which has shown significant performance improvement over various natural language understanding tasks, such as document summarization, document classification, etc." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-21", "text": "In this study, we investigate the employment of the pretrained BERT language model to tackle question generation tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-22", "text": "We introduce two neural architectures built on top of BERT for question generation tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-23", "text": "The first one is a straightforward BERT employment, which reveals the defects of directly using BERT for text generation." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-24", "text": "As will be shown in the experiment, naive employment of BERT offers poor performance, as, by construction, BERT produces all tokens at a time without considering decoding results in previous steps." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-25", "text": "Thus, we propose a sequential question generation model based on BERT as our second model for taking information from previous decoded results." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-26", "text": "Our model is simple but effective." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-27", "text": "We think this is a feature of BERT, as the power of BERT is able to simplify neural architecture design for natural language processing tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-28", "text": "Our model outperforms the existing best models (Zhao et al., 2018) and pushes the state-of-the-art result from 16.85 to 21.04 (BLEU 4)." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-29", "text": "The rest of this paper is organized as follows." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-30", "text": "First, in Section 2, we review the BERT model which is the building block for our models." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-31", "text": "In Sec-tion 3, we introduce two BERT adaptions for QG tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-32", "text": "Section 4 provides the performance evaluation and Section 5 concludes our findings and discuss future works." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-33", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-34", "text": "**BERT OVERVIEW**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-35", "text": "The BERT model is built by a stack of multi-layer bidirectional Transformer encoder (Vaswani et al., 2017) ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-36", "text": "The BERT model has three architecture parameter settings: the number of layers (i.e., transformer blocks), the hidden size, and the number of self-attention heads in a transformer block." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-37", "text": "There are two BERT models with different model size released." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-38", "text": "\u2022 BERT base : 12 layers, 768 hidden dimensions and 12 attention heads (in transformer) with the total number of 110M parameters." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-39", "text": "\u2022 BERT large : 24 layers, 1024 hidden dimensions and 16 attention heads (in transformer) with the total number of 340M parameters." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-40", "text": "For using BERT model, the input is required to be aligned as the BERTs specific input sequence." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-41", "text": "In general, a special token [CLS] is inserted as the first token for BERT's input sequence." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-42", "text": "The final hidden state of the [CLS] token is designed to be used as a final sequence representation for classification tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-43", "text": "The input token sequence can be a pack of multiple sentences." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-44", "text": "To distinguish the information from different sentences, a special token [SEP] is added between the tokens of two consecutive sentences." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-45", "text": "In addition, a learned embedding is added to every token to denote whether it belongs to sentence A or sentence B. For example, given a sentence pair (s i , s j ) where s i contains |s i | tokens and s j contains |s j | tokens, the BERT input sequence is formulated as a sequence in the following form:" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-46", "text": "The input representation of a given token is the sum of three embeddings: the token embeddings, the segmentation embeddings, and the position embeddings." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-47", "text": "Then the input representation is fed forward into extra layers to perform a fine-tuning procedure." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-48", "text": "BERT can be employed in three language modeling tasks: sequence-level classification, span-level prediction, and token-level prediction tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-49", "text": "The fine-tuning procedure is performed" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-50", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-51", "text": "**BERT FOR QUESTION GENERATION**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-52", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-53", "text": "**BERT-QG**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-54", "text": "As an initial attempt, we first adapt the BERT model for QG as follows." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-55", "text": "First, for a given context paragraph C = [c 1 , ..., c |C| ] and an answer phase A = [a 1 , ..., a |A| ], the input sequence X is aligned as" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-56", "text": "Let BERT() be the BERT model." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-57", "text": "We first obtain the hidden representation H \u2208 R |X|\u00d7h by H = BERT(X), where |X| is the length of the input sequence and h is the size of the hidden dimension." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-58", "text": "Then, H is passed to a dense layer W \u2208 R h\u00d7|V | followed by a softmax function as follows." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-59", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-60", "text": "**P R(W|X**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-61", "text": "The softmax is applied along the dimension of the sequence." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-62", "text": "All the parameters are fine-tuned jointly to maximize the log-probability of the correct token q i ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-63", "text": "The model architecture is illustrated in Figure 1 ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-64", "text": "As shown in the figure, we align a given context paragraph and a given answer as the input sequence and feed the input sequence into the BERT model to generate a sequence of tokens as a generated question." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-65", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-66", "text": "**BERT-SQG**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-67", "text": "In text generation tasks, as suggested by (Sutskever et al., 2014) , considering the previous decoded results has significant impacts on the quality of the generated text." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-68", "text": "However, in BERT-QG, the token generation is performed without previously decoded result information." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-69", "text": "Due to this consideration, we propose a sequential question generation model based on BERT (called BERT-SQG)." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-70", "text": "In BERT-SQG, we take into consideration the previous decoded results for decoding a token." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-71", "text": "We adapt the BERT model for question generation as follows." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-72", "text": "First, for a given context paragraph C = [c 1 , ..., c |C| ] and an answer phase A = [a 1 , ..., a |A| ], andQ = [q1, ...,q i ] the input sequence X i is formulated as" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-73", "text": "Then, the input sequence X i is represented by the BERT embedding layers and then travel forward into the BERT model." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-74", "text": "After that, we take the final hidden state of the last token [MASK] in the input sequence." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-75", "text": "We denote the final hidden vector of [MASK] as h [MASK] \u2208 R h ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-76", "text": "We adapt BERT model by adding an affine layer W SQG \u2208 R h\u00d7|V | to the output of the [MASK] token." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-77", "text": "We compute the probabilities P r(w|X i ) \u2208 R |V | by a softmax function as follows." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-78", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-79", "text": "**P R(W|X**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-80", "text": "Subsequently, the newly generated tokenq i is appended into X and the question generation process is repeated (as illustrated in Figure 2 ) with the new X until [SEP] is predicted." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-81", "text": "We report the generated tokens as the predicted question." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-82", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-83", "text": "**PERFORMANCE EVALUATION**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-84", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-85", "text": "**DATASETS**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-86", "text": "The SQuAD dataset contains 536 Wikipedia articles and around 100K reading comprehension questions (and the corresponding answers) posed about the articles." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-88", "text": "We follow the same data split settings as previous work on the QG tasks (Du et al., 2017) (Zhao et al., 2018) to directly compare the state-of-theart results on QG tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-89", "text": "Table 1 summarizes some statistics for the compared datasets." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-90", "text": "\u2022 SQuAD 73K In this set, we follow the same setting as (Du et al., 2017) ; the accessible parts of the SQuAD training data are randomly divided into a training set (80%), a development set (10%), and a test set (10%)." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-91", "text": "We report results on the 10% test set." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-92", "text": "\u2022 SQuAD 81K In this set, we follow the same setting as (Zhao et al., 2018) ; the accessible SQuAD development data set is divided into a development set (50%), and a test set (50%)." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-93", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-94", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-95", "text": "We use the PyTorch version of BERT 1 to train our BERT-QG and BERT-SQG models." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-96", "text": "The (Du et al., 2017) , and SQuAD 81K is the setting of (Zhao et al., 2018) ." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-97", "text": "SQuAD 73K 73240 11877 10570 SQuAD 81K 81577 8964 8964 pre-trained model uses the officially provided BERT base model (12 layers, 768 hidden dimensions, and 12 attention heads.) with a vocab of 30522 words." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-98", "text": "Dropout probability is set to 0.1 between transformer layers." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-99", "text": "The Adamax optimizer is applied during the training process, with an initial learning rate of 5e-5." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-100", "text": "The batch size for the update is set at 28." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-101", "text": "All our models use two TITAN RTX GPUs for 5 epochs training." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-102", "text": "We use Dev." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-103", "text": "data for epoch model to make predictions and select the highest accuracy rate as our score evaluation model." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-104", "text": "Also, in our BERT-SQG model, we use the Beam Search strategy for sequence decoding." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-105", "text": "The beam size is set to 3." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-106", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-107", "text": "**TRAIN TEST DEV**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-108", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-109", "text": "**MODEL COMPARISON**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-110", "text": "In this paper, we compare our models with the best performing models (Du et al., 2017) (Zhao et al., 2018) in the literature." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-111", "text": "The compared models in the experiment are:" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-112", "text": "\u2022 NQG-RC (Du et al., 2017) : A seq2seq question generation model based on bidirectional LSTM." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-113", "text": "\u2022 PLQG (Zhao et al., 2018) : A seq2seq network which contains a gated self-attention encoder and a maxout pointer decoder to enable the capability of handling long text input." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-114", "text": "PLQG model is the state-of-the-art models for QG tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-115", "text": "Table 2 shows the comparison results using sentence-level context and Table 3 shows the results on paragraph level context." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-116", "text": "We compare the models using standard metric BLEU and ROUGE-L ( (Papineni et al., 2002) )." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-119", "text": "In fact, the performance of BERT-QG is far from the results by other models." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-120", "text": "This result is expected as BERT-QG generates the sentences without considering the previous decoded results." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-121", "text": "However, when taking into account the previous decoded results (BERT-SQG), we effectively utilize the power of BERT and yield the state-of-the-art result compared with the existing RNN variants for QG." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-122", "text": "As shown in Table 2 , BERT-SQG outperforms the existing best performing model by 2% on both benchmark datasets." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-123", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-124", "text": "**EVALUATION RESULTS**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-125", "text": "Second, the results in Table 3 further show that BERT-SQG successfully processes the paragraphlevel contexts and further push the state-of-the-art from 16.85 to 21.04 in terms of BLEU 4 score." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-126", "text": "Note that NQG-RC and PLQG both use the RNN architecture, and the RNN-based models all suffer from the issue of consuming long text input." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-127", "text": "We see that the BERT model based on transformer blocks effectively addresses the issue of processing long text." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-128", "text": "The results of our BERT-SQG model are consistent in two data set and have achieved the best score at the paragraph level." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-129", "text": "----------------------------------" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-131", "text": "In this paper, we demonstrate that BERT can be adapted to question generation tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-132", "text": "We concede that our BERT-SQG model is simple." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-133", "text": "However, we think this is a feature of BERT, as the power of BERT is able to simplify neural architectures design for specific tasks." }, { "sent_id": "c7821d22613ad91f77ea454d50a5ce-C001-134", "text": "While our model is simple, our model achieves state-of-the-art performance at both sentence-level and paragraph-level input and provides strong baselines for future research." } ], "y": { "@MOT@": { "gold_contexts": [ [ "c7821d22613ad91f77ea454d50a5ce-C001-9" ], [ "c7821d22613ad91f77ea454d50a5ce-C001-15", "c7821d22613ad91f77ea454d50a5ce-C001-16" ] ], "cite_sentences": [ "c7821d22613ad91f77ea454d50a5ce-C001-9", "c7821d22613ad91f77ea454d50a5ce-C001-16" ] }, "@BACK@": { "gold_contexts": [ [ "c7821d22613ad91f77ea454d50a5ce-C001-13", "c7821d22613ad91f77ea454d50a5ce-C001-14" ], [ "c7821d22613ad91f77ea454d50a5ce-C001-96" ] ], "cite_sentences": [ "c7821d22613ad91f77ea454d50a5ce-C001-14", "c7821d22613ad91f77ea454d50a5ce-C001-96" ] }, "@USE@": { "gold_contexts": [ [ "c7821d22613ad91f77ea454d50a5ce-C001-88" ], [ "c7821d22613ad91f77ea454d50a5ce-C001-90" ], [ "c7821d22613ad91f77ea454d50a5ce-C001-111", "c7821d22613ad91f77ea454d50a5ce-C001-112" ] ], "cite_sentences": [ "c7821d22613ad91f77ea454d50a5ce-C001-88", "c7821d22613ad91f77ea454d50a5ce-C001-90", "c7821d22613ad91f77ea454d50a5ce-C001-112" ] }, "@SIM@": { "gold_contexts": [ [ "c7821d22613ad91f77ea454d50a5ce-C001-88" ], [ "c7821d22613ad91f77ea454d50a5ce-C001-90" ], [ "c7821d22613ad91f77ea454d50a5ce-C001-110" ], [ "c7821d22613ad91f77ea454d50a5ce-C001-111", "c7821d22613ad91f77ea454d50a5ce-C001-112" ] ], "cite_sentences": [ "c7821d22613ad91f77ea454d50a5ce-C001-88", "c7821d22613ad91f77ea454d50a5ce-C001-90", "c7821d22613ad91f77ea454d50a5ce-C001-110", "c7821d22613ad91f77ea454d50a5ce-C001-112" ] } } }, "ABC_09493a62815b4b826248d6d9be47cb_25": { "x": [ { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-30", "text": "**PROBLEM STATEMENT**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-2", "text": "In this work, we consider the medical concept normalization problem, i.e., the problem of mapping a health-related entity mention in a free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-3", "text": "This is a challenging task since medical terminology is very different when coming from health care professionals or from the general public in the form of social media texts." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-4", "text": "We approach it as a sequence learning problem with powerful neural networks such as recurrent neural networks and contextualized word representation models trained to obtain semantic representations of social media expressions." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-5", "text": "Our experimental evaluation over three different benchmarks shows that neural architectures leverage the semantic meaning of the entity mention and significantly outperform an existing state of the art models." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-6", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-8", "text": "User-generated texts (UGT) on social media present a wide variety of facts, experiences, and opinions on numerous topics, and this treasure trove of information is currently severely underexplored." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-9", "text": "We consider the problem of discovering medical concepts in UGTs with the ultimate goal of mining new symptoms, adverse drug reactions (ADR), and other information about a disorder or a drug." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-77", "text": "The corpus consists of 6556 phrases mapped to 618 SNOMED codes." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-78", "text": "SMM4H 2017." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-10", "text": "An important part of this problem is to translate a text from \"social media language\" (e.g., \"can't fall asleep all night\" or \"head spinning a little\") to \"formal medical language\" (e.g., \"insomnia\" and \"dizziness\" respectively)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-11", "text": "This is necessary to match user-generated descriptions with medical concepts, but it is more than just a simple matching of UGTs against a vocabulary." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-12", "text": "We call the task of mapping the language of UGTs to medical terminology medical concept normalization." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-13", "text": "It is especially difficult since in social media, patients discuss different concepts of illness and a wide array of drug reactions." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-14", "text": "Moreover, UGTs from social networks are typically ambiguous and very noisy, containing misspelled words, incorrect grammar, hashtags, abbreviations, smileys, different variations of the same word, and so on." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-15", "text": "Traditional approaches for concept normalization utilized lexicons and knowledge bases with string matching." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-16", "text": "The most popular knowledgebased system for mapping texts to UMLS identifiers is MetaMap (Aronson, 2001 )." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-17", "text": "This linguisticbased system uses lexical lookup and variants by associating a score with phrases in a sentence." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-18", "text": "The state-of-the-art baseline for clinical and scientific texts is DNorm (Leaman et al., 2013) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-19", "text": "DNorm adopts a pairwise learning-torank technique using vectors of query mentions and candidate concept terms." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-20", "text": "This model outperforms MetaMap significantly, increasing the macro-averaged F-measure by 25% on an NCBI disease dataset." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-21", "text": "However, while these tools have proven to be effective for patient records and research papers, they achieve moderate results on social media texts (Nikfarjam et al., 2015; Limsopatham and Collier, 2016) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-22", "text": "Recent works go beyond string matching: these works have tried to view the problem of matching a one-or multi-word expression against a knowledge base as a supervised sequence labeling problem." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-23", "text": "Limsopatham and Collier (2016) utilized convolutional neural networks (CNNs) for phrase normalization in user reviews, while Tutubalina et al. (2018) , Han et al. (2017) , and Belousov et al. (2017) applied recurrent neural networks (RNNs) to UGTs, achieving similar results." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-24", "text": "These works were among the first applications of deep learning techniques to medical concept normalization." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-25", "text": "The goal of this work is to study the use of deep neural models, i.e., contextualized word representation model BERT (Devlin et al., 2018) and Gated Recurrent Units (GRU) (Cho et al., 2014) with an attention mechanism, paired with word2vec word embeddings and contextualized ELMo embeddings (Peters et al., 2018) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-26", "text": "We investigate if a joint architecture with special provisions for domain knowledge can further improve the mapping of entity mentions from UGTs to medical concepts." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-27", "text": "We combine the representation of an entity mention constructed by a neural model and distance-like similarity features using vectors of an entity mention and concepts from the UMLS." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-28", "text": "We experimentally demonstrate the effectiveness of the neural models for medical concept normalization on three real-life datasets of tweets and user reviews about medications with two evaluation procedures." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-29", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-31", "text": "Our main research problem is to investigate the content of UGTs with the aim to learn the transition between a laypersons language and formal medical language." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-32", "text": "Examples from Table 1 show that an automated model has to account for the semantics of an entity mention." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-33", "text": "For example, it has to be able to map not only phases with shared ngrams no sexual interest and nonsexual being into the concept \"Lack of libido\" but also separate the phase bit of lower back pain from the broader concept \"Pain\" and map it to a narrower concept." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-34", "text": "While focusing on user-generated texts on social media, in this work we seek to answer the following research questions." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-35", "text": "To answer RQ1, we began by collecting UGTs from popular medical web portals and investigating distributed word representations trained on 2.6 millions of health-related user comments." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-36", "text": "In particular, we analyze drug name representations using clustering and chemoinformatics approaches." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-37", "text": "The analysis demonstrated that similar word vectors correspond to either drugs with the same active compound or to drugs with close therapeutic effects that belong to the same therapeutic group." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-38", "text": "It is worth noting that chemical similarity in such drug pairs was found to be low." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-39", "text": "Hence, these representations can help in the search for compounds with potentially similar biological effects among drugs of different therapeutic groups (Tutubalina et al., 2017) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-40", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-41", "text": "**RQ1:**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-42", "text": "To answer RQ2 and RQ3, we develop several models and conduct a set of experiments on three benchmark datasets where social media texts are extracted from user reviews and Twitter." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-43", "text": "We present this work in Sections 3 and 4." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-44", "text": "We discuss RQ4 and RQ5 with research plans in Section 5." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-45", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-46", "text": "**METHODS**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-47", "text": "Following state-of-the-art research (Limsopatham and Collier, 2016; Sarker et al., 2018) , we view concept normalization as a classification problem." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-48", "text": "To answer RQ2, we investigate the use of neural networks to learn the semantic representation of an entity before mapping its representation to a medical concept." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-49", "text": "First, we convert each mention into a vector representation using one of the following (well-known) neural models:" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-50", "text": "(1) bidirectional LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014) with an attention mechanism and a hyperbolic tangent activation function on top of 200-dimensional word embeddings obtained to answer RQ1;" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-51", "text": "(2) a bidirectional layer with attention on top of deep contextualized word representations ELMo (Peters et al., 2018) ; (3) a contextualized word representation model BERT (Devlin et al., 2018) , which is a multilayer bidirectional Transformer encoder." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-52", "text": "We omit technical explanations of the neural network architectures due to space constraints and refer to the studies above." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-53", "text": "Next, the learned representation is concatenated with a number of semantic similarity features based on prior knowledge from the UMLS Metathesaurus." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-54", "text": "Lastly, we add a softmax layer to convert values to conditional probabilities." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-55", "text": "The most attractive feature of the biomedical domain is that domain knowledge is prevailing in this domain for dozens of languages." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-56", "text": "In particular, UMLS is undoubtedly the largest lexico-semantic resource for medicine, containing more than 150 lexicons with terms from 25 languages." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-57", "text": "To answer RQ3, we extract a set of features to enhance the representation of phrases." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-58", "text": "These features contain cosine similarities between the vectors of an input phrase and a concept in a medical terminology dictionary." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-59", "text": "We use the following strategy, which we call TF-IDF (MAX), to construct representations of a concept and a mention: represent a medical code as a set of terms; for each term, compute the cosine distance between its TF-IDF representation and the entity mention; then choose the term with the largest similarity." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-60", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-62", "text": "We perform an extensive evaluation of neural models on three datasets of UGTs, namely CADEC (Karimi et al., 2015) , Psy-TAR (Zolnoori et al., 2019) , and SMM4H 2017 (Sarker et al., 2018) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-63", "text": "The basic task is to map a social media phrase to a relevant medical concept." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-64", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-65", "text": "**DATA**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-66", "text": "CADEC." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-67", "text": "CSIRO Adverse Drug Event Corpus (CADEC) (Karimi et al., 2015) is the first richly annotated and publicly available corpus of medical forum posts taken from AskaPatient 1 ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-68", "text": "This dataset contains 1253 UGTs about 12 drugs divided into two categories: Diclofenac and Lipitor." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-69", "text": "All posts were annotated manually for 5 types of entities: ADR, Drug, Disease, Symptom, and Finding." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-70", "text": "The annotators performed terminology association using the Systematized Nomenclature Of Medicine Clinical Terms (SNOMED CT)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-71", "text": "We removed \"conceptless\" or ambiguous mentions for the purposes of evaluation." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-72", "text": "There were 6,754 entities and 1,029 unique codes in total." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-73", "text": "PsyTAR." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-74", "text": "Psychiatric Treatment Adverse Reactions (PsyTAR) corpus (Zolnoori et al., 2019) is the second open-source corpus of user-generated posts taken from AskaPatient." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-75", "text": "This dataset includes 887 posts about four psychiatric medications from two classes: (i) Zoloft and Lexapro from the Selective Serotonin Reuptake Inhibitor (SSRI) class and (ii) Effexor and Cymbalta from the Serotonin Norepinephrine Reuptake Inhibitor (SNRI) class." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-76", "text": "All posts were annotated manually for 4 types of entities: ADR, withdrawal symptoms, drug indications, and sign/symptoms/illness." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-79", "text": "In 2017, Sarker et al. (2018) organized the Social Media Mining for Health (SMM4H) shared task which introduced a dataset with annotated ADR expressions from Twitter." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-80", "text": "Tweets were collected using 250 keywords such as generic and trade names for medications along with misspellings." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-81", "text": "Manually extracted ADR expressions were mapped to Preferred Terms (PTs) of the Medical Dictionary for Regulatory Activities (MedDRA)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-82", "text": "The training set consists of 6650 phrases mapped to 472 PTs." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-83", "text": "The test set consists of 2500 mentions mapped to 254 PTs." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-84", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-85", "text": "**EVALUATION DETAILS**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-86", "text": "We evaluate our models based on classification accuracy, averaged across randomly divided five folds of the CADEC and PsyTAR corpora." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-87", "text": "For SMM4H 2017 data, we adopted the official training and test sets (Sarker et al., 2018) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-88", "text": "Analysis of randomly split folds shows that Ran-dom KFolds create a high overlap of expressions in exact matching between subsets (see the baseline results in Table 2 )." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-89", "text": "Therefore, we set up a specific train/test split procedure for 5-fold cross-validation on the CADEC and Psy-TAR corpora: we removed duplicates of mentions and grouped medical records we are working with into sets related to specific medical codes." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-90", "text": "Then, each set has been split independently into k folds, and all folds have been merged into the final k folds named Custom KFolds." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-91", "text": "Random folds of CADEC are adopted from (Limsopatham and Collier, 2016 ) for a fair comparison." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-92", "text": "Custom folds of CADEC are adopted from our previous work (Tutubalina et al., 2018) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-93", "text": "PsyTAR folds are available on Zenodo.org 2 ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-94", "text": "We have also implemented a simple baseline approach that uses exact lexical matching with lowercased annotations from the training set." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-95", "text": "Table 2 shows our results for the concept normalization task on the Random and Custom KFolds of the CADEC, PsyTAR, and SMM4H 2017 corpora." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-96", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-97", "text": "**RESULTS**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-98", "text": "To answer RQ2, we compare the performance of examined neural models with the baseline and state-of-the-art methods in terms of accuracy." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-99", "text": "Attention-based GRU with ELMo embeddings showed improvement over GRU with word2vec embeddings, increasing the average accuracy to 77.85 (+3.65)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-100", "text": "The semantic information of an entity mention learned by BERT helps to improve the mapping abilities, outperforming other models (avg. accuracy 83.67)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-101", "text": "Our experiments with recurrent units showed that GRU consistently outperformed LSTM on all subsets, and attention mechanism provided further quality improvements for GRU." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-102", "text": "From the difference in accuracy on the Random and Custom KFolds, we conclude that future research should focus on developing extrinsic test sets for medical concept normalization." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-103", "text": "In particular, the BERT model's accuracy on the CADEC Custom KFolds decreased by 9.23% compared to the CADEC Random KFolds." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-104", "text": "To answer RQ3, we compare the performance of models with additional similarity features (marked by \"w/\") with others." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-105", "text": "Indeed, joint models based on GRU and similarity features gain 2-5% improvement on sets with Custom KFolds." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-106", "text": "The joint model based on BERT and similarity 2 https://doi.org/10.5281/zenodo.3236318 features stays roughly on par with BERT on all sets." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-107", "text": "We also tested different strategies for constructing representations using word embeddings and TF-IDF for all synonyms' tokens that led to similar improvements for GRU." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-108", "text": "5 Future Directions RQ4." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-109", "text": "Future research might focus on developing an embedding method that jointly maps extracted entity mentions and UMLS concepts into the same continuous vector space." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-110", "text": "The methods could help us to easily measure the similarity between words and concepts in the same space." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-111", "text": "Recently, Yamada et al. (2016) demonstrated that cotrained vectors improve the quality of both word and entity representations in entity linking (EL) which is a task closely related to concept normalization." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-112", "text": "We note that most of the recent EL methods focus on the disambiguation sub-task, applying simple heuristics for candidate generation." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-113", "text": "The latter is especially challenging in medical concept normalization due to a significant language difference between medical terminology and patient vocabulary." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-114", "text": "RQ5." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-115", "text": "Error analysis has confirmed that models often misclassify closely related concepts (e.g., \"Emotionally detached\" and \"Apathy\") and antonymous concepts (e.g., \"Hypertension\" and \"Hypotension\")." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-116", "text": "We suggest to take into account not only the distance-like similarity between entity mentions and concepts but the mention's context, which is not used directly in recent studies on concept normalization." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-117", "text": "The context can be represented by the set of adjacent words or entities." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-118", "text": "As an alternative, one can use a conditional random field (CRF) to output the most likely sequence of medical concepts discussed in a review." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-119", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-120", "text": "**RELATED WORK**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-121", "text": "In 2004, the research community started to address the needs to automatically detect biomedical entities in free texts through shared tasks." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-122", "text": "Huang and Lu (2015) survey the work done in the organization of biomedical NLP (BioNLP) challenge evaluations up to 2014." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-123", "text": "These tasks are devoted to the normalization of (1) (Limsopatham and Collier, 2016) 73.39 ----CNN (Limsopatham and Collier, 2016) 81.41 ----RNN (Limsopatham and Collier, 2016) 79.98 ----Attentional Char-CNN (Niu et al., 2018) 84.65 ----Hierarchical Char-CNN (Han et al., 2017) - Table 2 : The performance of the proposed models and the state-of-the-art methods in terms of accuracy." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-124", "text": "(4) diseases from clinical reports (ShARe/CLEF eHealth 2013; SemEval 2014 task 7)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-125", "text": "Similarly, the CLEF Health 2016 and 2017 labs addressed the problem of ICD coding of freeform death certificates (without specified entity mentions)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-126", "text": "Traditionally, linguistic approaches based on dictionaries, association measures, and syntactic properties have been used to map texts to a concept from a controlled vocabulary (Aronson, 2001; Van Mulligen et al., 2016; Mottin et al., 2016; Ghiasvand and Kate, 2014; Tang et al., 2014) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-127", "text": "Leaman et al. (2013) proposed the DNorm system based on a pairwise learningto-rank technique using vectors of query mentions and candidate concept terms." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-128", "text": "These vectors are obtained from a tf-idf representation of all tokens from training mentions and concept terms." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-129", "text": "Zweigenbaum and Lavergne (2016) utilized a hybrid method combining simple dictionary projection and mono-label supervised classification from ICD coding." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-130", "text": "Nevertheless, the majority of biomedical research on medical concept extraction primarily focused on scientific literature and clinical records (Huang and Lu, 2015) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-131", "text": "Zolnoori et al. (2019) applied a popular dictionary look-up system cTAKES on user reviews." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-132", "text": "cTAKES based on additional PsyTAR's dictionaries achieves twice better results (0.49 F1 score on the exact matching)." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-133", "text": "Thus, dictionaries gathered from layperson language can efficiently improve automatic performance." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-134", "text": "The 2017 SMM4H shared task (Sarker et al., 2018) was the first effort for the evaluation of NLP methods for the normalization of health-related text from social media on publicly released data." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-135", "text": "Recent advances in neural networks have been utilized for concept normalization: recent studies have employed convolutional neural networks (Limsopatham and Collier, 2016; Niu et al., 2018) and recurrent neural networks (Belousov et al., 2017; Han et al., 2017) ." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-136", "text": "These works have trained neural networks from scratch using only entity mentions from training data and pre-trained word embeddings." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-137", "text": "To sum up, most methods have dealt with encoding information an entity mention itself, ignoring the broader context where it occurred." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-138", "text": "Moreover, these studies did not examine an evaluation methodology tailored to the task." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-139", "text": "----------------------------------" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-140", "text": "**CONCLUSION**" }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-141", "text": "In this work, we have performed a fine-grained evaluation of neural models for medical concept normalization tasks." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-142", "text": "We employed several powerful models such as BERT and RNNs paired with pre-trained word embeddings and ELMo embeddings." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-143", "text": "We also developed a joint model that combines (i) semantic similarity features based on prior knowledge from UMLS and (ii) a learned representation that captures extensional semantic information of an entity mention." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-144", "text": "We have carried out experiments on three datasets using 5-fold cross-validation in two setups." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-145", "text": "Each dataset contains phrases and their corresponding SNOMED or MedDRA concepts." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-146", "text": "Analyzing the results, we have found that similarity features help to improve mapping abilities of joint models based on recurrent neural networks paired with pre-trained word embeddings or ELMo embeddings while staying roughly on par with the advanced language representation model BERT in terms of accuracy." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-147", "text": "Different setups of evaluation procedures affect the performance of models significantly: the accuracy of BERT is 7.25% higher on test sets with a simple random split than on test sets with the proposed custom split." }, { "sent_id": "09493a62815b4b826248d6d9be47cb-C001-148", "text": "Moreover, we have discussed some interesting future research directions and challenges to be overcome." } ], "y": { "@MOT@": { "gold_contexts": [ [ "09493a62815b4b826248d6d9be47cb-C001-21" ] ], "cite_sentences": [ "09493a62815b4b826248d6d9be47cb-C001-21" ] }, "@BACK@": { "gold_contexts": [ [ "09493a62815b4b826248d6d9be47cb-C001-23" ], [ "09493a62815b4b826248d6d9be47cb-C001-123" ] ], "cite_sentences": [ "09493a62815b4b826248d6d9be47cb-C001-23", "09493a62815b4b826248d6d9be47cb-C001-123" ] } } }, "ABC_e99baf9c4b8650f29f410501c5165b_25": { "x": [ { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-40", "text": "**RELATED WORK**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-2", "text": "The recent advances of deep learning in both computer vision (CV) and natural language processing (NLP) provide us a new way of understanding semantics, by which we can deal with more challenging tasks such as automatic description generation from natural images." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-3", "text": "In this challenge, the encoder-decoder framework has achieved promising performance when a convolutional neural network (CNN) is used as image encoder and a recurrent neural network (RNN) as decoder." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-4", "text": "In this paper, we introduce a sequential guiding network that guides the decoder during word generation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-5", "text": "The new model is an extension of the encoder-decoder framework with attention that has an additional guiding long short-term memory (LSTM) and can be trained in an end-to-end manner by using image/descriptions pairs." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-6", "text": "We validate our approach by conducting extensive experiments on a benchmark dataset, i.e., MS COCO Captions." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-7", "text": "The proposed model achieves significant improvement comparing to the other state-ofthe-art deep learning models." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-8", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-10", "text": "Automatically describing the content of images is of one of the hardest tasks of scene understanding -a long standing problem of the field of artificial intelligence (AI)." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-11", "text": "This task is challenging as it requires semantic understanding both of image and natural language, building a good learning models to shrink semantic gaps in different modalities." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-12", "text": "This unsolved problem has attracted a lot of attentions in AI community [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-13", "text": "An image captioning model must detect all objects/scenes present in an image and sometimes even objects/scenes that are not present in the image but related to its description." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-14", "text": "Moreover, it should be able to capture the relations between the objects/scenes and express them in a well-formed, humanunderstandable sentence." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-15", "text": "This challenge has the significance not only in academic research, but also in various applications such as information retrieval and visual question-answering." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-16", "text": "Some recent research of image captioning take inspiration from neural machine translation systems (NMT) [15] [16] [17] [18] that successfully use sequence-to-sequence learning for translation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-17", "text": "NMT models solve the task of translation by a two-fold pipeline." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-18", "text": "(1) A RNN is used to encode the source sentence in a fixed-length vector and then (2) a decoding RNN is conditioned on that fixed-length vector to generate a sentence in the target language, one word at each time step." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-39", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-19", "text": "This general encoder-decoder framework is well suited to the image description problem as the task is (sort of) equivalent to translating an image into its corresponding description." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-20", "text": "However, image captioning appears to be much harder than certain machine translation tasks." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-21", "text": "For instance, when translating from English to French, the source and target languages share similar sentence structures (similar part-of-speech order)." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-22", "text": "This similarity in structure is very useful for the translating system as the alignment will be much easier." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-23", "text": "Instead, for image captioning, the structure of the visual data is way different from the structure of the captions describing them." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-24", "text": "Moreover, the simple CNN+RNN pipeline squash the whole input image into a fixed-length embedding vector." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-25", "text": "This constitutes a major limitation of the basic encoder-decoder architecture." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-26", "text": "To overcome these limitations both for machine translation and image captioning, some new models were proposed by using the attention mechanism [3, 16, 18] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-27", "text": "For image captioning, attention mechanisms help the model focus on salient regions of the image while generating descriptions by dynamically updating the image representations at each time step." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-28", "text": "With this the input image is now represented as a sequence of context vector where the length of the sequence depends on the number of words in the sentence to be generated." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-29", "text": "Promising results has been published since attention was introduced in [16] then later refined in [18] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-30", "text": "Another group of models [4] [5] [6] tried to overcome the limitation of the basic encodedecoder framework by still representing the input image as a fixedlength vector but injecting external guiding information at each time step." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-31", "text": "The external guiding information can be any attribute features connecting the image to its description." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-32", "text": "For instance, the attribute features could be semantic information obtained from a multimodal space of images and their descriptions learned using Canonical Correlation Analysis [5] or the prediction of frequent word occurrences in captions [4] or even learned by an additional guiding network [6] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-33", "text": "The guiding information is however static in all of these models and couldn't be adjusted during the process of generation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-34", "text": "In this work, we investigate how can we take advantages of these encoder-decoder models by constructing a joint neural model with attention that has an extra guiding network." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-35", "text": "Our approach is more closely related to the work of [6] but instead of learning one magic guiding vector, we propose to learn a sequential guiding network that can adapt its guiding vector during words generation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-36", "text": "More specifically, the guiding network is a long short-term memory (LSTM) which outputs a guiding vector at each time step based on previous guiding vectors, current attention vector and attribute features." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-37", "text": "We use the Luong style of attention [18] which is a refined version of attention mechanism and that to the best of our knowledge, there has not been any published work reporting the performance of an image captioning model that is built following only the encoder-decoder pipeline with Luong style of attention." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-38", "text": "Furthermore, to demonstrate the usefulness of the guiding LSTM, we also compare the performance of our model with and without the the guiding LSTM in experiments." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-41", "text": "In the past few years, the advances in training deep neural network models both for CV [19] and NLP [15] give new perspectives for au-tomatic image description generation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-42", "text": "The neural encoder-decoder framework of machine translation has recently been used for generating image captions because of the high level similarity of the two fields." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-43", "text": "Both fields aim to translate a source language to a target one." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-44", "text": "The model in [8] was the first to follow the encoder-decoder pipeline for image captioning." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-45", "text": "Authors of [8] use a CNN to compute image features and a LSTM model to encode the corresponding descriptions." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-46", "text": "The image features are projected into latent states of the LSTM encoder to construct a multimodal distributed representation learned by optimizing a simple pairwise ranking loss." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-47", "text": "Image descriptions were generated from the multimodal space using a novel multiplicative neural language model named Structure-Content Neural Language Model (SC-NLM)." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-48", "text": "Their approach gives superior results than all previous models but was later outperformed by the model described in [2] , which is a simpler encoder-decoder architecture, again directly inspired by Neural Machine Translation (NMT)." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-49", "text": "[2] pass images through a deep CNN, take the activation of the last fully-connected layer as image features and then initialize the hidden states of a RNN cell with the CNN image features." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-50", "text": "During training at each time step they input the current word, compute a distribution over all the words of the vocabulary based on the hidden states and maximize the likelihood of the true next word using a negative log likelihood loss." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-51", "text": "The work in [1] employs a more powerful RNN cell, and they incorporated the image features as first input word instead of using it as initial hidden state." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-52", "text": "Other similar approaches include [5, 10, 11] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-53", "text": "[5] were proposed as an extension of the LSTM model by exploring different kind of semantic information that can be used as extra guiding input to the LSTM during decoding steps." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-54", "text": "[4] followed this direction by injecting more powerful high-level image attributes into the decoder." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-55", "text": "In their work, they investigate different architecture for injecting word occurrence prediction attributes [9] into the CNN-RNN framework." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-56", "text": "Inspired by the use of attention in sequence-to-sequence learning for machine translation [16, 18] , visual attention has been proved to be a very effective way of improving image captioning." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-57", "text": "Some early research follows this direction, e.g., the model proposed in [3] can focus on important parts of images while generating descriptions." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-58", "text": "The captioning model in [3] is very similar in spirit to that in [7] , in which visual representation is constructed for sentence parts while the description is being emitted." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-59", "text": "In [13] , authors proposed a spatial and channel-wise attention mechanism over 3D CNN feature maps." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-60", "text": "In addition to spatial attention (standard visual attention), their model can also learn to pay attention over CNN channels, which they argued as extracting semantic attributes." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-61", "text": "In [6] , visual attention is combined with semantic information for generating image captions." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-62", "text": "Their model can learn an additional guiding vector while learning to focus on image regions." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-63", "text": "However in contrast to our model, their framework only learns a fix guiding vector that couldn't be adapted during words generation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-64", "text": "A pure sequence-to-sequence architecture for image captioning is proposed in [12] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-65", "text": "Different from previous approaches, their model represents images as a sequence of detected objects and a 'sequential attention layer' is introduced to help the model focus on important objects." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-66", "text": "While resulting in a more complex architecture, their approach claims state-of-the-art results in all metrics." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-67", "text": "Instead of training via (penalized) maximum likelihood estimation, some recent works use Policy Gradient (PG) methods to directly optimize the non-differentiable testing metrics, claiming boost in term of performance measure." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-68", "text": "While [20] optimize for the standard CIDEr metric, [21] proposed to optimize for a new testing metric that is a linear combination of CIDEr [22] and SPICE [23] they called SPIDEr, which they found better correlated with human judgment." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-69", "text": "However in this line of work, it is not clear yet whether the improvement in testing metrics could result in captions with better quality." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-70", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-71", "text": "**PROPOSED MODEL**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-72", "text": "Technically, the ultimate goal of the neural CNN+RNN architecture for image captioning is to build an end-to-end model trainable by standard backpropagation that can generate a description S i given a certain image X i ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-73", "text": "Inspired by NMT, such a model can translate an image into a describing sentence." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-74", "text": "A CNN is first used to obtain image features and a RNN decoder is conditioned on those CNN image features to generate the describing sentence." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-75", "text": "Given a training dataset consisting of (S i , X i ) pairs, these models aim to directly maximize the log-probability of generating S i when X i is the input." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-76", "text": "Thus, the optimization problem can be formulated by:" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-77", "text": "where \u03b8 represents the set of parameters to be learned, X i is a single image and S i = [w i 1 , w i 2 , ..., w i N i ] is the corresponding caption which is a sequence of N i words." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-78", "text": "Because each caption represents a sequence of N i words, the log probability is calculated using Bayes chain rule:" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-79", "text": "is the likelihood of generating the first word w i 1 given only the image X i and p(w i t |X i ; w i 1:t\u22121 ; \u03b8) represents the probability of emitting word w i t at time step t conditioned on the image X i and the words generated so far w i 1:t\u22121 ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-80", "text": "In our work, we model the distribution p(w i t |X i , w i 1:t\u22121 ; \u03b8) with a LSTM cell wrapped with Luong-style attention mechanism [18] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-81", "text": "A RNN cell with Luong's attention computes its hidden state h t at each time step based on the current input x t , the previous hidden state h t\u22121 and the previous attention vectorh t\u22121 ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-82", "text": "The current hidden state h t is combined with the image-side context vector c t to form the final output of the cell which is the current attention vectorh t ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-83", "text": "Finally, the distribution p(w i t |X i ; w i 1:t\u22121 ; \u03b8) is computed by applying a Softmax layer on top of the current attention vectorh" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-84", "text": "where W s, W c are projection matrices, c t is the image-side context vector detailed in section 3.1 and R is a recursive function whose details are given in Section 3.2." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-85", "text": "states, our model computes a context vector c t which is a weighted sum of attention states and can be seen as a dynamic representation of the image at every time step t:" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-86", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-87", "text": "**IMAGE FEATURES: CONVOLUTIONAL NEURAL NETWORK**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-88", "text": "where c t \u2208 R D and \u03b1 t 1 , \u03b1 t 2 , . . . , \u03b1 t K are alignment weights and the function \u03a6 is known as alignment function." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-89", "text": "In our model, we use the general form described in [18] :" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-90", "text": "where W a \u2208 R H\u00d7D is a transformation matrix." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-91", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-92", "text": "**SENTENCE GENERATOR: LSTM + LUONG'S ATTENTION**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-93", "text": "The form of the recursive function R (Equation 3) is a critical design choice for generating sequences." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-94", "text": "In this paper, we use a LSTM cell wrapped with the attention mechanism described in [18] to form R. LSTM [24] is a powerful form of recurrent neural network that is widely used now because of its ability to deal with issues like vanishing and exploding gradients." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-95", "text": "The final form of R is described by the following equations:" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-96", "text": "here x t is input signal at time step t, m t and h t are respectively memory cell and hidden state of the LSTM cell andh t represents attention vector." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-97", "text": "The variables i t , f t , o t , g t stand respectively for input gate, forget gate, output gate and candidate memory cell." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-98", "text": "The various W z xy and b z are respectively parameter matrices and bias terms to be optimized." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-99", "text": "The non-linearity \u03c3 is element-wise sigmoid activation and is the element-wise dot product." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-100", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-101", "text": "**SEQUENTIAL GUIDING NETWORK**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-102", "text": "While the decoding function can access image features at each time step in the encoder-decoder framework with attention, injecting additional guiding vector to the decoder input signal can lead to higher performance." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-103", "text": "In our work, we extend the CNN+RNN architecture with attention by inserting an extra guiding network." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-104", "text": "Different from previous approaches that learn a static guiding vector, we explore the use of a sequential guiding network that can adapt its guiding vector at every time step." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-105", "text": "We model the sequential guiding network with a LSTM cell and name it LSTM-g." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-106", "text": "In this way, the guiding vector G t which is the hidden state of LSTM-g, can be adjusted based on previous guiding vectors and current guiding input signal z t ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-107", "text": "We then use the guiding vector G t to construct the input signal x t for the decoding cell R ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-108", "text": "The guiding input signal z t at every time step is formed by concatenating the previous attention vector and attribute features." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-109", "text": "Figure 1 shows an unrolled version of our framework." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-110", "text": "We referred the LSTM in the decoding cell as LSTM-d." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-111", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-112", "text": "**HIGH-LEVEL IMAGE ATTRIBUTES**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-113", "text": "In addition to the CNN features, our model also integrates other high-level attributes of the input image in the decoding phase." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-114", "text": "The probability distribution over most frequent words in captions has been shown to be powerful and very informative for image description [4, 9] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-115", "text": "We explore the use of this kind of image attributes in our model and denote by A \u2208 R Da the detected attribute representations." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-116", "text": "However, the high-level image attributes could be any additional attribute features connecting the image to its describing sentence." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-117", "text": "The attributes vector is used to construct initial state for the decoding LSTM and as additional guiding information for the guiding LSTM." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-118", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-119", "text": "**COMPLETE UPDATING PROCEDURE**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-120", "text": "Given the input image represented by the set of annotation vectors A and attribute features A and the describing sentence represented by the sequence w i 0 , w i 1 , . . . , w i N i , the decoding cell R update its hidden state h t at each time step following the procedure:" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-121", "text": "where W ax \u2208 R Dx\u00d7Da , W wx \u2208 R Dx\u00d7Dw , W gx \u2208 R Dx\u00d7Dg are projection matrices of the attribute space, word embedding space and guiding vector space to the LSTM-d input space." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-122", "text": "Dx, Dw, Da, Dg denote the dimension of LSTM-d input space, word embedding space, attributes vector and guiding vector, respectively." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-123", "text": "The vector w i t is the distributed representation of the t -th word in caption S i ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-124", "text": "We padded each caption with w i 0 to the left and w i N i to the right, where w i 0 and w i N i represent respectively the distributed representations of start-of-sentence token and end-of-sentence token . The sequential guiding network (LSTM-g) update its guiding vector G t with the following the procedure:" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-125", "text": "where f lstm is the recursive function within a LSTM cell and W z is a projection matrix." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-126", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-127", "text": "**EXPERIMENTAL STUDIES**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-128", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-129", "text": "**DATASET AND PREPROCESSING**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-130", "text": "MS COCO Captions [25, 26] is a large scale benchmark dataset for image captioning." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-131", "text": "We work only with the MS COCO c5 dataset which annotates each image with 5 human-produced captions." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-132", "text": "It contains 82,783 images for training and 40,504 images for validation and 40,775 images for testing." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-133", "text": "We use the splits publicly available 1 in previous works [2, 5] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-134", "text": "These splits contain 5000 validation images and 5000 testing images taken from the original validation set." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-135", "text": "The attribute detectors are trained with the same training set and the 1000 most common words in training captions are used to form the attributes." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-136", "text": "We follow the same data preprocessing in [2] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-137", "text": "That is all captions are transformed to lowercase, non-alphanumeric characters are discarded and all words occurring less than 5 times in the training captions are filtered and replaced by special token . The preprocessing results in a vocabulary of 8791 words." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-138", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-139", "text": "**CONFIGURATIONS AND IMPLEMENTATION**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-140", "text": "To obtain image annotations A , we use Inception-V3 [27] vision model." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-141", "text": "We take the 8 \u00d7 8 \u00d7 2048 activations map of the last convolutional layer (Mixed 7c in TensorFlow) as annotations." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-142", "text": "That is A has dimensions of 64 \u00d7 2048 ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-143", "text": "To avoid overfitting, we did not train the Inception-V3 from scratch but a model pre-trained on ImageNet is used." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-144", "text": "We did not fine-tune the weights of the vision model though it could give a small performance boost." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-145", "text": "The dimension of all layers in the decoding LSTM is set to 1024." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-146", "text": "For the guiding LSTM, the dimension is set to 512." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-147", "text": "Both LSTM cells are wrapped with dropout to avoid overfitting." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-148", "text": "We use word vectors with dimension of 512 and initialize all model parameters (except the CNN parameters) randomly with uniform distribution in [-0.1, 0.1]." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-149", "text": "We built our model based on TensorFlow [28] and used the publicly available code of Google's NIC model as base code." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-150", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-151", "text": "**MODEL COMPARISONS**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-152", "text": "We validate the effectiveness of our proposed framework by comparing it to several state-of-the-art captioning models based on CNN+RNN architecture such as NIC [1] , Soft-Attention [3] , LSTM-A5 [4] , LTG-Soft-Attention [6] and LTG-Review-Net [6] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-153", "text": "We measure the performance of our proposed model with four popular evaluation metrics: BLEU [29] , METEOR [30] , ROUGE [31] and CIDEr [22] ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-154", "text": "To compute these metrics, we use the official MS COCO evaluation toolkit 2 that is made publicly available." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-155", "text": "Evaluation results across these metrics are shown in Table 1 ." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-156", "text": "Our framework is referenced as SGN+Luong-Attention." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-157", "text": "Note that for all models we report only single model performance knowing that ensemble of multiple models can always give few extra performance points." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-158", "text": "As shown in the table, our proposed model outperforms all other CNN+RNN architectures." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-159", "text": "It is worth nothing that our model can be seen as an extension of all previous methods and is more complex than most of them." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-160", "text": "However, the use of a sequential guiding network can be given credit by the fact that our proposed method outperforms the models LTG-Soft-Attention and LTG-Review-Net which are as complex as ours." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-161", "text": "In contrast to our model, in LTG-Soft-Attention and LTG-Review-Net, the learned guiding vector was fixed and couldn't be adjusted during words generation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-162", "text": "To further investigate the usefulness of a sequential guiding network, we also compare our SGN+Luong-Attention captioning model with two other baseline that we built: Luong-Attention is a CNN+RNN captioning model based solely on Luong attention mechanism and does not have any extra guiding information at all." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-163", "text": "ATT+Luong-Attention is a version of SGN+Luong-Attention where we discard the guiding LSTM and replace the guiding vector at each time step by the high-level image attributes." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-164", "text": "It can be seen from the table that our SGN+Luong-Attention captioning model significantly outperforms ATT+Luong-Attention, again giving credit to the sequential guiding network." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-165", "text": "It is also worth noting that the Luong-Attention captioning model achieves better performance than the Soft-Attention model in [3] , which proves the advantage of using the Luong style of attention mechanism." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-166", "text": "----------------------------------" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-167", "text": "**CONCLUSION**" }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-168", "text": "In this paper, we have extended the encoder-decoder framework for image captioning by inserting a guiding network." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-169", "text": "By modeling the guiding network with a Long Short-Term Memory, the guiding vector can be adjusted at each time step based on the current context and high-level image attributes." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-170", "text": "We have also explored a natural way of applying Luong's attention over image regions and demonstrated its effectiveness." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-171", "text": "We then combined these two strategies in a single joint model." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-172", "text": "Experiments were conducted on the MS COCO Captions dataset and showed that the proposed model achieves superior performance over existing models based on the CNN+RNN architecture trained using maximum likelihood estimation." }, { "sent_id": "e99baf9c4b8650f29f410501c5165b-C001-173", "text": "In future work, we will consider to apply this model to cross-modal information retrieval." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e99baf9c4b8650f29f410501c5165b-C001-16" ], [ "e99baf9c4b8650f29f410501c5165b-C001-29" ] ], "cite_sentences": [ "e99baf9c4b8650f29f410501c5165b-C001-16", "e99baf9c4b8650f29f410501c5165b-C001-29" ] }, "@MOT@": { "gold_contexts": [ [ "e99baf9c4b8650f29f410501c5165b-C001-26" ], [ "e99baf9c4b8650f29f410501c5165b-C001-56" ] ], "cite_sentences": [ "e99baf9c4b8650f29f410501c5165b-C001-26", "e99baf9c4b8650f29f410501c5165b-C001-56" ] }, "@DIF@": { "gold_contexts": [ [ "e99baf9c4b8650f29f410501c5165b-C001-37" ], [ "e99baf9c4b8650f29f410501c5165b-C001-89" ], [ "e99baf9c4b8650f29f410501c5165b-C001-94" ] ], "cite_sentences": [ "e99baf9c4b8650f29f410501c5165b-C001-37", "e99baf9c4b8650f29f410501c5165b-C001-89", "e99baf9c4b8650f29f410501c5165b-C001-94" ] }, "@EXT@": { "gold_contexts": [ [ "e99baf9c4b8650f29f410501c5165b-C001-37" ], [ "e99baf9c4b8650f29f410501c5165b-C001-89" ], [ "e99baf9c4b8650f29f410501c5165b-C001-94" ] ], "cite_sentences": [ "e99baf9c4b8650f29f410501c5165b-C001-37", "e99baf9c4b8650f29f410501c5165b-C001-89", "e99baf9c4b8650f29f410501c5165b-C001-94" ] }, "@SIM@": { "gold_contexts": [ [ "e99baf9c4b8650f29f410501c5165b-C001-80" ] ], "cite_sentences": [ "e99baf9c4b8650f29f410501c5165b-C001-80" ] } } }, "ABC_d9aa77a03ff98cae29701eddb414d3_25": { "x": [ { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-2", "text": "Used for simple commands recognition on devices from smart routers to mobile phones, keyword spotting systems are everywhere." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-3", "text": "Ubiquitous as well are web applications, which have grown in popularity and complexity over the last decade with significant improvements in usability under cross-platform conditions." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-4", "text": "However, despite their obvious advantage in natural language interaction, voice-enabled web applications are still far and few between." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-5", "text": "In this work, we attempt to bridge this gap by bringing keyword spotting capabilities directly into the browser." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-6", "text": "To our knowledge, we are the first to demonstrate a fully-functional implementation of convolutional neural networks in pure JavaScript that runs in any standards-compliant browser." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-7", "text": "We also apply network slimming, a model compression technique, to explore the accuracy-efficiency tradeoffs, reporting latency measurements on a range of devices and software." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-8", "text": "Overall, our robust, cross-device implementation for keyword spotting realizes a new paradigm for serving neural network applications, and one of our slim models reduces latency by 66% with a minimal decrease in accuracy of 4% from 94% to 90%." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-9", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-11", "text": "With the rapid proliferation of voice-enabled devices, such as the Amazon Echo and Apple iPhone, speech recognition systems are becoming increasingly prevalent in our daily lives." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-12", "text": "Importantly, these systems improve safety and convenience in hands-free interactions, such as using Apple's Siri to dial contacts while driving." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-13", "text": "However, a prominent drawback is that most of these systems perform speech recognition in the cloud, where a remote server receives all audio to be transcribed, as recorded by the device." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-14", "text": "Clearly, the privacy and security implications are significant: the server may be accessed by other people-authorized or not." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-15", "text": "Thus, it is important to capture the relevant speech only and not all incoming audio, all the while providing a hands-free experience." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-16", "text": "Enter keyword spotting systems." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-17", "text": "They solve the aforementioned issues by implementing an on-device mechanism to \"wake up\" the intelligent agent, e.g., \"Okay, Google\" for triggering the Android assistant." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-18", "text": "This then allows the device to record and transmit a limited segment of relevant speech only, obviating the need to be always-listening." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-19", "text": "Specifically, the task of keyword spotting (KWS) is to detect the presence of pre-specified phrases in a stream of audio, often with the end goal of wake-word detection or simple command recognition on device." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-20", "text": "Currently, state of the art uses lightweight neural networks [1, 2, 3, 4] , which can perform inference in real-time even on low-end devices [4, 5] ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-21", "text": "Despite the popularity of voice-enabled products, web applications have yet to make use of keyword spotting." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-22", "text": "This is surprising, since modern web applications are supported on billions of devices ranging from desktops to smartphones." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-23", "text": "Also, an in-browser KWS system would be able to perform the aforementioned simple commands recognition and wakeword detection." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-24", "text": "Thus, we attempt to close the gap between KWS systems and web applications in both research literature and industrial applications, building and evaluating such an in-browser system." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-25", "text": "Unfortunately, the browser is a highly inefficient platform for deploying neural networks, mainly due to poorly optimized matrix multiply routines." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-26", "text": "Fortunately, in recent years, the art of compressing neural networks has made significant advances in both general [6, 7, 8] and keyword spotting literature [4, 9] ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-27", "text": "On our task, we demonstrate that network slimming [6] is a simple yet highly effective method to achieve low latency with minimal impact on accuracy." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-28", "text": "Thus, our main contributions are as follows: first, we develop a novel web application with an in-browser KWS system based on previous state-of-the-art [3] models." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-29", "text": "Second, we provide the first set of comprehensive experimental results for the latency of an in-browser KWS system on a broad range of devices." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-30", "text": "Finally, to the best of our knowledge, we are the first to apply network slimming to examine various accuracyefficiency operating points of a state-of-the-art KWS model." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-31", "text": "On the Google Speech Commands dataset [10] , our most accurate in-browser model achieves an accuracy of 94% while performing inference in less than 10 milliseconds." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-32", "text": "With network slimming, we further reduce latency by 66% while increasing the error rate by only 4%." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-33", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-34", "text": "**BACKGROUND AND RELATED WORK**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-35", "text": "Keyword spotting." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-36", "text": "KWS is the task of detecting a spoken phrase in audio, applicable to simple command recognition [3, 10] and wake-word detection [2, 1] ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-37", "text": "A typical requirement is that such a KWS system must be small-footprint at inference time, since the target platforms are mobile phones, Internet-of-things (IoT) devices, and other portable electronics." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-38", "text": "To achieve this goal, resource-efficient architectures using convolutional neural networks (CNNs) [3, 1] and recurrent neural networks (RNNs) [2] have been proposed, while other works make use of low-bitwidth weights [4, 9] ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-39", "text": "However, despite the pervasiveness of modern web browsers in devices from smartphones to desktops, and in spite of the availability of JavaScript-based deep learning toolkits, implementing on-device KWS systems in web applications has never been done before." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-40", "text": "Compressing neural networks." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-41", "text": "Sparse matrix storage leads to inefficient computation and storage in general-purpose hardware; thus, inducing structured sparsity in neural networks, e.g., on entire rows and columns, has been the cornerstone of various compression techniques [6, 8] ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-42", "text": "Network slimming [6] is one such state-of-the-art approach that have been applied successfully to CNNs: first, models are trained with an L 1 penalty on the scale parameters in 2D batch normalization [11] layers, which encourages entire channels to approach zero." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-43", "text": "Then, a fixed percentage of smallest and hence unimportant scale parameters are removed, along with the correspondent preceding and succeeding filters in the convolution layers (see Figure 1) ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-44", "text": "Finally, the entire network is fine-tuned on the training set-this entire process can optionally be repeated multiple times." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-45", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-46", "text": "**DATA AND IMPLEMENTATION**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-47", "text": "For consistency with past results [3, 5] , we train our models on the first version of the Google Speech Commands dataset [10] , which comprises a total of 65,000 spoken utterances for 30 short, one-second phrases." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-48", "text": "To compare with past work [3] , we pick the following twelve classes: \"yes,\" \"no,\" \"stop,\" \"go,\" \"left,\" \"right,\" \"on,\" \"off,\" unknown, and silence." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-49", "text": "It contains roughly 2,000 examples per class, including a few background noise samples of both man-made and artificial noise, e.g., washing dishes and white noise." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-50", "text": "As is standard in speech processing literature, all audio is in 16-bit PCM, 16kHz mono-channel WAV format." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-51", "text": "We use the standard 80%, 10%, and 10% splits for the training, validation, and test sets, respectively [3, 10] ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-52", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-53", "text": "**INPUT PREPROCESSING**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-54", "text": "First, for dataset augmentation, the input is randomly mixed with additive noise from the background noise set [10] -this helps to decrease the generalization error [12] and improve the robustness of the model under noisy conditions." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-55", "text": "Following the official TensorFlow implementation, we also apply a random timeshift of Uniform[\u2212100, 100] milliseconds." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-56", "text": "Then, for the feature extraction step, 40-dimensional Mel-frequency cepstral coefficients (MFCCs) are computed, with a window size of 30 milliseconds and a frame shift of 10 milliseconds, yielding a final preprocessed input size of 101 \u00d7 40 for each one-second audio sample." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-57", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-58", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-59", "text": "We use the res8 and res8-narrow architectures from Tang and Lin [3] as a starting point, which represent prior state of the art in residual CNNs [13] for KWS." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-60", "text": "In both models, given the input X \u2208 R 101\u00d740 , we first expand the input channel-wise by applying a 2D convolution layer with weights W \u2208 R Cout\u00d71\u00d7(3\u00d73) and padding of one on all sides." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-61", "text": "This step results in an output ofX \u2208 R Cout\u00d7101\u00d740 , which we then downsample using an average pooling layer with a kernel size of (4, 3)." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-62", "text": "Next, inspired by insights in image classification [13] , the output is passed through a series of three residual blocks comprising convolution and batch normalization [11] layers- Figure 2 illustrates one such block." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-63", "text": "Finally, we average-pool across the channels and pass the features through a softmax layer across the twelve classes." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-64", "text": "In the previous description, we are free to choose C out to dictate the expressiveness and computational footprint of the model." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-65", "text": "res8 and res8-narrow choose 45 and 19, respectively, for C out ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-66", "text": "In total, res8 contains 110K parameters and incurs 30 million multiplies per second of audio, while res8-narrow uses 19.9K parameters and incurs 5.65 million multiplies per second." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-67", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-68", "text": "**IMPLEMENTATION**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-69", "text": "In-browser training of the model with JavaScript is not recommended due to poorly optimized computation routines such as matrix multiply and convolution." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-70", "text": "Therefore, we use the official PyTorch model implementations 1 at training time." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-71", "text": "At inference time, weights are transferred from PyTorch to a web application implemented in TensorFlowJS." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-72", "text": "2 Unlike Python, which is well-suited for developing audio processing applications [14] , in-browser JavaScript presents challenges in manipulating audio; for example, many browsers restrict the sample rate of input audio to 44.1kHz only." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-73", "text": "To overcome these challenges, we use the Web Audio API for processing audio streams and Meyda [15] for computing MFCCs." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-74", "text": "The final values differ from MFCCs extracted by our LibROSA [14] Python back-end, even with comprehensive patching; however, we found this to be a non-issue in evaluation." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-75", "text": "Overall, we successfully enable KWS functionality in browser without any server-side inference." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-76", "text": "Since the audio data is quickly processed within the browser, it is much more efficient than transferring data over the network for inference." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-77", "text": "Furthermore, users are now freed from security and privacy implications, such as eavesdropping of network traffic and collection of personal speech data." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-78", "text": "Since JavaScript does not guarantee same efficiency as Python with native C++ optimizations, we look for ways to further optimize in-browser inference." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-79", "text": "After exploring a number of options, we find that network slimming [6] is a simple yet highly effective method to achieve this." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-80", "text": "Network slimming." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-81", "text": "Since the compression technique [6] relies on the presence of scale parameters in batch normalization layers, we cannot apply slimming as-is to the original res8-* , which does not use affine transforms." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-82", "text": "For prun-ing, we must introduce a scale parameter \u03b3 for each batch normalization operation, corresponding to \u03b3 \u00d7 X\u2212\u00b5 \u03c3 for input X, mean \u00b5, and standard deviation \u03c3." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-83", "text": "Note that these new scale parameters are only introduced in the pruned architecture, because they are unnecessary for the full architecture." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-84", "text": "We create two configurations of pruned models: one with 40% of the parameters removed, and another with a more aggressive 80% removed." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-85", "text": "We append -40 and -80 to res8 and res8-narrow, depending on the level of pruning." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-86", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-87", "text": "**EVALUATION**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-88", "text": "Two main metrics for neural network application are accuracy and inference latency." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-89", "text": "To be consistent with the training process, the experiments use the same test set partitioned from the data." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-90", "text": "We conduct experiments and evaluate performance on desktop, laptop, and smartphone configurations to demonstrate the feasibility of our web application on a broad range of devices." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-91", "text": "First, we evaluate our application on a desktop with 16GB RAM, an i7-4790k CPU, and a GTX 1080 Ti." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-92", "text": "Then, we use the Macbook Pro (2017) and Macbook Air (2013) for our laptop configurations; the former has a quadcore i5-7287U CPU and an Intel Iris Plus 650 GPU, while the latter has a lighter dual-core i5-4260U CPU and an Intel HD 6000 GPU." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-93", "text": "Finally, we choose the Galaxy S8 as our smartphone configuration." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-94", "text": "We select Firefox as the browser, and results are collected both with and without the existence of WebGL to evaluate the benefits of hardware acceleration." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-95", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-96", "text": "**RESULTS**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-97", "text": "In-browser KWS inference efficiency." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-98", "text": "Measured with our university WiFi connection, the average latency to the Google server is about \u223c25ms with standard deviation of 20ms." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-99", "text": "Network latency is much higher for transferring audio data." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-100", "text": "With a server written in Python, our evaluation presents an average latency of 481ms with standard deviation of 183ms for 16kHz mono-channel audio data." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-101", "text": "With in-browser inference, we achieve a serverless architecture which no longer suffers from variable network latency." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-102", "text": "Table 1 summarizes latency and accuracy results for both res8 and res8-narrow on various devices." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-103", "text": "Note that results on our PyTorch implementation are included on laptop and desktop setups to compare to the standard baseline; the original implementation achieves an accuracy of 94.34% for res8 and 91.16% for res8-narrow (see first few rows in the table)." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-104", "text": "Slight differences are observed among platforms due to mismatch of MFCCs between LibROSA and Meyda." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-105", "text": "However, the accuracy for each model is consistent for every platform, confirming that our in-browser web application is indeed robust." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-106", "text": "Even though latency is processor-dependent, the res8-narrow model performs inference in real-time on every platform, ranging from 7 to 43 milliseconds on GPU and 86 to 265 milliseconds on CPU configurations." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-107", "text": "Given that these delays are perceived by humans to be near-instantaneous [16] , the latency we observe is sufficient for real-time interactive web applications, even considering the in-browser overhead." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-108", "text": "In fact, it is now feasible to deploy cross-platform neural network web applications even on mobile devices." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-109", "text": "Latency-accuracy tradeoff." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-110", "text": "Under the limited computational resources on mobile devices, network slimming can provide an option to tradeoff accuracy for inference latency." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-111", "text": "To understand tradeoffs between latency and accuracy, we evaluate res8 and res8-narrow models with 40% and 80% of its batch normalization layer pruned as well (see Figure 3) ; to illustrate the trend concisely, the figure includes results on CPU configurations only." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-112", "text": "From res8-narrow-80 to res8-narrow-40, accuracy increases by 4% with minimal latency increase." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-113", "text": "However, starting from res8-narrow-40, the increase in latency is clear, indicating that obtaining higher accuracy comes at a cost." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-114", "text": "The slope of the curve increasingly steepens as accuracy increases, yielding tradeoff curves similar to those observed in other works [7, 17] ." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-115", "text": "Between res8-40 and res8, change in accuracy is less than 1% even though the most increase in latency is observed ranging from 111 ms to 363 ms." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-116", "text": "In other words, res8-40 performs as well as res8 while achieving lower latency." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-117", "text": "Overall, we achieve a 50% decrease in latency in res8-narrow-80 and 66% in res8-80, with only an absolute error rate increase of 4%." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-118", "text": "res8-narrow on Macbook Pro requires 94 ms but drops down to 41 ms with res8-narrow-80." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-119", "text": "Similarly, latency on Galaxy S8 starts from 265 ms and decreases to 137 ms." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-120", "text": "Also, given that both accuracy and latency of res8-narrow are comparable to res8-80, network slimming provides option to replace one model to the other depending on target device." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-121", "text": "----------------------------------" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-123", "text": "In this paper, we realize a new paradigm for serving neural network applications by implementing KWS with in-browser inference." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-124", "text": "The serverless architecture allows our application to be efficient and cross-device compatible, with the additional benefit that users are freed from security and privacy implications, such as eavesdropping of network traffic and collection of personal speech data." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-125", "text": "We implement a KWS web application that achieves an accuracy of 94% while maintaining an inference latency of less than 10 ms on modern devices." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-126", "text": "With the goal of understanding accuracy and inference latency tradeoffs, we also analyze the impact of network slimming on existing res8 and res8-narrow models." }, { "sent_id": "d9aa77a03ff98cae29701eddb414d3-C001-127", "text": "Our study shows that, with network slimming, our model yields a 66% decrease in latency with a minimal increase in error rate of 4%, along with accuracyefficiency tradeoff curves like those observed in the past." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d9aa77a03ff98cae29701eddb414d3-C001-20" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-28" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-36" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-38" ] ], "cite_sentences": [ "d9aa77a03ff98cae29701eddb414d3-C001-20", "d9aa77a03ff98cae29701eddb414d3-C001-28", "d9aa77a03ff98cae29701eddb414d3-C001-36", "d9aa77a03ff98cae29701eddb414d3-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "d9aa77a03ff98cae29701eddb414d3-C001-47" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-48" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-51" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-59" ] ], "cite_sentences": [ "d9aa77a03ff98cae29701eddb414d3-C001-47", "d9aa77a03ff98cae29701eddb414d3-C001-48", "d9aa77a03ff98cae29701eddb414d3-C001-51", "d9aa77a03ff98cae29701eddb414d3-C001-59" ] }, "@USE@": { "gold_contexts": [ [ "d9aa77a03ff98cae29701eddb414d3-C001-47" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-48" ], [ "d9aa77a03ff98cae29701eddb414d3-C001-59" ] ], "cite_sentences": [ "d9aa77a03ff98cae29701eddb414d3-C001-47", "d9aa77a03ff98cae29701eddb414d3-C001-48", "d9aa77a03ff98cae29701eddb414d3-C001-59" ] } } }, "ABC_e0e21b4e473ad6fde28378b2dc4f34_25": { "x": [ { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-30", "text": "Darwish and Oard (2003) termed this method the probabilistic structured query approach (PSQ)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-2", "text": "We present an approach to cross-language retrieval that combines dense knowledgebased features and sparse word translations." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-3", "text": "Both feature types are learned directly from relevance rankings of bilingual documents in a pairwise ranking framework." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-4", "text": "In large-scale experiments for patent prior art search and cross-lingual retrieval in Wikipedia, our approach yields considerable improvements over learningto-rank with either only dense or only sparse features, and over very competitive baselines that combine state-of-the-art machine translation and retrieval." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-5", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-7", "text": "Cross-Language Information Retrieval (CLIR) for the domain of web search successfully leverages state-of-the-art Statistical Machine Translation (SMT) to either produce a single most probable translation, or a weighted list of alternatives, that is used as search query to a standard search engine (Chin et al., 2008; Ture et al., 2012) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-8", "text": "This approach is advantageous if large amounts of indomain sentence-parallel data are available to train SMT systems, but relevance rankings to train retrieval models are not." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-9", "text": "The situation is different for CLIR in special domains such as patents or Wikipedia." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-10", "text": "Parallel data for translation have to be extracted with some effort from comparable or noisy parallel data (Utiyama and Isahara, 2007; Smith et al., 2010) , however, relevance judgments are often straightforwardly encoded in special domains." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-11", "text": "For example, in patent prior art search, patents granted at any patent office worldwide are considered relevant if they constitute prior art with respect to the invention claimed in the query patent." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-12", "text": "Since patent applicants and lawyers are required to list relevant prior work explicitly in the patent application, patent citations can be used to automatically extract large amounts of relevance judgments across languages (Graf and Azzopardi, 2008) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-13", "text": "In Wikipedia search, one can imagine a Wikipedia author trying to investigate whether a Wikipedia article covering the subject the author intends to write about already exists in another language." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-14", "text": "Since authors are encouraged to avoid orphan articles and to cite their sources, Wikipedia has a rich linking structure between related articles, which can be exploited to create relevance links between articles across languages (Bai et al., 2010) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-15", "text": "Besides a rich citation structure, patent documents and Wikipedia articles contain a number of further cues on relatedness that can be exploited as features in learning-to-rank approaches." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-16", "text": "For monolingual patent retrieval, Guo and Gomes (2009) and Oh et al. (2013) advocate the use of dense features encoding domain knowledge on inventors, assignees, location and date, together with dense similarity scores based on bag-of-word representations of patents." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-17", "text": "Bai et al. (2010) show that for the domain of Wikipedia, learning a sparse matrix of word associations between the query and document vocabularies from relevance rankings is useful in monolingual and cross-lingual retrieval." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-18", "text": "Sokolov et al. (2013) apply the idea of learning a sparse matrix of bilingual phrase associations from relevance rankings to cross-lingual retrieval in the patent domain." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-19", "text": "Both show improvements of learning-to-rank on relevance data over SMTbased approaches on their respective domains." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-20", "text": "The main contribution of this paper is a thorough evaluation of dense and sparse features for learning-to-rank that have so far been used only monolingually or only on either patents or Wikipedia." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-21", "text": "We show that for both domains, patents and Wikipedia, jointly learning bilingual sparse word associations and dense knowledgebased similarities directly on relevance ranked data improves significantly over approaches that use either only sparse or only dense features, and over approaches that combine query translation by SMT with standard retrieval in the target language." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-22", "text": "Furthermore, we show that our approach can be seen as supervised model combination that allows to combine SMT-based and rankingbased approaches for further substantial improvements." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-23", "text": "We conjecture that the gains are due to orthogonal information contributed by domainknowledge, ranking-based word associations, and translation-based information." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-24", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-25", "text": "**RELATED WORK**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-26", "text": "CLIR addresses the problem of translating or projecting a query into the language of the document repository across which retrieval is performed." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-27", "text": "In a direct translation approach (DT), a state-of-theart SMT system is used to produce a single best translation that is used as search query in the target language." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-28", "text": "For example, Google's CLIR approach combines their state-of-the-art SMT system with their proprietary search engine (Chin et al., 2008) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-29", "text": "Alternative approaches avoid to solve the hard problem of word reordering, and instead rely on token-to-token translations that are used to project the query terms into the target language with a probabilistic weighting of the standard term tfidf scheme." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-31", "text": "The advantage of this technique is an implicit query expansion effect due to the use of probability distributions over term translations (Xu et al., 2001) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-32", "text": "Ture et al. (2012) brought SMT back into this paradigm by projecting terms from n-best translations from synchronous context-free grammars." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-33", "text": "Ranking approaches have been presented by Guo and Gomes (2009) and Oh et al. (2013) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-34", "text": "Their method is a classical learning-to-rank setup where pairwise ranking is applied to a few hundred dense features." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-35", "text": "Methods to learn sparse word-based translation correspondences from supervised ranking signals have been presented by Bai et al. (2010) and Sokolov et al. (2013) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-36", "text": "Both approaches work in a cross-lingual setting, the former on Wikipedia data, the latter on patents." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-37", "text": "Our approach extends the work of Sokolov et al. (2013) by presenting an alternative learningto-rank approach that can be used for supervised model combination to integrate dense and sparse features, and by evaluating both approaches on cross-lingual retrieval for patents and Wikipedia." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-38", "text": "This relates our work to supervised model merging approaches (Sheldon et al., 2011) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-39", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-40", "text": "**TRANSLATION AND RANKING FOR CLIR**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-41", "text": "SMT-based Models." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-42", "text": "We will refer to DT and PSQ as SMT-based models that translate a query, and then perform monolingual retrieval using BM25." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-43", "text": "Translation is agnostic of the retrieval task." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-44", "text": "Linear Ranking for Word-Based Models. Let q \u2208 {0, 1} Q be a query and d \u2208 {0, 1} D be a document where the j th vector dimension indicates the occurrence of the j th word for dictionaries of size Q and D. A linear ranking model is defined as" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-45", "text": "where W \u2208 IR Q\u00d7D encodes a matrix of rankingspecific word associations (Bai et al., 2010) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-46", "text": "We optimize this model by pairwise ranking, which assumes labeled data in the form of a set R of tu-" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-47", "text": ", where d + is a relevant (or higher ranked) document and d \u2212 an irrelevant (or lower ranked) document for query q. The goal is to find a weight matrix W such that an inequality" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-48", "text": "is violated for the fewest number of tuples from R. We present two methods for optimizing W in the following." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-49", "text": "Pairwise Ranking using Boosting (BM)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-50", "text": "The Boosting-based Ranking baseline (Freund et al., 2003) optimizes an exponential loss:" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-51", "text": "where" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-52", "text": "is a non-negative importance function on tuples." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-53", "text": "The algorithm of Sokolov et al. (2013) combines batch boosting with bagging over a number of independently drawn bootstrap data samples from R. In each step, the single word pair feature is selected that provides the largest decrease of L exp ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-54", "text": "The found corresponding models are averaged." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-55", "text": "To reduce memory requirements we used random feature hashing with the size of the hash of 30 bits (Shi et al., 2009) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-56", "text": "For regularization we rely on early stopping." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-57", "text": "Pairwise Ranking with SGD (VW)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-58", "text": "The second objective is an 1 -regularized hinge loss:" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-59", "text": "where (x) + = max(0, 1 \u2212 x) and \u03bb is the regularization parameter." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-60", "text": "This newly added model utilizes the standard implementation of online SGD from the Vowpal Wabbit (VW) toolkit (Goel et al., 2008) and was run on a data sample of 5M to 10M tuples from R. For Wikipedia, we implemented features that compare the relative length of documents, number of links and images, the number of common links and common images, and Wikipedia categories: Given the categories associated with a foreign query, we use the language links on the Wikipedia category pages to generate a set of \"translated\" English categories S. The Englishside category graph is used to construct sets of super-and subcategories related to the candidate document's categories." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-61", "text": "This expansion is done in both directions for two levels resulting in 5 category sets." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-62", "text": "The intersection between target set T n and the source category set S reflects the category level similarity between query and document, which we calculate as a mutual containment score s n = 1 2 (|S \u2229 T n |/|S| + |S \u2229 T n |/|T n |) for n \u2208 {\u22122, \u22121, 0, +1, +2} (Broder, 1997) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-63", "text": "Optimization for these additional models including domain knowledge features was done by overloading the vector representation of queries q and documents d in the VW linear learner: Instead of sparse word-based features, q and d are represented by real-valued vectors of dense domainknowledge features." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-64", "text": "Optimization for the overloaded vectors is done as described above for VW." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-65", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-66", "text": "**MODEL COMBINATION**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-67", "text": "Combination by Borda Counts." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-68", "text": "The baseline consensus-based voting Borda Count procedure endows each voter with a fixed amount of voting points which he is free to distribute among the scored documents (Aslam and Montague, 2001; Sokolov et al., 2013) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-69", "text": "The aggregate score for two rankings f 1 (q, d) and f 2 (q, d) for all (q, d) in the test set is then a simple linear interpolation:" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-70", "text": ". Parameter \u03ba was adjusted on the dev set." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-71", "text": "Combination by Linear Learning." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-72", "text": "In order to acquire the best combination of more than two models, we created vectors of model scores along with domain knowledge features and reused the VW pairwise ranking approach." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-73", "text": "This means that the vector representation of queries q and documents d in the VW linear learner is overloaded once more: In addition to dense domainknowledge features, we incorporate arbitrary ranking models as dense features whose value is the score of the ranking model." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-74", "text": "Training data was sampled from the dev set and processed with VW." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-75", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-76", "text": "**DATA PATENT PRIOR ART SEARCH (JP-EN).**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-77", "text": "We use BoostCLIR 1 , a Japanese-English (JP-EN) corpus of patent abstracts from the MAREC and NTCIR data (Sokolov et al., 2013) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-78", "text": "It contains automatically induced relevance judgments for patent abstracts (Graf and Azzopardi, 2008) : EN patents are regarded as relevant with level (3) to a JP query patent, if they are in a family relationship (e.g., same invention), cited by the patent examiner (2), or cited by the applicant (1)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-79", "text": "Statistics on the ranking data are given in Table 1 ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-80", "text": "On average, queries and documents contain about 5 sentences." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-81", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-82", "text": "**WIKIPEDIA ARTICLE RETRIEVAL (DE-EN).**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-83", "text": "The intuition behind our Wikipedia retrieval setup is as follows: Consider the situation where the German (DE) Wikipedia article on geological sea stacks does not yet exist." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-84", "text": "A native speaker of German with profound knowledge in geology intends to write it, naming it \"Brandungspfeiler\", while seeking to align its structure with the EN counterpart." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-85", "text": "The task of a CLIR engine is to return relevant EN Wikipedia articles that may describe the very same concept (Stack (geology)), or relevant instances of it (Bako National Park, Lange Anna)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-86", "text": "The information need may be paraphrased as a high-level definition of the topic." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-87", "text": "Since typically the first sentence of any Wikipedia article is such To create this data 4 we downloaded XML and SQL dumps of the DE and EN Wikipedia from, resp., 22 nd and 4 th of November 2013." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-88", "text": "Wikipedia markup removal and link extraction was carried out using the Cloud9 toolkit 5 ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-89", "text": "Sentence extraction was done with NLTK 6 ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-90", "text": "Since Wikipedia articles vary greatly in length, we restricted EN documents to the first 200 words after extracting the link graph to reduce the number of features for BM and VW models." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-91", "text": "To avoid rendering the task too easy for literal keyword matching of queries about named entities, we removed title words from the German queries." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-92", "text": "Statistics are given in Table 1 ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-93", "text": "Preprocessing Ranking Data." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-94", "text": "In addition to lowercasing and punctuation removal, we applied Correlated Feature Hashing (CFH), that makes collisions more likely for words with close meaning (Bai et al., 2010) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-95", "text": "For patents, vocabularies contained 60k and 365k words for JP and EN." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-96", "text": "Filtering special symbols and stopwords reduced the JP vocabulary size to 50k (small enough not to resort to CFH)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-97", "text": "To reduce the EN vocabulary to a comparable size, we applied similar preprocessing and CFH with F =30k and k=5." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-98", "text": "Since for Wikipedia data, the DE and EN vocabularies were both large (6.7M and 6M), we used the same filtering and preprocessing as for the patent data before applying CFH with F =40k and k=5 on both sides." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-99", "text": "Parallel Data for SMT-based CLIR." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-100", "text": "For both tasks, DT and PSQ require an SMT baseline system trained on parallel corpora that are disjunct from the ranking data." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-101", "text": "A JP-EN system was trained on data described and preprocessed by Sokolov et al. (2013) , consisting of 1.8M parallel sentences from the NTCIR-7 JP-EN PatentMT subtask (Fujii et al., 2008) and 2k parallel sentences for parameter development from the NTCIR-8 test collection." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-102", "text": "For Wikipedia, we trained a DE-EN system on 4.1M parallel sentences from Europarl, Common Crawl, and NewsCommentary." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-103", "text": "Parameter tuning was done on 3k parallel sentences from the WMT'11 test set." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-104", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-105", "text": "**EXPERIMENTS**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-106", "text": "Experiment Settings." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-107", "text": "The SMT-based models use cdec (Dyer et al., 2010) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-108", "text": "Word alignments were created with mgiza (JP-EN) and fast align (Dyer et al., 2013) (DE-EN) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-109", "text": "Language models were trained with the KenLM toolkit (Heafield, 2011) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-110", "text": "The JP-EN system uses a 5-gram language model from the EN side of the training data." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-111", "text": "For the DE-EN system, a 4-gram model was built on the EN side of the training data and the EN Wikipedia documents." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-112", "text": "Weights for the standard feature set were optimized using cdec's MERT (JP-EN) and MIRA (DE-EN) implementations (Och, 2003; Chiang et al., 2008) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-113", "text": "PSQ on patents reuses settings found by Sokolov et al. (2013) ; settings for Wikipedia were adjusted on its dev set (n=1000, \u03bb=0.4, L=0, C=1)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-114", "text": "Patent retrieval for DT was done by sentencewise translation and subsequent re-joining to form one query per patent, which was ranked against the documents using BM25." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-115", "text": "For PSQ, BM25 is computed on expected term and document frequencies." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-116", "text": "For ranking-based retrieval, we compare several combinations of learners and features (Table 2) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-117", "text": "VW denotes a sparse model using word-based features trained with SGD." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-118", "text": "BM denotes a similar model trained using Boosting." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-119", "text": "DK denotes VW training of a model that represents queries q and documents d by dense domain-knowledge features instead of by sparse word-based vectors." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-120", "text": "In order to simulate pass-through behavior of out-ofvocabulary terms in SMT systems, additional features accounting for source and target term identity were added to DK and BM models." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-121", "text": "The parameter \u03bb for VW was found on dev set." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-122", "text": "Statistical significance testing was performed using the paired randomization test (Smucker et al., 2007) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-123", "text": "Borda denotes model combination by Borda Count voting where the linear interpolation parameter is adjusted for MAP on the respective development sets with grid search." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-124", "text": "This type of model combination only allows to combine pairs of rankings." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-125", "text": "We present a combination of SMTbased CLIR, DT+PSQ, a combination of dense and sparse features, DK+VW, and a combination of both combinations, (DT+PSQ)+(DK+VW)." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-126", "text": "LinLearn denotes model combination by overloading the vector representation of queries q and documents d in the VW linear learner by incorporating arbitrary ranking models as dense features." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-127", "text": "In difference to grid search for Borda, optimal weights for the linear combination of incorporated ranking models can be learned automatically." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-128", "text": "We investigate the same combinations of ranking models as described for Borda above." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-129", "text": "We do not report combination results including the sparse BM model since they were consistently lower than the ones with the sparse VW model." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-130", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-131", "text": "**TEST RESULTS.**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-132", "text": "Experimental results on test data are given in Table 2 ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-133", "text": "Results are reported with respect to MAP (Manning et al., 2008) , NDCG (J\u00e4rvelin and Kek\u00e4l\u00e4inen, 2002) , and PRES (Magdy and Jones, 2010) ." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-134", "text": "Scores were computed on the top 1,000 retrieved documents." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-135", "text": "As can be seen from inspecting the two blocks of results, one for patents, one for Wikipedia, we find the same system rankings on both datasets." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-136", "text": "In both cases, as standalone systems, DT and PSQ are very close and far better than any ranking approach, irrespective of the objective function or the choice of sparse or dense features." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-137", "text": "Model combination of similar models, e.g., DT and PSQ, gives minimal gains, compared to combining orthogonal models, e.g. DK and VW." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-138", "text": "The best result is achieved by combining DT and PSQ with DK and VW." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-139", "text": "This is due to the already high scores of the combined models, but also to the combination of yet other types of orthogonal information." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-140", "text": "Borda voting gives the best result under MAP which is probably due to the adjustment of the interpolation parameter for MAP on the development set." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-141", "text": "----------------------------------" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-142", "text": "**CONCLUSION**" }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-143", "text": "Special domains such as patents or Wikipedia offer the possibility to extract cross-lingual relevance data from citation and link graphs." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-144", "text": "These data can be used to directly optimizing crosslingual ranking models." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-145", "text": "We showed on two different large-scale ranking scenarios that a supervised combination of orthogonal information sources such as domain-knowledge, translation knowledge, and ranking-specific word associations by far outperforms a pipeline of query translation and retrieval." }, { "sent_id": "e0e21b4e473ad6fde28378b2dc4f34-C001-146", "text": "We conjecture that if these types of information sources are available, a supervised ranking approach will yield superior results in other retrieval scenarios as well." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e0e21b4e473ad6fde28378b2dc4f34-C001-35" ], [ "e0e21b4e473ad6fde28378b2dc4f34-C001-53" ], [ "e0e21b4e473ad6fde28378b2dc4f34-C001-68" ] ], "cite_sentences": [ "e0e21b4e473ad6fde28378b2dc4f34-C001-35", "e0e21b4e473ad6fde28378b2dc4f34-C001-53", "e0e21b4e473ad6fde28378b2dc4f34-C001-68" ] }, "@DIF@": { "gold_contexts": [ [ "e0e21b4e473ad6fde28378b2dc4f34-C001-37" ] ], "cite_sentences": [ "e0e21b4e473ad6fde28378b2dc4f34-C001-37" ] }, "@EXT@": { "gold_contexts": [ [ "e0e21b4e473ad6fde28378b2dc4f34-C001-37" ] ], "cite_sentences": [ "e0e21b4e473ad6fde28378b2dc4f34-C001-37" ] }, "@USE@": { "gold_contexts": [ [ "e0e21b4e473ad6fde28378b2dc4f34-C001-77" ], [ "e0e21b4e473ad6fde28378b2dc4f34-C001-101" ], [ "e0e21b4e473ad6fde28378b2dc4f34-C001-113" ] ], "cite_sentences": [ "e0e21b4e473ad6fde28378b2dc4f34-C001-77", "e0e21b4e473ad6fde28378b2dc4f34-C001-101", "e0e21b4e473ad6fde28378b2dc4f34-C001-113" ] } } }, "ABC_820fa732cc4cedf2d5d94b2afb90fc_25": { "x": [ { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-2", "text": "Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-3", "text": "While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large \"keys\" and \"values\" tensors." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-4", "text": "We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention \"heads\", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-5", "text": "We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-6", "text": "Neural Attention, introduced by [Bahdanau et al., 2014] , is a powerful tool for manipulating variable-length representations." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-7", "text": "A neural attention function takes a single query-vector q and a set of m different (key-vector, value-vector) pairs (represented by the matrices K and V ), and produces an output vector y. The output y is computed as a weighted sum of the different value vectors, where the weights are derived by comparing the query to the keys." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-8", "text": "The following code describes a common formulation, where the weights are computed as the softmax of the dot-products of the query with the different keys." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-9", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-11", "text": "The Transformer neural sequence model [Vaswani et al., 2017] has emerged as a popular alternative to recurrent sequence models." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-12", "text": "Transformer relies on attention layers to communicate information between and across sequences." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-13", "text": "One major challenge with Transformer is the speed of incremental inference." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-14", "text": "As we will discuss, the speed of incremental Transformer inference on modern computing hardware is limited by the memory bandwidth necessary to reload the large \"keys\" and \"values\" tensors which encode the state of the attention layers." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-15", "text": "In the following sections, we will review the multi-head-attention layers used by Transformer, provide a performance analysis, and propose an architectural variation (multi-query attention) which greatly improves inference speed with only minor quality degradation." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-16", "text": "d e f Do tP r o ductAttentio n ( q , K, V ) :" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-17", "text": "\" \" \" Dot\u2212Product A t t e n t i o n on one quer y ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-18", "text": "Args : q : a v e c t o r with sha pe [ k ] K: a ma tr ix with sha pe [m, k ] V: a ma tr ix with sha pe [m, v ] Retur ns : y : a v e c t o r with sha pe [ v ] \" \" \" l o g i t s = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-19", "text": "einsum ( \" k , mk\u2212>m\" , q , K) w e i g h t s = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-20", "text": "so ftma x ( l o g i t s ) r e t u r n t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-21", "text": "einsum ( \"m, mv\u2212>v \" , weig hts , V) Our code samples use einsum notation, as defined in TensorFlow and numpy, for generalized contractions between tensors of arbitrary dimension." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-22", "text": "In this notation, an equation names the dimensions of the input and output Tensors." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-23", "text": "The computation is numerically equivalent to broadcasting each input to have the union of all dimensions, multiplying component-wise, and summing across all dimensions not in the desired output shape." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-24", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-25", "text": "**MULTI-HEAD ATTENTION**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-26", "text": "The \"Transformer\" seuqence-to-sequence model [Vaswani et al., 2017] uses h different attention layers (heads) in parallel, which the authors refer to as \"Multi-head attention\"." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-27", "text": "The query vectors for the h different layers are derived from h different learned linear projections P q of an input vector x. Similarly, the keys and values are derived from h different learned linear projections P k , P v of a collection M of m different input vectors." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-28", "text": "The outputs of the h layers are themselves passed through different learned linear projections P o , then summed." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-29", "text": "For simplicity, we give the input and output vectors identical dimensionality d. The The computation can be expressed as follows: d e f M u l t i h e a d A t t e n t i o n (" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-30", "text": "x , M, P_q, P_k, P_v, P_o ) : \" \" \" Multi\u2212head A t t e n t i o n on one quer y ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-31", "text": "Args :" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-32", "text": "x : a v e c t o r with sha pe y : a v e c t o r with sha pe [ d ] \" \" \" q = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-33", "text": "einsum ( \" d , hdk\u2212>hk \" , x , P_q) K = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-34", "text": "einsum ( \"md, hdk\u2212>hmk\" , M, P_k) V = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-35", "text": "einsum ( \"md, hdv\u2212>hmv\" , M, P_v) l o g i t s = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-36", "text": "einsum ( \" hk , hmk\u2212>hm\" , q , K) w e i g h t s = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-37", "text": "so ftma x ( l o g i t s ) o = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-38", "text": "einsum ( \"hm, hmv\u2212>hv \" , weig hts , V) y = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-39", "text": "einsum ( \" hv , hdv\u2212>d \" , o , P_o) r e t u r n y Note: [Vaswani et al., 2017] include a constant scaling factor on the logits." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-40", "text": "We omit this in our code, as it can be folded into the linear projections P q or P k ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-41", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-42", "text": "**MULTI-HEAD ATTENTION (BATCHED)**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-43", "text": "In practice, it is far more efficient to batch together multiple queries." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-44", "text": "The code below adds two types of batching." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-45", "text": "First, we generate queries from n different positions in a sequence." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-46", "text": "These queries all interact with the same keys and values." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-47", "text": "In addition, we process a batch of b different non-interacting sequences at once." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-48", "text": "Following [Vaswani et al., 2017] , in an autoregressive model, we can prevent backward-information-flow by adding a \"mask\" to the logits containing the value \u2212\u221e in the illegal positions." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-49", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-50", "text": "**PERFORMANCE ANALYSIS OF BATCHED MULTI-HEAD ATTENTION**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-51", "text": "To simplify the performance analysis, we will make several simplifying assumptions:" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-52", "text": "h , as suggested by [Vaswani et al., 2017]" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-53", "text": "The total number of arithmetic operations is \u0398(bnd 2 )." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-54", "text": "(Since the complexity of each of the tf.einsum operations above is O(bnd 2 ) given the simplifying assumptions." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-55", "text": "The total size of memory to be accessed is equal to the sum of the sizes of all the tensors involved: O(bnd + bhn 2 + d 2 )." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-56", "text": "The first term is due to X, M , Q, K, V , O and Y , the second term due to the logits and weights, and the third term due to the projection tensors P q , P k , P v and P o ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-57", "text": "Dividing the two, we find that the ratio of memory access to arithmetic operations is O( 1 k + 1 bn )." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-58", "text": "This low ratio is necessary for good performance on modern GPU/TPU hardware, where the computational capacity can be two orders of magnitude higher than the memory bandwidth." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-59", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-60", "text": "**MULTIHEAD ATTENTION (INCREMENTAL)**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-61", "text": "In some settings, data dependencies make it is impossible to process queries from multiple positions in parallel." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-62", "text": "An example is a self-attention layer in an autoregressive language model such as Transformer [Vaswani et al., 2017] ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-63", "text": "The queries produced at each position attend to key-value pairs produced at all positions up to and including that position." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-64", "text": "During training, the ground-truth target sequence is known, and we can use an efficient parallel implementation similar to that in section 2.3." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-65", "text": "However, when generating from the trained model, the output of the self-attention layer at a particular position affects the token that is generated at the next position, which in turn affects the input to that layer at the next position." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-66", "text": "This prevents parallel computation." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-67", "text": "Code for incrementally computing this self-attention layer is shown below." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-68", "text": "d e f M u l t i h e a d S e l f A t t e n t i o n I n c r e m e n t a l (" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-69", "text": "x , prev_K , prev_V , P_q , P_k , P_v , P_o ) : \" \" \" Multi\u2212head S e l f \u2212A t t e n t i o n ( one s t e p ) ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-70", "text": "Args :" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-71", "text": "x : a t e n s o r with sha pe [ b , d ]" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-72", "text": ", a x i s =2) l o g i t s = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-73", "text": "einsum ( \" bhk , bhmk\u2212>bhm\" , q , new_K) w e i g h t s = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-74", "text": "so ftma x ( l o g i t s ) o = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-75", "text": "einsum ( \"bhm, bhmv\u2212>bhv \" , weig hts , new_V) y = t f ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-76", "text": "einsum ( \" bhv , hdv\u2212>bd \" , O, P_o) r e t u r n y , new_K, new_V" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-77", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-78", "text": "**PERFORMANCE ANALYSIS**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-79", "text": "We make the same simplifying assumptions as in section 2.3.1." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-80", "text": "Across n calls, the total number of arithmetic operations is again \u0398(bnd 2 )." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-81", "text": "Across n calls, the total amount of memory access is \u0398(bn 2 d + nd 2 ), the first term due to K and V and the second term due to P q , P k , P v and P o ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-82", "text": "Dividing the memory by the computations, we find that the ratio of memory access to arithmetic operations is \u0398( n d + 1 b )." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-83", "text": "When n \u2248 d or b \u2248 1, the ratio is close to 1, causing memory bandwidth to be a major performance bottleneck on modern computing hardware." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-84", "text": "In order to make incremental generation efficient, we must reduce both of these terms to be \u226a 1." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-85", "text": "The 1 b term is the easier one -we can just use a larger batch size, memory size permitting." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-86", "text": "Reducing the n d term is harder." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-87", "text": "This term is related to the expense of reloading at each step the K and V tensors representing the memory which have size bhmk = bn 2 ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-88", "text": "One solution is to limit the sequence length n. Another is to reduce the number of positions being attended-to, either by attending to a local neighborhood, or by otherwise compressing the number of memory positions, as in [Liu et al., 2018] , [Zhang et al., 2018] , [Povey et al., 2018] ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-89", "text": "In this paper we present an orthogonal approach to reducing the size of the K and V tensors -namely removing their \"heads\" dimension, while maintaining the \"heads\" dimension in the queries." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-90", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-91", "text": "**MULTI-QUERY ATTENTION**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-92", "text": "We introduce multi-query Attention as a variation of multi-head attention as described in [Vaswani et al., 2017] ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-93", "text": "Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values and outputs." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-94", "text": "Multi-query attention is identical except that the different heads share a single set of keys and values." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-95", "text": "The code for (incremental) multi-query (self) attention is identical to the code listed above for multi-head attention, except that we remove the letter \"h\" from the tf.einsum equations where it represents the \"heads\" dimension of K, V , P k , or P v ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-96", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-97", "text": "**PERFORMANCE ANALYSIS FOR INCREMENTAL MULTI-QUERY ATTENTION**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-98", "text": "We make the same simplifying assumptions as in section 2.3.1." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-99", "text": "Across n calls, the total number of arithmetic operations is again \u0398(bnd 2 )." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-100", "text": "Across n calls, the total amount of memory access is \u0398(bnd + bn 2 k + nd 2 ), the first term due to x, q, o and y, the second term due to K and V and the third term due to P q , P k , P v , P o ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-101", "text": "Dividing the memory by the computations, we find that the ratio of memory access to arithmetic operations is \u0398( 1 d + n dh + 1 b )." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-102", "text": "We have reduced the offensive n d by a factor of h. Theoretically, given large batch size b, this should dramatically improve performance of incremental generation." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-103", "text": "In our experimental section, we will show that the performance gains are real and that model quality remains high." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-104", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-105", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-106", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-107", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-108", "text": "Following [Vaswani et al., 2017] , we evaluate on the WMT 2014 English-German translation task." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-109", "text": "As a baseline, we use an encoder-decoder Transformer model with 6 layers, using d model = 1024 d f f = 4096, h = 8, d k = d v = 128, learned positional embeddings, and weight-sharing between the token-embedding and output layers." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-110", "text": "The baseline model and all variations have 211 million parameters." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-111", "text": "All models were trained for 100,000 steps ( 20 epochs)." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-112", "text": "Each training batch consisted of 128 examples, each of which consisted of a 256-token input sequence and a 256-token target sequence (multiple training sentences were concatenated together to reach this length)." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-113", "text": "Models were trained on a 32-core TPUv3 cluster, with each model taking about 2 hours to train." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-114", "text": "We used an implementation from the tensor2tensor and mesh-tensorflow libraries." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-115", "text": "The configurations used can be found at [to be added before publication] , including details about learning rates, dropout, label smoothing, etc." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-116", "text": "In our \"multi-query\" model, we replace all of the attention layers in the model to multi-query attention." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-117", "text": "This includes the encoder-self-attention, decoder-self-attention and encoder-decoder-attention layers." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-118", "text": "We widen the feed-forward hidden layers from 4096 to 5440 to make the total parameter-count equal to that of the baseline." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-119", "text": "To demonstrate that local-attention and multi-query attention are orthogonal, we also trained \"local\" versions of the baseline and multi-query models, where the decoder-self-attention layers (but not the other attention layers) restrict attention to the current position and the previous 31 positions." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-120", "text": "A simpler alternative way to reduce the sizes of K and V is to reduce the number of heads h and/or to reduce the dimensionalities k and v of the keys and values." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-121", "text": "We trained several such models for comparison, again widening the feed-forward hidden layers to make the total parameter-count equal to that of the baseline." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-122", "text": "We preformed a similar set of experiments using \"transformer-decoder\" language models on the Billion-Word Language Modeling Benchmark [Chelba et al., 2013] ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-123", "text": "For the baseline, we use a model with 6 layers," }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-124", "text": "The total parameter count is 192 million for the baseline and for all variations." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-125", "text": "We trained for 136K steps (10 epochs) at a batch size of 64K tokens." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-126", "text": "Again, we used a 32-core TPUv3 cluster for approximately 3 hours to train each model." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-127", "text": "Table 1 shows results for the machine-translation experiments." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-128", "text": "We decoded the dev set using greedy maximum-likelihood decoding and computed BLEU score with sacrebleu \"sacrebleu -t wmt13 -l en-de -tok intl\"." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-129", "text": "We also list per-subword-token perplexity on the dev set." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-130", "text": "According to both of these metrics, the multi-query attention model seems to be slightly worse than the baseline, but much closer than any of the alternatives involving decreasing h, d k and d v ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-131", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-132", "text": "**MODEL QUALITY**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-133", "text": "We validated the results by decoding the test set using both greedy decoding and beam search (beam 4, \u03b1 = 0.6), and evaluated with sacrebleu \"sacrebleu -t wmt14 -l en-de -tok intl\"." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-134", "text": "Again, the multiquery model performed similarly to the baseline, and actually had the highest BLEU score (28.5) with beam-4 decoding." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-135", "text": "Table 3 shows results for the billion-word language modeling benchmark." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-136", "text": "Models were evaluated by perword (not per-subword-token) perplexity on the dev set." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-137", "text": "The results paint a similar picture to the translation results." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-138", "text": "The multi-query attention model was slightly worse than the baseline, but significantly better than any of the alternatives involving decreasing h, d k and d v ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-139", "text": "Table 2 shows training and inference times for the various models." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-140", "text": "Both training and inference speeds were evaluated on one TPUv2 (8 cores)." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-141", "text": "A training step (consisting of 32,768 input tokens and 32,768 target tokens, as described above) took 433ms for the base model and 425ms for the multi-query model." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-142", "text": "Dividing by 32,768, we find that the training time is 13.2\u00b5s per (input-token + target-token), as listed in Table 2 ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-143", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-144", "text": "**SPEED**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-145", "text": "We ran incremental greedy inference on a batch of 1024 sequences (128 per core) using a source-sequence length of 128 tokens and a target sequence length of 128." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-146", "text": "1 For the baseline model, the encoder part of the model took 222ms and each incremental step of the decoder took 47ms." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-147", "text": "Dividing by the respective numbers of tokens, we find that the amortized inference time is 1.7\u00b5s per token for the encoder and a much larger 46\u00b5s per token for the decoder, as listed in Table 2 ." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-148", "text": "For the multi-query model, the encoder took 195ms and the decoder took 3.9ms per step, for amortized per-token costs of 1.5\u00b5s and 3.8\u00b5s respectively." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-149", "text": "Table 2 shows these values as well as similar results for beam-search." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-150", "text": "1 Due to system limitations requiring fixed shapes, we used padding and masking in our decoder-self-attention implementation." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-151", "text": "The memory tensors were thus padded to the maximum length (128), or to the window-size (32) in the case of local attention." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-152", "text": "Each decoding step thus took the same amount of time." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-153", "text": "An alternative implementation of incrementally growing the tensors could save time near the beginning of the sequence." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-154", "text": "----------------------------------" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-155", "text": "**CONCLUSION**" }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-156", "text": "We have proposed multi-query attention -an alternative to multi-head attention with much lower memorybandwidth requirements in the incremental setting." }, { "sent_id": "820fa732cc4cedf2d5d94b2afb90fc-C001-157", "text": "We believe that this enables wider adoption of attentionbased sequence models in inference-performance-critical applications." } ], "y": { "@BACK@": { "gold_contexts": [ [ "820fa732cc4cedf2d5d94b2afb90fc-C001-11" ], [ "820fa732cc4cedf2d5d94b2afb90fc-C001-26" ] ], "cite_sentences": [ "820fa732cc4cedf2d5d94b2afb90fc-C001-11", "820fa732cc4cedf2d5d94b2afb90fc-C001-26" ] }, "@DIF@": { "gold_contexts": [ [ "820fa732cc4cedf2d5d94b2afb90fc-C001-39", "820fa732cc4cedf2d5d94b2afb90fc-C001-40" ] ], "cite_sentences": [ "820fa732cc4cedf2d5d94b2afb90fc-C001-39" ] }, "@EXT@": { "gold_contexts": [ [ "820fa732cc4cedf2d5d94b2afb90fc-C001-39", "820fa732cc4cedf2d5d94b2afb90fc-C001-40" ] ], "cite_sentences": [ "820fa732cc4cedf2d5d94b2afb90fc-C001-39" ] }, "@USE@": { "gold_contexts": [ [ "820fa732cc4cedf2d5d94b2afb90fc-C001-48" ], [ "820fa732cc4cedf2d5d94b2afb90fc-C001-51", "820fa732cc4cedf2d5d94b2afb90fc-C001-52" ], [ "820fa732cc4cedf2d5d94b2afb90fc-C001-92" ], [ "820fa732cc4cedf2d5d94b2afb90fc-C001-108" ] ], "cite_sentences": [ "820fa732cc4cedf2d5d94b2afb90fc-C001-48", "820fa732cc4cedf2d5d94b2afb90fc-C001-52", "820fa732cc4cedf2d5d94b2afb90fc-C001-92", "820fa732cc4cedf2d5d94b2afb90fc-C001-108" ] }, "@SIM@": { "gold_contexts": [ [ "820fa732cc4cedf2d5d94b2afb90fc-C001-51", "820fa732cc4cedf2d5d94b2afb90fc-C001-52" ], [ "820fa732cc4cedf2d5d94b2afb90fc-C001-62" ], [ "820fa732cc4cedf2d5d94b2afb90fc-C001-92" ], [ "820fa732cc4cedf2d5d94b2afb90fc-C001-108" ] ], "cite_sentences": [ "820fa732cc4cedf2d5d94b2afb90fc-C001-52", "820fa732cc4cedf2d5d94b2afb90fc-C001-62", "820fa732cc4cedf2d5d94b2afb90fc-C001-92", "820fa732cc4cedf2d5d94b2afb90fc-C001-108" ] } } }, "ABC_d4563562cd0dfd8ef6cdb57117fb22_25": { "x": [ { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-67", "text": "For an example story, see Figure 1 ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-117", "text": "None of the parameter settings are able to ouperform the baseline on recall at 50, though both PMI models tie the baseline." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-2", "text": "The automatic induction of scripts (Schank and Abelson, 1977) has been the focus of many recent works." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-3", "text": "In this paper, we employ a variety of these methods to learn Schank and Abelson's canonical restaurant script, using a novel dataset of restaurant narratives we have compiled from a website called \"Dinners from Hell.\" Our models learn narrative chains, script-like structures that we evaluate with the \"narrative cloze\" task (Chambers and Jurafsky, 2008) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-4", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-6", "text": "A well-known theory from the intersection of psychology and artificial intelligence posits that humans organize certain kinds of general knowledge in the form of scripts, or common sequences of events (Schank and Abelson, 1977) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-7", "text": "Though many early AI systems employed hand-encoded scripts, more recent work has attempted to induce scripts with automatic and scalable techniques." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-8", "text": "In particular, several related techniques approach the problem of script induction as one of learning narrative chains from text corpora (Chambers and Jurafsky, 2008; Chambers and Jurafsky, 2009; Jans et al., 2012; Pichotta and Mooney, 2014) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-9", "text": "These statistical approaches have focused on open-domain script acquisition, in which a large number of scripts may be learned, but the acquisition of any particular set of scripts is not guaranteed." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-10", "text": "For many specialized applications, however, knowledge of a few relevant scripts may be more useful than knowledge of many irrelevant scripts." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-11", "text": "With this scenario in mind, we attempt to learn the famous \"restaurant script\" (Schank and Abelson, 1977) by applying the aforementioned narrative chain learning methods to a specialized corpus of dinner narratives we compile from the website \"Dinners from Hell.\" Our results suggest that applying these techniques to a domain-specific dataset may be reasonable way to learn domain-specific scripts." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-12", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-13", "text": "**BACKGROUND**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-14", "text": "Previous work in the automatic induction of scripts or script-like structures has taken a number of different approaches." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-15", "text": "Regneri et al. (2010) attempt to learn the structure of specific scripts by eliciting event sequence descriptions (ESDs) from humans to which they apply multiple sequence alignment (MSA) to yield one global structure per script." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-16", "text": "(Orr et al. (2014) learn similar structures in a probabilistic framework with Hidden Markov Models.) Although Regneri et al. (2010) , like us, are concerned with learning pre-specified scripts, our approach is different in that we apply unsupervised techniques to scenario-specific collections of natural, pre-existing texts." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-17", "text": "Note that while the applicability of our approach to script learning may appear limited to domains for which a corpus conveniently already exists, previous work demonstrates the feasibility of assembling such a corpus by automatically retrieving relevant documents from a larger collection." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-18", "text": "For example, Chambers and Jurafsky (2011) use information retrieval techniques to gather a small number of bombing-related documents from the Gigaword corpus, which they successfully use to learn a MUCstyle (Sundheim, 1991) information extraction tem-plate for bombing events." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-19", "text": "Following the work of Church and Hanks (1990) in learning word associations via mutual information, and the DIRT system introduced by Lin and Pantel (2001) , Chambers and Jurafsky (2008) propose a PMI-based system for learning script-like structures called narrative chains." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-20", "text": "Several followup papers introduce variations and improvements on this original model for learning narrative chains (Chambers and Jurafsky, 2009; Jans et al., 2012; Pichotta and Mooney, 2014) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-21", "text": "It is from this body of work that we borrow techniques to apply to the Dinners from Hell dataset." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-22", "text": "As defined by Chambers and Jurafsky (2008) , a narrative chain is \"a partially ordered set of narrative events that share a common actor,\" where a narrative event is \"a tuple of an event (most simply a verb) and its participants, represented as typed dependencies.\" To learn narrative chains from text, Chambers and Jurafsky extract chains of narrative events linked by a common coreferent within a document." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-23", "text": "For example, the sentence \"John drove to the store where he bought some ice cream.\" would generate two narrative events corresponding to the protagonist John: (DRIVE, nsubj) followed by (BUY, nsubj)." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-24", "text": "Over these extracted chains of narrative events, pointwise mutual information (PMI) is computed between all pairs of events." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-25", "text": "These PMI scores are then used to predict missing events from such chains, i.e. the narrative cloze evaluation." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-26", "text": "Jans et al. (2012) expand on this approach, introducing an ordered PMI model, a bigram probability model, skip n-gram counting methods, coreference chain selection, and an alternative scoring metric (recall at 50)." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-27", "text": "Their bigram probability model outperforms the original PMI model on the narrative cloze task under many conditions." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-28", "text": "Pichotta and Mooney (2014) introduce an extended notion of narrative event that includes information about subjects and objects." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-29", "text": "They also introduce a competitive \"unigram model\" as a baseline for the narrative cloze task." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-30", "text": "To learn the restaurant script from our dataset, we implement the models of Chambers and Jurafsky (2008) and Jans et al. (2012) , as well as the unigram baseline of Pichotta and Mooney (2014) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-31", "text": "To evaluate our success in learning the restaurant script, we perform a modified version of the narrative cloze task, predicting only verbs that we annotate as \"restaurant script-relevant\" and comparing the performance of each model." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-32", "text": "Note that these annotations are not used for training." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-33", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-34", "text": "**METHODS**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-35", "text": "This section provides an overview of each of the different methods and parameter settings we employ to learn narrative chains from the Dinners from Hell corpus, starting with the original model (Chambers and Jurafsky, 2008) and extending to the modifications of Jans et al. (2012) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-36", "text": "As part of this work, we are releasing a program called NaChos, our integrated Python implementation of each of the methods for learning narrative chains described in this section." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-37", "text": "1" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-38", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-39", "text": "**COUNTING METHODS FOR PMI**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-40", "text": "Formally, a narrative event, e := (v, d), is a verb, v, paired with a typed dependency (De Marneffe et al., 2006) , d, defining the role a \"protagonist\" (coreference mention) plays in an event (verb)." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-41", "text": "The main computational component of learning narrative chains in Chambers and Jurafsky's model is to learn the pointwise mutual information for any pair of narrative events:" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-42", "text": "pmi(e 1 , e 2 ) := log C(e 1 , e 2 ) C(e 1 , * )C( * , e 2 )" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-43", "text": "( 1) where C(e 1 , e 2 ) is the number of times that narrative events e 1 and e 2 \"co-occur\" and" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-44", "text": "Chambers and Jurafsky define C(e 1 , e 2 ) as \"the number of times the two events e 1 and e 2 had a coreferring entity filling the values of the dependencies d 1 and d 2 .\" This is a symmetric value with respect to e 1 and e 2 ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-45", "text": "We implement the following counting variants:" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-46", "text": "Skip N-gram By default, C(e 1 , e 2 ) is incremented if e 1 and e 2 occur anywhere within the same chain of events derived from a single coreference chain (skip-all); we also implement an option to restrict the distance between e 1 and e 2 to 0 though 5 intervening events (skip-0 through skip-5)." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-47", "text": "(Jans et al., 2012) Coreference Chain Length The original model counts co-occurrences in all coreference chains; we include Jans et al. (2012) 's option to count over only the longest chains in each document, or to count only over chains of length 5 or greater (long)." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-48", "text": "Count Threshold Because PMI favors low-count events, we add an option to set C(e 1 , e 2 ) to zero for any e 1 , e 2 for which C(e 1 , e 2 ) is below some threshold, T , up to 5." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-49", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-50", "text": "**PREDICTIVE MODELS FOR NARRATIVE CLOZE**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-51", "text": "In order to perform the narrative cloze task, we need a model for predicting the missing narrative event, e, from a chain of observed narrative events, e 1 . . ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-52", "text": "e n , at insertion point k. The original model, proposed by Chambers and Jurafsky (2008) , predicts the event that maximizes unordered pmi," }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-53", "text": "where V is the set of all observed events (the vocabulary) and C(e 1 , e 2 ) is symmetric." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-54", "text": "Two additional models are introduced by Jans et al. (2012) and we use them here, as well." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-55", "text": "First, the ordered pmi model," }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-56", "text": "pmi(e, e i ) (4) where C(e 1 , e 2 ) is asymmetric, i.e., C(e 1 , e 2 ) counts only cases in which e 1 occurs before e 2 ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-57", "text": "Second, the bigram probability model:" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-58", "text": "where p(e 2 |e 1 ) = C(e 1 ,e 2 )" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-59", "text": "C(e 1 , * ) and C(e 1 , e 2 ) is asymmetric." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-60", "text": "Discounting For each model, we add an option for discounting the computed scores." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-61", "text": "In the case of the two PMI-based models, we use the discount score described in Pantel and Ravichandran (2004) and used by Chambers and Jurafsky (2008) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-62", "text": "For the bigram probability model, this PMI discount score would be inappropriate, so we instead use absolute discounting." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-63", "text": "Document Threshold We include a document threshold parameter, D, that ensures that, in any narrative cloze test, any event e that was observed during training in fewer than D distinct documents will receive a worse score (i.e. be ranked behind) any event e whose count meets the document threshold." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-64", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-65", "text": "**DATASET: DINNERS FROM HELL**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-66", "text": "The source of our data for this experiment is a blog called \"Dinners From Hell\" 2 where readers submit stories about their terrible restaurant experiences." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-68", "text": "To process the raw data, we stripped all HTML and other non-story content from each file and processed the remaining text with the Stanford CoreNLP pipeline version 3.3.1 (Manning et al., 2014) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-69", "text": "Of the 237 stories obtained, we manually filtered out 94 stories that were \"off-topic\" (e.g., letters to the webmaster, dinners not at restaurants), leaving a total of 143 stories." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-70", "text": "The average story length is 352 words." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-71", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-72", "text": "**ANNOTATION**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-73", "text": "For the purposes of evaluation only, we hired four undergraduates to annotate every non-copular verb in each story as either corresponding to an event \"related to the experience of eating in a restaurant\" (e.g., ordered a steak), \"unrelated to the experience of eating in a restaurant\" (e.g., answered the phone), or uncertain." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-74", "text": "We used the WebAnno platform for annotation (Yimam et al., 2013) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-75", "text": "A total of 8,202 verb (tokens) were annotated, each by three annotators." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-76", "text": "70.3% of verbs annotated achieved 3-way agreement; 99.4% had at least 2-way agreement." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-77", "text": "After merging the annotations (simple majority vote), 30.7% of verbs were labeled as restaurant-script-related, 68.6% were labeled as restaurant-script-unrelated, and the remaining 0.7% as uncertain." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-78", "text": "Corresponding to the 8,202 annotated verb tokens, there are 1,481 narrative events at the type level." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-79", "text": "580 of these narrative event types were annotated as script-relevant in at least one token instance." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-80", "text": "2 www.dinnersfromhell.com 207 \"A long time ago when I was still in college, my family decided to take me out for pizza on my birthday." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-81", "text": "We decided to try the new location for a favorite pizza chain of ours." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-82", "text": "It was all adults and there were about 8 of us, so we ordered 3 large pizzas." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-83", "text": "We got to chatting and soon realized that the pizzas should've been ready quite a bit ago, so we called the waitress over and she went to check on our pizzas." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-84", "text": "She did not come back." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-85", "text": "We waited about another 10 minutes, then called over another waitress, who went to check on our pizzas and waitress." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-86", "text": "It now been over an hour." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-87", "text": "About 10 minutes later, my Dad goes up to the check-out and asks the girl there to send the manager to our table." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-88", "text": "A few minutes later the manager comes out." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-89", "text": "He explains to us that our pizzas got stuck in the oven and burned." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-90", "text": "They were out of large pizza dough bread, so they were making us 6 medium pizzas for the price of 3 large pizzas." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-91", "text": "We had so many [pizzas] on our table we barely had [room] to eat! Luckily my family is pretty easy going so we just laughed about the whole thing." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-92", "text": "We did tell the manager that it would have been nice if someone, anyone, had said something earlier to us, instead of just disappearing, and he agreed." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-93", "text": "He even said it was his responsibility, but that he had been busy trying to fix what caused the pizzas to jam up in the oven." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-94", "text": "He went so far as to give us 1/2 off our bill, which was really nice." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-95", "text": "It was definitely a memorable birthday!\"" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-96", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-97", "text": "**EVALUATION**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-98", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-99", "text": "**NARRATIVE CLOZE**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-100", "text": "We evaluate the various models on the narrative cloze task." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-101", "text": "What is different about our version of the narrative cloze task here is that we limit the cloze tests to only \"interesting\" events, i.e., those that have been identified as relevant to the restaurant script by our annotators (see Section 4.1)." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-102", "text": "Because our dataset is small (143 documents), we perform leave-one-out testing at the document level, training on 133 folds total." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-103", "text": "(Ten documents are excluded for a development set.) For each fold of training, we extract all of the narrative chains (mapped directly from coreference chains) in the held out test document." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-104", "text": "For each test chain, we generate one narrative cloze test per \"script-relevant\" event in that chain." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-105", "text": "For example, if a chain contains ten events, three of which are \"script-relevant,\" then three cloze tests will be generated, each containing nine \"observed\" events." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-106", "text": "Chains with fewer than two events are excluded." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-107", "text": "In this way, we generate a total of 2,273 cloze tests." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-108", "text": "Scoring We employ three different scoring metrics: average rank (Chambers and Jurafsky, 2008) , mean reciprocal rank, and recall at 50 (Jans et al., 2012) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-109", "text": "Baseline The baseline we use for the narrative cloze task is to rank events by frequency." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-110", "text": "This is the \"unigram model\" employed by Pichotta and Mooney (2014) , a competitive baseline on this task." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-111", "text": "For each model and scoring metric, we perform a complete grid search over all possible parameter settings to find the best-scoring combination on a cloze tests from a set-aside development set of ten documents." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-112", "text": "The parameter space is defined as the Cartesian product of each of the following possible parameter values: skip-n (all,0-5), coreference chain length (all, long, longest), count threshold (T=1-5), document threshold (D=1-5), and discounting (yes/no)." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-113", "text": "Bigram probability with and without discounting are treated as two separate models." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-114", "text": "Figure 2 reports the results of the narrative cloze evalutation." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-115", "text": "Each of the four models (unordered pmi, ordered pmi, bigram, and bigram with discounting) outperform the baseline on the average rank metric when the parameters are optimized for that metric." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-116", "text": "Both bigram models beat the baseline on mean reciprocal rank not only for MRR-optimized parameter settings, but for the average-rank-and recall-at-50-optimized settings." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-118", "text": "Overall, the model that performs the best is the bigram probability model with discounting (row 12 of Figure 2 ) which has the following parameter settings: skip-all, coref-all, T=1, and D=5." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-119", "text": "The fact that several model settings outperform an informed baseline on average rank and mean reciprocal rank indicates that these methods may in general be applicable to smaller, domain-specific corpora." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-120", "text": "Furthermore, it is apparent from the results that the bigram probability models perform better overall than PMI-based models, a finding also reported in Jans et al. (2012) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-121", "text": "This replication is futher evidence that these methods do in fact transfer." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-122", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-123", "text": "**QUALITATIVE EXAMPLE**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-124", "text": "To get a qualitative sense of the narrative events these models are learning to associate from this data, we use the conditional probabilities learned in the bigram model (Fig 2, row 12 ) to select the highest probability narrative chain of length three out of the 12 possible events in the \"we\" coreference chain in Figure 1 (bolded) ." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-125", "text": "The three events selected are boxed and highlighted in blue." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-126", "text": "The bigram model selects the \"deciding\" event (selecting restaurant) and the \"having\" event (having pizza), both reasonable components of the restaurant script." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-127", "text": "The third event selected is \"having room,\" which is not part of the restaurant script." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-128", "text": "This mistake illustrates a weakness of the narrative chains model; without considering the verb's object, the model is unable to distinguish \"have pizza\" from \"have room.\" Incorporating object information in future experiments, as in Pichotta and Mooney (2014), might resolve this issue, although it could introduce sparsity problems." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-129", "text": "----------------------------------" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-131", "text": "In this work, we describe the collection and annotation of a corpus of natural descriptions of restaurant visits from the website \"Dinners from Hell.\" We use this dataset in an attempt to learn the restaurant script, using a variety of related methods for learning narrative chains and evaluating on the narrative cloze task." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-132", "text": "Our results suggest that it may be possible in general to use these methods on domainspecific corpora in order to learn particular scripts from a pre-specified domain, although further experiments in other domains would help bolster this conclusion." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-133", "text": "In principle, a domain-specific corpus need not come from a website like Dinners from Hell; it could instead be sub-sampled from a larger corpus, retrieved from the web, or directly elicited." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-134", "text": "Our domain-specific approach to script learning is potentially useful for specialized NLP applications that require knowledge of only a particular set of scripts." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-135", "text": "One feature of the Dinners from Hell corpus that bears further inspection in future work is the fact that its stories contain many violations of the restaurant script." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-136", "text": "A question to investigate is whether these violations impact how the restaurant script is learned." }, { "sent_id": "d4563562cd0dfd8ef6cdb57117fb22-C001-137", "text": "Other avenues for future work include incorporating object information into event representations and applying domain adaptation techniques in order to leverage larger general-domain corpora." } ], "y": { "@USE@": { "gold_contexts": [ [ "d4563562cd0dfd8ef6cdb57117fb22-C001-3" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-30" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-61" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-108" ] ], "cite_sentences": [ "d4563562cd0dfd8ef6cdb57117fb22-C001-3", "d4563562cd0dfd8ef6cdb57117fb22-C001-30", "d4563562cd0dfd8ef6cdb57117fb22-C001-61", "d4563562cd0dfd8ef6cdb57117fb22-C001-108" ] }, "@BACK@": { "gold_contexts": [ [ "d4563562cd0dfd8ef6cdb57117fb22-C001-8" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-19" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-22" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-52" ] ], "cite_sentences": [ "d4563562cd0dfd8ef6cdb57117fb22-C001-8", "d4563562cd0dfd8ef6cdb57117fb22-C001-19", "d4563562cd0dfd8ef6cdb57117fb22-C001-22", "d4563562cd0dfd8ef6cdb57117fb22-C001-52" ] }, "@SIM@": { "gold_contexts": [ [ "d4563562cd0dfd8ef6cdb57117fb22-C001-30" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-61" ], [ "d4563562cd0dfd8ef6cdb57117fb22-C001-108" ] ], "cite_sentences": [ "d4563562cd0dfd8ef6cdb57117fb22-C001-30", "d4563562cd0dfd8ef6cdb57117fb22-C001-61", "d4563562cd0dfd8ef6cdb57117fb22-C001-108" ] }, "@DIF@": { "gold_contexts": [ [ "d4563562cd0dfd8ef6cdb57117fb22-C001-35" ] ], "cite_sentences": [ "d4563562cd0dfd8ef6cdb57117fb22-C001-35" ] }, "@EXT@": { "gold_contexts": [ [ "d4563562cd0dfd8ef6cdb57117fb22-C001-35" ] ], "cite_sentences": [ "d4563562cd0dfd8ef6cdb57117fb22-C001-35" ] } } }, "ABC_0706cab049274ffc82c5e2ef6f7b99_25": { "x": [ { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-74", "text": "There is not a single case in which they agree on Okay's being used as an acknowledgment." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-98", "text": "**CHANCE-CORRECTED AGREEMENT: ASSUMED EQUAL CODER CATEGORY DISTRIBUTION**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-71", "text": "Instead they use percentage agreement to arrive at this conclusion." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-72", "text": "By examining the data, it is clear that this conclusion would be false." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-2", "text": "Since Jean Carletta (1996) exposed computational linguists to the desirability of using chance-corrected agreement statistics to infer the reliability of data generated by applying coding schemes, there has been a general acceptance of their use within the field." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-3", "text": "However, there are prevailing misunderstandings concerning agreement statistics and the meaning of reliability." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-4", "text": "Investigation of new dialogue types and genres has been shown to reveal new phenomena in dialogue that are ill suited to annotation by current methods and also new annotation schemes that are qualitatively different from those commonly used in dialogue analysis." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-5", "text": "Previously prescribed practices for evaluating coding schemes become less applicable as annotation schemes become more sophisticated." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-6", "text": "To compensate, we need a greater understanding of reliability statistics and how they should be interpreted." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-7", "text": "In this article we discuss the purpose of reliability testing, address certain misunderstandings, and make recommendations regarding the way in which coding schemes should be evaluated." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-8", "text": "After developing schemes for annotating discourse or dialogue, it is necessary to assess their suitability for the purpose for which they are designed." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-9", "text": "Although no statistical test can determine whether any form of annotation is worthwhile or how applications will benefit from it, we at least need to show that coders are capable of performing the annotation." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-10", "text": "This often means assessing reliability based on agreement between annotators applying the scheme." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-11", "text": "Agreement measures are discussed in detail in section 2." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-12", "text": "Much of the confusion regarding which agreement measures to apply and how their results should be interpreted stems from a lack of understanding of what it means to" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-13", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-14", "text": "****" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-15", "text": "Since Jean Carletta (1996) exposed computational linguists to the desirability of using chance-corrected agreement statistics to infer the reliability of data generated by applying coding schemes, there has been a general acceptance of their use within the field." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-16", "text": "However, there are prevailing misunderstandings concerning agreement statistics and the meaning of reliability." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-17", "text": "Investigation of new dialogue types and genres has been shown to reveal new phenomena in dialogue that are ill suited to annotation by current methods and also new annotation schemes that are qualitatively different from those commonly used in dialogue analysis." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-18", "text": "Previously prescribed practices for evaluating coding schemes become less applicable as annotation schemes become more sophisticated." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-19", "text": "To compensate, we need a greater understanding of reliability statistics and how they should be interpreted." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-20", "text": "In this article we discuss the purpose of reliability testing, address certain misunderstandings, and make recommendations regarding the way in which coding schemes should be evaluated." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-21", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-22", "text": "**AGREEMENT, RELIABILITY, AND CODING SCHEMES**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-23", "text": "After developing schemes for annotating discourse or dialogue, it is necessary to assess their suitability for the purpose for which they are designed." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-24", "text": "Although no statistical test can determine whether any form of annotation is worthwhile or how applications will benefit from it, we at least need to show that coders are capable of performing the annotation." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-25", "text": "This often means assessing reliability based on agreement between annotators applying the scheme." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-26", "text": "Agreement measures are discussed in detail in section 2." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-27", "text": "Much of the confusion regarding which agreement measures to apply and how their results should be interpreted stems from a lack of understanding of what it means to assess reliability." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-28", "text": "For example, the coding manual for the Switchboard DAMSL dialogue act annotation scheme (Jurafsky, Shriberg, and Biasca 1997, page 2) states that kappa is used to \"assess labelling accuracy,\" and Di Eugenio and Glass (2004) relate reliability to \"the objectivity of decisions,\" whereas Carletta (1996) regards reliability as the degree to which we understand the judgments that annotators are asked to make." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-29", "text": "Although most researchers recognize that reporting agreement statistics is an important part of evaluating coding schemes, there is frequently a lack of understanding of what the figures actually mean." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-30", "text": "The intended meaning of reliability should refer to the degree to which the data generated by coders applying a scheme can be relied upon." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-31", "text": "If we consider the coding process to involve mapping units of analysis onto categories, data are reliable if coders agree on the category onto which each unit should be mapped." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-32", "text": "The further from perfect agreement that coders stray, the less we can rely on the resulting annotation." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-33", "text": "If data produced by applying a scheme are shown to be reliable, then we have established two important properties of those data:" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-34", "text": "The categories onto which the units are mapped are not inordinately dependent on the idiosyncratic judgments of any individual coder." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-35", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-36", "text": "**2.**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-37", "text": "There is a shared understanding of the meaning of the categories and how data are mapped onto them." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-38", "text": "The first of these is important for ensuring the reproducibility of the coding." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-39", "text": "To be able to trust the analysis of annotated corpora, we need to be confident that the categorization of the units of data is not dependent on which individual performed the annotation." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-40", "text": "The second governs the value of data resulting from the coding process." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-41", "text": "For an annotated corpus or the analysis thereof to be valuable, the phenomenon being annotated must represent some notion in which we can enjoy a shared understanding." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-42", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-43", "text": "**AGREEMENT MEASURES**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-44", "text": "There are many ways in which the level of agreement between coders can be evaluated, and the choice of which to apply in order to assess reliability is the source of much confusion." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-45", "text": "An appropriate statistic for this purpose must measure agreement as a function of the coding process and not of the coders, data, or categories." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-46", "text": "Only if the results of a test are solely dependent on the degree to which there is a shared understanding of how the phenomena to be described are mapped to the given categories can we infer the reliability of the resulting data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-47", "text": "Some agreement measures do not behave in this manner and are therefore unsuitable for evaluating reliability." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-73", "text": "In Table 1 , the coders agree 90 out of 100 times, but all agreements occur when both coders choose accept." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-48", "text": "A great deal of importance is placed on domain specificity in discourse and dialogue studies and as such, researchers are often encouraged to evaluate schemes using corpora from more than one domain." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-49", "text": "Concerning agreement, this encouragement is misplaced." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-50", "text": "Since an appropriate agreement measure is a function of only the coding process, if the original agreement test is performed in a scientifically sound manner, little more can be proved by applying it again to different data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-51", "text": "Any differences in the results between corpora are a function of the variance between samples and not of the reliability of the coding scheme." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-52", "text": "Di Eugenio and Glass (2004) identify three general classes of agreement statistics and suggest that all three should be used in conjunction in order to accurately evaluate coding schemes." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-53", "text": "However, this suggestion is founded on some misunderstandings of the role of agreement measure in reliability studies." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-54", "text": "We shall now rectify these and conclude that only one class of agreement measure is suitable." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-55", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-56", "text": "**PERCENTAGE AGREEMENT**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-57", "text": "The first of the recommended agreement tests, percentage agreement, measures the proportion of agreements between coders." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-58", "text": "This is an unsuitable measure for inferring reliability, and it was the use of this measure that prompted Carletta (1996) to recommend chance-corrected measures." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-59", "text": "Percentage agreement is inappropriate for inferring reliability because it excludes any notion of the level of agreement that we could expect to achieve by chance." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-60", "text": "Reliability should be inferred by locating the achieved level of agreement on a scale between the best possible (coders agree perfectly) and the worst possible (coders do not understand or cannot perform the mapping and behave randomly)." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-61", "text": "Without any indication of the agreement that coders would achieve by behaving randomly, any deviation from perfect agreement is uninterpretable (Krippendorff 2004b) ." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-62", "text": "The justification given for using percentage agreement is that it does not suffer from what Di Eugenio and Glass (2004) referred to as the \"prevalence problem." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-63", "text": "\" Prevalence refers to the unequal distribution of label use by coders." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-64", "text": "For example, Table 1 shows an example taken from Di Eugenio and Glass (2004) showing the classification of the utterance Okay as an acceptance or acknowledgment." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-65", "text": "It represents a confusion matrix describing the number of occasions that coders used pairs of labels for a given turn." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-66", "text": "This table shows that the two coders favored the use of accept strongly over acknowledge." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-67", "text": "They correctly state that this skew in the distribution of categories increases the expected chance agreement, thus lowering the overall agreement in chance-corrected tests." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-68", "text": "The reason for this is that since one category is more popular than others, the likelihood of coders' agreeing by chance by choosing this category increases." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-69", "text": "We therefore require a comparable increase in observed agreement to accommodate this." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-70", "text": "Di Eugenio and Glass (2004) perceive this as an \"unpleasant behavior\" of chancecorrected tests, one that prevents us from concluding that the example given in Table 1 shows satisfactory levels of agreement." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-75", "text": "The only conclusion one may justifiably draw is that the coders cannot distinguish the use of Okay as an acceptance from its use as an acknowledgment." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-76", "text": "Rather than being an unpleasant behavior, accounting for prevalence in the data is an" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-77", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-78", "text": "**CHANCE-CORRECTED AGREEMENT: UNEQUAL CODER CATEGORY DISTRIBUTION**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-79", "text": "The second class of agreement measure recommended in Di Eugenio and Glass (2004) is that of chance-corrected tests that do not assume an equal distribution of categories between coders." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-80", "text": "Chance-corrected tests compute agreement according to the ratio of observed (dis)agreement to that which we could expect by chance, estimated from the data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-81", "text": "The measures differ in the way in which this expected (dis)agreement is estimated." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-82", "text": "Those that do not assume an equal distribution between coders calculate expected (dis)agreement based on the individual distribution of each coder." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-83", "text": "The concern that in discourse and dialogue coding, coders will differ in the frequency with which they apply labels leads Di Eugenio and Glass to conclude that Cohen's (1960) kappa is the best chance-corrected test to apply." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-84", "text": "To clarify, by unequal distribution of categories, we do not refer to the disparity in the frequency with which categories occur (e.g., verbs are more common than pronouns) but rather to the difference in proclivity between coders (e.g., coder A is more likely to label something a noun than coder B)." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-85", "text": "Cohen's kappa calculates expected chance agreement, based on the individual coders' distributions, in a manner similar to association measures, such as chi-square." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-86", "text": "This means that its results are dependent on the preferences of the individual coders taking part in the tests." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-87", "text": "This violates the condition set out at the beginning of this section whereby agreement must be a function of the coding process, with coders being viewed as interchangeable." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-88", "text": "The purpose of assessing the reliability of coding schemes is not to judge the performance of the small number of individuals participating in the trial, but rather to predict the performance of the schemes in general." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-89", "text": "The proposal that in most discourse and dialogue studies, the assumption of equal distribution between coders does not hold is, in fact, an argument against the use of Cohen's kappa." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-90", "text": "Assessing the agreement between coders and accounting for their idiosyncratic proclivity toward or against certain labels tells us little about how the coding scheme will perform when applied by others." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-91", "text": "The solution is not to apply a test that panders to individual differences, but rather to increase the number of coders so that the influence of any individual on the final result becomes less pronounced." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-92", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-93", "text": "**1**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-94", "text": "Another reason provided for using Cohen's kappa is that its sensitivity to bias (differences in coders' category distribution) can be exploited to improve coding schemes." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-95", "text": "However, there is no need to calculate kappa in order to observe bias, since it will be evident in a contingency table of the data in question." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-96", "text": "Even if it were necessary to compute kappa for this purpose, however, this would not justify its use as a reliability test." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-97", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-99", "text": "The remaining class of agreement measure assumes an equal distribution of categories for all coders." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-100", "text": "Once we have accepted that this assumption is necessary in order to predict the performance of the scheme in general, there appears to be no objection to using this type of statistical test for assessing agreement in discourse and dialogue work." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-101", "text": "Tests that fall into this class include Siegel and Castellan's (1988) extension of Scott's (1955) pi, confusingly called kappa, and Krippendorff's (2004a) alpha." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-102", "text": "Both of these measures calculate expected (dis)agreement based on the frequency with which each category is used, estimated from the overall usage by the coders." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-103", "text": "Kappa is more frequently described in statistics textbooks and more commonly implemented in statistical software." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-104", "text": "In circumstances in which mechanisms other than nominal labels are used to annotate data, alpha has the benefit of being able to deal with different degrees of disagreement between pairs of interval, ordinal, and ratio values, among others." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-105", "text": "Di Eugenio and Glass (2004) conclude with the proposal that these three forms of agreement measure collectively provide better means with which to judge agreement than any individual test." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-106", "text": "We would argue, to the contrary, that applying three different metrics to measure the same property suggests a lack of confidence in any of them." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-107", "text": "Percentage agreement and Cohen's kappa do not provide an insight into a scheme's reliability, so reporting their results is potentially misleading." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-108", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-109", "text": "**INFERRING RELIABILITY**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-110", "text": "To reiterate, when testing reliability we are assessing whether the data that a scheme generates can be relied on." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-111", "text": "This may be inferred from the level of agreement between coders applying the scheme." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-112", "text": "In section 1 we described two properties of reliable data that are important to establish in discourse and dialogue analysis." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-113", "text": "In this section we explain how the gap between agreement and reliability may be bridged." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-114", "text": "When inferring reliability from agreement, a common error is to believe that there are a number of thresholds against which agreement scores can be measured in order to gauge whether or not a coding scheme produces reliable data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-115", "text": "Most commonly this is Krippendorff's decision criterion, in which scores greater than 0.8 are considered satisfactory and scores greater than 0.667 allow tentative conclusions to be drawn (Krippendorff 2004a) ." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-116", "text": "The prevalent use of this criterion despite repeated advice that it is unlikely to be suitable for all studies (Carletta 1996; Di Eugenio and Glass 2004; Krippendorff 2004a ) is probably due to a desire for a simple system that can be easily applied to a scheme." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-117", "text": "Unfortunately, because of the diversity of both the phenomena being coded and the applications of the results, it is impossible to prescribe a scale against which all coding schemes can be judged." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-118", "text": "Instead we provide discussion and some recommendations, all founded on the premise that reliability must \"correlate with the conditions under which one is willing to rely on imperfect data\" (Krippendorff 2004b, page 6) ." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-119", "text": "A common concern regarding the application of standards from other fields, such as the one described above, to discourse and dialogue research is that the subjectivity of the phenomena being coded may mean that we never obtain the necessary agreement levels." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-120", "text": "In this context, subjectivity describes the absence of an obvious mapping for each unit of analysis onto categories that describe the phenomenon in question." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-121", "text": "However, the fact that we consider these subjective phenomena worthy of study shows that we are, in fact, \"willing to rely on imperfect data,\" which is fine as long as we recognize the limitations of a scheme that delivers less-than-ideal levels of reliability and use the resulting annotated corpora accordingly." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-122", "text": "In order to discuss the acceptable levels of agreement for discourse and dialogue coding, let us consider two popular uses of coded data: to train systems to perform some automated task and to study the relationship between the coded phenomena and some other feature of the data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-123", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-124", "text": "**RELIABILITY AND TRAINING FOR AUTOMATIC ANNOTATION**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-125", "text": "Considering the effort involved in manually annotating linguistic data, it is unsurprising that attempts are often made to train a system to perform such annotation automatically (Mast et al. 1996; Wrede and Shriberg 2003) ." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-126", "text": "The reliability of manually annotated data is clearly a concern when they are used to train a system." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-127", "text": "If the level of agreement for the annotation scheme is low, then the system is going to replicate the inconsistent behavior of human annotators." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-128", "text": "Any deviant behavior by the system resulting in less than 100% accuracy in comparison with the manual annotation will compound the problem, possibly leading to meaningless data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-129", "text": "Worse still, if a system is to learn how to annotate from manually annotated data, it will do so based on the patterns observed in those data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-130", "text": "If the manual annotation is not reliable, then those patterns may be nonexistent or misleading." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-131", "text": "Returning to our original premise, we would suggest that if a coding scheme is to be used to generate data from which a system will learn to perform similar coding, then we should be \"unwilling to rely on imperfect data.\"" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-132", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-133", "text": "**RELIABILITY AND CORPUS ANALYSIS**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-134", "text": "Manually annotated corpora can also be used to infer a relationship between the phenomena in question and some other facet of the data." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-135", "text": "When performing this sort of analysis, we may be more willing to work with imperfect data and therefore accept lower levels of agreement." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-136", "text": "However, the conclusions that are gleaned from the analysis must be tempered according to the level of agreement achieved." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-137", "text": "For example, when it is suggested that a correlation exists between the occurrence of one phenomenon and that of another, less agreement observed in the sample annotation requires stronger evidence of the correlation in order for the conclusion to be valid." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-138", "text": "To summarize, there are no magic thresholds that, once crossed, entitle us to claim that a coding scheme is reliable." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-139", "text": "One must decide for oneself, based on the intended use of a scheme, whether the observed level of agreement is sufficient and conduct one's analysis accordingly." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-140", "text": "----------------------------------" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-141", "text": "**CONCLUSION**" }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-142", "text": "The application of agreement statistics has done much to improve the scientific rigor of discourse and dialogue research." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-143", "text": "However, unless we understand what we are attempting to prove and which tests are appropriate, the results of evaluation can be unsatisfactory or, worse still, misleading." }, { "sent_id": "0706cab049274ffc82c5e2ef6f7b99-C001-144", "text": "In this article we have encouraged researchers to clarify their reasons for assessing agreement and have suggested that in many cases the most suitable test for this purpose is one that corrects for expected agreement, based on an assumed equal distribution between coders." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0706cab049274ffc82c5e2ef6f7b99-C001-28" ], [ "0706cab049274ffc82c5e2ef6f7b99-C001-52" ], [ "0706cab049274ffc82c5e2ef6f7b99-C001-64" ], [ "0706cab049274ffc82c5e2ef6f7b99-C001-79" ], [ "0706cab049274ffc82c5e2ef6f7b99-C001-105" ], [ "0706cab049274ffc82c5e2ef6f7b99-C001-116" ] ], "cite_sentences": [ "0706cab049274ffc82c5e2ef6f7b99-C001-28", "0706cab049274ffc82c5e2ef6f7b99-C001-52", "0706cab049274ffc82c5e2ef6f7b99-C001-64", "0706cab049274ffc82c5e2ef6f7b99-C001-79", "0706cab049274ffc82c5e2ef6f7b99-C001-105", "0706cab049274ffc82c5e2ef6f7b99-C001-116" ] }, "@DIF@": { "gold_contexts": [ [ "0706cab049274ffc82c5e2ef6f7b99-C001-62" ], [ "0706cab049274ffc82c5e2ef6f7b99-C001-70" ] ], "cite_sentences": [ "0706cab049274ffc82c5e2ef6f7b99-C001-62", "0706cab049274ffc82c5e2ef6f7b99-C001-70" ] } } }, "ABC_7b5ca6526f460139f273484bd276bc_25": { "x": [ { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-123", "text": "The hidden state size, d, was set to 300 for all GRUs." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-2", "text": "This paper proposes dynamic chunk reader (DCR), an end-toend neural reading comprehension (RC) model that is able to extract and rank a set of answer candidates from a given document to answer questions." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-3", "text": "DCR is able to predict answers of variable lengths, whereas previous neural RC models primarily focused on predicting single tokens or entities." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-4", "text": "DCR encodes a document and an input question with recurrent neural networks, and then applies a word-by-word attention mechanism to acquire question-aware representations for the document, followed by the generation of chunk representations and a ranking module to propose the top-ranked chunk as the answer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-5", "text": "Experimental results show that DCR achieves stateof-the-art exact match and F1 scores on the SQuAD dataset (Rajpurkar et al. 2016) ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-6", "text": "----------------------------------" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-8", "text": "Reading comprehension-based question answering (RCQA) is the task of answering a question with a chunk of text taken from related document(s)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-9", "text": "A variety of neural models have been proposed recently either for extracting a single entity or a single token as an answer from a given text (Hermann et al. 2015 ; Kadlec et al. 2016; Trischler et al. 2016b; Dhingra et al. 2016 ; Chen, Bolton, and Manning 2016; Sordoni, Bachman, and Cui et al. 2016a) ; or for selecting the correct answer by ranking a small set of human-provided candidates (Yin, Ebert, and Sch\u00fctze 2016; Trischler et al. 2016a) ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-10", "text": "In both cases, an answer boundary is either easy to determine or already given." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-11", "text": "Different from the above two assumptions for RCQA, in the real-world QA scenario, people may ask questions about both entities (factoid) and non-entities such as explanations and reasons (non-factoid) (see Table 1 for examples)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-12", "text": "In this regard, RCQA has the potential to complement other QA approaches that leverage structured data (e.g., knowledge bases) for both the above question types." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-13", "text": "This is because RCQA can exploit the textual evidences to ensure increased answer coverage, which is particularly helpful for non-factoid answers." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-14", "text": "However, it is also challenging for RCQA to identify answer in arbitrary position in the passage with arbitrary length, especially for nonfactoid answers which might be clauses or sentences." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-15", "text": "As a result, apart from a few exceptions (Rajpurkar et al. 2016; Wang and Jiang 2016) , this research direction has not been fully explored yet." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-16", "text": "Compared to the relatively easier RC task of predicting single tokens/entities 1 , predicting answers of arbitrary lengths and positions significantly increase the search space complexity: the number of possible candidates to consider is in the order of O(n 2 ), where n is the number of passage words." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-17", "text": "In contrast, for previous works in which answers are single tokens/entities or from candidate lists, the complexity is in O(n) or the size of candidate lists l (usually l \u22645), respectively." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-18", "text": "To address the above complexity, Rajpurkar et al. (2016) used a two-step chunk-and-rank approach that employs a rule-based algorithm to extract answer candidates from a passage, followed by a ranking approach with hand-crafted features to select the best answer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-19", "text": "The rule-based chunking approach suffered from low coverage (\u2248 70% recall of answer chunks) that cannot be improved during training; and candidate ranking performance depends greatly on the quality of the hand-crafted features." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-20", "text": "More recently, Wang and Jiang (2016) proposed two endto-end neural network models, one of which chunks a candidate answer by predicting the answer's two boundary indices and the other classifies each passage word into answer/notanswer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-21", "text": "Both models improved significantly over the method proposed by Rajpurkar et al. (2016) ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-22", "text": "Our proposed model, called dynamic chunk reader (DCR), not only significantly differs from both the above systems in the way that answer candidates are generated and ranked, but also shares merits with both works." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-23", "text": "First, our model uses deep networks to learn better representations for candidate answer chunks, instead of using fixed feature representations as in (Rajpurkar et al. 2016) ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-24", "text": "Second, it represents answer candidates as chunks, as in (Rajpurkar et al. 2016 ), instead of word-level representations (Wang and Jiang 2016) , to make the model aware of the subtle differences among candidates (importantly, overlapping candidates)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-25", "text": "The contributions of this paper are three-fold." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-26", "text": "(1) We pro- We also propose several simple but effective features to strengthen the attention mechanism, which fundamentally improves candidate ranking, with the by-product of higher exact boundary match accuracy." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-27", "text": "The experiments on the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al. 2016) , which contains a variety of human-generated factoid and non-factoid questions, have shown the effectiveness of above three contributions." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-28", "text": "Our paper is organized as follows." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-29", "text": "We formally define the RCQA problem first." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-30", "text": "Next, we describe our baseline with a neural network component." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-31", "text": "We present the end-to-end dynamic chunk reader model next." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-32", "text": "Finally, we analyze our experimental results and discuss the related work." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-33", "text": "Table 1 shows an example of our RC setting where the goal is to answer a question Q i , factoid (Q1) or non-factoid (Q2 and Q3), based on a supporting passage P i , by selecting a continuous sequence of text A i \u2286 P i as answer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-34", "text": "Q i , P i , and A i are all word sequences, where each word is drawn from a vocabulary, V ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-35", "text": "The i-th instance in the training set is a triple in the form of (P i , Q i , A i ), where P i = (p i1 , . . . , p i|Pi| ), Q i = (q i1 , . . . , q i|Qi| ), and A i = (a i1 , . . . , a i|Ai| ) (p i\u00b7 , q i\u00b7 , a i\u00b7 \u2208 V )." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-36", "text": "Owing to the disagreement among annotators, there could be more than one correct answer for the same question; and the k-th answer to Q i is denoted by" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-37", "text": "----------------------------------" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-38", "text": "**PROBLEM DEFINITION**" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-39", "text": "}." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-40", "text": "An answer candidate for the i-th training example is defined as c m,n i , a sub-sequence in P i , that spans from position m to n (1 \u2264 m \u2264 n \u2264 |P i |)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-41", "text": "The ground truth answer A i could be included in the set of all candidates C i = {c m,n i |\u2200m, n \u2208 N + , subj(m, n, P i ) and 1 \u2264 m \u2264 n \u2264 |P i |}, where subj(m, n, P i ) is the constraint put on the candidate chunk for P i , such as, \"c m,n i can have at most 10 tokens\", or \"c m,n i must have a pre-defined POS pattern\"." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-42", "text": "To evaluate a system's performance, its top answer to a question is matched against the corresponding gold standard answer(s)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-43", "text": "Remark: Categories of RC Tasks Other simpler variants of the aforementioned RC task were explored in the past." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-44", "text": "For example, quiz-style datasets (e.g., MCTest (Richardson, Burges, and Renshaw 2013), MovieQA (Tapaswi et al. 2015) ) have multiple-choice questions with answer options." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-45", "text": "Cloze-style datesets Hill et al. 2015; Onishi et al. 2016) , usually automatically generated, have factoid \"question\"s created by replacing the answer in a sentence from the text with blank." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-46", "text": "For the answer selection task this paper focuses on, several datasets exist, e.g. TREC-QA for factoid answer extraction from multiple given passages, bAbI (Weston, Chopra, and Bordes 2014) designed for inference purpose, and the SQuAD dataset (Rajpurkar et al. 2016) used in this paper." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-47", "text": "To the best of our knowledge, the SQuAD dataset is the only one for both factoid and nonfactoid answer extraction with a question distribution more close to real-world applications." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-48", "text": "----------------------------------" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-49", "text": "**BASELINE: CHUNK-AND-RANK PIPELINE WITH**" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-50", "text": "Neural RC" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-51", "text": "In this section we modified a state-of-the-art RC system for cloze-style tasks for our answer extraction purpose, to see how much gap we have for the two type of tasks, and to inspire our end-to-end system in the next section." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-52", "text": "In order to make the cloze-style RC system to make chunk-level decision, we use the RC model to generate features for chunks, which are further used in a feature-based ranker like in (Rajpurkar et al. 2016) ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-53", "text": "As a result, this baseline can be viewed as a deep learning based counterpart of the system in (Rajpurkar et al. 2016 )." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-54", "text": "It has two main components: 1) a standalone answer chunker, which is trained to produce overlapping candidate chunks, and 2) a neural RC model, which is used to score each word in a given passage to be used thereafter for generating chunk scores." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-55", "text": "Answer Chunking To reduce the errors generated by the rule-based chunker in (Rajpurkar et al. 2016) , first, we capture the part-of-speech (POS) pattern of all answer subsequences in the training dataset to form a POS pattern trie tree, and then apply the answer POS patterns to passage P i to acquire a collection of all subsequences (chunk candidates) C i whose POS patterns can be matched to the POS pattern trie." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-56", "text": "This is equivalent to putting an constraint subj(m, n, P i ) to candidate answer chunk generation process that only choose the chunk with a POS pattern seen for answers in the training data." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-57", "text": "Then the sub-sequences C i are used as answer candidates for P i ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-58", "text": "Note that overlapping chunks could be generated for a passage, and we rely on the ranker to choose the best candidate based on features from the cloze-style RC system." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-59", "text": "Experiments showed that for > 90% of the questions on the development set, the ground truth answer is included in the candidate set constructed in such manner. ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-60", "text": "In step (1) to acquire s ij we train and apply a word-level single-layer Gated Attention Reader 2 (Dhingra et al. 2016) , which has state-of-the-art performance on CNN/DailyMail cloze-style RC task." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-61", "text": "In step (3) for chunk c m,n i , we designed 5 features, including 4 statistics on (s im , . . . , s in ): maximum, minimum, average and sum; as well as the count of matched POS pattern within the chunk, which serves as an answer prior." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-62", "text": "We use these 5 features in a state-of-the-art ranker (Ganjisaffar, Caruana, and Lopes 2011)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-63", "text": "----------------------------------" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-64", "text": "**DYNAMIC CHUNK READER**" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-65", "text": "The dynamic chunk reader (DCR) model is presented in Figure 1 ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-66", "text": "Inspired by the baseline we built, DCR is deemed to be superior to the baseline for 3 reasons." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-67", "text": "First, each chunk has a representation constructed dynamically, instead of having a set of pre-defined feature values." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-68", "text": "Second, each passage word's representation is enhanced by word-by-word attention that evaluates the relevance of the passage word to the question." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-69", "text": "Third, these components are all within a single, end-to-end model that can be trained in a joint manner." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-70", "text": "DCR works in four steps." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-71", "text": "First, the encoder layer encodes passage and question separately, by using bidirectional recurrent neural networks (RNN)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-72", "text": "Second, the attention layer calculates the relevance of each passage word to the question." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-73", "text": "Third, the chunk representation layer dynamically extracts the candidate chunks from the given passage, and create chunk representation that encodes the contextual information of each chunk." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-74", "text": "Fourth, the ranker layer scores the relevance between the representations of a chunk and the given question, and ranks all candidate chunks using a softmax layer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-75", "text": "We describe each step in details below." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-76", "text": "Encoder Layer We use bi-directional RNN encoder to encode P i and Q i of example i, and get hidden state for each word position p ij and q ik ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-77", "text": "3 As RNN input, a word is represented by a row vector x \u2208 R n ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-78", "text": "x can be the concatenation of word embedding and word features (see Fig. 1 )." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-79", "text": "The word vector for the t-th word is x t ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-80", "text": "A word sequence is processed using an RNN encoder with gated recurrent units (GRU) (Bengio, Goodfellow, and Courville 2015), which was proved to be effective in RC and neural machine translation tasks (Bahdanau," }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-81", "text": "(4) where h t , r t , and u t \u2208 R d are d-dimensional hidden state, reset gate, and update gate, respectively; W {r,u} , W \u2208 R n\u00d7d and U {r,u} , U \u2208 R d\u00d7d are the parameters of the GRU; \u03c3 is the sigmoid function, and denotes elementwise production." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-82", "text": "For a word at t, we use the hidden state \u2212 \u2192 h t from the forward RNN as a representation of the preceding context, and the \u2190 \u2212 h t from a backward RNN that encodes text reversely, to incorporate the context after t. Next," }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-83", "text": ", the bi-directional contextual encoding of x t , is formed." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-84", "text": "[\u00b7; \u00b7] is the concatenation operator." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-85", "text": "To distinguish hidden states from different sources, we denote the h j of jth word in P and the h k of k-th word in Q as h" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-86", "text": "where h p j and h q k are hidden states from the bi-directional RNN encoders (see Figure 1 )." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-87", "text": "An inner product, \u03b1 jk , is calculated between h p j and every question word h q k ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-88", "text": "It indicates how well the passage word p j matches with every question word q k ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-89", "text": "\u03b2 j is a weighted pooling of |Q| question hidden states, which serves as a p j -aware question representation." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-90", "text": "The" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-91", "text": "Chunk Representation Layer A candidate answer chunk representation is dynamically created given attention layer output." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-92", "text": "We first decide the text boundary for the candidate chunk, and then form a chunk representation using all or part of those \u03b3 j outputs inside the chunk." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-93", "text": "To decide a candidate chunk (boundary): we tried two ways: (1) adopt the POS trie-based approach used in our baseline, and (2) enumerate all possible chunks up to a maximum number of tokens." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-94", "text": "For (2), we create up to N (max chunk length) chunks starting from any position j in P j ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-95", "text": "Approach (1) can generate candidates with arbitrary lengths, but fails to recall candidates whose POS pattern is unseen in training set; whereas approach (2) considers all possible candidates within a window and is more flexible, but over-generates invalid candidates." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-96", "text": "For a candidate answer chunk c m,n spanning from position m to n inclusively, we construct chunk representation \u03b3 m,n \u2208 R 2d using every \u03b3 j within range [m, n], with a function g(\u00b7)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-97", "text": "Formally, \u03b3 m,n = g(\u03b3 m , . . . , \u03b3 n ) We experimented with several pooling functions (e.g., max, average) for g(\u00b7), and found out that, instead of pooling, the best function is to concatenate the hidden state of the first word in a chunk in forward RNN and that of the last word in backward RNN." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-98", "text": "Formally," }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-99", "text": "We hypothesize that the hidden states at that two ends can better represent the chunk's contexts, which is critical for this task, than the states within the chunk." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-100", "text": "This observation also agrees with (Kobayashi et al. 2016) ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-101", "text": "Ranker Layer Each chunk c m,n is evaluated on its context similarity to the question, by taking the cosine similarity between the chunk context representation\u03b3 m,n acquired from chunk representation layer, and the question representation which is the concatenation of the last hidden state in forward RNN and the first hidden state in backward RNN." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-102", "text": "Thus, for training example i, we have the probability of the chunk c" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-103", "text": "," }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-104", "text": "Qi k is the k-th hidden state output from question Q i 's forward and backward RNN encoder, respectively." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-105", "text": "In runtime, the chunk with the highest probability is taken as the answer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-106", "text": "In training, the following negative log likelihood is minimized:" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-107", "text": "Note that the i-th training instance is only used when A i is included in the corresponding candidate chunk set C i , i.e. \u2203 m,n A i = c m,n i" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-108", "text": ". The softmax in the final layer serves as the list-wise ranking module similar in spirit to (Cao et al. 2007 )." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-109", "text": "----------------------------------" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-110", "text": "**EXPERIMENTS**" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-111", "text": "Dataset We used the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al. 2016) for the experiment." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-112", "text": "SQuAD came into our sight because it is a mix of factoid and non-factoid questions, a real-world data (crowd-sourced), and of large scale (over 100K question-answer pairs collected from 536 Wikipedia articles)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-113", "text": "Answers range from single words to long, variable-length phrase/clauses." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-114", "text": "It is a relaxation of assumptions by the cloze-style and quiz-style RC datasets in the Problem Definition section." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-115", "text": "Features The input vector representation of each word w to encoder RNNs has six parts including a pre-trained 300-dimensional GloVe embedding (Pennington, Socher, and and five features (see Figure 1) : (1) a onehot encoding (46 dimensions) for the part-of-speech (POS) tag of w; (2) a one-hot encoding (14 dimensions) for named entity (NE) tag of w; (3) a binary value indicating whether w's surface form is the same to any word in the quesiton; (4) if the lemma form of w is the same to any word in the question; and (5) if w is caplitalized." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-116", "text": "Feature (3) and (4) are designed to help the model align the passage text with question." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-117", "text": "Note that some types of questions (e.g., \"who\", \"when\" questions) have answers that have a specific POS/NE tag pattern." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-118", "text": "For instance, \"who\" questions mostly have proper nouns/persons as answers and \"when\" questions may frequently have numbers/dates (e.g., a year) as answers." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-119", "text": "Thus, we believe that the model could exploit the co-relation between question types and answer POS/NE patterns easier with POS and NE tag features." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-120", "text": "Implementation Details We pre-processed the SQuAD dataset using Stanford CoreNLP tool 5 with its default setting to tokenize the text and obtain the POS and NE annotations." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-121", "text": "To train our model, we used stochastic gradient descent with the ADAM optimizer (Kingma and Ba 2014) , with an initial learning rate of 0.001." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-122", "text": "All GRU weights were initialized from a uniform distribution between (-0.01, 0.01)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-124", "text": "The question bi-GRU shared parameters with the passage bi-GRU, while the attention-based passage bi-GRU had its own parameters." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-125", "text": "We shuffled all training examples at the beginning of each epoch and adopted a curriculum learning approach (Bengio et al. 2009 ), by sorting training instances by length in every 10 batches, to enable the model start learning from relatively easier instances and to harder ones." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-126", "text": "We also applied dropout of rate 0.2 to the embedding layer of input bi-GRU encoder, and gradient clipping when the norm of gradients exceeded 10." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-127", "text": "We trained in mini-batch style (mini-batch size is 180) and applied zero-padding to the passage and question inputs in each batch." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-128", "text": "We also set the maximum passage length to be 300 tokens, and pruned all the tokens after the 300-th token in the training set to save memory and speed up the training process." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-129", "text": "This step reduced the training set size by about 1.6%." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-130", "text": "During test, we test on the full length of passage, so that we don't prune out the potential candidates." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-131", "text": "We trained the model for at most 30 epochs, and in case the accuracy did not improve for 10 epochs, we stopped training." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-132", "text": "For the feature ranking-based system, we used jforest ranker (Ganjisaffar, Caruana, and Lopes 2011) with LambdaMART-RegressionTree algorithm and the ranking metric was NDCG@10." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-133", "text": "For the Gated Attention Reader in baseline system, we replicated the method and use the same configurations as in (Dhingra et al. 2016 )." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-134", "text": "Results Table 2 shows our main results on the SQuAD dataset." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-135", "text": "Compared to the scores reported in (Wang and Jiang 2016), our exact match (EM) and F1 on the development set and EM score on the test set are better, and F1 on the test set is comparable." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-136", "text": "We also studied how each component in our model contributes to the overall performance." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-137", "text": "Table 3 shows the details as well as the results of the baseline ranker." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-138", "text": "As the first row of Table 3 shows, our baseline system improves 10% (EM) over Rajpurkar et al. (2016) (Table 2 , row 1), the feature-based ranking system." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-139", "text": "However when compared to our DCR model (Table 3 , row 2), the baseline (row 1) is more than 12% (EM) behind even though it is based on the state-of-the-art model for cloze-style RC tasks." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-140", "text": "This can be attributed to the advanced model structure and end-to-end manner of DCR." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-141", "text": "We also did ablation tests on our DCR model." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-142", "text": "First, replacing the word-by-word attention with Attentive Reader style attention decreases the EM score by about 4.5%, showing the strength of our proposed attention mechanism." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-143", "text": "Second, we remove the features in input to see the contribution of each feature." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-144", "text": "The result shows that POS feature (1) and question-word feature (3) are the two most important features." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-145", "text": "Finally, combining the DCR model with the proposed POS-trie constraints yields a score similar to the one obtained using the DCR model with all possible n-gram chunks." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-146", "text": "The result shows that (1) our chunk representations are powerful enough to differentiate even a huge amount of chunks when no constraints are applied; and (2) the proposed POS-trie reduces the search space at the cost of a small drop in performance." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-147", "text": "Analysis To better understand our system, we calculated the accuracy of the attention mechanism of the gated attention reader used in our deep learning-based baseline." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-148", "text": "We found that it is 72% accurate i.e., 72% of the times a word with the highest attention score is inside the correct answer span." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-149", "text": "This means that, if we could accurately detect the boundary around the word with the highest attention score to form the answer span, we could achieve an accuracy close to 72%." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-150", "text": "In addition, we checked the answer recall of our candidate chunking approach." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-151", "text": "When we use a window size of 10, 92% of the time, the ground truth answer will be included in the extracted Candidate chunk set." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-152", "text": "Thus the upper bound of the exact match score of our baseline system is around 66% (92% (the answer recall) \u00d7 72%)." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-153", "text": "From the results, we see our DCR system's exact match score is at 62%." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-154", "text": "This shows that DCR is proficient at differentiating answer spans dynamically." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-155", "text": "To further analyze the system's performance while predicting answers of different lengths, we show the exact match (EM) and F1 scores for answers with lengths up to 10 tokens in Figure 2(a) ." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-156", "text": "From the graph, we can see that, with the increase of answer length, both EM and F1 drops, but in different speed." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-157", "text": "The gap between F1 and exact match also widens as answer length increases." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-158", "text": "However, the model still yields a decent accuracy when the answer is longer than a single word." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-159", "text": "Additionally, Figure 2(b) shows that the system is better at \"when\" and \"who\" questions, but performs poorly on \"why\" questions." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-160", "text": "The large gap between exact match and F1 on \"why\" questions means that perfectly identifying the span is harder than locating the core of the answer span." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-161", "text": "Since \"what\", \"which\", and \"how\" questions contain a broad range of question types, we split them further based on the bigram a question starts with, and Figure 3 Zaremba and Sutskever 2015) were also potential candidates for the task, and Gulcehre et al. (2016) reported results on the bAbI task, which is worse than memory networks." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-162", "text": "Similarly, sequence-to-sequence models were also used Hermann et al. 2015) , but they did not yield better results either." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-163", "text": "Recently, several models have been proposed to enable more complex inference for RC task." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-164", "text": "For instance, gated attention model (Dhingra et al. 2016 ) employs a multi-layer architecture, where each layer encodes the same document, but the attention is updated from layer to layer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-165", "text": "EpiReader (Trischler et al. 2016b ) adopted a joint training model for answer extractor and reasoner, where the extractor proposes top candidates, and the reasoner weighs each candidate by examining entailment relationship between question-answer representation and the document." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-166", "text": "An iterative alternating attention mechanism and gating strategies were proposed in (Sordoni, Bachman, and to optimize the attention through several hops." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-167", "text": "In contrast, Cui et al. (2016a; introduced fine-grained document attention from each question word and then aggregated those attentions from each question token by summation with or without weights." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-168", "text": "This system achieved the state-of-the-art score on the CNN dataset." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-169", "text": "Those different variations all result in roughly 3-5% improvement over attention sum reader, but none of those could achieve higher than that." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-170", "text": "Other methods include using dynamic entity representation with maxpooling (Kobayashi et al. 2016 ) that aims to change entity representation with context, and Weissenborn's (2016) system, which tries to separate entity from the context and then matches the question to context, scoring an accuracy around 70% on the CNN dataset." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-171", "text": "However, all of those models assume that the answers are single tokens." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-172", "text": "This limits the type of questions the models can answer." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-173", "text": "Wang and Jiang (2016) proposed a matchlstm and achieved good results on SQuAD." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-174", "text": "However, this approach predicts a chunk boundary or whether a word is part of a chunk or not." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-175", "text": "In contrast, our approach explicitly constructs the chunk representations and similar chunks are compared directly to determine correct answer boundaries." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-176", "text": "----------------------------------" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-177", "text": "**CONCLUSION**" }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-178", "text": "In this paper we proposed a novel neural reading comprehension model for question answering." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-179", "text": "Different from the previously proposed models for factoid RCQA, the proposed model, dynamic chunk reader, is not restricted to predicting a single named entity as an answer or selecting an answer from a small, pre-defined candidate list." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-180", "text": "Instead, it is capable of answering both factoid and non-factoid questions as it learns to select answer chunks that are suitable for an input question." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-181", "text": "DCR achieves this goal with a joint deep learning model enhanced with a novel attention mechanism and five simple yet effective features." }, { "sent_id": "7b5ca6526f460139f273484bd276bc-C001-182", "text": "Error analysis shows that the DCR model achieves good performance, but still needs to improve on predicting longer answers, which are usually non-factoid in nature." } ], "y": { "@USE@": { "gold_contexts": [ [ "7b5ca6526f460139f273484bd276bc-C001-5" ], [ "7b5ca6526f460139f273484bd276bc-C001-27" ], [ "7b5ca6526f460139f273484bd276bc-C001-46" ], [ "7b5ca6526f460139f273484bd276bc-C001-55" ], [ "7b5ca6526f460139f273484bd276bc-C001-111" ] ], "cite_sentences": [ "7b5ca6526f460139f273484bd276bc-C001-5", "7b5ca6526f460139f273484bd276bc-C001-27", "7b5ca6526f460139f273484bd276bc-C001-46", "7b5ca6526f460139f273484bd276bc-C001-55", "7b5ca6526f460139f273484bd276bc-C001-111" ] }, "@BACK@": { "gold_contexts": [ [ "7b5ca6526f460139f273484bd276bc-C001-14", "7b5ca6526f460139f273484bd276bc-C001-15" ], [ "7b5ca6526f460139f273484bd276bc-C001-20", "7b5ca6526f460139f273484bd276bc-C001-21" ] ], "cite_sentences": [ "7b5ca6526f460139f273484bd276bc-C001-15", "7b5ca6526f460139f273484bd276bc-C001-21" ] }, "@MOT@": { "gold_contexts": [ [ "7b5ca6526f460139f273484bd276bc-C001-14", "7b5ca6526f460139f273484bd276bc-C001-15" ] ], "cite_sentences": [ "7b5ca6526f460139f273484bd276bc-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "7b5ca6526f460139f273484bd276bc-C001-23" ], [ "7b5ca6526f460139f273484bd276bc-C001-138" ] ], "cite_sentences": [ "7b5ca6526f460139f273484bd276bc-C001-23", "7b5ca6526f460139f273484bd276bc-C001-138" ] }, "@SIM@": { "gold_contexts": [ [ "7b5ca6526f460139f273484bd276bc-C001-24" ], [ "7b5ca6526f460139f273484bd276bc-C001-52", "7b5ca6526f460139f273484bd276bc-C001-53" ] ], "cite_sentences": [ "7b5ca6526f460139f273484bd276bc-C001-24", "7b5ca6526f460139f273484bd276bc-C001-52", "7b5ca6526f460139f273484bd276bc-C001-53" ] } } }, "ABC_e48a1eac39987cb2f504b66d135572_25": { "x": [ { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-84", "text": "**EXPERIMENTAL DESIGN**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-106", "text": "Attention heads from all three models show decreasing entropy with depth of layers." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-2", "text": "Large scale contextual representation models, such as BERT, have significantly advanced natural language processing (NLP) in recently years." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-3", "text": "However, in certain area like healthcare, accessing diverse large scale text data from multiple institutions is extremely challenging due to privacy and regulatory reasons." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-4", "text": "In this article, we show that it is possible to both pretrain and fine tune BERT models in a federated manner using clinical texts from different silos without moving the data." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-5", "text": "Ozlem Uzuner, Brett R South, Shuying Shen, and Scott L DuVall." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-6", "text": "2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-7", "text": "Fidler. 2015." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-8", "text": "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-9", "text": "In Proceedings of the IEEE international conference on computer vision, pages 19-27." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-10", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-12", "text": "In recent years, natural language processing (NLP) has been revolutionized by large contextual representation models pre-trained with large amount of data such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-55", "text": "An updated model is generated by averaging the parameters of models distributively trained, weighted by sample size (Kone\u010dn\u1ef3 et al., 2016; McMahan et al., 2016) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-56", "text": "In this study, sample size is defined as the number of patients in the pretraining stage and number of notes in fine-tuning stage." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-57", "text": "After model aggregation, the updated model was sent out to all sites again to repeat the global training cycle." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-58", "text": "Formally, the weight update is specified by:" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-59", "text": "where Q t ag is the parameter of aggregated model at global cycle t, K is the number of data silos." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-60", "text": "n k is the number of samples at the k th site, N is the total number of samples across all sites, and Q k is the parameters learned from the k th data site alone." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-61", "text": "t is the global cycle number in the range of [1,T]." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-13", "text": "Compared with traditional word level representations, such as word2vec (Mikolov et al., 2013) , GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017) , which assign a representation vector to a word regardless of their surround context, contextual representation method models context of a word." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-14", "text": "For example, the embedings of the word \"bank\" are different between the context of\"Bank of England\" and \"the bank of the Charles river." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-15", "text": "\" ELMo (Peters et al., 2018) uses a recurrent neural network model to learn information of contexts of words from unlabelled texts." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-16", "text": "The trained contextual embedding were then used for downstream tasks." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-17", "text": "More recently, another contextual word representation model based on transformer, called BERT, was released, and has become widely used for many NLP tasks (Vaswani et al., 2017; Devlin et al., 2018) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-18", "text": "Instead of conducting directional modeling of context of a word, transformers like BERT model relations between all pairs of tokens using a self-supervised strategy." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-19", "text": "Advances in these contextual representa-tion based model have created new state-of-the-art performance in many NLP benchmark tasks." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-20", "text": "Recent studies have shown that training contextual representations in texts from specific domains improves power of the model by capturing domain specific linguistic characteristics." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-21", "text": "Texts from biomedical publications and electronic medical record have been used to pre-train BERT models for NLP task in this domain and showed considerable improvement in many downstream tasks (Lee et al., 2019; Alsentzer et al., 2019; Si et al., 2019) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-22", "text": "Despite the success of the efforts mentioned above, many challenges remain." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-23", "text": "Training a medically useful and generalizable contextual representation model requires access to a large amount of clinical texts." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-24", "text": "In addition, as the way clinicians write texts vary significantly among hospitals and healthcare systems, it is important to have access to data from many sources." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-25", "text": "However, sharing clinical data is difficult due to privacy and regulatory issues." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-26", "text": "Training NLP models in a federated manner is a good option to overcome these challenges." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-27", "text": "In this article, we conduct a proof-of-concept study to train BERT across clinical notes from multiple sites in a federated manner without moving notes outside of their silos." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-28", "text": "Our main contribution include:" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-29", "text": "1. We show that it is possible to conduct federated pre-training of BERT model using clinical notes from multiple silos without data transfer." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-30", "text": "2. We show that it is possible to do federated fine tuning of BERT model for different down stream tasks such as name entity recognition(NER)." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-31", "text": "2019)." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-32", "text": "Later in the year, Alsentzer et al. (2019) and Si et al. (2019) published almost at the same time BERT models pre-trained trained on publicly available clinical notes from MIMIC3 either starting from trained parameters of original BERT or BioBERT model and show improvement of clinical NLP tasks." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-33", "text": "There has been only limited work on federated NLP works in the clincal domain." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-34", "text": "In 2019, Liu et al. (2019) published a study doing federated training of machine learning models on clinical notes and used the model for patient level phenotyping based on Concept Unique Identifiers (CUIs)." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-35", "text": "In comparison, in out study, we use raw clinical texts instead of CUIs and, to the best of our knowledge, we are the first one to use contextual representation methods for federated clinical NLP tasks." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-36", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-37", "text": "**METHODS**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-38", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-39", "text": "**CLINICAL DATA**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-40", "text": "In this study, the publicly available MIMIC-III corpus (Johnson et al., 2016) was used for contextual representation learning." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-41", "text": "This corpus contains information for more than 58,000 admissions for more than 45,000 patients admitted to Beth Israel Deaconess Medical Center in Boston between 2001 and 2012." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-42", "text": "Different type of clinical notes are available in this corpus including discharge summaries, nursing notes and so on." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-43", "text": "We included only discharge summaries in our study as previous studies have shown that performance of a model trained on only discharge summaries in this corpus is only marginally worse than model trained on all notes types (Alsentzer et al., 2019) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-44", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-45", "text": "**FEDERATED BERT TRAINING**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-46", "text": "Our goal is to develop methods for federated learning for both (1) pre-training models to capture linguistic characteristics of clinical text, and (2) finetuning for a specific down stream task, here named entity recognition (NER) from clinical notes." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-47", "text": "These methods will allow researchers and clinicians to utilize data from multiple health care providers to train contextual representation models like BERT, without the need to share the data directly, obviating issues related to data transfer and privacy." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-48", "text": "In the following sections, we first describe data processing and a simple notes pre-processing step." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-49", "text": "We then discuss the method for federated pretraining of BERT model and the method for fine tuning." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-50", "text": "In pre-training sage, clinical notes MIMIC-III corpus were randomly split into 5 groups to mimic 5 different silos by patient." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-51", "text": "The prepossessing and tokenization pipeline from Alsentzer et al. 2019 (Alsentzer et al., 2019 was adapted." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-52", "text": "To train the BERT model, we simulated sending out models with identical initial parameters to all silos." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-53", "text": "At each silo, a model was trained using only data from that site." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-54", "text": "Only model parameters of the models were then sent back to the analyzer for aggregation." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-62", "text": "In the pre-training stage, in each global cycle the BERT model was trained for one epoch through all clinical notes data at each of the silos using the default settings from the original BERT publication (Devlin et al., 2018) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-63", "text": "A total of 15 global cycles were run." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-64", "text": "The downstream task performance plateaued out around cycle 10." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-65", "text": "Therefore, BERT model federately trained for 10 global cycles was used for down stream tasks." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-66", "text": "Centralized fine-tuning of the i2b2 NER tasks plateaued after 4 epochs, with the learning rate set at 2e \u2212 5 and a batch size of 32 (Alsentzer et al., 2019) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-67", "text": "When conducting federated fine tuning using the same settings as centralized fine tuning, one epoch of training was conducted in each global cycle and a total of 6 global cycles were conducted when the performance plateaued." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-68", "text": "Each pre-training global cycle took around 4 hours on a Tesla K80 GPU which has a single precision GFLOPs of 55918736." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-69", "text": "A single federated Figure 1 : Federated BERT model trained can be conducted at both pre-training and fine tuning stages." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-70", "text": "In federated pre-training stage, unlabelled clinical texts from different silos , such as hospitals, are used for self-supervised training to learn domain-specific linguistic characterises in a federated manner with moving data outside their silos." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-71", "text": "In task-specific fine tuning stage, pre-trained BERT model were further trained using labelled texts from different silos in a federated manner." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-72", "text": "fine tuning global took around 20 mins on the same device." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-73", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-74", "text": "**DOWNSTREAM TASKS**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-75", "text": "Clinical BERT pre-trained on MIMIC corpus has been reported to have superior performance on NER tasks in Inside-Outside-Beginning (IOB) format (Ramshaw and Marcus, 1999) using i2b2 2010 (Uzuner et al., 2011) and 2012 (Sun et al., 2013) data (Alsentzer et al., 2019) .Original training/development/test splits in the challenges were used." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-76", "text": "The NER tasks classify if a token is within a span of a class, outside spans of any classes or at beginning of a span of a class." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-77", "text": "For example, the sentence \"He has severe asthma\" is labelled as \"Null, Null,B-problem and I-problem\"." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-78", "text": "This means the first two words are not in any classes, the third word is the beginning of span of class \"problem\" and the forth word is in the class \"problem\"." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-79", "text": "In this study we use these two data sets for the NER tasks and for fine tuning." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-80", "text": "To conduct federated fine tuning of BERT model, training notes for downstream task were randomly split into 5 silos by note." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-81", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-82", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-83", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-85", "text": "In order to understand whether large size contextual language representation model like BERT can be pre-trained and fine tuned in a federated manner using data from different silos, we designed and conducted the following 6 experiments." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-86", "text": "First of all, we looked at the scenarios no domain specific BERT model pre-training was con-ducted." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-87", "text": "In those cases, the parameters (checkpoint) from original BERT base model trained on Books Corpus and English Wikipedia (Zhu et al., 2015; Devlin et al., 2018) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-88", "text": "We also looked at the scenarios where BERTbase model was pre-trained by MIMIC3 discharge summaries in a centralized manner (Alsentzer et al., 2019) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-89", "text": "Lastly, we look at the scenarios where BERTbase model was pre-trained by MIMIC3 discharge summaries in a federated manner." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-90", "text": "For each of these conditions, we fine tuned BERT models for downstream tasks using centralized vs. federated learning." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-91", "text": "To summarize, six experiments were conducted:" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-92", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-93", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-94", "text": "The results of our experiments are shown in Table 1 ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-95", "text": "In experiment 1, when original BERT was used without domain-specific pre-training on MIMIC discharge summary and fine tuning was conducted in a centralized manner, the F1 score of 0.784 was achieved for i2b2 2010 and 0.728 for i2b2 2012." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-96", "text": "In experiment 2, where both BERT pre-training and fine tuning was conducted in a centralized manner, the F1 was 0.858 for i2b2 2010 and 0.741 for i2b2 2012." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-97", "text": "In experiment 3, where BERT was pretrained in a federated manner and fine tuned using centralized data, the F1 was 0.820 for i2b2 2010 and 0.735 for i2b2 2012." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-98", "text": "In Experiment 4 where original BERT was not pre-trained using MIMIC3 discharge summary and fine tuning was conducted in a federated manner, the F1 was 0.716 for i2b2 2010 and 0.686 for i2b2 2012." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-99", "text": "In comparison, if the BERT is trained using centralized clinical notes before federated fine tuning (Experiment 5), F1 scores of i2b2 2010 NER task improved to 0.843 and F1 scores of i2b2 2012 NER task improved to 0.731." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-100", "text": "While the both the pre-training and fine tuning were conducted in a federated, as in Experiment 6, the F1 score were 0.808 and 0.715 for i2b2 2010 and i2b2 2012 respectively, which is superior to BERT model without pre-training." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-101", "text": "We made our BERT model federatedly trained with discharge summaries publicly available at XXXX and all the codes at XXXX." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-102", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-103", "text": "**ATTENTION ANALYSIS**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-104", "text": "To understand how different BERT model works, we look at behaviors of attention head." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-105", "text": "Firstly, entropy level of a each head's attention distribution was analyzed (Clark et al., 2019) .We compared original BERTbase model, Clinical BERT model and federatedly trained clinical BERT model without fine tuning in this analysis." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-107", "text": "By comparing correlation of attention entropy among heads in different model, we found the federated clinical BERT model is more similar to BERTbase model than centralized clinical BERT model (table 2) ." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-108", "text": "We analyzes the attention from from three different types of BERTs models to understand if if there are different clustering patterns." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-109", "text": "We define distance among Jensen-Shannon Divergence ma-" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-110", "text": "----------------------------------" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-111", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-112", "text": "In this proof-of-concept study, we demonstrate the possibility to both pre-train and fine tune large contextual representation language model BERT in a federated manner." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-113", "text": "Our analyses suggest that conducting pretraining and fine tuning in a federated manner using data from different silos resulted in reduced performance compared with training on centralized data." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-114", "text": "We can refer to this loss of performance due to separation of data as \"federated communication loss\"." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-115", "text": "When only conducting pre-training on discharge summaries in a federated manner and fine tuning model for NER tasks in a centralized manner, performance on downstream tasks only had a less than 5% drop compared with model both pre-trained and fine-tuned in a centralized manner." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-116", "text": "When only conducting fine tuning in a federated manner but conduct pre-training kept with centralized data, the performance dropped less than 2%." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-117", "text": "However, when conducting both pre-training and fine tuning in a federate manner, the performance had a non negligible decrease of around 6%." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-118", "text": "There are several limitations in this study." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-119", "text": "Firstly of all, due to limit of data access, we used clinical notes from a single healthcare system to simulate different silos." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-120", "text": "In future studies ,we would like to conduct analysis on clinical notes from a wide range of health care providers." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-121", "text": "Secondly, the MIMIC3 training settings has a certain percentage of overlaps with the tasks text in i2b2, which could cause confusion in model performance." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-122", "text": "This is the reason why we did not include MedNLI tasks in out study, as all of MedNLI texts are directly from MIMIC3 database." }, { "sent_id": "e48a1eac39987cb2f504b66d135572-C001-123", "text": "In future study, more data-sets without any overlapped data should be used." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e48a1eac39987cb2f504b66d135572-C001-21" ], [ "e48a1eac39987cb2f504b66d135572-C001-32" ], [ "e48a1eac39987cb2f504b66d135572-C001-66" ], [ "e48a1eac39987cb2f504b66d135572-C001-75" ] ], "cite_sentences": [ "e48a1eac39987cb2f504b66d135572-C001-21", "e48a1eac39987cb2f504b66d135572-C001-32", "e48a1eac39987cb2f504b66d135572-C001-66", "e48a1eac39987cb2f504b66d135572-C001-75" ] }, "@MOT@": { "gold_contexts": [ [ "e48a1eac39987cb2f504b66d135572-C001-21" ] ], "cite_sentences": [ "e48a1eac39987cb2f504b66d135572-C001-21" ] }, "@SIM@": { "gold_contexts": [ [ "e48a1eac39987cb2f504b66d135572-C001-43" ], [ "e48a1eac39987cb2f504b66d135572-C001-51" ], [ "e48a1eac39987cb2f504b66d135572-C001-88" ] ], "cite_sentences": [ "e48a1eac39987cb2f504b66d135572-C001-43", "e48a1eac39987cb2f504b66d135572-C001-51", "e48a1eac39987cb2f504b66d135572-C001-88" ] }, "@USE@": { "gold_contexts": [ [ "e48a1eac39987cb2f504b66d135572-C001-51" ] ], "cite_sentences": [ "e48a1eac39987cb2f504b66d135572-C001-51" ] } } }, "ABC_60c1245eff625441383913f947a8b1_25": { "x": [ { "sent_id": "60c1245eff625441383913f947a8b1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-2", "text": "Technologies for abusive language detection are being developed and applied with little consideration of their potential biases." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-3", "text": "We examine racial bias in five different sets of Twitter data annotated for hate speech and abusive language." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-4", "text": "We train classifiers on these datasets and compare the predictions of these classifiers on tweets written in African-American English with those written in Standard American English." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-5", "text": "The results show evidence of systematic racial bias in all datasets, as classifiers trained on them tend to predict that tweets written in African-American English are abusive at substantially higher rates." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-6", "text": "If these abusive language detection systems are used in the field they will therefore have a disproportionate negative impact on African-American social media users." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-7", "text": "Consequently, these systems may discriminate against the groups who are often the targets of the abuse we are trying to detect." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-8", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-10", "text": "Recent work has shown evidence of substantial bias in machine learning systems, which is typically a result of bias in the training data." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-11", "text": "This includes both supervised (Blodgett and O'Connor, 2017; Tatman, 2017; Kiritchenko and Mohammad, 2018; De-Arteaga et al., 2019) and unsupervised natural language processing systems (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-12", "text": "Machine learning models are currently being deployed in the field to detect hate speech and abusive language on social media platforms including Facebook, Instagram, and Youtube." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-13", "text": "The aim of these models is to identify abusive language that directly targets certain individuals or groups, particularly people belonging to protected categories (Waseem et al., 2017) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-14", "text": "Bias may reduce the accuracy of these models, and at worst, will mean that the models actively discriminate against the same groups they are designed to protect." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-15", "text": "Our study focuses on racial bias in hate speech and abusive language detection datasets (Waseem, 2016; Waseem and Hovy, 2016; Golbeck et al., 2017; Founta et al., 2018) , all of which use data collected from Twitter." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-16", "text": "We train classifiers using each of the datasets and use a corpus of tweets with demographic information to compare how each classifier performs on tweets written in African-American English (AAE) versus Standard American English (SAE) (Blodgett et al., 2016) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-17", "text": "We use bootstrap sampling (Efron and Tibshirani, 1986) to estimate the proportion of tweets in each group that each classifier assigns to each class." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-18", "text": "We find evidence of systematic racial biases across all of the classifiers, with AAE tweets predicted as belonging to negative classes like hate speech or harassment significantly more frequently than SAE tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-19", "text": "In most cases the bias decreases in magnitude when we condition on particular keywords which may indicate membership in negative classes, yet it still persists." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-20", "text": "We expect that these biases will result in racial discrimination if classifiers trained on any of these datasets are deployed in the field." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-21", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-22", "text": "**RELATED WORKS**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-23", "text": "Scholars and practitioners have recently been devoting more attention to bias in machine learning models, particularly as these models are becoming involved in more and more consequential decisions (Athey, 2017) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-24", "text": "Bias often derives from the data used to train these models." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-45", "text": "**HATE SPEECH AND ABUSIVE LANGUAGE DATASETS**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-25", "text": "For example, Buolamwini and Gebru (2018) show how facial recognition technologies perform worse for darker-skinned people, particularly darker-skinned women, due to the disproportionate presence of white, male faces in the training data." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-26", "text": "Natural language processing systems also inherit biases from the data they were trained on." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-27", "text": "For example, in unsupervised learning, word embeddings often contain biases (Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018) which persist even after attempts to remove them (Gonen and Goldberg, 2019) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-28", "text": "There are many examples of bias in supervised learning contexts: YouTube's captioning models make more errors when transcribing women (Tatman, 2017) , AAE is more likely to be misclassified as non-English by widely used language classifiers (Blodgett and O'Connor, 2017) , numerous gender and racial biases exist in sentiment classification systems (Kiritchenko and Mohammad, 2018) , and errors in both co-reference resolution systems and occupational classification models reflect gendered occupational patterns (Zhao et al., 2018; De-Arteaga et al., 2019) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-29", "text": "While hate speech and abusive language detection has become an important area for natural language processing research (Schmidt and Wiegand, 2017; Waseem et al., 2017; Fortuna and Nunes, 2018) , there has been little work addressing the potential for these systems to be biased." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-30", "text": "The danger posed by bias in such systems is, however, particularly acute, since it could result in negative impacts on the same populations the systems are designed to protect." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-31", "text": "For example, if we mistakenly consider speech by a targeted minority group as abusive we might unfairly penalize the victim, but if we fail to identify abuse against them we will be unable to take action against the perpetrator." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-32", "text": "Although no model can perfectly avoid such problems, we should be particularly concerned about the potential for such models to be systematically biased against certain social groups, particularly protected classes." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-33", "text": "A number of studies have shown that false positive cases of hate speech are associated with the presence of terms related to race, gender, and sexuality (Kwok and Wang, 2013; Burnap and Williams, 2015; ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-34", "text": "While not directly measuring bias, prior work has explored how annotation schemes and the identity of the annotators (Waseem, 2016 ) might be manipulated to help to avoid bias." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-35", "text": "Dixon et al. (2018) directly measured biases in the Google Perspective API classifier, 1 trained on data from Wikipedia talk comments, finding that it tended to give high toxicity scores to innocuous statements like \"I am a gay man\"." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-36", "text": "They called this \"false positive bias\", caused by the model overgeneralizing from the training data, in this case from examples where \"gay\" was used pejoratively." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-37", "text": "They find that a number of such \"identity terms\" are disproportionately represented in the examples labeled as toxic." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-38", "text": "Park et al. (2018) build upon this study, using templates to study gender differences in performance across two hate speech and abusive language detection datasets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-39", "text": "They find that classifiers trained on these data tend to perform worse when female identity terms used, indicating gender bias in performance." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-40", "text": "We build upon this work by auditing a series of abusive language and hate speech detection datasets for racial biases." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-41", "text": "We evaluate how classification models trained on these datasets perform in the field, comparing their predictions for tweets written in language used by whites or African-Americans." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-42", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-43", "text": "**RESEARCH DESIGN**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-44", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-46", "text": "We focus on Twitter, the most widely used data source in abusive language research." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-47", "text": "We use all available datasets where tweets are labeled as various types of abuse and are written in English." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-48", "text": "We now briefly describe each of these datasets in chronological order." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-49", "text": "Waseem and Hovy (2016) collected 130k tweets containing one of seventeen different terms or phrases they considered to be hateful." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-50", "text": "They then annotated a sample of these tweets themselves, using guidelines inspired by critical race theory." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-51", "text": "These annotators were then reviewed by \"a 25 year old woman studying gender studies and a nonactivist feminist\" to check for bias." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-52", "text": "This dataset consists of 16,849 tweets labeled as either racism, sexism, or neither." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-53", "text": "Most of the tweets categorized as sexist relate to debates over an Australian TV show and most of those considered as racist are anti-Muslim." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-54", "text": "To account for potential bias in the previous dataset, Waseem (2016) relabeled 2876 tweets in the dataset, along with a new sample from the tweets originally collected." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-55", "text": "The tweets were annotated by \"feminist and anti-racism activists\", based upon the assumption that they are domain-experts." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-56", "text": "A fourth category, racism and sexism was also added to account for the presence of tweets which exhibit both types of abuse." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-57", "text": "The dataset contains 6,909 tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-58", "text": "collected tweets containing terms from the Hatebase, 2 a crowdsourced hate speech lexicon, then had a sample coded by crowdworkers located in the United States." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-59", "text": "To avoid false positives that occurred in prior work which considered all uses of particular terms as hate speech, crowdworkers were instructed not to make their decisions based upon any words or phrases in particular, no matter how offensive, but on the overall tweet and the inferred context." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-60", "text": "The dataset consists of 24,783 tweets annotated as hate speech, offensive language, or neither." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-61", "text": "Golbeck et al. (2017) selected tweets using ten keywords and phrases related to anti-black racism, Islamophobia, homophobia, anti-semitism, and sexism." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-62", "text": "The authors developed a coding scheme to distinguish between potentially offensive content and serious harassment, such as threats or hate speech." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-63", "text": "After an initial round of coding, where tweets were assigned to a number of different categories, they simplified their analysis to include a binary harassment or non-harassment label for each tweet." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-64", "text": "The dataset consists of 20,360 tweets, each hand-labeled by the authors." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-65", "text": "3 Founta et al. (2018) constructed a dataset intended to better approximate a real-world setting where abuse is relatively rare." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-66", "text": "They began with a random sample of tweets then augmented it by adding tweets containing one or more terms from the Hatebase lexicon and that had negative sentiment." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-67", "text": "They criticized prior work for defining labels in an ad hoc manner." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-68", "text": "To develop a more comprehensive annotation scheme they initially labeled a sample of tweets, allowing each tweet to belong to multiple classes." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-69", "text": "After analyzing the overlap between different classes they settled on a coding scheme with four distinct classes: abusive, hateful, spam, and normal." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-70", "text": "We use a dataset they published containing 91,951 tweets coded into these categories by crowdworkers." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-71", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-72", "text": "**TRAINING CLASSIFIERS**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-73", "text": "For each dataset we train a classifier to predict the class of unseen tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-74", "text": "We use regularized logistic regression with bag-of-words features, a commonly used approach in the field." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-75", "text": "While we expect that we could improve predictive performance by using more sophisticated classifiers, we expect that any bias is likely a function of the training data itself rather than the classifier." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-76", "text": "Moreover, although features like word embeddings can work well for this task (Djuric et al., 2015) we wanted to avoid inducing any bias in our models by using pre-trained embeddings (Park et al., 2018) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-77", "text": "We pre-process each tweet by removing excess white-space and replacing URLs and mentions with placeholders." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-78", "text": "We then tokenize them, stem each token, and construct n-grams with a maximum length of three." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-79", "text": "Next we transform each dataset into a TF-IDF matrix, with a maximum of 10,000 features." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-80", "text": "We use 80% of each dataset to train models and hold out the remainder for validation." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-81", "text": "Each model is trained using stratified 5-fold cross-validation." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-82", "text": "We conduct a grid-search over different regularization strength parameters to identify the best performing model." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-83", "text": "Finally, for each dataset we identify the model with the best average F1 score and retrain it using all of the training data." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-84", "text": "The performance of these models on the 20% held-out validation data is reported in Table 1 ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-85", "text": "Overall we see varying performance across the classifiers, with some performing much better out-of-sample than others." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-86", "text": "In particular, we see that hate speech and harassment are particularly difficult to detect." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-87", "text": "Since we are primarily interested in within classifier, between corpora performance, any variation between classifiers should not impact our results." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-88", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-89", "text": "**RACE DATASET**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-90", "text": "We use a dataset of tweets labeled by race from Blodgett et al. (2016) to measure racial biases in these classifiers." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-91", "text": "They collected geolocated tweets in the U.S. and matched them with demographic data from the Census on the population of non-Hispanic whites, non-Hispanic blacks, Hispanics, and Asians in the block group where the tweets originated." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-92", "text": "They then identified words associated with particular demographics and trained a probabilistic mixed-membership language model." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-93", "text": "This model learns demographicallyaligned language models for each of the four demographic categories and is used to calculate the posterior proportion of language from each category in each tweet." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-94", "text": "Their validation analyses indicate that tweets with a high posterior proportion of non-Hispanic black language exhibit lexical, phonological, and syntactic variation consistent with prior research on AAE." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-95", "text": "Their publiclyavailable dataset contains 59.2 million tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-96", "text": "We define a user as likely non-Hispanic black if the average posterior proportion across all of their tweets for the non-Hispanic black language model is \u2265 0.80 (and \u2264 0.10 Hispanic and Asian combined) and as non-Hispanic white using the same formula but for the white language model." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-97", "text": "5 This allows us to restrict our analysis to tweets written by users who predominantly use one of the language models." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-98", "text": "Due to space constraints we discard users who predominantly use either the Hispanic or the Asian language model." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-99", "text": "This results in a set of 1.1m tweets written by people who generally use non-Hispanic black language and 14.5m tweets written by users who tend to use non-Hispanic white language." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-100", "text": "Following Blodgett and O'Connor (2017), we call these datasets black-aligned and white-aligned tweets, reflecting 5 We use this threshold following Blodgett and O'Connor (2017) and after consulting with the lead author." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-101", "text": "While these cut-offs should provide high confidence that the users tend to use AAE or SAE, and hence serve as a proxy for race, it is important to note that not all African-Americans use AAE and that not all AAE users are African-American, although use of the AAE dialect suggests a social proximity to or affinity for African-American communities (Blodgett et al., 2016) the fact that they contain language associated with either demographic category but which may not all be produced by members of these categories." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-102", "text": "We now describe how we use these data in our experiments." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-103", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-104", "text": "**EXPERIMENTS**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-105", "text": "We examine whether the probability that a tweet is predicted to belong to a particular class varies in relation to the racial alignment of the language it uses." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-106", "text": "The null hypothesis of no racial bias is that the probability a tweet will belong to a negative class is independent of the racial group the tweet's author is a member of." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-107", "text": "Formally, for class c i , where c i = 1 denotes membership in the class and c i = 0 the opposite, we aim to test H N : P (c i = 1|black) = P (c i = 1|white)." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-108", "text": "If P (c i = 1|black) > P (c i = 1|white) and the difference is statistically significant then we can reject the null hypothesis H N in favor of the alternative hypothesis H A that black-aligned tweets are classified into c i at a higher rate than white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-109", "text": "Conversely, if P (c i = 1|black) < P (c i = 1|white) we can conclude that the classifier is more likely to classify white-aligned tweets as c i ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-110", "text": "We should expect that white-aligned tweets are more likely to use racist language or hate speech than blackaligned tweets, given that African-Americans are often targeted with racism and hate speech by whites." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-111", "text": "However for some classes like sexism we have no reason to expect there to be racial differences in either direction." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-112", "text": "To test this hypothesis we use bootstrap sampling (Efron and Tibshirani, 1986) to estimate the proportion of tweets in each dataset that each classifier predicts to belong to each class." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-113", "text": "We draw n random samples with replacement of k tweets from each of the two race corpora, where n = k = 1000." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-114", "text": "For each sample we use each classifier to predict the class membership of each tweet, then store the proportion of tweets that were assigned to each class, p i ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-115", "text": "For each classifier-class pair, we thus obtain a pair of vectors, one for each corpus, each containing n sampled proportions." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-116", "text": "The bootstrap estimates for the proportion of tweets belonging to class i for each group, p i black and p i white , are calculated by taking the mean of the elements in each vector: dicate that black-aligned tweets are classified as belonging to class i at a higher rate than whitealigned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-117", "text": "We also conduct a second experiment, where we assess whether there is racial bias conditional upon a tweet containing a keyword likely to be associated with a negative class." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-118", "text": "While differences in language will undoubtedly remain, this should help to account for the possibility that results in Experiment 1 are driven by differences in the true distribution of the different classes of interest, or of words associated with these classes, in the two corpora." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-119", "text": "For classifier c and category i, we evaluate H N : P (c i = 1|black, t) = P (c i = 1|white, t) for a given term t. We conduct this experiment for two different terms, each of which occurs frequently enough in the data to enable our bootstrapping approach." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-120", "text": "We select the term \"n*gga\", since it is a particularly prevalent source of false positives for hate speech detection (Kwok and Wang, 2013; Waseem et al., 2018) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-121", "text": "6 In this case, we expect that tweets containing the word should be classified as more negative when used by whites, thus H A 1 : P (c i = 1|black, t) < P (c i = 1|white, t)." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-122", "text": "The other alternative, H A 2 : P (c i = 1|black, t) > P (c i = 1|white, t) would indicate that blackaligned tweets containing the term are penalized at a higher rate than comparable white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-123", "text": "We also assess the results for the word \"b*tch\" since it is a widely used sexist term, which 6 We also planned to conduct the same analysis using the \"-er\" suffix, however the sample was too small, with the word being used in 555 tweets in the white-aligned corpus (0.004%) and 61 in the black-aligned corpus (0.005%)." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-124", "text": "is often also used casually, but we have no theoretical reason to expect there to be racial differences in its usage." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-125", "text": "The term \"n*gga\" was used in around 2.25% of black-aligned and 0.15% of white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-126", "text": "The term \"b*tch\" was used in 1.7% of black-aligned and 0.5% of whitealigned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-127", "text": "The substantial differences in the distributions for these two terms alone are consistent with our intuition that some of the results in Experiment 1 may be driven by differences in the frequencies of words associated with negative classes in the training datasets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-128", "text": "Since we are using a subsample of the available data, we use smaller bootstrap samples, drawing k = 100 tweets each time." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-129", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-130", "text": "**RESULTS**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-131", "text": "The results of Experiment 1 are shown in Table 2." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-132", "text": "We observe substantial racial disparities in the performance of all classifiers." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-133", "text": "In all but one of the comparisons, there are statistically significant (p < 0.001) differences and in all but one of these we see that tweets in the black-aligned corpus are assigned negative labels more frequently than those by whites." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-134", "text": "The only case where blackaligned tweets are classified into a negative class less frequently than white-aligned tweets is the racism class in the Waseem and Hovy (2016) classifier." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-135", "text": "Note, however, the extremely low rate at which tweets are predicted to belong to this class for both groups." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-136", "text": "On the other hand, this classifier is 1.7 times more likely to classify tweets in the black-aligned corpus as sexist." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-137", "text": "For Waseem (2016) we see that there is no significant difference in the estimated rates at which tweets are classified as racist across groups, although the rates remain low." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-138", "text": "Tweets in the black-aligned corpus are classified as containing sexism almost twice as frequently and 1.1 times as frequently classified as containing racism and sexism compared to those in the white-aligned corpus." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-139", "text": "Moving onto , we find large disparities, with around 5% of tweets in the black-aligned corpus classified as hate speech compared to 2% of those in the white-aligned set." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-140", "text": "Similarly, 17% of black-aligned tweets are predicted to contain offensive language compared to 6.5% of whitealigned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-141", "text": "The classifier trained on the Golbeck et al. (2017) dataset predicts black-aligned tweets to be harassment 1.4 times as frequently as white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-142", "text": "The Founta et al. (2018) classifier labels around 11% of tweets in the blackaligned corpus as hate speech and almost 18% as abusive, compared to 6% and 8% of white-aligned tweets respectively." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-143", "text": "It also classifies black-aligned tweets as spam 1.8 times as frequently." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-144", "text": "The results of Experiment 2 are consistent with the previous results, although there are some notable differences." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-145", "text": "In most cases the racial disparities persist, although they are generally smaller in magnitude and in some cases the direction even changes." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-146", "text": "Table 3 shows that for tweets containing the word \"n*gga\", classifiers trained on Waseem and Hovy (2016) and Waseem (2016) are both predict black-aligned tweets to be instances of sexism approximately 1.5 times as often as white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-147", "text": "The classifier trained on the data is significantly less likely to classify black-aligned tweets as hate speech, although it is more likely to classify them as offensive." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-148", "text": "Golbeck et al. (2017) classifies black-aligned tweets as harassment at a higher rate for both groups than in the previous experiment, although the disparity is narrower." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-149", "text": "For the Founta et al. (2018) classifier we see that black-aligned tweets are slightly less frequently considered to be hate speech but are much more frequently classified as abusive." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-150", "text": "The results for the second variation of Experiment 2 where we conditioned on the word \"b*tch\" are shown in Table 4 ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-151", "text": "We see similar results for Waseem and Hovy (2016) and Waseem (2016) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-152", "text": "In both cases the classifiers trained upon their data are still more likely to flag black-aligned tweets as sexism." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-153", "text": "The Waseem and Hovy (2016) classifier is particularly sensitive to the word \"b*tch\" with 96% of black-aligned and 94% of white-aligned tweets predicted to belong to this class." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-154", "text": "For almost all of these tweets are classified as offensive, however those in the blackaligned corpus are 1.15 times as frequently classified as hate speech." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-155", "text": "We see a very similar result for Golbeck et al. (2017) compared to the previous experiment, with black-aligned tweets flagged as harassment at 1.1 times the rate of those in the white-aligned corpus." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-156", "text": "Finally, for the Founta et al. (2018) classifier we see a substantial racial disparity, with black-aligned tweets classified as hate speech at 2.7 times the rate of white aligned ones, a higher rate than in Experiment 1." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-157", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-158", "text": "**DISCUSSION**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-159", "text": "Our results demonstrate consistent, systematic and substantial racial biases in classifiers trained on all five datasets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-160", "text": "In almost every case, black-aligned tweets are classified as sexism, hate speech, harassment, and abuse at higher rates than whitealigned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-161", "text": "To some extent, the results in the first experiment may be driven by underlying differences in the rates at which speakers of different dialects use particular words and phrases associated with these negative classes in the training data." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-162", "text": "For example, the word \"n*gga\" appears fifteen times as frequently in the black-aligned corpus compared to the white-aligned corpus." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-163", "text": "7 However, the second experiment shows that these disparities tend to persist even when comparing tweets containing keywords likely to be associated with negative classes." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-164", "text": "While some of the remaining disparities are likely due to differences in the distributions of other keywords we did not condition on, we expect that other more innocuous aspects of black-aligned language may be associated with negative labels in the training data, leading classifiers to disproportionately predict that tweets by African-Americans belong to negative classes." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-165", "text": "We now discuss the results as they pertain to each of the datasets used." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-166", "text": "Classifiers trained on data from Waseem and Hovy (2016) and Waseem (2016) only predicted a small fraction of the tweets to be racism." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-167", "text": "We suspect that this is due to the composition of their dataset, since the majority of the racist training examples consist of anti-Muslim rather than anti- Table 4 : Experiment 2, t = \"b*tch\" black language." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-168", "text": "Across both datasets the words \"n*gger\" and \"n*gga\" appear in 4 and 10 tweets respectively." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-169", "text": "Looking at the sexism class on the other hand, we see that both models were consistently classifying tweets in the black-aligned corpus as sexism at a substantially higher rate than those in the white-aligned corpus." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-170", "text": "Given this result, and the gender biases identified in these data by Park et al. (2018), it not apparent that the purportedly expert annotators were any less biased than amateur annotators (Waseem, 2016) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-171", "text": "The classifier trained on shows the largest disparities in Experiment 1, with tweets in the black-aligned corpus classified as hate speech and offensive language at substantially higher rates than white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-172", "text": "We expect that this result occurred for two reasons." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-173", "text": "First, the dataset contains a large number of cases where AAE is used (Waseem et al., 2018) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-174", "text": "Second, many of the AAE tweets also use words like \"n*gga\" and \"b*tch\", and are thus frequently associated with the hate speech and offensive classes, resulting in \"false positive bias\" (Dixon et al., 2018) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-196", "text": "In particular, we need to consider whether the linguistic markers we use to identify potentially abusive language may be associated with language used by members of protected categories." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-175", "text": "On the other hand, the distinction between hate speech and offensive language appears to hold up to scrutiny: while a large proportion of tweets in Experiment 2 containing the word \"n*gga\" are classified as hate speech, the rate is substantially higher for white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-176", "text": "Without this category we expect that many of the tweets classified as offensive would instead be mistakenly classified as hate speech." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-177", "text": "Turning to the Golbeck et al. (2017) classifer we found that tweets in the black-aligned dataset were significantly more likely to be classified as harassment in all experiments, although the disparity decreased substantially after conditioning on certain keywords." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-178", "text": "It seems likely that their simple binary labelling scheme may not be sufficient to capture the variation in language used, resulting in high rates of false positives." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-179", "text": "Finally, Founta et al. (2018) is the largest and perhaps the most comprehensive of the available datasets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-180", "text": "In Experiment 1 we see that this classifier has the second highest rates of racial disparities, classifying black-aligned tweets as hate speech, abusive, and spam at substantially higher rates than white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-181", "text": "In Experiment 2 the classifier is slightly less likely to classify black-aligned tweets containing the word \"n*gga\" as hate speech but is 2.7 times more likely to predict that black-aligned tweets using \"b*tch\" belong to this category." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-182", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-183", "text": "**CONCLUSION**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-184", "text": "Our study is the first to measure racial bias in hate speech and abusive language detection datasets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-185", "text": "We find evidence of substantial racial bias in all of the datasets tested." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-186", "text": "This bias tends to persist even when comparing tweets containing certain relevant keywords." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-187", "text": "While these datasets are still valuable for academic research, we caution against using them in the field to detect and particularly to take enforcement action against different types of abusive language." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-188", "text": "If they are used in this way we expect that they will systematically penalize African-Americans more than whites, resulting in racial discrimination." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-189", "text": "We have not evaluated these datasets for bias related to other ethnic and racial groups, nor other protected categories like gender and sexuality, but expect that such bias is also likely to exist." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-190", "text": "We recommend that efforts to measure and mitigate bias should start by focusing on how bias enters into datasets as they are collected and labeled." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-191", "text": "In particular, future work should focus on the following three areas." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-192", "text": "First, we expect that some biases emerge at the point of data collection." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-193", "text": "Some studies sampled tweets using small, ad hoc sets of keywords created by the authors (Waseem and Hovy, 2016; Waseem, 2016; Golbeck et al., 2017) , an approach demonstrated to produce poor results (King et al., 2017) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-194", "text": "Others start with large crowdsourced dictionaries of keywords, which tend to include many irrelevant terms, resulting in high rates of false positives Founta et al., 2018) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-195", "text": "In both cases, by using keywords to identify relevant tweets we are likely to get non-representative samples of training data that may over-or under-represent certain communities." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-197", "text": "For example, although started with thousands of terms from the Hatebase lexicon, AAE is over-represented in the dataset (Waseem et al., 2018) because some keywords associated with this speech community were used more frequently on Twitter than other keywords in the lexicon and were consequentially over-sampled." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-198", "text": "Second, we expect that the people who annotate data have their own biases." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-199", "text": "Since individual biases in reflect societal prejudices, they aggregate into systematic biases in training data." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-200", "text": "The datasets considered here relied upon a range of different annotators, from the authors (Golbeck et al., 2017; Waseem and Hovy, 2016) and crowdworkers Founta et al., 2018) to activists (Waseem, 2016) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-201", "text": "Even the classifier trained on expert-labeled data (Waseem, 2016) flags black-aligned tweets as sexist at almost twice the rate of white-aligned tweets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-202", "text": "While we agree that there is value in working with domain-experts to annotate data, these results suggest that activists may be prone to similar biases as academics and crowdworkers." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-203", "text": "Further work is therefore necessary to better understand how to integrate expertise into the process and how training can be used to help to mitigate bias." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-204", "text": "We also need to consider how sociocultural context influences annotators' decisions." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-205", "text": "For example, 48% of the workers employed by Founta et al. (2018) were located in Venezuela but the authors did not consider whether this affected their results (or if the annotators understood English sufficiently for the task)." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-206", "text": "Third, we observed substantial variation in the rates of class membership across classifiers and datasets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-207", "text": "In Experiment 1 the rate at which tweets were assigned to negative classes varied from 1% to 18%." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-208", "text": "Some of the low proportions may indicate a preponderance of false negatives due to a lack of training data, suggesting that these models may not be able to sufficiently generalize beyond the data they were trained on." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-209", "text": "The high proportions may signal too many false positives, which may a result of the over-sampling of abusive language in labeled datasets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-210", "text": "Founta et al. (2018) claim that, on average, between 0.1% and 3% of tweets are abusive, depending upon the category of abuse." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-211", "text": "Identifying such content is therefore a highly imbalanced classification problem." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-212", "text": "When labeling datasets and evaluating our models we must pay more attention to the baseline rates of usage of different types of abusive language and how they may vary across populations (Silva et al., 2016) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-213", "text": "Finally, we need to more carefully consider how contextual factors interact with linguistic subtleties and our definitions of abuse." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-214", "text": "The \"n-word\" is a particularly useful illustration of this issue." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-215", "text": "It exhibits polysemy, as it can be extremely racist or quotidian, depending on the speaker, the context, and the spelling." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-216", "text": "While the history of the word and its usages is too complex to be summarized here (Neal, 2013) , when used with the \"-er\" suffix it is generally considered to be a racist ephiphet, associated with white supremacy." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-217", "text": "Prior work has confirmed that the use of this variant online is generally considered to be hateful (Kwok and Wang, 2013) , although not always the case, for example when a victim of abuse shares an insult they have received." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-218", "text": "However the variant with the \"-a\" suffix is typically used innocuously by African-Americans (Kwok and Wang, 2013) , indeed our results indicate that it is used far more frequently in black-aligned tweets (although it is still used by many white people)." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-219", "text": "8 Despite this distinction, some studies have considered this variant to be hateful (Silva et al., 2016; Alorainy et al., 2018) ." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-220", "text": "This approach results in high rates of false positive cases of hate speech, thus included a class for offensive language which does not appear to be hateful and let annotators decide which class tweets belonged to based upon their interpretation of the context, many of whom labeled tweets containing the term as offensive." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-221", "text": "Waseem et al. (2018) criticized this decision, claiming that it is problematic to ever consider the word to be offensive due to its widespread use among AAE speakers." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-222", "text": "This critique appears to be reasonable in the sense that we should not penalize African-Americans for using the word, but it avoids grappling with how to act when the word is used by other speakers and in other contexts. What should be done if it is used by a white social media user in reference to a black user? How should the context of their interaction and the nature of their relationship affect our decision?" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-223", "text": "A \"one-size-fits-all\", context-independent approach to defining and detecting abusive language is clearly inappropriate." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-224", "text": "Different communities have different speech norms, such that a model suitable for one community may discriminate against another." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-225", "text": "However there is no con-sensus in the field on how and if we can develop detection systems sensitive to different social and cultural contexts." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-226", "text": "In addition to our recommendations for improving training data, we emphasize the necessity of considering how context matters and how detection systems will have uneven effects across different communities." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-227", "text": "----------------------------------" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-228", "text": "**LIMITATIONS**" }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-229", "text": "First, while the Blodgett et al. (2016) dataset is the best available source of tweets labeled as AAE, we do not have ground truth labels for the racial identities of the authors." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-230", "text": "By filtering on users who predominantly used one type of language we may also miss users who may frequently codeswitch between AAE and SAE." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-231", "text": "Second, although we roughly approximate this in Experiment 2, we cannot rule out the possibility that the results, rather than evidence of bias, are a function of different distributions of negative classes in the corpora studied." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-232", "text": "It is possible that words associated with negative categories in our abusive language datasets are also used to predict race by Blodgett et al. (2016) , potentially contributing to the observed disparities." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-233", "text": "To more thoroughly investigate this issue we therefore require ground truth labels for abuse and race." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-234", "text": "Third, the results may vary for different classifiers or feature sets." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-235", "text": "It is possible that more sophisticated modeling approaches could enable us to alleviate bias, although they could also exacerbate it." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-236", "text": "Fourth, we did not interpret the results of the classifiers to determine why they made particular predictions." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-237", "text": "Further work is needed to identify what features of AAE the classifiers are learning to associate with negative classes." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-238", "text": "Finally, this study has only focused on one dimension of racial bias." }, { "sent_id": "60c1245eff625441383913f947a8b1-C001-239", "text": "Further work is necessary to assess the degree to investigate the extent to which data and models are biased against people belonging to other protected categories." } ], "y": { "@MOT@": { "gold_contexts": [ [ "60c1245eff625441383913f947a8b1-C001-15" ] ], "cite_sentences": [ "60c1245eff625441383913f947a8b1-C001-15" ] }, "@BACK@": { "gold_contexts": [ [ "60c1245eff625441383913f947a8b1-C001-15" ], [ "60c1245eff625441383913f947a8b1-C001-49" ], [ "60c1245eff625441383913f947a8b1-C001-153" ], [ "60c1245eff625441383913f947a8b1-C001-193" ], [ "60c1245eff625441383913f947a8b1-C001-200" ] ], "cite_sentences": [ "60c1245eff625441383913f947a8b1-C001-15", "60c1245eff625441383913f947a8b1-C001-49", "60c1245eff625441383913f947a8b1-C001-153", "60c1245eff625441383913f947a8b1-C001-193", "60c1245eff625441383913f947a8b1-C001-200" ] }, "@DIF@": { "gold_contexts": [ [ "60c1245eff625441383913f947a8b1-C001-133", "60c1245eff625441383913f947a8b1-C001-134" ], [ "60c1245eff625441383913f947a8b1-C001-166" ] ], "cite_sentences": [ "60c1245eff625441383913f947a8b1-C001-134", "60c1245eff625441383913f947a8b1-C001-166" ] }, "@SIM@": { "gold_contexts": [ [ "60c1245eff625441383913f947a8b1-C001-146" ], [ "60c1245eff625441383913f947a8b1-C001-151" ] ], "cite_sentences": [ "60c1245eff625441383913f947a8b1-C001-146", "60c1245eff625441383913f947a8b1-C001-151" ] }, "@USE@": { "gold_contexts": [ [ "60c1245eff625441383913f947a8b1-C001-146" ] ], "cite_sentences": [ "60c1245eff625441383913f947a8b1-C001-146" ] } } }, "ABC_d44648766e68cb914c5489e385f42e_25": { "x": [ { "sent_id": "d44648766e68cb914c5489e385f42e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-2", "text": "Although parallel sentences rarely exist in quasi-comparable corpora, there could be parallel fragments that are also helpful for statistical machine translation (SMT)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-3", "text": "Previous studies cannot accurately extract parallel fragments from quasi-comparable corpora." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-4", "text": "To solve this problem, we propose an accurate parallel fragment extraction system that uses an alignment model to locate the parallel fragment candidates, and uses an accurate lexicon filter to identify the truly parallel ones." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-5", "text": "Experimental results indicate that our system can accurately extract parallel fragments, and our proposed method significantly outperforms a state-of-the-art approach." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-6", "text": "Furthermore, we investigate the factors that may affect the performance of our system in detail." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-7", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-9", "text": "In statistical machine translation (SMT) (Brown et al., 1993; Koehn et al., 2007) , since translation knowledge is acquired from parallel data, the quality and quantity of parallel data are crucial." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-10", "text": "However, except for a few language pairs, such as English-French, English-Arabic, English-Chinese and several European language pairs, parallel data remains a scarce resource." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-11", "text": "As non-parallel corpora are far more available, extracting parallel data from non-parallel corpora is an attractive research field." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-12", "text": "Most previous studies focus on extracting parallel sentences from comparable corpora (Zhao and Vogel, 2002; Utiyama and Isahara, 2003; Munteanu and Marcu, 2005; Tillmann, 2009; Smith et al., 2010; Abdul-Rauf and Schwenk, 2011) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-13", "text": "Quasi-comparable corpora that contain far more disparate very-non-parallel bilingual documents that could either be on the same topic (intopic) or not (out-topic) (Fung and Cheung, 2004) , are available in far larger quantities than comparable corpora." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-14", "text": "In quasi-comparable corpora, there are few or no parallel sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-15", "text": "However, there could be parallel fragments in comparable sentences that are also helpful for SMT." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-16", "text": "Previous studies for parallel fragment extraction from comparable sentences have the problem that they cannot extract parallel fragments accurately." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-17", "text": "Some studies extract parallel fragments relying on a probabilistic translation lexicon estimated on an external parallel corpus." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-18", "text": "They locate the source and target fragments independently, making the extracted fragments unreliable (Munteanu and Marcu, 2006) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-19", "text": "Some studies develop alignment models for comparable sentences to extract parallel fragments (Quirk et al., 2007) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-20", "text": "Because the comparable sentences are quite noisy, the extracted fragments are not accurate." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-21", "text": "In this paper, we propose an accurate parallel fragment extraction system." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-22", "text": "We locate parallel fragment candidates using an alignment model, and use an accurate lexicon filter to identify the truly parallel ones." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-23", "text": "Experimental results on Chinese-Japanese corpora show that our proposed method significantly outperforms a state-of-theart approach, which indicate the effectiveness of our parallel fragment extraction system." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-24", "text": "Moreover, we investigate the factors that may affect the performance of our system in detail." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-25", "text": "2 Related Work (Munteanu and Marcu, 2006) is the first attempt to extract parallel fragments from comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-26", "text": "They extract sub-sentential parallel fragments by using a Log-Likelihood-Ratio (LLR) lexicon estimated on an external parallel corpus and a smoothing filter." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-27", "text": "They show the effectiveness of fragment extraction for SMT." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-28", "text": "This study has the drawback that they do not locate the source Figure 1 : Parallel fragment extraction system. and target fragments simultaneously, which cannot guarantee that the extracted fragments are translations of each other." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-29", "text": "We solve this problem by using an alignment model to locate the source and target fragments simultaneously." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-30", "text": "Quirk et al. (2007) introduce two generative alignment models for extracting parallel fragments from comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-31", "text": "However, the extracted fragments slightly decrease MT performance when appending them to in-domain training data." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-32", "text": "We think the reason is that because the comparable sentences are quite noisy, the alignment models cannot accurately extract parallel fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-33", "text": "To solve this problem we only use alignment models for parallel fragment candidate detection, and use an accurate lexicon filter to guarantee the accuracy of the extracted parallel fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-34", "text": "Besides the above studies, there are some other efforts." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-35", "text": "Hewavitharana and Vogel (2011) propose a method that calculates both the inside and outside probabilities for fragments in a comparable sentence pair, and show that the context of the sentence helps fragment extraction." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-36", "text": "However, the proposed method only can be efficient in a controlled manner that supposes the source fragment was known, and search for the target fragment." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-37", "text": "Another study uses a syntax-based alignment model to extract parallel fragments from noisy parallel data (Riesa and Marcu, 2012) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-38", "text": "Since their method is designed for noisy parallel data, we believe that the method cannot accurately extract parallel fragments from comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-39", "text": "3 Proposed Method 3.1 System Overview Figure 1 shows an overview of our parallel fragment extraction system." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-40", "text": "We first apply comparable sentence extraction using a combination method of (Abdul-Rauf and Schwenk, 2011) (1)(2) and (Munteanu and Marcu, 2005) (3), which were originally used for extracting parallel sentences from comparable corpora." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-41", "text": "We translate the source sentences to target language with a SMT system trained on a parallel corpus (1)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-42", "text": "Then we use the translated sentences as queries for IR." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-43", "text": "We retrieve the top 10 target documents for each source sentence using Indri 1 , and use all sentences in the documents as comparable sentence candidates (2)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-44", "text": "Next, we identify the comparable sentences from the candidates using a classifier trained on a part of a parallel corpus 2 following (Munteanu and Marcu, 2005) (3)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-45", "text": "As the noise in comparable sentences will decrease MT performance, we further apply parallel fragment extraction." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-46", "text": "We apply two steps to accurately extract parallel fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-47", "text": "We first detect parallel fragment candidates using bidirectional IBM models (Brown et al., 1993) with symmetrization heuristics (Koehn et al., 2007 ) (4)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-48", "text": "The generative alignment models proposed by Quirk et al. (2007) may be more efficient for parallel fragment candidate detection, we leave this for future work." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-49", "text": "Then we filter the candidates with probabilistic translation lexicon to produce accurate results (5)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-50", "text": "We present the details of our proposed method in following sections." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-51", "text": "Figure 2 shows an example of comparable sentences extracted from Chinese-Japanese quasicomparable corpora by our system." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-52", "text": "The alignment results are computed by IBM models." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-53", "text": "We notice that the truly parallel fragments \"lead ion selective electrode\" and \"potentiometric titration method\" are aligned, although there are some incorrectly aligned word pairs." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-54", "text": "We think this kind of alignment information can be helpful for fragment extraction." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-55", "text": "What we need to do is develop a method to identify the true parallel fragments from the aligned fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-56", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-57", "text": "**A BRIEF EXAMPLE**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-58", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-59", "text": "**PARALLEL FRAGMENT CANDIDATE DETECTION**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-60", "text": "We treat the longest spans that have monotonic and non-null alignment as parallel fragment candidates." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-61", "text": "The reason we only consider monotonic ones is that based on our observation, ordering of IBM models on comparable sentences is unreliable." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-62", "text": "Quirk et al. (2007) also produce monotonic alignments in their generative model." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-63", "text": "Monotonic alignments are not sufficient for many language pairs." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-64", "text": "In the future, we plan to develop a method to deal with this problem." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-65", "text": "The non-null constraint can limit us from extracting incorrect fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-66", "text": "Similar to previous studies, we are interested in fragment pairs with size greater than 3." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-67", "text": "Taking the comparable sentences in Figure 2 as an example, we will extract the fragments in dashed rectangles as parallel fragment candidates." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-68", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-69", "text": "**LEXICON-BASED FILTER**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-70", "text": "The parallel fragment candidates cannot be used directly, because many of them are still noisy as shown in Figure 2 ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-71", "text": "Aiming to produce accurate results, we use a lexicon-based filter." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-72", "text": "We filter a candidate parallel fragment pair with a probabilistic translation lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-73", "text": "The lexicon-pair may be extracted from a parallel corpus, or from comparable corpora using some state-of-the-art approaches such as (Vuli\u0107 et al., 2011) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-74", "text": "In this study, we use the lexicon extracted from a parallel corpus." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-75", "text": "Different lexicons may have different effects for filtering." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-76", "text": "Here, we compare three types of lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-77", "text": "The first lexicon we use is the IBM Model 1 lexicon, which is obtained by running GIZA++ 3 that implements sequential word-based statistical alignment model of IBM models." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-78", "text": "The second lexicon we use is the LLR lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-79", "text": "Munteanu and Marcu (2006) show that the LLR lexicon performs better than the IBM Model 1 lexicon for parallel fragment extraction." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-80", "text": "One advantage of the LLR lexicon is that it can produce both positive and negative associations." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-81", "text": "Munteanu and Marcu (2006) develop a smoothing filter applying this advantage." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-82", "text": "We extract the LLR lexicon from a word-aligned parallel corpus using the same method as (Munteanu and Marcu, 2006) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-83", "text": "The last lexicon we use is the SampLEX lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-84", "text": "Vuli\u0107 and Moens (2012) propose an associative approach for lexicon extraction from par-allel corpora that relies on the paradigm of data reduction." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-85", "text": "They extract translation pairs from many smaller sub-corpora that are randomly sampled from the original corpus, based on some frequency-based criteria of similarity." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-86", "text": "They show that their method outperforms IBM Model 1 and other associative methods such as LLR in terms of precision and F-measure." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-87", "text": "We extract SampLEX lexicon from a parallel corpus using the same method as (Vuli\u0107 and Moens, 2012) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-88", "text": "Aiming to gain new knowledge that does not exist in the lexicon, we apply a smoothing filter similar to (Munteanu and Marcu, 2006) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-89", "text": "For each aligned word pair in the fragment candidates, we set scores to the words in two directions according to the extracted lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-90", "text": "If the aligned word pair exists in the lexicon, we set the corresponding translation probabilities as scores." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-91", "text": "For LLR lexicon, we use both positive and negative association values." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-92", "text": "If the aligned word pair does not exist in the lexicon, we set the scores in both directions to \u22121." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-93", "text": "There is the one exception that the aligned words are the same number, punctuation or abbreviation." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-94", "text": "In this case, we set the scores to 1 without considering the existence of the word pair in the lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-95", "text": "After this process, we get initial scores for the words in the fragment candidates in two directions." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-96", "text": "We then apply an averaging filter to the initial scores to obtain filtered scores in both directions." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-97", "text": "The averaging filter sets the score of one word to the average score of several words around it." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-98", "text": "We think the words with initial positive scores are reliable, because they satisfy two strong constraints, namely alignment by IBM models and existence in the lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-99", "text": "Therefore, unlike (Munteanu and Marcu, 2006), we only apply the averaging filter to the words with negative scores." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-100", "text": "Moreover, we add another constraint that only filtering a word when both the left and right words around it have positive scores, which can further guarantee accuracy." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-101", "text": "For the number of words used for averaging, we used 5 (2 preceding words and 2 following words)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-102", "text": "The heuristics presented here produced good results on a development set." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-103", "text": "Finally, we extract parallel fragments according to the filtered scores." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-104", "text": "We extract word aligned fragment pairs with continuous positive scores in both directions." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-105", "text": "Fragments with less than 3 words may be produced in this process, and we discard them like previous studies." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-106", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-107", "text": "**EXPERIMENTS**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-108", "text": "In our experiments, we compared our proposed fragment extraction method with (Munteanu and Marcu, 2006) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-109", "text": "We manually evaluated the accuracy of the extracted fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-110", "text": "Moreover, we used the extracted fragments as additional MT training data, and evaluated the effectiveness of the fragments for MT." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-111", "text": "We conducted experiments on Chinese-Japanese data." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-112", "text": "In all our experiments, we preprocessed the data by segmenting Chinese and Japanese sentences using a segmenter proposed by Chu et al. (2012) and JUMAN (Kurohashi et al., 1994) respectively." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-113", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-114", "text": "**DATA**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-115", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-116", "text": "**PARALLEL CORPUS**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-117", "text": "The parallel corpus we used is a scientific paper abstract corpus provided by JST 4 and NICT 5 ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-118", "text": "This corpus was created by the Japanese project \"Development and Research of Chinese-Japanese Natural Language Processing Technology\", containing 680k sentences (18.2M Chinese and 21.8M Japanese tokens respectively)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-119", "text": "This corpus contains various domains such as chemistry, physics, biology and agriculture etc." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-120", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-121", "text": "**QUASI-COMPARABLE CORPORA**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-122", "text": "The quasi-comparable corpora we used are scientific paper abstracts collected from academic websites." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-123", "text": "The Chinese corpora were collected from CNKI 6 , containing 420k sentences and 90k articles." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-124", "text": "The Japanese corpora were collected from CiNii 7 web portal, containing 5M sentences and 880k articles." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-125", "text": "Most articles in the Chinese corpora belong to the domain of chemistry, while the Japanese corpora contain various domains such as chemistry, physics and biology etc." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-126", "text": "Note that since the articles in these two websites were written by Chinese and Japanese researchers respectively, the collected corpora are very-non-parallel." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-127", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-128", "text": "**EXTRACTION EXPERIMENTS**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-129", "text": "We first applied sentence extraction on the quasicomparable corpora using our system, and 30k comparable sentences of chemistry domain were extracted." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-130", "text": "We then applied fragment extraction on the extracted comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-131", "text": "We compared our proposed method with (Munteanu and Marcu, 2006)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-132", "text": "We applied word alignment using GIZA++." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-133", "text": "External parallel data might be helpful for alignment models to detect parallel fragment candidates from comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-134", "text": "Therefore, we compared two different settings to investigate the influence of external parallel data for alignment to our proposed method:" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-135", "text": "\u2022 Only: Only use the extracted comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-136", "text": "\u2022 External: Use a small number of external parallel sentences together with the comparable sentences (In our experiment, we used chemistry domain data of the parallel corpus described in Section 4.1.1, containing 11k sentences)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-137", "text": "We also compared IBM Model 1, LLR and SampLEX lexicon for filtering." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-138", "text": "All lexicons were extracted from the parallel corpus." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-139", "text": "Table 1 shows the results for fragment extraction." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-140", "text": "We can see that the average size of fragments extracted by (Munteanu and Marcu, 2006 ) is unusually long, which is also reported in (Quirk et al., 2007) ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-141", "text": "Our proposed method extracts shorter fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-142", "text": "The number of extracted fragments and the average size are similar among the three lexicons when using the same alignment setting." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-143", "text": "Using the external parallel data for alignment extracts more fragments than only using the comparable sentences, and the average size is slightly larger." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-144", "text": "We think the reason is that the external parallel data is helpful to improve the recall of alignment for the parallel fragments in the comparable sentences, thus more parallel fragments will be detected." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-145", "text": "To evaluate accuracy, we randomly selected 100 fragments extracted by the different methods." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-146", "text": "We manually evaluated the accuracy based on the number of exact match." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-147", "text": "Note that exact match criteria has a bias against (Munteanu and Marcu, 2006) , because their method extacts subsentential fragments which are quite long." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-148", "text": "We found that only one of the fragments extracted by \"Munteanu+, 2006\" is exact match, while for the remainder only partial matches are contained in long fragments." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-149", "text": "Our proposed method have a accuracy over 80%, while the remainder are partial matches." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-150", "text": "For the effects of different lexicons, LLR and SampLEX shows better performance than IBM Model 1 lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-151", "text": "We think the reason is the same one reported in previous studies that LLR and SampLEX lexicon are more accurate than IBM Model 1 lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-152", "text": "Also, LLR lexicon performs slightly better than SampLEX lexicon in this experiment." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-153", "text": "The accuracy of only using the comparable sentences for alignment are better than using the external parallel data, except for IBM Model 1 lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-154", "text": "We think the reason is that the external parallel data may have a bad effect on the precision of alignment for the parallel fragments in the comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-155", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-156", "text": "**TRANSLATION EXPERIMENTS**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-157", "text": "We further conducted Chinese-to-Japanese translation experiments by appending the extracted fragments to a baseline system." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-177", "text": "This result indicates that external parallel data is not indispensable for the alignment model of our proposed method." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-178", "text": "----------------------------------" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-179", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-180", "text": "In this paper, we proposed an accurate parallel fragment extraction system using alignment model together with translation lexicon." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-181", "text": "Experiments conducted on Chinese-Japanese data showed that our proposed method significantly outperforms a state-of-the-art approach and improves MT performance." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-182", "text": "Our system can be improved in several aspects." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-183", "text": "Firstly, we only use IBM models for parallel fragment candidate detection, alignment models such as the ones proposed by (Quirk et al., 2007) could be more effective." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-184", "text": "Secondly, currently our proposed method cannot deal with ordering, an alignment model that is effective for ordering even on comparable sentences should be developed." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-185", "text": "Thirdly, although the experimental results indicate that external parallel data is not indispensable for the alignment model, we still use a parallel corpus for comparable sentence selection and lexicon filtering." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-186", "text": "An alternative method is constructing a large bilingual dictionary from comparable corpora, and use it for comparable sentence selection and lexicon filtering." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-187", "text": "Finally, although our proposed method is designed to be language and domain independent, the effectiveness for other language pairs and domains needs to be verified." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-158", "text": "For comparison, we also conducted translation experiments by appending the extracted comparable sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-159", "text": "For decoding, we used the state-of-the-art phrasebased SMT toolkit Moses (Koehn et al., 2007) with default options, except for the distortion limit (6\u219220)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-160", "text": "The baseline system used the parallel corpus (680k sentences)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-161", "text": "We used another 368 and 367 sentences from the chemistry domain for tuning and testing respectively." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-162", "text": "We trained a 5-gram language model on the Japanese side of the parallel corpus using the SRILM toolkit 8 ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-163", "text": "Table 2 : Results for Chinese-to-Japanese translation experiments (\" \u2020\" and \" \u2021\"denotes the result is better than \"Baseline\" significantly at p < 0.05 and p < 0.01 respectively, \" * \" denotes the result is better than \"+Munteanu+, 2006\" significantly at p < 0.05)." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-164", "text": "Translation results evaluated on BLEU-4, are shown in Table 2 ." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-165", "text": "We can see that appending the extracted comparable sentences have a positive effect on translation quality." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-166", "text": "Adding the fragments extracted by (Munteanu and Marcu, 2006) has a negative impact, compared to appending the sentences." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-167", "text": "Our proposed method outperforms both \"+sentences\" and \"Munteanu+, 2006\", which indicates the effectiveness of our proposed method for extracting useful parallel fragments for MT." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-168", "text": "We compared the phrase tables produced by different methods to investigate the reason for different MT performance." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-169", "text": "We found that all the methods increased the size of phrase table, meaning that new phrases were acquired from the extracted data." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-170", "text": "However, the noise contained in the data extracted by \"+sentences\" and \"Munteanu+, 2006\" produced many noisy phrase pairs, which may decrease MT performance." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-171", "text": "Our proposed method extracted accurate parallel fragments, which led to correct new phrases." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-172", "text": "Among all the settings of our proposed method, \"+External (IBM Model 1)\" showed the best performance." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-173", "text": "The reason for this is that it extracted more correct parallel fragments than the other settings, thus more new phrase pairs were produced." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-174", "text": "Surprisingly, the translation performance after appending the fragments extracted by our proposed method only using the comparable sentences for alignment shows comparable results when using LLR and SampLEX lexicon for filtering, compared to the ones using the external parallel data for alignment." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-175", "text": "We think the reason is that the extracted fragments not only can produce new phrases, but also can improve the quality of phrase pairs extracted from the original parallel corpus." }, { "sent_id": "d44648766e68cb914c5489e385f42e-C001-176", "text": "Because the fragments extracted only using the comparable sentences are more accurate than the ones using the external parallel data, they are more helpful to extract good phrase pairs from the original parallel corpus." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d44648766e68cb914c5489e385f42e-C001-18" ], [ "d44648766e68cb914c5489e385f42e-C001-25" ], [ "d44648766e68cb914c5489e385f42e-C001-79" ], [ "d44648766e68cb914c5489e385f42e-C001-81" ], [ "d44648766e68cb914c5489e385f42e-C001-140" ], [ "d44648766e68cb914c5489e385f42e-C001-147" ] ], "cite_sentences": [ "d44648766e68cb914c5489e385f42e-C001-18", "d44648766e68cb914c5489e385f42e-C001-25", "d44648766e68cb914c5489e385f42e-C001-79", "d44648766e68cb914c5489e385f42e-C001-81", "d44648766e68cb914c5489e385f42e-C001-140", "d44648766e68cb914c5489e385f42e-C001-147" ] }, "@SIM@": { "gold_contexts": [ [ "d44648766e68cb914c5489e385f42e-C001-82" ], [ "d44648766e68cb914c5489e385f42e-C001-88" ], [ "d44648766e68cb914c5489e385f42e-C001-108" ], [ "d44648766e68cb914c5489e385f42e-C001-131" ] ], "cite_sentences": [ "d44648766e68cb914c5489e385f42e-C001-82", "d44648766e68cb914c5489e385f42e-C001-88", "d44648766e68cb914c5489e385f42e-C001-108", "d44648766e68cb914c5489e385f42e-C001-131" ] }, "@USE@": { "gold_contexts": [ [ "d44648766e68cb914c5489e385f42e-C001-82" ], [ "d44648766e68cb914c5489e385f42e-C001-88" ] ], "cite_sentences": [ "d44648766e68cb914c5489e385f42e-C001-82", "d44648766e68cb914c5489e385f42e-C001-88" ] }, "@DIF@": { "gold_contexts": [ [ "d44648766e68cb914c5489e385f42e-C001-99" ], [ "d44648766e68cb914c5489e385f42e-C001-166" ] ], "cite_sentences": [ "d44648766e68cb914c5489e385f42e-C001-99", "d44648766e68cb914c5489e385f42e-C001-166" ] }, "@EXT@": { "gold_contexts": [ [ "d44648766e68cb914c5489e385f42e-C001-99" ] ], "cite_sentences": [ "d44648766e68cb914c5489e385f42e-C001-99" ] } } }, "ABC_3188ee1583a9c711cf147fc596768d_25": { "x": [ { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-2", "text": "This paper evaluates two semi-supervised techniques for the adaptation of a parse selection model to Wikipedia domains." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-3", "text": "The techniques examined are Structural Correspondence Learning (SCL) (Blitzer et al., 2006) and Self-training (Abney, 2007; McClosky et al., 2006) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-4", "text": "A preliminary evaluation favors the use of SCL over the simpler self-training techniques." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-5", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-6", "text": "**INTRODUCTION AND MOTIVATION**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-7", "text": "Parse selection constitutes an important part of many parsing systems (Hara et al., 2005; van Noord and Malouf, 2005; McClosky et al., 2006 )." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-8", "text": "Yet, there is little to no work focusing on the adaptation of parse selection models to novel domains." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-9", "text": "This is most probably due to the fact that potential gains for this task are inherently bounded by the underlying grammar." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-10", "text": "The few studies on adapting parse disambiguation models, like Hara et al. (2005) , have focused exclusively on supervised domain adaptation, i.e. one has access to a comparably small, but labeled amount of target data." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-11", "text": "In contrast, in semisupervised domain adaptation one has only unlabeled target data." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-12", "text": "It is a more realistic situation, but at the same time also considerably more difficult." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-13", "text": "In this paper we evaluate two semi-supervised approaches to domain adaptation of a discriminative parse selection model." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-14", "text": "We examine Structural Correspondence Learning (SCL) (Blitzer et al., 2006) for this task, and compare it to several variants of Self-training (Abney, 2007; McClosky et al., 2006) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-15", "text": "For empirical evaluation (section 4) we use the Alpino parsing system for Dutch (van Noord and Malouf, 2005) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-16", "text": "As target domain, we exploit Wikipedia as primary test and training collection." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-17", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-18", "text": "**PREVIOUS WORK**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-19", "text": "So far, Structural Correspondence Learning has been applied successfully to PoS tagging and Sentiment Analysis (Blitzer et al., 2006; )." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-20", "text": "An attempt was made in the CoNLL 2007 shared task to apply SCL to non-projective dependency parsing (Shimizu and Nakagawa, 2007) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-21", "text": "However, the system just ended up at rank 7 out of 8 teams." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-22", "text": "Based on annotation differences in the datasets and a bug in their system (Shimizu and Nakagawa, 2007) , their results are inconclusive." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-23", "text": "A recent attempt (Plank, 2009) shows promising results on applying SCL to parse disambiguation." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-24", "text": "In this paper, we extend that line of work and compare SCL to bootstrapping approaches such as self-training." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-25", "text": "Studies on self-training have focused mainly on generative, constituent based parsing (Steedman et al., 2003; McClosky et al., 2006; Reichart and Rappoport, 2007) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-26", "text": "Steedman et al. (2003) as well as Reichart and Rappoport (2007) examine self-training for PCFG parsing in the small seed case (< 1k labeled data), with different results." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-27", "text": "In contrast, McClosky et al. (2006) focus on large seeds and exploit a reranking-parser." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-28", "text": "Improvements are obtained (McClosky et al., 2006; McClosky and Charniak, 2008) , showing that a reranker is necessary for successful self-training in such a high-resource scenario." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-29", "text": "While they self-trained a generative model, we examine self-training and SCL for semi-supervised adaptation of a discriminative parse selection system." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-30", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-31", "text": "**SEMI-SUPERVISED DOMAIN ADAPTATION**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-32", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-33", "text": "**STRUCTURAL CORRESPONDENCE LEARNING**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-34", "text": "Structural Correspondence Learning (Blitzer et al., 2006) exploits unlabeled data from both source and target domain to find correspondences among features from different domains." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-35", "text": "These correspondences are then integrated as new features in the labeled data of the source domain." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-36", "text": "The outline of SCL is given in Algorithm 1." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-37", "text": "The key to SCL is to exploit pivot features to automatically identify feature correspondences." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-38", "text": "Pivots are features occurring frequently and behaving similarly in both domains (Blitzer et al., 2006) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-39", "text": "They correspond to auxiliary problems in Ando and Zhang (2005) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-40", "text": "For every such pivot feature, a binary classifier is trained (step 2 of Algorithm 1) by masking the pivot feature in the data and trying to predict it with the remaining non-pivot features." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-41", "text": "Non-pivots that correlate with many of the same pivots are assumed to correspond." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-42", "text": "These pivot predictor weight vectors thus implicitly align non-pivot features from source and target domain." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-43", "text": "Intuitively, if we are able to find good correspondences through 'linking' pivots, then the augmented source data should transfer better to a target domain (Blitzer et al., 2006) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-44", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-45", "text": "**SCL FOR DISCRIMINATIVE PARSE SELECTION**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-46", "text": "So far, pivot features on the word level were used (Blitzer et al., 2006; ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-47", "text": "However, for parse disambiguation based on a conditional model they are irrelevant." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-48", "text": "Hence, we follow Plank (2009) and actually first parse the unlabeled data." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-49", "text": "This allows a possibly noisy, but more abstract representation of the underlying data." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-50", "text": "Features thus correspond to properties of parses: application of grammar rules (r1,r2 features), dependency relations (dep), PoS tags (f1,f2), syntactic features (s1), precedence (mf ), bilexical preferences (z), apposition (appos) and further features for unknown words, temporal phrases, coordination (h,in year and p1, respectively)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-51", "text": "These features are further described in van Noord and Malouf (2005) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-52", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-53", "text": "**SELECTION OF PIVOT FEATURES**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-54", "text": "As pivot features should be common across domains, here we restrict our pivots to be of the type r1,p1,s1 (the most frequently occurring feature types)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-55", "text": "In more detail, r1 indicates which grammar rule applied, p1 whether coordination conjuncts are parallel, and s1 whether local/non-local extraction occurred." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-56", "text": "We count how often each feature appears in the parsed source and target domain data, and select those r1,p1,s1 features as pivot features, whose count is > t, where t is a specified threshold." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-57", "text": "In all our experiments, we set t = 5000." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-58", "text": "In this way we obtained on average 360 pivot features, on the datasets described in Section 4." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-59", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-60", "text": "**SELF-TRAINING**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-61", "text": "Self-training (Algorithm 2) is a simple single-view bootstrapping algorithm." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-62", "text": "In self-training, the newly labeled instances are taken at face value and added to the training data." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-63", "text": "There are many possible ways to instantiate selftraining (Abney, 2007) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-64", "text": "One variant, introduced in Abney (2007) is the notion of '(in)delibility': in the delible case the classifier relabels all of the unlabeled data from scratch in every iteration." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-65", "text": "The classifier may become unconfident about previously selected instances and they may drop out (Steven Abney, personal communication)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-66", "text": "In contrast, in the indelible case, labels once assigned do not change again (Abney, 2007) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-67", "text": "In this paper we look at the following variants of self-training:" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-68", "text": "\u2022 single versus multiple iterations," }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-69", "text": "\u2022 selection versus no selection (taking all selflabeled data or selecting presumably higher quality instances); different scoring functions for selection," }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-70", "text": "\u2022 delibility versus indelibility for multiple iterations." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-71", "text": "Algorithm 2 Self-training (indelible) (Abney, 2007) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-72", "text": "c \u2190 train(L) 6: until stopping criterion is met Scoring methods We examine three simple scoring functions for instance selection: i) Entropy (\u2212 y\u2208Y (s) p(\u03c9|s, \u03b8) log p(\u03c9|s, \u03b8))." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-73", "text": "ii) Number of parses (|Y (s)|); and iii) Sentence Length (|s|)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-74", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-75", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-76", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-77", "text": "**EXPERIMENTAL DESIGN**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-78", "text": "The system used in this study is Alpino, a two-stage dependency parser for Dutch (van Noord and Malouf, 2005) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-79", "text": "The first stage consists of a HPSG-like grammar that constitutes the parse generation component." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-80", "text": "The second stage is a Maximum Entropy (MaxEnt) parse selection model." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-81", "text": "To train the MaxEnt model, parameters are estimated based on informative samples (Osborne, 2000) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-82", "text": "A parse is added to the training data with a score indicating its \"goodness\" (van Noord and Malouf, 2005) ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-83", "text": "The score is obtained by comparing it with the gold standard (if available; otherwise the score is approximated through parse probability)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-84", "text": "The source domain is the Alpino Treebank (van Noord and Malouf, 2005) (newspaper text; approx. 7,000 sentences; 145k tokens)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-85", "text": "We use Wikipedia both as testset and as unlabeled target data source." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-86", "text": "We assume that in order to parse data from a very specific domain, say about the artist Prince, then data related to that domain, like information about the New Power Generation, the Purple rain movie, or other American singers and artists, should be of help." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-87", "text": "Thus, we exploit Wikipedia's category system to gather domain-specific target data." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-88", "text": "In our empirical setup, we follow Blitzer et al. (2006) and balance the size of source and target data." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-89", "text": "Thus, depending on the size of the resulting target domain dataset, and the \"broadness\" of the categories involved in creating it, we might wish to filter out certain pages." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-90", "text": "We implemented a filter mechanism that excludes pages of a certain category (e.g. a supercategory that is hypothesized to be \"too broad\")." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-91", "text": "Further details about the dataset construction are given in (Plank, 2009 The size of the target domain testsets is given in Table 2 ." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-92", "text": "As evaluation measure concept accuracy (CA) (van Noord and Malouf, 2005 ) is used (similar to labeled dependency accuracy)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-93", "text": "The training data for the pivot predictors are the 1-best parses of source and target domain data as selected by the original Alpino model." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-94", "text": "We report on results of SCL with dimensionality parameter set to h = 25, and remaining settings identical to Plank (2009) (i.e., no feature-specific regularization and no feature normalization and rescaling)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-95", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-96", "text": "**BASELINE**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-97", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-98", "text": "**SCL AND SELF-TRAINING RESULTS**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-99", "text": "The results for SCL (Table 3) show a small, but consistent increase in absolute performance on all testsets over the baselines (up to +0.27 absolute CA or 7.34% relative error reduction, which is significant at p < 0.05 according to sign test)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-100", "text": "In contrast, basic self-training (Table 3 ) achieves roughly only baseline accuracy and lower performance than SCL, with one exception." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-101", "text": "On the DeMorgan testset, self-training scores slightly higher than SCL." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-102", "text": "However, the improvements of both SCL and self-training are not significant on this rather small testset." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-103", "text": "Indeed, self-training scores better than the baseline on only 5 parses out of 254, while its performance is lower on 2, leaving only 3 parses that account for the difference." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-104", "text": "Table 3 : Results of SCL and self-training (single iteration, no selection)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-105", "text": "Entries marked with \u22c6 are statistically significant at p < 0.05." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-106", "text": "The \u03c6 score incorporates upperand lower-bounds." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-107", "text": "To gauge whether other instantiations of selftraining are more effective, we evaluated the selftraining variants introduced in section 3.2 on the Prince dataset." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-108", "text": "In the iterative setting, we follow Steedman et al. (2003) and parse 30 sentences from which 20 are selected in every iteration." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-109", "text": "With regard to the comparison of delible versus indelible self-training (whether labels may change), our empirical findings shows that the two cases achieve very similar performance; the two curves highly overlap (Figure 1 )." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-110", "text": "The accuracies of both curves fluctuate around 85.13, showing no upward or downward trend." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-111", "text": "In general, however, indelibility is preferred since it takes considerably less time (the classifier does not have to relabel U from scratch in every iteration)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-112", "text": "In addition, we tested EM (which uses all unlabeled data in each iteration)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-113", "text": "Its performance is consistently lower, varying around the baseline." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-114", "text": "Figure 2 compares several self-training variants with the supervised baseline and SCL." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-115", "text": "It summarizes the effect of i) selection versus no selection (and various selection techniques) as well as ii) single versus multiple iterations of self-training." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-116", "text": "For clarity, the figure shows the learning curve of the best selection technique only, but depicts the performance of the various selection techniques in a single iteration (non-solid lines)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-117", "text": "In the iterative setting, taking the whole selflabeled data and not selecting certain instances (grey curve in Figure 2 ) degrades performance." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-118", "text": "In contrast, selecting shorter sentences slightly improves accuracy, and is the best selection method among the ones tested (shorter sentences, entropy, fewer parses)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-119", "text": "For all self-training instantiations, running multiple iterations is on average just the same as running a single iteration (the non-solid lines are roughly the average of the learning curves)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-120", "text": "Thus there is no real need to run several iterations of self-training." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-121", "text": "The main conclusion is that in contrast to SCL, none of the self-training instantiations achieves a significant improvement over the baseline." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-122", "text": "----------------------------------" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-123", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-124", "text": "The paper compares Structural Correspondence Learning (Blitzer et al., 2006) with (various instances of) self-training (Abney, 2007; McClosky et al., 2006) for the adaptation of a parse selection model to Wikipedia domains." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-125", "text": "The empirical findings show that none of the evaluated self-training variants (delible/indelible, single versus multiple iterations, various selection techniques) achieves a significant improvement over the baseline." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-126", "text": "The more 'indirect' exploitation of unlabeled data through SCL is more fruitful than pure self-training." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-127", "text": "Thus, favoring the use of the more complex method, although the findings are not confirmed on all testsets." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-128", "text": "Of course, our results are preliminary and, rather than warranting yet many definite conclusions, encourage further investigation of SCL (varying size of target data, pivots selection, bigger testsets as well as other domains etc.) as well as related semisupervised adaptation techniques." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-129", "text": "Figure 2: Self-training variants compared to supervised baseline and SCL." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-130", "text": "The effect of various selection techniques (Sec. 3.2) in a single iteration is depicted (non-solid lines; fewer parses and no selection achieve identical results)." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-131", "text": "For clarity, the figure shows the learning curve for the best selection technique only (shorter sent) versus no selection." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-132", "text": "On average running multiple iterations is just the same as a single iteration." }, { "sent_id": "3188ee1583a9c711cf147fc596768d-C001-133", "text": "In all cases SCL still performs best." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3188ee1583a9c711cf147fc596768d-C001-3" ], [ "3188ee1583a9c711cf147fc596768d-C001-19" ], [ "3188ee1583a9c711cf147fc596768d-C001-34" ], [ "3188ee1583a9c711cf147fc596768d-C001-38" ], [ "3188ee1583a9c711cf147fc596768d-C001-46" ] ], "cite_sentences": [ "3188ee1583a9c711cf147fc596768d-C001-3", "3188ee1583a9c711cf147fc596768d-C001-19", "3188ee1583a9c711cf147fc596768d-C001-34", "3188ee1583a9c711cf147fc596768d-C001-38", "3188ee1583a9c711cf147fc596768d-C001-46" ] }, "@SIM@": { "gold_contexts": [ [ "3188ee1583a9c711cf147fc596768d-C001-14" ], [ "3188ee1583a9c711cf147fc596768d-C001-43" ], [ "3188ee1583a9c711cf147fc596768d-C001-88" ], [ "3188ee1583a9c711cf147fc596768d-C001-124" ] ], "cite_sentences": [ "3188ee1583a9c711cf147fc596768d-C001-14", "3188ee1583a9c711cf147fc596768d-C001-43", "3188ee1583a9c711cf147fc596768d-C001-88", "3188ee1583a9c711cf147fc596768d-C001-124" ] }, "@USE@": { "gold_contexts": [ [ "3188ee1583a9c711cf147fc596768d-C001-88" ] ], "cite_sentences": [ "3188ee1583a9c711cf147fc596768d-C001-88" ] } } }, "ABC_8ad5f7a658a8bb5377981ba6b098dc_25": { "x": [ { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-2", "text": "Abstract." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-3", "text": "We describe a simple approach to semantic parsing based on a tensor product kernel." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-4", "text": "We extract two feature vectors: one for the query and one for each candidate logical form." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-5", "text": "We then train a clasifier using the tensor product of the two vectors." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-6", "text": "Using very simple features for both, our system achieves an average F1 score of 40.1% on the WebQuestions dataset." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-7", "text": "This is comparable to more complex systems but is simpler to implement and runs faster." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-8", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-10", "text": "In recent years, the task of semantic parsing for querying large databases has been studied." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-11", "text": "This task differs from early work in semantic parsing in several ways:" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-12", "text": "-The databases being queried are typically several orders of magnitude larger, contain much more diverse content, and are less structured." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-13", "text": "-In standard semantic parsing approaches, the aim is to learn a logical form to represent a query." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-14", "text": "In recent approaches the goal is to find the correct answer (entity or set of entities in the database), with learning a logical form a potential byproduct." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-15", "text": "-Because of this, the datasets, which would have consisted of queries together with their corresponding logical forms, now may consist of the queries together with the desired correct answer." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-16", "text": "-The datasets themselves are much larger, and cover a more diverse range of entities, however there may be a lot of overlap in the type of queries in the dataset." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-17", "text": "We believe it is the last of these points that means that simple techniques such as the one we present can work surprisingly well." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-18", "text": "For example, the WebQuestions dataset contains 83 questions containing the term \"currency\"; of these 79 are asking what the currency of a particular country is." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-19", "text": "These 79 questions can be answered using the same logical form template, thus a system only has to see the term \"currency\", and identify the correct country in the question to have a very good chance of getting the answer correct." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-20", "text": "Knowing this on its own is not enough to build an effective system however." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-21", "text": "We still need to be able to somehow identify that it is this particular term in the query that is associated with this logical form." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-22", "text": "In this paper we demonstrate one way that this can be achieved." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-23", "text": "We build on the paraphrasing approach of [1] in that we use a fixed set of templates to generate a set of candidate logical forms to answer a given query and map each logical form to a natural language expression, its canonical utterance." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-24", "text": "Instead of using a complex paraphrasing model however, we use tensor kernels to find relationships between terms occuring in the query and in the canonical utterance." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-25", "text": "The virtue of our approach is in its simplicity, which both aids implementation and speeds up execution." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-26", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-27", "text": "**BACKGROUND**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-28", "text": "The task of semantic parsing initially focussed on fairly small problems, such as the GeoQuery dataset, which initially consisted of 250 queries [2] and was later extended to around 1000 queries [3] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-29", "text": "Approaches to this task included inductive logic programming [2, 3] , probabilistic grammar induction [4, 5] , synchronous grammars [6] and induction of latent logical forms [7] , the current state of the art on this type of dataset." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-30", "text": "More recently, attention has focussed on answering queries in much larger domains, such as Freebase [8] , which contains at the time of writing of around 2.7 billion facts." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-31", "text": "There are two datasets of queries for this database: Free917 consisting of 917 questions annotated with logical forms [9] , and WebQuestions which consists of 5,810 question-answer pairs, with no logical forms [10] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-32", "text": "Approaches to this task include schema matching [9] , inducing latent logical forms [10] , application of paraphrasing techniques [1, 11] , information extraction [12] , learning low dimensional embeddings of words and knowledge base constituents [13] and application of logical reasoning in conjunction with statistical techniques [11] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-33", "text": "Note that most of these approaches do not require annotated logical forms, and either induce logical forms when training using the given answers, or bypass them altogether." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-34", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-35", "text": "**SEMANTIC PARSING VIA PARAPHRASING**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-36", "text": "The ParaSempre system of [1] is based on the idea of generating a set of candidate logical forms from the query using a set of templates." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-37", "text": "For example, the query Who did Brad Pitt play in Troy? would generate the logical form as well as many incorrect logical forms." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-38", "text": "These are built by finding substrings of the query that approximately match Freebase entities and then applying relations that match the type of the entity." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-39", "text": "Given a logical form, a canonical utterance is generated, again using a set of rules, which depend on the syntactic type of the description of the entities." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-40", "text": "To Fig. 1 ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-41", "text": "Questions from the WebQuestions dataset containing the term \"currency\"." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-42", "text": "-Features extracted from the logical form itself, such as the size of the denotation of a logical form, i.e. the number of results returned when evaluating the logical form as a query on the database." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-43", "text": "This is important, since many incorrect logical forms have denotation zero; this feature acts as a filter removing these." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-44", "text": "-Features derived from an association model." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-45", "text": "This involves examining spans in the query and canonical utterance and looking for paraphrases between these spans." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-46", "text": "These paraphrases are derived from a large paraphrase corpus and WordNet [14] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-47", "text": "-Features derived from a vector space model built using Word2Vec [15] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-48", "text": "In an analysis on the development set of WebQuestions, the authors showed that removing the vector space model lead to a small drop in performance, removing the asssociation model gave a larger drop, and removing both of these halved the performance score." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-49", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-50", "text": "**TENSOR KERNERLS FOR SEMANTIC PARSING**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-51", "text": "We know that simple patterns or occurrences in the query can be used to identify a correct logical form with high probability, as with the \"currency\" example." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-52", "text": "We still need some way of identifying these patterns and linking them up to appropriate logical forms." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-53", "text": "In this section we discuss one approach for doing this." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-54", "text": "Our goal is to learn a mapping from queries to logical forms." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-55", "text": "One way of doing this to consider a fixed number of logical forms for each query sentence, and train a classifier to choose the best logical form given a sentence [1] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-56", "text": "In order to ues this approach, we need a single feature vector for each pair of queries and logical forms." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-57", "text": "Our proposal is to extract features for each query and logical form indepdendently, and to take their tensor product as the combined vector." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-58", "text": "Explicitly, let Q be the set of all possible queries and \u039b be the set of all possible logical forms." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-59", "text": "For each query q \u2208 Q and logical form \u03bb \u2208 \u039b, we represent the pair (q, \u03bb) by the vector:" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-60", "text": "where \u03c6 Q and \u03c6 \u039b map queries and logical forms to a vector space, i.e. perform feature extraction." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-61", "text": "Whilst this could potentially be a large space, note that we can use the kernel trick to avoid computing very large vectors, using a simple identity of dot products on tensor spaces:" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-62", "text": "The advantage of using the tensor product is that it preserves all the information of the original vectors, allowing us to learn how features relating to queries map to features relating to logical forms." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-63", "text": "More generally, instead of representing the query and logical form as vectors directly, this can be done implicitly using kernels." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-64", "text": "For example, we may use a string kernel \u03ba 1 on Q and a tree kernel \u03ba 2 on \u039b, then define the kernel \u03ba(q, \u03bb) = \u03ba 1 (q)\u03ba 2 (\u03bb) on Q \u00d7 \u039b. This idea is closely related to the Schur product kernel [16] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-65", "text": "It is worth noting at this point that, while what we really want is a one-toone mapping from queries to logical forms, the classifier actually gives us a set of logical forms for each query: we simply ask it to classify each pair (q, \u03bb)." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-66", "text": "In a probabilistic approach, such as logistic regression, we can choose the \u03bb for which the classifier gives the highest probability for (q, \u03bb)." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-67", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-68", "text": "**APPLICATION TO SEMANTIC PARSING VIA PARAPHRASING**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-69", "text": "There are clearly many ways we could map queries and logical forms to vectors." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-70", "text": "In this paper we will consider one simple approach in which we use unigrams as the features for both the query and the canonical utterance associated with the logical form." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-71", "text": "In this case, the tensor product of the vectors corresponds directly to the cartesian product of the unigrams derived from the query with those from the canonical utterance." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-72", "text": "Recall that given two vector spaces U and V of dimensionality n and m, the tensor product space U \u2297 V has dimensionality nm." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-73", "text": "If we have bases for U and V , then we can construct a basis for U \u2297 V ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-74", "text": "For each pair of basis vectors u and v in U and V respectively, we take a single basis vector u \u2297 v \u2208 U \u2297 V ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-75", "text": "In our case, the dimensions of U and V correspond to terms that can occur as unigram features in the query or canonical utterance respectively." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-76", "text": "Thus each basis vector of U \u2297 V corresponds to a pair of unigram features." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-77", "text": "As an example from the WebQuestions dataset, consider the query, What 5 countries border ethiopia?, and the canonical utterance The adjoins of ethiopia?, whose associated logical form gives the correct answer." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-78", "text": "Then there will be a dimension in the tensor product for each pair of words; for example the dimensions associated with (countries, adjoins) and (border, adjoins), as well as less useful pairs such as (5, ethiopia) would all have non-zero values in the tensor product." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-79", "text": "Thus we are able to learn that if we see borders in the query, then a logical form whose canonical utterance contains the term adjoins is a likely candidate to answer the query." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-80", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-81", "text": "**EMPIRICAL EVALUATION**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-82", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-83", "text": "**DATASET**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-84", "text": "We evaluated our system on the WebQuestions dataset [10] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-85", "text": "This consists of 5,810 question-answer pairs." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-86", "text": "The questions were obtained by querying the Google Suggest API, and answers were obtained using Amazon Mechanical Turk." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-87", "text": "We used the standard train/test split supplied with the dataset, and used crossvalidation on the training set for development purposes." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-88", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-89", "text": "**IMPLEMENTATION**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-90", "text": "We built our implementation on top of the ParaSempre system [1] , and so our evaluation exactly matches theirs." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-91", "text": "Our implementation is freely available online." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-92", "text": "1 We substituted the paraphrase system of ParaSempre with our tensor kernel-based system (i.e. we excluded features from both the association and vector space models), but we included the ParaSempre features derived from logical forms." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-93", "text": "To implement our tensor kernel of unigram features, we simply added all pairs of terms in the query and canonical utterance as features; in preliminary experiments we found that this was fast enough and we did not need to use the kernel trick, which could potentially provide further speed-ups." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-94", "text": "We did not implement any feature selection methods which may also help with efficiency." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-95", "text": "For evaluation, we report the average of the F1 score measured on the set of entities returned by the logical form when evaluated on the database, when compared to the correct set of entities." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-96", "text": "This allows, for example, to get a nonzero score for returning a similar set of entities to the correct one." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-97", "text": "For example, if we return the set {Jaxon Bieber} as an answer to the query Who is Justin Bieber's brother? we allow a nonzero score (the correct answer according to the dataset is {Jazmyn Bieber, Jaxon Bieber})." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-98", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-99", "text": "**RESULTS**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-100", "text": "Results are reported in Table 1 ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-101", "text": "Our system achieves an average F1 score of 40.1%, compared to ParaSempre's 39.9%." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-102", "text": "Our system runs faster however, due to the simpler method of generating features." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-103", "text": "Evaluating using ParaSempre on the development set took 22h31m; using the tensor kernel took 14h44m on a comparable machine." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-104", "text": "Since we have adopted the logical form templates of ParaSempre, our upper bound or oracle F1 score is the same, 63% [1] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-105", "text": "This is the score that would be obtained if we knew which was the best logical form out of all those generated." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-106", "text": "In contrast, Microsoft's DeepQA has an oracle F1 score of 77.3% [11] ; this could account for a large amount of the overall increase in their system." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-107", "text": "There is no reported oracle score for the Facebook system [13] ." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-108", "text": "Table 2 shows the top unigram feature pairs after training on the WebQuestions training set." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-109", "text": "It is clear that, whilst there are some superfluous features that simply learn to replace a word with itself (for example currency with currency, there are obviously many useful features that would be nontrivial to identify accurately." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-110", "text": "There are also spurious ones such as the pair (live, birthplace); this is perhaps due to a large proportion of people who live in their birthplace." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-111", "text": "----------------------------------" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-112", "text": "**DISCUSSION**" }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-113", "text": "Average F1 score Sempre [10] 35.7 ParaSempre [1] 39.9 Facebook [13] 41.8 DeepQA [11] 45.3 Tensor kernel with unigrams 40.1 In development, we found that ordering the training alphabetically by the text of the query lead to a large reduction in accuracy." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-114", "text": "2 Ordering alphabetically when performing the split for cross validation (instead of random ordering) means that a lot of queries on the same topic are grouped together, increasing the likelihood that a query on a topic seen at test time would not have been seen at training time." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-115", "text": "This validates our hypothesis that simple techniques work well because of the homogeneous nature of the dataset." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-116", "text": "We would argue that this does not invalidate the techniques however, as it is likely that real-world datasets also have this property." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-117", "text": "It is a feature of our tensor product model that there is no direct interaction between the features from the query and those from the logical form." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-118", "text": "This is evidenced by the fact that the system has to learn that the term currency in the query maps to currency in the canonical utterance." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-119", "text": "This hints at ways of improving over our current system." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-120", "text": "More interestingly, it also means that we are currently making very light use of the canonical utterance generation; in the canonical utterance, currency could be replaced by any symbol and our system would learn the same relationship." }, { "sent_id": "8ad5f7a658a8bb5377981ba6b098dc-C001-121", "text": "This points at another route of investigation involving generating features for use in the tensor kernel directly from the logical form instead of via canonical utterances." } ], "y": { "@SIM@": { "gold_contexts": [ [ "8ad5f7a658a8bb5377981ba6b098dc-C001-23" ], [ "8ad5f7a658a8bb5377981ba6b098dc-C001-32" ], [ "8ad5f7a658a8bb5377981ba6b098dc-C001-90" ], [ "8ad5f7a658a8bb5377981ba6b098dc-C001-104" ] ], "cite_sentences": [ "8ad5f7a658a8bb5377981ba6b098dc-C001-23", "8ad5f7a658a8bb5377981ba6b098dc-C001-32", "8ad5f7a658a8bb5377981ba6b098dc-C001-90", "8ad5f7a658a8bb5377981ba6b098dc-C001-104" ] }, "@USE@": { "gold_contexts": [ [ "8ad5f7a658a8bb5377981ba6b098dc-C001-23" ], [ "8ad5f7a658a8bb5377981ba6b098dc-C001-32" ], [ "8ad5f7a658a8bb5377981ba6b098dc-C001-90" ], [ "8ad5f7a658a8bb5377981ba6b098dc-C001-104" ] ], "cite_sentences": [ "8ad5f7a658a8bb5377981ba6b098dc-C001-23", "8ad5f7a658a8bb5377981ba6b098dc-C001-32", "8ad5f7a658a8bb5377981ba6b098dc-C001-90", "8ad5f7a658a8bb5377981ba6b098dc-C001-104" ] }, "@BACK@": { "gold_contexts": [ [ "8ad5f7a658a8bb5377981ba6b098dc-C001-36" ] ], "cite_sentences": [ "8ad5f7a658a8bb5377981ba6b098dc-C001-36" ] }, "@MOT@": { "gold_contexts": [ [ "8ad5f7a658a8bb5377981ba6b098dc-C001-55" ] ], "cite_sentences": [ "8ad5f7a658a8bb5377981ba6b098dc-C001-55" ] }, "@DIF@": { "gold_contexts": [ [ "8ad5f7a658a8bb5377981ba6b098dc-C001-113" ] ], "cite_sentences": [ "8ad5f7a658a8bb5377981ba6b098dc-C001-113" ] }, "@EXT@": { "gold_contexts": [ [ "8ad5f7a658a8bb5377981ba6b098dc-C001-113" ] ], "cite_sentences": [ "8ad5f7a658a8bb5377981ba6b098dc-C001-113" ] } } }, "ABC_d987872352e4602fd48936cf2fdab8_25": { "x": [ { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-98", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-2", "text": "In large-scale domain classification, an utterance can be handled by multiple domains with overlapped capabilities." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-3", "text": "However, only a limited number of ground-truth domains are provided for each training utterance in practice while knowing as many as correct target labels is helpful for improving the model performance." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-4", "text": "In this paper, given one ground-truth domain for each training utterance, we regard domains consistently predicted with the highest confidences as additional pseudo labels for the training." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-5", "text": "In order to reduce prediction errors due to incorrect pseudo labels, we leverage utterances with negative system responses to decrease the confidences of the incorrectly predicted domains." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-6", "text": "Evaluating on user utterances from an intelligent conversational system, we show that the proposed approach significantly improves the performance of domain classification with hypothesis reranking." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-7", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-9", "text": "Domain classification is a task that predicts the most relevant domain given an input utterance [1] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-10", "text": "1 It is becoming more challenging since recent conversational interaction systems such as Amazon Alexa, Google Assistant, and Microsoft Cortana support more than thousands of domains developed by external developers [4, 3, 5] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-11", "text": "As they are independently and rapidly developed without a centralized ontology, multiple domains have overlapped capabilities that can process the same utterances." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-12", "text": "For example, \"make an elephant sound\" can be processed by AnimalSounds, AnimalNoises, and ZooKeeper domains." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-13", "text": "Since there are a large number of domains, which are even frequently added or removed, it is infeasible to obtain all the ground-truth domains of the training utterances, and domain classifiers for conversational interaction systems are usually trained given only a small number (usually one) of groundtruths in the training utterances." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-14", "text": "This setting corresponds to multi-label positive and unlabeled (PU) learning, where assigned labels are positive, unassigned labels are not neces-sarily negative, and one or more labels are assigned for an instance [6, 7] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-15", "text": "2 In this paper, we utilize user log data, which contain triples of an utterance, the predicted domain, and the response, for the model training." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-16", "text": "Therefore, we are given only one ground-truth for each training utterance." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-17", "text": "In order to improve the classification performance in this setting, if certain domains are repeatedly predicted with the highest confidences even though they are not the ground-truths of an utterance, we regard the domains as additional pseudo labels." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-18", "text": "This is closely related to pseudo labeling [8] or self-training [9, 10, 11] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-19", "text": "While the conventional pseudo labeling is used to derive target labels for unlabeled data, our approach adds pseudo labels to singly labeled data so that the data can have multiple target labels." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-20", "text": "Also, the approach is related to selfdistillation, which leverages the confidence scores of the nontarget outputs to improve the model performance [12, 13] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-21", "text": "While distillation methods utilize the confidence scores as the soft targets, pseudo labeling regards high confident outputs as the hard targets to further boost their confidences." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-22", "text": "We use both pseudo labeling and self-distillation in our work." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-23", "text": "Pseudo labels can be wrongly derived when irrelevant domains are top predicted, which can lead the model training with wrong supervision." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-24", "text": "To mitigate this issue, we leverage utterances with negative system responses to lower the prediction confidences of the failing domains." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-25", "text": "For example, if a system response of a domain for an input utterance is \"I don't know that one\", the domain is regarded as a negative ground-truth since it fails to handle the utterance." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-26", "text": "Evaluating on an annotated dataset from the user logs of a large-scale conversation interaction system, we show that the proposed approach significantly improves the domain classification especially when hypothesis reranking is used [14, 5] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-27", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-28", "text": "**MODEL OVERVIEW**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-29", "text": "We take a hypothesis reranking approach, which is widely used in large-scale domain classification for higher scalabil- ity [14, 5] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-30", "text": "Within the approach, a shortlister, which is a light-weighted domain classifier, suggests the most promising k domains as the hypotheses." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-31", "text": "We train the shortlister along with the added pseudo labels, leveraging negative system responses, and self-distillation, which are described in Section 3. Then a hypothesis reranker selects the final prediction from the k hypotheses enriched with additional input features, which is described in Section 4." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-32", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-33", "text": "**SHORTLISTER MODEL**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-34", "text": "Our shortlister architecture is shown in Figure 1 ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-35", "text": "The words of an input utterance are represented as contextualized word vectors by bidirectional long short-term memory (BiLSTM) on top of the word embedding layer [15] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-36", "text": "Then, the concatenation of the last outputs of the forward LSTM and the backward LSTM is used to represent the utterance as a vector." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-37", "text": "3 Following [3] and [18] , we leverage the domain enablement information 4 through attention mechanism [19] , where the weighted sum of enabled domain vectors followed by sigmoid activation is concatenated to the utterance vector for representing a personalized utterance." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-38", "text": "On top of the personalized utterance vector, a feed-forward neural network followed by sigmoid activation is used to obtain n-dimensional output vector o, where the prediction confidence of each domain is represented as a scalar value between 0 and 1." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-39", "text": "Given an input utterance and its target label, binary cross entropy is used as the baseline loss function as follows:" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-40", "text": "where o, y, and n denote the model output vector, the onehot vector of the target label, and the number of total labels." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-41", "text": "We describe other proposed loss functions in the following subsections." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-42", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-43", "text": "**DERIVING PSEUDO LABELS**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-44", "text": "We hypothesize that the outputs repeatedly predicted with the highest confidences are indeed correct labels in many cases in multi-label PU learning setting." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-45", "text": "This approach is closely related to pseudo labeling [8] or self-training [9, 10, 11] in semi-supervised learning since our model is supervised with additional pseudo labels, but differs in that our approach assigns pseudo labels to singly labeled train sets rather than unlabeled data sets." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-46", "text": "We derive the pseudo labels when the following conditions are met:" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-47", "text": "\u2022 Maximally p domains predicted with the highest confidences that are higher than the confidence of the known ground-truth." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-48", "text": "\u2022 Domains predicted with the highest confidences for r times consecutively so that consistent top predictions are used as pseudo labels." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-49", "text": "For the experiments in Section 5, we use p=2 and r=4, which show the best dev set performance." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-50", "text": "Those derived pseudo labels are used in the model training as follows:" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-51", "text": "where\u1ef9 denotes an n-hot vector such that the elements corresponding to the original ground-truth and the additional pseudo labels are set to 1." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-52", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-53", "text": "**LEVERAGING NEGATIVE FEEDBACK**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-54", "text": "During the model training, irrelevant domains could be top predicted, and regarding them as additional target labels results in wrong confirmation bias [20] , which causes incorrect model training." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-55", "text": "To reduce the side effect, we leverage utterances with negative responses in order to discourage the utterances' incorrect predictions." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-56", "text": "This setting can be considered as a multi-label variant of Positive, Unlabeled, and Biased Negative Data (PUbN) learning [21] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-57", "text": "We obtain training utterances from log data, where utterances with positive system responses are used as the positive train set in Equation 1 and 2 while the utterances with negative responses are used as the negative train set in Equation 3." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-58", "text": "For example, AnimalSounds is a (positive) ground-truth domain for \"a monkey sound\" because the system response to the utterance is \"Here comes a monkey sound\" while it is a negative ground-truth for \"a dragon sound\" as the response is \"I don't know what sound a dragon makes\"." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-59", "text": "5 6 Previous work [22, 23] excludes such negative utterances from the training set." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-60", "text": "We find that it is more effective to explicitly demote the prediction confidences of the domains resulted in negative responses if they are top ranked." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-61", "text": "It is formulated as a loss function:" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-62", "text": "where j denotes the index corresponding to the negative ground-truth domain." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-63", "text": "We demote the confidences of the negative ground-truths only when they are the highest so that the influence of using the negative ground-truths is not overwhelming." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-64", "text": "7" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-65", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-66", "text": "**SELF-DISTILLATION**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-67", "text": "Knowledge distillation has been shown to improve the model performance by leveraging the prediction confidence scores from another model or from previous epochs [12, 13, 18] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-68", "text": "Inspired by [18] , we utilize the model at the epoch showing the best dev set performance before the current epoch to obtain the prediction confidence scores as the soft target." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-69", "text": "The selfdistillation in our work can be formulated as follows:" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-70", "text": "where\u00f5 i denotes the model output at the epoch showing the best dev set performance so far." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-71", "text": "Before taking sigmoid to obtain\u00f5 i , we use 16 as the temperature to increase the influence of distillation [12] , which shows the best dev set performance following [18] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-72", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-73", "text": "**COMBINED LOSS**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-74", "text": "The model is optimized with a combined loss function as follows:" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-75", "text": "where \u03b1 t = 1 \u2212 0.95 t and t is the current epoch so that the baseline loss is mainly used in the earlier epochs while the 5 Recent intelligent conversational systems can support thousands of domains, many of which are with very specific/narrow capabilities." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-76", "text": "For \"a dragon sound\", DungeonSound and DragonFire are the correct domains, and predicting ZooKeeper is incorrect." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-77", "text": "6 Negative responses can be easily identified since the responses are generated from predefined templates." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-78", "text": "We use 2K template patterns to extract such responses." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-79", "text": "7 In our experiments, reducing confidences of negative ground-truths regardless of the confidence ranks shows worse performance." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-80", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-81", "text": "**BILSTM \u2026**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-82", "text": "Hypotheses pseudo labels and self-distillation are more contributing in the later epochs following [24] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-83", "text": "\u03b2 is a hyperparameter for utilizing negative ground-truths, which is set to 0.00025 showing the best dev set performance." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-84", "text": "Figure 2 shows the overall architecture of the hypothesis reranker that is similar to [5] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-85", "text": "First, we run intent classification and slot filling for the k most confident domains from the shortlister outputs to obtain additional information for those domains [1] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-86", "text": "8 Then, we compose k hypotheses, each of which is a vector consists of the shortlister confidence score, intent score, Viterbi score of slot-filling, domain vector, intent vector, and the summation of the slot vectors." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-87", "text": "On top of the k hypothesis vectors, a BiLSTM is utilized for representing contextualized hypotheses and a shared feed-forward neural network is used to obtain final confidence score for each hypothesis." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-88", "text": "We set k=3 in our experiments following [5] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-89", "text": "We leverage the given ground-truth and the derived pseudo labels from the shortlister at the epoch showing the best dev set performance as target labels for training the reranker." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-90", "text": "We use hinge loss with margin 0.4 as the loss function." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-91", "text": "One issue of the hypothesis reranking is that a training utterance cannot be used if no ground-truth exist in the top k predictions of the shortlister." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-92", "text": "This is problematic in the multilabel PU setting since correct domains can indeed exist in the top k list but unknown, which makes the training utterance less useful in the reranking." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-93", "text": "Our pseudo labeling method can address this issue." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-94", "text": "If correct pseudo labels are derived from the shortlister's top predictions for such utterances, we can use them properly in the reranker training, which was unavailable without them." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-95", "text": "This allows our approach make more improvement in hypothesis reranking than shortlisting." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-96", "text": "Table 2 ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-97", "text": "Examples of additional pseudo labels." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-99", "text": "**HYPOTHESIS RERANKING MODEL**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-100", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-101", "text": "**EXPERIMENTS**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-102", "text": "In this section, we show training and evaluation sets, and experiment results." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-103", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-104", "text": "**DATASETS**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-105", "text": "We utilize utterances with explicit invocation patterns from an intelligent conversational system for the model training similarly to [5] and [18] ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-106", "text": "For example, given \"ask {AmbientSounds} to {play thunderstorm sound}\", we extract \"play thunderstorm\" as the input utterance and Ambient Sounds as the ground-truth." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-107", "text": "One difference from the previous work is that we utilize utterances with positive system responses as the positive train set and the dev set, and use those with the negative responses as the negative train set as described in Section 3.2." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-108", "text": "We have extracted 3M positive train, 400K negative train, and 600K dev sets from 4M log data with 2,500 most frequent domains as the ground-truths." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-109", "text": "Pseudo labels are added to 53K out of 3M in the positive train set as described in Section 3.1." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-110", "text": "For the evaluation, we have extracted 10K random utterances from the user log data and independent annotators labeled the top three predictions of all the evaluated models for each utterance so that we can correctly compute nDCG at rank position 3." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-111", "text": "Table 1 shows the evaluation results of the shortlister and the hypothesis reranker with the proposed approaches." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-112", "text": "For the shortlisters, we show nDCG 3 scores, which are highly correlated with the F1 scores of the rerankers than other metrics since the second and third top shortlister predictions contribute the metric." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-113", "text": "We find that just using the pseudo labels as the additional targets degrades the performance (2) ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-114", "text": "However, when both the pseudo labels and the negative ground-truths are utilized, we observe significant improvements for both precision and recall (5) ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-115", "text": "In addition, recall is increased when self-distillation is used, which achieves the best F1 score (6) ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-116", "text": "Each of utilizing the negative feedback ((1) \u2192 (3) and (2) \u2192 (5)) and then additional pseudo labels ((3) \u2192 (5) and (4) \u2192 (6)) show statistically significant improvements with McNemar test for p=0.05 for the final reranker results." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-117", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-118", "text": "**EXPERIMENT RESULTS**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-119", "text": "Using self-distillation ((3) \u2192 (4) and (5) \u2192 (6)) shows increased F-1 score by increasing recall and decreasing precision, but the improvements are not significant." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-120", "text": "One issue is that pseudo labeling and self-distillation are contrary since the former encourages entropy minimization [26, 8] while the latter can increase entropy by soft targeting the non-target labels." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-121", "text": "More investigation of self-distillation along with the proposed pseudo labeling would be future work." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-122", "text": "Table 2 shows examples of derived pseudo labels from model (6) ." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-123", "text": "It demonstrates that the domains capable of processing the utterances can be derived, which helps more correct model training." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-124", "text": "----------------------------------" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-125", "text": "**CONCLUSION**" }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-126", "text": "We have proposed deriving pseudo labels along with leveraging utterances with negative system responses and selfdistillation to improve the performance of domain classification when multiple domains are ground-truths even if only one ground-truth is known in large-scale domain classification." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-127", "text": "Evaluating on the test utterances with multiple groundtruths from an intelligent conversational system, we have showed that the proposed approach significantly improves the performance of domain classification with hypothesis reranking." }, { "sent_id": "d987872352e4602fd48936cf2fdab8-C001-128", "text": "As future work, combining our approach with pure semisupervised learning, and the relation between pseudo labeling and distillation should be further studied." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d987872352e4602fd48936cf2fdab8-C001-10" ] ], "cite_sentences": [ "d987872352e4602fd48936cf2fdab8-C001-10" ] }, "@DIF@": { "gold_contexts": [ [ "d987872352e4602fd48936cf2fdab8-C001-26" ] ], "cite_sentences": [ "d987872352e4602fd48936cf2fdab8-C001-26" ] }, "@SIM@": { "gold_contexts": [ [ "d987872352e4602fd48936cf2fdab8-C001-29" ], [ "d987872352e4602fd48936cf2fdab8-C001-84" ], [ "d987872352e4602fd48936cf2fdab8-C001-88" ] ], "cite_sentences": [ "d987872352e4602fd48936cf2fdab8-C001-29", "d987872352e4602fd48936cf2fdab8-C001-84", "d987872352e4602fd48936cf2fdab8-C001-88" ] }, "@USE@": { "gold_contexts": [ [ "d987872352e4602fd48936cf2fdab8-C001-29" ], [ "d987872352e4602fd48936cf2fdab8-C001-105" ] ], "cite_sentences": [ "d987872352e4602fd48936cf2fdab8-C001-29", "d987872352e4602fd48936cf2fdab8-C001-105" ] } } }, "ABC_22dc2a38e29a1f5ac55c9ac220782b_25": { "x": [ { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-2", "text": "Despite their superior performance, deep learning models often lack interpretability." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-3", "text": "In this paper, we explore the modeling of insightful relations between words, in order to understand and enhance predictions." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-4", "text": "To this effect, we propose the Self-Attention Network (SANet), a flexible and interpretable architecture for text classification." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-5", "text": "Experiments indicate that gains obtained by self-attention is task-dependent." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-6", "text": "For instance, experiments on sentiment analysis tasks showed an improvement of around 2% when using selfattention compared to a baseline without attention, while topic classification showed no gain." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-7", "text": "Interpretability brought forward by our architecture highlighted the importance of neighboring word interactions to extract sentiment." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-8", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-10", "text": "Deep neural networks have achieved great successes on numerous tasks." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-11", "text": "However, they are often seen as black boxes, lacking interpretability." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-12", "text": "Research efforts in order to solve this issue have steadily increased (Simonyan et al., 2013; Zeiler and Fergus, 2014; Bach et al., 2015; Ribeiro et al., 2016; Fong and Vedaldi, 2017) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-13", "text": "In language modeling, interpretability often takes place via an attention mechanism in the neural network (Bahdanau et al., 2014; Xu et al., 2015; Sukhbaatar et al., 2015; Choi et al., 2017) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-14", "text": "In this context, attention essentially allows a network to identify which words in a sentence are more relevant." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-15", "text": "Beyond interpretability, this often results in improved decision making by the network." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-16", "text": "Recently, Vaswani et al. (2017) proposed the Transformer architecture for machine translation." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-17", "text": "It relies only on attention mechanisms, instead of making use of either recurrent or convolutional * Authors contributed equally to this work." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-18", "text": "neural networks." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-19", "text": "This architecture contains layers called self-attention (or intra-attention) which allow each word in the sequence to pay attention to other words in the sequence, independently of their positions." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-20", "text": "We modified this architecture, resulting in the following contributions:" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-21", "text": "\u2022 A novel architecture for text classification called Self-Attention Network (SANet) that models the interactions between all input word pairs." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-22", "text": "It is sequence length-agnostic, thanks to a global max pooling layer." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-23", "text": "\u2022 A study on the impact of this self-attention mechanism on large scale datasets." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-24", "text": "In particular, we empirically demonstrate the positive impact of self-attention in terms of performance and interpretability for sentiment analysis, compared to topic classification." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-25", "text": "In the study, we make use of two quantitative metrics (Gini coefficient and diagonality) that exhibit particular behaviors for attention mechanisms in sentiment analysis." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-26", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-28", "text": "The majority of text classification techniques either use convolutional or recurrent neural networks on the words or the characters of the sentence (Zhang et al., 2015 Yang et al., 2016; Conneau et al., 2017; Zhang, 2016, 2017; Howard and Ruder, 2018) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-29", "text": "One notable exception is the fastText architecture (Joulin et al., 2016) which essentially employs a bag-of-words approach with word embeddings of the sentence." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-30", "text": "Attention mechanisms are a way to add interpretability in neural networks." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-31", "text": "They were introduced by Bahdanau et al. (2014) , where they achieved state-of-the-art in machine translation." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-32", "text": "Since then, attention mechanisms have been used in other language modeling tasks such as image captioning (Xu et al., 2015) , question answer-ing (Sukhbaatar et al., 2015; Choi et al., 2017) , and text classification (Yang et al., 2016) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-33", "text": "The concept of self-attention (Cheng et al., 2016; Parikh et al., 2016) , central to our proposed approach, has shown great promises in natural language processing; It produced state-of-the-art results for machine translation (Vaswani et al., 2017) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-34", "text": "In text classification, the focus on interpretability has thus far been limited." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-35", "text": "Lee et al. (2018) used a convolutional neural network (CNN) with Class Activation Mapping (CAM) (Oquab et al., 2015) to do sentiment analysis." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-36", "text": "CAM basically uses the weights of the classification layer to derive a heatmap on the input." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-37", "text": "Wang et al. (2018) used a densely connected CNN (Huang et al., 2017) to apply attention to n-grams." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-38", "text": "However, their approach limits the range and acuteness of the interactions between the words in the text. and Yang et al. (2016) both combined an attention mechanism with a recurrent neural network." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-39", "text": "The main difference with our work is, while being interpretable, these approaches do not perform true word-on-word attention across a whole sequence such as our self-attention layer." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-40", "text": "3 SANet: Self-Attention Network Inspired by the Transformer architecture (Vaswani et al., 2017) which performed machine translation without recurrent or convolutional layers, we propose the Self-Attention Network (SANet) architecture targeting instead text classification." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-41", "text": "One key difference between our approach and Vaswani et al. (2017) 's is that we only perform input-input attention with self-attention, as we do not have sequences as output but a text classification." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-42", "text": "Moreover, we employ global max pooling at the top, which enables our architecture to process input sequences of arbitrary length." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-43", "text": ". . . ; x T n ] be the concatenation of a sequence of n vectors giving a matrix X \u2208 R n\u00d7d such that x i \u2208 R d ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-44", "text": "Vaswani et al. (2017) defined attention as a function with as input a triplet containing queries Q, keys K with associated values V ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-45", "text": "In the case of self-attention, Q, K and V are linear projections of X. Thus, we define the dot-product (Vaswani et al., 2017) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-46", "text": "The self-attention block is repeated N times." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-47", "text": "self-attention mechanism as follows." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-48", "text": "Hence, W QK and W V are learned parameters." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-49", "text": "Our network (depicted in Figure 1 ) first encodes each word to its embedding." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-50", "text": "Pre-trained embeddings, like GloVe (Pennington et al., 2014) , may be used and fine-tuned during the learning process." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-51", "text": "Next, to inject information about the order of the words, the positional encoding layer adds location information to each word." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-52", "text": "We use the positional encoding vectors that were defined by Vaswani et al. (2017) as follows." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-78", "text": "The words embeddings were initialized using pre-trained word vectors from GloVe (Pennington et al., 2014) when available, or randomly initialized otherwise." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-53", "text": "A linear layer then performs dimensionality reduction/augmentation of the embedding space to a vector space of dimension d, which is kept constant throughout the network." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-54", "text": "It is followed by one or several \"self-attention blocks\" stacked one onto another." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-55", "text": "These blocks are comprised of a selfattention layer followed by a feed-forward network, both with residual connections." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-56", "text": "Contrary to Vaswani et al. (2017) , we only use a single attention head, with attention performed on the complete sequence with constant d-dimensional inputs." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-57", "text": "The feed-forward network consists of a single hidden layer with a ReLU." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-58", "text": "Where W 1 , W 2 \u2208 R d\u00d7d are learned parameters." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-59", "text": "The \"Add & Norm\" layer is a residual connection defined by LayerNorm(x + SubLayer(x)), where SubLayer(x) is the output of the previous layer and LayerNorm is a layer normalization method introduced by Ba et al. (2016) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-60", "text": "Let x i be the vector representation of an element in the input sequence." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-61", "text": "The normalization layer simply normalizes x i by the mean and the variance of its elements." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-62", "text": "Throughout this paper, dropout of 0.1 is applied to the output of SubLayer (x) Finally, since we restrict ourselves to classification, we need a fixed-size representation of the sequence before the classification layer." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-63", "text": "To achieve this, we apply a global max pooling operation for each dimension across all the n words of the sequence." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-64", "text": "That is, if X \u2208 R n\u00d7d , then the max pooling on X outputs a vector in R d ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-65", "text": "This technique was inspired by global average pooling introduced by Lin et al. (2013) for image classification in CNNs." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-66", "text": "Global max pooling allows us to handle sequences of any length (up to memory limitations)." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-67", "text": "Thus, our approach is length-agnostic contrary to some approaches based on CNN, where sequences are truncated or padded to obtain a fixed-length representation." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-68", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-69", "text": "**EXPERIMENTS**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-70", "text": "We evaluated our model on seven large scale text classification datasets introduced by Zhang et al. (2015) , grouped into two kinds of tasks." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-71", "text": "The first one is topic classification: AG's News with 4 classes of news articles, DBPedia with 14 classes of the Wikipedia ontology and Yahoo! Answers containing 10 categories of questions/answers." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-72", "text": "Yelp and Amazon reviews involve sentiment analysis with ratings from 1 to 5 stars." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-73", "text": "Two versions are derived from those datasets: one for predicting the number of stars, and the other involving the polarity of the reviews (negative for 1-2 stars, positive for 4-5 stars)." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-74", "text": "Each text entry was split into sentences and tokenized using NLTK (Bird et al., 2009) ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-75", "text": "Sequences longer than 1000 tokens were truncated to accommodate GPU memory limitations, only affecting a negligible portion of the texts." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-76", "text": "See Figure 2 for We used 20% of the training texts for validation." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-77", "text": "The vocabulary was built using every word appearing in the training and validation sets." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-79", "text": "We experimented with two configurations for our proposed SANet." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-80", "text": "The base model used N = 1 self-attention blocks, an embedding size of 100 and a hidden size of d = 128." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-81", "text": "The big model doubled these numbers, i.e. N = 2 self-attention blocks, embedding size of 200 and hidden size d = 256." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-82", "text": "For each configuration, we also trained a baseline network without any attention mechanisms, replacing each self-attention layer with a feed forward layer." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-83", "text": "Training was performed using SGD with a momentum of 0.9, a learning rate of 0.01 and minibatches of size 128." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-84", "text": "For the embeddings, a learning rate of 0.001 was applied without momentum." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-85", "text": "All learning rates were halved for the big model." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-86", "text": "We trained for 40 epochs and selected the best epoch, based on validation accuracy." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-87", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-88", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-89", "text": "From a performance perspective, as shown in Table 1, our model based entirely on attention is competitive while offering high level interpretability." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-90", "text": "There is a notable exception with Yelp Review Polarity that will be discussed." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-91", "text": "Our results also indicate that the increase in depth and representation size in the big model is beneficial, compared to the simpler base model." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-92", "text": "Most noteworthy, we noticed considerably different behaviors of the attention mechanism depending on the type of task." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-93", "text": "We offer an analysis below." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-94", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-95", "text": "**TOPIC CLASSIFICATION TASKS**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-96", "text": "On the topic classification task, the self-attention behavior can be described as looking for interactions between important concepts, without considering relative distance." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-97", "text": "As such, it acts similarly to a bag-of-word approach, while highlighting key elements and their associations." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-98", "text": "Thus, the attention matrix takes shape of active columns, one per concept." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-99", "text": "One such matrix is depicted in Figure 3a , where the attention is focused on distanced pairs such as (microsoft, class-action) or (settlement, billions) to help SANet predict the Business category, while the baseline wrongfully predicts Sci/Tech." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-100", "text": "We observed this column-based structure for attention matrix for every topic classification dataset, see Figure 4 GT: 4, SANet: 4, Baseline: 4 multiple examples." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-101", "text": "Although it adds interpretability to the model, our results seem to indicate that self-attention does not improve performances for topic classification, compared to the baseline." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-102", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-103", "text": "**SENTIMENT ANALYSIS TASKS**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-104", "text": "For sentiment analysis tasks, self-attention improves accuracy for every dataset and model configurations that we tested." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-105", "text": "For Yelp Review Polarity, although attention helps, the overall performances remain subpar." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-106", "text": "Noticeably for the other datasets, SANet is able to extract subtle interactions between words, with a strong focus on neighboring relation." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-107", "text": "Hence, the attention matrices are close to being band matrices, with interest concentrated on very small regions near the diagonal." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-108", "text": "This is observable in Figure 5 where multiple examples from all sentiment analysis datasets are presented." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-109", "text": "Concentration of the attention around the diagonal indicates that the useful features learned by the attention mechanism consist essentially of skip-bigrams with relatively small gaps." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-110", "text": "Of note, Wang and Manning (2012) previously observed consistent gains when including word bigram features to extract sentiment." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-111", "text": "Thus, our model corroborates this intuition about sentiment analysis while yielding interpretable insights on relevant word pairs across all possible skip-bigrams." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-112", "text": "Figure 3b is a typical example of such matrix with a band diagonal structure, for a 5-star Yelp review." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-113", "text": "A number of positive elements are highlighted by the self-attention mechanism such as i) the initial strong sentiment with the interaction between this with love and ! ii) the favorable comparison with even and better iii) the enticing openness to experiences with try and something and iv) the positive combination of two negative words with never and disappointed." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-114", "text": "Positional encoding helps the self-attention mechanism when interpreting words repetitions, in order to extract sentiment gradation." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-115", "text": "When repeating three times an adjective before the modified noun, attention on the adjective increases with their proximity to the noun: horrible horrible horrible service." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-116", "text": "Punctuation repetitions exhibit a similar behavior, as in the sentence \"love this place!!!\", where the words love and all three exclamation points apply attention to this with varying intensities: love this place ! ! ! ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-117", "text": "This particular behavior of the model reinforces our belief that it learns intricate knowledge for the task of sentiment analysis." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-118", "text": "Entire attention heatmaps for complete sequences can be found in Figure 6 ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-119", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-120", "text": "**QUANTITATIVE ANALYSIS**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-121", "text": "We now present a quantitative analysis of the attention matrices to support the qualitative intuition stated previously." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-122", "text": "Two metrics are used in order to assess the properties of the matrices; the first one (Gini coefficient) quantifies the sparsity of the attention, whereas the second one (diagonality) focuses on the diagonal concentration." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-123", "text": "These two properties are relevant for interpretability issues." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-124", "text": "The results are presented in Table 2 ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-125", "text": "The Gini coefficient which measures the inequality in the attention weights distribution is first computed." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-126", "text": "For topic classification datasets, the mean of the Gini coefficient is 63.57%, whereas, for sentiment analysis datasets, it raises at 87.15% without considering Yelp Review Polarity." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-127", "text": "Thus, for topic classification it reveals that every word interacts with multiple other words in the sequence." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-128", "text": "On the other hand, for sentiment analysis, the attention is focused on a fewer number of word pairs." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-129", "text": "The second metric will also point out that the sentiment analysis attention is sparse and specifically based on pair of words that are close in the sentence." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-130", "text": "This structurally corresponds to an attention matrix concentrated near the diagonal and justifies the introduction of the following metric." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-131", "text": "This new metric evaluates the resemblance with a band matrix by computing the proportion of attention weights which occur inside the band diagonal of a given bandwidth b, thus the band diagonality or diagonality for short." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-132", "text": "It expresses the interactions of every element with itself, and the b elements before and after in the sequence." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-133", "text": "This metric of diagonality was computed for a bandwidth of b = 1, 2, . . . , 5 as presented in Table 2 ." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-134", "text": "Results clearly reveal that sentiment analysis attention matrices are structurally close to being band matrices." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-135", "text": "Notably, with a bandwidth of b = 3 for topic classification, 16.12% of the weights occur inside the band diagonal, as for sentiment analysis without considering Yelp Review Polarity, 63.48% is located inside the band diagonal." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-136", "text": "In our opinion, the combination of these two metrics supports our qualitative observations of the attention matrices." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-137", "text": "It strengthens the difference in attention behavior between the topic classification and sentiment analysis task." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-138", "text": "Moreover, this quantitative analysis clearly exposes SANet inability to learn the appropriate attention behavior for sentiment analysis with Yelp Review Polarity." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-139", "text": "Its failure to adequately exploit the self-attention mechanism coincide with its poor performance to extract sentiment." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-140", "text": "Interestingly, Yelp Review Polarity examples are a subset of Yelp Review Full with merged classes, for which SANet performs well with the expected attention behavior." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-141", "text": "The cause of this discrepancy with the Yelp datasets is unknown and left for future work as is some linguistic investigation of the impact of close interacting words in sentiment analysis." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-142", "text": "----------------------------------" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-143", "text": "**CONCLUSION**" }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-144", "text": "In this paper, we introduced the Self-Attention Network (SANet), an attention-based lengthagnostic model architecture for text classification." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-145", "text": "Our experiments showed that self-attention is important for sentiment analysis." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-146", "text": "Moreover, the improved interpretability of the model through attention visualization enabled us to discover considerably different behaviors of our attention mechanism between the topic classification and sentiment analysis tasks." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-147", "text": "The interpretable perspective of this work gives insights on the importance of modeling interaction between neighboring words in order to accurately extract sentiment, as noted by (Wang and Manning, 2012) for bigrams." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-148", "text": "It highlights how interpretability can help us understand models behavior to guide future research." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-149", "text": "In the future, we hope to apply our Self-Attention Network to other datasets such as bullying detection on social network data and tasks from various fields, such as genomic data in bioinformatics." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-150", "text": "Finally, we wish to study the properties of the introduced global max pooling layer as a complementary tool for interpretability in a similar way that was done with CAM (Oquab et al., 2015) for global average pooling." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-151", "text": "The outcome will be some attention on individual words that can take into account the context given by the self-attention mechanism." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-152", "text": "This contrast with the approach of this paper which focuses on interaction between elements as pairs." }, { "sent_id": "22dc2a38e29a1f5ac55c9ac220782b-C001-153", "text": "Thus we are allowed to expect that these two mechanisms will act in a complementary way to enrich interpretability." } ], "y": { "@BACK@": { "gold_contexts": [ [ "22dc2a38e29a1f5ac55c9ac220782b-C001-16" ], [ "22dc2a38e29a1f5ac55c9ac220782b-C001-44" ], [ "22dc2a38e29a1f5ac55c9ac220782b-C001-45" ] ], "cite_sentences": [ "22dc2a38e29a1f5ac55c9ac220782b-C001-16", "22dc2a38e29a1f5ac55c9ac220782b-C001-44", "22dc2a38e29a1f5ac55c9ac220782b-C001-45" ] }, "@MOT@": { "gold_contexts": [ [ "22dc2a38e29a1f5ac55c9ac220782b-C001-16" ] ], "cite_sentences": [ "22dc2a38e29a1f5ac55c9ac220782b-C001-16" ] }, "@SIM@": { "gold_contexts": [ [ "22dc2a38e29a1f5ac55c9ac220782b-C001-33" ], [ "22dc2a38e29a1f5ac55c9ac220782b-C001-52" ] ], "cite_sentences": [ "22dc2a38e29a1f5ac55c9ac220782b-C001-33", "22dc2a38e29a1f5ac55c9ac220782b-C001-52" ] }, "@DIF@": { "gold_contexts": [ [ "22dc2a38e29a1f5ac55c9ac220782b-C001-40" ], [ "22dc2a38e29a1f5ac55c9ac220782b-C001-41" ], [ "22dc2a38e29a1f5ac55c9ac220782b-C001-56" ] ], "cite_sentences": [ "22dc2a38e29a1f5ac55c9ac220782b-C001-40", "22dc2a38e29a1f5ac55c9ac220782b-C001-41", "22dc2a38e29a1f5ac55c9ac220782b-C001-56" ] } } }, "ABC_1527ce2786adfe0decf8c926a3d846_25": { "x": [ { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-38", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-39", "text": "**GENERATING SIMPLIFICATIONS**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-2", "text": "Lexical simplification of scientific terms represents a unique challenge due to the lack of a standard parallel corpora and fast rate at which vocabulary shift along with research." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-3", "text": "We introduce SimpleScience, a lexical simplification approach for scientific terminology." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-4", "text": "We use word embeddings to extract simplification rules from a parallel corpora containing scientific publications and Wikipedia." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-5", "text": "To evaluate our system we construct SimpleSciGold, a novel gold standard set for science-related simplifications." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-6", "text": "We find that our approach outperforms prior context-aware approaches at generating simplifications for scientific terms." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-7", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-9", "text": "Lexical simplification, the process of reducing the complexity of words by replacing them with simpler substitutes (e.g., sodium in place of Na; insects in place of lepidopterans) can make scientific texts more accessible to general audiences." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-10", "text": "Human-inthe-loop interfaces present multiple possible simplifications to a reader (on demand) in place of jargon and give the reader familiar access points to understanding jargon (Kim et al., 2015) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-11", "text": "Unfortunately, simplification techniques are not yet of high enough quality for fully automated scenarios." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-12", "text": "Currently lexical simplification pipelines for scientific texts are rare." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-13", "text": "The vast majority of prior methods assume a domain independent context, and rely on Wikipedia and Simple English Wikipedia, a subset of Wikipedia using simplified grammar and terminology, to learn simplifications (Biran et al., 2011; Paetzold and Specia, 2015) , with translationbased approaches using an aligned version (Coster and Kauchak, 2011; Horn et al., 2014; Yatskar et al., 2010) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-14", "text": "However, learning simplifications from Wikipedia is not well suited to lexical simplification of scientific terms." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-15", "text": "Though generic or established terms may appear in Wikipedia, novel terms associated with new advances may not be reflected." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-37", "text": "We obtained all datasets from DeepDive (R\u00e9 and Zhang, 2015" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-16", "text": "Wikipedia's editing rules also favor generality over specificity and eliminate redundancy, both of which are problematic in providing a rich training set that matches simple and complex terms." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-17", "text": "Further, some approaches work by detecting all pairs of words in a corpus and filtering to isolate synonym or hypernym-relationship pairs using WordNet (Biran et al., 2011) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-18", "text": "Like Wikipedia, WordNet is a general purpose semantic database (Miller, 1995) , and does not cover all branches of science nor integrate new terminology quickly." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-19", "text": "Word embeddings do not require the use of prebuilt ontologies to identify associated terms like simplifications." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-20", "text": "Recent work indicates that they may improve results for simplification selection: determining which simplifications for a given complex word can be used without altering the meaning of the text (Paetzold and Specia, 2015) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-21", "text": "Embeddings have also been explored to extract hypernym relations from general corpora (Rei and Briscoe, 2014) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-22", "text": "However, word embeddings have not been used for generating lexical simplifications." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-23", "text": "We provide a novel demonstration of how using embeddings on a scientific corpus is better suited to learning scientific term simplifications than prior approaches that use WordNet as a filter and Wikipedia as a corpus." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-24", "text": "INPUT: Finally we show that the transient immune activation that renders mosquitoes resistant to the human malaria parasite has little to no effect on mosquito fitness as a measure of survival or fecundity under laboratory conditions." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-25", "text": "CANDIDATE RULES: {fecundity\u2192fertility} {fecundity\u2192productivity} OUTPUT: Finally we show that the transient immune activation that renders mosquitoes resistant to the human malaria parasite has little to no effect on mosquito fitness as a measure of survival or (fertility; productivity) under laboratory conditions." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-26", "text": "We introduce SimpleScience, a novel lexical simplification pipeline for scientific terms, which we apply to a scientific corpus of nearly 500k publications in Public Library of Science (PLOS) and PubMed Central (PMC) paired with a general corpus from Wikipedia." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-27", "text": "We validate our approach using SimpleSciGold, a gold standard set that we create using crowdsourcing that contains 293 sentences containing scientific terms with an average of 21 simplifications per term." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-28", "text": "We show how the SimpleScience pipeline achieves better performance (Fmeasure: 0.285) than the prior approach to simplification generation applied to our corpus (F-measure: 0.136)." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-29", "text": "We further find that the simplification selection techniques used in prior work to determine which simplifications are a good fit for a sentence do not improve performance when our generation pipeline is used." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-30", "text": "1" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-31", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-32", "text": "**PARALLEL CORPORA: SCIENTIFIC AND GENERAL**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-33", "text": "We assembled a scientific corpus of papers from the entire catalog of PLOS articles and the National Library of Medicine's Pubmed Central (PMC) archive (359,324 fulltext articles)." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-34", "text": "The PLOS corpus of 125,378 articles includes articles from PLOS One and each speciality PLOS journal (e.g., Pathogens, Computational Biology)." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-35", "text": "Our general corpus includes all 4,776,093 articles from the Feb. 2015 English Wikipedia snapshot." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-36", "text": "We chose Wikipedia as it covers many scientific concepts and usually contains descriptions of those concepts using simpler language than the research literature." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-40", "text": "Our goal is to learn simplification rules in the form complex word\u2192simple word." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-41", "text": "One approach identifies all pairwise permutations of 'content' terms and then applies semantic (i.e., WordNet) and simplicity filters to eliminate pairs that are not simplifications (Biran et al., 2011) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-42", "text": "We adopt a similar pipeline but leverage distance metrics on word embeddings and a simpler frequency filter in place of WordNet." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-43", "text": "Embeddings identify words that share context in an unsupervised, scalable way and are more efficient than constructing co-occurrence matrices (Biran et al., 2011) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-44", "text": "As our experiments demonstrate, our approach improves performance on a scientific test set over prior work." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-45", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-46", "text": "**STEP 1: GENERATING WORD EMBEDDINGS**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-47", "text": "We used the Word2Vec system (Mikolov et al., 2013) to learn word vectors for each content word in the union of vocabulary of the scientific and general corpus." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-48", "text": "While other approaches exist (Pennington et al., 2014; Levy and Goldberg, 2014) , Word2Vec has been shown to produce both fast and accurate results (Mikolov et al., 2013) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-49", "text": "We set the embedding dimension to 300, the context-window to 10, and use the skip-gram architecture with negativesampling,which is known to produce quality results for rare entities (Mikolov et al., 2013) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-50", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-51", "text": "**STEP 2: FILTERING PAIRS**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-52", "text": "Given the set of all pairwise permutations of words, we retain a simplification rule of two words w 1 , w 2 if the cosine similarity cos(w 1 , w 2 ) between the word vectors is greater than a threshold a. We use grid search, described below, to parameterize a." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-53", "text": "We then apply additional filtering rules." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-54", "text": "To avoid rules comprised of words with the same stem (e.g., permutable, permutation) we stem all words (using the Porter stemmer in the Python NLTK library (Bird et al., 2009) )." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-55", "text": "The POS of each word is determined by Morphadorner (Burns, 2013) and pairs that differ in POS are omitted (e.g., permutation (noun), change(d) (verb)); Finally, we omit rules where one word is a prefix of the other and the suffix is one of s, es, ed, ly, er, or ing." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-56", "text": "To retain only rules of the form complex word \u2192 simple word we calculate the corpus complexity, C (Biran et al., 2011) of each word w as the ratio between the frequency (f ) in the scientific versus general corpus: C w = f w,scientif ic /f w,general ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-57", "text": "The lexical complexity, L, of a word is calculated as the word's character length, and the final complexity of the word as C w \u00d7 L w ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-58", "text": "We require that the final complexity score of the first word in the rule be greater than the second." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-59", "text": "While this simplicity filter has been shown to work well in general corpora (Biran et al., 2011) , it is sensitive to very small differences in the frequencies with which both words appear in the corpora." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-60", "text": "This is problematic given the distribution of terms in our corpora, where many rarer scientific terms may appear in small numbers in both corpora." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-61", "text": "We introduce an additional constraint that requires that the second (simple) word in the rule occur in the general corpus at least k times." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-62", "text": "This helps ensure that we do not label words that are at a similar complexity level as simplifications." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-63", "text": "We note that this filter aligns with prior work that suggests that features of the hypernym in hypernym-hyponym relations influence performance more than features of the hyponym (Rei and Briscoe, 2014) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-64", "text": "Parameterization: We use a grid search analysis to identify which measures of the set including cos(w 1 , w 2 ), f w1,scientif ic , f w2,scientif ic , f w1,general , and f w2,general most impact the Fmeasure when we evaluate our generation approach against our scientific gold standard set (Sec. 4), and to set the specific parameter values." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-65", "text": "Using this method we identify a=0.4 for cosine similarity and k=3,000 for the frequency of the simple term in the general corpus." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-66", "text": "Full results are available in supplementary material." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-67", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-68", "text": "**APPLYING SIMPLIFICATIONS**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-69", "text": "In prior context-aware simplification systems, the decision of whether to apply a simplification rule in an input sentence is complex, involving several similarity operations on word co-occurrence matrices (Biran et al., 2011) or using embeddings to incorporate co-occurrence context for pairs generated using other means (Paetzold and Specia, 2015) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-70", "text": "However, the SimpleScience pipline already considers the context of appearance for each word in deriving simplifications via word embeddings learned from a large corpus." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-71", "text": "We see no additional improvements in F-measure when we apply two variants of context similarity thresholds to decide whether to apply a rule to an input sentence." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-72", "text": "The first is the cosine similarity between the distributed representation of the simple word and the sum of the distributed representations of all words within a window l surrounding the complex word in the input sentence (Paetzold and Specia, 2015) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-73", "text": "The second is the cosine similarity of a minimum shared frequency co-occurrence matrix for the words in the pair and the co-occurrence matrix for the input sentence (Biran et al., 2011) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-74", "text": "In fully automated applications, the top rule from the ranked candidate rules is used." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-75", "text": "We find that ranking by the cosine similarity between the word embeddings for the complex and simple word in the rule leads to the best performance at the top slot (full results in supplementary material)." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-76", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-77", "text": "**EVALUATION**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-78", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-79", "text": "**SIMPLESCIGOLD TEST SET**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-80", "text": "To evaluate our pipeline, we develop SimpleSciGold, a scientific gold standard set of sentences containing complex scientific terms which is modeled after the general purpose gold standard set created by Horn et al. (2014) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-81", "text": "To create SimpleSciGold, we start with scientific terms from two sources: we utilized all 304 complex terms from unigram rules by (Vydiswaran et al., 2014) , and added another 34,987 child terms from rules found by mining direct parent-child relations for unigrams in the Medical Subject Headings (MeSH) ontology (United States National Library of Medicine, 2015) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-82", "text": "We chose complex terms with preexisting simplifications as it provided a means by which we could informally check the crowd generated simplifications for consistency." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-83", "text": "To obtain words in context, we extracted 293 sentences containing unique words in this set from PLOS abstracts from PLOS Biology, Pathology, Genetics, and several other journals." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-84", "text": "We present 10 MTurk crowdworkers with a task (\"HIT\") showing one of these sentences with the complex word bolded." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-85", "text": "Workers are told to read the sentence, consult online materials (we provide direct links to a Wikipedia search, a standard Google search, and Table 2 : Simplification Generation Results." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-86", "text": "SimpleScience achieves the highest F-measure with a cosine threshold of 0.4 and a frequency of the simple word in the general corpus of 3000." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-87", "text": "a Google \"define\" search on the target term), and add their simplification suggestions." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-88", "text": "Crowdworkers first passed a multiple choice practice qualification in which they were presented with sentences containing three complex words in need of simplification along with screenshots of Wikipedia and dictionary pages for the terms." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-89", "text": "The workers were asked to identify which of 5 possible simplifications listed for each complex word would preserve the meaning while simplifying." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-90", "text": "108 workers took part in the gold standard set creation task, completing an average of 27 HITs each." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-91", "text": "The resulting SimpleSciGold standard set consists of an average of 20.7 simplifications for each of the 293 complex words in corresponding sentences." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-92", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-93", "text": "**SIMPLIFICATION GENERATION**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-94", "text": "We compare our word embedding generation process (applied to our corpora) to Biran et al.'s (2011) approach (applied to the Wikipedia and Simple English Wikipedia corpus as well as our scientific corpora)." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-95", "text": "Following the evaluation method used in Paetzold and Specia (2015), we calculate potential as the proportion of instances for which at least one of the substitutions generated is present in the gold standard set, precision as the proportion of generated instances which are present in the gold standard set, and F-measure as their harmonic mean." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-96", "text": "Our SimpleScience approach outperforms the original approach by Biran et al. (2011) applied to the Wikipedia and SEW corpus as well as to the scientific corpus (Table 1) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-97", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-98", "text": "**APPLYING SIMPLIFICATIONS**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-99", "text": "We find that neither prior selection approaches yield performance improvements over our generation process." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-100", "text": "We evaluate the performance of ranking candidate rules by cosine similarity (to find the top rule for a fully automated application), and achieve precision of 0.389 at the top slot." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-101", "text": "In our supplementary materials, we provide additional results for potential, precision and F-measure at varying numbers of slots (up to 5), where we test ranking by cosine similarity of embeddings as well as by the second filter used in our pair generation step: the frequency of the simple word in the simple corpus." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-102", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-103", "text": "**ANTONYM PREVALENCE ANALYSIS**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-104", "text": "A risk of using Word2Vec in place of WordNet is that the simpler terms generated by our approach may represent terms with opposite meanings (antonyms)." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-105", "text": "While a detailed analysis is beyond the scope of this paper, we compared the likelihood of seeing antonyms in our results using a gold standard set of antonyms for biology, chemistry, and physics terms from WordNik (Wordnik, 2009 )." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-106", "text": "Specifically, we created an antonym set consisting of the 304 terms from the biology, chemistry, and physics categories in Wictionary for which at least one antonym is listed in WordNik." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-107", "text": "We compared antonym pairs with rules that produced by the SimpleScience pipeline (Fig. 1) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-108", "text": "We observed that 14.5% of the time (44 out of 304 instances), an antonym appeared at the top slot among results." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-109", "text": "51.3% of the time (156 out of 304 instances), no antonyms in the list appeared within the top 100 ranked results." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-110", "text": "These results suggest that further filters are necessary to ensure high enough quality results for fully automated applications of scientific term simplification." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-111", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-112", "text": "**LIMITATIONS AND FUTURE WORK**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-113", "text": "A risk of using Word2Vec to find related terms, rather than querying a lexical database like WordNet, is that generated rules may include antonyms." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-114", "text": "Adding techniques to filter antonym rules, such as using co-reference chains (Adel and Sch\u00fctze, 2014) , is important in future work." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-115", "text": "We achieve a precision of 0.389 at the top slot on our SimpleSciGold standard set when we apply our generation method and rank candidates by cosine similarity." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-116", "text": "This level of precision is higher than that achieved by various prior ranking methods used in Lexenstein (Paetzold and Specia, 2015) , with the exception of using machine learning techniques like SVM (Paetzold and Specia, 2015) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-117", "text": "Future work should explore how much the precision of our SimpleScience pipeline can be improved by adopting more sophisticated ranking methods." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-118", "text": "However, we suspect that even the highest precision obtained on general corpora and gold standard sets in prior work is not sufficient for fully automated simplification." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-119", "text": "An exciting area for future work is in applying the SimpleScience pipeline in interactive simplification suggestion interfaces for those reading or writing about science (Kim et al., 2015) ." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-120", "text": "----------------------------------" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-121", "text": "**CONCLUSION**" }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-122", "text": "In this work, we introduce SimpleScience, a lexical simplification approach to address the unique challenges of simplifying scientific terminology, including a lack of parallel corpora, shifting vocabulary, and mismatch with using general purpose databases for filtering." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-123", "text": "We use word embeddings to extract simplification rules from a novel parallel corpora that contains scientific publications and Wikipedia." }, { "sent_id": "1527ce2786adfe0decf8c926a3d846-C001-124", "text": "Using SimpleSciGold, a gold standard set that we created using crowdsourcing, we show that using embeddings and simple frequency filters on a scientific corpus outperforms prior approaches to simplification generation, and renders the best prior approach to simplification selection unnecessary." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1527ce2786adfe0decf8c926a3d846-C001-13" ], [ "1527ce2786adfe0decf8c926a3d846-C001-17" ], [ "1527ce2786adfe0decf8c926a3d846-C001-41" ], [ "1527ce2786adfe0decf8c926a3d846-C001-43" ], [ "1527ce2786adfe0decf8c926a3d846-C001-69" ], [ "1527ce2786adfe0decf8c926a3d846-C001-73" ] ], "cite_sentences": [ "1527ce2786adfe0decf8c926a3d846-C001-13", "1527ce2786adfe0decf8c926a3d846-C001-17", "1527ce2786adfe0decf8c926a3d846-C001-41", "1527ce2786adfe0decf8c926a3d846-C001-43", "1527ce2786adfe0decf8c926a3d846-C001-69", "1527ce2786adfe0decf8c926a3d846-C001-73" ] }, "@MOT@": { "gold_contexts": [ [ "1527ce2786adfe0decf8c926a3d846-C001-43" ] ], "cite_sentences": [ "1527ce2786adfe0decf8c926a3d846-C001-43" ] }, "@SIM@": { "gold_contexts": [ [ "1527ce2786adfe0decf8c926a3d846-C001-56" ], [ "1527ce2786adfe0decf8c926a3d846-C001-96" ] ], "cite_sentences": [ "1527ce2786adfe0decf8c926a3d846-C001-56", "1527ce2786adfe0decf8c926a3d846-C001-96" ] }, "@DIF@": { "gold_contexts": [ [ "1527ce2786adfe0decf8c926a3d846-C001-58", "1527ce2786adfe0decf8c926a3d846-C001-59" ] ], "cite_sentences": [ "1527ce2786adfe0decf8c926a3d846-C001-59" ] }, "@USE@": { "gold_contexts": [ [ "1527ce2786adfe0decf8c926a3d846-C001-96" ] ], "cite_sentences": [ "1527ce2786adfe0decf8c926a3d846-C001-96" ] }, "@FUT@": { "gold_contexts": [ [ "1527ce2786adfe0decf8c926a3d846-C001-114" ] ], "cite_sentences": [ "1527ce2786adfe0decf8c926a3d846-C001-114" ] } } }, "ABC_05d1ecc230c7907d9a14d3351070c3_25": { "x": [ { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-17", "text": "The second method is to stand on the shoulder of the pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-2", "text": "The introduction of pre-trained language models has revolutionized natural language research communities." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-3", "text": "However, researchers still know relatively little regarding their theoretical and empirical properties." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-4", "text": "In this regard, Peters et al. [1] performs several experiments which demonstrate that it is better to adapt BERT with a light-weight task-specific head, rather than building a complex one on top of the pre-trained language model, and freeze the parameters in the said language model." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-5", "text": "However, there is another option to adopt." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-6", "text": "In this paper, we propose a new adaptation method which we first train the task model with the BERT parameters frozen and then fine-tune the entire model together." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-7", "text": "Our experimental results show that our model adaptation method can achieve 4.7% accuracy improvement in semantic similarity task, 0.99% accuracy improvement in sequence labeling task and 0.72% accuracy improvement in the text classification task." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-8", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-10", "text": "The introduction of pre-trained language models, such as BERT [2] and Open-GPT [3] , among many others, has brought tremendous progress to the NLP research and industrial communities." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-11", "text": "The contribution of these models can be categorized into two aspects." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-12", "text": "First, pre-trained language models allow modelers to achieve reasonable accuracy without the need an excessive amount of manually labeled data." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-13", "text": "This strategy is in contrast with the classical deep learning methods, which requires a multitude more data to reach comparable results." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-14", "text": "Second, for many NLP tasks, including but not limited to, SQuAD [4] , CoQA [5] , named entity recognition [6] , Glue [7] , machine translation [8] , pre-trained model allows the creation of new state-of-art, given a reasonable amount of labelled data." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-15", "text": "In the post pre-trained language model era, to pursue new state-of-art, two directions can be followed." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-16", "text": "The first method, is to improve the pre-training process, such as in the work of ERNIE [9] , GPT2.0 [3] and MT-DNN [10] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-18", "text": "Among the many possibilities, one of them is to build new neural network structures on top of pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-19", "text": "In principles, there are three ways to train the networks with stacked neural networks on top of pre-trained language models, as shown in Table 1 ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-20", "text": "In Peters et al . [1] , the authors compare the possibility of option stack-only and finetune-only, and conclude that option finetune-only is better than option stack-only." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-21", "text": "More specifically, Peter et al. [1] arXiv:1907.05338v1 [ argue that it is better to add a task-specific head on top of BERT than to freeze the weights of BERT and add more complex network structures." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-22", "text": "However, Peters et al. [1] did not compare option stack-and-finetune and finetune-only." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-23", "text": "On the other hand, before pre-trained deep language models became popular, researchers often use a strategy analog to option stack-and-finetune." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-24", "text": "That is, modelers first train the model until convergence, and then fine-tune the word embeddings with a few epochs." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-25", "text": "If pre-trained language models can be understood as at least partially resemblance of word embeddings, then it will be imprudent not to consider the possibility of option stack-and-finetune." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-26", "text": "In this study, we aim to compare the strategy stack-and-finetune and strategy finetune-only." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-27", "text": "More specifically, we perform three NLP tasks, sequence labeling, text classification, and question similarity." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-28", "text": "In the first tasks, we demonstrate that even without modifying the network structures, building networks on top of pre-trained language models might improve accuracy." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-29", "text": "In the second tasks, we show that by ensembling different neural networks, one can even improve the accuracy of fine-tuning only methods even further." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-30", "text": "Finally, in the last task, we demonstrate that if one can tailor-made a neural network that specifically fit the characteristics of the pre-trained language models, one can improve the accuracy even further." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-31", "text": "All the results indicate the strategy stack-and-finetune is superior to strategy finetune-only." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-32", "text": "This leads us to conclude that, at least, by overlooking the possibility strategy stack-and-finetune is imprudent." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-33", "text": "The contribution of this paper is two-fold." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-34", "text": "First, we propose a new strategy to improve the fine-tune-only strategy proposed by Peter et al. [1] , this allows us to achieve better results, at least on the selected tasks." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-35", "text": "More importantly, the results of this study demonstrate the importance of neural networks design, even in the presence of all-powerful pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-36", "text": "Second, during the experiment, we have found that although simply using the proposed training strategy can result in higher accuracies compared to that of Peter et al. [1] , it is still a challenging task to find the appropriate methods to design and to utilize pre-trained networks." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-37", "text": "In this regard, we find that pre-trained models differ significantly from word embeddings in terms of their training strategies." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-38", "text": "Especially, since word embeddings can be viewed as shallow transfer learning, while pre-trained model should be viewed as deep transfer learning, one must try to combat over-fitting problems with more care due to the enormous number of parameters presented in the pre-trained models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-81", "text": "Therefore, in the first phase we fix the weights in the pre-training language models, and only train the model on top of it." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-39", "text": "Besides, we also find that in order to achieve the maximal performance in the post-pre-trained language model era, one must design, either manually or via Auto ML, networks that best fit the structure, especially the depth of the pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-40", "text": "The rest of the paper is organized as follows." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-41", "text": "First, we review the relevant literature on pre-trained deep neural networks, the argument in Peter et al. [1] as well as fine-tuning strategies with word embeddings." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-42", "text": "Second, we present three experiments and showed the superiority of strategy stack-and-finetune compared to strategy finetune-only." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-43", "text": "Finally, we conclude with some remarks and future research possibilities." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-44", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-45", "text": "**RELATED STUDIES**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-46", "text": "Before the introduction of deep neural networks, researchers in the field of NLP have been using pre-trained models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-47", "text": "Among all of them, one of the most famous is the word embeddings, which maps each word into a continuous vector, instead of one-hot encodings [11] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-48", "text": "By doing so, not only are we able to reduce the dimensionality of the input features, which helps to avoid over-fitting, but also capture, at least partially, the internal meaning of each word." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-49", "text": "However, since each word is only endowed with a fixed numerical vector in the methodology of word embeddings, word embeddings are unable to capture the contextual meaning in the text." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-50", "text": "For example, consider the word \"bank\" sentences \"I am walking on the bank of the river.\" with \"I am going to rob the bank\"." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-51", "text": "It is obvious that the word \"bank\" represents completely different meaning, which the word embeddings techniques fail to capture." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-52", "text": "The aforementioned deficiencies prompt researchers to propose deep neural networks that are able to be trained in an unsupervised fashion while being able to capture the contextual meaning of the words presented in the texts." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-53", "text": "Some early attempts include pre-trained models includes, CoVe [12] , CVT [13, 14] , ELMo [15] and ULMFiT [16] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-54", "text": "However, the most successful ones are BERT [2] and Open-GPT [3] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-55", "text": "Unlike standard NLP deep learning model, BERT and Open-GPT are built on top of transformer [17] structures, instead of LSTM [18] or GRU [19] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-56", "text": "The difference between BERT and Open-GPT is that BERT uses bi-directional self-attentions while Open-GPT uses only unidirectional ones, as shown in Figure 2 ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-57", "text": "The transformer structures differ from the LSTM's in the two important aspects." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-58", "text": "First, it allows for stacking of multiple layers with residual connections and batch normalizations, which allows for free gradient flow." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-59", "text": "Second, the core computational unit is matrix multiplications, which allows researchers to utilize the full computational potential of TPU [20] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-60", "text": "After training on a large corpus, both BERT and Open-GPT are able to renew the SOTA of many important natural language tasks, such as such as SQuAD [4] , CoQA [5] , named entity recognition [6] , Glue [7] , machine translation [8] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-61", "text": "In the presence of the success of pre-trained language models, especially BERT [2] , it is natural to ask how to best utilize the pre-trained language models to achieve new state-of-the-art results." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-62", "text": "In this line of work, Liu et al. [21] investigated the linguistic knowledge and transferability of contextual representations by comparing BERT [2] with ELMo [15] , and concluded that while the higher levels of LSTM's are more task-specific, this trend does not exhibit in transformer based models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-63", "text": "Stickland and Murray [22] invented projected attention layer for multi-task learning using BERT, which results in an improvement in various state-of-the-art results compared to the original work of Devlin et al. [2] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-64", "text": "Xu et al. [23] propose a \"post-training\" algorithms, which does not directly fine-tune BERT, but rather first \"post-train\" BERT on the task related corpus using the masked language prediction task next sentence prediction task, which helps to reduce the bias in the training corpus." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-65", "text": "Finally, Sun et al. [24] added additional fine-tuning tasks based on multi-task training, which further improves the prediction power of BERT in the tasks of text classification." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-66", "text": "In this aspect, however, there is a simple yet crucial question that needs to be addressed." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-67", "text": "That is, whether it is possible to top BERT with the commonly used or task specific layers, and if this is possible, how to best utilize the pre-trained language models in this situation." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-68", "text": "In this regards, Peters et al. [1] investigated how to best adapt the pre-trained model to a specific task, and focused on two different adaptation method,feature extraction and directly fine-tuning the pre-trained model, which corresponding to the strategy finetune-only and the strategy stack-only in Table 1 ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-69", "text": "On this regard, Peters et al. [1] performs five experiments, including: (1) named entity recognition [6] ; (2) sentiment analysis [25] ; (3) natural language inference [26] ; (4) paraphrase detection [27] ; (5) semantic textual similarity [28] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-70", "text": "By the results of these tasks, Peters et al. [1] concludes that adding a light task-specific head and performing fine-tuning on BERT is better than building a complex network on top without BERT fine-tuning." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-71", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-72", "text": "**METHODOLOGY**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-73", "text": "Under our strategy stack-and-finetune, the model training process is divided into two phases, which are described in detail below." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-74", "text": "In the first phase, the parameters of the pre-training model are fixed, and only the upper-level models added for a specific task is learned." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-75", "text": "In the second phase, we fine-tune the upper-level models together with the pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-76", "text": "We choose this strategy for the following reasons." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-77", "text": "Pre-training models have been used to obtain more effective word representations through the study of a large number of corpora." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-78", "text": "In the paradigm proposed in the original work by Devlin et al. [2] , the author directly trained BERT along with with a light-weighted task-specific head." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-79", "text": "In our case though, we top BERT with a more complex network structure, using Kaiming initialization [29] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-80", "text": "If one would fine-tune directly the top models along with the weights in BERT, one is faced with the following dilemma: on the one hand, if the learning rate is too large, it is likely to disturb the structure innate to the pre-trained language models; on the other hand, if the learning rate is too small, since we top BERT with relatively complex models, the convergence of the top models might be impeded." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-82", "text": "Another aspect that is worth commenting in the first phase is that it is most beneficial that one does not train the top model until it reaches the highest accuracy on the training or validation data sets, but rather only up to a point where the prediction accuracy of the training and validation data sets do not differ much." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-83", "text": "This is intuitively reasonable for the following reasons." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-84", "text": "Unlike word embeddings, the pre-trained language models possess a large number of parameters compared to the task-specific models we build on top them." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-85", "text": "Therefore, if one were to train the top models until they reach the highest prediction accuracy in the training or validation data sets, it would likely cause the models to over-fit." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-86", "text": "Therefore, in our experiment, we found that this leads to the highest performance increase in the fine-tuning stage." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-87", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-88", "text": "**EXPERIMENTS 4.1 OVERVIEW**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-89", "text": "We perform three different experiments to test our hypotheses." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-90", "text": "First, we perform a named entity recognition tasks, by adding a bi-LSTM on top of the BERT model." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-91", "text": "In this experiment, we hope to test whether, without any modification to the commonly used network structure, our proposed training strategy will improve the overall accuracy." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-92", "text": "Second, we perform a text classification experiments, in this experiments, we trained three models, and perform a model ensemble." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-93", "text": "We hope to show that even the added network has not contributed to significantly in improving the accuracy, it does provide opportunities for model ensembles." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-94", "text": "Finally, we perform the textual similarity tests, in which we show that if one can tailor make a network that specifically fit the characteristics of the pre-trained languages, more significant improvement can be expected." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-95", "text": "Under the strategy finetune-only, we use only single BERT.In order to adapt to different tasks, we will add a fully connected layer upon BERT." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-96", "text": "In the sequence labeling task, the BERT word embedding of each word passes through two fully connected layers, and the prediction probability of named entity can be obtained." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-97", "text": "In the next two verification tasks, we use \"[CLS]\" for prediction and add two fully connected layers subsequently." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-98", "text": "Under our strategy stackand-finetune, we set different learning rates for the two phases." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-99", "text": "We tried to set the learning rate of the first stage to 1e \u22121 ,1e \u22122 ,5e \u22123 ,1e \u22123 and 5e \u22124 , and set it to a smaller number in the latter stage, such as 1e \u22123 ,1e \u22124 ,5e \u22125 and 1e \u22125 ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-100", "text": "After our experiments, we found that it gets better results while the learning rate is set to 0.001 in the stage of training only the upper model and set to 5e \u22125 in the later stage." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-101", "text": "Since BERT-Adam [2] has excellent performance, in our experiments, we use it as an optimizer with \u03b2 1 = 0.9, \u03b2 2 = 0.999,L 2 -weight decay of 0.01.We apply a dropout trick on all layers and set the dropout probability as 0.1." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-102", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-103", "text": "**EXPERIMENT A: SEQUENCE LABELING**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-152", "text": "This poses an even more essential question for Auto ML researchers in the following sense." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-104", "text": "In the sequence labeling task,we explore sub-task named entity recognition using CoNLL03 dataset [6] , which is a public available used in many studies to test the accuracy of their proposed methods [30, 31, 32, 33, 2] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-105", "text": "For strategy finetune-only and strategy stack-and-finetune, we implemented two models: one with BERT and the other with BERT adding a Bi-LSTM on top." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-106", "text": "Eval measure is accuracy and F1 score." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-107", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-108", "text": "**MODEL**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-109", "text": "Accuracy(%) F1(%) BERT 98.75 85.62 BERT + Bi-LSTM 99.74 92.51 Table 2 : Results for named entity recognition As is shown in Table 2 , even without modifying the networks to specifically adapt to the pre-trained model, our training strategy still brought improvement towards overall accuracy of 0.99% for the accuracy and 0.068 on the F1 score, proving the success of our proposed methods." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-110", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-111", "text": "**EXPERIMENT B: TEXT CLASSIFICATION**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-112", "text": "In the task of text categorization, we used Yahoo Answer Classification Dataset." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-113", "text": "The Dataset is consists of 10 classes, but due to the huge amount of the dataset, we just select two class of them." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-114", "text": "As for the upper model,we choose DenseNet [34] and HighwayLSTM [35] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-115", "text": "The DenseNet structure contains four independent blocks and each block has four CNNs connected by residual." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-116", "text": "We initialize word embedding in the word representation layer with BERT." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-117", "text": "We initialize each character as a 768-dimension vector." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-118", "text": "In the experiment of training DenseNet,we concat the output vector of DenseNet with [CLS] for prediction." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-119", "text": "We find the ensembled model enjoys a 0.72% improvements compared to the fine-tune only model and 0.005 improvement for the F1 score." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-120", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-121", "text": "**EXPERIMENT C: SEMANTIC SIMILARITY TASKS**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-122", "text": "We use \"Quora-Question-Pair\" dataset 1." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-123", "text": "This is a commonly used dataset containing 400k question pairs, annotated manually to be semantically equivalent or not." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-124", "text": "Due to its high quality, it is a standard dataset to test the success of various semantic similarity tasks." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-125", "text": "Various models which are tested on this data set are proposed, including but not limited to [36, 37, 38, 39] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-126", "text": "Apart from the BERT fine-tuning only model and BERT+ BIMPM model, we also devise two new network structures by modifying the BIMPM model." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-127", "text": "In the first model is to remove the first bi-LSTM of BIMPM, which is the input layer for the matching layer in BIMPM." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-128", "text": "In the second model, we combine the matching layer of BIMPM and with a transformer [17] , a model we call Sim-Transformer by replacing the output layer of the matching layer, originally a bi-LSTM model, with a transformer model." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-129", "text": "From the experimental results shown in Table 4 , we can see that due to the strong expressive ability of the BERT, there is almost no difference in the experimental results of removing the first bi-LSTM and BIMPM." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-130", "text": "In addition, we also find that Sim-Transformer's performance without fine-tuning is nearly four percentage points lower than BIMPM, but it out-performs BIMPM after fine-tuning." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-131", "text": "In general, the results show that BERT + Sim-Transformer out-performs BERT-only model by 4.7%, thus confirming our hypotheses again." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-132", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-133", "text": "**MODEL**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-134", "text": "Accuracy(BERT fixed) Accuracy(Fine-tune BERT) BERT -0.8445 BERT + BIMPM [39] 0.8739 0.8899 BERT + BIMPM(First bi-LSTM removed) 0.869 0.8855 BERT + Sim-Transformer 0.8341 0.8915 Table 4 : Results for semantic similarity task" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-135", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-136", "text": "**DISCUSSIONS AND CONCLUSIONS**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-137", "text": "In summary, we find that in all the three tasks, our proposed method out-performs the methods of simply tuning pre-trained language models, as is proposed in [1] ." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-138", "text": "However, we would like to caution the readers in two aspects when reading the conclusion of this study." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-139", "text": "First, this study does not argue that our proposed methods are always superior to fine-tuning only methods." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-140", "text": "For example, all the experiments in our study are based on data sets of relatively large size." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-141", "text": "In the other spectrum, if one is only given a limited data set, then building complex networks upon pre-trained language models might lead to disastrous over-fitting." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-142", "text": "If this is the case, then it is possible that deep domain adaptation [40] might be a better choice if one desires to stack neural networks on top of pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-143", "text": "However, most domain adaptation applications belong to the field of computer vision, therefore, a call for domain adaptations research in the NLP fields." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-144", "text": "During the experimentation, we also discover some tricks to obtain higher quality networks." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-145", "text": "The first is that due to the enormous number of parameters presented in the pre-trained language models, to achieve generalizable results on the test data sets, it is vital to combat over-fitting." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-146", "text": "In classical embedding + training networks, the general training method is to fix the word-embeddings, then train the top model until it converges, and finally fine-tuning the word-embeddings for a few epochs." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-147", "text": "This training strategy does not work when we replace pre-trained language models with word-embeddings." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-148", "text": "In our experiment, we first fix the pre-trained language models, and then we train the top neural networks only for a few epochs, until it reaches a reasonable accuracy, while closely monitoring the discrepancy between training accuracy and testing accuracy." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-149", "text": "After that, we fine-tune the pre-trained language model as well as our models on top together." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-150", "text": "This allows us to achieve better results on the experimentation." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-151", "text": "However, it is not yet clear to us when to stop the training of top neural networks." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-153", "text": "In the classical computer vision based Auto ML approaches, since one seldom build networks on already trained models, there is no particular need to auxiliary measure for over-fittings." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-154", "text": "While if Auto ML is to be performed on NLP tasks successfully, it might be essential that the gap between training accuracy and test accuracy to be incorporated when one evaluates the model." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-155", "text": "----------------------------------" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-156", "text": "**FINALLY, IT**" }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-157", "text": "is not yet clear what is the most proper way to build networks that tops the pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-158", "text": "However, there are several principles that we can follow when designing such networks." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-159", "text": "First, such networks must be able to ensure the gradient flow from the top of the model to the bottom." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-160", "text": "This is essential due to the depth of the pre-trained language model." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-161", "text": "Second, this also means, one does not need explicitly to build extremely complex networks on top of pre-trained language models unless it complements the mechanisms of self-attention." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-162", "text": "Finally, a challenge remains as to how to use the depth of pre-trained language models." }, { "sent_id": "05d1ecc230c7907d9a14d3351070c3-C001-163", "text": "The process of our experiment shows that utilizing deeper layers might be a fruitful way to achieve better accuracy." } ], "y": { "@BACK@": { "gold_contexts": [ [ "05d1ecc230c7907d9a14d3351070c3-C001-10" ], [ "05d1ecc230c7907d9a14d3351070c3-C001-53", "05d1ecc230c7907d9a14d3351070c3-C001-54" ], [ "05d1ecc230c7907d9a14d3351070c3-C001-63" ], [ "05d1ecc230c7907d9a14d3351070c3-C001-62" ], [ "05d1ecc230c7907d9a14d3351070c3-C001-78" ] ], "cite_sentences": [ "05d1ecc230c7907d9a14d3351070c3-C001-10", "05d1ecc230c7907d9a14d3351070c3-C001-54", "05d1ecc230c7907d9a14d3351070c3-C001-63", "05d1ecc230c7907d9a14d3351070c3-C001-62", "05d1ecc230c7907d9a14d3351070c3-C001-78" ] }, "@MOT@": { "gold_contexts": [ [ "05d1ecc230c7907d9a14d3351070c3-C001-61" ], [ "05d1ecc230c7907d9a14d3351070c3-C001-63" ] ], "cite_sentences": [ "05d1ecc230c7907d9a14d3351070c3-C001-61", "05d1ecc230c7907d9a14d3351070c3-C001-63" ] }, "@SIM@": { "gold_contexts": [ [ "05d1ecc230c7907d9a14d3351070c3-C001-101" ], [ "05d1ecc230c7907d9a14d3351070c3-C001-104" ] ], "cite_sentences": [ "05d1ecc230c7907d9a14d3351070c3-C001-101", "05d1ecc230c7907d9a14d3351070c3-C001-104" ] }, "@USE@": { "gold_contexts": [ [ "05d1ecc230c7907d9a14d3351070c3-C001-101" ], [ "05d1ecc230c7907d9a14d3351070c3-C001-104" ] ], "cite_sentences": [ "05d1ecc230c7907d9a14d3351070c3-C001-101", "05d1ecc230c7907d9a14d3351070c3-C001-104" ] } } }, "ABC_13d3d973a4be832f66b049b364fea5_25": { "x": [ { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-127", "text": "An example of error annotations is:" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-2", "text": "We propose a novel word embedding pretraining approach that exploits writing errors in learners' scripts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-3", "text": "We compare our method to previous models that tune the embeddings based on script scores and the discrimination between correct and corrupt word contexts in addition to the generic commonly-used embeddings pre-trained on large corpora." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-4", "text": "The comparison is achieved by using the aforementioned models to bootstrap a neural network that learns to predict a holistic score for scripts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-5", "text": "Furthermore, we investigate augmenting our model with error corrections and monitor the impact on performance." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-6", "text": "Our results show that our error-oriented approach outperforms other comparable ones which is further demonstrated when training on more data." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-7", "text": "Additionally, extending the model with corrections provides further performance gains when data sparsity is an issue." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-8", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-10", "text": "Assessing students' writing plays an inherent pedagogical role in the overall evaluation of learning outcomes." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-11", "text": "Traditionally, human graders are required to mark essays, which is cost-and timeinefficient, especially with the growing numbers of students." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-12", "text": "Moreover, the evaluation process is subjective, which leads to possible variations in the awarded scores when more than one human assessor is employed." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-77", "text": "In their experiment, they set \u03b1 to 0.1 giving most of the weight to score-related information." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-76", "text": "where \u03b1 is a hyperparameter." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-13", "text": "To remedy this, the automated assessment (AA) of writing has been motivated in order to automatically evaluate writing competence and hence not only reduce grader workload, but also bypass grader inconsistencies as only one system would be responsible for the assessment." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-14", "text": "Numerous AA systems have been developed for research purposes or deployed for commercial use, including Project Essay Grade (PEG) (Page, 2003) , e-Rater (Attali and Burstein, 2006) , Intelligent Essay Assessor (IEA) (Landauer et al., 2003) and Bayesian Essay Test Scoring sYstem (BETSY) (Rudner and Liang, 2002) among others." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-15", "text": "They employ statistical approaches that exploit a wide range of textual features." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-16", "text": "A recent direction of research has focused on applying deep learning to the AA task in order to circumvent the heavy feature engineering involved in traditional systems." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-17", "text": "Several neural architectures have been employed including variants of Long Short-Term Memory (LSTM) (Alikaniotis et al., 2016; Taghipour and Ng, 2016) and Convolutional Neural Networks (CNN) (Dong and Zhang, 2016) ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-18", "text": "They were all applied to the Automated Student Assessment Prize (ASAP) dataset, released in a Kaggle contest 1 , which contains essays written by middle-school English speaking students." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-19", "text": "On this dataset, neural models that only operate on word embeddings outperformed state-of-the-art statistical methods that rely on rich linguistic features (Yannakoudakis et al., 2011; Phandi et al., 2015) ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-20", "text": "The results obtained by neural networks on the ASAP dataset demonstrate their ability to capture properties of writing quality without recourse to handcrafted features." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-21", "text": "However, other AA datasets pose a challenge to neural models and they still fail to beat state-of-the-art methods when evaluated on these sets." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-22", "text": "An example of such datasets is the First Certificate in English (FCE) set where applying a rank preference Support Vector Machine (SVM) trained on various lexical and grammatical features achieved the best results (Yannakoudakis et al., 2011) ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-23", "text": "This motivates further investigation into neural networks to determine what minimum useful information they can utilize to enhance their predictive power." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-24", "text": "Initializing neural models with contextually rich word embeddings pre-trained on large corpora (Mikolov et al., 2013; Pennington et al., 2014; Turian et al., 2010) has been used to feed the networks with meaningful embeddings rather than random initialization." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-25", "text": "Those embeddings are generic and widely employed in Natural Language Processing (NLP) tasks, yet few attempts have been made to learn more task-specific embeddings." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-26", "text": "For instance, Alikaniotis et al. (2016) developed score-specific word embeddings (SSWE) to address the AA task on the ASAP dataset." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-27", "text": "Their embeddings are constructed by ranking correct ngrams against their \"noisy\" counterparts, in addition to capturing words' informativeness measured by their contribution to the overall score of the essay." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-28", "text": "We propose a task-specific approach to pre-train word embeddings, utilized by neural AA models, in an error-oriented fashion." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-29", "text": "Writing errors are strong indicators of the quality of writing competence and good predictors for the overall script score, especially in scripts written by language learners, which is the case for the FCE dataset." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-30", "text": "For example, the Spearman's rank correlation coefficient between the FCE script scores and the ratio of errors is \u22120.63 which is indicative of the importance of errors in writing evaluation: ratio of errors = number of erroneous script words script length" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-31", "text": "This correlation could even be higher if error severity is accounted for as some errors could be more serious than others." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-78", "text": "Error-specific Word Embeddings (ESWE)." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-32", "text": "Therefore, it seems plausible to exploit writing errors and integrate them into AA systems, as was successfully done by Yannakoudakis et al. (2011) and , but not by capturing this information directly in word embeddings in a neural AA model." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-33", "text": "Our pre-training model learns to predict a score for each ngram based on the errors it contains and modifies the word vectors accordingly." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-34", "text": "The idea is to arrange the embedding space in a way that discriminates between \"good\" and \"bad\" ngrams based on their contribution to writing errors." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-35", "text": "Bootstrapping the assessment neural model with those learned embeddings could help detect wrong patterns in writing which should improve its accuracy of predicting the script's holistic score." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-36", "text": "We implement a CNN as the AA model and compare its performance when initialized with our embeddings, tuned based on natural writing errors, to the one obtained when bootstrapped with the SSWE, proposed by Alikaniotis et al. (2016) , that relies on random noisy contexts and script scores." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-37", "text": "Furthermore, we implement another version of our model that augments ngram errors with their corrections and investigate the effect on performance." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-38", "text": "Additionally, we compare the aforementioned pre-training approaches to the commonly used embeddings trained on large corpora (Google or Wikipedea)." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-39", "text": "The results show that our approach outperforms other initialization methods and augmenting the model with error corrections helps alleviate the effects of data sparsity." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-40", "text": "Finally, we further analyse the pre-trained representations and demonstrate that our embeddings are better at detecting errors which is inherent for AA." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-41", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-42", "text": "**RELATED WORK**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-43", "text": "There have been various attempts to employ neural networks to assess the essays in the ASAP dataset." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-44", "text": "Taghipour and Ng (2016) compared the performance of a few neural network variants and obtained the best results with an LSTM followed by a mean over time layer that averages the output of the LSTM layer." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-45", "text": "Alikaniotis et al. (2016) assessed the same dataset by building a bidirectional double-layer LSTM which outperformed Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and Mikolov, 2014) and Support Vector Machines (SVM) baselines." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-46", "text": "Dong and Zhang (2016) implemented a CNN where the first layer convolves a filter of weights over the words in each sentence followed by an aggregative pooling function to construct sentence representations." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-47", "text": "Subsequently, a second filter is applied over sentence representations followed by a pooling operation then a fully-connected layer to predict the final score." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-48", "text": "Their CNN was applied to the ASAP dataset and its efficacy in in-domain and domain-adaptation essay evaluation was demonstrated in comparison to traditional state-of-the-art baselines." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-49", "text": "Several AA approaches in the literature have exploited the \"quality\" or \"correctness\" of ngrams as a feature to discriminate between good and poor essays." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-50", "text": "Phandi et al. (2015) defined good essays as the ones with grades above or equal to the average score and the rest as poor ones." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-51", "text": "They calculated the Fisher scores (Fisher, 1922) of ngrams and selected 201 with the highest scores as \"useful ngrams\"." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-52", "text": "Similarly, they generated correct POS ngrams from grammatically correct texts, classified the rest as \"bad POS ngrams\" and used them along with the useful ngrams and other shallow lexical features as bag-of-words features." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-53", "text": "They applied Bayesian linear ridge regression (BLRR) and SVM regression for domain-adaptation essay scoring using the ASAP dataset." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-54", "text": "Alikaniotis et al. (2016) applied a similar idea; in their SSWE model, they trained word embeddings to distinguish between correct and noisy contexts in addition to focusing more on each word's contribution to the overall text score." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-55", "text": "Bootsrapping their LSTM model with those embeddings offered further performance gains." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-56", "text": "Other models have directly leveraged error information exhibited in text." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-57", "text": "For example, Yannakoudakis et al. (2011) demonstrated that adding an \"error-rate\" feature to their SVM ranking model that uses a wide range of lexical and grammatical writing competence features further improves the AA performance." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-58", "text": "They calculated the error-rate using the error annotations in the Cambridge Learner Corpus (CLC) in addition to classifying a trigram as erroneous if it does not occur in the large ukWaC corpus (Ferraresi et al., 2008) or highly scoring CLC scripts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-59", "text": "proposed a bidirectional LSTM for error detection in learner data, where the model predicts the probability of a word being correct for each word in text." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-60", "text": "As an extension to their experiment, they incorporated the average predicted probability of word correctness as an additional feature to the self-assessment and tutoring system (SAT) (Andersen et al., 2013 ) that applied a supervised ranking perceptron to rich linguistic features." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-61", "text": "Adding their correctness probability feature successfully enhanced the predictive power of the SAT." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-62", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-63", "text": "**APPROACH**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-64", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-65", "text": "**WORD EMBEDDING PRE-TRAINING**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-66", "text": "In this section, we describe three different neural networks to pre-train word representations: the model implemented by Alikaniotis et al. (2016) and the two error-oriented models we propose in this work." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-67", "text": "The models' output embeddings -referred to as AA-specific embeddings -are used later to bootstrap the AA system." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-68", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-69", "text": "**SCORE-SPECIFIC WORD EMBEDDINGS (SSWE).**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-70", "text": "We compare our pre-training models to the SSWE developed by Alikaniotis et al. (2016) ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-71", "text": "Their method is inspired by the work of Collobert and Weston (2008) which learns word representations by distinguishing between a target word's context (window of surrounding words) and its noisy counterparts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-72", "text": "These counterparts are generated by replacing the target word with a randomly selected word from the vocabulary." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-73", "text": "The network is trained to rank the positive correct contexts higher than the negative corrupt ones." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-74", "text": "Additionally, the model is augmented with score specific information to focus on the informative words that contribute to the overall score of essays rather than the frequent words that occur equally in good and bad essays." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-75", "text": "They optimize the overall loss function as a weighted sum of the ranking loss between correct and noisy ngrams and the score specific loss:" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-79", "text": "We propose a model that fine-tunes the embedding space using a supervised method that leverages the errors appearing in the training data." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-80", "text": "It modifies the embedding space to discriminate between erroneous ngrams and correct ones." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-81", "text": "The core difference between this approach and SSWE is that it relies on the writing errors occurring naturally in the training data instead of randomly generating incorrect ngrams or capturing words' informativeness." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-82", "text": "The motivation for adopting this approach is twofold." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-83", "text": "First, we believe that the model could learn more useful AA features from actual errors rather than introducing random contexts that are unlikely to happen." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-84", "text": "Second, SSWE ignores the frequent words as they have less predictive power (they are used equally in highly and lowly scored texts)." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-85", "text": "However, despite the fact that frequent words (e.g. function words) carry less topical information than content ones, the errors associated with them constitute a substantial portion of the errors committed by non-native English speakers." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-86", "text": "For instance, determiner errors account for more than 9% of the total errors in public FCE training data." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-87", "text": "Therefore, learning representations from both function and content word errors in their contexts could be advantageous." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-88", "text": "The ESWE model predicts error scores for word ngrams." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-89", "text": "First, we demonstrate how the true error scores for ngrams are calculated and second, we describe the approach applied to estimate these scores." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-90", "text": "Each word w i in a training document is given an error indicating score e i \u2208 {1, 0} based on whether it is part of an error or not, respectively." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-91", "text": "Subsequently, an ngram gold score (n score) is calculated based on the sum of the errors it contains as follows:" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-92", "text": "where n is the ngram length." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-93", "text": "For the model to estimate the ngram scores, a convolutional operation is applied as depicted in Figure 1 ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-94", "text": "First, each word is mapped to a unique vector v wrd i \u2208 R d wrd retrieved from an embedding space E \u2208 R |V |\u00d7d wrd , where |V | is the vocabulary size." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-95", "text": "Consequently, an ngram is represented as a concatenation of its word vectors v ng = [v wrd i ; ...; v wrd i+n\u22121 ]." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-96", "text": "Scoring the ngrams is accomplished by sliding a convolutional linear filter W e \u2208 R n\u00d7d wrd -hereafter error filter 2 -over all the ngrams in the script, followed by a sigmoid non-linearity to map the predicted score to a [0, 1] probability space:" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-97", "text": "where \u03c3 is the sigmoid function." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-98", "text": "3 The error filter should work as an error detector that evaluates the correctness of words given their contexts and arranges them in the embedding space accordingly." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-99", "text": "For optimization, the sum of squared errors loss is minimized between the gold ngram scores and the estimated ones and the error gradients are backpropagated to the embedding matrix E building the ESWE space:" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-100", "text": "where k is the ngram index." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-101", "text": "Error-correction-specific Word Embeddings (ECSWE)." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-102", "text": "As an extension to ESWE, we propose augmenting it with the errors' corrections as follows." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-103", "text": "We build a corrected version of each script by replacing all its errors with their suggested corrections and train the ESWE model using the corrected scripts together with the original ones." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-104", "text": "In the corrected version, all the ngrams are given e i = 0 and consequently, n score = 1 according to Equation 2." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-105", "text": "All the above ESWE equations are applied and the loss for each script is calculated as the sum of both the loss of the original script and its corrected version (Equation 4 applied to obtain both)." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-106", "text": "The motivation for this model is twofold." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-107", "text": "First, it could enrich the embedding space by allowing the model to learn from faulty ngrams and their correct counterparts (both occur naturally in text) and construct ECSWE which is a modified version of ESWE that is more capable of distinguishing between good and bad contexts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-108", "text": "Second, it could alleviate the effects of data sparsity, when training on small datasets, by learning from more representations." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-109", "text": "4" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-110", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-111", "text": "**AA MODEL**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-112", "text": "The previous section discusses pre-training approaches for word embeddings that are later used to initialize the AA model." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-113", "text": "For this model, we use a second CNN to predict a holistic score for the script (Figure 2) script's subsequences to generate the feature maps M \u2208 R h\u00d7(l\u2212m+1) , where m is the filter height (window size) and h is the number of the output feature maps." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-114", "text": "We refer to this filter as the script filter." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-115", "text": "Previously, for the error filter used in the ESWE and ECSWE approaches, h was set to 1 which represents the predicted ngram score (n score), whereas here, the system extracts various contextual features from each ngram as a prestep towards predicting the script's score, hence setting h to a large value." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-116", "text": "The convolutional operation is followed by a ReLU non-linearity to capture more complex linguistic phenomena: 5" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-117", "text": "Subsequently, an average pooling function is applied to the output feature maps in order to select the useful features and unify the scripts' representations to a vector S \u2208 R h of fixed length." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-118", "text": "Finally, the last layer of the network is a fully connected one by applying linear regression to the script representation in order to predict the final score:" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-119", "text": "where W reg \u2208 R h is a learned parameter matrix." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-120", "text": "The network optimizes the sum of squared errors loss between the scripts' predicted scores and the gold ones." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-121", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-122", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-123", "text": "Baselines." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-124", "text": "We compare our error-oriented approaches to the SSWE model as well as generic pre-trained models commonly used to initialize prompts asking the learner to write either an article, a letter, a report, a composition or a short story." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-125", "text": "We apply script-level evaluation by concatenating the two answers and using a special answer end token to separate the answers in the same script." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-126", "text": "The writing errors committed in the scripts are manually annotated using a taxonomy of 80 error types (Nicholls, 2003) together with suggested corrections." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-128", "text": "The problems started inat the box office." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-129", "text": "where is the error, is the suggested correction and the error type \"RT\" refers to \"replace preposition\"." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-130", "text": "For error-oriented models, a word is considered an error if it occurs inside an error tag and the correction is retrieved according to the correction tag." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-131", "text": "We train the models on the released public FCE dataset which contains 1, 141 scripts for training and 97 scripts for testing." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-132", "text": "In order to examine the effects of training with extra data, we conduct experiments where we augment the public set with additional FCE scripts and refer to this extended version as FCE ext , which contains 9, 822 scripts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-133", "text": "We report the results of both datasets on the released test set." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-134", "text": "The public FCE dataset is divided into 1, 061 scripts for training and 80 for development while for FCE ext , 8, 842 scripts are used for training and 980 are held out for development." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-135", "text": "The only data preprocessing employed is word tokenization which is achieved using the Robust Accurate Statistical Parsing (RASP) system (Briscoe et al., 2006) ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-136", "text": "Training." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-137", "text": "Hyperparameter tuning is done for each model separately." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-138", "text": "The SSWE, ESWE and ECSWE models are initialized with GloVe (d wrd = 50) vectors, trained for 20 epochs and the learning rate is set to 0.01." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-139", "text": "For SSWE, \u03b1 is set to 0.1, batch size to 128, the number of randomly gen- erated counterparts per ngram to 20 and the size of hidden layer to 100." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-140", "text": "9 For the AA network, initialized with any of the 5 models, h is set to 100, and learning rate to 0.001 when training on public FCE and 0.0001 on FCE ext ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-141", "text": "The sizes used for error and script filters are shown in Table 1 ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-142", "text": "10 All the networks are optimized using Stochastic Gradient Descent (SGD)." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-143", "text": "The AA system is regularized with L2 regularization with rate = 0.0001 and trained for 50 epochs during which performance is monitored on the dev sets." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-144", "text": "Finally, the AA model with the best mean square error over the dev sets is selected." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-145", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-146", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-147", "text": "The public FCE results shown in Table 2 reveal that AA-specific embedding pre-training offers further gains in performance over the traditional embeddings trained on large corpora (Google and GloVe embeddings), which suggests that they are more suited for the AA task." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-148", "text": "The table also demonstrates that the ESWE model outperforms the SSWE one on correlation metrics, with a slight difference in the RMSE value." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-149", "text": "While the variance in the correlations between the two models is noticeable and suggests that the ESWE model is a more powerful one, the RMSE values weaken this assumption." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-150", "text": "This result could be attributed to the fact that public FCE is a small dataset with sparse error representations and SSWE is trained on 20 times more data as each ngram is paired with 20 randomly generated counterparts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-151", "text": "Therefore, a more relevant comparison is needed and could be achieved by either training on more data, as will be discussed later, or further enriching the embedding space with corrections (ECSWE)." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-152", "text": "Table 2 demonstrates that learning from the er-9 Using the same parameters as Alikaniotis et al. (2016) ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-153", "text": "10 Tuning the filter sizes was done for each model separately; for the Glove and Word2Vec models, a filter of size 3 performed better than 9, on both datasets." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-154", "text": "rors and their corrections enhances the error pretraining performance on public FCE which indicates the usefulness of the approach and its ability to mitigate the effects of data sparsity." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-155", "text": "According to the results, training the model based on naturally occurring errors and their correct counterparts is better suited to the AA task rather than introducing artificial noisy contexts and tuning the embeddings according to scripts' scores or relying on word distributions learned from large corpora." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-156", "text": "For a more robust analysis, we examine the performance when training on additional data (FCE ext ) as shown in Table 3 ." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-157", "text": "Comparing the results in Tables 2 and 3 proves that training with more data boosts the predictive power of all the models." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-158", "text": "It is also clear from Table 3 that with more data, the discrepancy in the performance between SSWE and ESWE models becomes more prominent and ESWE provides a superior performance on all evaluation metrics which suggests that, qualitatively, learning from learners' errors is a more efficient bootstrapping method." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-159", "text": "However, with FCE ext , the ECSWE approach outperforms ESWE on correlation metrics while giving a worse RMSE value." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-160", "text": "This change in the results when training on a bigger dataset indicates that the effect of incorporating the corrections in training becomes less obvious with enough data as the distribution of correct and incorrect ngrams is enough to learn from." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-161", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-162", "text": "**ANALYSIS**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-163", "text": "We conduct further analysis to the scores predicted by AA-specific embeddings by investigating the ability of the ESWE and SSWE models to detect errors in text." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-164", "text": "We run each model for 20 epochs on the public FCE (ngram size = 3) and FCE ext (ngram size = 9) training sets, then test the models on the respective dev sets and examine the output predictions." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-165", "text": "For simplicity, we assign a binary true score for each ngram with a correct score that it maximizes in comparison to the noisy ngrams and script score that should be high for good ngrams that occur in highly-graded scripts." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-166", "text": "The two scores are hence expected to be high for high-quality ngrams and low otherwise, which suggests that they can be used as proxies for error detection." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-167", "text": "We calculate the ngram predicted score of the SSWE model as a weighted sum of the correct and script scores, similar to its loss function (Equation 1 with \u03b1 = 0.1), and map the output to a [0, 1] probability based on the minimum and maximum generated scores." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-168", "text": "11 We calculate the average precision (AP) between the true scores and predicted ones with respect to the error representing class (true score = 0) and compare it to a random baseline, where random probability scores are generated." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-169", "text": "The results are displayed in Table 4 which shows that ESWE achieves a higher AP score on all evaluation sets, particularly with public FCE, and SSWE's performance is similar to the random baseline." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-170", "text": "This result is expected since the ESWE model is trained to predict actual errors, yet an empirical verification was required." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-171", "text": "We conclude from this analysis that tuning the embeddings based on training writing errors increases their sensitivity to unseen errors which is key for learners' data assessment and yields better performance than comparable pre-training approaches." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-172", "text": "----------------------------------" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-173", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-174", "text": "In this work, we have presented two error-oriented approaches to train the word embeddings used by 11 Different score combinations were implemented including using only one score, but they all led to similar results." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-175", "text": "writing assessment neural networks." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-176", "text": "The first approach learns to discriminate between good and bad ngrams by leveraging writing errors occurring in learner data." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-177", "text": "The second extends the first by combining the error representations with their suggested corrections and tuning the embedding space accordingly." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-178", "text": "Our motivation for applying these models is to provide neural assessment systems with the minimum features useful for the task in an attempt to boost their performance on challenging datasets while still avoiding heavy feature engineering." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-179", "text": "The presented results demonstrate that our error-oriented embeddings are better suited for learners' script assessment than generic embeddings trained on large corpora when both are used to bootstrap a neural assessment model." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-180", "text": "Additionally, our embeddings have yielded superior performance to those that rely on ranking correct and noisy contexts as well as words' contributions to the script's overall score." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-181", "text": "Furthermore, extending our error embeddings with error corrections has enhanced the performance when trained on small data, while having a less obvious effect when trained on greater amounts of data which shows their efficacy to enrich the embedding space and mitigate data sparsity issues." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-182", "text": "We further analysed our embeddings and the score-specific ones and showed empirically that error-oriented representations are better at error detection which explicates their superior performance in learners' data evaluation." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-183", "text": "Our best performing model still underperforms the state-of-the-art system by Yannakoudakis et al. (2011) that utilises a wide variety of features, even when they exclude error related features." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-184", "text": "However, the improvement obtained by error-oriented models over employing generic embeddings or score-specifc ones suggests that our pre-training approach is a promising avenue of research as it provides neural network assessment with useful information and motivates learning relevant properties associated with language proficiency." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-185", "text": "For future work, it will be interesting to jointly train the score-specific model with the errororiented one and test if this could further improve the performance." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-186", "text": "We also suggest fully automating the assessment process by using the outputs of automated error detection and correction systems to build the embeddings rather than relying on handcrafted error annotations." }, { "sent_id": "13d3d973a4be832f66b049b364fea5-C001-187", "text": "Finally, we encourage further examination for other types of fea-" } ], "y": { "@BACK@": { "gold_contexts": [ [ "13d3d973a4be832f66b049b364fea5-C001-17" ], [ "13d3d973a4be832f66b049b364fea5-C001-45" ], [ "13d3d973a4be832f66b049b364fea5-C001-54" ] ], "cite_sentences": [ "13d3d973a4be832f66b049b364fea5-C001-17", "13d3d973a4be832f66b049b364fea5-C001-45", "13d3d973a4be832f66b049b364fea5-C001-54" ] }, "@MOT@": { "gold_contexts": [ [ "13d3d973a4be832f66b049b364fea5-C001-26" ], [ "13d3d973a4be832f66b049b364fea5-C001-54" ] ], "cite_sentences": [ "13d3d973a4be832f66b049b364fea5-C001-26", "13d3d973a4be832f66b049b364fea5-C001-54" ] }, "@SIM@": { "gold_contexts": [ [ "13d3d973a4be832f66b049b364fea5-C001-36" ], [ "13d3d973a4be832f66b049b364fea5-C001-70" ], [ "13d3d973a4be832f66b049b364fea5-C001-152" ] ], "cite_sentences": [ "13d3d973a4be832f66b049b364fea5-C001-36", "13d3d973a4be832f66b049b364fea5-C001-70", "13d3d973a4be832f66b049b364fea5-C001-152" ] }, "@USE@": { "gold_contexts": [ [ "13d3d973a4be832f66b049b364fea5-C001-36" ], [ "13d3d973a4be832f66b049b364fea5-C001-66" ], [ "13d3d973a4be832f66b049b364fea5-C001-152" ] ], "cite_sentences": [ "13d3d973a4be832f66b049b364fea5-C001-36", "13d3d973a4be832f66b049b364fea5-C001-66", "13d3d973a4be832f66b049b364fea5-C001-152" ] } } }, "ABC_70d41cad40091bcc30a1fd544c277d_26": { "x": [ { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-26", "text": "Our key contribution is that we are the first to demonstrate the success of BERT on document classification tasks and datasets." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-27", "text": "We establish state-of-the-art results on four popular datasets for this task." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-28", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-29", "text": "**BACKGROUND AND RELATED WORK**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-79", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-106", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-107", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-2", "text": "Pre-trained language representation models achieve remarkable state of the art across a wide range of tasks in natural language processing." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-3", "text": "One of the latest advancements is BERT, a deep pre-trained transformer that yields much better results than its predecessors do." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-4", "text": "Despite its burgeoning popularity, however, BERT has not yet been applied to document classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-5", "text": "This task deserves attention, since it contains a few nuances: first, modeling syntactic structure matters less for document classification than for other problems, such as natural language inference and sentiment classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-6", "text": "Second, documents often have multiple labels across dozens of classes, which is uncharacteristic of the tasks that BERT explores." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-7", "text": "In this paper, we describe fine-tuning BERT for document classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-8", "text": "We are the first to demonstrate the success of BERT on this task, achieving state of the art across four popular datasets." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-9", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-11", "text": "Until recently, the dominant paradigm in approaching natural language processing (NLP) tasks has been to concentrate on neural architecture design, using only task-specific data and shallow pre-trained word embeddings, such as GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013) ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-12", "text": "Numerous literature surveys detail this historical neural progress: Young et al. (2018) describe a series of increasingly intricate neural NLP approaches, all of which follow the classical recipe of training on word embeddings of task-specific data." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-13", "text": "In their targeted review of sentence-pair modeling, Lan and Xu (2018) likewise examine neural networks that abide by this paradigm." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-14", "text": "* Equal contribution." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-15", "text": "The NLP community is, however, witnessing a dramatic paradigm shift toward the pretrained deep language representation model, which achieves state of the art in question answering, sentiment classification, and similarity modeling, to name a few." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-16", "text": "Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2018) represents one of the latest developments in this line of work." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-17", "text": "It outperforms its predecessors, ELMo (Peters et al., 2018) and GPT (Radford et al.) , staggeringly exceeding state of the art by a wide margin on multiple natural language understanding tasks." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-18", "text": "The approach consists of two stages: first, BERT is pre-trained on vast amounts of text, with an unsupervised objective of masked language modeling and next-sentence prediction." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-19", "text": "Next, this pre-trained network is then fine-tuned on taskspecific, labeled data." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-20", "text": "Crucially, the pre-trained weights for BERT are already provided-all that is required is to fine-tune the weights on a specific task and dataset, a relatively quick process." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-21", "text": "BERT, however, has not yet been fine-tuned for document classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-22", "text": "Why is this worth exploring? For one, modeling syntactic structure is arguably less important for document classification than for BERT's tasks, such as natural language inference and paraphrasing." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-23", "text": "This claim is supported by our observation that logistic regression and support vector machines are exceptionally strong document classification baselines." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-24", "text": "For another, documents often have several labels across many classes, which is again uncharacteristic of the tasks that BERT examines." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-25", "text": "Thus, in this paper, we explore fine-tuning BERT for document classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-30", "text": "Over the last few years, neural network-based architectures have achieved state of the art for document classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-31", "text": "Liu et al. (2017) develop XMLCNN for addressing this problem's multi-label nature, which they call extreme classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-32", "text": "XMLCNN is based on the popular KimCNN (Kim, 2014) , except with wider convolutional filters, adaptive dynamic maxpooling (Chen et al., 2015; Johnson and Zhang, 2015) , and an additional bottleneck layer to better capture the features of large documents." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-33", "text": "Another popular model, Hierarchical Attention Network (HAN; Yang et al., 2016) explicitly models hierarchical information from documents to extract meaningful features, incorporating word-and sentence-level encoders (with attention) to classify documents." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-34", "text": "Yang et al. (2018) propose a generative approach for multi-label document classification, using encoder-decoder sequence generation models (SGMs) for generating labels for each document." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-35", "text": "Contrary to the previous papers, Adhikari et al. (2019) propose LSTM reg , a simple, properly regularized single-layer BiLSTM, which represents current state of the art." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-36", "text": "These task-specific neural architectures have dominated the NLP literature-until recently." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-37", "text": "Enabled by more computational resources and data, the deep language representation model has greatly improved state of the art on a variety of tasks." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-38", "text": "Under this paradigm, a neural network is first pre-trained on vast amounts of text under an unsupervised objective (e.g., masked language modeling and next-sentence prediction), and then fine-tuned on task-specific data." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-39", "text": "The resulting models achieve state of the art in question answering, named-entity recognition, and natural language inference, to name a few." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-40", "text": "Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2018) currently represents state of the art, vastly outperforming previous models, such as the Generative Pretrained Transformer (GPT; Radford et al.) and Embeddings from Language Models (ELMo; Peters et al., 2018) ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-41", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-42", "text": "**MODEL**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-43", "text": "We begin with the pre-trained BERT base and BERT large models, which respectively represent the normal and large model variants (Devlin et al., 2018) ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-44", "text": "To adapt BERT for document classifica- Figure 1 ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-45", "text": "During fine-tuning, we optimize the entire model end-to-end, with the additional softmax classifier parameters W \u2208 IR K\u00d7H , where H is the dimension of the hidden state vectors and K is the number of classes." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-46", "text": "We minimize the crossentropy and binary cross-entropy loss for singlelabel and multi-label tasks, respectively." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-47", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-48", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-49", "text": "Using Hedwig, 1 a deep learning toolkit with preimplemented document classification models, we compare the fine-tuned BERT models against HAN, KimCNN, XMLCNN, SGM, and regularized BiLSTMs." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-50", "text": "For simple yet competitive baselines, we run the default logistic regression (LR) and support vector machine (SVM) implementations from Scikit-Learn (Pedregosa et al., 2011) , trained on the tf-idf vectors of the documents." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-51", "text": "We use Nvidia Tesla V100 and P100 GPUs for fine- tuning BERT, while we run the rest of the experiments on RTX 2080 Ti and GTX 1080 GPUs." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-52", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-53", "text": "**DATASETS**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-54", "text": "We use the following four datasets to evaluate BERT: Reuters-21578 (Reuters; Apt\u00e9 et al., 1994) , IMDB reviews, arXiv Academic Paper dataset (AAPD; Yang et al., 2018) and Yelp 2014 reviews." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-55", "text": "Reuters and AAPD are multi-label datasets, while IMDB and Yelp are single label." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-56", "text": "Table 1 summarizes the statistics of the datasets." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-57", "text": "For AAPD, we use the splits provided by Yang et al. (2018) ; for Reuters, we use the standard ModApt\u00e9 splits (Apt\u00e9 et al., 1994) ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-58", "text": "For IMDB and Yelp, following Yang et al. (2016) , we randomly sample 80% of the data for training and 10% each for validation and test." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-59", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-60", "text": "**TRAINING AND HYPERPARAMETERS**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-61", "text": "We tune the number of epochs, batch size, learning rate, and maximum sequence length (MSL), the number of tokens that documents are truncated to." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-62", "text": "We observe that model quality is quite sensitive to the number of epochs, and thus the number must be tailored for each dataset." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-63", "text": "We finetune on Reuters, AAPD, and IMDB for 30, 20 and 4 epochs, respectively." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-64", "text": "Due to resource constraints, we fine-tune on Yelp for only one epoch." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-65", "text": "As is the case with Devlin et al. (2018) , we find that choosing a batch size of 16, learning rate of 2\u00d710 \u22125 , and MSL of 512 tokens yields optimal performance on the validation sets of all datasets." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-66", "text": "Hyperparameter study." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-67", "text": "To gauge the improvement over the default hyperparameters, as well as to highlight the differences in fine-tuning BERT for document classification, we explore varying several key hyperparameters: namely, the number of epochs and the MSL." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-68", "text": "Originally, Devlin et al. (2018) find that fine-tuning for three or four epochs works well for both small and large datasets alike." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-69", "text": "They also apply a generous MSL of 512, which may be unnecessary for document classification, where fewer tokens may suffice in determining the topic." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-70", "text": "Furthermore, while conducting our experiments, we find that even fine-tuning BERT is a computationally intensive task." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-71", "text": "We argue that it is important to study these two hyperparameters, as they are major determinants of the computational resources required to fine-tune BERT." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-72", "text": "BERT large , for example, requires eight V100s to fine-tune on our datasets, which is of course prohibitive." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-73", "text": "The number of epochs determines the duration of finetuning, while maximum sequence length dictates the models' memory and computational footprint during both fine-tuning and inference." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-74", "text": "Thus, we vary the number of epochs and MSL on a few selected datasets." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-75", "text": "We choose Reuters and AAPD for analyzing the choice of epochs, since they use a non-standard 20 and 30 epochs, respectively." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-76", "text": "We select Reuters and IMDB for varying the MSL, since these two datasets differ greatly in average document length: Reuters documents average 144 words, while IMDB averages 394." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-77", "text": "(4) Ours (30) Theirs (4) Ours (20) Reuters AAPD Figure 2 : From left to right, results on the validation set from varying the MSL and the number of epochs." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-78", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-80", "text": "We report the mean F 1 scores and accuracy across five runs for the reimplementations of KimCNN, XMLCNN, HAN and LSTM reg in Table 2 ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-81", "text": "Due to resource limitations, we report the scores from only a single run for BERT base and BERT large ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-82", "text": "We also copy the value from Yang et al. (2018) for SGM on AAPD in Table 2 , row 8, as we have failed to replicate the authors' results using their codebase, environment, and data splits." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-83", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-84", "text": "**MODEL QUALITY**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-85", "text": "Trending with Devlin et al. (2018) , BERT large achieves state-of-the-art results on all four datasets, followed by BERT base (see Table 2 , rows 11 and 12)." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-86", "text": "The considerably simpler LSTM reg model (row 10) achieves a high F 1 and accuracy of 87.0 and 52.8, respectively, coming close to the quality of BERT base ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-87", "text": "Surprisingly, the LR and SVM baselines yield competitive results for the multi-label datasets." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-88", "text": "For instance, the SVM approaches BERT base results on Reuters, with an F 1 score of 86.1, astonishingly exceeding most of our neural baselines (rows 2-11)." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-89", "text": "This can also be observed on AAPD, where the SVM surpasses most of the neural models, except SGM, LSTM reg , and BERT." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-90", "text": "However, on the single-label datasets, both LR and SVM perform worse than even our simple neural baselines, such as KimCNN." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-91", "text": "It is worth noting that LR and SVM take only a fraction of the time and resources for training our neural models." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-92", "text": "----------------------------------" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-93", "text": "**HYPERPARAMETER ANALYSIS**" }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-94", "text": "MSL analysis." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-95", "text": "A decrease in the MSL corresponds to only a minor loss in F 1 on Reuters (see leftmost chart in Figure 2 ), possibly due to Reuters having shorter sentences." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-96", "text": "On IMDB (second subplot from left), lowering the MSL corresponds to a drastic fall in accuracy, suggesting that the entire document is necessary for this dataset." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-97", "text": "On the one hand, these results appear obvious." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-98", "text": "Alternatively, one can argue that, since IMDB contains longer documents, truncating tokens may hurt less." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-99", "text": "Figure 2 shows that this is not the case, since truncating to even 256 tokens causes accuracy to fall lower than that of the much smaller LSTM reg (see Table 2 )." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-100", "text": "From these results, we conclude that any amount of truncation is detrimental in document classification, but the level of degradation may differ." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-101", "text": "Epoch analysis." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-102", "text": "The rightmost two subplots in Figure 2 illustrate the F 1 score of BERT fine-tuned using a various number of epochs for AAPD and Reuters." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-103", "text": "Contrary to Devlin et al. (2018) , who achieve state of the art on small datasets with only a few epochs of fine-tuning, we find that smaller datasets require many more epochs to converge." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-104", "text": "On both the datasets (see Figure 2) , we see a significant drop in model quality when the BERT models are fine-tuned on only four epochs, as suggested in the original paper." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-105", "text": "On Reuters, using four epochs results in an F 1 worse than even logistic regression (Table 2 , row 1)." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-108", "text": "We demonstrate that BERT can be fine-tuned successfully to achieve state of the art across four popular datasets in document classification." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-109", "text": "We describe and explore a few differences in this task, conducting an analysis for the maximum sequence length and number of epochs." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-110", "text": "One direction of future work involves compressing BERT to both fine-tune and run tractably under resource-constrained scenarios." }, { "sent_id": "70d41cad40091bcc30a1fd544c277d-C001-111", "text": "Currently, fine-tuning the large variant is prohibitively expensive, while inference is still too heavyweight for practical deployment." } ], "y": { "@BACK@": { "gold_contexts": [ [ "70d41cad40091bcc30a1fd544c277d-C001-16" ], [ "70d41cad40091bcc30a1fd544c277d-C001-40" ], [ "70d41cad40091bcc30a1fd544c277d-C001-68" ], [ "70d41cad40091bcc30a1fd544c277d-C001-85" ] ], "cite_sentences": [ "70d41cad40091bcc30a1fd544c277d-C001-16", "70d41cad40091bcc30a1fd544c277d-C001-40", "70d41cad40091bcc30a1fd544c277d-C001-68", "70d41cad40091bcc30a1fd544c277d-C001-85" ] }, "@USE@": { "gold_contexts": [ [ "70d41cad40091bcc30a1fd544c277d-C001-43" ] ], "cite_sentences": [ "70d41cad40091bcc30a1fd544c277d-C001-43" ] }, "@EXT@": { "gold_contexts": [ [ "70d41cad40091bcc30a1fd544c277d-C001-43", "70d41cad40091bcc30a1fd544c277d-C001-44", "70d41cad40091bcc30a1fd544c277d-C001-45" ] ], "cite_sentences": [ "70d41cad40091bcc30a1fd544c277d-C001-43" ] }, "@SIM@": { "gold_contexts": [ [ "70d41cad40091bcc30a1fd544c277d-C001-65" ], [ "70d41cad40091bcc30a1fd544c277d-C001-85" ] ], "cite_sentences": [ "70d41cad40091bcc30a1fd544c277d-C001-65", "70d41cad40091bcc30a1fd544c277d-C001-85" ] }, "@MOT@": { "gold_contexts": [ [ "70d41cad40091bcc30a1fd544c277d-C001-68", "70d41cad40091bcc30a1fd544c277d-C001-69", "70d41cad40091bcc30a1fd544c277d-C001-70", "70d41cad40091bcc30a1fd544c277d-C001-71" ] ], "cite_sentences": [ "70d41cad40091bcc30a1fd544c277d-C001-68" ] }, "@DIF@": { "gold_contexts": [ [ "70d41cad40091bcc30a1fd544c277d-C001-103" ] ], "cite_sentences": [ "70d41cad40091bcc30a1fd544c277d-C001-103" ] } } }, "ABC_c067711a58722737ef8b7ea987bcf3_26": { "x": [ { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-2", "text": "Recently, sequence-to-sequence models have achieved impressive performance on a number of semantic parsing tasks." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-3", "text": "However, they often do not exploit available linguistic resources, while these, when employed correctly, are likely to increase performance even further." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-4", "text": "Research in neural machine translation has shown that employing this information has a lot of potential, especially when using a multi-encoder setup." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-5", "text": "We employ a range of semantic and syntactic resources to improve performance for the task of Discourse Representation Structure Parsing." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-6", "text": "We show that (i) linguistic features can be beneficial for neural semantic parsing and (ii) the best method of adding these features is by using multiple encoders." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-7", "text": "----------------------------------" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-8", "text": "**INTRODUCTION 2 DATA AND METHODOLOGY**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-9", "text": "----------------------------------" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-10", "text": "**DISCOURSE REPRESENTATION STRUCTURES**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-11", "text": "DRSs are formal meaning representations based on Discourse Representation Theory (Kamp and Reyle, 1993) ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-12", "text": "We use the version of DRT as provided in the Parallel Meaning Bank (PMB, ), a semantically annotated parallel corpus, with texts in English, Italian, German and Dutch." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-13", "text": "DRSs are rich meaning representations containing quantification, negation, reference resolution, comparison operators, discourse relations, concepts based on WordNet, and semantic roles based on VerbNet." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-14", "text": "All experiments are performed using the data of the PMB." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-15", "text": "In our experiments, we only use the English texts and corresponding DRSs." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-16", "text": "We use PMB release 2.2.0, which contains gold standard (fully manually annotated) data of which we use 4,597 as train, 682 as dev and 650 as test instances." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-17", "text": "It also contains 67,965 silver (partially manually annotated) and 120,662 bronze (no manual annotations) instances." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-18", "text": "Most sentences are between 5 and 15 tokens in length." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-19", "text": "Since we will compare our results mainly to Van Noord, Abzianidze, Toral, and Bos (2018) , we will only employ the gold and silver data." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-20", "text": "Name x3 \"tom\"" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-21", "text": "Figure 1: DRS in box format (a), gold clause representation (b) and example system output (c) for I am not working for Tom, with precision of 5/8 and recall of 5/9, resulting in an F-score of 58.8." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-22", "text": "----------------------------------" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-23", "text": "**REPRESENTING INPUT AND OUTPUT**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-24", "text": "We represent the source and target data in the same way as Van Noord, Abzianidze, Toral, and Bos (2018) , who represent the source sentence as a sequence of characters, with a special character indicating uppercase characters." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-25", "text": "The target DRS is also represented as a sequence of characters, with the exception of DRS operators, thematic roles and DRS variables, which are represented as super characters (Van Noord and Bos, 2017b), i.e. individual tokens." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-26", "text": "Since the variable names itself are meaningless, the DRS variables are rewritten to a more general representation, using the De Bruijn index (de Bruijn, 1972) ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-27", "text": "In a post-processing step, the original clause structured is restored." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-28", "text": "1 To include morphological and syntactic information, we apply a lemmatizer, POS-tagger and dependency parser using Stanford CoreNLP (Manning et al., 2014) , similar to Sennrich and Haddow (2016) for machine translation." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-29", "text": "The lemmas and POS-tags are added as a token after each word." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-30", "text": "For the dependency parse, we add the incoming arc for each word." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-31", "text": "We also apply the easyCCG parser of Lewis and Steedman (2014) , using the supertags." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-32", "text": "2 Finally, we exploit semantic information by using semantic tags (Bjerva et al., 2016; ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-33", "text": "Semantic tags are language-neutral semantic categories, which get assigned to a word in a similar fashion as part-of-speech tags." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-34", "text": "Semantic tags are able to express important semantic distinctions, such as negation, modals and types of quantification." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-35", "text": "We train a semantic tagger with the TnT tagger (Brants, 2000) on the gold and silver standard data in the PMB release." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-36", "text": "Examples of the input to the model for each source of information are shown in Table 1 ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-37", "text": "There are two ways to add the linguistic information; (1) merging all the information (i.e., input text and linguistic information) in a single encoder, or (2) using multiple encoders (i.e., encoding separately the input text and the linguistic information)." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-38", "text": "Multi-source encoders were initially introduced for multilingual translation (Zoph and Knight, 2016; Firat et al., 2016; Libovick\u00fd and Helcl, 2017) , but recently were used to introduce syntactic information to the model (Currey and Heafield, 2018) ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-39", "text": "Table 2 shows examples of how the input is structured for using one or more encoders." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-40", "text": "Experiments showed that using more than two encoders drastically decreased performance." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-41", "text": "Therefore, we merge all the linguistic information in a single encoder (see last row of Table 2 )." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-42", "text": "----------------------------------" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-43", "text": "**NEURAL ARCHITECTURE**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-44", "text": "We employ a recurrent sequence-to-sequence neural network with attention (Bahdanau et al., 2014) and two bi-LSTM layers, similar to the one used by Van Noord, Abzianidze, Toral, and Bos (2018) ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-45", "text": "However, their model was trained with OpenNMT (Klein et al., 2017) , which does not support multiple encoders." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-46", "text": "Therefore, we switch to the sequence-to-sequence framework implemented in Marian (Junczys-Dowmunt et al., 2018)." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-47", "text": "We use model-type s2s (for a single encoder) or multi-s2s (for multiple encoders)." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-48", "text": "For the latter, this means that the multiple inputs are encoded separately by an identical RNN (without sharing parameters)." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-49", "text": "The encoders share a single decoder, in which the resulting context vectors are concatenated." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-50", "text": "An attention layer 3 is then applied to selectively give more attention to certain parts of the vector (i.e. it can learn that the words themselves are more important than just the POS-tags)." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-51", "text": "A detailed overview of our parameter settings, found after a search on the dev set, can be found in Table 3 ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-52", "text": "When only using gold data, training is stopped after 15 epochs." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-53", "text": "For gold + silver data, we stop training after 6 epochs, after which we restart the training process from that checkpoint to finetune on only the gold data, also for 6 epochs." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-54", "text": "----------------------------------" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-55", "text": "**EVALUATION PROCEDURE**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-56", "text": "Produced DRSs are compared with the gold standard representations by using COUNTER (Van Noord, Abzianidze, Haagsma, and Bos, 2018) ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-57", "text": "This is a tool that calculates micro precision, recall and F-score over matching clauses, similar to the SMATCH (Cai and Knight, 2013 ) evaluation tool for AMR parsing." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-58", "text": "All clauses have the same weight in matching, except for REF clauses, which are ignored." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-59", "text": "An example of the matching procedure is shown in Figure 1 ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-60", "text": "The produced DRSs go through a strict syntactic and semantic validation process, as described in Van Noord, Abzianidze, Toral, and Bos (2018) ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-61", "text": "If a produced DRS is invalid, it is replaced by a dummy DRS, which gets an F-score of 0.0." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-62", "text": "We check whether two systems differ significantly by performing approximate randomization (Noreen, 1989) , with \u03b1 = 0.05, R = 1000 and F (model 1 ) > F (model 2 ) as test statistic for each DRS pair." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-63", "text": "----------------------------------" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-64", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-65", "text": "We perform all our experiments twice: (i) only using gold data for training and (ii) with both gold (fully manually annotated) and silver (partially manually annotated) data." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-66", "text": "The results of adding external sources of linguistic information are shown in Table 4 ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-67", "text": "We clearly see that using an additional encoder for the linguistic information is superior to merging all the information in a single encoder." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-68", "text": "For two encoders and only using gold data, the scores increase by at least 0.7 for each source of information individually." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-69", "text": "Lemmatization shows the highest improvement, most likely because the DRS concepts that need to be produced are often lemmatized versions of the source words." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-70", "text": "When we stack the linguistic features, we observe an improvement for each addition, resulting in a final 2.7 point F-score increase over the baseline." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-71", "text": "If we also employ silver data, we again observe that the multi-encoder setup is preferable over a single encoder, for both isolating and stacking the linguistic features." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-72", "text": "On isolation, the results are similar to only using gold data, with the exception of the semantic tags, which even hurt the performance now." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-73", "text": "Interestingly, when stacking the linguistic features, there is no improvement over only using the lemma of the source words." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-74", "text": "We now compare our best models to previous parsers 4 (Bos, 2015; Van Noord, Abzianidze, Toral, and Bos, 2018) and two baseline systems, SPAR and SIM-SPAR." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-75", "text": "As previously indicated, Van Noord, Abzianidze, Toral, and Bos (2018) used a similar sequence-to-sequence model as our current approach, but implemented in OpenNMT and without the linguistic features." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-76", "text": "Boxer (Bos, 2008 (Bos, , 2015 ) is a DRS parser that uses a statistical CCG parser for syntactic analysis and a compositional semantics based on \u03bb-calculus, followed by pronoun and presupposition resolution." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-77", "text": "SPAR is a baseline system that outputs the same DRS for each test instance 5 , while SIM-SPAR outputs the DRS of the most similar sentence in the training set, based on a simple word embedding metric." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-78", "text": "6 The results are shown in Table 5 ." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-79", "text": "Our model clearly outperforms the previous systems, even when only using gold standard data." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-80", "text": "When compared to Van Noord, Abzianidze, Toral, and Bos (2018) , retrained with the same data used in our systems, the largest improvement (3.6 and 3.5 for dev and test) comes from switching framework and changing certain parameters such as the optimizer and learning rate." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-81", "text": "However, the linguistic features are clearly still beneficial when using only gold data (increase of 2.7 and 1.9 for dev and test), and also still help when employing additional silver data (1.1 and 0.3 increase for dev and test, both significant)." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-82", "text": "Abzianidze, Toral, and Bos (2018)" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-83", "text": "----------------------------------" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-84", "text": "**CONCLUSIONS**" }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-85", "text": "In this paper we have shown that a range of linguistic features can improve performance of sequence-tosequence models for the task of parsing Discourse Representation Structures." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-86", "text": "We have shown empirically that the best method of adding these features is by using a multi-encoder setup, as opposed to merging the sources of linguistic information in a single encoder." }, { "sent_id": "c067711a58722737ef8b7ea987bcf3-C001-87", "text": "We believe that this method can also be beneficial for other semantic parsing tasks in which sequence-to-sequence models do well." } ], "y": { "@USE@": { "gold_contexts": [ [ "c067711a58722737ef8b7ea987bcf3-C001-19" ], [ "c067711a58722737ef8b7ea987bcf3-C001-24" ], [ "c067711a58722737ef8b7ea987bcf3-C001-44" ], [ "c067711a58722737ef8b7ea987bcf3-C001-60" ], [ "c067711a58722737ef8b7ea987bcf3-C001-74" ], [ "c067711a58722737ef8b7ea987bcf3-C001-75" ] ], "cite_sentences": [ "c067711a58722737ef8b7ea987bcf3-C001-19", "c067711a58722737ef8b7ea987bcf3-C001-24", "c067711a58722737ef8b7ea987bcf3-C001-44", "c067711a58722737ef8b7ea987bcf3-C001-60", "c067711a58722737ef8b7ea987bcf3-C001-74", "c067711a58722737ef8b7ea987bcf3-C001-75" ] }, "@DIF@": { "gold_contexts": [ [ "c067711a58722737ef8b7ea987bcf3-C001-75" ], [ "c067711a58722737ef8b7ea987bcf3-C001-80" ] ], "cite_sentences": [ "c067711a58722737ef8b7ea987bcf3-C001-75", "c067711a58722737ef8b7ea987bcf3-C001-80" ] }, "@SIM@": { "gold_contexts": [ [ "c067711a58722737ef8b7ea987bcf3-C001-75" ] ], "cite_sentences": [ "c067711a58722737ef8b7ea987bcf3-C001-75" ] } } }, "ABC_ab5788da3f24e01b0ec40fba0bdbec_26": { "x": [ { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-2", "text": "Back-translation provides a simple yet effective approach to exploit monolingual corpora in Neural Machine Translation (NMT)." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-3", "text": "Its iterative variant, where two opposite NMT models are jointly trained by alternately using a synthetic parallel corpus generated by the reverse model, plays a central role in unsupervised machine translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-4", "text": "In order to start producing sound translations and provide a meaningful training signal to each other, existing approaches rely on either a separate machine translation system to warm up the iterative procedure, or some form of pre-training to initialize the weights of the model." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-5", "text": "In this paper, we analyze the role that such initialization plays in iterative back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-6", "text": "Is the behavior of the final system heavily dependent on it? Or does iterative back-translation converge to a similar solution given any reasonable initialization?" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-7", "text": "Through a series of empirical experiments over a diverse set of warmup systems, we show that, although the quality of the initial system does affect final performance, its effect is relatively small, as iterative back-translation has a strong tendency to convergence to a similar solution." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-8", "text": "As such, the margin of improvement left for the initialization method is narrow, suggesting that future research should focus more on improving the iterative mechanism itself." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-9", "text": "----------------------------------" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-11", "text": "Back-translation (Sennrich et al., 2016) allows to naturally exploit monolingual corpora in Neural Machine Translation (NMT) by using a reverse model to generate a synthetic parallel corpus." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-12", "text": "Despite its simplicity, this technique has become a key component in state-of-the-art NMT systems." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-13", "text": "For instance, the majority of WMT19 submissions, including the best performing systems, made extensive use of it (Barrault et al., 2019) ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-14", "text": "While the synthetic parallel corpus generated through back-translation is typically combined with real parallel corpora, iterative or online variants of this technique also play a central role in unsupervised machine translation (Artetxe et al., 2018a (Artetxe et al., ,b, 2019 Lample et al., 2018a,b; Marie and Fujita, 2018; Conneau and Lample, 2019; Song et al., 2019; Liu et al., 2020) ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-15", "text": "In iterative back-translation, both NMT models are jointly trained using synthetic parallel data generated on-the-fly with the reverse model, alternating between both translation directions iteratively." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-16", "text": "While this enables fully unsupervised training without parallel corpora, some initialization mechanism is still required so the models can start producing sound translations and provide a meaningful training signal to each other." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-17", "text": "For that purpose, state-of-the-art approaches rely on either a separately trained unsupervised Statistical Machine Translation (SMT) system, which is used for warmup during the initial back-translation iterations (Marie and Fujita, 2018; Artetxe et al., 2019) , or large-scale pre-training through masked denoising, which is used to initialize the weights of the underlying encoder-decoder (Conneau and Lample, 2019; Song et al., 2019; Liu et al., 2020) ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-18", "text": "In this paper, we aim to understand the role that the initialization mechanism plays in iterative backtranslation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-19", "text": "For that purpose, we mimic the experimental settings of Artetxe et al. (2019) , and measure the effect of using different initial systems for warmup: the unsupervised SMT system proposed by Artetxe et al. (2019) themselves, supervised NMT and SMT systems trained on both small and large parallel corpora, and a commercial Rule-Based Machine Translation (RBMT) system." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-20", "text": "Despite the fundamentally different nature of these systems, our analysis reveals that iterative back-translation has a strong tendency to converge to a similar solution." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-21", "text": "Given the relatively small impact of the initial system, we conclude that fu-ture research on unsupervised machine translation should focus more on improving the iterative backtranslation mechanism itself." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-22", "text": "----------------------------------" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-23", "text": "**ITERATIVE BACK-TRANSLATION**" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-24", "text": "We next describe the iterative back-translation implementation used in our experiments, which was proposed by Artetxe et al. (2019) ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-25", "text": "Note, however, that the underlying principles of iterative backtranslation are very general, so our conclusions should be valid beyond this particular implementation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-26", "text": "The method in question trains two NMT systems in opposite directions following an iterative process where, at every iteration, each model is updated by performing a single pass over a set of N synthetic parallel sentences generated through back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-27", "text": "After iteration a, the synthetic parallel corpus is entirely generated by the reverse NMT model." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-28", "text": "However, so as to ensure that the NMT models produce sound translations and provide meaningful training signal to each other, the first a warmup iterations progressively transition from a separate initial system to the reverse NMT model itself." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-29", "text": "More concretely, iteration t uses N init = N \u00b7 max(0, 1 \u2212 t/a) back-translated sentences from the reverse initial system, and the remaining N \u2212 N initial sentences are generated by the reverse NMT model." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-30", "text": "In the latter case, half of the translations use random sampling , which produces more varied translations, whereas the other half are generated through greedy decoding, which produces more fluent and predictable translations." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-31", "text": "Following Artetxe et al. (2019) , we set N = 1, 000, 000 and a = 30, and perform a total of 60 such iterations." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-32", "text": "Both NMT models use the big transformer implementation from Fairseq 1 , training with a total batch size of 20,000 tokens with the exact same hyperparameters as ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-33", "text": "At test time, we use beam search decoding with a beam size of 5." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-34", "text": "----------------------------------" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-35", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-36", "text": "So as to better understand the role of initialization in iterative back-translation, we train different English-German models using the following initial systems for warmup:" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-37", "text": "\u2022 RBMT: We use the commercial Lucy LT translator (Alonso and Thurmair, 2003), a tra-1 https://github.com/pytorch/fairseq ditional transfer-based RBMT system combining human crafted computational grammars and monolingual and bilingual lexicons." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-38", "text": "\u2022 Supervised NMT: We use the Fairseq implementation of the big transformer model using the same hyperparameters as ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-39", "text": "We train two separate models: one using the concatenation of all parallel corpora from WMT 2014, and another one using a random subset of 100,000 sentences." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-40", "text": "In both cases, we use early stopping according to the cross-entropy in newstest2013." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-41", "text": "\u2022 Supervised SMT: We use the Moses (Koehn et al., 2007) implementation of phrase-based SMT (Koehn et al., 2003) with default hyperparameters, using FastAlign (Dyer et al., 2013) for word alignment." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-42", "text": "We train two separate models using the same parallel corpus splits as for NMT." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-43", "text": "In both cases, we use a 5-gram language model trained with KenLM (Heafield et al., 2013) on News Crawl 2007-2013, and apply MERT tuning (Och, 2003) over newstest2013." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-44", "text": "\u2022 Unsupervised: We use the unsupervised SMT system proposed by Artetxe et al. (2019) , which induces an initial phrase-table using cross-lingual word embedding mappings, combines it with an n-gram language model, and further improves the resulting model through unsupervised tuning and joint refinement." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-45", "text": "For each initial system, we train a separate NMT model through iterative back-translation as described in Section 2." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-46", "text": "For that purpose, we use the News Crawl 2007-2013 monolingual corpus as distributed in the WMT 2014 shared task." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-47", "text": "2 Preprocessing is done using standard Moses tools, and involves punctuation normalization, tokenization with aggressive hyphen splitting, and truecasing." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-48", "text": "We evaluate in newstest2014 using tokenized BLEU, and compare the performance of the different final systems after iterative back-translation and the initial systems used in their warmup." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-49", "text": "3 However, this only provides a measure of the quality of the different systems, but not the similarity of the translations they produce." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-50", "text": "So as to quantify how similar the translations of two systems are, we compute their corresponding BLEU scores taking one of them as the reference." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-51", "text": "This way, we report the average similarity of each final system with the rest of final systems, and analogously for the initial ones." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-52", "text": "Finally, we also compute the similarity between each initial system and its corresponding final system, which measures how much the final solution found by iterative back-translation differs from the initial one." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-53", "text": "Table 1 reports the test scores of different initial systems along with their corresponding final systems after iterative-backtranslation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-54", "text": "As it can be seen, the standard deviation across final systems is substantially lower than across initial systems (1.7 vs 5.6 in German-to-English and 1.4 vs 4.9 in English-to-German), which shows that iterative back-translation tends to converge to solutions of a similar quality." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-55", "text": "This way, while the initial system does have certain influence in final performance, differences greatly diminish after applying iterative back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-56", "text": "For instance, the full NMT system is 13.4 points better than the RBMT system in German-to-English, but this difference goes down to 2.3 points after iterative back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-57", "text": "Interestingly, better initial systems do not always lead to better final systems." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-58", "text": "For instance, the initial RBMT system is weaker than both the unsupervised system and the small SMT system, yet it leads to a better final system after iterative backtranslation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-59", "text": "Similarly, the small SMT model is substantially better than the small NMT model in German-to-English (19.6 vs 15.2), yet they both lead to the exact same BLEU score of 25.0 after iterative back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-60", "text": "We hypothesize that certain properties of the initial system are more relevant than others and, in particular, our results suggest that the adequacy and lexical coverage of the initial systems has a larger impact than its fluency." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-61", "text": "At the same time, it is remarkable that iterative back-translation has a generally positive impact, bringing an average improvement of 4.9 BLEU points for German-to-English and 4.3 BLEU points for English-to-German." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-62", "text": "Nevertheless, the full NMT system is a notable exception, as the final system learned through iterative back-translation is weaker than the initial system used for warmup." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-63", "text": "This reinforces the idea that iterative back-translation converges to a solution of a similar quality regardless of that of the initial system, to the extent that it can even deteriorate performance when the initial system is very strong." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-64", "text": "----------------------------------" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-65", "text": "**RESULTS**" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-66", "text": "So as to get a more complete picture of this behavior, Table 2 reports the average similarity between each final system and the rest of the final systems, and analogously for the initial ones." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-67", "text": "As it can be seen, final systems trained through iterative back-translation tend to produce substantially more similar translations than the initial systems used in their warmup (49.3 vs 28.2 for German-to-English and 42.9 vs 24.3 for English-to-German)." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-68", "text": "This suggests that iterative back-translation does not only converge to solutions of similar quality, but also to solutions that have a similar behavior." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-69", "text": "Interestingly, this also applies to systems that follow a fundamentally different paradigm as it is the case of RBMT." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-70", "text": "In relation to that, note that the similarity of each final system and its corresponding initial system is rather low, which reinforces the idea that the solution found by iterative back-translation is not heavily dependent on the initial system." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-71", "text": "----------------------------------" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-72", "text": "**RELATED WORK**" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-73", "text": "Originally proposed by Sennrich et al. (2016) , backtranslation has been widely adopted by the machine translation community (Barrault et al., 2019) , yet its behavior is still not fully understood." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-74", "text": "Several authors have studied the optimal balance between real and synthetic parallel data, concluding that using too much synthetic data can be harmful (Poncelas et al., 2018; Fadaee and Monz, 2018; ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-75", "text": "In addition to that, Fadaee and Monz (2018) observe that back-translation is most helpful for tokens with a high prediction loss, and use this insight to design a better selection method for monolingual data." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-76", "text": "At the same time, show that random sampling provides a stronger training signal than beam search or greedy decoding." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-77", "text": "Closer to our work, the impact of the system used for backtranslation has also been explored by some authors (Sennrich et al., 2016; Burlot and Yvon, 2018) , although the iterative back-translation variant, which allows to jointly train both systems so they can help each other, was not considered, and synthetic data was always combined with real parallel data." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-78", "text": "While all the previous authors use a fixed system to generate synthetic parallel corpora, Hoang et al. (2018) propose performing a second iteration of back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-79", "text": "Iterative back-translation was also explored by Marie and Fujita (2018) and Artetxe et al. (2019) in the context of unsupervised machine translation, relying on an unsupervised SMT system (Lample et al., 2018b; Artetxe et al., 2018a) for warmup." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-80", "text": "Early work in unsupervised NMT also incorporated the idea of on-the-fly backtranslation, which was combined with denoising autoencoding and a shared encoder initialized through unsupervised cross-lingual embeddings (Artetxe et al., 2018b; Lample et al., 2018a) ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-81", "text": "More recently, several authors have performed large-scale unsupervised pre-training through masked denoising to initialize the full model, which is then trained through iterative back-translation (Conneau and Lample, 2019; Song et al., 2019; Liu et al., 2020) ." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-82", "text": "Finally, iterative back-translation is also connected to the reconstruction loss in dual learning (He et al., 2016) , which incorporates an additional language modeling loss and also requires a warm start." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-83", "text": "----------------------------------" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-84", "text": "**CONCLUSIONS**" }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-85", "text": "In this paper, we empirically analyze the role that initialization plays in iterative back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-86", "text": "For that purpose, we try a diverse set of initial systems for warmup, and analyze the behavior of the resulting systems in relation to them." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-87", "text": "Our results show that differences in the initial systems heavily diminish after applying iterative back-translation." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-88", "text": "At the same time, we observe that iterative backtranslation has a hard ceiling, to the point that it can even deteriorate performance when the initial system is very strong." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-89", "text": "As such, we conclude that the margin for improvement left for the initialization is rather narrow, encouraging future research to focus more on improving the iterative back-translation mechanism itself." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-90", "text": "In the future, we would like to better characterize the specific factors of the initial systems that are most relevant." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-91", "text": "At the same time, we would like to design a simpler unsupervised system for warmup that is sufficient for iterative back-translation to converge to a good solution." }, { "sent_id": "ab5788da3f24e01b0ec40fba0bdbec-C001-92", "text": "Finally, we would like to incorporate pre-training methods like masked denoising into our analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ab5788da3f24e01b0ec40fba0bdbec-C001-17" ], [ "ab5788da3f24e01b0ec40fba0bdbec-C001-79" ] ], "cite_sentences": [ "ab5788da3f24e01b0ec40fba0bdbec-C001-17", "ab5788da3f24e01b0ec40fba0bdbec-C001-79" ] }, "@USE@": { "gold_contexts": [ [ "ab5788da3f24e01b0ec40fba0bdbec-C001-19" ], [ "ab5788da3f24e01b0ec40fba0bdbec-C001-24" ], [ "ab5788da3f24e01b0ec40fba0bdbec-C001-31" ], [ "ab5788da3f24e01b0ec40fba0bdbec-C001-44" ] ], "cite_sentences": [ "ab5788da3f24e01b0ec40fba0bdbec-C001-19", "ab5788da3f24e01b0ec40fba0bdbec-C001-24", "ab5788da3f24e01b0ec40fba0bdbec-C001-31", "ab5788da3f24e01b0ec40fba0bdbec-C001-44" ] } } }, "ABC_d92e92b9a375914f3dd74868f463fc_26": { "x": [ { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-2", "text": "In this paper, we explore a multilingual translation model with a cross-lingually shared layer that can be used as fixed-size sentence representation in different downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-3", "text": "We systematically study the impact of the size of the shared layer and the effect of including additional languages in the model." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-4", "text": "In contrast to related previous work, we demonstrate that the performance in translation does correlate with trainable downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-5", "text": "In particular, we show that larger intermediate layers not only improve translation quality, especially for long sentences, but also push the accuracy of trainable classification tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-6", "text": "On the other hand, shorter representations lead to increased compression that is beneficial in non-trainable similarity tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-7", "text": "We hypothesize that the training procedure on the downstream task enables the model to identify the encoded information that is useful for the specific task whereas nontrainable benchmarks can be confused by other types of information also encoded in the representation of a sentence." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-8", "text": "----------------------------------" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-10", "text": "Neural Machine Translation (NMT) has rapidly become the new Machine Translation (MT) paradigm, significantly improving over the traditional statistical machine translation procedure ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-11", "text": "Recently, several models and variants have been proposed with increased research efforts towards multilingual machine translation (Firat et al., 2016; Lakew et al., 2018; Wang et al., 2018; Blackwood et al., 2018; Lu et al., 2018) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-12", "text": "The main motivation of multilingual models is the effect of transfer learning that enables machine translation systems to benefit from relationships between languages and training signals that come from different datasets (Ha et al., 2016; Johnson et al., 2017; Gu et al., 2018) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-13", "text": "Another aspect that draws interest in translation models is the effective computation of sentence representations using the translation task as an auxiliary semantic signal (Hill et al., 2016; McCann et al., 2017; Schwenk and Douze, 2017; Subramanian et al., 2018 )." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-14", "text": "An important feature that enables an immediate use of the MT-based representations in other downstream tasks is the creation of fixed-sized sentence embeddings (C\u00edfka and Bojar, 2018) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-15", "text": "However, the effects of the size of sentence embeddings and the relation between translation performance and meaning representation quality are not entirely clear." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-16", "text": "Recent studies based on NMT either focus entirely on the use of MT-based sentence embeddings in other tasks (Schwenk, 2018) , on translation quality (Lu et al., 2018) , on speed comparison (Britz et al., 2017) , or only exploring a bilingual scenario (C\u00edfka and Bojar, 2018) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-17", "text": "In this paper, we are interested in exploring a cross-lingual intermediate shared layer (called attention bridge) in an attentive encoder-decoder MT model." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-18", "text": "This shared layer serves as a fixedsize sentence representation that can be straightforwardly applied to downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-19", "text": "We examine this model with a systematic evaluation on different sizes of the attention bridge and extensive experiments to study the abstractions it learns from multiple translation tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-20", "text": "In contrast to previous work (C\u00edfka and Bojar, 2018) , we demonstrate that there is a correlation between translation performance and trainable downstream tasks when adjusting the size of the intermediate layer." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-21", "text": "The trend is different for non-trainable tasks that benefit from the increased compression that denser representations achieve, which typically hurts the translation performance because of the decreased capacity of the model." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-22", "text": "We also show that multilingual models improve trainable downstream tasks even further, demonstrating the additional abstraction that is pushed into the representations through additional translation tasks involved in training." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-23", "text": "----------------------------------" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-24", "text": "**ARCHITECTURE**" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-25", "text": "Our architecture follows the standard setup of an encoder-decoder model in machine translation with a traditional attention mechanism (Luong et al., 2015) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-26", "text": "However, we augment the network with language specific encoders and decoders to enable multilingual training as in Lu et al. (2018) , plus we introduce an inner-attention layer (Liu et al., 2016; Lin et al., 2017) that summarizes the encoder information in a fixed-size vector representation that can easily be shared among different translation tasks with the language-specific encoders and decoders connecting to it." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-27", "text": "The overall architecture is illustrated in Figure 1 (see also V\u00e1zquez et al., 2019) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-28", "text": "Due to the attentive connection between encoders and decoders we call this layer attention bridge, and its architecture is an adaptation from the model proposed by C\u00edfka and Bojar (2018) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-29", "text": "Finally, each decoder follows a common attention mechanism in NMT, with the only exception that the context vector is computed on the attention bridge, and the initialization is performed by a mean pooling over it." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-30", "text": "Hence, the decoder receives the information only through the shared attention bridge." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-31", "text": "The fixed-sized representation coming out of the shared layer can immediately be applied to downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-32", "text": "1 However, selecting a reasonable size of the attention bridge in terms of attention heads (m i in Figure 1 ) is crucial for the performance both in a bilingual and multilingual sce-1 As in Lu et al. (2018) , we note that the attention bridge is independent of the underlying encoder and decoder." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-33", "text": "While we use LSTM, it could be easily replaced with a transformer type network (Vaswani et al., 2017) or with a CNN (Gehring et al., 2017) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-34", "text": "nario as we will see in the experiments below." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-35", "text": "----------------------------------" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-36", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-37", "text": "All models are implemented using the OpenNMT framework (Klein et al., 2017) trained using the same set of hyper-parameters." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-38", "text": "2 We use embedding layers of 512 dimensions, two stacked bidirectional LSTM layers with 512 hidden units (256 per direction) and an attentive decoder composed of two unidirectional LSTM layers with 512 units." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-39", "text": "Regarding the attention bridge, we experimented with four different configurations: 1, 10, 25 and 50 attention heads with 1024 hidden units each." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-40", "text": "For multilingual models, we used a language-rotating scheduler, in which each mini-batch contains sentences from a different language pair, cycling through all the language pairs uniformly." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-41", "text": "We selected the best model according to the BLEU score on the validation set." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-42", "text": "We train all the models using the Europarl Corpus v7 (Koehn, 2005) , focusing on 4 languages: English (EN), French (FR), German (DE) and Spanish (ES)." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-43", "text": "First we train bilingual models for EN\u2192DE; then we train multilingual models {DE,ES,FR}\u2194EN; lastly we train a final Many-to-Many model using the biggest size, i.e., 50 attention heads, involving all translation directions between the three languages, i.e., we also include DE-ES, DE-FR and ES-FR." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-44", "text": "To evaluate the sentence representations we utilize the SentEval toolkit (Conneau and Kiela, 2018) sentences." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-45", "text": "3 In order to obtain a sentence vector out of multiple attention heads we apply mean pooling over the attention bridge." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-46", "text": "We are also interested in the translation quality to verify the appropriateness of our models with respect to the main objective they are trained for." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-47", "text": "For this, we adopt the in-domain development and evaluation dataset from the ACL-WMT07 shared task." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-48", "text": "Sentences are encoded using Byte-Pair Encoding (Sennrich et al., 2016) , with 32,000 merge operations for each language." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-49", "text": "4 SentEval: Classification tasks Table 1 shows the performance of our models on two popular tasks (SNLI and SICK-E) as in C\u00edfka and Bojar (2018) as well as the average of all 10 SentEval downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-50", "text": "The experiments reveal two important findings:" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-51", "text": "(1) In contrast with the results from C\u00edfka and Bojar (2018), our scores demonstrate that an increasing number of attention heads is beneficial for classification-based downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-52", "text": "All models perform best with more than one attention head and the general trend is that the accuracies improve with larger representations." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-53", "text": "The previous claim was that there is the opposite effect and lower numbers of attention heads lead to higher performances in downstream tasks, but we do not see that effect in our setup, at least not in the classification tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-54", "text": "(2) The second outcome is the positive effect 3 Due to the large number of SentEval tasks, we report results on natural language inference (SNLI, SICK-E/SICK-R) and the average of all tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-55", "text": "Table 2 : Results from supervised similarity tasks (SICK-R and STSB), measured using Pearson's (r) and Spearman's (\u03c1) correlation coefficients (r/\u03c1)." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-56", "text": "The average across unsupervised similarity tasks on Pearson's measures are displayed in the right-most column." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-57", "text": "Results with \u2020 taken from C\u00edfka and Bojar (2018)." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-58", "text": "of multilingual training." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-59", "text": "We can see that multilingual training objectives are generally helpful for the trainable downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-60", "text": "Particularly interesting is the fact that the Manyto-Many model performs best on average even though it does not add any further training examples for English (compared to the other multilingual models), which is the target language of the downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-61", "text": "This suggests that the model is able to improve generalizations even from other language pairs (DE-ES, FR-ES, FR-DE) that are not directly involved in training the representations of English sentences." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-62", "text": "Comparing against benchmarks, our results are in line with competitive baselines (Arora et al., 2017) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-63", "text": "While our aim is not to beat the state of the art trained on different data, but rather to understand the impact of various sizes of attention heads in a bi-and multilingual scenario, we argue that a larger attention bridge and multilinguality constitute a preferable starting point to learn more meaningful sentence representations." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-64", "text": "5 SentEval: Similarity tasks Table 2 summarizes the results using Pearson's and Spearman's coefficient on the two SentEval supervised textual similarity tasks, SICK-R and STSB, and the average Pearson's measure on the remaining unsupervised similarity tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-65", "text": "Two different trends become visible: i) On the unsupervised textual similarity tasks, having fewer attention heads is beneficial." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-66", "text": "Contrary to the results in the classification tasks, the best overall k=1 k=10 k=25 k=50 M-to-M Table 3 : BLEU scores for multilingual models." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-67", "text": "Baseline system in the right-most column." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-68", "text": "model is provided by a bilingual setting with only one attention head." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-69", "text": "This is in line with the findings of C\u00edfka and Bojar (2018) and could also be expected as the model is more strongly pushed into a dense semantic abstraction that is beneficial for measuring similarities without further training." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-70", "text": "More surprising is the negative effect of the multilingual models." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-71", "text": "We believe that the multilingual information encoded jointly in the attention bridge hampers the results for the monolingual semantic similarity measured with the cosine distance, while it becomes easier in a bilingual scenario where the vector encodes only one source language, English in this case." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-72", "text": "ii) On the supervised textual similarity tasks, we find a similar trend as in the previous section for SICK: both a higher number of attention heads and multilinguality contribute to better scores, while for STSB, we notice a different pattern." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-73", "text": "This general discrepancy between results in supervised and unsupervised tasks is not new in the literature (Hill et al., 2016) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-74", "text": "We hypothesize that the training procedure is able to pick up the information needed for the task, while in the unsupervised case a more dense representation is essential." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-75", "text": "----------------------------------" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-76", "text": "**TRANSLATION QUALITY**" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-77", "text": "Finally, we also look at the translation performance of the multilingual models we have introduced above compared with a baseline, an standard encoder-decoder model with attention (Luong et al., 2015) ." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-78", "text": "In this section, we verify that the attention bridge model is stable and successfully learns to translate in the multilingual case." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-79", "text": "Table 3 shows the comparison between the multilingual models." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-80", "text": "In general, we observe the same trend as in the bilingual evaluation concerning the size of the attention bridge." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-81", "text": "Namely, more attention heads lead to a higher BLEU score." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-82", "text": "The model with 50 heads achieves the best results among our models." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-83", "text": "It obtains scores that range in the same ballpark as the baseline, only in a few cases there is a degradation of few BLEU points." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-84", "text": "Notably, we do not see any increase in translation quality from the {DE,ES,FR}\u2194EN model to the Many-to-Many model; the BLEU scores are statistically equivalent for all six translation directions." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-85", "text": "One of the main motivations for having more attention heads lies in the better support of longer sentences." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-86", "text": "To study the effect, we group sentences of similar length and compute the BLEU score for each group." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-87", "text": "As we can see from Figure 2 a larger number of attention heads has, indeed, a positive impact when translating longer sentences." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-88", "text": "Interestingly enough, on sentences with up to 45 words, there is no real gap between the results of the baseline model and our bridge models with a high number of attention heads." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-89", "text": "It looks like the performance drop of the attention bridge models is entirely due to sentences longer than 45 words." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-90", "text": "We hypothesize that this might be due to the increasing syntactic divergences between the languages that have to be encoded." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-91", "text": "The shared selfattention layer needs to learn to focus on different parts of a sentence depending on the language it reads and, with increasing lengths of a sentence, this ability becomes harder and more difficult to pick up from the data alone." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-92", "text": "----------------------------------" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-93", "text": "**CONCLUSION**" }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-94", "text": "We have shown that fixed-size sentence representations can effectively be learned with multilingual machine translation using a inner-attention layer and scheduled training with multiple translation tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-95", "text": "The performance of the model heavily depends on the size of the intermediate representation layer and we show that a higher number of attention heads leads to improved translation and stronger representations in supervised downstream tasks (contradicting earlier findings) and multilinguality also helps in the same downstream tasks." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-96", "text": "Our analysis reveals that the attention bridge model mainly suffers on long sentences." }, { "sent_id": "d92e92b9a375914f3dd74868f463fc-C001-97", "text": "The next steps will include a deeper linguistic analysis of the translation model and the extension to multilingual models with more languages with greater linguistic diversity." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d92e92b9a375914f3dd74868f463fc-C001-14" ], [ "d92e92b9a375914f3dd74868f463fc-C001-16" ] ], "cite_sentences": [ "d92e92b9a375914f3dd74868f463fc-C001-14", "d92e92b9a375914f3dd74868f463fc-C001-16" ] }, "@MOT@": { "gold_contexts": [ [ "d92e92b9a375914f3dd74868f463fc-C001-14", "d92e92b9a375914f3dd74868f463fc-C001-15" ] ], "cite_sentences": [ "d92e92b9a375914f3dd74868f463fc-C001-14" ] }, "@DIF@": { "gold_contexts": [ [ "d92e92b9a375914f3dd74868f463fc-C001-20" ], [ "d92e92b9a375914f3dd74868f463fc-C001-49", "d92e92b9a375914f3dd74868f463fc-C001-50", "d92e92b9a375914f3dd74868f463fc-C001-51" ] ], "cite_sentences": [ "d92e92b9a375914f3dd74868f463fc-C001-20", "d92e92b9a375914f3dd74868f463fc-C001-49", "d92e92b9a375914f3dd74868f463fc-C001-51" ] }, "@USE@": { "gold_contexts": [ [ "d92e92b9a375914f3dd74868f463fc-C001-28" ], [ "d92e92b9a375914f3dd74868f463fc-C001-57", "d92e92b9a375914f3dd74868f463fc-C001-58" ] ], "cite_sentences": [ "d92e92b9a375914f3dd74868f463fc-C001-28", "d92e92b9a375914f3dd74868f463fc-C001-57" ] }, "@SIM@": { "gold_contexts": [ [ "d92e92b9a375914f3dd74868f463fc-C001-49" ], [ "d92e92b9a375914f3dd74868f463fc-C001-69" ] ], "cite_sentences": [ "d92e92b9a375914f3dd74868f463fc-C001-49", "d92e92b9a375914f3dd74868f463fc-C001-69" ] } } }, "ABC_d42e2a9175e024e3ae44118e12fb58_26": { "x": [ { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-94", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-95", "text": "**DISCUSSION**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-2", "text": "For the purpose of automatically evaluating speakers' humor usage, we build a presentation corpus containing humorous utterances based on TED talks." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-3", "text": "Compared to previous data resources supporting humor recognition research, ours has several advantages, including (a) both positive and negative instances coming from a homogeneous data set, (b) containing a large number of speakers, and (c) being open." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-4", "text": "Focusing on using lexical cues for humor recognition, we systematically compare a newly emerging text classification method based on Convolutional Neural Networks (CNNs) with a well-established conventional method using linguistic knowledge." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-5", "text": "The CNN method shows its advantages on both higher recognition accuracies and being able to learn essential features automatically." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-6", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-8", "text": "The ability to make effective presentations has been found to be linked with success at school and in the workplace." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-9", "text": "Humor plays important roles in successful public speaking, e.g., helping to reduce public speaking anxiety, which was treated as the most prevalent type of social phobia, generating shared amusement to boost persuasive power, and serving as a means to attract attention and reduce tension (Xu, 2016) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-10", "text": "Automatically simulating an audience's reactions to humor will not only be useful for presentation training, but also improve other conversational systems by giving machines more empathetic power." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-11", "text": "The present study reports our efforts in recognizing utterances that cause laughter in presentations." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-12", "text": "These include building a corpus from TED talks and using Convolutional Neural Networks (CNNs) in the recognition." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-13", "text": "The remainder of the paper is organized as follows: Section 2 briefly reviews the previous related research; Section 3 describes the corpus we collected from TED talks; Section 4 describes the text classification methods; Section 5 reports on our experiments; and finally Section 6 discusses the findings of our study and plans for future work." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-14", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-15", "text": "**PREVIOUS RESEARCH**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-16", "text": "Humor recognition refers to the task of deciding whether a sentence/spoken-utterance expresses a certain degree of humor." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-17", "text": "In most of the previous studies (Mihalcea and Strapparava, 2005; Purandare and Litman, 2006; Yang et al., 2015) , humor recognition was modeled as a binary classification task" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-18", "text": "In the seminal work (Mihalcea and Strapparava, 2005) , a corpus of 16,000 \"one-liners\" was created using daily joke websites to collect humorous instances while using formal writing resources (e.g., news titles) to obtain non-humorous instances." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-19", "text": "Three humor-specific stylistic features, including alliteration, antonymy, and adult slang were utilized together with content-based features to build classifiers." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-20", "text": "In a recent work (Yang et al., 2015) , a new corpus was constructed from a Pun of the Day website." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-21", "text": "It systematically explained and computed latent semantic structure features based on the following four aspects: (a) Incongruity, (b) Ambiguity, (c) Interpersonal Effect, and (d) Phonetic Style." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-22", "text": "In addition, Word2Vec (Mikolov et al., 2013) distributed representations were utilized in the model building." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-23", "text": "Beyond lexical cues from text inputs, other research has also utilized speakers' acoustic cues (Purandare and Litman, 2006; Bertero and Fung, 2016b) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-24", "text": "These studies have typically used audio tracks from TV shows and their corresponding captions in order to categorize characters' speaking turns as humorous or non-humorous." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-25", "text": "Utterances prior to canned laughter that was manually inserted into the shows were treated as humorous, while other utterances were treated as negative cases." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-26", "text": "Convolutional Neural Networks (CNNs) have recently been successfully used in several text categorization tasks (e.g., review rating, sentiment recognition, and question type recognition)." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-27", "text": "Kim (2014) ; Johnson and Zhang (2015) ; Zhang and Wallace (2015) suggested that using a simple CNN setup, which entails one layer of convolution on top of word embedding vectors, achieves excellent results on multiple tasks." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-28", "text": "Deep learning is rapidly being applied to computational humor research (Bertero and Fung, 2016b,a) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-29", "text": "In Bertero and Fung (2016b), CNN was found to be the best model that uses both acoustic and lexical cues for humor recognition." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-30", "text": "By using Long Short Time Memory (LSTM) cells (Hochreiter and Schmidhuber, 1997), Bertero and Fung (2016a) showed that Recurrent Neural Networks (RNNs) perform better on modeling sequential information than Conditional Random Fields (CRFs) (Lafferty et al., 2001) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-31", "text": "From the brief review, we can find that the limited number of previously created corpora only cover one-line puns or jokes and conversations from TV comedy shows." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-32", "text": "There is a great need for an open corpus that can support investigating humor in presentations." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-33", "text": "1 CNN-based text categorization methods have been applied for humor recognition (e.g., in (Bertero and Fung, 2016b) ) but with limitations: (a) a rigorous comparison with the state-of-the-art conventional method examined in Yang et al. (2015) is missing; (b) CNN's performance in the previous research is not quite clear 2 ; and (c) some important techniques that can improve CNN performance (e.g., using variedsized filters and dropout regularization (Hinton et al., 2012) ) were missing." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-34", "text": "Therefore, the present study is meant to address these limitations." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-35", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-36", "text": "**TED TALK DATA**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-37", "text": "TED Talks 3 are recordings from TED conferences and other special TED programs." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-38", "text": "In the present study, we focused on the transcripts of the talks." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-39", "text": "Most transcripts of the talks contain the markup '(Laughter)', which represents where audiences laughed aloud during the talks." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-40", "text": "This special markup was used to determine utterance labels." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-41", "text": "We collected 1,192 TED Talk transcripts 4 ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-42", "text": "An example transcription is given in Figure 1 ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-43", "text": "The collected transcripts were split into sentences using the Stanford CoreNLP tool (Manning et al., 2014) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-44", "text": "In this study, sentences containing or immediately followed by '(Laughter)' were used as humorous sentences, as shown in Figure 1 ; all other sentences were defined as non-humorous sentences." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-45", "text": "Following (Mihalcea and Strapparava, 2005; Yang et al., 2015) , we selected the same sizes (n = 4726) of humorous and non-humorous sentences." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-46", "text": "To minimize possible topic shifts between positive and negative instances, for each positive instance, we picked up one negative instance nearby (the context window was 7 sentences in this study)." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-47", "text": "For example, in Figure 1 , a negative instance (corresponding to 'sent-2') was selected from the nearby sentences ranging from 'sent-7' and 'sent+7'." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-48", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-49", "text": "**METHODS**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-50", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-51", "text": "**CONVENTIONAL MODEL**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-52", "text": "Following Yang et al. (2015) , we applied Random Forest (Breiman, 2001 ) to do humor recognition by using the following two groups of features." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-53", "text": "The first group are latent semantic structural features covering the following 4 categories 5 : Incongruity (2), Ambiguity (6), Interpersonal Effect (4), and Phonetic Pattern (4)." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-54", "text": "The second group are semantic distance features, including the humor label classes from 5 sentences in the training set that are closest to this sentence (found by using a k-Nearest Neighbors (kNN) method), and each sentence's averaged Word2Vec representations (n = 300)." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-55", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-56", "text": "**CNN MODEL**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-57", "text": "Our CNN-based text classification's setup follows Kim (2014) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-58", "text": "Figure 2 depicts the model's details." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-59", "text": "From the left side's input texts to the right side's prediction labels, different shapes of tensors flow through the entire network for solving the classification task in an end-to-end mode." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-60", "text": "Firstly, input tokenized text strings were converted to a 2D tensor with shape (L \u00d7 d), where L represents sentences' maximum length while d represents the word-embedding dimension." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-61", "text": "In this study, we utilized the Word2Vec (Mikolov et al., 2013) embedding vectors (d = 300) that were trained from 100 billion words of Google News." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-62", "text": "Next, the embedding matrix was fed into a 1D convolution network with multiple filters." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-63", "text": "To cover varied reception fields, the size of the filters changed from f w\u22121 , f w , and f w+1 ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-64", "text": "For each filter size, f n filters were utilized." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-65", "text": "Then, max pooling, which stands for finding the largest value from a vector, was applied to each feature map (total 3 \u00d7 f n feature maps) output by the 1D convolution." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-66", "text": "Finally, maximum values from all of 3 \u00d7 f n filters were formed as a flattened vector to go through a fully connected (FC) layer to predict two possible labels (Humor vs. Non-Humor)." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-67", "text": "Note that for 1D convolution and FC layer's input, we applied 'dropout' (Hinton et al., 2012) regularization, which entails randomly setting a proportion of network weights to be zero during model training, to overcome overfitting." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-68", "text": "By using cross-entropy as the learning metric, the whole sequential network (all weights and bias) could be optimized by using any SGD optimization, e.g., Adam (Kingma and Ba, 2014) , Adadelta (Zeiler, 2012) , and so on." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-69", "text": "----------------------------------" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-70", "text": "**EXPERIMENTS**" }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-71", "text": "We used two corpora: the TED Talk corpus (denoted as TED) and the Pun of the Day corpus 6 (denoted as Pun)." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-72", "text": "Note that we normalized words in the Pun data to lowercase to avoid avoid a possibly elevated result caused by a special pattern: in the original format, all negative instances started with capital letters." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-73", "text": "The Pun data allows us to verify that our implementation is consistent with the work reported in Yang et al. (2015) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-74", "text": "In our experiment, we firstly divided each corpus into two parts." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-75", "text": "The smaller part (the Held-Out Partition) was used for tweaking various hyper- Table 1 : Humor recognition on both Pun and TED data sets by using (a) random prediction (Chance), conventional method (Base) and CNN method; the sizes of the dev and CV partitions are provided for each data set." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-76", "text": "parameters used in text classifiers." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-77", "text": "The larger portion (the CV Partition) was then formulated as a 10-fold cross-validation setup for obtaining a stable and comprehensive model evaluation result." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-78", "text": "Note that, with a goal of building a speakerindependent humor detector, when partitioning our TED data set, we always kept all of utterances of a single a talk within the same partition." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-79", "text": "Therefore, in our experimental setup, we always evaluated \"unseen\" utterances from the talks that had not been used in the training stage." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-80", "text": "To our knowledge, this is the first time that such a strict experimental setup has been used in recognizing humor in conversations, and it makes the humor recognition task on the TED data quite challenging." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-81", "text": "When building conventional models, we developed our own feature extraction scripts and used the SKLL 7 python package for tweaking and training Random Forest models." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-82", "text": "When implementing CNN, we used the Keras 8 Python package." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-83", "text": "9 Regarding hyper-parameter tweaking, we utilized the Tree Parzen Estimation (TPE) method as detailed in Bergstra et al. (2012) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-84", "text": "After running 200 iterations of tweaking, we ended up with the following selection: f w is 6 (entailing that the various filter sizes are (5, 6, 7)), f n is 100, dropout 1 is 0.7 and dropout 2 is 0.35, optimization uses Adam (Kingma and Ba, 2014) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-85", "text": "When training the CNN model, we randomly selected 10% of the training data as the validation set for using early stopping to avoid overfitting." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-86", "text": "7 https://github.com/ EducationalTestingService/skll 8 https://github.com/fchollet/keras 9 The implementation will be released with the paper On the Pun data, the CNN model shows consistent improved performance over the conventional model, as suggested in Yang et al. (2015) ." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-87", "text": "In particular, precision has been greatly increased from 0.762 to 0.864." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-88", "text": "On the TED data, we also observed that the CNN model helps to increase precision (from 0.515 to 0.582) and accuracy (from 52.0% to 58.9%)." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-89", "text": "The empirical evaluation results suggests that the CNN-based model has an advantage on the humor recognition task." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-90", "text": "In addition, focusing on the system development time, generating and implementing those features in the conventional model would take days or even weeks." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-91", "text": "However, the CNN model automatically learns its optimal feature representation and can adjust the features automatically across data sets." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-92", "text": "This makes the CNN model quite versatile for supporting different tasks and data domains." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-93", "text": "Compared with the humor recognition results on the Pun data, the results on the TED data is still quite low, and more research is needed to fully handle humor in authentic presentations." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-96", "text": "For the purpose of monitoring how well speakers can use humor during their presentations, we have created a corpus from TED talks." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-97", "text": "Compared to the existing (albeit limited) corpora for humor recognition research, ours has the following advantages: (a) it was collected from authentic talks, rather than from TV shows performed by professional actors based on scripts; (b) it contains about 100 times more speakers compared to the limited number of actors in existing corpora." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-98", "text": "We compared two types of leading text-based humor recognition methods: a conventional classifier (e.g., random forest) based on human-engineered features vs. an end-to-end CNN method, which relies on its inherent representation learning." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-99", "text": "We found that the CNN method has better performance." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-100", "text": "More importantly, the representation learning of the CNN method makes it very efficient when facing new data sets." }, { "sent_id": "d42e2a9175e024e3ae44118e12fb58-C001-101", "text": "Stemming from the present study, we envision that more research is worth pursuing: (a) for presentations, cues from other modalities such as audio or video will be included, similar to Bertero and Fung (2016b) ; (b) context information from multiple utterances will be modeled by using sequential modeling methods." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d42e2a9175e024e3ae44118e12fb58-C001-17" ], [ "d42e2a9175e024e3ae44118e12fb58-C001-20" ], [ "d42e2a9175e024e3ae44118e12fb58-C001-45" ], [ "d42e2a9175e024e3ae44118e12fb58-C001-86" ] ], "cite_sentences": [ "d42e2a9175e024e3ae44118e12fb58-C001-17", "d42e2a9175e024e3ae44118e12fb58-C001-20", "d42e2a9175e024e3ae44118e12fb58-C001-45", "d42e2a9175e024e3ae44118e12fb58-C001-86" ] }, "@MOT@": { "gold_contexts": [ [ "d42e2a9175e024e3ae44118e12fb58-C001-33" ] ], "cite_sentences": [ "d42e2a9175e024e3ae44118e12fb58-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "d42e2a9175e024e3ae44118e12fb58-C001-45" ], [ "d42e2a9175e024e3ae44118e12fb58-C001-52" ] ], "cite_sentences": [ "d42e2a9175e024e3ae44118e12fb58-C001-45", "d42e2a9175e024e3ae44118e12fb58-C001-52" ] }, "@SIM@": { "gold_contexts": [ [ "d42e2a9175e024e3ae44118e12fb58-C001-73" ], [ "d42e2a9175e024e3ae44118e12fb58-C001-86" ] ], "cite_sentences": [ "d42e2a9175e024e3ae44118e12fb58-C001-73", "d42e2a9175e024e3ae44118e12fb58-C001-86" ] } } }, "ABC_3d84cf97f48dad66b4d8de0baf79b1_26": { "x": [ { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-2", "text": "Feature norm datasets of human conceptual knowledge, collected in surveys of human volunteers, yield highly interpretable models of word meaning and play an important role in neurolinguistic research on semantic cognition." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-3", "text": "However, these datasets are limited in size due to practical obstacles associated with exhaustively listing properties for a large number of words." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-4", "text": "In contrast, the development of distributional modelling techniques and the availability of vast text corpora have allowed researchers to construct effective vector space models of word meaning over large lexicons." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-5", "text": "However, this comes at the cost of interpretable, human-like information about word meaning." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-6", "text": "We propose a method for mapping human property knowledge onto a distributional semantic space, which adapts the word2vec architecture to the task of modelling concept features." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-7", "text": "Our approach gives a measure of concept and feature affinity in a single semantic space, which makes for easy and efficient ranking of candidate human-derived semantic properties for arbitrary words." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-8", "text": "We compare our model with a previous approach, and show that it performs better on several evaluation tasks." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-9", "text": "Finally, we discuss how our method could be used to develop efficient sampling techniques to extend existing feature norm datasets in a reliable way." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-10", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-12", "text": "Distributional semantic modelling of word meaning has become a popular method for building pretrained lexical representations for downstream Natural Language Processing (NLP) tasks (Baroni and Lenci, 2010; Mikolov et al., 2013; Pennington et al., 2014; Bojanowski et al., 2017) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-13", "text": "In this approach, meaning is encoded in a dense vector space model, such that words (or concepts) that have vector representations that are spatially close together are similar in meaning." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-14", "text": "A criticism of these real-valued embedding vectors is the opaqueness of their representational dimensions and their lack of cognitive plausibility and interpretability (Murphy et al., 2012; \u015e enel et al., 2018) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-15", "text": "In contrast, human conceptual property knowledge is often modelled in terms of relatively sparse and interpretable vectors, based on verbalizable, human-elicited features collected in property knowledge surveys (McRae et al., 2005; ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-16", "text": "However, gathering and collating human-elicited property knowledge for concepts is very labour intensive, limiting both the number of words for which a rich feature set can be gathered, as well as the completeness of the feature listings for each word." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-17", "text": "Neural embedding models, on the other hand, learn from large corpora of text in an unsupervised fashion, allowing very detailed, high-dimensional semantic models to be constructed for a very large number of words." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-18", "text": "In this paper, we propose Feature2Vec, a computational framework that combines information from human-elicited property knowledge and information from distributional word embeddings, allowing us to exploit the strengths and advantages of both approaches." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-19", "text": "Feature2Vec maps human property norms onto a pretrained vector space model of word meaning." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-20", "text": "The embedding of feature-based information in the pretrained embedding space makes it possible to rank the relevance of features using cosine similarity, and we demonstrate how simple composition of features can be used to approximate concept vectors." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-21", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-22", "text": "**RELATED WORK**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-23", "text": "Several property-listing studies have been conducted with human participants in order to build property norms -datasets of normalized humanverbalizable feature listings for lexical concepts (McRae et al., 2005; ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-24", "text": "One use of feature norms is to critically examine distributional semantic models on their ability to encode grounded, human-elicited semantic knowledge." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-25", "text": "For example, Rubinstein et al. (2015) demonstrated that state-of-the-art distributional semantic models fail to predict attributive properties of concept words (e.g. the properties is-red and is-round for the word apple) as accurately as taxonomic properties (e.g. is-a-fruit)." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-26", "text": "Similarly, Sommerauer and Fokkens (2018) investigated the types of semantic knowledge encoded within pretrained word embeddings, concluding that some properties cannot be learned by supervised classifiers." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-27", "text": "Collell and Moens (2016) compared linguistic and visual representations of object concepts on their ability to represent different types of property knowledge." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-28", "text": "Research has shown that state-of-the-art distributional semantic models built from text corpora fail to capture important aspects of meaning related to grounded perceptual information, as this kind of information is not adequately represented in the statistical regularities of text data (Li and Gauthier, 2017; Kelly et al., 2014) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-29", "text": "Motivated by these issues, Silberer (2017) constructed multimodal semantic models from text and image data, with the goal of grounding word meaning using visual attributes." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-30", "text": "More recently, Derby et al. (2018) built similar models with the added constraint of sparsity, demonstrating that sparse multimodal vectors provide a more faithful representation of human semantic representations." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-31", "text": "Finally, the work that most resembles ours is that of Fagarasan et al. (2015) , who use Partial Least Squares Regression (PLSR) to learn a mapping from a word embedding model onto specific conceptual properties." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-32", "text": "Concurrent work recently undertaken by Li and Summers-Stay (2019) replaces the PLSR model with a feedforward neural network." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-33", "text": "In our work, we instead map property knowledge directly into vector space models of word meaning, rather than learning a supervised predictive function from concept embedding dimensions to feature terms." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-34", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-35", "text": "**METHOD**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-36", "text": "We make primary comparison with the work of Fagarasan et al. (2015) , although their approach differs from ours in that they map from an embedding space onto the feature space, while we learn a mapping from the feature domain onto the embedding space." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-37", "text": "We outline both methods below." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-38", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-39", "text": "**DISTRIBUTIONAL SEMANTIC MODELS**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-40", "text": "For our experiments, we make use of the pretrained GloVe embeddings (Pennington et al., 2014) provided in the Spacy 1 package trained on the Common Crawl 2 ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-41", "text": "The GloVe model includes 685, 000 tokens with embedding vectors of dimension 300, providing excellent lexical coverage with a rich set of semantic representations." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-42", "text": "In our analyses we use both the McRae property norms (McRae et al., 2005) , which contain 541 concepts and 2526 features, and the CSLB norms which have 638 concepts with 2725 features." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-43", "text": "For both sets of norms, a feature is listed for a concept if it has been elicited by five or more participants in the property norming study." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-44", "text": "The number of participants listing a given feature for a given concept name is termed the production frequency for that concept\u00d7feature pair." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-45", "text": "This gives sparse production frequency vectors for each concept over all the features in the norms." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-46", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-47", "text": "**PARTIAL LEAST SQUARE REGRESSION (PLSR)**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-48", "text": "Fagarasan et al. (2015) used partial least squares regression (PLSR) to map between the GloVe embedding space and property norm vectors." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-49", "text": "Suppose we have two real-valued matrices G \u2208 R n\u00d7m and F \u2208 R n\u00d7k ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-50", "text": "In this context, G and F represent GloVe embedding vectors and property norm feature vectors, respectively." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-51", "text": "For n available concept words, G is a matrix which consists of stacked pretrained embeddings from GloVe and F is the (sparse) matrix of production frequencies for each concept\u00d7feature pair." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-52", "text": "G and F share the same row indexing for concept words." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-53", "text": "For a new dimension size p \u2208 N, a partial least squared regression learns two new subspaces with dimensions n \u00d7 p, which have maximal covariance between them." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-54", "text": "The algorithm solves this problem by learning a mapping from the matrix G onto F , similar to a regression model." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-55", "text": "The fitted regression model thus provides a framework for predicting vectors in the feature space from vectors in the embedding space." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-56", "text": "In this work, we use the PLSR approach as a baseline for our model." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-57", "text": "In implementing PLSR, we set the intermediate dimension size to 50, following Fagarasan et al. (2015) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-58", "text": "We also build a PLSR model using 120 dimensions, which in preliminary experimentation we found gave the best performance from a range of values tested." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-59", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-60", "text": "**SKIP-GRAM WORD2VEC**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-61", "text": "Mikolov et al. (2013) proposed learning word embeddings using a predictive neural-network approach." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-62", "text": "In particular, the skip-gram implementation with negative sampling mines word cooccurrences within a text window, and the network must learn to predict the surrounding context from the target word." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-63", "text": "More specifically, for a vocabulary V , two sets of embeddings are learned through gradient decent, one for target embeddings and one for context embeddings." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-64", "text": "Given a target word w \u2208 V and a context word c \u2208 V in its window, the network calculates the dot product for the embeddings for w and v and a sigmoid activation is applied to the output ( Fig. 1(a) )." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-65", "text": "Negative samples are also generated for training, where the context is not in the target word's window." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-66", "text": "Let (w, c) \u2208 D be the positive word and context pairs and (w, c) \u2208 D be the negative word and context pairs." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-67", "text": "Then, using binary cross entropy loss, we learn a parameterization \u03b8 of the neural network that maximizes the function" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-68", "text": "where \u03c3 is the sigmoid function and v w and v c are the corresponding real-valued embeddings for the target words and context words ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-69", "text": "In this work, we adapt this skip-gram approach to the task of constructing semantic representations of human property norms by mapping properties into an embedding space (Figure 1 )." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-70", "text": "We achieve this by using a neural network to predict the properties from the input word, using the skipgram architecture with negative sampling on the properties." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-71", "text": "We replace context embeddings and windowed co-occurrence counts from the conventional skip-gram architecture with property embeddings and concept-feature production frequencies." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-72", "text": "The loss function for training remains the same; however, there are two modifications to the learning process." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-73", "text": "The first is that the target embeddings for the concept words are pre-trained (i.e. the GloVe embeddings), and gradients are not applied to this embedding matrix." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-74", "text": "The layer for the property norms is randomly initialized, and gradients are applied to these vectors to learn a semantic representation for properties aligned to the pre-trained distributional semantic space for the words." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-75", "text": "Secondly, the negative samples are generated from randomly sampled properties." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-76", "text": "We downweight negative samples by multiplying their associated loss by one over the negative sampling rate, so that the system pays more attention to real cases and less to the incorrect negative examples." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-77", "text": "Due to the sparsity of word-feature production frequencies, we generate all positive instances and randomly sample negative examples after each epoch to create a new set of training samples." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-78", "text": "We name this approach Feature2Vec 3 ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-79", "text": "We use a learn-" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-80", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-81", "text": "**EXPERIMENTS**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-82", "text": "We train with the McRae and CSLB property norms separately and report evaluations for each dataset." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-83", "text": "For the McRae dataset we use 400 randomly selected concepts for training and the remaining 141 for testing, and for the CSLB dataset we use 500 randomly selected concepts for training and the remaining 138 for testing 4 ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-84", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-85", "text": "**PREDICTING FEATURE VECTORS**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-86", "text": "We first evaluate how well the baseline PLSR model performs on the feature vector reconstruction task used by Fagarasan et al. (2015) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-87", "text": "In this evaluation, the feature vector for a test concept is predicted and we test whether the real concept vector is within the top N most similar neighbours of the predicted vector." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-88", "text": "We report results over both 50 (as in Fagarasan et al. (2015) ) and 120 dimensions for a range of values of N (Table 1) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-89", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-90", "text": "**CONSTRUCTING CONCEPT REPRESENTATIONS FROM FEATURE VECTORS**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-91", "text": "For Feature2Vec, we embed property norm features into the GloVe semantic space, giving a representation of properties in terms of GloVe dimensions." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-92", "text": "To predict a held-out concept embedding, we build a representation of the concept word by averaging the learned feature embedding vectors for that word using the ground truth information from the property norm dataset." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-93", "text": "This gives a method to construct embeddings for new words 4 We use the Python pickle package to store the numpy state for reproducible results in our code." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-94", "text": "using property knowledge and associated production frequencies (for example, for a held-out word unicorn, its GloVe embedding vector might be predicted from all features of horse, along with the features is-white, has-a-horn, and is-fantastical)." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-95", "text": "We compare these predicted embeddings to the held-out Glove embeddings (Table 1) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-96", "text": "However, we note that this approach is different to the PLSR models, so we do not make a direct comparison between PLSR and Feature2vec nearest neighbour results." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-97", "text": "Nevertheless, the results show that the word embeddings composed from the learned Feature2Vec feature embeddings appear relatively frequently amongst the most similar neighbour words in the pretrained GloVe space, indicating that feature embedding composition approximates the original word embedddings reasonably well." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-98", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-99", "text": "**PREDICTING PROPERTY KNOWLEDGE**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-100", "text": "The evaluation task that we are most interested in is how well the models can predict feature knowledge for concepts, given the distributional semantic vectors." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-101", "text": "More specifically, for a given concept with K features, we wish to take the top K predicted features according to each method, and record the overlap with the true property norm listing." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-102", "text": "In this evaluation, we make direct comparisons between all three models (PLSR 50, PLSR 120, & Feature2Vec) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-103", "text": "For the PLSR models, we predict the feature vector for a given target word using the embedding vector as input and take the top K weighted features." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-104", "text": "For Feature2Vec, we" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-105", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-106", "text": "**CONCEPTS PROPERTIES**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-107", "text": "Kingfisher has wings does fly has a beak has feathers is a bird does eat has a tail does swim has legs does lay eggs Avocado is eaten edible is tasty does grow is green is healthy is used in cooking has skin peel is red is food is a vegetable Door made of metal has a door doors is useful has a handle handles made of wood made of plastic is heavy is furniture does contain hold is found in kitchens Dragon is big large is an animal has a tail does eat is dangerous has legs has claws is grey is small does fly Table 3 : Top 10 predictions from CSLB-trained Feature2Vec model." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-108", "text": "The properties in bold and underlined are those that agree with the available ground truth features from the CSLB dataset." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-109", "text": "The first two concepts are from the CSLB test set, whilst the final two words were randomly sampled from the word embedding lexicon." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-110", "text": "rank all feature embeddings by their distance to the embedding for the target word, using cosine similarity, and take the top K most similar features ( Table 2 )." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-111", "text": "The results demonstrate that Feature2Vec outperforms the PLSR models on property knowledge prediction, for both training and testing datasets." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-112", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-113", "text": "**ANALYSIS**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-114", "text": "Following previous work, we provide the top 10 feature predictions for a few sample concepts, displayed in Table 3 ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-115", "text": "Properties underlined and in bold represent features that match the available ground truth data (i.e., the concept\u00d7feature pair occurs in the norms)." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-116", "text": "The first two words in Table 3 were sampled from the CSLB norms test set, whilst the last two words were randomly sampled from the word embedding lexicon and are not concept words appearing in the CSLB norms." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-117", "text": "We find that the predicted features that are not contained within the ground truth property set still tend to be quite reasonable, even for the two concepts not in the test dataset." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-118", "text": "As property norms do not represent an exhaustive listing of property knowledge, this is not surprising, and predicted properties not in the norms are not necessarily errors (Devereux et al., 2009; Fagarasan et al., 2015) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-119", "text": "Moreover, the set of features used within the norms are dependent on the concepts that were presented to the human participants." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-120", "text": "It is therefore notable that the conceptual representations predicted by our model for the two outof-norms concept words are particularly plausible, even though the attributes were never intended to conceptually represent these words." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-121", "text": "Our analysis supports the view that such supervised models could be utilised as an assistive tool for surveying much larger vocabularies of words." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-122", "text": "----------------------------------" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-123", "text": "**CONCLUSION**" }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-124", "text": "We proposed a method for constructing distributional semantic vectors for human property norms from a pretrained vector space model of word meaning, which outperforms previous methods for predicting concept features on two property norm datasets." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-125", "text": "As discussed by Fagarasan et al. (2015) and others, it is clear that property norm datasets provide only a semi-complete picture of human conceptual knowledge, and more extensive surveys may provide additional useful property knowledge information." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-126", "text": "By predicting plausible semantic features for concepts through the leveraging of corpus-derived word embedding data, our method offers a useful tool for guiding the expensive and laborious process of collecting property norm listings." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-127", "text": "For example, existing property norm datasets can be extended through human verification of features predicted with high confidence by Feature2Vec, with these features being added to the norms and subsequently incorporated into Feature2Vec in an iterative, semi-supervised manner (Kelly et al., 2012) ." }, { "sent_id": "3d84cf97f48dad66b4d8de0baf79b1-C001-128", "text": "Thus, Feature2Vec provides a useful heuristic to add interpretable feature-based information to these datasets for new words in a practical and efficient way." } ], "y": { "@USE@": { "gold_contexts": [ [ "3d84cf97f48dad66b4d8de0baf79b1-C001-31" ], [ "3d84cf97f48dad66b4d8de0baf79b1-C001-36" ], [ "3d84cf97f48dad66b4d8de0baf79b1-C001-57" ], [ "3d84cf97f48dad66b4d8de0baf79b1-C001-86" ] ], "cite_sentences": [ "3d84cf97f48dad66b4d8de0baf79b1-C001-31", "3d84cf97f48dad66b4d8de0baf79b1-C001-36", "3d84cf97f48dad66b4d8de0baf79b1-C001-57", "3d84cf97f48dad66b4d8de0baf79b1-C001-86" ] }, "@EXT@": { "gold_contexts": [ [ "3d84cf97f48dad66b4d8de0baf79b1-C001-31", "3d84cf97f48dad66b4d8de0baf79b1-C001-32", "3d84cf97f48dad66b4d8de0baf79b1-C001-33" ] ], "cite_sentences": [ "3d84cf97f48dad66b4d8de0baf79b1-C001-31" ] }, "@DIF@": { "gold_contexts": [ [ "3d84cf97f48dad66b4d8de0baf79b1-C001-31", "3d84cf97f48dad66b4d8de0baf79b1-C001-32", "3d84cf97f48dad66b4d8de0baf79b1-C001-33" ], [ "3d84cf97f48dad66b4d8de0baf79b1-C001-36" ] ], "cite_sentences": [ "3d84cf97f48dad66b4d8de0baf79b1-C001-31", "3d84cf97f48dad66b4d8de0baf79b1-C001-36" ] }, "@SIM@": { "gold_contexts": [ [ "3d84cf97f48dad66b4d8de0baf79b1-C001-88" ] ], "cite_sentences": [ "3d84cf97f48dad66b4d8de0baf79b1-C001-88" ] }, "@BACK@": { "gold_contexts": [ [ "3d84cf97f48dad66b4d8de0baf79b1-C001-118" ] ], "cite_sentences": [ "3d84cf97f48dad66b4d8de0baf79b1-C001-118" ] }, "@MOT@": { "gold_contexts": [ [ "3d84cf97f48dad66b4d8de0baf79b1-C001-125" ] ], "cite_sentences": [ "3d84cf97f48dad66b4d8de0baf79b1-C001-125" ] } } }, "ABC_c437e447603ecdbe4053651169770a_26": { "x": [ { "sent_id": "c437e447603ecdbe4053651169770a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-2", "text": "This paper describes the Rouletabille participation to the Hyperpartisan News Detection task." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-3", "text": "We propose the use of different text classification methods for this task." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-4", "text": "Preliminary experiments using a similar collection used in Potthast et al. (2018) show that neural-based classification methods reach state-of-the art results." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-5", "text": "Our final submission is composed of a unique run that ranks among all runs at 3/49 position for the by-publisher test dataset and 43/96 for the by-article test dataset in terms of Accuracy." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-6", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-8", "text": "Printed press have been in the last decades the main way to access to news in written format." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-9", "text": "This tendency is changing with the appearance of online channels but usually the main factors of the journalistic content generation are still there: events, journalists, and editors." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-10", "text": "One of the problems of the generation of this content is the influence of each factor in the veracity of the generated content." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-11", "text": "Two main factors may influence the final view of an article: writer's preferences and affiliation of the editor house." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-12", "text": "Identifying partisan preferences in news, based only on text content, has been shown to be a challenging task (Potthast et al., 2018) ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-13", "text": "This problem requires to identify if a news article was written in such a way that it includes an overrated appreciation of one of the participants in the news (a political party, a person, a company, etc.)." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-14", "text": "Despite the fact that sharply polarized documents are not necessarily fake, it is an early problem to solve for the identification of fake content." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-15", "text": "A recent paper (Potthast et al., 2018) claims that stylometric features are a key factor to tackle this task." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-16", "text": "In this paper, we present the description of our participation to the Hyperpartisan classification task at SemEval-2019 (Kiesel et al., 2019) ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-17", "text": "This task was composed of two subtasks, the first consist to identify hyperpartisan bias in documents classified by its individual content (bias of the writer or by-article category) and the second by the editorial house that published the article (bias of the editorial house or by-publisher category) as depicted in Figure 1 1 ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-18", "text": "To address this problem, we experimented with well-known models based on deep learning (Honnibal and Montani, 2017; Kim, 2014) ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-19", "text": "They achieve state-of-the-art results on a publicly available collection (Potthast et al., 2018) , showing that neural models can effectively address the task of hyperpartisan detection without including stylometric features." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-20", "text": "Our final submission ranked in the top-3 for the by-publisher category, and 43/96 for the by-article category (or 21/42 in the official ranking)." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-21", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-22", "text": "**CLASSIFICATION MODELS**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-23", "text": "We have considered that the hyperpartisan classification task can be addressed as a binary classification task where only two classes ('hyperpartisan' and 'mainstream') ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-24", "text": "Three different models were considered for our participation." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-25", "text": "The first of them is based on a classical document-level representation and the other two are based on word-level representations through the use of word embedding." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-26", "text": "All of them can be seen as baselines and no specific adaptation to the dataset 2 was performed." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-27", "text": "3" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-28", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-29", "text": "**TF-IDF + ADABOOST**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-30", "text": "For this model we represented our documents using the classical TF-IDF representation." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-31", "text": "Finally, the Adaboost classifier (Freund and Schapire, 1997) is used under the default configuration." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-32", "text": "Note that this is a very basic baseline, as it does not use recent representation techniques such as word embeddings." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-33", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-34", "text": "**SPACY MODEL**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-35", "text": "In this case, we used the SpaCy (Honnibal and Montani, 2017 ) library 4 ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-36", "text": "We used the text categorisation algorithm implemented in SpaCy which is based on the hierarchical attention network proposed in Yang et al. (2016) ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-37", "text": "The main improvement to the original model is the use of hashbased embeddings." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-38", "text": "We only defined two hyperparameters for the model: number of epochs and dropout rate." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-39", "text": "These parameters were set to 3 and 0.2, respectevelly." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-40", "text": "5" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-41", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-42", "text": "**CONVOLUTIONAL MODEL**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-43", "text": "We also tested the neural classification model proposed by Kim (2014) ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-44", "text": "This model uses convolutional neural networks that are finally reduced to a binary classification." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-45", "text": "This method is known as a highly competitive classification model for short documents." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-46", "text": "As SpaCy, this model is based on word embeddings representation." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-47", "text": "However, in this case we preferred to use the pre-calculated embeddings of GloVe (Pennington et al., 2014) ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-48", "text": "Hyperparameters were defined using the training data." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-49", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-50", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-51", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-52", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-53", "text": "Experiments were performed using two collections, the ACL2018 collection (Potthast et al., 2018) and the SemEval19 collection ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-54", "text": "The first collection is composed of 1627 articles including 801 hyperpartisan and 826 2 Different to the classical training of the involved classifiers." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-55", "text": "3 Further experiments were performed using networkbased models but as results did not show improvement in an existing collection, we decided to not include these results." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-56", "text": "mainstream manually annotated documents." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-57", "text": "As this collection is not originally split in training-test sets, results are presented using cross-validation." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-58", "text": "The second collection was split in train, validation, and test sets for the by-publisher category, and in train and test for the by-article category as presented in Table 1 ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-59", "text": "Results in this second collection are exclusively calculated using the TIRA evaluation system ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-60", "text": "In order to determine the best configuration to our participation using the SemEval collection, we decided to perform experiments and fix hyperparameters using the ACL2018 collection." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-61", "text": "We only used the first fold produced by the authors' code 6 ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-62", "text": "As our results are not directly comparable with the values reported in Potthast et al. (2018) , we re-evaluated their approach on this single fold." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-63", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-64", "text": "**RESULTS IN THE ACL2018 COLLECTION**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-65", "text": "Values of the three F-measures were calculated with sklearn 7 ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-66", "text": "Note that in binary classification, micro F-measure values are equivalent to accuracy values." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-67", "text": "Two state-of-the-art models (SpaCy and Kim (2014)) outperform the approach presented in Potthast et al. (2018) , showing that stylometric features are probably not necessary for the task." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-68", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-69", "text": "**RESULTS IN THE SEMEVAL2019 COLLECTION**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-70", "text": "Experiments on the official collection were performed through the use of TIRA SpaCy: it can be seen as an 'easy-to-implement' but strong baseline." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-71", "text": "The same model was trained on the by-publisher training set for both submissions (on the by-publisher and by-article dataset)." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-72", "text": "Tables 3 and 4 respectively present official results on the by-publisher 9 and the by-article datasets." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-73", "text": "One can see that relative results (i.e. regarding the official ranking) are strongly better on the bypublisher dataset than on the by-article one." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-74", "text": "This can be easily explained by the fact that collections were differently annotated." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-75", "text": "If we now compare accuracy scores of the SpaCy model between the ACL2018 collection and the SemEval2019 one, we can notice a decrease in performance (0.6640 vs 0.8091 on the 9 https://www.tira.io/task/hyperpartisan-newsdetection/dataset/pan19-hyperpartisan-news-detectionby -publisher-test-dataset-2018-12-12 by-publisher dataset for example), leading us to think that there exist some differences between the two collections." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-76", "text": "Both collections seem to be complementary for the evaluation of hyperpartisan detection." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-77", "text": "Another important observation is that the SpaCy model performs remarkably well on the bypublisher set, although not specifically tuned for the hyperpartisan detection task." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-78", "text": "Indeed, we are ranked first on the F1 metric, and 3rd on the Accuracy one." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-79", "text": "Some other experiments are needed to get a fine-tuned model for the task, but this version can already be considered as a strong baseline for the by-publiser subtask." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-80", "text": "----------------------------------" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-81", "text": "**CONCLUSION**" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-82", "text": "Our experiments and participation to the Hyperpartisan task led us to conclude that:" }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-83", "text": "\u2022 stylometric features seem not to be necessary to achieve state-of-the-art results for hyperpartisan detection in the ACL2018 collection." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-84", "text": "This deserves a set of extra experiments to better understand the real contribution of stylometric features when combined with strong representations/classifiers to validate the work of Potthast et al. (2018) ." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-85", "text": "\u2022 a state-of-the-art classification model in its default configuration (SpaCy) can be considered as a strong baseline for next experiments." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-86", "text": "Indeed, SpaCy is top-ranked according to the F1 metric on the by-publisher dataset." }, { "sent_id": "c437e447603ecdbe4053651169770a-C001-87", "text": "One question is thus now if other top-ranked approaches are also from the text classification literature or dedicated ones." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c437e447603ecdbe4053651169770a-C001-4" ], [ "c437e447603ecdbe4053651169770a-C001-12" ], [ "c437e447603ecdbe4053651169770a-C001-15" ], [ "c437e447603ecdbe4053651169770a-C001-19" ], [ "c437e447603ecdbe4053651169770a-C001-67" ] ], "cite_sentences": [ "c437e447603ecdbe4053651169770a-C001-4", "c437e447603ecdbe4053651169770a-C001-12", "c437e447603ecdbe4053651169770a-C001-15", "c437e447603ecdbe4053651169770a-C001-19", "c437e447603ecdbe4053651169770a-C001-67" ] }, "@MOT@": { "gold_contexts": [ [ "c437e447603ecdbe4053651169770a-C001-12" ] ], "cite_sentences": [ "c437e447603ecdbe4053651169770a-C001-12" ] }, "@USE@": { "gold_contexts": [ [ "c437e447603ecdbe4053651169770a-C001-53" ], [ "c437e447603ecdbe4053651169770a-C001-62" ], [ "c437e447603ecdbe4053651169770a-C001-84" ] ], "cite_sentences": [ "c437e447603ecdbe4053651169770a-C001-53", "c437e447603ecdbe4053651169770a-C001-62", "c437e447603ecdbe4053651169770a-C001-84" ] }, "@DIF@": { "gold_contexts": [ [ "c437e447603ecdbe4053651169770a-C001-62" ] ], "cite_sentences": [ "c437e447603ecdbe4053651169770a-C001-62" ] } } }, "ABC_d9877fc29c2e4f20805076392a70d0_26": { "x": [ { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-16", "text": "It is thus a continuation of prior work, in which we investigated historical English texts only (Hellrich and Hahn, 2016a) , and also influenced by the design decisions of Kim et al. (2014) and Kulkarni et al. (2015) which were the first to use word embeddings in diachronic studies." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-29", "text": "It is conceptually not affected by reliability problems, as there is no random initialization or relevant processing order." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-17", "text": "Our results cast doubt on the reproducibility of such experiments where neighborhoods between words in embedding space are taken as a computationally valid indicator for properly capturing lexical meaning (and, consequently, meaning shifts)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-18", "text": "linguistics." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-2", "text": "We assess the reliability and accuracy of (neural) word embeddings for both modern and historical English and German." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-3", "text": "Our research provides deeper insights into the empirically justified choice of optimal training methods and parameters." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-4", "text": "The overall low reliability we observe, nevertheless, casts doubt on the suitability of word neighborhoods in embedding spaces as a basis for qualitative conclusions on synchronic and diachronic lexico-semantic matters, an issue currently high up in the agenda of Digital Humanities." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-5", "text": "----------------------------------" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-7", "text": "Distributional methods applied to large-sized, often temporally stratified corpora have markedly enhanced the methodological repertoire of both synchronic and diachronic computational linguistics and are getting more and more popular in the Digital Humanities (see Section 2.2)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-8", "text": "However, using such quantitative data as a basis for qualitative, empirically-grounded theories requires that measurements should not only be accurate, but also reliable." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-9", "text": "Only under such a guarantee, quantitative data can be assembled from different experiments as a foundation for trustful theories." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-10", "text": "Measuring word similarity by word neighborhoods in embedding space can be used to detect diachronic shifts or domain specific usage, by training word embeddings on suited corpora and comparing these representations." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-11", "text": "Additionally, lexical items near in the embedding space to the lexical item under scrutiny can be considered as approximating its meaning at a given point in time or in a specific domain." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-12", "text": "These two lines of research converge in prior work to show, e.g., the increasing association of the lexical item 'gay' with the meaning dimension of homosexuality (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-13", "text": "Neural word embeddings (Mikolov et al., 2013) are probably the most influential among all embedding types (see Section 2.1)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-14", "text": "Yet, we gathered evidence that the inherent randomness involved in their generation affects the reliability of word neighborhood judgments and demonstrate how this hampers qualitative conclusions based on such models." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-15", "text": "Our investigation was performed on both historical (for the time span of 1900 to 1904) and contemporary texts (for the time span of 2005 to 2009) in two languages, English and German." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-91", "text": "Influence of Word Frequency." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-19", "text": "The word2vec family of algorithms, developed from heavily trimmed artificial neural networks, is a widely used and robust way to generate such embeddings (Mikolov et al., 2013; Levy et al., 2015) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-20", "text": "Its skip-gram variant predicts plausible contexts for a given word, whereas the alternative continuous bag-of-words variant tries to predict words from contexts; we focus on the former as it is generally reported to be superior (see e.g., Levy et al. (2015) )." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-21", "text": "There are two strategies for managing the huge number of potential contexts a word can appear in." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-22", "text": "Skip-gram hierarchical softmax (SGHS) uses a binary tree to more efficiently represent the vocabulary, whereas skip-gram negative sampling (SGNS) updates only a limited number of word vectors during each training step." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-23", "text": "SGNS is preferred in general, yet SGHS showed slight benefits in some reliability scenarios in our prior investigations (Hellrich and Hahn, 2016a) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-24", "text": "There are two sources of randomness involved in the training of neural word embeddings: First, the random initialization of all word vectors before any examples are processed." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-25", "text": "Second, the order in which these examples are processed." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-26", "text": "Both can be replaced by deterministic alternatives, 1 yet this would simply replace a random distortion with a fixed one, thus providing faux reliability only useful for testing purposes." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-27", "text": "A range of other word embedding algorithms was inspired by word2vec, either trying to avoid the opaqueness stemming from its neural network heritage (GloVe; still using random initialization, see Pennington et al. (2014) ) or adding capabilities, like using syntactic information during training (Levy and Goldberg, 2014) or modeling multiple word senses (Bartunov et al., 2016; Panchenko, 2016) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-28", "text": "Levy et al. (2015) created SVD PPMI , a variant of the classical pointwise mutual information co-occurrence metric (see e.g., Manning and Sch\u00fctze (1999, pp.178-183) ), by transferring pre-processing steps and hyper-parameters uncovered by the development of these algorithms, and reported similar or slightly better performance than SGNS on evaluation tasks." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-30", "text": "Word embeddings capture both syntactic and semantic information (and arguably also social biases, see Bolukbasi et al. (2016) ) in vector form and can thus be evaluated by their ability to calculate the similarity of two words and perform analogy-based reasoning; there exist several other evaluation methods and more test sets than discussed here, see e.g., Baroni et al. (2014) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-31", "text": "Mikolov et al. (2013) provide an analogy test set for measuring performance as the percentage of correctly calculated analogies for test cases such as the frequently cited 'king'-'queen' example (see Section 3)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-32", "text": "Word similarity is evaluated by calculating Spearman's rank coefficient between embedding-derived predictions and a gold standard of human word similarity judgments." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-33", "text": "Finkelstein et al. (2002) developed a widely used test set with 353 English word pairs, 2 a similar resource for German with 350 word pairs was provided by Zesch and Gurevych (2006) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-34", "text": "3 Recent work cautions that performance on such tasks is not always predictive for performance in down-stream applications (Batchkarov et al., 2016) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-35", "text": "----------------------------------" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-36", "text": "**DIACHRONIC APPLICATION**" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-37", "text": "Word embeddings can be used rather directly for tracking semantic changes, namely by measuring the similarity of word representations generated for one word at different points in time-words which underwent semantic shifts will be dissimilar with themselves." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-38", "text": "These models must either be trained in a continuous manner where the model for each time span is initialized with its predecessor (Kim et al., 2014; Hellrich and Hahn, 2016b) , or a mapping between models for different points in time must be calculated (Kulkarni et al., 2015; Hamilton et al., 2016) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-39", "text": "The first approach cannot be performed in parallel and is thus rather time-consuming, if texts are not subsampled." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-40", "text": "We nevertheless discourage using samples instead of full corpora, as we observed extremely low reliability values between different samples (Hellrich and Hahn, 2016a) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-41", "text": "Word embeddings can also be used in diachronic studies without any kind of mapping to track clusters of similar words over time and, thus, model the evolution of topics (Kenter et al., 2015) or compare neighborhoods in embedding space for preselected words (Jo, 2016) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-42", "text": "Besides temporal variations, word embeddings can also used to analyze geographic ones, e.g., the distinction between US American and British English variants (Kulkarni et al., 2016) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-43", "text": "Most of these studies were performed with algorithms from the word2vec family, respectively GloVe in Jo (2016), and are thus likely to be affected by the same systematic reliability problems on which we focus here." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-44", "text": "Only Hamilton et al. (2016) used SVD PPMI in some of their very recent experiments and showed it to be adequate for exploring historical semantics." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-45", "text": "The Google Books Ngram corpus (GBN; Michel et al. (2011 ), Lin et al. (2012 ) is used in most of the studies we already mentioned, including our current study and its predecessor (Hellrich and Hahn, 2016a) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-46", "text": "It contains about 6% of all books published between 1500 and 2009 in the form of n-grams (up to pentagrams), together with their frequency for each year." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-47", "text": "This corpus has often been criticized for its opaque sampling strategy, as its constituent books remain unknown and can be shown to form an unbalanced collection (Pechenick et al., 2015) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-48", "text": "GBN is multilingual, with its English part being subdivided into regional segments (British, US) and topic categories (general language and fiction texts)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-49", "text": "Diachronic research focuses on the English Fiction part, with the exception of some work relating to German data (Hellrich and Hahn, 2016b) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-50", "text": "----------------------------------" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-51", "text": "**EVALUATION METHODS**" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-52", "text": "Reliability, in this study, is judged by training three identically parametrized models for each experiment and by comparing the n next neighbors (by cosine distance) for each word modeled by the experiments with a variant of the Jaccard coefficient (Manning and Sch\u00fctze, 1999, p.299) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-53", "text": "The 3-dimensional array W i,j,k contains words ordered by closeness (i) for a word in question (j) according to an experiment (k)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-54", "text": "The reliability r for a specific value of n (r@n) is defined as the magnitude of the intersection of similar words produced by all three experiments with a rank of n or lower, averaged over all t words modeled by these experiments and normalized by n, which is the maximally achievable score for this value of n:" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-55", "text": "Accuracy, in this study, is measured considering two different approaches-analogy and similarity." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-56", "text": "The analogy approach uses the English test set developed by Mikolov et al. (2013) by calculating the percentage of correct analogies made by a word2vec model." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-57", "text": "It contains groups of four words connected via the analogy relation '::' and the similarity relation '\u223c', as exemplified by the expression 'king' \u223c 'queen' :: 'man' \u223c 'woman'." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-58", "text": "The similarity approach covers both English and German by calculating Spearman's rank correlation coefficient between the similarity judgments made by a word2vec model for a word pair (e.g., 'bread' and 'butter') and the human judgment thereof (Finkelstein et al., 2002; Zesch and Gurevych, 2006) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-59", "text": "Pairs containing words not modeled for the time span in question, such as the at that time non-existent 'FBI' in the early 20th century, are simply ignored." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-60", "text": "All three test sets are based on contemporary language and current world knowledge and might thus not fully match the requirements for historical texts, yet are also used for these due to the lack of a suitable alternative." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-61", "text": "Accuracy values were calculated independently for each of the three identically parametrized models and subsequently averaged, but resulting deviations were negligible." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-62", "text": "----------------------------------" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-63", "text": "**EXPERIMENTAL SET-UP 4.1 CORPUS**" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-64", "text": "Our experiments 4 were performed on the German part and the English Fiction part of the GBN; the latter is known to be less unbalanced than the general English part (Pechenick et al., 2015) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-65", "text": "Both corpus splits differ in size and contain mainly contemporary texts (from the past fifty years), as is evident from Figure 1 ; note the logarithmic axis and the negative impact of both World Wars on book production." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-66", "text": "Following Kulkarni et al. (2015) , we trained our models on all 5-grams occurring during five consecutive years for the two time spans, 5 1900-1904 and 2005-2009 ; the number of 5-grams 6 for each time span is listed in Table 1 ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-67", "text": "The two languages share a similar number of 5-grams for 1900-1904, yet not for [2005] [2006] [2007] [2008] [2009] ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-68", "text": "5-grams from both corpus parts were lower cased for training." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-69", "text": "The German part was not only taken as is, but also orthographically normalized using the CAB service (Jurish, 2013) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-70", "text": "7 We incorporated this step because major changes in German orthography occurred during the 20th century, an issue that could hamper diachronic comparisons, e.g., archaic 'Gem\u00fcth' (in English: \"mind, emotional disposition\") became modern 'Gem\u00fct'." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-71", "text": "Table 1 shows the resulting reduction in the number of types, bringing the morphologically richer German to levels below English (yet this reduction is in line with the respective corpus sizes)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-72", "text": "----------------------------------" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-73", "text": "**TRAINING**" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-74", "text": "We used the PYTHON-based GENSIM 8 implementation of word2vec to independently train word embeddings for each time span with 200 dimensions, a context window of 4 (limited by the 5-gram size), a minimum frequency of 10, and 10 \u22125 as the threshold for downsampling frequent words." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-75", "text": "We processed the full subcorpora for each time span, due to the extremely low reliability values between samples we observed in previous investigations (Hellrich and Hahn, 2016a) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-76", "text": "We tested both SGNS with 5 noise words and SGHS training strategies and trained for 10 iterations, saving the resulting embeddings after each epoch." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-77", "text": "During each epoch the learning rate was decreased from 0.025 to 0.0001." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-78", "text": "The averaged cosine values between word embeddings before and after an epoch are used as a convergence measure c (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-79", "text": "It is defined for a vocabulary with n words and a matrix W containing word embedding vectors (normalized to length 1) for words i from training epochs e and e-1:" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-80", "text": "We also define \u2206c, the change of c during subsequent epochs e-1, as another convergence criterion:" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-81", "text": "5 Results Table 2 shows the performance of the systems trained according to the settings described in Section 4.2, as measured by similarity accuracy and top-1 reliability (see below for other cut-offs)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-82", "text": "We make the following observations:" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-83", "text": "1. Both accuracy and reliability are higher for SGNS than for SGHS for all tested combinations of languages and time spans, if 10 training epochs are used." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-84", "text": "We also measured analogy accuracy for the English Fiction data sets, and observed no negative effect of multiple training epochs, yet a more pronounced gap between training methods, e.g., 36% of all analogies were correct for SGNS and only 27% for SGHS after one epoch on 1900-1904 data." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-85", "text": "In the following, we further explore system performance as influenced, e.g., by word frequency, word ambiguity and the number of training epochs." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-86", "text": "For German, we focus on the normalized version due to the overall similar performance and suitability for further applications." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-87", "text": "Influence of Neighborhood Size." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-88", "text": "Reliability at different top-n cut-offs is very similar for all languages and time spans under scrutiny, confirming previous observations in Hellrich and Hahn (2016a) and strengthening the suggestion to use only top-1 reliability for evaluation." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-89", "text": "Figure 2 illustrates this phenomenon with an SGNS trained on 1900-1904 English Fiction data." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-90", "text": "We assume this to be connected with the general decrease in word2vec embedding utility for high values of n already observed by Schnabel et al. (2015) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-92", "text": "Figures 3 and 4 depict the influence of word frequency (as percentile ranks) for English, as well as orthographically normalized German." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-93", "text": "Negative sampling is overall more reliable, especially for words with low or medium frequency." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-94", "text": "Word frequency has a less pronounced effect on reliability for German and negative sampling is again preferable, especially for low or medium frequency words." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-95", "text": "The 21 English words reported to have undergone traceable semantic changes in prior work 9 are all frequent with percentiles between 89 and 99-for such high-frequency words hierarchical softmax performs similarly or even slightly better." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-96", "text": "The relatively low reliability for medium-frequency English words, as compared to German ones, could be caused by a peculiar pattern of word co-occurrences, illustrated in Figures 5 and 6 for 1900-1904 English Fiction, respectively normalized German." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-97", "text": "Medium-frequency English words have fewer co-occurrences with low-frequency words than German ones, which might result in a lack of specific contexts for these words during training and thus hamper embedding quality." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-98", "text": "Influence of Word Ambiguity." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-99", "text": "Entries in lexical databases, such as WORDNET 10 (Fellbaum, 1998) and its German counterpart GERMANET 11 (Lemnitzer and Kunze, 2002) , can be employed to approximate the effect of word ambiguity on reliability." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-100", "text": "The number of synsets a word belongs to (i.e., the number of its senses) seems to be positively correlated with top-1 reliability for English, as shown in Figure 7 , whereas orthographically normalized German is less affected by ambiguity as Figure 8 reveals." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-101", "text": "This counter-intuitive effect for English seems to be caused by the low ambiguity of infrequent words-results become more uniform, if analysis is limited to high frequency words (e.g., 90th frequency percentile or higher)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-102", "text": "Influence of the Number of Training Epochs." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-103", "text": "Table 2 )." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-104", "text": "Figures 11 and 12 show the results for English and orthographically normalized German, respectively." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-105", "text": "Note that accuracy is assessed on a test set for modern-day language, and can thus not be considered a fully valid yardstick." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-106", "text": "Accuracy behaves similar to reliability, as under the negative sampling condition it clearly profits from multiple training epochs." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-107", "text": "This effect is more pronounced for smaller corpora; the biggest corpus (i.e., English Fiction 2005 -2009 ) shows a slight regression in accuracy after more than 5 training epochs." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-108", "text": "Conclusions." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-109", "text": "Both reliability and accuracy point towards negative sampling with 4 to 6 training epochs (6 being better for smaller and 4 being better for larger corpora) as the optimal training regime for all tested combinations of languages and time spans (implicitly, this is also a test on largely varying corpus sizes, see Table 1 )." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-110", "text": "Such a training scheme yields models with high reliability without losses in accuracy (that would indicate overfitting)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-111", "text": "Figure 13 shows \u2206c, i.e., the difference of the convergence measure c (Equations (2) and (3) averaged over all three models) between subsequent epochs, for both German and English data from the intervals 1900-1904 and 2005-2009 ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-112", "text": "Few changes occur after 4-6 epochs, which could be alternatively expressed as a \u2206c of about 0.003." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-113", "text": "The convergence criterion proposed by Kulkarni et al. (2015) , i.e., c = 0.9999, was never reached (this observation might be explained by Kulkarni et al.'s decision not to reset the learning rate for each training epoch, as was done by us and Kim et al. (2014) )." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-114", "text": "SVD PPMI , which are conceptually not bothered by the reliability problems we discussed here, were not a good fit for the hyperparameters we adopted from Kulkarni et al. (2015) ." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-115", "text": "Hamilton et al. (2016) reports similarity accuracy superior to SGNS, whereas for our set-up results in pretests were about 10 percent points worse than skip-gram embeddings, e.g., only 0.35 for 1900-1904 English Fiction." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-116", "text": "Finally, to want to illustrate how this reliability problem affects qualitative conclusions." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-117", "text": "In Table 3 we provide some examples in which three negative sampling models for 1900-1904 English Fiction did not agree on the closest neighbor for words in question (mostly drawn from the list in Footnote 9)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-118", "text": "The most inconsistent word neighborhoods are provided for 'romantic' which is connected to 'lazzaroni', 12 'fanciful' and 'melancholies'." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-119", "text": "This holds despite the high frequency (94th percentile) and moderate ambiguity (5 synsets) of the target item 'romantic'." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-120", "text": "----------------------------------" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-121", "text": "**DISCUSSION**" }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-122", "text": "Our investigation into the accuracy and reliability of skip-gram word embeddings shows even the most reliable systems too often provide inconsistent word neighborhoods." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-123", "text": "This carries unwarranted potential for erroneous conclusions on a word's semantic evolution as was shown, e.g., for the lexical item 'romantic' and English Fiction texts from the 1900-1904 time slice." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-124", "text": "We are thus skeptical about using word neighborhoods in skip-gram embedding space to adequately capture natural languages' lexical semantics (for English and German, at least)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-125", "text": "While we found some mitigation strategies, i.e., training for multiple epochs or using our convergence criterion of \u2206c 0.003, we assume SVD PPMI to be conceptually superior." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-126", "text": "Future work might try to provide general guidelines for proper hyperparameter selection for SVD PPMI , especially regarding complete temporal slices of the GBN (Hamilton et al. (2016) used samples)." }, { "sent_id": "d9877fc29c2e4f20805076392a70d0-C001-127", "text": "Alternatively, training several identically parametrized SGNS/SGHS models and combining them into an ensemble might constitute an easy way to reduce the reliability problems we described, yet at the price of exorbitant computational costs." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d9877fc29c2e4f20805076392a70d0-C001-12" ], [ "d9877fc29c2e4f20805076392a70d0-C001-16" ], [ "d9877fc29c2e4f20805076392a70d0-C001-38" ] ], "cite_sentences": [ "d9877fc29c2e4f20805076392a70d0-C001-12", "d9877fc29c2e4f20805076392a70d0-C001-16", "d9877fc29c2e4f20805076392a70d0-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "d9877fc29c2e4f20805076392a70d0-C001-16" ], [ "d9877fc29c2e4f20805076392a70d0-C001-66" ], [ "d9877fc29c2e4f20805076392a70d0-C001-78" ], [ "d9877fc29c2e4f20805076392a70d0-C001-114" ] ], "cite_sentences": [ "d9877fc29c2e4f20805076392a70d0-C001-16", "d9877fc29c2e4f20805076392a70d0-C001-66", "d9877fc29c2e4f20805076392a70d0-C001-78", "d9877fc29c2e4f20805076392a70d0-C001-114" ] }, "@MOT@": { "gold_contexts": [ [ "d9877fc29c2e4f20805076392a70d0-C001-38", "d9877fc29c2e4f20805076392a70d0-C001-39" ] ], "cite_sentences": [ "d9877fc29c2e4f20805076392a70d0-C001-38" ] }, "@DIF@": { "gold_contexts": [ [ "d9877fc29c2e4f20805076392a70d0-C001-113" ], [ "d9877fc29c2e4f20805076392a70d0-C001-114" ] ], "cite_sentences": [ "d9877fc29c2e4f20805076392a70d0-C001-113", "d9877fc29c2e4f20805076392a70d0-C001-114" ] } } }, "ABC_cee22bd0384d3d3fd4e45833341e77_26": { "x": [ { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-80", "text": "We only use pairs for which target and foil are found in the original captions." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-2", "text": "The last years have seen an explosion of work on the integration of vision and language data." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-3", "text": "New tasks like Image Captioning and Visual Questions Answering have been proposed and impressive results have been achieved." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-4", "text": "There is now a shared desire to gain an in-depth understanding of the strengths and weaknesses of those models." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-5", "text": "To this end, several datasets have been proposed to try and challenge the state-of-the-art." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-6", "text": "Those datasets, however, mostly focus on the interpretation of objects (as denoted by nouns in the corresponding captions)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-7", "text": "In this paper, we reuse a previously proposed methodology to evaluate the ability of current systems to move beyond objects and deal with attributes (as denoted by adjectives), actions (verbs), manner (adverbs) and spatial relations (prepositions)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-8", "text": "We show that the coarse representations given by current approaches are not informative enough to interpret attributes or actions, whilst spatial relations somewhat fare better, but only in attention models." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-9", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-11", "text": "Nouns are a crucial component of natural language sentences." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-12", "text": "It is not a coincidence that children first learn to use nouns and only afterwords expand their vocabulary with verbs, adjectives and other parts of speech (Waxman et al., 2013) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-13", "text": "Interestingly, the same development has taken place with Language and Vision models." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-14", "text": "Object classification has long been the main concern of the computer vision field, only then followed by action classification shared tasks." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-15", "text": "Recently, more ambitious competitions have been proposed, aiming to evaluate models' ability to connect whole sentences to images, through both Image Captioning (IC) or Visual Question Answering (VQA) tasks." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-16", "text": "Progress in this area has seemed swift and impressive, but the community is now scrutinising the results to understand whether enthusiasm is warranted." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-17", "text": "Several diagnostic datasets have been proposed with this goal in mind, highlighting various flaws in existing tasks (Johnson et al., 2017; Zhang et al., 2015) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-18", "text": "Our paper is a contribution to these efforts, showing that the field may have moved too fast from noun to sentence interpretation, overlooking difficulties in understanding other parts-of-speech." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-19", "text": "Our paper expands the existing FOIL dataset (Shekhar et al., 2017) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-20", "text": "FOIL consists of a set of images matched with captions containing one single mistake." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-21", "text": "The mistakes are always nouns referring to objects not actually present in the image." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-22", "text": "The work demonstrates that the language and vision modalities are not truly integrated in current computational models, as they fail to spot the mistake in the caption and to correct it appropriately (humans, on the other hand, obtain almost 100% accuracy on those tasks)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-23", "text": "In the present paper, we exploit the FOIL strategy to evaluate Language and Vision models on a larger set of possible mismatches between language and vision." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-24", "text": "Beside considering nouns as possible 'foil' words, we also consider verbs, adjectives, adverbs and prepositions, as illustrated in Figure 1 ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-25", "text": "The results obtained by state-of-the-art systems on this data demonstrate that current models are indeed little able to move beyond object understanding." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-26", "text": "Figure 1 : Sample image, corresponding original caption and the generated foil caption for the different parts of speech." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-27", "text": "The model has to be able to classify the caption as 'correct' or 'foil' (Task 1); detect the foil word in the foil caption (see words highlighted in red) (Task 2); and correct the foil word with an appropriate replacement (see words highlighted in green) (Task 3)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-28", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-29", "text": "**THE FOIL METHODOLOGY**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-30", "text": "We follow the methodology highlighted in Shekhar et al. (2017) , which consists of replacing a single word in a human-generated caption with a 'foil' item, making the caption unsuitable to describe the original image." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-31", "text": "Given such replacements, the system should be able to perform three tasks: a) a classification task (T1): given an image and a caption, the model has to predict whether the caption is correct or inappropriate for the image (evaluating whether the model has a coarse understanding of the linguistic and visual inputs and their relations); b) a foil word detection task (T2): given an image and a foil caption, detect the foil word in the caption (evaluating whether the model reaches a fine-grained representation of the linguistic input); a foil word correction task (T3): given an image, a foil caption and the foil word, the model has to correct the mistake (verifying whether the model reaches a fine-grained representation of the image)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-79", "text": "We use all word pairs in both directions (e.g. replacing push with pull and pull with push)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-32", "text": "Four models are tested on tasks 1-3: one baseline (a 'blind' model), and three state-of-the-art models from the Visual Question Answering (VQA) and Image Captioning (IC) literature." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-33", "text": "Blind Model: this model is based on the caption only; in other words, the system does not have access to the visual data." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-34", "text": "The caption is modelled by an LSTM, fully connected to a hidden layer followed by a softmax to perform the classification." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-35", "text": "This blind baseline affords an evaluation of the 'language bias' of the data (i.e., the phenomenon by which a Language and Vision dataset can be suitably modelled using language only)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-36", "text": "Discriminative VQA Models: two VQA models are used, namely the LSTM + norm I of Antol et al. (2015) and the Hierarchical Co-Attention model (HieCoAtt) of Lu et al. (2016) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-37", "text": "In LSTM + norm I, the text is represented by two stacked LSTMs and the image is represented by a normalisation of the last fully connected layer of VGG network (Simonyan and Zisserman, 2014) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-38", "text": "Both representations are projected onto a 1024-dimensional feature space." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-39", "text": "The combination of language and vision features is performed by point-wise multiplication followed by a fully connected and a softmax layer." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-40", "text": "HieCoAtt has a similar architecture, with the addition of an attention layer." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-41", "text": "Attention is provided to both image and text in alternation, in a hierarchical fashion." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-42", "text": "Generative IC Model: we use the IC system of Wang et al. (2016) (henceforth, IC-Wang) , which generates a word in a caption by considering both past and future contexts, using a bi-directional LSTM." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-43", "text": "IC-Wang consists of three modules: a CNN to encode the image, a text LSTM to encode captions, and a multimodal LSTM for mapping visual and text representations to a common space." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-44", "text": "For T1, the models are directly trained to classify a given caption as 'good' vs. 'foil'." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-45", "text": "For T2 and T3, the model trained on T1 is adopted." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-46", "text": "For T2, we subsequently occlude one word (Goyal et al. (2016) ) at a time and calculate the probability of the new caption to be good vs. foil." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-47", "text": "The model selects as foil word, the one which has generated the caption with the highest probability." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-48", "text": "For T3, we regress over all Shekhar et al. (2017) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-49", "text": "the target words on the position of the foil word and select the one which generates the caption with the highest probability to be \"good\"." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-50", "text": "Due to the generative nature of IC models, adapting IC-Wang for the classification purpose is less straightforward." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-51", "text": "For T1, we generate all possible captions by subsequently predicting one word at a time provided all other words in the caption and the image." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-52", "text": "We compare the probability of these generated captions with the given caption." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-53", "text": "When the test caption probability is higher than generated captions probabilities, we classify the given caption as good caption, else as foil caption." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-54", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-55", "text": "**DATASET CREATION**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-56", "text": "Following Shekhar et al. (2017) , we aim at creating a dataset of images associated with both correct and foil captions, where the latter are obtained by replacing one word in the original text." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-57", "text": "Expanding on the original paper, our target/foil pairs do not merely consist of nouns." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-58", "text": "The introduced error can also be an adjective (an object's attribute), a verb (an action), a preposition (a relation between objects) or an adverb (a manner of action)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-59", "text": "In total, we produce 196,284 datapoints, each corresponding to an triple." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-60", "text": "The starting point for images and correct captions is Microsoft's Common Objects in Context (MS-COCO) (Lin et al. (2014) )." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-61", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-62", "text": "**CREATING NEW TARGET/FOIL PAIRS**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-63", "text": "We describe below our procedure to expand the original dataset with new parts-of-speech." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-64", "text": "Verbs: We use three resources: a) VerbOcean, a semi-automatically generated broad-coverage semantic network of verbs extracted from the Web by exploiting a pattern-based approach (Chklovski and Pantel (2004) ); b) Computing Lexical Contrast (CLC), a resource of contrasting words selected from direct and indirect WordNet opposites, (Mohammad et al. (2013) ); c) SimLex999, a set of related word pairs rated with respect to their similarity (Hill et al. (2016) )." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-65", "text": "From VerbOcean and CLC, we extract all antonyms (e.g., pull-push)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-66", "text": "From SimLex999, we select those pairs with a similarity score lower than the average in the database (e.g., allowing-preventing)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-67", "text": "We end up with 902, 44, and 30 verb pairs from VerbOcean, CLC and SimLex999 respectively." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-68", "text": "Adjectives and adverbs: As in the verb case, we use antonyms from CLC and we select pairs from SimLex999 which have a similarity score lower than average." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-69", "text": "We extract 46 and 127 adjectives pairs from CLC and SimLex999 respectively." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-70", "text": "All adverbial pairs come from CLC, and amount to 52 datapoints." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-71", "text": "Prepositions: We extract prepositions from Berry et al. (1995) , divided into three classes: place (e.g., under, below), direction (e.g., inside, outside) and device (e.g., by, with)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-72", "text": "Using these prepositions, we generate target/foil pairs by coupling prepositions which belong to the same class." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-73", "text": "We obtain a total of 206 pairs (110, 90 and 6 for place, direction and device respectively)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-74", "text": "Nouns: The target/foil noun pairs are built using words that belong to the same category in MS-COCO (e.g., bird/dog, from the MS-COCO category ANIMAL)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-75", "text": "In order to obtain a balanced dataset across the various PoS, we only use a subset of the FOIL-COCO dataset of Shekhar et al. (2017) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-76", "text": "From the FOIL dataset, we retain the 37,536 images for which foil captions could be generated, using the target/foil pairs extracted from the resources mentioned above." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-77", "text": "Of the FOIL datapoints generated for the noun pairs, only those containing images used for the other PoS are selected." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-78", "text": "Hence, the number of unique images of the whole dataset is the same of those used for nouns (see Table 1 for details of the train/test set division.)" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-81", "text": "This ensures that the model will not learn to recognise a foil caption simply by recording the presence of an unknown word." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-82", "text": "From each resource, we randomly split the target/foil pairs into training and test sets." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-83", "text": "The number of unique pairs per PoS is provided in Table 1 ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-84", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-85", "text": "**FOIL CAPTION GENERATION**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-86", "text": "From the word pair lists above, foil captions are generated from MS-COCO original captions." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-87", "text": "The foil captions are generated by replacing nouns are directly extracted from the FOIL dataset by Shekhar et al. (2017) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-88", "text": "In this case, for each original MS-COCO caption, several foil ones are generated and subsequently filtered using several heuristics." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-89", "text": "The aim of filtering is to prioritise salient objects in the image, and to minimise the language bias in the data." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-90", "text": "Ideally, these filters would have to be applied also for the generation of the foil caption for the other PoS, but we found that they reduced the size of our data in an unacceptably small size." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-91", "text": "As a consequence, the results we report are obviously affected by the language bias, as shown by the reasonable performance of a 'blind' model without access to visual data." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-92", "text": "However, as we will see, our broad claim is not affected by this heightened baseline." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-93", "text": "Details on the number of the unique images and of of datapoints generated for each PoS are reported in Table 1 ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-94", "text": "Table 2 reports the accuracy of the various models described in \u00a72 for Task T1." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-95", "text": "The blind model's accuracy is well above chance level for all PoS, with lower results observed on the captions generated by noun and adverb replacement." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-96", "text": "Recall that for nouns, the language prior has been minimised, whereas the datapoints generated with verb, adjective and preposition replacements have some language prior that the models can exploit." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-97", "text": "The comparatively low performance on adverbs may be explained by the fact that all generated target/foil pairs are antonyms which behave very similarly from a distributional point of view (e.g. upwards/downwards, partially/completely, etc)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-98", "text": "HieCoAtt is the overall best performing model, but we note that it only outperforms the blind model by a few points." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-99", "text": "These numbers, however, do not show to which extent the models are able to avoid the trap of the dataset: Shekhar et al. (2017) showed that on the FOIL data, models tend to detect correct captions with reasonable accuracy but fail to identify the incorrect ones, leading to a large bias in classification." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-100", "text": "Taking this insight into account, for the rest of this paper, we focus on the accuracy of the systems in dealing with foil captions, across all three tasks." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-101", "text": "As shown in Table 3 , the blind model's accuracy is still reasonable on T1, but lower than chance for nouns and adverbs." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-102", "text": "In the case of nouns, the visual input helps obtaining a higher accuracy, whereas this is not the case for the other PoS. This could be due to the ability of vision models to 'see' objects but not their properties (adjectives) or relations (verbs, prepositions) ." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-103", "text": "It is a known shortcoming of such systems that they have difficulties in recognising anything that is not straightforwardly defined by a bounding box (Johnson et al. (2017) )." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-104", "text": "IC-Wang performs very poorly on verbs, adjectives and prepositions, even though it is the best system for nouns." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-105", "text": "Other models improve minimally on the baseline, with prepositions getting the best improvement: +7% for HieCoAtt." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-106", "text": "When looking more in detail into this result, we observe that most instances in the preposition data indicate location: it is not surprising that an attention model would perform well on those, since it is trained to focus on particular areas of the image." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-107", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-108", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-109", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-110", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-111", "text": "For task T2 (see Table 4 ), all models perform well under baseline on verbs, adjectives and adverbs." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-112", "text": "IC-Wang does however provide some improvement on prepositions." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-113", "text": "The reason for this may be that the system, being trained to generate sequences, has a better internal language model than other approaches." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-114", "text": "Whilst a good language model is unlikely to help in the case of content words, we can expect some benefits for function words." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-115", "text": "This trend has been observed in work on L2 error detection, where mistakes in words from closed classes are easier to spot and correct (Herbelot and Kochmar (2016) )." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-116", "text": "For task T3 (see Table 4 ), improvements over the baseline are minimal." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-117", "text": "IC-Wang performs best overall, but at a level well below its achievement on nouns." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-118", "text": "We do not only confirm that foil correction is hard, but that it is particularly challenging on parts-of-speech that represent attributes or relations." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-119", "text": "We note that IC-Wang improves more on prepositions than on adjectives and adverbs, confirming what was observed in T2 (i.e. closed classes are easier to deal with). But it also provides a good improvement on the verb baseline, which is puzzling given its inability to spot verb foils in T2." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-120", "text": "----------------------------------" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-121", "text": "**CONCLUSION**" }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-122", "text": "Language and Vision integration has been studied in a fine-grained, but single-minded way, when focusing on objects (nouns)." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-123", "text": "The level of events (sentences) has also received attention, but through coarse representations." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-124", "text": "Our work aims to highlight the importance of a fine-grained representation for all components of a sentence, including attributes and relations." }, { "sent_id": "cee22bd0384d3d3fd4e45833341e77-C001-125", "text": "Our results show that none of the current SoA models achieve this overall goal: attention models may have the right components to detect location (e.g., see locative prepositions), but some image captioning systems probably provide a better language model, in particular for closed-class words." } ], "y": { "@USE@": { "gold_contexts": [ [ "cee22bd0384d3d3fd4e45833341e77-C001-19" ], [ "cee22bd0384d3d3fd4e45833341e77-C001-30" ], [ "cee22bd0384d3d3fd4e45833341e77-C001-48" ], [ "cee22bd0384d3d3fd4e45833341e77-C001-56" ], [ "cee22bd0384d3d3fd4e45833341e77-C001-75" ], [ "cee22bd0384d3d3fd4e45833341e77-C001-87" ], [ "cee22bd0384d3d3fd4e45833341e77-C001-100", "cee22bd0384d3d3fd4e45833341e77-C001-99" ] ], "cite_sentences": [ "cee22bd0384d3d3fd4e45833341e77-C001-19", "cee22bd0384d3d3fd4e45833341e77-C001-30", "cee22bd0384d3d3fd4e45833341e77-C001-48", "cee22bd0384d3d3fd4e45833341e77-C001-56", "cee22bd0384d3d3fd4e45833341e77-C001-75", "cee22bd0384d3d3fd4e45833341e77-C001-87", "cee22bd0384d3d3fd4e45833341e77-C001-99" ] }, "@EXT@": { "gold_contexts": [ [ "cee22bd0384d3d3fd4e45833341e77-C001-56", "cee22bd0384d3d3fd4e45833341e77-C001-57" ] ], "cite_sentences": [ "cee22bd0384d3d3fd4e45833341e77-C001-56" ] }, "@BACK@": { "gold_contexts": [ [ "cee22bd0384d3d3fd4e45833341e77-C001-99" ] ], "cite_sentences": [ "cee22bd0384d3d3fd4e45833341e77-C001-99" ] } } }, "ABC_74db2b52e81969742f8f7e5681bd2b_26": { "x": [ { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-32", "text": "is a unary function and C : R d \u00d7 R n\u00d7d \u2192 R calculates a normalizer, but dispense with the assumption that" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-2", "text": "This paper presents some preliminary investigations of a new co-attention mechanism in neural transduction models." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-3", "text": "We propose a paradigm, termed Two-Headed Monster (THM), which consists of two symmetric encoder modules and one decoder module connected with co-attention." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-4", "text": "As a specific and concrete implementation of THM, Crossed Co-Attention Networks (CCNs) are designed based on the Transformer model." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-5", "text": "We demonstrate CCNs on WMT 2014 EN-DE and WMT 2016 EN-FI translation tasks and our model outperforms the strong Transformer baseline by 0.51 (big) and 0.74 (base) BLEU points on EN-DE and by 0.17 (big) and 0.47 (base) BLEU points on EN-FI." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-6", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-8", "text": "Attention has emerged as a prominent neural module extensively adopted in a wide range of deep learning research problems (Das et al., 2017; Rockt\u00e4schel et al., 2015; Santos et al., 2016; Xu and Saenko, 2016; Yang et al., 2016; Yin et al., 2016; Zhu et al., 2016; Xu et al., 2015; Chorowski et al., 2015) such as VQA, reading comprehension, textual entailment, image captioning, speech recognition and so forth." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-9", "text": "It's remarkable success is also embodied in machine translation tasks (Bahdanau et al., 2014; Vaswani et al., 2017) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-31", "text": "We basically follow the definition of no-local operation in (Wang et al., 2018) where f :" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-10", "text": "This work proposes an end-to-end co-attentional neural structure, named Crossed Co-Attention Networks (CCNs) to address machine translation, a typical sequence-to-sequence NLP task." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-11", "text": "We customize the transformer (Vaswani et al., 2017) featured by non-local operations (Wang et al., 2018) with two * The work was done when Yaoyiran was working at Living Analytics Research Centre, Singapore Management University who is now a PhD student at University of Cambridge." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-12", "text": "input branches and tailor the transformer's multihead attention mechanism to the needs of information exchange between these two parallel branches." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-13", "text": "A higher-level and more abstract paradigm generalized from CCNs is denoted as \"Two-Headed Monster\" (THM), representing a broader class of neural structure benefiting from two parallel neural channels that would be intertwined with each other through, for example, co-attention mechanism as illustrated in Fig. 1 ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-14", "text": "Needless to say, co-attention is widely adopted in multi-modal scenarios (Lu et al., 2016a; Yu et al., 2017; Tay et al., 2018; Xiong et al., 2016; Lu et al., 2016b) , the basic idea of which is to make two feature maps from different domains to attend to each other symmetrically and thus output summarized representations for each domain." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-15", "text": "In this work, we emphasize a parallel and symmetric manifold operating on two input channels and possessing two output channels but do not assume that the two channels of input must be disparate." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-16", "text": "Our co-attention mechanism is designed in a \"Transformer\" style, and to the best of our knowledge, our proposed Crossed Co-Attention Network is one of the first (if not the only) implementations of co-attention on transformer model." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-17", "text": "As a preliminary investigation, we apply our model on the popular machine translation task where two input channels are in one same domain." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-18", "text": "Our code also leverages half-precision floating point format (FP16) training and synchronous distributed training for inter-GPU communication (we do not discard gradients calculated by \"stragglers\") which dramatically accelerate our training procedure (Ott et al., 2018; Micikevicius et al., 2018) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-19", "text": "We will release our code after the paper is de-anonymized." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-20", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-21", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-22", "text": "We propose an end-to-end neural architecture, based on the transformer, to address a class of sequence to sequence tasks where the model takes input from two channels." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-23", "text": "We design a Crossed Co-Attention Mechanism to make our model capable of attending to two information flows simultaneously in both the encoding and the decoding stages." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-24", "text": "Our co-attention mechanism is naively realized by a crossed connection of Value, key and Query gates of a regular multi-head attention module, so we term our model Crossed Co-Attention Networks." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-25", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-26", "text": "**GENERIC CO-ATTENTION**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-27", "text": "In this section, we first review non-local operations and bridge them to the dot-product attention that is widely used in self-attention modules and then formulate the co-attention mechanism in a generic way." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-28", "text": "A non-local operation is defined as a building block in deep neural networks which captures long-range dependencies where every response is computed as a linear combination of all features in the input feature map (Wang et al., 2018) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-29", "text": "Suppose the input feature maps are V = [v 1 , v 2 , ..., v n ] T \u2208 R n\u00d7d , K = [k 1 , k 2 , ..., k n ] T \u2208 R n\u00d7d and Q = [q 1 , q 2 , ..., q n ] T \u2208 R n\u00d7d and the output feature map Y = [y 1 , y 2 , ..., y n ] T \u2208 R n\u00d7d is of the same size as the input." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-30", "text": "Then a generic non-local operation is formulated as follows:" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-33", "text": "then the non-local operation degrades to the multihead self-attention as is described in (Vaswani et al., 2017) (formula 2 describes only one attention head):" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-34", "text": "Considering two input channels, denoted as 'left' and 'right', we present the following non-local operation as a definition of co-attention where \u03b1(\u00b7), \u03b2(\u00b7) \u2208 { lef t , right }." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-35", "text": "Note that when \u03b1(\u00b7) = lef t , \u03b2(\u00b7) = right the co-attention degrades to two self-attention modules." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-36", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-37", "text": "**CROSSED CO-ATTENTION NETWORKS**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-38", "text": "Based on the transformer model (Vaswani et al., 2017) , we design a novel co-attention mechanism." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-39", "text": "Our proposed mechanism consists of two symmetrical branches that work in parallel to assimilate information from two input channels respectively." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-40", "text": "Different from previously known coattention mechanisms such as (Xiong et al., 2017; Lu et al., 2016a) , our co-attention is built through connecting two multiplicative attention modules (Vaswani et al., 2017 ) each containing three gates, i.e., Value, Key and Query." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-41", "text": "The information flows from two input channels then interact with and benefit from each other via crossed connections." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-42", "text": "Suppose the input fed into the left branch is X Lef t , and the right branch X right ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-43", "text": "In our encoder, the left branch takes input from X Lef t as Value (V) and Key (K) and takes the input X right as Query (Q)." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-44", "text": "The right branch, however, takes the input X Lef t as Query (Q) and X right as Value (V) and Key (K)." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-45", "text": "This design is, in a sense, meant for the two branches to relatively keep the information in their own domains." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-46", "text": "A special case is, if g(v i ) = v i , then the response y i will be in the row space of V ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-47", "text": "Because when an attention takes input V from its own branch, the output responses will by and large carry the information of the branch." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-48", "text": "For machine translation, the two encoder branches take in one same input sequence, but in order to reduce the redundancy of two parallel branches, we apply dropout and input corruption on input embeddings for two branches respectively." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-49", "text": "While our model shares BPE embeddings (Sennrich et al., 2015) globally, for input matrices encoder branches, we randomly select and swap two sub-word tokens at a probability of 0.5." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-50", "text": "In the encoder-decoder attention layers, the multi-head attention on two decoder branches uses the output from two encoder branches as Value and Key alternatively while absorbing the self-attended output embedding from below as Query." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-51", "text": "The output of the two branches in decoder is processed through concatenation, linear transformation and then fed into a feed-forward network." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-52", "text": "In addition to our co-attention mechanism, we keeps one self-attention layer in the decoder for reading in shifted output embedding." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-53", "text": "We adopt the same input masking and sinusoidal position encoding as the Transformer which will not be expanded here." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-54", "text": "3 Experiments" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-55", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-56", "text": "**SETUP**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-82", "text": "While the novel attention mechanism, eschewing recurrence, is famous for modeling global dependencies and considered faster than recurrent layers (Vaswani et al., 2017) , recent work points out that it may tend to overlook neighboring information (Yang et al., 2019a; Xu et al., 2019) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-83", "text": "It is found that applying an adaptive attention span could be conducive to character level language modeling tasks (Sukhbaatar et al., 2019) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-84", "text": "Yang et al. propose to model localness for self-attention which would be conducive to capturing local information by learning a Gaussian bias predicting the region of local attention (Yang et al., 2018a) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-85", "text": "Other work indicates that adding convolution layers would ameliorate the aforementioned issue (Yang et al., 2018b (Yang et al., , 2019b ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-86", "text": "Multi-head attention can also be used in multi-modal scenarios when V, K and Q gates take in data from different domains." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-87", "text": "(Helcl et al., 2018) (Wu et al., 2019) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-88", "text": "Other work shows that training on 128 GPUs can significantly boost the experimental results and shorten the training time (Ott et al., 2018) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-89", "text": "A novel research direction is semi-or un-supervised machine translation aimed at addressing low-resource languages where parallel data is usually unavailable (Cheng, 2019; Artetxe et al., 2017; Lample et al., 2017) ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-90", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-92", "text": "We propose a novel co-attention mechanism consisting of two parallel attention modules connected with each other in a crossed manner." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-93", "text": "First we formulate the co-attention in a general sense as a nonlocal operation and then show a specific type of co-attention, known as crossed co-attention can improve the machine translation tasks by 0.17 \u223c 0.74 BLEU points and enhance the capability of model selection." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-94", "text": "However, the time efficiency is reduced since the number of parameters increases." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-57", "text": "We demonstrate our model on WMT 2014 EN-DE and WMT 2016 EN-FI machine translation tasks." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-58", "text": "For convenience, in this section, we do not differentiate between the notion of THM and CCN which is an implementation of THM." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-59", "text": "The raw in-put data is pre-processed with length filtering as previous work (Ott et al., 2018 Our CCNs are established with 6 encoder and decoder blocks and a hidden state of size 512 for base models and with also 6 such blocks but a hidden state of 1, 024 neurons for big models." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-60", "text": "That exactly corresponds to the settings of Transformer paper." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-61", "text": "We train our models on a NVIDIA DGX-1 GPU server with 4 TESLA V100-16GB GPUs." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-62", "text": "In order to make full use of the computational resources, FP16 computation is adopted and we use a batch size of 6, 528 tokens/GPU for base models and 2, 176 for big models (both Transformer and THM)." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-63", "text": "We adopt the Sequence-to-Sequence Toolkit FairSeq (Ott et al., 2019) released by the Facebook AI Research for our Transformer baseline 1 , upon which our THM code is built as well." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-64", "text": "We train all base models for around one day and big models for around two days." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-65", "text": "For model selection, we strictly choose the model that achieves the highest BLEU on Dev set." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-66", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-67", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-68", "text": "Main Results: Our experiments demonstrate the efficiency of our proposed crossed co-attention mechanism which significantly improves the BLEU scores of machine translation as illustrated in Table 1 ." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-69", "text": "Besides, the co-attention mechanism has, by and large, reduced training, valid and test loss from the first training epoch compared with the transformer baselines as shown in Appendices A.1." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-70", "text": "However, since the number of parameters doubles, the epoch time also increases by roughly 60% \u223c 80%." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-71", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-72", "text": "**CAPABILITY OF MODEL SELECTION:**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-73", "text": "In addition to the BLEU, loss and time efficiency, we also find that the THM/CCN models demonstrate better capability of selecting good models with Dev set from all models derived in all training epochs." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-74", "text": "As is shown is Table 2 , for THM/CCN, the models that achieved hightest BLEU on Dev set are also highranking on the Test set." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-75", "text": "In 75% cases, THM will select TOP 3 models and in all cases, it will select TOP 10 models whereas Transformer can only select TOP 10 models in 50% cases." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-76", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-77", "text": "**PERFORMANCE ACROSS LANGUAGES:**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-78", "text": "We test our proposed method on two language pairs, EN-DE and EN-FI and the improved BLEU scores and the capability of model selection on both base and big models demonstrate the universality of our proposed method." }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-79", "text": "----------------------------------" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-80", "text": "**RELATED WORK**" }, { "sent_id": "74db2b52e81969742f8f7e5681bd2b-C001-81", "text": "Attention: Multi-head self-attention has demonstrated its capacity in neural transduction models (Vaswani et al., 2017) , language model pre-training (Devlin et al., 2018; Radford et al., 2018) and speech synthesis (Yang et al., 2019c) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "74db2b52e81969742f8f7e5681bd2b-C001-8", "74db2b52e81969742f8f7e5681bd2b-C001-9" ], [ "74db2b52e81969742f8f7e5681bd2b-C001-81" ], [ "74db2b52e81969742f8f7e5681bd2b-C001-82" ] ], "cite_sentences": [ "74db2b52e81969742f8f7e5681bd2b-C001-9", "74db2b52e81969742f8f7e5681bd2b-C001-81", "74db2b52e81969742f8f7e5681bd2b-C001-82" ] }, "@EXT@": { "gold_contexts": [ [ "74db2b52e81969742f8f7e5681bd2b-C001-11", "74db2b52e81969742f8f7e5681bd2b-C001-12" ], [ "74db2b52e81969742f8f7e5681bd2b-C001-40" ] ], "cite_sentences": [ "74db2b52e81969742f8f7e5681bd2b-C001-11", "74db2b52e81969742f8f7e5681bd2b-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "74db2b52e81969742f8f7e5681bd2b-C001-31", "74db2b52e81969742f8f7e5681bd2b-C001-32", "74db2b52e81969742f8f7e5681bd2b-C001-33" ], [ "74db2b52e81969742f8f7e5681bd2b-C001-38", "74db2b52e81969742f8f7e5681bd2b-C001-40" ] ], "cite_sentences": [ "74db2b52e81969742f8f7e5681bd2b-C001-33", "74db2b52e81969742f8f7e5681bd2b-C001-38", "74db2b52e81969742f8f7e5681bd2b-C001-40" ] } } }, "ABC_013f0e54384a8a4662a746eb4c30d9_26": { "x": [ { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-79", "text": "We use the open source FLAIR framework in all our experiments." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-103", "text": "Less pronounced impact on WNUT-17." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-2", "text": "Contextual string embeddings are a recent type of contextualized word embedding that were shown to yield state-of-the-art results when utilized in a range of sequence labeling tasks." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-3", "text": "They are based on character-level language models which treat text as distributions over characters and are capable of generating embeddings for any string of characters within any textual context." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-4", "text": "However, such purely character-based approaches struggle to produce meaningful embeddings if a rare string is used in a underspecified context." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-5", "text": "To address this drawback, we propose a method in which we dynamically aggregate contextualized embeddings of each unique string that we encounter." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-6", "text": "We then use a pooling operation to distill a global word representation from all contextualized instances." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-7", "text": "We evaluate these pooled contextualized embeddings on common named entity recognition (NER) tasks such as CoNLL-03 and WNUT and show that our approach significantly improves the state-of-the-art for NER." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-8", "text": "We make all code and pre-trained models available to the research community for use and reproduction." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-9", "text": "----------------------------------" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-11", "text": "Word embeddings are a crucial component in many NLP approaches (Mikolov et al., 2013; Pennington et al., 2014) since they capture latent semantics of words and thus allow models to better train and generalize." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-12", "text": "Recent work has moved away from the original \"one word, one embedding\" paradigm to investigate contextualized embedding models (Peters et al., 2017 (Peters et al., , 2018 Akbik et al., 2018) ." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-13", "text": "Such approaches produce different embeddings for the same word depending on its context and are thus capable of capturing latent contextualized semantics of ambiguous words." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-14", "text": "Recently, Akbik et al. (2018) proposed a character-level contextualized embeddings ap- context." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-15", "text": "This leads to an underspecified contextual word embedding for the string \"Indra\" that ultimately causes a misclassification of \"Indra\" as an organization (ORG) instead of person (PER) in a downstream NER task." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-16", "text": "proach they refer to as contextual string embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-17", "text": "They leverage pre-trained character-level language models from which they extract hidden states at the beginning and end character positions of each word to produce embeddings for any string of characters in a sentential context." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-18", "text": "They showed these embeddings to yield state-of-the-art results when utilized in sequence labeling tasks such as named entity recognition (NER) or part-of-speech (PoS) tagging." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-19", "text": "Underspecified contexts." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-20", "text": "However, such contextualized character-level models suffer from an inherent weakness when encountering rare words in an underspecified context." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-21", "text": "Consider the example text segment shown in Figure 1 : \"Fung Permadi (Taiwan) v Indra\", from the English CONLL-03 test data split (Tjong Kim Sang and De Meulder, 2003) ." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-22", "text": "If we consider the word \"Indra\" to be rare (meaning no prior occurrence in the corpus used to generate word embeddings), the underspecified context allows this word to be interpreted as either a person or an organization." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-23", "text": "This leads to an underspecified embedding that ultimately causes an incorrect classification of \"Indra\" as an organization in a downstream NER task." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-24", "text": "Pooled Contextual Embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-25", "text": "In this paper, we present a simple but effective approach to address this issue." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-26", "text": "We intuit that entities are normally only used in underspecified contexts if they are expected to be known to the reader." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-27", "text": "That is, they are either more clearly introduced in an earlier sentence, or part of general in-domain knowl-figure2-crop.pdf edge a reader is expected to have." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-28", "text": "Indeed, the string \"Indra\" in the CONLL-03 data also occurs in the earlier sentence \"Indra Wijaya (Indonesia) beat Ong Ewe Hock\"." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-29", "text": "Based on this, we propose an approach in which we dynamically aggregate contextualized embeddings of each unique string that we encounter as we process a dataset." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-30", "text": "We then use a pooling operation to distill a global word representation from all contextualized instances that we use in combination with the current contextualized representation as new word embedding." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-31", "text": "We evaluate our proposed embedding approach on the task of named entity recognition on the CONLL-03 (English, German and Dutch) and WNUT datasets." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-32", "text": "In all cases, we find that our approach outperforms previous approaches and yields new state-of-the-art scores." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-33", "text": "We contribute our approach and all pre-trained models to the open source FLAIR 1 framework, to ensure reproducibility of these results." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-34", "text": "----------------------------------" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-35", "text": "**METHOD**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-36", "text": "Our proposed approach dynamically builds up a \"memory\" of contextualized embeddings and applies a pooling operation to distill a global contextualized embedding for each word." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-37", "text": "It requires an embed() function that produces a contextualized embedding for a given word in a sentence context (see Akbik et al. (2018) )." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-38", "text": "It also requires a memory that records for each unique word all previous contextual embeddings, and a pool() operation to pool embedding vectors." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-39", "text": "This is illustrated in Algorithm 1: to embed a word (in a sentential context), we first call the embed() function (line 2) and add the resulting 1 https://github.com/zalandoresearch/flair embedding to the memory for this word (line 3)." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-40", "text": "We then call the pooling operation over all contextualized embeddings for this word in the memory (line 4) to compute the pooled contextualized embedding." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-41", "text": "Finally, we concatenate the original contextual embedding together with the pooled representation, to ensure that both local and global interpretations are represented (line 5)." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-42", "text": "This means that the resulting pooled contextualized embedding has twice the dimensionality of the original embedding." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-43", "text": "word.embedding \u2190 concat(emb pooled , emb context ) 6: end for Crucially, our approach expands the memory each time we embed a word." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-44", "text": "Therefore, the same word in the same context may have different embeddings over time as the memory is built up." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-45", "text": "Pooling operations." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-46", "text": "Per default, we use mean pooling to average a word's contextualized embedding vectors." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-47", "text": "We also experiment with min and max pooling to compute a vector consisting of all element-wise minimum or maximum values." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-48", "text": "Training downstream models." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-49", "text": "When training downstream task models (such as for NER), we typically make many passes over the training data." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-50", "text": "As Algorithm 2 shows, we reset the memory at the beginning of each pass over the training data (line Example of how we generate our proposed embedding (emb proposed ) for the word \"Indra\" in the example text segment \"Fung Permadi v Indra\"." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-51", "text": "We extract a contextual string embedding (embcontext) for this word and retrieve from the memory all embeddings that were produced for this string on previous sentences." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-52", "text": "We pool and concatenate all local contextualized embeddings to produce the final embedding. edge a reader is expected to have." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-53", "text": "Indeed, the string \"Indra\" in the CONLL-03 data also occurs in the earlier sentence \"Indra Wijaya (Indonesia) beat Ong Ewe Hock\"." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-54", "text": "Based on this, we propose an approach in which we dynamically aggregate contextualized embeddings of each unique string that we encounter as we process a dataset." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-55", "text": "We then use a pooling operation to distill a global word representation from all contextualized instances that we use in combination with the current contextualized representation as new word embedding." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-56", "text": "Our approach thus produces evolving word representations that change over time as more instances of the same word are observed in the data." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-57", "text": "We evaluate our proposed embedding approach on the task of named entity recognition on the CONLL-03 (English, German and Dutch) and WNUT datasets." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-58", "text": "In all cases, we find that our approach outperforms previous approaches and yields new state-of-the-art scores." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-59", "text": "We contribute our approach and all pre-trained models to the open source FLAIR 1 framework (Akbik et al., 2019) , to ensure reproducibility of these results." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-60", "text": "Our proposed approach (see Figure 2 ) dynamically builds up a \"memory\" of contextualized embeddings and applies a pooling operation to distill a global contextualized embedding for each word." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-61", "text": "It requires an embed() function that produces a contextualized embedding for a given word in a 1 https://github.com/zalandoresearch/flair sentence context (see Akbik et al. (2018) )." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-62", "text": "It also requires a memory that records for each unique word all previous contextual embeddings, and a pool() operation to pool embedding vectors." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-63", "text": "This is illustrated in Algorithm 1: to embed a word (in a sentential context), we first call the embed() function (line 2) and add the resulting embedding to the memory for this word (line 3)." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-64", "text": "We then call the pooling operation over all contextualized embeddings for this word in the memory (line 4) to compute the pooled contextualized embedding." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-65", "text": "Finally, we concatenate the original contextual embedding together with the pooled representation, to ensure that both local and global interpretations are represented (line 5)." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-66", "text": "This means that the resulting pooled contextualized embedding has twice the dimensionality of the original embedding." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-67", "text": "element-wise minimum or maximum values." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-68", "text": "Training downstream models." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-69", "text": "When training downstream task models (such as for NER), we typically make many passes over the training data." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-70", "text": "As Algorithm 2 shows, we reset the memory at the beginning of each pass over the training data (line 2), so that it is build up from scratch at each epoch." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-71", "text": "This approach ensures that the downstream task model learns to leverage pooled embeddings that are built up (e.g. evolve) over time." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-72", "text": "It also ensures that pooled embeddings during training are only computed over training data." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-73", "text": "After training, (i.e. during NER prediction), we do not reset embeddings and instead allow our approach to keep expanding the memory and evolve the embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-74", "text": "----------------------------------" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-75", "text": "**EXPERIMENTS**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-76", "text": "We verify our proposed approach in four named entity recognition (NER) tasks: We use the English, German and Dutch evaluation setups of the CONLL-03 shared task (Tjong Kim Sang and De Meulder, 2003) to evaluate our approach on classic newswire data, and the WNUT-17 task on emerging entity detection (Derczynski et al., 2017) to evaluate our approach in a noisy user-generated data setting with few repeated entity mentions." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-77", "text": "----------------------------------" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-78", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-80", "text": "It implements the standard BiLSTM-CRF sequence labeling architecture (Huang et al., 2015) and includes pre-trained contextual string embeddings for many languages." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-81", "text": "To FLAIR, we add an implementation of our proposed pooled contextualized embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-82", "text": "Hyperparameters." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-83", "text": "For our experiments, we follow the training and evaluation procedure outlined in Akbik et al. (2018) and follow most hyperparameter suggestions as given by the in-depth study presented in Reimers and Gurevych (2017) ." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-84", "text": "That is, we use an LSTM with 256 hidden states and one layer (Hochreiter and Schmidhuber, 1997) , a locked dropout value of 0.5, a word dropout of 0.05, and train using SGD with an annealing rate of 0.5 and a patience of 3." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-85", "text": "We perform model selection over the learning rate \u2208 {0.01, 0.05, 0.1} and mini-batch size \u2208 {8, 16, 32}, choosing the model with the best F-measure on the validation set." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-86", "text": "Following Peters et al. (2017) , we then repeat the experiment 5 times with different random seeds, and train using both train and development set, reporting both average performance and standard deviation over these runs on the test set as final performance." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-87", "text": "Standard word embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-88", "text": "The default setup of Akbik et al. (2018) recommends contextual string embeddings to be used in combination with standard word embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-89", "text": "We use GLOVE embeddings (Pennington et al., 2014) for the English tasks and FASTTEXT embeddings (Bojanowski et al., 2017) for all newswire tasks." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-90", "text": "Baselines." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-91", "text": "Our baseline are contextual string embeddings without pooling, i.e. the original setup proposed in Akbik et al. (2018) 2 ." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-92", "text": "By comparing against this baseline, we isolate the impact of our proposed pooled contextualized embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-93", "text": "Table 2 : Ablation experiment using contextual string embeddings without word embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-94", "text": "We find a more significant impact on evaluation numbers across all datasets, illustrating the need for capturing global next to contextualized semantics." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-95", "text": "In addition, we list the best reported numbers for the four tasks." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-96", "text": "This includes the recent BERT approach using bidirectional transformers by Devlin et al. (2018) , the semi-supervised multitask learning approach by , the ELMo word-level language modeling approach by Peters et al. (2018) , and the best published numbers for WNUT-17 (Aguilar et al., 2018) and German and Dutch CONLL-03 (Lample et al., 2016) ." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-97", "text": "----------------------------------" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-98", "text": "**RESULTS**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-99", "text": "Our experimental results are summarized in Table 1 for each of the four tasks." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-100", "text": "New state-of-the-art scores." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-101", "text": "We find that our approach outperforms all previously published results, raising the state-of-the-art for CONLL-03 on English to 93.18 F1-score (\u21910.32 pp vs. previous best), German to 88.27 (\u21910.86 pp) and Dutch to 90.44 (\u21910.28 pp)." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-102", "text": "The consistent improvements against the contextual string embeddings baseline indicate that our approach is generally a viable option for embedding entities in sequence labeling." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-104", "text": "However, we also find no significant improvements on the WNUT-17 task on emerging entities." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-105", "text": "Depending on the pooling operation, we find comparable results to the baseline." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-106", "text": "This result is expected since most entities appear only few times in this dataset, giving our approach little evidence to aggregate and pool." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-107", "text": "Nevertheless, since recent work" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-108", "text": "has not yet experimented with contextual embeddings on WNUT, as side result we report a new state-of-the-art of 49.59 F1 vs. the previous best reported number of 45.55 (Aguilar et al., 2018) ." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-109", "text": "Pooling operations." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-110", "text": "Comparing the pooling operations discussed in Section 2, we generally find similar results." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-111", "text": "As Table 1 shows, min pooling performs best for English and German CoNLL, while mean pooling is best for Dutch and WNUT." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-112", "text": "----------------------------------" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-113", "text": "**ABLATION: CHARACTER EMBEDDINGS ONLY**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-114", "text": "To better isolate the impact of our proposed approach, we run experiments in which we do not use any classic word embeddings, but rather rely solely on contextual string embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-115", "text": "As Table 2 shows, we observe more pronounced improvements of pooling vis-a-vis the baseline approach in this setup." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-116", "text": "This indicates that pooled contextualized embeddings capture global semantics words similar in nature to classical word embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-117", "text": "----------------------------------" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-118", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-119", "text": "We presented a simple but effective approach that addresses the problem of embedding rare strings in underspecified contexts." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-120", "text": "Our experimental evaluation shows that this approach improves the stateof-the-art across named entity recognition tasks, enabling us to report new state-of-the-art scores for CONLL-03 NER and WNUT emerging entity detection." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-121", "text": "These results indicate that our embedding approach is well suited for NER." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-122", "text": "Evolving embeddings." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-123", "text": "Our dynamic aggregation approach means that embeddings for the same words will change over time, even when used in exactly the same contexts." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-124", "text": "Assuming that entity names are more often used in well-specified contexts, their pooled embeddings will improve as more data is processed." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-125", "text": "The embedding model thus continues to \"learn\" from data even after the training of the downstream NER model is complete and it is used in prediction mode." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-126", "text": "We consider this idea of constantly evolving representations a very promising research direction." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-127", "text": "Future work." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-128", "text": "Our pooling operation makes the conceptual simplification that all previous instances of a word are equally important." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-129", "text": "However, we may find more recent mentions of a word -such as words within the same document or news cycle -to be more important for creating embeddings than mentions that belong to other documents or news cycles." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-130", "text": "Future work will therefore examine methods to learn weighted poolings of previous mentions." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-131", "text": "We will also investigate applicability of our proposed embeddings to tasks beside NER." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-132", "text": "Public release." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-133", "text": "We contribute our code to the FLAIR framework 3 ." }, { "sent_id": "013f0e54384a8a4662a746eb4c30d9-C001-134", "text": "This allows full reproduction of all experiments presented in this paper, and al-lows the research community to use our embeddings for training downstream task models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "013f0e54384a8a4662a746eb4c30d9-C001-12" ], [ "013f0e54384a8a4662a746eb4c30d9-C001-14" ], [ "013f0e54384a8a4662a746eb4c30d9-C001-88" ] ], "cite_sentences": [ "013f0e54384a8a4662a746eb4c30d9-C001-12", "013f0e54384a8a4662a746eb4c30d9-C001-14", "013f0e54384a8a4662a746eb4c30d9-C001-88" ] }, "@MOT@": { "gold_contexts": [ [ "013f0e54384a8a4662a746eb4c30d9-C001-14", "013f0e54384a8a4662a746eb4c30d9-C001-15" ] ], "cite_sentences": [ "013f0e54384a8a4662a746eb4c30d9-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "013f0e54384a8a4662a746eb4c30d9-C001-36", "013f0e54384a8a4662a746eb4c30d9-C001-37" ], [ "013f0e54384a8a4662a746eb4c30d9-C001-60", "013f0e54384a8a4662a746eb4c30d9-C001-61" ], [ "013f0e54384a8a4662a746eb4c30d9-C001-83" ], [ "013f0e54384a8a4662a746eb4c30d9-C001-88", "013f0e54384a8a4662a746eb4c30d9-C001-89" ], [ "013f0e54384a8a4662a746eb4c30d9-C001-91" ] ], "cite_sentences": [ "013f0e54384a8a4662a746eb4c30d9-C001-37", "013f0e54384a8a4662a746eb4c30d9-C001-61", "013f0e54384a8a4662a746eb4c30d9-C001-83", "013f0e54384a8a4662a746eb4c30d9-C001-88", "013f0e54384a8a4662a746eb4c30d9-C001-91" ] }, "@SIM@": { "gold_contexts": [ [ "013f0e54384a8a4662a746eb4c30d9-C001-88", "013f0e54384a8a4662a746eb4c30d9-C001-89" ] ], "cite_sentences": [ "013f0e54384a8a4662a746eb4c30d9-C001-88" ] } } }, "ABC_d52dfb30158deae64a1c3d787d9b95_26": { "x": [ { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-2", "text": "Abstract." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-3", "text": "There have been multiple attempts to resolve various inflection matching problems in information retrieval." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-4", "text": "Stemming is a common approach to this end." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-5", "text": "Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-6", "text": "In this paper we propose a method for finding affixes in different positions of a word." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-7", "text": "Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-8", "text": "Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-9", "text": "In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-10", "text": "These rules are used to statistically stem words and can be used in different text mining tasks." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-11", "text": "Experimental results on CLEF 2008 and CLEF 2009 English-Persian CLIR tasks indicate that the proposed method significantly outperforms all the baselines in terms of MAP." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-12", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-14", "text": "Uniforming different inflections of words is a required task in a wide range of text mining algorithms, including, but not limited to text classification, text clustering, document retrieval, and language modeling [1] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-15", "text": "Stemming has been considered as a common approach for this goal in several studies [9, 1] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-16", "text": "Stemmers usually remove affixes from the words to present them in the form of their morphological roots." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-17", "text": "Conventional rule-based stemmers tailor the linguistic knowledge of experts." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-18", "text": "On the other hand, statistical stemmers provide languageindependent approaches which generally group related words based on various string-similarity measures." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-19", "text": "Such approaches often involve n-grams; equivalence classes can be formed from words that share the same properties: word-initial letter n-grams, common n-grams throughout the word, or by refining these classes with clustering techniques." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-20", "text": "This kind of statistical stemming has been shown to be effective for many languages, including English, Turkish, and Malay." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-21", "text": "For example, Bhat introduced a method for Kannada where the similarity of two words is determined by three distance measures based on prefix and suffix matching and the first mismatch point in the words [2] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-22", "text": "Defining precise rules for morphologically complex texts, especially for the purpose of infix removal is sometimes impossible [5] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-23", "text": "Informal/irregular forms usually do not obey the conventional rules in the languages." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-24", "text": "For instance, 'khunh' (home) is a frequent form for 'khanh' in Persian conversations or 'goood ' and 'good ' are used interchangeably in English tweets." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-25", "text": "In this paper, we propose a statistical technique for finding inflectional and derivation formations of words." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-26", "text": "To this end, we introduce an unsupervised method to cluster all morphological variants of a word." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-27", "text": "The proposed algorithm learns linguistic patterns to match a word and its morphological variants based on a given large collection of documents, which is readily available on the Web." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-28", "text": "A linguistic pattern captures a transformation rule between a word and its morphological variant." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-29", "text": "The extracted rules indicate which letters in which positions of a word should be modified." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-30", "text": "Affix characters, positions of the characters, operations on the characters based on the minimum edit distance (MED) algorithm (i.e., insertion or deletion) [10] , and part-of-speech (POS) tag of the input word are the attributes of a rule." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-31", "text": "Our algorithm assigns a score to each rule, indicating its confidence." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-32", "text": "The higher the frequency of a rule in the input collection, the higher the confidence value of that rule." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-33", "text": "Finally, a small subset of the obtained rules are selected based on their scores and a learned threshold as valid rules." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-34", "text": "We demonstrate that using this subset for query expansion can significantly improve English-Persian CLIR performance compared to comprehensive baselines." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-35", "text": "In Section 2 we elaborate on the subject, in Section 3 we assess its quality in an IR task, and in Section 4 we conclude the paper." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-36", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-37", "text": "**SS4MCT: A STATISTICAL STEMMER FOR MORPHOLOGICALLY COMPLEX TEXTS**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-38", "text": "In this section we propose an unsupervised method for finding inflections in a language." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-39", "text": "To this end, we first introduce a transformation rule which is an edit path transforming the word w into w based on a number of actions." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-40", "text": "Our goal is to estimate the probability of each transformation rule P (R) and compute the likelihood of generating inflections for a given term (i.e., p(w |w))." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-41", "text": "In Section 2.1 and Section 2.2 we introduce the proposed method in more details and in Section 2.3 we propose an evaluation framework for the method." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-42", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-43", "text": "**TRANSFORMATION RULES**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-44", "text": "Each transformation rule contains a number of actions transforming a word into an inflection." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-45", "text": "If two terms are k points distant from each other, the rule that transforms the input term to the output term contains k actions and the maximum likelihood POS tag of the input word." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-46", "text": "Each action consists of the following attributes: c, the character in difference, p, the position of that character (begin," }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-47", "text": "middle, and end ), and o, the corresponding operation on the character in the MED algorithm (deletion or insertion)." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-48", "text": "Intuitively we define a few general positions for affixes to prevent sparsity of the rules." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-49", "text": "A substitution operation can be replaced by a couple of insertion/deletion operations; therefore we ignore the substitution operation." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-50", "text": "Table 1 shows a number of examples for the rules." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-51", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-52", "text": "**PROBABILITY OF THE RULES**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-53", "text": "To compute the probability of generating an inflection for a given term (i.e., p(w |w)) we can compute the transformation rule (R) between w and w and estimate p(w |w) by p(R)." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-54", "text": "To compute the probability of each rule we use a large collection of words extracted from a document collection." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-55", "text": "For each pair of words in the collection (w and w ), we compute the rule for transforming w into w and count the number of times this rule has happened in the collection." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-56", "text": "The higher the occurrences of a rule, the more likely it is to be a valid one." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-57", "text": "Finally we estimate the rules probability with maximum likelihood estimator." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-58", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-59", "text": "**HOW TO EVALUATE THE ALGORITHM**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-60", "text": "In this section we provide a framework for the stemming algorithm to evaluate its effectiveness." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-61", "text": "We use dictionary-based cross-lingual information retrieval (CLIR) to this end." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-62", "text": "In highly inflected languages, bilingual dictionaries contain only original forms of the words." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-63", "text": "Therefore, in dictionary-based CLIR, retrieval systems are obliged either to stem documents and queries, or to leave them intact [8, 4, 12] , or expand the query with inflections." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-64", "text": "We opted the query expansion approach which is a widely used approach to compensate the shortage of inflections [9, 3, 5] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-65", "text": "We used the following probabilistic framework to this end [5] :" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-66", "text": "where q i is a query term and c i is the set of translation candidates provided in a bilingual dictionary for q i ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-67", "text": "c i is the set of the most probable inflections of the words appeared in c i selected by a tuned threshold." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-68", "text": "Then, we compute the translation probability of c i,j or c i,j for the given q i ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-69", "text": "To avoid adding noisy terms, we only compute the joint probabilities between either a pair of translation candidates from the dictionary (c i,j and c i ,j ) or a pair of a candidate from the dictionary and an inflection from the collection (c i,j and c i ,j ) [5] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-70", "text": "Our goal is to findc i using the proposed SS4MCT (i.e. set of top-ranked c i ,j according to p t (c i ,j |c i,j )) and then evaluate its impact on the performance of the CLIR task." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-71", "text": "Figure 1 shows the whole process of extracting rules (off-line part) and the evaluation framework (on-line part)." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-72", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-73", "text": "**EXPERIMENTS**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-74", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-75", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-76", "text": "The statistics of the collection used for both rule extraction and evaluation is provided in Table 2 ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-77", "text": "We employed the statistical language modeling framework with Kullback-Leibler similarity measure of Lemur toolkit for our retrieval task." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-78", "text": "Dirichlet Prior is selected as our document smoothing strategy." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-79", "text": "Top 30 documents are used for the mixture pseudo-relevance feedback algorithm." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-80", "text": "Queries are expanded by the top 50 terms generated by the feedback model [14, 6] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-81", "text": "We removed Persian stop words from the queries and documents [4, 5] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-82", "text": "We used STeP1 [13] in our stemming process in Persian." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-83", "text": "We also stem the source English queries in all experiments with the Porter stemmer." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-84", "text": "We use Google EnglishPersian dictionary 1 as the translation resource." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-85", "text": "Dadashkarimi et al., demonstrated that Google has better coverage compared to other English-Persian dictionaries [5] ." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-86", "text": "We have exploited 40 Persian POS tags in our experiments." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-87", "text": "2 The retrieval results are mainly evaluated by Mean Average Precision (MAP) over top 1000 retrieved documents." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-88", "text": "Significance tests are computed using two-tailed paired t-test with 95% confidence." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-89", "text": "Precision at top 5 documents (P@5) and top 10 documents (P@10) are also reported." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-90", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-91", "text": "**COMPARING DIFFERENT MORPHOLOGICAL PROCESSING METHODS**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-92", "text": "In this section we aim at evaluating the proposed SS4MCT method." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-93", "text": "To this end we compare the proposed SS4MCT with a number of dictionary-based CLIR methods; the 5-gram truncation method (SPLIT) proposed in [11] , rule-based query expansion (RBQE) based on inflectional/derivation rules from Farazzin machine translator 3 , and the STeP1 stemmer [13] are the morphological processing approaches for the retrieval system." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-94", "text": "On the other hand, we run another set of experiments without applying any morphological processing method similar to the Persian state-of-the-art CLIR methods." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-95", "text": "Iterative translation disambiguation (ITD) [11] , joint cross-lingual topical relevance model (JCLTRLM) [7] , top-ranked translation (TOP-1), and the bi-gram coherence translation method (BiCTM), introduced in [5] (assume |c i | = 0), are the baselines without any morphological processing units." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-96", "text": "As shown in Table 3 BiCTM outperforms all the baselines when there is no morphological processing unit." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-97", "text": "Although the improvement compared to JCLTRLM is not statistically significant, for simplicity we assume this model as a base of comparisons in the next set of experiments." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-98", "text": "In other words, we study the effect of the morphological processing units on the performance of BiCTM." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-99", "text": "As shown in Table 3 the performance of the CLIR task degraded when we use the SPLIT approach." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-100", "text": "It is due to expanding the query with irrelevant tokens (e.g., normal/abnormal )." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-101", "text": "RBQE suffers from a similar problem to some extent; for example jat is a valid suffix for sabzi (=vegetable) in Persian whereas it is an invalid suffix for ketab (=book)." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-102", "text": "The results demonstrate that SS4MCT outperforms all the baselines in terms of MAP." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-103", "text": "This is due to a couple of reasons; first the ability of SS4MCT at finding infixes along with other affixes, particularly in irregular inflections and second its ability at deriving the likelihood/relevance of the rules in the collection/query." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-104", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-105", "text": "**CONCLUSION AND FUTURE WORKS**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-106", "text": "In this paper we proposed a new method for statistical stemming in morphologically complex texts." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-107", "text": "SS4MCT extracts a number of morphological rules based on edit distances of a large number of word pairs from a collection." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-108", "text": "Evaluating SS4MCT on a dictionary-based English-Persian CLIR task demonstrates its effectiveness." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-109", "text": "Considering adjacency of the characters in the rules and evaluating the method on informal text mining remained as future works." }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-110", "text": "----------------------------------" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-111", "text": "**ACKNOWLEDGEMENT**" }, { "sent_id": "d52dfb30158deae64a1c3d787d9b95-C001-112", "text": "The author would like to thank Razieh Rahimi and the anonymous reviewers for their helpful comments and feedback." } ], "y": { "@MOT@": { "gold_contexts": [ [ "d52dfb30158deae64a1c3d787d9b95-C001-22" ] ], "cite_sentences": [ "d52dfb30158deae64a1c3d787d9b95-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "d52dfb30158deae64a1c3d787d9b95-C001-64" ], [ "d52dfb30158deae64a1c3d787d9b95-C001-65" ], [ "d52dfb30158deae64a1c3d787d9b95-C001-69" ], [ "d52dfb30158deae64a1c3d787d9b95-C001-81" ], [ "d52dfb30158deae64a1c3d787d9b95-C001-95" ] ], "cite_sentences": [ "d52dfb30158deae64a1c3d787d9b95-C001-64", "d52dfb30158deae64a1c3d787d9b95-C001-65", "d52dfb30158deae64a1c3d787d9b95-C001-69", "d52dfb30158deae64a1c3d787d9b95-C001-81", "d52dfb30158deae64a1c3d787d9b95-C001-95" ] }, "@BACK@": { "gold_contexts": [ [ "d52dfb30158deae64a1c3d787d9b95-C001-85" ] ], "cite_sentences": [ "d52dfb30158deae64a1c3d787d9b95-C001-85" ] } } }, "ABC_eebf1edb6dbd3e58a904eff309f548_26": { "x": [ { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-2", "text": "We contrast two seemingly distinct approaches to the task of question answering (QA) using Freebase: one based on information extraction techniques, the other on semantic parsing." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-3", "text": "Results over the same test-set were collected from two state-ofthe-art, open-source systems, then analyzed in consultation with those systems' creators." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-4", "text": "We conclude that the differences between these technologies, both in task performance, and in how they get there, is not significant." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-5", "text": "This suggests that the semantic parsing community should target answering more compositional open-domain questions that are beyond the reach of more direct information extraction methods." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-6", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-8", "text": "Question Answering (QA) from structured data, such as DBPedia (Auer et al., 2007) , Freebase (Bollacker et al., 2008) and Yago2 (Hoffart et al., 2011) , has drawn significant interest from both knowledge base (KB) and semantic parsing (SP) researchers." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-9", "text": "The majority of such work treats the KB as a database, to which standard database queries (SPARQL, MySQL, etc.) are issued to retrieve answers." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-10", "text": "Language understanding is modeled as the task of converting natural language questions into queries through intermediate logical forms, with the popular two approaches including: CCG parsing (Zettlemoyer and Collins, 2005; Zettlemoyer and Collins, 2007; Zettlemoyer and Collins, 2009; Kwiatkowski et al., 2010; Kwiatkowski et al., 2011; Krishnamurthy and Mitchell, 2012; Kwiatkowski et al., 2013; Cai and Yates, 2013a) , and dependencybased compositional semantics (Liang et al., 2011; Berant et al., 2013; Berant and Liang, 2014) ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-11", "text": "We characterize semantic parsing as the task of deriving a representation of meaning from language, sufficient for a given task." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-12", "text": "Traditional information extraction (IE) from text may be coarsely characterized as representing a certain level of semantic parsing, where the goal is to derive enough meaning in order to populate a database with factoids of a form matching a given schema." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-13", "text": "1 Given the ease with which reasonably accurate, deep syntactic structure can be automatically derived over (English) text, it is not surprising that IE researchers would start including such \"features\" in their models." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-14", "text": "Our question is then: what is the difference between an IE system with access to syntax, as compared to a semantic parser, when both are targeting a factoid-extraction style task?" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-15", "text": "While our conclusions should hold generally for similar KBs, we will focus on Freebase, such as explored by Krishnamurthy and Mitchell (2012) , and then others such as Cai and Yates (2013a) and Berant et al. (2013) ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-16", "text": "We compare two open-source, state-ofthe-art systems on the task of Freebase QA: the semantic parsing system SEMPRE (Berant et al., 2013) , and the IE system jacana-freebase (Yao and Van Durme, 2014) ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-17", "text": "We find that these two systems are on par with each other, with no significant differences in terms of accuracy between them." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-18", "text": "A major distinction between the work of Berant et al. (2013) and Yao and Van Durme (2014) is the ability of the former to represent, and compose, aggregation operators (such as argmax, or count), as well as integrate disparate pieces of information." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-19", "text": "This representational capability was important in previous, closed-domain tasks such as GeoQuery." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-20", "text": "The move to Freebase by the SP community was meant to" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-21", "text": "provide richer, open-domain challenges." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-22", "text": "While the vocabulary increased, our analysis suggests that compositionality and complexity decreased." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-23", "text": "We therefore conclude that the semantic parsing community should target more challenging opendomain datasets, ones that \"standard IE\" methods are less capable of attacking." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-24", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-25", "text": "**IE AND SP SYSTEMS**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-26", "text": "jacana-freebase 2 (Yao and Van Durme, 2014) treats QA from a KB as a binary classification problem." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-27", "text": "Freebase is a gigantic graph with millions of nodes (topics) and billions of edges (relations)." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-28", "text": "For each question, jacana-freebase first selects a \"view\" of Freebase concerning only involved topics and their close neighbors (this \"view\" is called a topic graph)." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-29", "text": "For instance, for the question \"who is the brother of justin bieber?\", the topic graph of Justin Bieber, containing all related nodes to the topic (think of the \"Justin Bieber\" page displayed by the browser), is selected and retrieved by the Freebase Topic API." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-30", "text": "Usually such a topic graph contains hundreds to thousands of nodes in close relation to the central topic." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-31", "text": "Then each of the node is judged as answer or not by a logistic regression learner." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-32", "text": "Features for the logistic regression learner are first extracted from both the question and the topic graph." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-33", "text": "An analysis of the dependency parse of the question characterizes the question word, topic, verb, and named entities of the main subject as the question features, such as qword=who." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-34", "text": "Features on each node include the types of relations and properties the node possesses, such as type=person." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-35", "text": "Finally features from both the question and each node are combined as the final features used by the learner, such as qword=who|type=person." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-36", "text": "In this way the association between the question and answer type is enforced." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-37", "text": "Thus during decoding, for instance, if there is a who question, the nodes with a person property would be ranked higher as the answer candidate." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-38", "text": "SEMPRE 3 is an open-source system for training semantic parsers, that has been utilized to train a semantic parser against Freebase by Berant et al. (2013) ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-39", "text": "SEMPRE maps NL utterances to logical forms by performing bottom-up parsing." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-40", "text": "First, a lexicon is used to map NL phrases to KB predicates, and then predicates are combined to form a full logical form by a context-free grammar." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-41", "text": "Since logical forms can be derived in multiple ways from the grammar, a log-linear model is used to rank possible derivations." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-42", "text": "The parameters of the model are trained from question-answer pairs." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-43", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-44", "text": "**ANALYSIS**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-45", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-46", "text": "**EVALUATION METRICS**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-47", "text": "Both Berant et al. (2013) and Yao and Van Durme (2014) tested their systems on the WEBQUESTIONS dataset, which contains 3778 training questions and 2032 test questions collected from the Google Suggest API." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-48", "text": "Each question came with a standard answer from Freebase annotated by Amazon Mechanical Turk." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-49", "text": "Berant et al. (2013) reported a score of 31.4% in terms of accuracy (with partial credit if inexact match) on the test set and later in Berant and Liang (2014) revised it to 35.7%." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-50", "text": "Berant et al. focused on accuracy -how many questions were correctly answered by the system." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-51", "text": "Since their system answered almost all questions, accuracy is roughly identical to F 1 ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-52", "text": "Yao and Van Durme (2014)'s system on the other hand only answered 80% of all test questions." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-53", "text": "Thus they report a score of 42% in terms of F 1 on this dataset." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-54", "text": "For the purpose of comparing among all test questions, we lowered the logistic regression prediction threshold (usually 0.5) on jacana-freebase for the other 20% of questions where jacana-freebase had not proposed an answer to, and selected the best-possible prediction with the highest prediction score as the answer." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-55", "text": "In this way jacana-freebase was able to answer all questions with a lower accuracy of 35.4%." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-56", "text": "In the following we present analysis results based on the test questions where the two systems had very similar performance (35.7% vs. 35.4%)." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-57", "text": "4 The difference is not significant according to the paired permutation test (Smucker et al., 2007) ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-58", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-59", "text": "**ACCURACY VS. COVERAGE**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-60", "text": "First, we were interested to see the proportions of questions SEMPRE and jacana-freebase jointly and separately answered correctly." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-61", "text": "The answer to Since turkers did not exhaustively pick out all possible answers, evaluation is performed by computing the F 1 between the set of answers given by the system and the answers provided by turkers." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-62", "text": "With a strict threshold of F 1 = 1 and a permissive threshold of F 1 \u2265 0.5 to judge the correctness, we list the pair-wise correctness matrix in Table 1 . Not surprisingly, both systems had most questions wrong given that the averaged F 1 's were only around 35%." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-63", "text": "With the threshold F 1 = 1, SEMPRE answered more questions exactly correctly compared to jacana-freebase, while when F 1 \u2265 0.5, it was the other way around." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-64", "text": "This shows that SEMPRE is more accurate in certain questions." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-65", "text": "The reason behind this is that SEMPRE always fires queries that return exactly one set of answers from Freebase, while jacana-freebase could potentially tag multiple nodes as the answer, which may lower the accuracy." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-66", "text": "We have shown that both systems can be more accurate in certain questions, but when? Is there a correlation between the system confidence and accuracy?" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-67", "text": "Thus we took the logistic decoding score (between 0 and 1) from jacana-freebase and the probability from the log-linear model used by SEMPRE as confidence, and plotted an \"accuracy vs. coverage\" curve, which shows the accuracy of a QA engine with respect to its coverage of all questions." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-68", "text": "The curve basically answers one question: at a fixed accuracy, what is the proportion of questions that can be answered?" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-69", "text": "A better system should be able to answer more questions correctly with the same accuracy." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-70", "text": "The curve was drawn in the following way." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-71", "text": "For each question, we select the best answer candidate with the highest confidence score." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-72", "text": "Then for the whole test set, we have a list of (question, highest ranked answer, confidence score) tuples." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-73", "text": "Running The two curves generally follow a similar trend, but while jacana-freebase has higher accuracy when coverage is low, SEMPRE obtains slightly better accuracy when more questions are answered." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-74", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-75", "text": "**ACCURACY BY QUESTION LENGTH AND TYPE**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-76", "text": "Do accuracies of the two systems differ with respect to the complexity of questions?" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-77", "text": "Since there is no clear way to measure question complexity, we use question length as a surrogate and report accuracies by question length in Figure 2 ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-78", "text": "Most of the questions were 5 to 8 words long and there was no substantial difference in terms of accuracies." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-79", "text": "The major difference lies in questions of length 3, 12 and 13." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-80", "text": "However, the number of such questions was not high enough to show any statistical significance." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-81", "text": "Figure 3 further shows the accuracies with respect to the question types (as reflected by the WH-word)." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-82", "text": "Again, there is no significant difference between the two systems." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-83", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-84", "text": "**LEARNED FEATURES**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-85", "text": "What did the systems learn during training?" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-86", "text": "We compare them by presenting the top features by weight, as listed in Table 2 ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-87", "text": "Clearly, the type of knowledge learned by the systems in these features is similar: both systems learn to associate certain phrases with predicates from the KB." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-88", "text": "(929) where (357) who (261) which (35) when (100) how ( We note, however, that SEMPRE also obtains information from the fully constructed logical form." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-89", "text": "For instance, SEMPRE learns that logical forms that return an empty set when executed against the KB are usually incorrect (the weight for this feature is -8.88)." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-90", "text": "In this respect the SP approach \"understands\" more than the IE approach." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-91", "text": "We did not further compare on other datasets such as GeoQuery (Tang and Mooney, 2001 ) and FREE917 (Cai and Yates, 2013b) ." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-92", "text": "The first one involves geographic inference and multiple contraints in queries, directly fitting the compositional nature of semantic parsing." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-93", "text": "The second one was manually generated by looking at Freebase topics." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-94", "text": "Both datasets were less realistic than the WEBQUESTIONS dataset." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-95", "text": "Both datasets were also less challenging (accuracy/F 1 were between 80% and 90%) compared to WEBQUESTIONS (around 40%)." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-96", "text": "similarity between features used in both systems shown in Table 2 : the systems learned the same \"knowledge\" from data, with the distinction that the IE approach acquired this through a direct association between dependency parses and answer properties, while the SP approach acquired this through optimizing on intermediate logic forms." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-97", "text": "----------------------------------" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-98", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-99", "text": "With a direct information extraction technology easily getting on par with the more sophisticated semantic parsing method, it suggests that SP-based approaches for QA with Freebase has not yet shown its power from a \"deeper\" understanding of the questions, among questions of various lengths." }, { "sent_id": "eebf1edb6dbd3e58a904eff309f548-C001-100", "text": "We suggest that more compositional open-domain datasets should be created, and that SP researchers should focus on utterances in existing datasets that are beyond the reach of direct IE methods." } ], "y": { "@BACK@": { "gold_contexts": [ [ "eebf1edb6dbd3e58a904eff309f548-C001-10" ], [ "eebf1edb6dbd3e58a904eff309f548-C001-18" ], [ "eebf1edb6dbd3e58a904eff309f548-C001-38" ], [ "eebf1edb6dbd3e58a904eff309f548-C001-47" ] ], "cite_sentences": [ "eebf1edb6dbd3e58a904eff309f548-C001-10", "eebf1edb6dbd3e58a904eff309f548-C001-18", "eebf1edb6dbd3e58a904eff309f548-C001-38", "eebf1edb6dbd3e58a904eff309f548-C001-47" ] }, "@USE@": { "gold_contexts": [ [ "eebf1edb6dbd3e58a904eff309f548-C001-15" ], [ "eebf1edb6dbd3e58a904eff309f548-C001-16" ] ], "cite_sentences": [ "eebf1edb6dbd3e58a904eff309f548-C001-15", "eebf1edb6dbd3e58a904eff309f548-C001-16" ] } } }, "ABC_499580e888a1598681a8d877b07866_26": { "x": [ { "sent_id": "499580e888a1598681a8d877b07866-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-2", "text": "This paper describes the recurrent neural network based model for translation quality estimation." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-3", "text": "Recurrent neural network based quality estimation model consists of two parts." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-4", "text": "The first part using two bidirectional recurrent neural networks generates the quality information about whether each word in translation is properly translated." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-5", "text": "The second part using another recurrent neural network predicts the final quality of translation." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-6", "text": "We apply this model to sentence, word and phrase level of WMT16 Quality Estimation Shared Task." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-7", "text": "Our results achieve the excellent performance especially in sentence and phraselevel QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-8", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-10", "text": "We introduce the recurrent neural network based quality estimation (QE) model for predicting the sentence, word and phrase-level translation qualities, without relying on manual efforts to find QE related features." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-11", "text": "Existing QE researches have been usually focused on finding desirable QE related features to use machine learning algorithms." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-12", "text": "Recently, however, there have been efforts to apply neural networks to QE and these neural approaches have shown potential for QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-13", "text": "Shah et al. (2015) use continuous space language model features for sentence-level QE and word embedding features for word-level QE, in combination with other features produced by QuEst++ ." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-14", "text": "Kreutzer et al. (2015) apply neural networks using pre-trained alignments and word lookup-table to word-level QE, which achieve the excellent performance by using the combination of baseline features at word level." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-15", "text": "However, these are not 'pure' neural approaches for QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-16", "text": "Kim and Lee (2016) apply neural machine translation (NMT) models, based on recurrent neural network, to sentence-level QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-17", "text": "This is the first try of using NMT models for the translation quality estimation." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-18", "text": "This recurrent neural network based quality estimation model is a pure neural approach for QE and achieves a competitive performance in sentence-level QE (English-Spanish)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-19", "text": "In this paper, we extend the recurrent neural network based quality estimation model to word and phrase level." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-20", "text": "Also, we apply this model to sentence, word and phrase-level QE shared task (English-German) of WMT16." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-21", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-22", "text": "**RECURRENT NEURAL NETWORK BASED QUALITY ESTIMATION MODEL**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-23", "text": "Recurrent neural network (RNN) based quality estimation model (Kim and Lee, 2016) consists of two parts: two bidirectional RNNs on the source and target sentences in the first part and another RNN for predicting the quality in the second part." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-24", "text": "In the first part (Figure 1 ), modified RNN-based NMT model generates quality vectors, which indicate a sequence of vectors about target words' translation qualities." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-25", "text": "Each quality vector for each target word has, as not a number unit but a vector unit, the quality information about whether each target word is properly translated from source sentence." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-26", "text": "Each quality vector is generated by decomposing the probability of each target word from the modified NMT model." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-27", "text": "1 Kim and Lee (2016) modify the NMT model to 1) use source and target A Recurrent Neural Networks Approach for Estimating the Quality of Machine Translation Output" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-28", "text": "Figure 1: First part of recurrent neural network based quality estimation model for generating quality vectors (Kim and Lee, 2016) sentences as inputs, 2 2) apply bidirectional RNNs both on source and target sentences, which enable to fully utilize the bidirectional quality information, and 3) generate quality vectors for target words as outputs." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-29", "text": "In the second part ( Figure 2 , 3 and 4), the final quality of translation at various level (sentencelevel/word-level/phrase-level) is predicted by using the quality vectors as inputs." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-30", "text": "Kim and Lee (2016) apply RNN based model to sentence-level QE and we extend this model to word and phraselevel QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-31", "text": "In subsection 2.1, 2.2 and 2.3, we describe the RNN based 3 (second part) sentence, word and phrase-level QE models." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-32", "text": "4 The cause of these separated parts of the QE model comes from the insufficiency of QE datasets to train the whole QE model." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-33", "text": "Thus, the QE model is divided into two parts, and then different training data are used to train each of the separated parts: large-scale parallel corpora such as Europarl for training the first part and QE datasets, provided in Quality Estimation Shared Task of WMT, for training the second part." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-34", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-35", "text": "**RNN BASED SENTENCE-LEVEL QE MODEL**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-36", "text": "In RNN based sentence-level QE model (Figure 2) , HTER (human-targeted translation edit rate) (Snover et al., 2006) in [0,1] for target sentence is predicted by using a logistic sigmoid func-A Recurrent Neural Networks Approach for Estimating the Quality of Machine Translation Output (Kim and Lee, 2016) tion such that" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-37", "text": "(1)" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-38", "text": "W s is the weight matrix of sigmoid function 5 at sentence-level QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-39", "text": "s is a summary unit of the sequential quality vectors and is fixed to the last hidden state 6 h T y of RNN." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-40", "text": "The hidden state h j is computed by" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-41", "text": "where f is the activation function of RNN (Kim and Lee, 2016) ." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-42", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-43", "text": "**RNN BASED WORD-LEVEL QE MODEL**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-44", "text": "In RNN based word-level QE model (Figure 3 ), we apply bidirectional RNN based binary classification (OK/BAD) using quality vectors as inputs." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-45", "text": "Through the bidirectional RNN, bidirectional hidden states { h j , h j } for each target word y j are" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-46", "text": "The forward hidden state h j indicates summary information about the forward translation quality of target word y j , reflecting qualities of preceding target words {y 1 , ..., y j\u22121 }." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-47", "text": "And the backward hidden state h j indicates summary information about the backward translation quality of target word y j , reflecting qualities of following target words 8 {y j+1 , ..., y Ty }." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-48", "text": "We use the concatenated hidden state h w j (= [ h j ; h j ]) to predict the wordlevel quality for target word y j such that" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-49", "text": "W w is the weight matrix of sigmoid function at word-level QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-50", "text": "7 f (g ) is the activation function of the forward(backward) RNN at word-level QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-51", "text": "8 Ty is the length of target sentence." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-52", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-53", "text": "**RNN BASED PHRASE-LEVEL QE MODEL**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-54", "text": "RNN based phrase-level QE model is the extended version of RNN based word-level QE model (in subsection 2.2)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-55", "text": "In RNN based phrase-level QE model (Figure 4) , we also apply bidirectional RNN based binary classification." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-56", "text": "We use the simply averaged quality vector q ph j to predict the phrase-level quality of the phrase 9 ph j , composed of the corresponding target words {y k , y k+1 , ...}, such that" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-57", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-58", "text": "**RESULTS**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-59", "text": "RNN based QE models were evaluated on the WMT16 Quality Estimation Shared Task 11 at sentence, word and phrase level of English-German." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-60", "text": "Because whole QE models are separated into two parts, each part of the QE models is trained separately by using different training data." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-61", "text": "To train the first part of the QE models, English-German parallel corpus of Europarl v7 (Koehn, 2005) were used." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-62", "text": "To train the second part of the QE models, WMT16 QE datasets of English-German (Specia et al., 2016) is the number of iterations while the first part is trained by using large-scale parallel corpora to make quality vectors (QV)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-63", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-64", "text": "**RESULTS OF SENTENCE-LEVEL QE (TASK 1)**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-65", "text": "Pearson's correlation (r), mean absolute error (MAE), and root mean squared error (RMSE) are used to evaluate the scoring variant of sentencelevel QE. And Spearman's rank correlation (\u03c1) and DeltaAvg are used to evaluate the ranking variant of sentence-level QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-66", "text": "Table 1 and 2 (Table B .1 and B.2) present the results of the QE models on test (development) set for the scoring and ranking variants of the WMT16 sentence-level QE shared task (Task 1)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-67", "text": "In all aspects of evaluation at sentencelevel QE, the RNN based QE model (SENT/RNN) showed the better performance than the FNN based QE model (SENT/FNN)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-68", "text": "Our two methods (SENT/RNN-QV2 and SENT/RNN-QV3), participated in WMT16 sentence-level QE shared task, achieved top rank: each 2nd and 4th at the scoring variant and each 1st and 3rd at the ranking variant." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-69", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-70", "text": "**RESULTS OF WORD-LEVEL AND PHRASE-LEVEL QE (TASK 2)**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-71", "text": "The multiplication of F1-scores for the 'OK' and 'BAD' classes and F1-score for the 'BAD' class are used to evaluate the word-level and phraselevel QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-72", "text": "Table 3 and 4 (Table B. 3 and B.4) respectively present the results on test (development) set of the WMT16 word-level and phrase-level QE shared task (Task 2)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-73", "text": "In all aspects of evaluation at wordlevel and phrase-level QE, the RNN based QE" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-74", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-75", "text": "**CONCLUSION**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-76", "text": "This paper described recurrent neural network based quality estimation models of sentence, word and phrase level." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-77", "text": "We extended the (existing sentence-level) recurrent neural network based quality estimation model to word and phrase level. And we applied these models to sentence, word and phrase-level QE shared task of WMT16." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-78", "text": "These recurrent neural network based quality estimation models are pure neural approaches for QE and achieved excellent performance especially in sentence and phrase-level QE." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-79", "text": "----------------------------------" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-80", "text": "**A.3 FNN BASED PHRASE-LEVEL QE MODEL**" }, { "sent_id": "499580e888a1598681a8d877b07866-C001-81", "text": "In FNN based phrase-level QE model (Figure A. 3), we also apply FNN based binary classification (OK/BAD)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-82", "text": "By only using the averaged quality vector q ph j for the phrase ph j , composed of the corresponding target words {y k , y k+1 , ...}, the hidden state h j (= h ph j ) is made." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-83", "text": "We predict the phrase-level QE by (5)." }, { "sent_id": "499580e888a1598681a8d877b07866-C001-84", "text": "ity of Machine Translation Output Table B .4: Results on development set of WMT16 phrase-level QE (Task 2)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "499580e888a1598681a8d877b07866-C001-16" ], [ "499580e888a1598681a8d877b07866-C001-23" ], [ "499580e888a1598681a8d877b07866-C001-27" ], [ "499580e888a1598681a8d877b07866-C001-28" ], [ "499580e888a1598681a8d877b07866-C001-36" ], [ "499580e888a1598681a8d877b07866-C001-40", "499580e888a1598681a8d877b07866-C001-41" ] ], "cite_sentences": [ "499580e888a1598681a8d877b07866-C001-16", "499580e888a1598681a8d877b07866-C001-23", "499580e888a1598681a8d877b07866-C001-27", "499580e888a1598681a8d877b07866-C001-28", "499580e888a1598681a8d877b07866-C001-36", "499580e888a1598681a8d877b07866-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "499580e888a1598681a8d877b07866-C001-19", "499580e888a1598681a8d877b07866-C001-23" ], [ "499580e888a1598681a8d877b07866-C001-30" ] ], "cite_sentences": [ "499580e888a1598681a8d877b07866-C001-23", "499580e888a1598681a8d877b07866-C001-30" ] }, "@EXT@": { "gold_contexts": [ [ "499580e888a1598681a8d877b07866-C001-19", "499580e888a1598681a8d877b07866-C001-23" ], [ "499580e888a1598681a8d877b07866-C001-30" ] ], "cite_sentences": [ "499580e888a1598681a8d877b07866-C001-23", "499580e888a1598681a8d877b07866-C001-30" ] } } }, "ABC_35ef3eba487c3cd97d32210670678a_26": { "x": [ { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-2", "text": "Neural network models have been very successful for natural language inference, with the best models reaching 90% accuracy in some benchmarks." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-3", "text": "However, the success of these models turns out to be largely benchmark specific." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-4", "text": "We show that models trained on natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmark tasks is the same or similar." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-5", "text": "We train five state-of-the-art neural network models on different datasets and show that each one of these fail to generalize outside of the respective benchmark." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-6", "text": "In light of these results we conclude that the current neural network models are not able to generalize in capturing the semantics of natural language inference, but seem to be overfitting to the specific dataset." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-7", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-54", "text": "However, its focus are distributional semantic approaches." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-9", "text": "Natural Language Inference (NLI) has attracted considerable interest in the NLP community and, recently, a large number of neural network-based systems have been proposed to deal with the task." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-10", "text": "These approaches can be usually categorized into: a) sentence encoding models, and b) other neural network models." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-11", "text": "Both of them have been very successful, with the state of the art on the SNLI and MultiNLI datasets being 90.1% (Kim et al., 2018) and 86.7% (Devlin et al., 2018) respectively." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-12", "text": "However, a big question w.r.t to these systems is their ability to generalize outside the specific datasets they are trained and tested on." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-13", "text": "Recently, Glockner et al. (2018) have shown that state-of-the-art NLI systems break considerably easily when instead of tested on the original SNLI test set, they are tested on a test set which Preprint." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-14", "text": "Work in progress. is constructed by taking premises from the training set and creating several hypotheses from them by changing at most one word within the premise." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-15", "text": "The results show a very significant drop in accuracy for three of the four systems." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-16", "text": "The system that was more difficult to break and had the less loss in accuracy was the system by Chen et al. (2018) which utilizes external knowledge taken from WordNet (Miller, 1995) ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-17", "text": "In this paper we show that NLI systems that have been very successful in specific NLI benchmarks, fail to generalize when trained on a specific NLI datset and then tested across different NLI benchmarks." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-18", "text": "The results we get are in line with Glockner et al. (2018) , showing that the generalization capability of the individual NLI systems is very limited, but, what is more, they further show the only system that was less prone to breaking in Glockner et al. (2018) , breaks in the experiments we have conducted as well." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-19", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-21", "text": "The ability of NLI systems to generalize and related skepticism has been raised in a recent paper by Glockner et al. (2018) ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-22", "text": "There, the authors show that the generalization capabilities of stateof-the-art NLI systems, in cases where some kind of external lexical knowledge is needed, drops dramatically when the SNLI test set is replaced by a test set where the premise and the hypothesis are otherwise identical except for at most one word." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-23", "text": "The results show a very significant drop in accuracy. recognize the generalization problem that comes with training on datasets like SNLI, which tend to be homogeneous with linguistic variation." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-24", "text": "In this context, they propose to better train NLI models by making use of adversarial examples." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-25", "text": "Gururangan et al. (2018) Chatzikyriakidis et al. (2017) present an overview of the most standard datasets for NLI and show that the definitions of inference in each of them are actually quite different." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-26", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-27", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-28", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-29", "text": "**DATA**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-30", "text": "We chose three different datasets for the experiments: SNLI, MultiNLI and SICK." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-31", "text": "All of them have been designed for NLI involving three-way classification." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-32", "text": "The selected datasets use the same three labels entailment, neutral and contradiction." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-33", "text": "We did not include any datasets with two-way classification, e.g. SciTail ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-34", "text": "As SICK is a relatively small dataset with approximately only 10k sentence pairs, we did not use it as training data in any experiment." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-35", "text": "We also trained our models with a combined SNLI + MultiNLI training set." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-36", "text": "All the experimental combinations are listed in Table 1 ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-37", "text": "Examples from the selected dataset are provided in Table 2 ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-38", "text": "We describe the three datasets in more detail below." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-39", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-40", "text": "**SNLI**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-41", "text": "The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) is a dataset of 570k human-written sentence pairs manually labeled with the labels entailment, contradiction, and neutral." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-42", "text": "The dataset is divided into training (550,152 pairs), development (10,000 pairs) and test sets (10,000 pairs)." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-43", "text": "The source for the premise sentences in SNLI were image captions taken from the Flickr30k corpus (Young et al., 2014) ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-44", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-45", "text": "**MULTINLI**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-46", "text": "The Multi-Genre Natural Language Inference (MultiNLI) corpus (Williams et al., 2018 ) is a broad-coverage corpus for NLI, consisting of 433k human-written sentence pairs labeled with entailment, contradiction and neutral." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-47", "text": "Unlike the SNLI corpus, which draws the premise sentence from image captions, MultiNLI consists of sentence pairs from ten distinct genres of both written and spoken English." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-48", "text": "The dataset is divided into training (392,702 pairs), development (20,000 pairs) and test sets (20,000 pairs)." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-49", "text": "Only five genres are included in the training set." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-50", "text": "The development and test sets have been divided into matched and mismatched, where the former includes only sentences from the same genres as the training data, and the latter includes sentences from the remaining genres not present in the training data." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-51", "text": "We used the matched development set (MultiNLI-m) for the experiments." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-52", "text": "SICK SICK (Marelli et al., 2014 ) is a dataset that was originally constructed to test compositional distributional semantics (DS) models." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-53", "text": "The dataset contains 9,840 examples pertaining to logical inference (negation, conjunction, disjunction, apposition, relative clauses, etc.)." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-55", "text": "Therefore, it normalises several cases DS is not expected to account for." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-56", "text": "The dataset consists of approximately 10k test pairs annotated for inference (three-way) and relatedness." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-57", "text": "The dataset is constructed taking pairs of sentences from a random subset of the 8K ImageFlickr data set 1 and the SemEval 2012 STS MSRVideo Description dataset 2 ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-58", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-59", "text": "**MODEL AND TRAINING DETAILS**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-60", "text": "We perform experiments with five models, two from sentence encoding approaches and three coming from cross-sentence approaches." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-61", "text": "For sentence encoding models, we have chosen a simple one-layer bidirectional LSTM with max pooling (BiLSTM-max) with the hiddens size of 600D per direction, used e.g. in InferSent (Conneau et al., 2017) , and HBMP (Talman et al., 2018) ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-62", "text": "For the other models, we have chosen ESIM (Chen et al., 2017) , which includes cross-sentence attention, and KIM (Chen et al., 2018) , which has cross-entailment SICK A person, who is riding a bike, is wearing gear which is black A biker is wearing gear which is black SNLI A young family enjoys feeling ocean waves lap at their feet." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-63", "text": "A family is at the beach." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-64", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-65", "text": "**MULTINLI**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-66", "text": "Kal tangled both of Adrin's arms, keeping the blades far away." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-67", "text": "Adrin's arms were tangled, keeping his blades away from Kal." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-68", "text": "contradiction SICK There is no man wearing a black helmet and pushing a bicycle One man is wearing a black helmet and pushing a bicycle SNLI A man with a tattoo on his arm staring to the side with vehicles and buildings behind him." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-69", "text": "A man with no tattoos is getting a massage." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-70", "text": "Also in Eustace Street is an information office and a cultural center for children, The Ark. The Ark, a cultural center for kids, is located in Joyce Street." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-71", "text": "neutral SICK A little girl in a green coat and a boy holding a red sled are walking in the snow A child is wearing a coat and is carrying a red sled near a child in a green and black coat SNLI An old man with a package poses in front of an advertisement." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-72", "text": "A man poses in front of an ad for beer." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-73", "text": "Enthusiasm for Disney's Broadway production of The Lion King dwindles." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-74", "text": "The broadway production of The Lion King was amazing, but audiences are getting bored." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-75", "text": "sentence attention and utilizes external knowledge." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-76", "text": "We also selected one model involving a pretrained language model, namely ESIM + ELMo ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-77", "text": "All of the models perform well on the SNLI dataset, reaching near stateof-the-art accuracy in the sentence encoding and the other category respectively." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-78", "text": "KIM is particularly interesting in this context as it performed significantly better than other models in the Breaking NLI experiment conducted by Glockner et al. (2018) ." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-79", "text": "For BiLSTM-max we used the Adam optimizer (Kingma and Ba, 2014) and a learning rate of 5e-4." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-80", "text": "The learning rate was decreased by the factor of 0.2 after each epoch if the model did not improve." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-81", "text": "We used a batch size of 64." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-82", "text": "Dropout of 0.1 was used between the layers of the multilayer perceptron classifier, except before the last layer." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-83", "text": "The models were evaluated with the development data after each epoch and training was stopped if the development loss increased for more than 3 epochs." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-84", "text": "The model with the highest development accuracy was selected for testing." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-85", "text": "The BiLSTM-max models were initialized with pretrained GloVe 840B word embeddings of size 300 dimensions (Pennington et al., 2014) , which were fine-tuned during training." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-86", "text": "Our implementation of BiLSTM-max was done in PyTorch." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-87", "text": "For HBMP, ESIM and KIM we used the original implementations as well as the default settings and hyperparameter values as described in (Talman et al., 2018) , (Chen et al., 2017) and (Chen et al., 2018) respectively, adjusting only the vocabulary size based on the dataset used." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-88", "text": "For ESIM + ELMo we used the AllenNLP PyTorch implementation with the default settings and hyperparameter values." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-89", "text": "3 Our experiments show that, while all of the five models perform well when the test set is taken from the same corpus as the training and development set, accuracy drops significantly when the test data is drawn from a separate corpus, the average drop in accuracy being 25.4 points across all experiments." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-90", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-91", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-92", "text": "The accuracy drops the most when a model is tested on SICK." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-93", "text": "The drop in this case is between 19.0-28.9 points when trained on MultiNLI, between 31.6-33.7 points when trained on SNLI and between 31.1-33.0 when trained on SNLI + MultiNLI." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-94", "text": "This result was somewhat expected, as the method of constructing the sentence pairs was different, and therefore there is too much difference in the kind of sentence pairs between the training and test sets for the models to be able to transfer what it has learned to the test examples." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-95", "text": "However, the drop in accuracy was not expected to be that dramatic." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-96", "text": "The most surprising result was that the accuracy of all models drops significantly even in the set-up where the models were trained on MultiNLI and tested on SNLI (7.9-11.1 points)." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-97", "text": "This is surprising as both of these datasets have been constructed with a similar data collection method using the same definition of inference (i.e. same definition of entailment, contradiction and neutral)." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-98", "text": "The sentences included in SNLI are also much simpler compared to those in MultiNLI." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-99", "text": "This might also explain why the drop in accuracy for all of the four models is lowest when the models are trained on MultiNLI and tested on SNLI." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-100", "text": "It is also very surprising that the model with biggest drop in accuracy was ESIM + ELMo which includes a pretrained ELMo language model." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-101", "text": "ESIM + ELMo did, however, get the highest accuracy of 69.1% in this experiment." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-102", "text": "All the models perform almost equally poorly across all the experiments." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-103", "text": "Both BiLSTM-max and HBMP have an average drop in accuracy of 24.4 points, while the average for KIM is 25.5 and for ESIM + ELMo 25.6." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-104", "text": "ESIM has the highest average drop of 27.0 points." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-105", "text": "In contrast to the findings of Glockner et al. (2018) , utilizing external knowledge did not improve the model's generalization capability, as KIM performed equally poorly across all dataset combinations." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-106", "text": "Also including a pretrained language model did not improve the results significantly." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-107", "text": "----------------------------------" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-109", "text": "In this paper we have shown that neural network models for NLI fail to generalize across different NLI benchmarks." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-110", "text": "We experimented with five state-of-the-art models covering both sentence encoding approaches and cross-sentence attention models." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-111", "text": "For all the systems, the accuracy drops between 7.9-33.7 points (the average drop being 25.4 points), when testing with a test set drawn from a separate corpus from that of the training data, as compared to when the test and training data are splits from the same corpus." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-112", "text": "Our findings, together with the previous negative findings e.g. by Glockner et al. (2018) and Gururangan et al. (2018) , indicate that the current state-of-the-art neural network models fail to capture the semantics of NLI in a way that will enable them to generalize across different NLI situations." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-113", "text": "The results indicate two issues to be taken into consideration: a) using datasets involving a fraction of what NLI is, will fail when tested in datasets that are testing for a slightly different definition." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-114", "text": "This is evident when we move from the SNLI to the SICK dataset." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-115", "text": "b) NLI is to some extent also genre/context dependent." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-116", "text": "Training on SNLI and testing on MultiNLI gives worse results than vice versa." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-117", "text": "This can be seen as an indication that training on multiple genres helps." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-118", "text": "However, this is still not enough given that, even in case of training on MultiNLI and testing on SNLI, accuracy drops significantly." }, { "sent_id": "35ef3eba487c3cd97d32210670678a-C001-119", "text": "Further work is required on better data resources as well as on better neural network models to tackle these issues." } ], "y": { "@MOT@": { "gold_contexts": [ [ "35ef3eba487c3cd97d32210670678a-C001-13" ], [ "35ef3eba487c3cd97d32210670678a-C001-21" ] ], "cite_sentences": [ "35ef3eba487c3cd97d32210670678a-C001-13", "35ef3eba487c3cd97d32210670678a-C001-21" ] }, "@SIM@": { "gold_contexts": [ [ "35ef3eba487c3cd97d32210670678a-C001-18" ], [ "35ef3eba487c3cd97d32210670678a-C001-112" ] ], "cite_sentences": [ "35ef3eba487c3cd97d32210670678a-C001-18", "35ef3eba487c3cd97d32210670678a-C001-112" ] }, "@DIF@": { "gold_contexts": [ [ "35ef3eba487c3cd97d32210670678a-C001-18" ], [ "35ef3eba487c3cd97d32210670678a-C001-105" ] ], "cite_sentences": [ "35ef3eba487c3cd97d32210670678a-C001-18", "35ef3eba487c3cd97d32210670678a-C001-105" ] }, "@EXT@": { "gold_contexts": [ [ "35ef3eba487c3cd97d32210670678a-C001-18" ] ], "cite_sentences": [ "35ef3eba487c3cd97d32210670678a-C001-18" ] }, "@BACK@": { "gold_contexts": [ [ "35ef3eba487c3cd97d32210670678a-C001-21" ], [ "35ef3eba487c3cd97d32210670678a-C001-78" ] ], "cite_sentences": [ "35ef3eba487c3cd97d32210670678a-C001-21", "35ef3eba487c3cd97d32210670678a-C001-78" ] }, "@FUT@": { "gold_contexts": [ [ "35ef3eba487c3cd97d32210670678a-C001-112", "35ef3eba487c3cd97d32210670678a-C001-113" ] ], "cite_sentences": [ "35ef3eba487c3cd97d32210670678a-C001-112" ] } } }, "ABC_0a55859a36d0887ba4febc98762715_26": { "x": [ { "sent_id": "0a55859a36d0887ba4febc98762715-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-2", "text": "The latency in the current neural based dialogue state tracking models prohibits them from being used efficiently for deployment in production systems, albeit their highly accurate performance." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-3", "text": "This paper proposes a new scalable and accurate neural dialogue state tracking model, based on the recently proposed Global-Local Self-Attention encoder (GLAD) model by Zhong et al. (2018) which uses global modules to share parameters between estimators for different types (called slots) of dialogue states, and uses local modules to learn slot-specific features." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-4", "text": "By using only one recurrent networks with global conditioning, compared to (1 + # slots) recurrent networks with global and local conditioning used in the GLAD model, our proposed model reduces the latency in training and inference times by 35% on average, while preserving performance of belief state tracking, by 97.38% on turn request and 88.51% on joint goal and accuracy." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-5", "text": "Evaluation on Multi-domain dataset (Multi-WoZ) also demonstrates that our model outperforms GLAD on turn inform and joint goal accuracy." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-6", "text": "----------------------------------" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-8", "text": "Dialog State Tracking (DST) is an important component of task-oriented dialogue systems." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-9", "text": "DST keeps track of the interaction's goal and what has happened in the dialog history." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-10", "text": "Majority of the deployed dialogue systems in commercial settings such as common customer support systems and intelligent assistants, such as Amazon Alexa, Apple Siri and Google Assistant, take advantage of dialogue state tracking." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-11", "text": "Dialog state tracking uses the information from user utterance at each turn, context from previous turns, and other external information as well as the output of the system at every turn." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-12", "text": "Decision made by the dialogue state tracker, is then used to determine what action should be taken by the system in next steps." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-13", "text": "This is a critical role to play in the design of any task oriented dialogue system." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-14", "text": "State of the art approaches for dialogue state tracking rely on deep learning models, which represent the dialogue state as a distribution over all candidate slot values that are defined in the ontology." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-15", "text": "Recently, several neural-based DST systems have been proposed." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-16", "text": "proposed a Neural Belief Tracker (NBT) model based on binary decision making of each state-values, where representation of user utterance, system action, and candidate pairs are computed based on deep distribiutional representation of word vectors." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-17", "text": "In their model, they used deep network (DNN) and convolutional network (CNN) to compute such representation vectors." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-18", "text": "proposed a sequence-to-sequence model for estimating the next dialogue state." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-19", "text": "In their work, the encoded hidden vector of user utterance is used to determine the current dialogue state, followed by a policy network to query over knowledge databse." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-20", "text": "Then, the retrieved information is used as a conditionining input to the decoder, to generate the system response." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-21", "text": "Recently, Zhong et al. (2018) proposed a model based on training a binary classifier for each slot-value, Global-Locally Self Attentive encoder (GLAD, by employing recurrent and self attention for each utterance and previous system actions, and measuring similaity of these computed representation to each slot-value, which achieve state of the art results on WoZ and DSTC2 (Williams et al., 2013) datasets." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-22", "text": "Although the proposed neural based models achieves state of the art results on several benchmark, they are still inefficient for deployment in production system, due to their latency which stems from using recurrent networks." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-23", "text": "In this paper, we propose a new encoder, by improving GLAD architecture (Zhong et al., 2018) ." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-24", "text": "The proposed encoder is based on removing slot-dependent recurrent network for utterance and system action encoder, and employing a global conditioning of aforementioned encoder on the slot type embedding vector." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-25", "text": "By removing the slot-dependent recurrent network, the proposed model is able to preserve the performance in predicting correct belief state, while improving computational complexity." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-26", "text": "The detailed description of encoder is explained in the section 2." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-27", "text": "----------------------------------" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-28", "text": "**RELATED WORKS**" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-29", "text": "A similar scalable dialogue state tracking model is also proposed by Rastogi et al. (2017) , which is based on conditioning the encoder input." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-30", "text": "They used a similar conditioning of user utterance representation on slot values (candidate sets) and slot type." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-31", "text": "However, our proposed model is based on conditioning only on slot type." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-32", "text": "Therefor, our proposed model is simpler since it contains only one conditioned encoder for user utterance, whereas Rastogi et al. (2017) model requires two independet conditioned encoder." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-33", "text": "Recently, Xu and Hu (2018) proposed a model for unknown slot type by using a pointer network, based on conditioning to slot type embedding." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-34", "text": "Our proposed model is also relaxing the current GLAD architecture for unknown slot types during inference." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-35", "text": "----------------------------------" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-36", "text": "**PROPOSED MODEL**" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-37", "text": "In this section, we describe the proposed model." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-38", "text": "First, section 2.1 explains the recently proposed GLAD encoder (Zhong et al., 2018) architecture, followed by our proposed encoder in section 2.2." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-39", "text": "----------------------------------" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-40", "text": "**GLOBAL-LOCALLY SELF-ATTENTIVE MODEL**" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-41", "text": "GLAD model is based on learning multiple binary classifier for each slot-value pair." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-42", "text": "In this architecture, separate encoders are considered for utterance, previous system action, and all slot values." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-43", "text": "The output of these encoders are then used by two scores model, i.e. previous system action and utterance, to predict the probability distribution on slot-value pairs." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-44", "text": "This means, each scores model compute the similarity of each slot-value to the utterance representation or previous system action." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-45", "text": "All encoders used similar architecture, i.e. global-local self attention (GLAD)." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-46", "text": "To compute the hidden representation of its input sequence and its summary (context), GLAD a combination of bidirectional LSTM (Hochreiter and Schmidhuber, 1997) , to compute the temporal representation, followed by self-attention layer to extract the context vector." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-47", "text": "To incorporate information regarding each slot, there is a dedicated recurrent and self-attention network for each slot." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-48", "text": "Therefor, to estimate the probability distribution over values of each slot, GLAD encoder has to learn a different hidden and context vector for utterance and previous system action." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-49", "text": "----------------------------------" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-50", "text": "**GLOBALLY-CONDITIONED ENCODER (GCE)**" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-51", "text": "In this section, we describe the proposed globally-conditioned encoder (GCE) model." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-52", "text": "Here, we employ the similar approach of learning slot-specific temporal and context representation of user utterance and system actions, as proposed in GLAD (Zhong et al., 2018) ." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-53", "text": "However, we emphasize the limitation of GLAD encoder in using slot-specific recurrent and self-attention layers in their encoders." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-54", "text": "Our proposed encoder is based on improving the latency and speed of inference by remving the inefficient recurrent layers and self-attention layers, without degrading the performance." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-55", "text": "The proposed model is based on removing slot-specific recurrent and self-attention layers, and using only slot embedding vector (i.e. s k for k-th slot), as a conditioning vector to the temporal and context extraction layers, as shown in Figure 1 ." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-56", "text": "To compute k-th slot-based representation H k , the slot embedding s k is concatenated with sequence tokens X, i.e. user utterance or previous system actions, as input to the recurrent layer, where concatenation denoted as f (X, s k )." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-57", "text": "Then a slot-based attention score a k i is computed for each token hidden representation H k i , by concatenating them to slot embedding s k and passing to a linear layer." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-58", "text": "In this way, the computed attention is conditioned on the slot embedding, to pay attention to the slot-only information in the input sequence X." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-59", "text": "Therefor, the GCE encoder function can be represented as," }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-60", "text": "Encoding Modules: Based on the definition of the proposed GCE encoder, the representation of user utterance, previous system action and current slot-value pair is computed as below," }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-61", "text": "where U denotes the user utterance word embeddings, A j is the j-th previous system action, and V is the current slot value pair to be evaluated (e.g food=italian)." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-62", "text": "Scoring Model: We follow the proposed architecture in GLAD (Zhong et al., 2018) for computing score of each slot-value pair, in the user utterance and previous system actions." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-63", "text": "To determine whether the user has mentioned a specific value of slot k, we compute the slot-kth conditioned scores for its values." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-64", "text": "Similarly, to determine whether any slot-value is mentioned in previous system actions, that the user is referring to in the current utterance, we compute slot-conditioned scores of previous j system actions." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-65", "text": "The final scores of slot k is the weighted sum of user-based and action-based scores, i.e. y k utt and y k act , which are normalized by sigmoid function \u03c3." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-66", "text": "where \u03c9 is a learned parameter." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-67", "text": "----------------------------------" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-68", "text": "**EXPERIMENT**" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-69", "text": "In this section, we evaluate the proposed encoder for the task of dialogue state tracking om single and multi-domain dialogue state tracking." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-70", "text": "Wizard of oz (WoZ) restaurant reservation dataset is chosen for single-domain, and the performance is compared with the recent neural belief tracking models." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-71", "text": "Moreover, we also evaluate on recen;t proposed multi-domain dataset, MultiWoZ (Budzianowski et al., 2018) , which consists of seven domains, i.e. restaurant, hotel, train, attraction, hospital, taxi, and police." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-72", "text": "The evaluation metric is based on joint goal and turn-level request and joint goal tracking accuracy." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-73", "text": "The joint goal is the accumulation of turn goals as described in Zhong et al. (2018) ." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-74", "text": "The fixed pretrained GLoVe embedding (Pennington et al., 2014) with character-n gram embedding (Hashimoto et al., 2017) are used in embedding layer." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-75", "text": "The implementation details and code of the GCE model can be found at https://github.com/elnaaz/GCE-Model." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-76", "text": "Single-Domain: Table 1 shows the evaluation performance on WoZ dataset." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-77", "text": "It is indicated that our proposed GCE model performance is on par with GLAD model." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-78", "text": "To further compare the latency of GCE and GLAD during training and testing, computation time for a batch of turn and the overall epoch time during training is measured." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-79", "text": "We further evaluate the complete test time, which contains 400 dialogue and 1646 turns (WoZ test set), as shown in Table 2 ." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-80", "text": "The computation time is measured in second, and it is indicated that GCE improves latency in both training and testing by 35% on average." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-81", "text": "Multi-Domain: Table 3 shows the evauation on Multi-Woz (Budzianowski et al., 2018) dataset which consists of 10k dialogues." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-82", "text": "In this setting, we completely ignore the domain information and use slot names only." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-83", "text": "The results indicate that GCE model outperforms GLAD on turn inform and join goal accuracy." }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-84", "text": "70.8% 87.1% Delex. + Semantic Dictionary 83.7% 87.6%" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-85", "text": "Neural Belief Tracker-DNN 84.4% 91.2%" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-86", "text": "Neural Belief Tracker-CNN 84.2% 91.6%" }, { "sent_id": "0a55859a36d0887ba4febc98762715-C001-87", "text": "GLAD (Zhong et al., 2018) 88.1\u00b10.4% 97.1\u00b10.2% GCE (Ours) 88.51% 97.38%" } ], "y": { "@USE@": { "gold_contexts": [ [ "0a55859a36d0887ba4febc98762715-C001-3" ], [ "0a55859a36d0887ba4febc98762715-C001-23" ], [ "0a55859a36d0887ba4febc98762715-C001-38" ], [ "0a55859a36d0887ba4febc98762715-C001-52" ], [ "0a55859a36d0887ba4febc98762715-C001-62" ], [ "0a55859a36d0887ba4febc98762715-C001-73" ] ], "cite_sentences": [ "0a55859a36d0887ba4febc98762715-C001-3", "0a55859a36d0887ba4febc98762715-C001-23", "0a55859a36d0887ba4febc98762715-C001-38", "0a55859a36d0887ba4febc98762715-C001-52", "0a55859a36d0887ba4febc98762715-C001-62", "0a55859a36d0887ba4febc98762715-C001-73" ] }, "@BACK@": { "gold_contexts": [ [ "0a55859a36d0887ba4febc98762715-C001-21" ], [ "0a55859a36d0887ba4febc98762715-C001-38" ] ], "cite_sentences": [ "0a55859a36d0887ba4febc98762715-C001-21", "0a55859a36d0887ba4febc98762715-C001-38" ] }, "@MOT@": { "gold_contexts": [ [ "0a55859a36d0887ba4febc98762715-C001-21", "0a55859a36d0887ba4febc98762715-C001-22" ] ], "cite_sentences": [ "0a55859a36d0887ba4febc98762715-C001-21" ] }, "@SIM@": { "gold_contexts": [ [ "0a55859a36d0887ba4febc98762715-C001-73" ] ], "cite_sentences": [ "0a55859a36d0887ba4febc98762715-C001-73" ] } } }, "ABC_58b423c4ea2a3530d0c469fc0f5528_26": { "x": [ { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-2", "text": "This paper explores the task of translating natural language queries into regular expressions which embody their meaning." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-3", "text": "In contrast to prior work, the proposed neural model does not utilize domain-specific crafting, learning to translate directly from a parallel corpus." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-4", "text": "To fully explore the potential of neural models, we propose a methodology for collecting a large corpus 1 of regular expression, natural language pairs." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-5", "text": "Our resulting model achieves a performance gain of 19.6% over previous state-of-the-art models." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-6", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-8", "text": "This paper explores the task of translating natural language text queries into regular expressions which embody their meaning." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-9", "text": "Regular expressions are built into many application interfaces, yet most users of these applications have difficulty writing them (Friedl, 2002) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-10", "text": "Thus a system for automatically generating regular expressions from natural language would be useful in many contexts." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-11", "text": "Furthermore, such technologies can ultimately scale to translate into other formal representations, such as program scripts (Raza et al., 2015) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-12", "text": "Prior work has demonstrated the feasibility of this task." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-13", "text": "Kushman and Barzilay (2013) proposed a model that learns to perform the task from a parallel corpus of regular expressions and the text descriptions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-14", "text": "To account for the given representational disparity between formal regular expressions and natural language, their model utilizes a domain specific 1 Code and data are submitted as supplementary material." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-15", "text": "component which computes the semantic equivalence between two regular expressions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-16", "text": "Since their model relies heavily on this component, it cannot be readily applied to other formal representations where such semantic equivalence calculations are not possible." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-17", "text": "In this paper, we reexamine the need for such specialized domain knowledge for this task." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-18", "text": "Given the same parallel corpus used in Kushman and Barzilay (2013) , we use an LSTM-based sequence to sequence neural network to perform the mapping." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-19", "text": "Our model does not utilize semantic equivalence in any form, or make any other special assumptions about the formalism." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-20", "text": "Despite this and the relatively small size of the original dataset (814 examples), our neural model exhibits a small 0.1% boost in accuracy." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-21", "text": "To further explore the power of neural networks, we created a much larger public dataset, NL-RX." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-22", "text": "Since creation of regular expressions requires specialized knowledge, standard crowd-sourcing methods are not applicable here." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-23", "text": "Instead, we employ a two-step generate-and-paraphrase procedure that circumvents this problem." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-24", "text": "During the generate step, we use a small manually-crafted grammar that translates regular expression into natural language." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-25", "text": "In the paraphrase step, we rely on crowd-sourcing to paraphrase these rigid descriptions into more natural and fluid descriptions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-26", "text": "Using this methodology, we have constructed a corpus of 10,000 regular expressions, with corresponding verbalizations." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-27", "text": "Our results demonstrate that our sequence to sequence model significantly outperforms the domain specific technique on the larger dataset, reaching a gain of 19.6% over of the state-of-the-art technique." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-28", "text": "(Ranta, 1998) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-29", "text": "Our work, however, is closest to Kushman and Barzilay (2013) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-30", "text": "They learned a semantic parsing translation model from a parallel dataset of natural language and regular expressions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-31", "text": "Their model used a regular expressionspecific semantic unification technique to disambiguate the meaning of the natural language descriptions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-32", "text": "Our method is similar in that we require only description and regex pairs to learn." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-33", "text": "However, we treat the problem as a direct translation task without applying any domain-specific knowledge." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-34", "text": "Neural Machine Translation Recent advances in neural machine translation (NMT) (Bahdanau et al., 2014; Devlin et al., 2014; Luong et al., 2015) using the framework of sequence to sequence learning (Sutskever et al., 2014) have demonstrated the effectiveness of deep learning models at capturing and translating language semantics." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-35", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-36", "text": "**REGEX GENERATION AS TRANSLATION**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-37", "text": "Our model is inspired by the recent advancements in sequence to sequence neural translation." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-38", "text": "We use a Recurrent Neural Network (RNN) with attention (Mnih et al., 2014) for both encoding and decoding ( Figure 1 )." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-39", "text": "Let W = w 1 , w 2 ...w m be the input text description where each w i is a word in the vocabulary." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-40", "text": "We wish to generate the regex R = r 1 , r 2 , ...r n where each r i is a character in the regex." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-41", "text": "We use LSTM (Hochreiter and Schmidhuber, 1997) cells in our model, the transition equations for which can be summarized as: where \u03c3 represents the sigmoid function and is elementwise multiplication." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-42", "text": "The input x t is a word (w t ) for the encoder and the previously generated character r t\u22121 for the decoder." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-43", "text": "The attention mechanism is essentially a 'soft' weighting over the encoder's hidden states during decoding:" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-44", "text": "e exp(score(h t , h e )) where h e is a hidden state in the encoder and score is the scoring function." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-45", "text": "We use the general attention matrix weight for our scoring function." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-46", "text": "Our model is six layers deep, with one word embedding layer, two encoder layers, two decoder layers, and one dense output layer." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-47", "text": "Our encoder and decoder layers use a stacked LSTM architecture with a width of 512 nodes." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-48", "text": "We use a global attention mechanism (Bahdanau et al., 2014) , which considers all hidden states of the encoder when computing the model's context vector." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-49", "text": "We perform standard dropout during training (Srivastava et al., 2014) after every LSTM layer with dropout probability equal to 0.25." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-50", "text": "We train for 20 epochs, utilizing a minibatch size of 32, and a learning-rate of 1.0." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-51", "text": "The learning rate is decayed by a factor 0.5 if evaluation perplexity does not increase." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-52", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-53", "text": "**CREATING A LARGE CORPUS OF NATURAL LANGUAGE / REGULAR EXPRESSION PAIRS**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-54", "text": "Previous work in regular expression generation has used fairly small datasets for training and evaluation." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-55", "text": "In order to fully utilize the power of neural translation models, we create a new large corpus of regular expression, natural language pairs titled NL-RX." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-56", "text": "The challenge in collecting such corpora is that typical crowdsourcing workers do not possess the specialized knowledge to write regular expressions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-57", "text": "To solve this, we employ a two-step generate-andparaphrase procedure to gather our data." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-58", "text": "This technique is similar to the methods used by Wang et al. (2015) to create a semantic parsing corpus." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-59", "text": "Table 1 are applied to a node's children and the resulting string is passed to the node's parent." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-60", "text": "In the generate step, we generate regular expression representations from a small manually-crafted grammar (Table 1) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-61", "text": "Our grammar includes 15 nonterminal derivations and 6 terminals and of both basic and high-level operations." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-62", "text": "We identify these via frequency analysis of smaller datasets from previous work (Kushman and Barzilay, 2013) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-63", "text": "Every grammar rule has associated verbalizations for both regular expressions and language descriptions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-64", "text": "We use this grammar to stochastically generate regular expressions and their corresponding synthetic language descriptions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-65", "text": "This generation process is shown in Figure 2 ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-66", "text": "While the automatically generated descriptions are semantically correct, they do not exhibit richness and variability of human-generated descriptions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-67", "text": "To obtain natural language (non-synthetic) descriptions, we perform the paraphrase step." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-68", "text": "In this step, Mechanical Turk (Amazon, 2003) human workers paraphrase the generated synthetic descrip- tions into the fluent verbalizations." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-69", "text": "NL-RX Using the procedure described above, we create a new public dataset (NL-RX) comprising of 10,000 regular expressions and their corresponding natural language descriptions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-70", "text": "Table 2 shows an example from our dataset." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-71", "text": "Our data collection procedure enables us to create a substantially larger and more varied dataset than previously possible." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-72", "text": "Employing standard crowdsource workers to paraphrase is more cost-efficient and scalable than employing professional regex programmers, enabling us to create a much larger dataset." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-73", "text": "Furthermore, our stochastic generation of regular expressions from a grammar results in a more varied dataset because it is not subject to the bias of human workers who, in previous work, wrote many duplicate examples (see Results)." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-74", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-75", "text": "**EXPERIMENTS**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-76", "text": "Datasets We split the 10,000 regexp and description pairs in NL-RX into 65% train, 10% dev, and 25% test sets." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-77", "text": "In addition, we also evaluate our model on the dataset used by Kushman and Barzilay (2013) (KB13), although it contains far fewer data points (824)." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-78", "text": "We use the 75/25 train/test split used in their work in order directly compare our performance to theirs." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-79", "text": "Training We perform a hyper-parameter gridsearch (on the dev set), to determine our model hyper-parameters: learning-rate = 1.0, encoderdepth = 2, decoder-depth = 2, batch size = 32, dropout = 0.25." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-80", "text": "We use a Torch (Collobert et al., 2002) implementation of attention sequence to sequence networks from (Kim, 2016) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-81", "text": "We train our models on the train set for 20 epochs, and choose the model with the best average loss on the dev set." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-82", "text": "called DFA-Equal." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-83", "text": "We employ functional equality because there are many ways to write equivalent regular expressions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-84", "text": "For example, (a|b) is functionally equivalent to (b|a), despite their string representations differing." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-85", "text": "We report DFA-Equal accuracy as our model's evaluation metric, using Kushman and Barzilay (2013) 's implementation to directly compare our results." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-86", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-87", "text": "**EVALUATION METRIC**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-88", "text": "Baselines We compare our model against two baselines: BoW-NN: BoW-NN is a simple baseline that is a Nearest Neighbor classifier using Bag Of Words representation for each natural language description." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-89", "text": "For a given test example, it finds the closest cosinesimilar neighbor from the training set and uses the regexp from that example for its prediction." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-90", "text": "Semantic-Unify: Our second baseline, SemanticUnify, is the previous state-of-the-art model from (Kushman and Barzilay, 2013) , explained above." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-91", "text": "2" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-92", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-93", "text": "**RESULTS**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-94", "text": "Our model significantly outperforms the baselines on the NL-RX dataset and achieves comparable performance to Semantic Unify on the KB13 dataset (Table 3) ." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-95", "text": "Despite the small size of KB13, our model achieves state-of-the-art results on this very resource-constrained dataset (814 examples)." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-96", "text": "Using NL-RX, we investigate the impact of training data size on our model's accuracy." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-97", "text": "Figure 3 shows how our model's performance improves as the number of training examples grows." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-98", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-99", "text": "**DIFFERENCES IN DATASETS**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-100", "text": "Keeping the previous section in mind, a seemingly unusual finding is that the model's accuracy is higher for the smaller dataset, KB13, than for the larger dataset, NL-RXTurk." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-101", "text": "On further analysis, we learned that the KB13 dataset is a much less varied and complex dataset than NL-RX-Turk." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-102", "text": "KB13 contains many duplicates, with only 45% of its regular expressions being unique." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-103", "text": "This makes the translation task easier because over half of the correct test predictions will be exact repetitions from the training set." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-104", "text": "In contrast, NL-RX-Turk does not suffer from this variance problem and contains 97% unique regular expressions." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-105", "text": "The relative easiness of the KB13 dataset is further illustrated by the high performance of the Nearest-Neighbor baselines on the KB13 dataset." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-106", "text": "----------------------------------" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-107", "text": "**CONCLUSIONS**" }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-108", "text": "In this paper we demonstrate that generic neural architectures for generating regular expressions outperform customized, heavily engineered models." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-109", "text": "The results suggest that this technique can be employed to tackle more challenging problems in broader families of formal languages, such as mapping between language description and program scripts." }, { "sent_id": "58b423c4ea2a3530d0c469fc0f5528-C001-110", "text": "We also have created a large parallel corpus of regular expressions and natural language queries using typical crowd-sourcing workers, which we make available publicly." } ], "y": { "@BACK@": { "gold_contexts": [ [ "58b423c4ea2a3530d0c469fc0f5528-C001-13" ] ], "cite_sentences": [ "58b423c4ea2a3530d0c469fc0f5528-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "58b423c4ea2a3530d0c469fc0f5528-C001-18" ], [ "58b423c4ea2a3530d0c469fc0f5528-C001-62" ], [ "58b423c4ea2a3530d0c469fc0f5528-C001-77" ], [ "58b423c4ea2a3530d0c469fc0f5528-C001-85" ], [ "58b423c4ea2a3530d0c469fc0f5528-C001-90" ] ], "cite_sentences": [ "58b423c4ea2a3530d0c469fc0f5528-C001-18", "58b423c4ea2a3530d0c469fc0f5528-C001-62", "58b423c4ea2a3530d0c469fc0f5528-C001-77", "58b423c4ea2a3530d0c469fc0f5528-C001-85", "58b423c4ea2a3530d0c469fc0f5528-C001-90" ] }, "@SIM@": { "gold_contexts": [ [ "58b423c4ea2a3530d0c469fc0f5528-C001-29" ] ], "cite_sentences": [ "58b423c4ea2a3530d0c469fc0f5528-C001-29" ] } } }, "ABC_28900a293048fdb0c40dc540985cf1_26": { "x": [ { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-47", "text": "**EXPLORATION OF TRANSFER LEARNING**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-2", "text": "Multi-task training is an effective method to mitigate the data sparsity problem." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-45", "text": "The best model is applied at test time." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-46", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-3", "text": "It has recently been applied for crosslingual transfer learning for paradigm completion-the task of producing inflected forms of lemmata-with sequenceto-sequence networks." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-4", "text": "However, it is still vague how the model transfers knowledge across languages, as well as if and which information is shared." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-5", "text": "To investigate this, we propose a set of data-dependent experiments using an existing encoder-decoder recurrent neural network for the task." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-6", "text": "Our results show that indeed the performance gains surpass a pure regularization effect and that knowledge about language and morphology can be transferred." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-7", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-9", "text": "Neural sequence-to-sequence models define the state of the art for paradigm completion (Cotterell et al., 2016 (Cotterell et al., , 2017 Kann and Sch\u00fctze, 2016) , the task of generating inflected forms of a lemma's paradigm, e.g., filling the empty fields in Table 1 using one of the non-empty fields." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-10", "text": "However, those models are in general very datahungry, and do not reach good performances in low-resource settings." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-11", "text": "Therefore, Kann et al. (2017) propose to leverage morphological knowledge from a high-resource language (source language) to improve paradigm completion in a closely related language with insufficient resources (target language)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-12", "text": "This is achieved by a form of multi-task learning -they train an encoder-decoder model simultaneously on training examples for both languages." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-13", "text": "While closer related languages seem to help more than distant ones, the mechanisms how this transfer works still remain largely obscure." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-14", "text": "Several possibilities exist: (i) learning of target tag specific word transformations from the high-resource language (trans); (ii) training of the character language model of the decoder (LM); (iii) learning a bias to copy a large part of the input (copy), since members of the same paradigm mostly share the same stem; (iv) a general regularization effect obtained by multitask training (reg)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-15", "text": "In this work, we intend to shed light on the way cross-lingual transfer learning for paradigm completion with an encoder-decoder model works, and will especially focus on the role of the character and tag embeddings." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-16", "text": "In particular we aim at answering the following questions: (i) What does the neural model learn from the tags of a highresource language for the tags of a low-resource language? (ii) Is sharing an alphabet important for the transfer? (iii) How much of the transfer learning can be reduced to a regularization effect achieved by multi-task learning?" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-17", "text": "For our analysis, we present a set of detailed experiments for the target language Spanish [ES] tion task in the low-resource language." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-18", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-19", "text": "**TRANSFER LEARNING FOR PARADIGM COMPLETION**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-20", "text": "In this section, we describe cross-lingual transfer learning for morphology and the model used for it." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-21", "text": "Cross-lingual transfer." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-22", "text": "Transfer learning for paradigm completion is much more languagespecific than most semantic natural language processing tasks, like entity typing or machine translation." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-23", "text": "An extreme example is the infeasible task of transferring morphological knowledge from Chinese to Portuguese as Chinese does not make use of inflection at all." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-24", "text": "Even between two morphologically rich languages transfer is difficult if they are unrelated, since inflections often mark dissimilar subcategories and word forms do not share similarities." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-25", "text": "However, Kann et al. (2017) show that transferring morphological knowledge from Spanish to Portuguese, two languages with similar morphology and 89% lexical similarity, works well and, more surprisingly, even supposedly very different languages like Arabic and Spanish can benefit from each other." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-26", "text": "They make this possible by training an encoder-decoder model and appending a special tag (i.e., embedding) for each language to the input of the system, similar to (Johnson et al., 2016) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-27", "text": "It is currently unclear, though, what the nature of this transfer is, motivating our work which explores this in more detail." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-28", "text": "Model description." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-29", "text": "The model Kann et al. (2017) use and we explore in more detail here is an encoder-decoder recurrent neural network (RNN) with attention (Bahdanau et al., 2015) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-30", "text": "It is trained on maximizing the following log-likelihood:" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-31", "text": "We denote the source training examples as D s and the target training examples as D s ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-32", "text": "w s represents Figure 1 : Overview of an encoder-decoder RNN, mapping the Spanish lemma so\u00f1ar to the target form sue\u00f1a." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-33", "text": "The thickness of the arrows towards the circled plus symbol corresponds to each attention weight." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-34", "text": "All tags in the input are omitted." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-35", "text": "a lemma in a high-resource source language s and w t represents a lemma in a low-resource target language t ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-36", "text": "k represents a given slot in the paradigm and f k [w ] is the inflected form of w corresponding to the morphological tag t k ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-37", "text": "The parameters \u03b8 of the model are tied for both the high-resource language and the low-resource language to enable transfer learning." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-38", "text": "In detail, a bidirectional gated RNN is used to encode the input sequence, which consists of a language tag, morphological tags and characters of the input language." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-39", "text": "The decoder generates the output sequence from the characters of the same language, and consists of a unidirectional RNN with an attention mechanism over the encoder hidden states." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-40", "text": "Notably, the elements of the input and the output are represented by embeddings living in separate spaces." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-41", "text": "Hyperparameters." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-42", "text": "Encoder and decoder RNNs have 100 hidden units and we use 300-dimensional embeddings." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-43", "text": "We train using ADADELTA (Zeiler, 2012) with minibatch size 20." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-44", "text": "All models for all experiments are trained for a maximum of 150 epochs." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-48", "text": "In order to answer the questions raised in the introduction, we conduct the following experiments." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-49", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-50", "text": "**DATA**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-51", "text": "We use the Romance and Arabic language data from Kann et al. (2017) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-52", "text": "In particular, each training file contains 12, 000 high-resource examples mixed with 50 or 200 fixed Spanish instances." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-53", "text": "We" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-54", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-55", "text": "**EXPERIMENTS**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-56", "text": "Letter cipher (l-ciph)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-57", "text": "Let C = C low \u222a C high be the union of the sets of all characters in the alphabets of the low-resource language and the high-resource language, respectively." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-58", "text": "1 We define a bijective cipher function f ciph : C \u2192 C, mapping each character to a different character, chosen at random." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-59", "text": "Then, we apply this function to the elements of the input and output words in the high-resource language and train the model on this modified data." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-60", "text": "The low-resource samples in train, dev and test remain unchanged." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-61", "text": "We expect this to have the following effects: (i) languages do not share affixes anymore; (ii) as we use the same embeddings for the changed and unchanged characters, the model might learn wrong affixes for tags; (iii) an incorrect character language model could be learned; and (iv) a general bias to copy should remain unchanged." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-62", "text": "Tag cipher (t-ciph)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-63", "text": "We further consider the union of the sets of all morphological tags existing in the low-and high-resource languages: T = T low \u222a T high ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-64", "text": "We define a bijective cipher function f ciph : T \u2192 T ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-65", "text": "We then apply this function to all tags in the high-resource language input and train a new model." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-66", "text": "The low-resource examples in train, dev and test are not changed." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-67", "text": "We expect this to: (i) disturb the learning of correspondences between target tags and output characters; (ii) not influence anything else." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-68", "text": "Language-dependent letter embeddings (lemb)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-69", "text": "We now use different embeddings for the characters of the two languages." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-70", "text": "This corresponds to a setting where the source and target languages do not share the same vocabulary." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-71", "text": "This should result in: (i) making it impossible for the model to learn which affixes have to be produced for which tag, maybe resulting in benefits for more distant and worse performance for extremely close languages; and (ii) transfer of the decoder's character language model getting impossible." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-72", "text": "Language-dependent tag embedding (t-emb)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-73", "text": "Additionally, we also experiment with different embeddings for the morphological tags in different languages." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-74", "text": "We expect the following to happen: (i) the model can learn a character language model in the output, which might be good for related and bad for more distant languages; (ii) it should not be possible for the model to learn a correspondence between tags and characters in the output sequence; and (iii) the model cannot get information about tags in the low-resource language from the high-resource language's examples." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-75", "text": "We additionally perform two last experiments: Language-dependent letter embeddings with separation symbol (l-emb-sep)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-76", "text": "This is the same as l-emb, but we introduce a new separation symbol SEP between the tags and the characters, solving the problem that it is not clear where the tag ends and the word starts." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-77", "text": "We expect equal or better performance than for l-emb." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-78", "text": "Language-dependent tag embedding with separation symbol (t-emb-sep)." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-79", "text": "This is equivalent to t-emb, but we again insert a new separation symbol SEP between the tags and the input word's characters." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-80", "text": "We expect equal or better performance than for t-emb." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-81", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-82", "text": "**INTUITION**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-83", "text": "In Table 3 we display an overview of which of the working mechanisms of cross-lingual transfer learning we expect to be effected by which changes to the high-resource training data." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-84", "text": "Depending on the relationship between the source and the target language, e.g., whether they use the same affixes to express the same morphosyntactic properties, we anticipate stronger or weaker effects." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-85", "text": "The regularization effect should not be influenced by our changes to the data." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-86", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-87", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-88", "text": "For the low-resource training set of size 50, the models with the original setup and without transfer perform best and worst, respectively." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-89", "text": "However, Accuracy ( for low-resource training size 200, t-emb-sep performs best in most case, and without transfer still performs worst." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-90", "text": "The order of the accuracies averaged over all languages can be seen in Figure 2 : original > t-emb-sep > t-emb > t-ciph > lemb-sep > l-emb > l-ciph for 50 and t-emb-sep > l-emb-sep > l-emb > t-emb > original > t-ciph > l-ciph for 200 low-resource examples." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-91", "text": "The detailed results of each language can be found in Table 4 ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-92", "text": "First, this shows clearly that the character embeddings are more important for the task than the tag embeddings." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-93", "text": "Second, l-emb (resp." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-94", "text": "t-emb) and l-ciph (resp." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-95", "text": "t-ciph) correspond to a setting with no additional information vs. a setting with potentially wrong information." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-96", "text": "Generally higher accuracies for separate embedding spaces indicate that the model can learn incorrect information via transfer." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-97", "text": "Thus, the choice of the source language seems to be very important." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-98", "text": "The differences in performance between original and l-emb represent the influence of shared vs. separate embedding spaces, i.e., vocabularies in the case of the letters." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-99", "text": "Sharing a vocabulary seems to influence the final accuracy a lot, and more positively for 50 lowresource examples." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-100", "text": "We can explain this with the model learning to copy -it has no intrinsic way of knowing which input character equals which output character in the vocabulary unless it has seen it at least once." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-101", "text": "However, for 200 Spanish examples, we can expect all characters to appear in the Spanish training data, such that the character language model and tag-output correspondence get more important." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-102", "text": "This explains the unexpected result that l-emb performs best for Arabic (200) and Portuguese (200): both source languages potentially confuse the language model; in Portuguese we contribute this to a big overlap of lemmata in the two languages with Portuguese often inflecting in a different way (Kann et al., 2017) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-103", "text": "Further, the differences in performance between original and t-emb show that the model indeed learns information from the tags, supposedly which output sequence is more likely to appear with which tag." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-104", "text": "The l-emb-sep and t-emb-sep results show that a separation symbol clearly improves the model's performance." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-105", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-106", "text": "**RELATED WORK**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-107", "text": "Transfer learning with encoder-decoder networks." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-108", "text": "Encoder-decoder RNNs were introduced by Cho et al. (2014) and Sutskever et al. (2014) and extended by an attention mechanism by Bahdanau et al. (2015) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-109", "text": "Lately, much work was done on multi-task learning and transfer learning with encoder-decoder RNNs." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-110", "text": "Luong et al. (2015) investigated multi-task setups for sequence-to-sequence learning, combining multiple encoders and decoders." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-111", "text": "In contrast, in our experiments, we use only one encoder and one decoder." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-112", "text": "There exists much work on multi-task learning with encoderdecoder RNNs for machine translation (Johnson et al., 2016; Dong et al., 2015; Firat et al., 2016; Ha et al., 2016) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-113", "text": "Alonso and Plank (2016) explored multi-task learning empirically, analyzing when it improves performance." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-114", "text": "Here, we focus on how transfer via multi-task learning works." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-115", "text": "Paradigm completion." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-116", "text": "SIGMORPHON hosted two shared tasks on paradigm completion (Cotterell et al., 2016 (Cotterell et al., , 2017 , in order to encourage the development of systems for the task." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-117", "text": "One approach is to treat it as a string transduction problem by applying an alignment model with a semi-Markov model (Durrett and DeNero, 2013; Nicolai et al., 2015) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-118", "text": "Recently, neural sequenceto-sequence models are also widely used (Faruqui et al., 2016; Kann and Sch\u00fctze, 2016; Aharoni and Goldberg, 2017; Zhou and Neubig, 2017) ." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-119", "text": "All the above mentioned work were designed for one single language." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-120", "text": "----------------------------------" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-121", "text": "**CONCLUSION**" }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-122", "text": "We conducted a set of experiments to explore the mechanisms behind cross-lingual transfer learning for morphological reinflection." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-123", "text": "Our findings indicate that knowledge about a language's typical character sequences and outputs for certain morphological tags can be transferred." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-124", "text": "In particular, this means that the effect cannot be reduced to sole regularization." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-125", "text": "Chunting Zhou and Graham Neubig. 2017." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-126", "text": "Multispace variational encoder-decoders for semisupervised labeled sequence transduction." }, { "sent_id": "28900a293048fdb0c40dc540985cf1-C001-127", "text": "In ACL." } ], "y": { "@BACK@": { "gold_contexts": [ [ "28900a293048fdb0c40dc540985cf1-C001-10", "28900a293048fdb0c40dc540985cf1-C001-11", "28900a293048fdb0c40dc540985cf1-C001-25", "28900a293048fdb0c40dc540985cf1-C001-9" ], [ "28900a293048fdb0c40dc540985cf1-C001-29" ], [ "28900a293048fdb0c40dc540985cf1-C001-102" ], [ "28900a293048fdb0c40dc540985cf1-C001-116" ] ], "cite_sentences": [ "28900a293048fdb0c40dc540985cf1-C001-11", "28900a293048fdb0c40dc540985cf1-C001-25", "28900a293048fdb0c40dc540985cf1-C001-9", "28900a293048fdb0c40dc540985cf1-C001-29", "28900a293048fdb0c40dc540985cf1-C001-102", "28900a293048fdb0c40dc540985cf1-C001-116" ] }, "@MOT@": { "gold_contexts": [ [ "28900a293048fdb0c40dc540985cf1-C001-10", "28900a293048fdb0c40dc540985cf1-C001-11", "28900a293048fdb0c40dc540985cf1-C001-13", "28900a293048fdb0c40dc540985cf1-C001-9" ] ], "cite_sentences": [ "28900a293048fdb0c40dc540985cf1-C001-11", "28900a293048fdb0c40dc540985cf1-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "28900a293048fdb0c40dc540985cf1-C001-51" ] ], "cite_sentences": [ "28900a293048fdb0c40dc540985cf1-C001-51" ] } } }, "ABC_ce8997b630e9544b0f5812be319a59_26": { "x": [ { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-34", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-35", "text": "**METHOD**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-2", "text": "Existing recurrent neural language models often fail to capture higher-level structure present in text: for example, rhyming patterns present in poetry." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-3", "text": "Much prior work on poetry generation uses manually defined constraints which are satisfied during decoding using either specialized decoding procedures or rejection sampling." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-4", "text": "The rhyming constraints themselves are typically not learned by the generator." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-5", "text": "We propose an alternate approach that uses a structured discriminator to learn a poetry generator that directly captures rhyming constraints in a generative adversarial setup." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-6", "text": "By causing the discriminator to compare poems based only on a learned similarity matrix of pairs of line ending words, the proposed approach is able to successfully learn rhyming patterns in two different English poetry datasets (Sonnet and Limerick) without explicitly being provided with any phonetic information." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-7", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-9", "text": "Many existing approaches to text generation rely on recurrent neural networks trained using likelihood on sequences of words or characters." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-10", "text": "However, such models often fail to capture overall structure and coherency in multi-sentence or longform text Holtzman et al., 2018) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-11", "text": "To rectify this, prior work has proposed losses which encourage overall coherency or other desired behavior (Li et al., 2016; Zhang and Lapata, 2017; ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-12", "text": "However, most of these approaches rely on manually provided definitions of what constitutes a good or suitable structure, thereby limiting their applicability." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-13", "text": "In this paper we propose a method for English poetry generation that directly learns higher-level rhyming constraints as part of a generator without requiring strong manual intervention." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-14", "text": "Prior works on poetry generation have focused mostly on ad-hoc decoding procedures to generate reasonable poetry, often relying on pruning from a set of candidate outputs to encourage desired behavior such as presence of explicitly-defined rhyming patterns (Oliveira, 2017; Ghazvininejad et al., 2018) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-15", "text": "We propose an adversarial approach to poetry generation that, by adding structure and inductive bias into the discriminator, is able to learn rhyming constraints directly from data without prior knowledge." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-16", "text": "The role of the discriminator is to try to distinguish between generated and real poems during training." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-17", "text": "We propose to add inductive bias via the choice of discriminator architecture: We require the discriminator to reason about poems through pairwise comparisons between line ending words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-18", "text": "These learned word comparisons form a similarity matrix for the poem within the discriminator's architecture." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-19", "text": "Finally, the discriminator evaluates the poem through a 2D convolutional classifier applied directly to this matrix." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-20", "text": "This final convolution is naturally biased to identify spatial patterns across word comparisons, which, in turn, biases learned word comparisons to pick up rhyming since rhymes are typically the most salient spatial patterns." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-21", "text": "Recent work by Lau et al. (2018) proposes a quatrain generation method that relies on specific domain knowledge about the dataset to train a classifier for learning the notion of rhyming: that a line ending word always rhymes with exactly one more ending word in the poem." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-22", "text": "This limits the applicability of their method to other forms of poetry with different rhyming patterns." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-23", "text": "They train the classifier along with a language model in a multi-task setup." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-24", "text": "Further, at generation time, they heavily rely on rejection sampling to produce quatrains which satisfy any valid rhyming pattern." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-25", "text": "In contrast, we find that generators trained using our structured adversary produce samples that satisfy rhyming constraints with much higher frequency." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-26", "text": "We propose a structured discriminator to learn a poetry generator in a generative adversarial setup." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-27", "text": "Similarities between pairs of end-of-line words are obtained by computing cosine similarity between their corresponding representations, produced by a learned character-level LSTM encoder." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-28", "text": "The discriminator operates on the resulting matrix S representing pair-wise similarities of end words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-29", "text": "The proposed discriminator learns to identify rhyming word pairs as well as rhyming constraints present in the dataset without being provided phonetic information in advance." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-30", "text": "Our main contributions are as follows: We introduce a novel structured discriminator to learn a poetry generation model in a generative adversarial setup." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-31", "text": "We show that the discriminator induces an accurate rhyming metric and the generator learns natural rhyming patterns without being provided with phonetic information." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-32", "text": "We successfully demonstrate the applicability of our proposed approach on two datasets with different structural rhyming constraints." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-33", "text": "Our poem generation model learned with the structured discriminator is more sampling efficient compared to prior work -many fewer generation attempts are required in order to obtain an valid sample which obeys the rhyming constraints of the corresponding poetry dataset." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-36", "text": "Many forms of poetry make use of rhyming patterns on line-ending words (Oliveira, 2017) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-37", "text": "Therefore, to characterize a rhyming poem, a model needs (1) to know what it means to rhyme (2) to identify the specific permissible rhyming patterns for a particular poem type." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-38", "text": "For example, a limerick is a 5 line poem with a rhyming constraint of the type AABBA, i.e. the ends of the first, second, and fifth lines rhyme." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-39", "text": "We consider an adversarial learning setup with a hierarchical language model and a structured discriminator, such that the discriminator is trained to distinguish between generated examples and training examples, and the generator is trained to fool the discriminator." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-40", "text": "Our novel structured discriminator operates on a matrix which encodes a learned pair-wise similarity function of the line ending words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-41", "text": "We refer to our model as RHYME-GAN." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-42", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-43", "text": "**NEURAL GENERATION MODEL**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-44", "text": "Our generator is a hierarchical neural language model ( Figure 1 ) that first generates a sequence of line-ending words, and thereafter generates the poem's lines conditioned on the ending words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-45", "text": "We use recurrent neural networks for ending word generation as well line generation conditioned on ending words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-46", "text": "Following prior work (Lau et al., 2018), we generate words in each line in reverse order (i.e. right to left), and begin generation with the last line first." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-47", "text": "Letx represent a sample from the current generator distribution, denoted by p \u03b8 , where \u03b8 represents the generator parameters." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-48", "text": "We initialize the word embeddings in the generator with pre-trained word embeddings (Lau et al., 2018) trained on a separate non-sonnet corpus." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-49", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-50", "text": "**STRUCTURED DISCRIMINATOR**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-51", "text": "We introduce a structured discriminator, denoted by function f \u03c6 (x), which outputs the probability that x is a sample from the dataset as opposed to generated." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-52", "text": "Our architecture defines an intermediate matrix S \u2208 R T \u00d7T , where T denotes the number of lines in the poem, which encodes pair-wise similarities between line ending words in order to capture rhyming structure." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-53", "text": "The discriminator's output is determined by a two layer 2D convolutional neural network applied to S. Convolutional neural networks have been shown to capture local as well as global patterns in 2D data -for example, images." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-54", "text": "Thus, our discriminator is composed of two main components: computation of a matrix S, and a convolutional neural network to classify the computed matrix S. The pair-wise computation provides a useful inductive bias to identify the notion of rhyming, whereas the convolutional network is a suitable choice to capture overall rhyming patterns." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-55", "text": "More specifically, let the words at the ends of lines in x be denoted by e. The number of ending words will be same as the number of lines in x, which we denote as T ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-56", "text": "We encode each ending word using a character-level LSTM (Hochreiter and Schmidhuber, 1997) denoted by g \u03c6g , and use the last hidden state of the LSTM as a vector representation of the word." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-57", "text": "We let S ij be the cosine similarity between the representations of ending words e i , e j , given by following equation:" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-58", "text": "The matrix S is passed through a convolutional neural network composed with a linear layer, together denoted by c \u03c6c ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-59", "text": "The final output is passed through a sigmoid non-linearity, so that f \u03c6 (x) \u2208 [0, 1]." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-60", "text": "The value of f \u03c6 (x) represents the discriminator's assessment of the probability that datum x belongs to the real dataset, rather than being a generated sample." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-61", "text": "The discriminator's objective will train it to distinguish between a sample x from training data X , and a generated samplex, in a binary classification setup." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-62", "text": "Specifically, we define the discriminator loss for x,x as follows:" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-63", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-64", "text": "**LEARNING**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-65", "text": "Generator parameters \u03b8 and discriminator parameters \u03c6 are trained together under following objective:" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-66", "text": "Note, in addition to using a traditional adversarial objective, we also include a likelihood term to help stabilize the generator." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-67", "text": "\u03bb is a hyperparameter which controls the relative weight of the two terms." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-68", "text": "Since sampling ofx from generator involves discrete choices, we use the REINFORCE (Williams, 1992) algorithm to train the generator using learning signal from the adversarial loss term." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-69", "text": "The generator simultaneously gets an exact gradient from the likelihood portion of the objective." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-70", "text": "We observe training is more stable when we pretrain the LSTM word encoder g \u03c6g (.) part of the discriminator, along with a separate LSTM decoder, using an auto-encoding objective on words in the vocabulary." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-71", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-72", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-73", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-74", "text": "**DATASETS**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-75", "text": "We work with the Shakespeare SONNET dataset (Lau et al., 2018 ) and a new LIMERICK corpus." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-76", "text": "Each sonnet in the Sonnet dataset is made up of 3 quatrains of 4 lines each, and a couplet." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-77", "text": "The dataset consists of 2685 sonnets in train, and 335 each in validation and test splits." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-78", "text": "The quatrains typically have one of the following rhyming structures: AABB, ABAB, ABBA, though some deviations are observed in the dataset." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-79", "text": "This may be because rhyming patterns are not always strictly followed in writing quatrains, and there are possible inaccuracies in the word pronunciation dictionaries used (e.g. some words can have multiple different pronunciations based on context)." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-80", "text": "A limerick is a form of verse with five lines." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-81", "text": "Limericks typically follow a rhyming pattern of AABBA." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-82", "text": "We collect limericks from an online collection 1 ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-83", "text": "Due to a large vocabulary in the full collection, we filter the dataset to retain only those limericks whose all the words are in a subset of 9K most frequent words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-108", "text": "Our generator implementation is largely based on that of Lau et al. (2018) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-109", "text": "The main difference is that we first generate all the line-ending words and then condition on them to generate the remaining words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-110", "text": "The change was made to make it more amenable to our proposed discriminator." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-111", "text": "However, our hierarchical language model (RHYME-LM) performs worse than DEEP-SPEARE as per sampling efficiency." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-112", "text": "Therefore, structured discriminator is the driving factor behind the observed improvement with RHYME-GAN." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-113", "text": "However, committing to the ending words of all lines before completing preceding lines can be a limitation, and addressing it is a possible future direction." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-114", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-115", "text": "**ANALYZING LEARNED DISCRIMINATOR**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-116", "text": "We probe the the word representations g(.) to check if rhyming words are close-by in the learned manifold." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-117", "text": "We consider all pairs of words among the ending words in a quatrain/limerick, and label each pair to be rhyming or non-rhyming based on previously stated definition of rhyming." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-118", "text": "If the cosine similarity score between the representations of pairs of words is above a certain threshold, we predict that word pair as rhyming, else it is predicted as non-rhyming." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-119", "text": "We report F1 scores for the binary classification setup of predicting word-pairs to be rhyming or not." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-120", "text": "We consider some additional baselines: RHYM-EM (Reddy and Knight, 2011) uses latent variables to model rhyming schemes, and train parameters using EM." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-121", "text": "GRAPHEME-K baselines predict a word pair as rhyming only if the last K = {1, 2, 3} characters of the two words are same." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-122", "text": "For SONNET data, we observe that RHYME-GAN obtains a F1 score of 0.90 (Table 3) on the test split (threshold chosen to maximize f1 on dev split)." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-123", "text": "We repeat the above analysis on the LIMERICK dataset and observe an F1 of 0.92 for RHYME-GAN." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-124", "text": "DEEP-SPEARE model reports F1 Table 3 : Rhyming probe: We use the cosine similarity score of the learned representations to predict a word pair as rhyming or not, and report F1 score for this classification task." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-125", "text": "RHYM-EM is an unsupervised rhyming pattern discovery method." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-126", "text": "GRAPHEME-K baselines predict based on exact match of last k characters." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-127", "text": "of 0.91 on SONNET." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-128", "text": "As stated earlier, DEEP-SPEARE'S model is not amenable to LIMERICKwe do compare though with the max-margin classifier in DEEP-SPEARE model trained on LIMER-ICK which gets F1 score of 0.81." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-129", "text": "The scores are understandably lower since the AABBA pattern in limericks is not amenable to SONNET specific assumptions made in DEEP-SPEARE model." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-130", "text": "On the other hand, RHYME-GAN achieves high F1 scores for both the datasets without incorporating any domain specific rhyming pattern information." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-131", "text": "RHYME-GAN performs much better than RHYM-EM and GRAPHEME-K baselines." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-132", "text": "RHYM-EM does not perform well -probably because it operates at word-level and fails to generalize." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-133", "text": "Note that RHYME-GAN-NS gets F1 score of 0.85 in case of SONNET dataset and 0.87 for LIMER-ICK." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-134", "text": "These values are lower than corresponding scores for RHYME-GAN, demonstrating that the proposed structure in the discriminator was useful in learning the notion of rhyming." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-135", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-136", "text": "**HUMAN EVALUATIONS**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-137", "text": "Following prior work (Lau et al., 2018) , we requested human annotators to identify the humanwritten poem when presented with two samples at a time -a quatrain from the Sonnet corpus and a machine-generated quatrain, and report the annotator accuracy on this task." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-138", "text": "Note that a lower accuracy value is favorable as it signifies higher quality of machine-generated samples." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-139", "text": "Using 150 valid samples (i.e. samples belonging to one of the allowed rhyming patterns), we observe 56.0% annotator accuracy for RHYME-GAN, and 53.3% for DEEP-SPEARE -indicating that the post-rejection sampling outputs from the two methods are of comparable quality (the difference in annotator accuracy is not statistically significant as per McNemar's test under p < 0.05)." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-140", "text": "If we use pre-rejection samples, we observe 60.0% annotator accuracy for RHYME-GAN, and 81.3% for DEEP-SPEARE (the difference being statistically significant as per McNemar's test under p < 0.05) -indicating that unfiltered samples from RHYME-GAN are of higher quality compared to DEEP-SPEARE." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-141", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-142", "text": "**RELATED WORK**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-143", "text": "Early works on poetry generation mostly used rule based methods (Gerv\u00e1s, 2000; Wu et al., 2009; Oliveira, 2017) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-144", "text": "More recently, neural models for poetry generation have been proposed (Zhang and Lapata, 2014; Ghazvininejad et al., 2016 Ghazvininejad et al., , 2017 Hopkins and Kiela, 2017; Lau et al., 2018; Liu et al., 2019) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-145", "text": "Yan et al. (2013) retrieve high ranking sentences for a given user query, and repeatedly swap words to satisfy poetry constraints." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-146", "text": "Ghazvininejad et al. (2018) worked on poetry translation using an unconstrained machine translation model and separately learned Finite State Automata for enforcing rhythm and rhyme." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-147", "text": "Similar to rhyming and rhythm patterns in poetry, certain types of musical compositions showcase rhythm and repetition patterns, and some prior works model such patterns in music generation (Walder and Kim, 2018; Jhamtani and BergKirkpatrick, 2019) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-148", "text": "Generative adversarial learning (Goodfellow et al., 2014) for text generation has been used in prior works (Fedus et al., 2018; Wang et al., 2018 Wang et al., , 2019 Rao and Daum\u00e9 III, 2019) , though has not been explored with regard to the similarity structure proposed in this paper." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-149", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-150", "text": "**CONCLUSIONS**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-151", "text": "In this paper we have proposed a novel structured discriminator to learn a poem generator." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-152", "text": "The generator learned utilizing the structured adversary is able to identify rhyming structure patterns present in data, as demonstrated through the improved sampling efficiency." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-153", "text": "Through the rhyming classification probe, we demonstrate that the proposed discriminator is better at learning the notion of rhyming compared to baselines." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-84", "text": "Our final dataset consists of 10, 400 limericks in train and 1300 each in validation and test splits." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-85", "text": "We train and evaluate the models separately on each corpus." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-86", "text": "----------------------------------" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-87", "text": "**POEM GENERATOR**" }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-88", "text": "Sampling efficiency We compute the expected number of samples needed before we sample a quatrain which satisfies one of the hand-defined rhyming patterns." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-89", "text": "Towards this end, we get 10K samples from each model without any constraints (except avoiding UNK -unknown tokens)." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-90", "text": "Fol-lowing prior work (Lau et al., 2018) , words are sampled with a temperature value between 0.6 and 0.8." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-91", "text": "We use CMU dictionary (Weide, 1998) to look up the phonetic representation of a word, and extract the sequence of phonemes from the last stressed syllable onward." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-92", "text": "Two words are considered to be rhyming if their extracted sequences match (Parrish, 2015) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-93", "text": "We consider a generated quatrain to have an acceptable pattern if the four ending words follow one of the three rhyming patterns: AABB, ABBA, ABAB." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-94", "text": "Similarly for LIM-ERICK, we consider only those samples to be acceptable which have line endings of the rhyming form AABBA." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-95", "text": "We consider a baseline RHYME-LM, which has the same generator architecture as RHYME-GAN but is trained without the discriminator." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-96", "text": "We also compare with RHYME-GAN-NS which uses a simpler non-structured discriminator." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-97", "text": "Specifically, it uses a discriminator which first runs a characterlevel encoder for each ending word -similar to RHYME-GAN -but then instead of computing pairwise similarity matrix, it utilizes a LSTM on the sequence of the computed representations." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-98", "text": "As can be observed from Table 1 , RHYME-GAN needs fewer samples than other methods to produce an acceptable quatrain or a limerick, indicating that it has learned natural rhyming structures more effectively from data." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-99", "text": "Note we do not report DEEP-SPEARE on Limerick due to their SONNET specific assumption that for a given end-of-line word there is exactly one more rhyming word among other end-of-line words." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-100", "text": "Additionally, RHYME-GAN-NS performs worse than RHYME-GAN, and the difference in performance is more prominent in LIMERICK -demonstrating that the proposed structure in the discriminator provided useful inductive bias." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-101", "text": "Note that compared to 4 line quatrains in SONNET, LIMERICK has 5 line poems and has arguably more complex rhyming pattern constraints." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-102", "text": "Likelihood on held out data We report negative log likelihood (NLL) on test splits (Table 2) ." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-103", "text": "For SONNET, RHYME-GAN achieves a test set NLL of 3.98." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-104", "text": "Our model without adversarial learning i.e. RHYME-LM, achieves a test set NLL of 3.97." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-105", "text": "DEEP-SPEARE reports a test set NLL of 4.38." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-106", "text": "Note that our language model is hierarchical while DEEP-SPEARE has a linear model." }, { "sent_id": "ce8997b630e9544b0f5812be319a59-C001-107", "text": "The NLL for RHYME-LM and RHYME-GAN are very similar, though RHYME-GAN gets much better sampling efficiency scores than RHYME-LM." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ce8997b630e9544b0f5812be319a59-C001-21" ], [ "ce8997b630e9544b0f5812be319a59-C001-144" ] ], "cite_sentences": [ "ce8997b630e9544b0f5812be319a59-C001-21", "ce8997b630e9544b0f5812be319a59-C001-144" ] }, "@MOT@": { "gold_contexts": [ [ "ce8997b630e9544b0f5812be319a59-C001-21", "ce8997b630e9544b0f5812be319a59-C001-22" ] ], "cite_sentences": [ "ce8997b630e9544b0f5812be319a59-C001-21" ] }, "@DIF@": { "gold_contexts": [ [ "ce8997b630e9544b0f5812be319a59-C001-21", "ce8997b630e9544b0f5812be319a59-C001-23", "ce8997b630e9544b0f5812be319a59-C001-25" ] ], "cite_sentences": [ "ce8997b630e9544b0f5812be319a59-C001-21" ] }, "@USE@": { "gold_contexts": [ [ "ce8997b630e9544b0f5812be319a59-C001-46" ], [ "ce8997b630e9544b0f5812be319a59-C001-48" ], [ "ce8997b630e9544b0f5812be319a59-C001-75" ], [ "ce8997b630e9544b0f5812be319a59-C001-90" ], [ "ce8997b630e9544b0f5812be319a59-C001-108" ], [ "ce8997b630e9544b0f5812be319a59-C001-137" ] ], "cite_sentences": [ "ce8997b630e9544b0f5812be319a59-C001-46", "ce8997b630e9544b0f5812be319a59-C001-48", "ce8997b630e9544b0f5812be319a59-C001-75", "ce8997b630e9544b0f5812be319a59-C001-90", "ce8997b630e9544b0f5812be319a59-C001-108", "ce8997b630e9544b0f5812be319a59-C001-137" ] }, "@EXT@": { "gold_contexts": [ [ "ce8997b630e9544b0f5812be319a59-C001-108", "ce8997b630e9544b0f5812be319a59-C001-109" ] ], "cite_sentences": [ "ce8997b630e9544b0f5812be319a59-C001-108" ] } } }, "ABC_cb81d56412d1e800074777687fb45a_26": { "x": [ { "sent_id": "cb81d56412d1e800074777687fb45a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-2", "text": "The task of semantic parsing is highly useful for dialogue and question answering systems." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-3", "text": "Many datasets have been proposed to map natural language text into SQL, among which the recent Spider dataset provides crossdomain samples with multiple tables and complex queries." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-4", "text": "We build a Spider dataset for Chinese, which is currently a low-resource language in this task area." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-30", "text": "Existing datasets for semantic parsing can be classified into two major categories." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-5", "text": "Interesting research questions arise from the uniqueness of the language, which requires word segmentation, and also from the fact that SQL keywords and columns of DB tables are typically written in English." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-6", "text": "We compare character-and wordbased encoders for a semantic parser, and different embedding schemes." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-7", "text": "Results show that word-based semantic parser is subject to segmentation errors and cross-lingual word embeddings are useful for text-to-SQL." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-8", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-10", "text": "The task of semantic parsing is highly useful for tasks such as dialogue (Chen et al., 2013; Gupta et al., 2018; Einolghozati et al., 2019) and question answering (Gildea and Jurafsky, 2002; Yih et al., 2015; Reddy et al., 2016) ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-11", "text": "Among a wide range of possible semantic representations, SQL offers a standardized interface to knowledge bases across tasks (Astrova, 2009; Xu et al., 2017; Dong and Lapata, 2018; Lee et al., 2011) ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-12", "text": "Recently, Yu et al. (2018b) released a manually labelled dataset for parsing natural language questions into complex SQL, which facilitates related research." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-13", "text": "Yu et al. (2018b) 's dataset is exclusive for English questions." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-14", "text": "Intuitively, the same semantic parsing task can be applied cross-lingual, since SQL is a universal semantic representation and database interface." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-15", "text": "However, for languages other than English, there can be added difficulties parsing into SQL." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-16", "text": "Take Chinese for example, the additional challenges can be at least two-fold." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-17", "text": "First, structures of relational databases, in particular names and column names of DB tables, are typically represented in English." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-18", "text": "This adds to the challenges to question-to-DB mapping." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-19", "text": "Second, the basic semantic unit for denoting columns or cells can be words, but word segmentation can be erroneous." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-20", "text": "It is also interesting to study the influence of other linguistic characteristics of Chinese, such as zero-pronoun, on its SQL parsing." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-21", "text": "We investigate parsing Chinese questions to SQL by creating a first dataset, and empirically evaluating a strong baseline model on the dataset." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-22", "text": "In particular, we translate the Spider (Yu et al., 2018b) dataset into Chinese." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-23", "text": "Using the model of Yu et al. (2018a) , we compare several key model configurations." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-24", "text": "Results show that our human-translated dataset is significantly more reliable compared to a dataset composed of machine-translated questions." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-25", "text": "In addition, the overall accuracy for Chinese SQL semantic parsing can be comparable to that for English." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-26", "text": "We found that cross-lingual word embeddings are useful for matching Chinese questions with English table columns and keywords and that language characteristics have a significant influence on parsing results." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-27", "text": "We release our dataset named CSpider and code at https://github.com/taolusi/chisp." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-28", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-31", "text": "The first uses logic for semantic representation, including ATIS (Price, 1990; Dahl et al., 1994) and GeroQuery (Zelle and Mooney, 1996) ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-32", "text": "The second and dominant category of datasets uses SQL, which includes Restaurants (Tang and Mooney, 2001; Popescu et al., 2003) , Academic (Iyer et al., 2017) , Yelp and IMDB (Yaghmazadeh et al., 2017) , Ad-vising (Finegan-Dollak et al., 2018) and the recently proposed WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018b) ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-33", "text": "One salient difference between Spider and prior work is that Spider uses different databases across domains for training and testing, which can verify the generalization power of a semantic parsing model." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-34", "text": "Compared with WikiSQL, Spider further has multiple tables in each database and correspondingly more complex queries." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-35", "text": "We thus consider Spider for sourcing our dataset." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-36", "text": "Existing semantic parsing datasets for Chinese include a small corpus for assigning semantic roles (Sun and Jurafsky, 2004) and SemEval-2016 Task 9 for Chinese semantic dependency parsing (Che et al., 2012) , but these data are not related to SQL." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-37", "text": "To our knowledge, we are the first to release a Chinese SQL semantic parsing dataset." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-38", "text": "There has been a line of work improving the model of Yu et al. (2018a) since the release of the Spider dataset (Guo et al., 2019; Lin et al., 2019) ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-39", "text": "At the time of our investigation, however, the models are not published." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-40", "text": "We thus chose the model of Yu et al. (2018a) as our baseline." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-41", "text": "The choice of more different neural models is orthogonal to our dataset contribution, but can empirically give more insights about the conclusions." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-42", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-43", "text": "**DATASET**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-44", "text": "We translate all English questions in the Spider dataset into Chinese." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-45", "text": "1 The work is undertaken by 2 NLP researchers and 1 computer science student." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-46", "text": "Each question is first translated by one annotator, and then cross-checked and corrected by a second annotator." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-47", "text": "Finally, a third annotator verifies the original and corrected versions." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-48", "text": "Statistics of the dataset are shown in Table 1 ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-49", "text": "There are originally 10181 questions from Spider, but only 9691 for the training and development sets are publicly available." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-50", "text": "We thus translated these sentences only." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-51", "text": "Following the database split setting of Yu et al. (2018b) , we make training, development and test sets split in a way that no database overlaps in them as shown in Table 1 ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-52", "text": "The translation work is performed on a database to database basis." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-53", "text": "For each database, the same translator translates relevant inquiries sentence by # Q # SQL # DB # sentence." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-54", "text": "The translator is asked to read the original question as well as the SQL query before making its Chinese translation." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-55", "text": "If the literal translation is possible, the translator is asked to stick to the original sentence style as much as feasible." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-56", "text": "For complex questions, the translator is allowed to rephrase the English question, so that the most natural Chinese translation is made." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-57", "text": "In addition, we keep the diversity of style in the English dataset by matching different English expressions to different Chinese expressions." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-58", "text": "A sample of our dataset is shown in Table 2 ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-59", "text": "Our dataset is named CSpider." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-60", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-61", "text": "**MODEL**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-62", "text": "We use the neural semantic parsing method of Yu et al. (2018a) as the baseline model, which can be regarded as a sequence-to-tree model." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-63", "text": "In particular, the input question is encoded using an LSTM sequence encoder, and the output is a SQL query in its syntactic tree form." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-64", "text": "The tree is generated incrementally top-down, in a pre-order traversal sequence." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-65", "text": "Tree nodes include keyword nodes (e.g., SELECT, WHERE, EXCEPT) and table column name nodes (e.g., ID, City, Surname, which are defined in specific tables), which are represented in respective embedding spaces." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-66", "text": "Each keyword or column is generated by attention to the embedding space using the question representation as a key." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-67", "text": "A stack is used for incremental decoding, where the whole output history is leveraged as a feature for deciding the next term." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-68", "text": "This method gives the current released state-of-the-art results while submitting this paper." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-69", "text": "We provide a visualization of the model in Figure 1 ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-70", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-71", "text": "**EXPERIMENTS**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-72", "text": "We focus on comparing different word segmentation methods and different embedding representations." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-73", "text": "As discussed above, column names are selected by attention over column embeddings using sentence representation as a key." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-74", "text": "Hence there must be a link between the embeddings of columns and those of the questions." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-75", "text": "Since columns are written in English and questions in Chinese, we consider two embedding methods." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-76", "text": "The first method is to use two separate sets of embeddings for Chinese and English, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-77", "text": "We use Glove (Pennington et al., 2014) 2 for embeddings of English keywords, column names etc., and Tencent embeddings (Song et al., 2018 ) 3 for Chinese." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-78", "text": "The second method is to directly use the cross-lingual word embeddings." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-79", "text": "To this end, the Tencent multilingual embeddings are chosen, which contain both Chinese and English words in a multi-lingual embedding matrix." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-80", "text": "Evaluation Metrics." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-81", "text": "We follow Yu et al. (2018b) , evaluating the results using two major 2 https://nlp.stanford.edu/projects/glove/ 3 https://ai.tencent.com/ailab/nlp/embedding.html types of metrics." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-82", "text": "The first is exact matching accuracy, namely the percentage of questions that have exactly the same SQL output as its reference." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-83", "text": "The second is component matching F1, namely the F1 scores for SELECT, WHERE, GROUP BY, ORDER BY and all keywords, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-84", "text": "Hyperparameters." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-85", "text": "Our hyperparameters are mostly taken from Yu et al. (2018a) , but tuned on the Chinese Spider development set." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-86", "text": "We use character and word embeddings from Tencent embedding; both of them are not fine-tuned during model training." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-87", "text": "Embedding sizes are set to 200 for both characters and words." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-88", "text": "For the different choices of keywords and column names embeddings, sizes are set to 200 and 300, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-89", "text": "Adam (Kingma and Ba, 2014) is used for optimization, with a learning rate of 1e-4." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-90", "text": "Dropout is used for the output of LSTM with a rate of 0.5." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-91", "text": "For word-based models, segmentation is necessary." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-92", "text": "We take two segmentors with different performances, including the Jieba segmentor and the model of Yang et al. (2017) , which we name Jieba and YZ, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-93", "text": "To verify their accuracy, we manually segment the first 100 sentences from the test set." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-94", "text": "Jieba and YZ give F1 scores of 89.8% and 91.7%, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-95", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-96", "text": "**OVERALL RESULTS**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-97", "text": "The overall exact matching results are shown in Table 3 ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-98", "text": "In this table, ENG represents the results of Yu et al. (2018a) 's model on their English dataset but under our split." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-99", "text": "HT and MT denote human translation and machine translation of questions, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-100", "text": "Both HT and MT results are evaluated on human translated questions." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-101", "text": "C-ML and C-S denote the results of our Chinese models based on characters with multi-lingual embeddings and monolingual embeddings, respectively, while WY-ML, WY-S denote the wordbased models applying YZ segmentor with multilingual embeddings and monolingual embeddings, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-102", "text": "Finally, WJ-ML and WJ-S denote the word model with multi-lingual embeddings and monolingual embeddings with the Jieba segmentor, respectively." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-103", "text": "First, compared to the best results of human translation (C-ML and WY-ML), machine translation results show a large disadvantage (e.g. 7.1% vs 12.1% using C-ML)." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-104", "text": "We further did a manual inspection of 100 randomly picked machinetranslated sentences." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-105", "text": "Out of the 100 translated Second, comparisons among C-ML, WY-ML and WJ-ML, and among C-S, WY-S and WJ-S show that multi-lingual embeddings give superior results compared to monolingual embeddings, which is likely because they bring a better connection between natural language questions and database columns." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-106", "text": "Third, comparisons between WY-ML and WJ-ML, and WY-S and WJ-S indicate that better segmentation accuracy has a significant influence on question parsing." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-107", "text": "Word-based methods are subject to segmentation errors." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-108", "text": "Moreover, with the current segmentation accuracy of 92%, a word-based model underperforms a character-based model." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-109", "text": "Intuitively, since words carry more direct semantic information as compared with database columns and keywords, improved segmentation may allow a word-based model to outperform a character-based model." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-110", "text": "Finally, for easy questions, the character-based model shows strong advantages over the wordbased models." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-111", "text": "However, for medium to extremely hard questions, the trend becomes less obvious, which is likely because the intrinsic semantic complexity overwhelms the encoding differences." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-112", "text": "Our best Chinese system gives an overall accuracy of 12.1%, 4 which is less but comparable to the English results." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-113", "text": "This shows that Chinese semantic parsing may not be significantly more challenging compared to English with text to SQL." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-114", "text": "Component matching." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-115", "text": "Figure 2 shows F1 scores of several typical components, including 4 Note that the results are lower than those reported by Yu et al. (2018a) under their split due to different training/test splits." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-116", "text": "Our split has less training data and more test instances in the \"Hard\" category and less in \"Easy\" and \"Medium\"." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-117", "text": "Table 4 ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-118", "text": "Specifically, the char-based methods achieve around 41% on SELN and SEL (SELECT), which are about 5% higher compared to the word-based methods." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-119", "text": "This result may be due to the fact that word-based models are sensitive to the OOV words (Zhang and Yang, 2018; Li et al., 2019) ." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-120", "text": "Unlike other components, SEL and SELN are confronted with more severe OOV challenges caused by recognizing the unseen schema during testing." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-121", "text": "In addition, the models using multi-lingual embedding overperform the models using separate embeddings on both WHEN and OB (OR-DERBY), which further demonstrates that embeddings in the same dimension distribution benefit to strengthen the connection between the question and the schema." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-122", "text": "Contrary to the overall result, the models employing the jieba segmentor perform better than those using the YZ segmentor on OB." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-123", "text": "The reason is that the jieba segmentor has different word segmentation results in terms of the superlative of adjectives." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-124", "text": "For example, the word \"\u6700\u9ad8\" (the highest) is segmented as \"\u6700\"(most) and \"\u9ad8\"(high) by YZ segmentor but \"\u6700\u9ad8\" in jieba segmentor." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-125", "text": "This again demonstrates the influence of word segmentation." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-126", "text": "Finally, for GB (GROUPBY) there is not a regular contrast pattern between different models, which can be likely because of the lack of sufficient training data." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-127", "text": "Figure 3 shows the negative influence of segmentation errors." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-128", "text": "In particular, the incorrect segmentation of the word \"\u5e97\u540d\" (shop name) leads to incorrect SQL for the whole sentence, since the character \"\u5e97\" (shop) can typically be associated with \"\u5e97\u957f\" (shop manager)." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-129", "text": "Figure 4 shows the sensitivity of our model to sentence patterns." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-130", "text": "In particular, the wordbased model gives incorrect predictions for many question sentences frequently." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-131", "text": "As shown in the first row, the word \"where\" confuses the system for making a choice between \"ORDER BY\" and \"GROUP BY\"." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-132", "text": "When we manually change the sentence pattern into \"List the most common hometown of teachers\", the parser gives the correct keyword." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-133", "text": "In contrast, the characterbased model is less sensitive to question sentences, which is likely because characters are less sparse compared with words." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-134", "text": "More training data or contextualized embeddings may alleviate the issue for the word-based method, which we leave for future work." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-135", "text": "Figure 5 shows the sensitivity of the model to Chinese linguistic patterns." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-136", "text": "In particular, the first sentence has a zero pronoun \"\u5404\u515a\u7684\" (in each party), which is omitted later." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-137", "text": "As a result, a semantic parser cannot tell the correct database columns from the sentence." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-138", "text": "We manually add the correct entity for the zero pronoun, resulting in the second sentence." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-139", "text": "The parser can correctly identify both the column name and the table name for this cor- rected sentence." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-140", "text": "Since zero-pronouns are frequent for Chinese (Chen and Ng, 2016) , they give added difficulty for its semantic parsing." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-141", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-142", "text": "**CASE STUDY**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-143", "text": "----------------------------------" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-144", "text": "**CONCLUSION**" }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-145", "text": "We constructed a first resource named CSpider for Chinese sentence to SQL, evaluating the performance of a strong English model on this dataset." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-146", "text": "Results show that the input representation, embedding forms and linguistic factors all have the influence on the Chinese-specific task." }, { "sent_id": "cb81d56412d1e800074777687fb45a-C001-147", "text": "Our dataset can serve as a starting point for further research on this task, which can be beneficial to the investigation of Chinese QA and dialogue models." } ], "y": { "@USE@": { "gold_contexts": [ [ "cb81d56412d1e800074777687fb45a-C001-23" ], [ "cb81d56412d1e800074777687fb45a-C001-62" ], [ "cb81d56412d1e800074777687fb45a-C001-85" ] ], "cite_sentences": [ "cb81d56412d1e800074777687fb45a-C001-23", "cb81d56412d1e800074777687fb45a-C001-62", "cb81d56412d1e800074777687fb45a-C001-85" ] }, "@MOT@": { "gold_contexts": [ [ "cb81d56412d1e800074777687fb45a-C001-38", "cb81d56412d1e800074777687fb45a-C001-39" ] ], "cite_sentences": [ "cb81d56412d1e800074777687fb45a-C001-38" ] }, "@BACK@": { "gold_contexts": [ [ "cb81d56412d1e800074777687fb45a-C001-38", "cb81d56412d1e800074777687fb45a-C001-39" ] ], "cite_sentences": [ "cb81d56412d1e800074777687fb45a-C001-38" ] }, "@EXT@": { "gold_contexts": [ [ "cb81d56412d1e800074777687fb45a-C001-85" ] ], "cite_sentences": [ "cb81d56412d1e800074777687fb45a-C001-85" ] }, "@DIF@": { "gold_contexts": [ [ "cb81d56412d1e800074777687fb45a-C001-115", "cb81d56412d1e800074777687fb45a-C001-116" ], [ "cb81d56412d1e800074777687fb45a-C001-101", "cb81d56412d1e800074777687fb45a-C001-103", "cb81d56412d1e800074777687fb45a-C001-105", "cb81d56412d1e800074777687fb45a-C001-98" ] ], "cite_sentences": [ "cb81d56412d1e800074777687fb45a-C001-115", "cb81d56412d1e800074777687fb45a-C001-98" ] } } }, "ABC_9227b5afd1ef18ecf83400dc402459_27": { "x": [ { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-16", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-17", "text": "**LEXICAL DISTRIBUTION IN SPOKEN MANDARIN**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-2", "text": "This paper addresses recent results on Mandarin spoken dialogues and introduces the collection of a large Mandarin conversational dialogue corpus." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-3", "text": "In the context of data processing, principles of transcription are proposed and accordingly a transcription tool is specifically developed for Mandarin spoken conversations." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-4", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-6", "text": "Large speech corpora have become indispensable for current linguistic research and information science applications dealing with spoken data (Gibbon et al. 1997 )." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-7", "text": "Concretely, they provide real phonetic data and empirical data-driven knowledge on linguistic features of spoken language." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-8", "text": "The corpus presented here is composed of conversational dialogues." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-9", "text": "Conversations contain a considerable variety of linguistic phenomena as well as phonetic-acoustic variations." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-10", "text": "Furthermore, they open up a wide range of research issues such as dialogue acts, turn-taking, lexical use of spoken language and prosodic use in conversation." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-11", "text": "From a diachronic point of view, such a large dialogue corpus archives the contemporary daily conversational use of a given language." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-12", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-13", "text": "**GENERAL ISSUES ON MANDARIN DIALOGUES**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-14", "text": "In the following, issues on Mandarin dialogues relevant to spontaneous dialogue annotation are summarized and discussed." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-15", "text": "It includes lexical distribution, discourse markers, turn-taking and prosodic characterization." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-18", "text": "Results presented by Tseng (2001) show that speakers of Mandarin adopt some 30 words for building core structures of utterances in conversation, independently of individual speakers." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-19", "text": "All subjects used these words more than three times." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-20", "text": "The occurrences of these 30 core words make up about 80% of the overall tokens in conversation." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-21", "text": "Interestingly but also expected in conversational dialogues, the distribution of token frequency across all subjects is highly symmetric (Tseng 2001) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-22", "text": "For instance, verbs \"is located\", \"is\", \"that is\", \"say\", \"want\" and \"have\" were frequently used, so were pronouns \"s/he\", \"you\" and \"I\"." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-23", "text": "The negation \"don't have\" was a high-frequency word, so were words \"right\", \"this/these\" and \"that/those\"." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-24", "text": "Grammatical particles as well as discourse particles were also among the core words." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-25", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-26", "text": "**1.2**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-27", "text": "Discourse Markers It is now well known that what differentiates written texts from spontaneous speech most is the use of discourse particles." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-28", "text": "Among the core words, eleven words were discourse particles, or they were used as discourse markers." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-29", "text": "In the literature, there is still no consistent definition for discourse markers (Hirschberg and Litman 1993) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-30", "text": "Discourse markers can be defined as follows: elements whose original semantic meaning tends to decrease and their use in spoken discourse becomes more pragmatic and indicative of discourse structuring are discourse markers." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-31", "text": "In addition to several adverbs and determiners, discourse particles can also be categorized as discourse markers." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-32", "text": "They are very often observed in Mandarin spoken conversations as mentioned in Tseng (2001) and Clancy et al. (1996) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-33", "text": "In Tseng (2001) , each subject used on average 1.6 discourse particles per turn." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-34", "text": "This result leads to the consideration, if there is a need to add special categories for discourse particles or particle-like words for spoken Mandarin." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-35", "text": "Discourse particles were found to have different and specific discourse use in conversation." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-36", "text": "Namely, there exist discourse particles appearing preferably in turn-beginning position and some other discourse particles may exclusively mark the location of repairs." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-37", "text": "Regarding the small size of data used in Tseng (2001) , it is one of the reasons why the ongoing project is necessary for research of Mandarin spontaneous conversations." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-38", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-39", "text": "**1.3**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-40", "text": "Taking Turns in Dialogues In spontaneous conversation, turn-taking usually takes place arbitrarily to the extent that every individual interacts differently with the others under different circumstances." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-41", "text": "Thus, how to annotate overlapping sequences is one of the essential tasks in developing annotation systems." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-42", "text": "In Mandarin conversation, there are words preferably used in turn-initial position (Tseng 2001 , Chui 2000 ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-43", "text": "They normally have their own discourse-related pragmatic function associated with their positioning in utterances." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-44", "text": "Similarly, how to mark up turn-initial positions is also directly connected with the annotation convention." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-45", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-46", "text": "**1.4**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-47", "text": "Prosody in Spoken Mandarin Lexical tones are typically characteristic of spoken Mandarin." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-48", "text": "The interaction of lexical tones and the other prosodic means such as stress and intonation are related to a number of research issues, particularly in conversation." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-49", "text": "Falling tones may not show falling tendency anymore, when the associated words are used for specific discourse functions such as for indicating hesitation or the beginning of a turn (Tseng 2001) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-50", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-51", "text": "**MANDARIN CONVERSATIONAL DIALOGUE CORPUS**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-52", "text": "This section deals with the design and collection of a large Mandarin dialogue corpus currently produced in Academia Sinica." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-53", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-54", "text": "**DESIGN AND METHODOLOGY**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-55", "text": "The long-term goal of this project is to collect domain-independent Mandarin spoken dialogues." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-56", "text": "They are daily conversations with specific topics given for each stage of recording." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-57", "text": "Since the design of scenario aims to collect natural and spontaneous conversations, limitations on the topics are reduced to a minimum." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-58", "text": "Different from task-oriented dialogues such as air-planning or instruction-construction tasks (Kowtko and Price 1989, Sagerer et al. 1994) , subjects participating in this project were told to converse as naturally as possible." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-59", "text": "The scenario is similar to a situation where two strangers meet at the first time, try to find common topics interested by both of them and have a chat." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-60", "text": "To make sure that we will obtain at least some \"seriously spoken\" materials, the subjects should interact with attention to fulfil the WHERE task, namely the route-asking and route-describing task." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-61", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-62", "text": "**SUBJECTS AND INSTRUCTIONS**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-63", "text": "60, 23 male and 37 female, Taipei residents of Taipei City who volunteered to participate in the project were recorded in pairs." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-64", "text": "Age ranges from 16 to 45." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-65", "text": "Subjects did not know each other and their task was to introduce themselves first." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-66", "text": "Then they chose topics from those given on the instruction sheet or they were also free of choosing any other not-listed topics to talk about." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-67", "text": "In addition, they asked some questions about routes within conversation." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-68", "text": "The topics given to the subjects are family, work, shopping, food, travelling, politics and economics." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-69", "text": "Both subjects can be information-acquirer or information-giver." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-70", "text": "However, they were told that the person who asked route questions had to make sure that s/he has completely understood the described routes." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-71", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-72", "text": "**RECORDING**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-73", "text": "The dialogues were recorded by a SONY TCD-D10 Pro II DAT tape recorder with Audio-Technica ATM-33a handheld/stand cardioid condenser microphones at a sampling rate of 48 kHz." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-74", "text": "Each subject was recorded on a separate channel on a DAT tape." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-75", "text": "There was no time constraint given in advance." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-76", "text": "Once the subjects completed their task and wished to end the conversation, the recording was stopped." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-77", "text": "Total length of corpus is about 25 hours recording." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-78", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-79", "text": "**3**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-80", "text": "An Extensible Transcription Tool" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-81", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-82", "text": "**FUNCTIONAL CONSIDERATIONS**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-83", "text": "This section discusses three principles for constructing word-based database for Mandarin dialogues from audio data to lexical database." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-84", "text": "Three functions have to be included in a transcription system, either directly or potentially: 1) connecting the tasks of transcribing, annotating and labelling of sound data, 2) being able to deal with overlapping turns and 3) making available possible tiers for later time-alignment." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-85", "text": "There are three working levels for processing spoken data: transcribe, annotate and label." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-86", "text": "First, transcription is the transliteration of audio data." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-87", "text": "Normally, it is verbatim transcription in plain text form." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-88", "text": "A transcription tool has exclusively been developed for broadcast spoken data, named Transcriber (Barras et al. 2001) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-89", "text": "Audio data can be nicely combined with the other information." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-90", "text": "However, it lacks flexible possibility for defining new annotation tags." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-91", "text": "It is especially difficult to use Transcriber to transcribe Mandarin conversations because of the written system of Mandarin." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-92", "text": "For the understanding of content and for the completeness of written system, Chinese characters are as representative and important as Latin transcriptions." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-93", "text": "Secondly, to annotate spoken data is to add linguistic explanations to the plain transcript to represent linguistic structures at selected levels." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-94", "text": "And lastly, to label sound data is to temporally align transcript texts with speech signals." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-95", "text": "The tool we develop for our corpus collection aims to transcribe and annotate the speech data as well as to build potential temporal tiers for future labelling work." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-96", "text": "Traditional annotations of spoken data orient at turn-structured separation of utterances or sets of utterances (Sagerer et al. 1994) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-97", "text": "This leads to the following inconveniences." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-98", "text": "The beginning and ending boundaries between utterances are not represented, because it is presupposed that the current ending boundary is the beginning boundary of the next unit." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-99", "text": "While doing temporal alignment, pauses between utterance units and speakers may be missing." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-100", "text": "From the point of view of searching mechanism, an annotation system should also satisfy the demand on classifying sequences produced by given speakers or sequences produced in a given time period." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-101", "text": "Thus, it will be statistically effective and useful to output annotated transcription data in a database format." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-102", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-103", "text": "**TRANSLIST**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-104", "text": "Recorded dialogues are currently being transcribed by using a computer-aided program TransList, specifically developed for transcribing Mandarin conversations." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-105", "text": "The interface is illustrated in Figure 2 ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-106", "text": "Input contents include location of sound file, subject, start and end times of the transcribed segment in the correspondent sound file, the person who transcribes the segment." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-107", "text": "The actual transcription is done twofold: in characters and in Pinyin." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-108", "text": "Tag list can be flexibly extended to annotate different phenomena any time while doing the transcription." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-109", "text": "Each transcribed segment is referred to its original sound file." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-110", "text": "However, a direct connection to processing audio files is not available yet." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-111", "text": "Regarding the output format of TransList, two variations are currently in use." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-112", "text": "One is conversation-typed format." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-113", "text": "In other words, all sound files split from one conversation form an independent text file." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-114", "text": "In order of time and subject, the program outputs a turn-oriented format, as illustrated in the next section." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-115", "text": "More important is the second output format." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-116", "text": "All transcribed segments belonging to one conversation will be listed in a database, having the following columns: characters, pinyin, sound file and all added tags." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-117", "text": "Words marked up by different tags will all have the values of added tags as their attribute in the database." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-118", "text": "By doing this, we plan to do word segmentation and tagging for spoken Mandarin." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-119", "text": "An automatic word segmentation and tagging system is currently available for written Mandarin (Chen et al. 1996) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-120", "text": "We intend to test this program for spontaneous Mandarin." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-121", "text": "By outputting the transcription in database format, we will make the continuing data processing and searching more effective." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-122", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-123", "text": "**3.3**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-124", "text": "An Output Example This section gives an example produced by TransList." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-125", "text": " cong shili yao dao shili meishuguan dehua women jiushi keyi zhiyao ni en jiushi cong women xuexiao yeshi yiyang da ersanliu zui fangbian de a mhm ranhou da dao yeshi dao gongguan huan danshuixian ranhou ni zhiyao zuo dao dagai yuanshanzhan 1 " }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-126", "text": "In the above example, the brackets <> and mark up the beginning and ending boundaries of a speaker production sequence." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-127", "text": "A and B stand for speakers." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-128", "text": "Turns are not explicitly separated, but marked up in the annotation." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-129", "text": "Numbers after the speaker abbreviations indicate numbers of production sequences by the speaker." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-130", "text": "Thus, whether it is a turn-taking or it is a overlapping can be evaluated by means of the third parameter time (msec)." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-131", "text": "With respect to tags added into the transcribed segments, it is optional to include or to exclude the annotation tags." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-132", "text": "As shown in Figure 2 , these can be non-speech sounds, repairs or discourse markers (Heeman and Allen 1999) ." }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-133", "text": "----------------------------------" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-134", "text": "**CONCLUSION**" }, { "sent_id": "9227b5afd1ef18ecf83400dc402459-C001-135", "text": "This paper discussed general issues on Mandarin spoken dialogues and analysed components of a new developed transcription and annotation tool for spoken Mandarin." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9227b5afd1ef18ecf83400dc402459-C001-18" ], [ "9227b5afd1ef18ecf83400dc402459-C001-21" ], [ "9227b5afd1ef18ecf83400dc402459-C001-32", "9227b5afd1ef18ecf83400dc402459-C001-33" ], [ "9227b5afd1ef18ecf83400dc402459-C001-42" ], [ "9227b5afd1ef18ecf83400dc402459-C001-49" ] ], "cite_sentences": [ "9227b5afd1ef18ecf83400dc402459-C001-18", "9227b5afd1ef18ecf83400dc402459-C001-21", "9227b5afd1ef18ecf83400dc402459-C001-32", "9227b5afd1ef18ecf83400dc402459-C001-33", "9227b5afd1ef18ecf83400dc402459-C001-42", "9227b5afd1ef18ecf83400dc402459-C001-49" ] }, "@MOT@": { "gold_contexts": [ [ "9227b5afd1ef18ecf83400dc402459-C001-32", "9227b5afd1ef18ecf83400dc402459-C001-33", "9227b5afd1ef18ecf83400dc402459-C001-34" ], [ "9227b5afd1ef18ecf83400dc402459-C001-37" ] ], "cite_sentences": [ "9227b5afd1ef18ecf83400dc402459-C001-32", "9227b5afd1ef18ecf83400dc402459-C001-33", "9227b5afd1ef18ecf83400dc402459-C001-37" ] } } }, "ABC_6f4dc72277119f0df3d4a7155c61fc_27": { "x": [ { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-2", "text": "Bayesian methods have been successfully applied to sparsify weights of neural networks and to remove structure units from the networks, e. g. neurons." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-3", "text": "We apply and further develop this approach for gated recurrent architectures." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-4", "text": "Specifically, in addition to sparsification of individual weights and neurons, we propose to sparsify preactivations of gates and information flow in LSTM." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-5", "text": "It makes some gates and information flow components constant, speeds up forward pass and improves compression." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-6", "text": "Moreover, the resulting structure of gate sparsity is interpretable and depends on the task." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-7", "text": "----------------------------------" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-9", "text": "Recurrent neural networks (RNNs) yield high-quality results in many applications [1, 4, 18, 21] but often overfit due to overparametrization." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-10", "text": "In many practical problems, RNNs can be compressed orders of times with only slight quality drop or even with quality improvement [2, 15, 20] ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-11", "text": "Methods for RNN compression can be divided into three groups: based on matrix factorization [6, 19] , quantization [7] or sparsification [2, 15, 20] ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-12", "text": "We focus on RNNs sparsification." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-13", "text": "Two main groups of approaches for sparsification are pruning and Bayesian sparsification." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-14", "text": "In pruning [15, 20] , weights with absolute values less than a predefined threshold are set to zero." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-15", "text": "Such methods imply a lot of hyperparameters (thresholds, pruning schedule etc)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-16", "text": "Bayesian sparsification techniques [14, 16, 8, 9, 2] treat weights of an RNN as random variables and approximate posterior distribution over them given sparsity-inducing prior distribution." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-17", "text": "After training weights with low signal-to-noise ratio are set to zero." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-18", "text": "This allows eliminating the majority of weights from the model without time-consuming hyperparameters tuning." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-19", "text": "Also, Bayesian sparsification techniques can be easily extended to permanently set to zero intermediate variables in the network's computational graph [16, 8] (e.g. neurons in fully-connected networks or filters in convolutional networks)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-20", "text": "It is achieved by multiplying such a variable on a learnable weight, finding posterior over it and setting the weight to zero if the corresponding signal-to-noise ratio is small." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-21", "text": "In this work, we investigate the last mentioned property for gated architectures, particularly for LSTM." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-22", "text": "Following [2, 14] , we sparsify individual weights of the RNN." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-23", "text": "Following [8] , we eliminate neurons from the RNN by introducing multiplicative variables on activations of neurons." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-24", "text": "Our main contribution is the introduction of multiplicative variables on preactivations of the gates and information flow in LSTM." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-25", "text": "This leads to several positive effects." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-26", "text": "Firstly, when some component of preactivations is permanently set to zero, the corresponding gate becomes constant." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-27", "text": "It simplifies LSTM structure and speeds up computations." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-28", "text": "Secondly, we obtain a three-level hierarchy of sparsification: sparsification of individual weights helps to sparsify gates and information flow (make their components constant), and sparsification of gates and information flow helps to sparsify neurons (remove them from the model)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-29", "text": "As a result, the overall compression of the model is higher." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-30", "text": "----------------------------------" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-31", "text": "**PRELIMINARIES**" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-32", "text": "Consider a dataset of N objects (x i , y i ) and a model p(y|x, W ) parametrized by a neural network with weights W ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-33", "text": "In [14] , the authors propose a Bayesian technique called Sparse variational dropout (SparseVD) for neural networks sparsification." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-34", "text": "This model comprises log-uniform prior over weights: p(|w ij |) \u221d 1 |wij | and fully factorized normal approximate posterior: q(w ij ) = N (w ij |m ij , \u03c3 2 ij )." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-35", "text": "To find parameters of the approximate posterior distribution, evidence lower bound (ELBO) is optimized:" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-36", "text": "Because of the log-uniform prior, for the majority of weights signal-to-noise ratio m 2 ij /\u03c3 2 ij \u2192 0 and these weights do not affect network's output." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-37", "text": "In [2] SparseVD is adapted to RNNs." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-38", "text": "In [8] the authors propose to multiply activations of neurons on group variables z and to learn and sparsify group variables along with W ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-39", "text": "They put standard normal prior on W and log-uniform prior on z. The first prior moves mean values of W to 0, and it helps to set to zero z and to remove neurons from the model." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-40", "text": "This model is equivalent to multiplying rows of weight matrices on group variables." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-41", "text": "----------------------------------" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-42", "text": "**PROPOSED METHOD**" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-43", "text": "To sparsify individual weights, we apply SparseVD [14] to all weights of the RNN, taking into account recurrent specifics underlined in [2] ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-44", "text": "To compress layers and remove neurons, we follow [8] and introduce group variables for the neurons of all layers (excluding output predictions), and specifically, z" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-45", "text": "x and z h for input and hidden neurons of LSTM." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-46", "text": "The key component of our model is introducing groups variables z i , z f , z g , z o on preactivations of gates and information flow." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-47", "text": "The resulting LSTM layer looks as follows:" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-48", "text": "Described model is equivalent to multiplying rows and columns of weight matrices on group variables:" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-49", "text": "{same for i, o and g}" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-50", "text": "We learn group variables z in the same way as weights W : approximate posterior with fully factorized normal distribution given fully factorized log-uniform prior distribution 2 ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-51", "text": "To find approximate posterior distribution, we maximize ELBO (1)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-52", "text": "After learning, we set all weights and group variables with signal-to-noise ratio less than 0.05 to 0." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-53", "text": "----------------------------------" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-54", "text": "**IF SOME COMPONENT OF**" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-55", "text": "g is set to 0, the corresponding gate or information flow component becomes constant (equal to activation function of bias)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-56", "text": "It means that we don't need to compute this component, and the forward pass through LSTM is accelerated." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-57", "text": "Related work." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-58", "text": "In [20] the authors propose a pruning-based method that removes neurons from LSTM and argue that independent removing of i, f, g, o components may lead to invalid LSTM units." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-59", "text": "In our model, we do not remove these components but make them constant, gaining compression and acceleration with correct LSTM structure." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-60", "text": "----------------------------------" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-62", "text": "We perform experiments with LSTM architecture on two types of problems: text classification (datasets IMDb [10] and AGNews [22] ) and language modeling (dataset PTB [11] , character and word level tasks)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-63", "text": "For text classification, we use networks with an embedding layer, one recurrent layer and an output dense layer at the last timestep." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-64", "text": "For language modeling, we use networks with one We compare four models in terms of quality and sparsity: baseline model without any regularization, standard SparseVD model for weights sparsification only (W), SparseVD model with group variables for neurons sparsification (W+N) and SparseVD model with group variables for gates and neurons sparsification (W+G+N)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-65", "text": "In all SparseVD models, we sparsify weights matrices of all layers." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-66", "text": "Since in text classification tasks usually only a small number of input words are important, we use additional multiplicative weights to sparsify the input vocabulary in case of group sparsification (W+N, W+G+N) following [2] ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-67", "text": "On the contrary, in language modeling tasks all input characters or words are usually important, therefore we do not use z x for this task." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-68", "text": "Additional sparsification of input neurons in this case noticeably damage models quality and sparsity level of hidden neurons." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-69", "text": "To measure the sparsity level of our models we calculate the compression rate of individual weights as follows: |w|/|w = 0|." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-70", "text": "To compute the number of remaining neurons or non-constant gates we use corresponding rows/columns of W and corresponding weights z if applicable." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-71", "text": "Quantitative results are shown in Table 1 ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-72", "text": "Multiplicative variables for neurons boost group sparsity level without a significant quality drop." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-73", "text": "Additional variables for gates and information flow not only make some gates constant but also increase group sparsity level even further." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-74", "text": "Moreover, for a lot of constant gates bias values tend to be very large or small making corresponding gates either always open or close." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-75", "text": "Proposed gate sparsification technique also reveals an interesting work-flow structure of LSTM networks for different tasks." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-76", "text": "Figure 1 shows typical examples of gates of remaining hidden neurons." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-77", "text": "For language modeling tasks output gates are very important because models need both store all the information about the input in the memory and output only the current prediction at each timestep." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-78", "text": "On the contrary, for text classification tasks models need to output the answer only once at the end of the sequence, hence they do not really use output gates." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-79", "text": "Also, the character level language modeling task is more challenging than the word level one: the model uses the whole gating mechanism to solve it." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-80", "text": "We think this is the main reason why gate sparsification does not help here." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-81", "text": "----------------------------------" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-82", "text": "**A EXPERIMENTAL SETUP**" }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-83", "text": "Datasets." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-84", "text": "To evaluate our approach on text classification task we use two standard datasets: IMDb dataset [10] for binary classification and AGNews dataset [22] for four-class classification." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-85", "text": "We set aside 15% and 5% of training data for validation purposes respectively." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-86", "text": "For both datasets, we use the vocabulary of 20,000 most frequent words." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-87", "text": "To evaluate our approach on language modeling task we use the Penn Treebank corpus [11] with the train/valid/test partition from [12] ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-88", "text": "The dataset has a vocabulary of 50 characters or 10,000 words." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-89", "text": "Architectures for text classification." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-90", "text": "We use networks with one embedding layer of 300 units, one LSTM layer of 128 / 512 hidden units for IMDb / AGNews, and finally, a fully connected layer applied to the last output of the LSTM." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-91", "text": "Embedding layer is initialized with word2vec [13] / GloVe [17] and SparseVD models are trained for 800 / 150 epochs on IMDb / AGNews." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-92", "text": "Hidden-to-hidden weight matrices W h are initialized orthogonally and all other matrices are initialized uniformly using the method from [3] ." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-93", "text": "We train our networks using Adam [5] with batches of size 128 and a learning rate of 0.0005." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-94", "text": "Baseline networks overfit for all our tasks, therefore, we present results for them with early stopping." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-95", "text": "Architectures for language modeling." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-96", "text": "To solve character / word-level tasks we use networks with one LSTM layer of 1000 / 256 hidden units and fully-connected layer with softmax activation to predict next character or word." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-97", "text": "We train SparseVD models for 250 / 150 epochs on character-level / word-level tasks." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-98", "text": "All weight matrices of the networks are initialized orthogonally and all biases are initialized with zeros." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-99", "text": "Initial values of hidden and cell elements are not trainable and equal to zero." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-100", "text": "For the character-level task, we train our networks on non-overlapping sequences of 100 characters in mini-batches of 64 using a learning rate of 0.002 and clip gradients with threshold 1." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-101", "text": "For the word-level task, networks are unrolled for 35 steps." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-102", "text": "We use the final hidden states of the current mini-batch as the initial hidden state of the subsequent mini-batch (successive mini batches sequentially traverse the training set)." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-103", "text": "The size of each mini-batch is 32." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-104", "text": "We train models using a learning rate of 0.002 and clip gradients with threshold 10." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-105", "text": "Baseline networks overfit for all our tasks, therefore, we present results for them with early stopping." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-106", "text": "Sparsification." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-107", "text": "For all weights that we sparsify, we initialize log \u03c3 with -3." }, { "sent_id": "6f4dc72277119f0df3d4a7155c61fc-C001-108", "text": "We eliminate weights with signal-to-noise ratio less then \u03c4 = 0.05." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6f4dc72277119f0df3d4a7155c61fc-C001-10" ], [ "6f4dc72277119f0df3d4a7155c61fc-C001-11" ], [ "6f4dc72277119f0df3d4a7155c61fc-C001-16" ], [ "6f4dc72277119f0df3d4a7155c61fc-C001-37" ] ], "cite_sentences": [ "6f4dc72277119f0df3d4a7155c61fc-C001-10", "6f4dc72277119f0df3d4a7155c61fc-C001-11", "6f4dc72277119f0df3d4a7155c61fc-C001-16", "6f4dc72277119f0df3d4a7155c61fc-C001-37" ] }, "@USE@": { "gold_contexts": [ [ "6f4dc72277119f0df3d4a7155c61fc-C001-11", "6f4dc72277119f0df3d4a7155c61fc-C001-12" ], [ "6f4dc72277119f0df3d4a7155c61fc-C001-22" ], [ "6f4dc72277119f0df3d4a7155c61fc-C001-43" ], [ "6f4dc72277119f0df3d4a7155c61fc-C001-66" ] ], "cite_sentences": [ "6f4dc72277119f0df3d4a7155c61fc-C001-11", "6f4dc72277119f0df3d4a7155c61fc-C001-22", "6f4dc72277119f0df3d4a7155c61fc-C001-43", "6f4dc72277119f0df3d4a7155c61fc-C001-66" ] }, "@SIM@": { "gold_contexts": [ [ "6f4dc72277119f0df3d4a7155c61fc-C001-37", "6f4dc72277119f0df3d4a7155c61fc-C001-43" ] ], "cite_sentences": [ "6f4dc72277119f0df3d4a7155c61fc-C001-37", "6f4dc72277119f0df3d4a7155c61fc-C001-43" ] } } }, "ABC_5a039a2af7e07cffff76d3470f32f1_27": { "x": [ { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-75", "text": "The first three are based on a large corpus of Arabic documents constructed by Zahran et al. (Zahran et al., 2015) , which consists of 2,340,895 words." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-99", "text": "The model was trained for 150 epochs on 3,852 sentences and tested on 963 sentence using Columbia's University Arabic Named Entity Recognition Corpus (Columbia University, 2016)." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-55", "text": "Once tuples have been generated, they can be used as word analogy questions to evaluate different word embeddings as defined by Mikolov et al. (Mikolov et al., 2013) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-56", "text": "A word analogy question for a tuple consisting of two word pairs (a, b) and (c, d) can be formulated as follows: \"a to b is like c to ?\"." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-57", "text": "Each such question will then be answered by calculating a target vector t = b \u2212 a + c. We then calculate the cosine similarity between the target vector t and the vector representation of each word w in a given word embeddings V ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-58", "text": "Finally, we retrieve the most similar word w to t, i.e., argmax w\u2208V &w / \u2208{a,b,c} w\u00b7t ||w||||t|| ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-59", "text": "If w = d (i.e., the same word) then we assume that the word embeddings V has answered the question correctly." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-60", "text": "We also use our benchmark to generate additional analogy questions by using more than two word pairs per question." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-61", "text": "This provides a more accurate representation of a relation as mentioned in (Mikolov et al., 2013) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-62", "text": "For each relation, we generate a question per word pair consisting of the word pair plus 10 random word pairs from the same relation." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-63", "text": "Thus, each question would consist of 11 word pairs (a i , b i ) where 1 \u2264 i \u2264 11." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-64", "text": "We then use the average of the first 10 word pairs to generate the target vector t as follows: t = 1 10 10 i (b i \u2212 a i ) + a 11 ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-65", "text": "Finally we retrieve the closest word w to the target vector t using cosine similarity as in the previous case." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-66", "text": "The question is considered to be answered correctly if the answer word w is the same as b 11 ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-67", "text": "Moreover, we also extend the traditional word analogy task by taking into consideration if the correct answer is among the top-5 closest words in the embedding space to the target vector t, which allows us to more leniently evaluate the embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-68", "text": "This is particularly important in the case of Arabic since many forms of the same word exist, usually with additional prefixes or suffixes such as the equivalent of the article \"the\" or possessive determiners such as \"her\", \"his\", or \"their\"." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-69", "text": "For example, consider one question which asks \" to is like to ?\", i.e., \"man to woman is like king to ?\", with the answer being \" \" or \"queen\"." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-70", "text": "Now, if we rely only on the top-1 word and it happens to be \" \", which means \"his queen\" in English, the question would be considered to be answered wrongly." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-71", "text": "To relax this and ensure that different forms of the same word will not result in a mismatch, we use the top-5 words for evaluation rather than the top-1." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-72", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-73", "text": "**EVALUATION**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-2", "text": "Many unsupervised learning techniques have been proposed to obtain meaningful representations of words from text." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-3", "text": "In this study, we evaluate these various techniques when used to generate Arabic word embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-4", "text": "We first build a benchmark for the Arabic language that can be utilized to perform intrinsic evaluation of different word embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-5", "text": "We then perform additional extrinsic evaluations of the embeddings based on two NLP tasks." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-6", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-8", "text": "Distributed word representations, commonly referred to as word embeddings, represent words as vectors in a low-dimensional space." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-9", "text": "The goal of this deep representation of words is to capture syntactic and semantic relationships between words." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-10", "text": "These word embeddings have been proven to be very useful in various NLP applications, particularly those employing deep learning." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-11", "text": "Word embeddings are typically learned using unsupervised learning techniques on large text corpora." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-12", "text": "Many techniques have been proposed to learn such embeddings (Pennington et al., 2014; Mikolov et al., 2013; Mnih and Kavukcuoglu, 2013) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-13", "text": "While most of the work has focused on English word embeddings, few attempts have been carried out to learn word embeddings for other languages, mostly using the above mentioned techniques." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-14", "text": "In this paper, we focus on Arabic word embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-15", "text": "Particularly, we provide a thorough evaluation of the quality of four Arabic word embeddings that have been generated by previous work (Zahran et al., 2015; Al-Rfou et al., 2013) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-16", "text": "We use both intrinsic and extrinsic evaluation methods to evaluate the different embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-17", "text": "For the intrinsic evaluation, we build a benchmark consisting of over 115,000 word analogy questions for the Arabic language." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-18", "text": "Unlike previous attempts to evaluate Arabic embeddings, which relied on translating existing English benchmarks, our benchmark is the first specifically built for the Arabic language and is publicly available for future work in this area 1 . Translating an English benchmark is not the best strategy to evaluate Arabic embeddings for the following reasons." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-19", "text": "First, the currently available English benchmarks are specifically designed for the English language and some of the questions there are not applicable to Arabic." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-20", "text": "Second, Arabic has more relations compared to English and these should be included in the benchmark as well." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-21", "text": "Third, translating an English benchmark is subject to errors since it is usually carried out in an automatic fashion." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-22", "text": "In addition to the new benchmark, we also extend the basic analogy reasoning task by taking into consideration more than two word pairs when evaluating a relation, and by considering the top-5 words rather than only the top-1 word when answering an analogy question." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-23", "text": "Finally, we perform an extrinsic evaluation of the different embeddings using two different NLP tasks, namely Document Classification and Named Entity Recognition." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-24", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-25", "text": "**RELATED WORK**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-26", "text": "There is a wealth of research on evaluating unsupervised word embeddings, which can be can be broadly divided into intrinsic and extrinsic evalu- (Mikolov et al., 2013; Gao et al., 2014; Schnabel et al., 2015) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-27", "text": "Extrinsic evaluations assess the quality of the embeddings as features in models for other tasks, such as semantic role labeling and part-of-speech tagging (Collobert et al., 2011) , or noun-phrase chunking and sentiment analysis (Schnabel et al., 2015) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-28", "text": "However, all of these tasks and benchmarks are build for English and thus cannot be used to assess the quality of Arabic word embeddings, which is the main focus here." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-29", "text": "To the best of our knowledge, only a handful of recent studies attempted evaluating Arabic word embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-74", "text": "We compare four different Arabic word embeddings that have been generated by previous work." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-30", "text": "Zahran et al. (Zahran et al., 2015) translated the English benchmark in (Mikolov et al., 2013) and used it to evaluate different embedding techniques when applied on a large Arabic corpus." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-31", "text": "However, as the authors themselves point out, translating an English benchmark is not the best strategy to evaluate Arabic embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-32", "text": "Zahran et al. also consider extrinsic evaluation on two NLP tasks, namely query expansion for IR and short answer grading." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-33", "text": "Dahou et al. (Dahou et al., 2016) used the analogy questions from (Zahran et al., 2015) after correcting some Arabic spelling mistakes resulting from the translation and after adding new analogy questions to make up for the inadequacy of the English questions for the Arabic language." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-34", "text": "They also performed an extrinsic evaluation using sentiment analysis." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-35", "text": "Finally, Al-Rfou et al. (Al-Rfou et al., 2013) generated word embeddings for 100 different languages, including Arabic, and evaluated the embeddings using part-of-speech tagging, however the evaluation was done only for a handful of European languages." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-36", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-37", "text": "**BENCHMARK**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-38", "text": "Our benchmark is the first specifically designed for the Arabic language." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-39", "text": "It consists of nine relations, each consisting of over 100 word pairs." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-40", "text": "An Arabic linguist who was properly introduced to the word-analogy task provided the list of relations." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-41", "text": "Once the nine relations were defined, two different people collectively generated the word pairs." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-42", "text": "The two people are native Arabic speakers, and one of them is a co-author and the other is not." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-43", "text": "Table 1 displays the list of all relations in our benchmark as well as two example word pairs for each relation." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-44", "text": "The full benchmark and the evaluation tool can be obtained from the following link: http://oma-project.com/res_home." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-45", "text": "Translating an English benchmark is not adequate for many reasons." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-46", "text": "First, the currently available English benchmarks contain many questions that are not applicable to Arabic." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-47", "text": "For example, comparative and superlative relations are the same in Arabic, except that the superlatives are usually prefixed with the Arabic equivalent of \"the\"." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-48", "text": "Another example is the opposite relation, where some words in Arabic do not have antonyms, in which case the antonym is typically expressed by prefixing the word with \"not\"." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-49", "text": "Second, Arabic has more relations compared to English." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-50", "text": "For instance, in Arabic there is the pair relation (see Table 1 for an example)." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-51", "text": "Third, translating an English bench-mark is considerably difficult due to the high ambiguity of the Arabic language." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-52", "text": "Given our benchmark, we generate a test bank consisting of over 100,000 tuples." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-53", "text": "Each tuple consists of two word pairs (a, b) and (c, d) from the same relation." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-54", "text": "For each of our nine relations, we generate a tuple by combining two different word pairs from the same relation." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-76", "text": "Using this corpus, the authors generated three different word embeddings using three different techniques, namely the Continuous Bagof-Words (CBOW) model (Mikolov et al., 2013) , the Skip-gram model (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-77", "text": "The fourth word embeddings we evaluate in this paper is the Arabic part of the Polyglot word embeddings, which was trained on the Arabic Wikipedia by Al-Rfou et al and consists of over 100,000 words (Al-Rfou et al., 2013) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-78", "text": "To the best of our knowledge, these are the only available word embeddings that have been constructed for the Arabic language." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-79", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-80", "text": "**INTRINSIC EVALUATION**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-81", "text": "As we mentioned in the previous section, we use our word analogy benchmark to evaluate the embeddings using four different criteria, namely using top-1 and top-5 words when representing relations using two versus 11 word pairs." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-82", "text": "Tables 2 displays the accuracy of each embedding technique for the four evaluation criteria." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-83", "text": "Note that we consider a question to be answered wrongly if at least one of the words in the question are not present in the word embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-84", "text": "That is, we take into consideration the coverage of the embeddings as well (Gao et al., 2014) ." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-85", "text": "As can be seen in Table 2 , the CBOW model consistently outperforms all other compared models for all four evaluation criteria." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-86", "text": "The performance of Polyglot is particularly low since the embeddings were trained on a much smaller corpus (Arabic portion of Wikipedia), and thus both its coverage and the quality of the embeddings are much lower." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-87", "text": "As can also be seen from the Table 3 : F-measure for two NLP tasks representing a relation using 11 pairs rather just two pairs." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-88", "text": "This validates that it is indeed more appropriate to use more than two pairs to represent relations in word analogy tasks." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-89", "text": "When considering the top-5 matches, the accuracies of the embeddings are boosted drastically, which indeed shows that relying on just the top-1 word to assess the quality of embeddings might be unduly harsh, particularly in the case of Arabic." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-90", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-91", "text": "**EXTRINSIC EVALUATION**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-92", "text": "We perform extrinsic evaluation of the four word embeddings using two NLP tasks, namely: Arabic Document Classification and Arabic Named Entity Recognition (NER)." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-93", "text": "In the Document Classification task, the goal is to classify Arabic Wikipedia articles into four different classes (person (PER), organization (ORG), location (LOC), or miscellaneous (MISC))." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-94", "text": "To do this, we relied on a neural network with a Long Short-Term Memory (LSTM) layer (Hochreiter and Schmidhuber, 1997) , which is fed from the word embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-95", "text": "The LSTM layer is followed by two fullyconnected layers, which in turn are followed by a softmax layer that predicts class-assignment probabilities." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-96", "text": "The model was trained for 150 epochs on 8,000 articles, validated on 1,000 articles, and tested on another 1,000 articles." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-97", "text": "In the NER task, the goal is to label each word in a given sequence using one of the following labels: PER, LOC, ORG, and MISC, which represent different Named Entity classes." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-98", "text": "The same architecture as in the Document Classification task was used for this task as well." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-100", "text": "We used an LSTM neural network for both tasks since they flexibly make use of contextual data and thus are com-monly used in NLP tasks such as Document Classification and NER." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-101", "text": "As can be seen in Table 3 , the first three methods CBOW, Skip-gram and GloVe seem to perform relatively well for both the Document Classification task as well as the NER task with very comparable performance in terms of F-measure." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-102", "text": "They also clearly outperform Polyglot when it comes to both tasks as well." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-103", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-104", "text": "**DISCUSSION**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-105", "text": "Our experimental results indicate the superiority of CBOW and SKip-gram as word embeddings compared to Polyglot." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-106", "text": "This can be mainly attributed to the fact that the first two embeddings were trained using a much larger corpus and thus had both better coverage and higher accuracies when it comes to the word analogy task." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-107", "text": "This is also evident in the case of the extrinsic evaluation." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-108", "text": "Thus, when training word embeddings, it is crucial to use large training data to obtain meaningful embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-109", "text": "Moreover, when performing the intrinsic evaluation of the different embeddings, we observed that relying on just the top-1 word is unduly harsh for Arabic." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-110", "text": "This is mainly attributed to the fact that for Arabic, and unlike other languages such as English, different forms of the same word exist and these must be taken into consideration when evaluating the embeddings." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-111", "text": "Thus, it is advised to use the top-k matches to perform the evaluation, where k is 5 for instance." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-112", "text": "It is also advisable to represent a relation with multiple word pairs, rather than just two as is currently done in most similar studies, to guarantee that the relation is well represented." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-113", "text": "----------------------------------" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-114", "text": "**CONCLUSION**" }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-115", "text": "In this paper, we described the first word analogy benchmark specifically designed for the Arabic language." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-116", "text": "We used our benchmark to evaluate available Arabic word embeddings using the basic analogy reasoning task as well as extensions of it." }, { "sent_id": "5a039a2af7e07cffff76d3470f32f1-C001-117", "text": "In addition, we also evaluated the quality of the various embeddings using two NLP tasks, namely Document Classification and NER." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5a039a2af7e07cffff76d3470f32f1-C001-12" ], [ "5a039a2af7e07cffff76d3470f32f1-C001-26" ], [ "5a039a2af7e07cffff76d3470f32f1-C001-30" ] ], "cite_sentences": [ "5a039a2af7e07cffff76d3470f32f1-C001-12", "5a039a2af7e07cffff76d3470f32f1-C001-26", "5a039a2af7e07cffff76d3470f32f1-C001-30" ] }, "@USE@": { "gold_contexts": [ [ "5a039a2af7e07cffff76d3470f32f1-C001-55" ], [ "5a039a2af7e07cffff76d3470f32f1-C001-76" ] ], "cite_sentences": [ "5a039a2af7e07cffff76d3470f32f1-C001-55", "5a039a2af7e07cffff76d3470f32f1-C001-76" ] }, "@SIM@": { "gold_contexts": [ [ "5a039a2af7e07cffff76d3470f32f1-C001-61" ] ], "cite_sentences": [ "5a039a2af7e07cffff76d3470f32f1-C001-61" ] } } }, "ABC_167511f278a8596aed0124c3a4242b_27": { "x": [ { "sent_id": "167511f278a8596aed0124c3a4242b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-2", "text": "Simultaneous speech translation aims to maintain translation quality while minimizing the delay between reading input and incrementally producing the output." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-3", "text": "We propose a new general-purpose prediction action which predicts future words in the input to improve quality and minimize delay in simultaneous translation." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-4", "text": "We train this agent using reinforcement learning with a novel reward function." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-5", "text": "Our agent with prediction has better translation quality and less delay compared to an agent-based simultaneous translation system without prediction." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-6", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-8", "text": "One of the next significant challenges in machine translation research is to make translation ubiquitous using real-time translation." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-9", "text": "Simultaneous machine translation aims to address this issue by interleaving reading the input with writing the output translation." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-10", "text": "Current Simultaneous Neural Machine Translation (SNMT) systems (Satija and Pineau, 2016; Cho and Esipova, 2016; Gu et al., 2017) use an AGENT to control an incremental encoder-decoder (or sequence to sequence) NMT model." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-11", "text": "Each READ adds more information to the encoder RNN, and each WRITE produces more output using the decoder RNN." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-12", "text": "In this paper, we propose adding a new action to the AGENT: a PREDICT action that predicts what words might appear in the input stream." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-13", "text": "Prediction was previously proposed in simultaneous statistical machine translation (Grissom II et al., 2014) but has not been studied in the context of Neural Machine Translation (NMT)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-14", "text": "In SNMT systems, prediction of future words augments the encoder-decoder model with possible future contexts to produce output translations earlier (minimize delay) and/or produce better output translations (improve translation quality)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-15", "text": "Our experiments show that prediction improves SNMT in both these measures." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-16", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-17", "text": "**SIMULTANEOUS TRANSLATION FRAMEWORK**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-18", "text": "An agent-based framework whose actions decide whether to translate or wait for more input is a natural way to extend neural MT to simultaneous neural MT and has been explored in (Satija and Pineau, 2016; Gu et al., 2017) which contains two main components: The ENVIRONMENT which receives the input words X = {x 1 , . . . , x N } from the source language and incrementally generates translated words W = {w 1 , . . . , w M } in the target language; And the AGENT which decides an action for each time step, a t ." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-19", "text": "The AGENT generates an action sequence A = {a 1 , . . . , a T } to control the ENVIRONMENT." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-20", "text": "Previous models only include two actions: READ and WRITE." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-21", "text": "We extend the model by adding the third action called PREDICT." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-22", "text": "Action READ is simply sending a new word to the EN-VIRONMENT and generating a candidate word in the target language." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-23", "text": "In action WRITE, the AGENT takes current candidate word and sends it to the output." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-24", "text": "For PREDICT, the AGENT predicts the next word in the input and treats it like a READ action." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-25", "text": "The following section explains how the ENVIRONMENT deals with different actions." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-26", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-27", "text": "**ENVIRONMENT**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-28", "text": "The ENVIRONMENT is an attention-based Encoder-Decoder MT system (Bahdanau et al., 2014) which is adopted to simultaneous translation task." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-29", "text": "The Encoder receives the embedded representation of the input words (including predicted ones) and converts them into context vectors H \u21e2 n = {h 1 , . . . , h n+\u21e2 } using a gated RNN (GRU) where n is the number of input words so far and \u21e2 is the number of predicted words since the last READ." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-30", "text": "Whenever the AGENT decides to READ, \u21e2 will be set to 0, and" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-31", "text": ", where x n is the next input words (n \uf8ff N ). But if the action is PREDICT, \u21e2 > 0, the AGENT predicts a new word x 0 \u21e2 and the" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-32", "text": "At each time step t, the decoder uses the current context vectors (H \u21e2 n ) to generate the next candidate output (y t ):" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-33", "text": "where w m 1 is the previous output word, and a ATT is an attention model (Bahdanau et al., 2014) , f and g are nonlinear functions, and c t is the current context vector." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-34", "text": "If the action a t is either READ or PREDICT the current candidate y t will be ignored (wait for better predictions)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-35", "text": "But in the case of WRITE, the candidate y t is produced as the next output word w m and then the decoder state will be updated (w m y t , z m s t )." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-36", "text": "Note that as soon as the AGENT decides to READ, all the hidden vectors generated by PRE-DICT actions will be discarded (H" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-37", "text": "Figure 1 shows an example of how a sentence can be translated using our modified translation framework 1 ." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-38", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-39", "text": "**AGENT**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-40", "text": "The AGENT is a separate component which examines the ENVIRONMENT at each time step and decides on the actions that lead to better translation quality and lower delay." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-41", "text": "The agent in the greedy decoding framework (Gu et al., 2017) was trained using reinforcement learning with the policy gradient algorithm (Williams, 1992) , which observes the current state of the ENVIRONMENT at time step t as o t where o t = [c t ; s t ; w m ]." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-42", "text": "A RNN with one hidden layer passed through a softmax function generates the probability distribution over the actions a t at each step." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-43", "text": "Therefore, policy \u21e1 \u2713 will be computed as:" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-44", "text": "Where u t is the hidden state of the AGENT's RNN." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-45", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-46", "text": "**TRAINING THE AGENT WITH PREDICTION**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-47", "text": "In order to speed-up the training process, we have restricted AGENT's options by removing redundant operations." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-48", "text": "As illustrated in Figure 2 ter a series of WRITE, the AGENT cannot choose to PREDICT, and after a sequence of PREDICTs, READ is not an option." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-49", "text": "Reward Function: The total reward at any time step is calculated as the cumulative sum of rewards for actions at each preceding step." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-50", "text": "All the evaluation metrics have been modified to be computed for every time step." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-51", "text": "Quality: We use a modified smoothed version of BLEU score (Chen and Cherry, 2014) multiplied by Brevity Penalty (Lin and Och, 2004) for evaluating the impact of each action on translation quality." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-52", "text": "At each point in time, the reward for translation quality is:" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-53", "text": "The BLEU(t) is the difference between BLEU score of the translated sentence at the previous time step and the current time step; BLEU(t) = BLEU(W t , W \u21e4 ) BLEU(W t 1 , W \u21e4 ); where W t is the prefix of the translated sentence at time t. Delay: The Delay reward is used to motivate the AGENT to minimize delay." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-54", "text": "We use Average Proportion (AP) (Cho and Esipova, 2016) for this purpose, which is the average number of source words needed when translating each word." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-55", "text": "Given the source words X and translated words W , AP can be computed as:" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-80", "text": "We tried different settings for these hyperparameters during training and picked values that gave us the best Quality and Delay on the training data." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-81", "text": "Results and Analysis Figure 4 shows that as the sentence length increases, prediction helps translation quality due to complex reordering and multiclausal sentences; However, for shorter samples where the structure of the sentences are simpler, the prediction action cannot improve translation quality." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-82", "text": "Table 2 compares our model with the Greedy Decoding (GD) model in terms of translation quality and latency." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-83", "text": "It shows that the prediction mechanism outperforms the GD model in terms of BLEU and average proportion (AP)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-84", "text": "The delay evaluation measure (AP) counts about the same number of READs and WRITEs." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-85", "text": "It does not account for less delay as longer sentences are produced." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-86", "text": "A better measure than AP might be needed to emphasize delay differences." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-87", "text": "Therefore we also report the average segment length (\u00b5), which is computed as the average number of consecutive READs in each sentence." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-88", "text": "In both EN!DE and DE!EN experiments, our model constantly decreases the segment length by around 1 word which results in less latency." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-89", "text": "In order to evaluate the effectiveness of our proposed reward function for PREDICT action, we have explored various values for its hyperparameters (\u21b5, , and )." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-90", "text": "Our empirical results show that the best trade-off between quality and delay is achieved when around 20 percent of the actions are PREDICT for both EN!DE and DE!EN translation tasks (Figure 3 )." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-91", "text": "When there is no reward for PREDICT action ( = 0), the AGENT prefers other actions, and the number of PREDICT actions turns into zero immediately after training the AGENT." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-92", "text": "If the reward for prediction is valued too highly, the ENVIRONMENT depends more on predicted words and the translation quality decreases 2 ." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-93", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-94", "text": "**RELATED WORK**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-95", "text": "Early work in SNMT was done in speech, where the incoming signals were segmented based on acoustic or statistical cues (Bangalore et al., 2012; F\u00fcgen et al., 2007 Yarmohammadi et al., 2013; Siahbani et al., 2014 ) use a separate segmentation step and incrementally translate each segment using a standard phrase-based MT system." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-96", "text": "(Matsubara et al., 2000) applied pattern matching to predict target-side verbs in Japanese to English translation." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-97", "text": "(Grissom II et al., 2014) used reinforcement learning to predict the next word and the sentencefinal verb in a statistical MT model." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-98", "text": "These models reduce the delay but are not trained end-toend like our agent-based SNMT system." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-99", "text": "(Cho and Esipova, 2016) proposed a non-trainable heuristic agent which is not able to trade-off quality with delay." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-100", "text": "It always prefers to read more words from the input and this approach does not work well in practice." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-101", "text": "(Satija and Pineau, 2016) introduced a trainable agent which they trained using Deep Q networks (Mnih et al., 2015) ." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-102", "text": "We modified the SNMT trainable agent in (Gu et al., 2017) and added a new non-trivial PREDICT action to the agent." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-103", "text": "We compare to their model and show better results in delay and quality." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-104", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-105", "text": "**CONCLUSION**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-106", "text": "We introduce a new prediction action in a trainable agent for simultaneous neural machine translation." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-107", "text": "With prediction, the agent can be informed about future time steps in the input stream." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-108", "text": "Compared to a very strong baseline our results show that prediction can lower delay and improve the translation quality, especially for longer sentences and translating from an SOV (subject-object-verb) language (DE) to an SVO language (EN)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-109", "text": "decoding strategies for simultaneous translation." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-110", "text": "In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1032-1036." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-56", "text": "Where s(t) denotes the number of source words the WRITE action uses at time step t (for any other actions, s(t) would be zero)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-57", "text": "The delay reward is smoothed using a Target Delay which is a scalar constant denoted by d \u21e4 (Gu et al., 2017) :" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-58", "text": "\u21e4 c Prediction Rewards for Quality and Delay alone do not motivate the AGENT to choose prediction and in preliminary experiments, after a number of steps, the number of prediction actions became zero." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-59", "text": "We address this problem by defining Prediction Quality (PQ) which rewards the AGENT for changes in BLEU score after each prediction action." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-60", "text": "By initializing r p 0 = 0, the prediction reward can be written as:" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-61", "text": "The final reward function is calculated as the combination of quality, delay, and prediction rewards:" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-62", "text": "(1) The trade-off between better translation quality and minimal delay is achieved by modifying the parameters \u21b5, , and ." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-63", "text": "Reinforcement Learning is used to train the AGENT using a policy gradient algorithm (Gu et al., 2017; Williams, 1992) which searches for the maximum in" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-64", "text": "The gradient for a sentence is the cumulative sum of gradients at each time step." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-65", "text": "We pre-train the ENVIRONMENT on full sentences using log-loss log p(y|x)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-66", "text": "----------------------------------" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-67", "text": "**EXPERIMENTS**" }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-68", "text": "We train and evaluate our model on EnglishGerman (EN-DE) in both directions." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-69", "text": "We use WMT 2015 for training and Newstest 2013 for validation and testing." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-70", "text": "All sentences have been tokenized and the words are segmented using byte pair encoding (BPE) (Sennrich et al., 2016 Model Configuration For a fair comparison, we follow the settings that worked the best for the greedy decoding model in (Gu et al., 2017) and set the target delay d \u21e4 for the AGENT to 0.7." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-71", "text": "The EN-VIRONMENT consists of two unidirectional layers with 1028 GRU units for encoder and decoder." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-72", "text": "We train the network using AdaDelta optimizer, a batch of size 32 and a fixed learning rate of 0.0001 without decay." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-73", "text": "We use softmax policy via recurrent networks with 512 GRU units and a softmax function for the AGENT and train it using Adam optimizer (Kingma and Ba, 2014)." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-74", "text": "The batch size for the AGENT is 10, and the learning rate is 2e-6." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-75", "text": "The word predictor is a two layer RNN Language model which consists of two layers of 1024 units, followed by a softmax layer." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-76", "text": "The batch size is 64 with a learning rate of 2e-5." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-77", "text": "The predictor has been trained on the WMT'16 dataset and tested on Newstest'16 corpora for both languages." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-78", "text": "The perplexity of our language model is reported in Table 1 ." }, { "sent_id": "167511f278a8596aed0124c3a4242b-C001-79", "text": "We set \u21b5 = 1, = 0.5 and = 0.5." } ], "y": { "@BACK@": { "gold_contexts": [ [ "167511f278a8596aed0124c3a4242b-C001-10" ], [ "167511f278a8596aed0124c3a4242b-C001-18" ] ], "cite_sentences": [ "167511f278a8596aed0124c3a4242b-C001-10", "167511f278a8596aed0124c3a4242b-C001-18" ] }, "@EXT@": { "gold_contexts": [ [ "167511f278a8596aed0124c3a4242b-C001-10", "167511f278a8596aed0124c3a4242b-C001-12" ], [ "167511f278a8596aed0124c3a4242b-C001-102" ] ], "cite_sentences": [ "167511f278a8596aed0124c3a4242b-C001-10", "167511f278a8596aed0124c3a4242b-C001-102" ] }, "@USE@": { "gold_contexts": [ [ "167511f278a8596aed0124c3a4242b-C001-41" ], [ "167511f278a8596aed0124c3a4242b-C001-57" ], [ "167511f278a8596aed0124c3a4242b-C001-63" ], [ "167511f278a8596aed0124c3a4242b-C001-70" ] ], "cite_sentences": [ "167511f278a8596aed0124c3a4242b-C001-41", "167511f278a8596aed0124c3a4242b-C001-57", "167511f278a8596aed0124c3a4242b-C001-63", "167511f278a8596aed0124c3a4242b-C001-70" ] } } }, "ABC_e92c6b44f4482ca868221bff551d67_27": { "x": [ { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-80", "text": "the parser is less than one-hundred percent." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-2", "text": "In this paper, we extend an existing parser to produce richer output annotated with function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-3", "text": "We obtain state-of-the-art results both in function labelling and in parsing, by automatically relabelling the Penn Treebank trees." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-4", "text": "In particular, we obtain the best published results on semantic function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-5", "text": "This suggests that current statistical parsing methods are sufficiently general to produce accurate shallow semantic annotation." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-6", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-8", "text": "With recent advances in speech recognition, parsing, and information extraction, some domain-specific interactive systems are now of practical use for tasks such as question-answering, flight booking, or restaurant reservation (Stallard, 2000) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-9", "text": "One of the challenges ahead lies in moving from hand-crafted programs of limited scope to robust systems independent of a given domain." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-10", "text": "While this ambitious goal will remain in the future for some time to come, recent efforts to develop language processing systems producing richer semantic outputs will likely be the cornerstone of many successful developments in natural language understanding." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-11", "text": "In this paper, we present a parser that outputs labels indicating the syntactic or semantic function of a constituent in the tree, such as NP-SBJ or PP-TMP shown in bold face in the tree in Figure 1 ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-12", "text": "These labels indicate that the NP is the subject of the sentence and that the PP conveys temporal information." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-13", "text": "(Labels in parentheses will be explained later in the paper.) Output annotated with such informative labels underlies all domain-independent question an- swering or shallow semantic interpretation systems (Collins and Miller, 1998; Ge and Mooney, 2005) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-14", "text": "We test the hypothesis that a current statistical parser can output such richer information without any degradation of the parser's accuracy on the original parsing task." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-15", "text": "Briefly, our method consists in augmenting a state-of-the-art statistical parser (Henderson, 2003) , whose architecture and properties make it particularly adaptive to new tasks." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-16", "text": "We achieve state-of-the-art results both for parsing and function labelling." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-17", "text": "Statistical parsers trained on the Penn Treebank (PTB) (Marcus et al., 1993) produce trees annotated with bare phrase structure labels (Collins, 1999; Charniak, 2000) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-18", "text": "The trees of the Penn Treebank, however, are also decorated with function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-19", "text": "Figure 1 shows the simplified tree representation with function labels for a sample sentence from the Penn Treebank corpus (section 00) The Government's borrowing authority dropped at midnight Tuesday to 2.8 trillion from 2.87 trillion." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-20", "text": "tion labels are context-dependent and encode a shallow level of phrasal and lexical semantics, as observed first in (Blaheta and Charniak, 2000) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-21", "text": "1 To a large extent, they overlap with semantic role labels as defined in PropBank ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-22", "text": "Current statistical parsers do not use this richer information because performance of the parser usually decreases considerably, since a more complex task is being solved." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-23", "text": "(Klein and Manning, 2003) , for instance report a reduction in parsing accuracy of an unlexicalised PCFG from 77.8% to 72.9% if using function labels in training." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-24", "text": "(Blaheta, 2004) also reports a decrease in performance when attempting to integrate his function labelling system with a full parser." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-25", "text": "Conversely, researchers interested in producing richer semantic outputs have concentrated on two-stage systems, where the semantic labelling task is performed on the output of a parser, in a pipeline architecture divided in several stages (Gildea and Jurafsky, 2002) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-26", "text": "See also the common task of (CoNLL, 2004 (CoNLL, 2005 Senseval, 2004) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-27", "text": "Our approach maintains state-of-the-art results in parsing, while also reaching state-of-the-art results in function labelling, by suitably extending a Simple Synchrony Network (SSN) parser (Henderson, 2003) into a single integrated system." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-78", "text": "Finally, we also assess function labelling performance on its own." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-28", "text": "This is an interesting result, as a task combining function labelling and parsing is more complex than simple parsing." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-29", "text": "While the function of a constituent and its structural position are often correlated, they some-times diverge." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-30", "text": "For example, some nominal temporal modifiers occupy an object position without being objects, like Tuesday in the tree above." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-31", "text": "Moreover, given current limited availability of annotated tree banks, this more complex task will have to be solved with the same overall amount of data, aggravating the difficulty of estimating the model's parameters due to sparse data." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-32", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-33", "text": "**METHOD**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-34", "text": "Successfully addressing function parsing requires accurate parsing models and training data." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-35", "text": "Understanding the causes and the relevance of the observed results requires appropriate evaluation measures." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-36", "text": "In this section, we describe the methodology that will be used to assess our main hypothesis." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-37", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-38", "text": "**THE BASIC PARSING ARCHITECTURE**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-39", "text": "Our main hypothesis says that function labels can be successfully and automatically recovered while parsing, without affecting negatively the performance of the parser." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-40", "text": "It is possible that attempting to solve the function labelling and the parsing problem at the same time would require modifying existing parsing models, since their underlying independence assumptions might no longer hold." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-41", "text": "Moreover, many more parameters are to be estimated." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-42", "text": "It is therefore important to choose a statistical parser that can model our augmented labelling problem." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-43", "text": "We use a family of statistical parsers, the Simple Synchrony Network (SSN) parsers (Henderson, 2003) , which crucially do not make any explicit independence assumptions, and learn to smooth across rare feature combinations." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-44", "text": "They are therefore likely to adapt without much modification to the current problem." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-45", "text": "This architecture has shown state-of-the-art performance and is very adaptive to properties of the input." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-46", "text": "The architecture of an SSN parser comprises two components, one which estimates the parameters of a stochastic model for syntactic trees, and one which searches for the most probable syntactic tree given the parameter estimates." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-47", "text": "As with many other statistical parsers (Collins, 1999; Charniak, 2000) , the model of parsing is history-based." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-48", "text": "Its events are derivation moves." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-49", "text": "The set of well-formed sequences of derivation moves in this parser is defined by a Predictive LR pushdown automaton (Nederhof, 1994) , which implements a form of left-corner parsing strategy." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-50", "text": "2 The probability of a phrase-structure tree is equated to the probability of a finite (but unbounded) sequence of derivation moves." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-51", "text": "To bound the number of parameters, standard history-based models partition the set of prefixes of well-formed sequences of transitions into equivalence classes." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-52", "text": "While such a partition makes the problem of searching for the most probable parse polynomial, it introduces hard independence assumptions: a derivation move only depends on the equivalence class to which its history belongs." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-79", "text": "Note that the maximal precision or recall score of function labelling is strictly smaller than one-hundred percent if the precision or the recall of Table 2 : Confusion matrix for simple baseline model, tested on the validation set (section 24 of PTB)." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-53", "text": "SSN parsers, on the other hand, do not state any explicit independence assumptions: they induce a finite history representation of an unbounded sequence of moves, so that the representation of a move i \u2212 1 is included in the inputs to the represention of the next move i, as explained in more detail in (Henderson, 2003) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-54", "text": "SSN parsers only impose soft inductive biases to capture relevant properties of the derivation, thereby exhibiting adaptivity to the input." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-55", "text": "The art of designing SSN parsers consists in selecting and introducing such biases." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-56", "text": "To this end, it is sufficient to specify features that extract some information relevant to the next derivation move from previous ones, or some set of nodes that are structurally local to the node on top of the stack." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-57", "text": "These features and these nodes are input to the computation of a hidden history representation of the sequence of previous derivation moves." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-58", "text": "Given the hidden representation of a derivation, a log-linear distribution over possible next moves is computed." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-59", "text": "Thus, the set D of structurally local nodes and the set f of predefined features determine the inductive bias of an SSN system." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-60", "text": "Unless stated otherwise, for each of the experiments reported here, the set D that is input to the computation of the history representation of the derivation moves d 1 , . . . , d i\u22121 includes the following nodes: top i , the node on top of the pushdown stack before the ith move; the left-corner ancestor of top i ; the leftmost child of top i ; and the most recent child of top i , if any." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-61", "text": "The set of features f includes the last move in the derivation, the label or tag of top i , the tag-word pair of the most re-cently shifted word, the leftmost tag-word pair that top i dominates." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-62", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-63", "text": "**THE SET OF FUNCTION LABELS**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-64", "text": "The bracketting guidelines for the Penn Treebank II list 20 function labels, shown in Table 1 (Bies et al., 1995) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-65", "text": "Based on their description in the Penn Treebank guidelines, we partition the set of function labels into four classes, as indicated in the table." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-66", "text": "Following (Blaheta and Charniak, 2000) , we refer to the first class as syntactic function labels, and to the second class as semantic function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-67", "text": "In the rest of the paper, we will ignore the other two classes, for they do not intersect with PropBank labels, and they do not form natural classes." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-68", "text": "Like previous work (Blaheta and Charniak, 2000) , we complete the sets of syntactic and semantic labels by labelling constituents that do not bear any function label with a NULL label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-69", "text": "3" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-70", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-71", "text": "**EVALUATION**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-72", "text": "To evaluate the performance of our function parsing experiments, we will use several measures." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-73", "text": "First of all, we apply the standard Parseval measures of labelled recall and precision to a parser whose training data contains the Penn Treebank function labels, to assess how well we solve the standard phrase structure parsing problem." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-74", "text": "We call these figures FLABEL-less figures in the tables below and we will call the task the (simple) parsing task in the rest of the paper." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-75", "text": "Second, we measure the accuracy of this parser with an extension of the Parseval measures of labelled precision and recall applied to the set of complex labels -the phrase structure non-terminals augmented with function labels-to evaluate how well the parser solves this complex parsing problem." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-76", "text": "These are the FLABEL figures in the tables below." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-77", "text": "We call this task the function parsing task." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-81", "text": "Following (Blaheta and Charniak, 2000) , incorrectly parsed constituents will be ignored (roughly 11% of the total) in the evaluation of the precision and recall of the function labels, but not in the evaluation of the parser." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-82", "text": "Of the correctly parsed constituents, some bear function labels, but the overwhelming majority do not bear any label, or rather, in our notation, they bear a NULL label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-83", "text": "To avoid calculating excessively optimistic scores, constituents bearing the NULL label are not taken into consideration for computing overall recall and precision figures." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-84", "text": "NULL-labelled constituents are only needed to calculate the precision and recall of other function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-85", "text": "(In other words, NULL-labelled constituents never contribute to the numerators of our calculations.) For example, consider the confusion matrix M in Table 2 , which reports scores for the semantic labels recovered by the baseline model described below." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-86", "text": "Precision is computed as" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-87", "text": ". Recall is computed analogously." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-88", "text": "Notice that M [n, n], that is the [SEM-NULL,SEM-NULL] cell in the matrix, is never taken into account." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-89", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-90", "text": "**LEARNING FUNCTION LABELS**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-91", "text": "In order to assess the complexity of the task of predicting function labels while parsing, we run first the SSN on the function parsing task, without modifications to the parser." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-92", "text": "The confusion matrix for semantic function labels of this simple baseline model is illustrated in Table 2 ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-93", "text": "It is apparent that the baseline model's largest cause of error is confusion between the labels and the NULL label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-94", "text": "These misclassfications affect recall in particular." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-95", "text": "Consider, for example, the MNR label, where 40 out of 94 occurrences are not given a function label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-96", "text": "We add two augmentations to the parser to alleviate this problem." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-97", "text": "The simple baseline parser treats NULL labels like other labels, and it does not distinguish subtypes of NULL labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-98", "text": "Our first augmentation of the parser is designed to discriminate among constituents with these NULL labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-99", "text": "We hypothesize that the label NULL (ie. SYN-NULL and SEM-NULL) is a mixture of types, which will be more accurately learnt separately." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-100", "text": "If the label NULL is learnt more precisely, the recall of the other labels will increase." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-101", "text": "The NULL label in the training set was automatically split into the mutually exclusive labels CLR, OBJ and OTHER." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-102", "text": "Constituents were assigned the OBJ label according to the conditions stated in (Collins, 1999) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-103", "text": "4 Another striking property of the simple baseline function parser is that the SSN tends to project NULL labels more than any other label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-104", "text": "Since SSNs decide the label of a non-terminal at projection, this behaviour indicates that the parser does not have enough information at this point in the parse to project the correct function label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-105", "text": "We hypothesize that finer-grained labelling will improve parsing performance." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-106", "text": "This observation is consistent with results reported in (Klein and Manning, 2003) , who showed that part-of-speech tags occurring in the Treebank are not fine-grained enough to discriminate between preterminals." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-107", "text": "For example, the tag TO labels both the preposition to and the infinitival marker." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-108", "text": "Extending (Klein and Manning, 2003) 's technique to function labelling, we split some part-of-speech tags into tags marked with semantic function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-109", "text": "More precisely, we concentrate on the function labels DIR, LOC, MNR, PRP or TMP, which appear to cause the most trouble to the parser, as illustrated in Table 2 ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-110", "text": "The label attached to a non-terminal was propagated down to the pre-terminal tag of its head." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-111", "text": "The labels in parentheses in Figure 1 illustrate the effect of this lowering of the labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-112", "text": "The goal of this tagsplitting is to indicate more clearly to the parser what kind of label to project on reading a word-tag pair in the input." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-113", "text": "To this end, re-labelling is applied only if the non-terminal dominates the pre-terminal immediately." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-114", "text": "This constraint guarantees that only those non-terminals that are actual projections of the preterminal are affected by this tag-splitting method." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-115", "text": "Linguistically, we are trying to capture the notion of maximal projection." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-116", "text": "5 This augmented model has a total of 188 non-terminals to represent labels of constituents, instead of the 33 of the original SSN parser." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-117", "text": "As a result of lowering the five function labels, 83 new part-of-speech tags were introduced to partition the original tagset of the Treebank." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-118", "text": "There are 819 tag-word pairs in this model, while the original SSN parser has a vocabulary size of 508 tagword pairs." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-119", "text": "These augmented tags as well as the 155 new non-terminals are included in the set f of features input to parsing decisions as described in section 2.1." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-120", "text": "SSN parsers do not tag their input sentences." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-121", "text": "To provide the augmented model with tagged input sentences, we trained an SVM tagger whose features and parameters are described in detail in (Gimenez and Marquez, 2004) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-122", "text": "Trained on section 2-21, the tagger reaches a performance of 95.8% on the test set (section 23) of the PTB using our new tag set." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-123", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-124", "text": "**EXPERIMENTS**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-125", "text": "In this section, we report the results of the experiments testing hypotheses concerning our function parser." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-126", "text": "All SSN function parsers were trained on 5 This condition was relaxed in a few cases to capture constructs such as coordinated PPs Table 3 : Percentage F-measure (F), recall (R), and precision (P) of the SSN baseline (Base) and augmented (Aug) parsers." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-127", "text": "H03 indicates the model illustrated in (Henderson, 2003) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-128", "text": "sections 2-21 from the Penn Treebank, validated on section 24, and tested on section 23." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-129", "text": "All models are trained on parse trees whose labels include function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-130", "text": "Both results taking function labels into account (FLABEL) and results not taking them into account (FLABEL-less) are reported." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-131", "text": "All our models, as well as the parser described in (Henderson, 2003) , are run only once." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-132", "text": "6 These results are reported in Table 3 ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-133", "text": "Our hypothesis states that we can perform function labelling and parsing at the same time, without loss in parsing performance." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-134", "text": "For this to be an interesting statement, we need to show that function labelling is not a straightforward extension of simple parsing." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-135", "text": "If simple parsing could be easily applied to function parsing, we should not have a degradation of an SSN parser model evaluated on the complex labels, compared to the same SSN parser evaluated only on phrase structure labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-136", "text": "As the results on the validation set indicate, our baseline model with function labels (FLABEL) is indeed lower than the performance of the parser when function labels are not taken into account (FLABEL-less), indicating that the function parsing task is more difficult than the simple parsing task." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-137", "text": "Since the function parsing problem is more difficult than simple parsing, it is then interesting to observe that performance of the augmented parser increases significantly (FLABEL column) (p < .001) without losing accuracy on the parsing task (FLABEL-less column), compared to the initial parsing performance (as indicated by the performance of H03)." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-138", "text": "Notice that, numerically, we do in fact a little better than H03, but this difference is not significant." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-139", "text": "7 Beside confirming that learning function labels does not increase parsing errors, we can also confirm that the nature of the errors remains the same." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-140", "text": "A separate comparison of labelled and unlabelled scores of our complex function parser indicates that unlabelled results are roughly 1% better than labelled results (F measure 89.8% on the validation set)." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-141", "text": "The original SSN parser exhibits the same differential." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-142", "text": "This shows that, like other simple parsers, the function parser makes mostly node attachment mistakes rather than labelling mistakes." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-143", "text": "A separate experiment only discriminating NULL labels indicates that this modification is indeed useful, but not as much as introducing new tags, on which we concentrate to explain the results." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-144", "text": "There is converging evidence indicating that the improvement in performance is due to having introduced new tag-word pairs, and not simply new words." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-145", "text": "First of all, of the 311 new tag-word pairs only 122 introduce truly new words." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-146", "text": "The remaining pairs are constituted by words that were already in the original vocabulary and have been retagged, or by tags associated to unknown words." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-147", "text": "Second, this interpretation of the results is confirmed by comparing different ways of enlarging the vocabulary size input to the SSN." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-148", "text": "(Henderson, 2003) tested the effect of larger input vocabulary on SSN performance by changing the frequency cut-off that selects the input tag-word pairs." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-149", "text": "A frequency cutoff of 200 yields a vocabulary of 508 pairs, while a cut-off of 20 yields 4242 pairs, 3734 of which comprise new words." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-150", "text": "This difference in input size does not give rise to an appreciable difference in performance." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-151", "text": "On the contrary, we observe that introducing 122 new words and 83 new tags improves results considerably." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-152", "text": "This leads us to conclude that the performance of the augmented model is not simply due to a larger vocabulary." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-153", "text": "We think that our tag-word pairs are effective because they are selected by a linguistically meaning-7 Significance was measured with the randomized significance test described in (Yeh, 2000 Table 4 : Percentage F-measure (F), recall (R), and precision (P) function labelling, separated for syntactic and semantic labels, for our models and Blaheta and Charniak's (BC00) and Blaheta's models (B04 FT, B04 KP)." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-154", "text": "The feature trees (FT) and kernel perceptrons (KP) are optimised separately for the two different sets of labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-155", "text": "ful criterion and are more informative exemplars for the parser." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-156", "text": "Instead, simply decreasing the frequency cut-off adds mostly types of words for which the parser already possesses enough evidence (in general, nouns)." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-157", "text": "Our method of lowering function labels acts as a finer-grained classification that partitions different kinds of complements based on their lexical semantic characteristics, yielding classes that are relevant to constituent structure." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-158", "text": "For instance, it is well known that lexical semantic properties of arguments of verbs are related to the verb's argument structure, and consequently to the parse tree that the verb occurs in." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-159", "text": "Partitioning a verb's complements into function classes could influence attachment decisions beneficially." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-160", "text": "We also think that the parser we use is particularly able to take advantage of these subclasses." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-161", "text": "One of the main properties of SSN parsers is that they do not need large vocabularies, because the SSN is good at generalising itemspecific properties into an internal hidden representation of word classes." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-162", "text": "Finally, to provide a meaningful and complete evaluation of the parser, it is necessary to examine the level of performance on the function labels for those constituents that are correctly parsed according to the usual Parseval measure, i.e. for those constituents for which the phrase structure labels and the string covered by the label have been correctly Table 5 : Percentage F-measure (F), recall (R), and precision (P) function labelling, separated for individual semantic labels, for validation set." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-163", "text": "recovered." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-164", "text": "Clearly, our parsing results would be uninteresting if our recall on function labels were very low." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-165", "text": "In that case, we would have failed to learn the function parsing task, and that would trivially yield a good performance on the simple parsing task." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-166", "text": "Table 4 reports the aggregated numbers for the baseline and the augmented model, while Table 5 reports separate figures for each semantic function label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-167", "text": "These tables show that we also perform well on the labelling task alone." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-168", "text": "8 Comparison to other researchers (last three lines of Table 4) shows that we achieve state-of-the-art results with a single integrated model that is jointly optimised for all the different types of function labels and for parsing, while previous attempts are optimised separately for the two different sets of labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-169", "text": "In particular, our method performs better on semantic labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-170", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-171", "text": "**RELATED WORK**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-172", "text": "As far as we are aware, there is no directly comparable work, as nobody has so far attempted to fully merge function labelling or semantic role labelling into parsing." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-173", "text": "We will therefore discuss separately those pieces of work that have made limited use of function labels for parsing (Klein and Manning, 2003) , and those that have concentrated on recovering function labels as a separate task (Blaheta and Charniak, 2000; Blaheta, 2004) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-174", "text": "We cannot discuss here the large recent literature on semantic role labelling for reasons of space, apart from work that also recovers function labels and work that trains a parser on Propbank labels as the first stage of a semantic role labelling pipeline (Yi and Palmer, 2005) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-175", "text": "(Klein and Manning, 2003) and, to a much more limited extent, (Collins, 1999) are the only researchers we are aware of who used function labels for parsing." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-176", "text": "In both cases, the aim was actually to improve parser performance, consequently only few carefully chosen labels were used." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-177", "text": "(Klein and Manning, 2003) suggest the technique of tag splitting for the constituent bearing the label TMP." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-178", "text": "They also speculate that locative labels could be fruitfully percolated down the tree onto the preterminals." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-179", "text": "Results in Table 5 indicate more precisely that lowering locative labels does indeed bring about some improvement, but not as much as the MNR and TMP labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-180", "text": "In work that predates the availability of Framenet and Propbank, (Blaheta and Charniak, 2000) define the task of function labelling for the first time and highlight its relevance for NLP." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-181", "text": "Their method is in two-steps." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-182", "text": "First, they parse the Penn Treebank using a state-of-the-art parser (Charniak, 2000) ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-183", "text": "Then, they assign function labels using features from the local context, mostly limited to two levels up the tree and only one next label." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-184", "text": "(Blaheta, 2004) extends on this method by developing specialised feature sets for the different subproblems of function labelling and slightly improves the results, as reported in Table 4 ." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-185", "text": "approach the problem of enriching the output of a parser in several steps." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-186", "text": "The first step applies memory-based learning to the output of a parser mapped to dependency structures." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-187", "text": "This step learns function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-188", "text": "Only aggregated results for all function labels, and not only for syntactic or semantic labels, are provided." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-189", "text": "Although they cannot be compared directly to our results, it is interesting to notice that they are slightly better in F-measure than Blaheta's (F=88.5%)." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-190", "text": "(Yi and Palmer, 2005) share the motivation of our work, although they apply it to a different task." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-191", "text": "Like the current work, they observe that the distributions of semantic labels could potentially interact with the distributions of syntactic labels and redefine the boundaries of constituents, thus yielding trees that reflect generalisations over both these sources of information." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-192", "text": "----------------------------------" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-193", "text": "**CONCLUSIONS**" }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-194", "text": "In this paper we have presented a technique to extend an existing parser to produce richer output, annotated with function labels." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-195", "text": "We show that both state-of-the-art results in function labelling and in parsing can be achieved." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-196", "text": "Application of these results are many-fold, such as information extraction or question answering where shallow semantic annotation is necessary." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-197", "text": "The technique illustrated in this paper is of wide applicability to all other semantic annotation schemes available today, such as Propbank and Framenet, and can be easily extended." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-198", "text": "Work to extend this technique to Propbank annotation is underway." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-199", "text": "Since function labels describe dependence relations between the predicative head and its complements, whether they be arguments or adjuncts, this paper suggests that a left-corner parser and its probabilistic model, which are defined entirely on configurational criteria, can be used to produce a dependency output." }, { "sent_id": "e92c6b44f4482ca868221bff551d67-C001-200", "text": "Consequences of this observation will be explored in future work." } ], "y": { "@EXT@": { "gold_contexts": [ [ "e92c6b44f4482ca868221bff551d67-C001-15" ], [ "e92c6b44f4482ca868221bff551d67-C001-27" ] ], "cite_sentences": [ "e92c6b44f4482ca868221bff551d67-C001-15", "e92c6b44f4482ca868221bff551d67-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "e92c6b44f4482ca868221bff551d67-C001-43" ], [ "e92c6b44f4482ca868221bff551d67-C001-127" ], [ "e92c6b44f4482ca868221bff551d67-C001-147", "e92c6b44f4482ca868221bff551d67-C001-148" ] ], "cite_sentences": [ "e92c6b44f4482ca868221bff551d67-C001-43", "e92c6b44f4482ca868221bff551d67-C001-127", "e92c6b44f4482ca868221bff551d67-C001-148" ] }, "@BACK@": { "gold_contexts": [ [ "e92c6b44f4482ca868221bff551d67-C001-53" ], [ "e92c6b44f4482ca868221bff551d67-C001-148" ] ], "cite_sentences": [ "e92c6b44f4482ca868221bff551d67-C001-53", "e92c6b44f4482ca868221bff551d67-C001-148" ] }, "@SIM@": { "gold_contexts": [ [ "e92c6b44f4482ca868221bff551d67-C001-131" ] ], "cite_sentences": [ "e92c6b44f4482ca868221bff551d67-C001-131" ] } } }, "ABC_d8e3cdea7f61152ed37395c5f9393e_27": { "x": [ { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-59", "text": "**AUTOMATED METHOD -WORD INTRUSION**" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-2", "text": "When evaluating the quality of topics generated by a topic model, the convention is to score topic coherence -either manually or automatically -using the top-N topic words." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-3", "text": "This hyper-parameter N , or the cardinality of the topic, is often overlooked and selected arbitrarily." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-4", "text": "In this paper, we investigate the impact of this cardinality hyper-parameter on topic coherence evaluation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-5", "text": "For two automatic topic coherence methodologies, we observe that the correlation with human ratings decreases systematically as the cardinality increases." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-6", "text": "More interestingly, we find that performance can be improved if the system scores and human ratings are aggregated over several topic cardinalities before computing the correlation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-7", "text": "In contrast to the standard practice of using a fixed value of N (e.g. N = 5 or N = 10), our results suggest that calculating topic coherence over several different cardinalities and averaging results in a substantially more stable and robust evaluation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-8", "text": "We release the code and the datasets used in this research, for reproducibility." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-9", "text": "----------------------------------" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-10", "text": "**1**" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-11", "text": "----------------------------------" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-13", "text": "Latent Dirichlet Allocation (\"LDA\": Blei et al. (2003) ) is an approach to document clustering, in which \"topics\" (multinomial distributions over terms) and topic allocations (multinomial distributions over topics per document) are jointly learned." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-14", "text": "When the topic model output is to be presented to humans, optimisation of the number of topics is a non-trivial problem." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-15", "text": "In the seminal paper of Chang et al. (2009) , e.g., the authors showed thatcontrary to expectations -extrinsically measured topic coherence correlates negatively with model perplexity." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-16", "text": "They introduced the word intrusion task, whereby a randomly selected \"intruder\" word is injected into the top-N words of a given topic and users are asked to identify the intruder word." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-17", "text": "Low reliability in identifying the intruder word indicates low coherence (and vice versa), based on the intuition that the more coherent the topic, the more clearly the intruder word should be an outlier." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-18", "text": "Since then, several methodologies have been introduced to automate the evaluation of topic coherence." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-19", "text": "Newman et al. (2010) found that aggregate pairwise PMI scores over the top-N topic words correlated well with human ratings." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-20", "text": "Mimno et al. (2011) proposed replacing PMI with conditional probability based on co-document frequency." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-21", "text": "Aletras and Stevenson (2013) showed that coherence can be measured by a classical distributional similarity approach." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-22", "text": "More recently, Lau et al. (2014) proposed a methodology to automate the word intrusion task directly." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-23", "text": "Their results also reveal the differences between these methodologies in their assessment of topic coherence." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-24", "text": "A hyper-parameter in all these methodologies is the number of topic words, or its cardinality." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-25", "text": "These methodologies evaluate coherence over the top-N topic words, where N is selected arbitrarily: for Chang et al. (2009) , N = 5, whereas for Newman et al. (2010) , Aletras and Stevenson (2013) and Lau et al. (2014) , N = 10." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-26", "text": "The germ of this paper came when using the automatic word intrusion methodology (Lau et al., 2014) , and noticing that introducing one extra word to a given topic can dramatically change the accuracy of intruder word prediction." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-27", "text": "This forms the kernel of this paper: to better understand the impact of the topic cardinality hyper-parameter on the evaluation of topic coherence." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-28", "text": "To investigate this, we develop a new dataset with human-annotated coherence judgements for a range of cardinality settings (N = {5, 10, 15, 20})." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-29", "text": "We experiment with the automatic word intrusion (Lau et al., 2014) and discover that correlation with human ratings decreases systematically as cardinality increases." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-30", "text": "We also test the PMI methodology (Newman et al., 2010) and make the same observation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-31", "text": "To remedy this, we show that performance can be substantially improved if system scores and human ratings are aggregated over different cardinality settings before computing the correlation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-32", "text": "This has broad implications for topic model evaluation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-33", "text": "----------------------------------" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-34", "text": "**DATASET AND GOLD STANDARD**" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-35", "text": "To examine the relationship between topic cardinality and topic coherence, we require a dataset that has topics for a range of cardinality settings." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-36", "text": "Although there are existing datasets with human-annotated coherence scores (Newman et al., 2010; Aletras and Stevenson, 2013; Lau et al., 2014; Chang et al., 2009) , these topics were annotated using a fixed cardinality setting (e.g. 5 or 10)." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-37", "text": "We thus develop a new dataset for this experiment." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-38", "text": "Following Lau et al. (2014) , we use two domains: (1) WIKI, a collection of 3.3 million English Wikipedia articles (retrieved November 28th 2009); and (2) NEWS, a collection of 1.2 million New York Times articles from 1994 to 2004 (English Gigaword)." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-39", "text": "We sub-sample approximately 50M tokens (100K and 50K articles for WIKI and NEWS respectively) from both domains to create two smaller document collections." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-40", "text": "We then generate 300 LDA topics for each of the sub-sampled collection." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-41", "text": "2 There are two primary approaches to assessing topic coherence: (1) via word intrusion (Chang et (2) by directly measuring observed coherence (Newman et al., 2010; Lau et al., 2014) ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-42", "text": "With the first method, Chang et al. (2009) injects an intruder word into the top-5 topic words, shuffles the topic words, and sets the task of selecting the single intruder word out of the 6 words." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-43", "text": "In preliminary experiments, we found that the word intrusion task becomes unreasonably difficult for human annotators when the topic cardinality is high, e.g. when N = 20." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-44", "text": "As such, we use the second approach as the means for generating our gold standard, asking users to judge topic coherence directly over different topic cardinalities." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-45", "text": "3 To collect the coherence judgements, we used Amazon Mechanical Turk and asked Turkers to rate topics in terms of coherence using a 3-point ordinal scale, where 1 indicates incoherent and 3 very coherent (Newman et al., 2010) ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-46", "text": "For each topic (600 topics in total) we experiment with 4 cardinality settings: N = {5, 10, 15, 20}. For example, for N = 5, we display the top-5 topic words for coherence judgement." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-47", "text": "For annotation quality control, we embed a bad topic generated using random words into each HIT." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-48", "text": "Workers who fail to consistently rate these bad topics low are filtered out." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-49", "text": "4 On average, we collected approximately 9 ratings per topic in each cardinality setting (post-filtered), from which we generate the gold standard via the arithmetic mean." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-50", "text": "To understand the impact of cardinality (N ) on topic coherence, we analyse: (a) the mean topic rating for each N (Table 1) , and (b) the pairwise Pearson correlation coefficient between the same topics for different values of N (Table 2) ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-51", "text": "Coherence decreases slightly but systematically as N increases, suggesting that users find topics less coherent (but marginally more consistently interpretable, as indicated by the slight drop in standard deviation) when more words are presented in a topic." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-52", "text": "The strong pairwise correlations, however, indicate that the ratings are relatively stable across different cardinality settings." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-53", "text": "To better understand the data, in Figure 1 we present scatter plots of the ratings for all pairwise cardinality settings (where a point represents a topic)." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-54", "text": "Note the vertical lines for x = 3.0 (cf." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-55", "text": "the weaker effect of horizontal lines for y = 3.0), in particular for the top 3 plots where we are comparing N = 5 against higher cardinality settings." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-56", "text": "This implies that topics that are rated as perfectly coherent (3.0) for N = 5 exhibit some variance in coherence ratings when N increases." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-57", "text": "Intuitively, it means that a number of perfectly coherent 5-word topics become less coherent as more words are presented." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-58", "text": "----------------------------------" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-60", "text": "Lau et al. (2014) proposed an automated approach to the word intrusion task." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-61", "text": "The methodology computes pairwise word association features for the top-N words, and trains a support vector regression model to rank the words." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-62", "text": "The top-ranked word is then selected as the predicted intruder word." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-63", "text": "Note that even though it is supervised, no manual annotation is required as the identity of the true intruder word is known." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-64", "text": "Following the original paper, we use as features normalised PMI (NPMI) and two conditional probabilities (CP1 and CP2), computed over the full collection of WIKI (3.3 million articles) and NEWS (1.2 million articles), respectively." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-65", "text": "We use 10-fold cross validation to predict the intruder words for all topics." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-66", "text": "To generate an intruder for a topic, we select a random word that has a low probability (P < 0.0005) in the topic but high probability (P > 0.01) in another topic." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-67", "text": "We repeat this ten times to generate 10 different intruder words for a topic." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-68", "text": "The 4 cardinalities of a given topic share the same set of intruder words." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-69", "text": "To measure the coherence of a topic, we compute model precision, or the accuracy of intruder word prediction." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-70", "text": "For evaluation we compute the Pearson correlation coefficient r of model precisions and human ratings for each cardinality setting." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-71", "text": "Results are summarised in Table 3 Each domain has 2 sets of correlation figures, based on in-domain and out-of-domain features." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-72", "text": "Indomain (out-of-domain) features are word association features computed using the same (different) domain as the topics, e.g. when we compute coherence of WIKI topics using word association features derived from WIKI (NEWS)." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-73", "text": "The correlations using in-domain features are in general lower than for out-of-domain features." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-74", "text": "This is due to idiosyncratic words that are closely related in the collection, e.g. remnant Wikipedia markup tags." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-75", "text": "The topic model discovers them as topics and the word statistics derived from the same collection supports the association, but these topics are generally not coherent, as revealed by out-of-domain statistics." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-76", "text": "This result is consistent with previous studies (Lau et al., 2014) ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-77", "text": "We see that correlation decreases systematically as N increases, implying that N has high impact on topic coherence evaluation and that if a single value of N is to be used, a lower value is preferable." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-78", "text": "To test whether we can leverage the additional information from the different values of N , we aggregate the model precision values and human ratings per-topic before computing the correlation (Table 3 : Cardinality = \"Avg\")." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-79", "text": "We also test the significance of difference for each N with the aggregate correlation using the Steiger Test (Steiger, 1980) The correlation improves substantially." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-80", "text": "In fact, for NEWS using in-domain features, the correlation is higher than that of any individual cardinality setting." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-81", "text": "This observation suggests that a better approach to automatically computing topic coherence is to aggregate coherence scores over different cardinality settings, and that it is sub-optimal to evaluate a topic by only assessing a single setting of N ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-82", "text": "Instead, we should repeat it several times, varying N ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-83", "text": "----------------------------------" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-84", "text": "**AUTOMATED METHOD -NPMI**" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-85", "text": "The other mainstream approach to evaluating topic coherence is to directly measure the average pairwise association between the top-N words." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-86", "text": "Newman et al. (2010) found PMI to be the best association measure, and later studies (Aletras and Stevenson, 2013; Lau et al., 2014) found that normalised PMI (NPMI: Bouma (2009)) improves PMI further." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-87", "text": "To see if the benefit of aggregating coherence measures over several cardinalities transfers across to other methodologies, we test the NPMI methodology." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-88", "text": "We compute the topic coherence using the full collection of WIKI and NEWS, respectively, for varying N ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-89", "text": "Results are presented in Table 4 ." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-90", "text": "The in-domain features perform much worse, especially for the WIKI topics." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-91", "text": "NPMI assigns very high scores to several incoherent topics, thereby reducing the correlation to almost zero." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-92", "text": "These topics consist predominantly of Wikipedia markup tags, and the high association is due to word statistics idiosyncratic to the collection." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-93", "text": "Once again, aggregating the topic coherence over multiple N values boosts results further." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-94", "text": "The correlations using aggregation and out-of-domain features again produce the best results for both WIKI and NEWS." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-95", "text": "It is important to note that, while these findings were established based on manual annotation of topic coherence, for practical applications, topic coherence would be calculated in a fully-unsupervised manner (averaged over different topic cardinalities), without the use of manual annotations." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-96", "text": "----------------------------------" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-97", "text": "**CONCLUSION**" }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-98", "text": "We investigate the impact of the cardinality of topic words on topic coherence evaluation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-99", "text": "We found that human ratings decrease systematically when cardinality increases, although pairwise correlations are relatively high." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-100", "text": "We discovered that the performance of two automated methods -word intrusion and pairwise NPMI -can be substantially improved if the system scores and human ratings are aggregated over several cardinality settings before computing the correlation." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-101", "text": "Contrary to the standard practice of using a fixed cardinality setting, our findings suggest that we should assess topic coherence using several cardinality settings and then aggregate over them." }, { "sent_id": "d8e3cdea7f61152ed37395c5f9393e-C001-102", "text": "The human-judged coherence ratings, along with code to compute topic coherence, are available online." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d8e3cdea7f61152ed37395c5f9393e-C001-25" ], [ "d8e3cdea7f61152ed37395c5f9393e-C001-36" ], [ "d8e3cdea7f61152ed37395c5f9393e-C001-41" ] ], "cite_sentences": [ "d8e3cdea7f61152ed37395c5f9393e-C001-25", "d8e3cdea7f61152ed37395c5f9393e-C001-36", "d8e3cdea7f61152ed37395c5f9393e-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "d8e3cdea7f61152ed37395c5f9393e-C001-30" ], [ "d8e3cdea7f61152ed37395c5f9393e-C001-41", "d8e3cdea7f61152ed37395c5f9393e-C001-44" ], [ "d8e3cdea7f61152ed37395c5f9393e-C001-45" ] ], "cite_sentences": [ "d8e3cdea7f61152ed37395c5f9393e-C001-30", "d8e3cdea7f61152ed37395c5f9393e-C001-41", "d8e3cdea7f61152ed37395c5f9393e-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "d8e3cdea7f61152ed37395c5f9393e-C001-36", "d8e3cdea7f61152ed37395c5f9393e-C001-37" ] ], "cite_sentences": [ "d8e3cdea7f61152ed37395c5f9393e-C001-36" ] } } }, "ABC_fca75d394e9f7007e1f674c7b99794_27": { "x": [ { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-2", "text": "Multi-party linguistic entrainment refers to the phenomenon that speakers tend to speak more similarly during conversation." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-3", "text": "We first developed new measures of multi-party entrainment on features describing linguistic style, and then examined the relationship between entrainment and team characteristics in terms of gender composition, team size, and diversity." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-4", "text": "Next, we predicted the perception of team social outcomes using multi-party linguistic entrainment and team characteristics with a hierarchical regression model." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-5", "text": "We found that teams with greater gender diversity had higher minimum convergence than teams with less gender diversity." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-6", "text": "Entrainment contributed significantly to predicting perceived team social outcomes both alone and controlling for team characteristics." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-7", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-9", "text": "Linguistic entrainment is the phenomenon that speakers tend to speak similarly in conversations (Brennan and Clark 1996) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-10", "text": "It has attracted attention for its potential to improve performance when being incorporated into spoken dialogue systems." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-11", "text": "One successful example is that Lopes, Eskenazi, and Trancoso (2015) utilized entrainment to automatically choose better primes in the prompt of a dialogue system." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-12", "text": "In dyadic conversations, the existence of entrainment has been widely investigated for various linguistic features." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-13", "text": "Brennan and Clark (1996) have found that speakers tend to refer to the same object using identical lexical terms in their conversation." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-14", "text": "Levitan and Hirschberg (2011) found evidence of prosodic entrainment in intensity, pitch and voice quality." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-15", "text": "Niederhoffer and Pennebaker (2002) found that speaker pairs gradually matched their linguistic style in conversation." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-16", "text": "Besides linguistic features, researchers have also linked dyadic entrainment to other aspects of conversations, such as task success (Reitter and Moore 2007) , gender factors (Namy, Nygaard, and Sauerteig 2002) and social behaviors (Levitan et al. 2012) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-17", "text": "In recent years, researchers have started studying multi-party entrainment in conversations with more than two speakers." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-18", "text": "As with dyadic entrainment, the existence of multi-party entrainment has also been found for different linguistic features." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-19", "text": "For example, Rahimi et al. (2017) found speakers in the same group entrained on high Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org)." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-20", "text": "All rights reserved." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-21", "text": "frequency words and topic words." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-22", "text": "Litman et al. (2016) found significant group-level differences in pitch, jitter and shimmer between first and second halves of conversation." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-23", "text": "Multiparty entrainment has also been associated with other aspects of conversations such as task performance and group cohesiveness." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-24", "text": "Friedberg, Litman, and Paletz (2012) found that higher scoring teams are more likely to entrain in the use of task-related terms." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-25", "text": "Gonzales, Hancock, and Pennebaker (2010) suggested that group linguistic style matching is a significant indicator of team cohesiveness." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-26", "text": "Compared to studies demonstrating the existence of multi-party entrainment, however, fewer studies have investigated the links between multi-party entrainment and other aspects of conversations." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-27", "text": "We present two investigations of such relationships." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-64", "text": "For instance, the Engineer uttered 24 words in this excerpt but only one word belongs to the category negate." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-28", "text": "First, we investigate whether group or team characteristics relate to multi-party entrainment, since multiple individuals simultaneously engage in the same conversation." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-29", "text": "While dyad research has analyzed entrainment and gender composition (Levitan et al. 2012; Namy, Nygaard, and Sauerteig 2002) , relationships between team characteristics and multiparty entrainment could be more complex, given the increasing number of person-to-person and person-to-team communications." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-30", "text": "We examine such relationships using cooperative game conversations, in which multiple speakers are brought together as a team." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-31", "text": "Three types of team characteristics are investigated: gender composition as in prior work, as well as team size and diversity." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-32", "text": "Second, since relationships between multi-party entrainment and other conversational aspects are not well established, it is plausible that correlations found in prior studies (Friedberg, Litman, (Amason and Sapienza 1997) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-33", "text": "We hypothesize that both entrainment and team characteristics, specifically team size, gender composition and the diversity of a team, are associated with the perception of team social outcomes." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-34", "text": "We use hierarchical regression models to examine the unique contribution of multiparty entrainment in explaining perceived team social outcomes above and beyond team characteristics." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-35", "text": "Finally, to support our studies, we have developed an innovative representation of multi-party entrainment by extending the measurement from Litman et al. (2016) and adapting it to study the feature of linguistic style from Pennebaker and King (1999) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-36", "text": "We used this measure to statistically examine relationships between multi-party entrainment and team characteristics in a set of dialogues from a freely available cooperative game corpus." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-37", "text": "We also demonstrated that multi-party entrainment is associated with team outcomes, even after controlling for team characteristics." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-38", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-39", "text": "**THE TEAMS CORPUS**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-40", "text": "The freely available Teams Corpus (Litman et al. 2016) The corpus also includes survey data." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-41", "text": "A pre-game survey collected personal information such as age, gender, and eight options for ethnicity." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-42", "text": "While each participant could choose multiple options, in this paper we categorize each speaker into nine exclusive categories: Caucasian (150), East Asian (12), South Asian (11), Pacific Islander (0), Black (15), Native American (0), Hispanic (3), Middle Eastern (2), and Multiple Ethnicity (20) for participants who chose more than one of the other categories." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-43", "text": "The gender data yields seven types of team gender composition: 0% female (2), 25% female (4), 33% female (7), 50% female (9), 66% female (18), 75% female (10), 100% female (12)." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-44", "text": "Participants also took post-game surveys to evaluate team processes." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-45", "text": "These surveys contained a series of self-report questions on team cohesion, satisfaction, and other team social outcome constructs." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-46", "text": "The studies presented in this paper are based only on the data related to the first of the two games, as only these transcriptions were available." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-47", "text": "Before computing entrainment, we further processed these transcripts by removing punctuation, converting all words to lower case, and removing a list of interjections, e.g., 'hmm', that are not discussed in linguistic style (Pennebaker and King 1999) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-48", "text": "We then concatenated all the processed IPU transcriptions for each speaker." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-49", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-50", "text": "**FEATURES AND MEASURES**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-51", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-52", "text": "**FEATURES OF SPEAKER LINGUISTIC STYLE**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-53", "text": "Before computing entrainment, we first extracted linguistic style features for each speaker in each transcript using LIWC2007 (Pennebaker, Booth, and Francis 2007), a computational application for text analysis that includes a dictionary mapping a list of words to 64 psychological and linguistic categories." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-54", "text": "We used this dictionary to label each word in each speaker's concatenated IPU transcripts with potentially multiple LIWC categories." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-55", "text": "The final number of occurrences of each category was then converted into a percent." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-56", "text": "In our study we only focused on a limited subset of LIWC categories, namely function words." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-57", "text": "The first reason is that function words reflect the speaker's psychological state and convey information about the interactive process (Chung and Pennebaker 2007) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-58", "text": "Function words represent a highlevel linguistic difference in style." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-59", "text": "Second, in contrast to content words, function words do not rely on any specific task domain (Gonzales, Hancock, and Pennebaker 2010) and have a very high frequency in daily speech (Rochon et al. 2000) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-60", "text": "Using function words as features can alleviate feature sparsity." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-61", "text": "Since a considerable number of studies about linguistic style have used function words (Gonzales, Hancock, and Pennebaker 2010; Danescu-Niculescu-Mizil, Gamon, and Dumais 2011; Mukherjee and Liu 2012), we directly adopted Gonzales, Hancock, and Pennebaker's (2010) selection of 9 LIWC categories as function words." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-62", "text": "Figure 1 shows how we used LIWC to create function word features from a transcript excerpt." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-63", "text": "After the transcript preprocessing and speaker IPU concatenation discussed above, LIWC scored each speaker's input text and generated the category percentages for each of the 64 categories." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-65", "text": "Thus, the category percentage for negate is 1/24 = 4.20%." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-66", "text": "Since one word may belong to multiple categories, the sum of category percentages for the 64 categories may exceed 100." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-67", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-68", "text": "**MEASURES OF TEAM LINGUISTIC STYLE ENTRAINMENT**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-69", "text": "There are various methods to directly calculate multi-party entrainment using linguistic style." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-70", "text": "Some text-based studies have proposed probabilistic frameworks in linguistic style matching based on pairwise comparisons between speakers (Mukherjee and Liu 2012; Danescu-Niculescu-Mizil, Gamon, and Dumais 2011)." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-71", "text": "However, compared to their data, our data has a lower density of reciprocated interactions per pair." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-72", "text": "The number of conversations between speakers is insufficient for constructing such probability models." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-73", "text": "Addressee identification to create appropriate pairs is also not straightforward." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-74", "text": "Gonzales, Hancock, and Pennebaker (2010) developed a method to perform linguistic style matching based on multi-party speech, but they only focused on a global measure rather than on the degree of change." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-75", "text": "Recently, Litman et al. (2016) proposed a method to compute multi-party entrainment on acoustic-prosodic features based on the same Teams Corpus as used here." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-76", "text": "Their method highlighted feature change over time, which is more relevant to linguistic style entrainment." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-77", "text": "For each feature, they calculated the difference between a pair of speakers as the absolute difference of feature values, and the team difference as the average difference over all pairs." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-78", "text": "In our study, linguistic style is a single feature with multiple categories, so we converted their calculation of pair differences by summing up all the category differences." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-79", "text": "Moreover, we weighted category differences by the frequency of categories." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-80", "text": "More specifically, T Dif f unw (unweighted team difference) converts the team difference of Litman et al. (2016) to deal with multiple feature categories." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-81", "text": "T Dif f w (weighted team difference) extends T Dif f unw by weighting the category differences similarly to Gonzales, Hancock, and Pennebaker (2010) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-82", "text": "We calculated both T Dif f unw and T Dif f w for each pair of speakers and then averaged over all pairs." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-120", "text": "Diversity of age, which has continuous values, is measured by the population standard deviation." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-83", "text": "The formulas are shown in Equations 1, 2, and 3, where F, K, and |team size| respectively refer to the function word category set, an arbitrary function word category, and the team size." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-84", "text": "KDif f ij refers to the weighted category difference of category K between speakers i and j." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-85", "text": "Litman et al. (2016) then define convergence, a type of entrainment measuring increase in feature similarity, by comparing the T Dif f of two non-overlapping temporal intervals of a game as in Equation 4." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-86", "text": "C ij and T Dif f refer to the team's convergence and the weighted (or unweighted) team differences, respectively." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-87", "text": "Assuming the game is divided into n disjoint temporal intervals, i and j refer to two predetermined temporal intervals in chronological order." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-88", "text": "However, this definition leaves two unanswered questions." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-89", "text": "First, the measure of convergence allows negative values that represent divergence, which is the tendency that team members speak differently." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-90", "text": "Second, it requires the researcher to hand pick temporal intervals that are not guaranteed to result in an optimal measurement of entrainment." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-91", "text": "Hence, we derived four new variables of convergence (see Equations 5 and 6): Max and Min calculating the maximum and minimum positive C ij , and absMax and absMin calculating the absolute maximum and minimum |C ij |." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-92", "text": "Rather than two fixed intervals, we iterated over all two arbitrary temporal intervals in chronological order and conducted the comparison." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-93", "text": "Consequently, the Max and Min only measure maximum and minimum convergence so that they directly reflect the decrement of T Dif f between two intervals." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-94", "text": "The absMax and absMin measure the maximum and minimum magnitude of the change of T Dif f in the entire conversation." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-95", "text": "Unlike the Min and Max, the absMax and absMin are determined by the values of convergence or divergence." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-96", "text": "We added the absMax and absMin beyond Min and Max so that they reflect the overall fluctuation ranges of T Dif f , which might also be an important aspect of entrainment." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-97", "text": "Therefore in total, we defined eight measures of team entrainment: unweighted and weighted Max, Min, absMin, and absMax convergence." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-98", "text": "The parameter n in Equation 4 determines the length of temporal intervals being compared." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-99", "text": "Many studies defined n as two so that the conversation is evenly divided into two halves (Levitan and Hirschberg 2011; Rahimi et al. 2017 )." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-119", "text": "Note that the female percentage measures the numerical female dominance in a team, while gender diversity indicates the variability of gender composition." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-100", "text": "Since Litman et al. (2016) previously found that in the Teams corpus the highest acoustic-prosodic convergence occurred within the first and last three minutes, we used this finding to define our n. We evenly divided each game, which was limited to 30 minutes, into ten intervals, so each interval is less than three minutes." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-101", "text": "Since our focus is on measure development in this paper, methods for optimally tuning this temporal parameter are left for future work." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-102", "text": "We will use Figure 1 's excerpt to illustrate our calculations." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-103", "text": "Assuming n is set to two, we first divide the excerpt into two time intervals." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-104", "text": "Assuming that the temporal midpoint of the excerpt occurs after the fourth IPU, the first interval includes the first through fourth IPUs." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-105", "text": "The second interval includes the fifth through seventh IPUs." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-106", "text": "For each speaker, all IPUs in each interval are concatenated and input to LIWC." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-107", "text": "The interval division and LIWC category percentage output are shown in Figure 2 ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-108", "text": "Based on Equation 1, the unweighted pair difference between the Engineer and Pilot in the first interval is calculated as the sum of the absolute differences of all categories, which is equivalent to |0\u221211.11|+|6.25\u22120|+|12.5\u22120|+|12.5\u221222.22|+|18.75\u2212 11.11| + |6.25 \u2212 0| + |12.5 \u2212 11.1| + |0 \u2212 0| + |25 \u2212 22.22| = 57.64." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-109", "text": "Similarly, the pair differences between the other two pairs (Engineer and Messenger, Pilot and Messenger) are 52.08 and 50." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-110", "text": "The unweighted team difference is the average of these pair differences, which is 53.24." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-111", "text": "The weighted team difference is calculated using Equation 2, with the pair difference now being normalized by the frequency of each category." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-112", "text": "For instance, the absolute difference between Engineer and Pilot of negate is |6.25 \u2212 0| = 6.25." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-113", "text": "This number is less than the absolute difference of |18.75\u221211.11| = 7.64 for the category ppron." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-114", "text": "However, the occurrence of negate is less common than ppron in the speech of Engineer and Pilot." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-115", "text": "The weighted difference of negate is |6.25 \u2212 0|/(6.25 + 0) = 1, which is now greater than the weighted difference of ppron which is |18.75 \u2212 11.11|/(18.75 + 11.11) = 0.26." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-116", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-117", "text": "**MEASURES OF TEAM CHARACTERISTICS**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-118", "text": "This paper focuses on the following team characteristics: team size, gender diversity (Blau's index and female percentage), ethnic diversity, and age diversity." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-121", "text": "Diversity of ethnicity and gender with categorical values is measured by Blau's index of heterogeneity (Blau 1977) as in Equation 7, where P k is the proportion of a specific category k." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-122", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-123", "text": "**ETHNIC/GENDER DIVERSITY**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-124", "text": "Blau s = 1 \u2212 P k 2 (7)" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-125", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-126", "text": "**MEASURES OF PERCEIVED TEAM SOCIAL OUTCOMES**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-127", "text": "We assessed the perception of team social outcomes using the existing self-reported post-game survey responses." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-128", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-129", "text": "**RESULTS**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-130", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-131", "text": "**RELATING TEAM CHARACTERISTICS AND ENTRAINMENT**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-132", "text": "We first tested the relationship between linguistic style entrainment and team characteristics with continuous values (gender, ethnic and age diversity) using Spearman rho correlations." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-133", "text": "There was a significant positive correlation between unweighted convergence Min and gender diversity, (r(62) = .22, p < .05)." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-134", "text": "This correlation indicated that teams with greater gender diversity had higher minimum convergence than teams with less gender diversity." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-135", "text": "We then performed one-way ANOVA tests between linguistic style entrainment and the categorical team characteristics, i.e., percentage of females and team size." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-136", "text": "The unweighted absMax was found to significantly vary with female percentage for the 7 conditions (see corpus section), F(6,55) = 2.79, p = .019." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-137", "text": "Tukey HSD post hoc tests indicated that the 25% condition (N = 4, M = 40.15, SD = 13.263) was significantly different with the 50% condition (N = 9, M = 19.56, SD = 9.435), 66% condition (N = 18, M = 19.39, SD = 9.407) and 75% condition (N = 10, M = 18.92, SD = 8.117)." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-138", "text": "The mean of the 25% condition was larger than all other three conditions." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-139", "text": "This finding suggests that the maximum magnitude of the change of unweighted team differences in the 4-person team with one female was greater than other mixed-gender teams with more than one female." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-140", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-141", "text": "**PREDICTING PERCEIVED TEAM SOCIAL OUTCOMES**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-142", "text": "We predicted four measures of perceived team social outcomes: team processes (MIN = -2.57, MAX = 1.51, M = 0.00, SD = 0.80); task conflict (MIN = 1.00, MAX = 3.33, M = 1.75, SD = 0.46); process conflict (MIN = 1.00, MAX = 3.00, M = 1.58, SD = 0.41) and relationship conflict (MIN = 1.00, MAX = 1.75, M = 1.15, SD = 0.20)." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-143", "text": "A hierarchical linear regression (HLR) model allows us to consider the impact of team characteristics on the perception of team social outcomes, and then examine the significance of multi-party entrainment as a predictor beyond or controlling for team characteristics." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-144", "text": "Two models, Model 1 (M1) and Model 2 (M2), were constructed." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-145", "text": "Team size and team diversity (gender, ethnic and age) were entered simultaneously as independent team characteristic variables (IVs)." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-146", "text": "Multi-party entrainment was entered into M2 as an IV beyond the team characteristics." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-147", "text": "Only variables of multi-party entrainment that significantly contributed to the model were selected in M2." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-148", "text": "The dependent variable (DV) of each HLR was the variable describing the perceived team social outcomes." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-149", "text": "Significant HLR models are shown in Table 1 ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-150", "text": "The M2 predicting the task and process conflict were both significant, but no M1 was significant." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-151", "text": "In the HLR predicting task conflict, no team characteristics contributed significantly to M1." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-152", "text": "Introducing variables of multi-party entrainment to M2 explained an additional 9.2% variation in task conflict and the \u2206R 2 was significant, \u2206F (1, 55) = 6.46, p < 0.05." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-153", "text": "Unweighted absMax and team size were both significant contributors to task conflict in M2." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-154", "text": "The negative association between unweighted absMax and task conflict suggested that the higher maximum magnitude of the change of team difference signaled less team conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-155", "text": "Meanwhile, the positive association between team size and task conflict in M2 added evidence to previous findings that team size is positively associated with team conflict (Amason and Sapienza 1997; Smith et al. 1994) ." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-156", "text": "In the HLR predicting process conflict, team size contributed significantly to M1." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-157", "text": "Adding multi-party entrainment (the weighted absMax) to M2 explained an additional 12.2% of the variability in process conflict, and the \u2206R 2 was significant, \u2206F (1, 55) = 9.36, p < 0.01." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-158", "text": "Multi-party entrainment along with team size and ethnic diversity were important predictors to M2." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-159", "text": "Team size and ethnic diversity were both positively associated with process conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-160", "text": "We observed a negative association between the weighted absMax and process conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-161", "text": "This finding implied that the higher maximum magnitude of the change of team difference signaled less process conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-162", "text": "Overall, we found a negative association between maximum magnitude of the change of team difference and team conflict, specifically process and task conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-163", "text": "Team size and ethnic diversity both had effects on team conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-164", "text": "Maximum magnitude of the change of team difference was a significant predictor in team conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-165", "text": "To determine whether the team characteristics had a significant impact on the conflict variables above and beyond the effect for entrainment, we switched the IVs in M1 and M2." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-166", "text": "Variables of entrainment were entered into M1 stepwise and then the team characteristics that had shown significance in the previous HLR were entered into M2 (see Table 2 )." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-167", "text": "We observed similar findings in that both M1 and M2 significantly predicted task and process conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-168", "text": "The maximum magnitude of the change of the team difference was significantly negatively associated with task and process conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-169", "text": "Team size and, for process conflict, ethnic diversity were significantly related to conflict above and beyond entrainment." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-170", "text": "----------------------------------" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-171", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-172", "text": "We first proposed a new method for measuring multi-party linguistic style entrainment by converting and extending methods developed in prior studies of both linguistic style matching and team acoustic-prosodic entrainment." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-173", "text": "We then examined the relationship between multi-party entrainment and team characteristics." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-174", "text": "Our analysis implies that teams with greater gender diversity had greater minimum convergence than teams with less gender diversity, similarly to the findings of Levitan et al. (2012) and Namy, Nygaard, and Sauerteig (2002) that mixed-gender pairs generally entrain more in dyadic conversations." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-175", "text": "Moreover, the 4-person teams with more than one female had a higher maximum magnitude of change in team difference." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-176", "text": "Perhaps the existence of a female subgroup reconciled the team difference in these teams." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-177", "text": "In conclusion, different gender compositions affect the entraining behaviors of the overall team." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-178", "text": "These findings show that gender plays an important role for linguistic entrainment in human interactions." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-179", "text": "They also reveal a need to study the underlying process of multi-party entrainment with different granularity levels." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-180", "text": "Next, we predicted the perception of team social outcomes by team characteristics and variables of entrainment with hierarchical regression models." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-181", "text": "The experimental results indicated that the maximum magnitude of the change of the team difference was negatively associated with team conflict." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-182", "text": "Adding this variable of entrainment beyond team characteristics resulted in statistically significant improvements in model prediction." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-183", "text": "Finally, by entering entrainment variables in the first rather than second model, we showed that entrainment was significantly negatively associated with task and process conflict, both when controlling for team characteristics and when not." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-184", "text": "Although the overall models did not account for a large amount of variance, the base model of only team characteristics was improved significantly by adding entrainment." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-185", "text": "In sum, we found that entrainment is a promising feature to predict team social outcomes." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-186", "text": "In terms of broader impact, we can now possibly evaluate the success of team conversations using linguistic style entrainment." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-187", "text": "Additional interdisciplinary research building on our findings could test whether entrainment mediates the effects of team characteristics on social and task outcomes in different settings." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-188", "text": "In future work, we will investigate different feature combinations and prediction models to improve the performance of our statistical models." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-189", "text": "To further improve the calculation of multi-party entrainment, we intend to search for an optimal temporal window." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-190", "text": "Additionally, we plan to review the validity and accuracy of the team social outcomes, which were measured with self-reported surveys." }, { "sent_id": "fca75d394e9f7007e1f674c7b99794-C001-191", "text": "We also plan to investigate the relationship between multi-party entrainment and individual differences, such as personality and education background." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fca75d394e9f7007e1f674c7b99794-C001-22" ] ], "cite_sentences": [ "fca75d394e9f7007e1f674c7b99794-C001-22" ] }, "@EXT@": { "gold_contexts": [ [ "fca75d394e9f7007e1f674c7b99794-C001-35" ] ], "cite_sentences": [ "fca75d394e9f7007e1f674c7b99794-C001-35" ] }, "@USE@": { "gold_contexts": [ [ "fca75d394e9f7007e1f674c7b99794-C001-40" ], [ "fca75d394e9f7007e1f674c7b99794-C001-80" ], [ "fca75d394e9f7007e1f674c7b99794-C001-100" ] ], "cite_sentences": [ "fca75d394e9f7007e1f674c7b99794-C001-40", "fca75d394e9f7007e1f674c7b99794-C001-80", "fca75d394e9f7007e1f674c7b99794-C001-100" ] }, "@SIM@": { "gold_contexts": [ [ "fca75d394e9f7007e1f674c7b99794-C001-75" ], [ "fca75d394e9f7007e1f674c7b99794-C001-85" ] ], "cite_sentences": [ "fca75d394e9f7007e1f674c7b99794-C001-75", "fca75d394e9f7007e1f674c7b99794-C001-85" ] } } }, "ABC_f3c2c538019b1d9daa8e6c932d9826_27": { "x": [ { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-21", "text": "Hence, for efficiency the candidate images are selected a priori using an information retrieval engine." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-2", "text": "Abstract. Topics generated by topic models are usually represented by lists of t terms or alternatively using short phrases or images." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-3", "text": "The current state-of-the-art work on labeling topics using images selects images by re-ranking a small set of candidates for a given topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-4", "text": "In this paper, we present a more generic method that can estimate the degree of association between any arbitrary pair of an unseen topic and image using a deep neural network." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-5", "text": "Our method achieves better runtime performance O(n) compared to O(n 2 ) for the current state-of-the-art method, and is also significantly more accurate." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-6", "text": "----------------------------------" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-8", "text": "Topic models [5] are a popular method for organizing and interpreting large document collections by grouping documents into various thematic subjects (e.g. sports, politics or lifestyle) called topics." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-9", "text": "Topics are multinomial distributions over a predefined vocabulary whereas documents are represented as probability distributions over topics." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-10", "text": "Topic models have proven to be an elegant way to build exploratory interfaces (i.e. topic browsers) for visualizing document collections by presenting to the users lists of topics [6, 15, 14] where they select documents of a particular topic of interest." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-11", "text": "A topic is traditionally represented by a list of t terms with the highest probability." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-12", "text": "In recent works, short phrases [11, 4] , images [3] or summaries [19] have been used as alternatives." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-13", "text": "Particularly, images offer a language independent representation of the topic which can also be complementary to textual labels." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-14", "text": "The visual representation of a topic has been shown to be as effective as the textual labels on retrieving information using a topic browser while it can be understood quickly by the users [1, 2] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-15", "text": "The task of labeling topics consists of two main components: (1) a candidate generation component where candidate labels are obtained for a given topic (usually using information retrieval techniques and knowledge bases [11, 3] ), and (2) a ranking (or label selection) component that scores the candidates according to their relevance to the topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-16", "text": "In the case of labeling topics with images the candidate labels consist of images." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-17", "text": "The method presented by [3] generates a graph where the candidate images are its nodes." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-18", "text": "The edges are weighted with a similarity score between the images that connect." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-19", "text": "Then, an image is selected by re-ranking the candidates using PageRank." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-20", "text": "The method is iterative and has a runtime complexity of O(n 2 ) which makes it infeasible to run over large number of images." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-22", "text": "Thus the scope of this method gets limited to solving a local problem of re-ordering a small set of candidate images for a given topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-23", "text": "Furthermore, its accuracy is limited by the recall of the information retrieval engine." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-24", "text": "Finally, if new candidates appear, they should be added to the graph, the process of computing pairwise similarities and re-ranking of nodes is repeated." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-25", "text": "In this work, we present a more generic method that directly estimates the appropriateness of any arbitrary pair of topic and image." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-26", "text": "We refer to this method as a global method to differentiate it from the localized approach described above." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-27", "text": "We utilize a Deep Neural Network (DNN) to estimate the suitability of an image for labeling a given topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-28", "text": "DNNs have proven to be effective in various IR and NLP tasks [7, 16] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-29", "text": "They combine multiple layers that perform non-linear transformations to the data allowing the automatic learning of high-level abstractions." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-30", "text": "At runtime our method computes dot products between various features and the model weights to obtain the relevance score, that gives it an order complexity of O(n)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-31", "text": "Hence, it is suitable for using it over large image sources such as Flickr 1 , Getty 2 or ImageNet [9] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-32", "text": "The proposed model obtains state-of-the-art results for labeling completely unseen topics with images compared to previous methods and strong baselines." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-33", "text": "----------------------------------" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-34", "text": "**MODEL**" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-35", "text": "For a topic T and an image I, we want to compute a real value s \u2208 R that denotes how good the image I is for representing the topic T ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-36", "text": "T consists of ten terms (t) with the highest probability for the topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-37", "text": "We denote the visual information of the image as V ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-38", "text": "The image is also associated with text in its caption, C." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-39", "text": "For the topic T = {t 1 , t 2 , ..., t 10 } and the image caption C = {c 1 , c 2 , ..., c n }, each term is transformed into a vector x \u2208 R d where d is the dimensionality of the distributed semantic space." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-40", "text": "We use pre-computed dependency-based word embeddings [12] whose d is 300." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-41", "text": "The resulting representations of T and C are the mean vectors of their constituent words, x t and x c respectively." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-42", "text": "The visual information from the image V is converted into a dense vectorized representation, x v ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-43", "text": "That is the output of the publicly available 16-layer VGG-net [13] trained over the ImageNet dataset [9] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-44", "text": "VGG-net provides a 1000 dimensional vector which is the soft-max classification output of ImageNet classes." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-45", "text": "The input to the network is the concatenation of topic, caption and visual vectors." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-46", "text": "i.e.," }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-47", "text": "This results in a 1600-dimensional input vector." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-48", "text": "Then, X is passed through a series of four hidden layers, H 1 , ..., H 4 ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-49", "text": "In this way the network learns a combined representation of topics and images and the non-linear relationships that they share." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-50", "text": "where g is the rectified linear unit (ReLU) and h 0 = X. The output of each hidden layer is regularized using dropout [17] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-51", "text": "The output size of H 1 , H 2 , H 3 and H 4 are set to 256, 128, 64 and 32 nodes respectively." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-52", "text": "The output layer of the network maps the input to a real value s \u2208 R that denotes how good the image I is for the topic T ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-53", "text": "The network is trained by minimizing the mean absolute error:" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-54", "text": "where s g is the ground-truth relevance value." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-55", "text": "The network is optimized using a standard mini-batch gradient descent method with RMSProp adaptive learning rate algorithm [18] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-56", "text": "----------------------------------" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-57", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-58", "text": "We evaluate our model on the publicly available data set provided by [3] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-59", "text": "It consists of 300 topics generated using Wikipedia articles and news articles taken from the New York Times." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-60", "text": "Each topic is represented by ten terms with the highest probability." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-61", "text": "They are also associated with 20 candidate image labels and their human ratings between 0 (lowest) and 3 (highest) denoting the appropriateness of these images for the topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-62", "text": "That results into a total of 6K images and their associated textual metadata which are considered as captions." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-63", "text": "The task is to choose the image with the highest rating from the set of the 20 candidates for a given topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-64", "text": "The 20 candidate image labels per topic are collected by [3] using an information retrieval engine (Google)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-65", "text": "Hence most of them are expected to be relevant to the topic." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-66", "text": "This jeopardizes the training of our supervised model due to the lack of sufficient negative examples." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-67", "text": "To address this issue we generate extra negative examples." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-68", "text": "For each topic we sample another 20 images from random topics in the training set and assign them a relevance score of 0." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-69", "text": "These extra images are added into the training data." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-70", "text": "Our evaluation follows prior work [11, 3] using two metrics." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-71", "text": "The Top-1 average rating is the average human rating assigned to the top-ranked label proposed by the topic labeling method." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-72", "text": "This metric provides an indication of the overall quality of the label selected and takes values from 0 (irrelevant) to 3 (relevant)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-73", "text": "The normalized discounted cumulative gain (nDCG) compares the label ranking proposed by the labeling method to the gold-standard ranking provided by the human annotators [10, 8] ." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-74", "text": "2.24 --- Table 1 : Results obtained for the various topic labeling methods." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-75", "text": "\u2020, \u2021 and * denote statistically significant difference to Local PPR, Global PRR and WSABIE respectively (paired t-test, p < 0.01)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-76", "text": "We set the dropout value to 0.2 which randomly sets 20% of the input units to 0 at each update during the training time." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-77", "text": "We train the model in a 5-fold crossvalidation for 30 epochs and set the batch size for training data to 16." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-78", "text": "In each fold, data from 240 topics are used for training which results into 9,600 examples (20 original, 20 negative candidates per topic)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-79", "text": "The rest completely unseen 60 topics are used for testing which results into 1,200 test examples (note that we do not add negative examples in the test data)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-80", "text": "----------------------------------" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-81", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-82", "text": "We compare our approach to the state-of-the-art method that uses Personalized PageRank [3] to re-rank image candidates (Local PPR) and an adapted version that computes the PageRank scores of all the available images in the test set (Global PPR)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-83", "text": "We also test other baselines methods: (1) a relevant approach originally proposed for image annotation that learns a joint model of text and image features (WSABIE) [20] , (2) linear regression and SVM models that use the concatenation of the topic, the caption and the image vectors as input, LR (Topic+Caption+VGG) and SVM (Topic+Caption+VGG) respectively." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-84", "text": "Finally, we test two versions of our own DNN using only either the caption (DNN (Topic+Caption)) or the visual information of the image (DNN (Topic+VGG))." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-85", "text": "Table 1 shows the Top-1 average and nDCG scores obtained." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-86", "text": "First, we observe that the DNN methods perform better for both the evaluation metrics compared to the baseline methods." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-87", "text": "They achieve a Top-1 average rating between 1.94 and 2.12 better than the Global PPR, Local PPR, WSABIE, LR and SVM baselines." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-88", "text": "Specifically, the DNN (Topic+Caption+VGG) method significantly outperforms these models (paired t-test, p < 0.01)." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-89", "text": "This demonstrates that our simple DNN model captures high-level associations between topics and images." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-90", "text": "We should also highlight that the network has not seen either the topic or the image during Topic #288: surgery, body, medical, medicine, surgical, blood, organ, transplant, health, patient" }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-91", "text": "Topic #99: wedding, camera, bride, photographer, rachel, lens, sarah, couple, guest, shot DNN (Topic+Caption+VGG) )." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-92", "text": "An interesting finding is that using only the visual information (DNN (Topic+VGG)) achieves better results (2.04) compared to using only text." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-93", "text": "This demonstrates that images contain less noisy information compared to their captions for this particular task." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-94", "text": "The DNN models also provide a better ranking for the image candidates." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-95", "text": "The nDCG scores for the majority of the DNN methods are higher than the other methods." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-96", "text": "DNN (Topic+Caption+VGG) consistently obtains the best nDCG scores, 0.79, 0.80 and 0.81 respectively." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-97", "text": "Figure 1 shows two topics and the top-3 images selected by the DNN (Topic+Caption+VGG) model from the candidate set." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-98", "text": "The labels selected for the topic #288 are all very relevant to a Surgical operation." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-99", "text": "On the other hand, the images selected for topic #99 are irrelevant to Wedding photography." }, { "sent_id": "f3c2c538019b1d9daa8e6c932d9826-C001-100", "text": "For this topic the candidate set of labels do not contain any relevant images." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f3c2c538019b1d9daa8e6c932d9826-C001-12" ], [ "f3c2c538019b1d9daa8e6c932d9826-C001-15" ], [ "f3c2c538019b1d9daa8e6c932d9826-C001-17" ] ], "cite_sentences": [ "f3c2c538019b1d9daa8e6c932d9826-C001-12", "f3c2c538019b1d9daa8e6c932d9826-C001-15", "f3c2c538019b1d9daa8e6c932d9826-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "f3c2c538019b1d9daa8e6c932d9826-C001-43" ], [ "f3c2c538019b1d9daa8e6c932d9826-C001-58" ], [ "f3c2c538019b1d9daa8e6c932d9826-C001-64" ], [ "f3c2c538019b1d9daa8e6c932d9826-C001-70" ], [ "f3c2c538019b1d9daa8e6c932d9826-C001-82" ] ], "cite_sentences": [ "f3c2c538019b1d9daa8e6c932d9826-C001-43", "f3c2c538019b1d9daa8e6c932d9826-C001-58", "f3c2c538019b1d9daa8e6c932d9826-C001-64", "f3c2c538019b1d9daa8e6c932d9826-C001-70", "f3c2c538019b1d9daa8e6c932d9826-C001-82" ] } } }, "ABC_27ee0fbed3a88854ebe945dfffefd8_27": { "x": [ { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-88", "text": "**RECOGNITION OF TIMEXES**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-109", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-110", "text": "**DATA SET DOCUMENTS PART**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-2", "text": "The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-3", "text": "These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-4", "text": "We described the process of KGR10 corpus creation and a new approach to the recognition problem using Bidirectional Long-Short Term Memory (BiLSTM) network with additional CRF layer, where specific embeddings are essential." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-5", "text": "We presented experiments and conclusions drawn from them." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-6", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-8", "text": "Recent studies in information extraction domain (but also in other natural language processing fields) show that deep learning models produce state-of-the-art results [38] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-9", "text": "Deep architectures employ multiple layers to learn hierarchical representations of the input data." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-10", "text": "In the last few years, neural networks based on dense vector representations provided the best results in various NLP tasks, including named entities recognition [32] , semantic role labelling [6] , question answering [39] and multitask learning [4] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-11", "text": "The core element of most deep learning solutions is the dense distributed semantic representation of words, often called word embeddings." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-12", "text": "Distributional vectors follow the distributional hypothesis that words with a similar meaning tend to appear in similar contexts." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-13", "text": "Word embeddings capture the similarity between words and are often used as the first layer in deep learning models." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-14", "text": "Two of the most common and very efficient methods to produce word embeddings are Continuous Bag-of-Words (CBOW) and Skip-gram (SG), which produce distributed representations of words in a vector space, grouping them by similarity [19, 20] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-15", "text": "With the progress of machine learning techniques, it is possible to train such models on much larger data sets, and these often outperform the simple ones." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-16", "text": "It is possible to use a set of text documents containing even billions of words as training data." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-17", "text": "Both architectures (CBOW and SG) describe how the neural network learns the vector word representations for each word." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-18", "text": "In CBOW architecture the task is predicting the word given its context and in SG the task in predicting the context given the word." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-19", "text": "Due to a significant increase of quality using deep learning methods together with word embeddings as the input layer for neural networks, many word vector sets have been created, using different corpora." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-20", "text": "The widest range of available word embeddings is available for English [14] and there were not so many options for less popular languages, e.g. Polish." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-21", "text": "There was a definite need within CLARIN-PL 1 project and Sentimenti 2 to increase the quality of NLP methods for Polish which were utilising available Polish word vectors [25, 2, 23, 31] but only FastText modification of Skip-gram [2] was able to produce vectors for unknown words, based on character n-grams." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-22", "text": "The observation was that even using a sophisticated deep neural structure, the result strongly depends on the initial distributional representation." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-23", "text": "There was a need to build a massive corpus of Polish and create high-quality word vectors from that corpus." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-24", "text": "This work describes how we extended KGR7 1G corpus to become KGR10 with 4 billion words." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-25", "text": "Next, we present the different variants of word embeddings produced using this corpus." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-26", "text": "In the article about the recognition of named entities for Polish from the previous year, these embeddings were used in one of the three voting models to obtain the best results and the final system PolDeepNer [17] took the second place in PolEval2018 Task 2 [24] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-27", "text": "In this article, we evaluated KGR10 FastText word embeddings in recognition of timexes." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-28", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-29", "text": "**AVAILABLE WORD EMBEDDINGS**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-30", "text": "At the time we were testing word embeddings for different applications, there were 2 most popular sources of word vectors." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-31", "text": "The first one, called IPIPAN 3 , is the result of the project Compositional distributional semantic models for identification, discrimination and disambiguation of senses in Polish texts, the process of creating word embeddings is described in article [23] and corpora used were National Corpus of Polish (NKJP) [27] as the linguistic data source." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-32", "text": "Table 1 shows the number of tokens in each corpus and the name of the institution which prepared it." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-33", "text": "There is also information about the public availability of the resource." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-34", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-35", "text": "**C_ID CORPUS**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-36", "text": "Prepared [7, 18] was created at the Wroclaw University of Science and Technology by G4.19 Group." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-37", "text": "Due to the licences of documents in this corpus, this resource is not publicly available." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-38", "text": "Table 3 contains KGR7 subcorpora and statistics [36] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-39", "text": "One of the subcorpora in KGR7 is KIPI (the IPI PAN Corpus) [28] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-40", "text": "KGR7 covers texts from a wide range of domains like: blogs, science, stenographic recordings, news, journalism, books and parliamentary transcripts." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-41", "text": "All texts come from the second half of the 20th century and represent the modern Polish language." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-42", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-43", "text": "**PLWORDNET CORPUS 10.0 (KGR10)**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-65", "text": "Previous solutions ignored the morphology of words and were assigning a distinct vector to each word." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-44", "text": "KGR10, also known as plWordNet Corpus 10.0 (PLWNC 10.0), is the result of the work on the toolchain to automatic acquisition and extraction of the website content, called CorpoGrabber 6 [13] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-45", "text": "It is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth)." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-46", "text": "The proposed toolchain can be used to build a big Web corpus of text documents." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-47", "text": "It requires the list of the root websites as the input." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-48", "text": "Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-49", "text": "The whole process can be run in parallel on a single machine and includes the following tasks: download of the HTML subpages of each input page URL with HTTrack 7 , extraction of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [26] , deduplication of plain text [26] , bad quality documents removal utilising Morphological Analysis Converter and Aggregator (MACA) [30] , documents tagging using Wroc\u0142aw CRF Tagger (WCRFT) [29] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-50", "text": "Last two steps are available only for Polish." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-51", "text": "Table 3 : Names and the number of tokens in KGR7 subcorpora." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-52", "text": "In order to significantly expand the set of documents in KGR7, we utilised DMOZ (short for directory.mozilla.org) -a multilingual open content directory of World Wide Web links, also known as Open Directory Project (ODP)." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-53", "text": "The website with directory was closed in 2017, but the database still can be found on the web." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-54", "text": "Polish part of this directory contains more than 30,000 links to Polish websites." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-55", "text": "We used these links as root URLs for CorpoGrabber, and we downloaded more than 7TB of HTML web pages." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-56", "text": "After the extraction of text from HTML pages, deduplication of documents (including texts from KGR7) and removing bad quality documents (containing more than 30% of words outside the Morfeusz [37] dictionary) the result is KGR10 corpus, which contains 4,015,569,051 tokens and 18,084,712 unique words." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-57", "text": "Due to component licenses, KGR10 corpus is not publicly available." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-58", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-59", "text": "**KGR10 WORD EMBEDDINGS**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-60", "text": "We created a new Polish word embeddings models using the KGR10 corpus." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-61", "text": "We built 16 models of word embeddings using the implementation of CBOW and Skip-gram methods in the FastText tool [2] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-62", "text": "These models are available under an open license in the CLARIN-PL project DSpace repository 8 ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-63", "text": "The internal encoding solution based on embeddings of n-grams composing each word makes it possible to obtain FastText vector representations, also for words which were not processed during the creation of the model." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-64", "text": "A vector representation is associated with character n-gram and each word is represented as the sum of its n-gram vector representations." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-66", "text": "This is a limitation for languages with large vocabularies and many rare words, like Turkish, Finnish or Polish [2] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-67", "text": "Authors observed that using word representations trained with subword information outperformed the plain Skip-gram model and the improvement was most significant for morphologically rich Slavic languages such as Czech (8% reduction of perplexity over SG) and Russian (13% reduction) [2] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-68", "text": "We expected that word embeddings created that way for Polish should also provide such improvements." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-69", "text": "There were also previous attempts to build KGR10 word vectors with other methods (including FastText), and the results are presented in the article [25] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-70", "text": "We selected the best models from that article -with embedding ID prefix EP (embeddings, previous) in Table 4 -to compare with new models, marked as embedding ID prefix EC in Table 4 )." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-71", "text": "The word embeddings models used in PolDeepNer for recognition of timexes and named entities were EE1, ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-72", "text": "It was built on a plain KGR10." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-73", "text": "The dimension of word embedding is 300, the method of constructing vectors was Skip-gram [2] , and the number of negative samples for each positive example was 10." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-74", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-75", "text": "**TEMPORAL EXPRESSIONS**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-76", "text": "Temporal expressions (henceforth timexes) tell us when something happens, how long something lasts, or how often something occurs." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-77", "text": "The correct interpretation of a timex often involves knowing the context." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-78", "text": "Usually, a person is aware of their location in time, i.e., they know what day, month and year it is, and whether it is the beginning or the end of week or month." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-79", "text": "Therefore, they refer to specific dates, using incomplete expressions such as 12 November, Thursday, the following week, after three days." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-80", "text": "The temporal context is often necessary to determine to which specific date and time timexes refer." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-81", "text": "These examples do not exhaust the complexity of the problem of recognising timexes." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-82", "text": "TimeML [33] is a markup language for describing timexes that has been adapted to many languages." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-83", "text": "One of the best-known methods of recognition of timexes called HeidelTime [34] , which uses the TIMEX3 annotation standard, currently supports 13 languages (with the use of hand-crafted resources)." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-84", "text": "PLIMEX is a specification for the description of Polish timexes." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-85", "text": "It is based on TIMEX3 used in TimeML." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-86", "text": "Classes proposed in TimeML are adapted, namely: date, time, duration, set." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-87", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-89", "text": "There are many methods for recognising timexes that are widely used in natural language engineering." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-90", "text": "For English (but not exclusively), in approaches based on supervised learning, sequence labelling methods are often used, especially Conditional Random Fields [15] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-91", "text": "A review of the methods in the article [35] about the recognition of timexes for English and Spanish has shown a certain shift within the most popular solutions." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-92", "text": "As with the normalisation of timexes, the best results are still achieved with rule-based methods, many new solutions have been introduced in the area of recognition." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-93", "text": "The best systems listed in [35] , called TIPSem [16] and ClearTK [1] , use CRFs for recognition, so initially, we decided to apply the CRF-based approach for this task." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-94", "text": "The results were described in [12, 10] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-95", "text": "In recent years, solutions based on deep neural networks, using word representation in the form of word embeddings, created with the use of large linguistic corpus, have begun to dominate in the field of recognition of word expressions." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-96", "text": "The most popular solutions include bidirectional long short-term memory neural networks (henceforth Bi-LSTM), often in combination with conditional random fields, as presented in the paper [5] dedicated to the recognition of proper names." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-97", "text": "For the Polish language, deep networks have also recently been used to recognise word expressions." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-98", "text": "In the issue of recognition of timexes, a bidirectional gated recurrent unit network (GRU) has been used [21, 22] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-99", "text": "GRU network is described in detail in the article [3] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-100", "text": "In case of recognition of event descriptions using Bi-LSTM and Bi-GRU, where most of the Liner2 features were included in the input feature vector, better results were obtained [8] than for the Liner2 method (but without taking into account domain dictionaries)." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-101", "text": "In last year's publication on the issue of named entities recognition using BiLSTM+CRF (together with G4.19 Group 9 members), we received a statistically significant improvement in the quality of recognition compared to a solution using CRF only." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-102", "text": "The solution has been called PolDeepNer 10 [17] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-103", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-104", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-105", "text": "Experiments were carried out by the method proposed in [35] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-106", "text": "The first part is described as Task A, the purpose of which is to identify the boundaries of timexes and assign them to one of the following classes:" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-107", "text": "9 http://nlp.pwr.edu.pl/ 10 https://github.com/CLARIN-PL/PolDeepNer date, time, duration, set." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-108", "text": "[%] all 1635 100 train 1227 50 test 408 25 Table 5 : Evaluation data sets (source: KPWr)." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-111", "text": "We trained the final models using the train set and we evaluated it using the test set, which was the reproduction of analysis performed in articles [11, 9] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-112", "text": "The division is presented in Table ? ?." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-113", "text": "We used BiLSTM+CRF classifier as in previous work [17] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-114", "text": "We used precision, recall and F1 metrics from the classic NER task [17] , where true positive system answer has the same boundaries and type as annotation in gold data set." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-115", "text": "We evaluated all 17 word embeddings models using these metrics." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-116", "text": "The results are presented in Tables 6, 7 and 8." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-117", "text": "We chose the best 3 results from each word embeddings group (EE, EP, EC) from Table 8 presenting F1-scores for all models." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-118", "text": "Then we evaluated these results using more detailed measures for timexes, presented in [35] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-119", "text": "The following measures were used to evaluate the quality of boundaries and class recognition, socalled strict match: strict precision (Str.P), strict recall (Str.R) and strict F1-score (Str.F1)." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-120", "text": "A relaxed match (Rel." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-121", "text": "P, Rel." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-122", "text": "R, Rel." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-123", "text": "F1) evaluation has also been carried out to determine whether there is an overlap between the system entity and gold entity, e.g." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-124", "text": "[Sunday] and [Sunday morning] [35] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-125", "text": "If there was an overlap, a relaxed type F1-score (Type.F1) was calculated [35] ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-126", "text": "The results are presented in Table 9 ." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-127", "text": "Table 6 : Evaluation results (precision) for 17 word embeddings models for each TIMEX3 class (date, time, duration and set)." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-128", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-129", "text": "**EMBEDDING**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-130", "text": "----------------------------------" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-131", "text": "**CONCLUSIONS**" }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-132", "text": "The analysis of results from Tables 6, 7 and 8 show that 12 of 15 best results were obtained using new word embeddings." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-133", "text": "The evaluation results presented in Table 9 (the chosen best embeddings models from Table 8) prove that the best group of word embeddings is EC." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-134", "text": "The highest type F1-score was obtained for EC1 model, built using binary FastText Skip-gram method utilising subword information, with vector dimension equal to 300 and negative sampling equal to 10." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-135", "text": "The ability of the model to provide vector representation for the unknown words seems to be the most important." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-136", "text": "Also, previous models built using KGR10 (EP) are probably less accurate due to an incorrect tokenisation of the corpus." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-137", "text": "We used WCRFT tagger [29] , which utilises Toki [30] to tokenise the input text before the creation of the embeddings model." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-138", "text": "The comparison of EC1 with previous results obtained using only CRF [9] show the significant improvement across all the tested metrics: 3.6pp increase in strict F1-score, 1.36pp increase in relaxed precision, 5.61pp increase in relaxed recall and 3.51pp increase in relaxed F1-score." }, { "sent_id": "27ee0fbed3a88854ebe945dfffefd8-C001-139", "text": "Table 9 : Evaluation results for all TIMEX3 classes (total) for 9 word embeddings models (3 best models from each embeddings group: EE, EP, EC from Table 8 ) using the following measures from [35] : strict precision, strict recall, strict F1-score, relaxed precision, relaxed recall, relaxed F1-score, type F1-score." } ], "y": { "@USE@": { "gold_contexts": [ [ "27ee0fbed3a88854ebe945dfffefd8-C001-91" ], [ "27ee0fbed3a88854ebe945dfffefd8-C001-93" ], [ "27ee0fbed3a88854ebe945dfffefd8-C001-105" ], [ "27ee0fbed3a88854ebe945dfffefd8-C001-118" ], [ "27ee0fbed3a88854ebe945dfffefd8-C001-125" ], [ "27ee0fbed3a88854ebe945dfffefd8-C001-139" ] ], "cite_sentences": [ "27ee0fbed3a88854ebe945dfffefd8-C001-91", "27ee0fbed3a88854ebe945dfffefd8-C001-93", "27ee0fbed3a88854ebe945dfffefd8-C001-105", "27ee0fbed3a88854ebe945dfffefd8-C001-118", "27ee0fbed3a88854ebe945dfffefd8-C001-125", "27ee0fbed3a88854ebe945dfffefd8-C001-139" ] } } }, "ABC_74420437db295ca874d5c946891f69_27": { "x": [ { "sent_id": "74420437db295ca874d5c946891f69-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-2", "text": "As multiword expressions (MWEs) exhibit a range of idiosyncrasies, their automatic detection warrants the use of many different features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-3", "text": "Tsvetkov and Wintner (2014) proposed a Bayesian network model that combines linguistically motivated features and also models their interactions." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-4", "text": "In this paper, we extend their model with new features and apply it to Croatian, a morphologically complex and a relatively free word order language, achieving a satisfactory performance of 0.823 F1-score." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-5", "text": "Furthermore, by comparing against (semi)na\u00efve Bayes models, we demonstrate that manually modeling feature interactions is indeed important." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-6", "text": "We make our annotated dataset of Croatian MWEs freely available." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-7", "text": "----------------------------------" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-9", "text": "Multiword expressions (MWEs) have attracted a great deal of attention in the natural language processing community." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-10", "text": "While MWEs span a wide range of types, common to all is the idiosyncrasy at the lexical, syntactic, semantic, pragmatic, or statistical level (Baldwin and Kim, 2010) ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-11", "text": "A variety of models has been proposed for the automatic identification of MWE in corpora, including statistical (Church and Hanks, 1990; Lin, 1999; Pecina, 2010) and linguistic-based approaches (Cook et al., 2007; Baldwin, 2005; Green et al., 2011) ; see (Ramisch, 2015) for a recent overview." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-12", "text": "Sag et al. (2002) argued for a combination of the two approaches." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-13", "text": "Recently, Tsvetkov and Wintner (2014) proposed an approach for the detection of MWE candidates that combines a number of statistical and linguistic features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-14", "text": "The most interesting aspect of their work is that they explicitly model the linguistically motivated interactions between the features using a Bayesian network (BN)." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-15", "text": "The advantages of BNs lie in their interpretability and the possibility to encode linguistic knowledge in the form of the network structure." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-16", "text": "Furthermore, unlike most previous work, Tsvetkov and Wintner address MWE of various types and flexible syntactic constructions." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-17", "text": "They show that the manually-designed BN outperforms a number of strong baselines, including an SVM model, on English, French, and Hebrew datasets." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-18", "text": "Another advantage of their model is that it is in principle language-independent, aside from a few language-specific features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-19", "text": "In this paper, we address the task of MWE detection (type-level MWE classification) for Croatian, a South Slavic language with a rich morphology and a relatively free word order." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-20", "text": "The starting point of our work is the model of Tsvetkov and Wintner (2014) , which we extend with a number of features, including language-specific ones that account for the relatively free word order." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-21", "text": "Our main research question is whether modeling the interactions between features is important, and whether these can be learned automatically." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-43", "text": "In our sample-based analysis of Croatian MWEs, we concluded that in many cases this restriction is not limited to the right context." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-22", "text": "Tsvetkov and Wintner (2014) showed that a manually-designed BN substantially outperforms the one whose structure is learned automatically, hypothesizing that the cause for this might be the increased model complexity." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-23", "text": "We conduct a similar experiment using a structurelearning algorithm, but also model the interactions using a simpler, semi-naive Bayes classifier, for which the number of parameters is restricted." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-24", "text": "Finally, we compare these models against a structurefree counterpart, a na\u00efve Bayes classifier." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-25", "text": "For the experiments, we compile a new manually annotated dataset of Croatian MWEs." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-26", "text": "Unlike Tsvetkov and Wintner (2014) , who only consider bigrams, we consider MWEs of up to five words in length." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-27", "text": "We make the dataset freely available, along with all feature sets needed to replicate the experiments." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-28", "text": "----------------------------------" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-29", "text": "**MODEL**" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-30", "text": "We adopt the BN model of Tsvetkov and Wintner (2014) , but extend it with language-specific as well as semantically motivated features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-31", "text": "Most newly added features were inspired by the analysis of Croatian MWEs of Blagus Bartolec (2008) , and a sample-based analysis of a MWE from a dictionary of Croatian MWEs (Kova\u010devi\u0107, 2012) and their occurrences in the hrWaC corpus (Ljube\u0161i\u0107 and Erjavec, 2011) ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-32", "text": "The MWE candidates were POStagged using the tagger from (Pinnis et al., 2012) ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-33", "text": "----------------------------------" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-34", "text": "**FEATURES**" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-35", "text": "Original features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-36", "text": "The model of Tsvetkov and Wintner (2014) uses nine statistically and linguistically motivated features, computed for each MWE candidate and designed to discriminate between MWEs and ordinary word sequences." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-37", "text": "We adopted eight of these features: 1 (1) capitalization (indicating which MWE constituents are capitalized), (2) hyphenation (which constituents are hyphenated), (3) fossil word (whether constituents also occur outside of the MWE), (4) frozen form (whether the MWE is morphologically frozen), (5) partial morphological inflection (whether MWE admits only limited inflection), (6) syntactic pattern (the MWE's part-of-speech pattern), (7) semantic context, and (8) association measure." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-38", "text": "The values of statistical features were computed from hrWaC, a 1.2B-token Croatian web corpus compiled by Ljube\u0161i\u0107 and Erjavec (2011) ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-39", "text": "All numeric features were discretized into five reference levels based on their average values in the corpus." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-40", "text": "Interesting MWE examples from the corpus that showcase the above-mentioned statistical properties are curriculum vitae, which is made of fossil words, hodati po jajima (to walk on eggshells), which is a frozen form, and zlatno doba (golden age), which almost exclusively appears in the nominative and locative singular (partial inflection)." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-41", "text": "Modified features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-42", "text": "In the original model, the semantic context feature computes the lexical variety of the words following a MWE candidate vary, the idea being that MWEs have a more restricted context." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-44", "text": "Thus, we introduced two additional features: one for the left context and another considering a 5-word window around the MWE." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-45", "text": "Likewise, we used the Dice coefficient association measure, rather than PMI as used in the original model, as the former turned out to be more discriminative." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-46", "text": "New features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-47", "text": "We introduced six new features, four of which were inspired by our analysis of Croatian MWEs." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-48", "text": "The simile feature is motivated by the observation that many Croatian similes are MWEs, e.g., plakati kao ljuta godina, (to cry like a bitter year -to cry heavily)." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-49", "text": "We consider a MWE to be a simile if it contains a preposition kao (like) or poput (as)." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-50", "text": "We furthermore observe many Croatian MWEs contain loanwords." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-51", "text": "The foreign word feature indicates, for each MWE constituent, whether it has been tagged as a foreign word by the POS tagger." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-52", "text": "We also introduced two features to account for the relatively free word order of Croatian: constituent adjacency and constituent permutation." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-53", "text": "The former is turned on if there are more contiguous than discontinuous MWE candidate occurrences, while the latter is turned on if the corpus contains five or more word permutations of the MWE candidate." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-54", "text": "While most MWEs in Croatian nominally do not allow intervening words between its components, in fact most types of MWEs will allow the insertion of copula and pronoun enclitics; e.g., zadnji [je]\u010das ([is] last moment)." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-55", "text": "When searching for discontinuous MWE candidates of length n, we only consider n-grams for which the number of tokens between the first and final constituent is less than or equal to 2n." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-56", "text": "On the other hand, permutation of MWE constituents is much less frequent, even for a relatively free word order language such as Croatian." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-57", "text": "Thus, there may be a benefit to capturing which types of MWE -presumably mostly characterized by their POS patternsallow for permutations; e.g., jednim udarcem ubiti dvije muhe / dvije muhe ubiti jednim udarcem, etc. (to kill two flies with one stone)." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-58", "text": "Finally, inspired by a growing body of research on semantic non-compositionality of MWEs (Baldwin et al., 2003; Kim and Baldwin, 2006; Biemann and Giesbrecht, 2011; Kr\u010dm\u00e1\u0159 et al., 2013) , we introduced a simple semantic opacity feature." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-59", "text": "We opted for a simple approach proposed by (Mitchell and Lapata, 2008) , and computed this feature by deriving distributional vectors from hrWaC for the MWE and the additive composition of its con- stituents, and then computing the cosine between the two vectors." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-60", "text": "For opaque MWEs, we expect the cosine to be lower than for semantically transparent MWEs." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-61", "text": "Similarly as with other numeric features, we discretized the cosine scores into five levels." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-62", "text": "----------------------------------" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-63", "text": "**FEATURE INTERACTIONS**" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-64", "text": "The structure of a BN defines feature interactions by means of conditional independence assumptions between the variables." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-65", "text": "When constructed manually, the structure of the network essentially models our knowledge about the causal links between the features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-66", "text": "We extended the structure of the original BN model by introducing additional links for the newly added features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-67", "text": "We primarily based our design choices on linguistic intuition, but also on experimental validation." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-68", "text": "To this end, we compiled a small validation set of 33 MWEs and 33 non-MWEs, for which we computed the features over 50K sentences from hrWaC. We used this dataset to verify whether adding an interaction link improves the accuracy of the model." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-69", "text": "The resulting BN is shown in Fig. 1 ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-70", "text": "All nodes depend on the MWE node, which is the label to be predicted." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-71", "text": "2 We introduced feature interaction between the caps and foreign node, given that a high number of loanwords pertain to proper names." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-72", "text": "Additionally, we defined interactions between comp, context, and overlap, as the semantic opacity influences the general context of an expression, and the ratio of overlapping context 2 When using the BN model for MWE detection, we simply run a maximum a posteriori query on the MWE variable with all feature variables set to the observed values." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-73", "text": "words depends upon both features." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-74", "text": "Finally, since similes and hyphenated expressions signal a strict word order, we defined interactions between perm, adjac, hyphen, and simile." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-75", "text": "3 Dataset MWE definition." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-76", "text": "As there is no publicly available annotated datasets of Croatian MWEs, we decided to create one." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-77", "text": "We first established a working definition of Croatian MWEs, starting out from the taxonomy proposed by Blagus Bartolec (2008) , and adopted it to the universal classification of Sag et al. (2002) ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-78", "text": "We identified five major groups of MWEs: (1) idioms, semantically opaque expressions; (2) fixed expressions, common phrases whose meaning can clearly be gleaned from its constituents, but whose constituents are rarely replaced with synonyms in practice; (3) technical terms, expressions pertaining to the technical language of a particular profession; (4) foreign terms, any expression adopted from another language, as well as imaginary and nonsensical phrases; and (5) proper names, names of persons, institutions, geographical terms, etc., composed of two or more words." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-79", "text": "Annotation." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-80", "text": "As a source of data for our dataset, we use hrMWELex, a lexicon of Croatian MWEs candidate n-grams compiled by Ljube\u0161i\u0107 et al. (2015) ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-81", "text": "The lexicon was obtained by matching parse trees from hrWaC against a set of predefined syntactic patterns (POS patterns) for Croatian, yielding a high-recall, low-precision MWE lexicon." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-82", "text": "The resulting lexicon contains 12M n-grams with matching POS patterns." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-83", "text": "We next sorted the n-grams by corpus frequency, and made a balanced 2-, 3-, and 4-gram selection from the most frequent candidates, selecting 4000 MWE candidates." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-84", "text": "We then asked four native speakers of Croatian to label the dataset." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-85", "text": "Each annotated all 4000 instances, presented in random order to minimize the effect of a context bias." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-86", "text": "We also included 124 gold positive MWEs, extracted from (Ani\u0107, 2003) , to serve as a control set." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-87", "text": "To measure the inter-annotator agreement, we calculated the Cohen's coefficient (Cohen, 1960) between all pairs of annotations (Table 1) ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-88", "text": "The agreement ranges between 0.413 and 0.578, which, according to Landis and Koch (1977) , is considered a moderate agreement." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-89", "text": "Gold dataset." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-90", "text": "For the final dataset, we adjudicated the annotations by considering a MWE can- (Ani\u0107, 2003) and a dictionary of multiword expressions (Kova\u010devi\u0107, 2012) , yielding a total of 461 positive MWEs." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-91", "text": "3 Finally, we add an equal number of n-grams annotated as negative MWE instances by at least three annotators, yielding a perfectly-balanced dataset of 922 n-grams." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-92", "text": "Table 2 shows a breakdown of positive and negative examples by n-gram length." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-93", "text": "For each n-gram from this dataset, we computed the feature values on a random sample of the hrWaC corpus comprising 200K sentences (\u223c5M tokens)." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-94", "text": "We make the dataset and the precomputed features publicly available." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-95", "text": "4" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-96", "text": "----------------------------------" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-97", "text": "**EVALUATION**" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-98", "text": "We compare the BN model from Section 2 against two commonly used statistical baselines: Dice and PMI association measures." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-99", "text": "Furthermore, we compare the BN model to three variants of Bayes classifiers, differing in their ability to model feature interactions: a Naive Bayes classifier (NB), a treeaugmented Naive Bayes classifier (TAN) (Friedman et al., 1997) , and a Bayesian network classifier trained using the K2 structure learning algorithm (BN-K2) (Cooper and Herskovits, 1992 such as those among the syntax, frozen, and partial features in Fig. 1 ." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-100", "text": "The NB is even simpler, as it does not model any feature interactions at all, i.e., it assumes all feature pairs are conditionally independent within the MWE and non-MWE classes." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-101", "text": "In contrast, the BN and BN-K2 models can model (undirected) circular dependencies." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-102", "text": "The difference between them is that for the BN model the feature interactions were designed manually, based on linguistic insights, whereas in case of BN-K2 the interactions are learned from the train set." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-103", "text": "Table 3 shows the MWE classification accuracy, precision, recall, and F1-scores of the two baselines and the four Bayes classifiers." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-104", "text": "All models were trained and tested using 10-fold cross-validation on the gold dataset." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-105", "text": "The threshold of the two baseline models was optimized on the train sets." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-106", "text": "We observe that all four Bayes classifiers outperform the baselines in terms of accuracy and F1-score, except for the NB model which performs worse than Dice in terms of F1-score." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-107", "text": "On the other hand, the BN model outperforms all considered models in terms of both accuracy and F1-score by a considerable margin." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-108", "text": "This demonstrates that manual modeling of feature interactions is indeed important for MWE detection, and that BN does a reasonably good job in modeling these interactions." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-109", "text": "The more simple NB and TAN models even out in terms of F1-score, but differ in precision and recall scores, while the BN-K2 model performs comparably to TAN." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-110", "text": "----------------------------------" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-111", "text": "**CONCLUSION**" }, { "sent_id": "74420437db295ca874d5c946891f69-C001-112", "text": "We described the experiments on using a combination of linguistically motivated features for MWE detection in Croatian." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-113", "text": "We adopted the Bayesian network model of Tsvetkov and Wintner (2014) and extended it with new features and manually-designed feature interactions, inspired by an analysis of Croatian MWEs." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-114", "text": "To train and evaluate the model, we built a manually annotated dataset of Croatian MWEs." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-115", "text": "On this dataset, our model substantially outperforms statistical baselines, reaching a satisfactory performance of 0.823 F1-score on our dataset." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-116", "text": "The model also outperforms the (semi)na\u00efve Bayes models, which limit the feature interactions, as well as a Bayesian network model with automatically learned feature interactions." }, { "sent_id": "74420437db295ca874d5c946891f69-C001-117", "text": "Thus, the main finding of our work is that the model benefits from the linguistically motivated, manually-designed feature interactions, which proves that MWE features interact in rather intricate ways." } ], "y": { "@BACK@": { "gold_contexts": [ [ "74420437db295ca874d5c946891f69-C001-3" ], [ "74420437db295ca874d5c946891f69-C001-13" ], [ "74420437db295ca874d5c946891f69-C001-22" ], [ "74420437db295ca874d5c946891f69-C001-36" ] ], "cite_sentences": [ "74420437db295ca874d5c946891f69-C001-3", "74420437db295ca874d5c946891f69-C001-13", "74420437db295ca874d5c946891f69-C001-22", "74420437db295ca874d5c946891f69-C001-36" ] }, "@EXT@": { "gold_contexts": [ [ "74420437db295ca874d5c946891f69-C001-3", "74420437db295ca874d5c946891f69-C001-4" ], [ "74420437db295ca874d5c946891f69-C001-20" ], [ "74420437db295ca874d5c946891f69-C001-30" ], [ "74420437db295ca874d5c946891f69-C001-113" ] ], "cite_sentences": [ "74420437db295ca874d5c946891f69-C001-3", "74420437db295ca874d5c946891f69-C001-20", "74420437db295ca874d5c946891f69-C001-30", "74420437db295ca874d5c946891f69-C001-113" ] }, "@SIM@": { "gold_contexts": [ [ "74420437db295ca874d5c946891f69-C001-22", "74420437db295ca874d5c946891f69-C001-23" ] ], "cite_sentences": [ "74420437db295ca874d5c946891f69-C001-22" ] }, "@DIF@": { "gold_contexts": [ [ "74420437db295ca874d5c946891f69-C001-26" ] ], "cite_sentences": [ "74420437db295ca874d5c946891f69-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "74420437db295ca874d5c946891f69-C001-36", "74420437db295ca874d5c946891f69-C001-37" ] ], "cite_sentences": [ "74420437db295ca874d5c946891f69-C001-36" ] } } }, "ABC_91723cf7f22ba6405c85a929ac2d8e_27": { "x": [ { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-66", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-67", "text": "**3 EVALUATION**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-2", "text": "We propose a novel neural network model for machine reading, DER Network, which explicitly implements a reader building dynamic meaning representations for entities by gathering and accumulating information around the entities as it reads a document." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-3", "text": "Evaluated on a recent large scale dataset (Hermann et al., 2015) , our model exhibits better results than previous research, and we find that max-pooling is suited for modeling the accumulation of information on entities." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-4", "text": "Further analysis suggests that our model can put together multiple pieces of information encoded in different sentences to answer complicated questions." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-5", "text": "Our code for the model is available at https://github." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-6", "text": "com/soskek/der-network" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-7", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-9", "text": "Machine reading systems (Poon et al., 2010; Richardson et al., 2013) can be tested on their ability to answer queries about contents of documents that they read, thus a central problem is how the information of documents should be organized in the system and retrieved by the queries." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-10", "text": "Recently, large scale datasets of document-queryanswer triples have been constructed from online newspaper articles and their summaries (Hermann et al., 2015) , by replacing named entities in the summaries with placeholders to form Cloze (Taylor, 1953 ) style questions ( Figure 1 )." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-11", "text": "These datasets have enabled training and testing of complicated neural network models of hypothesized machine readers (Hermann et al., 2015; Hill et al., 2015) ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-12", "text": "( @entity1 ) @entity0 may be @entity2 in the popular @entity4 superhero films , but he recently dealt in some advanced bionic technology himself ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-13", "text": "@entity0 recently presented a robotic arm to young @entity7 , a @entity8 boy who is missing his right arm from just above his elbow ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-14", "text": "the arm was made by @entity12 , a \u2026! \" [X] \" star @entity0 presents a young child with a bionic arm! Figure 1 : A document-query-answer triple constructed from a news article and its bullet point summary." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-15", "text": "An entity in the summary (Robert Downey Jr.) is replaced by the placeholder [X] to form a query." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-16", "text": "All entities are anonymized to exclude world knowledge and focus on reading comprehension." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-17", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-18", "text": "**QUERY**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-19", "text": "In this paper, we hypothesize that a reader without world knowledge can only understand a named entity by dynamically constructing its meaning from the contexts." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-20", "text": "For example, in Figure 1 , a reader reading the sentence \"Robert Downey Jr. may be Iron Man ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-21", "text": ". . \" can only understand \"Robert Downey Jr.\" as something that \"may be Iron Man\" at this stage, given that it does not know Robert Downey Jr. a priori." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-22", "text": "Information about this entity can only be accumulated by its subsequent occurrence, such as \"Downey recently presented a robotic arm . . . \"." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-23", "text": "Thus, named entities basically serve as anchors to link multiple pieces of information encoded in different sentences." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-24", "text": "This insight has been reflected by the anonymization process in construction of the dataset, in which coreferent entities (e.g. \"Robert Downey Jr.\" and \"Downey\") are replaced by randomly permuted abstract entity markers (e.g. \"@en-tity0\"), in order to prevent additional world knowledge from being attached to the surface form of the entities (Hermann et al., 2015) ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-25", "text": "We, however, take it as a strong motivation to implement a reader that dynamically builds meaning representations for each entity, by gathering and accumulating information on that entity as it reads a document (Section 2)." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-26", "text": "Evaluation of our model, DER Network, exhibits better results than previous research (Section 3)." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-27", "text": "In particular, we find that max-pooling of entity representations, which is intended to model the accumulation of information on entities, can drastically improve performance." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-28", "text": "Further analysis suggests that max-pooling can help our model draw multiple pieces of information from different sentences." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-29", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-30", "text": "**MODEL**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-31", "text": "Following Hermann et al. (2015) , our model estimates the conditional probability p(e|D, q), where q is a query and D is a document." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-32", "text": "A candidate answer for the query is denoted by e, which in this paper is any named entity." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-33", "text": "Our model can be factorized as:" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-34", "text": "in which u(q) is the learned meaning for the query and v(e; D, q) the dynamically constructed meaning for an entity, depending on the document D and the query q. We note that (1) is in contrast to the factorization used by Hermann et al. (2015):" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-35", "text": "in which a vector u(D, q) is learned to represent the status of a reader after reading a document and a query, and this vector is used to retrieve an answer by coupling with the answer vector v(a)." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-36", "text": "1 Factorization (2) relies on the hypothesis that there exists a fixed vector for each candidate answer representing its meaning." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-37", "text": "However, as we argued in Section 1, an entity surface does not possess meaning; rather, it serves as an anchor to link pieces of information about it." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-38", "text": "Therefore, we hypothesize that the meaning representation v(e; D, q) of an entity e should be dynamically constructed from its surrounding contexts, and the meanings are \"accumulated\" through the reader reading the document D. We explain the construction of v(e; D, q) in Section 2.1, and propose a max-pooling process for modeling information accumulation in Section 2.2." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-39", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-40", "text": "**DYNAMIC ENTITY REPRESENTATION**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-41", "text": "For any entity e, we take its context c as any sentence that includes a token of e. Then, we use bidirectional single-layer LSTMs (Hochreiter and Schmidhuber, 1997; Graves et al., 2005) to encode c into vectors." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-42", "text": "LSTM is a neural cell that outputs a vector h c,t for each token t in the sentence c; taking the word vector x c,t of the token as input, each h c,t is calculated recurrently from its precedent vector h c,t\u22121 or h c,t+1 , depending on the direction of the encoding." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-43", "text": "Formally, we write forward and backward LSTMs as:" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-44", "text": "Then, denoting the length of the sentence c as T and the index of the entity e token as \u03c4 , we define the dynamic entity representation d e,c as the concatenation of the vectors [ h c,T , h c,1 , h c,\u03c4 , h c,\u03c4 ] encoded by a feed-forward layer (Figure 2 ):" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-45", "text": "in which s e,c (q) is calculated by the attention mechanism (Bahdanau et al., 2015) , modeling the degree to which our reader should attend to a particular occurrence of an entity, given the query q. More precisely, s e,c (q) is defined as the following:" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-46", "text": "where s e,c (q) is calculated by taking the softmax of s e,c (q), which is calculated from the dynamic entity representation d e,c and the query vector q. The vector m, matrix W dm , and the bias b s in (7) are learned parameters in the attention mechanism." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-47", "text": "Vector m is used here to map a vector value to a scalar." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-48", "text": "The query vector 3 u(q) is constructed similarly as dynamic entity representations, using bidirectional LSTMs 4 to encode the query and then encoding the output vectors." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-49", "text": "More precisely, if we denote the length of the query as T and the index of the placeholder as \u03c4 , the query vector is calculated as:" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-50", "text": "Then, v(e; D, q) and u(q) are used in (1) to calculate probability p(e|D, q)." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-51", "text": "2 Following a heuristic used in Hill et al. (2015) , we add a secondary bias b v to v(e; D, q) if the entity e already appears in the query q." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-52", "text": "3 u(q) and another query vector q, are calculated respectively, in the same way (8) with unshared model parameters, while sharing the parameters is also promising." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-53", "text": "4 The parameters of the bi-LSTM for queries are not shared with the ones for entity contexts." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-54", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-55", "text": "**MAX-POOLING**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-56", "text": "We expect the dynamic entity representation to capture information about an entity mentioned in a sentence." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-57", "text": "However, as an entity occurs multiple times in a document, information is accumulated as subsequent occurrences of the entity draw information from previous mentions." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-58", "text": "For example, in Figure 1 , the first sentence mentioning \"Robert Downey Jr.\" relates Downey to Iron Man, whereas a subsequent mention of \"Downey\" also relates him to a robotic arm." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-59", "text": "Both of the two pieces of information are necessary to answer the query \"Iron Man star [X] presents . . . with a bionic arm\"." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-60", "text": "Therefore, the dynamic entity representations as constructed individually from single sentences may not provide enough information for our reader model." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-61", "text": "We thus propose the use of max-pooling to model information accumulation of dynamic entity representations." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-62", "text": "More precisely, for each entity e, max-pooling takes the max value of each dimension of the vectors d e,c from all preceding contexts c (Figure 3) ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-63", "text": "Then, in a subsequent sentence c where the entity occurs again at index \u03c4 , we use the vector" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-64", "text": "as input for the LSTMs in (3) and (4) for encoding the context." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-65", "text": "This vector x c,\u03c4 draws information from preceding contexts, and is regarded as the meaning of the entity e that the reader understands so far, before reading the sentence c. It is used in place of a vector previously randomly initialized as a notion of e, in the construction of the new dynamic entity representation d e,c ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-68", "text": "We use the CNN-QA dataset (Hermann et al., 2015) for evaluating our model's ability to answer questions about named entities." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-69", "text": "The dataset consists of (D, q, e)-triples, where the document D is taken from online news articles, and the query q is formed by hiding a named entity e in a summarizing bullet point of the document (Figure 1) ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-70", "text": "The training set has 90k articles and 380k queries, and both validation and test sets have 1k articles and 3k queries." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-71", "text": "An average article has about 25 entities and 700 word tokens." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-72", "text": "One trains a machine reading system on the data by maximizing likelihood of correct answers." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-73", "text": "We use Chainer 5 (Tokui et al., 2015) to implement our model 6 ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-74", "text": "Experimental Settings Named entities in CNN-QA are already recognized." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-75", "text": "For preprocessing, we segment sentences at punctuation marks \".\", \"!\", and \"?\"." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-76", "text": "7 We train our model 8 with hyper-parameters lightly tuned on the validation set 9 , and we conduct ablation test on several techniques that improve our basic model." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-77", "text": "Table 1 , Max-pooling described in Section 2.2 drastically improves performance, showing the effect of accumulating information on entities." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-78", "text": "Another technique, called \"Byway\", is based on the observation that the attention mechanism (5) must always promote some entity occurrences (since all weights sum to 1), which could be difficult if the entity does not answer the query." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-79", "text": "To counter this, we make an artificial occurrence for each entity with no contexts, which serves as a byway to attend when no other occurrences can be reasonably related to the query." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-80", "text": "This simple trick shows 5 http://chainer.org/ 6 The implementation is available at https://github. com/soskek/der-network." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-81", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-82", "text": "**RESULTS AS SHOWN IN**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-83", "text": "7 Text in CNN-QA are tokenized without any sentence segmentations." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-84", "text": "8 Training process takes roughly a week (3-5 passes of the training data) on a 6-core 2.4GHz Xeon CPU." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-85", "text": "9 Vector dimension: 300, Dropout: 0.3, Batch: 50, Optimization: RMSProp with momentum (Tieleman and Hinton, 2012; Graves, 2013) (momentum: 0.9, decay: 0.95), Learning rate: 1e-4 divided by 2.0 per epoch, Gradient clipping factor: 10." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-86", "text": "We initialize word vectors by uniform distribution [-0.05, 0.05] , and other matrix parameters by Gaussians of mean 0 and variance 2/(# rows + # columns)." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-87", "text": "Hermann et al. (2015) and * * from Hill et al. (2015) ." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-88", "text": "( @entity1 ) @entity0 may be @entity2 in the popular @entity4 superhero films , but he recently dealt in some advanced bionic technology himself .! \u2026! @entity7 received his robotic arm in the summer , then later had it upgraded to resemble a \" @entity26 \" arm .! this past saturday , @entity7 received an even more impressive gift , from \" @entity2 \" himself .! \u2026! the actor showed the child two arms , one from @entity0 's movies and one for @entity7 : a real , working robotic @entity2 arm .! clear effects, suggesting that the attention mechanism plays a key role in our model." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-89", "text": "Combining these two techniques helps more." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-90", "text": "Further, we note that initializing our model with pre-trained word vectors 10 is helpful, though world knowledge of entities has been prevented by the anonymization process." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-91", "text": "This suggests that pre-trained word vectors may still bring extra linguistic knowledge encoded in ordinary words." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-92", "text": "Finally, we note that our model, full DER Network, shows the best results compared to several previous reader models (Hermann et al., 2015; Hill et al., 2015) , endorsing our approach as promising." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-93", "text": "The 99% confidence intervals of the results of full DER Network and the one initialized by word2vec on the test set were [0.700, 0.740] and [0.708, 0.749], respectively (measured by bootstrap tests)." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-94", "text": "Analysis In the example shown in Figure 4 , our basic model missed by paying little attention to the second and third sentences, probably because it does not mention @entity0 (Downey)." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-95", "text": "In contrast, maxpooling of @entity2 (Iron Man) draws attention to the second and third sentences because Iron Man is said related to Downey in the first sentence." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-96", "text": "This helps Iron Man surpass @entity26 (Transformers), which is the name of a different movie series in which robots appear but Downey doesn't." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-97", "text": "Quantitatively, in the 479 samples in test set correctly answered by max-pooling but missed by basic model, the average occurrences of answer entities (8.0) is higher than the one (7.2) in the 1782 samples correctly answered by both models." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-98", "text": "This suggests that max-pooling especially helps samples with more entity mentions." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-99", "text": "----------------------------------" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-100", "text": "**DISCUSSION**" }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-101", "text": "It is actually a surprise for us that deep learning models, despite their vast amount of parameters, seem able to learn as intended by the designers." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-102", "text": "This also indicates a potential that additional linguistic intuitions modeled by deep learning methods can improve performances, as in the other work using maxpooling (LeCun et al., 1998; Socher et al., 2011; Le et al., 2012; Collobert et al., 2011; Kalchbrenner et al., 2014) , attention (Bahdanau et al., 2015; Luong et al., 2015; Xu et al., 2015; Rush et al., 2015) , etc." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-103", "text": "In this work, we have focused on modeling a reader that dynamically builds meanings for entities." }, { "sent_id": "91723cf7f22ba6405c85a929ac2d8e-C001-104", "text": "We believe the methodology can be inspiring to other problems as well." } ], "y": { "@USE@": { "gold_contexts": [ [ "91723cf7f22ba6405c85a929ac2d8e-C001-3" ], [ "91723cf7f22ba6405c85a929ac2d8e-C001-31" ], [ "91723cf7f22ba6405c85a929ac2d8e-C001-68" ] ], "cite_sentences": [ "91723cf7f22ba6405c85a929ac2d8e-C001-3", "91723cf7f22ba6405c85a929ac2d8e-C001-31", "91723cf7f22ba6405c85a929ac2d8e-C001-68" ] }, "@BACK@": { "gold_contexts": [ [ "91723cf7f22ba6405c85a929ac2d8e-C001-10" ], [ "91723cf7f22ba6405c85a929ac2d8e-C001-11" ], [ "91723cf7f22ba6405c85a929ac2d8e-C001-24" ] ], "cite_sentences": [ "91723cf7f22ba6405c85a929ac2d8e-C001-10", "91723cf7f22ba6405c85a929ac2d8e-C001-11", "91723cf7f22ba6405c85a929ac2d8e-C001-24" ] }, "@MOT@": { "gold_contexts": [ [ "91723cf7f22ba6405c85a929ac2d8e-C001-24", "91723cf7f22ba6405c85a929ac2d8e-C001-25" ] ], "cite_sentences": [ "91723cf7f22ba6405c85a929ac2d8e-C001-24" ] }, "@DIF@": { "gold_contexts": [ [ "91723cf7f22ba6405c85a929ac2d8e-C001-34" ], [ "91723cf7f22ba6405c85a929ac2d8e-C001-92" ] ], "cite_sentences": [ "91723cf7f22ba6405c85a929ac2d8e-C001-34", "91723cf7f22ba6405c85a929ac2d8e-C001-92" ] } } }, "ABC_3881903212a2d0fea039c8967ab553_27": { "x": [ { "sent_id": "3881903212a2d0fea039c8967ab553-C001-5", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-7", "text": "The field of computational social sciences has created many interesting applications for natural language processing in recent years." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-8", "text": "One of the areas where NLP techniques have shown great promise is in the analysis of political speech." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-9", "text": "For example, researchers have applied NLP techniques to political texts for a variety of tasks such as predicting voting patterns (Thomas et al., 2006) , identifying markers of persuasion (Guerini et al., 2008) , capturing cues that signal charisma (Rosenberg and Hirschberg, 2009) , and detecting ideological positions (Sim et al., 2013) ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-10", "text": "Our work also analyzes political speech, more specifically, presidential debates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-11", "text": "The contribution of this paper is to show that the topic shifting tendency of a presidential candidate changes over the course of the election campaign, and that it is correlated with his or her relative power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-12", "text": "We also show that this insight can help computational systems that predict the candidates' relative rankings based on their interactions in the debates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-13", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-14", "text": "**MOTIVATION**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-15", "text": "The motivation for this paper stems from prior work done by the first author in collaboration with other researchers (Prabhakaran et al., 2013a; Prabhakaran et al., 2013b) ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-16", "text": "Prabhakaran et al. (2013a) introduced the notion of power in the domain of presidential debates, and Prabhakaran et al. (2013b) followed it up with an automatic power ranker system based on interactions within the debates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-17", "text": "The power that a candidate had at a certain point in the election campaign was modeled based on his or her recent poll standings: in elections, popularity is power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-18", "text": "Those studies analyzed the 2012 Republican presidential primary debates and found that a candidate's power at the time of a debate correlates with the structure of interactions within the debate (e.g., turn frequency and interruption patterns)." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-19", "text": "Another finding was that the candidates' power correlates with the distribution of topics they speak about in the debates: candidates with more power spoke significantly more about certain topics (e.g., economy) and less about certain other topics (e.g., energy)." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-20", "text": "However, these findings relate to the specific election cycle that was analyzed and will not carry over to political debates in general." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-21", "text": "A further dimension with relevance beyond a specific election campaign is how topics evolve during the course of an interaction (e.g., who attempts to shift topics)." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-22", "text": "In , we explored this dimension and found that candidates with higher power introduce significantly more topics in the debates, but attempt to shift topics significantly less often while responding to a moderator." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-23", "text": "We used the basic LDA topic modeling method (with a filter for substantivity of turns) to assign topics to turns, which were then used to detect shifts in topics." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-24", "text": "However, segmenting interactions into coherent topic segments is an active area of research and a variety of topic modeling approaches have been proposed for that purpose." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-25", "text": "In this paper, we explore the utility of one such topic modeling approach to tackle this problem." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-26", "text": "While most of the early approaches for topic segmenting in interactions have focused on the content of the contribution, Nguyen et al. (2012) introduced a system called Speaker Identity for Topic Segmentation (SITS) which also takes into account the topic shifting tendencies of the participants of the conversation." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-27", "text": "In later work, Nguyen et al. (2013) demonstrated the SITS system's utility in detecting influencers in Crossfire debates and Wikipedia discussions." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-28", "text": "They also applied the SITS system to the domain of political debates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-29", "text": "However they were able to perform only a qualitative analysis of its utility in the debates domain since the debates data did not have influence annotations." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-30", "text": "In this paper, we use the SITS system to assign topics to turns and perform a quantitative analysis of how the topic shift features calculated using the SITS system relate to the notion of power as captured by (Prabhakaran et al., 2013a) ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-31", "text": "The SITS system associates each debate participant with a constant scalar value that captures his or her tendency to shift topics." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-32", "text": "However, since we want to investigate how each candidate's topic shifting tendency relates to his or her changing power over the course of the campaign, we introduce a variation of the SITS analysis in which we represent a different \"persona\" for each candidate in each debate." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-33", "text": "Once equipped with this notion of \"persona\", we find that the topic shifting tendency of a candidate does indeed show a great deal of fluctuation during the election campaign period." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-34", "text": "We also find that this fluctuation in topic shifting tendencies is significantly correlated with the candidates' power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-35", "text": "As an additional contribution of this paper, we demonstrate the utility of our topic shift features extracted using both types of SITS-based analyses in improving the performance of the automatic power ranker system presented in (Prabhakaran et al., 2013b) ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-36", "text": "We also investigated the utility of topic shifting features described in (Prabhakaran et al., 2014) extracted using LDA based topic modeling." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-37", "text": "However, they did not improve the performance of the ranker, and hence we do not discuss them in detail in this paper." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-38", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-39", "text": "**DATA**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-40", "text": "We use the presidential debates corpus released by Prabhakaran et al. (2013a) , which contains manual transcripts of 20 debates held between May 2011 and February 2012 as part of the 2012 Republican presidential primaries." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-41", "text": "The corpus also captures each candidate's power at the time of each debate, computed based on their relative standing in recent public polls." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-42", "text": "The poll numbers capture how successful candidates are in convincing the electorate of their candidature, which in turn affects their confidence within the debates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-43", "text": "These debates serve as a rich domain to explore manifestations of power since they are a medium through which candidates pursue and maintain power over other candidates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-44", "text": "Prabhakaran et al. (2013b) offers a detailed description of how the relative standings in national and state-level polls from various sources are aggregated to obtain candidates' power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-45", "text": "The transcripts are originally obtained from The American Presidency Project, where each turn of the conversation is manually demarcated and their speakers identified." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-46", "text": "The turns in the corpus are preprocessed using the Stanford CoreNLP package to perform basic NLP steps such as tokenization, sentence segmentation, parts-of-speech tagging and lemmatization." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-47", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-48", "text": "**MODELING TOPIC SHIFTS**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-49", "text": "Topic segmentation, the task of segmenting interactions into coherent topic segments, is an important step in analyzing interactions." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-50", "text": "In addition to its primary purpose, topic segmentation also identifies the speaker turn where the conversation changed from one topic to another, i.e., where the topic shifted, which may shed light on the characteristics of the speaker who changed the topic." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-51", "text": "We use the SITS approach proposed by (Nguyen et al., 2012) to detect topic shifts." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-52", "text": "We also propose a different way of using SITS to obtain an analysis of our corpus, which we call SITS var ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-53", "text": "We discuss both in turn, and then provide a discussion." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-54", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-55", "text": "**SEGMENTATION USING SITS**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-56", "text": "Most computational approaches towards automatic topic segmentation have focused mainly on the content of the contribution without taking into account the social aspects or speaker characteristics." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-57", "text": "Different discourse participants may have different tendencies to introduce or shift topics in interactions." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-58", "text": "In order to address this shortcoming, Nguyen et al. (2012) proposed a new topic segmentation model called Speaker Identity for Topic Segmentation (SITS), in which they explicitly model the individual's tendency to introduce new topics." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-59", "text": "Like traditional topic modeling approaches, the SITS system also considers each turn to be a bag of words generated from a mixture of topics." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-60", "text": "These topics themselves are multinomial distributions over terms." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-61", "text": "In order to account for the topic shifts that happen during the course of an interaction, they introduce a binary latent variable l d;t called the topic shift to indicate whether the speaker changed the topic or not in conversation d at turn t. To capture the individual speaker's topic shifting tendency, they introduced another latent variable called topic shift tendency (\u03c0 x ) of speaker x. The \u03c0 x value represents the propensity of speaker x to perform a topic shift." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-62", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-63", "text": "**SEGMENTATION USING SITS VAR**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-64", "text": "Within the SITS formulation, the topic shifting tendency of an individual (\u03c0 x ) is considered a constant across conversations." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-65", "text": "While an individual may have an inherent propensity to shift topics or not, we argue that the topic shifting tendency he or she displays can vary based on the social settings in which he or she interacts and his or her status within those settings." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-66", "text": "In other words, the same discourse participant may behave differently in different social situations and at different points in time." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-67", "text": "This is especially relevant in the context of our dataset, where the debates happen over a period of 10 months, and the power and status of each candidate in the election campaign vary greatly within that time period." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-68", "text": "We propose a variant of SITS which takes this issue into account." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-69", "text": "We consider each candidate to have a different \"persona\" in each debate." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-70", "text": "To accomplish this, we create new identities for each candidate x for each debate d, denoted by x d. For example, 'ROMNEY 08-11-2011' denotes the persona of the candidate ROMNEY in the debate held on 08-11-2011." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-71", "text": "Running the SITS system using this formulation, we obtain different \u03c0 x d values for candidate x for different debates, capturing different topic shift tendencies of x." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-72", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-73", "text": "**EXECUTION**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-74", "text": "We perform both the SITS and SITS var analyses on the 20 debates in our corpus." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-75", "text": "We used the nonparametric version of SITS for both runs, since it systemically estimates the number of topics in the data." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-76", "text": "We set the maximum number of iterations at 5000, sample lag at 100 and initial number of topics at 25." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-77", "text": "We refer the reader to (Nguyen et al., 2013) for details on these parameters." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-78", "text": "For each candidate, we calculate the mean and standard deviation of the topic shift tendency (\u03c0 x d ) of his or her personas across all debates he or she participated in." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-79", "text": "We then average these means and standard deviations, and obtain an average mean of 0.14 and an average standard deviation of 0.09." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-80", "text": "This shows that the topic shift tendencies of candidates vary by a considerable amount across debates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-81", "text": "Figure 1 shows the \u03c0 x d value fluctuating across different debates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-82", "text": "is the value of TTS(x, d) weighted by 1 \u2212 \u03c0 x ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-83", "text": "The intuition here is that a topic shift by a speaker with low topic shift tendency must be weighted higher than that by a speaker with a high topic shift tendency." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-84", "text": "We use these two features as well, and denote the set of these two features as TopSh." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-85", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-86", "text": "**ANALYSIS OF TOPIC SHIFT FEATURES**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-87", "text": "We also extract the TTS and WTS features using our SITS var variation of topic segmentation analysis and denote them as TTS var and WTS var respectively." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-88", "text": "In addition, we also use a feature PI var (x, d) which is the \u03c0 x d value obtained by the SITS var for candidate x in debate d. It captures the topic shifting tendency of candidate x in debate d. (We do not include the SITS \u03c0 x value in our correlation analysis since it is constant across debates.) We denote the set of these three features obtained from the SITS var run as TopSh var ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-89", "text": "Table 1 shows the Pearson's product correlation between each topical feature and candidate's power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-90", "text": "We obtain a highly significant (p = 0.002) negative correlation between topic shift tendency of a candidate (PI) and his/her power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-91", "text": "In other words, the variation in the topic shifting tendencies is significantly correlated with the candidates' recent poll standings." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-92", "text": "Candidates who are higher up in the polls tend to stay on topic while the candidates with less power attempt to shift topics more often." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-93", "text": "This is in line with our previous findings from ) that candidates with higher power attempt to shift topics less often than others when responding to moderators." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-94", "text": "It is also in line with the findings by Prabhakaran et al. (2013a) that candidates with higher power tend not to interrupt others." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-95", "text": "On the other hand, we did not obtain any significant correlation for the features proposed by Nguyen et al. (2013) ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-96", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-97", "text": "**TOPIC SHIFT FEATURES IN POWER RANKER**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-98", "text": "In this section, we investigate the utility of the SITS and SITS var based topic shift features described above in the problem of automatically ranking the participants of debates based on their power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-99", "text": "Prabhakaran et al. (2013b) define the problem as follows: given a debate d with a set of participants C d = {x 1 , x 2 , ...x n } and corresponding power indices P (x i ) for 1 < i < n, find a ranking function r :" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-100", "text": "For our experiments, we use the SVM rank based supervised learned power ranker presented in that work to estimate this ranking function." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-101", "text": "As we do in (Prabhakaran et al., 2013b) , we here report Kendall's Tau and Normalized Discounted Cumulative Gain values (NDCG and NDCG@3) on 5-fold cross validation (at the debate level)." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-102", "text": "All three metrics are based on the number of rank inversions between original and predicted ranking." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-103", "text": "While Tau treats all rank inversions equal, NDCG and NDCG@3 penalize the inversions happening in the top of the ranked list more than those happening in the bottom." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-104", "text": "NDCG@3 focuses only on the top 3 positions in the ranked list." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-105", "text": "We use the best performing feature set of (Prabhakaran et al., 2013b) posted the overall best system obtaining a Tau of 0.60, NDCG of 0.970, and NDCG@3 of 0.937." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-106", "text": "These results demonstrates the utility of topic shift features in the power ranking problem, especially using the SITS var formulation." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-107", "text": "We also experimented with all subsets of TopSh and TopSh var ; the best results were obtained using all features in each set." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-108", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-109", "text": "**RELATED WORK**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-110", "text": "Studies in sociolinguistics (e.g., (Ng et al., 1993; Ng et al., 1995; Reid and Ng, 2000) ) have long established that dialog structure in interactions relates to power and influence." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-111", "text": "Researchers in the NLP community have studied power and influence in various genres of interactions, such as organizational email threads (Bramsen et al., 2011; Gilbert, 2012; Prabhakaran and Rambow, 2013; , online discussion forums (Danescu-Niculescu-Mizil et al., 2012; Biran et al., 2012) and online chat dialogs (Strzalkowski et al., 2012) ." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-112", "text": "The correlates analyzed in these studies range from word and phrase patterns, to derivatives of such patterns such as linguistic coordination, to deeper dialogic features such as argumentation and dialog acts." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-113", "text": "Our work differs from these studies in that we study the correlates of power in topic dynamics." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-114", "text": "Furthermore, we analyze spoken interactions." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-115", "text": "----------------------------------" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-117", "text": "In this paper, we studied how topic shift patterns in the 2012 Republican presidential debates correlate with the power of candidates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-118", "text": "We proposed an alternate formulation of the SITS topic segmentation system that captures fluctuations in each candidate's topic shifting tendencies, which we found to be correlated with their power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-119", "text": "We also showed that features based on topic shift improve the prediction of the relative rankings of candidates." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-120", "text": "In future work, we will explore a model that captures individuals' inherent topic shift propensities, while also capturing their fluctuations due to social factors." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-2", "text": "We study the topic dynamics of interactions in political debates using the 2012 Republican presidential primary debates as data." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-3", "text": "We show that the tendency of candidates to shift topics changes over the course of the election campaign, and that it is correlated with their relative power." }, { "sent_id": "3881903212a2d0fea039c8967ab553-C001-4", "text": "We also show that our topic shift features help predict candidates' relative rankings." } ], "y": { "@MOT@": { "gold_contexts": [ [ "3881903212a2d0fea039c8967ab553-C001-15" ] ], "cite_sentences": [ "3881903212a2d0fea039c8967ab553-C001-15" ] }, "@BACK@": { "gold_contexts": [ [ "3881903212a2d0fea039c8967ab553-C001-16" ] ], "cite_sentences": [ "3881903212a2d0fea039c8967ab553-C001-16" ] }, "@DIF@": { "gold_contexts": [ [ "3881903212a2d0fea039c8967ab553-C001-35" ] ], "cite_sentences": [ "3881903212a2d0fea039c8967ab553-C001-35" ] }, "@USE@": { "gold_contexts": [ [ "3881903212a2d0fea039c8967ab553-C001-101" ], [ "3881903212a2d0fea039c8967ab553-C001-105" ] ], "cite_sentences": [ "3881903212a2d0fea039c8967ab553-C001-101", "3881903212a2d0fea039c8967ab553-C001-105" ] } } }, "ABC_0753a2be70f9844d353ec54c04fd53_27": { "x": [ { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-2", "text": "Implicit discourse relation classification is one of the most difficult tasks in discourse parsing." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-3", "text": "Previous studies have generally focused on extracting better representations of the relational arguments." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-4", "text": "In order to solve the task, it is however additionally necessary to capture what events are expected to cause or follow each other." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-5", "text": "Current discourse relation classifiers fall short in this respect." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-6", "text": "We here show that this shortcoming can be effectively addressed by using the bidirectional encoder representation from transformers (BERT) proposed by Devlin et al. (2019), which were trained on a nextsentence prediction task, and thus encode a representation of likely next sentences." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-7", "text": "The BERT-based model outperforms the current state of the art in 11-way classification by 8% points on the standard PDTB dataset." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-8", "text": "Our experiments also demonstrate that the model can be successfully ported to other domains: on the BioDRB dataset, the model outperforms the state of the art system around 15% points." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-9", "text": "----------------------------------" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-11", "text": "Discourse relation classification has been shown to be beneficial to multiple down-stream NLP tasks such as machine translation (Li et al., 2014) , question answering (Jansen et al., 2014) and summarization (Yoshida et al., 2014) ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-12", "text": "Following the release of the Penn Discourse Tree Bank (Prasad et al., 2008, PDTB) , discourse relation classification has received a lot of attention from the NLP community, including two CoNLL shared tasks (Xue et al., 2015 (Xue et al., , 2016 ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-13", "text": "Discourse relations in texts are sometimes marked with an explicit connective (e.g., but, because, however), but these explicit signals are often absent." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-14", "text": "With explicit connectives acting as informative cues, it is relatively easy to classify the discourse relation with high accuracy (93.09% on four-way classification in (Pitler et al., 2008) )." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-15", "text": "When there is no connective, classification has to rely on semantic information from the relational arguments." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-16", "text": "This task is very challenging, with state-of-the-art systems achieving accuracy of only 45% to 48% on 11-way classification." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-17", "text": "Consider example 1:" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-18", "text": "(1) [The joint venture with Mr. Lang wasn't a good one.] Arg1 [The venture, formed in 1986, was supposed to be Time's low-cost, safe entry into women's magazines.] Arg2 implicit Comp.Concess.expectation relation from PDTB: wsj 1903 In order to correctly classify the relation, it is necessary to understand that Arg1 raises the expectation that the next discourse segment may provide an explanation for why the venture wasn't good (e.g., that it was risky), and Arg2 contrasts with this discourse expectation." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-19", "text": "More generally, this means that a successful discourse relation classification model would have to be able to learn typical temporal event sequences, reasons, consequences etc." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-20", "text": "for all kinds of events." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-21", "text": "Statistical models attempted to address this intuition by giving models word pairs from the two arguments as features (Lin et al., 2009; Park and Cardie, 2012; Biran and McKeown, 2013; Rutherford and Xue, 2014) , so that models could for instance learn to recognize antonym relations between words in the two arguments." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-22", "text": "Recent models exploit such similarity relations between the two arguments, as well as simpler surface features that occur in one relational argument and correlate with specific coherence relations (e.g., the presence of negation, temporal expressions etc. may give hints as to what coherence relation may be present, see Park and Cardie (2012) ; Asr and Demberg (2015))." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-23", "text": "However, relations between arguments are often a lot more diverse than simple contrasts that can be captured through antonyms, and may rely on world knowledge (Kishimoto et al., 2018) ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-24", "text": "It is hence clear that one cannot learn all these diverse relations from the very small amounts of available training data." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-25", "text": "Instead, we would have to learn a more general representation of discourse expectations." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-26", "text": "Many recent discourse relation classification approaches have focused on cross-lingual data augmentation , training models to better represent the relational arguments by using various neural network models, including feed-forward network (Rutherford et al., 2017) , convolutional neural networks (Zhang et al., 2015) , recurrent neural network (Ji et al., 2016; Bai and Zhao, 2018) , character-based (Qin et al., 2016) or formulating relation classification as an adversarial task (Qin et al., 2017) ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-27", "text": "These models typically use pre-trained semantic embeddings generated from language modeling tasks, like Word2Vec (Mikolov et al., 2013) , GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-28", "text": "However, previously proposed neural models still crucially lack a representation of the typical relations between sentences: to solve the task properly, a model should ideally be able to form discourse expectations, i.e., to represent the typical causes, consequences, next events or contrasts to a given event described in one relational argument, and then assess the content of the second relational argument with respect to these expectations (see Example 1)." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-29", "text": "Previous models would have to learn these relations only from the annotated training data, which is much too sparse for learning all possible relations between all events, states or claims." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-30", "text": "The recently proposed BERT model (Devlin et al., 2019) takes a promising step towards addressing this problem: the BERT representations are trained using a language modelling and, crucially, a \"next sentence prediction\" task, where the model is presented with the actual next sentence vs. a different sentence and needs to select the original next sentence." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-31", "text": "We believe it is a good fit for discourse relation recognition, since the task allows the model to represent what a typical next sentence would look like." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-32", "text": "In this paper, we show that a BERT-based model outperforms the current state of the art by 8% points in 11-way implicit discourse relation classification on PDTB." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-33", "text": "We also show that after pre- trained with small size cross-domain data, the model can be easily transferred to a new domain: it achieves around 16% accuracy gain on BioDRB compared to state of the art model." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-34", "text": "We also show that the Next Sentence Prediction task played an important role in these improvements." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-35", "text": "Devlin et al. (2019) proposed the Bidirectional Encoder Representation from Transformers (BERT), which is designed to pre-train a deep bidirectional representation by jointly conditioning on both left and right contexts." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-36", "text": "BERT is trained using two novel unsupervised prediction tasks: Masked Language Modeling and Next Sentence Prediction (NSP)." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-37", "text": "The NSP task has been formulated as a binary classification task: the model is trained to distinguish the original following sentence from a randomly chosen sentence from the corpus, and it showed great helps in multiple NLP tasks especially inference ones." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-38", "text": "The resulting BERT representations thus encode a representation of upcoming discourse content, and hence contain discourse expectation representations which, as we argued above, are required for classifying coherence relations." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-39", "text": "ment 2 are separated with token \"[SEP]\"; \"[CLS]\" is the special classification embedding while \"C\" is the same as \"[CLS]\" in pre-training but the ground-truth label in the fine-tuning." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-40", "text": "In the experiments, we used the uncased base model 1 provided by Devlin et al. (2019) , which is trained on BooksCorpus and English Wikipedia with 3300M tokens in total." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-41", "text": "----------------------------------" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-42", "text": "**NEXT SENTENCE PREDICTION**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-43", "text": "----------------------------------" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-44", "text": "**EVALUATION ON PDTB**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-45", "text": "We used the Penn Discourse Tree Bank (Prasad et al., 2008) , the largest available manually annotated discourse corpus." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-46", "text": "It provides a three level hierarchy of relation tags." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-47", "text": "Following the experimental settings and evaluation metrics in Bai and Zhao (2018) , we use two most-used splitting methods of PDTB data, denoted as PDTB-Lin (Lin et al., 2009) , which uses sections 2-21, 22, 23 as training, validation and test sets, and PDTB-Ji (Ji and Eisenstein, 2015) , which uses 2-20, 0-1, 21-22 as training, validation and test sets and report the overall accuracy score." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-48", "text": "In addition, we also performed 10-fold cross validation among sections 0-22, as promoted in ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-49", "text": "We also follow the standard in the literature to formulate the task as an 11-way classification task." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-50", "text": "Results are presented in Table 1 ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-51", "text": "We evaluated three versions of the BERT-based model." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-52", "text": "All of our BERT models use the pre-trained representations and are fine-tuned on the PDTB training data." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-53", "text": "The version marked as \"BERT\" does not do any additional pre-training." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-54", "text": "BERT+WSJ in addition performs further pre-training on the 1 https://github.com/google-research/ bert#pre-trained-models parts of the Wall Street Journal corpus that do not have discourse relation annotation." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-55", "text": "The model version \"BERT+WJS w/o NSP\" also performs pre-training on the WSJ corpus, but only uses the Masked Language Modelling task, not the Next Sentence Prediction task in the pre-training." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-56", "text": "We added this variant to measure the benefit of in-domain NSP on discourse relation classification (note though that the downloaded pre-trained BERT model contains the NSP task in the original pre-training)." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-57", "text": "We compared the results with four state-of-theart systems: Cai and Zhao (2017) proposed a model that takes a step towards calculating discourse expectations by using attention over an encoding of the first argument, to generate the representation of the second argument, and then learning a classifier based on the concatenation of the encodings of the two discourse relation arguments." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-58", "text": "Kishimoto et al. (2018) fed external world knowledge (ConceptNet relations and coreferences) explicitly into MAGE-GRU (Dhingra et al., 2017) and achieved improvements compared to only using the relational arguments." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-59", "text": "However, we here show that it works even better when we learn this knowledge implicit through next sentence prediction task." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-60", "text": "used a seq2seq model that learns better argument representations due to being trained to explicitate the implicit connective." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-61", "text": "In addition, their classifier also uses a memory network that is intended to help remember similar argument pairs encountered during training." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-62", "text": "The current best performance was achieved by Bai and Zhao (2018) , who combined representations from different grained em-beddings including contextualized word vectors from ELMo (Peters et al., 2018) , which has been proved very helpful." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-63", "text": "In addition, we compared our results with a simple bidirectional LSTM network and pre-trained word embeddings from Word2Vec." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-64", "text": "We can see that on all settings, the model using BERT representations outperformed all existing systems with a substantial margin." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-65", "text": "It obtained improvements of 7.3% points on PDTB-Lin, 5.5% points on PDTB-Ji, compared with the ELMobased method proposed in (Bai and Zhao, 2018) ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-66", "text": "What's more, the BERT model outperformed on cross validation by around 8%, with significance of p<0.01." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-67", "text": "Significance test was performed by estimating variance of the model from the performance on different folds in cross-validation (paired t-test)." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-68", "text": "For the Lin and Ji evaluations, we estimated variance due to random initialization by running them 5 times and calculating the likelihood that the state-of-the-art model result would come from that distribution." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-69", "text": "----------------------------------" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-70", "text": "**EVALUATION ON BIODRB**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-71", "text": "The Biomedical Discourse Relation Bank (Prasad et al., 2011) also follows PDTB-style annotation." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-72", "text": "It is a corpus annotated over 24 open access fulltext articles from the GENIA corpus (Kim et al., 2003) in the biomedical domain." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-73", "text": "Compared with PDTB, some new discourse relations and changes have been introduced in the annotation of Bio-DRB." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-74", "text": "In order to make the results comparable, we preprocessed the BioDRB annotations to map the relations to the PDTB ones, following the instructions in Prasad et al. (2011) ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-75", "text": "The biomedical domain is very different from the WSJ or the data on which the BERT model was trained." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-76", "text": "The BioDRB contains a lot of professional words / phrases that are extremely hard to model." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-77", "text": "In order to test the ability of the BERT model on cross-domain data, we performed finetuning on PDTB while testing on BioDRB." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-78", "text": "We also tested the state of the art model of implicit discourse relation classification proposed by Bai and Zhao (2018) on BioDRB." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-79", "text": "From Table 2 , we can see that the BERT base model achieved almost 12% points improvement over the Bi-LSTM baseline and 15% points over Bai and Zhao (2018) ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-80", "text": "When fine-tuned on in-domain data in the crossvalidation setting, the improvement increases to around 17% points." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-81", "text": "----------------------------------" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-82", "text": "**METHOD**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-83", "text": "Cross-Domain In-Domain Bi-LSTM + w2v 300 32.97 46.49 Bai and Zhao (2018) 29.52 55.90" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-84", "text": "BioBERT Table 2 : Accuracy (%) on BioDRB level 2 relations with different settings." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-85", "text": "Cross-Domain means trained on PDTB and tested on BioDRB." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-86", "text": "For the In-Domain setting, we used 5-fold cross-validation and report average accuracy." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-87", "text": "Numbers in bold are significantly better than the state of the art system with p<0.01 and numbers with * denote denote significant improvements over BERT + GENIA w/o NSP with p<0.01." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-88", "text": "It is also interesting to know whether the performance of the BERT model can be improved if we add additional pre-training on in-domain data." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-89", "text": "BioBert continues pretraining BERT with bio-medical texts including PubMed and PMC corpora (around 18B tokens), which achieved the best results on in-domain setting." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-90", "text": "Similarly, BERT+GENIA refers to a model in which the downloaded BERT representations are further pre-trained on the parts of the GENIA corpus which consists of 18k sentences and is not annotated with coherence relations." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-91", "text": "Evaluation shows that this in-domain pre-training yields another 3% point improvement; our tests also show that the NSP task again plays a substantial role in the improvement." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-92", "text": "We believe that gains for further pre-training on GENIA for the biomedical domain are higher than for pre-training on WSJ for PDTB because the domain difference between the BooksCorpus and the biomedical domain is larger." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-93", "text": "Currently there are not so many published results that we can compare with on BioDRB for implicit discourse relation classification." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-94", "text": "We compared BERT model with na\u00efve Bayes and Max-Ent methods proposed in Xu et al. (2012) on oneversus-all binary classification." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-95", "text": "We followed the settings in Xu et al. (2012) and used two articles (\"GENIA 1421503\", \"GENIA 1513057\") for testing and one article (\"GENIA 111020\") for validation." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-96", "text": "During training, we employed downsampling or up-sampling to keep the numbers of positive and negative samples in each relation consistent." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-97", "text": "The BERT base model achieved 43.03% average F 1 score and 77.34% average accuracy in one-versus-all level-1 classification." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-98", "text": "Compared with the current state-of-the-art perfor- Table 3 ." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-99", "text": "----------------------------------" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-100", "text": "**DISCUSSION**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-101", "text": "The usage of the BERT model in this paper was motivated primarily by the use of the nextsentence prediction task during training." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-102", "text": "The results in Table 1 and Table 2 confirm that removing the \"Next Sentence Prediction\" hurts the performance on both PDTB and BioDRB." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-103", "text": "In order to have better insights about which relation has benefited from the NSP task, we also reported the detailed performance for each relation with and without it in BERT." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-104", "text": "As illustrated in Table 4 , we can see that performances on relations like Temporal." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-105", "text": "Synchrony, Comparison." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-106", "text": "Contrast, Expansion." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-107", "text": "Conjunction and Expansion." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-108", "text": "Alternative have been improved by a large margin." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-109", "text": "This shows that representing the likely upcoming sentence helps the model form discourse expectations, which the classifier can then use to predict the coherence relation between the actually observed arguments." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-110", "text": "However, compared with BERT+GENIA, the results of BioBert in Table 2 show that having large in-domain data for pretraining also has limited ability in learning domain specific representations." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-111", "text": "We therefore believe that the model could be further improved by including external domain-specific knowledge from an ontology (as in Kishimoto et al. (2018) ) or a causal graph for biomedical concepts and events." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-112", "text": "----------------------------------" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-113", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-114", "text": "In this paper, we show that BERT has very good ability in encoding the semantic relationship between sentences with its \"next sentence prediction\" task in pre-training." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-115", "text": "It outperformed the current state-of-the-art systems significantly with a substantial margin on both in-domain and cross domain data." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-116", "text": "Our results also indicate that the next-sentence prediction task during training indeed plays a role in this improvement." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-117", "text": "Future work should explore the joint representation of discourse expectations through implicit representations that are learned during training and the inclusion of external knowledge." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-118", "text": "In addition, Yang et al. (2019) showed that NSP only helps tasks with longer texts." }, { "sent_id": "0753a2be70f9844d353ec54c04fd53-C001-119", "text": "It would be interesting to see whether it has the same effect on implicit discourse relation classification task, we'd like to leave that in the future work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0753a2be70f9844d353ec54c04fd53-C001-26" ], [ "0753a2be70f9844d353ec54c04fd53-C001-62" ] ], "cite_sentences": [ "0753a2be70f9844d353ec54c04fd53-C001-26", "0753a2be70f9844d353ec54c04fd53-C001-62" ] }, "@MOT@": { "gold_contexts": [ [ "0753a2be70f9844d353ec54c04fd53-C001-26", "0753a2be70f9844d353ec54c04fd53-C001-28" ] ], "cite_sentences": [ "0753a2be70f9844d353ec54c04fd53-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "0753a2be70f9844d353ec54c04fd53-C001-47" ], [ "0753a2be70f9844d353ec54c04fd53-C001-78" ] ], "cite_sentences": [ "0753a2be70f9844d353ec54c04fd53-C001-47", "0753a2be70f9844d353ec54c04fd53-C001-78" ] }, "@DIF@": { "gold_contexts": [ [ "0753a2be70f9844d353ec54c04fd53-C001-65" ], [ "0753a2be70f9844d353ec54c04fd53-C001-79" ] ], "cite_sentences": [ "0753a2be70f9844d353ec54c04fd53-C001-65", "0753a2be70f9844d353ec54c04fd53-C001-79" ] } } }, "ABC_3e5c070a6966361b54f069248438ec_27": { "x": [ { "sent_id": "3e5c070a6966361b54f069248438ec-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-2", "text": "We present a neural network based shiftreduce CCG parser, the first neural-network based parser for CCG." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-3", "text": "We also study the impact of neural network based tagging models, and greedy versus beam-search parsing, by using a structured neural network model." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-4", "text": "Our greedy parser obtains a labeled F-score of 83.27%, the best reported result for greedy CCG parsing in the literature (an improvement of 2.5% over a perceptron based greedy parser) and is more than three times faster." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-5", "text": "With a beam, our structured neural network model gives a labeled F-score of 85.57% which is 0.6% better than the perceptron based counterpart." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-6", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-8", "text": "Shift-reduce parsing is interesting for practical realworld applications like parsing the web, since parsing can be achieved in linear time." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-9", "text": "Although greedy parsers are fast, accuracies of these parsers are typically much lower than graph-based parsers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-10", "text": "Conversely, beam-search parsers achieve accuracies comparable to graph-based parsers (Zhang and Nivre, 2011) but are much slower than their greedy counterparts." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-11", "text": "Recently, Chen and Manning (2014) have showed that fast and accurate parsing can be achieved using neural network based parsers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-12", "text": "Improving their work, presented a structured neural network model which gave stateof-the-art results for English dependency parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-64", "text": "Our input layer is a 34 (feature templates) X 50 (embedding size) dimensional vector." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-13", "text": "There has been increasing interest in Combinatory Categorial Grammar (CCG) (Steedman, 2000) parsing due to the simplicity of its interface between syntax and semantics." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-14", "text": "In addition to predicateargument structure, CCG captures the unbounded dependencies found in grammatical constructions like relativization, coordination, etc." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-15", "text": "We present a neural network based shift-reduce CCG parser, the first neural network based parser for CCG." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-16", "text": "We first adapt Chen and Manning (2014) 's shift-reduce dependency parser for CCG parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-17", "text": "We then develop a structured neural network model based on , in order to explore the impact of a beamsearch on the parser." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-18", "text": "We also analyze the impact of neural network taggers (for both POS-tagging and CCG supertagging) as compared to maximum entropy taggers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-19", "text": "Our greedy neural network parser achieves unlabeled and labeled F-scores of 89.78% and 83.27% respectively, an improvement of around 2.5% over a perceptron based greedy parser, and is more than three times faster." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-20", "text": "Due to its relevance for large-scale parsing, we make this parser available for public usage." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-21", "text": "By using a beam search, our structured neural network model gave even better results of 91.95% and 85.57% unlabeled and labeled F-scores respectively." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-22", "text": "To the best of our knowledge, ours is the first neural network based parser for CCG and also the first work on exploring neural network taggers for shift-reduce CCG parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-23", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-25", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-26", "text": "**CCG PARSERS**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-27", "text": "Due to the availability of English CCGbank (Hockenmaier and Steedman, 2007) , several widecoverage CCG parsers have been developed (Hockenmaier and Steedman, 2002; Clark and Curran, 2007; Auli and Lopez, 2011; Zhang and Clark, 2011; Lewis and Steedman, 2014a) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-28", "text": "While the majority of CCG parsers are chart-based (Clark and Curran, 2007; Lewis and Steedman, 2014a) , there has been some work on shift-reduce CCG parsing (Zhang and Clark, 2011; Xu et al., 2014; Ambati et al., 2015) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-29", "text": "Zhang and Clark (2011) used a global linear model trained discriminatively with the averaged perceptron (Collins, 2002) and beam search for their shift-reduce CCG parser." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-30", "text": "A dependency model for shift-reduce CCG parsing using a dynamic oracle technique (Goldberg and Nivre, 2012) was developed by Xu et al. (2014) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-31", "text": "Ambati et al. (2015) presented an incremental algorithm for transition based CCG parsing which introduced two novel revealing actions to overcome the consequences of the greedy nature of the previous parsers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-32", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-33", "text": "**NEURAL NETWORK PARSERS**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-34", "text": "Neural Network parsers are attracting interest for both speed and accuracy." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-35", "text": "There has been some work on neural networks for constituent based parsing (Collobert, 2011; Socher et al., 2013; Watanabe and Sumita, 2015) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-36", "text": "Chen and Manning (2014) developed a neural network architecture for dependency parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-37", "text": "This parser was fast and accurate, parsing around 1000 sentences per second and achieving an unlabeled attachment score of 92.0% on the standard Penn Treebank test data for English." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-38", "text": "Chen and Manning (2014) 's parser used a feed forward neural network." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-39", "text": "Several improvements were made to this architecture in terms of using Long Short-Term Memory (LSTM) networks (Dyer et al., 2015) , deep neural networks and structured neural networks Zhou et al., 2015; ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-40", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-41", "text": "**OUR NEURAL NETWORK PARSER (NNPAR)**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-42", "text": "The architecture of our neural network based shift-reduce CCG parser is similar to that of Chen and Manning (2014) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-43", "text": "We present the details of the network and the model settings in this section." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-44", "text": "We also discuss our structured neural network model." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-45", "text": "Figure 1 shows the architecture of our neural network parser." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-46", "text": "There are three layers in the parser: input, hidden and output layers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-47", "text": "We first extract discrete features like words, POS-tags and CCG su- (2014))." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-48", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-49", "text": "**ARCHITECTURE**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-50", "text": "pertags from the parser configuration." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-51", "text": "For each of these discrete features we obtain a continuous vector representation in the form of their corresponding embeddings and use them in the input layer." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-52", "text": "Following Chen and Manning (2014), we use a cube activation function and softmax for output layer." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-53", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-54", "text": "**FEATURE AND MODEL SETTINGS**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-55", "text": "We extract features from a) top four nodes in the stack, b) next four nodes in the input and c) left and right children of the top two nodes in the stack." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-56", "text": "We obtain words and POS-tags of all these 12 nodes." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-57", "text": "In case of CCG supertags, in addition to the CCG categories of the nodes in the stack (top four nodes, left and right children of top two nodes), we also obtain the lexical head categories for the top two nodes." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-58", "text": "We use a special token 'NULL' if a feature is not present in the parser configuration." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-59", "text": "So, in total we have 34 features: 12 word, 12 POS-tag and 10 CCG supertag features." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-60", "text": "For each of these 34 features we obtain their corresponding embeddings." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-61", "text": "We use Turian embeddings of dimensionality 50 (Turian-50) 1 ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-62", "text": "For the words which are not in the word embeddings dictionary, embeddings of '-UNKNOWN-' token are used as a backoff." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-63", "text": "For POS-tags and CCG supertags, the parameters are randomly initialized with values between -0.01 and 0.01." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-65", "text": "We use 200 hidden units in the the hidden layer." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-66", "text": "For the output layer we compute softmax probabilities only for the actions which are possible in a particular parser configuration instead of all the actions." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-67", "text": "We use the training settings of Chen and Manning (2014) for our parser." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-68", "text": "The training objective is to minimize the cross-entropy loss with an l 2 -regularization and the training error derivatives are backpropagated during training." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-69", "text": "For optimization we use AdaGrad (Duchi et al., 2011) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-70", "text": "10 \u22128 and 0.01 are the values for regularization parameter and Adagrad initial learning rate respectively." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-71", "text": "Parameters that give the best labeled F-score on the development data are used for testing data." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-72", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-73", "text": "**STRUCTURED NEURAL NETWORK**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-74", "text": "Chen and Manning (2014)'s parser is a greedy parser and it is not straight forward to add a beam during training into their parser." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-75", "text": "As a way of introducing a beam, presented a structured perceptron training for the neural network parser." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-76", "text": "They first pre-trained their neural network model." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-77", "text": "For the final layer, they trained a structured perceptron using beam search decoding which takes the neural network hidden and output layers as the input." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-78", "text": "This method, known as a structured neural network, gave the state-of-the-art results for English dependency parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-79", "text": "In addition to using a softmax for the output layer, we also applied this structured neural network approach for our experiments using a beam." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-80", "text": "Unlike 's neural network architecture, which consists of two hidden layers with 2048 hidden units each, we use the Chen and Manning (2014) style architecture described in the previous sections." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-81", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-82", "text": "**COMPARISON TO CHEN AND MANNING (2014)**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-83", "text": "Our neural network parser differs from Chen and Manning (2014) in a number of respects." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-84", "text": "We use CCG supertags in the input layer rather than dependency labels." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-85", "text": "For word embeddings, we use Turian embeddings (Turian et al., 2010)" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-86", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-87", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-88", "text": "We first compare our neural network parser (NNPar) 2 with a perceptron based parser in the greedy settings." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-89", "text": "Then we analyze the impact of beam using neural network (NNPar) and structured neural network (Structured NNPar) models." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-90", "text": "The perceptron based parser is a reimplementation of Zhang and Clark (2011)'s parser (Z&C*)." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-91", "text": "A global linear model trained with the averaged perceptron (Collins, 2002 ) is used for this parser and an early-update (Collins and Roark, 2004 ) strategy is used during training." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-92", "text": "In the greedy setting (beam=1), when the predicted action differs from the gold action, decoding stops and weights are updated accordingly." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-93", "text": "When a beam is used (beam=16), weights are updated when the gold parse configuration falls out of the beam." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-94", "text": "For Z&C*, the feature set of Zhang and Clark (2011) , which comprises of 64 feature templates is used." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-95", "text": "For NNPar, the 34 feature templates described in section 3.2 are used." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-96", "text": "We employ an arc-standard style shift-reduce algorithm for CCG parsing, similar to Zhang and Clark (2011) , for all our experiments." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-97", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-98", "text": "**DATA AND SETTINGS**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-99", "text": "We use the standard CCGbank training (sections 02 \u2212 21), development (section 00) and testing (section 23) splits for our experiments." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-100", "text": "All the experiments are performed using automatic POS-tags and CCG supertags." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-101", "text": "We compare performance using two types of taggers: maximum entropy and neural network based taggers (NNT)." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-102", "text": "The C&C taggers 3 (Clark and Curran, 2004) are used for maximum entropy taggers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-103", "text": "For neural network taggers, SENNA tagger 4 (version 3.0) ) is used for POS-tagging and EasyCCG tagger 5 (Lewis and Steedman, 2014a ) is used for supertagging." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-104", "text": "Both these taggers use a feed-forward neural network architecture with a single hidden layer similar to our NNPar architecture." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-105", "text": "In the case of POS-tags, we consider the first best tag given by the POS tagger." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-106", "text": "For CCG supertags, we use a multitagger which gives n-best supertags for a word." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-107", "text": "Following Zhang and Clark (2011) and Xu et al. (2014) , only during training, the gold CCG lexical category is added to the list of supertags for a word if it is not present in the list assigned by the multitagger." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-108", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-109", "text": "**GREEDY SETTING**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-110", "text": "In this section, we compare the performance of perceptron (Z&C*) and neural network (NNPar) parsers in the greedy setting." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-111", "text": "Table 1 presents the unlabeled F-score (UF), labeled F-score (LF) and lexical category accuracy (Cat.) for the Z&C* and NNPar on the CCGbank development data." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-112", "text": "NNPar outperformed Z&C* on all the metrices." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-113", "text": "There is an improvement of 2.14% in UF and 2.4% in LF, when both the parsers used maximum-entropy (C&C) taggers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-114", "text": "We also experimented with the revealing based incremental algorithm of Ambati et al. (2015) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-115", "text": "Neural network parser gave better results than the perceptron parser for the incremental algorithm as well." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-116", "text": "Using the incremental algorithm, our NNPar obtained UF and LF of 89.08% and 81.07% which is 0.3% and 1.6% respectively lower than the results with the non-incremental algorithm." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-143", "text": "Structured NNPar gave improvements of 1.1% over the NNPar and 0.6% over the structured perceptron model, Z&C*. Structured NNPar gets better category accuracy, but lower LF than Xu et al.(2014) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-144", "text": "Note however that we use a much smaller beam size of 16 (similar to Z&C) as compared to theirs (128)." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-145", "text": "Increasing the beam size improved the accuracy but significantly reduced the parsing speed." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-146", "text": "Testing with a beam of size 128 gave 0.2% improvement in labelled F-score but slowed the parser by ten times." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-147", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-148", "text": "**SPEED**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-149", "text": "Beam-search parsers are more accurate than greedy parsers but are very slow." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-150", "text": "With neural network models we can build parsers which give a nice trade-off between speed and accuracy." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-151", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-152", "text": "**CONCLUSION**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-153", "text": "We presented the first neural network based shiftreduce parsers for CCG, a greedy and a beamsearch parser." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-154", "text": "On the standard CCGbank test data, we achieved a labeled F-score of 85.57% with our structured neural network parser, an improvement of 0.6% over the structured perceptron parser (Z&C*)." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-155", "text": "Our greedy parser gets UF and LF of 89.78% and 83.27% respectively, the best reported results for a greedy CCG parser, and is more than three times faster." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-156", "text": "In future we plan to explore more sophisticated tagging and parsing models like deep neural networks , recurrent neural networks (Dyer et al., 2015) , and bi-directional LSTMs (Lewis et al., 2016) for shift-reduce CCG parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-157", "text": "The parser code can be downloaded at https://github.com/bharatambati/tranccg." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-117", "text": "So, for the rest of the experiments we use non-incremental parsing algorithm of Z&C*." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-118", "text": "Using neural network based taggers (NNT) didn't give any improvement for Z&C* in the greedy settings." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-119", "text": "Performance of NNT is slightly lower than C&C tagger which could be the reason for this (Lewis and Steedman, 2014a) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-120", "text": "But for NNPar, NNT improved the performance over C&C by 0.7%." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-121", "text": "Lewis and Steedman (2014a) and Xu et al. (2015) showed improvements in the performance of C&C, a graph based parser, by using neural network taggers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-122", "text": "Our result with NNPar is in line with theirs and shows that neural network taggers can improve the performance of shift-reduce CCG parsers as well." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-123", "text": "We obtained final unlabeled and labeled F-scores of 90.09% and 83.33% respectively on the development data." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-124", "text": "To the best of our knowledge these are the best reported results for greedy shift-reduce CCG parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-125", "text": "Table 1 : Performance of greedy CCG parsers on CCGbank development data (Sec. 00)." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-126", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-127", "text": "**BEAM SEARCH**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-128", "text": "We next analyze the impact of beam-search on the various parsers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-129", "text": "For Z&C* and Structured NNPar, we use a beam of size 16 both during training and testing; for NNPar, a beam (of 16) can be used only during testing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-130", "text": "Zhang and Clark (2011) and Xu et al. (2014) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-131", "text": "Using a beam improved the performance of both the perceptron and neural network parsers." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-132", "text": "Since NNPar uses a beam only during testing, there is only slight improvement in the f-score." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-133", "text": "Using a structured neural network gave a significant boost in performance." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-134", "text": "Structured NNPar is better than NNPar on all the metrices which shows that Structured NNPar is a stronger model than NNPar." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-135", "text": "We obtained a final LF of 85.69% on the development data which is 1.3% better than the Z&C*, the structured perceptron counter part, and 1.1% better than NNPar." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-136", "text": "This is the best published result on the development data for shift-reduce CCG parsing. (2011) 85.00 92.77 Xu et al. (2014) 85.18 92.75 Table 2 : Impact of the beam on CCGbank development data (Sec. 00)." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-137", "text": "----------------------------------" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-138", "text": "**FINAL TEST RESULTS**" }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-139", "text": "and Xu et al. (2014) ." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-140", "text": "With the greedy setting, our NNPar outperformed Z&C* by around 2.5%, obtaining 89.78% and 83.27% UF and LF respectively." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-141", "text": "These are the best reported result for greedy shiftreduce CCG parsing." }, { "sent_id": "3e5c070a6966361b54f069248438ec-C001-142", "text": "In the case of the beam search parsers, we achieved final best scores of 91.95% in UF and 85.57% in LF with our Structured NNPar." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3e5c070a6966361b54f069248438ec-C001-11" ], [ "3e5c070a6966361b54f069248438ec-C001-36" ], [ "3e5c070a6966361b54f069248438ec-C001-38" ], [ "3e5c070a6966361b54f069248438ec-C001-74" ] ], "cite_sentences": [ "3e5c070a6966361b54f069248438ec-C001-11", "3e5c070a6966361b54f069248438ec-C001-36", "3e5c070a6966361b54f069248438ec-C001-38", "3e5c070a6966361b54f069248438ec-C001-74" ] }, "@EXT@": { "gold_contexts": [ [ "3e5c070a6966361b54f069248438ec-C001-16" ] ], "cite_sentences": [ "3e5c070a6966361b54f069248438ec-C001-16" ] }, "@SIM@": { "gold_contexts": [ [ "3e5c070a6966361b54f069248438ec-C001-42" ] ], "cite_sentences": [ "3e5c070a6966361b54f069248438ec-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "3e5c070a6966361b54f069248438ec-C001-52" ], [ "3e5c070a6966361b54f069248438ec-C001-67" ], [ "3e5c070a6966361b54f069248438ec-C001-80" ] ], "cite_sentences": [ "3e5c070a6966361b54f069248438ec-C001-52", "3e5c070a6966361b54f069248438ec-C001-67", "3e5c070a6966361b54f069248438ec-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "3e5c070a6966361b54f069248438ec-C001-83" ] ], "cite_sentences": [ "3e5c070a6966361b54f069248438ec-C001-83" ] } } }, "ABC_98d8ea63896cc80f0989130e7cbbf1_27": { "x": [ { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-97", "text": "**TESTING OUR THEORY**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-24", "text": "**OUR FRAMEWORK**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-48", "text": "These results are consistent when validating against other datasets." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-2", "text": "This paper proposes a new unsupervised technique for clustering a collection of documents written by distinct individuals into authorial components." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-3", "text": "We highlight the importance of utilizing syntactic structure to cluster documents by author, and demonstrate experimental results that show the method we outline performs on par with state-of-the-art techniques." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-96", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-4", "text": "Additionally, we argue that this feature set outperforms previous methods in cases where authors consciously emulate each other's style or are otherwise rhetorically similar." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-5", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-7", "text": "Unsupervised authorial clustering is the process of partitioning n documents written by k distinct authors into k groups of documents segmented by authorship." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-8", "text": "Nothing is assumed about each document except that it was written by a single author." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-9", "text": "Koppel et al. (2011) formulated this problem in a paper focused on clustering five books from the Hebrew Bible. They also consider a 'multi-author document' version of the problem: decomposing sentences from a single composite document generated by merging randomly sampled chunks of text from k authors." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-10", "text": "Akiva and Koppel (2013) followed that work with an expanded method, and Aldebei et al. (2015) have since presented an improved technique in the 'multi-author document' context by exploiting posterior probabilities of a Naive-Bayesian Model." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-11", "text": "We consider only the case of clustering n documents written by k authors because we believe that, in most cases of authorial decomposition, there is some minimum size of text (a 'document'), for which it can be reliably asserted that only a single author is present." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-12", "text": "Furthermore, this formulation precludes results dependent on a random document generation procedure." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-13", "text": "In this paper, we argue that the biblical clustering done by Koppel et al. (2011) and by Aldebei et al. (2015) do not represent a grouping around true authorship within the Bible, but rather around common topics or shared style." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-14", "text": "We demonstrate a general technique that can accurately discern multiple authors contained within the Books of Ezekiel and Jeremiah." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-15", "text": "Prior work assumes that each prophetic book reflects a single source, and does not consider the consensus among modern biblical scholars that the books of Ezekiel and Jeremiah were written by multiple individuals." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-16", "text": "To cluster documents by true authorship, we propose that considering part-of-speech (POS) ngrams as features most distinctly identifies an individual writer." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-17", "text": "The use of syntactic structure in authorial research has been studied before." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-18", "text": "Baayen et al. (1996) introduced syntactic information measures for authorship attribution and Stamatatos (2009) argued that POS information could reflect a more reliable authorial fingerprint than lexical information." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-19", "text": "Both Zheng et al. (2006) and Layton et al. (2013) propose that syntactic feature sets are reliable predictors for authorial attribution, and Tschuggnall and Specht (2014) demonstrates, with modest success, authorial decomposition using pq-grams extracted from sentences' syntax trees." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-20", "text": "We found that by combining the feature set of POS n-grams with a clustering approach similar to the one presented by Akiva (2013) , our method of decomposition attains higher accuracy than Tschuggnall's method, which also considers grammatical style." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-21", "text": "Additionally, in cases where authors are rhetorically similar, our framework outperforms techniques outlined by Akiva (2013) and Aldebei (2015) , which both rely on word occurrences as features." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-22", "text": "This paper is organized as follows: section 2 outlines our proposed framework, section 3 clari-fies our method in detail through an example, section 4 contains results, section 5 tests an explanation of our results, and section 6 concludes our findings and discusses future work." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-23", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-25", "text": "Given n documents written by k distinct authors, where it is assumed that each document is written entirely by one of the k authors, our method proceeds in the following way:" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-26", "text": "First, represent each document as a frequency vector reflecting all n-grams occurring in the 'POS-translated' document." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-27", "text": "Second, cluster documents into k groups using an unsupervised clustering algorithm." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-28", "text": "Third, determine 'core elements', documents that most strongly represent authorship attributes of their respective clusters." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-29", "text": "Fourth, use 'core elements' to train a supervised classifier in order to improve accuracies of documents that were not central to any cluster." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-30", "text": "A key improvement our framework presents over prior techniques is in step one, where we represent documents in terms of POS n-grams." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-31", "text": "Specifically, each document, x i , is transformed into a 'POS-translated' version, x i , such that every word or punctuation symbol from the original document is replaced with its respective POS or punctuation token in the translated version." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-32", "text": "Consider the following sentences from a New York Times (NYT) column written by Paul Krugman: \"Last week the Federal Reserve chose not to raise interest rates." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-33", "text": "It was the right decision.\" In the 'POStranslated' version these sentences appear as \"JJ NN DT NNP NNP NN RB TO VB NN NNS PE-RIOD PRP VBD DT JJ NN PERIOD\"." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-34", "text": "1 We use a POS tagger from the Natural Language Toolkit to translate English documents (Bird et al., 2009) and use hand annotations for the Hebrew Bible." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-35", "text": "Our framework will work with any text for which POS-annotations are obtainable." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-36", "text": "The requirement that k is a fixed parameter is a limitation of the set of unsupervised clustering algorithms available in step two." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-37", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-38", "text": "**CLARIFYING DETAILS WITH NYT COLUMNS**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-39", "text": "We shall describe a clustering of New York Times columns to clarify our framework." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-40", "text": "The NYT cor- pus is used both because the author of each document is known with certainty and because it is a canonical dataset that has served as a benchmark for both Akiva and Koppel (2013) and Aldebei et al. (2015) ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-41", "text": "The corpus is comprised of texts from four columnists: Gail Collins (274 documents), Maureen Dowd (298 documents), Thomas Friedman (279 documents), and Paul Krugman (331 documents)." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-42", "text": "Each document is approximately the same length and the columnists discuss a variety of topics." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-43", "text": "Here we consider the binary (k = 2) case of clustering the set of 629 Dowd and Krugman documents into two groups." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-44", "text": "In step one, the documents are converted into their 'POS-translated' form as previously outlined." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-45", "text": "Each document is represented as a frequency vector that reflects all 3, 4, and 5-grams that appear in the 'POS-translated' corpus." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-46", "text": "This range of ngrams was determined through validation of different values for n across several datasets." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-47", "text": "Results of this validation for the two way split over NYT columnists is displayed in Table 1 ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-49", "text": "Using 3, 4, and 5-grams, the resulting design matrix has dimension 629 by 302,395." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-50", "text": "We re-weight every element in the design matrix according to its term frequency-inverse document frequency." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-51", "text": "In step two, we apply spectral clustering to the design matrix to partition the documents into two clusters." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-52", "text": "This is implemented with the Shi and Malik (2000) algorithm, which solves a convex relaxation of the normalized cuts problem on the affinity graph (Pedregosa et al., 2011) ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-53", "text": "Edgeweights of the affinity graph are computed using a linear kernel." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-54", "text": "In the case of clustering several (k > 2) authors, we apply the Yu and Shi (2003) algorithm to perform multiclass spectral clustering." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-55", "text": "In step three, we calculate the centroid of each cluster produced by step two." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-56", "text": "For each document x i , we determine \u03b8 i , the angle between that document and the centroid of its cluster, and call a document a 'core element' if \u03b8 i is within 2 standard deviations of the average of \u03b8 i in x i 's cluster." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-57", "text": "In step four, 'core elements' are used to train a 500 tree random forest where at each split the standard heuristic of \u221a p features are considered (here p = 302, 395)." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-58", "text": "Finally, we reclassify all 629 documents according to this random forest to produce our final class labels, summarized in Table 2 ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-59", "text": "The final accuracy of the Dowd-Krugman clustering, measured as an F1-score, is 98.8%." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-60", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-61", "text": "**RESULTS**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-62", "text": "All accuracy scores given in the rest of this paper are calculated using the F1-score." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-63", "text": "Because our technique contains stochastic elements, results reflect an average of 20 runs." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-64", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-65", "text": "**NYT COLUMNS**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-66", "text": "When clustering over all six binary-pairs of NYT columnists, our framework achieves an average accuracy of 94.5%, ranging from 90.0% to 98.8%." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-67", "text": "Aldebei et al. (2015) addresses the slightly different problem of decomposing artificially merged NYT documents, and acknowledging the distinction between the two problems, our results are comparable to their accuracies which range from 93.3% to 96.1%." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-68", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-69", "text": "**SANDITON: AN UNCOMPLETED NOVEL**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-70", "text": "Another canonical authorship test is that of the novel Sanditon, a text left incomplete at the death of its author, Jane Austen, and finished some years later by an anonymous person known as \"Another Lady.\" She closely emulated Austen's style and added 19 chapters to Austen's original 11." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-71", "text": "Researchers have long examined this text and most recently Moon et al. (2006) analyzed Sanditon using supervised techniques in the context of authorship attribution." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-72", "text": "Much progress has been made in the field since then, but examining Sanditon has fallen out of style." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-73", "text": "Our framework clusters Austen's chapters from Another Lady's with 93.8% accuracy, only mislabeling two documents." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-74", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-75", "text": "**OBAMA-MCCAIN & EZEKIEL-JEREMIAH**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-76", "text": "In order to confirm our framework is accurate over a variety of documents, we considered campaign speeches from the 2008 presidential election." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-77", "text": "Collecting 27 speeches from President Obama and 20 from Senator McCain, we expected our technique to excel in this context." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-78", "text": "We found instead that our method performed exceptionally poorly, clustering these speeches with only 74.2% accuracy." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-79", "text": "Indeed, we were further surprised to discover that by adjusting our framework to be similar to that presented in Akiva and Koppel (2013) and Aldebei et al. (2015) -by replacing POS n-grams with ordinary word occurrences in step one -our framework performed very well, clustering at 95.3%." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-80", "text": "Similarly, our framework performed poorly on the Books of Ezekiel and Jeremiah from the Hebrew Bible." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-81", "text": "Using the English-translated King James Version, and considering each chapter as an individual document, our framework clusters the 48 chapters of Ezekiel and the 52 chapters of Jeremiah at 54.7%." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-82", "text": "Aldebei et al. (2015) reports 98.0% on this dataset, and when considering the original English text instead of the POS-translated text, our framework achieves 99.0%." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-83", "text": "The simultaneous success of word features and failure of POS features on these two datasets seemed to completely contradict our previous results." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-84", "text": "We propose two explanations." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-85", "text": "First, perhaps too much syntactic structure is lost during translation." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-86", "text": "This could certainly be a factor, but does not explain the Obama-McCain results." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-87", "text": "The second explanation comes from the wide consensus among biblical scholars that there was no single 'Ezekiel' or 'Jeremiah' entirely responsible for each book." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-88", "text": "Instead, the books are composites from a number of authors, sometimes written over the span of hundreds of years (McKane, 1986; Zimmerli, 1979; Mowinckel, 1914) ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-89", "text": "Koppel et al. (2011) acknowledges this shortcoming in their original paper, and suggest that in this authorial interpretation their clustering is one of style, not authorship." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-90", "text": "We hypothesize that in both failed cases, accuracy is low because our assumption that only two authors were represented among the documents is incorrect." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-91", "text": "This theory holds for the Obama-McCain dataset, because Obama had up to three primary speechwriters during the '08 election and McCain likely had a similar number (Parker, 2008) ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-92", "text": "Perhaps emulating syntactic patterns is more difficult than emulating word choice." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-93", "text": "If so, using word fea- tures, a model can discern Obama's rhetoric from that of McCain." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-94", "text": "However, since the syntax of more than two individuals is present in the text, POS features cannot accurately cluster the documents into two groups." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-95", "text": "Our goal is for POS features to cluster more accurately than word features when the true authorship of the documents is correctly considered." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-98", "text": "We first attempt to cluster the Ezekiel and Jeremiah texts in the original Hebrew in order to test if too much syntactic structure is lost during translation." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-99", "text": "For the Hebrew text, we use hand-tagged POS information because a reliable automatic tagger was not available (van Peursen et al., 2015; Roorda, 2015) ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-100", "text": "Clustering Ezekiel and Jeremiah using Hebrew POS features obtains 62.5% accuracy." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-101", "text": "This is an improvement over the English text, but still performs far worse than lexical feature sets." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-102", "text": "We next attempt to cluster the Ezekiel and Jeremiah texts according to the authorial strata within each book that is widely agreed upon by biblical scholars, in order to test if incorrect authorial assumptions were causing the decrease in accuracy." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-103", "text": "Unfortunately, there is no public breakdown of Obama and McCain speeches by speechwriter, so testing our hypothesis is limited here to the biblical dataset." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-104", "text": "We therefore cluster the Book of Ezekiel assuming there are two nested authors, which according to modern scholarship are Ezekiel 1 (chapters 1-39) and Ezekiel 2 (chapters 40-48) (Zimmerli, 1979) ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-105", "text": "Summarized in Table 3 , according to this division our framework clusters the Ezekiel chapters with 93.6% accuracy, mislabeling only three documents." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-106", "text": "We also consider the Book of Jeremiah, which is composed of two primary authors with four secondary authors." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-107", "text": "In clus- tering a corpus containing Jeremiah 1 (23 noncontiguous chapters) and Jeremiah 2 (14 noncontiguous chapters) (McKane, 1986) , our framework divides the 37 chapters into two groups with 94.5% accuracy, mislabeling only two documents." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-108", "text": "These results are summarized in Table 4 ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-109", "text": "When considering the 4-way split between Ezekiel 1, Ezekiel 2, Jeremiah 1 and Jeremiah 2, our method achieves 87.5% accuracy as summarized in Table 5 ." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-110", "text": "When comparing these results with those obtained by looking at word frequencies in the original Hebrew texts partitioned into the four correct authors, we find that our approach performs significantly better." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-111", "text": "With word frequencies as features, our framework clusters Ezekiel 1 from Ezekiel 2 with only 76.3% accuracy, Jeremiah 1 from Jeremiah 2 with only 74.9% accuracy, and crucially, clusters the four-way between both Ezekiels and both Jeremiahs with only 47.9% accuracy." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-112", "text": "While lexical features outperform syntactic features when considering incorrect authorship, syntactic features substantially outperform lexical features when considering the true authorial divisions of Ezekiel and Jeremiah." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-113", "text": "----------------------------------" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-114", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-115", "text": "We have demonstrated a new framework for authorial clustering that not only clusters canonical datasets with state-of-the-art accuracy, but also discerns nested authorship within the Hebrew Bible more accurately than prior work." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-116", "text": "While we believe it is possible for an author to emulate another author's word choice, it is much more difficult to emulate unconscious syntactic structure." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-117", "text": "These syntactic patterns, rather than lexical frequencies, may therefore be key to understanding authorial fingerprints." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-118", "text": "Finding testing data for this problem is difficult, since documents for which authorship is misconstrued or obfuscated but for which true authorship is known with certainty are rare." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-119", "text": "However, when clustering texts for which authorship is not known, one would wish to have a framework which most accurately discerns author-ship, rather than rhetorical similarity." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-120", "text": "We believe that our framework, and syntactic feature sets in particular, clusters documents based on authorship more accurately than prior work." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-121", "text": "While we have shown that POS feature sets can succeed independently, future work should examine augmenting syntactic and lexical feature sets in order to utilize the benefits of each." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-122", "text": "Finally, authorial clustering performs poorly when the number of true and expected authors within a corpus do not match." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-123", "text": "An important next step is to automatically identify the number of authors contained within a set of documents." }, { "sent_id": "98d8ea63896cc80f0989130e7cbbf1-C001-124", "text": "We believe that a more reliable method of generating 'core elements' is essential, and should not be reliant on a predetermined number of authors." } ], "y": { "@BACK@": { "gold_contexts": [ [ "98d8ea63896cc80f0989130e7cbbf1-C001-10" ] ], "cite_sentences": [ "98d8ea63896cc80f0989130e7cbbf1-C001-10" ] }, "@DIF@": { "gold_contexts": [ [ "98d8ea63896cc80f0989130e7cbbf1-C001-13" ], [ "98d8ea63896cc80f0989130e7cbbf1-C001-21" ] ], "cite_sentences": [ "98d8ea63896cc80f0989130e7cbbf1-C001-13", "98d8ea63896cc80f0989130e7cbbf1-C001-21" ] }, "@SIM@": { "gold_contexts": [ [ "98d8ea63896cc80f0989130e7cbbf1-C001-40" ], [ "98d8ea63896cc80f0989130e7cbbf1-C001-79" ] ], "cite_sentences": [ "98d8ea63896cc80f0989130e7cbbf1-C001-40", "98d8ea63896cc80f0989130e7cbbf1-C001-79" ] } } }, "ABC_684349c06be7ff51c0944b25be10ce_27": { "x": [ { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-2", "text": "Characters are fundamental to literary analysis." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-3", "text": "Current approaches are heavily reliant on NER to identify characters, causing many to be overlooked." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-4", "text": "We propose a novel technique for character detection, achieving significant improvements over state of the art on multiple datasets." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-5", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-7", "text": "How many literary characters appear in a novel?" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-8", "text": "Despite the seeming simplicity of the question, precisely identifying which characters appear in a story remains an open question in literary and narrative analysis." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-9", "text": "Characters form the core of many computational analyses, from inferring prototypical character types (Bamman et al., 2014) to identifying the structure of social networks in literature (Elson et al., 2010; Lee and Yeung, 2012; Agarwal et al., 2013; Ardanuy and Sporleder, 2014; Jayannavar et al., 2015) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-10", "text": "These current approaches have largely assumed that characters can be reliably identified in text using standard techniques such as Named Entity Recognition (NER) and that the variations in how a character is named can be found through coreference resolution." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-11", "text": "However, such treatment of character identity often overlooks minor characters that serve to enrich the social structure and serve as foils for the identities of major characters (Eder et al., 2010) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-12", "text": "This work provides a comprehensive examination of literary character detection, with three key contributions." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-13", "text": "First, we formalize the task with evaluation criteria and offer two datasets, including a complete, manually-annotated list of all characters in 58 literary works." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-14", "text": "Second, we propose a new technique for character detection based on inducing character prototypes, and in comparisons with three state-of-the-art methods, demonstrate superior performance, achieving significant improvements in F1 over the next-best method." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-15", "text": "Third, as practical applications, we analyze literary trends in character density over 20 decades and revisit the character-based literary hypothesis tested by Elson et al. (2010) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-16", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-17", "text": "**RELATED WORK**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-18", "text": "Character detection has primarily been performed in the context of mining literary social networks." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-19", "text": "Elson et al. (2010) extract character mentions from conversational segments, using the Stanford CoreNLP NER system to discover character names (Manning et al., 2014) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-20", "text": "To account for variability in character naming, alternate forms of a name are generated using the method of Davis et al. (2003) and merged together as a single character." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-21", "text": "Furthermore, the set of aliases for a character is expanded by creating coreference chains originating from these proper names and merging all coreferent expressions." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-22", "text": "Agarwal et al. (2013) also rely on the CoreNLP NER and coreference resolution systems for character detection; however for literary analysis, they use gold character mentions that have been marked and resolved by a team of trained annotators, highlighting the difficulty of the task." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-23", "text": "He et al. (2013) propose an alternate approach for identifying speaker references in novels, using a probabilistic model to identify which character is speaking." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-24", "text": "However, to account for the multiple aliases used to refer to a character, the authors first manually constructed a list of characters and their aliases, which is the task proposed in this work and underscores the need for automated methods." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-25", "text": "Two approaches mined social interaction net-works without relying on dialogue, unlike the methods of Elson et al. (2010) and He et al. (2013) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-26", "text": "Lee and Yeung (2012) build social networks by recognizing characters from explicit markers (e.g., kinship) and implicit markers (e.g., physical collocation)." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-27", "text": "Similarly, Agarwal and Rambow (2010) build character networks using tree kernels on parse trees to identify interacting agents." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-28", "text": "In the two most-related works, Bamman et al. (2014) and Ardanuy and Sporleder (2014) , character names are extracted and clustered under a set of constraints." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-29", "text": "In the BookNLP system developed by Bamman et al. (2014) , NER-identified names are retained and merged based on animacy, determined through dependencies with \"sentient\" lemmas from a small dictionary (including for example, say and smile), and gender, assigned through pronomial resolution and a dictionary of genderspecific honorifics." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-30", "text": "Ardanuy and Sporleder (2014) similarly use NER to identify character name mentions." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-31", "text": "These names are grouped through the application of a series of deterministic rules, beginning with recognizing gender constraints, where gender assignments are based off of gender-specific honorifics and names." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-32", "text": "If a gender can't be assigned, then one is derived from the majority count of gender-specific pronouns (e.g. he, herself) appearing in the immediate context of the name mentions." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-33", "text": "The extracted names are then clustered, while respecting the gender impositions, based on a sieve of name variant heuristics." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-34", "text": "In the final step, any remaining ambiguous referents , i.e., those that can be matched to multiple characters, are assigned to the more prominent character in the story." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-35", "text": "The authors achieve F1-scores > 0.9 for extracting the 10 most relevant characters in a small collection of novels, but the performance on all characters is unknown." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-36", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-37", "text": "**DETECTING CHARACTERS**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-38", "text": "We propose an eight stage pipeline for detecting characters, which builds a graph where nodes are names and edges connect names belonging to the same character." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-39", "text": "The vertices in the graph are initially populated by running NER over the corpus and also incorporating names following an honorific." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-40", "text": "Second, coreference resolution is run to identify names that occur together in a coreference chain and edges are added where two nodes' names co-occur in a chain." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-41", "text": "Stanford CoreNLP is used for both NER and co-reference." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-42", "text": "Third, we apply a series of name variation rules to link names potentially referring to the same character (e.g., by removing an honorific)." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-43", "text": "Fourth, a gazetteer of 1859 hypocorisms for 560 names is used to link variations (e.g., Tim and Timmy)." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-44", "text": "Stages 2-4 potentially introduce edges connecting names of different characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-45", "text": "Therefore, in the fifth stage, three heuristics are applied to add prohibitions on merging two names into the same character." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-46", "text": "Two vertices cannot be merged if (1) the inferred genders of both names differ, (2) both names share a common surname but different first names, or (3) the honorific of both names differ, e.g., \"Miss\" and \"Mrs.\" Similarly, the sixth stage inserts prohibitions by extracting pairs of names from the novel where (1) both names appear connected by a conjunction, (2) one name appears as the speaker mentioning the other name in direct speech, and (3) both names appear together in a quotation." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-47", "text": "Together, Stage 1-6 are applied by first linking all nodes by edges and following, identifying pairs prohibited from being connected and remove the edges along the shortest path between those two nodes, effectively creating two new disconnected components in the name graph." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-48", "text": "Figure 1 illustrates this transformation on a simple character graph." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-49", "text": "Next, the seventh step attempts to identify characters whose names may not be recognized by NER." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-50", "text": "For example, many minor characters do not appear as named entities and instead have general role-based referents such as \"the governor\" or \"the archbishop.\" However, despite the lack of proper names, such characters behave and interact in similar ways as major characters, including having dialogue." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-51", "text": "Therefore, to discover such characters, we adopt a bootstrapping technique aimed at uncovering prototypical character behaviors from the novels themselves, inspired by the semantic predicate work of Flati and Navigli (2013) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-52", "text": "The Project Gutenberg fiction corpus was dependency parsed to identify all verbs in a dependency relation with nouns, where each noun was categorized as (a) a named entity (b) having its first sense in WordNet refer to an animate entity (Fellbaum, 1998) , or (c) neither of the above." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-53", "text": "All verbs associated with these were then ranked according to their ratio of types (a) and (c) to identify verbs strongly associated with character-like behaviors, which avoids including the behavior of nouns in (b) which may refer to minor characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-54", "text": "Ultimately, 2,073 verbsand-dependency pairs with a ratio of 0.25 were retained as predicates selecting for character-like entities, after limited experimental testing showed this threshold extracted sensible verbs such as \"rejoice,\" \"accost,\" and \"frown.\" Using this set of predicates, nouns appearing with the verb in the appropriate dependency relation are added as characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-55", "text": "We prohibit adding names contained in a small stop list of 22 generic nouns (e.g., \"man\")." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-56", "text": "Finally, the eighth and last stage removes nodes that are disconnected from the rest of the graph and represent a name that is a portion of one or more names for other nodes." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-57", "text": "These nodes are typically ambiguous first or last names." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-58", "text": "Thus, the remaining set of nodes are merged to create sets of names, each associated with a different character." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-59", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-60", "text": "**EXPERIMENTS**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-61", "text": "Given a literary novel, our objective is to produce a list of characters, where each character may be associated with one or more names." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-62", "text": "Datasets Two datasets are used." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-63", "text": "The first is a manually-annotated collection of 58 works with complete lists of all characters and their possible referents in texts." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-64", "text": "Of these works, 56 were generated as a part of an on-going longitudinal study of author style for all Sherlock Holmes stories written by Sir Arthur Conan Doyle." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-65", "text": "The remaining two works are the full length novels Pride and Prejudice by Jane Austen and The Moonstone by Wilkie Collins." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-66", "text": "Characters and their alias were manually coded by expert annotators, with multiple passes to ensure completeness." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-67", "text": "The Moonstone was treated as truly held-out test data and results were only generated once prior to submission." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-68", "text": "The second dataset consists of 30 novels listed on Sparknotes (sparknotes.com) and their corresponding lists of characters, with supplemental naming variations of these characters provided by our annotators." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-69", "text": "These character lists often contain only the major characters in a novel; for example, their list for Pride and Prejudice contains only 17 characters, where as our manually-annotated list identifies 73 characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-70", "text": "Nevertheless, the Sparknotes data serves as a baseline of those characters any method should be able to detect." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-71", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-72", "text": "**EVALUATION**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-73", "text": "Character recognition systems produce a list of sets, each containing the names associated with one character, denoted E = {E 1 , . . . , E n } where E i is a set of names for a character." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-74", "text": "These lists are evaluated against a gold standard list, denoted G, containing all naming variations for each character." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-75", "text": "To evaluate, we formalize the problem as finding a maximum bipartite matching where the sets of names in E and those in G constitute the two node types." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-76", "text": "For precision, matching is measured in the purity of an extracted set of names, E i , with respect to the gold-standard names, G j : 1" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-77", "text": "; simply, a match is maximal when the set of extracted names is a subset of the gold standard names, with penalties for including wrong names." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-78", "text": "Recall uses a looser definition of matching with the aim of measuring whether a character G j was found at all; matching is measured as a binary function that is 1 if E i \\ G j 6 = ; and 0 otherwise." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-79", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-80", "text": "**COMPARISON SYSTEMS**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-81", "text": "The task of character recognition has largely been subsumed into the task of extracting the social network of novels." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-82", "text": "Therefore, three state-of-the-art systems for social network extraction were selected: the method described in Elson et al. (2010) , BookNLP (Bamman et al., 2014) , and the method described in Ardanuy and Sporleder (2014) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-83", "text": "For each method, we follow their procedures for identifying the characters in the social network, which produces sets of one or more aliases associated with each identified character." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-84", "text": "As a baseline, we use the output of Stanford NER, where every name is considered a separate character; this baseline represents the upper-bound in recall from any system using only NER to identify character names." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-85", "text": "Table 1 shows the results for the manually-annotated and SparkNotes corpora." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-86", "text": "The Sherlock Holmes corpus presents a notable challenge due to the presence of many minor characters, which are not detected by NER." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-87", "text": "An error analysis for our approach revealed that while many characters were extracted, the coreference resolution did not link a characters' different referents together and hence, each name was reported as a separate character, which caused a drop in performance." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-88", "text": "Nevertheless, our system provided the highest performance for character recognition." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-89", "text": "ferent set of challenges due to multiple characters sharing the same last name or the same first name." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-90", "text": "Here, coreference resolution frequently creates incorrect links between the similar names of different characters, creating a drop in precision for most systems." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-91", "text": "Our precision value particularly benefited from the heuristics for distinguishing characters by gender and stringent name-merging constraints." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-92", "text": "BookNLP and the approach of Ardanuy and Sporleder (2014) performed quite similarly in identifying characters, which is expected given the overlap in rules applied by both systems." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-93", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-94", "text": "**EXPERIMENT 1: ACCURACY**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-95", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-96", "text": "**THE PRIDE AND PREJUDICE NOVEL PRESENTS A DIF-**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-97", "text": "Moonstone contains a unique novel structure with multiple first-person narrators, group-based characters (e.g., \"the jugglers\") that present a challenge to co-reference systems, and 419 different names for the 78 unique characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-98", "text": "An error analysis of our system revealed that majority of mistakes were due to the multiple names for a character not being merged into a single identity." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-99", "text": "Nevertheless, our system performs best of those tested." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-100", "text": "For the SparkNotes data, the NER baseline achieves the highest recall, indicating that many of the major character names listed in SparkNotes' data can be directly found by NER." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-101", "text": "Nevertheless, in reality, the baseline's performance is offset by its significantly lower precision, as shown in its performance on the other novels; indeed the baseline grossly overestimates the number of characters for the SparkNotes novels, reporting 339 characters per novel on average." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-102", "text": "Table 2 shows our system's performance without stage 7, which involved the extraction of minor characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-103", "text": "Stage 7 overall improves recall with a slight hindrance to precision." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-104", "text": "For the Sherlock Holmes corpus, stage 7 is slightly detrimental to overall performance, which as we stipulated earlier is caused by missing co-referent links." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-105", "text": "Finally, returning to the initially-posed question of how many characters are present, we find that despite the detection error in our method, the overall predicted number of characters is quite close to the actual: for Sherlock Holmes stories, the number of characters was estimated within 2.4 on average, for Pride and Prejudice our method predicted 72 compared with 73 actual characters, and for The Moonstone our method predicted 87 compared with 78." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-106", "text": "Thus, we argue that our procedure can provide a reasonable estimate for the total number of characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-107", "text": "(For comparison, BookNLP, the next best system, extracted 69 and 72 characters for Pride and Prejudice and The Moonstone, respectively, and within 1.2, on average, on the Sherlock Holmes set.) Experiment 2: Literary Theories Elson et al. (2010) analyze 60 novels to computationally test literary theories for novels in urban and rural settings (Williams, 1975; Moretti, 1999) ." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-108", "text": "Recently, Jayannavar et al. (2015) challenged this analysis, showing their improved method for social network extraction did not support the same conclusions." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-109", "text": "While our work focuses only on character detection, we are nevertheless able to test the related hypothesis of whether the number of characters in novels with urban settings is more than those in rural." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-110", "text": "Character detection was run on the same novels from Elson et al. (2010) and we found no statistically-significant difference in the mean number of characters in urban and rural settings, even when accounting for text size." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-111", "text": "Thus, our work raises questions about how these character interact and whether the setting influences the structure of the social network, despite similar numbers of characters." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-112", "text": "Experiment 3: Historical Trends As a second application of our technique, we examine historical trends in how many characters appear in a novel." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-113", "text": "All fiction novels listed on Project Gutenberg were compiled and publication dates were automatically extracted for 1066 and manually entered for an additional 637." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-114", "text": "This set was combined with a corpus of 6333 novels, including works such as To The Lighthouse by Virginia Woolf, not available on Project Gutenberg." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-115", "text": "Books were then partitioned into the decade in which they were au- 1 8 4 0 -1 8 4 9 1 8 5 0 -1 8 5 9 1 8 6 0 -1 8 6 9 1 8 7 0 -1 8 7 9 1 8 8 0 -1 8 8 9 1 8 9 0 -1 8 9 9 1 9 0 0 -1 9 0 9 1 9 1 0 -1 9 1 9 1 9 2 0 -1 9 2 9 1 9 3 0 -1 9 3 9 1 9 4 0 -1 9 4 9 1 9 5 0 -1 9 5 9 1 9 6 0 -1 9 6 9 1 9 7 0 -1 9 7 9 1 9 8 0 -1 9 8 9 1 9 9 0 -1 9 9 9" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-116", "text": "Ratio of number of characters to book tokens Figure 2 : Distributions of the size-normalized number of characters per novel per decade." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-117", "text": "thored." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-118", "text": "We limit our focus to trends starting in 1800 to 1990, when at least 11 books are available for each decade." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-119", "text": "To account for variability in novel length, we normalize the novel's number of characters by its number of tokens." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-120", "text": "Figure 2 shows the box-andwhisker plot of the normalized number of characters per novel, where the box denotes the first and third quartile and the bar denotes the median." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-121", "text": "Surprisingly, we did not observe any significant change in the relative number of characters per novel, despite the underlying socio-economic changes that accompanied this time period." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-122", "text": "While novels written before 1850 had slightly more characters on average, this effect may be due to the smaller number of works available from this period." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-123", "text": "However, our finding raises many questions about whether the social networks for these characters obey similar trends in their size and density." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-124", "text": "----------------------------------" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-125", "text": "**CONCLUSION**" }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-126", "text": "Although a fundamental task to character analysis, identifying the number of characters in a literary novel presents a significant challenge to current state of the art." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-127", "text": "To lay the foundation towards solving the task, we provide three contributions: (1) an annotated corpus of 58 books, (2) an evaluation framework for measuring performance on the task, (3) a new state-of-the-art method for character extraction." }, { "sent_id": "684349c06be7ff51c0944b25be10ce-C001-128", "text": "Furthermore, to promote future work we make all software and data available upon request." } ], "y": { "@BACK@": { "gold_contexts": [ [ "684349c06be7ff51c0944b25be10ce-C001-9" ], [ "684349c06be7ff51c0944b25be10ce-C001-25" ] ], "cite_sentences": [ "684349c06be7ff51c0944b25be10ce-C001-9", "684349c06be7ff51c0944b25be10ce-C001-25" ] }, "@USE@": { "gold_contexts": [ [ "684349c06be7ff51c0944b25be10ce-C001-15" ], [ "684349c06be7ff51c0944b25be10ce-C001-82" ], [ "684349c06be7ff51c0944b25be10ce-C001-107" ], [ "684349c06be7ff51c0944b25be10ce-C001-110" ] ], "cite_sentences": [ "684349c06be7ff51c0944b25be10ce-C001-15", "684349c06be7ff51c0944b25be10ce-C001-82", "684349c06be7ff51c0944b25be10ce-C001-107", "684349c06be7ff51c0944b25be10ce-C001-110" ] } } }, "ABC_2e967f8560ffdb216135ae387776eb_27": { "x": [ { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-67", "text": "**TAGESSCHAU/LOGO**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-2", "text": "We analyze two novel data sets of German educational media texts targeting adults and children." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-3", "text": "The analysis is based on 400 automatically extracted measures of linguistic complexity from a wide range of linguistic domains." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-4", "text": "We show that both data sets exhibit broad linguistic adaptation to the target audience, which generalizes across both data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-5", "text": "Our most successful binary classification model for German readability robustly shows high accuracy between 89.4%-98.9% for both data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-6", "text": "To our knowledge, this comprehensive German readability model is the first for which robust cross-corpus performance has been shown." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-7", "text": "The research also contributes resources for German readability assessment that are externally validated as successful for different target audiences: we compiled a new corpus of German news broadcast subtitles, the Tagesschau/Logo corpus, and crawled a GEO/GEOlino corpus substantially enlarging the data compiled by Hancke et al. (2012" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-8", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-10", "text": "Readability assessment refers to the task of (automatically) linking a text to the appropriate target audience based on its complexity." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-11", "text": "A diverse spectrum of potential application domains has been identified for this task in the literature, ranging from the design and evaluation of education materials, to information retrieval, and text simplification." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-12", "text": "Given the increasing need for learning material adapted to different audiences and the barrier-free access to information required for political and social participation, automatic readability assessment is of immediate social relevance." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-13", "text": "Accordingly, it has attracted considerable research interest over the last decades, particularly for the assessment of English (Crossley et al., 2011; Chen and Meurers, 2017; Feng et al., 2010) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-14", "text": "For German readability assessment, however, little progress has been made in recent years, despite a series of promising results published around the turn of the decade (Vor der Br\u00fcck et al., 2008; Hancke et al., 2012) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-15", "text": "In particular, German readability research has suffered from the lack of a shared reference corpus and sufficiently comparable corpora for cross-corpus testing of readability models: While for English research, the Common Core corpus consisting of examples from the English Language Arts Standards of the Common Core State Standards, and the WeeklyReader corpus of online news articles have been widely used in studies on English readability and text simplification (Vajjala and Meurers, 2014; Petersen and Ostendorf, 2009; Feng et al., 2010) , there are no comparable resources for German." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-16", "text": "This is particularly problematic, as over-fitting is a potential issue for classification algorithms, especially given the limited size of the typical data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-17", "text": "To address these issues, we first present two new data sets for German readability assessment in Section 3: a set of German news broadcast subtitles based on the primary German TV news outlet Tagesschau and the children's counterpart Logo!, and a GEO/GEOlino corpus crawled from the educational GEO magazine's web site, a source first identified by Hancke et al. (2012) , but double in size." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-18", "text": "1 The longstanding success of these outlets with their target audiences provides some external validity to the nature of the implicit linguistic adaptation of the language used." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-19", "text": "As showed for German secondary school textbooks, this is not necessarily the case across all linguistic dimensions and adjustments may even be limited to only the surface level of text, sentence, and word length." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-20", "text": "We conducted a series of analyses on these two data sets to accomplish the following objectives:" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-21", "text": "1. Investigate how instances of German educational news language differ in terms of language complexity across adult and child target audiences." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-22", "text": "2. Build a binary readability model for German educational language targeting adults and children that shows high, robust classification performance across corpora." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-23", "text": "For the purposes of our studies, we operationalize child target audience of German educational news language as children aged between 8 and 14." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-24", "text": "This is the typical audience age range of the child-targeting news media we analyzed." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-25", "text": "2 Adult target audience then is defined as over 14 years of age." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-26", "text": "To address our first research question, after introducing a broad set of complexity measures in Section 4, we compare their informativeness for distinguishing adult and child level in the two data sets in Section 5." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-27", "text": "In Section 6, we define a series of readability models for German, including one showing high classification accuracy between 89.4% and 98.9% on both data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-28", "text": "The paper closes with a discussion of the implications of our results for the current research discussion and an outlook on future work." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-29", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-31", "text": "For over a century, text readability has been assessed using surface measure-based readability formula such as the Flesch-Kincaid formula (Kincaid et al., 1975) or the Dale-Chall readability formula (Chall and Dale, 1995) , see for an overview DuBay (2004) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-32", "text": "While these formula are still used in some nonlinguistic studies (Woodmansey, 2010; Grootens-Wiegers et al., 2015; Esfahani et al., 2016) , a decade ago research shifted towards using more elaborate statistical modeling approaches based on larger sets of linguistically more informed features." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-33", "text": "Automatic readability assessment has benefited from the use of Natural Language Processing tools for the assessment of syntactic, lexical, and discourse measures and from adapting complexity measures employed in Second Language Acquisition research Feng et al., 2010) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-34", "text": "There has also been extensive research on the relevance of cohesion and discourse measures for readability assessment that have successfully been employed for proficiency assessment in the CohMetrix project (Crossley et al., 2008; Crossley et al., 2011) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-35", "text": "Another example is the work by Feng et al. (2010) , who evaluate which of the typically proposed measures of text readability are most promising by studying their relevance on primary school students reading material." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-36", "text": "They find language model features and cohesion in terms of entity density to be particularly useful, as well as measures of nouns." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-37", "text": "Interestingly, they also observe overall sentence length to be more informative than more elaborate syntactic features." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-38", "text": "While Feng et al. (2010) do not elaborate further on other lexical measures than POS features, Chen and Meurers (2017) conduct an elaborate cross-corpus study on the use of word frequency features for readability assessment." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-39", "text": "They show, that the typical aggregation of word frequencies across documents are less informative than richer representations including frequency standard deviations." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-40", "text": "In contrast to English, research on readability assessment for other languages, such as German, is more limited." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-41", "text": "There was a series of articles on this issue from the late 2000s to the early 2010s that demonstrated the benefits of broad linguistic modeling, in particular the use of morphological complexity measures for languages with rich morphological systems like German (Vor der Br\u00fcck et al., 2008; Hancke et al., 2012) , but also Russian (Reynolds, 2016) or French (Fran\u00e7ois and Fairon, 2012) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-42", "text": "The readability checker DeLite of Vor der Br\u00fcck et al. (2008) is one of the first more sophisticated approaches that went beyond using simple readability formulas for German." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-43", "text": "The tool employs morphological, lexical, syntactical, semantic, and discourse measures, which they trained on municipal administration texts rated for their readability by humans in an online readability study involving 500 texts and 300 participant, resulting in overall 3,000 ratings." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-44", "text": "However, due to the specific nature of the data, the robustness of the approach across genres is unclear." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-45", "text": "Municipal administration language is so particular that results are unlikely to generalize to educational or literary materials, which are more attractive in first and second language acquisition contexts." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-46", "text": "Later work by Hancke et al. (2012) also combines traditional readability formula measures, such as text or word length, with more sophisticated lexical, syntactic, and language model, and morphological features to assess German readability, but they employ an overall broader and more diverse feature set than DeLite." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-66", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-47", "text": "They investigate readability of educational magazines on the GEO/GEOlino data set, which they compiled from online articles freely available at the GEO magazine's web page." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-48", "text": "Their work illustrates the relevance of rich linguistic modeling for readability assessment and in particular the value of morphological complexity features for German." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-49", "text": "The latest large scale research endeavor for the assessment of German text readability has focused more on identifying linguistic differences between texts targeting different audiences than on building readability models: In the Reading Demands project, complexity differences in German secondary school book texts across grade levels and school types were investigated. and analyze to which extent publishers successfully adapt their reading material to their target audiences." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-50", "text": "They find a lack of consistent adaptation for passive constructions, concessive and adversative connectives, and relative clauses, and only some limited adaptation in terms of lexical variation, noun complexity, and dependency length measures." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-51", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-52", "text": "**DATA SETS**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-53", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-54", "text": "**GEO/GEOLINO**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-55", "text": "The GEO/GEOlino data set consists of online articles from one of the leading German monthly educational magazines, GEO, and the counterpart for children, GEOlino." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-56", "text": "3 They are comparable to the National Geographic magazine and cover a variety of topics ranging from culture and history to technology and nature." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-57", "text": "Hancke et al. (2012) first compiled and analyzed a data set from this web resource." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-58", "text": "We followed their lead and crawled 8,263 articles from the GEO/GEOlino online archive, almost doubling the size of the original corpus." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-59", "text": "We removed all material flagged as non-article contents by GEO as well as all articles that contained less than 15 words." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-60", "text": "We further cleaned our data from crawling artifacts and performed near-duplicate detection with the Simhash algorithm." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-61", "text": "We then grouped all texts into topic categories based on the subdomains they were published under, following the web page topic structure." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-62", "text": "4 Table 1 shows the composition of the corpus in terms of the topic groups." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-63", "text": "Since the number of documents in the different topic groups differ between GEO and the smaller GEOlino set, we created a more balanced subset (GEO/GEOlino S )." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-64", "text": "For this, we included only topic categories existing in both GEO and GEOlino, included all GEOlino texts in those categories and sampled from the GEO texts in those categories until we reached the same overall size of 2480 texts each." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-65", "text": "Table 1 : Distribution of topics in the full and sampled GEO/GEOlino data set." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-68", "text": "The Tagesschau/Logo data set is compiled from subtitles of German daily news broadcasts of Tagesschau and its children's counterpart Logo!. Tagesschau is the dominant national television news service of Germany, produced by the German public-service television network ARD." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-69", "text": "It broadcasts multiple updated editions of daily news throughout the day." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-70", "text": "Logo! is a television news service for children produced by the German public-service television broadcaster ZDF airing once a day." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-71", "text": "The data set consists of subtitles for all editions of both news outlets that have been broadcasted from December 2015 to January 2017." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-72", "text": "For this paper, we limited the Tagesschau data to the main edition broadcasted at 8pm." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-73", "text": "This amounts to overall 421 editions for Tagesschau and 415 editions for Logo!, with the small difference arising from a lack of Logo! broadcasts on some public holidays or due to special broadcasts." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-74", "text": "We cleaned the subtitles by removing non-spoken comments (e.g., * music playing * or * cheering *)." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-75", "text": "Table 2 : Corpus profile for sampled GEO/GEOlino data set and the Tagesschau/Logo data set." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-76", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-77", "text": "**CHARACTERISTICS OF THE TWO DATA SETS**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-78", "text": "While GEO/GEOlino contains more documents than Tagesschau/Logo, they are considerably shorter in terms of the number of words and sentences they contain." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-79", "text": "Another difference arises in terms of the medium: GEO/GEOlino articles are self-contained reading material and Tagesschau/Logo subtitles complement video material." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-80", "text": "At the same time, they consist of German educational media language and share the functional goal of conveying information to the reader, so that we consider them to be sufficiently similar to support a cross-corpus analysis." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-81", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-82", "text": "**COMPLEXITY ANALYSIS**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-83", "text": "For the assessment of German language complexity, we extract 400 complexity measures using state of the art NLP techniques." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-84", "text": "All features are theoretically grounded in the contemporary research in linguistic subdisciplines, in particular Second Language Acquisition research, where Complexity is one of three dimensions of language proficiency, together with Accuracy and Fluency (Housen et al., 2012) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-85", "text": "SLA research has a rich tradition of analyzing the complexity development of learner language, see Lu (2010; 2012) for an overview." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-86", "text": "show that these measures can be successfully applied to readability research." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-87", "text": "Building on these findings, we follow the SLA definition of complexity as the elaborateness and variability of language (Ellis and Barkhuizen, 2005) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-88", "text": "Our measures can be grouped into seven categories: i) lexical complexity, ii) clausal complexity, iii) phrasal complexity, iv) morphological complexity, v) discourse complexity, vi) cognitive complexity, and vii) language use." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-89", "text": "While the former five groups are rooted in the linguistic system, the latter two categories were derived from psycholinguistic research." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-90", "text": "The resulting complexity assessment covers a broad variety of measures." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-91", "text": "To the best of our knowledge, this is currently the most extensive feature collection for German complexity assessment." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-92", "text": "5 Table 3 gives an overview of the feature categories and how much they contribute to our assessment." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-93", "text": "(Brysbaert et al., 2011) , dlexDB (Heister et al., 2011) , Karlsruhe Children's Texts (Lavalley et al., 2015) Approximation of age of active use based on Karlsruhe Children's Texts." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-94", "text": "In order to extract these measures, we employ an elaborate analysis pipeline which relies on a number of NLP tools and external linguistic resources." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-95", "text": "We use OpenNLP 1.6.0 for tokenization and sentence segmentation." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-96", "text": "7 This serves as input for the Mate tools 3.6.0 (Bohnet and Nivre, 2012) , which perform a morphological analysis, lemmatization, POS tagging, and dependency parsing." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-97", "text": "We then use the JWord-Splitter 3.4.0 for compound analysis." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-98", "text": "8 The Mate POS tags are further used to inform the Stanford PCFG parser 3.6.0 (Rafferty and Manning, 2008 ) and the Berkeley parser 1.7.0 (Petrov and Klein, 2007) , which we use for constituency and topological field parsing." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-99", "text": "For all tools, we use the German default models that were provided with them, except for the Berkeley parser, for which we use the topological field model by Ziai and Meurers (2018) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-100", "text": "With these annotations, we extract all instances of the linguistic constructs that we need to calculate the final 400 complexity ratios." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-101", "text": "9 5 Study 1: Which complexity measures are informative?" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-102", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-103", "text": "**SET-UP**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-104", "text": "We first want to determine the informativeness of each measure for distinguishing between adult and child target audience." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-105", "text": "For this, we calculate the information gain of each measure on both data sets using 10-folds cross-validation for training and testing." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-106", "text": "We then compare across both data sets i) the number of features that are informative, and ii) the 20 most informative measures that show a Pearson correlation smaller than \u00b10.8 with each other." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-107", "text": "10 This allows us to gain insights into the range of linguistic properties of the documents targeting adults and children." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-108", "text": "We used WEKA (Hall et al., 2009 ) to calculate information gain and R for the correlation analysis." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-109", "text": "Overall, 79.00% of the measures are informative for the GEO/GEOlino data and 88.25% for the Tagesschau/Logo data." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-110", "text": "This shows, that the documents are adjusted to their different target audiences in terms of a broad range of dimensions of linguistic complexity." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-111", "text": "Table 5 provides a deeper look into the linguistic design of the documents by showing the 20 most informative measures distinguishing adult from child targeted documents, including only measures with a correlation less than \u00b10.8." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-112", "text": "The table shows the original rank of each measure before removal of correlated measures, the average merit of each measure for the distinction of the target audience, the type of complexity measures it belongs to, and the feature name." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-113", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-114", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-115", "text": "The results for both data sets show a diverse collection of features, some of which are similar for both data sets, but also some interesting differences." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-116", "text": "In total the measures seem to be more informative for Tagesschau/Logo, as indicated by the higher average merit, and more correlated, as can be seen from the wider range of original ranks." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-117", "text": "Language use as captured by frequency measures is particularly relevant for both data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-118", "text": "The table includes seven measures of word frequency for GEO/GEOlino and five for Tagesschau/Logo." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-119", "text": "For both data sets, the most informative measure is one of language use: For GEO/GEOlino it is the average minimal age of active use of lexical types found in the Karlsruhe Children's Corpus (KCT) of Lavalley et al. (2015) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-120", "text": "For Tagesschau/Logo it is the average log lexical type frequency based on Google Books 2000." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-121", "text": "The other language-use measures are very similar across data sets: Lexical types unknown to the Subtlex-DE data base (Brysbaert et al., 2011) , for example, rank 4th and 2nd on both data sets and while on Tagesschau/Logo the lemma frequency per lexical type found in KCT is the 12th most informative measure, its log counterpart ranks 8th on GEO/GEOlino." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-122", "text": "Table 5 : Top 20 most informative measures on balanced GEO/GEOlino and Tagesschau/Logo data based on information gain with r \u2264 0.8." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-123", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-124", "text": "**GEO/GEOLINO**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-125", "text": "Cohesion measures are highly informative, too, although more so for Tagesschau/Logo." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-126", "text": "In particular the use of certain personal or possessive pronouns is highly informative for GEO/GEOlino." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-127", "text": "The use of second person pronouns ranks highly for both data sets, which may easily be explained by it being used for the informal German address appropriate when speaking to children." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-128", "text": "This is further corroborated by the ratio of second person verb inflections being ranked as the 13th most important measure." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-129", "text": "For Tagesschau/Logo, other implicit measures of textual cohesion based on content overlap are also informative as well as the use of causal connectives." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-130", "text": "Overall 55% of the most informative 20 measures for both data set are captured by these two categories." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-131", "text": "The other feature groups are less frequently represented, but provide some interpretable insights into the data." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-132", "text": "First, both data sets show indications of differences in the degree of nominalization used in language targeting adults and children: For GEO/GEOlino, complex noun phrases per t-unit and finite clause are highly informative as well as the use of the nominalization suffix -ion." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-133", "text": "On Tagesschau/Logo, genitive case, determiners per noun phrase, and the percentage of compound nouns indicate a similar relevance of differences regarding the organisation of the nominal domain." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-134", "text": "Lexical and sentential complexity seems to be less homogeneous for the distinction of adult and child targeted language across data sets: There are two measures of lexical complexity assessing word length in syllables and the semantic inter-relatedness of words ranked high for GEO/GEOlino, while on Tagesschau/Logo, lexical diversity and verb variation are particularly informative." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-135", "text": "For sentential complexity, constituency tree complexity, the average length of the middle field, and the use of prepositional phrases per t-unit are particularly informative on GEO/GEOlino." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-136", "text": "On Tagesschau/Logo, parse tree height and the use of conjunctional clauses are relevant." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-137", "text": "Cognitive measures do not seem to play an important role on either data set, except for the sum of longest dependencies per clause on Tagesschau/Logo." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-138", "text": "Overall, these results clearly show that for both data sets the distinction between target audiences is not just made based on surface modifications such as sentence or word length." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-139", "text": "In fact, these measures do not occur among the most informative measures at all." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-140", "text": "Rather, measures of language use and cohesion are predominantly informative for the distinction of adult and child targeting texts, but also measures of phrasal, sentential, lexical, and morphological complexity." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-141", "text": "The adjustment of the data to their audience observed here thus seems to be more linguistically refined than that found in the ReadingDemands textbook data, where found only few adjustment across dimensions." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-142", "text": "6 Study 2: Can we successfully model readability for German, also across data sets?" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-143", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-144", "text": "**SET-UP**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-145", "text": "Our second objective is the design of a robust model of educational media language that distinguishes robustly between language targeting adults and children across corpora and genres." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-146", "text": "For this, we train two binary Sequential Minimal Optimization (SMO) support vector classifier (Platt, 1998) with linear kernels using the WEKA machine learning toolkit (Hall et al., 2009 )." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-147", "text": "Each model is tested i) on the same corpus it is trained on, using 10-folds cross-validation, and afterwards ii) on the other data set for cross-corpus testing after training on the full data set." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-148", "text": "For model performance evaluation, we report classification accuracy and the classification confusion matrices, and random baselines as reference point." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-149", "text": "Table 6 shows the accuracy of our SMO models on both data sets and compares them with a random baseline." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-150", "text": "Both models clearly outperform the baseline of 50.0%." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-151", "text": "On GEO/GEOlino S , the performance is comparable to the performance observed by Hancke et al. (2012) on the original GEO/GEOlino data." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-152", "text": "11 As Table 7a shows, erroneous classifications are roughly balanced across both classes, showing that the model does not prefer one class over the other." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-153", "text": "When training a model using only the 20 most informative measures identified in Study 1, we reach an accuracy of 85.1%, i.e., the additional measures only account only for 3.3%." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-154", "text": "12 When testing the models on the Tagesschau/Logo corpus, accuracy increases to 98.8% for both models." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-155", "text": "The confusion matrix for the model using 400 measures in Table 7b Table 7 : Confusion matrices for testing models with 400 features trained on GEO/GEOlino S ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-156", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-157", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-158", "text": "a minor tendency towards classifying Logo! texts as Tagesschau texts, but due to the low number of incorrect classifications this is not conclusive." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-159", "text": "Overall, performance of both models trained on GEO/GEOlino S on the Tagesschau/Logo data is comparable to the performance of both models trained and tested on Tagesschau/Logo with 10-folds crossvalidation, although the confusion matrix for the cross-validated Tagesschau/Logo model using 400 measures does not exhibit any tendency towards predicting one class preferred over the other, as may be seen in Table 8a The model trained and tested on Tagesschau/Logo reaches an unexpectedly high accuracy of 99.9% for using 400 measures and 99.8% when using only the 20 most informative measures reported in Study 1." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-160", "text": "Since the performance remains high when using only 20 measures and the standard deviation across folds is very low, the results seem not to be due to over-fitting." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-161", "text": "The model learns linguistic properties of the data set that generalize across." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-162", "text": "It is important to stress here than none of our measures include n-gram language models or any other lexical content features but only complexity measures aggregated over each document." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-163", "text": "13 model with 91.1%, confirming that our approach is in fact competitive with the state of the art." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-164", "text": "12 We do not show the confusion matrices for the models with 20 features, because they are equivalent to the matrices in Table 7 ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-165", "text": "The same holds for the models tested on Tagesschau/Logo and their matrices in Table 8 ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-166", "text": "13 Content features are problematic since they can pick up recurring phrases that are characteristic of particular media outlets rather than generalizable linguistic complexity characteristics." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-167", "text": "E.g., the Tagesschau always starts with the greeting \"Hier ist das Erste Deutsche Fernsehen mit der Tagesschau.\" (Here is the first public German TV channel with the daily news.)." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-168", "text": "When testing the models trained on the Tagesschau/Logo data set on the GEO/GEOlino S data, it becomes apparent that the characteristics learned from the Tagesschau/Logo data set do not generalize, with the model based on 400 measures performing only marginally above chance, and the model using the 20 measures performing slightly better with 56.2%." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-169", "text": "When considering the confusion matrix for this model in Table 8b , we see that most texts are classified as GEOlino texts, irrespective of whether they belong to GEO or GEOlino." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-170", "text": "The Tagesschau/Logo trained models do not generalize well to the other adult/child corpus." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-171", "text": "Since the model trained on GEO/GEOlino S is highly successful when tested on Tagesschau/Logo, this cannot be due to an actual lack of generalizable differences in the linguistic characteristics of the adult and child targeting texts contained in both data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-172", "text": "One possible reason for these results may be that, as Study 1 showed, the measures are considerably more informative on Tagesschau/Logo than on GEO/GEOlino S ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-173", "text": "It could be, that the differences between the news subtitles designed for different target audiences are more extreme than those observable for the GEO magazines." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-174", "text": "This would explain the surprisingly good performance of the GEO/GEOlino S model on the Tagesschau/Logo data, which would then be easier to distinguish, while also accounting for the poor performance in the opposite case." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-175", "text": "----------------------------------" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-176", "text": "**SUMMARY AND OUTLOOK**" }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-177", "text": "We presented a study of the difference between German targeting adults and children, as far as we know the most broadly based linguistic complexity analysis to date." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-178", "text": "We created and analyzed a novel data set compiled from German news subtitles that consists of news broadcasts for adults and children from the same days, ensuring a relatively parallel selection of topics." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-179", "text": "We compared this with a newly compiled GEO/GEOlino corpus consisting of online articles of two magazines for adults and children by the same publisher discussing the similar topics." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-180", "text": "Based on these two data sets, we presented within-corpus (10-fold CV) and cross-corpus experiments and built binary classification models of German educational media text readability that perform with very high accuracy across both data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-181", "text": "The model is based on a broad range of features that are highly informative for both data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-182", "text": "This model is a valuable contribution since i) it is based on a considerably broader data basis than previous approaches to German readability, and ii) it successfully generalizes across the data sets, illustrating surprising robustness across rather different text types." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-183", "text": "The approach presented thus extends the state-of-the-art in Hancke et al. (2012) in terms of the breadth of features integrated and the accuracy and generalizability of the model -and provides two new data sources for this line of research." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-184", "text": "The paper also contributes some new insights into the linguistic characteristics of German media language targeting adults and children." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-185", "text": "Since all the language is produced by adults, it is not necessarily clear how well it is in fact adjusted to the target audience." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-186", "text": "As demonstrated by , German textbook publishers indeed do not seem to be adjusting the complexity of the language used according to school type and grade level in any systematic way." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-187", "text": "Our results for educational media language indicate, that i) both data sets are successfully and broadly adapted towards their target audiences; and ii) that they form two distinct, cross-corpus generalizable constructs of German educational media language for children and adults." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-188", "text": "In a next step, we plan to test to which extent this linguistically diverse and generalized construct matches the language competence of the intended children target group by comparing it with the Karlsruhe Children's Text corpus (Lavalley et al., 2015) ." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-189", "text": "We also plan to further investigate the linguistic properties of our two data sets." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-190", "text": "In particular, the Tagesschau/Logo data set requires further statistical and qualitative analyses to investigate why its linguistic characteristics generalize well across all folds of the data set itself but not across GEO/GEOlino." }, { "sent_id": "2e967f8560ffdb216135ae387776eb-C001-191", "text": "We also plan to conduct more analyses of the informativeness of the different complexity feature groups for the target audience distinction." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2e967f8560ffdb216135ae387776eb-C001-14" ], [ "2e967f8560ffdb216135ae387776eb-C001-41" ], [ "2e967f8560ffdb216135ae387776eb-C001-46" ], [ "2e967f8560ffdb216135ae387776eb-C001-57" ] ], "cite_sentences": [ "2e967f8560ffdb216135ae387776eb-C001-14", "2e967f8560ffdb216135ae387776eb-C001-41", "2e967f8560ffdb216135ae387776eb-C001-46", "2e967f8560ffdb216135ae387776eb-C001-57" ] }, "@EXT@": { "gold_contexts": [ [ "2e967f8560ffdb216135ae387776eb-C001-17" ], [ "2e967f8560ffdb216135ae387776eb-C001-183" ] ], "cite_sentences": [ "2e967f8560ffdb216135ae387776eb-C001-17", "2e967f8560ffdb216135ae387776eb-C001-183" ] }, "@USE@": { "gold_contexts": [ [ "2e967f8560ffdb216135ae387776eb-C001-57", "2e967f8560ffdb216135ae387776eb-C001-58" ] ], "cite_sentences": [ "2e967f8560ffdb216135ae387776eb-C001-57" ] }, "@SIM@": { "gold_contexts": [ [ "2e967f8560ffdb216135ae387776eb-C001-151" ] ], "cite_sentences": [ "2e967f8560ffdb216135ae387776eb-C001-151" ] } } }, "ABC_99b26d9151c7c0a1df1df1300fc764_27": { "x": [ { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-2", "text": "In this paper, we propose a novel deep neural network architecture, Speech2Vec, for learning fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to the underlying spoken words, and are close to other vectors in the embedding space if their corresponding underlying spoken words are semantically similar." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-3", "text": "The proposed model can be viewed as a speech version of Word2Vec [1]." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-4", "text": "Its design is based on a RNN Encoder-Decoder framework, and borrows the methodology of skipgrams or continuous bag-of-words for training." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-5", "text": "Learning word embeddings directly from speech enables Speech2Vec to make use of the semantic information carried by speech that does not exist in plain text." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-6", "text": "The learned word embeddings are evaluated and analyzed on 13 widely used word similarity benchmarks, and outperform word embeddings learned by Word2Vec from the transcriptions." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-7", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-9", "text": "Natural language processing (NLP) techniques such as Word2Vec [1, 2] and GloVe [3] transform words into fixed dimensional vectors, or word embeddings." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-10", "text": "The embeddings are obtained via unsupervised learning from co-occurrence information in text, and contain semantic information about the word which are useful for many NLP tasks [4, 5, 6, 7, 8] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-11", "text": "Researchers have also explored the concept of learning vector representations from speech [9, 10, 11, 12, 13, 14] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-12", "text": "These approaches are based on notions of acoustic-phonetic (rather than semantic) similarity, so that different instances of the same underlying word would map to the same point in a latent embedding space." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-13", "text": "Our work, highly inspired by Word2Vec [1] , uses a skipgrams or continuous bag-of-words formulation to focus on neighboring acoustic regions, rather than the acoustic segment associated with the word itself." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-14", "text": "We show that the resulting acoustic embedding space is more semantic in nature." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-15", "text": "Recent research by [15, 16, 17] has presented a deep neural network model capable of rudimentary spoken language acquisition using raw speech training data paired with contextually relevant images." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-16", "text": "Using this contextual grounding, the model learned a latent semantic audio-visual embedding space." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-17", "text": "In this paper, we propose a deep neural network architecture capable of learning embeddings of audio segments corresponding to words from raw speech without any other modalities." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-18", "text": "The proposed model, called Speech2Vec, integrates an RNN EncoderDecoder framework [18, 19] with the concept of skipgrams or continuous bag-of-words, and can handle arbitrary length audio segments." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-19", "text": "The resulting word embeddings contain information pertaining to the meaning of the underlying spoken words such that semantically similar words produce vector representations that are nearby in the embedding space." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-20", "text": "Speech2Vec can be viewed as a speech version of Word2Vec." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-21", "text": "Traditionally, when we want to learn word embeddings from speech, we need to first transcribe the speech into text by an ASR system, then apply a textual word embedding method on the transcripts." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-22", "text": "The motivations for this work are that learning word embeddings directly from speech surmounts the recognition errors caused by the process of transcribing." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-23", "text": "Moreover, speech contains richer information than text such as prosody, and a machine should be able to make use of this information in order to learn better semantic representations." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-24", "text": "In this paper, we build on a preliminary version of the Speech2Vec model [20] by introducing additional methodologies and details for training the model, comparing the model with additional baseline approaches, and providing systematic analysis and visualizations of the learned word embeddings." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-25", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-26", "text": "**PROPOSED APPROACH**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-27", "text": "Our goal is to learn a fixed-length embedding of a audio segment corresponding to a word that is represented by a variablelength sequence of acoustic features such as Mel-Frequency Cepstral Coefficients (MFCCs), x = (x1, x2, ..., xT ), where xt is the acoustic feature at time t and T is the length of the sequence." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-28", "text": "We desire that this word embedding is able to describe the semantics of the original audio segment to some degree." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-29", "text": "Here we first review the RNN Encoder-Decoder framework, followed by formally proposing the Speech2Vec model." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-30", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-31", "text": "**RNN ENCODER-DECODER FRAMEWORK**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-32", "text": "A Recurrent Neural Network (RNN) Encoder-Decoder consists of an Encoder RNN and a Decoder RNN [18, 19] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-33", "text": "For an input sequence x = (x1, x2, ..., xT ), the Encoder reads each of its symbol xi sequentially, and the hidden state ht of the RNN is updated accordingly." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-34", "text": "After the last symbol xT is processed, the corresponding hidden state hT is interpreted as the learned representation of the entire input sequence." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-35", "text": "Subsequently, by initializing its hidden state using hT , the Decoder generates an output sequence y = (y1, y2, ..., y T ) sequentially, where T and T can be different." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-36", "text": "Such a sequence-to-sequence framework does not constrain the input or target sequences, and has been successfully applied to tasks such as machine translation [21, 22] , video caption generation [23] , abstract meaning representation parsing and generation [24] , and acoustic word embeddings acquisition [11] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-37", "text": "The structures of Speech2Vec trained with skipgrams and cbow, respectively." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-38", "text": "All audio segments were padded by zero vectors into the same length T ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-39", "text": "Note that when training Speech2Vec with skipgrams, it is the same Decoder RNN that generates all the output audio segments; when training Speech2Vec with cbow, it is the same Encoder RNN that encodes all the input audio segments." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-40", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-41", "text": "**SPEECH2VEC**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-42", "text": "The backbone of Speech2Vec is the RNN Encoder-Decoder framework." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-43", "text": "Inspired by Word2Vec, here we propose two methodologies for training Speech2Vec: skipgrams and continuous bag-of-words (cbow)." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-44", "text": "The two variants are depicted in Figure 1 (a) and Figure 1(b) , respectively." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-45", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-46", "text": "**TRAINING SPEECH2VEC WITH SKIPGRAMS**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-47", "text": "The idea of training Speech2Vec with skipgrams is that for each audio segment x (n) (corresponding to a word) in a speech corpus, the model is trained to predict the audio segments {x (n\u2212k) , ...," }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-48", "text": ".., x (n+k) } (corresponding to nearby words) within a certain range k before and after x (n) ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-49", "text": "During training, the Encoder first takes x (n) as input and encodes it into a vector representation of fixed dimensionality z (n) ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-50", "text": "The Decoder then maps z (n) to several output sequences y (i) , i \u2208 {n\u2212k, ..., n\u22121, n+1, ..., n+k}. The model is trained by minimizing the gap between the output sequences and their corresponding nearby audio segments, measured by the general mean squared error i x (i) \u2212 y (i) 2 ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-51", "text": "The intuition behind the this approach is that, in order to successfully decode nearby audio segments, the encoded vector representation z (n) should contain sufficient semantic information about the current audio segment x (n) ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-52", "text": "After training, z (n) is taken as the word embedding of x (n) ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-53", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-54", "text": "**TRAINING SPEECH2VEC WITH CBOW**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-55", "text": "In contrast to training Speech2Vec with skipgrams that aim to predict nearby audio segments from z (n) , training Speech2Vec with cbow sets x (n) as the target and aims to infer it from nearby audio segments." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-56", "text": "During training, all nearby audio segments are encoded by a shared Encoder into h (i) , i \u2208 {n \u2212 k, ..., n \u2212 1, n + 1, ..., n + k}, and their sum" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-57", "text": "used by the Decoder to generate x (n) ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-58", "text": "After training, z (n) is taken as the word embedding for x (n) ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-59", "text": "In our experiments, we found that Speech2Vec trained with skipgrams consistently outperforms that trained with cbow." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-60", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-61", "text": "**DIFFERENCES BETWEEN SPEECH2VEC AND WORD2VEC**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-62", "text": "The proposed Speech2Vec aims to learn a fixed-length embedding of an audio segment that captures the semantic information of the spoken word directly from audio." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-63", "text": "It can be viewed as a speech version of Word2Vec." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-64", "text": "Although they have many properties in common, such as sharing the same training methodologies (skipgrams and cbow), and learning word embeddings that capture semantic information from their respective modalities, it is important to identify two fundamental differences." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-65", "text": "First, the architecture of a Word2Vec model is a two-layered fullyconnected neural network with one-hot encoded vectors as input and output." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-66", "text": "In contrast, the Speech2Vec model is composed of Encoder and Decoder RNNs, in order to handle variablelength input and output sequences of acoustic features." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-67", "text": "Second, in a Word2Vec model, the embedding for a particular word is deterministic." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-68", "text": "Every instance of the same word will be represented by one, and only one, embedding vector." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-69", "text": "In contrast, in the Speech2Vec model, due to the fact that every instance of a spoken word will be different (due to speaker, channel, and other contextual differences etc.), every instance of the same underlying word will be represented by a different (though hopefully similar) embedding vector." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-70", "text": "For experimental purposes, in this work, all vectors representing instances of the same spoken word are averaged to obtain a single word embedding." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-71", "text": "The effect of this averaging operation is discussed in Section 3." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-72", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-73", "text": "**EXPERIMENTS**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-74", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-75", "text": "**DATASET**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-76", "text": "For our experiments we used LibriSpeech [25] , a corpus of read English speech, to learn Speech2Vec embeddings." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-77", "text": "In particular, we used a 500 hour subset of broadband speech produced by 1,252 speakers." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-78", "text": "Speech features consisting of 13 dimensional Mel Frequency Cepstral Coefficients (MFCCs) were produced every 10ms." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-79", "text": "The speech was segmented according to word boundaries obtained by forced alignment with respect to the reference transcriptions such that each audio segment corresponds to a spoken word." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-80", "text": "This resulted in a large set of audio segments {x (1) , x (2) , ..., x (|C|) }, where |C| denotes the total number of audio segments (words) in the corpus." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-81", "text": "10 50 100 200 10 50 100 200 10 50 100 200 10 50 100" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-82", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-83", "text": "**IMPLEMENTATION AND TRAINING DETAILS**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-84", "text": "We implemented the Speech2Vec model with PyTorch [26] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-85", "text": "The Encoder RNN is a single-layered bidirectional LSTM [27] , and the Decoder RNN is another single-layered unidirectional LSTM." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-86", "text": "To facilitate the learning process, we also adopted the attention mechanism that enables the Decoder to condition every decoding step on the last hidden state of the Encoder [28] , in other words, the Decoder can refer to hT when generating every symbol yt of the output sequence y. The window size k for training the model with skipgrams and cbow is set to three." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-87", "text": "The model was trained by stochastic gradient descent (SGD) without momentum, with a fixed learning rate of 1e \u2212 3 and 500 epochs." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-88", "text": "We experimented with hyperparameter combinations for training the Speech2Vec model, including the depths of the Encoder and Decoder RNNs, which memory cell (LSTM or GRU) to use, and bidirectional or unidirectional RNNs." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-89", "text": "We conducted experiments using the specified architecture since it produced the most stable and satisfactory results." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-90", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-91", "text": "**EVALUATION**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-92", "text": "Existing schemes for evaluating methods for word embeddings fall into two major categories: extrinsic and intrinsic [29] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-93", "text": "With the extrinsic method, the learned word embeddings are used as input features to a downstream task [4, 5, 6, 7, 8] , and the performance metric varies from task to task." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-94", "text": "The intrinsic method directly tests for semantic or syntactic relationships between words, and includes the tasks of word similarity and word analogy [1] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-95", "text": "In this paper, we focus on the intrinsic method, especially the word similarity task, for evaluating and analyzing the Speech2Vec word embeddings." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-96", "text": "We used 13 benchmarks [30] [40] , and SimVerb-3500 [41] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-97", "text": "These 13 benchmarks contain different numbers of pairs of English words that have been assigned similarity ratings by humans, and each of them evaluates the word embeddings in terms of different aspects." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-98", "text": "For example, RG-65 and MC-30 focus on nouns, YC-130 and SimVerb-3500 focus on verbs, and Rare-Word focuses on rare-words." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-99", "text": "The similarity between a given pair of words was calculated by computing the cosine similarity between their corresponding word embeddings." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-100", "text": "We then reported the Spearman's rank correlation coefficient \u03c1 between the rankings produced by each model against the human rankings [42] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-101", "text": "We compared Speech2Vec trained with skipgrams or cbow with its Word2Vec counterpart trained on the transcriptions of the LibriSpeech corpus using the fastText implementation [2] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-102", "text": "For convenience, we refer to these four models as skipgrams Speech2Vec, cbow Speech2Vec, skipgrams Word2Vec, and cbow Word2Vec, respectively." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-103", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-104", "text": "**RESULTS AND DISCUSSIONS**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-105", "text": "We trained the four models with different embedding sizes to understand how large the embedding size should be to capture sufficient semantic information about the word." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-106", "text": "The results are shown in Table 1 ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-107", "text": "We also varied the size of the corpus used for training the four models and report the results in Table 2 ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-108", "text": "1 The numbers in both tables are the average of running the experiment 10 times and the standard deviations are negligible." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-109", "text": "From Table 1 and Table 2 , we have the following observations." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-110", "text": "Embedding size impact on performance." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-111", "text": "We found that increasing the embedding size does not always result in improved performance." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-112", "text": "For cbow Speech2Vec, skipgrams Speech2Vec, and cbow Word2Vec, word embeddings of 50-dimensions are able to capture enough semantic information of the words, as the best performance (highest \u03c1) of each benchmark is mostly achieved by them." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-113", "text": "For skipgrams Word2Vec, although the best performance of 7 out of 13 benchmarks is achieved by word embeddings of 200-dims, there are 6 benchmarks whose best performance is achieved by word embeddings of other sizes." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-114", "text": "Comparing Speech2Vec to Word2Vec." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-115", "text": "From Table 1 we see that skipgrams Speech2Vec achieves the highest \u03c1 in 8 out of 13 benchmarks, outperforming cbow and skipgrams Word2Vec in combination." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-116", "text": "We believe a possible reason for such results is due to skipgrams Speech2Vec's ability to capture semantic information present in speech such as prosody that is not in text." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-117", "text": "Comparing skipgrams to cbow Speech2Vec." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-118", "text": "From Table 1 we observe that skipgrams Speech2Vec consistently outperforms cbow Speech2Vec on all benchmarks for all embedding sizes." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-119", "text": "This result aligns with the empirical fact that skipgrams Word2Vec is likely to work better than cbow Word2Vec with small training corpus size [1] ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-120", "text": "Training size 10% 40% 70% 100% 10% 40% 70% 100% 10% 40% 70% 100% 10% 40% 70% Impact of training corpus size." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-121", "text": "From Table 2 we observe that when 10% of the corpus was used for training, the resulting word embeddings perform poorly." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-122", "text": "Unsurprisingly, the performance continues to improve as training size increases." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-123", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-124", "text": "**VARIANCE STUDY**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-125", "text": "In Section 2.3 we mention that in Speech2Vec, every instance of a spoken word will produce a different embedding vector." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-126", "text": "Here we try to understand how the vectors for a given word vary." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-127", "text": "i.e., are they similar, or is there considerable variance that the averaging operation adopted in this paper smooths out?" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-128", "text": "To study this, we partitioned all words into four sub-groups based on the number of times, N , that they appeared in the corpus, ranging from 5 \u223c 99, 100 \u223c 999, 1000 \u223c 9999, and \u2265 10k." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-129", "text": "Then, for all vector representations {w 1 , w 2 , ..., w N } of a given word w that appeared N times, we computed the mean of the standard deviations of each dimensions mw =" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-130", "text": ", where d denotes the embedding size." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-131", "text": "Finally, we averaged mw for every word w that belongs to the same sub-group and reported the results in Figure 2 ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-132", "text": "From Figure 2 we observe that when N falls in 5 \u223c 99, the variances of the vectors generated by cbow Speech2Vec are smaller than those generated by skipgrams Speech2Vec." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-133", "text": "However, when N becomes bigger, variances of the vectors generated by skipgrams Speech2Vec become smaller than those generated by cbow Speech2Vec, and the gap continues to grow as N increases." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-134", "text": "We suspect the lower variation of the skipgrams model relative to the cbow model is related to the overall superior performance of the skipgrams Speech2Vec model." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-135", "text": "We are encouraged that the deviation of the skipgrams model gets smaller as N increases, as it suggests stability in the model." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-136", "text": "We visualized the word embeddings learned by skipgrams Speech2Vec with t-SNE [43] using http://www." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-137", "text": "wordvectors.org/ in Figure 3 ." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-138", "text": "We see that words with positive meanings (colored in green) are mainly located at the upper part of the figure, while words with negative meanings (colored in red) are mostly located at the bottom." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-139", "text": "Such distribution suggests that the learned word embeddings do capture notions of antonym and synonyms to some degree." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-140", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-141", "text": "**VISUALIZATIONS**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-142", "text": "----------------------------------" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-143", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-144", "text": "Speech2Vec, which integrates a RNN Encoder-Decoder framework with skipgrams or cbow for training, extends the textbased Word2Vec [1] model to learn word embeddings directly from speech." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-145", "text": "Speech2Vec has access to richer information in the speech signal that does not exist in plain text." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-146", "text": "In our experiments, the learned word embeddings outperform those produced by Word2Vec from the transcriptions." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-147", "text": "In the future, we plan to evaluate the word embeddings on speech-related extrinsic tasks such as machine listening comprehension [44, 45] and speech-based visual question answering [46] by initializing the embedding layers of the neural network models." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-148", "text": "Finally, in this work, some supervision was incorporated into the learning by using forced alignment segmentations as the basis for audio segments." }, { "sent_id": "99b26d9151c7c0a1df1df1300fc764-C001-149", "text": "It would be interesting to explore less supervised segmentations to learn word boundaries [47, 48] ." } ], "y": { "@EXT@": { "gold_contexts": [ [ "99b26d9151c7c0a1df1df1300fc764-C001-3" ], [ "99b26d9151c7c0a1df1df1300fc764-C001-144" ] ], "cite_sentences": [ "99b26d9151c7c0a1df1df1300fc764-C001-3", "99b26d9151c7c0a1df1df1300fc764-C001-144" ] }, "@BACK@": { "gold_contexts": [ [ "99b26d9151c7c0a1df1df1300fc764-C001-9" ] ], "cite_sentences": [ "99b26d9151c7c0a1df1df1300fc764-C001-9" ] }, "@MOT@": { "gold_contexts": [ [ "99b26d9151c7c0a1df1df1300fc764-C001-13" ] ], "cite_sentences": [ "99b26d9151c7c0a1df1df1300fc764-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "99b26d9151c7c0a1df1df1300fc764-C001-94" ] ], "cite_sentences": [ "99b26d9151c7c0a1df1df1300fc764-C001-94" ] }, "@SIM@": { "gold_contexts": [ [ "99b26d9151c7c0a1df1df1300fc764-C001-119" ] ], "cite_sentences": [ "99b26d9151c7c0a1df1df1300fc764-C001-119" ] } } }, "ABC_bdeffcf02a86d06f57dbfae979b098_27": { "x": [ { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-2", "text": "Many tasks in NLP and IR require efficient document similarity computations." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-3", "text": "Beyond their common application to exploratory data analysis, latent variable topic models have been used to represent text in a low-dimensional space, independent of vocabulary, where documents may be compared." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-4", "text": "This paper focuses on the task of searching a large multilingual collection for pairs of documents that are translations of each other." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-5", "text": "We present (1) efficient, online inference for representing documents in several languages in a common topic space and (2) fast approximations for finding near neighbors in the probability simplex." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-6", "text": "Empirical evaluations show that these methods are as accurate as-and significantly faster thanGibbs sampling and brute-force all-pairs search." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-7", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-9", "text": "Statistical topic models, such as latent Dirichlet allocation (LDA) , have proven to be highly effective at discovering hidden structure in document collections (Hall et al., 2008, e.g.) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-10", "text": "Often, these models facilitate exploratory data analysis, by revealing which collocations of terms are favored in different kinds of documents or which terms and topics rise and fall over time (Blei and Lafferty, 2006; Wang and McCallum, 2006) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-11", "text": "One of the greatest advantages in using topic models to analyze and process large document collections is their ability to represent documents as probability distributions over a small number of topics, thereby mapping documents into a low-dimensional latent space-the T -dimensional probability simplex, where T is the number of topics." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-12", "text": "A document, represented by some point in this simplex, is said to have a particular \"topic distribution\"." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-13", "text": "Representing documents as points in a lowdimensional shared latent space abstracts away from the specific words used in each document, thereby facilitating the analysis of relationships between documents written using different vocabularies." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-14", "text": "For instance, topic models have been used to identify scientific communities working on related problems in different disciplines, e.g., work on cancer funded by multiple Institutes within the NIH (Talley et al., 2011) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-15", "text": "While vocabulary mismatch occurs within the realm of one language, naturally this mismatch occurs across different languages." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-16", "text": "Therefore, mapping documents in different languages into a common latent topic space can be of great benefit when detecting document translation pairs (Mimno et al., 2009; Platt et al., 2010) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-17", "text": "Aside from the benefits that it offers in the task of detecting document translation pairs, topic models offer potential benefits to the task of creating translation lexica, aligning passages, etc." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-18", "text": "The process of discovering relationship between documents using topic models involves: (1) representing documents in the latent space by inferring their topic distributions and (2) comparing pairs of topic distributions to find close matches." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-19", "text": "Many widely used techniques do not scale efficiently, however, as the size of the document collection grows." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-20", "text": "Posterior inference by Gibbs sampling, for instance, may make thousands of passes through the data." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-21", "text": "For the task of comparing topic distributions, recent work has also resorted to comparing all pairs of documents (Talley et al., 2011) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-22", "text": "This paper presents efficient methods for both of these steps and performs empirical evaluations on the task of detected translated document pairs embedded in a large multilingual corpus." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-23", "text": "Unlike some more exploratory applications of topic models, translation detection is easy to evaluate." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-24", "text": "The need for bilingual training data in many language pairs and domains also makes it attractive to mitigate the quadratic runtime of brute force translation detection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-25", "text": "We begin in \u00a72 by extending the online variational Bayes approach of Hoffman et al. (2010) to polylingual topic models (Mimno et al., 2009) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-26", "text": "Then, in \u00a73, we build on prior work on efficient approximations to the nearest neighbor problem by presenting theoretical and empirical evidence for applicability to topic distributions in the probability simplex and in \u00a74, we evaluate the combination of online variational Bayes and approximate nearest neighbor methods on the translation detection task." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-27", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-28", "text": "**ONLINE VARIATIONAL BAYES FOR POLYLINGUAL TOPIC MODELS**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-29", "text": "Hierarchical generative Bayesian models, such as topic models, have proven to be very effective for modeling document collections and discovering underlying latent semantic structures." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-30", "text": "Most current topic models are based on Latent Dirichlet Allocation (LDA) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-31", "text": "In some early work on the subject, showed the usefulness of LDA on the task of automatic annotation of images." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-32", "text": "Hall et al. (2008) used LDA to analyze historical trends in the scientific literature; Wei and Croft (2006) showed improvements on an information retrieval task." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-33", "text": "More recently Eisenstein et al. (2010) modeled geographic linguistic variation using Twitter data." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-34", "text": "Aside from their widespread use on monolingual text, topic models have also been used to model multilingual data (Boyd-Graber and Blei, 2009; Platt et al., 2010; Jagarlamudi and Daum\u00e9, 2010; Fukumasu et al., 2012) , to name a few." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-35", "text": "In this paper, we focus on the Polylingual Topic Model, introduced by Mimno et al. (2009) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-36", "text": "Given a multilingual set of aligned documents, the PLTM assumes that across an aligned multilingual document tuple, there exists a single, tuple-specific, distribution across topics." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-37", "text": "In addition, PLTM assumes that for each language-topic pair, there exists a distribution over words in that language \u03b2 l ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-38", "text": "As such, PLTM assumes that the multilingual corpus is created through a generative process where first a document tuple is generated by drawing a tuple-specific distribution over topics \u03b8 1 which, as it is the case with LDA, is drawn from a Dirichlet prior \u03b8 \u223c Dir (\u03b1) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-39", "text": "For each of the languages l in the tuple and for each of the N words w l n in the document the generative process: first chooses a topic assignment z l n \u223c M ultinomial (\u03b8) which is then followed by choosing a word w l n from a multinomial distribution conditioned on the topic assignment and the language specific topics distribution over words \u03b2 l \u223cDir (\u03b7 l )." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-40", "text": "Both \u03b1 and \u03b7 1,...,L are symmetric priors, i.e. the priors are exchangeable Dirichlet distributions." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-41", "text": "Finally, each word is generated from a language-and topic-specific multinomial distribution \u03b2 l t as selected by the topic assignment variable z l n :" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-42", "text": "(1) Figure 1 shows a graphical representation of the PLTM using plate notation." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-43", "text": "In their original work Mimno et al. (2009) used the Gibbs sampling approach as a posterior inference algorithm to assign topics distributions over their test collection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-44", "text": "While more straightforward to implement, this sampling approach is inherently slow when applied to large collections which makes the original PLTM work practically infeasible to be used on real-world data sets." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-45", "text": "In general, performing posterior inference over the latent variables of a Bayesian model is usually done with two of the three approximate approaches, Gibbs sampling, variational Bayes (VB) and expectation-propagation." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-46", "text": "While Gibbs Sampling is a variation of Markov Chain Monte Carlo method (MCMC) which generates a sample from the true posterior after converging to a stationary distribution; in VB, a set of free variational parameters characterizes a simpler family of probability distributions." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-47", "text": "These variational parameters are then optimized by finding the minimum KullbackLeibler (KL) divergence between the variational distribution q (\u03b8, z, \u03b2|\u03b3, \u03c6, \u03bb) and the true posterior P (\u03b8, z, \u03b2|w, \u03b1, \u03b7)." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-48", "text": "From an algorithmic perspective, the variational Bayes approach follows the Expectation-Maximization (EM) procedure where for a given document, the E-step updates the per document variational parameters \u03b3 d and \u03c6 d while holding the per words-topic distribution parameter \u03bb fixed." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-49", "text": "It then updates the variational parameter \u03bb using the sufficient statistics computed in the E step." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-50", "text": "In order to converge to a stationary point, both approaches require going over the whole collection multiple times which makes their time complexity to grown linearly with the size of the data collection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-51", "text": "The mere fact that they require continuous access to the whole collection makes both inference approaches impracticable to use on very large or streaming collections." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-52", "text": "To alleviate this problem, several algorithms have been proposed that draws from belief propagation (Zeng et al., 2012) , the Gibbs sampling approach such as (Canini et al., 2009 ), variational Bayes (Hoffman et al., 2010) as well as a combination of the latter two (Hoffman et al., 2012) to name a few." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-53", "text": "In this paper we use Hoffman et al. (2010) approach." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-54", "text": "Hoffman et al. (2010) proposed a new inference approach called Online LDA which relies on the stochastic gradient descent to optimize the variational parameters." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-55", "text": "This approach can produce good estimates of LDA posteriors in a single pass over the whole collection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-56", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-57", "text": "**ALGORITHMIC IMPLEMENTATION**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-58", "text": "We now derive an online variational Bayes algorithm for PLTM to infer topic distributions over multilingual collections." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-59", "text": "Figure 2 shows the variational model and free parameters used in our approach." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-60", "text": "As in the case of Hoffman et al. (2010) , our algorithm updates the variational parameters \u03b3 l d and \u03c6 l d on each batch of documents while the variational parameter \u03bb is computed as a weighted average of the value on the previous batch and its approximate version\u03bb." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-61", "text": "Averaging is performed using a decay function whose parameters control the rate at which old values of \u03bb l are forgotten." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-62", "text": "Within the E step of the VB approach, we compute the updates over the variational parameter \u03c6 l T . . ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-63", "text": "Figure 2: Graphical model representation of the free variational parameters for the online variational Bayes approximation of the PLTM posterior for each language L present in our document tuple while the update on the \u03b3 parameter accumulates the language specific sufficient statistics:" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-64", "text": "We detail these steps in Algorithm 1." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-65", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-66", "text": "**PERFORMANCE ANALYSIS**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-67", "text": "To demonstrate the efficacy of online PLTM, we ran topic inference on a subset of the EnglishSpanish Europarl collection consisting of \u223c64k parallel speeches and compared the accuracy results vs. the training and inference speed against the original PLTM model using topic sets of T=50,100, 200 and 500." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-68", "text": "We explain in details the evaluation task and the performance metric used in \u00a74." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-69", "text": "Shown in Figure 3 are the results of these comparisons." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-70", "text": "Our speed measurements were performed on Xeon quad processors with a clock speed of 2.66GHz and a total of 16GB of memory." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-71", "text": "As we increase the number of topics we gain in accuracy over the evaluation task across both inference approaches." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-72", "text": "When we increase the number of topics from 50 to 500 the speed improvement obtained by Online VB PLTM drops by a factor of 2.9 within the training step and by a factor of 4.45 in the test step." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-73", "text": "Our total running time for the Online VB PLTM with T=500 approaches the running time of the Gibbs sampling approach with T=50." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-74", "text": "The gradual drop in speed improvement with the increase of the number topics is mostly attributed to the commutation of the Algorithm 1 Online variational Bayes for PLTM initialize \u03bb l randomly obtain the tth mini-batch of tuples M t for t = 1 to \u221e do Online VB PLTM and Gibbs Sampling PLTM at T=50,100, 200 and 500." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-75", "text": "We used a Python implementation of Online VB and Mallet's Java implementation of PLTM with in-memory Gibbs Sampling using 1000 iterations." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-76", "text": "While a multilingual collection of \u223c64k document pairs is considered relatively big, our goal of deriving the Online VB PLTM approach was to be able to utilize PLTM on very large multilingual collections." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-77", "text": "To analyze the potential of using Online VB PLTM on such collections we ran speed comparisons within the training step by creating multilingual collections of different lengths multiplying the original English-Spanish Europarl collection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-78", "text": "Speed comparisons using collections of length 50K, 100K, 250K, 500K, 750K and 1M are shown in Figure 4 ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-79", "text": "Training was performed with the number of topics T set to T=50 and T=500." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-80", "text": "As we increase the collection size we observe the real benefit of using Online VB compared to Gibbs sampling." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-81", "text": "This is mostly attributed to the fact that the Gibbs sampling approach requires multiple iterations over the whole collection in order to achieve a convergence point." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-82", "text": "For collection sizes of 50k and 100k the training time for the Online VB PLTM with T=500 approaches the training time of Gibbs sampling with T=50 and as we increase the collection size this proximity dissipates." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-83", "text": "In Figure 5 we show a sample set of the aligned topics extracted using Online VB PLTM with T=400 on the English-Spanish Europarl collection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-84", "text": "For a given topic tuple words are ordered based on probability of occurrence within the given topic." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-85", "text": "3 Approximate NN Search in the Probability Simplex" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-86", "text": "One of the most attractive applications for topic models has involved using the latent variables as a low-dimensional representation for document similarity computations (Hall et al., 2008; BoydGraber and Resnik, 2010; Talley et al., 2011) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-87", "text": "After computing topic distributions for documents, however, researchers in this line of work have almost always resorted to brute-force all-pairs similarity comparisons between topic distributions." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-88", "text": "In this section, we present efficient methods for approximate near neighbor search in the probability simplex in which topic distributions live." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-89", "text": "Measurements for similarity between two probability distributions are information-theoretic, and distance metrics, typical for the metric space, are not appropriate (measurements such as Euclidean, cosine, Jaccard, etc.)." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-90", "text": "Divergence metrics, such as Kullback-Leibler (KL), Jensen-Shannon (JS), and Hellinger distance are used instead." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-91", "text": "Shown in Figure 6 are the formulas of the divergence metrics along with the Euclidean distance." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-92", "text": "When dealing with a large data set of N documents, the O(N 2 ) time complexity of all-pairs comparison makes the task practically infeasible." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-93", "text": "With some distance measures, however, the time complexity on near neighbor tasks has been alleviated using approximate methods that reduce the time complexity of each query to a sub-linear number of comparisons." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-94", "text": "For example, Euclidean distance (3) has been efficiently used on all-pairs comparison tasks in large data sets thanks to its approximate based versions developed using locality sensitive hashing (LSH) (Andoni et al., 2005) and k-d search trees (Friedman et al., 1977) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-95", "text": "In order to alleviate the all-pairs computational complexity in the probability simplex, we will use a reduction of the Hellinger divergence measure (4) to Euclidean distance and therefore utilize preexisting approximation techniques for the Euclidean distance in the probability simplex." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-96", "text": "This reduction comes from the fact that both measurements have similar algebraic expressions." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-97", "text": "If we discard the square root used in the Euclidean distance, Hellinger distance (4) becomes equivalent to the Euclidean distance metric (3) between \u221a p i and \u221a q i ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-98", "text": "The task of finding nearest neighbors for a given point (whether in the metric space or the probability simplex) involves ranking all nearest points discovered and as such not computing the square root function does not affect the overall ranking and the nearest neighbor discovery." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-99", "text": "Moreover, depending on its functional form, the Hellinger distance is often defined as square root over the whole summation." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-100", "text": "Aside from the Hellinger distance, we also approximate JensenShannon divergence which is a symmetric version of the Kullback-Liebler divergence." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-101", "text": "For the JS approximation, we will use a constant factor relationship between the Jensen-Shannon divergence an Hellinger distance previously explored by (Tops\u00f8e, 2000) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-102", "text": "More specifically, we will be using its more concise form (7) also presented by" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-103", "text": "Figure 6: Distance measures and bounds (Guha et al., 2006) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-104", "text": "The constant factor relationship provides us with the theoretical guarantees necessary for this approximation." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-105", "text": "In practice, we can often do much better than this theoretical bound." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-106", "text": "Figure 7 shows the empirical relation of JS and Hellinger on a translationdetection task." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-107", "text": "As will be described in \u00a74, we computed the JS and Hellinger divergences between topic distributions of English and Spanish Europarl speeches for a total of 1 million document pairs." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-108", "text": "Each point in the figure represents one Spanish-English document pair that might or might not be translations of each other." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-109", "text": "In this figure we emphasize the lower left section of the plot where the nearest neighbors (i.e., likely translations) reside, and the relationship between JS and Hellinger is much tighter than the theoretical bounds and from pratical perspective as we will show in the next section." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-110", "text": "As a summary for the reader, using the above approaches, we will approximate JS divergence by using the Euclidean based representation of the Hellinger distance." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-111", "text": "As stated earlier, the Euclidean based representation is computed using well established approximation approaches and in our case we will use two such approaches: the Exact Euclidean LSH (E2LSH) (Andoni et al., 2005)" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-112", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-113", "text": "**EFFICIENT APPROXIMATE TRANSLATION DETECTION**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-114", "text": "Mapping multilingual documents into a common, language-independent vector space for the purpose of improving machine translation (MT) and performing cross-language information retrieval (CLIR) tasks has been explored through various techniques." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-115", "text": "Mimno et al. (2009) introduced polylingual topic models (PLTM), an extension of latent Dirichlet allocation (LDA), and, more recently, Platt et al. (2010) proposed extensions of principal component analysis (PCA) and probabilistic latent semantic indexing (PLSI)." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-116", "text": "Both the PLTM and PLSI represent bilingual documents in the probability simplex, and thus the task of finding document translation pairs is formulated as finding similar probability distributions." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-117", "text": "While the nature of both works was exploratory, results shown on fairly large collections of bilingual documents (less than 20k documents) offer convincing argument of their potential." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-118", "text": "Expanding these approaches to much large collections of multilingual documents would require utilizing fast NN search for computing similarity in the probability simplex." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-119", "text": "While there are many other proposed approaches to the task of finding document translation pairs that represent documents in metric space, such as Krstovski and Smith (2011) which utilizes LSH for cosine distance, there is no evidence that they yield good results on documents of small lengths such as paragraphs and even sen-tences." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-120", "text": "In this section, we empirically show how to utilize approaches that deal with representing documents in the probability simplex without a significant loss in accuracy while significantly improving the processing time." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-121", "text": "We use PLTM representations of bilingual documents." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-122", "text": "In addition, we show how the results as reported by Platt et al. (2010) can be obtained using the PLTM representation with a significant speed improvement." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-123", "text": "As in (Platt et al., 2010) and (Mimno et al., 2009 ) the task is to find document translation pairs in a multilingual collection of documents by representing documents in the probability simplex and computing similarity between their probability distribution representation across all document pairs." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-124", "text": "For this experimental setup, accuracy is defined as the number of times (in percentage) that the target language document was discovered at rank 1 (i.e. % @Rank 1.) across the whole test collection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-125", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-126", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-127", "text": "We use Mallet's (McCallum, 2002) implementation of the PLTM to train and infer topics on the same data set used in Platt et al. (2010) ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-128", "text": "That paper used the Europarl (Koehn, 2005) (Mimno et al., 2009) , these performance comparisons are not done on the same training and test sets-a gap that we fill below." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-129", "text": "We train PLTM models with number of topics T set to 50, 100, 200, and 500." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-130", "text": "In order to compare exactly the same topic distributions when computing speed vs. accuracy of various approximate and exhaustive all-pairs comparisons we focus only on one inference approach -the Gibbs sampling and ignore the online VB approach as it yields similar performance." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-131", "text": "For all four topic models, we use the same settings for PLTM (hyperparameter values and number of Gibbs sampling iterations) as in (Mimno et al., 2009) 2 ." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-132", "text": "Topic distributions were then inferred on the test collection using the trained topics." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-133", "text": "We then performed all-pairs comparison using JS divergence, Hellinger distance, and approximate, LSH and kd-trees based, Hellinger distance." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-134", "text": "We measured the total time that it takes to perform exhaustive all-pairs comparison using JS divergence, the LSH and kdtrees version on a single machine consisting of a core 2 duo quad processors with a clock speed of 2.66GHz on each core and a total of 8GB of memory." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-135", "text": "Since the time performance of the E2LSH depends on the radius R of data set points considered for each query point (Indyk and Motwani, 1998) , we performed measurements with different values of R. For this task, the all-pairs JS code implementation first reads both source and target sets of documents and stores them in hash tables." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-136", "text": "We then go over each entry in the source table and compute divergence against all target table entries." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-137", "text": "We refer to this code implementation as hash map implementation." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-138", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-139", "text": "**EVALUATION TASK AND RESULTS**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-140", "text": "Performance of the four PLTM models and the performance across the four different similarity measurements was evaluated based on the percentage of document translation pairs (out of the whole test set) that were discovered at rank one." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-141", "text": "This same approach was used by (Platt et al., 2010) to show the absolute performance comparison." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-142", "text": "As in the case of the previous two tasks, in order to evaluate the approximate, LSH based, Hellinger distance we used values of R=0.4, R=0.6 and R=0.8." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-143", "text": "Since in (Platt et al., 2010) numbers were reported on the test speeches whose word length is greater or equal to 100, we used the same subset (total of 14150 speeches) of the original test collection." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-144", "text": "Shown in Table 1 are results across the four different measurements for all four PLTM models." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-145", "text": "When using regular JS divergence, our PLTM model with 200 topics performs the best with 99.42% of the top one ranked candidate translation documents being true translations." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-146", "text": "When using approximate, kd-trees based, Hellinger distance, we outperform regular JS and Hellinger divergence across all topics and for T=500 we achieve the best overall accuracy of 99.61%." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-147", "text": "We believe that this is due to the small amount of error in the search introduced by ANN, due to its approximate nature, which for this task yields positive results." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-148", "text": "On the same data set, (Platt et al., 2010) report accuracy of 98.9% using 50 topics, a slightly different prior distribution, and MAP instead of posterior inference." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-149", "text": "Shown in Table 2 are the relative differences in time between all pairs JS divergence, approximate kd-trees and LSH based Hellinger distance with different value of R. Rather than showing absolute speed numbers, which are often influenced by the processor configuration and available memory, we show relative speed improvements where we take the slowest running configuration as a referent value." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-150", "text": "In our case we assign the referent speed value of 1 to the configuration with T=500 and allpairs JS computation." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-151", "text": "Results shown are based on comparing running time of E2LSH and ANN against the all-pairs similarity comparison implementation that uses hash tables to store all documents in the bilingual collection which is significantly faster than the other code implementation." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-152", "text": "For the approximate, LSH based, Hellinger distance with T=100 we obtain a speed improvement of 24.2 times compared to regular all-pairs JS divergence while maintaining the same performance compared to Hellinger distance metric and insignificant loss over all-pairs JS divergence." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-153", "text": "From Table 2 it is evident that as we increase the radius R we reduce the relative speed of performance since the range of points that LSH considers for a given query point increases." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-154", "text": "Also, as the number of topics increases, the speed benefit is reduced for both the LSH and k-d tree techniques." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-155", "text": "----------------------------------" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-156", "text": "**CONCLUSION**" }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-157", "text": "Hierarchical Bayesian models, such as Polylingual Topic Models, have been shown to offer great potential in analyzing multilingual collections, extracting aligned topics and finding document translation pairs when trained on sufficiently large aligned collections." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-158", "text": "Online stochastic optimization inference allows us to generate good parameter estimates." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-159", "text": "By combining these two approaches we are able to infer topic distributions across documents in large multilingual document collections in an efficient manner." }, { "sent_id": "bdeffcf02a86d06f57dbfae979b098-C001-160", "text": "Utilizing approximate NN search techniques in the probability simplex, we showed that fast document translation detection could be achieved with insignificant loss in accuracy." } ], "y": { "@USE@": { "gold_contexts": [ [ "bdeffcf02a86d06f57dbfae979b098-C001-25" ], [ "bdeffcf02a86d06f57dbfae979b098-C001-35" ], [ "bdeffcf02a86d06f57dbfae979b098-C001-115", "bdeffcf02a86d06f57dbfae979b098-C001-121" ], [ "bdeffcf02a86d06f57dbfae979b098-C001-127", "bdeffcf02a86d06f57dbfae979b098-C001-128" ], [ "bdeffcf02a86d06f57dbfae979b098-C001-131" ] ], "cite_sentences": [ "bdeffcf02a86d06f57dbfae979b098-C001-25", "bdeffcf02a86d06f57dbfae979b098-C001-35", "bdeffcf02a86d06f57dbfae979b098-C001-115", "bdeffcf02a86d06f57dbfae979b098-C001-128", "bdeffcf02a86d06f57dbfae979b098-C001-131" ] }, "@BACK@": { "gold_contexts": [ [ "bdeffcf02a86d06f57dbfae979b098-C001-43" ], [ "bdeffcf02a86d06f57dbfae979b098-C001-115" ], [ "bdeffcf02a86d06f57dbfae979b098-C001-128" ] ], "cite_sentences": [ "bdeffcf02a86d06f57dbfae979b098-C001-43", "bdeffcf02a86d06f57dbfae979b098-C001-115", "bdeffcf02a86d06f57dbfae979b098-C001-128" ] }, "@DIF@": { "gold_contexts": [ [ "bdeffcf02a86d06f57dbfae979b098-C001-43", "bdeffcf02a86d06f57dbfae979b098-C001-44", "bdeffcf02a86d06f57dbfae979b098-C001-53" ] ], "cite_sentences": [ "bdeffcf02a86d06f57dbfae979b098-C001-43" ] } } }, "ABC_5c5abc2773143af41d49087e17310e_27": { "x": [ { "sent_id": "5c5abc2773143af41d49087e17310e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-2", "text": "Unsupervised approaches to multi-document summarization consist of two steps: finding a content model of the documents to be summarized, and then generating a summary that best represents the most salient information of the documents." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-3", "text": "In this paper, we present a sentence selection objective for extractive summarization in which sentences are penalized for containing content that is specific to the documents they were extracted from." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-4", "text": "We modify an existing system, HIER-SUM (Haghighi & Vanderwende, 2009) , to use our objective, which significantly outperforms the original HIERSUM in pairwise user evaluation." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-5", "text": "Additionally, our ROUGE scores advance the current state-of-the-art for both supervised and unsupervised systems with statistical significance." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-6", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-8", "text": "Multi-document summarization is the task of generating a single summary from a set of documents that are related to a single topic." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-9", "text": "Summaries should contain information that is relevant to the main ideas of the entire document set, and should not contain information that is too specific to any one document." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-10", "text": "For example, a summary of multiple news articles about the Star Wars movies could contain the words \"Lucas \"and \"Jedi\", but should not contain the name of a fan who was interviewed in one article." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-11", "text": "Most approaches to this problem generate summaries extractively, selecting whole or partial sentences from the original text, then attempting to piece them together in a coherent manner." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-12", "text": "Extracted text is selected based on its relevance to the main ideas of the document set." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-13", "text": "Summaries can be evaluated manually, or with automatic metrics such as ROUGE (Lin, 2004) ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-14", "text": "The use of structured probabilistic topic models has made it possible to represent document set content with increasing complexity (Daum\u00e9 & Marcu, 2006; Tang et al., 2009; Celikyilmaz & HakkaniTur, 2010) ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-15", "text": "Haghighi and Vanderwende (2009) demonstrated that these models can improve the quality of generic multi-document summaries over simpler surface models." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-16", "text": "Their most complex hierarchial model improves summary content by teasing out the words that are not general enough to represent the document set as a whole." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-17", "text": "Once those words are no longer included in the content word distribution, they are implicitly less likely to appear in the extracted summary as well. But this objective does not sufficiently keep document-specific content from appearing in multi-document summaries." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-18", "text": "In this paper, we present a selection objective that explicitly excludes document-specific content." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-19", "text": "We re-implement the HIERSUM system from Haghighi and Vanderwende (2009) , and show that using our objective dramatically improves the content of extracted summaries." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-20", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-21", "text": "**MODELING CONTENT**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-22", "text": "The easiest way to model document content is to find a probability distribution of all unigrams that appear in the original documents." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-23", "text": "The highest frequency words (after removing stop words) have a high likelihood of appearing in human-authored summaries (Nenkova & Vanderwende, 2005) ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-24", "text": "However, the raw (Haghighi & Vanderwende, 2009)." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-25", "text": "unigram distribution may contain words that appear frequently in one document, but do not reflect the content of the document set as a whole." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-26", "text": "Probabilistic topic models provide a more principled approach to finding a distribution of content words." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-27", "text": "This idea was first presented by Daum\u00e9 and Marcu (2006) for their BAYESUM system for query-focused summarization, and later adapted for non-query summarization in the TOPICSUM system by Haghighi and Vanderwende (2009) ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-28", "text": "1 In these systems, each word from the original documents is drawn from one of three vocabulary distributions." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-29", "text": "The first, \u03c6 b , is the background distribution of general English words." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-30", "text": "The second, \u03c6 d , contains vocabulary that is specific to that one document. And the third, \u03c6 c , is the distribution of content words for that document set, and contains relevant words that should appear in the generated summary." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-31", "text": "HIERSUM (Haghighi & Vanderwende, 2009 ) adds more structure to TOPICSUM by further splitting the content distribution into multiple sub-topics." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-32", "text": "The content words in each sentence can be generated by either the general content topic or the content sub-topic for that sentence, and the words from the general content distribution are considered when building the summary." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-33", "text": "1 The original BAYESUM can also be used without a query, in which case, BAYESUM and TOPICSUM are the exact same model." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-34", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-35", "text": "**KL SELECTION**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-36", "text": "The KL-divergence between two unigram word distributions P and Q is given by KL(P ||Q) = w P (w) log P (w) Q(w) ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-37", "text": "This quantity is used for summary sentence selection in several systems including Lerman and McDonald (2009) and Haghighi and Vanderwende (2009) , and was used as a feature in the discrimitive sentence ranking of Daum\u00e9 and Marcu (2006) ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-38", "text": "TOPICSUM and HIERSUM use the following KL objective, which finds S * , the summary that minimizes the KL-divergence between the estimated content distribution \u03c6 c and the summary word distribution P S :" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-39", "text": "A greedy approximation is used to find S * ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-40", "text": "Starting with an empty summary, sentences are greedily added to the summary one at a time until the summary has reached the maximum word limit, L. The values of P S are smoothed uniformly in order to ensure finite values of KL(\u03c6 c ||P S )." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-41", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-42", "text": "**WHY DOCUMENT-SPECIFIC WORDS ARE A PROBLEM**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-43", "text": "The KL selection objective effectively ensures the presence of highly weighted content words in the generated summary. But it is asymmetric in that it allows a high proportion of words in the summary to be words that appear infrequently, or not at all, in the content word distribution." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-44", "text": "This asymmetry is the reason why the KL selection metric does not sufficiently keep document-specific words out of the generated summary." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-45", "text": "Consider what happens when a document-specific word is included in summary S. Assume that the word w i does not appear (has zero probability) in the content word distribution \u03c6 c , but does appear in the document-specific distribution \u03c6 d for document d. Then w i appearing in S has very little impact on KL(\u03c6 c ||P S ) = j \u03c6 c (w j ) log \u03c6c(w j ) P S (w j ) because \u03c6 c (w i ) = 0." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-46", "text": "There will be a slight impact because the presence of the word w i in S will cause the probability of other words in the summary to be sligntly smaller." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-47", "text": "But in a summary of length 250 words (the length used for the DUC summarization task) the difference is negligible." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-48", "text": "The reason why we do not simply substitute a symmetrical metric for comparing distributions (e.g., Information Radius) is because we want the selection objective to disprefer only document-specific words." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-49", "text": "Specifically, the selection objective should not disprefer background English vocabulary." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-50", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-51", "text": "**KL(C)-KL(D) SELECTION**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-52", "text": "In contrast to the KL selection objective, our objective measures the similarity of both content and document-specific word distributions to the extracted summary sentences." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-53", "text": "We combine these measures linearly:" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-54", "text": "Our objective can be understood in comparison to the MMR criterion by (Carbonell & Goldstein, 1998) , which also utilizes a linear metric in order to maximize informativeness of summaries while minimizing some unwanted quality of the extracted sentences (in their case, redundancy)." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-55", "text": "In contrast, our criterion utilizes information about what kind of information should not be included in the summary, which to our knowledge has not been done in previous summarization systems." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-56", "text": "2 For comparison to the previous KL objective, we also use a greedy approximation for S * ." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-57", "text": "However, because we are extracting sentences from many documents, the distribution \u03c6 d is actually several distributions, a separate distribution for each document in the document set." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-58", "text": "The implementation we used in our experiments is that, as we consider a sentence s to be added to the previously selected sentences S, we set \u03c6 d to be the document-specific distribution of the document that s has been extracted from." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-59", "text": "So each time we add a sentence to the summary, we find the sentence that minimizes KL(\u03c6 c ||P S\u222as ) \u2212 KL(\u03c6 d(s) ||P S\u222as )." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-60", "text": "Another implementation we tried was combining all of the \u03c6 d distributions into one distribution, but we did not notice any difference in the extracted summaries." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-61", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-62", "text": "**EVALUATION**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-63", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-64", "text": "**DATA**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-65", "text": "We developed our sentence selection objective using data from the Document Understanding Conference 3 (DUC) 2006 summarization task, and used data from DUC 2007 task for evaluations." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-66", "text": "In these tasks, the system is given a set of 25 news articles related to an event or topic, and needs to generate a summary of under 250 words from those documents." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-67", "text": "4 For each document set, four humanauthored summaries are provided for use with evaluations." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-68", "text": "The DUC 2006 data has 50 document sets, and the DUC 2007 data has 45 document sets." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-69", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-70", "text": "**AUTOMATIC EVALUATION**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-71", "text": "Systems are automatically evalatued using ROUGE (Lin, 2004) , which has good correlation with human judgments of summary content." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-72", "text": "ROUGE compares n-gram recall between system-generated summaries, and human-authored reference summaries." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-73", "text": "The first two metrics we compare are unigram and bigram recall, R-1 and R-2, respectively." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-74", "text": "The last metric, R-SU4, measures recall of skip-4 bigrams, which may skip one or two words in between the two words to be measured." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-75", "text": "We set ROUGE to stem both the system and reference summaries, scale our results by 10 2 and present scores with and without stopwords removed." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-76", "text": "The ROUGE scores of the original HIERSUM system are given in the first row of table 1, followed by the scores of HIERSUM using our KL(c-d) selection." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-77", "text": "The KL(c-d) selection outperforms the KL selection in each of the ROUGE metrics shown." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-78", "text": "In fact, these results are statistically significant over the baseline KL selection for all but the unigram metrics (R-1 with and without stopwords)." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-79", "text": "These results show that our KL(c-d) selection yields significant improvements in terms of ROUGE performance, since having fewer irrelevant words in the summaries leaves room for words that are more relevant to the content topic, and therefore more likely to appear in the reference summaries." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-80", "text": "The last two rows of Table 1 : ROUGE scores on the DUC 2007 document sets." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-81", "text": "The first two rows compare the results of the unigram HIERSUM system with its original and our improved selection metrics." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-82", "text": "Bolded scores represent where our system has a significant improvement over the orignal HIERSUM." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-83", "text": "For further comparison, the last two rows show the ROUGE scores of two other state-of-the-art multi-document summarization systems (Toutanova et al., 2007; Celikyilmaz & Hakkani-Tur, 2010 )." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-84", "text": "See section 6.2 for more details." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-85", "text": "marization systems." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-86", "text": "Both of these systems select sentences discriminatively on many features in order to maximize ROUGE scores." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-87", "text": "The first, PYTHY (Toutanova et al., 2007) , trains on dozens of sentence-level features, such as n-gram and skipgram frequency, named entities, sentence length and position, and also utilizes sentence compression." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-88", "text": "The second, HYBHSUM (Celikyilmaz & HakkaniTur, 2010) , uses a nested Chinese restaurant process (Blei et al., 2004) to model a hierarchical content distribution with more complexity than HIERSUM, and uses a regression model to predict scores for new sentences." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-89", "text": "For both of these systems, our summaries are significantly better for R-2 and R-SU4 without stopwords, and comparable in all other metrics." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-90", "text": "5 These results show that our selection objective can make a simple unsupervised model competitive with more complicated supervised models." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-91", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-92", "text": "**MANUAL EVALUATION**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-93", "text": "For manual evaluation, we performed a pairwise comparison of summaries generated by HIERSUM with both the original and our modified sentence selection objective." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-94", "text": "Users were given the two summaries to compare, plus a human-generated reference summary." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-95", "text": "The order that the summaries appeared in was random." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-96", "text": "We asked users to select which summary was better for the following ques-5 Haghighi and Vanderwende (2009) Q1 Which was better in terms of overall content?" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-97", "text": "Q2 Which summary had less repetition?" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-98", "text": "Q3 Which summary was more coherent?" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-99", "text": "Q4 Which summary had better focus?" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-100", "text": "We took 87 pairwise preferences from participants over Mechanical Turk." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-101", "text": "7 The results of our evaluation are shown in table 2." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-102", "text": "For all attributes, our criterion performs better than the original HIERSUM selection criterion, and our results for Q1 and Q3 are significantly better as determined by Fisher sign test (two-tailed P value < 0.01)." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-103", "text": "These results confirm that our objective noticably improves the content of extractive summaries by selecting sentences that contain less document-specific information." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-104", "text": "This leaves more room in the summary for content that is relevant to the main idea of the document set (Q1) and keeps out content that is not relevant (Q4)." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-105", "text": "Additionally, although neither criterion explicitly addresses coherence, we found that a significant proportion of users found our summaries to be more coherent (Q3)." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-106", "text": "We believe this may be the case because the presence of document-specific information can distract from the main ideas of the summary, and make it less likely that the extracted sentences will flow together." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-107", "text": "There is no immediate explanation for why users found our our summaries less repetitive (Q2), since if anything the narrowing of topics due to the negative KL(\u03c6 d ||P S ) term should make for more repetition." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-108", "text": "We currently hypothesize that the improved score is simply a spillover from the general improvement in document quality." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-109", "text": "----------------------------------" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-111", "text": "We have described a new objective for sentence selection in extractive multi-document summarization, which is different in that it explicitly gives negative weight to sentences that contain document-specific words." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-112", "text": "Our objective significantly improves the performance of an existing summarization system, and improves on current best ROUGE scores with significance." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-113", "text": "We have observed that while the content in our extracted summaries is often comparable to the content in human-written summaries, the extracted summaries are still far weaker in terms of coherence and repetition." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-114", "text": "Even though our objective significantly improves coherence, more sophisticated methods of decoding are still needed to produce readable summaries." }, { "sent_id": "5c5abc2773143af41d49087e17310e-C001-115", "text": "These problems could be addressed through further refinement of the selection objective, through simplification or compression of selected sentences, and through improving the coherence of generated summaries." } ], "y": { "@EXT@": { "gold_contexts": [ [ "5c5abc2773143af41d49087e17310e-C001-4" ], [ "5c5abc2773143af41d49087e17310e-C001-19" ] ], "cite_sentences": [ "5c5abc2773143af41d49087e17310e-C001-4", "5c5abc2773143af41d49087e17310e-C001-19" ] }, "@BACK@": { "gold_contexts": [ [ "5c5abc2773143af41d49087e17310e-C001-15" ], [ "5c5abc2773143af41d49087e17310e-C001-24", "5c5abc2773143af41d49087e17310e-C001-25" ], [ "5c5abc2773143af41d49087e17310e-C001-27" ], [ "5c5abc2773143af41d49087e17310e-C001-31" ], [ "5c5abc2773143af41d49087e17310e-C001-37" ] ], "cite_sentences": [ "5c5abc2773143af41d49087e17310e-C001-15", "5c5abc2773143af41d49087e17310e-C001-24", "5c5abc2773143af41d49087e17310e-C001-27", "5c5abc2773143af41d49087e17310e-C001-31", "5c5abc2773143af41d49087e17310e-C001-37" ] }, "@USE@": { "gold_contexts": [ [ "5c5abc2773143af41d49087e17310e-C001-96" ] ], "cite_sentences": [ "5c5abc2773143af41d49087e17310e-C001-96" ] } } }, "ABC_fe443d5e13b525cbdfa58dafb83162_28": { "x": [ { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-2", "text": "We present a simple, language-independent method for integrating recovery of empty elements into syntactic parsing." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-3", "text": "This method outperforms the best published method we are aware of on English and a recently published method on Chinese." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-4", "text": "----------------------------------" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-6", "text": "Empty elements in the syntactic analysis of a sentence are markers that show where a word or phrase might otherwise be expected to appear, but does not." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-7", "text": "They play an important role in understanding the grammatical relations in the sentence." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-8", "text": "For example, in the tree of Figure 2a , the first empty element (*) marks where John would be if believed were in the active voice (someone believed. . .), and the second empty element (*T*) marks where the man would be if who were not fronted (John was believed to admire who?)." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-9", "text": "Empty elements exist in many languages and serve different purposes." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-10", "text": "In languages such as Chinese and Korean, where subjects and objects can be dropped to avoid duplication, empty elements are particularly important, as they indicate the position of dropped arguments." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-11", "text": "Figure 1 gives an example of a Chinese parse tree with empty elements." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-12", "text": "The first empty element (*pro*) marks the subject of the whole sentence, a pronoun inferable from context." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-13", "text": "The second empty element (*PRO*) marks the subject of the dependent VP (sh\u00edsh\u012b f\u01cel\u01dc ti\u00e1ow\u00e9n)." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-14", "text": "The Penn Treebanks (Marcus et al., 1993; Xue et al., 2005) Figure 1 : Chinese parse tree with empty elements marked." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-15", "text": "The meaning of the sentence is, \"Implementation of the law is temporarily suspended.\" notable exceptions." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-38", "text": "processing community (Hall, 2005; Chappelier et al., 1999) , and was recently applied to the task of joint clitic-segmentation and syntactic-parsing in Hebrew (Goldberg and Tsarfaty, 2008; Goldberg and Elhadad, 2011) and Arabic (Green and Manning, 2010) ." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-39", "text": "Here, we use lattice parsing for emptyelement recovery." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-40", "text": "We use a modified version of the Berkeley parser which allows handling lattices as input." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-41", "text": "2 The modification is fairly straightforward: Each lattice arc correspond to a lexical item." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-42", "text": "Lexical items are now indexed by their start and end states rather than by their sentence position, and the initialization procedure of the CKY chart is changed to allow lexical items of spans greater than 1." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-43", "text": "We then make the necessary adjustments to the parsing algorithm to support this change: trying rules involving preterminals even when the span is greater than 1, and not relying on span size for identifying lexical items." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-44", "text": "At test time, we first construct a lattice for each test sentence that allows 0, 1, or 2 empty symbols (\u03f5) between each pair of words or at the start/end of the sentence." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-45", "text": "Then we feed these lattices through our lattice parser to produce trees with empty elements." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-46", "text": "Finally, we reverse the transformations that had been applied to the training data." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-47", "text": "----------------------------------" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-48", "text": "**EVALUATION MEASURES**" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-49", "text": "Evaluation metrics for empty-element recovery are not well established, and previous studies use a variety of metrics." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-50", "text": "We review several of these here and additionally propose a unified evaluation of parsing and empty-element recovery." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-51", "text": "3 If A and B are multisets, let A(x) be the number of occurrences of x in A, let |A| = \u2211 x A(x), and let A \u2229 B be the multiset such that (A \u2229 B)(x) = min(A(x), B(x))." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-52", "text": "If T is the multiset of \"items\" in the trees being tested and G is the multiset of \"items\" in the gold-standard trees, then" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-53", "text": "The modified parser is available at http://www.cs.bgu." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-54", "text": "where \"items\" are defined differently for each metric, as follows." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-16", "text": "Johnson (2002) studied emptyelement recovery in English, followed by several others (Dienes and Dubey, 2003; Campbell, 2004; Gabbard et al., 2006) ; the best results we are aware of are due to Schmid (2006) ." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-17", "text": "Recently, empty-element recovery for Chinese has begun to receive attention: Yang and Xue (2010) treat it as classification problem, while Chung and Gildea (2010) pursue several approaches for both Korean and Chinese, and explore applications to machine translation." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-18", "text": "Our intuition motivating this work is that empty elements are an integral part of syntactic structure, and should be constructed jointly with it, not added in afterwards." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-19", "text": "Moreover, we expect empty-element recovery to improve as the parsing quality improves." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-20", "text": "Our method makes use of a strong syntactic model, the PCFGs with latent annotation of Petrov et al. (2006) , which we extend to predict empty cate-gories by the use of lattice parsing." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-21", "text": "The method is language-independent and performs very well on both languages we tested it on: for English, it outperforms the best published method we are aware of (Schmid, 2006) , and for Chinese, it outperforms the method of Yang and Xue (2010) ." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-22", "text": "1" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-23", "text": "----------------------------------" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-24", "text": "**METHOD**" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-25", "text": "Our method is fairly simple." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-26", "text": "We take a state-of-theart parsing model, the Berkeley parser (Petrov et al., 2006) , train it on data with explicit empty elements, and test it on word lattices that can nondeterministically insert empty elements anywhere." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-27", "text": "The idea is that the state-splitting of the parsing model will enable it to learn where to expect empty elements to be inserted into the test sentences." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-28", "text": "Tree transformations Prior to training, we alter the annotation of empty elements so that the terminal label is a consistent symbol (\u03f5), the preterminal label is the type of the empty element, and -NONEis deleted (see Figure 2b )." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-29", "text": "This simplifies the lattices because there is only one empty symbol, and helps the parsing model to learn dependencies between nonterminal labels and empty-category types because there is no intervening -NONE-." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-30", "text": "Then, following Schmid (2006) , if a constituent contains an empty element that is linked to another node with label X, then we append /X to its label." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-31", "text": "If there is more than one empty element, we process them bottom-up (see Figure 2b )." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-32", "text": "This helps the parser learn to expect where to find empty elements." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-33", "text": "In our experiments, we did this only for elements of type *T*. Finally, we train the Berkeley parser on the preprocessed training data." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-34", "text": "Lattice parsing Unlike the training data, the test data does not mark any empty elements." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-35", "text": "We allow the parser to produce empty elements by means of lattice-parsing (Chappelier et al., 1999) , a generalization of CKY parsing allowing it to parse a wordlattice instead of a predetermined list of terminals." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-36", "text": "Lattice parsing adds a layer of flexibility to existing parsing technology, and allows parsing in situations where the yield of the tree is not known in advance." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-37", "text": "Lattice parsing originated in the speech 1 Unfortunately, not enough information was available to carry out comparison with the method of Chung and Gildea (2010)." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-55", "text": "Define a nonterminal node, for present purposes, to be a node which is neither a terminal nor preterminal node." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-56", "text": "The standard PARSEVAL metric (Black et al., 1991) counts labeled nonempty brackets: items are (X, i, j) for each nonempty nonterminal node, where X is its label and i, j are the start and end positions of its span." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-57", "text": "Yang and Xue (2010) simply count unlabeled empty elements: items are (i, i) for each empty element, where i is its position." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-58", "text": "If multiple empty elements occur at the same position, they only count the last one." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-59", "text": "The metric originally proposed by Johnson (2002) counts labeled empty brackets: items are (X/t, i, i) for each empty nonterminal node, where X is its label and t is the type of the empty element it dominates, but also (t, i, i) for each empty element not dominated by an empty nonterminal node." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-60", "text": "4 The following structure has an empty nonterminal dominating two empty elements:" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-61", "text": ". . . . (SBAR-S/*T*, i, i)." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-62", "text": "5 We tried to follow Schmid in a generic way: we collapse any vertical chain of empty nonterminals into a single nonterminal." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-63", "text": "In order to avoid problems associated with cases like this, we suggest a pair of simpler metrics." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-64", "text": "The first is to count labeled empty elements, i.e., items are (t, i, i) for each empty element, and the second, similar in spirit to SParseval (Roark et al., 2006) , is to count all labeled brackets, i.e., items are (X, i, j) for each nonterminal node (whether nonempty or empty)." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-65", "text": "These two metrics, together with part-ofspeech accuracy, cover all possible nodes in the tree." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-66", "text": "----------------------------------" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-67", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-68", "text": "English As is standard, we trained the parser on sections 02-21 of the Penn Treebank Wall Street Journal corpus, used section 00 for development, and section 23 for testing." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-69", "text": "We ran 6 cycles of training; then, because we were unable to complete the 7th split-merge cycle with the default setting of merging 50% of splits, we tried increasing merges to 75% and ran 7 cycles of training." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-70", "text": "Table 1 presents our results." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-71", "text": "We chose the parser settings that gave the best labeled empty elements F 1 on the dev set, and used these settings for the test set." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-72", "text": "We outperform the state of the art at recovering empty elements, as well as achieving state of the art accuracy at recovering phrase structure." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-73", "text": "----------------------------------" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-74", "text": "**5**" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-75", "text": "This difference is not small; scores using Schmid's metric are lower by roughly 1%." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-76", "text": "There are other minor differences in Schmid's metric which we do not detail here." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-77", "text": "Chinese We also experimented on a subset of the Penn Chinese Treebank 6.0." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-78", "text": "For comparability with previous work (Yang and Xue, 2010) , we trained the parser on sections 0081-0900, used sections 0041-0080 for development, and sections 0001-0040 and 0901-0931 for testing." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-79", "text": "The results are shown in Table 2 ." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-80", "text": "We selected the 6th split-merge cycle based on the labeled empty elements F 1 measure." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-81", "text": "The unlabeled empty elements column shows that our system outperforms the baseline system of Yang and Xue (2010) ." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-82", "text": "We also analyzed the emptyelement recall by type (Table 3) ." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-83", "text": "Our system outperformed that of Yang and Xue (2010) especially on *pro*, used for dropped arguments, and *T*, used for relative clauses and topicalization." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-84", "text": "----------------------------------" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-85", "text": "**DISCUSSION AND FUTURE WORK**" }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-86", "text": "The empty-element recovery method we have presented is simple, highly effective, and fully integrated with state of the art parsing." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-87", "text": "We hope to exploit cross-lingual information about empty elements in machine translation." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-88", "text": "Chung and Gildea (2010) have shown that such information indeed helps translation, and we plan to extend this work by handling more empty categories (rather Table 3 : Recall on different types of empty categories." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-89", "text": "YX = (Yang and Xue, 2010) , Ours = split 6\u00d7." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-90", "text": "than just *pro* and *PRO*), and to incorporate them into a syntax-based translation model instead of a phrase-based model." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-91", "text": "We also plan to extend our work here to recover coindexation information (links between a moved element and the trace which marks the position it was moved from)." }, { "sent_id": "fe443d5e13b525cbdfa58dafb83162-C001-92", "text": "As a step towards shallow semantic analysis, this may further benefit other natural language processing tasks such as machine translation and summary generation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fe443d5e13b525cbdfa58dafb83162-C001-17" ], [ "fe443d5e13b525cbdfa58dafb83162-C001-57" ] ], "cite_sentences": [ "fe443d5e13b525cbdfa58dafb83162-C001-17", "fe443d5e13b525cbdfa58dafb83162-C001-57" ] }, "@DIF@": { "gold_contexts": [ [ "fe443d5e13b525cbdfa58dafb83162-C001-21" ], [ "fe443d5e13b525cbdfa58dafb83162-C001-81" ], [ "fe443d5e13b525cbdfa58dafb83162-C001-83" ] ], "cite_sentences": [ "fe443d5e13b525cbdfa58dafb83162-C001-21", "fe443d5e13b525cbdfa58dafb83162-C001-81", "fe443d5e13b525cbdfa58dafb83162-C001-83" ] }, "@UNSURE@": { "gold_contexts": [ [ "fe443d5e13b525cbdfa58dafb83162-C001-78" ] ], "cite_sentences": [ "fe443d5e13b525cbdfa58dafb83162-C001-78" ] } } }, "ABC_622bd6f16d55ab5853389286cdda56_28": { "x": [ { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-2", "text": "While there is increasing interest in automatically recognizing the argumentative structure of a text, recognizing the argumentative purpose of revisions to such texts has been less explored." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-3", "text": "Furthermore, existing revision classification approaches typically ignore contextual information." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-4", "text": "We propose two approaches for utilizing contextual information when predicting argumentative revision purposes: developing contextual features for use in the classification paradigm of prior work, and transforming the classification problem to a sequence labeling task." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-5", "text": "Experimental results using two corpora of student essays demonstrate the utility of contextual information for predicting argumentative revision purposes." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-6", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-8", "text": "Incorporating natural language processing into systems that provide writing assistance beyond grammar is an area of increasing research and commercial interest (e.g., (Writelab, 2015; Roscoe et al., 2015) )." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-9", "text": "As one example, the automatic recognition of the purpose of each of an author's revisions allows writing assistance systems to provide better rewriting suggestions." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-10", "text": "In this paper, we propose contextbased methods to improve the automatic identification of revision purposes in student argumentative writing." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-11", "text": "Argumentation plays an important role in analyzing many types of writing such as persuasive essays , scientific papers (Teufel, 2000) and law documents (Palau and Moens, 2009 )." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-12", "text": "In student papers, identifying revision purposes with respect to argument structure has been used to predict the grade improvement in the paper after revision (Zhang and Litman, 2015) ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-13", "text": "Existing works on the analysis of writing revisions (Adler et al., 2011; Bronner and Monz, 2012; Daxenberger and Gurevych, 2013; Zhang and Litman, 2015) typically compare two versions of a text to extract revisions, then classify the purpose of each revision in isolation." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-14", "text": "That is, while limited contextual features such as revision location have been utilized in prior work, such features are computed from the revision being classified but typically not its neighbors." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-15", "text": "In addition, ordinary classifiers rather than structured prediction models are typically used." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-16", "text": "To increase the role of context during prediction, in this paper we 1) introduce new contextual features (e.g., the impact of a revision on local text cohesion), and 2) transform revision purpose classification to a sequential labeling task to capture dependencies among revisions (as in Table 1 )." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-17", "text": "An experimental evaluation demonstrates the utility of our approach." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-18", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-19", "text": "**RELATED WORK**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-20", "text": "There are multiple works on the classification of revisions (Adler et al., 2011; Javanmardi et al., 2011; Bronner and Monz, 2012; Daxenberger and Gurevych, 2013; Zhang and Litman, 2015) ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-21", "text": "While different classification tasks were explored, similar approaches were taken by extracting features (location, text, meta-data, language) from the revised text to train a classification model (SVM, Random Forest, etc.) on the annotated data." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-22", "text": "One problem with prior works is that the contextual features used were typically shallow (location), while we cap- and sentence 2 acts as the Warrant for the Claim." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-23", "text": "Sentence 1 in Draft 1 is modified to sentence 1 (also acts as the Claim) of Draft 2." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-24", "text": "Sentence 2 in Draft 1 is deleted in Draft 2." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-47", "text": "Corpus B was collected in the same manor as A with agreement Kappa 0.69." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-25", "text": "The first revision is a Claim revision as it modifies the Claim of the paper by removing \"rhetorical questions.\" This leads to the second Warrant revision, which deletes the Warrant for \"rhetorical questions." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-26", "text": "\" ture additional contextual information as text cohesion/coherence changes and revision dependencies." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-27", "text": "As our task focuses on identifying the argumentative purpose of writing revisions, work in argument mining is also relevant." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-28", "text": "In fact, many features for predicting argument structure (e.g., location, discourse connectives, punctuation) Moens et al., 2007; Palau and Moens, 2009; Feng and Hirst, 2011) are also used in revision classification." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-29", "text": "In addition, Lawrence et al. (2014) use changes in topic to detect argumentation, which leads us to hypothesize that different types of argumentative revisions will have different impacts on text cohesion and coherence." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-30", "text": "Guo et al. (2011) and Park et al. (2015) both utilize Conditional Random Fields (CRFs) for identifying argumentative structures." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-31", "text": "While we focus on the different task of identifying revisions to argumentation, we similarly hypothesize that dependencies exist between revisions and thus utilize CRFs in our task." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-32", "text": "While our task is similar to argument mining, a key difference is that the revisions do not always appear near each other." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-33", "text": "For example, a 5-paragraph long essay might have only two or three revisions located at different paragraphs." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-34", "text": "Thus, the types of previous revisions cannot always be used as the contextual information." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-35", "text": "Moreover, the type of the revision is not necessarily the argument type of its revised sentence." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-36", "text": "For example, a revision on the evidence argument can be just a correction of spelling mistakes." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-37", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-38", "text": "**DATA DESCRIPTION**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-39", "text": "Revision purposes." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-40", "text": "To label our data, we adapt the schema defined in (Zhang and Litman, 2015) as it can be reliably annotated and is argument- (Faigley and Witte, 1981) ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-41", "text": "As we focus on argumentative changes, we merge all the Surface subcategories into one Surface category." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-42", "text": "As Zhang and Litman (2015) reported that both Rebuttals and multiple labels for a single revision were rare, we merge Rebuttal and Warrant into one Warrant category 1 and allow only a single (primary) label per revision." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-43", "text": "Corpora." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-44", "text": "Our experiments use two corpora consisting of Drafts 1 and 2 of papers written by high school students taking AP-English courses; papers were revised after receiving and generating peer feedback." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-45", "text": "Corpus A was collected in our earlier pa-per (Zhang and Litman, 2015) , although the original annotations were modified as described above." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-46", "text": "It contains 47 paper draft pairs about placing contemporaries in Dante's Inferno." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-72", "text": "**TRANSFORMING TO SEQUENCE LABELING**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-48", "text": "It contains 63 paper draft pairs explaining the rhetorical strategies used by the speaker/author of a previously read lecture/essay." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-49", "text": "Both corpora were double coded and gold standard labels were created upon agreement of two annotators." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-50", "text": "Two example annotated revisions from Corpus B are shown in Table 1 , while the distribution of annotated revision purposes for both corpora are shown in Table 2 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-51", "text": "4 Utilizing Context" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-52", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-53", "text": "**ADDING CONTEXTUAL FEATURES**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-54", "text": "Our previous work (Zhang and Litman, 2015) used three types of features primarily from prior work (Adler et al., 2011; Bronner and Monz, 2012; Daxenberger and Gurevych, 2013) for argumentative revision classification." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-55", "text": "Location features encode the location of the sentence in the paragraph and the location of the sentence's paragraph in the essay." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-56", "text": "Textual features encode revision operation, sentence length, edit distance between aligned sentences and the difference in sentence length and punctuation numbers." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-57", "text": "Language features encode part of speech (POS) unigrams and difference in POS tag counts." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-58", "text": "We implement this feature set as the baseline as our tasks are similar, then propose two new types of contextual features." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-59", "text": "The first type (Ext) extends prior work by extracting the baseline features from not only the aligned sentence pair representing the revision in question, but also for the sentence pairs before and after the revision." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-60", "text": "The second type (Coh) measures the cohesion and coherence changes in a 2-sentence block around the revision 2 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-61", "text": "Utilizing the cohesion and coherence difference." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-62", "text": "Inspired by (Lee et al., 2015; Vaughan and McDonald, 1986) , we hypothesize that different revisions can have different impacts on the cohesion and coherence of the essay." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-63", "text": "We propose to extract features for both impact on cohesion (lexical) and impact on coherence (semantic)." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-64", "text": "Inspired by (Hearst, 1997), sequences of blocks are created for sentences 2 In this paper we consider the most adjacent sentence only." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-65", "text": "each sentence block after stop-word filtering and stemming." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-66", "text": "Jaccard similarity is used for the calculation of lexical similarity between sentence blocks." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-67", "text": "Word embedding vectors (Mikolov et al., 2013) are used for the calculation of semantic similarity." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-68", "text": "A vector is calculated for each sentence block by summing up the embedding vectors of words that are not stop-words 5 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-69", "text": "Afterwards the similarity is calculated as the cosine similarity between the block vectors." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-70", "text": "This approach has been taken by multiple groups in the SemEval-2015 semantic similarity task (SemEval-2015 Task 1) (Xu et al., 2015) ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-71", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-73", "text": "To capture dependencies among predicted revisions, we transform the revisions to a consecutive sequence and label it with Conditional Random Fields (CRFs) as demonstrated in Figure 2 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-74", "text": "For both drafts, sentences are sorted according to their order of occurrence in the essay." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-75", "text": "Aligned sentences are put into the same row and each aligned pair of sentences is treated as a unit of revision." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-76", "text": "The \"cross-aligned\" pairs of sentences 6 (which does not often occur) are broken into deleted and added sentences (i.e, the cross-aligned sentences in Draft 1 are treated as deleted and the sentences in Draft 2 are treated as added.)." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-77", "text": "After generating the sequence, each revision unit in the sequence is assigned the revision purpose label according to the annotations, with unchanged sentence pairs labeled as Nochange." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-78", "text": "We conducted labeling on both essay-level and paragraph-level sequences." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-79", "text": "The essay-level treats the whole essay as a sequence segment while the paragraph-level treats each paragraph as a segment." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-80", "text": "After labeling, the label of each changed sentence pair is marked as the purpose of the revision 7 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-81", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-82", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-83", "text": "Our prior work (Zhang and Litman, 2014) proposed an approach for the alignment of sentences." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-84", "text": "The approach achieves 92% accuracy on both corpora." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-85", "text": "In this paper we focus on the prediction task and assume we have gold-standard sentence alignments 8 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-86", "text": "The first four columns of Table 3 show the performance of baseline features with and without our new contextual features using an SVM prediction model 9 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-87", "text": "The last four columns show the performance of CRFs 10 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-88", "text": "All experiments are conducted using 10-fold (student) cross-validation with 300 features selected using learning gain ratio 11 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-89", "text": "For the SVM approach, we observe that the Coh features yield a significant improvement over the baseline features in Corpus B, and a nonsignificant improvement in Corpus A. This indicates that changes in text cohesion and coherence can in-7 Revisions on cross-aligned pairs are marked as Surface." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-90", "text": "8 Similar to settings in (Daxenberger and Gurevych, 2013) 9 We compared three models used in discourse analysis and revision classification (C4.5 Decision Tree, SVM and Random Forests) Bronner and Monz, 2012; and SVM yielded the best performance." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-91", "text": "10 SVM model implemented with Weka (Hall et al., 2009) and CRF model implemented with CRFSuite (Okazaki, 2007) 11 We tested with parameters 100, 200, 300, 500 on a development dataset disjoint from Corpora A and B and chose 300 which yielded the best performance." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-92", "text": "deed improve the prediction of argumentative revision types." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-93", "text": "The Ext feature set -which computes features for not only the revision but also its immediately adjacent sentences -also yields a slight (although not significant) improvement." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-94", "text": "However, adding the two feature sets together does not further improve the performance using the SVM model." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-95", "text": "The CRF approach almost always yields the best results for both corpora, with all such CRF results better than all other results." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-96", "text": "This indicates that dependencies exist among argumentative revisions that cannot be identified with traditional classification approaches." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-97", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-98", "text": "**ERROR ANALYSIS**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-99", "text": "To have a better understanding of how the sequence labeling approach improves the classification performance, we counted the errors of the cross-validation results on Corpus A (where the revisions are more evenly distributed)." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-100", "text": "Figure 3 demonstrates the comparison of errors made by SVM and CRFs 12 ." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-101", "text": "We notice that the CRF approach makes less errors than the SVM approach in recognizing Claim changes (General-Claim, Evidence-Claim, WarrantClaim, Surface-Claim)." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-102", "text": "This matches our intuition that there exists dependency between revisions on supporting materials and revisions on Claim." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-103", "text": "We also observe that same problems exist in both approaches." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-104", "text": "The biggest difficulty is the differentiation between General and Warrant revisions, which counts 37.6% of the SVM errors and 40.1% of CRFs errors." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-105", "text": "It is also common that Claim and Evidence 12 Both use models with all the features." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-106", "text": "revisions are classified as Warrant revisions." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-107", "text": "Approaches need to be designed for such cases to further improve the classification performance." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-108", "text": "----------------------------------" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-109", "text": "**CONCLUSION**" }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-110", "text": "In this paper we proposed different methods for utilizing contextual information when predicting the argumentative purpose of revisions in student writing." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-111", "text": "Adding features that captured changes in text cohesion and coherence, as well as using sequence modeling to capture revision dependencies, both significantly improved predictive performance in an experimental evaluation." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-112", "text": "In the future, we plan to investigate whether performance can be further improved when more sentences in the context are included." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-113", "text": "Also, we plan to investigate whether revision dependencies exist in other types of corpora such as Wikipedia revisions." }, { "sent_id": "622bd6f16d55ab5853389286cdda56-C001-114", "text": "While the corpora used in this study cannot be published because of the lack of required IRB, we are starting a user study project (Zhang et al., 2016) on the application of our proposed techniques and will publish the data collected from this project." } ], "y": { "@BACK@": { "gold_contexts": [ [ "622bd6f16d55ab5853389286cdda56-C001-12" ], [ "622bd6f16d55ab5853389286cdda56-C001-13" ], [ "622bd6f16d55ab5853389286cdda56-C001-20" ], [ "622bd6f16d55ab5853389286cdda56-C001-54" ] ], "cite_sentences": [ "622bd6f16d55ab5853389286cdda56-C001-12", "622bd6f16d55ab5853389286cdda56-C001-13", "622bd6f16d55ab5853389286cdda56-C001-20", "622bd6f16d55ab5853389286cdda56-C001-54" ] }, "@EXT@": { "gold_contexts": [ [ "622bd6f16d55ab5853389286cdda56-C001-40" ] ], "cite_sentences": [ "622bd6f16d55ab5853389286cdda56-C001-40" ] }, "@MOT@": { "gold_contexts": [ [ "622bd6f16d55ab5853389286cdda56-C001-42" ] ], "cite_sentences": [ "622bd6f16d55ab5853389286cdda56-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "622bd6f16d55ab5853389286cdda56-C001-44", "622bd6f16d55ab5853389286cdda56-C001-45" ] ], "cite_sentences": [ "622bd6f16d55ab5853389286cdda56-C001-45" ] } } }, "ABC_318487ac270ca272ec11a3de6c0685_28": { "x": [ { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-2", "text": "Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-3", "text": "The key idea is that similarity in the latent space implies semantic relatedness." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-4", "text": "We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-5", "text": "First, we design a new discriminative term-weighting metric called TF-KLD, which outperforms TF-IDF." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-6", "text": "Next, we show that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-7", "text": "Finally, we combine latent features with fine-grained n-gram overlap features, yielding performance that is 3% more accurate than the prior state-of-the-art." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-8", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-10", "text": "Measuring the semantic similarity of short units of text is fundamental to many natural language processing tasks, from evaluating machine translation (Kauchak and Barzilay, 2006) to grouping redundant event mentions in social media (Petrovi\u0107 et al., 2010) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-11", "text": "The task is challenging because of the infinitely diverse set of possible linguistic realizations for any idea (Bhagat and Hovy, 2013) , and because of the short length of individual sentences, which means that standard bag-of-words representations will be hopelessly sparse." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-12", "text": "Distributional methods address this problem by transforming the high-dimensional bag-of-words representation into a lower-dimensional latent space." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-13", "text": "This can be accomplished by factoring a matrix or tensor of term-context counts (Turney and Pantel, 2010) ; proximity in the induced latent space has been shown to correlate with semantic similarity (Mihalcea et al., 2006) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-14", "text": "However, factoring the term-context matrix means throwing away a considerable amount of information, as the original matrix of size M \u00d7 N (number of instances by number of features) is factored into two smaller matrices of size M \u00d7 K and N \u00d7 K, with K M, N ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-15", "text": "If the factorization does not take into account labeled data about semantic similarity, important information can be lost." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-16", "text": "In this paper, we show how labeled data can considerably improve distributional methods for measuring semantic similarity." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-17", "text": "First, we develop a new discriminative term-weighting metric called TF-KLD, which is applied to the term-context matrix before factorization." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-18", "text": "On a standard paraphrase identification task (Dolan et al., 2004) , this method improves on both traditional TF-IDF and Weighted Textual Matrix Factorization (WTMF; Guo and Diab, 2012) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-19", "text": "Next, we convert the latent representations of each sentence pair into a feature vector, which is used as input to a linear SVM classifier." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-20", "text": "This yields further improvements and substantially outperforms the current state-of-the-art on paraphrase classification." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-21", "text": "We then add \"finegrained\" features about the lexical similarity of the sentence pair." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-22", "text": "The combination of latent and finegrained features yields further improvements in accuracy, demonstrating that these feature sets provide complementary information on semantic similarity." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-23", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-25", "text": "Without attempting to do justice to the entire literature on paraphrase identification, we note three high-level approaches: (1) string similarity metrics such as n-gram overlap and BLEU score (Wan et al., 2006; Madnani et al., 2012) , as well as string kernels (Bu et al., 2012) ; (2) syntactic operations on the parse structure (Wu, 2005; Das and Smith, 2009) ; and (3) distributional methods, such as latent semantic analysis (LSA; Landauer et al., 1998) , which are most relevant to our work." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-26", "text": "One application of distributional techniques is to replace individual words with distributionally similar alternatives (Kauchak and Barzilay, 2006) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-27", "text": "Alternatively, Blacoe and Lapata (2012) show that latent word representations can be combined with simple elementwise operations to identify the semantic similarity of larger units of text." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-28", "text": "Socher et al. (2011) propose a syntactically-informed approach to combine word representations, using a recursive auto-encoder to propagate meaning through the parse tree." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-29", "text": "We take a different approach: rather than representing the meanings of individual words, we directly obtain a distributional representation for the entire sentence." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-30", "text": "This is inspired by Mihalcea et al. (2006) and Guo and Diab (2012) , who treat sentences as pseudo-documents in an LSA framework, and identify paraphrases using similarity in the latent space." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-31", "text": "We show that the performance of such techniques can be improved dramatically by using supervised information to (1) reweight the individual distributional features and (2) learn the importance of each latent dimension." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-32", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-33", "text": "**DISCRIMINATIVE FEATURE WEIGHTING**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-34", "text": "Distributional representations (Turney and Pantel, 2010) can be induced from a co-occurrence matrix W \u2208 R M \u00d7N , where M is the number of instances and N is the number of distributional features." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-35", "text": "For paraphrase identification, each instance is a sentence; features may be unigrams, or may include higher-order n-grams or dependency pairs." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-36", "text": "By decomposing the matrix W, we hope to obtain a latent representation in which semantically-related sentences are similar." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-37", "text": "Singular value decomposition (SVD) is traditionally used to perform this factorization." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-38", "text": "However, recent work has demonstrated the robustness of nonnegative matrix factorization (NMF; Lee and Seung, 2001 ) for text mining tasks (Xu et al., 2003; Arora et al., 2012) ; the difference from SVD is the addition of a non-negativity constraint in the latent representation based on non-orthogonal basis." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-39", "text": "While W may simply contain counts of distributional features, prior work has demonstrated the utility of reweighting these counts (Turney and Pantel, 2010) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-40", "text": "TF-IDF is a standard approach, as the inverse document frequency (IDF) term increases the importance of rare words, which may be more discriminative." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-41", "text": "Guo and Diab (2012) show that applying a special weight to unseen words can further improvement performance on paraphrase identification." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-42", "text": "We present a new weighting scheme, TF-KLD, based on supervised information." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-43", "text": "The key idea is to increase the weights of distributional features that are discriminative, and to decrease the weights of features that are not." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-44", "text": "Conceptually, this is similar to Linear Discriminant Analysis, a supervised feature weighting scheme for continuous data (Murphy, 2012) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-45", "text": "More formally, we assume labeled sentence pairs of the form w is the binarized vector of distributional features for the second sentence, and r i \u2208 {0, 1} indicates whether they are labeled as a paraphrase pair." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-46", "text": "Assuming the order of the sentences within the pair is irrelevant, then for k-th distributional feature, we define two Bernoulli distributions:" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-47", "text": "." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-48", "text": "This is the probability that sentence w (1) i contains feature k, given that k appears in w (2) i and the two sentences are labeled as paraphrases, r i = 1." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-49", "text": "." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-50", "text": "This is the probability that sentence w (1) i contains feature k, given that k appears in w (2) i and the two sentences are labeled as not paraphrases, r i = 0." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-51", "text": "is then a measure of the discriminability of feature k, and is guaranteed to be non- negative." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-52", "text": "1 We use this divergence to reweight the features in W before performing the matrix factorization." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-53", "text": "This has the effect of increasing the weights of features whose likelihood of appearing in a pair of sentences is strongly influenced by the paraphrase relationship between the two sentences." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-54", "text": "On the other hand, if p k = q k , then the KL-divergence will be zero, and the feature will be ignored in the matrix factorization." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-55", "text": "We name this weighting scheme TF-KLD, since it includes the term frequency and the KL-divergence." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-56", "text": "Taking the unigram feature not as an example, we have p k = [0.66, 0.34] and q k = [0.31, 0.69], for a KL-divergence of 0.25: the likelihood of this word being shared between two sentence is strongly dependent on whether the sentences are paraphrases." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-57", "text": "In contrast, the feature then has p k = [0.33, 0.67] and q k = [0.32, 0.68], for a KL-divergence of 3.9 \u00d7 10 \u22124 ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-58", "text": "Figure 1 shows the distributions of these and other unigram features with respect to p k and 1 \u2212 q k ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-59", "text": "The diagonal line running through the middle of the plot indicates zero KL-divergence, so features on this line will be ignored." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-60", "text": "1 unigram recall 2 unigram precision 3 bigram recall 4 bigram precision 5 dependency relation recall 6 dependency relation precision 7 BLEU recall 8 BLEU precision 9 Difference of sentence length 10 Tree-editing distance Table 1 : Fine-grained features for paraphrase classification, selected from prior work (Wan et al., 2006) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-61", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-62", "text": "**SUPERVISED CLASSIFICATION**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-63", "text": "While previous work has performed paraphrase classification using distance or similarity in the latent space (Guo and Diab, 2012; Socher et al., 2011) , more direct supervision can be applied." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-64", "text": "Specifically, we convert the latent representations of a pair of sentences v 1 and v 2 into a sample vector," }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-65", "text": "concatenating the element-wise sum v 1 + v 2 and ab-" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-66", "text": "." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-67", "text": "Given this representation, we can use any supervised classification algorithm." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-68", "text": "A further advantage of treating paraphrase as a supervised classification problem is that we can apply additional features besides the latent representation." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-69", "text": "We consider a subset of features identified by Wan et al. (2006) , listed in Table 1 ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-70", "text": "These features mainly capture fine-grained similarity between sentences, for example by counting specific unigram and bigram overlap." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-71", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-72", "text": "**EXPERIMENTS**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-73", "text": "Our experiments test the utility of the TF-KLD weighting towards paraphrase classification, using the Microsoft Research Paraphrase Corpus (Dolan et al., 2004) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-74", "text": "The training set contains 2753 true paraphrase pairs and 1323 false paraphrase pairs; the test set contains 1147 and 578 pairs, respectively." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-75", "text": "The TF-KLD weights are constructed from only the training set, while matrix factorizations are per-formed on the entire corpus." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-76", "text": "Matrix factorization on both training and (unlabeled) test data can be viewed as a form of transductive learning (Gammerman et al., 1998) , where we assume access to unlabeled test set instances." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-77", "text": "2 We also consider an inductive setting, where we construct the basis of the latent space from only the training set, and then project the test set onto this basis to find the corresponding latent representation." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-78", "text": "The performance differences between the transductive and inductive settings were generally between 0.5% and 1%, as noted in detail below." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-79", "text": "We reiterate that the TF-KLD weights are never computed from test set data." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-80", "text": "Prior work on this dataset is described in section 2." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-81", "text": "To our knowledge, the current state-of-theart is a supervised system that combines several machine translation metrics (Madnani et al., 2012 ), but we also compare with state-of-the-art unsupervised matrix factorization work (Guo and Diab, 2012) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-82", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-83", "text": "**SIMILARITY-BASED CLASSIFICATION**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-84", "text": "In the first experiment, we predict whether a pair of sentences is a paraphrase by measuring their cosine similarity in latent space, using a threshold for the classification boundary." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-85", "text": "As in prior work (Guo and Diab, 2012) , the threshold is tuned on held-out training data." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-86", "text": "We consider two distributional feature sets: FEAT 1 , which includes unigrams; and FEAT 2 , which also includes bigrams and unlabeled dependency pairs obtained from MaltParser (Nivre et al., 2007) ." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-87", "text": "To compare with Guo and Diab (2012) , we set the latent dimensionality to K = 100, which was the same in their paper." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-88", "text": "Both SVD and NMF factorization are evaluated; in both cases, we minimize the Frobenius norm of the reconstruction error." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-89", "text": "Table 2 compares the accuracy of a number of different configurations." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-90", "text": "The transductive TF-KLD weighting yields the best overall accuracy, achieving 72.75% when combined with nonnegative matrix factorization." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-91", "text": "While NMF performs slightly better than SVD in both comparisons, the major difference is the performance of discriminative TF-KLD weighting, which outperforms TF-IDF regardless of the factorization technique." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-92", "text": "When we 2 Another example of transductive learning in NLP is when Turian et al. (2010) induced word representations from a corpus that included both training and test data for their downstream named entity recognition task." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-93", "text": "perform the matrix factorization on only the training data, the accuracy on the test set is 73.58%, with F1 score 80.55%." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-94", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-95", "text": "**SUPERVISED CLASSIFICATION**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-96", "text": "Next, we apply supervised classification, constructing sample vectors from the latent representation as shown in Equation 1." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-97", "text": "For classification, we choose a Support Vector Machine with a linear kernel (Fan et al., 2008) , leaving a thorough comparison of classifiers for future work." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-98", "text": "The classifier parameter C is tuned on a development set comprising 20% of the original training set." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-99", "text": "Figure 2 presents results for a range of latent dimensionalities." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-100", "text": "Supervised learning identifies the important dimensions in the latent space, yielding significantly better performance that the similaritybased classification from the previous experiment." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-101", "text": "In Table 3 , we compare against prior published work, using the held-out development set to select the best value of K (again, K = 400)." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-102", "text": "The best result is from TF-KLD, with distributional features FEAT 2 , achieving 79.76% accuracy and 85.87% F1." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-103", "text": "This is well beyond all known prior results on this task." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-104", "text": "When we induce the latent basis from only the training data, we get 78.55% on accuracy and 84.59% F1, also better than the previous state-of-art." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-105", "text": "Finally, we augment the distributional representation, concatenating the ten \"fine-grained\" features in Table 1 (Bu et al., 2012) 76.3 not reported (Socher et al., 2011) 76.8 83.6 (Madnani et al., 2012) 77. racy now improves to 80.41%, with an F1 score of 85.96%." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-106", "text": "When the latent representation is induced from only the training data, the corresponding results are 79.94% on accuracy and 85.36% F1, again better than the previous state-of-the-art." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-107", "text": "These results show that the information captured by the distributional representation can still be augmented by more fine-grained traditional features." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-108", "text": "----------------------------------" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-109", "text": "**CONCLUSION**" }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-110", "text": "We have presented three ways in which labeled data can improve distributional measures of semantic similarity at the sentence level." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-111", "text": "The main innovation is TF-KLD, which discriminatively reweights the distributional features before factorization, so that discriminability impacts the induction of the latent representation." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-112", "text": "We then transform the latent representation into a sample vector for supervised learning, obtaining results that strongly outperform the prior state-of-the-art; adding fine-grained lexical features further increases performance." }, { "sent_id": "318487ac270ca272ec11a3de6c0685-C001-113", "text": "These ideas may have applicability in other semantic similarity tasks, and we are also eager to apply them to new, large-scale automatically-induced paraphrase corpora (Ganitkevitch et al., 2013) ." } ], "y": { "@DIF@": { "gold_contexts": [ [ "318487ac270ca272ec11a3de6c0685-C001-17", "318487ac270ca272ec11a3de6c0685-C001-18" ] ], "cite_sentences": [ "318487ac270ca272ec11a3de6c0685-C001-18" ] }, "@BACK@": { "gold_contexts": [ [ "318487ac270ca272ec11a3de6c0685-C001-30" ], [ "318487ac270ca272ec11a3de6c0685-C001-63" ] ], "cite_sentences": [ "318487ac270ca272ec11a3de6c0685-C001-30", "318487ac270ca272ec11a3de6c0685-C001-63" ] }, "@MOT@": { "gold_contexts": [ [ "318487ac270ca272ec11a3de6c0685-C001-29", "318487ac270ca272ec11a3de6c0685-C001-30" ] ], "cite_sentences": [ "318487ac270ca272ec11a3de6c0685-C001-30" ] }, "@UNSURE@": { "gold_contexts": [ [ "318487ac270ca272ec11a3de6c0685-C001-81" ] ], "cite_sentences": [ "318487ac270ca272ec11a3de6c0685-C001-81" ] }, "@SIM@": { "gold_contexts": [ [ "318487ac270ca272ec11a3de6c0685-C001-85" ], [ "318487ac270ca272ec11a3de6c0685-C001-87" ] ], "cite_sentences": [ "318487ac270ca272ec11a3de6c0685-C001-85", "318487ac270ca272ec11a3de6c0685-C001-87" ] }, "@USE@": { "gold_contexts": [ [ "318487ac270ca272ec11a3de6c0685-C001-85" ], [ "318487ac270ca272ec11a3de6c0685-C001-87" ] ], "cite_sentences": [ "318487ac270ca272ec11a3de6c0685-C001-85", "318487ac270ca272ec11a3de6c0685-C001-87" ] } } }, "ABC_49b42346795d541dbcac9e2b9ad00a_28": { "x": [ { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-105", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-106", "text": "**DATASET**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-2", "text": "This paper investigates the attempts to augment neural-based inflection models with characterbased language models." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-3", "text": "We found that in most cases this slightly improves performance, however, the effect is marginal." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-4", "text": "We also propose another language-model based approach that can be used as a strong baseline in low-resource setting." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-5", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-7", "text": "Morphological inflection is a task of automatic reconstruction of surface word form given its source form, called lemma, and morphological characteristic of required form." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-8", "text": "For example, in Spanish the input word contar together with features v;fin;ind;imp;pst;3;pl should be transformed to contaban." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-9", "text": "The obvious way to solve such a task is to handcode transformations using finitestate rules." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-10", "text": "However, this approach requires an expert knowledge of the language under consideration and can be extremely time-consuming for the languages with complex morphology." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-11", "text": "Therefore a machine learning algorithm should be developed to efficiently solve this task for any language." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-12", "text": "Such an algorithm must be able to generalize from known lemma-features-word triples to previously unseen ones, mimicking human behaviour when inflecting a neologism in its native language or an unknown word in a foreign one." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-13", "text": "In this setting automatic inflection becomes an instance of string transduction problem, which makes conditional random fields a natural baseline model as suggested in (Nicolai et al., 2015) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-14", "text": "Another popular approach is to predict transformation pattern, either as a pair of prefix and suffix changes as used in the baseline model for Sigmorphon 2018 Shared Task (Cotterell et al., 2018) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-15", "text": "For example, consider a Czech adjective kr\u00e1sn\u00fd and its superlative form nejkr\u00e1\u0161nej\u0161\u00ed." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-16", "text": "The inflection pattern can be encoded as a pair of prefix rule $ \u2192 $nej and a suffix rule\u00fd \u2192 ej\u0161\u00ed." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-17", "text": "Such encoding is, however, too weak to deal with infixation and root vowel alterations, required, for example, for Spanish verb volver and its +Pres+Sg+1 form vuelvo." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-18", "text": "An abstract paradigm approach (Ahlberg et al., 2015; Sorokin, 2016) compresses this transformation to 1+o+2+er#1+ue+2+o, where digits stand for variables (the parts of verb stem), and constant fragments define the paradigm." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-19", "text": "In both cases in order to predict the inflected word form one suffices to guess the transformation pattern, thus solving a standard classification problem." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-20", "text": "Both mentioned models (CRFs and abstract paradigms) were quite successful in Sigmorphon 2016 Shared Task (Cotterell et al., 2016) , however, they were clearly outperformed by neural network approaches." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-21", "text": "Indeed, string transduction problems are successfully solved using neural methods, for example, in machine translation." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-22", "text": "The work of (Kann and Sch\u00fctze, 2016 ) adopts the seq2seq model with soft attention of (Bahdanau et al., 2014) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-23", "text": "It defeated not only non-neural systems mentioned earlier, but also other neural approaches." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-24", "text": "It shows that attention mechanism is crucial for word inflection." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-25", "text": "However, in contrast to machine translation, a symbol of output word is less prone to depend from multiple input symbols, than a translated word from multiple source words." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-26", "text": "Consequently, the attention weight is usually concentrated on a single source symbol, being more a pointer than a distributed probability mass." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-27", "text": "Moreover, this pointer traverses the source word from left to right in order to generate the inflected form." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-28", "text": "All that motivated the hard attention model of (Aharoni and Goldberg, 2017) , which outperformed the soft attention approaches." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-29", "text": "The key feature of this model is that it predicts not only the output word, but also the alignment between source and target using an additional step symbol which shifts the pointer to the next symbol." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-30", "text": "This model was further improved by (Makarov et al., 2017) , whose system was the winner of Sigmorphon 2017 evaluation campaign (Cotterell et al., 2017) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-31", "text": "The approach of Makarov et al. was especially successful in low and medium resource setting, while in high resource setting it achieves an impressive accuracy of over 95% 1 ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-32", "text": "Does it mean that no further research is required and hard attention method equipped with copy mechanism is the final solution for automatic inflection problem?" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-33", "text": "Actually, not, since the quality of the winning approach was much lower on medium (about 85%) and low (below 50%) datasets." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-34", "text": "This lower quality is easy to explain since in low resource setting the system might even see no examples of the required form 2 or observe just one or two inflection pairs which do not cover all possible paradigms for this particular form." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-35", "text": "For example, Russian verbs has several tens of variants to produce the +Pres+Sg+1 form." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-36", "text": "Consequently, to improve the inflection accuracy the system should extract more information from the whole language, not only the instances of the given form." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-37", "text": "This task is easier for agglutinative languages with regular inflection paradigm: to predict, say, the +Pres+Sg+1 form in Turkish, the system has just to observe several singular verb form (not necessarily of the first person) to extract the singular suffix and several first person form (of any number and tense)." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-38", "text": "In presence of fusion, like in Russian and other Slavonic languages, the decomposition is not that easy or even impossible." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-39", "text": "However, this decomposition is already realised in model of (Makarov et al., 2017) since the grammatical features are treated as a list of atomic elements, not as entire label." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-40", "text": "A new source of information about the whole language are the laws of its phonetics." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-41", "text": "For example, to detect the vowel in the suffix of the Turkish verb one do not need to observe any verbs at all, but to extract the vowel harmony patterns from the inflection of nouns." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-42", "text": "A natural way to capture the phonetic patterns are character language models." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-43", "text": "They were already applied to the problem of inflection in (Sorokin, 2016) and produced a strong boost over the baseline system." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-44", "text": "The work of Sorokin used simple ngram models, however, neural language models (Tran et al., 2016) has shown their superiority over earlier approaches for various tasks." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-45", "text": "Summarizing, our approach was to enrich the model of (Makarov et al., 2017 ) with the language model component." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-46", "text": "We followed the architecture of (Gulcehre et al., 2017) , whose approach is simply to concatenate the state of the neural decoder with the state of the neural language model before passing it to the output projection layer." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-47", "text": "We expected to improve performance especially in low and medium resource setting, however, our approach does not have clear advantages: our joint system is only slightly ahead the baseline system of (Makarov et al., 2017) for most of the languages." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-48", "text": "We conclude that the language model job is already executed by the decoder." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-49", "text": "However, given the vitality of language model approach in other areas of modern NLP (Peters et al., 2018), we describe our attempts in detail to give other researchers the ideas for future work in this direction." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-50", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-51", "text": "**MODEL STRUCTURE**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-52", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-53", "text": "**BASELINE MODEL**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-54", "text": "As the state-of-the-art baseline we choose the model of Makarov et al. (Makarov et al., 2017) , the winner of previous Sigmorphon Shared Task." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-55", "text": "This system is based on earlier work of Aharoni and Goldberg (Aharoni and Goldberg, 2017) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-56", "text": "We briefly describe the structure of baseline model (we call it AGM-model further) and refer the reader to these two papers for more information." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-57", "text": "AGMmodel consists of encoder and decoder, where an encoder is just a bidirectional LSTM." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-58", "text": "Each element of the input sequence contains a 0-1 encoding of a current letter and two LSTMs traverse this sequence in opposite directions." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-59", "text": "After encoding, each element of obtained sequence contains information about current letter and its context." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-60", "text": "The main feature of the encoder is that it operates on the level on alignments, not on the level of letter sequences." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-61", "text": "Assume a pair volver-vuelvo appears in the training set." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-62", "text": "The natural alignment is as input the lower string of Figure 1 ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-63", "text": "Let i be the number of current timestep and j be current position in the input string." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-64", "text": "On i-th step the decoder takes a concatenation of 3 vectors: x j -the j-th element in the output of the encoder,f = W f eat f -the embedding of the grammatical feature vector and g i = W emb y i\u22121 -the embedding of previous output symbol." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-65", "text": "The feature vector is obtained as 0/1-encoding of the list of grammatical features." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-66", "text": "We actually take the concatenation of output vectors for d \u2265 1 previous output symbols as y i\u22121 , in our experiments d was set to 4." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-67", "text": "On each step the decoder produces a vector z i as output and propagates updated hidden state vector h i to the next timestep." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-68", "text": "z i is then passed to a two-layer perceptron with ReLU activation on the intermediate layer and softmax activation on the output layer, which produces the output distribution p i over output letters, formally:" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-69", "text": "If y i is the index of step symbol, we move the pointer to the next input letter." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-70", "text": "We also use the copy gate from (Makarov et al., 2017) : since the neural network copies the vast majority of its symbols, the output distribution p i is obtained as a weighted sum of singleton distribution which outputs current input symbol and the preliminary distribution p i specified above." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-71", "text": "The weight \u03c3 i is the output of another one-layer perceptron:" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-72", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-73", "text": "**CHARACTER-BASED MODEL**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-74", "text": "Our proposal is to explicitly equip the decoder with the information from the character-based language model." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-75", "text": "We suppose it will help the model to avoid outputting phonetically implausible sequences of letters." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-76", "text": "We choose the simplest possible architecture of the language model, namely, on each step it takes a concatenation of d previous symbol embeddings u i = [g i\u2212d , . . . , g i\u22121 ] and applies an LSTM cell to obtain a vector v i and update LSTM hidden state h i ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-77", "text": "v i is propagated through a two-layer perceptron to predict the next output symbol analogously to the output layer of the baseline model:" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-78", "text": "The model is trained to predict next output symbol separately from the basic model." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-79", "text": "In principle, one can use more complex neural architectures, for example, a multilayer LSTM or apply attention mechanism." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-80", "text": "However, our preliminary experiments have shown that attention over recent history as in (Tran et al., 2016) leads to slightly worse performance." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-81", "text": "To join the baseline model and the language model we concatenate the decoder output z i with the analogous vector from the language model z LM i ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-82", "text": "The language model is conditioned over previously output vectors (excluding step symbol)." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-83", "text": "That is the fusion mechanism as used in (Gulcehre et al., 2017) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-84", "text": "We also experimented with concatenating the pre-output vectors z i , z LM i , however, the former variant leads to slightly better performance." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-85", "text": "To avoid exposure bias we mask language model state with all zeros with the probability of 0.4 (it teaches the model to recover from language model errors)." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-86", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-87", "text": "**DATA AND IMPLEMENTATION**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-88", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-89", "text": "**IMPLEMENTATION**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-90", "text": "The initial alignment was obtained using longest common subsequence (LCS) method." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-91", "text": "Then this alignment was optimized using Chinese Restaurant process as in (Cotterell et al., 2016) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-92", "text": "The optimization phase did 5 passes over training data." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-93", "text": "The aligner trained on the training set was also used to align the validation data." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-94", "text": "We implemented our model using Keras library with Tensorflow backend 3 ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-95", "text": "For all the setting we used the encoder with 96 hidden units in each direction, the decoder contained 128 units and the pre-output projection layer was of dimension 96." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-96", "text": "Morphological features were embedded to 48 dimensions." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-97", "text": "We used batch size of 32 when training, the batches contained the words of approximately the same size to reduce the amount of padding." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-98", "text": "We trained the model for 100 epochs with Adam optimizer, training was stopped when the accuracy on the validation data did not improve for 15 epochs." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-99", "text": "During decoding, the beam search of width 10 was applied." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-100", "text": "When learning the weights of a language model, we used the same training and validation sets as for inflection network." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-101", "text": "The language model used history of 5 symbols and contained 64 units in LSTM layers." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-102", "text": "The number of layers was set to 2." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-103", "text": "The rate of dropout was the same as for basic model." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-104", "text": "The model was trained for 20 epochs, training was stopped when perplexity on validation set did not improve for 5 epochs." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-129", "text": "However, language models demonstrate its utility even when little training data is available." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-130", "text": "The results for low subtask (see 2) demonstrate that they are powerful enough to discriminate between correct and incorrect variants proposed by the abstract paradigm generator." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-131", "text": "This is especially impressive since this method simply returns the input form in case it has not seen the given set of grammatical features." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-132", "text": "So it cannot recover the value of missing paradigm cells generalizing other elements of the paradigm table, which clearly limits its performance." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-133", "text": "Moreover, even for an observed grammatical values a small training set does not cover all possible inflection patterns, either due to their irregularity and multiplicity, like in case of Arabic or Latin, or complex phonetic rules as in case of Navajo." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-134", "text": "Nevertheless, this approach clearly beats our neural models since it requires less data when the number of possible inflection patterns is small." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-135", "text": "So language models are actually good in ranking inflection variants even in case of little data available." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-136", "text": "What remains is to generate enough candidate forms to improve their recall." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-137", "text": "We tried to solve this problem by adding top 10 candidates proposed by the neural network model to the list of possible outputs." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-138", "text": "However, this approach fails: for most languages the results fall below the level of neural models themselves." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-139", "text": "Doing a quick error analysis, we found that in low setting neural networks often are not able to discriminate between different forms, predicting a correct variant for another tense or person." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-140", "text": "The language model also does not learn enough well to distinguish different inflectional affixes due to the same lack of data." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-141", "text": "Therefore it favors either a shorter form or the endings it has observed more frequently, even if these endings does not refer to the set of features under consideration." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-142", "text": "On the contrary, abstract paradigms simply do not produce these variants, making the choice more easy." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-143", "text": "A possible workaround may be to predict the set of grammatical features for the generated form, however, we have not implemented this method due to the lack of time." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-144", "text": "This reranking approach appears to be less successful for medium and high datasets." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-145", "text": "In this case the number of proposed candidate paradigms becomes too high." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-146", "text": "Some of these paradigms generate phonetically plausible forms but are applicable only in particular conditions not satisfied by a given word." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-147", "text": "For example, consider the Russian input \u0434\u0435\u043b\u0430\u0442\u044c;v;prs;ind;3;sg; the paradigm 1+\u0430\u0442\u044c#1+\u0438\u0442 produces the form \u0434\u0435\u043b\u0438\u0442, which is correct, but for another verb \u0434\u0435\u043b\u0438\u0442\u044c." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-148", "text": "Therefore the application of language models in case of more training data looks problematic: we tried to use them to filter out forms generated by neural models without reranking remaining candidates." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-149", "text": "That marginally improved performance for complex languages like Navajo and Latin but had a slight negative effect in most other cases." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-150", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-151", "text": "**CONCLUSION**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-152", "text": "We investigated the applications of character language models to automatic reinflection." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-153", "text": "Despite their usefulness for other task, they do not produce significant boost, though improve the quality for all the settings." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-154", "text": "However, reranking-based approach, which also uses language models, reaches slightly higher scores in case of low amount of training data." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-155", "text": "In case of larger training sets the phonetic plausibility is effectively checked by the neural decoder itself without applying additional mechanisms." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-156", "text": "The relative success of paradigmbased approach in low-resource setting implies that neural networks lack control mechanism provided by abstract paradigms." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-157", "text": "Therefore the combination of neural networks with finite state techniques seems a perspective direction of study." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-158", "text": "Another promising direction not touched in the current work are different methods of data augmentation, either by training on data from related languages, or by generating additional training instances." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-159", "text": "At least for the second approach character language models seem useful to check the quality of generated source-target pairs." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-107", "text": "We tested our model in Sigmorphon 2018 Shared Task (Cotterell et al., 2018) ." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-108", "text": "For an extended description we refer the reader to this papers." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-109", "text": "The dataset contained three subsets: high, medium and low." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-110", "text": "The size of the training dataset was 10000 words in the high subset 4 , 1000 in medium and 100 in low." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-111", "text": "The dataset also contained a development set containing 1000 instances most of the time, for all languages we used this subset as validation data." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-112", "text": "Overall, there were 86 languages in the high setting, 102 in medium and 103 in low." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-113", "text": "----------------------------------" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-114", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-115", "text": "We submitted three systems, one replicating the algorithm of (Makarov et al., 2017) , the second equipped with language models." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-116", "text": "The third one used only the language models: we extracted all possible abstract inflection paradigms for a given set of grammatical features and created a set of possible candidate forms applying all paradigms to the lemma." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-117", "text": "For example, consider the word \u0434\u0435\u043b\u0430\u0442\u044c and paradigms 1+\u0430\u0442\u044c#1+\u0435\u0442, 1+\u0430\u0442\u044c#1+\u0438\u0442, 1+\u044c#1 and 1+\u0447\u044c#1+\u0436\u0435\u0442; the first three produce the forms \u0434\u0435\u043b\u0430\u0435\u0442, \u0434\u0435\u043b\u0438\u0442, \u0434\u0435\u043b\u0430\u0442, while the fourth yields nothing since the given word does not end in -\u0447\u044c." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-118", "text": "Then all these forms are ranked using sum of logarithmic probabilities from forward and backward language models." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-119", "text": "Our results are mostly negative, since our language-model based architecture produced only marginal improvement over the model of Makarov et al. which it is based on." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-120", "text": "Moreover, for the lowresource setting the performance of both system was mediocre, even our third paradigm-based system was able to overperform them despite its obvious weakness." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-121", "text": "The results are presented in Table 1 5 , M1 stands for the baseline model and M2 -for the LM-based one." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-122", "text": "The numbers in brackets count the number of large gaps (more than 2% for high dataset, 3% for medium and 5% for low)." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-123", "text": "We observe that the influence of language models is marginal, the strength of this effect grows with the size of training data, which contradicts our expectations." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-124", "text": "In low and medium setting we expected slightly higher performance, which probably implies that our choice of hyperparameters is suboptimal." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-125", "text": "We made several observations when comparing our two models: first, the LM-based one demonstrates the highest quality after reducing output history of the baseline model from 4 to 2 and setting LM state dropout to 0.4." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-126", "text": "It shows that memory containing last output symbols plays the role of a language model for local dependencies and the memory of LSTM encoder -for global and often there is no need to duplicate them." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-127", "text": "However, most of the time LM-based variant converges much faster which implies that language model learns to throw out incorrect sequences of letters, but seems to overfit in the same time." }, { "sent_id": "49b42346795d541dbcac9e2b9ad00a-C001-128", "text": "In any case, these questions require future investigation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "49b42346795d541dbcac9e2b9ad00a-C001-30" ], [ "49b42346795d541dbcac9e2b9ad00a-C001-39" ] ], "cite_sentences": [ "49b42346795d541dbcac9e2b9ad00a-C001-30", "49b42346795d541dbcac9e2b9ad00a-C001-39" ] }, "@EXT@": { "gold_contexts": [ [ "49b42346795d541dbcac9e2b9ad00a-C001-45" ] ], "cite_sentences": [ "49b42346795d541dbcac9e2b9ad00a-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "49b42346795d541dbcac9e2b9ad00a-C001-47" ] ], "cite_sentences": [ "49b42346795d541dbcac9e2b9ad00a-C001-47" ] }, "@USE@": { "gold_contexts": [ [ "49b42346795d541dbcac9e2b9ad00a-C001-54" ], [ "49b42346795d541dbcac9e2b9ad00a-C001-70" ], [ "49b42346795d541dbcac9e2b9ad00a-C001-115" ] ], "cite_sentences": [ "49b42346795d541dbcac9e2b9ad00a-C001-54", "49b42346795d541dbcac9e2b9ad00a-C001-70", "49b42346795d541dbcac9e2b9ad00a-C001-115" ] } } }, "ABC_afee292717afe1b0dcd77e155a5121_28": { "x": [ { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-2", "text": "The accuracy of named entity recognition systems relies heavily upon the volume and quality of available training data." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-3", "text": "Improving the process of automatically producing such training data is an important task, as manual acquisition is both time consuming and expensive." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-4", "text": "We explore the use of a variety of machine learning algorithms for categorising Wikipedia articles, an initial step in producing the named entity training data." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-5", "text": "We were able to achieve a categorisation accuracy of 95% F -score over six coarse categories, an improvement of up to 5% F -score over previous methods." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-6", "text": "----------------------------------" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-8", "text": "Named Entity Recognition (NER) is the task of identifying proper nouns, such as location, organisation and personal names, in text." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-9", "text": "It emerged as a distinct type of information extraction during the sixth Message Understanding Conference (MUC) evaluation in 1995, and was further defined and explored in the CONLL NER evaluations of 2002 and 2003." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-10", "text": "A set of four broad categories became the standard scheme for marking named entities (NEs) in text: person (PER), organisation (ORG), location (LOC), and miscellaneous (MISC)." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-11", "text": "This scheme remains the most common, despite the development of more complex hierarchical category schemes (e.g. Brunstein (2002) ; Sekine et al. (2002) )." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-12", "text": "Domainspecific category schemes have also been developed in many areas, such as astroinformatics (Murphy et al., 2006) , bioinformatics (Kim et al., 2003) and the travel industry (Vijayakrishna and Sobha, 2008) ." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-13", "text": "We also extend the broad scheme with a DAB category for Wikipedia \"disambiguation\" pagespages used to group articles with identical titles." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-14", "text": "NER systems that categorise NEs under these schemes require a large amount of highly accurate training data to perform well at the task." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-15", "text": "Expert annotation is time consuming and expensive, so there is an imperative to generate this data automatically." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-16", "text": "Wikipedia is emerging as a significant resource due to its immense size and rich structural information, such as its link structure." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-17", "text": "Nothman et al. (2009) introduced a novel approach to exploiting Wikipedia's internal structure to produce training data for NER systems." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-18", "text": "Their process involved an initial step of categorising all Wikipedia articles using a simple heuristic-based bootstrapping algorithm." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-19", "text": "Potential NEs were then identified as the words in an article's text that served as links to other Wikipedia articles." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-20", "text": "To label a NE they then used the category assigned to the article that it linked to." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-21", "text": "We have explored the use of Na\u00efve Bayes (NB) and support vector machines (SVMs) as replacements for the text categorisation approach taken by Nothman." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-22", "text": "This involved the conversion of heuristics used by Nothman into features as well as the incorporation of a number of new features." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-23", "text": "We demonstrate the superiority of our approach, providing a comparison of the individual text categorisation step to both Nothman's system and other previous research." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-24", "text": "Our state-of-the-art text categorisation system for Wikipedia achieved an improvement of up to 5% F -score over previous approaches." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-25", "text": "Accurate classifications for Wikipedia articles are useful for a number of natural language processing (NLP) tasks, such as question answering and NER." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-26", "text": "To produce article classifications for generating NER training data, Nothman et al. (2009) used a heuristicbased text categorisation system." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-27", "text": "This involved extracting the first head noun after the copula, head nouns from an article's categories, and incoming link information." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-28", "text": "They reported an F -score of 89% when evaluating on a set of 1,300 hand-labelled articles." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-29", "text": "Dakka and Cucerzan (2008) explored the use of NB and SVM classifiers for categorising Wikipedia." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-30", "text": "They expanded each article's bag-of-words representation with disambiguated surface forms, as well as terms extracted from its first paragraph, abstract, and any tables present." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-31", "text": "They also extracted a small amount of context surrounding links to other Wikipedia articles." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-32", "text": "Dakka and Cucerzan (2008) expanded their set of 800 hand-labelled articles using a semisupervised approach, extracting training samples from Wikipedia \"List\" pages -pages that group other articles by type." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-33", "text": "For each \"List\" page containing a link to an article from the hand-labelled set they used the hand-labelled article's category to classify other articles on the list." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-34", "text": "They neglected to report how many training instances this left them with, but noted that they maintained the original class distribution of the hand-labelled data." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-35", "text": "They achieved an F -score of 89.7% with an SVM classifier and the category set PER, LOC, ORG, MISC and COM (for common nouns) when classifying their full article set." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-36", "text": "We experimented with a combination of the classification techniques used by Dakka and Cucerzan (2008) and the feature extraction methods used by Nothman et al. (2009) and others (Ponzetto and Strube, 2007; Hu et al., 2008; Biadsy et al., 2008) , focusing on the extraction of features from Wikipedia's rich metadata." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-37", "text": "(Loper and Bird, 2002) were used to tokenise the corpus." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-38", "text": "----------------------------------" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-39", "text": "**DATA**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-40", "text": "----------------------------------" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-41", "text": "**ANNOTATION SCHEME**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-42", "text": "Annotation was performed under a slightly modified BBN category hierarchy (Brunstein, 2002) ." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-43", "text": "During annotation we discovered the need for a number of additional categories due to the large number of articles Wikipedia contains relating to popular culture, for example the new categories Organisation \u2192 Band and M isc \u2192 W ork of Art \u2192 T V Series were quite common." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-44", "text": "We map these categories back to the \"Other\" subcategory of their parent category to allow accurate comparison with the original BBN scheme." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-45", "text": "Table 1 lists some of our new categories and gives an example for each." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-46", "text": "We also discovered a number of ambiguities in the original BBN scheme." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-47", "text": "A number of Wikipedia articles were border cases in the BBN scheme -they related to a number of categories, but did not fit perfectly into any single one." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-48", "text": "The category M isc \u2192 F ranchise is an example of an additional category to label articles such as \"Star Wars\" and \"Final Fantasy\"." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-49", "text": "We also noticed some unresolvable overlaps in categories, such as Location \u2192 Location \u2192 Island and Location \u2192 GPE \u2192 State for articles such as \"Tasmania\" and \"Hawaii\"." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-50", "text": "----------------------------------" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-51", "text": "**MANUAL ANNOTATION**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-52", "text": "A list of Wikipedia articles was selected for annotation based on several criteria." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-53", "text": "Given the large number of stub articles that exist within Wikipedia and the poor representation of categories that selecting random articles would achieve, our list of articles was primarily based on their popularity as detailed by Ringland et al. (2009) ." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-99", "text": "With this exclusion the NB classifier became much more competitive with the SVM classifier." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-54", "text": "We took into consideration the number of different language versions of Wikipedia that the article existed in to try and maximise the usefulness of our annotated data for further multi-lingual NLP tasks." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-55", "text": "We took a list of the most popular articles from August 2008 and checked for an article's existence on that list." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-56", "text": "We also considered the number of incoming links an article attracted." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-57", "text": "Based on these three criteria we produced a list of 2,311 articles for annotation." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-58", "text": "Our resulting set of articles was of much higher quality than one that a random article selection process would produce." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-59", "text": "Random article selection fails to achieve good coverage of some important article categories, such as Location \u2192 GP E \u2192 Country which annotators are likely to never come across using a random selection method." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-60", "text": "Random selection also yields a high number of stub articles with fewer features for a machine learner to learn from." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-61", "text": "Our final set of Wikipedia articles was doubleannotated with an inter-annotator agreement of 99.7% using the fine-grained category scheme, and an agreement of 99.87% on the broad NER categories." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-62", "text": "The remaining classification discrepancies were due to fundamental conflicts in the category hierarchy that could not be resolved." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-63", "text": "This set of handlabelled articles will be released after publication." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-64", "text": "----------------------------------" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-65", "text": "**FEATURES FOR TEXT CATEGORISATION**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-66", "text": "Our baseline system used a simple bag-of-words including tokens from the entire article body and the article title." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-67", "text": "This did not include tokens that appear in templates used in the generation of an article." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-68", "text": "We then experimented with a number of different feature extraction methods, focusing primarily on the document structure for identifying useful features." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-69", "text": "Tokens in the first paragraph were identified by Dakka and Cucerzan (2008) as useful features for a machine learner, an idea stemming from the fact that most human annotators will recognise an article's category after reading just the first paragraph." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-70", "text": "We extended this idea by also marking the first sentence and title tokens as separate from other tokens, as we found that often the first sentence was all that was required for a human annotator to classify an article." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-71", "text": "We ran experiments limiting the feature space to these smaller portions of the document." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-72", "text": "Wikipedia articles often have a large amount of metadata that helps in identifying an article's category, in particular Wikipedia categories and templates." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-73", "text": "Wikipedia categories are informal user defined and applied categories, forming a \"folksonomy\" rather than a strict taxonomy suitable for classification tasks, but the terms in the category names are usually strong indicators of an article's class." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-74", "text": "We extracted the list of categories applied to each article, tokenised the category names and added each token to the bag-of-words representation of the article." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-75", "text": "Using the same reasoning we also extracted a list of each article's templates, tokenised their names, and expanded the article's bag-of-words representation with these tokens." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-76", "text": "Furthermore, we expanded the templates \"Infobox\", \"Sidebar\" and \"Taxobox\" to extract tokens from their content." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-77", "text": "These templates often contain a condensed set of important facts relating to the article, and so are powerful additions to the bag-of-words representation of an article." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-78", "text": "Category, template and infobox features were marked with prefixes to distinguish them from each other and from features extracted from the article body." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-79", "text": "We reduced our raw set of features using a stop list of frequent terms, and removing terms with frequency less than 20 in a set of 1,800,800 articles taken from a separate Wikipedia dump." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-80", "text": "The assumption is that the majority of low frequency tokens will be typographical errors, or otherwise statistically unreliable data." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-81", "text": "----------------------------------" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-82", "text": "**RESULTS**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-83", "text": "We compared our two classifiers against the heuristic-based system described by Nothman et al. (2009) and the classifiers described by Dakka and Cucerzan (2008) ." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-84", "text": "We also tested a baseline system that used a bag-of-words representation of Wikipedia articles with rich metadata excluded." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-85", "text": "All SVM experiments were run using LIB-SVM (Chang and Lin, 2001 ) using a linear kernel with parameter C = 2." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-86", "text": "For NB experiments we used the NLTK." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-87", "text": "The text categorisation system developed by Nothman et al. (2009) was provided to us by the authors, and we evaluated it using our hand-labelled training data." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-88", "text": "Direct comparison with this system was difficult, as it has the ability to mark an article as \"unknown\" or \"conflict\" and defer classification." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-89", "text": "Given that these classifications cannot be considered correct we marked them as classification errors." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-90", "text": "There were also a number of complications when comparing our system with the system described by Dakka and Cucerzan (2008) : they used a different, and substantially smaller, hand-labelled data set; they did not specify how they handled disambiguation pages; they provided no results for experiments using only hand-labelled data, instead incorporating training data produced via their semi-automated approach into the final results; and they neglected to report the final size of the training data produced by their semi-automated annotation." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-91", "text": "However, these two systems provided the closest benchmarks for comparison." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-92", "text": "We found that across all experiments the NB classifier performed best when using a bag-of-words representation incorporating the first sentence of an article only, along with tokens extracted from categories, templates and infoboxes." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-93", "text": "Conversely, the SVM classifier performed best using a bag-of-words representation incorporating the entire body of an article, along with category, template and infobox tokens." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-94", "text": "All experiment results listed were run with these respective configurations." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-95", "text": "We evaluated our system on two coarse-grained sets of data: the first containing all articles from our hand-labelled set, and the second containing only those articles that described NEs." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-96", "text": "Table 2 lists results from the top scoring configurations for both the NB and SVM classifiers." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-97", "text": "The SVM classifier performed significantly better than the NB classifier." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-98", "text": "Limiting the categorisation scheme to NE-only classes improved the classification accuracy for both classifiers, as the difficult NON class was excluded." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-100", "text": "Table 3 is a comparison of precision, recall and F -scores between our baseline and final systems, and the systems produced by Nothman et al. (2009) and Dakka and Cucerzan (2008) ." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-101", "text": "The difference between results from Nothman's system, our baseline and our full feature classifier were all found to be statistically significant at the p < 0.05 level." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-102", "text": "We performed this significance test using a stratified sam- pling approach outlined by Chinchor (1992) ." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-103", "text": "----------------------------------" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-105", "text": "We exploited Wikipedia's rich document structure and content, such as categories, templates and infoboxes, to classify its articles under a categorisation scheme using NB and SVM machine learners." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-106", "text": "Our system produced state-of-the-art results, achieving an F -score of 95%, an improvement of up to 5% over previous approaches." }, { "sent_id": "afee292717afe1b0dcd77e155a5121-C001-107", "text": "These high quality classifications are useful for a number of NLP tasks, in particular named entity recognition." } ], "y": { "@BACK@": { "gold_contexts": [ [ "afee292717afe1b0dcd77e155a5121-C001-29" ], [ "afee292717afe1b0dcd77e155a5121-C001-32" ] ], "cite_sentences": [ "afee292717afe1b0dcd77e155a5121-C001-29", "afee292717afe1b0dcd77e155a5121-C001-32" ] }, "@USE@": { "gold_contexts": [ [ "afee292717afe1b0dcd77e155a5121-C001-36" ] ], "cite_sentences": [ "afee292717afe1b0dcd77e155a5121-C001-36" ] }, "@EXT@": { "gold_contexts": [ [ "afee292717afe1b0dcd77e155a5121-C001-69", "afee292717afe1b0dcd77e155a5121-C001-70" ] ], "cite_sentences": [ "afee292717afe1b0dcd77e155a5121-C001-69" ] }, "@UNSURE@": { "gold_contexts": [ [ "afee292717afe1b0dcd77e155a5121-C001-83" ], [ "afee292717afe1b0dcd77e155a5121-C001-100" ] ], "cite_sentences": [ "afee292717afe1b0dcd77e155a5121-C001-83", "afee292717afe1b0dcd77e155a5121-C001-100" ] }, "@DIF@": { "gold_contexts": [ [ "afee292717afe1b0dcd77e155a5121-C001-90" ] ], "cite_sentences": [ "afee292717afe1b0dcd77e155a5121-C001-90" ] } } }, "ABC_c594df62c01bef2ffb1a7ee9c5ea28_28": { "x": [ { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-81", "text": "We hypothesise the difference lies in the nature of the two tasks." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-2", "text": "The functional approach to compositional distributional semantics considers transitive verbs to be linear maps that transform the distributional vectors representing nouns into a vector representing a sentence." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-3", "text": "We conduct an initial investigation that uses a matrix consisting of the parameters of a logistic regression classifier trained on a plausibility task as a transitive verb function." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-4", "text": "We compare our method to a commonly used corpus-based method for constructing a verb matrix and find that the plausibility training may be more effective for disambiguation tasks." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-5", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-7", "text": "The field of compositional distributional semantics seeks principled ways to combine distributional representations of words to form larger units." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-8", "text": "Representations of full sentences, besides their theoretical interest, have the potential to be useful for tasks such as automatic summarisation and recognising textual entailment." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-9", "text": "A number of recent studies have investigated ways to combine distributional representations of Subject, Verb, Object (SVO) triples to form transitive sentences, that is, sentences based on a transitive verb [2-4, 6, 7, 12, 14] ." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-10", "text": "Under the functional approach [1] [2] [3] [4] , argument-taking words such as verbs and adjectives are represented as tensors, which take word vectors as arguments." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-80", "text": "Overall, reg performs better than dist on verb disambiguation, while dist performs better on sentence similarity." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-11", "text": "A transitive verb can be viewed as a third-order tensor with input dimensions for the subject and object, and an output dimension for the meaning of the sentence as a whole." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-12", "text": "This approach has achieved promising initial results [6] [7] [8] [9] 14] , but many questions remain." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-13", "text": "Two outstanding questions are the best method of learning verb tensors from a corpus, and the best sentence space for a variety of different tasks." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-14", "text": "This paper presents work in progress which addresses both of these questions." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-15", "text": "It compares two methods for learning verb representations, the distributional model of [7] in which positive examples of subject-object pairs for a given verb are structurally mixed, and the regression model of [14] in which positive and negative examples of subject-object pairs for a given verb are mapped into a plausibility space." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-16", "text": "A variety of methods for reducing the noun space and composing the verb, subject, and object representations are investigated." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-17", "text": "The results show that the plausibility training outperforms the distributional method on a verb disambiguation task, while the purely distributional approach performs better on sentence similarity." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-18", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-19", "text": "**METHODS**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-20", "text": "In the definition of the functional approach to compositional distributional semantics [1] [2] [3] [4] , a transitive verb is a map that takes as arguments noun vectors representing the subject and object, and produces a vector in the sentence space." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-21", "text": "Typically, noun vectors for subject and object reside in a \"topic space\" where the dimensions correspond to co-occurrence features; we use a reduced space resulting from applying Singular Value Decomposition (SVD) to the raw co-occurrence space." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-22", "text": "The correct sentence space to use is less obvious; previous approaches have either mapped sentence meaning to the same topic-based noun space [6, 7] or defined a new space for sentence meaning, particularly plausibility space [11, 14] ." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-23", "text": "If the verb function is a multi-linear map, then the verb is naturally represented by a third-order tensor." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-24", "text": "However, tensor training can be expensive and in practice, for some tasks, the verb can be approximated as a matrix [7, 14] ." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-25", "text": "Below we describe two ways of learning a verb matrix." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-26", "text": "In the regression method, the learnt matrix consists of parameters from a plausibility classifier." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-27", "text": "The classifier is trained to distinguish plausible sentences like animals eat plants from implausible sentences like animals eat planets." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-28", "text": "In the distributional method, training is based on a sum of positive (i.e. attested) SVO triples." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-29", "text": "The acquisition of positive SVO data and plausibility training data is described in Section 2.2." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-30", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-31", "text": "**VERBS**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-32", "text": "Regression (reg) Following [14] , we formulate regression learning as a plausibility task where the class membership can be estimated with a single variable." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-33", "text": "To produce a scalar output, we can learn the parameters for a single K \u00d7 K matrix (V) using standard logistic regression with the mean squared error cost function and K-dimensional subject (n s ) and object (n o ) noun vectors as input:" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-34", "text": "where t i are the true plausibility labels of the N training examples." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-35", "text": "T ) is a sigmoid transformation of the scalar that results from the inner product and the objective is regularised by the parameter \u03bb:" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-36", "text": "All of the plausibility data is used for training and validation, as we employ only the resulting verb matrix to produce a sentence space for transitive verb phrases in the test data." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-37", "text": "The regression algorithm is trained through gradient descent with Adagrad [5] and 10% of the training triples are used as a validation set for early stopping." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-38", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-39", "text": "**DISTRIBUTIONAL (DIST)**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-40", "text": "Following [7] , we generate a K \u00d7 K matrix for each verb as the average of outer products of subject and verb vectors from the positively labelled subset of the training data:" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-41", "text": "where \u2297 is outer product and N p is the number of positive training examples." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-42", "text": "The intuition is that the matrix encodes higher weights for contextual features of frequently attested subjects and objects; for example, multiplying by the matrix for eat may yield a higher scalar value when its subject exhibits features common to animate nouns, and its object exhibits features common to edible nouns." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-43", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-44", "text": "**TRAINING DATA**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-45", "text": "In order to generate training data we find plausible SVO triples that occur in an October 2013 dump of Wikipedia." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-46", "text": "To ensure quality we choose triples whose nouns occur at least 100 times." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-47", "text": "For some verbs there are thousands of such triples, so we choose the top 300 most frequent triples for each verb." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-48", "text": "For each verb we generate negative examples by substituting plausible subjects and objects with maximally dissimilar nouns." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-49", "text": "Specifically, for a given subject (or object) noun we calculate its average sum with the centroid of plausible subject (or object) vectors, and then select the frequencymatched noun with lowest cosine similarity to this average." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-50", "text": "The noun vectors are generated from the Wikipedia corpus using the t-test weighting scheme and normalisation techniques described in [13] ." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-51", "text": "These techniques enable us to learn using SVD reduced vectors with dimensions as low as 20." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-52", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-53", "text": "**COMPOSITION METHODS**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-54", "text": "We investigate the following methods of composing the verb matrix with the subject and object vectors to form a vector representation for a transitive sentence." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-55", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-56", "text": "**DATASET METHOD SPEARMAN SETTINGS**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-57", "text": "KS2014 reg 0.28 \u00b1 0.0000 comb: REL, K=20 dist 0.35 \u00b1 0.0000 comb: REL, K=100 GS2011 reg 0.37 \u00b1 0.0037 comb: VO, K=200 dist 0.25 \u00b1 0.0000 comb: CS, K=200" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-58", "text": "Relational: from [7, 10] , the meaning of a transitive sentence is a matrix, obtained by the following formula, where \u2297 is outer product, \u2299 is elementwise product, and \u00d7 is matrix multiplication:" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-59", "text": "Copy-subject: from [9, 10] , the meaning of a transitive sentence is a vector, obtained by:" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-60", "text": "Verb-object: this method tests whether the verb and object alone are enough to measure sentence similarity." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-61", "text": "It can be compared directly to Copy-subject and reflects the linguistic generalisation that a verb selects its object more strongly than its subject." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-62", "text": "The meaning of a transitive sentence is a vector, obtained by:" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-63", "text": "For the Relational method, sentence similarity is measured as the Frobenius inner product of the two sentence matrices." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-64", "text": "For Copy-subject and Verb-object, sentence similarity is measured as the cosine of the two sentence vectors." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-65", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-66", "text": "**TASKS**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-67", "text": "We investigate the performance of the regression learning method on two tasks: verb disambiguation, and transitive sentence similarity." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-68", "text": "In each case the system must compose SVO triples and compare the resulting semantic representations." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-69", "text": "For the verb disambiguation task we use the GS2011 dataset [7] ." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-70", "text": "This dataset consists of pairs of SVO triples in which the subject and object are held constant, and the verb is manipulated to highlight different word senses." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-71", "text": "For example, the verb draw has senses that correspond to attract and depict." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-72", "text": "The SVO triple report draw attention has high similarity to report attract attention, but low similarity to report depict attention." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-73", "text": "Conversely, child draw picture has high similarity to child depict picture, but low similarity to child attract picture." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-74", "text": "The gold standard consists of human judgements of the similarity between pairs, and we measure the correlation of the system's similarity scores to the gold standard judgements." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-75", "text": "For the transitive sentence similarity task we use the KS2013 dataset [9] ." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-76", "text": "This dataset also consists of pairs of SVO triples, but the verb as well as the subject and object vary." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-77", "text": "For example, author write book and delegate buy land are judged by most annotators to be very dissimilar, while programme offer support and service provide help are considered highly similar." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-78", "text": "Again, we measure the correlation between the system's similarity scores and the gold standard judgements." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-79", "text": "Table 1 and Figure 1 summarise the results of our experiments." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-82", "text": "The verb disambiguation task is inherently plausibilitybased, because one member of the low-similarity pairs (with the non-relevant sense of the verb) is always implausible." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-83", "text": "For example, when disambiguating the verb meet, the triple system satisfy requirement has high plausibility, while system visit requirement has low plausibility." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-84", "text": "In fact, the Verb-object (VO) composition method alone is enough to disambiguate these triples using reg." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-85", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-86", "text": "**RESULTS**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-87", "text": "On the other hand, both triples in the sentence similarity task tend to be highly plausible, even when their topic differs." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-88", "text": "Because dist uses a topic-based space, it may better capture these distinctions." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-89", "text": "For example, girl like people and woman ask man are both highly plausible, but annotators have judged the pair to have low similarity on average." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-90", "text": "The best distributional method (K=100, Relational composition) ranks this pair the middle, while the corresponding regression method (K=100, Relational composition) ranks this pair near the top of the range of similarities." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-91", "text": "Figure 1 shows the overall trends of the two methods." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-92", "text": "The reg method consistently outperforms dist when low-dimensional vectors are used." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-93", "text": "Due to the nature of functional composition of distributional representations, where each word-type other than noun is represented by a higher-order tensor, low-dimensional representations are particularly advantageous." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-94", "text": "On the other hand, the dist method is very quick to train and therefore can be used with higher dimensional vectors for tasks where data storage is not an issue." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-95", "text": "----------------------------------" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-96", "text": "**CONCLUSION**" }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-97", "text": "The difference in performance of the two methods underlines the need to find the appropriate sentence spaces for particular tasks." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-98", "text": "This preliminary study indicates that plausibility training may be better suited for disambiguation." }, { "sent_id": "c594df62c01bef2ffb1a7ee9c5ea28-C001-99", "text": "Further work will consist of more in depth analysis and optimisation of the training procedure, as well as investigation into ways of low-cost learning of task-specific sentence spaces." } ], "y": { "@MOT@": { "gold_contexts": [ [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-12", "c594df62c01bef2ffb1a7ee9c5ea28-C001-13" ] ], "cite_sentences": [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-12" ] }, "@UNSURE@": { "gold_contexts": [ [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-15" ], [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-58" ] ], "cite_sentences": [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-15", "c594df62c01bef2ffb1a7ee9c5ea28-C001-58" ] }, "@BACK@": { "gold_contexts": [ [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-22" ], [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-24" ] ], "cite_sentences": [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-22", "c594df62c01bef2ffb1a7ee9c5ea28-C001-24" ] }, "@USE@": { "gold_contexts": [ [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-40" ], [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-69" ] ], "cite_sentences": [ "c594df62c01bef2ffb1a7ee9c5ea28-C001-40", "c594df62c01bef2ffb1a7ee9c5ea28-C001-69" ] } } }, "ABC_4ad830d8377d2584798a30bed65254_28": { "x": [ { "sent_id": "4ad830d8377d2584798a30bed65254-C001-152", "text": "We report results for the case of using pre-trained word embeddings on both the encoder and decoder and only in the encoder." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-2", "text": "Neural machine translation has significantly pushed forward the quality of the field." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-3", "text": "However, there are remaining big issues with the translations and one of them is fairness." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-4", "text": "Neural models are trained on large text corpora which contains biases and stereotypes." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-5", "text": "As a consequence, models inherit these social biases." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-6", "text": "Recent methods have shown results in reducing gender bias in other natural language processing applications such as word embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-7", "text": "We take advantage of the fact that word embeddings are used in neural machine translation to propose the first debiased machine translation system." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-8", "text": "Specifically, we propose, experiment and analyze the integration of two debiasing techniques over GloVe embeddings in the Transformer translation architecture." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-9", "text": "We evaluate our proposed system on a generic English-Spanish task, showing gains up to one BLEU point." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-10", "text": "As for the gender bias evaluation, we generate a test set of occupations and we show that our proposed system learns to equalize existing biases from the baseline system." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-11", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-13", "text": "Language is one of the most interesting and complex skills used in our daily life, and may even be taken for granted on our ability to communicate." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-14", "text": "However, the understanding of meanings between lines in natural languages is not straightforward for the logic rules of programming languages." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-15", "text": "Natural language processing (NLP) is a sub-field of artificial intelligence that focuses on making natural languages understandable to computers." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-16", "text": "Similarly, the translation between different natural languages is a task for machine translation (MT)." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-153", "text": "As shown on the right part of the For the cases studied, values do not differ much, all around 30 BLEU points." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-154", "text": "Using pre-trained embeddings for initialization seems to improve the translation is slightly coherently with previous studies (Qi et al., 2018) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-155", "text": "Furthermore, debiasing word embeddings keeps this improvement or even increases it when using GN-Glove in the encoder and the decoder." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-156", "text": "In all cases, BLEU does not decrease, except for the case of GN-Glove (only Enc)." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-180", "text": "BLEU performance was similar when using bias and debiased models and even improved when using the latter." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-17", "text": "Neural machine translation (NMT) is a recent approach in MT which learns patterns between source and target language corpora to produce text translations using deep neural networks (Sutskever et al., 2014) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-18", "text": "One downside of models trained with human generated corpora is that social biases present in the data are learned." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-19", "text": "This is shown when training word embeddings, a vector representation of words, in news sets with crowd-sourcing evaluation to quantify the presence of biases, such as gender bias, in those representation (Bolukbasi et al., 2016) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-20", "text": "This can affect downstream applications (Zhao et al., 2018a) and are at risk of being amplified (Zhao et al., 2017) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-21", "text": "The objective of this work is to study the presence of gender bias in MT and give insight on the impact of debiasing in such systems." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-22", "text": "An example of this gender bias is the word \"friend\" in the English sentence \"She works in a hospital, my friend is a nurse\" would be correctly translated to \"amiga\" (feminine of friend) in Spanish, while \"She works in a hospital, my friend is a doctor\" would be incorrectly translated to \"amigo\" (masculine of friend) in Spanish." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-23", "text": "We consider that this translation contains gender bias since it ignores the fact that, for both cases, \"friend\" is a female and translates by focusing on the occupational stereotypes, i.e. translating doctor as male and nurse as female." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-24", "text": "The main contribution of this study is providing progress on the recent detected problem which is gender bias in MT (Prates et al., 2018) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-25", "text": "The progress towards reducing gender bias in MT is made in two directions: first, we define a framework to experiment, detect and evaluate gender bias in MT for a particular task; second, we propose to use debiased word embeddings techniques in the MT system to reduce the detected bias." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-26", "text": "This is the first study in proposing debiasing techniques for MT." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-27", "text": "The rest the paper is organized as follows." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-28", "text": "Section 2 reports material relevant to the background of the study." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-29", "text": "Section 3 presents previous work on the bias problem." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-30", "text": "Section 4 reports the methodology used for experimentation and section 5 details the experimental framework." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-31", "text": "The results and discussion are included in section 6 and section 7 presents the main conclusions and ideas for further work." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-32", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-33", "text": "**BACKGROUND**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-34", "text": "This section presents the two most important models that are used in this paper." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-35", "text": "First, we describe what is the transformer model which is the stateof-the-art model in MT and second, we report a brief description of word embeddings and the corresponding techniques to debias them." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-36", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-37", "text": "**TRANSFORMER**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-179", "text": "The models were evaluated using the BLEU metric on the standard task of the WMT newstest2013 test set." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-38", "text": "The Transformer (Vaswani et al., 2017 ) is a neural network architecture purely based on selfattention mechanisms that show an improvement in performance on MT tasks over previous recurrent and convolutional models, also is more efficient in using computation resources and faster in training." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-39", "text": "Neural networks start by representing individual words as vectors, word embeddings (more on this later), to process language as a vector space representation which can have a fixed or variable length." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-40", "text": "Words surrounding another word determine its meaning and how it is represented in this space, thus context influences in deciding the appropriate meaning for a given task using such representation." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-41", "text": "The Transformer computes a reduced constant number of steps using a self-attention mechanism on each one." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-42", "text": "This mechanism models the relations between words independently of their position, thus improving the number of steps needed to determine a target word." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-43", "text": "An attention score is computed for all words in a sentence when comparing the contribution of each word to the next representation." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-44", "text": "An encoder reads an input sentence to generate a representation which is later used by a decoder to produce a sentence output word by word." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-45", "text": "New representations are generated at each step in parallel for all words." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-46", "text": "The decoder uses self-attention in the generated words and also uses the representations from the last words in the encoder to produce a single word each time." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-47", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-48", "text": "**WORD EMBEDDINGS**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-49", "text": "Word embeddings are vector representations of words." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-50", "text": "This representation is less sparse and more expressive, opposite to discrete atomic symbols and one-hot vectors." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-51", "text": "It is used in many NLP applications." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-52", "text": "Based on the hypothesis that words appearing in same contexts share semantic meaning, this continuous vector space representation gathers semantically similar words." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-53", "text": "Arithmetic operations can be performed with these embeddings, in order to find analogies between pairs of nouns with the pattern \"A is to B what C is to D\" (Mikolov et al., 2013) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-54", "text": "For nouns, such as countries and their respective capitals or for the conjugations of verbs." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-55", "text": "While there are many techniques for extracting word embeddings, in this work we are using Global Vectors, or GloVe (Pennington et al., 2014) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-56", "text": "Glove is an unsupervised method for learning word embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-57", "text": "This count-based method, uses statistical information of word occurrences from a given corpus to train a vector space for which each vector is related to a word and their values describes their semantic relations." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-58", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-59", "text": "**DEBIASING WORD EMBEDDINGS**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-60", "text": "The presence of biases in word embeddings has aroused as a topic of discussion about fairness." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-61", "text": "More specifically, gender stereotypes are learned from human generated corpora as shown by (Bolukbasi et al., 2016) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-62", "text": "Several debiasing approaches have been proposed." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-63", "text": "Debiaswe is a postprocess method for debiasing previously generated embeddings (Bolukbasi et al., 2016) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-64", "text": "GN-GloVe is a method for generating gender neutral embeddings (Zhao et al., 2018b) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-65", "text": "The main ideas behind these algorithms are described next." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-66", "text": "Debiaswe (Bolukbasi et al., 2016 ) is a postprocess method for debiasing word embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-67", "text": "It consists of two main parts: First the direction of the embeddings where the bias is present is identified." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-68", "text": "Second, the gender neutral words in this direction are neutralized to zero and also equalizes the sets by making the neutral word equidistant to the remaining ones in the set." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-69", "text": "The disadvantage of the first part of the process is that it can remove valuable information in the embed-dings for semantic relations between words with several meanings that are not related to the bias being treated." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-70", "text": "(Zhao et al., 2018b) is an algorithm for learning gender netural word embeddings models." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-71", "text": "It is modified to have protected attributes in the embeddings so to capture them in a specific dimension." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-72", "text": "For a protected attribute like gender, the minimization objective is composed first by capturing the word proximity like GloVe (Pennington et al., 2014) and the second restricts gender information in a specific dimension so the remaining are neutral." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-73", "text": "A set of seed male and female words are used to define metrics for computing the optimization and a set of gender neutral words is used for restricting neutral words in a gender direction." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-74", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-75", "text": "**GN-GLOVE**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-76", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-77", "text": "**RELATED WORK**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-78", "text": "There are studies on the presence of biases in many NLP applications." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-79", "text": "Word embeddings can learn biases learn from human generated corpora." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-80", "text": "(Bolukbasi et al., 2016) showed that stereotypical analogies are present in word embeddings both for gender and race." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-81", "text": "(Caliskan et al., 2017) found also a strong gender and racial bias presence is found in pre-trained embeddings and proposed a method for measuring bias in word embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-82", "text": "(Zhao et al., 2018b) proposed GN-GloVe, an algorithm to generated gender neutral word embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-83", "text": "The approach is to restrict gender information attributes in certain dimensions to keep the remaining free of this attributes." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-84", "text": "(Zhao et al., 2018a) shows that sexism present in a coreference resolution system is due to the word embeddings components." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-85", "text": "Applications that use these embeddings, such as curriculum filtering, may discriminate candidates because of their gender." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-86", "text": "The amplification of biases in downstream applications is a concerning problem also that can enlarge the gap between genders, for example in search engines, for professions where the name of the candidates may be discriminated by the algorithm because of their bias towards a specific gender." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-87", "text": "Thus, broadening even further gender inequality for a given field." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-88", "text": "(Zhao et al., 2017) shows that gender bias is learned and amplified in models trained from data sets containing web images used in language modelling tasks." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-89", "text": "As an example of, the word \"cooking\" is more probable to be re-lated to females than males and it can be further amplified." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-90", "text": "(Park et al., 2018) studies the reduction of such biases in abusive language detection." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-91", "text": "These models have a strong bias towards words that identify gender because of the data sets in which they are trained." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-92", "text": "Sentences that do not necessarily show sexism are detected as false positives compromising the robustness of the models." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-93", "text": "Debiased word embeddings combined with augmenting and swapping gender data is the most effective method for reducing gender bias for this task." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-94", "text": "(Prates et al., 2018) performs a case study on gender bias in machine translation." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-95", "text": "They build a test set consisting of a list of jobs and gender-specific sentences." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-96", "text": "Using English as a target language and a variety of gender neutral languages as a source, i.e. languages that do not explicitly give gender information about the subject, they test these sentences on the translating service Google Translate." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-97", "text": "They find that occupations related to science, engineering and mathematics present a strong stereotype toward male subjects." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-98", "text": "However, late 2018, Google announced in their developers blog 1 that efforts are put on providing gender-specific translations in Google Translate." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-99", "text": "Thus, gives both the translation for female and male when translating from gender-neutral languages." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-100", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-101", "text": "**METHODOLOGY**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-102", "text": "In this section, we describe the methodology used for this study." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-103", "text": "In order to study the translation system, the prior layer of both the Encoder and Decoder in the Transformer (Vaswani et al., 2017) where the word embeddings are trained, is adapted to use pretrained word embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-104", "text": "Then, we train the system with different pre-trained word embeddings (based on GloVe (Pennington et al., 2014) ) to have a set of models." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-105", "text": "The scenarios are the following:" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-106", "text": "\u2022 No pre-trained word embeddings, i.e. they are learned within the training of the model." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-107", "text": "\u2022 Pre-trained word embeddings which are learned from the same corpus." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-108", "text": "\u2022 Pre-trained word embeddings which are learned from the same corpus using a debiasing algorithm." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-109", "text": "Also, the models with pre-trained embeddings given to the Transformer have two cases." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-110", "text": "For one case, using pre-trained embeddings in both the encoder and decoder sides, see Figure 1 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-111", "text": "For another case, using pre-trained embeddings only in the encoder side." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-112", "text": "See Figure 2 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-113", "text": "To evaluate the performance of the models we use the BLEU metric (Papineni et al., 2002) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-114", "text": "This metric gives a score for a predicted translation set compared to its expected output." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-115", "text": "5 Experimental framework" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-116", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-117", "text": "**CORPORA**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-118", "text": "The language pair used for the experiments is English-Spanish." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-119", "text": "A large training set of 16,554,790 sentences has been used for training the model." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-120", "text": "The validation and test sets used are the newstest2012 and newstest2013, respectively from the Workshop on Machine Translation 2 (WMT), which comprises 3,003 and 3,000 sentences." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-121", "text": "See Table 2 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-122", "text": "To study gender bias, we have developed an additional test set with custom sentences to evaluate the quality of the translation in the models." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-123", "text": "We built this test set using a sentence pattern \"I've known {her, him, } for a long time, my friend is {a, an} \" for a list of occupations from different professional areas." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-124", "text": "We refer to this test as Occupations test, their related sizes are also listed in Table 2 and sample sentences from this set are in Table 1 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-125", "text": "We use Spanish proper names to reduce ambiguity in this particular test." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-126", "text": "These sentences are properly tokenized before using them in the test." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-127", "text": "With these test sentences we see how \"friend\" is translated into its Spanish equivalent \"amiga\" or \"amigo\" which has a gender relation for each word, female and male, respectively." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-128", "text": "Note that we are formulating sentences with an ambiguous word \"friend\" that can be translated into any of the two words and we are adding context in the same sentence so that the system has enough information to translate them correctly." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-129", "text": "The list of occupations is from the U.S. Bureau of Labor Statistics 3 , which also includes statistical data for gender and race for most professions." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-130", "text": "We use a pre-processed version of this list from (Prates et al., 2018)." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-131", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-132", "text": "**MODELS**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-133", "text": "The word embeddings are trained from the same corpus, using GloVe (Pennington et al., 2014) and GN-GloVe (Zhao et al., 2018b) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-134", "text": "The dimension of the vectors is settled to 512 as standard and kept through all the experiments in this study." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-135", "text": "The parameter values for training the word embedding models with GloVe and GN-GloVe methods are listed in Table 3 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-136", "text": "Debiaswe (Bolukbasi et al., 2016 ) is a debiasing post-process performed on trained embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-137", "text": "Instead of having parameters for learning the representation it uses a set of words to define the gender direction and to neutralize and equalize the bias from the word vectors." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-138", "text": "Three set of words are used in the Debiaswe algorithm." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-139", "text": "One set of ten pairs of words such as woman-man, girl-boy, she-he... are used to define the gender direction." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-140", "text": "Another set of 218 genderspecific words such as aunt, uncle, wife, husband... are used for learning a larger set of genderspecific words." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-141", "text": "Finally, a set of crowd-sourced male-female equalization pairs such as dad-mom, boy-girl, granpa-grandma... that represent gender direction are equalized in the algorithm." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-142", "text": "For the Spanish side, the sets are adapted for the task and slightly modified to avoid unclear words from the English language or unnecessary repetitions." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-143", "text": "The sets from GN-GloVe are similarly adapted to the Spanish language." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-144", "text": "The architecture to train the models for the translation task is the Transformer (Vaswani et al., 2017) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-145", "text": "The evaluation of the performance of the model is obtained by its BLEU score (Papineni et al., 2002) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-146", "text": "The parameter values used in the Transformer are the same as proposed in the baseline provided by the toolkit OpenNMT 4 and listed in Table 4 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-147", "text": "OpenNMT has built-in tools for training with pre-trained embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-148", "text": "These pre-trained embeddings have been implemented with the cor-" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-149", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-150", "text": "**RESULTS**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-151", "text": "BLEU scores for the test set newstest2013 are listed in Table 5 . Note that we can use the pretrained word embeddings either in the encoder or in the decoder or on both of them." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-157", "text": "In the next subsection, we show how each of the models performs on gender debiasing, but we want to underline that these models do not decrease the quality of translation in terms of BLEU when tested in an standard MT task." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-158", "text": "Qualitative analysis is performed on the Occupations test set, examples of which are shown in Table 1 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-159", "text": "This test set has context information to translate the ambiguous word of \"friend\" either into \"amigo\" or \"amiga\"." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-160", "text": "If the lower is the bias in the system, the better the system will be able to translate to the correct gender." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-161", "text": "See Table 6 for the percentages of how \"friend\" is translated for each model." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-162", "text": "Note that we are using in all cases \"updated\" embeddings, since they are the best quality systems from Table 5 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-163", "text": "On the one hand, the system translates the masculine pronoun at almost 100% accuracy for all models, while not all occupations are well translated." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-164", "text": "On the other hand, for the feminine pronoun, the accuracy of this task is not as precise as its counterpart and it shows a slight decrease in accuracy for all models." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-165", "text": "For the case using a proper noun (instead of the pronoun as correference of \"friend\") like, for example, \"John\" or \"Mary\", the accuracy of this task shows further biases." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-166", "text": "However, it is worth mentioning that proper names can induce another kind of bias such as racial stereotypes." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-167", "text": "Note that gender debiasing is performed by augmenting the percentage of \"amiga\" in the translation in the presence of the female pronoun while keeping the quality of translation (coherently with generic results in Table 5 )." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-168", "text": "The most neutral system is achieved with GloVe and Debiaswe pretrained embeddings, being updated also updated during training." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-169", "text": "The improvement is over 80% compared to the baseline system and over 4% compared to the non-debiased pre-trained word embeddings." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-170", "text": "----------------------------------" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-171", "text": "**CONCLUSIONS AND FURTHER WORK**" }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-172", "text": "Biases learned from human generated corpora in natural language processing applications is a topic that has been gaining relevance over the last few years." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-173", "text": "Specifically, for machine translation, studies quantifying gender bias present in news corpora and proposing debiasing approaches for word embedding models have shown improvements on this matter." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-174", "text": "We studied the impact of gender debiasing on neural machine translation." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-175", "text": "We trained sets of word embeddings with the standard GloVe algorithm." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-176", "text": "Then, we debiased the embeddings using Debiaswe (Bolukbasi et al., 2016) and also trained its gender neutral version with GN-GloVe (Zhao et al., 2018b) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-177", "text": "We used all these different models on the Transformer (Vaswani et al., 2017) ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-178", "text": "Experiments were reported on using these word embeddings on both the encoder and decoder sides, or only the encoder side." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-181", "text": "To study of the bias for the translations of the models, we developed a specific test set." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-182", "text": "This set consists of sentences that includes context of the gender of the ambiguous \"friend\" in the English-to-Spanish translation." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-183", "text": "This word can be translated to feminine or masculine and the proper translation has to be derived from context." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-184", "text": "Our hypothesis is that if the translation system is gender biased, the context will be disregarded, while if the system is neutral, the translation will be correct (since it has the information of gender in the sentence)." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-185", "text": "Results show that the male pronoun is always identi-fied, despite not all occupations being well translated, while the female pronoun has different ratio of appearance for different models." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-186", "text": "Successfully, we achieve almost 100% accuracy of translation on this pronouns when using the debiased word embeddings of Debiaswe both on the encoder and decoder." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-187", "text": "Also, this system slightly improves the BLEU performance from the baseline translation system." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-188", "text": "Therefore, we are \"equalizing\" the translation, while keeping its quality." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-189", "text": "Experimental material from this paper is available in github 8 ." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-190", "text": "As already mentioned, this is the first work on proposing gender debiased translation systems, Future work on the topic of debiasing translation algorithms is required." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-191", "text": "In this paper, we studied gender as a bias in machine translation, however other social constructs and stereotypes may be present in corpora, whether individually or combined, such as race, religious beliefs or age; this being just a small subset of possible biases which will present new challenges for fairness both in ML and MT." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-192", "text": "For the task in hand, the type of corpora used for the study is commonly related to news articles." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-193", "text": "Human corpora has a broad spectrum of categories, as an instance: industrial, medical, legal... and other biases particular to each area may present interesting problems." }, { "sent_id": "4ad830d8377d2584798a30bed65254-C001-194", "text": "Also, other language pairs with different degree in specifying gender information in their written or spoken communication could be studied for the evaluation of debiasing in machine translation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4ad830d8377d2584798a30bed65254-C001-19" ], [ "4ad830d8377d2584798a30bed65254-C001-61" ], [ "4ad830d8377d2584798a30bed65254-C001-63" ], [ "4ad830d8377d2584798a30bed65254-C001-66" ], [ "4ad830d8377d2584798a30bed65254-C001-136" ] ], "cite_sentences": [ "4ad830d8377d2584798a30bed65254-C001-19", "4ad830d8377d2584798a30bed65254-C001-61", "4ad830d8377d2584798a30bed65254-C001-63", "4ad830d8377d2584798a30bed65254-C001-66", "4ad830d8377d2584798a30bed65254-C001-136" ] }, "@USE@": { "gold_contexts": [ [ "4ad830d8377d2584798a30bed65254-C001-176" ] ], "cite_sentences": [ "4ad830d8377d2584798a30bed65254-C001-176" ] } } }, "ABC_188f10a5b78a5e691e10d180dfde6f_28": { "x": [ { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-2", "text": "Recently, a variety of representation learning approaches have been developed in the literature to induce latent generalizable features across two domains." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-3", "text": "In this paper, we extend the standard hidden Markov models (HMMs) to learn distributed state representations to improve cross-domain prediction performance." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-4", "text": "We reformulate the HMMs by mapping each discrete hidden state to a distributed representation vector and employ an expectationmaximization algorithm to jointly learn distributed state representations and model parameters." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-5", "text": "We empirically investigate the proposed model on cross-domain part-ofspeech tagging and noun-phrase chunking tasks." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-6", "text": "The experimental results demonstrate the effectiveness of the distributed HMMs on facilitating domain adaptation." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-7", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-9", "text": "Domain adaptation aims to obtain an effective prediction model for a particular target domain where labeled training data is scarce by exploiting labeled data from a related source domain." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-10", "text": "Domain adaptation is very important in the field of natural language processing (NLP) as it can reduce the expensive manual annotation effort in the target domain." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-11", "text": "Various NLP tasks have benefited from domain adaptation techniques, including part-ofspeech tagging (Blitzer et al., 2006; Huang and Yates, 2010a) , chunking (Daum\u00e9 III, 2007; Huang and Yates, 2009) , named entity recognition (Guo et al., 2009; Turian et al., 2010) , dependency parsing (Dredze et al., 2007; Sagae and Tsujii, 2007) and semantic role labeling (Dahlmeier and Ng, 2010; Huang and Yates, 2010b) ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-12", "text": "In a typical domain adaptation scenario of NLP, the source and target domains contain text data of different genres (e.g., newswire vs biomedical (Blitzer et al., 2006) )." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-13", "text": "Under such circumstances, the original lexical features may not perform well in cross-domain learning since different genres of text may use very different vocabularies and produce cross-domain feature distribution divergence and feature sparsity issue." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-14", "text": "A number of techniques have been developed in the literature to tackle the problem of cross-domain feature divergence and feature sparsity, including clustering based word representation learning methods (Huang and Yates, 2009; Candito et al., 2011) , word embedding based representation learning methods (Turian et al., 2010; Hovy et al., 2015) and some other representation learning methods (Blitzer et al., 2006) ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-15", "text": "In this paper, we extend the standard hidden Markov models (HMMs) to perform distributed state representation learning and induce contextaware distributed word representations for domain adaptation." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-16", "text": "Instead of learning a single discrete latent state for each observation in a given sentence, we learn a distributed representation vector." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-17", "text": "We define a state embedding matrix to map each latent state value to a low-dimensional distributed vector and reformulate the three local distributions of HMMs based on the distributed state representations." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-18", "text": "We then simultaneously learn the state embedding matrix and the model parameters using an expectation-maximization (EM) algorithm." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-19", "text": "The hidden states of each word in a sentence can be decoded using the standard Viterbi decoding procedure of HMMs, and its distributed representation can be obtained by a simple mapping with the state embedding matrix." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-20", "text": "We then use the context-aware distributed representations of the words as their augmenting features to perform cross-domain part-of-speech (POS) tagging and noun-phrase (NP) chunking." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-21", "text": "The proposed approach is closely related to the clustering based method (Huang and Yates, 2009 ) as we both use latent state representations as generalizable features." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-22", "text": "However, they use standard HMMs to produce discrete hidden state features for each observation word, while we induce distributed state representation vectors." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-23", "text": "Our distributed HMMs share similarities with the word embedding based method (Hovy et al., 2015) , and can be more space-efficient than the standard HMMs." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-24", "text": "Moreover, our model can incorporate context information into observation feature vectors to perform representation learning in a context-aware manner." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-25", "text": "The distributed state representations induced by our model hence have larger representing capacities and generalizing capabilities for cross-domain learning than standard HMMs." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-26", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-28", "text": "A variety of representation learning approaches have been developed in the literature to address NLP domain adaptation problems." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-29", "text": "The clustering based word representation learning methods perform word clustering within the sentence structure and use word cluster indicators as generalizable features to address domain adaptation problems." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-30", "text": "For example, Huang and Yates (2009) used the discrete hidden state of a word under HMMs as augmenting features for cross-domain POS tagging and NP chunking." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-31", "text": "Brown clusters (Brown et al., 1992) , which was used as latent features for simple in-domain dependency parsing (Koo et al., 2008) , has recently been exploited for out-ofdomain statistical parsing (Candito et al., 2011) ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-32", "text": "The word embedding based representation learning methods learn a dense real-valued representation vector for each word as latent features for domain adaptation." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-33", "text": "Turian et al. (2010) empirically studied using word embeddings learned from hierarchical log-bilinear models (Mnih and Geoffrey, 2008) and neural language models (Collobert and Weston, 2008) for cross-domain NER tasks." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-34", "text": "Hovy et al. (2015) used the word embeddings learned from the Skip-gram Model (SGM) (Mikolov et al., 2013) to develop a POS tagger for Twitter data with labeled newswire training data." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-35", "text": "Some other representation learning methods have been developed to tackle NLP cross-domain problems as well." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-36", "text": "For example, Blitzer et al. (2006) proposed a structural correspondence learning method for POS tagging, which first selects a set of pivot features (occurring frequently in Figure 1 : Hidden Markov models with distributed state representations (dHMM)." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-37", "text": "the two domains) and then models the correlations between pivot features and non-pivot features to induce generalizable features." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-38", "text": "In terms of performing distributed representation learning for output variables, our proposed model shares similarity with the structured output representation learning approach developed by Srikumar and Manning (2014) , which extends the structured support vector machines to simultaneously learn the prediction model and the distributed representations of the output labels." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-39", "text": "However, the approach in (Srikumar and Manning, 2014 ) assumes the training labels (i.e., output values) are given and performs learning in the standard supervised in-domain setting, while our proposed distributed HMMs address cross-domain learning problems by performing unsupervised representation learning." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-40", "text": "There are also a few works that extended standard HMMs in the literature, including the observable operator models (Jaeger, 1999) , and the spectral learning method (Stratos et al., 2013) . But none of them performs representation learning to address cross-domain adaptation problems." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-41", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-42", "text": "**PROPOSED MODEL**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-43", "text": "In this paper, we propose a novel distributed hidden Markov model (dHMM) for representation learning over sequence data." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-44", "text": "This model extends the hidden Markov models (Rabiner and Juang, 1986) to learn distributed state representations." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-45", "text": "Similar as HMMs, a dHMM (shown in Figure 1) is a two-layer generative graphical model, which generates a sequence of observations from a sequence of latent state variables using Markov properties." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-46", "text": "Let O = {o 1 , o 2 , . . . , o T } be the sequence of observations with length T , where each observation o t \u2208 R d is a d-dimensional feature vector." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-47", "text": "Let S = {s 1 , s 2 , . ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-48", "text": ". , s T } be the sequence of T hidden states, where each hidden state s t has a discrete state value from a total H hidden states H = {1, 2, . . . , H}. Besides, we assume that there is a low-dimensional distributed representation vector associated with each hidden state." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-49", "text": "Let M \u2208 R H\u00d7m be the state embedding matrix where the i-th row M i: denotes the m-dimensional representation vector for the i-th state." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-50", "text": "Previous works have demonstrated the usefulness of discrete hidden states induced from a HMM on addressing feature sparsity in domain adaptation (Huang and Yates, 2009 )." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-51", "text": "However, expressing a semantic word by a single discrete state value is too restrictive, as it has been shown in the literature that words have many different features in a multidimensional space where they could be separately characterized as number, POS tag, gender, tense, voice and other aspects (Sag and Wasow, 1999; Huang et al., 2011) ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-52", "text": "Our proposed model aims to overcome this inherent drawback of standard HMMs on learning word representations." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-53", "text": "Given a set of observation sequences in two domains, the dHMM induces a distributed representation vector with continuous real values for each observation word as generalizable features, which has the capacity of capturing multi-aspect latent characteristics of the word clusters." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-54", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-55", "text": "**MODEL FORMULATION**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-56", "text": "To build the dHMMs, we reformulate the standard HMMs by defining three main local distributions based on the distributed state representations, i.e., the initial state distribution, the state transition distribution, and the observation emission distribution." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-57", "text": "Below we introduce them by using \u0398 to denote the set of parameters involved and using 1 to denote a column vector with all 1s." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-58", "text": "First we use the following multinomial distribution as the initial state distribution," }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-59", "text": "where \u03c6(s t ) \u2208 {0, 1} H is a one-hot vector with a single 1 value at its s t -th entry, and \u03bb \u2208 R H is the parameter vector such that \u03bb \u2265 0 and \u03bb 1 = 1." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-60", "text": "We then define a multinomial logistic regression model for the state transition distribution," }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-61", "text": "where W \u2208 R H\u00d7m is the regression parameter matrix and Z(s t ; \u0398) is the normalization term." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-62", "text": "Finally, we assume the observation vector is generated from a multivariate Gaussian distribution, i.e., o t \u223c N \u03c6(s t ) M Q, \u03c3I d , and use the following model for the emission distribution, (Rabiner and Juang, 1986 ) use conditional probability tables for the state transition distribution, which grows quadratically with respect to the number of hidden states, and the emission distribution, which grows linearly with respect to the observed vocabulary size that is usually very large in NLP tasks." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-63", "text": "Instead, the dHMMs can significantly reduce the sizes of these conditional probability tables by introducing the low-dimensional state embedding vectors, and the dHMM is much more efficient in terms of memory storage." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-64", "text": "In fact, the complexity of dHMMs can be independent of the vocabulary size by using flexible observation features." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-65", "text": "We represent the dHMM parameter set as \u0398 = {M \u2208 R H\u00d7m , W \u2208 R H\u00d7m , Q \u2208 R m\u00d7d , \u03c3 \u2208 R, \u03bb \u2208 [0, 1] H }, where m is a small constant." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-66", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-67", "text": "**MODEL TRAINING**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-68", "text": "Given a data set of N observed sequences {O 1 , . . . , O n , ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-69", "text": ". . , O N }, its regularized log-" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-70", "text": "where the regularization function is defined with Frobenius norms such as" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-71", "text": "Moreover, each loglikelihood term has the following lower bound logP (O n ; \u0398) = log" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-72", "text": "where Q(S n ) is any valid distribution over the hidden state variables S n and KL(.||.) denotes the Kullback-Leibler divergence." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-73", "text": "Let F(Q, \u0398) denote the regularized lower bound function obtained by plugging the lower bound (2) back into the objective function (1)." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-74", "text": "We then perform training by using an expectation-maximization (EM) algorithm (Dempster et al., 1977) that iteratively maximizes F(Q, \u0398) to reach a local optimal solution." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-75", "text": "We first randomly initialize the model parameters while enforcing \u03bb to be in the feasible region (\u03bb \u2265 0, \u03bb 1 = 1)." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-76", "text": "In the (k+1)-th iteration, given {Q (k) , \u0398 (k) }, we then sequentially update Q with an E-step (3) and update \u0398 with a M-step (4)." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-77", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-78", "text": "**DOMAIN ADAPTATION WITH DISTRIBUTED STATE REPRESENTATIONS**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-79", "text": "We use all training data from the two domains to train dHMMs for local optimal model parameters \u0398 * = {M * , W * , Q * , \u03c3 * , \u03bb * }." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-80", "text": "We then infer the latent state sequence S * = {s * 1 , s * 2 , . . . , s * T } using the standard Viterbi algorithm (Rabiner and Juang, 1986) for each labeled source training sentence and each target test sentence." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-81", "text": "The corresponding distributed state representation vectors can be obtained as" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-82", "text": "We then train a supervised NLP system (e.g., POS tagging or NP chunking) on the labeled source training sentences using the distributed state representations as augmenting input features and perform prediction on the augmented test sentences." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-83", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-84", "text": "**EXPERIMENTS**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-85", "text": "We conducted experiments on cross-domain partof-speech (POS) tagging and noun-phrase (NP) chunking tasks." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-86", "text": "We used the same experimental datasets as in (Huang and Yates, 2009 ) for cross-domain POS tagging from Wall Street Journal (WSJ) domain (Marcus et al., 1993) to MED-LINE domain (PennBioIE, 2005) and for crossdomain NP chunking from CoNLL shared task dataset (Tjong et al., 2000) to Open American National Corpus (OANC) (Reppen et al., 2005) ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-87", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-88", "text": "**REPRESENTATION LEARNING**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-89", "text": "We first built a unified vocabulary with all the data in the two domains." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-90", "text": "We then conducted latent semantic analysis (LSA) over the sentence-word frequency matrix to get a low-dimensional representation vector for each word." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-91", "text": "We used a sliding window with size 3 to construct the d-dimensional feature vector (d = 1500) for each observation in a given sentence." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-92", "text": "We used \u03b7 = 0.5, set the number of hidden states H to be 80 and the dimensionality m = 20." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-93", "text": "We used all the labeled and unlabeled training data in the two domains to train dHMMs." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-94", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-95", "text": "**TEST RESULTS**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-96", "text": "We used the induced distributed state representations of each observation as augmenting features to train conditional random fields (CRF) with the CRFSuite package (Okazaki, 2007) on the labeled source sentences and perform prediction on the target test sentences." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-97", "text": "We compared with the following systems: a Baseline system without representation learning, a SGM based word embedding system (Hovy et al., 2015) , and a discrete hidden state based clustering system (Huang and Yates, 2009) ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-98", "text": "We used the word id and orthographic features as the baseline features for POS tagging and added POS tags for NP chunking." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-99", "text": "We reported the POS tagging accuracy for all words and outof-vocabulary (OOV) words (which appear less than three times in the labeled source training sentences), and NP chunking F1 scores for all NPs and only OOV NPs (whose beginning word is an OOV word) in Table 1 ." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-100", "text": "We can see that the Baseline method performs poorly on both tasks especially on the OOV words/NPs, which shows that the original lexical based features are not sufficient to develop a robust POS tagger/NP chunker for the target domain with labeled source training sentences." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-101", "text": "By using unlabeled training sentences from the two domains, all representation learning approaches increase the cross-domain test performance, especially on the OOV words/NPs." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-102", "text": "These improvements over the Baseline method demonstrate that the induced latent features do alleviate feature sparsity issue across the two domains and help the trained NLP system generalize well in the target domain." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-103", "text": "Between these representation learning approaches, the proposed distributed state representation learning method outperforms both of the word embedding based and discrete HMM hidden state based systems." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-104", "text": "This suggests that by learning distributed representations in a context-aware manner, dHMMs can effectively bridge domain divergence." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-105", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-106", "text": "**SENSITIVITY ANALYSIS OVER THE DIMENSIONALITY OF STATE EMBEDDINGS**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-107", "text": "We also conducted experiments to investigate how does the dimensionality of the distributed state representations, m, in our proposed approach affect cross-domain test performance given a fixed state number H = 80." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-108", "text": "We" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-109", "text": "----------------------------------" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-111", "text": "In this paper, we extended the standard HMMs to learn distributed state representations and facilitate cross-domain sequence predictions." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-112", "text": "We mapped each state variable to a distributed representation vector and simultaneously learned the state embedding matrix and the model parameters with an EM algorithm." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-113", "text": "The experimental results on cross-domain POS tagging and NP chunking tasks demonstrated the effectiveness of the proposed approach for domain adaptation." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-114", "text": "In the future, we plan to apply this approach to other cross-domain prediction tasks such as named entity recognition or semantic role labeling." }, { "sent_id": "188f10a5b78a5e691e10d180dfde6f-C001-115", "text": "We also plan to extend our method to learn cross-lingual representations with auxiliary resources such as bilingual dictionaries or parallel sentences." } ], "y": { "@BACK@": { "gold_contexts": [ [ "188f10a5b78a5e691e10d180dfde6f-C001-11" ], [ "188f10a5b78a5e691e10d180dfde6f-C001-14" ], [ "188f10a5b78a5e691e10d180dfde6f-C001-30" ], [ "188f10a5b78a5e691e10d180dfde6f-C001-50" ] ], "cite_sentences": [ "188f10a5b78a5e691e10d180dfde6f-C001-11", "188f10a5b78a5e691e10d180dfde6f-C001-14", "188f10a5b78a5e691e10d180dfde6f-C001-30", "188f10a5b78a5e691e10d180dfde6f-C001-50" ] }, "@SIM@": { "gold_contexts": [ [ "188f10a5b78a5e691e10d180dfde6f-C001-21" ], [ "188f10a5b78a5e691e10d180dfde6f-C001-86" ] ], "cite_sentences": [ "188f10a5b78a5e691e10d180dfde6f-C001-21", "188f10a5b78a5e691e10d180dfde6f-C001-86" ] }, "@USE@": { "gold_contexts": [ [ "188f10a5b78a5e691e10d180dfde6f-C001-21" ], [ "188f10a5b78a5e691e10d180dfde6f-C001-86" ] ], "cite_sentences": [ "188f10a5b78a5e691e10d180dfde6f-C001-21", "188f10a5b78a5e691e10d180dfde6f-C001-86" ] }, "@UNSURE@": { "gold_contexts": [ [ "188f10a5b78a5e691e10d180dfde6f-C001-97" ] ], "cite_sentences": [ "188f10a5b78a5e691e10d180dfde6f-C001-97" ] } } }, "ABC_9666fd26c7e9a02505ff26a687076d_28": { "x": [ { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-19", "text": "This is commonly implemented for the HMM alignment model as well as Models 4 and 5." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-66", "text": "----------------------------------" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-67", "text": "**EMPIRICAL RESULTS**" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-2", "text": "A modification of a reparameterisation of IBM Model 2 is presented, which makes the model more flexible, and able to model a preference for aligning to words to either the right or left, and take into account POS tags on the target side of the corpus." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-3", "text": "We show that this extension has a very small impact on training times, while obtaining better alignments in terms of BLEU scores." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-4", "text": "----------------------------------" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-6", "text": "Word alignment is at the basis of most statistical machine translation." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-7", "text": "The models that are generally used are often slow to train, and have a large number of parameters." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-8", "text": "Dyer et al. (2013) present a simple reparameterization of IBM Model 2 that is very fast to train, and achieves results similar to IBM Model 4." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-9", "text": "While this model is very effective, it also has a very low number of parameters, and as such doesn't have a large amount of expressive power." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-10", "text": "For one thing, it forces the model to consider alignments on both sides of the diagonal equally likely." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-11", "text": "However, it isn't clear that this is the case, as for some languages an alignment to earlier or later in the sentence (above or below the diagonal) could be common, due to word order differences." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-12", "text": "For example, when aligning to Dutch, it may be common for one verb to be aligned near the end of the sentence that would be at the beginning in English." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-13", "text": "This would mean most of the other words in the sentence would also align slightly away from the diagonal in one direction." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-14", "text": "Figure 1 shows an example sentence in which this happens." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-15", "text": "Here, a circle denotes an alignment, and darker squares are more likely under the alignment model." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-16", "text": "In this case the modified Model 2 would simply make both directions equally likely, where we would really like for only one direction to be more likely." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-17", "text": "In some cases it could be that the prior probability for a word alignment should be off the diagonal." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-18", "text": "Furthermore, it is common in word alignment to take word classes into account." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-20", "text": "Och and Ney (2003) show that for larger corpora, using word classes leads to lower Alignment Error Rate (AER)." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-21", "text": "This is not implemented for Model 2, as it already has an alignment model that is dependent on both source and target length, and the position in both sentences, and adding a dependency to word classes would make the the Model even more prone to overfitting than it already is." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-22", "text": "However, using the reparameterization in (Dyer et al., 2013) would leave the model simple enough even with a relatively large amount of word classes." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-23", "text": "word class, and so can have different gradients for alignment probability over the english words." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-24", "text": "If the model has learned that prepositions and nouns are more likely to align to words later in the sentence, it could have a lower lambda for both word classes, resulting in a less steep slope." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-25", "text": "If we also split lambda into two variables, we can get alignment probabilities as shown above for the Dutch word 'de', where aligning to one side of the diagonal is made more likely for some word classes." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-26", "text": "Finally, instead of just having one side of the diagonal less steep than the other, it may be useful to instead move the peak of the alignment probability function off the diagonal, while keeping it equally likely." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-27", "text": "In Figure 2 , this is done for the past participle 'gezien'." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-28", "text": "We will present a simple model for adding the above extensions to achieve the above (splitting the parameter, adding an offset and conditioning the parameters on the POS tag of the target word) in section 2, results on a set of experiments in section 3 and present our conclusions in section 4." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-29", "text": "----------------------------------" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-30", "text": "**METHODS**" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-31", "text": "We make use of a modified version of Model 2, from Dyer et al. (2013) , which has an alignment model that is parameterised in its original form solely on the variable \u03bb." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-32", "text": "Specifically, the probability of a sentence e given a sentence f is given as:" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-33", "text": "here, m is the length of the target sentence e, n the same for source sentence f , \u03b4 is the alignment model and \u03b8 is the translation model." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-34", "text": "In this paper we are mainly concerned with the alignment model \u03b4." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-35", "text": "In the original formulation (with a minor tweak to ensure symmetry through the center), this function is defined as:" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-36", "text": "where, h(\u00b7) is defined as" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-37", "text": "e. a normalising function." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-38", "text": "Like the original Model 2 (Brown et al., 1993) , this model is trained using Expectation-Maximisation." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-39", "text": "However, it is not possible to directly update the \u03bb parameter during training, as it cannot be computed analytically." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-40", "text": "Instead, a gradient-based approach is used during the M-step." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-41", "text": "Two different optimisations are employed, the first of which is used for calculating Z \u03bb ." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-42", "text": "This function forms a geometric series away from the diagonal (for each target word), which can be computed efficiently for each of the directions from the diagonal." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-43", "text": "The second is used during the M-step when computing the derivative, and is very similar, but instead of using a geometric series, an arithmetico-geometric series is used." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-44", "text": "In order to allow the model to have a different parameter above and below the diagonal, the only change needed is to redefine h(\u00b7) to use a different parameter for \u03bb above and below the diagonal." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-45", "text": "We denote these parameters as \u03bb and \u03b3 for below and above the diagonal respectively." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-46", "text": "Further, the offset is denoted as \u03c9." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-47", "text": "we change the definition of h(\u00b7) to the following instead:" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-48", "text": "+ \u03c9 otherwise j \u2193 is the point closest to or on the diagonal here, calculated as:" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-49", "text": "Here, \u03c9 can range from \u22121 to 1, and thus the calculation for the diagonal j \u2193 is clamped to be in a valid range for alignments." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-50", "text": "As the partition function (Z(\u00b7)) used in (Dyer et al., 2013) consists of 2 calculations for each target position i, one for above and one for below the diagonal, we can simply substitute \u03b3 for the geometric series calculations in order to use different parameters for each:" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-51", "text": "where j \u2191 is j \u2193 + 1." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-52", "text": "----------------------------------" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-53", "text": "**OPTIMIZING THE PARAMETERS**" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-54", "text": "As in the original formulation, we need to use gradient-based optimisation in order to find good values for \u03bb, \u03b3 and \u03c9." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-55", "text": "Unfortunately, optimizing \u03c9 would require taking the derivative of h(\u00b7), and thus the derivative of the absolute value." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-56", "text": "This is unfortunately undefined when the argument is 0, however we work around this by choosing a subgradient of 0 at that point." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-57", "text": "This means the steps we take do not always improve the objective function, but in practice the method works well." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-58", "text": "The first derivative of L with respect to \u03bb at a single target word becomes:" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-59", "text": "And similar for finding the first derivative with respect to \u03b3, but summing from j \u2191 to n instead." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-60", "text": "The first derivative with respect to \u03c9 then, is:" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-61", "text": "Where h (\u00b7) is the first derivative of h(\u00b7) with respect to \u03c9." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-62", "text": "For obtaining this derivative, the arithmetico-geometric series (Fernandez et al., 2006) was originally used as an optimization, and for the gradient with respect to omega a geometric series should suffice, as an optimization, as there is no conditioning on the source words." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-63", "text": "This is not done in the current work however, so timing results will not be directly comparable to those found in (Dyer et al., 2013) ." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-64", "text": "Conditioning on the POS of the target words then becomes as simple as using a different \u03bb, \u03b3, and \u03c9 for each POS tag in the input, and calculating a separate derivative for each of them, using only the derivatives at those target words that use the POS tag." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-65", "text": "A minor detail is to keep a count of alignment positions used for finding the derivative for each different parameter, and normalizing the resulting derivatives with those counts, so the step size can be kept constant across POS tags." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-68", "text": "The above described model is evaluated with experiments on a set of 3 language pairs, on which AER scores and BLEU scores are computed." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-69", "text": "We use similar corpora as used in (Dyer et al., 2013) : a French-English corpus made up of Europarl version 7 and news-commentary corpora, the ArabicEnglish parallel data consisting of the non-UN portions of the NIST training corpora, and the FBIS Chinese-English corpora." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-70", "text": "The models that are compared are the original reparameterization of Model 2, a version where \u03bb is split around the diagonal (split), one where pos tags are used, but \u03bb is not split around the diagonal (pos), one where an offset is used, but parameters aren't split about the diagonal (offset), one that's split about the diagonal and uses pos tags used as in (Dyer et al., 2013) , with stepsize for updates to \u03bb and \u03b3 during gradient ascent is 1000, and that for \u03c9 is 0.03, decaying after every gradient descent step by 0.9, using 8 steps every iteration." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-71", "text": "Both \u03bb and \u03b3 are initialised to 6, and \u03c9 is initialised to 0." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-72", "text": "For these experiments the pos and pos & split use POS tags generated using the Stanford POS tagger (Toutanova and Manning, 2000) , using the supplied models for all of the languages used in the experiments." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-73", "text": "For comparison, Model 4 is trained for 5 iterations using 5 iterations each of Model 1 and Model 3 as initialization, using GIZA++ (Och and Ney, 2003) ." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-74", "text": "For the comparisons in AER, the corpora are used as-is, but for the BLEU comparisons, sentences longer than 50 words are filtered out." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-75", "text": "In Table 2 the sizes of the corpora before filtering are listed, as well as the time taken in hours to align the corpora for AER." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-76", "text": "As the training times for the different versions barely differ, only the average is displayed for the models here described and Model 4 training times are given for comparison." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-77", "text": "Note that the times for the models optimizing only \u03bb and \u03b3, and the model only optimizing \u03c9 still calculate the derivatives for the other parameters, and so could be made to be faster than here displayed." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-78", "text": "For both the BLEU and AER results, the alignments are generated in both directions, and symmetrised using the grow-diag-final-and heuristic, which in preliminary tests had shown to do best in terms of AER." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-79", "text": "The results are given in Table 2 ." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-80", "text": "These scores were computed using the WMT2012 data as gold standard." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-81", "text": "The different extensions to the model make no difference to the AER scores for ChineseEnglish, and actually do slightly worse for FrenchEnglish." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-82", "text": "In both cases, Model 4 does better than the models introduced here." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-83", "text": "For the comparisons of translation quality, the models are trained up using a phrase-based translation system (Koehn et al., 2007 ) that used the above listed models to align the data." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-84", "text": "Language models were augmented with data outside of the corpora for Chinese-English (200M words total) and Arabic-English (100M words total)." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-85", "text": "Test sets for Chinese are MT02, MT03, MT06 and MT08, for Arabic they were MT05, MT06 and MT08, and for French they were the newssyscomb2009 data and the newstest 2009-2012 data." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-86", "text": "The results are listed in Table 3 1 ." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-87", "text": "BLEU scores for Arabic-English and Chinese-English are computed with multiple references, while those for French-English are against a single reference." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-88", "text": "Although the different models made little difference in AER, there is quite a bit of variation in the BLEU scores between the different models." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-89", "text": "In all cases, the models conditioned on POS tags did better than the original model, by as much as 0.5 BLEU points." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-90", "text": "For Arabic-English as well as Chinese-English, the full model outperformed Model 4, in the case of Chinese-English by 0.9 BLEU points." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-91", "text": "The low impact of the split and offset models are most likely due to the need to model all alignments in the corpus." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-92", "text": "The distributions can't skew too far to aligning to one direction, as that would lower the probability of a large amount of alignments." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-93", "text": "This is reflected in the resulting parameters \u03bb, \u03b3 and \u03c9 that are estimated, as the first two do not differ much from the parameters estimated when both are kept the same, and the second tends to be very small." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-94", "text": "As for the Pos model, it seems that only varying the symmetrical slope for the different POS tags doesn't capture the differences between distributions for POS tags." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-95", "text": "For example, the \u03bb and \u03b3 parameters can differ quite a lot in the Pos & Split model when compared to the Pos model, with one side having a much smaller parameter and the other a much larger parameter for a given POS tag in the first model, and the single parameter being closer to the model average for the same POS tag in the second model." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-96", "text": "The low variation in results between the different models for French-English might be explained by less word movement when translating between these languages, which could mean the original model is sufficient to capture this behaviour." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-97", "text": "----------------------------------" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-98", "text": "**CONCLUSION**" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-99", "text": "We have shown some extensions to a reparameterized IBM Model 2, allowing it to model word reordering better." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-100", "text": "Although these models don't improve on the baseline in terms of AER, they do better than the original in all three languages tested, and outperform M4 in two of these languages, at little cost in terms of training time." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-101", "text": "Future directions for this work include allowing for more expressivity of the alignment model by using a Beta distribution instead of the current exponential model." }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-102", "text": "----------------------------------" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-103", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "9666fd26c7e9a02505ff26a687076d-C001-104", "text": "Dr Cohn is the recipient of an Australian Research Council Future Fellowship (project number FT130101105)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9666fd26c7e9a02505ff26a687076d-C001-8" ], [ "9666fd26c7e9a02505ff26a687076d-C001-22" ], [ "9666fd26c7e9a02505ff26a687076d-C001-50" ] ], "cite_sentences": [ "9666fd26c7e9a02505ff26a687076d-C001-8", "9666fd26c7e9a02505ff26a687076d-C001-22", "9666fd26c7e9a02505ff26a687076d-C001-50" ] }, "@EXT@": { "gold_contexts": [ [ "9666fd26c7e9a02505ff26a687076d-C001-31" ] ], "cite_sentences": [ "9666fd26c7e9a02505ff26a687076d-C001-31" ] }, "@DIF@": { "gold_contexts": [ [ "9666fd26c7e9a02505ff26a687076d-C001-63" ] ], "cite_sentences": [ "9666fd26c7e9a02505ff26a687076d-C001-63" ] }, "@SIM@": { "gold_contexts": [ [ "9666fd26c7e9a02505ff26a687076d-C001-69" ] ], "cite_sentences": [ "9666fd26c7e9a02505ff26a687076d-C001-69" ] }, "@UNSURE@": { "gold_contexts": [ [ "9666fd26c7e9a02505ff26a687076d-C001-70" ] ], "cite_sentences": [ "9666fd26c7e9a02505ff26a687076d-C001-70" ] } } }, "ABC_0d1fb27d847ca44af36862cf78744e_28": { "x": [ { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-2", "text": "In order to realize the full potential of dependency-based syntactic parsing, it is desirable to allow non-projective dependency structures." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-3", "text": "We show how a datadriven deterministic dependency parser, in itself restricted to projective structures, can be combined with graph transformation techniques to produce non-projective structures." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-4", "text": "Experiments using data from the Prague Dependency Treebank show that the combined system can handle nonprojective constructions with a precision sufficient to yield a significant improvement in overall parsing accuracy." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-5", "text": "This leads to the best reported performance for robust non-projective parsing of Czech." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-6", "text": "----------------------------------" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-8", "text": "It is sometimes claimed that one of the advantages of dependency grammar over approaches based on constituency is that it allows a more adequate treatment of languages with variable word order, where discontinuous syntactic constructions are more common than in languages like English (Mel'\u010duk, 1988; Covington, 1990) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-9", "text": "However, this argument is only plausible if the formal framework allows non-projective dependency structures, i.e. structures where a head and its dependents may correspond to a discontinuous constituent." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-10", "text": "From the point of view of computational implementation this can be problematic, since the inclusion of non-projective structures makes the parsing problem more complex and therefore compromises efficiency and in practice also accuracy and robustness." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-11", "text": "Thus, most broad-coverage parsers based on dependency grammar have been restricted to projective structures." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-12", "text": "This is true of the widely used link grammar parser for English (Sleator and Temperley, 1993) , which uses a dependency grammar of sorts, the probabilistic dependency parser of Eisner (1996) , and more recently proposed deterministic dependency parsers (Yamada and Matsumoto, 2003; ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-13", "text": "It is also true of the adaptation of the Collins parser for Czech and the finite-state dependency parser for Turkish by Oflazer (2003) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-14", "text": "This is in contrast to dependency treebanks, e.g. Prague Dependency Treebank (Haji\u010d et al., 2001b) , Danish Dependency Treebank (Kromann, 2003) , and the METU Treebank of Turkish , which generally allow annotations with nonprojective dependency structures." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-15", "text": "The fact that projective dependency parsers can never exactly reproduce the analyses found in non-projective treebanks is often neglected because of the relative scarcity of problematic constructions." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-16", "text": "While the proportion of sentences containing non-projective dependencies is often 15-25%, the total proportion of non-projective arcs is normally only 1-2%." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-17", "text": "As long as the main evaluation metric is dependency accuracy per word, with state-of-the-art accuracy mostly below 90%, the penalty for not handling non-projective constructions is almost negligible." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-18", "text": "Still, from a theoretical point of view, projective parsing of non-projective structures has the drawback that it rules out perfect accuracy even as an asymptotic goal." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-19", "text": "There exist a few robust broad-coverage parsers that produce non-projective dependency structures, notably Tapanainen and J\u00e4rvinen (1997) and Wang and Harper (2004) for English, Foth et al. (2004) for German, and Holan (2004) for Czech." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-20", "text": "In addition, there are several approaches to non-projective dependency parsing that are still to be evaluated in the large (Covington, 1990; Kahane et al., 1998; Duchier and Debusmann, 2001; Holan et al., 2001; Hellwig, 2003) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-21", "text": "Finally, since non-projective constructions often involve long-distance dependencies, the problem is closely related to the recovery of empty categories and non-local dependencies in constituency-based parsing (Johnson, 2002; Dienes and Dubey, 2003; Jijkoun and de Rijke, 2004; Cahill et al., 2004; Levy and Manning, 2004; Campbell, 2004) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-22", "text": "In this paper, we show how non-projective dependency parsing can be achieved by combining a datadriven projective parser with special graph transformation techniques." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-23", "text": "First, the training data for the parser is projectivized by applying a minimal number of lifting operations (Kahane et al., 1998) and encoding information about these lifts in arc labels." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-24", "text": "When the parser is trained on the transformed data, it will ideally learn not only to construct projective dependency structures but also to assign arc labels that encode information about lifts." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-25", "text": "By applying an inverse transformation to the output of the parser, arcs with non-standard labels can be lowered to their proper place in the dependency graph, giving rise 1 The dependency graph has been modified to make the final period a dependent of the main verb instead of being a dependent of a special root node for the sentence." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-26", "text": "to non-projective structures." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-95", "text": "The last four Table 3 show the distribution of nonprojective arcs with respect to the number of lifts required." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-27", "text": "We call this pseudoprojective dependency parsing, since it is based on a notion of pseudo-projectivity (Kahane et al., 1998) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-28", "text": "The rest of the paper is structured as follows." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-29", "text": "In section 2 we introduce the graph transformation techniques used to projectivize and deprojectivize dependency graphs, and in section 3 we describe the data-driven dependency parser that is the core of our system." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-30", "text": "We then evaluate the approach in two steps." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-31", "text": "First, in section 4, we evaluate the graph transformation techniques in themselves, with data from the Prague Dependency Treebank and the Danish Dependency Treebank." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-32", "text": "In section 5, we then evaluate the entire parsing system by training and evaluating on data from the Prague Dependency Treebank." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-33", "text": "----------------------------------" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-34", "text": "**DEPENDENCY GRAPH TRANSFORMATIONS**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-35", "text": "We assume that the goal in dependency parsing is to construct a labeled dependency graph of the kind depicted in Figure 1 ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-36", "text": "Formally, we define dependency graphs as follows:" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-37", "text": "1. Let R = {r 1 , . . . , r m } be the set of permissible dependency types (arc labels)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-38", "text": "----------------------------------" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-39", "text": "**A DEPENDENCY GRAPH FOR A STRING OF WORDS**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-40", "text": "(a) W is the set of nodes, i.e. word tokens in the input string, ordered by a linear precedence relation <, (b) A is a set of labeled arcs (w i , r, w j ), where w i , w j \u2208 W , r \u2208 R, (c) for every w j \u2208 W , there is at most one arc (w i , r, w j ) \u2208 A. If (w i , r, w j ) \u2208 A, we say that w i is the head of w j and w j a dependent of w i ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-41", "text": "In the following, we use the notation w i r \u2192 w j to mean that (w i , r, w j ) \u2208 A; we also use w i \u2192 w j to denote an arc with unspecified label and w i \u2192 * w j for the reflexive and transitive closure of the (unlabeled) arc relation." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-42", "text": "The dependency graph in Figure 1 satisfies all the defining conditions above, but it fails to satisfy the condition of projectivity (Kahane et al., 1998): 1." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-43", "text": "An arc w i \u2192 w k is projective iff, for every word w j occurring between w i and w k in the string" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-44", "text": "The arc connecting the head jedna (one) to the dependent Z (out-of) spans the token je (is), which is not dominated by jedna." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-45", "text": "As observed by Kahane et al. (1998) , any (nonprojective) dependency graph can be transformed into a projective one by a lifting operation, which replaces each non-projective arc w j \u2192 w k by a projective arc w i \u2192 w k such that w i \u2192 * w j holds in the original graph." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-46", "text": "Here we use a slightly different notion of lift, applying to individual arcs and moving their head upwards one step at a time:" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-47", "text": "Intuitively, lifting an arc makes the word w k dependent on the head w i of its original head w j (which is unique in a well-formed dependency graph), unless w j is a root in which case the operation is undefined (but then w j \u2192 w k is necessarily projective if the dependency graph is well-formed)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-48", "text": "Projectivizing a dependency graph by lifting nonprojective arcs is a nondeterministic operation in the general case." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-49", "text": "However, since we want to preserve as much of the original structure as possible, we are interested in finding a transformation that involves a minimal number of lifts." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-93", "text": "As shown in Table 3 , the proportion of sentences containing some non-projective dependency ranges from about 15% in DDT to almost 25% in PDT." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-50", "text": "Even this may be nondeterministic, in case the graph contains several non-projective arcs whose lifts interact, but we use the following algorithm to construct a minimal projective transformation D = (W, A ) of a (nonprojective) dependency graph D = (W, A):" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-51", "text": "The function SMALLEST-NONP-ARC returns the non-projective arc with the shortest distance from head to dependent (breaking ties from left to right)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-52", "text": "Applying the function PROJECTIVIZE to the graph in Figure 1 yields the graph in Figure 2 , where the problematic arc pointing to Z has been lifted from the original head jedna to the ancestor je." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-53", "text": "Using the terminology of Kahane et al. (1998) , we say that jedna is the syntactic head of Z, while je is its linear head in the projectivized representation." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-54", "text": "Unlike Kahane et al. (1998) , we do not regard a projectivized representation as the final target of the parsing process." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-55", "text": "Instead, we want to apply an in-Lifted arc label Path labels Number of labels Baseline d p n Head d\u2191h p n(n + 1) Head+Path d\u2191h p\u2193 2n(n + 1) Path d\u2191 p\u2193 4n Table 1 : Encoding schemes (d = dependent, h = syntactic head, p = path; n = number of dependency types) verse transformation to recover the underlying (nonprojective) dependency graph." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-56", "text": "In order to facilitate this task, we extend the set of arc labels to encode information about lifting operations." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-57", "text": "In principle, it would be possible to encode the exact position of the syntactic head in the label of the arc from the linear head, but this would give a potentially infinite set of arc labels and would make the training of the parser very hard." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-58", "text": "In practice, we can therefore expect a trade-off such that increasing the amount of information encoded in arc labels will cause an increase in the accuracy of the inverse transformation but a decrease in the accuracy with which the parser can construct the labeled representations." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-59", "text": "To explore this tradeoff, we have performed experiments with three different encoding schemes (plus a baseline), which are described schematically in Table 1 ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-60", "text": "The baseline simply retains the original labels for all arcs, regardless of whether they have been lifted or not, and the number of distinct labels is therefore simply the number n of distinct dependency types." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-61", "text": "2 In the first encoding scheme, called Head, we use a new label d\u2191h for each lifted arc, where d is the dependency relation between the syntactic head and the dependent in the non-projective representation, and h is the dependency relation that the syntactic head has to its own head in the underlying structure." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-62", "text": "Using this encoding scheme, the arc from je to Z in Figure 2 would be assigned the label AuxP\u2191Sb (signifying an AuxP that has been lifted from a Sb)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-63", "text": "In the second scheme, Head+Path, we in addition modify the label of every arc along the lifting path from the syntactic to the linear head so that if the original label is p the new label is p\u2193. Thus, the arc from je to jedna will be labeled Sb\u2193 (to indicate that there is a syntactic head below it)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-64", "text": "In the third and final scheme, denoted Path, we keep the extra infor-mation on path labels but drop the information about the syntactic head of the lifted arc, using the label d\u2191 instead of d\u2191h (AuxP\u2191 instead of AuxP\u2191Sb)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-65", "text": "As can be seen from the last column in Table 1 , both Head and Head+Path may theoretically lead to a quadratic increase in the number of distinct arc labels (Head+Path being worse than Head only by a constant factor), while the increase is only linear in the case of Path." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-66", "text": "On the other hand, we can expect Head+Path to be the most useful representation for reconstructing the underlying non-projective dependency graph." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-67", "text": "In approaching this problem, a variety of different methods are conceivable, including a more or less sophisticated use of machine learning." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-68", "text": "In the present study, we limit ourselves to an algorithmic approach, using a deterministic breadthfirst search." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-94", "text": "However, the overall percentage of non-projective arcs is less than 2% in PDT and less than 1% in DDT." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-69", "text": "The details of the transformation procedure are slightly different depending on the encoding schemes: In section 4 we evaluate these transformations with respect to projectivized dependency treebanks, and in section 5 they are applied to parser output." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-70", "text": "Before we turn to the evaluation, however, we need to introduce the data-driven dependency parser used in the latter experiments." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-71", "text": "----------------------------------" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-72", "text": "**MEMORY-BASED DEPENDENCY PARSING**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-73", "text": "In the experiments below, we employ a data-driven deterministic dependency parser producing labeled projective dependency graphs, 3 previously tested on Swedish and English (Nivre and Scholz, 2004) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-74", "text": "The parser builds dependency graphs by traversing the input from left to right, using a stack to store tokens that are not yet complete with respect to their dependents." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-75", "text": "At each point during the derivation, the parser has a choice between pushing the next input token onto the stack -with or without adding an arc from the token on top of the stack to the token pushed -and popping a token from the stack -with or without adding an arc from the next input token to the token popped." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-76", "text": "More details on the parsing algorithm can be found in Nivre (2003) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-77", "text": "The choice between different actions is in general nondeterministic, and the parser relies on a memorybased classifier, trained on treebank data, to predict the next action based on features of the current parser configuration." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-78", "text": "Table 2 shows the features used in the current version of the parser." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-79", "text": "At each point during the derivation, the prediction is based on six word tokens, the two topmost tokens on the stack, and the next four input tokens." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-80", "text": "For each token, three types of features may be taken into account: the word form; the part-of-speech assigned by an automatic tagger; and labels on previously assigned dependency arcs involving the token -the arc from its head and the arcs to its leftmost and rightmost dependent, respectively." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-81", "text": "Except for the left-most dependent of the next input token, dependency type features are limited to tokens on the stack." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-82", "text": "The prediction based on these features is a knearest neighbor classification, using the IB1 algorithm and k = 5, the modified value difference metric (MVDM) and class voting with inverse distance weighting, as implemented in the TiMBL software package (Daelemans et al., 2003) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-83", "text": "More details on the memory-based prediction can be found in and Nivre and Scholz (2004) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-84", "text": "----------------------------------" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-85", "text": "**EXPERIMENT 1: TREEBANK TRANSFORMATION**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-86", "text": "The first experiment uses data from two dependency treebanks." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-87", "text": "The Prague Dependency Treebank (PDT) consists of more than 1M words of newspaper text, annotated on three levels, the morphological, analytical and tectogrammatical levels (Haji\u010d, 1998) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-88", "text": "Our experiments all concern the analytical annotation, and the first experiment is based only on the training part." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-89", "text": "The Danish Dependency Treebank (DDT) comprises about 100K words of text selected from the Danish PAROLE corpus, with annotation of primary and secondary dependencies (Kromann, 2003) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-90", "text": "The entire treebank is used in the experiment, but only primary dependencies are considered." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-91", "text": "4 In all experiments, punctuation tokens are included in the data but omitted in evaluation scores." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-92", "text": "In the first part of the experiment, dependency graphs from the treebanks were projectivized using the algorithm described in section 2." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-96", "text": "It is worth noting that, although nonprojective constructions are less frequent in DDT than in PDT, they seem to be more deeply nested, since only about 80% can be projectivized with a single lift, while almost 95% of the non-projective arcs in PDT only require a single lift." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-97", "text": "In the second part of the experiment, we applied the inverse transformation based on breadth-first search under the three different encoding schemes." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-98", "text": "The results are given in Table 4 ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-99", "text": "As expected, the most informative encoding, Head+Path, gives the highest accuracy with over 99% of all non-projective arcs being recovered correctly in both data sets." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-100", "text": "However, it can be noted that the results for the least informative encoding, Path, are almost comparable, while the third encoding, Head, gives substantially worse results for both data sets." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-101", "text": "We also see that the increase in the size of the label sets for Head and Head+Path is far below the theoretical upper bounds given in Table 1 ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-102", "text": "The increase is generally higher for PDT than for DDT, which indicates a greater diversity in non-projective constructions." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-103", "text": "----------------------------------" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-104", "text": "**EXPERIMENT 2: MEMORY-BASED PARSING**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-105", "text": "The second experiment is limited to data from PDT." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-106", "text": "5 The training part of the treebank was projectivized under different encoding schemes and used to train memory-based dependency parsers, which were run on the test part of the treebank, consisting of 7,507" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-107", "text": "5 Preliminary experiments using data from DDT indicated that the limited size of the treebank creates a severe sparse data problem with respect to non-projective constructions." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-108", "text": "sentences and 125,713 tokens." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-109", "text": "6 The inverse transformation was applied to the output of the parsers and the result compared to the gold standard test set." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-110", "text": "Table 5 shows the overall parsing accuracy attained with the three different encoding schemes, compared to the baseline (no special arc labels) and to training directly on non-projective dependency graphs." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-111", "text": "Evaluation metrics used are Attachment Score (AS), i.e. the proportion of tokens that are attached to the correct head, and Exact Match (EM), i.e. the proportion of sentences for which the dependency graph exactly matches the gold standard." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-112", "text": "In the labeled version of these metrics (L) both heads and arc labels must be correct, while the unlabeled version (U) only considers heads." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-113", "text": "The first thing to note is that projectivizing helps in itself, even if no encoding is used, as seen from the fact that the projective baseline outperforms the non-projective training condition by more than half a percentage point on attachment score, although the gain is much smaller with respect to exact match." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-114", "text": "The second main result is that the pseudo-projective approach to parsing (using special arc labels to guide an inverse transformation) gives a further improvement of about one percentage point on attachment score." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-115", "text": "With respect to exact match, the improvement is even more noticeable, which shows quite clearly that even if non-projective dependencies are rare on the token level, they are nevertheless important for getting the global syntactic structure correct." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-116", "text": "All improvements over the baseline are statistically significant beyond the 0." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-117", "text": "Table 6 : Precision, recall and F-measure for non-projective arcs test)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-118", "text": "By contrast, when we turn to a comparison of the three encoding schemes it is hard to find any significant differences, and the overall impression is that it makes little or no difference which encoding scheme is used, as long as there is some indication of which words are assigned their linear head instead of their syntactic head by the projective parser." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-119", "text": "This may seem surprising, given the experiments reported in section 4, but the explanation is probably that the non-projective dependencies that can be recovered at all are of the simple kind that only requires a single lift, where the encoding of path information is often redundant." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-120", "text": "It is likely that the more complex cases, where path information could make a difference, are beyond the reach of the parser in most cases." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-121", "text": "However, if we consider precision, recall and Fmeasure on non-projective dependencies only, as shown in Table 6 , some differences begin to emerge." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-122", "text": "The most informative scheme, Head+Path, gives the highest scores, although with respect to Head the difference is not statistically significant, while the least informative scheme, Path -with almost the same performance on treebank transformation -is significantly lower (p < 0.01)." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-123", "text": "On the other hand, given that all schemes have similar parsing accuracy overall, this means that the Path scheme is the least likely to introduce errors on projective arcs." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-124", "text": "The overall parsing accuracy obtained with the pseudo-projective approach is still lower than for the best projective parsers." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-125", "text": "Although the best published results for the Collins parser is 80% UAS (Collins, 1999) , this parser reaches 82% when trained on the entire training data set, and an adapted version of Charniak's parser (Charniak, 2000) performs at 84% (Jan Haji\u010d, pers. comm.) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-126", "text": "However, the accuracy is considerably higher than previously reported results for robust non-projective parsing of Czech, with a best performance of 73% UAS (Holan, 2004) ." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-127", "text": "Compared to related work on the recovery of long-distance dependencies in constituency-based parsing, our approach is similar to that of Dienes and Dubey (2003) in that the processing of non-local dependencies is partly integrated in the parsing process, via an extension of the set of syntactic categories, whereas most other approaches rely on postprocessing only." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-128", "text": "However, while Dienes and Dubey recognize empty categories in a pre-processing step and only let the parser find their antecedents, we use the parser both to detect dislocated dependents and to predict either the type or the location of their syntactic head (or both) and use post-processing only to transform the graph in accordance with the parser's analysis." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-129", "text": "----------------------------------" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-131", "text": "We have presented a new method for non-projective dependency parsing, based on a combination of data-driven projective dependency parsing and graph transformation techniques." }, { "sent_id": "0d1fb27d847ca44af36862cf78744e-C001-132", "text": "The main result is that the combined system can recover non-projective dependencies with a precision sufficient to give a significant improvement in overall parsing accuracy, especially with respect to the exact match criterion, leading to the best reported performance for robust non-projective parsing of Czech." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0d1fb27d847ca44af36862cf78744e-C001-20" ], [ "0d1fb27d847ca44af36862cf78744e-C001-23" ], [ "0d1fb27d847ca44af36862cf78744e-C001-45" ] ], "cite_sentences": [ "0d1fb27d847ca44af36862cf78744e-C001-20", "0d1fb27d847ca44af36862cf78744e-C001-23", "0d1fb27d847ca44af36862cf78744e-C001-45" ] }, "@UNSURE@": { "gold_contexts": [ [ "0d1fb27d847ca44af36862cf78744e-C001-27" ], [ "0d1fb27d847ca44af36862cf78744e-C001-42" ] ], "cite_sentences": [ "0d1fb27d847ca44af36862cf78744e-C001-27", "0d1fb27d847ca44af36862cf78744e-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "0d1fb27d847ca44af36862cf78744e-C001-53" ] ], "cite_sentences": [ "0d1fb27d847ca44af36862cf78744e-C001-53" ] }, "@DIF@": { "gold_contexts": [ [ "0d1fb27d847ca44af36862cf78744e-C001-54" ] ], "cite_sentences": [ "0d1fb27d847ca44af36862cf78744e-C001-54" ] } } }, "ABC_fe30705e03f0475f9ab9d044a3c9ca_28": { "x": [ { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-2", "text": "We introduce a new highly scalable approach for computing Distributional Thesauri (DTs)." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-3", "text": "By employing pruning techniques and a distributed framework, we make the computation for very large corpora feasible on comparably small computational resources." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-4", "text": "We demonstrate this by releasing a DT for the whole vocabulary of Google Books syntactic n-grams." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-5", "text": "Evaluating against lexical resources using two measures, we show that our approach produces higher quality DTs than previous approaches, and is thus preferable in terms of speed and quality for large corpora." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-6", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-8", "text": "Using larger data to estimate models for machine learning applications as well as for applications of Natural Language Processing (NLP) has repeatedly shown to be advantageous, see e.g. (Banko and Brill, 2001; Brants et al., 2007) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-9", "text": "In this work, we tackle the influence of corpus size for building a distributional thesaurus (Lin, 1998) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-10", "text": "Especially, we shed light on the interaction of similarity measures and corpus size, as well as aspects of scalability." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-11", "text": "We shortly introduce the JoBimText framework for distributional semantics and show its scalability for large corpora." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-12", "text": "For the computation of the data we follow the MapReduce (Dean and Ghemawat, 2004) paradigm." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-13", "text": "The computation of similarities between terms becomes challenging on large corpora, as both the numbers of terms to be compared and the number of context features increases." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-14", "text": "This makes standard similarity calculations as proposed in (Lin, 1998; Curran, 2002; Lund and Burgess, 1996; Weeds et al., 2004) computationally infeasible." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-15", "text": "These approaches first calculate an information measure between each word and the according context and then calculate the similarity between all words, based on the information measure for all shared contexts." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-16", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-17", "text": "**RELATED WORK**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-18", "text": "A variety of approaches to compute DTs have been proposed to tackle issues regarding size and runtime." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-19", "text": "The reduction of the feature space seems to be one possibility, but still requires the computation of such reduction cf." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-20", "text": "(Blei et al., 2003; Golub and Kahan, 1965) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-21", "text": "Other approaches use randomised indexing for storing counts or hashing functions to approximate counts and measures (Gorman and Curran, 2006; Goyal et al., 2010; Sahlgren, 2006 )." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-22", "text": "Another possibility is the usage of distributed processing like MapReduce." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-23", "text": "In (Pantel et al., 2009; Agirre et al., 2009) a DT is computed using MapReduce on 200 quad core nodes (for 5.2 billion sentences) respectively 2000 cores (1.6 Terawords), an amount of hardware only available to commercial search engines." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-24", "text": "Whereas Agirre uses a \u03c7 2 test to measure the information between terms and context, Pantel uses the Pointwise Mutual Information (PMI)." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-25", "text": "Then, both approaches use the cosine similarity to calculate the similarity between terms." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-26", "text": "Furthermore, Pantel describes an optimization for the calculation of the cosine similarity." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-27", "text": "Whereas Pantel and Lin (2002) describe a method for sense clustering, they also use a method to calculate similarities between terms." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-28", "text": "Here, they propose a pruning scheme similar to ours, but do not explicitly evaluate its effect." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-29", "text": "The evaluation of DTs has been performed in extrinsic and intrinsic manner." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-30", "text": "Extrinsic evaluations have been performed using e.g. DTs for automatic set expansion (Pantel et al., 2009) or phrase polarity identification (Goyal and Daum\u00e9, 2011) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-31", "text": "In this work we will concentrate on intrinsic evaluations: Lin (1997; 1998) introduced two measures using WordNet (Miller, 1995) and Roget's Thesaurus." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-32", "text": "Using WordNet, he defines context features (synsets a word occurs in Wordnet or subsets when using Roget's Thesaurus) and then builds a gold standard thesaurus using a similarity measure." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-33", "text": "Then he evaluates his generated Distributional Thesaurus (DT) with respect to the gold standard thesauri." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-34", "text": "Weeds et al. (2004) evaluate various similarity measures based on 1000 frequent and 1000 infrequent words." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-35", "text": "Curran (2004) created a gold standard thesaurus by manually extracting entries from several English thesauri for 70 words." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-36", "text": "His automatically generated DTs are evaluated against this gold standard thesaurus using several measures." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-37", "text": "We will report on his measure and additionally propose a measure based on WordNet paths." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-38", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-39", "text": "**BUILDING A DISTRIBUTIONAL THESAURUS**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-40", "text": "Here we present our scalable DT algorithm using the MapReduce paradigm, which is divided into two parts: The holing system and a computational method to calculate distributional similarities." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-41", "text": "A more detailed description, especially for the MapReduce steps, can be found in (Biemann and Riedl, 2013) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-42", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-43", "text": "**HOLING SYSTEM**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-44", "text": "The holing operation splits an observation (e.g. a dependency relation) into a pair of two parts: a term and a context feature." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-45", "text": "This captures their firstorder relationship." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-46", "text": "These pairs are subsequently used for the computation of the similarities between terms, leading to a second-order relation." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-47", "text": "The representation can be formalized by the pair where x is the term and y represents the context feature." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-48", "text": "The position of x in y is denoted by the hole symbol '@'." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-49", "text": "As an example the dependency relation (nsub;gave 2 ;I 1 ) could be transferred to and . This representation scheme is more generic then the schemes introduced in (Lin, 1998; Curran, 2002) , as it allows to characterise pairs by several holes, which could be used to learn analogies, cf." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-50", "text": "(Turney and Littman, 2005) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-51", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-52", "text": "**DISTRIBUTIONAL SIMILARITY**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-53", "text": "First, we count the frequency for each first-order relation and remove all features that occur with more than w terms, as these context features tend to be too general to characterise the similarity between other words (Rychl\u00fd and Kilgarriff, 2007; Goyal et al., 2010, cmp.) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-54", "text": "From this. we calculate a significance score for all first-order relations." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-55", "text": "For this work, we implemented two different significance measures: Pointwise Mutual Information (PMI): (Church and Hanks, 1990 ) and Lexicographer's Mutual Information (LMI):" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-56", "text": "We then prune all negatively correlated pairs (s<0)." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-57", "text": "The maximum number of context features per term are defined with p, as we argue that it is sufficient to keep only the p most salient (ordered descending by their significance score) context features per term." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-58", "text": "Features of low saliency generally should not contribute much to the similarity of terms and also could lead to spurious similarity scores." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-59", "text": "Afterwards, all terms are aggregated by their features, which allows us to compute similarity scores between all terms that share at least one such feature." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-60", "text": "Whereas the method introduced by (Pantel and Lin, 2002) is very similar to the one proposed in this paper (the similarity between terms is calculated solely by the number of features two terms share), they use PMI to rank features and do not use pruning to scale to large corpora, as they use a rather small corpus." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-61", "text": "Additionally, they do not evaluate the effect of such pruning." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-62", "text": "In contrast to the best measures proposed by Lin (1998; Curran (2002; Pantel et al. (2009; Goyal et al. (2010) we do not calculate any information measure using frequencies of features and terms (we use significance ranking instead), as shown in Table 1 ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-63", "text": "Additionally, we avoid any similarity measurement using the information measure, as also done in these approaches, to calculate the similarity over the feature counts of each term: we merely count how many salient features two terms share." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-64", "text": "All these constraints makes this approach more scalable to larger corpora, as we do not need to know the full list of Information Measures Lin's formula I(term, f eature) = lin(term, f eature) = log" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-65", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-66", "text": "**SIMILARITY MEASURES**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-67", "text": "Lin's formula sim(t 1 , t 2 ) =" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-68", "text": "Curran's Dice sim(t 1 , t 2 ) = P f \u2208f eatures(t 1 )\u2229f eatures(t 2 ) min(I(t 1 ,f ),I(t 2 ,f )) P f \u2208f eatures(t 1 )\u2229f eatures(t 2 ) (I(t 1 ,f )+I(t 2 ,f ))" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-69", "text": "with I(t, f ) > 0" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-70", "text": "Our Measure sim(t 1 , t 2 ) = f \u2208f eatures(t 1 )\u2229f eatures(t 2 ) 1 with s > 0 features for a term pair at any time." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-71", "text": "While our computations might seem simplistic, we demonstrate its adequacy for large corpora in Section 5." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-72", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-73", "text": "**EVALUATION**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-74", "text": "The evaluation is performed using a recent dump of English Wikipedia, containing 36 million sentences and a newspaper corpus, compiled from 120 million sentences (about 2 Gigawords) from Leipzig Corpora Collection (Richter et al., 2006) and the Gigaword corpus (Parker et al., 2011) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-75", "text": "The DTs are based on collapsed dependencies from the Stanford Parser (Marneffe et al., 2006) in the holing operation." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-76", "text": "For all DTs we use the pruning parameters s=0, p=1000 and w=1000." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-77", "text": "In a final evaluation, we use the syntactic n-grams built from Google Books (Goldberg and Orwant, 2013) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-78", "text": "To show the impact of corpus size, we downsampled our corpora to 10 million, 1 million and 100,000 sentences." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-79", "text": "We compare our results against DTs calculated using Lin's (Lin, 1998) measure and the best measure proposed by Curran (2002) (see Table 1)." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-80", "text": "Our evaluation is performed using the same 1000 frequent and 1000 infrequent nouns as previously employed by Weeds et al. (2004) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-81", "text": "We create a gold standard, by extracting reasonable entries of these 2000 nouns using Roget's 1911 thesaurus, Moby Thesaurus, Merriam Webster's Thesaurus, the Big Huge Thesaurus and the OpenOffice Thesaurus and employ the inverse ranking measure (Curran, 2002) to evaluate the DTs." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-82", "text": "Furthermore, we introduce a WordNet-based method." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-83", "text": "To calculate the similarity between two terms, we use the WordNet::Similarity path (Pedersen et al., 2004) measure." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-84", "text": "While its absolute scores are hard to interpret due to inhomogenity in the granularity of WordNet, they are well-suited for relative comparison." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-85", "text": "The score between two terms is inversely proportional to the shortest path between all the synsets of both terms." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-86", "text": "The highest possible score is one, if two terms share a synset." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-87", "text": "We compare the average score of the top five (or ten) entries in the DT for each of the 2000 selected words for our comparison." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-88", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-89", "text": "**RESULTS**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-90", "text": "First, we inspect the results of Curran's measure using the Wikipedia and newspaper corpus for the frequent nouns, shown in Figure 1 ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-91", "text": "Both graphs show the inverse ranking score against the size of the corpus." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-92", "text": "Our method scores consistently higher when using LMI instead of PMI for ranking the features per term." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-93", "text": "The PMI measure declines when the corpus becomes larger." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-94", "text": "This can be attributed to the fact that PMI favors term-context pairs involving rare contexts (Bordag, 2008) ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-95", "text": "Computing similarities between terms should not be performed on the basis of rare contexts, as these do not generalize well because of their sparseness." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-96", "text": "All other measures improve with larger corpora." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-97", "text": "It is surprising that recent works use PMI to calculate similarities between terms (Goyal et al., 2010; Pantel et al., 2009) , who, however evaluate their approach only with respect to their own implementation or extrinsically, and do not prune on saliency." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-98", "text": "Apart from the PMI measure, Curran's measure leads to the weakest results." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-99", "text": "We could not confirm that his measure outperforms Lin's measure as stated in (Curran, 2002) 1 ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-100", "text": "An explanation for this results might be the use of a different parser, very few test words and also a different gold standard thesaurus in his evaluation." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-101", "text": "Comparing our method using LMI to Lin's method, we achieve lower scores with our method using small corpora, but surpass Lin's measure from 10 million sentences onwards." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-102", "text": "Next, we show the results of the WordNet evaluation measure in Figure 2 ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-103", "text": "Comparing the top 10 (upper) to the top 5 words (lower) used for the evaluation, we can observe higher scores for the top 5 words, which validates the ranking." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-104", "text": "These results are highly correlated to the results achieved with the inverse ranking measure." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-105", "text": "This is a positive result, as the WordNet measure can be performed automatically using a single public resource 2 ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-106", "text": "In Figure 3 , we show results for the 1000 infrequent nouns using the inverse ranking (upper) and the WordNet measure (lower)." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-107", "text": "We can see that our method using PMI does not decline for larger corpora, as the limit on first-order features is not reached and frequent features are still being used." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-108", "text": "Comparing our LMI DT is en par with Lin's measure for 10 million sentences, and makes better use of large data when using the complete dataset." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-109", "text": "Again, the inverse ranking and the WordNet Path measure are highly correlated." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-110", "text": "2 Building a gold standard thesaurus following Curran (2002) needs access to all the used thesauri." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-111", "text": "Whereas for some, programming interfaces exist, often with limited access and licence restrictions, others have to be extracted manually." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-112", "text": "The results shown here validate our pruning approach." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-113", "text": "Whereas Lin and Curran propose approaches to filter features that have low word feature scores, they do not remove features that occur with too many words, which is done in this work." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-114", "text": "Using these pruning steps, a simplistic similarity measure does not only lead to reduced computation times, but also to better results, when using larger corpora." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-115", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-116", "text": "**USING A LARGE 3 CORPUS**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-117", "text": "We demonstrate the scalability of our method using the very large Google Books dataset (Goldberg and Orwant, 2013) , consisting of dependencies extracted from 17.6 billion sentences." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-118", "text": "The evaluation results, using different measures, are given in Table 2 ." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-119", "text": "Comparing the results for the Google Books DT to the ones achieved using Wikipedia and the news- paper, we can observe a boost in the performance, both for the inverse ranking and the WordNet measures." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-120", "text": "Additionally, we show results for the P@1 measure, which indicates the percentage of entries, whose first entry is in the gold standard thesaurus." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-121", "text": "Remarkably, we get a P@1 against our gold standard thesaurus of 76% for frequent and 64% for infrequent nouns using the Google Books DT." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-122", "text": "The most computation time was needed for the dependency parsing and took two weeks on a small cluster (64 cores on 8 nodes) for the 120 million Newspaper sentences." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-123", "text": "The DT for the Google Books was calculated in under 30 hours on a Hadoop cluster (192 cores on 16 nodes) and could be calculated within 10 hours for the Newspaper corpus." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-124", "text": "The computation of a DT using this huge corpus would be intractable with standard vector-based measurements." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-125", "text": "Even computing Lin's and Curran's vector-based similarity measure for the whole vocabulary of the newspaper corpus was not possible with our Hadoop cluster, as too much memory would have been required and thus we computed similarities only for the 2000 test nouns on a server with 92GB of main memory." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-126", "text": "----------------------------------" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-127", "text": "**CONCLUSION**" }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-128", "text": "We have introduced a highly scalable approach to DT computation and showed its adequacy for very large corpora." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-129", "text": "Evaluating against thesauri and WordNet, we demonstrated that our similarity measure yields better-quality DTs and scales to corpora of billions of sentences, even on comparably small compute clusters." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-130", "text": "We achieve this by a number of pruning operations, and distributed processing." }, { "sent_id": "fe30705e03f0475f9ab9d044a3c9ca-C001-131", "text": "The framework and the DTs for Google Books, Newspaper and Wikipedia are available online 3 under the ASL 2.0 licence." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fe30705e03f0475f9ab9d044a3c9ca-C001-14" ], [ "fe30705e03f0475f9ab9d044a3c9ca-C001-110" ] ], "cite_sentences": [ "fe30705e03f0475f9ab9d044a3c9ca-C001-14", "fe30705e03f0475f9ab9d044a3c9ca-C001-110" ] }, "@DIF@": { "gold_contexts": [ [ "fe30705e03f0475f9ab9d044a3c9ca-C001-49" ], [ "fe30705e03f0475f9ab9d044a3c9ca-C001-62" ], [ "fe30705e03f0475f9ab9d044a3c9ca-C001-99" ] ], "cite_sentences": [ "fe30705e03f0475f9ab9d044a3c9ca-C001-49", "fe30705e03f0475f9ab9d044a3c9ca-C001-62", "fe30705e03f0475f9ab9d044a3c9ca-C001-99" ] }, "@UNSURE@": { "gold_contexts": [ [ "fe30705e03f0475f9ab9d044a3c9ca-C001-79" ] ], "cite_sentences": [ "fe30705e03f0475f9ab9d044a3c9ca-C001-79" ] }, "@USE@": { "gold_contexts": [ [ "fe30705e03f0475f9ab9d044a3c9ca-C001-81" ] ], "cite_sentences": [ "fe30705e03f0475f9ab9d044a3c9ca-C001-81" ] } } }, "ABC_3351b13fc0c9d4d2de16d897c78aee_28": { "x": [ { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-2", "text": "Recently, string kernels have obtained stateof-the-art results in various text classification tasks such as Arabic dialect identification or native language identification." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-3", "text": "In this paper, we apply two simple yet effective transductive learning approaches to further improve the results of string kernels." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-4", "text": "The first approach is based on interpreting the pairwise string kernel similarities between samples in the training set and samples in the test set as features." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-5", "text": "Our second approach is a simple self-training method based on two learning iterations." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-6", "text": "In the first iteration, a classifier is trained on the training set and tested on the test set, as usual." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-62", "text": "**POLARITY CLASSIFICATION**" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-63", "text": "Data set." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-7", "text": "In the second iteration, a number of test samples (to which the classifier associated higher confidence scores) are added to the training set for another round of training." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-8", "text": "However, the ground-truth labels of the added test samples are not necessary." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-9", "text": "Instead, we use the labels predicted by the classifier in the first training iteration." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-10", "text": "By adapting string kernels to the test set, we report significantly better accuracy rates in English polarity classification and Arabic dialect identification." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-11", "text": "----------------------------------" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-13", "text": "In recent years, methods based on string kernels have demonstrated remarkable performance in various text classification tasks ranging from authorship identification (Popescu and Grozea, 2012) and sentiment analysis (Gim\u00e9nez-P\u00e9rez et al., 2017; to native language identification (Popescu and Ionescu, 2013; Ionescu et al., 2014 Ionescu et al., , 2016 , dialect identification (Ionescu and Popescu, 2016b; Ionescu and Butnaru, 2017; and automatic essay scoring (Cozma et al., 2018) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-14", "text": "As long as a labeled training set is available, string kernels can reach state-of-the-art results in various languages including English (Ionescu et al., 2014; Gim\u00e9nez-P\u00e9rez et al., 2017; Cozma et al., 2018) , Arabic (Ionescu, 2015; Ionescu et al., 2016; Ionescu and Butnaru, 2017; , Chinese and Norwegian (Ionescu et al., 2016) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-15", "text": "Different from all these recent approaches, we use unlabeled data from the test set to significantly increase the performance of string kernels." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-16", "text": "More precisely, we propose two transductive learning approaches combined into a unified framework." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-17", "text": "We show that the proposed framework improves the results of string kernels in two different tasks (cross-domain sentiment classification and Arabic dialect identification) and two different languages (English and Arabic)." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-18", "text": "To the best of our knowledge, transductive learning frameworks based on string kernels have not been studied in previous works." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-19", "text": "----------------------------------" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-20", "text": "**TRANSDUCTIVE STRING KERNELS**" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-21", "text": "String kernels." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-22", "text": "Kernel functions (Shawe-Taylor and Cristianini, 2004) capture the intuitive notion of similarity between objects in a specific domain." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-23", "text": "For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character ngrams." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-24", "text": "Various string kernel functions have been proposed to date (Lodhi et al., 2002; Shawe-Taylor and Cristianini, 2004; Ionescu et al., 2014) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-25", "text": "Perhaps one of the most recently introduced string kernels is the histogram intersection string kernel (Ionescu et al., 2014 )." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-26", "text": "For two strings over an alphabet \u03a3, x, y \u2208 \u03a3 * , the intersection string kernel is formally defined as follows:" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-27", "text": "where num v (x) is the number of occurrences of ngram v as a substring in x, and p is the length of v. The spectrum string kernel or the presence bits string kernel can be defined in a similar fashion ( Ionescu et al., 2014) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-28", "text": "The standard kernel learning pipeline is presented in Figure 1 ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-29", "text": "String kernels help to efficiently compute the dual representation directly, thus skipping the first step in the pipeline illustrated in Figure 1 ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-30", "text": "Transductive string kernels." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-31", "text": "We propose a simple and straightforward approach to produce a transductive similarity measure suitable for strings, as illustrated in Figure 2 ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-32", "text": "We take the following steps to derive transductive string kernels." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-33", "text": "For a given kernel (similarity) function k, we first build the full kernel matrix K, by including the pairwise similarities of samples from both the train and the test sets (step S1 in Figure 2 ) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-34", "text": "For a training set X = {x 1 , x 2 , ..., x m } of m samples and a test set Y = {y 1 , y 2 , ..., y n } of n samples, such that X \u2229 Y = \u2205, each component in the full kernel matrix is defined as follows (step S2 in Figure 2 ):" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-35", "text": "where z i and z j are samples from the set Z = X \u222a Y = {x 1 , x 2 , ..., x m , y 1 , y 2 , ..., y n }, for all 1 \u2264 i, j \u2264 m + n. We then normalize the kernel matrix by dividing each component by the square root of the product of the two corresponding diagonal components:" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-36", "text": "We transform the normalized kernel matrix into a radial basis function (RBF) kernel matrix as follows:K" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-37", "text": "As the kernel matrix is already normalized, we can choose \u03c3 2 = 0.5 for simplicity." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-38", "text": "Therefore, Equation (4) becomes:" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-39", "text": "Each row in the RBF kernel matrixK is now interpreted as a feature vector, going from step S2 to step S3 in Figure 2 ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-40", "text": "In other words, each sample z i is represented by a feature vector that contains the similarity between the respective sample z i and all the samples in Z (step S3 in Figure 2 )." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-41", "text": "Since Z includes the test samples as well, the feature vector is inherently adapted to the test set." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-42", "text": "Indeed, it is easy to see that the features will be different if we choose to apply the string kernel approach on a set of test samples" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-43", "text": "It is important to note that through the features, the subsequent classifier will have some information about the test samples at training time." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-44", "text": "More specifically, the feature vector conveys information about how similar is every test sample to every training sample." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-45", "text": "We next consider the linear kernel, which is given by the scalar product between the new feature vectors." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-46", "text": "To obtain the final linear kernel matrix, we simply need to compute the product between the RBF kernel matrix and its transpose (step S4 in Figure 2 ):" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-47", "text": "In this way, the samples from the test set, which are included in Z, are used to obtain new (transductive) string kernels that are adapted to the test set at hand." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-48", "text": "Transductive kernel classifier." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-49", "text": "After obtaining the transductive string kernels, we use a simple transductive learning approach that falls in the category of self-training methods (McClosky et al., 2006; Chen et al., 2011) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-50", "text": "The transductive approach is divided into two learning iterations." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-51", "text": "In the first iteration, a kernel classifier is trained on the training data and applied on the test data, just as usual." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-52", "text": "Next, the test samples are sorted by the classifier's confidence score to maximize the probability of correctly predicted labels in the top of the sorted list." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-53", "text": "In the second iteration, a fixed number of samples (1000 in the experiments) from the top of the list are added to the training set for another round of training." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-54", "text": "Even though a small percent (less than 8% in all experiments) of the predicted labels corresponding to the newly included samples are wrong, the classifier has the chance to learn some useful patterns (from the correctly predicted labels) only visible in the test data." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-55", "text": "The transductive kernel classifier (TKC) is based on the intuition that the added test samples bring more useful information than noise, since the majority of added test samples have correct labels." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-56", "text": "Finally, we would like to stress out that the groundtruth test labels are never used in our transductive algorithm." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-57", "text": "The proposed transductive learning approaches are used together in a unified framework." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-58", "text": "As any other transductive learning method, the main disadvantage of the proposed framework is that the unlabeled test samples from the target domain need to be used in the training stage." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-59", "text": "Nevertheless, we present empirical results indicating that our approach can obtain significantly better accuracy rates in cross-domain polarity classification and Arabic dialect identification compared to state-of-the-art methods based on string kernels (Gim\u00e9nez-P\u00e9rez et al., 2017; Ionescu and Butnaru, 2017) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-60", "text": "We also report better results than other domain adaptation methods (Pan et al., 2010; Bollegala et al., 2013; Franco-Salvador et al., 2015; Sun et al., 2016; Huang et al., 2017) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-61", "text": "----------------------------------" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-64", "text": "For the cross-domain polarity classification experiments, we use the second version of Multi-Domain Sentiment Dataset (Blitzer et al., 2007) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-65", "text": "The data set contains Amazon product reviews of four different domains: Books (B), DVDs (D), Electronics (E) and Kitchen appliances (K)." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-66", "text": "Reviews contain star ratings (from 1 to 5) which are converted into binary labels as follows: reviews rated with more than 3 stars are labeled as positive, and those with less than 3 stars as negative." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-67", "text": "In each domain, there are 1000 positive and 1000 negative reviews." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-68", "text": "Baselines." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-69", "text": "We compare our approach with several methods (Pan et al., 2010; Bollegala et al., 2013; Franco-Salvador et al., 2015; Sun et al., 2016; Gim\u00e9nez-P\u00e9rez et al., 2017; Results in single-source setting." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-70", "text": "The results for the single-source cross-domain polarity classification setting are presented in Table 2 ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-71", "text": "We considered all possible combinations of source and target domains in this experiment, and we improve the results in each and every case." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-72", "text": "Without exception, the accuracy rates reached by the transductive string kernels are significantly better than the best baseline string kernel (Gim\u00e9nez-P\u00e9rez et al., 2017) , according to the McNemar's test performed at a confidence level of 0.01." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-73", "text": "The highest improvements (above 2.7%) are obtained when the source domain contains Books reviews and the target domain contains Kitchen reviews." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-74", "text": "As in the multi-source setting, we obtain much better results when the transductive classifier is employed for the learning task." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-75", "text": "In all cases, the accuracy rates of the transductive classifier are more than 2% better than the best baseline string kernel." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-76", "text": "Remarkably, in four cases (E\u2192B, E\u2192D, B\u2192K and D\u2192K) our improvements are greater than 4%." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-77", "text": "The improvements brought by our transductive classifier based on string kernels are statistically significant in each and every case." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-78", "text": "In comparison with SFA (Pan et al., 2010) , we obtain better results in all but one case (K\u2192D)." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-79", "text": "With respect to KMM (Huang et al., 2007) , we also obtain better results in all but one case (B\u2192E)." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-80", "text": "Remarkably, we surpass the Table 2 : Single-source cross-domain polarity classification accuracy rates (in %) of our transductive approaches versus a state-of-the-art (sota) baseline based on string kernels (Gim\u00e9nez-P\u00e9rez et al., 2017) , as well as SFA (Pan et al., 2010) , KMM (Huang et al., 2007) , CORAL (Sun et al., 2016) and TR-TrAdaBoost (Huang et al., 2017) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-81", "text": "The best accuracy rates are highlighted in bold." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-82", "text": "The marker * indicates that the performance is significantly better than the best baseline string kernel according to a paired McNemar's test performed at a significance level of 0.01." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-83", "text": "other state-of-the-art approaches (Sun et al., 2016; Huang et al., 2017) in all cases." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-84", "text": "----------------------------------" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-85", "text": "**ARABIC DIALECT IDENTIFICATION**" }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-86", "text": "Data set." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-87", "text": "The Arabic Dialect Identification (ADI) data set (Ali et al., 2016) contains audio recordings and Automatic Speech Recognition (ASR) transcripts of Arabic speech collected from the Broadcast News domain." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-88", "text": "The classification task is to discriminate between Modern Standard Arabic and four Arabic dialects, namely Egyptian, Gulf, Levantine, and Maghrebi." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-89", "text": "The training set contains 14000 samples, the development set contains 1524 samples, and the test contains another 1492 samples." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-90", "text": "The data set was used in the ADI Shared Task of the 2017 VarDial Evaluation Campaign ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-91", "text": "Baseline." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-92", "text": "We choose as baseline the approach of Ionescu and Butnaru (2017) , which is based on string kernels and multiple kernel learning." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-93", "text": "The approach that we consider as baseline is the winner of the 2017 ADI Shared Task ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-94", "text": "In addition, we also compare with the second-best approach (Meta-classifier) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-95", "text": "Evaluation procedure and parameters." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-96", "text": "Ionescu and Butnaru (2017) combined four kernels into a sum, and used Kernel Ridge Regression for training." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-97", "text": "Three of the kernels are based on character ngrams extracted from ASR transcripts." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-98", "text": "These are the presence bits string kernel (K 0/1 ), the intersection string kernel (K \u2229 ), and a kernel based on Local Rank Distance (K LRD ) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-99", "text": "The fourth kernel is an RBF kernel (K ivec ) based on the i-vectors provided with the ADI data set (Ali et al., 2016) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-100", "text": "In our experiments, we employ the exact same kernels as Ionescu and Butnaru (2017) to ensure an unbiased comparison with their ap- Butnaru, 2017) and the first runner up ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-101", "text": "The best accuracy rates are highlighted in bold." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-102", "text": "The marker * indicates that the performance is significantly better than (Ionescu and Butnaru, 2017) according to a paired McNemar's test performed at a significance level of 0.01." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-103", "text": "proach." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-104", "text": "As in the polarity classification experiments, we select r = 1000 unlabeled test samples to be included in the training set for the second round of training the transductive classifier, and we use Kernel Ridge Regression with a regularization of 10 \u22125 in all our ADI experiments." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-105", "text": "Results." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-106", "text": "The results for the cross-domain Arabic dialect identification experiments on both the development and the test sets are presented in Table 3 ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-107", "text": "The domain-adapted sum of kernels obtains improvements above 0.8% over the stateof-the-art sum of kernels (Ionescu and Butnaru, 2017) ." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-108", "text": "The improvement on the development set (from 64.17% to 65.42%) is statistically significant." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-109", "text": "Nevertheless, we obtain higher and significant improvements when we employ the transductive classifier." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-110", "text": "Our best accuracy is 66.73% (2.56% above the baseline) on the development set and 78.35% (2.08% above the baseline) on the test set." }, { "sent_id": "3351b13fc0c9d4d2de16d897c78aee-C001-111", "text": "The results show that our domain adaptation framework based on string kernels attains the best performance on the ADI Shared Task data set, and the improvements over the state-of-the-art are statistically significant, according to the McNemar's test." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3351b13fc0c9d4d2de16d897c78aee-C001-13" ], [ "3351b13fc0c9d4d2de16d897c78aee-C001-14" ], [ "3351b13fc0c9d4d2de16d897c78aee-C001-96" ] ], "cite_sentences": [ "3351b13fc0c9d4d2de16d897c78aee-C001-13", "3351b13fc0c9d4d2de16d897c78aee-C001-14", "3351b13fc0c9d4d2de16d897c78aee-C001-96" ] }, "@DIF@": { "gold_contexts": [ [ "3351b13fc0c9d4d2de16d897c78aee-C001-59" ], [ "3351b13fc0c9d4d2de16d897c78aee-C001-102" ], [ "3351b13fc0c9d4d2de16d897c78aee-C001-107" ] ], "cite_sentences": [ "3351b13fc0c9d4d2de16d897c78aee-C001-59", "3351b13fc0c9d4d2de16d897c78aee-C001-102", "3351b13fc0c9d4d2de16d897c78aee-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "3351b13fc0c9d4d2de16d897c78aee-C001-92" ], [ "3351b13fc0c9d4d2de16d897c78aee-C001-100" ] ], "cite_sentences": [ "3351b13fc0c9d4d2de16d897c78aee-C001-92", "3351b13fc0c9d4d2de16d897c78aee-C001-100" ] }, "@SIM@": { "gold_contexts": [ [ "3351b13fc0c9d4d2de16d897c78aee-C001-100" ] ], "cite_sentences": [ "3351b13fc0c9d4d2de16d897c78aee-C001-100" ] } } }, "ABC_5e6d5bb4fb5be2b18ce3256302bf28_28": { "x": [ { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-2", "text": "This paper presents a Prefix Tree (Trie) based model for Generation of Referring Expression (GRE)." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-3", "text": "The existing algorithms in GRE lie in two extremities." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-4", "text": "Incremental algorithm is simple and speedy but less expressive in nature whereas others are complex and exhaustive but more expressive in nature." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-5", "text": "Our prefix tree based model not only incorporates all relevant features of GRE (like describing set, generating Boolean and context sensitive description etc.) but also try to attain simplicity and speed properties of Incremental algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-6", "text": "Thus this model provides a simple and linguistically rich approach to GRE." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-7", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-9", "text": "Generation of referring expression (GRE) is an important task in the field of Natural Language Generation (NLG) systems (Reiter and Dale, 1995) ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-10", "text": "The task of any GRE algorithm is to find a combination of properties that allow the audience to identify an object (target object) from a set of objects (domain or environment)." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-11", "text": "The properties should satisfy the target object and dissatisfy all other objects in the domain." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-12", "text": "We sometimes call it distinguishing description because it helps us to distinguish the target from potential distractors, called contrast set." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-13", "text": "When we generate any natural language text in a particular domain, it has been observed that the text is centered on certain objects for that domain." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-14", "text": "When we give introductory description of any object, we generally give full length description (e.g. \"The large black hairy dog\"). But the later references to that object tend to be shorter and only support referential communication goal of distinguishing the target from other objects." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-15", "text": "For example the expression \"The black dog\" suffices if the other dogs in the environment are all non black." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-16", "text": "Grice, an eminent philosopher of language, has stressed on brevity of referential communication to avoid conversational implicature." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-17", "text": "Dale (1992) developed Full Brevity algorithm based on this observation." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-18", "text": "It always generates shortest possible referring description to identify an object. But Reiter and Dale (1995) later proved that Full Brevity requirement is an NP-Hard task, thus computationally intractable and offered an alternative polynomial time Incremental Algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-19", "text": "This algorithm adds properties in a predetermined order, based on the observation that human speakers and audiences prefer certain kinds of properties when describing an object in a domain (Krahmer et al. 2003) ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-20", "text": "The Incremental Algorithm is accepted as state of the art algorithm in NLG domain." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-21", "text": "Later many refinements (like Boolean description and set representation (Deemter 2002), context sensitivity (Krahmer et al 2002) etc) have been incorporated into this algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-22", "text": "Several approaches have also been made to propose an alternative algorithmic framework to this problem like graph-based (Krahmer et al. 2003) , conceptual graph based (Croitoru and Deemter 2007) etc that also handle the above refinements." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-23", "text": "In this paper we propose a new Prefix Tree (Trie) based framework for modeling GRE problems." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-24", "text": "Trie is an ordered tree data structure which allows the organization of prefixes in such a way that the branching at each level is guided by the parts of prefixes." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-25", "text": "There are several advantages of this approach: 1) Trie data structure has already been extensively used in many domains where search is the key operation." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-26", "text": "2) The structure is scalable and various optimized algorithms are there for time, space optimizations." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-27", "text": "In this paper it is shown how scenes can be represented using a Trie (section 2) and how description generation can be formalized as a search problem (section 3)." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-28", "text": "In section 4 the algorithm is explained using an example scene." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-29", "text": "In section 5, the basic algorithm is extended to take care of different scenarios." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-30", "text": "The algorithm is analyzed for time com-plexity in section 6 and conclusion is drawn in section 7." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-31", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-32", "text": "**MODELING GRE USING TRIE STRUCTURE**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-33", "text": "In this section, it is shown how a scene can be represented using a trie data structure." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-34", "text": "The scheme is based on Incremental algorithm (Reiter and Dale 1995) and incorporates the attractive properties (e.g. speed, simplicity etc) of that algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-35", "text": "Later it is extended to take care of different refinements (like relational, boolean description etc) that could not be handled by Incremental algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-36", "text": "Reiter and Dale (1995) pointed out the notion of 'PreferredAttributes' (e.g. Type, Size, Color etc) which is a sequence of attributes of an object that human speakers generally use to identify that object from the contrast set." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-37", "text": "We assume that the initial description of an entity is following this sequence (e.g. \"The large black dog\") then the later references will be some subset of initial description (like \"The dog\" or \"The large dog\") which is defined as the prefix of the initial description." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-38", "text": "So, we have to search for a prefix of the initial full length description so that it is adequate to distinguish the target object." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-39", "text": "Following the Incremental version we will add properties one by one from the 'PreferredAttributes' list." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-40", "text": "In our model, the root consists of all entities in the domain and has empty description." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-41", "text": "Then at each level, branching is made based on different values of corresponding preferred attribute." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-42", "text": "The outgoing edge is labeled with that value." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-43", "text": "For example, at the first level, branching is made based on different values of 'Type' attribute like 'Dog', 'Cat', 'Poodle' etc." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-44", "text": "A node in Trie will contain only those objects which have the property(s) expressed by the edges, constituting the path from root to that node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-45", "text": "After construction of the Trie structure for a given domain in this way, referring expression generation problem for an object r is reduced to search the tree for a node which consists of r and no other object." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-46", "text": "Description for r can be found from the search path itself as we have said earlier." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-47", "text": "Now we will introduce some notations that we will use to describe the actual algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-48", "text": "Let D be the Domain, r be the target object and P be the 'PreferredAttributes' List." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-49", "text": ") ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-50", "text": "Since our construction is basically a tree, each node is reachable from root and there exists a unique path from root to that node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-51", "text": "So, for each node in the tree we will get some description." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-52", "text": "We will formulate referring expression construction as search in the constructed tree for the node min(k) {N k } such that N k = {r}. If N k is leaf node then description of r will be same with the full description but if it is an intermediate node then description is some proper prefix of initial description. But the point is that, in both cases the later reference is prefix of initial one (as both \"ab\" and \"abc\" are prefixes of \"abc\")." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-53", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-54", "text": "**BASIC ALGORITHM**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-55", "text": "Based on above discussions, algorithms are developed for construction of Trie from the domain and generation of reference description for any object in that domain." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-56", "text": "The Trie construction algorithm ConstructTrie(D,P,T) is shown in figure 1 , Referring expression generation algorithm MakeRefExpr(r,p,T,L) is shown in figure 2 , where T is a node pointer and p is pointer to parent of that node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-57", "text": "Our algorithm MakeRefExpr returns set of attribute-values L to identify r in the domain." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-58", "text": "As discussed earlier, it is basically a node searching algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-59", "text": "In course of searching, if it is found that an intermediate node N doesn't have r i.e. r\u2209 N then our search will not move forward through the subtree rooted at N. Our search will proceed through next level iff r\u2208 N ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-60", "text": "For a node N k , if we get N k = {r} then we have succeeded and our algorithm will return L, set of descriptions for that node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-61", "text": "If there is no distinguishing description exists for r, then \u2205 (null) will be returned." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-62", "text": "We would like to point out that our algorithm will find out only one description that exists at the minimum level of the tree." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-63", "text": "Moreover, a description is added to L only if it is distinguishing i.e. the connecting edge must remove some contrasting object(s)." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-64", "text": "Thus, the child node should contain less number of objects than that of parent node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-65", "text": "In this case, cardinality of parent N i (Card(N i )) will be greater than that of child (Card(N i+1 ) )." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-66", "text": "This condition is included in our algorithm and if (Card (P\u2192N)) > Card (T\u2192N) holds then only the value is added P->N and T->N respectively represents parent and child node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-67", "text": "After finding a distinguishing description for r, search will neither move further down the tree nor explore the remaining branches of the current node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-68", "text": "Search will explore the next branch only if the search in current branch returned NULL description i.e. when L\u2032 = \u2205 in the algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-69", "text": "If we reach a leaf node and that contains r along with other objects then it is not possible to distinguish r'." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-70", "text": "In that case, the algorithm returns NULL indicating that no description exists at all." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-71", "text": "It has been later shown that some distinguishing description may still exist and the algorithm will be modified to find that." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-72", "text": "It should be mentioned that once the prefix tree is constructed offline, it can be used repetitively to find description for any object in the domain throughout the text generation phase." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-73", "text": "Our MakeRefExpr() algorithm is very simple and it doesn't employ any set theoretic operation, which is a non trivial task, to find current contrast set at every steps of algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-74", "text": "In existing algorithms, computing referential description for every object require computing similar things (like finding current contrast set, ruled out objects) again and again." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-75", "text": "And it has to be repeated every time the object is referred." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-76", "text": "It is not possible to generate description once, store it and use it later because of the fact that domain may also change in course of time (Krahmer, 2002 )." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-77", "text": "That's why every time we want to refer to 'r', such rigorous set operations need to be computed." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-78", "text": "But in our prefix tree structure, once the tree is constructed, it is very easy to find description for that object using simple tree search function." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-79", "text": "It is also very easy to add/delete objects to/from domain." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-80", "text": "We have to follow just the initial properties of that object to find the proper branching at each level, followed by addition /deletion of that object to /from relevant nodes, which is essentially a search operation." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-81", "text": "The disadvantage of our algorithm is that space complexity is high but it can be tackled using bit Vector representation of individual nodes of the prefix tree." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-82", "text": "Besides, several methods are there for compressing Trie structure." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-83", "text": "But these optimization techniques are beyond the scope of our current discussion." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-84", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-85", "text": "**FORMALIZING A SCENE USING PREFIX TREE**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-86", "text": "Consider an example scene in figure 3 , from [Krahmer 2002] ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-87", "text": "In this scene, there is a finite domain of entities D. Let D = {d1, d2, d3, d4}, P = {Type, Size, Color} and values are Type = {dog, cat}; Size = {small, large}; Color = {black, white}. A scene is usually represented as a database (or \u2329 Type : cat \u232a , \u2329 Size: small \u232a , \u2329 Color: white \u232a Now it will be shown how our MakeRefExpr() algorithm will find a description for a target object r. Let r = {d 1 }." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-88", "text": "In the first phase, starting from root, edge labeled D is traversed." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-89", "text": "Since d 1 exists in the node and D discards some objects (d 4 ), D is distinguishing description and it is added to L. In the next phase the node connected by the edge labeled L does not contain d 1 so search will not proceed further." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-90", "text": "Rather the node connected by the edge labeled S contains d 1 ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-91", "text": "Since, d 1 is the only object, then we are done and the referring expression is \"The small dog\". But for d 2 , we have to search upto the leaf node which generates the description \"The large white dog\". can't distinguish a from b at second phase." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-92", "text": "Deemter(2002) suggested inclusion of all overlapping values that are true of target while also removing some distractors." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-93", "text": "So, referring expression for a is \"The red orange desk\". But it fails to obey Gricean maxims of conversational implicature." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-94", "text": "We consider the failure as 'Early Decision' problem and defer the decision making in our model." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-95", "text": "We keep in our mind the fact that human beings seldom take instantaneous decision." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-96", "text": "Rather they consider all opportunities in parallel and take decision in the favor of the best one at later point of time." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-97", "text": "Since, our algorithm searches in parallel through all promising branches until some description is found; it mimics the capabilities of human mind to consider in parallel." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-98", "text": "Our algorithm will generate \"The large orange desk\" which will help audiences to better identify the desk." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-99", "text": "The execution sequence is shown in figure 4 ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-100", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-101", "text": "**DESCRIBING SET OF OBJECTS**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-102", "text": "Generation of referring description for a set of objects is very important in NLG." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-103", "text": "Deemter's (2002) suggestion can be easily incorporated into our framework." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-104", "text": "We will represent target r as set of objects." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-105", "text": "Now our algorithm will try to find a node in the tree which only consists of all objects in the set r. In this way, we can find a distinguishing description for any set, for which description exists." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-106", "text": "In figure 3 , the description for the set {d 2 ,d 3 } is \"The large dogs\"." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-107", "text": "Thus, our basic algorithm is able to describe set of objects." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-108", "text": "In case of set like {d 2 , d 3 , d 4 } where there is no separate node consisting all the object, we need to partition the set and find description for individual set." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-109", "text": "In our case the possible partitions are {d 2 , d 3 } and {d 4 } for which separate nodes exist." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-110", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-111", "text": "**BOOLEAN DESCRIPTIONS**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-112", "text": "Deemter (2002) shown that Incremental algorithm is only intersectively complete. But he argues that other Boolean combination of properties can be used to generate description for an object." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-113", "text": "Consider the example from (Deemter, 2002) ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-114", "text": "Let D = {a, b, c, d, e} Type: {Dog(a,b,c,d,e) ; Poodle(a,b)} Color: {Black(a,b,c); White(d,e)} and r = {c}. In this scenario Incremental algorithm is not able to individuate any of the animals." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-115", "text": "However a description for c exists, \"The black dog that is not a poodle\"." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-116", "text": "Deemter (2002) has modified the Incremental algorithm by adding negative values for each attribute." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-117", "text": "Now we will show that our basic algorithm can be modified to take care of this situation." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-118", "text": "In our basic algorithm ConstructTrie(), we add branches at each level for negative values also." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-119", "text": "In this case our simple routine MakeRefExpr() is able to find boolean description while remaining as close as to Incremental algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-120", "text": "In figure 5 , we show part of the trie structure, which is generated for the above scene." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-121", "text": "The dashed arrows show the alternative search paths for node containing {c}. For referring objects using disjunction of properties we have do same thing as negations." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-122", "text": "We have to extend our prefix tree structure by adding extra edges at different levels for making implicit information explicit as described in [Krahmer 2002 ]." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-123", "text": "Krahmer and Theune [2002] have added the notion of context sensitivity into GRE." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-124", "text": "Earlier algorithms assumed that all objects in environment are equally salient." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-125", "text": "Krahmer and Theune refined the idea by assigning some degree of salience to each object." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-126", "text": "They proposed that during referring any object, the object needs to be distinguished only from those objects which are more salient (having higher salience weight)." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-127", "text": "An object that has been mentioned recently, is linguistically more salient than other objects and can be described using fewer properties (\"The dog\" instead of \"The large black hairy dog\")." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-128", "text": "They introduced the concept of centering theory, hierarchical focus constraints in the field of NLG and devised a constant function mapping sw: D \u2192 \u2192 \u2192 \u2192 \u2115 , where sw is salience weight function, D is domain and \u2115 is set of natural numbers." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-129", "text": "We can incorporate this idea into our model easily." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-130", "text": "In each node of the prefix tree we keep a field 'salience weight' (sw) for each of the object stored in that node in the form (d i , sw i )." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-131", "text": "During describing an object if we find a node that is containing r where it is the most salient then we need not traverse higher depth of the tree." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-132", "text": "So, we have to modify then r is the most salient and the edges constituting the path from root to N represents distinguishing description for r. In figure 6 , a is most salient dog and referred to as \"The dog\" whereas b is referred to as \"The small dog\"." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-133", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-134", "text": "**INCORPORATING CONTEXT SENSITIVITY**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-135", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-136", "text": "**RELATIONAL DESCRIPTIONS**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-137", "text": "Relational descriptions are used to single out an object with reference to other one." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-138", "text": "For example \"The cup on the table\" is used to distinguish a cup from other cups which are not on the" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-139", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-140", "text": "**MODELING FULL BREVITY**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-141", "text": "In this section, we will show that our prefix tree structure can be so modified that it can generate shortest possible description which is requirement of Full Brevity (Dale, 1992) ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-142", "text": "Consider a scene where a domain is identified by set of n attributes {A 1 , A 2 \u2026A n }." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-143", "text": "We can generate n! number of different permutations of A i 's \u2200 i \u2208 [1,n]." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-144", "text": "We consider each permutation as different PreferredAttributes list P k and generate all possible prefix trees T k for each P k \u2200 k\u2208 [1,n!] for same domain D. Now, we connect roots of all trees with a common dummy root node with edges having empty description (\u03b5)." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-145", "text": "Now, if we search the branches of new combined tree in parallel, it's obvious that we can always find the target node at lowest possible level." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-146", "text": "Thus we can generate shortest length description using our algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-147", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-148", "text": "**COMPLEXITY OF THE ALGORITHM**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-149", "text": "Let the domain entities are identified by a number of attributes and each attribute has on the average k number of different values." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-150", "text": "So, our ConstructTrie() algorithm takes \u039f(k a ) time." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-151", "text": "Now we will consider different cases for analyzing the time complexity of our MakeRefExpr() algorithm." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-152", "text": "1) In case of non overlapping properties, our search tree will be pruned at each level by a factor of k. Thus the time complexity will be \u039f(log k (k a )) = \u039f(a) which is linear." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-153", "text": "2) In case of overlapping properties, we have to search whole tree in worst case (although in average cases also there will be large pruning, as found from test cases) which will take \u039f(k a ) time." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-154", "text": "3) In case of achieving full brevity requirement, both time and space complexity will be exponential as in the original algorithm by Dale (1992) ." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-155", "text": "----------------------------------" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-156", "text": "**CONCLUSIONS**" }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-157", "text": "In this paper, we present a new Prefix tree (Trie) based approach for modeling GRE problems." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-158", "text": "We construct the trie in such a way that a node at a particular level consists of only those objects which are satisfied by values of the edges, constituting the path from root to that node." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-159", "text": "We formulate description generation as a search problem." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-160", "text": "So, when we reach the target node, the attribute values corresponding to the edges in the path automatically form the distinguishing description." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-161", "text": "Different scenarios of GRE problems like representation of set, boolean descriptions etc. is taken care of in this paper." }, { "sent_id": "5e6d5bb4fb5be2b18ce3256302bf28-C001-162", "text": "We have shown that in simple non overlapping scenarios, our algorithm will find distinguishing description in linear time." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5e6d5bb4fb5be2b18ce3256302bf28-C001-9" ], [ "5e6d5bb4fb5be2b18ce3256302bf28-C001-18" ] ], "cite_sentences": [ "5e6d5bb4fb5be2b18ce3256302bf28-C001-9", "5e6d5bb4fb5be2b18ce3256302bf28-C001-18" ] }, "@USE@": { "gold_contexts": [ [ "5e6d5bb4fb5be2b18ce3256302bf28-C001-33", "5e6d5bb4fb5be2b18ce3256302bf28-C001-34" ], [ "5e6d5bb4fb5be2b18ce3256302bf28-C001-36", "5e6d5bb4fb5be2b18ce3256302bf28-C001-37" ] ], "cite_sentences": [ "5e6d5bb4fb5be2b18ce3256302bf28-C001-34", "5e6d5bb4fb5be2b18ce3256302bf28-C001-36" ] } } }, "ABC_a6f32017135e984fbe59f2171c50f4_28": { "x": [ { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-59", "text": "**DATASETS AND EVALUATION METRICS**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-84", "text": "Table 1 shows the BLEU scores for different models." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-133", "text": "**QUALITATIVE ANALYSIS**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-134", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-135", "text": "**RELATED WORK**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-2", "text": "Neural Machine Translation (NMT) can be improved by including document-level contextual information." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-3", "text": "For this purpose, we propose a hierarchical attention model to capture the context in a structured and dynamic manner." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-4", "text": "The model is integrated in the original NMT architecture as another level of abstraction, conditioning on the NMT model's own previous hidden states." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-5", "text": "Experiments show that hierarchical attention significantly improves the BLEU score over a strong NMT baseline with the state-of-the-art in context-aware methods, and that both the encoder and decoder benefit from context in complementary ways." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-6", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-8", "text": "Neural machine translation (NMT) (Bahdanau et al., 2015; Wu et al., 2016; Vaswani et al., 2017) trains an encoder-decoder network on sentence pairs to maximize the likelihood of predicting a target-language sentence given the corresponding source-language sentence, without considering the document context." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-9", "text": "By ignoring discourse connections between sentences and other valuable contextual information, this simplification potentially degrades the coherence and cohesion of a translated document (Hardmeier, 2012; Meyer and Webber, 2013; Sim Smith, 2017) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-10", "text": "Recent studies (Tiedemann and Scherrer, 2017; Wang et al., 2017; have demonstrated that adding contextual information to the NMT model improves the general translation performance, and more importantly, improves the coherence and cohesion of the translated text (Bawden et al., 2018; Lapshinova-Koltunski and Hardmeier, 2017) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-11", "text": "Most of these methods use an additional encoder Wang et al., 2017) to extract contextual information from previous source-side sentences." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-12", "text": "However, this requires additional parameters and it does not exploit the representations already learned by the NMT encoder." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-13", "text": "More recently, have shown that a cache-based memory network performs better than the above encoder-based methods." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-14", "text": "The cache-based memory keeps past context as a set of words, where each cell corresponds to one unique word keeping the hidden representations learned by the NMT while translating it." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-15", "text": "However, in this method, the word representations are stored irrespective of the sentences where they occur, and those vector representations are disconnected from the original NMT network." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-16", "text": "We propose to use a hierarchical attention network (HAN) (Yang et al., 2016) to model the contextual information in a structured manner using word-level and sentence-level abstractions." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-17", "text": "In contrast to the hierarchical recurrent neural network (HRNN) used by (Wang et al., 2017) , here the attention allows dynamic access to the context by selectively focusing on different sentences and words for each predicted word." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-18", "text": "In addition, we integrate two HANs in the NMT model to account for target and source context." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-19", "text": "The HAN encoder helps in the disambiguation of source-word representations, while the HAN decoder improves the target-side lexical cohesion and coherence." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-20", "text": "The integration is done by (i) re-using the hidden representations from both the encoder and decoder of previous sentence translations and (ii) providing input to both the encoder and decoder for the current translation." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-21", "text": "This integration method enables it to jointly optimize for multiple-sentences." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-22", "text": "Furthermore, we extend the original HAN with a multi-head attention (Vaswani et al., 2017) to capture different types of discourse phenomena." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-23", "text": "Our main contributions are the following: (i) We propose a HAN framework for translation to capture context and inter-sentence connections in a structured and dynamic manner." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-24", "text": "(ii) We integrate the HAN in a very competitive NMT ar-chitecture (Vaswani et al., 2017) and show significant improvement over two strong baselines on multiple data sets." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-25", "text": "(iii) We perform an ablation study of the contribution of each HAN configuration, showing that contextual information obtained from source and target sides are complementary." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-26", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-27", "text": "**THE PROPOSED APPROACH**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-28", "text": "The goal of NMT is to maximize the likelihood of a set of sentences in a target language represented as sequences of words y = (y 1 , ..., y t ) given a set of input sentences in a source language x = (x 1 , ..., x m ) as:" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-29", "text": "so, the translation of a document D is made by translating each of its sentences independently." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-30", "text": "In this study, we introduce dependencies on the previous sentences from the source and target sides:" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-31", "text": "where D x n = (x n\u2212k , ..., x n\u22121 ) and D y n = (y n\u2212k , ..., y n\u22121 ) denote the previous k sentences from source and target sides respectively." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-32", "text": "The contexts D x n and D y n are modeled with HANs." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-33", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-34", "text": "**HIERARCHICAL ATTENTION NETWORK**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-35", "text": "The proposed HAN has two levels of abstraction." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-36", "text": "The word-level abstraction summarizes information from each previous sentence j into a vector s j as:" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-37", "text": "where h denotes a hidden state of the NMT network." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-38", "text": "In particular, h t is the last hidden state of the word to be encoded, or decoded at time step t, and h j i is the last hidden state of the i-th word of the j-th sentence of the context." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-39", "text": "The function f w is a linear transformation to obtain the query q w ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-40", "text": "We used the MultiHead attention function proposed by (Vaswani et al., 2017) to capture different types of relations among words." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-41", "text": "It matches the query against each of the hidden representations h j i (used as value and key for the attention)." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-42", "text": "The sentence-level abstraction summarizes the contextual information required at time t in d t as: Figure 1 : Integration of HAN during encoding at time step t,h t is the context-aware hidden state of the word x t ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-43", "text": "Similar architecture is used during decoding." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-44", "text": "where f s is a linear transformation, q s is the query for the attention function, FFN is a position-wise feed-forward layer (Vaswani et al., 2017 )." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-45", "text": "Each layer is followed by a normalization layer (Lei Ba et al., 2016) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-46", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-47", "text": "**CONTEXT GATING**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-48", "text": "We use a gate to regulate the information at sentence-level h t and the contextual information at document-level d t ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-49", "text": "The intuition is that different words require different amount of context for translation:" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-50", "text": "where W h , W p are parameter matrices, and h t is the final hidden representation for a word x t or y t ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-51", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-52", "text": "**INTEGRATED MODEL**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-53", "text": "The context can be used during encoding or decoding a word, and it can be taken from previously encoded source sentences, previously decoded target sentences, or from previous alignment vectors (i.e. context vectors (Bahdanau et al., 2015) )." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-54", "text": "The different configurations will define the input query and values of the attention function." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-55", "text": "In this work we experiment with five of them: one at encoding time, three at decoding time, and one combining both." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-56", "text": "At encoding time the query is a function of the hidden state h xt of the current word to be encoded x t , and the values are the encoded states of previous sentences h j x i (HAN encoder)." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-57", "text": "At decoding time, the query is a function of the hidden state h yt of the current word to be decoded y t , and the values can be (a) the encoded states of previous sentences h 3 Experimental Setup" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-58", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-60", "text": "We carry out experiments with Chinese-to-English (Zh-En) and Spanish-to-English (Es-En) sets on three different domains: talks, subtitles, and news." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-61", "text": "TED Talks is part of the IWSLT 2014 and 2015 (Cettolo et al., 2012 (Cettolo et al., , 2015 evaluation campaigns 1 ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-62", "text": "We use dev2010 for development; and tst2010-2012 (Es-En), tst2010-2013 (Zh-En) for testing." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-63", "text": "The Zh-En subtitles corpus is a compilation of TV subtitles designed for research on context (Wang et al., 2018) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-64", "text": "In contrast to the other sets, it has three references to compare." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-65", "text": "The Es-En corpus is a subset of OpenSubtitles2018 (Lison and Tiedemann, 2016) 2 ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-66", "text": "We randomly select two episodes for development and testing each." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-67", "text": "Finally, we use the Es-En News-Commentaries11 3 corpus which has document-level delimitation." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-68", "text": "We evaluate on WMT sets (Bojar et al., 2013) : newstest2008 for development, and newstest2009-2013 for testing." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-69", "text": "A similar corpus for Zh-En is too small to be comparable." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-70", "text": "Table 2 shows the corpus statistics." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-71", "text": "For evaluation, we use BLEU score (Papineni et al., 2002 ) (multi-blue) on tokenized text, and we measure significance with the paired bootstrap resampling method proposed by Koehn (2004) (implementations by Koehn et al. (2007) )." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-72", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-73", "text": "**MODEL CONFIGURATION AND TRAINING**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-74", "text": "As baselines, we use a NMT transformer, and a context-aware NMT transformer with cache memory which we implemented for comparison following the best model described by , with memory size of 25 words." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-75", "text": "We used the OpenNMT (Klein et al., 2017) implementation of the transformer network." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-76", "text": "The configuration is the same as the model called \"base model\" in the original paper (Vaswani et al., 2017) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-77", "text": "The encoder and decoder are composed of 6 hidden layers each." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-78", "text": "All hidden states have dimension of 512, dropout of 0.1, and 8 heads for the multi-head attention." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-79", "text": "The target and source vocabulary size is 30K." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-80", "text": "The optimization and regularization methods were the same as proposed by Vaswani et al. (2017) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-81", "text": "Inspired by we trained the models in two stages." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-82", "text": "First we optimize the parameters for the NMT without the HAN, then we proceed to optimize the parameters of the whole network." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-83", "text": "We use k = 3 previous sentences, which gave the best performance on the development set." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-85", "text": "The baseline NMT transformer already has better performance than previously published results on these datasets, and we replicate previous previous improvements from the cache method over the this stronger baseline." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-86", "text": "All of our proposed HAN models perform at least as well as the cache method." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-87", "text": "The best scores are obtained by the combined encoder and decoder HAN model, which is significantly better than the cache method on all datasets without compromising training speed (2.3K vs 2.6K tok/sec)." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-88", "text": "An important portion of the improvement comes from the HAN encoder, which can be attributed to the fact that the sourceside always contains correct information, while the target-side may contain erroneous predictions at testing time. But combining HAN decoder with HAN encoder further improves translation performance, showing that they contribute complementary information." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-89", "text": "The three ways of incorporating information into the decoder all perform similarly." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-90", "text": "Table 3 shows the performance of our best HAN model with a varying number k of previous sentences in the test-set." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-91", "text": "We can see that the best performance for TED talks and news is archived with 3, while for subtitles it is similar between 3 and 7." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-92", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-93", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-94", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-95", "text": "**TRANSLATION PERFORMANCE**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-96", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-97", "text": "**ACCURACY OF PRONOUN/NOUN TRANSLATIONS**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-98", "text": "We evaluate coreference and anaphora using the reference-based metric: accuracy of pronoun translation (Miculicich Werlen and Popescu-Belis, 2017b), which can be extended for nouns." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-99", "text": "The list of evaluated pronouns is predefined in the metric, while the list of nouns was extracted using NLTK POS tagging (Bird, 2006 of Table 4 shows the results." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-100", "text": "For nouns, the joint HAN achieves the best accuracy with a significant improvement compared to other models, showing that target and source contextual information are complementary." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-101", "text": "Similarity for pronouns, the joint model has the best result for TED talks and news." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-102", "text": "However, HAN encoder alone is better in the case of subtitles." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-103", "text": "Here HAN decoder produces mistakes by repeating past translated personal pronouns." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-104", "text": "Subtitles is a challenging corpus for personal pronoun disambiguation because it usually involves dialogue between multiple speakers." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-105", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-106", "text": "**COHESION AND COHERENCE EVALUATION**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-107", "text": "We use the metric proposed by Wong and Kit (2012) to evaluate lexical cohesion." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-108", "text": "It is defined as the ratio between the number of repeated and lexically similar content words over the total number of content words in a target document." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-109", "text": "The lexical similarity is obtained using WordNet." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-110", "text": "Table 4 (bottom-left) displays the average ratio per tested document." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-111", "text": "In some cases, HAN decoder achieves the best score because it produces a larger quantity of repetitions than other models." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-112", "text": "However, as previously demonstrated in 4.2, repetitions do not always make the translation better." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-113", "text": "Although HAN boosts lexical cohesion, the scores are still far from the human reference, so there is room for improvement in this aspect." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-114", "text": "For coherence, we use a metric based on Latent Semantic Analysis (LSA) (Foltz et al., 1998) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-115", "text": "LSA is used to obtain sentence representations, then cosine similarity is calculated from one sentence to the next, and the results are averaged to get a document score." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-116", "text": "We employed the pre-trained LSA model Wiki-6 from (Stefanescu et al., 2014) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-117", "text": "Table 4 (bottom-right) shows the average coherence score of documents." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-118", "text": "The joint HAN model consistently obtains the best coherence score, but close to other HAN models." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-119", "text": "Most of the improvement comes from the HAN decoder." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-120", "text": "Table 5 shows an example where HAN helped to generate the correct translation." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-121", "text": "The first box shows the current sentence with the analyzed word in bold; and the second, the past context at source and target." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-122", "text": "For the context visualization we use the toolkit provided by Pappas and Popescu-Belis (2017) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-123", "text": "Red corresponds to sentences, and blue to words." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-124", "text": "The intensity of color is proportional to the weight." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-125", "text": "We see that HAN correctly translates the ambiguous Spanish pronoun \"su\" into the English \"his\"." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-126", "text": "The HAN decoder highlighted a previous mention of \"his\", and the HAN encoder highlighted the antecedent \"Nathaniel\"." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-127", "text": "This shows that HAN can capture interpretable inter-sentence connections." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-128", "text": "More samples with different attention heads are shown in the Appendix ??." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-129", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-130", "text": "**TED TALKS**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-131", "text": "Subtitles News k Zh-En Es-En Zh-En Es-En Es-En 1 17" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-132", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-136", "text": "Statistical Machine Translation (SMT) Initial studies were based on cache memories (Tiede- mann, 2010; Gong et al., 2011) ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-137", "text": "However, most of the work explicitly models discourse phenomena (Sim Smith, 2017) such as lexical cohesion (Meyer and Popescu-Belis, 2012; Xiong et al., 2013; Lo\u00e1iciga and Grisot, 2016; Pu et al., 2017; Mascarell, 2017) , coherence (Born et al., 2017) , and coreference (Rios Gonzales and Tuggener, 2017; Miculicich Werlen and PopescuBelis, 2017a)." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-138", "text": "Hardmeier et al. (2013) introduced the document-level SMT paradigm." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-139", "text": "Sentence-level NMT Initial studies on NMT enhanced the sentence-level context by using memory networks (Wang et al., 2016) , self-attention (Miculicich Werlen et al., 2018; Zhang et al., 2016) , and latent variables ." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-140", "text": "Document-level NMT Tiedemann and Scherrer (2017) use the concatenation of multiple sentences as NMT's input/output, add a context encoder for the previous source sentence, Wang et al. (2017) includes a HRNN to summarize source-side context, and use a dynamic cache memory to store representations of previously translated words." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-141", "text": "Recently, Bawden et al. (2018) proposed test-sets for evaluating discourse in NMT, Voita et al. (2018) shows that context-aware NMT improves the of anaphoric pronouns, and Maruf and Haffari (2018) proposed a document-level NMT using memory-networks." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-142", "text": "----------------------------------" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-143", "text": "**CONCLUSION**" }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-144", "text": "We proposed a hierarchical multi-head HAN NMT model 5 to capture inter-sentence connections." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-145", "text": "We integrated context from source and target sides by directly connecting representations from previous sentence translations into the current sentence translation." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-146", "text": "The model significantly outperforms two competitive baselines, and the ablation study shows that target and source context is complementary." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-147", "text": "It also improves lexical cohesion and coherence, and the translation of nouns and pronouns." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-148", "text": "The qualitative analysis shows that the model is able to identify important previous sentences and words for the correct prediction." }, { "sent_id": "a6f32017135e984fbe59f2171c50f4-C001-149", "text": "In future work, we plan to explicitly model discourse connections with the help of annotated data, which may further improve translation quality." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a6f32017135e984fbe59f2171c50f4-C001-8" ], [ "a6f32017135e984fbe59f2171c50f4-C001-44" ] ], "cite_sentences": [ "a6f32017135e984fbe59f2171c50f4-C001-8", "a6f32017135e984fbe59f2171c50f4-C001-44" ] }, "@USE@": { "gold_contexts": [ [ "a6f32017135e984fbe59f2171c50f4-C001-22" ], [ "a6f32017135e984fbe59f2171c50f4-C001-24" ], [ "a6f32017135e984fbe59f2171c50f4-C001-40" ], [ "a6f32017135e984fbe59f2171c50f4-C001-76" ], [ "a6f32017135e984fbe59f2171c50f4-C001-80" ] ], "cite_sentences": [ "a6f32017135e984fbe59f2171c50f4-C001-22", "a6f32017135e984fbe59f2171c50f4-C001-24", "a6f32017135e984fbe59f2171c50f4-C001-40", "a6f32017135e984fbe59f2171c50f4-C001-76", "a6f32017135e984fbe59f2171c50f4-C001-80" ] }, "@SIM@": { "gold_contexts": [ [ "a6f32017135e984fbe59f2171c50f4-C001-76" ], [ "a6f32017135e984fbe59f2171c50f4-C001-80" ] ], "cite_sentences": [ "a6f32017135e984fbe59f2171c50f4-C001-76", "a6f32017135e984fbe59f2171c50f4-C001-80" ] } } }, "ABC_d3f5f9b1ef8bda3d33c563d252d58a_28": { "x": [ { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-2", "text": "Word embedding has been found to be highly powerful to translate words from one language to another by a simple linear transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-3", "text": "However, we found some inconsistence among the objective functions of the embedding and the transform learning, as well as the distance measurement." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-4", "text": "This paper proposes a solution which normalizes the word vectors on a hypersphere and constrains the linear transform as an orthogonal transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-5", "text": "The experimental results confirmed that the proposed solution can offer better performance on a word similarity task and an English-toSpanish word translation task." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-6", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-8", "text": "Word embedding has been extensively studied in recent years (Bengio et al., 2003; Turian et al., 2010; Collobert et al., 2011; Huang et al., 2012) ." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-9", "text": "Following the idea that the meaning of a word can be determined by 'the company it keeps' (Baroni and Zamparelli, 2010) , i.e., the words that it co-occurs with, word embedding projects discrete words to a low-dimensional and continuous vector space where co-occurred words are located close to each other." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-10", "text": "Compared to conventional discrete representations (e.g., the one-hot encoding), word embedding provides more robust representations for words, particulary for those that infrequently appear in the training data." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-11", "text": "More importantly, the embedding encodes syntactic and semantic content implicitly, so that relations among words can be simply computed as the distances among their embeddings, or word vectors." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-12", "text": "A well-known efficient word embedding approach was recently proposed by (Mikolov et al., 2013a) , where two log-linear models (CBOW and skip-gram) are proposed to learn the neighboring relation of words in context." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-13", "text": "A following work proposed by the same authors introduces some modifications that largely improve the efficiency of model training (Mikolov et al., 2013c )." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-14", "text": "An interesting property of word vectors learned by the log-linear model is that the relations among relevant words seem linear and can be computed by simple vector addition and substraction (Mikolov et al., 2013d) ." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-15", "text": "For example, the following relation approximately holds in the word vector space: ParisFrance + Rome = Italy." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-16", "text": "In (Mikolov et al., 2013b) , the linear relation is extended to the bilingual scenario, where a linear transform is learned to project semantically identical words from one language to another." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-17", "text": "The authors reported a high accuracy on a bilingual word translation task." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-18", "text": "Although promising, we argue that both the word embedding and the linear transform are ill-posed, due to the inconsistence among the objective function used to learn the word vectors (maximum likelihood based on inner product), the distance measurement for word vectors (cosine distance), and the objective function used to learn the linear transform (mean square error)." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-19", "text": "This inconsistence may lead to suboptimal estimation for both word vectors and the bilingual transform, as we will see shortly." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-20", "text": "This paper solves the inconsistence by normalizing the word vectors." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-21", "text": "Specifically, we enforce the word vectors to be in a unit length during the learning of the embedding." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-22", "text": "By this constraint, all the word vectors are located on a hypersphere and so the inner product falls back to the cosine distance." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-23", "text": "This hence solves the inconsistence between the embedding and the distance measurement." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-24", "text": "To respect the normalization constraint on word vectors, the linear transform in the bilingual projection has to be constrained as an orthogonal transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-25", "text": "Finally, the cosine distance is used when we train the orthogonal transform, in order to achieve full consistence." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-26", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-28", "text": "This work largely follows the methodology and experimental settings of (Mikolov et al., 2013b) , while we normalize the embedding and use an orthogonal transform to conduct bilingual translation." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-29", "text": "Multilingual learning can be categorized into projection-based approaches and regularizationbased approaches." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-30", "text": "In the projection-based approaches, the embedding is performed for each language individually with monolingual data, and then one or several projections are learned using multilingual data to represent the relation between languages." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-31", "text": "Our method in this paper and the linear projection method in (Mikolov et al., 2013b ) both belong to this category." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-76", "text": "One can show that this problem can be solved by taking the singular value decomposition (SVD) of W and replacing the singular values to ones." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-32", "text": "Another interesting work proposed by (Faruqui and Dyer, 2014) learns linear transforms that project word vectors of all languages to a common low-dimensional space, where the correlation of the multilingual word pairs is maximized with the canonical correlation analysis (CCA)." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-33", "text": "The regularization-based approaches involve the multilingual constraint in the objective function for learning the embedding." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-34", "text": "For example, (Zou et al., 2013) adds an extra term that reflects the distances of some pairs of semantically related words from different languages into the objective funtion." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-35", "text": "A similar approach is proposed in (Klementiev et al., 2012) , which casts multilingual learning as a multitask learning and encodes the multilingual information in the interaction matrix." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-36", "text": "All the above methods rely on a multilingual lexicon or a word/pharse alignment, usually from a machine translation (MT) system." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-37", "text": "(Blunsom et al., 2014) proposed a novel approach based on a joint optimization method for word alignments and the embedding." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-38", "text": "A simplified version of this approach is proposed in (Hermann and Blunsom, 2014) , where a sentence is represented by the mean vector of the words involved." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-39", "text": "Multilingual learning is then reduced to maximizing the overall distance of the parallel sentences in the training corpus, with the distance computed upon the sentence vectors." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-40", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-41", "text": "**NORMALIZED WORD VECTORS**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-42", "text": "Taking the skip-gram model, the goal is to predict the context words with a word in the central position." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-43", "text": "Mathematically, the training process maximizes the following likelihood function with a word sequence" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-44", "text": "where C is the length of the context in concern, and the prediction probability is given by:" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-45", "text": "where w is any word in the vocabulary, and c w denotes the vector of word w. Obviously, the word vectors learned by this way are not constrained and disperse in the entire M -dimensional space, where M is the dimension of the word vectors." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-46", "text": "An inconsistence with this model is that the distance measurement in the training is the inner product c T w c w , however when word vectors are applied, e.g., to estimate word similarities, the metric is often the cosine distance c T w c w ||cw||||c w || ." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-47", "text": "A way to solve this consistence is to use the inner product in applications, however using the cosine distance is a convention in natural language processing (NLP) and this measure does show better performance than the inner product in our experiments." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-48", "text": "We therefore perform in an opposite way, i.e., enforcing the word vectors to be unit in length." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-49", "text": "Theoretically, this changes the learning of the embedding to an optimization problem with a quadratic constraint." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-50", "text": "Solving this problem by Lagrange multipliers is possible, but here we simply divide a vector by its l-2 norm whenever the vector is updated." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-51", "text": "This does not involve much code change and is efficient enough." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-52", "text": "1 The consequence of the normalization is that all the word vectors are located on a hypersphere, as illustrated in Figure 1 ." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-53", "text": "In addition, by the normalization, the inner product falls back to the cosine distance, hence solving the inconsistence between the embedding learning and the distance measurement." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-54", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-55", "text": "**ORTHOGONAL TRANSFORM**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-56", "text": "The bilingual word translation provided by (Mikolov et al., 2013b ) learns a linear transform from the source language to the target language by the linear regression." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-57", "text": "The objective function is as follows:" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-58", "text": "1 For efficiency, this normalization can be conducted every n mini-batches." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-59", "text": "The performance is expected to be not much impacted, given that n is not too large." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-60", "text": "where W is the projection matrix to be learned, and x i and z i are word vectors in the source and target language respectively." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-61", "text": "The bilingual pair (x i , z i ) indicates that x i and z i are identical in semantic meaning." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-62", "text": "A high accuracy was reported on a word translation task, where a word projected to the vector space of the target language is expected to be as close as possible to its translation (Mikolov et al., 2013b) ." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-63", "text": "However, we note that the 'closeness' of words in the projection space is measured by the cosine distance, which is fundamentally different from the Euler distance in the objective function (3) and hence causes inconsistence." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-64", "text": "We solve this problem by using the cosine distance in the transform learning, so the optimization task can be redefined as follows:" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-65", "text": "Note that the word vectors in both the source and target vector spaces are normalized, so the inner product in (4) is equivalent to the cosine distance." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-66", "text": "A problem of this change, however, is that the projected vector W x i has to be normalized, which is not guaranteed so far." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-67", "text": "To solve the problem, we first consider the case where the dimensions of the source and target vector spaces are the same." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-68", "text": "In this case, the normalization constraint on word vectors can be satisfied by constraining W as an orthogonal matrix, which turns the unconstrained problem (4) to a constrained optimization problem." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-69", "text": "A general solver such as SQP can be used to solve the problem." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-70", "text": "However, we seek a simple approximation in this work." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-71", "text": "Firstly, solve (4) by gradient descendant without considering any constraint." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-72", "text": "A simple calculation shows that the gradient is as follows:" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-73", "text": "and the update rule is simply given by:" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-74", "text": "where \u03b1 is the learning rate." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-75", "text": "After the update, W is orthogonalized by solving the following constrained quadratic problem:" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-77", "text": "For the case where the dimensions of the source and target vector spaces are different, the normalization constraint upon the projected vectors is not easy to satisfy." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-78", "text": "We choose a pragmatic solution." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-79", "text": "First, we extend the low-dimensional vector space by padding a small tunable constant at the end of the word vectors so that the source and target vector spaces are in the same dimension." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-80", "text": "The vectors are then renormalized after the padding to respect the normalization constraint." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-81", "text": "Once this is done, the same gradient descendant and orthognalization approaches are ready to use to learn the orthogonal transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-82", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-83", "text": "**EXPERIMENT**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-84", "text": "We first present the data profile and configurations used to learn monolingual word vectors, and then examine the learning quality on the word similarity task." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-85", "text": "Finally, a comparative study is reported on the bilingual word translation task, with Mikolov's linear transform and the orthogonal transform proposed in this paper." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-86", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-87", "text": "**MONOLINGUAL WORD EMBEDDING**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-88", "text": "The monolingual word embedding is conducted with the data published by the EMNLP 2011 SMT workshop (WMT11) 2 ." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-89", "text": "For an easy comparison, we largely follow Mikolov's settings in (Mikolov et al., 2013b) and set English and Spanish as the source and target language, respectively." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-90", "text": "The data preparation involves the following steps." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-91", "text": "Firstly, the text was tokenized by the standard scripts provided by WMT11 3 , and then duplicated sentences were removed." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-92", "text": "The numerical expressions were tokenized as 'NUM', and special characters (such as !?,:) were removed." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-93", "text": "The word2vector toolkit 4 was used to train the word embedding model." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-94", "text": "We chose the skip-gram model and the text window was set to 5." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-95", "text": "The training resulted in embedding of 169k English tokens and 116k Spanish tokens." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-96", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-97", "text": "**MONOLINGUAL WORD SIMILARITY**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-98", "text": "The first experiment examines the quality of the learned word vectors in English." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-99", "text": "We choose the word similarity task, which tests to what extent the word similarity computed based on word vectors agrees with human judgement." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-100", "text": "The WordSimilarity-353 Test Collection 5 provided by (Finkelstein et al., 2002 ) is used." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-101", "text": "The dataset involves 154 word pairs whose similarities are measured by 13 people and the mean values are used as the human judgement." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-102", "text": "In the experiment, the correlation between the cosine distances computed based on the word vectors and the humane-judged similarity is used to measure the quality of the embedding." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-103", "text": "The results are shown in Figure 2 , where the dimension of the vector space varies from 300 to 1000." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-104", "text": "It can be observed that the normalized word vectors offer a high correlation with human judgement than the unnormalized counterparts." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-105", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-106", "text": "**BILINGUAL WORD TRANSLATION**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-107", "text": "The second experiment focuses on bilingual word translation." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-108", "text": "We select 6000 frequent words in English and employ the online Google's translation service to translate them to Spanish." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-109", "text": "The resulting 6000 English-Spanish word pairs are used to train and test the bilingual transform in the way of cross validation." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-110", "text": "Specifically, the 6000 pairs are randomly divided into 10 subsets, and at each time, 9 subsets are used for training and the rest 1 subset for testing." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-111", "text": "The average of the results of the 10 tests is reported as the final result." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-112", "text": "Note that not all the words translated by Google are in the vocabulary of the target language; the vocabulary coverage is 99.5% in our test." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-113", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-114", "text": "**RESULTS WITH LINEAR TRANSFORM**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-115", "text": "We first reproduce Mikolov's work with the linear transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-116", "text": "A number of dimension settings are experimented with and the results are reported in Table 1." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-117", "text": "The proportions that the correct translations are in the top 1 and top 5 candidate list are reported as P@1 and P@5 respectively." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-118", "text": "As can be seen, the best dimension setting is 800 for English and 200 for Spanish, and the corresponding P@1 and P@5 are 35.36% and 53.96%, respectively." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-119", "text": "These results are comparable with the results reported in (Mikolov et al., 2013b" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-120", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-121", "text": "**RESULTS WITH ORTHOGONAL TRANSFORM**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-122", "text": "The results with the normalized word vectors and the orthogonal transform are reported in Table 2 ." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-123", "text": "It can be seen that the results with the orthogonal transform are consistently better than those reported in Table1 which are based on the linear transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-124", "text": "This confirms our conjecture that bilingual translation can be largely improved by the normalized embedding and the accompanied orthogonal transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-125", "text": "----------------------------------" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-126", "text": "**CONCLUSIONS**" }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-127", "text": "We proposed an orthogonal transform based on normalized word vectors for bilingual word translation." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-128", "text": "This approach solves the inherent inconsistence in the original approach based on unnormalized word vectors and a linear transform." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-129", "text": "The experimental results on a monolingual word similarity task and an English-to-Spanish word translation task show clear advantage of the proposal." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-130", "text": "This work, however, is still preliminary." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-131", "text": "It is unknown if the normalized embedding works on other tasks such as relation prediction, although we expect so." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-132", "text": "The solution to the orthogonal transform between vector spaces with mismatched dimensions is rather ad-hoc." }, { "sent_id": "d3f5f9b1ef8bda3d33c563d252d58a-C001-133", "text": "Nevertheless, locating word vectors on a hypersphere opens a door to study the properties of the word embedding in a space that is yet less known to us." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-16" ], [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-56" ], [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-62" ] ], "cite_sentences": [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-16", "d3f5f9b1ef8bda3d33c563d252d58a-C001-56", "d3f5f9b1ef8bda3d33c563d252d58a-C001-62" ] }, "@USE@": { "gold_contexts": [ [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-28" ], [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-89" ] ], "cite_sentences": [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-28", "d3f5f9b1ef8bda3d33c563d252d58a-C001-89" ] }, "@SIM@": { "gold_contexts": [ [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-28" ], [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-31" ], [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-89" ], [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-119" ] ], "cite_sentences": [ "d3f5f9b1ef8bda3d33c563d252d58a-C001-28", "d3f5f9b1ef8bda3d33c563d252d58a-C001-31", "d3f5f9b1ef8bda3d33c563d252d58a-C001-89", "d3f5f9b1ef8bda3d33c563d252d58a-C001-119" ] } } }, "ABC_2db9f6c8540d8d2b7a5946c3c132e9_28": { "x": [ { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-2", "text": "Recent advances in distributional semantics combined with the availability of large-scale diachronic corpora offer new research avenues for the Digital Humanities." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-3", "text": "JESEME, the Jena Semantic Explorer, renders assistance to a non-technical audience to investigate diachronic semantic topics." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-4", "text": "JESEME runs as a website with query options and interactive visualizations of results, as well as a REST API for access to the underlying diachronic data sets." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-5", "text": "----------------------------------" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-7", "text": "Scholars in the humanities frequently deal with texts whose lexical items have become antiquated or have undergone semantic changes." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-8", "text": "Thus their proper understanding is dependent on translational knowledge from manually compiled dictionaries." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-9", "text": "To complement this workflow with modern NLP tooling, we developed JESEME, 1 the Jena Semantic Explorer." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-10", "text": "It supports both lexicologists and scholars with easy-to-use state-of-the-art distributional semantics machinery via an interactive public website and a REST API." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-11", "text": "JESEME can be queried for change patterns of lexical items over decades and centuries (resources permitting)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-12", "text": "The website and the underlying NLP pipelines are open source and available via GitHub." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-13", "text": "2 JESEME currently covers five diachronic corpora, two for German and three for English." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-14", "text": "To the best of our knowledge, it is the first tool ever with such capabilities." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-15", "text": "Its development owes credits to the interdisciplinary Graduate School \"The Romantic Model\" at Friedrich-Schiller-Universit\u00e4t Jena (Germany)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-16", "text": "1 http://jeseme.org 2 https://github.com/hellrich/JeSemE 2 Related Work" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-17", "text": "----------------------------------" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-18", "text": "**DISTRIBUTIONAL SEMANTICS**" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-19", "text": "Distributional semantics can be broadly conceived as a staged approach to capture the semantics of a lexical item in focus via contextual patterns." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-20", "text": "Concordances are probably the most simple scheme to examine contextual semantic effects, but leave semantic inferences entirely to the human observer." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-21", "text": "A more complex layer is reached with collocations which can be identified automatically via statistical word co-occurrence metrics (Manning and Sch\u00fctze, 1999; Wermter and Hahn, 2006) , two of which are incorporated in JESEME as well: Positive pointwise mutual information (PPMI), developed by Bullinaria and Levy (2007) as an improvement over the probability ratio of normal pointwise mutual information (PMI; Church and Hanks (1990) ) and Pearson's \u03c7 2 , commonly used for testing the association between categorical variables (e.g., POS tags) and considered to be more robust than PMI when facing sparse information (Manning and Sch\u00fctze, 1999) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-22", "text": "The currently most sophisticated and most influential approach to distributional semantics employs word embeddings, i.e., low (usually 300-500) dimensional vector word representations of both semantic and syntactic information." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-23", "text": "Alternative approaches are e.g., graph-based algorithms (Biemann and Riedl, 2013) or ranking functions from information retrieval (Claveau et al., 2014) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-24", "text": "The premier example for word embeddings is skip-gram negative sampling, which is part of the word2vec family of algorithms (Mikolov et al., 2013) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-25", "text": "The random processes involved in training these embeddings lead to a lack of reliability which is dangerous during interpretationexperiments cannot be repeated without predicting severely different relationships between words Hahn, 2016a, 2017) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-26", "text": "Word embeddings based on singular value decomposition (SVD; historically popular in the form of Latent Semantic Analysis (Deerwester et al., 1990) ) are not affected by this problem." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-27", "text": "Levy et al. (2015) created SVD PPMI after investigating the implicit operations performed while training neural word embeddings (Levy and Goldberg, 2014) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-28", "text": "As SVD PPMI performs very similar to word2vec on evaluation tasks while avoiding reliability problems we deem it the best currently available word embedding method for applying distributional semantics in the Digital Humanities (Hamilton et al., 2016; Hellrich and Hahn, 2016a) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-29", "text": "----------------------------------" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-30", "text": "**AUTOMATIC DIACHRONIC SEMANTICS**" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-31", "text": "The use of statistical methods is getting more and more the status of a commonly shared methodology in diachronic linguistics (see e.g., Curzan (2009))." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-32", "text": "There exist already several tools for performing statistical analysis on user provided corpora, e.g., WORDSMITH 3 or the UCS TOOLKIT, 4 as well as interactive websites for exploring precompiled corpora, e.g., the \"advanced\" interface for Google Books (Davies, 2014) or DIACOLLO (Jurish, 2015) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-33", "text": "Meanwhile, word embeddings and their application to diachronic semantics have become a novel state-of-the-art methodology lacking, however, off-the-shelves analysis tools easy to use for a typically non-technical audience." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-34", "text": "Most work is centered around word2vec (e.g., Kim et al. (2014) ; Kulkarni et al. (2015) ; Hellrich and Hahn (2016b) ), whereas alternative approaches are rare, e.g., Jo (2016) using GloVe (Pennington et al., 2014) and Hamilton et al. (2016) using SVD PPMI ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-35", "text": "Embeddings trained on corpora specific for multiple time spans can be used for two research purposes, namely, screening the semantic evolution of lexical items over time (Kim et al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016) and exploring the meaning of lexical items during a specific time span by finding their closest neighbors in embedding space." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-36", "text": "This information can then be exploited for automatic (Buechel et al., 2016) or manual (Jo, 2016) interpretation." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-37", "text": "research." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-38", "text": "We employ five corpora, including the four largest diachronic corpora of acceptable quality for English and German." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-39", "text": "The Google Books Ngram Corpus (GB; Michel et al. (2011 ), Lin et al. (2012 ) contains about 6% of all books published between 1500 and 2009 in the form of n-grams (up to pentagrams)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-40", "text": "GB is multilingual; its English subcorpus is further divided into regional segments (British, US) and genres (general language and fiction texts)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-41", "text": "It can be argued to be not so useful for Digital Humanities research due to digitalization artifacts and its opaque and unbalanced nature, yet the English Fiction part is least effected by these problems (Pechenick et al., 2015; Koplenig, 2017) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-42", "text": "We use its German (GB German) and English Fiction (GB fiction) subcorpora." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-43", "text": "The Corpus of Historical American English 5 (COHA; Davies (2012)) covers texts from 1800 to 2009 from multiple genres balanced for each decade, and contains annotations for lemmata." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-44", "text": "The Deutsches Textarchiv 6 (DTA, 'German Text Archive'; Geyken (2013); Jurish (2013)) is a German diachronic corpus and consists of manually transcribed books selected for their representativeness and balance between genres." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-45", "text": "A major benefit of DTA are its annotation layers which offer both orthographic normalization (mapping archaic forms to contemporary ones) and lemmatization via the CAB tool (Jurish, 2013) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-46", "text": "Finally, the Royal Society Corpus (RSC) contains the first two centuries of the Philosophical Transactions of the Royal Society of London (Kermes et al., 2016) , thus forming the most specialized corpus in our collection." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-47", "text": "Orthographic normalization as well as lemmatization information are provided, just as in DTA." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-48", "text": "RSC is far smaller than the other corpora, yet was included due to its relevance for research projects in our graduate school." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-49", "text": "----------------------------------" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-50", "text": "**SEMANTIC PROCESSING**" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-51", "text": "The five corpora described in Section 3 were divided into multiple non-overlapping temporal slices, covering 10 years each for COHA and the two GB subcorpora, 30 years each for the smaller DTA and finally two 50 year slices and one 19 year slice for the even smaller RSC (as provided in the corpus, roughly similar in size)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-52", "text": "We removed non-alphanumeric characters during pre-processing and transformed all English text to lowercase." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-53", "text": "Lemmata were used for the stronger inflected German (provided in DTA, respectively a mapping table created with the CAB webservice (Jurish, 2013) for the German GB subcorpus) and the rather antiquated RSC (provided in the corpus)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-54", "text": "We calculated PPMI and \u03c7 2 for each slice, with a context window of 4 words, no random sampling, context distribution smoothing of 0.75 for PPMI, and corpus dependent minimum word frequency thresholds of 50 (COHA, DTA and RSC) respectively 100 (GB subcorpora)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-55", "text": "7 The PPMI matrices were then used to create SVD PPMI embeddings with 500 dimensions." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-56", "text": "These calculations were performed with a modified version of HY-PERWORDS 8 (Levy et al., 2015) , using custom extensions for faster pre-processing and \u03c7 2 ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-57", "text": "The resulting models have a size of 32 GB and are available for download on JESEME's Help page." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-58", "text": "9 To ensure JESEME's responsiveness, we finally pre-computed similarity (by cosine between word embeddings), as well as context specificity based on PPMI and \u03c7 2 ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-59", "text": "These values are stored in a POSTGRESQL 10 database, occupying about 60GB of space." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-60", "text": "Due to both space constraints (scaling with O(n 2 ) for vocabulary size n) and the lower quality of representations for infrequent words, we limited this step to words which were among the 10k most frequent words for all slices of a corpus, resulting in 3,1k -6,5k words per corpus." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-61", "text": "In accordance with this limit, we also discarded slices with less than 10k (5k for RSC) 7 Parameters were chosen in accordance with Levy et al. (2015) and Hamilton et al. (2016 words above the minimum frequency threshold used during PPMI and \u03c7 2 calculation, e.g., the 1810s and 1820s COHA slices." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-62", "text": "Figure 1 illustrates this sequence of processing steps, while Table 1 summarizes the resulting models for each corpus." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-63", "text": "5 Website and API JESEME provides both an interactive website and an API for querying the underlying database." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-64", "text": "Both are implemented with the SPARK 11 framework running inside a JETTY 12 Web server." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-65", "text": "On JESEME's initial landing page, users can enter a word into a search field and select a corpus." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-66", "text": "They are then redirected to the result page, as depicted in Figure 2 ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-67", "text": "Query words are automatically lowercased or lemmatized, depending on the respective corpus (see Section 4)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-68", "text": "The result page provides three kinds of graphs, i.e., Similar Words, Typical Context and Relative Frequency." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-69", "text": "Similar Words depicts the words with the highest similarity relative to the query term for the first and last time slice and how their similarity values changed over time." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-70", "text": "We follow Kim et al. (2014) in choosing such a visualization, while we refrain from using the two-dimensional projection used in other studies (Kulkarni et al., 2015; Hamilton et al., 2016) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-71", "text": "We stipulate that the latter could Figure 2 : Screenshot of JESEME's result page when searching for the lexical item \"heart\" in COHA." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-72", "text": "be potentially misleading by implying a constant meaning of those words used as the background (which are actually positioned by their meaning at a single point in time)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-73", "text": "Typical Context offers two graphs, one for \u03c7 2 and one for PPMI, arranged in tabs." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-74", "text": "Values in typical context graphs are normalized to make them comparable across different metrics." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-75", "text": "Finally, Relative Frequency plots the relative frequency measure against all words above the minimum frequency threshold (see Section 4)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-76", "text": "All graphs are accompanied by a short explanation and a form for adding further words to the graph under scrutiny." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-77", "text": "The result page also provides a link to the corresponding corpus, to help users trace JESEME's computational results." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-78", "text": "As an example, consider JESEME's search for \"heart\" in COHA as depicted in Figure 2 ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-79", "text": "The Similar Words graph depicts a lowered similarity to \"soul\" and increased similarity to \"lungs\", and more recently also \"stroke\", which we interpret as a gradual decrease in metaphorical usage." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-80", "text": "Since COHA is balanced, we assume this pattern to indicate a true semantic change; a similar change is also observable in the GB Fiction dataset, yet not in the highly domainspecific RSC." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-81", "text": "Note that this change is unlikely to be linked with the decreased frequency of \"soul\", as PMI-derived metrics are known to be biased towards infrequent words (Levy et al., 2015) ." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-82", "text": "This shift in meaning is also visible in the Typical Context graphs, with \"attack\" and \"disease\" being increasingly specific by both \u03c7 2 and PPMI." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-83", "text": "Note that metaphorical or metonymical usage of \"heart\" is historically quite common (Niemeier, 2003) , despite its long-known anatomical function (Aird, 2011)." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-84", "text": "The database underlying JESEME's graphs can also be queried via a REST API which provides JSON encoded results." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-85", "text": "API calls need to specify the corpus to be searched and one (frequency) or two (similarity, context) words as GET parameters." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-86", "text": "13 Calling conventions are further detailed on JESEME's Help page." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-87", "text": "14 13 For example http://jeseme.org/api/ similarity?word1=Tag&word2=Nacht&corpus= dta 14 http://jeseme.org/help.html#api" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-88", "text": "----------------------------------" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-89", "text": "**CONCLUSION**" }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-90", "text": "We presented JESEME, the Jena Semantic Explorer, an interactive website and REST API for exploring changes in lexical semantics over long periods of time." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-91", "text": "In contrast to other corpus exploration tools, JESEME is based on cutting-edge word embedding technology (Levy et al., 2015; Hamilton et al., 2016; Hahn, 2016a, 2017) and provides access to five popular corpora for the English and German language." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-92", "text": "JESEME is also the first tool of its kind and under continuous development." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-93", "text": "Future technical work will add functionality to compare words across corpora which might require a mapping between embeddings (Kulkarni et al., 2015; Hamilton et al., 2016) and provide optional stemming routines." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-94", "text": "Both goals come with an increase in precomputed similarity values and will thus necessitate storage optimizations to ensure long-term availability." }, { "sent_id": "2db9f6c8540d8d2b7a5946c3c132e9-C001-95", "text": "Finally, we will conduct a user study to investigate JESEME's potential for the Digital Humanities community." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-28" ], [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-34" ], [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-35" ] ], "cite_sentences": [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-28", "2db9f6c8540d8d2b7a5946c3c132e9-C001-34", "2db9f6c8540d8d2b7a5946c3c132e9-C001-35" ] }, "@USE@": { "gold_contexts": [ [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-61" ] ], "cite_sentences": [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-61" ] }, "@SIM@": { "gold_contexts": [ [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-61" ] ], "cite_sentences": [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-61" ] }, "@DIF@": { "gold_contexts": [ [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-70" ] ], "cite_sentences": [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-70" ] }, "@UNSURE@": { "gold_contexts": [ [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-91" ] ], "cite_sentences": [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-91" ] }, "@FUT@": { "gold_contexts": [ [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-93" ] ], "cite_sentences": [ "2db9f6c8540d8d2b7a5946c3c132e9-C001-93" ] } } }, "ABC_9a52e0ea1f12e3455fca48ac8f8936_28": { "x": [ { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-2", "text": "Abstract-Applying state of the art deeplearning models to novel real world datasets gives a practical evaluation of the generalizability of these models." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-3", "text": "Of importance in this process is how sensitive the hyper parameters of such models are to novel datasets as this would affect the reproducibility of a model." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-4", "text": "We present work to characterize the hyper parameter space of an LSTM for language modeling on a code-mixed corpus." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-5", "text": "We observe that the evaluated model shows minimal sensitivity to our novel dataset bar a few hyper parameters." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-6", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-7", "text": "**I. INTRODUCTION**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-8", "text": "Hyper parameter tuning is an integral part of building deep learning models." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-9", "text": "State of the art models are often benchmarked on a small set of datasets such as Penn Treebank [1] , WikiText, GigaWord, MNIST, CIFAR10 to name a few of the limited set." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-10", "text": "The hyper parameters values on these datasets are however not directly applicable to other use case specific datasets." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-11", "text": "Advances in deep learning research including its applications to Natural Language Processing (NLP) is correlated to the introduction of new increasing strategies for regularization and optimization of neural networks." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-12", "text": "These strategies, more often than not introduce new hyper parameters, thus, compounding the challenge of hyper parameter tuning; even more so if hyper parameter values are overly sensitive to the dataset." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-13", "text": "The effect of this would be that reproducing state of the art neural models on a unique dataset would require significant hyper parameter search thus limiting the reach of these models to parties with significant computing resources." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-14", "text": "We present work done to understand the effect of the set of parameters selected on the perplexity (the exponential of the average negative log-likelihood of prediction of the next word in a sequence [2] ) of a Neural Language Model (NLM)." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-15", "text": "We apply hyper parameter search methods given baseline hyper parameter values for benchmark datasets to modeling codemixed text." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-16", "text": "Code-mixed text is text which draws from elements of two or more grammatical systems [3] )." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-17", "text": "Code-mixed text is common in countries in which multiple languages co-exist." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-18", "text": "In this work we assess the performance of the AWD-LSTM model [4] for language modeling to better understand how relevant the published hyper parameters may be for a codemixed corpus and to isolate which hyper parameters could be further tuned to improve performance." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-19", "text": "Our results show that as a whole, the set of hyperparameters considered the best [4] are reasonably good, however ther are better sets hyperparamers for the code-mixed corpora." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-20", "text": "Moreover, even with the best set of hyper parameters, the perplexity observed for our data are significantly higher (i.e. performance is worse at the task of language modeling) than the performance demonstrated in the literature." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-21", "text": "Finally, our implemented approach is one that not only enables confirmation of the goodness of the hyper parameters values, but we can also develop inferences about which hyper parameter values would yield better results." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-22", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-23", "text": "**II. BACKGROUND**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-24", "text": "Deeplearning has found sucess in various applications including natural language processing tasks such as language modeling, parts of speech tagging, summarization and many others." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-25", "text": "The learning performance of deep neural networks however depends on systematic tuning of the hyper parameters." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-26", "text": "As such finding optimal hyper parameters is an integral part of building neural models including neural language models." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-27", "text": "Recurrent neural networks (RNNs) being well suited to dealing with sequences of vectors, have found much success in NLP by leveraging the sequential structure of language, A variant of RNNs known as Long Short-Term Memory Networks (LSTMs) [5] have particularly been widely used and stands as the state of the art technique for language modeling on benchmark datasets such as Penn Treebank (PTB) [1] and One billion words [6] among others." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-28", "text": "Language models (LMs) by themselves are valuable because well trained LMs improve the underlying metrics of downstream tasks such as word error rate for speech recognition, BLEU score for translation." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-29", "text": "In addition, LMs compactly extract knowledge encoded in training data [7] ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-30", "text": "The current state of the art on modeling both PTB and WikiText 2 [8] datasets as reported in [4] shows little sensitivity to hyper parameters; sharing almost all hyper parameters values between both datasets." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-31", "text": "In [9] , its is also shown that deep learning model can jointly learn a number of large-scale tasks from multiple domains by designing a multi-modal architecture in which as many parameters as possible are shared." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-32", "text": "Training and evaluating a neural network involves mapping the hyper parameter configuration (set of values for each hyper parameter) used in training the network to the validation error obtained at the end." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-33", "text": "Strategies for searching and obtaining an optimal configuration that have been applied and found considerable success include grid search, random search, Bayesian optimization, Sequential Model-based Bayesian Optimization (SMBO) [10] , deterministic hyperparameter optimization methods that employs radial basis functions as error surrogates proposed by [11] , Gaussian Process Batch Upper Confidence Bound (GP-BUCB) [12] ; an upper confidence bound-based algorithm, which models the reward function as a sample from a Gaussian process." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-34", "text": "In [13] , the authors propose initializing Bayesian hyper parameters using meta-learning." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-35", "text": "The idea being initializing the configuration space for a novel dataset based on configurations that are known to perform well on similar, previously evaluated, datasets." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-36", "text": "Following a meta-learning approach, we apply a genetic algorithm and a sequential search algorithm, described in the next section, initialized using the best configuration reported in [4] to search the space around optimal hyper parameters for the AWD-LSTM model." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-37", "text": "Twitter tweets collected using a geolocation filter for Nigeria and Kenya with the goal of acquiring a code-mixed text corpus serve as our evaluation datasets." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-38", "text": "We report the test perplexity distributions of the various evaluated configurations and draw inferences on the sensitivity of each hyper parameter to our unique dataset." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-39", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-40", "text": "**III. METHODOLOGY**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-41", "text": "We begin our work by establishing what the baseline and current state of the art model is for a language modeling task [4] ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-42", "text": "Applying the AWD-LSTM model, based on the open sourced code and trained on code-mixed Twitter data, we sample 84 different hyper parameter configurations for each dataset, and evaluate the resulting test perplexity distributions while varying individual hyperparameter values to understand the effect of the set of hyper parameter values selected on the model perplexity." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-43", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-44", "text": "**A. DATASETS**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-45", "text": "Two sources of data are collected using the Twitter streaming API with a geolocation filter set to geo-cordinates for Kenya and Nigeria." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-46", "text": "The resulting data is code-mixed with the Kenya corpus (Dataset 1) containing several mixes of English and Swahili both official languages in Kenya." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-47", "text": "The Nigeria data (Dataset 2) on the other hand, does not predominantly contain mixes of English with another language in the same sentence." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-48", "text": "Rather, English is simply often completety rewritten in a pidgin form." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-49", "text": "The training data for Kenya and Nigeria contains 13,185 words and 27,855 words respectively." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-50", "text": "All tweets are stripped of mentions and hashtags as well as converted to lower-case." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-51", "text": "The phenomenon of code-mixed language use is common in locales that are surrounded by others which speak different languages or locales with a large number of immigrants." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-52", "text": "In Kenya and Nigeria as such, the use of English is influenced by the presence of one or more local languages and this is evident in the corpus." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-53", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-54", "text": "**B. MODEL HYPER PARAMETERS**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-55", "text": "We considered 11 hyper parameters for tuning including the size of the word embedding (emsize), the number of hidden units in each LSTM layer (nhid), the number of LSTM layers (nlayers), the initial learning rate of the optimizer (lr), the maximum norm for gradient clipping (clip), the backpropagation through time sequence length (bptt), dropoutapplied to the layers (dropout), weight dropout applied to the LSTM hidden to hidden matrix (wdrop), the input embedding layer dropout (dropouti), dropout for the LSTM layer (dropouth), and dropout to remove words from embedding layer (dropoute)." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-56", "text": "Table I contains the default values of the individual hyper parameters." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-57", "text": "All experiments involved training for 100 epochs inline with available GPU resources." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-58", "text": "The training criteron was the cross-entropy loss which is the average negative log-likelihood of predicting the right next word by the LM." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-59", "text": "It took approximately two hours wall clock time to train the model for each hyper parameter configuration." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-60", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-61", "text": "**C. SEQUENTIAL SEARCH**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-62", "text": "The search process begins by setting the values of each hyper parameter (configuration) to known best values (see Table I )." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-63", "text": "We then iteratively search for the best value for each hyper parameter." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-64", "text": "The order used in this search is defined in the rows of Table II ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-65", "text": "Performance is evaluated based on the text perplexity for the modeling task." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-66", "text": "Once the best perplexity is identified from the list of possible values for the given hyper parameter, it is fixed and the space of the next hyper parameter in the sequence is searched." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-67", "text": "In this manner the the configuration space of the model is explored." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-68", "text": "This approach shares similarities with the method applied in [14] , though it remains an open question what the impact of the sequence is on the quality of best configuration produced." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-69", "text": "For the context of this work, our aim is not to find the best configuration." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-70", "text": "Instead it is to better understand the configuration space defined by these hyper parameters to determine the impact of their values on the performance when considering a code-mixed corpora." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-71", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-72", "text": "**D. POPULATION BASED SEARCH**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-73", "text": "We apply a genetic algorithm (GA) to provide a complementary approach to the sequential search for the exploration of hyper parameter configurations." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-74", "text": "The GA is a biologically inspired population based search technique presented by [15] ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-75", "text": "The algorithm is a meta-heuristic inspired by biological natural selection." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-76", "text": "The test perplexity for a particular hyper parameter configuration is the measure of its fitness." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-77", "text": "Given an evaluated population i.e a set of hyper parameter configurations, we derive the next generation of the population by first selecting the candidate configurations using roulette wheel selection [16] ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-78", "text": "This biases the selection of good configuration to pass their \"genetic material\" to the subsequent generation." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-79", "text": "The probability of selection of the jth configuration in a generation, p j , is defined in (1) where f j is its fitness which is a function of the test perplexity." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-80", "text": "Each hyper parameter configuration in the subsequent generation is derived from two parent configurations selected via this approach." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-81", "text": "Mimicking biological crossover chromosomes [15] , the two configurations selected are mixed, and one of the resulting configurations are selected at random." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-82", "text": "Finally, a random subset of the components of each derived configuration is perturbed by adding noise." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-83", "text": "This sequence of processes define how configurations from a current generation are used to derive the next generation." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-84", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-85", "text": "**E. META-LEARNING INITIALIZATION**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-86", "text": "Both the population based and sequential search space were manually initialized with four (4) values of each hyperparameter in the neighbourhood of the best values reported in [4] as shown in Table I ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-87", "text": "It is important to note that the sampled configuration space is very small compared to the overall space which is of size 4" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-88", "text": "11 . 84 samples for each dataset constitute the set of sampled configurations which is a far cry from the size of the universal set." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-89", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-90", "text": "**IV. RESULTS**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-91", "text": "We use the term default value when refering to an individual hyper parameter value that makes up the configuration with the best result for the AWD-LSTM model as reported in [4] ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-92", "text": "We evaluate the sensitivity of the hyper parameters by observing the test perplexity distribution comparing it with the default values." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-93", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-94", "text": "**1) BETTER:**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-95", "text": "When considering each of these hyper parameters, it's possible to identify that the default value is correlated with a statistically better performance." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-96", "text": "This is the case for dropouti, dropoute, dropouth, clip, wdrop, and lr." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-97", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-98", "text": "**2) NOT BETTER BUT NOT WORSE! (THOUGH GENERALLY THE BEST):**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-99", "text": "When considering each of these hyper parameters, it's possible to identify that the default value is not correlated with a statistically better performance but also not correlated with statistically worse performance." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-100", "text": "However the default values are closely in the neighbourhood of the best hyper parameter values for both datasets." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-101", "text": "This is the case for dropout and bptt as shown in 1 and 2 respectively." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-102", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-103", "text": "**3) NOT BETTER BUT NOT WORSE! (THOUGH GENERALLY NOT THE BEST):**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-104", "text": "When considering each of these hyper parameters, it's possible to identify that the default value is not correlated with a statistically better performance but also not correlated with The data from varying the number of LSTM layers suggests that shallow models yields lower perplexity." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-105", "text": "And this is consistent across both datasets and supported by [17] ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-106", "text": "The number of hidden units indicates a bound on the number of nonlinear transformations in the network with an increase leading to an increase in the number of calculations between inputs and corresponding outputs." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-107", "text": "A higher number of hidden units is expected to improve the model performance. What is observed however on both datasets is the lowest value of hidden units yielding the best result." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-108", "text": "----------------------------------" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-109", "text": "**4) WORSE:**" }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-110", "text": "The emsize is observed to be the only hyperparameter for which the default values is correlated with statistically significant worse performance on both Datasets 1 and 2 as shown in Figure 5 ." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-111", "text": "Encoding context, morphology, relationships between words through training the word embedding is directly tied to the vocabulary." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-112", "text": "As there are varying degree of similarity between words across corpora the embedding size is a hyperparameter that is expected to be sensitive to the dataset." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-113", "text": "Every corpus has varying level of semantic and syntatic context that needs to be encoded as features that affects the NLM." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-114", "text": "Thus, the sensitivity of the embedding size hyper-parameter is not overall surprising." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-115", "text": "We present a comparison of the test perplexity using the default values of the AWD-LSTM model with the best test perplexity from the both sequential and population based search in Table III." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-116", "text": "V. CONCLUSION In this work we set out to characterize the space of hyper parameter values for a neural language model trained to per- form the task of language modeling." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-117", "text": "The performance of such models is sensitive to the selection of hyper parameters which define their operation." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-118", "text": "Our work applied language modeling to the domain of code-mixed text, and we found that although the published hyper parameters the performance of the state of the art architecture, the AWD-LSTM model, were largely good, they did not define the best combination of hyper parameters for the task." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-119", "text": "Our hyper parameter searches uncovered that the AWD-LSTM model is not generally sensitive to novel datasets." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-120", "text": "Specifically, the size of the word embedding, and the number of hidden units in each LSTM layer are observed to be the only two hyper parameters from the set of 11 evaluated hyper parameters that differ from the published work." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-121", "text": "This work thus can serve as a solid baseline model derivation of better sets of hyper parameters for this type of data." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-122", "text": "Of particular interest to us is the performance of the AWD-LSTM model which is the current state of the art on a codemixed corpus." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-123", "text": "The perplexity values are a far cry from what is generally known to be 'good' perplexity of NLMs." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-124", "text": "As such, while the AWD-LSTM model shows promising results on benchmark datasets, evaluating on a code-mixed corpus with hyper parameter values found to be the best on such benchmark datasets, as well as values in the same neighbourhood, results in unacceptable and impractical perplexity values for a NLM." }, { "sent_id": "9a52e0ea1f12e3455fca48ac8f8936-C001-125", "text": "We hope to explore the various strategies for developing a better language model of the datasets introduced in this work." } ], "y": { "@USE@": { "gold_contexts": [ [ "9a52e0ea1f12e3455fca48ac8f8936-C001-18" ], [ "9a52e0ea1f12e3455fca48ac8f8936-C001-36" ] ], "cite_sentences": [ "9a52e0ea1f12e3455fca48ac8f8936-C001-18", "9a52e0ea1f12e3455fca48ac8f8936-C001-36" ] }, "@UNSURE@": { "gold_contexts": [ [ "9a52e0ea1f12e3455fca48ac8f8936-C001-19" ], [ "9a52e0ea1f12e3455fca48ac8f8936-C001-91" ] ], "cite_sentences": [ "9a52e0ea1f12e3455fca48ac8f8936-C001-19", "9a52e0ea1f12e3455fca48ac8f8936-C001-91" ] }, "@BACK@": { "gold_contexts": [ [ "9a52e0ea1f12e3455fca48ac8f8936-C001-30" ], [ "9a52e0ea1f12e3455fca48ac8f8936-C001-41" ] ], "cite_sentences": [ "9a52e0ea1f12e3455fca48ac8f8936-C001-30", "9a52e0ea1f12e3455fca48ac8f8936-C001-41" ] }, "@SIM@": { "gold_contexts": [ [ "9a52e0ea1f12e3455fca48ac8f8936-C001-86" ] ], "cite_sentences": [ "9a52e0ea1f12e3455fca48ac8f8936-C001-86" ] } } }, "ABC_ffcefdc73338187d4a6b2dc2f0bb47_28": { "x": [ { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-85", "text": "We unk words that appear at most once in the training (21,755 types)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-114", "text": "Table 3 : Evaluation of models trained on the WSJ and additional resources." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-82", "text": "We address this problem in Section 5.3." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-83", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-84", "text": "**SEMI-SUPERVISION**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-24", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-25", "text": "**PREVIOUS WORK**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-26", "text": "We look here at three neural net (NN) models closest to our research along various dimensions." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-2", "text": "We recast syntactic parsing as a language modeling problem and use recent advances in neural network language modeling to achieve a new state of the art for constituency Penn Treebank parsing -93.8 F 1 on section 23, using 2-21 as training, 24 as development, plus tri-training." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-3", "text": "When trees are converted to Stanford dependencies, UAS and LAS are 95.9% and 94.1%." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-4", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-6", "text": "Recent work on deep learning syntactic parsing models has achieved notably good results, e.g., Dyer et al. (2016) with 92.4 F 1 on Penn Treebank constituency parsing and Vinyals et al. (2015) with 92.8 F 1 ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-7", "text": "In this paper we borrow from the approaches of both of these works and present a neural-net parse reranker that achieves very good results, 93.8 F 1 , with a comparatively simple architecture." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-8", "text": "In the remainder of this section we outline the major difference between this and previous workviewing parsing as a language modeling problem." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-9", "text": "Section 2 looks more closely at three of the most relevant previous papers." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-10", "text": "We then describe our exact model (Section 3), followed by the experimental setup and results (Sections 4 and 5)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-11", "text": "There is a one-to-one mapping between a tree and its sequential form." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-12", "text": "(Part-of-speech tags are not used.)" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-13", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-14", "text": "**LANGUAGE MODELING**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-15", "text": "Formally, a language model (LM) is a probability distribution over strings of a language:" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-16", "text": "where x is a sentence and t indicates a word position." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-17", "text": "The efforts in language modeling go into computing P (x t |x 1 , \u00b7 \u00b7 \u00b7 , x t\u22121 ), which as described next is useful for parsing as well." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-18", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-19", "text": "**PARSING AS LANGUAGE MODELING**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-20", "text": "A generative parsing model parses a sentence (x) into its phrasal structure (y) according to" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-21", "text": "where Y(x) lists all possible structures of x. If we think of a tree (x, y) as a sequence (z) (Vinyals et al., 2015) as illustrated in Figure 1 , we can define a probability distribution over (x, y) as follows:" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-22", "text": "which is equivalent to Equation (1)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-23", "text": "We have reduced parsing to language modeling and can use language modeling techniques of estimating P (z t |z 1 , \u00b7 \u00b7 \u00b7 , z t\u22121 ) for parsing." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-27", "text": "The first (Zaremba et al., 2014) gives the basic language modeling architecture that we have adopted, while the other two (Vinyals et al., 2015; Dyer et al., 2016) are parsing models that have the current best results in NN parsing." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-28", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-29", "text": "**LSTM-LM**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-30", "text": "The LSTM-LM of Zaremba et al. (2014) turns (x 1 , \u00b7 \u00b7 \u00b7 , x t\u22121 ) into h t , a hidden state of an LSTM (Hochreiter and Schmidhuber, 1997; Gers et al., 2003; Graves, 2013) , and uses h t to guess x t :" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-31", "text": "where W is a parameter matrix and [i] indexes ith element of a vector." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-32", "text": "The simplicity of the model makes it easily extendable and scalable, which has inspired a character-based LSTM-LM that works well for many languages (Kim et al., 2016) and an ensemble of large LSTM-LMs for English with astonishing perplexity of 23.7 (Jozefowicz et al., 2016) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-33", "text": "In this paper, we build a parsing model based on the LSTM-LM of Zaremba et al. (2014) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-34", "text": "Vinyals et al. (2015) observe that a phrasal structure (y) can be expressed as a sequence and build a machine translation parser (MTP), a sequence-tosequence model, which translates x into y using a conditional probability:" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-35", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-36", "text": "**MTP**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-37", "text": "where the conditioning event (x, y 1 , \u00b7 \u00b7 \u00b7 , y t\u22121 ) is modeled by an LSTM encoder and an LSTM decoder." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-38", "text": "The encoder maps x into h e , a set of vectors that represents x, and the decoder obtains a summary vector (h t ) which is concatenation of the decoder's hidden state (h d t ) and weighted sum of word representations (" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-39", "text": "with an alignment vector (\u03b1)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-40", "text": "Finally the decoder predicts y t given h t ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-41", "text": "Inspired by MTP, our model processes sequential trees." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-42", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-43", "text": "**RNNG**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-44", "text": "Recurrent Neural Network Grammars (RNNG), a generative parsing model, defines a joint distribution over a tree in terms of actions the model takes to generate the tree (Dyer et al., 2016) :" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-45", "text": "where a is a sequence of actions whose output precisely matches the sequence of symbols in z, which implies Equation (3) is the same as Equation (2)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-46", "text": "RNNG and our model differ in how they compute the conditioning event (z 1 , \u00b7 \u00b7 \u00b7 , z t\u22121 ): RNNG combines hidden states of three LSTMs that keep track of actions the model has taken, an incomplete tree the model has generated and words the model has generated whereas our model uses one LSTM's hidden state as shown in the next section." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-47", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-48", "text": "**MODEL**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-49", "text": "Our model, the model of Zaremba et al. (2014) applied to sequential trees and we call LSTM-LM from now on, is a joint distribution over trees:" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-50", "text": "where h t is a hidden state of an LSTM." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-51", "text": "Due to lack of an algorithm that searches through an exponentially large phrase-structure space, we use an n-best parser to reduce Y(x) to Y (x), whose size is polynomial, and use LSTM-LM to find y that satisfies" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-52", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-53", "text": "**HYPER-PARAMETERS**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-54", "text": "The model has three LSTM layers with 1,500 units and gets trained with truncated backpropagation through time with mini-batch size 20 and step size 50." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-55", "text": "We initialize starting states with previous minibatch's last hidden states (Sutskever, 2013) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-56", "text": "The forget gate bias is initialized to be one (Jozefowicz et al., 2015) and the rest of model parameters are sampled from U(\u22120.05, 0.05)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-57", "text": "Dropout is applied to non-recurrent connections (Pham et al., 2014) and gradients are clipped when their norm is bigger than 20 (Pascanu et al., 2013) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-58", "text": "The learning rate is 0.25 \u00b7 0.85 max ( \u221215, 0) where is an epoch number." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-59", "text": "For simplicity, we use vanilla softmax over an entire vocabulary as opposed to hierarchical softmax (Morin and Bengio, 2005) or noise contrastive estimation (Gutmann and Hyv\u00e4rinen, 2012)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-60", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-62", "text": "We describe datasets we use for evaluation, detail training and development processes." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-63", "text": "1" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-64", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-65", "text": "**DATA**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-66", "text": "We use the Wall Street Journal (WSJ) of the Penn Treebank (Marcus et al., 1993) for training (2-21), development (24) and testing (23) and millions of auto-parsed \"silver\" trees (McClosky et al., 2006; Huang et al., 2010; Vinyals et al., 2015) for tritraining." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-67", "text": "To obtain silver trees, we parse the entire section of the New York Times (NYT) of the fifth Gigaword (Parker et al., 2011 ) with a product of eight Berkeley parsers (Petrov, 2010) 2 and ZPar (Zhu et al., 2013) and select 24 million trees on which both parsers agree (Li et al., 2014) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-68", "text": "We do not resample trees to match the sentence length distribution of the NYT to that of the WSJ (Vinyals et 1 The code and trained models used for experiments are available at github.com/cdg720/emnlp2016." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-69", "text": "2 We use the reimplementation by Huang et al. (2010) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-70", "text": "(Charniak, 2000) performed better when trained on all of 24 million trees than when trained on resampled two million trees." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-71", "text": "Given x, we produce Y (x), 50-best trees, with Charniak parser and find y with LSTM-LM as Dyer et al. (2016) do with their discriminative and generative models." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-72", "text": "3" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-73", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-74", "text": "**TRAINING AND DEVELOPMENT**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-75", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-76", "text": "**SUPERVISION**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-77", "text": "We unk words that appear fewer than 10 times in the WSJ training (6,922 types) and drop activations with probability 0.7." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-78", "text": "At the beginning of each epoch, we shuffle the order of trees in the training data." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-79", "text": "Both perplexity and F 1 of LSTM-LM (G) improve and then plateau (Figure 2) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-80", "text": "Perplexity, the Base Final Vinyals et al. (2015) 88.3 90.5 Dyer et al. (2016) 89.8 92.4 LSTM-LM (G) 89.7 92.6 We also evaluate our model with varying n-best trees including optimal 51-best trees that contain gold trees (51 o )." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-81", "text": "As shown in Table 1, the LSTM-LM (G) is robust given sufficiently large n, i.e. 50, but does not exhibit its full capacity because of search errors in Charniak parser." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-86", "text": "We drop activations with probability 0.45, smaller than 0.7, thanks to many silver trees, which help regularization." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-87", "text": "We train LSTM-LM (GS) on the WSJ and a different set of 400,000 NYT trees for each epoch except for the last one during which we use the WSJ only." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-88", "text": "Training takes 26 epochs and 68 hours on a Titan X. LSTM-LM (GS) achieves 92.5 F 1 on the development." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-89", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-90", "text": "**RESULTS**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-91", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-92", "text": "**SUPERVISION**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-93", "text": "As shown in Table 2 , with 92.6 F 1 LSTM-LM (G) outperforms an ensemble of five MTPs (Vinyals et al., 2015) and RNNG (Dyer et al., 2016) , both of which are trained on the WSJ only." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-94", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-95", "text": "**SEMI-SUPERVISION**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-96", "text": "We compare LSTM-LM (GS) to two very strong semi-supervised NN parsers: an ensemble of five MTPs trained on 11 million trees of the highconfidence corpus 4 (HC) (Vinyals et al., 2015) ; and an ensemble of six one-to-many sequence models trained on the HC and 4.5 millions of EnglishGerman translation sentence pairs (Luong et al., 2016) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-97", "text": "We also compare LSTM-LM (GS) to best performing non-NN parsers in the literature." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-98", "text": "Parsers' parsing performance along with their training data is reported in Table 3 ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-99", "text": "LSTM-LM (GS) outperforms all the other parsers with 93.1 F 1 ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-100", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-101", "text": "**IMPROVED SEMI-SUPERVISION**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-102", "text": "Due to search errors -good trees are missing in 50-best trees -in Charniak (G), our supervised and semi-supervised models do not exhibit their full potentials when Charniak (G) provides Y (x)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-103", "text": "To mitigate the search problem, we tri-train Charniak (GS) on all of 24 million NYT trees in addition to the WSJ, to yield Y (x)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-104", "text": "As shown in Table 3 , both LSTM-LM (G) and LSTM-LM (GS) are affected by the quality of Y (x)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-105", "text": "A single LSTM-LM (GS) together with Charniak (GS) reaches 93.6 and an ensemble of eight LSTM-LMs (GS) with Charniak (GS) achieves a new state of the art, 93.8 F 1 ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-106", "text": "When trees are converted to Stanford dependencies, 5 UAS and LAS are 95.9% and 94.1%, 6 more than 1% higher than those of the state of the art dependency parser (Andor et al., 2016) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-107", "text": "Why an indirect method (converting trees to dependencies) is more accurate than a direct one (dependency parsing) remains unanswered (Kong and Smith, 2014) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-108", "text": "----------------------------------" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-109", "text": "**CONCLUSION**" }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-110", "text": "The generative parsing model we presented in this paper is very powerful." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-111", "text": "In fact, we see that a generative parsing model, LSTM-LM, is more effective than discriminative parsing models (Dyer et al., 2016) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-112", "text": "We suspect building large models with character embeddings would lead to further improvement as in language modeling (Kim et al., 2016; Jozefowicz et al., 2016) ." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-113", "text": "We also wish to develop a complete parsing model using the LSTM-LM framework." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-115", "text": "Note that the numbers of Vinyals et al. (2015) and Luong et al. (2016) are not directly comparable as their models are evaluated on OntoNotesstyle trees instead of PTB-style trees." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-116", "text": "E(LSTM-LMs (GS)) is an ensemble of eight LSTM-LMs (GS)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-117", "text": "X/Y in Silver column indicates the number of silver trees used to train Charniak parser and LSTM-LM." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-118", "text": "For the ensemble model, we report the maximum number of trees used to train one of LSTM-LMs (GS)." }, { "sent_id": "ffcefdc73338187d4a6b2dc2f0bb47-C001-119", "text": "at Brown University for setting up GPU machines and David McClosky for helping us train Charniak parser on millions trees." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-6" ], [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-27" ], [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-44" ], [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-93" ] ], "cite_sentences": [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-6", "ffcefdc73338187d4a6b2dc2f0bb47-C001-27", "ffcefdc73338187d4a6b2dc2f0bb47-C001-44", "ffcefdc73338187d4a6b2dc2f0bb47-C001-93" ] }, "@SIM@": { "gold_contexts": [ [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-71" ] ], "cite_sentences": [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-71" ] }, "@DIF@": { "gold_contexts": [ [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-111" ] ], "cite_sentences": [ "ffcefdc73338187d4a6b2dc2f0bb47-C001-111" ] } } }, "ABC_fe0f9312caccf41def06e4311d15fb_28": { "x": [ { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-2", "text": "Induction of common sense knowledge about prototypical sequences of events has recently received much attention (e.g., (Chambers & Jurafsky, 2008; Regneri et al., 2010) )." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-3", "text": "Instead of inducing this knowledge in the form of graphs, as in much of the previous work, in our method, distributed representations of event realizations are computed based on distributed representations of predicates and their arguments, and then these representations are used to predict prototypical event orderings." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-4", "text": "The parameters of the compositional process for computing the event representations and the ranking component of the model are jointly estimated from texts." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-5", "text": "We show that this approach results in a substantial boost in ordering performance with respect to previous methods." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-6", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-8", "text": "It is generally believed that natural language understanding systems would benefit from incorporating common-sense knowledge about prototypical sequences of events and their participants." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-9", "text": "Early work focused on structured representations of this knowledge (called scripts (Schank & Abelson, 1977) ) and manual construction of script knowledge bases." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-10", "text": "However, these approaches do not scale to complex domains (Mueller, 1998; Gordon, 2001 )." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-11", "text": "More recently, automatic induction of script knowledge from text have started to attract attention: these methods exploit either natural texts (Chambers & Jurafsky, 2008; 2009) or crowdsourced data (Regneri et al., 2010) , and, consequently, do not require expensive expert annotation." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-12", "text": "Given a text corpus, they extract structured representations (i.e. graphs), for example chains (Chambers & Jurafsky, 2008) or more gen- eral directed acyclic graphs (Regneri et al., 2010) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-13", "text": "These graphs are scenario-specific, nodes in them correspond to events (and associated with sets of potential event mentions) and arcs encode the temporal precedence relation." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-14", "text": "These graphs can then be used to inform NLP applications (e.g., question answering) by providing information whether one event is likely to precede or succeed another." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-15", "text": "In this work we advocate constructing a statistical model which is capable of \"answering\" at least some of the questions these graphs can be used to answer, but doing this without explicitly representing the knowledge as a graph." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-16", "text": "In our method, the distributed representations (i.e. vectors of real numbers) of event realizations are computed based on distributed representations of predicates and their arguments, and then the event representations are used in a ranker to predict the expected ordering of events." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-17", "text": "Both the parameters of the compositional process for computing the event representation and the ranking component of the model are estimated from data." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-18", "text": "In order to get an intuition why the embedding approach may be attractive, consider a situation where a prototypical ordering of events the bus disembarked passengers and the bus drove away needs to be predicted." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-19", "text": "An approach based on frequency of predicate pairs (Chambers & Jurafsky, 2008) , is unlikely to make a right prediction as driving usually precedes disembarking." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-20", "text": "Similarly, an approach which treats the whole predicate-argument structure as an atomic unit (Regneri et al., 2010) will probably fail as well, as such a sparse model is unlikely to be effectively learnable even from large amounts of data." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-21", "text": "However, our embedding method would be expected to capture relevant features of the verb frames, namely, the transitive use for the predicate disembark and the effect of the particle away, and these features will then be used by the ranking component to make the correct prediction." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-22", "text": "In previous work on learning inference rules (Berant et al., 2011) , it has been shown that enforcing transitivity constraints on the inference rules results in significantly improved performance." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-23", "text": "The same is true for the event order- ing task, as scripts have largely linear structure, and observing that a \u227a b and b \u227a c is likely to imply a \u227a c. Interestingly, in our approach we implicitly learn the model which satisfies transitivity constraints, without the need for any explicit global optimization on a graph." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-24", "text": "The approach is evaluated on crowdsourced dataset of Regneri et al. (2010) and we demonstrate that using our model results in the 13.5% absolute improvement in F 1 on event ordering with respect to their graph induction method (84% vs. 71%)." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-25", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-26", "text": "**MODEL**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-27", "text": "In this section we describe the model we use for computing event representations as well as the ranking component of our model." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-28", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-29", "text": "**EVENT REPRESENTATION**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-30", "text": "Learning and exploiting distributed word representations (i.e. vectors of real values, also known as embeddings) have been shown to be beneficial in many NLP applications (Bengio et al., 2001; Turian et al., 2010; Collobert et al., 2011) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-31", "text": "These representations encode semantic and syntactic properties of a word, and are normally learned in the language modeling setting (i.e. learned to be predictive of local word context), though they can also be specialized by learning in the context of other NLP applications such as PoS tagging or semantic role labeling (Collobert et al., 2011) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-32", "text": "More recently, the area of distributional compositional semantics have started to emerge (Baroni & Zamparelli, 2011; Socher et al., 2012) , they focus on inducing representations of phrases by learning a compositional model." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-33", "text": "Such a model would compute a representation of a phrase by starting with embeddings of individual words in the phrase, often this composition process is recursive and guided by some form of syntactic structure." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-34", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-35", "text": "**ALGORITHM 1 LEARNING ALGORITHM**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-36", "text": "Notation w : ranking weight vector E k : k th sequence of events in temporal order t k : array of model scores for events in E k \u03b3 : fixed global margin for ranking" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-37", "text": "In our work, we use a simple compositional model for representing semantics of a verb frame (i.e. the predicate and its arguments)." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-38", "text": "The model is shown in Figure 1 ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-39", "text": "Each word w i in the vocabulary is mapped to a real vector based on the corresponding lemma (the embedding function C)." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-40", "text": "The hidden layer is computed by summing linearly transformed predicate and argument embeddings and passing it through the logistic sigmoid function." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-41", "text": "1 We use different transformation matrices for arguments and predicates, T and R, respectively." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-42", "text": "The event representation x is then obtained by applying another linear transform (matrix A) followed by another application of the sigmoid function." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-43", "text": "These event representations are learned in the context of 1 Only syntactic heads of arguments are used in this work." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-44", "text": "If an argument is a coffee maker, we will use only the word maker." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-45", "text": "event ranking: the transformation parameters as well as representations of words are forced to be predictive of the temporal order of events." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-46", "text": "However, one important characteristic of neural network embeddings is that they can be induced in a multitasking scenario, and consequently can be learned to be predictive of different types of contexts providing a general framework for inducing different aspects of (semantic) properties of events, as well as exploiting the same representations in different applications." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-47", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-48", "text": "**LEARNING TO ORDER**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-49", "text": "The task of learning stereotyped order of events naturally corresponds to the standard ranking setting." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-50", "text": "Here, we assume that we are provided with sequences of events, and our goal is to capture this order." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-51", "text": "We discuss how we obtain this learning material in the next section." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-52", "text": "We learn a linear ranker (characterized by a vector w) which takes an event representation and returns a ranking score." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-53", "text": "Events are then ordered according to the score to yield the model prediction." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-54", "text": "Note that during the learning stage we estimate not only w but also the event representation parameters, i.e. matrices T , R and A, and the word embedding C. Note that by casting the event ordering task as a global ranking problem we ensure that the model implicitly exploits transitivity of the temporal relation, the property which is crucial for successful learning from finite amount of data, as we argued in the introduction and will confirm in our experiments." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-55", "text": "We use an online ranking algorithm based on the Perceptron Rank (PRank, (Crammer & Singer, 2001) ), or, more accurately, its large-margin extension." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-56", "text": "One crucial difference though is that the error is computed not only with respect to w but also propagated back through the structure of the neural network." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-57", "text": "The learning procedure is sketched in Algorithm 1." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-58", "text": "Additionally, we use a Gaussian prior on weights, regularizing both the embedding parameters and the vector w. We initialize word representations using the SENNA embeddings (Collobert et al., 2011) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-59", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-60", "text": "**EXPERIMENTS**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-61", "text": "We evaluate our approach on crowdsourced data collected for script induction by Regneri et al. (2010) , though, in principle, the method is applicable in arguably more general setting of Chambers & Jurafsky (2008) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-62", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-63", "text": "**DATA AND TASK**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-64", "text": "Regneri et al. (2010) collected short textual descriptions (called event sequence descriptions, ESDs) of various types of human activities (e.g., going to a restaurant, ironing clothes) using crowdsourcing (Amazon Mechanical Turk), this dataset was also complemented by descriptions provided in the OMICS corpus (Gupta & Kochenderfer, 2004) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-65", "text": "The datasets are fairly small, containing 30 ESDs per activity type in average (we will refer to different activities as scenarios), but the collection can easily be extended given the low cost of crowdsourcing." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-66", "text": "The ESDs are written in a bullet-point style and the annotators were asked to follow the temporal order in writing." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-67", "text": "Consider an example ESD for the scenario prepare coffee :" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-68", "text": "{go to coffee maker} \u2192 {fill water in coffee maker} \u2192 {place the filter in holder} \u2192 {place coffee in filter} \u2192 {place holder in coffee maker} \u2192 {turn on coffee maker}" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-69", "text": "Though individual ESDs may seem simple, the learning task is challenging because of the limited amount of training data, variability in the used vocabulary, optionality of events (e.g., going to the coffee machine may not be mentioned in a ESD), different granularity of events and variability in the ordering (e.g., coffee may be put in a filter before placing it in a coffee maker)." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-70", "text": "Unlike our work, Regneri et al. (2010) relies on WordNet to provide extra signal when using the Multiple Sequence Alignment (MSA) algorithm." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-71", "text": "As in their work, each description was preprocessed to extract a predicate and heads of argument noun phrases to be used in the model." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-72", "text": "The methods are evaluated on human annotated scenariospecific tests: the goal is to classify event pairs as appearing in a given stereotypical order or not (Regneri et al., 2010 )." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-73", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-74", "text": "**3**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-75", "text": "The model was estimated as explained in Section 2.2 with the order of events in ESDs treated as gold standard." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-76", "text": "We used 4 held-out scenarios to choose model parameters, no scenario-specific tuning was performed, and the 10 test scripts were not used to perform model selection." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-77", "text": "When testing, we predicted that the event pair (e 1 ,e 2 ) is in the stereotypical order (e 1 \u227a e 2 ) if the ranking score for e 1 exceeded the ranking score for e 2" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-78", "text": "----------------------------------" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-79", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-80", "text": "In our experiments, we compared our event embedding model (EE) against three baseline systems (BL , MSA) and BSMSA is the system of Regneri et al. (2010) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-81", "text": "BS is a a hierarchical Bayesian system of Frermann et al. (2014) ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-82", "text": "BL chooses the order of events based on the preferred order of the corresponding verbs in the training set: (e 1 , e 2 ) is predicted to be in the stereotypical order if the number of times the corresponding verbs v 1 and v 2 appear in this order in the training ESDs exceeds the number of times they appear in the opposite order (not necessary at adjacent positions); a coin is tossed to break ties (or if v 1 and v 2 are the same verb)." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-83", "text": "We also compare to the version of our model which uses only verbs (EE verbs )." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-84", "text": "Note that EE verbs is conceptually very similar to BL, as it essentially induces an ordering over verbs." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-85", "text": "However, this ordering can benefit from the implicit transitivity assumption used in EE verbs (and EE), as we discussed in the introduction." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-86", "text": "The results are presented in Table 1 ." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-87", "text": "The first observation is that the full model improves substantially over the baseline and the previous methods (MSA and BS) (13.5% and 6.5% improvement over MSA and BS respectively in F1), this improvement is largely due to an increase in the recall but the precision is not negatively affected." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-88", "text": "We also observe a substantial improvement in all metrics from using transitivity, as seen by comparing the results of BL and EE verb (11.3% improvement in F1)." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-89", "text": "This simple approach already outperforms the pipelined MSA system." }, { "sent_id": "fe0f9312caccf41def06e4311d15fb-C001-90", "text": "These results seem to support our hypothesis in" } ], "y": { "@BACK@": { "gold_contexts": [ [ "fe0f9312caccf41def06e4311d15fb-C001-11" ], [ "fe0f9312caccf41def06e4311d15fb-C001-12" ], [ "fe0f9312caccf41def06e4311d15fb-C001-20" ], [ "fe0f9312caccf41def06e4311d15fb-C001-64" ], [ "fe0f9312caccf41def06e4311d15fb-C001-72" ] ], "cite_sentences": [ "fe0f9312caccf41def06e4311d15fb-C001-11", "fe0f9312caccf41def06e4311d15fb-C001-12", "fe0f9312caccf41def06e4311d15fb-C001-20", "fe0f9312caccf41def06e4311d15fb-C001-64", "fe0f9312caccf41def06e4311d15fb-C001-72" ] }, "@USE@": { "gold_contexts": [ [ "fe0f9312caccf41def06e4311d15fb-C001-24" ], [ "fe0f9312caccf41def06e4311d15fb-C001-61" ] ], "cite_sentences": [ "fe0f9312caccf41def06e4311d15fb-C001-24", "fe0f9312caccf41def06e4311d15fb-C001-61" ] }, "@DIF@": { "gold_contexts": [ [ "fe0f9312caccf41def06e4311d15fb-C001-70" ] ], "cite_sentences": [ "fe0f9312caccf41def06e4311d15fb-C001-70" ] }, "@UNSURE@": { "gold_contexts": [ [ "fe0f9312caccf41def06e4311d15fb-C001-80" ] ], "cite_sentences": [ "fe0f9312caccf41def06e4311d15fb-C001-80" ] } } }, "ABC_7d89a96743d9db5667d90cbd3ebd30_28": { "x": [ { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-2", "text": "Sentiment analysis is a common task in natural language processing that aims to detect polarity of a text document (typically a consumer review)." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-3", "text": "In the simplest settings, we discriminate only between positive and negative sentiment, turning the task into a standard binary classification problem." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-4", "text": "We compare several machine learning approaches to this problem, and combine them to achieve a new state of the art." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-5", "text": "We show how to use for this task the standard generative language models, which are slightly complementary to the state of the art techniques." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-6", "text": "We achieve strong results on a well-known dataset of IMDB movie reviews." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-7", "text": "Our results are easily reproducible, as we publish also the code needed to repeat the experiments." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-8", "text": "This should simplify further advance of the state of the art, as other researchers can combine their techniques with ours with little effort." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-9", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-11", "text": "Sentiment analysis is among the most popular, simple and useful tasks in natural language processing." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-12", "text": "It aims at predicting the attitude of text, typically a sentence or a review." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-13", "text": "For instance, movies or restaurant are often rated with a certain number of stars, which indicate the degree to which the reviewer was satisfied." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-14", "text": "This task is often considered as one of the simplest in NLP because basic machine learning techniques can yield strong baselines (Wang & Manning, 2012) , often beating much more intricate approaches (Socher et al., 2011) ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-15", "text": "In the simplest settings, this task can be seen as a binary classification between positive and negative sentiment." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-16", "text": "However, there are several challenges towards achieving the best possible accuracy." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-17", "text": "It is not obvious how to represent variable length documents beyond simple bag of words approaches that lose word order information." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-18", "text": "One can use advanced machine learning techniques such as recurrent neural networks and their variations (Mikolov et al., 2010; Socher et al., 2011) , however it is not clear if these provide any significant gain over simple bag-of-words and bag-of-ngram techniques (Pang & Lee, 2008; Wang & Manning, 2012) ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-19", "text": "In this work, we compared several different approaches and realized, without much surprise, that model combination performs better than any individual technique." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-20", "text": "The ensemble best benefits from models that are complementary, thus having diverse set of techniques is desirable." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-21", "text": "The vast majority of models proposed in the literature are discriminative in nature, as their parameters are tuned for the classification task directly." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-22", "text": "In this work, we boost the performance of the ensemble by considering a generative language model." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-23", "text": "To this end, we train two language models, one on the positive reviews and one on the negative ones, and use the likelihood ratio of these two models evaluated on the test data as an additional feature." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-24", "text": "For example, we assume that a positive review will have higher likelihood to be generated by a model that was trained on a large set of positive reviews, and lower likelihood given the negative model." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-25", "text": "In this paper, we constrained our work to binary classification where we trained two generative models, positive and negative." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-26", "text": "One could consider a higher number of classes since this approach scales linearily with the number of models to be train, i.e. one for each class." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-27", "text": "The large pool of diverse models is a) simple to implement (in line with previous work by Wang and Manning (Wang & Manning, 2012) ) and b) it yields state of the art performance on one of the largest publicly available benchmarks of movie reviews, the Stanford IMDB dataset of reviews." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-28", "text": "Code to reproduce our experiments is available at https://github.com/mesnilgr/iclr15." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-29", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-30", "text": "**DESCRIPTION OF THE MODELS**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-31", "text": "In this section we describe in detail the approaches we considered in our study." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-32", "text": "The novelty of this paper consists in combining both generative and discriminative models together for sentiment prediciton." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-33", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-34", "text": "**GENERATIVE MODEL**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-35", "text": "A generative model defines a distribution over the input." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-36", "text": "By training a generative model for each class, we can then use Bayes rule to predict which class a test sample belongs to." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-37", "text": "More formally, given a dataset of pairs {x" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-38", "text": "is the i-th document in the training set, y (i) \u2208 {\u22121, +1} is the corresponding label and N is the number of training samples, we train two models: p + (x|y = +1) for {x (i) subject to y (i) = +1} and p \u2212 (x|y = \u22121) for {x subject to y = \u22121}. Then, given an input x at test time we compute the ratio (derived from Bayes rule):" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-39", "text": ", then x is assigned to the positive class, otherwise to the negative class." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-40", "text": "We have a few different choices of distribution we can choose from." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-41", "text": "The most common one is the n-gram, a count-based non-parametric method to compute p(x" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-42", "text": "k is the k-th word in the i-th document." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-43", "text": "In order to compute the likelihood of a document, we use the Markov assumption and simply multiply the n-gram probabilities over all words in the document: p(" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-44", "text": "As mentioned before, we train one n-gram language model using the positive documents and one model using the negative ones." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-45", "text": "In our experiments, we used SRILM toolkit (Stolcke et al., 2002) to train the n-gram language models using modified Kneser-Ney smoothing (Kneser & Ney, 1995) ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-46", "text": "Furthermore, as both language models are trained on different datasets, there is a mismatch between vocabularies: some words can appear only in one of the training sets." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-47", "text": "This can be a problem during scoring, as the test data contain novel words that were not seen in at least one of the training datasets." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-48", "text": "To avoid this problem, it is needed to add penalty during scoring for each out of vocabulary word." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-49", "text": "N-grams are a very simple data-driven way to build language models." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-50", "text": "However, they suffer from both data sparsity and large memory requirement." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-51", "text": "Since the number of word combinations grows exponentially with the length of the context, there is always little data to accurately estimate probabilities for higher order n-grams." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-52", "text": "In contrast with N-grams languages models, Recurrent neural networks (RNNs) (Mikolov et al., 2010) are parametric models that can address these issues." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-53", "text": "The inner architecture of the RNNs gives them potentially infinite context window, allowing them to perform smoother predictions." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-54", "text": "We know that in practice, the context window is limited due to exploding and vanishing gradients (Pascanu et al., 2012) ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-55", "text": "Still, RNNs outperform significantly n-grams and are the state of the art for statistical language modeling." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-56", "text": "A review of these techniques is beyond the scope of this short paper and we point the reader to (Mikolov, 2012) for a more in depth discussion on this topic." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-57", "text": "Both when using n-grams and RNNs, we compute the probability of the test document belonging to the positive and negative class via Bayes' rule." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-58", "text": "These scores are then averaged in the ensemble with other models, as explained in Section 2.4." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-59", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-60", "text": "**LINEAR CLASSIFICATION OF WEIGHTED N-GRAM FEATURES**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-61", "text": "Among purely discriminative methods, the most popular choice is a linear classifier on top of a bagof-word representation of the document." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-62", "text": "The input representation is usually a tf-idf weighted word counts of the document." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-63", "text": "In order to preserve local ordering of the words, a better representation would consider also the position-independent n-gram counts of the document (bag-of-n-grams)." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-64", "text": "In our ensemble, we used a supervised reweighing of the counts as in the Naive Bayes Support Vector Machine (NB-SVM) approach (Wang & Manning, 2012) ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-65", "text": "This approach computes a log-ratio vector between the average word counts extracted from positive documents and the average word counts extracted from negative documents." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-66", "text": "The input to the logistic regression classifier corresponds to the log-ratio vector multiplied by the binary pattern for each word in the document vector." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-67", "text": "Note that the logictic regression can be replaced by a linear SVM." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-68", "text": "Our implementation 1 slightly improved the performance reported in (Wang & Manning, 2012) by adding tri-grams (improvement of +0.6%), as shown in Table 1 ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-69", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-70", "text": "**SENTENCE VECTORS**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-71", "text": "Recently, (Le & Mikolov, 2014) proposed an unsupervised method to learn distributed representations of words and paragraphs." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-72", "text": "The key idea is to learn a compact representation of a word or paragraph by predicting nearby words in a fixed context window." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-73", "text": "This captures co-occurence statistics and it learns embeddings of words and paragraphs that capture rich semantics." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-74", "text": "Synonym words and similar paragraphs often are surrounded by similar context, and therefore, they will be mapped into nearby feature vectors (and vice versa)." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-75", "text": "Such embeddings can then be used to represent a new document (for instance, by averaging the representations of the paragraphs that constitute the document) via a fixed size feature vector." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-76", "text": "The authors then use such a document descriptor as input to a one hidden layer neural network for sentiment discrimination." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-77", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-78", "text": "**MODEL ENSEMBLE**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-79", "text": "In this work, we combine the log probability scores of the above mentioned models via linear interpolation." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-80", "text": "More formally, we define the overall probability score as the weighted geometric mean of baseline models:" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-81", "text": "We find the best setting of weights via brute force grid search, quantizing the coefficient values in the interval [0, 1] at increments of 0.1." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-82", "text": "The search is evaluated on a validation set to avoid overfitting." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-83", "text": "We do not focus on a smarter way to find the \u03b1 since we consider only 3 models in our approach and we consider it out of the scope of this paper." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-84", "text": "Using more models would make the use of such method prohibitive." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-85", "text": "For a larger number of models, one might want to consider random search of the \u03b1 coefficients or even Bayesian approaches as these techniques will give better running time performance." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-86", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-87", "text": "**RESULTS**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-88", "text": "In this section we report results on one of the largest publicly available sentiment analysis datasets, the IMDB dataset of movie reviews." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-89", "text": "The dataset consists of 50, 000 movie reviews which are categorized as being either positive or negative." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-90", "text": "We use 25, 000 reviews for training and the rest for testing, using the same protocol proposed by (Maas et al., 2011) ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-91", "text": "All experiments can be reproduced using the code available at https://github.com/mesnilgr/iclr15." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-92", "text": "Table 2 reports the results of each individual model." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-93", "text": "We have found that generative models performed the worst, with RNNs slightly better than n-grams." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-94", "text": "The most competitive method is the method based on reweighed bag-of-words (Wang & Manning, 2012) 2 ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-95", "text": "Favoring simplicity and reproducibility of our performance, all results reported in this paper were produced by a linear classifier." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-96", "text": "Finally, Table 3 reports the results of combining the previous models into an ensemble." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-97", "text": "When we interpolate the scores of RNN, sentence vectors and NB-SVM, we achieve a new state-of-the-art performance of 92.57%, to be compared to 91.22% reported by (Wang & Manning, 2012) ." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-98", "text": "Notice that our implementation of the Sentence Vectors method (Le & Mikolov, 2014) alone yielded only 89.3% (a difference of \u2243 3%)." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-99", "text": "In order to measure the contribution of each model to the final ensemble classifier, we remove one model at a time from the ensemble." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-100", "text": "We observe that the removal of the generative model affects the least the ensemble performance." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-101", "text": "Overall, all three models contribute to the success of the overall ensemble, suggesting that these three models pick up complimentary features useful for discrimination." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-102", "text": "In Table 4 , we show test reviews misclassified by single models but classified accurately by the ensemble." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-103", "text": "----------------------------------" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-105", "text": "We have proposed a very simple yet powerful ensemble system for sentiment analysis." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-106", "text": "We combine three rather complementary and conceptually different baseline models: one based on a generative approach (language models), one based on continuous representations of sentences and one based on a clever reweighing of tf-idf bag-of-word representation of the document." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-107", "text": "Each such model contributes to the success of the overall system, achieving the new state of the art performance on the challenging IMDB movie review dataset." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-108", "text": "Code to reproduce our experiments is available at: https://github.com/mesnilgr/iclr15." }, { "sent_id": "7d89a96743d9db5667d90cbd3ebd30-C001-109", "text": "We hope researchers will take advantage of our code to include their new results into our ensemble and focus on improving the state of the art for Sentiment Analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7d89a96743d9db5667d90cbd3ebd30-C001-14" ], [ "7d89a96743d9db5667d90cbd3ebd30-C001-18" ] ], "cite_sentences": [ "7d89a96743d9db5667d90cbd3ebd30-C001-14", "7d89a96743d9db5667d90cbd3ebd30-C001-18" ] }, "@SIM@": { "gold_contexts": [ [ "7d89a96743d9db5667d90cbd3ebd30-C001-27" ], [ "7d89a96743d9db5667d90cbd3ebd30-C001-64" ] ], "cite_sentences": [ "7d89a96743d9db5667d90cbd3ebd30-C001-27", "7d89a96743d9db5667d90cbd3ebd30-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "7d89a96743d9db5667d90cbd3ebd30-C001-64" ] ], "cite_sentences": [ "7d89a96743d9db5667d90cbd3ebd30-C001-64" ] }, "@DIF@": { "gold_contexts": [ [ "7d89a96743d9db5667d90cbd3ebd30-C001-68" ], [ "7d89a96743d9db5667d90cbd3ebd30-C001-97" ] ], "cite_sentences": [ "7d89a96743d9db5667d90cbd3ebd30-C001-68", "7d89a96743d9db5667d90cbd3ebd30-C001-97" ] }, "@UNSURE@": { "gold_contexts": [ [ "7d89a96743d9db5667d90cbd3ebd30-C001-94" ] ], "cite_sentences": [ "7d89a96743d9db5667d90cbd3ebd30-C001-94" ] } } }, "ABC_591e2873606d6171e48fd34a731fc7_29": { "x": [ { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-2", "text": "Geolocating social media posts relies on the assumption that language carries sufficient geographic information." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-3", "text": "However, locations are usually given as continuous latitude/longitude tuples, so we first need to define discrete geographic regions that can serve as labels." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-4", "text": "Most studies use some form of clustering to discretize the continuous coordinates (Han et al., 2016) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-5", "text": "However, the resulting regions do not always correspond to existing linguistic areas." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-6", "text": "Consequently, accuracy at 100 miles tends to be good, but degrades for finer-grained distinctions, when different linguistic regions get lumped together." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-7", "text": "We describe a new algorithm, Point-to-City (P2C), an iterative k-d tree-based method for clustering geographic coordinates and associating them with towns." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-8", "text": "We create three sets of labels at different levels of granularity, and compare performance of a state-of-the-art geolocation model trained and tested with P2C labels to one with regular k-d tree labels." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-9", "text": "Even though P2C results in substantially more labels than the baseline, model accuracy increases significantly over using traditional labels at the fine-grained level, while staying comparable at 100 miles." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-10", "text": "The results suggest that identifying meaningful linguistic areas is crucial for improving geolocation at a fine-grained level." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-11", "text": "----------------------------------" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-13", "text": "Predicting the location of a Social Media post involves first and foremost ways to identify the words that indicate geographic location." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-14", "text": "Secondly, and perhaps even more fundamentally, though, we also need to determine an effective notion of what a \"location\" is, i.e., what do our labels represent: a state, a city, a neighborhood, a street? In many NLP tasks, labels are ambiguous and open to interpretation ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-15", "text": "In geolocation, the information initially given is an unambiguous latitude/longitude pair, but this format captures a level of detail (precise down to a centimeter) that is both unnecessary and unrealistic for most practical applications." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-16", "text": "Collapsing coordinates to geographic categories is therefore a common step in geolocation." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-17", "text": "However, this discretization step is open to interpretation: what method should we choose?" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-18", "text": "Previous work includes three different approaches to discretizing continuous values into location labels (see also Section 2):" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-19", "text": "1.) Geodesic grids are the most straightforward, but do not \"lead to a natural representation of the administrative, population-based or language boundaries in the region\" (Han et al., 2012) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-20", "text": "2.) Clustering coordinates prevents the identification of (nearly) empty locations and keeps points which are geographically close together in one location." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-21", "text": "Unfortunately, in crowded regions, clusters might be too close to each other, and therefore divide cultural/linguistic areas into meaningless groups." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-22", "text": "3.) Predefined administrative regions, like cities, can provide homogeneous interpretable areas." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-23", "text": "However, mapping coordinates to the closest city can be ambiguous." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-24", "text": "Previous work typically considered cities with a population of at least 100K (Han et al., 2012 (Han et al., , 2014 ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-25", "text": "This approach has the opposite problem of clustering: different linguistic areas might be contained within a single administrative region." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-26", "text": "Here, we propose Point-To-City (P2C), a new method mapping continuous coordinates to locations." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-27", "text": "It combines the strengths of the last two approaches, keeping coordinates which appear close to each other in the same location, while also representing them in terms of meaningful administrative regions, with adjustable granularity." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-28", "text": "We show that these two criteria also result in superior prediction performance for geolocation." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-29", "text": "Relying on k-d trees (Maneewongvatana and Mount, 1999) , P2C iteratively clusters points within a specified maximum distance d, and maps them to the coordinates of the closest town with a minimum population size." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-30", "text": "We evaluate P2C on two data sets commonly used for geolocation." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-31", "text": "We create three different conditions by using three different values for d as maximum distance between points, and compare the results to those obtained using k-d tree labels (as used in the W-NUT shared task (Han et al., 2016) )." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-32", "text": "For all four labeling schemes, we train an attention-based convolutional neural network, and evaluate mean and median distance between target and predicted point, and accuracy within 161 km (Acc@161)." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-33", "text": "We also show the standard accuracy score relative to the specific labels, usually much worse than Acc@161, and often not reported in the literature." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-34", "text": "Our results show that P2C reliably produces Acc@161 performance which is comparable with state-of-the-art models." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-35", "text": "For exact accuracy, however, P2C labels always result in substantially better performance than previous methods, in spite of the larger set of classes." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-36", "text": "This suggests that P2C captures more meaningful location distinctions (backed up by a qualitative analysis), and that previous labels capture only broader, linguistically mixed areas." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-37", "text": "More generally, our results show that language reflects social and geographical distinctions in the world, and that more meaningful realworld labels help language-based prediction models to perform their task more efficiently." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-38", "text": "Contributions The contributions of this paper are the following: 1.) we propose P2C, a k-d tree based procedure to cluster geographic points associated with existing towns within a certain distance between town and cluster centroid." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-39", "text": "2.) we show that P2C produces more meaningful, interpretable cultural and linguistic locations 3.) we show that P2C labels substantially improve model performance in exact, fine-grained classification 2 Related work Geolocation prediction can, in principle, be modeled both as regression and as classification problem." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-40", "text": "In practice, however, given the difficulty of predicting continuous coordinate values, regression is often carried out in conjunction with the classification (Eisenstein et al., 2010; Lourentzou et al., 2017; Fornaciari and Hovy, 2019b) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-41", "text": "In general, however, the task is considered a classification problem, which requires solutions for the identification of geographic regions as labels." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-42", "text": "Geodesic grids were used for the geolocation of posts on Flickr, Twitter and Wikipedia (Serdyukov et al., 2009; Wing and Baldridge, 2011) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-43", "text": "Hulden et al. (2015) noticed that \"using smaller grid sizes leads to an immediate sparse data problem since very few features/words are [selectively] observed in each cell\"." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-44", "text": "In order to enhance the expressiveness of the geographic cells, Wing and Baldridge (2014), constructed both flat and hierarchical grids relying on k-d tree, and testing their methods at different levels of granularity." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-45", "text": "The same labels were used in the study of Rahimi et al. (2018) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-46", "text": "Han et al. (2012 Han et al. ( , 2014 , who released TWITTER-WORLD, use the information provided by the Geoname dataset 1 in order to identify a set of cities around the world with at least 100K inhabitants." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-47", "text": "Then they refer their geo-tagged texts to those cities, creating easily interpretable geographic places." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-48", "text": "Cha et al. (2015) proposed a voting-based grid selection scheme, with the classification referred to regions/states in US." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-49", "text": "Most works use deep learning techniques for classification (Miura et al., 2016) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-50", "text": "Often, they include multi-view models, considering different sources (Miura et al., 2017; Lau et al., 2017; Ebrahimi et al., 2018; Fornaciari and Hovy, 2019a) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-51", "text": "In particular, Lau et al. (2017) implemented a multi-channel convolutional network, structurally similar to our model." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-52", "text": "Rahimi et al. (2018) proposes a Graph-Convolutional neural network, though the text features are represented by a bag-of-words, while we rely on word embeddings." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-53", "text": "The ability of the labels to reflect real anthropological areas, however, affects primarily the models which rely on linguistic data." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-54", "text": "This is the case of the studies of Han et al. (2012) and Han et al. (2014) who based their predictions on the so-called Location-Indicative Words (LIW)." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-55", "text": "Recently, neural models have been built with the same purpose (Rahimi et al., 2017; Tang et al., 2019) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-56", "text": "----------------------------------" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-57", "text": "**METHODS**" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-58", "text": "Data sets We apply our method to two widely used data sets for geolocation: TWITTER-US (Roller et al., 2012) , and TWITTER-WORLD (Han et al., 2012) ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-59", "text": "They are all collections of En-glish tweets aggregated by author and labeled with geographic coordinates." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-60", "text": "TWITTER-US and TWITTER-WORLD contain 450K and 1.39M texts, respectively." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-61", "text": "They are each divided into their own training, development, and test sets." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-62", "text": "Readers are referred to the respective papers for additional details." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-63", "text": "We round the coordinates to the second decimal number." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-64", "text": "A distance of 0.01 degrees corresponds to less than 1.1 km on the longitude axis (the distance is not constant on the latitude axis)." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-65", "text": "Smaller distinctions are not relevant for any common NLP task." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-66", "text": "Point-To-City (P2C) For the classification, we need to identify labels corresponding to existing cultural/linguistic areas, so that the geographic information conveyed through language can be fully exploited." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-67", "text": "To this end, P2C iteratively creates clusters of points, and afterwards associates the final clusters with specific towns." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-68", "text": "The parameter d controls the maximum spherical distance we allow between points assigned to the same cluster at the initialization step." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-69", "text": "We run P2C considering three values: 0.1, 0.25, and 0.5 coordinate decimal points, which correspond to 11.12 km (6.91 miles), 27.80 km (17.27 miles), and 55.60 km (34.55 miles) on the longitude axis." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-70", "text": "We use these values to explore the feasibility of finer (and more challenging) predictions than those usually accepted in the literature." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-71", "text": "One of the most popular metrics in previous studies (see Section 2 and 4) is the accuracy of the predictions within 161 km, or 100 mi, from the target point." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-95", "text": "Finally, we substitute these words with their replacement token in the whole corpus, including development and test set." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-72", "text": "In contrast, we are interested in the accuracy relative to the precise prediction of the labels, and we want labels representing points aggregated according to a distance much smaller than 161 km/100 mi: even the highest value we choose for d, 0.5, is about one third the distance of accuracy at 161 km (Acc@161)." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-73", "text": "However, since P2C iteratively creates clusters of clusters, it is possible that the original points belonging to different clusters are further apart than the threshold of d. For this reason, we selected values of d which are about three to fifteen times smaller than 161 km/100 mi." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-74", "text": "Given d and a set of coordinate points/instances in the data set, P2C iterates over the following steps until convergence:" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-75", "text": "1. for each point, apply k-d trees to find clusters of points where each pair has a reciprocal distance less than d; 5. assign the points which fall into more than one cluster to the one with the nearest centroid;" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-76", "text": "6. substitute each instance's coordinates with the centroid coordinates of the corresponding cluster." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-77", "text": "The algorithm converges when the final number of points cannot be further reduced, since they all are farther apart from each other than the maximum distance d. After assigning each instance its new coordinates, we follow Han et al. (2012 Han et al. ( , 2014 in using the GeoNames data set to associate clusters with cities, by substituting the instance coordinates with those of the closest town center." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-78", "text": "In our case, however, rather than collecting cities with a population of at least 100K, we consider all towns with a population of at least 15K." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-79", "text": "This last step further reduces the set of points associated with our instances." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-80", "text": "Table 1 shows the resulting number of labels, and the mean distance in km between the new instance coordinates and the respective town center." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-81", "text": "This choice of 15K inhabitants is coherent with the settings of d: we aim to account for linguistic/social environments more specific than the broad and compound communities of densely populated cities." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-82", "text": "This is helpful for high resolution geolocation both in the case of crowded regions and of areas with low density of inhabitants." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-83", "text": "However, we found that in spite of qualified information, such as the annual Worlds Cities report of the United Nations, it is actually difficult to set an optimal threshold." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-84", "text": "In fact, not even that document provides a detailed profile of small towns at a global level." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-85", "text": "Therefore we rely on the format of the information offered by Geonames." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-86", "text": "The code for computing P2C is available at github.com/Bocconi-NLPLab." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-87", "text": "Feature selection The two corpora have very different vocabulary sizes." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-88", "text": "Despite fewer instances, TWITTER-US contains a much richer vocabulary than TWITTER-WORLD: 14 vs. 6.5 millions words." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-89", "text": "This size is computationally infeasible." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-90", "text": "In order to maximize discrimination, we filter the vocabulary with several steps." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-91", "text": "In order not to waste the possible geographic information carried by the huge amount of low frequency terms, we use replacement tokens as follows: We again take only the training data into account." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-92", "text": "First, we discard the hapax legomena, that is the words with frequency 1, as there is no evidence that these words could be found elsewhere." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-93", "text": "Then, we discard words with frequency greater than 1, if they appear in more than one place." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-94", "text": "We replace low frequency terms which appear uniquely in on place with a replacement token specific for that place, i.e., label." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-96", "text": "Since the word distribution follows the Zipf curve (Powers, 1998) we are able to exploit the geographic information of millions of words using only a small number of replacement tokens." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-97", "text": "The use of this information is fair, as it relies on the information present in the training set only." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-98", "text": "In terms of performance, however, the effect of the replacement tokens is theoretically not different from that resulting from the direct inclusion of the single words in the vocabulary." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-99", "text": "The benefit is in terms of noise reduction, for the selective removal of geographically ambiguous words, and computational affordability." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-100", "text": "Following Han et al. (2012) , we further filter the vocabulary via Information Gain Ratio (IGR), selecting the terms with the highest values until we reach a computationally feasible vocabulary size: here, 750K and 460K for TWITTER-US and TWITTER-WORLD." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-101", "text": "Attention-based CNN For classification, we use an attention-based convolutional neural model." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-102", "text": "We first train our own word embeddings for each corpus, and feed the texts into two convolutional channels (with window size 4 and 8) and max-pooling, followed by an overall attention mechanism, and finally a fully-connected layer with softmax activation for prediction." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-103", "text": "For evaluation, as discussed in Section 3, we use the common metrics considered in literature: acc@161, that is the accuracy within 161 km (100 mi) from the target point, and mean and median distance between the predicted and the target points." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-104", "text": "We are also interested in the exact accuracy." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-105", "text": "This metric is often not shown in literature, but is important for the geolocation in real case scenarios." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-106", "text": "We evaluate significance via bootstrap sampling, following ." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-107", "text": "----------------------------------" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-108", "text": "**RESULTS**" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-109", "text": "The model performance is shown in table 2." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-110", "text": "When applied to the W-NUT labels, our model replicates the results of Rahimi et al. (2017) : in TWITTER-US the values correspond perfectly, in TWITTER-WORLD the Att-CNN performance is slightly lower." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-111", "text": "Compared to the W-NUT labels, the P2C labels are much more granular in every condition and, in spite of their apparent greater difficulty, they help to reach better performance in all metrics, with very high levels of significance." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-112", "text": "Such differences are surprisingly wide with respect to the accuracy: in TWITTER-US, for P2C with d = .5, the performance is more than doubled compared to the same model with the W-NUT k-d tree labels (54% vs. 26%)." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-113", "text": "Figure 1 shows the coordinates of the W-NUT (1a) and of the P2C cluster centroids (1b)." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-114", "text": "The diameter of the circles represent the rate correct prediction for those points." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-115", "text": "As can be seen, P2C identifies a unique linguistic region around Washington, while different W-NUT labels cover more or less the same area." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-116", "text": "P2C labels also allow a much better concentration of predictions in the same administrative/linguistic area." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-117", "text": "----------------------------------" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-118", "text": "**CONCLUSION**" }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-119", "text": "P2C is a method for geographic labeling that dynamically clusters points and links them to specific towns." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-120", "text": "The aims are 1) to gather the points belonging to the same linguistic areas; 2) to associate such areas with distinct, existing administrative regions; 3) to improve the models' effectiveness, training them with texts showing consistent linguistic patterns." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-121", "text": "Compared to the W-NUT k-d tree labels, P2C leads to remarkably higher performance in all metrics, and in particular in the accuracy, even in spite of the higher number of labels identified." }, { "sent_id": "591e2873606d6171e48fd34a731fc7-C001-122", "text": "This suggests that techniques like P2C might be particularly useful when high performance at high levels of granularity is required." } ], "y": { "@BACK@": { "gold_contexts": [ [ "591e2873606d6171e48fd34a731fc7-C001-19" ], [ "591e2873606d6171e48fd34a731fc7-C001-24" ] ], "cite_sentences": [ "591e2873606d6171e48fd34a731fc7-C001-19", "591e2873606d6171e48fd34a731fc7-C001-24" ] }, "@MOT@": { "gold_contexts": [ [ "591e2873606d6171e48fd34a731fc7-C001-19" ], [ "591e2873606d6171e48fd34a731fc7-C001-24", "591e2873606d6171e48fd34a731fc7-C001-25" ], [ "591e2873606d6171e48fd34a731fc7-C001-53", "591e2873606d6171e48fd34a731fc7-C001-54" ] ], "cite_sentences": [ "591e2873606d6171e48fd34a731fc7-C001-19", "591e2873606d6171e48fd34a731fc7-C001-24", "591e2873606d6171e48fd34a731fc7-C001-54" ] }, "@USE@": { "gold_contexts": [ [ "591e2873606d6171e48fd34a731fc7-C001-58" ], [ "591e2873606d6171e48fd34a731fc7-C001-77" ], [ "591e2873606d6171e48fd34a731fc7-C001-100" ] ], "cite_sentences": [ "591e2873606d6171e48fd34a731fc7-C001-58", "591e2873606d6171e48fd34a731fc7-C001-77", "591e2873606d6171e48fd34a731fc7-C001-100" ] } } }, "ABC_2bb115f1c3e753e9dc66735887a52d_29": { "x": [ { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-2", "text": "A desirable quality of a coreference resolution system is the ability to handle transitivity constraints, such that even if it places high likelihood on a particular mention being coreferent with each of two other mentions, it will also consider the likelihood of those two mentions being coreferent when making a final assignment." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-3", "text": "This is exactly the kind of constraint that integer linear programming (ILP) is ideal for, but, surprisingly, previous work applying ILP to coreference resolution has not encoded this type of constraint." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-4", "text": "We train a coreference classifier over pairs of mentions, and show how to encode this type of constraint on top of the probabilities output from our pairwise classifier to extract the most probable legal entity assignments." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-5", "text": "We present results on two commonly used datasets which show that enforcement of transitive closure consistently improves performance, including improvements of up to 3.6% using the b 3 scorer, and up to 16.5% using cluster f-measure." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-6", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-8", "text": "Much recent work on coreference resolution, which is the task of deciding which noun phrases, or mentions, in a document refer to the same real world entity, builds on Soon et al. (2001) ." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-9", "text": "They built a decision tree classifier to label pairs of mentions as coreferent or not." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-10", "text": "Using their classifier, they would build up coreference chains, where each mention was linked up with the most recent previous mention that the classifier labeled as coreferent, if such a mention existed." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-11", "text": "Transitive closure in this model was done implicitly." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-12", "text": "If John Smith was labeled coreferent with Smith, and Smith with Jane Smith, then John Smith and Jane Smith were also coreferent regardless of the classifier's evaluation of that pair." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-13", "text": "Much work that followed improved upon this strategy, by improving the features (Ng and Cardie, 2002b) , the type of classifier (Denis and Baldridge, 2007) , and changing mention links to be to the most likely antecedent rather than the most recent positively labeled antecedent (Ng and Cardie, 2002b) ." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-14", "text": "This line of work has largely ignored the implicit transitivity of the decisions made, and can result in unintuitive chains such as the Smith chain just described, where each pairwise decision is sensible, but the final result is not." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-15", "text": "Ng and Cardie (2002a) and Ng (2004) highlight the problem of determining whether or not common noun phrases are anaphoric." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-16", "text": "They use two classifiers, an anaphoricity classifier, which decides if a mention should have an antecedent and a pairwise classifier similar those just discussed, which are combined in a cascaded manner." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-17", "text": "More recently, Denis and Baldridge (2007) utilized an integer linear programming (ILP) solver to better combine the decisions made by these two complementary classifiers, by finding the globally optimal solution according to both classifiers." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-18", "text": "However, when encoding constraints into their ILP solver, they did not enforce transitivity." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-19", "text": "The goal of the present work is simply to show that transitivity constraints are a useful source of information, which can and should be incorporated into an ILP-based coreference system." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-20", "text": "For this goal, we put aside the anaphoricity classifier and focus on the pairwise classifier and transitivity constraints." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-21", "text": "We build a pairwise logistic classifier, trained on all pairs of mentions, and then at test time we use an ILP solver equipped with transitivity constraints to find the most likely legal assignment to the variables which represent the pairwise decisions." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-22", "text": "1 Our results show a significant improvement compared to the na\u00efve use of the pairwise classifier." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-23", "text": "Other work on global models of coreference (as opposed to pairwise models) has included: Luo et al. (2004) who used a Bell tree whose leaves represent possible partitionings of the mentions into entities and then trained a model for searching the tree; McCallum and Wellner (2004) who defined several conditional random field-based models; Ng (2005) who took a reranking approach; and Culotta et al. (2006) who use a probabilistic first-order logic model." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-24", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-25", "text": "**COREFERENCE RESOLUTION**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-26", "text": "For this task we are given a document which is annotated with a set of mentions, and the goal is to cluster the mentions which refer to the same entity." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-27", "text": "When describing our model, we build upon the notation used by Denis and Baldridge (2007) ." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-28", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-29", "text": "**PAIRWISE CLASSIFICATION**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-30", "text": "Our baseline systems are based on a logistic classifier over pairs of mentions." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-31", "text": "The probability of a pair of mentions takes the standard logistic form:" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-32", "text": "( 1) where m i and m j correspond to mentions i and j respectively; f (m i , m j ) is a feature function over a pair of mentions; \u03b8 are the feature weights we wish to learn; and x i,j is a boolean variable which takes value 1 if m i and m j are coreferent, and 0 if they are not." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-33", "text": "The log likelihood of a document is the sum of the log likelihoods of all pairs of mentions:" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-34", "text": "(2) where m is the set of mentions in the document, and x is the set of variables representing each pairwise coreference decision x i,j . Note that this model is degenerate, because it assigns probability mass to nonsensical clusterings." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-35", "text": "Specifically, it will allow" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-36", "text": "Prior work (Soon et al., 2001; Denis and Baldridge, 2007) has generated training data for pairwise classifiers in the following manner." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-37", "text": "For each mention, work backwards through the preceding mentions in the document until you come to a true coreferent mention." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-38", "text": "Create negative examples for all intermediate mentions, and a positive example for the mention and its correct antecedent." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-39", "text": "This approach made sense for Soon et al. (2001) because testing proceeded in a similar manner: for each mention, work backwards until you find a previous mention which the classifier thinks is coreferent, add a link, and terminate the search." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-40", "text": "The COREF-ILP model of Denis and Baldridge (2007) took a different approach at test time: for each mention they would work backwards and add a link for all previous mentions which the classifier deemed coreferent." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-41", "text": "This is equivalent to finding the most likely assignment to each x i,j in Equation 2." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-42", "text": "As noted, these assignments may not be a legal clustering because there is no guarantee of transitivity." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-43", "text": "The transitive closure happens in an ad-hoc manner after this assignment is found: any two mentions linked through other mentions are determined to be coreferent." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-44", "text": "Our SOON-STYLE baseline used the same training and testing regimen as Soon et al. (2001) ." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-45", "text": "Our D&B-STYLE baseline used the same test time method as Denis and Baldridge (2007) , however at training time we created data for all mention pairs." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-46", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-47", "text": "**INTEGER LINEAR PROGRAMMING TO ENFORCE TRANSITIVITY**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-48", "text": "Because of the ad-hoc manner in which transitivity is enforced in our baseline systems, we do not necessarily find the most probable legal clustering." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-49", "text": "This is exactly the kind of task at which integer linear programming excels." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-50", "text": "We need to first formulate the objective function which we wish the ILP solver to maximize at test time." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-51", "text": "2 Let p i,j = log P (x i,j |m i , m j ; \u03b8), which is the log probability that m i and m j are coreferent according to the pairwise logistic classifier discussed in the previous section, and letp i,j = log(1 \u2212 p i,j ), be the log probability that they are not coreferent." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-52", "text": "Our objective function is then the log probability of a particular (possibly illegal) variable assignment:" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-53", "text": "We add binary constraints on each of the variables: x i,j \u2208 {0, 1}. We also add constraints, over each triple of mentions, to enforce transitivity:" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-54", "text": "This constraint ensures that whenever x i,j = x j,k = 1 it must also be the case that x i,k = 1." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-55", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-56", "text": "**EXPERIMENTS**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-57", "text": "We used lp solve 3 to solve our ILP optimization problems." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-58", "text": "We ran experiments on two datasets." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-59", "text": "We used the MUC-6 formal training and test data, as well as the NWIRE and BNEWS portions of the ACE (Phase 2) corpus." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-60", "text": "This corpus had a third portion, NPAPER, but we found that several documents where too long for lp solve to find a solution." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-61", "text": "4 We added named entity (NE) tags to the data using the tagger of Finkel et al. (2005) ." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-62", "text": "The ACE data is already annotated with NE tags, so when they conflicted they overrode the tags output by the tagger." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-63", "text": "We also added part of speech (POS) tags to the data using the tagger of Toutanova et al. (2003) , and used the tags to decide if mentions were plural or singular." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-64", "text": "The ACE data is labeled with mention type (pronominal, nominal, and name), but the MUC-6 data is not, so the POS and NE tags were used to infer this information." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-65", "text": "Our feature set was simple, and included many features from (Soon et al., 2001) , including the pronoun, string match, definite and demonstrative NP, number and gender agreement, proper name and appositive features." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-66", "text": "We had additional features for NE tags, head matching and head substring matching." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-67", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-68", "text": "**EVALUATION METRICS**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-69", "text": "The MUC scorer (Vilain et al., 1995) is a popular coreference evaluation metric, but we found it to be fatally flawed." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-70", "text": "As observed by Luo et al. (2004) , if all mentions in each document are placed into a single entity, the results on the MUC-6 formal test set are 100% recall, 78.9% precision, and 88.2% F1 score -significantly higher than any published system." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-71", "text": "The b 3 scorer (Amit and Baldwin, 1998) was proposed to overcome several shortcomings of the MUC scorer." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-72", "text": "However, coreference resolution is a clustering task, and many cluster scorers already exist." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-73", "text": "In addition to the MUC and b 3 scorers, we also evaluate using cluster f-measure (Ghosh, 2003) , which is the standard f-measure computed over true/false coreference decisions for pairs of mentions; the Rand index (Rand, 1971) , which is pairwise accuracy of the clustering; and variation of information (Meila, 2003) , which utilizes the entropy of the clusterings and their mutual information (and for which lower values are better)." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-74", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-75", "text": "**RESULTS**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-76", "text": "Our results are summarized in Table 1 ." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-77", "text": "We show performance for both baseline classifiers, as well as our ILP-based classifier, which finds the most probable legal assignment to the variables representing coreference decisions over pairs of mentions." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-78", "text": "For comparison, we also give the results of the COREF-ILP system of Denis and Baldridge (2007) , which was also based on a na\u00efve pairwise classifier." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-79", "text": "They used an ILP solver to find an assignment for the variables, but as they note at the end of Section 5.1, it is equivalent to taking all links for which the classifier returns a probability \u2265 0.5, and so the ILP solver is not really necessary." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-80", "text": "We also include their JOINT-ILP numbers, however that system makes use of an additional anaphoricity classifier." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-81", "text": "For all three corpora, the ILP model beat both baselines for the cluster f-score, Rand index, and variation of information metrics." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-82", "text": "Using the b 3 metric, the ILP system and the D&B-STYLE baseline performed about the same on the MUC-6 corpus, though for both ACE corpora, the ILP system was the clear winner." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-83", "text": "When using the MUC scorer, the ILP system always did worse than the D&B-STYLE baseline." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-84", "text": "However, this is precisely because the transitivity constraints tend to yield smaller clusters (which increase precision while decreasing recall)." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-85", "text": "Remember that going in the opposite direction and simply putting all mentions in one cluster produces a MUC score which is higher than any in the table, even though this clustering is clearly not useful in applications." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-86", "text": "Hence, we are skeptical of this measure's utility and provide it primarily for comparison with previous work." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-87", "text": "The improvements from the ILP system are most clearly shown on the ACE NWIRE corpus, where the b 3 f-score improved 3.6%, and the cluster f-score improved 16.5%." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-88", "text": "----------------------------------" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-89", "text": "**CONCLUSION**" }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-90", "text": "We showed how to use integer linear programming to encode transitivity constraints in a corefer- ence classifier which models pairwise decisions over mentions." }, { "sent_id": "2bb115f1c3e753e9dc66735887a52d-C001-91", "text": "We also demonstrated that enforcing such constraints at test time can significantly improve performance, using a variety of evaluation metrics." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2bb115f1c3e753e9dc66735887a52d-C001-13" ], [ "2bb115f1c3e753e9dc66735887a52d-C001-17" ], [ "2bb115f1c3e753e9dc66735887a52d-C001-36" ], [ "2bb115f1c3e753e9dc66735887a52d-C001-40" ] ], "cite_sentences": [ "2bb115f1c3e753e9dc66735887a52d-C001-13", "2bb115f1c3e753e9dc66735887a52d-C001-17", "2bb115f1c3e753e9dc66735887a52d-C001-36", "2bb115f1c3e753e9dc66735887a52d-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "2bb115f1c3e753e9dc66735887a52d-C001-27" ], [ "2bb115f1c3e753e9dc66735887a52d-C001-45" ], [ "2bb115f1c3e753e9dc66735887a52d-C001-78" ] ], "cite_sentences": [ "2bb115f1c3e753e9dc66735887a52d-C001-27", "2bb115f1c3e753e9dc66735887a52d-C001-45", "2bb115f1c3e753e9dc66735887a52d-C001-78" ] } } }, "ABC_18ef4e4fafdf62839d6797d62eb76b_29": { "x": [ { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-109", "text": "These results indicate the generalizability of our proposed model across datasets." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-2", "text": "The rising growth of fake news and misleading information through online media outlets demands an automatic method for detecting such news articles." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-3", "text": "Of the few limited works which differentiate between trusted vs other types of news article (satire, propaganda, hoax), none of them model sentence interactions within a document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-4", "text": "We observe an interesting pattern in the way sentences interact with each other across different kind of news articles." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-5", "text": "To capture this kind of information for long news articles, we propose a graph neural networkbased model which does away with the need of feature engineering for fine grained fake news classification." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-6", "text": "Through experiments, we show that our proposed method beats strong neural baselines and achieves state-of-the-art accuracy on existing datasets." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-7", "text": "Moreover, we establish the generalizability of our model by evaluating its performance in out-of-domain scenarios." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-8", "text": "Code is available at https: //github.com/MysteryVaibhav/ fake_news_semantics." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-9", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-11", "text": "In today's day and age of social media, there are ample opportunities for fake news production, dissemination and consumption." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-12", "text": "Rashkin et al. (2017) break down fake news into three categories, hoax, propaganda and satire." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-13", "text": "A hoax article typically tries to convince the reader about a cookedup story while propaganda ones usually mislead the reader into believing a false political or social agenda." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-14", "text": "Burfoot and Baldwin (2009) defines a satirical article as the one which deliberately exposes real-world individuals, organisations and events to ridicule." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-15", "text": "Previous works (Rubin et al., 2016; Rashkin et al., 2017) rely on various linguistic and handcrafted semantic features for differentiating between news articles." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-16", "text": "However, none of them try to model the interaction of sentences within the document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-17", "text": "We observed a pattern in the way sentences cluster in different kind of news articles." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-18", "text": "Specifically, satirical articles had a more coherent story and thus all the sentences in the document seemed similar to each other." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-19", "text": "On the other hand, the trusted news articles were also coherent but the similarity between sentences from different parts of the document was not that strong, as depicted in Figure 1 ." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-20", "text": "We believe that the reason for such kind of behaviour is the presence of factual jumps across sections in a trusted document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-21", "text": "In this work, we propose a graph neural network-based model to classify news articles while capturing the interaction of sentences across the document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-22", "text": "We present a series of experiments on News Corpus with Varying Reliability dataset (Rashkin et al., 2017) and Satirical Legitimate News dataset (Rubin et al., 2016) ." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-23", "text": "Our results demonstrate that the proposed model achieves state-of-the-art performance on these datasets and provides interesting insights." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-24", "text": "Experiments performed in out-of-domain settings establish the generalizability of our proposed method." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-25", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-26", "text": "**RELATED WORK**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-27", "text": "Satire, according to Simpson (2003) , is complicated because it occupies more than one place in the framework for humor, proposed by Ziv (1988) : it clearly has an aggressive and social function, and often expresses an intellectual aspect as well." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-28", "text": "Rubin et al. (2016) defines news satire as a genre of satire that mimics the format and style of journalistic reporting." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-29", "text": "Datasets created for the task of identifying satirical news articles from the trusted ones are often constructed by collecting documents from different online sources (Rubin et al., 2016) ." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-30", "text": "McHardy et al. (2019) hypothesized that this encourages the models to learn characteristics for different publication sources rather than characteristics of satire." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-31", "text": "In this work, we show that our proposed model generalizes to articles from unseen publication sources." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-32", "text": "Rubin et al. (2016) 's work by offering a quantitative study of linguistic differences found in articles of different types of fake news such as hoax, propaganda and satire." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-33", "text": "They also proposed predictive models for graded deception across multiple domains." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-34", "text": "Rashkin et al. (2017) found that neural methods didn't perform well for this task and proposed to use a Max-Entropy classifier." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-35", "text": "We show that our proposed neural network based on graph convolutional layers can outperform this model." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-36", "text": "Recent works by Yang et al. (2017) ; De Sarkar et al. (2018) show that sophisticated neural models can be used for satirical news detection." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-37", "text": "To the best of our knowledge, none of the previous works represent individual documents as graphs where the nodes represent the sentences for performing clas-sification using a graph neural network." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-38", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-39", "text": "**RASHKIN ET AL. (2017) EXTENDS**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-40", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-41", "text": "**DATASET AND BASELINE**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-110", "text": "We also present results of four way classification in Table 3 ." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-42", "text": "We use SLN: Satirical and Legitimate News Database (Rubin et al., 2016) , RPN: Random Political News Dataset (Horne and Adali, 2017) and LUN: Labeled Unreliable News Dataset Rashkin et al. (2017) for our experiments." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-43", "text": "Table 1 shows the statistics." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-44", "text": "Since all of the previous methods on the aforementioned datasets are non-neural, we implement the following neural baselines," }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-45", "text": "\u2022 CNN: In this model, we apply a 1-d CNN (Convolutional Neural Network) layer (Kim, 2014) with filter size 3 over the word embeddings of the sentences within a document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-46", "text": "This is followed by a max-pooling layer to get a single document vector which is passed to a fully connected projection layer to get the logits over output classes." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-47", "text": "\u2022 LSTM: In this model, we encode the document using a LSTM (Long Short-Term Memory) layer (Hochreiter and Schmidhuber, 1997) ." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-48", "text": "We use the hidden state at the last time step as the document vector which is passed to a fully connected projection layer to get the logits over output classes." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-49", "text": "\u2022 BERT: In this model, we extract the sentence vector (representation corresponding to [CLS] token) using BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) for each sentence in the document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-50", "text": "We then apply a LSTM layer on the sentence embeddings, followed by a projection layer to make the prediction for each document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-51", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-52", "text": "**PROPOSED MODEL**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-53", "text": "Capturing sentence interactions in long documents is not feasible using a recurrent network because of the vanishing gradient problem (Pascanu et al., 2013) ." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-54", "text": "Thus, we propose a novel way of encoding documents as described in the next subsection." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-55", "text": "Figure 2 shows the overall framework of our graph based neural network." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-56", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-57", "text": "**INPUT REPRESENTATION**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-58", "text": "Each document in the corpus is represented as a graph." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-59", "text": "The nodes of the graph represent the sentences of a document while the edges represent the semantic similarity between a pair of sentences." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-60", "text": "Representing a document as a fully connected graph allows the model to directly capture the interaction of each sentence with every other sentence in the document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-61", "text": "Formally," }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-62", "text": "We initialize the edge scores using BERT (Devlin et al., 2019) finetuned on the semantic textual similarity task 1 for computing the semantic similarity (SS) between two sentences." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-63", "text": "Refer to the Supplementary Material for more details regarding the SS model." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-64", "text": "Note that this representation drops the sentence order information but is better able to capture the interaction between far off sentences within a document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-65", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-66", "text": "**GRAPH BASED NEURAL NETWORKS**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-67", "text": "We reformulate the fake news classification problem as a graph classification task, where a graph represents a document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-68", "text": "Given a graph G = (E, S) where E is the adjacency matrix and S is the sentence feature matrix." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-69", "text": "We randomly initialize the word embeddings and use the last hidden state of a LSTM layer as the sentence embedding, shown in Figure 2 ." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-70", "text": "We experiment with two kinds of graph neural networks," }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-71", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-72", "text": "**GRAPH CONVOLUTION NETWORK (GCN)**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-73", "text": "The graph convolutional network (Kipf and Welling, 2017) is a spectral convolutional operation denoted by f (Z l , E|W l )," }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-74", "text": "1 Task 1 of SemEval-2017 Here, Z l is the output feature corresponding to the nodes after l th convolution." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-75", "text": "W l is the parameter associated with the l th layer." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-76", "text": "We set Z 0 = S. Based on the above operation, we can define arbitrarily deep networks." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-77", "text": "For our experiments, we just use a single layer unless stated otherwise." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-78", "text": "By default, the adjacency matrix (E) is fully connected i.e. all the elements are 1 except the diagonal elements which are all set to 0." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-79", "text": "We set E based on semantic similarity model in our GCN + SS model." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-80", "text": "For the GCN + Attn model, we just add a self attention layer (Vaswani et al., 2017) after the GCN layer and before the pooling layer." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-81", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-82", "text": "**GRAPH ATTENTION NETWORK (GAT)**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-83", "text": "Velikovi et al. (2018) introduced graph attention networks to address various shortcomings of GCNs." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-84", "text": "Most importantly, they enable nodes to attend over their neighborhoods features without depending on the graph structure upfront." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-85", "text": "The key idea is to compute the hidden representations of each node in the graph, by attending over its neighbors, following a self-attention (Vaswani et al., 2017) strategy." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-86", "text": "By default, there is one attention head in the GAT model." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-87", "text": "For our GAT + 2 Attn Heads model, we use two attention heads and concatenate the node embeddings obtained from different heads before passing it to the pooling layer." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-88", "text": "For a fully connected graph, the GAT model allows every node to attend on every other node and learn the edge weights." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-89", "text": "Thus, initializing the edge weights using the SS model is useless as they are being learned." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-90", "text": "Mathematical details are provided in the Supplementary Material." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-91", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-92", "text": "**HYPERPARAMETERS**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-93", "text": "We use a randomly initialized embedding matrix with 100 dimensions." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-94", "text": "We use a single layer LSTM to encode the sentences prior to the graph neural networks." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-95", "text": "All the hidden dimensions used in our networks are set to 100." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-96", "text": "The node embedding dimension is 32." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-97", "text": "For GCN and GAT, we set \u03c3 as LeakyRelU with slope 0.2." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-98", "text": "We train the models for a maximum of 10 epochs and use Adam optimizer with learning rate 0.001." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-99", "text": "For all the models, we use max-pool for pooling, which is followed by a fully connected projection layer with output nodes equal to the number of classes for classification." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-100", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-101", "text": "**EXPERIMENTAL SETTING**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-102", "text": "We conduct experiments across various settings and datasets." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-103", "text": "We report macro-averaged scores in Table 3 : 4-way classification results for different models." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-104", "text": "We only report F1-score following the SoTA paper." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-105", "text": "similarity model does not seem to have much impact on the GCN model, and considering the computing cost, we don't experiment with it for the 4-way classification scenario." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-106", "text": "Given that we use SLN as an out of domain test set (just one overlapping source, no overlap in articles), whereas the SoTA paper (Rubin et al., 2016) reports a 10fold cross validation number on SLN." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-107", "text": "We believe that our results are quite strong, the GAT + 2 Attn Heads model achieves an accuracy of 87% on the entire RPN dataset when used as an out-of-domain test set." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-108", "text": "The SoTA paper (Horne and Adali, 2017) on RPN reports a 5-fold cross validation accuracy of 91%." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-111", "text": "All of our proposed methods outperform SoTA on both the in-domain and out of domain test set." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-112", "text": "To further understand the working of our proposed model, we closely inspect the attention maps generated by the GAT model for satirical and trusted news articles for the SLN dataset." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-113", "text": "From Figure 3 , we can see that the attention map generated for the trusted news article only focuses on two specific sentence whereas the attention weights are much more distributed in case of a satirical article." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-114", "text": "Interestingly enough the highlighted sentences in case of the trusted news article were the starting sentence of two different paragraphs in the article indicating the presence of similar sentence clusters within a document." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-115", "text": "This opens a new avenue for understanding the differences between different kind of text articles for future research." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-116", "text": "----------------------------------" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-117", "text": "**CONCLUSION**" }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-118", "text": "This paper introduces a novel way of encoding articles for fake news classification." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-119", "text": "The intuition behind representing documents as a graph is motivated by the fact that sentences interact differently with each other across different kinds of article." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-120", "text": "Recurrent networks are unable to maintain long term dependencies in large documents, whereas a fully connected graph captures the interaction between sentences at unit distance." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-121", "text": "The quantitative result shows the effectiveness of our proposed model and the qualitative results validate our hypothesis about difference in sentence interaction across different articles." }, { "sent_id": "18ef4e4fafdf62839d6797d62eb76b-C001-122", "text": "Further, we show that our proposed model generalizes to unseen datasets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "18ef4e4fafdf62839d6797d62eb76b-C001-15" ], [ "18ef4e4fafdf62839d6797d62eb76b-C001-28" ], [ "18ef4e4fafdf62839d6797d62eb76b-C001-29" ] ], "cite_sentences": [ "18ef4e4fafdf62839d6797d62eb76b-C001-15", "18ef4e4fafdf62839d6797d62eb76b-C001-28", "18ef4e4fafdf62839d6797d62eb76b-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "18ef4e4fafdf62839d6797d62eb76b-C001-22" ], [ "18ef4e4fafdf62839d6797d62eb76b-C001-42" ] ], "cite_sentences": [ "18ef4e4fafdf62839d6797d62eb76b-C001-22", "18ef4e4fafdf62839d6797d62eb76b-C001-42" ] }, "@DIF@": { "gold_contexts": [ [ "18ef4e4fafdf62839d6797d62eb76b-C001-32", "18ef4e4fafdf62839d6797d62eb76b-C001-33", "18ef4e4fafdf62839d6797d62eb76b-C001-35" ], [ "18ef4e4fafdf62839d6797d62eb76b-C001-106" ] ], "cite_sentences": [ "18ef4e4fafdf62839d6797d62eb76b-C001-32", "18ef4e4fafdf62839d6797d62eb76b-C001-106" ] } } }, "ABC_e2705ae777acffc894f7aa18d42771_29": { "x": [ { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-57", "text": "Here, we take a minimalist approach." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-2", "text": "We present a simple method to improve neural translation of a low-resource language pair using parallel data from a related, also low-resource, language pair." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-3", "text": "The method is based on the transfer method of Zoph et al., but whereas their method ignores any source vocabulary overlap, ours exploits it." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-4", "text": "First, we split words using Byte Pair Encoding (BPE) to increase vocabulary overlap." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-5", "text": "Then, we train a model on the first language pair and transfer its parameters, including its source word embeddings, to another model and continue training on the second language pair." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-6", "text": "Our experiments show that transfer learning helps word-based translation only slightly, but when used on top of a much stronger BPE baseline, it yields larger improvements of up to 4.3 BLEU." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-7", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-9", "text": "Neural machine translation (NMT) (Sutskever et al., 2014; Bahdanau et al., 2015) is rapidly proving itself to be a strong competitor to other statistical machine translation methods." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-10", "text": "However, it still lags behind other statistical methods on very lowresource language pairs (Zoph et al., 2016; Koehn and Knowles, 2017) ." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-11", "text": "A common strategy to improve learning of lowresource languages is to use resources from related languages (Nakov and Ng, 2009) ." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-12", "text": "However, adapting these resources is not trivial." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-13", "text": "NMT offers some simple ways of doing this." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-103", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-14", "text": "For example, Zoph et al. (2016) train a parent model on a (possibly unrelated) high-resource language pair, then use this model to initialize a child model which is further trained on a low-resource language pair." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-15", "text": "In particular, they showed that a French-English model could be used to improve translation on a wide range of low-resource language pairs such as Hausa-, Turkish-, and Uzbek-English." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-16", "text": "In this paper, we explore the opposite scenario, where the parent language pair is also lowresource, but related to the child language pair." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-17", "text": "We show that, at least in the case of three Turkic languages (Turkish, Uzbek, and Uyghur), the original method of Zoph et al. (2016) does not always work, but it is still possible to use the parent model to considerably improve the child model." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-18", "text": "The basic idea is to exploit the relationship between the parent and child language lexicons." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-19", "text": "Zoph et al.'s original method makes no assumption about the relatedness of the parent and child languages, so it effectively makes a random assignment of the parent-language word embeddings to child-language words. But if we assume that the parent and child lexicons are related, it should be beneficial to transfer source word embeddings from parent-language words to their child-language equivalents." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-20", "text": "Thus, the problem amounts to finding a representation of the data that ensures a sufficient overlap between the vocabularies of the languages." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-21", "text": "To do this, we map the source languages to a common alphabet and use Byte Pair Encoding (BPE) (Sennrich et al., 2016) on the union of the vocabularies to increase the number of common subwords." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-22", "text": "In our experiments, we show that transfer learning helps word-based translation, but not always significantly. But when used on top of a much stronger BPE baseline, it yields larger and statistically significant improvements." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-23", "text": "Using Uzbek as a parent language and Turkish and Uyghur as child languages, we obtain improvements over BPE of 0.8 and 4.3 BLEU, respectively." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-24", "text": "2 Background" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-25", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-26", "text": "**ATTENTIONAL MODEL**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-27", "text": "We use the 2-layer, 512-hidden-unit global attentional model with general scoring function and input feeding by Luong et al. (2015) ." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-28", "text": "For the purposes of this paper, the most important detail of the model is that (as in many other models) the word types of both the source and target languages are mapped to vector representations called word embeddings, which are learned automatically with the rest of the model." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-29", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-30", "text": "**LANGUAGE TRANSFER**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-31", "text": "We follow the transfer learning approach proposed by Zoph et al. (2016) ." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-32", "text": "In their work, a parent model is first trained on a high-resource language pair." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-33", "text": "Then the child model's parameter values are copied from the parent's and are fine-tuned on its low-resource data." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-104", "text": "Our results are shown in Table 3 ." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-34", "text": "The source word embeddings are copied with the rest of the model, with the ith parent-language word embedding being assigned to the ith childlanguage word." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-35", "text": "Because the parent and child source languages have different vocabularies, this amounts to randomly assigning parent source word embeddings to child source words." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-36", "text": "In other words, even if a word exists in both the parent and child vocabularies, it's extremely unlikely that it will be assigned the same embedding in both models." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-37", "text": "By contrast, because the target language is the same in both the parent and child models, the target word embeddings are frozen during finetuning." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-38", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-39", "text": "**RELATED LANGUAGES**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-40", "text": "The experiments described below are on translation from three Turkic languages to English." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-41", "text": "The Turkic language family is a group of related lan-" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-42", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-43", "text": "**BYTE PAIR ENCODING**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-44", "text": "BPE (Sennrich et al., 2016) is an efficient word segmentation algorithm." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-45", "text": "It initially treats the words as sequences of character tokens, then iteratively finds and merges the most common token pair into one." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-46", "text": "The algorithm stops after a controllable number of operations, or when no token pair appears more than once." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-47", "text": "At test time, the learned merge operations are applied to merge the character sequences in test data into larger symbols." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-48", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-49", "text": "**METHOD**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-50", "text": "The basic idea of our method is to extend the transfer method of Zoph et al. (2016) to share the parent and child's source vocabularies, so that when source word embeddings are transferred, a word that appears in both vocabularies keeps its embedding." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-51", "text": "In order for this to work, it must be the case that the parent and child languages have considerable vocabulary overlap, and that when a word occurs in both languages, it often has a similar meaning in both languages." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-52", "text": "Thus, we need to process the data to make these two assumptions hold as much as possible." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-53", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-54", "text": "**TRANSLITERATION**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-55", "text": "If the parent and child language have different orthographies, it should help to map them into a common orthography." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-56", "text": "Even if the two use the same script, some transformation could be applied; for example, we might change French -eur endings to Spanish -or." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-58", "text": "Turkish and Uzbek are both written using Latin script, and we did not apply any transformations to them." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-59", "text": "Our Uyghur data is written in Arabic script, so we transliterated it to Latin script using an off-the-shelf transliterator." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-60", "text": "1 The transliteration is a string homomorphism, replacing Arabic letters with English letters or consonant clusters independent of context." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-61", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-62", "text": "**SEGMENTATION**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-63", "text": "To increase the overlap between the parent and child vocabularies, we use BPE to break words into subwords." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-64", "text": "For the BPE merge rules to not only find the common subwords between two source languages but also ensure consistency between source and target segmentation among each language pair, we learn the rules from the union of source and target data of both the parent and child models." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-65", "text": "The rules are then used to segment the corpora." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-66", "text": "It is important to note that this results in a single vocabulary, used for both the source and target languages in both models." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-67", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-68", "text": "**EXPERIMENTS**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-69", "text": "We used Turkish-, Uzbek-, and Uyghur-English parallel texts from the LORELEI program." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-70", "text": "We tokenized all data using the Moses toolkit (Koehn et al., 2007) ; for Turkish-English experiments, we also truecased the data." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-71", "text": "For Uyghur-English, the word-based models were trained in the original Uyghur data written in Arabic script; for BPEbased systems, we transliterated it to Latin script as described above." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-72", "text": "For the word-based systems, we fixed the vocabulary size and replaced out-of-vocabulary words with UNK." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-73", "text": "We tried different sizes for each language pair; however, each word-based system's target vocabulary size is limited by that of the child, so we could only use up to 45,000 word types for Turkish-English and 20,000 for UyghurEnglish." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-74", "text": "The BPE-based systems could make use of bigger vocabulary size thanks to the combination of both parent and child source and target vocabularies." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-75", "text": "We varied the number of BPE merge operations from 5,000 to 60,000." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-76", "text": "Instead of using a fixed vocabulary cutoff, we used the full vocabulary; to ensure the model still learns how to deal with unknown words, we trained on two copies of the training data: one unchanged, and one in which all rare words (whose frequency is less than 5) were replaced with UNK." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-77", "text": "Accordingly, the number of epochs was halved." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-78", "text": "Following common practice, we fixed an upper limit on training sentence length (discarding longer sentences)." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-79", "text": "Because the BPE-based systems have shorter tokens and therefore longer sentences, we set this upper limit differently for the word-based and BPE-based systems to approximately equalize the total size of the training data." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-80", "text": "This led to a limit of 50 tokens for word-based models and 60 tokens for BPE-based models." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-81", "text": "See Table 2 for statistics of the resulting datasets." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-82", "text": "We trained using Adadelta (Zeiler, 2012) , with a minibatch size of 32 and dropout with a dropout rate of 0.2." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-83", "text": "We rescaled the gradient when its norm exceeded 5." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-84", "text": "For the Uzbek-English to TurkishEnglish experiment, the parent and child models were trained for 100 and 50 epochs, respectively." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-85", "text": "For the Uzbek-English to Uyghur-English experiment, the parent and child models were trained for 50 and 200, respectively." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-86", "text": "As mentioned above, the BPE models were trained for half as many epochs because their data is duplicated." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-87", "text": "We used beam search for translation on the dev and test sets." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-88", "text": "Since NMT tends to favor short translations (Cho et al., 2014) , we use the length normalization approach of Wu et al. (2016) which uses a different score s(e | f ) instead of logprobability:" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-89", "text": "lp(e) = (5 + |e|) \u03b1 (5 + 1) \u03b1 ." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-90", "text": "We set \u03b1 = 0.8 for all of our experiments." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-91", "text": "Table 3 : Whereas transfer learning at word-level does not always help, our method consistently yields a significant improvement over the stronger BPE systems." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-92", "text": "Scores are case-sensitive test BLEU." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-93", "text": "Key: size = vocabulary size (word-based) or number of BPE operations (BPE)." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-94", "text": "The symbols \u2020 and \u2021 indicate statistically significant improvements with p < 0.05 and p < 0.01, respectively, while * indicates a statistically insignificant improvement (p > 0.05)." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-95", "text": "Table 4 : Amount of child's source types that appear in parent." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-96", "text": "We stopped training when the tokenized BLEU score was maximized on the development set." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-97", "text": "We also optimized the vocabulary size and the number of BPE operations for the word-based and BPEbased systems, respectively, to maximize the tokenized BLEU on the development set." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-98", "text": "After translation at test time, we rejoined BPE segments, recased, and detokenized." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-99", "text": "Finally, we evaluated using case-sensitive BLEU." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-100", "text": "As a baseline, we trained a child model using BPE but without transfer (that is, with weights randomly initialized)." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-101", "text": "We also compared against a word-based baseline (without transfer) and two word-based systems using transfer without vocabulary-sharing, corresponding with the method of Zoph et al. (2016) ( \u00a72.2): one where the target word embeddings are fine-tuned, and one where they are frozen." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-102", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-105", "text": "In this lowresource setting, we find that transferring wordbased models does not consistently help." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-106", "text": "On Turkish-English, both transfer methods give only a statistically insignificant improvement (p > 0.05); on Uyghur-English, transfer without freezing target embeddings helps somewhat, but transfer with freezing helps only insignificantly." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-107", "text": "In both language pairs, the models that use BPE perform much better than their word-based counterparts." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-108", "text": "When we apply transfer learning to this much stronger baseline, we find that the relative improvements actually increase; that is, the combined effect of BPE and transfer learning is more than additive." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-109", "text": "On Turkish-English, the improvement is +0.8 BLEU over the BPE baseline; on Uyghur-English, a healthy +4.3 over the BPE baseline." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-110", "text": "A similar pattern emerges when we examine the best BLEU scores on the development set (Figure 1) ." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-111", "text": "Whereas word-based transfer methods help very little for Turkish-English, and help or hurt slightly for Uyghur-English, our BPE-based transfer approach consistently improves over both the baseline and transfer word-based models." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-112", "text": "We surmise that the improvement is primarily due to the vocabulary overlap created by BPE (see Table 4 )." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-113", "text": "----------------------------------" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-114", "text": "**CONCLUSION**" }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-115", "text": "In this paper, we have shown that the transfer learning method of Zoph et al. (2016) , while appealing, might not always work in a low-resource context." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-116", "text": "However, by combining it with BPE, we can improve NMT performance on a low-resource language pair by exploiting its lexical similarity with another related, low-resource language." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-117", "text": "Our results show consistent improvement in two Turkic languages." }, { "sent_id": "e2705ae777acffc894f7aa18d42771-C001-118", "text": "Our approach, which relies on segmenting words into subwords, seems well suited to agglutinative languages; further investigation would be needed to confirm whether our method works on other types of languages." } ], "y": { "@MOT@": { "gold_contexts": [ [ "e2705ae777acffc894f7aa18d42771-C001-10", "e2705ae777acffc894f7aa18d42771-C001-9" ], [ "e2705ae777acffc894f7aa18d42771-C001-17" ] ], "cite_sentences": [ "e2705ae777acffc894f7aa18d42771-C001-10", "e2705ae777acffc894f7aa18d42771-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "e2705ae777acffc894f7aa18d42771-C001-17" ], [ "e2705ae777acffc894f7aa18d42771-C001-31" ], [ "e2705ae777acffc894f7aa18d42771-C001-50" ], [ "e2705ae777acffc894f7aa18d42771-C001-101" ], [ "e2705ae777acffc894f7aa18d42771-C001-115" ] ], "cite_sentences": [ "e2705ae777acffc894f7aa18d42771-C001-17", "e2705ae777acffc894f7aa18d42771-C001-31", "e2705ae777acffc894f7aa18d42771-C001-50", "e2705ae777acffc894f7aa18d42771-C001-101", "e2705ae777acffc894f7aa18d42771-C001-115" ] }, "@BACK@": { "gold_contexts": [ [ "e2705ae777acffc894f7aa18d42771-C001-14" ] ], "cite_sentences": [ "e2705ae777acffc894f7aa18d42771-C001-14" ] }, "@EXT@": { "gold_contexts": [ [ "e2705ae777acffc894f7aa18d42771-C001-50" ], [ "e2705ae777acffc894f7aa18d42771-C001-115", "e2705ae777acffc894f7aa18d42771-C001-116" ] ], "cite_sentences": [ "e2705ae777acffc894f7aa18d42771-C001-50", "e2705ae777acffc894f7aa18d42771-C001-115" ] } } }, "ABC_eab79e8aa2cbe6f3aeef0018129208_29": { "x": [ { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-2", "text": "We propose solutions to enhance the Inside-Outside Recursive Neural Network (IORNN) reranker of Le and Zuidema (2014) ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-3", "text": "Replacing the original softmax function with a hierarchical softmax using a binary tree constructed by combining output of the Brown clustering algorithm and frequency-based Huffman codes, we significantly reduce the reranker's computational complexity." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-4", "text": "In addition, enriching contexts used in the reranker by adding subtrees rooted at (ancestors') cousin nodes, the accuracy is increased." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-5", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-7", "text": "Using neural networks for syntactic parsing has become popular recently, thanks to promising results that those neural-net-based parsers achieved." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-8", "text": "For constituent parsing, Socher et al. (2013) using a recursive neural network (RNN) got an F1 score close to the state-of-the-art on the Penn WSJ corpus." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-9", "text": "For dependency parsing, the inside-outside recursive neural net (IORNN) reranker proposed by Le and Zuidema (2014) is among the top systems, including the Chen and Manning (2014)'s extremely fast transition-based parser employing a traditional feed-forward neural network." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-10", "text": "There are many reasons why neural-net-based systems perform very well." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-11", "text": "First, Bansal et al. (2014) showed that using word-embeddings can lead to significant improvement for dependency parsing." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-12", "text": "Interestingly, those neural-net-based systems can transparently and easily employ wordembeddings by initializing their networks with those vectors." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-13", "text": "Second, by comparing a countbased model with their neural-net-based model on perplexity, Le and Zuidema (2014) suggested that predicting with neural nets is an effective solution for the problem of data sparsity." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-14", "text": "Last but not least, as showed in the work of Socher and colleagues on RNNs, e.g. Socher et al. (2013) , neural networks are capable of 'semantic transfer', which is essential for disambiguation." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-15", "text": "We focus on how to enhance the IORNN reranker of Le and Zuidema (2014) by both reducing its computational complexity and increasing its accuracy." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-16", "text": "Although this reranker is among the top systems in accuracy, its computation is very costly due to its softmax function used to compute probabilities of generating tokens: all possible words in the vocabulary are taken into account." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-17", "text": "Solutions for this are to approximate the original softmax function by using a hierarchical softmax (Morin and Bengio, 2005) , noise-contrastive estimation (Mnih and Teh, 2012) , or factorization using classes (Mikolov et al., 2011) ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-18", "text": "A cost of using those approximations is, however, drop of the system performance." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-19", "text": "To reduce the drop, we use a hierarchical softmax with a binary tree constructed by combining Brown clusters and Huffman codes." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-20", "text": "We show that, thanks to the reranking framework and the IORNN's ability to overcome the problem of data sparsity, more complex contexts can be employed to generate tokens." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-21", "text": "We introduce a new type of contexts, named full-history." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-22", "text": "By employing both the hierarchical softmax and the new type of context, our new IORNN reranker has significantly lower complexity but higher accuracy than the original reranker." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-23", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-24", "text": "**THE IORNN RERANKER**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-25", "text": "We firstly introduce the IORNN reranker (Le and Zuidema, 2014) ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-26", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-27", "text": "**THE \u221e-ORDER GENERATIVE MODEL**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-28", "text": "The reranker employs the generative model proposed by Eisner (1996) ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-29", "text": "Intuitively, this model is top-down: starting with ROOT, we generate its left dependents and its right dependents." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-30", "text": "We then generate dependents for each ROOT's dependent." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-31", "text": "The generative process recursively continues until there is no dependent to generate." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-32", "text": "Formally, this model is described by the following formula" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-33", "text": "where H is the current head, T (N ) is the subtree rooted at N , and C N is the context to generate N ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-34", "text": "H L , H R are respectively H's left dependents and right dependents, plus EOC (End-Of-Children), a special token to inform that there are no more dependents to generate." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-35", "text": "Thus, P (T (ROOT )) is the probability of generating the entire T ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-36", "text": "Le and Zuidema's \u221e-order generative model is defined as the Eisner's model in which the context C \u221e D to generate D contains all of D's generated siblings, its ancestors and theirs siblings." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-37", "text": "Because of very large fragments that contexts are allowed to hold, traditional count-based methods are impractical (even if we use smart smoothing techniques)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-38", "text": "They thus introduced the IORNN architecture to estimate the model." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-39", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-40", "text": "**ESTIMATION WITH THE IORNN**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-41", "text": "Each tree node y carries three vectors: inner representation i y , representing y, outer representation o y , representing the full context of y, and partial outer representation\u014d y , representing the partial context C \u221e y which generates the token of y. Without loss of generality and ignoring directions for simplicity, we assume that the model is generating dependent y for node h conditioning on context C \u221e y (see Figure 1 )." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-42", "text": "Under the approximation that the inner representation of a phrase equals the inner representation of its head, and thanks to the recursive definition of full/partial contexts (C \u221e y is a combination of C \u221e h and y's previously generated sisters), the (partial) outer representations of y are computed as follows." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-43", "text": "Figure 1: The process to (a) generate y, (b) compute outer representation o y , given head h and sibling x. Black, grey, white boxes are respectively inner, partial outer, and outer representations." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-44", "text": "(Le and Zuidema, 2014) where r = 0 if y is the first dependent of h; oth-" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-45", "text": "is the set of y's sisters generated before. And," }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-46", "text": "is the set of y's sisters." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-47", "text": "dr(v) is the dependency relation of v with h. W hi/ho/dr(v) are n \u00d7 n real matrices, and b o is an n-d real vector." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-48", "text": "The probability P (w|C \u221e y ) of generating a token w at node y is given by sof tmax(w,\u014d y ) = e u(w,\u014dy)" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-49", "text": "w \u2208V e u(w ,\u014dy) (2)" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-50", "text": "and V is the set of all possible tokens (i.e. vocabulary)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-51", "text": "W is a |V | \u00d7 n real matrix, b is an |V |-d real vector." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-52", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-53", "text": "**THE RERANKER**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-54", "text": "Le and Zuidema's (mixture) reranker is" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-55", "text": "where D(S) and s(S, T ) are a k-best list and scores given by a third-party parser, and \u03b1 \u2208 [0, 1]." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-56", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-57", "text": "**REDUCE COMPLEXITY**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-58", "text": "The complexity of the IORNN reranker above for computing P (T (ROOT )) is approximately 1" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-59", "text": "1 Look at Figure 1 , we can see that each node requires four matrix-vector multiplications: two for computing children's (partial) outer representation, one for computing sisters' (partial) outer representations, and one for computing the softmax." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-60", "text": "where l is the length of the given sentence, n and |V | are respectively the dimensions of representations and the vocabulary size (sums of vectors are ignored because their computational cost is small w.r.t l \u00d7 n \u00d7 n)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-61", "text": "In Le and Zuidema's reranker, |V | \u2248 14000 n = 200." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-62", "text": "It means that the reranker spends most of its time on computing sof tmax(w,\u014d y ) in Equation 2." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-63", "text": "This is also true for the complexity in the training phase." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-64", "text": "To reduce the reranker's complexity, we need to approximate this softmax." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-65", "text": "Mnih and Teh (2012) propose using the noise-contrastive estimation method which is to force the system to discriminate correct words from randomly chosen candidates (i.e., noise)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-66", "text": "This approach is very fast in training thanks to fixing normalization factors to one, but slow in testing because normalization factors are explicitly computed." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-67", "text": "Vaswani et al. (2013) use the same approach, and also fix normalization factors to one when testing." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-68", "text": "This, however, doest not guarantee to give us properly normalized probabilities." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-69", "text": "We thus employ the hierarchical sofmax proposed by Morin and Bengio (2005) which is fast in both training and testing and outputs properly normalized probabilities." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-70", "text": "Assume that there is a binary tree whose leaf nodes each correspond to a word in the vocabulary." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-71", "text": "Let (u w 1 , u w 2 , ..., u w L ) be a path from the root to the leaf node w (i.e. u w 1 = root and u w L = w)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-72", "text": "Let L(u) the left child of node u, and [x] be 1 if x true and \u22121 otherwise." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-73", "text": "We then replace Equation 2 by" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-74", "text": "where \u03c3(z) = 1/(1 + e \u2212z )." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-75", "text": "If the binary tree is perfectly balanced, the new complexity is approximately l \u00d7 (3 \u00d7 n \u00d7 n + n \u00d7 log(|V |)), which is less than 4l \u00d7 n \u00d7 n if |V | < 2 n (1.6 \u00d7 10 60 as n = 200 in the Le and Zuidema's reranker)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-76", "text": "Constructing a binary tree for this hierarchical softmax turns out to be nontrivial." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-77", "text": "Morin and Bengio (2005) relied on WordNet whereas Mikolov et al. (2013) used only frequency-based Huffmann codes." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-78", "text": "In our case, an ideal tree should reflect both semantic similarities between words (e.g. leaf nodes for 'dog' and 'cat' should be close to each other), and word frequencies (since we want to minimize the complexity)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-79", "text": "Therefore we propose combining output of the Brown hierarchical clustering algorithm (Brown et al., 1992) and frequency-based Huffman codes." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-80", "text": "2 Firstly, we use the Brown algorithm to find c hierarchical clusters (c = 500 in our experiments)." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-81", "text": "3 We then, for each cluster, compute the Huffman code for each word in that cluster." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-82", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-83", "text": "**ENRICH CONTEXTS**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-84", "text": "Although suggesting that predicting with neural networks is a solution to overcome the problem of sparsity, Le and Zuidema's reranker still relies on two widely-used independence assumptions: (i) the two Markov chains that generate dependents in the two directions are independent, given the head, and (ii) non-overlapping subtrees are generated independently." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-85", "text": "4 That is why its partial context (e.g. the red-dashed shape in Figure 2 ) used to generate a node ignores: (i) sisters in the other direction and (ii) ancestors' cousins and their descendants." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-86", "text": "We, in contrast, eliminate those two assumptions by proposing the following top-down left-toright generative story." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-87", "text": "From the head node, we generate its dependents from left to right." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-88", "text": "The partial context to generate a dependent is the whole fragment that is generated so far (e.g. the blue shape in Figure 2 )." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-89", "text": "We then generate subtrees rooted at those nodes also from left to right." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-90", "text": "The full context given to a node to generate the subtree rooted at this node is thus the whole fragment that is generated so far (e.g. the combination of the blue shape and the blue-dotted shape in Figure 2) ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-91", "text": "In this way, the model always uses the whole up-to-date fragment to generate a dependent or to generate a subtree rooted at a node." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-92", "text": "To our knowledge, these contexts, which contain full derivation histories, are the most complete ones ever used for graph-based parsing." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-93", "text": "Extending the IORNN reranker in this way is straight-forward." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-94", "text": "For example, we first generate a subtree tr(x) rooted at node x in Figure 3 ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-95", "text": "We then compute the inner representation for tr(x): if tr(x) contains only x then i tr(x) = i x ; otherwise where S(x) is the set of x's dependents, dr(u) is the dependency relation of u with x, W i h/dr(u) are n \u00d7 n real matrices, and b i is an n-d real vector." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-96", "text": "5" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-97", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-98", "text": "**EXPERIMENTS**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-99", "text": "We use the same setting reported in Le and Zuidema (2014, Section 5.3) ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-100", "text": "The Penn WSJ Treebank is converted to dependencies using the Yamada and Matsumoto (2003)'s head rules." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-101", "text": "Sections 2-21 are for training, section 22 for development, and section 23 for testing." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-102", "text": "The development and test sets are tagged by the Stanford POStagger trained on the whole training data, whereas 10-way jackknifing is used to generate tags for the training set." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-103", "text": "For the new IORNN reranker, we set n = 200, initialise it with the 50-dim word embeddings from Collobert et al. (2011) ." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-104", "text": "We use the MSTParser (with the 2nd-order feature mode) (McDonald et al., 2005) to generate k-best lists, and optimize k and \u03b1 (Equation 3) on the development set." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-105", "text": "Table 1 shows the comparison of our new reranker against other systems." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-106", "text": "It is a surprise that our reranker with the proposed hierarchical softmax alone can achieve an almost equivalent score with Le and Zuidema's reranker." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-107", "text": "We conjecture that drawbacks of the hierarchical softmax compared to the original can be lessened by probabilities of generating other elements like POS-tags, System UAS MSTParser (baseline) 92.06 Koo and Collins (2010) 93.04 Zhang and McDonald (2012) 93.06 Martins et al. (2013) 93.07 Bohnet and Kuhn (2012) 93.39 Reranking Hayashi et al. (2013) 93.12 Le and Zuidema (2014) 93.12 Our reranker (h-softmax only, k = 45) 93.10 Our reranker (k = 47)" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-108", "text": "93.27 dependency relations." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-109", "text": "Adding enriched contexts, our reranker achieves the second best accuracy among those systems." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-110", "text": "Because in this experiment no words have paths longer than 20 n = 200, our new reranker has a significantly lower complexity than the one of Le and Zuidema's reranker." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-111", "text": "On a computer with an Intel Core-i5 3.3GHz CPU and 8GB RAM, it takes 20 minutes to train this reranker, which is implemented in C++, and 2 minutes to evaluate it on the test set." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-112", "text": "----------------------------------" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-113", "text": "**CONCLUSION**" }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-114", "text": "Solutions to enhance the IORNN reranker of Le and Zuidema (2014) were proposed." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-115", "text": "We showed that, by replacing the original softmax function with a hierarchical softmax, the reranker's computational complexity significantly decreases." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-116", "text": "The cost of this, which is drop on accuracy, is avoided by enriching contexts with subtrees rooted at (ancestors') cousin nodes." }, { "sent_id": "eab79e8aa2cbe6f3aeef0018129208-C001-117", "text": "The new reranker, according to experimental results on the Penn WSJ Treebank, has even higher accuracy than the old one." } ], "y": { "@USE@": { "gold_contexts": [ [ "eab79e8aa2cbe6f3aeef0018129208-C001-2" ], [ "eab79e8aa2cbe6f3aeef0018129208-C001-15" ], [ "eab79e8aa2cbe6f3aeef0018129208-C001-25" ], [ "eab79e8aa2cbe6f3aeef0018129208-C001-44", "eab79e8aa2cbe6f3aeef0018129208-C001-45" ] ], "cite_sentences": [ "eab79e8aa2cbe6f3aeef0018129208-C001-2", "eab79e8aa2cbe6f3aeef0018129208-C001-15", "eab79e8aa2cbe6f3aeef0018129208-C001-25", "eab79e8aa2cbe6f3aeef0018129208-C001-44" ] }, "@EXT@": { "gold_contexts": [ [ "eab79e8aa2cbe6f3aeef0018129208-C001-2", "eab79e8aa2cbe6f3aeef0018129208-C001-3" ], [ "eab79e8aa2cbe6f3aeef0018129208-C001-15" ], [ "eab79e8aa2cbe6f3aeef0018129208-C001-114", "eab79e8aa2cbe6f3aeef0018129208-C001-115" ] ], "cite_sentences": [ "eab79e8aa2cbe6f3aeef0018129208-C001-2", "eab79e8aa2cbe6f3aeef0018129208-C001-15", "eab79e8aa2cbe6f3aeef0018129208-C001-114" ] }, "@BACK@": { "gold_contexts": [ [ "eab79e8aa2cbe6f3aeef0018129208-C001-9" ], [ "eab79e8aa2cbe6f3aeef0018129208-C001-13" ], [ "eab79e8aa2cbe6f3aeef0018129208-C001-44", "eab79e8aa2cbe6f3aeef0018129208-C001-45" ] ], "cite_sentences": [ "eab79e8aa2cbe6f3aeef0018129208-C001-9", "eab79e8aa2cbe6f3aeef0018129208-C001-13", "eab79e8aa2cbe6f3aeef0018129208-C001-44" ] } } }, "ABC_17d44521cfdd351d29b4e5f80d41cd_29": { "x": [ { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-23", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-24", "text": "**THE TRANSITION-BASED PARSING ALGORITHM**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-93", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-94", "text": "**FINAL TEST RESULTS**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-95", "text": "Our experiments were performed on a Linux platform with a 2GHz CPU." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-96", "text": "The speed of our baseline parser was 50 sentences per second." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-97", "text": "With all new features added, the speed dropped to 29 sentences per second." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-21", "text": "For the Chinese Treebank, our parser gets a score of 86.0%, the best reported result so far." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-2", "text": "Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-3", "text": "In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-4", "text": "In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-5", "text": "For the Chinese Treebank, they give a signficant improvement of the state of the art." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-6", "text": "An open source release of our parser is freely available." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-7", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-9", "text": "Transition-based dependency parsing (Yamada and Matsumoto, 2003; Nivre et al., 2006b; Zhang and Clark, 2008; Huang and Sagae, 2010 ) utilize a deterministic shift-reduce process for making structural predictions." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-10", "text": "Compared to graph-based dependency parsing, it typically offers linear time complexity and the comparative freedom to define non-local features, as exemplified by the comparison between MaltParser and MSTParser (Nivre et al., 2006b; McDonald et al., 2005; McDonald and Nivre, 2007) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-11", "text": "Recent research has addressed two potential disadvantages of systems like MaltParser." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-12", "text": "In the aspect of decoding, beam-search (Johansson and Nugues, 2007; Zhang and Clark, 2008; Huang et al., 2009 ) and partial dynamic-programming (Huang and Sagae, 2010) have been applied to improve upon greedy one-best search, and positive results were reported." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-13", "text": "In the aspect of training, global structural learning has been used to replace local learning on each decision (Zhang and Clark, 2008; Huang et al., 2009) , although the effect of global learning has not been separated out and studied alone." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-14", "text": "In this short paper, we study a third aspect in a statistical system: feature definition." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-15", "text": "Representing the type of information a statistical system uses to make predictions, feature templates can be one of the most important factors determining parsing accuracy." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-16", "text": "Various recent attempts have been made to include non-local features into graph-based dependency parsing (Smith and Eisner, 2008; Martins et al., 2009; Koo and Collins, 2010) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-17", "text": "Transitionbased parsing, by contrast, can easily accommodate arbitrarily complex representations involving nonlocal features." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-18", "text": "Complex non-local features, such as bracket matching and rhythmic patterns, are used in transition-based constituency parsing (Zhang and Clark, 2009; Wang et al., 2006) , and most transitionbased dependency parsers incorporate some nonlocal features, but current practice is nevertheless to use a rather restricted set of features, as exemplified by the default feature models in MaltParser (Nivre et al., 2006a) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-19", "text": "We explore considerably richer feature representations and show that they improve parsing accuracy significantly." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-20", "text": "In standard experiments using the Penn Treebank, our parser gets an unlabeled attachment score of 92.9%, which is the best result achieved with a transition-based parser and comparable to the state of the art." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-22", "text": "188" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-25", "text": "In a typical transition-based parsing process, the input words are put into a queue and partially built structures are organized by a stack." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-26", "text": "A set of shiftreduce actions are defined, which consume words from the queue and build the output parse." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-27", "text": "Recent research have focused on action sets that build projective dependency trees in an arc-eager (Nivre et al., 2006b; Zhang and Clark, 2008) or arc-standard (Yamada and Matsumoto, 2003; Huang and Sagae, 2010) process." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-28", "text": "We adopt the arc-eager system 1 , for which the actions are:" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-29", "text": "\u2022 Shift, which removes the front of the queue and pushes it onto the top of the stack;" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-30", "text": "\u2022 Reduce, which pops the top item off the stack;" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-31", "text": "\u2022 LeftArc, which pops the top item off the stack, and adds it as a modifier to the front of the queue;" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-32", "text": "\u2022 RightArc, which removes the front of the queue, pushes it onto the stack and adds it as a modifier to the top of the stack." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-33", "text": "Further, we follow Zhang and Clark (2008) and Huang et al. (2009) and use the generalized perceptron (Collins, 2002) for global learning and beamsearch for decoding." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-34", "text": "Unlike both earlier globallearning parsers, which only perform unlabeled parsing, we perform labeled parsing by augmenting the LeftArc and RightArc actions with the set of dependency labels." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-35", "text": "Hence our work is in line with Titov and Henderson (2007) in using labeled transitions with global learning." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-36", "text": "Moreover, we will see that label information can actually improve link accuracy." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-37", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-38", "text": "**FEATURE TEMPLATES**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-39", "text": "At each step during a parsing process, the parser configuration can be represented by a tuple S, N, A , where S is the stack, N is the queue of incoming words, and A is the set of dependency arcs that have been built." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-40", "text": "Denoting the top of stack 1 It is very likely that the type of features explored in this paper would be beneficial also for the arc-standard system, although the exact same feature templates would not be applicable because of differences in the parsing order." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-41", "text": "w -word; p -POS-tag." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-42", "text": "distance S 0 wd; S 0 pd; N 0 wd; N 0 pd; S 0 wN 0 wd; S 0 pN 0 pd; valency S 0 wv r ; S 0 pv r ; S 0 wv l ; S 0 pv l ; N 0 wv l ; N 0 pv l ; unigrams S 0h w; S 0h p; S 0 l; S 0l w; S 0l p; S 0l l; S 0r w; S 0r p; S 0r l;N 0l w; N 0l p; N 0l l; third-order S 0h2 w; S 0h2 p; S 0h l; S 0l2 w; S 0l2 p; S 0l2 l; S 0r2 w; S 0r2 p; S 0r2 l; N 0l2 w; N 0l2 p; N 0l2 l; S 0 pS 0l pS 0l2 p; S 0 pS 0r pS 0r2 p; S 0 pS 0h pS 0h2 p; N 0 pN 0l pN 0l2 p; label set S 0 ws r ; S 0 ps r ; S 0 ws l ; S 0 ps l ; N 0 ws l ; N 0 ps l ; w -word; p -POS-tag; v l , v r -valency; ldependency label, s l , s r -labelset." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-43", "text": "with S 0 , the front items from the queue with N 0 , N 1 , and N 2 , the head of S 0 (if any) with S 0h , the leftmost and rightmost modifiers of S 0 (if any) with S 0l and S 0r , respectively, and the leftmost modifier of N 0 (if any) with N 0l , the baseline features are shown in Table 1 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-44", "text": "These features are mostly taken from Zhang and Clark (2008) and Huang and Sagae (2010) , and our parser reproduces the same accuracies as reported by both papers." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-45", "text": "In this table, w and p represents the word and POS-tag, respectively." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-46", "text": "For example, S 0 pN 0 wp represents the feature template that takes the word and POS-tag of N 0 , and combines it with the word of S 0 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-47", "text": "189" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-48", "text": "In this short paper, we extend the baseline feature templates with the following:" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-49", "text": "Distance between S 0 and N 0 Direction and distance between a pair of head and modifier have been used in the standard feature templates for maximum spanning tree parsing (McDonald et al., 2005) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-50", "text": "Distance information has also been used in the easy-first parser of (Goldberg and Elhadad, 2010) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-51", "text": "For a transition-based parser, direction information is indirectly included in the LeftArc and RightArc actions." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-52", "text": "We add the distance between S 0 and N 0 to the feature set by combining it with the word and POS-tag of S 0 and N 0 , as shown in Table 2 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-53", "text": "It is worth noticing that the use of distance information in our transition-based model is different from that in a typical graph-based parser such as MSTParser." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-54", "text": "The distance between S 0 and N 0 will correspond to the distance between a pair of head and modifier when an LeftArc action is taken, for example, but not when a Shift action is taken." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-55", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-56", "text": "**VALENCY OF S 0 AND N 0**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-57", "text": "The number of modifiers to a given head is used by the graph-based submodel of Zhang and Clark (2008) and the models of Martins et al. (2009) and Sagae and Tsujii (2007) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-58", "text": "We include similar information in our model." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-59", "text": "In particular, we calculate the number of left and right modifiers separately, calling them left valency and right valency, respectively." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-60", "text": "Left and right valencies are represented by v l and v r in Table 2 , respectively." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-61", "text": "They are combined with the word and POS-tag of S 0 and N 0 to form new feature templates." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-62", "text": "Again, the use of valency information in our transition-based parser is different from the aforementioned graph-based models." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-63", "text": "In our case, valency information is put into the context of the shift-reduce process, and used together with each action to give a score to the local decision." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-64", "text": "Unigram information for S 0h , S 0l , S 0r and N 0l The head, left/rightmost modifiers of S 0 and the leftmost modifier of N 0 have been used by most arc-eager transition-based parsers we are aware of through the combination of their POS-tag with information from S 0 and N 0 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-65", "text": "Such use is exemplified by the feature templates \"from three words\" in Table 1 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-66", "text": "We further use their word and POS-tag information as \"unigram\" features in Table 2 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-67", "text": "Moreover, we include the dependency label information in the unigram features, represented by l in the table." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-68", "text": "Unigram label information has been used in MaltParser (Nivre et al., 2006a; Nivre, 2006) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-69", "text": "Third-order features of S 0 and N 0 Higher-order context features have been used by graph-based dependency parsers to improve accuracies (Carreras, 2007; Koo and Collins, 2010) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-70", "text": "We include information of third order dependency arcs in our new feature templates, when available." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-71", "text": "In Table 2 , S 0h2 , S 0l2 , S 0r2 and N 0l2 refer to the head of S 0h , the second leftmost modifier and the second rightmost modifier of S 0 , and the second leftmost modifier of N 0 , respectively." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-72", "text": "The new templates include unigram word, POS-tag and dependency labels of S 0h2 , S 0l2 , S 0r2 and N 0l2 , as well as POS-tag combinations with S 0 and N 0 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-73", "text": "Set of dependency labels with S 0 and N 0 As a more global feature, we include the set of unique dependency labels from the modifiers of S 0 and N 0 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-74", "text": "This information is combined with the word and POS-tag of S 0 and N 0 to make feature templates." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-75", "text": "In Table 2 , s l and s r stands for the set of labels on the left and right of the head, respectively." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-76", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-77", "text": "**EXPERIMENTS**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-78", "text": "Our experiments were performed using the Penn Treebank (PTB) and Chinese Treebank (CTB) data." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-79", "text": "We follow the standard approach to split PTB3, using sections 2 -21 for training, section 22 for development and 23 for final testing." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-80", "text": "Bracketed sentences from PTB were transformed into dependency formats using the Penn2Malt tool." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-81", "text": "2 Following Huang and Sagae (2010), we assign POS-tags to the training data using ten-way jackknifing." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-82", "text": "We used our implementation of the Collins (2002) tagger (with 97.3% accuracy on a standard Penn Treebank test) to perform POS-tagging." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-83", "text": "For all experiments, we set the beam size to 64 for the parser, and report unlabeled and labeled attachment scores (UAS, LAS) and unlabeled exact match (UEM) for evaluation." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-84", "text": "Table 4 : Final test accuracies for English." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-85", "text": "UAS = unlabeled attachment score; UEM = unlabeled exact match; LAS = labeled attachment score." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-86", "text": "Table 3 shows the effect of new features on the development test data for English." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-87", "text": "We start with the baseline features in Table 1 , and incrementally add the distance, valency, unigram, third-order and label set feature templates in Table 2 ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-88", "text": "Each group of new feature templates improved the accuracies over the previous system, and the final accuracy with all new features was 93.14% in unlabeled attachment score." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-89", "text": "Table 4 shows the final test results of our parser for English." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-90", "text": "We include in the table results from the pure transition-based parser of Zhang and Clark (2008) (row 'Z&C08 transition'), the dynamic-programming arc-standard parser of Huang and Sagae (2010) (row 'H&S10'), and graphbased models including MSTParser (McDonald and Pereira, 2006) , the baseline feature parser of Koo et al. (2008) (row 'K08 baeline') , and the two models of Koo and Collins (2010 ing the highest attachment score reported for a transition-based parser, comparable to those of the best graph-based parsers." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-91", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-92", "text": "**DEVELOPMENT EXPERIMENTS**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-98", "text": "As an alternative to Penn2Malt, bracketed sentences can also be transformed into Stanford dependencies (De Marneffe et al., 2006) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-99", "text": "Our parser gave 93.5% UAS, 91.9% LAS and 52.1% UEM when trained and evaluated on Stanford basic dependencies, which are projective dependency trees." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-100", "text": "Cer et al. (2010) report results on Stanford collapsed dependencies, which allow a word to have multiple heads and therefore cannot be produced by a regular dependency parser." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-101", "text": "Their results are relevant although not directly comparable with ours." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-102", "text": "Table 5 shows the results of our final parser, the pure transition-based parser of Zhang and Clark (2008) , and the parser of Huang and Sagae (2010) on Chinese." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-103", "text": "We take the standard split of CTB and use gold segmentation and POS-tags for the input." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-104", "text": "Our scores for this test set are the best reported so far and significantly better than the previous systems." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-105", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-106", "text": "**CHINESE TEST RESULTS**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-107", "text": "----------------------------------" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-109", "text": "We have shown that enriching the feature representation significantly improves the accuracy of our transition-based dependency parser." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-110", "text": "The effect of the new features appears to outweigh the effect of combining transition-based and graph-based models, reported by Zhang and Clark (2008) , as well as the effect of using dynamic programming, as in- Huang and Sagae (2010) ." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-111", "text": "This shows that feature definition is a crucial aspect of transition-based pars-191 ing." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-112", "text": "In fact, some of the new feature templates in this paper, such as distance and valency, are among those which are in the graph-based submodel of Zhang and Clark (2008) , but not the transition-based submodel." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-113", "text": "Therefore our new features to some extent achieved the same effect as their model combination." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-114", "text": "The new features are also hard to use in dynamic programming because they add considerable complexity to the parse items." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-115", "text": "Enriched feature representations have been studied as an important factor for improving the accuracies of graph-based dependency parsing also." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-116", "text": "Recent research including the use of loopy belief network (Smith and Eisner, 2008) , integer linear programming (Martins et al., 2009 ) and an improved dynamic programming algorithm (Koo and Collins, 2010) can be seen as methods to incorporate nonlocal features into a graph-based model." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-117", "text": "An open source release of our parser, together with trained models for English and Chinese, are freely available." }, { "sent_id": "17d44521cfdd351d29b4e5f80d41cd-C001-118", "text": "3" } ], "y": { "@BACK@": { "gold_contexts": [ [ "17d44521cfdd351d29b4e5f80d41cd-C001-9" ], [ "17d44521cfdd351d29b4e5f80d41cd-C001-12" ], [ "17d44521cfdd351d29b4e5f80d41cd-C001-27" ] ], "cite_sentences": [ "17d44521cfdd351d29b4e5f80d41cd-C001-9", "17d44521cfdd351d29b4e5f80d41cd-C001-12", "17d44521cfdd351d29b4e5f80d41cd-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "17d44521cfdd351d29b4e5f80d41cd-C001-27", "17d44521cfdd351d29b4e5f80d41cd-C001-28", "17d44521cfdd351d29b4e5f80d41cd-C001-29", "17d44521cfdd351d29b4e5f80d41cd-C001-30", "17d44521cfdd351d29b4e5f80d41cd-C001-31", "17d44521cfdd351d29b4e5f80d41cd-C001-32" ], [ "17d44521cfdd351d29b4e5f80d41cd-C001-44" ], [ "17d44521cfdd351d29b4e5f80d41cd-C001-81" ], [ "17d44521cfdd351d29b4e5f80d41cd-C001-89", "17d44521cfdd351d29b4e5f80d41cd-C001-90" ], [ "17d44521cfdd351d29b4e5f80d41cd-C001-102" ] ], "cite_sentences": [ "17d44521cfdd351d29b4e5f80d41cd-C001-27", "17d44521cfdd351d29b4e5f80d41cd-C001-44", "17d44521cfdd351d29b4e5f80d41cd-C001-81", "17d44521cfdd351d29b4e5f80d41cd-C001-90", "17d44521cfdd351d29b4e5f80d41cd-C001-102" ] }, "@SIM@": { "gold_contexts": [ [ "17d44521cfdd351d29b4e5f80d41cd-C001-44" ] ], "cite_sentences": [ "17d44521cfdd351d29b4e5f80d41cd-C001-44" ] }, "@DIF@": { "gold_contexts": [ [ "17d44521cfdd351d29b4e5f80d41cd-C001-102", "17d44521cfdd351d29b4e5f80d41cd-C001-104" ], [ "17d44521cfdd351d29b4e5f80d41cd-C001-110" ] ], "cite_sentences": [ "17d44521cfdd351d29b4e5f80d41cd-C001-102", "17d44521cfdd351d29b4e5f80d41cd-C001-110" ] } } }, "ABC_1c86f563ababf5ec3c67cbf259252b_29": { "x": [ { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-2", "text": "Abstractive summarization typically relies on large collections of paired articles and summaries, however parallel data is scarce and costly to obtain." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-3", "text": "We develop an abstractive summarization system that only relies on having access to large collections of example summaries and non-matching articles." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-4", "text": "Our approach consists of an unsupervised sentence extractor, which selects salient sentences to include in the final summary; as well as a sentence abstractor, trained using pseudo-parallel and synthetic data, which paraphrases each of the extracted sentences." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-5", "text": "We achieve promising results on the CNN/DailyMail benchmark without relying on any article-summary pairs." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-6", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-8", "text": "Text summarization aims to produce a shorter, informative version of an input text." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-9", "text": "While extractive summarization only selects important sentences from the input, abstractive summarization generates content without explicitly re-using whole sentences (Nenkova et al., 2011) resulting summaries that are more fluent." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-10", "text": "In recent years, a number of successful approaches have been proposed for both extractive (Nallapati et al., 2017; Narayan et al., 2018) and abstractive (See et al., 2017; Chen and Bansal, 2018) summarization paradigms." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-11", "text": "State-of-the-art abstractive approaches are supervised, relying on large collections of paired articles and summaries." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-12", "text": "However, competitive performance of abstractive systems remains a challenge when availability of parallel data is limited, such as in low-resource domains or for languages other than English." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-13", "text": "Even when parallel data is limited, we may have access to example summaries and to large collections of articles on similar topics." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-14", "text": "Examples are blog posts or scientific press releases, for which the original articles may be unavailable or behind a paywall." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-15", "text": "In this paper, we develop a system for abstractive document summarization that only relies on having access to example summaries and nonmatching articles, bypassing the need for largescale parallel corpora." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-16", "text": "Our system consists of two components: An unsupervised sentence extractor first selects salient sentences." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-17", "text": "Each extracted sentence is subsequently paraphrased using a sentence abstractor." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-18", "text": "The abstractor is trained on pseudo-parallel data extracted from raw corpora, as well as on additional synthetic data generated through backtranslation." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-19", "text": "Our approach achieves promising results on the CNN/DailyMail single-document summarization benchmark (Section 5) without relying on any parallel documentsummary pairs." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-20", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-21", "text": "**BACKGROUND**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-22", "text": "Unsupervised summarization has a long history within the extractive summarization paradigm." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-23", "text": "Given an input article a a a = {a 1 , ..., a N } consisting of N sentences, the goal of extractive summarization is to select the K most salient sentences as the output summary, without employing any paraphrasing or fusion." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-24", "text": "A typical approach is to either weigh each sentence with respect to the document as a whole or through an adjacency-based measure of sentence importance (Erkan and Radev, 2004) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-25", "text": "In recent years, there have been large advances in supervised abstractive summarization, for headline generation (Rush et al., 2015; Nallapati et al., 2017) as well as for generation of multisentence summaries (See et al., 2017) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-26", "text": "State-ofthe-art approaches are typically trained to generate summaries either in a fully end-to-end fashion (See et al., 2017) , processing the entire article at once; or hierarchically, first extracting content and then paraphrasing it sentence-by-sentence (Chen and Bansal, 2018) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-27", "text": "Both approaches rely on large collections of article-summary pairs such as the annotated Gigaword (Napoles et al., 2012) or the CNN/DailyMail (Nallapati et al., 2016) dataset." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-28", "text": "The heavy reliance on manually created resources prohibits the use of abstractive summarization in domains other than news articles, or languages other than English, where parallel data may not be as abundantly available." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-29", "text": "In such areas, extractive summarization often remains the preferred choice." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-30", "text": "Our work focuses on abstractive summarization using large-scale non-parallel resources, such as collections of summaries without matching articles." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-31", "text": "Recently, a number of methods have been proposed to reduce the need for parallel data: through harvesting pseudo-parallel data from raw corpora (Nikolov and Hahnloser, 2019) or by synthesizing data using backtranslation (Sennrich et al., 2016a) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-32", "text": "Such methods have been shown to be viable for tasks such as unsupervised machine translation (Lample et al., 2018) , sentence compression (Fevry and Phang, 2018) , and style transfer (Lample et al., 2019) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-33", "text": "To the best of our knowledge, this work is the first to extend such methods to single-document summarization in order to generate multi-sentence abstractive summaries in a data-driven fashion." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-34", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-35", "text": "**APPROACH**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-36", "text": "Our system consists of two components: an extractor (Section 3.1) which picks salient sentences to include in the final summary; and an abstractor (Section 3.2) which subsequently paraphrases each of the extracted sentences, rewriting them to meet the target summary style." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-37", "text": "Our approach is similar to (Chen and Bansal, 2018) , except that they use parallel data to train their extractors and abstractors." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-38", "text": "In contrast, during training, we only assume access to example summaries S S S = {s s s 0 , .., s s s M } without matching articles." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-39", "text": "During testing, given an input article a a a = {a 0 , ..., a N } consisting of N sentences, our system is capable of generating a multi-sentence abstractive summary consisting of K sentences (where K is a hyperparameter)." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-40", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-41", "text": "**SENTENCE EXTRACTOR**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-42", "text": "The extractor selects the K most salient article sentences to include in the summary." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-43", "text": "We consider two unsupervised variants for the extractor: LEAD picks the first K sentences from the article and returns them as the summary." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-44", "text": "For many datasets, such as CNN/DailyMail, LEAD is a simple but tough baseline to beat, especially using abstractive methods (See et al., 2017) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-45", "text": "Because LEAD may not be the optimal choice for other domains or datasets, we experiment with another unsupervised extractive approach." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-46", "text": "LEXRANK (Erkan and Radev, 2004) represents the input as a highly connected graph, in which vertices represent sentences, and edges between sentences are assigned weights equal to their TF-IDF similarity, provided that the similarity is higher than a predefined threshold t. The centrality of a sentence is then computed using the PageRank algorithm." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-47", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-48", "text": "**SENTENCE ABSTRACTOR**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-98", "text": "However, the results are promising: for example, the ROUGE-L gap between our LEAD model and the LSTM model is only 1.8." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-49", "text": "The sentence abstractor (P PM ) is trained to generate a paraphrase s i for every article sentence a i , rewriting it to meet the target sentence style of the summaries." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-50", "text": "We implement P PM as an LSTM encoder-decoder with an attention mechanism (Bahdanau et al., 2014) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-51", "text": "Instead on parallel examples of sentences from articles and summaries, the abstractor is trained on a synthetic dataset that is created in two steps:" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-52", "text": "Pseudo-parallel dataset." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-53", "text": "The first step is to obtain an initial set of pseudo-parallel articlesummary sentence pairs." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-54", "text": "Because we assume access to example summaries, our approach is to align summary sentences to an external collection of articles in the same format as our target summaries." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-55", "text": "Here, we apply the large-scale alignment method from (Nikolov and Hahnloser, 2019) , which hierarchically aligns documents followed by sentences in the two datasets." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-56", "text": "The alignment is implemented through nearest neighbour search of document and sentence embeddings." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-57", "text": "Backtranslated pairs." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-58", "text": "We use the initial pseudo-parallel dataset to train a backtranslation model P BT (a i |s i ), following (Sennrich et al., 2016a) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-59", "text": "The model learns to synthesize \"fake\" article sentences given a summary sentence." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-60", "text": "We use P BT to generate multiple synthetic article sentences for each summary sentence we have available, taking the N top hypotheses predicted by beam search 1 ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-61", "text": "To train our final sentence paraphrasing model P PM (s i |a i ), we combine all pseudo-parallel and backtranslated pairs into a single dataset of article-summary sentence pairs." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-62", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-63", "text": "**EXPERIMENTAL SET-UP**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-64", "text": "Dataset." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-65", "text": "We use the CNN/DailyMail (CNN/DM) dataset (Hermann et al., 2015) consisting of pairs of news articles from CNN and Daily Mail, along with summaries in the form of bullet points." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-66", "text": "We choose this dataset because it allows us to compare our approach to existing fully supervised methods and to measure the gap between unsupervised and supervised summarization." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-67", "text": "We follow the preprocessing pipeline of (Chen and Bansal, 2018) , splitting the dataset into 287k/11k/11k pairs for training/validation/testing." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-68", "text": "Note that our method relies only on the bullet-point summaries from this training set." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-69", "text": "Obtaining synthetic data." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-70", "text": "To obtain training data for our sentence abstractor P PM , we follow the procedure from Section 3.2." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-71", "text": "We align all summaries from the CNN/DM training set to 8.5M news articles from the Gigaword dataset (Napoles et al., 2012) , which contains no articles from CNN or Daily Mail." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-72", "text": "After alignment 2 , we obtained 1.2M pseudo-parallel pairs which we use to train our backtranslation model P BT ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-73", "text": "Using P BT , we synthesize 5 article sentences for each of the 1M summary sentences by picking the top 5 beam hypotheses." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-74", "text": "Our best sentence paraphrasing dataset used to train our final abstractor P PM contains 6.7 million pairs, 18% of which are pseudo-parallel pairs and 82% are backtranslated pairs." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-75", "text": "Implementation details." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-76", "text": "P PM and P BT are both implemented as bidirectional LSTM encoderdecoder models with 256 hidden units, embedding dimension 128, and an attention mechanism (Bahdanau et al., 2014) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-77", "text": "We pick this model size to be comparable to recent work (See et al., 2017; Chen and Bansal, 2018) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-78", "text": "Our models are initialized and trained separately, but they share the same 50k byte pair encoding (Sennrich et al., 2016b) vocab- 1 We also experimented with sampling (Edunov et al., 2018) but found it to be too noisy in the current setting." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-79", "text": "2 We follow the set-up from (Nikolov and Hahnloser, 2019) using the Sent2Vec embedding method (Pagliardini et al., 2017) for computing document/sentence embeddings." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-80", "text": "We use hyperparameters \u03b8 d = 0.5 and \u03b8s = {0.60, 0.63, 0.67}. ulary extracted from a joint collection of articles and summaries." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-81", "text": "We train both models until convergence with Adam (Kingma and Ba, 2015); P PM uses beam search with a beam of 5 during testing." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-82", "text": "Because both of our extractors are unsupervised, we directly apply them to the CNN/DM articles to select salient sentences." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-83", "text": "We always set K, the number of sentences to be extracted, to 4, which is the average number of summary sentences in the CNN/DM dataset." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-84", "text": "We additionally tune the similarity threshold t of the LEXRANK extractor on 1k pairs from the CNN/DM valid set." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-85", "text": "Evaluation details." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-86", "text": "We evaluate our systems on the CNN/DM test set, using the recallbased ROUGE-1/2/L metrics (Lin, 2004 ) and the precision-based METEOR (Banerjee and Lavie, 2005) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-87", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-88", "text": "**RESULTS**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-89", "text": "Baselines." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-90", "text": "We compare our models with a number of supervised and unsupervised baselines." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-91", "text": "LSTM is a standard bidirectional LSTM model, trained to directly generate the CNN/DM summaries from the full CNN/DM articles." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-92", "text": "EXT-ABS is the hierarchical model from (Chen and Bansal, 2018) , consisting of a supervised LSTM extractor and separate abstractor, both of which are individually trained on the CNN/DM dataset by aligning summary to article sentences." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-93", "text": "Our work best resembles EXT-ABS except that we do not rely on any parallel data." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-94", "text": "EXT-ABS-RL is a state-of-theart summarization system that extends EXT-ABS by jointly tuning the two supervised components using reinforcement learning." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-95", "text": "We additionally report the performance of our unsupervised extractive baselines, LEAD and LEXRANK, as well as the result of an oracle (ORACLE) which computes an upper bound for extractive summarization." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-96", "text": "Automatic evaluation." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-97", "text": "Our best abstractive models trained on non-parallel data (LEAD + ABS and LEXRANK + ABS in Table 1) performed worse than the baselines trained on parallel data." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-99", "text": "Our models generated much shorter summaries than the other systems, indicating that they potentially summarize much more aggressively." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-100", "text": "Model analysis." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-101", "text": "Is the gap between our approach and fully supervised models due to poor" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-102", "text": "MET is METEOR, while # is the average number of tokens in the summaries." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-103", "text": "\u2020 are from (Chen and Bansal, 2018) ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-104", "text": "(1) ABSPP-0.63 : cnn is the first time in three years ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-105", "text": "the other contestants told the price of the price ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-106", "text": "the game show will be hosted by the tv game show ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-107", "text": "the game of the game is the first of the show ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-108", "text": "(2) ABSPP+SYN-5 : a tv legend has returned to the first time in eight years ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-109", "text": "contestants told the price of \" the price is right \" bob barker hosted the tv game show for 35 years ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-110", "text": "the game is the first of the show 's \" lucky seven \" (3) ABSPAR : a tv legend returned to doing what he does best ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-111", "text": "contestants told to \" come on down ! \" on april 1 edition ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-112", "text": "he hosted the tv game show for 35 years before stepping down in 2007 ." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-113", "text": "barker handled the first price-guessing game of the show , the classic \" lucky seven \" (2) are trained on pseudo-parallel/synthetic data, while the abstractor in (3) is trained on parallel data." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-114", "text": "(ABSPAR) vs. pseudo-parallel data (ABS PP-\u03b8s , using different sentence alignment thresholds \u03b8s; ABSPP-UB is the upper bound for large-scale alignment) or using a mixture of pseudo-parallel and synthetic data (ABSPP+SYN-N , using the ABSPP-0.63 dataset and backtranslated data from the top N beam hypotheses)." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-115", "text": "We always use the LEAD extractor." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-116", "text": "sentence extraction or inadequate sentence abstraction?" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-117", "text": "The CNN/DM dataset is a special case, in which, due to the strong performance of LEAD, the main bottleneck is abstraction." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-118", "text": "For datasets in which the first sentences are less salient, alternative approaches to extractive summarization such as LEXRANK become increasingly important." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-119", "text": "In Table 3 , we compare the effect of training our abstractor on pseudo-parallel datasets of different sizes (ABS PP-* ) as well as on a mixture of pseudoparallel and backtranslated data (ABS PP+SYN-* )." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-120", "text": "For reference, we also include results from aligning the original CNN/DM articles and summaries directly." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-121", "text": "We construct a parallel dataset of sentence pairs by aligning the original CNN/DM document pairs (ABS PAR system); as well as a pseudoparallel dataset, without using the CNN/DM document labels (ABS PP-UB system)." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-122", "text": "ABS PP-UB is an upper bound for large-scale alignment, where the raw dataset of articles perfectly matches the summaries." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-123", "text": "Our best pseudo-parallel abstractor performs poorly in comparison to the result of the parallel abstractor." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-124", "text": "Adding additional synthetic data is helpful but insufficient to compensate the gap and we observe a diminishing improvement from adding synthetic pairs." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-125", "text": "The large-scale alignment method is able to construct a pseudo-parallel upper bound that almost perfectly matches the parallel dataset, indicating that potentially the main bottleneck in our system is the domain difference between the articles in Gigaword and the CNN/DM." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-126", "text": "Example summaries." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-127", "text": "In Table 2 , we also provide example summaries produced by our system." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-128", "text": "Our final model, trained on additional backtranslated data, produced much more relevant and coherent sentences than the model trained on pseudo-parallel data only." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-129", "text": "Despite having seen no parallel examples, the system is capable of generating fluent, abstractive sentences." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-130", "text": "However, in comparison to the abstractor trained on parallel data, there is still room for further improvement." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-131", "text": "----------------------------------" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-132", "text": "**CONCLUSION**" }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-133", "text": "We developed an abstractive summarization system that does not rely on any parallel resources, but can instead be trained using example summaries and a large collection of non-matching articles, making it particularly relevant to lowresource domains and languages." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-134", "text": "Perhaps surprisingly, our system performs competitively to fully supervised models." }, { "sent_id": "1c86f563ababf5ec3c67cbf259252b-C001-135", "text": "Future work will focus on developing novel unsupervised extractors; on decreasing the gap between abstractors trained on parallel and non-parallel data; as well as on developing methods for combining the abstractor and extractor into a single system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1c86f563ababf5ec3c67cbf259252b-C001-10" ], [ "1c86f563ababf5ec3c67cbf259252b-C001-26" ], [ "1c86f563ababf5ec3c67cbf259252b-C001-92" ] ], "cite_sentences": [ "1c86f563ababf5ec3c67cbf259252b-C001-10", "1c86f563ababf5ec3c67cbf259252b-C001-26", "1c86f563ababf5ec3c67cbf259252b-C001-92" ] }, "@SIM@": { "gold_contexts": [ [ "1c86f563ababf5ec3c67cbf259252b-C001-37" ], [ "1c86f563ababf5ec3c67cbf259252b-C001-77" ], [ "1c86f563ababf5ec3c67cbf259252b-C001-92", "1c86f563ababf5ec3c67cbf259252b-C001-93" ] ], "cite_sentences": [ "1c86f563ababf5ec3c67cbf259252b-C001-37", "1c86f563ababf5ec3c67cbf259252b-C001-77", "1c86f563ababf5ec3c67cbf259252b-C001-92" ] }, "@DIF@": { "gold_contexts": [ [ "1c86f563ababf5ec3c67cbf259252b-C001-37" ], [ "1c86f563ababf5ec3c67cbf259252b-C001-92", "1c86f563ababf5ec3c67cbf259252b-C001-93" ] ], "cite_sentences": [ "1c86f563ababf5ec3c67cbf259252b-C001-37", "1c86f563ababf5ec3c67cbf259252b-C001-92" ] }, "@USE@": { "gold_contexts": [ [ "1c86f563ababf5ec3c67cbf259252b-C001-67" ], [ "1c86f563ababf5ec3c67cbf259252b-C001-77" ], [ "1c86f563ababf5ec3c67cbf259252b-C001-92", "1c86f563ababf5ec3c67cbf259252b-C001-93" ] ], "cite_sentences": [ "1c86f563ababf5ec3c67cbf259252b-C001-67", "1c86f563ababf5ec3c67cbf259252b-C001-77", "1c86f563ababf5ec3c67cbf259252b-C001-92" ] } } }, "ABC_d0fa481abaf6d1b5529e40ff73f00a_29": { "x": [ { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-2", "text": "Interest in neural machine translation has grown rapidly as its effectiveness has been demonstrated across language and data scenarios." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-3", "text": "New research regularly introduces architectural and algorithmic improvements that lead to significant gains over \"vanilla\" NMT implementations." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-4", "text": "However, these new techniques are rarely evaluated in the context of previously published techniques, specifically those that are widely used in state-of-theart production and shared-task systems." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-5", "text": "As a result, it is often difficult to determine whether improvements from research will carry over to systems deployed for real-world use." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-6", "text": "In this work, we recommend three specific methods that are relatively easy to implement and result in much stronger experimental systems." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-7", "text": "Beyond reporting significantly higher BLEU scores, we conduct an in-depth analysis of where improvements originate and what inherent weaknesses of basic NMT models are being addressed." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-8", "text": "We then compare the relative gains afforded by several other techniques proposed in the literature when starting with vanilla systems versus our stronger baselines, showing that experimental conclusions may change depending on the baseline chosen." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-9", "text": "This indicates that choosing a strong baseline is crucial for reporting reliable experimental results." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-10", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-12", "text": "In the relatively short time since its introduction, neural machine translation has risen to prominence in both academia and industry." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-13", "text": "Neural models have consistently shown top performance in shared evaluation tasks (Bojar et al., 2016; Cettolo et al., 2016) and are becoming the technology of choice for commercial MT service providers (Wu et al., 2016; Crego et al., 2016) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-14", "text": "New work from the research community regularly introduces model extensions and algorithms that show significant gains over baseline NMT." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-15", "text": "However, the continuous improvement of real-world translation systems has led to a substantial performance gap between the first published neural translation models and the current state of the art." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-16", "text": "When promising new techniques are only evaluated on very basic NMT systems, it can be difficult to determine how much (if any) improvement will carry over to stronger systems; is new work actually solving new problems or simply re-solving problems that have already been addressed elsewhere?" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-17", "text": "In this work, we recommend three specific techniques for strengthening NMT systems and empirically demonstrate how their use improves reliability of experimental results." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-18", "text": "We analyze in depth how these techniques change the behavior of NMT systems by addressing key weaknesses and discuss how these findings can be used to understand the effect of other types of system extensions." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-19", "text": "Our recommended techniques include: (1) a training approach using Adam with multiple restarts and learning rate annealing, (2) sub-word translation via byte pair encoding, and (3) decoding with ensembles of independently trained models." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-20", "text": "We begin the paper content by introducing a typical NMT baseline system as our experimental starting point ( \u00a72.1)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-21", "text": "We then present and examine the effects of each recommended technique: Adam with multiple restarts and step size annealing ( \u00a73), byte pair encoding ( \u00a74), and independent model ensembling ( \u00a75)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-22", "text": "We show that combining these techniques can lead to a substantial improvement of over 5 BLEU ( \u00a76) and that results for several previously published techniques can dramati-cally differ (up to being reversed) when evaluated on stronger systems ( \u00a76.2)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-23", "text": "We then conclude by summarizing our findings ( \u00a77)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-24", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-25", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-26", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-27", "text": "**TRANSLATION SYSTEM**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-28", "text": "Our starting point for experimentation is a standard baseline neural machine translation system implemented using the Lamtram 1 and DyNet 2 toolkits (Neubig, 2015; Neubig et al., 2017) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-29", "text": "This system uses the attentional encoder-decoder architecture described by Bahdanau et al. (2015) , building on work by Sutskever et al. (2014) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-30", "text": "The translation model uses a bi-directional encoder with a single LSTM layer of size 1024, multilayer perceptron attention with a layer size of 1024, and word representations of size 512." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-31", "text": "Translation models are trained until perplexity convergence on held-out data using the Adam algorithm with a maximum step size of 0.0002 (Kingma and Ba, 2015; Wu et al., 2016) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-32", "text": "Maximum training sentence length is set to 100 words." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-33", "text": "Model vocabulary is limited to the top 50K source words and 50K target words by frequency, with all others mapped to an unk token." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-34", "text": "A post-processing step replaces any unk tokens in system output by attempting a dictionary lookup 3 of the corresponding source word (highest attention score) and backing off to copying the source word directly (Luong et al., 2015) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-35", "text": "Experiments in each section evaluate this system against incremental extensions such as improved model vocabulary or training algorithm." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-36", "text": "Evaluation is conducted by average BLEU score over multiple independent training runs (Papineni et al., 2002; Clark et al., 2011) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-37", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-38", "text": "**DATA SETS**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-39", "text": "We evaluate systems on a selection of public data sets covering a range of data sizes, language directions, and morphological complexities." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-40", "text": "These sets, described in Table 1 3 Training Algorithms" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-41", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-42", "text": "**BACKGROUND**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-43", "text": "The first neural translation models were optimized with stochastic gradient descent (Sutskever et al., 2014) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-44", "text": "After training for several epochs with a fixed learning rate, the rate is halved at prespecified intervals." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-45", "text": "This widely used rate \"annealing\" technique takes large steps to move parameters from their initial point to a promising part of the search space followed by increasingly smaller steps to explore that part of the space for a good local optimum." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-46", "text": "While effective, this approach can be time consuming and relies on hand-crafted learning schedules that may not generalize to different models and data sets." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-47", "text": "To eliminate the need for schedules, subsequent NMT work trained models using the Adadelta algorithm, which automatically and continuously adapts learning rates for individual parameters during training (Zeiler, 2012) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-48", "text": "Model performance is reported to be equivalent to SGD with annealing, though training still takes a considerable amount of time (Bahdanau et al., 2015; Sennrich et al., 2016b) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-49", "text": "More recent work seeks to accelerate training with the Adam algorithm, which applies momentum on a per-parameter basis and automatically adapts step size subject to a user-specified maximum (Kingma and Ba, 2015) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-50", "text": "While this can lead to much faster convergence, the resulting models are shown to slightly underperform compared to annealing SGD (Wu et al., 2016) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-51", "text": "However, Adam's speed and reputation of generally being \"good enough\" have made it a popular choice for researchers and NMT toolkit authors 6 (Arthur et al., 2016; Lee et al., 2016; Britz et al., 2017; Sennrich et al., 2017) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-52", "text": "While differences in automatic metric scores between SGD and Adam-trained systems may be relatively small, they raise the more general question of training effectiveness." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-53", "text": "In the following section, we explore the relative quality of the optima found by these training algorithms." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-54", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-55", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-56", "text": "To compare the behavior of SGD and Adam, we conduct training experiments with all data sets listed in \u00a72.2." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-57", "text": "For each set, we train instances of the baseline model described in \u00a72.1 with both optimizers using empirically effective initial settings." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-58", "text": "7 In the only departure from the described baseline, we use a byte-pair encoded vocabulary with 32K merge operations in place of a limited full-word vocabulary, leading to faster training and higher metric scores (see experiments in \u00a74)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-59", "text": "For SGD, we begin with a learning rate of 0.5 and train the model to convergence as measured by dev set perplexity." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-60", "text": "We then halve the learning rate and restart training from the best previous point." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-61", "text": "This continues until training has been run a total of 5 times." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-62", "text": "The choice of training to convergence is made both to avoid the need for handcrafted learning schedules and to give the optimizers a better chance to find good neighborhoods to explore." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-63", "text": "For Adam, we use a learning rate (maximum step size) of 0.0002." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-64", "text": "While Adam's use of momentum can be considered a form of \"self-annealing\", we also evaluate the novel extension of explicitly annealing the maximum step size by applying the same halving and restarting process used for SGD." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-65", "text": "It is important to note that while restarting SGD has no effect beyond changing the learning rate, restarting Adam causes the optimizer to \"forget\" the per-parameter learning rates and start fresh." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-66", "text": "For all training, we use a mini-batch size of 512 words." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-67", "text": "8 For WMT systems, we evaluate dev 6 Adam is the default optimizer for the Lamtram, Nematus (https://github.com/rsennrich/nematus), and Marian toolkits (https://github.com/amunmt/ marian)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-68", "text": "7 Learning rates of 0.5 for SGD and 0.0002 for Adam or very similar are shown to work well in NMT implementations including GNMT (Wu et al., 2016) , Nematus, Marian, and OpenNMT (http://opennmt.net)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-69", "text": "8 For each mini-batch, sentences are added until the word set perplexity every 50K training sentences for the first training run and every 25K sentences for subsequent runs." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-70", "text": "For IWSLT systems, we evaluate every 25K sentences and then every 6,250 sentences." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-71", "text": "Training stops when no improvement in perplexity has been seen in 20 evaluations." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-72", "text": "For each experimental condition, we conduct 3 independent optimizer runs and report averaged metric scores." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-73", "text": "All training results are visualized in Figure 1 ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-74", "text": "Our first observation is that these experiments are largely in concert with prior work: Adam without annealing (first point) is significantly faster than SGD with annealing (last point) and often comparable or slightly worse in accuracy, with the exception of Czech-English where SGD underperforms." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-75", "text": "However, Adam with just 2 restarts and SGD-style rate annealing is actually both faster than the fully annealed SGD and obtains significantly better results in both perplexity and BLEU." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-76", "text": "We conjecture that the reason for this is twofold." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-77", "text": "First, while Adam has the ability to automatically adjust its learning rate, like SGD it still benefits from an explicit adjustment when it has begun to overfit." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-78", "text": "Second, Adam's adaptive learning rates tend to reduce to sub-optimally low values as training progresses, leading to getting stuck in a local optimum." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-79", "text": "Restarting training when reducing the learning rate helps jolt the optimizer out of this local optimum and continue to find parameters that are better globally." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-80", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-81", "text": "**SUB-WORD TRANSLATION**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-82", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-83", "text": "**BACKGROUND**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-84", "text": "Unlike phrase-based approaches, neural translation models must limit source and target vocabulary size to keep computational complexity manageable." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-85", "text": "Basic models typically include the most frequent words (30K-50K) plus a single unk token to which all other words are mapped." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-86", "text": "As described in \u00a72.1, unk words generated by the NMT system are translated in post-processing by dictionary lookup or pass-through, often with significantly degraded quality (Luong et al., 2015) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-87", "text": "Realworld NMT systems frequently sidestep this problem with sub-word translation, where models operate on a fixed number of word pieces that can be chained together to form words in an arbitrarcount is reached." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-88", "text": "Counting words versus sentences leads to more uniformly-sized mini-batches." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-89", "text": "We choose the size of 512 based on contrastive experiments that found it to be the best balance between speed and effectiveness of updates during training." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-90", "text": "ily large vocabulary." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-91", "text": "In this section, we examine the impact of sub-words on NMT, specifically when using the technique of byte pair encoding (Sennrich et al., 2016b) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-92", "text": "Given the full parallel corpus (concatenation of source and target sides), BPE first splits all words into individual characters and then begins merging the most frequently adjacent pairs." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-93", "text": "Merged pairs become single units that are candidates for further merging and the process continues to build larger word pieces for a fixed number of operations." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-94", "text": "The final result is an encoded corpus where the most frequent words are single pieces and less frequent words are split into multiple, higher frequency pieces." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-95", "text": "At test time, words are split using the operations learned during training, allowing the model to translate with a nearly open vocabulary." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-96", "text": "9 The model vocabulary size grows with and is limited by the number of merge operations." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-97", "text": "While prior work has focused on using sub-words as a method for translating 9 It is possible that certain intermediate word pieces will not appear in the encoded training data (and thus the model's vocabulary) if all occurrences are merged into larger units." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-98", "text": "If these pieces appear in test data and are not merged, they will be true OOVs for the model." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-99", "text": "For this reason, we map singleton word pieces in the training data to unk so the model has some ability to handle these cases (dictionary lookup or pass-through)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-100", "text": "Table 2 : BLEU scores for training NMT models with full word and byte pair encoded vocabularies." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-101", "text": "Full word models limit vocabulary size to 50K." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-102", "text": "All models are trained with annealing Adam and scores are averaged over 3 optimizer runs." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-103", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-104", "text": "**WMT IWSLT DE-EN EN-FI RO-EN EN-FR CS-EN**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-105", "text": "unseen words in morphologically rich languages (Sennrich et al., 2016b) or reducing model size (Wu et al., 2016) , we examine how using BPE actually leads to broad improvement by addressing inherent weaknesses of word-level NMT." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-106", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-107", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-108", "text": "We measure the effects of byte pair encoding by training full-word and BPE systems for all data sets as described in \u00a72.1 with the incremental improvement of using Adam with rate annealing ( \u00a73)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-109", "text": "As Wu et al. (2016) show different levels of effectiveness for different sub-word vocabulary sizes, we evaluate running BPE with 16K and 32K merge operations." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-110", "text": "As shown in Table 2 , sub-word systems outperform full-word systems across the board, despite having fewer total parameters." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-111", "text": "Systems built on larger data generally benefit from larger vocabularies while smaller systems perform better with smaller vocabularies." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-112", "text": "Based on these results, we recommend 32K as a generally effective vocabulary size and 16K as a contrastive condition when building systems on less than 1 million parallel sentences." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-113", "text": "To understand the origin of these improvements, we divide the words in each test set into classes based on how the full-word and BPE models handle them and report the unigram F-1 score for each model on each class." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-114", "text": "We also plot the fullword and BPE vocabularies for context." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-115", "text": "As shown in Figure 2 , performance is comparable for the most frequent words that both models represent as single units." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-116", "text": "The identical shapes on the leftmost part of each vocabulary plot indicate that the two systems have the same number of training instances from which to learn translations." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-117", "text": "For words that are split in the BPE model, performance is tied to data sparsity." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-118", "text": "With larger data, performance is comparable as both models have enough training instances to learn reliable statistics; with smaller data or morphologically rich languages such as Finnish, significant gains can be realized by modeling multiple higher-frequency sub-words in place of a single lower-frequency word." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-119", "text": "This can be seen as effectively moving to the left in the vocabulary plot where translations are more reliable." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-120", "text": "In the next category of words beyond the 50K cutoff, the BPE system's ability to actually model rare words leads to consistent improvement over the full-word system's reliance on dictionary substitution." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-121", "text": "The final two categories evaluate handling of true out-of-vocabulary items." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-122", "text": "For OOVs that should be translated, the full-word system will always score zero, lacking any mechanism for producing words not in its vocabulary or dictionary." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-123", "text": "The more interesting result is in the relatively low scores for OOVs that should simply be copied from source to target." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-124", "text": "While phrase-based systems can reliably pass OOVs through 1:1, fullword neural systems must generate unk tokens and correctly map them to source words using attention scores." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-125", "text": "Differences in source and target true vocabulary sizes and frequency distributions often lead to different numbers of unk tokens in source and target sentences, resulting in models that are prone to over or under-generating unks at test time." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-126", "text": "BPE systems address these weaknesses, although their performance is not always intuitive." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-127", "text": "While some OOVs are successfully translated using word pieces, overall scores are still quite low, indicating only limited success for the notion of open vocabulary translation often associated with sub-word NMT." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-128", "text": "However, the ability to learn when to self-translate sub-words 10 leads to significant gains in pass-through accuracy." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-129", "text": "In summary, our analysis indicates that while BPE does lead to smaller, faster models, it also significantly improves translation quality." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-130", "text": "Rather than being limited to only rare and unseen words, modeling higher-frequency sub-words in place of lower-frequency full words can lead to significant improvement across the board." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-131", "text": "The specific improvement in pass-through OOV handling can be particularly helpful for handling named entities and open-class items such as numbers and URLs without additional dedicated techniques." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-132", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-133", "text": "**ENSEMBLES AND MODEL DIVERSITY**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-134", "text": "The final technique we explore is the combination of multiple translation models into a single, more powerful ensemble by averaging their predictions at the word level." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-135", "text": "The idea of ensemble averaging is well understood and widely used across machine learning fields and work from the earliest encoder-decoder papers to the most recent system descriptions reports dramatic improvements in BLEU scores for model ensembles (Sutskever et al., 2014; Sennrich et al., 2016a) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-136", "text": "While this technique is conceptually simple, it requires training and decoding with multiple translation models, often at significant resource costs." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-137", "text": "However, these costs are either mitigated or justified when building real-world systems or evaluating techniques that should be applicable to those systems." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-138", "text": "Decoding costs can be reduced by using knowledge distillation techniques to train a single, compact model to replicate the output of an ensemble (Hinton et al., 2015; Kuncoro et al., 2016; Kim and Rush, 2016) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-139", "text": "Researchers can skip this timeconsuming step, evaluating the ensemble directly, while real-world system engineers can rely on it to make deployment of ensembles practical." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-140", "text": "To re- Left figures: Source vocabulary visualizations for NMT training data using full words and bytepair encoded tokens." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-141", "text": "The number of merge operations is set to either 32K or 16K, chosen by best BLEU score." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-142", "text": "BPE reduces vocabulary size by 1-2 orders of magnitude and allows models to cover the entire training corpus." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-143", "text": "Full-word systems for all scenarios use a much larger vocabulary size of 50K (labeled horizontal line) that leaves much of the total vocabulary uncovered." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-144", "text": "Right figures: Class-wise test set unigram F1 scores for NMT systems using full words and bytepair encoded tokens." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-145", "text": "Scores are reported separately for the following classes: words in the vocabulary of both the full-word and BPE models (Full), words in the vocabulary of the fullword model that are split in the BPE model (Split), words outside the vocabulary of the full-word model but covered by its dictionary (Dict), words outside the vocabulary of the full-word model and its dictionary that should be translated (OOV-T), and words outside the vocabulary of the fullword model and its dictionary that should be passed through (OOV-P)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-146", "text": "All reported scores are averaged over 3 independent optimizer runs." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-147", "text": "Table 3 : Test set BLEU scores for \"vanilla\" NMT (full words and standard Adam), and our recommended systems (byte pair encoding and annealing Adam, with and without ensembling)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-148", "text": "Scores for single models are averaged over 3 independent optimizer runs while scores for ensembles are the result of combining 3 runs." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-149", "text": "duce training time, some work ensembles different training checkpoints of the same model rather than using fully independent models (Jean et al., 2015; Sennrich et al., 2016a) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-150", "text": "While checkpoint ensembling is shown to be effective for improving BLEU scores under resource constraints, it does so with less diverse models." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-151", "text": "As discussed in recent work and demonstrated in our experiments in \u00a76, model diversity is a key component in building strong NMT ensembles (Jean et al., 2015; Sennrich et al., 2016a; Farajian et al., 2016) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-152", "text": "For these reasons, we recommend evaluating new techniques on systems that ensemble multiple independently trained models for the most reliable results." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-153", "text": "Results showing both the effectiveness of ensembles and the importance of model diversity are included in the larger experiments conducted in the next section." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-154", "text": "6 On Trustable Evaluation" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-155", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-156", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-157", "text": "In this section, we evaluate and discuss the effects that choice of baseline can have on experimental conclusions regarding neural MT systems." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-158", "text": "First, we build systems that include Adam with rate annealing, byte pair encoding, and independent model ensembling and compare them to the vanilla baselines described in \u00a72.1." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-159", "text": "As shown in Table 3 , combining these techniques leads to a consistent improvement of 4-5 BLEU points across all scenarios." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-160", "text": "These improvements are the result of addressing several underlying weaknesses of basic NMT models as described in previous sections, leading to systems that behave much closer to those deployed for real-world tasks." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-161", "text": "Next, to empirically demonstrate the importance of evaluating new methods in the context of these stronger systems, we select several tech- Table 4 : Dropout: Apply the improved dropout technique for sequence models described by Gal and Ghahramani (2016) to LSTM layers with a rate of 0.2." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-162", "text": "We find this version to significantly outperform standard dropout." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-163", "text": "Lexicon bias: Incorporate scores from a pretrained lexicon (fast align model learned on the same data) directly as additional weights when selecting output words (Arthur et al., 2016) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-164", "text": "Target word lexicon scores are computed as weighted sums over source words based on attention scores." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-165", "text": "Pre-translation: Translate source sentences with a traditional phrase-based system trained on the same data." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-166", "text": "Input for the neural system is the original source sentence concatenated with the PBMT output ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-167", "text": "Input words are prefixed with either s or t to denote source or target language." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-168", "text": "We improve performance with a novel extension where word alignments are used to weave together source and PBMT output so that each original word is immediately followed by its suggested translation from the phrase-based system." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-169", "text": "As pre-translation doubles source vocabulary size and input length, we only apply it to sub-word systems to keep complexity reasonable." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-170", "text": "Data bootstrapping: Expand training data by extracting phrase pairs (sub-sentence translation examples) and including them as additional training instances ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-171", "text": "We apply a novel extension where we train a phrase-based system and use it to re-translate the training data, providing a near-optimal phrase segmentation as a byproduct." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-172", "text": "We use these phrases in place of the heuristically chosen phrases in the original work, improving coverage and leading to more fine-grained translation examples." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-173", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-174", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-175", "text": "The immediately noticeable trend from Table 4 is that while all techniques improve basic systems, only a single technique, data bootstrapping, improves the fully strengthened system for both data sets (and barely so)." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-176", "text": "This can be attributed to a mix of redundancy and incompatibility between the improvements we've discussed in previous sections and the techniques evaluated here." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-177", "text": "Lexicon bias and pre-translation both incorporate scores from pre-trained models that are shown to improve handling of rare words." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-178", "text": "When NMT models are sub-optimally trained, they can benefit from the suggestions of a better-trained model." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-179", "text": "When full-word NMT models struggle to learn translations for infrequent words, they can learn to simply trust the lexical or phrase-based model." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-180", "text": "However, when annealing Adam and BPE alleviate these underlying problems, the neural model's accuracy can match or exceed that of the pretrained model, making external scores either completely redundant or (in the worst case) harmful bias that must be overcome to produce correct translations." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-181", "text": "While pre-translation fares better than lexicon bias, it suffers a reversal in one scenario and a significant degradation in the other when moving from a single model to an ensemble." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-182", "text": "Even when bias from an external model improves translation, it does so at the cost of diversity by pushing the neural model's preferences toward those of the pre-trained model." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-183", "text": "These results further validate claims of the importance of diversity in model ensembles." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-184", "text": "Applying dropout significantly improves all configurations of the Czech-English system and some configurations of the English-French system, leveling off with the strongest." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-185", "text": "This trend follows previous work showing that dropout combats overfitting of small data, though the point of inflection is worth noting (Sennrich et al., 2016a; Wu et al., 2016) ." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-186", "text": "Even though the English-French data is still relatively small (220K sentences), BPE leads to a smaller vocabulary of more general translation units, effectively reducing sparsity, while annealing Adam can avoid getting stuck in poor local optima." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-187", "text": "These techniques already lead to better generalization without the need for dropout." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-188", "text": "Finally, we can observe a few key properties of data bootstrapping, the best performing technique on fully strengthened systems." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-189", "text": "Unlike lexicon bias and pre-translation, it modifies only the training data, allowing \"purely neural\" models to be learned from random initialization points." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-190", "text": "This preserves model diversity, allowing ensembles to benefit as well as single models." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-191", "text": "Further, data bootstrapping is complementary to annealing Adam and BPE; better optimization and a more general vocabulary can make better use of the new training instances." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-192", "text": "While evaluation on simple vanilla NMT systems would indicate that all of the techniques in this section lead to significant improvement for both data sets, only evaluation on systems using annealing Adam, byte pair encoding, and independent model ensembling reveals both the reversals of results on state-of-the-art systems and nuanced interactions between techniques that we have reported." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-193", "text": "Based on these results, we highly recommend evaluating new techniques on systems that are at least this strong and representative of those deployed for real-world use." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-194", "text": "----------------------------------" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-195", "text": "**CONCLUSION**" }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-196", "text": "In this work, we have empirically demonstrated the effectiveness of Adam training with multiple restarts and step size annealing, byte pair encoding, and independent model ensembling both for improving BLEU scores and increasing the reliability of experimental results." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-197", "text": "Out of four previously published techniques for improving vanilla NMT, only one, data bootstrapping via phrase extraction, also improves a fully strengthened model across all scenarios." }, { "sent_id": "d0fa481abaf6d1b5529e40ff73f00a-C001-198", "text": "For these reasons, we recommend evaluating new model extensions and algorithms on NMT systems at least as strong as those we have described for maximally trustable results." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d0fa481abaf6d1b5529e40ff73f00a-C001-13" ], [ "d0fa481abaf6d1b5529e40ff73f00a-C001-31" ], [ "d0fa481abaf6d1b5529e40ff73f00a-C001-50" ] ], "cite_sentences": [ "d0fa481abaf6d1b5529e40ff73f00a-C001-13", "d0fa481abaf6d1b5529e40ff73f00a-C001-31", "d0fa481abaf6d1b5529e40ff73f00a-C001-50" ] }, "@USE@": { "gold_contexts": [ [ "d0fa481abaf6d1b5529e40ff73f00a-C001-28", "d0fa481abaf6d1b5529e40ff73f00a-C001-31" ], [ "d0fa481abaf6d1b5529e40ff73f00a-C001-68" ], [ "d0fa481abaf6d1b5529e40ff73f00a-C001-109" ] ], "cite_sentences": [ "d0fa481abaf6d1b5529e40ff73f00a-C001-31", "d0fa481abaf6d1b5529e40ff73f00a-C001-68", "d0fa481abaf6d1b5529e40ff73f00a-C001-109" ] }, "@DIF@": { "gold_contexts": [ [ "d0fa481abaf6d1b5529e40ff73f00a-C001-50" ] ], "cite_sentences": [ "d0fa481abaf6d1b5529e40ff73f00a-C001-50" ] }, "@MOT@": { "gold_contexts": [ [ "d0fa481abaf6d1b5529e40ff73f00a-C001-50" ] ], "cite_sentences": [ "d0fa481abaf6d1b5529e40ff73f00a-C001-50" ] }, "@SIM@": { "gold_contexts": [ [ "d0fa481abaf6d1b5529e40ff73f00a-C001-109" ], [ "d0fa481abaf6d1b5529e40ff73f00a-C001-185" ] ], "cite_sentences": [ "d0fa481abaf6d1b5529e40ff73f00a-C001-109", "d0fa481abaf6d1b5529e40ff73f00a-C001-185" ] } } }, "ABC_7176d3dd72e781dca42f8c146d062d_29": { "x": [ { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-2", "text": "In this paper we present an extension of a successful simple and effective method for extracting parallel sentences from comparable corpora and we apply it to an Arabic/English NIST system." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-3", "text": "We experiment with a new TERp filter, along with WER and TER filters." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-4", "text": "We also report a comparison of our approach with that of (Munteanu and Marcu, 2005) using exactly the same corpora and show performance gain by using much lesser data." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-5", "text": "Our approach employs an SMT system built from small amounts of parallel texts to translate the source side of the nonparallel corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-6", "text": "The target side texts are used, along with other corpora, in the language model of this SMT system." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-7", "text": "We then use information retrieval techniques and simple filters to create parallel data from a comparable news corpora." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-8", "text": "We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-9", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-11", "text": "Parallel corpora, a requisite resource for Statistical Machine Translation (SMT) as well as many other natural language processing applications, remain a sparse resource due to the huge expense (human as well as monetary) required for their creation." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-12", "text": "A parallel corpus, also called bitext, consists in bilingual texts aligned at the sentence level." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-13", "text": "SMT systems use parallel texts as training material and monolingual corpora for target language modeling." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-14", "text": "Though enough monolingual data is available for most language pairs, it is the parallel corpus that is a sparse resource." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-15", "text": "The performance of an SMT system heavily depends on the parallel corpus used for training." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-16", "text": "Generally, more bitexts lead to better performance." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-17", "text": "The existing resources of parallel corpora cover a few language pairs and mostly come from one domain (proceedings of the Canadian or European Parliament, or of the United Nations)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-18", "text": "The language jargon used in such corpora is not very well suited for everyday life translations or translations of some other domain, thus a dire need arises for more parallel corpora well suited for everyday life and domain adapted translations." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-19", "text": "One option to increase this scarce resource could be to produce more human translations, but this is a very expensive option, in terms of both time and money." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-20", "text": "Crowd sourcing could be another option, but this has its own costs and thus is not very practical for all cases." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-21", "text": "The world wide web can also be crawled for potential \"parallel sentences\", but most of the found bilingual texts are not direct translations of each other and not very easy to align." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-22", "text": "In recent works less expensive but very productive methods of creating such sentence aligned bilingual corpora were proposed." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-23", "text": "These are based on generating \"parallel\" texts from already available \"almost parallel\" or \"not much parallel\" texts." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-24", "text": "The term \"comparable corpus\" is often used to define such texts." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-25", "text": "A comparable corpus is a collection of texts composed independently in the respective languages and combined on the basis of similarity of content (Yang and Li, 2003) ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-26", "text": "The raw material for comparable documents is often easy to obtain but the alignment of individual documents is a challenging task (Oard, 1997) ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-27", "text": "Potential sources of comparable corpora are multilingual news reporting agencies like AFP, Xinhua, Al-Jazeera, BBC etc, or multilingual encyclopedias like Wikipedia, Encarta etc." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-28", "text": "Such comparable corpora are widely available from LDC, in particular the Gigaword corpora, or over the WEB for many languages and domains, e.g. Wikipedia." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-29", "text": "They often contain many sentences that are reasonable translations of each other." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-30", "text": "Reliable identification of these pairs would enable the automatic creation of large and diverse parallel corpora." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-31", "text": "The ease of availability of these comparable corpora and the potential for parallel corpus as well as dictionary creation has sparked an interest in trying to make maximum use of these comparable resources, some of these works include dictionary learning and identifying word translations (Rapp, 1995) , named entity recognition (Sproat et al., 2006) , word sense disambiguation (Kaji, 2003) , improving SMT performance using extracted parallel sentences (Munteanu and Marcu, 2005) , (Rauf and Schwenk, 2009 )." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-32", "text": "There has been considerable amount of work on bilingual comparable corpora to learn word translations as well as discovering parallel sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-33", "text": "Yang and Lee (2003) use an approach based on dynamic programming to identify potential parallel sentences in title pairs." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-34", "text": "Longest common sub sequence, edit operations and match-based score functions are subsequently used to determine confidence scores." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-35", "text": "Resnik and Smith (2003) propose their STRAND web-mining based system and show that their approach is able to find large numbers of similar document pairs." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-36", "text": "Works aimed at discovering parallel sentences include (Utiyama and Isahara, 2003) , who use cross-language information retrieval techniques and dynamic programming to extract sentences from an English-Japanese comparable corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-37", "text": "They identify similar article pairs, and then, treating these pairs as parallel texts, align their sentences on a sentence pair similarity score and use DP to find the least-cost alignment over the document pair." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-38", "text": "Fung and Cheung (2004) approach the problem by using a cosine similarity measure to match foreign and English documents." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-39", "text": "They work on \"very non-parallel corpora\"." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-40", "text": "They then generate all possible sentence pairs and select the best ones based on a threshold on cosine similarity scores." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-41", "text": "Using the extracted sentences they learn a dictionary and iterate over with more sentence pairs." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-42", "text": "Recent work by Munteanu and Marcu (2005) uses a bilingual lexicon to translate some of the words of the source sentence." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-43", "text": "These translations are then used to query the database to find matching translations using information retrieval (IR) techniques." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-44", "text": "Candidate sentences are determined based on word overlap and the decision whether a sentence pair is parallel or not is performed by a maximum entropy classifier trained on parallel sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-45", "text": "Bootstrapping is used and the size of the learned bilingual dictionary is increased over iterations to get better results." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-46", "text": "Our technique is similar to that of (Munteanu and Marcu, 2005) but we bypass the need of the bilingual dictionary by using proper SMT translations and instead of a maximum entropy classifier we use simple measures like the word error rate (WER) and the translation edit rate (TER) to decide whether sentences are parallel or not." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-47", "text": "We also report an extension of our work (Rauf and Schwenk, 2009 ) by experimenting with an additional filter TERp, and building a named entity noun dictionary using the unknown words from the SMT (section 5.2)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-48", "text": "TERp has been tried encouraged by the outperformance of TER in our previous study on French-English." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-49", "text": "We have applied our technique on a different language pair Arabic-English, versus French-English that we reported the technique earlier on." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-50", "text": "Our use of full SMT sentences, gives us an added advantage of being able to detect one of the major errors of these approaches, also identified by (Munteanu and Marcu, 2005) , i.e, the cases where the initial sentences are identical but the retrieved sentence has a tail of extra words at sentence end." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-51", "text": "We discuss this problem as detailed in section 5.1." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-52", "text": "We apply our technique to create a parallel corpus for the Arabic/English language pair." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-53", "text": "We show that we achieve significant improvements in the BLEU score by adding our extracted corpus to the already available human-translated corpora." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-54", "text": "We also perform a comparison of the data extracted by our approach and that by (Munteanu and Marcu, 2005) and report the results in Section 5.3." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-55", "text": "This paper is organized as follows." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-56", "text": "In the next section we first describe the baseline SMT system trained on human-provided translations only." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-57", "text": "We then proceed by explaining our parallel sentence selection scheme and the post-processing." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-58", "text": "Section 5 summarizes our experimental results and the paper concludes with a discussion and perspectives of this work." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-59", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-60", "text": "**TASK DESCRIPTION**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-61", "text": "In this paper, we consider the translation from Arabic into English, under the same conditions as the official NIST 2008 evaluation." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-62", "text": "The used bi-texts include various news wire translations 1 as well as some texts from the GALE project." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-63", "text": "2 We also added the 2002 to 2005 test data to the parallel training data (using all reference translations)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-64", "text": "This corresponds to a total of about 8M Arabic words." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-65", "text": "Our baseline system is trained on these bitexts only." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-66", "text": "We use the 2006 NIST test data as development data and the official NIST 2008 test data as internal test set." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-67", "text": "All case sensitive BLEU scores are calculated with the NIST scoring tool with respect to four reference translations." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-68", "text": "Both data sets include texts from news wires as well as newsgroups." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-69", "text": "LDC provides large collections of monolingual data, namely the LDC Arabic and English Gigaword corpora." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-70", "text": "There are two text sources that do exist in Arabic and English: the AFP and XIN collection." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-71", "text": "It is likely that each corpora contains sentences which are translations of the other." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-72", "text": "We aim to extract those." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-73", "text": "We have used the XIN corpus for all of our reported results and the collection of the AFP and XIN for comparison with ISI." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-74", "text": "Table 1 summarizes the characteristics of the corpora used." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-75", "text": "Note that the English part is much larger than the Arabic one (we found the same to be the case for French-English AFP comparable corpora that we used in our previous study)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-76", "text": "The number of words are given after tokenization." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-77", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-78", "text": "**BASELINE SMT SYSTEM**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-79", "text": "The goal of statistical machine translation (SMT) is to produce a target sentence e from a source sentence f ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-80", "text": "It is today common practice to use phrases as translation units (Koehn et al., 2003; Och and Ney, 2003) and a log linear framework in order to introduce several models explaining the translation process: 2004E72, T17, T18, 2005E46 and 2006E25." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-81", "text": "2 LDC2005E83, 2006E24, E34, E85 and E92." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-82", "text": "The feature functions h i are the system models and the \u03bb i weights are typically optimized to maximize a scoring function on a development set (Och and Ney, 2002) ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-83", "text": "In our system fourteen features functions were used, namely phrase and lexical translation probabilities in both directions, seven features for the lexicalized distortion model, a word and a phrase penalty and a target language model (LM)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-84", "text": "The system is based on the Moses SMT toolkit (Koehn et al., 2007) and constructed as follows." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-85", "text": "First, Giza++ is used to perform word alignments in both directions." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-86", "text": "Second, phrases and lexical reorderings are extracted using the default settings of the Moses SMT toolkit." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-87", "text": "The target 4-gram back-off language model is trained on the English part of all bitexts as well as the whole English Gigaword corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-88", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-89", "text": "**SYSTEM ARCHITECTURE**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-90", "text": "The general architecture of our parallel sentence extraction system is shown in figure 1." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-91", "text": "Starting from comparable corpora for the two languages, Arabic and English, we first translate Arabic to English using an SMT system as described in the above sections." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-92", "text": "These translated texts are then used to perform information retrieval from the English corpus, followed by simple metrics like WER, TER or TERp to filter out good sentence pairs and eventually generate a parallel corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-93", "text": "We show that a parallel corpus obtained using this technique helps considerably to improve an SMT system." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-94", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-95", "text": "**SYSTEM FOR EXTRACTING PARALLEL SENTENCES FROM COMPARABLE CORPORA**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-96", "text": "We start by translating the Arabic XIN and AFP texts to English using the SMT systems discussed in section 2. can safely say that a news item reported on day X in the Arabic corpus will be most probably found in the day X-5 and day X+5 time period." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-97", "text": "We experimented with several window sizes and found the window size of is to be the most accurate in terms of time and the quality of the retrieved sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-98", "text": "(Munteanu and Marcu, 2005) have also worked with a \u00b15 day window." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-99", "text": "Using the ID and date information for each sentence of both corpora, we first collect all sentences from the SMT translations corresponding to the same day (query sentences) and then the corresponding articles from the English Gigaword corpus (search space for IR)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-100", "text": "These day-specific files are then used for information retrieval using a robust information retrieval system." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-101", "text": "The Lemur IR toolkit (Ogilvie and Callan, 2001 ) was used for sentence extraction." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-102", "text": "The information retrieval step is the most time consuming task in the whole system." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-103", "text": "The time taken depends upon various factors like size of the index to search in, length of the query sentence etc." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-104", "text": "To give a time estimate, using a \u00b15 day window required 9 seconds per query vs 15 seconds per query when a \u00b17 day window was used." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-105", "text": "We placed a limit of approximately 90 words on the queries and the indexed sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-106", "text": "This choice was motivated by the fact that the word alignment toolkit Giza++ does not process longer sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-107", "text": "A Krovetz stemmer was used while building the index as provided by the toolkit." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-108", "text": "English stop words, i.e. frequently used words, such as \"a\" or \"the\", are normally not indexed because they are so common that they are not useful to query on." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-109", "text": "The stop word list provided by the IR Group of University of Glasgow 3 was used." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-110", "text": "The resources required by our system are minimal : translations of one side of the comparable corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-111", "text": "It has already been demonstrated in (Rauf and Schwenk, 2009 ) that when using translations as queries, the quality of the initial SMT is not a factor for better sentence retrieval and that an SMT system trained on small amounts of humantranslated data can 'retrieve' potentially good parallel sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-112", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-113", "text": "**CANDIDATE SENTENCE PAIR SELECTION**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-114", "text": "The information retrieval process gives us the potential parallel sentences per query sentence, the decision of their being parallel or not needs to be made about them." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-115", "text": "At this stage we choose the best scoring sentence as determined by the toolkit and pass the sentence pair through further filters." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-139", "text": "The name of the bitext indicates the filter threshold used, for example, TER-50 means sentences selected based on TER filter threshold of 50." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-140", "text": "Generally, sentences selected based on TER filter showed better BLEU scores on NIST06 than their WER and TERp counter parts up to almost 21M words." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-141", "text": "Also for the same filter threshold TERp selected longer sentences, followed by TER and then WER, this fact is evident from table 2, where for the filter threshold of 60, TERp and TER select 20.8M and 17.3 words respectively, whereas WER selects 14.5M words." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-142", "text": "Figure 2 shows the trend obtained in function of the number of words added." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-143", "text": "These experiments were performed by adding our extracted sentences to only 5.8M words of human-provided translations." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-144", "text": "Our best results are obtained when 11.5M of our extracted parallel sentences based on TER filter are added to 5.8M of News wire and gale parallel corpora." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-145", "text": "We gain an improvement of 1.66 BLEU points on NIST06 and 2.38 BLEU points An interesting thing to notice in figure 2 is that no filter was able to clearly outperform the others, which is contradictory to our experiments with the French-English language pair (Rauf and Schwenk, 2009) , where the TER filter clearly outperformed the WER filter." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-146", "text": "WER is worse than TER but less evident here than for our previous experiments for the French-English language pair." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-147", "text": "This performance gain by using the TER filter for FrenchEnglish was our main motivation for trying TERp." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-148", "text": "We expected TERp to get better results compared to WER and TER, but TER filter seems the better one among the three filters." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-149", "text": "Note that all conditions in all the experiments were identical." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-150", "text": "This gives a strong hint of language pair dependency, making the decision of suitability of a particular filter dependent on the language pair in consideration." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-151", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-152", "text": "**SENTENCE TAIL REMOVAL**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-153", "text": "Two main classes of errors are known when extracting parallel sentences from comparable corpora: firstly, cases where the two sentences share many common words but actually convey different meaning, and secondly, cases where the two sentences are (exactly) parallel except at sentence ends where one sentence has more information than the other." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-154", "text": "This second case of errors can be detected using WER as we have the advantage of having both the sentences in English." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-155", "text": "We detected the extra insertions at the end of the IR result sentence and removed them." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-156", "text": "Some examples of such sentences along with tails detected and removed are shown in figure 3." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-157", "text": "Since this gives significant improvement in the SMT scores we used it for all our extracted sentences (Rauf and Schwenk, 2009)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-158", "text": "However, similar to our observations in the last section, the tails were much shorter as compared to our previous experiments with French-English, also most of the tails in this Arabic-English data were of type as shown in last line figure 3." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-159", "text": "This is a factor dependent on reporting agency and its scheme for reporting, i.e, whether it reports an event independently in each language or uses the translation from one language to the other ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-160", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-161", "text": "**DICTIONARY CREATION**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-162", "text": "In our translations, we keep the unknown words as they are, i.e. in Arabic (normally a flag is used so that Moses skips them)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-163", "text": "This enables us to build a dictionary." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-164", "text": "Consider the case with translation with one unknown word in Arabic, if all the other words around align well with the English sentence that we found with IR, we could conclude the translation of the unknown Arabic word, see figure 3 line 5." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-165", "text": "We were able to make a dictionary using this scheme which was comprised mostly of proper nouns often not found in Arabic-English dictionaries." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-166", "text": "Our proper noun dictionary comprises of about 244K words, some sample words are shown in figure 4." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-167", "text": "Adding the proper nouns found by this technique to the initial SMT system should help improve translations for new sentences, as these words were before unknown to the system." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-168", "text": "However, the impact of addition of these words on translation quality is to be evaluated at the moment. ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-169", "text": "Query: Bono adopted this position after some legislators asked the government to rethink the Spanish military presence in Afghanistan ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-170", "text": "Result: Bono adopted this attitude after some legislators asked the government to reconsider the Spanish military presence in Afghanistan ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-171", "text": "( SPAIN-AFGHANISTAN ) ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-172", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-173", "text": "**COMPARISON WITH PREVIOUS WORK**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-174", "text": "LDC provides extracted parallel texts extracted with the algorithm published by (Munteanu and Marcu, 2005) ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-175", "text": "This corpus contains 1.1M sentence pairs (about 35M words) which were automatically extracted and aligned from the monolingual Arabic and English Gigaword corpora, a confidence score being provided for each sentence pair." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-176", "text": "We also applied our approach on data provided by LDC, but on a different subset." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-177", "text": "Since we had used the recent data sets our corpora were till year 2006, whereas ISI's data were till year 2004." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-178", "text": "We filtered our data according to the time interval of their data (date information was provided for each sentence pair) and used them to compare the two data sets." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-179", "text": "Both AFP and XIN were used in these comparison experiments since the available ISI's data was comprised of these two collections." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-180", "text": "To perform the comparison, we have, firstly, the ISI parallel sentences and secondly the parallel sentences extracted by using our approach using the same time frame and comparable corpora as ISI." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-181", "text": "We used our sentences as filtered by the TER filter and added them to the already available 5.8M of human-translated (as done in previous experiments)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-182", "text": "The result is shown graphically in figure 5 ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-183", "text": "Adding the ISI parallel data to the 5.8M baseline parallel corpus (total 27.5M words) yielded a BLEU score of 43.59 on NIST06 Dev set and 41.84 BLEU points on NIST08 test set." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-184", "text": "Whereas we were able to achieve a BLEU score of 43.88 on NIST06 Dev and 41.35 on NIST08 test set (using a total of 16.1M words), which amounts to an increase of 0.29 BLEU points on the NIST06 Dev set." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-185", "text": "Note that this gain is achieved by using a total of only 10.3M of our extracted words as compared to 21.7M of ISI corpus to get their best result." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-186", "text": "However we were not able to improve as much on the NIST08 test corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-187", "text": "The trend in BLEU score in figure 5 clearly shows that our sentence selection scheme selects good sentences, and is capable of achieving the same scores but with much less sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-188", "text": "This is because in the scheme of ISI, the confidence scores provided are based on the IR and maximum entropy classifier scoring scheme, whereas our filters score the sentences based on linguistic sentence similarity, allowing us to retrieve the good sentence pairs from the bad ones." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-189", "text": "Once information retrieval is done, which is the most time consuming task in both the techniques, our approach is better able to sort out the good IR extracted sentences as is evident from the results obtained." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-190", "text": "Moreover our scheme does not require any complex operations, just simple filters which are well adapted to the problem at hand." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-191", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-192", "text": "**CONCLUSION AND DISCUSSION**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-193", "text": "Sentence-aligned bilingual texts are a crucial resource to build SMT systems." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-194", "text": "For some language pairs bilingual corpora just do not exist, the existing corpora are too small to build a good SMT system or they are not of the same genre or domain." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-195", "text": "This need for parallel corpora, has made the researchers employ new techniques and methods in an attempt to reduce the dire need of this crucial resource of the SMT systems." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-196", "text": "Our study also contributes in this regard by employing an SMT itself and information retrieval techniques to produce additional parallel corpora from easily available comparable corpora." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-197", "text": "We use translations of the source language comparable corpus to find the corresponding parallel sentences from the target language comparable corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-198", "text": "We only used a limited amount of human-provided bilingual resources." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-199", "text": "Starting with small amounts of sentence aligned bilingual data large amounts of monolingual data are translated." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-200", "text": "These translations are then employed to find the corresponding matching sentences in the target side corpus, using information retrieval methods." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-201", "text": "Simple filters are used to determine whether the retrieved sentences are parallel or not." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-202", "text": "By adding these retrieved parallel sentences to already available human translated parallel corpora we were able to improve the BLEU score on the test set(NIST08) by 2.38 points for the ArabicEnglish language pair." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-203", "text": "Contrary to the previous approaches as in (Munteanu and Marcu, 2005) which used small amounts of in-domain parallel corpus as an initial resource, our system exploits the target language side of the comparable corpus to attain the same goal, thus the comparable corpus itself helps to better extract possible parallel sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-204", "text": "We have also presented a comparison with their approach and found our bitexts to achieve nice improvements using much less words." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-205", "text": "The LDC comparable corpora were used in this paper, but the same approach can be extended to extract parallel sentences from huge amounts of corpora available on the web by identifying comparable articles using techniques such as (Yang and Li, 2003) and (Resnik and Y, 2003) .We have successfully applied our approach to French-English and ArabicEnglish language pairs." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-206", "text": "As this study strongly hinted towards language pair dependancy on the choice of the filter to use to select better sentences, we intend to investigate this trend in detail." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-116", "text": "Gale and Church (1993) based their align program on the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-117", "text": "We initially used the same logic in our selection of the candidate sentence pairs." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-118", "text": "However our observation was that the filters that we use, WER, TER and TERp implicitly place a penalty when the length differ-ence between two sentences is too large." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-119", "text": "Thus using this inherent property, we did not apply any explicit sentence length filtering." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-120", "text": "The candidate sentences pairs are then judged based on simple filters." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-121", "text": "Our choice of filters in accordance to the task in consideration were the WER (Levenshtein distance), Translation Edit Rate (TER) and the relatively new Translation Edit Rate plus (TERp)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-122", "text": "WER measures the number of operations required to transform one sentence into the other (insertions, deletions and substitutions)." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-123", "text": "A zero WER would mean the two sentences are identical, subsequently lower WER sentence pairs would be sharing most of the common words." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-124", "text": "However two correct translations may differ in the order in which the words appear, something that WER is incapable of taking into account." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-125", "text": "This shortcoming is addressed by TER which allows block movements of words and thus takes into account the reorderings of words and phrases in translation (Snover et al., 2006) ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-126", "text": "TERp is an extension of Translation Edit Rate and was one of the top performing metrics at the NIST Metric MATR workshop 4 ." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-127", "text": "It had the highest absolute correlation, as measured by the Pearson correlation coefficient, with human judgments in 9 of the 45 test conditions." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-128", "text": "TERp tries to address the weaknesses of TER through the use of paraphrases, morphological stemming, and synonyms, as well as edit costs that are optimized to correlate better with various types of human judgments (Snover et al., 2009 )." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-129", "text": "The TER filter allows shifts if the two strings (the word sequence in the translated and the IR retrieved sentence) match exactly, however TERp allows shifts if the words being shifted are exactly the same, are synonyms, stems or paraphrases of each other, or any such combination." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-130", "text": "This allows better sentence comparison by incorporation of sort of linguistic information about words." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-131", "text": "----------------------------------" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-132", "text": "**EXPERIMENTAL EVALUATION**" }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-133", "text": "Our main goal was to be able to create an additional parallel corpus to improve machine translation quality, especially for the domains where we have less or no parallel data available." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-134", "text": "In this section we report the results of adding these extracted parallel sentences to the already available humantranslated parallel sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-135", "text": "We conducted a range of experiments by adding our extracted corpus to various combinations of already available human-translated parallel corpora." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-136", "text": "For our experiments on effect on SMT quality we use only the XIN extracted corpus." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-137", "text": "We experimented with WER, TER and TERp as filters to select the best scoring sentences." }, { "sent_id": "7176d3dd72e781dca42f8c146d062d-C001-138", "text": "Table 2 shows some of the scores obtained based on BLEU scores on the Dev and test data as a function of the size of the added extracted corpus." } ], "y": { "@DIF@": { "gold_contexts": [ [ "7176d3dd72e781dca42f8c146d062d-C001-4" ], [ "7176d3dd72e781dca42f8c146d062d-C001-46" ], [ "7176d3dd72e781dca42f8c146d062d-C001-203" ] ], "cite_sentences": [ "7176d3dd72e781dca42f8c146d062d-C001-4", "7176d3dd72e781dca42f8c146d062d-C001-46", "7176d3dd72e781dca42f8c146d062d-C001-203" ] }, "@SIM@": { "gold_contexts": [ [ "7176d3dd72e781dca42f8c146d062d-C001-4" ], [ "7176d3dd72e781dca42f8c146d062d-C001-46" ], [ "7176d3dd72e781dca42f8c146d062d-C001-50" ], [ "7176d3dd72e781dca42f8c146d062d-C001-174", "7176d3dd72e781dca42f8c146d062d-C001-175", "7176d3dd72e781dca42f8c146d062d-C001-176" ] ], "cite_sentences": [ "7176d3dd72e781dca42f8c146d062d-C001-4", "7176d3dd72e781dca42f8c146d062d-C001-46", "7176d3dd72e781dca42f8c146d062d-C001-50", "7176d3dd72e781dca42f8c146d062d-C001-174" ] }, "@BACK@": { "gold_contexts": [ [ "7176d3dd72e781dca42f8c146d062d-C001-31" ], [ "7176d3dd72e781dca42f8c146d062d-C001-42" ] ], "cite_sentences": [ "7176d3dd72e781dca42f8c146d062d-C001-31", "7176d3dd72e781dca42f8c146d062d-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "7176d3dd72e781dca42f8c146d062d-C001-54" ] ], "cite_sentences": [ "7176d3dd72e781dca42f8c146d062d-C001-54" ] } } }, "ABC_0ae49d1618e18eb794666543d924ed_29": { "x": [ { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-87", "text": "**EXPERIMENTS**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-2", "text": "Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-3", "text": "However, to date there has been no direct investigation of the inherent differences between name and nonname tokens in text, nor whether this property holds across multiple languages." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-4", "text": "This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-5", "text": "We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-6", "text": "Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-7", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-9", "text": "In English, there is strong empirical evidence that the character sequences that make up proper nouns tend to be distinctive." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-10", "text": "Even divorced of context, a human reader can predict that \"hoekstenberger\" is an entity, but \"abstractually\" 2 is not." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-11", "text": "Some NER research explores the use of characterlevel features including capitalization, prefixes and suffixes (Cucerzan and Yarowsky, 1999; Ratinov and Roth, 2009) , and character-level models (CLMs) (Klein et al., 2003) to improve the performance of NER, but to date there has been no systematic study isolating the utility of CLMs in capturing distinctions between name and non-name tokens in English or across other languages." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-12", "text": "We conduct an experimental assessment of the discriminative power of CLMs for a range of lan- 1 The code and resources for this publication can be found at: https://cogcomp.org/page/publication_ view/846 2 Not a real name or a real word." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-13", "text": "Figure 1: Perplexity histogram of entity (left) and nonentity tokens (right) in CoNLL Train calculated by entity CLM for both sides." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-14", "text": "The graphs show the percentage of tokens (y axis) with different levels of CLM perplexities (x axis)." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-15", "text": "The entity CLM gives a low average perplexity and small variance to entity tokens (left), while giving non-entity tokens much higher perplexity and higher variance (right)." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-16", "text": "guages: English, Amharic, Arabic, Bengali, Farsi, Hindi, Somali, and Tagalog." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-17", "text": "These languages use a variety of scripts and orthographic conventions (for example, only three use capitalization), come from different language families, and vary in their morphological complexity." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-18", "text": "We demonstrate the effectiveness of CLMs in distinguishing name tokens from non-name tokens, as illustrated by Figure 1 , which shows perplexity histograms from a CLM trained on entity tokens." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-19", "text": "Our models use only individual tokens, but perform extremely well in spite of taking no account of word context." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-20", "text": "We then assess the utility of directly adding simple features based on this CLM implementation to an existing NER system, and show that they have a significant positive impact on performance across many of the languages we tried." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-21", "text": "By adding very simple CLM-based features to the system, our scores approach those of a state-of-the-art NER system (Lample et al., 2016) across multiple languages, demonstrating both the unique importance and the broad utility of this approach." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-22", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-23", "text": "**METHODS**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-24", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-25", "text": "**CHARACTER LANGUAGE MODELS**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-26", "text": "We propose a very simple model in which we train an entity CLM on a list of entity tokens, and a nonentity CLM on a list of non-entity tokens." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-27", "text": "Both lists are unordered, with all entries treated independently." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-28", "text": "Each token is split into characters and treated as a \"sentence\" where the characters are the \"words.\" For example, \"Obama\" is an entity token, and is split into \"O b a m a\"." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-29", "text": "From these examples we learn a score measuring how likely it is that a sequence of characters forms an entity." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-30", "text": "At test time, we also split each word into characters and determine perplexity using the entity and nonentity CLMs." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-31", "text": "We assign the label corresponding to the lower perplexity CLM." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-32", "text": "We experiment with four different kinds of language model: N-gram model, Skip-gram model, Continuous Bag-of-Words model (CBOW), and Log-Bilinear model (LB)." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-33", "text": "We demonstrate that the N-gram model is best suited for this task." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-34", "text": "Following Peng and Roth (2016) , we implement N-gram using SRILM (Stolcke, 2002) with order 6 and Witten-Bell discounting." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-35", "text": "3 For Skip-Gram and CBOW CLMs, we use the Gensim implementation (Rehurek and Sojka, 2010) for training and inference, and we build the LB CLM using the OxLM toolkit (Baltescu et al., 2014) ." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-36", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-37", "text": "**DATA**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-38", "text": "To determine whether name identifiability applies to languages other than English, we conduct experiments on a range of languages for which we had previously gathered resources (such as Brown clusters): English, Amharic, Arabic, Bengali, Farsi, Hindi, Somali, and Tagalog." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-39", "text": "For English, we use the original splits from the ubiquitous CoNLL 2003 English dataset (Sang and Meulder, 2003) , which is a newswire dataset annotated with Person (PER), Organization (ORG), Location (LOC) and Miscellaneous (MISC)." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-40", "text": "To collect the list of entities and nonentities as the training data for the Entity and Non-Entity CLMs, we sample a large number of PER/ORG/LOC and non-entities from Wikipedia, using types derived from their corresponding FreeBase entities (Ling and Weld, 2012) ." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-41", "text": "For all other languages, we use a subset of the corpora from the LORELEI project annotated for the NER task (Strassel and Tracey, 2016) ." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-42", "text": "We build our entity list using the tokens labeled as entities in the training data, and our non-entity list from the remaining tokens." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-43", "text": "These two lists are then used to train two CLMs, as described above." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-44", "text": "Our datasets vary in size of entity and non-entity tokens, as shown in Table 1 ." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-45", "text": "The smallest, Farsi, has 4.5K entity and 50K non-entity tokens; the largest, English, has 29K entity and 170K nonentity tokens." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-46", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-47", "text": "**CLM FOR NAMED ENTITY IDENTIFICATION**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-48", "text": "In this section, we first show the power of CLMs for distinguishing between entity and non-entity tokens in English, and then that this power is robust across a variety of languages." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-49", "text": "We refer to this task as Named Entity Identification (NEI), because we are concerned only with finding an entity span, not its label." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-50", "text": "We differentiate it from Named Entity Recognition (NER), in which both span and label are required." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-51", "text": "To avoid complicating this straightforward approach by requiring a separate mention detection step, we evaluate at the token-level, as opposed to the more common phrase-level evaluation." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-52", "text": "We also apply one heuristic: if a word has length 1, we automatically predict 'O' (or non-entity)." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-53", "text": "This captures most punctuation and words like 'I' and 'a'. Figure 1 shows that for the majority of entity tokens, the entity CLM computes a relatively low perplexity compared to non-entity tokens." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-54", "text": "Though there also exist some non-entities with low entity CLM perplexity, we can still reliably identify a large proportion of non-entity words by setting a threshold value for entity CLM perplexity." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-55", "text": "If a token perplexity lies above this threshold, we label it as a non-entity token." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-56", "text": "The threshold is tuned on development data." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-57", "text": "Since we also build a CLM for non-entities, we can also compare the entity and non-entity perplexity scores for a token." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-58", "text": "For those tokens not excluded using the threshold as described above, we compare the perplexity scores of the two models and assign the label corresponding to the model yielding the lower score." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-59", "text": "We compare SRILM against Skip-gram and CBOW, as implemented in Gensim, and the LogBilinear (LB) model." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-60", "text": "We trained both CBOW and Skip-gram with window size 3, and size 20." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-61", "text": "We tuned LB, and report results with embedding size 150, and learning rate 0.1." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-62", "text": "Despite tuning the neural models, the simple N-gram model outperforms them significantly, perhaps because of the relatively small amount of training data." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-63", "text": "4 We compare the CLM's Entity Identification against two state-of-the-art NER systems: CogCompNER (Khashabi et al., 2018) and LSTM-CRF (Lample et al., 2016) ." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-64", "text": "We train the NER systems as usual, but at test time we convert all predictions into binary token-level annotations to get the final score." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-65", "text": "As Table 2 shows, the result of Ngram CLM, which yields the highest performance, is remarkably close to the result of state-of-theart NER systems (especially for English) given the simplicity of the model." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-66", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-67", "text": "**IMPROVING NER WITH CLM FEATURES**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-68", "text": "In this section we show that we can augment a standard NER system with simple features based on our entity/non-entity CLMs to improve performance in many languages." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-69", "text": "Based on their superior performance as reported in Section 3, we use the N-gram CLMs." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-70", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-71", "text": "**FEATURES**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-72", "text": "We define three simple features that capture information provided by CLMs and which we expect to be useful for NER." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-73", "text": "Entity Feature We define one \"isEntity\" feature based on the perplexities of the entity and non-entity CLMs." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-74", "text": "We compare the perplexity calculated by entity CLM and non-entity CLM described in Section 3, and return a boolean value indicating whether the entity CLM score is lower." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-75", "text": "Language Features We define two languagerelated features: \"isArabic\" and \"isRussian\"." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-76", "text": "We observe that there are many names in English text that originate from other languages, resulting in very different orthography than native English names." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-77", "text": "We therefore build two languagebased CLMs for Arabic and Russian." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-78", "text": "We collect a list of Arabic names and a list of Russian names by scraping name-related websites, and train an Arabic CLM and a Russian CLM." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-79", "text": "For each token, when the perplexity of either the Arabic or the Russian CLM is lower than the perplexity of the Non-Entity CLM, we return True, indicating that this entity is likely to be a name from Arabic/Russian." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-80", "text": "Otherwise, we return False." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-81", "text": "Table 3 : NER results on 8 languages show that even a simplistic addition of CLM features to a standard NER model boosts performance." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-82", "text": "CogCompNER is run with standard features, including Brown clusters; (Lample et al., 2016) is run with default parameters and pre-trained embeddings." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-83", "text": "Unseen refers to performance on named entities in Test that were not seen in the training data." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-84", "text": "Full is performance on all entities in Test." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-85", "text": "Averages are computed over all languages other than English." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-86", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-88", "text": "We use CogCompNER (Khashabi et al., 2018) as our baseline NER system because it allows easy integration of new features, and evaluate on the same datasets as before." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-89", "text": "For English, we add all features described above." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-90", "text": "For other languages, due to the limited training data, we only use the \"isEntity\" feature." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-91", "text": "We compare with the state-of-theart character-level neural NER system of (Lample et al., 2016) , which inherently encodes comparable information to CLMs, as a way to investigate how much of that system's performance can be attributed directly to name-internal structure." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-92", "text": "The results in Table 3 show that for six of the eight languages we studied, the baseline NER can be significantly improved by adding simple CLM features; for English and Arabic, it performs better even than the neural NER model of (Lample et al., 2016) ." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-93", "text": "For Tagalog, however, adding CLM features actually impairs system performance." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-94", "text": "In the same table, the rows marked \"unseen\" report systems' performance on named entities in Test that were not seen in the training data." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-95", "text": "This setting more directly assesses the robustness of a system to identify named entities in new data." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-96", "text": "By this measure, Farsi NER is not improved by nameonly CLM features and Tagalog is impaired." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-97", "text": "Benefits for English, Hindi, and Somali are limited, but are quite significant for Amharic, Arabic, and Bengali." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-98", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-99", "text": "**DISCUSSION**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-100", "text": "Our results demonstrate the power of CLMs for recognizing named entity tokens in a diverse range of languages, and that in many cases they can improve off-the-shelf NER system performance even when integrated in a simplistic way." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-101", "text": "However, the results from Section 4.2 show that this is not true for all languages, especially when only considering unseen entities in Test: Tagalog and Farsi do not follow the trend for the other languages we assessed even though CLM performs well for Named Entity Identification." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-102", "text": "While the end-to-end model developed by (Lample et al., 2016) clearly includes information comparable to that in the CLM, it requires a fully annotated NER corpus, takes significant time and computational resources to train, and is non-trivial to integrate into a new NER system." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-103", "text": "The CLM approach captures a very large fraction of the entity/non-entity distinction capacity of full NER systems, and can be rapidly trained using only entity and non-entity token lists -i.e., it is corpus-agnostic." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-104", "text": "For some languages it can be used directly to improve NER performance; for others (such as Tagalog), the strong NEI performance indicates that while it does not immediately boost performance, it can ultimately be used to improve NER there too." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-105", "text": "6 Related Work Cucerzan and Yarowsky (1999) is one of the earliest works to use character-based features (character tries) for NER." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-106", "text": "The approach of Klein et al. (2003) was one of the original papers in the CoNLL 2003 NER shared task." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-107", "text": "Their approach, which ranked in the top 3 for both English and German shared tasks, used character-based features for NER." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-108", "text": "They do two experiments: one with a character-based HMM, another with using character n-grams as features to a maximum entropy model." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-109", "text": "The focus on character-level patterns is similar to our work, but without the specific exploration of language models alone." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-110", "text": "Using character-based models similar to ours, Smarr and Manning (2002) show that unseen noun phrases can be accurately classified into a small number of categories using only a character-based model independent of context." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-111", "text": "We tackle a somewhat more challenging task of distinguishing entities from non-entities." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-112", "text": "Lample et al. (2016) use character embeddings in an LSTM-CRF model." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-113", "text": "Their ablation studies show that character-level features improve performance significantly." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-114", "text": "We are not aware of any work that directly evaluates CLMs for identifying name tokens, nor of work that demonstrates the utility of characterlevel information for identifying names in multiple languages." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-115", "text": "----------------------------------" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-116", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-117", "text": "We have shown, in a series of simple experiments, that in many languages names are identifiable by character patterns alone, and that character level patterns have strong potential for building better NER systems." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-118", "text": "In the future, we plan to make a more thorough analysis of reasons for the high variance in NER performance." }, { "sent_id": "0ae49d1618e18eb794666543d924ed-C001-119", "text": "In particular, we will study why it is possible, as with Tagalog, to have high Named Entity Identification results but lose points in NER." } ], "y": { "@SIM@": { "gold_contexts": [ [ "0ae49d1618e18eb794666543d924ed-C001-21" ], [ "0ae49d1618e18eb794666543d924ed-C001-63", "0ae49d1618e18eb794666543d924ed-C001-65" ] ], "cite_sentences": [ "0ae49d1618e18eb794666543d924ed-C001-21", "0ae49d1618e18eb794666543d924ed-C001-63" ] }, "@USE@": { "gold_contexts": [ [ "0ae49d1618e18eb794666543d924ed-C001-63" ], [ "0ae49d1618e18eb794666543d924ed-C001-82" ], [ "0ae49d1618e18eb794666543d924ed-C001-91" ] ], "cite_sentences": [ "0ae49d1618e18eb794666543d924ed-C001-63", "0ae49d1618e18eb794666543d924ed-C001-82", "0ae49d1618e18eb794666543d924ed-C001-91" ] }, "@DIF@": { "gold_contexts": [ [ "0ae49d1618e18eb794666543d924ed-C001-92" ], [ "0ae49d1618e18eb794666543d924ed-C001-102", "0ae49d1618e18eb794666543d924ed-C001-103" ] ], "cite_sentences": [ "0ae49d1618e18eb794666543d924ed-C001-92", "0ae49d1618e18eb794666543d924ed-C001-102" ] }, "@MOT@": { "gold_contexts": [ [ "0ae49d1618e18eb794666543d924ed-C001-102" ] ], "cite_sentences": [ "0ae49d1618e18eb794666543d924ed-C001-102" ] }, "@BACK@": { "gold_contexts": [ [ "0ae49d1618e18eb794666543d924ed-C001-112" ] ], "cite_sentences": [ "0ae49d1618e18eb794666543d924ed-C001-112" ] } } }, "ABC_b8d0e66901698d201b9fb1f362b8c6_29": { "x": [ { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-38", "text": "Note there is no need for the morphosyntactic features to be harmonized with the tagset to predict." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-2", "text": "Neural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-3", "text": "In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-4", "text": "The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-5", "text": "The improvements are particularly important for the smaller datasets." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-6", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-8", "text": "Part-of-speech tagging is now a classic task in natural language processing." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-9", "text": "Its aim is to associate each \"word\" with a morphosyntactic tag, whose granularity can range from a simple morphosyntactic category, or part-of-speech (hereafter PoS), to finer categories enriched with morphological features (gender, number, case, tense, mood, person, etc.) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-10", "text": "The use of machine learning algorithms trained on manually annotated corpora has long become the standard way to develop PoS taggers." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-11", "text": "A large variety of algorithms have been used, such as (in approximative chronological order) bigram and trigram hidden Markov models (Merialdo, 1994; Brants, 1996 Brants, , 2000 , decision trees (Schmid, 1994; Magerman, 1995) , maximum entropy Markov models (MEMMs) (Ratnaparkhi, 1996) and Conditional Random Fields (CRFs) (Lafferty et al., 2001; Constant and Tellier, 2012) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-12", "text": "Recently, neural approaches have reached very competitive accuracy levels, improving over the state of the art in a number of settings (Plank et al., 2016) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-13", "text": "As a complement to annotated training corpora, external lexicons can be a valuable source of information." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-14", "text": "First, morphosyntactic lexicons provide a large inventory of (word, PoS) pairs." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-15", "text": "Such lexical information can be used in the form of constraints at tagging time (Kim et al., 1999; Haji\u010d, 2000) or during the training process as additional features combined with standard features extracted from the training corpus (Chrupa\u0142a et al., 2008; Goldberg et al., 2009; Denis and Sagot, 2012) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-16", "text": "Second, lexical information encoded in vector representations, known as word embeddings, have emerged more recently (Bengio et al., 2003; Collobert and Weston, 2008; Chrupa\u0142a, 2013; Ling et al., 2015; Ballesteros et al., 2015; M\u00fcller and Sch\u00fctze, 2015) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-17", "text": "Such representations, often extracted from large amounts of raw text, have proved very useful for numerous tasks including PoS tagging, in particular when used in recurrent neural networks (RNNs) and more specifically in mono-or bi-directional, word-level or characterlevel long short-term memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997; Ling et al., 2015; Ballesteros et al., 2015; Plank et al., 2016) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-18", "text": "Character-level embeddings are of particular interest for PoS tagging as they generate vector representations that result from the internal characterlevel make-up of each word." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-19", "text": "It can generalise over relevant sub-parts such as prefixes or suffixes, thus directly addressing the problem of unknown words." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-20", "text": "However, unknown words do not always follow such generalisations." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-21", "text": "In such cases, character-level models cannot bring any advantage." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-22", "text": "This is a difference with external lexicons, which provides information about any word it contains, yet without any quantitative distinction between relevant and less relevant information." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-23", "text": "Therefore, a comparative assessment of the ad-vantages of using character-level embeddings and external lexical information is an interesting idea to follow." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-24", "text": "However, the inclusion of morphosyntactic information from lexicons into neural PoS tagging architecture, as a replacement or complement to character-based or pre-computed word embeddings, remains to be investigated." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-25", "text": "In this paper, we describe how such an inclusion can be achieved and show, based on experiments using the Universal Dependencies corpora (version 1.3), that it leads to significant improvements over Plank et al.'s (2016) state-of-the-art results." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-26", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-27", "text": "**BASELINE BI-LSTM TAGGER**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-28", "text": "As shown by Plank et al. (2016) , state-of-the-art performance can be achieved using a bi-LSTM architecture fed with word representations." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-29", "text": "Optimal performance is achieved representing words using the concatenation of (i) a word vector w built using a word embedding layer, called its word embedding, and (ii) a representation c of the word's characters, called its character-based embedding built using a character-level bi-LSTM, which is trained jointly with the word-level layers." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-30", "text": "Further improvements can be obtained on most but not all languages by initialising the word embedding layer with pre-computed word embeddings." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-31", "text": "We refer to Plank et al. (2016) for further details." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-32", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-33", "text": "**INTEGRATING LEXICAL INFORMATION**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-34", "text": "We extend this bi-LSTM architecture with an additional input layer that contains token-wise features obtained from a lexicon." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-35", "text": "The input vector l for a given word is an n-hot vector where each active value corresponds to one of the possible labels in the lexicon." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-36", "text": "For instance, the English word house, which is both a singular noun and a verb in its base form, will be associated to a 2-hot input vector." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-37", "text": "Words that are not in the lexicon are represented in the form of a zero vector." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-39", "text": "Figure 1 shows how the output of this input layer is concatenated to that of the two baseline input layers, i.e. the word embedding w and (if enabled) the character-based embedding c. The result of this concatenation feeds the bi-LSTM layer." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-40", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-41", "text": "**DATA**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-42", "text": "We use the Universal Dependencies (UD) datasets for our experiments." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-43", "text": "In order to facilitate compar- ison with Plank et al.'s (2016) , we performed our experiments on the version 1.3 of UD (Nivre et al., 2016) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-44", "text": "Lexicons Our sources of lexical information we used are twofold." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-45", "text": "The first one is the Apertium 2 and the Giellatekno 3 projects." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-46", "text": "We used Apertium morphological lexicons whenever available." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-47", "text": "For other languages, we downloaded the corresponding monolingual part of OPUS's OpenSubtitles2016 corpus, tokenised it, extracted the 1 million most frequent tokens, and retrieved all their morphological analyses by the corresponding morphological analyser provided by Apertium (or, failing that, Giellatekno)." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-48", "text": "All these analyses were then gathered in the form of a lexicon." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-49", "text": "In a second step, we converted all lexicons obtained using manually crafted rules, so that each lexical entry contains a (inflected) wordform, a lemma, a Universal PoS, 4 and morphological features from the Universal Features." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-50", "text": "5 We then created two variants of the lexicons obtained: a coarse variant in which labels are Universal PoS, and a full variant 1 We also provide the type-token ratio of the corpus (TTR) and whether there were available Polyglot embeddings (PG) to initialize w." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-51", "text": "in which labels are the concatenation of the Universal PoS and Universal Features." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-52", "text": "We also took advantage of other existing lexicons." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-53", "text": "For space reasons, we are not able to describe here the language-specific transformations we applied to some of these lexicons." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-54", "text": "See Table 1 and its caption for more information." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-55", "text": "We determine the best performing lexicon for each language based on tagging accuracy on the development set." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-56", "text": "In the remainder of this paper, all information about the lexicons (Table 1) and accuracy results are restricted to these best performing lexicons." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-57", "text": "Coverage information on the test sets for both the training data and the best external lexicon for each dataset is provided in Table 2 : Coverage of the training set and of the best lexicon on the test set for each dataset of the UD 1.3 corpora." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-58", "text": "\"OOTC\" stands for \"out of training corpus\" and OOLex for \"out of (external) lexicon\"." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-59", "text": "The \"OOTC, in Lex.\" column displays the percentage of words that are not in the training corpus but are covered by the lexicon." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-60", "text": "Best improvements are expected for these words." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-61", "text": "Pre-computed embeddings Whenever available and following Plank et al. (2016) , we performed experiments using Polyglot pre-computed embeddings (Al-Rfou et al., 2013) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-62", "text": "Languages for which Polyglot embeddings are available are indicated in Table 1 ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-86", "text": "When also using pre-computed embeddings, improvements are only slightly lower." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-63", "text": "We trained our tagger with and without character-based embeddings, and with or without Polyglot-based initialisation (when available), both without lexical information and with lexicon information from all available lexicons, resulting in 4 to 12 training configurations." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-64", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-65", "text": "**LANGUAGE BASELINE**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-66", "text": "With best lexicon Gain when using (no lexicon) (selected on dev, cf." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-67", "text": "Tab." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-68", "text": "1) best lexicon w w + c w P + c w + l w + c + l w P + c + l w(+ l) w + c(+ l) w P + c(+ l) Table 3 : Overall results." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-69", "text": "PoS accuracy scores are given for each language in the baseline configuration (the same as Plank et al., 2016) and in the lexicon-enabled configuration." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-70", "text": "For each configuration, scores are given when using word embeddings only ( w), word and character-based embeddings ( w + c), and word and character-based embeddings with initialisation of word embeddings with Polyglot vectors ( w P + c)." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-71", "text": "The last columns show the difference between lexicon-enabled and baseline configurations." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-72", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-73", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-74", "text": "We use as a baseline the state-of-the-art bi-LSTM PoS tagger bilty, a freely available 6 and \"significantly refactored version of the code originally used\" by Plank et al. (2016) ." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-75", "text": "We use its standard configuration, with one bi-LSTM layer, characterbased embeddings size of 100, word embedding size of 64 (same as Polyglot embeddings), no multitask learning, 7 and 20 iterations for training." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-76", "text": "We extended bilty for enabling integration of lexical morphosyntactic information, in the way described in the previous section." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-77", "text": "5 Bouma et al., 2000; Oliver and Tadi\u0107, 2004; Heslin, 2007; Borin et al., 2008; Molinero et al., 2009; Sagot, 2010; Erjavec, 2010; Sagot and Walther, 2010; M\u011bchura, 2014; Sagot, 2014." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-78", "text": "6 https://github.com/bplank/bilstm-aux 7 Plank et al.'s (2016) secondary task-predicting the frequency class of each word-results in better OOV scores but virtually identical overall scores when averaged over all tested languages/corpora." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-79", "text": "For each lexicon-related configuration, we trained three variants of the tagger: (i) a variant without using character-based embeddings and standard (zero) initialisation of word embeddings before training, (ii) a variant with character-based embeddings and standard initialisation of word embeddings, and (iii) when Polyglot embeddings are available for the language at hand, a variant with character-based embeddings and initialisation of the word embeddings with the Polyglot embeddings." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-80", "text": "This is deliberately similar to Plank et al.'s (2016) Table 4 : Accuracy of the best system using a lexicon for words out of the training corpus (OOTC), and for words out of the training corpus that are present in the lexicon (OOTC in Lex.), as well as difference between the best system and the baseline without lexicon for these two subsets of words." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-81", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-82", "text": "**RESULTS**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-83", "text": "Our results show that using lexical information as an additional input layer to a bi-LSTM PoS tagger results in consistent improvements over 35 corpora." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-84", "text": "The improvement holds for all configurations on almost all corpora." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-85", "text": "As expected, the greatest improvements are obtained without characterbased embeddings, with a macro-averaged improvement of +2.56, versus +0.57 points when also using character-based embeddings." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-87", "text": "External lexical information is useful as it covers both words with an irregular morphology and words not present in the training data." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-88", "text": "The improvements are particularly high for the smaller datasets; in the w + c setup, the three languages with the highest improvements when using a lexicon are those with smallest datasets." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-89", "text": "Table 4 shows the accuracy of the best system, compared with the baseline, for words not in the training data (OOTC), and for whose that are present in the lexicon but not in the training data (OOTC in Lex)." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-90", "text": "While lexicon coverage is an important, it is not the only factor." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-91", "text": "we observe the improvements are much larger for the smaller datasets like Kazakh (kk) or Russian (ru)." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-92", "text": "However, the improvement is smaller for words that are not in the training data but are nevertheless present in the lexicon, which indicates that the contribution of the lexicon features to PoS prediction is not limited to the words that are covered by the lexicon but spreads throught the contexts by means of the bi-LSTM architecture." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-93", "text": "Moreover, we argue that the presence of the lexicon features aids compensate for character embeddings fit on smaller datasets, which are not necessarily more trustworthy." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-94", "text": "----------------------------------" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-95", "text": "**CONCLUSION**" }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-96", "text": "Our work shows that word embeddings and external lexical information are complementary sources of morphological information, which both improve the accuracy of a state-of-the-art neural partof-speech tagger." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-97", "text": "It also confirms that both lexical information and character-based embeddings capture morphological information and help part-ofspeech tagging, especially for unknown words." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-98", "text": "Interestingly, we also observe improvements when using external lexical information together with character-based embeddings, and even when initialising with pre-computed word embeddings." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-99", "text": "This shows that the use of character-based embeddings is not sufficient for addressing the problem of out-of-vocabulary words." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-100", "text": "Further work includes using lexicons to tag finer-grained tag inventories, as well as a more thorough analysis on the relation between lexicon and training data properties." }, { "sent_id": "b8d0e66901698d201b9fb1f362b8c6-C001-101", "text": "Another natural follow-up to the work presented here would be to examine the interplay between lexical features and more complex neural architectures, for instance by using more than one bi-LSTM layer, or by embedding the n-hot lexiconbased vector before concatenating it to the wordand character-based embeddings." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b8d0e66901698d201b9fb1f362b8c6-C001-12" ], [ "b8d0e66901698d201b9fb1f362b8c6-C001-17" ], [ "b8d0e66901698d201b9fb1f362b8c6-C001-28" ] ], "cite_sentences": [ "b8d0e66901698d201b9fb1f362b8c6-C001-12", "b8d0e66901698d201b9fb1f362b8c6-C001-17", "b8d0e66901698d201b9fb1f362b8c6-C001-28" ] }, "@MOT@": { "gold_contexts": [ [ "b8d0e66901698d201b9fb1f362b8c6-C001-30", "b8d0e66901698d201b9fb1f362b8c6-C001-31" ] ], "cite_sentences": [ "b8d0e66901698d201b9fb1f362b8c6-C001-31" ] }, "@USE@": { "gold_contexts": [ [ "b8d0e66901698d201b9fb1f362b8c6-C001-61" ], [ "b8d0e66901698d201b9fb1f362b8c6-C001-69" ], [ "b8d0e66901698d201b9fb1f362b8c6-C001-74" ] ], "cite_sentences": [ "b8d0e66901698d201b9fb1f362b8c6-C001-61", "b8d0e66901698d201b9fb1f362b8c6-C001-69", "b8d0e66901698d201b9fb1f362b8c6-C001-74" ] } } }, "ABC_603f49fc6ecf90da67a9a55986f217_29": { "x": [ { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-2", "text": "Incorporating lexical knowledge from semantic resources (e.g., WordNet ) has been shown to improve the quality of distributed word representations." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-3", "text": "This knowledge often comes in the form of relational triplets (x, r, y) where words x and y are connected by a relation type r. Existing methods either ignore the relation types, essentially treating the word pairs as generic related words, or employ rather restrictive assumptions to model the relational knowledge." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-4", "text": "We propose a novel approach to model relational knowledge based on low-rank subspace regularization, and conduct experiments on standard tasks to evaluate its effectiveness." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-5", "text": "----------------------------------" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-7", "text": "Distributed word representations, also known as word embeddings, are low-dimensional vector representations for words that capture semantic aspects (Bengio et al., 2003; Pennington et al., 2014; Mikolov et al., 2013a) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-8", "text": "The algorithms for learning the word embeddings rely on distributional hypothesis (Harris, 1954) that words occurring in similar contexts tend to have similar meanings." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-9", "text": "Word embeddings have been shown to capture interesting linguistic regularities by simple vector arithmetic (e.g., v(king)-v(man)+v(woman)\u2248 v(queen)) (Mikolov et al., 2013c) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-10", "text": "They have also been used to derive downstream features for various NLP tasks, such as named entity recognition, chunking, dependency parsing, sentiment analysis, paraphrase detection and machine translation (Turian et al., 2010; Dhillon et al., 2011; Bansal et al., 2014; Maas et al., 2011; Socher et al., 2011; Zou et al., 2013) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-11", "text": "Their promise as semantic word representations has led to increasing research efforts on improving their quality." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-12", "text": "To this end, researchers have attempted to incorporate lexical knowledge into word embeddings by using additional regularization or loss terms in the learning objective." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-13", "text": "This lexical knowledge is often available in the form of triplets {(w i , r, w j )}, where the words w i and w j are connected by relation type r. These methods can be broadly classified into two categories." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-14", "text": "First family of methods use a (over-)generalized notion of similarity between words and ignore the type of relations, essentially treating the two words as generic similar words (Yu and Dredze, 2014; Faruqui et al., 2015; Liu et al., 2015) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-15", "text": "This places an implicit restriction on the types of relations that can be used with these methods." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-16", "text": "Second family of methods model each relation type by a distinct operator." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-17", "text": "Bordes et al. (2013) assumed a distinct relation vector r for every relation and minimize the distance between the translated first word and the second word, i.e., d(w i + r, w j ) for every triplet (w i , r, w j )." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-18", "text": "Socher et al. (2013) proposed a neural tensor network which uses a distinct tensor operator for every relation." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-19", "text": "These methods were used to learn entity and relation embeddings from a large collection of relation triplets for the task of knowledge base completion." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-20", "text": "Since these methods did not use any co-occurrence information from a text corpus, all entities were required to appear at least once in the training data, ruling out generalization to unseen entities 1 ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-21", "text": "More recently, Xu et al. (2014) combined the training objective of SKIP-GRAM (Mikolov et al., 2013a) with the training objective of (Bordes et al., 2013) to incorporate lexical 1 There exists work on relation extraction and knowledgebase completion that combines structured relation triplets and logical rules with unstructured text using various forms of latent variable models (Riedel et al., 2013; Chang et al., 2014; Toutanova et al., 2015; Rockt\u00e4schel et al., 2015) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-22", "text": "knowledge into word embeddings." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-23", "text": "Fried and Duh (2014) combine the training objective of (Bordes et al., 2013) with that of neural language model (Collobert et al., 2011) using alternating direction method of multipliers (Boyd et al., 2011) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-24", "text": "Constant translation model (Bordes et al., 2013; Xu et al., 2014; Fried and Duh, 2014 ) (referred as CTM from now on), although an important step in modeling relational knowledge, makes a rather restrictive assumption requiring all triplets (w i , r, w j ) pertaining to a relation type r to satisfy w i + r \u2248 w j , \u2200(i, j)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-25", "text": "This restriction can be severe when learning from a large text corpus since vector representation of a word also needs to respect a huge set of co-occurrence instances with other words." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-26", "text": "CTM is also not suitable for (i) modeling symmetric relations (e.g., synonyms, antonyms), and (ii) modeling transitive relations (e.g., synonyms, hypernyms)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-27", "text": "In this paper, we propose a novel formulation for modeling the relational knowledge which addresses these issues by relaxing the constant translation assumption and modeling each relation by a low-rank subspace, i.e., all the word pairs pertaining to a relation are assumed to lie in a low-rank subspace." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-28", "text": "We demonstrate effectiveness of the learned word representations on the tasks of knowledge-base completion and word analogy." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-29", "text": "2 Subspace-regularized word embedding Although our proposed framework for relational modeling is general enough to use with any existing word embedding method, we work with Word2Vec model (Mikolov et al., 2013a) in this paper for illustrating our ideas and later for empirical evaluations." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-30", "text": "Word2Vec is a neural network model trained on sequence of words and its hidden layer activations can be read out as the word representations." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-31", "text": "Two variants were proposed in (Mikolov et al., 2013a ) -SKIP-GRAM, which maximizes the log likelihood of the local context words given the target word, and CBOW, which maximizes the log likelihood of the target word given its local context." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-32", "text": "More specifically, CBOW maximizes the objective" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-33", "text": "where w t+c t\u2212c represents the words (or tokens) in the local context window around the t'th word (or token) and v t = \u2212c\u2264i\u2264c,i =0 w t+i can be seen as the average context vector." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-34", "text": "The vectors w, w \u2208 R d denote the input and output embeddings for word w, respectively." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-35", "text": "The input embeddings are taken as the final word representations." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-36", "text": "Negative sampling was proposed to efficiently optimize Eq. 1 (Mikolov et al., 2013b) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-37", "text": "We report empirical results with CBOW since it was computationally faster than SKIP-GRAM while giving similar results in our early explorations." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-38", "text": "We assume access to relational knowledge in the form of triplets R k = {(w i , r k , w j )} \u22001 \u2264 k \u2264 m, where words w i and w j are connected by relation r k and R k is the set of all triplets corresponding to relation r k with |R k | = n k ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-39", "text": "This form of knowledge is commonly available from Knowledge Bases like WordNet (Fellbaum, 1998) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-40", "text": "Our framework is suitable for both symmetric relations where words can be interchanged (e.g., synonyms) and asymmetric relations which have a directional nature (e.g., hypernyms)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-41", "text": "Let" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-42", "text": "denote the difference vector for the triplet (w i , r k , w j ) which points from the vector of word w i to that of word w j . Let us construct a matrix D k \u2208 R d\u00d7n k by stacking the difference vectors corresponding to all the triplets in relation r k , i.e.," }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-43", "text": "To incorporate this relational knowledge into word embeddings, we enforce an approximate low-rank constraint on D k assuming" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-44", "text": "where U k \u2208 R d\u00d7p , p d is the relation basis whose linear span contains all the difference vectors pertaining to relation r k ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-45", "text": "For p = 2, this assumption implies that all the difference vectors pertaining to a relation lie in a 2-D plane." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-46", "text": "For" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-47", "text": "implying that all the difference vectors for a relation are collinear." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-48", "text": "In this paper, we mainly study the rank-1 model (p=1) since it seems to be a natural starting point for evaluating the idea of subspace-regularized relational modeling." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-49", "text": "The study of higher rank models will potentially require a careful exploration of various structural regularizers for reconstruction matrix A k as well as a different evaluation scheme." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-50", "text": "We leave this study for future work." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-51", "text": "Rank-1 subspace regularization can also be motivated from the fact that word embeddings are able to capture some linguistic regularities (Mikolov et al., 2013c) along certain directions in the vector space." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-52", "text": "For example, the difference vector for word pair (king, queen) is approximately aligned with the difference vector for (man,woman), encoding the gender relation." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-53", "text": "The direction of the difference vectors carries significant information for these regularities which is evident from the success of cosine similarity metrics in the word analogy problems ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-54", "text": "CTM that assumes w i + u k = w j \u2200(w i , r k , w j ) \u2208 R enforces an additional equal length constraint on the difference vectors, which may be rather restrictive, especially when the word vectors are also influenced by co-occurrence statistics (apart from relational knowledge)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-55", "text": "Moreover, it may face following challenges in relational modeling:" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-56", "text": "\u2022 It does not have a natural interpretation for modeling symmetric relations (e.g., synonyms, antonyms) that allow interchangeability of words in a given relation triplet (i.e., (w i , r k , w j ) \u21d0\u21d2 (w j , r k , w i ))." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-57", "text": "Having a constant translation of u k \u2208 R d from the first word to the second word leads to contradiction." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-58", "text": "\u2022 It is also not natural for modeling relations with transitive property (i.e., (w i , r k , w j ) \u2227 (w j , r k , w l ) =\u21d2 (w i , r k , w l )), again leading to contradictions." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-59", "text": "Common examples of such relations are synonyms and hypernyms." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-60", "text": "The proposed rank-1 subspace relation model naturally allows for modeling such relations by doing away with the constant length restriction on the difference vectors." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-61", "text": "Our empirical evaluations verify that this relaxation indeed leads to improved quality of word vectors with respect to capturing linguistic regularities." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-62", "text": "We incorporate the proposed relational model into the learning objective for word vectors by regularizing the matrix of difference vectors towards a rank-1 matrix." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-63", "text": "We impose a nonnegativity constraint on the reconstruction coefficients \u03b1 k if relation r k is asymmetric." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-64", "text": "This respects the unidirectional nature of asymmetric relations." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-65", "text": "To ensure uniqueness of solution for u k and \u03b1 k , we constrain u k 2 = 1. Leaving \u03b1 k completely free can end up creating spurious relations between any two words that are arbitrarily far but whose difference vector is directionally aligned with any of the relation basis vectors {u k } m k=1 ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-66", "text": "To avoid this, we further impose a upper limit of c on the absolute value of elements of \u03b1 k ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-67", "text": "We minimize the following joint objective for word vectors {w i } |V | i=1 and relation parameters {u k , \u03b1 k } m k=1 :" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-68", "text": "where D k is the matrix of difference vectors as defined earlier and \u03bb is the regularization parameter." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-69", "text": "The first term in the objective takes into account the co-occurrence information text corpus while the second term incorporates the relational knowledge." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-70", "text": "----------------------------------" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-71", "text": "**OPTIMIZING FOR WORD VECTORS:**" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-72", "text": "We adopt parallel asynchronous stochastic gradient descent (SGD) with negative sampling approach of (Mikolov et al., 2013b) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-73", "text": "The model parameters for optimization are input embeddings (weights connecting input and hidden layer) and output embeddings (weights connecting hidden and output layer)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-74", "text": "Input embeddings are taken as the final word embeddings." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-75", "text": "Each computing thread works with a predefined segment of the text corpus and updates parameters that are stored in a shared memory." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-76", "text": "In each gradient step of CBOW, a thread samples a target word and its local context window and updates the parameters of the neural network." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-77", "text": "It can be seen as sampling one of the f t (\u00b7), t = 1, 2, . . . , T and taking a gradient step with it." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-78", "text": "A small number of random target words are also sampled for the same context, treating them as negative examples for the gradient update." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-79", "text": "In the CBOW architecture, representations for context words are directly encoded as columns of the linear weight matrix W \u2208 R d\u00d7|V | that maps input bag-of-words layer to the hidden layer." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-80", "text": "The columns of W are taken as the word embeddings for the corresponding words in the vocabulary V ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-81", "text": "The reader is referred to (Mikolov et al., 2013b; for more details on the optimization procedure for CBOW." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-82", "text": "If a word appears in the set of relation triplets R, our regularization term gets activated." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-83", "text": "Since we place the regularizer only on input embeddings, the following gradient updates due to the regularization term act only on input embeddings." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-84", "text": "where \u03b7 is the learning rate, and \u03b1 k ij denotes the element of \u03b1 k corresponding to the column of matrix D k which contains difference vector (w j \u2212 w i ) (and similarly for \u03b1 k ji )." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-85", "text": "The modifications in the learning rate as the SGD progresses are kept same as in the original implementation of CBOW 2 ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-86", "text": "Optimization for u k and \u03b1 k : Instead of having stochastic gradient updates, we adopt an asynchronous batch update strategy for relation basis {u k } m k=1 and reconstruction coefficients {\u03b1 k } m k=1 ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-87", "text": "We launch one compute thread that keeps solving the batch optimization problem for {u k } m k=1 and {\u03b1 k } m k=1 in an infinite loop until the optimization for word embeddings finishes." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-88", "text": "The batch optimization problem for a symmetric relation r k is:" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-89", "text": "where D k \u2208 R d\u00d7n k is the matrix of difference vectors for all triplets corresponding to relation r k as defined in Eq. 2." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-90", "text": "Without the absolute value constraint on \u03b1 k , this problem can be exactly solved by SVD." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-91", "text": "We follow an alternating optimization procedure for solving this problem." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-92", "text": "We initialize u k to the top left singular vector of D k and then alternate between solving two leastsquares sub-problems for u k and \u03b1 k respectively with the corresponding constraints." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-93", "text": "For asymmetric relations, there is an additional nonnegativity constraint on \u03b1 k ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-94", "text": "We use projected gradient descent to solve these constrained least-squares problems." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-95", "text": "----------------------------------" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-96", "text": "**EMPIRICAL OBSERVATIONS**" }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-97", "text": "We report preliminary evaluations of the proposed model (termed as RELSUB) on the tasks of word analogy and knowledge base completion." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-98", "text": "We use 2 https://code.google.com/p/word2vec/ English Wikipedia for training which contains approximately 4.8 million articles and 2 billion tokens." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-99", "text": "We lowercase all the text and and tokenize using Stanford NLP tokenizer." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-100", "text": "We use two datasets for evaluating the proposed method." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-101", "text": "Google word analogy data (Mikolov et al., 2013a) contains 19544 analogy relations (14 relation types -5 semantic, 9 syntactic) of the form a:b::c:d constructed from 550 unique relation triplets." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-102", "text": "We use this data only for evaluation (test phase)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-103", "text": "WordRep contains a large collection of relation triplets (44584 triplets in total, 25 relation types -18 semantic, 7 syntactic) extracted from WordNet, Wikipedia and Dictionary." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-104", "text": "For each relation type, we randomly split the triplets in 4 : 1 ratio with larger split used for training and smaller split used for test." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-105", "text": "We make sure that there is no word overlap between training and test triplets." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-106", "text": "We also remove triplets containing words from Google Analogy data from the training set." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-107", "text": "We compare the proposed RELSUB model with two methods: (i) CBOW (Mikolov et al., 2013a) , and (ii) RELCONST which is based on constant translation model for relations which was originally proposed in (Bordes et al., 2013) for embedding knowledge-bases and was recently used by Table 2 : Google analogy data: Accuracy on word analogy task (Xu et al., 2014) for learning word embeddings." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-108", "text": "Our objective for RELCONST is same as Eq. 4 except that {\u03b1 k } m k=1 are set equal to the vector of all 1's and norm constraint on u k are removed." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-109", "text": "This enables us to directly test the merit of the proposed rank-1 subspace relational model over that of constant translational model in the same regularization framework." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-110", "text": "Note that this objective is similar in spirit to (Xu et al., 2014) in the sense that it also uses a constant translation model for relations." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-111", "text": "However, Xu et al. (Xu et al., 2014) employ a maximum margin objective on the relation triplets as originally proposed in (Bordes et al., 2013) ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-112", "text": "It encourages the loss (measured in terms of 2 distance) for true relation triplets to be smaller than the loss for randomly corrupted relation triplets." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-113", "text": "Instead of a maximum margin objective for relational knowledge, our model uses a simpler regularization based objective." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-114", "text": "We could not obtain the implementation of RC-NET (Xu et al., 2014) due to copyright issues cited by its authors." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-115", "text": "We also cannot compare with approaches that use only knowledge-base for training (Faruqui et al., 2015) since they do not learn or modify the embeddings of unseen words and our evaluation triplets do not overlap with training triplets." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-116", "text": "We use the CBOW implementation in publicly available Word2Vec code 3 for our experiments." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-117", "text": "Our vocabulary has 400k words and we use a dimensionality of 300 for embeddings." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-118", "text": "For all other parameters, we use default values that the Word2Vec code comes with including a context window size of 5 tokens to each side, 5 negative samples per positive sample for negative sampling technique, etc." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-119", "text": "For both RELSUB and RELCONST, we set the regularization parameter to \u03bb |R| = 1e \u22124 in all our experiments." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-120", "text": "We set the upper limit c in Eq. 4 to 1." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-121", "text": "The parameters were not fine tuned rigorously but these values seemed to work reasonably well for us." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-122", "text": "We do total 5 epochs of SGD over the text corpus for all methods." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-123", "text": "In knowledge-base completion task, we want to predict the missing word of a relation triplet." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-124", "text": "For a triplet (x, r, y), we assume that x (first word) and 3 https://code.google.com/p/word2vec/ r (relation type) are observed and the task is to predict the missing word y. We restrict the search for the missing word to the most frequent 300k words (75% of the vocabulary)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-125", "text": "The missing word is predicted to be the closest word along the rank-1 subspace spanned by the relation vector (restricted by c in Eq. 4)." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-126", "text": "For RELCONST, the missing word is predicted by translating the first word by the relation vector and then searching for nearest word." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-127", "text": "The accuracy results on WordRep data are shown in Table 1 ." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-128", "text": "Relaxing the constant translation to rank-1 subspace assumption results in significant improvements on this task." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-129", "text": "In the analogy task, we want to predict the missing word in an analogy tuple a:b::c:?." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-130", "text": "We use the Google word-analogy data (Mikolov et al., 2013a) for this evaluation." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-131", "text": "We observe considerable gains with RELSUB over CBOW for semantic categories." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-132", "text": "The accuracy of knowledge regularized methods on syntactic categories is a little worse than CBOW and only slightly better than RELCONST, which is contrary to our observation on the knowledge-base completion task." }, { "sent_id": "603f49fc6ecf90da67a9a55986f217-C001-133", "text": "This is due to the fact that analogy task uses the difference vector (b \u2212 a) instead of the learned relation vector which is assumed to be unknown." } ], "y": { "@BACK@": { "gold_contexts": [ [ "603f49fc6ecf90da67a9a55986f217-C001-7" ], [ "603f49fc6ecf90da67a9a55986f217-C001-21" ], [ "603f49fc6ecf90da67a9a55986f217-C001-31" ], [ "603f49fc6ecf90da67a9a55986f217-C001-101" ] ], "cite_sentences": [ "603f49fc6ecf90da67a9a55986f217-C001-7", "603f49fc6ecf90da67a9a55986f217-C001-21", "603f49fc6ecf90da67a9a55986f217-C001-31", "603f49fc6ecf90da67a9a55986f217-C001-101" ] }, "@USE@": { "gold_contexts": [ [ "603f49fc6ecf90da67a9a55986f217-C001-29" ], [ "603f49fc6ecf90da67a9a55986f217-C001-31", "603f49fc6ecf90da67a9a55986f217-C001-37" ], [ "603f49fc6ecf90da67a9a55986f217-C001-101", "603f49fc6ecf90da67a9a55986f217-C001-102" ], [ "603f49fc6ecf90da67a9a55986f217-C001-107" ], [ "603f49fc6ecf90da67a9a55986f217-C001-130" ] ], "cite_sentences": [ "603f49fc6ecf90da67a9a55986f217-C001-29", "603f49fc6ecf90da67a9a55986f217-C001-31", "603f49fc6ecf90da67a9a55986f217-C001-101", "603f49fc6ecf90da67a9a55986f217-C001-107", "603f49fc6ecf90da67a9a55986f217-C001-130" ] }, "@DIF@": { "gold_contexts": [ [ "603f49fc6ecf90da67a9a55986f217-C001-107", "603f49fc6ecf90da67a9a55986f217-C001-131" ] ], "cite_sentences": [ "603f49fc6ecf90da67a9a55986f217-C001-107" ] } } }, "ABC_4d8ae52583d41b4124800c419963df_29": { "x": [ { "sent_id": "4d8ae52583d41b4124800c419963df-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-2", "text": "This paper presents a model for disfluency detection in spontaneous speech transcripts called LSTM Noisy Channel Model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-25", "text": "**RELATED WORK**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-3", "text": "The model uses a Noisy Channel Model (NCM) to generate n-best candidate disfluency analyses and a Long Short-Term Memory (LSTM) language model to score the underlying fluent sentences of each analysis." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-4", "text": "The LSTM language model scores, along with other features, are used in a MaxEnt reranker to identify the most plausible analysis." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-5", "text": "We show that using an LSTM language model in the reranking process of noisy channel disfluency model improves the state-of-the-art in disfluency detection." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-6", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-8", "text": "Disfluency is a characteristic of spontaneous speech which is not present in written text." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-9", "text": "Disfluencies are informally defined as interruptions in the normal flow of speech that occur in different forms, including false starts, corrections, repetitions and filled pauses." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-10", "text": "According to Shriberg's (1994) definition, the basic pattern of speech disfluencies contains three parts: reparandum 1 , interregnum and repair." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-11", "text": "Example 1 illustrates a disfluent structure, where the reparandum to Boston is the part of the utterance that is replaced, the interregnum uh, I mean is an optional part of a disfluent structure that consists of a filled pause uh and a discourse marker I mean and the repair to Denver replaces the reparandum." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-12", "text": "The fluent version of Example 1 is obtained by deleting 1 Reparandum is sometimes called edit." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-13", "text": "reparandum and interregnum words." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-14", "text": "While disfluency rate varies with the context, age and gender of speaker, Bortfeld et al. (2001) reported disfluencies once in every 17 words." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-15", "text": "Such frequency is high enough to reduce the readability of speech transcripts." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-16", "text": "Moreover, disfluencies pose a major challenge to natural language processing tasks, such as dialogue systems, that rely on speech transcripts (Ostendorf et al., 2008) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-17", "text": "Since such systems are usually trained on fluent, clean corpora, it is important to apply a speech disfluency detection system as a pre-processor to find and remove disfluencies from input data." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-18", "text": "By disfluency detection, we usually mean identifying and deleting reparandum words." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-19", "text": "Filled pauses and discourse markers belong to a closed set of words, so they are trivial to detect ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-20", "text": "In this paper, we introduce a new model for detecting restart and repair disfluencies in spontaneous speech transcripts called LSTM Noisy Channel Model (LSTM-NCM)." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-21", "text": "The model uses a Noisy Channel Model (NCM) to generate n-best candidate disfluency analyses, and a Long Short-Term Memory (LSTM) language model to rescore the NCM analyses." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-22", "text": "The language model scores are used as features in a MaxEnt reranker to select the most plausible analysis." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-23", "text": "We show that this novel approach improves the current state-of-the-art." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-24", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-26", "text": "Approaches to disfluency detection task fall into three main categories: sequence tagging, parsingbased and noisy channel model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-27", "text": "The sequence tagging models label words as fluent or disfluent using different techniques, including conditional random fields (Ostendorf and Hahn, 2013; Zayats et al., 2014; Ferguson et al., 2015) , hidden Markov models (Liu et al., 2006; Schuler et al., 2010) or recurrent neural networks (Hough and Schlangen, 2015; Zayats et al., 2016) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-28", "text": "Although sequence tagging models can be easily generalized to a wide range of domains, they require a specific state space for disfluency detection, such as begininside-outside (BIO) style states that label words as being inside or outside of a reparandum word sequence." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-29", "text": "The parsing-based approaches refer to parsers that detect disfluencies, as well as identifying the syntactic structure of the sentence (Rasooli and Tetreault, 2013; Honnibal and Johnson, 2014; Yoshikawa et al., 2016) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-30", "text": "Training a parsingbased model requires large annotated tree-banks that contain both disfluencies and syntactic structures." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-31", "text": "Noisy channel models (NCMs) use the similarity between reparandum and repair as an indicator of disfluency." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-32", "text": "However, applying an effective language model (LM) inside an NCM is computationally complex." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-33", "text": "To alleviate this problem, some researchers use more effective LMs to rescore the NCM disfluency analyses." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-34", "text": "applied a syntactic parsing-based LM trained on the fluent version of the Switchboard corpus to rescore the disfluency analyses." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-35", "text": "Zwarts and Johnson (2011) trained external n-gram LMs on a variety of large speech and non-speech corpora to rank the analyses." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-36", "text": "Using the external LM probabilities as features to the reranker improved the baseline NCM ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-37", "text": "The idea of applying external language models in the reranking process of the NCM motivates our model in this work." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-38", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-39", "text": "**LSTM NOISY CHANNEL MODEL**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-40", "text": "We follow in using an NCM to find the n-best candidate disfluency analyses for each sentence." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-41", "text": "The NCM, however, lacks an effective language model to capture more complicated language structures." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-42", "text": "To overcome this problem, our idea is to use different LSTM language models to score the underlying fluent sentences of the analyses proposed by the NCM and use the language model scores as features to a MaxEnt reranker to select the best analysis." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-43", "text": "In the following, we describe our model and its components in details." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-44", "text": "In the NCM of speech disfluency, we assume that there is a well-formed source utterance X to which some noise is added and generates a disfluent utterance Y as follows." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-45", "text": "X = a flight to Denver Y = a flight to Boston uh I mean to Denver Given Y , the goal of the NCM is to find the most likely source sentenceX such that:" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-46", "text": "As shown in Equation 2, the NCM contains two components: the channel model P(Y |X) and the language model P(X)." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-70", "text": "**LSTM**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-47", "text": "Calculating the channel model and language model probabilities, the NCM generates 25-best candidate disfluency analyses as follows." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-48", "text": "(3)" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-49", "text": "Example 3 shows sample outputs of the NCM, where potential reparandum words are specified with strikethrough text." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-50", "text": "The MaxEnt reranker is applied on the candidate analyses of the NCM to select the most plausible one." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-51", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-52", "text": "**CHANNEL MODEL**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-53", "text": "We assume that X is a substring of Y , so the source sentence X is obtained by deleting words from Y ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-54", "text": "For each sentence Y , there are only a finite number of potential source sentences." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-55", "text": "However, with the increase in the length of Y , the number of possible source sentences X grows exponentially, so it is not feasible to do exhaustive search." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-56", "text": "Moreover, since disfluent utterances may contain an unbounded number of crossed dependencies, a context-free grammar is not suitable for finding the alignments." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-57", "text": "The crossed dependencies refer to the relation between repair and reparandum words which are usually the same or very similar words in roughly the same order as in Example 4. (4) We apply a Tree Adjoining Grammar (TAG) based transducer which is a more expressive formalism and provides a systematic way of formalising the channel model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-58", "text": "The TAG channel model encodes the crossed dependencies of speech disfluency, rather than reflecting the syntactic structure of the sentence." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-59", "text": "The TAG transducer is effectively a simple first-order Markov model which generates each word in the reparandum conditioned on the preceding word in the reparandum and the corresponding word in the repair." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-60", "text": "More details about the TAG channel model can be found in ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-61", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-62", "text": "**LANGUAGE MODEL**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-63", "text": "The language model of the NCM evaluates the fluency of the sentence with disfluency removed." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-64", "text": "The language model is expected to assign a very high probability to a fluent sentence X (e.g. a flight to Denver) and a lower probability to a sentence Y which still contains disfluency (e.g. a flight to Boston uh I mean to Denver)." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-65", "text": "However, it is computationally complex to use an effective language model within the NCM." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-66", "text": "The reason is the polynomial-time dynamic programming parsing algorithms of TAG can be used to search for likely repairs if they are used with simple language models such as a bigram LM ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-67", "text": "The bigram LM within the NCM is too simple to capture more complicated language structure." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-68", "text": "In order to alleviate this problem, we follow Zwarts and Johnson (2011) by training LMs on different corpora, but we apply state-ofthe-art recurrent neural network (RNN) language models." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-69", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-71", "text": "We use a long short-term memory (LSTM) neural network for training language models." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-72", "text": "LSTM is a particular type of recurrent neural networks which has achieved state-of-the-art performance in many tasks including language modelling (Mikolov et al., 2010; Jozefowicz et al., 2016) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-73", "text": "LSTM is able to learn long dependencies between words, which can be highly beneficial for the speech disfluency detection task." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-74", "text": "Moreover, it allows for adopting a distributed representation of words by constructing word embedding (Mikolov et al., 2013) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-75", "text": "We train forward and backward (i.e. input sentences are given in reverse order) LSTM language models using truncated backpropagation through time algorithm (Rumelhart et al., 1986) with minibatch size 20 and total number of epochs 13." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-76", "text": "The LSTM model has two layers and 200 hidden units." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-77", "text": "The initial learning rate for stochastic gradient optimizer is chosen to 1 which is decayed by 0.5 for each epoch after maximum epoch 4." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-78", "text": "We limit the maximum sentence length for training our model due to the high computational complexity of longer histories in the LSTM." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-79", "text": "In our experiments, considering maximum 50 words for each sentence leads to good results." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-80", "text": "The size of word embedding is 200 and it is randomly initialized for all LSTM LMs 2 ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-81", "text": "Using each forward and backward LSTM language model, we assign a probability to the underlying fluent parts of each candidate analysis." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-82", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-83", "text": "**RERANKER**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-84", "text": "In order to rank the the 25-best candidate disfluency analyses of the NCM and select the most suitable one, we apply the MaxEnt reranker proposed by ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-85", "text": "We use the feature set introduced by Zwarts and Johnson (2011) , but instead of n-gram scores, we apply the LSTM language model probabilities." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-86", "text": "The features are so good that the reranker without any external language model is already a state-of-the-art system, providing a very strong baseline for our work." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-87", "text": "The reranker uses both model-based scores (including NCM scores and LM probabilities) and surface pattern features (which are boolean indicators) as described in Table 1 ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-88", "text": "Our reranker optimizes the expected f-score approximation described in Zwarts and Johnson (2011) with L2 regularisation." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-89", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-90", "text": "**CORPORA FOR LANGUAGE MODELLING**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-91", "text": "In this work, we train forward and backward LSTM language models on Switchboard (Godfrey and Holliman, 1993) and Fisher (Cieri et al., 2004) corpora." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-92", "text": "Fisher consists of 2.2 \u00d7 10 7 tokens of transcribed text, but disfluencies are not annotated in it." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-93", "text": "Switchboard is a widely available corpus (1.2 \u00d7 10 6 tokens) where disfluencies are annotated according to Shriberg's (1994) scheme." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-94", "text": "Since the bigram language model of the NCM is trained on this corpus, we cannot directly use Switchboard to build LSTM LMs." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-95", "text": "The reason is that if the training data of Switchboard is used both for predicting language fluency and optimizing the loss function, the reranker will overestimate the model-based features 1-2. forward & backward LSTM LM scores 3-7." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-96", "text": "log probability of the entire NCM 8." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-97", "text": "sum of the log LM probability & the log channel model probability plus number of edits in the sentence 9." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-98", "text": "channel model probability Zwarts and Johnson (2011) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-99", "text": "weight related to the LM features extracted from Switchboard." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-100", "text": "This is because the fluent sentence itself is part of the language model (Zwarts and Johnson, 2011) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-101", "text": "As a solution, we apply a k-fold cross-validation (k = 20) to train the LSTM language models when using Switchboard corpus." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-102", "text": "We follow Charniak and Johnson (2001) in splitting Switchboard corpus into training, development and test set." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-103", "text": "The training data consists of all sw[23] * .dps files, development training consists of all sw4[5-9] * .dps files and test data consists of all sw4[0-1] * .dps files." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-104", "text": "Following , we remove all partial words and punctuation from the training data." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-105", "text": "Although partial words are very strong indicators of disfluency, standard speech recognizers never produce them in their outputs, so this makes our evaluation both harder and more realistic." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-106", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-107", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-108", "text": "We assess the proposed model for disfluency detection with all MaxEnt features described in Table 1 against the baseline model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-109", "text": "The noisy channel model with exactly the same reranker features except the LSTM LMs forms the baseline model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-110", "text": "To evaluate our system, we use two metrics f-score and error rate." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-111", "text": "Charniak and Johnson (2001) used the f-score of labelling reparanda or \"edited\" words, while Fiscus et al (2004) defined an \"error rate\" measure, which is the number of words falsely labelled divided by the number of reparanda words." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-112", "text": "Since only 6% of words are disfluent in Switchboard corpus, accuracy is not a good measure of system performance." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-113", "text": "F-score, on the other hand, focuses more on detecting \"edited\" words, so it is a decent metric for highly skewed data." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-114", "text": "According to Tables 2 and 3 , the LSTM noisy channel model outperforms the baseline." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-115", "text": "The experiment on Switchboard and Fisher corpora demonstrates that the LSTM LMs provide information about the global fluency of an analysis that the local features of the reranker do not capture." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-116", "text": "The LSTM language model trained on Switchboard corpus results in the greatest improvement." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-117", "text": "Switchboard is in the same domain as the test data and it is also disfluency annotated." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-118", "text": "Either or both of these might be the reason why Switchboard seems to be better in comparison with Fisher which is a larger corpus and might be expected to make a better language model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-119", "text": "Moreover, the backward LSTMs have better performance in comparison with the forward ones." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-120", "text": "It seems when sentences are fed in reverse order, the model can more easily detect the unexpected word order associated with the reparandum to detect disfluencies." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-121", "text": "In other words, that the disfluency is observed \"after\" the fluent repair in a backward language model is helpful for recognizing disfluencies." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-122", "text": "Table 3 : Expected error rates on the dev set for a variey of LSTM language models." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-123", "text": "We compare the performance of Kneser-Ney smoothed 4-gram language models with the LSTM corresponding on the reranking process of the noisy channel model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-124", "text": "We estimate the 4-gram models and assign probabilities to the fluent parts of disflueny analyses using the SRILM toolkit (Stolcke, 2002) ." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-125", "text": "As Tables 4 and 5 We also compare our best model on the development set to the state-of-the-art methods in the literature." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-126", "text": "As shown in Table 6 , the LSTM noisy channel model outperforms the results of prior work, achieving a state-of-the-art performance of 86.8." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-127", "text": "It also has better performance in comparison with Ferguson et al. (2015) and Zayat et al.'s (2016) models, even though they use richer input that includes prosodic features or partial words." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-128", "text": "----------------------------------" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-129", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-130", "text": "In this paper, we present a new model for disfluency detection from spontaneous speech tranModel f-score Yoshikawa et al. (2016) 62.5 79.7 81.0 Rasooli and Tetreault (2013) 81." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-131", "text": "4 Qian and Liu (2013) 82.1 Honnibal and Johnson (2014) 84.1 Ferguson et al. (2015) * 85.4 Zwarts and Johnson (2011) 85.7 Zayats et al. (2016) * 85.9 LSTM-NCM 86.8 Table 6 : Comparison of the LSTM-NCM to stateof-the-art methods on the dev set." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-132", "text": "*Models have used richer input." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-133", "text": "scripts." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-134", "text": "It uses a long short-term memory neural network language model to rescore the candidate disfluency analyses produced by a noisy channel model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-135", "text": "The LSTM language model scores as features in a MaxEnt reranker improves the model's ability to detect and correct restart and repair disfluencies." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-136", "text": "The model outperforms other models reported in the literature, including models that exploit richer information from the input." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-137", "text": "As future work, we apply more complex LSTM language models such as sequence-to-sequence on the reranking process of the noisy channel model." }, { "sent_id": "4d8ae52583d41b4124800c419963df-C001-138", "text": "We also intend to investigate the effect of integrating LSTM language models into other kinds of disfluency detection models, such as sequence labelling and parsing-based models." } ], "y": { "@EXT@": { "gold_contexts": [ [ "4d8ae52583d41b4124800c419963df-C001-68" ], [ "4d8ae52583d41b4124800c419963df-C001-85" ] ], "cite_sentences": [ "4d8ae52583d41b4124800c419963df-C001-68", "4d8ae52583d41b4124800c419963df-C001-85" ] }, "@USE@": { "gold_contexts": [ [ "4d8ae52583d41b4124800c419963df-C001-88" ], [ "4d8ae52583d41b4124800c419963df-C001-100", "4d8ae52583d41b4124800c419963df-C001-94", "4d8ae52583d41b4124800c419963df-C001-95" ] ], "cite_sentences": [ "4d8ae52583d41b4124800c419963df-C001-88", "4d8ae52583d41b4124800c419963df-C001-100" ] }, "@UNSURE@": { "gold_contexts": [ [ "4d8ae52583d41b4124800c419963df-C001-100" ] ], "cite_sentences": [ "4d8ae52583d41b4124800c419963df-C001-100" ] } } }, "ABC_685b0b0da37b81765bb78f0f87505b_29": { "x": [ { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-90", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-2", "text": "Abstractive summarization, the task of rewriting and compressing a document into a short summary, has achieved considerable success with neural sequence-tosequence models." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-3", "text": "However, these models can still benefit from stronger natural language inference skills, since a correct summary is logically entailed by the input document, i.e., it should not contain any contradictory or unrelated information." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-4", "text": "We incorporate such knowledge into an abstractive summarization model via multi-task learning, where we share its decoder parameters with those of an entailment generation model." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-5", "text": "We achieve promising initial improvements based on multiple metrics and datasets (including a test-only setting)." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-6", "text": "The domain mismatch between the entailment (captions) and summarization (news) datasets suggests that the model is learning some domain-agnostic inference skills." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-7", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-9", "text": "Abstractive summarization, the task of rewriting a document into a short summary is a significantly more challenging (and natural) task than extractive summarization, which only involves choosing which sentence from the original document to keep or discard in the output summary." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-10", "text": "Neural sequence-to-sequence models have led to substantial improvements on this task of abstractive summarization, via machine translation inspired encoder-aligner-decoder approaches, further enhanced via convolutional encoders, pointer-copy mechanisms, and hierarchical attention (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-11", "text": "Despite these promising recent improvements, Input Document: may is a pivotal month for moving and storage companies ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-12", "text": "Ground-truth Summary: moving companies hit bumps in economic road Baseline Summary: a month to move storage companies Multi-task Summary: pivotal month for storage firms there is still scope in better teaching summarization models about the general natural language inference skill of logical entailment generation." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-13", "text": "This is because the task of abstractive summarization involves two subtasks: salient (important) event detection as well as logical compression, i.e., the summary should not contain any information that is contradictory or unrelated to the original document." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-14", "text": "Current methods have to learn both these skills from the same dataset and a single model." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-15", "text": "Therefore, there is benefit in learning the latter ability of logical compression via external knowledge from a separate entailment generation task, that will specifically teach the model how to rewrite and compress a sentence such that it logically follows from the original input." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-16", "text": "To achieve this, we employ the recent paradigm of sequence-to-sequence multi-task learning (Luong et al., 2016) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-17", "text": "We share the decoder parameters of the summarization model with those of the entailment-generation model, so as to generate summaries that are good at both extracting important facts from as well as being logically entailed by the input document." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-18", "text": "Fig. 1 shows such an (actual) output example from our model, where it successfully learns both salient information extraction as well as entailment, unlike the strong baseline model." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-19", "text": "Empirically, we report promising initial improvements over some solid baselines based on several metrics, and on multiple datasets: Gigaword and also a test-only setting of DUC." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-91", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-64", "text": "The dataset has approximately 3.8 million training pairs." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-20", "text": "Impor-tantly, these improvements are achieved despite the fact that the domain of the entailment dataset (image captions) is substantially different from the domain of the summarization datasets (general news), which suggests that the model is learning certain domain-independent inference skills." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-21", "text": "Our next steps to this workshop paper include incorporating stronger pointer-based models and employing the new multi-domain entailment corpus (Williams et al., 2017) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-22", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-24", "text": "Earlier summarization work focused more towards extractive (and compression) based summarization, i.e., selecting which sentences to keep vs discard, and also compressing based on choosing grammatically correct sub-sentences having the most important pieces of information (Jing, 2000; Knight and Marcu, 2002; Clarke and Lapata, 2008; Filippova et al., 2015) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-25", "text": "Bigger datasets and neural models have allowed the addressing of the complex reasoning involved in abstractive summarization, i.e., rewriting and compressing the input document into a new summary." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-26", "text": "Several advances have been made in this direction using machine translation inspired encoder-aligner-decoder models, convolution-based encoders, switching pointer and copy mechanisms, and hierarchical attention models (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-27", "text": "Recognizing textual entailment (RTE) is the classification task of predicting whether the relationship between a premise and hypothesis sentence is that of entailment (i.e., logically follows), contradiction, or independence (Dagan et al., 2006) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-28", "text": "The SNLI corpus Bowman et al. (2015) allows training accurate end-to-end neural networks for this task." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-29", "text": "Some previous work (Mehdad et al., 2013; Gupta et al., 2014) has explored the use of textual entailment recognition for redundancy detection in summarization." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-30", "text": "They label relationships between sentences, so as to select the most informative and non-redundant sentences for summarization, via sentence connectivity and graphbased optimization and fusion." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-31", "text": "Our focus, on the other hand, is entailment generation and not recognition, i.e., to teach summarization models the general natural language inference skill of generating a compressed sentence that logically entails the original longer sentence, so as to produce more effective short summaries." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-32", "text": "We achieve this via multi-task learning with entailment generation." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-33", "text": "Multi-task learning involves sharing parameters between related tasks, whereby each task benefits from extra information in the training signals of the related tasks, and also improves its generalization performance." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-34", "text": "Luong et al. (2016) showed improvements on translation, captioning, and parsing in a shared multi-task setting." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-35", "text": "Recently, Pasunuru and Bansal (2017) extend this idea to video captioning with two related tasks: video completion and entailment generation." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-36", "text": "We demonstrate that abstractive text summarization models can also be improved by sharing parameters with an entailment generation task." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-37", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-38", "text": "**MODELS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-39", "text": "First, we discuss our baseline model which is similar to the machine translation encoder-alignerdecoder model of Luong et al. (2015) , and presented by Chopra et al. (2016) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-89", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-40", "text": "Next, we introduce our multi-task learning approach of sharing the parameters between abstractive summarization and entailment generation models." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-41", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-42", "text": "**BASELINE MODEL**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-43", "text": "Our baseline model is a strong, multi-layered encoder-attention-decoder model with bilinear attention, similar to Luong et al. (2015) and following the details in Chopra et al. (2016) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-44", "text": "Here, we encode the source document with a two-layered LSTM-RNN and generate the summary using another two-layered LSTM-RNN decoder." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-45", "text": "The word probability distribution at time step t of the decoder is defined as follows:" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-46", "text": "where g is a non-linear function and c t and s t are the context vector and LSTM-RNN decoder hidden state at time step t, respectively." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-47", "text": "The context vector c t = \u03b1 t,i h i is a weighted combination of encoder hidden states h i , where the attention weights are learned through the bilinear attention mechanism proposed in Luong et al. (2015) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-48", "text": "For the rest of the paper, we use same notations." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-49", "text": "We also use the same model architecture for the entailment generation task, i.e., a sequence-tosequence model encoding the premise and decoding the entailed hypothesis, via bilinear attention between them." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-50", "text": "Figure 2: Multi-task learning of the summarization task (left) with the entailment generation task (right)." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-51", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-52", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-53", "text": "Multi-task learning helps in sharing knowledge between related tasks across domains (Luong et al., 2015) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-54", "text": "In this work, we show improvements on the task of abstractive summarization by sharing its parameters with the task of entailment generation." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-55", "text": "Since a summary is entailed by the input document, sharing parameters with the entailment generation task improves the logically-directed aspect of the summarization model, while maintaining the salient information extraction aspect." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-56", "text": "In our multi-task setup, we share the decoder parameters of both the tasks (along with the word embeddings), as shown in Fig. 2 , and we optimize the two loss functions (one for summarization and another for entailment generation) in alternate mini-batches of training." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-57", "text": "Let \u03b1 s be the number of mini-batches of training for summarization after which it is switched to train \u03b1 e number of minibatches for entailment generation." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-58", "text": "Then, the mixing ratio is defined as" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-59", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-61", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-62", "text": "**DATASETS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-63", "text": "Gigaword Corpus We use the exact annotated Gigaword corpus provided by Rush et al. (2015) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-65", "text": "We use 10, 000 pairs as validation set and the exact test sample provided by Rush et al. (2015) as our test set." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-66", "text": "We use the first sentence of the article as the source with vocabulary size of 119, 505 and article headline as target with vocabulary size of 68, 885." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-67", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-68", "text": "**DUC TEST CORPUS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-69", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-70", "text": "**SNLI CORPUS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-71", "text": "For the task of entailment generation, we use the Standford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) , where we only use the entailment-labeled pairs and regroup the splits to have a zero overlap traintest split and have a multi-reference test set, as suggested by Pasunuru and Bansal (2017) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-72", "text": "Out of 190, 113 entailments pairs, we use 145, 822 unique premise pairs for training, and the rest of them are equally divided into dev and test sets." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-73", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-74", "text": "**EVALUATION**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-75", "text": "Following previous work (Nallapati et al., 2016; Chopra et al., 2016; Rush et al., 2015) , we use the full-length F1 variant of Rouge (Lin, 2004) for the Gigaword results, and the 75-bytes length limited Recall variant of Rouge for DUC." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-76", "text": "Additionally, we also report other standard language generation metrics (as motivated recently by See et al. (2017) ): METEOR (Denkowski and Lavie, 2014) , BLEU-4 (Papineni et al., 2002) , and CIDEr-D , based on the MS-COCO evaluation script (Chen et al., 2015) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-77", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-78", "text": "**TRAINING DETAILS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-79", "text": "We use the following simple settings for all the models, unless otherwise specified." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-80", "text": "We unroll the encoder RNN's to a maximum of 50 time steps and decoder RNN's to a maximum of 30 time steps." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-81", "text": "Models ROUGE-1 ROUGE-2 ROUGE-L METEOR BLEU-4 CIDEr-D PREVIOUS WORK ABS+ (Rush et al., 2015) 29.76 11.88 26.96 ---RAS-Elman (Chopra et al., 2016) 33.78 15.97 31.15 ---words-lvt2k-1sent (Nallapati et al., 2016) 32 Table 1 : Summarization results on Gigaword." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-82", "text": "Rouge scores are full length F-1, following previous work." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-83", "text": "We use RNN hidden state dimension of 512 and word embedding dimension of 256." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-84", "text": "We do not initialize our word embeddings with any pre-trained models, i.e., we learn them from scratch." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-85", "text": "We use the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.001." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-86", "text": "During training, to handle the large vocabulary, we use the sampled loss trick of Jean et al. (2014) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-87", "text": "We always tune hyperparameters on the validation set of the corresponding dataset, where applicable." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-88", "text": "For multi-task learning, we tried a few mixing ratios and found 1 : 0.05 to work better, i.e., 100 mini-batches of summarization with 5 mini-batches of entailment generation task in alternate training rounds." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-92", "text": "**SUMMARIZATION RESULTS: GIGAWORD**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-93", "text": "Baseline Results and Previous Work Our baseline is a strong encoder-attention-decoder model based on Luong et al. (2015) and presented by Chopra et al. (2016) ." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-94", "text": "As shown in Table 1 , it is reasonably close to some of the state-of-theart (comparable) results in previous work, though making this baseline further strong (e.g., based on pointer-copy mechanism) is our next step." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-95", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-96", "text": "**MULTI-TASK RESULTS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-97", "text": "We show promising initial multi-task improvements on top of our baseline, based on several metrics." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-98", "text": "This suggests that the entailment generation model is teaching the summarization model some skills about how to choose a logical subset of the events in the full input document." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-99", "text": "This is especially promising given that the domain of the entailment dataset (image captions) is very different from the domain of the summarization datasets (news), suggesting that the model might be learning some domain-agnostic inference skills." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-100", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-101", "text": "**SUMMARIZATION RESULTS: DUC**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-102", "text": "Here, we directly use the Gigaword-trained model to test on the DUC-2004 dataset (see tuning discussion in Sec. 4.1)." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-103", "text": "In Table 2 , we again see that et al. (2015) 28.18 8.49 23.81 Chopra et al. (2016) 28.97 8.26 24.06 Nallapati et al. (2016) our Luong et al. (2015) baseline model achieves competitive performance with previous work, esp." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-104", "text": "on Rouge-2 and Rouge-L. Next, we show promising multi-task improvements over this baseline of around 0.4% across all metrics, despite being a test-only setting and also with the mismatch between the summarization and entailment domains." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-105", "text": "Figure 3 shows some additional interesting output examples of our multi-task model and how it generates summaries that are better at being logically entailed by the input document, whereas the baseline model contains some crucial contradictory or unrelated information." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-106", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-107", "text": "**ANALYSIS EXAMPLES**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-108", "text": "----------------------------------" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-109", "text": "**CONCLUSION AND NEXT STEPS**" }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-110", "text": "We presented a multi-task learning approach to incorporate entailment generation knowledge into summarization models." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-111", "text": "We demonstrated promising initial improvements based on multiple datasets and metrics, even when the entailment knowledge was extracted from a domain different from the summarization domain." }, { "sent_id": "685b0b0da37b81765bb78f0f87505b-C001-112", "text": "Our next steps to this workshop paper include: (1) stronger summarization baselines, e.g., using pointer copy mechanism (See et al., 2017; Nallapati et al., 2016) , and also adding this capability to the entailment generation model; (2) results on CNN/Daily Mail corpora (Nallapati et al., 2016) ; (3) incorporating entailment knowledge from other news-style domains such as the new Multi-NLI corpus (Williams et al., 2017) , and (4) demonstrating mutual improvements on the entailment generation task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "685b0b0da37b81765bb78f0f87505b-C001-10" ], [ "685b0b0da37b81765bb78f0f87505b-C001-26" ] ], "cite_sentences": [ "685b0b0da37b81765bb78f0f87505b-C001-10", "685b0b0da37b81765bb78f0f87505b-C001-26" ] }, "@MOT@": { "gold_contexts": [ [ "685b0b0da37b81765bb78f0f87505b-C001-10", "685b0b0da37b81765bb78f0f87505b-C001-11", "685b0b0da37b81765bb78f0f87505b-C001-12" ] ], "cite_sentences": [ "685b0b0da37b81765bb78f0f87505b-C001-10" ] }, "@USE@": { "gold_contexts": [ [ "685b0b0da37b81765bb78f0f87505b-C001-75" ] ], "cite_sentences": [ "685b0b0da37b81765bb78f0f87505b-C001-75" ] }, "@DIF@": { "gold_contexts": [ [ "685b0b0da37b81765bb78f0f87505b-C001-103", "685b0b0da37b81765bb78f0f87505b-C001-104" ] ], "cite_sentences": [ "685b0b0da37b81765bb78f0f87505b-C001-103" ] }, "@FUT@": { "gold_contexts": [ [ "685b0b0da37b81765bb78f0f87505b-C001-112" ] ], "cite_sentences": [ "685b0b0da37b81765bb78f0f87505b-C001-112" ] } } }, "ABC_09f627b9a70966dc7b63316c56a2a0_29": { "x": [ { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-2", "text": "We present a simple algorithm to efficiently train language models with noise-contrastive estimation (NCE) on graphics processing units (GPUs)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-3", "text": "Our NCE-trained language models achieve significantly lower perplexity on the One Billion Word Benchmark language modeling challenge, and contain one sixth of the parameters in the best single model in Chelba et al. (2013) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-4", "text": "When incorporated into a strong Arabic-English machine translation system they give a strong boost in translation quality." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-5", "text": "We release a toolkit so that others may also train large-scale, large vocabulary LSTM language models with NCE, parallelizing computation across multiple GPUs." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-6", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-8", "text": "Language models are used to compute probabilities of sequences of words." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-9", "text": "They are crucial for good performance in tasks like machine translation, speech recognition, and spelling correction." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-10", "text": "They can be classified into two categories: count-based and continuous-space language models." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-11", "text": "The language modeling literature abounds with successful approaches for learning-count based language models: modified Kneser-Ney smoothing, JelinekMercer smoothing, etc." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-12", "text": "In recent years, continuousspace language models such as feed-forward neural probabilistic language models (NPLMs) and recurrent neural network language models (RNNs) 1 * Equal contribution." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-13", "text": "1 Henceforth we will use terms like \"RNN\" and \"LSTM\" with the understanding that we are referring to language models that use these formalisms have outperformed their count-based counterparts (Chelba et al., 2013; Zaremba et al., 2014; Mikolov, 2012) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-14", "text": "RNNs are more powerful than n-gram language models, as they can exploit longer word contexts to predict words." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-15", "text": "Long short-term memory language models (LSTMs) are a class of RNNs that have been designed to model long histories and are easier to train than standard RNNs." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-16", "text": "LSTMs are currently the best performing language models on the Penn Treebank (PTB) dataset (Zaremba et al., 2014) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-17", "text": "The most common method for training LSTMs, maximum likelihood estimation (MLE), is prohibitively expensive for large vocabularies, as it involves time-intensive matrix-matrix multiplications." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-18", "text": "Noise-contrastive estimation (NCE) has been a successful alternative to train continuous space language models with large vocabularies (Mnih and Teh, 2012; Vaswani et al., 2013) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-19", "text": "However, NCE in its standard form is not suitable for GPUs, as the computations are not amenable to dense matrix operations." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-20", "text": "In this paper, we present a natural modification to the NCE objective function for language modeling that allows a very efficient GPU implementation." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-21", "text": "Using our new objective, we train large multi-layer LSTMs on the One Billion Word benchmark (Chelba et al., 2013) , with its full 780k word vocabulary." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-22", "text": "We achieve significantly lower perplexities with a single model, while using only a sixth of the parameters of a very strong baseline model (Chelba et al., 2013) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-23", "text": "We release our toolkit 2 to allow researchers to train large-scale, large-vocabulary LSTMs with NCE." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-24", "text": "The contributions in this paper are the following: 2 www.github.com/isi-nlp/Zoph_RNN" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-25", "text": "\u2022 A fast and simple approach for handling large vocabularies effectively on the GPU." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-26", "text": "\u2022 Significantly improved perplexities (43.2) on the One Billion Word benchmark over Chelba et al. (2013) \u2022 Extrinsic machine translation improvement over a strong baseline." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-27", "text": "\u2022 Fast decoding times because in practice there is no need to normalize." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-28", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-29", "text": "**LONG SHORT TERM MEMORY LANGUAGE MODELS**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-30", "text": "In recent years, LSTMs (Hochreiter and Schmidhuber, 1997) have achieved state-of-the-art performance in many natural language tasks such as language modeling (Zaremba et al., 2014) and statistical machine translation Luong et al., 2015) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-31", "text": "LSTMs were designed to have longer memories than standard RNNs, allowing them to exploit more context to make predictions." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-32", "text": "To compute word probabilities, the LSTM reads words left-to-right, updating its memory after each word and producing a hidden state h, which summarizes all of the history." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-33", "text": "For details on the architecture and computations of the LSTM, the reader can refer to (Zaremba et al., 2014) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-34", "text": "In this model the probability of word w given history u is" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-35", "text": "where p(w | u) = exp D w h T + b w is an unnormalized probability." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-36", "text": "D w and b w are the output word embedding and biases respectively, which are learned during training." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-37", "text": "The normalization constant is Z(u) = w\u2208V p(w | u), and V is the vocabulary." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-38", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-39", "text": "**NOISE CONTRASTIVE ESTIMATION FOR TRAINING NEURAL LANGUAGE MODELS**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-40", "text": "The standard approach for estimating neural language models is maximum liklelihood estimation (MLE), where we learn the parameters \u03b8 * that maximize the likelihood of the training data," }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-41", "text": "However, for each training instance, gradient-based approaches for MLE require a summation over all units in the output layer, one for each word in V ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-42", "text": "This makes MLE training infeasible for large vocabularies." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-43", "text": "Noise-contrastive estimation (NCE) (Gutmann and Hyv\u00e4rinen, 2010) has been successfully adopted for training neural language models with large vocabularies (Mnih and Teh, 2012; Vaswani et al., 2013; Baltescu and Blunsom, 2014; Williams et al., 2015) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-44", "text": "NCE avoids repeated summations by training the model to correctly classify between generated noise samples and words observed in the training data." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-45", "text": "For each training pair u i , w i , we generate k noise samples,w i1 . . . ,w ik from a noise distribution q(w), which is known." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-46", "text": "We label u i , w i as true (C = 1), and all u i ,w ik as false (C = 0)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-47", "text": "We then train the parameters to maximize the binary classification log likelihood," }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-48", "text": "where" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-49", "text": ", (4) and" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-50", "text": "We learn parameters to maximize this objective with gradient ascent." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-51", "text": "In NCE, we treat Z(u) as another parameter and learn its estimate along with the rest of the parameters." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-52", "text": "Following Mnih and Teh (2012) and Vaswani et al. (2013) , we freeze Z(u) to 1 and the model learns to approximately satisfy the constraint." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-53", "text": "In practice, we find that our mean for Z(u) is very close to 1 and the variance is quite small (Section 6)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-54", "text": "For each training instance, we compute gradients for the observed word and each noise word, reducing the time complexity from O(|V |) for MLE to O(k)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-55", "text": "However, unlike MLE, since we typically sample different noise samples for each training instance, our gradient computations for the NCE objective are not amenable to dense matrix operations, making it difficult to benefit from fast dense matrix arithmetic on GPUs." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-56", "text": "In this paper, we present a simple solution to this problem: sharing the noise samples across all the training instances in a minibatch." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-57", "text": "Sharing noise samples allows us to describe NCE gradients with dense matrix operations, and implement them easily on the GPU." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-58", "text": "In the next section, we describe our NCE implementation on the GPU with shared noise samples." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-59", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-60", "text": "**OUR NCE MODIFICATION**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-61", "text": "In typical Noise-Contrastive Estimation, the objective function requires noise samples coming from some distribution (in our case, the uniform distribution)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-62", "text": "The objective function for NCE is shown above in Equation 3, where we must calculate P (C = 1 | w, u) for the true word and the noise samples generated." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-63", "text": "There are three components to this calculation: p(w | u) , Z(u) , and k \u00b7 q(w)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-64", "text": "In NCE we fix Z(u) to be one, so we only need to calculate p(w | u) and k \u00b7 q(w)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-65", "text": "The term k \u00b7 q(w) is simply an O(1) lookup, so the only costly operation is computing p(w | u) for all k noise samples and the true word." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-66", "text": "The operation to compute p(w | u) for a single word w is exp D w h T + b w where D w and b w represent the word embedding and bias corresponding to the word we are computing it for." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-67", "text": "The main cost in calculating the NCE objective function is computing exp D w h T + b w , where in normal NCE a dense matrix multiplication cannot be done." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-68", "text": "The reason is that the noise samples generated per training example will be different." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-69", "text": "Therefore, when we parallelize our algorithm by running multiple training examples in parallel (a minibatch), the D w we need are different per h T that we are running in parallel." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-93", "text": "The training is parallelized across 4 GPUs, such that each layer lies on its own GPU and communicates its activations to the next layer once it finishes its computation." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-70", "text": "If a constraint is set such that the noise samples must be the same across all training examples in the minibatch, then a matrix multiplication can be done to compute exp D w h T + b w for all words across that minibatch." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-71", "text": "This matrix multiplication is Dh T M , where h T M is now a matrix of all the hidden states being used in parallel, whereas before h T was just a vector." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-72", "text": "Additionally, D is the matrix of word embedding for the samples that are being shared across a minibatch." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-73", "text": "Before, this was not possible as the D matrix would have to be much larger, being (minibatch size) \u00b7 (k + 1) in size, making the algorithm run much slower." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-74", "text": "An alternative is to not restrict the noise samples to be the same, but then we must use a sparse matrix multiplication as in Williams et al. (2015) , which is neither as fast nor as easy to implement." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-75", "text": "A comparison between these two approaches is shown in Figure 1 ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-76", "text": "We find that sharing noise samples across the minibatch gives us significant speedups over a baseline of using different noise samples for each word in the minibatch." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-77", "text": "These timings were calculated for a single layer LSTM with 1000 hidden units, a vocabulary of 350k, and a minibatch of 128." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-78", "text": "Not surprisingly, MLE is quite expensive, limiting it's use for large vocabularies." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-79", "text": "Additionally, the memory requirements for NCE are much lower than MLE, since we do not need to store the gradient which has the same size as the output embedding matrix." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-80", "text": "For this MLE run, we had to distribute the computation across two GPUs because of memory limitations." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-81", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-82", "text": "**EXPERIMENTS**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-83", "text": "We conducted two series of experiments to validate the efficiency of our approach and the quality of the models we learned using it: An intrinsic study of language model perplexity using the standard One Billion Word benchmark (Chelba et al., 2013) and an extrinsic end-to-end statistical machine translation task that uses an LSTM as one of several feature functions in re-ranking." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-84", "text": "Both experiments achieve excellent results." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-85", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-86", "text": "**LANGUAGE MODELING**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-87", "text": "For our language modeling experiment we use the One Billion Word benchmark proposed by Chelba et al. (2013) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-88", "text": "In this task there are roughly 0.8 billion words of training data." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-89", "text": "We use perplexity to evaluate the quality of language models we train on this data." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-90", "text": "We train an LSTM with 4 layers, where each layer has 2048 hidden units, with a target vocabulary size of 793,471." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-91", "text": "For training, we also use dropout to prevent overfitting." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-92", "text": "We follow Zaremba et al. (2014) for dropout locations, and we use a dropout rate of 0.2." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-94", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-95", "text": "**STATISTICAL MACHINE TRANSLATION**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-96", "text": "We incorporate our LSTM as a rescoring feature on top of the output of a strong Arabic-English syntax-based string-to-tree statistical machine translation system (Galley et al., 2006; Galley et al., 2004) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-97", "text": "That system is trained on 208 million words of parallel Arabic-English data from a variety of sources, much of which is Egyptian discussion forum content." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-98", "text": "The Arabic side is morphologically segmented with MADA-ARZ (Habash et al., 2013) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-99", "text": "We use a standard set of features, including two conventional count-based language models, as well as thousands of sparsely-occurring, lexicalized syntactic features (Chiang et al., 2009 )." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-100", "text": "The system additionally uses a source-to-target feed-forward neural network translation model during decoding, analogous to the model of (Devlin et al., 2014) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-101", "text": "These features are tuned with MIRA (Chiang et al., 2009) on approximately 63,000 words of Arabic discussion forum text, along with three references." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-102", "text": "We evaluate this baseline on two test sets, each with approximately 34,000 words from the same genre used in tuning." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-103", "text": "We generate n-best lists (n = 1000) of unique translations for each sentence in the tuning set and re-rank the translations using an approach based on MERT (Och, 2003) ." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-104", "text": "We collapse all features other than language models, text length, and derivation size into a single feature, formed by taking the dot product of the previously learned feature and weight vectors." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-105", "text": "We then run a single iteration of MERT on the n-best lists to determine optimal weights for the collapsed feature, the uncollapsed features, and an LSTM feature formed by taking the score of the hypothesis according to the LSTM described in Section 5.1." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-106", "text": "We use the weights to rerank hypotheses from the n-best lists of the two test sets." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-107", "text": "We repeated this experiment, substituting instead a twolayer LSTM trained on the English side of the training data." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-108", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-109", "text": "**RESULTS**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-110", "text": "Our two experiments with LSTMs trained with our modification of NCE show strong results in their corresponding tasks." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-111", "text": "Our perplexity results are shown in Table 1 , where we get significantly lower perplexities than the best single model from Chelba et al. (2013) , while having almost 6 times fewer parameters." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-112", "text": "We also compute the partition function values, log Z(u) , for our development set and we find that the mean is 0.058 and the variance is 0.139, indicating that training has encouraged self-normalization." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-113", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-114", "text": "**MODEL**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-115", "text": "Parameters Perplexity Chelba et al. (2013) 20m 51.3 NCE (ours) 3.4m 43.2 Recently, (J\u00f3zefowicz et al., 2016) achieved stateof-the-art language modeling perplexities (30.0) on the billion word dataset with a single model, using importance sampling to approximate the normalization constant, Z(u)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-116", "text": "Independent of our work, they also share noise samples across the minibatch." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-117", "text": "However, they use 8192 noise samples, while we achieve strong perplexities with 100 noise samples." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-118", "text": "We also show significant improvements in machine translation, exploiting self-normalization for fast decoding, in addition to releasing a efficient toolkit that practitioners can use." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-119", "text": "Arabic-to-English n-best lists." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-120", "text": "The baseline is a state-of-theart, statistical string-to-tree system." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-121", "text": "BOLT is a 208m-word, indomain English corpus; \"1b\" refers to the One Billion Word benchmark corpus." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-122", "text": "these re-scoring experiments we simply use the unnormalized numerator p(w | u) as our word score, which means we never have to compute the costly partition function, Z(u)." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-123", "text": "This is because the partition function is so close to 1 that the un-normalized scores are very close to the normalized ones." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-124", "text": "----------------------------------" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-125", "text": "**CONCLUSION**" }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-126", "text": "We describe a natural extension to NCE that achieves a large speedup on the GPU while also achieving good empirical results." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-127", "text": "LSTM models trained with our NCE modification achieve strong results on the One Billion Word dataset." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-128", "text": "Additionally, we get good BLEU gains when re-ranking nbest lists from a strong string-to-tree machine translation system." }, { "sent_id": "09f627b9a70966dc7b63316c56a2a0-C001-129", "text": "We also release an efficient toolkit for training LSTM language models with NCE." } ], "y": { "@DIF@": { "gold_contexts": [ [ "09f627b9a70966dc7b63316c56a2a0-C001-3" ], [ "09f627b9a70966dc7b63316c56a2a0-C001-13" ], [ "09f627b9a70966dc7b63316c56a2a0-C001-22" ], [ "09f627b9a70966dc7b63316c56a2a0-C001-24", "09f627b9a70966dc7b63316c56a2a0-C001-26" ], [ "09f627b9a70966dc7b63316c56a2a0-C001-111" ], [ "09f627b9a70966dc7b63316c56a2a0-C001-115" ] ], "cite_sentences": [ "09f627b9a70966dc7b63316c56a2a0-C001-3", "09f627b9a70966dc7b63316c56a2a0-C001-13", "09f627b9a70966dc7b63316c56a2a0-C001-22", "09f627b9a70966dc7b63316c56a2a0-C001-26", "09f627b9a70966dc7b63316c56a2a0-C001-111", "09f627b9a70966dc7b63316c56a2a0-C001-115" ] }, "@USE@": { "gold_contexts": [ [ "09f627b9a70966dc7b63316c56a2a0-C001-21" ], [ "09f627b9a70966dc7b63316c56a2a0-C001-83" ], [ "09f627b9a70966dc7b63316c56a2a0-C001-87" ] ], "cite_sentences": [ "09f627b9a70966dc7b63316c56a2a0-C001-21", "09f627b9a70966dc7b63316c56a2a0-C001-83", "09f627b9a70966dc7b63316c56a2a0-C001-87" ] }, "@EXT@": { "gold_contexts": [ [ "09f627b9a70966dc7b63316c56a2a0-C001-24", "09f627b9a70966dc7b63316c56a2a0-C001-26" ] ], "cite_sentences": [ "09f627b9a70966dc7b63316c56a2a0-C001-26" ] } } }, "ABC_68d41bee7361b6680103c9951a6570_29": { "x": [ { "sent_id": "68d41bee7361b6680103c9951a6570-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-2", "text": "Deep learning has dramatically improved the performance of speech recognition systems through learning hierarchies of features optimized for the task at hand." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-3", "text": "However, true end-toend learning, where features are learned directly from waveforms, has only recently reached the performance of handtailored representations based on the Fourier transform." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-4", "text": "In this paper, we detail an approach to use convolutional filters to push past the inherent tradeoff of temporal and frequency resolution that exists for spectral representations." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-5", "text": "At increased computational cost, we show that increasing temporal resolution via reduced stride and increasing frequency resolution via additional filters delivers significant performance improvements." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-6", "text": "Further, we find more efficient representations by simultaneously learning at multiple scales, leading to an overall decrease in word error rate on a difficult internal speech test set by 20.7% relative to networks with the same number of parameters trained on spectrograms." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-7", "text": "----------------------------------" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-9", "text": "Raw speech waveforms are densely sampled in time, and thus require downsampling to make many analysis techniques computationally tractable." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-10", "text": "For speech recognition, this presents the challenge to reduce the number of timesteps in the signal without throwing away relevant information." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-11", "text": "Representations based on the Fourier transform have proven effective at this task as the transform forms a complete basis for signal reconstruction." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-12", "text": "Deep learning's recent success in speech recognition is based on learning feature hierarchies atop these representations [1, 2] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-13", "text": "There has been increasing focus on extending this end-toend learning approach down to the level of the raw waveform." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-14", "text": "A popular approach is pass the waveform through strided convolutions, or networks connected to local temporal frames, often followed by a pooling step to create invariance to phase shifts and further downsample the signal [3, 4, 5, 6, 7, 8] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-15", "text": "While some studies find inferior performance for convolutional filters learned in this way, deeper networks have recently matched the performance of hand-engineered features on large vocabulary speech recognition tasks [4] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-16", "text": "Features based on the Fourier transform are computationally efficient, but exhibit an intrinsic tradeoff of temporal and frequency resolution." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-17", "text": "Convolutional filters decouple time and frequency resolution as the number of filters and stride are chosen independently." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-18", "text": "Despite this, a filter bank is constrained by its window size to a single scale." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-19", "text": "Herein, we explore jointly learning filter banks at multiple different scales on raw wave- Figure 1 : Diagram of multiscale convolutions for learning directly from waveforms." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-20", "text": "Sampled at 16kHz, the waveform is originally 1/16ms per frame." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-21", "text": "Three convolutions with different window sizes and strides are applied, leading to feature maps with High (1ms/frame), Mid (5ms/frame), and Low (10ms/frame) temporal resolution." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-22", "text": "Next, maxpooling and concatenation ensure a consistent sampling frequency (20ms/frame) for the rest of the network." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-23", "text": "The diagram shows real feature maps that are extracted by applying learned features (also shown) to a recording of one of the authors saying \"I like cats\" in Mandarin." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-24", "text": "forms." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-25", "text": "Multiscale convolutions have already been successfully applied to address tasks in the computer vision field, such as image classification [9] , scene labeling [10] , and gesture detection [11] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-26", "text": "These successful applications exploit structure at different scales, which encourages us to explore multiscale representations for waveforms as well." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-27", "text": "Further, multiscale convolutions enable us to split the spectrum into different filter banks with independent choice of stride, window size, and number of filters." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-28", "text": "To learn high frequency features, filters with a short window are applied at a small stride on the waveforms." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-29", "text": "Low-frequency features, on the contrary, employ a long window that can be applied at a larger stride." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-30", "text": "Finally, based on the needs of the speech recognition system we can then independently tune the number of filters for the different frequency bands." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-31", "text": "While much research has already been conducted on learn- ing directly from waveforms for speech recognition, the unique contributions of this paper are threefold:" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-32", "text": "\u2022 We perform with an in-depth analysis of scaling to low strides and large numbers of filters and discover that a convolutional front end can significantly outperform Fourier features by independently tuning temporal and frequency resolution, at the cost of additional computation and memory." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-33", "text": "\u2022 We propose a new multiscale convolutional front end, composed of concatenated filters with different window sizes, that requires less computation and outperforms features learned on a single scale (20.7% relative to spectrogram baseline)." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-34", "text": "\u2022 We find that multiscale features naturally learn the frequencies they can most efficiently represent, with large and small windows learning low and high frequencies respectively." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-35", "text": "This contrasts with the single scale features which try to cover the entire frequency spectrum regardless of window size." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-36", "text": "----------------------------------" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-37", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-38", "text": "The experimental design of this study is modelled after our previous work on end-to-end speech recognition [12, 2] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-39", "text": "However, to decrease the experimental latency, we train on a reduced version of the model and a subset of the training data." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-40", "text": "The basic architecture is shown in Table 1 ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-41", "text": "While we vary the front end processing, the backend remains the same: a convolutional (through time) layer, followed by three bidirectional simple recurrent layers, and a fully connected layer." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-42", "text": "Batch normalization [13] , is employed between each layer, but not between individual timesteps [2] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-43", "text": "Rectified linear unit (ReLU) activation functions are used for all layers, including between timesteps." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-44", "text": "We use the Connectionist Temporal Classification (CTC) cost function to integrate over all possible alignments between the network outputs and characters of the English alphabet [14] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-45", "text": "Training is conducted on 2,400 hours of audio randomly sampled from 12,000 hours of data." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-46", "text": "The training data is drawn from a diverse collection of sources including read, conversational, accented, and noisy speech [2] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-47", "text": "At each epoch, 40% of the utterances are randomly selected to have background noise Table 2 : Single scale waveform convolution outperforms the spectrogram baseline at low strides." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-48", "text": "The trend is visualized in Figure 2 ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-49", "text": "(superpositions of YouTube clips) added at signal-to-noise ratios ranging from 0dB to 15dB [12] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-50", "text": "All input data (either spectrogram or waveform) is sampled at 16kHz, and normalized so that each input feature has zero mean and unit variance." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-51", "text": "For models that learn directly from waveforms, no other preprocessing is applied and the network inputs have 1 feature per 1/16ms frame." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-52", "text": "For models that learn from spectrograms, linear FFT features are extracted with a hop size of 10ms, and window size of 20ms." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-53", "text": "The network inputs are thus spectral magnitude maps ranging from 0-8kHz with 161 features per 10ms frame." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-54", "text": "We train using stochastic gradient descent with Nesterov momentum and a batch size of 128." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-55", "text": "Hyperparameters are tuned for each model by optimizing a hold-out set." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-56", "text": "Typical values are a learning rate of 3e-4 and momentum of 0.99, and training converges after 20 epochs." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-57", "text": "Following [2] , we sort the first epoch by utterance length (SortaGrad), to promote stability of training on long utterances." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-58", "text": "While the CTC-trained acoustic model learns a rudimentary language model itself from the training data, for testing, we supplement it with a Kneser-Ney smoothed 5-gram model that is trained using the KenLM toolkit [15] on cleaned text from the Common Crawl Repository." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-59", "text": "Decoding is done via beam search, where a weighted combination of the acoustic model and language model with an added word insert penalty is used as the value function." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-60", "text": "Test set word error rates (WER) are reported on a difficult in-house test set of 2048 utterances, diversely composed of noisy, conversational, voice-command, and accented speech." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-61", "text": "The test set is collected internally and from industry partners and is not represented in the training data." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-62", "text": "As previously observed [2] , deep neural networks trained on sufficient data perform better as the model size grows." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-63", "text": "In order to make fair comparisons, the number of parameters of the models used in our experiments is held constant at 35M." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-64", "text": "We are aware that the results are not directly comparable to literature due to the use of proprietary datasets." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-65", "text": "However, we attempt to tightly control our experiments rather than focus on overall performance, with the aim that the conclusions will generalize to other architectures and datasets as well." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-66", "text": "If optimizing for performance, it is worth noting that the WER drops by \u223c50% when training on all 12,000 hours of data with 7 bidirectional layers (70M parameters) in the backend." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-67", "text": "----------------------------------" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-68", "text": "**RESULTS**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-69", "text": "----------------------------------" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-70", "text": "**DECOUPLING TEMPORAL RESOLUTION**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-71", "text": "While many studies have compared convolution on raw waveforms to features such as spectrograms, MFCCs, and melscale filterbanks, they often compare the two at a similar strides and window sizes [5, 6] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-72", "text": "Spectrograms can employ a high stride because of the unique analytic structure of the basis functions." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-73", "text": "Integrating twice, once over the real cosine and once over its imaginary sine counterpart, identifies both the phase and magnitude of the response." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-74", "text": "This is in many ways similar to performing a convolution over every timestep and max-pooling, with the phase represented by the index at which the max occurs, but is much more computationally efficient to perform with an FFT." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-75", "text": "We decided to test this hypothesis that a convolution with low stride and pooling can find at least as good a basis as the spectrogram." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-76", "text": "Figure 2 and Table 2 show the effects of replacing the spectrogram with a single convolution and pooling layer of the same number of filters and window size." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-77", "text": "For each decrease in stride of the convolution, the stride of the pooling is increased to give a consistent total stride of 20ms, which is the same as the stride of the spectrogram with pooling." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-78", "text": "At comparable strides to the spectrogram (10ms) the convolution is unable to perform as well, likely due to needing to represent phase shifts as well as frequency variation." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-79", "text": "However, as the stride dips below 2ms, the convolution asymptotically reaches a superior level of performance to the spectrogram." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-80", "text": "To be fair, this improved performance comes at increased computational cost and memory usage." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-81", "text": "----------------------------------" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-82", "text": "**MULTISCALE FEATURES ARE MORE EFFICIENT**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-83", "text": "Representing high frequency information with a large filter is difficult because of the many places the information can occur in the filter window." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-84", "text": "Similarly, representing low frequency information with a small filter is challenging because separate filters are required for separate parts of the wave." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-85", "text": "We hypothe- Figure 1 )." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-86", "text": "sized that applying convolution simultaneously at several scales could allow each scale to learn filters selective to the frequencies that it can most efficiently represent." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-87", "text": "To test our hypothesis, we compare the performance of convolutional front ends with a constant number of filters at three different scales (High: (1ms window, 1/4ms stride), Mid: (4ms, 1ms), and Low: (40ms, 10ms)), to a front end employing all three scales (Diagramed in Figure 1 ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-88", "text": "Table 3 shows that WER significantly decreases from High to Mid to Low frequency filter banks." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-89", "text": "The improved performance of the lower frequency banks is evidence for the importance of low frequency vocalization features in speech recognition [4] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-90", "text": "Figure 3 displays the spectral centroids of each filter bank sorted by frequency, with the mean value printed alongside." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-91", "text": "It is clear that the High (2800Hz), Mid (1700Hz), and Low (940Hz) filter banks live up to their names, each learning filters that capture different frequency bands." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-92", "text": "When we jointly learn several scales, the filters exhibit a an heightened preference for representing different bands." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-93", "text": "Relaxing the requirement of each bank to cover the entire spectrum causes the High (3500Hz) banks to increase in mean frequency, and the Mid (1300Hz) and Low (480Hz) banks to decrease." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-94", "text": "Further, the WER is lowest for the multiscale features, despite the fact that it has fewer parameters and three times less large filters." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-95", "text": "----------------------------------" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-96", "text": "**DECOUPLING FREQUENCY RESOLUTION**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-97", "text": "With methods based on the Fourier transform, there is an intrinsic tradeoff of temporal and frequency resolution." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-98", "text": "As the number of basis functions increases, helping identify which fre- quencies are present, so does the window size, which smears knowledge of when they occur." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-99", "text": "Decreasing the stride cannot increase temporal resolution, it can only provide more samples of the smeared signal." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-100", "text": "Convolutions do not suffer from this same tradeoff, as the temporal resolution is limited by the number of filters and temporal resolution by the stride, both of which are independent." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-101", "text": "This advantage comes at the added cost of increased computation and memory, both by increasing number of filters or decreasing stride." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-102", "text": "We explore the value of increased resolution by performing the same multiscale experiments as the previous section, but increasing the number of filters." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-103", "text": "Table 4 demonstrates that increasing the number of filters, even by a factor of 3, leads to a significant (8% relative) improvement in WER." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-104", "text": "A fullyconnected layer with output dimension of 161 is added above the concatenated feature maps in order to produce the same number of features as the previous experiments." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-105", "text": "Since using more filters at short strides is more costly in terms of both computation and memory, we specifically increase the number of filters at long strides." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-106", "text": "Further increasing the number of filters, and expanding the size of the bottle neck leads to smaller gains." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-107", "text": "showing an impressive 20.7% cumulative improvement in WER relative to the spectrogram baseline." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-108", "text": "----------------------------------" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-109", "text": "**DISCUSSION**" }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-110", "text": "In this paper, we have consistently demonstrated that learning features directly from waveforms can outperform spectrograms, especially when applied at multiple scales." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-111", "text": "However, several interesting research questions remain to be answered before such techniques likely see widespread adoption." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-112", "text": "Learning convolutional filters overcomes the time/frequency tradeoff of Fourier representations, but with considerable computational and memory cost." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-113", "text": "Many modern state of the art systems train on clusters of GPUs, where memory is precious, as requiring memory transfer between GPU and CPU can be prohibitively slow for training." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-114", "text": "This is especially problematic for training on long utterances, where the amount of memory required to save all the activations increases both with the number of filters and the reduction of stride." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-115", "text": "It remains to be seen whether the power of learned input features can be combined with the efficiency of analytic signal transformations such as the Fourier transform." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-116", "text": "One such approach could be to learn basis functions in the real and imaginary domain by performing backpropagation through the Hilbert transformation, enabling the use of larger strides." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-117", "text": "Alternatively, learned features can be fixed and used to augment other fixed features at train time." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-118", "text": "Sainath et al. [4] found noticeable improvements from supplementing log-mel filterbanks in such a manner." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-119", "text": "While learned features outperformed spectrograms feeding into temporal convolution in this study, many state of the art systems apply two-dimensional convolutions to their inputs [2, 16] ." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-120", "text": "Our learned features underperform in this context, which is understandable as they are not spectrally ordered, and lack spatial structure." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-121", "text": "Regularization techniques such as [17] could perhaps be key to learning ordered filter maps with useful structure." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-122", "text": "In our experiments, we made sure to downsample each scale equally with appropriate stride such that the signals can be concatenated for the later recurrent layers." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-123", "text": "This temporal pooling only takes into account local structure and has no explicit knowledge of what information to preserve based on longrange dependencies." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-124", "text": "Recently proposed architectures that operate simultaneously at different timescales, such as the Clockwork RNN [18] , could provide a more elegant way of combining multiscale signals." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-125", "text": "Beyond incorporating recurrence, low frequency features that require fewer temporal samples could then also require less recurrent computation and facilitate modeling long-range structure." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-126", "text": "Finally, from observing representative filters learned at each scale in Figure 4 , we can see that there is some redundancy in the representation." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-127", "text": "Some filter shapes appear similar at multiple scales." }, { "sent_id": "68d41bee7361b6680103c9951a6570-C001-128", "text": "An interesting future direction could be to investigate learning features in a scale-free basis, similar to wavelets, where a reduced basis set could be applied across a range of scales." } ], "y": { "@BACK@": { "gold_contexts": [ [ "68d41bee7361b6680103c9951a6570-C001-12" ] ], "cite_sentences": [ "68d41bee7361b6680103c9951a6570-C001-12" ] }, "@USE@": { "gold_contexts": [ [ "68d41bee7361b6680103c9951a6570-C001-38" ], [ "68d41bee7361b6680103c9951a6570-C001-42" ], [ "68d41bee7361b6680103c9951a6570-C001-46" ], [ "68d41bee7361b6680103c9951a6570-C001-57" ], [ "68d41bee7361b6680103c9951a6570-C001-62" ] ], "cite_sentences": [ "68d41bee7361b6680103c9951a6570-C001-38", "68d41bee7361b6680103c9951a6570-C001-42", "68d41bee7361b6680103c9951a6570-C001-46", "68d41bee7361b6680103c9951a6570-C001-57", "68d41bee7361b6680103c9951a6570-C001-62" ] }, "@SIM@": { "gold_contexts": [ [ "68d41bee7361b6680103c9951a6570-C001-62" ], [ "68d41bee7361b6680103c9951a6570-C001-119" ] ], "cite_sentences": [ "68d41bee7361b6680103c9951a6570-C001-62", "68d41bee7361b6680103c9951a6570-C001-119" ] }, "@DIF@": { "gold_contexts": [ [ "68d41bee7361b6680103c9951a6570-C001-119", "68d41bee7361b6680103c9951a6570-C001-120" ] ], "cite_sentences": [ "68d41bee7361b6680103c9951a6570-C001-119" ] } } }, "ABC_40d370558d873499e493a83f106f17_29": { "x": [ { "sent_id": "40d370558d873499e493a83f106f17-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-2", "text": "Much attention has been given to the impact of informativeness and similarity measures on distributional thesauri." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-3", "text": "We investigate the effects of context filters on thesaurus quality and propose the use of cooccurrence frequency as a simple and inexpensive criterion." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-4", "text": "For evaluation, we measure thesaurus agreement with WordNet and performance in answering TOEFL-like questions." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-5", "text": "Results illustrate the sensitivity of distributional thesauri to filters." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-6", "text": "----------------------------------" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-8", "text": "Large-scale distributional thesauri created automatically from corpora (Grefenstette, 1994; Lin, 1998; Weeds et al., 2004; Ferret, 2012) are an inexpensive and fast alternative for representing semantic relatedness between words, when manually constructed resources like WordNet (Fellbaum, 1998 ) are unavailable or lack coverage." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-9", "text": "To construct a distributional thesaurus, the (collocational or syntactic) contexts in which a target word occurs are used as the basis for calculating its similarity with other words." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-10", "text": "That is, two words are similar if they share a large proportion of contexts." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-11", "text": "Much attention has been devoted to refining thesaurus quality, improving informativeness and similarity measures (Lin, 1998; Curran and Moens, 2002; Ferret, 2010) , identifying and demoting bad neighbors (Ferret, 2013) , or using more relevant contexts (Broda et al., 2009; Biemann and Riedl, 2013) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-12", "text": "For the latter in particular, as words vary in their collocational tendencies, it is difficult to determine how informative a given context is." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-13", "text": "To remove uninformative and noisy contexts, filters have often been applied like pointwise mutual information (PMI), lexicographer's mutual information (LMI) (Biemann and Riedl, 2013) , t-score (Piasecki et al., 2007) and z-score (Broda et al., 2009 )." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-14", "text": "However, the selection of a measure and of a threshold value for these filters is generally empirically determined." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-15", "text": "We argue that these filtering parameters have a great influence on the quality of the generated thesauri." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-16", "text": "The goal of this paper is to quantify the impact of context filters on distributional thesauri." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-17", "text": "We experiment with different filter methods and measures to assess context significance." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-18", "text": "We propose the use of simple cooccurrence frequency as a filter and show that it leads to better results than more expensive measures such as LMI or PMI." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-19", "text": "Thus we propose a cheap and effective way of filtering contexts while maintaining quality." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-20", "text": "This paper is organized as follows: in \u00a72 we discuss evaluation of distributional thesauri." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-21", "text": "The methodology adopted in the work and the results are discussed in \u00a73 and \u00a74." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-22", "text": "We finish with some conclusions and discussion of future work." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-23", "text": "----------------------------------" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-25", "text": "In a nutshell, the standard approach to build a distributional thesaurus consists of: (i) the extraction of contexts for the target words from corpora, (ii) the application of an informativeness measure to represent these contexts and (iii) the application of a similarity measure to compare sets of contexts." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-26", "text": "The contexts in which a target word appears can be extracted in terms of a window of cooccurring (content) words surrounding the target (Freitag et al., 2005; Ferret, 2012; Erk and Pado, 2010) or in terms of the syntactic dependencies in which the target appears (Lin, 1998; McCarthy et al., 2003; Weeds et al., 2004) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-27", "text": "The informativeness of each context is calculated using measures like PMI, and t-test while the similarity between contexts is calculated using measures like Lin's (1998) , cosine, Jensen-Shannon divergence, Dice or Jaccard." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-28", "text": "Evaluation of the quality of distributional thesauri is a well know problem in the area (Lin, 1998; Curran and Moens, 2002) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-29", "text": "For instance, for intrinsic evaluation, the agreement between thesauri has been examined, looking at the average similarity of a word in the thesauri (Lin, 1998) , and at the overlap and rank agreement between the thesauri for target words like nouns (Weeds et al., 2004) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-30", "text": "Although much attention has been given to the evaluation of various informativeness and similarity measures, a careful assessment of the effects of filtering on the resulting thesauri is also needed." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-31", "text": "For instance, Biemann and Riedl (2013) found that filtering a subset of contexts based on LMI increased the similarity of a thesaurus with WordNet." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-32", "text": "In this work, we compare the impact of using different types of filters in terms of thesaurus agreement with WordNet, focusing on a distributional thesaurus of English verbs." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-33", "text": "We also propose a frequency-based saliency measure to rank and filter contexts and compare it with PMI and LMI." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-34", "text": "Extrinsic evaluation of distributional thesauri has been carried out for tasks such as English lexical substitution (McCarthy and Navigli, 2009), phrasal verb compositionality detection (McCarthy et al., 2003) and the WordNet-based synonymy test (WBST) (Freitag et al., 2005) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-35", "text": "For comparative purposes in this work we adopt the latter." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-36", "text": "----------------------------------" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-37", "text": "**METHODOLOGY**" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-38", "text": "We focus on thesauri of English verbs constructed from the BNC (Burnard, 2007) 1 ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-39", "text": "Contexts are extracted from syntactic dependencies generated by RASP (Briscoe et al., 2006) , using nouns (heads of NPs) which have subject and direct object relations with the target verb." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-40", "text": "Thus, each target verb is represented by a set of triples containing (i) the verb itself, (ii) a context noun and (iii) a syntactic relation (object, subject)." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-41", "text": "The thesauri were constructed using Lin's (1998) method." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-42", "text": "Lin's version of the distributional hypothesis states that two words (verbs v 1 and v 2 in our case) are similar if they share a large proportion of contexts weighted by their information content, assessed with PMI (Bansal et al., 2012; Turney, 2013) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-43", "text": "In the literature, little attention is paid to context filters." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-44", "text": "To investigate their impact, we compare two kinds of filters, and before calculating similarity using Lin's measure, we apply them to remove potentially noisy triples:" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-45", "text": "\u2022 Threshold (th): we remove triples that occur less than a threshold th." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-46", "text": "Threshold values vary from 1 to 50 counts per triple." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-47", "text": "\u2022 Relevance (p): we keep only the top p most relevant contexts for each verb, were relevance is defined according to the following measures: (a) frequency, (b) PMI, and (c) LMI (Biemann and Riedl, 2013) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-48", "text": "Values of p vary between 10 and 1000." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-49", "text": "In this work, we want to answer two questions: (a) Do more selective filters improve intrinsic evaluation of thesaurus? and (b) Do they also help in extrinsic evaluation?" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-50", "text": "For intrinsic evaluation, we determine agreement between a distributional thesaurus and WordNet as the path similarities for the first k distributional neighbors of a verb." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-51", "text": "A single score is obtained by averaging the similarities of all verbs with their k first neighbors." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-52", "text": "The higher this score is, the closer the neighbors are to the target in WordNet, and the better the thesaurus." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-53", "text": "Several values of k were tested and the results showed exactly the same curve shapes for all values, with WordNet similarity decreasing linearly with k. For the remainder of the paper we adopt k = 10, as it is widely used in the literature." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-54", "text": "For extrinsic evaluation, we use the WBST set for verbs (Freitag et al., 2005) with 7,398 questions and an average polysemy of 10.4." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-55", "text": "The task consists of choosing the most suitable synonym for a word among a set of four options." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-56", "text": "The thesaurus is used to rank the candidate answers by similarity scores, and select the first one as the correct synonym." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-57", "text": "As discussed by Freitag et al. (2005) , the upper bound reached by English native speakers is 88.4% accuracy, and simple lower bounds are 25% (random choice) and 34.5% (always choosing the most frequent option)." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-58", "text": "Figure 1 shows average WordNet similarities for thesauri built filtering by frequency threshold th and by p most frequent contexts." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-59", "text": "When using a threshold filter (Figure 1 left) , high values lead to better performance for midand low-frequency verbs." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-60", "text": "This is because, for high th values, there are few low and mid-frequency verbs left, since a verb that occurs less has less chances to be seen often in the same context." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-61", "text": "The similarity for verbs with no contexts over the frequency threshold cannot be assessed and as a consequence those verbs are not included in the final thesaurus." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-62", "text": "As Figure 2 shows, the number of verbs decreases much faster for low and mid frequency verbs when th increases." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-63", "text": "3 For example, for th = 50, there are only 7 remaining lowfrequency verbs in the thesaurus and these tend to be idiosyncratic multiword expressions." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-64", "text": "One example is wreak, and the only triple containing this verb that appeared more than 50 times is wreak havoc (71 occurrences)." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-65", "text": "The neighbors of this verb are cause and play, which yield a good similarity score in WordNet." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-66", "text": "Therefore, although higher thresholds result in higher similarities for low and mid-frequency verbs, this comes at a cost, as the number of verbs included in the thesaurus decreases considerably." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-67", "text": "----------------------------------" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-68", "text": "**RESULTS**" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-69", "text": "(||v|| \u2265 500), mid-frequency (150 \u2264 ||v|| < 500) and lowfrequency (||v|| < 150)." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-70", "text": "3 For p most salient contexts, the number of verbs does not vary and is the same shown in Figure 2 for th = 1 (no filter)." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-71", "text": "As expected, the best performance is obtained for high-frequency verbs and no filter, since it results in more context information per verb." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-72", "text": "Increasing th decreases similarity due to the removal of some of these contexts." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-73", "text": "In average, higher th values lead to better overall similarity among the frequency ranges (from 0.148 with th = 1 to 0.164 with th = 50)." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-74", "text": "The higher the threshold, the more high-frequency verbs will prevail in the thesauri, for which the WordNet path similarities are higher." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-75", "text": "On the other hand, when adopting a relevance filter of keeping the p most relevant contexts for each verb (Figure 1 right) , we obtain similar results, but more stable thesauri." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-76", "text": "The number of verbs remains constant, since we keep a fixed number of contexts for each verb and verbs are not removed when the threshold is modified." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-77", "text": "WordNet similarity increases as more contexts are taken into account, for all frequency ranges." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-78", "text": "There is a maximum around p = 200, though larger values do not lead to a drastic drop in quality." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-79", "text": "This suggests that the noise introduced by low-frequency contexts is compensated by the increase of informativeness for other contexts." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-80", "text": "An ideal balance is reached by the lowest possible p that maintains high WordNet similarity, since the lower the p the faster the thesaurus construction." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-81", "text": "In terms of saliency measure, when keeping only the p most relevant contexts, sorting them with PMI leads to much worse results than LMI or frequency, as PMI gives too much weight to infrequent combinations." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-82", "text": "This is consistent with results of Biemann and Riedl (2013) ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-83", "text": "Regarding LMI versus frequency, the results using the latter are slightly better (or with no significant difference, depending on the frequency range)." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-84", "text": "The advantage of using frequency instead of LMI is that it makes the process simpler and faster while leading to equal or better performance in all frequency ranges." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-85", "text": "Therefore for the extrinsic evaluation using WBST task, we use frequency to select the p most relevant contexts and then compute Lin's similarity using only those contexts." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-86", "text": "Figure 3 shows the performance of the thesauri in the WBST task in terms of precision, recall and F1." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-87", "text": "4 For precision, the best filter is to remove con-4 Filters based on LMI and PMI were also tested with the texts occurring less than th times, but, this also leads to poor recall, since many verbs are left out of the thesauri and their WSBT questions cannot be answered." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-88", "text": "On the other hand, keeping the most relevant p contexts leads to more stable results and when p is high (right plot), they are similar to those shown in the left plot of Figure 3 ." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-89", "text": "----------------------------------" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-90", "text": "**DISCUSSION**" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-91", "text": "The answer to our questions in Section 3 is yes, more selective filters improve intrinsic and extrinsic thesaurus quality." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-92", "text": "The use of both filtering methods results in thesauri in which the neighbors of target verbs are closer in WordNet and get better scores in TOEFL-like tests." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-93", "text": "However, the fact that filtering contexts with frequency under th removes verbs in the final thesaurus is a drawback, as highlighted in the extrinsic evaluation on the WBST task." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-94", "text": "Furthermore, we demonstrated that competitive results can be obtained keeping only the p most relevant contexts per verb." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-95", "text": "On the one hand, this method leads to much more stable thesauri, with the same verbs for all values of p. On the other hand, it is important to highlight that the best results to assess the relevance of the contexts are obtained using frequency while more sophisticated filters such as LMI do not improve thesaurus quality." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-96", "text": "Although an LMI filter is relatively fast compared to dimensionality reduction techniques such as singular value decomposition (Landauer and Dumais, 1997), it is still considerably more expensive than a simple frequency filter." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-97", "text": "In short, our experiments indicate that a reasonsame results as intrinsic evaluation: sorting contexts by frequency leads to better results." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-98", "text": "able trade-off between noise, coverage and computational efficiency is obtained for p = 200 most frequent contexts, as confirmed by intrinsic and extrinsic evaluation." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-99", "text": "Frequency threshold th is not recommended: it degrades recall because the contexts for many verbs are not frequent enough." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-100", "text": "This result is useful for extracting distributional thesauri from very large corpora like the UKWaC (Ferraresi et al., 2008) by proposing an alternative that minimizes the required computational resources while efficiently removing a significant amount of noise." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-101", "text": "----------------------------------" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-102", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "40d370558d873499e493a83f106f17-C001-103", "text": "In this paper we addressed the impact of filters on the quality of distributional thesauri, evaluating a set of standard thesauri and different filtering methods." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-104", "text": "The results suggest that the use of filters and their parameters greatly affect the thesauri generated." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-105", "text": "We show that it is better to use a filter that selects the most relevant contexts for a verb than to simply remove rare contexts." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-106", "text": "Furthermore, the best performance was obtained with the simplest method: frequency was found to be a simple and inexpensive measure of context salience." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-107", "text": "This is especially important when dealing with large amounts of data, since computing LMI for all contexts would be computationally costly." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-108", "text": "With our proposal to keep just the p most frequent contexts per verb, a great deal of contexts are cheaply removed and thus the computational power required for assessing similarity is drastically reduced." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-109", "text": "As future work, we plan to use these filters to build thesauri from larger corpora." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-110", "text": "We would like to generalize our findings to other syntactic configurations (e.g. noun-adjective) as well as to other similarity and informativeness measures." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-111", "text": "For instance, ongoing experiments indicate that the same parameters apply when Lin's similarity is replaced by cosine." }, { "sent_id": "40d370558d873499e493a83f106f17-C001-112", "text": "Finally, we would like to compare the proposed heuristics with more sophisticated filtering strategies like singular value decomposition (Landauer and Dumais, 1997) and non-negative matrix factorization (Van de Cruys, 2009 )." } ], "y": { "@BACK@": { "gold_contexts": [ [ "40d370558d873499e493a83f106f17-C001-8" ], [ "40d370558d873499e493a83f106f17-C001-11" ], [ "40d370558d873499e493a83f106f17-C001-26" ], [ "40d370558d873499e493a83f106f17-C001-27" ], [ "40d370558d873499e493a83f106f17-C001-28" ] ], "cite_sentences": [ "40d370558d873499e493a83f106f17-C001-8", "40d370558d873499e493a83f106f17-C001-11", "40d370558d873499e493a83f106f17-C001-26", "40d370558d873499e493a83f106f17-C001-27", "40d370558d873499e493a83f106f17-C001-28" ] }, "@MOT@": { "gold_contexts": [ [ "40d370558d873499e493a83f106f17-C001-28", "40d370558d873499e493a83f106f17-C001-29", "40d370558d873499e493a83f106f17-C001-30" ] ], "cite_sentences": [ "40d370558d873499e493a83f106f17-C001-28", "40d370558d873499e493a83f106f17-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "40d370558d873499e493a83f106f17-C001-41" ] ], "cite_sentences": [ "40d370558d873499e493a83f106f17-C001-41" ] } } }, "ABC_3d99ad1ba1696c8ef743f233530601_29": { "x": [ { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-144", "text": "Probably, with many more iterations the results would be better." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-2", "text": "Some languages do not have enough labeled data to obtain good discourse parsing, specially in the relation identification step, and the additional use of unlabeled data is a plausible solution." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-3", "text": "A workflow is presented that uses a semi-supervised learning approach." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-4", "text": "Instead of only a predefined additional set of unlabeled data, texts obtained from the web are continuously added." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-5", "text": "This obtains near human perfomance (0.79) in intra sentential rhetorical relation identification." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-6", "text": "An experiment for English also shows improvement using a similar workflow." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-7", "text": "----------------------------------" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-9", "text": "A text is composed of coherent propositions (phrases and sentences, for example) ordered and connected according to the intentions of the author of the text." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-10", "text": "This composition may be recognized and structured according to many theories and this type of information is valuable to many natural language processing applications." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-11", "text": "A process to recognize, automatically, the coherent or discursive (or also rhetorical) structure of a text is named discourse parsing (DP)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-12", "text": "The most prominent theory in Computational Linguistics to structure the discourse of a text is the Rhetorical Structure Theory (RST) proposed by Mann and Thompson (1987) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-13", "text": "In this theory, the text is segmented into elementary discourse units (EDUs), which each contain a proposition (basic idea) of the text." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-14", "text": "The theory proposes a set of rhetorical relations that may hold between Figure 1 : An example of sentence-level structure according to RST." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-15", "text": "From Soricut and Marcu (2003) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-16", "text": "the EDUs, explicating the intentions of the author." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-17", "text": "For example, consider the sentence in Figure 1 ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-18", "text": "It is segmented into three EDUs, numbered from 1 to 3." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-19", "text": "EDUs 2 and 3 are related by the relation Enablement, forming a new span of text, which is related to 1 by the relation Attribution." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-20", "text": "In each relation, EDUs can be Nucleus (more essential) or Satellite to the writer's purpose." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-21", "text": "Many approaches have been used in DP, the majority of them using machine learning algorithms, such as probabilistic models (Soricut and Marcu, 2003) , SVMs (Reitter, 2003; duVerle and Prendinger, 2009; Hernault et al., 2010; Feng and Hirst, 2012) and dynamic conditional random field (Joty et al., 2012) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-22", "text": "To obtain acceptable results, these approaches need plenty of labeled data." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-23", "text": "But even more than other levels of linguistic information, such as morphology or syntax, the annotation of discourse is an expensive task." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-24", "text": "Given this fact, what can we do when there is not enough data to perform effective learning of DP, as in languages with little annotated data?" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-25", "text": "This paper describes a methodology to overcome the problem of insufficient labeled data in the task of identifying rhetorical relations between Figure 2 : Lexicalized syntactic tree used by SPADE." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-26", "text": "The circles indicate the node used as the most indicative information to identify the rhetorical relation and structure." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-27", "text": "EDUs, which is the most important step during DP." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-28", "text": "The language used in our work is Portuguese and two well-known systems of DP for English were adapted to this language." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-29", "text": "Portuguese is a language with insufficient annotated data to obtain a good discourse parser, but has all the tools to adapt some English discourse parsers." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-30", "text": "A framework of semi-supervised never-ending learning (SSNEL) (see Section 2.2 below) was created and evaluated with the adapted models." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-31", "text": "The results show that this approach improved the results to achieve nearhuman perfomance, even with the use of automatic tools (syntax parser and discourse segmenter)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-32", "text": "2 Related Work 2.1 Supervised Discourse Parsing Soricut and Marcu (2003) use two probabilistic models to perform a sentence-level analysis, one for segmentation and other to identify the relations and build the rhetorical structure." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-33", "text": "The parser is named SPADE (Sentence-level Parsing of DiscoursE) and the authors base their model on lexical and syntactic information, extracting features from a lexicalized syntactic tree." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-56", "text": "Specially, it is important to note that, to obtain good performance in the task more data is necessary." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-34", "text": "They assume that the features extracted at the jointing point of two discursive segments are the most indicative information to identify the rhetorical structure of the sentence." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-35", "text": "For example, in Figure 2 , the circled nodes correspond to the most indicative cues to identify the structure and relation between each two adjacent segments." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-36", "text": "The authors report a F-measure of 0.49 in a set of 18 RST relations." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-37", "text": "The human performance in this same task is 0.77 (measured by interannotation agreement)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-38", "text": "The authors, then, use the probabilistic model with manual segmentation and syntactic trees to see the impact of this information in the parsing and the model achieves 0.75." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-39", "text": "Hernault et al. (2010) use support vector machine (SVM) classifiers to perform DP." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-40", "text": "This discourse parser is named HILDA (HIgh-Level Discourse Analyser)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-41", "text": "This work used a set of 41 rhetorical relations and achieves a F-measure of 0.48 in the step of relation identification, both intra-sentential and inter-sentential." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-42", "text": "Feng and Hirst (2012) improve HILDA by incorporating new proposed features and some adapted from Lin et al. (2009) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-43", "text": "Another important decision was the specification of features for intra-sentential and inter-sentential relationships and the use of contextual features in the building of the rhetorical tree." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-44", "text": "Considering the approach to intra-sentential relation identification, with 18 RST relations this work achieves a macro average F-measure of 0.49 and weighted average Fmeasure of 0.77 in relation identification." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-45", "text": "Joty et al. (2012) use a joint modelling approach to identify the structure and the relations at the sentence-level using DCRFs (dynamic conditional random fields) and a non-greedy bottom-up method in the construction of the rhetorical structure." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-46", "text": "The features used in this work were similar to those used by HILDA." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-47", "text": "They achieve a F-measure of 0.77, using manual segmentation, and 0.65 using automatic segmentation." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-48", "text": "Some languages, such as Portuguese, do not have enough data to train a good DP and there is no work treating this limitation in this language." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-49", "text": "The first attempt to perform DP in Portuguese was made by Pardo and Nunes (2006) , who used an approach based on lexical patterns extracted from an RST-annotated corpus of academic texts to create DiZer (Discourse analyZer)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-50", "text": "More than 740 lexical patterns were manually extracted from the corpus." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-51", "text": "A lexical pattern is composed of the discursive markers, its position in the EDU, and corresponding nuclearity." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-52", "text": "The use of lexical patterns is a unique approach for Portuguese, and achieves a F-measure of 0.625 in relation detection when evaluated in academic texts; in news texts, DiZer achieves an F-measure of 0.405." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-53", "text": "----------------------------------" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-54", "text": "**SEMI-SUPERVISED DISCOURSE PARSING**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-55", "text": "All the above cited approaches to DP use annotated data to extract discursive knowledge and are limited to the availability of this resource, which is expensive to obtain." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-57", "text": "Semi-supervised learning (SSL) is employed in scenarios in which there is some labeled data and large availability of unlabeled data, and manual annotation is an expensive task (Zhu, 2008) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-58", "text": "Related to the use of SSL in DP, Marcu and Echihabi (2002) used naive Bayes to train binary classifiers to distinguish between some types of relations, as Elaboration vs. Cause-ExplanationEvidence." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-59", "text": "For example, for this binary classifier, applying SSL, the accuracy increased from approximately 0.6 to 0.95 after the use of millions of new instances." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-60", "text": "Chiarcos (2012) used SSL to develop a probabilistic model mapping the occurrence of discourse markers and verbs to rhetorical relations." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-61", "text": "For Italian, Soria and Ferrari (1998) conducted work in the same direction." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-62", "text": "Sporleder and Lascarides (2005) performed similar work to Marcu and Echihabi, with similar results for a different set of relations and a more sophisticated classifier." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-63", "text": "Building on this, there is an interesting idea, known as never-ending learning (NEL) by Carlson et al. (2010) , in which they apply SSL with infinite unlabeled data." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-64", "text": "The needed data is widely and freely available on the web." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-65", "text": "Their architecture runs 24 hours per day, forever, obtaining new information and performing a learning task." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-66", "text": "With the aim of surpassing the limitation of labeled RST in Portuguese to develop a good DP, we employ SSNEL in the task by adapting the work of Soricut and Marcu (2003) and Hernault et al. (2010) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-67", "text": "This choice for SSLNEL was made considering the large and free availability of news texts on the web." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-68", "text": "----------------------------------" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-69", "text": "**RST CORPORA**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-70", "text": "RST-DT (RST Discourse TreeBank) (Carlson et al., 2001 ) is the most widely used corpus annotated with RST in English." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-71", "text": "Table 1 compares it with available Portuguese corpora labeled according to RST (these corpora will be referred to as RST-DT-PT hereafter)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-72", "text": "The corpora CSTNews (Cardoso et al., 2011) , Summ-it (Collovini et al., 2007) and two-thirds of Rhetalho (Pardo and Seno, (45) and many more words (55,536) than RST-DT-PT." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-73", "text": "This work focuses on the identification of rhetorical relations at the sentence level, and as is common since the work of Soricut and Marcu (2003) , fine-grained relations were grouped: 29 sentence-level rhetorical relations were found and grouped into 16 groups." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-74", "text": "The imbalance of the relations is a natural characteristic in discourse and, to avoid overfitting of a learning model on the lessfrequent relations, no balancing was made." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-75", "text": "The relation Summary, for example, occurs only 2 times, and Elaboration occurs 1491 times, making very difficult the identification of the Summary relation." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-76", "text": "----------------------------------" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-77", "text": "**ADAPTED MODELS**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-78", "text": "Syntactic information is crucial in SPADE (Soricut and Marcu, 2003) and for Portuguese the parser most similar to that used by Soricut and Marcu is the LX-parser (Stanford parser trained to Portuguese (Silva et al., 2010) )." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-79", "text": "After the parsing of the text by the syntactic parser, the same lexicalization procedure (Magerman, 1995) was applied and adapted according to the tagset used by LX-parser." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-80", "text": "In this adaptation, only pairs of adjacent segments at sentence-level were considered, and nuclearity was not considered, in order to avoid sparseness in the data." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-81", "text": "Training the adapted model (here called SPADE-PT) using the RST-DT-PT achieved F-measure of 0.30." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-82", "text": "The precision was 0.69, but the recall was only 0.19." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-83", "text": "The same features used by HILDA (Hernault et al., 2010) were extracted from the pairs of adjacent segments at sentence-level and many machine learning algorithms were tested, besides the SVM, which was used in the original work." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-84", "text": "The algorithm which obtained the best F-measure was J48, an implementation of decision trees (Quinlan, 1993) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-85", "text": "The RST-DT-PT corpora was used and the adaptation (here called of HILDA-PT) achieved an F-measure of 0.548, which is much better than that of SPADE-PT." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-86", "text": "A possible explanation is that the feature set in SPADE is composed only of syntactic tags and words." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-87", "text": "The resulting probabilistic model is sparse and many equal instances may indicate different relations (classes)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-88", "text": "However HILDA adds more features over which the classifier can work better, even when some values are absent." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-89", "text": "Given the results of the adapted models, HILDA-PT was chosen to be incorporated into the SSNEL, explicated below." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-90", "text": "----------------------------------" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-91", "text": "**SEMI-SUPERVISED NEVER-ENDING LEARNING WORKFLOW**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-92", "text": "Here, an adaptation of Carlson et al. (2010) selftraining algorithm was used." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-93", "text": "Two different approaches to relation identification are used, that is to say, a lexical pattern set LPS (the relation identification module of DiZer), and a multi-class classifier C generated according to some machine learning algorithm." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-94", "text": "All the new instances obtained from the lexical module are used together with the more confident classifications of C to retrain this last." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-95", "text": "For each classification, J48 returns a confidence value used to choose the most confident classifications." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-96", "text": "Also, there is interest in observing the behaviour of the classifier in each iteration of the semi-supervision, searching for the best F-measure it may achieve." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-97", "text": "In this way, a workflow of never-ending learning (NEL) was proposed and is presented in Figure 3 ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-98", "text": "Workflow 1 is presented as an alternative visualization to the illustration in Figure 3 ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-99", "text": "Continuously, a crawler gets pages from online news on the web and performs cleaning to obtain the main text (Text)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-100", "text": "In a first iteration, a Segmenter (Maziero et al., 2007) is applied to obtain the EDUs in each sentence and, for each pair of adjacent EDUs (PairEDUs), the C 1 classifier (C 1 initially trained with the LabeledData 1 from the RST-DT-PT) and the lexical pattern set LPS are used to identify the relations between the segments." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-101", "text": "To retrain C 1 , all the new instances from the lexical pattern set LabeledDataLPS (as LPS does not provide a confidence value, all the labelled instances are" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-102", "text": "Workflow 1: Workflow of the SSNEL using two models to identify rhetorical relations between each PairEDUs." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-103", "text": "used in the semi-supervision) and the classifications LabeledDataC with confidence greater than 0.7 by C 1 are joined with LabeledData 1 to obtain LabeledData 2 (LabeledData 2 = LabeledDataLPS + LabeledDataC)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-104", "text": "After the retraining, a Monitor verifies the new F-measure of C 2 (FmC 2 , obtained using 10-fold cross validation) and, if it decreased compared with the F-measure of C 1 (FmC 1 ), C 2 is discarded and, for the next iteration, C 1 will continue to be used." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-105", "text": "If FmC 1 did not decrease, C 2 will be used in the next iteration." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-106", "text": "Monitor also plots a graph G to present the behaviour of FmC during SSNEL." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-107", "text": "This process continues iteratively." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-108", "text": "It is important to note that, given the small size of the training data, we opted to use 10-fold crossvalidation during the training and testing of the classifiers, instead of separating the data into three sets (training, development, and test)." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-109", "text": "The total number of instances was 6163 and some relations, such as Restatement with 28 instances, would have few relations when split into three sets." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-110", "text": "During the semi-supervision of SPADE-PT, the model of relation identification was incrementally obtained at each iteration, since the addition of a new instance only modifies the probabilities of the instances already present in the model." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-111", "text": "If the instance is new, it is added to the model and the probabilities are adjusted." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-112", "text": "However, in the semi-supervision of the HILDA-PT, the algorithm J48 does not allow incremental learning." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-113", "text": "There are some implementations of incremental decision trees, but the resulting models are not as accurate as J48 because they work with an incomplete set of training instances." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-114", "text": "As we want the best F-measure for relation identification, the algorithm J48 was employed, even though it is not an incremental learning." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-115", "text": "Another important decision is to monitor the concept-drift (CD) (Klinkenberg, 2004) during the SSNEL, given that a concept may change over time." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-116", "text": "In this work, CD refers to different sources and topics to which the classifier is applied." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-117", "text": "To treat CD, the algorithm may detect the evolution of the concept and be able to modify the model to accommodate the concept, avoiding the decrease in the performance of the model being generated." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-118", "text": "One technique to monitor the CD is statistical process control (SPC) (Gama et al., 2004) ." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-119", "text": "This technique constantly analyses the error during the learning: if the F-measure drops, it may indicate some changes in the concept and the model needs to be modified." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-143", "text": "The test shown improvements (at the level p < .1), even though they are low for both experiments." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-120", "text": "In the SSNEL workflow, this is treated by the Monitor, which discards new instances used to retrain the model if its F-measure decreases, ensuring that the learned model always acquires correct new learning." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-121", "text": "----------------------------------" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-122", "text": "**EXPERIMENTS**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-123", "text": "Considering Workflow 1, the two adapted models were instantiated as C, and many iterations were executed." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-124", "text": "After 1,640 iterations and the addition of 1,592 new training instances, the F-measure of SPADE-PT increased only 0.05." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-125", "text": "HILDA-PT, after 180 iterations and with the addition of 21,740 new instances, increased 0.24, achieving 0.79 using automatic segmentation." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-126", "text": "Table 2 presents a summary of the results." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-127", "text": "As explained in Section 4, the features used by SPADE-PT lead to a sparse model (when there is not enough initial data), and this is the reason that, during 1,640 iterations, only 1,592 new instances were acquired, compared to the number of iterations and new instances during the experiment with HILDA-PT." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-128", "text": "To evaluate the parsers, two baselines were considered." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-129", "text": "One of them (Elaboration Relation) is the labeling of all the instances with the most frequent relation in the corpus (Elaboration); the second is the use of LPS (DiZer) applied to all PairEDUs in RST-DT-PT." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-130", "text": "SPADE-PT, even after many iterations in SSNEL, performed lower than the two baselines." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-131", "text": "HILDA-PT, since even before the use of SSNEL, performed better than the baselines." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-132", "text": "The class composed of relations Interpretation, Evaluation and Conclusion had 40 labeled exam-ples, initially." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-133", "text": "After the iterations, its F-measure increased from 0.054 to 0.916." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-134", "text": "Except for Comparison and Summary, all the other relations increased their F-measures." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-135", "text": "This reinforces the results obtained by Marcu and Echihabi (2002) , which increased (the result of a binary classifier to distinguish between two relations) from 0.6 to 0.95 after the use of millions of new instances." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-136", "text": "The relation Summary, however, with only 2 labeled instances, continued its zero F-measure." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-137", "text": "SSNEL of HILDA-PT was executed for 23 days." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-138", "text": "Documents used had on average 28 sentences and 749 words." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-139", "text": "The choice of only 10 documents per batch is to have a fine-grained control over the new instances, given that if a new classifier decreases the F-measure, it is discarded." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-140", "text": "Out of 70 generated classifiers were discarded." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-141", "text": "As the use of 10-fold cross-validation in the SSNEL may lead to some overfitting on the data which was already classified in the workflow, two other SSNEL experiments were performed, for English and Portuguese, with separated training and test sets." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-142", "text": "These experiments had less time to run, and, in order to determine whether the improvements during the SSNEL were statistically significant, paired T-tests were employed to compare initial classifier and the best classifier obtained during iterations in the workflow." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-145", "text": "Table 3 shows the improvements in the accuracy during the SSNEL, the number of iterations, and the number of new instances incorporated in the training data." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-146", "text": "Although a direct comparison between the experiments is not fair, due to different corpora, the improvements show that this workflow is promising to increase the accuracy of classifiers with unlabeled data." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-147", "text": "The experiment with SSNEL for English was realized in order to see the results that could be obtained when large annotated corpora are available." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-148", "text": "In the SSNEL for English, only decision-tree classifiers were used to classify new instances." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-149", "text": "For Portuguese, a symbolic model (lexical patterns) was also used together with the classifiers." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-150", "text": "The improved results presented in Table 2 and 3 are very different due to differing evaluation strategies." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-151", "text": "Using separated test data, we tried to avoid possible overfitting on training data, but the size of test data may not lead to a fair evaluation Soricut and Marcu (2003) or Joty et al. (2012) , since HILDA-PT used different corpora (RST-DT-PT instead of RST-DT), and some reported results are for the complete DP." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-152", "text": "However, our results show the potential of the SSNEL workflow when not enough labeled data is available for supervised learning, since the same approach for relation identification of Hernault et al. (2010) was used in HILDA-PT and 0.531 was initially obtained." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-153", "text": "These results constitute the state of art for rhetorical relation identification for Portuguese and it is believed that with more time (iterations in SSNEL), the results may increase." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-154", "text": "----------------------------------" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-155", "text": "**CONCLUSION**" }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-156", "text": "Even though the results obtained in the SSNEL were satisfactory, new features will be added to the HILDA-PT, for example, types of discourse signals, beyond the discourse markers (Taboada and Das, 2013) , and the use of semantic information, as synonymity." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-157", "text": "Also, given that the number of features will increase, feature selection may be applied to select the most informative features in each iteration of the SSNEL." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-158", "text": "Since this work treats only rhetorical relations, without nuclearity, a classifier of nuclearity was trained (with the same features of HILDA-PT) and obtained a F-score of 0.86." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-159", "text": "As done by Feng and Hirst (2012), a better set of features will be selected to identify relations between inter-sentential spans." }, { "sent_id": "3d99ad1ba1696c8ef743f233530601-C001-160", "text": "A procedure similar to tree building used by Feng and Hirst (2012) will be employed in the future DP." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3d99ad1ba1696c8ef743f233530601-C001-21" ], [ "3d99ad1ba1696c8ef743f233530601-C001-32" ], [ "3d99ad1ba1696c8ef743f233530601-C001-73" ], [ "3d99ad1ba1696c8ef743f233530601-C001-78" ], [ "3d99ad1ba1696c8ef743f233530601-C001-151" ] ], "cite_sentences": [ "3d99ad1ba1696c8ef743f233530601-C001-21", "3d99ad1ba1696c8ef743f233530601-C001-32", "3d99ad1ba1696c8ef743f233530601-C001-73", "3d99ad1ba1696c8ef743f233530601-C001-78", "3d99ad1ba1696c8ef743f233530601-C001-151" ] }, "@MOT@": { "gold_contexts": [ [ "3d99ad1ba1696c8ef743f233530601-C001-21", "3d99ad1ba1696c8ef743f233530601-C001-22", "3d99ad1ba1696c8ef743f233530601-C001-24" ] ], "cite_sentences": [ "3d99ad1ba1696c8ef743f233530601-C001-21" ] }, "@USE@": { "gold_contexts": [ [ "3d99ad1ba1696c8ef743f233530601-C001-66" ], [ "3d99ad1ba1696c8ef743f233530601-C001-73" ], [ "3d99ad1ba1696c8ef743f233530601-C001-78", "3d99ad1ba1696c8ef743f233530601-C001-79" ] ], "cite_sentences": [ "3d99ad1ba1696c8ef743f233530601-C001-66", "3d99ad1ba1696c8ef743f233530601-C001-73", "3d99ad1ba1696c8ef743f233530601-C001-78" ] }, "@SIM@": { "gold_contexts": [ [ "3d99ad1ba1696c8ef743f233530601-C001-73" ] ], "cite_sentences": [ "3d99ad1ba1696c8ef743f233530601-C001-73" ] } } }, "ABC_84ae490de92cb9d064993be751b3e0_29": { "x": [ { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-2", "text": "A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-3", "text": "In this paper the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched conditions are reported." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-4", "text": "For both corpora word recognition expenrnents were carried out with vocabularies containing up to 20k words." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-5", "text": "The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-6", "text": "The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with a 20k-word vocabulary when used with bigram back-off language models." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-7", "text": "A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-8", "text": "Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sexdependent models." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-9", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-179", "text": "Improving the model accuracy, at the acoustic level and at the language model level, by taking advantage of the available gaining data, has led to better system performance." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-11", "text": "Speech recognition research at LIMSI aims to develop recognizers that are task-, speaker-, and vocabulary-independent so as to be easily adapted to a variety of applications." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-12", "text": "The applicability of speech recognition techniques used for one language to other languages is of particular importance in Europe." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-13", "text": "The multilingual aspects are in part carried out in the context of the LRE SQALE (Speech recognizer Quality Assessment for Linguistic Engineering) project, which is aimed at assessing language dependent issues in multilingual recognizer evaluation." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-14", "text": "In this project, the same system will be evaluated on comparable tasks in different languages (English, French and German) to determine cross-lingual differences, and different recognizers will be compared on the same language to compare advantages of different recognition strategies." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-15", "text": "In this paper some of the primary issues in large vocabulary, speaker-independent, continuous speech recognition for dictation are addressed." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-16", "text": "These issues include language modeling, acoustic modeling, lexical representation, and search." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-17", "text": "Acoustic modeling makes use of continuous density HMM with Gaussian mixture of context-dependent phone models." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-18", "text": "For language modeling n-gram statistics are estimated on tThis work is partially funded by the LRE project 62-058 SQALE." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-19", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-20", "text": "**319**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-21", "text": "text material." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-22", "text": "To deal with phonological variability alternate pronunciations are included in the lexicon, and optional phonological rules are applied during training and recognition." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-23", "text": "The recognizer uses a time-synchronous graph-search strategy [16] for a first pass with a bigram back-off language model (LM) [10] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-24", "text": "A trigram LM is used in a second acoustic decoding pass which makes use of the word graph generated using the bigram LM [6] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-25", "text": "Experimental results are reported on the ARPA Wall Street Journal (WSJ) [19] and BREF [14] corpora, using for both corpora over 37k utterances for acoustic training and more than 37 million words of newspaper text for language model training." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-26", "text": "While the number of speakers is larger for WSJ, the total amount of acoustic training material is about the same (see Table 1 )." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-27", "text": "It is shown that for both corpora increasing the amount of training utterances by an order of magnitude reduces the word error by about 30%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-28", "text": "The use of a trigram LM in a second pass also gives an error reduction of 20% to 30%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-29", "text": "The combined error reduction is on the order of 50%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-30", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-31", "text": "**LANGUAGE MODELING**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-32", "text": "Language modeling entails incorporating constraints on the allowable sequences of words which form a sentence." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-33", "text": "Statistical n-gram models attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-34", "text": "In this work bigram and trigram language models are estimated on the training text material for each corpus." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-35", "text": "This data consists of 37M words of the WSJ 1 and 38M words of Le Monde." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-36", "text": "A backoff mechanism [10] is used to smooth the estimates of the probabilities of rare n-grams by relying on a lower order n-gram when there is insufficient training data, and to provide a means of modeling unobserved n-grams." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-37", "text": "Another advantage of the backoff mechanism is that LM size can be arbitrarily reduced by relying more on the backoff, by increasing the minimum number of required n-gram observations needed to include the n-gram." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-38", "text": "This property can be used in the first bigram decod1While we have built n-gram-backoff LMs directly from the 37M-word standardized WSJ training text material, in these experiments all results are reported using the 5k or 20k, bigram and tfigram backoff LMs provided by Lincoln Labs [ 19] as required by ARPA so as to be compatible with the other sites participating in the tests." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-39", "text": "ing pass to reduce computational requirements." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-40", "text": "The trigram langage model is used in the second pass of the decoding process:." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-41", "text": "In order to be able to constnact LMs for BREF, it was necessary to normalize the text material of Le Monde newpaper, which entailed a pre-treatment rather different from that used to normalize the WSJ texts [19] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-42", "text": "The main differences are in the treatment of compound words, abbreviations, and case." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-43", "text": "In BREF the distinction between the cases is kept if it designates a distinctive graphemic feature, but not when the upper case is simply due to the fact that the word occurs at the beginning of the sentence." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-44", "text": "Thus, the first word of each sentence was semi-automatically verified to determine if a transformation to lower case was needed." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-45", "text": "Special treatment is also needed for the symbols hyphen (-), quote ('), and period (.) which can lead to ambiguous separations." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-46", "text": "For example, the hyphen in compound words like beaux-arts and au-dessus is considered word-internal." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-47", "text": "Alternatively the hyphen may be associated with the first word as in ex-, or anti-, or with the second word as in -Id or -nL Finally, it may appear in the text even though it is not associated with any word." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-48", "text": "ruction (see Figure 1 )." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-49", "text": "There are also a larger number of frequent, monophone words for Le Monde than for WSJ, accounting for about 17% and 3% of all word occurrences in the respective training texts." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-50", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-51", "text": "**ACOUSTIC-PHONETIC MODELING**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-52", "text": "The recognizer makes use of continuous density HMM (CDHMM) with Gaussian mixture for acoustic modeling." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-53", "text": "The main advantage continuous density modeling offers over discrete or semi-continuous (or tied-mixture) observation density modeling is that the number of parameters used to model an HMM observation distribution can easily be adapted to the amount of available training data associated to this state." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-54", "text": "As a consequence, high precision modeling can be achieved for highly frequented states without the explicit need of smoothing techniques for the densities of less frequented states." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-55", "text": "Discrete and semi-continuous modeling use a fixed number of parameters to represent a given observation density and therefore cannot achieve high precision without the use of smoothing techniques." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-56", "text": "This problem can be alleviated by tying some states of the Markov models in order to have more training data to estimate each state distribution." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-57", "text": "However, since this kind of tying requires careful design and some a priori assumptions, these techniques are primarily of interest when the training data is limited and cannot easily be increased." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-58", "text": "In the experimental section we demonstrate the improvement in performance obtained on the same test data by simply using additional training material." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-59", "text": "A 48-component feature vector is computed every 10 ms." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-60", "text": "This feature vector consists of 16 Bark-frequency scale cepstrum coefficients computed on the 8kHz bandwidth and their first and second order derivatives." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-61", "text": "For each frame (30 ms window), a 15 channel Bark power spectrum is obtained by applying triangular windows to the DFT output." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-62", "text": "The cepstrum coefficients are then computed using a cosinus transform [2] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-63", "text": "The acoustic models are sets of context-dependent(CD), position independent phone models, which include both intra-word and cross-word contexts." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-64", "text": "The contexts are automatically selected based on their frequencies in the training data." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-65", "text": "The models include tfiphone models, fight-and left-context phone models, and context-independent phone models." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-66", "text": "Each phone model is a left-to-right CDHMM with Gaussian mixture observation densities (typically 32 components)." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-67", "text": "The covariance matrices of all the Gaussians are diagonal." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-68", "text": "Duration is modeled with a gamma distribution per phone model." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-69", "text": "The HMM and duration parameters are estimated separately and combined in the recognition process for the Viterbi search." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-70", "text": "Maximum a postedori estimators are used for the HMM parameters [8] and moment estimators for the gamma distributions." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-71", "text": "Separate male and female models are used to more accurately model the speech data." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-72", "text": "Dunng system development phone recognition has been used to evaluate different acoustic model sets." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-73", "text": "It has been shown that improvements in phone accuracy are directly indicative of improvements in word accuracy when the same phone models are used for recognition [12] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-74", "text": "Phone recognition provides the added benefit that the recognized phone string can be used to understand word recognition errors and problems in the lexical representation." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-75", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-76", "text": "**LEXICAL REPRESENTATION**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-77", "text": "Lexicons containing 5k, 20k, and 64k words have been used in these experiments." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-78", "text": "The lexicons are represented phonemically, using language-specific sets of phonemes." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-79", "text": "Each lexicon has alternate pronunciations for some of the words, and allows some of the phones to be optional) A pronunciation graph is generated for each word from the baseform transcription to which word internal phonological rules are optionally applied during training and recognition to account for some of the phonological variations observed in fluent speech." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-80", "text": "The WSJ lexicons are represented using a set of 46 phonemes, including 21 vowels, 24 consonants, and silence." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-81", "text": "Training and test lexicons were created at LIMSI and include some input from modified versions of the TIMIT, Pocket and Moby lexicons." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-82", "text": "Missing forms were generated by rule when possible, or added by hand." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-83", "text": "Some pronounciations for proper names were kindly provided by Murray Spiegel at Bellcore from the Orator system." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-84", "text": "The BREF lexicons, corresponding to the 5k and 20k most common words in the Le Monde texts are represented with 35 phonemes including 14 vowels, 20 consonants, and silence [3] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-85", "text": "The base pronunciations, obtained using text-to-phoneme rules [20] , were extended to annotate potential liaisons and pronunciation variants." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-86", "text": "Some example lexical entries are given in Figure 1 ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-87", "text": "Word boundary phonological rules are applied in building the phone graph used by the recognizer so as to allow for some of the phonological variations observed in fluent speech [11] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-88", "text": "The principle behind the phonological rules is to modify the phone network to take into account such vari3About 10% of the lexical entries have multiple transcriptions, if the word final optional phonemes marking possible liaisons for BREF are not included." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-89", "text": "Including these raises the number of entries with multiple transcriptions to almost 40%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-90", "text": "ations." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-91", "text": "These rules are optionally applied during training and recognition." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-92", "text": "Using optional phonological rules during training results in better acoustic models, as they are less \"polluted\" by wrong transcriptions." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-93", "text": "Their use during recognition reduces the number of mismatches." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-94", "text": "For English, only \u2022 well known phonological rules, such as glide insertion, stop deletion, homorganic stop insertion, palatalization, and voicing assimilation have been incorporated in the system." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-95", "text": "The same mechanism has been used to handle liaisons, mute-e, and final consonant cluster reduction for French." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-96", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-97", "text": "**SEARCH STRATEGY**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-98", "text": "One of the most important problems in implementing a large vocabulary speech recognizer is the design of an efficient search algorithm to deal with the huge search space, especially when using language models with a longer span than two successive words, such as trigrams." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-99", "text": "The most commonly used approach for small and medium vocabulary sizes is the one-pass frame-synchronous beam search [16] which uses a dynamic programming procedure." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-100", "text": "This basic strategy has been recently extended by adding other features such as \"fast match\" [9, 1] , N-best rescoring [21] , and progressive search [15] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-101", "text": "The two-pass approach used in our system is based on the idea of progressive search where the information between levels is transmitted via word graphs." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-102", "text": "Prior to word recognition, sex identification is performed for each sentence using phone-based ergodic HMMs [13] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-103", "text": "The word recognizer is then run with a bigram LM using the acoustic model set corresponding to the identified sex." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-104", "text": "The first pass uses a bigram-backoff LM with a tree organization of the lexicon for the backoff component." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-105", "text": "This one-pass frame-synchronous beam search, which includes intra-and inter-word CD phone models, intra-and interword phonological rules, phone duration models, and genderdependent models, generates a list of word hypotheses resulting in a word lattice." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-106", "text": "Two problems need to be considered at this level." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-107", "text": "The first is whether or not the dynamic programming procedure used in the first pass, which guarantees the optimality of the search for the bigram, generates an \"optimal\" lattice to be used with a trigram LM." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-108", "text": "For example, any giwen word in the lattice will have many possible ending points, but only a few starting points." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-109", "text": "This problem was in fact less severe than expected since the time information is not critical to generate an \"optimal\" word graph from the lattice, i.e. the multiple word endings provide enough flexibility to compensate for single word beginnings." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-110", "text": "The second consideration is that the lattice generated in this way cannot be too large or there is no interest in a two pass approach." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-111", "text": "To solve this second problem, two pruning thresholds are used during r, he first pass, a beam search pruning threshold which is kept to a level insuring almost no search errors (from the bigram point of view) and a word lattice pruning threshold used to control the lattice size." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-112", "text": "A description of the exact procedure used to generate the word graph from the word lattice is beyond the scope of this paper." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-113", "text": "The following steps give the key elements behind the procedure." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-114", "text": "4 First, a word graph is generated from the lattice by merging three consecutive frames (i.e. the minimum duration for a word in our system)." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-115", "text": "Then, \"similar\" graph nodes are merged with the goal of reducing the overall graph size and generalizing the word lattice." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-116", "text": "This step is reiterated until no further reductions are possible." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-117", "text": "Finally, based on the trigram backoff language model a trigram word graph is then generated by duplicating the nodes having multiple language model contexts." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-118", "text": "Bigram backoff nodes are created when possible to limit the graph expansion." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-119", "text": "To fix these ideas, let us consider some numbers for the WSJ 5k-closed vocabulary." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-120", "text": "With the pruning threshold set at a level such that there are only a negligible number of search errors, the first pass generates a word lattice containing on average 10,000 word hypotheses per sentence." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-121", "text": "The generated word graph before trigram expansion contains on average 1400 arcs." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-122", "text": "After expansion with the trigram backoff LM, there are on average 3900 word instanciations including silences which are treated the same way as words." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-123", "text": "It should be noted that this decoding strategy based on two forward passes can in fact be implemented in a single forward pass using one or two processors." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-124", "text": "We are using a two pass solution because it is conceptually simpler, and also due to memory constraints." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-125", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-126", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-127", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-128", "text": "**WSJ:**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-129", "text": "The ARPA WSJ corpus [19] was designed to provide general-purpose speech data with large vocabularies." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-130", "text": "Text materials were selected to provide training and test data for 5k and 20k word, closed and open vocabularies, and with both verbalized (VP) and non-verbalized (NVP) punctuation." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-131", "text": "41n our implementation, a word lattice differs from a word graph only because it includes word endpoint information." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-132", "text": "The 20k open test is also referred to as a 64k test since all of the words in these sentences occur in the 63,495 most frequent words in the normalized WSJ text material [ 19] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-133", "text": "Two sets of standard training material have been used for these experiments: The standard WSJ0 SI84 training data which include 7240 sentences from 84 speakers, and the standard set of 37,518 WSJ0/WSJ1 SI284 sentences from 284 speakers." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-134", "text": "Only the primary microphone data were used for training." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-135", "text": "The WSJ corpus provides a wealth of material that can be used for system development." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-136", "text": "We have worked primarily with the WSJ0-Dev (410 sentences, 10 speakers), and the WSJ1-Dev from spokes s5 and s6 (394 sentences, 10 speakers)." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-137", "text": "Development of the word recognizer was done with the 5k closed vocabulary system in order to reduce the computational requirements." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-138", "text": "The Nov92 5k and 20k nvp test sets were used to assess progress during this development phase." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-139", "text": "The WSJ system was evaluated in the Nov92 ARPA evaluation test [17] for the 5k-closed vocabulary and in the Nov93 ARPA evaluation test [18] for the 5k and 64k hubs." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-140", "text": "Except when explicitly stated otherwise, all of the results reported for WSJ use the standard language models [19] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-141", "text": "Using a set of 1084 CD models trained with the WSJ0 si84 training data, the word error is 6.6% on the Nov92 5k test data and 9.4% on the Nov93 test data." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-142", "text": "Using the combined WSJ0]WSJ1 si284 training data reduces the error by about 27% for both tests." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-143", "text": "When a trigram LM is used in the second pass, the word error is reduced by an addition 35% on the Nov92 test and by 22% on the Nov93 test." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-144", "text": "Results are given in the Table 3 for the Nov92 nvp 64K test data using both closed and open 20k vocabularies." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-145", "text": "With si84 training (si84c, a slightly smaller model set than si84) the word error rate is doubled when the vocabulary increases from 5k to 20k words and the test perplexity goes from 111 to 244." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-146", "text": "The higher error rate with the 20k open lexicon can be largely attributed to the out-of-vocabulary (OOV) words, which account for almost 2% of the words in the test sentences." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-147", "text": "Processing the same test data with a system trained on the si284 training data, reduces the word error by 30%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-148", "text": "The word error on the Nov93 20k test is 15.2% with the si284 system." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-149", "text": "Using the trigrarn LM reduces the error rate by 18% on the Nov92 test and 22% on the Nov93 test." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-150", "text": "The 20k trigram sentence error rates for Nov92 and Nov93 are 60% and 62% respectively." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-151", "text": "Since this is an open vocabulary test, the lower bound for the sentence error is given by the percent of sentences with OOV words, which is 26% for Nov92 and 21% for Nov93." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-152", "text": "In addition there are errors introduced by the use of word graphs generated by the first pass." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-153", "text": "The graph error rate (ie. the correct solution was not in the graph) was 6% and 12% respectively for Nov92 and Nov93." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-154", "text": "In fact, in most of these cases the errors should not be considered search errors as the recognized string has a higher likelihood than the correct string." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-155", "text": "A final test was run using a 64k lexicon in an attempt to eliminate errors due to unknown words." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-156", "text": "(In principle, all of the read WSJ prompts are found in the 64k most frequent words, however, since the WSJ1 data were recorded with non-normalized prompts, additional OOV words can occur.) Running a full 64k system was not possible with the computing facilities available, so we added a third decoding pass to extend the vocabulary size." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-157", "text": "Starting with the phone string corresponding to the hypothesis of the trigram 20k system, an A* algorithm is used to generate a word graph using phone confusion statistics and the 64k lexicon." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-158", "text": "This word graph is then used by the recognizer with a 64k trigram LM trained on the standard WSJ training texts (37M words)." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-159", "text": "Using this approach only about 30% of the errors due to OOV words on the Nov93 64k test are recovered, reducing the word error to 11.2% from 11.8%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-160", "text": "BREF: BREF [14] is a large read-speech corpus, containing over 100 hours of speech material, from 120 speakers (55m/65f)." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-161", "text": "The text materials were selected verbatim from the French newspaper Le Monde, so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments [7] ." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-162", "text": "The material in BREF was selected to maximize the number of different phonemic contexts." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-163", "text": "5 Containing 1115 distinct diphones and over 17,500 triphones, BREF can be used to train vocabulary-independent acoustic models." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-164", "text": "The text material was read without verbalized punctuation using the verbatim prompts." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-165", "text": "6 5This is in contrast to the WSJ texts which were selected so as to contain only words in the most frequent 64,000 words in the original text material." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-166", "text": "6Another difference between BREF and WSJ0 is that the prompts for" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-167", "text": "----------------------------------" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-168", "text": "**DISCUSSION AND SUMMARY**" }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-169", "text": "The recognizer has been evaluated on 5k and 20k test data for the English and French languages using similar style corpora." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-170", "text": "It should be pointed out however, that although the Nov92 5k WSJ test data and the BREF 5k test data were closed-vocabulary, the conditions are not quite the same." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-171", "text": "For WSJ, paragraphs were selected ensuring not more than one word was out of the 5.6k most frequent words [ 19] , and these additional words were then included as part of the vocabulary." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-172", "text": "For BREF, a lexicon was first constructed containing the 5k/20k most frequent words, and sentences covered by this vocabulary were selected from the development test material." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-173", "text": "The situation was slightly different for the Nov93 5k test in that the prompt texts were not normalized, and therefore several OOV words (0.3%) occurred in the test data despite it being a closed-vocabulary test." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-174", "text": "However, looking at the recognition results for individual speakers, it appears that interspeaker differences are much more important than differences in perplexity, and perhaps more than language differences." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-175", "text": "Just considering the relationship between speaking rate and word accurracy, in general, speakers that are faster or slower than the average have a higher word error." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-176", "text": "It has been observed that the better/worse speakers are the same on both the 5k and 20k tests." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-177", "text": "We have observed some language dependencies, such as the higher number of homophones in BREF, which has the effect of reducing the efficiency of the search and the large number of frequent monophone words which results in larger networks." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-178", "text": "At the same time, the phone accuracy for BREF is better than that for WSJ, which speeds up the search." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-180", "text": "For both WSJ and BREF increasing the amount of training utterances by an order of magnitude reduces the word error by about 30%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-181", "text": "By using larger training text materials it is possible to train a trigram LM which was incorporated in a second acoustic pass." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-182", "text": "The trigram pass gives an error rate reduction of 20% to 30%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-183", "text": "The combined error reduction is on the order of 50%." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-184", "text": "It remains a general problem of how to define comparable test conditions for different languages." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-185", "text": "This may even depend on the definition of a word in a given language, which is linked to the lexical coverage." }, { "sent_id": "84ae490de92cb9d064993be751b3e0-C001-186", "text": "A primary aim of the aforementioned LRE Sqale project is to address this issue." } ], "y": { "@BACK@": { "gold_contexts": [ [ "84ae490de92cb9d064993be751b3e0-C001-22", "84ae490de92cb9d064993be751b3e0-C001-24", "84ae490de92cb9d064993be751b3e0-C001-25", "84ae490de92cb9d064993be751b3e0-C001-27", "84ae490de92cb9d064993be751b3e0-C001-28" ], [ "84ae490de92cb9d064993be751b3e0-C001-129" ] ], "cite_sentences": [ "84ae490de92cb9d064993be751b3e0-C001-25", "84ae490de92cb9d064993be751b3e0-C001-129" ] }, "@USE@": { "gold_contexts": [ [ "84ae490de92cb9d064993be751b3e0-C001-38" ], [ "84ae490de92cb9d064993be751b3e0-C001-41" ], [ "84ae490de92cb9d064993be751b3e0-C001-132" ], [ "84ae490de92cb9d064993be751b3e0-C001-140" ], [ "84ae490de92cb9d064993be751b3e0-C001-169", "84ae490de92cb9d064993be751b3e0-C001-171" ] ], "cite_sentences": [ "84ae490de92cb9d064993be751b3e0-C001-38", "84ae490de92cb9d064993be751b3e0-C001-41", "84ae490de92cb9d064993be751b3e0-C001-132", "84ae490de92cb9d064993be751b3e0-C001-140", "84ae490de92cb9d064993be751b3e0-C001-171" ] }, "@DIF@": { "gold_contexts": [ [ "84ae490de92cb9d064993be751b3e0-C001-41" ] ], "cite_sentences": [ "84ae490de92cb9d064993be751b3e0-C001-41" ] } } }, "ABC_1deb67be8226867fe6b9514cdecdec_29": { "x": [ { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-2", "text": "In this paper, we extend an existing statistical parsing model to produce richer output parse trees, annotated with PropBank semantic role labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-3", "text": "Our results show that the model can be robustly extended to produce more complex output parse trees without any loss in performance and suggest that joint inference of syntactic and semantic representations is a viable alternative to approaches based on a pipeline of local processing steps." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-4", "text": "----------------------------------" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-6", "text": "Recent successes in statistical syntactic parsing based on supervised learning techniques trained on a large corpus of syntactic trees (Collins, 1999; Charniak, 2000; Henderson, 2003) have brought forth the hope that the same approaches could be applied to the more ambitious goal of recovering the propositional content and the frame semantics of a sentence." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-7", "text": "Moving towards a shallow semantic level of representation is a first initial step towards the distant goal of natural language understanding and has immediate applications in question-answering and information extraction." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-8", "text": "For example, an automatic flight reservation system processing the sentence I want to book a flight from Geneva to Trento will need to know that from Geneva denotes the origin of the flight and to Trento denotes its destination. Knowing that these two phrases are prepositional phrases, the information provided by a syntactic parser, is only moderately useful." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-9", "text": "The growing interest in learning deeper information is to a large extent supported and due to the recent development of semantically annotated databases such as FrameNet (Baker et al., 1998) or the Proposition Bank , that can be used as training resources for a number of supervised learning paradigms." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-10", "text": "We focus here on the Proposition Bank (PropBank)." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-11", "text": "PropBank encodes propositional information by adding a layer of argument structure annotation to the syntactic structures of the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-12", "text": "Verbal predicates in the Penn Treebank (PTB) receive a label REL and their arguments are annotated with abstract semantic role labels A0-A5 or AA for those complements of the predicative verb that are considered arguments while those complements of the verb labelled with a semantic functional label in the original PTB receive the composite semantic role label AM-X, where X stands for labels such as LOC, TMP or ADV, for locative, temporal and adverbial modifiers respectively." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-13", "text": "A tree structure with PropBank labels for a sentence from the PTB (section 00) is shown in Figure 1 below." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-14", "text": "PropBank uses two levels of granularity in its annotation, at least conceptually." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-15", "text": "Arguments receiving labels A0-A5 or AA do not express consistent semantic roles and are specific to a verb, while arguments receiving an AM-X label are supposed to be adjuncts and the respective roles they express are consistent across all verbs." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-16", "text": "1 Recent approaches to learning semantic role labels are based on two-stage architectures." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-17", "text": "The first stage selects the elements to be labelled, while the second determines the labels to be assigned to the selected elements." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-18", "text": "While some of these models are based on full parse trees (Gildea and Jurafsky, 2002; Gildea and Palmer, 2002) , other methods have been proposed that eschew the need for a full parse (CoNNL, 2004; CoNLL, 2005) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-19", "text": "Because of the way the problem has been formulated -as a pipeline of parsing (or chunking) feeding into labelling -specific investigations of integrated approaches that solve both the parsing and the semantic role labelling problems at the same time have not been studied." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-20", "text": "We present work to test the hypothesis that a current statistical parser (Henderson, 2003) can output richer information robustly, that is without any significant degradation of the parser's accuracy on the original parsing task, by explicitly modelling semantic role labels as the interface between syntax and semantics." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-21", "text": "We achieve promising results both on the simple parsing task, where the accuracy of the parser is measured on the standard Parseval measures, and also on the parsing task where the more complex labels of PropBank are taken into account." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-22", "text": "We will call the former task Penn Treebank parsing (PTB parsing) and the latter task PropBank parsing below." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-23", "text": "These results have several consequences." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-24", "text": "On the one hand, we show that it is possible to build a single integrated robust system successfully." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-25", "text": "This is a meaningful achievement, as a task combining semantic role labelling and parsing is more complex than simple syntactic parsing." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-26", "text": "While the shallow semantics of a constituent and its structural position are often correlated, they sometimes diverge." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-27", "text": "For example, some nominal temporal modifiers occupy an object position without being objects, like Tuesday in Figure 1 below." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-28", "text": "On the other hand, our results indicate that the proposed models are robust." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-29", "text": "To model our task accurately, additional parameters must be estimated." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-30", "text": "However, given the current limited availability of annotated treebanks, this more complex task will have to be solved with the same overall amount of data, aggravating the difficulty of estimating the model's parameters due to sparse data." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-31", "text": "The limited availability of data is increased further by the high variability of the argumental labels A0-A5 whose semantics is specific to a given verb or a given verb sense." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-32", "text": "Solving this more complex problem successfully, then, indicates that the models used are robust." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-33", "text": "Finally, we achieve robustness without simplifying the parsing architecture." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-34", "text": "Specifically, robustness is achieved without resorting to the stipulation of strong independence assumptions to compensate for the limited availability and high variability of data." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-35", "text": "Consequently, such an achievement demonstrates not only that the robustness of the parsing model, but also its scalability and portability." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-36", "text": "----------------------------------" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-37", "text": "**THE BASIC PARSING ARCHITECTURE**" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-38", "text": "To achieve the complex task of assigning semantic role labels while parsing, we use a family of statistical parsers, the Simple Synchrony Network (SSN) parsers (Henderson, 2003) , which do not make any explicit independence assumptions, and are therefore likely to adapt without much modification to the current problem." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-39", "text": "This architecture has shown state-of-the-art performance." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-40", "text": "SSN parsers comprise two components, one which estimates the parameters of a stochastic model for syntactic trees, and one which searches for the most probable syntactic tree given the parameter estimates." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-41", "text": "As with many other statistical parsers (Collins, 1999; Charniak, 2000) , SSN parsers use a history-based model of parsing." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-42", "text": "Events in such a model are derivation moves." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-43", "text": "The set of well-formed sequences of derivation moves in this parser is defined by a Predictive LR pushdown automaton (Nederhof, 1994) , which implements a form of left-corner parsing strategy." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-44", "text": "The derivation moves include: projecting a constituent with a specified label, attaching one constituent to another, and shifting a tag-word pair onto the pushdown stack." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-45", "text": "Unlike standard history-based models, SSN parsers do not state any explicit independence assumptions between derivation steps." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-46", "text": "They use a neural network architecture, called Simple Synchrony Network (Henderson and Lane, 1998) , to induce a finite history representation of an unbounded sequence of moves." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-47", "text": "The history representation of a parse history d 1 , . . . , d i\u22121 , which we denote h(d 1 , . . . , d i\u22121 ), is assigned to the constituent that is on the top of the stack before the ith move." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-48", "text": "The representation h(d 1 , . . . , d i\u22121 ) is computed from a set f of features of the derivation move d i\u22121 and from a finite set D of recent history representations h(d 1 , . . . , d j ), where j < i \u2212 1." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-49", "text": "Because the history representation computed for the move i \u2212 1 is included in the inputs to the computation of the representation for the next move i, virtually any information about the derivation history could flow from history representation to history representation and be used to estimate the probability of a derivation move." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-50", "text": "However, the recency preference exhibited by recursively defined neural networks biases learning towards information which flows through fewer history representations." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-51", "text": "(Henderson, 2003) exploits this bias by directly inputting information which is considered relevant at a given step to the history representation of the constituent on the top of the stack before that step." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-52", "text": "In addition to history representations, the inputs to h(d 1 , . . . , d i\u22121 ) include handcrafted features of the derivation history that are meant to be relevant to the move to be chosen at step i. For each of the experiments reported here, the set D that is input to the computation of the history representation of the derivation moves d 1 , . . . , d i\u22121 includes the most recent history representation of the following nodes: top i , the node on top of the pushdown stack before the ith move; the left-corner ancestor of top i (that is, the second top-most node on the parser's stack); the leftmost child of top i ; and the most recent child of top i , if any." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-53", "text": "The set of features f includes the last move in the derivation, the label or tag of top i , the tag-word pair of the most recently shifted word, and the leftmost tag-word pair that top i dominates." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-54", "text": "Given the hidden history representation h(d 1 , \u00b7 \u00b7 \u00b7 , d i\u22121 ) of a derivation, a normalized exponential output function is computed by SSNs to estimate a probability distribution over the possible next derivation moves d i ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-55", "text": "2 The second component of SSN parsers, which searches for the best derivation given the parameter estimates, implements a severe pruning strategy." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-56", "text": "Such pruning handles the high computational cost of computing probability estimates with SSNs, and renders the search tractable." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-57", "text": "The space of possible derivations is pruned in two different ways." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-58", "text": "The first pruning occurs immediately after a tag-word pair has been pushed onto the stack: only a fixed beam of the 100 best derivations ending in that tag-word pair are expanded." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-59", "text": "For training, the width of such beam is set to five." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-60", "text": "A second reduction of the search space prunes the space of possible project or attach derivation moves: a best-first search strategy is applied to the five best alternative decisions only." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-61", "text": "The next section describes our model, extended to produce richer output parse trees annotated with semantic role labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-62", "text": "----------------------------------" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-63", "text": "**LEARNING SEMANTIC ROLE LABELS**" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-64", "text": "Previous work on learning function labels during parsing assumed that function labels represent the interface between lexical semantics and syntax." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-65", "text": "We extend this hypothesis to the semantic role labels assigned in PropBank, as they are an exhaustive extension of function labels, which have been reorganised in a coherent inventory of labels and assigned exhaustively to all sentences in the PTB." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-66", "text": "Because PropBank is built on the PTB, it inherits in part its notion of function labels which is directly integrated into the AM-X role labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-67", "text": "A0-A5 or AA labels correspond to many of the unlabelled elements in the PTB and also to those elements that PTB annotators had classified as re-2 The on-line version of Backpropagation is used to train SSN parsing models." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-68", "text": "It performs a gradient descent with a maximum likelihood objective function and weight decay regularization (Bishop, 1995) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-69", "text": "ceiving a syntactic functional label such as SBJ (subject) or DTV (dative)." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-70", "text": "Because they are projections of the lexical semantics of the elements in the sentence, semantic role labels are projected bottom-up, they tend to appear low in the tree and they are infrequently found on the higher levels of the parse tree, where projections of grammatical, as opposed to lexical, elements usually reside." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-71", "text": "Because they are the interface level with syntax, semantic labels are also subject to distributional constraints that govern syntactic dependencies, such as argument structure or subcategorization." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-72", "text": "We attempt to capture such constraints by modelling the c-command relation." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-73", "text": "Recall that the c-command relation relates two nodes in a tree, even if they are not close to each other, provided that the first node dominating one node also dominate the other." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-74", "text": "This notion of c-command captures both linear and hierarchical constraints and defines the domain in which semantic role labelling applies." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-75", "text": "While PTB function labels appear to overlap to a large extent with PropBank semantic rolel labels, work by (Ye and Baldwin, 2005) on semantic labelling prepositional phrases, however, indicates that the function labels in the Penn Treebank are assigned more sporadically and heterogeneously than in PropBank." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-76", "text": "Apparently only the \"easy\" cases have been tagged functionally, because assigning these function tags was not the main goal of the annotation." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-77", "text": "PropBank instead was annotated exhaustively, taking all cases into account, annotating multiple roles, coreferences and discontinuous constituents." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-78", "text": "It is therefore not void of interest to test our hypothesis that, like function labels, semantic role labels are the interface between syntax and semantics, and they need to be recovered by applying constraints that model both higher level nodes and lower level ones." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-79", "text": "We assume that semantic roles are very often projected by the lexical semantics of the words in the sentence." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-80", "text": "We introduce this bottom-up lexical information by fine-grained modelling of semantic role labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-81", "text": "Extending a technique presented in (Klein and Manning, 2003 ) and adopted in for function labels, we split some part-of-speech tags into tags marked with semantic role labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-82", "text": "The semantic role labels attached to a non-terminal directly projected by a preterminal and belonging to a few selected categories (DIR, EXT, LOC, MNR, PNC, CAUS and TMP) were propagated down to the pre-terminal part-of-speech tag of its head." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-83", "text": "To affect only labels that are projections of lexical semantics properties, the propagation takes into account the distance of the projection from the lexical head to the label, and distances greater than two are not included." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-84", "text": "Figure 2 illustrates the result of this operation." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-85", "text": "In our augmented model, inputs to each history representation are selected according to a linguistically motivated notion of structural locality over which dependencies such as argument structure or subcategorization could be specified." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-86", "text": "In SSN parsing models, the set D of nodes that are structurally local to a given node on top of the stack defines the structural distance between this given node and other nodes in the tree." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-87", "text": "Such a notion of distance determines the number of history representations through which information passes to flow from the representation of a node i to the representation of a node j. By adding nodes to the set D, one can shorten the structural distance between two nodes and enlarge the locality domain over which dependencies can be specified." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-88", "text": "To capture a locality domain appropriate for semantic role parsing, we add the most recent child of top i labelled with a semantic role label to the set D. These additions yield a model that is sensitive to regularities in structurally defined sequences of nodes bearing semantic role labels, within and across constituents." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-89", "text": "This modification of the biases is illustrated in Figure 3 ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-90", "text": "This figure displays two constituents, S and VP with some of their respective child nodes." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-91", "text": "The VP node is assumed to be on the top of the parser's stack, and the S one is supposed to be its leftcorner ancestor." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-92", "text": "The directed arcs represent the information that flows from one node to another." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-93", "text": "According to the original SSN model in (Henderson, 2003) , only the information carried over by the leftmost child and the most recent child of a constituent directly flows to that constituent." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-94", "text": "In the figure above, only the information conveyed by the nodes \u03b1 and \u03b4 is directly input to the node S. Similarly, the only bottom-up information directly input to the VP node is conveyed by the child nodes \u01eb and \u03b8." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-95", "text": "In the original SSN models, nodes bearing a function label such as \u03c6 1 and \u03c6 2 are not directly input to their respective parents." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-96", "text": "In our extended model, information conveyed by \u03c6 1 and \u03c6 2 directly flows to their respective parents." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-97", "text": "So the distance between the nodes \u03c6 1 and \u03c6 2 , which stand in a c-command relation, is shortened." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-98", "text": "For more information on this technique to capture domains induced by the c-command relation, see ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-99", "text": "We report the effects of these augmentations on parsing results in the experiments described below." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-100", "text": "----------------------------------" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-101", "text": "**EXPERIMENTS**" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-102", "text": "Our extended semantic role SSN parser was trained on sections 2-21 and validated on section 24 from the PropBank." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-103", "text": "Training, validating and testing data sets consist of the PTB data annotated with PropBank semantic roles labels, as provided in the CoNLL-2005 shared task (Carreras and Marquez, 2005) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-104", "text": "Our augmented model has a total 613 of nonterminals to represents both the PTB and PropBank labels of constituents, instead of the 33 of the original SSN parser." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-105", "text": "The 580 newly introduced labels consist of a standard PTB label followed by a set of one or more PropBank semantic role such as PP-AM-TMP or NP-A0-A1." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-106", "text": "As a result of lowering the six AM-X semantic role labels, 240 new part-of-speech tags were introduced to partition the original tag set which consisted of 45 tags." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-107", "text": "SSN parsers do not tag their input sentences." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-108", "text": "To provide the augmented model with tagged input sentences, we trained an SVM tagger whose features and parameters are described in detail in (Gimenez and Marquez, 2004) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-109", "text": "Trained on section 2-21, the tagger reaches a performance of 95.45% on the test set (section 23) using our new tag set." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-110", "text": "As already mentioned, argumental labels A0-A5 are specific to a given verb or a given verb sense, thus their distribution is highly variable." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-111", "text": "To reduce variability, we add some of the tag-verb pairs licensing these argumental labels to the vocabu-F R P PropBank training and PropBank parsing task 82.3 82.1 82.4 PropBank training and PTB parsing task 88.8 88.6 88.9 PTB training and PTB parsing task (Henderson, 2003) 88.6 88.3 88.9 Table 1 : Percentage F-measure (F), recall (R), and precision (P) of our SSN parser on two different tasks and the original SSN parser." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-112", "text": "lary of our model." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-113", "text": "We reach a total of 4970 tagword pairs." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-114", "text": "3 This vocabulary comprises the original 512 pairs of the original SSN model, and our added pairs which must occur at least 10 times in the training data." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-115", "text": "Our vocabulary as well as the new 240 POS tags and the new 580 non-terminal labels are included in the set f of features input to the history representations as described in section 2." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-116", "text": "We perform two different evaluations on our model trained on PropBank data." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-117", "text": "Recall that we distinguish between two parsing tasks: the PropBank parsing task and the PTB parsing task." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-118", "text": "To evaluate the first parsing task, we compute the standard Parseval measures of labelled recall and precision of constituents, taking into account not only the 33 original labels but also the 580 newly introduced PropBank labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-119", "text": "This evaluation gives us an indication of how accurately and exhaustively we can recover this richer set of nonterminal labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-120", "text": "The results, computed on the testing data set from the PropBank, are shown on the first line of Table 1 ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-121", "text": "To evaluate the PTB task, we compute the labelled recall and precision of constituents, ignoring the set of PropBank semantic role labels that our model assigns to constituents." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-122", "text": "This evaluation indicates how well we perform on the standard PTB parsing task alone, and its results on the testing data set from the PTB are shown on the second line of Table 1 ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-123", "text": "The third line of Table 1 gives the performance on the simpler PTB parsing task of the original SSN parser (Henderson, 2003) , that was trained on the PTB data sets contrary to our SSN model trained on the PropBank data sets." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-124", "text": "----------------------------------" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-125", "text": "**DISCUSSION**" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-126", "text": "These results clearly indicate that our model can perform the PTB parsing task at levels of per-3 Such pairs consists of a tag and a word token." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-127", "text": "No attempt at collecting word types was made." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-128", "text": "formance comparable to state-of-the-art statistical parsing, by extensions that take the nature of the richer labels to be recovered into account." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-129", "text": "They also suggest that the relationship between syntactic PTB parsing and semantic PropBank parsing is strict enough that an integrated approach to the problem of semantic role labelling is beneficial." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-130", "text": "In particular, recent models of semantic role labelling separate input indicators of the correlation between the structural position in the tree and the semantic label, such as path, from those indicators that encode constraints on the sequence, such as the previously assigned role (Kwon et al., 2004) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-131", "text": "In this way, they can never encode directly the constraining power of a certain role in a given structural position onto a following node in its structural position." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-132", "text": "In our augmented model, we attempt to capture these constraints by directly modelling syntactic domains defined by the notion of c-command." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-133", "text": "Our results also confirm the findings in ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-134", "text": "They take a critical look at some commonly used features in the semantic role labelling task, such as the path feature." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-135", "text": "They suggest that the path feature is not very effective because it is sparse." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-136", "text": "Its sparseness is due to the occurrence of intermediate nodes that are not relevant for the syntactic relations between an argument and its predicate." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-137", "text": "Our model of domains is less noisy, and consequently more robust, because it can focus only on c-commanding nodes bearing semantic role labels, thus abstracting away from those nodes that smear the pertinent relations." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-138", "text": "( Yi and Palmer, 2005) share the motivation of our work." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-139", "text": "Like the current work, they observe that the distributions of semantic labels could potentially interact with the distributions of syntactic labels and redefine the boundaries of constituents, thus yielding trees that reflect generalisations over both these sources of information." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-140", "text": "To our knowledge, no results have yet been published on parsing the PropBank." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-141", "text": "Accordingly, it is not possible to draw a straigthforward quantitative F R P (Haghighi et al., 2005) 83.4 83.1 83.7 (Pradhan et al., 2005) 83.3 83.0 83.5 (Punyakanok et al., 2005) 83.1 82.8 83.3 83.1 82.8 83.3 (Surdeanu and Turmo, 2005) (Collins, 1999; Charniak, 2000) , both for training and testing, and return partial trees annotated with semantic role labels." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-142", "text": "An indirect way of comparing our parser with semantic role labellers suggests itself." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-143", "text": "We merge the partial trees output by a semantic role labeller with the output of a parser it was trained on, and compute PropBank parsing performance measures on the resulting parse trees." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-144", "text": "The first five lines of Table 2 report such measures for the five best semantic role labelling systems (Haghighi et al., 2005; Pradhan et al., 2005; Punyakanok et al., 2005; Surdeanu and Turmo, 2005) according to (CoNLL, 2005) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-145", "text": "The partial trees output by these systems were merged with the parse trees returned by (Charniak, 2000) 's parser." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-146", "text": "These systems use (Charniak, 2000) 's parse trees both for training and testing as well as various other information sources including sets of n-best parse trees (Punyakanok et al., 2005; Haghighi et al., 2005) or chunks Pradhan et al., 2005) and named entities (Surdeanu and Turmo, 2005) ." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-147", "text": "While our preliminary results indicated in the last line of Table 2 are not state-of-the-art, they do demonstrate the viability of SSN parsers for joint inference of syntactic and semantic representations." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-148", "text": "----------------------------------" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-149", "text": "**CONCLUSIONS**" }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-150", "text": "In this paper, we have explored extensions to an existing state-of-the-art parsing model." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-151", "text": "We have achieved promising results on parsing the Proposition Bank, showing that our extensions are sufficiently robust to produce parse trees annotated with shallow semantic information." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-152", "text": "Future work will lie in extracting semantic role relations from such richly annotated trees, for applications such as information extraction or question answering." }, { "sent_id": "1deb67be8226867fe6b9514cdecdec-C001-153", "text": "In addition, further research will explore the relevance of semantic role features to parse reranking." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1deb67be8226867fe6b9514cdecdec-C001-6" ], [ "1deb67be8226867fe6b9514cdecdec-C001-51" ], [ "1deb67be8226867fe6b9514cdecdec-C001-93" ] ], "cite_sentences": [ "1deb67be8226867fe6b9514cdecdec-C001-6", "1deb67be8226867fe6b9514cdecdec-C001-51", "1deb67be8226867fe6b9514cdecdec-C001-93" ] }, "@USE@": { "gold_contexts": [ [ "1deb67be8226867fe6b9514cdecdec-C001-20" ], [ "1deb67be8226867fe6b9514cdecdec-C001-38" ] ], "cite_sentences": [ "1deb67be8226867fe6b9514cdecdec-C001-20", "1deb67be8226867fe6b9514cdecdec-C001-38" ] }, "@EXT@": { "gold_contexts": [ [ "1deb67be8226867fe6b9514cdecdec-C001-20" ] ], "cite_sentences": [ "1deb67be8226867fe6b9514cdecdec-C001-20" ] }, "@MOT@": { "gold_contexts": [ [ "1deb67be8226867fe6b9514cdecdec-C001-38" ], [ "1deb67be8226867fe6b9514cdecdec-C001-50", "1deb67be8226867fe6b9514cdecdec-C001-51" ] ], "cite_sentences": [ "1deb67be8226867fe6b9514cdecdec-C001-38", "1deb67be8226867fe6b9514cdecdec-C001-51" ] }, "@SIM@": { "gold_contexts": [ [ "1deb67be8226867fe6b9514cdecdec-C001-123", "1deb67be8226867fe6b9514cdecdec-C001-126" ] ], "cite_sentences": [ "1deb67be8226867fe6b9514cdecdec-C001-123" ] } } }, "ABC_b335178d833e26190b7056469d3fa7_30": { "x": [ { "sent_id": "b335178d833e26190b7056469d3fa7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-2", "text": "Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-3", "text": "In this paper, we push the limits of multilingual NMT in terms of number of languages being used." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-4", "text": "We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English within a single model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-5", "text": "We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-6", "text": "We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-7", "text": "Our experiments on a large-scale dataset with 102 languages to and from English and up to one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-8", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-10", "text": "Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013; Bahdanau et al., 2014; Sutskever et al., 2014) is the current state-ofthe-art approach for machine translation in both academia (Bojar et al., 2016 (Bojar et al., , 2017 (Bojar et al., , 2018 and industry (Wu et al., 2016; ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-11", "text": "Recent works (Dong et al., 2015; Firat et al., 2016a; Ha et al., 2016; Johnson et al., 2017) extended the approach to support multilingual translation, i.e. training a single model that is capable of translating between multiple language pairs." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-12", "text": "Multilingual models are appealing for several reasons." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-13", "text": "First, they are more efficient in terms * Work carried out during an internship at Google AI." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-14", "text": "of the number of required models and model parameters, enabling simpler deployment." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-15", "text": "Another benefit is transfer learning; when low-resource language pairs are trained together with highresource ones, the translation quality may improve (Zoph et al., 2016; Nguyen and Chiang, 2017 )." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-16", "text": "An extreme case of such transfer learning is zero-shot translation (Johnson et al., 2017) , where multilingual models are able to translate between language pairs that were never seen during training." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-17", "text": "While very promising, it is still unclear how far one can scale multilingual NMT in terms of the number of languages involved." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-18", "text": "Previous works on multilingual NMT typically trained models with up to 7 languages (Dong et al., 2015; Firat et al., 2016b; Ha et al., 2016; Johnson et al., 2017; Gu et al., 2018) and up to 20 trained directions (Cettolo et al., 2017) simultaneously." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-19", "text": "One recent exception is Neubig and Hu (2018) which trained many-to-one models from 58 languages into English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-20", "text": "While utilizing significantly more languages than previous works, their experiments were restricted to many-to-one models in a lowresource setting with up to 214k examples per language-pair and were evaluated only on four translation directions." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-21", "text": "In this work, we take a step towards practical \"universal\" NMT -training massively multilingual models which support up to 102 languages and with up to one million examples per languagepair simultaneously." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-22", "text": "Specifically, we focus on training \"English-centric\" many-to-many models, in which the training data is composed of many language pairs that contain English either on the source side or the target side." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-23", "text": "This is a practical setting since English parallel data is widely available for many language pairs." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-45", "text": "These languages present an extreme low-resource case, with as few as 4.5k training examples for Belarusian-English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-24", "text": "We restrict our experiments to Transformer models (Vaswani et al., 2017) as they were shown to be very effective in recent benchmarks (Ott et al., 2018) , also in the context of multilingual models (Lakew et al., 2018; ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-25", "text": "We evaluate the performance of such massively multilingual models while varying factors like model capacity, the number of trained directions (tasks) and low-resource vs. high-resource settings." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-26", "text": "Our experiments on the publicly available TED talks dataset (Qi et al., 2018) show that massively multilingual many-to-many models with up to 58 languages to-and-from English are very effective in low resource settings, allowing to use high-capacity models while avoiding overfitting and achieving superior results to the current stateof-the-art on this dataset (Neubig and Hu, 2018; Wang et al., 2019) when translating into English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-27", "text": "We then turn to experiment with models trained on 103 languages in a high-resource setting." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-28", "text": "For this purpose we compile an English-centric inhouse dataset, including 102 languages aligned to-and-from English with up to one million examples per language pair." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-29", "text": "We then train a single model on the resulting 204 translation directions and find that such models outperform strong bilingual baselines by more than 2 BLEU averaged across 10 diverse language pairs, both toand-from English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-30", "text": "Finally, we analyze the tradeoffs between the number of involved languages and translation accuracy in such settings, showing that massively multilingual models generalize better to zero-shot scenarios." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-31", "text": "We hope these results will encourage future research on massively multilingual NMT." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-32", "text": "2 Low-Resource Setting: 59 Languages" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-33", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-34", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-35", "text": "The main question we wish to answer in this work is how well a single NMT model can scale to support a very large number of language pairs." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-36", "text": "The answer is not trivial: on the one hand, training multiple language pairs together may result in transfer learning (Zoph et al., 2016; Nguyen and Chiang, 2017) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-37", "text": "This may improve performance as we increase the number of language pairs, since more information can be shared between the different translation tasks, allowing the model to learn which information to share." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-38", "text": "On the other hand, adding many language pairs may result in a bottleneck; the model has a limited capacity while it needs to handle this large number of translation tasks, and sharing all parameters between the different languages can be sub-optimal (Wang et al., 2018) especially if they are not from the same typological language family (Sachan and ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-39", "text": "We begin tackling this question by experimenting with the TED Talks parallel corpus compiled by Qi et al. (2018) 1 , which is unique in that it includes parallel data from 59 languages." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-40", "text": "For comparison, this is significantly \"more multilingual\" than the data available from all previous WMT news translation shared task evaluations throughout the years -the latest being Bojar et al. (2016 Bojar et al. ( , 2017 Bojar et al. ( , 2018 , which included 14 languages so far." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-41", "text": "2 We focus on the setting where we train \"English-centric\" models, i.e. training on all language pairs that contain English in either the source or the target, resulting in 116 translation directions." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-42", "text": "This dataset is also highly imbalanced, with language pairs including between 3.3k to 214k sentence pairs for training." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-43", "text": "Since the dataset is already tokenized we did not apply additional preprocessing other than applying joint subword segmentation (Sennrich et al., 2016) with 32k symbols." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-44", "text": "Regarding the languages we evaluate on, we begin with the same four languages as Neubig and Hu (2018) -Azerbeijani (Az), Belarusian (Be), Galician (Gl) and Slovak (Sk)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-46", "text": "In order to better understand the effect of training set size in these settings, we evaluate on four additional languages that have more than 167k training examples each -Arabic (Ar), German (De), Hebrew (He) and Italian (It)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-47", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-48", "text": "**MODEL DETAILS**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-49", "text": "Using the same data, we trained three massively multilingual models: a many-to-many model which we train using all 116 translation directions with 58 languages to-and-from English, a one-tomany model from English into 58 languages, and a many-to-one model from 58 languages into English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-50", "text": "We follow the method of Ha et al. (2016) ; Johnson et al. (2017) and add a target-language prefix token to each source sentence to enable many-to-many translation." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-51", "text": "These different setups enable us to examine the effect of the number of translation tasks on the translation quality as measured in BLEU (Papineni et al., 2002) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-52", "text": "We also compare our massively multilingual models to bilingual baselines and to two recently published results on this dataset (Neubig and Hu (2018) ; Wang et al. (2019) )." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-53", "text": "Regarding the models, we focused on the Transformer in the \"Base\" configuration." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-54", "text": "We refer the reader to Vaswani et al. (2017) for more details on the model architecture." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-55", "text": "Specifically, we use 6 layers in both the encoder and the decoder, with model dimension set at 512, hidden dimension size of 2048 and 8 attention heads." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-56", "text": "We also applied dropout at a rate of 0.2 in the following components: on the sum of the input embeddings and the positional embeddings, on the output of each sub-layer before added to the previous layer input (residual connection), on the inner layer output after the ReLU activation in each feed-forward sublayer, and to the attention weight in each attention sub-layer." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-57", "text": "This results in a model with approximately 93M trainable parameters." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-58", "text": "For all models we used the inverse square root learning rate schedule from Vaswani et al. (2017) with learningrate set at 3 and 40k warmup steps." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-59", "text": "All models are implemented in Tensorflow-Lingvo (Shen et al., 2019) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-60", "text": "In all cases we report test results for the checkpoint that performed best on the development set in terms of BLEU." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-61", "text": "For the multilingual models we create a development set that includes examples we uniformly sample from a concatenation of all the individual language pair development sets, resulting in 13k development examples per model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-62", "text": "Another important detail regarding multilingual training is the batching scheme." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-63", "text": "In all of our multilingual models we use heterogeneous batching, where each batch contains examples which are uniformly sampled from a concatenation of all the language pairs the model is trained on." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-64", "text": "Specifically, we use batches of 64 examples for sequences shorter than 69 tokens and batches of 16 examples for longer sequences." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-65", "text": "We did not use oversampling as the dataset is relatively small." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-66", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-67", "text": "**RESULTS**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-68", "text": "We use tokenized BLEU in order to be comparable with Neubig and Hu (2018) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-69", "text": "Table 1 shows the results of our experiments when evaluating on the same language pairs as they did." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-70", "text": "The re- sults under \"Neubig & Hu 18\" are their bilingual baselines and their best many-to-one models." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-71", "text": "Their many-to-one models use similar-languageregularization, i.e. fine-tuning a pre-trained manyto-one model with data from the language pair of interest together with data from a language pair that has a typologically-similar source language and more data (i.e. Russian and Belarusian, Turkish and Azerbaijani)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-72", "text": "The results under \"Ours\" are our many-to-one and many-to-many models we trained identically in terms of model architecture and hyper-parameters." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-73", "text": "We first note that our many-to-many model outperforms all other models when translating into English, with 1.82 BLEU improvement (when av-eraged across the four language pairs) over the best fine-tuned many-to-one models of Neubig and Hu (2018) and 2.44 BLEU improvement over our many-to-one model when averaged across the four low-resource language pairs (Table 1) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-74", "text": "This is surprising as it uses the same X\u2192En data, model architecture and capacity as our many-toone model, while handling a heavier burden since it also supports 58 additional translation tasks (from English into 58 languages)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-75", "text": "Our models also outperform the more complex models of (Wang et al., 2019) which use a \"soft decoupled encoding\" for the input tokens, while our models use a simple subword segmentation." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-76", "text": "One possible explanation is that the many-toone model overfits the English side of the corpus as it is multi-way-parallel: in such setting the English sentences are overlapping across the different language pairs, making it much easier for the model to memorize the training set instead of generalizing (when enough capacity is available)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-77", "text": "On the other hand, the many-to-many model is trained on additional target languages other than English, which can act as regularizers for the X\u2192En tasks, reducing such overfitting." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-78", "text": "To further illustrate this, Figure 1 tracks the BLEU scores on the individual development sets during training for Italian (It), Romanian (Ro), Dutch (Nl), German (De) and Arabic (Ar) into English (left), together with BLEU scores on a subset of the training set for each model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-79", "text": "We can see that while the many-to-one model degrades in performance on the development set, the manyto-many model still improves." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-80", "text": "Note the large gap in the many-to-one model between the training set BLEU and the development set BLEU, which points on the generalization issue that is not present in the many-to-many setting." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-81", "text": "We also note that our many-to-one model is on average 0.75 BLEU behind the best many-to-one models in Neubig and Hu (2018) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-82", "text": "We attribute this to the fact that their models are fine-tuned using similarlanguage-regularization while our model is not." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-83", "text": "We find an additional difference between the results on the resource-scarce languages (Table 1) and the higher-resource languages (Table 2) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-84", "text": "Specifically, the bilingual baselines outperform the many-to-one models only in the higherresource setting." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-85", "text": "This makes sense as in the lowresource setting the baselines have very few training examples to outperform the many-to-one mod- Table 3 : En\u2192X test BLEU on the TED Talks corpus els, while in the higher resource setting they have access to more training data." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-86", "text": "This corroborates the results of Gu et al. (2018) that showed the sensitivity of such models to similar low resource conditions and the improvements gained from using many-to-one models (however with much fewer language pairs)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-87", "text": "Table 3 shows the results of our massively multilingual models and bilingual baselines when evaluated out-of-English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-88", "text": "In this case we see an opposite trend: the many-to-many model performs worse than the one-to-many model by 2.53 BLEU on average." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-111", "text": "More details about this dataset are available in Similarly to our previous experiments, we compare the massively multilingual models to bilingual baselines trained on the same data." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-89", "text": "While previous works (Wang et al., 2018; discuss the phenomena of quality degradation in English-to-many settings, this shows that increasing the number of source languages also causes additional degradation in a many-to-many model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-90", "text": "This degradation may be due to the English-centric setting: since most of the translation directions the model is trained on are into English, this leaves less capacity for the other target languages (while still performing better than the bilingual baselines on all 8 language pairs)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-91", "text": "We also note that in this case the results are consistent among the higher and lower resource pairs -the one-to-many model is better than the many-to-many model, which outperforms the bilingual baselines in all cases." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-92", "text": "This is unlike the difference we saw in the X\u2192 En experiments since here we do not have the multi-way-parallel overfitting issue." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-93", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-94", "text": "**DISCUSSION**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-95", "text": "From the above experiments we learn that NMT models can scale to 59 languages in a lowresource, imbalanced, English-centric setting, with the following observations: (1) massively multilingual many-to-many models outperform many-to-one and bilingual models with similar capacity and identical training conditions when averaged over 8 language pairs into English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-96", "text": "We attribute this improvement over the many-to-one models to the multiple target language pairs which may act as regularizers, especially in this lowresource multi-way-parallel setting that is prone to memorization." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-97", "text": "(2) many-to-many models are inferior in performance when going out-of-English in comparison to a one-to-many model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-98", "text": "We attribute this to English being over-represented in the English-centric many-to-many setting, where it appears as a target language in 58 out of 116 trained directions, which may harm the performance on the rest of the target languages as the model capacity is limited." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-99", "text": "3 It is important to stress the fact that we compared the different models under identical training conditions and did not perform extensive hyperparameter tuning for each setting separately." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-100", "text": "However, we believe that such tuning may improve performance even further, as the diversity in each training batch is very different between the different settings." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-101", "text": "For example, while the baseline model batches include only one language in the source and one language in the target, the manyto-many model includes 59 languages in each side with a strong bias towards English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-102", "text": "These differences may require tailored hyper-parameter choices for each settings (i.e. different batch sizes, learning rate schedules, dropout rates etc.) which would be interesting to explore in future work." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-103", "text": "In the following experiments we investigate whether these observations hold using (1) an even larger set of languages, and (2) a much larger, balanced training corpus that is not multi-wayparallel." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-104", "text": "3 High-Resource Setting: 103 Languages" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-105", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-106", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-107", "text": "In this setting we scale the number of languages and examples per language pair further when training a single massively multilingual model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-108", "text": "Since we are not aware of a publicly available resource for this purpose, we construct an in-house dataset." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-109", "text": "This dataset includes 102 language pairs which we \"mirror\" to-and-from English, with up to one million examples per language pair." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-110", "text": "This results in 103 languages in total, and 204 translation directions which we train simultaneously." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-112", "text": "We tokenize the data using an in-house tokenizer and then apply joint subword segmentation to achieve an open-vocabulary." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-113", "text": "In this setting we used a vocabulary of 64k subwords rather than 32k." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-114", "text": "Since the dataset contains 24k unique characters, a 32k symbol vocabulary will consist of mostly characters, thereby increase the average sequence length." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-115", "text": "Regarding the model, for these experiments we use a Transformer model with 6 layers in both the encoder and the decoder, model dimension set to 1024, hidden dimension size of 8192, and 16 attention heads." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-116", "text": "This results in a model with approximately 473.7M parameters." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-117", "text": "5 Since the model and data are much larger in this case, we used a dropout rate of 0.1 for our multilingual models and tuned it to 0.3 for our baseline models as it improved the translation quality on the development set." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-118", "text": "We evaluate our models on 10 languages from different typological families: Semitic -Arabic (Ar), Hebrew (He), Romance -Galician (Gl), Italian (It), Romanian (Ro), Germanic -German (De), Dutch (Nl), Slavic -Belarusian (Be), Slovak (Sk) and Turkic -Azerbaijani (Az) and Turkish (Tr)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-119", "text": "We evaluate both to-and-from English, where each language pair is trained on one million examples." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-120", "text": "As in the previous experiment, we report test results from the model that performed best in terms of BLEU on the development set." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-121", "text": "Table 5 describes the results when translating into English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-122", "text": "First, we can see that both multilingual models perform better than the baselines in terms of average BLEU." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-123", "text": "This shows that massively multilingual many-to-many models can work well in realistic settings with millions of training examples, 102 languages and 204 jointly trained directions to-and-from English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-124", "text": "Looking more closely, we note several different behaviors in comparison to the low-resource experiments on the TED Talks corpus." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-125", "text": "First, the many-to-one model here performs better than the many-to-many model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-126", "text": "This shows that the previous result was indeed due to the pathologies of the low-resource dataset; when the training data is large enough and not multiway-parallel there is no overfitting in the many-toone model, and it outperforms the many-to-many model in most cases while they are trained identically." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-127", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-128", "text": "**RESULTS**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-129", "text": "One particular outlier in this case is German-to-English, where the many-to-one model is 2 BLEU points below the many-to-many model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-130", "text": "We examine the BLEU score of this language pair on its dedicated German-English development set during training in the many-to-one model and find that it highly fluctuates." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-131", "text": "We then measure the performance on the test set for this language pair by choosing the best checkpoint on the dedicated German-English development set (instead of on the mixed multilingual development set) and find it to be 38.07, which is actually higher by 1 BLEU than the best result of the many-to-many model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-132", "text": "This shows that while training many languages together, there is no \"silver bullet\": some languages may suffer from severe interference during training (i.e. a reduction 3 BLEU in this case, from 38.07 to 35.05) while other languages continue to improve with more updates." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-133", "text": "Table 6 describes the results when translating out-of-English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-157", "text": "We then study the effect of the number of languages on zero-shot translation accuracy." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-134", "text": "Again, both of the massively multilingual models perform better than the baselines when averaged across the 10 evaluated language pairs, while handling up to 102 languages to-and-from English and 204 translation tasks simultaneously." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-135", "text": "In this case the results are similar to those we observed on the TED talks corpus, where the one-to-many model performs better than the many-to-many model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-136", "text": "Again, this advantage may be due to the one-to-many model handling a smaller number of tasks while not being biased towards English in the target side like the many-to-many model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-137", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-138", "text": "**ANALYSIS**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-139", "text": "The above results show that massively multilingual NMT is indeed possible in large scale settings and can improve performance over strong bilingual baselines." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-140", "text": "However, it was shown in a somewhat extreme case with more than 100 languages trained jointly, where we saw that in some cases the joint training may harm the performance for some language pairs (i.e. German-English above)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-141", "text": "In the following analysis we would like to better understand the trade-off between the number of languages involved and the translation accuracy while keeping the model capacity and training configuration fixed." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-142", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-143", "text": "**MULTILINGUALITY & SUPERVISED PERFORMANCE**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-144", "text": "We first study the effect of varying the number of languages on the translation accuracy in a supervised setting, where we focus on many-to-many Table 8 : Zero-Shot performance while varying the number of languages involved models." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-145", "text": "We create four subsets of the in-house dataset by sub-sampling it to a different number of languages in each subset." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-146", "text": "In this way we create four additional English-centric datasets, containing 5, 25, 50 and 75 languages each to-and-from English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-147", "text": "We make sure that each subset contains all the languages from the next smaller subsetsi.e." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-148", "text": "the 25 language subset contains the 5 language subset, the 50 language subset contains the 25 language subset and so on." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-149", "text": "We train a similarcapacity Transformer model (with 473.7M parameters) on each of these subsets and measure the performance for each model on the 8 supervised language pairs from the smallest subset -{Arabic, French, Russian, Ukrainian}\u2194English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-150", "text": "In this way we can analyze to what extent adding more languages improves or harms translation quality while keeping the model capacity fixed, testing the capacity vs. accuracy saturation point." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-151", "text": "Table 7 shows the results of this experiment, reporting the test results for the models that performed best on the multilingual development set." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-152", "text": "We can see that in most cases the best results are obtained using the 5-to-5 model, showing that there is indeed a trade off between the number of languages and translation accuracy when using a fixed model capacity and the same training setup." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-153", "text": "One may expect that the gaps between the different models should become smaller and even close with more updates, as the models with more languages see less examples per language in each batch, thus requiring more updates to improve in terms of BLEU." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-154", "text": "However, in our setting these gaps did not close even after the models converged, leaving 2.73 average BLEU difference between the 5-to-5 and the 103-to-103 model." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-155", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-156", "text": "**MULTILINGUALITY & ZERO-SHOT PERFORMANCE**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-158", "text": "Since we find zero-shot accuracy as an interesting measure for model generalization, we hypothesize that by adding more languages, the model is forced to create a more generalized representation to better utilize its capacity, which may improve zeroshot performance." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-159", "text": "We choose four language pairs for this purpose: Arabic\u2194French which are distant languages, and Ukrainian\u2194Russian which are similar." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-160", "text": "Table 8 shows the results of our models on these language pairs." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-161", "text": "For Arabic\u2194French the BLEU scores are very low in all cases, with the 50-to-50 and 25-to-25 models being slightly better than rest on Ar-Fr and Fr-Ar respectively." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-162", "text": "On Russian\u2194Ukrainian we see clear improvements when increasing the number of languages to more than five." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-163", "text": "Figure 2 further illustrates this, showing the better generalization performance of the massively multilingual models under this zero-shot setting." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-164", "text": "While the zero-shot performance in this case is low and unstable for the 5-to-5 and 25-to-25 models, it is much better for the 50-to-50, 75-to-75 and 103-to-103 models." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-165", "text": "Given these results we can Figure 2: Zero-shot BLEU during training for Ukranian to Russian say that the balance between capacity and generalization here favors the mid range 50-to-50 model, even when using models with more than 473M trained parameters." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-166", "text": "This may hint at the necessity of even larger models for such settings, which is a challenging avenue for future work." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-167", "text": "We also note that our 103 language corpus includes up to one million examples per language pair -while in real-world MT deployments, systems are trained on much more examples per pair." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-168", "text": "This again emphasizes the need for better techniques for training such massively multilingual models as we may already be hitting the capacity barrier in our setting." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-169", "text": "Dong et al. (2015) extended the NMT model of Bahdanau et al. (2014) to one-to-many translation (from English into 4 languages) by adding a dedicated decoder per target language, showing improvements over strong single-pair baselines." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-170", "text": "Firat et al. (2016a,b) proposed many-to-many models (with up to 6 languages) by using separate encoders and decoders per language while sharing the attention mechanism." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-171", "text": "They also introduced the notion of zero-resource translation, where they use synthetic training data generated through pivoting to train translation directions without available training data." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-172", "text": "Ha et al. (2016) and Johnson et al. (2017) proposed to use a shared encoderdecoder-attention model for many-to-many translation (with up to 7 languages in the latter)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-173", "text": "In order to determine the target language in such scenarios they proposed adding dedicated targetlanguage symbols to the source." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-174", "text": "This method enabled zero-shot translation, showing the ability of the model to generalize to unseen pairs." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-175", "text": "Recent works propose different methods for parameter sharing between language pairs in multilingual NMT." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-176", "text": "Blackwood et al. (2018) propose sharing all parameters but the attention mechanism and show improvements over sharing all parameters." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-177", "text": "Sachan and Neubig (2018) explore sharing various components in self-attentional (Transformer) models." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-178", "text": "Lu et al. (2018) add a shared \"interlingua\" layer while using separate encoders and decoders." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-179", "text": "Zaremoodi et al. (2018) utilize recurrent units with multiple blocks together with a trainable routing network." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-180", "text": "Platanios et al. (2018) propose to share the entire network, while using a contextual parameter generator that learns to generate the parameters of the system given the desired source and target languages." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-181", "text": "Gu et al. (2018) propose a \"Universal Language Representation\" layer together with a Mixture-of-Language-Experts component to improve a many-to-one model from 5 languages into English." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-182", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-183", "text": "**RELATED WORK**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-184", "text": "While the mentioned studies provide valuable contributions to improving multilingual models, they apply their models on only up to 7 languages (Johnson et al., 2017) and 20 trained directions (Cettolo et al., 2017 ) in a single model, whereas we focus on scaling NMT to much larger numbers of languages and trained directions." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-185", "text": "Regarding massively multilingual models, Neubig and Hu (2018) explored methods for rapid adaptation of NMT to new languages by training multilingual models on the 59-language TED Talks corpus and fine-tuning them using data from the new languages." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-186", "text": "While modeling significantly more languages than previous studies, they only train many-to-one models, which we show are inferior in comparison to our proposed massively multilingual many-to-many models when evaluated into English on this dataset." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-187", "text": "Tiedemann (2018) trained an English-centric many-to-many model on translations of the bible including 927 languages." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-188", "text": "While this work pointed to an interesting phenomena in the latent space learned by the model where it clusters representations of typologically-similar languages together, it did not include any evaluation of the produced translations." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-189", "text": "Similarly, Malaviya et al. (2017) trained a many-to-English system including 1017 languages from bible translations, and used it to infer typological features for the different languages (without evaluating the translation quality)." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-190", "text": "In another relevant work, Artetxe and Schwenk (2018) trained an NMT model on 93 languages and used the learned representations to perform cross-lingual transfer learning." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-191", "text": "Again, they did not report the performance of the translation model learned in that massively multilingual setting." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-192", "text": "----------------------------------" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-193", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-194", "text": "We showed that NMT models can successfully scale to 102 languages to-and-from English with 204 trained directions and up to one million examples per direction." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-195", "text": "Such models improve the translation quality over similar single-pair baselines when evaluated to and from English by more than 2 BLEU when averaged over 10 diverse lan-guage pairs in each case." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-196", "text": "We show a similar result on the low-resource TED Talks corpus with 59 languages and 116 trained directions." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-197", "text": "We analyze the trade-offs between translation quality and the number of languages involved, pointing on capacity bottlenecks even with very large models and showing that massively multilingual models can generalize better to zero-shot settings." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-198", "text": "We hope this work will encourage future research on massively multilingual NMT, enabling easier support for systems that can serve more people around the globe." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-199", "text": "There are many possible avenues for future work, including semi-supervised learning in such settings, exploring ways to reduce the performance degradation when increasing the number of languages, or using such models for multilingual transfer learning (McCann et al., 2017; Eriguchi et al., 2018; Artetxe and Schwenk, 2018) ." }, { "sent_id": "b335178d833e26190b7056469d3fa7-C001-200", "text": "Understanding and improving zero-shot performance in such scenarios is also a promising direction for future work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b335178d833e26190b7056469d3fa7-C001-19" ], [ "b335178d833e26190b7056469d3fa7-C001-185" ] ], "cite_sentences": [ "b335178d833e26190b7056469d3fa7-C001-19", "b335178d833e26190b7056469d3fa7-C001-185" ] }, "@USE@": { "gold_contexts": [ [ "b335178d833e26190b7056469d3fa7-C001-26" ], [ "b335178d833e26190b7056469d3fa7-C001-44" ], [ "b335178d833e26190b7056469d3fa7-C001-52" ], [ "b335178d833e26190b7056469d3fa7-C001-68" ] ], "cite_sentences": [ "b335178d833e26190b7056469d3fa7-C001-26", "b335178d833e26190b7056469d3fa7-C001-44", "b335178d833e26190b7056469d3fa7-C001-52", "b335178d833e26190b7056469d3fa7-C001-68" ] }, "@DIF@": { "gold_contexts": [ [ "b335178d833e26190b7056469d3fa7-C001-73" ], [ "b335178d833e26190b7056469d3fa7-C001-81" ] ], "cite_sentences": [ "b335178d833e26190b7056469d3fa7-C001-73", "b335178d833e26190b7056469d3fa7-C001-81" ] } } }, "ABC_35522a080b41f716723d2a619f59c4_30": { "x": [ { "sent_id": "35522a080b41f716723d2a619f59c4-C001-60", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-2", "text": "In this paper, we study the problem of using an annotated corpus in English for the same natural language processing task in another language." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-3", "text": "While various machine translation systems are available, automated translation is still far from perfect." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-4", "text": "To minimize the noise introduced by translations, we propose to use only key 'reliable\" parts from the translations and apply structural correspondence learning (SCL) to find a low dimensional representation shared by the two languages." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-5", "text": "We perform experiments on an EnglishChinese sentiment classification task and compare our results with a previous cotraining approach." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-6", "text": "To alleviate the problem of data sparseness, we create extra pseudo-examples for SCL by making queries to a search engine." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-7", "text": "Experiments on real-world on-line review data demonstrate the two techniques can effectively improve the performance compared to previous work." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-8", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-10", "text": "In this paper we are interested in the problem of transferring knowledge gained from data gathered in one language to another language." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-11", "text": "A simple and straightforward solution for this problem might be to use automatic machine translations." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-12", "text": "However, while machine translation has been the subject of a great deal of development in recent years, many of the recent gains in performance manifest as syntactically as opposed to semantically correct sentences." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-13", "text": "For example, \"PIANYI\" is a word mainly used in positive comments in Chinese but its translation from the online Google translator is always \"cheap\", a word typically used in a negative context in English." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-14", "text": "To reduce this kind of error introduced by the translator, Wan in (Wan, 2009 ) applied a co-training scheme." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-15", "text": "In this setting classifiers are trained in both languages and the two classifiers teach each other for the unlabeled examples." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-16", "text": "The co-training approach manages to boost the performance as it allows the text similarity in the target language to compete with the \"fake\" similarity from the translated texts." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-17", "text": "However, the translated texts are still used as training data and thus can potentially mislead the classifier." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-18", "text": "As we are not really interested in predicting something on the language created by the translator, but rather on the real one, it may be better to further diminish the role of the translated texts in the learning process." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-19", "text": "Motivated by this observation, we suggest here to view this problem as a special case of domain adaptation, in the source domain, we mainly observe English features, while in the other domain mostly features from Chinese." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-20", "text": "The problem we address is how to associate the features under a unified setting." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-21", "text": "There has been a lot of work in domain adaption for NLP (Dai et al., 2007) (Jiang and Zhai, 2007) and one suitable choice for our problem is the approach based on structural correspondence learning (SCL) as in (Blitzer et al., 2006) and (Blitzer et al., 2007b) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-22", "text": "The key idea of SCL is to identify a low-dimensional representations that capture correspondence between features from both domains (x s and x t in our case) by modeling their correlations with some special pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-23", "text": "The SCL approach is a good fit for our problem as it performs knowledge transfer through identifying important features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-24", "text": "In the cross-lingual setting, we can restrict the translated texts by using them only through the pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-25", "text": "We believe this form is more robust to errors in the language produced by the translator." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-26", "text": "Adapting language resources and knowledge to a new language was first studied for general text categorization and information retrieval as in (Bel et al., 2003) , where the authors translate a keyword lexicon to perform cross-lingual text categorization." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-27", "text": "In (Mihalcea et al., 2007) , different shortcomings of lexicon-based translation scheme was discussed for the more semantic-oriented task subjective analysis, instead the authors proposed to use a parallel-corpus, apply the classifier in the source language and use the corresponding sentences in the target language to train a new classifier." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-28", "text": "With the rapid development of automatic machine translations, translating the whole corpus becomes a plausible option." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-29", "text": "One can either choose to translate a corpus in the target language and apply the classifier in the source language to obtain labeled data, or directly translated the existing data set to the new language." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-30", "text": "Various experiments of the first strategy are performed in (Banea et al., 2008) for the subjective analysis task and an average 65 F1 score was reported." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-31", "text": "In (Wan, 2008) , the authors propose to combine both strategies with ensemble learning and train a bi-lingual classifier." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-32", "text": "In this paper, we are also interested in exploring whether a search engine can be used to improve the performance of NLP systems through reducing the effect of data sparseness." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-33", "text": "As the SCL algorithm we use here is based on co-occurrence statistics, we adopt a simple approach of creating pseudo-examples from the query counts returned by Google." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-34", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-35", "text": "**OUR APPROACH**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-36", "text": "To begin, we give a formal definition of the problem we are considering." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-37", "text": "Assume we have two languages l s and l t and denote features in these two languages as x s and x t respectively." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-38", "text": "We also have text-level translations and we use x t for features in the translations from l s to l t and x s for the other direction." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-39", "text": "Let y be the output variable we want to predict, we have labeled examples (y, x s ) and some unlabeled examples (x t )." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-40", "text": "Our task is to train a classifier for (y, x t )." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-41", "text": "In this paper, we consider the binary sentiment classification (positive or negative) problem where l s and l t correspond to English and Chinese (for general sentiment analysis, we refer the readers to the various previous studies as in (Turney, 2002) , (Pang et al., 2002 ),and (McDonald et al., 2007 )." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-42", "text": "With these definitions in place, we now describe our approach in further detail." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-43", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-44", "text": "**STRUCTURAL CORRESPONDENCE LEARNING(SCL)**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-45", "text": "Due to space limitations, we give a very brief overview of the SCL framework here." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-46", "text": "For a detailed illustration, please refer to (Ando and Zhang, 2005) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-47", "text": "When SCL is used in a domain adaptation problem, one first needs to find a set of pivot features x p ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-48", "text": "These pivot features should behave in a similar manner in both domains, and can be used as \"references\" to estimate how much other features may contribute when used in a classifier to predict a target variable." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-49", "text": "These features can either be identified with heuristics (Blitzer et al., 2006) or by automatic selection (Blitzer et al., 2007b) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-50", "text": "Take sentiment classification as an example, \"very good\" and \"awful\" are good pivot features, if a certain feature in the target domain co-occurs often with \"very good\" but infrequently with \"awful\", we could expect this feature will play a similar role as \"very good\" in the final classifier but a different role from \"awful\"." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-51", "text": "We can make this observation purely based on the co-occurrence between these features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-52", "text": "No hand-labeling is required and this specific feature doesn't need to be present in our labeled training data of the source domain." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-53", "text": "The SCL approach of (Ando and Zhang, 2005) formulates the above idea by constructing a set of linear predictors for each of the pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-54", "text": "Each of these linear predictor is binary like whether \"very good\" occurs in the text and we have a set of training instances (1|0, {x i })." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-55", "text": "The weight matrix of these linear predictors will encode the co-occurrence statistics between an ordinary feature and the pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-56", "text": "As the cooccurrence data are generally very sparse for a typical NLP task, we usually compress the weight matrix using the singular vector decomposition and only selects the top k eigenvectors v k ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-57", "text": "This matrix w of the k vectors {v k } gives a mapping from the original feature space to a lower dimensional representation and is shown in (Ando and Zhang, 2005) to be the optimal choice of dimension k under common loss functions." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-58", "text": "In the next step we can then train a classifier on the extended feature (x, w * x) in the source domain." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-59", "text": "As w groups the features from different domains with similar behavior relative to the pivot features together, if such a classifier has good performance on the source domain, it will likely do well on the target domain as well." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-61", "text": "**SCL FOR THE CROSS-LINGUAL ADAPTATION**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-62", "text": "Viewing our task as a domain adaptation problem." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-63", "text": "The source domain correspond to English reviews and the target domain for Chinese ones." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-64", "text": "The full feature vector is (x s , x t )." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-65", "text": "The difficulty we are facing is, due to noise in the translations, the conditional probabilities p(y|x s ) and the one in the translated texts p(y|x s ) may be quite different. Consider the following two straightforward strategies of using automatic machine translations: one can translate the original English labeled data (y, x s ) into (y, x t ) in Chinese and train a classifier, or one can train a classifier on (y, x s ) and translate x t in Chinese into x s in English so as to use the classifier. But as the conditional distribution can be quite different for the original language and the pseudo language produced by the machine translators, these two strategies give poor performance as reported in (Wan, 2009) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-66", "text": "Our solution to this problem is simple: instead of using all the features as (x s , x t ) and (x s , x t ), we only preserves the pivot features in the translated texts x s and x t respectively and discard the other features produced by the translator." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-67", "text": "So, now we will have (x s , x tp ) and (x sp , x t ) where x (s|t)p are pivot features in the source and the target languages." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-68", "text": "In other words, when we use the SCL on our problem, the translations are only used to decide if a certain pivot feature occurs or not in the training of the linear predictors." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-69", "text": "All the other nonpivot features in the translators are blocked to reduce the noise." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-70", "text": "In the original SCL as we mentioned earlier, the final classifier is trained on the extended features (x, w * x)." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-71", "text": "However, as mentioned above we will only use the pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-72", "text": "To represent this constraint, we can modify the vector to be (w p * x, w * x) where w p is a constant matrix that only selects the pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-73", "text": "This modification will not affect the deduction procedure and results in (Ando and Zhang, 2005) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-74", "text": "Experiments show that using only pivot features actually outperforms the full feature setting." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-75", "text": "For the selection of the pivot features, we follow the automatic selection method proposed in (Blitzer et al., 2007a) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-76", "text": "We first select some candidates that occur at least some constant number of times in reviews of the two languages." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-77", "text": "Then, we rank these features according to their conditional entropy to the labels on the training set." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-78", "text": "In table 1, we give some of the pivot features with English English Pivot Features \"poor quality\", \"not buy\", \"easy use\", \"very easy\" \"excellent\", \"perfect\", \"still very\", \"garbage\", \"poor\", \"not work\", \"not to\", \"very comfortable\" Chinese Pivot Features wanmei(perfect), xiaoguo hen(effect is very...) tisheng(improve),feichang hao(very good), cha(poor), shushi(comfortable), chuse(excellent)" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-79", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-80", "text": "**UTILIZING THE SEARCH ENGINE**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-81", "text": "Data sparseness is a common problem in NLP tasks." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-82", "text": "On the other hand, search engines nowadays usually index a huge amount of web pages." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-83", "text": "We now show how they can also be used as a valuable data source in a less obvious way." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-84", "text": "Previous studies like (Bollegala, 2007) have shown that search engine results can be comparable to language statistics from a large scale corpus for some NLP tasks like word sense disambiguation." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-85", "text": "For our problem, we use the query counts returned by a search engine to compute the correlations between a normal feature and the pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-86", "text": "Consider the word \"PIANYI\" which is mostly used in positive comments, the query \"CHAN-PIN(product) PING(comment) CHA(bad) PI-ANYI\" has 2,900,000 results, while \"CHAN-PIN(product) PING(comment) HAO(good) PI-ANYI\" returns 57,400,000 pages." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-87", "text": "The results imply the word \"PIANYI\" is closer to the pivot feature \"good\" and it behaves less similar with the pivot feature \"bad\"." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-88", "text": "To add the query counts into the SCL scheme, we create pseudo examples when training linear predictors for pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-89", "text": "To construct a pseudo-positive example between a certain feature x i and a certain pivot feature x p , we simply query the term x i x p and get a count c 1 ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-90", "text": "We also query x p alone and get another count c 2 .Then we can create an example (1, {0, ..., 0," }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-91", "text": "The pseudo-negative examples are created similarly." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-92", "text": "These pseudo examples are equivalent to texts with a single word and the count is used to approximate the empirical expectation." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-93", "text": "As an initial experiment, we select 10,000 Chinese features that occur more than once in the Chinese unlabeled data set but not frequent enough to be captured by the original SCL. And we also select the top 20 most informative Chinese pivot features to perform the queries." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-94", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-95", "text": "**EXPERIMENT**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-96", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-97", "text": "**DATA SET**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-98", "text": "For comparsion, we use the same data set in (Wan, 2009) :" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-99", "text": "Test Set(Labeled Chinese Reviews): The data set contains a total of 886 labeled product reviews in Chinese (451 positive reviews and 435 negative ones)." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-100", "text": "These reviews are extracted from a popular Chinese IT product website IT168 1 ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-101", "text": "The reviews are mainly about electronic devices like mp3 players, mobile phones, digital cameras and computers." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-102", "text": "Training Set(Labeled English Reviews): This is the data set used in the domain adaption experiment of (Blitzer et al., 2007b )." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-103", "text": "It contains four major categories: books, DVDs, electronics and kitchen appliances." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-104", "text": "The data set consists of 8000 reviews with 4000 positive and 4000 negative, It is a public data set available on the web 2 ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-105", "text": "Unlabeled Set (Unlabeled Chinese Reviews): 1000 Chinese reviews downloaded from the same website as the Chinese training set." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-106", "text": "They are of the same domain as the test set." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-107", "text": "We translate each English review into Chinese and vice versus through the public Google Translation service." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-108", "text": "Also following the setting in (Wan, 2009) , we only use the Chinese unlabeled data and English training sets for our SCL training procedures." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-109", "text": "The test set is blind to the training stage." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-110", "text": "The features we used are bigrams and unigrams in the two languages as in (Wan, 2009) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-111", "text": "In Chinese, we first apply the stanford Chinese word segmenter 3 to segment the reviews." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-112", "text": "Bigrams refers to a single Chinese word and a bigram refers to two adjacent Chinese words." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-113", "text": "The features are also pre-processed and normalized as in (Blitzer et al., 2007b) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-114", "text": "Table 3 : Results on the Negative Reviews" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-115", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-116", "text": "**COMPARISONS**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-117", "text": "We compare our procedure with the co-training scheme reported in (Wan, 2009) :" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-118", "text": "The method with the best performance in (Wan, 2009) ." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-119", "text": "Two standard SVMs are trained using the co-training scheme for the Chinese views and the English views. And the results of the two SVMs are combined to give the final output." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-120", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-121", "text": "**SCL-B:**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-122", "text": "The basic SCL procedure as explained." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-123", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-124", "text": "**SCL-O:**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-125", "text": "The basic SCL except that we use all features from the translated texts instead of only the pivot features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-126", "text": "SCL-C: The training procedure is still the same as SCL-B except in the test time we only use the Chinese pivot features and neglect the English pivot features from translations." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-127", "text": "SCL-E: The same as SCL-B except that in the training of linear pivot predictors, we also use the pseudo examples constructed from queries of the search engine." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-128", "text": "Table 2 and 3 give results measured on the positive labeled reviews and negative reviews separately." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-129", "text": "Table 4 gives the overall accuracy on the whole 886 reviews." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-130", "text": "Our basic SCL approach SCL-B outperforms the original Co-Training approach by 2.2% in the overall accuracy." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-131", "text": "We can Table 4 : Overall Accuracy of Different Methods also notice that using all the features including the ones from translations actually deteriorate the performance from 0.835 to 0.826." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-132", "text": "The model incorporating the co-occurrence count information from the search engine has the best overall performance of 0.857." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-133", "text": "It is interesting to note that the simple scheme we have adopted increased the recall performance on the negative reviews significantly." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-134", "text": "After examining the reviews, we find the negative part contains some idioms and words mainly used on the internet and the query count seems to be able to capture their usage." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-135", "text": "Finally, as our final goal is to train a Chinese sentiment classifier, it will be best if our model can only rely on the Chinese features." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-136", "text": "The SCL-C model improves the performance from the CoTraining method a little but not as much as the SCL \u2212 B and the SCL \u2212 O approaches." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-137", "text": "This observation suggests that the translations are still helpful for the cross-lingual adaptation problem as the translators perform some implicit semantic mapping." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-138", "text": "----------------------------------" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-139", "text": "**CONCLUSION**" }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-140", "text": "In this paper, we are interested in adapting existing knowledge to a new language." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-141", "text": "We show that instead of fully relying on automatic translation, which may be misleading for a highly semantic task like the sentiment analysis, using techniques like SCL to connect the two languages through feature-level mapping seems a more suitable choice." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-142", "text": "We also perform an initial experiment using the co-occurrence statistics from a search engine to handle the data sparseness problem in the adaptation process, and the result is encouraging." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-143", "text": "As future research we believe a promising avenue of exploration is to construct a probabilistic version of the SCL approach which could offer a more explicit model of the relations between the two domains and the relations between the search engine results and the model parameters." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-144", "text": "Also, in the current work, we select the pivot features by simple ranking with mutual information, which only considers the distribution information." }, { "sent_id": "35522a080b41f716723d2a619f59c4-C001-145", "text": "Incorporating the confidence from the translator may further improve the performance." } ], "y": { "@BACK@": { "gold_contexts": [ [ "35522a080b41f716723d2a619f59c4-C001-14" ] ], "cite_sentences": [ "35522a080b41f716723d2a619f59c4-C001-14" ] }, "@SIM@": { "gold_contexts": [ [ "35522a080b41f716723d2a619f59c4-C001-65" ] ], "cite_sentences": [ "35522a080b41f716723d2a619f59c4-C001-65" ] }, "@USE@": { "gold_contexts": [ [ "35522a080b41f716723d2a619f59c4-C001-98" ], [ "35522a080b41f716723d2a619f59c4-C001-108" ], [ "35522a080b41f716723d2a619f59c4-C001-110" ], [ "35522a080b41f716723d2a619f59c4-C001-117" ] ], "cite_sentences": [ "35522a080b41f716723d2a619f59c4-C001-98", "35522a080b41f716723d2a619f59c4-C001-108", "35522a080b41f716723d2a619f59c4-C001-110", "35522a080b41f716723d2a619f59c4-C001-117" ] }, "@UNSURE@": { "gold_contexts": [ [ "35522a080b41f716723d2a619f59c4-C001-118" ] ], "cite_sentences": [ "35522a080b41f716723d2a619f59c4-C001-118" ] } } }, "ABC_fe3e71020dfb32927f5c348a6fdcfc_30": { "x": [ { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-2", "text": "The ultimate goal when building dialogue systems is to satisfy the needs of real users, but quality assurance for dialogue strategies is a non-trivial problem." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-3", "text": "The applied evaluation metrics and resulting design principles are often obscure, emerge by trial-and-error, and are highly context dependent." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-4", "text": "This paper introduces data-driven methods for obtaining reliable objective functions for system design." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-5", "text": "In particular, we test whether an objective function obtained from Wizard-of-Oz (WOZ) data is a valid estimate of real users' preferences." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-6", "text": "We test this in a test-retest comparison between the model obtained from the WOZ study and the models obtained when testing with real users." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-7", "text": "We can show that, despite a low fit to the initial data, the objective function obtained from WOZ data makes accurate predictions for automatic dialogue evaluation, and, when automatically optimising a policy using these predictions, the improvement over a strategy simply mimicking the data becomes clear from an error analysis." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-8", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-10", "text": "The ultimate goal when building dialogue systems is to satisfy the needs of real users, but quality assurance for dialogue strategies is a non-trivial problem." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-11", "text": "In conventional dialogue design the dialogue often is designed following 'best practises' which are often obscure and emerge by trial-anderror (Paek, 2007) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-12", "text": "In addition, user preferences are highly context dependent (Hu et al., 2007) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-13", "text": "This is why dialogue strategy design is often referred to as being more of an art than a science (Jones and Galliers, 1996; Pieraccini, 2002) Over recent years, data-driven statistical optimisation methods (e.g. Reinforcement Learning (RL)) for dialogue strategy design have become more and more popular (Lemon and Pietquin, 2007) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-14", "text": "One major advantage of RL-based dialogue strategy development is that the dialogue strategy can be automatically trained and evaluated using the same objective function (Walker, 2005) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-15", "text": "In the context of RL the objective function is also called the \"reward\" (Sutton and Barto, 1998) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-16", "text": "Despite its central aspect for RL, quality assurance for objective functions has received little attention so far." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-17", "text": "In fact, the reward function is one of the most handcoded aspects in RL (Paek, 2006) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-18", "text": "In this paper we propose a new method for meta-evaluation of the objective function." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-19", "text": "We bring together two strands of research: one strand uses Reinforcement Learning to automatically optimise dialogue strategies, e.g. (Singh et al., 2002) , (Henderson et al., 2008) , (Rieser and Lemon, 2008a; Rieser and Lemon, 2008b) ; the other other focuses on automatic evaluation of dialogue strategies, e.g. the PARADISE framework (Walker et al., 1997) , and meta-evaluation of dialogue metrics, e.g. (Engelbrecht and M\u00f6ller, 2007; Paek, 2007) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-20", "text": "Clearly, automatic optimisation and evaluation of dialogue policies, as well as quality control of the objective function, are closely inter-related problems: how can we make sure that we optimise a system according to real users' preferences?" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-21", "text": "In particular, we construct a data-driven objective function using the PARADISE framework, and use it for automatic dialogue strategy optimisation following pioneering work by (Walker et al., 1998) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-22", "text": "However, it is not clear how reliable such a predictive model is, i.e. if it indeed estimates real user preferences." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-23", "text": "The models obtained with PARADISE usually fit the data poorly (Engelbrecht and M\u00f6ller, 2007) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-24", "text": "It is also not clear how general they are across different systems and user groups (Walker et al., 2000) , (Paek, 2007) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-25", "text": "Furthermore, it is not clear how they perform when being used for automatic strategy optimisation within the RL framework." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-26", "text": "In the following we evaluate different aspects of an objective function obtained from Wizard-of-Oz (WOZ) data (Rieser and Lemon, 2008b) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-27", "text": "We proceed as follows: The next Section shortly summarises the overall dialogue system design." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-28", "text": "In Section 3. we test the model stability in a test-retest comparison across different user populations and data sets." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-29", "text": "In Section 4. we measure prediction accuracy." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-30", "text": "In Section 5. we conduct a detailed error analysis where we test the relationship between improved user ratings and dialogue behaviour, i.e. we investigate which factors lead the users to give higher scores, and whether this was correctly reflected in the original objective function." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-31", "text": "2." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-32", "text": "Overall framework 2.1." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-33", "text": "Dialogue System Design Our application domains are multimodal information seeking dialogue systems as an interface to an in-car MP3 player." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-34", "text": "The structure of information seeking dialogues consists of an information acquisition dialogue and an information presentation sub-dialogue (see Figure 1) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-35", "text": "For information acquisition the task of the dialogue policy is to gather 'enough' search constraints from the user, and then, 'at the right time', to start the information presentation phase where the task is to present 'the right amount' of information -either on the screen or listing the items verbally. What this actually means depends on the dialogue context and the preferences of our users as reflected in the objective function." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-36", "text": "We therefore formulate dialogue learning as a hierarchical optimisation problem (Rieser and Lemon, 2008b) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-37", "text": "The applied objective function follows this structure as well." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-38", "text": "Figure 1 : Hierarchical dialogue structure for information seeking multimodal systems." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-39", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-40", "text": "**METHOD**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-41", "text": "In the following the overall method is shortly summarised. Please see (Rieser and Lemon, 2008b; Rieser, 2008) for details." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-42", "text": "1. We obtain an objective function from the WOZ data of (Rieser et al., 2005) according to the PARADISE framework." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-43", "text": "In PARADISE multivariate linear regression is applied to experimental dialogue data in order to develop predictive models of user preferences (obtained from questionnaires) as a linear weighted function of dialogue performance measures (such as dialogue length)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-44", "text": "This predictive model is used to automatically evaluate dialogues." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-45", "text": "For RL this function is used as the \"reward\" for training." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-46", "text": "2. We train an RL-based dialogue system with the obtained model." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-47", "text": "The hypothesis is that, by using the obtained quality measures as a reward function for RL, we will be able to learn an improved strategy over a policy which simply mimics observed patterns (i.e. the human wizard behaviour) in the data." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-48", "text": "The baseline policy is therefore constructed using Supervised Learning (SL) on the WOZ data." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-49", "text": "We then test both strategies (SL and RL) with real users using the same objective/evaluation function." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-50", "text": "3. Since the objective function plays such a central role in automatic dialogue design, we need to find methods that ensure its quality." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-51", "text": "In this paper, we evaluate the obtained function in a test-retest comparison between the model obtained from the WOZ study and the one obtained when testing the real system as described in the following." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-52", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-53", "text": "**MODEL STABILITY**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-54", "text": "For the information acquisition phase we applied stepwise multivariate linear regression to select the dialogue features which are most predictive for perceived Task Ease." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-55", "text": "Task Ease is a measure from the user questionnaires obtained by taking the average of two user ratings on a 5-point Likert scale." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-56", "text": "1. The task was easy to solve." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-57", "text": "2. I had no problems finding the information I wanted." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-58", "text": "We choose Task Ease as the ultimate measure to be optimised following (Clark, 1996) 's principle of the least effort which says: \"All things being equal, agents try to minimize their effort in doing what they intend to do\"." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-59", "text": "The PARADISE regression model is constructed from 3 different corpora: the SAMMIE WOZ experiment (Rieser et al., 2005) , and the iTalk system used for the user tests (Rieser and Lemon, 2008b) running the supervised baseline policy and the RL-based policy." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-60", "text": "By replicating the regression model on different data sets we test whether the automatic estimate of Task Ease generalises beyond the conditions and assumptions of a particular experimental design." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-61", "text": "The resulting models are shown in Equations 1-3 , where T askEase W OZ is the regression model obtained from the WOZ data, T askEase SL is obtained from the user test data running the supervised policy, and T askEase RL is obtained from the user test data running the RL-based policy." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-62", "text": "They all reflect the same trends: longer dialogues (measured in turns) predict a lower Task Ease, whereas a good performance in the multimodal information presentation phase (multimodal score) will positively influence Task Ease." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-63", "text": "For the iTalk user tests almost all the tasks were completed; therefore task completion was only chosen to be a predictive factor for the WOZ model." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-64", "text": "T askEaseW OZ = 1.58 + .12 * taskCompl +.09 * mmScore \u2212 .20 * dialogueLength" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-65", "text": "T askEaseSL = 3.50 + .54 * mmScore \u2212.34 * dialogueLength;" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-66", "text": "T askEaseRL = 3.80 + .49 * mmScore \u2212.36 * dialogueLength;" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-67", "text": "To evaluate the obtained regression models we use two measures: how well they fit the data (goodness-of-fit) and how close the functions are to each other (model replicability)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-68", "text": "For the WOZ model the data fit was rather low (R 2 W OZ = .03), 1 whereas for the models obtained from the iTalk system the fit has improved (R 2 RL = .48, and R 2 SL = .55)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-69", "text": "To directly compare the functions we plotted them in 3D space (the 4th dimension for T askEase W OZ was omitted), see Figure 2 ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-70", "text": "While the models obtained with the iTalk system show almost perfect overlap (R 2 = .98), the (reduced) WOZ model differs (R 2 = .22) in the sense that it assigns 1 for R 2 we use the adjusted values." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-71", "text": "less weight to dialogue length and the multimodal presentation score." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-72", "text": "Figure 2: 3D Visualisation of the objective functions obtained from WOZ data and real user data using a SL and RL-based strategy." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-73", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-74", "text": "**MODEL PERFORMANCE: PREDICTION ACCURACY**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-75", "text": "We now investigate how well these models generalise by testing their prediction accuracy." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-76", "text": "Previous research evaluated two aspects: how well a given objective function is able to predict unseen events from the original system (Engelbrecht and M\u00f6ller, 2007) , and how well it is able to predict unseen events of a new/different system (Walker et al., 2000) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-77", "text": "We evaluate these two aspects as well, the only difference is that we use the Root Mean Standard Error (RMSE) instead of R 2 for measuring the models prediction accuracy." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-78", "text": "RMSE is (as we argue) more robust for small data sets." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-79", "text": "In particular, we argue that, by correcting for variance, R 2 can lead to artificially good results when using small tests sets (which typically vary more) and is sensitive to outliers (see Equation 4)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-80", "text": "RMSE instead measures the (root) mean difference between actual and predicted values (see Equation 5 )." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-81", "text": "First, we measure the predictive power of our models within the same data set using 10-fold cross validation, and across the different systems by testing models trained on one system to predict perceived Task Ease for another system, following a method introduced by (Walker et al., 2000) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-82", "text": "The results for comparing the RMSE (max.7 for SL/RL, and max.5 for WOZ) for training and testing within data sets (ID 1-3) and across data sets (ID 4,5) are shown in Table 1 ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-83", "text": "In order to present results from different scales we also report the percentage of the RMSE of the maximum error (% error)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-84", "text": "The results show that predictions according to PARADISE can lead to accurate test results despite the low data fit." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-85", "text": "While for the regression model obtained from the WOZ data the fit was 10-times lower than for SL/RL, the prediction performance is comparably good (see Table 1 , ID 1-3)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-86", "text": "The models also generalise well across systems (see Table 1 In addition, we evaluate model accuracy following a method introduced by (Engelbrecht and M\u00f6ller, 2007) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-87", "text": "They suggest to compare model performance by plotting mean values for predicted and true ratings by averaging over conditions." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-88", "text": "We replicate this method, averaging mean ratings for observed and predicted Task Ease over number of turns." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-89", "text": "The resulting graphs in Table 2 show that the predicted mean values per turn are fairly accurate for the SL and RL objective functions (first two graphs from the left)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-90", "text": "For the WOZ data, the predictions are less accurate especially for low numbers of turns (graph on the right)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-91", "text": "This is due to the fact that for low numbers of turns only very few observations are in the training set: 25% of the dialogues are between 5 and 6 turns long (where the predictions are close to the observations) and 42% of dialogue are over 14 turns long (where the curves converge again)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-92", "text": "Only 33% covers the span between 7-13 turns, where the graphical comparison indicates low prediction performance." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-93", "text": "However, these results are misleading for small data sets (as we argue)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-94", "text": "Quite the contrary is the case: the predicted values show that the linear model does well for the majority of the cases and is not sensitive to outliers, i.e. the graph only diverges if there are too little obeservations." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-95", "text": "It therefore generalises well." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-96", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-97", "text": "**ERROR ANALYSIS**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-98", "text": "In previous work we showed that the RL-based policy significantly outperforms the supervised policy in terms of improved user ratings and dialogue performance measures (Rieser and Lemon, 2008b) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-99", "text": "Here, we test the relationship between improved user ratings and dialogue behaviour, i.e. we investigate which factors lead the users to give higher scores, and whether this was correctly reflected in the original reward function." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-100", "text": "We concentrate on the information presentation phase, since there is a simple two-way relationship between user scores and the number of presented items." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-101", "text": "To estimate this relationship we use curve fitting, which is used as an alternative model to linear regression in cases where the relationship between two variables can also be non-linear." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-102", "text": "For each presentation mode (verbal vs. multimodal) we select the (simplest) model with the closest fit to the data (R 2 )." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-103", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-104", "text": "**TRAINING**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-105", "text": "We first use this method to construct the reward function for policy learning from the WOZ data." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-106", "text": "Figure 3 shows the employed reward function for information presentation modelled from the WOZ data." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-107", "text": "The straight line presents the objective function for verbal presentation and the quadratic curve the one for multimodal presentation." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-108", "text": "In the WOZ experiments wizards never presented more than 3 items using speech, resulting in a linearly decreasing line." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-109", "text": "This fact was captured by the learning schemes in different ways." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-110", "text": "SL extracted the rule \"never present more than 3 items using speech\"." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-111", "text": "For RL the extrapolated line assigns negative values to more than 5 verbally presented items and intersects with the multimodal reward at 2.62, i.e. for more than 3 items the returned reward is higher when presenting multimodally." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-112", "text": "Therefore the RL-based strategy learns to present up to 3 items verbally (on average not more than 2.4 items per dialogue)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-113", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-114", "text": "**TESTING**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-115", "text": "We now apply the same curve-fitting method on the iTalk user test data in order to test whether the policy optimisation had been successful." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-116", "text": "We therefore compare the curve fitting model obtained from the system running the RL policy against the model obtained from the SL policy." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-117", "text": "The hypothesis is that if the policy is good (i.e. consistently making the right decisions), this will result in equally high scores for all presented items, represented by a straight line; whereas if the curve is not linear, this indicates that the policy was sometimes making the right decision and sometimes not." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-118", "text": "The estimated relationship between the average number of items presented verbally and the verbal presentation score from the user questionnaire is shown in the left column of Table 3 ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-119", "text": "The straight, slightly declining line indicates that the policies in general make the right decision, although the fewer items they present the better." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-120", "text": "For verbal presentation both learning schemes (RL and SL) were able to learn a policy from the WOZ data which received consistently good ratings from the users (between 6-5 for RL, and 5-4 for SL on a 7-point Likert scale)." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-121", "text": "For multimodal presentation the WOZ objective function has a turning point at 14.8 (see Figure 3) ." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-122", "text": "The RL-based policy learned to maximise the returned reward by displaying no more than 15 items." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-123", "text": "The SL policy, in contrast, did not learn an upper boundary for when to show items on the screen (since the wizards did not follow a specific pattern, (Rieser and Lemon, 2008b) )." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-124", "text": "When relating number of items to user scores, the RL policy produces a linear (slightly declining) line between 7 and 6 (Table 3 , bottom right), indicating that the applied policy reflected the users' preferences." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-125", "text": "Hence, we conclude that the objective function derived from the WOZ data gave the right feedback to the learner." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-126", "text": "For the SL policy the Logarithmic function best describes the data." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-127", "text": "It function indicates that the multimodal presentation strategy received the highest scores if the number of items presented were just under 15 (Table 3 , top right), which is the turning point of the WOZ objective function." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-128", "text": "This again indicates that, for the iTalk users the preferred multimodal policy was indeed the one reflected in the WOZ objective function." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-129", "text": "----------------------------------" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-131", "text": "This paper introduces data-driven methods for obtaining reliable objective functions for dialogue system design, and so steers dialogue design towards science rather than art." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-132", "text": "We applied data-driven methods to build objective func-corpus verbal multimodal SL RL Table 3 : Objective functions for information presentation tions (for both dialogue policy learning and evaluation) reflecting the needs of real users." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-133", "text": "In particular, we derived a non-linear objective function from Wizard-of-Oz data which is used to automatically train a Reinforcement Learning-based dialogue strategy, which was then evaluated with real users." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-134", "text": "To ensure the quality of the applied objective function we evaluated its stability, predictive power, and strategy improvements in a test-retest comparison." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-135", "text": "We also conduct a detailed error analysis." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-136", "text": "In sum, according to our measures, an objective function obtained from WOZ data is a valid first estimate of real users' preferences." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-137", "text": "Despite a low fit to the initial data, the objective function obtained from WOZ data makes accurate predictions for automatic dialogue evaluation, and, when automatically optimising a policy using these predictions, the improvement over a strategy just mimicking the data becomes clear from an error analysis." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-138", "text": "The models obtained from the tests with a real system follow the same trends, but can be seen as more reliable estimates of the objective function in this domain." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-139", "text": "In future work we will explore incrementally training a system according to improved representations of real user preferences, for example gathered online from a deployed spoken dialogue system." }, { "sent_id": "fe3e71020dfb32927f5c348a6fdcfc-C001-140", "text": "This work also introduces non-linear objective functions for dialogue optimization, which merit further exploration in future work." } ], "y": { "@USE@": { "gold_contexts": [ [ "fe3e71020dfb32927f5c348a6fdcfc-C001-19" ], [ "fe3e71020dfb32927f5c348a6fdcfc-C001-26" ], [ "fe3e71020dfb32927f5c348a6fdcfc-C001-36" ], [ "fe3e71020dfb32927f5c348a6fdcfc-C001-59" ] ], "cite_sentences": [ "fe3e71020dfb32927f5c348a6fdcfc-C001-19", "fe3e71020dfb32927f5c348a6fdcfc-C001-26", "fe3e71020dfb32927f5c348a6fdcfc-C001-36", "fe3e71020dfb32927f5c348a6fdcfc-C001-59" ] }, "@BACK@": { "gold_contexts": [ [ "fe3e71020dfb32927f5c348a6fdcfc-C001-41" ], [ "fe3e71020dfb32927f5c348a6fdcfc-C001-98" ] ], "cite_sentences": [ "fe3e71020dfb32927f5c348a6fdcfc-C001-41", "fe3e71020dfb32927f5c348a6fdcfc-C001-98" ] }, "@DIF@": { "gold_contexts": [ [ "fe3e71020dfb32927f5c348a6fdcfc-C001-123" ] ], "cite_sentences": [ "fe3e71020dfb32927f5c348a6fdcfc-C001-123" ] } } }, "ABC_56d1812bec8abbdb31a2346d96e5ca_30": { "x": [ { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-30", "text": "However, in low-resource scenarios it is less realistic to expect that NMT could learn this reordering from scratch on its own." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-2", "text": "Despite impressive empirical successes of neural machine translation (NMT) on standard benchmarks, limited parallel data impedes the application of NMT models to many language pairs." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-3", "text": "Data augmentation methods such as back-translation make it possible to use monolingual data to help alleviate these issues, but back-translation itself fails in extreme low-resource scenarios, especially for syntactically divergent languages." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-4", "text": "In this paper, we propose a simple yet effective solution, whereby target-language sentences are re-ordered to match the order of the source and used as an additional source of trainingtime supervision." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-5", "text": "Experiments with simulated low-resource Japanese-to-English, and real low-resource Uyghur-to-English scenarios find significant improvements over other semi-supervised alternatives 1" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-6", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-50", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-8", "text": "While neural machine translation (NMT; Bahdanau et al. (2015) ; Vaswani et al. (2017) ) now represents the state of the art in the majority of large-scale MT benchmarks (Bojar et al., 2017) , it is highly dependent on the availability of copious parallel resources; NMT under-performs previous phrase-based methods when the training data is small (Koehn and Knowles, 2017) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-9", "text": "Unfortunately, million-sentence parallel corpora are often unavailable for many language pairs." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-10", "text": "Conversely, monolingual sentences, particularly in English, are often much easier to find, making semisupervised approaches that can use monolingual data a desirable solution to this problem." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-11", "text": "2" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-12", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-13", "text": "**\u79c1 \u306f \u65b0\u3057\u3044 \u2f9e\u8eca\uf902 \u3092 \u8cb7\u3063\u305f \u3002**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-14", "text": "I bought a new car ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-15", "text": "I var_1 a new car var_2 bought ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-16", "text": "Reference Japanese:" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-17", "text": "Japanese-ordered English:" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-18", "text": "English:" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-19", "text": "Figure 1: An English sentence re-ordered into Japanese order using the rule-based method of Isozaki et al. (2010b) , and its reference Japanese translation." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-20", "text": "Semi-supervised approaches for NMT are often based on automatically creating pseudoparallel sentences through methods such as backtranslation (Irvine and Callison-Burch, 2013; Sennrich et al., 2016) or adding an auxiliary autoencoding task on monolingual data (Cheng et al., 2016; Currey et al., 2017) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-21", "text": "However, both methods have problems with lowresource and syntactically divergent language pairs." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-22", "text": "Back translation assumes enough data to create a functional NMT system, an unrealistic requirement in low-resource scenarios, while autoencoding target sentences by definition will not be able to learn source-target word reordering." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-23", "text": "This paper proposes a method to create pseudoparallel sentences for NMT for language pairs with divergent syntactic structures." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-24", "text": "Prior to NMT, word reordering was a major challenge for statistical machine translation (SMT), and many techniques have emerged over the years to address this challenge (Xia and McCord, 2004; ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-25", "text": "Importantly, even simple heuristic reordering methods with a few handcreated rules have been shown to be highly effective in closing syntactic gaps (Collins et al. (2005) ; Isozaki et al. (2010b) ; Fig. 1 )." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-26", "text": "Because these rules usually function solely in high-resourced languages such as English with high-quality synguages, but limited success on real low-resource settings and syntactically divergent language pairs (Neubig and Hu, 2018; Guzm\u00e1n et al., 2019) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-27", "text": "Hence we focus on semi-supervised methods in this paper." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-28", "text": "tactic analysis tools, a linguist with rudimentary knowledge of the structure of the target language can create them in short order using these tools." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-29", "text": "However, similar pre-ordering methods have not proven useful in NMT (Du and Way, 2017) , largely because high-resource scenarios NMT is much more effective at learning reordering than previous SMT methods were (Bentivogli et al., 2016) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-31", "text": "Here we ask \"how can we efficiently leverage the monolingual target data to improve the performance of the NMT system in low-resource, syntactically divergent language pairs?\" We tackle this problem via a simple two-step data augmentation method: (1) we first reorder monolingual target sentences to create source-ordered target sentences as shown in Fig. 1 , (2) we then replace the words in the reordered sentences with source words using a bilingual dictionary, and add them as the source side of a pseudo-parallel corpus." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-32", "text": "Experiments demonstrate the effectiveness of our approach on translation from Japanese and Uyghur to English, with a simple, linguistically motivated method of head finalization (HF; Isozaki et al. (2010b) ) as our reordering method." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-33", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-34", "text": "**THE PROPOSED METHOD**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-35", "text": "Training Framework We assume that there are two types of available resources: a small parallel corpus P = {(s, t)} and a large monolingual target corpus Q. The goal of our method is to create a pseudo-parallel corpusQ = {(\u015d, t)}, where\u015d is a pseudo-parallel sentence automatically created in two steps of (1) word reordering, and (2) word-byword translation." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-36", "text": "Word Reordering The first step reorders monolingual target sentences t \u2208 Q into the source order t s ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-37", "text": "Instead of devising an entirely new wordordering method, we can simply rely on methods that have already been widely studied and proven useful in SMT ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-38", "text": "Reordering can be done either using rules based on linguistic knowledge (Isozaki et al., 2010b; Collins et al., 2005) or learning from aligned parallel data (Xia and McCord, 2004; Habash, 2007) , and in principle our pseudo-corpus creation paradigm is compatible with any of these methods." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-39", "text": "Specifically, in this work we utilize rule-based methods, as our goal is to improve translation of low-resource languages, where large quantities of high-quality parallel data do not exist and we posit that current data-driven reordering methods are unlikely to function well." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-40", "text": "Examples of rule-based methods include those to reorder English into German (Navratil et al., 2012) , Arabic (Badr et al., 2009 ), or Japanese (Isozaki et al., 2010b) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-41", "text": "In experiments we use Isozaki et al. (2010b) 's method of reordering SVO languages (e.g. English) into the order of SOV languages (e.g. Japanese) by simply (1) applying a syntactic parser to English (Tsuruoka et al., 2004) , (2) identifying the head constituent of each phrase and moving it to the end of the phrase, and (3) inserting special tokens after subjects and objects of predicates to mimic Japanese case markers." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-42", "text": "Word-by-word Translation To generate data for training MT models, we next perform wordby-word translation of t s into pseudo-source sentence\u015d using a bilingual dictionary (Xie et al., 2018) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-43", "text": "3 There are many ways we can obtain this dictionary: even for many low-resource languages with a paucity of bilingual text, we can obtain manually-curated lexicons with reasonable coverage, or run unsupervised word alignment on whatever parallel data we have available." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-44", "text": "In addition, we can induce word translations for more words in target language using methods for bilingual lexicon induction over pre-trained word embeddings (e.g. Grave et al. (2018) )." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-45", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-46", "text": "**EXPERIMENTS**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-47", "text": "We evaluate our method on two language pairs: Japanese-to-English (ja-en) and Uyghur-toEnglish (ug-en)." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-48", "text": "Japanese and Uyghur are phylogenetically distant languages, but they share similar SOV syntactic structure, which is greatly divergent from English SVO structure." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-49", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-51", "text": "For both language pairs, we use an attentionbased encoder-decoder NMT model with a onelayer bidirectional LSTM as the encoder and onelayer uni-directional LSTM as the decoder." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-52", "text": "4 Embeddings and LSTM states were set to 300 and 256 dimensions respectively." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-53", "text": "Target word embeddings are shared with the softmax weight matrix in the decoder." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-54", "text": "As noted above, we use HF (Isozaki et al., 2010b) as our re-ordering rule." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-55", "text": "HF was designed for transforming English into Japanese order, but we use it as-is for the Uyghur-English pair as well to demonstrate that simple, linguistically motivated rules can generalize across pairs with similar syntax with little or no modification." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-56", "text": "Further details regarding the experimental settings are in the supplementary material." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-57", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-58", "text": "**SIMULATED JAPANESE TO ENGLISH EXPERIMENTS**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-59", "text": "We first evaluate on a simulated low-resource ja-en translation task using the ASPEC dataset (Nakazawa et al., 2016) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-60", "text": "We randomly select 400k ja-en parallel sentence pairs to use as our full training data." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-61", "text": "We then randomly sub-sample low-resource datasets of 3k, 6k, 10k, and 20k parallel sentences, and use the remainder of the 400k English sentences as monolingual data." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-62", "text": "We duplicate the number of parallel sentences by 5 times in the training data augmented with the reordered pairs." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-63", "text": "For settings with supervised parallel sentences of 3k, 6k, 10k and 20k, we set the maximum vocabulary size of both Japanese and English to be 10k, 10k, 15k and 20k respectively." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-64", "text": "To automatically learn a high-precision dictionary on the small amount of parallel data we have available for training, we use GIZA++ (Och and Ney, 2003) to learn alignments in both directions then take the intersection of alignments." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-65", "text": "We then learn the bilingual word embeddings with DeMa-BWE (Zhou et al., 2019) , an unsupervised method that has shown strong results on syntactically divergent language pairs." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-66", "text": "We give the more reliable alignments extracted from GIZA++ high priority by querying the alignment dictionary first, then follow by querying the embedding-induced dictionary." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-67", "text": "When an English word is not within any vocabulary, we output the English word as-is into the pseudo-source sentence." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-68", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-69", "text": "**REAL UYGHUR TO ENGLISH EXPERIMENTS**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-70", "text": "We also consider the harder case of Uyghur, a truly lowresource language." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-71", "text": "We create test and validation sets using the test data from the DARPA LORELEI corpus (Christianson et al., 2018) which contains 2,275 sentence pairs (after filtering out noisy ones) related to incidents that happened in the Uyghur area." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-72", "text": "We hold out 300 pairs as the validation data and use the rest as the test set." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-73", "text": "The LORELEI language pack also contains the bilingual lexicons between Uyghur and English, and thousands of in-domain English sentences." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-74", "text": "We also use a large monolingual English corpus containing sentences related to various incidents occurring all over the world collected from ReliefWeb." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-75", "text": "5 To sub-select a relevant subset of this corpus, we use the cross-entropy filtering (Moore and Lewis, 2010) to select 400k that are most like the in-domain English data." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-76", "text": "For parallel data, like many low-resource languages, we only have access to data from the Bible 6 and Wikipedia language links (the total number of parallel Uyghur-English Wikipedia titles is 3,088), but no other in-domain parallel data." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-77", "text": "We run GIZA++ on this parallel data to obtain an alignment dictionary." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-78", "text": "We learn the bilingual word embeddings via the supervised Geometric approach (Jawanpuria et al., 2019) on FastText (Grave et al., 2018) pre-trained Uyghur and English monolingual embeddings." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-79", "text": "the too high rotation speed produces the reverse deformation supervised however , the deformation of and the deformation of is caused by the dc rate ours however , the deformation of is generated when the rotation rate is large source 8000 3 3 29 3 12 2 reference a 3.3 magnitude earthquake with the depth of 8000 meters hit feb 12 at 3:29 urumqi time supervised 2 , on february 12 -12 of darkness , urumqi time hit a 3.3 earthquake , the earthquake hit ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-80", "text": "ours 2 -on february 12 -on , urumqi time 3:29minus 3.3 magnitude earthquake hit , the earthquake under depth of 8000 meters ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-81", "text": "Table 2 : Translation examples on ja-en (reorder with 6000 supervised pairs) and ug-en (reorder) from our model and the supervised counterpart." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-82", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-83", "text": "**RESULTS AND COMPARISON**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-84", "text": "Baselines In Tab." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-85", "text": "1, we compare our models with baselines including regular supervised training (sup) and back-translation (Sennrich et al., 2016) (back) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-86", "text": "7 To demonstrate the effectiveness of the reordering, we also compare our method against a copy-based data-augmentation method (No-reorder) where the original English sentences t \u2208 Q rather than the reordered ones t s are translated via the bilingual lexicon." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-87", "text": "8 For each of the above settings, we also experimented with a phrased-based statistical machine translation (SMT) system (Dyer et al., 2010) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-88", "text": "In Tab." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-89", "text": "1, we only show the results with supervised data and back-translation for SMT, since we observed that the data augmentation method performs poorly with SMT (complete results are presented in the Appendix)." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-90", "text": "Main Results In Tab." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-91", "text": "1, we observe consistent improvements on both ja-en and ug-en translation tasks against other baseline methods." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-92", "text": "First, comparing our results with the NMT models trained using the same amount of parallel data, our word reordering-based semi-supervised models consistently outperform standard NMT models by a large margin." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-93", "text": "In the case that we have no access to in-domain parallel data at all, our method can still achieve some success in ug-en translation." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-94", "text": "Second, comparing our Reorder method with the No-Reorder one, reordering English sentences into the source language order consistently brings large performance gains, which demonstrates the importance of reordering." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-95", "text": "These results are notable given previous reports that explicit reordering is not beneficial for NMT (Du and Way, 2017) ." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-96", "text": "Third, for ja-en translation, when gradually decreasing the amount of parallel data, the improvements of our model over the supervised NMT models become more significant, demonstrating the effectiveness of our approach in low-resource settings." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-97", "text": "Fourth, back-translation is not very beneficial or even harmful, likely because the backtranslation system trained on limited supervised data can not provide high-quality translations to train the model." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-98", "text": "Finally, we also notice that although SMT performs better than NMT with less supervised training data (3k, 6k supervised data and Uyghur), the performance gain is not as remarkable as NMT when the amount of supervised data increases." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-99", "text": "Moreover, in the case of less supervised data, our data augmentation method with reordering still outperforms SMT." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-100", "text": "We give two examples of the translation outputs of our model and a supervised NMT model for ug-en and ja-en (trained with 6k supervised pairs) in Tab. 2." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-101", "text": "In the first example from ja-en, our model is able to output terminology such as \"rotation rate\" thanks to the enlarged vocabulary while the supervised model can not." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-102", "text": "In the example from ug-en, our model can produce a more fluent sentence with better information coverage." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-103", "text": "Analysis To investigate the effects of reordering, we compare our method with \"No-Reorder\" described in 3.2." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-104", "text": "First, we bucket the test data by sentence length and compute the BLEU score accordingly." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-105", "text": "We present the comparison results in Fig. 2 , from which we observe that \"Reorder\" outperforms \"No-Reorder\" consistently under different sentence length buckets, and the improvement is larger when the sentence length is longer." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-106", "text": "Second, we also evaluate the model outputs on the test data with RIBES (Isozaki et al., 2010a) which is an automatic evaluation metric of translation quality designed for distant languages that is especially sensitive to word order." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-107", "text": "From Fig. 3 , we can see that \"Reorder\" consistently outperforms \"No-Reorder\" on ja-en translation especially when the amount of supervised data decreases." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-108", "text": "This suggests that with reordered pairs as the augmented training data, the model is able to output more syntactically correct sentences." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-109", "text": "----------------------------------" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-111", "text": "This paper proposed a simple yet effective semisupervised learning framework for low-resource machine translation that artificially creates sourceordered target sentences for data-augmentation." }, { "sent_id": "56d1812bec8abbdb31a2346d96e5ca-C001-112", "text": "Experimental results on ja-en and ug-en translations show that our approach achieves significant improvements over baseline systems, demonstrating the effectiveness of the proposed approach on divergent language pairs." } ], "y": { "@USE@": { "gold_contexts": [ [ "56d1812bec8abbdb31a2346d96e5ca-C001-19" ], [ "56d1812bec8abbdb31a2346d96e5ca-C001-32" ], [ "56d1812bec8abbdb31a2346d96e5ca-C001-38" ], [ "56d1812bec8abbdb31a2346d96e5ca-C001-41" ], [ "56d1812bec8abbdb31a2346d96e5ca-C001-54" ] ], "cite_sentences": [ "56d1812bec8abbdb31a2346d96e5ca-C001-19", "56d1812bec8abbdb31a2346d96e5ca-C001-32", "56d1812bec8abbdb31a2346d96e5ca-C001-38", "56d1812bec8abbdb31a2346d96e5ca-C001-41", "56d1812bec8abbdb31a2346d96e5ca-C001-54" ] }, "@BACK@": { "gold_contexts": [ [ "56d1812bec8abbdb31a2346d96e5ca-C001-25" ], [ "56d1812bec8abbdb31a2346d96e5ca-C001-40" ] ], "cite_sentences": [ "56d1812bec8abbdb31a2346d96e5ca-C001-25", "56d1812bec8abbdb31a2346d96e5ca-C001-40" ] } } }, "ABC_bd3663405d2d68f943acc73720b42d_30": { "x": [ { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-2", "text": "We explore the application of a Deep Structured Similarity Model (DSSM) to ranking in lexical simplification." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-3", "text": "Our results show that the DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming the stateof-the-art on two standard datasets used for the task." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-4", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-6", "text": "Lexical simplification is the task of automatically rewriting a text by substituting words or phrases with simpler variants, while retaining its meaning and grammaticality." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-7", "text": "The goal is to make the text easier to understand for children, language learners, people with cognitive disabilities and even machines." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-8", "text": "Approaches to lexical simplification generally follow a standard pipeline consisting of two main steps: generation and ranking." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-9", "text": "In the generation step, a set of possible substitutions for the target word is commonly created by querying semantic databases such as Wordnet (Devlin and Tait, 1998) , learning substitution rules from sentence-aligned parallel corpora of complex-simple texts (Horn et al., 2014; Paetzold and Specia, 2017) , and learning word embeddings from a large corpora to obtain similar words of the complex word (Glava\u0161 and\u0160tajner, 2015; Kim et al., 2016; Specia, 2016a, 2017) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-10", "text": "In the ranking step, features that discriminate a substitution candidate from other substitution candidates are leveraged and the candidates are ranked with respect to their simplicity and contextual fitness." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-11", "text": "* This research was conducted while the first author was a Post Doctoral Fellow at the City University of Hong Kong." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-12", "text": "The ranking step is challenging because the substitution candidates usually have similar meaning to the target word, and thus share similar context features." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-13", "text": "State-of-the-art approaches to ranking in lexical simplification exploit supervised machine learning-based methods that rely mostly on surface features, such as word frequency, word length and n-gram probability, for training the model (Horn et al., 2014; Bingel and S\u00f8gaard, 2016; Specia, 2016a, 2017) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-14", "text": "Moreover, deep architectures are not explored in these models." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-15", "text": "Surface features and shallow architectures might not be able to explore the features at different levels of abstractions that capture nuances that discriminate the candidates." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-16", "text": "In this paper, we propose to use a Deep Structured Similarity Model (DSSM) (Huang et al., 2013) to rank substitution candidates." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-17", "text": "The DSSM exploits a deep architecture by using a deep neural network (DNN), that can effectively capture contextual features to perform semantic matching between two sentences." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-18", "text": "It has been successfully applied to several natural language processing (NLP) tasks, such as machine translation , web search ranking (Huang et al., 2013; Shen et al., 2014; Liu et al., 2015) , question answering , and image captioning (Fang et al., 2015) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-19", "text": "To the best of our knowledge, this is the first time this model is applied to lexical simplification." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-20", "text": "We adapt the original DSSM architecture and objective function to our specific task." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-21", "text": "Our evaluation on two standard datasets for lexical simplification shows that this method outperforms state-of-the-art approaches that use supervised machine learning-based methods." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-22", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-23", "text": "**METHOD**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-24", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-25", "text": "**TASK DEFINITION**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-26", "text": "We focus on the ranking step of the standard lexical simplification pipeline." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-27", "text": "Given a dataset of tar-get words, their sentential contexts and substitution candidates for the target words, the goal is to train a model that accurately ranks the candidates based on their simplicity and semantic matching." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-28", "text": "For generating substitution candidates, we utilize the method proposed by Paetzold and Specia (2017) , which was recently shown to be the state-of-art method for generating substitution candidates." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-29", "text": "They exploit a hybrid substitution generation approach where candidates are first extracted from 550,644 simple-complex aligned sentences from the Newsela corpus (Xu et al., 2015) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-30", "text": "Then, these candidates are complemented with candidates generated with a retrofitted word embedding model." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-31", "text": "The word embedding model is retrofitted over WordNet's synonym pairs (for details, please refer to Paetzold and Specia (2017) )." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-32", "text": "For ranking substitution candidates, we use a DSSM, which we elaborate in the next section." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-33", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-34", "text": "**DSSM FOR RANKING SUBSTITUTION**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-35", "text": "Candidates" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-36", "text": "Word Hashing" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-37", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-38", "text": "**NONLINEAR PROJECTION**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-39", "text": "Relevance measured by cosine similarity Posterior probability computed by softmax" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-40", "text": "Figure 1: Architecture of the Deep Structured Similarity Model (DSSM): The input X (either a target word or a substitute candidate and their sentential contexts, T and S, respectively) is first represented as a bag of words, then hashed into letter 3-grams H. Non-linear projection W t generates the semantic representation of T and nonlinear projection W s constructs the semantic representation of S. Finally, the cosine similarity is adopted to measure the relevance between the T and S. At last, the posterior probabilities over all candidates are computed." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-41", "text": "Compared to other latent semantic models, such as Latent Semantic Analysis (Deerwester et al., 1990) and its extensions, Deep Structured Similarity Model (also called Deep Semantic Similarity Model) or DSSM (Huang et al., 2013) can capture fine-grained local and global contextual features more effectively." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-42", "text": "The DSSM is trained by optimizing a similarity-driven objective, by projecting the whole sentence to a continuous semantic space." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-43", "text": "In addition, it is is built upon characters (rather than words) for scalability and generalizability (He, 2016) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-44", "text": "Figure 1 shows the architecture of the model used in this work." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-45", "text": "It consists of a typical DNN with a word hashing layer, a nonlinear projection layer, and an output layer." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-46", "text": "Each component is described in the following: Word Hashing Layer: the input is first mapped from a high-dimentional one-hot word vector into a low-dimentional letter-trigram space (with the dimentionality as low as 5k), a method called word hashing (Huang et al., 2013) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-47", "text": "For example, the word cat is hashed as the bag of letter trigram #-ca, c-a-t, a-t-#, where # is a boundary symbol (Liu et al., 2015) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-48", "text": "The word hashing helps the model generalize better for out-of-vocabulary words and for spelling variants of the same word (Liu et al., 2015) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-49", "text": "Nonlinear Projection Layer: This layers maps the substitution candidate and the target word in their sentential contexts, S and T respectively, which are represented as letter tri-grams, into ddimension semantic representations, S Sq and T Sq respectively:" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-50", "text": "where the nonlinear activation tanh is defined as:" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-51", "text": "1+e \u22122z ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-52", "text": "Output Layer: This layer computes the semantic relevance score between S and T as:" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-53", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-54", "text": "**FEATURES FOR DSSM**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-55", "text": "As baseline features, we use the same n-gram probability features as in Paetzold and Specia (2017) , who also employ a neural network to rank substitution candidates." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-56", "text": "As in Paetzold and Specia (2017) , the features were extracted using the SubIMDB corpus (Paetzold and Specia, 2015) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-57", "text": "We also experiment with additional features that have been reported as useful in this task." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-58", "text": "For each target word and a substitution candidate word we also compute: cosine similarity, word length, and alignment probability in the sentence-aligned Normal-Simple Wikipedia corpus (Kauchak, 2013) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-59", "text": "The cosine similarity feature is computed using the SubIMDB corpus." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-60", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-61", "text": "**IMPLEMENTATION DETAILS AND TRAINING PROCEDURE OF THE DSSM**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-62", "text": "Following previous works that used supervised machine learning for ranking in lexical simplification (Horn et al., 2014; Paetzold and Specia, 2017) , we train the DSSM using the LexMTurk dataset (Horn et al., 2014) , which contains 500 instances composed of a sentence, a target word and substitution candidates ranked by simplicity (Paetzold and Specia, 2017) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-63", "text": "In order to learn the parameters W t and W s (Figure 1 ) of the DSSM, we use the standard backpropagation algorithm (Rumelhart et al., 1988) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-64", "text": "The objective used in this paper follows the pair-wise learning-to-rank paradigm outlined in (Burges et al., 2005) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-65", "text": "Given a target word and its sentential context T , we obtain a list of candidates L. We set different positive values to the candidates based on their simplicity rankings." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-66", "text": "E.g., if the list of the candidates is ordered by simplificity as, L = {A + > B + > C + }, the labels are first constructed as L = {y A + = 3, y B + = 2, y C + = 1}. The values are then normalized by dividing by the maximum value in the list: L = {y A + = 1, y B + = 0.667, y C + = 0.333}. If the target word was not originally in L, we add it with label 0." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-67", "text": "This enables the model to reflect the label information in the similarity scores." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-68", "text": "We minimize the Bayesian expected loss as:" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-69", "text": "Note that P (S l |T ) is computed as:" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-70", "text": "here, \u03b3 is a tuning factor." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-71", "text": "We used 5-cross validation approach to select hyper-parameters, such as number of hidden nodes." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-72", "text": "We set the gamma factor as 10 as per Huang et al. (2013) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-73", "text": "The selected hyperparameters were used to train the model in the whole LexMTurk dataset." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-74", "text": "We employ earlystopping and select the model whose change of the average loss in each epoch was smaller than 1.0e-3." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-75", "text": "Since the training data is small (only 500 samples) we use a smaller number of hidden nodes, d = 32, in the nonlinear projection layer and adopt a higher dropout rate (0.4)." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-76", "text": "The model is optimized using Adam (Kingma and Ba, 2014) with the learning rate fixed at 0.001, and is trained for 30 epochs." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-77", "text": "The mini-batch is set to 16 during training." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-78", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-79", "text": "**EXPERIMENTS**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-80", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-81", "text": "**DATASETS AND EVALUATION METRICS**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-82", "text": "To evaluate the proposed model, we conduct experiments on two common datasets for lexical simplification: BenchLS (Paetzold and Specia, 2016b) , which contains 929 instances, and NNSEval (Paetzold and Specia, 2016a) , which contains 239 instances." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-83", "text": "Each instance is composed of a sentence, a target word, and a set of gold candidates ranked by simplicity (Paetzold and Specia, 2017) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-84", "text": "Since both datasets contain instances from the LexMturk dataset (Horn et al., 2014) , which we use for training the DNN, we remove the overlap instances between training and test datasets 1 ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-85", "text": "We finally obtain 429 remaining instances in the BenchLS dataset, and 78 instances in the NNEval dataset, which are used in our evaluation." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-86", "text": "We adopt the same evaluation metrics featured in Glava\u0161 and\u0160tajner (2015) and Horn et al. (2014) : 1) precision: ratio of correct simplifications out of all the simplifications made by the system; 2) accuracy: ratio of correct simplifications out of all words that should have been simplified; and 3) changed: ratio of target words changed by the system." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-87", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-88", "text": "**BASELINES**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-89", "text": "We compared the proposed model (DSSM Ranking) to two state-of-the-art approaches to ranking in lexical simplification that exploit supervised machine learning-based methods." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-90", "text": "The first baseline is the Neural Substitution Ranking (NSR) approach described in (Paetzold and Specia, 2017) , which employs a multi-layer perceptron neural network." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-91", "text": "We reimplement their model as part of the LEXenstein toolkit (Paetzold and Specia, 2015) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-92", "text": "The network has 3 hidden layers with 8 nodes each." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-93", "text": "Unlike the proposed model, they treat ranking in lexical simplification as a standard classification problem." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-94", "text": "The second baseline is SVM rank (Joachims, 2006) Table 1 : Substitution candidates ranking results." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-95", "text": "n-gram probs." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-96", "text": "denotes the n-gram probability features described in Paetzold and Specia (2017) , and all denotes all features described in Section 2.3." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-97", "text": "All values marked in bold are significantly higher compared to the best baseline, SVM rank , measured by t-test at p-value of 0.05." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-98", "text": "with default parameters) for ranking substitution candidates, similar to the method described in (Horn et al., 2014) ." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-99", "text": "All the three models employ the n-gram probability features extracted from the SubIMDB corpus (Paetzold and Specia, 2015) , as described in (Paetzold and Specia, 2017) , and are trained using the LexMTurk dataset." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-100", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-101", "text": "**RESULTS**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-102", "text": "The top part of table 1 (Substitution Candidates Ranking) summarizes the results of all three systems." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-103", "text": "Overall, both SVM rank and DSSM Ranking outperform the NSR Baseline." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-104", "text": "The DSSM Ranking performs comparably to SVM rank when using only n-gram probabilities as features, and consistently leverages all features described in Section 2.3, outperforming all systems in accuracy, precision and changed ratio." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-105", "text": "We experimented with adding all features described in Section 2.3 to the baselines as well, however, we obtained no improvements compared to using only n-gram probability features." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-106", "text": "We also tried running all ranking systems on selected candidates that best replace the target word in the input sentence." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-107", "text": "We follow the Unsupervised Boundary Ranking Substitution Selection method described in Paetzold and Specia (2017) , which ranks candidates according to how well they fit the context of the target word, and discards 50% of the worst ranking candidates." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-108", "text": "The bottom part of the table 1 (Selection Step + Substitution Candidates Ranking) summarizes the results of all ranking systems after performing the selection step on generated substitution candidates." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-109", "text": "We obtain similar tendency in the results, with the DSSM Ranking outperforming both baselines." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-110", "text": "The results indicate the advantage of using a deep architecture, and of building a semantic representation of the whole sentence on top of the characters." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-111", "text": "To illustrate by examples, Table 2 lists the top candidate ranked by the systems for different input sentences." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-112", "text": "In the examples, the DSSM Ranking correctly ranked a substitute for the target word, while the two baselines either left the target word unchanged, or ranked an incorrect substitute." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-113", "text": "----------------------------------" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-114", "text": "**CONCLUSIONS**" }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-115", "text": "We presented an effective method for ranking in lexical simplification." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-116", "text": "We explored the application of a DSSM that builds a semantic representation of the whole sentence on top of characters." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-117", "text": "The DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming state-of-art approaches that use supervised machine learning to ranking in lexical simplification." }, { "sent_id": "bd3663405d2d68f943acc73720b42d-C001-118", "text": "For future work, we plan to examine and incorporate a larger feature set to the DSSM, as well as try other DSSM architectures, such as the Convolutional DSSM (C-DSSM) (Shen et al., 2014) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "bd3663405d2d68f943acc73720b42d-C001-9" ], [ "bd3663405d2d68f943acc73720b42d-C001-13" ] ], "cite_sentences": [ "bd3663405d2d68f943acc73720b42d-C001-9", "bd3663405d2d68f943acc73720b42d-C001-13" ] }, "@MOT@": { "gold_contexts": [ [ "bd3663405d2d68f943acc73720b42d-C001-13", "bd3663405d2d68f943acc73720b42d-C001-14", "bd3663405d2d68f943acc73720b42d-C001-16" ] ], "cite_sentences": [ "bd3663405d2d68f943acc73720b42d-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "bd3663405d2d68f943acc73720b42d-C001-62" ], [ "bd3663405d2d68f943acc73720b42d-C001-84" ], [ "bd3663405d2d68f943acc73720b42d-C001-86" ] ], "cite_sentences": [ "bd3663405d2d68f943acc73720b42d-C001-62", "bd3663405d2d68f943acc73720b42d-C001-84", "bd3663405d2d68f943acc73720b42d-C001-86" ] }, "@SIM@": { "gold_contexts": [ [ "bd3663405d2d68f943acc73720b42d-C001-98" ] ], "cite_sentences": [ "bd3663405d2d68f943acc73720b42d-C001-98" ] } } }, "ABC_80e3aec943c37927050f97459360b4_30": { "x": [ { "sent_id": "80e3aec943c37927050f97459360b4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-2", "text": "Recently, the acoustic-to-word model based on the Connectionist Temporal Classification (CTC) criterion was shown as a natural end-to-end model directly targeting words as output units." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-3", "text": "However, this type of word-based CTC model suffers from the out-of-vocabulary (OOV) issue as it can only model limited number of words in the output layer and maps all the remaining words into an OOV output node." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-4", "text": "Therefore, such word-based CTC model can only recognize the frequent words modeled by the network output nodes." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-5", "text": "It also cannot easily handle the hot-words which emerge after the model is trained." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-6", "text": "In this study, we improve the acoustic-to-word model with a hybrid CTC model which can predict both words and characters at the same time." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-7", "text": "With a shared-hidden-layer structure and modular design, the alignments of words generated from the word-based CTC and the character-based CTC are synchronized." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-8", "text": "Whenever the acoustic-to-word model emits an OOV token, we back off that OOV segment to the word output generated from the character-based CTC, hence solving the OOV or hot-words issue." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-9", "text": "Evaluated on a Microsoft Cortana voice assistant task, the proposed model can reduce the errors introduced by the OOV output token in the acoustic-to-word model by 30%." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-10", "text": "CTC, OOV, acoustic-to-word, hybrid, LSTM 111 978-1-5090-4788-8/17/$31.00" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-11", "text": "Index Terms-" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-12", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-14", "text": "Recently, significant progress has been made in automatic speech recognition (ASR) when switching the acoustic model from deep neural networks (DNNs) to long short-term memory (LSTM) [1] [2] recurrent neural networks (RNNs) which can better model the speech sequence [3] [4] [5] [6] [7] [8] [9] [10] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-15", "text": "Like DNNs, LSTM-RNNs are usually trained with the cross entropy (CE) criterion, and then may be further optimized with the sequence discriminative training criterion [11] [12] [13] [14] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-16", "text": "Note that ASR is a sequence-to-sequence task, which maps the input speech waveform to a final word sequence or an intermediate phoneme sequence." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-17", "text": "What the acoustic modeling cares is the output of word or phoneme sequence, instead of the frame-by-frame labeling which the traditional CE training criterion optimizes." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-18", "text": "Hence, the Connectionist Temporal Classification (CTC) approach [15] [16] [17] [18] was introduced to map the speech input frames into an output label sequence." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-19", "text": "The building network is still a LSTM-RNN, but the training objective function is changed from CE to CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-20", "text": "The most attractive characteristics of CTC is that it provides a path to end-to-end optimization of acoustic models." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-21", "text": "In the deep speech [19] [20] and EESEN [21] [22] work, the end-to-end speech recognition system was explored to directly predict characters instead of phonemes, hence removing the need of using lexicons and decision trees which are the building blocks in [17] [18] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-22", "text": "This is one step toward removing expert knowledge when building an ASR system." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-23", "text": "As the goal of ASR is to generate a word sequence from the speech acoustic, word unit is the most natural output unit for network modeling." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-24", "text": "In [17] , the CTC with up to 27 thousand (k) word output targets was explored but the ASR accuracy is not very good, partially due to the high out-of-vocabulary (OOV) rate when using only around 3k hours training data." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-25", "text": "In [23] , it was shown that by using 100k words as the output targets and by training the model with 125k hours of data, the word-based CTC, a.k.a." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-26", "text": "acoustic-to-word CTC, can beat the CTC system with phoneme unit." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-27", "text": "In [24] , the training strategy of word-based CTC was explored with better initialization." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-28", "text": "The ASR task of CTC with word-based output is very simple: the output word sequence is constructed by taking the words with the maximum posterior spikes in the sequence and reducing repeated words into one if there is no blank between them." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-29", "text": "No language model or complex decoding process is involved." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-30", "text": "Therefore, the word-based CTC is a very good endto-end ASR model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-31", "text": "In addition to CTC, attention-based models [25] [26] [27] and RNN-transducers [28] [29] [30] are also end-to-end ASR models." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-32", "text": "Their effectiveness has been demonstrated when working with character output units." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-33", "text": "To our best knowledge, there is no report of using word output units in attention-based models and RNN-transducers." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-34", "text": "In this study, we focus on how to improve the CTC with word output units." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-35", "text": "There are two challenges to the word-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-36", "text": "The first one is the OOV issue." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-37", "text": "In [17] [23] [24] , only frequent words in the training set are used as the targets and the remaining words are just tagged as the OOV." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-38", "text": "All these OOV words cannot be modeled by the LSTM-RNN and cannot be recognized during evaluation." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-39", "text": "The second issue of the wordbased CTC is that it cannot handle hot-words which emerge and become popular after the network has been built." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-40", "text": "It is impossible to get satisfactory performance by directly adding output nodes in the network with the specified hot-words without retraining the network." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-41", "text": "Inspired by the open vocabulary neural machine translation work [31] , we propose an acoustic-to-word model without OOV by first building a word-based CTC in which the output vocabulary contains the frequent words in the training set together with an OOV token which all the infrequent words are mapped to." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-42", "text": "Then we train a characterbased CTC by sharing most hidden layers of the word-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-43", "text": "During recognition, the word-based CTC generates a word sequence, and the character-based CTC is only consulted at the OOV segments." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-44", "text": "Evaluated on a Microsoft internal Cortana voice assistant task, the proposed method can reduce the errors introduced by OOV output token in the acoustic-to-word model by 30%." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-45", "text": "Although the proposed work shares the same concept of open vocabulary neural translation in [31] , our work is very different from [31] as the fundament framework in our work is CTC-based speech recognition while [31] uses attentionbased framework for translation." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-46", "text": "Hence, the detailed implementations are very different in these two works." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-47", "text": "There are also many works in the traditional systems which handle OOV problem for open vocabulary ASR task." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-48", "text": "Some works [32] [33] only detect OOV words and recognize their phonetic transcriptions." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-49", "text": "Some studies [34] [35] go further to identify the character sequence of OOV words by first recognizing the OOV word as a phoneme sequence and then using phonemeto-character conversion to generate the character sequence." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-50", "text": "In contrast, the auxiliary character-based CTC in this study provides a much easier way to handle the OOV issue in the word-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-51", "text": "The rest of the paper is organized as follows." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-52", "text": "In Section 2, we briefly introduce CTC modeling and then we present the proposed acoustic-to-word model without OOV." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-53", "text": "Experimental evaluation of the algorithm is provided in Section 3." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-54", "text": "We summarize our study and draw conclusions in Section 4." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-55", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-56", "text": "**ACOUSTIC-TO-WORD CTC WITHOUT OOV**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-57", "text": "In this section, we first briefly overview the CTC modeling technology and describe the OOV issue in the acoustic-toword CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-58", "text": "Then, we propose the hybrid CTC model that can solve the OOV issue by consulting an auxiliary characterbased CTC to generate candidate words." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-59", "text": "Last, we discuss how to improve the character-based CTC for better prediction." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-60", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-61", "text": "**CTC MODELING**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-62", "text": "The CTC criterion [15] was introduced to map the speech input frames into an output label sequence [16] [17] [18] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-63", "text": "To deal with the issue that the number of output labels is smaller than that of input speech frames, CTC introduces a special blank label and allows the repetition of labels to map the label sequence into a CTC path, which forces the output and input sequences to have the same length." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-64", "text": "Denote as the speech input sequence, as the original label sequence, and B \u22121 ( ) represents all the CTC paths mapped from ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-65", "text": "The CTC loss function is defined as the sum of negative log probabilities of all the CTC paths mapped from the correct label sequence as" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-66", "text": "where is one CTC path." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-67", "text": "With the conditional independent assumption, ( | ) can be decomposed into a product of posterior from each frame as" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-68", "text": "The calculation of ( | ) is done via the forward-backward process in [15] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-69", "text": "The CTC output labels can be phonemes [16] [17] [18] , characters [19] [20] [21] [22] [36] or even words [17] [23] [24] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-70", "text": "As the goal of ASR is to generate a word sequence from the speech waveform, word unit is the most natural output unit for network modeling." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-71", "text": "The recently proposed acoustic-toword models [23] [24], a.k.a." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-72", "text": "word-based CTC models, build multiple layer LSTM networks and use words as the network output units, optimized with the CTC training criterion." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-73", "text": "It is very simple to generate the word sequence with this wordbased CTC model: pick the words corresponding to posterior spikes to form the output word sequence." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-74", "text": "There is neither language model nor complex decoding process involved." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-75", "text": "However, when training the word-based CTC model, only frequent words in the training set are used as the targets and the remaining words are just tagged as the OOV." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-76", "text": "All these OOV words cannot be modeled by the network and cannot be recognized during evaluation." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-77", "text": "In next section, we proposed a hybrid CTC model to address the OOV issue and the hotwords issue discussed in the introduction." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-78", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-79", "text": "**ACOUSTIC-TO-WORD CTC WITHOUT OOV**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-80", "text": "The proposed acoustic-to-word CTC without OOV model is a hybrid model which uses a word-based CTC as the primary model and a character-based CTC as the auxiliary model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-81", "text": "The word-based CTC model emits a word sequence, and the output of the character-based CTC is only consulted at the segment where the word-based CTC emits an OOV token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-82", "text": "Figure 1 gives an example of the hybrid CTC model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-83", "text": "The hybrid model has four shared hidden LSTM layers, on top of which the word-based CTC and the character-based CTC have individual one hidden LSTM layer and one softmax layer." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-84", "text": "The word-based CTC generates a sequence \"play artist OOV\" while the word sequence from the character-based CTC is \"play artist ratatat\". \"ratatat\" from the character-based CTC is the segment overlapped with the OOV token most, and is then used to replace the OOV token to form the final ASR output of the hybrid CTC as \"play artist ratatat\". replacing the OOV token generated from the word-based CTC with the word generated from the character-based CTC that has the largest time overlap with the OOV token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-85", "text": "There are two ways to generate the word output sequence from the character-based CTC in step 3.c." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-86", "text": "The first way is to directly take the characters with maximum posteriors and collapse them into words." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-87", "text": "We refer this as the max output decoding." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-88", "text": "However, the character-based CTC without any decoding constraint usually gives very high WER as shown in [29] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-89", "text": "The second way is to constrain the character-based CTC to generate only valid words (e.g., only the words in training set) using a character graph as in [36] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-90", "text": "In this way, we avoid the character-based CTC generating invalid words." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-91", "text": "Furthermore, since the character graph includes all the words in training data, the rare words that are mapped into OOV in word-based CTC can be recognized by the character-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-92", "text": "The character graph also allows us to handle the hotwords that emerge after the model is trained by adding the hot-words into the valid words list and reconstructing the character graph." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-93", "text": "To measure the overlap in step 3.d, we need to define what is the segment corresponding to an output token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-94", "text": "As blank dominates most of the time in CTC, it is not suitable to use only the frames corresponding to the token spike as the segment, which will be very short." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-95", "text": "Instead, we treat the spike frames as well as all the immediate preceding blank frames as the segment of the token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-96", "text": "To get the segment corresponding to an OOV token in word-based CTC is very straightforward from the above definition." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-97", "text": "To get the segment corresponding to a word in character-based CTC, we need to first get the segment of each character with the above definition, and then concatenate all the character segments to form the word segment." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-98", "text": "This hybrid CTC model is guaranteed to improve the accuracy of the word-based CTC because it only replaces the OOV tokens generated from the word-based CTC without changing any other word outputs." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-99", "text": "With the shared-hiddenlayer structure, the alignments of words from the word-based CTC and the character-based CTC are well synchronized." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-100", "text": "Because the character-based CTC inside the hybrid CTC can generate any word without revisiting the model training, the hot-words issue can also be solved." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-101", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-102", "text": "**IMPROVE CHARACTER-BASED CTC**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-103", "text": "The baseline character-based CTC has 28 outputs: 'a'-'z', space, blank." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-104", "text": "We refer it as the \"28-character set\"." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-105", "text": "We need to generate word sequences from the output of characterbased CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-106", "text": "A word is generated by first reducing repeated characters into one and then combining all the characters except blank between two spaces." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-107", "text": "The word cannot be right if any character gets wrong." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-108", "text": "Using the context information should make the prediction of the characters better." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-109", "text": "Therefore, we add a row-convolution layer on top of the last LSTM layer as" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-110", "text": "is the activation vector of the last hidden LSTM layer, is the row convolution matrix associate with the cth context hidden vector, and 2C+1 is the total number of context hidden vectors." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-111", "text": "\u0302 is then connected to the last softmax layer to predict characters." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-112", "text": "Different from the row convolution layer in [20] which only uses future hidden vectors, we use both history and future (left and right) hidden vectors to introduce more context information." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-113", "text": "Following [36] , we also construct a new character set by adding additional characters on top of the 28-character set." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-114", "text": "These additional characters include capital letters used in the word-initial position, double-letter units representing repeated characters like ll, apostrophes followed by letters such as 'd, 're etc. Please refer to [36] for more details." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-115", "text": "Altogether such a large unit inventory has 83 characters, and we refer it as the \"83-character set\"." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-116", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-117", "text": "**EXPERIMENTS**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-118", "text": "In this section, we use a Microsoft Cortana voice assistant task to evaluate the proposed method." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-119", "text": "The training data consists of 3400 hours of transcribed US-English Cortana audio." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-120", "text": "The test set consists of 3 hours of data from the same Cortana task." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-121", "text": "The audio data is 16k HZ sampled, recorded in mobile environments." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-122", "text": "All experiments were conducted using the computational network toolkit (CNTK) [37] , which allows us to build and evaluate various network structures efficiently without deriving and implementing complicated training algorithms." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-123", "text": "We first built a LSTM model trained with the CE criterion." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-124", "text": "The input consists of 80-dimensional log Mel-filterbank features." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-125", "text": "It has 5 LSTM hidden layers: each has 1024 memory units and the output size of each LSTM layer is reduced to 512 using a linear projection layer [5] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-126", "text": "There is no frame stacking, and the output HMM state label is delayed by 5 frames as in [5] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-127", "text": "There are totally 5980 tied HMM states." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-128", "text": "This model is denoted as LSTM-CE in Table 1 , with 10.05% word error rate (WER)." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-129", "text": "Because of the latency restriction, we always use uni-directional models in our work." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-130", "text": "Then, we built a phoneme-based LSTM model trained with the CTC criterion, modeling around 6000 tied contextdependent phonemes." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-131", "text": "It has the same 5-layer LSTM structure with projection layer as the previous LSTM-CE model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-132", "text": "Eight frames of 80-dim log Mel-filter-bank features are stacked together as the input, and the time step shift is three frames as in [17] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-133", "text": "Without mentioning explicitly, all the CTC models in this study use the same structure as this model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-134", "text": "This model is denoted as LSTM-CTC (phoneme) in Table 1 , with 9.87 % WER." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-135", "text": "Both the LSTM-CE model and LSTM-CTC (phoneme) model use a strong 5-gram language model (LM) for decoding." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-136", "text": "The gap between the LSTM-CE model and the LSTM-CTC (phoneme) model is not large, consistent with the recent report [38] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-137", "text": "Next, we built an acoustic-to-word CTC model by modeling around 27k most frequent words in the training data." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-138", "text": "These frequent words occurred at least 10 times in the training data." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-139", "text": "All other infrequent words are mapped to an OOV output token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-140", "text": "This model, LSTM-CTC (word), gets 13.59% WER, among which the OOV token contributes 1.70% WER." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-141", "text": "In other words, if every OOV token can be converted to the right word, the WER will be reduced to 11.89%." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-142", "text": "Note that the WER gap between the phoneme-based CTC and the word-based CTC is not small because the wordbased CTC doesn't use any LM while the phoneme-based CTC uses a very strong LM trained from much larger amount of text than the 3400hr speech transcription." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-143", "text": "The WER gap is consistent with what has been observed in [17] [24] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-144", "text": "All the CTC models except the phoneme-based CTC model in this study purely rely on the network score to generate outputs without using LM." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-145", "text": "We use the structure in Figure 1 to build hybrid CTC models." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-146", "text": "The first step is to build character-based CTC models by sharing 4 hidden LSTM layers of the word-based CTC model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-147", "text": "On top of the shared hidden layers, we add a new LSTM hidden layer and a softmax layer to model character outputs." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-148", "text": "The output units of the character-based CTC can be from either the 28-character set or the 83-character set described in Section 2." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-149", "text": "When training the character-based CTC model, only the added LSTM hidden layer and softmax layer are updated." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-150", "text": "The bottom 4 hidden LSTM layers are not updated because they are shared with the word-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-151", "text": "Next, the character-based CTC model is improved with row convolution described in Section 2.3." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-152", "text": "The row convolution operates on 9 frame hidden vectors from the last LSTM layer, with 4 history frames, a central frame, and 4 future frames." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-153", "text": "Table 2 gives the WER of different character-based CTC models." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-154", "text": "The baseline CTC with the 28-character set has 33.79% WER when just using the max output decoding which picks the characters with maximum posteriors and then collapses to words." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-155", "text": "Such a high WER is consistent with what has been observed in other sites [29] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-156", "text": "Adding the constraint that only valid words from training set can be generated, the WER is reduced by 10% absolute to 23.87% WER." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-157", "text": "In the following, the default decoding setup of the character-based CTC is with character-graph decoding with the valid word list constraint." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-158", "text": "Clearly, the vanilla character-based CTC is far behind the word-based CTC, and hence can only be used as an auxiliary model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-159", "text": "By taking 9-frame hidden vector context with row convolution, the character-based CTC can be improved to 20.83% WER." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-160", "text": "Then, the CTC with the 83-character set improves its counterpart with the 28-character set from 23.87% WER to 20.25% WER." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-161", "text": "Finally, the CTC model with the 83-character set and row convolution gets 18.91% WER, still 5% absolute higher than the WER from the word-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-162", "text": "The row convolution method can get 12.74% relative WER reduction (from 23.87% WER to 20.83% WER) with the 28-character set, but only gets 6.62% relative WER reduction (from 20.25% WER to 18.91% WER) with the 83character set." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-163", "text": "One reason is that the 83-character set also somehow handles the context information (e.g., with double letters), which is also handled by the row convolution method." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-164", "text": "Table 3 gives several examples showing how the row convolution method helps to improve the WER of the CTC with the 28-character set." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-165", "text": "Without consulting context frames, the CTC model sometimes misses several characters while the row convolution model can emit the right words out based on its context." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-166", "text": "Table 3 : Examples that the CTC with row convolution gets the right recognition result." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-167", "text": "CTC (28-character) CTC (28-character + row convolution) how much one how much money wake me up in a hour wake me up in an hour tell me good joke tell me a good joke Table 4 shows how the CTC with the 83-character set is better than the CTC with the 28-character set with several examples." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-168", "text": "Modeling the double letters helps to win these examples." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-169", "text": "Table 5 gives the WERs of hybrid CTC models." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-170", "text": "When the CTC with 28 characters is used with max output decoding, the hybrid CTC only slightly improves the baseline word-based CTC because such a character-based CTC setup cannot give too much helps due to high WER as shown in Table 2 ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-171", "text": "When decoding with character graph constrained by valid words, the hybrid CTC obtains around 13.09% WER, with 0.5% absolute WER reduction from the word-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-172", "text": "Because the OOV token brings 1.7% absolute WER to the word-based CTC model, this means the hybrid CTC can recover 30% errors introduced by the OOV token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-173", "text": "It is somehow surprising that although both the row convolution CTC modeling and the CTC with the 83-character set have better WER than the CTC with the 28-character set, neither setup can help the final WER of the hybrid CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-174", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-175", "text": "**13.08**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-176", "text": "In Table 6 , we show how the hybrid CTC model performs with some examples." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-177", "text": "The first three are the examples that the hybrid CTC can recover the right words from the OOV token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-178", "text": "\"azusa\", \"ratatat\", and \"wanna\", all these addresses and names, are the words not in the frequent words in the training set, and haven't been modeled by any output node in the word-based CTC model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-179", "text": "However, they can be successfully recovered by the character-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-180", "text": "The last three rows in Table 6 are the examples that the hybrid CTC still fails to recover the right words from the OOV token. \"margera\" is recognized as \"marger\" by the character-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-181", "text": "Such error happens with one character missing, revealing the weakness of character-based CTC. \"purr\" is recognized as \"per\", and \"kristi\" is recognized as \"christi\" by the character-based CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-182", "text": "These errors are homophone errors, which cannot be handled by characterbased CTC unless high level information is blended into the decision." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-183", "text": "Table 5 , it is a little disappointing that neither row convolution nor 83-character set modeling improves the final WER of the hybrid CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-184", "text": "We also examined the results and found that most of times these two methods help to improve the recognition results that the word-based CTC succeeds." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-185", "text": "For the failed cases in Table 6 , they cannot help too much." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-186", "text": "For example, \"margera\" is recognized as \"marger\" by the CTC with the 28-character set, and recognized as \"marera\" by the CTC with the 83-character set and row convolution." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-187", "text": "They also cannot help the homophone error cases." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-188", "text": "Even with better modeling, it is sometimes still very challenging for the character-based CTC to get words right for the cases that the word-based CTC fails." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-189", "text": "----------------------------------" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-190", "text": "**CONCLUSIONS AND FUTURE WORKS**" }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-191", "text": "In this paper, we have presented a hybrid CTC model that solves the OOV issue and the hot-words issue presented in the acoustic-to-word CTC models, a.k.a. the word-based CTC, by using the output from the word-based CTC as the primary ASR result and consulting the character-based CTC at the segment where the word-based CTC emits an OOV token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-192", "text": "By only replacing the OOV tokens with the words generated from the character-based CTC, the proposed method is guaranteed to improve the accuracy of the acousticto-word CTC." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-193", "text": "The shared hidden layer structure helps to align the word segments between the word-based CTC and the character-based CTC so that the OOV token lookup algorithm can work." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-194", "text": "Evaluated on a Microsoft Cortana voice assistant task in which the word-based CTC has 1.7% WER introduced by the OOV token, the hybrid CTC model can reduce 0.5% absolute WER, representing a recovery of 30% errors caused by the OOV token." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-195", "text": "Several research issues will be addressed in the future to further increase the effectiveness of the algorithm presented in this paper." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-196", "text": "First, a better character unit set should be considered to improve the accuracy of the character-based CTC model." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-197", "text": "Recently, gram-CTC [39] was proposed to automatically learn the most suitable decomposition of target sequences, which not only boosts the modeling flexibility but also improves the final ASR accuracy." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-198", "text": "We are now trying to incorporate the gram-CTC into our system." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-199", "text": "Second, the character-based CTC has very high WER (around 33%) when using the maximum output decoding." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-200", "text": "We add valid word constraint when generating words from the character-based CTC and bring down its WER to 23% so that the words used to replace OOV tokens are useful." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-201", "text": "However, a characterbased CTC model with decoding constraint is not a clean endto-end model as it still involves expert knowledge." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-202", "text": "We are now pursuing more advanced method which can improve the character-based WER to as low as 18% with the maximum output decoding [40] ." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-203", "text": "Last, with thousand hours of training data, the word-based CTC still has an accuracy gap from the phoneme-based CTC, which has been observed from various sites." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-204", "text": "We found that the word-based CTC can significantly improve the accuracy of the phoneme-based CTC by combining them together, given very different error patterns from these CTC models." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-205", "text": "Therefore, it is meaningful to invest on the word-based CTC even from the production point of view." }, { "sent_id": "80e3aec943c37927050f97459360b4-C001-206", "text": "At the same time, we are working on improving the modeling of word-based CTC so that we can deploy such an end-to-end acoustic-to-word model to production." } ], "y": { "@BACK@": { "gold_contexts": [ [ "80e3aec943c37927050f97459360b4-C001-18" ], [ "80e3aec943c37927050f97459360b4-C001-21" ], [ "80e3aec943c37927050f97459360b4-C001-24" ], [ "80e3aec943c37927050f97459360b4-C001-37" ], [ "80e3aec943c37927050f97459360b4-C001-62" ], [ "80e3aec943c37927050f97459360b4-C001-69" ] ], "cite_sentences": [ "80e3aec943c37927050f97459360b4-C001-18", "80e3aec943c37927050f97459360b4-C001-21", "80e3aec943c37927050f97459360b4-C001-24", "80e3aec943c37927050f97459360b4-C001-37", "80e3aec943c37927050f97459360b4-C001-62", "80e3aec943c37927050f97459360b4-C001-69" ] }, "@MOT@": { "gold_contexts": [ [ "80e3aec943c37927050f97459360b4-C001-34", "80e3aec943c37927050f97459360b4-C001-35", "80e3aec943c37927050f97459360b4-C001-36", "80e3aec943c37927050f97459360b4-C001-37" ] ], "cite_sentences": [ "80e3aec943c37927050f97459360b4-C001-37" ] }, "@USE@": { "gold_contexts": [ [ "80e3aec943c37927050f97459360b4-C001-132" ] ], "cite_sentences": [ "80e3aec943c37927050f97459360b4-C001-132" ] }, "@SIM@": { "gold_contexts": [ [ "80e3aec943c37927050f97459360b4-C001-143" ] ], "cite_sentences": [ "80e3aec943c37927050f97459360b4-C001-143" ] } } }, "ABC_fb75198b7c9e569932dfd486ba6c0a_30": { "x": [ { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-118", "text": "The document content will be comparable across languages." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-119", "text": "Figure 1 shows an example in English related to global warming." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-2", "text": "Domain portability and adaptation of NLP components and Word Sense Disambiguation systems present new challenges." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-3", "text": "The difficulties found by supervised systems to adapt might change the way we assess the strengths and weaknesses of supervised and knowledgebased WSD systems." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-4", "text": "Unfortunately, all existing evaluation datasets for specific domains are lexical-sample corpora." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-5", "text": "With this paper we want to motivate the creation of an allwords test dataset for WSD on the environment domain in several languages, and present the overall design of this SemEval task." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-6", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-8", "text": "Word Sense Disambiguation (WSD) competitions have focused on general domain texts, as attested in the last Senseval and Semeval competitions (Kilgarriff, 2001; Mihalcea et al., 2004; Pradhan et al., 2007) ." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-9", "text": "Specific domains pose fresh challenges to WSD systems: the context in which the senses occur might change, distributions and predominant senses vary, some words tend to occur in fewer senses in specific domains, and new senses and terms might be involved." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-10", "text": "Both supervised and knowledge-based systems are affected by these issues: while the first suffer from different context and sense priors, the later suffer from lack of coverage of domain-related words and information." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-11", "text": "Domain adaptation of supervised techniques is a hot issue in Natural Language Processing, including Word Sense Disambiguation." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-12", "text": "Supervised Word Sense Disambiguation systems trained on general corpora are known to perform worse when applied to specific domains (Escudero et al., 2000; Mart\u00ednez and Agirre, 2000) , and domain adaptation techniques have been proposed as a solution to this problem with mixed results." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-13", "text": "Current research on applying WSD to specific domains has been evaluated on three available lexicalsample datasets (Ng and Lee, 1996; Weeber et al., 2001; Koeling et al., 2005) ." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-14", "text": "This kind of dataset contains hand-labeled examples for a handful of selected target words." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-15", "text": "As the systems are evaluated on a few words, the actual performance of the systems over complete texts can not be measured." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-16", "text": "Differences in behavior of WSD systems when applied to lexical-sample and all-words datasets have been observed on previous Senseval and Semeval competitions (Kilgarriff, 2001; Mihalcea et al., 2004; Pradhan et al., 2007) : supervised systems attain results on the high 80's and beat the most frequent baseline by a large margin for lexical-sample datasets, but results on the all-words datasets were much more modest, on the low 70's, and a few points above the most frequent baseline." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-17", "text": "Thus, the behaviour of WSD systems on domainspecific texts is largely unknown." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-18", "text": "While some words could be supposed to behave in similar ways, and thus be amenable to be properly treated by a generic WSD algorithm, other words have senses closely linked to the domain, and might be disambiguated using purpose-built domain adaptation strategies (cf. Section 4)." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-19", "text": "While it seems that domain-specific WSD might be a tougher problem than generic WSD, it might well be that domain-related words are easier to disambiguate." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-20", "text": "The main goal of this task is to provide a multilingual testbed to evaluate WSD systems when faced with full-texts from a specific domain, that of environment-related texts." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-21", "text": "The paper is structured as follows." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-22", "text": "The next section presents current lexical sample datasets for domain-specific WSD." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-23", "text": "Section 3 presents some possible settings for domain adaptation." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-24", "text": "Section 4 reviews the state-of-the art in domain-specific WSD." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-25", "text": "Section 5 presents the design of our task, and finally, Section 6 draws some conclusions." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-26", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-27", "text": "**SPECIFIC DOMAIN DATASETS AVAILABLE**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-28", "text": "We will briefly present the three existing datasets for domain-related studies in WSD, which are all lexical-sample." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-29", "text": "The most commonly used dataset is the Defense Science Organization (DSO) corpus (Ng and Lee, 1996) , which comprises sentences from two different corpora." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-30", "text": "The first is the Wall Street Journal (WSJ), which belongs to the financial domain, and the second is the Brown Corpus (BC) which is a balanced corpora of English usage." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-31", "text": "191 polysemous words (nouns and verbs) of high frequency in WSJ and BC were selected and a total of 192,800 occurrences of these words were tagged with WordNet 1.5 senses, more than 1,000 instances per word in average." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-32", "text": "The examples from BC comprise 78,080 occurrences of word senses, and examples from WSJ consist on 114,794 occurrences." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-33", "text": "In domain adaptation experiments, the Brown Corpus examples play the role of general corpora, and the examples from the WSJ play the role of domain-specific examples." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-34", "text": "Koeling et al. (2005) present a corpus were the examples are drawn from the balanced BNC corpus (Leech, 1992) and the SPORTS and FINANCES sections of the newswire Reuters corpus (Rose et al., 2002) , comprising around 300 examples (roughly 100 from each of those corpora) for each of the 41 nouns." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-35", "text": "The nouns were selected because they were salient in either the SPORTS or FINANCES domains, or because they had senses linked to those domains." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-36", "text": "The occurrences were hand-tagged with the senses from WordNet version 1.7.1 (Fellbaum, 1998) ." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-37", "text": "In domain adaptation experiments the BNC examples play the role of general corpora, and the FINANCES and SPORTS examples the role of two specific domain corpora." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-38", "text": "Finally, a dataset for biomedicine was developed by Weeber et al. (2001) , and has been used as a benchmark by many independent groups." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-39", "text": "The UMLS Metathesaurus was used to provide a set of possible meanings for terms in biomedical text." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-40", "text": "50 ambiguous terms which occur frequently in MED-LINE were chosen for inclusion in the test set." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-41", "text": "100 instances of each term were selected from citations added to the MEDLINE database in 1998 and manually disambiguated by 11 annotators." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-42", "text": "Twelve terms were flagged as \"problematic\" due to substantial disagreement between the annotators." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-43", "text": "In addition to the meanings defined in UMLS, annotators had the option of assigning a special tag (\"none\") when none of the UMLS meanings seemed appropriate." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-44", "text": "Although these three corpora are useful for WSD research, it is difficult to infer which would be the performance of a WSD system on full texts." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-45", "text": "The corpus of Koeling et al., for instance, only includes words which where salient for the target domains, but the behavior of WSD systems on other words cannot be explored." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-46", "text": "We would also like to note that while the biomedicine corpus tackles scholarly text of a very specific domain, the WSJ part of the DSO includes texts from a financially oriented newspaper, but also includes news of general interest which have no strict relation to the finance domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-47", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-48", "text": "**POSSIBLE SETTINGS FOR DOMAIN ADAPTATION**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-49", "text": "When performing supervised WSD on specific domains the first setting is to train on a general domain data set and to test on the specific domain (source setting)." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-50", "text": "If performance would be optimal, this would be the ideal solution, as it would show that a generic WSD system is robust enough to tackle texts from new domains, and domain adaptation would not be necessary." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-51", "text": "The second setting (target setting) would be to train the WSD systems only using examples from the target domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-52", "text": "If this would be the optimal setting, it would show that there is no cost-effective method for domain adaptation." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-53", "text": "WSD systems would need fresh examples every time they were deployed in new domains, and examples from general domains could be discarded." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-54", "text": "In the third setting, the WSD system is trained with examples coming from both the general domain and the specific domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-55", "text": "Good results in this setting would show that supervised domain adaptation is working, and that generic WSD systems can be supplemented with hand-tagged examples from the target domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-56", "text": "There is an additional setting, where a generic WSD system is supplemented with untagged examples from the domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-57", "text": "Good results in this setting would show that semi-supervised domain adaptation works, and that generic WSD systems can be supplemented with untagged examples from the target domain in order to improve their results." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-58", "text": "Most of current all-words generic supervised WSD systems take SemCor (Miller et al., 1993) as their source corpus, i.e. they are trained on SemCor examples and then applied to new examples." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-59", "text": "SemCor is the largest publicly available annotated corpus." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-60", "text": "It's mainly a subset of the Brown Corpus, plus the novel The Red Badge of Courage." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-61", "text": "The Brown corpus is balanced, yet not from the general domain, as it comprises 500 documents drawn from different domains, each approximately 2000 words long." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-62", "text": "Although the Brown corpus is balanced, SemCor is not, as the documents were not chosen at random." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-63", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-64", "text": "**STATE-OF-THE-ART IN WSD FOR SPECIFIC DOMAINS**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-65", "text": "Initial work on domain adaptation for WSD systems showed that WSD systems were not able to obtain better results on the source or adaptation settings compared to the target settings (Escudero et al., 2000) , showing that a generic WSD system (i.e. based on hand-annotated examples from a generic corpus) would not be useful when moved to new domains." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-66", "text": "Escudero et al. (2000) tested the supervised adaptation scenario on the DSO corpus, which had examples from the Brown Corpus and Wall Street Journal corpus." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-67", "text": "They found that the source corpus did not help when tagging the target corpus, showing that tagged corpora from each domain would suffice, and concluding that hand tagging a large general corpus would not guarantee robust broad-coverage WSD." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-68", "text": "Agirre and Mart\u00ednez (2000) used the same DSO corpus and showed that training on the subset of the source corpus that is topically related to the target corpus does allow for domain adaptation, obtaining better results than training on the target data alone." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-69", "text": "In (Agirre and Lopez de Lacalle, 2008) , the authors also show that state-of-the-art WSD systems are not able to adapt to the domains in the context of the Koeling et al. (2005) dataset." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-70", "text": "While WSD systems trained on the target domain obtained 85.1 and 87.0 of precision on the sports and finances domains, respectively, the same systems trained on the BNC corpus (considered as a general domain corpus) obtained 53.9 and 62.9 of precision on sports and finances, respectively." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-71", "text": "Training on both source and target was inferior that using the target examples alone." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-72", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-73", "text": "**SUPERVISED ADAPTATION**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-74", "text": "Supervised adaptation for other NLP tasks has been widely reported." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-75", "text": "For instance, (Daum\u00e9 III, 2007) shows that a simple feature augmentation method for SVM is able to effectively use both labeled target and source data to provide the best domainadaptation results in a number of NLP tasks." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-76", "text": "His method improves or equals over previously explored more sophisticated methods (Daum\u00e9 III and Marcu, 2006; Chelba and Acero, 2004) ." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-77", "text": "In contrast, ) reimplemented this method and showed that the improvement on WSD in the (Koeling et al., 2005) data was marginal." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-78", "text": "Better results have been obtained using purposebuilt adaptation methods." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-79", "text": "Chan and Ng (2007) performed supervised domain adaptation on a manually selected subset of 21 nouns from the DSO corpus." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-80", "text": "They used active learning, count-merging, and predominant sense estimation in order to save target annotation effort." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-81", "text": "They showed that adding just 30% of the target data to the source examples the same precision as the full combination of target and source data could be achieved." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-82", "text": "They also showed that using the source corpus significantly improved results when only 10%-30% of the target corpus was used for training." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-83", "text": "In followup work (Zhong et Projections for 2100 suggest that temperature in Europe will have risen by between 2 to 6.3 C above 1990 levels." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-84", "text": "The sea level is projected to rise, and a greater frequency and intensity of extreme weather events are expected." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-85", "text": "Even if emissions of greenhouse gases stop today, these changes would continue for many decades and in the case of sea level for centuries." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-86", "text": "This is due to the historical build up of the gases in the atmosphere and time lags in the response of climatic and oceanic systems to changes in the atmospheric concentration of the gases." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-87", "text": "al., 2008), the feature augmentation approach was combined with active learning and tested on the OntoNotes corpus, on a large domain-adaptation experiment." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-88", "text": "They significantly reduced the effort of hand-tagging, but only obtained positive domainadaptation results for smaller fractions of the target corpus." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-89", "text": "In ) the authors report successful adaptation on the (Koeling et al., 2005 ) dataset on supervised setting." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-90", "text": "Their method is based on the use of unlabeled data, reducing the feature space with SVD, and combination of features using an ensemble of kernel methods." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-91", "text": "They report 22% error reduction when using both source and target data compared to a classifier trained on target the target data alone, even when the full dataset is used." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-92", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-93", "text": "**SEMI-SUPERVISED ADAPTATION**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-94", "text": "There are less works on semi-supervised domain adaptation in NLP tasks, and fewer in WSD task." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-95", "text": "Blitzer et al. (2006) used Structural Correspondence Learning and unlabeled data to adapt a Part-ofSpeech tagger." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-96", "text": "They carefully select so-called pivot features to learn linear predictors, perform SVD on the weights learned by the predictor, and thus learn correspondences among features in both source and target domains." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-97", "text": "Agirre and Lopez de Lacalle (2008) show that methods based on SVD with unlabeled data and combination of distinct feature spaces produce positive semi-supervised domain adaptation results for WSD." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-98", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-99", "text": "**UNSUPERVISED ADAPTATION**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-100", "text": "In this context, we take unsupervised to mean Knowledge-Based methods which do not require hand-tagged corpora." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-101", "text": "The predominant sense acquisition method was succesfully applied to specific domains in (Koeling et al., 2005) ." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-102", "text": "The methos has two steps: In the first, a corpus of untagged text from the target domain is used to construct a thesaurus of similar words." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-103", "text": "In the second, each target word is disambiguated using pairwise WordNet-based similarity measures, taking as pairs the target word and each of the most related words according to the thesaurus up to a certain threshold." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-104", "text": "This method aims to obtain, for each target word, the sense which is the most predominant for the target corpus." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-105", "text": "When a general corpus is used, the most predominant sense in general is obtained, and when a domain-specific corpus is used, the most predominant sense for that corpus is obtained (Koeling et al., 2005) ." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-106", "text": "The main motivation of the authors is that the most frequent sense is a very powerful baseline, but it is one which requires hand-tagging text, while their method yields similar information automatically." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-107", "text": "The results show that they are able to obtain good results." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-108", "text": "In related work, ) report improved results using the same strategy but applying a graph-based WSD method, and highlight the domain-adaptation potential of unsupervised knowledge-based WSD systems compared to supervised WSD." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-109", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-110", "text": "**DESIGN OF THE WSD-DOMAIN TASK**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-111", "text": "This task was designed in the context of Kyoto (Piek Vossen and VanGent, 2008) 1 , an AsianEuropean project that develops a community platform for modeling knowledge and finding facts across languages and cultures." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-112", "text": "The platform operates as a Wiki system with an ontological support that social communities can use to agree on the meaning of terms in specific domains of their interest." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-113", "text": "Kyoto will focus on the environmental domain because it poses interesting challenges for information sharing, but the techniques and platforms will be independent of the application domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-114", "text": "Kyoto will make use of semantic technologies based on ontologies and WSD in order to extract and represent relevant information for the domain, and is thus interested on measuring the performance of WSD techniques on this domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-115", "text": "The WSD-domain task will comprise comparable all-words test corpora on the environment domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-116", "text": "Texts from the European Center for Nature Conservation 2 and Worldwide Wildlife Forum 3 will be used in order to build domain specific test corpora." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-117", "text": "We will select documents that are written for a general but interested public and that involve specific terms from the domain." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-120", "text": "The data will be available in a number of languages: English, Dutch, Italian and Chinese." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-121", "text": "The sense inventories will be based on wordnets of the respective languages, which will be updated to include new vocabulary and senses." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-122", "text": "The test data will comprise three documents of around 2000 words each for each language." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-123", "text": "The annotation procedure will involve double-blind annotation plus adjudication, and inter-tagger agreement data will be provided." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-124", "text": "The formats and scoring software will follow those of Senseval-3 4 and SemEval-2007 5 English all-words tasks." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-125", "text": "There will not be training data available, but participants are free to use existing hand-tagged corpora and lexical resources (e.g. SemCor and previous Senseval and SemEval data)." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-126", "text": "We plan to make available a corpus of documents from the same domain as the selected documents, as well as wordnets updated to include the terms and senses in the selected documents." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-127", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-128", "text": "**CONCLUSIONS**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-129", "text": "Domain portability and adaptation of NLP components and Word Sense Disambiguation systems present new challenges." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-130", "text": "The difficulties found by supervised systems to adapt might change the way we assess the strengths and weaknesses of supervised and knowledge-based WSD systems." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-131", "text": "Unfortunately, all existing evaluation datasets for specific domains are lexical-sample corpora." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-132", "text": "With this paper we have motivated the creation of an all-words test dataset for WSD on the environment domain in several languages, and presented the overall design of this SemEval task." }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-133", "text": "Further details can be obtained from the Semeval-2010 6 website, our task website 7 , and in our distribution list 8" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-134", "text": "----------------------------------" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-135", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "fb75198b7c9e569932dfd486ba6c0a-C001-136", "text": "The organization of the task is partially funded by the European Commission (KYOTO FP7 ICT-2007-211423 ) and the Spanish Research Department (KNOW TIN2006-15049-C03-01)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fb75198b7c9e569932dfd486ba6c0a-C001-13" ], [ "fb75198b7c9e569932dfd486ba6c0a-C001-34" ], [ "fb75198b7c9e569932dfd486ba6c0a-C001-69" ], [ "fb75198b7c9e569932dfd486ba6c0a-C001-77" ], [ "fb75198b7c9e569932dfd486ba6c0a-C001-89" ], [ "fb75198b7c9e569932dfd486ba6c0a-C001-105" ] ], "cite_sentences": [ "fb75198b7c9e569932dfd486ba6c0a-C001-13", "fb75198b7c9e569932dfd486ba6c0a-C001-34", "fb75198b7c9e569932dfd486ba6c0a-C001-69", "fb75198b7c9e569932dfd486ba6c0a-C001-77", "fb75198b7c9e569932dfd486ba6c0a-C001-89", "fb75198b7c9e569932dfd486ba6c0a-C001-105" ] }, "@USE@": { "gold_contexts": [ [ "fb75198b7c9e569932dfd486ba6c0a-C001-101" ] ], "cite_sentences": [ "fb75198b7c9e569932dfd486ba6c0a-C001-101" ] }, "@SIM@": { "gold_contexts": [ [ "fb75198b7c9e569932dfd486ba6c0a-C001-105", "fb75198b7c9e569932dfd486ba6c0a-C001-106" ] ], "cite_sentences": [ "fb75198b7c9e569932dfd486ba6c0a-C001-105" ] } } }, "ABC_332e252e09d28763deb1ded2171c90_30": { "x": [ { "sent_id": "332e252e09d28763deb1ded2171c90-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-2", "text": "Tree-based approaches to alignment model translation as a sequence of probabilistic operations transforming the syntactic parse tree of a sentence in one language into that of the other." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-3", "text": "The trees may be learned directly from parallel corpora (Wu, 1997), or provided by a parser trained on hand-annotated treebanks (Yamada and Knight, 2001) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-4", "text": "In this paper, we compare these approaches on Chinese-English and French-English datasets, and find that automatically derived trees result in better agreement with human-annotated word-level alignments for unseen test data." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-5", "text": "----------------------------------" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-7", "text": "Statistical approaches to machine translation, pioneered by Brown et al. (1990) , estimate parameters for a probabilistic model of word-to-word correspondences and word re-orderings directly from large corpora of parallel bilingual text." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-8", "text": "In recent years, a number of syntactically motivated approaches to statistical machine translation have been proposed." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-9", "text": "These approaches assign a parallel tree structure to the two sides of each sentence pair, and model the translation process with reordering operations defined on the tree structure." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-10", "text": "The tree-based approach allows us to represent the fact that syntactic constituents tend to move as unit, as well as systematic differences in word order in the grammars of the two languages." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-11", "text": "Furthermore, the tree structure allows us to make probabilistic independence assumptions that result in polynomial time algorithms for estimating a translation model from parallel training data, and for finding the highest probability translation given a new sentence." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-12", "text": "Wu (1997) modeled the reordering process with binary branching trees, where each production could be either in the same or in reverse order going from source to target language." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-13", "text": "The trees of Wu's Inversion Transduction Grammar were derived by synchronously parsing a parallel corpus, using a grammar with lexical translation probabilities at the leaves and a simple grammar with a single nonterminal providing the tree structure." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-14", "text": "While this grammar did not represent traditional syntactic categories such as verb phrases and noun phrases, it served to restrict the word-level alignments considered by the system to those allowable by reordering operations on binary trees." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-15", "text": "This restriction corresponds to intuitions about the alignments that could be produced by systematic differences between the two language's grammars, and allows for a polynomial time algorithm for finding the highest-probability alignment, and for re-estimation of the lexical translation and grammar probabilities using the Expectation Maximization algorithm." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-16", "text": "Yamada and Knight (2001) present an algorithm for estimating probabilistic parameters for a similar model which represents translation as a sequence of re-ordering operations over children of nodes in a syntactic tree, using automatic parser output for the initial tree structures." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-17", "text": "This gives the translation model more information about the structure of the source language, and further constrains the reorderings to match not just a possible bracketing as in Wu (1997) , but the specific bracketing of the parse tree provided." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-18", "text": "In this paper, we make a direct comparison of a syntactically unsupervised alignment model, based on Wu (1997) , with a syntactically supervised model, based on Yamada and Knight (2001) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-19", "text": "We use the term syntactically supervised to indicate that the syntactic structure in one language is given to the training procedure." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-20", "text": "It is important to note, however, that both algorithms are unsupervised in that they are not provided any hand-aligned training data." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-21", "text": "Rather, they both use Expectation Maximization to find an alignment model by iteratively improving the likelihood assigned to unaligned parallel sentences." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-22", "text": "Our evaluation is in terms of agreement with word-level alignments created by bilingual human annotators." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-23", "text": "We describe each of the models used in more detail in the next two sections, including the clone operation of Gildea (2003) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-24", "text": "The reader who is familiar with these models may proceed directly to our experiments in Section 4, and further discussion in Section 5." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-25", "text": "----------------------------------" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-26", "text": "**THE INVERSION TRANSDUCTION GRAMMAR**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-27", "text": "The Inversion Transduction Grammar of Wu (1997) can be thought as a a generative process which simultaneously produces strings in both languages through a series of synchronous context-free grammar productions." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-28", "text": "The grammar is restricted to binary rules, which can have the symbols in the right hand side appear in the same order in both languages, represented with square brackets:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-29", "text": "or the symbols may appear in reverse order in the two languages, indicated by angle brackets:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-30", "text": "Individual lexical translations between English words e and French words f take place at the leaves of the tree, generated by grammar rules with a single right hand side symbol in each language:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-31", "text": "Given a bilingual sentence pair, a synchronous parse can be built using a two-dimensional extension of chart parsing, where chart items are indexed by their nonterminal Y and beginning and ending positions l, m in the source language string, and beginning and ending positions i, j in the target language string." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-32", "text": "For Expectation Maximization training, we compute inside probabilities \u03b2(Y, l, m, i, j) from the bottom up as outlined below:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-33", "text": "A similar recursion is used to compute outside probabilities for each chart item, and the inside and outside probabilities are combined to derive expected counts for occurrence of each grammar rule, including the rules corresponding to individual lexical translations." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-34", "text": "In our experiments we use a grammar with a start symbol S, a single preterminal C, and two nonterminals A and B used to ensure that only one parse can generate any given word-level alignment (ignoring insertions and deletions) (Wu, 1997; Zens and Ney, 2003) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-35", "text": "The individual lexical translations produced by the grammar may include a NULL word on either side, in order to represent insertions and deletions." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-36", "text": "----------------------------------" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-37", "text": "**THE TREE-TO-STRING MODEL**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-38", "text": "The model of Yamada and Knight (2001) can be thought of as a generative process taking a tree in one language as input and producing a string in the other through a sequence of probabilistic operations." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-39", "text": "If we follow the process of an English sentence's transformation into French, the English sentence is first given a syntactic tree representation by a statistical parser (Collins, 1999) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-40", "text": "As the first step in the translation process, the children of each node in the tree can be re-ordered." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-41", "text": "For any node with m children, m! re-orderings are possible, each of which is assigned a probability P order conditioned on the syntactic categories of the parent node and its children." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-42", "text": "As the second step, French words can be inserted at each node of the parse tree." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-43", "text": "Insertions are modeled in two steps, the first predicting whether an insertion to the left, an insertion to the right, or no insertion takes place with probability P ins , conditioned on the syntactic category of the node and that of its parent." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-44", "text": "The second step is the choice of the inserted word P t (f |NULL), which is predicted without any conditioning information." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-45", "text": "The final step, a French translation of each original English word, at the leaves of the tree, is chosen according to a distribution P t (f |e)." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-46", "text": "The French word is predicted conditioned only on the English word, and each English word can generate at most one French word, or can generate a NULL symbol, representing deletion." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-47", "text": "Given the original tree, the re-ordering, insertion, and translation probabilities at each node are independent of the choices at any other node." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-48", "text": "These independence relations are analogous to those of a stochastic context-free grammar, and allow for efficient parameter estimation by an inside-outside Expectation Maximization algorithm." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-49", "text": "The computation of inside probabilities \u03b2, outlined below, considers possible reorderings of nodes in the original tree in a bottom-up manner:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-50", "text": "for all nodes \u03b5 i in input tree T do for all k, l such that 1 < k < l < N do for all orderings \u03c1 of the children \u03b5 1 ...\u03b5 m of \u03b5 i do for all partitions of span k, l into As with Inversion Transduction Grammar, many alignments between source and target sentences are not allowed." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-51", "text": "As a minimal example, take the tree:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-52", "text": "Of the six possible re-orderings of the three terminals, the two which would involve crossing the bracketing of the original tree (XZY and YZX) are not allowed." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-53", "text": "While this constraint gives us a way of using syntactic information in translation, it may in many cases be too rigid." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-54", "text": "In part to deal with this problem, Yamada and Knight (2001) flatten the trees in a pre-processing step by collapsing nodes with the same lexical head-word." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-55", "text": "This allows, for example, an English subject-verb-object (SVO) structure, which is analyzed as having a VP node spanning the verb and object, to be re-ordered as VSO in a language such as Arabic." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-56", "text": "Larger syntactic divergences between the two trees may require further relaxation of this constraint, and in practice we expect such divergences to be frequent." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-57", "text": "For example, a nominal modifier in one language may show up as an adverbial in the other, or, due to choices such as which information is represented by a main verb, the syntactic correspondence between the two sentences may break down completely." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-58", "text": "While having flatter trees can make more reorderings possible than with the binary Inversion Transduction Grammar trees, fixing the tree in one language generally has a much stronger opposite effect, dramatically restricting the number of permissible alignments." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-59", "text": "----------------------------------" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-60", "text": "**TREE-TO-STRING WITH CLONING**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-61", "text": "In order to provide more flexibility in alignments, a cloning operation was introduced for tree-to-string alignment by Gildea (2003) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-62", "text": "The model is modified to allow for a copy of a (translated) subtree from the English sentences to occur, with some cost, at any point in the resulting French sentence." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-63", "text": "For example, in the case of the input tree This operation, combined with the deletion of the original node Z, produces the alignment (XZY) that was disallowed by the original tree reordering model." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-64", "text": "The probability of adding a clone of original node \u03b5 i as a child of node \u03b5 j is calculated in two steps: first, the choice of whether to insert a clone under \u03b5 j , with probability P ins (clone|\u03b5 j ), and the choice of which original node to copy, with probability" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-65", "text": "where P makeclone is the probability of an original node producing a copy." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-66", "text": "In our implementation, P ins (clone) is estimated by the Expectation Maximization algorithm conditioned on the label of the parent node \u03b5 j , and P makeclone is a constant, meaning that the node to be copied is chosen from all the nodes in the original tree with uniform probability." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-67", "text": "----------------------------------" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-68", "text": "**EXPERIMENTS**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-69", "text": "We trained our translation models on a parallel corpus of Chinese-English newswire text." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-70", "text": "We restricted ourselves to sentences of no more than 25 words in either language, resulting in a training corpus of 18,773 sentence pairs with a total of 276,113 Chinese words and 315,415 English words." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-71", "text": "The Chinese data were automatically segmented into tokens, and English capitalization was retained." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-72", "text": "We replace words occurring only once with an unknown word token, resulting in a Chinese vocabulary of 23,783 words and an English vocabulary of 27,075 words." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-73", "text": "Our hand-aligned data consisted of 48 sentence pairs also with less than 25 words in either language, for a total of 788 English words and 580 Chinese words." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-74", "text": "A separate development set of 49 sentence pairs was used to control overfitting." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-75", "text": "These sets were the data used by Hwa et al. (2002) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-76", "text": "The hand aligned test data consisted of 745 individual aligned word pairs." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-77", "text": "Words could be aligned oneto-many in either direction." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-78", "text": "This limits the performance achievable by our models; the IBM models allow one-to-many alignments in one direction only, while the tree-based models allow only one-to-one alignment unless the cloning operation is used." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-79", "text": "Our French-English experiments were based on data from the Canadian Hansards made available by Ulrich German." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-80", "text": "We used as training data 20,000 sentence pairs of no more than 25 words in either language." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-81", "text": "Our test data consisted of 447 sentence pairs of no more than 30 words, hand aligned by Och and Ney (2000) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-82", "text": "A separate development set of 37 sentences was used to control overfitting." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-83", "text": "We used of vocabulary of words occurring at least 10 times in the entire Hansard corpus, resulting in 19,304 English words and 22,906 French words." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-84", "text": "Our test set is that used in the alignment evaluation organized by Mihalcea and Pederson (2003) , though we retained sentence-initial capitalization, used a closed vocabulary, and restricted ourselves to a smaller training corpus." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-85", "text": "We parsed the English side of the data with the Collins parser." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-86", "text": "As an artifact of the parser's probability model, it outputs sentence-final punctuation attached at the lowest level of the tree." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-87", "text": "We raised sentence-final punctuation to be a daughter of the tree's root before training our parse-based model." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-88", "text": "As our Chinese-English test data did not include sentence-final punctuation, we also removed it from our French-English test set." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-89", "text": "We evaluate our translation models in terms of agreement with human-annotated word-level alignments between the sentence pairs." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-90", "text": "For scoring the viterbi alignments of each system against goldstandard annotated alignments, we use the alignment error rate (AER) of Och and Ney (2000), which measures agreement at the level of pairs of words:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-91", "text": "where A is the set of word pairs aligned by the automatic system, G S is the set marked in the gold standard as \"sure\", and G P is the set marked as \"possible\" (including the \"sure\" pairs)." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-92", "text": "In our Chinese-English data, only one type of alignment was marked, meaning that G P = G S ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-93", "text": "For a better understanding of how the models differ, we break this figure down into precision:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-94", "text": "and recall:" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-95", "text": "Since none of the systems presented in this comparison make use of hand-aligned data, they may differ in the overall proportion of words that are aligned, rather than inserted or deleted." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-96", "text": "This affects the precision/recall tradeoff; better results with respect to human alignments may be possible by adjusting an overall insertion probability in order to optimize AER." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-97", "text": "Table 1 provides a comparison of results using the tree-based models with the word-level IBM models." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-98", "text": "IBM Models 1 and 4 refer to Brown et al. (1993) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-99", "text": "We used the GIZA++ package, including the HMM model of Och and Ney (2000) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-100", "text": "We ran Model 1 for three iterations, then the HMM model for three iterations, and finally Model 4 for two iterations, training each model until AER began to increase on our held-out cross validation data." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-101", "text": "\"Inversion Transduction Grammar\" (ITG) is the model of Wu (1997) , \"Tree-to-String\" is the model of Yamada and Knight (2001) , and \"Tree-to-String, Clone\" allows the node cloning operation described above." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-102", "text": "Our tree-based models were initialized from uniform distributions for both the lexical translation probabilities and the tree reordering operations, and were trained until AER began to rise on our held-out cross-validation data, which turned out to be four iterations for the tree-to-string models and three for the Inversion Transduction Grammar." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-103", "text": "French-English results are shown in Table 2 ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-104", "text": "Here, IBM Model 1 was trained for 12 iterations, then the HMM model for 5 iterations and Model 4 for 5 iterations." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-105", "text": "The ITG and tree-to-string models were both trained for 5 iterations." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-106", "text": "A learning curve for the Inversion Transduction Grammar, is shown in Figure 1 , showing both perplexity on held-out data and alignment error rate." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-107", "text": "In general we found that while all models would increase in AER if trained for too many iterations, the increases were of only a few percent." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-108", "text": "----------------------------------" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-109", "text": "**DISCUSSION**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-110", "text": "The Inversion Transduction Grammar significantly outperforms the syntactically supervised tree-tostring model of Yamada and Knight (2001) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-111", "text": "The tree-to-string and IBM models are roughly equivalent." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-112", "text": "Adding the cloning operation improves treeto-string results by 2% precision and recall." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-113", "text": "It is particularly significant that the ITG gets higher recall than the other models, when it is the only model entirely limited to one-to-one alignments, bounding the maximum recall it can achieve." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-114", "text": "Our French-English experiments show only small differences between the various systems." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-115", "text": "Overall, performance on French-English is much better than for Chinese-English." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-116", "text": "French-English has less reordering overall, as shown by the percentage of productions in the viterbi ITG parses that are inverted: 14% for French-English in comparison to 23% for Chinese-English." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-117", "text": "One possible explanation for our results is parser error." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-118", "text": "While we describe our system as \"syntacti- cally supervised\", in fact this supervision comes in the form of the annotation of the Wall Street Journal treebank on which the parser is trained, rather than parses for our parallel training corpus." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-119", "text": "In particular, the text we are parsing has a different vocabulary and style of prose from the WSJ treebank, and often the fluency of the English translations leaves something to be desired." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-120", "text": "While both corpora consist of newswire text, a typical WSJ sentence Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-121", "text": "contrasts dramatically with" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-122", "text": "In the past when education on opposing Communists and on resisting Russia was stressed, retaking the mainland and unifying China became a slogan for the authoritarian system, which made the unification under the martial law a tool for oppressing the Taiwan people." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-123", "text": "a typical sentence from our corpus." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-124", "text": "While we did not have human-annotated goldstandard parses for our training data, we did have human annotated parses for the Chinese side of our test data, which was taken from the Penn Chinese Treebank (Xue et al., 2002) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-125", "text": "We trained a second tree-to-string model in the opposite direction, using Chinese trees and English strings." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-126", "text": "The Chinese training data was parsed with the Bikel (2002) parser, and used the Chinese Treebank parses for our test data." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-127", "text": "Results are shown in Table 3 ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-128", "text": "Because the ITG is a symmetric, generative model, the ITG results in Table 3 are identical to those in Table 1 ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-129", "text": "While the experiment does not show a significant improvement, it is possible that better parses for the training data might be equally important." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-130", "text": "Even when the automatic parser output is correct, the tree structure of the two languages may not correspond." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-131", "text": "Dorr (1994) categorizes sources of syntactic divergence between languages, and Fox (2002) analyzed a parallel French-English corpus, quantifying how often parse dependencies cross when projecting an English tree onto a French string." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-132", "text": "Even in this closely related language pair with generally similar word order, crossed dependencies were caused by such common occurrences as adverb modification of a verb, or the correspondence of \"not\" to \"ne pas\"." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-133", "text": "Galley et al. (2004) extract translation rules from a large parsed parallel corpus that extend in scope to tree fragments beyond a single node; we believe that adding such larger-scale operations to the translation model is likely to significantly improve the performance of syntactically supervised alignment." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-134", "text": "The syntactically supervised model has been found to outperform the IBM word-level alignment models of Brown et al. (1993) for translation by Yamada and Knight (2002) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-135", "text": "An evaluation for the alignment task, measuring agreement with human judges, also found the syntax-based model to outperform the IBM models." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-136", "text": "However, a relatively small corpus was used to train both models (2121 Japanese-English sentence pairs), and the evaluations were performed on the same data for training, meaning that one or both models might be significantly overfitting." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-137", "text": "Zens and Ney (2003) provide a thorough analysis of alignment constraints from the perspective of decoding algorithms." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-138", "text": "They train the models of Wu (1993) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-139", "text": "Decoding, meaning exact computation of the highest probability translation given a foreign sentence, is not possible in polynomial time for the IBM models, and in practice decoders search through the space of hypothesis translations using a set of additional, hard alignment constraints." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-140", "text": "Zens and Ney (2003) compute the viterbi alignments for German-English and French-English sentences pairs using IBM Model 5, and then measure how many of the resulting alignments fall within the hard constraints of both Wu (1997) and Berger et al. (1996) ." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-141", "text": "They find higher coverage for an extended version of ITG than for the IBM decoding constraint for both language pairs, with the unmodified ITG implementation covering about the same amount of German-English data as IBM, and significantly less French-English data." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-142", "text": "These results show promise for ITG as a basis for efficient decoding, but do not address which model best aligns the original training data, as IBMderived alignments were taken as the gold standard, rather than human alignments." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-143", "text": "We believe that our results show that syntactically-motivated models are a promising general approach to training translation models as well to searching through the resulting probability space." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-144", "text": "Computational complexity is an issue for the treebased models presented here." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-145", "text": "While training the IBM models with the GIZA++ software takes minutes, the tree-based EM takes hours." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-146", "text": "With our C implementation, one iteration of the syntactically supervised model takes 50 CPU hours, which can be parallelized across machines." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-147", "text": "Our tree-based models are estimated with complete EM, while the training procedure for the IBM models samples from a number of likely alignments when accumulating expected counts." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-148", "text": "Because not every alignment is legal with the tree-based models, the technique of sampling by choosing likely alignments according to a simpler model is not straightforward." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-149", "text": "Nonetheless, we feel that training times can be improved with the right pruning and sampling techniques, as will be necessary to train on the much larger amounts data now available, and on longer sentences." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-150", "text": "----------------------------------" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-151", "text": "**CONCLUSION**" }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-152", "text": "We present a side-by-side comparison of syntactically supervised and unsupervised tree-based alignment, along with the non tree-based IBM Model 4." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-153", "text": "For Chinese-English, using trees helps the alignment task, but a data-derived tree structure gives better results than projecting automatic English parser output onto the Chinese string." }, { "sent_id": "332e252e09d28763deb1ded2171c90-C001-154", "text": "The FrenchEnglish task is easier overall, and exhibits smaller differences between the systems." } ], "y": { "@BACK@": { "gold_contexts": [ [ "332e252e09d28763deb1ded2171c90-C001-3" ], [ "332e252e09d28763deb1ded2171c90-C001-12" ], [ "332e252e09d28763deb1ded2171c90-C001-17" ], [ "332e252e09d28763deb1ded2171c90-C001-27" ], [ "332e252e09d28763deb1ded2171c90-C001-101" ], [ "332e252e09d28763deb1ded2171c90-C001-140" ] ], "cite_sentences": [ "332e252e09d28763deb1ded2171c90-C001-3", "332e252e09d28763deb1ded2171c90-C001-12", "332e252e09d28763deb1ded2171c90-C001-17", "332e252e09d28763deb1ded2171c90-C001-27", "332e252e09d28763deb1ded2171c90-C001-101", "332e252e09d28763deb1ded2171c90-C001-140" ] }, "@USE@": { "gold_contexts": [ [ "332e252e09d28763deb1ded2171c90-C001-3", "332e252e09d28763deb1ded2171c90-C001-4" ], [ "332e252e09d28763deb1ded2171c90-C001-18" ], [ "332e252e09d28763deb1ded2171c90-C001-34" ], [ "332e252e09d28763deb1ded2171c90-C001-100", "332e252e09d28763deb1ded2171c90-C001-101" ] ], "cite_sentences": [ "332e252e09d28763deb1ded2171c90-C001-3", "332e252e09d28763deb1ded2171c90-C001-18", "332e252e09d28763deb1ded2171c90-C001-34", "332e252e09d28763deb1ded2171c90-C001-101" ] } } }, "ABC_4a7fecf3b80c274739e9c83be9a36b_30": { "x": [ { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-7", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-72", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-73", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-2", "text": "Previous work on bridging anaphora recognition (Hou et al., 2013a) casts the problem as a subtask of learning fine-grained information status (IS)." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-3", "text": "However, these systems heavily depend on many hand-crafted linguistic features." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-4", "text": "In this paper, we propose a discourse context-aware self-attention neural network model for fine-grained IS classification." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-5", "text": "On the ISNotes corpus (Markert et al., 2012) , our model with the contextually-encoded word representations (BERT) (Devlin et al., 2018) achieves new state-of-the-art performances on fine-grained IS classification, obtaining a 4.1% absolute overall accuracy improvement compared to Hou et al. (2013a) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-6", "text": "More importantly, we also show an improvement of 3.9% F1 for bridging anaphora recognition without using any complex hand-crafted semantic features designed for capturing the bridging phenomenon." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-9", "text": "Information Structure (Halliday, 1967; Prince, 1981 Prince, , 1992 Gundel et al., 1993; Lambrecht, 1994; Birner and Ward, 1998; Kruijff-Korbayov\u00e1 and Steedman, 2003 ) studies structural and semantic properties of a sentence according to its relation to the discourse context." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-10", "text": "Information structure affects how discourse entities are referred to in a text, which is known as Information Status (Halliday, 1967; Prince, 1981; Nissim et al., 2004) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-11", "text": "Specifically, information status (IS henceforth) reflects the accessibility of a discourse entity based on the evolving discourse context and the speaker's assumption about the hearer's knowledge and beliefs." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-12", "text": "For instance, according to Markert et al. (2012) , old mentions 1 refer to entities that have been referred to previously; mediated men-tions have not been mentioned before but are accessible to the hearer by reference to another old mention or to prior world knowledge; and new mentions refer to entities that are introduced to the discourse for the first time and are not known to the hearer before." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-13", "text": "In this paper, we follow the IS scheme proposed by Markert et al. (2012) and focus on learning finegrained IS on written texts." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-14", "text": "A mention's semantic and syntactic properties can signal its information status." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-15", "text": "For instance, indefinite NPs tend to be new and pronouns are likely to be old." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-16", "text": "Moreover, referential patterns of how a mention is referred to in a sentence also affect this mention's IS." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-17", "text": "In Example 1, \"Friends\" is a bridging anaphor even if we do not know the antecedent (i.e., she); while the information status for \"Friends\" in Example 2 is mediated/worldKnowledge." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-18", "text": "Section 3.1 analyzes the characteristics of each IS category and the relations between IS and discourse context." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-19", "text": "(1) She made money, but spent more." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-20", "text": "Friends pitched in." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-21", "text": "(2) Friends are part of the glue that holds life and faith together." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-22", "text": "In this work, we propose a discourse contextaware self-attention neural network model for fine-grained IS classification." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-23", "text": "We find that the sentence containing the target mention as well as the lexical overlap information between the target mention and the preceding mentions are the most important discourse context when assigning IS for a mention." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-24", "text": "With self-attention, our model can capture important signals within a mention and the interactions between the mention and its context." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-25", "text": "On the ISNotes corpus (Markert et al., 2012) , our model with the contextually-encoded word representations (BERT) (Devlin et al., 2018) achieves new state-of-the-art performances on fine-grained IS classification, obtaining a 4.1% absolute overall accuracy improvement compared to Hou et al. (2013a) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-26", "text": "More importantly, we also show an improvement of 3.9% F1 for bridging anaphora recognition without using any sophisticated handcrafted semantic features." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-27", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-28", "text": "**RELATED WORK**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-29", "text": "IS classification and bridging anaphora recognition." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-30", "text": "Bridging resolution (Hou et al., 2014 contains two sub tasks: identifying bridging anaphors (Markert et al., 2012; Hou et al., 2013a; Hou, 2016) and finding the correct antecedent among candidates (Hou et al., 2013b; Hou, 2018a,b) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-31", "text": "Previous work handle bridging anaphora recognition as part of IS classification problem." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-32", "text": "Markert et al. (2012) et al. (2013a) regarding the overall IS classificiation accuracy but the result on bridging anaphora recognition is much worse than Hou et al. (2013a) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-33", "text": "Rahman and Ng (2012) incorporated carefully designed rules into an SVM multiclass algorithm for IS classification on the Switchboard dialogue corpus (Nissim et al., 2004) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-34", "text": "Cahill and Riester (2012) trained a CRF model with syntactic and surface features for fine-grained IS classification on the German DIRNDL radio news corpus (Riester et al., 2010) 2 ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-35", "text": "Different from the above mentioned work, we do not use any complicated hand-crafted features and our model improves the previous state-of-theart results on both overall IS classification accuracy and bridging recognition by a large margin on the ISNotes corpus." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-36", "text": "Self-attention." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-37", "text": "Recently, multi-head selfattention encoder (Ashish et al., 2017) has been shown to perform well in various NLP tasks, including semantic role labelling (Strubell et al., 2018) , question answering and natural language inference (Devlin et al., 2018) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-38", "text": "In our model, we create a \"pseudo sentence\" for each mention and apply the transformer encoder for our task." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-39", "text": "The self-attention mechanism allows our model to attend to both the context and the mention itself for clues which are helpful for predicting the mention's IS." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-40", "text": "Fine-tuning with contextual word embeddings." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-41", "text": "Recent work (Peters et al., 2018; Howard and Ruder, 2018; Devlin et al., 2018) have shown that a range of downstream NLP tasks benefit from fine-tuning task-specific parameters with pre-trained contextual word representations." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-42", "text": "Our work belongs to this category and we fine-tune our model based on BERT BASE representations (Devlin et al., 2018) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-43", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-44", "text": "**APPROACH**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-45", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-46", "text": "**INFORMATION STATUS AND DISCOURSE CONTEXT**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-47", "text": "The IS scheme proposed by Markert et al. (2012) adopts three major IS categories (old, new and mediated) from Nissim et al. (2004) and distinguishes six subcategories for mediated." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-48", "text": "Table 1 lists the definitions for these IS categories and summarizes the main affecting factors for each IS class." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-49", "text": "As described in Section 1, a mention's internal syntactic and semantic properties can signal its IS." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-50", "text": "For instance, a mention containing a possessive pronoun modifier is likely to be mediated/syntactic (e.g., their father); and a mediated/comparative mention often contains a premodifier indicating that this entity is compared to another preceding entity (e.g., further attacks)." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-51", "text": "In addition, for some IS classes, the \"local context\" (the sentence s which contains the target mention) and \"previous context\" (sentences from the discourse which precede s) play an important role when assigning IS to a mention." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-52", "text": "Example 1 and Example 2 in Section 1 demonstrate the role of the local context for IS." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-53", "text": "And sometimes we need to look at the previous context when deciding IS for a mention." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-54", "text": "In Example 3, without looking at the previous context, we tend to think the IS for \"Poland\" in the second sentence is mediated/WorldKnowledge." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-55", "text": "Here the correct IS for \"Poland\" is old because it is mentioned before in the previous context." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-56", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-57", "text": "**IS CLASSIFICATION WITH DISCOURSE**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-58", "text": "Context-Aware Self-Attention" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-59", "text": "To account for the different factors described in the previous section when predicting IS for a mention, we create a novel \"pseudo sentence\" for each mention and apply the multi-head self-attention encoder (Ashish et al., 2017) for this sentence." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-60", "text": "Figure 1 depicts the high-level structure of our model." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-61", "text": "The pseudo sentence consists of five parts: previous overlap info, local context, the delimiter token \"[delimiter]\", the content of the target mention, and the IS prediction token \"[IS]\"." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-62", "text": "The previous overlap info part contains two tokens, which indicate whether the target mention has the same string/head with a mention from the preceding sentences. And the local context is the sentence containing the target mention." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-63", "text": "The final prediction is made based on the hidden state of the prediction token \"[IS]\"." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-64", "text": "In principle, the structure of the pseudo sentence and the mechanism of multi-head self-attention help the model to learn the important cues from both the mention and its discourse context when predicting IS." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-65", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-66", "text": "**MODEL PARAMETERS**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-67", "text": "Our context-aware self-attention model has 12 transformer blocks, 768 hidden units, and 12 selfattention heads." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-68", "text": "We first initialize our model using BERT BASE , then fine-tune the model for 3 epochs with the learning rate of 5e \u2212 5." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-69", "text": "During training and testing, the max token size of the pseudo sentence is set as 128." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-70", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-71", "text": "**EXPERIMENTS**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-74", "text": "We perform experiments on the ISNotes corpus (Markert et al., 2012) , which contains 10,980 mentions annotated for information status in 50 news texts." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-75", "text": "Following Hou et al. (2013a) , all experiments are performed via 10-fold cross-validation on documents." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-76", "text": "We report overall accuracy as well as precision, recall and F-measure per IS class." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-77", "text": "In the following, we describe the baselines as well as our model with different settings." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-78", "text": "collective (baseline1)." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-79", "text": "Hou et al. (2013a) applied collective classification to account for the linguistic relations among IS categories." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-80", "text": "They explored a wide range of features (34 in total), including a large number of lexico-semantic features (for recognizing bridging) as well as a couple of surface features and syntactic features." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-81", "text": "cascaded collective (baseline2)." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-82", "text": "This is the cascading minority preference system for bridging anaphora recognition from Hou et al. (2013a) ." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-83", "text": "self-attention wo context." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-84", "text": "We apply our model (see Section 3) on the pseudo sentences containing only the target mentions and the prediction token \"[IS]\"." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-85", "text": "self-attention with context I. Based on selfattention wo context, we add the local context in the pseudo sentences." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-86", "text": "self-attention with context II." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-87", "text": "Based on selfattention with context I, we add the previous overlap info part in the pseudo sentences." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-88", "text": "Table 3 shows the results of our models compared to the baselines." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-89", "text": "Surprisingly, our model considering only the content of mentions (self-attention wo context) achieves competitive results as the baseline cascade collective which explores many handcrafted linguistic features." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-90", "text": "Also self-attention wo context outperforms the two baselines on several IS categories (m/syntactic, m/aggregate, m/comparative, m/bridging and new)." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-91", "text": "In Section" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-92", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-93", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-94", "text": "3.1, we analyze that m/syntactic and m/aggregate are often signaled by mentions' internal syntactic structures, and that the semantics of certain premodifiers is a strong signal for m/comparative." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-95", "text": "The improvements on these categories show that our model can capture the semantic/syntactic properties of a mention when predicting its IS." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-96", "text": "The continuous improvements on self-attention with context I and self-attention with context II show the impact of the local context and the previous context on IS prediction, respectively." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-97", "text": "It seems that the local context has more impact on m/bridging and new, whereas the previous context has more impact on old and m/worldKnowledge." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-98", "text": "For self-attention with context I, we also tried to add the previous k sentences (k = 1 and k = 2) into the current local context to see whether the broader local context can help us to capture bridging better." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-99", "text": "However, we found that the overall results on both settings are similar as the current one." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-100", "text": "Overall, we achieve the new state-of-the-art results on bridging anaphora recognition with the local context model (self-attention with context I). And our full model (self-attention with context II) achieves an overall accuracy of 83% on IS classification, obtaining a 4.1% and 4.4% absolute improvements over the two baselines (collective and cascade collective), respectively." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-101", "text": "Our full model beats the two strong baselines on most IS categories except m/function." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-102", "text": "This is because there are only 65 m/function mentions in ISNotes." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-103", "text": "With such a small amount of training data, it is hard for our model to learn patterns for this category." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-104", "text": "----------------------------------" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-105", "text": "**CONCLUSIONS**" }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-106", "text": "We develop a discourse context-aware selfattention model for IS classification." }, { "sent_id": "4a7fecf3b80c274739e9c83be9a36b-C001-107", "text": "Our model does not contain any complex hand-crafted semantic features and achieves the new state-of-the-art results for IS classification and bridging anaphora recognition on ISNotes." } ], "y": { "@USE@": { "gold_contexts": [ [ "4a7fecf3b80c274739e9c83be9a36b-C001-5" ], [ "4a7fecf3b80c274739e9c83be9a36b-C001-13" ], [ "4a7fecf3b80c274739e9c83be9a36b-C001-25" ], [ "4a7fecf3b80c274739e9c83be9a36b-C001-74" ] ], "cite_sentences": [ "4a7fecf3b80c274739e9c83be9a36b-C001-5", "4a7fecf3b80c274739e9c83be9a36b-C001-13", "4a7fecf3b80c274739e9c83be9a36b-C001-25", "4a7fecf3b80c274739e9c83be9a36b-C001-74" ] }, "@BACK@": { "gold_contexts": [ [ "4a7fecf3b80c274739e9c83be9a36b-C001-12" ], [ "4a7fecf3b80c274739e9c83be9a36b-C001-32" ], [ "4a7fecf3b80c274739e9c83be9a36b-C001-47" ] ], "cite_sentences": [ "4a7fecf3b80c274739e9c83be9a36b-C001-12", "4a7fecf3b80c274739e9c83be9a36b-C001-32", "4a7fecf3b80c274739e9c83be9a36b-C001-47" ] } } }, "ABC_89b2b492b4319636ff2f28a4ba0d95_30": { "x": [ { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-8", "text": "----------------------------------" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-2", "text": "Beam-search and global models have been applied to transition-based dependency parsing, leading to state-of-the-art accuracies that are comparable to the best graph-based parsers." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-3", "text": "In this paper, we analyze the effects of global learning and beam-search on the overall accuracy and error distribution of a transition-based dependency parser." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-4", "text": "First, we show that global learning and beam-search must be jointly applied to give improvements over greedy, locally trained parsing." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-5", "text": "We then show that in addition to the reduction of error propagation, an important advantage of the combination of global learning and beam-search is that it accommodates more powerful parsing models without overfitting." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-6", "text": "Finally, we characterize the errors of a global, beam-search, transition-based parser, relating it to the classic contrast between \"local, greedy, transition-based parsing\" and \"global, exhaustive, graph-based parsing\"." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-7", "text": "TITLE AND ABSTRACT IN CHINESE" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-10", "text": "Beam-search has been applied to transition-based dependency parsing in recent studies (Zhang and Clark, 2008; Huang and Sagae, 2010; Hatori et al., 2011) ." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-11", "text": "In addition to reducing search errors compared to greedy search, it also enables the use of global models that accommodate richer non-local features without overfitting, leading to recent state-of-the-art accuracies of transition-based dependency parsing (Zhang and Nivre, 2011; Bohnet and Kuhn, 2012; Bohnet and Nivre, 2012) that are competitive with the best graph-based dependency parsers." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-12", "text": "It has been known that a transition-based parser using global learning, beam-search and rich features gives significantly higher accuracies than one with local learning and greedy search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-13", "text": "However, the effects of global learning, beam-search and rich features have not been separately studied." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-14", "text": "Apart from the natural conclusion that beam-search reduces error propagation compared to greedy search, exactly how these techniques help to improve parsing has not been discussed, and many interesting questions remain unanswered." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-15", "text": "For example, the contribution of global learning in improving the accuracies has not been separately studied." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-16", "text": "It has not been shown how global learning affects the accuracies, or whether it is important at all." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-17", "text": "For another example, it would be interesting to know whether a local, greedy, transition-based parser can be equipped with the rich features of Zhang and Nivre (2011) to improve its accuracy, and in particular whether MaltParser (Nivre et al., 2006) can achieve the same level of accuracies as ZPar (Zhang and Nivre, 2011) by using the same range of rich feature definitions." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-18", "text": "In this paper, we answer the above questions empirically." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-19", "text": "First, we separate out global learning and beam-search, and study the effect of each technique by comparison with a local greedy baseline." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-20", "text": "Our results show that significant improvements are achieved only when the two are jointly applied." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-21", "text": "Second, we show that the accuracies of a local, greedy transition-based parser cannot be improved by adding the rich features of Zhang and Nivre (2011) ." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-22", "text": "Our result suggests that global learning with beam-search accommodates more complex models with richer features than a local model with greedy search and therefore enables higher accuracies." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-23", "text": "One interesting aspect of using a global model with beam-search is that it narrows down the contrast between \"local, greedy, transition-based parsing\" and \"global, exhaustive, graph-based parsing\" as exemplified by McDonald and Nivre (2007) ." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-24", "text": "On the one hand, global beam-search parsing is more similar to global, exhaustive parsing than local, greedy parsing in the use of global models and non-greedy search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-25", "text": "On the other hand, beam-search does not affect the fundamental transition-based parsing process, which allows the use of rich non-local features, and is very different from graph-based parsing." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-26", "text": "An interesting question is how such differences in models and algorithms affect empirical errors." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-27", "text": "McDonald and Nivre (2007) make a comparative analysis of local greedy transition-based MaltParser and global near-exhaustive graph-based MSTParser (McDonald and Pereira, 2006) using the CoNLL-X Shared Task data (Buchholz and Marsi, 2006) , showing that the parsers give near identical overall accuracies, but have very different error distributions according to various metrics." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-28", "text": "While MaltParser is more accurate on frequently occurring short sentences and dependencies, it performs worse on long sentences and dependencies due to search errors." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-29", "text": "We present empirical studies of the error distribution of global, beam-search transition-based dependency parsing, using ZPar (Zhang and Nivre, 2011) as a representative system." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-30", "text": "We follow McDonald and Nivre (2007) and perform a comparative error analysis of ZPar, MSTParser and MaltParser using the CoNLL-X shared task data." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-31", "text": "Our results show that beam-search im-proves the precision on long sentences and dependencies compared to greedy search, while the advantage of transition-based parsing on short dependencies is preserved." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-32", "text": "Under particular measures, such as precision for arcs at different levels of the trees, ZPar shows characteristics surprisingly similar to MSTParser." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-33", "text": "----------------------------------" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-34", "text": "**ANALYZING THE EFFECT OF GLOBAL LEARNING AND BEAM-SEARCH**" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-35", "text": "In this section we study the effects of global learning and beam-search on the accuracies of transition-based dependency parsing." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-36", "text": "Our experiments are performed using the Penn Treebank (PTB)." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-37", "text": "We follow the standard approach to split PTB3 into training (sections 2-21), development (section 22) and final testing (section 23) sections." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-38", "text": "Bracketed sentences from the treebank are transformed into dependency structures using the Penn2Malt tool." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-39", "text": "1 POS-tags are assigned using a perceptron tagger (Collins, 2002) , with an accuracy of 97.3% on a standard Penn Treebank test." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-40", "text": "We assign automatic POS-tags to the training data using ten-way jacknifing." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-41", "text": "Accuracies are measured using the unlabeled attached score (UAS) metric, which is defined as the percentage of words (excluding punctuation) that are assigned the correct heads." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-42", "text": "----------------------------------" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-43", "text": "**THE EFFECTS OF GLOBAL LEARNING AND BEAM-SEARCH**" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-44", "text": "In this subsection, we study the effects of global learning and beam-search separately." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-45", "text": "Our experiments are performed using ZPar, which uses global learning and beam-search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-46", "text": "To make comparisons with local learning under different settings, we make configurations and modifications to ZPar where necessary." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-47", "text": "Global learning is implemented in the same way as Zhang and Nivre (2011) , using the averaged perceptron algorithm (Collins, 2002) and early update (Collins and Roark, 2004) ." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-48", "text": "This is a global learning method in the sense that it tries to maximize accuracy over the entire sentence and not on isolated local transitions." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-49", "text": "Unless explicitly specified, the same beam size is applied for training and testing when beam-search is applied." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-50", "text": "Local learning is implemented as a multi-class classifier that predicts the next transition action given a parser configuration (i.e. a stack and an incoming queue), trained using the averaged perceptron algorithm." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-51", "text": "In local learning, each transition is considered in isolation and there is no global view of the transition sequence needed to parse an entire sentence." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-52", "text": "Figure 1 shows the UAS of ZPar under different settings, where 'global' refers to a global model trained using the same method as Zhang and Nivre (2011) , 'local' refers to a local classifier trained using the averaged perceptron, 'base features' refers to the set of base feature templates in Zhang and Nivre (2011) , and 'all features' refers to the set of base and all extended feature templates in Zhang and Nivre (2011) ." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-53", "text": "When the size of the beam is 1, the decoding algorithm is greedy local search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-54", "text": "Using base features, a locally trained model gives a UAS of 89.15%, higher than that of a globally trained model (89.04%)." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-55", "text": "Here a global model does not give better accuracies compared to a local model under greedy search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-56", "text": "As the size of the beam increses, the UAS of the global model increases, but the UAS of the local model decreases." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-57", "text": "Global learning gives significantly better accuracies than local learning under beam-search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-58", "text": "There are two ways to explain the reason that beam-search hurts the UAS of a locally trained model." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-59", "text": "First, the perceptron can be viewed as a large-margin training algorithm that finds a separation margin between the scores of positive examples (gold-standard structures) and negative examples (non-gold structures from the decoder)." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-60", "text": "The online learning process runs the decoding algorithm to generate a space of negative examples, which is used together with its corresponding positive example space for parameter udpates." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-61", "text": "If the negative example space during training is different from that during testing, the trained model will not separate the test examples as effectively as when the negative example spaces for training and testing are similar, since there are more unseen negative examples in the model." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-62", "text": "To further illustrate this, we conduct an additional set of development experiments by training two global models with different beam sizes." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-63", "text": "Each of the models is tested using its own training beam size and the training beam size of the other model." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-64", "text": "The results are shown in Table 1 ." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-65", "text": "As can be seen from the table, a global model trained with a size-1 beam gives a higher UAS when tested with a size-1 beam than with a size-64 beam." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-66", "text": "Similarly, a global model trained with a size-64 beam gives a higher UAS when tested using a size-64 beam than using a size-1 beam." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-67", "text": "Our observations are consistent with those of Daum\u00e9 III and Marcu (2005) , which show that the accuracies of another online large-margin model are lower when the training and testing beam sizes are different than when they are the same." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-68", "text": "These results show the negative effect of a mismatch between training and testing negative example spaces, which also happens when a locally trained model is tested using beam-search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-69", "text": "To take a second perspective, a local model is trained to disambiguate different transition actions under the same parser configuration, but not different transitions under different parser configurations." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-70", "text": "This means that the scores of two sequences of transition actions may not be comparable with each other when they consist of very different parser configuration sequences." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-71", "text": "This is reminiscent of the label bias problem (Lafferty et al., 2001) , and partly explains the performance degradation of the local model when tested with beam-search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-72", "text": "To summarize the above discussion, a global model does not improve over a local model for greedy parsing, and beam-search does not improve the performance of a parser trained locally using the perceptron algorithm." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-73", "text": "However, the combination of global learning and beam-search can significantly improve the performance compared to a local, greedy transition-based parser." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-74", "text": "----------------------------------" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-75", "text": "**BENEFITS FROM GLOBAL LEARNING AND BEAM-SEARCH**" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-76", "text": "An additional benefit of global learning and beam-search is the accommodation of rich nonlocal features." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-77", "text": "Again in Figure 1 , the use of rich non-local features improves the UAS of the global models with all beam sizes, while the improvement brought by rich non-local features also increases with increased size of the beam." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-78", "text": "With greedy local search, the accuracy improves from 89.04% with base features to 89.35% with all features; with the size of the beam being 64, the accuracy improves form 92.27% with base features to 93.18% with all features." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-79", "text": "The absolute improvement increased from 0.3% to 0.89%." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-80", "text": "The above fact shows that rich non-local features are more effective on a global model with a large beam-size." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-81", "text": "This is a consequence of the interaction between learning and search: a large beam not only reduces search errors, but also enables a more complex model to be trained without overfitting." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-82", "text": "In contrast to a globally trained model, a local model cannot benefit as much from the power of rich features." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-83", "text": "With greedy local search, the UAS of a local model improves from 89.15% with base features to 89.28% with all features." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-84", "text": "Beam-search does not bring additional improvements." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-85", "text": "For further evidence, we add rich non-local features in the same increments as Zhang and Nivre (2011) to both ZPar and MaltParser, and evaluate UAS on the same development data set." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-86", "text": "Original settings are applied to both parsers, with ZPar using global learning and beam-search, and MaltParser using local learning and greedy search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-87", "text": "Table 2 shows that while ZPar's accuracy consistently improves with the addition of each new set of features, there is very little impact on MaltParser's accuracy and in some cases the effect is in fact negative, indicating that the locally trained greedy parser cannot benefit from the rich non-local features." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-88", "text": "Yet another evidence for the support of more complex models by global learning and beamsearch is the work of Bohnet and Nivre (2012) , where non-projective parsing using online reordering (Nivre, 2009 ) and rich features led to significant improvements over greedy search (Nivre, 2009) , achieving state-of-the-art on a range of typologically diverse languages." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-89", "text": "3 Characterizing the errors" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-90", "text": "----------------------------------" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-91", "text": "**THE PARSERS AND EVALUATION DATA**" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-92", "text": "In this section we study the effect of global learning and beam-search on the error distributions of transition-based dependency parsing." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-93", "text": "We characterize the errors of ZPar and add it to the error comparison between MaltParser and MSTParser (McDonald and Nivre, 2007) ." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-94", "text": "Following McDonald and Nivre (2007) we evaluate the parsers on the CoNLL-X Shared Task data (Buchholz and Marsi, 2006) , which include training and test sentences for 13 different languages." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-95", "text": "For each parser, we conjoin the outputs for all 13 languages in the same way as McDonald and Nivre (2007) , and calculate error distributions over the aggregated output." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-96", "text": "Accuracies are measured using the labeled attached score (LAS) evaluation metric, which is defined as the percentage of words (excluding punctuation) that are assigned both the correct head word and the correct arc label." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-97", "text": "To handle non-projectivity, pseudo-projective parsing (Nivre and Nilsson, 2005 ) is applied to ZPar and MaltParser, transforming non-projective trees into pseudo-projective trees in the training data, and post-processing pseudo-projective outputs by the parser to transform them into non-projective trees." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-98", "text": "MSTParser produces non-projective trees from projective trees by scorebased rearrangements of arcs." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-99", "text": "----------------------------------" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-100", "text": "**ERROR DISTRIBUTIONS**" }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-101", "text": "We take a range of different perspectives to characterize the errors of ZPar, comparing them with those of MaltParser and MSTParser by measuring the accuracies against various types of metrics, including the size of the sentences and dependency arcs, the distance to the root of the dependency tree, and the number of siblings." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-102", "text": "The parsers show different empirical performances over these measures, demonstrating the comparative advantages and disadvantages of their design discussed in Section 3.1." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-103", "text": "Figure 2 shows the accuracy of the parsers relative to sentence length (the number of words in a sentence, in bins of size 10)." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-104", "text": "All three parsers perform comparatively better on short sentences." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-105", "text": "The performance of MaltParser and MSTParser is very similar, with MaltParser performing better on very short sentences (\u2264 20) due to richer feature representations, and worse on longer sentences (20 to 50) due to the propagation of search errors." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-106", "text": "Because short sentences are much more frequent in the test data, MaltParser and MSTParser give almost identical overall accuracies." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-107", "text": "The three parsers show larger variance in performance when evaluated against specific properties of the dependency tree." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-108", "text": "Figure 3 shows the precision and recall for each parser relative to the arc lengths in the predicted and gold-standard dependency trees." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-109", "text": "Here the length of an arc is defined as the absolute difference between the indices of the head and modifier." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-110", "text": "Precision represents the percentage of predicted arcs with a particular length that are correct, and recall represents the percentage of gold arcs of a particular length that are correctly predicted." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-111", "text": "MaltParser gives higher precision than MSTParser for short dependency arcs (\u2264 4), but its precision drops rapidly for arcs with increased lengths." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-112", "text": "These arcs take more shift-reduce actions to build, and are hence more prone to error propagation." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-113", "text": "The precision of ZPar drops much slower compared to MaltParser, demonstrating the effect of beam-search for the reduction of error propagation." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-114", "text": "Another important factor is the use of rich non-local features by ZPar, which is a likely reason for its precision to drop slower even than that of MSTParser when the arc size increases from 1 to 8." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-115", "text": "Interestingly, the precision of ZPar is almost indistinguishable from that of MaltParser for size 1 arcs (arcs between neighbouring words), showing that the wider range of features in ZPar is the most helpful in arcs that take more than one, but not too many shiftreduce actions to build." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-116", "text": "The recall curves of the three parsers are similar, with ZPar having higher recall than MSTParser and MaltParser, particularly when the dependency size is greater than 2." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-117", "text": "This shows that particular gold-standard dependencies are hard for all parsers to build, but ZPar is better in recovering hard gold dependencies probably due to its rich features." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-118", "text": "To take another perspective, we compare the performance of the three parsers at different levels of a dependency tree by measuring accuracies for arcs relative to their distance to the root." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-119", "text": "Here the distance of an arc to the root is defined as the number of arcs in the path from the root to the modifier in the arc." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-120", "text": "Figure 4 shows the precision and recall of each system for arcs of varying distances to the root." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-121", "text": "Here the precision of MaltParser and MSTParser is very different, with MaltParser being more precise for arcs nearer to the leaves, but less precise for those nearer to the root." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-122", "text": "One possible reason is that arcs near the bottom of the tree require comparatively fewer shift-reduce actions to build, and are therefore less prone to the propagation of search errors." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-123", "text": "Another important reason, as pointed out by McDonald and Nivre (2007) , is the default single-root mechanism by MaltParser: all words that have not been attached as a modifier when the shift-reduce process finishes are attached as modifiers to the pseudo-root." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-124", "text": "Although the vast majority of sentences have only one root-modifier, there is no global control for the number of root-modifiers in the greedy shift-reduce process, and each action is made locally and independently." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-125", "text": "As a result, MaltParser tends to over-predict root modifiers, leading to the comparatively low precision." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-126", "text": "Surprisingly, the precision curve of ZPar is much more similar to that of MSTParser than that of MaltParser, although ZPar is based on the same shift-reduce parsing process, and even has a similar default single-root mechanism as MaltParser." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-127", "text": "This result is perhaps the most powerful demonstration of the effect of global learning and beam-search compared to local learning and greedy search." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-128", "text": "The model which scores whole sequences of shift-reduce actions, plus the reduction of search error propagation, lead to significantly reduced over-prediction of rootmodifiers." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-129", "text": "In addition, rich features used by ZPar, such as the valency (number of modifiers for a head) and set of modifier labels for a head, can also be useful in reducing over-prediction of modifiers." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-130", "text": "Because of these, ZPar effectively pushes the predictions of difficult arcs down the tree, which is exactly the behavior of MSTParser." }, { "sent_id": "89b2b492b4319636ff2f28a4ba0d95-C001-131", "text": "Interestingly, the recall curve of ZPar is more similar to that of MaltParser than that of MSTParser, showing that arcs at particular levels are harder to recover using the shift-reduce process than a global tree search." } ], "y": { "@BACK@": { "gold_contexts": [ [ "89b2b492b4319636ff2f28a4ba0d95-C001-23" ], [ "89b2b492b4319636ff2f28a4ba0d95-C001-123" ] ], "cite_sentences": [ "89b2b492b4319636ff2f28a4ba0d95-C001-23", "89b2b492b4319636ff2f28a4ba0d95-C001-123" ] }, "@USE@": { "gold_contexts": [ [ "89b2b492b4319636ff2f28a4ba0d95-C001-30" ], [ "89b2b492b4319636ff2f28a4ba0d95-C001-93" ], [ "89b2b492b4319636ff2f28a4ba0d95-C001-94" ], [ "89b2b492b4319636ff2f28a4ba0d95-C001-95" ] ], "cite_sentences": [ "89b2b492b4319636ff2f28a4ba0d95-C001-30", "89b2b492b4319636ff2f28a4ba0d95-C001-93", "89b2b492b4319636ff2f28a4ba0d95-C001-94", "89b2b492b4319636ff2f28a4ba0d95-C001-95" ] }, "@SIM@": { "gold_contexts": [ [ "89b2b492b4319636ff2f28a4ba0d95-C001-121", "89b2b492b4319636ff2f28a4ba0d95-C001-122", "89b2b492b4319636ff2f28a4ba0d95-C001-123" ] ], "cite_sentences": [ "89b2b492b4319636ff2f28a4ba0d95-C001-123" ] } } }, "ABC_370da04cbb2a6ab807428f7e058110_30": { "x": [ { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-11", "text": "Statistical methods have long been applied to analyze written texts and language patterns [1] , which now include network representations of text to investigate linguistic phenomena [2, 3, 4, 5, 6, 7, 8] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-2", "text": "The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-3", "text": "In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-4", "text": "The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-5", "text": "The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-6", "text": "With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-7", "text": "Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-8", "text": "Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-9", "text": "----------------------------------" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-12", "text": "Networks generated from text share several features with other complex systems, e.g. transportation networks [9] , biological systems [10, 11] , social interactions [12] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-13", "text": "Examples of language-related networks include phonological networks with modular or cut-off power-law behaviors [13, 14, 15] , semantic similarity networks with small-world and scale-free properties [16] , syntactic dependency networks with hierarchical and small-world organization [17, 18] and collocation networks, which also display small-world and scale-free properties [3] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-14", "text": "The ubiquity of specific patterns in language networks is believed to account for an easy navigation and acquisition in semantic and syntactic networks [19] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-15", "text": "Of particular relevance to this study, word co-occurrence networks are a special case of collocation networks where two words (nodes) are linked if they appear close to each other in a text." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-16", "text": "Co-occurrence networks are convenient because they do not require prior linguistic knowledge, apart from that needed to filter relevant information." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-17", "text": "Since most of the syntactic relations occur between adjacent words, co-occurrence networks can be seen as simplified versions of syntactic networks [18] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-18", "text": "Several patterns have been identified in co-occurrence networks formed from large corpora, such as the power-law regimes for degrees distribution [2] and core-periphery structure [20] resulting from the complex organization of the lexicon." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-19", "text": "The overall structure and dynamics of networks representing texts have been modeled to describe their mechanism of growth and attachment [21, 22] , while nuances in the topology of real networks were exploited in practical problems, including natural language processing [23, 24, 25] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-20", "text": "In this study, we use the co-occurrence representation to probe how the variation of network topology along a text is able to identify an author's style." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-21", "text": "Writing style is more subjective than other text characteristics (e.g. topic), making authorship recognition one of the most challenging text mining tasks [26, 27] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-22", "text": "It is crucial for practical applications such as text classification [25] , copyright resolution [28] , identification of terrorist messages [29] and of plagiarism [26] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-23", "text": "Early studies using stylometry were conducted by Mosteller and Wallace to identify authorship of the so-called Federalist Papers [30] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-24", "text": "A myriad of methods to tackle the problem have been developed since then, typically using statistical properties of words (e.g. mean length, frequency, burstiness and vocabulary richness) and characters (e.g. character counts and long-range correlations) [26] , in addition to syntactic and semantic information [26] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-25", "text": "Methods from statistical physics have also been used for authorship recognition [31, 32] , which in recent years included text modeling with co-occurrence networks [33, 34, 35, 36, 37, 38] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-26", "text": "The adequacy of co-occurrence networks for the task was confirmed for the first time with the correlation between network topology and authors' styles [25] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-27", "text": "Despite this relative success, some issues concerning the applicability of network methods remain unsolved." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-28", "text": "A major issue in network representation is that regular patterns among concepts only emerge when large pieces of texts are available." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-29", "text": "Furthermore, rigorous network-based similarity estimators usually assume that the networks comprise the same number of nodes and edges, since most measurements are affected by the network size [39] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-30", "text": "Unfortunately, such strong assumption often does not hold for written texts ranging from small tales to long novels, which may hinder the applicability of models to real situations." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-31", "text": "As we shall show, the method presented here obviates these issues with a simple approach based on network dynamics." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-32", "text": "----------------------------------" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-33", "text": "**METHODS**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-34", "text": "Written texts were represented as co-occurrence networks, from which a set of network dynamics measurements were obtained." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-35", "text": "These measurements were used as attributes in pattern recognition methods in order to identify the author of a given text." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-36", "text": "The construction and analysis of the measurements are described in detail in the following subsections." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-37", "text": "----------------------------------" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-38", "text": "**MODELING TEXTS AS CO-OCCURRENCE NETWORKS**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-39", "text": "The texts used for classification come from a collection of novels and tales in English whose details are provided in the Supporting Information." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-40", "text": "The collection comprising 8 authors with 10 texts per author was selected to simulate a real case where the text lengths are varied in a range from 2, 853 to 267, 012 tokens with an average of 53, 532 tokens." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-41", "text": "The approach introduced requires a pre-processing step before transforming texts into networks, which consists in the removal of stopwords and lemmatization." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-42", "text": "Because we are mostly interested in the relationship between content words, stopwords such as function words conveying low semantic information were removed as in many studies of this type [40] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-43", "text": "The remaining words were lemmatized so that nouns and verbs were mapped to their singular and infinitive forms, and therefore words related to the same concept were mapped into the same node (also referred to as one single token)." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-44", "text": "Since lemmatization requires part-of-speech (POS) tagging, we used the maximum-entropy approach described in [41] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-45", "text": "The co-occurrence networks were constructed with each distinct word becoming a node and two words being linked if they were adjacent in the pre-processed text [25] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-46", "text": "The link is directed from the word appearing first to the second word and is weighted by the number of times the pair is found in the text." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-47", "text": "----------------------------------" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-48", "text": "**CHARACTERIZATION OF WRITTEN TEXTS VIA NETWORK DYNAMICS**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-49", "text": "The proposed method for authorship attribution is based on the evolution of the topology of networks, i.e. we exploit the network dynamics." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-50", "text": "Therefore, unlike previous approaches (see e.g. [3, 4, 42] ), we do not construct one single network from the whole book." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-51", "text": "Instead, a piece of text is divided into shorter parts comprising the same number of tokens." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-52", "text": "Then, a co-occurrence network is constructed for each part, which generates a series of independent networks for each book." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-53", "text": "The last partition is disregarded from the analysis because it is shorter than the previous ones." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-54", "text": "Since distinct books have different numbers of tokens, the series length varies from book to book." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-55", "text": "Each partition is described by the following topological network measurements: clustering coefficient C, which gives the fraction of possible triangles that exist for a particular node; network diameter D, which is the largest of all shortest paths (max{S ij }); network radius R, which is the smallest of all shortest paths (min{S ij }); number of cliques (complete subgraphs) C q ; load centrality L, similar to betweenness centrality but considering weights on edges; network transitivity T, which measures the fraction of all connected triples which are in fact triangles, T = 3 \u00d7 triangles/triads; betweenness centrality B, which measures how many shortest paths pass through a given node; shortest path length S, which is the shortest number of edges between two nodes; degree K or connectivity (number of edges) of a node; intermittency I, which measures how periodically a word is repeated [43] ; total number of nodes N (i.e. vocabulary size); and total number of edges E. Even though intermittency is not a traditional network measurement, we considered it because of its strong relationship with the concept of cycle length in networks." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-56", "text": "Moreover, this measurement has been proven useful for analyzing text styles [25] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-57", "text": "The metrics D, R, C q , T, N and E are scalar values for a network, while the other measurements are computed for each node individually." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-58", "text": "In order to have an overall picture of each partition, we computed the average values of C, L, B, S, K and I. As such, each partition is characterized by a set of twelve global topological measurements." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-59", "text": "The total number of tokens W (equal to the total weight among links), in each partition, was selected in a simple optimization procedure, with a compromise between having a long but noisy series (many small networks) and a shorter, more stable one (few large networks)." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-60", "text": "We found that with W = 200 tokens one ensures a series length with T = 268 elements on average while keeping the number of nodes over n = 100 for all networks." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-61", "text": "A set of time series is constructed by extracting the twelve global network metrics defined above for each of the networks from a book." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-62", "text": "Figure 1 shows the series for Moby Dick by Herman Melville, from which one may note that the series oscillate steadily around a fixed value along the text, with no significant trend." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-63", "text": "Indeed, the analysis is facilitated if the series are stationary." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-64", "text": "Strong stationarity requires the expected values being constant along time while weak stationarity implies that the mean value (and sometimes the variance) is constant." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-65", "text": "We confirmed that the time series are stationary, i.e. characterized by low values of autocorrelation." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-66", "text": "Correlation of a time series with itself shifted by a certain gap measures how much a value in the series depends on the previous ones, implying that the autocorrelation must be almost null for all but the first few values of the gap." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-67", "text": "In order to assess series stationarity, we implemented Kwiatkowski-Phillips-Schmidt-Shin (KPSS), augmented Dickey-Fuller, and MacKinnon (finitesample and asymptotic) unit root tests [44, 45, 46] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-68", "text": "These tests assume stationarity as the null hypothesis." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-69", "text": "We observed that for ten of the twelve time series, p-values were larger than 0.05, and therefore the stationarity hypothesis cannot be rejected." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-70", "text": "For the two remaining series (clustering coefficient C and transitivity T) two of the four p-values were less than 0.05: those for augmented Dickey-Fuller and finite-sample MacKinnon tests." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-71", "text": "MacKinnon [46] pointed out that z-tests can lead to unreliable finite-sample p-values substantially smaller than asymptotic p-values as we have observed for the clustering coefficient and transitivity." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-72", "text": "Moreover, autocorrelation drops after a small lag as shown in figure 2(a) for the series of clustering coefficient which can be fitted with an ARIMA(1,1,2) model, thus implying that the first derivative of the series is stationary." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-73", "text": "The finding that the series can be considered stationary allows one to compare estimated values from series of different lengths." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-74", "text": "As the distribution in the time series was found to display a bellshaped form (shown in figure 2(b) ), we propose the first four moments of the series distributions as the dynamical measurements, i.e." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-75", "text": "where 1 < i \u2264 4, x j are the series values and \u00b5 1 is the average of the measurements in the series." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-76", "text": "Since there are twelve time series, we obtain 48 dynamical measurements to characterize a book." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-77", "text": "----------------------------------" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-78", "text": "**DATA ANALYSIS**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-79", "text": "The moments of the network metrics are used to characterize a text." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-80", "text": "In the terminology of machine learning these are the attributes (also called features) for the algorithms, while individual texts are the instances and the author of a text corresponds to the instance class." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-81", "text": "A 80 \u00d7 48 data matrix is constructed where each row corresponds to an instance and each column corresponds to an attribute." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-82", "text": "In order to account for the different scales of the attributes (see figure 1 ), each column in the data matrix is normalized between zero and one." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-83", "text": "From the data matrix, the author of each book is inferred using standard supervised learning (classification) algorithms [47] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-84", "text": "There is a dimensionality reduction stage prior to the classification." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-85", "text": "Dimensionality reduction is achieved through either feature selection or feature extraction." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-86", "text": "Feature selection consists in removing attributes which do not satisfy a given condition, thus leaving a subset or combination of the total number of features." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-87", "text": "On the other hand, feature extraction blends the original attributes together in order to create a set of, usually fewer, new attributes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-88", "text": "Feature selection was implemented using variance threshold and scoring criteria." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-89", "text": "Variance threshold selection imposes a minimum variance among the realizations of an attribute, for example, if an attribute has the same value for all instances its variance is zero and can be safely removed because it does not contribute to the classification process." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-90", "text": "We also implemented feature selection based on score." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-91", "text": "The huge number of combinations of attributes prohibits an exhaustive search of the combination(s) with the highest score." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-92", "text": "Instead, we start by testing all subsets obtained by removing one attribute from the whole set." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-93", "text": "In the next step we test all subsets obtained by removing one attribute from the subsets with the highest score in the previous step, and the process is iterated." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-94", "text": "Dimensionality reduction through feature extraction was implemented using the well-known Principal Component Analysis (PCA) and Isomap [49, 50] technique." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-95", "text": "Isomap analyzes data points in a high-dimensional space that are implicitly located in a curved manifold of smaller dimensionality." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-96", "text": "Dimensionality reduction is then achieved by unwrapping the manifold and getting rid of the extra dimensions." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-97", "text": "As will be shown, both feature selection and extraction improve the classification success score." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-98", "text": "Since there are many supervised learning algorithms, we have selected some to cover the most distinct classification paradigms: ZeroR (0R), which arbitrarily labels all instances as belonging to the most prevalent class; OneR (1R), which ranks attributes according to their error rate; Naive Bayes (NB), which assumes independence among attributes; K-nearest neighbors (KNN), where the class of an instance is inferred by a voting process over the nearest neighbors in the training dataset; J48, which organizes the patterns in a tree-like structure; and Radial Basis Function Network (RBFN) where a learning network with an input, a processing, and an output layer is used." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-99", "text": "Due to their simplicity, 0R and 1R are only used for comparison." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-100", "text": "In all methods, the parameters were set with their default configuration [48] and the classification is calculated for a 10-fold stratified cross-validation." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-101", "text": "A detailed description of classification algorithms can be found in [47] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-102", "text": "The performance of supervised and unsupervised algorithms can be evaluated with two standard scores: precision and recall." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-103", "text": "Both are real values ranging from zero to one, being specific for a given class c. Precision (P c ) is defined as" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-104", "text": "where \u03c4 c is the number of instances belonging to class c that were correctly classified (i.e. the number of true positives), and c is the number of instances of other classes that were wrongly classified as belonging to class c (i.e. number of false positives)." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-105", "text": "The Recall R for class c is computed as" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-106", "text": "where \u03b3 c is the number of instances of class c that were incorrectly classified (i.e. the number of false negatives)." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-107", "text": "The precision and recall scores defined above refer to a single class." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-108", "text": "To obtain a single value from the dataset, one may use micro-and macro-averaging." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-109", "text": "Micro-averaging weights the scores of each class by the number of instances; therefore, classes with many instances are more relevant to the overall score." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-110", "text": "In contrast, the macro-average score is the arithmetic mean of the scores of all classes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-111", "text": "Note that the micro-averaged recall is equivalent to the success rate, that is, the proportion between the number of instances correctly classified and the total number of instances." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-112", "text": "For the present collection, having the same number of instances per class, micro-and macro-averaging are equivalent." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-113", "text": "----------------------------------" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-114", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-115", "text": "The authorship signature is captured by the metrics proposed, which reveals the relationship between style and changes in network structure." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-116", "text": "Success scores greatly surpass the threshold imposed by a blind classification obtained with ZeroR algorithm, which for our collection is 1/8 = 12.5%." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-117", "text": "Unmodified data from the 48 original metrics were classified with success rates in the range from 45% to 62.5% as shown in table 1." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-118", "text": "The simple OneR algorithm also performed well, reaching 46.25% score." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-119", "text": "Dimensionality reduction using either feature extraction or feature selection increased the success rates for all algorithms." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-120", "text": "The results of feature selection are shown in figure 3 success scores are presented, with the maximum value for each curve marked with a circle." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-121", "text": "If there is more than one maximum (e.g. J48 and NB for variance threshold and J48 and KNN for score-based selection), we only consider the combinations with the fewest number of attributes, located at the rightmost positions." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-122", "text": "The results of feature selection using a variance threshold are shown in figures 3(a) and 3(c)." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-123", "text": "There is a single subset (combination) of attributes for each variance threshold level." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-124", "text": "At the lowest threshold in figure 3(c) all attributes are present (even though the threshold is larger than zero) and all cells of the highest row are colored black." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-125", "text": "As the threshold is gradually increased, attributes are successively removed until there are no attributes left and all the cells in the lowest row are colored white." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-126", "text": "Remarkably, the first and the last attributes removed were respectively the fourth and the third moments of the number of cliques C q ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-127", "text": "Note also that for nine of the twelve network metrics, either the third or the fourth moment had the smallest variance." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-128", "text": "The maximum scores are marked with circles in figure 3 (a) and listed in table 1." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-129", "text": "The thresholds for maximum scores marked in 3(a) are located in a narrow range and are represented in figure 3(c) as dashed lines." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-130", "text": "The results of feature selection based on score are shown in figures 3(b) and 3(d)." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-131", "text": "We start with all the attributes in the left end of figure 3(b) ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-132", "text": "As we explore the combinations obtained by removing one attribute at a time the scores increase (monotonically for J48 and KNN) until a maximum value is reached, after which the scores rapidly decrease reaching ZeroR score when there are zero attributes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-133", "text": "The maximum scores are marked with circles in figure 3(b) and listed in table 1." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-134", "text": "It must be noted that the maximum scores can be reached with a few attributes, at most 16 attributes in the case of KNN." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-135", "text": "The combinations of attributes giving the maximum scores marked are presented in detail in figure 3(d) ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-136", "text": "For KNN two combinations of attributes reached the highest score." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-137", "text": "Again, the four moments of a given network metric are grouped together." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-138", "text": "It can be seen that the best scoring combinations for some algorithms did not include any of the four moments from some network metrics." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-139", "text": "In particular, load centrality L was not used by any algorithm (having therefore a blank column for L in figure 3(d) )." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-140", "text": "One should highlight the betweenness centrality B, which was extensively used by KNN, NB and RBFN even though its mean value (i.e. first moment, and the leftmost column under the B label on figure 3(d) ) was not used by these algorithms." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-141", "text": "Two last combinations of attributes were constructed." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-142", "text": "The first moments \u00b5 1 represent the static metrics previously studied (see e.g. [25, 51] ) and define a subset of 12 attributes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-143", "text": "The complementary subset of 36 second, third, and fourth moments represent the dynamical aspects of networks since they describe the extent of variation around the mean value throughout a text." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-144", "text": "Classification was applied to these two subsets without further dimensionality reduction." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-145", "text": "The results are listed in the fourth and fifth rows of table 1 showing that purely dynamical metrics provide better overall performance when compared to the statical counterparts, while both subsets score similarly to the whole set of 48 attributes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-146", "text": "Another dimensionality reduction technique implemented was feature extraction, using both linear PCA and nonlinear Isomap." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-147", "text": "The latter uses geodesic distances in an embedded manifold instead of high-dimensional Euclidean distances." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-148", "text": "There is a free parameter in Isomap: the number of neighbors n neighbors ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-149", "text": "The distance between two instances considered neighbors is the traditional Euclidean distance while the distance between two farthest instances is the geodesic distance for a path inside the manifold [49] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-150", "text": "The results for Isomap depend on n neighbors and on the reduced number of dimensions n comps ; we varied both parameters from 2 to 15 and found similar results for most cases (see Supporting Information)." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-151", "text": "The best scores reported below were obtained for n neighbors = 10 and n comps = 13." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-152", "text": "Figure 4(b) shows precision (defined by equation 2) and recall (success score, defined by equation 3) for original (without dimensionality reduction), PCA-, and Isomap-treated attributes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-153", "text": "Dimensionality reduction through PCA leads to lower precision and recall, while Isomap enhances the classification efficacy of the algorithms." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-154", "text": "The best performance is reached with RBFN for which the authorship of 68 of the 80 texts in the collection is correctly identified, thus reaching 85% success score (recall) and 0.854 precision." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-155", "text": "This performance is robust among algorithms as both precision and recall surpass 80% using KNN, NB and RBFN." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-156", "text": "For visualization purposes Isomap was also applied to reduce the number of attributes to a two-dimensional space using the Projection Explorer software [50] as shown in figure 4(a) ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-157", "text": "For some authors the texts are clearly grouped and separated from the rest (e.g. texts from A. C. Doyle and B. Shaw) while for other authors the separation is not as clear." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-158", "text": "A common trend exists nevertheless, with texts by the same author located in preferential regions in the attribute space." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-159", "text": "Even though a direct comparison to related works requires using the same text collection, two examples using collections with similar characteristics which use static network metrics are worth mentioning." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-160", "text": "A similar study for the same task [25] analyzed 40 texts from 8 authors in English reaching a success score of 65%." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-161", "text": "In another work, 36 Persian books from 5 authors were classified with an accuracy rate of 77.8% [51] ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-162", "text": "A myriad of other metrics for authorship identification have been proposed." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-163", "text": "Argamon and Juola [52] collected the results of the PAN 2011 competition where 3001 electronic messages from 26 authors were classified using diverse metrics for which the best micro-averaged recall (i.e. success score) was 0.717." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-164", "text": "These collections have characteristics different to ours such as the number of texts, authors, and the sizes of messages compared to books." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-165", "text": "----------------------------------" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-166", "text": "**CONCLUSION**" }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-167", "text": "Network dynamics could be probed in a straightforward manner owing to the stationarity of the series obtained with the network metrics; as a bonus, some of the problems faced in applying networks to real-life problems are solved." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-168", "text": "Then, texts of different sizes can be compared, and indeed, the smallest book of the collection (from A. C. Doyle) was correctly classified repeatedly." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-169", "text": "Success scores reached 85%, which is outstanding using a collection with such characteristics." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-170", "text": "Dimensionality reduction through non-linear feature extraction helped to raise the success rates in classification." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-171", "text": "Although two of the twelve network metrics did not comply with some stationarity tests, they were successfully used for the task." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-172", "text": "For instance, transitivity T was largely used in the combinations leading to the highest scores shown in figure 3(d) ." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-173", "text": "The typical sizes of the networks were slightly more than 100 nodes which are usually considered small." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-174", "text": "Our approach succeeds because it collects only global metrics, i.e. averages which are still reliable, in contrast to distributions over all nodes." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-175", "text": "When the typical network sizes are below a few hundreds of nodes, the scores drop." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-176", "text": "On the other hand, when network sizes are too large the scores also diminish (at a slower pace), because the number of elements in the series decreases as the network size increases." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-177", "text": "Considering that small books in the collection are not the source of wrong classification, we conclude that the errors are caused by the variability of style of some authors in their books: while for some authors texts are clearly concentrated in a small region of attribute space, the texts from others are scattered." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-178", "text": "This reflects that some authors use well-defined structures while others change their narrative resources from one text to another." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-179", "text": "Converting networks structure information to time series allows one to use the vast knowledge already applied to other series to analyze evolution of network topologies, and in particular, to the way an author uses the structures offered by the language in his/her narrative." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-180", "text": "Purely dynamical measures, i.e. higher moments of the time series, revealed an aspect hitherto unknown of the close relation between style and network dynamics." }, { "sent_id": "370da04cbb2a6ab807428f7e058110-C001-181", "text": "Because co-occurrence networks are also author-dependent, with network dynamics texts of different sizes can be compared, which further expands the application of networks to real situations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "370da04cbb2a6ab807428f7e058110-C001-19" ], [ "370da04cbb2a6ab807428f7e058110-C001-22" ], [ "370da04cbb2a6ab807428f7e058110-C001-26" ], [ "370da04cbb2a6ab807428f7e058110-C001-56" ] ], "cite_sentences": [ "370da04cbb2a6ab807428f7e058110-C001-19", "370da04cbb2a6ab807428f7e058110-C001-22", "370da04cbb2a6ab807428f7e058110-C001-26", "370da04cbb2a6ab807428f7e058110-C001-56" ] }, "@MOT@": { "gold_contexts": [ [ "370da04cbb2a6ab807428f7e058110-C001-26", "370da04cbb2a6ab807428f7e058110-C001-27" ] ], "cite_sentences": [ "370da04cbb2a6ab807428f7e058110-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "370da04cbb2a6ab807428f7e058110-C001-45" ], [ "370da04cbb2a6ab807428f7e058110-C001-56" ], [ "370da04cbb2a6ab807428f7e058110-C001-142" ] ], "cite_sentences": [ "370da04cbb2a6ab807428f7e058110-C001-45", "370da04cbb2a6ab807428f7e058110-C001-56", "370da04cbb2a6ab807428f7e058110-C001-142" ] }, "@SIM@": { "gold_contexts": [ [ "370da04cbb2a6ab807428f7e058110-C001-160" ] ], "cite_sentences": [ "370da04cbb2a6ab807428f7e058110-C001-160" ] } } }, "ABC_460a83a07ca3aa4d56deabad4f9831_30": { "x": [ { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-2", "text": "We address the problem of end-to-end visual storytelling." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-3", "text": "Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-4", "text": "For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-5", "text": "Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-6", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-8", "text": "Since we first developed language, humans have always told stories." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-9", "text": "Fashioning a good story is an act of creativity and developing algorithms to replicate this has been a long running challenge." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-10", "text": "Adding pictures as input can provide information for guiding story construction by offering visual illustrations of the storyline." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-11", "text": "In the related task of image captioning, most methods try to generate descriptions only for individual images or for short videos depicting a single activity." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-12", "text": "Very recently, datasets have been introduced that extend this task to longer temporal sequences such as movies or photo albums (Rohrbach et al., 2016; Pan et al., 2016; Lu and Grauman, 2013; Huang et al., 2016) ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-13", "text": "The type of data we consider in this paper provides input illustrations for story generation in the form of photo albums, sampled over a few minutes to a few days of time." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-14", "text": "For this type of data, generating textual descriptions involves telling a temporally consistent story about the depicted visual information, where stories must be coherent and take into account the temporal context of the images." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-15", "text": "Applications of this include constructing visual and textual summaries of albums, or even enabling search through personal photo collections to find photos of life events." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-16", "text": "Previous visual storytelling works can be classified into two types, vision-based and languagebased, where image or language stories are constructed respectively." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-17", "text": "Among the vision-based approaches, unsupervised learning is commonly applied: e.g., (Sigurdsson et al., 2016) learns the latent temporal dynamics given a large amount of albums, and (Kim and Xing, 2014) formulate the photo selection as a sparse time-varying directed graph." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-18", "text": "However, these visual summaries tend to be difficult to evaluate and selected photos may not agree with human selections." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-19", "text": "For languagebased approaches, a sequence of natural language sentences are generated to describe a set of photos." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-20", "text": "To drive this work (Park and collected a dataset mined from Blog Posts." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-21", "text": "However, this kind of data often contains contextual information or loosely related language." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-91", "text": "Evaluations of each model are shown in Table 1 ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-92", "text": "The h-attn outperforms both baselines, and h-attnrank achieves the best performance for all metrics." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-22", "text": "A more direct dataset was recently released (Huang et al., 2016) , where multi-sentence stories are collected describing photo albums via Amazon Mechanical Turk." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-23", "text": "In this paper, we make use of the Visual Storytelling Dataset (Huang et al., 2016) ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-24", "text": "While the authors provide a seq2seq baseline, they only deal with the task of generating stories given 5-representative (summary) photos hand-selected by people from an album." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-25", "text": "Instead, we focus on the more challenging and realistic problem of end-toend generation of stories from entire albums." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-26", "text": "This requires us to either generate a story from all of the album's photos or to learn selection mechanisms to identify representative photos and then generate stories from those summary photos." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-27", "text": "We evaluate each type of approach." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-28", "text": "Ultimately, we propose a model of hierarchically-attentive recurrent neural nets, consisting of three RNN stages." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-29", "text": "The first RNN encodes the whole album context and each photo's content, the second RNN provides weights for photo selection, and the third RNN takes the weighted representation and decodes to the resulting sentences." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-30", "text": "Note that during training, we are only given the full input albums and the output stories, and our model needs to learn the summary photo selections latently." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-31", "text": "We show that our model achieves better performance over baselines under both automatic metrics and human evaluations." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-32", "text": "As a side product, we show that the latent photo selection also reasonably mimics human selections." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-33", "text": "Additionally, we propose an album retrieval task that can reliably pick the correct photo album given a sequence of sentences, and find that our model also outperforms the baselines on this task." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-34", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-36", "text": "Recent years have witnessed an explosion of interest in vision and language tasks, reviewed below." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-37", "text": "Visual Captioning: Most recent approaches to image captioning (Vinyals et al., 2015b; Xu et al., 2015) have used CNN-LSTM structures to generate descriptions." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-38", "text": "For captioning video or movie content (Venugopalan et al., 2015; Pan et al., 2016) , sequence-to-sequence models are widely applied, where the first sequence encodes video frames and the second sequence decodes the description." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-39", "text": "Attention techniques (Xu et al., 2015; Yu et al., 2016; Yao et al., 2015) are commonly incorporated for both tasks to localize salient temporal or spatial information." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-40", "text": "Video Summarization: Similar to documentation summarization (Rush et al., 2015; Cheng and Lapata, 2016; Woodsend and Lapata, 2010) which extracts key sentences and words, video summarization selects key frames or shots." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-41", "text": "While some approaches use unsupervised learning (Lu and Grauman, 2013; Khosla et al., 2013) or intuitive criteria to pick salient frames, recent models learn from human-created summaries (Gygli et al., 2015; Zhang et al., 2016b,a; Gong et al., 2014) ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-42", "text": "Recently, to better exploit semantics, (Choi et al., 2017) proposed textually customized summaries." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-43", "text": "Visual Storytelling: Visual storytelling tries to tell a coherent visual or textual story about an image set." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-44", "text": "Previous works include storyline graph modeling Xing, 2014), unsupervised mining (Sigurdsson et al., 2016) , blog-photo alignment , and language retelling (Huang et al., 2016; Park and Kim, 2015) ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-45", "text": "While (Park and collects data by mining Blog Posts, (Huang et al., 2016) collects stories using Mechanical Turk, providing more directly relevant stories." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-46", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-47", "text": "**MODEL**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-48", "text": "Our model (Fig. 1) is composed of three modules: Album Encoder, Photo Selector, and Story Generator, jointly learned during training." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-49", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-50", "text": "**ALBUM ENCODER**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-51", "text": "Given an album A = {a 1 , a 2 , ..., a n }, composed of a set of photos, we use a bi-directional RNN to encode the local album context for each photo." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-52", "text": "We first extract the 2048-dimensional visual representation f i \u2208 R k for each photo using ResNet101 , then a bi-directional RNN is applied to encode the full album." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-53", "text": "Following (Huang et al., 2016) , we choose a Gated Recurrent Unit (GRU) as the RNN unit to encode the photo sequence." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-54", "text": "The sequence output at each time step encodes the local album context for each photo (from both directions)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-55", "text": "Fused with the visual representation followed by ReLU, our final photo representation is (top module in Fig. 1 ):" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-56", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-57", "text": "**PHOTO SELECTOR**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-58", "text": "The Photo Selector (illustrated in the middle yellow part of Fig. 1 ) identifies representative photos to summarize an album's content." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-59", "text": "As discussed, we do not assume that we are given the ground-truth album summaries during training, instead regarding selection as a latent variable in the end-to-end learning." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-60", "text": "Inspired by Pointer Networks (Vinyals et al., 2015a), we use another GRU-RNN to perform this task 1 ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-61", "text": "Given the album representation V n\u00d7k , the photo selector outputs probabilities p t \u2208 R n (likelihood of selection as t-th summary image) for all photos using soft attention." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-62", "text": "Figure 1 : Model: the album encoder is a bi-directional GRU-RNN that encodes all album photos; the photo selector computes the probability of each photo being the tth album-summary photo; and finally, the story generator outputs a sequence of sentences that combine to tell a story for the album." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-63", "text": "At each summarization step, t, the GRU takes the previous p t\u22121 and previous hidden state as input, and outputs the next hidden stateh t .h t is fused with each photo representation v i to compute the i th photo's attention p i t = p(y a i (t) = 1)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-64", "text": "At test time, we simply pick the photo with the highest probability to be the summary photo at step t." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-65", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-66", "text": "**STORY GENERATOR**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-67", "text": "To generate an album's story, given the album representation matrix V and photo summary probabilities p t from the first two modules, we compute the visual summary representation g t \u2208 R k (for the t-th summary step)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-68", "text": "This is a weighted sum of the album representations, i.e., g t = p T t V ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-69", "text": "Each of these 5 g t embeddings (for t = 1 to 5) is then used to decode 1 of the 5 story sentences respectively, as shown in the blue part of Fig. 1 ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-70", "text": "Given a story S = {s t }, where s t is t-th summary sentence." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-71", "text": "Following Donahue et al. (2015) , the l-th word probability of the t-th sentence is:" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-72", "text": "where W e is the word embedding." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-73", "text": "The GRU takes the joint input of visual summarization g t , the previous word embedding w t,l , and the previous hidden state, then outputs the next hidden state." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-74", "text": "The generation loss is then the sum of the negative log likelihoods of the correct words:" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-75", "text": "Lt l=1 log p t,l (s t,l )." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-76", "text": "To further exploit the notion of temporal coherence in a story, we add an order-preserving constraint to order the sequence of sentences within a story (related to the story-sorting idea in )." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-77", "text": "For each story S we randomly shuffle its 5 sentences to generate negative story instances S ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-78", "text": "We then apply a max-margin ranking loss to encourage correctly-ordered stories: L rank (S, S ) = max(0, m\u2212log p(S )+log p(S))." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-79", "text": "The final loss is then a combination of the generation and ranking losses:" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-80", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-81", "text": "**EXPERIMENTS**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-82", "text": "We use the Visual Storytelling Dataset (Huang et al., 2016) , consisting of 10,000 albums with 200,000 photos." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-83", "text": "Each album contains 10-50 photos taken within a 48-hour span with two annotations: 1) 2 album summarizations, each with 5 selected representative photos, and 2) 5 stories describing the selected photos." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-84", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-85", "text": "**STORY GENERATION**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-86", "text": "This task is to generate a 5-sentence story describing an album." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-87", "text": "We compare our model with two sequence-to-sequence baselines: 1) an encoderdecoder model (enc-dec), where the sequence of album photos is encoded and the last hidden state is fed into the decoder for story generation, 2) an encoder-attention-decoder model (Xu et al., 2015) (enc-attn-dec) with weights computed using a soft-attention mechanism." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-88", "text": "At each decoding time step, a weighted sum of hidden states from the encoder is decoded." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-89", "text": "For fair comparison, we use the same album representation (Sec. 3.1) for the baselines." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-90", "text": "We test two variants of our model trained with and without ranking regularization by controlling \u03bb in our loss function, denoted as h-attn (without ranking), and h-attn-rank (with ranking)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-93", "text": "Note, we use beam-search with beam size=3 during generation for a reasonable performancespeed trade-off (we observe similar improvement trends with beam size = 1)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-94", "text": "2 To test performance under optimal image selection, we use one of the two ground-truth human-selected 5-photo-sets as an oracle to hard-code the photo selection, denoted as h-(gd)attn-rank." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-95", "text": "This achieves only a slightly higher Meteor compared to our end-to-end model." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-96", "text": "Additionally, we also run human evaluations in a forced-choice task where people choose between stories generated by different methods." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-97", "text": "For this evaluation, we select 400 albums, each evaluated by 3 Turkers." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-98", "text": "Results are shown in Table 2 ." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-99", "text": "Experiments find significant preference for our model over both baselines." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-100", "text": "As a simple Turing test, we also compare our results with human written stories (last row of Table 2 ), indicating room for improvement of methods." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-101", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-102", "text": "**ALBUM SUMMARIZATION**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-103", "text": "We evaluate the precision and recall of our generated summaries (output by the photo selector) compared to human selections (the combined set Table 4 : 1000 album retrieval evaluation." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-104", "text": "of both human-selected 5-photo stories)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-105", "text": "For comparison, we evaluate enc-attn-dec on the same task by aggregating predicted attention and selecting the 5 photos with highest accumulated attention." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-106", "text": "Additionally, we also run DPP-based video summarization (Kulesza et al., 2012) using the same album features." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-107", "text": "Our models have higher performance compared to baselines as shown in Table 3 (though DPP also achieves strong results, indicating that there is still room to improve the pointer network)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-108", "text": "Fig. 2 and Fig. 3 shows several output examples of the joint album summarization and storytelling generation." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-109", "text": "We compare our full model h-attnrank with the baseline enc-attn-dec, as both models are able to do the album summarization and story generation tasks jointly." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-110", "text": "In Fig. 2 and Fig. 3 , we use blue dashed box and red box to indicate the album summarization by the two models respectively." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-111", "text": "As reference, we also show the groundtruth album summaries by randomly selecting 1 out of 2 human album summaries, which are highlighted with green box." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-112", "text": "Below each album are their generated stories." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-113", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-114", "text": "**OUTPUT EXAMPLE ANALYSIS**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-115", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-116", "text": "**ALBUM RETRIEVAL**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-117", "text": "Given a human-written story, we introduce a task to retrieve the album described by that story." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-118", "text": "We randomly select 1000 albums and one groundtruth story from each for evaluation." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-119", "text": "Using the generation loss, we compute the likelihood of each album A m given the query story S and retrieve the album with the highest generation likelihood, A = argmax Am p(S|A m )." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-120", "text": "We use Recall@k and Median Rank for evaluation." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-121", "text": "As shown in Table 4), we find that our models outperform the baselines, but the ranking term in Eqn.2 does not improve performance significantly." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-122", "text": "Figure 2: Examples of album summarization and storytelling by enc-attn-dec (blue), h-attn-rank (red), and ground-truth (green)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-123", "text": "We randomly select 1 out of 2 human album summaries as ground-truth here." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-124", "text": "----------------------------------" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-125", "text": "**CONCLUSION**" }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-126", "text": "Our proposed hierarchically-attentive RNN based models for end-to-end visual storytelling can jointly summarize and generate relevant stories from full input photo albums effectively." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-127", "text": "Automatic and human evaluations show that our Figure 3 : More examples of album summarization and storytelling by enc-attn-dec (blue), h-attn-rank (red), and ground-truth (green)." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-128", "text": "We randomly select 1 out of 2 human album summaries as ground-truth here." }, { "sent_id": "460a83a07ca3aa4d56deabad4f9831-C001-129", "text": "method outperforms strong sequence-to-sequence baselines on selection, generation, and retrieval tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "460a83a07ca3aa4d56deabad4f9831-C001-12" ], [ "460a83a07ca3aa4d56deabad4f9831-C001-22" ], [ "460a83a07ca3aa4d56deabad4f9831-C001-44" ], [ "460a83a07ca3aa4d56deabad4f9831-C001-45" ] ], "cite_sentences": [ "460a83a07ca3aa4d56deabad4f9831-C001-12", "460a83a07ca3aa4d56deabad4f9831-C001-22", "460a83a07ca3aa4d56deabad4f9831-C001-44", "460a83a07ca3aa4d56deabad4f9831-C001-45" ] }, "@USE@": { "gold_contexts": [ [ "460a83a07ca3aa4d56deabad4f9831-C001-23" ], [ "460a83a07ca3aa4d56deabad4f9831-C001-53" ], [ "460a83a07ca3aa4d56deabad4f9831-C001-82" ] ], "cite_sentences": [ "460a83a07ca3aa4d56deabad4f9831-C001-23", "460a83a07ca3aa4d56deabad4f9831-C001-53", "460a83a07ca3aa4d56deabad4f9831-C001-82" ] } } }, "ABC_616e8732490f0fa87d35998f769196_30": { "x": [ { "sent_id": "616e8732490f0fa87d35998f769196-C001-64", "text": "**THE SIMULTANEITY REQUIREMENT**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-2", "text": "Recent applications of Tree-Adjoining Grammar (TAG) to the domain of semantics as well as new attention to syntactic phenomena have given rise to increased interested in more expressive and complex multicomponent TAG formalisms (MCTAG)." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-3", "text": "Although many constructions can be modeled using tree-local MCTAG (TL-MCTAG), certain applications require even more flexibility." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-4", "text": "In this paper we suggest a shift in focus from constraining locality and complexity through treeand set-locality to constraining locality and complexity through restrictions on the derivational distance between trees in the same tree set in a valid derivation." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-5", "text": "We examine three formalisms, restricted NS-MCTAG, restricted Vector-TAG and delayed TL-MCTAG, that use notions of derivational distance to constrain locality and demonstrate how they permit additional expressivity beyond TL-MCTAG without increasing complexity to the level of set local MCTAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-6", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-8", "text": "Tree-Adjoining Grammar (TAG) has long been popular for natural language applications because of its ability to naturally capture syntactic relationships while also remaining efficient to process." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-9", "text": "More recent applications of TAG to the domain of semantics as well as new attention to syntactic phenomena such as scrambling have given rise to increased interested in multicomponent TAG formalisms (MC-TAG), which extend the flexibility, and in some cases generative capacity of the formalism but also have substantial costs in terms of efficient processing." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-10", "text": "Much work in TAG semantics makes use of tree-local MCTAG (TL-MCTAG) to model phenomena such as quantifier scoping, Wh-question formation, and many other constructions ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-11", "text": "Certain applications, however, appear to require even more flexibility than is provided by TL-MCTAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-12", "text": "Scrambling is one well-known example (Rambow, 1994) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-13", "text": "In addition, in the semantics domain, the use of a new TAG operation, flexible composition, is used to perform certain semantic operations that seemingly cannot be modeled with TL-MCTAG alone (Chiang and Scheffler, 2008) and in work in synchronous TAG semantics, constructions such as nested quantifiers require a set-local MCTAG (SL-MCTAG) analysis (Nesson and Shieber, 2006) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-14", "text": "In this paper we suggest a shift in focus from constraining locality and complexity through restrictions that all trees in a tree set must adjoin within a single tree or tree set to constraining locality and complexity through restrictions on the derivational distance between trees in the same tree set in a valid derivation." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-15", "text": "We examine three formalisms, two of them introduced in this work for the first time, that use derivational distance to constrain locality and demonstrate by construction of parsers their relationship to TL-MCTAG in both expressivity and complexity." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-16", "text": "In Section 2 we give a very brief introduction to TAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-17", "text": "In Section 3 we elaborate further the distinction between these two types of locality restrictions using TAG derivation trees." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-18", "text": "Section 4 briefly addresses the simultaneity requirement present in MCTAG formalisms but not in Vector- TAG formalisms and argues for dropping the requirement." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-19", "text": "In Sections 5 and 6 we introduce two novel formalisms, restricted non-simultaneous MC-TAG and restricted Vector-TAG, respectively, and define CKY-style parsers for them." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-20", "text": "In Section 7 we recall the delayed TL-MCTAG formalism introduced by Chiang and Scheffler (2008) and define a CKY-style parser for it as well." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-21", "text": "In Section 8 we explore the complexity of all three parsers and the relationship between the formalisms." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-22", "text": "In Section 9 we discuss the linguistic applications of these formalisms and show that they permit analyses of some of the hard cases that have led researchers to look beyond TL-MCTAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-23", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-24", "text": "**BACKGROUND**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-25", "text": "A tree-adjoining grammar consists of a set of elementary tree structures of arbitrary depth, which are combined by operations of adjunction and substitution." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-26", "text": "Auxiliary trees are elementary trees in which the root and a frontier node, called the foot node and distinguished by the diacritic * , are labeled with the same nonterminal A. The adjunction operation entails splicing an auxiliary tree in at an internal node in an elementary tree also labeled with nonterminal A. Trees without a foot node, which serve as a base for derivations and may combine with other trees by substitution, are called initial trees." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-27", "text": "Examples of the adjunction and substitution operations are given in Figure 1 ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-28", "text": "For further background, refer to the survey by (Joshi and Schabes, 1997) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-29", "text": "Shieber et al. (1995) and Vijay-Shanker (1987) apply the Cocke-Kasami-Younger (CKY) algorithm first introduced for use with context-free grammars in Chomsky normal form (Kasami, 1965; Younger, 1967) to the TAG parsing problem to generate parsers with a time complexity of O(n 6 |G| 2 )." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-30", "text": "In order to clarify the presentation of our extended TL-MCTAG parsers below, we briefly review the algorithm of Shieber et al. (1995) using the inference rule notation from that paper." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-31", "text": "As shown in Figure 2 , items in CKY-style TAG parsing consist of a node in an elementary tree and the indices that mark the edges of the span dominated by that node." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-32", "text": "Nodes, notated \u03b1@a \u2022 , are specified by three pieces of information: the identifier \u03b1 of the elementary tree the node is in, the Gorn address a of the node in that tree 1 , and a diacritic, \u2022, indicating that an adjunction or substitution is still available at that node or \u2022, indicating that one has already taken place." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-33", "text": "Each item has four indices, indicating the left and right edges of the span covered by the node as well as any gap in the node that may be the result of a foot node dominated by the node." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-34", "text": "Nodes that do not dominate a foot node will have no gap in them, which we indicate by the use of underscores in place of the indices for the gap." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-35", "text": "To limit the number of inference rules needed, we define the following function i \u222a j for combining indices:" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-36", "text": "The side conditions Init(\u03b1) and Aux(\u03b1) hold if \u03b1 is an initial tree or an auxiliary tree, respectively." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-37", "text": "Label(\u03b1@a) specifies the label of the node in tree \u03b1 at address a. Ft(\u03b1) specifies the address of the foot node of tree \u03b1." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-38", "text": "Adj(\u03b1@a, \u03b2) holds if tree \u03b2 may adjoin into tree \u03b1 at address a. Subst(\u03b1@a, \u03b2) holds if tree \u03b2 may substitute into tree \u03b1 at address a. These conditions fail if the adjunction or substitution is prevented by constraints such as mismatched node labels." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-39", "text": "Multi-component TAG (MCTAG) generalizes TAG by allowing the elementary items to be sets of trees rather than single trees (Joshi and Schabes, 1997) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-40", "text": "The basic operations are the same but all trees in a set must adjoin (or substitute) into another tree or tree set in a single step in the derivation." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-41", "text": "An MCTAG is tree-local if tree sets are required to adjoin within a single elementary tree (Weir, Goal Item:" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-42", "text": "Unary Complete: has the same generative capacity as TAG (Weir, 1988) , the conversion to TAG is exponential and the TL-MCTAG formalism is NP-hard to recognize (S\u00f8gaard et al., 2007 )." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-43", "text": "An MCTAG is set-local if tree sets required to adjoin within a single elementary tree set (Weir, 1988) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-44", "text": "Set-local MCTAG (SL-MCTAG) has equivalent expressivity to linear context-free rewriting systems and recognition is provably PSPACE complete (Nesson et al., 2008) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-45", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-46", "text": "**DOMAINS OF LOCALITY AND DERIVATION TREES**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-47", "text": "The domains of locality of TL-MCTAG and SL-MCTAG (and trivially, TAG) can be thought of as lexically defined." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-48", "text": "That is, all locations at which the adjunction of one tree set into another may occur must be present within a single lexical item." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-49", "text": "However, we can also think of locality derivationally." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-50", "text": "In a derivationally local system the constraint is on the relationships allowed between members of the same tree set in the derivation tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-51", "text": "TAG derivation trees provide the information about how the elementary structures of the grammar combine that is necessary to construct the derived tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-52", "text": "Nodes in a TAG derivation tree are labeled with identifiers of elementary structures." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-53", "text": "One elementary structure is the child of another in the derivation tree if it adjoins or substitutes into it in the derivation." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-54", "text": "Arcs in the derivation tree are labeled with the address in the target elementary structure at which the operation takes place." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-55", "text": "In MCTAG the derivation trees are often drawn with identifiers of entire tree sets as the nodes of the tree because the lexical locality constraints require that each elementary tree set be the derivational child of only one other tree set." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-56", "text": "However, if we elaborate the derivation tree to include a node for each tree in the grammar rather than only for each tree set we can see a stark contrast in the derivational locality of these two formalisms." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-57", "text": "In TL-MCTAG all trees in a set must adjoin to the same tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-58", "text": "This means that they must all be siblings in the derivation tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-59", "text": "In SL-MCTAG, on the other hand, it is possible to generate derivations with arbitrarily long distances before the nearest common ancestor of two trees from the same elementary tree set is reached." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-60", "text": "An example SL-MCTAG grammar that can produce an arbitrarily long derivational distance to the nearest common ancestor of the trees in a given tree set is given in Figure 3 ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-61", "text": "Chiang and Scheffler (2008) recently introduced one variant of MCTAG, delayed Tree-Local MC-TAG (delayed TL-MCTAG) that uses a derivational notion of locality." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-62", "text": "In this paper we introduce two additional derivationally local TAG-based formalisms, restricted non-simultaneous MCTAG (restricted NS-MCTAG) and restricted Vector TAG (restricted V-TAG) and demonstrate by construction of parsers how each gives rise to a hierarchy of derivationally local formalisms with a well-defined efficiency penalty for each step of derivational distance permitted." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-63", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-65", "text": "In addition to lexical locality constraints the definition of MCTAG requires that all trees from a set adjoin simultaneously." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-66", "text": "In terms of well-formed derivation trees, this amounts to disallowing derivations in which a tree from a given set is the ancestor of a tree from the same tree set." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-67", "text": "For most linguistic applications of TAG, this requirement seems natural and is strictly obeyed." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-68", "text": "There are a few applications, including flexible composition and scrambling in free-word order languages that benefit from TAG-based grammars that drop the simultaneity requirement (Chiang and Scheffler, 2008; Rambow, 1994) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-69", "text": "From a complexity perspective, however, checking the simultaneity requirement is expensive (Kallmeyer, 2007) ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-70", "text": "As a result, it can be advantageous to select a base formalism that does not require simultaneity even if the grammars implemented with it do not make use of that additional freedom." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-71", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-72", "text": "**RESTRICTED NON-SIMULTANEOUS MCTAG**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-73", "text": "The simplest version of a derivationally local TAGbased formalism is most similar to non-local MC-TAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-74", "text": "There is no lexical locality requirement at all." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-75", "text": "In addition, we drop the simultaneity requirement." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-76", "text": "Thus the only constraint on elementary tree sets is the limit, d, on the derivational distance between the trees in a given set and their nearest common ancestor." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-77", "text": "We call this formalism restricted nonsimultaneous MCTAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-78", "text": "Note that if we constrain d to be one, this happens to enforce both the derivational delay limit and the lexical locality requirement of TL-MCTAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-79", "text": "A CKY-style parser for restricted NS-MCTAG with a restriction of d is given in Figure 4 ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-80", "text": "The items of this parser contain d lists, \u039b 1 , . ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-81", "text": ". , \u039b d , called histories that record the identities of the trees that have already adjoined in the derivation in order to enforce the locality constraints." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-82", "text": "The identities of the trees in a tree set that have adjoined in a given derivation are maintained in the histories until all the trees from that set have adjoined." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-83", "text": "Once the locality constraint is checked for a tree set, the Filter side condition expunges those trees from the histories." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-84", "text": "A tree is recorded in this history list with superscript i, where i is the derivational distance between the location where the recorded tree adjoined and the location of the current item." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-85", "text": "The locality constraint is enforced at the point of adjunction or substitution where the Goal Item history at the limit of the permissible delay must be empty for the operation to succeed." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-86", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-87", "text": "**RESTRICTED V-TAG**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-88", "text": "A Vector-TAG (V-TAG) (Rambow, 1994 ) is similar to an MCTAG in that the elementary structures are sets (or vectors) of TAG trees." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-89", "text": "A derivation in a V-TAG is defined as in TAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-90", "text": "There is no locality requirement or other restriction on adjunction except that if one tree from a vector is used in a derivation, all trees from that vector must be used in the derivation." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-91", "text": "The trees in a vector may be connected by dominance links between the foot nodes of auxiliary trees and any node in other trees in the vector." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-92", "text": "All adjunctions must respect the dominance relations in that a node \u03b7 1 that dominates a node \u03b7 2 must appear on the path from \u03b7 2 to the root of the derived tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-93", "text": "The definition of V-TAG is very similar to that of non-local MCTAG as defined by Weir (1988) except that in non-local MCTAG all trees from a tree set are required to adjoin simultaneously." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-94", "text": "Restricted V-TAG constrains V-TAG in several ways." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-95", "text": "First, the dominance chain in each elementary tree vector is required to define a total order over the trees in the vector." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-96", "text": "This means there is a single base tree in each vector." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-97", "text": "Note also that all trees other than the base tree must be auxiliary trees in order to dominate other trees in the vector." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-98", "text": "The base tree may be either an initial tree or an auxiliary tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-99", "text": "Second, a restricted V-TAG has a restriction level, d, that determines the largest derivational distance that may exists between the base tree and the highest tree in a tree vector in a derivation." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-100", "text": "Restricted V-TAG differs from restricted NS-MCTAG in one important respect: the dominance requirements of restricted V-TAG require that trees from the same set must appear along a single path in the derived tree, whereas in restricted NS-MCTAG trees from the same set need not adhere to any dominance relationship in the derived tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-101", "text": "A CKY-style parser for restricted V-TAG with restriction level d is given in Figure 5 ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-102", "text": "Parsing is similar to delayed TL-MCTAG in that we have a set of histories for each restriction level." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-103", "text": "However, because of the total order over trees in a vector, the parser only needs to maintain the identity of the highest tree from a vector that has been used in the derivation along with its distance from the base tree from that vector." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-104", "text": "The Filter side condition accordingly expunges trees that are the top tree in the dominance chain of their tree vector." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-105", "text": "The side conditions for the Adjoin non-base rule enforce that the dominance constraints are satisfied and that the derivational distance from the base of a tree vector to its currently highest adjoined tree is maintained accurately." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-106", "text": "We note that in order to allow a non-total ordering of the trees in a vector we would simply have to record all trees in a tree vector in the histories as is done in the delayed TL-MCTAG parser." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-107", "text": "7 Delayed TL-MCTAG Chiang and Scheffler (2008) introduce the delayed TL-MCTAG formalism which makes use of a derivational distance restriction in a somewhat different way." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-108", "text": "Rather than restricting the absolute distance between the trees of a set and their nearest common ancestor, given a node \u03b1 in a derivation tree, delayed TL-MCTAG restricts the number of tree sets that are not fully dominated by \u03b1." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-109", "text": "Borrowing directly from Chiang and Scheffler (2008) , Figure 7 gives two examples." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-110", "text": "Parsing for delayed TL-MCTAG is not discussed by Chiang and Scheffler (2008) but can be accomplished using a similar CKY-style strategy as in the two parsers above." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-111", "text": "We present a parser in Figure 6 ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-112", "text": "Rather than keeping histories that record derivational distance, we keep an active delay list for each item that records the delays that are active (by recording the identities of the trees that have adjoined) for the tree of which the current node is a part." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-113", "text": "At the root of each tree the active delay list is filtered using the Filter side condition to remove all tree sets that are fully dominated and the resulting list is checked using the Size to ensure that it contains no more than d distinct tree sets where d is the specified delay for the grammar." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-114", "text": "The active delays for a given tree are passed to its derivational parent when it adjoins or substitutes." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-115", "text": "Delayed TL-MCTAG differs from both of the previous formalisms in that it places no constraint on the length of a delay." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-116", "text": "On the other hand while the previous formalisms allow unlimited short delays to be pending at the same time, in delayed TL-MCTAG, only a restricted number of delays may be active at once." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-117", "text": "Similar to restricted V-TAG, there is no simultaneity requirement, so a tree may have another tree from the same set as an ancestor." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-118", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-119", "text": "**COMPLEXITY**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-120", "text": "The complexity of the restricted NS-MCTAG and restricted V-TAG parsers presented above depends on the number of possible histories that may appear in an item." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-121", "text": "For each step of derivational distance permitted between trees of the same set, the corresponding history permits many more entries." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-122", "text": "History \u039b 1 may contain trees that have adjoined into the same tree as the node of the current item." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-123", "text": "The number of entries is therefore limited by the number of adjunction sites in that tree, which is in turn limited by the number of nodes in that tree." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-124", "text": "We will call the maximum number of nodes in a tree in the grammar t. Theoretically, any tree in the grammar could adjoin at any of these adjunction sites, meaning that the number of possible values for each entry in the history is bounded by the size of the grammar |G|." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-125", "text": "Thus the size of \u039b 1 is O(|G| t )." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-126", "text": "For \u039b 2 the en-Unary Complete" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-127", "text": "Adjoin base: tries correspond to tree that have adjoined into a tree that has adjoined into the tree of the current item." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-128", "text": "Thus, for each of the t trees that may have adjoined at a derivational distance of one, there are t more trees that may have adjoined at a derivational distance of two." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-129", "text": "The size of \u039b 2 is therefore |G| t 2 ." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-130", "text": "The combined size of the histories for a grammar with a delay or restriction of d is therefore O(|G|" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-131", "text": "Replacing the sum with its closed form solution, we" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-132", "text": ") histories." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-133", "text": "Using the reasoning about the size of the histories given above, the restricted NS-MCTAG parser presented here has a complexity of O(n 6 |G| 1+ t d+1 \u22121 t\u22121 ), where t is as defined above and d is the limit on delay of adjunction." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-134", "text": "For a tree-local MCTAG, the complexity reduces to O(n 6 |G| 2+t )." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-135", "text": "For the linguistic applications that motivate this chapter no delay greater than two is needed, resulting in a complexity of O(n 6 |G| 2+t+t 2 )." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-136", "text": "The same complexity analysis applies for restricted V-TAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-137", "text": "However, we can provide a somewhat tighter bound by noting that the rank, r, of the grammar-how many tree sets adjoin in a single tree-and the fan out, f of the grammar-how many trees may be in a single tree set-are limited by t. That is, a complete derivation containing |D| tree sets can contain no more than t |D| individual trees and also no more than rf |D| individual trees." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-138", "text": "In the restricted V-TAG algorithm we maintain only one tree from a tree set in the history at a time, so rather than maintaining O(t) entries in each history, we only need to maintain the smaller O(r) entries." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-139", "text": "The complexity of the delayed TL-MCTAG parser depends on the number of possible active delay lists." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-140", "text": "As above, each delay list may have a maximum of t entries for trees that adjoin directly into it." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-141", "text": "The restriction on the number of active delays means that the active delay lists passed up from these child nodes at the point of adjunction or substitution can have size no more than d. This results in an additional td(f \u2212 1) possible entries in the active de- )." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-142", "text": "Thus the complexity of the parser is O(n 6 |G| 2+t(1+d(f \u22121)) )." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-143", "text": "----------------------------------" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-144", "text": "**CONCLUSION**" }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-145", "text": "Each of the formalisms presented above extends the flexibility of MCTAG beyond that of TL-MCTAG while maintaining, as we have shown herein, complexity much less than that of SL-MCTAG." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-146", "text": "All three formalisms permit modeling of flexible composition (because they permit one member of a tree set to be a derivational ancestor of another tree in the same set), at least restricted NS-MCTAG and restricted V-TAG permit analyses of scrambling, and all three permit analyses of the various challenging semantic constructions mentioned in the introduction." }, { "sent_id": "616e8732490f0fa87d35998f769196-C001-147", "text": "We conclude that extending locality by constraining derivational distance may be an effective way to add flexibility to MCTAG without losing computational tractability." } ], "y": { "@BACK@": { "gold_contexts": [ [ "616e8732490f0fa87d35998f769196-C001-13" ], [ "616e8732490f0fa87d35998f769196-C001-68" ], [ "616e8732490f0fa87d35998f769196-C001-107" ] ], "cite_sentences": [ "616e8732490f0fa87d35998f769196-C001-13", "616e8732490f0fa87d35998f769196-C001-68", "616e8732490f0fa87d35998f769196-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "616e8732490f0fa87d35998f769196-C001-20" ], [ "616e8732490f0fa87d35998f769196-C001-109" ] ], "cite_sentences": [ "616e8732490f0fa87d35998f769196-C001-20", "616e8732490f0fa87d35998f769196-C001-109" ] }, "@DIF@": { "gold_contexts": [ [ "616e8732490f0fa87d35998f769196-C001-110" ] ], "cite_sentences": [ "616e8732490f0fa87d35998f769196-C001-110" ] } } }, "ABC_dcfd8cb0179ab156a6ffcab3358a45_30": { "x": [ { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-2", "text": "Visual Question Answering (VQA) task has showcased a new stage of interaction between language and vision, two of the most pivotal components of artificial intelligence." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-3", "text": "However, it has mostly focused on generating short and repetitive answers, mostly single words, which fall short of rich linguistic capabilities of humans." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-4", "text": "We introduce Full-Sentence Visual Question Answering (FSVQA) dataset (www.mi.t.u-tokyo.ac. jp/static/projects/fsvqa), consisting of nearly 1 million pairs of questions and full-sentence answers for images, built by applying a number of rule-based natural language processing techniques to original VQA dataset and captions in the MS COCO dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-5", "text": "This poses many additional complexities to conventional VQA task, and we provide a baseline for approaching and evaluating the task, on top of which we invite the research community to build further improvements." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-6", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-8", "text": "The research community in artificial intelligence (AI) has witnessed a series of dramatic advances in the AI tasks concerning language and vision in recent years, thanks to the successful applications of deep learning techniques, particularly convolutional neural networks (CNN) and recurrent neural networks (RNN)." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-9", "text": "AI has moved on from naming the entities in the image (Mei et al. 2008; Wang et al. 2009) , to describing the image with a natural sentence (Vinyals et al. 2015; Xu et al. 2015; Karpathy and Li 2015) and then to answering specific questions about the image with the advent of visual question answering (VQA) task (Antol et al. 2015) ." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-10", "text": "However, current VQA task is focused on generating a short answer, mostly single words, which does not fully take advantage of the wide range of expressibility inherent in human natural language." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-11", "text": "Just as we moved from merely naming entities in the image to description of the images with natural sentence, it naturally follows that VQA will also move towards full-sentence answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-12", "text": "One way to tackle this issue would be to apply appropriate linguistic rules after a single-word answer is generated." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-13", "text": "However, previous works in natural language processing field have demonstrated that data-driven response generation achieves better performance than rule-based generation (Ritter, Cherry, and Dolan 2011) ." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-14", "text": "In other words, it is more efficient to provide data only once for pre-training than to parse and tag the text every time to apply universal rules." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-15", "text": "In addition, training with full-sentence answers provides an opportunity for the learning of complex morphological transformations along with visual and semantic understanding, which cannot be done with manual application of rules." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-16", "text": "Learning and generating full-sentence answers will inevitably increase the number of distinct answers at an exponential scale; for example, thousands of samples with simple identical answer \"yes\" will be further divided into \"yes, the color of the car is red,\"\"yes, the boy is holding a bat,\" etc." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-17", "text": "Indeed, our FSVQA dataset contains almost 40 times more answers that are unique than the original VQA dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-18", "text": "This poses additional challenges on top of original VQA task, since now it not only has to come up with the correct answer, but also has to form a full sentence considering how the words are conjugated, inflected, and ordered." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-19", "text": "We introduce Full-Sentence Visual Question Answering (FSVQA) dataset, built by applying linguistic rules to original VQA dataset at zero financial cost." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-20", "text": "We also provide an augmented version of FSVQA by converting image captions to question and answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-21", "text": "We examine baseline approaches, and utilize complementary metrics for evaluation, providing a guideline upon which we invite the research community to build further improvements." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-22", "text": "Our primary contributions can be summarized as following: 1) introducing a novel task of full-sentence visual question answering, 2) building a large, publicly available dataset consisting of up to 1 million full-sentence Q&A pairs, and 3) examining baseline approaches along with a novel combination of evaluation metrics." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-23", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-25", "text": "A number of datasets on visual question answering have been introduced in recent years (Malinowski and Fritz 2014; Ren, Kiros, and Zemel 2015) , among which (Antol et al. 2015) in particular has gained the most attention and helped popularize the task." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-26", "text": "However, these datasets mostly consist of a small set of answers covering most of the questions, and most of the answers being single word." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-27", "text": "Our FSVQA dataset, derived from (Antol et al. 2015) , minimizes such limitation by converting the answers to full-sentences, thus widely expanding the set of answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-28", "text": "(Fukui et al. 2016) proposed multimodal compact bilinear pooling (MCB) to combine multimodal features of visual and text representations." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-29", "text": "This approach won the 1st place in 2016 VQA Challenge in real images category." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-30", "text": "(Saito et al. 2016) proposed DualNet, in which both addition and multiplication of the input features are performed, in order to fully take advantage of the discriminative features in the data." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-31", "text": "This method won the 1st place in 2016 VQA Challenge in abstract scenes category. ) was one of the first to propose attention model for VQA." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-32", "text": "They proposed stacked attention networks (SANs) that utilize question representations to search for most relevant regions in the image." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-33", "text": "(Noh and Han 2016) also built an attention-based model, which optimizes the network by minimizing the joint loss from all answering units." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-34", "text": "They further-proposed an early stopping strategy, in which overfitting units are disregarded in training." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-35", "text": "(Lu et al. 2016) argued that not only visual attention is important, but also question attention is important." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-36", "text": "Coattention model was thus proposed to jointly decide where to attend visually and linguistically." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-37", "text": "introduced multimodal residual network (MRN), which uses element-wise multiplication for joint residual learning of attention models." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-38", "text": "Most of the works above limited the number of possible answers, which was possible due to a small number of answers covering the majority of the dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-39", "text": "Our FSVQA dataset imposes additional complexity to existing approaches by having a much larger set of possible answers, in which no small set of labels can cover the majority of the dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-40", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-41", "text": "**DATASET**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-42", "text": "Collecting full-sentence annotations from crowd-sourcing tools can be highly costly." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-43", "text": "We circumvent this financial cost by converting the answers in the original VQA dataset (Antol et al. 2015) to full-sentence answers by applying a number of linguistic rules using natural language processing techniques." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-44", "text": "Furthermore, we also provide an augmented version of dataset by converting the human-written captions provided in the MS COCO (Lin et al. 2014) ." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-45", "text": "We generated questions with a set of rules, for which the caption itself becomes the answer." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-46", "text": "Both versions of FSVQA dataset along with the features used in our experiment, as will be described in the Experiment Section, are publicly available for download." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-47", "text": "Note that, for both versions, only train and validation splits are provided, since test splits are not publicly available." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-48", "text": "Also, we only provide open-ended version, and do not provide multiple choice version." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-49", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-50", "text": "**CONVERTING VQA**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-51", "text": "VQA dataset comes with 10 annotations per question, and we chose one annotation per question that has the highest frequency as the single answer for corresponding question." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-52", "text": "If par, one annotation was randomly selected." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-53", "text": "VQA dataset mainly consists of three categories of questions; yes/no, number, and others." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-73", "text": "We also generated questions that are category-specific." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-54", "text": "Table 1 summarizes the general conversion rule for generating full-sentence answers for each category, along with examples." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-55", "text": "Part-of-speech tag notation follows that of PennTree I Tags (Marcus et al. 1994) , except NP and VP refer to parse tree instead of than part-of-speech. tense:V \u2192 T returns the tense of the input verb, conjug:V \u00d7 T \u2192 V conjugates the input verb to the input tense, where V is a space of verbs of all forms, and T is a set of tenses such that T={past, present, future, past perfect, present perfect, future perfect}, except it returns the input as is if the input is of JJ tag." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-56", "text": "negate:V \u2192 V negates the input verb of given tense, and replace(A,B) substitutes A by B. Parentheses indicate an optional addition, while A/B indicates an insertion of one of the two sides depending on the question." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-57", "text": "To briefly illustrate the process, these general conversion rules substitute the question phrase with the answer, and reorder the sentence with appropriate conjugation and negation." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-58", "text": "While these general rules cover the majority of the questions, some types of questions require additional processing." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-59", "text": "For example, conditional statements with \"if,\" or selective statements such as \"who is in the picture, A or B?,\" can be handled by disregarding the sub-clauses and applying the rules to the main clause." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-60", "text": "Also, since VQA dataset consists of human-written natural language, it inevitably contains variations encompassing colloquialism, typos, grammar violations, and abridgements, which make it difficult to apply any type of general conversion rule." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-61", "text": "We either manually modify them, or leave them as they are." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-62", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-63", "text": "**CONVERTING CAPTIONS**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-64", "text": "We also provide an augmented version of the dataset by converting the human-written captions for images into questions and answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-65", "text": "Apart from yes/no questions, the answers to the generated questions are the captions themselves, eliminating the burden for generating reliable answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-66", "text": "Most images in MS COCO come with 5 captions, and we generated distinct question for each caption, whose conversion rule is shown in Table 2 ." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-67", "text": "We assigned at least two yes/no questions with one \"yes\" and one \"no\" to all images, roughly balancing the number of answers with \"yes,\" and \"no.\" Questions with affirmative answers involving \"yes\" were generated by simply rephrasing the caption such that the question asks to confirm the contents in the caption, for which the answer is an affirmative statement of the question (which is the caption itself), accompanied by \"Yes,\" in the beginning." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-68", "text": "Questions with non-affirmative answers involving \"no\" were generated by substituting parts of captions with random actions or agents." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-69", "text": "For agents, we randomly choose one class from 1,000 object classes of ILSVRC 2014 object detection task, and substitute it for given agent in the caption." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-70", "text": "For actions, we randomly choose one class from 101 action classes of UCF-101 (Soomro, Zamir, and Shah 2012) plus 20 classes of activities of daily living (ADL) from (Ohnishi et al. 2016) , and substitute it for given verb phrase." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-71", "text": "Resulting questions are frequently of interesting non-sensical type, such as \"are the birds doing push-ups on the tree?,\" for which the answer is simply a negation of the question, accompanied by \"No,\" in the beginning." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-72", "text": "We expect that such set of questions and answers can also potentially help the learning of distinction between common sense and nonsense." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-74", "text": "When an object or a property of a specific category is referred to, we replace it with the category name, preceded by wh-determiner or wh-pronoun, to ask which object or property it belongs to." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-75", "text": "Our manually pre-defined categories included color, animal, room, food, transportation, and sport." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-76", "text": "Other rules are mostly reversed process of the conversion rules in Table 1 , in which we deliberately mask certain content of the caption, replace it with appropriate question forms, and reorder the sentence." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-77", "text": "Table 3 shows statistics for original VQA dataset and two versions of FSVQA dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-78", "text": "Both versions of FSVQA contain much longer answers on average, and the number of unique answers is more than ten times larger in the regular version and about 38 times larger in the augmented version compared to VQA dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-79", "text": "Augmented version is longer since captions in MS COCO tend to be longer than questions in VQA." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-80", "text": "The size of vocabularies is also many times larger in both versions." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-81", "text": "Note that, consistently with the conversion process, only the most frequent answer from 10 answers per question was taken into consideration for VQA dataset's statistics." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-82", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-83", "text": "**STATISTICS**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-84", "text": "Comparing the coverage of 1,000 most frequent answers in the dataset shows even more striking contrast." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-85", "text": "In VQA dataset, 1,000 most frequent answers covered about 86.5% of the entire dataset, so that learning a small set of frequent answers could perform reasonably well." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-86", "text": "In FSVQA, the coverage by 1,000 most frequent answers is merely 12.7% and 4.8% for regular and augmented version respectively, which is clearly much less than VQA dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-87", "text": "Thus, it adds a critical amount of computational complexity, in which less frequent answers cannot easily be disregarded." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-88", "text": "Table 4 shows the number of unique answers for each category." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-89", "text": "While FSVQA has much more answers in all categories, most striking example is in the yes/no category." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-90", "text": "VQA dataset essentially contains only two answers \"yes\" and \"no\" for yes/no questions (infrequent answers such as \"not sure\" were filtered out in the conversion process)." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-91", "text": "In fact, answering with \"yes\" alone for all questions achieved 70.97% accuracy for yes/no category in the original VQA dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-92", "text": "On the contrary, FSVQA datasets contain approximately 49,000 times more answer and 240,000 times more answers for yes/no category in each version." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-93", "text": "It becomes clear again that FSVQA cannot be taken advantage of by manipulating a small set of frequent answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-94", "text": "Figure 2 shows the percentage distribution of answers in the respective datasets for number of words in the answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-95", "text": "While over 90% of the answers in VQA dataset are single words, both versions of FSVQA dataset show much smoother distribution over a wide range of number of words." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-96", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-97", "text": "**EXPERIMENT SETTING**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-98", "text": "We used 4096-dimensional features from the second fullyconnected layer (fc7) of VGG (Simonyan and Zisserman 2014) with 19 layers, trained on ImageNet (Deng et al. 2009 ), as our image features." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-99", "text": "Words in the question were input to LSTM (Hochreiter and Schmidhuber 1997) one at a time as one-hot vector, where the dictionary contains only the words appearing more than once." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-100", "text": "Image features and question features are then mapped to common embedding space as a 1,024-dimensional vector." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-101", "text": "Batch size was 500 and training was performed for 300 epochs." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-102", "text": "We trained only with the answers that appear twice or more in train split, as using all unique answers in the dataset fails to run, with required memory far beyond the capacity of most of the contemporary GPUs, NVIDIA Tesla 40m in our case." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-103", "text": "20,130 answers appear more than once in regular version, covering 95,340 questions from 62,292 images, and 23,400 answers appear more than once in the augmented version, which cover 105,563 questions from 64,060 images." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-104", "text": "This is only about 25% and 15.5% of the entire train split in respective version, which again shows a striking contrast with the original VQA dataset, in which only 1,000 answers covered up to 86.5% of the dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-105", "text": "Following (Antol et al. 2015) , we examined the effect of Conversely, we also examined an approach where only image features are concerned." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-106", "text": "This requires a slightly different training procedure, as it does not involve a series of onehot vector inputs." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-107", "text": "We followed the conventional approach used in image captioning task (Vinyals et al. 2015) , where the image features are fixed, and a stack of LSTM units learns to generate the ground truth captions." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-108", "text": "Each LSTM unit generates one word at a time, which in turn enters the next LSTM unit." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-109", "text": "The only difference in our case is that the ground truth captions are replaced by full-sentence answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-110", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-111", "text": "**EVALUATION**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-112", "text": "Metrics Evaluating the results from the experiments also poses a challenge." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-113", "text": "Since original VQA dataset consisted mostly of short answers, evaluation was as simple as matching the results with ground truths, and yielding the percentage." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-114", "text": "Yet, since we now have full-sentence answers, simply matching the results with ground truths will not be compatible." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-115", "text": "For example, \"yes, the color of the cat is red\" for the question in which the original ground truth is \"yes,\" will be classified as incorrect using the current evaluation tool for original VQA." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-116", "text": "We thus come up with a set of mutually complementary ways of evaluating full-sentence answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-117", "text": "First, we employ the frequently used evaluation metrics for image captioning task, namely BLEU (Papineni et al. 2002 ), METEOR (Denkowski and Lavie 2014 ), and CIDEr (Vedantam, Zitnick, and Parikh 2015 ." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-118", "text": "The goal is to quantify the overall resemblance of the results to the ground truth answers." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-119", "text": "However, higher performance on these evaluation metrics does not necessarily imply that it is a more accurate answer." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-120", "text": "For instance, \"the color of the car is red\" will have higher scores than \"it is blue\" for a ground-truth answer \"the color of the car is blue,\" due to more tokens being identical." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-121", "text": "Yet, the former is clearly an incorrect answer, whereas the latter should be considered correct." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-122", "text": "In order to overcome this drawback, we also employ a simple complementary evaluation metric, whose procedure is as follows: we examine whether the short answer from the original dataset is present in the generated result, and extract the short answer if present." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-123", "text": "If not, we leave the answer blank." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-124", "text": "Extracted terms in this way are tested with the evaluation tool for original VQA." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-125", "text": "Using the previous example, the rationale is that as long as the original short answer \"blue\" is present in the generated result, it can be assumed that the answer is correct." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-126", "text": "We refer to this metric as VQA accuracy." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-127", "text": "Note that, for augmented version, generated answers for only the original subset are extracted to measure VQA accuracy, since there are no ground truth VQA answers for the augmented segment." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-128", "text": "However, there also exist cases in which VQA accuracy can be misleading, since the rest of the context may not be compatible with the question." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-129", "text": "For example, \"yes, the color is blue.\" will be considered correct if the original answer is \"yes,\" but it should not be considered correct if the question was \"is it raining?\" In fact, this was one of the underlying concerns in the original VQA dataset, since we cannot be sure whether \"yes\" or \"no\" was generated in the right sense or purely by chance." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-130", "text": "In order to alleviate this issue, we also report FSVQA accuracy, which is the percentage in which the ground truth answer in FSVQA dataset contains the generated answer." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-131", "text": "Since the answers have to be matched at the sentence level, it assures us with high confidence that the answer was correct in the intended context." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-132", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-133", "text": "**RESULTS & DISCUSSION**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-134", "text": "Results for all metrics are shown in Table 5 ." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-135", "text": "Note that evaluation is performed on the results for validation split, since ground truths for test split are not publicly available." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-136", "text": "While each metric is concerned with slightly different aspect of the answers, results shows that they gen- erally tend to agree with each other." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-137", "text": "Figure 3 shows examples of generated full-sentence answers for each model, along with the ground truth answer from the original VQA dataset." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-138", "text": "As expected, answers generated from using both question and image features turn out to be most reliable." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-139", "text": "Answers from question features alone result in answers that match the questions but are frequently out of visual context given by the image." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-140", "text": "Likewise, answers generated from image features alone fit the images but are frequently out of textual context given by the question." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-141", "text": "It is notable that using image features alone performs very poorly, whereas using question features alone results in performances comparable to using both features." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-142", "text": "One plausible explanation is that, since using image features alone always generates the same answer for the same image regardless of the question, it can only get 1 out of k questions correctly at best, where k is the number of questions per image." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-143", "text": "On the contrary, using question features alone essentially reduces the problem to a semantic Q&A task, which can be handled one at a time." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-144", "text": "This tendency is consistent with the results reported in (Antol et al. 2015) ." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-145", "text": "It must nevertheless be reminded that the best performances in both (Antol et al. 2015) and our experiment were achieved with the presence of both visual and textual clues." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-146", "text": "----------------------------------" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-147", "text": "**CONCLUSION**" }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-148", "text": "We introduced FSVQA, a publicly available dataset consisting of nearly 1 million pairs of questions and full-sentence answers for images, built by applying linguistic rules to existing datasets." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-149", "text": "While pushing forward the VQA task to a more human-like stage, it poses many extra complexities." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-150", "text": "We examined baseline approaches for tackling this novel task." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-151", "text": "Applying some of the successful approaches from the original VQA task, such as attention mechanism, will be an intriguing and important future work." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-152", "text": "Whether generative approach can play more role in the future, as the number of answers grows larger and classification approach becomes less efficient, is also of interest." }, { "sent_id": "dcfd8cb0179ab156a6ffcab3358a45-C001-153", "text": "We invite the research community to come up with an innovative and efficient way to improve the performance on FSVQA." } ], "y": { "@BACK@": { "gold_contexts": [ [ "dcfd8cb0179ab156a6ffcab3358a45-C001-9" ], [ "dcfd8cb0179ab156a6ffcab3358a45-C001-25" ] ], "cite_sentences": [ "dcfd8cb0179ab156a6ffcab3358a45-C001-9", "dcfd8cb0179ab156a6ffcab3358a45-C001-25" ] }, "@MOT@": { "gold_contexts": [ [ "dcfd8cb0179ab156a6ffcab3358a45-C001-25", "dcfd8cb0179ab156a6ffcab3358a45-C001-26" ] ], "cite_sentences": [ "dcfd8cb0179ab156a6ffcab3358a45-C001-25" ] }, "@EXT@": { "gold_contexts": [ [ "dcfd8cb0179ab156a6ffcab3358a45-C001-27" ], [ "dcfd8cb0179ab156a6ffcab3358a45-C001-43" ] ], "cite_sentences": [ "dcfd8cb0179ab156a6ffcab3358a45-C001-27", "dcfd8cb0179ab156a6ffcab3358a45-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "dcfd8cb0179ab156a6ffcab3358a45-C001-105" ] ], "cite_sentences": [ "dcfd8cb0179ab156a6ffcab3358a45-C001-105" ] }, "@SIM@": { "gold_contexts": [ [ "dcfd8cb0179ab156a6ffcab3358a45-C001-144" ], [ "dcfd8cb0179ab156a6ffcab3358a45-C001-145" ] ], "cite_sentences": [ "dcfd8cb0179ab156a6ffcab3358a45-C001-144", "dcfd8cb0179ab156a6ffcab3358a45-C001-145" ] } } }, "ABC_b4d7e9b7942698ef0678d3b4a0ad7d_30": { "x": [ { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-126", "text": "We use another feature set LEX to capture word ngrams, POS (part of speech) ngrams and mixed ngrams." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-109", "text": "Findings 3 & 4 are interesting since they reveal special characteristics of threads involving hierarchically related participants." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-150", "text": "**CONCLUSION**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-2", "text": "We introduce the problem of predicting who has power over whom in pairs of people based on a single written dialog." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-3", "text": "We propose a new set of structural features." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-4", "text": "We build a supervised learning system to predict the direction of power; our new features significantly improve the results over using previously proposed features." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-5", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-7", "text": "Computationally analyzing the social context in which language is used has gathered great interest within the NLP community recently." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-8", "text": "One of the areas that has generated substantial research is the study of how social power relations between people affect and/or are revealed in their interactions with one another." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-9", "text": "Researchers have proposed systems to detect social power relations between participants of organizational email threads (Bramsen et al., 2011; Gilbert, 2012; , online forums (Danescu-NiculescuMizil et al., 2012; Biran et al., 2012; DanescuNiculescu-Mizil et al., 2013) , chats (Strzalkowski et al., 2012) , and off-line interactions such as presidential debates Nguyen et al., 2013) ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-10", "text": "Automatically identifying power and influence from interactions can have many practical applications ranging from law enforcement and intelligence to online marketing." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-11", "text": "A significant number of these studies are performed in the domain of organizational email where there is a well defined notion of power (organizational hierarchy)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-12", "text": "Bramsen et al. (2011) and Gilbert (2012) predict hierarchical power relations between people in the Enron email corpus using lexical features extracted from all the messages exchanged between them." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-13", "text": "However, their approaches primarily apply to situations where large collections of messages exchanged between pairs of people are available." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-14", "text": "In , we introduced the problem of detecting whether a participant of an email thread has power over someone else in the thread and established the importance of dialog structure in that task." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-15", "text": "However, in that work we did not detect over whom that person has power." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-16", "text": "In this paper, we introduce a new problem formulation." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-17", "text": "We predict the hierarchical power relation between pairs of participants in an email interaction thread based solely on features extracted from that thread." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-18", "text": "As a second major contribution, we introduce a new set of features to capture aspects of participant behavior such as responsiveness, and we show that these features are significantly correlated with the direction of power." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-19", "text": "We present a fully automatic system for this task obtaining an accuracy of 73.0%, an improvement of 6.9% over 68.3% by a system using only lexical features." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-20", "text": "This best-performing system uses our new feature set." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-21", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-22", "text": "**MOTIVATION**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-23", "text": "Early NLP-based approaches such as Bramsen et al. (2011) and Gilbert (2012) built systems to predict hierarchical power relations between people in the Enron email corpus using lexical features from all the messages exchanged between them." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-151", "text": "We introduced the problem of predicting who has power over whom based on a single thread of written interactions." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-110", "text": "In , we had found that persons with hierarchical power rarely initiated threads and contributed less within the threads. But that problem formulation was different -we were identifying whether a person in a given thread had hierarchical power over someone else or not." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-111", "text": "The data points in that formulation included participants from threads that did not have any hierarchically related people, whereas our current formulation do not." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-112", "text": "These findings suggest that if a person starts an email thread, he's likely not to be the one who has power, but if a thread includes a pair of people who are hierarchically related, then it is likely to be initiated by the superior and he/she tends to contribute more in such threads." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-113", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-114", "text": "**PREDICTING DIRECTION OF POWER**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-115", "text": "We build an SVM-based supervised learning system that can predict HP (p 1 , p 2 ) to be either superior or subordinate based on the interaction within a thread t for any pair of participants (p 1 , p 2 ) \u2208 RIPP t ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-116", "text": "We deterministically fix the order of participants in (p 1 , p 2 ) such that p 1 is the sender of the first message in IM t (p 1 , p 2 )." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-117", "text": "We use the ClearTK (Ogren et al., 2008) wrapper for SVMLight (Joachims, 1999) in our experiments." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-118", "text": "We use the related interacting participant pairs in threads from the train set to train our models and optimize our performance on those from the dev set." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-119", "text": "We report results obtained on dev and test sets." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-120", "text": "In our formulation, values of many features are undefined for some instances (e.g., Inform% is undefined when MsgCount = 0)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-121", "text": "Handling of undefined values for features in SVM is not straightforward." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-122", "text": "Most SVM implementations assume the value of 0 by default in such cases, conflating them with cases where Inform% is truly 0." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-123", "text": "In order to mitigate this issue, we use an indicator feature for each structural feature to denote whether or not it is valid." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-124", "text": "Since we use a quadratic kernel, we expect the SVM to pick up the interaction between each feature and its indicator feature." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-125", "text": "Lexical features have already been shown to be valuable in predicting power relations (Bramsen et al., 2011; Gilbert, 2012) ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-24", "text": "One limitation of this approach is that it relies solely on lexical cues and hence works best when large collections of messages exchanged between the pairs of people are available." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-25", "text": "For example, Bramsen et al. (2011) excluded sender-recipient pairs who exchanged fewer than 500 words from their evaluation set, since they found smaller text samples are harder to classify." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-26", "text": "By taking the message out of the context of the interaction in which it was exchanged, they fail to utilize cues from the structure of interactions, which complements the lexical cues in detecting power relations, as we showed in ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-27", "text": "We modeled the problem of detecting power relationships differently in : we predicted whether a participant in an email thread has a certain type of power or not." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-28", "text": "However, in that work we did not predict over whom he/she has that power." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-29", "text": "This may result in noisy features; consider a thread in which participant X has power over participant Y, who has power over participant Z. By aggregating features over all messages sent by Y, features salient to a subordinate-superior interaction are incorrectly conflated with those salient to superior-subordinate interaction." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-30", "text": "Another limitation of ) is that we used manual annotations for many of our features such as dialog acts and overt displays of power." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-31", "text": "Relying on manual annotations for features limited our analysis to a small subset of the Enron corpus, which has only 18 instances of hierarchical power." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-32", "text": "Consequently, our findings with respect to hierarchical power were weak in terms of both correlations of features and system performance." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-33", "text": "In this paper, we introduce the problem of predicting who has power over whom in pairs of interacting participants based on a single thread of interactions." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-34", "text": "From (Bramsen et al., 2011) we retain the idea that we want to predict the power relation between pairs of people. But in contrast to their formulation, we retain the goal from ) that we want to study communication in the context of an interaction, and that we want to be able to make predictions using only the emails exchanged in a single thread." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-35", "text": "Like , we use features to capture the dialog structure, but we use automatic taggers to generate them and assume no manual annotation at all at training or test time." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-36", "text": "This allows us to use the entire Enron email corpus for this study." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-37", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-38", "text": "**DATA**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-39", "text": "In this work, we use the version of Enron email corpus by Yeh and Harnly (2006) which captures the thread structure of email exchanges." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-40", "text": "The corpus contains 36,615 email threads." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-41", "text": "We excluded a small subset of 419 threads that was used for previous manual annotation efforts, part of which was also used to train the DA and ODP taggers (Section 5) that generate features for our system." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-42", "text": "The average number of email messages per thread was around 3." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-43", "text": "We divided the remaining threads into train (50%), dev (25%) and test (25%) sets by random sampling." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-44", "text": "We then applied various basic NLP preprocessing steps such as tokenization, POS tagging and lemmatization to the body of email messages." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-45", "text": "We use the Enron gold organizational hierarchy released by Agarwal et al. (2012) to model hierarchical power." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-46", "text": "Their corpus was manually built using information from Enron organizational charts." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-47", "text": "It includes relations of 1,518 employees and captures dominance relations between 13,724 pairs of them." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-48", "text": "Theirs is the largest such data set available to the best of our knowledge." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-49", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-50", "text": "**PROBLEM FORMULATION**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-51", "text": "Let t denote an email thread and M t denote the set of all messages in t. Also, let P t be the set of all participants in t, i.e., the union of senders and recipients (To and CC) of all messages in M t ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-52", "text": "We are interested in detecting power relations between pairs of participants who interact within a given email thread." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-53", "text": "Not every pair of participants (p 1 , p 2 ) \u2208 P t \u00d7 P t interact with one another within t. Let IM t (p 1 , p 2 ) denote the set of Interaction Messages -non-empty messages in t in which either p 1 is the sender and p 2 is one of the recipients or vice versa." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-54", "text": "We call the set of" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-55", "text": "We focus on the manifestations of power in interactions between people across different levels of hierarchy." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-56", "text": "For every (p 1 , p 2 ) \u2208 IPP t , we query the set of dominance relations in the gold hierarchy to determine their hierarchical power relation (HP (p 1 , p 2 ))." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-57", "text": "We exclude pairs that do not exist in the gold hierarchy from our analysis and denote the remaining set of related interacting participant pairs as RIPP t ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-58", "text": "We assign HP (p 1 , p 2 ) to be superior if p 1 dominates p 2 , and subordinate if p 2 dominates p 1 ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-59", "text": "Table 1 shows the total number of pairs in IPP t and RIPP t from all the threads in our corpus and across train, dev and test sets." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-60", "text": "Given a thread t and a pair of participants (p 1 , p 2 ) \u2208 RIPP t , we want to automatically detect HP (p 1 , p 2 )." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-61", "text": "This problem formulation is similar to the ones in (Bramsen et al., 2011) and (Gilbert, 2012) ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-62", "text": "However, the difference is that for us an instance is a pair of participants in a single thread of interaction (which may or may not include other people), whereas for them an instance constitutes all messages exchanged between a pair of people in the entire corpus." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-63", "text": "Our formulation also differs from in that we detect power relations between pairs of participants, instead of just whether a participant had power over anyone in the thread." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-64", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-65", "text": "**STRUCTURAL ANALYSIS**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-66", "text": "In this section we analyze various features that capture the structure of interaction between the pairs of participants in a thread." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-67", "text": "Each feature f is extracted with respect to a person p over a ref-" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-68", "text": "Mt . The first two capture behavior of the pair among themselves, while the third and fourth capture their overall behavior in the entire thread." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-69", "text": "We group our features into three categories -THR New , THR PR and DIA PR ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-70", "text": "THR New is a set of new features we propose, while THR PR and DIA PR incorporate features we proposed in ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-71", "text": "THR New and THR PR capture the structure of message exchanges without looking at the content of the emails (e.g., how many emails did a person send), while DIA PR captures the pragmatics of the dialog and requires an analysis of the content of the emails (e.g., did they issue any requests)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-72", "text": "THR New : This is a new set of features we introduce in this paper." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-73", "text": "It includes the average number of recipients (AvgRecipients) and To recipients (AvgToRecipients) in emails sent by p, the percentage of emails p received in which he/she was in the To list (InToList%), boolean features denoting whether p added or removed people when responding to a message (AddPerson and RemovePerson), average number of replies received per message sent by p (ReplyRate) and average number of replies received from the other person of the pair to messages where he/she was a To recipient (ReplyRateWithinPair)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-74", "text": "ReplyRateWithinPair applies only to IM t (p 1 , p 2 )." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-75", "text": "THR PR : This feature set includes two metadata based feature sets -positional and verbosity." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-76", "text": "Positional features include a boolean feature to denote whether p sent the first message (Initiate), and relative positions of p's first and last messages (FirstMsgPos and LastMsgPos) in M ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-77", "text": "Verbosity features include p's message count (MsgCount), message ratio (MsgRatio), token count (TokenCount), token ratio (TokenRato) and tokens per message (TokenPerMsg), all calculated over M ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-78", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-79", "text": "**DIA PR :**" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-80", "text": "In (Prabhakaran and Rambow, 2013), we used dialog features derived from manual annotations -dialog acts (DA) and overt displays of power (ODP) -to model the structure of interactions within the message content." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-81", "text": "In this work, we obtain DA and ODP tags on the entire corpus using automatic taggers trained on those manual annotations." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-82", "text": "The DA tagger (Omuya et al., 2013) obtained an accuracy of 92%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-83", "text": "The ODP tagger (Prabhakaran et al., 2012) obtained an accuracy of 96% and F-measure of 54%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-84", "text": "The DA tagger labels each sentence to be one of the 4 dialog acts: Request Action, Request Information, Inform, and Conventional." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-85", "text": "The ODP Tagger identifies sentences (mostly requests) that express additional constraints on its response, beyond those introduced by the dialog act." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-86", "text": "We use 5 features: ReqAction%, ReqInform%, Inform%, Conventional%, and ODP% to capture the percentage of sentences in messages sent by p that has each of these labels." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-87", "text": "We also use a feature to capture the number of p's messages with a request that did not get a reply, i.e., dangling requests (DanglingReq%), over all messages sent by p." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-88", "text": "We perform an unpaired two-sample two-tailed Student's t-Test comparing mean values of each feature for subordinates vs. superiors." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-89", "text": "For our analysis, a data point is a related interacting pair, and not a message." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-90", "text": "Hence, a message with multiple recipients who have a superior/subordinate relation with the sender will contribute to features for multiple data points." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-91", "text": "We limit our analysis to the related interacting pairs from only our train set." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-92", "text": "Table 2 presents mean values of features for subordinates and superiors at the interaction level." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-93", "text": "Thread level versions of these features also obtained similar results overall in terms of direction of difference and significance." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-94", "text": "We denote three significance levels -* (p < .05 ), ** (p < .01 ), and *** (p < .001 )." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-95", "text": "To control false discovery rates in multiple testing, we adjusted the p-values (Benjamini and Hochberg, 1995 the main findings on the significant features below." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-96", "text": "1. Superiors send messages addressed to more people (AvgRecipients and AvgToRecipients)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-97", "text": "Consequently, they get more replies to their messages (ReplyRate)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-98", "text": "However, considering messages where the other person of the pair is addressed in the To list (ReplyRateWithinPair), subordinates get more replies." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-99", "text": "2. Superiors issue more requests (ReqAction% and ReqInform%) and overt displays of power (ODP%)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-100", "text": "Subordinates issue more informs (Inform%) and, surprisingly, have fewer unanswered requests (DanglingReq%)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-101", "text": "3. Superiors initiate the interactions more often than subordinates (Initiate)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-102", "text": "They also leave interactions earlier (LastMsgPos)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-103", "text": "4. Superiors send shorter messages (TokenPerMsg)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-104", "text": "They also send more messages (MsgCount & MsgRatio) and even contribute a higher ratio of tokens in the thread (TokenRatio) despite sending shorter messages." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-105", "text": "Finding 1 goes in line with findings from studies analyzing social networks that superiors have higher connectivity in the networks that they are part of (Rowe et al., 2007) ." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-106", "text": "Intuitively, those who have higher connectivity also send emails to larger number of people, and hence our result." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-107", "text": "Since superiors address more people in their emails, they also have a higher chance of getting replies." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-108", "text": "Finding 2 also aligns with the general intuition about how superiors and subordinates behave within interactions (e.g., superiors exhibit more overt displays of power than subordinates)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-127", "text": "A mixed ngram (Prabhakaran et al., 2012 ) is a special case of word ngram where words belonging to open classes are replaced with their POS tags." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-128", "text": "We found the best setting to be using both unigrams and bigrams for all three types of ngrams, by tuning in our dev set." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-129", "text": "We then performed experiments using all subsets of {LEX, THR New , THR PR , DIA PR }." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-130", "text": "Table 3 presents the results obtained using various feature subsets." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-131", "text": "We use a majority class baseline assigning HP (p 1 , p 2 ) to be always superior, which obtains 52.5% accuracy." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-132", "text": "We also use a stronger baseline using word unigrams and bigrams as features, which obtained an accuracy of 68.6%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-133", "text": "The performance of the system using each structural feature class on its own is very low." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-134", "text": "Combining all three of them improves the accuracy to 62.5%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-135", "text": "The highest performance obtained without using any message content is for THR PR and THR New (61.5%)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-136", "text": "LEX features by itself obtain a very high accuracy of 70.7%, confirming the importance of lexical patterns in this task." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-137", "text": "Perplexingly, adding all structural features to LEX reduces the accuracy by around 2.2 percentage points." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-138", "text": "The best performing system (BEST) uses LEX and THR New features and obtains an accuracy of 73.0%, a statistically significant improvement over the LEX-only system (McNemar)." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-139", "text": "We also performed an ablation study to understand the importance of different slices of our feature sets." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-140", "text": "If we remove all feature versions with respect to the second person, the accuracy drops to 72.1%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-141", "text": "This suggests that features about the other person's behavior also help the prediction task." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-142", "text": "If we remove either the thread level versions of features or interaction level versions of features, the accuracy again drops, suggesting that both the pair's behavior among themselves, and their overall behavior in the thread add value to the prediction task." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-143", "text": "Removing the indicator feature denoting the structural features' validity also reduces the performance of the system." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-144", "text": "We now discuss evaluation on our blind test set." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-145", "text": "The majority baseline (Always Superior) for accuracy is 55.0%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-146", "text": "The word unigrams and bigrams baseline obtains an accuracy of 68.3%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-147", "text": "The LEX system (using other forms of ngrams as well) obtains a slightly lower accuracy of 68.1%." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-148", "text": "Our BEST system using LEX and THR New features obtains an accuracy of 73.0% (coincidentally the same as on the dev set), an improvement of 6.9% over the system using only lexical features." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-149", "text": "----------------------------------" }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-152", "text": "We introduced a new set of features which describe the structure of the dialog." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-153", "text": "Using this feature set, we obtain an accuracy of 73.0% on a blind test." }, { "sent_id": "b4d7e9b7942698ef0678d3b4a0ad7d-C001-154", "text": "In future work, we will tackle the problem of three-way classification of pairs of participants, which will cover cases in which they are not in a power relation at all." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-9" ], [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-23" ], [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-125" ] ], "cite_sentences": [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-9", "b4d7e9b7942698ef0678d3b4a0ad7d-C001-23", "b4d7e9b7942698ef0678d3b4a0ad7d-C001-125" ] }, "@MOT@": { "gold_contexts": [ [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-23", "b4d7e9b7942698ef0678d3b4a0ad7d-C001-24", "b4d7e9b7942698ef0678d3b4a0ad7d-C001-25" ] ], "cite_sentences": [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-23", "b4d7e9b7942698ef0678d3b4a0ad7d-C001-25" ] }, "@DIF@": { "gold_contexts": [ [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-34" ] ], "cite_sentences": [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-34" ] }, "@SIM@": { "gold_contexts": [ [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-61" ] ], "cite_sentences": [ "b4d7e9b7942698ef0678d3b4a0ad7d-C001-61" ] } } }, "ABC_280affafa32147a63e7eeda8d5f763_30": { "x": [ { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-71", "text": "We performed the correlation analysis as follows." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-2", "text": "Image description is a new natural language generation task, where the aim is to generate a human-like description of an image." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-3", "text": "The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-4", "text": "The focus of this paper is to determine the correlation of automatic measures with human judgements for this task." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-5", "text": "We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-6", "text": "The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-7", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-9", "text": "Recent advances in computer vision and natural language processing have led to an upsurge of research on tasks involving both vision and language." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-10", "text": "State of the art visual detectors have made it possible to hypothesise what is in an image (Guillaumin et al., 2009; Felzenszwalb et al., 2010) , paving the way for automatic image description systems." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-11", "text": "The aim of such systems is to extract and reason about visual aspects of images to generate a humanlike description." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-12", "text": "An example of the type of image and gold-standard descriptions available can be seen in Figure 1 ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-13", "text": "Recent approaches to this task have been based on slot-filling (Yang et al., 2011; Elliott and Keller, 2013) , combining web-scale ngrams , syntactic tree substitution (Mitchell et al., 2012) , and description-by-retrieval (Farhadi et al., 2010; Ordonez et al., 2011; Hodosh et al., 2013) ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-14", "text": "Image description has been compared to translating an image into text or summarising an image" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-15", "text": "1. An older woman with a small dog in the snow." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-16", "text": "2. A woman and a cat are outside in the snow." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-17", "text": "3. A woman in a brown vest is walking on the snow with an animal." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-18", "text": "4. A woman with a red scarf covering her head walks with her cat on snow-covered ground." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-19", "text": "5. Heavy set woman in snow with a cat." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-20", "text": "Figure 1: An image from the Flickr8K data set and five human-written descriptions." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-21", "text": "These descriptions vary in the adjectives or prepositional phrases that describe the woman (1, 3, 4, 5), incorrect or uncertain identification of the cat (1, 3), and include a sentence without a verb (5)." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-22", "text": "(Yang et al., 2011) , resulting in the adoption of the evaluation measures from those communities." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-23", "text": "In this paper we estimate the correlation of human judgements with five automatic evaluation measures on two image description data sets." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-24", "text": "Our work extends previous studies of evaluation measures for image description (Hodosh et al., 2013) , which focused on unigram-based measures and reported agreement scores such as Cohen's \u03ba rather than correlations." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-70", "text": "**PROTOCOL**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-25", "text": "The main finding of our analysis is that TER and unigram BLEU are weakly corre-lated against human judgements, ROUGE-SU4 and Smoothed BLEU are moderately correlated, and the strongest correlation is found with Meteor." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-26", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-27", "text": "**METHODOLOGY**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-28", "text": "We estimate Spearman's \u03c1 for five different automatic evaluation measures against human judgements for the automatic image description task." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-29", "text": "Spearman's \u03c1 is a non-parametric correlation coefficient that restricts the ability of outlier data points to skew the co-efficient value." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-30", "text": "The automatic measures are calculated on the sentence level and correlated against human judgements of semantic correctness." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-31", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-32", "text": "**DATA**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-33", "text": "We perform the correlation analysis on the Flickr8K data set of Hodosh et al. (2013) , and the data set of Elliott and Keller (2013) ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-34", "text": "The test data of the Flickr8K data set contains 1,000 images paired with five reference descriptions." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-35", "text": "The images were retrieved from Flickr, the reference descriptions were collected from Mechanical Turk, and the human judgements were collected from expert annotators as follows: each image in the test data was paired with the highest scoring sentence(s) retrieved from all possible test sentences by the TRI5SEM model in Hodosh et al. (2013) ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-36", "text": "Each image-description pairing in the test data was judged for semantic correctness by three expert human judges on a scale of 1-4." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-37", "text": "We calculate automatic measures for each image-retrieved sentence pair against the five reference descriptions for the original image." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-38", "text": "The test data of Elliott and Keller (2013) contains 101 images paired with three reference descriptions." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-39", "text": "The images were taken from the PAS-CAL VOC Action Recognition Task, the reference descriptions were collected from Mechanical Turk, and the judgements were also collected from Mechanical Turk." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-40", "text": "Elliott and Keller (2013) generated two-sentence descriptions for each of the test images using four variants of a slot-filling model, and collected five human judgements of the semantic correctness and grammatical correctness of the description on a scale of 1-5 for each imagedescription pair, resulting in a total of 2,042 human judgement-description pairings." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-41", "text": "In this analysis, we use only the first sentence of the description, which describes the event depicted in the image." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-42", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-43", "text": "**AUTOMATIC EVALUATION MEASURES**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-44", "text": "BLEU measures the effective overlap between a reference sentence X and a candidate sentence Y ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-45", "text": "It is defined as the geometric mean of the effective n-gram precision scores, multiplied by the brevity penalty factor BP to penalise short translations." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-46", "text": "p n measures the effective overlap by calculating the proportion of the maximum number of n-grams co-occurring between a candidate and a reference and the total number of n-grams in the candidate text." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-47", "text": "More formally," }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-48", "text": "Unigram BLEU without a brevity penalty has been reported by ), Ordonez et al. (2011 , and Kuznetsova et al. (2012) ; to the best of our knowledge, the only image description work to use higher-order n-grams with BLEU is Elliott and Keller (2013) ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-49", "text": "In this paper we use the smoothed BLEU implementation of Clark et al. (2011) to perform a sentence-level analysis, setting n = 1 and no brevity penalty to get the unigram BLEU measure, or n = 4 with the brevity penalty to get the Smoothed BLEU measure." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-50", "text": "We note that a higher BLEU score is better." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-51", "text": "ROUGE measures the longest common subsequence of tokens between a candidate Y and reference X. There is also a variant that measures the cooccurrence of pairs of tokens in both the candidate and reference (a skip-bigram): ROUGE-SU*. The skip-bigram calculation is parameterised with d skip , the maximum number of tokens between the words in the skip-bigram." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-52", "text": "Setting d skip to 0 is equivalent to bigram overlap and setting d skip to \u221e means tokens can be any distance apart." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-53", "text": "If \u03b1 = |SKIP2(X,Y )| is the number of matching skip-bigrams between the reference and the candidate, then skip-bigram ROUGE is formally defined as:" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-54", "text": "ROUGE has been used by only Yang et al. (2011) to measure the quality of generated descriptions, using a variant they describe as ROUGE-1." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-55", "text": "We set d skip = 4 and award partial credit for unigram only matches, otherwise known as ROUGE-SU4." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-56", "text": "We use ROUGE v.1.5.5 for the analysis, and configure the evaluation script to return the result for the average score for matching between the candidate and the references." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-57", "text": "A higher ROUGE score is better." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-58", "text": "TER measures the number of modifications a human would need to make to transform a candidate Y into a reference X. The modifications available are insertion, deletion, substitute a single word, and shift a word an arbitrary distance." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-59", "text": "TER is expressed as the percentage of the sentence that needs to be changed, and can be greater than 100 if the candidate is longer than the reference." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-60", "text": "More formally, TER = |edits| |reference tokens| TER has not yet been used to evaluate image description models." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-61", "text": "We use v.0.8.0 of the TER evaluation tool, and a lower TER is better." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-62", "text": "Meteor is the harmonic mean of unigram precision and recall that allows for exact, synonym, and paraphrase matchings between candidates and references." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-63", "text": "It is calculated by generating an alignment between the tokens in the candidate and reference sentences, with the aim of a 1:1 alignment between tokens and minimising the number of chunks ch of contiguous and identically ordered tokens in the sentence pair." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-64", "text": "The alignment is based on exact token matching, followed by Wordnet synonyms, and then stemmed tokens." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-65", "text": "We can calculate precision, recall, and F-measure, where m is the number of aligned unigrams between candidate and reference." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-66", "text": "Meteor is defined as:" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-67", "text": "We calculated the Meteor scores using release 1.4.0 with the package-provided free parameter settings of 0.85, 0.2, 0.6, and 0.75 for the matching components." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-68", "text": "Meteor has not yet been reported to evaluate the performance of different models on the image description task; a higher Meteor score is better." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-69", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-72", "text": "The sentence-level evaluation measures were calculated for each image-description-reference tuple." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-73", "text": "We collected the BLEU, TER, and Meteor scores using MultEval (Clark et al., 2011) , and the ROUGE-SU4 scores using the RELEASE-1.5.5.pl script." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-74", "text": "The evaluation measure scores were then compared with the human judgements using Spearman's correlation estimated at the sentence-level." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-75", "text": "Table 1 shows the correlation co-efficients between automatic measures and human judgements and Figures 2(a) and (b) show the distribution of scores for each measure against human judgements." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-76", "text": "To classify the strength of the correlations, we followed the guidance of Dancey and Reidy (2011) , who posit that a co-efficient of 0.0-0.1 is uncorrelated, 0.11-0.4 is weak, 0.41-0.7 is moderate, 0.71-0.90 is strong, and 0.91-1.0 is perfect." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-77", "text": "On the Flickr8k data set, all evaluation measures can be classified as either weakly correlated or moderately correlated with human judgements and all results are significant." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-78", "text": "TER is only weakly correlated with human judgements but could prove useful in comparing the types of differences between models." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-79", "text": "An analysis of the distribution of TER scores in Figure 2 (a) shows that differences in candidate and reference length are prevalent in the image description task." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-80", "text": "Unigram BLEU is also only weakly correlated against human judgements, even though it has been reported extensively for this task." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-81", "text": "Finally, Meteor is most strongly correlated measure against human judgements." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-82", "text": "A similar pattern is observed in the Elliott and Keller (2013) data set, though the correlations are lower across all measures." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-83", "text": "This could be caused by the smaller sample size or because the descriptions were generated by a computer, and not retrieved from a collection of human-written descriptions containing the goldstandard text, as in the Flickr8K data set." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-84", "text": "Figure 3 shows two images from the test collection of the Flickr8K data set with a low Meteor score and a maximum human judgement of semantic correctness." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-85", "text": "The main difference between the candidates and references are in deciding what to describe (content selection), and how to describe it (realisation)." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-86", "text": "We can hypothesise that in both translation and summarisation, the source text acts as a lexical and semantic framework within which the translation or summarisation process takes place." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-87", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-88", "text": "**RESULTS**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-89", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-90", "text": "**QUALITATIVE ANALYSIS**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-91", "text": "In Figure 3(a) , the authors of the descriptions made different decisions on what to describe." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-92", "text": "A decision has been made to describe the role of the officials in the candidate text, and not in the reference text." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-93", "text": "The underlying cause of this is an active area of research in the human vision literature and can be attributed to bottom-up effects, such as saliency (Itti et al., 1998) , top-down contextual effects (Torralba et al., 2006) , or rapidly-obtained scene properties (Oliva and Torralba, 2001) ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-94", "text": "In (b), we can see the problem of deciding how to describe the selected content." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-95", "text": "The reference uses a more specific noun to describe the person on the bicycle than the candidate." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-96", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-97", "text": "**DISCUSSION**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-98", "text": "There are several differences between our analysis and that of Hodosh et al. (2013) ." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-99", "text": "First, we report Spearman's \u03c1 correlation coefficient of automatic measures against human judgements, whereas they report agreement between judgements and automatic measures in terms of Cohen's \u03ba." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-100", "text": "The use of \u03ba requires the transformation of real-valued scores into categorical values, and thus loses information; we use the judgement and evaluation measure scores in their original forms." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-101", "text": "Second, our use of Spearman's \u03c1 means we can readily use all of the available data for the correlation analysis, whereas Hodosh et al. (2013) report agreement on thresholded subsets of the data." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-102", "text": "Third, we report the correlation coefficients against five evaluation measures, Candidate: Football players gathering to contest something to collaborating officials." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-103", "text": "Reference: A football player in red and white is holding both hands up." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-104", "text": "In contrast to the results presented here, Reiter and Belz (2009) found no significant correlations of automatic evaluation measures against human judgements of the accuracy of machine-generated weather forecasts." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-105", "text": "They did, however, find significant correlations of automatic measures against fluency judgements." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-106", "text": "There are no fluency judgements available for Flickr8K, but Elliott and Keller (2013) report grammaticality judgements for their data, which are comparable to fluency ratings." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-107", "text": "We failed to find significant correlations between grammatlicality judgements and any of the automatic measures on the Elliott and Keller (2013) data." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-108", "text": "This discrepancy could be explained in terms of the differences between the weather forecast generation and image description tasks, or because the image description data sets contain thousands of texts and a few human judgements per text, whereas the data sets of Reiter and Belz (2009) included hundreds of texts with 30 human judges." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-109", "text": "----------------------------------" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-110", "text": "**CONCLUSIONS**" }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-111", "text": "In this paper we performed a sentence-level correlation analysis of automatic evaluation measures against expert human judgements for the automatic image description task." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-112", "text": "We found that sentencelevel unigram BLEU is only weakly correlated with human judgements, even though it has extensively reported in the literature for this task." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-113", "text": "Meteor was found to have the highest correlation with human judgements, but it requires Wordnet and paraphrase resources that are not available for all languages." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-114", "text": "Our findings held when judgements were made on human-written or computer-generated descriptions." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-115", "text": "The variability in what and how people describe images will cause problems for all of the measures compared in this paper." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-116", "text": "Nevertheless, we propose that unigram BLEU should no longer be used as an objective function for automatic image description because it has a weak correlation with human accuracy judgements." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-117", "text": "We recommend adopting either Meteor, Smoothed BLEU, or ROUGE-SU4 because they show stronger correlations with human judgements." }, { "sent_id": "280affafa32147a63e7eeda8d5f763-C001-118", "text": "We believe these suggestions are also applicable to the ranking tasks proposed in Hodosh et al. (2013) , where automatic evaluation scores could act as features to a ranking function." } ], "y": { "@BACK@": { "gold_contexts": [ [ "280affafa32147a63e7eeda8d5f763-C001-13" ], [ "280affafa32147a63e7eeda8d5f763-C001-38", "280affafa32147a63e7eeda8d5f763-C001-40" ], [ "280affafa32147a63e7eeda8d5f763-C001-48" ] ], "cite_sentences": [ "280affafa32147a63e7eeda8d5f763-C001-13", "280affafa32147a63e7eeda8d5f763-C001-38", "280affafa32147a63e7eeda8d5f763-C001-40", "280affafa32147a63e7eeda8d5f763-C001-48" ] }, "@USE@": { "gold_contexts": [ [ "280affafa32147a63e7eeda8d5f763-C001-33" ], [ "280affafa32147a63e7eeda8d5f763-C001-37", "280affafa32147a63e7eeda8d5f763-C001-38", "280affafa32147a63e7eeda8d5f763-C001-39", "280affafa32147a63e7eeda8d5f763-C001-40" ], [ "280affafa32147a63e7eeda8d5f763-C001-106" ], [ "280affafa32147a63e7eeda8d5f763-C001-107" ] ], "cite_sentences": [ "280affafa32147a63e7eeda8d5f763-C001-33", "280affafa32147a63e7eeda8d5f763-C001-38", "280affafa32147a63e7eeda8d5f763-C001-40", "280affafa32147a63e7eeda8d5f763-C001-106", "280affafa32147a63e7eeda8d5f763-C001-107" ] }, "@DIF@": { "gold_contexts": [ [ "280affafa32147a63e7eeda8d5f763-C001-40", "280affafa32147a63e7eeda8d5f763-C001-41" ] ], "cite_sentences": [ "280affafa32147a63e7eeda8d5f763-C001-40" ] }, "@SIM@": { "gold_contexts": [ [ "280affafa32147a63e7eeda8d5f763-C001-82" ] ], "cite_sentences": [ "280affafa32147a63e7eeda8d5f763-C001-82" ] } } }, "ABC_3e86788379f2c0074ff16687d68fc9_30": { "x": [ { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-97", "text": "**DATA PREPARATION**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-2", "text": "Abstract." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-3", "text": "We present a method of chunking in Korean texts using conditional random fields (CRFs), a recently introduced probabilistic model for labeling and segmenting sequence of data." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-4", "text": "In agglutinative languages such as Korean and Japanese, a rule-based chunking method is predominantly used for its simplicity and efficiency." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-5", "text": "A hybrid of a rule-based and machine learning method was also proposed to handle exceptional cases of the rules." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-6", "text": "In this paper, we present how CRFs can be applied to the task of chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-7", "text": "Experiments using the STEP 2000 dataset show that the proposed method significantly improves the performance as well as outperforms previous systems." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-8", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-10", "text": "Text chunking is a process to identify non-recursive cores of various phrase types without conducting deep parsing of text [3] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-11", "text": "Abney first proposed it as an intermediate step toward full parsing [1] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-12", "text": "Since Ramshaw and Marcus approached NP chunking using a machine learning method, many researchers have used various machine learning techniques [2, 4, 5, 6, 10, 11, 13, 14] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-13", "text": "The chunking task was extended to the CoNLL-2000 shared task with standard datasets and evaluation metrics, which is now a standard evaluation task for text chunking [3] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-14", "text": "Most previous works with relatively high performance in English used machine learning methods for chunking [4, 13] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-15", "text": "Machine learning methods are mainly divided into the generative approach and conditional approach." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-16", "text": "The generative approach relies on generative probabilistic models that assign a joint probability p(X,Y) of paired input sequence and label sequence, X and Y respectively." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-17", "text": "It provides straightforward understanding of underlying distribution." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-18", "text": "However, this approach is intractable in most domains without strong independence assumptions that each input element is independent from the other elements in input sequence, and is also difficult to use multiple interacting features and long-range dependencies between input elements." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-19", "text": "The conditional approach views the chunking task as a sequence of classification problems, and defines a conditional probability p(Y|X) over label sequence given input sequence." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-20", "text": "A number of conditional models recently have been developed for use." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-21", "text": "They showed better performance than generative models as they can handle many arbitrary and overlapping features of input sequence [12] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-22", "text": "A number of methods are applied to chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-23", "text": "Unlike English, a rule-based chunking method [7, 8] is predominantly used in Korean because of its well-developed function words, which contain information such as grammatical relation, case, tense, modal, etc." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-24", "text": "Chunking in Korean texts with only simple heuristic rules obtained through observation on the text shows a good performance similar to other machine learning methods [6] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-25", "text": "Park et al. proposed a hybrid of rule-based and machine learning method to handle exceptional cases of the rules, to improve the performance of chunking in Korean texts [5, 6] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-26", "text": "In this paper, we present how CRFs, a recently introduced probabilistic model for labeling and segmenting sequence of data [12] , can be applied to the task of chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-27", "text": "CRFs are undirected graphical models trained to maximize conditional probabilities of label sequence given input sequence." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-28", "text": "It takes advantage of generative and conditional models." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-29", "text": "CRFs can include many correlated, overlapping features, and they are trained discriminatively like conditional model." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-30", "text": "Since CRFs have single exponential model for the conditional probability of entire label sequence given input sequence, they also guarantee to obtain globally optimal label sequence." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-31", "text": "CRFs successfully have been applied in many NLP problems such as part-of-speech tagging [12] , text chunking [13, 15] and table extraction from government reports [19] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-32", "text": "The rest of this paper is organized as follows." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-33", "text": "Section 2 gives a simple introduction to CRFs." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-34", "text": "Section 3 explains how CRFs is applied to the task of chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-35", "text": "Finally, we present experimental results and draw conclusions." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-36", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-37", "text": "**CONDITIONAL RANDOM FIELDS**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-38", "text": "Conditional Random Fields (CRFs) are conditional probabilistic sequence models first introduced by Lefferty et al [12] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-39", "text": "CRFs are undirected graphical models, which can be used to define the joint probability distribution over label sequence given the entire input sequence to be labeled, rather than being directed graphical models such as Maximum Entropy Markov Models (MEMMs) [11] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-40", "text": "It relaxes the strong independence assumption of Hidden Markov Models (HMMs), as well as resolves the label bias problem exhibited by MEMMs and other non-generative directed graphical models such as discriminative Markov models [12] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-41", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-42", "text": "**FUNDAMENTALS OF CRFS**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-43", "text": "CRFs may be viewed as an undirected graphical model globally conditioned on input sequence [14] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-44", "text": "Let X=x 1 x 2 x 3 \u2026x n be an input sequence and Y=y 1 y 2 y 3 \u2026y n a label sequence." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-45", "text": "In the chunking task, X is associated with a sequence of words and Y is associated with a sequence of chunk types." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-46", "text": "If we assume that the structure of a graph forms a simple first-order chain, as illustrated in Figure 1 , CRFs define the conditional probability of a label sequence Y given an input sequence X by the Hammersley-Clifford theorem [16] as follows:" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-47", "text": "where Z(X) is a normalization factor; f k (y i-1 , y i , X, i) is a feature function at positions i and i-1 in the label sequence; k \u03bb is a weight associated with feature k f ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-48", "text": "where t k (y i-1 , y i , X, i) is a transition feature function of the entire input sequence and the labels at positions i and i-1 in the label sequence; s k (y i , X, i) is a state feature function of the label at position i and the observed input sequence; and k \u03bb and k \u00b5 are parameters to be estimated from training data." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-49", "text": "The parameters k \u03bb and k \u00b5 play similar roles to the transition and emission probabilities in HMMs [12] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-50", "text": "Therefore, the most probable label sequence for input sequence X is Y* which maximizes a posterior probability." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-51", "text": "We can find Y* with dynamic programming using the Viterbi algorithm." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-52", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-53", "text": "**PARAMETER ESTIMATION FOR CRFS**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-54", "text": "Assuming the training data {(X (n) , Y (n) )} are independently and identically distributed, the product of Equation 1 over all training sequences is a likelihood function of the parameter \u03bb ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-55", "text": "Maximum likelihood training chooses parameter values such that the log-likelihood is maximized [10] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-56", "text": "For CRFs, the log-likelihood ) (\u03bb L is given by" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-57", "text": "It is not possible to analytically determine the parameter values that maximize the log-likelihood." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-58", "text": "Instead, maximum likelihood parameters must be identified using an iterative technique such as iterative scaling [12] or gradient-based methods [13, 14] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-59", "text": "Lafferty et al. proposed two iterative scaling algorithms to find parameters for CRFs." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-60", "text": "However, these methods converge into a global maximum very slowly." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-61", "text": "To overcome this problem of slow convergence, several researchers adopted modern optimization algorithms such as the conjugate-gradient method or the limited-memory BFGS(L-BFGS) method [17] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-62", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-63", "text": "**CHUNKING USING CONDITIONAL RANDOM FIELDS IN KOREAN TEXTS**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-64", "text": "We now describe how CRFs are applied to the task of chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-65", "text": "Firstly, we explore characteristics and chunk types of Korean." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-66", "text": "Then we explain the features for the model of chunking in Korean texts using CRFs." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-67", "text": "The ultimate goal of a chunker is to output appropriate chunk tags of a sequence of words with part-ofspeech tags." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-68", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-69", "text": "**CHARACTERISTICS OF KOREAN**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-70", "text": "Korean is an agglutinative language, in which a word unit (called an eojeol) is a composition of a content word and function word(s)." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-71", "text": "Function words -postpositions and endings -give much information such as grammatical relation, case, tense, modal, etc." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-72", "text": "Well-developed function words in Korean help with chunking, especially NP and VP chunking." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-73", "text": "For example, when the part-of-speech of current word is one of determiner, pronoun and noun, the following seven rules for NP chunking in Table 1 can find most NP chunks in text, with about 89% accuracy [6] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-74", "text": "For this reason, boundaries of chunks are easily found in Korean, compared to other languages such as English or Chinese." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-75", "text": "This is why a rule-based chunking method is predominantly used." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-76", "text": "However, with sophisticated rules, the rule-based chunking method has limitations when handling exceptional cases." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-77", "text": "Park et al. proposed a hybrid of the rule-based and the machine learning method to resolve this problem [5, 6] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-78", "text": "Many recent machine learning techniques can capture hidden characteristics for classification." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-79", "text": "Despite its simplicity and efficiency, the rule-based method has recently been outdone by the machine learning method in various classification problems." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-80", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-81", "text": "**CHUNK TYPES OF KOREAN**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-82", "text": "Abney was the first to use the term 'chunk' to represent a non-recursive core of an intra-clausal constituent, extending from the beginning of constituent to its head." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-83", "text": "In Korean, there are four basic phrases: noun phrase (NP), verb phrase (VP), adverb phrase (ADVP), and independent phrase (IP) [6] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-84", "text": "As function words such as postposition or ending are well-developed, the number of chunk types is small compared to other languages such as English or Chinese." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-85", "text": "Like the CoNLL-2000 dataset, we use three types of chunk border tags, indicating whether a word is outside a chunk (O), starts a chunk (B), or continues a chunk (I)." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-86", "text": "Each chunk type XP has two border tags: B-XP and I-XP." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-87", "text": "XP should be one of NP, VP, ADVP and IP." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-88", "text": "There exist nine chunk tags in Korean." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-89", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-90", "text": "**FEATURE SET OF CRFS**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-91", "text": "One advantage of CRFs is that they can use many arbitrary, overlapping features." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-92", "text": "So we take advantage of all context information of a current word." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-93", "text": "We use words, partof-speech tags of context and combinations of part-of-speech tags to determine the chunk tag of the current word,." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-94", "text": "The window size of context is 5; from left two words to right two words." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-95", "text": "Table 3 summarizes the feature set for chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-96", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-98", "text": "For evaluation of our proposed method, we use the STEP 2000 Korean chunking dataset (STEP 2000 dataset) 1 , which is converted from the parsed KAIST Corpus [9] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-99", "text": "The STEP 2000 dataset consists of 12,092 sentences." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-100", "text": "We divide this corpus into training data and test data." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-101", "text": "Training data has 10,883 sentences and test data has 1,209 sentences, 90% and 10% respectively." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-102", "text": "Table 4 summarizes characteristics of the STEP 2000 dataset." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-103", "text": "Figure 2 shows an example sentence of the STEP 2000 dataset and its format is equal to that of CoNLL-2000 dataset." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-104", "text": "Each line is composed of a word, its part-of-speech (POS) tag and a chunk tag." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-105", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-106", "text": "**EVALUATION METRIC**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-107", "text": "The standard evaluation metrics for chunking performance are precision, recall and Fscore (F \u03b2 =1 ) [3] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-108", "text": "F-score is used for comparisons with other reported results." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-109", "text": "Each equation is defined as follows." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-110", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-111", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-112", "text": "Experiments were performed with C++ implementation of CRFs (FlexCRFs) on Linux with 2.4 GHz Pentium IV dual processors and 2.0Gbyte of main memory [18] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-113", "text": "We use L-BFGS to train the parameters and use a Gaussian prior regularization in order to avoid overfitting [20] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-114", "text": "Total number of CRF features is 83,264." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-115", "text": "As shown in Table 5 , the performances of most chunk type are 96~100%, very high performance." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-116", "text": "However, the performance of NP chunk type is lowest, 94.27% because the border of NP chunk type is very ambiguous in case of consecutive nouns." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-117", "text": "Using more features such as previous chunk tag should be able to improve the performance of NP chunk type." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-118", "text": "Park et al. reported the performance of various chunking methods [6] ." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-119", "text": "We add the experimental results of the chunking methods using HMMs-bigram and CRFs." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-120", "text": "In Table 6 , F-score of chunking using CRFs in Korean texts is 97.19%, the highest performance of all." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-121", "text": "It significantly outperforms all others, including machine learning methods, rule-based methods and hybrid methods." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-122", "text": "It is because CRFs have a global optimum solution hence overcoming the label bias problem." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-123", "text": "They also can use many arbitrary, overlapping features." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-124", "text": "Figure 3 shows the performance curve on the same test set in terms of the precision, recall and F-score with respect to the size of training data." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-125", "text": "In this figure, we can see that the performance slowly increases in proportion to the size of training data." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-126", "text": "In the experiment, we can see that CRFs can help improve the performance of chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-127", "text": "CRFs have many promising properties except for the slow convergence speed compared to other models." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-128", "text": "In the next experiment, we have tried to analyze the importance of each feature and to make an additional experiment with various window sizes and any other useful features." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-129", "text": "----------------------------------" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-131", "text": "In this paper, we proposed a chunking method for Korean texts using CRFs." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-132", "text": "We observed that the proposed method outperforms other approaches." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-133", "text": "Experiments on the STEP 2000 dataset showed that the proposed method yields an F-score of 95.36%." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-134", "text": "This performance is 2.82% higher than that of SVMs and 1.15% higher than that of the hybrid method." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-135", "text": "CRFs use a number of correlated features and overcome the label bias problem." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-136", "text": "We obtained a very high performance using only small features." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-137", "text": "Thus, if we use more features such as semantic information or collocation, we can obtain a better performance." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-138", "text": "From the experiment, we know that the proposed method using CRFs can significantly improve the performance of chunking in Korean texts." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-139", "text": "CRFs are a good framework for labeling an input sequence." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-140", "text": "In our future work, we will investigate how CRFs can be applied to other NLP problems: parsing, semantic analysis and spam filtering." }, { "sent_id": "3e86788379f2c0074ff16687d68fc9-C001-141", "text": "Finally, we hope that this work can contribute to the body of research in this field." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3e86788379f2c0074ff16687d68fc9-C001-12" ], [ "3e86788379f2c0074ff16687d68fc9-C001-24" ], [ "3e86788379f2c0074ff16687d68fc9-C001-73" ], [ "3e86788379f2c0074ff16687d68fc9-C001-77" ], [ "3e86788379f2c0074ff16687d68fc9-C001-83" ], [ "3e86788379f2c0074ff16687d68fc9-C001-118" ] ], "cite_sentences": [ "3e86788379f2c0074ff16687d68fc9-C001-12", "3e86788379f2c0074ff16687d68fc9-C001-24", "3e86788379f2c0074ff16687d68fc9-C001-73", "3e86788379f2c0074ff16687d68fc9-C001-77", "3e86788379f2c0074ff16687d68fc9-C001-83", "3e86788379f2c0074ff16687d68fc9-C001-118" ] }, "@USE@": { "gold_contexts": [ [ "3e86788379f2c0074ff16687d68fc9-C001-24", "3e86788379f2c0074ff16687d68fc9-C001-25", "3e86788379f2c0074ff16687d68fc9-C001-26" ] ], "cite_sentences": [ "3e86788379f2c0074ff16687d68fc9-C001-24", "3e86788379f2c0074ff16687d68fc9-C001-25" ] }, "@SIM@": { "gold_contexts": [ [ "3e86788379f2c0074ff16687d68fc9-C001-118", "3e86788379f2c0074ff16687d68fc9-C001-119" ] ], "cite_sentences": [ "3e86788379f2c0074ff16687d68fc9-C001-118" ] } } }, "ABC_a800862f17f7a8c13ed13fc6e9433f_30": { "x": [ { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-2", "text": "It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-3", "text": "In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-4", "text": "Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-5", "text": "While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-6", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-8", "text": "Powerful neural networks have enabled the use of \"end-to-end\" speech recognition models that directly map a sequence of acoustic features to a sequence of words without conditional independence assumptions." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-9", "text": "Typical examples are attention based encoder-decoder [1] and recurrent neural network transducer models [2] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-10", "text": "Due to training on full sequences, an utterance corresponds to a single observation from the view point of these models; thus, data sparsity is a general challenge for such approaches, and it is believed that these models are effective only when sufficient training data is available." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-11", "text": "Indeed, many end-toend speech recognition papers focus on LibriSpeech, which has 960 hours of training audio." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-12", "text": "Nevertheless, the best performing systems follow the traditional hybrid approach [3] , outperforming attention based encoder-decoder models [4, 5, 6, 7] , and when less training data is used, the gap between \"end-to-end\" and hybrid models is more prominent [4, 8] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-13", "text": "Several methods have been proposed to tackle data sparsity and overfitting problems; a detailed list can be found in Sec. 2." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-14", "text": "Recently, increasingly complex attention mechanisms have been proposed to improve seq2seq model performance, including stacking self and regular attention layers and using multiple attention heads in the encoder and decoder [5, 9] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-15", "text": "We show that consistent application of various regularization techniques brings a simple, single-head LSTM attention based encoder-decoder model to state-of-the-art performance on Switchboard-300, a task where data sparsity is more severe than LibriSpeech." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-16", "text": "We also note that remarkable performance has been achieved with single-head LSTM models in a recent study on language modeling [10] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-17", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-18", "text": "**METHODS TO IMPROVE SEQ2SEQ MODELS**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-19", "text": "In contrast to traditional hybrid models, where even recurrent networks are trained on randomized, aligned chunks of labels and features [11, 12] , whole sequence models are more prone to memorizing the training samples." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-20", "text": "In order to improve generalization, many of the methods we investigate introduce additional noise, either directly or indirectly, to stochastic gradient descent (SGD) training to avoid narrow, local optima." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-21", "text": "The other techniques we study address the highly non-convex nature of training neural networks, ease the optimization process, and speed up convergence." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-22", "text": "Weight decay adds the l2 norm of the trainable parameters to the loss function, which encourages the weights to stay small unless necessary, and is one of the oldest techniques to improve neural network generalization." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-23", "text": "As shown in [13] , weight decay can improve generalization by suppressing some of the effects of static noise on the targets." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-24", "text": "Dropout randomly deactivates neurons with a predefined probability in every training step [14] to reduce co-adaptation of neurons." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-25", "text": "DropConnect, which is similar in spirit to dropout, randomly deactivates connections between neurons by temporarily zeroing out weights [15] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-26", "text": "Zoneout, which is also inspired by dropout and was especially developed for recurrent models [16] , stochastically forces some hidden units to maintain their previous values." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-27", "text": "In LSTMs, the method is applied on the cell state or on the recurrent feedback of the output." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-28", "text": "Label smoothing interpolates the hard label targets with a uniform distribution over targets, and improves generalization in many classification tasks [17] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-29", "text": "Batch normalization (BN) accelerates training by standardizing the distribution of each layer's input [18] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-30", "text": "In order to reduce the normalization mismatch between training and testing, we modify the original approach by freezing the batch normalization layers in the middle of the training when the magnitude of parameter updates is small." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-31", "text": "After freezing, the running statistics are not updated, batch statistics are ignored, and BN layers approximately operate as global normalization." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-32", "text": "Scheduled sampling stochastically uses the token produced by a sequence model instead of the true previous token during training to mitigate the effects of exposure bias [19] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-33", "text": "Residual networks address the problem of vanishing and exploding gradients by including skip connections [20] in the model that force the neural network to learn a residual mapping function using a stack of layers." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-34", "text": "Optimization of this residual mapping is easier, allowing the use of much deeper structures." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-35", "text": "Curriculum learning simplifies deep neural network training by presenting training examples in a meaningful order, usually by increasing order of difficulty [21] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-36", "text": "In seq2seq models, the input acoustic sequences are frequently sorted in order of in-creasing length [22] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-37", "text": "Speed and tempo perturbation changes the rate of speech, typically by \u00b110%, with or without altering the pitch and timbre of the speech signal [23, 24] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-38", "text": "The goal of these methods is to increase the amount of training data for the model." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-39", "text": "Sequence noise injection adds structured sequence level noise generated from speech utterances to training examples to improve the generalization of seq2seq models [25] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-40", "text": "As previously shown, input noise during neural network training encourages convergence to a local optimum with lower curvature, which indicates better generalization [26] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-41", "text": "Weight noise adds noise directly to the network parameters to improve generalization [27] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-42", "text": "This form of noise can be interpreted as a simplified form of Bayesian inference that optimizes a minimum description length loss [28] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-43", "text": "SpecAugment masks blocks of frequency channels and blocks of time steps [4] and also warps the spectrogram along the time axis to perform data augmentation." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-44", "text": "It is closely related to [29] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-45", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-46", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-47", "text": "This study focuses on Switchboard-300, a standard 300-hour English conversational speech recognition task." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-48", "text": "Our acoustic and text data preparation follows the Kaldi [30] s5c recipe." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-49", "text": "Our attention based seq2seq model is similar to [31, 32] and follows the structure of [33] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-50", "text": "We extract 80-dimensional log-Mel filterbank features over 25ms frames every 10ms from the input speech signal." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-51", "text": "The input audio is speed and/or tempo perturbed with 5 /6 probability." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-52", "text": "Following [25] , sequence noise mixed from up to 4 utterances is injected with 40% probability and 0.3 weight." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-53", "text": "The filterbank output is mean-and-variance normalized at the speaker level, and first (\u2206) and second (\u2206\u2206) derivatives are also calculated." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-54", "text": "The final features presented to the network are also processed through a SpecAugment block that uses the SM policy [4] with p = 0.3 and no time warping." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-55", "text": "The encoder network comprises 8 bidirectional LSTM layers with 1536 nodes per direction per layer [34, 35] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-56", "text": "As shown in Fig. 1a , each LSTM block in the encoder includes a residual connection with a linear transformation that bypasses the LSTM, a 1024-dimensional linear reduction layer on the LSTM output, and batch-normalization (BN) of the block output." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-57", "text": "A pyramidal structure [32] in the first two LSTM layers reduces the frame rate by a factor of 4." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-58", "text": "The final dimension of the encoder output is 256, enforced by a linear bottleneck." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-59", "text": "We apply 30% dropout to the LSTM outputs and 30% drop-connect to the hidden-to-hidden matrices [15, 36] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-60", "text": "As suggested by [37] , the weight dropout is fixed for a batch of sequences." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-61", "text": "The attention based decoder model is illustrated in Fig. 1b ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-62", "text": "The decoder models the sequence of 600 BPE units estimated on characters [38] , where the BPE units are embedded in 256 dimensions." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-63", "text": "We use additive, location aware attention, without key/value transformations, and the attention is smoothed by 256, 5-dimensional kernels [39] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-64", "text": "The decoder block consists of 2 unidirectional LSTM layers: one is a dedicated languagemodel-like component with 512 nodes that operates only on the embedded predicted symbol sequence, and the other is a 768 unit layer processing acoustic and symbol information." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-65", "text": "The output of both LSTMs is reduced to 256 dimensions by a linear bottleneck [40] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-66", "text": "Fixed sequence-level weight dropout of 15% is applied in the decoder LSTMs, a dropout of 5% is applied to the embeddings, and a dropout of 15% is applied to the decoder LSTM outputs." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-67", "text": "The second LSTM in the decoder also uses zoneout, where the cell state update is deactivated with 15% probability and the recurrent feedback from the output maintains its previous value with 5% probability." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-68", "text": "Overall, the model has 280M parameters, of which only 5.4M are in the decoder." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-69", "text": "Aiming at the best word error rate, this design choice is based on our observation that an external language model has significantly larger effect if the decoder is not over-parametrized [33] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-70", "text": "The model is trained for 250 epochs on 32 P100 GPUs in less than 4 days using a PyTorch [41] implementation of distributed synchronous SGD with up to 32 sequences per GPU per batch." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-71", "text": "Training uses a learning rate of 0.03 and Nesterov momentum [42] of 0.9." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-72", "text": "The weight decay parameter is 4e-6, the label smoothing parameter is 0.35, and teacher forcing is fixed to 0.8 throughout training." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-73", "text": "In the first 3 epochs the learning rate is warmed up and batch size is gradually increased from 8 to 32 [43] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-74", "text": "In the first 35 epochs, the neural network is trained on sequences sorted in ascending order of length of the input." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-75", "text": "Afterwards, batches are randomized within length buckets, ensuring that a batch always contains sequences with similar length." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-76", "text": "Weight noise from a normal distribution with mean 0.0 and variance 0.015 is switched on after 70 epochs." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-77", "text": "After 110 epochs, the updates of sufficient statistics in the batchnormalization layers are turned off, converting them into fixed affine transformations." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-78", "text": "The learning rate is annealed by 0.9 per epoch after 180 epochs of training, and simultaneously label smoothing is also switched off." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-79", "text": "The external language model (LM) is built on the BPE segmentation of 24M words from the Switchboard and Fisher corpora." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-80", "text": "It is trained for 40 epochs using label smoothing of 0.15 in the first 20 epochs." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-81", "text": "The baseline LM has 57M parameters and consists of 2 unidirectional LSTM layers with 2048 nodes [44] trained with drop-connect and dropout probabilities of 15%." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-82", "text": "The embedding layer has 512 nodes, and the output of the last LSTM is projected to 128 dimensions." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-83", "text": "When the LM is trained and evaluated across utterances, consecutive segments of a single-channel recording are grouped together up to 40 seconds." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-84", "text": "Perplexities (PPL) are measured at the word level on the concatenation of ground truth transcripts, while the WER is obtained by retaining the LM state of the single-best hypothesis of the preceding utterance." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-85", "text": "Decoding uses simple beam search with a beam width of 60 hypotheses and no lexical prefix tree constraint [45] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-86", "text": "The search performs shallow fusion of the encoder-decoder score, the external language model score, a length normalization term, and a coverage term [46, 47, 48] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-87", "text": "For more details, please refer to [33] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-88", "text": "Hub5'00 is used as a development set to optimize decoding hyperparameters, while Hub5'01 and RT03 are used as final test sets." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-89", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-90", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-91", "text": "Our current setup is the result of incremental development." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-92", "text": "Keeping in mind that several other equally powerful setups probably exist, the focus of the following experiments is to investigate ours around the current optimum." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-93", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-94", "text": "**EFFECT OF DATA PREPARATION**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-95", "text": "We first investigate the importance of different data processing steps." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-96", "text": "The s5c Kaldi recipe includes a duplicate filtering step, in which the maximum number of occurrences of utterances with the same content is limited." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-97", "text": "We measure the impact of duplicate filtering and also the effect of filtering out word fragments and noise tokens from the training transcripts." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-98", "text": "Since the LM is trained on identically filtered transcripts from Fisher+Switchboard data, word fragment and noise token filters were applied consistently." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-99", "text": "The results are summarized in Table 1 ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-100", "text": "Deactivating the duplicate filter is never harmful when an external LM is used, and the gains on CallHome can be substantial." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-101", "text": "Considering performance on the complete Hub5'00 data, the best systems either explicitly handle both word fragments and noise tokens or filter them all out." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-102", "text": "When an external LM is used, the best results are obtained when word fragment and noise token filters are activated and the duplicate filter is deactivated." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-103", "text": "This setting is also appealing in cases where the external LM may be trained on text data that will not contain word fragments or noise; thus, the remaining experiments are carried out with this system setting." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-104", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-105", "text": "**ABLATION STUDY**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-106", "text": "In a second set of experiments, we characterize the importance of each of the regularization methods described in Sec. 2 for our model performance by switching off one training method at a time without re-optimizing the remaining settings." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-107", "text": "In these experiments, decoding is performed without an external language model." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-108", "text": "Curriculum learning is evaluated by either switching to randomized batches after 35 epochs or leaving the sorting on throughout training." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-109", "text": "We also test the importance of \u2206 and \u2206\u2206 features [49] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-110", "text": "Sorting the results by decreasing number of absolute errors on Hub5'00, Table 2 indicates that each regularization method contributes to the improved WER." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-111", "text": "SpecAugment is by far the most important method, while using \u2206 and \u2206\u2206 features or switching off the curriculum learning in the later stage of training have marginal but positive effects." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-112", "text": "Other direct input level perturbation steps (speed/tempo perturbation and sequence noise injection) are also key techniques that can be found in the upper half of the table. If we compare the worst and baseline models, we find that the relative performance difference between them is nearly unchanged by including the external LM in decoding." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-113", "text": "Without the LM, the gap is 18% relative, while with the LM the" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-114", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-115", "text": "**OPTIMIZING THE LANGUAGE MODEL**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-116", "text": "The following experiments summarize our optimization of the LM." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-117", "text": "Compared to our previous LM [25] , we measure better perplexity and WER if no bottleneck is used before the softmax layer (rows 1 and 3 in Table 3 )." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-118", "text": "Increasing the model capacity to 122M parameters results in a significant gain in PPL only after the dropout rates are tuned (rows 3, 5 and 6)." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-119", "text": "Similar to [50, 51] , significant PPL gain is observed if the LM was trained across utterances." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-120", "text": "However, this PPL improvement does not translate into reduced WER with a bigger model when cross utterance modeling is used (rows 4 and 7)." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-121", "text": "Thus, in all other experiments we use the smaller, 57M-parameter model." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-122", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-123", "text": "**EFFECT OF BEAM SIZE AND NUMBER OF PARAMETERS**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-124", "text": "A 280M-parameter model may be larger than is practical in many applications." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-125", "text": "Thus, we also conduct experiments to see if this model size is necessary for reasonable ASR performance." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-126", "text": "Models are trained without changing the training configuration, except that the size or number of LSTM layers is reduced." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-127", "text": "As Table 4 shows, although our smallest attention based model achieves reasonable results on this task, a significant loss is indeed observed with decreasing model size, especially on Call-Home." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-128", "text": "Nevertheless, an external language model reduces the performance gap." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-129", "text": "A small, 57M-parameter model together with a similar size language model is only 5% relative worse than our largest model." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-130", "text": "We note that this model already outperforms the best published attention based seq2seq model [4] , with roughly" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-131", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-132", "text": "**66% FEWER PARAMETERS.**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-133", "text": "Additional experiments are carried out to characterize the search and modeling errors in decoding." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-134", "text": "The results of tuning the beam size and keeping the other search hyperparameters unchanged are shown in Fig. 2 . \"Small\" denotes the 57M model, while \"large\" denotes the 280M model." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-135", "text": "When greedy search (beam 1) is used, the external language model increases WER, an effect that might be mitigated with re-optimized hyperparameters." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-136", "text": "Nevertheless, if a beam of at least 2 hypotheses is used, the positive effect of the language model is clear." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-137", "text": "We also observe that without the language model the search saturates much earlier, around beam 8, fluctuating within only a few absolute errors afterwards." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-138", "text": "On the contrary, decoding with the language model, we measure consistent but small gains with larger beams." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-139", "text": "The minimum number of word errors was measured with a relatively large beam of 240." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-140", "text": "The figure also shows that the effect of a cross-utterance language model grows with larger beams." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-141", "text": "Lastly, if the model is trained on 2000 hours of speech data (see next section), the extremely fast greedy decoding gives remarkably good performance." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-142", "text": "Although the importance of beam search decreases with an increased amount of training data, we still measure 10% relative degradation compared to a system with a cross-utterance LM and wide (240) beam search." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-143", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-144", "text": "**EXPERIMENTS ON SWITCHBOARD-2000**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-145", "text": "As a contrast to our best results on Switchboard-300, we also train a seq2seq model on the 2000-hour Switchboard+Fisher data." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-146", "text": "This model consists of 10 encoder layers, and is trained for only 50 epochs." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-147", "text": "Our overall results on the Hub5'00 and other evaluation sets are summarized in Table 5 ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-148", "text": "The results in Fig. 2 and Table 5 show that adding more training data greatly improves the system, by around 30% relative in some cases." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-149", "text": "For comparison with others, the 2000-hour system reaches 8.7% and 7.4% WER on rt02 and rt04." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-150", "text": "We observe that the regularization techniques, which are extremely important on the 300h setup, are still beneficial but have a significantly smaller effect." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-151", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-152", "text": "**COMPARISON WITH THE LITERATURE**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-153", "text": "For comparison with results in the literature we refer to the Switchboard-300 results in [4, 8, 52, 53] and the Switchboard-2000 results in [51, 52, 54, 55, 56, 57] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-154", "text": "Our 300hour model not only outperforms the previous best attention based encoder-decoder model [4] by a large margin, it also surpasses the best hybrid systems with multiple LMs [8] ." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-155", "text": "Our result on Switchboard-2000 is also better than any single system results reported to date, and reaches the performance of the best system combinations." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-156", "text": "----------------------------------" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-157", "text": "**CONCLUSIONS**" }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-158", "text": "We presented an attention based encoder-decoder setup which achieves state-of-the-art performance on Switchboard-300." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-159", "text": "A rather simple model built from LSTM layers and a decoder with a single-headed attention mechanism outperforms the standard hybrid approach." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-160", "text": "This is particularly remarkable given that in our model neither a pronunciation lexicon nor a speech model with explicit hidden state representations is needed." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-161", "text": "We also demonstrated that excellent results are possible with smaller models and with practically search-free, greedy decoding." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-162", "text": "The best results were achieved with a speaker independent model in a single decoding pass, using a minimalistic search algorithm, and without any attention mechanism in the language model." }, { "sent_id": "a800862f17f7a8c13ed13fc6e9433f-C001-163", "text": "Thus, we believe that further improvements are still possible if we apply a more complicated sequence-level training criterion and speaker adaptation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a800862f17f7a8c13ed13fc6e9433f-C001-12" ], [ "a800862f17f7a8c13ed13fc6e9433f-C001-43" ] ], "cite_sentences": [ "a800862f17f7a8c13ed13fc6e9433f-C001-12", "a800862f17f7a8c13ed13fc6e9433f-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "a800862f17f7a8c13ed13fc6e9433f-C001-54" ], [ "a800862f17f7a8c13ed13fc6e9433f-C001-153" ] ], "cite_sentences": [ "a800862f17f7a8c13ed13fc6e9433f-C001-54", "a800862f17f7a8c13ed13fc6e9433f-C001-153" ] }, "@DIF@": { "gold_contexts": [ [ "a800862f17f7a8c13ed13fc6e9433f-C001-130" ], [ "a800862f17f7a8c13ed13fc6e9433f-C001-154" ] ], "cite_sentences": [ "a800862f17f7a8c13ed13fc6e9433f-C001-130", "a800862f17f7a8c13ed13fc6e9433f-C001-154" ] } } }, "ABC_472b7c9f53b3130d7e5bba772e5b88_30": { "x": [ { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-17", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-39", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-40", "text": "**MOCKINGJAY**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-2", "text": "We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-3", "text": "Previous speech representation methods learn through conditioning on past frames and predicting information about future frames." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-4", "text": "Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past and future contexts." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-5", "text": "The Mockingjay representation improves performance for a wide range of downstream tasks, including phoneme classification, speaker recognition, and sentiment classification on spoken content, while outperforming other approaches." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-6", "text": "Mockingjay is empirically powerful and can be fine-tuned with downstream models, with only 2 epochs we further improve performance dramatically." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-7", "text": "In a low resource setting with only 0.1% of labeled data, we outperform the result of Mel-features that uses all 100% labeled data." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-8", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-10", "text": "The goal of speech representation learning is to find a transform from speech that makes high-level information more accessible to SLP (Speech and Language Processing) downstream tasks, as speech signal possess a rich set of acoustic and linguistic content, including phonemes, words, semantic meanings, tone, speaker characteristics, and even sentimental information." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-11", "text": "In this paper, we propose Mockingjay to learn speech representations through unsupervised training without the use of any labels." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-12", "text": "We use multi-layer transformer encoders and multi-head self-attention [1] to achieve bidirectional encoding; this framework allows our model to consider past and future contexts at the same time." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-13", "text": "To achieve unsupervised pre-training for speech representations, Mockingjay learns under the proposed Masked Acoustic Model (MAM) task." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-14", "text": "During training, masked frames are given, and the model learns to reconstruct and predict the original speech." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-15", "text": "Hence we gave the name Mockingjay, a bird that mimics sound." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-16", "text": "The proposed framework is illustrated in Figure 1 ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-38", "text": "With only 0.36 hours (0.1%) of transcribed speech, the proposed approach outperforms Mel-features with 360 hours (100%) of labels." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-19", "text": "Unsupervised speech representation learning [2, 3, 4, 5, 6, 7, 8, 9, 10] is effective in extracting high-level properties from speech." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-20", "text": "SLP downstream tasks can be improved through speech representations because surface features such as log Mel-spectrograms or waveform can poorly reveal the abundant information within speech." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-21", "text": "Contrastive Predictive Coding (CPC) [5] and wav2vec [7] use a multi-layer CNN to encode past context, representations are learned by predicting the future in latent space under a contrastive binary classification task." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-22", "text": "Autoregressive Predictive Coding (APC) [6] uses autoregressive models to encode temporal information of past acoustic sequences; the model predicts future frames like an RNN-based language model [11] , optimized with reconstruction loss." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-23", "text": "Unidirectional models are commonly used in the previous approaches [2, 3, 4, 5, 6, 7] ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-24", "text": "However, this constraint on model architectures limits the potential of speech representation learning." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-25", "text": "The recently proposed vq-wav2vec [8] approach attempts to apply the well-performing Natural Language Processing (NLP) algorithm BERT [12] on continuous speech." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-26", "text": "Input speech is discretized to a K-way quantized embedding space, so continuous speech could act like discrete units similar to word tokens in NLP tasks." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-27", "text": "In vq-wav2vec [8] , an exhaustive two-stage training pipeline with massive computing resources are required to adapt speech to NLP algorithm, as the quantization process is against the continuous nature of speech." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-28", "text": "Unlike [8] that adapts speech to BERT [12] through quantization, the proposed approach can be seen as a modified version of BERT [12] for direct application on continuous speech." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-29", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-30", "text": "**PROPOSED METHOD**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-31", "text": "Unlike previous left-to-right unidirectional approaches that only consider past sequences to predict information about future frames, the proposed method allows us to train a bidirectional speech representation model, alleviating the unidirectionality constraint of previous methods." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-32", "text": "As a result, the Mockingjay model obtains substantial improvements in several SLP tasks." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-33", "text": "Moreover, as previous approaches restrict the power of the pre-trained models to representation extraction only [5, 6, 7, 8] , the proposed method is robust and can be fine-tuned easily on downstream tasks." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-34", "text": "We show that finetuning for 2 epochs easily acquires significant improvement." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-35", "text": "The proposed approach outperforms other representations and features." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-36", "text": "When compared to the commonly used log Mel-features, we outperformed it by 35.2% (absolute improvement) for phoneme classification accuracy, 28.0% (absolute improvement) for speaker recognition accuracy, and 6.4% (absolute improvement) for sentiment discrimination accuracy on a spoken content dataset unseen during pre-train." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-37", "text": "We also experiment in low resource settings to show that Mockingjay is capable of improving supervised training in real-life low-resource scenarios." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-41", "text": "In this section, we first introduce model architecture and its designs, secondly we explain the proposed unsupervised context prediction task, and finally we explain how the proposed model is used with downstream task models." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-42", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-43", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-44", "text": "We use a multi-layer Transformer encoder with multi-head self-attention for left-and-right bidirectional encoding, this architecture is illustrated in Figure 2 ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-45", "text": "Each encoder layer has two sub-layers, the first is a multi-head self-attention network, and the second is a feed-forward layer, each sub-layer has a residual connection followed by layer normalization [13] , based on the design described in [1] ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-46", "text": "All encoder layers in the model, as well as the sub-layers, produce outputs of identical dimensions denoted as H dim ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-47", "text": "In Figure 2 , we denote the feed-forward size as F dim , the number of self-attention heads as A num , and the total of Transformer layers as L num ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-48", "text": "The Mockingjay representations can be extracted from the Transformer encoders' hidden state and labeled as Hidden, we explain how they are used as representations in Section 2.3." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-49", "text": "Since Transformer encoders contain no recurrence and convolution, we use positional encoding to make our model aware of the input sequence order [1] ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-50", "text": "As direct addition of acoustic features to positional encoding may lead to potential training failure [14] , the input frames are first projected linearly to the hidden dimension of H dim before adding with positional encoding [15] ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-51", "text": "We use sinusoidal positional encoding instead of learnable positional embeddings [16] because acoustic features can be arbitrarily long with high variance [15] ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-52", "text": "We apply downsampling on input features to adapt our model to long sequences." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-53", "text": "To reduce the length of frames by a factor of R f actor , we use the reshape technique from [14, 15] by stacking R f actor consecutive frames into one step." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-54", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-55", "text": "**MASKED ACOUSTIC MODELING**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-56", "text": "We propose the Masked Acoustic Modeling task, where we randomly select 15% of the input frames, and the model predicts the selected frames based on its left and right context, as illustrated in Figure 1 ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-57", "text": "During training, we add a prediction head consists of two layers of feed-forward network with layer-normalization, using the last encoder layer as it's input." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-58", "text": "We use L1 Loss to minimize reconstruction error between prediction and ground-truth frames on the selected 15%." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-59", "text": "The prediction head is not required once the model is trained." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-60", "text": "During training, for the selected 15% of frames, 1) we mask it all to zero 80% of the time, 2) replace all with a random frame 10% of the time, and 3) leave the frames untouched 10% of the time." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-61", "text": "1 We introduce this sub-random process instead of always masking the frames to alleviate the mismatch between training and inference, as masked frames do not appear during inference time." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-62", "text": "Note that in contrast to BERT [12] , where the sub-random process is performed token-wise on the i-th chosen token, our sub-random process is performed utterance-wise." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-63", "text": "In other words, our model may receive inputs as ground-truth frames for 3) 10% of the time, rather with some of the inputs always augmented as in [12] ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-64", "text": "To avoid the model exploiting local smoothness of acoustic frames, we propose additional consecutive masking where we mask consecutive frames C num to zero." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-65", "text": "The model is required to infer on global structure rather than local information." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-66", "text": "We also use dynamic masking [18] where masking patterns are sampled from an uniform distribution every time we feed a sequence to the model, unlike the static mask employed in [12] where masking is performed during data preprocessing." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-67", "text": "We only use a single context prediction task to train our representation model, as suggested by [18] ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-68", "text": "Unlike BERT [12] and ALBERT [19] that needs two tasks to train their language model." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-69", "text": "In our preliminary experiments, we found that the sentence prediction task used in [12, 19] is not helpful, as additional tasks can potentially harm training behavior." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-70", "text": "We do not include details due to space limitations." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-71", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-72", "text": "**INCORPORATING WITH DOWNSTREAM TASKS**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-73", "text": "Mockingjay representations are essentially the Transformer encoder hidden states." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-74", "text": "There are many ways to incorporate learned representations to downstream tasks." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-75", "text": "In this work, we mainly extract representations from the last layer." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-76", "text": "However, we also expose the deep internals of Mockingjay to downstream models, where we use a mixture of representations from all layers, similar to the ELMO [20] approach." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-77", "text": "In other words, we use a learnable weighted sum to integrate hidden states from all layers." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-78", "text": "A pre-trained Mockingjay model can be fine-tuned with downstream models to create improving results, we update the pre-trained Mockingjay together with random initialized downstream task models." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-79", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-80", "text": "**IMPLEMENTATION**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-81", "text": "In this work, we use two types of features as our model's output reconstruction target: the Mel-scale spectrogram and the linear-scale spectrogram." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-82", "text": "As Mel-scale spectrogram is a more condensed acoustic representation when compared to linearscale spectrogram, we propose two model settings: BASE and LARGE." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-103", "text": "As reported in [6] , the APC approach outperformed CPC representations [5, 7, 9] in both two tasks, which makes APC suitable as a strong baseline." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-104", "text": "APC uses an unidirectional autoregressive model." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-105", "text": "We compare the proposed approach with APC to show that our bidirectional approach has advantages in speech representation learning." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-106", "text": "For fair comparison, we pre-train APC using their official implementations with the reported ideal parameters and settings, but expand the model's hidden size to H dim =768 to match ours." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-107", "text": "We also report results on 160-dimensional log Mel-features, which helps evaluate the accessibility of speech information from regular acoustic features." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-108", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-109", "text": "**PHONEME CLASSIFICATION**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-110", "text": "To measure the accessibility of phonetic information, we train linear phone classifiers using Mel-features, APC and Mockingjay representations from the LibriSpeech train-clean-360 subset." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-111", "text": "We obtain force-aligned phoneme sequences with the Montreal Forced Aligner [24] , where there are 72 possible phone classes." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-112", "text": "Testing results on the LibriSpeech test-clean subset are presented in Figure 3 ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-113", "text": "In the case where all 360 hours of labels are used to train the classifier, BASE and LARGE representations increase 11.8% and 15.2% accuracy from Mel-features." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-114", "text": "The BASE-FT2 model outperforms all representations after 2 epochs of fine-tuning, with 10.2% and 35.2% absolute improvement over APC and Mel-features, respectively." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-115", "text": "We observe that fine-tuning for 2 epochs is enough to reveal our method's potential as there is only a small gap (3.9%) between BASE-FT2 and BASE-FT500." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-116", "text": "Furthermore, LARGE-WS improves over LARGE, just as we expected." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-117", "text": "To demonstrate how pre-training on speech can improve supervised training in resource constrained scenarios where human labels are short, we train the classifier with reduced amount of training data." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-118", "text": "The performance variation of different methods are plotted in Figure 3 , where we measure over various intervals of constrained training data to observe performance drop." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-119", "text": "Both BASE and LARGE increase accuracy over Mel-features across various amount of transcribed data." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-120", "text": "Whereas the APC approach performs well on the full resource but fails to generalize for limited amount of labeled data." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-121", "text": "In the case where there are only 0.36 hours of data available, we improve accuracy by 22.7% (an absolute improvement from Mel-features)." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-122", "text": "Note that with only 0.36 hours (0.1%) of labels available, BASE-FT2 (57.9% acc) even outperformed Mel-features (49.1% acc) that uses all 360 hours (100%) of labeled data." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-123", "text": "We conclude that pre-training Mockingjay on speech substantially improves the performance on supervised task that requires human annotations." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-124", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-125", "text": "**SPEAKER RECOGNITION**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-126", "text": "To demonstrate that the proposed approach performs constantly for all SLP downstream tasks, we report speaker recognition results on the LibriSpeech 100 hour selected subset, where train/test split is performed randomly with a 9:1 ratio, and there are 63 possible speakers." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-127", "text": "We trained a simple one-layer RNN classifier for speaker recognition using different representations, results are listed in Table 2" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-128", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-129", "text": "**SENTIMENT CLASSIFICATION ON SPOKEN CONTENT**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-130", "text": "To demonstrate domain invariant transferability of the proposed representation across different datasets, the Mockingjay model is pre-trained on LibriSpeech and applied on the MOSEI [25] dataset." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-131", "text": "We also use a simple one-layer RNN classifier, where the model is trained to extract linguistic meanings from speech and discriminates between sentiments." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-132", "text": "The results listed in Table 2 lead to an identical conclusion as in the speaker recognition task discussed above." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-133", "text": "Except that in the case of sentiment classification, LARGE-WS achieved the highest score without the need of fine-tuning, demonstrating that a deeper model has great potential for extracting general speech representations." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-134", "text": "To conclude this section, we claim that the proposed representations are general and can be used on datasets with various unseen domains." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-135", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-136", "text": "**CONCLUSION**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-137", "text": "The proposed representation contains a variety of knowledge, including but not limited to phonetic, speaker, and sentiment information." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-138", "text": "We improve performance for a wide range of downstream tasks, and show promising results in low resource settings, as the learned speech representations are robust and can be transferred to different tasks across different datasets." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-139", "text": "In future work, we will investigate and deploy Mockingjay representations on more downstream SLP tasks, including ASR, voice conversion, and speech translation." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-83", "text": "Both of these models take Mel-features as input, and transform input Mel-features into high-level representations." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-84", "text": "They use the same hidden dimension size of H dim =768, feedforward size of F dim =3072, and attention heads of A num =12, with the exception of layer number L num , downsample factor R f actor , and consecutive masking number C num , the differences in model settings are listed in Table 1 ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-85", "text": "We further analyze their differences in our experiment section." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-86", "text": "The proposed Mockingjay models are pre-trained on the LibriSpeech [21] corpus train-clean-360 subset." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-87", "text": "We use Adam [22] where learning rate is warmed up over the first 7% of 500k total training steps to a peak value of 4e-4 and then linearly decayed." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-88", "text": "A dropout [23] of 0.1 is applied on all layers and attention weights." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-89", "text": "For downstream task fine-tuning, most of the hyperparameters are the same as in pre-training, with the exception of a learning rate of 4e-3, and the number of training epochs is set to 2 (which is approximately 50k steps)." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-90", "text": "We train with a batch size of 6 using a single 1080Ti GPU." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-91", "text": "We provide pre-trained models with our implementation, they are publicly available for reproducibility 2 ." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-92", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-93", "text": "**EXPERIMENT**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-94", "text": "Following previous works [2, 3, 4, 5, 6, 7, 8] , we evaluate different features and representations on downstream tasks, including: phoneme classification, speaker recognition, and sentiment classification on spoken content." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-95", "text": "For a fair comparison, each downstream task uses an identical model architecture and hyperparameters despite different input features." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-96", "text": "We report results from 5 of our models: 1) BASE and 2) LARGE where Mockingjay representations are extracted from the last encoder layer, 3) the BASE-FT2 where we finetune BASE with random initialized downstream models for 2 epochs, and 4) the BASE-FT500 where we fine-tune for 500k steps, and finally 5) the LARGE-WS where we incorporate hidden states from all encoder layers of the LARGE model through a learnable weighted sum." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-97", "text": "We did not fine-tune the LARGE model, as it is meant for extracting representations." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-98", "text": "Empirically we found that even with supervised training, a random initialized Mockingjay model followed by any downstream model is hard to be trained from scratch." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-99", "text": "This shows that the proposed pre-training is essentially indispensable." }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-100", "text": "----------------------------------" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-101", "text": "**COMPARING WITH OTHER REPRESENTATIONS**" }, { "sent_id": "472b7c9f53b3130d7e5bba772e5b88-C001-102", "text": "The proposed approaches are mainly compared with APC [6] representations, as they also experiment on phone classification and speaker verification." } ], "y": { "@BACK@": { "gold_contexts": [ [ "472b7c9f53b3130d7e5bba772e5b88-C001-19" ], [ "472b7c9f53b3130d7e5bba772e5b88-C001-22" ], [ "472b7c9f53b3130d7e5bba772e5b88-C001-23" ], [ "472b7c9f53b3130d7e5bba772e5b88-C001-103" ] ], "cite_sentences": [ "472b7c9f53b3130d7e5bba772e5b88-C001-19", "472b7c9f53b3130d7e5bba772e5b88-C001-22", "472b7c9f53b3130d7e5bba772e5b88-C001-23", "472b7c9f53b3130d7e5bba772e5b88-C001-103" ] }, "@DIF@": { "gold_contexts": [ [ "472b7c9f53b3130d7e5bba772e5b88-C001-33" ] ], "cite_sentences": [ "472b7c9f53b3130d7e5bba772e5b88-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "472b7c9f53b3130d7e5bba772e5b88-C001-94" ], [ "472b7c9f53b3130d7e5bba772e5b88-C001-102" ], [ "472b7c9f53b3130d7e5bba772e5b88-C001-103" ] ], "cite_sentences": [ "472b7c9f53b3130d7e5bba772e5b88-C001-94", "472b7c9f53b3130d7e5bba772e5b88-C001-102", "472b7c9f53b3130d7e5bba772e5b88-C001-103" ] } } }, "ABC_4590b1a4a0566915a6f2d6439a4e8a_30": { "x": [ { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-2", "text": "Recent work has shown that recurrent neural networks (RNNs) can implicitly capture and exploit hierarchical information when trained to solve common natural language processing tasks (Blevins et al" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-3", "text": "----------------------------------" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-4", "text": "**INTRODUCTION**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-5", "text": "Recurrent neural networks (RNNs), in particular Long Short-Term Memory networks (LSTMs), have become a dominant tool in natural language processing." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-6", "text": "While LSTMs appear to be a natural choice for modeling sequential data, recently a class of non-recurrent models (Gehring et al., 2017; Vaswani et al., 2017) have shown competitive performance on sequence modeling." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-7", "text": "Gehring et al. (2017) propose a fully convolutional sequence-tosequence model that achieves state-of-the-art performance in machine translation." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-8", "text": "Vaswani et al. (2017) introduce Transformer networks that do not use any convolution or recurrent connections while obtaining the best translation performance." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-9", "text": "These non-recurrent models are appealing due to their highly parallelizable computations on modern GPUs. But do they have the same ability to exploit hierarchical structures implicitly in comparison to RNNs?" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-10", "text": "In this work, we provide a first answer to this question." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-11", "text": "Our interest here is the ability of capturing hierarchical structure without being equipped with explicit structural representations (Bowman et al., 2015b; Tran et al., 2016; Linzen et al., 2016) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-12", "text": "We choose Transformer as a non-recurrent model to study in this paper." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-13", "text": "We refer to Transformer as Fully Attentional Network (FAN) to emphasize this characteristic." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-14", "text": "Our motivation to favor FANs over convolutional neural networks (CNNs) is that FANs always have full access to the sequence history, making them more suited for modeling long distance dependencies than CNNs." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-15", "text": "Additionally, FANs promise to be more interpretable than LSTMs by visualizing attention weights." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-16", "text": "The rest of the paper is organized as follows: We first highlight the differences between the two architectures ( \u00a72) and introduce the two tasks ( \u00a73)." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-17", "text": "Then we provide setup and results for each task ( \u00a74 and \u00a75) and discuss our findings ( \u00a76)." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-18", "text": "----------------------------------" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-19", "text": "**FAN VERSUS LSTM**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-20", "text": "Conceptually, FANs differ from LSTMs in the way they utilize the previous input to predict the next output." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-21", "text": "Figure 1 depicts the main difference in terms of computation when each model is making predictions." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-22", "text": "At time step t, a FAN can access information from all previous time steps directly with O(1) computational operations." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-23", "text": "FANs do so by employing a self-attention mechanism to compute the weighted average of all previous input representations." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-24", "text": "In contrast, LSTMs compress at each time step all previous information into a single vector recursively based on the current input and the previous compressed vector." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-25", "text": "By their definition, LSTMs require O(d) computational operations to access the information at time step t \u2212 d." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-26", "text": "For the details of self-attention mechanics in FANs, we refer to the work of Vaswani et al. (2017) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-27", "text": "We now proceed to measure both models' ability to learn hierarchical structure with a set of controlled experiments." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-28", "text": "----------------------------------" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-29", "text": "**TASKS**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-30", "text": "We choose two tasks to study in this work: (1) subject-verb agreement, and (2) logical inference." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-31", "text": "The first task was proposed by Linzen et al. (2016) to test the ability of recurrent neural networks to capture syntactic dependencies in natural language." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-32", "text": "The second task was introduced by Bowman et al. (2015b) to compare tree-based recursive neural networks against sequence-based recurrent networks with respect to their ability to exploit hierarchical structures to make accurate inferences." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-33", "text": "The choice of tasks here is important to ensure that both models have to exploit hierarchical structural features (Jia and Liang, 2017) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-34", "text": "4 Subject-Verb Agreement Linzen et al. (2016) propose the task of predicting number agreement between subject and verb in naturally occurring English sentences as a proxy for the ability of LSTMs to capture hierarchical structure in natural language." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-35", "text": "We use the dataset provided by Linzen et al. (2016) and follow their experimental protocol of training each model using either (a) a general language model, i.e., next word prediction objective, and (b) an explicit supervision objective, i.e., predicting the number of the verb given its sentence history." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-36", "text": "Table 1 illustrates the training and testing conditions of the task." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-37", "text": "Data: Following the original setting, we take 10% of the data for training, 1% for validation, and the rest for testing." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-38", "text": "The vocabulary consists of the 10k most frequent words, while the remaining words are replaced by their part-of-speech." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-39", "text": "Hyperparameters: To allow for a fair comparison, we find the best configuration for each model by running a grid search over the following hyperparameters: number of layers in {2, 3, 4}, dropout rate in {0.2, 0.3, 0.5}, embedding size and number of hidden units in {128, 256, 512}, number of heads (for FAN) in {2, 4}, and learning rate in {0.00001, 0.0001, 0.001}. The weights of the word embeddings and output layer are shared (Inan et al., 2017; Press and Wolf, 2017) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-40", "text": "Models are optimized by Adam (Kingma and Ba, 2015) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-41", "text": "We first assess whether the LSTM and FAN models trained with respect to the language model objective assign higher probabilities to the correctly inflected verbs." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-42", "text": "As shown in Figures 2a and 2b, both models achieve high accuracies for this task, but LSTMs consistently outperform FANs." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-43", "text": "Moreover, LSTMs are clearly more robust than FANs with respect to task difficulty, measured both in terms of word distance and number of agreement attractors 1 between subject and verb." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-44", "text": "Christiansen and Chater (2016) ; Cornish et al. (2017) have argued that human memory limitations give rise to important characteristics of natural language, including its hierarchical structure." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-45", "text": "Similarly, our experiments suggest that, by compressing the history into a single vector before making predictions, LSTMs are forced to better learn the input structure." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-46", "text": "On the other hand, despite having direct access to all words in their history, FANs are less capable of detecting the verb's subject." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-47", "text": "We note that the validation perplexities of the LSTM and FAN are 67.06 and 69.14, respectively." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-48", "text": "Secondly, we evaluate FAN and LSTM models explicitly trained to predict the verb number (Figures 2c and 2d) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-49", "text": "Again, we observe that LSTMs consistently outperform FANs." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-50", "text": "This is a particularly interesting result since the self-attention mechanism in FANs connects two words in any po- Figure 3: Proportion of times the subject is the most attended word by different heads at different layers ( 3 is the highest layer)." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-51", "text": "Only cases where the model made a correct prediction are shown." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-52", "text": "sition with a O(1) number of executed operations, whereas RNNs require more recurrent operations." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-53", "text": "Despite this apparent advantage of FANs, the performance gap between FANs and LSTMs increases with the distance and number of attractors." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-54", "text": "2 To gain further insights into our results, we examine the attention weights computed by FANs during verb-number prediction (supervised objective)." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-55", "text": "Specifically, for each attention head at each layer of the FAN, we compute the percentage of 2 We note that our LSTM results are better than those in Linzen et al. (2016) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-56", "text": "Also surprising is that the language model objective yields higher accuracies than the number prediction objective." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-57", "text": "We believe this may be due to better model optimization and to the embedding-output layer weight sharing, but we leave a thorough investigation to future work." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-58", "text": "times the subject is the most attended word among all words in the history." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-59", "text": "Figure 3 shows the results for all cases where the model made the correct prediction." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-60", "text": "While it is hard to interpret the exact role of attention for different heads and at different layers, we find that some of the attention heads at the higher layers ( 2 h1, 3 h0) frequently point to the subject with an accuracy that decreases linearly with the distance between subject and verb." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-61", "text": "----------------------------------" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-62", "text": "**LOGICAL INFERENCE**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-63", "text": "In this task, we choose the artificial language introduced by Bowman et al. (2015b) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-64", "text": "The vocabulary of this language includes six word types {a, b, c, d, e, f } and three logical operators {or, and, not}. The task consists of predicting one of seven mutually exclusive logical relations that describe the relationship between a pair of sentences: entailment ( , ), equivalence (\u2261), exhaustive and non-exhaustive contradiction ( \u2227 , |), and two types of semantic independence (#, )." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-65", "text": "We generate 60,000 samples 3 with the number of logical operations ranging from 1 to 12." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-66", "text": "The train/dev/test dataset ratios are set to 0.8/0.1/0.1." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-67", "text": "Here are some samples of the training data:" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-68", "text": "( d ( or f ) ) ( f ( and a ) ) ( d ( and ( c ( or d ) ) ) ) # ( not f ) ( not ( d ( or ( f ( or c ) ) ) ) ) ( not ( c ( and ( not d ) ) ) )" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-69", "text": "Why artificial data?" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-70", "text": "Despite the simplicity of the language, this task is not trivial." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-71", "text": "To correctly classify logical relations, the model must learn nested structures as well as the scope of logical operators." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-72", "text": "We verify the difficulty of the task by training three bag-of-words models followed by sum/average/max-pooling." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-73", "text": "The best of the three models achieve less than 59% accuracy on the logical inference versus 77% on the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015a) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-74", "text": "This shows that the SNLI task can be largely solved by exploiting shallow features without understanding the underlying linguistic structures, which has also been pointed out by recent work (Glockner et al., 2018; Gururangan et al., 2018) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-75", "text": "Concurrently to our work Evans et al. (2018) proposed an alternative data set for logical inference and also found that a FAN model underperformed various other architectures including LSTMs." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-76", "text": "----------------------------------" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-77", "text": "**MODELS**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-78", "text": "We follow the general architecture proposed in (Bowman et al., 2015b) : Premise and hypothesis sentences are encoded by fixed-size vectors." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-79", "text": "These two vectors are then concatenated and fed to a 3-layer feed-forward neural network with ReLU nonlinearities to perform 7-way classification of the logical relation." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-80", "text": "The LSTM architecture used in this experiment is similar to that of Bowman et al. (2015b) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-81", "text": "We simply take the last hidden state of the top LSTM layer as a fixed-size vector representation of the sentence." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-82", "text": "Here, we use a 2-layer LSTM with skip connections." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-83", "text": "The FAN maps a sentence x of length n to H = [h 1 , . . . , h n ] \u2208 R d\u00d7n ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-84", "text": "To obtain a fixedsize representation z, we use a self-attention layer with two trainable queries q 1 , q 2 \u2208 R 1\u00d7d :" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-85", "text": "We find the best hyperparameters for each model by running a grid search as explained in \u00a74." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-86", "text": "----------------------------------" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-87", "text": "**RESULTS**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-88", "text": "Following the experimental protocol of Bowman et al. (2015b) , the data is divided into 13 bins based on the number of logical operators." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-89", "text": "Both FANs and LSTMs are trained on samples with at most n logical operators and tested on all bins." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-90", "text": "Figure 4 shows the result of the experiments with n \u2264 6 and n \u2264 12." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-91", "text": "We see that FANs and LSTMs perform similarly when trained on the whole dataset (Figure 4a )." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-92", "text": "However when trained on a subset of the data (Figure 4b) , LSTMs obtain better accuracies on similar examples (n \u2264 6) and generalize better on longer examples (6 < n \u2264 12)." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-93", "text": "----------------------------------" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-94", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-95", "text": "We have compared a recurrent architecture (LSTM) to a non-recurrent one (FAN) with respect to the ability of capturing the underlying hierarchical structure of sequential data." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-96", "text": "Our experiments show that LSTMs slightly but consistently outperform FANs." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-97", "text": "We found that LSTMs are notably more robust with respect to the presence of misleading features in the agreement task, whether trained with explicit supervision or with a general language model objective." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-98", "text": "Secondly, we found that LSTMs generalize better than FANs to longer sequences in a logical inference task." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-99", "text": "These findings suggest that recurrency is a key model property which should not be sacrificed for efficiency when hierarchical structure matters for the task." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-100", "text": "This does not imply that LSTMs should always be preferred over non-recurrent architectures." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-101", "text": "In fact, both FAN-and CNN-based networks have proved to perform comparably or better than LSTM-based ones on a very complex task like machine translation (Gehring et al., 2017; Vaswani et al., 2017) ." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-102", "text": "Nevertheless, we believe that the ability of capturing hierarchical information in sequential data remains a fundamental need for building intelligent systems that can understand and process language." }, { "sent_id": "4590b1a4a0566915a6f2d6439a4e8a-C001-103", "text": "Thus we hope that our insights will be useful towards building the next generation of neural networks." } ], "y": { "@USE@": { "gold_contexts": [ [ "4590b1a4a0566915a6f2d6439a4e8a-C001-11" ], [ "4590b1a4a0566915a6f2d6439a4e8a-C001-32" ], [ "4590b1a4a0566915a6f2d6439a4e8a-C001-63" ], [ "4590b1a4a0566915a6f2d6439a4e8a-C001-78" ], [ "4590b1a4a0566915a6f2d6439a4e8a-C001-88" ] ], "cite_sentences": [ "4590b1a4a0566915a6f2d6439a4e8a-C001-11", "4590b1a4a0566915a6f2d6439a4e8a-C001-32", "4590b1a4a0566915a6f2d6439a4e8a-C001-63", "4590b1a4a0566915a6f2d6439a4e8a-C001-78", "4590b1a4a0566915a6f2d6439a4e8a-C001-88" ] }, "@SIM@": { "gold_contexts": [ [ "4590b1a4a0566915a6f2d6439a4e8a-C001-80" ] ], "cite_sentences": [ "4590b1a4a0566915a6f2d6439a4e8a-C001-80" ] } } }, "ABC_abfc6373c577980154dbb93190d69b_31": { "x": [ { "sent_id": "abfc6373c577980154dbb93190d69b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-49", "text": "As a result, in our mapping for Korean, we mapped stative verbs to the universal ADJ tag." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-50", "text": "In other cases this was clearer, e.g. the Bulgarian treebank has no category for determiners or articles." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-51", "text": "This is not to say that there are no determiners in the Bulgarian language." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-2", "text": "To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-ofspeech categories." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-3", "text": "In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-4", "text": "As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common partsof-speech for 22 different languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-5", "text": "We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-6", "text": "----------------------------------" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-8", "text": "Part-of-speech (POS) tagging has received a great deal of attention as it is a critical component of most natural language processing systems." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-9", "text": "As supervised POS tagging accuracies for English (measured on the Wall Street Journal portion of the PennTreebank (Marcus et al., 1993) ) have converged to around 97.3% (Toutanova et al., 2003; Shen et al., 2007) , the attention has shifted to unsupervised approaches (Christodoulopoulos et al., 2010) ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-10", "text": "In particular, there has been growing interest in both multilingual POS induction ) and cross-lingual POS induction via treebank projection (Yarowsky and Ngai, 2001; Xi and Hwa, 2005; Das and Petrov, 2011) ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-11", "text": "Underlying these studies is the idea that a set of (coarse) syntactic POS categories exist in similar forms across languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-12", "text": "These categories are often called universals to represent their cross-lingual nature (Carnie, 2002; Newmeyer, 2005) ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-13", "text": "For example, used the Multext-East (Erjavec, 2004) corpus to evaluate their multi-lingual POS induction system, because it uses the same tagset for multiple languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-14", "text": "When corpora with common tagsets are unavailable, a standard approach is to manually define a mapping from language and treebank specific fine-grained tagsets to a predefined universal set." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-15", "text": "This was the approach taken by Das and Petrov (2011) to evaluate their cross-lingual POS projection system for six different languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-16", "text": "To facilitate future research and to standardize best-practices, we propose a tagset that consists of twelve universal POS categories." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-17", "text": "While there might be some controversy about what the exact tagset should be, we feel that these twelve categories cover the most frequent part-of-speech that exist in most languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-18", "text": "In addition to the tagset, we also develop a mapping from fine-grained POS tags for 25 different treebanks to this universal set." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-19", "text": "As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-ofspeech for 22 different languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-20", "text": "1 Both the tagset and mappings are made available for download at http://code.google.com/p/universal-pos-tags/." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-21", "text": "This resource serves multiple purposes." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-22", "text": "First, as mentioned previously, it is useful for building and evaluating unsupervised and cross-lingual taggers." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-23", "text": "Second, it also permits for a more reasonable com- parison of accuracy across languages for supervised taggers." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-24", "text": "Statements of the form \"POS tagging for language X is harder than for language Y\" are vacuous when the tagsets used for the two languages are incomparable (not to mention of different cardinality)." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-25", "text": "Finally, it also permits language technology practitioners to train POS taggers with common tagsets across multiple languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-26", "text": "This in turn facilitates downstream application development as there is no need to maintain language specific rules due to differences in treebank annotation guidelines." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-27", "text": "In this paper, we specifically highlight two use cases of this resource." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-28", "text": "First, using our universal tagset and mapping, we run an experiment comparing POS tag accuracies for 25 different treebanks to evaluate POS tagging accuracy on a single tagset." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-29", "text": "Second, we combine the cross-lingual projection part-of-speech taggers of Das and Petrov (2011) with the grammar induction system of Naseem et al. (2010) -which requires a universal tagset -to produce a completely unsupervised grammar induction system for multiple languages, that does not require gold POS tags in the target language." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-30", "text": "----------------------------------" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-31", "text": "**TAGSET**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-32", "text": "While there might be some disagreement about the exact definition of an universal POS tagset (Evans and Levinson, 2009) , it seems fairly indisputable that a set of coarse POS categories (or syntactic universals) exists across all languages in one form or another (Carnie, 2002; Newmeyer, 2005) ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-33", "text": "Rather than arguing over definitions, we took a pragmatic standpoint during the design of the universal POS tagset and focused our attention on the POS categories that we expect to be most useful (and necessary) for users of POS taggers." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-34", "text": "In our opinion, these are NLP practitioners using POS taggers in downstream applications, and NLP researchers using POS taggers in grammar induction and other experiments." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-35", "text": "A high-level analysis of the tagsets underlying various treebanks shows that the majority of tagsets are very fine-grained and language specific." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-36", "text": "In fact, Smith and Eisner (2005) made a similar observation and defined a collapsed set of 17 English POS tags (instead of the original 45) that has subsequently been adopted by most unsupervised POS induction work." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-37", "text": "Similarly, the organizers of the CoNLL shared tasks on dependency parsing provide coarse (but still language specific) tags in addition to the finegrained tags used in the original treebanks (Buchholz and Marsi, 2006; . identified eight different coarse POS tags when analyzing the errors of two dependency parsers on the 13 different languages from the CoNLL shared tasks." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-38", "text": "Our universal POS tagset unifies this previous work and defines the following twelve POS tags: NOUN (nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), PRON (pronouns), DET (determiners and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), '.' (punctuation marks) and X (a catch-all for other categories such as abbreviations or foreign words)." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-39", "text": "We did not rely on intrinsic definitions of the above categories." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-40", "text": "Instead, each category is defined operationally." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-41", "text": "For each treebank under consideration, we studied the exact POS tag definitions and annotation guidelines and created a mapping from the original treebank tagset to these universal POS tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-42", "text": "Most of the decisions were fairly clear." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-43", "text": "For example, from the PennTreebank, VB, VBD, VBG, VBN, VBP, VBZ and MD (modal) were all mapped to VERB." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-44", "text": "A less clear case was the universal tag for particles, PRT, which was mapped from POS (possessive), RP (particle) and TO (the word 'to')." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-45", "text": "In particular, the TO tag is ambiguous in the PennTreebank between infinitival markers and the preposition 'to'." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-46", "text": "Thus, under this mapping, some prepositions will be marked as particles in the universal tagset." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-47", "text": "Figure 1 gives an example mapping for a sentence from the PennTreebank." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-48", "text": "Another case we had to consider is that some tag (Kim, 2002) ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-52", "text": "However, since they are not annotated as such in the treebank, we are not able to include them in our mapping." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-53", "text": "The list of treebanks for which we have constructed mappings can be seen in Table 1 ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-54", "text": "One main objective in publicly releasing this resource is to provide treebank and language specific experts a mechanism for refining these categories and the decisions we have made, as well as adding new treebanks and languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-55", "text": "This resource is therefore hosted as an open source project with version control." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-56", "text": "----------------------------------" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-57", "text": "**EXPERIMENTS**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-58", "text": "To demonstrate the efficacy of the proposed universal POS tagset, we performed two sets of experiments." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-59", "text": "First, to provide a language comparison, we trained the same supervised POS tagging model on all of the above treebanks and evaluated the tagging accuracy on the universal POS tagset." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-60", "text": "Second, we used universal POS tags (automatically projected from English) as the starting point for unsupervised grammar induction, producing completely unsupervised parsers for several languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-61", "text": "----------------------------------" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-62", "text": "**LANGUAGE COMPARISONS**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-63", "text": "To compare POS tagging accuracies across different languages we trained a supervised tagger based on a trigram Markov model (Brants, 2000) on all treebanks." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-64", "text": "We chose this model for its fast speed and (close to) state-of-the-art accuracy without language specific tuning." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-65", "text": "2 Table 1 shows the results for all 25 treebanks when training/testing on the original (O) and universal (U) tagsets." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-66", "text": "Overall, the variance on the universal tagset has been reduced by half (5.1 instead of 10.4)." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-67", "text": "But of course there are still accuracy differences across the different languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-68", "text": "On the one hand, given a golden segmentation, tagging Japanese is almost deterministic, resulting in a final accuracy of above 99%." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-69", "text": "3 On the other hand, tagging Turkish, an agglutinative language with and average sentence length of 11.6 tokens, remains very challenging, resulting in an accuracy of only 90.2%." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-70", "text": "It should be noted that the best results are obtained by training on the original treebank categories and mapping the predictions to the universal POS tags at the end (O/U column)." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-71", "text": "This is because the transition model based on the universal POS tagset is less informative." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-72", "text": "An interesting experiment would be to train the latent variable tagger of Huang et al. (2009) on this tagset." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-73", "text": "Their model automatically discovers refinements of the observed categories and could potentially find a tighter fit to the data, than the one provided by the original, linguistically motivated treebank tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-74", "text": "----------------------------------" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-75", "text": "**GRAMMAR INDUCTION**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-76", "text": "We further demonstrate the utility of the universal POS tags in a grammar induction experiment." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-77", "text": "To decouple the challenges of POS tagging and parsing, golden POS tags are typically assumed in unsupervised grammar induction experiments (Carroll and Charniak, 1992; Klein and Manning, 2004) ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-78", "text": "4 We propose to remove this unrealistic simplification by using POS tags automatically projected from English as the basis of a grammar induction model." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-79", "text": "Das and Petrov (2011) describe a cross-lingual projection framework to learn POS taggers without labeled data for the language of interest." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-80", "text": "We use their automatically induced POS tags to induce 2 Trained on the English PennTreebank this model achieves 96.7% accuracy when evaluated on the original 45 POS tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-81", "text": "3 Note that the accuracy on the universal POS tags for the two Japanese treebanks is almost the same." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-82", "text": "4 A less benevolent explanation for this practice is that grammar induction from plain text is simply still too difficult." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-83", "text": "syntactic dependencies." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-84", "text": "To this end, we chose the framework of Naseem et al. (2010) , in which a few universal syntactic rules (USR) are used to constrain a probabilistic Bayesian model." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-85", "text": "These rules are specified using a set of universal syntactic categories, and lead to state-of-the-art grammar induction performance superior to previous methods, such as the dependency model with valence (DMV) (Klein and Manning, 2004 ) and the phylogenetic grammar induction model (PGI) (Berg-Kirkpatrick and Klein, 2010) ." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-86", "text": "----------------------------------" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-87", "text": "**LANGUAGE DMV PGI USR-G USR-I**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-88", "text": "In their experiments, Naseem et al. also used a set of universal categories, however, with some differences to the tagset presented here." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-89", "text": "Their tagset does not have punctuation and catch-all categories, but includes a category for auxiliaries." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-90", "text": "The auxiliary category helps define a syntactic rule that attaches verbs to an auxiliary head, which is beneficial for certain languages." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-91", "text": "However, since this rule is reversed for other languages, we omit it in our tagset." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-92", "text": "Additionally, they also used refined categories in the form of CoNLL treebank tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-93", "text": "In our experiments, we did not make use of refined categories, as the POS tags induced by Das and Petrov (2011) were all coarse." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-94", "text": "We present results on the same eight IndoEuropean languages as Das and Petrov (2011) , so that we can make use of their automatically projected POS tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-95", "text": "For all languages, we used the treebanks released as a part of the CoNLL-X (Buchholz and Marsi, 2006) shared task." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-96", "text": "We only considered sentences of length 10 or less, after the removal of punctuations." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-97", "text": "We performed Bayesian inference on the whole treebank and report dependency attachment accuracy." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-98", "text": "Table 2 shows directed dependency accuracies for the DMV and PGI models using fine-grained gold POS tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-99", "text": "For the USR model, it reports results on gold universal POS tags (USR-G) and automatically induced universal POS tags (USR-I)." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-100", "text": "The USR-I model falls short of the USR-G model, but has the advantage that it does not require any labeled data from the target language." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-101", "text": "Quite impressively, it does better than DMV for all languages, and is competitive with PGI, even though those models have access to fine-grained gold POS tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-102", "text": "----------------------------------" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-103", "text": "**CONCLUSIONS**" }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-104", "text": "We proposed a POS tagset consisting of twelve categories that exists across languages and developed a mapping from 25 language specific tagsets to this universal set." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-105", "text": "We demonstrated experimentally that the universal POS categories generalize well across language boundaries on an unsupervised grammar induction task, giving competitive parsing accuracies without relying on gold POS tags." }, { "sent_id": "abfc6373c577980154dbb93190d69b-C001-106", "text": "The tagset and mappings are available for download at http://code.google.com/p/universal-pos-tags/" } ], "y": { "@BACK@": { "gold_contexts": [ [ "abfc6373c577980154dbb93190d69b-C001-10", "abfc6373c577980154dbb93190d69b-C001-8" ], [ "abfc6373c577980154dbb93190d69b-C001-11", "abfc6373c577980154dbb93190d69b-C001-12", "abfc6373c577980154dbb93190d69b-C001-14", "abfc6373c577980154dbb93190d69b-C001-15" ] ], "cite_sentences": [ "abfc6373c577980154dbb93190d69b-C001-10", "abfc6373c577980154dbb93190d69b-C001-15" ] }, "@EXT@": { "gold_contexts": [ [ "abfc6373c577980154dbb93190d69b-C001-29" ] ], "cite_sentences": [ "abfc6373c577980154dbb93190d69b-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "abfc6373c577980154dbb93190d69b-C001-94" ], [ "abfc6373c577980154dbb93190d69b-C001-93" ] ], "cite_sentences": [ "abfc6373c577980154dbb93190d69b-C001-94", "abfc6373c577980154dbb93190d69b-C001-93" ] } } }, "ABC_d883c7bb1e95895cd4f57a8f430e35_31": { "x": [ { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-87", "text": "**LANGUAGE DMV PGI USR-G USR-I**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-2", "text": "To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-ofspeech categories." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-3", "text": "In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-4", "text": "As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common partsof-speech for 22 different languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-5", "text": "We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-6", "text": "----------------------------------" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-8", "text": "Part-of-speech (POS) tagging has received a great deal of attention as it is a critical component of most natural language processing systems." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-9", "text": "As supervised POS tagging accuracies for English (measured on the Wall Street Journal portion of the PennTreebank (Marcus et al., 1993) ) have converged to around 97.3% (Toutanova et al., 2003; Shen et al., 2007) , the attention has shifted to unsupervised approaches (Christodoulopoulos et al., 2010) ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-10", "text": "In particular, there has been growing interest in both multilingual POS induction ) and cross-lingual POS induction via treebank projection (Yarowsky and Ngai, 2001; Xi and Hwa, 2005; Das and Petrov, 2011) ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-11", "text": "Underlying these studies is the idea that a set of (coarse) syntactic POS categories exist in similar forms across languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-12", "text": "These categories are often called universals to represent their cross-lingual nature (Carnie, 2002; Newmeyer, 2005) ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-13", "text": "For example, used the Multext-East (Erjavec, 2004) corpus to evaluate their multi-lingual POS induction system, because it uses the same tagset for multiple languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-14", "text": "When corpora with common tagsets are unavailable, a standard approach is to manually define a mapping from language and treebank specific fine-grained tagsets to a predefined universal set." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-15", "text": "This was the approach taken by Das and Petrov (2011) to evaluate their cross-lingual POS projection system for six different languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-16", "text": "To facilitate future research and to standardize best-practices, we propose a tagset that consists of twelve universal POS categories." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-17", "text": "While there might be some controversy about what the exact tagset should be, we feel that these twelve categories cover the most frequent part-of-speech that exist in most languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-18", "text": "In addition to the tagset, we also develop a mapping from fine-grained POS tags for 25 different treebanks to this universal set." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-19", "text": "As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-ofspeech for 22 different languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-20", "text": "1 Both the tagset and mappings are made available for download at http://code.google.com/p/universal-pos-tags/." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-21", "text": "This resource serves multiple purposes." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-22", "text": "First, as mentioned previously, it is useful for building and evaluating unsupervised and cross-lingual taggers." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-23", "text": "Second, it also permits for a more reasonable com- parison of accuracy across languages for supervised taggers." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-24", "text": "Statements of the form \"POS tagging for language X is harder than for language Y\" are vacuous when the tagsets used for the two languages are incomparable (not to mention of different cardinality)." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-25", "text": "Finally, it also permits language technology practitioners to train POS taggers with common tagsets across multiple languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-26", "text": "This in turn facilitates downstream application development as there is no need to maintain language specific rules due to differences in treebank annotation guidelines." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-27", "text": "In this paper, we specifically highlight two use cases of this resource." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-28", "text": "First, using our universal tagset and mapping, we run an experiment comparing POS tag accuracies for 25 different treebanks to evaluate POS tagging accuracy on a single tagset." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-29", "text": "Second, we combine the cross-lingual projection part-of-speech taggers of Das and Petrov (2011) with the grammar induction system of Naseem et al. (2010) -which requires a universal tagset -to produce a completely unsupervised grammar induction system for multiple languages, that does not require gold POS tags in the target language." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-30", "text": "----------------------------------" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-31", "text": "**TAGSET**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-32", "text": "While there might be some disagreement about the exact definition of an universal POS tagset (Evans and Levinson, 2009) , it seems fairly indisputable that a set of coarse POS categories (or syntactic universals) exists across all languages in one form or another (Carnie, 2002; Newmeyer, 2005) ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-33", "text": "Rather than arguing over definitions, we took a pragmatic standpoint during the design of the universal POS tagset and focused our attention on the POS categories that we expect to be most useful (and necessary) for users of POS taggers." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-34", "text": "In our opinion, these are NLP practitioners using POS taggers in downstream applications, and NLP researchers using POS taggers in grammar induction and other experiments." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-35", "text": "A high-level analysis of the tagsets underlying various treebanks shows that the majority of tagsets are very fine-grained and language specific." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-36", "text": "In fact, Smith and Eisner (2005) made a similar observation and defined a collapsed set of 17 English POS tags (instead of the original 45) that has subsequently been adopted by most unsupervised POS induction work." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-37", "text": "Similarly, the organizers of the CoNLL shared tasks on dependency parsing provide coarse (but still language specific) tags in addition to the finegrained tags used in the original treebanks (Buchholz and Marsi, 2006; . identified eight different coarse POS tags when analyzing the errors of two dependency parsers on the 13 different languages from the CoNLL shared tasks." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-38", "text": "Our universal POS tagset unifies this previous work and defines the following twelve POS tags: NOUN (nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), PRON (pronouns), DET (determiners and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), '.' (punctuation marks) and X (a catch-all for other categories such as abbreviations or foreign words)." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-39", "text": "We did not rely on intrinsic definitions of the above categories." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-40", "text": "Instead, each category is defined operationally." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-41", "text": "For each treebank under consideration, we studied the exact POS tag definitions and annotation guidelines and created a mapping from the original treebank tagset to these universal POS tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-42", "text": "Most of the decisions were fairly clear." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-43", "text": "For example, from the PennTreebank, VB, VBD, VBG, VBN, VBP, VBZ and MD (modal) were all mapped to VERB." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-44", "text": "A less clear case was the universal tag for particles, PRT, which was mapped from POS (possessive), RP (particle) and TO (the word 'to')." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-45", "text": "In particular, the TO tag is ambiguous in the PennTreebank between infinitival markers and the preposition 'to'." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-46", "text": "Thus, under this mapping, some prepositions will be marked as particles in the universal tagset." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-47", "text": "Figure 1 gives an example mapping for a sentence from the PennTreebank." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-48", "text": "Another case we had to consider is that some tag (Kim, 2002) ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-49", "text": "As a result, in our mapping for Korean, we mapped stative verbs to the universal ADJ tag." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-50", "text": "In other cases this was clearer, e.g. the Bulgarian treebank has no category for determiners or articles." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-51", "text": "This is not to say that there are no determiners in the Bulgarian language." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-52", "text": "However, since they are not annotated as such in the treebank, we are not able to include them in our mapping." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-53", "text": "The list of treebanks for which we have constructed mappings can be seen in Table 1 ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-54", "text": "One main objective in publicly releasing this resource is to provide treebank and language specific experts a mechanism for refining these categories and the decisions we have made, as well as adding new treebanks and languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-55", "text": "This resource is therefore hosted as an open source project with version control." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-56", "text": "----------------------------------" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-57", "text": "**EXPERIMENTS**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-58", "text": "To demonstrate the efficacy of the proposed universal POS tagset, we performed two sets of experiments." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-59", "text": "First, to provide a language comparison, we trained the same supervised POS tagging model on all of the above treebanks and evaluated the tagging accuracy on the universal POS tagset." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-60", "text": "Second, we used universal POS tags (automatically projected from English) as the starting point for unsupervised grammar induction, producing completely unsupervised parsers for several languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-61", "text": "----------------------------------" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-62", "text": "**LANGUAGE COMPARISONS**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-63", "text": "To compare POS tagging accuracies across different languages we trained a supervised tagger based on a trigram Markov model (Brants, 2000) on all treebanks." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-64", "text": "We chose this model for its fast speed and (close to) state-of-the-art accuracy without language specific tuning." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-65", "text": "2 Table 1 shows the results for all 25 treebanks when training/testing on the original (O) and universal (U) tagsets." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-66", "text": "Overall, the variance on the universal tagset has been reduced by half (5.1 instead of 10.4)." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-67", "text": "But of course there are still accuracy differences across the different languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-68", "text": "On the one hand, given a golden segmentation, tagging Japanese is almost deterministic, resulting in a final accuracy of above 99%." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-69", "text": "3 On the other hand, tagging Turkish, an agglutinative language with and average sentence length of 11.6 tokens, remains very challenging, resulting in an accuracy of only 90.2%." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-70", "text": "It should be noted that the best results are obtained by training on the original treebank categories and mapping the predictions to the universal POS tags at the end (O/U column)." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-71", "text": "This is because the transition model based on the universal POS tagset is less informative." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-72", "text": "An interesting experiment would be to train the latent variable tagger of Huang et al. (2009) on this tagset." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-73", "text": "Their model automatically discovers refinements of the observed categories and could potentially find a tighter fit to the data, than the one provided by the original, linguistically motivated treebank tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-74", "text": "----------------------------------" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-75", "text": "**GRAMMAR INDUCTION**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-76", "text": "We further demonstrate the utility of the universal POS tags in a grammar induction experiment." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-77", "text": "To decouple the challenges of POS tagging and parsing, golden POS tags are typically assumed in unsupervised grammar induction experiments (Carroll and Charniak, 1992; Klein and Manning, 2004) ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-78", "text": "4 We propose to remove this unrealistic simplification by using POS tags automatically projected from English as the basis of a grammar induction model." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-79", "text": "Das and Petrov (2011) describe a cross-lingual projection framework to learn POS taggers without labeled data for the language of interest." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-80", "text": "We use their automatically induced POS tags to induce 2 Trained on the English PennTreebank this model achieves 96.7% accuracy when evaluated on the original 45 POS tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-81", "text": "3 Note that the accuracy on the universal POS tags for the two Japanese treebanks is almost the same." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-82", "text": "4 A less benevolent explanation for this practice is that grammar induction from plain text is simply still too difficult." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-83", "text": "syntactic dependencies." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-84", "text": "To this end, we chose the framework of Naseem et al. (2010) , in which a few universal syntactic rules (USR) are used to constrain a probabilistic Bayesian model." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-85", "text": "These rules are specified using a set of universal syntactic categories, and lead to state-of-the-art grammar induction performance superior to previous methods, such as the dependency model with valence (DMV) (Klein and Manning, 2004 ) and the phylogenetic grammar induction model (PGI) (Berg-Kirkpatrick and Klein, 2010) ." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-86", "text": "----------------------------------" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-88", "text": "In their experiments, Naseem et al. also used a set of universal categories, however, with some differences to the tagset presented here." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-89", "text": "Their tagset does not have punctuation and catch-all categories, but includes a category for auxiliaries." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-90", "text": "The auxiliary category helps define a syntactic rule that attaches verbs to an auxiliary head, which is beneficial for certain languages." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-91", "text": "However, since this rule is reversed for other languages, we omit it in our tagset." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-92", "text": "Additionally, they also used refined categories in the form of CoNLL treebank tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-93", "text": "In our experiments, we did not make use of refined categories, as the POS tags induced by Das and Petrov (2011) were all coarse." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-94", "text": "We present results on the same eight IndoEuropean languages as Das and Petrov (2011) , so that we can make use of their automatically projected POS tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-95", "text": "For all languages, we used the treebanks released as a part of the CoNLL-X (Buchholz and Marsi, 2006) shared task." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-96", "text": "We only considered sentences of length 10 or less, after the removal of punctuations." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-97", "text": "We performed Bayesian inference on the whole treebank and report dependency attachment accuracy." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-98", "text": "Table 2 shows directed dependency accuracies for the DMV and PGI models using fine-grained gold POS tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-99", "text": "For the USR model, it reports results on gold universal POS tags (USR-G) and automatically induced universal POS tags (USR-I)." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-100", "text": "The USR-I model falls short of the USR-G model, but has the advantage that it does not require any labeled data from the target language." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-101", "text": "Quite impressively, it does better than DMV for all languages, and is competitive with PGI, even though those models have access to fine-grained gold POS tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-102", "text": "----------------------------------" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-103", "text": "**CONCLUSIONS**" }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-104", "text": "We proposed a POS tagset consisting of twelve categories that exists across languages and developed a mapping from 25 language specific tagsets to this universal set." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-105", "text": "We demonstrated experimentally that the universal POS categories generalize well across language boundaries on an unsupervised grammar induction task, giving competitive parsing accuracies without relying on gold POS tags." }, { "sent_id": "d883c7bb1e95895cd4f57a8f430e35-C001-106", "text": "The tagset and mappings are available for download at http://code.google.com/p/universal-pos-tags/" } ], "y": { "@BACK@": { "gold_contexts": [ [ "d883c7bb1e95895cd4f57a8f430e35-C001-10", "d883c7bb1e95895cd4f57a8f430e35-C001-8" ], [ "d883c7bb1e95895cd4f57a8f430e35-C001-14", "d883c7bb1e95895cd4f57a8f430e35-C001-15", "d883c7bb1e95895cd4f57a8f430e35-C001-16" ] ], "cite_sentences": [ "d883c7bb1e95895cd4f57a8f430e35-C001-10", "d883c7bb1e95895cd4f57a8f430e35-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "d883c7bb1e95895cd4f57a8f430e35-C001-14", "d883c7bb1e95895cd4f57a8f430e35-C001-15", "d883c7bb1e95895cd4f57a8f430e35-C001-16" ], [ "d883c7bb1e95895cd4f57a8f430e35-C001-29" ] ], "cite_sentences": [ "d883c7bb1e95895cd4f57a8f430e35-C001-15", "d883c7bb1e95895cd4f57a8f430e35-C001-29" ] }, "@MOT@": { "gold_contexts": [ [ "d883c7bb1e95895cd4f57a8f430e35-C001-29" ] ], "cite_sentences": [ "d883c7bb1e95895cd4f57a8f430e35-C001-29" ] }, "@EXT@": { "gold_contexts": [ [ "d883c7bb1e95895cd4f57a8f430e35-C001-29" ] ], "cite_sentences": [ "d883c7bb1e95895cd4f57a8f430e35-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "d883c7bb1e95895cd4f57a8f430e35-C001-93" ], [ "d883c7bb1e95895cd4f57a8f430e35-C001-94" ] ], "cite_sentences": [ "d883c7bb1e95895cd4f57a8f430e35-C001-93", "d883c7bb1e95895cd4f57a8f430e35-C001-94" ] }, "@SIM@": { "gold_contexts": [ [ "d883c7bb1e95895cd4f57a8f430e35-C001-94" ] ], "cite_sentences": [ "d883c7bb1e95895cd4f57a8f430e35-C001-94" ] } } }, "ABC_1265a336e56a4535f0a904ca89b220_31": { "x": [ { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-2", "text": "We assess the reliability and accuracy of (neural) word embeddings for both modern and historical English and German." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-3", "text": "Our research provides deeper insights into the empirically justified choice of optimal training methods and parameters." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-4", "text": "The overall low reliability we observe, nevertheless, casts doubt on the suitability of word neighborhoods in embedding spaces as a basis for qualitative conclusions on synchronic and diachronic lexico-semantic matters, an issue currently high up in the agenda of Digital Humanities." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-5", "text": "----------------------------------" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-7", "text": "Distributional methods applied to large-sized, often temporally stratified corpora have markedly enhanced the methodological repertoire of both synchronic and diachronic computational linguistics and are getting more and more popular in the Digital Humanities (see Section 2.2)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-8", "text": "However, using such quantitative data as a basis for qualitative, empirically-grounded theories requires that measurements should not only be accurate, but also reliable." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-88", "text": "Reliability at different top-n cut-offs is very similar for all languages and time spans under scrutiny, confirming previous observations in Hellrich and Hahn (2016a) and strengthening the suggestion to use only top-1 reliability for evaluation." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-89", "text": "Figure 2 illustrates this phenomenon with an SGNS trained on 1900-1904 English Fiction data." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-90", "text": "We assume this to be connected with the general decrease in word2vec embedding utility for high values of n already observed by Schnabel et al. (2015) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-91", "text": "Influence of Word Frequency." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-92", "text": "Figures 3 and 4 depict the influence of word frequency (as percentile ranks) for English, as well as orthographically normalized German." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-93", "text": "Negative sampling is overall more reliable, especially for words with low or medium frequency." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-94", "text": "Word frequency has a less pronounced effect on reliability for German and negative sampling is again preferable, especially for low or medium frequency words." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-95", "text": "The 21 English words reported to have undergone traceable semantic changes in prior work 9 are all frequent with percentiles between 89 and 99-for such high-frequency words hierarchical softmax performs similarly or even slightly better." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-9", "text": "Only under such a guarantee, quantitative data can be assembled from different experiments as a foundation for trustful theories." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-10", "text": "Measuring word similarity by word neighborhoods in embedding space can be used to detect diachronic shifts or domain specific usage, by training word embeddings on suited corpora and comparing these representations." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-11", "text": "Additionally, lexical items near in the embedding space to the lexical item under scrutiny can be considered as approximating its meaning at a given point in time or in a specific domain." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-12", "text": "These two lines of research converge in prior work to show, e.g., the increasing association of the lexical item 'gay' with the meaning dimension of homosexuality (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-13", "text": "Neural word embeddings (Mikolov et al., 2013) are probably the most influential among all embedding types (see Section 2.1)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-14", "text": "Yet, we gathered evidence that the inherent randomness involved in their generation affects the reliability of word neighborhood judgments and demonstrate how this hampers qualitative conclusions based on such models." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-15", "text": "Our investigation was performed on both historical (for the time span of 1900 to 1904) and contemporary texts (for the time span of 2005 to 2009) in two languages, English and German." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-16", "text": "It is thus a continuation of prior work, in which we investigated historical English texts only (Hellrich and Hahn, 2016a) , and also influenced by the design decisions of Kim et al. (2014) and Kulkarni et al. (2015) which were the first to use word embeddings in diachronic studies." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-17", "text": "Our results cast doubt on the reproducibility of such experiments where neighborhoods between words in embedding space are taken as a computationally valid indicator for properly capturing lexical meaning (and, consequently, meaning shifts)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-18", "text": "linguistics." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-19", "text": "The word2vec family of algorithms, developed from heavily trimmed artificial neural networks, is a widely used and robust way to generate such embeddings (Mikolov et al., 2013; Levy et al., 2015) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-20", "text": "Its skip-gram variant predicts plausible contexts for a given word, whereas the alternative continuous bag-of-words variant tries to predict words from contexts; we focus on the former as it is generally reported to be superior (see e.g., Levy et al. (2015) )." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-21", "text": "There are two strategies for managing the huge number of potential contexts a word can appear in." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-22", "text": "Skip-gram hierarchical softmax (SGHS) uses a binary tree to more efficiently represent the vocabulary, whereas skip-gram negative sampling (SGNS) updates only a limited number of word vectors during each training step." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-23", "text": "SGNS is preferred in general, yet SGHS showed slight benefits in some reliability scenarios in our prior investigations (Hellrich and Hahn, 2016a) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-24", "text": "There are two sources of randomness involved in the training of neural word embeddings: First, the random initialization of all word vectors before any examples are processed." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-25", "text": "Second, the order in which these examples are processed." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-26", "text": "Both can be replaced by deterministic alternatives, 1 yet this would simply replace a random distortion with a fixed one, thus providing faux reliability only useful for testing purposes." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-27", "text": "A range of other word embedding algorithms was inspired by word2vec, either trying to avoid the opaqueness stemming from its neural network heritage (GloVe; still using random initialization, see Pennington et al. (2014) ) or adding capabilities, like using syntactic information during training (Levy and Goldberg, 2014) or modeling multiple word senses (Bartunov et al., 2016; Panchenko, 2016) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-28", "text": "Levy et al. (2015) created SVD PPMI , a variant of the classical pointwise mutual information co-occurrence metric (see e.g., Manning and Sch\u00fctze (1999, pp.178-183) ), by transferring pre-processing steps and hyper-parameters uncovered by the development of these algorithms, and reported similar or slightly better performance than SGNS on evaluation tasks." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-29", "text": "It is conceptually not affected by reliability problems, as there is no random initialization or relevant processing order." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-30", "text": "Word embeddings capture both syntactic and semantic information (and arguably also social biases, see Bolukbasi et al. (2016) ) in vector form and can thus be evaluated by their ability to calculate the similarity of two words and perform analogy-based reasoning; there exist several other evaluation methods and more test sets than discussed here, see e.g., Baroni et al. (2014) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-31", "text": "Mikolov et al. (2013) provide an analogy test set for measuring performance as the percentage of correctly calculated analogies for test cases such as the frequently cited 'king'-'queen' example (see Section 3)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-32", "text": "Word similarity is evaluated by calculating Spearman's rank coefficient between embedding-derived predictions and a gold standard of human word similarity judgments." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-33", "text": "Finkelstein et al. (2002) developed a widely used test set with 353 English word pairs, 2 a similar resource for German with 350 word pairs was provided by Zesch and Gurevych (2006) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-34", "text": "3 Recent work cautions that performance on such tasks is not always predictive for performance in down-stream applications (Batchkarov et al., 2016) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-35", "text": "----------------------------------" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-36", "text": "**DIACHRONIC APPLICATION**" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-37", "text": "Word embeddings can be used rather directly for tracking semantic changes, namely by measuring the similarity of word representations generated for one word at different points in time-words which underwent semantic shifts will be dissimilar with themselves." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-38", "text": "These models must either be trained in a continuous manner where the model for each time span is initialized with its predecessor (Kim et al., 2014; Hellrich and Hahn, 2016b) , or a mapping between models for different points in time must be calculated (Kulkarni et al., 2015; Hamilton et al., 2016) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-39", "text": "The first approach cannot be performed in parallel and is thus rather time-consuming, if texts are not subsampled." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-40", "text": "We nevertheless discourage using samples instead of full corpora, as we observed extremely low reliability values between different samples (Hellrich and Hahn, 2016a) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-41", "text": "Word embeddings can also be used in diachronic studies without any kind of mapping to track clusters of similar words over time and, thus, model the evolution of topics (Kenter et al., 2015) or compare neighborhoods in embedding space for preselected words (Jo, 2016) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-42", "text": "Besides temporal variations, word embeddings can also used to analyze geographic ones, e.g., the distinction between US American and British English variants (Kulkarni et al., 2016) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-43", "text": "Most of these studies were performed with algorithms from the word2vec family, respectively GloVe in Jo (2016), and are thus likely to be affected by the same systematic reliability problems on which we focus here." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-44", "text": "Only Hamilton et al. (2016) used SVD PPMI in some of their very recent experiments and showed it to be adequate for exploring historical semantics." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-45", "text": "The Google Books Ngram corpus (GBN; Michel et al. (2011 ), Lin et al. (2012 ) is used in most of the studies we already mentioned, including our current study and its predecessor (Hellrich and Hahn, 2016a) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-46", "text": "It contains about 6% of all books published between 1500 and 2009 in the form of n-grams (up to pentagrams), together with their frequency for each year." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-47", "text": "This corpus has often been criticized for its opaque sampling strategy, as its constituent books remain unknown and can be shown to form an unbalanced collection (Pechenick et al., 2015) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-48", "text": "GBN is multilingual, with its English part being subdivided into regional segments (British, US) and topic categories (general language and fiction texts)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-49", "text": "Diachronic research focuses on the English Fiction part, with the exception of some work relating to German data (Hellrich and Hahn, 2016b) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-50", "text": "----------------------------------" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-51", "text": "**EVALUATION METHODS**" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-52", "text": "Reliability, in this study, is judged by training three identically parametrized models for each experiment and by comparing the n next neighbors (by cosine distance) for each word modeled by the experiments with a variant of the Jaccard coefficient (Manning and Sch\u00fctze, 1999, p.299) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-53", "text": "The 3-dimensional array W i,j,k contains words ordered by closeness (i) for a word in question (j) according to an experiment (k)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-54", "text": "The reliability r for a specific value of n (r@n) is defined as the magnitude of the intersection of similar words produced by all three experiments with a rank of n or lower, averaged over all t words modeled by these experiments and normalized by n, which is the maximally achievable score for this value of n:" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-55", "text": "Accuracy, in this study, is measured considering two different approaches-analogy and similarity." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-56", "text": "The analogy approach uses the English test set developed by Mikolov et al. (2013) by calculating the percentage of correct analogies made by a word2vec model." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-57", "text": "It contains groups of four words connected via the analogy relation '::' and the similarity relation '\u223c', as exemplified by the expression 'king' \u223c 'queen' :: 'man' \u223c 'woman'." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-58", "text": "The similarity approach covers both English and German by calculating Spearman's rank correlation coefficient between the similarity judgments made by a word2vec model for a word pair (e.g., 'bread' and 'butter') and the human judgment thereof (Finkelstein et al., 2002; Zesch and Gurevych, 2006) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-59", "text": "Pairs containing words not modeled for the time span in question, such as the at that time non-existent 'FBI' in the early 20th century, are simply ignored." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-60", "text": "All three test sets are based on contemporary language and current world knowledge and might thus not fully match the requirements for historical texts, yet are also used for these due to the lack of a suitable alternative." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-61", "text": "Accuracy values were calculated independently for each of the three identically parametrized models and subsequently averaged, but resulting deviations were negligible." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-62", "text": "----------------------------------" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-63", "text": "**EXPERIMENTAL SET-UP 4.1 CORPUS**" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-64", "text": "Our experiments 4 were performed on the German part and the English Fiction part of the GBN; the latter is known to be less unbalanced than the general English part (Pechenick et al., 2015) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-65", "text": "Both corpus splits differ in size and contain mainly contemporary texts (from the past fifty years), as is evident from Figure 1 ; note the logarithmic axis and the negative impact of both World Wars on book production." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-66", "text": "Following Kulkarni et al. (2015) , we trained our models on all 5-grams occurring during five consecutive years for the two time spans, 5 1900-1904 and 2005-2009 ; the number of 5-grams 6 for each time span is listed in Table 1 ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-67", "text": "The two languages share a similar number of 5-grams for 1900-1904, yet not for [2005] [2006] [2007] [2008] [2009] ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-68", "text": "5-grams from both corpus parts were lower cased for training." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-69", "text": "The German part was not only taken as is, but also orthographically normalized using the CAB service (Jurish, 2013) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-70", "text": "7 We incorporated this step because major changes in German orthography occurred during the 20th century, an issue that could hamper diachronic comparisons, e.g., archaic 'Gem\u00fcth' (in English: \"mind, emotional disposition\") became modern 'Gem\u00fct'." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-71", "text": "Table 1 shows the resulting reduction in the number of types, bringing the morphologically richer German to levels below English (yet this reduction is in line with the respective corpus sizes)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-72", "text": "----------------------------------" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-73", "text": "**TRAINING**" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-74", "text": "We used the PYTHON-based GENSIM 8 implementation of word2vec to independently train word embeddings for each time span with 200 dimensions, a context window of 4 (limited by the 5-gram size), a minimum frequency of 10, and 10 \u22125 as the threshold for downsampling frequent words." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-75", "text": "We processed the full subcorpora for each time span, due to the extremely low reliability values between samples we observed in previous investigations (Hellrich and Hahn, 2016a) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-76", "text": "We tested both SGNS with 5 noise words and SGHS training strategies and trained for 10 iterations, saving the resulting embeddings after each epoch." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-77", "text": "During each epoch the learning rate was decreased from 0.025 to 0.0001." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-78", "text": "The averaged cosine values between word embeddings before and after an epoch are used as a convergence measure c (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-79", "text": "It is defined for a vocabulary with n words and a matrix W containing word embedding vectors (normalized to length 1) for words i from training epochs e and e-1:" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-80", "text": "We also define \u2206c, the change of c during subsequent epochs e-1, as another convergence criterion:" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-81", "text": "5 Results Table 2 shows the performance of the systems trained according to the settings described in Section 4.2, as measured by similarity accuracy and top-1 reliability (see below for other cut-offs)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-82", "text": "We make the following observations:" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-83", "text": "1. Both accuracy and reliability are higher for SGNS than for SGHS for all tested combinations of languages and time spans, if 10 training epochs are used." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-84", "text": "We also measured analogy accuracy for the English Fiction data sets, and observed no negative effect of multiple training epochs, yet a more pronounced gap between training methods, e.g., 36% of all analogies were correct for SGNS and only 27% for SGHS after one epoch on 1900-1904 data." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-85", "text": "In the following, we further explore system performance as influenced, e.g., by word frequency, word ambiguity and the number of training epochs." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-86", "text": "For German, we focus on the normalized version due to the overall similar performance and suitability for further applications." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-87", "text": "Influence of Neighborhood Size." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-96", "text": "The relatively low reliability for medium-frequency English words, as compared to German ones, could be caused by a peculiar pattern of word co-occurrences, illustrated in Figures 5 and 6 for 1900-1904 English Fiction, respectively normalized German." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-97", "text": "Medium-frequency English words have fewer co-occurrences with low-frequency words than German ones, which might result in a lack of specific contexts for these words during training and thus hamper embedding quality." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-98", "text": "Influence of Word Ambiguity." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-99", "text": "Entries in lexical databases, such as WORDNET 10 (Fellbaum, 1998) and its German counterpart GERMANET 11 (Lemnitzer and Kunze, 2002) , can be employed to approximate the effect of word ambiguity on reliability." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-100", "text": "The number of synsets a word belongs to (i.e., the number of its senses) seems to be positively correlated with top-1 reliability for English, as shown in Figure 7 , whereas orthographically normalized German is less affected by ambiguity as Figure 8 reveals." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-101", "text": "This counter-intuitive effect for English seems to be caused by the low ambiguity of infrequent words-results become more uniform, if analysis is limited to high frequency words (e.g., 90th frequency percentile or higher)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-102", "text": "Influence of the Number of Training Epochs." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-103", "text": "Table 2 )." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-104", "text": "Figures 11 and 12 show the results for English and orthographically normalized German, respectively." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-105", "text": "Note that accuracy is assessed on a test set for modern-day language, and can thus not be considered a fully valid yardstick." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-106", "text": "Accuracy behaves similar to reliability, as under the negative sampling condition it clearly profits from multiple training epochs." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-107", "text": "This effect is more pronounced for smaller corpora; the biggest corpus (i.e., English Fiction 2005 -2009 ) shows a slight regression in accuracy after more than 5 training epochs." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-108", "text": "Conclusions." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-109", "text": "Both reliability and accuracy point towards negative sampling with 4 to 6 training epochs (6 being better for smaller and 4 being better for larger corpora) as the optimal training regime for all tested combinations of languages and time spans (implicitly, this is also a test on largely varying corpus sizes, see Table 1 )." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-110", "text": "Such a training scheme yields models with high reliability without losses in accuracy (that would indicate overfitting)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-111", "text": "Figure 13 shows \u2206c, i.e., the difference of the convergence measure c (Equations (2) and (3) averaged over all three models) between subsequent epochs, for both German and English data from the intervals 1900-1904 and 2005-2009 ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-112", "text": "Few changes occur after 4-6 epochs, which could be alternatively expressed as a \u2206c of about 0.003." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-113", "text": "The convergence criterion proposed by Kulkarni et al. (2015) , i.e., c = 0.9999, was never reached (this observation might be explained by Kulkarni et al.'s decision not to reset the learning rate for each training epoch, as was done by us and Kim et al. (2014) )." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-114", "text": "SVD PPMI , which are conceptually not bothered by the reliability problems we discussed here, were not a good fit for the hyperparameters we adopted from Kulkarni et al. (2015) ." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-115", "text": "Hamilton et al. (2016) reports similarity accuracy superior to SGNS, whereas for our set-up results in pretests were about 10 percent points worse than skip-gram embeddings, e.g., only 0.35 for 1900-1904 English Fiction." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-116", "text": "Finally, to want to illustrate how this reliability problem affects qualitative conclusions." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-117", "text": "In Table 3 we provide some examples in which three negative sampling models for 1900-1904 English Fiction did not agree on the closest neighbor for words in question (mostly drawn from the list in Footnote 9)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-118", "text": "The most inconsistent word neighborhoods are provided for 'romantic' which is connected to 'lazzaroni', 12 'fanciful' and 'melancholies'." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-119", "text": "This holds despite the high frequency (94th percentile) and moderate ambiguity (5 synsets) of the target item 'romantic'." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-120", "text": "----------------------------------" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-121", "text": "**DISCUSSION**" }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-122", "text": "Our investigation into the accuracy and reliability of skip-gram word embeddings shows even the most reliable systems too often provide inconsistent word neighborhoods." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-123", "text": "This carries unwarranted potential for erroneous conclusions on a word's semantic evolution as was shown, e.g., for the lexical item 'romantic' and English Fiction texts from the 1900-1904 time slice." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-124", "text": "We are thus skeptical about using word neighborhoods in skip-gram embedding space to adequately capture natural languages' lexical semantics (for English and German, at least)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-125", "text": "While we found some mitigation strategies, i.e., training for multiple epochs or using our convergence criterion of \u2206c 0.003, we assume SVD PPMI to be conceptually superior." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-126", "text": "Future work might try to provide general guidelines for proper hyperparameter selection for SVD PPMI , especially regarding complete temporal slices of the GBN (Hamilton et al. (2016) used samples)." }, { "sent_id": "1265a336e56a4535f0a904ca89b220-C001-127", "text": "Alternatively, training several identically parametrized SGNS/SGHS models and combining them into an ensemble might constitute an easy way to reduce the reliability problems we described, yet at the price of exorbitant computational costs." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1265a336e56a4535f0a904ca89b220-C001-15", "1265a336e56a4535f0a904ca89b220-C001-16" ], [ "1265a336e56a4535f0a904ca89b220-C001-21", "1265a336e56a4535f0a904ca89b220-C001-22", "1265a336e56a4535f0a904ca89b220-C001-23" ], [ "1265a336e56a4535f0a904ca89b220-C001-40" ], [ "1265a336e56a4535f0a904ca89b220-C001-45" ] ], "cite_sentences": [ "1265a336e56a4535f0a904ca89b220-C001-16", "1265a336e56a4535f0a904ca89b220-C001-23", "1265a336e56a4535f0a904ca89b220-C001-40", "1265a336e56a4535f0a904ca89b220-C001-45" ] }, "@MOT@": { "gold_contexts": [ [ "1265a336e56a4535f0a904ca89b220-C001-40" ], [ "1265a336e56a4535f0a904ca89b220-C001-75" ] ], "cite_sentences": [ "1265a336e56a4535f0a904ca89b220-C001-40", "1265a336e56a4535f0a904ca89b220-C001-75" ] }, "@USE@": { "gold_contexts": [ [ "1265a336e56a4535f0a904ca89b220-C001-45" ] ], "cite_sentences": [ "1265a336e56a4535f0a904ca89b220-C001-45" ] }, "@UNSURE@": { "gold_contexts": [ [ "1265a336e56a4535f0a904ca89b220-C001-45" ] ], "cite_sentences": [ "1265a336e56a4535f0a904ca89b220-C001-45" ] }, "@SIM@": { "gold_contexts": [ [ "1265a336e56a4535f0a904ca89b220-C001-88" ] ], "cite_sentences": [ "1265a336e56a4535f0a904ca89b220-C001-88" ] } } }, "ABC_48def208400142f043a07be5d83713_31": { "x": [ { "sent_id": "48def208400142f043a07be5d83713-C001-5", "text": "We develop RTM models for each WMT15 QET (QET15) subtask and obtain improvements over QET14 results." }, { "sent_id": "48def208400142f043a07be5d83713-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-2", "text": "We use referential translation machines (RTMs) for predicting translation performance." }, { "sent_id": "48def208400142f043a07be5d83713-C001-3", "text": "RTMs pioneer a language independent approach to all similarity tasks and remove the need to access any task or domain specific information or resource." }, { "sent_id": "48def208400142f043a07be5d83713-C001-4", "text": "We improve our RTM models with the ParFDA instance selection model , with additional features for predicting the translation performance, and with improved learning models." }, { "sent_id": "48def208400142f043a07be5d83713-C001-6", "text": "RTMs achieve top performance in QET15 ranking 1st in document-and sentence-level prediction tasks and 2nd in word-level prediction task." }, { "sent_id": "48def208400142f043a07be5d83713-C001-7", "text": "Referential translation machines are a computational model effectively judging monolingual and bilingual similarity while identifying translation acts between any two data sets with respect to interpretants." }, { "sent_id": "48def208400142f043a07be5d83713-C001-8", "text": "RTMs achieve top performance in automatic, accurate, and language independent prediction of machine translation performance and reduce our dependence on any task dependent resource." }, { "sent_id": "48def208400142f043a07be5d83713-C001-9", "text": "Prediction of translation performance can help in estimating the effort required for correcting the translations during post-editing by human translators." }, { "sent_id": "48def208400142f043a07be5d83713-C001-10", "text": "We improve our RTM models (Bi\u00e7ici and Way, 2014):" }, { "sent_id": "48def208400142f043a07be5d83713-C001-11", "text": "\u2022 by using improved ParFDA instance selection model allowing better language models (LM) in which similarity judgments are made to be built with improved optimization and selection of the LM data," }, { "sent_id": "48def208400142f043a07be5d83713-C001-12", "text": "\u2022 by selecting TreeF features over source and translation data jointly instead of taking their intersection," }, { "sent_id": "48def208400142f043a07be5d83713-C001-13", "text": "\u2022 with extended learning models including bayesian ridge regression (Tan et al., 2015) , which did not obtain better performance than support vector regression in training results (Section 2.2)." }, { "sent_id": "48def208400142f043a07be5d83713-C001-14", "text": "We present top results with Referential Translation Machines (Bi\u00e7ici, 2015; Bi\u00e7ici and Way, 2014) at quality estimation task (QET15) in WMT15 (Bojar et al., 2015) ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-15", "text": "RTMs pioneer a computational model for quality and semantic similarity judgments in monolingual and bilingual settings using retrieval of relevant training data (Bi\u00e7ici and Yuret, 2015) as interpretants for reaching shared semantics." }, { "sent_id": "48def208400142f043a07be5d83713-C001-16", "text": "RTMs use Machine Translation Performance Prediction (MTPP) System Bi\u00e7ici, 2015) , which is a state-of-the-art performance predictor of translation even without using the translation by using only the source." }, { "sent_id": "48def208400142f043a07be5d83713-C001-17", "text": "We use ParFDA for selecting the interpretants Bi\u00e7ici and Yuret, 2015) and build an MTPP model." }, { "sent_id": "48def208400142f043a07be5d83713-C001-18", "text": "MTPP derives indicators of the closeness of test sentences to the available training data, the difficulty of translating the sentence, and the presence of acts of translation for data transformation." }, { "sent_id": "48def208400142f043a07be5d83713-C001-19", "text": "We view that acts of translation are ubiquitously used during communication:" }, { "sent_id": "48def208400142f043a07be5d83713-C001-20", "text": "Every act of communication is an act of translation (Bliss, 2012)." }, { "sent_id": "48def208400142f043a07be5d83713-C001-21", "text": "Figure 1 depicts RTM." }, { "sent_id": "48def208400142f043a07be5d83713-C001-22", "text": "Our encouraging results in QET provides a greater understanding of the acts of translation we ubiquitously use and how they can be used to predict the performance of translation." }, { "sent_id": "48def208400142f043a07be5d83713-C001-23", "text": "RTMs are powerful enough to be applicable in different domains and tasks while achieving top performance." }, { "sent_id": "48def208400142f043a07be5d83713-C001-24", "text": "304" }, { "sent_id": "48def208400142f043a07be5d83713-C001-25", "text": "----------------------------------" }, { "sent_id": "48def208400142f043a07be5d83713-C001-26", "text": "**REFERENTIAL TRANSLATION MACHINE (RTM)**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-27", "text": "Referential translation machines are a computational model effectively judging monolingual and bilingual similarity while identifying translation acts between any two data sets with respect to interpretants." }, { "sent_id": "48def208400142f043a07be5d83713-C001-28", "text": "RTMs achieve top performance in automatic, accurate, and language independent prediction of machine translation performance and reduce our dependence on any task dependent resource." }, { "sent_id": "48def208400142f043a07be5d83713-C001-29", "text": "Prediction of translation performance can help in estimating the effort required for correcting the translations during post-editing by human translators." }, { "sent_id": "48def208400142f043a07be5d83713-C001-30", "text": "We improve our RTM models (Bi\u00e7ici and Way, 2014 ):" }, { "sent_id": "48def208400142f043a07be5d83713-C001-31", "text": "\u2022 by using improved ParFDA instance selection model allowing better language models (LM) in which similarity judgments are made to be built with improved optimization and selection of the LM data," }, { "sent_id": "48def208400142f043a07be5d83713-C001-32", "text": "\u2022 by selecting TreeF features over source and translation data jointly instead of taking their intersection," }, { "sent_id": "48def208400142f043a07be5d83713-C001-33", "text": "\u2022 with extended learning models including bayesian ridge regression (Tan et al., 2015) , which did not obtain better performance than support vector regression in training results (Section 2.2)." }, { "sent_id": "48def208400142f043a07be5d83713-C001-34", "text": "We present top results with Referential Translation Machines (Bi\u00e7ici, 2015; Bi\u00e7ici and Way, 2014) at quality estimation task (QET15) in WMT15 (Bojar et al., 2015) ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-35", "text": "RTMs pioneer a computational model for quality and semantic similarity judgments in monolingual and bilingual settings using retrieval of relevant training data (Bi\u00e7ici and Yuret, 2015) as interpretants for reaching shared semantics." }, { "sent_id": "48def208400142f043a07be5d83713-C001-36", "text": "RTMs use Machine Translation Performance Prediction (MTPP) System Bi\u00e7ici, 2015) , which is a state-of-the-art performance predictor of translation even without using the translation by using only the source." }, { "sent_id": "48def208400142f043a07be5d83713-C001-37", "text": "We use ParFDA for selecting the interpretants Bi\u00e7ici and Yuret, 2015) and build an MTPP model." }, { "sent_id": "48def208400142f043a07be5d83713-C001-38", "text": "MTPP derives indicators of the closeness of test sentences to the available training data, the difficulty of translating the sentence, and the presence of acts of translation for data transformation." }, { "sent_id": "48def208400142f043a07be5d83713-C001-39", "text": "We view that acts of translation are ubiquitously used during communication:" }, { "sent_id": "48def208400142f043a07be5d83713-C001-40", "text": "Every act of communication is an act of translation (Bliss, 2012) ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-41", "text": "Figure 1 depicts RTM." }, { "sent_id": "48def208400142f043a07be5d83713-C001-42", "text": "Our encouraging results in QET provides a greater understanding of the acts of translation we ubiquitously use and how they can be used to predict the performance of translation." }, { "sent_id": "48def208400142f043a07be5d83713-C001-43", "text": "RTMs are powerful enough to be applicable in different domains and tasks while achieving top performance." }, { "sent_id": "48def208400142f043a07be5d83713-C001-44", "text": "----------------------------------" }, { "sent_id": "48def208400142f043a07be5d83713-C001-45", "text": "**RTM IN THE QUALITY ESTIMATION TASK**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-46", "text": "We participate in all of the three subtasks of the quality estimation task (QET) (Bojar et al., 2015) , which include English to Spanish (en-es), English to German (en-de), and German to English (deen) translation directions." }, { "sent_id": "48def208400142f043a07be5d83713-C001-47", "text": "There are three subtasks: sentence-level prediction (Task 1), wordlevel prediction (Task 2), and document-level prediction (Task 3)." }, { "sent_id": "48def208400142f043a07be5d83713-C001-48", "text": "Task 1 is about predicting HTER (human-targeted translation edit rate) (Snover et al., 2006) scores of sentence translations, Task 2 is about binary classification of word-level quality, and Task 3 is about predicting METEOR (Lavie and Agarwal, 2007) scores of document translations." }, { "sent_id": "48def208400142f043a07be5d83713-C001-49", "text": "Instance selection for the training set and the language model (LM) corpus is handled by ParFDA , whose parameters are optimized for each translation task." }, { "sent_id": "48def208400142f043a07be5d83713-C001-50", "text": "LM are trained using SRILM (Stolcke, 2002) ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-51", "text": "We tokenize and truecase all of the corpora using code released with Moses (Koehn et al., 2007) 1 ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-52", "text": "Table 1 lists the number of sentences in the training and test sets for each task." }, { "sent_id": "48def208400142f043a07be5d83713-C001-53", "text": "1 mosesdecoder/scripts/" }, { "sent_id": "48def208400142f043a07be5d83713-C001-54", "text": "----------------------------------" }, { "sent_id": "48def208400142f043a07be5d83713-C001-55", "text": "**RTM PREDICTION MODELS AND OPTIMIZATION**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-56", "text": "We present results using support vector regression (SVR) with RBF (radial basis functions) kernel (Smola and Sch\u00f6lkopf, 2004) for sentence and document translation prediction tasks and Global Linear Models (GLM) (Collins, 2002) with dynamic learning (GLMd) (Bi\u00e7ici, 2013; Bi\u00e7ici and Way, 2014) for word-level translation performance prediction." }, { "sent_id": "48def208400142f043a07be5d83713-C001-57", "text": "We also use these learning models after a feature subset selection (FS) with recursive feature elimination (RFE) (Guyon et al., 2002) or a dimensionality reduction and mapping step using partial least squares (PLS) (Specia et al., 2009 ), or PLS after FS (FS+PLS)." }, { "sent_id": "48def208400142f043a07be5d83713-C001-58", "text": "GLM relies on Viterbi decoding, perceptron learning, and flexible feature definitions." }, { "sent_id": "48def208400142f043a07be5d83713-C001-59", "text": "GLMd extends the GLM framework by parallel perceptron training (McDonald et al., 2010) and dynamic learning with adaptive weight updates in the perceptron learning algorithm:" }, { "sent_id": "48def208400142f043a07be5d83713-C001-60", "text": "where \u03a6 returns a global representation for instance i and the weights are updated by \u03b1, which dynamically decays the amount of the change during weight updates at later stages and prevents large fluctuations with updates." }, { "sent_id": "48def208400142f043a07be5d83713-C001-61", "text": "The learning rate updates the weight values with weights in the range [a, b] using the following function taking error rate as the input:" }, { "sent_id": "48def208400142f043a07be5d83713-C001-62", "text": "Learning rate curve for a = 0.5 and b = 1.0 is provided in Figure 2:" }, { "sent_id": "48def208400142f043a07be5d83713-C001-63", "text": "----------------------------------" }, { "sent_id": "48def208400142f043a07be5d83713-C001-64", "text": "**TRAINING RESULTS**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-65", "text": "We use mean absolute error (MAE), relative absolute error (RAE), root mean squared error (RMSE), and correlation (r) as well as relative MAE (MAER) and relative RAE (MRAER) to evaluate (Bi\u00e7ici, 2015; Bi\u00e7ici, 2013) ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-66", "text": "MAER is mean absolute error relative to the magnitude of the target and MRAER is mean absolute error relative to the absolute error of a predictor always predicting the target mean assuming that target mean is known (Bi\u00e7ici, 2015 2012) calculates the average quality difference between the top n\u22121 quartiles and the overall quality for the test set." }, { "sent_id": "48def208400142f043a07be5d83713-C001-67", "text": "Table 2 presents the training results for Task 1 and Task 3." }, { "sent_id": "48def208400142f043a07be5d83713-C001-68", "text": "Table 3 presents Task 2 training results." }, { "sent_id": "48def208400142f043a07be5d83713-C001-69", "text": "We refer to GLMd parallelized over 4 splits as GLMd s4 and GLMd with 5 splits as GLMd s5." }, { "sent_id": "48def208400142f043a07be5d83713-C001-70", "text": "----------------------------------" }, { "sent_id": "48def208400142f043a07be5d83713-C001-71", "text": "**TEST RESULTS**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-72", "text": "Task 1: Predicting the HTER for Sentence Translations The results on the test set are given in Table 4 ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-73", "text": "Rank lists the overall ranking in the task out of about 9 submissions." }, { "sent_id": "48def208400142f043a07be5d83713-C001-74", "text": "We obtain the rankings by sorting according to the predicted scores and randomly assigning ranks in case of ties." }, { "sent_id": "48def208400142f043a07be5d83713-C001-75", "text": "RTMs with FS followed by PLS and learning with SVR is able to achieve the top rank in this task." }, { "sent_id": "48def208400142f043a07be5d83713-C001-76", "text": "Task 2: Prediction of Word-level Translation Quality Task 2 is about binary classification of word-level quality." }, { "sent_id": "48def208400142f043a07be5d83713-C001-77", "text": "We develop individual RTM models for each subtask and use GLMd model (Bi\u00e7ici, 2013; Bi\u00e7ici and Way, 2014) , for predicting the quality at the word-level." }, { "sent_id": "48def208400142f043a07be5d83713-C001-78", "text": "The results on the test set are in Table 5 where the ranks are out of about 17 submissions." }, { "sent_id": "48def208400142f043a07be5d83713-C001-79", "text": "RTMs with GLMd becomes the second best system this task." }, { "sent_id": "48def208400142f043a07be5d83713-C001-80", "text": "Task 3: Predicting METEOR of Document Translations Task 3 is about predicting ME-TEOR (Lavie and Agarwal, 2007) and their ranking." }, { "sent_id": "48def208400142f043a07be5d83713-C001-81", "text": "The results on the test set are given in Table 4 where the ranks are out of about 6 submissions using wF 1 ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-82", "text": "RTMs achieve top rankings in this task." }, { "sent_id": "48def208400142f043a07be5d83713-C001-83", "text": "Table 5 : RTM-DCU Task 2 results on the test set." }, { "sent_id": "48def208400142f043a07be5d83713-C001-84", "text": "wF 1 is the average weighted F 1 score." }, { "sent_id": "48def208400142f043a07be5d83713-C001-85", "text": "----------------------------------" }, { "sent_id": "48def208400142f043a07be5d83713-C001-86", "text": "**RTMS ACROSS TASKS AND YEARS**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-87", "text": "We compare the difficulty of tasks according to MRAER levels achieved." }, { "sent_id": "48def208400142f043a07be5d83713-C001-88", "text": "In Table 6 , we list the RTM test results for tasks and subtasks that predict HTER or METEOR from QET15, QET14 (Bi\u00e7ici and Way, 2014) , and QET13 (Bi\u00e7ici, 2013) ." }, { "sent_id": "48def208400142f043a07be5d83713-C001-89", "text": "The best results when predicting HTER are obtained this year." }, { "sent_id": "48def208400142f043a07be5d83713-C001-90", "text": "----------------------------------" }, { "sent_id": "48def208400142f043a07be5d83713-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "48def208400142f043a07be5d83713-C001-92", "text": "Referential translation machines achieve top performance in automatic, accurate, and language independent prediction of document-, sentence-, and word-level statistical machine translation (SMT) performance." }, { "sent_id": "48def208400142f043a07be5d83713-C001-93", "text": "RTMs remove the need to access any SMT system specific information or prior knowledge of the training data or models used when generating the translations." }, { "sent_id": "48def208400142f043a07be5d83713-C001-94", "text": "RTMs achieve top performance when predicting translation performance." }, { "sent_id": "48def208400142f043a07be5d83713-C001-95", "text": "Table 6 : Test performance of the top individual RTM results when predicting HTER or METEOR also including results from QET14 (Bi\u00e7ici and Way, 2014) and QET13 (Bi\u00e7ici, 2013) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "48def208400142f043a07be5d83713-C001-10", "48def208400142f043a07be5d83713-C001-11", "48def208400142f043a07be5d83713-C001-12", "48def208400142f043a07be5d83713-C001-13" ], [ "48def208400142f043a07be5d83713-C001-14" ], [ "48def208400142f043a07be5d83713-C001-30", "48def208400142f043a07be5d83713-C001-31", "48def208400142f043a07be5d83713-C001-32", "48def208400142f043a07be5d83713-C001-33" ], [ "48def208400142f043a07be5d83713-C001-34" ], [ "48def208400142f043a07be5d83713-C001-56" ] ], "cite_sentences": [ "48def208400142f043a07be5d83713-C001-10", "48def208400142f043a07be5d83713-C001-14", "48def208400142f043a07be5d83713-C001-30", "48def208400142f043a07be5d83713-C001-34", "48def208400142f043a07be5d83713-C001-56" ] }, "@MOT@": { "gold_contexts": [ [ "48def208400142f043a07be5d83713-C001-10", "48def208400142f043a07be5d83713-C001-11", "48def208400142f043a07be5d83713-C001-12", "48def208400142f043a07be5d83713-C001-13" ], [ "48def208400142f043a07be5d83713-C001-30", "48def208400142f043a07be5d83713-C001-31", "48def208400142f043a07be5d83713-C001-32", "48def208400142f043a07be5d83713-C001-33" ] ], "cite_sentences": [ "48def208400142f043a07be5d83713-C001-10", "48def208400142f043a07be5d83713-C001-30" ] }, "@DIF@": { "gold_contexts": [ [ "48def208400142f043a07be5d83713-C001-10", "48def208400142f043a07be5d83713-C001-11", "48def208400142f043a07be5d83713-C001-12", "48def208400142f043a07be5d83713-C001-13" ], [ "48def208400142f043a07be5d83713-C001-30", "48def208400142f043a07be5d83713-C001-31", "48def208400142f043a07be5d83713-C001-32", "48def208400142f043a07be5d83713-C001-33" ] ], "cite_sentences": [ "48def208400142f043a07be5d83713-C001-10", "48def208400142f043a07be5d83713-C001-30" ] }, "@USE@": { "gold_contexts": [ [ "48def208400142f043a07be5d83713-C001-14" ], [ "48def208400142f043a07be5d83713-C001-34" ], [ "48def208400142f043a07be5d83713-C001-56" ], [ "48def208400142f043a07be5d83713-C001-77" ], [ "48def208400142f043a07be5d83713-C001-87", "48def208400142f043a07be5d83713-C001-88" ] ], "cite_sentences": [ "48def208400142f043a07be5d83713-C001-14", "48def208400142f043a07be5d83713-C001-34", "48def208400142f043a07be5d83713-C001-56", "48def208400142f043a07be5d83713-C001-77", "48def208400142f043a07be5d83713-C001-88" ] }, "@UNSURE@": { "gold_contexts": [ [ "48def208400142f043a07be5d83713-C001-95" ] ], "cite_sentences": [ "48def208400142f043a07be5d83713-C001-95" ] } } }, "ABC_365171603fb13c6534ef4abab092d6_31": { "x": [ { "sent_id": "365171603fb13c6534ef4abab092d6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-2", "text": "We propose a simple yet effective textbased user geolocation model based on a neural network with one hidden layer, which achieves state of the art performance over three Twitter benchmark geolocation datasets, in addition to producing word and phrase embeddings in the hidden layer that we show to be useful for detecting dialectal terms." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-3", "text": "As part of our analysis of dialectal terms, we release DAREDS, a dataset for evaluating dialect term detection methods." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-4", "text": "----------------------------------" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-6", "text": "Many services such as web search (Leung et al., 2010) , recommender systems (Ho et al., 2012) , targeted advertising (Lim and Datta, 2013) , and rapid disaster response (Ashktorab et al., 2014) rely on the location of users to personalise information and extract actionable knowledge." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-7", "text": "Explicit user geolocation metadata (e.g. GPS tags, WiFi footprint, IP address) is not usually available to third-party consumers, giving rise to the need for geolocation based on profile data, text content, friendship graphs (Jurgens et al., 2015) or some combination of these (Rahimi et al., 2015b,a) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-8", "text": "The strong geographical bias, most obviously at the language level (e.g. Finland vs. Japan), and more subtly at the dialect level (e.g. in English used in north-west England vs. north-east USA vs. Texas, USA), clearly reflected in language use in social media services such as Twitter, has been used extensively either for geolocation of users (Eisenstein et al., 2010; Roller et al., 2012; Rout et al., 2013; Wing and Baldridge, 2014) or dialectology Eisenstein, 2015) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-9", "text": "In these methods, a user is often represented by the concatenation of their tweets, and the geolocation model is trained on a very small percentage of explicitly geotagged tweets, noting the potential biases implicit in geotagged tweets (Pavalanathan and Eisenstein, 2015) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-10", "text": "Lexical dialectology is (in part) the converse of user geolocation (Eisenstein, 2015) : given text associated with a variety of regions, the task is to identify terms that are distinctive of particular regions." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-11", "text": "The complexity of the task is two-fold: (1) localised named entities (e.g. sporting team names) are not of interest; and (2) without semantic knowledge it is difficult to detect terms that are in general use but have a special meaning in a region." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-12", "text": "In this paper we propose a text-based geolocation method based on neural networks." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-13", "text": "Our contributions are as follows: (1) we achieve state-of-the-art results on benchmark Twitter geolocation datasets; (2) we show that the model is less sensitive to the specific location discretisation method; (3) we release the first broad-coverage dataset for evaluation of lexical dialectology models; (4) we incorporate our text-based model into a network-based model (Rahimi et al., 2015a) and improve the performance utilising both network and text; and (5) we use the model's embeddings for extraction of local terms and show that it outperforms two baselines." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-14", "text": "----------------------------------" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-15", "text": "**RELATED WORK**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-16", "text": "Related work on Twitter user geolocation falls into two categories: text-based and network-based methods." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-17", "text": "Text-based methods make use of the geographical biases of language use, and networkbased methods rely on the geospatial homophily of user-user interactions." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-18", "text": "In both cases, the assumption is that users who live in the same geographic area share similar features (linguistic or interactional)." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-19", "text": "Three main text-based approaches are: (1) the use of gazetteers Quercini et al., 2010) ; (2) unsupervised text clustering based on topic models or similar (Eisenstein et al., 2010; Hong et al., 2012; Ahmed et al., 2013) ; and (3) supervised classification (Ding et al., 2000; Backstrom et al., 2008; Cheng et al., 2010; Hecht et al., 2011; Kinsella et al., 2011; Wing and Baldridge, 2011; Han et al., 2012; Rout et al., 2013) , which unlike gazetteers can be applied to informal text and compared to topic models, scales better." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-20", "text": "The classification models often rely on less than 1% of geotagged tweets for supervision and discretise real-valued coordinates into equalsized grids (Serdyukov et al., 2009 ), administrative regions (Cheng et al., 2010; Hecht et al., 2011; Kinsella et al., 2011; Han et al., 2012 , or flat (Wing and Baldridge, 2011) or hierarchical k-d tree clusters (Wing and Baldridge, 2014) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-21", "text": "Network-based methods also use either real-valued coordinates (Jurgens et al., 2015) or discretised regions (Rahimi et al., 2015a) as labels, and use label propagation over the interaction graph (e.g. @-mentions)." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-22", "text": "More recent methods have focused on representation learning by using sparse coding (Cha et al., 2015) or neural networks (Liu and Inkpen, 2015) , utilising both text and network information (Rahimi et al., 2015a) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-23", "text": "Dialect is a variety of language shared by a group of speakers (Wolfram and Schilling, 2015) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-24", "text": "Our focus here is on geographical dialects which are spoken (and written in social media) by people from particular areas." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-25", "text": "The traditional approach to dialectology is to find the geographical distribution of known lexical alternatives (e.g. you, yall and yinz: (Labov et al., 2005; Nerbonne et al., 2008; Gon\u00e7alves and S\u00e1nchez, 2014; Doyle, 2014; Huang et al., 2015; Nguyen and Eisenstein, 2016) ), the shortcoming of which is that the alternative lexical variables must be known beforehand." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-26", "text": "There have also been attempts to automatically identify such words from geotagged documents (Eisenstein et al., 2010; Ahmed et al., 2013; Eisenstein, 2015) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-27", "text": "The main idea is to find lexical variables that are disproportionately distributed in different locations either via model-based or statistical methods (Monroe et al., 2008) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-28", "text": "There is a research gap in evaluating the geolocation models in terms of their usability in retrieving dialect terms given a geographic region." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-29", "text": "We use a text-based neural approach trained on geotagged Twitter messages that: (a) given a geographical region, identifies the associated lexical terms; and (b) given a text, predicts its location." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-30", "text": "----------------------------------" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-31", "text": "**DATA**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-32", "text": "We use three existing Twitter user geolocation datasets: (1) GEOTEXT (Eisenstein et al., 2010) , (2) TWITTER-US (Roller et al., 2012) , and (3) TWITTER-WORLD (Han et al., 2012) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-33", "text": "These datasets have been used widely for training and evaluation of geolocation models." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-34", "text": "They are all prepartitioned into training, development and test sets." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-35", "text": "Each user is represented by the concatenation of their tweets, and labeled with the latitude/longitude of the first collected geotagged tweet in the case of GEOTEXT and TWITTER-US, and the centre of the closest city in the case of TWITTER-WORLD." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-36", "text": "1 GEOTEXT and TWITTER-US cover the continental US, and TWITTER-WORLD covers the whole world, with 9k, 449k and 1.3m users, respectively as shown in Figure 1 ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-37", "text": "2 DAREDS is a dialect-term dataset novel to this research, created from the Dictionary of American Regional English (DARE) (Cassidy et al., 1985) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-38", "text": "DARE consists of dialect regions, their terms and meaning." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-39", "text": "3 It is based on dialectal surveys from different regions of the U.S., which are then postprocessed to identify dialect regions and terms." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-40", "text": "In order to construct a dataset based on DARE, we downloaded the web version of DARE, cleaned it, and removed multiword expressions and highly-frequent words (any word which occurred in the top 50k most frequent words, based on a word frequency list (Norvig, 2009) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-41", "text": "For dialect regions that don't correspond to a single state or set of cities (e.g. South), we mapped it to the most populous cities within each region." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-42", "text": "For example, within the Pacific Northwest dialect region, we manually extracted the most populous cities (Seattle, Tacoma, Portland, Salem, Eugene) and added those cities to DAREDS as subregions." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-43", "text": "The resulting dataset (DAREDS) consists of around 4.3k dialect terms from 99 U.S. dialect regions." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-44", "text": "DAREDS is the largest standardised dialectology dataset." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-45", "text": "----------------------------------" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-46", "text": "**METHODS**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-47", "text": "We use a multilayer perceptron (MLP) with one hidden layer as our location classifier, where the input is l 2 normalised bag-of-words features for a given user." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-48", "text": "We exclude @-mentions, words with document frequency less than 10, and stop words." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-49", "text": "The output is either a k-d tree leaf node or k-means discretisation of real-valued coordinates of training locations, the output of which is visualised for TWITTER-US in Figure 2 ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-50", "text": "The hidden layer output provides word (and phrase, as bags of words) embeddings for dialectal analysis." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-51", "text": "The number of regions, regularisation strength, hidden layer and mini-batch size are tuned over development data and set to (32, 10 \u22125 , 896, 100), (256, 10 \u22126 , 2048, 10000) and (930, 10 \u22126 , 3720, 10000) for GEOTEXT, TWITTER-US and TWITTER-WORLD, respectively." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-52", "text": "The parameters are optimised using Adamx (Kingma and Ba, 2014) using Lasagne/Theano (Theano Development Team, 2016) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-53", "text": "Following Cheng (2010) and Eisenstein (2010) , we evaluated the geolocation model using mean and median error in km (\"Mean\" and \"Median\" resp.) and accuracy within 161km of the actual location (\"Acc@161\")." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-54", "text": "Note that lower numbers are better for Mean and Median, and higher numbers better for Acc@161." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-55", "text": "4 The results reported in Rahimi et al. (2015b; 2015a) for TWITTER-WORLD were over a superset of the dataset; the results reported here are based on the actual dataset." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-56", "text": "While the focus of this paper is text-based user geolocation, state-of-the-art results for the three datasets have been achieved with hybrid text+network-based models, where the predictions of the text-based model are fed into a mention network as \"dongle\" nodes to each user node, providing a personalised geolocation prior for each user (Rahimi et al., 2015a) ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-57", "text": "Note that it would, of course, be possible to combine text and network information in a joint deep learning model (Yang et al., 2016; Kipf and Welling, 2016) , which we leave to future work (noting that scalability will potentially be a major issue for the larger datasets)." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-58", "text": "To test the applicability of the model's embeddings in dialectology, we created DAREDS." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-59", "text": "The output of the hidden layer of the model is used as embeddings for both location names and dialect terms." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-85", "text": "**DIALECTOLOGY**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-60", "text": "Given a dialect region name, we retrieve its nearest neighbours in the embedding space, and compare them to dialect terms associated with that location." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-61", "text": "We also compare the quality of the embeddings with pre-trained word2vec embeddings and the embeddings from the output layer of LR (logistic regression) (Rahimi et al., 2015b) as baselines." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-62", "text": "Regions in DAREDS can be very broad (e.g. SouthWest), meaning that words associated with those locations will be used across a large number Table 1 : Geolocation results over the three Twitter datasets, based on the text-based MLP with k-d tree or k-means discretisation and text+network model MADCEL-W-MLP using MLP with k-d tree for text-based predictions." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-63", "text": "We compare with state-of-the-art results for each dataset." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-64", "text": "4 \"-\" signifies that no results were reported for the given metric or dataset." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-65", "text": "of cities contained within that region." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-66", "text": "We generate a region-level embedding by simply taking the city names associated with the region, and feeding them as BoW input for LR and MLP and averaging their embeddings for word2vec." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-67", "text": "We evaluate the retrieved terms by computing recall of DAREDS terms existing in TWITTER-US (1071 terms) at k \u2208 {0.05%, 0.1%, 0.2%, 0.5%, 1%, 2%, 5%} of vocabulary size." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-68", "text": "The code and the DAREDS dataset are available at https://github." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-69", "text": "com/afshinrahimi/acl2017." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-70", "text": "----------------------------------" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-71", "text": "**RESULTS**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-72", "text": "----------------------------------" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-73", "text": "**GEOLOCATION**" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-74", "text": "The performance of the text-based MLP model with k-d tree and k-means discretisation over the three datasets is shown in Table 1 ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-75", "text": "The results are also compared with state-of-the-art text-based methods based on a flat (Rahimi et al., 2015b; Cha et al., 2015) or hierarchical (Wing and Baldridge, 2014; Melo and Martins, 2015; Liu and Inkpen, 2015) geospatial representation." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-76", "text": "Our method outperforms both the flat and hierarchical text-based models by a large margin." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-77", "text": "Comparing the two discretisation strategies, k-means outperforms k-d tree by a reasonable margin." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-78", "text": "We also incorporated the MLP predictions into a network-based model based on the method of Rahimi et al. (2015a) , and improved upon their work." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-79", "text": "We analysed the Table 2 : Nearest neighbours of place names." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-80", "text": "in Figure 3 ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-81", "text": "The error is highest in states with lower training coverage (e.g. Maine, Montana, Wisconsin, Iowa and Kansas)." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-82", "text": "We also randomly sampled 50 development samples from the 1000 samples with highest prediction errors to check the biases of the model." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-83", "text": "Most of the errors are the result of geolocating users from Eastern U.S. in Western U.S. particularly in Los Angeles and San Francisco." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-84", "text": "----------------------------------" }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-86", "text": "We quantitatively tested the quality of the geographical embeddings by calculating the micro-average recall of the k-nearest dialect terms (in terms of the proportion of retrieved dialect terms) given a dialect region, as shown in Figure 4 ." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-87", "text": "Recall at 0.5% is about 3.6%, meaning that we were able to retrieve 3.6% of the dialect terms given the dialect region name in the geographical embedding space." }, { "sent_id": "365171603fb13c6534ef4abab092d6-C001-88", "text": "The embeddings slightly outperform the output layer of logistic regression (LR) (Rahimi et al., 2015b)" } ], "y": { "@BACK@": { "gold_contexts": [ [ "365171603fb13c6534ef4abab092d6-C001-12", "365171603fb13c6534ef4abab092d6-C001-13" ], [ "365171603fb13c6534ef4abab092d6-C001-21" ], [ "365171603fb13c6534ef4abab092d6-C001-22" ], [ "365171603fb13c6534ef4abab092d6-C001-78" ] ], "cite_sentences": [ "365171603fb13c6534ef4abab092d6-C001-13", "365171603fb13c6534ef4abab092d6-C001-21", "365171603fb13c6534ef4abab092d6-C001-22", "365171603fb13c6534ef4abab092d6-C001-78" ] }, "@EXT@": { "gold_contexts": [ [ "365171603fb13c6534ef4abab092d6-C001-12", "365171603fb13c6534ef4abab092d6-C001-13" ], [ "365171603fb13c6534ef4abab092d6-C001-56" ], [ "365171603fb13c6534ef4abab092d6-C001-78" ] ], "cite_sentences": [ "365171603fb13c6534ef4abab092d6-C001-13", "365171603fb13c6534ef4abab092d6-C001-56", "365171603fb13c6534ef4abab092d6-C001-78" ] }, "@DIF@": { "gold_contexts": [ [ "365171603fb13c6534ef4abab092d6-C001-55" ], [ "365171603fb13c6534ef4abab092d6-C001-56" ] ], "cite_sentences": [ "365171603fb13c6534ef4abab092d6-C001-55", "365171603fb13c6534ef4abab092d6-C001-56" ] } } }, "ABC_a7e49bec53a2bfd7795b9c770f5d0c_31": { "x": [ { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-2", "text": "This paper studies the effect of limited precision data representation and computation on word embeddings." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-3", "text": "We present a systematic evaluation of word embeddings with limited memory and discuss methods that directly train the limited precision representation with limited memory." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-4", "text": "Our results show that it is possible to use and train an 8-bit fixed-point value for word embedding without loss of performance in word/phrase similarity and dependency parsing tasks." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-5", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-7", "text": "There is an accumulation of evidence that the use of dense distributional lexical representations, known as word embeddings, often supports better performance on a range of NLP tasks (Bengio et al., 2003; Turian et al., 2010; Collobert et al., 2011; Mikolov et al., 2013a; Mikolov et al., 2013b; Levy et al., 2015) ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-8", "text": "Consequently, word embeddings have been commonly used in the last few years for lexical similarity tasks and as features in multiple, syntactic and semantic, NLP applications." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-9", "text": "However, keeping embedding vectors for hundreds of thousands of words for repeated use could take its toll both on storing the word vectors on disk and, even more so, on loading them into memory." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-10", "text": "For example, for 1 million words, loading 200 dimensional vectors takes up to 1.6 GB memory on a 64-bit system." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-11", "text": "Considering applications that make use of billions of tokens and multiple languages, size issues impose significant limitations on the practical use of word embeddings." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-12", "text": "This paper presents the question of whether it is possible to significantly reduce the memory needs for the use and training of word embeddings." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-13", "text": "Specifically, we ask \"what is the impact of representing each dimension of a dense representation with significantly fewer bits than the standard 64 bits?\" Moreover, we investigate the possibility of directly training dense embedding vectors using significantly fewer bits than typically used." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-14", "text": "The results we present are quite surprising." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-15", "text": "We show that it is possible to reduce the memory consumption by an order of magnitude both when word embeddings are being used and in training." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-16", "text": "In the first case, as we show, simply truncating the resulting representations after training and using a smaller number of bits (as low as 4 bits per dimension) results in comparable performance to the use of 64 bits." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-17", "text": "Moreover, we provide two ways to train existing algorithms (Mikolov et al., 2013a; Mikolov et al., 2013b ) when the memory is limited during training and show that, here, too, an order of magnitude saving in memory is possible without degrading performance." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-18", "text": "We conduct comprehensive experiments on existing word and phrase similarity and relatedness datasets as well as on dependency parsing, to evaluate these results." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-19", "text": "Our experiments show that, in all cases and without loss in performance, 8 bits can be used when the current standard is 64 and, in some cases, only 4 bits per dimension are sufficient, reducing the amount of space required by a factor of 16." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-20", "text": "The truncated word embeddings are available from the papers web page at https://cogcomp.cs.illinois." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-21", "text": "edu/page/publication_view/790." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-22", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-24", "text": "If we consider traditional cluster encoded word representation, e.g., Brown clusters (Brown et al., 1992) , it only uses a small number of bits to track the path on a hierarchical tree of word clusters to represent each word." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-25", "text": "In fact, word embedding generalized the idea of discrete clustering representation to continuous vector representation in language models, with the goal of improving the continuous word analogy prediction and generalization ability (Bengio et al., 2003; Mikolov et al., 2013a; Mikolov et al., 2013b) ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-26", "text": "However, it has been proven that Brown clusters as discrete features are even better than continuous word embedding as features for named entity recognition tasks (Ratinov and Roth, 2009 )." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-27", "text": "Guo et al. (Guo et al., 2014) further tried to binarize embeddings using a threshold tuned for each dimension, and essentially used less than two bits to represent each dimension." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-28", "text": "They have shown that binarization can be comparable to or even better than the original word embeddings when used as features for named entity recognition tasks." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-29", "text": "Moreover, Faruqui et al. (Faruqui et al., 2015) showed that imposing sparsity constraints over the embedding vectors can further improve the representation interpretability and performance on several word similarity and text classification benchmark datasets." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-30", "text": "These works indicate that, for some tasks, we do not need all the information encoded in \"standard\" word embeddings." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-31", "text": "Nonetheless, it is clear that binarization loses a lot of information, and this calls for a systematic comparison of how many bits are needed to maintain the expressivity needed from word embeddings for different tasks." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-32", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-33", "text": "**VALUE TRUNCATION**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-34", "text": "In this section, we introduce approaches for word embedding when the memory is limited." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-35", "text": "We truncate any value x in the word embedding into an n bit representation." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-36", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-37", "text": "**POST-PROCESSING ROUNDING**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-38", "text": "When the word embedding vectors are given, the most intuitive and simple way is to round the numbers to their n-bit precision." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-39", "text": "Then we can use the truncated values as features for any tasks that word embedding can be used for." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-40", "text": "For example, if we want to round x to be in the range of [\u2212r, r], a simple function can be applied as follows." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-41", "text": "(1) where = 2 1\u2212n r. For example, if we want to use 8 bits to represent any value in the vectors, then we only have 256 numbers ranging from -128 to 127 for each value." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-42", "text": "In practice, we first scale all the values and then round them to the 256 numbers." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-43", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-44", "text": "**TRAINING WITH LIMITED MEMORY**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-45", "text": "When the memory for training word embedding is also limited, we need to modify the training algorithms by introducing new data structures to reduce the bits used to encode the values." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-46", "text": "In practice, we found that in the stochastic gradient descent (SGD) iteration in word2vec algorithms (Mikolov et al., 2013a; Mikolov et al., 2013b) , the updating vector's values are often very small numbers (e.g., < 10 \u22125 )." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-47", "text": "In this case, if we directly apply the rounding method to certain precisions (e.g., 8 bits), the update of word vectors will always be zero." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-48", "text": "For example, the 8-bit precision is 2 \u22127 = 0.0078, so 10 \u22125 is not significant enough to update the vector with 8-bit values." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-49", "text": "Therefore, we consider the following two ways to improve this." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-50", "text": "Stochastic Rounding." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-51", "text": "We first consider using stochastic rounding (Gupta et al., 2015) to train word embedding." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-52", "text": "Stochastic rounding introduces some randomness into the rounding mechanism, which has been proven to be helpful when there are many parameters in the learning system, such as deep learning systems (Gupta et al., 2015) ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-53", "text": "Here we also introduce this approach to update word embedding vectors in SGD." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-54", "text": "The probability of rounding x to x is proportional to the proximity of x to x :" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-82", "text": "**ANALYSIS OF BITS NEEDED**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-55", "text": "In this case, even though the update values are not significant enough to update the word embedding vectors, we randomly choose some of the values being updated proportional to the value of how close the update value is to the rounding precision." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-56", "text": "Auxiliary Update Vectors." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-57", "text": "In addition to the method of directly applying rounding to the values, we also provide a method using auxiliary update vectors to trade precision for more space." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-58", "text": "Suppose we know the range of update value in S-GD as [\u2212r , r ], and we use additional m bits to store all the values less than the limited numerical precision ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-59", "text": "Here r can be easily estimated by running SGD for several examples." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-60", "text": "Then the real precision is = 2 1\u2212m r ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-61", "text": "For example, if r = 10 \u22124 and m = 8, then the numerical precision is 7.8 \u00b7 10 \u22127 which can capture much higher precision than the SGD update values have." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-62", "text": "When the cumulated values in the auxiliary update vectors are greater than the original numerical precision , e.g., = 2 \u22127 for 8 bits, we update the original vector and clear the value in the auxiliary vector." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-63", "text": "In this case, we can have final n-bit values in word embedding vectors as good as the method presented in Section 3.1." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-64", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-65", "text": "**EXPERIMENTS ON WORD/PHRASE SIMILARITY**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-66", "text": "In this section, we describe a comprehensive study on tasks that have been used for evaluating word embeddings." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-67", "text": "We train the word embedding algorithms, word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b) , based on the Oct. 2013 Wikipedia dump." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-68", "text": "1 We first compare levels of truncation of word2vec embeddings, and then evaluate the stochastic rounding and the auxiliary vectors based methods for training word2vec vectors." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-69", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-70", "text": "**DATASETS**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-71", "text": "We use multiple test datasets as follows." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-72", "text": "Word Similarity." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-73", "text": "Word similarity datasets have been widely used to evaluate word embedding results." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-74", "text": "We use the datasets summarized by Faruqui and Dyer (Faruqui and Dyer, 2014) : wordsim-353, wordsim-sim, wordsim-rel, MC-30, RG-65, MTurk-287, MTurk-771, MEN 3000, YP-130, Rare-Word, Verb-143, and SimLex-999." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-75", "text": "2 We compute the similarities between pairs of words and check the Spearman's rank correlation coefficient (Myers and Well., 1995) between the computer and the human labeled ranks." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-76", "text": "Paraphrases (bigrams)." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-77", "text": "We use the paraphrase (bigram) datasets used in (Wieting et al., 2015) , ppdb all, bigrams vn, bigrams nn, and bigrams jnn, to test whether the truncation affects phrase level embedding." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-78", "text": "Our phrase level embedding is based on the average of the words inside each phrase." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-79", "text": "Note that it is also easy to incorporate our truncation methods into existing phrase embedding algorithms." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-80", "text": "We follow (Wieting et al., 2015) in using cosine similarity to evaluate the correlation between the computed similarity and annotated similarity between paraphrases." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-81", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-83", "text": "We ran both CBOW and skipgram with negative sampling (Mikolov et al., 2013a; Mikolov et al., 2013b) on the Wikipedia dump data, and set the window size of context to be five." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-84", "text": "Then we performed value truncation with 4 bits, 6 bits, and 8 bits." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-85", "text": "The results are shown in Fig. 1 , and the numbers of the averaged results are shown in Table 1 ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-86", "text": "We also used the binarization algorithm (Guo et al., 2014) to truncate each dimension to three values; these experiments are is denoted using the suffix \"binary\" in the figure." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-87", "text": "For both CBOW and skipgram models, we train the vectors with 25 and 200 dimensions respectively." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-88", "text": "The representations used in our experiments were trained using the whole Wikipedia dump." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-89", "text": "A first observation is that, in general, CBOW performs better than the skipgram model." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-90", "text": "When using the truncation method, the memory required to store the embedding is significantly reduced, while the performance on the test datasets remains almost the same until we truncate down to 4 bits." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-91", "text": "When comparing CBOW and skipgram models, we again see that the drop in performance with 4-bit values for the skipgram model is greater than the one for the CBOW model." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-92", "text": "For the CBOW model, the drop in performance with 4-bit values is greater when using 200 dimensions than it is when using 25 dimensions." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-93", "text": "However, when using skipgram, this drop is slightly greater when using 25 dimensions than 200." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-94", "text": "We also evaluated the binarization approach (Guo et al., 2014) ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-95", "text": "This model uses three values, represented using two bits." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-96", "text": "We observe that, when the dimension is 25, the binarization is worse than truncation." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-97", "text": "One possible explanation has to do merely with the size of the space; while 3 25 is much larger than the size of the word space, it does not provide enough redundancy to exploit similarity as needed in the tasks." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-98", "text": "Consequently, the binarization approach results in worse performance." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-99", "text": "However, when the dimension is 200, this approach works much better, and outperforms the 4-bit truncation." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-100", "text": "In particular, binarization works better for skipgram than for CBOW with 200 dimensions." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-101", "text": "One possible explanation is that the binarization algorithm computes, for each dimension of the word vectors, the positive and negative means of the values and uses it to split the original values in that dimension, thus behaving like a model that clusters the values in each dimension." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-102", "text": "The success of the binarization then indicates that skipgram embeddings might be more discriminative than CBOW embeddings." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-103", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-104", "text": "**COMPARING TRAINING METHODS**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-105", "text": "We compare the training methods for the CBOW model in Table 2 ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-106", "text": "For stochastic rounding, we scale the probability of rounding up to make sure that small gradient values will still update the values." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-107", "text": "Both stochastic rounding and truncation with auxiliary update vectors (shown in Sec. 3.2) require 16 bits for each value in the training phase." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-108", "text": "Truncation with auxiliary update vectors finally produces 8-bit-value based vectors while stochastic rounding produces 16-bit-value based vectors." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-109", "text": "Even though our auxiliary update algorithm uses smaller memory/disk to store vectors, its performance is still better than that of stochastic rounding." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-110", "text": "This is simply because the update values in SGD are too small to allow the stochastic round-" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-111", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-112", "text": "**EXPERIMENTS ON DEPENDENCY PARSING**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-113", "text": "We also incorporate word embedding results into a downstream task, dependency parsing, to evaluate whether the truncated embedding results are still good features compared to the original features." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-114", "text": "We follow the setup of (Guo et al., 2015) in a monolingual setting 3 ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-115", "text": "We train the parser with 5,000 iterations using different truncation settings for word2vec embedding." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-116", "text": "The data used to train and evaluate the parser is the English data in the CoNLL-X shared task (Buchholz and Marsi, 2006) ." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-117", "text": "We follow (Guo et al., 2015) Table 3 for dependency parsing are consistent with word similarity and paraphrasing." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-118", "text": "First, we see that binarization for CBOW and skipgram is again better than the truncation approach." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-119", "text": "Second, for truncation results, more bits leads to better results." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-120", "text": "With 8-bits, we can again obtain results similar to those obtained 3 https://github.com/jiangfeng1124/ acl15-clnndep" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-121", "text": "----------------------------------" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-123", "text": "We systematically evaluated how small can the representation size of dense word embedding be before it starts to impact the performance of NLP tasks that use them." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-124", "text": "We considered both the final size of the size we provide it while learning it." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-125", "text": "Our study considers both the CBOW and the skipgram models at 25 and 200 dimensions and showed that 8 bits per dimension (and sometimes even less) are sufficient to represent each value and maintain performance on a range of lexical tasks." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-126", "text": "We also provided two ways to train the embeddings with reduced memory use." }, { "sent_id": "a7e49bec53a2bfd7795b9c770f5d0c-C001-127", "text": "The natural future step is to extend these experiments and study the impact of the representation size on more advanced tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-7", "a7e49bec53a2bfd7795b9c770f5d0c-C001-8" ], [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-24", "a7e49bec53a2bfd7795b9c770f5d0c-C001-25" ] ], "cite_sentences": [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-7", "a7e49bec53a2bfd7795b9c770f5d0c-C001-25" ] }, "@EXT@": { "gold_contexts": [ [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-17" ] ], "cite_sentences": [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-17" ] }, "@MOT@": { "gold_contexts": [ [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-24", "a7e49bec53a2bfd7795b9c770f5d0c-C001-25" ] ], "cite_sentences": [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-25" ] }, "@UNSURE@": { "gold_contexts": [ [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-46" ] ], "cite_sentences": [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-46" ] }, "@USE@": { "gold_contexts": [ [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-67" ], [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-83" ] ], "cite_sentences": [ "a7e49bec53a2bfd7795b9c770f5d0c-C001-67", "a7e49bec53a2bfd7795b9c770f5d0c-C001-83" ] } } }, "ABC_021e5dbe22bf0f4ebda4d37040d0a6_31": { "x": [ { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-2", "text": "We present a new collection of treebanks with homogeneous syntactic dependency annotation for six languages: German, English, Swedish, Spanish, French and Korean." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-3", "text": "To show the usefulness of such a resource, we present a case study of crosslingual transfer parsing with more reliable evaluation than has been possible before." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-4", "text": "This 'universal' treebank is made freely available in order to facilitate research on multilingual dependency parsing." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-5", "text": "1" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-6", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-8", "text": "In recent years, syntactic representations based on head-modifier dependency relations between words have attracted a lot of interest (K\u00fcbler et al., 2009) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-9", "text": "Research in dependency parsing -computational methods to predict such representations -has increased dramatically, due in large part to the availability of dependency treebanks in a number of languages." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-10", "text": "In particular, the CoNLL shared tasks on dependency parsing have provided over twenty data sets in a standardized format (Buchholz and Marsi, 2006; ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-11", "text": "While these data sets are standardized in terms of their formal representation, they are still heterogeneous treebanks." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-12", "text": "That is to say, despite them all being dependency treebanks, which annotate each sentence with a dependency tree, they subscribe to different annotation schemes." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-13", "text": "This can include superficial differences, such as the renaming of common relations, as well as true divergences concerning the analysis of linguistic constructions." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-14", "text": "Common divergences are found in the 1 Downloadable at https://code.google.com/p/uni-dep-tb/." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-15", "text": "analysis of coordination, verb groups, subordinate clauses, and multi-word expressions K\u00fcbler et al., 2009; Zeman et al., 2012) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-16", "text": "These data sets can be sufficient if one's goal is to build monolingual parsers and evaluate their quality without reference to other languages, as in the original CoNLL shared tasks, but there are many cases where heterogenous treebanks are less than adequate." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-17", "text": "First, a homogeneous representation is critical for multilingual language technologies that require consistent cross-lingual analysis for downstream components." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-18", "text": "Second, consistent syntactic representations are desirable in the evaluation of unsupervised (Klein and Manning, 2004) or cross-lingual syntactic parsers (Hwa et al., 2005) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-19", "text": "In the cross-lingual study of McDonald et al. (2011) , where delexicalized parsing models from a number of source languages were evaluated on a set of target languages, it was observed that the best target language was frequently not the closest typologically to the source." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-20", "text": "In one stunning example, Danish was the worst source language when parsing Swedish, solely due to greatly divergent annotation schemes." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-21", "text": "In order to overcome these difficulties, some cross-lingual studies have resorted to heuristics to homogenize treebanks (Hwa et al., 2005; Smith and Eisner, 2009; Ganchev et al., 2009 ), but we are only aware of a few systematic attempts to create homogenous syntactic dependency annotation in multiple languages." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-22", "text": "In terms of automatic construction, Zeman et al. (2012) attempt to harmonize a large number of dependency treebanks by mapping their annotation to a version of the Prague Dependency Treebank scheme (Haji\u010d et al., 2001; B\u00f6hmov\u00e1 et al., 2003) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-23", "text": "Additionally, there have been efforts to manually or semimanually construct resources with common syn-tactic analyses across multiple languages using alternate syntactic theories as the basis for the representation (Butt et al., 2002; Helmreich et al., 2004; Hovy et al., 2006; Erjavec, 2012) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-24", "text": "In order to facilitate research on multilingual syntactic analysis, we present a collection of data sets with uniformly analyzed sentences for six languages: German, English, French, Korean, Spanish and Swedish." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-25", "text": "This resource is freely available and we plan to extend it to include more data and languages." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-26", "text": "In the context of part-of-speech tagging, universal representations, such as that of Petrov et al. (2012) , have already spurred numerous examples of improved empirical cross-lingual systems (Zhang et al., 2012; Gelling et al., 2012; T\u00e4ckstr\u00f6m et al., 2013) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-27", "text": "We aim to do the same for syntactic dependencies and present cross-lingual parsing experiments to highlight some of the benefits of cross-lingually consistent annotation." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-28", "text": "First, results largely conform to our expectations of which target languages should be useful for which source languages, unlike in the study of McDonald et al. (2011) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-29", "text": "Second, the evaluation scores in general are significantly higher than previous cross-lingual studies, suggesting that most of these studies underestimate true accuracy." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-30", "text": "Finally, unlike all previous cross-lingual studies, we can report full labeled accuracies and not just unlabeled structural accuracies." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-31", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-32", "text": "**TOWARDS A UNIVERSAL TREEBANK**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-33", "text": "The Stanford typed dependencies for English (De Marneffe et al., 2006; de Marneffe and Manning, 2008) serve as the point of departure for our 'universal' dependency representation, together with the tag set of Petrov et al. (2012) as the underlying part-of-speech representation." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-34", "text": "The Stanford scheme, partly inspired by the LFG framework, has emerged as a de facto standard for dependency annotation in English and has recently been adapted to several languages representing different (and typologically diverse) language groups, such as Chinese (Sino-Tibetan) (Chang et al., 2009 ), Finnish (Finno-Ugric) (Haverinen et al., 2010) , Persian (Indo-Iranian) (Seraji et al., 2012) , and Modern Hebrew (Semitic) (Tsarfaty, 2013) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-35", "text": "Its widespread use and proven adaptability makes it a natural choice for our endeavor, even though additional modifications will be needed to capture the full variety of grammatical structures in the world's languages." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-36", "text": "We use the so-called basic dependencies (with punctuation included), where every dependency structure is a tree spanning all the input tokens, because this is the kind of representation that most available dependency parsers require." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-37", "text": "A sample dependency tree from the French data set is shown in Figure 1 ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-38", "text": "We take two approaches to generating data." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-39", "text": "The first is traditional manual annotation, as previously used by Helmreich et al. (2004) for multilingual syntactic treebank construction." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-40", "text": "The second, used only for English and Swedish, is to automatically convert existing treebanks, as in Zeman et al. (2012) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-41", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-42", "text": "**AUTOMATIC CONVERSION**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-43", "text": "Since the Stanford dependencies for English are taken as the starting point for our universal annotation scheme, we begin by describing the data sets produced by automatic conversion." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-44", "text": "For English, we used the Stanford parser (v1.6.8) (Klein and Manning, 2003) to convert the Wall Street Journal section of the Penn Treebank (Marcus et al., 1993) to basic dependency trees, including punctuation and with the copula verb as head in copula constructions." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-45", "text": "For Swedish, we developed a set of deterministic rules for converting the Talbanken part of the Swedish Treebank (Nivre and Megyesi, 2007) to a representation as close as possible to the Stanford dependencies for English." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-46", "text": "This mainly consisted in relabeling dependency relations and, due to the fine-grained label set used in the Swedish Treebank (Teleman, 1974) , this could be done with high precision." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-47", "text": "In addition, a small number of constructions required structural conversion, notably coordination, which in the Swedish Treebank is given a Prague style analysis ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-48", "text": "For both English and Swedish, we mapped the language-specific partof-speech tags to universal tags using the mappings of Petrov et al. (2012) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-49", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-50", "text": "**MANUAL ANNOTATION**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-51", "text": "For the remaining four languages, annotators were given three resources: 1) the English Stanford guidelines; 2) a set of English sentences with Stanford dependencies and universal tags (as above); and 3) a large collection of unlabeled sentences randomly drawn from newswire, weblogs and/or consumer reviews, automatically tokenized with a rule-based system." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-52", "text": "For German, French and Spanish, contractions were split, except in the case of clitics." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-53", "text": "For Korean, tokenization was more coarse and included particles within token units." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-54", "text": "Annotators could correct this automatic tokenization." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-55", "text": "The annotators were then tasked with producing language-specific annotation guidelines with the expressed goal of keeping the label and construction set as close as possible to the original English set, only adding labels for phenomena that do not exist in English." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-56", "text": "Making fine-grained label distinctions was discouraged." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-57", "text": "Once these guidelines were fixed, annotators selected roughly an equal amount of sentences to be annotated from each domain in the unlabeled data." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-58", "text": "As the sentences were already randomly selected from a larger corpus, annotators were told to view the sentences in order and to discard a sentence only if it was 1) fragmented because of a sentence splitting error; 2) not from the language of interest; 3) incomprehensible to a native speaker; or 4) shorter than three words." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-59", "text": "The selected sentences were pre-processed using cross-lingual taggers (Das and Petrov, 2011) and parsers (McDonald et al., 2011) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-60", "text": "The annotators modified the pre-parsed trees using the TrEd 2 tool." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-61", "text": "At the beginning of the annotation process, double-blind annotation, followed by manual arbitration and consensus, was used iteratively for small batches of data until the guidelines were finalized." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-62", "text": "Most of the data was annotated using single-annotation and full review: one annotator annotating the data and another reviewing it, making changes in close collaboration with the original annotator." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-63", "text": "As a final step, all annotated data was semi-automatically checked for annotation consistency." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-64", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-65", "text": "**HARMONIZATION**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-66", "text": "After producing the two converted and four annotated data sets, we performed a harmonization step, where the goal was to maximize consistency of annotation across languages." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-67", "text": "In particular, we wanted to eliminate cases where the same label was used for different linguistic relations in different languages and, conversely, where one and the same relation was annotated with different labels, both of which could happen accidentally because annotators were allowed to add new labels for the language they were working on." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-68", "text": "Moreover, we wanted to avoid, as far as possible, labels that were only used in one or two languages." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-69", "text": "In order to satisfy these requirements, a number of language-specific labels were merged into more general labels." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-70", "text": "For example, in analogy with the nn label for (element of a) noun-noun compound, the annotators of German added aa for compound adjectives, and the annotators of Korean added vv for compound verbs." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-71", "text": "In the harmonization step, these three labels were merged into a single label compmod for modifier in compound." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-72", "text": "In addition to harmonizing language-specific labels, we also renamed a small number of relations, where the name would be misleading in the universal context (although quite appropriate for English)." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-73", "text": "For example, the label prep (for a modifier headed by a preposition) was renamed adpmod, to make clear the relation to other modifier labels and to allow postpositions as well as prepositions." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-74", "text": "3 We also eliminated a few distinctions in the original Stanford scheme that were not annotated consistently across languages (e.g., merging complm with mark, number with num, and purpcl with advcl)." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-75", "text": "The final set of labels is listed with explanations in Table 1 . Note that relative to the universal partof-speech tagset of Petrov et al. (2012) our final label set is quite rich (40 versus 12)." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-76", "text": "This is due mainly to the fact that the the former is based on deterministic mappings from a large set of annotation schemes and therefore reduced to the granularity of the greatest common denominator." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-77", "text": "Such a reduction may ultimately be necessary also in the case of dependency relations, but since most of our data sets were created through manual annotation, we could afford to retain a fine-grained analysis, knowing that it is always possible to map from finer to coarser distinctions, but not vice versa." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-78", "text": "4 due to the source and tokenization." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-79", "text": "For example, Korean has 50% more sentences than Spanish, but \u223c40k less tokens due to a more coarse-grained tokenization." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-80", "text": "In addition to the data itself, annotation guidelines and harmonization rules are included so that the data can be regenerated." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-81", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-82", "text": "**FINAL DATA SETS**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-83", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-84", "text": "**EXPERIMENTS**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-85", "text": "One of the motivating factors in creating such a data set was improved cross-lingual transfer evaluation." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-86", "text": "To test this, we use a cross-lingual transfer parser similar to that of McDonald et al. (2011) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-87", "text": "In particular, it is a perceptron-trained shift-reduce parser with a beam of size 8." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-88", "text": "We use the features of Zhang and Nivre (2011) , except that all lexical identities are dropped from the templates during training and testing, hence inducing a 'delexicalized' model that employs only 'universal' properties from source-side treebanks, such as part-ofspeech tags, labels, head-modifier distance, etc." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-89", "text": "We ran a number of experiments, which can be seen in Table 3 ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-90", "text": "For these experiments we randomly split each data set into training, development and testing sets." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-91", "text": "5 The one exception is English, where we used the standard splits." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-92", "text": "Each row in Table 3 represents a source training language and each column a target evaluation language." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-93", "text": "We report both unlabeled attachment score (UAS) and labeled attachment score (LAS) (Buchholz and Marsi, 2006) ." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-94", "text": "This is likely the first reliable cross-lingual parsing evaluation." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-95", "text": "In particular, previous studies could not even report LAS due to differences in treebank annotations." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-96", "text": "We can make several interesting observations." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-97", "text": "Most notably, for the Germanic and Romance target languages, the best source language is from the same language group." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-98", "text": "This is in stark contrast to the results of McDonald et al. (2011) , who observe that this is rarely the case with the heterogenous CoNLL treebanks." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-99", "text": "Among the Germanic languages, it is interesting to note that Swedish is the best source language for both German and English, which makes sense from a typological point of view, because Swedish is intermediate between German and English in terms of word order properties." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-100", "text": "For Romance languages, the crosslingual parser is approaching the accuracy of the supervised setting, confirming that for these languages much of the divergence is lexical and not structural, which is not true for the Germanic languages." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-101", "text": "Finally, Korean emerges as a very clear outlier (both as a source and as a target language), which again is supported by typological considerations as well as by the difference in tokenization." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-102", "text": "With respect to evaluation, it is interesting to compare the absolute numbers to those reported in McDonald et al. (2011)" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-103", "text": "----------------------------------" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-105", "text": "We have released data sets for six languages with consistent dependency annotation." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-106", "text": "After the initial release, we will continue to annotate data in more languages as well as investigate further automatic treebank conversions." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-107", "text": "This may also lead to modifications of the annotation scheme, which should be regarded as preliminary at this point." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-108", "text": "Specifically, with more typologically and morphologically diverse languages being added to the collection, it may be advisable to consistently enforce the principle that content words take function words as dependents, which is currently violated in the analysis of adpositional and copula constructions." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-109", "text": "This will ensure a consistent analysis of functional elements that in some languages are not realized as free words or are not obligatory, such as adpositions which are often absent due to case inflections in languages like Finnish." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-110", "text": "It will also allow the inclusion of language-specific functional or morphological markers (case markers, topic markers, classifiers, etc.) at the leaves of the tree, where they can easily be ignored in applications that require a uniform cross-lingual representation." }, { "sent_id": "021e5dbe22bf0f4ebda4d37040d0a6-C001-111", "text": "Finally, this data is available on an open source repository in the hope that the community will commit new data and make corrections to existing annotations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-19", "021e5dbe22bf0f4ebda4d37040d0a6-C001-20" ], [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-27", "021e5dbe22bf0f4ebda4d37040d0a6-C001-28" ], [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-85", "021e5dbe22bf0f4ebda4d37040d0a6-C001-86", "021e5dbe22bf0f4ebda4d37040d0a6-C001-87" ], [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-96", "021e5dbe22bf0f4ebda4d37040d0a6-C001-97", "021e5dbe22bf0f4ebda4d37040d0a6-C001-98" ], [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-102" ] ], "cite_sentences": [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-19", "021e5dbe22bf0f4ebda4d37040d0a6-C001-28", "021e5dbe22bf0f4ebda4d37040d0a6-C001-86", "021e5dbe22bf0f4ebda4d37040d0a6-C001-98", "021e5dbe22bf0f4ebda4d37040d0a6-C001-102" ] }, "@MOT@": { "gold_contexts": [ [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-19", "021e5dbe22bf0f4ebda4d37040d0a6-C001-20" ] ], "cite_sentences": [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-19" ] }, "@DIF@": { "gold_contexts": [ [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-27", "021e5dbe22bf0f4ebda4d37040d0a6-C001-28" ], [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-96", "021e5dbe22bf0f4ebda4d37040d0a6-C001-97", "021e5dbe22bf0f4ebda4d37040d0a6-C001-98" ] ], "cite_sentences": [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-28", "021e5dbe22bf0f4ebda4d37040d0a6-C001-98" ] }, "@USE@": { "gold_contexts": [ [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-59" ] ], "cite_sentences": [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-59" ] }, "@EXT@": { "gold_contexts": [ [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-85", "021e5dbe22bf0f4ebda4d37040d0a6-C001-86", "021e5dbe22bf0f4ebda4d37040d0a6-C001-87" ] ], "cite_sentences": [ "021e5dbe22bf0f4ebda4d37040d0a6-C001-86" ] } } }, "ABC_4b08797852539f464793dd23cf8999_31": { "x": [ { "sent_id": "4b08797852539f464793dd23cf8999-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-2", "text": "This paper studies the effect of limited precision data representation and computation on word embeddings." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-3", "text": "We present a systematic evaluation of word embeddings with limited memory and discuss methods that directly train the limited precision representation with limited memory." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-4", "text": "Our results show that it is possible to use and train an 8-bit fixed-point value for word embedding without loss of performance in word/phrase similarity and dependency parsing tasks." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-5", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-7", "text": "There is an accumulation of evidence that the use of dense distributional lexical representations, known as word embeddings, often supports better performance on a range of NLP tasks (Bengio et al., 2003; Turian et al., 2010; Collobert et al., 2011; Mikolov et al., 2013a; Mikolov et al., 2013b; Levy et al., 2015) ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-8", "text": "Consequently, word embeddings have been commonly used in the last few years for lexical similarity tasks and as features in multiple, syntactic and semantic, NLP applications." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-9", "text": "However, keeping embedding vectors for hundreds of thousands of words for repeated use could take its toll both on storing the word vectors on disk and, even more so, on loading them into memory." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-10", "text": "For example, for 1 million words, loading 200 dimensional vectors takes up to 1.6 GB memory on a 64-bit system." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-11", "text": "Considering applications that make use of billions of tokens and multiple languages, size issues impose significant limitations on the practical use of word embeddings." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-12", "text": "This paper presents the question of whether it is possible to significantly reduce the memory needs for the use and training of word embeddings." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-13", "text": "Specifically, we ask \"what is the impact of representing each dimension of a dense representation with significantly fewer bits than the standard 64 bits?\" Moreover, we investigate the possibility of directly training dense embedding vectors using significantly fewer bits than typically used." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-14", "text": "The results we present are quite surprising." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-15", "text": "We show that it is possible to reduce the memory consumption by an order of magnitude both when word embeddings are being used and in training." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-16", "text": "In the first case, as we show, simply truncating the resulting representations after training and using a smaller number of bits (as low as 4 bits per dimension) results in comparable performance to the use of 64 bits." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-17", "text": "Moreover, we provide two ways to train existing algorithms (Mikolov et al., 2013a; Mikolov et al., 2013b ) when the memory is limited during training and show that, here, too, an order of magnitude saving in memory is possible without degrading performance." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-18", "text": "We conduct comprehensive experiments on existing word and phrase similarity and relatedness datasets as well as on dependency parsing, to evaluate these results." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-19", "text": "Our experiments show that, in all cases and without loss in performance, 8 bits can be used when the current standard is 64 and, in some cases, only 4 bits per dimension are sufficient, reducing the amount of space required by a factor of 16." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-20", "text": "The truncated word embeddings are available from the papers web page at https://cogcomp.cs.illinois." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-21", "text": "edu/page/publication_view/790." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-22", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-24", "text": "If we consider traditional cluster encoded word representation, e.g., Brown clusters (Brown et al., 1992) , it only uses a small number of bits to track the path on a hierarchical tree of word clusters to represent each word." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-25", "text": "In fact, word embedding generalized the idea of discrete clustering representation to continuous vector representation in language models, with the goal of improving the continuous word analogy prediction and generalization ability (Bengio et al., 2003; Mikolov et al., 2013a; Mikolov et al., 2013b) ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-26", "text": "However, it has been proven that Brown clusters as discrete features are even better than continuous word embedding as features for named entity recognition tasks (Ratinov and Roth, 2009 )." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-27", "text": "Guo et al. (Guo et al., 2014) further tried to binarize embeddings using a threshold tuned for each dimension, and essentially used less than two bits to represent each dimension." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-28", "text": "They have shown that binarization can be comparable to or even better than the original word embeddings when used as features for named entity recognition tasks." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-29", "text": "Moreover, Faruqui et al. (Faruqui et al., 2015) showed that imposing sparsity constraints over the embedding vectors can further improve the representation interpretability and performance on several word similarity and text classification benchmark datasets." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-30", "text": "These works indicate that, for some tasks, we do not need all the information encoded in \"standard\" word embeddings." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-31", "text": "Nonetheless, it is clear that binarization loses a lot of information, and this calls for a systematic comparison of how many bits are needed to maintain the expressivity needed from word embeddings for different tasks." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-32", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-33", "text": "**VALUE TRUNCATION**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-34", "text": "In this section, we introduce approaches for word embedding when the memory is limited." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-35", "text": "We truncate any value x in the word embedding into an n bit representation." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-36", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-37", "text": "**POST-PROCESSING ROUNDING**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-38", "text": "When the word embedding vectors are given, the most intuitive and simple way is to round the numbers to their n-bit precision." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-39", "text": "Then we can use the truncated values as features for any tasks that word embedding can be used for." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-40", "text": "For example, if we want to round x to be in the range of [\u2212r, r], a simple function can be applied as follows." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-41", "text": "(1) where = 2 1\u2212n r. For example, if we want to use 8 bits to represent any value in the vectors, then we only have 256 numbers ranging from -128 to 127 for each value." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-42", "text": "In practice, we first scale all the values and then round them to the 256 numbers." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-43", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-44", "text": "**TRAINING WITH LIMITED MEMORY**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-45", "text": "When the memory for training word embedding is also limited, we need to modify the training algorithms by introducing new data structures to reduce the bits used to encode the values." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-46", "text": "In practice, we found that in the stochastic gradient descent (SGD) iteration in word2vec algorithms (Mikolov et al., 2013a; Mikolov et al., 2013b) , the updating vector's values are often very small numbers (e.g., < 10 \u22125 )." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-47", "text": "In this case, if we directly apply the rounding method to certain precisions (e.g., 8 bits), the update of word vectors will always be zero." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-48", "text": "For example, the 8-bit precision is 2 \u22127 = 0.0078, so 10 \u22125 is not significant enough to update the vector with 8-bit values." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-49", "text": "Therefore, we consider the following two ways to improve this." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-50", "text": "Stochastic Rounding." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-51", "text": "We first consider using stochastic rounding (Gupta et al., 2015) to train word embedding." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-52", "text": "Stochastic rounding introduces some randomness into the rounding mechanism, which has been proven to be helpful when there are many parameters in the learning system, such as deep learning systems (Gupta et al., 2015) ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-53", "text": "Here we also introduce this approach to update word embedding vectors in SGD." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-54", "text": "The probability of rounding x to x is proportional to the proximity of x to x :" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-55", "text": "In this case, even though the update values are not significant enough to update the word embedding vectors, we randomly choose some of the values being updated proportional to the value of how close the update value is to the rounding precision." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-56", "text": "Auxiliary Update Vectors." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-57", "text": "In addition to the method of directly applying rounding to the values, we also provide a method using auxiliary update vectors to trade precision for more space." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-58", "text": "Suppose we know the range of update value in S-GD as [\u2212r , r ], and we use additional m bits to store all the values less than the limited numerical precision ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-59", "text": "Here r can be easily estimated by running SGD for several examples." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-60", "text": "Then the real precision is = 2 1\u2212m r ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-61", "text": "For example, if r = 10 \u22124 and m = 8, then the numerical precision is 7.8 \u00b7 10 \u22127 which can capture much higher precision than the SGD update values have." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-62", "text": "When the cumulated values in the auxiliary update vectors are greater than the original numerical precision , e.g., = 2 \u22127 for 8 bits, we update the original vector and clear the value in the auxiliary vector." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-63", "text": "In this case, we can have final n-bit values in word embedding vectors as good as the method presented in Section 3.1." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-64", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-65", "text": "**EXPERIMENTS ON WORD/PHRASE SIMILARITY**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-66", "text": "In this section, we describe a comprehensive study on tasks that have been used for evaluating word embeddings." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-67", "text": "We train the word embedding algorithms, word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b) , based on the Oct. 2013 Wikipedia dump." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-68", "text": "1 We first compare levels of truncation of word2vec embeddings, and then evaluate the stochastic rounding and the auxiliary vectors based methods for training word2vec vectors." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-69", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-70", "text": "**DATASETS**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-71", "text": "We use multiple test datasets as follows." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-72", "text": "Word Similarity." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-73", "text": "Word similarity datasets have been widely used to evaluate word embedding results." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-74", "text": "We use the datasets summarized by Faruqui and Dyer (Faruqui and Dyer, 2014) : wordsim-353, wordsim-sim, wordsim-rel, MC-30, RG-65, MTurk-287, MTurk-771, MEN 3000, YP-130, Rare-Word, Verb-143, and SimLex-999." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-75", "text": "2 We compute the similarities between pairs of words and check the Spearman's rank correlation coefficient (Myers and Well., 1995) between the computer and the human labeled ranks." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-76", "text": "Paraphrases (bigrams)." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-77", "text": "We use the paraphrase (bigram) datasets used in (Wieting et al., 2015) , ppdb all, bigrams vn, bigrams nn, and bigrams jnn, to test whether the truncation affects phrase level embedding." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-78", "text": "Our phrase level embedding is based on the average of the words inside each phrase." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-79", "text": "Note that it is also easy to incorporate our truncation methods into existing phrase embedding algorithms." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-80", "text": "We follow (Wieting et al., 2015) in using cosine similarity to evaluate the correlation between the computed similarity and annotated similarity between paraphrases." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-81", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-82", "text": "**ANALYSIS OF BITS NEEDED**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-83", "text": "We ran both CBOW and skipgram with negative sampling (Mikolov et al., 2013a; Mikolov et al., 2013b) on the Wikipedia dump data, and set the window size of context to be five." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-84", "text": "Then we performed value truncation with 4 bits, 6 bits, and 8 bits." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-85", "text": "The results are shown in Fig. 1 , and the numbers of the averaged results are shown in Table 1 ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-86", "text": "We also used the binarization algorithm (Guo et al., 2014) to truncate each dimension to three values; these experiments are is denoted using the suffix \"binary\" in the figure." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-87", "text": "For both CBOW and skipgram models, we train the vectors with 25 and 200 dimensions respectively." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-88", "text": "The representations used in our experiments were trained using the whole Wikipedia dump." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-89", "text": "A first observation is that, in general, CBOW performs better than the skipgram model." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-90", "text": "When using the truncation method, the memory required to store the embedding is significantly reduced, while the performance on the test datasets remains almost the same until we truncate down to 4 bits." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-91", "text": "When comparing CBOW and skipgram models, we again see that the drop in performance with 4-bit values for the skipgram model is greater than the one for the CBOW model." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-92", "text": "For the CBOW model, the drop in performance with 4-bit values is greater when using 200 dimensions than it is when using 25 dimensions." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-93", "text": "However, when using skipgram, this drop is slightly greater when using 25 dimensions than 200." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-94", "text": "We also evaluated the binarization approach (Guo et al., 2014) ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-95", "text": "This model uses three values, represented using two bits." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-96", "text": "We observe that, when the dimension is 25, the binarization is worse than truncation." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-97", "text": "One possible explanation has to do merely with the size of the space; while 3 25 is much larger than the size of the word space, it does not provide enough redundancy to exploit similarity as needed in the tasks." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-98", "text": "Consequently, the binarization approach results in worse performance." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-99", "text": "However, when the dimension is 200, this approach works much better, and outperforms the 4-bit truncation." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-100", "text": "In particular, binarization works better for skipgram than for CBOW with 200 dimensions." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-101", "text": "One possible explanation is that the binarization algorithm computes, for each dimension of the word vectors, the positive and negative means of the values and uses it to split the original values in that dimension, thus behaving like a model that clusters the values in each dimension." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-102", "text": "The success of the binarization then indicates that skipgram embeddings might be more discriminative than CBOW embeddings." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-103", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-104", "text": "**COMPARING TRAINING METHODS**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-105", "text": "We compare the training methods for the CBOW model in Table 2 ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-106", "text": "For stochastic rounding, we scale the probability of rounding up to make sure that small gradient values will still update the values." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-107", "text": "Both stochastic rounding and truncation with auxiliary update vectors (shown in Sec. 3.2) require 16 bits for each value in the training phase." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-108", "text": "Truncation with auxiliary update vectors finally produces 8-bit-value based vectors while stochastic rounding produces 16-bit-value based vectors." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-109", "text": "Even though our auxiliary update algorithm uses smaller memory/disk to store vectors, its performance is still better than that of stochastic rounding." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-110", "text": "This is simply because the update values in SGD are too small to allow the stochastic round-" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-111", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-112", "text": "**EXPERIMENTS ON DEPENDENCY PARSING**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-113", "text": "We also incorporate word embedding results into a downstream task, dependency parsing, to evaluate whether the truncated embedding results are still good features compared to the original features." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-114", "text": "We follow the setup of (Guo et al., 2015) in a monolingual setting 3 ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-115", "text": "We train the parser with 5,000 iterations using different truncation settings for word2vec embedding." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-116", "text": "The data used to train and evaluate the parser is the English data in the CoNLL-X shared task (Buchholz and Marsi, 2006) ." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-117", "text": "We follow (Guo et al., 2015) Table 3 for dependency parsing are consistent with word similarity and paraphrasing." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-118", "text": "First, we see that binarization for CBOW and skipgram is again better than the truncation approach." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-119", "text": "Second, for truncation results, more bits leads to better results." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-120", "text": "With 8-bits, we can again obtain results similar to those obtained 3 https://github.com/jiangfeng1124/ acl15-clnndep" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-121", "text": "----------------------------------" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-123", "text": "We systematically evaluated how small can the representation size of dense word embedding be before it starts to impact the performance of NLP tasks that use them." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-124", "text": "We considered both the final size of the size we provide it while learning it." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-125", "text": "Our study considers both the CBOW and the skipgram models at 25 and 200 dimensions and showed that 8 bits per dimension (and sometimes even less) are sufficient to represent each value and maintain performance on a range of lexical tasks." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-126", "text": "We also provided two ways to train the embeddings with reduced memory use." }, { "sent_id": "4b08797852539f464793dd23cf8999-C001-127", "text": "The natural future step is to extend these experiments and study the impact of the representation size on more advanced tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4b08797852539f464793dd23cf8999-C001-25" ], [ "4b08797852539f464793dd23cf8999-C001-45", "4b08797852539f464793dd23cf8999-C001-46", "4b08797852539f464793dd23cf8999-C001-47", "4b08797852539f464793dd23cf8999-C001-48" ] ], "cite_sentences": [ "4b08797852539f464793dd23cf8999-C001-25", "4b08797852539f464793dd23cf8999-C001-46" ] }, "@MOT@": { "gold_contexts": [ [ "4b08797852539f464793dd23cf8999-C001-25" ], [ "4b08797852539f464793dd23cf8999-C001-45", "4b08797852539f464793dd23cf8999-C001-46", "4b08797852539f464793dd23cf8999-C001-47", "4b08797852539f464793dd23cf8999-C001-48" ] ], "cite_sentences": [ "4b08797852539f464793dd23cf8999-C001-25", "4b08797852539f464793dd23cf8999-C001-46" ] }, "@USE@": { "gold_contexts": [ [ "4b08797852539f464793dd23cf8999-C001-67" ], [ "4b08797852539f464793dd23cf8999-C001-83" ] ], "cite_sentences": [ "4b08797852539f464793dd23cf8999-C001-67", "4b08797852539f464793dd23cf8999-C001-83" ] } } }, "ABC_062e9348de5fda68e61fff3ca4f186_31": { "x": [ { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-4", "text": "In all tasks normalization improves the results." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-5", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-6", "text": "****" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-29", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-30", "text": "**NORMALIZED ENTITY GRAPH**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-57", "text": "**SENTENCE ORDERING**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-58", "text": "This task consists of two subtasks: discrimination and insertion." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-2", "text": "Guinaudeau and Strube (2013) introduce a graph based model to compute local entity coherence." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-7", "text": "1 Introduction Guinaudeau and Strube (2013) introduce a graph based model (henceforth called entity graph) to compute local entity coherence." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-3", "text": "We propose a computationally efficient normalization method for these graphs and then evaluate it on three tasks: sentence ordering, summary coherence rating and readability assessment." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-8", "text": "Despite being unsupervised, the entity graph performs on par with Barzilay and Lapata's (2005; 2008) supervised entity grid on the tasks of sentence ordering, summary coherence rating and readability assessment." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-9", "text": "The entity graph also overcomes shortcomings of the entity grid with regard to computational complexity, data sparsity and domain dependence." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-10", "text": "The entity graph is a bipartite graph where one set of nodes represents entities and the other set of nodes represents the sentences of a document." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-11", "text": "Guinaudeau and Strube (2013) apply a one mode projection on sentence nodes (Newman, 2010) and then compute the average out-degree of sentence nodes to determine how coherent a document is." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-12", "text": "They describe variants of their entity graph which take the number of shared entities between sentences and their grammatical functions into account thus resulting in weighted bipartite graphs and weighted one mode projections." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-13", "text": "Here, we propose to normalize weights for the entity graph." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-14", "text": "Normalization allows to include distance between mentions of the same entity, which improves the performance on all three tasks thus confirming research in related areas which states that normalizing weights leads to better performance (Zhou et al., 2008; Zweig and Kaufmann, 2011) ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-15", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-16", "text": "**THE ENTITY GRAPH**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-17", "text": "The entity graph (Guinaudeau and Strube, 2013) , G = (V, E), represents the relations between sentences and entities in a text, where node set V contains all sentences and entities in a text and E is the set of all edges between sentences and entities." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-18", "text": "Let function w(s i , e j ) indicate the weight of an edge which connects sentence s i and entity e j ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-19", "text": "If w(s i , e j ) = 1, then this edge indicates that there is a mention of e j in sentence s i ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-20", "text": "In order to realize the insight from Grosz et al. (1995) that certain syntactic roles are more important than others, the syntactic role of e j in s i can be mapped to an integer value (Guinaudeau and Strube, 2013): w(s i , e j ) = 3 if ej is subject in si 2 if ej is object in si 1 otherwise Figure 1 illustrates a weighted entity graph for three sentences." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-21", "text": "Three types of one-mode projections capture relations between sentences, P U , P W and P Acc ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-22", "text": "P U creates an edge between two sentences if they share at least one entity." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-23", "text": "P W captures the intuition that the connection between two sentences is stronger the more entities they share by means of weighted edges, where the weights equal the number of entities shared by sentences (Newman, 2004) ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-24", "text": "The third type of projection, P Acc , integrates syntactic information in the edge weights calculated by the following formula: While the entity grid (Barzilay and Lapata, 2008) uses information about sentences which do not share entities by means of the \"--\" transition, the entity graph cannot employ this negative information." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-25", "text": "Here, we propose a normalization for the entity graph and its corresponding one-mode projections which is based on the relative importance of entities and, in turn, the relative importance of sentences." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-26", "text": "Including negative information allows to normalize the importance of entities according to sentence length (measured in terms of entity mentions), and hence to capture distance information between mentions of the same entity." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-27", "text": "This brings the entity graph closer to Stoddard's (1991, p.30 ) notion of cohesion: \"The relative cohesiveness of a text depends on the number of cohesive ties [...] and on the distance between the nodes and their associated cohesive elements." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-28", "text": "\" By using this information, edge weights are set less arbitrary which leads to the more sound method and higher performance in all tasks." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-31", "text": "The entity graph weighs edges by the number of entities sentences share (P W ) and which syntactic functions the entities occupy (P Acc )." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-32", "text": "Here we normalize the weights by the number of entities in a sentence." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-33", "text": "This takes negative information into account as entities which do not occur in other sentences also count." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-34", "text": "Hence normalization captures the relative importance of entities as well as the relative importance of sentences." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-35", "text": "We follow Newman (2004) by applying node degree normalization." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-36", "text": "For P W , we divide the weight of each edge by the degree of the corresponding sentence node." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-37", "text": "If a sentence contains many entities, then the amount of information each entity contributes is reduced." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-38", "text": "Assume s i as the number of entities in sentence s i ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-39", "text": "The importance of entity e j for s i is Imp(s i , e j ) = 1 s i ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-40", "text": "Figure 3: Normalized entity graph" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-41", "text": "For P Acc we divide the weight of each edge by the sum of all edges' weights of a sentence." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-42", "text": "This gives the importance of each entity in a sentence relative to the sentence's other entities (see Figure 3) ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-43", "text": "For also normalizing the one-mode projection we introduce a virtual node T C capturing the textual content of all sentences (inspired by the graph based information retrieval model of Rode (2008))." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-44", "text": "The virtual node T C is connected to all sentences (see Figure 4 )." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-45", "text": "Rode (2008) uses the following formula to compute weights on the edges between the sentence nodes and T C:" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-46", "text": "where the function Score(s i |T C) is the number of entities in s i which have overlap with T C. This value is equal to the degree of each sentence." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-47", "text": "Since we are interested in local coherence, we restrict T C to pairs of sentences (See Figure 5) ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-48", "text": "Subsequently, instead of w(s i , T C), we use the notation lw s j s i (local weight of sentence s i according to sentence s j )." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-49", "text": "We define the normalized one-mode projection as follows: This method can be applied to graphs with edges weighted according to syntactic role (P Acc )." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-50", "text": "To compute the connection's strength of a pair of sentences we follow Yang and Knoke's (2001) approach: The path length in a weighted graph is the sum of the edge weights in the path." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-51", "text": "In our case, each path is defined between a pair of sentences of the entity graph, so the number of edges of all paths are equal to two." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-52", "text": "Figure 6 shows the normalized projections where the weights have been computed by the above formula." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-53", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-54", "text": "**EXPERIMENTS**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-55", "text": "We compare the normalized entity graph with the entity graph on all tasks, Guinaudeau and Strube (2013)" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-56", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-59", "text": "In both subtasks we evaluate whether our model can distinguish between the correct order of sentences in a document and an incorrect one." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-60", "text": "Experimental setup and data follow Guinaudeau and Strube (2013) (61 documents from the English test part of the CoNLL 2012 shared task (Pradhan et al., 2012) )." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-61", "text": "For discrimination we use 20 permutations of each text." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-62", "text": "Table 1 shows the results." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-63", "text": "Results for Guinaudeau and Strube (2013) , G&S, are reproduced, results for Barzilay and Lapata (2008) , B&L, and Elsner and Charniak (2011) , E&C, were reproduced by Guinaudeau and Strube (2013) ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-64", "text": "The unweighted graph, P U , does not need normalization." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-65", "text": "Hence the results for the entity graph and the normalized entity graph are identical." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-66", "text": "Normalization improves the results for the weighted graphs P W and P Acc with P Acc outperforming B&L considerably and closely approaching E&L." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-67", "text": "Sentence insertion is more difficult than discrimination." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-68", "text": "Following Elsner and Charniak (2011), we use two measures for evaluation: Accuracy (Acc.) and the average proportion of correct insertions per document (Ins.)." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-69", "text": "Table 3 : Summary Coherence Rating, B&L and entity graph vs. normalized entity graph Table 2 shows that the normalized entity graph outperforms the entity graph for P W and P Acc (again, no difference for P U )." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-70", "text": "The normalized entity graph outperforms E&C in Acc. and approaches it in Ins." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-71", "text": "The high value for Ins. shows that if the normalized entity graph makes false decisions they are closer to the original ordering than the mistakes of the entity graph." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-72", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-73", "text": "**SUMMARY COHERENCE RATING**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-74", "text": "We follow Barzilay and Lapata (2008) for evaluating whether the normalized entity graph can decide whether automatic or human summaries are more coherent (80 pairs of summaries extracted from DUC 2003)." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-75", "text": "Human coherence scores are associated with each pair of summarized documents (Barzilay and Lapata, 2008) ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-76", "text": "Table 3 displays reported results of B&L and reproduced results of the entity graph and our normalized entity graph." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-77", "text": "Normalizing significantly improves the results for P W and P Acc ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-78", "text": "P U is still slightly better than both, but in contrast to the entity graph, this difference is not statistically significant." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-79", "text": "We believe that better weighting schemes based on linguistic insights eventually will outperform P U and B&L (left for future work)." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-80", "text": "Distance information always degrades the results for this task (see Guinaudeau and Strube (2013) )." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-81", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-82", "text": "**READABILITY ASSESSMENT**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-83", "text": "Readability assessment aims to distinguish texts which are difficult to read from texts which are easier to read." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-84", "text": "In experiments, Barzilay and Lapata (2008) assume that articles taken from Encyclopedia Britannica are more difficult to read (less coherent) than the corresponding articles from Encyclopedia Britannica Elementary, its version for children." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-85", "text": "We follow them with regard to data (107 article pairs), experimental setup and evaluation." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-86", "text": "Sentences in the Britannica Elementary are simpler and shorter than in the Encyclopedia Britannica." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-87", "text": "The entity graph does not take into account the effect of entities not shared between sentences while the normalized entity graph assigns a lower weight if there are more of these entities." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-88", "text": "Hence, Britannica Elementary receives a higher cohesion score than Encyclopedia Britannica in our model." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-89", "text": "Adding grammatical information, does not help, because of the influence of the number of entities (shared and not shared) outweighs the influence of syntactic roles." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-90", "text": "The normalized entity graph (P W , Dist) does not only outperform the entity graph (significantly) and B&L but also S&O and the combination B&L + S&O." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-91", "text": "----------------------------------" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-92", "text": "**CONCLUSION**" }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-93", "text": "We proposed a normalization method for the entity graph (Guinaudeau and Strube, 2013) ." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-94", "text": "We compared our model to the entity graph and to the entity grid (Barzilay and Lapata, 2008) and showed that normalization improves the results significantly in most tasks." }, { "sent_id": "062e9348de5fda68e61fff3ca4f186-C001-95", "text": "Future work will include adding more linguistic information, stronger weighting schemes and application to other readability datasets (Pitler and Nenkova, 2008; De Clercq et al., 2014) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "062e9348de5fda68e61fff3ca4f186-C001-24", "062e9348de5fda68e61fff3ca4f186-C001-25" ], [ "062e9348de5fda68e61fff3ca4f186-C001-63" ], [ "062e9348de5fda68e61fff3ca4f186-C001-84", "062e9348de5fda68e61fff3ca4f186-C001-85", "062e9348de5fda68e61fff3ca4f186-C001-86", "062e9348de5fda68e61fff3ca4f186-C001-87", "062e9348de5fda68e61fff3ca4f186-C001-88" ] ], "cite_sentences": [ "062e9348de5fda68e61fff3ca4f186-C001-24", "062e9348de5fda68e61fff3ca4f186-C001-63", "062e9348de5fda68e61fff3ca4f186-C001-84" ] }, "@MOT@": { "gold_contexts": [ [ "062e9348de5fda68e61fff3ca4f186-C001-24", "062e9348de5fda68e61fff3ca4f186-C001-25" ], [ "062e9348de5fda68e61fff3ca4f186-C001-84", "062e9348de5fda68e61fff3ca4f186-C001-85", "062e9348de5fda68e61fff3ca4f186-C001-86", "062e9348de5fda68e61fff3ca4f186-C001-87", "062e9348de5fda68e61fff3ca4f186-C001-88" ] ], "cite_sentences": [ "062e9348de5fda68e61fff3ca4f186-C001-24", "062e9348de5fda68e61fff3ca4f186-C001-84" ] }, "@USE@": { "gold_contexts": [ [ "062e9348de5fda68e61fff3ca4f186-C001-74" ], [ "062e9348de5fda68e61fff3ca4f186-C001-75" ], [ "062e9348de5fda68e61fff3ca4f186-C001-84", "062e9348de5fda68e61fff3ca4f186-C001-85", "062e9348de5fda68e61fff3ca4f186-C001-86", "062e9348de5fda68e61fff3ca4f186-C001-87", "062e9348de5fda68e61fff3ca4f186-C001-88" ] ], "cite_sentences": [ "062e9348de5fda68e61fff3ca4f186-C001-74", "062e9348de5fda68e61fff3ca4f186-C001-75", "062e9348de5fda68e61fff3ca4f186-C001-84" ] }, "@DIF@": { "gold_contexts": [ [ "062e9348de5fda68e61fff3ca4f186-C001-94" ] ], "cite_sentences": [ "062e9348de5fda68e61fff3ca4f186-C001-94" ] } } }, "ABC_42ca932eaa96c174cdfb815bee82cb_31": { "x": [ { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-2", "text": "Here we respond to some comments by Alday concerning headedness in linguistic theory and the validity of the assumptions of a mathematical model for word order." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-3", "text": "For brevity, we focus only on two assumptions: the unit of measurement of dependency length and the monotonicity of the cost of a dependency as a function of its length." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-4", "text": "We also revise the implicit psychological bias in Alday's comments." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-5", "text": "Notwithstanding, Alday is indicating the path for linguistic research with his unusual concerns about parsimony from multiple dimensions." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-6", "text": "----------------------------------" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-7", "text": "**HEADEDNESS**" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-8", "text": "The principle of dependency length minimization makes conditional predictions, which do not imply that placing heads at the center is optimal in general (Ferreri-Cancho, 2015) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-9", "text": "In the context of just one head and at least one dependent, the principle predicts that the central placement of the head is optimal when there are at least two dependents but that placement is irrelevant if there is only one dependent." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-10", "text": "In the context of an initial verb followed by two arguments (e.g., subject and object), the principle predicts that the heads of the verbal arguments are placed first with respect to their dependents, e.g., articles or adjectives follow the head noun, whereas, for a final verb, the prediction is that the heads of the arguments are placed last with respect to their dependents, e.g. articles or adjectives precede the head noun (Ferrer-i-Cancho, 2015) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-11", "text": "The principle of dependency length minimization predicts consistent branching, to some extent, for verb-first or verb-last placements." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-12", "text": "1 If there is one head, the principle predicts a rather symmetric head placement but various heads can lead to an anti-symmetric placement (Ferrer-i-Cancho, 2015 , 2008 ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-13", "text": "Importantly, the anti-symmetric placements do not need to be consistent." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-14", "text": "Thus, in an SOV language the optimal placement of an (inflected) verbal auxiliary is after the verbal head (Ferrer-i-Cancho, 2008) but the optimal placement of adjectives is before their nominal heads." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-15", "text": "----------------------------------" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-16", "text": "**2**" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-17", "text": "The inconsistency of anti-symmetry is a challenge for principles and parameters theory, e.g. Baker (2001) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-18", "text": "Amongst other parameters, the theory typically defines one for headedness, specifying whether a head should follow or precede its dependents." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-19", "text": "That general parameter is clearly insufficient for the inconsistency above, but also because SVO languages put their verb at the center (head at the center) but then tend to put adjectives after the noun (head first)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-20", "text": "A possible solution is defining at least two headedness parameters, one for verbal heads and another for nominal heads, which is not very parsimonious (and still problematic: SOV follows left-branching for S and O but right branching for verbal auxiliaries)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-21", "text": "3 Instead, certain theoretical linguists may wish to involve interactions with other parameters in the discussion (see Baker (2001) for candidates). But is that parsimonious enough?" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-22", "text": "An alternative solution is considering the placement of the verb as a parameter." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-23", "text": "From that single parameter we have shown that it is possible to infer head-final placement within verb arguments in SOV, head-first placement within verb arguments in SVO, head-first placement within verb arguments of VSO/VOS (Ferrer-i-Cancho, 2015) and also head first for the placement of inflected auxiliaries in SOV and head-final for their placement in VSO (Ferrer-i-Cancho, 2008 )." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-24", "text": "However, the predictive power of the position of the verb does not imply that this is a fundamental parameter in a universal grammar sense: verb position might be determined by evolutionary time (e.g., verb-last being more likely in early stages of evolution) or sentence length (e.g., SVO being more likely in languages where speakers produce more elaborate sentences or simply longer sentences) as explained by Ferrer-i-Cancho (2014) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-25", "text": "2 We are assuming that verbal auxiliaries are dependents of verbs, but it could be argued that it should be the other way around, as in certain approaches to dependency grammar (Mel'\u010duk, 1988) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-26", "text": "There is no straightforward solution to this problem (Heine, 1993, p. 106 )." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-27", "text": "For our dependency length minimization arguments, the direction of the dependency is irrelevant." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-28", "text": "What matters is whether the heads of the subject and the object attach to the auxiliary verb (the auxiliary verb is then the hub) or to the main verb (the main verb is then the hub)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-29", "text": "A model where verbal auxiliaries are the hubs poses some problems." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-30", "text": "First, the terms \"main verb\" and \"auxiliary verb\" suggest that the main verb has to be the head." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-31", "text": "Second, research on grammaticalization suggests that auxiliaries are unlikely to be the heads at advanced stages of evolution (Heine, 1993, p. 106) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-32", "text": "Third, information theory predicts that the dependencies of high frequency words such as auxiliaries are likely to have weaker links than those of the main verb (Ferrer-i-Cancho & Reina, 2002) , thus reducing the chances that auxiliaries win the competition for becoming the hubs." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-33", "text": "Fourth, an explanation for the relative placement of verbal auxiliaries given the placement of the main verb (Ferrer-i-Cancho, 2015 , 2008 would be lost." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-34", "text": "A theory covering all the phenomena reached originally by the principle of online memory minimization would be heavier." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-35", "text": "2 The monotonic dependency between cognitive cost and distance Ferrer-i-Cancho (2015) assumes that the cognitive cost of a dependency is a strictly monotonic function of its length." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-36", "text": "Alday (2015) considers that this assumption may not be valid." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-37", "text": "For simplicity, let us measure the length of a dependency in words." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-38", "text": "The online memory cost of a sentence of n words can be defined as" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-39", "text": "where n \u2212 1 is the number of edges of a syntactic dependency tree of n vertices, Alday (2015) reviews the issue of anti-locality effects (Lewis, Vasishth, & Van Dyke, 2006) : in certain circumstances, longer dependencies are processed faster." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-40", "text": "We believe that this could happen for various reasons." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-41", "text": "First, anti-locality might be a consequence of dependency length minimization itself (this is reminiscent of the idea that anti-locality effects derive from artifacts of the memory architecture (Lewis et al., 2006) )." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-42", "text": "In Section 1, we reviewed that dependency length minimization and central head placement are not equivalent: in SOV languages, the placement of adjectives before the nominal heads of S and O, which is antilocal, leads to shorter dependencies than a central placement of those nominal heads within the S and O constituent." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-43", "text": "In other words, anti-locality within a constituent can increase the locality of the whole." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-44", "text": "This is a very important point for psycholinguistic research when the dependency length of a few dependencies or even just one dependency is considered to be relevant to argue for or against dependency minimization (Konieczny, 2000; Levy, Fedorenko, Breen, & Gibson, 2012) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-45", "text": "Standard psycholinguistic research on word order is reductionistic." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-46", "text": "Second, conflicts between word order principles are at the core of Ferrer-iCancho's word order models (Ferrer-i-Cancho, 2014):" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-47", "text": "\u2022 Dependency length minimization is in conflict with maximization of predictability, e.g., postponing an element maximizes its predictability." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-48", "text": "\u2022 A preference for SOV may simply mean that dependency length minimization is not the strongest principle (in short sequences, online memory is less important)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-49", "text": "Thus, anti-locality effects do not invalidate a priori the principle of dependency length minimization." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-50", "text": "Faster processing when dependents are farther apart does not need to be caused by their separation." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-51", "text": "6 Other principles may be at play: postponing the head is optimal for maximizing its predictability (Ferreri-Cancho, 2014) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-52", "text": "Third, notice that Ferrer-i-Cancho (2015) assumes that the cost of a dependency of length d is g(d)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-53", "text": "In that setup, the identity of the dependents is irrelevant." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-54", "text": "However, we may consider a more general definition of the cost of a dependency between two units u and v: g(u, v, d)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-55", "text": "Intuitively, one would expect that g(u, v, d) is lower when u and v are more strongly correlated, as this should make their dependency more robust against interference or decay." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-56", "text": "Lower values of g(u, v, d) may facilitate the involvement of other word order principles." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-57", "text": "\u00b7 a n." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-58", "text": "1 2 3 4 5 6 7 8 8.5 9 10 11 12 13 13.5 14 15." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-59", "text": "1 2 3 4 5 6 7 7.5 8 9 10 11 11.5 12 13." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-60", "text": "3 The unit of measurement of dependency length A limitation of Ferrer-i-Cancho (2015) is that dependency length is measured in words." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-61", "text": "However, it could be measured in other linguistic units: syllables, morphemes, phonemes, etc." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-62", "text": "A higher level of precision might illuminate inconsistencies between the dominant order according to the criteria in Dryer (2013) and other orders that appear recurrently in particular circumstances." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-63", "text": "Alday (2015) points out the case of Italian, which is classified as SVO (Dryer & Haspelmath, 2013) but allows pronouns taking the role of the object to appear before the verb." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-64", "text": "This is a common phenomenon occurring in other Romance languages (e.g., Catalan or French)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-65", "text": "Let us consider the case of French, which is \"SVO when the object is lexical, but SOV when the object is prepositional\" (Newmeyer, 2003) , as in sentences (a) and (b) in Fig. 1 ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-66", "text": "For simplicity, imagine that we measure dependency length in characters (this may make sense in a reading task, for instance)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-67", "text": "Imagine that the center of a word of length \u03bb is located at position (\u03bb + 1)/2 and that the space between two words counts as one character." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-68", "text": "For simplicity again, let us assume that dependencies originate from the center of the word and that the length of a dependency is the difference between the positions of the centers of the words involved." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-69", "text": "The sum of dependency lengths of the SOV sentence with a pronominal object ( Fig. 1 (b) ) is in between that of the SVO sentence with nominal object (Fig. 1 (a) ) and that of SOV the sentence with nominal object (Fig. 1 (c) )." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-70", "text": "This is due to the brevity of the pronoun, suggesting that pressure for online memory minimization reduces if short words are involved." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-71", "text": "Another factor that may influence pressure for dependency length minimization is sentence length: dependency length minimization has been argued to be less necessary in short sentences, where the maximization of the predictability of the verb might be the winning principle (Ferrer-i-Cancho, 2014) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-72", "text": "Sentence length and word length might explain the tendency to adopt SOV in Romance languages when clitics are involved." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-73", "text": "----------------------------------" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-74", "text": "**THE PSYCHOLOGICAL BIAS**" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-75", "text": "A wide majority of language researchers adopt a psychological perspective when dealing with word order and other linguistic phenomena (e.g., Konieczny, 2000; Levy et al., 2012; Alday, 2015) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-76", "text": "This perspective takes for granted that word order phenomena can ultimately be explained by principles (e.g., minimization of computational resources of the human brain) or mental processes that operate during sentence production or understanding within an individual, abstracting away from evolution (evolutionary history or evolutionary processes) and to a great extent from that individual's society." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-77", "text": "However, evidence that language structure or linguistic features are determined by social or political features is challenging this dominant view (Lupyan & Dale, 2010; SantacreuVasut, Shoham, & Gay, 2013 )." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-78", "text": "Ferrer-i-Cancho (2015 also takes a psychological perspective when investigating the predictions of a principle of minimization of dependency length costs, but he departs from it by considering the possibility that certain word order configurations do not have a purely psychological origin but, rather, one that cannot be disentangled from evolution." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-79", "text": "For instance, the relative placement of the dependents within the subject or the object in SVO languages might be an adaptation that prevents regression to SOV (Ferrer-iCancho, 2015) ." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-80", "text": "In terms of online memory minimization, an SVO language could place those dependents anywhere, but placements that are harder for SOV would be selected." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-81", "text": "Regarding dependency length minimization, SVO has more freedom to place dependents of the nominal heads after or before their heads than SOV does, but SVO has higher chances of survival in languages that place those dependents against the preferences of SOV." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-82", "text": "Under this perspective, word order is regarded as a species that competes with other word orders." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-83", "text": "7 In sum, we may not be able to solve all word order puzzles through the organization of memory or mental processes involved in language production and understanding." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-84", "text": "Theorists may think that they have solved them with fatter theories requiring unnecessary or arbitrary parameters." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-85", "text": "Experimentalists may think that increasing the complexity of the experiments is the way to go." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-86", "text": "----------------------------------" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-87", "text": "**CONCLUDING REMARKS**" }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-88", "text": "Alday (2015) is very right about the need of parsimony in its multiple forms (simplicity, generality, theory-agnostic approach, unification of synchrony and diachrony, ...)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-89", "text": "Theoretical linguistics of the last century has chosen the easy path of modeling language by gratuitously adding parameters and then worrying a posteriori on constraints in number (Hurford, 2012) 8 or evolution." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-90", "text": "Artificial intelligence researchers know very well that modeling is easy without any limit on the number of parameters or lacking any concern about the predictive capacity of the model on new data or new contexts (generality)." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-91", "text": "The illusion of success caused by models that overfit linguistic phenomena and a blind belief in the autonomy of a discipline (against multidisciplinary research) prevent language specialists from incorporating simple and general models by 'outsiders'." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-92", "text": "We, as a community, should be wiser when we are not satisfied by the limited fit of general models of language but do not care about overfitting." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-93", "text": "9 A lack of scientific depth can be hidden by abstract concepts and complex formalisms, but also by complicated psychological experiments or elaborate computer simulations." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-94", "text": "Modern model selection, which relies on a quantitative evaluation of models from a compromise between parsimony and goodness of fit (Burnham & Anderson, 2002) , belongs to the future of language research." }, { "sent_id": "42ca932eaa96c174cdfb815bee82cb-C001-95", "text": "With his unusual concerns about parsimony, Alday (2015) is reminding us that parsimony is more than an aesthetic requirement: rather, it is the key to a deep understanding of the myriad of issues with which he has cleverly challenged Ferrer-i-Cancho" } ], "y": { "@BACK@": { "gold_contexts": [ [ "42ca932eaa96c174cdfb815bee82cb-C001-10", "42ca932eaa96c174cdfb815bee82cb-C001-8", "42ca932eaa96c174cdfb815bee82cb-C001-9" ], [ "42ca932eaa96c174cdfb815bee82cb-C001-11", "42ca932eaa96c174cdfb815bee82cb-C001-12" ], [ "42ca932eaa96c174cdfb815bee82cb-C001-22", "42ca932eaa96c174cdfb815bee82cb-C001-23", "42ca932eaa96c174cdfb815bee82cb-C001-24" ], [ "42ca932eaa96c174cdfb815bee82cb-C001-33", "42ca932eaa96c174cdfb815bee82cb-C001-34" ], [ "42ca932eaa96c174cdfb815bee82cb-C001-35" ], [ "42ca932eaa96c174cdfb815bee82cb-C001-52", "42ca932eaa96c174cdfb815bee82cb-C001-53" ], [ "42ca932eaa96c174cdfb815bee82cb-C001-60", "42ca932eaa96c174cdfb815bee82cb-C001-61", "42ca932eaa96c174cdfb815bee82cb-C001-62" ] ], "cite_sentences": [ "42ca932eaa96c174cdfb815bee82cb-C001-10", "42ca932eaa96c174cdfb815bee82cb-C001-12", "42ca932eaa96c174cdfb815bee82cb-C001-23", "42ca932eaa96c174cdfb815bee82cb-C001-33", "42ca932eaa96c174cdfb815bee82cb-C001-35", "42ca932eaa96c174cdfb815bee82cb-C001-52", "42ca932eaa96c174cdfb815bee82cb-C001-60" ] }, "@MOT@": { "gold_contexts": [ [ "42ca932eaa96c174cdfb815bee82cb-C001-22", "42ca932eaa96c174cdfb815bee82cb-C001-23", "42ca932eaa96c174cdfb815bee82cb-C001-24" ], [ "42ca932eaa96c174cdfb815bee82cb-C001-60", "42ca932eaa96c174cdfb815bee82cb-C001-61", "42ca932eaa96c174cdfb815bee82cb-C001-62" ] ], "cite_sentences": [ "42ca932eaa96c174cdfb815bee82cb-C001-23", "42ca932eaa96c174cdfb815bee82cb-C001-60" ] } } }, "ABC_c608567abe72c75bbbc8eb917ab5d3_31": { "x": [ { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-2", "text": "Adversarial training (AT) is a regularization method that can be used to improve the robustness of neural network methods by adding small perturbations in the training data." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-3", "text": "We show how to use AT for the tasks of entity recognition and relation extraction." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-4", "text": "In particular, we demonstrate that applying AT to a general purpose baseline model for jointly extracting entities and relations, allows improving the stateof-the-art effectiveness on several datasets in different contexts (i.e., news, biomedical, and real estate data) and for different languages (English and Dutch)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-5", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-7", "text": "Many neural network methods have recently been exploited in various natural language processing (NLP) tasks, such as parsing , POS tagging (Lample et al., 2016) , relation extraction (dos Santos et al., 2015) , translation (Bahdanau et al., 2015) , and joint tasks (Miwa and Bansal, 2016) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-8", "text": "However, Szegedy et al. (2014) observed that intentional small scale perturbations (i.e., adversarial examples) to the input of such models may lead to incorrect decisions (with high confidence)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-9", "text": "Goodfellow et al. (2015) proposed adversarial training (AT) (for image recognition) as a regularization method which uses a mixture of clean and adversarial examples to enhance the robustness of the model." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-10", "text": "Although AT has recently been applied in NLP tasks (e.g., text classification (Miyato et al., 2017) ), this paper -to the best of our knowledge -is the first attempt investigating regularization effects of AT in a joint setting for two related tasks." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-11", "text": "We start from a baseline joint model that performs the tasks of named entity recognition (NER) and relation extraction at once." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-12", "text": "Previously proposed models (summarized in Section 2) exhibit several issues that the neural network-based baseline approach (detailed in Section 3.1) overcomes: (i) our model uses automatically extracted features without the need of external parsers nor manually extracted features (see Gupta et al. (2016) ; Miwa and Bansal (2016) ; Li et al. (2017) ), (ii) all entities and the corresponding relations within the sentence are extracted at once, instead of examining one pair of entities at a time (see Adel and Sch\u00fctze (2017) ), and (iii) we model relation extraction in a multi-label setting, allowing multiple relations per entity (see Katiyar and Cardie (2017) ; Bekoulis et al. (2018a) )." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-13", "text": "The core contribution of the paper is the use of AT as an extension in the training procedure for the joint extraction task (Section 3.2)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-14", "text": "To evaluate the proposed AT method, we perform a large scale experimental study in this joint task (see Section 4), using datasets from different contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-15", "text": "We use a strong baseline that outperforms all previous models that rely on automatically extracted features, achieving state-of-the-art performance (Section 5)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-16", "text": "Compared to the baseline model, applying AT during training leads to a consistent additional increase in joint extraction effectiveness." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-17", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-19", "text": "Joint entity and relation extraction: Joint models (Li and Ji, 2014; Miwa and Sasaki, 2014) that are based on manually extracted features have been proposed for performing both the named entity recognition (NER) and relation extraction subtasks at once." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-20", "text": "These methods rely on the availability of NLP tools (e.g., POS taggers) or manually designed features leading to additional complexity." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-21", "text": "Neural network methods have been exploited to overcome this feature design issue and usually involve RNNs and CNNs (Miwa and Bansal, 2016; Zheng et al., 2017) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-22", "text": "Specifically, Miwa and Bansal (2016) as well as Li et al. (2017) apply bidirectional tree-structured RNNs for different contexts (i.e., news, biomedical) to capture syntactic information (using external dependency parsers)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-23", "text": "Gupta et al. (2016) propose the use of various manually extracted features along with RNNs." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-24", "text": "Adel and Sch\u00fctze (2017) solve the simpler problem of entity classification (EC, assuming entity boundaries are given), instead of NER, and they replicate the context around the entities, feeding entity pairs to the relation extraction layer." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-25", "text": "Katiyar and Cardie (2017) investigate RNNs with attention without taking into account that relation labels are not mutually exclusive." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-26", "text": "Finally, Bekoulis et al. (2018a) use LSTMs in a joint model for extracting just one relation at a time, but increase the complexity of the NER part." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-27", "text": "Our baseline model enables simultaneous extraction of multiple relations from the same input." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-28", "text": "Then, we further extend this strong baseline using adversarial training." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-29", "text": "Adversarial training (AT) (Goodfellow et al., 2015) has been proposed to make classifiers more robust to input perturbations in the context of image recognition." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-30", "text": "In the context of NLP, several variants have been proposed for different tasks such as text classification (Miyato et al., 2017) , relation extraction (Wu et al., 2017) and POS tagging (Yasunaga et al., 2018) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-31", "text": "AT is considered as a regularization method." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-32", "text": "Unlike other regularization methods (i.e., dropout (Srivastava et al., 2014) , word dropout (Iyyer et al., 2015) ) that introduce random noise, AT generates perturbations that are variations of examples easily misclassified by the model." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-33", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-34", "text": "**MODEL**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-35", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-36", "text": "**JOINT LEARNING AS HEAD SELECTION**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-37", "text": "The baseline model, described in detail in Bekoulis et al. (2018b) , is illustrated in Fig. 1 ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-38", "text": "It aims to detect (i) the type and the boundaries of the entities and (ii) the relations between them." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-39", "text": "The input is a sequence of tokens (i.e., sentence) w = w 1 , ..., w n ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-40", "text": "We use character level embeddings to implicitly capture morphological features (e.g., prefixes and suffixes), representing each character by a vector (embedding)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-41", "text": "The character embeddings are fed to a bidirectional LSTM (BiLSTM) to obtain the character-based representation of the word." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-42", "text": "We also use pre-trained word embeddings." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-43", "text": "Word and character embeddings are concatenated to form the final token representation, which is then fed to a BiLSTM layer to extract sequential information." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-44", "text": "For the NER task, we adopt the BIO (Beginning, Inside, Outside) encoding scheme." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-45", "text": "In Fig. 1 , the B-PER tag is assigned to the beginning token of a 'person' (PER) entity." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-46", "text": "For the prediction of the entity tags, we use: (i) a softmax approach for the entity classification (EC) task (assuming entity boundaries given) or (ii) a CRF approach where we identify both the type and the boundaries for each entity." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-47", "text": "During decoding, in the softmax setting, we greedily detect the entity types of the tokens (i.e., independent prediction)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-48", "text": "Although independent distribution of types is reasonable for EC tasks, this is not the case when there are strong correlations between neighboring tags." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-49", "text": "For instance, the BIO encoding scheme imposes several constraints in the NER task (e.g., the B-PER and I-LOC tags cannot be sequential)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-50", "text": "Motivated by this intuition, we use a linear-chain CRF for the NER task (Lample et al., 2016) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-51", "text": "For decoding, in the CRF setting, we use the Viterbi algorithm." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-52", "text": "During training, for both EC (softmax) and NER tasks (CRF), we minimize the cross-entropy loss L NER ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-53", "text": "The entity tags are later fed into the relation extraction layer as label embeddings (see Fig. 1 ), assuming that knowledge of the entity types is beneficial in predicting the relations between the involved entities." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-54", "text": "We model the relation extraction task as a multi-label head selection problem (Bekoulis et al., 2018b; )." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-55", "text": "In our model, each word w i can be involved in multiple relations with other words." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-56", "text": "For instance, in the example illustrated in Fig. 1 , \"Smith\" could be involved not only in a Lives in relation with the token \"California\" (head) but also in other relations simultaneously (e.g., Works for, Born In with some corresponding tokens)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-57", "text": "The goal of the task is to predict for each word w i , a vector of heads\u0177 i and the vector of corresponding relationsr i ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-58", "text": "We compute the score s(w j , w i , r k ) of word w j to be the head of w i given a relation label r k using a single layer neural network." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-59", "text": "The corresponding probability is defined as: P(w j , r k | w i ; \u03b8) = \u03c3(s(w j , w i , r k )), where \u03c3(.) is the sigmoid function." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-60", "text": "During training, we minimize the cross-entropy loss L rel as:" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-61", "text": "where m is the number of associated heads (and thus relations) per word w i ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-62", "text": "During decoding, the most probable heads and relations are selected using threshold-based prediction." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-63", "text": "The final objective for the joint task is computed as L JOINT (w; \u03b8) = L NER + L rel where \u03b8 is a set of parameters." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-64", "text": "In the case of multi-token entities, only the last token of the entity can serve as head of another token, to eliminate redundant relations." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-65", "text": "If an entity is not involved in any relation, we predict the auxiliary \"N\" relation label and the token itself as head." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-66", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-67", "text": "**ADVERSARIAL TRAINING (AT)**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-68", "text": "We exploit the idea of AT (Goodfellow et al., 2015) as a regularization method to make our model robust to input perturbations." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-69", "text": "Specifically, we generate examples which are variations of the original ones by adding some noise at the level of the concatenated word representation (Miyato et al., 2017) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-70", "text": "This is similar to the concept introduced by Goodfellow et al. (2015) to improve the robustness of image recognition classifiers." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-71", "text": "We generate an adversarial example by adding the worst-case perturbation \u03b7 adv to the original embedding w that maximizes the loss function:" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-72", "text": "where\u03b8 is a copy of the current model parameters." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-73", "text": "Since Eq. (2) is intractable in neural networks, we use the approximation proposed in Goodfellow et al. (2015) defined as: \u03b7 adv = g/ g , with g = \u2207 w L JOINT (w;\u03b8), where is a small bounded norm treated as a hyperparameter." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-74", "text": "Similar to Yasunaga et al. (2018) , we set to be \u03b1 \u221a D (where D is the dimension of the embeddings)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-75", "text": "We train on the mixture of original and adversarial examples, so the final loss is computed as: L JOINT (w;\u03b8) + L JOINT (w + \u03b7 adv ;\u03b8)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-76", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-77", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-78", "text": "We evaluate our models on four datasets, using the code as available from our github codebase." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-79", "text": "1 Specifically, we follow the 5-fold crossvalidation defined by Miwa and Bansal (2016) for the ACE04 (Doddington et al., 2004) dataset." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-80", "text": "For the CoNLL04 (Roth and Yih, 2004 ) EC task (assuming boundaries are given), we use the same splits as in Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-81", "text": "We also evaluate our models on the NER task similar to Miwa and Sasaki (2014) in the same dataset using 10-fold cross validation." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-82", "text": "For the Dutch Real Estate Classifieds, DREC (Bekoulis et al., 2017) dataset, we use train-test splits as in Bekoulis et al. (2018a) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-83", "text": "For the Adverse Drug Events, ADE (Gurulingappa et al., 2012) , we perform 10-fold cross-validation similar to Li et al. (2017) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-84", "text": "To obtain comparable results that are not affected by the input embeddings, we use the embeddings of the previous works." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-85", "text": "We employ early stopping in all of the experiments." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-86", "text": "We use the Adam optimizer (Kingma and Ba, 2015) and we fix the hyperparameters (i.e., \u03b1, dropout values, best epoch, learning rate) on the validation sets." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-87", "text": "The scaling parameter \u03b1 is selected from {5e\u22122, 1e\u22122, 1e\u22123, 1e\u22124}. Larger values of \u03b1 (i.e., larger perturbations) lead to consistent performance decrease in our early experiments." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-88", "text": "This can be explained from the fact that adding more noise can change the content of the sentence as also reported by Wu et al. (2017) ." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-89", "text": "We use three types of evaluation, namely: (i) S(trict): we score an entity as correct if both the entity boundaries and the entity type are correct (ACE04, ADE, CoNLL04, DREC), (ii) B(oundaries): we score an entity as correct if only the entity boundaries are correct while the entity type is not taken into account (DREC) and (iii) R(elaxed): a multi-token entity is considered Table 1 : Comparison of our method with the stateof-the-art in terms of F 1 score." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-90", "text": "The proposed models are: (i) baseline, (ii) baseline EC (predicts only entity classes) and (iii) baseline (EC) + AT (regularized by AT)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-91", "text": "The and symbols indicate whether the models rely on external NLP tools." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-92", "text": "We include different evaluation types (S, R and B)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-93", "text": "correct if at least one correct type is assigned to the tokens comprising the entity, assuming that the boundaries are known (CoNLL04), to compare to previous works." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-94", "text": "In all cases, a relation is considered as correct when both the relation type and the argument entities are correct." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-95", "text": "Table 1 shows our experimental results." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-96", "text": "The name of the dataset is presented in the first column while the models are listed in the second column." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-97", "text": "The proposed models are the following: (i) baseline: the baseline model shown in Fig. 1 with the CRF layer and the sigmoid loss, (ii) baseline EC: the proposed model with the softmax layer for EC, (iii) baseline (EC) + AT: the baseline regularized using AT." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-98", "text": "The final three columns present the F 1 results for the two subtasks and their average performance." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-99", "text": "Bold values indicate the best results among models that use only automatically extracted features." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-100", "text": "For ACE04, the baseline outperforms Katiyar and Cardie (2017) by \u223c2% in both tasks." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-101", "text": "This improvement can be explained by the use of: (i) multi-label head selection, (ii) CRF-layer and (iii) character level embeddings." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-102", "text": "Compared to Miwa and Bansal (2016) , who rely on NLP tools, the baseline performs within a reasonable margin (less than 1%) on the joint task." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-103", "text": "On the other hand, Li et al. (2017) use the same model for the ADE biomedical dataset, where we report a 2.5% overall improvement." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-104", "text": "This indicates that NLP tools are not always accurate for various contexts." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-105", "text": "For the CoNLL04 dataset, we use two evaluation settings." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-106", "text": "We use the relaxed evaluation similar to Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) on the EC task." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-107", "text": "The baseline model outperforms the state-of-the-art models that do not rely on manually extracted features (>4% improvement for both tasks), since we directly model the whole sentence, instead of just considering pairs of entities." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-108", "text": "Moreover, compared to the model of Gupta et al. (2016) that relies on complex features, the baseline model performs within a margin of 1% in terms of overall F 1 score." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-109", "text": "We also report NER results on the same dataset and improve overall F 1 score with \u223c1% compared to Miwa and Sasaki (2014) , indicating that our automatically extracted features are more informative than the hand-crafted ones." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-110", "text": "These automatically extracted features exhibit their performance improvement mainly due to the shared LSTM layer that learns to automatically generate feature representations of entities and their corresponding relations within a single model." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-111", "text": "For the DREC dataset, we use two evaluation methods." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-112", "text": "In the boundaries evaluation, the baseline has an improvement of \u223c3% on both tasks compared to Bekoulis et al. (2018a) , whose quadratic scoring layer complicates NER." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-113", "text": "Table 1 and Fig. 2 show the effectiveness of the adversarial training on top of the baseline model." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-114", "text": "In all of the experiments, AT improves the predictive performance of the baseline model in the joint setting." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-115", "text": "Moreover, as seen in Fig. 2 , the performance of the models using AT is closer to maximum even from the early training epochs." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-116", "text": "Specifically, for ACE04, there is an improvement in both tasks as well as in the overall F 1 performance (0.4%)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-117", "text": "For CoNLL04, we note an improvement in the overall F 1 of 0.4% for the EC and 0.8% for the NER tasks, respectively." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-118", "text": "For the DREC dataset, in both settings, there is an overall improvement of \u223c1%." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-119", "text": "Figure 2 shows that from the first epochs, the model obtains its maximum performance on the DREC validation set." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-120", "text": "Finally, for ADE, our AT model beats the baseline F 1 by 0.7%." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-121", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-122", "text": "**RESULTS**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-123", "text": "Our results demonstrate that AT outperforms the neural baseline model consistently, consider- ing our experiments across multiple and more diverse datasets than typical related works." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-124", "text": "The improvement of AT over our baseline (depending on the dataset) ranges from \u223c0.4% to \u223c0.9% in terms of overall F 1 score." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-125", "text": "This seemingly small performance increase is mainly due to the limited performance benefit for the NER component, which is in accordance with the recent advances in NER using neural networks that report similarly small gains (e.g., the performance improvement in Ma and Hovy (2016) and Lample et al. (2016) on the CoNLL-2003 test set is 0.01% and 0.17% F 1 percentage points, while in the work of Yasunaga et al. (2018) , a 0.07% F 1 improvement on CoNLL-2000 using AT for NER is reported)." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-126", "text": "However, the relation extraction performance increases by \u223c1% F 1 scoring points, except for the ACE04 dataset." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-127", "text": "Further, as seen in Fig. 2 , the improvement for CoNLL04 is particularly small on the evaluation set." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-128", "text": "This may indicate a correlation between the dataset size and the benefit of adversarial training in the context of joint models, but this needs further investigation in future work." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-129", "text": "----------------------------------" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-131", "text": "We proposed to use adversarial training (AT) for the joint task of entity recognition and relation extraction." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-132", "text": "The contribution of this study is twofold: (i) investigation of the consistent effectiveness of AT as a regularization method over a multi-context baseline joint model, with (ii) a large scale experimental evaluation." }, { "sent_id": "c608567abe72c75bbbc8eb917ab5d3-C001-133", "text": "Experiments show that AT improves the results for each task separately, as well as the overall performance of the baseline joint model, while reaching high performance already during the first epochs of the training procedure." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c608567abe72c75bbbc8eb917ab5d3-C001-7" ], [ "c608567abe72c75bbbc8eb917ab5d3-C001-19", "c608567abe72c75bbbc8eb917ab5d3-C001-20", "c608567abe72c75bbbc8eb917ab5d3-C001-21", "c608567abe72c75bbbc8eb917ab5d3-C001-22" ] ], "cite_sentences": [ "c608567abe72c75bbbc8eb917ab5d3-C001-7", "c608567abe72c75bbbc8eb917ab5d3-C001-21", "c608567abe72c75bbbc8eb917ab5d3-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "c608567abe72c75bbbc8eb917ab5d3-C001-12" ], [ "c608567abe72c75bbbc8eb917ab5d3-C001-78", "c608567abe72c75bbbc8eb917ab5d3-C001-79" ] ], "cite_sentences": [ "c608567abe72c75bbbc8eb917ab5d3-C001-12", "c608567abe72c75bbbc8eb917ab5d3-C001-79" ] }, "@DIF@": { "gold_contexts": [ [ "c608567abe72c75bbbc8eb917ab5d3-C001-102" ] ], "cite_sentences": [ "c608567abe72c75bbbc8eb917ab5d3-C001-102" ] } } }, "ABC_307c18e2928c4a45f574a9c3a36b76_31": { "x": [ { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-2", "text": "Can language analysis reveal the underlying social power relations that exist between participants of an interaction?" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-3", "text": "Prior work within NLP has shown promise in this area, but the performance of automatically predicting power relations using NLP analysis of social interactions remains wanting." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-4", "text": "In this paper, we present a novel neural architecture that captures manifestations of power within individual emails which are then aggregated in an order-preserving way in order to infer the direction of power between pairs of participants in an email thread." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-5", "text": "We obtain an accuracy of 80.4%, a 10.1% improvement over state-of-the-art methods, in this task." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-6", "text": "We further apply our model to the task of predicting power relations between individuals based on the entire set of messages exchanged between them; here also, our model significantly outperforms the 70.0% accuracy using prior state-of-the-art techniques, obtaining an accuracy of 83.0%." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-7", "text": "----------------------------------" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-9", "text": "With the availability and abundance of linguistic data that captures different avenues of human social interactions, there is an unprecedented opportunity to expand NLP to not only understand language, but also to understand the people who speak it and the social relations between them." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-10", "text": "Social power structures are ubiquitous in human interactions, and since power is often reflected through language, computational research at the intersection of language and power has gained interest recently." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-11", "text": "This research has been applied to a wide array of domains such as Wikipedia talk pages (Strzalkowski et al., 2010; Taylor et al., 2012; Danescu-Niculescu-Mizil et al., 2012; Swayamdipta and Rambow, 2012) , blogs (Rosenthal, 2014) as well as workplace interactions (Bramsen et al., 2011; Gilbert, 2012; Prabhakaran, 2015) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-12", "text": "The corporate environment is one social context in which power dynamics have a clearly defined structure and shape the interactions between individuals, making it an interesting case study on how language and power interact." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-13", "text": "Organizations stand to benefit greatly from being able to detect power dynamics within their internal interactions, in order to address disparities and ensure inclusive and productive workplaces." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-14", "text": "For instance, (Cortina et al., 2001) reports that women are more likely to experience incivility, often from superiors." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-15", "text": "It has also been shown that incivility may breed more incivility (Harold and Holtz, 2015) , and that it can lead to increased stress and lack of commitment (Miner et al., 2012) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-16", "text": "Prior work has investigated the use of NLP techniques to study manifestations of different types of power using the Enron email corpus (Diesner et al., 2005; Prabhakaran et al., 2012; Prabhakaran and Rambow, 2013; Prabhakaran and Rambow, 2014) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-17", "text": "While early work (Bramsen et al., 2011; Gilbert, 2012) focused on surface level lexical features aggregated at corpus level, more recent work has looked into the thread structure of emails as well (Prabhakaran and Rambow, 2014) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-18", "text": "However, both (Bramsen et al., 2011; Gilbert, 2012) and (Prabhakaran and Rambow, 2014 ) group all messages sent by an individual to another individual (at the corpus-level and at the thread-level, respectively) and rely on word-ngram * Authors (listed in alphabetical order) contributed equally." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-19", "text": "----------------------------------" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-20", "text": "**ARXIV:1807.06557V1 [CS.CL] 17 JUL 2018**" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-21", "text": "based features extracted from this concatenated text to infer power relations." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-22", "text": "They ignore the fact that the text comes from separate emails, and that there is a sequential order to them." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-23", "text": "In this paper, we propose a hierarchical deep learning architecture for power prediction, using a combination of Convolutional Neural Networks (CNN) to capture linguistic manifestations of power in individual emails, and a Long Short-Term Memory (LSTM) that aggregates their outputs in an order-preserving fashion." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-24", "text": "We obtain significant improvements in accuracy on the corpus-level task (82.4% over 70.0%) and on the thread-level task (80.4% over 73.0%) over prior state-of-the-art techniques." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-25", "text": "----------------------------------" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-26", "text": "**DATA AND PROBLEM FORMULATION**" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-27", "text": "We use the version of the Enron Email corpus released by Agarwal et al. (2012) that captures the organizational power relation between 13,724 pairs of Enron employees, in addition to the reconstructed thread structure of email messages added by Yeh and Harnly (2006) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-28", "text": "We mask greetings and signature lines in the email content to prevent our model from being biased by the roles held by specific employees." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-29", "text": "Prior work on NLP approaches to predict power in organizational email has used two different problem formulations -Per-Thread and Grouped." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-30", "text": "We investigate both formulations in this paper." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-31", "text": "Table 1 shows the number of data instances in each problem formulation." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-32", "text": "Per-Thread: This formulation was introduced by Prabhakaran and Rambow (2014) in which, for a given thread t and a pair of related interacting participant pairs (A, B), the direction of power between A and B is predicted (where the assignment of labels A and B is arbitrary)." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-33", "text": "The participants in these pairs are 1) interacting: at least one message exists in the thread such that either A is the sender and B is a recipient or vice versa, and 2) related: A and B are related by a dominance relation (either superior or subordinate) based on the organizational hierarchy." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-34", "text": "As in (Prabhakaran and Rambow, 2014) , we exclude pairs of employees who are peers, and we use the same train-dev-test splits so our results are comparable." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-35", "text": "Grouped: Here, we group all emails A sent to B across all threads in the corpus, and vice versa, and use these sets of emails to predict the power relation between A and B. This formulation is similar those in (Bramsen et al., 2011; Gilbert, 2012) , but our results are not directly comparable since, unlike them, we rely on the ground truth of power relations from (Agarwal et al., 2012) ; however, we created an SVM model that uses word-ngram features similar to theirs as a baseline to our proposed neural architectures." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-36", "text": "----------------------------------" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-37", "text": "**METHODS**" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-38", "text": "The inputs to our models take on two forms: Lexical features: We represent each email as a series of tokenized words, each of which is represented by a 100-dimensional GloVe vector pre-trained on Wikipedia and Gigaword (Pennington et al., 2014) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-39", "text": "We cap the email length at a maximum of 200 words." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-40", "text": "Non-lexical features: We incorporate the structural non-lexical features identified as significant by Prabhakaran and Rambow (2014) for the Grouped problem formulation." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-41", "text": "We used (1) the average number of recipients and (2) the average number of words in each email for each individual; these features were concatenated into a single input vector." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-42", "text": "We investigate the following three network architectures, in increasing order of complexity, to train our model: Figure 1) , all of A's emails to B are batched and fed into a Convolutional Neural Network (CNN), and the same operation is performed for B's emails to A. The format of this input is described earlier in this section." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-43", "text": "This representation can be thought of as a neural equivalent of the SVM-based approaches in prior work, since they merge together all emails in either direction as a single unit." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-44", "text": "Then, the output of these two CNNs is merged with the non-lexical features from A's emails and B's emails, passed through a dense layer with rectified linear unit (ReLU) activation, and fed to a sigmoid classifier that predicts the power relation between A and B. Approach 2: Separated emails (Separated-CNN)." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-45", "text": "In this model (see Figure 2) , we capture the essence of individual emails by separating them in the model input." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-46", "text": "As in Batched-CNN, we separate A's and B's emails, but here we feed each email as input to a CNN." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-47", "text": "The motivation here is to first capture local patterns from individual emails." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-48", "text": "We then merge the output of these CNNs with the non-lexical features from A's and B's emails, pass this to a dense layer with ReLU activation, and pass the result to a sigmoid classifier that predicts the power relation." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-49", "text": "Approach 3: Sequential emails (Sequential-CNN-LSTM)." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-50", "text": "Finally, we use a third model where we account for the temporal order of emails, which may be important in the case of the Per-Thread formulation." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-51", "text": "In this model (see Figure 3) , we separate each individual's emails, feed each email to a CNN, and pass the sequence of CNN outputs for each email to a Long Short-Term Memory network (LSTM) for that individual." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-52", "text": "We then merge the resulting output of the two LSTMs with the non-lexical features from each individual's emails, pass it on to a dense layer with ReLU activation, and then to a sigmoid classifier for the final prediction." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-53", "text": "----------------------------------" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-54", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-55", "text": "We use support vector machine (SVM) based approaches as our baseline, since they are the state-of-the art in this problem (Prabhakaran and Rambow, 2014; Bramsen et al., 2011; Gilbert, 2012) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-56", "text": "We use the performance reported by (Prabhakaran and Rambow, 2014) using SVM as baseline for the Per-Thread formulation (using the same train-dev-test splits) and implemented an SVM baseline for the Grouped formulation (not directly comparable to performance reported by (Bramsen et al., 2011; Gilbert, 2012) )." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-57", "text": "For each of our neural net models, we trained for 30-70 epochs until the performance on the development set stopped improving, in order to avoid overfitting." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-58", "text": "We used Hyperas to tune hyperparameters on our development dataset for the same set of parameter options for each task formulation, varying activation functions, hidden layer size, batch size, dropout, number of filters, and number of words to include per email." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-59", "text": "Table 2 presents the accuracy obtained using different models." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-60", "text": "All of our models significantly outperformed the SVM baseline in both task formulations." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-61", "text": "In the Per-Thread formulation, we obtained the best accuracy of 80.4% using the Sequential-CNN-LSTM approach, compared to the 73.0% reported by (Prabhakaran and Rambow, 2014) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-62", "text": "This is also a marked improvement over the simpler Batched-CNN and Separated-CNN models." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-63", "text": "This suggests that both temporal and local email features aid in the power prediction task within the Per-Thread formulation." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-64", "text": "In the Grouped formulation, the Separated-CNN model obtained the best accuracy of 83.0%, outperforming the Sequential-CNN-LSTM accuracy of 82.4%." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-65", "text": "We hypothesize that this is because the grouped formulation does not inherently have a temporal structure between emails, unlike the thread formulation where Sequential-CNN-LSTM is able to tap into the temporal structure." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-66", "text": "Table 3 presents a few emails from our corpus, along with the true and predicted labels for the power relation between their sender and recipient(s)." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-67", "text": "Our model seems to pick up on linguistic signals of lack of power such as relinquishing agency (let me know who you'd like us to work with), and status reports (model is nearly completed), as well as overt displays of power such as I personally would like to see the results and we need to end all payments." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-68", "text": "On the other hand, the model picks up on the phrasing in don't use the ftp site as displaying superiority while the superiority displayed here may have been derived from the expertise the subordinate has in file-transfer protocols." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-69", "text": "Similarly, the model may have misunderstood the overtly polite phrasings in the last email sent by a superior to be subordinate-like behavior." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-70", "text": "This sheds light on an important challenge in this task: superiors don't express their superiority in all emails, and subordinates may sometimes display power derived from other sources such as expertise." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-71", "text": "In such cases where text features alone are not informative enough, signals from additional non-lexical features may be key to accurate classification." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-72", "text": "----------------------------------" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-73", "text": "**CONCLUDING DISCUSSIONS**" }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-74", "text": "In this paper, we investigated the intersection between language and power in the corporate domain via neural architectures grounded in an understanding of how expressions of power unfold in email." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-75", "text": "Our Sequential-CNN-LSTM model, which utilizes an LSTM to capture the temporal relations underlying peremail features, achieved 80.4% accuracy in predicting the direction of power between participant pairs in individual email threads, which is a 10.1% accuracy improvement over the state-of-the-art approach (Prabhakaran and Rambow, 2014) ." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-76", "text": "Our Separated-CNN model also obtains an accuracy of 83.0% in predicting power relations between individuals based on the entire set of messages exchanged between them, a significant boost over 70.0% accuracy obtained using traditional methods." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-77", "text": "We also present a qualitative error analysis that sheds light on the patterns that confuse the model." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-78", "text": "To further our work, we plan to granularize the level at which features are learned." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-79", "text": "We hypothesize that by training a CNN on each sentence rather than email, the model will better capture mid-level indicators of power that occur between the word level and email level." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-80", "text": "We will also investigate ways to better incorporate structural features by accounting for their relevance to a holistic judgment of power; for example, features like gender and temporal position in a thread are more suited to merge with a higher level of the architecture like the per-individual LSTMs while features like number of email tokens are more suited to merge at the low level of the per-email CNNs." }, { "sent_id": "307c18e2928c4a45f574a9c3a36b76-C001-81", "text": "Lastly, we plan to incorporate additional datasets such as the Avocado Research Email Collection (Oard et al., 2015) to study cross-corpora performance." } ], "y": { "@BACK@": { "gold_contexts": [ [ "307c18e2928c4a45f574a9c3a36b76-C001-10", "307c18e2928c4a45f574a9c3a36b76-C001-11", "307c18e2928c4a45f574a9c3a36b76-C001-12" ], [ "307c18e2928c4a45f574a9c3a36b76-C001-17" ], [ "307c18e2928c4a45f574a9c3a36b76-C001-18", "307c18e2928c4a45f574a9c3a36b76-C001-21", "307c18e2928c4a45f574a9c3a36b76-C001-22" ], [ "307c18e2928c4a45f574a9c3a36b76-C001-55" ] ], "cite_sentences": [ "307c18e2928c4a45f574a9c3a36b76-C001-11", "307c18e2928c4a45f574a9c3a36b76-C001-17", "307c18e2928c4a45f574a9c3a36b76-C001-18", "307c18e2928c4a45f574a9c3a36b76-C001-55" ] }, "@MOT@": { "gold_contexts": [ [ "307c18e2928c4a45f574a9c3a36b76-C001-10", "307c18e2928c4a45f574a9c3a36b76-C001-11", "307c18e2928c4a45f574a9c3a36b76-C001-12" ] ], "cite_sentences": [ "307c18e2928c4a45f574a9c3a36b76-C001-11" ] }, "@DIF@": { "gold_contexts": [ [ "307c18e2928c4a45f574a9c3a36b76-C001-35" ], [ "307c18e2928c4a45f574a9c3a36b76-C001-56" ] ], "cite_sentences": [ "307c18e2928c4a45f574a9c3a36b76-C001-35", "307c18e2928c4a45f574a9c3a36b76-C001-56" ] }, "@EXT@": { "gold_contexts": [ [ "307c18e2928c4a45f574a9c3a36b76-C001-35" ], [ "307c18e2928c4a45f574a9c3a36b76-C001-56" ] ], "cite_sentences": [ "307c18e2928c4a45f574a9c3a36b76-C001-35", "307c18e2928c4a45f574a9c3a36b76-C001-56" ] }, "@SIM@": { "gold_contexts": [ [ "307c18e2928c4a45f574a9c3a36b76-C001-35" ] ], "cite_sentences": [ "307c18e2928c4a45f574a9c3a36b76-C001-35" ] }, "@USE@": { "gold_contexts": [ [ "307c18e2928c4a45f574a9c3a36b76-C001-55" ] ], "cite_sentences": [ "307c18e2928c4a45f574a9c3a36b76-C001-55" ] } } }, "ABC_a869bebe1744e3a7c71cb0f6fed12c_31": { "x": [ { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-23", "text": "Contribution." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-2", "text": "Deep Learning has drastically reshaped virtually all areas of NLP." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-3", "text": "Yet on the downside, it is commonly thought to be dependent on vast amounts of training data." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-4", "text": "As such, these techniques appear ill-suited for areas where annotated data is limited, like emotion analysis, with its many nuanced and hard-to-acquire annotation formats, or other low-data scenarios encountered in under-resourced languages." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-5", "text": "In contrast to this popular notion, we provide empirical evidence from three typologically diverse languages that today's favorite neural architectures can be trained on a few hundred observations only." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-6", "text": "Our results suggest that high-quality, pre-trained word embeddings are crucial for achieving high performance despite such strong data limitations." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-7", "text": "----------------------------------" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-9", "text": "Deep Learning (DL) has radically changed the rules of the game in NLP by boosting performance figures in almost all applications areas." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-10", "text": "Yet in contrast to more conventional techniques, such as ngram based linear models, neural methodologies seem to rely on vast amounts of training dataas is obvious in areas such as machine translation or word representation learning (Vaswani et al., 2017; Mikolov et al., 2013) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-11", "text": "With this profile, DL seems ill suited for many prediction tasks in Sentiment and Subjectivity Analysis (Balahur et al., 2014) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-12", "text": "For the widely studied problem of polarity prediction in social media (positive vs. negative emotion or evaluation, only; Rosenthal et al. (2017) ), training data is relatively abundant." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-13", "text": "However, annotating for more complex representations of affective states-such as Basic Emotions (Ekman, 1992) or ValenceArousal-Dominance (Bradley and Lang, 1994 )-seems to be significantly harder in terms of both time consumption and inter-annotator agreement (IAA) (Strapparava and Mihalcea, 2007) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-14", "text": "Nevertheless, these more complex models of emotion rapidly gained popularity in recent years due to their increased expressiveness (Wang et al., 2016; Buechel and Hahn, 2017; Sedoc et al., 2017) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-15", "text": "For the social media domain, this lack of gold data can be partly countered by (pre-) training with distant supervision which uses signals such as emojis or hashtags as a surrogate for manual annotation (Mohammad and Kiritchenko, 2015; Felbo et al., 2017; Abdul-Mageed and Ungar, 2017) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-16", "text": "Yet this procedure is less appropriate for other target domains as well as for predicting other subjective phenomena such as empathy, epistemic modality or personality (Khanpour et al., 2017; Rubin, 2007; Liu et al., 2017) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-17", "text": "These problems only intensify for under-resourced languages." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-18", "text": "Besides pre-training the entirety of the model with distant supervision, an alternative strategy is pre-training word representations, only." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-19", "text": "This approach is feasible for a wide range of languages since raw text is much more readily available than gold data, e.g., through Wikipedia ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-20", "text": "Unfortunately, it has been frequently argued that pre-trained embeddings are ill-suited for sentiment and emotion analysis since they do not capture sufficient affective information." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-21", "text": "This has been illustrated by word pairs like good and bad which have highly similar vector representations but opposing polarity (Tang et al., 2014; Yu et al., 2017; Khosla et al., 2018) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-22", "text": "However, to the best of our knowledge, no experimental data have been provided in support of this claim." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-24", "text": "Both claims, the need for large amounts of gold data and the lack of affective information in pre-trained word embeddings, may largely impede the feasibility of DL in lowresource scenarios." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-25", "text": "Yet, in this paper, we provide strong, first-time evidence that both, in actuality, turn out to be misconceptions." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-26", "text": "(Strapparava and Mihalcea, 2007) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-27", "text": "----------------------------------" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-28", "text": "**DATA**" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-29", "text": "For our study, we selected corpora of small size (\u22641000) where each instance bears numerical ratings regarding multiple emotion variables." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-30", "text": "1 According to these criteria, we came up with the following four data sets covering three typologically diverse languages (exemplary entries in Table 1) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-31", "text": "SE07: The test set of SemEval 2007 Task 14 (Strapparava and Mihalcea, 2007) comprises 1000 English news headlines which are annotated according to six Basic Emotions, joy, anger, sadness, fear, disgust, and surprise on a [0; 100]-scale (BE6 annotation format)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-32", "text": "ANET: The Affective Norms for English Text (Bradley and Lang, 2010) are an adaptation of the popular lexical database ANEW (Bradley and Lang, 1999 ) to short texts." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-33", "text": "The corpus comprises 120 situation description which are annotated according to Valence, Arousal, and Dominance on a 9-point scale (VAD annotation format)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-34", "text": "ANPST and MAS: The Affective Norms of Polish Short Texts (Imbir, 2017) ) and the Minho Affective Sentences (Pinheiro et al., 2017) can be seen as loose adaptations of ANET, very similar in methodology, but different in size and linguistic characteristics (see Table 1 )." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-35", "text": "Both are annotated according to VAD." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-36", "text": "Additionally MAS is also annotated according the the first five Basic Emotions (omitting 'surprise') on a 5-point scale (BE5)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-37", "text": "To increase both performance and reproducibility we employ pre-trained, publicly available word embeddings." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-38", "text": "We rely mostly on FastText vectors (Bojanowski et al., 2017) , yet for SE07 we use the word2vec embeddings 2 (Mikolov et al., 2013) trained on similar data than SE07 comprises (newswire material)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-39", "text": "For ANET, we rely on the FastText embeddings trained on Common Crawl ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-40", "text": "For ANPST and MAS, we use the FastText embeddings by trained on the respective Wikipedias." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-41", "text": "An overview of our corpora and embedding models is given in Table 2 ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-42", "text": "----------------------------------" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-43", "text": "**METHODS**" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-44", "text": "We provide two distinct linear baseline models which both rely on Ridge regression, an 2 -regularized version of linear regression." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-45", "text": "The first one, Ridge ngram , is based on n-gram features where we use n \u2208 {1, 2, 3}. The second one, Ridge BV uses bag-of-vectors features, i.e., the pointwise mean of the embeddings of the words in a text." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-46", "text": "Regarding the deep learning approaches, we compare Feed-Forward Networks (FFN), Gated Recurrent Unit Networks (GRU), Long Short-Term Memory Networks (LSTM), Convolutional Neural Networks (CNN), as well as a combination of the latter two (CNN-LSTM) (Cho et al., 2014; Hochreiter and Schmidhuber, 1997; Kalchbrenner et al., 2014) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-47", "text": "Since holding out a dev set from the already extremely limited training data is not feasible, we decided to instead use constant hyperparameter settings across all corpora, thus also demonstrating the robustness of our models (see Section 4)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-48", "text": "Moreover, a large number of hyperparameters will even be held constant across different model architectures." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-49", "text": "These universal settings are as follows:" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-50", "text": "The input to our DL models is based on pretrained word vectors of 300 dimensions." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-51", "text": "ReLu activation was used everywhere except in recurrent layers." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-52", "text": "Dropout is used for regularization with a probability of .2 for embedding layers and .5 for dense layers following the recommendations by Srivastava et al. (2014) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-53", "text": "We use .5 dropout also on other types of layers where it would conventionally be consider too high (e.g. on max pooling layers)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-54", "text": "Our models are trained for 200 epochs using the Adam optimizer (Kingma and Ba, 2015) with fixed learning rate of .001 and batch size of 32." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-55", "text": "3 Word embeddings were not updated during training." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-56", "text": "Since, in compliance with our gold data, we treat emotion analysis as regression problem (Buechel and Hahn, 2016 ) the output layers of our models consist of an affine transformation, i.e., a dense layer without non-linearity." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-57", "text": "To reduce the risk of overfitting on such small data sets, we used relatively simple models both in terms of number of layers and units in them (mostly 2 and 128, respectively)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-58", "text": "Moreover, our models have one distinct output neuron for each variable of the respective annotation format (e.g., 3 for VAD)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-59", "text": "Yet the weights and biases of all hidden layers are shared across the outputs." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-60", "text": "Arguably, this set-up qualifies as a mild form of multi-3 Training each of the individual models took about a minute on a GeForce GTX 1080 Ti, at most." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-61", "text": "----------------------------------" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-62", "text": "**S E 0 7**" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-63", "text": "A N E T A N P S T M A S V A D M A S B E 5 M e a n task learning (Caruana, 1997) , a machine learning techniques which has been shown to greatly decrease the risk of overfitting (Baxter, 1997) and to work well for various NLP tasks (S\u00f8gaard and Goldberg, 2016; Peng et al., 2017 )." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-64", "text": "An overview of our models 4 and details about their individual hyperparameter settings are provided in Table 3 ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-65", "text": "----------------------------------" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-66", "text": "**RESULTS**" }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-67", "text": "Performance will be measured as Pearson correlation r between the predicted values and human gold ratings (one r-value per variable of the target representation, often averaged over all of them)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-68", "text": "Repeated Cross-Validation." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-69", "text": "Conventional 10-fold cross-validation (CV) would lead to very small test splits (only 12 instances in the case of ANET) thus causing high variance between the individual splits and, ultimately, even regarding the average of all 10 runs." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-70", "text": "Therefore, we repeat 10-fold CV ten times (10\u00d710-CV) with different data splits, then averaging the results (Dietterich, 1998) ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-71", "text": "To further increase reliability, identical data splits were used for each of the approaches under comparison." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-72", "text": "We treat the VAD and the BE5 ratings of the MAS corpus as two different data sets (MAS VAD and MAS BE5 ), leading to a total of 5 conditions J o y A n g e r S a d n e s s F e a r D i s g u s t S u r p r i s e M e a n (see Table 4 )." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-73", "text": "Overall, the DL approaches yield a satisfying performance of at least r=.6 as average over all corpora, despite the small data size." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-74", "text": "All of them massively outperform Ridge ngram which represents more conventional methodologies popular before the wide adaptation of embedding-and DLbased approaches." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-75", "text": "The results are especially good for GRU, LSTM, CNN-LSTM and FFN, each one with an average performance of r\u2265.64." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-76", "text": "Overall, the GRU performs best-being superior in all but one condition where the FFN comes out on top." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-77", "text": "Perhaps surprisingly, also Ridge BV performs very competitive." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-78", "text": "Given its low computational cost and its robustness across data sets, our results indicate that this model constitutes an excellent baseline." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-79", "text": "It also suggests that the high quality of the pretrained embedding models may be one of the keyfactors for our generally very strong results because Ridge BV heavily relies on lexical signals." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-80", "text": "In line with that, we found in a supplemental experiment that not using pre-trained embeddings but instead learning them during training significantly reduces performance, e.g., by over 15%-points for the GRU on SE07." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-81", "text": "We now compare our best performing model against previously reported results for the SE07 corpus." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-82", "text": "Table 5 provides the performance of the winning system of the original shared task (WIN-NER; Chaumartin (2007) ), the IAA as reported by the organizers (Strapparava and Mihalcea, 2007) , the performance by Beck (2017) , the highest one reported for this data set so far (BECK), as well as the results of our GRU from the 10\u00d710-CV." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-83", "text": "As can be seen, the GRU established a new state-of-the-art result and even achieves superhuman performance." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-84", "text": "This may sound improbable at first glance." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-85", "text": "However, Strapparava and Mihalcea (2007) employ a rather weak notion of human performance which is-broadly speaking-based on the reliability of a single human rater." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-86", "text": "5 Interestingly, the GRU shows particularly large improvements over human performance for categories where the IAA is low (anger, disgust, and surprise) which might be an effect of the additional supervision introduced by multi-task learning." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-87", "text": "Training Size vs. Model Performance." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-88", "text": "In our last analysis, again focusing on the SE07 corpus, we examine the behavior of our full set of models when varying the amount of training data." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-89", "text": "For each number N \u2208{1, 10, 20, ..., 100, 200, ..., 900}, we randomly sampled N instances of the entirety of the corpus for training and tested on the held out data." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-90", "text": "This procedure was repeated 100 times for each of the training data sizes before averaging the results." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-91", "text": "Each of the models was evaluated with the identical data splits." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-92", "text": "The outcome of this experiment is depicted in Figure 1 ." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-93", "text": "As can be seen, recurrent models suffer only a moderate loss of performance down to a third of the original training data (about 300 observations)." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-94", "text": "The CNN, FFN and Ridge BV model remain stable even longer-their performance only begins to decline rapidly at about 100 instances." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-95", "text": "Astonishingly, the CNN achieves human-performance even with as little 200 training samples." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-96", "text": "In contrast, Ridge nGram declines more steadily yet its overall performance on larger training sets is much lower." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-97", "text": "CNN-LSTM on four topologically diverse data sets of sizes ranging between 1000 and only 120 instances." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-98", "text": "Counterintuitively, we found that all DL approaches performed well under every experimental condition." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-99", "text": "Our proposed GRU model even established a novel state-of-the-art result on the SemEval 2007 test set (Strapparava and Mihalcea, 2007) outperforming human reliability." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-100", "text": "Moreover, it has been frequently argued that pre-trained word embeddings do not comprise sufficient affective information to be used verbatim in emotion analysis." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-101", "text": "We here provided evidence that in actuality the opposite holds-high-quality pre-trained word embeddings are instrumental in achieving strong results in low-resource scenarios and largely boost performance independent of model type." }, { "sent_id": "a869bebe1744e3a7c71cb0f6fed12c-C001-102", "text": "Hence, this contribution pointed out two obstructive misconceptions thus opening up DL for applications in low-resource scenarios." } ], "y": { "@MOT@": { "gold_contexts": [ [ "a869bebe1744e3a7c71cb0f6fed12c-C001-12", "a869bebe1744e3a7c71cb0f6fed12c-C001-13" ], [ "a869bebe1744e3a7c71cb0f6fed12c-C001-24", "a869bebe1744e3a7c71cb0f6fed12c-C001-25", "a869bebe1744e3a7c71cb0f6fed12c-C001-26" ], [ "a869bebe1744e3a7c71cb0f6fed12c-C001-83", "a869bebe1744e3a7c71cb0f6fed12c-C001-84", "a869bebe1744e3a7c71cb0f6fed12c-C001-85" ] ], "cite_sentences": [ "a869bebe1744e3a7c71cb0f6fed12c-C001-13", "a869bebe1744e3a7c71cb0f6fed12c-C001-85" ] }, "@BACK@": { "gold_contexts": [ [ "a869bebe1744e3a7c71cb0f6fed12c-C001-12", "a869bebe1744e3a7c71cb0f6fed12c-C001-13" ], [ "a869bebe1744e3a7c71cb0f6fed12c-C001-24", "a869bebe1744e3a7c71cb0f6fed12c-C001-25", "a869bebe1744e3a7c71cb0f6fed12c-C001-26" ] ], "cite_sentences": [ "a869bebe1744e3a7c71cb0f6fed12c-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "a869bebe1744e3a7c71cb0f6fed12c-C001-31" ], [ "a869bebe1744e3a7c71cb0f6fed12c-C001-81", "a869bebe1744e3a7c71cb0f6fed12c-C001-82" ], [ "a869bebe1744e3a7c71cb0f6fed12c-C001-99" ] ], "cite_sentences": [ "a869bebe1744e3a7c71cb0f6fed12c-C001-31", "a869bebe1744e3a7c71cb0f6fed12c-C001-82", "a869bebe1744e3a7c71cb0f6fed12c-C001-99" ] }, "@DIF@": { "gold_contexts": [ [ "a869bebe1744e3a7c71cb0f6fed12c-C001-83", "a869bebe1744e3a7c71cb0f6fed12c-C001-84", "a869bebe1744e3a7c71cb0f6fed12c-C001-85" ], [ "a869bebe1744e3a7c71cb0f6fed12c-C001-99" ] ], "cite_sentences": [ "a869bebe1744e3a7c71cb0f6fed12c-C001-85", "a869bebe1744e3a7c71cb0f6fed12c-C001-99" ] } } }, "ABC_43422cf92d8cbb280c0b4c590632f1_31": { "x": [ { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-2", "text": "In recent literature, contextual pretrained Language Models (LMs) demonstrated their potential in generalizing the knowledge to several Natural Language Processing (NLP) tasks including supervised Word Sense Disambiguation (WSD), a challenging problem in the field of Natural Language Understanding (NLU)." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-3", "text": "However, word representations from these models are still very dense, costly in terms of memory footprint, as well as minimally interpretable." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-4", "text": "In order to address such issues, we propose a new supervised biologically inspired technique for transferring large pre-trained language model representations into a compressed representation, for the case of WSD." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-5", "text": "Our produced representation contributes to increase the general interpretability of the framework and to decrease memory footprint, while enhancing performance." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-6", "text": "Preprint." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-7", "text": "Under review." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-8", "text": "----------------------------------" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-10", "text": "Recently, large pretrained LMs such as ELMo [1] or BERT [2] on large corpora achieved state-of-theart performance on several downstream tasks in NLP." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-11", "text": "These models are trained to predict the next word given a sequence of words or a masked word given it surrounding words in the same sentence." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-12", "text": "These architectures are capable to incorporate into latent distributed representations the contextual information from different timescales." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-13", "text": "From the biological point of view, there are strong evidences that hierarchical cortical structures implement a similar mechanism, where the spatiotemporal pattern completion process is done recursively in order to extract the relevant information [3, 4] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-14", "text": "When it comes to performing WSD, literature indicates several supervised techniques, where different types of classifiers can learn based on annotated datasets." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-15", "text": "A particularly interesting approach to address this task is the cognitive one, where the phenomenon of WSD is treated by modeling the cognitive process through ACT (Adaptive Control of Thought) [5] combined with RACE/A (Retrieval by Accumulating Evidence in an Architecture) [6] , and such a modeling allows the outperforming of many systems for the task at hand [7] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-16", "text": "The ideas that are presented in this paper with a particular application to contextual word representations are inspired by recent studies about physicochemical properties of the brain." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-17", "text": "As explained in [8] , the brain is a very noisy medium." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-18", "text": "More precisely, according to [9] and [10] respectively, neurons are subject to high erasure effects on their inputs (neurotransmitters are temporarily unavailable in a synapse to respond properly to an incoming spike) and are also the source of numerous spurious (non stimulated) spikes." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-19", "text": "Because of this fluctuating behaviour it is difficult to accept that mental pieces of information could be borne by precise levels of neuronal activity." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-20", "text": "A way to overcome noise and inconstancies is to make information be expressed by assemblies of neurons instead of individual neurons." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-21", "text": "Associative memories can be regarded as a good means to store pieces of information with high robustness towards erasures [3] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-22", "text": "A strategy to combat uncertainty about neuronal activity is to materialize information by relative values instead of absolute ones." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-23", "text": "For instance, this is the way that sparse associative memories based on neural cliques [11] are recovered through the Winner-Takes-All (WTA) mechanism which has shown high biological plausibility [12] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-24", "text": "Furthermore, the WTA principle is applied in several studies [13, 14, 15] for increase activation sparseness in neural networks." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-25", "text": "The simplest way to implement the WTA mechanism is to organize neurons into groups or clusters, thus making it possible to achieve simple pairwise comparisons between neurons of the same cluster." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-26", "text": "This rationale was applied in [16, 14] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-27", "text": "With this paper, we use an implementation based on the WTA mechanism in order to produce new contextual word representations,that are overcomplete, compositional, discrete and sparse." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-28", "text": "----------------------------------" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-30", "text": "Starting from the biological perspective, by the comprehension of brain access schemes, it can be infered that the combination of binary synaptic weights, sparsely encoded memory patterns and local learning rules is one way of producing good representation [17, 18] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-31", "text": "And as fast information retrieval remains a critical point to the real world applications, such a method proves again its power as the computation made is basically a vector-matrix product (in the binary case this can still be considered just counting), followed by an operation of threshold [19] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-32", "text": "Recent work has started to address as well the problematic of memory footprint when learning word embeddings [20, 15] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-33", "text": "In order to mitigate such an issue, [15] proposed a method that learns multi-codebook." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-34", "text": "Therefore, each word is now represented by a hash code of discrete numbers, that have to be defined in a way that similar words will have similar representations as well as a factor of difference, to capture nuances." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-35", "text": "Presenting a different use for the previous work done by the field of codebooks compression based source coding, known as product quantization (PQ) [21] and additive quantization [22] , they show that by minimizing the squared distance between both distributions (baseline and composed embeddings), and using a direct learning approach for the codes in an end-to-end neural network, with a Gumbel-softmax layer [23] to encourage the discreteness [15] , it is possible to construct the word embeddings radically reducing the number of parameters without hurting performance." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-36", "text": "These word codebooks are learned using a local WTA rule in the hidden layer of an autoencoder neural network." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-37", "text": "Other works have also focused in accomplishing the task of modeling complex characteristics of word use, and its variations across different contexts." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-38", "text": "[1] presented a new type of word representation, where word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus [1] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-39", "text": "All those aspects allow richer word representations, and also the capture of important context dependant aspects." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-40", "text": "And as the computation is made based on the internal states of the baseline architecture, such approach can be easily integrated to existing models." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-41", "text": "Results have shown ELMo's capacity to boost performance in many NLP tasks, including WSD, using the approach of the 1-nearest neighbor, similarly to [24] and accordingly to the framework of evaluation from [25] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-42", "text": "As stated above, when the vocabulary is limited, the problem of compressing word embedding representations by contextual compositional codes is already addressed." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-43", "text": "However, the problem of compressed context representation remains, as a result of the enormous variability of contexts for a single word." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-44", "text": "With the current work, our goal is to treat this problem in the particular case of WSD." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-45", "text": "----------------------------------" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-46", "text": "**OUR FRAMEWORK**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-47", "text": "The proposed method is divided into three different steps." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-48", "text": "In the first step, the features corresponding to the ambiguous word are extracted from the second layer of the ELMo model as pictured in Fig. 1 ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-49", "text": "Experimental results from [1] indicate that performing such extraction on the last layer of the LM (concatenating the representations from both directions) increase performance in comparison with doing it on the first one." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-50", "text": "For the case of composed words (i.e. Mother-in-law), the average embedding is taken, in order to create a more robust representation." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-51", "text": "In the second step, a deep autoencoder neural network model is optimized to reduce the mean squared error of reconstructed contextual features extracted from first step." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-52", "text": "This model implements the Gumbel Softmax reparametrization function [23] to obtain discrete activations in the last intermediate layer." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-53", "text": "For the sake of simplicity, we adopted the same architecture as [15] , as shown in Fig. 2 ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-54", "text": "The third step relies on Sparse Associative Memories (SAM), neural networks able to store and retrieve sparse patterns from incomplete inputs." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-55", "text": "Our choice to this technique comes from the fact the sparse projections produced by SAM's have been shown to be sufficient to uniquely encode a neural pattern [26] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-56", "text": "The connections in sparse associative memories are completely binary (they either exist or not)." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-57", "text": "At the beginning of training procedure, the graph that models the SAM is initialized without any connection." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-58", "text": "Each node in the bottom-layer corresponds to a specific context usage of a word." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-59", "text": "Thus, the learning procedure consists in adding connections to the network for each pattern to store." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-60", "text": "The retrieving procedure starts with partial pattern, meaning that some of its active neurons are not activated in the beginning." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-61", "text": "Then, it tries to find in each group of neurons, the neuron (or the neurons) that has the maximum number of connections with the active neurons (WTA algorithm)." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-62", "text": "For instance, the node n3 represents a context that fall is assign to the meaning be captured and the node n6 represents a context with the meaning descend in free fall under the influence of gravity of the word fall." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-63", "text": "Table 1 : Table 1 : Baseline and variations of our approach with their respective compression rate and F1-score in supervised WSD evaluation" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-64", "text": "----------------------------------" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-65", "text": "**METHOD**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-66", "text": "----------------------------------" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-67", "text": "**RESULTS**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-68", "text": "With the purpose of learning robust discrete representations in first step, we have used all sentences and word contexts from the One billion word dataset [27] to train the autoencoder model." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-69", "text": "In order to make our framework comparable to the baseline methods, we have maintained the process of evaluation done by [25] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-70", "text": "Compression rate values were computed considering the saved space when storing all WSD training ELMo contextual vectors." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-71", "text": "As detailed in Table 1 , the framework can achieve high compression rates, while increasing performance." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-72", "text": "In terms of interpretability, [28] point that sparse and overcomplete word vector representations are more interpretable than dense ones." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-73", "text": "In Figure 4 , the last two contexts usages have both similar meaning whereas the first context has different meaning from others, and this result is congruent to the hamming distance between different compositional codes: (0) and (1) is 26 ; (1) and (2) is 16; (0) and (2) is 25." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-74", "text": "Discrete activations of contextual codes are human readable." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-75", "text": "Therefore, it is possible to affirm that the premise of interpretability from [28] remains true in the case of contextual code representations." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-76", "text": "----------------------------------" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-77", "text": "**CONCLUSION**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-78", "text": "We have presented a technique of transfer learning of contextual LM representations based on neuroinspired sparse binary associative memories." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-79", "text": "We have confirmed by our results that such a framework can dramatically mitigate the memory footprint problem and enhance the interpretability while it is still capable of increasing performance in the focused task, WSD." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-80", "text": "Our framework is an extension of [15] work to deal with latent representations of contextual pretrained LM." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-81", "text": "For future work, we intend to explore other LM recent architectures such as BERT [2] , and also aggregate information from knowledge-based corpus such as WordNet [29] ." }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-82", "text": "----------------------------------" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-83", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "43422cf92d8cbb280c0b4c590632f1-C001-84", "text": "The authors would like to thank CAPES, NVIDIA and ERC grant for the financial support." } ], "y": { "@BACK@": { "gold_contexts": [ [ "43422cf92d8cbb280c0b4c590632f1-C001-22", "43422cf92d8cbb280c0b4c590632f1-C001-23", "43422cf92d8cbb280c0b4c590632f1-C001-24" ], [ "43422cf92d8cbb280c0b4c590632f1-C001-32" ], [ "43422cf92d8cbb280c0b4c590632f1-C001-33", "43422cf92d8cbb280c0b4c590632f1-C001-34" ], [ "43422cf92d8cbb280c0b4c590632f1-C001-35", "43422cf92d8cbb280c0b4c590632f1-C001-36" ] ], "cite_sentences": [ "43422cf92d8cbb280c0b4c590632f1-C001-24", "43422cf92d8cbb280c0b4c590632f1-C001-32", "43422cf92d8cbb280c0b4c590632f1-C001-33", "43422cf92d8cbb280c0b4c590632f1-C001-35" ] }, "@MOT@": { "gold_contexts": [ [ "43422cf92d8cbb280c0b4c590632f1-C001-32" ] ], "cite_sentences": [ "43422cf92d8cbb280c0b4c590632f1-C001-32" ] }, "@FUT@": { "gold_contexts": [ [ "43422cf92d8cbb280c0b4c590632f1-C001-33", "43422cf92d8cbb280c0b4c590632f1-C001-34" ] ], "cite_sentences": [ "43422cf92d8cbb280c0b4c590632f1-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "43422cf92d8cbb280c0b4c590632f1-C001-53" ] ], "cite_sentences": [ "43422cf92d8cbb280c0b4c590632f1-C001-53" ] }, "@EXT@": { "gold_contexts": [ [ "43422cf92d8cbb280c0b4c590632f1-C001-80" ] ], "cite_sentences": [ "43422cf92d8cbb280c0b4c590632f1-C001-80" ] } } }, "ABC_61f88b86c451fb6a5e5893c8c42a24_31": { "x": [ { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-2", "text": "In addition to the expression of positive and negative sentiments in the reviews, customers often tend to express wishes and suggestions regarding improvements in a product/service, which could be worth extracting." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-3", "text": "Subjunctive mood is often present in sentences which speak about a possibility or action that has not yet occurred." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-4", "text": "While this phenomena poses challenges to the identification of positive and negative sentiments hidden in a text, it can be helpful to identify wishes and suggestions." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-5", "text": "In this paper, we extract features from a small dataset of subjunctive mood, and use those features to identify wishes and suggestions in opinionated text." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-6", "text": "Our study validates that subjunctive features can be good features for the detection of wishes." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-7", "text": "However, with the given dataset, such features did not perform well for suggestion detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-8", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-10", "text": "In the context of Sentiment Analysis, presence of a variety of linguistic phenomena poses challenges for the identification of underlying sentiment in an opinionated text." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-11", "text": "Subjunctive mood is one such phenomena (Liu et al. (2013) ; Bloom (2011) )." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-12", "text": "It is a commonly occurring language phenomenon in Indo-European languages, which is a verb mood typically used in subordinate clauses to express action that has not yet occurred, in the form of a wish, possibility, necessity etc." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-13", "text": "(Guan, 2012) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-14", "text": "Oxford dictionary defines it as, Relating to or denoting a mood of verbs expressing what is imagined or wished or possible." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-15", "text": "Sentiment terms present in such sentences may not necessarily contribute to the actual sentiment of the sentence, for example 'I wish it tasted as amazing as it looked' is not positive." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-16", "text": "While this is considered as a challenge for sentiment analysis, we adopt a different perspective, and discover benefits of the presence of subjunctive mood in opinionated text." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-17", "text": "Apart from the expression of criticism and satisfaction in customer reviews, reviews might include suggestions for improvements." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-18", "text": "Suggestions can either be expressed explicitly (Brun, 2013) , or by expressing wishes regarding new features and improvements (Ramanand et al., 2010) (Table 1) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-19", "text": "Extraction of suggestions goes beyond the scope of sentiment analysis, and also complements it by providing another valuable information that is worth analyzing." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-20", "text": "Table 1 presents some examples of occurrence of subjunctive mood collected from different forums on English grammar 1 . There seems to be a high probability of the occurrence of subjunctive mood in wish and suggestion expressing sentences." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-21", "text": "This observation can be exploited for the tasks of wish detection (Ramanand et al., 2010) , and suggestion extraction (Brun, 2013) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-22", "text": "To the best of our knowledge, subjunctive mood has never been analysed in the context of wish and suggestion detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-23", "text": "We collect a sample dataset comprising of example sentences of subjunctive mood, and identify features of subjunctive mood." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-24", "text": "We then employ a state of the art statistical classifier, and use subjunctive features in order to perform two kind of tasks on a given set of sentences: 1. Detect wish expressing sentences, and 2." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-25", "text": "Detect suggestion expressing sentences." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-26", "text": "Description Examples Suggestion bearing wishes in product reviews I wanted a dvd player that had basic features and would be able to play dvd or format discs that I had made myself." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-27", "text": "I wish canon would work out some way for that issue." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-28", "text": "Direct suggestions in product reviews They should improve their user interface." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-29", "text": "Wishes in political discussions I wish someone said that to teddy at the meeting yesterday. Perhaps I should have stopped at 8 or 9 years old." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-30", "text": "I would like to know if you re a purist or a hypocrite." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-31", "text": "Sentences containing subjunctive mood I wish it were summer." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-32", "text": "I suggest that Dawn drive the car. But if it weren't so big, it wouldn't be nearly so fun." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-33", "text": "Mood and Modality: Modality is a grammatical category that allows the expression of aspects related to the attitude of a speaker towards his statement, in terms of degree of certainty, reliability, subjectivity, sources of information, and perspective (Morante and Sporleder, 2012) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-34", "text": "Subjunctive mood originated from the typological studies of modality (Palmer, 1986; Dudman, 1988; Portner, 2009) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-35", "text": "Some works equate its presence with 'counterfactuality' (Palmer, 1986) , while some do not (Anderson, 1951) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-36", "text": "Other concepts like 'event modality', 'irrealis' (Palmer, 1986) , have definitions similar to that of subjunctive mood." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-37", "text": "Benamara et al. (2012) studied modality and negation for French language, with an objective to examine its effect on sentiment polarity." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-38", "text": "Narayanan et al. (2009) performed sentiment analysis on conditional sentences." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-39", "text": "Our objective however is inclined towards wish and suggestion detection, rather than sentiment analysis." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-40", "text": "Wish Detection: Goldberg et al. (2009) performed wish detection on datasets obtained from political discussion forums and product reviews." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-41", "text": "They automatically extracted sentence templates from a corpus of new year wishes, and used them as features with a statistical classifier." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-42", "text": "Suggestion Detection: Ramanand et al. (2010) pointed out that wish is a broader category, which might not bear suggestions every time." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-43", "text": "They performed suggestion detection, where they focussed only on suggestion bearing wishes, and used manually formulated syntactic patterns for their detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-44", "text": "Brun (2013) also extracted suggestions from product reviews and used syntactico-semantic patterns for suggestion detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-45", "text": "None of these works on suggestion detection used a statistical classifier." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-46", "text": "None of these works aligned the problem of wish and suggestion detection with subjunctive mood, or identified features related to it." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-47", "text": "Wish and suggestion detection remain young problems, and our work contributes towards the same." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-48", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-49", "text": "**DATASETS**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-50", "text": "Following are the datasets which we use for our experiments." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-51", "text": "\u2022 Wish Detection Oxford dictionary defines the noun wish as, A desire or hope for something to happen." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-52", "text": "Goldberg et al. (2009) follow this definition of wish and provide manually annotated datasets, where each sentence is labelled as wish or non-wish." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-53", "text": "Following two datasets are made available: a. Political Discussions: 6379 sentences, out of which 34% are annotated wishes." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-54", "text": "b. Product Reviews: 1235 sentences, out of which 12% are annotated as wishes." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-55", "text": "Table 1 presents some examples from these datasets." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-56", "text": "Ramanand et al. (2010) worked on product review dataset of the wish corpus, with an objective to extract suggestions for improvements." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-57", "text": "They considered suggestions as a subset of wishes, and thus retained the labels of only suggestion bearing wishes." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-58", "text": "They also annotated additional product reviews, but their data is not available for open research." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-59", "text": "\u2022 Suggestion Detection Product reviews (new): We re-annotated the product review dataset from Goldberg et al. (2009) , for suggestions." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-60", "text": "This also includes wishes for improvements and new features." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-61", "text": "Out of 1235 sentences, 6% are annotated as suggestions." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-62", "text": "Table 1 presents some examples from this dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-63", "text": "Annotation Details: We had 2 annotators annotate each sentence with a suggestion or non-suggestion tag." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-64", "text": "We support the observation of Ramanand et al. (2010) that wishes for improvements and new features are implicit expression of suggestions." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-65", "text": "Therefore, annotators were also asked to annotate suggestions which were expressed as wishes." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-66", "text": "For inter-annotator agreement, a kappa value of 0.874 was obtained." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-67", "text": "In the final dataset, we only retained the sentences where both the annotators agree." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-68", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-69", "text": "**SUBJUNCTIVE FEATURE EXTRACTION**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-70", "text": "Subjunctive Mood Dataset (new): Since we did not come across any corpus of subjunctive mood, we collected example sentences of subjunctive mood from various grammar websites and forums 2 , which resulted in a sample dataset of 229 sentences." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-71", "text": "Table 1 shows examples from this dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-72", "text": "We use this dataset for manual and automatic identification of features of subjunctive mood." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-73", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-74", "text": "**APPROACH**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-75", "text": "We use a statistical classifier to detect wishes and suggestions in corresponding datasets." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-76", "text": "We obtain the following set of features from the subjunctive mood dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-77", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-78", "text": "**LEXICAL FEATURES:**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-79", "text": "\u2022 Condition indicator 'if': This is a binary feature, whose value depends on the presence and absence of 'if' in a sentence." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-80", "text": "\u2022 Suggestion and Wish Verbs: We collect some suggestion and wish indicator verbs observed in the subjunctive mood dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-81", "text": "We then expand this set of verbs by using VerbNet 3.2 (Schuler, 2005) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-82", "text": "VerbNet is a wide coverage verb lexicon, which places verbs into classes whose members have common syntactic and semantic properties." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-83", "text": "We collect all members of the VerbNet verb classes advice, wish, want, urge, require; 28 different verbs were obtained." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-84", "text": "Ramanand et al. (2010) also used a similar but much smaller subset {love, like, prefer and suggest} in their rules." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-85", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-86", "text": "**SYNTACTIC FEATURES:**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-87", "text": "\u2022 Frequent POS sequences: This is a set of 3,4 length sequences of Part Of Speech (POS) tags, which are automatically extracted from the subjunctive mood dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-88", "text": "Words in the sentences are replaced by their corresponding POS tag, and top 200 sequences are extracted based on their weight." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-89", "text": "The weight of each sequence is a product of Term Frequency (TF) and Inverse Document Frequency (IDF)." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-90", "text": "In order to apply the concept of TF and IDF to POS tag sequences, every 3 and 4 length tag sequence occurring in the corpus is treated as a term." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-91", "text": "We separate tags within a sequence with an underscore." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-92", "text": "An example of a sequence of length 3 would be PRP VB PRP ie." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-93", "text": "Personal Pronoun Base form of Verb Personal pronoun." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-94", "text": "\u2022 Frequent Dependency Relations: These are a set of dependency relations (Marneffe and Manning, 2008) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-95", "text": "Using the same method as the part of speech tags, we identify 5 most frequent dependency relations which occur in the subjunctive mood dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-96", "text": "In order to apply the concept of TF/IDF, each dependency relation occurring in the corpus is treated as a term." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-97", "text": "The top 5 relations were: advmod, aux, ccomp, mark and nsubj." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-98", "text": "Goldberg et.al (2009) templates n/a n/a 0.47 unigrams,templates n/a n/a 0.56 We also obtain classification results of the combination of these features with the standard unigram features (Table 2, 3) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-99", "text": "To obtain the part of speech and dependency information, we use Stanford Parser 3.3.1 (Klein and Manning, 2003) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-100", "text": "Word stemming is not performed." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-101", "text": "We use the LibSVM implementation of SVM classifier (EL-Manzalawy and Honavar, 2005) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-102", "text": "The parameter values of SVM classifiers are: SVM type = C-SVC, Kernel Function = Radial Basis Function." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-103", "text": "Features are ranked using the Info-Gain feature selection algorithm (Mitchell, 1997) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-104", "text": "Top 1000 features are used in all the experiments ie." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-105", "text": "the size of feature vector is not more than 1000." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-106", "text": "5 Subjunctive Feature Evaluation Goldberg et al. (2009) evaluated their approach using a 10 fold cross validation on their datasets." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-107", "text": "In order to compare subjunctive features against their wish template features, we also perform 10 fold cross validation on their wish datasets (politics and products)." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-108", "text": "The evaluation metrics include Precision, Recall, and Area Under Curve (AUC) for the positive class." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-109", "text": "AUC was also used by Goldberg et al. (2009) ." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-110", "text": "To the best of our knowledge, statistical classification based approach have not yet been employed to detect suggestions in reviews." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-111", "text": "Our experiment which uses subjunctive features for suggestion detection, is the first in this regard." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-112", "text": "Table 2 compares the AUC values obtained with unigrams, subjunctive features, a combination of both, and the results from Goldberg et al. (2009) for wish detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-113", "text": "Table 3 compares the AUC values obtained with unigrams, subjunctive features, and a combination of both for suggestion detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-114", "text": "Table 4 presents some of the top features used by the classifier." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-115", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-116", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-117", "text": "Wish Detection: Unigrams vs Subjunctive: One probable reason for the better performance of subjunctive features over unigrams in the case of product dataset, could be the small size of the dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-118", "text": "In the case of politics dataset, similar reason (big dataset) can be attributed for the better performance of unigrams over subjunctive features." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-119", "text": "Goldberg et al. (2009) perform better than our subjunctive features for the politics data." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-120", "text": "However, subjunctive features perform much better with product data as compared to the wish templates (Table 3 )." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-121", "text": "This may lead to the conclusion that wish templates need larger training corpus, since they failed for the smaller dataset of product reviews (AUC less than 0.5)." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-122", "text": "One additional benefit of subjunctive features could be that subjunctive mood appears in many languages, and thus such features can be easily extended to multi-lingual wish detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-123", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-124", "text": "**SUGGESTION DETECTION:**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-125", "text": "Subjunctive features perform better than unigrams in this case too." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-126", "text": "An overall decrease in classifier performance for the task of suggestion detection can be attributed to the fact that not all wishes are suggestions, and therefore are not tagged in this dataset." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-127", "text": "Some of these untagged wishes would contain subjunctive mood, which reduced the performance of subjunctive features, as compared to the task of wish detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-128", "text": "----------------------------------" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-130", "text": "From the results of feature evaluation, we conclude that subjunctive features are not effective for suggestion detection, but are considerably effective for the task of wish detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-131", "text": "This work contributes towards both, analysis and methodology for wish detection." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-132", "text": "On the analysis part, we validate that a considerable amount of wishes in opinionated text contain subjunctive mood." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-133", "text": "On the methodology part, we use subjunctive mood features as effective features for the detection of wishes." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-134", "text": "We also provide datasets for this kind of study." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-135", "text": "Since we only deal with 2 domains here, further experiments can be performed over data from different domains." }, { "sent_id": "61f88b86c451fb6a5e5893c8c42a24-C001-136", "text": "In the continuation of this work, we intend to extend the datasets and explore more syntactic and semantic features for wish and suggestion detection." } ], "y": { "@BACK@": { "gold_contexts": [ [ "61f88b86c451fb6a5e5893c8c42a24-C001-18" ], [ "61f88b86c451fb6a5e5893c8c42a24-C001-20", "61f88b86c451fb6a5e5893c8c42a24-C001-21", "61f88b86c451fb6a5e5893c8c42a24-C001-22" ], [ "61f88b86c451fb6a5e5893c8c42a24-C001-42", "61f88b86c451fb6a5e5893c8c42a24-C001-43" ], [ "61f88b86c451fb6a5e5893c8c42a24-C001-56", "61f88b86c451fb6a5e5893c8c42a24-C001-57", "61f88b86c451fb6a5e5893c8c42a24-C001-58" ] ], "cite_sentences": [ "61f88b86c451fb6a5e5893c8c42a24-C001-18", "61f88b86c451fb6a5e5893c8c42a24-C001-21", "61f88b86c451fb6a5e5893c8c42a24-C001-42", "61f88b86c451fb6a5e5893c8c42a24-C001-56" ] }, "@MOT@": { "gold_contexts": [ [ "61f88b86c451fb6a5e5893c8c42a24-C001-20", "61f88b86c451fb6a5e5893c8c42a24-C001-21", "61f88b86c451fb6a5e5893c8c42a24-C001-22" ], [ "61f88b86c451fb6a5e5893c8c42a24-C001-63", "61f88b86c451fb6a5e5893c8c42a24-C001-64", "61f88b86c451fb6a5e5893c8c42a24-C001-65" ] ], "cite_sentences": [ "61f88b86c451fb6a5e5893c8c42a24-C001-21", "61f88b86c451fb6a5e5893c8c42a24-C001-64" ] }, "@SIM@": { "gold_contexts": [ [ "61f88b86c451fb6a5e5893c8c42a24-C001-63", "61f88b86c451fb6a5e5893c8c42a24-C001-64", "61f88b86c451fb6a5e5893c8c42a24-C001-65" ], [ "61f88b86c451fb6a5e5893c8c42a24-C001-83", "61f88b86c451fb6a5e5893c8c42a24-C001-84" ] ], "cite_sentences": [ "61f88b86c451fb6a5e5893c8c42a24-C001-64", "61f88b86c451fb6a5e5893c8c42a24-C001-84" ] } } }, "ABC_ee66681690f2c92fe705a09bf7015d_31": { "x": [ { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-39", "text": "----------------------------------" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-40", "text": "**SEQUENCE ALIGNMENT MODEL**" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-41", "text": "Let e = e 1 e 2 . . ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-2", "text": "In machine transliteration we transcribe a name across languages while maintaining its phonetic information." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-3", "text": "In this paper, we present a novel sequence transduction algorithm for the problem of machine transliteration." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-4", "text": "Our model is discriminatively trained by the MIRA algorithm, which improves the traditional Perceptron training in three ways: (1) It allows us to consider k-best transliterations instead of the best one." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-5", "text": "(2) It is trained based on the ranking of these transliterations according to user-specified loss function (Levenshtein edit distance)." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-6", "text": "(3) It enables the user to tune a built-in parameter to cope with noisy non-separable data during training." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-7", "text": "On an Arabic-English name transliteration task, our model achieves a relative error reduction of 2.2% over a perceptron-based model with similar features, and an error reduction of 7.2% over a statistical machine translation model with more complex features." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-8", "text": "----------------------------------" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-9", "text": "**INTRODUCTION AND RELATED WORK**" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-10", "text": "Proper names and other technical terms are frequently encountered in natural language text." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-11", "text": "Both machine translation (Knight and Graehl, 1997) and cross-language information retrieval (Jeong et al., 1999; Virga and Khudanpur, 2003; Abdul-Jaleel and Larkey, 2003) can benefit by explicitly translating such words from one language into another." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-12", "text": "This approach is decidedly better than treating them uniformly as out-of-vocabulary tokens." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-13", "text": "The goal of machine transliteration is to translate words between alphabets of different languages such that they are phonetically equivalent." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-14", "text": "Given a source language sequence f = f 1 f 2 . . ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-15", "text": "f m from an alphabet F, we want to produce a target language sequence e = e 1 e 2 . . ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-16", "text": "e n in the alphabet E such that it maximizes some score function s(e, f ), e = arg max e \u2032 s(e \u2032 , f )." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-17", "text": "Virga and Khudanpur (2003) model this scoring function using a separate translation and language model, that is, s(e, f ) = P r(f |e)P r(e)." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-42", "text": "e n and f = f 1 f 2 . . ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-18", "text": "In constrast, Al-Onaizan and Knight (2002) directly model the translation probability P r(e|f ) using a log-linear combination of several individually trained phrase and character-based models." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-19", "text": "Others have treated transliteration as a phrase-based transduction (Sherif and Kondrak, 2007) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-20", "text": "All these approaches are adaptations of statistical models for machine translation (Brown et al., 1994) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-21", "text": "In general, the parameters of the scoring function in such approaches are trained generatively and do not utilize complex features of the input sequence pairs." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-22", "text": "Recently, there has been interest in applying discriminatively-trained sequence alignment models to many real-world problems." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-23", "text": "McCallum et al. (2005) train a conditional random field model to discriminate between matching and non-matching string pairs treating alignments as latent." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-24", "text": "Learning accurate alignments in this model requires finding \"close\" non-match pairs which can be a challenge." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-25", "text": "A similar conditional latent-variable model has been applied to the task of lemmatization and generation of morphological forms (Dreyer et al., 2008) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-26", "text": "Zelenko and Aone (2006) model transliteration as a structured prediction problem where the letter e i is predicted using local and global features derived from e 1 e 2 . . ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-27", "text": "e i\u22121 and f ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-28", "text": "Bergsma and Kondrak (2007) address cognate identification by training a SVM classification model using phrase-based features obtained from a Levenshtein alignment." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-29", "text": "Both these models do not learn alignments that is needed to obtain high performance on transliteration tasks." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-30", "text": "Freitag and Khadivi (2007) describe a discriminatively trained sequence alignment model based on averaged perceptron, which is closely related to the method proposed in this paper." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-31", "text": "Our approach improves over previous directions in two ways." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-32", "text": "First, our system produces better k-best transliterations than related approaches by training on multiple hypotheses ranked according to a userspecified loss function (Levenshtein edit distance)." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-33", "text": "Hence, our method achieves a 19.2% error reduction in 5-best performance over a baseline only trained with 1-best transliterations." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-34", "text": "This is especially helpful when machine transliteration is part of a larger machine translation or information retrieval pipeline since additional sentence context can be used to choose the best among top-K transliterations." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-35", "text": "Second, our training procedure accounts for noise and non-separability in the data." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-36", "text": "Therefore, our transliteration system would work well in cases where person names were misspelled or in cases in which a single name had many reasonable translations in the foreign language." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-37", "text": "The training algorithm we propose in this paper is based on the K-best MIRA algorithm which has been used earlier in structured prediction problems (McDonald et al., 2005a; McDonald et al., 2005b )." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-38", "text": "Our results demonstrate a significant improvement in accuracy of 7.2% over a statistical machine translation (SMT) system (Zens et al., 2005) and of 2.2% over a perceptron-based edit model (Freitag and Khadivi, 2007) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-43", "text": "f m be sequences from the target alphabet E and source alphabet F respectively." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-44", "text": "Let a = a 1 a 2 . . . a l be a sequence of alignment operations needed to convert f into e. Each alignment operation either appends a letter to the end of the source sequence, the target sequence or both sequences." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-45", "text": "Hence, it is a member of the cross-product a k \u2208 E \u222a{\u01eb}\u00d7F \u222a{\u01eb}\\{(\u01eb, \u01eb)}, where \u01eb is the null character symbol." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-46", "text": "Let a k 1 = a 1 a 2 . . ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-47", "text": "a k denote the sequence of first k alignment operations." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-48", "text": "Similarly e k 1 and f k 1 are prefixes of e and f of length k." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-49", "text": "We define the scoring function between a word and its transliteration to be the a maximum over all possible alignment sequences a," }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-50", "text": "where the score of a specific alignment a between two words is given by a linear relation," }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-51", "text": "for a parameter vector w and a feature vector \u03a6(a, e, f )." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-52", "text": "Furthermore, let \u03a6(a, e, f ) = l k=1 \u03c6(a k , e, i, f , j) be the sum of feature vectors associated with individual alignment operations." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-53", "text": "Here i, j are positions in sequences e, f after performing operations a k 1 ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-54", "text": "For fixed sequences e and f the function s(e, f ) can be efficiently computed using a dynamic programming algorithm," }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-55", "text": "Given a source sequence f computing the best scoring target sequence e = arg max e \u2032 s(e \u2032 , f ) among all possible sequences E * requires a beam search procedure (Freitag and Khadivi, 2007) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-56", "text": "This procedure can also be used to produce K-best target sequences" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-57", "text": ". In this paper, we employ the same features as those used by Freitag and Khadivi (2007) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-58", "text": "All local feature functions \u03c6(a k , e, i, f , j) are conjunctions of the alignment operation a k and forward or backward-looking character m-grams in sequences e and f at positions i and j respectively." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-59", "text": "For the source sequence f both forward and backwardlooking m-gram features are included." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-60", "text": "We restrict the m-gram features in our target sequence e to only be backward-looking since we do not have access to forward-looking m-grams during beam-search." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-61", "text": "An order M model is one that uses m-gram features where m = 0, 1, . . ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-62", "text": "M ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-63", "text": "Our training algorithm takes as input a data set D of source-target transliteration pairs and outputs a parameter vector u. The algorithm pseudo-code appears in Fig. (1) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-64", "text": "In the algorithm, the function L(e \u2032 , e) defines a loss incurred by predicting e \u2032 instead of e. In most structured prediction problems, the targets are of equal length and in such cases the Hamming loss function can be used." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-65", "text": "However, in our case the targets may differ in terms of length and thus we use the Levenshtein edit distance (Levenshtein, 1966 ) with unit costs for insertions, deletions and substitutions." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-66", "text": "Since the targets are both in the same alphabet E this loss function is well-defined." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-67", "text": "The user also supplies three paramters: (1) T -the number of training iterations (2) K -the number of best target hypotheses used (3) C -a complexity parameter." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-68", "text": "A low C is useful if the data is nonseparable and noisy." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-69", "text": "The final parameter vector u returned by the algorithm is the average of the intermediate parameter vectors produced during training." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-70", "text": "We find that averaging helps to improve performance." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-71", "text": "At test time, we use the beam search procedure to produce Kbest hypotheses using the parameter vector u." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-72", "text": "----------------------------------" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-73", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-74", "text": "We apply our model to the real-world ArabicEnglish name transliteration task on a data set of 10,084 Arabic names from the LDC." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-75", "text": "The data set consists of Arabic names in an ASCII-based alphabet and its English rendering." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-76", "text": "Table 1 shows a few examples of Arabic-English pairs in our data set." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-77", "text": "We use the same training/development/testing (8084/1000/1000) set as the one used in a previous benchmark study (Freitag and Khadivi, 2007) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-78", "text": "The development and testing data were obtained by randomly removing entries from the training data." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-79", "text": "The absence of short vowels (e.g. \"a\" in NB\"I, nab'i ), doubled consonants (e.g. \"ww\" in FWAL, fawwal ) and other diacritics in Arabic make the transliteration a hard problem." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-80", "text": "Therefore, it is hard to achieve perfect accuracy on this data set." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-81", "text": "For training, we set K = 20 best hypotheses and" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-82", "text": "----------------------------------" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-83", "text": "**INPUT PARAMETERS**" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-84", "text": "Training Data D Complexity parameter C > 0 Number of epochs T Initialize w 0 = 0 (zero vector) ; \u03c4 = 0 ; u = 0 Repeat T times: For Each (e, f ) \u2208 D :" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-85", "text": "1. a = arg max\u00e2 w \u03c4 \u00b7 \u03a6(\u00e2, e, f ) (Find best scoring alignment between e and f using dynamic programming) 2. Generate a list of K-best target hypotheses {e" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-86", "text": "K } given the current parameters w \u03c4 ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-87", "text": "Let the corresponding alignments for the targets be {a" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-88", "text": "3. Set w \u03c4 +1 to be the solution of : C = 1.0 and run the algorithm for T = 10 epochs." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-89", "text": "To evaluate our algorithm, we generate 1-best (or 5-best) hypotheses using the beam search procedure and measure accuracy as the percentage of instances in which the target sequence e is one of the 1-best (or 5-best) targets." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-90", "text": "The input features are based on character m-grams for m = 1, 2, 3." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-91", "text": "Unlike previ-ous generative transliteration models, no additional language model feature is used." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-92", "text": "We compare our model against a state-of-the-art statistical machine translation (SMT) system (Zens et al., 2005) and an averaged perceptron edit model (PTEM) with identical features (Freitag and Khadivi, 2007) ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-93", "text": "The SMT system directly models the posterior probability P r(e|f ) using a log-linear combination of several sub-models: a characterbased phrase translation model, a character-based lexicon model, a character penalty and a phrase penalty." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-94", "text": "In the PTEM model, the update rule only considers the best target sequence and modifies the parameters w \u03c4 +1 = w \u03c4 + \u03a6(a, e, f ) \u2212 \u03a6(a \u2032 , e \u2032 , f ) if the score s(e \u2032 , f ) \u2265 s(e, f )." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-95", "text": "Table 2 shows the 1-best and 5-best accuracy of each model trained on the combined train+dev data set." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-96", "text": "All the models are evaluated on the same test set." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-97", "text": "Both MIRA and PTEM algorithms outperform the SMT model in terms of 1-best accuracy." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-98", "text": "The differences in accuracy are significant at 95% confidence level, using the bootstrapping method for hypothesis testing." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-99", "text": "The difference in 1-best performance of MIRA and PTEM is not significant." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-100", "text": "At 5-best, the MIRA model outperforms both SMT and PTEM model." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-101", "text": "We conjecture that using the problem-specific Levenshtein loss function helps filter bad target sequences from the K-best outputs during training." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-102", "text": "In a second experiment we studied the effect of changing C on the performance of the algorithm." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-103", "text": "We ran the algorithm with the above settings, except varying the value of the complexity parameter to one of 7 values in the range C = 0.00001, 0.0001, . . . , 0.1, 1.0, training only using the train set, and evaluating the resulting model on the test set." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-104", "text": "The results are summarized in Table 3 ." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-105", "text": "The entry marked with a star * indicates the model that achieved the best performance on the dev set for a particular choice of evaluation measure (1-best or 5-best)." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-106", "text": "We find that changing C does have an effect on model performance." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-107", "text": "As the value of C decreases, the performance at lower ranks improves: C = 0.01 is good for 5-best accuracy and C = 0.001 for 20-best accuracy (not in table)." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-108", "text": "As C is further reduced, a greater number of iterations are needed to converge." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-109", "text": "In our model, where the alignments are not observed but inferred during training, we find that making small incremental updates makes our algorithm more robust." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-110", "text": "Indeed, setting C = 0.01 and training on the train+dev set improves 5-best performance of our model from 0.841 to 0.861." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-111", "text": "Hence, the choice of C is important." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-112", "text": "----------------------------------" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-113", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-114", "text": "We have shown a significant improvement in accuracy over state-of-the-art transliteration models by taking into consideration the ranking of multiple hypotheses (top-K) by Levenshtein distance, and making the training algorithm robust to noisy nonseparable data." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-115", "text": "Our model does consistently well at high (K = 1) and low ranks (K = 5), and can therefore be used in isolation or in a pipelined system (e.g. machine translation or cross-language information retrieval) to achieve better performance." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-116", "text": "In a pipeline system, more features of names around proper nouns and previous mentions of the name can be used to improve scoring of K-best outputs." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-117", "text": "In our experiments, the Levenshtein loss function uses only unit costs for edit operations and is not specifically tuned towards our application." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-118", "text": "In future work, we may imagine penalizing insertions and deletions higher than substitutions and other non-uniform schemes for better transliteration performance." }, { "sent_id": "ee66681690f2c92fe705a09bf7015d-C001-119", "text": "Our K-best framework can also be easily extended to cases where one name has multiple foreign translations that are equally likely." } ], "y": { "@DIF@": { "gold_contexts": [ [ "ee66681690f2c92fe705a09bf7015d-C001-38" ] ], "cite_sentences": [ "ee66681690f2c92fe705a09bf7015d-C001-38" ] }, "@BACK@": { "gold_contexts": [ [ "ee66681690f2c92fe705a09bf7015d-C001-38" ], [ "ee66681690f2c92fe705a09bf7015d-C001-55" ] ], "cite_sentences": [ "ee66681690f2c92fe705a09bf7015d-C001-38", "ee66681690f2c92fe705a09bf7015d-C001-55" ] }, "@USE@": { "gold_contexts": [ [ "ee66681690f2c92fe705a09bf7015d-C001-55" ], [ "ee66681690f2c92fe705a09bf7015d-C001-57" ], [ "ee66681690f2c92fe705a09bf7015d-C001-77" ], [ "ee66681690f2c92fe705a09bf7015d-C001-92" ] ], "cite_sentences": [ "ee66681690f2c92fe705a09bf7015d-C001-55", "ee66681690f2c92fe705a09bf7015d-C001-57", "ee66681690f2c92fe705a09bf7015d-C001-77", "ee66681690f2c92fe705a09bf7015d-C001-92" ] } } }, "ABC_5b98a80237182b2d506ea4c9d71aa1_31": { "x": [ { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-30", "text": "Our features are also inspired by Wong and Ng (2007) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-74", "text": "**FLOW CHART USING NON-LOCAL FEATURES**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-75", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-76", "text": "**FOUR KINDS OF NON-LOCAL FEATURES**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-98", "text": "Please note that not all matching token sequences are true candidates." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-2", "text": "Abstract." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-3", "text": "Named Entity Recognition (NER) is always limited by its lower recall resulting from the asymmetric data distribution where the NONE class dominates the entity classes." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-4", "text": "This paper presents an approach that exploits non-local information to improve the NER recall." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-5", "text": "Several kinds of non-local features encoding entity token occurrence, entity boundary and entity class are explored under Conditional Random Fields (CRFs) framework." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-6", "text": "Experiments on SIGHAN 2006 MSRA (CityU) corpus indicate that non-local features can effectively enhance the recall of the state-of-the-art NER systems." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-7", "text": "Incorporating the non-local features into the NER systems using local features alone, our best system achieves a 23.56% (25.26%) relative error reduction on the recall and 17.10% (11.36%) relative error reduction on the F1 score; the improved F1 score 89.38% (90.09%) is significantly superior to the best NER system with F1 of 86.51% (89.03%) participated in the closed track." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-8", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-10", "text": "Named entity recognition (NER) is a subtask of information extraction that seeks to locate and classify predefined entities, such as names of persons, locations, organizations, etc." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-11", "text": "in unstructured texts." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-12", "text": "It is the fundamental step to many natural language processing applications, like Information Extraction (IE), Information Retrieval (IR) and Question Answering (QA)." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-13", "text": "Most empirical approaches currently employed in NER task make decision only on local context for extract inference, which is based on the data independent assumption (Krishnan and Manning, 2006) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-14", "text": "But often this assumption does not hold because non-local dependencies are prevalent in natural language (including the NER task)." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-15", "text": "How to utilize the non-local dependencies effectively is a key issue in NER task." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-16", "text": "Unfortunately, few researches have been devoted to this issue, existing works mainly focus on using the non-local information for further improving NER label consistency." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-17", "text": "There are two methods to use non-local information." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-18", "text": "One is to add additional edges to graphical model structure to represent the distant dependencies and the other is to encode the non-locality with non-local features." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-19", "text": "However, in the first approach, heuristic rules are used to find the dependencies (Bunescu and Mooney, 2004; Sutton and McCallum, 2004) or penalties for label inconsistency are required to handset ad-hoc (Finkel et al., 2005) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-20", "text": "Furthermore, high computational cost is spent for approximate inference." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-21", "text": "In order to establish the long dependencies easily and overcome the disadvantage of the approximate inference, Krishnan and Manning (2006) propose a two-stage approach using Conditional Random Fields (CRFs) with extract inference." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-22", "text": "They represent the non-locality with non-local features, and extract the nonlocal features from the output of the first stage CRF using local context alone; then they incorporate the non-local features into the second CRF. But the features in this approach are only used to improve label consistency." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-23", "text": "To our best knowledge, up to now, non-local information has not been explored to improve NER recall in previous researches; on the other hand, NER is always impaired by its lower recall due to the imbalanced distribution where the NONE class dominates the entity classes." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-24", "text": "Classifiers built on such data typically have a higher precision and a lower recall and tend to overproduce the NONE class (Kambhatla, 2006) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-25", "text": "In this paper, we employ non-local information to recall the missed entities." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-26", "text": "Similar to Krishnan and Manning (2006) , we also encode non-local information with features and apply the simple two-stage architecture." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-27", "text": "Different from their work for improve label consistency, their features are activated on the recognized entities coming from the first CRF, the non-local features we design are used to recall more missed entities which are seen in the training data or unseen entities but some of their occurrences being recognized correctly in the first stage, our features are fired on the raw token sequence directly with forward maximum match." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-28", "text": "Compared to their non-local information extracted from training data with 10-fold cross-validation, our non-local information is extracted from the training date directly; our approach obtaining the non-local features is simpler." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-29", "text": "Moreover, we design different non-local features encoding different useful information for NER two subtasks: entity boundary detection and entity semantic classification." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-31", "text": "They extract entity majority type features from unlabelled data with an initial maximum entropy classifier." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-32", "text": "Our approach is validated on the third International Chinese language processing bakeoff (SIGHAN 2006) MSRA and CityU NER closed track, the experimental results show that non-local features can significantly improve the recall of the state-of-the-art NER system using local context alone." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-33", "text": "The remainder of the paper is structured as follows." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-34", "text": "In Section 2, we introduce the first stage CRF with local features alone; then we describe the second stage CRF using non-local features we design in Section 3." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-35", "text": "We demonstrate the experiments in Section 4 and we conclude the paper in Section 5." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-36", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-37", "text": "**OUR BASELINE NER SYSTEM**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-38", "text": "To validate the effectiveness of our approach of exploiting non-local features, we need to establish a baseline with state-of-the-art performance using local context alone." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-39", "text": "Similar to (Krishnan and Manning, 2006) , we employ two-stage architecture under conditional random fields (CRFs) framework." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-40", "text": "In the first stage, we build the baseline with local features only, and then we build the second NER system with non-local features." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-41", "text": "We will introduce them step by step." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-42", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-43", "text": "**CONDITIONAL RANDOM FIELDS**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-44", "text": "We regard the NER task as a sequence labeling problem and apply Conditional Random Fields (Lafferty et al., 2001; Sha and Pereira, 2003) since it represents the state of the art in sequence modeling and has also been very effective at NER task." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-45", "text": "It is undirected graph established on G = (V, E), where V is the set of random variables Y = {Y i |1\u2264i\u2264 n} for each the n tokens in an input sequence and E = {(Y i\u22121 , Y i ) |1\u2264i\u2264n} is the set of (n \u2212 1) edges forming a linear chain." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-46", "text": "Following Lafferty et al. (2001) , the conditional probability of the state sequence (s1, s2\u2026sn) given the input sequence (o1, o2\u2026on) is computed as follows:" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-47", "text": "Where f k is an arbitrary feature function; and \u03bb k is the weight for the feature function; it can be optimized through iterative algorithms like GIS (Darroch and Ratcliff, 1972) and IIS (Della Pietra et al., 1997) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-48", "text": "However recent research y been shown that quasi-Newton methods, such as L-BFGS, are significantly more efficient (Byrd et al., 1994; Malouf, 2002; Sha and Pereira, 2003) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-49", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-50", "text": "**LOCAL FEATURES**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-51", "text": "The fist stage CRF labels for token directly depends on the labels corresponding the previous and next token, namely C -2 , C -1 , C 0 , C -1 , C 2 , C -2 C -1 , C -1 C 0 , C 0 C -1 , C 1 C 2 , and C -1 C 1 , where C 0 is the current character, C 1 the next character, C 2 the second character after C 0 , C -1 the character preceding C 0 , and C -2 the second character before C 0 ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-52", "text": "In addition, the first CRF used the tag bigram feature." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-123", "text": "Table 3 displays the performance on MSRA and CityU NER closed track." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-53", "text": "Although these local features are simple, they give us state-of-the-art baseline using local information alone as described in Section 4." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-54", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-55", "text": "**LOW RECALL IN NER TASK**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-56", "text": "As Kambhatla (2006) points out that NER system typically have a higher precision and a lower recall and tends to overproduce the NONE class because the NONE class dominates all other classes in the task." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-57", "text": "In natural language, different sentences contain different useful contextual information; the missed entities are happened when their context surroundings are not indicative enough for the statistical-based approaches (including the CRFs) to make a correct decision." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-58", "text": "When we analyze these missed occurrences of the missed entities further, we can put them into three groups." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-59", "text": "The first is the seen entities in the training data; the second is the unseen occurrences, but some other occurrences of the entities have been correctly recognized in certain indicative context surroundings." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-60", "text": "The third is the unseen occurrences with no any occurrences recognized correctly." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-61", "text": "In NER task, considering influences between extractions can be very useful, if the context surrounding one occurrence of a token sequence is very indicative of it being an entity, then this should also influence the tagging of another occurrence of the same token sequence in a different context that is not indicative of entity (Bunescu and Mooney, 2004) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-62", "text": "So if we consider the non-local dependencies between the same entities, some of these missed occurrences will be recognized correctly." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-63", "text": "We will describe how to capture the nonlocality to recall more missed entities in Section 3." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-64", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-65", "text": "**RECALLING MISSED ENTITIES WITH NON-LOCAL FEATURES**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-66", "text": "In natural language, difference sentences contain different useful context information; the missed entities happen when their context surroundings are not indicative enough for the first stage CRF to make correct decisions." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-67", "text": "If the context surrounding one occurrence of a token sequence is very indicative of it being an entity, then this should also influence the labeling of another occurrence of the same token sequence in a different context that is not indicative of entity (Bunescu and Mooney, 2004) ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-68", "text": "So considering the non-local dependencies between the same entities can be very useful, if these non-local dependencies are incorporated into the CRF model, some of the missed entities will be recalled correctly." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-69", "text": "Figure 1 shows the flow using non-local features in two-stage architecture under CRFs framework." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-70", "text": "The first CRF is trained with local features alone as baseline (described in Section 2), and then we test the testing data with the first CRF and get the entities plus their type from the output." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-71", "text": "The second CRF utilizes the non-local features derived from the entity list which is merged by the output of the first CRF from the testing data and the entities extracted directly from the training data." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-72", "text": "To provide flexible and general conclusion, we only use non-local information found in labeled training data and test data rather than external knowledge sources, such as post-of-speech, gazetteers, external lexica and etc." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-73", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-77", "text": "We design four kinds of non-local features which encode different useful information for the two NER subtasks, i.e. entity boundary detection and entity semantic classification, the nonlocal features are fired on the token sequences if they are matched with certain entity in the entity list in forward maximum matching (FMM) way." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-78", "text": "I will describe them one by one as follows." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-79", "text": "Entity-occurrence features (F1): These refer to the occurrence information assigned to the token sequence which is matched with the entity list exactly." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-80", "text": "These features capture the dependencies between the identical candidate entities; which results in the same candidate entities of different occurrences can be recalled favorably." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-81", "text": "Token-position features (F2): These refer to the position information (start, middle and last) assigned to the token sequence which is matched with the entity list exactly." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-82", "text": "These features enable us to capture the dependencies between the identical candidate entities and their boundaries." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-83", "text": "Entity-majority features (F3): These refer to the majority label assigned to the token sequence which is matched with the entity list exactly." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-84", "text": "These features enable us to capture the dependencies between the identical entities and their classes, so that the same candidate entities of different occurrences can be recalled favorably, and their label consistencies can be considered too." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-85", "text": "Token-position & entity-majority features (F4): These features capture non-local information from F2 and F3 simultaneously. They take into account the entity boundary and semantic class information at the same time." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-86", "text": "These non-local features are applied in English NER in one-step approach (Krishnan and Manning, 2006; Wong and Ng, 2007) , they employ these features to improve entity consistence among their different occurrences." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-87", "text": "These features are assigned to token sequences that are matched exactly with the (entity, majority-type) list in forward maximum matching (FMM) way." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-88", "text": "During training or testing, when the CRFs tagger encounters a token sequence C 1 ...C n such that (C k ...C s ) (k\u22651, s \u2264n) is the longest token sequence existing in the entity list; the correspondent features will be turned on to each token in C k ....C s ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-89", "text": "For example, considering the following" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-90", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-91", "text": "**SENTENCE: \u6211(WO)\u7231 (AI)\u5317(BEI)\u4eac(JING)\u5929(TIAN)\u5b89(AN)\u95e8 (MEN)(I LOVE BEIJING TIANANMEN). IF (\u5317 \u4eac, MAJ-LOC), (\u4eac, MAJ-LOC), AND (\u5929\u5b89\u95e8, MAJ-LOC) ARE PRESENTED IN THE (ENTITY, MAJORITY**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-92", "text": "type) list, the features below will be turned on as table 1 shows." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-93", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-94", "text": "**NOTICE THAT THE FEATURE TURNED ON FOR \u4eacIS E E--M MA AJ J--L LO OC C, , NOT B-M MA AJ J--L LO OC C, BECAUSE THE LONGEST**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-95", "text": "matching sequence is \u5317\u4eac." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-96", "text": "Different from (Krishnan and Manning, 2006; Wong and Ng, 2007) , they only assign the majority type information, like Maj-Loc, to each token in matched candidates, boundary information like B, I and E is ignored, it is acceptable because they utilize these features only for English corpora, and the boundary information can be captured by the capitalization characteristics." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-97", "text": "But in Chinese NER, NED is more difficult than NEC, so we assign the boundary information, representing with B, I and E, to each token in the matched candidates." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-99", "text": "The false candidates come from two aspects: the first is the boundaries are correct, but the occurrences are common words 1 ; the second errors come from FMM, so the features are soft constraints." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-100", "text": "T Ta ab bl le e 1 1. ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-101", "text": "E Ex xa am mp pl le e f fo or r T To ok ke en n--M Ma aj jo or ri it ty y--T Ty yp pe e f fe ea at tu ur re es s" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-102", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-103", "text": "**EXPERIMENTS**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-104", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-105", "text": "**CORPUS ANALYSIS**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-106", "text": "Our investigation is based on the MSRA and CityU datasets from the NER closed track of the third International Chinese language processing bakeoff (SIGHAN 2006) (Levow, 2006) ; its goal is to perform NER on three entity classes: PERSON, LOCATION and ORGANIZATION." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-107", "text": "We give up the LDC corpus because it is initially designed for ACE Evaluation and the definition of named entity is different from traditional definition." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-108", "text": "The named entities in SIGHAN training data sets are labeled in IOB-2 format, we covert the corpus to OBIE as a preprocessing, because some existing work and our experiments show that OBIE scheme outperforms other formats when applying machine learning to NER." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-109", "text": "In OBIE format, tokens outside of entities are tagged with O (NONE class), while the first token in an entity is tagged with B-k to begin class k, the token inside the entity is tagged with I-k and the end token in the entity is tagged with E-k; single-token entity is labeled as B-k." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-110", "text": "General information for each dataset appears in Table 2 ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-111", "text": "It also summarizes the statistic information of seen and unseen entities in the test sets." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-112", "text": "A seen named entity in test set means that it exists in its correspondent training data set." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-113", "text": "From the table, we can find that the proportion of seen entities is very high." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-114", "text": "71.86% of named entities in MSRA test data can be found in MSRA training data, while 73.53% for CityU corpus." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-115", "text": "In fact, most of named entities may appear frequently in our generally lives." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-116", "text": "To make use of existing named entities in training data is crucial to improving capability to capture seen entities and thereby unseen entities, since many models consider the possibilities of labels in context." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-117", "text": "We also see an interesting phenomenon in MSRA corpus that many named entities are consecutive without punctuations, especially the person names." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-118", "text": "Particularly, in MSRA testing data, nearly 20% named entities appear consecutively." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-119", "text": "It brings great difficulties for NER system to capture such entities separately." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-120", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-121", "text": "**PROBLEMS OF NER WITH ONLY LOCAL INFORMATION**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-122", "text": "1 For the string\u4e24\u5cb8, when it refers to Mainland and Taiwan, it is an entity, when it refers to the bank of rivers, it is a common word." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-124", "text": "The F0 row lists the precision, recall and F-measure (\u03b2=1) got by the first CRF (described in Section 2) using local features alone." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-125", "text": "The score makes the first CRF rank the top position on the MSRA and the second on the CityU in SIGHAN bakeoff (Levow, 2006) 2 ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-126", "text": "It shows that our baseline has achieved the state-of-the-art performance." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-127", "text": "However, comparing the recall with the precision on each dataset, we find that the performance is impaired by the relatively low recall." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-128", "text": "To investigate the causes of this problem, we analyze the missed entities further." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-129", "text": "We categorize them into two classes, seen and unseen in training data." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-130", "text": "Five kinds of statistic information are collected and listed in F0 column in table 4. (1) The number of different missed named entities; (2) The times of missing occurrences; (3) The number of different missed named entities which are detected correctly at least once; (4) The times missing occurrences under the case of (3)." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-131", "text": "From (1) and (2) measurements of the seen entities in the F0 column in table 4, we find that there are many seen named enmities are missed." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-132", "text": "Though identifying unseen named entities is more difficult than seen named entities, the boldfaced number indicates that about 10% (24 of 254) for MSRA and 23% (111 of 476) for CityU of unseen and missed named entities have been labeled out correctly for at least once." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-133", "text": "The difficulty to capture unseen named entities in training data is because of the nature of machine learning techniques." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-134", "text": "However, the statistical results in Table 4 show there is a great potential (200+48)/(200+330)=47% for MSRA and (384+396)/ (384+1144)=51% for CityU, to improve recall by enhancing the capture of seen named entities and making use of labeled outputs from test data to capture more unseen named entities." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-135", "text": "What is more, performance can be improved further when more named entities are labeled correctly, because many models, such as CRF, assign labels according to the possibilities of whole sequence." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-136", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-137", "text": "**T TO OK KE EN N ENTITY-MAJORITY-TYPE FEATURE**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-138", "text": "\u6211 \u6211 - - \u7231 \u7231 - - \uf963 \uf963 B B--M Ma aj j--L LO OC C \u4eac \u4eac E E--M Ma aj j--L LO OC C \u5929 \u5929 B B--M Ma aj j--L LO OC C \u5b89 \u5b89 I I--M Ma aj j--L LO OC C \u95e8 \u95e8 E E--M Ma aj j--L LO OC C" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-139", "text": "----------------------------------" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-140", "text": "**INFLUENCE OF USING NON-LOCAL FEATURES IN NER**" }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-141", "text": "After we feed the non-local features (described in Section 3) to the second CRF, we test it on the testing data of MSRA and CityU again." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-142", "text": "Table 3 lists the performance got by each kind of feature configurations." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-143", "text": "F0 means the first CRF (baseline) using local features alone, and the F0+Fi (i=1, 2, 3, 4) means the second CRF using local features (F0) as well as the non-local features Fi." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-144", "text": "From the table 3, we can conclude that exploiting non-local information is a good choice to recall more missed entities." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-145", "text": "Comparing with the baseline using only local context, the recalls of NER systems are improved after taking non-local information into account by -0.34%~3.76% on MSRA, 2.92%~3.68% on CityU. And the overall F-measures increase by -0.54%~2.19% and 0.72%~1.27% on MSRA and CityU each." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-146", "text": "The MSRA performance got by F0+F1 decrease slightly because there are many consecutive entities in the testing data." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-147", "text": "Since F1 does not encode boundary and class information, more entity tokens are recalled, but their boundaries or classes are wrong." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-148", "text": "After we implement a post-processing step with person name list extracted from the MSRA training data to separate the consecutive candidate entities, the performance lists with F0+F1 (PP) increases." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-149", "text": "The performance difference among F1, F2, F3 and F4 are mainly because they encode different useful non-local information as described in Section 3.2." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-150", "text": "For F1, it only encode whether a token sequence is an entity." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-151", "text": "No boundary and class are considered which are represented in F2 and F3 respectively, so F2 and F3 both achieve high performance than F0, and F4 consider both boundary and class simultaneously, so it is the best choice of exploiting non-local information to improve NER recall." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-152", "text": "We can not compare between F2 and F3 directly because boundary detection and semantic class classification are the two different sub-tasks in NER." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-153", "text": "The performance difference between the performance on CityU and MSRA come from two folds." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-154", "text": "One is because CityU testing data contains more seen entities than that of MSRA since the seen entities can be captured easily by the non-local features." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-155", "text": "The other is because MSRA data sets contain much more consecutive named entities than CityU. Since NER with non-local information prefers to dig out more and thereby longer named entities, it may tend to label more 2 The best F1-score on MSRA and CityU is 86.51% and 89.03% respectively." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-156", "text": "continuous named entities as a single named entities and introduce more errors damaging both in recall and precision." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-157", "text": "Then, we investigate the situation of missed seen and missed unseen named entity in NER with non-local information by filling the Table 4 ." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-158", "text": "F0 is the first CRF (baseline) using local features alone, and the F0+Fi (i=1, 2, 3, 4) means the second CRF using local features (F0) as well as the non-local features Fi." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-159", "text": "The four same measurements are used as described in Section 4.2." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-160", "text": "Compared with the numbers in F0 column, significant reduction of missing of seen entities is achieved by adding non-local features." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-161", "text": "What is more, the hit of unseen entities is also increased as we predicted in previous analysis." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-162", "text": "entities have been recognized correctly with local context alone." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-163", "text": "We also compare the different kinds of non-local features which fit to different NER sub-tasks and find that non-local feature considering the boundary and class information simultaneously is the best." }, { "sent_id": "5b98a80237182b2d506ea4c9d71aa1-C001-164", "text": "Our approach is language independent, due to lack of annotated corpora of other languages, the experiments have only been conducted on Chinese corpora, and related experiments on other languages can be done in the future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5b98a80237182b2d506ea4c9d71aa1-C001-13", "5b98a80237182b2d506ea4c9d71aa1-C001-14" ], [ "5b98a80237182b2d506ea4c9d71aa1-C001-21", "5b98a80237182b2d506ea4c9d71aa1-C001-22", "5b98a80237182b2d506ea4c9d71aa1-C001-23" ], [ "5b98a80237182b2d506ea4c9d71aa1-C001-85", "5b98a80237182b2d506ea4c9d71aa1-C001-86" ] ], "cite_sentences": [ "5b98a80237182b2d506ea4c9d71aa1-C001-13", "5b98a80237182b2d506ea4c9d71aa1-C001-21", "5b98a80237182b2d506ea4c9d71aa1-C001-86" ] }, "@MOT@": { "gold_contexts": [ [ "5b98a80237182b2d506ea4c9d71aa1-C001-13", "5b98a80237182b2d506ea4c9d71aa1-C001-14" ], [ "5b98a80237182b2d506ea4c9d71aa1-C001-21", "5b98a80237182b2d506ea4c9d71aa1-C001-22", "5b98a80237182b2d506ea4c9d71aa1-C001-23" ] ], "cite_sentences": [ "5b98a80237182b2d506ea4c9d71aa1-C001-13", "5b98a80237182b2d506ea4c9d71aa1-C001-21" ] }, "@SIM@": { "gold_contexts": [ [ "5b98a80237182b2d506ea4c9d71aa1-C001-26", "5b98a80237182b2d506ea4c9d71aa1-C001-27", "5b98a80237182b2d506ea4c9d71aa1-C001-28", "5b98a80237182b2d506ea4c9d71aa1-C001-29" ], [ "5b98a80237182b2d506ea4c9d71aa1-C001-39" ] ], "cite_sentences": [ "5b98a80237182b2d506ea4c9d71aa1-C001-26", "5b98a80237182b2d506ea4c9d71aa1-C001-39" ] }, "@DIF@": { "gold_contexts": [ [ "5b98a80237182b2d506ea4c9d71aa1-C001-26", "5b98a80237182b2d506ea4c9d71aa1-C001-27", "5b98a80237182b2d506ea4c9d71aa1-C001-28", "5b98a80237182b2d506ea4c9d71aa1-C001-29" ] ], "cite_sentences": [ "5b98a80237182b2d506ea4c9d71aa1-C001-26" ] }, "@UNSURE@": { "gold_contexts": [ [ "5b98a80237182b2d506ea4c9d71aa1-C001-96" ] ], "cite_sentences": [ "5b98a80237182b2d506ea4c9d71aa1-C001-96" ] } } }, "ABC_4a63ef4085639a66d1c7f6344f7548_31": { "x": [ { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-43", "text": "We guess it is an acronym for the authors of (Galley et al., 2004) stract rule." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-2", "text": "Recently, numerous statistical machine translation models which can utilize various kinds of translation rules are proposed." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-3", "text": "In these models, not only the conventional syntactic rules but also the non-syntactic rules can be applied." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-4", "text": "Even the pure phrase rules are includes in some of these models." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-5", "text": "Although the better performances are reported over the conventional phrase model and syntax model, the mixture of diversified rules still leaves much room for study." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-6", "text": "In this paper, we present a refined rule classification system." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-7", "text": "Based on this classification system, the rules are classified according to different standards, such as lexicalization level and generalization." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-8", "text": "Especially, we refresh the concepts of the structure reordering rules and the discontiguous phrase rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-9", "text": "This novel classification system may supports the SMT research community with some helpful references." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-10", "text": "----------------------------------" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-12", "text": "Phrase-based statistical machine translation models (Marcu and Wong, 2002; Koehn et al., 2003; Och and Ney, 2004; Koehn, 2004; Koehn et al., 2007) have achieved significant improvements in translation accuracy over the original IBM word-based model." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-13", "text": "However, there are still many limitations in phrase based models." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-14", "text": "The most frequently pointed limitation is its inefficacy to modeling the structure reordering and the discontiguous corresponding." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-15", "text": "To overcome these limitations, many syntaxbased SMT models have been proposed (Wu, 1997; Chiang, 2007; Ding et al., 2005; Eisner, 2003; Quirk et al., 2005; Liu et al., 2007; Zhang et al., 2007; Zhang et al., 2008a; Zhang et al., 2008b; Gildea, 2003; Galley et al., 2004; Bod, 2007) ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-16", "text": "The basic motivation behind syntax-based model is that the syntax information has the potential to model the structure reordering and discontiguous corresponding by the intrinsic structural generalization ability." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-17", "text": "Although remarkable progresses have been reported, the strict syntactic constraint (the both sides of the rules should strictly be a subtree of the whole syntax parse) greatly hinders the utilization of the non-syntactic translation equivalents." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-18", "text": "To alleviate this constraint, a few works have attempted to make full use of the non-syntactic rules by extending their syntax-based models to more general frameworks." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-19", "text": "For example, forest-to-string transformation rules have been integrated into the tree-to-string translation framework by (Liu et al., 2006; Liu et al., 2007) ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-20", "text": "Zhang et al. (2008a) made it possible to utilize the non-syntactic rules and even the phrases which are used in phrase based model by advancing a general tree sequence to tree sequence framework based on the tree-to-tree model presented in (Zhang et al., 2007) ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-21", "text": "In these models, various kinds of rules can be employed." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-22", "text": "For example, as shown in Figure 1 and Figure 2 , Figure 1 shows a Chinese-to-English sentence pair with syntax parses on both sides and the word alignments (dotted lines)." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-23", "text": "Figure 2 lists some of the rules which can be extracted from the sentence pair in Figure 1 by the system used in (Zhang et al., 2008a) ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-24", "text": "These rules includes not only conventional syntax rules but also the tree sequence rules (the multi-headed syntax rules )." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-25", "text": "Even the phrase rules are adopted by the system." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-26", "text": "Although the better performances are reported over the conventional phrase-based model and syntax-based model, the mixture of diversified rules still leaves much room for study." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-27", "text": "Given such a hybrid rule set, we must want to know what kinds of rules can make more important contributions to the overall system performance and what kinds of rules are redundant compared with the others." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-28", "text": "From engineering point of view, the developers may concern about which kinds of rules should be preferred and which kinds of rules could be discard without too much decline in translation quality." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-29", "text": "However, one of the precondition for the investigations of these issues is what are the \"rule categories\"?" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-30", "text": "In other words, some comprehensive rule classifications are necessary to make the rule analyses feasible." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-31", "text": "The motivation of this paper is to present such a rule classification." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-32", "text": "----------------------------------" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-33", "text": "**RELATED WORKS**" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-34", "text": "A few researches have made some exploratory investigations towards the effects of different rules by classifying the translation rules into different subcategories (Liu et al., 2007; Zhang et al., 2008a; DeNeefe et al., 2007) ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-35", "text": "Liu et al. (2007) differentiated the rules in their tree-to-string model which integrated with forest 1 -to-string into fully lexicalized rules, non-lexicalized rules and partial lexicalized rules according to the lexicalization levels." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-36", "text": "As an extension, Zhang et al. (2008a) proposed two more categories: Structure Reordering Rules (SRR) and Discontiguous Phrase Rules (DPR)." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-37", "text": "The SRR stands for the rules which have at least two non-terminal leaf nodes with inverted order in the source and target side." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-38", "text": "And DPR refers to the rules having at least one non-terminal leaf node between two terminal leaf nodes." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-39", "text": "(DeNeefe et al., 2007) made an illuminating breakdown of the different kinds of rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-40", "text": "Firstly, they classify all the GHKM 2 rules (Galley et al., 2004; Galley et al., 2006) into two categories: lexical rules and non-lexical rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-41", "text": "The former are the rules whose source side has no source words." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-42", "text": "In other words, a non-lexical rule is a purely ab-1 A \"forest\" means a sub-tree sequence derived from a given parse tree 2 One reviewer asked about the acronym GHKM." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-44", "text": "The latter is the complementary set of the former. And then lexical rules are classified further into phrasal rules and non-phrasal rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-45", "text": "The phrasal rules refer to the rules whose source side and the yield of the target side contain exactly one contiguous phrase each. And the one or more nonterminals can be placed on either side of the phrase." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-46", "text": "In other words, each phrasal rule can be simulated by the conjunction of two more phrase rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-47", "text": "(DeNeefe et al., 2007) classifies non-phrasal rules further into structural rules, re-ordering rules, and noncontiguous phrase rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-48", "text": "However, these categories are not explicitly defined in (DeNeefe et al., 2007) since out of its focus." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-49", "text": "Our proposed rule classification is inspired by these works." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-50", "text": "----------------------------------" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-51", "text": "**RULES CLASSIFICATIONS**" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-52", "text": "Currently, there have been several classifications in SMT research community." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-53", "text": "Generally, the rules can be classified into two main groups according to whether syntax information is involved: bilingual phrases (Phrase) and syntax rules (Syntax)." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-54", "text": "Further, the syntax rules can be divided into three categories according to the lexicalization levels (Liu et al., 2007; Zhang et al., 2008a source and target sides are non-lexicons (nonterminals)" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-55", "text": "3) Partially lexicalized (PLex): otherwise." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-56", "text": "In Figure 2 , R 1 -R 3 are FLex rules, and R 5 -R 8 are PLex rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-57", "text": "Following (Zhang et al., 2008b) , a syntax rule r can be formalized into a tuple < \u03be s , \u03be t , A T , A N T > , where \u03be s and \u03be t are tree sequences of source side and target side respectively, A T is a many-to-many correspondence set which includes the alignments between the terminal leaf nodes from source and target side, and A N T is a one-to-one correspondence set which includes the synchronizing relations between the non-terminal leaf nodes from source and target side." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-58", "text": "Then, the syntax rules can also fall into two categories according to whether equipping with generalization capability (Chiang, 2007; Zhang et al., 2008a) : 1) Initial rules (Initial): all leaf nodes of this rule are terminals." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-59", "text": "2) Abstract rules (Abstract): otherwise, i.e. at least one leaf node is a non-terminal." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-60", "text": "A non-terminal leaf node in a rule is named an abstract node since it has the generalization capability." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-61", "text": "Comparing these two classifications for syntax rules, we can find that a FLex rule is a initial rule when ULex rules and PLex rules belong to abstract rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-62", "text": "These classifications are clear and easy for understanding." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-63", "text": "However, we argue that they need further refinement for in-depth study." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-64", "text": "Specially, more refined differentiations are needed for the abstract rules (ULex rules and PLex rules) since they play important roles for the characteristic capabilities which are deemed to be the advantages over the phrase-based model." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-65", "text": "For instance, the potentials to model the structure reordering and the discontiguous correspondence." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-66", "text": "The Structure Reordering Rules (SRR) and Discontiguous Phrase Rules (DPR) mentioned by (Zhang et al., 2008a) can be regarded as more in-depth classification of the syntax rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-67", "text": "In (Zhang et al., 2008a) , they are described as follows: Definition 1: The Structure Reordering Rule (SRR) refers to the structure reordering rule that has at least two non-terminal leaf nodes with inverted order in the source and target side." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-68", "text": "Definition 2: The Discontiguous Phrase Rule (DPR) refers to the rule having at least one nonterminal leaf node between two lexicalized leaf nodes." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-69", "text": "Based on these descriptions, R 7 , R 8 in Figure 2 belong to the category of SRR and R 6 , R 7 fall into the category of DPR." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-70", "text": "Although these two definitions are easy implemented in practice, we argue that the definition of SRR is not complete." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-71", "text": "The reordering rules involving the reordering between content word terminals and non-terminal (such as R 5 in Figure 2 ) also can model the useful structure reorderings." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-72", "text": "Moreover, it is not uncommon that a rule demonstrates the reorderings between two non-terminals as well as the reorderings between one non-terminal and one content word terminal." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-73", "text": "The reason for our emphasis of content word terminal is that the reorderings between the non-terminals and function word are less meaningful." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-74", "text": "One of the theoretical problems with phrase based SMT models is that they can not effectively model the discontiguous translations and numerous attempts have been made on this issue (Simard et al., 2005; Quirk and Menezes, 2006; Wellington et al., 2006; Bod, 2007; Zhang et al., 2007) ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-75", "text": "What seems to be lacking, however, is a explicit definition to the discontiguous translation." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-76", "text": "The definition of DPR in (Zhang et al., 2008a ) is explicit but somewhat rough and not very accurate." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-77", "text": "For example, in Figure 3(a) , non-terminal node pair ([0,' '] , [0,'love'] ) is surrounded by lexical terminals." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-78", "text": "According to Definition 2, it is a DPR." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-79", "text": "However, obviously it is not a discontiguous phrase actually." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-80", "text": "This rule can be simulated by conjunctions of three phrases (' ', 'I'; ' ', 'love'; ' ','you')." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-81", "text": "In contrast, the translation rule in Figure 3(b) is an actual discontiguous phrase rule." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-82", "text": "The English correspondences of the Chinese word ' ' is dispersed in the English side in which the correspondence of Chinese word ' ' is inserted." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-83", "text": "This rule can not be simulated by any conjunctions of the sub phrases." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-84", "text": "It must be noted that the discontiguous phrase (' '-\"switch . . . off\") can not be abstracted under the existing synchronous grammar frameworks." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-85", "text": "The fundamental reason is that the corresponding parts should be abstracted in the same time and lexicalized in the same time." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-86", "text": "In other words, the discontiguous phrase can not be modeled by the permutation between non-terminals (abstract nodes)." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-87", "text": "Another point to notice is that our focus in this paper is the ability demonstrated by the abstract rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-88", "text": "Thus, we do not pay much attentions to the reorderings and discontiguous phrases involved in the phrase rules (e.g. \" \"-\"switch the light off\") since they lack the generalization capability." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-89", "text": "Therefore, the discontiguous phrase is limited to the relation between non-terminals and terminals." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-90", "text": "On the basis of the above analyses, we present a novel classification system for the abstract rules based on the crossings between the leaf node alignment links." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-91", "text": "Given an abstract rule r =< Note that the intersection of SRR NT 2 and SRR NT-T is not necessary an empty set, i.e. a rule can be both SRR NT 2 and SRR NT-T rule." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-92", "text": "The basic characteristic of the discontiguous translation is that the correspondence of one nonterminal N T is inserted among the correspondences of one phrase X. Figure 5 (a) illustrates this situation." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-93", "text": "However, this characteristic can not support necessary and sufficient condition." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-94", "text": "For example, if the phrase X can be divided like Figure 5 (b), then the rule in Figure 5 (a) is actually a reordering rule rather than a discontiguous phrase rule." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-95", "text": "For sufficient condition, we constrain that the phrase X = w i . . ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-96", "text": "w j need to satisfy the requirement: w i should be connected with w j through word alignment links (A word is connected with itself)." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-97", "text": "In Figure 5(c) , f 1 is connected with f 2 when N T is inserted between e 1 and e 2 ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-98", "text": "Thus, the rule in Figure 5 (c) is a discontiguous phrase rule." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-99", "text": "Definition 3: Given an abstract rule r =< \u03be s , \u03be t , A T , A N T >, it is a Discontiguous Phrase iff \u2203 two links l t1 , l t2 from A T and a link l nt from A N T , satisfy: l t1 , l t2 are emitted from the same word and l t1 is crossed with l nt when l t2 is not crossed with l nt ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-100", "text": "Through Definition 3, we know that the DPR is a sub-set of the SRR NT-T." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-101", "text": "----------------------------------" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-102", "text": "**CONCLUSIONS AND FUTURE WORKS**" }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-103", "text": "In this paper, we present a refined rule classification system." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-104", "text": "Based on this classification system, the rules are classified according to different standards, such as lexicalization level and generalization." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-105", "text": "Especially, we refresh the concepts of the structure reordering rules and the discontiguous phrase rules." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-106", "text": "This novel classification system may supports the SMT research community with some helpful references." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-107", "text": "In the future works, aiming to analyze the rule contributions and the redundances issues using the presented rule classification based on some real translation systems, we plan to implement some synchronous grammar based syntax translation models such as the one presented in (Liu et al., 2007) or in (Zhang et al., 2008a) ." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-108", "text": "Taking such a system as the experimental platform, we can perform comprehensive statistics about distributions of different rule categories." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-109", "text": "What is more important, the contribution of each rule category can be evaluated seriatim." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-110", "text": "Furthermore, which kinds of rules are preferentially applied in the 1-best decoding can be studied." }, { "sent_id": "4a63ef4085639a66d1c7f6344f7548-C001-111", "text": "All these investigations could reveal very useful information for the optimization of rule extraction and the improvement of the computational models for synchronous grammar based machine translation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4a63ef4085639a66d1c7f6344f7548-C001-13", "4a63ef4085639a66d1c7f6344f7548-C001-14", "4a63ef4085639a66d1c7f6344f7548-C001-15" ], [ "4a63ef4085639a66d1c7f6344f7548-C001-18", "4a63ef4085639a66d1c7f6344f7548-C001-19" ], [ "4a63ef4085639a66d1c7f6344f7548-C001-34" ], [ "4a63ef4085639a66d1c7f6344f7548-C001-35" ], [ "4a63ef4085639a66d1c7f6344f7548-C001-52", "4a63ef4085639a66d1c7f6344f7548-C001-53", "4a63ef4085639a66d1c7f6344f7548-C001-54" ] ], "cite_sentences": [ "4a63ef4085639a66d1c7f6344f7548-C001-15", "4a63ef4085639a66d1c7f6344f7548-C001-19", "4a63ef4085639a66d1c7f6344f7548-C001-34", "4a63ef4085639a66d1c7f6344f7548-C001-35", "4a63ef4085639a66d1c7f6344f7548-C001-54" ] }, "@MOT@": { "gold_contexts": [ [ "4a63ef4085639a66d1c7f6344f7548-C001-13", "4a63ef4085639a66d1c7f6344f7548-C001-14", "4a63ef4085639a66d1c7f6344f7548-C001-15" ], [ "4a63ef4085639a66d1c7f6344f7548-C001-34" ] ], "cite_sentences": [ "4a63ef4085639a66d1c7f6344f7548-C001-15", "4a63ef4085639a66d1c7f6344f7548-C001-34" ] }, "@FUT@": { "gold_contexts": [ [ "4a63ef4085639a66d1c7f6344f7548-C001-107" ] ], "cite_sentences": [ "4a63ef4085639a66d1c7f6344f7548-C001-107" ] } } }, "ABC_45d804ec30d20bd7e484c3bbd8399f_31": { "x": [ { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-112", "text": "**CONCLUSION**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-2", "text": "We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-23", "text": "In addition, we also perform a similar analysis on BERT models fine-tuned on natural language understanding tasks as well as RoBERTa." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-3", "text": "We employ two methods-taking the maximum attention weight and computing the maximum spanning tree-to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-4", "text": "We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-5", "text": "We also analyze BERT fine-tuned on two datasets-the syntaxoriented CoLA and the semantics-oriented MNLI-to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-6", "text": "Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-7", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-9", "text": "Pretrained Transformer models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have shown stellar performance on language understanding tasks, significantly improve the state-of-the-art on many tasks such as dependency parsing (Zhou et al., 2019) , question answering (Rajpurkar et al., 2016) , and have at- * Equal contribution \u2020 Currently working at Verisk Analytics." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-10", "text": "This work was completed when the author was at New York University." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-11", "text": "tained top positions on transfer learning benchmarks such as GLUE and Su-perGLUE ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-12", "text": "As these models become a staple component of many NLP tasks, it is crucial to understand what kind of knowledge they learn, and why and when they perform well." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-13", "text": "To that end, researchers have investigated the linguistic knowledge that these models learn by analyzing BERT (Goldberg, 2018; Lin et al., 2019) directly or training probing classifiers on the contextualized embeddings or attention heads of BERT (Tenney et al., 2019b,a; Hewitt and Manning, 2019) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-14", "text": "BERT and RoBERTa, as Transformer models (Vaswani et al., 2017) , compute the hidden representation of all the attention heads at each layer for each token by attending to all the token representations in the preceding layer." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-15", "text": "In this work, we investigate the hypothesis that BERTstyle models use at least some of their attention heads to track syntactic dependency relationships between words." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-16", "text": "We use two dependency relation extraction methods to extract dependency relations from each self-attention heads of BERT and RoBERTa." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-17", "text": "The first method-maximum attention weight (MAX)-designates the word with the highest incoming attention weight as the parent, and is meant to identify specialist heads that track specific dependencies like obj (in the style of Clark et al., 2019) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-18", "text": "The second-maximum spanning tree (MST)-computes a maximum spanning tree over the attention matrix, and is meant to identify generalist heads that can form complete, syntactically informative dependency trees." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-19", "text": "We analyze the extracted dependency relations and trees to investigate whether the attention heads of these models track syntactic dependencies significantly better than chance or baselines, and what type of dependency relations they learn best." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-20", "text": "In contrast to probing models (Adi et al., 2017; Conneau et al., 2018) , our methods require no further training." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-21", "text": "In prior work, Clark et al. (2019) find that some heads of BERT exhibit the behavior of some dependency relation types, though they do not perform well at all types of relations in general." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-22", "text": "We are able to replicate their results on BERT using our MAX method." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-24", "text": "Our experiments suggest that there are particular attention heads of BERT and RoBERTa that encode certain dependency relation types such as nsubj, obj with substantially higher accuracy than our baselines-a randomly initialized Transformer and relative positional baselines." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-25", "text": "We find that fine-tuning BERT on the syntax-oriented CoLA does not significantly impact the accuracy of extracted dependency relations." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-26", "text": "However, when fine-tuned on the semantics-oriented MNLI dataset, we see improvements in accuracy for longer-distance clausal relations and a slight loss in accuracy for shorter-distance relations." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-27", "text": "Overall, while BERT and RoBERTa models obtain nontrivial accuracy for some dependency types such as nsubj, obj and conj when we analyze individual heads, their performance still leaves much to be desired." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-28", "text": "On the other hand, when we use the MST method to extract full trees from specific dependency heads, BERT and RoBERTa fail to meaningfully outperform our baselines." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-29", "text": "Although the attention heads of BERT and RoBERTa capture several specific dependency relation types somewhat well, they do not reflect the full extent of the significant amount of syntactic knowledge that these models are known to learn." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-30", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-31", "text": "**RELATED WORK**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-32", "text": "Previous works have proposed methods for extracting dependency relations and trees from the attention heads of the transformer-based neural machine translation (NMT) models." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-33", "text": "In their preliminary work, Mare\u010dek and Rosa (2018) aggregate the attention weights across the self-attention layers and heads to form a single attention weight matrix." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-34", "text": "Using this matrix, they propose a method to extract constituency and (undirected) dependency trees by recursively splitting and constructing the maximum spanning trees respectively." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-35", "text": "In contrast, Raganato and Tiedemann (2018) train a Transformer-based machine translation model on different language pairs and extract the maximum spanning tree algorithm from the attention weights of the encoder for each layer and head individually." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-36", "text": "They find that the best dependency score is not significantly higher than a right-branching tree baseline." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-37", "text": "Voita et al. (2019) find the most confident attention heads of the Transformer NMT encoder based on a heuristic of the concentration of attention weights on a single token, and find that these heads mostly attend to relative positions, syntactic relations, and rare words." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-38", "text": "Additionally, researchers have investigated the syntactic knowledge that BERT learns by analyzing the contextualized embeddings (Warstadt et al., 2019a) and attention heads of BERT (Clark et al., 2019) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-39", "text": "Goldberg (2018) analyzes the contextualized embeddings of BERT by computing language model surprisal on subject-verb agreement and shows that BERT learns significant knowledge of syntax." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-40", "text": "Tenney et al. (2019b) introduce a probing classifier for evaluating syntactic knowledge in BERT and show that BERT encodes syntax more than semantics." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-41", "text": "Hewitt and Manning (2019) train a structural probing model that maps the hidden representations of each token to an inner-product space that corresponds to syntax tree distance." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-42", "text": "They show that the learned spaces of strong models such as BERT and ELMo (Peters et al., 2018) are better for reconstructing dependency trees compared to baselines." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-43", "text": "Clark et al. (2019) train a probing classifier on the attentionheads of BERT and show that BERT's attention heads capture substantial syntactic information." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-111", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-44", "text": "While there has been prior work on analysis of the attention heads of BERT, we believe we are the first to analyze the dependency relations learned by the attention heads of fine-tuned BERT models and RoBERTa." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-45", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-46", "text": "**METHODS**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-47", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-48", "text": "**MODELS**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-49", "text": "BERT (Devlin et al., 2019) is a Transformer-based masked language model, pretrained on BooksCorpus (Zhu et al., 2015) and English Wikipedia, that has attained stellar performance on a variety of downstream NLP tasks." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-50", "text": "RoBERTa (Liu et al., 2019) adds several refinements to BERT while using the same model architecture and capacity, including a longer training schedule over more data, and shows significant improvements over BERT on a wide range of NLP tasks." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-51", "text": "We run our ex-periments on the pretrained large versions of both BERT (cased and uncased) and RoBERTa models, which consist of 24 self-attention layers with 16 heads each layer." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-52", "text": "For a given dataset, we feed each input sentence through the respective model and capture the attention weights for each individual head and layer." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-53", "text": "Phang et al. (2018) report the performance gains on the GLUE benchmark by supplementing pretrained BERT with data-rich supervised tasks such as the Multi-Genre Natural Language Inference dataset (MNLI; Williams et al., 2018) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-54", "text": "Although these fine-tuned BERT models may learn different aspects of language and show different performance from BERT on GLUE benchmark, comparatively little previous work has investigated the syntax learned by these fine-tuned models (Warstadt et al., 2019a) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-55", "text": "We run experiments on the uncased BERT-large model finetuned on the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2019b) and MNLI to investigate the impact of fine-tuning on a syntaxrelated task (CoLA) and a semantic-related task (MNLI) on the structure of attention weights and resultant extracted dependency relations." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-56", "text": "We refer to these fine-tuned models as CoLA-BERT and MNLI-BERT respectively." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-57", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-58", "text": "**ANALYSIS METHODS**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-59", "text": "We aim to test the hypothesis that the attention heads of BERT learn syntactic relations implicitly, and that the self-attention between two words corresponds to their dependency relation." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-60", "text": "We use two methods for extracting dependency relations from the attention heads of Transformer-based models." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-61", "text": "Both methods operate on the attention weight matrix W \u2208 (0, 1) T \u00d7T for a given head at a given layer, where T is the number of tokens in the sequence, and the rows and columns correspond to the attending and attended tokens respectively (such that each row sums to 1)." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-62", "text": "Method 1: Maximum Attention Weights (MAX) Given a token A in a sentence, a selfattention mechanism is designed to assign high attention weights on tokens that have some kind of relationship with token A (Vaswani et al., 2017) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-63", "text": "Therefore, for a given token A, a token B that has the highest attention weight with respect to the token A should be related to token A. Our aim is to investigate whether this relation maps to a universal dependency relation." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-64", "text": "We assign a relation (w i , w j ) between word w i and w j if j = argmax W [i] for each row (that corresponds to a word in attention matrix) i in attention matrix W ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-65", "text": "Based on this simple strategy, we extract relations for all sentences in our evaluation datasets." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-66", "text": "This method is similar to Clark et al. (2019) , and attempts to recover individual arcs between words; the relations extracted using this method need not form a valid tree, or even be fully connected, and the resulting edge directions may or may not match the canonical directions." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-67", "text": "Hence, we evaluate the resulting arcs individually and ignore their direction." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-68", "text": "After extracting dependency relations from all heads at all layers, we take the maximum UUAS over all relations types." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-69", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-70", "text": "**METHOD 2: MAXIMUM SPANNING TREE (MST)**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-71", "text": "We also want to investigate if there are attention heads of BERT that can form complete, syntactically informative parse trees." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-72", "text": "To extract complete valid dependency trees from the attention weights for a given layer and head, we follow the approach of Raganato and Tiedemann (2018) and treat the matrix of attention weight tokens as a complete weighted directed graph, with the edges pointing from the output token to each attended token." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-73", "text": "As in Raganato and Tiedemann, we take the root of the gold dependency tree as the starting node and apply the Chu-Liu-Edmonds algorithm (Chu, 1965; Edmonds, 1967) to compute the maximum spanning tree." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-74", "text": "(Using the gold root as the starting point in MST may artificially improve our results slightly, but this bias is applied evenly across all the models we compare.) The resulting tree is a valid directed dependency tree, though we follow Hewitt and Manning (2019) in evaluating it as undirected, for easier comparison with our MAX method." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-75", "text": "Following Voita et al. (2019) , we exclude the sentence demarcation tokens ([CLS], [SEP], , ) from the attention matrices." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-76", "text": "This allows us to focus on inter-word attention." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-77", "text": "Where the tokenization of our parsed corpus does not match the model's tokenization, we merge the non-matching tokens until the tokenizations are mutually compatible, and sum the attention weights for the corresponding columns and rows." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-78", "text": "We then apply either of the two extraction methods to the attention matrix." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-79", "text": "During evaluation when we compare the gold dependencies, to handle the subtokens within the merged tokens, we set all subtokens except for the first to depend on the first subtoken." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-80", "text": "This approach is largely similar to that in Hewitt and Manning (2019) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-81", "text": "We use the English Parallel Universal Dependencies (PUD) treebank from the CoNLL 2017 shared task (Zeman et al., 2018) as the gold standard for our evaluation." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-82", "text": "Baselines Many dependency relations tend to occur in specific positions relative to the parent word." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-83", "text": "For example, amod (adjectival modifier) mostly occurs before a noun." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-84", "text": "As an example, Figure 1 shows the distribution of relative positions for four major UD relations in our data." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-85", "text": "Following Voita et al. (2019) , we compute the most common positional offset between a parent and child word for a given dependency relation, and formulate a baseline based on that most common relative positional offset to evaluate our methods." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-86", "text": "For MST, as we also want to evaluate the quality of the entire tree, we use a right-branching dependency tree as baseline." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-87", "text": "Additionally, we use a BERT-large model with randomly initialized weights (which we refer to as random BERT), as previous work has shown that randomly initialized sentence encoders perform surprisingly well on a suite of NLP tasks (Zhang and Bowman, 2018; Wieting and Kiela, 2019) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-88", "text": "Figure 2 and Table 1 describe the accuracy for the most frequent relation types in the dataset using relations extracted based on the MAX method." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-89", "text": "We also include results for the rarer long-distance advcl and csubj dependency types, as they show that MNLI-BERT has a tendency to track clausal dependencies better than BERT, CoLA-BERT, and RoBERTa." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-90", "text": "The non-random models outperform random BERT substantially for all dependency types." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-91", "text": "They also outperform the rel- ative position baselines for some relation types." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-92", "text": "They outperform the baselines by a large margin for nsubj and obj, but only slightly better for advmod and amod." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-93", "text": "These results are consistent with the findings of Clark et al. (2019) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-94", "text": "Moreover, we do not observe substantial differences in accuracy by fine-tuning on CoLA." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-95", "text": "Both BERT and CoLA-BERT have similar or slightly better performance than MNLI-BERT, except for clausal dependencies such as advcl (adverbial clause modifier) and csubj (clausal subject) where MNLI-BERT outperforms BERT and CoLA-BERT by more than 5 absolute points in accuracy." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-96", "text": "This suggests that fine-tuning on a semantics-oriented task encourages effective long-distance dependencies, although it slightly degrades the performance in other shorter-distance dependency types." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-97", "text": "Figure 3 shows the accuracy for nsubj, obj, advmod, and amod relations extracted based on the MST method." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-98", "text": "Similar to the MAX method, we choose the best accuracy for each relation type." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-99", "text": "We observe that the models outperform the baselines by a large margin for nsubj and obj." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-100", "text": "However, the models do not outperform the positional baseline for advmod and amod." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-101", "text": "Surprisingly, RoBERTa performs worse than other BERT models in all categories when the MAX method is used to extract the trees, but it outperforms all other models when the MST method is used." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-102", "text": "Figure 4 describes the maximum undirected unlabeled attachment scores (UUAS) across each layer." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-103", "text": "The trained models achieve significantly higher UUAS than random BERT." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-104", "text": "Although the trained models perform better than the rightbranching baseline in most cases, the performance gap is not substantial." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-105", "text": "Given that the MST method uses the root of the gold trees, whereas the rightbranching baseline does not, this implies that the attention weights in the different layers/heads of the BERT models do not appear to correspond to complete, syntactically informative parse trees." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-106", "text": "----------------------------------" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-107", "text": "**RESULTS**" }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-108", "text": "Overall, the results of both analysis methods suggest that, although some attention heads of BERT capture specific dependency relation types, they do not reflect the full extent of the significant amount of syntactic knowledge BERT and RoBERTa are known to learn as shown in previous syntactic probing work (Tenney et al., 2019b; Hewitt and Manning, 2019) ." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-109", "text": "Additionally, we find that fine-tuning on the semantics-oriented MNLI dataset improves long term dependencies while slightly degrading the performance for other dependency types." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-110", "text": "The overall performance of BERT and the fine-tuned BERTs over the nonrandom baselines are not substantial, and finetuning on CoLA and MNLI also does not have a large impact on UUAS." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-113", "text": "In this work, we investigate whether the attention heads of BERT and RoBERTa exhibit the implicit syntax dependency by extracting and analyzing the dependency relations from the attention heads at all layers." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-114", "text": "We use two simple dependency relation extraction methods that require no additional training, and observe that there are certain specialist attention heads of the models that track specific dependency types, but neither of our analysis methods support the existence of generalist heads that can perform holistic parsing." }, { "sent_id": "45d804ec30d20bd7e484c3bbd8399f-C001-115", "text": "Furthermore, we observe that fine-tuning on CoLA and MNLI does not significantly change the overall pattern of selfattention within the frame of our analysis, despite their being tuned for starkly different downstream tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "45d804ec30d20bd7e484c3bbd8399f-C001-16", "45d804ec30d20bd7e484c3bbd8399f-C001-17" ], [ "45d804ec30d20bd7e484c3bbd8399f-C001-21", "45d804ec30d20bd7e484c3bbd8399f-C001-22" ], [ "45d804ec30d20bd7e484c3bbd8399f-C001-38" ] ], "cite_sentences": [ "45d804ec30d20bd7e484c3bbd8399f-C001-17", "45d804ec30d20bd7e484c3bbd8399f-C001-21", "45d804ec30d20bd7e484c3bbd8399f-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "45d804ec30d20bd7e484c3bbd8399f-C001-16", "45d804ec30d20bd7e484c3bbd8399f-C001-17" ] ], "cite_sentences": [ "45d804ec30d20bd7e484c3bbd8399f-C001-17" ] }, "@MOT@": { "gold_contexts": [ [ "45d804ec30d20bd7e484c3bbd8399f-C001-21", "45d804ec30d20bd7e484c3bbd8399f-C001-22" ] ], "cite_sentences": [ "45d804ec30d20bd7e484c3bbd8399f-C001-21" ] }, "@EXT@": { "gold_contexts": [ [ "45d804ec30d20bd7e484c3bbd8399f-C001-21", "45d804ec30d20bd7e484c3bbd8399f-C001-22" ], [ "45d804ec30d20bd7e484c3bbd8399f-C001-62", "45d804ec30d20bd7e484c3bbd8399f-C001-63", "45d804ec30d20bd7e484c3bbd8399f-C001-64", "45d804ec30d20bd7e484c3bbd8399f-C001-65", "45d804ec30d20bd7e484c3bbd8399f-C001-66", "45d804ec30d20bd7e484c3bbd8399f-C001-67", "45d804ec30d20bd7e484c3bbd8399f-C001-68" ] ], "cite_sentences": [ "45d804ec30d20bd7e484c3bbd8399f-C001-21", "45d804ec30d20bd7e484c3bbd8399f-C001-66" ] }, "@SIM@": { "gold_contexts": [ [ "45d804ec30d20bd7e484c3bbd8399f-C001-62", "45d804ec30d20bd7e484c3bbd8399f-C001-63", "45d804ec30d20bd7e484c3bbd8399f-C001-64", "45d804ec30d20bd7e484c3bbd8399f-C001-65", "45d804ec30d20bd7e484c3bbd8399f-C001-66", "45d804ec30d20bd7e484c3bbd8399f-C001-67", "45d804ec30d20bd7e484c3bbd8399f-C001-68" ], [ "45d804ec30d20bd7e484c3bbd8399f-C001-93" ] ], "cite_sentences": [ "45d804ec30d20bd7e484c3bbd8399f-C001-66", "45d804ec30d20bd7e484c3bbd8399f-C001-93" ] } } }, "ABC_4c8e83eb213879e68285e9cd09be47_31": { "x": [ { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-2", "text": "We highlight several issues in the evaluation of historical text normalization systems that make it hard to tell how well these systems would actually work in practice-i.e., for new datasets or languages; in comparison to more na\u00efve systems; or as a preprocessing step for downstream NLP tools." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-3", "text": "We illustrate these issues and exemplify our proposed evaluation practices by comparing two neural models against a na\u00efve baseline system." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-4", "text": "We show that the neural models generalize well to unseen words in tests on five languages; nevertheless, they provide no clear benefit over the na\u00efve baseline for downstream POS tagging of an English historical collection." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-5", "text": "We conclude that future work should include more rigorous evaluation, including both intrinsic and extrinsic measures where possible." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-6", "text": "----------------------------------" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-8", "text": "Historical text normalization systems aim to convert historical wordforms to their modern equivalents, in order to make historical documents more searchable or to improve the performance of downstream NLP tools." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-9", "text": "In historical texts, a single word type may be realized with several different orthographic forms, which may not correspond to the modern form." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-10", "text": "For example, the modern English word said might be realized as sayed, seyd, said, sayd, etc." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-11", "text": "Spellings change over time, but also vary within a single time period and even within a single author, since orthography only became standardized in many languages fairly recently." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-12", "text": "Over the years, researchers have proposed normalization methods based on rules and/or edit distances (Baron and Rayson, 2008; Bollmann, 2012; Hauser and Schulz, 2007; Bollmann et al., 2011; Pettersson et al., 2013a; Mitankin et al., 2014; Pettersson et al., 2014) , statistical machine translation (Pettersson et al., 2013b; Scherrer and Erjavec, 2013) , and most recently neural network models (Bollmann and S\u00f8gaard, 2016; Bollmann et al., 2017; Korchagina, 2017) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-13", "text": "However, most of these systems have been developed and tested on a single language (or even a single corpus), and many have not been compared to the na\u00efve but strong baseline that only changes words seen in the training data, normalizing each to its most frequent modern form observed during training." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-14", "text": "1 These issues make it hard to tell which methods generalize across languages and corpora, and how they compare to each other." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-15", "text": "Moreover, researchers have rarely examined whether their systems actually improve performance on downstream tasks." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-16", "text": "This paper brings together best practices for evaluating historical text normalization systems, highlighting in particular the need to report results on unseen tokens and to consider the na\u00efve baseline." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-17", "text": "We focus our evaluation on two recent neural models: one that has been previously tested only on a German collection that is not widely available (Bollmann et al., 2017) , and one that is adapted from work on morphological re-inflection, but has not been used for historical text normalization (Aharoni et al., 2017) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-18", "text": "Both are encoderdecoder models; the former with soft attention, and the latter with hard monotonic attention." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-19", "text": "We present results on five languages, for both seen and unseen words and for various amounts of training data." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-20", "text": "The soft attention model performs surprisingly poorly on seen words, so that its overall performance is worse than the na\u00efve baseline and several earlier models (Pettersson et al., 2014) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-21", "text": "However, on unseen words (which we argue are what matters), both neural models do well." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-22", "text": "Unfortunately, these positive results did not" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-23", "text": "translate into improvements when we tested the English-trained models on a downstream POS tagging task using a different historical collection spanning a similar time range." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-24", "text": "Normalizing the text gave better tag accuracy than not normalizing, but neither neural model convincingly outperformed the na\u00efve normalizer." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-25", "text": "Although these results are disappointing, the clear evaluation standards laid out here should benefit future work in this area." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-26", "text": "----------------------------------" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-27", "text": "**TASK SETTING AND ISSUES OF EVALUATION**" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-28", "text": "We follow previous work in training our systems on pairs (h, m) of historical tokens and their gold standard modern forms." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-29", "text": "2 Note that at test time, most of the h tokens will have been seen before in the training data (due to Zipf's law), and for these tokens it is very difficult to beat a baseline that normalizes each h to the most common m seen for it in training." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-30", "text": "3 Thus, in practice, normalization systems should typically only be applied to unseen tokens." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-31", "text": "It is therefore critical to report both dataset statistics and experimental results for unseen tokens." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-32", "text": "Unfortunately, some recent papers have only reported accuracy on all tokens, and only in comparison to other (non-baseline) systems (Bollmann and S\u00f8gaard, 2016; Bollmann et al., 2017; Korchagina, 2017) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-33", "text": "These figures can be misleading if systems underperform the na\u00efve baseline on seen tokens (which we show does happen in practice)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-34", "text": "To see why, suppose 80% of test tokens were seen in training, and the baseline gets 90% of them right, while system A gets 80% and system B gets only 70%." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-35", "text": "Meanwhile the baseline gets only 50% of unseen tokens right, whereas systems A and B get 70% and 90%, respectively." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-36", "text": "A's accuracy is higher overall than B's (78% vs 74%), but both systems underperform the baseline (82%)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-37", "text": "More importantly, the best system (90% accuracy overall) is achieved by applying the baseline to seen tokens, and the system that generalizes best (B) to unseen tokens; it is irrelevant that A scores higher overall than B." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-38", "text": "Stemming from the reasoning above, we argue that a full evaluation of any spelling normalization system requires more complete dataset statistics and experimental results." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-39", "text": "In describing the training and test sets, researchers should not only report the number of types and tokens, but also the per-centage of unseen tokens in the test (or dev) set and the percentage of training items (h, m) where h = m. This last statistic measures the degree of spelling variation, which varies considerably between corpora." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-40", "text": "As for reporting results, we have argued that accuracy should be reported separately for seen vs unseen tokens, and overall results compared to the na\u00efve memorization baseline." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-41", "text": "Since historical spelling normalization is typically a low-resource task, systems should also ideally be tested with varying amounts of training data to assess how much annotation might be required for a new corpus (Pettersson et al., 2014; Bollmann and S\u00f8gaard, 2016; Korchagina, 2017) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-42", "text": "Finally, since these systems may be deployed on corpora other than those they were trained on, and used as preprocessing for other tasks, we advocate reporting performance on a downstream task and/or different corpus." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-43", "text": "To our knowledge the only previous supervised learning system to do so is Pettersson et al. (2013b) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-44", "text": "----------------------------------" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-45", "text": "**MODELS**" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-46", "text": "We focus on two neural encoder-decoder models for spelling normalization, comparing them against the memorization baseline and to previous results from Pettersson et al. (2014) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-47", "text": "The first model (Bollmann et al., 2017) 4 uses a fairly standard architecture with a bi-directional LSTM encoder and an LSTM decoder with soft attention (Xu et al., 2015) , and is trained using cross-entropy loss." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-48", "text": "The second model is a new approach to spelling normalization, which adapts the morphological reinflection system of Aharoni et al. (2017) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-49", "text": "5 The reinflection model generates the characters in an inflected wordform (y 1:n ), given the characters of its lemma (x 1:m ) and a set of corresponding morphological features (f )." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-50", "text": "Rather than using a soft attention mechanism that computes a weight vector over the entire sequence, this model exploits the generally monotonic character alignment between x 1:m and y 1:n and attends to only a single encoded input character at a time during decoding." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-51", "text": "Architecturally, the model uses a standard bidirectional encoder." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-52", "text": "The decoder steps through the characters of the input and considers jointly the output of the previous step, the morphological features, and the currently attended encoded input." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-53", "text": "It outputs either a character or an advance symbol (to advance the focus of attention for the next time step)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-54", "text": "It is trained on an oracle sequence of write/advance actions s 1:q which are generated from an automatic alignment of the input and output sequences." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-55", "text": "The model maximizes p(s 1:q |x 1:m , f )." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-56", "text": "For details, see Aharoni et al. (2017) ." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-57", "text": "We adapt the model to our purpose by removing the morphological features f , maximising only p(s 1:q |x 1:m )." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-58", "text": "The monotonic assumption is wellsuited to our task, since fewer than 0.4% of edit operations require non-monotonic alignments (i.e. character transpositions) in any of our datasets." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-59", "text": "Other than removing the need for morphological features from the hard attention model, and increasing the number of training epochs to 50 for both models, we did no further hyperparameter tuning, since our goal was to assess the \"off-the-shelf\" performance of these systems." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-60", "text": "----------------------------------" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-62", "text": "We use the same datasets as Pettersson et al. (2014) , with data from five languages over a range of historical periods." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-63", "text": "6 We use the same train/dev/test splits as Pettersson; dataset statistics are shown in Table 1 . Because we do no hyperparameter tuning, we do not use the development sets, and all results are reported on the test sets." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-64", "text": "Each system was tested as recommended above, with accuracy reported separately on seen and unseen items, and for different training data sizes." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-65", "text": "To evaluate the downstream effects of normalization, we applied the models to a collection of unseen documents and then tagged them with the Stan-ford POS tagger, which comes pre-trained on modern English." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-66", "text": "The documents are from the Parsed Corpus of Early English Correspondence (PCEEC) (Taylor et al., 2006) , comprised of 84 letter collections from the 15th-17th centuries." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-67", "text": "(Our English normalization training data is from the 14th-17th centuries.) PCEEC contains roughly 2.2m manually POS-tagged tokens but no spelling annotation." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-68", "text": "Because it uses a large and somewhat idiosyncratic set of POS tags, we converted these to better match the Stanford tags before evaluating (though the match still isn't perfect; accuracy would be higher in all cases if the tag sets were identical)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-69", "text": "Baselines are provided by tagging the unnormalized text and the output of the na\u00efve normalization baseline." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-70", "text": "Table 2 gives test set results for all models, broken down into seen and unseen items where possible." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-71", "text": "7 The split into seen/unseen highlights the fact that neither of the neural models does as well on seen items as the baseline; indeed the soft attention model is considerably worse in English and Hungarian, the two largest datasets." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-72", "text": "8 The result is that this model actually underperforms the baseline when applied to all tokens, although a hybrid model (baseline for seen, soft attention for unseen) would outperform the baseline." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-73", "text": "Nevertheless, the hard attention model performs best on unseen tokens in all cases, often by a wide margin, and also yields competitive overall performance." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-74", "text": "----------------------------------" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-75", "text": "**RESULTS: NORMALIZATION ACCURACY**" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-76", "text": "We also compared the accuracy of the two neural models at different training data sizes starting from 1k tokens." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-77", "text": "On seen tokens, the baseline was best in all cases except for 1k tokens in Hungarian and Icelandic (where the soft attention model was slightly better) and the largest two data sizes in German (where the hard attention model was slightly better)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-78", "text": "This supports our claim that learned models should typically only be applied to unseen tokens." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-79", "text": "Accuracy on unseen tokens is shown in Figure 1 . Note that the set of unseen items gets smaller 7 We obtained our datasets from Pettersson et al. but our baseline results are slightly different from what they report." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-80", "text": "The differences (theirs-ours) are -0.1, 0.2, 0.4, 1.2, 0.6 for Eng, Ger, Hun, Ice, Swe respectively." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-81", "text": "This could be due to differences in tie-breaking methods, or to another unknown factor." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-82", "text": "These differences suggest using caution in directly comparing their non-baseline results to ours." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-83", "text": "8 When we varied the training data sizes, we found that the soft attention model actually gets worse on seen tokens in all languages as the training data increases beyond a relatively small size." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-84", "text": "We have no good explanation for this, and it's possible that tuning the parameters would help." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-85", "text": "Table 2 : Tokens normalized correctly (%) for each dataset." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-86", "text": "Upper half: results on (A)ll tokens reported by Pettersson et al. (2014) for a hybrid model (apply memorization baseline to seen tokens and an edit-distance-based model to unseen tokens) and two SMT models (which align character unigrams and bigrams, respectively)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-87", "text": "Lower half: results from our experiments, including accuracy reported separately on (S)een and (U)nseen tokens. and presumably more difficult as training data size increases, so the baseline gets worse." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-88", "text": "In contrast, the neural models are able to maintain or increase performance on this set." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-89", "text": "We expected that the bias toward monotonic alignments would help the hard attention model at smaller data sizes, but it is the soft attention model that seems to do better there, while the hard attention model does better in most cases at the larger data sizes." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-90", "text": "Note that Bollmann et al. (2017) trained their model on individual manuscripts, with no training set containing more than 13.2k tokens." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-91", "text": "The fact that this model struggles with larger data sizes, especially for seen tokens, suggests that the default hyperparameters may be tuned to work well with small training sets at the cost of underfitting the larger datasets." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-92", "text": "Results: POS tagging Based on our results above, we tested the neural models by applying them only to unseen tokens in the PCEEC, and normalizing seen tokens using the na\u00efve baseline in all cases." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-93", "text": "The PCEEC is a heterogeneous collection, so baseline tagger accuracy on the unnormalized text ranges from 52.0% to 82.6%, with an average of 71.0% (\u03c3: 6.8)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-94", "text": "Figure 2 shows the effects of normalizing using the different methods." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-95", "text": "Although normalizing provides a clear benefit, in most cases the neural models are no better than normalizing using the baseline method." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-96", "text": "The exception is at 5k and 10k training items, where a two-tailed t-test shows that the hard attention model is significantly better than the other methods (p < 0.01)." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-97", "text": "We also tried preprocessing both the normalization and tagging datasets by lowercasing all tokens; this resulted in small improvements in most cases (about 1 point) but any remaining differences were to the benefit of the baseline method." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-98", "text": "Our findings differ from those of Pettersson et al. (2013b) , who reported that their SMT-based system did work better than the baseline normalizer for POS tagging in Icelandic and verb identification in Swedish." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-99", "text": "Our contrasting findings could derive either from our use of different models or different datasets; nevertheless, they highlight the fact that intrinsic improvements do not always translate into extrinsic ones." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-100", "text": "----------------------------------" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-101", "text": "**CONCLUSION**" }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-102", "text": "We have highlighted some important issues in the evaluation of historical text normalization systems: in particular, the need to report accuracy on unseen tokens and to compare performance to a na\u00efve memorization baseline." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-103", "text": "Following these recommendations, we evaluated two neural models, one of which is new to this task." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-104", "text": "Across five languages, both models greatly outperformed the baseline on unseen tokens, with the soft attention model doing a bit better for smaller data sizes, and the hard attention model doing a bit better for larger ones." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-105", "text": "However, these improvements did not translate into clearly better POS tagging downstream." }, { "sent_id": "4c8e83eb213879e68285e9cd09be47-C001-106", "text": "Despite these mixed results, we hope that the evaluation guidelines presented here will help promote work in this area, in order to eventually provide better tools for working with historical text collections." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4c8e83eb213879e68285e9cd09be47-C001-12", "4c8e83eb213879e68285e9cd09be47-C001-13", "4c8e83eb213879e68285e9cd09be47-C001-14" ], [ "4c8e83eb213879e68285e9cd09be47-C001-20" ], [ "4c8e83eb213879e68285e9cd09be47-C001-41" ] ], "cite_sentences": [ "4c8e83eb213879e68285e9cd09be47-C001-12", "4c8e83eb213879e68285e9cd09be47-C001-20", "4c8e83eb213879e68285e9cd09be47-C001-41" ] }, "@MOT@": { "gold_contexts": [ [ "4c8e83eb213879e68285e9cd09be47-C001-12", "4c8e83eb213879e68285e9cd09be47-C001-13", "4c8e83eb213879e68285e9cd09be47-C001-14" ], [ "4c8e83eb213879e68285e9cd09be47-C001-20" ] ], "cite_sentences": [ "4c8e83eb213879e68285e9cd09be47-C001-12", "4c8e83eb213879e68285e9cd09be47-C001-20" ] }, "@UNSURE@": { "gold_contexts": [ [ "4c8e83eb213879e68285e9cd09be47-C001-46" ] ], "cite_sentences": [ "4c8e83eb213879e68285e9cd09be47-C001-46" ] }, "@USE@": { "gold_contexts": [ [ "4c8e83eb213879e68285e9cd09be47-C001-62" ], [ "4c8e83eb213879e68285e9cd09be47-C001-86" ] ], "cite_sentences": [ "4c8e83eb213879e68285e9cd09be47-C001-62", "4c8e83eb213879e68285e9cd09be47-C001-86" ] }, "@EXT@": { "gold_contexts": [ [ "4c8e83eb213879e68285e9cd09be47-C001-63" ] ], "cite_sentences": [] } } }, "ABC_6bf17a793eaee0593596df0c2249b5_32": { "x": [ { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-2", "text": "Multi-hop question answering (QA) requires an information retrieval (IR) system that can find multiple supporting evidence needed to answer the question, making the retrieval process very challenging." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-3", "text": "This paper introduces an IR technique that uses information of entities present in the initially retrieved evidence to learn to 'hop' to other relevant evidence." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-4", "text": "In a setting, with more than 5 million Wikipedia paragraphs, our approach leads to significant boost in retrieval performance." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-5", "text": "The retrieved evidence also increased the performance of an existing QA model (without any training) on the HOTPOTQA benchmark by 10.59 F1." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-6", "text": "----------------------------------" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-8", "text": "Multi-hop QA requires finding multiple supporting evidence, and reasoning over them in order to answer a question (Welbl et al., 2018; Talmor and Berant, 2018; Yang et al., 2018) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-9", "text": "For example, to answer the question shown in figure 1 , the QA system has to retrieve two different paragraphs and reason over them." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-10", "text": "Moreover, the paragraph containing the answer to the question has very little lexical overlap with the question, making it difficult for search engines to retrieve them from a large corpus." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-11", "text": "For instance, the accuracy of a BM25 retriever for finding all supporting evidence for a question decreases from 53.7% to 25.9% on the 'easy' and 'hard' subsets of the HOTPOTQA training dataset." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-12", "text": "1 We hypothesize that an effective retriever for multi-hop QA should have the \"hopiness\" built into it, by design." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-13", "text": "That is, after retrieving an initial set of documents, the retriever should be able to \"hop\" onto other documents, if required." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-14", "text": "We note that, many supporting evidence often share common (bridge) entities between them (e.g. \"Rochester Hills\" in figure 1)." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-15", "text": "In this work, we introduce a model that uses information about entities present in an initially retrieved paragraph to jointly find a passage of text describing the entity (entity-linking) and also determining whether that passage would be relevant to answer the multi-hop query." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-16", "text": "A major component of our retriever is a re-ranker model that uses contextualized entity representation obtained from a pre-trained BERT (Devlin et al., 2018) language model." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-17", "text": "Specifically, the entity representation is obtained by feeding the query and a Wikipedia paragraph describing the entity to a BERT model." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-18", "text": "The re-ranker uses representation of both the initial paragraph and the representation of all the entities within it to determine which evidence to gather next." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-19", "text": "Essentially, our method introduces a new way of multi-step retrieval that uses information about intermediate entities." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-20", "text": "A standard way of doing multistep retrieval is via pseudo-relevance feedback (Xu and Croft, 1996; Lavrenko and Croft, 2001 ) in which relevant terms from initial retrieved documents are used to reformulate the initial question." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-21", "text": "A few recent works learn to reformulate the query using task specific reward such as document recall or performance on a QA task (Nogueira and Cho, 2017; Buck et al., 2018; Das et al., 2019) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-22", "text": "However, these methods do not necessarily use the information about entities present in the evidence as they might not be the more frequent/salient terms in it." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-23", "text": "Figure 2 : Overview of our approach." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-24", "text": "We use the entity mentions present in the initially retrieved paragraphs to link to paragraphs describing them." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-25", "text": "Next, the BERT-based re-ranker scores the chain of initial and the entity-describing paragraph." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-26", "text": "Note the presence of self-loop from the initial paragraphs to accommodate for questions that do not require 'hopping' to a new paragraph." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-27", "text": "Empiricially, our method outperforms all of these methods significantly for multi-hop QA." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-28", "text": "Our work is most closely related to the recently proposed BERT re-ranker model of Nogueira and Cho (2019) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-29", "text": "However, unlike us, they do not model the chains of evidence paragraphs required for a multi-hop question." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-30", "text": "Secondly, they also do not have a entity linking component to identify the relevant paragraphs." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-31", "text": "Our model out-performs them for multi-hop QA." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-32", "text": "To summarize, this paper presents an entitycentric IR approach that jointly performs entity linking and effectively finds relevant evidence required for questions that need multi-hop reasoning from a large corpus containing millions of paragraphs." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-33", "text": "When the retrieved paragraphs are supplied to the baseline QA model introduced in Yang et al. (2018) , it improved the QA performance on the hidden test set by 10.59 F1 points." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-34", "text": "2" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-35", "text": "----------------------------------" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-36", "text": "**METHODOLOGY**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-37", "text": "Our approach is summarized in Figure 2 ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-38", "text": "The first component of our model is a standard IR system that takes in a natural language query 'Q' and returns an initial set of evidence." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-39", "text": "For our experiments, we use the popular BM25 retriever, but this component can be replaced by any IR model." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-40", "text": "We assume that all spans of entity mentions have been identified in the paragraph text by a one-time preprocessing, with an entity tagger." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-41", "text": "3 2 Code, pre-trained models and retrieved paragraphs are released -https://github.com/ameyagodbole/ entity-centric-ir-for-multihop-qa 3 We plan to explore joint learning of entity tagging with linking and retrieval as future work." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-42", "text": "----------------------------------" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-43", "text": "**ENTITY LINKING**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-44", "text": "The next component of our model is an entity linker that finds an introductory Wikipedia paragraph describing the entity, corresponding to each entity mention." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-45", "text": "Several IR approaches (Xiong et al., 2016; Raviv et al., 2016) use an off-the-shelf entity linker." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-46", "text": "However, most entity linking systems (Ganea and Hofmann, 2017; Raiman and Raiman, 2018) have been trained on Wikipedia data and hence using an off-the-shelf linker would be unfair, since there exists a possibility of test-time leakage." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-47", "text": "To ensure strictness, we developed our own simple linking strategy." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-48", "text": "Following the standard approach of using mention text and hyper-link information (Cucerzan, 2007; Ji and Grishman, 2011) , we create a mapping (alias table) between them." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-49", "text": "The alias table stores mappings between a mention string (e.g. \"Bill\") and various entities it can refer to (e.g. Bill Clinton, Billy Joel, etc)." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-50", "text": "The top-40 documents returned by the BM25 retriever on the dev and test queries are also ignored while building the alias table." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-51", "text": "At test time, our reranker considers all the candidate entity paragraphs that a mention is linked to via the alias table." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-52", "text": "Although simple, we find this strategy to work well for our task and we plan to use a learned entity linker for future work." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-53", "text": "Re-ranker The next component of our model is a BERT-based re-ranker that ranks the chains of paragraphs obtained from the previous two components of the model." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-54", "text": "Let Q denote the query, D denote a paragraph in the initial set of paragraphs returned by the BM25 retriever." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-55", "text": "Let e denote an entity mention present in D and E be the linked document returned by the linker for e. If there are multiple linked documents, we consider all of them." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-56", "text": "Although our retriever is designed for multi-hop questions, in a general setting, most questions are not multi-hop in nature." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-57", "text": "Therefore to account for questions that do not need hopping to a new paragraph, we also add a 'self-link' (Figure 2 ) from each of the initial retrieved paragraph, giving the model the ability to stay in the same paragraph." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-58", "text": "To train the re-ranker, we form query-dependent passage representation for both D and E. The query Q and the paragraph E are concatenated and fed as input to a BERT encoder and the corresponding [CLS] token forms the entity representation e." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-59", "text": "Similarly, the document representation d is set to the embedding of the [CLS] token obtained after encoding the concatenation of Q and D. The final score that the entity paragraph E is relevant to Q is computed by concatenating the two query-aware representation d and e and passing it through a 2-layer feed-forward network as before." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-60", "text": "It should be noted, the final score is determined by both the evidence paragraphs D and E and as we show empirically, not considering both leads to decrease in performance." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-61", "text": "During training, we mark a chain of paragraphs as a positive example, if the last paragraph of the chain is present in the supporting facts, since that is a chain of reasoning that led to a relevant paragraph." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-62", "text": "All other paragraph chains are treated as negative examples." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-63", "text": "In our experiments, we consider chains of length 2, although extending to longer chains is straightforward." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-64", "text": "We use a simple binary crossentropy loss to train the network." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-65", "text": "----------------------------------" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-66", "text": "**EXPERIMENTS**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-67", "text": "For all our experiment, unless specified otherwise, we use the open domain corpus 4 released by Yang et al. (2018) which contains over 5.23 million Wikipedia abstracts (introductory paragraphs)." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-68", "text": "To identify spans of entities, we use the implementation of the state-of-the-art entity tagger presented in Peters et al. (2018) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-69", "text": "5 For the BERT encoder, we use the BERT-BASE-UNCASED model." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-70", "text": "6 We use the implementation of widely-used BM25 retrieval available in Lucene." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-71", "text": "Table 1 : Retrieval performance of models on the HOT-POTQA benchmark." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-72", "text": "A successful retrieval is when all the relevant passages for a question are retrieved from more than 5 million paragraphs in the corpus." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-73", "text": "systems is via pseudo-relevance feedback (PRF)." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-74", "text": "The PRF methods assume that the top retrieved documents in response to a given query are relevant." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-75", "text": "Based on this assumption, they expand the query in a weighted manner." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-76", "text": "PRF has been shown to be effective in various retrieval settings (Xu and Croft, 1996) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-77", "text": "We compare with two widely used PRF models -The Rocchio's algorithm on top of the TF-IDF retrieval model (PRF-TFIDF) (Rocchio, 1971 ) and the relevance model (RM3) based on the language modeling framework in information retrieval (PRF-RM) (Lavrenko and Croft, 2001 )." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-78", "text": "Following prior work (Nogueira and Cho, 2017) , we use query likelihood retrieval model with Dirichlet prior smoothing (Zhai and Lafferty, 2001 ) for first retrieval run." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-79", "text": "Nogueira and Cho (2017) proposed a new way of query reformulation -incorporating reward from a document-relevance task (PRF-TASK) and training using reinforcement learning." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-80", "text": "Recently, Nogueira and Cho (2019) proposed a BERT based passage re-ranker (BERT-re-ranker) that has achieved excellent performance in several IR benchmarks." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-81", "text": "But, its performance has not been evaluated on multi-hop queries till now." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-82", "text": "For a fair comparison with our model which looks at paragraphs corresponding to entities, we use top 200 paragraphs retrieved by the initial IR model for BERT-re-ranker instead of 25 for our model." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-83", "text": "8 Table 1 reports the accuracy(@k) of retrieving all the relevant paragraphs required for answering a question in HOTPOTQA." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-84", "text": "9 We compare the supporting passage annotation present in the dataset with the retrieved paragraph and retrieval is only correct when all the supporting paragraphs are in the top k retrieved passages." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-85", "text": "We also report the mean average precision score (MAP) which is a strict metric that takes into account the relative position of the relevant document in the ranked list (Kadlec et al., 2017) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-86", "text": "As we can see, our retrieval technique vastly outperforms other existing retrieval systems with an absolute increase of 26.5% (accuracy@10) and 18.4% (MAP), when compared to BERT-re-ranker." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-87", "text": "The standard PRF techniques do not perform well for this task." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-88", "text": "This is primarily because the PRF methods rely on statistical features like frequency of terms in the document, and fail to explicitly use information about entities, that may not be frequently occurring the paragraph." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-89", "text": "In fact, their performance is a little behind the standard retrieval results of BM25, suggesting that this benchmark dataset needs entity-centric information retrieval." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-90", "text": "The PRF-TASK does slightly better than other PRF models, showing that incorporating task-specific rewards can be beneficial." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-91", "text": "However, as we find, RL approaches are slow to converge 10 as rewards from a down-stream tasks are sparse and action space in information retrieval is very large." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-92", "text": "Ablations." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-93", "text": "We investigate whether modeling the chain of paragraphs needed to reach the final paragraph is important or not." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-94", "text": "As an ablation, we ignore the representation of the initial retrieved document D 1 and only consider the final document representing the entity (QUERY+E-DOC)." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-95", "text": "Table 1 shows that, indeed modeling the chain of documents is important." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-96", "text": "This makes intuitive sense, since to answer questions such as the county where a person is from (figure 1), modeling context about the person, should be helpful." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-97", "text": "We also evaluate, if our model performs well on single-hop questions as well." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-98", "text": "This evaluation is a bit tricky to do in HOTPOTQA, since the evaluataion set only contains questions from 'hard' subset (Yang et al., 2018) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-99", "text": "However, within that hard subset, we find the set of question, that has the answer span present in all the supporting passages (SINGLE-HOP (HARD)) and only in one of the supporting passages (MULTI-HOP (HARD)) 11 ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-100", "text": "The intuition is that if there are multiple evidence containing the answer spans then it might be a little easier for a downstream QA model to identify the answer span." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-101", "text": "Figure 3 shows that our model performs equally well on both type of queries and hence can be applied in a practical setting." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-102", "text": "Baseline Reader (Yang et al., 2018) Table 2 shows the performance on the QA task." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-103", "text": "We were able to achieve better scores than reported in the baseline reader model of Yang et al. (2018) by using Adam (Kingma and Ba, 2014) instead of standard SGD (our re-implementation)." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-104", "text": "Next, we use the top-10 paragraphs retrieved by our system from the entire corpus and feed it to the reader model." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-105", "text": "We achieve a 10.59 absolute increase in F1 score than the baseline." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-106", "text": "It should be noted that we use the simple baseline reader model and we are confident that we can achieve better scores by using more sophisticated reader architectures, e.g. using BERT based architectures." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-107", "text": "Our results show that retrieval is an important component of an opendomain system and equal importance should be given to both the retriever and reader component." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-108", "text": "----------------------------------" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-109", "text": "**ZERO-SHOT EXPERIMENT ON WIKIHOP**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-110", "text": "We experiment if our model trained on HOTPOTQA can generalize to another multi-hop dataset -WIK-IHOP (Welbl et al., 2018) , without any training." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-111", "text": "In the WIKIHOP dataset, a set of candidate introductory Wikipedia paragraphs are given per question." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-112", "text": "Hence, we do not need to use our initial BM25 retriever." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-113", "text": "We assign the first entity mention occurring in a paragraph as the textual description of that entity." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-114", "text": "For instance, if the first entity mention in the paragraph is 'Mumbai', we assign that paragraph as the textual description for the entity 'Mumbai'." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-115", "text": "This assumption is often true for the introductory paragraphs of a Wikipedia article." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-116", "text": "Next, we perform entity linking of mentions by just simple string matching (i.e. linking strings such as 'mumbai' to the previous paragraph)." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-117", "text": "After constructing a small subgraph from the candidate paragraphs, we apply our model trained on HOTPOTQA." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-118", "text": "The baseline models we compare to are a BM25 retriever and a BERT-re-ranker model of (Nogueira and Cho, 2019) that ranks all the candidate supporting paragraphs for the question." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-119", "text": "Table 3 shows our model outperforms both models in zero-shot setting." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-120", "text": "----------------------------------" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-121", "text": "**RELATED WORK**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-122", "text": "Document retrieval using entities." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-123", "text": "Analysis of web-search query logs has revealed that there is a large portion of entity seeking queries (Liu and Fang, 2015) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-124", "text": "There exists substantial work on modeling documents with entities occurring in them." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-125", "text": "For example, Xiong et al. (2016) represents a document with bag-of-entities and Raviv et al. (2016) use entity-based language modeling for document retrieval." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-126", "text": "However, they depend on an off-the-shelf entity tagger, where as we jointly perform linking and retrieval." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-127", "text": "Moreover, we use contextualized entity representations using pre-trained LMs which have been proven to be better than bag-of-words approaches." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-128", "text": "There has been a lot of work which leverages knowledge graphs (KGs) to learn better entity representations (Xiong and Callan, 2015; Xiong et al., 2017; Liu et al., 2018) and for better query reformulation (Cao et al., 2008; Dalton et al., 2014; Dietz and Verga, 2014) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-129", "text": "Our work is not tied to any specific KG schema, instead we encode entities using its text description." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-130", "text": "Neural ranking models have shown great potential and have been widely adopted in the IR community (Dehghani et al., 2017; Guo et al., 2019; Mitra et al., 2017; Zamani et al., 2018, inter-alia) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-131", "text": "Bagof-words and contextual embedding models, such as word2vec and BERT, have also been explored extensively for various IR tasks, from document to sentence-level retrieval (Padigela et al., 2019; Croft, 2016, 2017) ." }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-132", "text": "----------------------------------" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-133", "text": "**CONCLUSION**" }, { "sent_id": "6bf17a793eaee0593596df0c2249b5-C001-134", "text": "We introduce an entity-centric approach to IR that finds relevant evidence required to answer multihop questions from a corpus containing millions of paragraphs leading to significant improvement to an existing QA system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6bf17a793eaee0593596df0c2249b5-C001-8" ] ], "cite_sentences": [ "6bf17a793eaee0593596df0c2249b5-C001-8" ] }, "@USE@": { "gold_contexts": [ [ "6bf17a793eaee0593596df0c2249b5-C001-32", "6bf17a793eaee0593596df0c2249b5-C001-33" ], [ "6bf17a793eaee0593596df0c2249b5-C001-67" ], [ "6bf17a793eaee0593596df0c2249b5-C001-102" ] ], "cite_sentences": [ "6bf17a793eaee0593596df0c2249b5-C001-33", "6bf17a793eaee0593596df0c2249b5-C001-67", "6bf17a793eaee0593596df0c2249b5-C001-102" ] }, "@UNSURE@": { "gold_contexts": [ [ "6bf17a793eaee0593596df0c2249b5-C001-97", "6bf17a793eaee0593596df0c2249b5-C001-98" ] ], "cite_sentences": [ "6bf17a793eaee0593596df0c2249b5-C001-98" ] }, "@DIF@": { "gold_contexts": [ [ "6bf17a793eaee0593596df0c2249b5-C001-103" ] ], "cite_sentences": [ "6bf17a793eaee0593596df0c2249b5-C001-103" ] } } }, "ABC_15df1d107fb349f78c313b0c3342b8_32": { "x": [ { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-55", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-2", "text": "This paper describes the methodology followed to build a neural machine translation system in the biomedical domain for the English-Catalan language pair." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-3", "text": "This task can be considered a low-resourced task from the point of view of the domain and the language pair." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-4", "text": "To face this task, this paper reports experiments on a cascade pivot strategy through Spanish for the neural machine translation using the English-Spanish SCIELO and Spanish-Catalan El Peri\u00f3dico database." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-5", "text": "To test the final performance of the system, we have created a new test data set for English-Catalan in the biomedical domain which is freely available on request." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-6", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-8", "text": "Neural machine translation (Sutskever et al., 2014; has recently emerged as a stronger alternative to standard statistical paradigm (Koehn et al., 2003) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-9", "text": "Among other advantages, neural MT offers an end-to-end paradigm which seems to be able to generalize better from data (Bentivogli et al., 2017) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-10", "text": "However, deep learning techniques face serious difficulties for learning when having limited or low resources and machine translation is not an exception (Koehn and Knowles, 2017) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-11", "text": "English has become the de facto universal language of communication around the world." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-12", "text": "In Catalonia, out of 7.5 million population only around 30% of people have knowledge of English in all competences 1 ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-13", "text": "Therefore, there are many situations where professional or automatic translations are still necessary." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-14", "text": "One of them is in medical communication patient-physician at the level of primary health care." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-56", "text": "**ES' CA**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-57", "text": "ES -CA The size of the corpora is summarised in Table 1 ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-15", "text": "Also in the biomedical domain it is worth mentioning that Catalonia has become a hub of global biomedical research as proven by the nearly 1% of global scientific production, 9,000 innovative companies or the fact that the sector raised a record of 153 million of euros in 2016 2 ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-16", "text": "Therefore, English-Catalan translation in the biomedical domain is of interest not only in health communication but to properly disseminate the work in such a relevant area for Catalan economy." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-17", "text": "English-Catalan in general -and even more in a closed domain as the biomedical one-can be considered to be a limited resourced language pair." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-18", "text": "However, there are quite large amount of resources for English-Spanish and SpanishCatalan language pairs." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-19", "text": "Therefore, English-Catalan could take advantage of them by using the popular pivot strategies which consist in using one intermediate language to perform the translation between two other languages." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-20", "text": "Pivot strategies have been shown to provide a good performance within the phrase-based framework (Costa-juss\u00e0 et al., 2012) and also for the particular case of English-Catalan (de Gispert and no, 2006) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-21", "text": "While in the phrase-based con-1 Data taken from https://www.idescat.cat/ 2 http://cataloniabio.org/ca/publicacions text, pivot strategies have been widely exploited, this is not the case for the neural approaches." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-22", "text": "Pivot studies are limited to (Cheng et al., 2017) which considers a single direct approach (the cascade) contrasted with a joint trained model from source-to-pivot and pivot-to-source." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-23", "text": "Other alternatives when having no parallel data for a language pair are the multilingual approximations where several language pairs are trained together and the system is able to learn non-resourced pairs (Wu et al., 2016) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-24", "text": "Another related research area for this study is precisely training translation systems domain-specific tasks, where there are scarce in-domain translation resources." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-25", "text": "A common approach in these cases consists in training a system with a generic corpus and then, use a small in-domain corpus to adapt the system to that particular domain." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-26", "text": "In this direction, there is a huge amount of research in the statistical approach (Costa-juss\u00e0, 2015) and also starting in the neural approach (Chu et al., 2017) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-27", "text": "Finally, there is am emerging line of research in the topic of unsupervised neural MT (Lample et al., 2018; Artetxe et al., 2018) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-28", "text": "This study designs and details an experiment for testing the standard cascade pivot architecture which has been employed in standard statistical machine translation (Costajuss\u00e0 et al., 2012) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-29", "text": "The system that we propose builds on top of one of the latest neural MT architectures called the Transformer (Vaswani et al., 2017) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-30", "text": "This architecture is an encoderdecoder structure which uses attention-based mechanisms as an alternative to recurrent neural networks proposed in initial architectures (Sutskever et al., 2014; Cho et al., 2014) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-31", "text": "This new architecture has been proven more efficient and better than all previous proposed so far (Vaswani et al., 2017) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-32", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-33", "text": "**NEURAL MT APPROACH**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-34", "text": "This section provides a brief high-level explanation of the neural MT approach that we are using as a baseline system, which is one of the strongest systems presented recently (Vaswani et al., 2017) , as well as a glance of its differences with other popular neural machine translation architectures." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-35", "text": "Sequence-to-sequence recurrent models (Sutskever et al., 2014; Cho et al., 2014) have been the standard approach for neural machine translation, especially since the incorporation of attention mechanisms Luong et al., 2015) , which enables the system to learn to identify the information which is relevant for producing each word in the translation." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-36", "text": "Convolutional networks (Gehring et al., 2017) were the second paradigm to effectively approach sequence transduction tasks like machine translation." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-37", "text": "In this paper we make use of the third paradigm for neural machine translation, proposed in (Vaswani et al., 2017) , namely the Transformer architecture, which is based on a feed-forward encoder-decoder scheme with attention mechanisms." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-38", "text": "The type of attention mechanism used in the system, referred to as multi-head attention, allows to train several attention modules in parallel, combining also self-attention with standard attention." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-39", "text": "Self-attention differs from standard attention in the use of the same sentence as input and trains over it allowing to solve issues as coreference resolution." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-40", "text": "Equations and details about the transformer system can be found in the original paper (Vaswani et al., 2017) and are out of the scope of this paper." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-41", "text": "For the definition of the vocabulary to be used as input for the neural network, we used the sub-word mechanism from tensor2tensor package, which is similar to BytePair Encoding (BPE) from (Sennrich et al., 2016) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-42", "text": "For the English-Spanish language pair, two separate 32K sub-word vocabularies where extracted, while for Spanish-Catalan we extracted a single shared 32K sub-word vocabulary for both languages." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-43", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-44", "text": "**PIVOT CASCADE APPROACH**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-45", "text": "Standard approaches for making use of pivot language translation in phrase-based systems include the translation cascade." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-46", "text": "The cascade approach consists in building two translation systems: source-to-pivot and pivot-to-target." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-47", "text": "In test time, the cascade approach requires two translations." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-48", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-49", "text": "**EXPERIMENTAL FRAMEWORK: DATA RESOURCES**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-50", "text": "This section reports details on data to be employed in the experiments." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-51", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-52", "text": "**EN ES**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-53", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-54", "text": "**EN -ES**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-58", "text": "The corpora has been pre-processed with a standard pipeline for Catalan, Spanish and English: tokenizing and keeping parallel sentences between 1 and 50 words." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-59", "text": "Additionally, for English and Spanish we used Freeling (Padr\u00f3 and Stanilovsky, 2012) to tokenize pronouns from verbs (i.e. pregunt\u00e1ndose to preguntando + se), we also split prepositions and articles, i.e. del to de + el and al to a + el." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-60", "text": "This there is no statistically significant difference in global and event free survival between the two groups ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-61", "text": "no hi ha difer\u00e8ncia estad\u00edsticament significativa en la superviv\u00e8ncia global i lliure d ' esdeveniments entre els dos grups ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-62", "text": "was done for similarity to English." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-63", "text": "For Spanish and Catalan, we used Freeling to tokenize the text but no split with pronouns, prepositions or articles was done." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-64", "text": "The test sets come from WMT 2016 biomedical shared task in the case of English and Spanish." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-65", "text": "Since we required a gold standard in English-Catalan, we translated the Spanish test set from WMT 2016 biomedical shared task into Catalan." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-66", "text": "The translation was performed in two steps: we did a first automatic translation from Spanish to Catalan and then a professional translator postedited the output." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-67", "text": "This English-Catalan test set on the biomedical domain is freely available on request to authors." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-68", "text": "Details on the test sets are reported in Table 4 ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-69", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-70", "text": "**RESULTS**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-71", "text": "The results of each of the pivotal translation systems as well as the combined cascaded translation are summarized in table 4, which shows the high quality of the translations of the attentional architecture from (Vaswani et al., 2017) ." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-72", "text": "The English-to-Spanish translation obtains a BLEU score of 46.55 in the test set of the WMT Biomedical test set while the Spanish-to-Catalan translation obtains a BLEU score of 86.89 in the El Peri\u00f3dico test set." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-73", "text": "The cascaded translation achives a BLEU score of 41.38 in the translated WMT Biometical test set." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-74", "text": "All BLEU scores are case-sensitive and where obtained with script t2t-bleu from the tensor2tensor framework, whose results are equivalent to those from mteval-v14.pl from the Moses package." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-75", "text": "In order to illustrate the quality of the cascaded translations quality, some sample translations are shown in table 3." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-76", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-77", "text": "**CONCLUSION**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-78", "text": "This paper describes the data resources and architectures to build an English-Catalan neural MT system in the medical domain without English-Catalan parallel resources." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-79", "text": "This descriptive paper provides details on latest architectures in neural MT based on attention mechanisms and one standard pivot architecture that has been used with the statistical approach." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-80", "text": "The paper reports results on the baseline system of the cascade approach with the latest neural MT architecture of the Transformer." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-81", "text": "Further experiments are required to fully characterize the potential of pivoting approaches." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-82", "text": "One of the future lines of research is to apply the pseudo-corpus approach, which consists in training a third translation system on a synthetic corpus created by means of the pivotal ones." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-83", "text": "A second future line of research is the use of recently proposed unsupervised machine translation approaches (Lample et al., 2018; Artetxe et al., 2018) , which do not require large amount of parallel data." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-84", "text": "----------------------------------" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-85", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-86", "text": "This work is supported by the Spanish Ministerio de Econom\u00eda y Competitividad and European Regional Development Fund, through the postdoctoral senior grant Ram\u00f3n y Cajal, the contract TEC2015-69266-P (MINECO/FEDER, UE) and the contract PCIN-2017-079 (AEI/MINECO)." }, { "sent_id": "15df1d107fb349f78c313b0c3342b8-C001-87", "text": "This work is also supported in part by an Industrial PhD Grant from the Catalan Agency for Management of University and Research Grants (AGAUR) and by the United Language Group (ULG)." } ], "y": { "@EXT@": { "gold_contexts": [ [ "15df1d107fb349f78c313b0c3342b8-C001-29" ] ], "cite_sentences": [ "15df1d107fb349f78c313b0c3342b8-C001-29" ] }, "@BACK@": { "gold_contexts": [ [ "15df1d107fb349f78c313b0c3342b8-C001-34" ], [ "15df1d107fb349f78c313b0c3342b8-C001-40" ] ], "cite_sentences": [ "15df1d107fb349f78c313b0c3342b8-C001-34", "15df1d107fb349f78c313b0c3342b8-C001-40" ] }, "@MOT@": { "gold_contexts": [ [ "15df1d107fb349f78c313b0c3342b8-C001-34" ] ], "cite_sentences": [ "15df1d107fb349f78c313b0c3342b8-C001-34" ] }, "@USE@": { "gold_contexts": [ [ "15df1d107fb349f78c313b0c3342b8-C001-37" ] ], "cite_sentences": [ "15df1d107fb349f78c313b0c3342b8-C001-37" ] }, "@UNSURE@": { "gold_contexts": [ [ "15df1d107fb349f78c313b0c3342b8-C001-71" ] ], "cite_sentences": [ "15df1d107fb349f78c313b0c3342b8-C001-71" ] } } }, "ABC_8905d5936a5b2a839bfd56783ff55d_32": { "x": [ { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-2", "text": "Natural language generators for task-oriented dialog should be able to vary the style of the output utterance while still effectively realizing the system dialog actions and their associated semantics." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-3", "text": "While the use of neural generation for training the response generation component of conversational agents promises to simplify the process of producing high quality responses in new domains, to our knowledge, there has been very little investigation of neural generators for task-oriented dialog that can vary their response style, and we know of no experiments on models that can generate responses that are different in style from those seen during training, while still maintaining semantic fidelity to the input meaning representation." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-4", "text": "Here, we show that a model that is trained to achieve a single stylistic personality target can produce outputs that combine stylistic targets." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-5", "text": "We carefully evaluate the multivoice outputs for both semantic fidelity and for similarities to and differences from the linguistic features that characterize the original training style." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-6", "text": "We show that contrary to our predictions, the learned models do not always simply interpolate model parameters, but rather produce styles that are distinct, and novel from the personalities they were trained on." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-7", "text": "----------------------------------" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-9", "text": "Natural language generators for task-oriented dialog should be able to vary the style of the output while still effectively realizing the system dialog actions and their associated semantics." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-10", "text": "The use of neural natural language generation (NNLG) for training the response generation component of conversational agents promises to simplify the process of producing high quality responses in new domains by relying on the neural architecture to automatically learn how to map an input meaning representation to an output utterance." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-11", "text": "However, there has been little investigation of NNLGs for dialog that can vary their response style, and we know of no experiments on models that can generate responses that are different in style from those seen during training, while still maintaining semantic fidelity to the input meaning representation." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-12", "text": "Instead, work on stylistic transfer has focused on tasks where only coarse-grained semantic fidelity is needed, such as controlling the sentiment of the utterance (positive or negative), or the topic or entity under discussion [1, 2, 3] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-13", "text": "Consider for example a training instance for the restaurant domain consisting of a meaning representation (MR) from the End-to-End (E2E) Generation Challenge 1 and a sample output from one of our neural generation models in Figure 1 [4, 5] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-14", "text": "Systems using the training set of 50K crowdsourced utterances from the E2E task achieved high semantic correctness, e.g. the BLEU score for our best system on the dev set was 0.72 [6] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-15", "text": "However in the best case these models can only reproduce the 1 http://www.macs.hw.ac.uk/InteractionLab/E2E/ style of the training data, and in actuality the outputs have reduced stylistic variation, because when particular stylistic variations are less frequent, they are treated similarly to noise." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-16", "text": "Browns Cambridge is a pub, also it is a moderately priced italian place near Adriatic, also it is family friendly, you know and it's in the city centre." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-17", "text": "In subsequent work, we showed that we could augment the E2E training data with synthetically generated stylistic variants and train a neural generator to reproduce these variants, however the models can still only generate what they have seen in training [5] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-18", "text": "Here, instead, we explore whether a model that is trained to achieve a single stylistic personality target can produce outputs that combine stylistic targets, to yield a novel style that is significantly different than what was seen in training, while still maintaining high semantic correctness." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-19", "text": "We first train each stylistic model with a single latent variable for supervision, for five different personality models, or voices, based on the Big Five theory of personality, namely the personality trait styles of EXTRAVERT, AGREEABLE, DISAGREEABLE, CONSCI-ENTIOUS, and UNCONSCIENTIOUS." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-20", "text": "Then, at generation time, we provide the model with combinations of the stylistic variables, i.e. we instruct the NNLG to generate multivoice outputs that combine EXTRAVERT with DISAGREEABLE, where such combined outputs never occurred in the training data." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-21", "text": "We first describe how we set up our dataset and neural models in Section 2, and then present our results in Section 3." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-22", "text": "We evaluate the multivoice outputs for both semantic fidelity and for similarities to and differences from the linguistic features that characterize the original training style." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-23", "text": "We hypothesize that controlling multiple stylistic parameters is more difficult and will lead to more semantic errors, so we examine in detail the interaction of stylistic variation and semantic fidelity, as well as quantifying stylistic fidelity." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-24", "text": "We leave a discussion of related work until Section 4 where we also conclude." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-25", "text": "----------------------------------" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-26", "text": "**DATA AND MODELS**" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-27", "text": "There is a long tradition in AI of using slightly synthetic tasks and datasets in order to test the ability of particular models to achieve these tasks [7, 8] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-28", "text": "The PERSONAGE corpus [5] provides a controlled environment for testing different models of neural generation and style generation." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-29", "text": "It consists of 88,500 restaurant domain utterances whose style varies according to models of personality, which were generated by an existing statistical NLG engine that has the capability of manipulating 67 different stylistic parameters [9] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-30", "text": "Table 2 shows sample utterances that are output for the singlevoice models and for each of our multi- Table 2 : MultiVoice generation output and comparable singlevoice outputs for DISAGREEABLE, EXTRAVERT and CONSCIENTIOUS for the meaning representation in Figure 1 ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-31", "text": "We count the frequency of periods (Period Agg.) and expletives (Explet. Prag) for multivoice models that utilize DISAGREEABLE)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-32", "text": "voice models (described below) for the same MR." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-33", "text": "Each output corresponding to each single voice personality is controlled by a set of sentence planning parameters that vary for each personality." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-34", "text": "These parameters are discussed in Section 3 when we evaluate stylistic fidelity." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-35", "text": "What is important to note here is that each individual voice represents a distinct stylistic distribution in the training data." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-36", "text": "The corpus uses the MRs and training/test splits of the E2E Generation Challenge." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-37", "text": "There are 3,784 unique MRs in the training set, and the corpus contains 17,771 MR/training utterance pairs for each of the existing models for the personality traits of AGREEABLE, DISAGREEABLE, CONSCIENTIOUSNESS, UN-CONSCIENTIOUSNESS, and EXTRAVERT, for a total training set of 88,855 utterances." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-38", "text": "This guarantees a wide range of variation in parameter combinations." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-39", "text": "The test set consists of 278 unique MRs." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-40", "text": "The frequencies of longer utterances (more attribute MRs) vary across train and test with test MRs not seen during training." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-41", "text": "The training data has more smaller MRs, while the test set is more challenging, with more larger MRs." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-42", "text": "Previous work shows that a simple model trained on the whole corpus of 88,855 utterances produces semantically correct outputs, but with reduced stylistic variation [5] , while a model that allocates a variable corresponding to a label for each style learns to reproduce the stylistic variation." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-43", "text": "This is interesting because each style variable (personality) actually encodes a set of 36 different stylistic parameters and their values: the model learns for example how the DISAGREEABLE personality tends to produce many shorter sentences in the output, as well as learning that it tends to use expletives like damn, e.g. see the outputs based on DISAGREEABLE personality in Table 2 ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-44", "text": "Model Description." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-45", "text": "Our NNLG model uses a single token to represent personality encoding, following the use of single language labels used in machine translation and other work on neural generation [10, 5] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-46", "text": "Figure 1 summarizes the model architecture." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-47", "text": "This model builds on the open-source sequence-tosequence (seq2seq) TGen system [11] , which is implemented in Tensorflow [12] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-48", "text": "2 The system is based on the seq2seq generation method with attention [14, 15] , and uses a sequence of LSTMs [16] for the encoder and decoder, combined with beamsearch and an n-best list reranker for output tuning." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-49", "text": "The inputs to the model are dialog acts for each system action (such as inform) and a set of attribute slots (such as rating) and their values (such as high for attribute rating)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-50", "text": "To prepro- 2 We refer the reader to TGen publications [11, 13] for model details." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-51", "text": "cess the corpus of MR/utterance pairs, attributes that take on proper-noun values are delexicalized during training i.e. name and near." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-52", "text": "We encode personality as an additional dialog act, of type CONVERT with personality as the key and the target personality as the value (see Figure 1) ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-53", "text": "For every input MR and a personality, we train the model with the corresponding PER-SONAGE generated sentence." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-54", "text": "Our model differs from the TO-KEN model used in our previous work [5] because it is trained on unsorted inputs to allow us to add multiple CONVERT tags to the MR at generation time." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-55", "text": "Note that we do not train on multiple personalities, instead, we train one model that uses all the data, where each distinct single personality has a corresponding CONVERT(PERSONALITY = X) in the training instance." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-56", "text": "At generation time, we generate singlevoice data for all the test MRs (1,390 total realizations, 278 unique MRs, realized for each of 5 personalities)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-57", "text": "For the multivoice experiments, we generate 2 references per combination of two personalities for each of the 278 test MRs, since the order of the CONVERT tags matters." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-58", "text": "For a given order, the model produces a single output." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-59", "text": "We do not combine personalities that are exact opposites such as AGREEABLE and DISAGREEABLE, yielding 8 combinations." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-60", "text": "The multivoice test set consists of 4,448 total realizations (278 MRs and 8 \u00d7 2 outputs per MR)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-61", "text": "----------------------------------" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-62", "text": "**RESULTS**" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-63", "text": "Although it is well known that current automatic metrics do not perform well for evaluating the quality of an NLG [30] , and that they penalize stylistic variation, we report automatic metrics for completeness." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-64", "text": "To address their limitations, we also report the results of our own metrics developed to measure semantic correctness and stylistic fidelity." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-65", "text": "Examples of model outputs for single and multivoice are shown in Table 2 , demonstrating how our models interpolate the stylistic parameters described here." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-66", "text": "Automatic Metrics." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-67", "text": "The automatic evaluation uses the E2E generation challenge script." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-68", "text": "3 Table 3 summarizes the results for each personality combination for the metrics: BLEU (n-gram precision), NIST (weighted n-gram precision), METEOR (ngrams with synonym recall), and ROUGE (n-gram recall)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-69", "text": "We note that multivoice automatically has a better chance because the evaluation is over 4,448 examples as opposed to 1,390 for singlevoice, and each multivoice output is compared to 2 possible references (one for each single voice), and then averaged." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-70", "text": "Semantic Errors." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-71", "text": "Table 4 shows ratios for the number of deletions, repeats, and hallucinations for each single and multivoice model for their respective test sets (1,390 total realizations and 4,448 realizations)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-72", "text": "The error counts are split by personality, and normalized by the number of unique MRs (278)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-73", "text": "Note that smaller ratios are preferable, indicating fewer errors." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-74", "text": "As we predicted, it is more challenging to preserve semantic fidelity when attempting to hit multiple stylistic targets." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-75", "text": "We see that in most cases the frequency of errors increase for multivoice compared to singlevoice, with particular combinations such as DISAGREEABLE plus EXTRAVERSION making more than one attribute deletion for each output on average." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-76", "text": "In the singlevoice results DISAGREEABLE and EXTRAVERT make the most errors with the smallest total ratio found for CONSCIENTIOUS, but when CONSCIENTIOUS combines with DISAGREEBLE it performs worse than either model alone." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-77", "text": "Stylistic Characterization." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-78", "text": "To characterize the differences in style between the multivoice and singlevoice outputs, we develop scripts that count the aggregation operations and prag-3 https://github.com/tuetschek/e2e-metrics matic markers in Figure 5 in both the singlevoice and multivoice test data." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-79", "text": "We then compare the singlevoice data directly with multivoice results." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-80", "text": "The aggregation parameters in Table 5 control how the NLG combines attributes into sentences, e.g. whether it tries to create complex sentences and what types of combination operations it uses." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-81", "text": "The pragmatic operators in the bottom part of Table 5 are intended to achieve particular pragmatic effects in the generated outputs: for example the use of a hedge such as sort of softens a claim and affects perceptions of friendliness and politeness [17] , while the exaggeration associated with emphasizers like actually, basically, really influences perceptions of extraversion and enthusiasm [18, 19] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-82", "text": "Each parameter value can be set to high, low, or don't care." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-83", "text": "Aggregation." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-84", "text": "To measure the similarity of each multivoice model to its parent single voices for aggregation operations, we first count the average number of times each aggregation operation occurs for each model and personality or personality combination." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-85", "text": "We then compute Pearson correlation across different model outputs to quantify the similarity of these model outputs with respect to the aggregation operations." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-86", "text": "Table 6 provides a summary of these results (higher means more correlated)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-87", "text": "The final column of Table 6 provides the correlations between the original two single voices that were put together to create the multivoice model." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-88", "text": "This shows for example (Row 1) that AGREEABLE and CONSCIENTIOUS are similar in their use of aggregation but that DISAGREEABLE and EXTRAVERSION are very dissimilar (Row 6)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-89", "text": "We would expect that models that are similar to start with would be less novel when they are combined, and indeed Row 1 shows that when the multivoice model is compared with both the original AGREEABLE voice (Column 3) and the CONSCIENTIOUS voice (Column 4) the use of aggregation operations changes little." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-90", "text": "However other combinations seem to produce completely novel models that use aggregation very differently than either of their singlevoice source models." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-91", "text": "For example in Row 7 the combination of DISAGREEABLE and UNCONSCIENTIOUS produces a model whose use of aggregation is distinct from either of its source models." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-92", "text": "All of the correlations in Table 6 are significant (p < 0.05) except for the 0.01 correlation when comparing the single voices of CONSCI-ENTIOUS vs. DISAGREE where the p-value is 0.6." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-93", "text": "Figure 2 provides a closer look at particular aggregation operations associated with CONSCIENTIOUSNESS and DIS-AGREEABLE and plots the differences between the singlevoice models and the use of these operations in the multivoice models." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-94", "text": "Interestingly, these plots also clearly show that the multivoice model is a novel personality, yielding a different distribution for aggregation operations than either of its source voice styles." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-95", "text": "To measure the models' use of pragmatic markers, we count the number of times each marker in Table 5 occurred in the model outputs, compared to the singlevoice references." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-96", "text": "We again compute the Pearson correlation between the original voices and the multivoice model outputs for personality combination." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-97", "text": "The results are shown in Table 7 (all correlations significant with p \u2264 0.05)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-98", "text": "The final column of Table 7 provides the correlations between the original two single voices that were put together to create the multivoice model." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-99", "text": "As we can see in Row 1, the only two voices that are similar to start are AGREEABLE and CON-SCIENTIOUS." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-100", "text": "All of the other voices have negative correlations with one another in their use of pragmatic markers." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-101", "text": "Interestingly, the multivoice combination of AGREEABLE and CONSCI-ENTIOUS resembles CONSCIENTIOUS much more (see column 4)." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-102", "text": "All the other multivoice models also appear to resemble one of the parent models more than the other, but none are very similar to their parents: they each appear to demonstrate characteristics of a novel voice." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-103", "text": "For example, in Row 6, the combination of DISAGREEABLE and EXTRAVERSION produces a model that bears very little similarity to either DISAGREEABLE (0.12 correlation) or EXTRAVERSION (-0.03 correlation) ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-104", "text": "Figure 3 provides a closer look at particular pragmatic markers associated with CONSCIENTIOUSNESS and DIS-AGREEABLE and plots the differences between the singlevoice models and the multivoice models." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-105", "text": "Again, interestingly, these plots show that the multivoice model is a novel personality that yields a different distribution for pragmatic markers than either of its source voice styles." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-106", "text": "----------------------------------" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-107", "text": "**RELATED WORK AND CONCLUSION**" }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-108", "text": "The restaurant domain has been a testbed for conversational agents for over 25 years [20, 21, 22, 23, 24] , but there is little previous work examining stylistic variation in this domain [9, 25] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-109", "text": "Most of the recent research using neural NLG has focused on semantic fidelity [26, 27, 13, 28] , however there is work on methods for controlling when long utterances should be split into shorter ones, and for attempting to enforce pronominalization [29] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-110", "text": "Other work has pointed out how poor evaluation metrics such as BLEU are for evaluating natural language generation quality [30] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-111", "text": "Recent work on neural methods for controlling linguistic style has mainly been carried out in the context of machine translation [1] or focused on tasks where semantic fidelity was not required [31] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-112", "text": "Previous work in the statistical NLG tradition presents methods for controlling stylistic variation [32, 33, 34] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-113", "text": "Work on the persona of a conversational agent did not actually focus on stylistic variation, or personality, but instead tried to ensure that an open domain conversational agent would answer questions about itself in a semantically consistent way [35] ." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-114", "text": "Here we present the first experiment, to our knowledge, examining stylistic generalization in a domain that requires semantic fidelity." }, { "sent_id": "8905d5936a5b2a839bfd56783ff55d-C001-115", "text": "We show that our neural models produce novel styles that they have not seen in training, and examine how and to what extent stylistic control interacts with semantic fidelity." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8905d5936a5b2a839bfd56783ff55d-C001-17" ], [ "8905d5936a5b2a839bfd56783ff55d-C001-42" ], [ "8905d5936a5b2a839bfd56783ff55d-C001-10", "8905d5936a5b2a839bfd56783ff55d-C001-11", "8905d5936a5b2a839bfd56783ff55d-C001-12", "8905d5936a5b2a839bfd56783ff55d-C001-13", "8905d5936a5b2a839bfd56783ff55d-C001-9" ] ], "cite_sentences": [ "8905d5936a5b2a839bfd56783ff55d-C001-17", "8905d5936a5b2a839bfd56783ff55d-C001-42", "8905d5936a5b2a839bfd56783ff55d-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "8905d5936a5b2a839bfd56783ff55d-C001-28" ], [ "8905d5936a5b2a839bfd56783ff55d-C001-45" ] ], "cite_sentences": [ "8905d5936a5b2a839bfd56783ff55d-C001-28", "8905d5936a5b2a839bfd56783ff55d-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "8905d5936a5b2a839bfd56783ff55d-C001-42" ], [ "8905d5936a5b2a839bfd56783ff55d-C001-54" ] ], "cite_sentences": [ "8905d5936a5b2a839bfd56783ff55d-C001-42", "8905d5936a5b2a839bfd56783ff55d-C001-54" ] } } }, "ABC_10acbeba830b2f8b3feb30de542c56_32": { "x": [ { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-2", "text": "The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-3", "text": "To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-4", "text": "We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-5", "text": "These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-6", "text": "We also provide a reference implementation for both of the Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR)." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-7", "text": "We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn fully support both Touchdown tasks and can be used effectively for further research and comparison." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-8", "text": "----------------------------------" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-10", "text": "Following natural language navigation instructions in visual environments requires addressing multiple challenges in dynamic, continuously changing environments, including language understanding, object recognition, grounding and spatial reasoning." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-11", "text": "Until recently, the most commonly studied domains were map-based (Thompson et al., 1993) or game-like (Macmahon et al., 2006; Misra et al., 2017 Misra et al., , 2018 Hermann et al., 2017; Hill et al., 2017) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-12", "text": "These environments enabled substantial progress, but the complexity and diversity of the visual input they provide is limited." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-13", "text": "This greatly simplifies both the language and vision challenges." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-33", "text": "The dataset can be easily accessed." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-14", "text": "To address this, recent tasks based on simulated environments include photo-realistic visual input, such as Room-to-Room (R2R; Anderson et al., 2018) , Talk-the-Walk (de Vries et al., 2018) and Touchdown (Chen et al., 2019) , all of which rely on panorama photos." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-15", "text": "A major challenge of creating simulations that use realworld photographs is they at times capture bystanders and their property." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-16", "text": "This raises privacy concerns and requires additional care to check for and ensure personally identifiable information (PII) is removed from research resources that are made publicly available." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-17", "text": "Existing resources adopt different strategies to address this." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-18", "text": "The Matterport3D dataset (Chang et al., 2017) , which underlies the R2R task, is focused on real-estate data that is curated to exclude PII." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-19", "text": "This approach is limited to environments of a specific type: houses that are for sale." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-20", "text": "Academic resources that focus on urban street scenes opted to manually collect panoramas from scratch and scrub them for PII (de Vries et al., 2018; Weiss et al., 2019) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-21", "text": "This is laborious and costly-especially the first stage of collecting the panoramas." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-22", "text": "As a result, such resources cover relatively small areas." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-23", "text": "Google Street View has world-wide scale coverage of street scenes." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-24", "text": "Each panorama in Street View has gone through a process to protect the privacy of bystanders and their property." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-25", "text": "Individuals can also request specific panoramas to be removed." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-26", "text": "As such, it is a resource with the potential to transform the research community's ability to study problems such as street scene understanding and navigation." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-27", "text": "Touchdown relies on 29,641 panoramas from Street View; however, because raw images cannot be distributed according to the Street View terms-of-service, 1 these are not provided with the Touchdown data." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-28", "text": "Instead, only image feature vectors are available for direct download with the data, and access to the raw panoramas is subject to availability through APIs governed by Street View's terms of service." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-29", "text": "Research can be done within a company and shared via publication without releasing data; for example, Cirik et al. (2018) discussed models for instruction-conditioned navigation in Street View." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-30", "text": "However, the full impact of the data and research about it can be better realized by making at least some portion of such resources available to the broader research community." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-31", "text": "In this context, StreetLearn (Mirowski et al., 2018 (Mirowski et al., , 2019 stands out as a publicly available resource of Street View data that has been approved for dissemination and use for academic research." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-32", "text": "2 StreetLearn contains 114k panoramas from New York City and Pittsburgh that have been manually checked for PII, ensuring, for example, that faces and license plates are blurred." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-34", "text": "Researchers interested in working with the data simply fill a form stating their goals and commit to update the data periodically with newer versions as they are released." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-35", "text": "This process balances the ability of researchers to use the data with preserving the privacy and rights of individuals impacted by the data." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-36", "text": "For example, periodic updates allow Google to respond to user takedown requests." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-37", "text": "To increase the accessibility of Touchdown and providing an example of how important data can be responsibly released, we integrate the Touchdown task and its corresponding Street View data into an updated version of StreetLearn." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-38", "text": "This paper reconciles Touchdown's mode of dissemination with StreetLearn's, which was designed to adhere to the rights of Google and individuals while also simplifying access for researchers and improving reproducibility." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-39", "text": "We also provide open source implementations of both the visionand-language navigation and spatial description resolution tasks, which we show to have a consistent performance with the results in the original Touchdown paper." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-40", "text": "We hope that this release of data and code will enable the entire research community to make further progress on these problems and to consider new questions and tasks enabled by this limited but significant slice of Street View data." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-41", "text": "----------------------------------" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-42", "text": "**PROCESS**" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-43", "text": "Touchdown includes tasks for natural language navigation and spatial reasoning in realistic urban environments." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-44", "text": "Touchdown uses Street View panoramas of New York City to define a large-scale navigation environment." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-45", "text": "It includes 9,326 human-written instructions and 27,575 spatial description resolution tasks." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-46", "text": "Touchdown's instructions were written by people and emphasize attributes of the visual environment as navigational cues." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-47", "text": "This makes Touchdown a valuable resource for research on following natural language instructions in visual environments." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-48", "text": "This contrasts with the template-based navigation instructions used by Hermann et al. (2020) , which were generated by Google Maps API and used with StreetLearn panoramas." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-49", "text": "Unfortunately, the development and release of Touchdown introduced several challenges that complicate working with the data." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-50", "text": "Even though Touchdown itself does not contain Street View data, it references specific Street View panoramas and depends on access to them via the Street View API." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-51", "text": "This requires any researcher that wishes to work on the data to download large amounts of data using the API, which is inconvenient, error-prone and not aligned with the current Google Maps terms-of-service." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-52", "text": "Also, the panoramas available through the API periodically change, potentially making parts of the data unavailable." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-74", "text": "Spatial Description Resolution." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-53", "text": "This means there is no hope for consistent versioning (which hurts reproducibility) regarding panorama availability because the data collected by each researcher is dependent on the particular time they access it." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-54", "text": "Finally, individual researchers or research groups cannot themselves comply with takedown requests-a responsibility that should stay with Google." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-55", "text": "Therefore, long term storage must be kept within Google, with researchers periodically refreshing the data." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-56", "text": "To address these challenges, we collect, check and release the Touchdown panoramas as part of an update to the 114k existing StreetLearn panoramas, which cover regions of New York City and Pittsburgh." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-57", "text": "As shown in Figure 1 , StreetLearn encompasses the entire region of New York City contained in Touchdown; however, the StreetLearn panoramas themselves are not sufficient for supporting the Touchdown tasks themselves." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-58", "text": "This is for several reasons." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-59", "text": "\u2022 The granularity of the panorama spacing is different." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-60", "text": "Figure 1 shows that most of the panoramas are different." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-61", "text": "Touchdown has roughly 25% of the panos but covers half of Manhattan compared to StreetLearn." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-62", "text": "\u2022 The language instructions refer to transient objects such as cars, bicycles, and couches, as illustrated in Figures 2 and 3." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-63", "text": "A panorama from a different time period will not contain these objects, so the instructions are not stable across time periods." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-64", "text": "\u2022 Spatial description resolution requires coverage of multiple points-of-view for those specific panoramas." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-65", "text": "Figure 3 shows an example SDR description and the corresponding views from which it can be answered." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-66", "text": "In all, the Touchdown tasks encompass 29,641 panoramas." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-67", "text": "All of these went through extensive manual review by annotators to check for personally identifiable information (PII), such as faces and license plates." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-68", "text": "Regions containing PII were marked as bounding boxes by annotators, and we blurred all of these regions for the final images." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-69", "text": "----------------------------------" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-70", "text": "**EXPERIMENTS**" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-71", "text": "We re-implement the best-reported models on the navigation and spatial description resolution tasks from Chen et al. (2019) to compare performance with our data release to the original Touchdown paper." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-72", "text": "The key difference between the two settings is that our released panoramas contain additional blurred patches (Section 2)." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-73", "text": "Another minor difference is that we use a word-piece tokenizer (Devlin et al., 2019) instead of a full-word tokenizer." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-75", "text": "SDR results are given in Table 1 ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-76", "text": "Following Chen et al. (2019) , we report mean distance error and accuracy with different thresholds (40px, 80px, and 120px), which measures the proportion of evaluation items where the pixel chosen by the model is within the specified pixel distance." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-77", "text": "Our Retouchdown reimplementation of LINGUNET obtains better performance on the accuracy measures, but worse performance on mean distance error." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-78", "text": "To check whether this is a consequence of the blur-ring, we ran our model with features retrieved from original panoramas and obtained similar results as those listed in Table 1." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-79", "text": "Given this, the performance difference between our model and the original paper are likely not due to the additional blurring." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-80", "text": "As such, the Touchdown panoramas available through StreetLearn can be reliably used as direct replacement for those used in Chen et al. (2019) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-81", "text": "Vision-and-Language Navigation." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-82", "text": "We use the following metrics to evaluate VLN performance:" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-83", "text": "\u2022 Task Completion (TC): the accuracy of navigating to the correct location." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-84", "text": "The correct location is defined as the exact goal panorama or one of its neighboring panoramas." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-85", "text": "This is the equivalent of the success rate metric (SR) used commonly in VLN for R2R." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-86", "text": "\u2022 Shortest-path distance (SPD): the mean of the distances over all executions of the agent's final panorama position and the goal panorama." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-87", "text": "\u2022 Success weighted by Edit Distance (SED): normalized graph edit distance between the agent path and true path, with points only awarded for successful paths." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-88", "text": "\u2022 Normalized Dynamic Time Warping (nDTW): a minimized cumulative distance between the agent path and true path, normalized by path length." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-89", "text": "\u2022 Success weighted Dynamic Time Warping (SDTW): nDTW, with points awarded only for successful paths." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-90", "text": "TC, SPD, and SED are defined in Chen et al. (2019) and nDTW and SDTW are defined in Ilharco et al. (2019) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-91", "text": "VLN results are given in Table 2 ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-92", "text": "Our Retouchdown reimplementation of the RCONCAT model improves over the results given in Chen et al. (2019) for all metrics." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-93", "text": "We also establish benchmark scores for nDTW and SDTW." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-94", "text": "As with SDR, the panoramas now available via StreetLearn Panorama before the main SDR panorama." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-95", "text": "Main SDR Panorama." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-96", "text": "Panorama after the main SDR panorama." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-97", "text": "Figure 3 ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-98", "text": "Actual example taken from the dataset with multiple SDR panorama viewpoints for the same instruction: Two parked bicycles, and a discarded couch, all on the left. Walk just past this couch, and stop before you pass another parked bicycle." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-99", "text": "This bike will be white and red, with a white seat." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-100", "text": "Touchdown is sitting on top of the bike seat. thus do not remove information critical for the VLN task." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-101", "text": "In our implementation, we use imitation learning on top of a scalable framework based on the Actor-Learner architecture (Lansing et al., 2019) , instead of supervised learning using Hogwild! (Recht et al., 2011) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-102", "text": "These differences likely explain the observed differences with the original results." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-103", "text": "Compared to interior navigation in the Room-to-Room (R2R) task, the Touchdown task is much harder: e.g. the current state-of-the-art success rate (equivalent to TC) for R2R on the validation unseen dataset is 55% (Zhu et al., 2019) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-104", "text": "The same holds for DTW measures: Ilharco et al. (2019) report a success rate of 44% and corresponding SDTW of 38.3% for a fidelity-oriented version of the Reinforced Cross-modal Matching agent (Wang et al., 2019) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-105", "text": "The TC of 12.8% and SDTW of 1.4% obtained by Retouch-RCONCAT amply demonstrates the challenge of the outdoor navigation problem defined by Touchdown." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-106", "text": "The greater diversity of the visual environments and the far greater degrees-of-freedom for navigation thus provide plenty of headroom for future research." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-107", "text": "----------------------------------" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-108", "text": "**METHOD**" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-109", "text": "----------------------------------" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-111", "text": "The research community is interested in using largescale resources such as Street View for work on computer vision and navigation." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-112", "text": "In order to comply with Street View's terms-of-service (which allow for only limited use of its data and APIs) and with its data restrictions, we have enriched StreetLearn with panoramas from the Touchdown study." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-113", "text": "That dataset is periodically updated to comply with Google Street View takedown requests to respect in-dividuals' privacy preferences." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-114", "text": "We encourage the research community to use only vetted and approved resources like StreetLearn, including our new release of the Touchdown panoramas, for their Street View oriented work." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-115", "text": "The addition of Touchdown to StreetLearn (a.k.a. Retouchdown) boosts the total panorama count for the StreetLearn dataset 3 from 114k to 144k." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-116", "text": "Furthermore, it contains multiple panoramas from the same neighborhoods, which supports work on learning to navigate in a region and testing in that same region using panoramas from a different time." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-117", "text": "Our code for training and evaluating vision-andlanguage navigation agents and spatial description resolution models are publicly available as part of the VALAN framework (Lansing et al., 2019) ." }, { "sent_id": "10acbeba830b2f8b3feb30de542c56-C001-118", "text": "4" } ], "y": { "@BACK@": { "gold_contexts": [ [ "10acbeba830b2f8b3feb30de542c56-C001-2" ], [ "10acbeba830b2f8b3feb30de542c56-C001-11", "10acbeba830b2f8b3feb30de542c56-C001-12", "10acbeba830b2f8b3feb30de542c56-C001-13", "10acbeba830b2f8b3feb30de542c56-C001-14" ], [ "10acbeba830b2f8b3feb30de542c56-C001-90" ] ], "cite_sentences": [ "10acbeba830b2f8b3feb30de542c56-C001-2", "10acbeba830b2f8b3feb30de542c56-C001-14", "10acbeba830b2f8b3feb30de542c56-C001-90" ] }, "@USE@": { "gold_contexts": [ [ "10acbeba830b2f8b3feb30de542c56-C001-7" ], [ "10acbeba830b2f8b3feb30de542c56-C001-71" ], [ "10acbeba830b2f8b3feb30de542c56-C001-76" ] ], "cite_sentences": [ "10acbeba830b2f8b3feb30de542c56-C001-7", "10acbeba830b2f8b3feb30de542c56-C001-71", "10acbeba830b2f8b3feb30de542c56-C001-76" ] }, "@UNSURE@": { "gold_contexts": [ [ "10acbeba830b2f8b3feb30de542c56-C001-78", "10acbeba830b2f8b3feb30de542c56-C001-79", "10acbeba830b2f8b3feb30de542c56-C001-80" ], [ "10acbeba830b2f8b3feb30de542c56-C001-92" ] ], "cite_sentences": [ "10acbeba830b2f8b3feb30de542c56-C001-80", "10acbeba830b2f8b3feb30de542c56-C001-92" ] } } }, "ABC_55e429045af4434f9cb27ae8c6db66_32": { "x": [ { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-2", "text": "Essay scoring is a complicated processing requiring analyzing, summarizing and judging expertise." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-3", "text": "Traditional work on essay scoring focused on automatic handcrafted features, which are expensive yet sparse." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-4", "text": "Neural models offer a way to learn syntactic and semantic features automatically, which can potentially improve upon discrete features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-5", "text": "In this paper, we employ convolutional neural network (CNN) for the effect of automatically learning features, and compare the result with the state-of-art discrete baselines." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-6", "text": "For in-domain and domain-adaptation essay scoring tasks, our neural model empirically outperforms discrete models." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-7", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-9", "text": "Automatic essay scoring (AES) is the task of building a computer-based grading system, with the aim of reducing the involvement of human raters as far as possible." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-10", "text": "AES is challenging since it relies not only on grammars, but also on semantics, discourse and pragmatics." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-11", "text": "Traditional approaches treat AES as a classification (Larkey, 1998; Rudner and Liang, 2002) , regression (Attali and Burstein, 2004; Phandi et al., 2015) , or ranking classification problem (Yannakoudakis et al., 2011; Chen and He, 2013) , addressing AES by supervised learning." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-12", "text": "Features are typically bag-of-words, spelling errors and lengths, such word length, sentence length and essay length, etc." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-13", "text": "Some grammatical features are considered to assess the quality of essays (Yannakoudakis et al., 2011) ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-14", "text": "A drawback is feature engineering, which can be time-consuming, since features need to be carefully handcrafted and selected to fit the approriate model." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-15", "text": "A further drawback of manual feature templates is that they are sparse, instantiated by discrete pattern-matching." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-16", "text": "As a result, parsers and semantic analyzers are necessary as a preprocessing step to offer syntactic and semantic patterns for feature extraction." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-17", "text": "Given variable qualities of student essays, such analyzers can be highly unreliable." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-18", "text": "Neural network approaches have been shown to be capable of inducing dense syntactic and semantic features automatcially, giving competitive results to manually designed features for several tasks (Kalchbrenner et al., 2014; Johnson and Zhang, 2014; dos Santos and Gatti, 2014) ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-19", "text": "In this paper, we empirically investigate a neural network method to learn features automatically for AES, without the need of external pre-processing." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-20", "text": "In particular, we build a hierarchical CNN model, with one lower layer representing sentence structures and one upper layer representing essay structure based on sentence representations." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-21", "text": "We compare automatically-induced features by the model with state-of-art baseline handcrafted features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-22", "text": "Empirical results show that neural features learned by CNN are very effective in essay scoring task, covering more high-level and abstract information compared to manual feature templates." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-23", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-25", "text": "Following the first AES system Project Essay Grade (PEG) been developed in 1966 (Page, 1994) , a number of commercial systems have come out, such as IntelliMetric 2, Intelligent Essay Assessor (IEA) (Foltz et al., 1999 ) and e-rater system (Attali and Burstein, 2004) ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-26", "text": "The e-rater system now plays a second human rater's role in the Test of English as a Foreign Language (TOEFL) and Graduate Record Examination (GRE)." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-27", "text": "The e-rater extracts a number of complex features, such as grammatical error and lexical complexity, and uses stepwise linear regression." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-28", "text": "IEA uses Latent Semantic Analysis (LSA) (Landauer et al., 1998) to create semantic vectors for essays and measure the semantic similarity between the vectors." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-29", "text": "In the research literature, Larkey (1998) and Rudner and Liang (2002) treat AES as classification using bag-of-words features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-30", "text": "Other recent work formulates the task as a preference ranking problem (Yannakoudakis et al., 2011; Chen and He, 2013) ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-31", "text": "Yannakoudakis et al. (2011) formulated AES as a pairwise ranking problem by ranking the order of pair essays based on their quality." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-32", "text": "Features consist of word, POS n-grams features, complex grammatical features and so on." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-33", "text": "Chen and He (2013) formulated AES into a listwise ranking problem by considering the order relation among the whole essays and features contain syntactical features, grammar and fluency features as well as content and promptspecific features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-34", "text": "Phandi et al. (2015) use correlated Bayesian Linear Ridge Regression (cBLRR) focusing on domain-adaptation tasks." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-35", "text": "All these previous methods use discrete handcrafted features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-36", "text": "Recently, Alikaniotis et al. (2016) also employ a neural model to learn features for essay scoring automatically, which leverages a score-specific word embedding (SSWE) for word representations and a two-layer bidirectional long-short term memory network (LSTM) to learn essay representations." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-37", "text": "Alikaniotis et al. (2016) show that by combining SSWE, LSTM outperforms traditional SVM model." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-38", "text": "On the other hand, using LSTM alone does not give significantly more accuracies compared to SVM." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-39", "text": "This conforms to our preliminary experiments with the LSTM structure." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-40", "text": "Here, we use CNN without any specific embeddings and show that our neural models could outperform baseline discrete models on both in-domain and cross-domain senarios." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-41", "text": "CNN has been used in many NLP applications, such as sequence labeling (Collobert et al., 2011) , sentences modeling (Kalchbrenner et al., 2014) , sentences classification (Kim, 2014 ), text categorization (Johnson and Zhang, 2014; Zhang et al., 2015) and sentimental analysis (dos Santos and Gatti, 2014), etc." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-42", "text": "In this paper, we explore CNN representation ability for AES tasks on both in-domain and domain-adaptation settings." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-43", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-44", "text": "**BASELINE**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-45", "text": "Bayesian Linear Ridge Regression (BLRR) and Support Vector Regression (SVR) (Smola and Vapnik, 1997 ) are chosen as state-of-art baselines." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-46", "text": "Feature templates follow (Phandi et al., 2015) , extracted by EASE 1 , which are briefly listed in Table 1 . \"Useful n-grams\" are determined using the Fisher test to separate the good scoring essays and bad scoring essays." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-47", "text": "Good essays are essays with a score greater than or equal to the average score, and the remainder are considered as bad scoring essays." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-48", "text": "The top 201 ngrams with the highest Fisher values are chosen as the bag of features and these top 201 n-grams constitute useful n-grams." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-49", "text": "Correct POS tags are generated using grammatically correct texts, which is done by EASE." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-50", "text": "The POS tags that are not included in the correct POS tags are treated as bad POS tags, and these bad POS tags make up the \"bad POS n-grams\" features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-51", "text": "The features tend to be highly useful for the in-domain task since the discrete features of same prompt data share the similar statistics." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-52", "text": "However, for different prompts, features statistics vary significantly." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-53", "text": "This raises challenges for discrete feature patterns." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-54", "text": "ML-\u03c1 (Phandi et al., 2015) was proposed to address this issue." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-55", "text": "It is based on feature augmentation, incorporating explicit correlation into augmented feature spaces." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-56", "text": "In particular, it expands baseline feature vector x to be \u03a6 s (x) = (\u03c1x, (1 \u2212 \u03c1 2 ) 1/2 x) and \u03a6 t (x) = (x, 0 p ) for source and target domain data 1 https://github.com/edx/ease in R 2p respectively, with \u03c1 being the correlation between source and target domain data." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-57", "text": "Then BLRR and maximum likelihood estimation are used to the optimize correlation." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-58", "text": "All the baseline models require POS-tagging as a pre-processing step, extracting syntactic features based on POS-tags." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-59", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-60", "text": "**MODEL**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-61", "text": "Word Representations We use word embedding with an embedding matrix E w \u2208 R dw\u00d7Vw where d w is the embedding dimension, and V w represents words vocabulary size." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-62", "text": "A word vector z i is represented by z i = E w w i where w i is the i-th word in a sentence." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-63", "text": "In contrast to the baseline models, our CNN model does not rely on POS-tagging or other pre-processing." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-64", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-65", "text": "**CNN MODEL**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-66", "text": "We take essay scoring as a regression task and employ a two-layer CNN model, in which one convolutional layer is used to extract sentences representations, and the other is stacked on sentence vectors to learn essays representations." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-67", "text": "The architecture is depicted in Figure 1 ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-68", "text": "Given an input sentence z 1 , z 2 , ..., z n , a convolution layer with a filter w \u2208 R h\u00d7k is applied to a window of h words to produce n-grams features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-69", "text": "For instance, a feature c i is generated from a window of words z i:i+h\u22121 by c i = f (w \u00b7 z i:i+h\u22121 + b) , b \u2208 R is the bias term and f is the non-linear activation function rectified linear unit (ReLU)." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-70", "text": "The filter is applied to the all possible windows in a sentence to produce a feature map c = [c 1 , c 2 , ..., c m\u2212h+1 ]." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-71", "text": "For c j of the j-th sentence in an essay, max-pooling and average pooling function are used to produce the sentence vector s j = max{c j } \u2295 avg{c j }." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-72", "text": "The second convolutional layer takes s 1 , s 2 ,..., s n as inputs, followed by pooling layer (max-pooling and average-pooling) and a fully-connected hidden layer." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-73", "text": "The hidden layer directly connects to output layer which generates a score." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-74", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-75", "text": "**EXPERIMENTS**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-76", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-77", "text": "**SETUP**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-78", "text": "Data We use the Automated Student Assessment Prize (ASAP) 2 dataset as evaluation data for our task, which contains 8 prompts of different genres as listed in Table 2 ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-79", "text": "The essay scores are scaled into the range from 0 to 1." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-80", "text": "The settings of data preparation follow (Phandi et al., 2015) ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-81", "text": "We use quadratic weighted kappa (QWK) as the metric." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-82", "text": "For domainadaptation (cross-domain) experiments, we follow (Phandi et al., 2015) , picking four pairs of essay prompts, namely, 1\u21922, 3\u21924, 5\u21926 and 7\u21928, where 1\u21922 denotes prompt 1 as source domain and prompt Hyper-parameters We use Adagrad for optimization." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-83", "text": "Word embeddings are randomly initialized and the hyper-parameter settings are listed in Table 3 ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-84", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-85", "text": "**RESULTS**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-86", "text": "In-domain The in-domain results are shown in Figure 2 ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-87", "text": "The average values of all 8 prompt sets are listed in Table 4 ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-88", "text": "For the in-domain task, CNN outperforms the baseline model SVR on all prompts of essay sets, and is competitive to BLRR." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-89", "text": "For the statistical significance, neural model is significantly better than baseline models with the p-value less than 10 \u22125 at the confidence level of 95%." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-90", "text": "The average kappa value over 8 prompts is close to that of human raters." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-91", "text": "Cross-domain The domain-adaptation results are shown in Table 5 ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-92", "text": "It can be seen that our CNN model outperforms ML-\u03c1 on almost all pairs of adaptation experiments." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-93", "text": "ML-\u03c1 domain-adaptation method's performance improves as the size of target domain training data increases." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-94", "text": "However, compared to ML-\u03c1, target training data size has less impact on our neural model." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-95", "text": "Even if the target training size is small, the neural model still gives strong performance." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-96", "text": "This results from the fact that neural model could learn more high-level and abstract features compared to traditional models with handcrafted discrete features." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-97", "text": "We plot the confusion matrix between truth and model prediction on test data in Figure 4 , which shows that prediction scores of neural model tend to be closer to true values, which is very important in our task." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-98", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-99", "text": "**FEATURE ANALYSIS**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-100", "text": "To visualize the features learned by our model, we use t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008), projecting 50-dimensional features into 2-dimensional space." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-101", "text": "We take two domain pairs 3\u21924 and 5\u21926 as examples on the cross-domain task, extracting fully-connected hidden-layer features for target domain data using model trained on source domain data." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-102", "text": "The results are showed in Figure 3 ." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-103", "text": "The baseline discrete features are more concentrated, which shows that patterns on source prompt are weak in differentiating target prompt essays." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-104", "text": "By using ML-\u03c1 and leveraging 100 target prompt training examples, the discrete features patterns are more scattered, increasing the differentiating power." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-105", "text": "In contrast, CNN features trained on source prompt are sparse when used directly on the target prompt." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-106", "text": "This shows that neural features learned by the CNN model can better differentiate essays of different qualities." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-107", "text": "Without manual templates, such features automatically capture subtle and complex information that is relevant to the task." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-108", "text": "----------------------------------" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-109", "text": "**CONCLUSION**" }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-110", "text": "We empirically investigated a hierarchical CNN model for automatic essay scoring, showing automatically learned features competitive to discrete handcrafted features for both in-domain and domain-adaptation tasks." }, { "sent_id": "55e429045af4434f9cb27ae8c6db66-C001-111", "text": "The results demonstrate large potential for deep learning in AES." } ], "y": { "@BACK@": { "gold_contexts": [ [ "55e429045af4434f9cb27ae8c6db66-C001-11" ] ], "cite_sentences": [ "55e429045af4434f9cb27ae8c6db66-C001-11" ] }, "@USE@": { "gold_contexts": [ [ "55e429045af4434f9cb27ae8c6db66-C001-46" ], [ "55e429045af4434f9cb27ae8c6db66-C001-52", "55e429045af4434f9cb27ae8c6db66-C001-53", "55e429045af4434f9cb27ae8c6db66-C001-54" ], [ "55e429045af4434f9cb27ae8c6db66-C001-80" ], [ "55e429045af4434f9cb27ae8c6db66-C001-82" ] ], "cite_sentences": [ "55e429045af4434f9cb27ae8c6db66-C001-46", "55e429045af4434f9cb27ae8c6db66-C001-54", "55e429045af4434f9cb27ae8c6db66-C001-80", "55e429045af4434f9cb27ae8c6db66-C001-82" ] } } }, "ABC_06276db79ed5aa04bb24a31c10d3a9_32": { "x": [ { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-133", "text": "However, the very long paths still proved to be challenging for the best model." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-28", "text": "----------------------------------" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-29", "text": "**ABSTRACT MEANING REPRESENTATIONS**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-106", "text": "The added AMRs are all CAMR-produced." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-2", "text": "Linguistic phenomena like pronouns, control constructions, or co-reference give rise to co-indexed variables in meaning representations." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-3", "text": "We review three different methods for dealing with co-indexed variables in the output of neural semantic parsing of abstract meaning representations: (a) copying concepts during training and restoring co-indexation in a post-processing step; (b) explicit indexing of co-indexation; and (c) using absolute paths to designate co-indexing." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-4", "text": "The second method gives the best results and outperforms the baseline by 2.9 F-score points." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-5", "text": "----------------------------------" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-7", "text": "Semantic parsing is the task of mapping a natural language sentence into a meaning representation (a logical form)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-8", "text": "One of the problems a semantic parser has to deal with is co-indexed variables, which arise in antecedent-anaphor relations, proper name co-reference, control constructions and other linguistic phenomena." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-9", "text": "Examples of such constructions are given in (1)- (4) In the context of this paper, we represent meanings using the formalism of Abstract Meaning Representation (AMR), as introduced by Banarescu et al. (2013) ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-10", "text": "AMRs can be seen as graphs connecting concepts by relations." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-11", "text": "Each concept is represented by a named instance." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-12", "text": "Co-reference is established by re-using these instances." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-13", "text": "For example, the AMRs corresponding to examples (1) and (2) above are given in Figure 1 . Note that, due to the bracketing, the variable b encapsulates the whole entity person :name \"Bob\" and not just person, i.e. b stands for a person with the name Bob." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-14", "text": "That there is a lot to gain in this area can be seen by applying the AMR evaluation suite of Damonte et al. (2017) , which calculates nine different metrics to evaluate AMR parsing, reentrancy being one of them." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-15", "text": "Out of the four systems that made these scores available (all scores reported in van Noord and Bos (2017) ), the reentrancy metric obtained the lowest F-score for three of them." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-16", "text": "Various methods have been proposed to automatically parse AMRs, ranging from syntax-based approaches (e.g. Flanigan et al. (2014) ; Wang et al. (2015) ; Pust et al. (2015) ; Damonte et al. (2017) ) to the more recent neural approaches (Peng et al. (2017) ; Buys and Blunsom (2017) ; Konstas et al. (2017) ; Foland and Martin (2017); van Noord and Bos (2017) )." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-17", "text": "Especially the neural approaches are interesting, since they all use some sort of linearization method and therefore need a predefined way to handle reentrancy." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-18", "text": "Peng et al. (2017) and Buys and Blunsom (2017) use a special character to indicate reentrancy and restore co-referring variables in a post-processing step." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-19", "text": "Konstas et al. (2017) simply replace reentrancy variables by their co-referring concept in the input and never outputs co-referring nodes." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-20", "text": "Foland and Martin (2017) and van Noord and Bos (2017) use the same input transformation as Konstas et al. (2017) , but do try to restore co-referring nodes by merging all equal concepts into a single concept in a post-processing step." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-21", "text": "All these methods have in common that they are not very sophisticated, but more importantly, that it is not clear what the exact impact of these methods is on the final performance of the model, making it unclear what the best implementation is for future neural AMR parsers." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-22", "text": "In this paper we present three methods to handle reentrancy for AMR parsing." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-23", "text": "The first two methods are based on the previous work described above, while the third is a new, more principled method." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-24", "text": "These methods are applied on the model that reported the best results in the literature, the character-level neural semantic parsing method of van Noord and Bos (2017) ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-25", "text": "In a nutshell, this method uses a character-based sequence-to-sequence model to translate sentences to AMRs." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-26", "text": "To enable this process, pre-processing and post-processing steps are needed." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-27", "text": "The aim of this paper is to find the best method to handle reentrancy in neural semantic parsing and to show the specific impact that each of the methods have on general performance." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-30", "text": "AMRs, as introduced by Banarescu et al. (2013) , are acyclic, directed graphs that show a representation of the meaning of a sentence." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-31", "text": "There are three ways to display an AMR: as a graph, as a tree, or as a set of triples." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-32", "text": "The AMRs in Figure 1 are shown as trees, which is also the input for our neural semantic parser." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-33", "text": "AMR concepts (e.g. like, person) are relating to each other by the use of two-place predicates (e.g. :ARG0, :ARG1, :name)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-34", "text": "For example, in the right AMR in Figure 1 , want and buy are connected by the :ARG3 predicate." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-35", "text": "For our experiments, we use the annotated AMR corpus LDC2016E25, which contains 36,521 training, 1,368 development and 1,371 test AMRs." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-36", "text": "Co-indexed variables occur frequently in this data set, as can be seen in Table 1 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-37", "text": "About half of the AMRs contain at least one co-indexed variable, while about 20% of the total number of triples 1 contains a variable that has at least one anaphor in the AMR." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-38", "text": "This is the number of triples that is used in the reentrancy evaluation metric by Damonte et al. (2017) ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-39", "text": "This number, however, also includes triples that would be present regardless of whether the variable is coindexed (e.g. the instance triple of the antecedent)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-40", "text": "A more fair number might be the relation triples that contain an anaphor variable, which is exactly equal to the number of total co-indexed variables." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-41", "text": "This is still 3.4 to 4.2% of all triples in the data set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-42", "text": "The actual values of the variables for AMR instances are unimportant." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-43", "text": "Hence, one can rename variables in an AMR as long as re-occurrences of variables are preserved." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-44", "text": "If variables are used only once in an AMR, they can therefore be completely eliminated from an AMR." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-45", "text": "This insight was used by Barzdins and Gosko (2016) in the first approach to neural semantic parsing of AMRs, because particular names of variables are very hard to learn for a neural model given the limited amount of data available." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-46", "text": "We present three ways to encode co-reference in variable-free AMRs." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-47", "text": "2" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-48", "text": "Method 1A: Baseline Note that if there are co-indexed variables, there is always exactly one instance of the variable that carries semantic information." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-49", "text": "In our baseline method, similar to Barzdins and Gosko (2016) and Konstas et al. (2017) , we simply copy this semantic information, while removing the variables, as is shown below." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-50", "text": "This method does never output reentrancy nodes and therefore functions as a baseline." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-51", "text": "An example is shown in Figure 2 . (1) and (2)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-52", "text": "Method 1B: Reentrancy Restoring This method is created to restore reentrancy nodes in the output of the baseline model." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-53", "text": "It operates on a very ad hoc principle: if two nodes have the same concept, the second one was actually a reference to the first one." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-54", "text": "We therefore replace each node that has already occurred in the AMR by the variable of the antecedent node." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-55", "text": "This approach was applied by van Noord and Bos (2017) and Foland and Martin (2017) ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-56", "text": "The model thus never learns to output reentrancy, but the co-indexed variables are restored in a postprocessing step." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-57", "text": "An example is shown in Figure 3 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-58", "text": "Note that this process can also erroneously insert reentrancies when two separate entities would be correct." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-59", "text": "For example, if the sentence Bob i likes Bob j refers to two different Bobs, the initial AMR in Figure 3 would have been correct." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-60", "text": "Method 2: Indexing This method comprises of removing all variables from an AMR, except when they are co-indexed." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-61", "text": "The remaining variables are normalised by converting them to numbers, so that each unique co-indexed variable has a unique identifier." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-62", "text": "Similar approaches were applied by Peng et al. (2017) and Buys and Blunsom (2017) , and an example is shown in Figure 4 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-63", "text": "In this approach the model actually learns where it should output reentrancy nodes, instead of restoring them in a post-processing step." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-64", "text": "However, we do still need some post-processing, as is shown in Figure 5 (after variables are restored)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-65", "text": "In this AMR, there is an index that is never instantiated ( * 3 * ) and an instantiated index that is never (1) and (2) when applying the Indexing method." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-66", "text": "referred to ( * 2 * b / buy)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-67", "text": "The superfluous * 2 * can simply be removed, but for * 3 * we have multiple options." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-68", "text": "Algorithm 1 shows how we handle these cases." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-69", "text": "For each referent that was never instantiated, we first check if there is also an instantiated index that is never referred to." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-70", "text": "If that it is the case, we replace it by that variable." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-71", "text": "This assumes that the model merely mismatched the index symbols." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-72", "text": "This works for * 3 * in Figure 5 , even though the resulting node is still incorrect." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-73", "text": "If that was not the case, we try to replace the referent by an instantiated index that already did have a referent." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-74", "text": "Since this index already has a referent, we assume it is likely that a mismatched referent actually referred to this index (in Figure 5 , * 3 * would then be replaced by p) ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-75", "text": "In the case that both previous options failed, we simply pick the variable of the concept that is most often a referent in the training set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-76", "text": "Algorithm 1 Post-processing algorithm used to replace indexes by variables." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-77", "text": "R: all referent indexes in output (e.g. * 1 * , * 3 * ) X: all instantiated indexes that do not have a referent (e.g. * 2 * b / buy-01 ) Y: all instantiated indexes that have a referent (e.g. * 1 * person :name \"Jack\" ) C: all concept-variable pairs (e.g. w / want-01) most freq: function that, given a list of concepts, selects the concept that is most frequently a referent in the training set Figure 6 show the relative paths." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-78", "text": "In the first AMR, the path :ARG0 describes that the node refers to the value of the relation :ARG0." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-79", "text": "In the second AMR, the path :ARG1 refers in a similar way to :ARG1." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-80", "text": "Note that although these examples are straightforward, the paths can become quite long for larger AMRs, e.g. :op1 :ARG0 :ARG0-of :mod." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-81", "text": "The longest path in the training set even contains 14 relations." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-82", "text": "(1) and (2) when applying the Absolute Paths method." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-83", "text": "This method is perhaps the most attractive from a theoretical point of view." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-84", "text": "Note, however, that not every node in an AMR can be described by a unique path, as ambiguities might occur when there are two edges with the same name (this can occur, for instance, when more than one modifier is present)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-85", "text": "To solve these ambiguities an index (e.g. |1|, |2|) is added to each relation in the path." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-86", "text": "3 This was necessary for 632 out of the 40,582 paths in the training set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-87", "text": "In total, there are 5,760 unique paths, of which 3,447 only occur once." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-88", "text": "The most frequent path is :ARG0|1|, occurring 5,405 times." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-89", "text": "Similar to the Indexing method, a problem with this approach is the fact that we have no control over what the model will output." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-90", "text": "Especially for larger AMRs it is likely that the model will output impossible paths without destination." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-91", "text": "For example, the model might output :ARG2 instead of :ARG0 in the left example in Figure 6 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-92", "text": "These impossible paths still need to be replaced by a variable to get a valid AMR." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-93", "text": "The strategy we opt for here is similar to one used in the Indexing method, namely replacing the path by the variable of the concept in the AMR that most frequently has a referent in the training set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-94", "text": "----------------------------------" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-95", "text": "**NEURAL MODEL**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-96", "text": "We implement a bidirectional sequence-to-sequence model with general attention that takes characters as input, using the OpenNMT software (Klein et al., 2017) ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-97", "text": "The parameter settings are the same as in van Noord and Bos (2017) and are shown in Table 2 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-98", "text": "It is trained for 20 epochs, after which the model that performs best on the development set is used to decode the test set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-99", "text": "----------------------------------" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-100", "text": "**EXPERIMENTS**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-101", "text": "We test the impact of the different methods on two of our earlier models, described in van Noord and Bos (2017) ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-102", "text": "The first is a simple baseline model that only takes the characters into account without any additional methods to improve performance." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-103", "text": "This model is referred to as the char-only model." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-104", "text": "The second is the approach that produced one of the best results so far in the literature." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-105", "text": "This model uses POS-tagged input, clusters together groups of characters (super characters) and exploits 100,000 \"silver\" AMRs that were obtained by using the off-the-shelf AMR parsers CAMR (Wang et al., 2015) and JAMR (Flanigan et al., 2014) 4 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-134", "text": "This might mean that, even when adding more gold data, learning such sophisticated structures is too difficult for end-to-end sequence-tosequence models in general." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-135", "text": "Table 6 shows more detailed results of the Indexing method." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-136", "text": "In the majority of cases, the model does what it is supposed to do: first instantiating an index, then referring to it." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-137", "text": "However, for finding the correct variable replacement for the baseline model, we still have to fall back on heuristics in approximately 30% of the cases." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-138", "text": "Most of these instances are then solved by referring to the concept that is most frequently a referent in the training set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-139", "text": "However, for our best model, this reliance is not there anymore." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-140", "text": "The model generally only outputs an index when it was instantiated first, which is exactly what was intended." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-141", "text": "We proposed three methods to handle co-indexed variables for neural semantic (AMR) parsing." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-142", "text": "The best results were obtained by the Indexing method, which explicitly encodes co-indexing nodes in the data set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-143", "text": "The perhaps theoretically most attractive Absolute Paths performed the worst, although it did still offer an improvement over the baseline." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-144", "text": "Perhaps an interesting direction for future research is to use relative instead of absolute paths, encoding the path relative to the reentrancy node instead of starting at the top of the AMR, because this will make the paths shorter and more local." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-145", "text": "----------------------------------" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-146", "text": "**DETAILED ANALYSIS**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-107", "text": "We must note that CAMR is not particularly keen on outputting coreference, as the 100,000 silver AMRs only produced 18,865 new reentrancy nodes." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-108", "text": "The second approach also employs the postprocessing methods Wikification and pruning, as explained in van Noord and Bos (2017)." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-109", "text": "The Wikification step simply adds wiki links to :name nodes, since those links were removed in the input." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-110", "text": "Pruning is used to remove erroneously produced duplicate output." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-111", "text": "This is a common problem for sequence-to-sequence models, since the model does not keep track of what it has already output." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-112", "text": "No pre-training or ensemble methods are used for both approaches." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-113", "text": "----------------------------------" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-114", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-115", "text": "----------------------------------" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-116", "text": "**MAIN RESULTS**" }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-117", "text": "The results of applying our three methods on the baseline and best model are shown in Table 3 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-118", "text": "All reported numbers are F-scores obtained by using SMATCH ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-119", "text": "All three methods offer an improvement over the baseline, for both the baseline and the best model." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-120", "text": "Indexing is the highest scoring method, except for the test set of the best model, since Reentrancy Restoring obtains the same F-score there." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-121", "text": "Explicitly encoding the absolute paths resulted in an increase over the baseline, but did not outperform both Reentrancy Restoring and Indexing, although only by a small margin." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-122", "text": "It is interesting to look at whether we indeed improve on parsing AMRs that have co-indexed variables, in comparison to AMRs that do not contain them." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-123", "text": "This is shown in Table 4 ." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-124", "text": "We see that, in general, we indeed only improve on both baselines for AMRs that have co-indexed variables." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-125", "text": "This is the case for both the dev and the test set." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-126", "text": "This is the desired scenario: we improve on AMRs with reentrancy, while performance does not decrease on AMRs without reentrancy." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-127", "text": "Only applying the Absolute Paths method on the baseline model scores lower on the test set on AMRs without reentrancy nodes." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-128", "text": "Similar to the results in Table 3 , Indexing outperforms Reentrancy Restoring for the baseline model, but has similar scores when applied to the best model." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-129", "text": "Table 5 shows the how often the model outputs possible paths for the Absolute Paths method." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-130", "text": "Unfortunately, about 50% of the time, the model output a path that did not lead to a possible referent, leaving us to rely on the frequency heuristic." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-131", "text": "This problem is not as severe for our best model, though, since about 75% of paths were actually possible." }, { "sent_id": "06276db79ed5aa04bb24a31c10d3a9-C001-132", "text": "The addition of extra (silver) data thus helped the model in learning the paths, suggesting that the baseline results might merely be an effect of sparse data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "06276db79ed5aa04bb24a31c10d3a9-C001-10", "06276db79ed5aa04bb24a31c10d3a9-C001-11", "06276db79ed5aa04bb24a31c10d3a9-C001-12", "06276db79ed5aa04bb24a31c10d3a9-C001-13", "06276db79ed5aa04bb24a31c10d3a9-C001-14", "06276db79ed5aa04bb24a31c10d3a9-C001-15" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-16", "06276db79ed5aa04bb24a31c10d3a9-C001-17" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-20", "06276db79ed5aa04bb24a31c10d3a9-C001-21" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-22", "06276db79ed5aa04bb24a31c10d3a9-C001-23", "06276db79ed5aa04bb24a31c10d3a9-C001-24", "06276db79ed5aa04bb24a31c10d3a9-C001-25", "06276db79ed5aa04bb24a31c10d3a9-C001-26" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-108" ] ], "cite_sentences": [ "06276db79ed5aa04bb24a31c10d3a9-C001-15", "06276db79ed5aa04bb24a31c10d3a9-C001-16", "06276db79ed5aa04bb24a31c10d3a9-C001-20", "06276db79ed5aa04bb24a31c10d3a9-C001-24", "06276db79ed5aa04bb24a31c10d3a9-C001-108" ] }, "@MOT@": { "gold_contexts": [ [ "06276db79ed5aa04bb24a31c10d3a9-C001-16", "06276db79ed5aa04bb24a31c10d3a9-C001-17" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-20", "06276db79ed5aa04bb24a31c10d3a9-C001-21" ] ], "cite_sentences": [ "06276db79ed5aa04bb24a31c10d3a9-C001-16", "06276db79ed5aa04bb24a31c10d3a9-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "06276db79ed5aa04bb24a31c10d3a9-C001-22", "06276db79ed5aa04bb24a31c10d3a9-C001-23", "06276db79ed5aa04bb24a31c10d3a9-C001-24", "06276db79ed5aa04bb24a31c10d3a9-C001-25", "06276db79ed5aa04bb24a31c10d3a9-C001-26" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-52", "06276db79ed5aa04bb24a31c10d3a9-C001-53", "06276db79ed5aa04bb24a31c10d3a9-C001-54", "06276db79ed5aa04bb24a31c10d3a9-C001-55" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-97" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-101" ], [ "06276db79ed5aa04bb24a31c10d3a9-C001-108" ] ], "cite_sentences": [ "06276db79ed5aa04bb24a31c10d3a9-C001-24", "06276db79ed5aa04bb24a31c10d3a9-C001-55", "06276db79ed5aa04bb24a31c10d3a9-C001-97", "06276db79ed5aa04bb24a31c10d3a9-C001-101", "06276db79ed5aa04bb24a31c10d3a9-C001-108" ] } } }, "ABC_672d4299e60752e866293d72f97905_32": { "x": [ { "sent_id": "672d4299e60752e866293d72f97905-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-2", "text": "Abstract." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-3", "text": "Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-4", "text": "Most of these properties are subjective, involving costly and time-consuming surveys to be gathered." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-5", "text": "Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-6", "text": "However, some of the resources used by such approaches are not available to most languages." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-7", "text": "This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-8", "text": "The correlations between the properties inferred are close to those obtained by related works." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-9", "text": "The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-10", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-12", "text": "Besides frequency, form, and meaning, words also have several other less known words properties, such as imageability, concreteness, familiarity, subjective frequency, and age of acquisition (AoA), which are subjective psycholinguistic properties, as they depend on the personal experiences that individuals had using those words." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-13", "text": "According to [15] , word imageability is the ease and speed with which a word evokes a mental image; concreteness is the degree to which words refer to objects, people, places, or things that can be experienced by the senses; experiential familiarity is the degree to which individuals know and use words in their everyday life; subjective frequency is the estimation of the number of times a word is encountered by individuals in its written or spoken form, and AoA is the estimation of the age at which a word was learned." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-14", "text": "Psycholinguistic properties have been used in various approaches, such as for Lexical Simplification [12] , for Text Simplification at the sentence level, with the aim of reducing the difficulty of informative text for language learners [18] , to predict the reading times (RTs) of each word in a sentence to assess sentence complexity [14] and also to create robust text level readability models [17] , which is also one of the purposes of this paper." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-15", "text": "Because of its inherent costs, the measurement of subjective psycholinguistic properties is usually used in the creation of datasets of limited size [2, 7, 8, 15] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-16", "text": "For the English language, the most well known database of this kind is the MRC Psycholinguistic Database 4 , which contains 27 subjective psycholinguistic properties for 150,837 words." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-17", "text": "For BP for example, there is a psycholinguistic database 5 containing 21 columns of information for 215,175 words, but no subjective psycholinguistic properties." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-18", "text": "In this work we aim to overcome this gap by automatically inferring the psycholinguistic properties of imageability, concreteness, AoA and subjective frequency (similar to familiarity) for a large database of 26,874 BP words using a resource-light regression approach." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-19", "text": "As for the automatic inference, this work is strongly based on the results of [12] which proposed an automatic bootstrapping method for regression to populate the MRC Database." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-20", "text": "We explore here 3 research questions: (1) is it possible to achieve high Pearson and Spearman correlations values and low MSE values with a regression method using only word embedding features to infer the psycholinguistic properties for BP? (2) which size a database with psycholinguistic properties should have to be used in regression models?" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-21", "text": "Does merging databases from different sources yield better correlation and lower MSE scores? (3) can the inferred values help in creating features that result in more reliable readability prediction models of BP texts for early school years (from 3rd to 6th grades)?" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-22", "text": "Moreover, we assessed interrater reliability (Cronbach's alpha) between ratings generated by our method and the imageability and concreteness produced for 237 nouns by [9] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-23", "text": "Besides that, we analyzed the relations between the inferred ratings and other psycholinguistic variables." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-24", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-25", "text": "**RELATED WORKS**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-26", "text": "To the best of our knowledge there are only two studies that propose regression methods to automatically estimate missing psycholinguistic properties in the MRC Database [4, 12] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-27", "text": "In order to solve limitations resulting from using word databases with human ratings, [4] proposes a computational model to predict word concreteness, by using linear regression with word attributes from WordNet [3] , Latent Semantic Analysis (LSA) and the CELEX Database 6 and use these attributes to simulate human ratings in the MRC database." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-28", "text": "Word concreteness is among the most important indices provided by CohMetrix, as comprehension is facilitated by virtue of more concrete words." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-29", "text": "The lexical features used were 19 lexical types from WordNet, 17 LSA dimensions, hypernymy information from WordNet, word frequencies from the CELEX Database, and word length (i.e., number of letters), totalling 39 attributes." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-30", "text": "The Pearson correlation between the estimated concreteness score and the concreteness score in the test set was 0.82." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-31", "text": "[12] automatically estimate missing psycholinguistic properties in the MRC Database through a bootstrapping algorithm for regression." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-32", "text": "Their method exploits word embedding models and 15 lexical features, including the number of senses, synonyms, hyper-nyms and hyponyms for word in WordNet and also minimum, maximum and average distance between the word's senses in WordNet and the thesaurus' root sense." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-33", "text": "The Pearson correlation between the estimated score and the inferred score for familiarity was 0.846; 0.862 for AoA; 0.823 for imagenery and 0.869 for concretness, which is better than the results of [4] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-34", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-35", "text": "**A LIGHTWEIGHT REGRESSION METHOD TO INFER PSYCHOLINGUISTIC PROPERTIES OF WORDS**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-36", "text": "The fact that the methods developed by [4] and [12] are based on a large, scarce lexical resources as WordNet, led us to raise the question \"Could we have a similar performance with a simpler set of features which are easily obtainable for most languages?\"." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-37", "text": "Therefore we decided to build our regressors using only word length, frequency lists, lexical databases composed of school dictionaries and word embeddings models." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-38", "text": "One critical difference between the strategy of [12] and ours is that they concatenate all features to train a regressor, while we take a different approach." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-39", "text": "Although simply combining all features is straightforward, it can lead to noise insertion, given that the features used greatly contrast among them (e.g. word embeddings and word length)." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-40", "text": "Instead, we adopted a more elegant solution, called Multi-View Learning [19] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-41", "text": "In a Multi-View Learning, multiple regressors/classifiers are trained over different feature spaces and then combined to produce a single result." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-42", "text": "Here, the fusion stage is made by averaging the values predicted by the regressors [19] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-43", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-44", "text": "**ADAPTATION OF DATABASES WITH PSYCHOLOGICAL NORMS FOR PORTUGUESE WORDS**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-45", "text": "We present in Table 1 surveys involving the subjective psycholinguistic properties of words focused in this study (concreteness, age of acquisition, imageability and subjective frequency), both for European Portuguese (EP) and BP." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-46", "text": "If, on the one hand, manually produced resources fulfill the needs for which they were collected, on the other hand, they are very limited for Natural Language Processing purposes, given their limited size." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-47", "text": "There is place, however, to automatically infer subjective psycholinguistic properties for several words, using the existing ones." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-48", "text": "To achieve this goal, however, we need, first of all, to rely on a set of words with values for each aimed property." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-49", "text": "In BP, we only have 909 words with concreteness values [7] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-50", "text": "Therefore, we decided to incorporate EP resources to our set, as well as to combine different resources containing values for the same property." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-51", "text": "In order to turn EP resources usable for our study in BP, we executed adjustments in the word lists." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-52", "text": "Most of them were in orthography, as for example: ac\u00e7\u00e3o/ac\u00e3o (action), adop\u00e7\u00e3o/ado\u00e7\u00e3o (adoption), amnistia/anistia (amnesty)." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-53", "text": "Other adjustments pertained to concepts that the two variants of Portuguese lexicalize in different ways, such as: ficheiro/arquivo (file), assass\u00ednio /assassinato (murder), apuramento/apura\u00e7\u00e3o (calculation)." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-54", "text": "Finally, some words have been discarded, as they lexicalize concepts related to fauna, flora, and culinary traits native to Portugal, such as: faneca (pout), faia (beech) and rebu\u00e7ado (candy)." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-55", "text": "In theory, there should not be a problem in concatenating two or more lists of words with the same psychological property." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-56", "text": "We did this for concreteness, merging the list of [15] , once adapted fo BP, with the one of [7] , which was created for BP." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-57", "text": "As both lists rated concreteness using a Likert scale of 7 points, the values were comparable." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-58", "text": "However, in what concerns AoA, the two lists available, [2] and [8] , rated concreteness using respectively a Likert scale of 7 and 9 points." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-59", "text": "It is worth mentioning that both lists obtained AoA ratings through estimates produced by adults (AoA could be, alternatively, gathered using the proficiency of children of different ages in object naming tasks)." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-60", "text": "Therefore, to turn them comparable, we had to convert the scale of 9 points into a scale of 7 points." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-61", "text": "After concluding the lexical adjustments, converting the scale of 9 points into 7 points for AoA lists, merging the lists and eliminating duplicated words, we obtained sizeable datasets for all word properties addressed in this study." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-62", "text": "Table 2 shows the number of entries obtained for each property, between the parentheses." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-63", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-64", "text": "**FEATURES**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-65", "text": "Our regressors use 10 features from several sources, grouped in: (i) lexical (1-8); (ii) Skip-Gram word embeddings (9) [10] ; and (iii) GloVe word embeddings (10) [13]:" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-66", "text": "1. Log of Frequency in SUBTLEX-pt-BR [16] , which is a database of BP word frequencies based on more than 50 million words from film and television subtitles; 2. Log of Contextual diversity in SUBTLEX-pt-BR, which is the number of subtitles that contain the word; 3. Log of Frequency in SubIMDb-PT [11] : this corpus was extracted from subtitles of family, comedy and children movies and series; 4. Log of Frequency in the Written Language part of Corpus Brasileiro, a corpus with about 1 billion words of Contemporary BP; 5. Log of Frequency in the Spoken Language part of Corpus Brasileiro; 6." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-67", "text": "Log of Frequency in a corpus of 1.4 billion tokens of Mixed Text Genres in BP; 7." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-68", "text": "Word Length; 8." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-69", "text": "Lexical databases from 6 school dictionaries for specific grade-levels; 9." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-70", "text": "Word's raw embedding values of word embeddings models created using the SkipGram algorithm [10] , with word vector sizes of 300, 600 and 1,000; 10." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-71", "text": "Word's raw embedding values of word embeddings models created using the GloVe algorithm [13] , with word vector sizes of 300, 600 and 1,000;" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-72", "text": "Reading time studies provide evidence that more processing time is allocated to rare words than high-frequency words." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-73", "text": "Besides that, the logarithm of word frequency was used here because reading times are linearly related to the logarithm of word frequency, not to raw word frequencies [1] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-74", "text": "We trained our embedding models using Skip-Gram word2vec and GloVe over a corpus of 1.4 billion tokens, and 3.827.725 types composed by mixed text genres, including subtitles to cover spoken language besides written texts." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-75", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-76", "text": "**USING REGRESSION IN A MULTI-VIEW LEARNING APPROACH**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-77", "text": "We used a linear least squares regressor with L2 regularization, which is also known as Ridge Regression or Tikhonov regularization [6] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-78", "text": "We choose this regression method due to the promising results reported by [12] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-79", "text": "We trained three regressors in different feature spaces: lexical features, Skip-Gram embeddings, and GloVe embeddings." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-80", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-81", "text": "**EVALUATION**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-82", "text": "We experimented with several dimensions of word embeddings, but for space reasons, here we include only the best results: Skip-Gram and GloVe embeddings with 300 word vector dimensions." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-83", "text": "We used 20x5-fold cross-validation in order to perform our experiments." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-84", "text": "As evaluation metrics, we used Mean Square Error (MSE), Spearman's (\u03c1), and Pearson's (r) correlation." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-85", "text": "For the MSE metric, a repeated measures ANOVA with Dunnet post-test was used to compare the best regressors with the others to significance level of 0.05." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-86", "text": "Table 2 shows the evaluation results of our method." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-87", "text": "For subjective frequency, the best result was given by the combination of Lexical, Skip-gram and GloVe embeddings." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-88", "text": "For AoA, the best result was given by the combination of Lexical and GloVe embeddings." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-89", "text": "For AoA, the three better lexical features responsible for such results were grade-level lexical databases, the log frequency in Sub-IMDb-PT and word length." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-90", "text": "The regressors in bold presented statistically significant differences when compared with the others." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-91", "text": "However, for AoA property, the Lexical + GloVe regressor did not present difference statistically significant with Lexical + Skip-gram + GloVe regressor." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-92", "text": "Table 2 . MSE and Pearson and Spearman correlation scores of the regression models." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-93", "text": "Portuguese databases with AoA properties are small in size, therefore we evaluated three different databases for this property." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-94", "text": "The first has 765 words [8] , the second has 1717 words [2] , and the third is composed by a merging of the first and the second converted into a 7-scale; the merging resulted in a database with 2368 different words, which is still small compared to the other 3 properties evaluated here." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-95", "text": "The resulting correlations and MSE for AoA (see Table 3 Table 3 . MSE, Pearson, and Spearman correlations of the regression models." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-96", "text": "We also compared the four properties' interdependency among themselves by using Pearson correlation." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-97", "text": "Table 4 presents the results obtained for our comparisons, as well as the results obtained on similar comparisons from related contributions." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-98", "text": "In Table 4 , dashes represent evaluations which were not performed on a given study." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-99", "text": "Our results are close to those reported in the literature, except for the correlation between age of aquisition and concreteness, which is stronger in other studies." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-100", "text": "This may be related to the fact that a full dicitionary have a larger proportion of rare words than observed in the lists of these studies." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-101", "text": "In order to validate the reliability of our automatically inferred psycholinguistic properties, we conducted internal consistency analyses." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-102", "text": "We calculated alpha scores between our automatically produced imageability and concreteness properties and from those present in the psycholinguistic dataset of [9] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-103", "text": "In total, 237 words were considered." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-104", "text": "The alpha scores for imageability and concreteness are 0.921 and 0.820, which are similar to the values achieved by [15] , and suggest that our features do, in fact, accurately capture the psycholinguistic properties being targeted." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-105", "text": "We built a database of plain words which was populated with the inferred values for the four psycholinguistic properties focused in this study." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-106", "text": "For this, we took profit of Minidicion\u00e1rio Caldas Aulete's entries [5] and their respective first grammatical cate-gory." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-107", "text": "In the sequence, we selected only nouns, verbs, adjectives and adverbs." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-108", "text": "Finally, we eliminated all loanwords (foreign words from different origins used in BP)." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-109", "text": "Then we searched the frequency of each word in the large corpus of 1.4 billion words we have used to train our word embeddings models, and after manual analysis, we decided to disregard words with less than 8 occurrences, as they are very uncommon." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-110", "text": "The final lexicon is available 7 and contains 26,874 words, being 15,204 nouns, 4,305 verbs, 7,293 adjectives and 72 adverbs with the information of the four inferred psycholinguistic properties using the better results with less features (shown in bold in Table 2 )." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-111", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-112", "text": "**EVALUATING PSYCHOLINGUISTIC FEATURES IN READABILITY PREDICTION**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-113", "text": "In order to evaluate the use of psycholinguistic properties to predict the readability level of BP informative texts from newspapers and magazines for early school years, we trained a classifier using a corpus of 1,413 texts which were classified as ease to read for 3rd to 6th graders." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-114", "text": "These texts were annotated by 2 linguists (0,914 weighted kappa) and the corpus distribution by grade levels is 183 texts for 3rd grade, 361 texts for 4th grade, 537 texts for 5th grade and 332 texts for 6th-grade year." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-115", "text": "We are still in the annotation process to have a dataset similar in size to the work by [17] ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-116", "text": "We compared the use of our four psycholinguistic features and six traditional readability formulas: Flesch ReadingEase adapted to BP Brun\u00e9t, Honor\u00e9, Dale-Chall, Gunning Fox and Moving Average Type-Token Ratio (MATTR)." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-117", "text": "We used the mean and the average of our psycholinguistic properties as features to train SVM classifiers with RBF kernel for each psycholinguistic information and a single-view classifier for all four properties." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-118", "text": "All results presented here were obtained by a 10-fold cross-validation process." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-119", "text": "Table 5 shows that subjective frequency provided better results than the other psycholinguistic features in classifying grade levels and achieved the 3rd best result for individual features." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-120", "text": "The single-view classifier of psycholinguistic features overcame all traditional formulas but MATTR and Brun\u00e9t Index for grade-level classification in F1-measure." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-121", "text": "Both MATTR and Brun\u00e9t Index measure the lexical diversity of a text and are independent of text length." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-122", "text": "Their high performance in this evaluation suggests that lexical diversity is a strong proxy to distinguish grade levels in primary school years." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-123", "text": "Table 5 ." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-124", "text": "Evaluating Psycholinguistic and Classic readability formulas for readability prediction." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-125", "text": "7 https://www.dropbox.com/s/evieeic673gomvn/PB.zip?dl=0." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-126", "text": "----------------------------------" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-127", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "672d4299e60752e866293d72f97905-C001-128", "text": "In this work, we set our aims at finding a light set of features available for most languages to build regressors that infer psycholinguistic properties for BP words." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-129", "text": "We have made publicly available a large database of 26,874 BP words annotated with psycholinguistic properties." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-130", "text": "With respect to our research questions (1) and (2), we have shown we can infer psycholinguistic properties for BP using word embeddings as features." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-131", "text": "Nonetheless, our regressors need a reasonably sized number of training instances (at least, more than two thousand examples), as well as complementary lexical resources to yield top performance for AoA and subjective frequency." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-132", "text": "As for the research question (3), the results show that psycholinguistic properties have potential to help readability prediction." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-133", "text": "This results ratify the claims of [14] , which state that (i) words with higher concreteness are easier to imagine, comprehend, and memorize and therefore help readability of texts, and (ii) age of acquisition has been shown to be helpful in predicting reading difficulty." }, { "sent_id": "672d4299e60752e866293d72f97905-C001-134", "text": "As future work, we propose to extend evaluation to other tasks, to use new modelling of our psycholinguistic features (besides the average and standard deviation of the inferred values) and to use a more robust approach to perform the fusion of regressors, e.g. stacking regression." } ], "y": { "@BACK@": { "gold_contexts": [ [ "672d4299e60752e866293d72f97905-C001-14" ], [ "672d4299e60752e866293d72f97905-C001-26" ], [ "672d4299e60752e866293d72f97905-C001-31", "672d4299e60752e866293d72f97905-C001-32", "672d4299e60752e866293d72f97905-C001-33" ], [ "672d4299e60752e866293d72f97905-C001-36", "672d4299e60752e866293d72f97905-C001-37" ] ], "cite_sentences": [ "672d4299e60752e866293d72f97905-C001-14", "672d4299e60752e866293d72f97905-C001-26", "672d4299e60752e866293d72f97905-C001-31", "672d4299e60752e866293d72f97905-C001-36" ] }, "@SIM@": { "gold_contexts": [ [ "672d4299e60752e866293d72f97905-C001-19" ] ], "cite_sentences": [ "672d4299e60752e866293d72f97905-C001-19" ] }, "@MOT@": { "gold_contexts": [ [ "672d4299e60752e866293d72f97905-C001-36", "672d4299e60752e866293d72f97905-C001-37" ], [ "672d4299e60752e866293d72f97905-C001-38", "672d4299e60752e866293d72f97905-C001-39", "672d4299e60752e866293d72f97905-C001-40" ] ], "cite_sentences": [ "672d4299e60752e866293d72f97905-C001-36", "672d4299e60752e866293d72f97905-C001-38" ] }, "@DIF@": { "gold_contexts": [ [ "672d4299e60752e866293d72f97905-C001-38", "672d4299e60752e866293d72f97905-C001-39", "672d4299e60752e866293d72f97905-C001-40" ] ], "cite_sentences": [ "672d4299e60752e866293d72f97905-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "672d4299e60752e866293d72f97905-C001-77", "672d4299e60752e866293d72f97905-C001-78" ] ], "cite_sentences": [ "672d4299e60752e866293d72f97905-C001-78" ] } } }, "ABC_6da7dcbcb7f52f31ec23c8131d438d_32": { "x": [ { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-2", "text": "Based on massive amounts of data, recent pretrained contextual representation models have made significant strides in advancing a number of different English NLP tasks." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-3", "text": "However, for other languages, relevant training data may be lacking, while state-of-the-art deep learning methods are known to be data-hungry." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-4", "text": "In this paper, we present an elegantly simple robust self-learning framework to include unlabeled non-English samples in the fine-tuning process of pretrained multilingual representation models." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-5", "text": "We leverage a multilingual model's own predictions on unlabeled non-English data in order to obtain additional information that can be used during further finetuning." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-6", "text": "Compared with original multilingual models and other cross-lingual classification models, we observe significant gains in effectiveness on document and sentiment classification for a range of diverse languages." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-7", "text": "----------------------------------" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-9", "text": "Owing to notable advances in deep learning and representation learning, important progress has been achieved on text classification, reading comprehension, and other NLP tasks." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-10", "text": "Recently, pretrained language representations with self-supervised objectives (Peters et al., 2018; Devlin et al., 2018; Radford et al., 2018) have further pushed forward the state-of-the-art on many English tasks." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-11", "text": "While these sorts of deep models can be trained on different languages, deep models typically require substantial amounts of labeled data for the specific domain of data." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-12", "text": "Unfortunately, the cost of acquiring new custom-built resources for each combination of language and domain is very high, as it typically requires human annotation." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-13", "text": "Available resources for domain-specific tasks are often imbalanced between different languages." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-14", "text": "The scarcity of non-English annotated corpora may preclude our ability to train language-specific machine learning models." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-15", "text": "In contrast, English-language annotations are often readily available to train deep models." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-16", "text": "Although translation can be an option, human translation is very costly and for many language pairs, any available domain-specific parallel corpora are too small to train high-quality machine translation systems." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-17", "text": "Cross-lingual systems rely on training data from one language to train a model that can be applied to other languages (de Melo and Siersdorfer, 2007) , alleviating the training bottleneck issues for low-resource languages." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-18", "text": "This is facilitated by recent advances in learning joint multilingual representations (Lample and Conneau, 2019; Artetxe and Schwenk, 2018; Devlin et al., 2018) ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-19", "text": "In our work, we propose a self-learning framework to incorporate the predictions of the multilingual BERT model (Devlin et al., 2018) on non-English data into an English training procedure." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-20", "text": "The initial multilingual BERT model was simultaneously pretrained on 104 languages, and has shown to perform well for cross-lingual transfer of natural language tasks (Wu and Dredze, 2019) ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-21", "text": "Our model begins by learning just from available English samples, but then makes predictions on unlabeled non-English samples and a part of those samples with high confidence prediction scores are repurposed to serve as labeled examples for a next iteration of fine-tuning until the model converges." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-22", "text": "Based on this multilingual self-learning technique, we demonstrate the superiority of our framework on Multilingual Document Classification (MLDoc) (Schwenk and Li, 2018) in comparison with several strong baselines." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-23", "text": "Our study then proceeds to show that our method is better on Chinese sentiment classification than other cross-lingual methods that also consider unla-beled non-English data." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-24", "text": "This shows that our method is more effective at cross-lingual transfer for domain-specific tasks, using a mix of labeled and unlabeled data via a multilingual BERT sentence model." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-25", "text": "----------------------------------" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-26", "text": "**METHOD**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-27", "text": "Our proposed framework consists of three parts as shown in Figure 1 ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-28", "text": "The first part is the pretrained multilingual encoder, denoted as f n (\u00b7; \u03b8 n )." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-29", "text": "The encoder is assumed to have been pretrained across different languages with appropriate strategies, such as WordPiece, Byte Pair Encoding modeling, and Cross-lingual Word Alignment, to allow the model to share representations across languages to a certain degree." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-30", "text": "Hence, we obtain a universal sentence representation h \u2208 R d from this encoder, where d is the dimensionality of the sentence representation." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-31", "text": "Subsequently, a task-specific classification module f cl (\u00b7; \u03b8 cl ) is applied for finetuning on top of the pretrained model f n (\u00b7; \u03b8 n )." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-32", "text": "This module consists of a linear function mapping h \u2208 R d into R |Y| and a softmax function, where Y is the set of target classes." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-33", "text": "For the overall process, we first train the whole network f (\u00b7; \u03b8) in K epochs using a set of the labeled data L = {(x i , y i ) | i = 1, ..., n}, where n is the number of labeled instances, x i \u2208 X are instances, and y i \u2208 Y are the corresponding ground truth labels." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-34", "text": "The next step is to make predictions for the unlabeled instances in U = {x u | u = 1, ..., m}. We assume that f (\u00b7; \u03b8) yields a class label as well as a confidence score." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-35", "text": "To better take advantage of the pretrained multilingual model, a selection mechanism is invoked to repurpose unlabeled data with high confidence scores for incorporation into the training data." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-36", "text": "There are several variations of such selection mechanisms (Abney, 2007) , and we rely on a balancing approach that considers the same number of instances for each class." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-37", "text": "Thus, we select a subset {x s | s = 1, ..., K t } of the unlabeled data for each class, containing the top K t highest confidence items based on the current trained model." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-38", "text": "The union set U s of selected items is merged into the training set L and then we retrain the model and repeat this process iteratively." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-39", "text": "The detailed process is described in Algorithm 1, where for each class y \u2208 Y, D y denotes the set of tuples pairing unlabeled data with corresponding confidence scores c." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-40", "text": "Algorithm 1 Self-learning on cross-lingual tasks 1: repeat 2:" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-41", "text": "Fine-tune f (\u00b7; \u03b8) for K epochs using L 3:" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-42", "text": "L \u2190 L \u222a U s 13: until stopping criterion is true" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-43", "text": "----------------------------------" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-44", "text": "**EXPERIMENTS**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-45", "text": "We evaluate our self-learning framework on two cross-lingual document and sentiment classification tasks to show the effectiveness of self-learning for multilingual BERT-based crosslingual transfer." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-46", "text": "----------------------------------" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-47", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-48", "text": "Datasets." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-49", "text": "For evaluation, we first rely on MLDoc (Schwenk and Li, 2018) , a balanced subset of the Reuters corpus covering 8 languages for document classification, with 1,000 training and validation documents and 4,000 test documents for each language." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-50", "text": "The 4-way topic classification scheme consists of CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets) as targets." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-51", "text": "For cross-lingual classification, 1,000 target language training documents are used as unlabeled data for self-learning." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-52", "text": "We further evaluate our method on cross-lingual sentiment classification from English to Chinese." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-53", "text": "For English, we use a balanced dataset of 700k Yelp reviews from Zhang et al. (2015) with their ratings as labels (scale 1-5) and adopting their training-validation split: 650k reviews for training and 50k as a validation set." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-54", "text": "For Chinese, we use the same dataset configuration as Chen et al. (2018) , consisting of 150k unlabeled Chinese hotel reviews and 10k balanced Chinese hotel reviews as a validation set." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-55", "text": "The results are reported on a separate test set of another 10k hotel reviews." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-56", "text": "The data are also annotated with 5 labels (1-5)." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-57", "text": "Both classification tasks are evaluated in terms of classification accuracy (ACC)." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-58", "text": "Model Details." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-59", "text": "We tune the hyper-parameters for our neural network architecture based on each non-English validation set." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-60", "text": "For the encoder, we invoke the multilingual BERT model (Devlin et al., 2018) , which supports 104 languages 1 ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-61", "text": "It relies on a shared 110k WordPiece vocabulary across all languages and yields sentence representations in a common multilingual space." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-62", "text": "Most model hyperparameters are the same as in pretraining, with the exception of the batch size, max." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-63", "text": "sequence length, and number of training epochs." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-64", "text": "The batch size, max." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-65", "text": "sequence length and number of training epochs used for the MLDoc task are 128, 32, and 4, respectively, while they are 128, 96, and 3 for sentiment classification." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-66", "text": "Another hyper-parameter involved in self-learning is K t , which is 40 for MLDoc and 100 for sentiment classification." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-67", "text": "We rely on early stopping as a termination criterion." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-68", "text": "----------------------------------" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-69", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-70", "text": "Cross-lingual Document Classification." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-71", "text": "Three recent strong baselines are included in our MLDoc experiments." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-72", "text": "Schwenk and Li (2018) use Multi-CCA, multilingual word embeddings trained with a bilingual dictionary and convolution neural networks." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-73", "text": "Artetxe and Schwenk (2018) pretrain a multilingual sentence representation with a massively multilingual sequence-to-sequence NMT model, where the encoder is used for fine-tuning downstream tasks." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-74", "text": "We also considered multilingual BERT without self-learning as one of our baselines." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-75", "text": "As shown in Table 1 , our framework significantly outperforms all baselines in 7 languages on cross-lingual document classification, including for phylogenetically unrelated languages." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-76", "text": "We also show the respective percentages of instances added into the training set that are correct for the MLDoc data based on our method in Table 3 ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-77", "text": "The high percentages of correctly labeled incorporated instances in 7 languages further show the effectiveness of self-learning in our framework." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-78", "text": "Cross-lingual Sentiment Classification." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-79", "text": "To evaluate the robustness of our framework on crosslingual sentiment classification, we consider several diverse baselines as listed in Table 2 ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-80", "text": "mSDA (Chen et al., 2012) is a very effective method for cross-domain sentiment classification on Amazon reviews, which can also be used in cross-lingual tasks, but it has the worst performance." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-81", "text": "Deep Averaging Networks (DANs) by Iyyer et al. (2015) consider an arithmetic mean of word vectors as a sentence representation and pass it to a classification module, while Chen et al. (2018) translate the Chinese test text into English as a machine translation baseline." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-82", "text": "The third category of baselines includes Xu and Yang (2017) , who propose a crosslingual distillation (CLD) method that makes use of soft source predictions on a parallel corpus to train a target model (CLD-KCNN)." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-83", "text": "They further propose an improved variant (CLDFA-KCNN) that utilizes adversarial training for domain adaptation within a single language." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-84", "text": "Adversarial DAN (ADAN) by Chen et al. (2018) is another state of the art baseline that improves cross-lingual generalization by means of adversarial training." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-85", "text": "We also run experiments on multilingual BERT and observe that it does not outperform CLD-based CLTC and ADAN, while our approach achieves the new state-of-the-art result, indicating that our self-learning method for cross-lingual transfer can be more effective than a diverse range of other approaches." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-86", "text": "In addition, we evaluate the proximity between incorrect prediction and the corresponding correct label in our sentiment task by means of mean squared error." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-87", "text": "The error of our method is 1.37, while for regular multilingual BERT it is 1.42, which also shows the superiority of our method." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-88", "text": "Comparison of Selection Mechanisms." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-89", "text": "We ran experiments on two selection mechanisms." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-90", "text": "One is balancing, as described in Section 2." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-91", "text": "An alternative is throttling by selecting the top n unlabeled examples without considering specific classes (Abney, 2007) ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-92", "text": "Our experiments on ML-Doc show that the results suffer from a rapid decline during self-learning with throttling." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-93", "text": "This is because selecting from all samples leads to an imbalance between different classes and due to repeated error amplification this means that samples are increasingly likely to be assigned to the majority class in each self-learning iteration." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-94", "text": "----------------------------------" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-95", "text": "**RELATED WORK**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-96", "text": "Semi-supervised Learning." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-97", "text": "There is a long history of research on semi-supervised Learning to exploit unlabeled data." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-98", "text": "Self-learning (also known as self-training) was successfully applied to NLP tasks in early work such as on word sense disambiguation (Yarowsky, 1995) and parsing (Mc-Closky et al., 2006) ." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-99", "text": "In recent work, show that self-learning can iteratively improve unsupervised cross-lingual word embeddings." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-100", "text": "presents Cross-View Training, a new self-training algorithm that works well for neural sequence modeling." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-101", "text": "Other semisupervised methods, such as co-training (Blum and Mitchell, 1998) and tri-training (Zhou and Li, 2005) , have as well been used for sentiment analysis." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-102", "text": "Ruder and Plank (2018) propose a novel multitask tri-training method that reduces the time and space complexity of classic tri-training for sentiment analysis." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-103", "text": "For cross-lingual sentiment analysis, Wan (2009) uses machine translation to directly convert English training data to Chinese, which provides two views for co-training." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-104", "text": "Xu and Yang (2017) propose to use soft probabilistic predictions for the documents in a label-rich language as the (induced) supervisory labels in a parallel corpus of documents, while there is no need to use parallel corpora in our work." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-105", "text": "Chen et al. (2018) propose an Adversarial Deep Averaging Network to learn invariance across languages, which is another baseline considered in our experiments." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-106", "text": "Cross-lingual Representation Learning." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-107", "text": "With models such as ELMo (Peters et al., 2018) , GPT-2 (Radford et al., 2018) , and BERT (Devlin et al., 2018) , important progress has been made in learning improved sentence representations with context-specific encodings of words via a language modeling objective." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-108", "text": "The latter two approaches both rely on Transformer encoders, but BERT is trained using masked language modeling instead of right-to-left or left-to-right language modeling." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-109", "text": "Additionally, BERT also optimizes a next sentence classification objective." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-110", "text": "Recent work has also investigated cross-lingual extensions." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-111", "text": "Devlin et al. (2018) themselves published a multilingual version of BERT, following the same model architecture and training procedure, except that the union of 104 different language editions of Wikipedia serves as the training input." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-112", "text": "Lample and Conneau (2019) incorporate parallel text into BERT's architecture by training on a new supervised learning objective." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-113", "text": "Artetxe and Schwenk (2018) also show that the encoder from a pretrained sequence-to-sequence model can be used to produce cross-lingual sentence embeddings." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-114", "text": "All these methods are compatible with our self-learning framework, since they provide a shared sentence meaning representation across languages as needed by our approach." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-115", "text": "----------------------------------" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-117", "text": "In this work, we propose a self-learning framework for cross-lingual text classification." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-118", "text": "Based on the cross-lingual prediction ability of pretrained multilingual model, this elegantly simple framework makes the most of unlabeled text to improve cross-lingual transfer for text classification." }, { "sent_id": "6da7dcbcb7f52f31ec23c8131d438d-C001-119", "text": "We achieve new state-of-the-art results on cross-lingual document and sentiment classification and demonstrate that self-learning is an effective method for improved classification accuracy without target language training data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6da7dcbcb7f52f31ec23c8131d438d-C001-10" ], [ "6da7dcbcb7f52f31ec23c8131d438d-C001-17", "6da7dcbcb7f52f31ec23c8131d438d-C001-18" ], [ "6da7dcbcb7f52f31ec23c8131d438d-C001-107", "6da7dcbcb7f52f31ec23c8131d438d-C001-108", "6da7dcbcb7f52f31ec23c8131d438d-C001-109" ], [ "6da7dcbcb7f52f31ec23c8131d438d-C001-111" ] ], "cite_sentences": [ "6da7dcbcb7f52f31ec23c8131d438d-C001-10", "6da7dcbcb7f52f31ec23c8131d438d-C001-18", "6da7dcbcb7f52f31ec23c8131d438d-C001-107" ] }, "@EXT@": { "gold_contexts": [ [ "6da7dcbcb7f52f31ec23c8131d438d-C001-19" ] ], "cite_sentences": [ "6da7dcbcb7f52f31ec23c8131d438d-C001-19" ] }, "@USE@": { "gold_contexts": [ [ "6da7dcbcb7f52f31ec23c8131d438d-C001-60" ] ], "cite_sentences": [ "6da7dcbcb7f52f31ec23c8131d438d-C001-60" ] } } }, "ABC_43d670e583caab9b38ddce999b8872_32": { "x": [ { "sent_id": "43d670e583caab9b38ddce999b8872-C001-37", "text": "This list includes the actual next response of the conversation, which is the desired prediction of the model." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-38", "text": "The other entries, which act as false positives, are sampled from elsewhere in the corpus." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-2", "text": "An open challenge in constructing dialogue systems is developing methods for automatically learning dialogue strategies from large amounts of unlabelled data." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-3", "text": "Recent work has proposed Next-Utterance-Classification (NUC) as a surrogate task for building dialogue systems from text data." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-4", "text": "In this paper we investigate the performance of humans on this task to validate the relevance of NUC as a method of evaluation." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-5", "text": "Our results show three main findings: (1) humans are able to correctly classify responses at a rate much better than chance, thus confirming that the task is feasible, (2) human performance levels vary across task domains (we consider 3 datasets) and expertise levels (novice vs experts), thus showing that a range of performance is possible on this type of task, (3) automated dialogue systems built using state-of-the-art machine learning methods have similar performance to the human novices, but worse than the experts, thus confirming the utility of this class of tasks for driving further research in automated dialogue systems." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-6", "text": "----------------------------------" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-8", "text": "Significant efforts have been made in recent years to develop computational methods for learning dialogue strategies offline from large amounts of text data." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-9", "text": "One of the challenges of this line of work is to develop methods to automatically evaluate, either directly or indirectly, models that are trained in this manner Schatzmann et al., 2005) , without requiring human labels or human user experiments, which are time consuming and expensive." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-10", "text": "The use of automatic tasks and metrics is one key issue in scaling the development of dialogue systems from small domainspecific systems, which require significant engineering, to general conversational agents (Pietquin and Hastie, 2013) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-11", "text": "In this paper, we consider tasks and evaluation measures for what we call 'unsupervised' dialogue systems, such as chatbots." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-12", "text": "These are in contrast to 'supervised' dialogue systems, which we define as those that explicitly incorporate some supervised signal such as task completion or user satisfaction 1 ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-13", "text": "Unsupervised systems can be roughly separated into response generation systems that attempt to produce a likely response given a conversational context, and retrieval-based systems that attempt to select a response from a (possibly large) list of utterances in a corpus." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-14", "text": "While there has been significant work on building end-to-end response generation systems (Vinyals and Le, 2015; Shang et al., 2015; , it has recently been shown that many of the automatic evaluation metrics used for such systems correlate poorly or not at all with human judgement of the generated responses (Liu et al., 2016) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-15", "text": "Retrieval-based systems are of interest because they admit a natural evaluation metric, namely the recall and precision measures." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-16", "text": "First introduced for evaluating user simulations by Schatzmann et al. (2005) , such a framework has gained recent prominence for the evaluation of end-to-end dialogue systems (Lowe et al., 2015a; Kadlec et al., 2015; Dodge et al., 2015) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-17", "text": "These models are trained on the task of selecting the correct response from a candidate list, which we call Next-Utterance-Classification (NUC, detailed in Section 2), and are evaluated using the metric of recall." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-18", "text": "NUC is useful for several reasons: 1) the performance (i.e. loss or error) is easy to com-pute automatically, 2) it is simple to adjust the difficulty of the task, 3) the task is interpretable and amenable to comparison with human performance, 4) it is an easier task compared to generative dialogue modeling, which is difficult for endto-end systems , and 5) models trained with NUC can be converted to dialogue systems by retrieving from the full corpus (Liu et al., 2016) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-19", "text": "In this case, NUC additionally allows for making hard constraints on the allowable outputs of the system (to prevent offensive responses), and guarantees that the responses are fluent (because they were generated by humans)." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-20", "text": "Thus, NUC can be thought of both as an intermediate task that can be used to evaluate the ability of systems to understand natural language conversations, similar to the bAbI tasks for language understanding , and as a useful framework for building chatbots." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-21", "text": "With the huge size of current dialogue datasets that contain millions of utterances (Lowe et al., 2015a; Banchs, 2012; Ritter et al., 2010) and the increasing amount of natural language data, it is conceivable that retrieval-based systems will be able to have engaging conversations with humans." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-22", "text": "However, despite the current work with NUC, there has been no verification of whether machine and human performance differ on this task." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-23", "text": "This cannot be assumed; it is possible that no significant gap exists between the two, as is the case with many current automatic response generation metrics (Liu et al., 2016) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-24", "text": "Further, it is important to benchmark human performance on new tasks such as NUC to determine when research has outgrown their use." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-25", "text": "In this paper, we consider to what extent NUC is achievable by humans, whether human performance varies according to expertise, and whether there is room for machine performance to improve (or has reached human performance already) and we should move to more complex conversational tasks." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-26", "text": "We performed a user study on three different datasets: the SubTle Corpus of movie dialogues (Banchs, 2012) , the Twitter Corpus (Ritter et al., 2010) , and the Ubuntu Dialogue Corpus (Lowe et al., 2015a) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-27", "text": "Since conversations in the Ubuntu Dialogue Corpus are highly technical, we recruit 'expert' humans who are adept with the Ubuntu terminology, whom we compare with a state-of-the-art machine learning agent on all datasets." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-28", "text": "We find that there is indeed a significant separation between machine and expert hu- man performance, suggesting that NUC is a useful intermediate task for measuring progress." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-29", "text": "----------------------------------" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-30", "text": "**TECHNICAL BACKGROUND ON NUC**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-31", "text": "Our long-term goal is the development and deployment of artificial conversational agents." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-32", "text": "Recent deep neural architectures offer perhaps the most promising framework for tackling this problem." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-33", "text": "However training such architectures typically requires large amounts of conversation data from the target domain, and a way to automatically assess prediction errors." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-34", "text": "Next-Utterance-Classification (NUC, see Figure 1 ) is a task, which is straightforward to evaluate, designed for training and validation of dialogue systems." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-35", "text": "They are evaluated using the metric of Recall@k, which we define in this section." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-36", "text": "In NUC, a model or user, when presented with the context of a conversation and a (usually small) pre-defined list of responses, must select the most appropriate response from this list." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-39", "text": "Note that no assumptions are made regarding the number of utterances in the context: these can be fixed or sampled from arbitrary distributions." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-40", "text": "Performance on this task is easy to assess by measuring the success rate of picking the correct next response; more specifically, we measure Recall@k (R@k), which is the percentage of correct responses (i.e. the actual response of the conversation) that are found in the top k responses with the highest rankings according to the model." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-41", "text": "This task has gained some popularity recently for evaluating dialogue systems (Lowe et al., 2015a; Kadlec et al., 2015) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-42", "text": "There are several attractive properties of this approach, as detailed in the introduction: the performance is easy to compute automatically, the task is interpretable and amenable to comparison with human performance, and it is easier than generative dialogue modeling." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-43", "text": "A particularly nice property is that one can adjust the difficulty of NUC by simply changing the number of false responses (from one response to the full corpus), or by altering the selection criteria of false responses (from randomly sampled to intentionally confusing)." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-44", "text": "Indeed, as the number of false responses grows to encompass all natural language responses, the task becomes identical to response generation." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-45", "text": "One potential limitation of the NUC approach is that, since the other candidate answers are sampled from elsewhere in the corpus, these may also represent reasonable responses given the context." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-46", "text": "Part of the contribution of this work is determining the significance of this limitation." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-47", "text": "----------------------------------" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-48", "text": "**SURVEY METHODOLOGY**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-49", "text": "----------------------------------" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-50", "text": "**CORPORA**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-51", "text": "We conducted our analysis on three corpora that have gained recent popularity for training dialogue systems." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-52", "text": "The SubTle Corpus (Banchs, 2012) consists of movie dialogues as extracted from subtitles, and includes turn-taking information indicating when each user has finished their turn." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-53", "text": "The Twitter Corpus (Ritter et al., 2010) contains a large number of conversations between users on the microblogging platform Twitter." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-54", "text": "Finally, the Ubuntu Dialogue Corpus contains conversations extracted from IRC chat logs." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-55", "text": "We focus our attention on these as they cover a range of popular domains, and are among the largest available dialogue datasets, making them good candidates for building data-driven dialogue systems." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-56", "text": "Note that while the Ubuntu Corpus is most relevant to supervised systems, the NUC task still applies in this domain." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-57", "text": "Models that take semantic information into account (ie. to solve the user's problem) can still be validated with NUC, as in (Lowe et al., 2015b) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-58", "text": "A group of 145 paid participants were recruited through Amazon Mechanical Turk (AMT), a crowdsourcing platform for obtaining human participants for various studies." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-59", "text": "Demographic statistics of the participants are shown in Table 1 ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-60", "text": "An additional 8 volunteers were recruited from the student population in the computer science department at the author's institution." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-61", "text": "2 This second group, referred to as \"Lab experts\", had significant exposure to technical terms prominent in the Ubuntu dataset; we hypothesized that this was an advantage in selecting responses for that corpus." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-62", "text": "----------------------------------" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-63", "text": "**TASK DESCRIPTION**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-64", "text": "Each participant was asked to answer either 30 or 40 questions (mean=31.9)." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-65", "text": "To ensure a sufficient diversity of questions from each dataset, four versions of the survey with different questions were given to participants." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-66", "text": "For AMT respondents, the questions were approximately evenly distributed across the three datasets, while for the lab experts, half of the questions were related to Ubuntu and the remainder evenly split across Twitter and movies." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-67", "text": "Each question had 1 correct response, and 4 false responses drawn uniformly at random from elsewhere in the (same) corpus." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-68", "text": "Participants had a time limit of 40 minutes." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-69", "text": "Conversations were extracted to form NUC conversation-response pairs as described in Sec. 2." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-70", "text": "The number of utterances in the context were sampled according to the procedure in (Lowe et al., 2015a) , with a maximum context length of 6 turns -this was done for both the human trials and ANN model." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-71", "text": "All conversations were preprocessed in order to anonymize the utterances." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-72", "text": "Table 2 : Average results on each corpus." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-73", "text": "'Number of Users' indicates the number of respondents for each category." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-74", "text": "'AMT experts' and 'AMT non-experts' are combined for the Movie and Twitter corpora." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-75", "text": "95% confidence intervals are calculated using the normal approximation, which assumes subjects answer each question independently of other examples and subjects." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-76", "text": "Starred (*) results indicate a poor approximation due to high scores with small sample size, according to the rule of thumb by Brown et al. (2001) ." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-77", "text": "For the Twitter conversations, this was extended to replacing all user mentions (words beginning with @) throughout the utterance with a placeholder '@user' symbol, as these are often repeated in a conversation." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-78", "text": "Conversations were edited or pruned to remove offensive language according to ethical guidelines." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-79", "text": "----------------------------------" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-80", "text": "**RESULTS**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-81", "text": "As we can see from Table 1 , the AMT participants are mostly young adults, fluent in English with some undergraduate education." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-82", "text": "The split across genders is approximately equal, and the majority of respondents had never used Ubuntu before." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-83", "text": "Table 2 shows the NUC results on each corpus." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-84", "text": "The human results are separated into AMT nonexperts, consisting of paid respondents who have 'Beginner' or no knowledge of Ubuntu terminology; AMT experts, who claimed to have 'Intermediate' or 'Advanced' knowledge of Ubuntu; and Lab experts." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-85", "text": "We also presents results on the same task for a state-of-the-art artificial neural network (ANN) dialogue model (see (Lowe et al., 2015a) for implementation details)." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-86", "text": "We first observe that subjects perform above chance level (20% for R@1) on all domains, thus the task is doable for humans." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-87", "text": "Second we observe difference in performances between the three domains." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-88", "text": "The Twitter dataset appears to have the best predictability, with a Recall@1 approximately 8% points higher than for the movie dialogues for AMT workers, and 18% higher for lab experts." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-89", "text": "Rather than attributing this to greater familiarity with Twitter than movies, it seems more likely that it is because movie utterances are often short, generic (e.g. contain few topic-related words), and lack proper context (e.g., video cues and the movie's story)." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-90", "text": "Conversely, tweets are typically more specific, and successive tweets may have common hashtags." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-91", "text": "As expected, untrained respondents scored lowest on the Ubuntu dataset, as it contains the most difficult language with often unfamiliar terminology." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-92", "text": "Further, since the domain is narrow, randomly drawn false responses could be more likely to resemble the actual next response, especially to someone unfamiliar with Ubuntu terminology." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-93", "text": "We also observe that the ANN model achieves similar performance to the paid human respondents from AMT." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-94", "text": "However, the model is still significantly behind the lab experts for Recall@1." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-95", "text": "An interesting note is that there is very little difference between the paid AMT non-experts and AMT experts on Ubuntu." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-96", "text": "This suggests that the participants do not provide accurate self-rating of expertise, either intentionally or not." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-97", "text": "We also found that lab experts took on average approximately 50% more time to complete the survey than paid testers; this is reflected in the results, where the lab experts score 30% higher on the Ubuntu Corpus, and even 5-10% higher on the non-technical Movie and Twitter corpora." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-98", "text": "While we included attention check questions to ensure the quality of responses, this reflects poorly on the ability of crowdsourced workers to answer technical questions, even if they self-identify as being adept with the technology." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-99", "text": "----------------------------------" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-100", "text": "**DISCUSSION**" }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-101", "text": "Our results demonstrate that humans outperform current dialogue models on the task of Next-Utterance-Classification, indicating that there is plenty of room for improvement for these models to better understand the nature of human dialogue." }, { "sent_id": "43d670e583caab9b38ddce999b8872-C001-102", "text": "While our results suggest that NUC is a useful task, it is by no means sufficient; we strongly advocate for automatically evaluating dialogue systems with as many relevant metrics as possible." } ], "y": { "@BACK@": { "gold_contexts": [ [ "43d670e583caab9b38ddce999b8872-C001-15", "43d670e583caab9b38ddce999b8872-C001-16" ], [ "43d670e583caab9b38ddce999b8872-C001-21" ], [ "43d670e583caab9b38ddce999b8872-C001-34", "43d670e583caab9b38ddce999b8872-C001-35", "43d670e583caab9b38ddce999b8872-C001-36", "43d670e583caab9b38ddce999b8872-C001-37", "43d670e583caab9b38ddce999b8872-C001-38", "43d670e583caab9b38ddce999b8872-C001-39", "43d670e583caab9b38ddce999b8872-C001-40", "43d670e583caab9b38ddce999b8872-C001-41" ] ], "cite_sentences": [ "43d670e583caab9b38ddce999b8872-C001-16", "43d670e583caab9b38ddce999b8872-C001-21", "43d670e583caab9b38ddce999b8872-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "43d670e583caab9b38ddce999b8872-C001-26" ], [ "43d670e583caab9b38ddce999b8872-C001-70" ], [ "43d670e583caab9b38ddce999b8872-C001-85" ] ], "cite_sentences": [ "43d670e583caab9b38ddce999b8872-C001-26", "43d670e583caab9b38ddce999b8872-C001-70", "43d670e583caab9b38ddce999b8872-C001-85" ] } } }, "ABC_3ea1f4acd7e2812e68eca54600fc5c_32": { "x": [ { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-6", "text": "In the shared task evaluation, it scored better than average." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-7", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-2", "text": "In the paper we describe a dependency parser that uses exact search and global learning (Crammer et al., 2006) to produce labelled dependency trees." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-3", "text": "Our system integrates the task of learning tree structure and learning labels in one step, using the same set of features for both tasks." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-4", "text": "During label prediction, the system automatically selects for each feature an appropriate level of smoothing." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-5", "text": "We report on several experiments that we conducted with our system." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-9", "text": "Dependency parsing is a topic that has engendered increasing interest in recent years." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-10", "text": "One promising approach is based on exact search and structural learning (McDonald et al., 2005; McDonald and Pereira, 2006) ." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-11", "text": "In this work we also pursue this approach." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-12", "text": "Our system makes no provisions for non-projective edges." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-13", "text": "In contrast to previous work, we aim to learn labelled dependency trees at one fell swoop." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-14", "text": "This is done by maintaining several copies of feature vectors that capture the features' impact on predicting different dependency relations (deprels)." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-15", "text": "In order to preserve the strength of McDonald et al. (2005) 's approach in terms of unlabelled attachment score, we add feature vectors for generalizations over deprels." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-16", "text": "We also employ various reversible transformations to reach treebank formats that better match our feature representation and that reduce the complexity of the learning task." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-17", "text": "The paper first presents the methodology used, goes on to describing experiments and results and finally concludes." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-18", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-19", "text": "**METHODOLOGY**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-20", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-21", "text": "**PARSING ALGORITHM**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-22", "text": "In our approach, we adopt Eisner (1996) 's bottomup chart-parsing algorithm in McDonald et al. (2005) 's formulation, which finds the best projective dependency tree for an input string" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-23", "text": "We assume that every possible headdependent pair \u00a8 is described by a feature vector" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-24", "text": "Eisner's algorithm achieves optimal tree packing by storing partial structures in two matrices and ." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-25", "text": "First the diagonals of the matrices are initiated with 0; then all other cells are filled according to eqs. (1) and (2) and their symmetric variants." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-26", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-27", "text": "**FEATURE REPRESENTATION**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-28", "text": "In deriving features, we used all information given in the treebanks, i.e. words (w), fine-grained POS tags (fp), combinations of lemmas and coarsegrained POS tags (lcp), and whether two tokens agree 1 (agr = yes, no, don't know)." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-29", "text": "We essentially employ the same set of features as McDonald et al. (2005) :" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-30", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-31", "text": "**STRUCTURAL LEARNING**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-32", "text": "For determining feature weights & , we used online passive-aggressive learning (OPAL) (Crammer et al., 2006) ." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-33", "text": "OPAL iterates repeatedly over all training instances , adapting weights after each parse." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-34", "text": "It tries to change weights as little as possible (passiveness), while ensuring that (1)" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-35", "text": "1 Agreement was computed from morphological features, viz. gender, number and person, and case." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-36", "text": "In languages with subject-verb agreement, we added a nominative case feature to finite verbs." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-37", "text": "In Basque, agreement is case-specific (absolutive, dative, ergative, other case)." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-38", "text": "Having a closed-form solution, OPAL is easier to implement and more efficient than the MIRA algorithm used by McDonald et al. (2005) , although it achieves a performance comparable to MIRA's on many problems (Crammer et al., 2006) ." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-39", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-40", "text": "**LEARNING LABELS FOR DEPENDENCY RELATIONS**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-41", "text": "So far, the presented system, which follows closely the approach of McDonald et al. (2005) , only predicts unlabelled dependency trees." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-42", "text": "To derive a labeling, we departed from their approach: We split each feature along the deprel label dimension, so that each deprel U is associated with its own feature vector (cf." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-43", "text": "eq. (4), where V is the tensor product and" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-44", "text": "In parsing, we only consider the best deprel label." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-45", "text": "On its own, this simple approach led to a severe degradation of performance, so we took a step back by re-introducing features for unlabelled trees." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-46", "text": "leads to a massive amount of features, so we pruned the taxonomy on a feature-to-feature basis by merging all nodes on a level that only encompass deprels that never occur with this feature in the training data." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-47", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-48", "text": "**TREEBANK TRANSFORMATIONS**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-49", "text": "Having no explicit feature representation for the information in the morphological features slot (cf. section 2.2), we partially redistributed that information to other slots: Verb form, case 2 to fp, semantic classification to an empty lemma slot (Turkish affixes, e.g. \"Able\", \"Ly\")." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-50", "text": "The balance between fp and w was not always optimal; we used a fine-grained 3 classification in punctuation tags, distinguished between prepositions (e.g. in) and preposition-article combinations (e.g. nel) in Italian 4 on the basis of number/gender features, and collected definite and indefinite articles under one common fp tag." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-51", "text": "When distinctions in deprels are recoverable from context, we removed them: The dichotomy between conjunctive and disjunctive coordination in Italian depends in most cases exclusively on the coordinating conjunction." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-52", "text": "The Greek and Czech treebanks have a generic distinction between ordinary deprels and deprels in a coordination, apposition, and parenthesis construction." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-53", "text": "In Greek, we got rid of the parenthesis markers on deprels by switching head and dependent, giving the former head (the parenthesis) a unique new deprel." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-54", "text": "For Czech, we reduced the number of deprels from 46 to 34 by swapping the deprels of conjuncts, appositions, etc. and their heads (coordination or comma)." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-55", "text": "Sometimes, multiple conjuncts take different deprels." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-56", "text": "We only provided for the clash between \"ExD\" (ellipsis) and other deprels, in which case we added \"ExD\", see below." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-57", "text": "In Basque, agreement is usually between arguments and auxiliary verbs, so we re-attached 5 relevant arguments from main verb to auxiliary verb." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-58", "text": "The training set for Arabic contains some very long sentences (up to 396 tokens)." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-59", "text": "Since contextfree parsing sentences of this length is tedious, we split up all sentences at final punctuation signs Table 3 : Results on DevTest and Test Sets compared with the Average Performance in CoNLL'07." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-60", "text": "LAS = Labelled Attachment Score, UAS = Unlabelled Attachment Score, LAcc = Label Accuracy, AV = Average score." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-61", "text": "(AuxK)." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-62", "text": "With this trick, we pushed down maximal sentence length to 196." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-63", "text": "Unfortunately, we overlooked the fact that in Turkish, the ROOT deprel not only designates root nodes but also attaches some punctuation marks." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-64", "text": "This often leads to non-projective structures, which our parser cannot handle, so our parser scored below average in Turkish." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-65", "text": "In after-deadline experiments, we took this feature of the Turkish treebank into account and achieved above-average results by re-linking all ROOT-ed punctuation signs to the immediately preceding token." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-66", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-67", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-68", "text": "All experiments were conducted on the treebanks provided in the shared task (Haji\u010d et al., 2004; Aduriz et al., 2003; Mart\u00ed et al., 2007; Chen et al., 2003; B\u00f6hmov\u00e1 et al., 2003; Marcus et al., 1993; Johansson and Nugues, 2007; Prokopidis et al., 2005; Csendes et al., 2005; Montemagni et al., 2003; Oflazer et al., 2003) ." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-69", "text": "For our contribution, we used the second-order algorithm; only afterwards did we also apply the first-order model to the data, with quite good results (cf." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-70", "text": "Table 1 )." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-71", "text": "For testing our approach, we split the treebanks provided into an actual training and a development set (details are in Table 2 )." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-72", "text": "From each training set, we extracted at least a million features (not counting the split for deprel labels)." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-73", "text": "The last column in Table 2 shows the average time needed in a training iteration." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-74", "text": "For nearly all languages, our approach achieved a performance better than average (see Table 3 )." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-75", "text": "Only in Turkish and Basque did we score below average." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-76", "text": "On closer inspection, we saw that this performance was due to our projectivity assumption and to insufficient exploration of these treebanks." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-77", "text": "In its bottom part, Table 3 gives results of improved versions of our approach." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-78", "text": "----------------------------------" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-79", "text": "**CONCLUSION**" }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-80", "text": "We presented an approach to dependency parsing that is based on exact search and global learning." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-81", "text": "Special emphasis is laid on an integrated derivation of labelled and unlabelled dependency trees." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-82", "text": "We also employed various transformation techniques to reach treebank formats that are better suited to our approach." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-83", "text": "The approach scores better than average in (nearly) all languages." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-84", "text": "Nevertheless, it is still a long way from cutting-edge performance." }, { "sent_id": "3ea1f4acd7e2812e68eca54600fc5c-C001-85", "text": "One direction we would like to explore in the future is the integration of dynamic features on deprel labels." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3ea1f4acd7e2812e68eca54600fc5c-C001-10", "3ea1f4acd7e2812e68eca54600fc5c-C001-9" ] ], "cite_sentences": [ "3ea1f4acd7e2812e68eca54600fc5c-C001-10" ] }, "@UNSURE@": { "gold_contexts": [ [ "3ea1f4acd7e2812e68eca54600fc5c-C001-15" ] ], "cite_sentences": [ "3ea1f4acd7e2812e68eca54600fc5c-C001-15" ] }, "@EXT@": { "gold_contexts": [ [ "3ea1f4acd7e2812e68eca54600fc5c-C001-22" ] ], "cite_sentences": [ "3ea1f4acd7e2812e68eca54600fc5c-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "3ea1f4acd7e2812e68eca54600fc5c-C001-29" ] ], "cite_sentences": [ "3ea1f4acd7e2812e68eca54600fc5c-C001-29" ] }, "@DIF@": { "gold_contexts": [ [ "3ea1f4acd7e2812e68eca54600fc5c-C001-38" ], [ "3ea1f4acd7e2812e68eca54600fc5c-C001-41", "3ea1f4acd7e2812e68eca54600fc5c-C001-42", "3ea1f4acd7e2812e68eca54600fc5c-C001-43", "3ea1f4acd7e2812e68eca54600fc5c-C001-44" ] ], "cite_sentences": [ "3ea1f4acd7e2812e68eca54600fc5c-C001-38", "3ea1f4acd7e2812e68eca54600fc5c-C001-41" ] }, "@MOT@": { "gold_contexts": [ [ "3ea1f4acd7e2812e68eca54600fc5c-C001-41", "3ea1f4acd7e2812e68eca54600fc5c-C001-42", "3ea1f4acd7e2812e68eca54600fc5c-C001-43", "3ea1f4acd7e2812e68eca54600fc5c-C001-44" ] ], "cite_sentences": [ "3ea1f4acd7e2812e68eca54600fc5c-C001-41" ] } } }, "ABC_4924d721b50abe1c8d883a7efd0205_32": { "x": [ { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-4", "text": "We show that nine years later it is the state-of-the-art on knowledge-based WSD." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-20", "text": "Next we present the results and comparison to the state-ofthe-art." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-2", "text": "UKB is an open source collection of programs for performing, among other tasks, knowledge-based Word Sense Disambiguation (WSD)." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-3", "text": "Since it was released in 2009 it has been often used out-of-thebox in sub-optimal settings." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-5", "text": "This case shows the pitfalls of releasing open source NLP software without optimal default settings and precise instructions for reproducibility." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-6", "text": "----------------------------------" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-8", "text": "The release of open-source Natural Language Processing (NLP) software has been key to make the field progress, as it facilitates other researchers to build upon previous results and software easily." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-9", "text": "It also allows easier reproducibility, allowing for sound scientific progress." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-10", "text": "Unfortunately, in some cases, it can also allow competing systems to run the open-source software out-of-the-box with suboptimal parameters, specially in fields where there is no standard benchmark and new benchmarks (or new versions of older benchmarks) are created." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-11", "text": "Once a paper reports sub-optimal results for a NLP software, newer papers can start to routinely quote the low results from the previous study." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-12", "text": "Finding a fix to this situation is not easy." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-13", "text": "The authors of the software can contact the authors of the more recent papers, but it is usually too late for updating the paper." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-14", "text": "Alternatively, the authors of the NLP software can try to publish a new paper with updated results, but there is usually no venue for such a paper, and, even if published, it might not be noticed in the field." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-15", "text": "In this paper we want to report such a case in Word Sense Disambiguation (WSD), where the original software (UKB) was released with suboptimal default parameters." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-16", "text": "Although the accompanying papers did contain the necessary information to obtain state-of-the-art results, the software did not contain step-by-step instructions, or endto-end scripts for optimal performance." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-17", "text": "This case is special, in that we realized that the software is able to attain state-of-the-art results also in newer datasets, using the same settings as in the papers." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-18", "text": "The take-away message for open-source NLP software authors is that they should not rely on other researchers reading the papers with care, and that it is extremely important to include, with the software release, precise instructions and optimal default parameters, or better still, end-toend scripts that download all resources, perform any necessary pre-processing and reproduce the results." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-19", "text": "The first section presents UKB and WSD, followed by the settings and parameters." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-21", "text": "Section 5 reports some additional results, and finally, we draw the conclusions." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-22", "text": "----------------------------------" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-23", "text": "**WSD AND UKB**" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-24", "text": "Word Sense Disambiguation (WSD) is the problem of assigning the correct sense of a word in a context (Agirre and Edmonds, 2007) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-25", "text": "Traditionally, supervised approaches have attained the best results in the area, but they are expensive to build because of the need of large amounts of manually annotated examples." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-26", "text": "Alternatively, knowledge based approaches rely on lexical resources such as WordNet, which are nowadays widely available in many languages (Bond and Paik, 2012) 1 ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-27", "text": "In particular, graph-based approaches represent the knowledge base as a graph, and apply several well-known graph analysis algorithms to perform WSD." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-28", "text": "UKB is a collection of programs which was first released for performing graph-based Word Sense Disambiguation using a preexisting knowledge base such as WordNet, and attained state-of-the-art results among knowledge-based systems when evaluated on standard benchmarks Agirre et al., 2014) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-29", "text": "In addition, UKB has been extended to perform disambiguation of medical entities (Agirre et al., 2010) , named-entities (Erbs et al., 2012; , word similarity and to create knowledge-based word embeddings (Goikoetxea et al., 2015) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-30", "text": "All programs are open source 2 ,3 and are accompanied by the resources and instructions necessary to reproduce the results." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-31", "text": "The software is quite popular, with 60 stars and 26 forks in github, as well as more than eight thousand direct downloads from the website since 2011." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-32", "text": "The software is coded in C++ and released under the GPL v3.0 license." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-33", "text": "When UKB was released, the papers specified the optimal parameters for WSD Agirre et al., 2014) , as well as other key issues like the underlying knowledge-base version, specific set of relations to be used, and method to pre-process the input text." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-34", "text": "At the time, we assumed that future researchers would use the optimal parameters and settings specified in the papers, and that they would contact the authors if in doubt." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-35", "text": "The default parameters of the software were not optimal, and the other issues were left under the users responsibility." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-36", "text": "The assumption failed, and several papers reported low results in some new datasets (including updated versions of older datasets), as we will see in the following sections." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-37", "text": "----------------------------------" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-38", "text": "**UKB PARAMETERS AND SETTING FOR WSD**" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-39", "text": "When using UKB for WSD, the main parameters and settings can be classified in five main categories." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-40", "text": "For each of those we mention the best options and the associated UKB parameter when relevant (in italics), as taken from Agirre et al., 2014 ):" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-41", "text": "\u2022 Pre-processing of input text." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-42", "text": "When running UKB for WSD, one needs to define which 2 http://ixa2.si.ehu.eus/ukb 3 https://github.com/asoroa/ukb window of words is to be used as context to initialize the random walks." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-43", "text": "One option is to take just the sentence, but given that in some cases the sentences are very short, better results are obtained when considering previous and following sentences." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-44", "text": "The procedure in the original paper repeated the extension procedure until the total length of the context is at least 20 words 4 ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-45", "text": "\u2022 Knowledge base relations." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-46", "text": "When performing WSD for English, UKB uses WordNet (Fellbaum, 1998) as a knowledge base." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-47", "text": "WordNet comes in various versions, and usually UKB performs best when using the same version the dataset was annotated with." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-48", "text": "Besides regular WordNet relations, gloss relations (relations between synsets appearing in the glosses) have been shown to be always helpful." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-49", "text": "\u2022 Graph algorithm." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-50", "text": "UKB implements different graph-based algorithms and variants to perform WSD." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-51", "text": "These are the main ones: ppr w2w: apply personalized PageRank for each target word, that is, perform a random walk in the graph personalized on the word context." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-52", "text": "It yields the best results overall, at the cost of being more time consuming that the rest." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-53", "text": "ppr: same as above, but apply personalized PageRank to each sentence only once, disambiguating all content words in the sentence in one go." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-54", "text": "It is thus faster that the previous approach, but obtains worse results." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-55", "text": "dfs: unlike the two previous algorithms, which consider the WordNet graph as a whole, this algorithm first creates a subgraph for each context, following the method first presented in Navigli and Lapata (2010) , and then runs the PageRank algorithm over the subgraph." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-56", "text": "This option represents a compromise between ppr w2w and ppr, as it faster than than the former while better than the latter." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-57", "text": "\u2022 The PageRank algorithm has two parameters which were set as follows: number of iterations of power method (prank iter) 30, and damping factor (prank damping) 0.85. (Raganato et al., 2017a) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-58", "text": "Best results in bold." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-59", "text": "Note that (Raganato et al., 2017b ) used S07 for development." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-60", "text": "\u2022 Use of sense frequencies (dict weight)." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-61", "text": "Sense frequencies are a valuable piece of information that describe the frequencies of the associations between a word and its possible senses." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-62", "text": "The frequencies are often derived from manually sense annotated corpora, such as Semcor (Miller et al., 1993) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-63", "text": "We use the sense frequency accompanying Wordnet, which, according to the documentation, \"represents the decimal number of times the sense is tagged in various semantic concordance texts\"." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-64", "text": "The frequencies are smoothed adding one to all counts (dict weight smooth)." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-65", "text": "The sense frequency is used when initializing context words, and is also used to produce the final sense weights as a linear combination of sense frequencies and graph-based sense probabilities." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-66", "text": "The use of sense frequencies with UKB was introduced in (Agirre et al., 2014) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-67", "text": "----------------------------------" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-68", "text": "**COMPARISON TO THE STATE-OF-THE-ART**" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-69", "text": "We evaluate UKB on the recent evaluation dataset described in (Raganato et al., 2017a) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-70", "text": "This dataset comprises five standard English all-words datasets, standardized into a unified format with gold keys in WordNet version 3.0 (some of the original datasets used older versions of WordNet)." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-71", "text": "The dataset contains 7, 253 instances of 2, 659 different content words (nouns, verbs, adjectives and adverbs)." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-72", "text": "The average ambiguity of the words in the dataset is of 5.9 senses per word." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-73", "text": "We report F1, the harmonic mean between precision and recall, as computed by the evaluation code accompanying the dataset." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-74", "text": "The two top rows in Table 1 show conflicting results for UKB." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-75", "text": "The first row corresponds to UKB ran with the settings described above." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-76", "text": "The second row was first reported in (Raganato et al., 2017a) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-77", "text": "As the results show, that paper reports a suboptimal use of UKB." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-78", "text": "In more recent work, Chaplot and Sakajhutdinov (2018) take up that result and report it in their paper as well." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-79", "text": "The difference is of nearly 10 absolute F1 points overall." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-80", "text": "5 This decrease could be caused by the fact that Raganato et al. (2017a) did not use sense frequencies." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-101", "text": "The results using the settings described in the paper on newly released datasets show that UKB is the best among knowledgebased WSD algorithms." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-81", "text": "In addition to UKB, the table also reports the 5 Note that the UKB results for S2, S3 and S07 (62.6, 63.0 and 48.6 respectively) are different from those in (Agirre et al., 2014) , which is to be expected, as the new datasets have been converted to WordNet 3.0 (we confirmed experimentally that this is the sole difference between the two experiments)." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-82", "text": "best performing knowledge-based systems on this dataset." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-83", "text": "Raganato et al. (2017a) run several wellknown algorithms when presenting their datasets." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-84", "text": "We also report (Chaplot and Sakajhutdinov, 2018) , the latest work on this area, as well as the most frequent sense as given by WordNet counts (see Section 3)." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-85", "text": "The table shows that UKB yields the best overall result." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-86", "text": "Note that Banerjee and Pedersen (2003) do not use sense frequency information." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-87", "text": "For completeness, Table 2 reports the results of supervised systems on the same dataset, taken from the two works that use the dataset (Yuan et al., 2016; Raganato et al., 2017b) ." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-88", "text": "As expected, supervised systems outperform knowledge-based systems, by a small margin in some of the cases." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-89", "text": "----------------------------------" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-90", "text": "**ADDITIONAL RESULTS**" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-91", "text": "In addition to the results of UKB using the setting in Agirre et al., 2014) as specified in Section 3, we checked whether some reasonable settings would obtain better results." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-92", "text": "Table 3 shows the results when applying the three algorithms described in Section 3, both with and without sense frequencies, as well as using a single sentence for context or extended context." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-93", "text": "The table shows that the key factor is the use of sense frequencies, and systems that do not use them (those with a nf subscript) suffer a loss between 7 and 8 percentage points in F1." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-94", "text": "This would explain part of the decrease in performance reported in (Raganato et al., 2017a) , as they explicitly mention that they did not activate the use of sense frequencies in UKB." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-95", "text": "The table also shows that extending the context is mildly effective." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-96", "text": "Regarding the algorithm, the table confirms that the best method is ppr w2w, followed by the subgraph approach (dfs) and ppr." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-97", "text": "----------------------------------" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-98", "text": "**CONCLUSIONS**" }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-99", "text": "This paper presents a case where an open-source NLP software was used with suboptimal parameters by third parties." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-100", "text": "UKB was released with suboptimal default parameters, and although the accompanying papers did describe the necessary settings for good results on WSD, bad results were not prevented." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-102", "text": "The take-away message for open-source NLP software authors is that they should not rely on other researchers reading the papers with care, and that it is extremely important to include, with the software release, precise instructions and optimal default parameters, or better still, end-toend scripts that download all resources, perform any necessary pre-processing and reproduce the results." }, { "sent_id": "4924d721b50abe1c8d883a7efd0205-C001-103", "text": "UKB now includes in version 3.1 such end-to-end scripts and the appropriate default parameters." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4924d721b50abe1c8d883a7efd0205-C001-28" ], [ "4924d721b50abe1c8d883a7efd0205-C001-33" ], [ "4924d721b50abe1c8d883a7efd0205-C001-66" ] ], "cite_sentences": [ "4924d721b50abe1c8d883a7efd0205-C001-28", "4924d721b50abe1c8d883a7efd0205-C001-33", "4924d721b50abe1c8d883a7efd0205-C001-66" ] }, "@USE@": { "gold_contexts": [ [ "4924d721b50abe1c8d883a7efd0205-C001-39", "4924d721b50abe1c8d883a7efd0205-C001-40" ] ], "cite_sentences": [ "4924d721b50abe1c8d883a7efd0205-C001-40" ] }, "@DIF@": { "gold_contexts": [ [ "4924d721b50abe1c8d883a7efd0205-C001-81" ] ], "cite_sentences": [ "4924d721b50abe1c8d883a7efd0205-C001-81" ] }, "@EXT@": { "gold_contexts": [ [ "4924d721b50abe1c8d883a7efd0205-C001-91" ] ], "cite_sentences": [ "4924d721b50abe1c8d883a7efd0205-C001-91" ] } } }, "ABC_91685660d3d689c50e7436be46f37e_32": { "x": [ { "sent_id": "91685660d3d689c50e7436be46f37e-C001-39", "text": "**LANGUAGE MODEL**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-13", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-14", "text": "**DATA AND RESOURCES**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-38", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-2", "text": "We develop a grammatical error correction (GEC) system for German using a small gold GEC corpus augmented with edits extracted from Wikipedia revision history." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-3", "text": "We extend the automatic error annotation tool ERRANT (Bryant et al., 2017) for German and use it to analyze both gold GEC corrections and Wikipedia edits (Grundkiewicz and JunczysDowmunt, 2014) in order to select as additional training data Wikipedia edits containing grammatical corrections similar to those in the gold corpus." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-4", "text": "Using a multilayer convolutional encoder-decoder neural network GEC approach (Chollampatt and Ng, 2018), we evaluate the contribution of Wikipedia edits and find that carefully selected Wikipedia edits increase performance by over 5%." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-5", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-6", "text": "**INTRODUCTION AND PREVIOUS WORK**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-7", "text": "In the past decade, there has been a great deal of research on grammatical error correction for English including a series of shared tasks, Helping Our Own in 2011 and 2012 (Dale and Kilgarriff, 2011; Dale et al., 2012) and the CoNLL 2013 and 2014 shared tasks (Ng et al., 2013 (Ng et al., , 2014 , which have contributed to the development of larger English GEC corpora." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-8", "text": "On the basis of these resources along with advances in machine translation, the current state-of-the-art English GEC systems use ensembles of neural MT models (Chollampatt and Ng, 2018) and hybrid systems with both statistical and neural MT models (Grundkiewicz and Junczys-Dowmunt, 2018) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-9", "text": "In addition to using gold GEC corpora, which are typically fairly small in the context of MTbased approaches, research in GEC has taken a number of alternate data sources into consideration such as artificially generated errors (e.g., Wagner et al., 2007; Foster and Andersen, 2009; Yuan and Felice, 2013) , crowd-sourced corrections (e.g., Mizumoto et al., 2012) , or errors from native language resources (e.g., Cahill et al., 2013; Grundkiewicz and Junczys-Dowmunt, 2014) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-10", "text": "For English, Grundkiewicz and JunczysDowmunt (2014) extracted pairs of edited sentences from the Wikipedia revision history and filtered them based on a profile of gold GEC data in order to extend the training data for a statistical MT GEC system and found that the addition of filtered edits improved the system's F 0.5 score b\u1ef9 2%." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-11", "text": "For languages with more limited resources, native language resources such as Wikipedia offer an easily accessible source of additional data." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-12", "text": "Using a similar approach that extends existing gold GEC data with Wikipedia edits, we develop a neural machine translation grammatical error correction system for a new language, in this instance German, for which there are only small gold GEC corpora but plentiful native language resources." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-15", "text": "The following sections describe the data and resources used in our experiments on GEC for German." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-16", "text": "We create a new GEC corpus for German along with the models needed for the neural GEC approach presented in Chollampatt and Ng (2018) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-17", "text": "Throughout this paper we will refer to the source sentence as the original and the target sentence as the correction." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-18", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-19", "text": "**GOLD GEC CORPUS**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-20", "text": "As we are not aware of any standard corpora for German GEC, we create a new grammatical error correction corpus from two German learner corpora that have been manually annotated following similar guidelines." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-21", "text": "In the Falko project, annotation guidelines were developed for minimal target hypotheses, minimal corrections that transform an original sentence into a grammatical correction, and these guidelines were applied to ad- vanced German learner essays (Reznicek et al., 2012) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-22", "text": "The MERLIN project (Boyd et al., 2014) adapted the Falko guidelines and annotated learner texts from a wide range of proficiency levels." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-23", "text": "1 We extract pairs of original sentences and corrections from all annotated sentence spans in FalkoEssayL2 v2.4 2 (248 texts), FalkoEssayWhig v2.0 2 (196 texts), and MERLIN v1.1 3 (1,033 texts) to create the new Falko-MERLIN GEC Corpus, which contains 24,077 sentence pairs." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-24", "text": "The corpus is divided into train (80%), dev (10%), and test (10%) sets, keeping all sentences from a single learner text within the same partition." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-25", "text": "An overview of the Falko-MERLIN GEC Corpus is shown in Table 1 with the number of errors per sentence and errors per token as analyzed by ERRANT for German (see section 3.1)." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-26", "text": "On average, the Falko corpus (advanced learners) contains longer sentences with fewer errors per token while the MERLIN corpus (all proficiency levels) contains shorter sentences with more errors per token." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-27", "text": "A more detailed ERRANT-based analysis is presented in Figure 2 in section 3.2." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-28", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-29", "text": "**WIKIPEDIA**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-30", "text": "In our experiments, we use German Wikipedia dumps of articles and revision history from June 1, 2018." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-31", "text": "Wikipedia edits are extracted from the revision history using Wiki Edits (Grundkiewicz and Junczys-Dowmunt, 2014 ) with a maximum sentence length of 60 tokens, since 99% of the Falko and MERLIN sentences are shorter than 60 tokens." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-32", "text": "For training the subword embeddings, plain text is extracted from the German Wikipedia articles using WikiExtractor." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-33", "text": "4" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-34", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-35", "text": "**BPE MODEL AND SUBWORD EMBEDDINGS**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-36", "text": "We learn a byte pair encoding (BPE) (Sennrich et al., 2016) with 30K symbols using the corrections from the Falko-MERLIN training data plus the complete plain Wikipedia article text." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-37", "text": "As suggested by Chollampatt and Ng (2018) , we encode the Wikipedia article text using the BPE model and learn fastText embeddings (Bojanowski et al., 2017) with 500 dimensions." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-40", "text": "For reranking, we train a language model on the first one billion lines (~12 billion tokens) of the deduplicated German Common Crawl corpus (Buck et al., 2014) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-41", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-42", "text": "**METHOD**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-43", "text": "We extend the Falko-MERLIN GEC training data with sentence-level Wikipedia edits that include similar types of corrections." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-44", "text": "In order to automatically analyze German GEC data, we extend ER-RANT from English to German (section 3.1) and use its analyses to select suitable Wikipedia edits (section 3.2)." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-45", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-46", "text": "**ERRANT**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-47", "text": "ERRANT, the ERRor ANnotation Tool (Felice et al., 2016; Bryant et al., 2017) , analyzes pairs of English sentences from a GEC corpus to identify the types of corrections performed." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-48", "text": "The tokens in a pair of sentences are aligned using Damerau-Levenshtein edit distance with a custom substitution cost that includes linguistic information -lemmas, POS, and characters -to promote alignments between related word forms." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-49", "text": "After the individual tokens are aligned, neighboring edits are evaluated to determine whether two or more edits should be merged into one longer edit, such as merging wide \u2192 widespread followed by spread \u2192 \u2205 into a single edit wide spread \u2192 widespread." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-50", "text": "To assign an error type to a correction, ER-RANT uses a rule-based approach that considers information about the POS tags, lemmas, stems, and dependency parses." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-51", "text": "To extend ERRANT for German, we adapted and simplified the English error types, relying on UD POS tags instead of language-specific tags as much as possible." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-52", "text": "Our top-level German ERRANT error types are shown with examples in Table 2 ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-53", "text": "For substitution errors, The POS tag types include 14 UD POS types plus the German-specific STTS tag TRUNC." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-54", "text": "The MORPH tag captures errors for related word forms with different POS tags, ORTH is for capitalization and whitespace errors, SPELL errors have an original token that is not in a large word list with >50% overlapping characters compared to the corrected token, ORDER errors cover adjacent reordered tokens, and CONTR errors involve the contraction 's ('it')." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-55", "text": "All remaining errors are classified as OTHER." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-56", "text": "In ERRANT for English, all linguistic annotation is performed with spaCy." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-57", "text": "5 We preserve as much of the spaCy pipeline as possible using spaCy's German models, however the lemmatizer is not sufficient and is replaced with the TreeTagger lemmatizer." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-58", "text": "6 All our experiments are performed with spaCy 2.0.11 and spaCy's default German model." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-59", "text": "The word list for detecting spelling errors comes from Hunspell igerman98-20161207 7 and the mapping of STTS to UD tags from TuebaUDConverter (\u00c7\u00f6ltekin et al., 2017) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-60", "text": "An example of a German ERRANT analysis is shown in Figure 1 ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-61", "text": "The first token is analyzed as an adjective substitution error where both adjectives have the same lemma (S:ADJ:FORM), the inflected deverbal adjective bestandenen 'passed' is inserted before Pr\u00fcfung 'exam' (I:ADJ), and the past participle bestanden 'passed' is deleted at the end of the sentence (D:VERB)." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-62", "text": "Note that ERRANT does not analyze Pr\u00fcfung bestanden \u2192 bestandenen Pr\u00fcfung as a word order error because the reordered word forms are not identical." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-63", "text": "In cases like these and ones with longer distance movement, which is a frequent type of correction in non-native German texts, ERRANT has no way to indicate that these two word forms are related or that this pair of edits is coupled." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-64", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-65", "text": "**FILTERING EDITS WITH ERRANT**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-66", "text": "Even though the Wiki Edits algorithm (Grundkiewicz and Junczys-Dowmunt, 2014) extracts only sentence pairs with small differences, many edits relate to content rather than grammatical errors, such as inserting a person's middle name or updating a date." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-67", "text": "In order to identify the most relevant Wikipedia edits for GEC, we analyze the gold GEC corpus and Wikipedia edits with ERRANT and then filter the Wikipedia edits based on a profile of the gold GEC data." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-68", "text": "First, sentences with ERRANT error types that indicate content or punctuation edits are discarded: 1) sentences with only punctuation, proper noun, and/or OTHER error types, 2) sentences with edits modifying only numbers or non-Latin characters, and 3) sentences with OTHER edits longer than two tokens." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-69", "text": "Second, the ERRANT profile of the gold corpus is used to select edits that: 1) include an original token edited in the gold corpus, 2) include the same list of error types as a sentence in the gold corpus, 3) include the same set of error types as a sentence in the gold corpus for 2+ error types, or 4) for sets of Gold and W iki error types have a Jaccard similarity coefficient to a gold sentence greater than 0.5:" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-70", "text": "After ERRANT-based filtering, approximately one third of the sentences extracted with Wiki Edits remain." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-71", "text": "The distribution of selected ERRANT error types for the Falko and MERLIN gold GEC corpora vs. the unfiltered and filtered Wikipedia edit corpora are shown in Figure 2 in order to provide an overview of the similarities and differences between the data." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-72", "text": "As intended, filtering Wikipedia edits as described above decreases the number of potentially content-related PNOUN and OTHER edits while increasing the proportion of other types of edits." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-73", "text": "Both in the unfiltered and filtered Wikipedia edits corpora, the overall frequency of errors remains lower than in the Falko-MERLIN GEC corpus: 1.7 vs. 2.8 errors per sentence and 0.08 vs. 0.18 errors per token." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-74", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-75", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-76", "text": "We evaluate the effect of extending the Falko-MERLIN GEC Corpus with Wikipedia edits for a German GEC system using the multilayer convolutional encoder-decoder neural network approach from Chollampatt and Ng (2018) , using the same parameters as for English." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-77", "text": "8 We train a single model for each condition and evaluate on the Falko-MERLIN test set using M 2 scorer (Dahlmeier and Ng, 2012 The results, presented in Table 3 , show that the addition of both unfiltered and filtered Wikipedia edits to the Falko-MERLIN GEC training data lead to improvements in performance, however larger numbers of unfiltered edits (>250K) do not consistently lead to improvements, similar to the results for English in Grundkiewicz and JunczysDowmunt (2014) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-78", "text": "However for filtered edits, increasing the number of additional edits from 100K to 1M continues to lead to improvements, with an overall improvement of 5.2 F 0.5 for 1M edits over the baseline without additional reranking." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-79", "text": "In contrast to the results for English in Chollampatt and Ng (2018) In order to explore the possibility of developing GEC systems for languages with fewer resources, we trained models solely on Wikipedia edits, which leads to a huge drop in performance (45.22 vs. 24.37 F 0.5 )." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-80", "text": "However, the genre differences may be too large to draw solid conclusions and this approach may benefit from further work on Wikipedia edit selection, such as using a language model to exclude some Wikipedia edits that introduce (rather than correct) grammatical errors." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-81", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-82", "text": "**FUTURE WORK**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-83", "text": "The combined basis of ERRANT and Wiki Edits make it possible to explore MT-based GEC approaches for languages with limited gold GEC resources." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-84", "text": "The current German ERRANT error analysis approach can be easily generalized to rely on a pure UD analysis, which would make it possible to apply ERRANT to any language with a UD parser and a lemmatizer." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-85", "text": "Similarly, the process of filtering Wikipedia edits could use alternate methods in place of a gold reference corpus, such as a list of targeted token or error types, to generate GEC training data for any language with resources similar to a Wikipedia revision history." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-86", "text": "For the current German GEC system, a detailed error analysis for the output could identify the types of errors where Wikipedia edits make a significant contribution and other areas where additional data could be incorporated, potentially through artificial error generation or crowd-sourcing." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-87", "text": "----------------------------------" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-88", "text": "**CONCLUSION**" }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-89", "text": "We provide initial results for grammatical error correction for German using data from the Falko and MERLIN corpora augmented with Wikipedia edits that have been filtered using a new German extension of the automatic error annotation tool ERRANT (Bryant et al., 2017) ." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-90", "text": "Wikipedia edits are extracted using Wiki Edits (Grundkiewicz and Junczys-Dowmunt, 2014) , profiled with ER-RANT, and filtered with reference to the gold GEC data." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-91", "text": "We evaluate our method using the multilayer convolutional encoder-decoder neural network GEC approach from Chollampatt and Ng (2018) and find that augmenting a small gold German GEC corpus with one million filtered Wikipedia edits improves the performance from 39.22 to 44.47 F 0.5 and additional language model reranking increases performance to 45.22." }, { "sent_id": "91685660d3d689c50e7436be46f37e-C001-92", "text": "The data and source code for this paper are available at: https://github.com/adrianeboyd/ boyd-wnut2018/" } ], "y": { "@USE@": { "gold_contexts": [ [ "91685660d3d689c50e7436be46f37e-C001-4" ], [ "91685660d3d689c50e7436be46f37e-C001-16" ], [ "91685660d3d689c50e7436be46f37e-C001-37" ], [ "91685660d3d689c50e7436be46f37e-C001-76" ], [ "91685660d3d689c50e7436be46f37e-C001-91" ] ], "cite_sentences": [ "91685660d3d689c50e7436be46f37e-C001-4", "91685660d3d689c50e7436be46f37e-C001-16", "91685660d3d689c50e7436be46f37e-C001-37", "91685660d3d689c50e7436be46f37e-C001-76", "91685660d3d689c50e7436be46f37e-C001-91" ] }, "@BACK@": { "gold_contexts": [ [ "91685660d3d689c50e7436be46f37e-C001-8" ] ], "cite_sentences": [ "91685660d3d689c50e7436be46f37e-C001-8" ] }, "@UNSURE@": { "gold_contexts": [ [ "91685660d3d689c50e7436be46f37e-C001-79" ] ], "cite_sentences": [ "91685660d3d689c50e7436be46f37e-C001-79" ] } } }, "ABC_0fd87fbdbe64e7d002ca31783448fb_32": { "x": [ { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-2", "text": "We present a novel real-time, collaborative, and interactive AI painting system, Mappa Mundi, for artistic Mind Map creation." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-3", "text": "The system consists of a voice-based input interface, an automatic topic expansion module, and an image projection module." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-4", "text": "The key innovation is to inject Artificial Imagination into painting creation by considering lexical and phonological similarities of language, learning and inheriting artist's original painting style, and applying the principles of Dadaism and impossibility of improvisation." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-5", "text": "Our system indicates that AI and artist can collaborate seamlessly to create imaginative artistic painting and Mappa Mundi has been applied in art exhibition in UCCA, Beijing 1 ." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-6", "text": "----------------------------------" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-8", "text": "Mind Map is a diagram that connects relevant words, terms or events around a keyword or an idea and it's perceived as an artistic possibility representation of knowledge." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-9", "text": "Through constant interpretation and abstraction, the artistic creation of Mind Map is a process of subject decision making and information expansion." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-10", "text": "In this paper, we introduce Mappa Mundi, an interactive Mind Map generator featured with artificial imagination." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-11", "text": "The creation of Mind Map requires the interaction between AI and artist on the decision making of idea expansion." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-12", "text": "While previous AI image generators, such as pix2pix [Isola et al., 2017] framework, depends only on the model performance and less on taking artist's interaction into account." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-13", "text": "Besides, models constructed with neural approaches barely extract features from big data, whereas features, such as imagination and creativity, are far to reach." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-14", "text": "As Mind Map represents the universal mind of human thinking, the Kantian thinking of free play of imagination, which is the foundation of modern arts [Kant, 1987] , plays an important role in expanding artist's thought and connecting the flourishing information." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-15", "text": "Different from the traditional framework, Mappa Mundi offers opportunities to enable AI and artist together during art making." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-16", "text": "Mappa Mundi enables voice input from human and extracts the keywords for vivid expansion on myriad levels of" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-17", "text": "----------------------------------" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-18", "text": "**RELATED WORK AND CHALLENGES**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-19", "text": "Nowadays, AI demonstrates stronger potential for art creation." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-20", "text": "Many researches have been conducted to involve AI into poem generation [Zhang and Lapata, 2014; Cheng et al., 2018] , creation of classical or pop music [Manzelli et al., 2018; Hadjeres et al., 2017] and automatic images generation [van den Oord et al., 2016; Yan et al., 2016; Xu et al., 2018] ." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-21", "text": "Whereas there are few researches exploring the possibility of artificial imagination for artwork creation." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-22", "text": "To generate an imaginative and creative Mind Map, the key problem is automatic topic expansion." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-23", "text": "There are several challenges: first of all, language is an informative system." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-24", "text": "Thus many linguistic features should be considered in words and relation extension." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-25", "text": "Second, as an artwork, it should reflect artist's mind and individual experience." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-26", "text": "Third, for Mind Map creation, artists usually pay less attention on defining accurate relation between concepts, instead, they are defined with imagination." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-27", "text": "So, to build artistic Mind Map, the system should find a way to break the convention rules for the connection between literal representations." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-28", "text": "----------------------------------" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-29", "text": "**SYSTEM ARCHITECTURE**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-30", "text": "The overall architecture of Mappa Mundi is illustrated in Figure 1." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-31", "text": "It consists of three modules in total." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-32", "text": "The first one is a voice-based input interface, which translates user's voice input into text and extracts keyword from it." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-33", "text": "The second is topic expansion module, which expands the keyword into several word candidates that share various connections with it." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-34", "text": "The third module is image projection, which takes the keyword" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-35", "text": "----------------------------------" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-36", "text": "**VOICE-BASED INPUT INTERFACE**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-37", "text": "Automatic Speech Recognition (ASR) engine is designed for interaction." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-38", "text": "Both English and Chinese languages are supported." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-39", "text": "We also performed lexicon and language model adaption to enhance the recognition of words in art domain." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-40", "text": "After that, we developed a keyword extraction function to extract meaningful words from the converted text for further expansion." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-41", "text": "We adopted open-source software jieba 2 for Chinese and Stanford parser [Toutanova and Manning, 2000] for English POS tagging." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-42", "text": "Only informative words such as noun, verb and adjective words are kept and TFIDF weights are calculated for further filtering." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-43", "text": "The output of this module is the keywords." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-44", "text": "----------------------------------" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-45", "text": "**TOPIC EXPANSION**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-46", "text": "As imagination is the soul for artistic Mind Map, Mappa Mundi employs several features to increase information variety [Liu et al., 2019] during topic expansion." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-47", "text": "It firstly uses word embeddings to find candidates based on semantic similarity [Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018] ." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-48", "text": "To enrich linguistic information of expansions, it also takes the morphological and phonological features into account." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-49", "text": "Words sharing similar characters or morphemes or phonetic syllables are selected as candidates." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-50", "text": "Moreover, we construct a knowledge graph (KG) by extracting concepts and relations from artist's original Mind Map masterpieces." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-51", "text": "Mappa Mundi will adopt those concepts and relations once a topic word is covered by the KG." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-52", "text": "By doing so, it can better imitate artists' mind and thoughts." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-53", "text": "Further, Mappa Mundi breaks the restriction of domains and conventions and explores more possibilities for words' connections by creating rules following the famous Dadaism principle in art movement [Elger, 2004] ." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-54", "text": "Finally, with all above mentioned methods, Mappa Mundi can create a branch of creative and informative word candidates for a given keyword." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-55", "text": "----------------------------------" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-56", "text": "**IMAGE PROJECTION**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-57", "text": "To visualize the abstract concepts, Mappa Mundi includes around 3000 painting elements which are extracted from the representative Mind Map paintings of a famous artist 3 ." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-58", "text": "These elements are sharing traditional Chinese Shan-shui (Mountain-river) painting style and are further analyzed into 5 typical Shan-shui painting categories including architecture, mountain, river, grassland and lake." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-59", "text": "Mappa Mundi will classify each word into one of the five categories above and then connect it with the elements in the corresponding domain." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-60", "text": "An additional type of painting element Route is also involved." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-61", "text": "The distance between the concept and it's candidate is determined by their similarity score and then visualized as Route in image." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-62", "text": "The more similar these two words are, the closer distance will be." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-63", "text": "Our Mappa Mundi is not a one-shot completion, it is an interaction of live stimulations, which contains constant vocal inputs via ASR and outputs of generated words and their relations." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-64", "text": "It is a two-way street." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-65", "text": "The generated information, after being presented in our system, becomes the inspiration for artist's next vocal input." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-66", "text": "This artwork reflects both the development of artist's thinking and the AI-enabled imagination." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-67", "text": "----------------------------------" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-68", "text": "**CONCLUSION**" }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-69", "text": "In this paper, we propose a Mind Map generator, Mappa Mundi, that can inject Artificial Imagination together with human interaction into an artwork." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-70", "text": "The way of enriching artificial imagination includes considering semantic similarity, integrating informative linguistic features, inheriting artist's mind, and inserting Dadaism and impossibility of improvisation principles." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-71", "text": "Mappa Mundi is presented to understand how AI can aid artistic practice." }, { "sent_id": "0fd87fbdbe64e7d002ca31783448fb-C001-72", "text": "Its development also indicates AI has more implications in art than a way of artistic creation, and it enriches the possibility of artistic practice." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0fd87fbdbe64e7d002ca31783448fb-C001-20" ], [ "0fd87fbdbe64e7d002ca31783448fb-C001-46" ] ], "cite_sentences": [ "0fd87fbdbe64e7d002ca31783448fb-C001-20", "0fd87fbdbe64e7d002ca31783448fb-C001-46" ] }, "@USE@": { "gold_contexts": [ [ "0fd87fbdbe64e7d002ca31783448fb-C001-41" ] ], "cite_sentences": [ "0fd87fbdbe64e7d002ca31783448fb-C001-41" ] } } }, "ABC_81dd7a27479f0cec3a01337c57ca95_32": { "x": [ { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-2", "text": "Text compression has diverse applications such as Summarization, Reading Comprehension and Text Editing." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-3", "text": "However, almost all existing approaches require either hand-crafted features, syntactic labels or parallel data." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-4", "text": "Even for one that achieves this task in an unsupervised setting, its architecture necessitates a task-specific autoencoder." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-5", "text": "Moreover, these models only generate one compressed sentence for each source input, so that adapting to different style requirements (e.g. length) for the final output usually implies retraining the model from scratch." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-6", "text": "In this work, we propose a fully unsupervised model, DELETER, that is able to discover an \"optimal deletion path\" for an arbitrary sentence, where each intermediate sequence along the path is a coherent subsequence of the previous one." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-7", "text": "This approach relies exclusively on a pretrained bidirectional language model (BERT) to score each candidate deletion based on the average Perplexity of the resulting sentence and performs progressive greedy lookahead search to select the best deletion for each step." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-8", "text": "We apply DELETER to the task of extractive Sentence Compression, and found that our model is competitive with state-of-the-art supervised models trained on 1.02 million in-domain examples with similar compression ratio." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-9", "text": "Qualitative analysis, as well as automatic and human evaluations both verify that our model produces high-quality compression." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-10", "text": "----------------------------------" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-12", "text": "Text compression can be applied to various tasks such as Summarization, Text Editing and even Data Augmentation where the compressed text can be employed as additional training examples." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-13", "text": "However, almost all existing approaches require either parallel data, hand-crafted rules, or extra syntactic information such as dependency labels or part-of-speech tag features trees (McDonald, 2006; Filippova and Strube, 2008; Zhao et al., 2018) ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-14", "text": "Even for one model that achieves this task in an unsupervised setting, its architecture necessitates a task-specific \"sequence-to-sequenceto-sequence\" autoencoder (Baziotis et al., 2019) ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-15", "text": "Moreover, these models only generate one compressed sentence for each source input, making adaption to different style requirements difficult." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-16", "text": "In this work, we propose a fully unsupervised model, DELETER, that is able to discover an \"optimal deletion path\" for a sentence, where each intermediate sequence along the path is a coherent subsequence of the previous one." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-17", "text": "DELETER relies exclusively on a pretrained bidirectional language model, BERT (Devlin et al., 2018) , to score each candidate deletion based on the average Perplexity of the resulting sentence and performs progressive greedy look-ahead search to select the best deletion(s) for each step." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-18", "text": "1 Table 1 shows a real example of how the DELETER gradually turns \"i think america is still a fairly crowded country by the way .\" into \"america is a country .\", where each intermediate sequence between them is a completely grammatical sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-19", "text": "Interestingly, the model can also be used to delete extra words in a sentence, such as turning \"i work work at a company .\" into \"i work at a company .\"" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-20", "text": "As shown in the table, unlike previous pureblack-box approaches, our model not only provides the final compression, but also exposes the full deletion path, making the compression process more comprehensible and controllable, so that it is easy to freeze certain key words from the original sentence or enforce a certain minimum/maximum compression ratio." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-21", "text": "We apply DELETER to the task of extractive Sentence Compression, and found that our 1 We will release all our code and model outputs soon." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-22", "text": "----------------------------------" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-23", "text": "**THE DELETER MODEL**" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-24", "text": "Our DELETER employs progressive lookahead greedy search based on a pretrained BERT, which is used to assign a negative log likelihood for each token in a sentence to derive a score for each intermediate sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-25", "text": "The algorithm aims to minimize the average score along the path to ensure that each intermediate sentence is grammatical." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-26", "text": "Task Formulation Given a sentence, there are finite possible tokens we could keep deleting until running out of tokens." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-27", "text": "In our setting, we do not go back and add tokens to an intermediate sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-28", "text": "We thus formulate the task of Sentence Compression as finding an optimal deletion path along a Directed Acyclic Graph." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-29", "text": "Each node in this graph is either the original sentence itself (i.e., the root node, which has no incoming edges) or a subsequence of it." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-30", "text": "Each node has outgoing edges pointing to each sentence resulting from a legal deletion of the current node (note that our model allows deletion of multiple tokens in one step)." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-31", "text": "Each edge is also associated with a cost, which is the Average Perplexity Score (AvgPPL) of all tokens in the sentence as assigned by BERT." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-32", "text": "We consider a path optimal if the average score of all nodes along the path is minimized." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-33", "text": "Sentence Scoring The DELETER assigns each sentence in the graph an AvgPPL score as shown in the equation below." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-34", "text": "Lower score indicates more syntactically coherent sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-35", "text": "{T i } are tokens present in the current sentence and {D i } refer to the collection of all deleted tokens." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-36", "text": "Note that each log D i is obtained by scoring the corresponding token in the original sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-37", "text": "This second term is crucial to the algorithm because otherwise the model would prefer to delete rarer tokens, whose unigram probabilities are lower (i.e., of higher negative log likelihood)." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-38", "text": "Progressive Lookahead Greedy Search DELETER allows deleting multiple tokens at a time, but naively probing for deletions consisting of multiple tokens will result in computation workload scaling with O(m k ) , where m is the the number of tokens in the current intermediate sentence and k is the maximum deletion length for each step." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-39", "text": "We thus adopt a progressive approach to help reduce the search space." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-40", "text": "Figure 1 shows how a real probing step looks like for the sentence \"she sings with me .\"." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-41", "text": "For each step, we first probe for deletions consisting of only one token; if no candidate obtains a score below a threshold, we increase the lookahead length to 2, and so forth." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-42", "text": "A candidate is above this threshold T if:" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-43", "text": "where \u03b1 (set to 0.04 in our experiments) is a hyperparameter controlling the extent of increase on AvgPPL for each step." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-44", "text": "The lower it is, the better the compression quality (with the price of being slower)." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-45", "text": "L root is the number of tokens in the original sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-46", "text": "We include this term because shorter sentences tend to have lower percentage of increase on AvgPPL, in which case we want to lower the threshold." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-47", "text": "Note that during inference, this threshold also functions as the termination condition in case we are only allowed to select one sentence from the deletion path (e.g. for Sentence Compression)." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-48", "text": "During probing, we also want to slightly discourage the model from deleting multiple tokens at a time, since we want it to traverse the deletion path as gradually as possible to have more intermediate compressions." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-49", "text": "We thus further multiply AvgPPL by L \u03b2 s when probing for the next deletion, where L s is the length of the current sentence and \u03b2 (set to 0.04 in our experiments) indicates how \"gradual\" we want the DELETER to proceed." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-50", "text": "----------------------------------" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-51", "text": "**EXPERIMENTAL SETUP DELETER MODEL DETAILS**" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-52", "text": "We use a pretrained BERT uncased model implemented by HuggingFace." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-53", "text": "2 To score any token in a sentence, we use the special token [mask] to mask out the target token, and then prepend a [CLS] and append a [SEP] , which function as [START] and [END] tokens." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-54", "text": "We use both tokens to help the model evaluate whether the current first or last token can function as a sentence start or end." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-55", "text": "This is also the main reason that we did not choose other competitive language models such as GPT-2 (Radford et al., 2019) : because these models are not bidirectional, they will be much less sensitive to deletions from the very end of a sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-56", "text": "However, note that since BERT is not intended to be a language model, 3 we mask one token at a time to obtain the negative log probability of each." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-57", "text": "The maximum lookahead length in our work is set to 3 to facilitate deletion of phrases which contain multiple tokens, such as \"by the way\" in Table 1 ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-58", "text": "Datasets We experiment on two datasets." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-59", "text": "Google sentence compression dataset (Filippova and Altun, 2013) (2015), we used the first 1, 000 sentences of evaluation set as our test set." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-60", "text": "The Gigaword dataset (Napoles et al., 2012) with 1.02 million examples, where the first 200 are labeled for extractive Sentence Compression by two annotators (Zhao et al., 2018) ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-61", "text": "5 Automatic and Human Evaluation Following Wang et al. (2017) , we use the ground truth compressed sentences to compute F1 scores." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-62", "text": "Although F1 can be indicative of how well a compression model performs, we note that their could be multiple viable compressions for the same sentence, which single-reference ground-truth cannot cover (Handler et al., 2019) ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-63", "text": "Thus to faithfully evaluate the quality of the compressions generated by our model, we follow Zhao et al. (2018) and conducted human studies on the Gigaword dataset with Amazon Mechanical Turk (MTurk)." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-64", "text": "6 Specifically, we sample 150 examples from the test set, and put our model output side-by-side with the two compressions created by the two annotators." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-65", "text": "7 The three compressions were randomly shuffled to anonymize model identity." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-66", "text": "We hire annotators who have an approval rate of at least 98% and 10, 000 approved HITs." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-67", "text": "Following (Zhao et al., 2018) , we employed Readability and Informativeness as criteria on a five-point Likert scale." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-68", "text": "----------------------------------" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-69", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-70", "text": "Automatic Evaluation We report F1 scores on both Google and Gigaword dataset." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-71", "text": "Since each of the 200 sentences from Giga test set has two references (by Annotator 1 and 2, respectively), we report two F1's following Zhao et al. (2018) ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-72", "text": "As shown in Table 2 , our model is competitive (esp." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-73", "text": "for Annotator #2) with state-of-the-art supervised models trained on 1.02 million examples with similar compression ratio." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-74", "text": "We also report F1 results on the Google dataset (Table 4) ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-75", "text": "We can see that for this dataset our F1 is still far away from state-of-the-art." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-76", "text": "We reason that this is partly because the ground-truth compressions in Google dataset are based on news headlines (later edited by human) rather than compressions generated directly by human beings." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-77", "text": "8" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-78", "text": "Human Evaluation As mentioned in Section 3, we conducted human evaluation on Gigaword." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-79", "text": "As shown in Table 4 , the Readability of our model is close to that of Annotator #1, while there is a larger gap on Informativeness between the two." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-80", "text": "This is expected because our model is syntaxaware rather than semantic-aware." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-81", "text": "For example, the DELETER would readily delete negation words (e.g., from \"i do not love NLP\" to \"i do love NLP\"), which usually causes abrupt change in meaning." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-82", "text": "We leave enforcing semantic consistency by leveraging Neural Language Inference datasets (Williams et al., 2017) as future work." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-83", "text": "It is also interesting to note that there is a larger gap between Annotator #1 and #2 than between Annotator #1 and our model, indicating that the quality of compression is a fairly subject matter." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-84", "text": "----------------------------------" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-85", "text": "**RELATED WORK**" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-86", "text": "Sentence Compression task has been investigated by various previous work (Jing, 2000; Knight and Marcu, 2000; Clarke and Lapata, 2006; McDonald, 2006; Clarke and Lapata, 2008; Filippova and Strube, 2008; Berg-Kirkpatrick et al., 2011) , where the more recent work tend to adopt a neural approach (Filippova and Alfonseca, 2015; Kamigaito et al., 2018; C\u00edfka et al., 2018) ." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-87", "text": "Our model is also neural-based since it leverages a neural language model." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-88", "text": "Our work differs from previous work in that it does not require any syntactic information such as POS tags or dependency parse tree, which is an advantage especially for low-resource language where training data for tagger or parser is scarce." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-89", "text": "We note that Baziotis et al. (2019) also built an unsupervised model for abstractive sentence compression." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-90", "text": "They trained a \"sequence-to-sequence-tosequence\" autoencoder to first compress the original sentence, and then reconstruct it." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-91", "text": "Similar to our AvgPPL, Zhao et al. (2018) also employed average Perplexity (though without our length correction terms) as the reward to a policy network trained with reinforcement learning." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-92", "text": "Another characteristic of our model is that we obtain a sequence of sentences which are all valid compressions of the original one, while other models usually generate only one compression." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-93", "text": "----------------------------------" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-94", "text": "**CONCLUSION**" }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-95", "text": "We introduced DELETER which finds an optimal deletion path of a sentence where each node along the path is a grammatical subsequence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-96", "text": "We applied this model to two Sentence Compression datasets and found that they are comparable with state-of-the-art supervised models." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-97", "text": "Note that Sentence Compression has not explored the full power of this model since it only selects one sentence as the output." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-98", "text": "We plan to apply DELETER to tasks such as Data Augmentation, where the training data of each epoch is dynamically assembled by randomly sampling an intermediate sentence along the deletion path of each sentence." }, { "sent_id": "81dd7a27479f0cec3a01337c57ca95-C001-99", "text": "This approach can also be leveraged as a way to generate adversarial examples (Niu and Bansal, 2018) because deleting a few tokens usually preserve semantics of the original sentence." } ], "y": { "@BACK@": { "gold_contexts": [ [ "81dd7a27479f0cec3a01337c57ca95-C001-12", "81dd7a27479f0cec3a01337c57ca95-C001-13" ] ], "cite_sentences": [ "81dd7a27479f0cec3a01337c57ca95-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "81dd7a27479f0cec3a01337c57ca95-C001-60" ], [ "81dd7a27479f0cec3a01337c57ca95-C001-63" ], [ "81dd7a27479f0cec3a01337c57ca95-C001-67" ], [ "81dd7a27479f0cec3a01337c57ca95-C001-71" ] ], "cite_sentences": [ "81dd7a27479f0cec3a01337c57ca95-C001-60", "81dd7a27479f0cec3a01337c57ca95-C001-63", "81dd7a27479f0cec3a01337c57ca95-C001-67", "81dd7a27479f0cec3a01337c57ca95-C001-71" ] }, "@SIM@": { "gold_contexts": [ [ "81dd7a27479f0cec3a01337c57ca95-C001-91" ] ], "cite_sentences": [ "81dd7a27479f0cec3a01337c57ca95-C001-91" ] } } }, "ABC_1f463f2f87bc2d572299d96481084f_32": { "x": [ { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-2", "text": "Randomized methods of significance testing enable estimation of the probability that an increase in score has occurred simply by chance." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-3", "text": "In this paper, we examine the accuracy of three randomized methods of significance testing in the context of machine translation: paired bootstrap resampling, bootstrap resampling and approximate randomization." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-4", "text": "We carry out a large-scale human evaluation of shared task systems for two language pairs to provide a gold standard for tests." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-5", "text": "Results show very little difference in accuracy across the three methods of significance testing." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-6", "text": "Notably, accuracy of all test/metric combinations for evaluation of English-to-Spanish are so low that there is not enough evidence to conclude they are any better than a random coin toss." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-7", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-9", "text": "Automatic metrics, such as BLEU (Papineni et al., 2002) , are widely used in machine translation (MT) as a substitute for human evaluation." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-10", "text": "Such metrics commonly take the form of an automatic comparison of MT output text with one or more human reference translations." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-11", "text": "Small differences in automatic metric scores can be difficult to interpret, however, and statistical significance testing provides a way of estimating the likelihood that a score difference has occurred simply by chance." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-12", "text": "For several metrics, such as BLEU, standard significance tests cannot be applied due to scores not comprising the mean of individual sentence scores, justifying the use of randomized methods." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-13", "text": "Bootstrap resampling was one of the early randomized methods proposed for statistical significance testing of MT (Germann, 2003; Och, 2003; Kumar and Byrne, 2004; Koehn, 2004) , to assess for a pair of systems how likely a difference in BLEU scores occurred by chance." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-14", "text": "Empirical tests detailed in Koehn (2004) show that even for test sets as small as 300 translations, BLEU confidence intervals can be computed as accurately as if they had been computed on a test set 100 times as large." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-15", "text": "Approximate randomization was subsequently proposed as an alternate to bootstrap resampling (Riezler and Maxwell, 2005) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-16", "text": "Theoretically speaking, approximate randomization has an advantage over bootstrap resampling, in that it does not make the assumption that samples are representative of the populations from which they are drawn." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-17", "text": "Both methods require some adaptation in order to be used for the purpose of MT evaluation, such as combination with an automatic metric, and therefore it cannot be taken for granted that approximate randomization will be more accurate in practice." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-18", "text": "Within MT, approximate randomization for the purpose of statistical testing is also less common." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-19", "text": "Riezler and Maxwell (2005) provide a comparison of approximate randomization with bootstrap resampling (distinct from paired bootstrap resampling), and conclude that since approximate randomization produces higher p-values for a set of apparently equally-performing systems, it more conservatively concludes statistically significant differences, and recommend preference of approximate randomization over bootstrap resampling for MT evaluation." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-20", "text": "Conclusions drawn from experiments provided in Riezler and Maxwell (2005) are oft-cited, with experiments interpreted as evidence that bootstrap resampling is overly optimistic in reporting significant differences (Riezler and Maxwell, 2006; Koehn and Monz, 2006; Galley and Manning, 2008; Green et al., 2010; Monz, 2011; Clark et al., 2011) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-21", "text": "Our contribution in this paper is to revisit statistical significance tests in MT -namely, bootstrap resampling, paired bootstrap resampling and approximate randomization -and find problems with the published formulations." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-22", "text": "We redress these issues, and apply the tests in statistical testing of two language pairs." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-23", "text": "Using human judgments of translation quality, we find only very minor differences in significance levels across the three tests, challenging claims made in the literature about relative merits of tests." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-24", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-25", "text": "**REVISITING STATISTICAL SIGNIFICANCE TESTS FOR MT EVALUATION**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-26", "text": "First, we revisit the formulations of bootstrap resampling and approximate randomization algorithms as presented in Riezler and Maxwell (2005) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-27", "text": "At first glance, both methods appear to be two-tailed tests, with the null hypothesis that the two systems perform equally well." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-28", "text": "A better comparison of p-values would first require doubling the values of the one-sided bootstrap, leaving those of the two-sided approximate randomization algorithm as-is." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-29", "text": "The results of the two tests on this basis are extremely close, and in fact, in two out of the five comparisons, those of the bootstrap would have marginally higher pvalues than those of approximate randomization." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-30", "text": "As such, it is conceivable to conclude that the experiments actually show no substantial difference in Type I error between the two tests, which is consistent with results published in other fields of research (Smucker et al., 2007) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-31", "text": "We also note that the pseudo-code contains an unconventional computation of mean pseudo-statistics, \u03c4 B , for shiftto-zero." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-32", "text": "Rather than speculate over whether these issues with the original paper were simply presentational glitches or the actual basis of the experiments reported on in the paper, we present a normalized version of the two-sided bootstrap algorithm in Figure 1 , and report on the results of our own experiments in Section 4." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-33", "text": "We compare this method with approximate randomization and also paired bootstrap resampling (Koehn, 2004) , which is widely used in MT evaluation." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-34", "text": "We carry out evaluation over a range of MT systems, not only including pairs of systems that perform equally well, but also pairs of systems for which one system performs marginally better than the other." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-35", "text": "This enables evaluation of not only Type I error, but the overall accuracy of the tests." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-36", "text": "We carry out a large-scale human evaluation of all WMT 2012 shared task participating systems for two language pairs, and collect sufficient human judgments to facilitate statistical significance tests." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-37", "text": "This human evaluation data then provides a gold-standard against which to compare randomized tests." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-38", "text": "Since all randomized tests only function in combination with an automatic MT evaluation metric, we present results of each randomized test across four different MT metrics." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-39", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-40", "text": "**RANDOMIZED SIGNIFICANCE TESTS**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-41", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-42", "text": "**BOOTSTRAP RESAMPLING**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-43", "text": "Bootstrap resampling provides a way of estimating the population distribution by sampling with replacement from a representative sample (Efron and Tibshirani, 1993) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-44", "text": "The test statistic is taken as the difference in scores of the two systems, S X \u2212 S Y , which has an expected value of 0 under the null hypothesis that the two systems perform equally well." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-45", "text": "A bootstrap pseudo-sample consists of the translations by the two systems (X b , Y b ) of a bootstrapped test set (Koehn, 2004) , constructed by sampling with replacement from the original test set translations." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-46", "text": "The bootstrap distribution S boot of the test statistic is estimated by calculating the value of the pseudo-statistic The null hypothesis distribution S H 0 can be estimated from S boot by applying the shift method (Noreen, 1989 ), which assumes that S H 0 has the same shape but a different mean than S boot ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-47", "text": "Thus, S boot is transformed into S H 0 by subtracting the mean bootstrap statistic from every value in S boot ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-48", "text": "Once this shift-to-zero has taken place, the null hypothesis is rejected if the probability of observing a more extreme value than the actual statistic is lower than a predetermined p-value \u03b1, which is typically set to 0.05." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-49", "text": "In other words, the score difference is significant at level 1 \u2212 \u03b1." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-50", "text": "Figure 3 provides a one-sided implementation of bootstrap resampling, where H 0 is that the score of System X is less than or equal to the score of Figure 5 includes a typical example of bootstrap resampling applied to BLEU, for a pair of systems for which differences in scores are significant, while Figure 6 shows the same for ME-TEOR but for a pair of systems with no significant difference in scores." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-51", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-52", "text": "**APPROXIMATE RANDOMIZATION**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-53", "text": "Unlike bootstrap, approximate randomization does not make any assumptions about the population distribution." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-54", "text": "To simulate a distribution for the null hypothesis that the scores of the two systems are the same, translations are shuffled between the two systems so that 50% of each pseudo-sample is drawn from each system." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-55", "text": "In the context of machine translation, this can be interpreted as each translation being equally likely to have been produced by one system as the other (Riezler and Maxwell, 2005) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-56", "text": "The test statistic is taken as the difference in scores of the two systems, S X \u2212 S Y ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-57", "text": "If there is a total of S sentences, then a total of 2 S shuffles is possible." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-58", "text": "If S is large, instead of generating all 2 S possible combinations, we instead generate samples by randomly permuting translations between the two systems with equal probability." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-59", "text": "The distribution of the test statistic under the null hypothesis is approximated by calculating the pseudostatistic, S Xr \u2212 S Yr , for each sample." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-60", "text": "As before, the null hypothesis is rejected if the probability of observing a more extreme value than the actual test statistic is lower than \u03b1." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-61", "text": "Figure 2 provides a one-sided implementation of approximate randomization for MT evaluation, where the null hypothesis is that the score of System X is less than or equal to the score of System Y ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-62", "text": "Figure 5 shows a typical example of pseudostatistic distributions for approximate randomization for a pair of systems with a small but significant score difference according to BLEU, and Figure 6 shows the same for METEOR applied to a pair of systems where no significant difference is concluded." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-63", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-64", "text": "**PAIRED BOOTSTRAP RESAMPLING**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-65", "text": "Paired bootstrap resampling (Koehn, 2004 ) is shown in Figure 4 ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-66", "text": "Unlike the other two randomized tests, this method makes no attempt to simulate the null hypothesis distribution." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-67", "text": "Instead, bootstrap samples are used to estimate confidence intervals of score differences, with confidence intervals not containing 0 implying a statistically significant difference." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-68", "text": "We compare what takes place with the two other tests, by plotting differences in scores for bootstrapped samples," }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-69", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-70", "text": "**EVALUATION**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-71", "text": "In order to evaluate the accuracy of the three randomized significance significance tests, we compare conclusions reached in a human evaluation of shared task participant systems." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-72", "text": "We carry out a large-scale human evaluation of all participating systems from WMT 2012 (Callison-Burch et al., 2012) for the Spanish-to-English and English-toSpanish translation tasks." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-73", "text": "Large numbers of human assessments of translations were collected using Amazon's Mechanical Turk, with strict quality control filtering (Graham et al., 2013) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-74", "text": "A total of 82,100 human adequacy assessments and 62,400 human fluency assessments were collected." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-75", "text": "After the removal of quality control items and filtering of judgments from low-quality workers, this resulted in an average of 1,280 adequacy and 1,013 fluency assessments per system for Spanishto-English (12 systems), and 1,483 adequacy and 1,534 fluency assessments per system for Englishto-Spanish (11 systems)." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-76", "text": "To remove bias with respect to individual human judge preference scoring severity/leniency, scores provided by each human assessor were standardized according to the mean and standard deviation of all scores provided by that individual." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-77", "text": "Significance tests were carried out over the scores for each pair of systems separately for adequacy and fluency assessments using the Wilcoxon rank-sum test." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-78", "text": "Figure 7 shows pairwise significance test results for fluency, adequacy and the combination of the two tests, for all pairs of Spanish-to-English systems." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-79", "text": "Combined fluency and adequacy significance test results are constructed as follows: if a system's adequacy score is significantly greater than that of another, the combined conclusion is that it is significantly better, at that significance level." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-80", "text": "Only when a tie in adequacy scores occurs are fluency judgments used to break the tie." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-81", "text": "In this case, p-values from significance tests applied to fluency scores of that system pair are used." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-82", "text": "For example, in Figure 7 , adequacy scores of System B are not significantly greater than those of Systems C, D and E, while fluency scores for System B are significantly greater than those of the three other systems." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-83", "text": "The combined result for each pair of systems is therefore taken as the p-value from the corresponding fluency significance test." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-84", "text": "We use the combined human evaluation pairwise significant tests as a gold standard against which to evaluate the randomized methods of statistical significance testing." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-85", "text": "We evaluate paired bootstrap resampling (Koehn, 2004) and bootstrap resampling as shown in Figure 3 and approximate randomization as shown in Figure 2 , each in combination with four automatic MT metrics: BLEU (Papineni et al., 2002) , NIST (NIST, 2002) , METEOR (Banerjee and Lavie, 2005) and TER (Snover et al., 2006) ." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-86", "text": "Figure 8 shows the outcome of pairwise randomized significance tests for each metric for Spanishto-English systems, and all three randomized methods produce p-values so similar that when \u03b1 thresholds are applied, all three tests produce precisely the same set of pairwise conclusions for each metric." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-87", "text": "When tests are combined with METEOR and TER, similar results are observed: at the \u03b1 thresholds of 0.05 and 0.01, precisely the same conclusions are drawn for both metrics combined with each of the three tests, and at most a difference of two conclusions at the lowest \u03b1 level." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-88", "text": "Table 2 shows the accuracy of each test on the English-to-Spanish data, showing much the same set of conclusions at all \u03b1 levels." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-89", "text": "For BLEU and NIST, all three tests again produce precisely the same conclusions, at p < 0.01 there is at most a single different conclusion for METEOR, and only at the lowest p-value level is there a single difference for TER." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-90", "text": "Finally, we examine which combination of metric and test is most accurate for each language pair at the conventional significance level of p < 0.05." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-91", "text": "For Spanish-to-English evaluation, NIST combined with any of the three randomized tests is most accurate, making 54 out of 66 (82%) correct conclusions." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-92", "text": "For English-to-Spanish, BLEU in combination with any of the three randomized tests, is most accurate at 62%." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-93", "text": "For both language pairs, however, differences in accuracy for metrics are not significant (Chi-square test)." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-94", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-95", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-96", "text": "For English-to-Spanish evaluation, an accuracy as low as 62% should be a concern." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-97", "text": "This level of accuracy for significance testing -only making the correct conclusion in 6 out of 10 tests -acts as a reminder that no matter how sophisticated the significance test, it will never make up for flaws in an underlying metric." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-98", "text": "When we take into account the fact that lower confidence limits all fall below 50%, significance tests based on these metrics for English-to-Spanish are effectively no better than a random coin toss." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-99", "text": "----------------------------------" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-100", "text": "**CONCLUSIONS**" }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-101", "text": "We provided a comparison of bootstrap resampling and approximate randomization significance tests for a range of automatic machine translation evaluation metrics." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-102", "text": "To provide a goldstandard against which to evaluate randomized tests, we carried out a large-scale human evaluation of all shared task participating systems for the Spanish-to-English and English-to-Spanish translation tasks from WMT 2012." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-103", "text": "Results showed for many metrics and significance levels that all three tests produce precisely the same set of conclusions, and when conclusions do differ, it is commonly only by a single contrasting conclusion, which is not significant." }, { "sent_id": "1f463f2f87bc2d572299d96481084f-C001-104", "text": "For English-to-Spanish MT, the results of the different MT evaluation metric/significance test combinations are not significantly higher than a random baseline." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1f463f2f87bc2d572299d96481084f-C001-13" ], [ "1f463f2f87bc2d572299d96481084f-C001-14" ], [ "1f463f2f87bc2d572299d96481084f-C001-45" ] ], "cite_sentences": [ "1f463f2f87bc2d572299d96481084f-C001-13", "1f463f2f87bc2d572299d96481084f-C001-14", "1f463f2f87bc2d572299d96481084f-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "1f463f2f87bc2d572299d96481084f-C001-31", "1f463f2f87bc2d572299d96481084f-C001-32", "1f463f2f87bc2d572299d96481084f-C001-33" ] ], "cite_sentences": [ "1f463f2f87bc2d572299d96481084f-C001-33" ] }, "@MOT@": { "gold_contexts": [ [ "1f463f2f87bc2d572299d96481084f-C001-31", "1f463f2f87bc2d572299d96481084f-C001-32", "1f463f2f87bc2d572299d96481084f-C001-33" ] ], "cite_sentences": [ "1f463f2f87bc2d572299d96481084f-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "1f463f2f87bc2d572299d96481084f-C001-45" ] ], "cite_sentences": [ "1f463f2f87bc2d572299d96481084f-C001-45" ] }, "@UNSURE@": { "gold_contexts": [ [ "1f463f2f87bc2d572299d96481084f-C001-65" ], [ "1f463f2f87bc2d572299d96481084f-C001-85" ] ], "cite_sentences": [ "1f463f2f87bc2d572299d96481084f-C001-65", "1f463f2f87bc2d572299d96481084f-C001-85" ] } } }, "ABC_b31acd3535cd740e609d45986fbf33_32": { "x": [ { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-2", "text": "Recent studies have shown that pre-trained contextual word embeddings, which assign the same word different vectors in different contexts, improve performance in many tasks. But while contextual embeddings can also be trained at the character level, the effectiveness of such embeddings has not been studied." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-3", "text": "We derive character-level contextual embeddings from Flair (Akbik et al., 2018) , and apply them to a time normalization task, yielding major performance improvements over the previous state-of-the-art: 51% error reduction in news and 33% in clinical notes." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-4", "text": "We analyze the sources of these improvements, and find that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and cross-domain changes." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-5", "text": "We also quantify the size of context that pretrained contextual character embeddings take advantage of, and show that such embeddings capture features like part-of-speech and capitalization." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-6", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-8", "text": "Pre-trained language models (LMs) such as ELMo (Peters et al., 2018) , ULMFiT (Howard and Ruder, 2018) , OpenAI GPT (Radford et al., 2018) , Flair (Akbik et al., 2018) and Bert (Devlin et al., 2018) have shown great improvements in NLP tasks ranging from sentiment analysis to named entity recognition to question answering." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-9", "text": "These models are trained on huge collections of unlabeled data and produce contextualized word embeddings, i.e., each word receives a different vector representation in each context, rather than a single common vector representation regardless of context as in word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-10", "text": "Research is ongoing to study these models and determine where their benefits are coming from (Peters et al., 2018; Radford et al., 2018; Khandelwal et al., 2018; Zhang and Bowman, 2018) ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-11", "text": "The analyses have focused on wordlevel models, yet character-level models have been shown to outperform word-level models in some NLP tasks, such as text classification (Zhang et al., 2015) , named entity recognition (Kuru et al., 2016) , and time normalization (Laparra et al., 2018a) ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-12", "text": "Thus, there is a need to study pre-trained contextualized character embeddings, to see if they also yield improvements, and if so, to analyze where those benefits are coming from." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-13", "text": "All of the pre-trained word-level contextual embedding models include some character or subword components in their architecture." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-14", "text": "For example, Flair is a forward-backward LM trained over characters using recurrent neural networks (RNNs), that generates pre-trained contextual word embeddings by concatenating the forward LM's hidden state for the word's last character and the backward LM's hidden state for the word's first character." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-15", "text": "Flair achieves state-of-the-art or competitive results on part-of-speech tagging and named entity tagging (Akbik et al., 2018) ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-16", "text": "Though they do not pre-train a LM, Bohnet et al. (2018) similarly apply a bidirectional long short term memory network (LSTM) layer on all characters of a sentence and generate contextual word embeddings by concatenating the forward and backward LSTM hidden states of the first and last character in each word." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-17", "text": "Together with other techniques, they achieve state-of-the-art performance on part-of-speech and morphological tagging." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-18", "text": "However, both Akbik et al. (2018) and Bohnet et al. (2018) discard all other contextual character embeddings, and no analyses of the models are performed at the character-level." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-19", "text": "In the current paper, we derive pre-trained contextual character embeddings from Flair's forwardbackward LM trained on a 1-billion word corpus of English (Chelba et al., 2014) , and observe if these embeddings yield the same large improvements for character-level tasks as yielded by pre-trained contextual word embeddings for word-level tasks." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-20", "text": "We aim to analyze where improvements come from (e.g., term variations, low frequency words) and what they depend on (e.g., embedding size, context size)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-21", "text": "We focus on the task of parsing time normalizations (Laparra et al., 2018b) , where large gains of character-level models over word-level models have been observed (Laparra et al., 2018a) ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-22", "text": "This task involves finding and composing pieces of a time expression to infer time intervals, so for example, the expression 3 days ago could be normalized to the interval [2019-03-01, 2019-03-02) ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-23", "text": "We first take a state-of-the-art neural network for parsing time normalizations (Laparra et al., 2018a) and replace its randomly initialized character embeddings with pre-trained contextual character embeddings." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-24", "text": "After showing that this yields major performance improvements, we analyze the improvements to understand why pre-trained contextual character embeddings are so useful." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-25", "text": "Our contributions are:" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-26", "text": "\u2022 We derive pre-trained contextual character embeddings from Flair (Akbik et al., 2018) , apply them to a state-of-the art time normalizer (Laparra et al., 2018a) , and obtain major performance improvements over the previous state-of-the-art: 51% error reduction in news and 33% error reduction in clinical notes." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-27", "text": "\u2022 We demonstrate that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and crossdomain changes." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-28", "text": "\u2022 We quantify the amount of context leveraged by pre-trained contextual character embeddings." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-29", "text": "\u2022 We show that pre-trained contextual character embeddings remove the need for features like part-of-speech and capitalization." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-30", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-31", "text": "**FRAMEWORK**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-32", "text": "The parsing time normalizations task is based on the Semantically Compositional Annotation of Time Expressions (SCATE) schema (Bethard and Parker, 2016) , in which times are annotated as compositional time entities." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-33", "text": "Laparra et al. (2018a) decomposes the Parsing Time Normalizations task into two subtasks: a) time entity identification using a character-level sequence tagger which detects" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-34", "text": "Bi-GRUs (3) Softmax (3) Outputs ( the spans of characters that belong to each time expression and labels them with their corresponding time entity; and b) time entity composition using a simple set of rules that links relevant entities together while respecting the entity type constraints imposed by the SCATE schema." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-35", "text": "These two tasks are run sequentially using the predicted output of the sequence tagger as input to the rule-based time entity composition system." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-36", "text": "In this paper, We focus on the character-level time entity identifier that is the foundation of Laparra et al. (2018a) 's model." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-37", "text": "The sequence tagger is a multi-output RNN with three different input features, shown in Figure 1 ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-38", "text": "Features are mapped through an embedding layer, then fed into stacked bidirectional Gated Recurrent Units (bi-GRUs), and followed by a softmax layer." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-39", "text": "There are three types of outputs per Laparra et al. (2018a) 's encoding of the SCATE schema, so there is a separate stack of bi-GRUs and a softmax for each output type." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-40", "text": "We keep the original neural architecture and parameter settings in Laparra et al. (2018a) , and experiment with the following embedding layers: Rand(128): the original setting of Laparra et al. (2018a) , where 128-dimensional character embeddings are randomly initialized." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-41", "text": "Rand (4096) ning Flair forward-backward character-level LM Flair's forward and backward character-level language models over the text, and concatenating the hidden states from forward and backward character-level LMs for each character ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-42", "text": "We evaluate in the clinical and news domains, the former being more than 9 times larger and the latter having a more diverse set of labels." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-43", "text": "Three different evaluation metrics are used for parsing time normalization tasks: identification of time entities, which evaluates the predicted span (offsets) and the SCATE type for each entity; parsing of time entities, which evaluates the span, the SCATE type, and properties for each time entity; interval extraction, which interprets parsed annotations as intervals along the timeline and measures the fraction of the correctly parsed intervals." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-44", "text": "The SemeEval task description paper (Laparra et al., 2018b ) has more details on dataset statistics and evaluation metrics." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-45", "text": "Table 3 shows how pre-trained contextual character embeddings improve performance on both term variations and low frequency words." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-46", "text": "We define term variations as time entities that appear in the training data in the following patterns: both upper-case and lower-case, e.g., DAY, Day, and day; abbreviation with and without punctuation, e.g., AM and A.M.; or same stem, e.g., Month and Months, previously and previous." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-47", "text": "In the dev and test sets, 30." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-48", "text": "over Rand(4096) on time entities with (+var) and without (-var) term variations." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-49", "text": "Cont (4096) is always better than Rand(4096) so all differences are positive, but the improvements in +var are much larger than those of -var in the news domain (+8.4 vs. +1.6 and +15.0 vs. +8.7)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-50", "text": "In the clinical domain, where 9 times more training data is available, both +var and -var yield similar improvements." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-51", "text": "We conclude that pre-trained contextual character embeddings are mostly helpful with term variations in low data scenarios." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-52", "text": "We define infrequent terms as time entities that occur in the training set 10 or fewer times." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-53", "text": "In the dev and test sets, 73.9-86.9% of terms are infrequent, with about one third of infrequent terms being numerical 3 ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-54", "text": "The bottom two rows of table 3 show the improvements in F 1 of the Cont(4096) over Rand(4096) on frequent (>10) and infrequent (\u226410) terms." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-55", "text": "Cont(4096) is always better than Rand(4096), and in both domains the improvements on low frequency terms are always greater than those on high frequency terms (+8.1 vs. +2.4 in news dev, +17.6 vs. +5.0 in news test, etc.)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-56", "text": "We conclude that pre-trained contextual character embeddings improve the representations of low frequency words in both low and high data settings." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-57", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-58", "text": "**RESULTS**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-59", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-60", "text": "**ROBUSTNESS TO VARIANTS AND FREQUENCY**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-61", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-62", "text": "**ROBUSTNESS TO DOMAIN DIFFERENCES**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-63", "text": "To illustrate the ability of pre-trained contextual character embeddings to handle unseen data, we train the models in one domain and evaluate in the other, as shown in Table 4 ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-64", "text": "We find that Rand(128) and Rand(4096) achieve similar cross-domain performance, e.g., Rand(128) achieves 63.4% of F 1 on news dev and Rand (4096) improvements are significant (p < 0.001)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-65", "text": "We conclude that pre-trained contextual character embeddings generalize better across domains." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-66", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-67", "text": "**GREATER RELIANCE ON NEARBY CONTEXT**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-68", "text": "Inspired by Khandelwal et al. (2018) 's analysis of the effective context size of a word-based language model, we present an ablation study to measure performance when contextual information is removed." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-69", "text": "Specifically, when evaluating models, we retain only the characters in a small window around each time entity in the dev and test sets, and replace all other characters with padding characters." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-70", "text": "Figures 2a and 2b evaluate the Cont(4096), Rand(4096) and Rand(128) models across different context window sizes on the news dev and test set, respectively." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-71", "text": "Rand(128) performs similarly across all context sizes, suggesting that it makes little use of context information." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-72", "text": "Both Rand (4096) and Cont(4096) depend heavily of context: without any context information (context size 0), they perform worse than Rand(128)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-73", "text": "Cont (4096) is sensitive to the nearby context, with a \u223c10 point gain on news dev and \u223c15 point gain on news test from just the first 10 characters of context, putting it easily above Rand(128)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-74", "text": "Rand(4096) doesn't exceed the performance of Rand(128) until at least 50 characters of context." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-75", "text": "Figures 2c and 2d shows similar trends in the clinical domain, except that the Rand(128) model now shows a small dependence on context, with a \u223c5 point drop on clinical dev and a \u223c3 drop on clinical test in the no-context setting." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-76", "text": "Cont(4096) again makes large improvements in just the first 10 characters, and Rand(4096) now takes close to 100 characters of context to reach the performance of Rand(128)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-77", "text": "We conclude that pre-trained contextual character embeddings make better use of local context, especially within the first 10 characters." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-78", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-79", "text": "**ENCODING WORD CATEGORIES**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-80", "text": "We perform a feature ablation to see if pre-trained contextual character embeddings capture basic syntax (e.g., part-of-speech) like pre-trained contextual word embeddings do (Peters et al., 2018; Akbik et al., 2018) ." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-81", "text": "Table 5 shows that removing both part-of-speech and unicode category features from Cont (4096) Table 5 : Effect of features on performance: Performance (F 1 ) with different feature sets, including characters (C), part-of-speech tags (P), and unicode character categories (U)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-82", "text": "formance for both Rand(128) and Rand(4096) in all cases." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-83", "text": "For example, Rand(4096) with all features achieves 82.7 F 1 on news dev, significantly better than the 80.5 F 1 of using only characters (p = 0.0467)." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-84", "text": "We conclude that pre-trained contextual character embeddings encode a variety of word category information such as part-of-speech, capitalization, and punctuation." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-85", "text": "----------------------------------" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-86", "text": "**CONCLUSION**" }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-87", "text": "We derive pre-trained character-level contextual embeddings from Flair (Akbik et al., 2018) , a wordlevel embedding model, inject these into a state-ofthe-art time normalization system, and achieve major performance improvements: 51% error reduction in news and 33% in clinical notes." }, { "sent_id": "b31acd3535cd740e609d45986fbf33-C001-88", "text": "Our detailed analysis concludes that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and cross-domain changes; that they benefit most from the first 10 characters of context; and that they encode part-of-speech, capitalization, and punctuation information." } ], "y": { "@USE@": { "gold_contexts": [ [ "b31acd3535cd740e609d45986fbf33-C001-3" ] ], "cite_sentences": [ "b31acd3535cd740e609d45986fbf33-C001-3" ] }, "@BACK@": { "gold_contexts": [ [ "b31acd3535cd740e609d45986fbf33-C001-8" ], [ "b31acd3535cd740e609d45986fbf33-C001-13", "b31acd3535cd740e609d45986fbf33-C001-14", "b31acd3535cd740e609d45986fbf33-C001-15" ] ], "cite_sentences": [ "b31acd3535cd740e609d45986fbf33-C001-8", "b31acd3535cd740e609d45986fbf33-C001-15" ] }, "@MOT@": { "gold_contexts": [ [ "b31acd3535cd740e609d45986fbf33-C001-18" ] ], "cite_sentences": [ "b31acd3535cd740e609d45986fbf33-C001-18" ] }, "@EXT@": { "gold_contexts": [ [ "b31acd3535cd740e609d45986fbf33-C001-19" ], [ "b31acd3535cd740e609d45986fbf33-C001-26" ], [ "b31acd3535cd740e609d45986fbf33-C001-87" ] ], "cite_sentences": [ "b31acd3535cd740e609d45986fbf33-C001-26", "b31acd3535cd740e609d45986fbf33-C001-87" ] }, "@UNSURE@": { "gold_contexts": [ [ "b31acd3535cd740e609d45986fbf33-C001-80" ] ], "cite_sentences": [ "b31acd3535cd740e609d45986fbf33-C001-80" ] } } }, "ABC_46cc0df5c6ed25f735cc0afd301ec8_32": { "x": [ { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-50", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-51", "text": "**MULTI-GRANULAR TEXT ENCODER**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-2", "text": "Self-explaining text categorization requires a classifier to make a prediction along with supporting evidence." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-3", "text": "A popular type of evidence is sub-sequences extracted from the input text which are sufficient for the classifier to make the prediction." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-4", "text": "In this work, we define multigranular ngrams as basic units for explanation, and organize all ngrams into a hierarchical structure, so that shorter ngrams can be reused while computing longer ngrams." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-5", "text": "We leverage a tree-structured LSTM to learn a contextindependent representation for each unit via parameter sharing." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-6", "text": "Experiments on medical disease classification show that our model is more accurate, efficient and compact than BiL-STM and CNN baselines." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-7", "text": "More importantly, our model can extract intuitive multi-granular evidence to support its predictions." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-8", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-10", "text": "Increasingly complex neural networks have achieved highly competitive results for many NLP tasks (Vaswani et al., 2017; Devlin et al., 2018) , but they prevent human experts from understanding how and why a prediction is made." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-11", "text": "Understanding how a prediction is made can be very important for certain domains, such as the medical domain." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-12", "text": "Recent research has started to investigate models with self-explaining capability, i.e. extracting evidence to support their final predictions (Li et al., 2015; Lei et al., 2016; Lin et al., 2017; Mullenbach et al., 2018) ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-13", "text": "For example, in order to make diagnoses based on the medical report in Table 1 , the highlighted symptoms may be extracted as evidence." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-14", "text": "Two methods have been proposed on how to jointly provide highlights along with classification." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-15", "text": "(1) an extraction-based method (Lei et al., 2016) , which first extracts evidences from the original text and then makes a prediction solely based on the extracted evidences; (2) an attentionbased method (Lin et al., 2017; Mullenbach et al., 2018) , which leverages the self-attention mechaMedical Report: The patient was admitted to the Neurological Intensive Care Unit for close observation." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-16", "text": "She was begun on heparin anticoagulated carefully secondary to the petechial bleed ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-17", "text": "She started weaning from the vent the next day." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-18", "text": "She was started on Digoxin to control her rate and her Cardizem was held." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-19", "text": "She was started on antibiotics for possible aspiration pneumonia ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-20", "text": "Her chest xray showed retrocardiac effusion ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-21", "text": "She had some bleeding after nasogastric tube insertion ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-22", "text": "Diagnoses: Cerebral artery occlusion; Unspecified essential hypertension; Atrial fibrillation; Diabetes mellitus." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-23", "text": "nism to show the importance of basic units (words or ngrams) through their attention weights." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-24", "text": "However, previous work has several limitations." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-49", "text": "Structures for a sentence w 1 w 2 w 3 w 4 , where each node corresponds to a phrase or ngram." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-25", "text": "Lin et al. (2017) , for example, take single words as basic units, while meaningful information is usually carried by multi-word phrases." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-26", "text": "For instance, useful symptoms in Table 1 , such as \"bleeding after nasogastric tube insertion\", are larger than a single word." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-27", "text": "Another issue of Lin et al. (2017) is that their attention model is applied on the representation vectors produced by an LSTM." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-28", "text": "Each LSTM output contains more than just the information of that position, thus the real range for the highlighted position is unclear." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-29", "text": "Mullenbach et al. (2018) defines all 4-grams of the input text as basic units and uses a convolutional layer to learn their representations, which still suffers from fixed-length highlighting." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-30", "text": "Thus the explainability of the model is limited." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-31", "text": "Lei et al. (2016) introduce a regularizer over the selected (single-word) positions to encourage the model to extract larger phrases." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-32", "text": "However, their method can not tell how much a selected unit contributes to the model's decision through a weight value." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-33", "text": "In this paper, we study what the meaningful units to highlight are." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-34", "text": "We define multi-granular ngrams as basic units, so that all highlighted symptoms in Table 1" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-35", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-36", "text": "**GENERIC ARCHITECTURE AND BASELINES**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-37", "text": "Our work leverages the attention-based selfexplaining method (Lin et al., 2017) , as shown in Figure 1 ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-38", "text": "First, our text encoder ( \u00a73) formulates an input text into a list of basic units, learning a vector representation for each, where the basic units can be words, phrases, or arbitrary ngrams." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-39", "text": "Then, the attention mechanism is leveraged over all basic units, and sums up all unit representations based on the attention weights {\u03b1 1 , ..., \u03b1 n }." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-40", "text": "Eventually, the attention weight \u03b1 i will be used to reveal how important a basic unit h i is." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-41", "text": "The last prediction layer takes the fixed-length text representation t as input, and makes the final prediction." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-42", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-43", "text": "**BASELINES:**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-44", "text": "We compare two types of baseline text encoders in Figure 1 . (1) Lin et al. (2017) (BiLSTM), which formulates single word positions as basic units, and computes the vector h i for the i-th word position with a BiLSTM; (2) Extension of Mullenbach et al. (2018) (CNN) ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-45", "text": "The original model in (Mullenbach et al., 2018) only utilizes 4-grams." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-46", "text": "Here we extend this model to take all unigrams, bigrams, and up to n-grams as the basic units." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-47", "text": "For a fair comparison, both our approach and the baselines share the same architecture, and the only difference is the text encoder used." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-48", "text": "volving seven pathogens , discusses pitfalls of routine blood cultures and examines the role of the laboratory in microbiologic diagnosis ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-52", "text": "We propose the multi-granular text encoder to deal with drawbacks (as mentioned in the third paragraph of Section 1) of our baselines." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-53", "text": "Structural basic units: We define basic units for the input text as multi-granular ngrams, organizing ngrams in four different ways." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-54", "text": "Taking a synthetic sentence w 1 w 2 w 3 w 4 as the running example, we illustrate these structures in Figure 2 (a), (b), (c) and (d), respectively." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-55", "text": "The first is a tree structure (as shown in Figure 2(a) ) that includes all phrases from a (binarized) constituent tree over the input text, where no cross-boundary phrases exists." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-56", "text": "The second type (as shown in Figure 2 (b,c,d) ) includes all possible ngrams from the input text, which is a superset of the tree structure." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-57", "text": "In order to reuse representations of smaller ngrams while encoding bigger ngrams, all ngrams are organized into hierarchical structures in three different ways." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-58", "text": "First, inspired by Zhao et al. (2015) , a pyramid structure is created for all ngrams as shown in Figure 2(b) , where leaf nodes are unigrams of the input text, and higher level nodes correspond to higher-order ngrams." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-59", "text": "A disadvantage of the pyramid structure is that some lower level nodes may be used repeatedly while encoding higher level nodes, which may improperly augment the influence of the repeated units." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-60", "text": "For example, when encoding the trigram node \"w 1 w 2 w 3 \", the unigram node \"w 2 \" is used twice through two bigram nodes \"w 1 w 2 \" and \"w 2 w 3 \"." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-61", "text": "To tackle this issue, a left-branching forest structure is constructed for all ngrams as shown in Figure 2 (c), where ngrams with the same prefix are grouped together into a left-branching binary tree, and, in this arrangement, multiple trees construct a forest." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-62", "text": "Similarly, we construct a right-branching forest as shown in Figure 2(" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-63", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-64", "text": "**D).**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-65", "text": "Encoding: We leverage a tree-structured LSTM composition function (Tai et al., 2015; Zhu et al., 2015; Teng and Zhang, 2016) to compute node embeddings for all structures in Figure 2 ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-66", "text": "Formally, the state of each node is represented as a pair of one hidden vector h and one memory representation c, which are calculated by composing the node's label embedding x and states of its left child h l , c l and right child h r , c r with gated functions:" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-67", "text": "where \u03c3 is the sigmoid activation function, is the elementwise product, i is the input gate, f l and f r are the forget gates for the left and right child respectively, and o is the output gate." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-68", "text": "We set x as the pre-trained word embedding for leaf nodes, and zero vectors for other nodes." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-69", "text": "The representations for all units (nodes) can be obtained by encoding each basic unit in a bottom-up order." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-70", "text": "Comparison with baselines Our encoder is more efficient than CNN while encoding bigger ngrams, because it reuses representations of smaller ngrams." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-71", "text": "Furthermore, the same parameters are shared across all ngrams, which makes our encoder more compact, whereas the CNN baseline has to define different filters for different order of ngrams, so it requires much more parameters." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-72", "text": "Experiments show that using basic units up to 7-grams to construct the forest structure is good enough, which makes our encoder more efficient than BiLSTM." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-73", "text": "Since in our encoder, all ngrams with the same order can be computed in parallel, and the model needs at most 7 iterative steps along the depth dimension for representing a given text of arbitrary length." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-74", "text": "tokens, and one label out of five categories indicating which disease this document is about." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-75", "text": "We randomly split the dataset into train/dev/test sets by 8:1:1 for each category, and end up with 11,216/1,442/1,444 instances for each set." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-76", "text": "Hyperparameters We use the 300-dimensional GloVe word vectors pre-trained from the 840B Common Crawl corpus (Pennington et al., 2014) , and set the hidden size as 100 for node embeddings." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-77", "text": "We apply dropout to every layer with a dropout ratio 0.2, and set the batch size as 50." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-78", "text": "We minimize the cross-entropy of the training set with the ADAM optimizer (Kingma and Ba, 2014) , and set the learning rate is to 0.001." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-79", "text": "During training, the pre-trained word embeddings are not updated." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-80", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-81", "text": "**PROPERTIES OF THE MULTI-GRANULAR ENCODER**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-82", "text": "Influence of the n-gram order: For CNN and our LeftForest encoder, we vary the order of ngrams from 1 to 9, and plot results in Figure 3 ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-83", "text": "For BiLSTM, we draw a horizontal line according to its performance, since the ngram order does not apply." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-84", "text": "When ngram order is less than 3, both CNN and LeftForest underperform BiLSTM." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-85", "text": "When ngram order is over 3, LeftForest outperforms both baselines." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-86", "text": "Therefore, in terms of accuracy, our multi-granular text encoder is more powerful than baselines." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-87", "text": "Efficiency: We set ngram order as 7 for both CNN and our encoder." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-88", "text": "Table 2 shows the time cost (seconds) of one iteration over the training set and evaluation on the development set." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-89", "text": "BiLSTM is the slowest model, because it has to scan over the entire text sequentially." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-90", "text": "LeftForest is almost 2x faster than CNN, because LeftForest reuses lower-order ngrams while computing higher-order ngrams." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-91", "text": "This result reveals that our encoder is more efficient than baselines." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-92", "text": "Model size: In Table 2 , the last two columns show the accuracy and number of parameters for each model." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-93", "text": "LeftForest contains much less parameters than CNN, and it gives a better accuracy than BiLSTM with only a small amount of extra parameters." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-94", "text": "Therefore, our encoder is more compact." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-95", "text": "Table 3 lists the accuracy on the test set, where BiForest represents each ngram by concatenating representations of this ngram from both the LeftForest and the RightForest encoders." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-96", "text": "We get several interesting observations: (1) Our multigranular text encoder outperforms both the CNN and BiLSTM baselines regardless of the structure used; (2) The LeftForest and RightForest encoders work better than the Tree encoder, which shows that representing texts with more ngrams is helpful than just using the non-overlapping phrases from a parse tree; (3) The LeftForest and RightForest encoders give better performance than the Pyramid encoder, which verifies the advantages of organizing ngrams with forest structures; (4) There is no significant difference between the LeftForest encoder and the RightForest encoder." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-97", "text": "However, by combining them, the BiForest encoder gets the best performance among all models, indicating that the LeftForest encoder and the RightForest encoder complement each other for better accuracy." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-98", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-99", "text": "**MODEL PERFORMANCE**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-100", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-101", "text": "**ANALYSIS OF EXPLAINABILITY**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-102", "text": "Qualitative analysis The following text is a snippet of an example from the dev set." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-103", "text": "We leverage our BiForest model to extract ngrams whose attention scores are higher than 0.05, and use the bold font to highlight them." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-104", "text": "We extracted three ngrams as supporting evidence for its predicted category \"nervous system diseases\"." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-105", "text": "Both the spontaneous extradural spinal hematoma and the spinal arteriovenous malformation are diseases related to the spinal cord, therefore they are good evidence to indicate the text is about \"nervous system diseases\"." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-106", "text": "Quantitative analysis For each instance in the training set and the dev set, we utilize the attention scores from BiForest to sort all ngrams, and create different copies of the training set and development set by only keeping the first n important words." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-107", "text": "We then train and evaluate a BiLSTM model with the newly created dataset." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-108", "text": "We vary the number of words n among {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50} , and show the corresponding accuracy with the green triangles in Figure 4 ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-109", "text": "We define a Random baseline by randomly selecting a sub-sequence containing n words, and plot its accuracy with blue squares in Figure 4 ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-110", "text": "We also take a BiLSTM model trained with the entire texts as the upper bound (the horizontal line in Figure 4) ." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-111", "text": "When using only a single word for representing instances, single words extracted from our BiForest model are significantly more effective than randomly picked single words." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-112", "text": "When utilizing up to five extracted words from our BiForest model for representing each instance, we can obtain an accuracy very close to the upper bound." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-113", "text": "Therefore, the extracted evidence from our BiForest model are truly effective for representing the instance and its corresponding category." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-114", "text": "----------------------------------" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-115", "text": "**CONCLUSION**" }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-116", "text": "We proposed a multi-granular text encoder for self-explaining text categorization." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-117", "text": "Comparing with the existing BiLSTM and CNN baselines, our model is more accurate, efficient and compact." }, { "sent_id": "46cc0df5c6ed25f735cc0afd301ec8-C001-118", "text": "In addition, our model can extract effective and intuitive evidence to support its predictions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "46cc0df5c6ed25f735cc0afd301ec8-C001-12" ], [ "46cc0df5c6ed25f735cc0afd301ec8-C001-14", "46cc0df5c6ed25f735cc0afd301ec8-C001-15" ], [ "46cc0df5c6ed25f735cc0afd301ec8-C001-24", "46cc0df5c6ed25f735cc0afd301ec8-C001-25" ], [ "46cc0df5c6ed25f735cc0afd301ec8-C001-27", "46cc0df5c6ed25f735cc0afd301ec8-C001-28" ] ], "cite_sentences": [ "46cc0df5c6ed25f735cc0afd301ec8-C001-12", "46cc0df5c6ed25f735cc0afd301ec8-C001-15", "46cc0df5c6ed25f735cc0afd301ec8-C001-27" ] }, "@MOT@": { "gold_contexts": [ [ "46cc0df5c6ed25f735cc0afd301ec8-C001-24", "46cc0df5c6ed25f735cc0afd301ec8-C001-25" ], [ "46cc0df5c6ed25f735cc0afd301ec8-C001-27", "46cc0df5c6ed25f735cc0afd301ec8-C001-28" ] ], "cite_sentences": [ "46cc0df5c6ed25f735cc0afd301ec8-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "46cc0df5c6ed25f735cc0afd301ec8-C001-37", "46cc0df5c6ed25f735cc0afd301ec8-C001-38", "46cc0df5c6ed25f735cc0afd301ec8-C001-39", "46cc0df5c6ed25f735cc0afd301ec8-C001-40", "46cc0df5c6ed25f735cc0afd301ec8-C001-41" ], [ "46cc0df5c6ed25f735cc0afd301ec8-C001-44" ] ], "cite_sentences": [ "46cc0df5c6ed25f735cc0afd301ec8-C001-37", "46cc0df5c6ed25f735cc0afd301ec8-C001-44" ] } } }, "ABC_8a8670fd7cfb8db9ddd3f546ce4534_32": { "x": [ { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-44", "text": "Our motivation for S+R is two-fold." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-2", "text": "Word sense disambiguation aims to identify which meaning of a word is present in a given usage." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-3", "text": "Gathering word sense annotations is a laborious and difficult task." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-4", "text": "Several methods have been proposed to gather sense annotations using large numbers of untrained annotators, with mixed results." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-5", "text": "We propose three new annotation methodologies for gathering word senses where untrained annotators are allowed to use multiple labels and weight the senses." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-6", "text": "Our findings show that given the appropriate annotation task, untrained workers can obtain at least as high agreement as annotators in a controlled setting, and in aggregate generate equally as good of a sense labeling." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-7", "text": "----------------------------------" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-9", "text": "Word sense annotation is regarded as one of the most difficult annotation tasks (Artstein and Poesio, 2008) and building manually-annotated corpora with highquality sense labels is often a time-and resourceconsuming task." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-10", "text": "As a result, nearly all sense-tagged corpora in wide-spread use are created using trained annotators (Hovy et al., 2006; Passonneau et al., 2010) , which results in a knowledge acquisition bottleneck for training systems that require sense labels (Gale et al., 1992) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-11", "text": "In other NLP areas, this bottleneck has been addressed through gathering annotations using many untrained workers on platforms such as Amazon Mechanical Turk (MTurk), a task commonly referred to as crowdsourcing." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-12", "text": "Recently, several works have proposed gathering sense annotations using crowdsourcing (Snow et al., 2008; Biemann and Nygaard, 2010; Passonneau et al., 2012b; Rumshisky et al., 2012) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-13", "text": "However, these methods produce sense labels that are different from the commonly used sense inventories such as WordNet (Fellbaum, 1998) or OntoNotes (Hovy et al., 2006) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-14", "text": "Furthermore, while Passonneau et al. (2012b) did use WordNet sense labels, they found the quality was well below that of trained experts." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-15", "text": "We revisit the task of crowdsourcing word sense annotations, focusing on two key aspects: (1) the annotation methodology itself, and (2) the restriction to single sense assignment." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-16", "text": "First, the choice in sense inventory plays an important role in gathering high-quality annotations; fine-grained inventories such as WordNet often contain several related senses for polysemous words, which untrained annotators find difficult to correctly apply in a given context (Chugur et al., 2002; McCarthy, 2006; Palmer et al., 2007; Rumshisky and Batiukova, 2008; Brown et al., 2010) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-17", "text": "However, many agreement studies have restricted annotators to using a single sense, which can significantly lower inter-annotator agreement (IAA) in the presence of ambiguous or polysemous usages; indeed, multiple studies have shown that when allowed, annotators readily assign multiple senses to a single usage (V\u00e9ronis, 1998; Murray and Green, 2004; Erk et al., 2009; Passonneau et al., 2012b) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-18", "text": "Therefore, we focus on annotation methodologies that enable workers to use as many labels as they feel appropriate, asking the question: if allowed to make labeling ambiguity explicit, will annotators agree?" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-19", "text": "Furthermore, we adopt the goal of Erk et al. (2009) , which enabled annotators to weight each sense by its applicability to the given context, thereby quantifying the ambiguity." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-20", "text": "This paper provides the following contributions." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-21", "text": "First, we demonstrate that the choice in annotation setup can significantly improve IAA and that the labels of untrained workers follow consistent patterns that enable creating high quality labeling from their aggregate." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-22", "text": "Second, we find that the sense labeling from crowdsourcing matches performance with annotators in a controlled setting." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-23", "text": "----------------------------------" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-25", "text": "Given the potential utility of a sense-labeled corpus, multiple studies have examined how to efficiently gather high quality sense annotations." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-26", "text": "Snow et al. (2008) had MTurk workers, referred to as Turkers, disambiguate uses of \"president.\" While they reported extremely high IAA (0.952), their analysis was only performed on a single word." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-27", "text": "Biemann and Nygaard (2010) and Biemann (2012) construct a sense-labeled corpus by concurrently constructing the sense inventory itself." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-28", "text": "Turkers used a lexical substitution task to identify valid substitutions of a target word." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-29", "text": "The contexts for the resulting substitutions were clustered based on their word overlap and the resulting clusters were labeled as senses." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-30", "text": "Biemann and Nygaard (2010) showed that the number of sense definitions for a word in their inventory was correlated with the number in WordNet, often with their inventory having fewer senses by combining related meanings and omitting rare meanings." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-31", "text": "Hong and Baker (2011) evaluated multiple annotation strategies for gathering FrameNet sense annotations, ultimately yielding high (>90%) accuracy for most terms after filtering." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-32", "text": "They highlight ambiguous and polysemous usages as a notable source of errors, which the present work directly addresses." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-33", "text": "In the most related work, Passonneau et al. (2012b) had Turkers annotate contexts using one or more senses, with the requirement that a worker labels all contexts." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-34", "text": "While they found that agreement between all workers was low, their annotations could be combined using the GLAD model (Whitehill et al., 2000) to obtain good performance, though not as good as trained annotators." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-35", "text": "----------------------------------" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-36", "text": "**ANNOTATION METHODOLOGIES**" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-37", "text": "We consider three methodologies for gathering sense labels: (1) the methodology of Erk et al. (2009) for gathering weighted labels, (2) a multistage strategy that uses both binary and Likert ratings, and (3) MaxDiff, a paired choice format." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-38", "text": "Likert Ratings Likert rating scales provide the most direct way of gathering weighted sense labels; Turkers are presented with all senses of a word and then asked to rate each on a numeric scale." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-39", "text": "We adopt the annotation guidelines of Erk et al. (2009) which used a five-point scale, ranging from 1 to 5, indicating the sense does not apply or that it matches the contextual usage exactly, respectively." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-40", "text": "Select and Rate Recent efforts in crowdsourcing have proposed multi-stage processes for accomplishing complex tasks, where efforts by one group of workers are used to create new subtasks for other workers to complete (Bernstein et al., 2010; Kittur et al., 2011; Kulkarni et al., 2012) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-41", "text": "We propose a two-stage strategy that aims to reduce the complexity of the annotation task, referred to as Select and Rate (S+R)." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-42", "text": "First, Turkers are presented with all the senses and asked to make a binary choice of which senses apply." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-43", "text": "Second, a Likert rating task is created for only those senses whose selection frequency is above a threshold, thereby concentrating worker focus on a potentially smaller set of senses." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-45", "text": "First, the sense definitions of certain words may be unclear or misinterpreted by a minority of the Turkers, who then systematically rate inapplicable senses as applicable." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-46", "text": "The Select task can potentially remove such noise and therefore improve both IAA and rating quality in the subsequent Rate task." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-47", "text": "Second, while the present study analyzes words with 4-8 senses, we are ultimately interested in annotating highly polysemous words with tens of senses, which could present a significant cognitive burden for an annotator to rate concurrently." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-48", "text": "Here, the Select stage can potentially reduce the number of senses presented, leading to less cognitive burden in the Rate stage." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-49", "text": "Furthermore, as a pragmatic benefit, removing inapplicable senses reduces the visual space required for displaying the questions on the MTurk platform, which can improve annotation throughput." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-50", "text": "MaxDiff MaxDiff is an alternative to scale-based ratings in which Turkers are presented with a only subset of all of a word's senses and then asked to select (1) the sense option that best matches the mean-add.v ask.v win.v argument.n interest.n paper.n different.a important.a Erk et al. (2009) ing in the example context and (2) the sense option that least matches (Louviere, 1991) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-51", "text": "In our setting, we presented three options at a time for words with fewer than seven senses, and four options for those with seven senses." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-52", "text": "For a single context, multiple subsets of the senses are presented and then their relative ranking is used to produce the numeric rating." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-53", "text": "The final applicability ratings were produced using a modification of the counting procedure of Orme (2009) ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-54", "text": "First, all sense ratings are computed as the number of times the sense was rated best minus the number of times rated least." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-55", "text": "Second, all negativelyrated senses are assigned score of 1, and all positively ratings are normalized to be (1, 5]." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-56", "text": "----------------------------------" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-57", "text": "**EXPERIMENTS**" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-58", "text": "For measuring the difference in methodologies, we propose three experiments based on different analyses of comparing Turker and non-Turker annotations on the same dataset, the latter of which we refer to as the reference labeling." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-59", "text": "First, we measure the ability of the Turkers individually by evaluating their IAA with the reference labeling." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-60", "text": "Second, many studies using crowdsourcing combine the results into a single answer, thereby leveraging the wisdom of the crowds (Surowiecki, 2005) to smooth over inconsistencies in the data." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-61", "text": "Therefore, in the second experiment, we evaluate different methods of combining Turker responses into a single sense labeling, referred to as an aggregate labeling, and comparing that with the reference labeling." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-62", "text": "Third, we measure the replicability of the Turker annotations (Kilgarriff, 1999 ) using a sampling methodology." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-63", "text": "Two equally-sized sets of Turker annotations are created by randomly sampling without replacement from the full set of annotations for each item." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-64", "text": "IAA is calculated between the aggregate labelings computed from each set." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-65", "text": "This sampling is repeated 50 times and we report the mean IAA as a measure of the expected degree of replicability when annotating using different groups of Turkers." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-66", "text": "For the reference sense labeling, we use a subset of the GWS dataset of Erk et al. (2009) , where three annotators rated 50 instances each for eight words." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-67", "text": "For clarity, we refer to these individuals as the GWS annotators." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-68", "text": "Given a word usage in a sentence, GWS annotators rated the applicability of all WordNet 3.0 senses using the same Likert scale as described in Section 3." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-69", "text": "Contexts were drawn evenly from the SemCor (Miller et al., 1993) and SENSEVAL-3 lexical substitution (Mihalcea et al., 2004) corpora." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-70", "text": "GWS annotators were apt to use multiple senses, with nearly all instances having multiple labels." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-71", "text": "For each annotation task, Turkers were presented with an identical set of annotation guidelines, followed by methodology-specific instructions." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-72", "text": "1 To increase the familiarity with the task, four instances were shown per task, with all instances using the same target word." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-73", "text": "Unlike Passonneau et al. (2012b) , we did not require a Turker to annotate all contexts for a single word; however many Turkers did complete the majority of instances." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-74", "text": "Both the Likert, Select, and Rate tasks used ten Turkers each." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-75", "text": "Senses were passed from Select to Rate if they received at least three votes." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-76", "text": "For MaxDiff, we gathered at least 3n annotations per context where n is the number of senses of the target word, ensuring that each sense appeared at least once." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-77", "text": "Due to resource limitations, we omitted the evaluation of argument.n for MaxDiff." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-78", "text": "Following the recommendation of Kosinski et al. (2012) , Turkers were paid $0.05USD for each Likert, Select, and Rate task." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-79", "text": "For MaxDiff, due to their shorter nature and comparably high volume, Turkers were paid $0.03USD per task." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-80", "text": "To ensure fluency in English as well as reduce the potential for low-quality results, we prefaced each task with a simple test question that asked the Turker to pick out a definition of the target word from a list of four options." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-81", "text": "The incorrect options were selected so that they would be nonsensical for anyone familiar with the target word." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-82", "text": "Additionally, we rejected all Turker responses where more than one option was missing a rating." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-83", "text": "In the case of missing ratings, we infer a rating of 1." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-84", "text": "Approximately 20-30% of the submissions were rejected by these criteria, underscoring the importance of filtering." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-85", "text": "For measuring IAA, we selected Krippendorff's \u03b1 (Krippendorff, 1980; Artstein and Poesio, 2008) , which is an agreement coefficient that handles missing data, as well as different levels of measurement, e.g., nominal data (Select and MaxDiff) and interval data (Likert and Rate)." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-86", "text": "2 Krippendorff's \u03b1 adjusts for chance, ranging between [\u22121, 1] for nominal data and (\u22121, 1] for interval data, where 1 indicates perfect agreement and -1 indicates systematic disagreement; random labels would have an expected \u03b1 of zero." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-87", "text": "We treat each sense and instance combination as a separate item to rate." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-88", "text": "----------------------------------" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-89", "text": "**RESULTS**" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-90", "text": "The results of the first experiment appear in the top of Table 1 ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-91", "text": "Two important aspects emerge." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-92", "text": "First, the word itself plays a significant role in IAA." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-93", "text": "Though Erk et al. (2009) reported a pair-wise IAA of the GWS annotators between 0.466 and 0.506 using Spearman's \u03c1, the IAA varies considerably between words for both Turkers and GWS annotators when measured using Krippendorff's \u03b1." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-94", "text": "Second, the choice of annotation methodology significantly impacts IAA." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-95", "text": "While both the Likert and S+R tasks have lower IAA than the GWS annotators do, the MaxDiff annotators achieve higher IAA for almost all words." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-96", "text": "We hypothesize that comparing senses for applicability is an easier task for the untrained worker, rather than having to construct a mental scale of what constitutes the applicability of each sense." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-97", "text": "Surprisingly, the binary Select task has a lower IAA than the more complex the Likert task." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-98", "text": "An analysis of the duration of median task completion times for the Likert and Select tasks showed little difference (with the exception of paper.n, which was on average 50 second faster for Likert ratings), suggesting that both tasks are equally as cognitively demanding." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-99", "text": "In addition, the Rate task has the lowest IAA, despite its similarity to the Likert task." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-100", "text": "An inspection of the annotations shows that the full rating scale was used, so the low value is not due to Turkers always using the same rating, which would yield an IAA near chance." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-101", "text": "In the second experiment, we created a aggregate sense labeling and compared its IAA with the GWS annotators, shown in Table 1 (bottom)." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-102", "text": "For scalebased ratings, we considered three arithmetic operations for selecting the final rating: mode, median, and mean." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-103", "text": "We found that the mode yielded the highest average IAA for the Likert ratings and median for S+R; however, the differences in IAA using each operation were often small." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-104", "text": "We compare the IAA with GWS annotators against two baselines: one generated by sampling from the GWS annotators' rating distribution, and a second generated by uniformly sampling in [1, 5] ." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-105", "text": "By comparison, the aggregate labelings have a much larger IAA than the baselines, which is often at least as high as the IAA amongst the GWS annotators themselves, indicating that the Turkers in aggregate are capable of producing equivalent ratings." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-106", "text": "Of the three annotation methodologies, MaxDiff provides the highest IAA both within its annotators and with its aggregate key." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-107", "text": "Surprisingly, neither the Likert or S+R aggregate labeling appears better than the other." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-108", "text": "Based on the second experiment, we measured the average IAA across all words between the aggregate Likert and MaxDiff solutions, which was 0.472." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-109", "text": "However, this IAA is significantly affected by the annotations for win.v and different.a, which had the lowest IAA among Turkers (Table 1) For the third experiment, replicability is reported as the average IAA between the sampled aggregate labelings for all annotated words." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-110", "text": "Table 2 shows this IAA for Likert and MaxDiff methodologies in comparison to other sense annotation studies." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-111", "text": "Krippendorff (2004) recommends that an \u03b1 of 0.8 is necessary to claim high-quality agreement, which is achieved by the MaxDiff methodology." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-112", "text": "In contrast, the average IAA between sampled Likert ratings is significantly lower, though the methodology does achieve an \u03b1 of 0.812 for paper.n." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-113", "text": "However, when the two words with the lowest IAA, win.v and different.a, are excluded, the average \u03b1 increases to 0.880 for MaxDiff and 0.649 for Likert." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-114", "text": "Overall, these results suggest that MaxDiff can generate highly replicable annotations with agreement on par with that of other high-quality sense-labeled corpora." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-115", "text": "Furthermore, the Likert methodology may in aggregate still produce moderately replicable annotations in some cases." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-116", "text": "----------------------------------" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-117", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-118", "text": "Word sense disambiguation is a difficult task, both for humans and algorithms, with an important bottleneck in acquiring large sense annotated corpora." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-119", "text": "As a potential solution, we proposed three annotation methodologies for crowdsourcing sense labels." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-120", "text": "Importantly, we relax the single sense assignment restriction in order to let annotators explicitly note ambiguity through weighted sense ratings." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-121", "text": "Our findings reveal that moderate IAA can be obtained using MaxDiff ratings, with IAA surpassing that of annotators in a controlled setting." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-122", "text": "Furthermore, our findings showed marked differences in rating difficulty per word, even in the weighted rating setting." }, { "sent_id": "8a8670fd7cfb8db9ddd3f546ce4534-C001-123", "text": "In future work, we will investigate what factors influence annotation difficulty in order to improve IAA to what is considered expert levels, drawing from existing work analyzing difficulty in the single label setting (Murray and Green, 2004; Passonneau et al., 2009; Cinkov\u00e1 et al., 2012) ." } ], "y": { "@MOT@": { "gold_contexts": [ [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-17" ] ], "cite_sentences": [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-17" ] }, "@BACK@": { "gold_contexts": [ [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-17" ] ], "cite_sentences": [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-17" ] }, "@SIM@": { "gold_contexts": [ [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-19" ] ], "cite_sentences": [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-19" ] }, "@UNSURE@": { "gold_contexts": [ [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-37" ] ], "cite_sentences": [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-37" ] }, "@USE@": { "gold_contexts": [ [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-39" ], [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-50" ], [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-66" ] ], "cite_sentences": [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-39", "8a8670fd7cfb8db9ddd3f546ce4534-C001-50", "8a8670fd7cfb8db9ddd3f546ce4534-C001-66" ] }, "@DIF@": { "gold_contexts": [ [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-93" ] ], "cite_sentences": [ "8a8670fd7cfb8db9ddd3f546ce4534-C001-93" ] } } }, "ABC_d0c12613f09b36e071b9a842a4d844_32": { "x": [ { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-2", "text": "Abstract." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-3", "text": "The alignment of syntactic trees is the task of aligning the internal and leaf nodes of two sentences in different languages structured as trees." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-4", "text": "The output of the alignment can be used, for instance, as knowledge resource for learning translation rules (for rule-based machine translation systems) or models (for statistical machine translation systems)." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-5", "text": "This paper presents some experiments carried out based on two syntactic tree alignment algorithms presented in [Lavie et al. 2008] and [Tinsley et al. 2007 ]." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-6", "text": "Aiming at improving the performance of internal nodes alignment, some approaches for combining the output of these two algorithms were evaluated in Brazilian Portuguese and English parallel trees." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-7", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-9", "text": "The alignment of syntactic trees is the task of aligning the internal and leaf nodes of two sentences in different languages structured as trees." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-10", "text": "More specifically, the parallel sentences are represented by syntactic trees generated separately for the source and target languages." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-11", "text": "From this pair of syntactic trees, automatic methods determine the correspondences between source and target nodes (internal and leaf ones)." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-12", "text": "The alignment produced by the automatic methods can be very useful for Machine Translation (MT)." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-13", "text": "This paper, therefore, proposes the combination of two syntactic tree alignment methods - [Lavie et al. 2008 ] (a bottom-up approach) and [Tinsley et al. 2007 ] (a topdown approach) -aiming at improving their performance evaluated on Brazilian Portuguese (pt) and English (en) pair of languages." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-14", "text": "Moreover, some lexical alignment filters are proposed to filter out the misaligned leaf nodes." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-15", "text": "The investigated hypotheses are: (i) it is possible to combine the baseline alignment methods and their features and also (ii) a good lexical alignment of leaf nodes can improve the quality of internal nodes alignment." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-16", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-17", "text": "**RELATED WORK**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-18", "text": "According to related work, it is possible to note that the alignment of syntactic trees is divided in two steps." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-19", "text": "First, the lexical alignment is applied to align the leaf nodes, then, the other (internal) nodes are aligned." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-20", "text": "Furthermore, there is a wellformedness criterion for creating internal alignments which states that an ascendant node in the source tree may only be aligned with an ascendant node in the target tree, regarding the previously aligned node." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-21", "text": "The same is true for descendant nodes: a descendant node in the source tree can only be aligned with a descendant node in the target tree." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-22", "text": "After the alignment of leaf nodes, the internal nodes are aligned following various approaches and distinct criteria." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-23", "text": "For instance, the method presented in [Lavie et al. 2008 ] assigns a prime number to each pair of aligned leaf nodes in source and target trees based on the lexical alignment." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-24", "text": "This alignment is propagated to the highest nodes in a way that the ascendant nodes receive the product of their children, and the internal nodes of both trees with the same resultant value are aligned." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-25", "text": "Similarly, in [Tinsley et al. 2007 ] the alignment of internal nodes is accomplished using the alignment probabilities of leaf nodes generated by GIZA++ [Och and Ney 2003] ." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-26", "text": "In this case, the product of the probabilities of lexical alignment (not prime numbers as [Lavie et al. 2008] ) is assigned to parent nodes." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-27", "text": "In [Menezes and Richardson 2001] and [Groves et al. 2004] , the proposed methods automatically align fragments of the source tree with the equivalent target tree fragment quickly and consistently using a best-first approach and composition rules." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-28", "text": "Some other methods, such as [Marecek et al. 2008] and [Tiedemann and Kotz\u00e9 2009] , use different resources for the alignment of syntactic trees as: prefix analysis, part-of-speech and organization of words in the sentence (linear position)." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-29", "text": "For the experiments presented in this paper, the baseline models were implemented based on [Lavie et al. 2008] and [Tinsley et al. 2007 ] mainly because they do not require rich resources such as [Marecek et al. 2008] neither use manually created composition rules as [Menezes and Richardson 2001] and [Groves et al. 2004 ]." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-30", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-31", "text": "**MODELS FOR ALIGNING PARALLEL SYNTACTIC TREES**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-32", "text": "3.1." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-33", "text": "Model 1 -Based on [Lavie et al. 2008] Following an idea similar to that described in [Lavie et al. 2008] , our implementation (model 1) assigns prime numbers to each pair of aligned terminal nodes 1 ." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-34", "text": "For those nonaligned terminal nodes, model 1 assigns the value 1 and for those nodes with multiple alignments, it assigns the product of the prime numbers of each alignment." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-35", "text": "Then, in a second step, the values are propagated to the internal nodes (a bottomup approach): the value assigned to a parent node is the product of the values assigned to its child nodes." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-36", "text": "Finally, the value of each node in the source tree is compared to the values of each target node and the source and target nodes with the same value are aligned." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-37", "text": "[Tinsley et al. 2007] As proposed in [Tinsley et al. 2007] , the model 2 uses the probability generated by GIZA++ [Och and Ney 2003] ." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-38", "text": "For each node in the source tree, model 2 calculates the alignment probability regarding each target node in the parallel tree." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-39", "text": "These values are organized in a matrix and, in each interaction, the pair of nodes with the highest score is aligned." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-40", "text": "The restriction of aligning each node only once has also been followed." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-41", "text": "So, different from model 1, model 2 only generates 1-to-1 alignments." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-42", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-43", "text": "**MODEL 2 -BASED ON**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-44", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-45", "text": "**MODELS 3-5**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-46", "text": "Aiming at improving the performance of baseline alignment models, three extended models were implemented as shown in Figure 1 . Note that the input of all extended models is the output of model 1 and model 2." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-47", "text": "Model 3 is the union between model 1 and model 2, and was implemented in an attempt to improve the recall of the parallel syntactic tree alignment process by joining the output of both baseline models." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-48", "text": "Model 4, in turn, implements the intersection between the alignment generated by models 1 and 2." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-49", "text": "By doing so model 4 tries to improve the precision of the parallel syntactic tree alignment process and it is composed by all the pairs of parallel nodes aligned by both models and not only one of them as in model 3." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-50", "text": "Finally, model 5 is the merge of models 1 and 2." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-51", "text": "In this case, the merge is the application of model 2 to filter out ambiguous alignments generated by model 1: when a node has more than one alignment (remember that model 1 is able to generate 1-to-many alignments), model 5 outputs only the one aligned by model 2." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-52", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-53", "text": "**THE LEXICAL ALIGNMENT FILTERS**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-54", "text": "The lexical alignment was generated by the union of GIZA++'s output in source-target and target-source alignment directions." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-55", "text": "This union-alignment was evaluated regarding the aligned terminal nodes in the gold standard and achieved 74.63% precision, 93.42% recall and 82.97% F-measure." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-56", "text": "In order to improve the lexical alignment automatically generated by GIZA++, two new features were defined to filter the alignments based on their part-of-speech or neighborhood." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-57", "text": "In the part-of-speech filter, the labels of each leaf node of the source and the target syntactic trees are compared." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-58", "text": "If they belong to different groups of labels extracted from the lexical alignment of the gold standard, the alignment between them is filtered out." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-59", "text": "The neighborhood filter, in turn, allows only alignments between source and target nodes that occur in similar positions in source and target trees, respectively." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-60", "text": "By doing so, we try to solve some ambiguities in the lexical alignment filtering out the less probable ones." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-61", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-62", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-63", "text": "The experiments were carried out in a Brazilian Portuguese (pt) and English (en) parallel corpus containing 108 pairs of syntactic trees." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-64", "text": "These trees were generated by syntactic parser for pt [Bick 2000 ] and en [Collins 1999 ], separately." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-65", "text": "These trees represent sentences derived from articles of the Pesquisa FAPESP 2 Brazilian magazine." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-66", "text": "From this test corpus, a gold standard manually aligned by a human expert was produced to serve as reference in the automatic comparison." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-67", "text": "Table 1 shows the precision, recall and F scores for the five models." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-68", "text": "The evaluation was performed considering the 1-to-many alignments (1:n) of model 1 (lines 1, 3-4) and also restricting this model to provide only the 1-to-1 alignments (1:1, lines 5-8)." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-69", "text": "In the left part of this table we can see the results using the GIZA++'s union of sourcetarget and target-source lexical alignments; and on the right, GIZA++'s output filtered by part-of-speech and neighborhood filters." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-70", "text": "From this table it is possible to notice that, as expected, model 4 (intersection) improved the precision of the baseline models (1 and 2) while model 3 (union) improved their recall." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-71", "text": "These results confirm our first hypothesis since we see that it is possible to combine the baseline alignment methods and improve their performance." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-72", "text": "Regarding our second hypothesis, applying the filters on GIZA++'s output lead to a better lexical alignment precision but a worser recall." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-73", "text": "3 The better precision in lexical alignments improved the recall of internal nodes alignment since the 93.48% of recall in model 3 (1:n) was the best recall achieved in our experiments." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-74", "text": "However, the same improvement was not achieved for the precision of internal nodes alignment." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-75", "text": "----------------------------------" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-76", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-77", "text": "In this paper, we evaluated some models to automatically align the internal nodes of two parallel syntactic trees." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-78", "text": "The best precision (97.36%) was obtained by the intersection (model 4) while the union (model 3) achieved the best recall (93.48%) and F (92.18%) scores." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-79", "text": "Model 5 was not the best in any measure, but it improved the precision of model 1 mitigating the decline in recall of model 4." }, { "sent_id": "d0c12613f09b36e071b9a842a4d844-C001-80", "text": "The next steps in this research are: (i) to apply the best models in the whole corpus of 16,994 pairs of Brazilian Portuguese-English parallel syntactic trees, (ii) to extract translation rules from the aligned parallel trees and (iii) to apply the extracted rules in a machine translation system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d0c12613f09b36e071b9a842a4d844-C001-5" ], [ "d0c12613f09b36e071b9a842a4d844-C001-22", "d0c12613f09b36e071b9a842a4d844-C001-23" ], [ "d0c12613f09b36e071b9a842a4d844-C001-25", "d0c12613f09b36e071b9a842a4d844-C001-26" ] ], "cite_sentences": [ "d0c12613f09b36e071b9a842a4d844-C001-5", "d0c12613f09b36e071b9a842a4d844-C001-23", "d0c12613f09b36e071b9a842a4d844-C001-26" ] }, "@EXT@": { "gold_contexts": [ [ "d0c12613f09b36e071b9a842a4d844-C001-13" ] ], "cite_sentences": [ "d0c12613f09b36e071b9a842a4d844-C001-13" ] }, "@MOT@": { "gold_contexts": [ [ "d0c12613f09b36e071b9a842a4d844-C001-13" ] ], "cite_sentences": [ "d0c12613f09b36e071b9a842a4d844-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "d0c12613f09b36e071b9a842a4d844-C001-29" ], [ "d0c12613f09b36e071b9a842a4d844-C001-33" ] ], "cite_sentences": [ "d0c12613f09b36e071b9a842a4d844-C001-29", "d0c12613f09b36e071b9a842a4d844-C001-33" ] }, "@SIM@": { "gold_contexts": [ [ "d0c12613f09b36e071b9a842a4d844-C001-33" ] ], "cite_sentences": [ "d0c12613f09b36e071b9a842a4d844-C001-33" ] } } }, "ABC_5c63296c36cbd95e07f05f2563a2a1_32": { "x": [ { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-68", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-69", "text": "**NER MODELS 4.1 WORD-BASED NEURAL MODEL**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-117", "text": "Entity Category Pro. Loc. Org." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-2", "text": "Recently, neural models have shown superior performance over conventional models in NER tasks." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-3", "text": "These models use CNN to extract sub-word information along with RNN to predict a tag for each word." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-4", "text": "However, these models have been tested almost entirely on English texts." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-5", "text": "It remains unclear whether they perform similarly in other languages." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-6", "text": "We worked on Japanese NER using neural models and discovered two obstacles of the state-ofthe-art model." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-7", "text": "First, CNN is unsuitable for extracting Japanese sub-word information." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-8", "text": "Secondly, a model predicting a tag for each word cannot extract an entity when a part of a word composes an entity." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-9", "text": "The contributions of this work are (i) verifying the effectiveness of the state-of-theart NER model for Japanese, (ii) proposing a neural model for predicting a tag for each character using word and character information." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-10", "text": "Experimentally obtained results demonstrate that our model outperforms the state-of-the-art neural English NER model in Japanese." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-11", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-13", "text": "Named Entity Recognition (NER) is designed to extract entities such as location and product from texts." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-14", "text": "The results are used in sophisticated tasks including summarizations and recommendations." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-15", "text": "In the past several years, sequential neural models such as long-short term memory (LSTM) have been applied to NER." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-16", "text": "They have outperformed the conventional models (Huang et al., 2015) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-17", "text": "Recently, Convolutional Neural Network (CNN) was introduced into many models for extracting sub-word information from a word (Santos and Guimaraes, 2015; Ma and Hovy, 2016) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-18", "text": "The models achieved higher performance because CNN can capture capitalization, suffixes, and prefixes (Chiu and Nichols, 2015) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-19", "text": "These models predict a tag for each word assuming that words can be separated clearly by explicit word separators (e.g. blank spaces)." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-20", "text": "We refer to such model as a \"wordbased model\", even if inputs include characters." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-21", "text": "When Japanese NER employs a recent neural model, two obstacles arise." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-22", "text": "First, extracting sub-word information by CNN is unsuitable for Japanese language." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-23", "text": "The reasons are that Japanese words tend to be shorter than English and Japanese characters have no capitalization." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-24", "text": "Secondly, the word-based model cannot extract entities when a part of a word composes an entity." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-25", "text": "Japanese language has no explicit word separators." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-26", "text": "Word boundaries occasionally become ambiguous." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-27", "text": "Therefore, the possibility exists that entity boundary does not match word boundaries." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-28", "text": "We define such phenomena as \"boundary conflict\"." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-29", "text": "To avoid this obstacle, NER using finer-grained compose units than words are preferred in Japanese NERs (Asahara and Matsumoto, 2003; Sassano and Utsuro, 2000) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-30", "text": "We follow these approaches and expand the state-of-the-art neural NER model to predict a tag for each character: a \"characterbased model\"." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-31", "text": "The contributions of our study are: (i) application of a state-of-the-art NER model to Japanese NER and verification of its effectiveness, and (ii) proposition of a \"character-based\" neural model with concatenating words and characters." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-32", "text": "Experimental results show that our model outperforms the state-of-the-art neural NER model in Japanese." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-33", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-34", "text": "**RELATED WORK**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-35", "text": "Conventional Models: Conventional NER systems employ machine learning algorithms that use inputs which are hand-crafted features such as POS tags." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-36", "text": "Support Vector Machine (Isozaki and Kazawa, 2002) , maximum entropy models (Bender et al., 2003) , Hidden Markov Models (Zhou and Su, 2002) and CRF (Klinger, 2011; Chen et al., 2006; Marcinczuk, 2015) were applied." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-37", "text": "Word-based Neural Models: A neural model was applied to sequence labeling tasks also in NER (Collobert et al., 2011) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-38", "text": "Modified models using Bi-directional LSTM (BLSTM) or Stacked LSTM were proposed (Huang et al., 2015; Lample et al., 2016) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-39", "text": "Recently, new approaches introducing CNN or LSTM for extracting subword information from character inputs have been found to outperform other models (Lample et al., 2016) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-40", "text": "Rei et al. (2016) proposed the model using an attention mechanism whose inputs are words and characters." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-41", "text": "Above all, BLSTM-CNNs-CRF (Ma and Hovy, 2016 ) achieved state-of-theart performance on the standard English corpus: CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-42", "text": "Character-based Neural Models: Kuru et al. (2016) proposed a character-based neural model." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-43", "text": "This model, which inputs only characters, exhibits good performance on the condition that no external knowledge is used." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-44", "text": "This model predicts a tag for each character and forces that predicted tags in a word are the same." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-45", "text": "Therefore, it is unsuitable for languages in which boundary conflicts occur." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-46", "text": "Japanese NER: For Japanese NER, many models using conventional algorithms have been proposed (Iwakura, 2011; Sasano and Kurohashi, 2008) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-47", "text": "Most such models are character-based models to deal with boundary conflicts." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-48", "text": "Tomori et al. (2016) applied a neural model to Japanese NER." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-49", "text": "This study uses non-sequential neural networks with inputs that are hand-crafted features." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-50", "text": "This model uses no recent advanced approaches for NER, such as word embedding or CNN to extract sub-word information." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-51", "text": "Therefore, the effectiveness of recent neural models for Japanese NER has not been evaluated." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-52", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-53", "text": "**JAPANESE NER AND CHARACTERISTICS**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-54", "text": "One common definition of entity categories for Japanese NER is Sekine's extended named entity hierarchy (Sekine et al., 2002) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-55", "text": "This definition includes 30 entity categories." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-56", "text": "This study used the corpus annotated in Mainichi newspaper articles (Hashimoto et al., 2008 )." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-57", "text": "Japanese language is written without blank spaces." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-58", "text": "Therefore, word segmentations that are made using morphological analysis are needed to use word information." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-59", "text": "However, some word segmentations cause boundary conflicts." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-60", "text": "As an example, one can consider the extraction of the correct entity \"Tokyoto\" (Tokyo prefecture) from \"Tokyotonai\" (in Tokyo prefecture)." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-61", "text": "Tokyo/tonai (Tokyo / in pref.): word boundary Tokyoto/nai (Tokyo pref. / in): entity boundary These slashes show word and entity boundaries." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-62", "text": "The entity boundary does not match the word boundary." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-63", "text": "Therefore, the entity candidates by word-based models are \"Tokyo\" and \"Tokyotonai.\" It is impossible to extract the entity \"Tokyoto\"." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-64", "text": "Word lengths of Japanese language tend to be shorter than those of English." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-65", "text": "The average word length in entities in CoNLL 2003 (Reuters news service) is 6.43 characters." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-66", "text": "That in the Mainichi newspaper corpus is 1.95." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-67", "text": "Therefore, it is difficult to extract sub-word information in Japanese in a manner that is suitable for English." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-70", "text": "In this study, we specifically examine BLSTMCNNs-CRF (Ma and Hovy, 2016) because it achieves state-of-the-art performance in the CoNLL 2003 corpus." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-71", "text": "Figure 1 presents the architecture of this model." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-72", "text": "This word-based model combines CNN, BLSTM, and CRF layers." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-73", "text": "We describe each layer of this model as the following." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-74", "text": "CNN Layer: This layer is aimed at extracting subword information." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-75", "text": "The inputs are character embeddings of a word." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-76", "text": "This layer consists of convolution and pooling layers." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-77", "text": "The convolution layer produces a matrix for a word with consideration of the sub-word." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-78", "text": "The pooling layer compresses the matrix for each dimension of character embedding." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-79", "text": "BLSTM Layer: BLSTM (Graves and Schmidhuber, 2005 ) is an approach to treat sequential data." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-80", "text": "The output of CNN and word embedding are concatenated as an input of BLSTM." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-81", "text": "CRF Layer: This layer was designed to select the best tag sequence from all possible tag sequences with consideration of outputs from BLSTM and correlations between adjacent tags." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-82", "text": "This layer introduces a transition score for each transition pattern between adjacent tags." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-83", "text": "The objective function is calculated using the sum of the outputs from BLSTM and the transition scores for a sequence." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-84", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-85", "text": "**CHARACTER-BASED NEURAL MODEL**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-86", "text": "To resolve the obstacles when applying a recent neural model, we propose character-based BLSTM-CRF model (Char-BLSTM-CRF)." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-87", "text": "This model, which consists of BLSTM and CRF layers, predicts a tag for every character independently." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-88", "text": "Figure 2 presents the model structure." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-89", "text": "This model gives an input for each character to predict a tag for a character independently." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-90", "text": "Additionally, we introduce word information with character information as inputs of this model." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-91", "text": "Character information is a character embedding and word information is the embedding of the word containing the character." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-92", "text": "That is, the same word embedding will be used as inputs of characters constructing a word." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-93", "text": "This enables us to utilize pre-training of word embeddings with the effectiveness shown in English (Ma and Hovy, 2016) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-94", "text": "We assume that it is difficult for the CNN layer to extract the Japanese sub-word information." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-95", "text": "Moreover, we assume that sufficient information can be extracted from a simple character input." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-96", "text": "Consequently, the model uses no CNN layer." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-97", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-98", "text": "**EXPERIMENTS**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-99", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-100", "text": "**EXPERIMENT CONDITIONS**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-101", "text": "We evaluate our models using the Mainichi newspaper corpus." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-102", "text": "We specifically examine the four categories of the highest frequency: Product, Location, Organization, Time. statistics related to this corpus." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-103", "text": "We prepared pretrained word embeddings using skip-gram model (Mikolov et al., 2013) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-104", "text": "Seven years (1995-1996 and 1998-2002) of Mainichi newspaper articles which include almost 500 million words are used for pre-training." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-105", "text": "We conduct parameter tuning using the development dataset." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-106", "text": "We choose the unit number of LSTM as 300, the size of word embedding as 500, that of character embedding as 50, the maximum epoch as 20, and the batch size as 60." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-107", "text": "We use Adam (Kingma and Ba, 2014) , with the learning rate of 0.001 for optimization." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-108", "text": "We use MeCab (Kudo, 2005) for word segmentation." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-109", "text": "Other conditions are the same as those reported for an earlier study (Ma and Hovy, 2016) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-110", "text": "Table 2 : F1 score of each models." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-111", "text": "Average is a weighted average." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-112", "text": "\u2020 expresses the best result in the word-based models for each entity category." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-113", "text": "Bold means the best result in all models for each entity category." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-114", "text": "\"output\" means the unit of prediction; \"input\" shows information used as inputs." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-115", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-116", "text": "**RESULTS**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-118", "text": "Time Word Length in Entity 1.99 2.07 2.51" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-119", "text": "1.20 Table 3 : Averaged word length in entity." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-120", "text": "Word-based Neural Models: Among wordbased models, BLSTM-CNNs-CRF is the best model for Organization." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-121", "text": "Also, BLSTM-CRF is the best model for Product, Location, and Time." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-122", "text": "We confirm that cutting-edge neural models are suitable for Japanese language because each model outperforms CRF." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-123", "text": "When comparing BLSTM-CNNs and BLSTMCNNs-CRF, the CRF layer contributes to improvement by 2.39 pt." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-124", "text": "When comparing BLSTM-CRF and BLSTM-CNNs-CRF, the CNN layer is worse by 0.18 pt." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-125", "text": "Here, the CNN layer and the CRF layer improve by 2.36 pt and 1.75 pt in English (Ma and Hovy, 2016) ." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-126", "text": "Therefore, the CRF layer performs similarly but the CNN layer performs differently." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-127", "text": "The CNN layer enhances the model flexibility." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-128", "text": "Nevertheless, this layer can scarcely extract information from characters because Japanese words are shorter than English, according to Section 3." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-129", "text": "Table 3 shows the averaged word length after splitting an entity into words for each entity category." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-130", "text": "Words composing Time entities is the shortest." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-131", "text": "Therefore, information is scarce, especially in Time." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-132", "text": "In contrast, the words composing Organization is long." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-133", "text": "Therefore, CNN can extract information from characters in a word in Organization." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-134", "text": "This is the reason why BLSTM-CNNs-CRF performs better than BLSTM-CRF in Organization." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-135", "text": "Character-based Neural Models: The results of averaged F1 scores show that Char-BLSTM-CRF is more suitable for Japanese than word-based models." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-136", "text": "When comparing four configurations of Char-BLSTM-CRF, pre-training is critically important for performance." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-137", "text": "Character input also con- Table 4 : Number of entities with boundary conflicts and that of entities extracted by Char-BLSTM-CRF." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-138", "text": "tributes to the performance improvement in Char-BLSTM-CRF, although the input degrades the performance in a word-based model." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-139", "text": "Total Conflicts in the table 4 is the total number of entities with boundary conflicts in the test data." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-140", "text": "Extracted Entities in the table is the number of entities that Char-BLSTM-CRF extracts among the entities with boundary conflicts." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-141", "text": "Results show that the model extracts entities with boundary conflicts which cannot be extracted by word-based models." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-142", "text": "The number of entities with boundary conflicts extracted by Char-BLSTM-CRF is the largest in Location." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-143", "text": "When comparing the performance of Char-BLSTM-CRF and BLSTM-CRF for each entity category, the largest performance improvement of 2.90 pt is achieved in Location." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-144", "text": "By extracting 68 entities with boundary conflicts in Location, Char-BLSTM-CRF achieves about 2 pt improvement out of total 2.90 pt." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-145", "text": "It can be said that almost all improvements of Char-BLSTM-CRF are from extracting these entities." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-146", "text": "In contrast, Char-BLSTM-CRF is inappropriate for Organization." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-147", "text": "The averaged word length of entities that are not extracted accurately by BLSTMCNNs-CRF is 4.07; that by Char-BLSTM-CRF is 4.87." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-148", "text": "It can be said that Char-BLSTM-CRF is unsuitable for extracting long words." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-149", "text": "We infer that the inputs of LSTM become redundant and that LSTM does not work efficiently." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-150", "text": "Especially, the averaged word length of Organization is long according to Table 3." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-151", "text": "For that reason, the characterbased model is inappropriate in Organization." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-152", "text": "----------------------------------" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-153", "text": "**CONCLUSIONS**" }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-154", "text": "As described in this paper, we verified the effectiveness of the state-of-the-art neural NER model for Japanese." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-155", "text": "The experimentally obtained results show that the model outperforms conventional CRF in Japanese." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-156", "text": "Results show that the CNN layer works improperly for Japanese because words of the Japanese language are short." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-157", "text": "We proposed a character-based neural model incorporating words and characters: Char-BLSTM-CRF." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-158", "text": "This model outperforms a state-of-the-art neural NER model in Japanese, especially for the entity category consisting of short words." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-159", "text": "Our future work will examine reduction of redundancy in character-based model by preparing and combining different LSTMs for word and character inputs." }, { "sent_id": "5c63296c36cbd95e07f05f2563a2a1-C001-160", "text": "Also to examine the effects of pre-training of characters in Char-BLSTM-CRF is our future work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5c63296c36cbd95e07f05f2563a2a1-C001-17" ], [ "5c63296c36cbd95e07f05f2563a2a1-C001-41" ], [ "5c63296c36cbd95e07f05f2563a2a1-C001-70", "5c63296c36cbd95e07f05f2563a2a1-C001-71", "5c63296c36cbd95e07f05f2563a2a1-C001-72", "5c63296c36cbd95e07f05f2563a2a1-C001-73", "5c63296c36cbd95e07f05f2563a2a1-C001-74", "5c63296c36cbd95e07f05f2563a2a1-C001-75", "5c63296c36cbd95e07f05f2563a2a1-C001-76", "5c63296c36cbd95e07f05f2563a2a1-C001-77", "5c63296c36cbd95e07f05f2563a2a1-C001-78", "5c63296c36cbd95e07f05f2563a2a1-C001-79", "5c63296c36cbd95e07f05f2563a2a1-C001-80", "5c63296c36cbd95e07f05f2563a2a1-C001-81", "5c63296c36cbd95e07f05f2563a2a1-C001-82", "5c63296c36cbd95e07f05f2563a2a1-C001-83" ] ], "cite_sentences": [ "5c63296c36cbd95e07f05f2563a2a1-C001-17", "5c63296c36cbd95e07f05f2563a2a1-C001-41", "5c63296c36cbd95e07f05f2563a2a1-C001-70" ] }, "@MOT@": { "gold_contexts": [ [ "5c63296c36cbd95e07f05f2563a2a1-C001-70" ] ], "cite_sentences": [ "5c63296c36cbd95e07f05f2563a2a1-C001-70" ] }, "@UNSURE@": { "gold_contexts": [ [ "5c63296c36cbd95e07f05f2563a2a1-C001-90", "5c63296c36cbd95e07f05f2563a2a1-C001-93" ], [ "5c63296c36cbd95e07f05f2563a2a1-C001-123", "5c63296c36cbd95e07f05f2563a2a1-C001-124", "5c63296c36cbd95e07f05f2563a2a1-C001-125", "5c63296c36cbd95e07f05f2563a2a1-C001-126", "5c63296c36cbd95e07f05f2563a2a1-C001-127" ] ], "cite_sentences": [ "5c63296c36cbd95e07f05f2563a2a1-C001-93", "5c63296c36cbd95e07f05f2563a2a1-C001-125" ] }, "@USE@": { "gold_contexts": [ [ "5c63296c36cbd95e07f05f2563a2a1-C001-109" ] ], "cite_sentences": [ "5c63296c36cbd95e07f05f2563a2a1-C001-109" ] } } }, "ABC_0fed8b9e785426880fa8e5641116a4_33": { "x": [ { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-2", "text": "A recent paper described a new machine translation evaluation metric, AMBER." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-3", "text": "This paper describes two changes to AMBER." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-4", "text": "The first one is incorporation of a new ordering penalty; the second one is the use of the downhill simplex algorithm to tune the weights for the components of AMBER." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-5", "text": "We tested the impact of the two changes, using data from the WMT metrics task." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-6", "text": "Each of the changes by itself improved the performance of AMBER, and the two together yielded even greater improvement, which in some cases was more than additive." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-7", "text": "The new version of AMBER clearly outperforms BLEU in terms of correlation with human judgment." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-8", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-10", "text": "AMBER is a machine translation evaluation metric first described in (Chen and Kuhn, 2011) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-11", "text": "It is designed to have the advantages of BLEU (Papineni et al., 2002) , such as nearly complete language independence and rapid computability, while attaining even higher correlation with human judgment." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-12", "text": "According to the paper just cited: \"It can be thought of as a weighted combination of dozens of computationally cheap features based on word surface forms for evaluating MT quality\"." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-13", "text": "Many recently defined machine translation metrics seek to exploit deeper sources of knowledge than are available to BLEU, such as external lexical and syntactic resources." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-14", "text": "Unlike these and like BLEU, AMBER relies entirely on matching surface forms in tokens in the hypothesis and reference, thus sacrificing depth of knowledge for simplicity and speed." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-15", "text": "In this paper, we describe two improvements to AMBER." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-16", "text": "The first is a new ordering penalty called \"v\" developed in (Chen et al., 2012) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-17", "text": "The second remedies a weakness in the 2011 version of AMBER by carrying out automatic, rather than manual, tuning of this metric's free parameters; we now use the simplex algorithm to do the tuning." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-18", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-19", "text": "**AMBER**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-20", "text": "AMBER is the product of a score and a penalty, as in Equation (1); in this, it resembles BLEU." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-21", "text": "However, both the score part and the penalty part are more sophisticated than in BLEU." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-48", "text": "This punishes only once a sequence of words that moves a long distance with the internal word order conserved, rather than on every word." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-22", "text": "The score part (Equation 2) is enriched by incorporating the weighted average of n-gram precisions (AvgP), the F-measure derived from the arithmetic averages of precision and recall (Fmean), and the arithmetic average of F-measure of precision and recall for each n-gram (AvgF)." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-23", "text": "The penalty part is a weighted product of several different penalties (Equation 3)." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-24", "text": "Our original AMBER paper (Chen and Kuhn, 2011) describes the ten penalties used at that time; two of these penalties, the normalized Spearman's correlation penalty and the normalized Kendall's correlation penalty, model word reordering." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-25", "text": "In addition to the more complex score and penalty factors, AMBER differs from BLEU in two other ways:" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-26", "text": "\u2022 Not only fixed n-grams, but three different kinds of flexible n-grams, are used in computing scores and penalties." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-27", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-28", "text": "**\u2022**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-29", "text": "The AMBER score can be computed with different types of text preprocessing, i.e. different combinations of several text preprocessing techniques: lowercasing, tokenization, stemming, word splitting, etc." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-30", "text": "8 types were tried in (Chen and Kuhn, 2011) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-31", "text": "When using more than one type, the final score is computed as an average over runs, one run per type." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-32", "text": "In the experiments reported below, we averaged over two types of preprocessing." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-33", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-34", "text": "**IMPROVEMENTS TO AMBER**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-35", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-36", "text": "**ORDERING PENALTY V**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-37", "text": "We use a simple matching algorithm (Isozaki et al., 2010) to do 1-1 word alignment between the hypothesis and the reference." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-38", "text": "After word alignment, represent the reference by a list of normalized positions of those of its words that were aligned with words in the hypothesis, and represent the hypothesis by a list of positions for the corresponding words in the reference." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-39", "text": "For both lists, unaligned words are ignored." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-40", "text": "E.g., let P 1 = reference, P 2 = hypothesis: P 1 : Then we obtain P 1 : 1 2 3 4 5 6 (the 2 nd word \"the\", 4 th word \"of\" and 6 th word \",\" in the reference are not aligned to any word in the hypothesis." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-41", "text": "Thus, their positions are not in P 1 , so the positions of the matching words \"in winter 2010 I visited Paris\" are normalized to 1 2 3 4 5 6) P 2 : 4 5 6 1 3 2 (the word \"'s\" was unaligned)." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-42", "text": "The ordering metric v is computed from two distance measures." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-43", "text": "The first is absolute permutation distance:" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-44", "text": "v 1 ranges from 0 to 1; a larger value means more similarity between the two permutations." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-45", "text": "This metric is similar to Spearman's \u03c1 (Spearman, 1904) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-46", "text": "However, we have found that \u03c1 punishes long-distance reordering too heavily." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-47", "text": "For instance, Inspired by HMM word alignment (Vogel et al., 1996) , our second distance measure is based on jump width." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-49", "text": "In the following, only two groups of words have moved, so the jump width punishment is light:" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-50", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-51", "text": "**REF: IN THE WINTER OF 2010, I VISITED PARIS HYP: I VISITED PARIS IN THE WINTER OF 2010**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-52", "text": "The second distance measure is" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-53", "text": "where we set" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-54", "text": "As with v 1 , v 2 is also from 0 to 1, and larger values indicate more similar permutations." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-55", "text": "The ordering measure v s is the harmonic mean of v 1 and v 2 (Chen et al., 2012) :" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-56", "text": "(8) In (Chen et al., 2012) we found this to be slightly more effective than the geometric mean." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-57", "text": "v s in (8) (9) where l is the number of segments of the document, and len(R) is the length of the reference after text preprocessing." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-58", "text": "v s is the segment-level ordering penalty." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-59", "text": "Recall that the penalty part of AMBER is the weighted product of several component penalties." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-60", "text": "In the original version of AMBER, there were 10 component penalties." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-61", "text": "In the new version, v is incorporated as an additional, 11th weighted penalty in (3)." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-62", "text": "Thus, the new version of AMBER incorporates three reordering penalties: Spearman's correlation, Kendall's correlation, and v. Note that v is also incorporated in a tuning metric we recently devised (Chen et al., 2012) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-63", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-64", "text": "**AUTOMATIC TUNING**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-65", "text": "In (Chen and Kuhn, 2011) , we manually set the 17 free parameters of AMBER (see section 3.2 of that paper)." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-66", "text": "In the experiments reported below, we tuned the 18 free parameters -the original 17 plus the ordering metric v described in the previous section -automatically, using the downhill simplex method of (Nelder and Mead, 1965) as described in (Press et al., 2002) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-67", "text": "This is a multidimensional optimization technique inspired by geometrical considerations that has shown good performance in a variety of applications." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-68", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-69", "text": "**EXPERIMENTS**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-70", "text": "The experiments are carried out on WMT metric task data: specifically, the WMT 2008, WMT 2009, WMT 2010, WMT 2011 all-to-English, and English-to-all submissions." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-71", "text": "The languages \"all\" (\"xx\" in Table 1) We used 2008 and 2011 data as dev sets, 2009 and 2010 data as test sets." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-72", "text": "Spearman's rank correlation coefficient \u03c1 was employed to measure correlation of the metric with system-level human judgments of translation." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-73", "text": "The human judgment score was based on the \"Rank\" only, i.e., how often the translations of the system were rated as better than those from other systems (CallisonBurch et al., 2008) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-74", "text": "Thus, BLEU and the new version of AMBER were evaluated on how well their rankings correlated with the human ones." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-75", "text": "For the segment level, we followed (Callison-Burch et al., 2010) in using Kendall's rank correlation coefficient \u03c4." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-76", "text": "In what follows, \"AMBER1\" will denote a variant of AMBER as described in (Chen and Kuhn, 2011) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-77", "text": "Specifically, it is the variant AMBER(1,4) -that is, the variant in which results are averaged over two runs with the following preprocessing:" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-78", "text": "1. A run with tokenization and lower-casing 2." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-79", "text": "A run in which tokenization and lowercasing are followed by the word splitting." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-80", "text": "Each word with more than 4 letters is segmented into two sub-words, with one being the first 4 letters and the other the last 2 letters." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-81", "text": "If the word has 5 letters, the 4 th letter appears twice: e.g., \"gangs\" becomes \"gang\" + \"gs\"." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-82", "text": "If the word has more than 6 letters, the middle part is thrown away." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-83", "text": "The second run above requires some explanation." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-84", "text": "Recall that in AMBER, we wish to avoid use of external resources such as stemmers and morphological analyzers, and we aim at maximal language independence." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-85", "text": "Here, we are doing a kind of \"poor man's morphological analysis\"." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-86", "text": "The first four letters of a word are an approximation of its stem, and the last two letters typically carry at least some information about number, gender, case, etc." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-87", "text": "Some information is lost, but on the other hand, when we use the metric for a new language (or at least, a new Indo-European language) we know that it will extract at least some of the information hidden inside morphologically complex words." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-88", "text": "The results shown in Tables 2-4 compare the correlation of variants of AMBER with human judgment; Table 5 compares the best version of AMBER (AMBER2) with BLEU." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-89", "text": "For instance, to calculate segment-level correlations using" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-90", "text": "Kendall's \u03c4, we carried out 33,071 paired comparisons for out-of-English and 31,051 paired comparisons for into-English." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-91", "text": "The resulting \u03c4 was calculated per system, then averaged for each condition (out-of-English and into-English) to obtain one out-of-English value and one into-English value." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-92", "text": "First, we compared the performance of AMBER1 with a version of AMBER1 that includes the new reordering penalty v, at the system and segment levels." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-93", "text": "The results are shown in Table 2 ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-94", "text": "The greatest impact of v is on \"out of English\" at the segment level, but none of the results are particularly impressive." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-95", "text": "Second, we compared the performance of manually tuned AMBER1 with AMBER1 whose parameters were tuned by the simplex method." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-96", "text": "The tuning was run four times on the dev set, once for each possible combination of into/out-of English and system/segment level." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-97", "text": "Table 3 shows the results on the test set." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-98", "text": "This change had a greater impact, especially on the segment level." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-99", "text": "Then, we compared the performance of AMBER1 with AMBER1 that contains v and that has been tuned by the simplex method." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-100", "text": "We will denote the new version of AMBER containing both changes \"AMBER2\"." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-101", "text": "It will be seen from Table 4 that AMBER2 is a major improvement over AMBER1 at the segment level." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-102", "text": "In the case of \"into English\" at the segment level, the impact of the two changes seems to have been synergistic: adding together the percentage improvements due to v and simplex from Tables 2 and 3, one would have expected an improvement of 4.5% for both changes together, but the actual improvement was 6.2%." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-103", "text": "Furthermore, there was no improvement at the system level for \"out of English\" when either change was tried separately, but there was a small improvement when the two changes were combined." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-104", "text": "Of course, the most important question is: does the new version of AMBER (AMBER2) perform better than BLEU?" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-105", "text": "Table 5 answers this question (the version of BLEU used here was smoothed BLEU (mteval-v13a))." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-106", "text": "There is a clear advantage for AMBER2 over BLEU at both the system and segment levels, for both \"into English\" and \"out of English\"." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-107", "text": "----------------------------------" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-109", "text": "We have made two changes to AMBER, a metric described in (Chen and Kuhn, 2011) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-110", "text": "In our experiments, the new version of AMBER was shown to be an improvement on the original version in terms of correlation with human judgment." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-111", "text": "Furthermore, it outperformed BLEU by about 12% at the system level and about 23% at the segment level." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-112", "text": "A good evaluation metric is not necessarily a good tuning metric, and vice versa." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-113", "text": "In parallel with our work on AMBER for evaluation, we have also been exploring a machine translation tuning metric called PORT (Chen et al., 2012) ." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-114", "text": "AMBER and PORT differ in many details, but they share the same underlying philosophy: to exploit surface similarities between hypothesis and references even more thoroughly than BLEU does, rather than to invoke external resources with richer linguistic knowledge." }, { "sent_id": "0fed8b9e785426880fa8e5641116a4-C001-115", "text": "So far, the results for PORT have been just as encouraging as the ones for AMBER reported here." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0fed8b9e785426880fa8e5641116a4-C001-10" ], [ "0fed8b9e785426880fa8e5641116a4-C001-24" ], [ "0fed8b9e785426880fa8e5641116a4-C001-29", "0fed8b9e785426880fa8e5641116a4-C001-30" ] ], "cite_sentences": [ "0fed8b9e785426880fa8e5641116a4-C001-10", "0fed8b9e785426880fa8e5641116a4-C001-24", "0fed8b9e785426880fa8e5641116a4-C001-30" ] }, "@DIF@": { "gold_contexts": [ [ "0fed8b9e785426880fa8e5641116a4-C001-65", "0fed8b9e785426880fa8e5641116a4-C001-66" ] ], "cite_sentences": [ "0fed8b9e785426880fa8e5641116a4-C001-65" ] }, "@UNSURE@": { "gold_contexts": [ [ "0fed8b9e785426880fa8e5641116a4-C001-76" ] ], "cite_sentences": [ "0fed8b9e785426880fa8e5641116a4-C001-76" ] }, "@EXT@": { "gold_contexts": [ [ "0fed8b9e785426880fa8e5641116a4-C001-109" ] ], "cite_sentences": [ "0fed8b9e785426880fa8e5641116a4-C001-109" ] } } }, "ABC_c54a1aba5845a52f468cde916c970b_33": { "x": [ { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-2", "text": "Task-oriented dialogue focuses on conversational agents that participate in dialogues with user goals on domain-specific topics." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-3", "text": "In contrast to chatbots, which simply seek to sustain open-ended meaningful discourse, existing task-oriented agents usually explicitly model user intent and belief states." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-4", "text": "This paper examines bypassing such an explicit representation by depending on a latent neural embedding of state and learning selective attention to dialogue history together with copying to incorporate relevant prior context." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-5", "text": "We complement recent work by showing the effectiveness of simple sequence-to-sequence neural architectures with a copy mechanism." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-6", "text": "Our model outperforms more complex memory-augmented models by 7% in per-response generation and is on par with the current state-of-the-art on DSTC2, a real-world task-oriented dialogue dataset." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-7", "text": "----------------------------------" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-9", "text": "Effective task-oriented dialogue systems are becoming important as society progresses toward using voice for interacting with devices and performing everyday tasks such as scheduling." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-10", "text": "To that end, research efforts have focused on using machine learning methods to train agents using dialogue corpora." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-11", "text": "One line of work has tackled the problem using partially observable Markov decision processes and reinforcement learning with carefully designed action spaces (Young et al., 2013) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-12", "text": "However, the large, hand-designed action and state spaces make this class of models brittle and unscalable, and in practice most deployed dialogue systems remain hand-written, rule-based systems." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-13", "text": "Recently, neural network models have achieved success on a variety of natural language processing tasks (Bahdanau et al., 2015; Sutskever et al., 2014; Vinyals et al., 2015) , due to their ability to implicitly learn powerful distributed representations from data in an end-to-end trainable fashion." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-14", "text": "This paper extends recent work examining the utility of distributed state representations for taskoriented dialogue agents, without providing rules or manually tuning features." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-15", "text": "One prominent line of recent neural dialogue work has continued to build systems with modularly-connected representation, belief state, and generation components (Wen et al., 2016b) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-16", "text": "These models must learn to explicitly represent user intent through intermediate supervision, and hence suffer from not being truly end-to-end trainable." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-17", "text": "Other work stores dialogue context in a memory module and repeatedly queries and reasons about this context to select an adequate system response (Bordes and Weston, 2016) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-18", "text": "While reasoning over memory is appealing, these models simply choose among a set of utterances rather than generating text and also must have temporal dialogue features explicitly encoded." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-19", "text": "However, the present literature lacks results for now standard sequence-to-sequence architectures, and we aim to fill this gap by building increasingly complex models of text generation, starting with a vanilla sequence-to-sequence recurrent architecture." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-20", "text": "The result is a simple, intuitive, and highly competitive model, which outperforms the more complex model of Bordes and Weston (2016) by 6.9%." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-21", "text": "Our contributions are as follows: 1) We perform a systematic, empirical analysis of increasingly complex sequence-to-sequence models for task-oriented dialogue, and 2) we develop a recurrent neural dialogue architecture augmented with an attention-based copy mechanism that is able to significantly outperform more complex models on a variety of metrics on realistic data." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-22", "text": "We use neural encoder-decoder architectures to frame dialogue as a sequence-to-sequence learning problem." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-23", "text": "Given a dialogue between a user (u) and a system (s), we represent the dialogue utterances as {(u 1 , s 1 ), (u 2 , s 2 ), . . . , (u k , s k )} where k denotes the number of turns in the dialogue." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-24", "text": "At the i th turn of the dialogue, we encode the aggregated dialogue context composed of the tokens of (u 1 , s 1 , . . . , s i\u22121 , u i )." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-25", "text": "Letting x 1 , . . . , x m denote these tokens, we first embed these tokens using a trained embedding function \u03c6 emb that maps each token to a fixed-dimensional vector." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-26", "text": "These mappings are fed into the encoder to produce contextsensitive hidden representations h 1 , . . . , h m ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-27", "text": "The vanilla Seq2Seq decoder predicts the tokens of the i th system response s i by first computing decoder hidden states via the recurrent unit." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-28", "text": "We denoteh 1 , . . . ,h n as the hidden states of the decoder and y 1 , . . . , y n as the output tokens." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-29", "text": "We extend this decoder with an attention-based model (Bahdanau et al., 2015; Luong et al., 2015a) , where, at every time step t of the decoding, an attention score a t i is computed for each hidden state h i of the encoder, using the attention mechanism of (Vinyals et al., 2015) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-30", "text": "Formally this attention can be described by the following equations:" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-31", "text": "where W 1 , W 2 , U , and v are trainable parameters of the model and o t represents the logits over the tokens of the output vocabulary V ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-32", "text": "During training, the next token y t is predicted so as to maximize the log-likelihood of the correct output sequence given the input sequence." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-33", "text": "An effective task-oriented dialogue system must have powerful language modelling capabilities and be able to pick up on relevant entities of an underlying knowledge base." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-34", "text": "One source of relevant entities is that they will commonly have been mentioned in the prior discourse context." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-35", "text": "Recent literature has shown that incorporating a copying mechanism into neural architectures improves performance on various sequence-to-sequence tasks including code generation, machine translation, and text summarization (Gu et al., 2016; Ling et al., 2016; Gulcehre et al., 2016) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-36", "text": "We therefore augment the attention encoder-decoder model with an attention-based copy mechanism in the style of (Jia and Liang, 2016) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-37", "text": "In this scheme, during decoding we compute our new logits vector" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-38", "text": "] where a t [1:m] is the concatenated attention scores of the encoder hidden states, and we are now predicting over a vocabulary of size |V | + m. The model, thus, either predicts a token y t from V or copies a token x i from the encoder input context, via the attention score a t i ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-39", "text": "Rather than copy over any token mentioned in the encoder dialogue context, our model is trained to only copy over entities of the knowledge base mentioned in the dialogue context, as this provides a conceptually intuitive goal for the model's predictive learning: as training progresses it will learn to either predict a token from the standard vocabulary of the language model thereby ensuring well-formed natural language utterances, or to copy over the relevant entities from the input context, thereby learning to extract important dialogue context." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-40", "text": "In our best performing model, we augment the inputs to the encoder by adding entity type features." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-41", "text": "Classes present in the knowledge base of the dataset, namely the 8 distinct entity types referred to in Table 1 , are encoded as one-hot vectors." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-42", "text": "Whenever a token of a certain entity type is seen during encoding, we append the appropriate one-hot vector to the token's word embedding before it is fed into the recurrent cell." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-43", "text": "These type features improve generalization to novel entities by allowing the model to hone in on positions with particularly relevant bits of dialogue context during its soft attention and copying." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-44", "text": "Other cited work using the DSTC2 dataset (Bordes and Weston, 2016; Liu and Perez, 2016; Seo et al., 2016) implement similar mechanisms whereby they expand the feature representations of candidate system responses based on whether there is lexical entity class matching with provided dialogue context." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-45", "text": "In these works, such features are referred to as match features." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-46", "text": "All of our architectures use an LSTM cell as the recurrent unit (Hochreiter and Schmidhuber, 1997) with a bias of 1 added to the forget gate in the style of (Pham et al., 2014) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-47", "text": "----------------------------------" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-48", "text": "**EXPERIMENTS**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-49", "text": "----------------------------------" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-50", "text": "**DATA**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-51", "text": "For our experiments, we used dialogues extracted from the Dialogue State Tracking Challenge 2 (DSTC2) (Henderson et al., 2014) , a restaurant reservation system dataset." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-52", "text": "While the goal of the original challenge was building a system for inferring dialogue state, for our study, we use the version of the data from Bordes and Weston (2016) , which ignores the dialogue state annotations, using only the raw text of the dialogues." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-53", "text": "The raw text includes user and system utterances as well as the API calls the system would make to the underlying KB in response to the user's queries." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-54", "text": "Our model then aims to predict both these system utterances and API calls, each of which is regarded as a turn of the dialogue." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-55", "text": "We use the train/validation/test splits from this modified version of the dataset." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-56", "text": "The dataset is appealing for a number of reasons: 1) It is derived from a real-world system so it presents the kind of linguistic diversity and conversational abilities we would hope for in an effective dialogue agent." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-57", "text": "2) It is grounded via an underlying knowledge base of restaurant entities and their attributes." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-58", "text": "3) Previous results have been reported on it so we can directly compare our model performance." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-59", "text": "We include statistics of the dataset in Table 1 ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-60", "text": "----------------------------------" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-61", "text": "**TRAINING**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-62", "text": "We trained using a cross-entropy loss and the Adam optimizer (Kingma and Ba, 2015) , applying dropout (Hinton et al., 2012) as a regularizer to the input and output of the LSTM." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-63", "text": "We identified hyperparameters by random search, evaluating on a held-out validation subset of the data." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-64", "text": "Dropout keep rates ranged from 0.75 to 0.95." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-65", "text": "We used word embeddings with size 300, and hidden layer and cell sizes were set to 353, identified through our search." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-66", "text": "We applied gradient clipping with a clipvalue of 10 to avoid gradient explosions during training." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-67", "text": "The attention, output parameters, word embeddings, and LSTM weights were randomly initialized from a uniform unit-scaled distribution in the style of (Sussillo and Abbott, 2015) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-68", "text": "----------------------------------" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-69", "text": "**METRICS**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-70", "text": "Evaluation of dialogue systems is known to be difficult ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-71", "text": "We employ several metrics for assessing specific aspects of our model, drawn from previous work: \u2022 Per-Response Accuracy: Bordes and Weston (2016) report a per-turn response accuracy, which tests their model's ability to select the system response at a certain timestep." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-72", "text": "Their system does a multiclass classification over a predefined candidate set of responses, which was created by aggregating all system responses seen in the training, validation, and test sets." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-73", "text": "Our model actually generates each individual token of the response, and we consider a prediction to be correct only if every token of the model output matches the corresponding token in the gold response." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-74", "text": "Evaluating using this metric on our model is therefore significantly more stringent a test than for the model of Bordes and Weston (2016) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-75", "text": "\u2022 Per-Dialogue Accuracy: Bordes and Weston (2016) also report a per-dialogue accuracy, which assesses their model's ability to produce every system response of the dialogue correctly." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-76", "text": "We calculate a similar value of dialogue accuracy, though again our model generates every token of every response." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-77", "text": "\u2022 BLEU: We use the BLEU metric, commonly employed in evaluating machine translation systems (Papineni et al., 2002) , which has also been used in past literature for evaluating dialogue systems (Ritter et al., 2011; ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-78", "text": "We calculate average BLEU score over all responses generated by the system, and primarily report these scores to gauge our model's ability to accurately generate the language patterns seen in DSTC2." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-79", "text": "\u2022 Entity F 1 : Each system response in the test data defines a gold set of entities." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-80", "text": "To compute an entity F 1 , we micro-average over the entire set of system dialogue responses." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-81", "text": "This metric evaluates the model's ability to generate relevant entities from the underlying knowledge base and to capture the semantics of the user-initiated dialogue flow." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-82", "text": "Our experiments show that sometimes our model generates a response to a given input that is perfectly reasonable, but is penalized because our evaluation metrics involve direct comparison to the gold system output." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-83", "text": "For example, given a user request for an australian restaurant, the gold system output is you are looking for an australian restaurant right? whereas our system outputs what part of town do you have in mind?, which is a more directed follow-up intended to narrow down the search space of candidate restaurants the system should propose." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-84", "text": "This issue, which recurs with evaluation of dialogue or other generative systems, could be alleviated through more forgiving evaluation procedures based on beam search decoding." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-85", "text": "----------------------------------" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-86", "text": "**RESULTS**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-87", "text": "In Table 2 , we present the results of our models compared to the reported performance of the best performing model of (Bordes and Weston, 2016) , which is a variant of an end-to-end memory network (Sukhbaatar et al., 2015) ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-88", "text": "Their model is referred to as MemNN." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-89", "text": "We also include the model of (Liu and Perez, 2016) , referred to as GMemNN, and the model of (Seo et al., 2016) , referred to as QRN, which currently is the stateof-the-art." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-90", "text": "In the table, Seq2Seq refers to our vanilla encoder-decoder architecture with (1), (2), and (3) LSTM layers respectively." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-91", "text": "+Attn refers to a 1-layer Seq2Seq with attention-based decoding." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-92", "text": "+Copy refers to +Attn with our copy-mechanism added." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-93", "text": "+EntType refers to +Copy with entity class features added to encoder inputs." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-94", "text": "We see that a 1-layer vanilla encoder-decoder is already able to significantly outperform MemNN in both per-response and per-dialogue accuracies, despite our more stringent setting." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-95", "text": "Adding layers to Seq2Seq leads to a drop in performance, suggesting an overly powerful model for the small dataset size." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-96", "text": "Adding an attention-based decoding to the vanilla model increases BLEU although per-response and per-dialogue accuracies suffer a bit." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-97", "text": "Adding our attention-based entity copy mechanism achieves substantial increases in perresponse accuracies and entity F 1 ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-98", "text": "Adding entity class features to +Copy achieves our bestperforming model, in terms of per-response accuracy and entity F 1 ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-99", "text": "This model achieves a 6.9% increase in per-response accuracy on DSTC2 over MemNN, including +1.5% per-dialogue accuracy, and is on par with the performance of GMemNN, including beating its per-dialogue accuracy." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-100", "text": "It also achieves the highest entity F 1 ." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-101", "text": "----------------------------------" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-102", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-103", "text": "We have iteratively built out a class of neural models for task-oriented dialogue that is able to outperform other more intricately designed neural architectures on a number of metrics." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-104", "text": "The model incorporates in a simple way abilities that we believe are essential to building good task-oriented dialogue agents, namely maintaining dialogue state and being able to extract and use relevant entities in its responses, without requiring intermediate supervision of dialogue state or belief tracker modules." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-105", "text": "Other dialogue models tested on DSTC2 that are more performant in per-response accuracy are equipped with sufficiently more complex mechanisms than our model." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-106", "text": "Taking inspiration from (Sukhbaatar et al., 2015) and (Srivastava et al., 2015) , GMemNN uses an explicit memory module as well as an adaptive gating mechanism to learn to attend to relevant memories." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-107", "text": "The QRN model employs a variant of a recurrent unit that is intended to handle local and global interactions in sequential data." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-108", "text": "We contrast with these works by bootstrapping off of more empirically accepted Seq2Seq architectures through intuitive extensions, while still producing highly competitive models." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-109", "text": "We attribute the large gains in per-response accuracy and entity F 1 demonstrated by our +Ent-Type to its ability to pick out the relevant KB entities from the dialogue context fed into the encoder." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-110", "text": "In Figure 1 , we see the attention-based copy cheap restaurant in east part of town api call r cuisine east cheap the missing sock is a nice place in the east of town and the prices are cheap address sure, the missing sock is on the missing sock address phone number the phone number of the missing sock is the missing sock phone thank you good bye you are welcome Table 3 : Sample dialogue generated." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-111", "text": "System responses are in italics." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-112", "text": "The dataset uses fake addresses and phone numbers." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-113", "text": "Figure 1: Attention-copy weights for a generated natural language response (top) and API call (bottom)." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-114", "text": "The decoder output is displayed vertically and the encoder input is abbreviated for display." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-115", "text": "weights of the model, indicating that the model is able to learn the relevant entities it should focus on in the input context." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-116", "text": "The powerful language modelling abilities of the Seq2Seq backbone allow smooth integration of these extracted entities into both system-generated API calls and natural language responses as shown in the figure." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-117", "text": "The appeal of our model comes from the simplicity and effectiveness of framing system response generation as a sequence-to-sequence mapping with a soft copy mechanism over relevant context." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-118", "text": "Unlike the task-oriented dialogue agents of Wen et." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-119", "text": "al (2016b) , our architecture does not explicitly model belief states or KB slot-value trackers, and we preserve full end-to-end-trainability." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-120", "text": "Further, in contrast to other referenced work on DSTC2, our model offers more linguistic versatility due to its generative nature while still remaining highly competitive and outperforming other models." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-121", "text": "Of course, this is not to deny the importance of dialogue agents which can more effectively use a knowledge base to answer user requests, and this remains a good avenue for further work." }, { "sent_id": "c54a1aba5845a52f468cde916c970b-C001-122", "text": "Nevertheless, we hope this simple and effective architecture can be a strong baseline for future research efforts on task-oriented dialogue." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c54a1aba5845a52f468cde916c970b-C001-17" ], [ "c54a1aba5845a52f468cde916c970b-C001-44" ] ], "cite_sentences": [ "c54a1aba5845a52f468cde916c970b-C001-17", "c54a1aba5845a52f468cde916c970b-C001-44" ] }, "@DIF@": { "gold_contexts": [ [ "c54a1aba5845a52f468cde916c970b-C001-19", "c54a1aba5845a52f468cde916c970b-C001-20" ], [ "c54a1aba5845a52f468cde916c970b-C001-74" ], [ "c54a1aba5845a52f468cde916c970b-C001-75", "c54a1aba5845a52f468cde916c970b-C001-76" ] ], "cite_sentences": [ "c54a1aba5845a52f468cde916c970b-C001-20", "c54a1aba5845a52f468cde916c970b-C001-74", "c54a1aba5845a52f468cde916c970b-C001-75" ] }, "@USE@": { "gold_contexts": [ [ "c54a1aba5845a52f468cde916c970b-C001-52" ], [ "c54a1aba5845a52f468cde916c970b-C001-71" ] ], "cite_sentences": [ "c54a1aba5845a52f468cde916c970b-C001-52", "c54a1aba5845a52f468cde916c970b-C001-71" ] }, "@SIM@": { "gold_contexts": [ [ "c54a1aba5845a52f468cde916c970b-C001-75", "c54a1aba5845a52f468cde916c970b-C001-76" ] ], "cite_sentences": [ "c54a1aba5845a52f468cde916c970b-C001-75" ] }, "@MOT@": { "gold_contexts": [ [ "c54a1aba5845a52f468cde916c970b-C001-87" ] ], "cite_sentences": [ "c54a1aba5845a52f468cde916c970b-C001-87" ] }, "@UNSURE@": { "gold_contexts": [ [ "c54a1aba5845a52f468cde916c970b-C001-87" ] ], "cite_sentences": [ "c54a1aba5845a52f468cde916c970b-C001-87" ] } } }, "ABC_45c4e9bc90d28cd1a5a61393625062_33": { "x": [ { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-23", "text": "**THE COMPARABILITY MEASURE**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-66", "text": "The bilingual dictionary used in the experiments is constructed from an online dictionary." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-2", "text": "We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-3", "text": "We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-4", "text": "Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-5", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-7", "text": "Bilingual lexicons are an important resource in multilingual natural language processing tasks such as statistical machine translation (Och and Ney, 2003) and cross-language information retrieval (Ballesteros and Croft, 1997) ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-8", "text": "Because it is expensive to manually build bilingual lexicons adapted to different domains, researchers have tried to automatically extract bilingual lexicons from various corpora." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-9", "text": "Compared with parallel corpora, it is much easier to build high-volume comparable corpora, i.e. corpora consisting of documents in different languages covering overlapping information." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-10", "text": "Several studies have focused on the extraction of bilingual lexicons from comparable corpora (Fung and McKeown, 1997; Fung and Yee, 1998; Rapp, 1999; D\u00e9jean et al., 2002; Gaussier et al., 2004; Robitaille et al., 2006; Morin et al., 2007; Garera et al., 2009; Yu and Tsujii, 2009; Shezaf and Rappoport, 2010) ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-11", "text": "The basic assumption behind most studies on lexicon extraction from comparable corpora is a distributional hypothesis, stating that words which are translation of each other are likely to appear in similar context across languages." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-12", "text": "On top of this hypothesis, researchers have investigated the use of better representations for word contexts, as well as the use of different methods for matching words across languages." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-13", "text": "These approaches seem to have reached a plateau in terms of performance." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-14", "text": "More recently, and departing from such traditional approaches, we have proposed in (Li and Gaussier, 2010) an approach based on improving the comparability of the corpus under consideration, prior to extracting bilingual lexicons." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-15", "text": "This approach is interesting since there is no point in trying to extract lexicons from a corpus with a low degree of comparability, as the probability of finding translations of any given word is low in such cases." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-16", "text": "We follow here the same general idea and aim, in a first step, at improving the comparability of a given corpus while preserving most of its vocabulary." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-17", "text": "However, unlike the previous work, we show here that it is possible to guarantee a certain degree of homogeneity for the improved corpus, and that this homogeneity translates into a significant improvement of both the quality of the resulting corpora and the bilingual lexicons extracted." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-18", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-19", "text": "**ENHANCING COMPARABLE CORPORA: A CLUSTERING APPROACH**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-20", "text": "We first introduce in this section the comparability measure proposed in former work, prior to describing the clustering-based algorithm to improve the quality of a given comparable corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-21", "text": "For convenience, the following discussion will be made in the context of the English-French comparable corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-22", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-24", "text": "In order to measure the degree of comparability of bilingual corpora, we make use of the measure M developed in (Li and Gaussier, 2010) : Given a comparable corpus P consisting of an English part P e and a French part P f , the degree of comparability of P is defined as the expectation of finding the translation of any given source/target word in the target/source corpus vocabulary." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-25", "text": "Let \u03c3 be a function indicating whether a translation from the translation set T w of the word w is found in the vocabulary P v of a corpus P, i.e.:" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-26", "text": "and let D be a bilingual dictionary with D v e denoting its English vocabulary and D v f its French vocabulary." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-27", "text": "The comparability measure M can be written as:" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-28", "text": "where # w (P) denotes the number of different words present in P. One can find from equation 1 that M directly measures the proportion of source/target words translated in the target/source vocabulary of P." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-29", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-30", "text": "**CLUSTERING DOCUMENTS FOR HIGH QUALITY COMPARABLE CORPORA**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-31", "text": "If a corpus covers a limited set of topics, it is more likely to contain consistent information on the words used (Morin et al., 2007) , leading to improved bilingual lexicons extracted with existing algorithms relying on the distributional hypothesis." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-32", "text": "The term homogeneity directly refers to this fact, and we will say, in an informal manner, that a corpus is homogeneous if it covers a limited set of topics." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-33", "text": "The rationale for the algorithm we introduce here to enhance corpus comparability is precisely based on the concept of homogeneity." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-34", "text": "In order to find document sets which are similar with each other (i.e. homogeneous), it is natural to resort to clustering techniques." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-35", "text": "Furthermore, since we need homogeneous corpora for bilingual lexicon extraction, it will be convenient to rely on techniques which allows one to easily prune less relevant clusters." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-36", "text": "To perform all this, we use in this work a standard hierarchical agglomerative clustering method." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-37", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-38", "text": "**BILINGUAL CLUSTERING ALGORITHM**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-39", "text": "The overall process retained to build high quality, homogeneous comparable corpora relies on the following steps:" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-40", "text": "1. Using the bilingual similarity measure defined in Section 2.2.2, cluster English and French documents so as to get bilingual dendrograms from the original corpus P by grouping documents with related content;" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-41", "text": "2. Pick high quality sub-clusters by thresholding the obtained dendrograms according to the node depth, which retains nodes far from the roots of the clustering trees;" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-42", "text": "3. Combine all these sub-clusters to form a new comparable corpus P H , which thus contains homogeneous, high-quality subparts;" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-43", "text": "4. Use again steps (1), (2) and (3) to enrich the remaining subpart of P (denoted as P L , P L = P \\ P H ) with external resources." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-44", "text": "The first three steps aim at extracting the most comparable and homogeneous subpart of P. Once this has been done, one needs to resort to new corpora if one wants to build an homogeneous corpus with a high degree of comparability from P L ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-67", "text": "It consists of 33k distinct English words and 28k distinct French words, constituting 76k translation pairs." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-45", "text": "To do so, we simply perform, in step (4), the clustering and thresholding process defined in (1), (2) and (3) on two comparable corpora: The first one consists of the English part of P L and the French part of an external corpus P T ; The second one consists of the French part of P L and the English part of P T ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-46", "text": "The two high quality subparts obtained from these two new comparable corpora in step (4) are then combined with P H to constitute the final comparable corpus of higher quality." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-47", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-48", "text": "**SIMILARITY MEASURE**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-49", "text": "Let us assume that we have two document sets (i.e. clusters) C 1 and C 2 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-50", "text": "In the task of bilingual lexicon extraction, two document sets are similar to each other and should be clustered if the combination of the two can complement the content of each single set, which relates to the notion of homogeneity." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-51", "text": "In other words, both the English part C e 1 of C 1 and the French part C f 1 of C 1 should be comparable to their counterparts (respectively the same for the French part C f 2 of C 2 and the English part C e 2 of C 2 )." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-52", "text": "This leads to the following similarity measure for C 1 and C 2 :" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-53", "text": "where \u03b2 (0 \u2264 \u03b2 \u2264 1) is a weight controlling the importance of the two subparts (C e 1 , C f 2 ) and (C e 2 , C f 1 )." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-54", "text": "Intuitively, the larger one, containing more information, of the two comparable corpora (C e 1 , C f 2 ) and (C e 2 , C f 1 ) should dominate the overall similarity sim(C 1 , C 2 )." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-55", "text": "Since the content relatedness in the comparable corpus is basically reflected by the relations between all the possible bilingual document pairs, we use here the number of document pairs to represent the scale of the comparable corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-56", "text": "The weight \u03b2 can thus be defined as the proportion of possible document pairs in the current comparable corpus (C e 1 , C f 2 ) to all the possible document pairs, which is:" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-57", "text": "where # d (C) stands for the number of documents in C. However, this measure does not integrate the relative length of the French and English parts, which actually impacts the performance of bilingual lexicon extraction." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-58", "text": "If a 1-to-1 constraint is too strong (i.e. assuming that all clusters should contain the same number of English and French documents), having completely unbalanced corpora is also not desirable." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-59", "text": "We thus introduce a penalty function \u03c6 aiming at penalizing unbalanced corpora:" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-60", "text": "The above penalty function leads us to a new similarity measure sim l which is the one finally used in the above algorithm:" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-61", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-62", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-63", "text": "The experiments we have designed in this paper aim at assessing (a) whether the clustering-based algorithm we have introduced yields corpora of higher quality in terms of comparability scores, and (b) whether the bilingual lexicons extracted from such corpora are of higher quality." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-64", "text": "Several corpora were used in our experiments: the TREC 1 Associated Press corpus (AP, English) and the corpora used in the CLEF 2 campaign including the Los Angeles Times (LAT94, English), the Glasgow Herald (GH95, English), Le Monde (MON94, French), SDA French 94 (SDA94, French) and SDA French 95 (SDA95, French)." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-65", "text": "In addition, two monolingual corpora Wiki-En and Wiki-Fr were built by respectively retrieving all the articles below the category Society and Soci\u00e9t\u00e9 from the Wikipedia dump files 3 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-68", "text": "In our experiments, we use the method described in this paper, as well as the one in (Li and Gaussier, 2010) which is the only alternative method to enhance corpus comparability." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-69", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-70", "text": "**IMPROVING CORPUS QUALITY**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-71", "text": "In this subsection, the clustering algorithm described in Section 2.2.1 is employed to improve the quality of the comparable corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-72", "text": "The corpora GH95 and SDA95 are used as the original corpus P 0 (56k English documents and 42k French documents)." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-73", "text": "We consider two external corpora: P 1 T (109k English documents and 87k French documents) consisting of the corpora LAT94, MON94 and SDA94; P 2 T (368k English documents and 378k French documents) consisting of Wiki-En and Wiki-Fr." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-74", "text": "After the clustering process, we obtain the resulting corpora P 1 (with the external corpus P 1 T ) and P 2 (with P 2 T )." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-75", "text": "As mentioned before, we also used the method described in (Li and Gaussier, 2010) on the same data, producing resulting corpora P 1 (with P 1 T ) and P 2 (with P 2 T ) from P 0 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-76", "text": "In terms of lexical coverage, P 1 (resp. P 2 ) covers 97.9% (resp." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-77", "text": "99.0%) of the vocabulary of P 0 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-78", "text": "Hence, most of the vocabulary of the original corpus has been preserved." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-79", "text": "The comparability score of P 1 reaches 0.924 and that of P 2 is 0.939." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-80", "text": "Both corpora are more comparable than P 0 of which the comparability is 0.881." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-81", "text": "Furthermore, both P 1 and P 2 are more comparable than P 1 (comparability 0.912) and P 2 (comparability 0.915), which shows homogeneity is crucial for comparability." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-82", "text": "The intrinsic evaluation shows the efficiency of our approach which can improve the quality of the given corpus while preserving most of its vocabulary." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-83", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-84", "text": "**BILINGUAL LEXICON EXTRACTION EXPERIMENTS**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-85", "text": "To extract bilingual lexicons from comparable corpora, we directly use here the method proposed by Fung and Yee (1998) which has been referred to as the standard approach in more recent studies (D\u00e9jean et al., 2002; Gaussier et al., 2004; Yu and Tsujii, 2009) ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-86", "text": "In this approach, each word w is represented as a context vector consisting of the words co-occurring with w in a certain window in the corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-87", "text": "The context vectors in different languages are then bridged with an existing bilingual dictionary." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-88", "text": "Finally, a similarity score is given to any word pair based on the cosine of their respective context vectors." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-89", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-90", "text": "**EXPERIMENT SETTINGS**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-91", "text": "In order to measure the performance of the lexicons extracted, we follow the common practice by dividing the bilingual dictionary into 2 parts: 10% of the English words (3,338 words) together with their translations are randomly chosen and used as the evaluation set, the remaining words being used to compute the similarity of context vectors." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-92", "text": "English words not present in P e or with no translation in P f are excluded from the evaluation set." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-93", "text": "For each English word in the evaluation set, all the French words in P f are then ranked according to their similarity with the English word." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-94", "text": "Precision and recall are then computed on the first N translation candidate lists." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-95", "text": "The precision amounts in this case to the proportion of lists containing the correct translation (in case of multiple translations, a list is deemed to contain the correct translation as soon as one of the possible translations is present)." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-96", "text": "The recall is the proportion of correct translations found in the lists to all the translations in the corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-97", "text": "This evaluation procedure has been used in previous studies and is now standard." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-98", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-99", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-100", "text": "In a first series of experiments, bilingual lexicons were extracted from the corpora obtained by our approach (P 1 and P 2 ), the corpora obtained by the approach described in (Li and Gaussier, 2010 ) (P 1 and P 2 ) and the original corpus P 0 , with the fixed N value set to 20." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-101", "text": "Table 1 displays the results obtained." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-102", "text": "Each of the last two columns \"P 1 > P 0 \" and \"P 2 > P 0 \" contains the absolute and the relative difference (in %) w.r.t." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-103", "text": "P 0 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-104", "text": "As one can note, the best results (in bold) are obtained from the corpora P 2 built with the method we have described in this paper." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-105", "text": "The lexicons extracted from the enhanced corpora are of much higher quality than the ones obtained from the original corpus ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-106", "text": "For instance, the increase of the precision is 6.9% (30.5% relatively) in P 1 and 23.5% (104.0% relatively) in P 2 , compared with P 0 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-107", "text": "The difference is more remarkable with P 2 , which is obtained from a large external corpus P 2 T ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-108", "text": "Intuitively, one can expect to find, in larger corpora, more documents related to a given corpus, an intuition which seems to be confirmed by our results." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-109", "text": "One can also notice, by comparing P 2 and P 2 as well as P 1 and P 1 , a remarkable improvement when considering our approach and the early methodology." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-110", "text": "Intuitively, the value N plays an important role in the above experiments." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-111", "text": "In a second series of experiments, we let N vary from 1 to 300 and plot the results obtained with different evaluation measure in Figure 1 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-112", "text": "In Figure 1 (a) (resp." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-113", "text": "Figure 1(b) ), the xaxis corresponds to the values taken by N, and the yaxis to the precision (resp. recall) scores for the lexicons extracted on each of the 5 corpora P 0 , P 1 , P 2 , P 1 and P 2 ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-114", "text": "A clear fact from the figure is that both the precision and the recall scores increase according to the increase of the N values, which coincides with our intuition." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-115", "text": "As one can note, our method consistently outperforms the previous work and also the original corpus on all the values considered for N ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-116", "text": "Figure 1: Performance of bilingual lexicon extraction from different corpora with varied N values from 1 to 300." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-117", "text": "The five lines from the top down in each subfigure are corresponding to the results for P 2 , P 2 , P 1 , P 1 and P 0 respectively." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-118", "text": "----------------------------------" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-119", "text": "**DISCUSSION**" }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-120", "text": "As previous studies on bilingual lexicon extraction from comparable corpora radically differ on resources used and technical choices, it is very difficult to compare them in a unified framework (Laroche and Langlais, 2010) ." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-121", "text": "We compare in this section our method with some ones in the same vein (i.e. enhancing bilingual corpora prior to extracting bilingual lexicons from them)." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-122", "text": "Some works like (Munteanu et al., 2004) and (Munteanu and Marcu, 2006) propose methods to extract parallel fragments from comparable corpora." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-123", "text": "However, their approach only focuses on a very small part of the original corpus, whereas our work aims at preserving most of the vocabulary of the original corpus." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-124", "text": "We have followed here the general approach in (Li and Gaussier, 2010) which consists in enhancing the quality of a comparable corpus prior to extracting information from it." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-125", "text": "However, despite this latter work, we have shown here a method which ensures homogeneity of the obtained corpus, and which finally leads to comparable corpora of higher quality." }, { "sent_id": "45c4e9bc90d28cd1a5a61393625062-C001-126", "text": "In turn such corpora yield better bilingual lexicons extracted." } ], "y": { "@BACK@": { "gold_contexts": [ [ "45c4e9bc90d28cd1a5a61393625062-C001-14" ] ], "cite_sentences": [ "45c4e9bc90d28cd1a5a61393625062-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "45c4e9bc90d28cd1a5a61393625062-C001-24" ], [ "45c4e9bc90d28cd1a5a61393625062-C001-68" ], [ "45c4e9bc90d28cd1a5a61393625062-C001-75" ], [ "45c4e9bc90d28cd1a5a61393625062-C001-100" ], [ "45c4e9bc90d28cd1a5a61393625062-C001-124" ] ], "cite_sentences": [ "45c4e9bc90d28cd1a5a61393625062-C001-24", "45c4e9bc90d28cd1a5a61393625062-C001-68", "45c4e9bc90d28cd1a5a61393625062-C001-75", "45c4e9bc90d28cd1a5a61393625062-C001-100", "45c4e9bc90d28cd1a5a61393625062-C001-124" ] } } }, "ABC_d61f75366022f043d4c3a005b5a73d_33": { "x": [ { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-2", "text": "We explore the ability of word embeddings to capture both semantic and morphological similarity, as affected by the different types of linguistic properties (surface form, lemma, morphological tag) used to compose the representation of each word." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-3", "text": "We train several models, where each uses a different subset of these properties to compose its representations." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-4", "text": "By evaluating the models on semantic and morphological measures, we reveal some useful insights on the relationship between semantics and morphology." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-5", "text": "----------------------------------" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-7", "text": "Word embedding models learn a space of continuous word representations, in which similar words are expected to be close to each other." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-8", "text": "Traditionally, the term similar refers to semantic similarity (e.g. walking should be close to hiking, and happiness to joy), hence the model performance is usually evaluated using semantic similarity datasets." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-9", "text": "Recently, several works introduced morphology-driven models motivated by the poor performance of traditional models on morphologically complex words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-10", "text": "Such words are often rare, and there is not enough evidence to model them correctly." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-11", "text": "The morphology-driven models allow pooling evidence from different words which have the same base form." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-12", "text": "These models work by learning per-morpheme representations rather than just per-word ones, and compose the representing vector of each word from those of its morphemes -as derived from a supervised or unsupervised morphological analysis -and (optionally) its surface form (e.g. walking = f (v walk , v ing , v walking ))." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-13", "text": "The works differ in the way they acquire morphological knowledge (from using linguistically derived morphological analyzers on one end, to approximating morphology using substrings while relying on the concatenative nature of morphology, on the other) and in the model form (cDSMs (Lazaridou et al., 2013) , RNN (Luong et al., 2013) , LBL (Botha and Blunsom, 2014) , CBOW (Qiu et al., 2014) , SkipGram (Soricut and Och, 2015; Bojanowski et al., 2016) , GGM (Cotterell et al., 2016) )." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-14", "text": "But essentially, they all show that breaking a word into morphological components (base form, affixes and potentially also the complete surface form), learning a vector for each component, and representing a word as a composition of these vectors improves the models semantic performance, especially on rare words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-15", "text": "In this work we argue that these models capture two distinct aspects of word similarity, semantic (e.g. sim(walking, hiking) > sim(walking, eating)) and morphological (e.g. sim(walking, hiking) > sim (walking, hiked) ), and that these two aspects are at odds with each other (should sim(walking, hiking) be lower or higher than sim(walking, walked)?)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-37", "text": "Most of the works on morphology-driven models evaluate the semantic performance of the models, while others perform morphological evaluation." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-38", "text": "To the best of our knowledge, this work is the first to evaluate both aspects." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-39", "text": "While our experiments focus on Modern Hebrew due to the availability of a reliable semantic similarity dataset, we believe our conclusions hold more generally." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-40", "text": "----------------------------------" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-41", "text": "**MODELS**" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-42", "text": "Our model form is a generalization of the fastText model (Bojanowski et al., 2016) , which in turn extends the skip-gram model of Mikolov et al (2013) ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-43", "text": "The skip-gram model takes a sequence of words w 1 , ..., w T and a function s assigning scores to (word, context) pairs, and maximizes" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-44", "text": "where \u2113 is the log-sigmoid loss function, C t is a set of context words, and N t is a set of negative examples sampled from the vocabulary." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-45", "text": "s(w t , w c ) is defined as s(w t , w c ) = v \u22a4 wt u wc (where v wt and u wc are the embeddings of the focus and the context words)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-46", "text": "Bojanowski et al (2016) replace the word representation v wt with the set of character ngrams appearing in it: v wt = g\u2208G(wt) v g where G(w t ) is the set of n-grams appearing in w t ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-47", "text": "The n-grams are used to approximate the morphemes in the target word." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-48", "text": "We generalize Bojanowski et al (2016) by replacing the set of ngrams G(w) with a set P(w) of explicit linguistic properties." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-49", "text": "Each word w t is then composed as the sum of the vectors of its linguistic properties: v wt = p\u2208P(wt) v p ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-50", "text": "The linguistic properties we consider are the surface form of the word (W), it's lemma (L) and its morphological tag (M) 1 ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-51", "text": "The lemma corre-sponds to the base-form, and the morphological tag encodes the grammatical properties of the word, from which its inflectional affixes are derived (a similar approach was taken by Cotterell and Sch\u00fctze (2015) )." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-52", "text": "Moving from a set of ngrams to a set of explicit linguistic properties, allows finer control of the kinds of information in the word representation." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-53", "text": "We train models with different subsets of {W, L, M }." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-54", "text": "----------------------------------" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-55", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-56", "text": "Our implementation is based on the fastText 2 library (Bojanowski et al., 2016) , which we modify as described above." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-57", "text": "We train the models on the Hebrew Wikipedia (\u223c4M sentences), using a window size of 2 to each side of the focus word, and dimensionality of 200." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-58", "text": "We use the morphological disambiguator of Adler (2007) to assign words with their morphological tags, and the inflection dictionary of MILA (Itai and Wintner, 2008) Semantic Evaluation Measure The common datasets for semantic similarity 4 have some notable shortcomings as noted in (Avraham and Goldberg, 2016; Faruqui et al., 2016; Batchkarov et al., 2016; Linzen, 2016) ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-59", "text": "We use the evaluation method (and corresponding Hebrew similarity dataset) that we have introduced in a previous work (Avraham and Goldberg, 2016) (AG)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-60", "text": "The AG method defines an annotation task which is more natural for human judges, resulting in datasets with improved annotator-agreement scores." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-61", "text": "Furthermore, the AG's evaluation metric takes annotator agreement into account, by putting less weight on similarities that have lower annotator agreement." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-62", "text": "An AG dataset is a collection of target-groups, where each group contains a target word (e.g. singer) and three types of candidate words: positives which are words \"similar\" to the target (e.g. musician), distractors which are words \"related but dissimilar\" to the target (e.g. microphone), and randoms which are not related to the target at all (e.g laptop)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-63", "text": "The human annotators are asked to rank the positive words by their similarity to the target word (distractor and random words are not annotated by humans and are automatically ranked below the positive words)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-64", "text": "This results in a set of triples of a target word w and two candidate words c 1 , c 2 , coupled with a value indicating the confidence of ranking sim(w, c 1 ) > sim(w, c 2 ) by the annotators." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-65", "text": "A model is then scored based on its ability to correctly rank each triple, giving more weight to highly-confident triples." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-66", "text": "The scores range between 0 (all wrong answers) to 1 (perfect match with human annotators)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-67", "text": "We use this method on two datasets: the AG dataset from (Avraham and Goldberg, 2016 ) (SemanticSim, containing 1819 triples), and a new dataset we created in order to evaluate the models on rare words (similar to RW (Luong et al., 2013) )." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-68", "text": "The rare-words dataset (SemanticSimRare) follows the structure of SemanticSim, but includes only target words that occur less than 100 times in the corpus." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-69", "text": "It contains a total of 163 triples, all of the type positive vs. random (we find that for rare words, distinguishing similar words from random ones is a hard enough task for the models)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-70", "text": "Cotterrel and Sch\u00fctze (2015) introduced the MorphoDist k measure, which quantifies the amount of morphological difference between a target word and a list of its k most similar words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-71", "text": "We modify MorphoDist k measure to derive MorphoSim k , a measure that ranges between 0 and 1, where 1 indicates total morphological compatibility." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-72", "text": "The MorphoDist measure is defined as: M orphoDist k (w) = w \u2032 \u2208Kw min mw,m w \u2032 d h (m w , m w \u2032 ) where K w is the set of top-k similarities of w, m w and m w \u2032 are possible morphological tags of w and w \u2032 respectively (there may be more than one possible morphological interpretation per word), and d h is the Hamming distance between the morphological tags." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-73", "text": "MorphoDist counts the total number of incompatible morphological components." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-74", "text": "----------------------------------" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-75", "text": "**MORPHOLOGICAL EVALUATION MEASURE**" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-76", "text": ", where |m w | is the number of grammatical components specified in w's morphological tag." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-77", "text": "We use k=10 and calculate the average MorphoSim score over 100 randomly chosen words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-78", "text": "To evaluate the morphological performance on rare words, we run another benchmark (MorphoSimRare) in which we calculate the average MorphoSim score over the 35 target words of the SemanticSimRare dataset." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-79", "text": "Qualitative Results To get an impression of the differences in behavior between the models, we queried each model for the top similarities of several words (calculated by cosine similarity between words vectors), focusing on rare words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-80", "text": "Table 1 presents the top-3 similarities for the word \u00d0 \u00de\u00d5 ([she] looked [at] ), which occurs 17 times in the corpus, under the different models." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-81", "text": "Unsurprisingly, the lemma component has a positive effect on semantics, while the tag component improves the morphological performance." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-82", "text": "It also shows a clear trade-off between the two aspects -as models which perform the best on semantics are the worst on morphology." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-83", "text": "This behavior is representative of the dozens of words we examined." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-84", "text": "----------------------------------" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-85", "text": "**QUANTITATIVE RESULTS**" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-86", "text": "We compare the different models on the different measures, and also compare to the state-of-the-art n-gram based fastText model of Bojanowski et al (2016) that does not require morphological analysis." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-87", "text": "The results (Table 2) highlight the following:" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-88", "text": "1. There is a trade-off between semantic and morphological performance -improving one aspect comes at the expense of the other: the lemma component improves semantics but hurts morphology, while the opposite is true for the tag component." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-89", "text": "The common practice of using both components together is a kind of compromise: the LM, WLM and n-grams models are not the best nor the worst on any measure." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-90", "text": "2. The impacts of the lemma and the tag components are much larger when dealing with rare words: comparing to W, WL is only 1.7% better on SS and 3.8% worse on MS, while it's 16.3% better and 11.9% worse on SSR and MSR (respectively)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-91", "text": "Similarly, WM is only 2.8% worse than W on SS and 44.9% better on MS, while it's 21.8% worse and 75.7% better on SSR and MSR (respectively)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-92", "text": "3. Simply lemmatizing the words is very effective for capturing semantic similarity." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-93", "text": "This is especially true for the rare words, in which the L model clearly outperform all others." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-94", "text": "For the common words, we see a small drop compared to including the surface form as well (WL, WLM) ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-95", "text": "This is attributed to cases in which some of the semantics lies within the word's morphological template, for example: in W model, most similar words for the masculine verb \u00d0\u00d8\u00d4 (fell) are associated with a soldier (which is a masculine noun): \u00dc \u00d4 (was killed), \u00d6 \u00d8\u00d4 (was injured), while the similarities of the feminine form \u00d0\u00d8\u00d4 are associated with a land or a state (both are feminine nouns): \u00d8 \u00d5 (was annexed), \u00dd \u00d4 (was occupied)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-96", "text": "In L model -\u00d0\u00d8\u00d4 and \u00d0\u00d8\u00d4 share a single, less accurate representation (somewhat similarly to representations of ambiguous words)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-97", "text": "This suggests using different compositions for common and rare words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-98", "text": "----------------------------------" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-99", "text": "**CONCLUSIONS**" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-100", "text": "Our key message is that users of morphologydriven models should consider the trade-off between the different components of their representations." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-101", "text": "Since the goal of most works on morphology-driven models was to improve semantic similarity, the configurations they used (which combine both semantic and morphological components) were probably not the best choices: we show that using the lemma component (either alone or together with the surface form) is better." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-102", "text": "Indeed, excluding the morphological component will make the morphological similarity drop, but it's not necessarily a problem for every task." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-103", "text": "One should include the morphological component in the embeddings only for tasks in which morphological similarity is required and cannot be handled by other means." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-104", "text": "A future work can be to perform an extrinsic evaluation of the different models in various downstream applications." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-105", "text": "This may reveal which kinds of tasks benefit from morphological information, and which can be done better by a pure semantic model." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-16", "text": "The base form component of the compositional models is mostly responsible for semantic aspects of the similarity, while the affixes are mostly responsible for morphological similarity." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-17", "text": "This analysis brings about several natural questions: is the combination of semantic and morphological components used in previous work ideal for every purpose?" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-18", "text": "For example, if we exclude the morphological component from the representations, wouldn't it improve the semantic performance? What is the contribution of using the surface form? And do the models behave differently on common and rare words?" }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-19", "text": "We explore these questions in order to help the users of morphology-driven models choose the right configuration for their needs: semantic or morphological performance, on common or rare words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-20", "text": "We compare different configurations of morphology-driven models, while controlling for the components composing the representation." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-21", "text": "We then separately evaluate the semantic and morphological performance of each model, on rare and on common words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-22", "text": "We focus on inflectional (rather than derivational) morphology." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-23", "text": "This is due to the fact that derivations (e.g. affected \u2192 unaffected) often drastically change the meaning of the word, and therefore the benefit of having similar representations for words with the same derivational base is questionable, as discussed by Lazaridou et al (2013) and Luong et al (2013) ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-24", "text": "Inflections (e.g. walked \u2192 walking), in contrast, preserve the word lexical meaning, and only change its grammatical categories values." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-25", "text": "Our experiments are performed on Modern Hebrew, a language with rich inflectional morphological system." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-26", "text": "We build on a recently introduced evaluation dataset for semantic similarity in Modern Hebrew (Avraham and Goldberg, 2016) , which we further extend with a collection of rare words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-27", "text": "We also create datasets for morphological similarity, for common and rare words." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-28", "text": "Hebrew's morphology is not concatenative, so unlike most previous work we do not break the words into base and affixes, but instead rely on a morphological analyzer and represent words using their lemmas (corresponding to the base form) and their morphological tags (from which the morphological forms are derived, corresponding to affixes)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-29", "text": "This allow us to have a finer grained control over the composition, separating inflectional from derivational processes." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-30", "text": "We also compare to a strong character ngram based model, that mixes the different components and does not allow finergrained distinctions." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-31", "text": "We observe a clear trade-off between the morphological and semantic performance -models that excel on one metric perform badly on the other." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-32", "text": "We present the strengths and weaknesses of the different configurations, to help the users choose the one that best fits their needs." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-33", "text": "We believe that this work is the first to make a comprehensive comparison between various configurations of morphology-driven models: among the previous work mentioned above, only few explored configurations other than (base + affixes) or (surface + base + affixes)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-34", "text": "Lazaridou et al (2013) and Luong et al (2013) trained models which represent a word by its base only, and showed that these models performs worse than the compositional ones (base + affixes)." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-35", "text": "However, the poor results for the base-only models were mainly attributed to undesirable capturing of derivational similarity, e.g. (affected, unaffected) ." }, { "sent_id": "d61f75366022f043d4c3a005b5a73d-C001-36", "text": "Working with a more linguistically informed morphological analyzer allows us to tease apart inflectional from derivational processes, leading to different results." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d61f75366022f043d4c3a005b5a73d-C001-13" ] ], "cite_sentences": [ "d61f75366022f043d4c3a005b5a73d-C001-13" ] }, "@SIM@": { "gold_contexts": [ [ "d61f75366022f043d4c3a005b5a73d-C001-42" ] ], "cite_sentences": [ "d61f75366022f043d4c3a005b5a73d-C001-42" ] }, "@EXT@": { "gold_contexts": [ [ "d61f75366022f043d4c3a005b5a73d-C001-42" ], [ "d61f75366022f043d4c3a005b5a73d-C001-48" ], [ "d61f75366022f043d4c3a005b5a73d-C001-56" ] ], "cite_sentences": [ "d61f75366022f043d4c3a005b5a73d-C001-42", "d61f75366022f043d4c3a005b5a73d-C001-48", "d61f75366022f043d4c3a005b5a73d-C001-56" ] }, "@UNSURE@": { "gold_contexts": [ [ "d61f75366022f043d4c3a005b5a73d-C001-86" ] ], "cite_sentences": [ "d61f75366022f043d4c3a005b5a73d-C001-86" ] } } }, "ABC_b208c7180bc3f973b8616937b2801c_33": { "x": [ { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-2", "text": "Image paragraph captioning models aim to produce detailed descriptions of a source image." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-3", "text": "These models use similar techniques as standard image captioning models, but they have encountered issues in text generation, notably a lack of diversity between sentences, that have limited their effectiveness." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-4", "text": "In this work, we consider applying sequence-level training for this task." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-5", "text": "We find that standard self-critical training produces poor results, but when combined with an integrated penalty on trigram repetition produces much more diverse paragraphs." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-6", "text": "This simple training approach improves on the best result on the Visual Genome paragraph captioning dataset from 16.9 to 30.6 CIDEr, with gains on METEOR and BLEU as well, without requiring any architectural changes." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-7", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-9", "text": "Image captioning aims to describe the objects, actions, and details present in an image using natural language." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-10", "text": "Most image captioning research has focused on single-sentence captions, but the descriptive capacity of this form is limited; a single sentence can only describe in detail a small aspect of an image." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-11", "text": "Recent work has argued instead for image paragraph captioning with the aim of generating a (usually 5-8 sentence) paragraph describing an image." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-12", "text": "Compared with single-sentence captioning, paragraph captioning is a relatively new task." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-13", "text": "The main paragraph captioning dataset is the Visual Genome corpus, introduced by Krause et al. (2016) ." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-14", "text": "When strong single-sentence captioning models are trained on this dataset, they produce repetitive paragraphs that are unable to describe diverse aspects of images." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-15", "text": "The generated paragraphs repeat a slight variant of the same sentence multiple times, even when beam search is used." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-16", "text": "Prior work, discussed in the following section, tried to address this repetition with architectural changes, such as hierarchical LSTMs, which separate the generation of sentence topics and words." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-17", "text": "In this work, we consider an approach for training paragraph captioning models that focuses on increasing the diversity of the output paragraph." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-18", "text": "In particular, we note that self-critical sequence training (SCST) (Ranzato et al., 2015; Rennie et al., 2016) , a technique which uses policy gradient methods to directly optimize a target metric, has been successfully employed in standard captioning, but not in paragraph captioning." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-19", "text": "We observe that during SCST training the intermediate results of the system lack diversity, which makes it difficult for the model to improve." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-20", "text": "We address this issue with a simple repetition penalty which downweights trigram overlap." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-21", "text": "Experiments show that this technique greatly improves the baseline model." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-22", "text": "A simple baseline, non-hierarchical model trained with repetitionpenalized SCST outperforms complex hierarchical models trained with both cross-entropy and customized adversarial losses." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-23", "text": "We demonstrate that this strong performance gain comes from the combination of repetition-penalized search and SCST, rather than from either individually, and discuss how this impacts the output paragraphs." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-24", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-25", "text": "**BACKGROUND AND RELATED WORK**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-26", "text": "Nearly all modern image captioning models employ variants of an encoder-decoder architecture." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-27", "text": "As introduced by Vinyals et al. (2014) , the encoder is a CNN pre-trained for classification and the decoder is a LSTM or GRU." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-28", "text": "Following work in machine translation, Xu et al. (2015) added an attention mechanism over the encoder features." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-29", "text": "Recently, Anderson et al. (2017) further improved single-sentence captioning performance by incorporating object detection in the encoder (bottomup attention) and adding an LSTM layer before attending to spatial features in the decoder (top-down attention)." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-30", "text": "Single-sentence and paragraph captioning models are evaluated with a number of metrics, including some designed specifically for captioning (CIDEr) and some adopted from machine translation (BLEU, METEOR)." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-31", "text": "CIDEr and BLEU measure accuracy with n-gram overlaps, with CIDEr weighting n-grams by TF-IDF (termfrequency inverse-document-frequency), and ME-TEOR uses unigram overlap, incorporating synonym and paraphrase matches." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-32", "text": "We discuss these metrics in greater detail when analyzing our experiments." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-33", "text": "Krause et al. (2016) introduced the first large-scale paragraph captioning dataset, a subset of the Visual Genome dataset, along with a number of models for paragraph captioning." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-34", "text": "Empirically, they showed that paragraphs contain significantly more pronouns, verbs, coreferences, and greater overall \"diversity\" than singlesentence captions." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-35", "text": "Whereas most single-sentence captions in the MSCOCO dataset describe only the most important object or action in an image, paragraph captions usually touch on multiple objects and actions." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-36", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-37", "text": "**RELATED MODELS**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-38", "text": "The paragraph captioning models proposed by Krause et al. (2016) included template-based (nonneural) approaches and two encoder-decoder models." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-39", "text": "In both neural models, the encoder is an object detector pre-trained for dense captioning." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-40", "text": "In the first model, called the flat model, the decoder is a single LSTM which outputs an entire paragraph word-by-word." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-41", "text": "In the second model, called the hierarchical model, the decoder is composed of two LSTMs, where the output of one sentencelevel LSTM is used as input to the other word-level LSTM." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-42", "text": "Recently, Liang et al. (2017) extended this model with a third (paragraph-level) LSTM and added adversarial training." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-43", "text": "In total, their model (RTT-GAN) incorporates three LSTMs, two attention mechanisms, a phrase copy mechanism, and two adversarial discriminators." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-44", "text": "To the best of our knowledge, this model achieves state-ofthe-art performance of 16.9 CIDEr on the Visual Genome dataset (without external data)." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-45", "text": "For our experiments, we use the top-down single-sentence captioning model in Anderson et al. (2017) ." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-46", "text": "This model is similar to the \"flat\" model in Krause et al. (2016) , except that it incorporates attention with a top-down mechanism." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-47", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-48", "text": "**APPROACH**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-49", "text": "The primary issue in current paragraph captioning models, especially non-hierarchical ones, is lack of diversity of topic in the output paragraph." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-50", "text": "For example, for the image of a skateboarder in Figure 1 , the flat model outputs \"The man is wearing a black shirt and black pants\" seven times." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-51", "text": "This example is not anomalous: it is a typical failure case of the model." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-52", "text": "Empirically, in validation, ground truth paragraphs contain 0.62 repeated trigrams on average, whereas paragraphs produced by the flat cross-entropy model contain 25.9 repeated trigrams on average." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-53", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-54", "text": "**SELF-CRITICAL SEQUENCE TRAINING**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-55", "text": "Self-critical sequence training (SCST) is a sequence-level optimization procedure proposed by Rennie et al. (2016) , which has been widely adopted in single-sentence captioning but has not yet been applied to paragraph captioning." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-56", "text": "This method provides an alternative approach to wordlevel cross-entropy which can incorporate a task specific metric." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-57", "text": "Sequence-level training employs a policy gradient method to optimize directly for a nondifferentiable metric, such as CIDEr or BLEU." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-58", "text": "This idea was first applied to machine translation by Ranzato et al. (2015) in a procedure called MIXER, which incrementally transitions from cross-entropy to policy gradient training." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-59", "text": "To normalize the policy gradient reward and reduce variance during training, MIXER subtracts a baseline estimate of the reward as calculated by a linear regressor." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-60", "text": "SCST replaces this baseline reward estimate with the reward obtained by the test-time inference algorithm, namely the CIDEr score of greedy search." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-61", "text": "This weights the gradient by the difference in reward given to a sampled paragraph compared to the current greedy output (see Eq. 3-9 in (Rennie et al., 2016) )." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-62", "text": "Additionally, SCST uses a hard transition from cross-entropy to policy gradient training." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-63", "text": "The final gradient is:" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-64", "text": "Where w s is a sampled paragraph, w g is a greedy decoded paragraph, r is the reward (e.g CIDEr), p \u03b8 is the captioning model." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-65", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-66", "text": "**REPETITION PENALTY**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-67", "text": "In preliminary experiments, we find that directly applying SCST is not effective for paragraph captioning models." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-68", "text": "Table 1 shows that when training with SCST, the model performs only marginally better than cross-entropy." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-69", "text": "In further analysis, we see that the greedy baseline in SCST training has very non-diverse output, which leads to poor policy gradients." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-70", "text": "Unlike in standard image captioning, the cross-entropy model is too weak for SCST to be effective." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-71", "text": "To address this problem, we take inspiration from recent work in abstractive text summarization, which encounters the same challenge when producing paragraph-length summaries of documents (Paulus et al., 2017) ." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-72", "text": "These models target the repetition problem by simply preventing the model from producing the same trigram more than once during inference." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-73", "text": "We therefore introduce an inference constraint that penalizes the logprobabilities of words that would result in repeated trigrams." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-74", "text": "The penalty is proportional to the number of times the trigram has already been generated." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-75", "text": "Formally, denote the (pre-softmax) output of the LSTM by o, where the length of o is the size of the target vocabulary and o w is the log-probability of word w. We modify o w by o w \u2192 o w \u2212 k w \u00b7 \u03b1, where k w is the number of times the trigram completed by word w has previously been generated in the paragraph, and \u03b1 is a hyperparameter which controls the degree of blocking." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-76", "text": "When \u03b1 = 0, there is no penalty, so we have standard greedy search." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-77", "text": "When \u03b1 \u2192 \u221e, or in practice when \u03b1 exceeds about 5, we have full trigram blocking." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-78", "text": "We incorporate this penalty into the greedy baseline used to compute policy gradients in SCST." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-79", "text": "During inference, we employ the same repetition-penalized greedy search." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-80", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-81", "text": "**METHODS AND RESULTS**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-82", "text": "For our paragraph captioning model we use the top-down model from Anderson et al. (2017) ." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-83", "text": "Our encoder is a convolutional network pretrained for object detection (as opposed to dense captioning, as in Krause et al. (2016) and Liang et al. (2017) )." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-84", "text": "METEOR CIDEr BLEU-1 BLEU-2 BLEU-3 BLEU-4 Krause et al. (Template) The encoder extracts between 10 and 100 objects per image and applies spatial max-pooling to yield a single feature vector of dimension 2048 per object." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-85", "text": "The decoder is a 1-layer LSTM with hidden dimension 512 and top-down attention." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-86", "text": "Evaluation is done on the Visual Genome dataset with the splits provided by Krause et al. (2016) ." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-87", "text": "We first train for 25 epochs with crossentropy (XE) loss, using Adam with learning rate 5 \u00b7 10 \u22124 ." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-88", "text": "We then train an additional 25 epochs with repetition-penalized SCST targeting a CIDEr-based reward, using Adam with learning rate 5 \u00b7 10 \u22125 ." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-89", "text": "Our PyTorch-based implementation is available at https://github.com/lukemelas/ image-paragraph-captioning." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-90", "text": "Results Table 1 shows the main experimental results." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-91", "text": "Our baseline cross-entropy captioning model gets similar scores to the original flat model." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-92", "text": "When the repetition penalty is applied to a model trained with cross-entropy, we see a large improvement on CIDEr and a minor improvement on other metrics." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-93", "text": "1 When combining the repetition penalty with SCST, we see a dramatic improvement across all metrics, and particularly on CIDEr." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-94", "text": "Interestingly, SCST only works when its baseline reward model is strong; for this reason the combination of the repetition penalty and SCST is particularly effective." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-95", "text": "Table 3 : Analysis of different model outputs (\u03b1 = 2.0 for models w/ penalty) Finally, Table 3 shows quantitative changes in trigram repetition and ground truth matches." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-96", "text": "The cross-entropy model fails to generate enough unique phrases." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-97", "text": "Blocking these entirely gives some benefit, but the SCST model is able to raise the total number of matched trigrams while reintroducing few repeats." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-98", "text": "----------------------------------" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-99", "text": "**CONCLUSION**" }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-100", "text": "This work targets increased diversity in image paragraph captioning." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-101", "text": "We show that training with SCST combined with a repetition penalty leads to a substantial improvement in the state-of-theart for this task, without requiring architectural changes or adversarial training." }, { "sent_id": "b208c7180bc3f973b8616937b2801c-C001-102", "text": "In future work, we hope to further address the language issues of paragraph generation as well as extend this simple approach to other tasks requiring long-form text or paragraph generation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b208c7180bc3f973b8616937b2801c-C001-13" ], [ "b208c7180bc3f973b8616937b2801c-C001-33" ], [ "b208c7180bc3f973b8616937b2801c-C001-38" ], [ "b208c7180bc3f973b8616937b2801c-C001-45", "b208c7180bc3f973b8616937b2801c-C001-46" ] ], "cite_sentences": [ "b208c7180bc3f973b8616937b2801c-C001-13", "b208c7180bc3f973b8616937b2801c-C001-33", "b208c7180bc3f973b8616937b2801c-C001-38", "b208c7180bc3f973b8616937b2801c-C001-46" ] }, "@SIM@": { "gold_contexts": [ [ "b208c7180bc3f973b8616937b2801c-C001-45", "b208c7180bc3f973b8616937b2801c-C001-46" ] ], "cite_sentences": [ "b208c7180bc3f973b8616937b2801c-C001-46" ] }, "@DIF@": { "gold_contexts": [ [ "b208c7180bc3f973b8616937b2801c-C001-83" ] ], "cite_sentences": [ "b208c7180bc3f973b8616937b2801c-C001-83" ] }, "@USE@": { "gold_contexts": [ [ "b208c7180bc3f973b8616937b2801c-C001-86" ] ], "cite_sentences": [ "b208c7180bc3f973b8616937b2801c-C001-86" ] } } }, "ABC_201aa2a740b5d45f273ee298595f5a_33": { "x": [ { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-66", "text": "**QUALITATIVE ANALYSIS**" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-49", "text": "We interpret the lower amount of \"real\" extremes (1 and 5) for verbs as an indicator of the difficulty that participants had to clearly norm verbs compared to nouns." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-2", "text": "In recent years, both cognitive and computational research has provided empirical analyses of contextual co-occurrence of concrete and abstract words, partially resulting in inconsistent pictures." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-3", "text": "In this work we provide a more fine-grained description of the distributional nature in the corpusbased interaction of verbs and nouns within subcategorisation, by investigating the concreteness of verbs and nouns that are in a specific syntactic relationship with each other, i.e., subject, direct object, and prepositional object." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-4", "text": "Overall, our experiments show consistent patterns in the distributional representation of subcategorising and subcategorised concrete and abstract words." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-5", "text": "At the same time, the studies reveal empirical evidence why contextual abstractness represents a valuable indicator for automatic non-literal language identification." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-6", "text": "----------------------------------" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-8", "text": "The need of providing a clear description of the usage of concrete and abstract words in communication is becoming salient both in cognitive science and in computational linguistics." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-9", "text": "In the cognitive science community, much has been said about concrete concepts, but there is still an open debate about the nature of abstract concepts (Barsalou and Wiemer-Hastings, 2005; McRae and Jones, 2013; Hill et al., 2014; Vigliocco et al., 2014) ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-10", "text": "Computational linguists have recognised the importance of investigating the concreteness of contexts in empirical models, for example for the automatic identification of non-literal language usage (Turney et al., 2011; K\u00f6per and Schulte im Walde, 2016; Aedmaa et al., 2018) ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-50", "text": "For example, while comparing the nouns belief 1.2 and ball 5.0 humans would have a clear agreement on highly abstract and highly concrete scores; on the contrary, the distinction between moralise 1.4 and sit 4.8 might be less clear." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-51", "text": "In our main study, we analysed the concreteness of the nouns that are in a specific and direct syntactic relation with verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-52", "text": "The overall distributions in Figure 2 are extremely consistent across syntactic relations: when looking at the means, the concreteness of nouns subcategorised by concrete verbs is significantly higher than the concreteness of nouns subcategorised by abstract verbs (all p-values < 0.001)." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-53", "text": "This result is perfectly in line with the more general analysis by Naumann et al. (2018) ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-54", "text": "Table 1 investigates more deeply the interaction between the concreteness of verbs and nouns for different syntactic functions." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-55", "text": "It reports the average concreteness scores of the nouns subcategorised by concrete and abstract verbs (\u00b1 standard deviation), the difference between the concrete and abstract scores (with significance tests) and the overall average concreteness score by function." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-56", "text": "The statistical analyses have been performed using a standard linear regression model." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-57", "text": "The comparison between the scores in the first two columns (Abstract Verbs and Concrete Verbs) confirms that subject and direct object nouns that are subcategorised by concrete verbs are significantly more concrete than those subcategorised by abstract verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-58", "text": "The \"Difference C-A\" column shows that these differences are all highly significant." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-59", "text": "In addition, the nouns subcategorised by concrete verbs are extremely high on the concreteness scale (mean 1 In this paper the number in subscript indicates the concreteness score from the Brysbaert et al. (2014) norms." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-60", "text": "By zooming in on the specific functions, we see that subjects are significantly more concrete than direct objects for both abstract and concrete verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-61", "text": "The concreteness scores of subjects of passivised sentences are in between in both categories." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-62", "text": "This pattern is confirmed by looking at the \"Overall\" column." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-63", "text": "Prepositional objects that are subcategorised by concrete verbs are significantly more concrete than prepositional objects subcategorised by abstract verbs, across prepositions." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-64", "text": "However, given the extreme variability in the prepositions used, we will analyse the most representative pobjs more specifically in the following section." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-65", "text": "----------------------------------" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-11", "text": "Recently, multiple studies have focussed on providing a fine-grained analysis of the nature of concrete vs. abstract words from a corpus-based perspective (Bhaskar et al., 2017; Frassinelli et al., 2017; Naumann et al., 2018) ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-12", "text": "In these studies, the authors have shown a general but consistent pattern: concrete words have a preference to co-occur with other concrete words, while abstract words co-occur more frequently with abstract words." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-13", "text": "Specifically, Naumann et al. (2018) performed their analyses across parts-of-speech by comparing the behaviour of nouns, verbs and adjectives in large-scale corpora." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-14", "text": "These results are not fully in line with various theories of cognition which suggest that both concrete and abstract words should co-occur more often with concrete words because concrete information links the real-world usage of both concrete and abstract words to their mental representation (Barsalou, 1999; Pecher et al., 2011) ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-15", "text": "----------------------------------" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-16", "text": "**THE CURRENT STUDY**" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-17", "text": "In the current study we build on prior evidence from the literature and perform a more fine-grained corpus-based analysis on the distribution of concrete and abstract words by specifically looking at the types of syntactic relations that connect nouns to verbs in sentences." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-18", "text": "More specifically, we look at the concreteness of verbs and the corresponding nouns as subjects, direct objects and prepositional objects." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-19", "text": "This study is carried out in a quantitative fashion to identify general trends." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-20", "text": "However, we also look into specific examples to better understand the types of nouns that attach to specific verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-21", "text": "First of all, we expect to replicate the main results from Naumann et al. (2018) : in general, concrete nouns should co-occur more frequently with concrete verbs and abstract nouns with abstract verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-22", "text": "Moreover, we expect to identify the main patterns that characterise semantic effects of an interaction of concreteness in verb-noun subcategorisation, such as collocations and meaning shifts." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-23", "text": "The motivation for this study is twofold: (1) From a cognitive science perspective we seek additional and more fine-grained evidence to better understand the clash between the existing corpus-based studies and the theories of cognition which predict predominantly concrete information in the context of both concrete and abstract words." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-24", "text": "(2) From a computational perspective we expect some variability in the interaction of concreteness in verb-noun subcategorisation, given that abstract contexts are ubiquitous and salient empirical indicators for non-literal language identification, cf. carry a bag vs. carry a risk." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-25", "text": "----------------------------------" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-26", "text": "**MATERIALS**" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-27", "text": "In the following analyses, we used nouns and verbs extracted from the Brysbaert et al. (2014) collection of concreteness ratings." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-28", "text": "In this resource, the concreteness of 40,000 English words was evaluated by human participants on a scale from 1 (abstract) to 5 (concrete)." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-29", "text": "Given that participants did not have any overt information about part-of-speech (henceforth, POS) while performing the norming study, Brysbaert et al. added this information post-hoc from the SUBTLEX-US, a 51-million word subtitle corpus (Brysbaert and New, 2009) ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-30", "text": "In order to align the POS information to the current study, we disambiguated the POS of the normed words by extracting their most frequent POS from the 10-billion word corpus ENCOW16AX (see below for details)." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-31", "text": "Moreover, as discussed in previous studies by Naumann et al. (2018) and Pollock (2018) , mid-range concreteness scores indicate words that are difficult to categorise unambiguously regarding their concreteness." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-32", "text": "For this reason and in order to obtain a clear picture of the behaviour of concrete vs. abstract words, we selected only words with very high (concrete) or very low (abstract) concreteness scores." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-33", "text": "We included in our analyses the 1000 most concrete (concreteness range: 4.86 -5.00) and 1000 most abstract (1.04 -1.76) nouns, and the 500 most concrete (3.80 -5.00) and most abstract (1.19 -2.00) verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-34", "text": "We chose to include a smaller selection of verbs compared to the nouns because we considered verbs to be more difficult to evaluate by humans according to their concreteness scores and consequently noisier and more ambiguous for the analyses we are conducting." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-35", "text": "The corpus analyses were performed on the parsed version of the sentence-shuffled English EN-COW16AX corpus (Sch\u00e4fer and Bildhauer, 2012) ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-36", "text": "For each sentence in the corpus, we extracted the verbs in combination with the nouns when they both occur in our selection of words from Brysbaert et al. (2014) and when the nouns are parsed as subjects (in active and passive sentences: nsubj and nsubjpass), direct objects (dobj) or prepositional objects (pobj) of the verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-37", "text": "In the case of pobj, we considered the 20 most frequent prepositions (e.g., of, in, for, at) in the corpus." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-38", "text": "In total, we extracted 11,716,189 verb-noun token pairs including 3,814,048 abstract verb tokens; 7,902,141 concrete verb tokens; 3,701,669 abstract noun tokens; and 8,014,520 concrete noun tokens." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-39", "text": "In 2,958,308 cases, the noun was parsed as the subject of the verb (with 748,438 of them as subjects in passive constructions), in 5,011,347 cases the noun was the direct object, and in 3,746,534 cases the noun was a prepositional object." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-40", "text": "Already by looking at these numbers it is possible to identify a strong frequency bias in favour of concrete words; we will discuss later in the paper how this bias affects the results reported." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-41", "text": "All the analyses reported in the following sections are performed at token level." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-42", "text": "----------------------------------" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-43", "text": "**QUANTITATIVE ANALYSIS**" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-44", "text": "In a pre-test we analysed the overall distributions of verbs and nouns according to their concreteness scores." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-45", "text": "Figure 1 shows the overall distributions of verbs (left, M=3.4, SD=1.1) and nouns (right, M=3.9, SD=1.6) included in our analyses." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-46", "text": "Overall, nouns have significantly more extreme values than verbs: the majority of concrete nouns have concreteness scores clustering around 5.00 while concrete verbs cluster around 4.0." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-47", "text": "Similarly, abstract nouns have significantly lower scores (i.e., they are more abstract) than abstract verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-48", "text": "The numerical difference in the presence of extreme scores is also highlighted by the much higher standard deviation characterising nouns compared to verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-67", "text": "In order to better understand the patterns of concreteness behind each syntactic function introduced in the previous section, we performed a series of qualitative analyses, by looking at the most frequent verbnoun combinations grouped by syntactic function." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-68", "text": "For both functions nsubj and dobj we see the same strong pattern as in the general analyses in Section 4: concrete verbs have a strong overall preference for concrete complements (map 4.9 show 4.0 , boil 4.2 water 5.0 )." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-69", "text": "Regarding abstract verbs, we find a preference for subcategorising abstract direct objects (reduce 2.0 risk 1.6 ), but -in contrast-a preference for concrete subjects (student 4.9 need 1.7 )." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-70", "text": "Appropriately, surface subjects in passivised clauses have preferences that are in between those for surface subjects and direct objects in active clauses, presumably because they are semantically comparable to the direct objects of the action encoded by the corresponding verb." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-71", "text": "When looking into exceptions to this predominant pattern, we find collocations and non-literal language, such as metaphors and metonyms." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-72", "text": "For example, metaphorical language usage occurs when concrete verbs attach to abstract direct objects (carry 4.0 risk 1.6 vs. carry 4.0 bag 4.9 , catch 4.1 moment 1.6 vs. catch 4.1 insect 4.9 ); while abstract verbs collocated with concrete direct objects trigger a metonymical use (recommend 1.7 book 4.9 vs. write 4.2 book 4.9 )." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-73", "text": "When looking at prepositional objects it is possible to identify three main behaviours: i) a main preference for concrete verbs and nouns (e.g., \"in\" and \"at\"); ii) a strong interaction with abstract verbs and nouns (e.g., \"for\"); iii) a mixed co-occurrence with both concrete and abstract verbs and nouns (e.g.,\"of\")." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-74", "text": "The following paragraphs report a qualitative discussion about the predominant verbs and nouns with regard to the four prepositions \"in\", \"at\", \"for\", and \"of\"." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-75", "text": "The preposition in manifests a very strong interaction with concrete verbs and concrete nouns." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-76", "text": "Some examples among the most frequent ones in the corpus are: write 4.2 in book 4.9 and sleep 4.4 in bed 5.0 ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-77", "text": "The only rare exceptions to this pattern refer to idiomatic structures like: carry 4.0 in accordance 1.5 or carry 4.0 in manner 1.6 ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-78", "text": "Table 1 confirms that the preposition in triggers very high concreteness scores in general and the highest concreteness scores for nouns that are subcategorised by concrete verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-79", "text": "The preposition at connects mainly concrete verbs with concrete nouns: sit 4.8 at table 4.9 and eat 4.4 at restaurant 4.9 ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-80", "text": "However, in strong collocations it shows a preference for abstract nouns: jump 4.5 at chance 1.6 or happen 1.8 at moment 1.6 ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-81", "text": "This pattern is confirmed by Table 1 too, where concrete verbs have high scores while abstract verbs have the lowest scores in the entire table." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-82", "text": "The preposition for, on the other hand, mainly occurs with abstract nouns that are subcategorised by abstract verbs: need 1.7 for purpose 1.5 and imagine 1.5 for moment 1.6 ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-83", "text": "Exceptions to this pattern are due to metonymic readings like write 4.2 for magazine 5.0 and run 4.3 for office 4.9 ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-84", "text": "Correspondingly, we see the lowest overall concreteness score across verbs in Table 1 ." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-85", "text": "Finally, the preposition of shows a mixed interaction in the concreteness of verbs and nouns." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-86", "text": "This preposition co-occurs mainly with very concrete verbs that however subcategorise both highly concrete nouns (run 4.3 of water 5 ) but also highly abstract nouns (run 4.3 of idea 1.6 ) in cases of metaphorical use." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-87", "text": "As expected, the overall concreteness for this function in Table 1 is among the highest both for concrete and abstract verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-88", "text": "----------------------------------" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-89", "text": "**GENERAL DISCUSSION & CONCLUSION**" }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-90", "text": "The aim of this study was to provide a fine-grained empirical analysis of the concreteness nature in verbnoun subcategorisation." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-91", "text": "The general pattern already described in Naumann et al. (2018) is confirmed by our quantitative analysis: overall, concrete verbs predominantly subcategorise concrete nouns as subjects and direct objects, while abstract verbs predominantly subcategorise abstract nouns as subjects and direct objects." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-92", "text": "A qualitative analysis revealed that exceptions to the predominant same-class interaction indicate semantic effects in verb-noun interaction: collocation, metaphor and metonymy, which shows the usefulness of detecting abstractness in the contexts of verbs as salient features in automatic non-literal language identification." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-93", "text": "A slightly more variable pattern emerges when looking at prepositional objects." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-94", "text": "We identified three main clusters of prepositions that behave differently according to their preferred nouns and verbs." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-95", "text": "The prepositions in the first cluster (e.g., \"in\" and \"at\") co-occur mostly with concrete verbs and nouns; the prepositions in the second cluster (e.g., \"for\") have a strong preference for abstract verbs and nouns; while the prepositions in the third cluster (e.g, \"of\") show variability in the concreteness of the related nouns." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-96", "text": "Once again, the divergence form the general pattern is often ascribable to cases of non-literal language." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-97", "text": "This study, on the one hand, provided additional and more fine-grained evidence of the clash between the existing corpus-based studies and the theories of cognition which predict predominantly concrete information in the context of both concrete and abstract words." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-98", "text": "This was achieved by zooming in on the contexts which stand in a direct syntactic relation to the target word." }, { "sent_id": "201aa2a740b5d45f273ee298595f5a-C001-99", "text": "In addition, they provided useful indicators to the implementation of computational models for the automatic identification and classification of non-literal language." } ], "y": { "@BACK@": { "gold_contexts": [ [ "201aa2a740b5d45f273ee298595f5a-C001-11" ], [ "201aa2a740b5d45f273ee298595f5a-C001-13" ], [ "201aa2a740b5d45f273ee298595f5a-C001-31" ] ], "cite_sentences": [ "201aa2a740b5d45f273ee298595f5a-C001-11", "201aa2a740b5d45f273ee298595f5a-C001-13", "201aa2a740b5d45f273ee298595f5a-C001-31" ] }, "@UNSURE@": { "gold_contexts": [ [ "201aa2a740b5d45f273ee298595f5a-C001-21" ] ], "cite_sentences": [ "201aa2a740b5d45f273ee298595f5a-C001-21" ] }, "@SIM@": { "gold_contexts": [ [ "201aa2a740b5d45f273ee298595f5a-C001-53" ], [ "201aa2a740b5d45f273ee298595f5a-C001-91" ] ], "cite_sentences": [ "201aa2a740b5d45f273ee298595f5a-C001-53", "201aa2a740b5d45f273ee298595f5a-C001-91" ] } } }, "ABC_4a19f17c00e904a595c1703ab9318d_33": { "x": [ { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-37", "text": "In the following, we describe how we expand on this intuition in detail." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-38", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-2", "text": "Recent approaches to the Automatic PostEditing (APE) research have shown that better results are obtained by multi-source models, which jointly encode both source (src) and machine translation output (mt) to produce post-edited sentence (pe)." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-3", "text": "Along this trend, we present a new multi-source APE model based on the Transformer." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-4", "text": "To construct effective joint representations, our model internally learns to incorporate src context into mt representation." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-5", "text": "With this approach, we achieve a significant improvement over baseline systems, as well as the state-of-the-art multi-source APE model." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-6", "text": "Moreover, to demonstrate the capability of our model to incorporate src context, we show that the word alignment of the unknown MT system is successfully captured in our encoding results." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-7", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-9", "text": "Thanks to recent advances in deep learning over the last two decades, machine translation (MT) has shown steady improvements (Bahdanau et al., 2014; Gehring et al., 2017; Vaswani et al., 2017) ." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-10", "text": "Nevertheless, even the state-of-the-art MT systems are still often far from perfect." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-11", "text": "Translations provided by MT systems may contain incorrect lexical choices or word ordering." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-12", "text": "Therefore, to get publishable quality translations, correction of MT errors has been regarded as an essential post-processing task referred to as \"post-editing\" (PE) ." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-13", "text": "From the MT point of view, PE can be defined as the process of editing translations provided by MT systems with a minimal amount of manual effort (TAUS Report, 2010) while the field of \"Automatic post-editing\" (APE) aims to mimic the human post-editing process automatically." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-14", "text": "Given the underlying assumption of the APE task in which the source text (src), the corresponding output of the unknown MT system (mt), and its human postedited sentences (pe) are the only available resources, many studies have attempted to leverage information from both src and mt by suggesting multi-source architectures (Chatterjee et al., 2016; Libovick\u00fd et al., 2016) ." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-15", "text": "Furthermore, as the newly proposed architecture called the Transformer network (Vaswani et al., 2017) has shown to be effective in various sequence-to-sequence problems, various multi-source adaptations of the Transformer (Junczys-Dowmunt and Grundkiewicz, 2018; Shin and Lee, 2018; Tebbifakhr et al., 2018) have been applied to APE." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-16", "text": "In this work, we propose a new multi-source APE model by extending Transformer." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-39", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-17", "text": "We focus especially on modifying its original encoder portion to construct effective joint representations of two sources (src, mt)." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-18", "text": "A major advantage of our approach is that by constructing the joint representation such that src context is incorporated into mt representation, the model allows information on mt as well as its src context to be considered together in generating pe." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-19", "text": "In addition, we utilize a weight sharing method between the embedding and output layers in a manner similar to Press and Wolf (2017), and Junczys-Dowmunt and Grundkiewicz (2018) ." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-20", "text": "Finally, our model outperforms not only baseline systems but also other multi-source APE models." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-21", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-22", "text": "**RELATED WORK**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-23", "text": "Properly modeling the relations between multiple sources is important for solving multi-source sequence generation problems such as APE." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-24", "text": "For this purpose, Chatterjee et al. (2016) suggested the Factored APE model, combining mt, its src alignment, and other features into factors." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-25", "text": "It is based on a statistical model, but the approach to incorporating src context into mt is similar to our work." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-26", "text": "For neural network-based approaches, Libovick\u00fd et al. (2016) used an extended RNN encoder-decoder architecture (Bahdanau et al., 2014) with two separate encoders for each source so that the decoder could attend to both." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-27", "text": "Chatterjee et al. (2017) expanded on this model to win the WMT 17 APE shared task." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-28", "text": "Recently, a variety of multisource architectures based on the Transformer (Junczys-Dowmunt and Grundkiewicz, 2018; Shin and Lee, 2018; Tebbifakhr et al., 2018) have been suggested." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-29", "text": "They are all architectures employing separate encoders but with slightly different ways of modeling the joint representation: Tebbifakhr et al. (2018) simply concatenated the output of two encoders; Shin and Lee (2018) considered multiple relations (mt-pe, src-pe, src-mt); JunczysDowmunt and Grundkiewicz (2018) encoded src and mt using shared weight parameters." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-30", "text": "The key difference between their works and ours is that they encode two sources independently, while our method feeds src encoding results as inputs when encoding mt, so that src context can be directly integrated into each mt representation." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-31", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-32", "text": "**MULTI-SOURCE AUTOMATIC POST-EDITING**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-33", "text": "We predicted that considering mt with its corresponding src context leads to an improvement in post-editing performance." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-34", "text": "The advantage of this approach is made clear in cases for which src context must be taken into consideration for accurate post-editing." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-35", "text": "For example, the German phrase \"mein Haus\" (\"my house\" in English) is a possible mistranslation of the English phrase \"my home\"." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-36", "text": "A system that recognizes src context would be able to correct it into \"mein Zuhaus\"." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-40", "text": "Given the sequences src x = ( , \u2026 , ) , mt y = ( , \u2026 , ), and pe z = (z , \u2026 , ) with the length of the sequence , , and , respectively, the model is trained to learn probabilities conditioning on two sources x, y and target history :" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-41", "text": "The model architecture, as shown in Figure 1 , is an extension of Transformer." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-42", "text": "As described in Vaswani et al. (2017) , each stacked layer is composed of multi-head attention networks and position-wise fully connected feed-forward networks." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-43", "text": "Unlike the original Transformer, to construct joint representations for two inputs, the encoder consists of two sub networks: src encoder (x) and mt encoder (x , y) , shown on the left and right side of the encoder in Figure 1 ." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-44", "text": "The workflow of the model is as follows: \uf0b7 (x) accepts src embeddings as input and produces a sequence of encoded vectors" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-45", "text": "in which each src is encoded with its context via the self-attention layer." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-46", "text": "\uf0b7 (x , y) accepts both mt embeddings and the output of (x) as inputs and returns the final output of the encoder e = ( , \u2026 , ) in which each mt is jointly encoded with its corresponding src context through the attention layer." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-47", "text": "\uf0b7 Finally, the decoder, shown on the far right of Figure 1 , generates an output sequence z which maximizes (1) at each step by attending to relevant parts of the output of the encoder in an auto-regressive manner (Graves, 2013) ." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-48", "text": "In addition to shared embedding described in Vaswani et al. (2017) , we also utilize weight sharing across the embedding and output layers in a manner similar to Junczys-Dowmunt and Grundkiewicz (2018)." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-49", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-50", "text": "**MULTI-HEAD ATTENTION LAYER**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-51", "text": "The multi-head attention layers marked with dashed lines in Figure 1 play an important role in constructing joint representations." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-52", "text": "As described in Vaswani et al. (2017) , we utilize the same multihead attention with h-heads based on scaled dotproduct attention to get matrix C composed of context vectors as follows:" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-53", "text": "where" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-54", "text": "Attn( , , ) = softmax (4) the output is then transferred to the Layer Normalization layer with residual connection." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-55", "text": "Thus, the output hidden states can cover the context information of V along with the hidden state itself as follows:" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-56", "text": "For (x) , src embeddings x \u2208 \u211d \u00d7 are assigned to Q, K and V, resulting in hidden states that contain self-context and src itself." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-57", "text": "(x , y) consists of two multi-head attention layers." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-58", "text": "As the first layer works similar to (x) except that future mask is added to mimic the decoding process of the MT system in which predictions can depend only on past information, thus mt embeddings y \u2208 \u211d \u00d7 are assigned to Q, K and V. In the second multi-head attention layer, mt embeddings are assigned to \u2208 \u211d \u00d7 and the outputs of (x , y) to \u2208 \u211d \u00d7 and \u2208 \u211d \u00d7 , resulting in joint representations that contain information of mt with its corresponding src context." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-59", "text": "The decoder structurally identical to (x , y) therefore predicts a pe word depending on both previously generated pe words and the final output of the encoder that contains contextual information from both src and mt." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-60", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-62", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-63", "text": "**DATA**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-64", "text": "We trained an English-to-German APE system with the WMT dataset (Bojar et al., 2017) from the IT domain, consisting of 23K and 1K triplets (src, mt, pe) for training and development set, respectively." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-65", "text": "In addition, we adopted a large sized artificial dataset (Junczys-Dowmunt and Grundkiewicz, 2016), which contains ~4M triplets generated from a round-trip translation, to prevent early overfitting from the small size of the WMT training data." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-66", "text": "Furthermore, we encoded every sentence into subword units (Kudo, 2018 ) with a 32K shared vocabulary." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-67", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-68", "text": "**TRAINING DETAILS**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-69", "text": "We employed the OpenNMT-py (Klein et al., 2017) framework to modify the original Transformer network." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-70", "text": "We trained our model for ~14K update steps with the Adam optimizer (Kingma and Ba, 2014), warm up learning rates (Vaswani et al., 2017 ) with a size of 12,000, and batch size of approximately 17,000 tokens for each triplet." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-71", "text": "Other detailed settings were the same asVaswani et al. (2017) ." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-72", "text": "To obtain a single trained model, it consumed ~14K update steps until convergence on the development set." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-73", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-74", "text": "**EVALUATION**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-75", "text": "For evaluation, we tested on the WMT16 and WMT17 APE test datasets with case sensitive TER and BLEU scores." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-76", "text": "For the baselines, we employed 1) the raw MT outputs provided by the test dataset, which serves as the official baseline in WMT, and 2) the original single-source Transformer model trained on src\u2192pe and 3) on mt\u2192pe." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-77", "text": "For comparison against other APE systems, we selected three models proposed by 1) Chatterjee et al. (2017) , 2) Junczys-Dowmunt and Grundkiewicz (2018), and 3) Shin and Lee (2018) , which are the recent multisource approaches ( \u00a72) experimented on the same amount of training data as ours." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-78", "text": "As shown in Table 1 , our model performed significantly better than all baselines." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-79", "text": "Moreover, we exceeded not only the RNN based multi-source APE system (Chatterjee et al., 2017) , which was the winning entry in WMT17, but also other recently proposed transformer based multi-source APE systems which have shown notable results, including the winning entry (Junczys-Dowmunt and Grundkiewicz, 2018) in WMT18." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-80", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-81", "text": "**ANALYSIS**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-82", "text": "As word-alignment for mt sentences can be obtained from the WMT Quality Estimation task (Bojar et al., 2016) , we analyzed the similarity of the attention from our encoder and the word-alignment from the unknown MT system to determine if context information in reasonably integrated into the joint representations." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-83", "text": "Given the randomly sampled sentence from the WMT 16 development data in Figure 2 , we observe that our attention results indicate similar tendency to word-alignment from the MT system." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-84", "text": "Consequently, we believe that the output of the encoder may successfully incorporate information of mt and its corresponding src context leading to performance improvement." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-85", "text": "We have described comparisons for various sentences in the appendix." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-86", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-87", "text": "**CONCLUSION**" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-88", "text": "We proposed a new multi-source APE model based on the Transformer network." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-89", "text": "With the motivation that jointly representing mt with its src context might be useful in post editing, we have constructed joint representations in our modified encoder." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-90", "text": "Particularly, that we were able to generate similar alignments to the MT system is, according to our analysis, an indication of the effectiveness of our model in capturing the characteristics of unknown MT systems." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-91", "text": "(a) (b) Figure 2 : src-mt alignment from a random sample." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-92", "text": "The x and y-axis of each plot correspond to src and mt, respectively." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-93", "text": "Each cell shows the alignment probability." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-94", "text": "The lighter the color, the higher the probability." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-95", "text": "(a) and (b) refer to the unknown MT system and our encoder, respectively." }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-96", "text": "----------------------------------" }, { "sent_id": "4a19f17c00e904a595c1703ab9318d-C001-97", "text": "**REFERENCES**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "4a19f17c00e904a595c1703ab9318d-C001-9" ], [ "4a19f17c00e904a595c1703ab9318d-C001-15" ], [ "4a19f17c00e904a595c1703ab9318d-C001-42" ] ], "cite_sentences": [ "4a19f17c00e904a595c1703ab9318d-C001-9", "4a19f17c00e904a595c1703ab9318d-C001-15", "4a19f17c00e904a595c1703ab9318d-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "4a19f17c00e904a595c1703ab9318d-C001-48" ], [ "4a19f17c00e904a595c1703ab9318d-C001-52" ], [ "4a19f17c00e904a595c1703ab9318d-C001-70" ] ], "cite_sentences": [ "4a19f17c00e904a595c1703ab9318d-C001-48", "4a19f17c00e904a595c1703ab9318d-C001-52", "4a19f17c00e904a595c1703ab9318d-C001-70" ] }, "@SIM@": { "gold_contexts": [ [ "4a19f17c00e904a595c1703ab9318d-C001-52" ] ], "cite_sentences": [ "4a19f17c00e904a595c1703ab9318d-C001-52" ] } } }, "ABC_cbed76ac8086637fe1d2e30f39c585_33": { "x": [ { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-2", "text": "We build a grammatical error correction (GEC) system primarily based on the state-of-the-art statistical machine translation (SMT) approach, using task-specific features and tuning, and further enhance it with the modeling power of neural network joint models." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-3", "text": "The SMT-based system is weak in generalizing beyond patterns seen during training and lacks granularity below the word level." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-4", "text": "To address this issue, we incorporate a character-level SMT component targeting the misspelled words that the original SMT-based system fails to correct." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-5", "text": "Our final system achieves 53.14% F 0.5 score on the benchmark CoNLL-2014 test set, an improvement of 3.62% F 0.5 over the best previous published score." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-6", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-8", "text": "Grammatical error correction (GEC) is the task of correcting various textual errors including spelling, grammar, and collocation errors." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-9", "text": "The phrase-based statistical machine translation (SMT) approach is able to achieve state-of-theart performance on GEC (Junczys-Dowmunt and Grundkiewicz, 2016) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-10", "text": "In this approach, error correction is treated as a machine translation task from the language of \"bad English\" to the language of \"good English\"." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-11", "text": "SMT-based systems do not rely on language-specific tools and hence they can be trained for any language with adequate parallel data (i.e., erroneous and corrected sentence pairs)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-12", "text": "They are also capable of correcting complex errors which are difficult for classifier systems that target specific error types." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-13", "text": "The generalization of SMT-based GEC systems has been shown to improve further by adding neural network models (Chollampatt et al., 2016b) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-14", "text": "Though SMT provides a strong framework for GEC, the traditional word-level SMT is weak in generalizing beyond patterns seen in the training data Rozovskaya and Roth, 2016) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-15", "text": "This effect is particularly evident for spelling errors, since a large number of misspelled words produced by learners are not observed in the training data." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-16", "text": "We propose improving the SMT approach by adding a character-level SMT component to a word-level SMT-based GEC system, with the aim of correcting misspelled words." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-17", "text": "Our word-level SMT-based GEC system utilizes task-specific features described in (JunczysDowmunt and Grundkiewicz, 2016) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-18", "text": "We show in this paper that performance continues to improve further after adding neural network joint models (NNJMs), as introduced in (Chollampatt et al., 2016b) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-19", "text": "NNJMs can leverage the continuous space representation of words and phrases and can capture a larger context from the source sentence, which enables them to make better predictions than traditional language models (Devlin et al., 2014) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-20", "text": "The NNJM is further improved using the regularized adaptive training method described in (Chollampatt et al., 2016a) on a higher quality training dataset, which has a higher errorper-sentence ratio." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-21", "text": "In addition, we add a characterlevel SMT component to generate candidate corrections for misspelled words." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-22", "text": "These candidate corrections are rescored with n-gram language model features to prune away non-word candidates and select the candidate that best fits the context." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-23", "text": "Our final system outperforms the best prior published system when evaluated on the benchmark CoNLL-2014 test set." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-24", "text": "For better replicability, we release our source code and model files publicly at https://github.com/nusnlp/ smtgec2017." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-25", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-26", "text": "**327**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-27", "text": "2 Related Work GEC has gained popularity since the CoNLL-2014 shared task was organized." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-28", "text": "Unlike previous shared tasks (Dale and Kilgarriff, 2011; Dale et al., 2012; ) that focused only on a few error types, the CoNLL-2014 shared task dealt with correction of all kinds of textual errors." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-29", "text": "The SMT approach, which was first used for correcting countability errors of mass nouns (Brockett et al., 2006) , became popular during the CoNLL-2014 shared task." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-30", "text": "Two of the top three teams used this approach in their systems." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-31", "text": "It later became the most widely used approach and was used in state-of-the-art GEC systems Chollampatt et al., 2016b; JunczysDowmunt and Grundkiewicz, 2016; Rozovskaya and Roth, 2016) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-32", "text": "Neural machine translation approaches have also showed some promise (Xie et al., 2016; ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-33", "text": "A number of papers on GEC were published in 2016." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-34", "text": "Chollampatt et al. (2016b) showed that using neural network translation models in phrase-based SMT decoding improves performance." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-35", "text": "Other works focused on re-ranking and combination of the n-best hypotheses produced by an SMT system using classifiers to generate better corrections (Mizumoto and Matsumoto, 2016; Hoang et al., 2016) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-36", "text": "Rozovskaya and Roth (2016) compared the SMT and classifier approaches by performing error analysis of outputs and described a pipeline system using classifier-based error type-specific components, a context sensitive spelling correction system (Flor and Futagi, 2012) , punctuation and casing correction systems, and SMT." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-37", "text": "Junczys-Dowmunt and Grundkiewicz (2016) described a state-of-the-art SMT-based GEC system using task-specific features, better language models, and task-specific tuning of the SMT system." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-38", "text": "Their system achieved the best published score to date on the CoNLL-2014 test set." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-39", "text": "We use the features proposed in their work to enhance the SMT component in our system as well." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-40", "text": "Additionally, we use neural network joint models (Devlin et al., 2014) introduced in (Chollampatt et al., 2016b) and a character-level SMT component." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-41", "text": "Character-level SMT systems are used in transliteration and machine translation (Tiedemann, 2009; Nakov and Tiedemann, 2012; Durrani et al., 2014) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-42", "text": "It has been previously used for spelling correction in Arabic (Bougares and Bouamor, 2015) and for pre-processing noisy input to an SMT system (Formiga and Fonollosa, 2012) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-43", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-44", "text": "**STATISTICAL MACHINE TRANSLATION**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-45", "text": "We use the popular phrase-based SMT toolkit Moses (Koehn et al., 2007) , which employs a loglinear model for combination of features." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-46", "text": "We use the task-specific tuning and features proposed in (Junczys-Dowmunt and Grundkiewicz, 2016) to further improve the system." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-47", "text": "The features include edit operation counts, a word class language model (WCLM), the Operation Sequence Model (OSM) (Durrani et al., 2013) , and sparse edit operations." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-48", "text": "Moreover, Junczys-Dowmunt and Grundkiewicz (2016) trained a web-scale language model (LM) using large corpora from the Common Crawl data (Buck et al., 2014) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-49", "text": "We train an LM of similar size from the same corpora and use it to improve our GEC performance." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-50", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-51", "text": "**NEURAL NETWORK JOINT MODELS AND ADAPTATION**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-52", "text": "Following Chollampatt et al. (2016b) , we add a neural network joint model (NNJM) feature to further improve the SMT component." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-53", "text": "We train the neural networks on GPUs using log-likelihood objective function with self-normalization, following (Devlin et al., 2014) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-54", "text": "Training of the neural network joint model is done using a Theanobased (Theano Development Team, 2016) implementation, CoreLM 1 ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-55", "text": "Chollampatt et al. (2016a) proposed adapting SMT-based GEC based on the native language of writers, by adaptive training of a pre-trained NNJM on in-domain data (written by authors sharing the same native language) using a regularized loss function." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-56", "text": "We follow this adaptation method and perform subsequent adaptive training of the NNJM, but on a subset of training data with better annotation quality and a higher error-per-sentence ratio, favoring more corrections and thus increasing recall." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-57", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-58", "text": "**SPELLING ERROR CORRECTION USING SMT**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-59", "text": "Due to the inherent weakness of SMT-based GEC systems in correcting unknown words (mainly consisting of misspelled words), we add a character-level SMT component for spelling error correction." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-60", "text": "A character in this character-level SMT component is equivalent to a word in wordlevel SMT, and a sequence of characters (i.e., a word) in the former is equivalent to a sequence of words (i.e., a sentence) in the latter." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-61", "text": "Input to our character-level SMT component is a sequence of characters that make up the unknown (misspelled) word and output is a list of correction candidates (words)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-62", "text": "Note that unknown words are words unseen in the source side of the parallel training data used to train the translation model." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-63", "text": "For training the character-level SMT component, alignments are computed based on a Levenshtein matrix, instead of using GIZA++ (Och and Ney, 2003) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-64", "text": "Our character-level SMT is tuned using the M 2 metric (Dahlmeier and Ng, 2012) on characters, with character-level edit operation features and a 5-gram character LM." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-65", "text": "For each unknown word, character-level SMT produces 100 candidates that are then rescored to select the best candidate based on the context." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-66", "text": "This rescoring is done following Durrani et al. (2014) and uses word-level n-gram LM features: LM probability and the LM OOV (out-of-vocabulary) count denoting the number of words in the sentence that are not in the LM's vocabulary." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-67", "text": "The architecture of our final system is shown in Figure 1 ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-68", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-69", "text": "**EXPERIMENTS**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-70", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-71", "text": "**DATA AND EVALUATION**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-72", "text": "The parallel data for training our word-level SMT system consist of two corpora: the NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013) and Lang-8 Learner Corpora v2 (Lang-8) (Mizumoto et al., 2011) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-73", "text": "From NUCLE, we extract sentences with at least one annotation (edit) in a sentence." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-74", "text": "We use one-fourth of these sentences as our development data (5,458 sentences with 141,978 source tokens)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-75", "text": "The remainder of NUCLE, including sentences without annotations (i.e., error-free sentences), are used for training." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-76", "text": "We extract the English portion of Lang-8 by selecting sentences written by English learners via filtering using a language identification tool, langid.py (Lui and Baldwin, 2012) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-77", "text": "This filtered data set and the training portion of NU-CLE are combined to form the training set, consisting of 2.21M sentences (26.77M source tokens and 30.87M target tokens)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-78", "text": "We use two corpora to train the LMs: Wikipedia texts (1.78B tokens) and a subset of the Common Crawl corpus (94B tokens)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-79", "text": "To train the character-level SMT component, we obtain a corpus of misspelled words and their corrections 2 , of which the misspellingcorrection pairs from Holbrook are used as the development set and the remaining pairs together with the unique words in the NUCLE training data (replicated on the source side to get parallel data) are used for training." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-80", "text": "We evaluate our system on the official CoNLL-2014 test set, using the MaxMatch (Dahlmeier and Ng, 2012) scorer v3.2 which computes the F 0.5 score, as well as on the JFLEG corpus (Napoles et al., 2017) , an error-corrected subset of the GUG corpus (Heilman et al., 2014) , using the F 0.5 and GLEU (Napoles et al., 2015) metrics." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-81", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-82", "text": "**SMT-BASED GEC SYSTEM**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-83", "text": "Our SMT-based GEC system uses a phrase table trained on the complete parallel data." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-84", "text": "In our word-level SMT system, we use two 5-gram LMs, one of them trained on the target side of the parallel training data and the other trained on Wikipedia texts (Wiki LM)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-85", "text": "We add all the dense features proposed in (Junczys-Dowmunt and Grundkiewicz, 2016) and sparse edit features on words (with one word context)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-86", "text": "We further improve the system by replacing Wiki LM with a 5-gram LM trained on Common Crawl data (94BCC LM)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-87", "text": "NNJM is trained on the complete parallel data." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-88", "text": "We further adapt the NNJM following the adaptation method proposed by Chollampatt et al. (2016a) on sentences from the training portion of NUCLE that contain at least one error annotation (edit) in a sentence." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-89", "text": "We use the same hyper-parameters as (Chollampatt et al., 2016a) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-90", "text": "The SMT-based GEC system with all the features, 94BCC LM, and adapted NNJM, is referred to as \"Word SMT-GEC\"." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-91", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-92", "text": "**SMT FOR SPELLING ERROR CORRECTION**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-93", "text": "The character-level SMT component that generates candidates for misspelled words uses a 5-gram character-level LM trained on the target side of the spelling corpora." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-94", "text": "5-gram Wiki LM is used during rescoring." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-95", "text": "The final system is referred to as \"Word&Char SMT-GEC\"." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-96", "text": "Table 1 shows the results of incrementally adding features and components to the SMT-GEC system, measuring performance on the official CoNLL-2014 test set." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-97", "text": "All SMT systems are tuned five times and the feature weights are averaged in order to account for optimizer instability." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-98", "text": "The improvement obtained for each incremental modification is statistically significant (p < 0.01) over its previous system." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-99", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-100", "text": "**RESULTS AND DISCUSSIONS**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-101", "text": "The addition of NNJM improves by 1.14% F 0.5 on top of a high-performing SMT-based GEC system with task-specific features and a web-scale LM." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-102", "text": "Adaptation of NNJM on a subset of NUCLE improves the results by a notable margin (1.31% F 0.5 )." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-103", "text": "The NUCLE data set is manually annotated by experts and is of higher quality than Lang-8 data." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-104", "text": "Also, choosing sentences with a higher error rate encourages NNJM to favor more corrections." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-105", "text": "Adding the SMT component for spelling error correction (\"Spelling SMT\") further improves F 0.5 to 53.14%." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-106", "text": "We use Wiki LM to rescore the candidates, since using 94BCC LM yielded slightly worse results (53.06% F 0.5 )." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-107", "text": "94BCC LM, trained on noisy web texts, includes many misspellings in its vocabulary and hence misspelled translation candidates are not effectively pruned away by the OOV feature compared to using Wiki LM." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-108", "text": "Table 3 : Results on the JFLEG corpus." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-109", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-110", "text": "**COMPARISON TO THE STATE OF THE ART**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-111", "text": "and Grundkiewicz (2016) (J&G) and Rozovskaya and Roth (2016) (R&R) 3 . \"Word SMT-GEC\" is better than the previous best system (J&G) by a margin of 2.18% F 0.5 ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-112", "text": "This improvement is without using any additional datasets compared to J&G. \"Word&Char SMT-GEC\", which additionally uses \"Spelling SMT\" trained using spelling corpora, increases the margin of improvement to 3.62% F 0.5 and becomes the new state of the art." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-113", "text": "We also evaluate using 10 sets of human annotations of the CoNLL-2014 test set released by Bryant and Ng (2015) (\"10 ann.\")." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-114", "text": "We measure a system's performance compared to human using the ratio metric (\"Ratio\"), which is the average system-vs-human score (\"SvH\") divided by average human-vs-human score (F 0.5 of 72.58%)." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-115", "text": "\"SvH\" is computed by removing one set of human annotations at a time and evaluating the system against the remaining 9 sets, and finally averaging over all 10 repetitions." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-116", "text": "The results show that \"Word&Char SMT-GEC\" achieves 94.09% of the human-level performance, substantially closing the gap between system and human performance for this task by 36%." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-117", "text": "To ascertain the generalizability of our results, we also evaluate our system on the JFLEG development and test sets without re-tuning." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-118", "text": "Table 3 compares our systems with top-performing systems 4 ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-119", "text": "Our systems outperform the previous best systems by large margins." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-120", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-121", "text": "**ERROR TYPE ANALYSIS**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-122", "text": "We analyze the performance of our final system and the top systems on specific error types on the CoNLL-2014 test set." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-123", "text": "To do this, we compare the per-error-type F 0.5 using the ERRANT toolkit (Bryant et al., 2017) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-124", "text": "ERRANT uses a rule-based framework primarily relying on partof-speech (POS) tags to classify the error types." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-125", "text": "The error type classification has been shown to achieve 95% acceptance by human raters." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-126", "text": "We analyze the performance on six common error types, namely, noun number (Nn), verb tense (Vt), determiner (Det), punctuation (Punct), subject-verb agreement (SVA), and preposition (Prep) errors." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-127", "text": "The results are shown in Figure 2 ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-128", "text": "Our system outperforms the other systems on four of these six error types, and achieves comparable performance on the determiner errors." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-129", "text": "It is interesting to note that R&R outperforms our system and J&G on subject-verb agreement errors by a notable margin." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-130", "text": "This is because R&R uses a classification-based system for subject-verb agreement errors that uses rich linguistic features including syntactic and dependency parse information." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-131", "text": "SMT-based systems are weaker in correcting such errors as they do not explicitly identify and model the relationship between a verb and its subject." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-132", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-133", "text": "**PERFORMANCE ON SPELLING ERRORS**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-134", "text": "We perform comparative analysis on spelling error correction on the CoNLL-2014 test set using ERRANT." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-135", "text": "The results are summarized in Table 4 ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-136", "text": "Our final system with the character-level SMT (Flor and Futagi, 2012) ." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-137", "text": "ConSpel is a proprietary non-word spell checker that has been shown to outperform off-the-shelf spell checkers such as MS Word and Aspell." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-138", "text": "Despite using ConSpel, R&R achieves a lower precision (74.19 vs. 75.40) and recall (85.98 vs. 91.35) compared to our final system." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-139", "text": "We also compare against a baseline where our spelling correction component is replaced by an off-the-shelf spell checker Hunspell (\"Word SMT-GEC + Hunspell\")." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-140", "text": "Using Hunspell causes a drastic drop in precision due to a large number of spurious corrections that it proposes and results in a lower F 0.5 score." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-141", "text": "----------------------------------" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-142", "text": "**CONCLUSION**" }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-143", "text": "We have improved a state-of-the-art SMT-based GEC system by incorporating and adapting neural network joint models." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-144", "text": "The weakness of SMTbased GEC in correcting misspellings is addressed by adding a character-level SMT component." }, { "sent_id": "cbed76ac8086637fe1d2e30f39c585-C001-145", "text": "Our final best system achieves 53.14% F 0.5 on the CoNLL-2014 test set, outperforming the previous best system by 3.62%, and achieves 94% of human performance on this task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cbed76ac8086637fe1d2e30f39c585-C001-13" ], [ "cbed76ac8086637fe1d2e30f39c585-C001-18" ], [ "cbed76ac8086637fe1d2e30f39c585-C001-31" ], [ "cbed76ac8086637fe1d2e30f39c585-C001-40" ] ], "cite_sentences": [ "cbed76ac8086637fe1d2e30f39c585-C001-13", "cbed76ac8086637fe1d2e30f39c585-C001-18", "cbed76ac8086637fe1d2e30f39c585-C001-31", "cbed76ac8086637fe1d2e30f39c585-C001-40" ] }, "@SIM@": { "gold_contexts": [ [ "cbed76ac8086637fe1d2e30f39c585-C001-18" ] ], "cite_sentences": [ "cbed76ac8086637fe1d2e30f39c585-C001-18" ] }, "@USE@": { "gold_contexts": [ [ "cbed76ac8086637fe1d2e30f39c585-C001-52" ] ], "cite_sentences": [ "cbed76ac8086637fe1d2e30f39c585-C001-52" ] } } }, "ABC_cbe9e36f371c072432ca25800c96d3_33": { "x": [ { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-14", "text": "phonemes)." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-15", "text": "Such features can be hand-crafted by leveraging prior knowledge [11, 12, 13, 14] , or they can be learned in a data-driven fashion." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-2", "text": "Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-3", "text": "This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-4", "text": "As semantics of speech are largely determined by its lexical content, grounding models learn to preserve phonetic information while disregarding uncorrelated factors, such as speaker and channel." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-5", "text": "To study the properties of features distilled from different layers, we use them as input separately to train multiple speech recognition models." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-6", "text": "Empirical results demonstrate that layers closer to input retain more phonetic information, while following layers exhibit greater invariance to domain shift." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-7", "text": "Moreover, while most previous studies include training data for speech recognition for feature extractor training, our grounding models are not trained on any of those data, indicating more universal applicability to new domains." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-8", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-10", "text": "Robustness of automatic speech recognition (ASR) systems is essential to generalization of using speech as interfaces for human computer interaction." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-11", "text": "Thanks to the strong modeling capacity of neural networks, recent studies [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] have demonstrated that by providing supervised examples as abundant and diverse as possible, such models can learn to extract domain invariant features and recognize linguistic units jointly." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-12", "text": "However, without additional treatment, good performance and robustness may not be achieved when labeled data are very limited in quantities or not available in all domains [11] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-13", "text": "One way to ease the burden of ASR systems is by providing better features, which are more invariant to nuisance factors while containing linguistic information ready for use (e.g., linear separability w.r.t." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-16", "text": "Furthermore, this learning can take place jointly with ASR [15] , or separately with some tasks that have aligned objectives [16, 17, 18] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-17", "text": "Learning features from some source tasks that can benefit the target task is a common realization of transfer learning [19] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-18", "text": "In this work, we propose a novel inductive transfer learning scenario [19] , which utilizes speech features learned from audiovisual grounding for speech recognition." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-19", "text": "Audio-visual grounding [20] is a task which aims to distinguish whether a spoken caption is semantically associated with an image or not, and vice versa, without using any textual transcripts." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-20", "text": "Deep audiovisual embedding network (DAVEnet) [21] is a two-branched convolutional neural network model for this task, which learns to encode images and spoken captions into a shared embedding space that reflects semantic similarity." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-21", "text": "To successfully learn a semantic representation for speech, the model has to recognize its lexical content, which in turns requires identifying phonetic content." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-22", "text": "Therefore, one can expect intermediate layers of the speech branch in DAVEnet models to function as lexical or phonetic unit detectors." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-23", "text": "Furthermore, since non-linguistic aspects of speech, such as speaker, are not correlated with semantics, these information may be discarded, resulting in the intermediate outputs from the model being invariant to domain shift." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-24", "text": "We conduct a series of ASR experiments probing properties of the features distilled from DAVEnet models at different layers." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-25", "text": "Results indicate higher in-domain accuracy using features closer to input, and better robustness to domain shift using features from latter layers." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-26", "text": "In addition, we also study how the choice of DAVEnet architectures and grounding performance affects the performance of distilled feature extractors." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-27", "text": "In summary, our contributions are three-fold: (1) To the best of our knowledge, this is the first work connecting audio-visual grounding with speech recognition." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-28", "text": "(2) Our empirical study verifies that the distilled feature extractors not only contain sufficient information for recognizing phonemes, but better remove nuisance information." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-29", "text": "(3) Moreover, the grounding models are trained on a different dataset from that used for ASR, indicating more general applicability of the distilled features." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-30", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-31", "text": "**LEARNING SPOKEN LANGUAGES THROUGH AUDIO-VISUAL GROUNDING**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-32", "text": "In this section, we describe in detail the source task as well as the DAVEnet model, and then review several analysis studies which lay the foundation for our work." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-33", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-34", "text": "**AUDIO-VISUAL GROUNDING**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-35", "text": "Inspired by the fact that humans learn to speak before being able to read or write, audio-visual grounding of speech is a proxy task proposed in [20] that aims to examine the capability of computational models to learn a language using only semanticlevel supervision." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-36", "text": "To simulate such a learning scenario, a model has access to images and their spoken captions during training." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-37", "text": "The goal of the model is to learn a semantic representation for each caption and each image, such that representations of semantically correlated utterances and images are similar to each other, while those from irrelevant pairs are dissimilar." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-38", "text": "Performance is evaluated using a cross-modality retrieval task: given a spoken sentence, a model is asked to rank a list of 1,000 images according to semantic relevance, with only one image being the correct answer, and vice versa." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-39", "text": "Recall@10 averaged over the retrieval tasks in both directions is used for evaluation." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-40", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-41", "text": "**DEEP AUDIO-VISUAL EMBEDDING NETWORK (DAVENET)**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-42", "text": "DAVEnet is a convolutional neural network (CNN) for audiovisual grounding proposed in [20, 22, 21] , which consists of two branches: f for speech and g for image, as depicted in Figure 1 ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-43", "text": "Each branch has a sequence of strided convolutional blocks, followed by a global mean-pooling layer to produce a fixed dimensional representation." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-44", "text": "The model is trained to minimize a triplet loss [23, 24] : given a similarity measure S(\u00b7, \u00b7), paired speech and image, xs and xi, along with one imposter instance from each modality,xs andxi, the loss enforces S(f (xs), g(xi)) to exceed both S(f (xs), g(xi)) and S(f (xs), g(xi)) by a predefined margin." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-45", "text": "Following [25] , imposter instances are drawn using a mixture of uniform sampling and within-batch semi-hard negative mining [24] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-46", "text": "S(z1, z2) = z1, z2 is used here." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-47", "text": "In our experiments, we make use of two DAVEnet model variants." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-48", "text": "The first is identical to the model used in [21] , which uses an audio network comprised of 5 convolutional layers and the VGG16 architecture for the image network." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-49", "text": "The second model, ResDAVEnet, is based upon deep residual networks [26] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-50", "text": "The image network makes use of the ResNet50 architecture, while the audio network is based on strided 1-D convolutions with residual connections." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-51", "text": "The first layer of the ResDAVEnet audio model is comprised of 128 convolutional units each spanning all frequency channels but only one temporal frame, with a temporal shift of 1 frame." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-52", "text": "This is followed by a ReLU and a BatchNorm layer." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-53", "text": "The remainder of the network is a sequence of 4 residual stacks with channel dimensions 128, 256, 512, and 1024." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-54", "text": "Each residual stack is comprised of a sequence of two basic residual blocks (as described in [26] ) which share the same overall channel dimension, with 2-D 3x3 kernels replaced with 1-D kernels of length 9." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-55", "text": "Additionally, the first residual block in each layer in each stack is applied with a stride of two frames, resulting in an effective temporal downsampling ratio of 2 4 for the entire network, as shown in Figure 1 (center)." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-56", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-57", "text": "**EMERGENCE OF MULTI-LEVEL SPEECH UNIT DETECTORS**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-58", "text": "Recent work [27, 22, 25] on analyzing DAVEnets have shown that, despite the fact that phoneme and word labels are never explicitly provided, such detectors automatically emerge within these models." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-59", "text": "Phoneme-like detectors reside in layers closer to the input [27, 25] , while semantic word detectors reside in layers closer to the output [22] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-60", "text": "Such findings echo with the recent discovery in the computer vision community [28, 29] that in a trained scene classifier, layers closer to sensory input appear to be low-level pattern (e.g., shape, edge, and color) detectors, while object detectors emerge at later layers." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-61", "text": "This behavior can be mainly attributed to the compositionality of the prediction target as well as the inductive bias we impose in the model architecture." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-62", "text": "Just as a scene can often be determined by the objects that are present, the semantics of a spoken sentence is determined by the sequence of words, each of which in turn is determined by phoneme sequences." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-63", "text": "Prediction of semantic objects from a spoken sentence can therefore be regarded as a bottom-up process, which iteratively composes higher-level concepts from lower-level ones with the hierarchical convolution operations in CNNs." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-64", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-65", "text": "**TRANSFER LEARNING TO SPEECH RECOGNITION**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-66", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-67", "text": "**DISTILLING ROBUST FEATURE EXTRACTORS FOR ASR**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-68", "text": "Both DAVEnet variants are trained on the Places Audio Caption dataset (PlacesAudCap) [21] , derived from the Places205 scene classification dataset [28] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-69", "text": "PlacesAudCap is composed of over 400K image and unscripted spoken caption pairs collected from 2,954 speakers via Amazon Mechanical Turk, which sums up to over 1,000 hours." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-70", "text": "For the audio-visual grounding task, both models use 40-dimensional log Mel filterbank (FBank) features with 10ms shift and 25ms analysis window as input, and achieve R@10 of 0.629 and 0.720, respectively." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-71", "text": "As a natural result of large-scale crowd-sourcing, this dataset exhibits great diversity not only in textual content, but also in speaker, background noise, and microphone channels." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-72", "text": "For both semantic grounding and speech recognition, these nontextual factors are nuisances to the target, and therefore would eventually be removed from the internal representations learned by the networks trained for these two tasks." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-73", "text": "Having been exposed to a vast amount of nuisance factors, we hypothesize that the audio branch of DAVEnet models would also learn domain invariant phonetic representations at later layers, which can be subsequently utilized for robust speech recognition." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-74", "text": "From now on, we denote features extracted from the k-th layer of model M with M -Lk, for example, ResDAVEnet-L2." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-75", "text": "To account for the different frame rates at different layers in DAVEnet models, when extracting outputs from a layer with a down-sampling rate r compared to the speech inputs, we repeat each step r times for simplicity, as shown in Figure 1 (right)." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-76", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-77", "text": "**EVALUATING TRANSFER LEARNING PERFORMANCE**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-78", "text": "To evaluate transfer learning performance, we consider three criteria: (1) inclusion of phonetic content, (2) exclusion of nuisance factors, and (3) transferrability across datasets." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-79", "text": "The first two are evaluated using a protocol similar to [17] , where an ASR model is trained on a set of domains, and evaluated on both in-domain and out-of-domain speech (relative to the training data)." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-80", "text": "Performance on in-domain data characterizes an upper bound for the amount of phonetic information that can be inferred from the input." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-81", "text": "The performance gap between in-domain and out-of-domain data quantifies the invariance of the features to nuisance factors: the smaller this gap, the more invariant the features are." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-82", "text": "To test the third criteria, instead of training the source task on a dataset that includes speech used for the target task, a separate dataset collected through a different process (i.e., PlacesAudCap) is used." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-83", "text": "We emphasize here that this is a more practical setting to consider than training one feature extractor for each target task." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-84", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-85", "text": "**RELATED WORK**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-86", "text": "Transfer learning has a long history in the field of machine learning [19] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-87", "text": "More recently, deep neural network models have been shown to be extremely effective for learning representations of data with a high degree of re-usability across many different tasks and domains." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-88", "text": "Perhaps the most well-known example of this is the use of the ImageNet [30] image classification database to pre-train convolutional neural network models for other downstream computer vision tasks [31, 32, 33] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-89", "text": "Other sub-fields have also developed similarly techniques." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-90", "text": "For example, in natural language processing, dense word vector models such as word2vec [34] and GloVe [35] , or more advanced ones like ELMo [36] and BERT [37] have quickly replaced one-hot word representations in many tasks and pushed the state-of-theart forward on a variety of language understanding tasks." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-91", "text": "More recently, there is also an increasing interest in learning from multimodal data [38] and transfer learned representations from such tasks [39] In the field of speech recognition, low-resource speech recognition is a scenario which heavily benefits from transfer learning, for example in the form of training on multilingual datasets [40] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-92", "text": "Other models capable of disentangling phonetic and domain information have recently been shown to learn acoustic features with a greater degree of domain invariance than traditional acoustic features [16, 7, 17] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-93", "text": "Another line of work has studied the use of the visual modality as a form of weak supervision using semantic information for acoustic modeling [20, 41, 42] , followed up with analysis on representations learned from such models [27, 43, 25] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-94", "text": "In this paper, we build upon this prior work and quantify the degree to which these representations can be used to build robust ASR." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-95", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-96", "text": "**EXPERIMENTS**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-97", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-98", "text": "**ASR SETUP AND BASELINES**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-99", "text": "We consider TIMIT [44] and Aurora-4 [45] for training ASR systems to study robustness of the proposed method to speaker, channel, and noise." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-100", "text": "TIMIT contains 5.4 hours of 16kHz broadband recordings of read speech from 630 speakers, of which about 70% are male." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-101", "text": "Recordings from male speakers are used for training ASR systems, which are then tested on both genders." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-102", "text": "Aurora-4 is based on the Wall Street Journal (WSJ) corpus [46] , containing recordings with microphone and noise variation." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-103", "text": "The set of conditions are divided into four groups: clean (A), noisy (B), channel (C), and noisy+channel (D)." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-104", "text": "While recordings in A are recorded by one microphone in quiet environments, those in C are recorded with a different set of microphones than A. Recordings in B and D are created from A and C, respectively, with artificially added noises." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-105", "text": "Similar to [17] , we use the clean set (A) for training ASR systems, and test on the four groups separately." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-106", "text": "Kaldi [47] is used for training of initial HMM-GMM models, forced alignment, and decoding." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-107", "text": "The Microsoft Cognitive Toolkit (CNTK) [48] is used for neural network-based acoustic model training." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-108", "text": "To simplify the pipeline and study only the effect of ASR input features, the same forced alignment derived from a HMM-GMM model trained on Mel-frequency cepstral coefficient (MFCC) features are used for all experiments, following the default recipe in Kaldi." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-109", "text": "A three-layer long short-term memory (LSTM) acoustic model with 1,024 memory cells and a 512-node linear projection at each layer is used [49] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-110", "text": "Training of LSTM acoustic models closely follows [50] , which minimizes a frame-level cross-entropy loss using stochastic gradient descent with a momentum of 0.9 starting from the second epoch." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-111", "text": "Initial learning rate is set to 0.2 per minibatch, and L2 regularization with a weight of 1e \u2212 6 is used." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-112", "text": "We consider two types of features to compare with our proposed method." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-113", "text": "The first one is FBank feature, which is the input to DAVEnet models and contains rich phonetic and domain information." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-114", "text": "The second one is the latent segment variable z1 from a model called factorized hierarchical variational autoencoder (FHVAE) [16] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-115", "text": "FHVAE learns to encode sequence-level and segment-level information into separate latent variables without supervision by optimizing an evidence lower bound derived from a factorized graphical model, and has been shown effective for extracting domain invariant ASR features [17] ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-116", "text": "While previous work investigated usage of FHVAE for ASR by training FHVAE models on all domains of the target task (e.g., Aurora-4 with all four conditions) [17, 8] , we also evaluate FHVAE models trained on PlacesAudCap to test cross-dataset transferability, and on the subset of domains used for ASR training." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-117", "text": "We use FHVAE models with two LSTM layers, each with 256 cells, for both the encoders and decoder." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-118", "text": "A discriminative weight of \u03b1 = 10 is applied for all models, and the scalable training algorithm proposed in [51] is used for training on PlacesAudCap dataset with a sequence batch size K = 5000, because the original algorithm cannot handle large-scale datasets." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-119", "text": "Tables 1 and 2 present the testing word error rates (WERs) on both in-domain and out-of-domain conditions for ASR systems trained with different features." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-120", "text": "FE Train Set denotes the data used for training feature extractors, and A/I following Places represents the audio and image portion of the PlacesAudCap dataset, respectively." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-121", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-122", "text": "**MAIN RESULTS**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-123", "text": "Starting with Table 1 , we observe that FBank suffers from severe degradation in all out-of-domain conditions (B, C, and D), while FHVAE trained on all conditions of the Aurora-4 dataset achieves the best performance." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-124", "text": "However, when trained on Places A, improvement of FHVAE from FBank on out-ofdomain data becomes less significant in the presence of additive noise, compared to the result in the purely channel-mismatched condition (C)." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-125", "text": "Results of the proposed methods are shown in the second and the third section in Table 1 ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-126", "text": "While features from ResDAVEnet consistently outperforms FBank and FH-VAE (Places A) for all layers, those from DAVEnet do not." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-127", "text": "We hypothesize that the much deeper architecture of ResDAVEnet at each layer (ResStack) enables better removal of nuisance factors and preserving of linguistic information compared to DAVEnet, which also reflects in the comparison of grounding performance as mentioned earlier." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-128", "text": "It is also worth noting that, for both DAVEnet and ResDAVEnet models, performance in matched domain degrades when using latter layers, and except for ResDAVEnet-L1, all features are actually worse than the FBank baseline." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-129", "text": "This could indicate discarding of relevant phonetic information in the process of inferring higher-level semantic representation such as words." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-130", "text": "Table 2 demonstrates a similar trend as Table 1 , where FHVAE trained on TIMIT dataset of all genders achieves the best outof-domain WER, and ResDAVEnet-L2 is the best comparing to models trained on Places." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-131", "text": "We also present qualitative visualizations in Figure 2 using To conclude, we learn that (1) despite being trained with exactly the same process, inductive bias introduced to model architectures (i.e., DAVEnet versus ResDAVEnet) still affects the properties of learned representations, (2) feature extractors distilled from ResDAVEnet models clearly preserve phonetic information while improving invariance to nuisance factors, and most importantly, (3) it achieves better cross-dataset transferrability compared to FHVAE and FBank features." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-132", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-133", "text": "**CORRELATION WITH SOURCE TASK PERFORMANCE**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-134", "text": "Finally, we study how the performance of the grounding task affects the transfer learning performance, conditioning on the same neural network architecture for the source task." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-135", "text": "We create two proper subsets of 200k and 80k paired image/audio captions, and train one ResDAVEnet model on each subset." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-136", "text": "R@10 of the retrieval task for the models trained with 80k, 200k, and 400k (original) are 0.343, 0.582, and 0.720, respectively." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-137", "text": "Results are shown in Table 3 ." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-138", "text": "Except for the first layer, we can observe that WER decreases as the amount of source task training data increases." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-139", "text": "In fact, except for the out-of-domain conditions of the first layer, all layers improve in all conditions (full results not shown due to space limit)." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-140", "text": "Discovery of such positive correlation affirms the relatedness of the two tasks and encourages collection of a larger dataset for building a general feature extractor based on semantic grounding tasks." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-141", "text": "----------------------------------" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-142", "text": "**CONCLUDING DISCUSSION AND FUTURE WORK**" }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-143", "text": "In this paper, we present a successful example of transfer learning from a weakly supervised semantic grounding task to robust ASR." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-144", "text": "We achieve cross-dataset transferability, which is an important milestone toward building a generalized feature extractor to be used in many tasks and domains like BERT." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-145", "text": "In addition, along with the analysis in [27, 25] , this work sheds light on using semantic level supervision to learn the compositional structure of a language." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-146", "text": "For future work, we would like to study methods for leveraging target task data, possibly through semi-supervised training or adaptation, in order to bridge the gap to FHVAE trained on those data." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-147", "text": "Furthermore, unlike FH-VAE, it is unclear at which layer a ResDAVEnet model learns to maximally remove domain information." }, { "sent_id": "cbe9e36f371c072432ca25800c96d3-C001-148", "text": "We would also like to explicitly force such disentanglement to occur at certain layers, which can possibly improve both the grounding performance and the robustness of distilled features." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cbe9e36f371c072432ca25800c96d3-C001-16" ], [ "cbe9e36f371c072432ca25800c96d3-C001-92" ], [ "cbe9e36f371c072432ca25800c96d3-C001-115" ], [ "cbe9e36f371c072432ca25800c96d3-C001-116" ] ], "cite_sentences": [ "cbe9e36f371c072432ca25800c96d3-C001-16", "cbe9e36f371c072432ca25800c96d3-C001-92", "cbe9e36f371c072432ca25800c96d3-C001-115", "cbe9e36f371c072432ca25800c96d3-C001-116" ] }, "@SIM@": { "gold_contexts": [ [ "cbe9e36f371c072432ca25800c96d3-C001-78", "cbe9e36f371c072432ca25800c96d3-C001-79" ], [ "cbe9e36f371c072432ca25800c96d3-C001-105" ] ], "cite_sentences": [ "cbe9e36f371c072432ca25800c96d3-C001-79", "cbe9e36f371c072432ca25800c96d3-C001-105" ] } } }, "ABC_0a538968f0cd121a1ef63b58a0c9f7_33": { "x": [ { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-2", "text": "Dialogue act recognition is an important part of natural language understanding." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-3", "text": "We investigate the way dialogue act corpora are annotated and the learning approaches used so far." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-4", "text": "We find that the dialogue act is context-sensitive within the conversation for most of the classes." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-5", "text": "Nevertheless, previous models of dialogue act classification work on the utterance-level and only very few consider context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-6", "text": "We propose a novel context-based learning method to classify dialogue acts using a character-level language model utterance representation, and we notice significant improvement." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-7", "text": "We evaluate this method on the Switchboard Dialogue Act corpus, and our results show that the consideration of the preceding utterances as a a context of the current utterance improves dialogue act detection." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-8", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-10", "text": "In natural language processing research, the dialogue act (DA) concept plays an important role." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-11", "text": "Its recognition, in most cases, is considered a lexical-based or syntax-based classification at utterance-level." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-12", "text": "However, the discourse compositionality is context sensitive, meaning that the DA of an utterance can be elicited from the preceding utterances (Grosz, 1982) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-13", "text": "Hence, classifying only utterances is not enough because their DA class arises from their context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-14", "text": "For example, the utterance containing only the lexical entry 'yeah' might appear in several DA classes such as Backchannel, Yes-Answer, etc." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-15", "text": "For certain DA classes, the utterances are short, and most of them share similar lexical and syntactic cues ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-16", "text": "The aim of this article has two subgoals: first, we investigate the annotation process of DA corpora and review the modelling so far used for DA classification, and second, we present a novel model and compare its results with the state of the art." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-17", "text": "We propose to use context-based learning for the identification of the DA classes." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-18", "text": "First, we show the results without context, i.e., classifying only utterances." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-19", "text": "Including context leads to 3% higher accuracy." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-20", "text": "We use a simple recurrent neural network (RNN) for context learning of the discourse compositionality." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-21", "text": "We feed the preceding and current utterances to the RNN model to predict its DA class." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-22", "text": "The main contributions of this work are as follows: -We provide detailed insight on the annotation and modelling of dialogue act corpora." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-23", "text": "We suggest to model discourse within the context of a conversation." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-24", "text": "-We propose a context-based learning approach for DA identification." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-25", "text": "In our approach, we represent utterances by a character-level language model trained on domainindependent data." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-26", "text": "-We evaluate the model on the Switchboard Dialogue Act (SwDA 1 ) corpus and show how using context affects the results." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-27", "text": "For the SwDA corpus, our model achieved an accu-1 Available at https://github.com/cgpotts/swda racy of 77.3% compared to 73.9% as state of the art, where the context-based learning is used for the DA classification (Kalchbrenner and Blunsom, 2013 )." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-28", "text": "-Benefits of using context arise from using only a few preceding utterances making the model suitable for dialogue system in real time, in contrast to feeding the whole conversation, which can achieve high accuracy, but includes future utterances (Liu et al., 2017; Kumar et al., 2017) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-29", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-31", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-32", "text": "**ANNOTATION OF DIALOGUE ACT CORPORA**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-33", "text": "Annotation Process and Standards: Research on dialogue acts became important with the commercial reality of spoken dialogue systems." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-34", "text": "There have been many taxonomies to it: speech act (Austin, 1962) which was later modified into five classes (Assertive, Directive, Commissive, Expressive, Declarative) (Searle, 1979) , and the Dialogue Act Markup in Several Layers (DAMSL) tag set where each DA has a forward-looking function (such as Statement, Info-request, Thanking) and a backwardlooking function (such as Accept, Reject, Answer) (Allen and Core, 1997) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-35", "text": "There are many such standard taxonomies and schemes to annotate conversational data, some of them follow the concept of discourse compositionality." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-36", "text": "These schemes are important for analysing dialogues or building a dialogue system (Skantze, 2007) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-37", "text": "However, there can never be a unique scheme that considers all aspects of dialogue." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-38", "text": "Corpus Insight: We have investigated the annotation method for two corpora: Switchboard (SWBD) (Godfrey et al., 1992; Jurafsky et al., 1997) and ICSI Meeting Recorder Dialogue Act (MRDA) (Shriberg et al., 2004) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-39", "text": "They are annotated with the DAMSL tag set." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-40", "text": "The annotation includes not only the utterance-level but also the segmentedutterance labelling." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-41", "text": "The DAMSL tag set provides very fine-grained and detailed DA classes and follows the discourse compositionality." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-42", "text": "For example, the SWBD-DAMSL is the variant of DAMSL specific to the Switchboard cor-" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-43", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-44", "text": "**NATURE OF DISCOURSE IN CONVERSATION:**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-45", "text": "The dialogue act is a context-based discourse concept that means the DA class of a current utterance can be derived from its preceding utterance." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-46", "text": "We will elaborate this argument with an example given in Table 1 ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-47", "text": "Speaker A utters 'Oh, yeah.' twice in the first portion, and each time it is labelled with two different DA labels." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-48", "text": "This is simply due to the context of the previously conversed utterances." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-49", "text": "If we see the last four utterances of the example, when speaker A utters the 'Yes-No Question' DA, speaker B answers with 'yeah' which is labelled as 'Yes-Answer' DA." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-50", "text": "However, after the 'Statementopinion' from the same speaker, the same utterance 'yeah' is labelled as 'Backchannel' and not 'Yes-Answer'." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-51", "text": "This gives evidence that when we process the text of a conversation, we can see the context of a current utterance in the preceding utterances." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-52", "text": "Prosodic Cues for DA Recognition: It has also been noted that prosodic knowledge plays a major role in DA identification for certain DA types Stolcke et al., 2000) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-53", "text": "The main reason is that the acoustic signal of the same utterance can be very different in a different DA class." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-54", "text": "This indicates that if one wants to classify DA classes only from the text, the context must be an important aspect to consider: simply classifying single utterances might not be enough, but considering the preceding utterances as a context is important." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-55", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-56", "text": "**MODELLING APPROACHES**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-57", "text": "Lexical, Prosodic, and Syntactic Cues: Many studies have been carried out to find out the lexical, prosodic and syntactic cues (Stolcke et al., 2000; Surendran and Levow, 2006; O'Shea et al., 2012; Yang et al., 2014) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-58", "text": "For the SwDA corpus, the state-of-the-art baseline result was 71%" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-59", "text": "for more than a decade using a standard Hidden Markov Model (HMM) with language features such as words and n-grams (Stolcke et al., 2000) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-60", "text": "The inter-annotator agreement accuracy for the same corpus is 84%, and in this particular case, we are still far from achieving human accuracy." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-61", "text": "However, words like 'yeah' appear in many classes such as backchannel, yes-answer, agree/accept etc." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-62", "text": "Here, the prosodic cues play a very important role in identifying the DA classes, as the same utterance can acoustically differ a lot which helps to distinguish the specific DA class (Shriberg et al., 1998) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-63", "text": "There are several approaches like traditional Naive Bayes and HMM models, which use minimal information and certainly ignore the dependency of the context within the communication (Grau et al., 2004; Tavafi et al., 2013) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-64", "text": "They achieved 66% and 74.32% respectively on the SwDA test set." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-65", "text": "Utterance-level Classification: Perhaps most research in modelling dialogue act identification is conducted at utterance-level (Stolcke et al., 2000; Grau et al., 2004; Tavafi et al., 2013; Ji et al., ; Khanpour et al., 2016; Lee and Dernoncourt, 2016) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-66", "text": "The emergence of deep learning also gave a big push to DA classification." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-67", "text": "In a natural language conversation, most utterances are very short; hence it is also referred to as short text classification." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-68", "text": "Lee and Dernoncourt (2016) achieved 73.1% accuracy on the SwDA corpus by using advanced deep learning frameworks such as RNNs and convolutional neural networks (CNN) with word-level feature embeddings." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-69", "text": "A Novel Approach: Context-based Learning: Classifying the DA classes at single utterance-level might fail when it comes to DA classes where the utterances share similar lexical and syntactic cues (words and phrases) like the backchannel, yes-answer and accept/agree classes." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-70", "text": "Some researchers proposed an utterance-dependent learning approach (Kalchbrenner and Blunsom, 2013; Ji et al., ; Kumar et al., 2017; Tran et al., 2017; Liu et al., 2017; Ortega and Vu, 2017; Meng et al., 2017) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-71", "text": "Kalchbrenner and Blunsom (2013) and Ortega and Vu (2017) have proposed contextbased learning, where they represent the utterance as a compressed vector of the word embeddings using CNNs and use these utterance representations to model discourse within a conversation using RNNs." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-72", "text": "In their architecture, I c a n i m a g i n e ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-73", "text": "Figure 1: (a) Multiplicative LSTM (mLSTM) characterlevel language model to produce the sentence representation s t ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-74", "text": "The character-level language model is pre-trained and produces the feature (hidden unit states of mLSTM at the last character) or average (average of all hidden unit states of every character) vector representation of the given utterance." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-75", "text": "(b) Utterance-level classification using a simple MLP layer with a softmax function (our baseline model)." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-76", "text": "they also give importance to turn-taking by providing the speaker identity but do not analyse their model in this regard." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-77", "text": "This approach achieves about 73.9% accuracy on the SwDA corpus." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-78", "text": "In another line of research (Ji et al., ; Kumar et al., 2017) , authors claim that their models take care of the dependency of the utterances within a conversation." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-79", "text": "Ji et al. (2016) use discourse annotation for the word-level language modelling on the SwDA corpus and also highlight a limitation that this approach is not scalable to large data." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-80", "text": "In other approaches a hierarchical convolutional and recurrent neural encoder model are used to learn utterance representation by feeding a whole conversation (Kumar et al., 2017; Liu et al., 2017) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-81", "text": "The utterance representations are further used to classify DA classes using the conditional random field (CRF) as a linear classifier." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-82", "text": "The model can see the past and future utterances at the same time within a conversation, which limits usage in a dialogue system where one can only perceive the preceding utterance as a context but does not know the upcoming utterances." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-83", "text": "Hence, we use a context-based learning approach and regard the 73.9% accuracy (Kalchbrenner and Blunsom, 2013) on the SwDA corpus as a current state of the art for this task." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-84", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-85", "text": "**OUR APPROACH**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-86", "text": "Our approach takes care of discourse compositionality while recognising dialogue acts." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-87", "text": "The DA class of the current utterance is predicted using the context of the preceding utterances." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-88", "text": "We represent each utterance by the hidden state of the multiplicative recurrent neural network trained on domain-independent data using a character-level language model." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-89", "text": "We use RNNs to feed the sequence of the utterances and eventually predict the DA class of the corresponding utterance." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-90", "text": "Figure 2 : The RNN setup for learning the dialogue act recognition with the previous sentences as context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-91", "text": "s t is an utterance representation derived with a character-level language model and has a dialogue act label da t ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-92", "text": "s t\u22121 and s t\u22122 are the preceding utterances of s t ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-93", "text": "The RNN is trained to learn the recurrency through previous utterances s t\u22121 and s t\u22122 derived as h t\u22121 and h t\u22122 as a context to recognize the dialogue act of current utterance s t which is represented by h t used to detect da t ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-94", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-95", "text": "**UTTERANCE REPRESENTATION**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-96", "text": "Character-level encoding allows processing words and whole sentences based on their smallest units and still capturing punctuation and permutation of words." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-97", "text": "We represent a character-level utterance by encoding the whole sentence with a pre-trained character language model 2 ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-98", "text": "This model consists of a single multiplicative long-short-term memory (mLSTM) network (Krause et al., 2016) layer with 4,096 hidden units." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-99", "text": "The mLSTM is composed of an LSTM and a multiplicative RNN and considers each possible input in a recurrent transition function." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-100", "text": "It is trained as a character language model on \u223c80 million Amazon product reviews (Radford et al., 2017) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-101", "text": "We sequentially input the characters of an utterance to the mLSTM and take the hidden state values after the last character as shown in Figure 1 (a) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-102", "text": "The hidden vector s t obtained after the last character is called the last feature vector, as it stores the information related to the character language model and the sentiment of the utterance." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-103", "text": "However, it was shown that the average vector over all characters in the utterance works better for emotion detection (Lakomkin et al., 2017) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-104", "text": "Hence, we extract the last feature vector and also the average feature vector representations for each utterance." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-105", "text": "We classify these representations with a multi-layer perceptron (MLP) as shown in Figure 1 (b) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-106", "text": "The results are shown in Table 2 ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-107", "text": "The standard deviation (SD) is computed over ten runs." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-108", "text": "The average vector seems to carry more information related to the DA; hence we use it for future experiments." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-109", "text": "There is an advantage of using domain-independent data: it is rich regarding features being trained on big data, perhaps surpassing the limitation of scalability as mentioned in Ji et al. (2016) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-110", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-111", "text": "**CONTEXT LEARNING WITH RNNS**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-112", "text": "We apply context learning with the help of RNNs." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-113", "text": "As shown in Figure 2 , the utterances with their character-level language model representation s t are fed to the RNN with the preceding utterances (s t\u22121 , s t\u22122 ) being the context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-114", "text": "We use the RNN, which gets the input s t , and stores the hidden vector h t at time t (Elman, 1990) , which is calculated as:" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-115", "text": "where f () is a sigmoid function, W h and I are recurrent and input weight matrices respectively and b is a bias vector learned during training." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-116", "text": "h t is computed using the previous hidden vector h t\u22121 which is computed in a same way for preceding utterance s t\u22121 ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-117", "text": "The output da t is the dialogue act label of the current utterance s t calculated using h t , as:" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-118", "text": "where W out is the output weight matrix." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-119", "text": "The weight matrices are learned using back-propagation through time." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-120", "text": "The task is to classify several classes; hence we use a sof tmax function g() on the output." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-121", "text": "The input is the sequence of the current and preceding utterances, e.g., s t , s t\u22121 , and s t\u22122 ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-122", "text": "We reset the RNN when it sees the current utterance s t ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-123", "text": "We also give the information related to a speaker to let the network find the change in the speaker's turn." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-124", "text": "The speaker id 'A' is represented by [1, 0] and id 'B' by [0, 1] and it is concatenated with the corresponding utterances s t ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-125", "text": "The Adam optimiser (Kingma and Ba, 2014) was used with a learning rate 1e \u2212 4, which decays to zero during training, and clipping gradients at norm 1." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-126", "text": "Early stopping was used to avoid over-fitting of the network, 20% of training samples were used for validation." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-127", "text": "In all learning cases, we minimise the categorical cross-entropy." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-128", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-129", "text": "**RESULTS**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-130", "text": "We follow the same data split of 1115 training and 19 test conversations as in the baseline approach (Stolcke et al., 2000; Kalchbrenner and Blunsom, 2013) ." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-131", "text": "Table 3 shows the results of the proposed model with several setups, first without the context, then with one, two, and so on preceding utterances in the context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-132", "text": "We examined different values for the number of the hidden units of the RNN, empirically 64 was identified as best and used throughout the experiments." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-133", "text": "We also experimented with the various representations for the speaker id that is concatenated with the respective utterances but could find no differences." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-134", "text": "As a result, our proposed model uses minimal information for the context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-135", "text": "The performance increases from 74% to about 77% with context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-136", "text": "We run each experiment for ten times and take the average." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-137", "text": "The model shows robustness providing minimal variance, and using a minimum number of preceding utterances as a context can produce fair results." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-138", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-139", "text": "**CONCLUSION**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-140", "text": "In this article, we detail the annotation and modelling of dialogue act corpora, and we find that there is a difference in the way DAs are annotated and the way they are modelled." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-141", "text": "We argue to generalise the discourse modelling for conversation within the context of communication." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-142", "text": "Hence, we propose to use the context-based learning approach for the DA identification task." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-143", "text": "We used simple RNN to model the context of preceding utterances." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-144", "text": "We used the domainindependent pre-trained character language model to represent the utterances." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-145", "text": "We evaluated the proposed model on the Switchboard Dialogue Act corpus and show the results with and without context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-146", "text": "For this corpus, our model achieved an accuracy of 77.34% with context compared to 73.96% without context." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-147", "text": "We also compare our model with Kalchbrenner and Blunsom (2013) who used the contextbased learning approach achieving 73.9%." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-148", "text": "Our model uses minimal information, such as the context of a few preceding utterances which can be adapted to an online learning tool such as a spoken dialogue system where one can naturally see the preceding utterances but not the future ones." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-149", "text": "This makes our model suitable for human-robot/computer interaction which can be easily plugged into any spoken dialogue system." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-150", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-151", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-152", "text": "This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement number 642667 (SECURE)." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-153", "text": "----------------------------------" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-154", "text": "**BIBLIOGRAPHICAL REFERENCES APPENDIX: ANALYSIS OF THE STATE OF THE RNN**" }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-155", "text": "We also analyze the internal state h t of the RNNs for a two-utterance setup." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-156", "text": "We plot them on a 2D graph with the t-SNE algorithm for the first 2,000 utterances of the SwDA test set." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-157", "text": "Figure 3 shows the clusters of all the DA classes." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-158", "text": "The classes which do not share any information are grouped without any interference such as Non-verbal, and Abandoned." }, { "sent_id": "0a538968f0cd121a1ef63b58a0c9f7-C001-159", "text": "Figure 4 shows some particular classes with utterances in their vector spaces, the (1) current utterance and (2) a preceding utterance in the context." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "0a538968f0cd121a1ef63b58a0c9f7-C001-83" ], [ "0a538968f0cd121a1ef63b58a0c9f7-C001-147" ] ], "cite_sentences": [ "0a538968f0cd121a1ef63b58a0c9f7-C001-83", "0a538968f0cd121a1ef63b58a0c9f7-C001-147" ] }, "@SIM@": { "gold_contexts": [ [ "0a538968f0cd121a1ef63b58a0c9f7-C001-130" ] ], "cite_sentences": [ "0a538968f0cd121a1ef63b58a0c9f7-C001-130" ] }, "@USE@": { "gold_contexts": [ [ "0a538968f0cd121a1ef63b58a0c9f7-C001-130" ] ], "cite_sentences": [ "0a538968f0cd121a1ef63b58a0c9f7-C001-130" ] } } }, "ABC_52b9efca757fe5a376ca6b548d77ce_33": { "x": [ { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-2", "text": "Although diathesis alternations have been used as features for manual verb classification, and there is recent work on incorporating such features in computational models of human language acquisition, work on large scale verb classification has yet to examine the potential for using diathesis alternations as input features to the clustering process." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-3", "text": "This paper proposes a method for approximating diathesis alternation behaviour in corpus data and shows, using a state-of-the-art verb clustering system, that features based on alternation approximation outperform those based on independent subcategorization frames." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-4", "text": "Our alternation-based approach is particularly adept at leveraging information from less frequent data." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-5", "text": "----------------------------------" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-7", "text": "Diathesis alternations (DAs) are regular alternations of the syntactic expression of verbal arguments, sometimes accompanied by a change in meaning." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-8", "text": "For example, The man broke the window \u2194 The window broke." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-9", "text": "The syntactic phenomena are triggered by the underlying semantics of the participating verbs." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-10", "text": "Levin (1993) 's seminal book provides a manual inventory both of DAs and verb classes where membership is determined according to participation in these alternations." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-11", "text": "For example, most of the COOK verbs (e.g. bake, cook, fry . . . ) can all take various DAs, such as the causative alternation, middle alternation and instrument subject alternation." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-12", "text": "In computational linguistics, work inspired by Levin's classification has exploited the link between syntax and semantics for producing classifications of verbs." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-13", "text": "Such classifications are useful for a wide variety of purposes such as semantic role labelling (Gildea and Jurafsky, 2002) , predicting unseen syntax (Parisien and Stevenson, 2010) , argument zoning (Guo et al., 2011) and metaphor identification (Shutova et al., 2010) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-14", "text": "While Levin's classification can be extended manually (Kipper-Schuler, 2005) , a large body of research has developed methods for automatic verb classification since such methods can be applied easily to other domains and languages." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-15", "text": "Existing work on automatic classification relies largely on syntactic features such as subcategorization frames (SCF)s (Schulte im Walde, 2006; Sun and Korhonen, 2011; Vlachos et al., 2009; Brew and Schulte im Walde, 2002) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-16", "text": "There has also been some success incorporating selectional preferences (Sun and Korhonen, 2009) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-17", "text": "Few have attempted to use, or approximate, diathesis features directly for verb classification although manual classifications have relied on them heavily, and there has been related work on identifying the DAs themselves automatically using SCF and semantic information (Resnik, 1993; McCarthy and Korhonen, 1998; Lapata, 1999; McCarthy, 2000; Tsang and Stevenson, 2004) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-18", "text": "Exceptions to this include Merlo and Stevenson (2001) , Joanis et al. (2008) and Stevenson (2010, 2011) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-19", "text": "Merlo and Stevenson (2001) used cues such as passive voice, animacy and syntactic frames coupled with the overlap of lexical fillers between the alternating slots to predict a 3-way classification (unergative, unaccusative and object-drop)." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-20", "text": "Joanis et al. (2008) used similar features to classify verbs on a much larger scale." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-21", "text": "They classify up to 496 verbs using 11 different classifications each having between 2 and 14 classes." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-22", "text": "Stevenson (2010, 2011) used hierarchical Bayesian models on slot frequency data obtained from childdirected speech parsed with a dependency parser to model acquisition of SCF, alternations and ultimately verb classes which provided predictions for unseen syntactic behaviour of class members." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-23", "text": "We are interested in the improvement that can be achieved to verb clustering using approximations for DAs, rather than the DA per se." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-24", "text": "As such we make the simple assumption that if a pair of SCFs tends to occur with the same verbs, we have a potential occurrence of DA." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-25", "text": "Although this approximation can give rise to false positives (pairs of frames that co-occur frequently but are not DA) we are nevertheless interested in investigating its potential usefulness for verb classification." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-26", "text": "One attractive aspect of this method is that it does not require a pre-defined list of possible alternations." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-27", "text": "----------------------------------" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-28", "text": "**DIATHESIS ALTERNATION APPROXIMATION**" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-29", "text": "A DA can be approximated by a pair of SCFs." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-30", "text": "We parameterize frames involving prepositional phrases with the preposition." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-31", "text": "Example SCFs for the verb \"spray\" are shown in Table 1 ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-32", "text": "The feature value of a single frame feature is the frequency of the SCF." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-33", "text": "Given two frames f v (i), f v (j) of a verb v, they can be transformed into a feature pair (f v (i), f v (j)) as an approximation to a DA." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-34", "text": "The feature value of the DA feature (f v (i), f v (j)) is approximated by the joint probability of the pair of frames p(f v (i), f v (j)|v), obtained by integrating all the possible DAs." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-35", "text": "The key assumption is that the joint probability of two SCFs has a strong correlation with a DA on the grounds that the DA gives rise to both SCFs in the pair." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-36", "text": "We use the DA feature (f v (i), f v (j)) with its value p(f v (i), f v (j)|v) as a new feature for verb clustering." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-37", "text": "As a comparison point, we can ignore the DA and make a frame independence assumption." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-38", "text": "The joint probability is decomposed as:" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-39", "text": "We assume that SCFs are dependent as they are generated by the underlying meaning components (Levin and Hovav, 2006) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-40", "text": "The frame dependency is represented by a simple graphical model in figure 1." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-41", "text": "Figure 1: Graphical model for the joint probability of pairs of frames." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-42", "text": "v represents a verb, a represents a DA and f represents a specific frame in total of M possible frames" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-43", "text": "In the data, the verb (v) and frames (f ) are observed, and any underlying alternation (a) is hidden." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-44", "text": "The aim is to approximate but not to detect a DA, so a is summed out:" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-45", "text": "(2) In order to evaluate this sum, we use a relaxation 1 : the sum in equation 1 is replaced with the maximum (max)." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-46", "text": "This is a reasonable relaxation, as a pair of frames rarely participates in more than one type of a DA." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-47", "text": "The second relaxation further relaxes the first one by replacing the max with the least upper bound (sup): If f v (i) occurs a times, f v (j) occurs b times and b < a, the number of times that a DA occurs between f v (i) and f v (j) must be smaller or equal to b. So we end up with a simple form:" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-48", "text": "----------------------------------" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-49", "text": "**EXPERIMENTS**" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-50", "text": "We evaluated this model by performing verb clustering experiments using three feature sets: F1: SCF parameterized with preposition." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-51", "text": "Examples are shown in Table 1 ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-52", "text": "F2: The frame pair features built from F1 with the frame independence assumption (equation 1)." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-53", "text": "This feature is not a DA feature as it ignores the inter-dependency of the frames." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-54", "text": "F3: The frame pair features (DAs) built from F1 with the frame dependency assumption (equation 4)." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-55", "text": "This is the DA feature which considers the correlation of the two frames which are generated from the alternation." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-56", "text": "F3 implicitly includes F1, as a frame can pair with itself." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-57", "text": "2 In the example in Table 2 , the frame pair \"PP(on) PP(on)\" will always have the same value as the \"PP(on)\" frame in F1." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-58", "text": "We extracted the SCFs using the system of Preiss et al. (2007) which classifies each corpus occurrence of a verb as a member of one of the 168 SCFs on the basis of grammatical relations identified by the RASP (Briscoe et al., 2006) parser." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-59", "text": "We experimented with two datasets that have been used in prior work on verb clustering: the test sets 7-11 (3-14 classes) in Joanis et al. (2008) , and the 17 classes set in Sun et al. (2008) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-60", "text": "We used the spectral clustering (SPEC) method and settings as in Sun and Korhonen (2009) but adopted the Bhattacharyya kernel (Jebara and Kondor, 2003) to improve the computational efficiency of the approach given the high dimensionality of the quadratic feature space." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-61", "text": "The mean-filed bound of the Bhattacharyya kernel is very similar to the KL divergence kernel (Jebara et al., 2004) which is frequently used in verb clustering experiments (Korhonen et al., 2003; Sun and Korhonen, 2009) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-62", "text": "To further reduce computational complexity, we restricted our scope to the more frequent features." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-63", "text": "In the experiment described in this section we used the 50 most frequent features for the 3-6 way classifications (Joanis et al.'s test set 7-9) and 100 features for the 7-17 way classifications." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-64", "text": "In the next section, we will demonstrate that F3 outperforms F1 regardless of the feature number setting." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-65", "text": "The features are normalized to sum 1." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-66", "text": "The clustering results are evaluated using FMeasure as in Sun and Korhonen (2009) which provides the harmonic mean of precision (P ) and recall (R) P is calculated using modified purity -a global measure which evaluates the mean precision of clusters." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-67", "text": "Each cluster (k i \u2208 K) is associated with the gold-standard class to which the majority of its members belong." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-68", "text": "The number of verbs in a cluster (k i ) that take this class is denoted by n prevalent (k i )." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-69", "text": "|verbs| R is calculated using weighted class accuracy: the proportion of members of the dominant cluster DOM-CLUST i within each of the gold-standard classes c i \u2208 C." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-70", "text": "The results are shown in Table 3 ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-71", "text": "The result of F2 is lower than that of F3, and even lower than that of F1 for 3-6 way classification." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-72", "text": "This indicates that the frame independence assumption is a poor assumption." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-73", "text": "F3 yields substantially better result than F2 and F1." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-74", "text": "The result of F3 is 6.4% higher than the result (F=63.28) reported in Sun and Korhonen (2009) using the F1 feature." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-75", "text": "This experiment shows, on two datasets, that DA features are clearly more effective than the frame features for verb clustering, even when relaxations are used." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-76", "text": "----------------------------------" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-77", "text": "**ANALYSIS OF FEATURE FREQUENCY**" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-78", "text": "A further experiment was carried out using F1 and F3 on Joanis et al. (2008) 's test sets 10 and 11." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-79", "text": "The frequency ranked features were added to the clustering one at a time, starting from the most frequent one." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-80", "text": "The results are shown in figure 2." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-81", "text": "F3 outperforms F1 clearly on all the feature number settings." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-82", "text": "After adding some highly frequent frames (22 for test set 10 and 67 for test set 11), the performance for F1 is not further improved." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-83", "text": "The performance of F3, in contrast, is improved for almost all (including the mid-range frequency) frames, although to a lesser degree for low frequency frames." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-84", "text": "5 Related work Parisien and Stevenson (2010) introduced a hierarchical Bayesian model capable of learning verb alternations and constructions from syntactic input." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-85", "text": "The focus was on modelling and explaining the child alternation acquisition rather than on automatic verb classification." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-86", "text": "Therefore, no quantitative evaluation of the clustering is reported, and the number of verbs under the novel verb generalization test is relatively small." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-87", "text": "Parisien and A fundamental difference is that we explicitly use a probability distribution over alternations (pair of frames) to represent a verb, whereas they represent a verb by a distribution over the observed frames similar to Vlachos et al. (2009) 's approach." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-88", "text": "Also the parameters in their model were inferred by Gibbs sampling whereas we avoided this inference step by using relaxation." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-89", "text": "----------------------------------" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-90", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-91", "text": "We have demonstrated the merits of using DAs for verb clustering compared to the SCF data from which they are derived on standard verb classification datasets and when integrated in a stateof-the-art verb clustering system." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-92", "text": "We have also demonstrated that the performance of frame features is dominated by the high frequency frames." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-93", "text": "In contrast, the DA features enable the mid-range frequency frames to further improve the performance." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-94", "text": "In the future, we plan to evaluate the performance of DA features in a larger scale experiment." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-95", "text": "Due to the high dimensionality of the transformed feature space (quadratic of the original feature space), we will need to improve the computational efficiency further, e.g. via use of an unsupervised dimensionality reduction technique Zhao and Liu (2007) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-96", "text": "Moreover, we plan to use Bayesian inference as in Vlachos et al. (2009); Stevenson (2010, 2011) to infer the actual parameter values and avoid the relaxation." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-97", "text": "Finally, we plan to supplement the DA feature with evidence from the slot fillers of the alternating slots, in the spirit of earlier work (McCarthy, 2000; Merlo and Stevenson, 2001; Joanis et al., 2008) ." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-98", "text": "Unlike these previous works, we will use selectional preferences to generalize the argument heads but will do so using preferences from distributional data (Sun and Korhonen, 2009 ) rather than WordNet, and use all argument head data in all frames." }, { "sent_id": "52b9efca757fe5a376ca6b548d77ce-C001-99", "text": "We envisage using maximum average distributional similarity of the argument heads in any potentially alternating slots in a pair of cooccurring frames as a feature, just as we currently use the frequency of the less frequent co-occurring frame." } ], "y": { "@BACK@": { "gold_contexts": [ [ "52b9efca757fe5a376ca6b548d77ce-C001-16" ], [ "52b9efca757fe5a376ca6b548d77ce-C001-61" ] ], "cite_sentences": [ "52b9efca757fe5a376ca6b548d77ce-C001-16", "52b9efca757fe5a376ca6b548d77ce-C001-61" ] }, "@USE@": { "gold_contexts": [ [ "52b9efca757fe5a376ca6b548d77ce-C001-60" ], [ "52b9efca757fe5a376ca6b548d77ce-C001-66" ], [ "52b9efca757fe5a376ca6b548d77ce-C001-98" ] ], "cite_sentences": [ "52b9efca757fe5a376ca6b548d77ce-C001-60", "52b9efca757fe5a376ca6b548d77ce-C001-66", "52b9efca757fe5a376ca6b548d77ce-C001-98" ] }, "@DIF@": { "gold_contexts": [ [ "52b9efca757fe5a376ca6b548d77ce-C001-74" ] ], "cite_sentences": [ "52b9efca757fe5a376ca6b548d77ce-C001-74" ] } } }, "ABC_cb57b8886be9ea4f0c50fd2c3a178a_33": { "x": [ { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-2", "text": "Although considerable attention has been given to neural ranking architectures recently, far less attention has been paid to the term representations that are used as input to these models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-3", "text": "In this work, we investigate how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-4", "text": "Through experiments on Trec benchmarks, we find that several existing neural ranking architectures can benefit from the additional context provided by contextualized language models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-5", "text": "Furthermore, we propose a joint approach that incorporates BERT's classification vector into existing neural models and show that it outperforms state-of-the-art ad-hoc ranking baselines." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-6", "text": "We call this joint approach CEDR (Contextualized Embeddings for Document Ranking)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-7", "text": "We also address practical challenges in using these models for ranking, including the maximum input length imposed by BERT and runtime performance impacts of contextualized language models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-8", "text": "----------------------------------" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-10", "text": "Recently, there has been much work designing ranking architectures to effectively score query-document pairs, with encouraging results [5, 6, 20] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-11", "text": "Meanwhile, pretrained contextualized language models (such as ELMo [16] and BERT [4] ) have shown great promise on various natural language processing tasks [4, 16] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-12", "text": "These models work by pre-training LSTM-based or transformer-based [19] language models on a large corpus, and then by performing minimal task fine-tuning (akin to ImageNet [3, 23] )." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-13", "text": "Prior work has suggested that contextual information can be valuable when ranking." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-14", "text": "ConvKNRM [1] , a recent neural ranking model, uses a convolutional neural network atop word representations, allowing the model to learn representations aware of context in local proximity." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-15", "text": "In a similar vein, McDonald et al. [12] proposes Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-16", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-17", "text": "Abstracting with credit is permitted." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-18", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-19", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-20", "text": "an approach that learns a recurrent neural network for term representations, thus being able to capture context from the entire text [12] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-21", "text": "These approaches are inherently limited by the variability found in the training data." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-22", "text": "Since obtaining massive amounts of highquality relevance information can be difficult [24] , we hypothesize that pretrained contextualized term representations will improve ad-hoc document ranking performance." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-23", "text": "We propose incorporating contextualized language models into existing neural ranking architectures by using multiple similarity matrices -one for each layer of the language model." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-24", "text": "We find that, at the expense of computation costs, this improves ranking performance considerably, achieving state-of-the-art performance on the Robust 2004 and WebTrack 2012-2014 datasets." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-25", "text": "We also show that combining each model with BERT's classification mechanism can further improve ranking performance." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-26", "text": "We call this approach CEDR (Contextualzed Embeddings for Document Ranking)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-27", "text": "Finally, we show that the computation costs of contextualized language models can be dampened by only partially computing the contextualized language model representations." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-28", "text": "Although others have successfully used BERT for passage ranking [14] and question answering [22] , these approaches only make use of BERT's sentence classification mechanism." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-29", "text": "In contrast, we use BERT's term representations, and show that they can be effectively combined with existing neural ranking architectures." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-30", "text": "In summary, our contributions are as follows: -We are the first to demonstrate that contextualized word representations can be successfully incorporated into existing neural architectures (PACRR [6] , KNRM [20] , and DRMM [5] ), allowing them to leverage contextual information to improve ad-hoc document ranking." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-31", "text": "-We present a new joint model that combines BERT's classification vector with existing neural ranking architectures (using BERT's token vectors) to get the benefits from both approaches." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-32", "text": "-We demonstrate an approach for addressing the performance impact of computing contextualized language models by only partially computing the language model representations." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-33", "text": "-Our code is available for replication and future work." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-34", "text": "1" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-35", "text": "----------------------------------" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-36", "text": "**METHODOLOGY 2.1 NOTATION**" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-37", "text": "In ad-hoc ranking, documents are ranked for a given query according to a relevance estimate." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-38", "text": "Let Q be a query consisting of query terms {q 1 , q 2 , ..., q |Q | }, and let D be a document consisting of terms {d 1 , d 2 , ..., d |D | }." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-39", "text": "Let ranker (Q, D) \u2208 R be a function that returns a real-valued relevance estimate for the document to the query." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-40", "text": "Neural relevance ranking architectures generally use a similarity matrix as input S \u2208 R |Q |\u00d7 |D | , where each cell represents a similarity score between the query and document: S i, j = sim(q i , d j )." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-41", "text": "These similarity values are usually the cosine similarity score between the word vectors of each term in the query and document." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-42", "text": "----------------------------------" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-43", "text": "**CONTEXTUALIZED SIMILARITY TENSORS**" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-44", "text": "Pretrained contextual language representations (such as those from ELMo [16] and BERT [4] ) are context sensitive; in contrast to more conventional pretrained word vectors (e.g., GloVe [15] ) that generate a single word representation for each word in the vocabulary, these models generate a representation of each word based on its context in the sentence." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-45", "text": "For example, the contextualized representation of word bank would be different in bank deposit and river bank, while a pretrained word embedding model would always result in the same representation for this term." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-46", "text": "Given that these representations capture contextual information in the language, we investigate how these models can also benefit general neural ranking models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-47", "text": "Although contextualized language models vary in particular architectures, they typically consist of multiple stacked layers of representations (e.g., recurrent or transformer outputs)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-48", "text": "The intuition is that the deeper the layer, the more context is incorporated." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-49", "text": "To allow neural ranking models to learn which levels are most important, we choose to incorporate the output of all layers into the model, resulting in a three-dimensional similarity representation." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-50", "text": "Thus, we expand the similarity representation (conditioned on the query and document context) to S Q,D \u2208 R L\u00d7 |Q |\u00d7 |D | where L is the number of layers in the model, akin to the channel in image processing." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-51", "text": "Let context Q,D (t, l) \u2208 R D be the contextualized representation for token t in layer l, given the context of Q and D. Given these definitions, let the contextualized representation be:" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-52", "text": "for each query term q \u2208 Q, document term d \u2208 D, and layer" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-53", "text": "Note that when q and d are identical, they will likely not receive a similarity score of 1, as their representation depends on the surrounding context of the query and document." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-54", "text": "The layer dimension can be easily incorporated into existing neural models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-55", "text": "For instance, soft n-gram based models, like PACRR, can perform convolutions with multiple input channels, and counting-based methods (like KNRM and DRMM) can count each channel individually." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-56", "text": "[22] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-57", "text": "----------------------------------" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-58", "text": "**JOINT BERT APPROACH**" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-59", "text": "We explore incorporating the [CLS] token's representation into existing neural ranking models as a joint approach." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-60", "text": "This allows neural rankers to benefit from deep semantic information from BERT in addition to individual contextualized token matches." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-61", "text": "Incorporating the [CLS] token into existing ranking models is straightforward." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-62", "text": "First, the given ranking model produces relevance scores (e.g., n-gram or kernel scores) for each query term based on the similarity matrices." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-63", "text": "Then, for models using dense combination (e.g., PACRR, KNRM), we propose concatenating the [CLS] vector to the model's signals." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-64", "text": "For models that sum query term scores (e.g., DRMM), we include the [CLS] vector in the dense calculation of each term score (i.e., during combination of bin scores)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-65", "text": "We hypothesize that this approach will work because the BERT classification mechanism and existing rankers have different strengths." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-66", "text": "The BERT classification benefits from deep semantic understanding based on next-sentence prediction, whereas ranking architectures traditionally assume query term repetition indicates higher relevance." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-67", "text": "In reality, both are likely important for relevance ranking." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-68", "text": "----------------------------------" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-69", "text": "**EXPERIMENT 3.1 EXPERIMENTAL SETUP**" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-70", "text": "Datasets." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-71", "text": "We evaluate our approaches using two datasets: Trec Robust 2004 and WebTrack 2012-14." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-72", "text": "For Robust, we use the five folds from [7] with three folds used for training, one fold for testing, and the previous fold for validation." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-73", "text": "We evaluate the results using the nDCG@20 / P@20 metrics for Robust04 and nDCG@20 / ERR@20 for WebTrack." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-74", "text": "Models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-75", "text": "Rather than building new models, in this work we use existing model architectures to test the effectiveness of various input representations." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-76", "text": "We evaluate our methods on three neural relevance matching methods: PACRR [6] , KNRM [20] , and DRMM [5] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-77", "text": "Relevance matching models have generally shown to be more effective than semantic matching models, while not requiring massive amounts of behavioral data (e.g., query logs)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-78", "text": "For PACRR, we increase k max = 30 to allow for more term matches and better back-propagation to the language model." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-79", "text": "Contextualized language models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-80", "text": "We use the pretrained ELMo (Original, 5 .5B) and BERT (BERT-Base, Uncased) language models in our experiments." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-81", "text": "For ELMo, the query and document are encoded separately." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-82", "text": "Since BERT enables encoding multiple texts at the same time using Segment A and Segment B embeddings, we encode the query (Segment A) and document (Segment B) simultaneously." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-83", "text": "Because the pretrained BERT model is limited to 512 tokens, longer documents are split such that document segments are split as evenly as possible, while not exceeding the limit when combined with the query and control tokens." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-84", "text": "(Note that the query is always included in full." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-85", "text": ") BERT allows for simple classification fine-tuning, so we also experiment with a variant that is first fine-tuned on the same data using the Vanilla BERT classifier (see baseline below), and further fine-tuned when training the ranker itself." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-86", "text": "Training and optimization." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-87", "text": "We train all models using pairwise hinge loss [2] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-88", "text": "Positive and negative training documents are selected from the query relevance judgments (positive documents limited to only those meeting the re-ranking cutoff threshold k using BM25, others considered negative)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-89", "text": "We train each model for 100 epochs, each with 32 batches of 16 training pairs." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-90", "text": "Gradient accumulation is employed when the batch size of 16 is too large to fit on a single GPU." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-91", "text": "We re-rank to top k BM25 results for validation, and use P@20 on Robust and nDCG@20 on WebTrack to select the best-performing model." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-92", "text": "We different re-ranking functions and thresholds at test time for each dataset: BM25 with k = 150 for Robust04, and QL with k = 100 for WebTrack." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-93", "text": "The re-ranking setting is a better evaluation setting than ranking all qrels, as demonstrated by major search engines using a pipeline approach [18] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-94", "text": "All models are trained using Adam [8] with a learning rate of 0.001 while BERT layers are trained at a rate of 2e-5." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-95", "text": "5 Following prior work [6] , documents are truncated to 800 tokens." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-96", "text": "Baselines." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-97", "text": "We compare contextualized language model performance to the following strong baselines:" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-98", "text": "-BM25 and SDM [13] , as implemented by Anserini [21] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-99", "text": "Finetuning is conducted on the test set, representing the maximum performance of the model when using static parameters over each dataset." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-100", "text": "6 We do not report SDM performance on WebTrack due to its high cost of retrieval on the large ClueWeb collections." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-101", "text": "-Vanilla BERT ranker." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-102", "text": "We fine-tune a pretrained BERT model (BERT-Base, Uncased) with a linear combination layer stacked atop the classifier [CLS] token." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-103", "text": "This network is optimized the same way our models are, using pairwise cross-entropy loss and the Adam optimizer." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-104", "text": "We use the approach described above to handle documents longer than the capacity of the network, and average the [CLS] vectors from each split." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-105", "text": "-TREC-best: We also compare to the top-performing topic TREC run for each track in terms of nDCG@20." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-106", "text": "We use uogTrA44xu for WT12 ([10] , a learning-to-rank based run), clustmrfaf for WT13 ( [17] , clustering-based), UDInfoWebAX for WT14 ( [11] , entity expansion), and pircRB04t3 for Robust04 ( [9] , query expansion using Google search results)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-107", "text": "7 -ConvKNRM [1] , our implementation with the same training pipeline as the evaluation models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-108", "text": "-Each evaluation model when using GloVe [15] vectors." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-109", "text": "8 Table 1 shows the ranking performance using our approach." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-110", "text": "We first note that the Vanilla BERT method significantly outperforms the tuned BM25 [V] and ConvKNRM [C] baselines on its own." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-111", "text": "This is encouraging, and shows the ranking power of the Vanilla BERT model." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-112", "text": "When using contextualized language term representations without tuning, PACRR and DRMM performance is comparable to that of GloVe [G] , while KNRM sees a modest boost." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-113", "text": "This might be 5 Pilot experiments showed that a learning rate of 2e-5 was more effective on this task than the other recommended values of 5e-5 and 3e-5 by [4] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-114", "text": "6 k 1 in 0.1-4 (by 0.1), b in 0.1-1 (by 0.1), SDM weights in 0-1 (by 0.05)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-115", "text": "7 We acknowledge limitations of the TREC experimentation environment." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-116", "text": "8 due to KNRM's ability to train its matching kernels, tuning to specific similarity ranges produced by the models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-117", "text": "(In contrast, DRMM uses fixed buckets, and PACRR uses maximum convolutional filter strength, both of which are less adaptable to new similarity score ranges.) When fine-tuning BERT, all three models see a significant boost in performance compared to the GloVe-trained version." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-118", "text": "PACRR and KNRM see comparable or higher performance than the Vanilla BERT model." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-119", "text": "This indicates that fine-tuning contextualized language models for ranking is important." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-120", "text": "This boost is further enhanced when using the CEDR (joint) approach, with the CEDR models always outperforming Vanilla BERT [V] , and nearly always significantly outperforming the non-CEDR versions [N] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-121", "text": "This suggests that term counting methods (such as KNRM and DRMM) are complementary to BERT's classification mechanism." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-122", "text": "Similar trends for both Robust04 and WebTrack 2012-14 indicate that our approach is generally applicable to ad-hoc document retrieval tasks." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-123", "text": "----------------------------------" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-124", "text": "**RESULTS & ANALYSIS**" }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-125", "text": "To gain a better understanding of how the contextual language model helps enhance the input representation, we plot example similarity matrices based on GloVe word embeddings, ELMo representations (layer 2), and fine-tuned BERT representations (layer 5)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-126", "text": "In these examples, two senses of the word curb (restrain, and edge of street) are encountered." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-127", "text": "The first is relevant to the query (it's discussing attempts to restrain population growth)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-128", "text": "The second is not (it discusses street construction)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-129", "text": "Both the ELMo and BERT models give a higher similarity score to the correct sense of the term for the query." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-130", "text": "This ability to distinguish different senses of terms is a strength of contextualized language models, and likely can explain some of the performance gains of the non-joint models." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-131", "text": "Although the contextualized language models yield ranking performance improvements, they come with a considerable cost at inference time-a practical issue ignored in previous ranking work [14, 21] ." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-132", "text": "To demonstrate this, in Figure 2 (a) we plot the processing rate of GloVe, ELMo, and BERT." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-133", "text": "9 Note that the processing 9 Running time measured on single GeForce GTX 1080 Ti GPU, data in memory." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-134", "text": "rate when using static GloVe vectors is orders of magnitude faster than when using the contextualized representations, with BERT outperforming ELMo because it uses the more efficient Transformer instead of an RNN." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-135", "text": "In an attempt to improve the running time of these systems, we propose limiting the number of layers processed by the model." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-136", "text": "The reasoning behind this approach is that the lower the layer, the more abstract the matching becomes, perhaps becoming less useful for ranking." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-137", "text": "We show the runtime and ranking performance of PACRR when only processing only up to a given layer in Figure 2 (b)." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-138", "text": "It shows that most of the performance benefits can be achieved by only running BERT through layer 5; the performance is comparable to running the full BERT model, while running more than twice as fast." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-139", "text": "While we acknowledge that our research code is not completely optimized, we argue that this approach is generally applicable because the processing of these layers are sequential, query-dependent, and dominate the processing time of the entire model." }, { "sent_id": "cb57b8886be9ea4f0c50fd2c3a178a-C001-140", "text": "This approach is a simple time-saving measure." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cb57b8886be9ea4f0c50fd2c3a178a-C001-10" ] ], "cite_sentences": [ "cb57b8886be9ea4f0c50fd2c3a178a-C001-10" ] }, "@UNSURE@": { "gold_contexts": [ [ "cb57b8886be9ea4f0c50fd2c3a178a-C001-30" ] ], "cite_sentences": [ "cb57b8886be9ea4f0c50fd2c3a178a-C001-30" ] }, "@USE@": { "gold_contexts": [ [ "cb57b8886be9ea4f0c50fd2c3a178a-C001-76" ], [ "cb57b8886be9ea4f0c50fd2c3a178a-C001-95" ] ], "cite_sentences": [ "cb57b8886be9ea4f0c50fd2c3a178a-C001-76", "cb57b8886be9ea4f0c50fd2c3a178a-C001-95" ] } } }, "ABC_8b2bb6753dc72a048ec28958e943fb_33": { "x": [ { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-86", "text": "A softmax layer that is trained on top of the layers is used to predict the likelihood of a window of words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-157", "text": "The window length, n, that determines the number of words input to the second layer is set to 5." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-2", "text": "Agglutinative languages such as Turkish, Finnish and Hungarian require morphological disambiguation before further processing due to the complex morphology of words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-3", "text": "A morphological disambiguator is used to select the correct morphological analysis of a word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-4", "text": "Morphological disambiguation is important because it generally is one of the first steps of natural language processing and its performance affects subsequent analyses." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-5", "text": "In this paper, we propose a system that uses deep learning techniques for morphological disambiguation." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-6", "text": "Many of the state-of-the-art results in computer vision, speech recognition and natural language processing have been obtained through deep learning models." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-7", "text": "However, applying deep learning techniques to morphologically rich languages is not well studied." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-8", "text": "In this work, while we focus on Turkish morphological disambiguation we also present results for French and German in order to show that the proposed architecture achieves high accuracy with no language-specific feature engineering or additional resource." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-9", "text": "In the experiments, we achieve 84.12 , 88.35 and 93.78 morphological disambiguation accuracy among the ambiguous words for Turkish, German and French respectively." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-10", "text": "----------------------------------" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-12", "text": "Morphological analysis is generally achieved through the use of finite state transducers (FST) (Kaplan and Kay 1981; Koskenniemi 1984; Beesley and Karttunen 2003; Oflazer 1993) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-13", "text": "During morphological analysis, the surface form of the word is given as input and an FST is used to output possible morphological analyses of the input word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-14", "text": "A morphological analysis contains a root and a set of tags so called morphemes, minimal units of meaning in a language (Oflazer 1993) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-15", "text": "A morphological disambiguator is used to select the correct analysis among the possible analyses of a word using the context that the word appears in." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-16", "text": "The output of morphological disambiguation contains syntactic and semantic information about a word such as its POS tag, tense, polarity and it being accusative, possessive or genitive." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-17", "text": "This information is vital for some NLP tasks such as dependency parsing and semantic role labeling whereas it can be utilized in other NLP tasks such as topic modeling, named entity recognition and machine translation." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-18", "text": "While morphological disambiguation is important for natural language processing in any language, it is vital in morphologically rich languages." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-19", "text": "We specifically focus on Turkish which is an important language spoken by over 70 million people and has a complex morphology that allows construction of thousands of word forms from each root through inflectional and derivational suffixation (Hakkani-T\u00fcr, Oflazer, and T\u00fcr 2000) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-20", "text": "For instance, y\u00fcr\u00fc (walk), y\u00fcr\u00fcd\u00fcm (I walked), y\u00fcr\u00fcyeceksiniz (you will walk), y\u00fcr\u00fctt\u00fcler (they made somebody walk), y\u00fcr\u00fcy\u00fcnce (When (he/she/it) walks) and y\u00fcr\u00fcyecektiler (They were going to walk) are some of the possible word formations of a Turkish verb root y\u00fcr\u00fc." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-21", "text": "In all the examples \"y\u00fcr\u00fc\" is the root of the words whereas the suffixes are used to change meaning." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-22", "text": "Morphological analysis of a word might produce more than one analysis since there might be multiple interpretations of a single word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-23", "text": "Consider the example given in Table 1 where the Turkish word \"dolar\" is analyzed." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-24", "text": "The output of the morphological analyzer for this word contains four possible analyses." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-25", "text": "The reason for that is each of the words \"dolar\", \"dola\", \"dol\" and \"do\" can be used as roots and at the same time \"r\", \"ar\" and \"lar\" are all valid suffixes in Turkish." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-26", "text": "Thus, all four of the analyses are valid that lead to quite different meanings." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-27", "text": "Another reason for multiple morphological analyses is due to the fact that a morpheme might change meanining depending on the context of the word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-28", "text": "Consider the example given in Table 2 ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-29", "text": "In the first row \"evi\" is used in the ac-cusative case whereas in the second row it is used as a possessive noun." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-30", "text": "The word \"evi\" has two morphological analyses sharing the same root." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-31", "text": "Thus, the only difference in the analyses is at the suffix of the word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-32", "text": "The suffix \"-i\" in the first sentence transforms the word into \"accusative marker\", while its interpretation is \"third person possessive\" in the second one." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-33", "text": "In addition, some root words might have multiple meanings." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-34", "text": "For instance, the Turkish word \"y\u00fcz\" could be interpreted as a noun (face), a verb (swim) or a number (hundred) depending on its context." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-35", "text": "As we noted before, morphological disambiguation is important for NLP in most of the languages." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-36", "text": "For instance, although German and French do not have a morphology as rich as Turkish, NLP in these languages can still benefit from morphological disambiguation." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-37", "text": "Higher accuracies in NLP tasks such as POS tagging and dependency parsing can be obtained if the morphology of the words are taken into account (Sennrich et al. 2009 ), (Candito et al. 2010) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-38", "text": "We apply our general purpose morphological disambiguation method to these languages and show that high accuracies can be obtained for POS tagging and lemmatization." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-39", "text": "Possible word formations and morphological analyses of the German word \"Haus\" and the French word \"savoir\" are given in Table 3 and Table 4 respectively." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-40", "text": "There are various approaches proposed for morphological disambiguation based on lexical rules or statistical models." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-41", "text": "Rule based methods apply hand-crafted rules in order to select the correct morphological analyses or eliminate incorrect ones (Oflazer and Kuru\u00f6z 1994; Oflazer and Tur 1996; Daybelge and Cicekli 2007) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-42", "text": "Y\u00fcret and T\u00fcre (2006) proposed a decision list learning algorithm for extraction of Finally, hybrid models which combine statistical and rule based approaches are also proposed (Oflazer and Tur 1996; Kutlu and Cicekli 2013) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-43", "text": "We propose a deep neural architecture followed by the Viterbi algorithm for morphological disambiguation of words in a sentence." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-44", "text": "In this paper we focus on Turkish as an example even though the proposed model can be utilized in all morphologically rich languages." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-45", "text": "We test our approach in German and French in order to prove that the proposed method is able to work well for other languages as well." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-46", "text": "The network architecture in this paper is designed to produce a classification score for a sequence of n-words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-47", "text": "It consists of two layers and a softmax layer." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-48", "text": "The first layer of the model builds a representation for each word using root embeddings and some syntactic and semantic features." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-49", "text": "The second layer takes as input the learned word representations and incorporates contextual information." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-50", "text": "A softmax layer uses the output of the second layer to produce a classification score." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-51", "text": "We use the neural network to produce a score for each n length sequence in a given sentence." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-52", "text": "We then employ the Viterbi algorithm to produce the morphological disambiguation for each word in the sentence by finding the most probable sequence using the output of the softmax layer." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-53", "text": "----------------------------------" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-54", "text": "**RELATED WORKS**" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-55", "text": "In a natural language processing pipeline morphological disambiguation can be considered at the same level as POS tagging." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-56", "text": "In order to perform POS tagging in English, various approaches such as rule-based models (Brill 1992) , statistical models (Brill 1995) , maximum entropy models (Ratnaparkhi 1997), HMMs (Cutting et al. 1992) , CRFs (Lafferty, McCallum, and Pereira 2001) and decision trees (Schmid 1994) are proposed." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-57", "text": "However, morphological disambiguation is a much harder problem in general due to the fact that it requires the classification of both roots, suffixes and the corresponding labels." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-58", "text": "Moreover, compared to an agglutinative language such as Turkish, English words can take on a limited number of word forms and part-of-speech tags." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-59", "text": "Y\u00fcret and T\u00fcre (2006) observe that more than ten thousand tag types exists in a corpus comprised of a million Turkish words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-60", "text": "Thus, due to the high number of possible tags and the number of possible analyses in languages with productive morphology, morphological disambiguation is quite different from part-of-speech tagging in English." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-61", "text": "The previous work on morphological disambiguation in morphologically rich languages can be summarized into three categories: rule based, statistical and hybrid approaches." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-62", "text": "In the rule-based approaches a large number of hand crafted rules are used to select the correct morphological analyses or to eliminate incorrect ones (Karlsson et al. 1995; Oflazer and Kuru\u00f6z 1994; Oflazer and Tur 1996; Daybelge and Cicekli 2007) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-63", "text": "Statistical approaches generally utilize the statistics of root and tag sequences for selection of the best roots and tags." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-64", "text": "A statistical Turkish morphological disambiguation model which scores the probability of each tag by considering the statistics over the derivational boundaries and roots using trigrams has been proposed by T\u00fcr et al. (2000) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-65", "text": "They test their model on a manually disambiguated test data consisting of 2763 words and obtain 93.5% accuracy in morphological disambiguation (including non-ambiguous words)." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-66", "text": "A similar morphology-aware nonparametric Bayesian model is proposed in (Chahuneau, Smith, and Dyer 2013)." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-67", "text": "They integrate their generative model to NLP applications such as language modeling, word alignment and morphological disambiguation and obtain state-of-the-art results for Russian morphological disambiguation." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-68", "text": "Y\u00fcret and T\u00fcre (2006) extract Turkish morphological disambiguation rules using a decision list learner, Greedy Prepend Algorithm (GPA), and they achieve 95.8% accuracy on manually disambiguated data consisting of around 1K words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-69", "text": "Megyesi (1999) adapt a transformation based syntactic rule learner (Brill 1995) for Hungarian and Hajic (1998) extend his work for Czech and five other languages." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-70", "text": "Sak et al. (2007) apply a multilayer perceptron algorithm using a set of 23 features including tri-gram and bi-gram statistics of morphological tags and roots." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-71", "text": "They obtain 96.8% accuracy on test data consisting of 2.5K words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-72", "text": "Ehsani et al. (2012) apply conditional random fields (CRFs) using several features derived from morphological and syntactic properties of words and achieve 96.8% accuracy." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-73", "text": "G\u00f6rg\u00fcn and Yildiz (2011) use a J48 decision tree and achieve 95.6% accuracy." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-74", "text": "There are also several studies that combine statistical and rule based approaches such as (Ezeiza et al. 1998; Oflazer and Tur 1996; Kutlu and Cicekli 2013; Orosz and Nov\u00e1k 2013) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-75", "text": "Although deep learning techniques have been successfully used in various NLP tasks in English (Collobert and Weston 2008; Collobert et al. 2011; Le and Mikolov 2014; Pennington, Socher, and Manning 2014; Luong, Socher, and Manning 2013; Socher et al. 2012 ), this study is unique in that we create a deep learning architecture specifically suited for handling morphologically rich languages." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-76", "text": "One similar work to ours is the recent work of Luong et al. (2013) who introduce morphological RNNs to create word representations through composition of morphemes." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-77", "text": "However, they present their results only for English which is not morphologically as rich as languages such as Turkish or Finnish." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-78", "text": "There are also recent works that suggest integrating morphological knowledge into distributed word representations such as (Cotterell and Sch\u00fctze 2015) and (Cui et al. 2015) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-79", "text": "Cotterell and Sch\u00fctze (2015) extend log-bilinear model (an instance of language models that make the Markov assumption as n-gram language models) in order to jointly predict the next morphological tag along with the next word, encouraging the resulting embeddings to encode morphology." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-80", "text": "On the other hand, (Cui et al. 2015) propose a method for learning embeddings which is a modified version of skipgram algorithm (Mikolov et al. 2013 ) that benefits from morphological knowledge when predicting the target word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-81", "text": "Using morphology-based word representations improves the performance for different NLP tasks such as word similarity and statistical machine translation according to empirical evaluation of Botha and Phil (2014) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-82", "text": "Our work uses a convolutional architecture and handles any number of morphological features in order to build word representations while performing disambiguation at the same time." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-83", "text": "----------------------------------" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-84", "text": "**METHOD**" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-85", "text": "In this work we propose an architecture with the ability to represent morphologically rich words and model spatial dependencies among word vectors." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-87", "text": "Finally, the Viterbi algorithm is used on the outputs of the softmax layer in order to find the optimal disambiguation of the words in a sentence." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-88", "text": "We also show how unsupervised pre-trainining can be used to improve the performance of the designed system and achieve the state-of-the-art accuracy for Turkish morphological disambiguation." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-89", "text": "The input to our model is a sentence where each word in the sentence needs to be disambiguated." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-90", "text": "We first tokenize the sentences and then use morphological analyzers to find possible analyses of each word in the sentence." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-91", "text": "HFST tool (Lind\u00e9n, Silfverberg, and Pirinen 2009 ) is used to perform morphological analysis in German and French whereas (Oflazer 1993 ) is used for Turkish morphological analysis." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-92", "text": "NLP systems that use deep learning generally employ word embeddings in order to represent each word in a dictionary." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-93", "text": "Word embeddings are dense low dimensional real-valued vectors with each dimension corresponding to a latent feature of the word (Turian, Ratinov, and Bengio 2010) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-94", "text": "In a morphologically rich language, representing words in surface form might not be a good idea since lots of surface form words can be derived from a single root." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-95", "text": "Thus, in our design, each word in surface form is represented with a root and a set of morphological features where each root and feature has individual embeddings that are learned during training." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-96", "text": "Root and morphological feature embeddings can have varying lengths and through their concatenation surface form words are represented with fixed length embeddings." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-97", "text": "Our architecture is illustrated in Figure 1 where individual layers are marked with (a), (b) and (c)." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-98", "text": "The first layer (a) takes the root and morphological features of a single word as input and propagates to the next layer." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-99", "text": "The second layer, (b), takes a window of n words as input and propagates to the softmax layer, (c)." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-100", "text": "The non-linearity in both the first and the second layers are provided through the use of tanh as the transfer function." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-101", "text": "The softmax layer is responsible for deciding the likelihood of the current morphological analysis of the words, i.e., a binary decision is produced with the expected result of 1 if the analysis is correct, 0 otherwise." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-102", "text": "We train our network with the possible sequences of morphological analyses in the training data." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-103", "text": "For each sentence, and for each word, we select the n-2 words preceding the word and their groundtruth annotations along with the possible annotations of the last two words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-104", "text": "We also add n-1 out of sentence tokens at the beginning of each sentence so that all words in the sentence are included in the training data." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-105", "text": "We label the sequences containing the correct morphological analysis as positive whereas the remaining sequences are labeled as negative." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-106", "text": "This way the model is trained to predict the correct annotation for the last two words in a sequence given that the first n-2 words have correct annotations." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-107", "text": "Training is performed with stochastic gradient descent and AdaGrad (Duchi, Hazan, and Singer 2011) as the optimization algorithms." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-108", "text": "At inference time, given a sentence containing words to disambiguate, we use the network to make predictions for window of words in the sentence and then use the Viterbi algorithm to select the best morphological analysis for each word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-109", "text": "Unsupervised pre-training of word embeddings have been employed in various NLP tasks, and their usage have improved recognition accuracies (Collobert et al. 2011; Turian, Ratinov, and Bengio 2010) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-110", "text": "In order to improve the performance of our disambiguation system we also use unsupervised methods to pre-train root embeddings of words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-111", "text": "We created a corpus comprised of 1 billion Turkish words that we collected from various sources, such as e-books and web sites." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-112", "text": "Although our corpus is rather small compared to English corpora, it is the largest text corpus in Turkish that we know of." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-113", "text": "After we trained the supervised disambiguation system as described above, we disambiguated each word in the corpus and extracted the roots of words." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-114", "text": "Next, we built representations for root forms of the words using the unsupervised skip-gram algorithm (Mikolov et al. 2013) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-115", "text": "After obtaining the pre-trained root vectors, we retrained our disambiguation system with pre-trained root embeddings." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-116", "text": "This technique allowed us to further improve the disambiguation accuracies we obtained." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-117", "text": "As discussed earlier, the first layer takes as input the root and the morphological features of a word." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-118", "text": "The morphological features of words we use are presented in Table 5 ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-119", "text": "Specifically, the set of morphological features we consider contains the root, main POS tag, minor POS tag, person and possessive agreements, plurality, gender, case marker, polarity and tense." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-120", "text": "Note that the information contained in a surface word form may differ due to morphological characteristics of a language." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-121", "text": "For instance, German and French have gender feature contrary to Turkish while Turkish words have possesive agreement and polarity." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-122", "text": "Main POS tag, describes the category of a word and can take on values such as noun, verb, adjective and adverb." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-123", "text": "Minor POS tag determines the minor morphological properties of a word such as semantic markers, causative markers and post-position." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-124", "text": "\"Since\", \"While\", \"Propernoun\", \"Without\" can be given as examples to this kind of morphological features in Turkish." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-125", "text": "Person and possessive agreement are used to answer the questions \"who\" and \"whose\" respectively, i.e., they are used to indicate a person or an ownership relationship." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-126", "text": "Case marker relate the nouns to the rest of the sentence as prepositions do in English." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-127", "text": "Nominative (none), dative(to, for), locative (at, in, on), ablative (from, out of) and genitive (of) forms are examples of the forms that can be observed in a sentence." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-128", "text": "Polarity of a word is positive if the word is not negated and negative otherwise." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-129", "text": "Tense indicates the tense of the verbs such as present, past and future tense." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-130", "text": "Additionally, we consider the moods of the verbs within tense feature." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-131", "text": "Moods express the speaker's attitude such as indicative, imperative or subjunctive moods." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-132", "text": "In languages with grammatical gender such German and French, every noun is associated with a gender." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-133", "text": "The morphological analyzer we use associates each French word with one of the two genders (masculine and feminine) while it associates each German word with one of the four possible genders (masculine, feminine, neuter and no gender)." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-134", "text": "Some of the suffixes in Turkish change word meaning creating derivational boundaries in the morphological analyses." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-135", "text": "The morphological features of a word given in Table 5 are extracted after the final derivational boundary." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-136", "text": "In Turkish, we add one more feature to each word named previous tags in order to account for the previous suffixes that the word might have." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-137", "text": "This way, our model learns the effect of suffixes that change word meaning." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-138", "text": "Some of the described morphological features exist only for certain word categories." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-139", "text": "For instance possessive agreement and case marker features can only exist in nouns, polarity and tense exist in verbs and person agreement exist in nouns and verbs." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-140", "text": "If a morphological feature cannot be extracted from a word, we label it as having NULL for the feature." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-141", "text": "----------------------------------" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-142", "text": "**EXPERIMENTS**" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-143", "text": "For Turkish, we used a semi-automatically disambiguated corpus containing 1M tokens (Y\u00fcret and T\u00fcre 2006) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-144", "text": "Since this dataset is annotated semi-automatically, it also contains noise." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-145", "text": "In order to reduce the effect of noise to the recognition accuracies, we created a test set by randomly selecting sentences containing 20K of the tokens and manually annotating them." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-146", "text": "We make this test data publicly available 1 so that Turkish morphological disambiguation algorithms can be compared more accurately in the future." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-147", "text": "We use SPMRL 2014 dataset (Seddah and Tsarfaty 2014) for German and French." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-148", "text": "This data set is created in the Penn tree bank format and used for a shared task on statistical parsing of morphologically rich languages." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-149", "text": "This dataset contains 1M and 500K sentences with POS tag and morphological information for German and French respectiveley." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-150", "text": "It provides 90% of all sentences as training set and %10 of rest of the sentences as test set." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-151", "text": "We align the features in the tree bank to the HFST outputs in order to determine the correct morphological analyses generated by the HFST tool." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-152", "text": "We use this data set for both training and testing." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-153", "text": "The development sets for each language are randomly separated from the training data and are used to optimize the embedding lengths of morphological features." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-154", "text": "We noticed that similar parameters lead to the best performance." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-155", "text": "Thus, in the experiments, we used embedding lengths 50, 20 and 5 for roots, POS tags and the other morphological features respectively." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-156", "text": "The number of filters in the first and second layers are 30 and 40 respectively." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-158", "text": "The experiment results for POS tagging, lemmatization and morphological disambiguation in Turkish, German and French are presented in Table 6 ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-159", "text": "Notice that the POS tagging and lemmatization accuracies are refer to the percentages of POS tags and lemmas predicted correctly while morphological disambiguation accuracies are refer to the percentages of the words disambiguated correctly among the ambiguous words According to the results, we observe that even though our initial target was to be able to achieve Turkish morphological disambiguation, our model consistently obtains high accuracies in French and German as well." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-160", "text": "In Table 7 , we present the results of various models for Turkish morphological disambiguation on our hand-labeled test data." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-161", "text": "The results of the multilayer perceptron developed in (Sak, G\u00fcng\u00f6r, and Sara\u00e7lar 2007) and the decision list learning algorithm developed in (Y\u00fcret and T\u00fcre 2006) are presented in lines 1 and 2 respectively." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-162", "text": "We present Turkish morphological disambiguation results obtained by our model with and without pre-training in lines 3 and 4 respectively." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-163", "text": "As we discussed before, unsupervised pre-training of the embeddings can boost accuracies of neural networks." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-164", "text": "As expected, morphological disambiguation accuracy increases by around 1% (around 6% reduction in error) when root embeddings are pre-trained instead of randomly initialized." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-165", "text": "We see that even without unsupervised pre-training our algorithm outperforms the current state of the art models and we are able to further improve the accuracy by pre-training of the embeddings." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-166", "text": "Although we do not evaluate the effects of unsupervised pre-training for German and French, it is expected that higher accuracies can be achieved using unsupervised pretraining of the embeddings for these languages as well." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-167", "text": "Error analysis for Turkish morphological disambiguation shows that the root is incorrectly decided in 30% of errors." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-168", "text": "The root is correct but the POS tag is incorrectly decided in 40% of errors while 30% of errors caused by wrong decisions on other inflectional groups." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-169", "text": "When compared with the study of Sak et al. 2007 , there is no significant difference in the distribution of mistakes." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-170", "text": "However our method performs better in root decisions due to unsupervised learning of root embeddings." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-171", "text": "As discussed before, the available data for Turkish morphological disambiguation task contains some systematic errors." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-172", "text": "Y\u00fcret and T\u00fcre (2006) report that the accuracy of the training data is below 95%." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-173", "text": "According to our observation there is a major confusion between noun and adjective POS tags in training data which affects the decision of the morphological disambiguation systems." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-174", "text": "In our experiment, we observe that 18% of the errors are caused by such confusion, whereas the ratio of these errors are reported as 22% in the experiments of Sak et al. (2007) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-175", "text": "----------------------------------" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-176", "text": "**SUMMARY AND FUTURE WORK**" }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-177", "text": "In this paper, we present a model capable of learning word representations for languages with rich morphology." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-178", "text": "We show the utility of our approach in the task of Turkish, German and French morphological disambiguation." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-179", "text": "We also show the effect of unsupervised pre-training on recognition accuracies and improve the current state-of-the-art in Turkish morphological disambiguation." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-180", "text": "We publicly make available a manually annotated test set containing 20K tokens which we believe will benefit Turkish NLP." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-181", "text": "This paper presents a deep learning architecture specifically aiming to handle morphologically rich languages." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-182", "text": "Nonetheless, NLP systems that work on languages such as English can also benefit from our work." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-183", "text": "Using our model, English words can be separated into morphemes so that they can be better represented." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-184", "text": "This allows creating systems that are less affected from problems such as data sparsity (Luong, Socher, and Manning 2013) ." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-185", "text": "While using pre-training, we only considered the pretrained root embeddings." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-186", "text": "It would be preferred to pre-train all the embeddings using our text corpus which we leave as future work." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-187", "text": "Another point of note is the selected embedding sizes that we used in our experiments." }, { "sent_id": "8b2bb6753dc72a048ec28958e943fb-C001-188", "text": "While we worked on a development set separated from training data for parameter selection, further investigation in parameter selection might improve the obtained accuracies." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8b2bb6753dc72a048ec28958e943fb-C001-42" ], [ "8b2bb6753dc72a048ec28958e943fb-C001-59" ], [ "8b2bb6753dc72a048ec28958e943fb-C001-68" ], [ "8b2bb6753dc72a048ec28958e943fb-C001-172" ] ], "cite_sentences": [ "8b2bb6753dc72a048ec28958e943fb-C001-42", "8b2bb6753dc72a048ec28958e943fb-C001-59", "8b2bb6753dc72a048ec28958e943fb-C001-68", "8b2bb6753dc72a048ec28958e943fb-C001-172" ] }, "@USE@": { "gold_contexts": [ [ "8b2bb6753dc72a048ec28958e943fb-C001-143" ] ], "cite_sentences": [ "8b2bb6753dc72a048ec28958e943fb-C001-143" ] }, "@UNSURE@": { "gold_contexts": [ [ "8b2bb6753dc72a048ec28958e943fb-C001-161" ] ], "cite_sentences": [ "8b2bb6753dc72a048ec28958e943fb-C001-161" ] } } }, "ABC_41d56f3f962fa3b0e0563544a6de40_33": { "x": [ { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-2", "text": "Left-to-right (LR) decoding (Watanabe et al., 2006 ) is promising decoding algorithm for hierarchical phrase-based translation (Hiero) that visits input spans in arbitrary order producing the output translation in left to right order." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-3", "text": "This leads to far fewer language model calls, but while LR decoding is more efficient than CKY decoding, it is unable to capture some hierarchical phrase alignments reachable using CKY decoding and suffers from lower translation quality as a result." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-4", "text": "This paper introduces two improvements to LR decoding that make it comparable in translation quality to CKY-based Hiero." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-5", "text": "----------------------------------" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-7", "text": "Hierarchical phrase-based translation (Hiero) (Chiang, 2007 ) uses a lexicalized synchronous context-free grammar (SCFG) extracted from word and phrase alignments of a bitext." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-8", "text": "Decoding for Hiero is typically done with CKY-style decoding with time complexity O(n 3 ) for source input with n words." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-9", "text": "Computing the language model score for each hypothesis within CKY decoding requires two histories, the left and the right edge of each span, due to the fact that the target side is built inside-out from sub-spans Heafield et al., 2013) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-10", "text": "LR-decoding algorithms exist for phrasebased (Koehn, 2004; Galley and Manning, 2010) and syntax-based (Huang and Mi, 2010; Feng et al., 2012 ) models and also for hierarchical phrasebased models (Watanabe et al., 2006; Siahbani et al., 2013) , which is our focus in this paper." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-11", "text": "Watanabe et al. (2006) first proposed left-toright (LR) decoding for Hiero (LR-Hiero henceforth) which uses beam search and runs in O(n 2 b) in practice where n is the length of source sentence and b is the size of beam (Huang and Mi, 2010) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-12", "text": "To simplify target generation, SCFG rules are constrained to be prefix-lexicalized on target side aka Griebach Normal Form (GNF)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-13", "text": "Throughout this paper we abuse the notation for simplicity and use the term GNF grammars for such SCFGs." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-14", "text": "This constraint drastically reduces the size of grammar for LR-Hiero in comparison to Hiero grammar (Siahbani et al., 2013) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-15", "text": "However, the original LR-Hiero decoding algorithm does not perform well in comparison to current state-of-the-art Hiero and phrase-based translation systems." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-16", "text": "Siahbani et al. (2013) propose an augmented version of LR decoding to address some limitations in the original LR-Hiero algorithm in terms of translation quality and time efficiency." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-17", "text": "Although, LR-Hiero performs much faster than Hiero in decoding and obtains BLEU scores comparable to phrase-based translation system on some language pairs, there is still a notable gap between CKY-Hiero and LR-Hiero (Siahbani et al., 2013) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-18", "text": "We show in this paper using instructive examples that CKY-Hiero can capture some complex phrasal re-orderings that are observed in language pairs such as Chinese-English that LR-Hiero cannot (c.f. Sec.3)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-19", "text": "We introduce two improvements to LR decoding of GNF grammars: (1) We add queue diversity to the cube pruning algorithm for LR-Hiero, and (2) We extend the LR-Hiero decoder to capture all the hierarchical phrasal alignments that are reachable in CKY-Hiero (restricted to using GNF grammars)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-20", "text": "We evaluate our modifications on three language pairs and show that LR-Hiero can reach the translation scores comparable to CKY-Hiero in two language pairs, and reduce the gap between Hiero and LR-Hiero on the third one." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-21", "text": "----------------------------------" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-22", "text": "**LR DECODING WITH QUEUE DIVERSITY**" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-23", "text": "LR-Hiero uses a constrained lexicalized SCFG which we call a GNF grammar: X \u2192 \u03b3,b \u03b2 where \u03b3 is a string of non-terminal and terminal symbols,b is a string of terminal symbols and \u03b2 is a possibly empty sequence of non-terminals." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-24", "text": "This ensures that as each rule is used in a derivation, Add h to hypList 29:" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-25", "text": "return hypList the target string is generated from left to right." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-26", "text": "The rules are obtained from a word and phrase aligned bitext using the rule extraction algorithm in (Watanabe et al., 2006) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-27", "text": "LR-Hiero decoding uses a top-down depth-first search, which strictly grows the hypotheses in target surface ordering." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-28", "text": "Search on the source side follows an Earley-style search (Earley, 1970) , the dot jumps around on the source side of the rules based on the order of nonterminals on the target side." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-29", "text": "This search is integrated with beam search or cube pruning to find the k-best translations." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-30", "text": "Algorithm 1 shows the pseudocode for LRHiero decoding with cube pruning (Chiang, 2007) (CP)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-31", "text": "LR-Hiero with CP was introduced in (Siahbani et al., 2013) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-32", "text": "In this pseudocode, we have introduced the notion of queue diversity (explained below)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-33", "text": "However to understand our change we need to understand the algorithm in more detail." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-34", "text": "----------------------------------" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-35", "text": "**S I**" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-36", "text": "Figure 1: Cubes (grids) are fed to a priority queue (triangle) and generated hypotheses are iteratively popped from the queue and added to stack Si." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-37", "text": "Lower scores are better." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-38", "text": "Scores of rules and hypotheses appear on the top and left side of the grids respectively." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-39", "text": "Shaded entries are hypotheses in the queue and black ones are popped from the queue and added to Si." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-40", "text": "Each source side non-terminal is instantiated with the legal spans given the input source string, e.g. if there is a Hiero rule aX 1 , a X 1 and if a only occurs at position 3 in the input then this rule can be applied to span [3, i] for all i, 4 < i \u2264 n for input of length n and source side X 1 is instantiated to span [4, i] ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-41", "text": "A worked out example of how the decoder works is shown in Figure 2 ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-42", "text": "Each partial hypothesis h is a 4-tuple (h t , h s , h cov , h c ): consisting of a translation prefix h t , a (LIFO-ordered) list h s of uncovered spans, source words coverage set h cov and the hypothesis cost h c ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-43", "text": "The initial hypothesis is a null string with just a sentence-initial marker s and the list h s containing a span of the whole sentence, [0, n]." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-44", "text": "The hypotheses are stored in stacks S 0 , . . . , S n , where S p contains hypotheses covering p source words just like in stack decoding for phrase-based SMT (Koehn et al., 2003) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-45", "text": "To fill stack S i we consider hypotheses in each stack S p 2 , which are first partitioned into a set of groups {G}, based on their first uncovered span (line 9)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-46", "text": "Each group g is a 2-tuple (g span , g hyps ), where g hyps is a list of hypotheses which share the same first uncovered span g span ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-47", "text": "Rules matching the span g span are obtained from routine GetSpanRules." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-48", "text": "Each g hyps and possible R s create a cube which is added to cubeList." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-49", "text": "The Merge routine gets the best hypotheses from all cubes (see Fig.1 )." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-50", "text": "Hypotheses (rows) and columns (rules) are sorted based on their scores." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-51", "text": "GetBestHypotheses((H, R), F, d) uses current hypothesis H and rule R to produce new hypotheses." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-52", "text": "The first best hypothesis, h along with its score h c and corresponding cube (H, R) is placed in a priority queue heapQ (triangle in Figure 1 and line 23 in Algorithm 1)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-53", "text": "Iteratively the K best 9) xiang jingji/to the economy s Thailand wants [3, 15] s Thailand wants to utilize [4, 15] s Thailand wants to utilize this money [7, 15] s Thailand wants to utilize this money to inject more [12,15][7,9] s Thailand wants to utilize this money to inject more circulating [13,15][7,9] s Thailand wants to utilize this money to inject more circulating capital [14,15][7,9] s Thailand wants to utilize this money to inject more circulating capital ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-54", "text": "[7, 9] s Thailand wants to utilize this money to inject more circulating capital ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-55", "text": "to the economy /s Figure 2 : The process of translating the Chinese sentence in Figure 3 (b) in LR-Hiero." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-77", "text": "In both these cases, CKY-Hiero has no difficulty in reaching the target sentence with the same GNF rules." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-56", "text": "Left side shows the rules used in the derivation (G indicates glue rules as defined in (Watanabe et al., 2006) )." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-57", "text": "The hypotheses column shows the translation prefix and the ordered list of yet-to-be-covered spans." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-58", "text": "hypotheses in the queue are popped (line 26) and for each hypothesis its neighbours in the cube are added to the priority queue (line 27)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-59", "text": "Decoding finishes when stack S n has been filled." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-60", "text": "The language model (LM) score violates the hypotheses generation assumption of CP and can cause search errors." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-61", "text": "In Figure 1 , the topmost and leftmost entry of the right cube has a score worse than many hypotheses in the left cube due to the LM score." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-62", "text": "This means the right cube has hypotheses that are ignored." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-63", "text": "This type of search error hurts LR-Hiero more than CKYHiero, due to the fact that hypotheses scores in LR-Hiero rely on a future cost, while CKY-Hiero uses the inside score for each hypothesis." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-64", "text": "To solve this issue for LR-Hiero we introduce the notion of queue diversity which is the parameter d in GetBestHypotheses((H, R), F, d)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-65", "text": "This parameter guarantees that each cube will produce at least d candidate hypotheses for the priority queue." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-66", "text": "d=1 in standard cube pruning for LR-Hiero (Siahbani et al., 2013) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-67", "text": "We apply the idea of diversity at queue level, before generating K best hypothesis, such that the GetBestHypotheses routine generates d best hypotheses from each cube and all these hypotheses are pushed to the priority queue (line 22-23)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-68", "text": "We fill each stack differently from CKY-Hiero and so queue diversity is different from lazy cube pruning (Pust and Knight, 2009) or cube growing (Huang and Chiang, 2007; Vilar and Ney, 2009; Xu and Koehn, 2012) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-69", "text": "Figure 2 where rule #5 is matched to span [7, 15] ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-70", "text": "During decoding LR-Hiero maintains a stack (lastin-first-out) of yet-to-be-covered spans and tries to translate the first uncovered span (span [7, 15] in Step 5)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-71", "text": "LR-Hiero should match rule #5 to span [7, 15] , therefore X 2 is forced to match span [12, 15] which leads to the translation of span [7, 9] (corresponding to X 1 ) being reordered around it In Figure 3 (a) monotonic translations after span [6, 9] are out of reach of the LR-Hiero decoder which has to use the non-terminals to support the reordering within span [6, 9] ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-72", "text": "In this example the first few phrases are translated monotonically, then for span [6, 18] we have to apply rule muqian X 1 wending, is now in stable X 1 to obtain the correct translation." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-73", "text": "But this rule cannot be matched to span [6, 18] and the decoder fails to generate the correct translation." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-74", "text": "While CKYHiero can apply this rule to span [6, 9] , generate correct translation for this span and monotonically combine it with translation of other spans ([0, 6] , [9, 18] )." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-75", "text": "----------------------------------" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-76", "text": "**CAPTURING MISSING ALIGNMENTS**" }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-78", "text": "The fact that we have to process spans as they appear in the stack in LR-Hiero means that we cannot combine arbitrary adjacent spans to deal with such cases." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-79", "text": "So purely bottom-up decoders such as CKY-Hiero can capture the alignments in Figure 3 but LR-Hiero cannot." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-80", "text": "We extend the LR-Hiero decoder to handle such cases by making the GNF grammar more expressive." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-81", "text": "Rules are partitioned to three types based on the right boundary in the source and target side." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-82", "text": "The rhs after the \u21d2 shows the new rules we create within the decoder using a new non-terminal X r to match the right boundary." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-83", "text": "where \u03b3 is a string of terminals and non-terminals, a andb are terminal sequences of source and target respectively, \u03b2 is a possibly empty sequence of non-terminals and X n and X m are different non-terminals distinct from X r 3 ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-84", "text": "The extra nonterminal X r lets us add a new yet-to-be-covered span to the bottom of the stack at each rule application which lets us match any two adjacent spans just as in CKY-Hiero." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-85", "text": "This captures the missing alignments that could not be previously captured in the LR-Hiero decoder 4 ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-86", "text": "In Table 4 we translated devset sentences using forced decoding to show that our modifications to LR-Hiero in this section improves the alignment coverage when compared to CKY-Hiero." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-87", "text": "We use a 5-gram LM trained on the Gigaword corpus and use KenLM (Heafield, 2011) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-88", "text": "We tune weights by minimizing BLEU loss on the dev set through MERT (Och, 2003) and report BLEU scores on the test set." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-89", "text": "Pop limit for Hiero and LRHiero+CP is 500 and beam size LR-Hiero is 500." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-90", "text": "Other extraction and decoder settings such as maximum phrase length, etc. were identical across settings." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-91", "text": "To make the results comparable we use the same feature set for all baselines, Hiero as well (including new features proposed by (Siahbani et al., 2013) )." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-92", "text": "We use 3 baselines: (i) our implementation of (Watanabe et al., 2006) : LR-Hiero with beam search (LR-Hiero) and (ii) LR-Hiero with cube pruning (Siahbani et al., 2013) : (LR-Hiero+CP); and (iii) Kriya, an open-source implementation of Hiero in Python, which performs comparably to other open-source Hiero systems (Sankaran et al., 2012) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-93", "text": "Table 3 shows model sizes for LR-Hiero (GNF) and Hiero (SCFG)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-94", "text": "Typical Hiero rule extraction excludes phrase-pairs with unaligned words on boundaries (loose phrases)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-95", "text": "We use similar rule extraction as Hiero, except that exclude non-GNF rules and include loose phrase-pairs as terminal rules." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-96", "text": "Table 2a shows the translation quality of different systems in terms of BLEU score." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-97", "text": "Row 3 is from (Siahbani et al., 2013) 5 . As we discussed in Section 2, LR-Hiero+CP suffers from severe search errors on Zh-En (1.5 BLEU) but using queue diversity (QD=15) we fill this gap." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-98", "text": "We use the same QD(=15) in next rows for Zh-en." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-99", "text": "For Cs-En and De-En we use regular cube pruning (QD=1), as it works as well as beam search (compare rows 4 and 2)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-100", "text": "We measure the benefit of the new modified rules from Section 3: (ab): adding modifications for rules type (a) and (b); (abc): modification of all rules." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-101", "text": "We can see that for all language pairs (ab) constantly improves performance of LRHiero, significantly better than LR-Hiero+CP and LR-Hiero (p-value<0.05) on Cs-En and Zh-En, evaluated by MultEval (Clark et al., 2011) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-102", "text": "But modifying rule type (c) does not show any improvement due to spurious ambiguity created by 5 We report results on Cs-En and De-En in (Siahbani et al., 2013) ." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-103", "text": "Row 4 is the same translation system as row 3 (LR-Hiero+CP)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-104", "text": "We achieve better results than our previous work (Siahbani et al., 2013) type (c) rules." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-105", "text": "Figure 2b shows the results in terms of average number of language model queries on a sample set of 50 sentences from test sets." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-106", "text": "All of the baselines use the same wrapper to KenLM (Heafield, 2011) to query the language model, and we have instrumented the wrapper to count the statistics." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-107", "text": "In (Siahbani et al., 2013) we discuss that LR-Hiero with beam search (Watanabe et al., 2006) does not perform at the same level of state-of-the-art Hiero (more LM calls and less translation quality)." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-108", "text": "As we can see in this figure, adding new modified rules slightly increases the number of language model queries on Cs-En and De-En so that LR-Hiero+CP still works 2 to 3 times faster than Hiero." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-109", "text": "On Zh-En, LR-Hiero+CP applies queue diversity (QD=15) which reduces search errors and improves translation quality but increases the number of hypothesis generation as well." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-110", "text": "LRHiero+CP with our modifications works substantially faster than LR-Hiero while obtain significantly better translation quality on Zh-En." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-111", "text": "Comparing Table 2a with Figure 2b we can see that overall our modifications to LR-Hiero decoder significantly improves the BLEU scores compared to previous LR decoders for Hiero." }, { "sent_id": "41d56f3f962fa3b0e0563544a6de40-C001-112", "text": "We obtain comparable results to CKY-Hiero for Cs-En and De-En and remarkably improve results on Zh-En, while at the same time making 2 to 3 times less LM calls on Cs-En and De-En compared to CKYHiero." } ], "y": { "@BACK@": { "gold_contexts": [ [ "41d56f3f962fa3b0e0563544a6de40-C001-2" ], [ "41d56f3f962fa3b0e0563544a6de40-C001-10" ], [ "41d56f3f962fa3b0e0563544a6de40-C001-11" ], [ "41d56f3f962fa3b0e0563544a6de40-C001-107" ] ], "cite_sentences": [ "41d56f3f962fa3b0e0563544a6de40-C001-2", "41d56f3f962fa3b0e0563544a6de40-C001-10", "41d56f3f962fa3b0e0563544a6de40-C001-11", "41d56f3f962fa3b0e0563544a6de40-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "41d56f3f962fa3b0e0563544a6de40-C001-23", "41d56f3f962fa3b0e0563544a6de40-C001-24", "41d56f3f962fa3b0e0563544a6de40-C001-25", "41d56f3f962fa3b0e0563544a6de40-C001-26" ], [ "41d56f3f962fa3b0e0563544a6de40-C001-56" ], [ "41d56f3f962fa3b0e0563544a6de40-C001-92" ] ], "cite_sentences": [ "41d56f3f962fa3b0e0563544a6de40-C001-26", "41d56f3f962fa3b0e0563544a6de40-C001-56", "41d56f3f962fa3b0e0563544a6de40-C001-92" ] } } }, "ABC_fe2f22d3d25358b23d0b75a6edee57_33": { "x": [ { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-14", "text": "(FGET)." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-15", "text": "Table 1 illustrates how this form of world knowledge can be used to improve question generation." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-2", "text": "In this paper, we propose a method for incorporating world knowledge (linked entities and fine-grained entity types) into a neural question generation model." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-3", "text": "This world knowledge helps to encode additional information related to the entities present in the passage required to generate human-like questions." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-4", "text": "We evaluate our models on both SQuAD and MS MARCO to demonstrate the usefulness of the world knowledge features." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-5", "text": "The proposed world knowledge enriched question generation model is able to outperform the vanilla neural question generation model by 1.37 and 1.59 absolute BLEU 4 score on SQuAD and MS MARCO test dataset respectively." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-6", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-8", "text": "The task of question generation (QG) aims to generate syntactically and semantically sound questions, from a given text, for which a given answer would be a correct response." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-9", "text": "Recently, there has been increased research interest in the QG task due to (1) the wide success of neural network based sequence-to-sequence techniques (Sutskever et al., 2014) for various NLP tasks and (Bahdanau et al., 2014; Srivastava et al., 2015; Xu et al., 2015; Rush et al., 2015; Kumar et al., 2016) , (2) the abundance of large question answering datasets: SQuAD (Rajpurkar et al., 2016) , NewsQA (Trischler et al., 2016) , MS MARCO (Nguyen et al., 2016) )." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-10", "text": "In this paper, we advocate for improving question generation systems using world knowledge, which has not been investigated as of yet." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-11", "text": "We explore world knowledge in the form of entities present in text and exploit the associated entity knowledge to generate human-like questions." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-12", "text": "In our experiments, we use two types of world knowledge: linked entities to the Wikipedia knowledge base and fine grained entity types Table 1 : Sample texts, along with the machine (Q gen ) and human (Q hum ) generated questions for the given answer (Ans)." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-13", "text": "In the machine generated questions where the corresponding entities could not be resolved are shown in red, the corresponding resolved entities in human generated questions are in blue." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-16", "text": "Here, \"Central Intelligence\" is the name of a movie and 'Dwayne Johnson' and 'Kevin Hart' are actors." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-17", "text": "The world knowledge here is the name of a movie (\"Central Intelligence\"), which helps the model to generate the correct word 'film' instead of the incorrect word 'organization'." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-18", "text": "We adopt the sequence-to-sequence model (Bahdanau et al., 2014) equipped with the copy mechanism (Gulcehre et al., 2016; See et al., 2017) as our base model for question generation." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-19", "text": "The entity linking and fine-grained entity typing information are fed to the network along with the answer of interest." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-20", "text": "We believe this is the first work that explores world-knowledge in the form of linked entities and fine grained entity types as features to improve neural question generation models." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-21", "text": "Recently, works on question generation have drifted towards neural-based approaches." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-22", "text": "These approaches typically involve end-to-end supervised learning to generate questions." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-23", "text": "Du et al. (2017) proposed sequence-to-sequence learning for question generation from text passages." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-24", "text": "Zhou et al. (2017) utilized the answer-position, and linguistic features such as named entity recognition (NER) and parts of speech (POS) information to further improve the QG performance as the model is aware that for which answer a question need to be generated." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-25", "text": "In the work of a multi-perspective context matching algorithm is employed." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-26", "text": "Harrison and Walker (2018) use a set of rich linguistic features along with a NQG model." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-27", "text": "(Song et al., 2018) used the matching algorithm proposed by to compute the similarity between the target answer and the passage for collecting relevant contextual information under the different perspective, so that contextual information can be better considered by the encoder." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-28", "text": "More recently, Kim et al. (2018) has claimed to improve the performance of QG model by replacing the target answer in the original passage with a special tokens." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-29", "text": "Other NQG models, include (Zhao et al., 2018; Sun et al., 2018; Gao et al., 2018) which generate questions mainly from SQuAD and MS MARCO dataset." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-30", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-31", "text": "**PROPOSED APPROACH PROBLEM STATEMENT**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-32", "text": "Given a passage P = {w" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-33", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-34", "text": "**WORLD KNOWLEDGE ENRICH ENCODER**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-35", "text": "Our proposed model is based on the sequence-tosequence (Bahdanau et al., 2014) paradigm." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-36", "text": "For the encoder, we utilize a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) network." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-37", "text": "In order to capture more contextual information, we use a two-layer bidirectional LSTM (Bi-LSTM)." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-38", "text": "Inspired by the success of using linguistic features in (Zhou et al., 2017; Harrison and Walker, 2018) , we exploit word knowledge in the form of entity linking and fine-grained entity typing in the encoder of the network." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-39", "text": "A Bi-LSTM encoder reads the passage words and their associated world knowledge features (c.f. section 3.1.1, 3.1.2) to produce a sequence of word-and-feature vectors." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-40", "text": "The word vectors, the embedded world knowledge feature vectors and the answer position indicator embedding vectors are concatenated and passed as input to the Bi-LSTM encoder." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-41", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-42", "text": "**ENTITY LINKING**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-43", "text": "In previous works (Zhou et al., 2017; Harrison and Walker, 2018) , named entity type features have been used." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-44", "text": "These features, however, only allow for the encoding of coarse level information such as knowledge of if an entity belongs to a set of predefined categories such as 'PERSON', 'LOCATION' and 'ORGANI-ZATION'." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-45", "text": "To alleviate this, we use the knowledge in the form of linked entities." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-46", "text": "In our experiments, we use Wikipedia as the knowledge base for which to link entities." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-47", "text": "This specific task (also known as Wikification (Cheng and Roth, 2013) ) is the task of identifying concepts and entities in text and disambiguation them into the most specific corresponding Wikipedia pages." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-48", "text": "We followed the approach by Cheng and Roth (2013) for the Wikification." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-49", "text": "The Wikification process is performed on the input passage P having n words {w For multi-word mentions, we assign the same Wikipedia title to each word of the mention." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-50", "text": "In order to project the word and entity in the same vector space, we jointly learn pre-trained word-entity vector embeddings using the method proposed by Yamada et al. (2016) ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-51", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-52", "text": "**FINE-GRAINED ENTITY TYPES (FGET)**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-53", "text": "Fine-grained entity typing consists of assigning types from a hierarchy to entity mentions in text." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-54", "text": "Similar to the approach in (Xu and Barbosa, 2018) , we build a classification model, to classify the predicted entity mention from the entity linker, discussed in section 3.1.1, into one of the predefined fine-grained entity types (112 entities) (Ling and Weld, 2012) ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-55", "text": "The inputs to the network are a sub-sequence of passage sentence S = (w 1 , w 2 , . . . , w T ) and the target entity M = (w p , . . . , w t ) (p, t \u2208 [1, T ], p \u2264 t) of length t \u2212 p + 1." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-56", "text": "The sub-sequence S is a context sentence of length T for the given mention M , where M \u2208 S. Using the FGET classification approach discussed in (Xu and Barbosa, 2018) , we obtain the representation R of the passage sentence S. Thereafter, a soft-max layer is employed to obtain the probability distribution over the set of fine-grained entity types Y ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-57", "text": "Concretely," }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-58", "text": "where weight matrix W is treated as the learned type embedding and b is the bias." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-59", "text": "Similar to the process we use for the linked entities, we map the passage words to their corresponding fine-grained entity types to get a sequence of FGET E f get = {f" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-60", "text": "The final embedding of a word at a given time t of the passage P , is computed as:" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-61", "text": "where,\u00e2 p t\u0175 p t ,\u00ea p t andf p t are the embeddings of the answer position, word, linked entity, and fine-grained entity type of the token t of the passage." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-62", "text": "The final embedding sequence x = (x 1 , x 2 , . . . , x n ), is passed to a Bi-LSTM encoder to produce two sequences of hidden vectors, the forward sequence ( h 1 , h 2 , . . . , h n ) and the backward sequence ( h 1 , h 2 , . . . , h n )." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-63", "text": "Lastly, the output sequence of the encoder is the concatenation of the two sequences," }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-64", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-65", "text": "**DECODING WITH ATTENTION**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-66", "text": "We use a two layered LSTM for the decoder." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-67", "text": "Words are generated sequentially conditioned on the encoder output and the previous decoder step." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-68", "text": "Formally, at decoding time step t, the LSTM decoder reads the previous word embedding y t\u22121 and context vector c t\u22121 to compute the new hidden state d t ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-69", "text": "The context vector c t at time step t is computed using the attention mechanism in (Luong et al., 2015) , which matches the current decoder state d t with each encoder hidden state h i to get an relevance score." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-70", "text": "A one layer feedforward network takes the decoder state s t and c t and predicts the probability distribution over the decoder vocabulary." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-71", "text": "Similar to (Sun et al., 2018; Song et al., 2018) , we also use the copy mechanism from (Gulcehre et al., 2016) to deal with the rare and unknown words." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-72", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-73", "text": "**EXPERIMENTAL RESULTS AND ANALYSIS**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-74", "text": "We evaluated the performance of our approach on SQuAD (Rajpurkar et al., 2016) and MS MARCO v2.1 (Nguyen et al., 2016) ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-75", "text": "SQuAD is composed of more than 100K questions posed by crowd workers on 536 Wikipedia articles." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-76", "text": "We used the same split as (Zhou et al., 2017) ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-77", "text": "MS MARCO datasets contains 1 million queries with corresponding answers and passages." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-78", "text": "All questions are sampled from real anonymized user queries and context passages are extracted from real web documents." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-79", "text": "We picked a subset of MS MARCO data where answers (<= 10 words) are sub-spans within the passages (<= 600 words), and use dev set as test set (7, 849), and split train set with ratio 90%-10% into train (1, 36, 337) and dev (15, 148) sets." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-80", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-81", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-82", "text": "We optimized network hyper-parameters for both dataset via their respective development set." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-83", "text": "The LSTM cell hidden size was 512 for both the dataset." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-84", "text": "We used 500 dimension vector 1 jointly trained for word and Wikipedia entity from (Yamada et al., 2016) for the pretrained word and entity embeddings." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-85", "text": "The dimension of answer tagging and entity type tagging were set to 100." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-86", "text": "The model was optimized via gradient descent using the Adam (Kingma and Ba, 2014) optimiser with learning rate of 0.001 and mini-batch size 64." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-87", "text": "We selected the models with the highest BLEU-4 (Papineni et al., 2002 ) scores as our final models." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-88", "text": "At inference time, we used beam search with a beam size of 12 (also optimized on the dev set) for the SQuAD dataset and greedy search is adopted for MS MARCO dataset as it performed near the best result compared to beam search." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-89", "text": "For, both the datasets, we restrict the target vocabulary to most frequent 20, 000 words." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-90", "text": "We evaluate the question generation performance in terms of BLEU (Papineni et al., 2002) , METEOR (Banerjee and Lavie, 2005) and and ROUGE-L (Lin, 2004) and using the evaluation package released by Sharma et al. (2017) ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-91", "text": "(Zhou et al., 2017) 13.29 ----- (Song et al., 2018) 13.91 ----- Table 3 : Performance comparison of the proposed model with the other baselines and state-of-the-art model on the development set of SQuAD dataset" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-92", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-93", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-94", "text": "We conducted series of experiments as follows:" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-95", "text": "(1) s2s+Att: Baseline encoder-decoder based seq2seq network with attention mechanism." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-96", "text": "(2) NQG: Extension of s2s+Att with answer position feature." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-97", "text": "(3) NQG + EL: Extension of NQG with the entity linking feature (500 dimension) discussed in Section 3.1.1. (4) NQG + EL (pre): NQG + Entity Linking with the pre-trained entity linking feature obtained from the joint training of word and Wikipedia entity using (Yamada et al., 2016) ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-98", "text": "In order to compare our models with the existing coarse-grained entity features (NER) being used in literature (Zhou et al., 2017; Harrison and Walker, 2018) , we also report the following experiments." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-99", "text": "Table 3 , 4." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-100", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-101", "text": "**DISCUSSION AND ANALYSIS**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-102", "text": "Table 2 clearly demonstrates that the proposed fine-grained word-knowledge features improve the performance of the models over the baseline and the coarse-grained entity (NER) features seem to be not as useful as the entity linking features for both datasets." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-103", "text": "We analyzed the effect of each word-knowledge feature on both the datasets." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-104", "text": "Our findings are as follows:" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-105", "text": "Passage: though there is no official definition for the northern boundary of southern california , such a division has existed from the time when mexico ruled california , and political disputes raged between the californios of monterey in the upper part and los angeles in the lower part of alta california ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-106", "text": "Target Answer: mexico Reference: which country used to rule california ?" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-107", "text": "NQG: who ruled california?" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-108", "text": "NQG+EL (best)+ FGET (best): which country california ?" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-109", "text": "Passage: for example , a significant number of students all over wales are educated either wholly or largely through the medium of welsh : in 2008/09 , 22 per cent of classes in maintained primary schools used welsh as the sole or main medium of instruction ." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-110", "text": "Target Answer: welsh Reference: what language is used to educate in wales ?" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-111", "text": "NQG: a significant number of students over wales are educated through what medium ? NQG+EL (best)+ FGET (best): what language has a significant number of students in wales ?" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-112", "text": "Table 5 : Samples questions generated by the baseline and proposed model." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-113", "text": "Entity Linking: On both the datasets, the pretrained entity linking features were more effective compared to random initialized features followed by fine-tuning while training." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-114", "text": "We believe this is due to the word and corresponding entity being jointly trained and projected into the same vector space." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-115", "text": "We observe that, entity linking features on SQuAD is less effective as compared to MS MARCO." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-116", "text": "FGET: Similar to the linker based features, the pre-trained FGET features trained on FIGER dataset (Ling and Weld, 2012) are more effective than the randomly initialized vectors." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-117", "text": "The FGET feature is more effective at improving the QG model on SQuAD." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-118", "text": "We believe this is likely because the both SQuAD and FIGER datasets were both derived from the Wikipedia." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-119", "text": "In contrast, MS MARCO was derived from Bing 3 user queries and web passages, which is entirely different in nature." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-120", "text": "It should also be noted that the FGET features were derived using entities detected using the entity linker." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-121", "text": "In order to evaluate the effect of using the linker as an entity detector we also performed an experiment for which we used entities detected using the NER." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-122", "text": "We found that that the models that use the entities detected with the linker have higher performance in terms of each evaluation metrics on both the datasets." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-123", "text": "A samples of generated questions are given in table 5." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-124", "text": "----------------------------------" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-125", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-126", "text": "We proposed that features based on general wordknowledge can improve the performance of ques-tion generation." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-127", "text": "Our results on SQuAD and MS MARCO show that entity based world knowledge are effective at improving question generation according to automated metrics." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-128", "text": "In order to fully explore the performance gains of these features, human evaluation is required and we leave this for future work." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-129", "text": "We would also like to explore other sources of world knowledge beyond entity based information." }, { "sent_id": "fe2f22d3d25358b23d0b75a6edee57-C001-130", "text": "In particular, we believe that information based on the relationships between the entities present in the passage would also be useful." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fe2f22d3d25358b23d0b75a6edee57-C001-24" ] ], "cite_sentences": [ "fe2f22d3d25358b23d0b75a6edee57-C001-24" ] }, "@UNSURE@": { "gold_contexts": [ [ "fe2f22d3d25358b23d0b75a6edee57-C001-38" ], [ "fe2f22d3d25358b23d0b75a6edee57-C001-98" ] ], "cite_sentences": [ "fe2f22d3d25358b23d0b75a6edee57-C001-38", "fe2f22d3d25358b23d0b75a6edee57-C001-98" ] }, "@MOT@": { "gold_contexts": [ [ "fe2f22d3d25358b23d0b75a6edee57-C001-43", "fe2f22d3d25358b23d0b75a6edee57-C001-44" ] ], "cite_sentences": [ "fe2f22d3d25358b23d0b75a6edee57-C001-43" ] }, "@SIM@": { "gold_contexts": [ [ "fe2f22d3d25358b23d0b75a6edee57-C001-76" ] ], "cite_sentences": [ "fe2f22d3d25358b23d0b75a6edee57-C001-76" ] }, "@USE@": { "gold_contexts": [ [ "fe2f22d3d25358b23d0b75a6edee57-C001-76" ] ], "cite_sentences": [ "fe2f22d3d25358b23d0b75a6edee57-C001-76" ] } } }, "ABC_8622616ffd4db96058a9b8aff54212_33": { "x": [ { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-2", "text": "This paper presents experiments on how the performance of automatic keyword extraction can be improved, as measured by keywords previously assigned by professional indexers." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-3", "text": "The keyword extraction algorithm consists of three prediction models that are combined to decide what words or sequences of words in the documents are suitable as keywords." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-4", "text": "The models, in turn, are built using different definitions of what constitutes a term in a written document." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-5", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-7", "text": "Automatic keyword indexing is the task of finding a small set of terms that describes the content of a specific document." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-8", "text": "If the keywords are chosen from the document at hand, it is referred to as keyword extraction, and this is the approach taken for the work presented in this paper." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-9", "text": "Once a document has a set of keywords, they can be useful for several tasks." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-10", "text": "For example, they can be the entrance to a document collection, similar to a back-ofthe-book index; they can be used to refine a query to a search engine; or they may serve as a dense summary for a specific document." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-11", "text": "In the presented research, the decision of what words or sequences of words in the documents that are suitable as keywords are made by prediction models trained on documents with manually assigned keywords." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-12", "text": "This paper presents a number of modifications to an existing keyword extraction algorithm, as well as results of the empirical verifications." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-13", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-14", "text": "**BACKGROUND**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-15", "text": "The approach taken to the keyword extraction task is that of supervised machine learning." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-16", "text": "This means that a set of documents with known keywords is used to train a model, which in turn is applied to select keywords to and from previously unseen documents." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-17", "text": "The keyword extraction discussed in this paper is based on work presented in Hulth (2003a) and Hulth (2003b) ." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-18", "text": "In Hulth (2003a) an evaluation of three different methods to extract candidate terms from documents is presented." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-19", "text": "The methods are: extracting all uni-, bi, and trigrams that do not begin or end with a stopword." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-20", "text": "extracting all noun phrase (NP) chunks as judged by a partial parser." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-21", "text": "extracting all part-of-speech (PoS) tagged words or sequences of words that match any of a set of empirically defined PoS patterns." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-22", "text": "The best performing models use four attributes." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-23", "text": "These are:" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-24", "text": "term frequency collection frequency relative position of the first occurrence the POS tag or tags assigned to the term All terms are stemmed using Porter's stemmer (Porter, 1980) , and an automatically selected keyword is considered correct if it is equivalent to a stemmed manually assigned keyword." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-25", "text": "The performance of the classifiers is evaluated by calculating the F-measure for the selected keywords, with equal weight given to the precision and the recall." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-26", "text": "In Hulth (2003b) , experiments on how the performance of the keyword extraction can be improved by combining the judgement of three classifiers are presented." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-27", "text": "The classifiers differ in how the data are represented, and more specifically in how the candidate terms are selected from the documents." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-28", "text": "By only assigning keywords that are selected by at least two term selection approaches-that is by taking the majority vote-a better performance is achieved." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-29", "text": "In addition, by removing the subsumed keywords (keywords that are substrings of other selected keywords) the performance is yet higher." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-30", "text": "The classifiers are constructed by Rule Discovery System (RDS), a system for rule induction 1 ." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-31", "text": "This means that the models consist of rules." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-32", "text": "The applied strategy is that of recursive partitioning, where the resulting rules are hierarchically organised (i.e., decision trees)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-33", "text": "The data set on which the models are trained and tested originates from the Inspec database 2 , and consists of abstracts in English from scientific journal papers." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-34", "text": "The set of 2 000 documents is divided into three sets: a training set of 1 000 documents (to train the models), a validation set consisting of 500 documents (to select the best performing model, e.g., for setting the threshold value for the regression runs), and the remaining 500 documents are saved for testing (to get unbiased results)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-35", "text": "Each abstract has two sets of keywords-assigned by a professional indexer-associated to them: a set of controlled terms (keywords restricted to the Inspec thesaurus); and a set of uncontrolled terms that can be any suitable terms." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-36", "text": "Both the controlled terms and the uncontrolled terms may or may not be present in the abstracts." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-37", "text": "However, the indexers had access to the full-length documents when assigning the keywords, and not only to the abstracts." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-38", "text": "For the experiments presented in this paper, only the uncontrolled terms are considered, as these to a larger extent are present in the abstracts (76.2% as opposed to 18.1% for the controlled terms)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-39", "text": "The performance is evaluated using the uncontrolled keywords as the gold standard." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-40", "text": "In the paper, three minor improvements to the keyword extraction algorithm are presented." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-41", "text": "These concern how one of the term selection approaches extract candidate terms; how the collection frequency is calculated; and how the weights are set to the positive examples." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-42", "text": "The major focus of the paper is how the learning task is defined." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-43", "text": "For these experiments, the same machine learning system-RDS-is used as for the experiments presented by Hulth (2003a) ." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-44", "text": "Also the same data are used to train the models and to tune the parameters." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-45", "text": "The results of the experiments are presented in Tables 1-5 , which show: the average number of keywords assigned per document (Assign.); the average number of correct keywords per document (Corr.); precision (P); recall (R); and F-measure (F)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-46", "text": "On average, 7.6 manually assigned keywords are present per document." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-47", "text": "The total number of manual keywords present in the abstracts in the test data set is 3 816, and is the number on which the recall is calculated." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-48", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-49", "text": "**REFINEMENTS**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-50", "text": "In this section, three minor modifications made to the keyword extraction algorithm are presented." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-51", "text": "The first one concerns how the NP-chunks are extracted from the documents: By removing the initial determiner of the NPchunks, a better performance is achieved." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-52", "text": "The second alteration is to use a general corpus for calculating the collection frequency value." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-53", "text": "Also the weights for the positive examples are set in a more systematic way, to maximise the performance of the combined model." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-54", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-55", "text": "**REFINING THE NP-CHUNK APPROACH**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-56", "text": "It was noted in Hulth (2003b) that when extracting NPchunks, the accompanying determiners are also extracted (per definition), but that determiners are rarely found at the initial position of keywords." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-57", "text": "This means that the automatic evaluation treats such keywords as misclassified, although they might have been correct without the determiner." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-58", "text": "For this reason the determiners a, an, and the are removed when occurring in the beginning of an extracted NP-chunks." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-59", "text": "The results for the runs when extracting NPchunks with and without these determiners are found in Table 1 ." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-60", "text": "As can be seen in this table, the recall increases while the precision decreases." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-61", "text": "However, the high increase in recall leads to an increase in the F-measure from 33.0 to 36.8." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-62", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-63", "text": "**ASSIGN**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-64", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-65", "text": "**USING A GENERAL CORPUS**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-66", "text": "In the experiments presented in Hulth (2003a) , only the documents present in the training, validation, and test set respectively are used for calculating the collection frequency." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-67", "text": "This means that the collection is rather homogenous." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-68", "text": "For this reason, the collection frequency is instead calculated on a set of 200 arbitrarily chosen documents from the British National Corpus (BNC)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-69", "text": "In Table 2 , the performance of two runs when taking the majority vote of the three classifiers removing the subsumed terms is presented." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-70", "text": "The first run ('Abstracts') is when the collection frequency is calculated from the abstracts." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-71", "text": "The second run ('Gen. Corp.') is when the BNC documents are used for this calculation." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-72", "text": "If comparing these two runs, the Fmeasure increases." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-73", "text": "In other words, using a more general corpus for this calculation leads to a better performance of the automatic keyword extraction." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-74", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-75", "text": "**SETTING THE WEIGHTS**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-99", "text": "Both regression runs have a higher F-measure than the classification run, due to the fact that recall increases, more than what the precision decreases." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-76", "text": "As the data set is unbalanced-there is a larger number of negative than positive examples-the positive examples are given a higher weight when training the prediction models." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-77", "text": "In the experiments discussed so far, the weights given to the positive examples are those resulting in the best performance for each individual classifier (as described in Hulth (2003a) )." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-78", "text": "For the results presented further, the weights are instead set according to which individual weight that maximises the F-measure for the combined model on the validation set." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-79", "text": "The weight given to the positive examples for each term selection approach has in a (rather large) number of runs been altered systematically for each classifier, and the combination that results in the best performance is selected." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-80", "text": "The results on the test set are presented in Table 3 ." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-81", "text": "As can be seen in this table, the recall decreases, while the precision and the F-measure increase." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-82", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-83", "text": "**REGRESSION VS. CLASSIFICATION**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-84", "text": "In the experiments presented in Hulth (2003a) , the automatic keyword indexing task is treated as a binary classification task, where each candidate term is classified either as a keyword or a non-keyword." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-85", "text": "RDS allows for the prediction to be treated as a regression task (Breiman et al., 1984) ." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-86", "text": "This means that the prediction is given as a numerical value, instead of a category." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-87", "text": "When training the regression models, the candidate terms being manually assigned keywords are given the value one, and all other candidate terms are assigned the value zero." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-88", "text": "In this fashion, the prediction is a value between zero and one, and the higher the value, the more likely a candidate term is to be a keyword (according to the model)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-89", "text": "To combine the results from the three models, there are two alternatives." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-90", "text": "Either the prediction value can be added for all candidate terms, or it can be added only if it is over a certain threshold set for each model, depending on the model's individual performance." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-91", "text": "Regardless, a candidate term may be selected as a keyword even if it is extracted by only one method, provided that the value is high enough." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-92", "text": "The threshold values are defined based on the performance of the models on the validation data." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-93", "text": "In Table 4 , results for two regression runs on the test data are presented." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-94", "text": "These two runs are in Table 4 compared to the best performing classification run." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-95", "text": "The first regression run ('Regression') is when all candidate terms having an added value over a certain threshold are selected." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-96", "text": "The second presented regression run (Regression with individual threshold: 'Reg. ind." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-97", "text": "thresh.') is when a threshold is set for each individual model: If a prediction value is below this threshold it does not contribute to the added value for a candidate term." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-98", "text": "In this case, the threshold for the total score is slightly lower than when no individual threshold is set." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-100", "text": "The run without individual thresholds results in the highest F-measure." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-101", "text": "Table 4 : Using classification and regression." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-102", "text": "'Reg. ind." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-103", "text": "thesh.' refers to a run where the regression value from each model contributes only if it is over a certain threshold." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-104", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-105", "text": "**ASSIGN**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-106", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-107", "text": "**DEFINING THE NUMBER OF KEYWORDS**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-108", "text": "If closer inspecting the best regression run, this combined model assigns on average 10.8 keywords per document." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-109", "text": "The actual distribution varies from 3 documents with 0 to 1 document with 32 keywords." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-110", "text": "As mentioned, the prediction value from a regression model is numeric, and indicates how likely a candidate term is to be a keyword." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-111", "text": "It is thus possible to rank the output, and consequently to limit the number of keywords assigned per document." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-112", "text": "A set of experiments has been performed with the aim to find what number of keywords per document that results in the highest F-measure, by varying the number of keywords assigned." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-113", "text": "In these experiments, only terms with an added value over the threshold are considered, and the candidate terms with the highest values are selected first." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-114", "text": "The best performance is when the maximum of twelve keywords is selected for each document." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-115", "text": "(The subsumed terms are removed after that the maximum number of keywords is selected.) As can be seen in Table 5 ('All' compared to 'Max. 12'), the F-measure decreases as does the recall, although the precision increases, when limiting the number of keywords." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-116", "text": "There are, however, still some documents that do not get any selected keywords." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-117", "text": "To overcome this, three terms are assigned to each document even if the added regression value is below the threshold." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-118", "text": "Doing this gives a slightly lower precision, while the recall increases slightly." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-119", "text": "The F-measure is unaffected (see Table 5 : 3-12)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-120", "text": "Table 5 : Assigning all terms over the threshold (All), and limiting the number of terms assigned per document (Max. 12, and 3-12 respectively)." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-121", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-122", "text": "**ASSIGN**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-123", "text": "----------------------------------" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-124", "text": "**CONCLUDING REMARKS**" }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-125", "text": "In this paper, a number of experiments leading to a better performance of a keyword extraction algorithm has been presented." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-126", "text": "One improvement concerns how the NPchunks are extracted, where the results are improved by excluding the initial determiners a, an, and the." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-127", "text": "Possibly, this improvement could be yet higher if all initial determiners were removed from the NP." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-128", "text": "Another improvement concerns how the collection frequency is calculated, where the F-measure of the extraction increases when a general corpus is used." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-129", "text": "A third improvement concerns how the weights to the positive examples are set." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-130", "text": "By adjusting the weights to maximise the performance of the combined model, the F-measure increases." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-131", "text": "Also, one major change is made to the algorithm, as the learning task is redefined." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-132", "text": "This is done by using regression instead of classification for the machine learning." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-133", "text": "Apart from an increase in performance by regression, this enables a ranked output of the keywords." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-134", "text": "This in turn makes it easy to vary the number of keywords selected per document, in case necessary for some types of applications." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-135", "text": "In addition, compared to classification, regression resembles reality in the sense that some words are definitely keywords, some are definitely not, but there are also many candidate terms that are keywords to a certain extent." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-136", "text": "Thus, there is a continuum of the candidate terms' \"keywordness\"." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-137", "text": "Evaluating automatically extracted keywords is not trivial, as different persons may prefer different terms at different occasions." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-138", "text": "This is also true for professional indexers, where the consistency also depends on how experienced an indexer is." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-139", "text": "For example, Bureau van Dijk (1995) has shown that the index consistency between experienced indexers may be up to 60-80 per cent, while it is not unusual that it is as low as 20-30 between inexperienced indexers." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-140", "text": "The approach taken to the evaluation of the experiments presented in this paper is that of using keywords previously assigned by professional indexers as a gold standard for calculating the precision, the recall, and the F-measure." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-141", "text": "If looking at the inter-judgement agreement between the keywords selected by the combined model assigning no more than twelve keywords per document and the manually assigned keywords for the documents in the test set, it is 28.2%." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-142", "text": "Thus the performance of the keyword extraction algorithm is at least as consistent as that of inexperienced professional indexers." }, { "sent_id": "8622616ffd4db96058a9b8aff54212-C001-143", "text": "This is, however, only true to a certain extent, as some of the keywords selected by the automatic extractor would never have been considered by a human-not even a nonprofessional 3 ." } ], "y": { "@USE@": { "gold_contexts": [ [ "8622616ffd4db96058a9b8aff54212-C001-17" ], [ "8622616ffd4db96058a9b8aff54212-C001-43" ] ], "cite_sentences": [ "8622616ffd4db96058a9b8aff54212-C001-17", "8622616ffd4db96058a9b8aff54212-C001-43" ] }, "@BACK@": { "gold_contexts": [ [ "8622616ffd4db96058a9b8aff54212-C001-18" ], [ "8622616ffd4db96058a9b8aff54212-C001-66" ], [ "8622616ffd4db96058a9b8aff54212-C001-84" ] ], "cite_sentences": [ "8622616ffd4db96058a9b8aff54212-C001-18", "8622616ffd4db96058a9b8aff54212-C001-66", "8622616ffd4db96058a9b8aff54212-C001-84" ] }, "@SIM@": { "gold_contexts": [ [ "8622616ffd4db96058a9b8aff54212-C001-43" ] ], "cite_sentences": [ "8622616ffd4db96058a9b8aff54212-C001-43" ] }, "@UNSURE@": { "gold_contexts": [ [ "8622616ffd4db96058a9b8aff54212-C001-77" ] ], "cite_sentences": [ "8622616ffd4db96058a9b8aff54212-C001-77" ] } } }, "ABC_a8743bb89abd16f75bec9e72e446b3_33": { "x": [ { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-117", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-118", "text": "**CONCLUSION**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-2", "text": "Syllabification is sometimes influenced by morphological boundaries." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-3", "text": "We show that incorporating morphological information can improve the accuracy of orthographic syllabification in English and German." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-4", "text": "Surprisingly, unsupervised segmenters, such as Morfessor, can be more useful for this purpose than the supervised ones." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-5", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-7", "text": "Syllabification is the process of dividing a word into syllables." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-8", "text": "Although in the strict linguistic sense syllables are composed of phonemes rather than letters, due to practical considerations we focus here on orthographic syllabification, which is also referred to as hyphenation." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-9", "text": "Some dictionaries include hyphenation information to indicate where words may be broken for end-of-line divisions, and to assist the reader in recovering the correct pronunciation." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-10", "text": "In many languages the orthographic and phonological representations of a word are closely related." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-11", "text": "Orthographic syllabification has a number of computational applications." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-12", "text": "Incorporation of the syllable boundaries between letters benefits grapheme-to-phoneme conversion (Damper et al., 2005) , and respelling generation (Hauer and Kondrak, 2013) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-13", "text": "Hyphenation of out-of-dictionary words is also important in text processing (Trogkanis and Elkan, 2010) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-14", "text": "Because of the productive nature of language, a dictionary look-up process for syllabification is inadequate." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-15", "text": "Rule-based systems are generally outperformed on out-ofdictionary words by data-driven methods, such as those of Daelemans et al. (1997) , Demberg (2006) , Marchand and Damper (2007) , and Trogkanis and Elkan (2010) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-16", "text": "Morphological segmentation is the task of dividing words into morphemes, the smallest meaning-bearing units in the word (Goldsmith, 2001) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-17", "text": "For example the morpheme over occurs in words like hold+over, lay+over, and skip+over." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-18", "text": "1 Roots combine with derivational (e.g. refut+able) and inflectional affixes (e.g. hold+ing)." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-19", "text": "Computational segmentation approaches can be divided into rule-based (Porter, 1980) , supervised (Ruokolainen et al., 2013) , semi-supervised (Gr\u00f6nroos et al., 2014) , and unsupervised (Creutz and Lagus, 2002) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-20", "text": "Bartlett et al. (2008) observe that some of the errors made by their otherwise highly-accurate system, such as hol-dov-er and coad-ju-tors, can be attributed to the lack of awareness of morphological boundaries, which influence syllabification." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-21", "text": "In this paper, we demonstrate that the accuracy of orthographic syllabification can be improved by considering morphology." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-22", "text": "We augment the syllabification approach of Bartlett et al. (2008) , with features encoding morphological segmentation of words." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-23", "text": "We investigate the degree of overlap between the morphological and syllable boundaries." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-24", "text": "The results of our experiments on English and German show that the incorporation of expert-annotated (gold) morphological boundaries extracted from lexical databases substantially reduces the syllabification error rate, particularly in low-resource settings." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-25", "text": "We find that the accuracy gains tend to be preserved when unsupervised segmentation is used instead." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-26", "text": "On the other hand, relying on a fully-supervised system appears to be much less robust, even though it generates more accurate morphological segmentations than the unsupervised systems." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-27", "text": "We propose an explanation for this surprising result." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-28", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-29", "text": "**METHODS**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-30", "text": "In this section, we describe the original syllabification method of Bartlett et al. (2008) , which serves as our baseline system, and discuss various approaches to incorporating morphological information." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-31", "text": "Bartlett et al. (2008) present a discriminative approach to automatic syllabification." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-32", "text": "They formulate syllabification as a tagging problem, and learn a Structured SVM tagger from labeled data (Tsochantaridis et al., 2005) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-33", "text": "Under the Markov assumption that each tag is dependent on its previous n tags, the tagger predicts the optimal tag sequence (Altun et al., 2003) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-34", "text": "A large-margin training objective is applied to learn a weight vector to separate the correct tag sequence from other possible sequences for each training instance." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-35", "text": "The test instances are tagged using the Viterbi decoding algorithm on the basis of the weighted features." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-36", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-37", "text": "**BASE SYSTEM**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-38", "text": "Each training instance is represented as a sequence of feature vectors, with the tags following the \"Numbered NB\" tagging scheme, which was found to produce the best results." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-39", "text": "In the scheme, the B tags signal that a boundary occurs after the current character, while the N tags indicate the distance from the previous boundary." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-40", "text": "For example, the word syl-lab-i-fy is annotated as: N1 N2 B N1 N2 B B N1 N2." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-41", "text": "The feature vectors consist of all n-grams around the current focus character, up to size 5." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-42", "text": "These n-grams are composed of context letters, and word-boundary markers that are added at the beginning and end of each word." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-43", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-44", "text": "**MORPHOLOGICAL INFORMATION**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-45", "text": "We incorporate available morphological information by adding morpheme boundary markers into the input words." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-46", "text": "The extracted features belong to two categories: orthographic and morphological." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-47", "text": "The orthographic features are identical to the ones described in Section 2.1." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-48", "text": "The morphological features are also contextual n-grams, but may contain morphological breaks, which can potentially help identify the correct syllabification of words." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-49", "text": "Manually-annotated morphological lexicons sometimes distinguish between inflectional, derivational, and compound boundaries." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-50", "text": "We can pass this information to the syllabification system by marking the respective boundaries with different symbols." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-51", "text": "Since morphologically annotated lexicons are expensive to create, and available only for wellstudied languages, we investigate the idea of replacing them with annotations generated by fully-supervised, distantly-supervised, and unsupervised segmentation algorithms." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-52", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-53", "text": "**FULLY-SUPERVISED**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-54", "text": "While supervised methods typically require large amounts of annotated training data, they can perform segmentation of unseen (out-of-dictionary) words." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-55", "text": "As our fully-supervised segmenter, we use the discriminative string transducer of Jiampojamarn et al. (2010) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-56", "text": "The transducer is trained on aligned source-target pairs, one pair per word; the target is identical to the source except that it includes characters that represent morphological breaks." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-57", "text": "Using source and target context, the transducer learns to insert these breaks into words." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-58", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-59", "text": "**DISTANTLY-SUPERVISED**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-60", "text": "Whereas morphologically-annotated lexicons are rare, websites such as Wiktionary contain crowdgenerated inflection tables for many languages." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-61", "text": "A distantly-supervised segmenter can be trained on semi-structured inflection tables to divide words into stems and affixes without explicit segmentation annotation." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-62", "text": "We adopt the approach of Nicolai and Kondrak (2016), which combines unsupervised alignment with a discriminative string transduction algorithm, An important limitation of this approach is that it can only identify inflectional morpheme boundaries." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-63", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-64", "text": "**UNSUPERVISED**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-65", "text": "Unsupervised methods have the advantage of requiring no training data." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-66", "text": "We investigate the applicability of two unsupervised segmenters: Morfessor (Creutz and Lagus, 2005) and Morpheme++ (Dasgupta and Ng, 2007) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-67", "text": "Morfessor uses the minimum description length (MDL) principle to predict a word as a likely sequence of morphemes." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-68", "text": "Since the baseline version of Morfessor tends to over-segment rare words, we instead apply Morfessor FlatCat (Gr\u00f6nroos et al., 2014) , which reduces over-segmentation through the use of a hidden Markov model." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-69", "text": "Morpheme++ is another system that is capable of distinguishing between prefixes, suffixes, and stems by taking advantage of the regularity of affixes." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-70", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-71", "text": "**EXPERIMENTS**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-72", "text": "In this section, we introduce our data sets, and discuss the overlap between morphological and syllabic boundaries." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-73", "text": "We investigate the quality of the morphological segmentations of produced by various methods, and replicate the syllabification results of Bartlett et al. (2008) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-74", "text": "Finally, we discuss the results of incorporating morphological information into the syllabification system." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-75", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-76", "text": "**DATA**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-77", "text": "Our data comes from the English and German sections of the CELEX lexical database (Baayen et al., 1995) ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-78", "text": "The English and German training sets contain 43,212 and 41,382 instances, with corresponding development sets of 8,735 and 5,173 instances, and test sets of 8,608 and 5,173 instances." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-79", "text": "The distantly-supervised and fully-supervised segmenters were trained on the union of the training and development sets, while the unsupervised segmenters were applied to the union of the training, development and test sets." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-80", "text": "The distantlysupervised system had no access to the gold morphological segmentations." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-81", "text": "The annotation in CELEX distinguishes between inflectional vs. derivational affixes, as well as derivational vs. compound breaks." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-82", "text": "The latter distinction did not help in our development experiments, so we disregard it." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-83", "text": "We refer to the two subsets of the morpheme boundary annotations as \"Gold Inflectional\" and \"Gold Derivational\"." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-84", "text": "Table 2 : Overlap between syllabic and morphological boundaries on the test set." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-85", "text": "Table 2 shows the percentage of the predicted morphological breaks that match gold syllable boundaries." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-86", "text": "We observe that the inflectional boundaries are far less likely than the derivational ones to correspond to syllable breaks." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-87", "text": "We also note that on German the unsupervised segmenters exhibit much higher syllabification overlap than the gold annotation." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-88", "text": "We attribute this to the tendency of the unsupervised methods to oversegment." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-89", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-90", "text": "**QUALITY OF MORPHOLOGICAL SEGMENTATION**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-91", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-92", "text": "**BASELINE SYLLABIFICATION**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-93", "text": "As a baseline, we replicate the experiments of Bartlett et al. (2008) , and extend them to lowresource settings." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-94", "text": "Since the training sets are of slightly different sizes, we label each training size point as specified in Table 3 ." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-95", "text": "We see that correct syllabification of approximately half of the words is achieved with as few as 100 English and 50 German training examples." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-96", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-97", "text": "**MORPHOLOGICALLY-INFORMED SYLLABIFICATION**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-98", "text": "Our main set of experiments concerns the incorporation of the morphological information obtained from methods described in Section 2.2 into the baseline syllabification system." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-99", "text": "As seen in Table 3, the accuracy of the baseline syllabification system trained on a large number of instances is already very high, so the gains introduced by mor- For the sake of clarity, we omit some of the methods from the graphs." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-100", "text": "The unsupervised methods are represented by Morfessor FlatCat." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-101", "text": "The distantly-supervised system is generally successful at predicting the inflectional boundaries, but fails to improve on the baseline, as they are less important for syllabification than the derivational boundaries." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-102", "text": "----------------------------------" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-103", "text": "**DISCUSSION**" }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-104", "text": "Overall, the results confirm that morphology can help syllabification." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-105", "text": "The incorporation of gold segmentation boundaries consistently leads to the reduction of the syllabification error rate; the only exception occurs on the full English training set." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-106", "text": "While the fully-supervised system provides a benefit at lower training thresholds, it actually hurts the accuracy at larger training sizes." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-107", "text": "Notably, unsupervised segmentation appears to outperform fully-supervised segmentation as the amount of the training data increases; the corresponding error rate reduction approaches 25% on German." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-108", "text": "One explanation for the strong performance of the unsupervised systems is their high accuracy on compound words." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-109", "text": "Consider the German compound Toppflagge \"masthead flag\"." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-110", "text": "An unsupervised system is able to guess that the word is composed of the words Topp and Flagge that exist in the lexicon on their own." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-111", "text": "To produce the same segmentation, the fully-supervised system must be trained on a number of compound words that include either topp or flagge." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-112", "text": "Since compound boundaries are almost always syllable breaks as well, they have a strong effect on syllabification." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-113", "text": "Sometimes even a linguistically incorrect segmentation proposed by an unsupervised segmenter may work better for the purposes of syllabification." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-114", "text": "Many words of Latin origin contain affixes that are no longer productive in English." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-115", "text": "Thus, an unsupervised system over-segments the word ob+literate, which allows it to produce the correct syllabification ob-lit-er-ate, as opposed to oblit-er-ate predicted by the gold-informed system." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-116", "text": "This phenomenon appears to be particularly frequent in German." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-119", "text": "We have demonstrated that morphological information can improve the accuracy of orthographic syllabification." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-120", "text": "We have found that unsupervised segmentation methods often perform better than supervised methods, and can rival gold human annotation." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-121", "text": "We have proposed two explanations for this counter-intuitive phenomenon." }, { "sent_id": "a8743bb89abd16f75bec9e72e446b3-C001-122", "text": "We hope that this work will contribute a computational perspective on the issue of interaction between syllabification and morphology." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a8743bb89abd16f75bec9e72e446b3-C001-20" ], [ "a8743bb89abd16f75bec9e72e446b3-C001-31" ] ], "cite_sentences": [ "a8743bb89abd16f75bec9e72e446b3-C001-20", "a8743bb89abd16f75bec9e72e446b3-C001-31" ] }, "@EXT@": { "gold_contexts": [ [ "a8743bb89abd16f75bec9e72e446b3-C001-22" ], [ "a8743bb89abd16f75bec9e72e446b3-C001-93" ] ], "cite_sentences": [ "a8743bb89abd16f75bec9e72e446b3-C001-22", "a8743bb89abd16f75bec9e72e446b3-C001-93" ] }, "@USE@": { "gold_contexts": [ [ "a8743bb89abd16f75bec9e72e446b3-C001-30" ], [ "a8743bb89abd16f75bec9e72e446b3-C001-93" ] ], "cite_sentences": [ "a8743bb89abd16f75bec9e72e446b3-C001-30", "a8743bb89abd16f75bec9e72e446b3-C001-93" ] }, "@UNSURE@": { "gold_contexts": [ [ "a8743bb89abd16f75bec9e72e446b3-C001-73" ] ], "cite_sentences": [ "a8743bb89abd16f75bec9e72e446b3-C001-73" ] } } }, "ABC_4be5a47b5fd900c3578330b352b24c_33": { "x": [ { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-2", "text": "We present a supervised sentiment detection system that classifies the polarity of subjective phrases as positive, negative, or neutral." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-3", "text": "It is tailored towards online genres, specifically Twitter, through the inclusion of dictionaries developed to capture vocabulary used in online conversations (e.g., slang and emoticons) as well as stylistic features common to social media." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-4", "text": "We show how to incorporate these new features within a state of the art system and evaluate it on subtask A in SemEval-2013 Task 2: Sentiment Analysis in Twitter." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-5", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-7", "text": "People use social media to write openly about their personal experiences, likes and dislikes." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-8", "text": "The following sentence from Twitter is a typical example: \"Tomorrow I'm coming back from Barcelona...I don't want! :(((\"." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-9", "text": "The ability to detect the sentiment expressed in social media can be useful for understanding what people think about the restaurants they visit, the political viewpoints of the day, and the products they buy." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-10", "text": "These sentiments can be used to provided targeted advertising, automatically generate reviews, and make various predictions, such as political outcomes." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-11", "text": "In this paper we develop a sentiment detection algorithm for social media that classifies the polarity of sentence phrases as positive, negative, or neutral and test its performance in Twitter through the participation in the expression level task (subtask A) of the SemEval-2013 Task 2: Sentiment Analysis in Twitter (Wilson et al., 2013) which the authors helped organize." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-12", "text": "To do so, we build on previous work on sentiment detection algorithms for the more formal news genre, notably the work of Agarwal et al (2009) , but adapt it for the language of social media, in particular Twitter." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-13", "text": "We show that exploiting lexical-stylistic features and dictionaries geared toward social media are useful in detecting sentiment." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-14", "text": "In this rest of this paper, we discuss related work, including the state of the art sentiment system (Agarwal et al., 2009) our method is based on, the lexicons we used, our method, and experiments and results." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-15", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-16", "text": "**RELATED WORK**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-17", "text": "Several recent papers have explored sentiment analysis in Twitter." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-18", "text": "Go et al (2009) and Pak and Paroubek (2010) classify the sentiment of tweets containing emoticons using n-grams and POS." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-19", "text": "Barbosa and Feng (2010) detect sentiment using a polarity dictionary that includes web vocabulary and tweet-specific social media features." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-20", "text": "Bermingham and Smeaton (2010) compare polarity detection in twitter to blogs and movie reviews using lexical features." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-21", "text": "Agarwal et al (2011) perform polarity sentiment detection on the entire tweet using features that are somewhat similar to ours: the DAL, lexical features (e.g. POS and n-grams), social media features (e.g. slang and hashtags) and tree kernel features." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-22", "text": "In contrast to this related work, our approach is geared towards predicting sentiment is at the phrase level as opposed to the tweet level." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-23", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-24", "text": "**LEXICONS**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-25", "text": "Several lexicons are used in our system." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-26", "text": "We use the DAL and expand it with WordNet, as it was used in the original work (Agarwal et al., 2009) , and expand it further to use Wiktionary and an emoticon lexicon." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-27", "text": "We consider proper nouns that are not in the DAL to be objective." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-28", "text": "We also shorten words that are lengthened to see if we can find the shortened version in the lexicons (e.g. sweeeet \u2192 sweet)." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-29", "text": "The coverage of the lexicons for each corpus is shown in Table 1 ." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-30", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-31", "text": "**DAL**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-32", "text": "The Dictionary of Affect and Language (DAL) (Whissel, 1989 ) is an English language dictionary of 8742 words built to measure the emotional meaning of texts." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-33", "text": "In addition to using newswire, it was also built from individual sources such as interviews on abuse, students' retelling of a story, and adolescent's descriptions of emotions." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-34", "text": "It therefore covers a broad set of words." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-35", "text": "Each word is given three scores (pleasantness -also called evaluation (ee), activeness (aa), and imagery (ii)) on a scale of 1 (low) to 3 (high)." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-36", "text": "We compute the polarity of a chunk in the same manner as the original work (Agarwal et al., 2009) , using the sum of the AE Space Score's (| \u221a ee 2 + aa 2 |) of each word within the chunk." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-37", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-38", "text": "**WORDNET**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-39", "text": "The DAL does cover a broad set of words, but we will still often encounter words that are not included in the dictionary." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-40", "text": "Any word that is not in the DAL and is not a proper noun is accessed in WordNet (Fellbaum, 1998) 1 and, if it exists, the DAL scores of the synonyms of its first sense are used in its place." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-41", "text": "In addition to the original approach, if there are no synonyms we look at the hypernym." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-42", "text": "We then compute the average scores (ee, aa, and ii) of all the words and use that as the score for the word." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-43", "text": "1 We cannot use SentiWordNet because we are interested in the DAL scores" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-44", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-45", "text": "**WIKTIONARY**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-46", "text": "We use Wiktionary, an online dictionary, to supplement the common words that are not found in WordNet and the DAL." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-47", "text": "We first examine all \"form of\" relationships for the word such as \"doesnt\" is a \"misspelling of\" \"doesn't\", and 'tonite\" is an \"alternate form of\" \"tonight\"." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-48", "text": "If no \"form of\" relationships exist, we take all the words in the definitions that have their own Wiktionary page and look up the scores for each word in the DAL." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-49", "text": "(e.g., the verb definition for LOL (laugh out loud) in Wiktionary is \"To laugh out loud\" with \"laugh\" having its own Wiktionary definition; it is therefore looked up in the DAL and the score for \"laugh\" is used for \"LOL\".) We then compute the average scores (ee, aa, and ii) of all the words and use that as the score for the word." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-50", "text": "We created a simple lexicon to map common emoticons to a definition in the DAL." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-51", "text": "We looked at over 1000 emoticons gathered from several lists on the internet 2 and computed their frequencies within a LiveJournal blog corpus." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-52", "text": "(In the future we would like to use an external Twitter corpus)." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-53", "text": "We kept the 192 emoticons that appeared at least once and mapped each emoticon to a single word definition." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-54", "text": "The top 5 emoticons and their definitions are shown in Table 2 ." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-55", "text": "When an emoticon is found in a tweet we look up its definition in the DAL." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-56", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-57", "text": "**METHODS**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-58", "text": "We run our data through several pre-processing steps to preserve emoticons and expand contractions." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-59", "text": "We then pre-process the sentences to add Part-of-Speech tags (POS) and chunk the sentences using the CRF tagger and chunker (Phan, 2006a; Phan, 2006b )." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-60", "text": "The chunker uses three labels, 'B' (beginning), 'I' (in), and 'O' (out)." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-61", "text": "The 'O' label tends to be applied to punctuation which one typically wants to ignore." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-62", "text": "However, in this context, punctation can be very important (e.g. exclamation points, and emoticons)." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-63", "text": "Therefore, we append words/phrases tagged as O to the prior B-I chunk." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-64", "text": "We apply the dictionaries to the preprocessed sentences to generate lexical, syntactic, and stylistic features." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-65", "text": "All sets of features were reduced using chisquare in Weka (Hall et al., 2009 )." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-66", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-67", "text": "**LEXICAL AND SYNTACTIC FEATURES**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-68", "text": "We include POS tags and the top 500 n-gram features (Agarwal et al., 2009) ." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-69", "text": "We experimented with different amounts of n-grams and found that more than 500 n-grams reduced performance." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-70", "text": "The DAL and other dictionaries are used along with a negation state machine (Agarwal et al., 2009) to determine the polarity for each word in the sentence." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-71", "text": "We include all the features described in the original system (Agarwal et al., 2009 )." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-72", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-73", "text": "**LEXICAL-STYLISTIC FEATURES**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-74", "text": "We include several lexical-stylistic features (see Table 3 ) that can occur in all datasets." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-75", "text": "We divide these features into two groups, general: ones that are common across online and traditional genres, and social media: one that are far more common in online genres." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-76", "text": "Examples of general style features are exclamation points and ellipses." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-77", "text": "Examples of social media style features are emoticons and word lengthening." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-78", "text": "Word lengthening is a common phenomenon in social media where letters are repeated to indicate emphasis (e.g. sweeeet)." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-79", "text": "It is particularly common in opinionated words (Brody and Diakopoulos, 2011) ." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-80", "text": "The count values of each feature was normalized by the number of words in the phrase." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-81", "text": "The percentage of lexical-stylistic features that are positive/negative/neutral is shown in Figure 1 ." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-82", "text": "For example, emoticons tend to indicate a positive phrase in Twitter." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-83", "text": "Each stylistic feature accounts for less than 2% of the sentence but at least one of the stylistic features exists in 61% of the Tweets." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-84", "text": "We also computed the most frequent emoticons (<3, :D), acronyms (lol), and punctuation symbols (#) within a subset of the Twitter training set and included those as additional features." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-85", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-86", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-87", "text": "This task was evaluated on the Twitter dataset provided by Semeval-2013 Task 2, subtask A, which the authors helped organize." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-88", "text": "Therefore, a large portion of time was spent on creating the dataset." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-89", "text": "We ran all of our experiments in Weka (Hall et al., 2009 ) using Logistic Regression." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-90", "text": "We also experimented with other learning methods but found that this worked best." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-91", "text": "All results are shown using the average F-measure of the positive and negative class." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-92", "text": "We tuned our system for Semeval-2013 Task 2, subtask A, using the provided development set and ran it on the provided Twitter and SMS test data." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-93", "text": "Our results are shown in Table 4 with all results being statistically significant over a majority baseline." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-94", "text": "We also use the DAL as a baseline to indicate how useful lexical-stylistic features (specifically those geared towards social media) and the dictionaries are in improving the performance of sentiment detection of phrases in online genres in contrast to using just the DAL." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-95", "text": "The results that are statistically significant (computed using the Wilcoxon's test, p \u2264 .02) shown in bold." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-96", "text": "Our best results for each dataset include all features with an average Fmeasure of 77.6% and 73.3% for the Twitter and SMS test sets respectively resulting in a significant improvement of more than 5% for each test set over the DAL baseline." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-97", "text": "At the time of submission, we had not experimented with n-grams, and therefore chose the Dictionaries+Style system as our final version for the official run resulting in a rank of 12/22 (75% Fmeasure) for Twitter and 13/19 (70.2% F-measure) for SMS." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-98", "text": "Our rank with the best system, which includes n-grams, would remain the same for Twitter, but bring our rank up to 10/19 for SMS." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-99", "text": "We looked more closely at the impact of our new features and as one would expect, feature selection found the general and social media style features (e.g. emoticons, :(, lol, word lengthening) to be useful in Twitter and SMS data." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-100", "text": "Using additional online dictionaries is useful in Twitter and SMS, which is understandable because they both have poor coverage in the DAL and WordNet." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-101", "text": "In all cases using n-grams was the most useful which indicates that context is most important." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-102", "text": "Using Dictionaries and Style in addition to n-grams did provide a significant improvement in the Twitter test set, but not in the Twitter Dev and SMS test set." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-103", "text": "----------------------------------" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-104", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-105", "text": "We have explored whether social media features, Wiktionary, and emoticon dictionaries positively impact the accuracy of polarity detection in Twitter and other online genres." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-106", "text": "We found that social media related features can be used to predict sentiment in Twitter and SMS." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-107", "text": "In addition, Wiktionary helps improve the word coverage and though it does not provide a significant improvement over WordNet, it can be used in place of WordNet." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-108", "text": "On the other hand, we found that using the DAL and n-grams alone does almost as well as the best system." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-109", "text": "This is encouraging as it indicates that content is important and domain independent sentiment systems can do a good job of predicting sentiment in social media." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-110", "text": "The results of the SMS messages dataset indicate that even though the online genres are different, the training data in one online genre can indeed be used to predict results with reasonable accuracy in the other online genre." }, { "sent_id": "4be5a47b5fd900c3578330b352b24c-C001-111", "text": "These results show promise for further work on domain adaptation across different kinds of social media." } ], "y": { "@EXT@": { "gold_contexts": [ [ "4be5a47b5fd900c3578330b352b24c-C001-12" ], [ "4be5a47b5fd900c3578330b352b24c-C001-14" ], [ "4be5a47b5fd900c3578330b352b24c-C001-26" ] ], "cite_sentences": [ "4be5a47b5fd900c3578330b352b24c-C001-12", "4be5a47b5fd900c3578330b352b24c-C001-14", "4be5a47b5fd900c3578330b352b24c-C001-26" ] }, "@SIM@": { "gold_contexts": [ [ "4be5a47b5fd900c3578330b352b24c-C001-26" ], [ "4be5a47b5fd900c3578330b352b24c-C001-36" ], [ "4be5a47b5fd900c3578330b352b24c-C001-71" ] ], "cite_sentences": [ "4be5a47b5fd900c3578330b352b24c-C001-26", "4be5a47b5fd900c3578330b352b24c-C001-36", "4be5a47b5fd900c3578330b352b24c-C001-71" ] }, "@USE@": { "gold_contexts": [ [ "4be5a47b5fd900c3578330b352b24c-C001-26" ], [ "4be5a47b5fd900c3578330b352b24c-C001-36" ], [ "4be5a47b5fd900c3578330b352b24c-C001-68" ], [ "4be5a47b5fd900c3578330b352b24c-C001-70" ], [ "4be5a47b5fd900c3578330b352b24c-C001-71" ] ], "cite_sentences": [ "4be5a47b5fd900c3578330b352b24c-C001-26", "4be5a47b5fd900c3578330b352b24c-C001-36", "4be5a47b5fd900c3578330b352b24c-C001-68", "4be5a47b5fd900c3578330b352b24c-C001-70", "4be5a47b5fd900c3578330b352b24c-C001-71" ] } } }, "ABC_e9f7d339ccda101000b53d89da4e49_33": { "x": [ { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-2", "text": "We propose a method for semantic structure analysis of noun phrases using Abstract Meaning Representation (AMR)." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-3", "text": "AMR is a graph representation for the meaning of a sentence, in which noun phrases (NPs) are manually annotated with internal structure and semantic relations." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-4", "text": "We extract NPs from the AMR corpus and construct a data set of NP semantic structures." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-5", "text": "We also propose a transition-based algorithm which jointly identifies both the nodes in a semantic structure tree and semantic relations between them." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-6", "text": "Compared to the baseline, our method improves the performance of NP semantic structure analysis by 2.7 points, while further incorporating external dictionary boosts the performance by 7.1 points." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-7", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-9", "text": "Semantic structure analysis of noun phrases (NPs) is an important research topic, which is beneficial for various NLP tasks, such as machine translation and question answering (Nakov and Hearst, 2013; Nakov, 2013) ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-10", "text": "Among the previous works on NP analysis are internal NP structure analysis (Vadas and Curran, 2007; Vadas and Curran, 2008) , noun-noun relation analysis of noun compounds (Girju et al., 2005; Tratz and Hovy, 2010; Kim and Baldwin, 2013) , and predicate-argument analysis of noun compounds (Lapata, 2002) ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-11", "text": "The goal of internal NP structure analysis is to assign bracket information inside an NP (e.g., (lung cancer) deaths indicates that the phrase lung cancer modifies the head deaths)." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-12", "text": "In noun-noun relation analysis, the goal is to assign one of the predefined semantic relations to a noun compound consisting of two nouns (e.g., assigning a relation Predicate-argument relation ARG1, noun-noun relation topic, and internal structure (disaster prevention) awareness are expressed." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-13", "text": "purpose to a noun compound cooking pot, meaning that pot is used for cooking)." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-14", "text": "On the other hand, in predicate-argument analysis, the goal is to decide whether the modifier noun is the subject or the object of the head noun (e.g., car is the object of love in car lover, while child is the subject of behave in child behavior)." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-15", "text": "Previous NP researches have mainly focused on these different subproblems of NP analysis using different data sets, rather than targeting general NPs simultaneously." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-16", "text": "However, these subproblems of NP analysis tend to be highly intertwined when processing texts." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-17", "text": "For the purpose of tackling the task of combined NP analysis, we make use of the Abstract Meaning Representation (AMR) corpus." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-18", "text": "AMR is a formalism of sentence semantic structure by directed, acyclic, and rooted graphs, in which semantic relations such as predicateargument relations and noun-noun relations are expressed." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-19", "text": "In this paper, we extract substructures corresponding to NPs (shown in Figure 1 ) from the AMR Bank 1 , and create a data set of NP semantic structures." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-20", "text": "In general, AMR substructures are graphs." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-21", "text": "However, since we found out that NPs mostly form trees rather than graphs in the AMR Bank, we can assume that AMR substructures corresponding to NPs are trees." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-22", "text": "Thus, we define our task as predicting the AMR tree structure, given a sequence of words in an NP." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-23", "text": "The previous method for AMR parsing takes a Train Dev Test 3504 463 398 Table 1 : Statistics of the extracted NP data two-step approach: first identifying distinct concepts (nodes) in the AMR graph, then defining the dependency relations between those concepts (Flanigan et al., 2014) ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-24", "text": "In the concept identification step, unlike POS tagging, one word is sometimes assigned with more than one concept, and the number of possible concepts is far more than the number of possible parts-of-speech." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-25", "text": "As the concept identification accuracy remains low, such a pipeline method suffers from error propagation, thus resulting in a suboptimal AMR parsing performance." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-26", "text": "To solve this problem, we extend a transitionbased dependency parsing algorithm, and propose a novel algorithm which jointly identifies the concepts and the relations in AMR trees." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-27", "text": "Compared to the baseline, our method improves the performance of AMR analysis of NP semantic structures by 2.7 points, and using an external dictionary further boosts the performance by 7.1 points." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-28", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-29", "text": "**ABSTRACT MEANING REPRESENTATION**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-30", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-31", "text": "**EXTRACTION OF NPS**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-32", "text": "We extract substructures (subtrees) corresponding to NPs from the AMR Bank (LDC2014T12)." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-33", "text": "In the AMR Bank, there is no alignment between the words and the concepts (nodes) in the AMR graphs." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-34", "text": "We obtain this alignment by using the rule-based alignment tool by Flanigan et al. (2014) ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-35", "text": "Then, we use the Stanford Parser (Klein and Manning, 2003) to obtain constituency trees, and extract NPs that contain more than one noun and are not included by another NP." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-36", "text": "We exclude NPs that contain named entities, because they would require various kinds of manually crafted rules for each type of named entity." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-37", "text": "We also exclude NPs that contain possessive pronouns or conjunctions, which prove problematic for the alignment tool." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-38", "text": "Table 1 shows the statistics of the extracted NP data." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-39", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-40", "text": "**PREVIOUS METHOD FOR AMR ANALYSIS**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-41", "text": "We adopt the method proposed by Flanigan et al. (2014) as our baseline, which is a two-step pipeline method of concept identification step and (Flanigan et al., 2014) for a retired plant worker." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-42", "text": "\u2205 denotes an empty concept." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-43", "text": "relation identification step." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-44", "text": "Their method is designed for parsing sentences into AMR, but here, we use this method for parsing NPs." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-45", "text": "In their method, concept identification is formulated as a sequence labeling problem (Janssen and Limnios, 1999) and solved by the Viterbi algorithm." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-46", "text": "Spans of words in the input sentence are labeled with concept subgraphs." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-47", "text": "Figure 2 illustrates the concept identification step for an NP a retired plant worker." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-48", "text": "After the concepts have been identified, these concepts are fixed, and the dependency relations between them are identified by an algorithm that finds the maximum spanning connected subgraph (Chu and Liu, 1965) , which is similar to the maximum spanning tree (MST) algorithm used for dependency parsing (McDonald et al., 2005) ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-49", "text": "They report that using gold concepts yields much better performance, implying that joint identification of concepts and relations can be helpful." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-50", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-51", "text": "**PROPOSED METHOD**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-52", "text": "In this paper, we propose a novel approach for mapping the word sequence in an NP to an AMR tree, where the concepts (nodes) corresponding to the words and the dependency relations between those concepts must be identified." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-53", "text": "We extend the arc-standard algorithm by Nivre (2004) for AMR parsing, and propose a transition-based algorithm which jointly identifies concepts and dependency retired! (Hatori et al., 2011) , in which they perform POS tagging and dependency parsing jointly by assigning a POS tag to a word when performing SHIFT, but differs in that, unlike POS tagging, one word is sometimes assigned with more than one concept." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-54", "text": "In our algorithm, the input words are stored in the buffer and the identified concepts are stored in the stack." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-55", "text": "SHIFT identifies a concept subtree associated with the top word in the buffer." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-56", "text": "REDUCE identifies the dependency relation between the top two concept subtrees in the stack." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-57", "text": "Figure 3 illustrates the process of deriving an AMR tree for a retired plant worker, and Figure 4 shows the resulting AMR tree." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-58", "text": "Table 2 shows the definition of each action and state transition." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-59", "text": "A state is a triple (\u03c3, \u03b2, R), where \u03c3 is a stack of identified concept subtrees, \u03b2 is a buffer of words, and R is a set of identified relations." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-60", "text": "SHIFT(c(w i )) extracts the top word w i in the buffer, generates a concept subtree c(w i ) from w i , and pushes c(w i ) onto the stack." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-61", "text": "The concept subtree c(w i ) is generated from w i by using one of the rules defined in Table 3 ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-62", "text": "LEFT-REDUCE(r, n) pops the top two subtrees c i , c j from the stack, identifies the relation r between the root n root (c i ) of c i and the node n(c j ) in c j , adds r to R, and pushes c j back onto the stack." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-63", "text": "Here, n denotes a mapping from a subtree to its specific node, which allows for attachment to an arbitrary concept in a subtree." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-64", "text": "Since the sizes of the subtrees were at most two in our data set, n \u2208 {n root , n child }, where n root is a mapping from a subtree to its root, and n child is a mapping from a subtree to the direct child of its root." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-65", "text": "RIGHT-REDUCE(r, n) is defined in the same manner." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-66", "text": "EMPTY-REDUCE removes an empty subtree \u2205 at the top of the stack." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-67", "text": "EMPTY-REDUCE is always performed immediately after SHIFT(\u2205) generates an empty subtree \u2205. In the initial state, the stack \u03c3 is empty, the buffer \u03b2 contains all the words in the NP, and the set of the identified relations R is empty." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-68", "text": "In the terminal state, the buffer \u03b2 is empty and the stack \u03c3 contains only one subtree." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-69", "text": "The resulting AMR tree is obtained by connecting all the subtrees generated by the SHIFT actions with all the relations in R." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-70", "text": "The previous method could not generate unseen concepts in the training data, leading to low recall in concept identification." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-71", "text": "In contrast, our method defines five rules (" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-72", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-73", "text": "**FEATURES**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-74", "text": "The feature set \u03c6(s, a) for the current state s and the next action a is the direct product (all-vs-all combinations from each set) of the feature set \u03c6 state (s) for the current state and the feature set \u03c6 action (s, a) for the next action." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-75", "text": "is the union of the feature sets defined in Table 4 , where w(c) denotes the word from which the subtree c was generated." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-76", "text": "Table 5 shows the feature set \u03c6 action ((\u03c3, [w i |\u03b2], R), a) for each action a, where rule(w i , c) is a function that returns the rule which generated the subtree c from the top word w i in the buffer." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-77", "text": "In order to allow different SHIFT actions and different LEFT-RIGHT/REDUCE actions to partially share features, Table 5 defines features of different granularities for each action." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-78", "text": "For example, although SHIFT((run-01)) and SHIFT((sleep-01)) are different actions, they share the features \"S\", \"S\"\u2022\"DICT PRED \" because they share the generation rule DICT PRED ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-79", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-80", "text": "**EXPERIMENTS**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-81", "text": "We conduct an experiment using our NP data set (Table 1) ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-82", "text": "We use the implementation 2 of (Flanigan et al., 2014) as our baseline." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-83", "text": "For the baseline, we use the features of the default settings." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-84", "text": "The method by Flanigan et al. (2014) can only generate the concepts that appear in the training data." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-85", "text": "On the other hand, our method can generate concepts that do not appear in the training data using the concept generation rules LEMMA, DICT PRED , and DICT NOUN in Table 3 ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-86", "text": "For a fair comparison, first, we only use the rules EMPTY and KNOWN." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-87", "text": "Then, we add the rule LEMMA, which can generate a concept of the lemma of the 2 https://github.com/jflanigan/jamr Name Definition LEM {w(\u03c31).lem, w(\u03c30).lem, \u03b20.lem, w(\u03c31).lem \u2022 w(\u03c30).lem, w(\u03c30).lem \u2022 \u03b20.lem} SUF {w(\u03c31).suf, w(\u03c30).suf, \u03b20.suf, w(\u03c31).suf \u2022 w(\u03c30).suf, w(\u03c30).suf \u2022 \u03b20.suf} POS {w(\u03c31).pos, w(\u03c30).pos, \u03b20.pos, w(\u03c31).pos \u2022 w(\u03c30).pos, w(\u03c30).pos" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-88", "text": "all words between w(\u03c31) and w(\u03c30) \u222a all words between w(\u03c30) and \u03b20 Table 5 : Feature sets for the action word." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-89", "text": "Finally, we add the rules DICT PRED and DICT NOUN ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-90", "text": "These two rules need conversion from nouns and adjectives to their verb and noun forms, For this conversion, we use CatVar v2.1 (Habash and Dorr, 2003) , which lists categorial variations of words (such as verb run for noun runner)." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-91", "text": "We also use definitions of the predicates from PropBank (Palmer et al., 2005) , which AMR tries to reuse as much as possible, and impose constraints that the defined predicates can only have semantic relations consistent with the definition." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-92", "text": "During the training, we use the max-violation perceptron (Huang et al., 2012) with beam size 8 and average the parameters." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-93", "text": "During the testing, we also perform beam search with beam size 8." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-94", "text": "Table 6 shows the overall performance on NP semantic structure analysis." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-95", "text": "We evaluate the performance using the Smatch score (Cai and Knight, Method P R F1 (Flanigan et al., 2014) 75.5 61.1 67." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-96", "text": "Table 7 : Performance on concept identification 2013), which reports precision, recall, and F 1 -score for overlaps of nodes, edges, and roots in semantic structure graphs." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-97", "text": "Compared to the baseline, our method improves both the precision and recall, resulting in an increasing of F 1 -score by 2.7 points." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-98", "text": "When we add the LEMMA rule, the recall increases by 11.4 points because the LEMMA rule can generate concepts that do not appear in the training data, resulting in a further increase of F 1 -score by 5.2 points." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-99", "text": "Finally, when we add the DICT rules, the F 1 -score improves further by 1.9 points." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-100", "text": "Table 7 shows the performance on concept identification." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-101", "text": "We report precision, recall, and F 1 -score against the correct set of concepts." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-102", "text": "For each condition, we observe the same tendency in performance increases as Table 6 ." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-103", "text": "Thus, we conclude that our method improves both the concept identification and relation identification performances." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-104", "text": "----------------------------------" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-105", "text": "**CONCLUSION**" }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-106", "text": "In this paper, we used Abstract Meaning Representation (AMR) for semantic structure analysis of noun compounds (NPs)." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-107", "text": "We extracted substructures corresponding to NPs from the AMR Bank, and created a data set of NP semantic structures." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-108", "text": "Then, we proposed a novel method which jointly identifies concepts (nodes) and dependency relations in AMR trees." }, { "sent_id": "e9f7d339ccda101000b53d89da4e49-C001-109", "text": "We confirmed that our method improves the performance on NP semantic structure analysis, and that incorporating an external dictionary further boosts the performance." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e9f7d339ccda101000b53d89da4e49-C001-23" ] ], "cite_sentences": [ "e9f7d339ccda101000b53d89da4e49-C001-23" ] }, "@USE@": { "gold_contexts": [ [ "e9f7d339ccda101000b53d89da4e49-C001-34" ], [ "e9f7d339ccda101000b53d89da4e49-C001-41" ], [ "e9f7d339ccda101000b53d89da4e49-C001-82" ] ], "cite_sentences": [ "e9f7d339ccda101000b53d89da4e49-C001-34", "e9f7d339ccda101000b53d89da4e49-C001-41", "e9f7d339ccda101000b53d89da4e49-C001-82" ] }, "@DIF@": { "gold_contexts": [ [ "e9f7d339ccda101000b53d89da4e49-C001-84", "e9f7d339ccda101000b53d89da4e49-C001-85" ] ], "cite_sentences": [ "e9f7d339ccda101000b53d89da4e49-C001-84" ] } } }, "ABC_3452953ac579f1c05870442456a49c_33": { "x": [ { "sent_id": "3452953ac579f1c05870442456a49c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-2", "text": "The task of implicit discourse relation classification has received increased attention in recent years, including two CoNNL shared tasks on the topic." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-3", "text": "Existing machine learning models for the task train on sections 2-21 of the PDTB and test on section 23, which includes a total of 761 implicit discourse relations." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-4", "text": "In this paper, we'd like to make a methodological point, arguing that the standard test set is too small to draw conclusions about whether the inclusion of certain features constitute a genuine improvement, or whether one got lucky with some properties of the test set, and argue for the adoption of cross validation for the discourse relation classification task by the community." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-5", "text": "----------------------------------" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-7", "text": "Discourse-level relation analysis is relevant to a variety of NLP tasks such as summarization (Yoshida et al., 2014) , question answering (Jansen et al., 2014) and machine translation (Meyer et al., 2015) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-8", "text": "Recent years have seen more and more works on this topic, including two CoNNL shared tasks Xue et al., 2016) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-9", "text": "The community most often uses the Penn Discourse Treebank (PDTB) (Prasad et al., 2008) as a resource, and has adopted the usual split into training and test data as used for other tasks such as parsing." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-10", "text": "Because discourse relation annotation is at a higher level than syntactic annotation, this however means that the test set is rather small, and with the amount of alternative features and, more recently, neural network architectures being applied to the problem, we run a serious risk as a community of believing in features that are successful in getting some improvement on the specific test set but don't generalize at all." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-11", "text": "In discourse relation parsing, we usually distinguish between implicit and explicit discourse relations." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-12", "text": "Explicit relations are marked with a discourse connective such as \"because\", \"but\", \"if\", while implicit discourse relations are not marked with any discourse connective." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-13", "text": "The connective serves as a strong cue for the discourse relation, as the example below demonstrates:" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-14", "text": "\" Typically, money-fund yields beat comparable short-term investments because portfolio managers can vary maturities and go after the highest rates\" (Explicit, Contingency.Cause) \" They desperately needed somebody who showed they cared for them, who loved them." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-15", "text": "(But) The last thing they needed was another drag-down blow.\" (Implicit, Comparison." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-16", "text": "Contrast) Previous studies show that the presence of connectives can greatly help with classification of the relation and can be disambiguated with 0.93 accuracy (4-ways) solely on the discourse relation connectives (Pitler et al., 2008) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-17", "text": "In implicit relations, no such strong cue is available and the discourse relation instead needs to be inferred based on the two textual arguments." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-18", "text": "In recent studies, various classes of features are explored to capture lexical and semantic regularities for identifying the sense of implicit relations, including linguistically informed features like polarity tags, Levin verb classes, length of verb phrases, language model based features, contextual features, constituent parse features and dependency parse features (Lin et al., 2009; Pitler et al., 2009; Zhou et al., 2010; Zhang et al., 2015; Chen et al., 2016) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-19", "text": "For some of second-level relations (a level of granularity that should be much more meaningful to downstream tasks than the four-way distinction), there are only a dozen in-stances, so that it's important to make maximal use of both the data set for training and testing." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-41", "text": "Additional features proposed later include relation specific word similarity (Biran and McKeown, 2013) , Brown clusters and Coreference Patterns (Rutherford and Xue, 2014) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-20", "text": "The test set that is currently most often used for 11 way classification is section 23 (Lin et al., 2009; Ji and Eisenstein, 2015; Rutherford et al., 2017) , which contains only about 761 implicit relations." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-21", "text": "This small size implies that a gain of 1 percentage point in accuracy corresponds to just classifying an additional 7-8 instances correctly." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-22", "text": "This paper therefore aims to demonstrate the degree to which conclusions about the effectiveness of including certain features would depend on whether one evaluates on the standard test section only, or performs cross validation on the whole dataset for second-level discourse relation classification." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-23", "text": "The model that we use is a neural network that takes the words occurring in the relation arguments as input, as well as traditional features mentioned above, to make comparisons with most-used section splits." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-24", "text": "To our knowledge, this is the first paper that systematically evaluates the effect of the train/test split for the implicit discourse relation classification task on PDTB." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-25", "text": "We report the classification performances on random and conventional split sections." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-26", "text": "As a model, we use a neural network that also includes some of the surface features that have been shown to be successful in previous work." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-27", "text": "Our model is competitive with the state of the art." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-28", "text": "The experiments here are exemplary for what kind of conclusions we would draw from the cross validation vs. from the usual train-test split." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-29", "text": "We find that results are quite different in the different splits of dataset, which we think is a strong indication that cross validation is important to adopt as a standard practice for the discourse relation classification community." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-30", "text": "We view cross validation as an important method in case other unseen datasets are not available (note that at least for English, new datasets have recently been made available as part of the shared task ; as well as Rehbein et al., (2016) )." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-31", "text": "----------------------------------" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-32", "text": "**BACKGROUND ON DISCOURSE RELATION**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-33", "text": "Parsing Soricut and Marcu (2003) firstly addressed the task of parsing discourse structure within the same sentence." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-34", "text": "Many of the useful features proposed by them, syntax in particular, revealed that both arguments of the connectives are found in the same sentence." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-35", "text": "The release of PDTB, the largest available annotated corpora of discourse relations, opened the door to machine learning based discourse relation classification." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-36", "text": "Feature-based methods exploit discriminative features for implicit relation classification." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-37", "text": "Pitler et al. (2009) demonstrated that features developed to capture word polarity, verb classes and orientation, as well as some lexical features are strong indicator of the type of discourse relation." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-38", "text": "Lin et al. (2009) further introduced contextual, constituent and dependency parse features." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-39", "text": "They achieved an accuracy of 40.2% for 11-way classification, a 14.1% absolute improvement over the baseline." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-40", "text": "With these features, Park and Cardie (2012) provided a systematic study of previously proposed features and identified feature combinations." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-86", "text": "Implementation All the models are implemented in Keras 2 , which runs on top of Theano." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-42", "text": "Data selection and extension is another main aspect for discourse relation classification, given that the number of training instances is limited and only from a single domain." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-43", "text": "Wang et al. (2012) proposed a novel single centroid clustering algorithm to differentiate typical and atypical examples for each discourse relation." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-44", "text": "Mihil et al. (2014) and Hernault et al. (2010) proposed semi-supervised learning methods to recognise relations." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-45", "text": "Rutherford and collected additional training data from unannotated data, selecting instances based on two criteria (the degree to which a connective can generally be omitted and the degree to which a connective typically changes the interpretation of the relation) improved the inference of implicit discourse relation." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-46", "text": "Hidey and McKeown (2016) , Quirk and Poon (2016) extended training data with weakly labeled data which are cheaply obtained by distant-supervised learning." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-47", "text": "Recently the distributed word representations (Bengio et al., 2003; Mikolov et al., 2013) have shown an advantage in dealing with data sparsity problem (Braud and Denis, 2015) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-48", "text": "Many deep learning methods have been proved to be helpful in discourse relation parsing and achieved some significant progresses." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-49", "text": "Zhang et al. (2015) proposed a shallow convolutional neural network for implicit discourse recognition to alleviate the overfitting problem and help preserve the recognition and generalization ability with the model." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-50", "text": "Ji et al. (2015) computed distributed meaning represen-tations for each discourse argument with recursive neural network." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-51", "text": "Ji et al. (2016) introduced a latent variable to recurrent neural network and outperformed in two tasks." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-52", "text": "Chen et al. (2016) adopted a gated relevance network to capture the semantic interaction between word pairs." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-53", "text": "proposed a neural discourse relation recognizer with a semantic memory and attention weights for implicit discourse relation recognition." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-54", "text": "The model we use in this paper is most closely related to the neural network model proposed in Rutherford et al. (2017) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-55", "text": "The model also has access to the traditional features, which are concatenated to the neural representations of the arguments in the output layer." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-56", "text": "In order to simulate what conclusions we would be drawing from comparing the contributions of the handcrafted surface features, we calculate accuracy for each of the handcrafted features." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-57", "text": "----------------------------------" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-58", "text": "**CORPORA**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-59", "text": "The Penn Discourse Treebank (PDTB) We use the Penn Discourse Treebank (Prasad et al., 2008) , the largest available manually annotated corpora of discourse on top of one million word tokens from the Wall Street Journal (WSJ)." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-60", "text": "The PDTB provides annotations for explicit and implicit discourse relations." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-61", "text": "By definition, an explicit relation contains an explicit discourse connective while the implicit one does not." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-62", "text": "The PDTB provides a three level hierarchy of relation tags for its annotation." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-85", "text": "With concatenating feature vector and the instance's representation, we classify it with a softmax layer and output its label." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-63", "text": "Previous work in this task has been done over two schemes of evaluation: first-level 4-ways classification (Pitler et al., 2009; Rutherford and Xue, 2014; Chen et al., 2016) , second-level 11-way classification (Lin et al., 2009; Ji and Eisenstein, 2015) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-64", "text": "The distribution of second-level relations in PDTB is illustrated in Table 1 ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-65", "text": "We follow the preprocessing method in (Lin et al., 2009; Rutherford et al., 2017) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-66", "text": "If the instance is annotated with two relations, we adopt the first one shown up, and remove those relations with too few instances." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-67", "text": "We treat section 2-21 as training set, section 22 as development set and section 23 as test set for our results reported as \"mostused split\"." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-68", "text": "In order to investigate whether the results for benefit of including a certain feature to the model are stable, we conduct 10-fold crossvalidation on the whole corpus including sections 0-24. Note that we here included also the validation section for our experiments, to have maximal data for our demonstration of variability between folds." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-69", "text": "For best practice when testing new models, we instead recommend to keep the validation set completely separate and do cross-validation for the remaining data." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-70", "text": "Also note that you might want to choose repeated cross-validation (which simply repeats the cross-validation step several times with the data divided up into different folds) as an alternative to simple cross-validation performed here." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-71", "text": "For a more in-detail discussion of cross validation methods, see (Kim, 2009; Bengio and Grandvalet, 2005) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-72", "text": "In Table 1 , we can see that the different relations' proportions on the training and test set are quite different in the most-used split." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-73", "text": "For instance, temporal relations are under-represented which may lead to a misestimation of the usefulness of features that are relevant for classifying temporal relations." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-74", "text": "For our cross validation experiments, we evenly divided all the instances in section 0-24 into 10 balanced folds 1 ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-75", "text": "The proportions of each class in the training and testing set are identical." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-76", "text": "With the same distribution of each class, we here avoid having an unbalanced number of instances per class among training and testing set." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-77", "text": "----------------------------------" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-78", "text": "**MODEL**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-79", "text": "The task is to predict the discourse relation given the two arguments of an implicit instance." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-80", "text": "As a label set, we use 11-way distinction as proposed in Lin et al., (2009); Ji and Eisenstein (2015) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-81", "text": "Word Embeddings are trained with the Skip-gram architecture in Word2Vec (Mikolov et al., 2013) , which is able to capture semantic and syntactic patterns with an unsupervised method, on the training sections of WSJ data." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-82", "text": "Our model is illustrated in Figure 1 ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-83", "text": "Each word is represented as a vector, which is found through a look-up word embedding." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-84", "text": "Then we get the representations of argument 1 and argument 2 separately after transforming semantic word vectors into distributed continuous-value features by LSTM recurrent neural network." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-87", "text": "The architecture of the model we use is illustrated in Figure 1 ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-88", "text": "Regarding the initialization, regularization and learning algorithm, we follow all the settings in (Rutherford et al., 2017) ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-89", "text": "We adopt cross-entropy as our cost function, adagrad as the optimization algorithm, initialized all the weights in the model with uniform random and set dropout layers after the embedding and output layer with a drop rate of 0.2 and 0.5 respectively." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-90", "text": "----------------------------------" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-91", "text": "**FEATURES**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-92", "text": "For the sake of our cross-validation argument, we choose five kinds of most popular features in discourse relation classification, namely Inquirer Tags (semantic classification tags), Brown Clusters, Verb features, Levin classes and Modality." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-93", "text": "----------------------------------" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-94", "text": "**RESULTS**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-95", "text": "We tested five frequently-used surface features with our model." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-96", "text": "Results are shown in Table 2 ." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-97", "text": "We can see that our implemented model is comparable with state of the art models." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-98", "text": "Our main point here is however not to argue that we outperform any particular model, but rather we'd like to discuss what conclusions we'd be drawing from adding surface features to our NN model if using the standard test set vs. doing cross validation." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-99", "text": "For each cross validation with different features, the separation into train and test sets are identical." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-100", "text": "We can see that the performances on Most-used Split section is generally 3-7% better than the results for the rest of the corpus." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-101", "text": "While we would also conclude from our model when evaluated on the standard test set that each of these features contribute some useful information, we can also see that we would come to very different conclusions if actually running the cross-validation experiment." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-102", "text": "Cross Validation is primarily a way of measuring the predictive performance of a model." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-103", "text": "With such a small test set, improvements on the classification could be the results of many factors." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-104", "text": "For instance, take a look at the effectiveness of including Inquirer Tags: these lead to an increase in performance by 2.8% in Most-used Split, but actually only helped on two out of 10-fold in the cross-validation set, overall leading to a small decrease in performance of the classifier." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-105", "text": "Similarly, the verb features seem to indicate a substantial improvement in relation classification accuracy on the standard test set, but there is no effect at all across the folds." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-106", "text": "Other works, such as Berg-Kirkpatrick et al. (2012) strongly recommend significance testing to validate metric gains in NLP tasks, even though the relationship between metric gain and statistical significance is complex." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-107", "text": "We observed that recent papers in discourse relation parsing do not always perform significance testing, and if they do report significance, then oftentimes they do not report the test that was used." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-108", "text": "We would here like to argue in favour of significance testing with cross validation, as opposed to boot strapping methods that only use the standard test set." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-109", "text": "Due to the larger amount of data, calculating significance based on the cross validation will give us substantially better estimates about the robustness of our results, because it can quantify more exactly the amount of variation with respect to transferring to a new (in-domain) dataset." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-110", "text": "----------------------------------" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-111", "text": "**CONCLUSION**" }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-112", "text": "We have argued that the standard test section of the PDTB is too small to draw conclusions about whether a feature is generally useful or not, especially when using a larger label set, as is the case in recent work using second level labels." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-113", "text": "While these ideas are far from new and apply also to other NLP tasks with small evaluation sets, we think it is important to discuss this issue, as recent work in the field of discourse relation analysis has mostly ignored the issue of small test set sizes in the PDTB." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-114", "text": "Our experiments support our claim by showing that features that may look like they improve performance on the 11-way classification on the standard test set, did not always show a consistent improvement when the training / testing was split up differently." }, { "sent_id": "3452953ac579f1c05870442456a49c-C001-115", "text": "This means that we run a large risk of drawing incorrect conclusions about which features are helpful if we only stick out our small standard test set for evaluation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3452953ac579f1c05870442456a49c-C001-18" ], [ "3452953ac579f1c05870442456a49c-C001-20" ], [ "3452953ac579f1c05870442456a49c-C001-63" ] ], "cite_sentences": [ "3452953ac579f1c05870442456a49c-C001-18", "3452953ac579f1c05870442456a49c-C001-20", "3452953ac579f1c05870442456a49c-C001-63" ] }, "@USE@": { "gold_contexts": [ [ "3452953ac579f1c05870442456a49c-C001-65" ], [ "3452953ac579f1c05870442456a49c-C001-80" ] ], "cite_sentences": [ "3452953ac579f1c05870442456a49c-C001-65", "3452953ac579f1c05870442456a49c-C001-80" ] } } }, "ABC_7a1a1593a9480b6ee246ff4248668e_34": { "x": [ { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-52", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-77", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-78", "text": "**TRAINING DETAILS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-2", "text": "Abstractive summarization, the task of rewriting and compressing a document into a short summary, has achieved considerable success with neural sequence-tosequence models." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-3", "text": "However, these models can still benefit from stronger natural language inference skills, since a correct summary is logically entailed by the input document, i.e., it should not contain any contradictory or unrelated information." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-4", "text": "We incorporate such knowledge into an abstractive summarization model via multi-task learning, where we share its decoder parameters with those of an entailment generation model." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-5", "text": "We achieve promising initial improvements based on multiple metrics and datasets (including a test-only setting)." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-6", "text": "The domain mismatch between the entailment (captions) and summarization (news) datasets suggests that the model is learning some domain-agnostic inference skills." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-7", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-9", "text": "Abstractive summarization, the task of rewriting a document into a short summary is a significantly more challenging (and natural) task than extractive summarization, which only involves choosing which sentence from the original document to keep or discard in the output summary." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-28", "text": "The SNLI corpus Bowman et al. (2015) allows training accurate end-to-end neural networks for this task." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-10", "text": "Neural sequence-to-sequence models have led to substantial improvements on this task of abstractive summarization, via machine translation inspired encoder-aligner-decoder approaches, further enhanced via convolutional encoders, pointer-copy mechanisms, and hierarchical attention (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-11", "text": "Despite these promising recent improvements, Input Document: may is a pivotal month for moving and storage companies ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-12", "text": "Ground-truth Summary: moving companies hit bumps in economic road Baseline Summary: a month to move storage companies Multi-task Summary: pivotal month for storage firms there is still scope in better teaching summarization models about the general natural language inference skill of logical entailment generation." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-13", "text": "This is because the task of abstractive summarization involves two subtasks: salient (important) event detection as well as logical compression, i.e., the summary should not contain any information that is contradictory or unrelated to the original document." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-14", "text": "Current methods have to learn both these skills from the same dataset and a single model." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-15", "text": "Therefore, there is benefit in learning the latter ability of logical compression via external knowledge from a separate entailment generation task, that will specifically teach the model how to rewrite and compress a sentence such that it logically follows from the original input." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-16", "text": "To achieve this, we employ the recent paradigm of sequence-to-sequence multi-task learning (Luong et al., 2016) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-17", "text": "We share the decoder parameters of the summarization model with those of the entailment-generation model, so as to generate summaries that are good at both extracting important facts from as well as being logically entailed by the input document." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-18", "text": "Fig. 1 shows such an (actual) output example from our model, where it successfully learns both salient information extraction as well as entailment, unlike the strong baseline model." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-19", "text": "Empirically, we report promising initial improvements over some solid baselines based on several metrics, and on multiple datasets: Gigaword and also a test-only setting of DUC." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-20", "text": "Impor-tantly, these improvements are achieved despite the fact that the domain of the entailment dataset (image captions) is substantially different from the domain of the summarization datasets (general news), which suggests that the model is learning certain domain-independent inference skills." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-21", "text": "Our next steps to this workshop paper include incorporating stronger pointer-based models and employing the new multi-domain entailment corpus (Williams et al., 2017) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-22", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-24", "text": "Earlier summarization work focused more towards extractive (and compression) based summarization, i.e., selecting which sentences to keep vs discard, and also compressing based on choosing grammatically correct sub-sentences having the most important pieces of information (Jing, 2000; Knight and Marcu, 2002; Clarke and Lapata, 2008; Filippova et al., 2015) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-25", "text": "Bigger datasets and neural models have allowed the addressing of the complex reasoning involved in abstractive summarization, i.e., rewriting and compressing the input document into a new summary." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-26", "text": "Several advances have been made in this direction using machine translation inspired encoder-aligner-decoder models, convolution-based encoders, switching pointer and copy mechanisms, and hierarchical attention models (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-27", "text": "Recognizing textual entailment (RTE) is the classification task of predicting whether the relationship between a premise and hypothesis sentence is that of entailment (i.e., logically follows), contradiction, or independence (Dagan et al., 2006) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-51", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-29", "text": "Some previous work (Mehdad et al., 2013; Gupta et al., 2014) has explored the use of textual entailment recognition for redundancy detection in summarization." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-30", "text": "They label relationships between sentences, so as to select the most informative and non-redundant sentences for summarization, via sentence connectivity and graphbased optimization and fusion." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-31", "text": "Our focus, on the other hand, is entailment generation and not recognition, i.e., to teach summarization models the general natural language inference skill of generating a compressed sentence that logically entails the original longer sentence, so as to produce more effective short summaries." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-32", "text": "We achieve this via multi-task learning with entailment generation." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-33", "text": "Multi-task learning involves sharing parameters between related tasks, whereby each task benefits from extra information in the training signals of the related tasks, and also improves its generalization performance." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-34", "text": "Luong et al. (2016) showed improvements on translation, captioning, and parsing in a shared multi-task setting." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-35", "text": "Recently, Pasunuru and Bansal (2017) extend this idea to video captioning with two related tasks: video completion and entailment generation." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-36", "text": "We demonstrate that abstractive text summarization models can also be improved by sharing parameters with an entailment generation task." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-37", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-38", "text": "**MODELS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-39", "text": "First, we discuss our baseline model which is similar to the machine translation encoder-alignerdecoder model of Luong et al. (2015) , and presented by Chopra et al. (2016) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-40", "text": "Next, we introduce our multi-task learning approach of sharing the parameters between abstractive summarization and entailment generation models." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-41", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-42", "text": "**BASELINE MODEL**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-43", "text": "Our baseline model is a strong, multi-layered encoder-attention-decoder model with bilinear attention, similar to Luong et al. (2015) and following the details in Chopra et al. (2016) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-44", "text": "Here, we encode the source document with a two-layered LSTM-RNN and generate the summary using another two-layered LSTM-RNN decoder." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-45", "text": "The word probability distribution at time step t of the decoder is defined as follows:" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-46", "text": "where g is a non-linear function and c t and s t are the context vector and LSTM-RNN decoder hidden state at time step t, respectively." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-47", "text": "The context vector c t = \u03b1 t,i h i is a weighted combination of encoder hidden states h i , where the attention weights are learned through the bilinear attention mechanism proposed in Luong et al. (2015) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-48", "text": "For the rest of the paper, we use same notations." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-49", "text": "We also use the same model architecture for the entailment generation task, i.e., a sequence-tosequence model encoding the premise and decoding the entailed hypothesis, via bilinear attention between them." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-50", "text": "Figure 2: Multi-task learning of the summarization task (left) with the entailment generation task (right)." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-53", "text": "Multi-task learning helps in sharing knowledge between related tasks across domains (Luong et al., 2015) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-54", "text": "In this work, we show improvements on the task of abstractive summarization by sharing its parameters with the task of entailment generation." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-55", "text": "Since a summary is entailed by the input document, sharing parameters with the entailment generation task improves the logically-directed aspect of the summarization model, while maintaining the salient information extraction aspect." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-56", "text": "In our multi-task setup, we share the decoder parameters of both the tasks (along with the word embeddings), as shown in Fig. 2 , and we optimize the two loss functions (one for summarization and another for entailment generation) in alternate mini-batches of training." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-57", "text": "Let \u03b1 s be the number of mini-batches of training for summarization after which it is switched to train \u03b1 e number of minibatches for entailment generation." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-58", "text": "Then, the mixing ratio is defined as" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-59", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-61", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-62", "text": "**DATASETS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-63", "text": "Gigaword Corpus We use the exact annotated Gigaword corpus provided by Rush et al. (2015) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-64", "text": "The dataset has approximately 3.8 million training pairs." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-65", "text": "We use 10, 000 pairs as validation set and the exact test sample provided by Rush et al. (2015) as our test set." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-66", "text": "We use the first sentence of the article as the source with vocabulary size of 119, 505 and article headline as target with vocabulary size of 68, 885." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-67", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-68", "text": "**DUC TEST CORPUS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-69", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-70", "text": "**SNLI CORPUS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-71", "text": "For the task of entailment generation, we use the Standford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) , where we only use the entailment-labeled pairs and regroup the splits to have a zero overlap traintest split and have a multi-reference test set, as suggested by Pasunuru and Bansal (2017) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-72", "text": "Out of 190, 113 entailments pairs, we use 145, 822 unique premise pairs for training, and the rest of them are equally divided into dev and test sets." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-73", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-74", "text": "**EVALUATION**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-75", "text": "Following previous work (Nallapati et al., 2016; Chopra et al., 2016; Rush et al., 2015) , we use the full-length F1 variant of Rouge (Lin, 2004) for the Gigaword results, and the 75-bytes length limited Recall variant of Rouge for DUC." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-76", "text": "Additionally, we also report other standard language generation metrics (as motivated recently by See et al. (2017) ): METEOR (Denkowski and Lavie, 2014) , BLEU-4 (Papineni et al., 2002) , and CIDEr-D , based on the MS-COCO evaluation script (Chen et al., 2015) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-79", "text": "We use the following simple settings for all the models, unless otherwise specified." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-80", "text": "We unroll the encoder RNN's to a maximum of 50 time steps and decoder RNN's to a maximum of 30 time steps." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-81", "text": "Models ROUGE-1 ROUGE-2 ROUGE-L METEOR BLEU-4 CIDEr-D PREVIOUS WORK ABS+ (Rush et al., 2015) 29.76 11.88 26.96 ---RAS-Elman (Chopra et al., 2016) 33.78 15.97 31.15 ---words-lvt2k-1sent (Nallapati et al., 2016) 32 Table 1 : Summarization results on Gigaword." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-82", "text": "Rouge scores are full length F-1, following previous work." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-83", "text": "We use RNN hidden state dimension of 512 and word embedding dimension of 256." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-84", "text": "We do not initialize our word embeddings with any pre-trained models, i.e., we learn them from scratch." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-85", "text": "We use the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.001." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-86", "text": "During training, to handle the large vocabulary, we use the sampled loss trick of Jean et al. (2014) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-87", "text": "We always tune hyperparameters on the validation set of the corresponding dataset, where applicable." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-88", "text": "For multi-task learning, we tried a few mixing ratios and found 1 : 0.05 to work better, i.e., 100 mini-batches of summarization with 5 mini-batches of entailment generation task in alternate training rounds." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-89", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-90", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-91", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-92", "text": "**SUMMARIZATION RESULTS: GIGAWORD**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-93", "text": "Baseline Results and Previous Work Our baseline is a strong encoder-attention-decoder model based on Luong et al. (2015) and presented by Chopra et al. (2016) ." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-94", "text": "As shown in Table 1 , it is reasonably close to some of the state-of-theart (comparable) results in previous work, though making this baseline further strong (e.g., based on pointer-copy mechanism) is our next step." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-95", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-96", "text": "**MULTI-TASK RESULTS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-97", "text": "We show promising initial multi-task improvements on top of our baseline, based on several metrics." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-98", "text": "This suggests that the entailment generation model is teaching the summarization model some skills about how to choose a logical subset of the events in the full input document." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-99", "text": "This is especially promising given that the domain of the entailment dataset (image captions) is very different from the domain of the summarization datasets (news), suggesting that the model might be learning some domain-agnostic inference skills." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-100", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-101", "text": "**SUMMARIZATION RESULTS: DUC**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-102", "text": "Here, we directly use the Gigaword-trained model to test on the DUC-2004 dataset (see tuning discussion in Sec. 4.1)." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-103", "text": "In Table 2 , we again see that et al. (2015) 28.18 8.49 23.81 Chopra et al. (2016) 28.97 8.26 24.06 Nallapati et al. (2016) our Luong et al. (2015) baseline model achieves competitive performance with previous work, esp." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-104", "text": "on Rouge-2 and Rouge-L. Next, we show promising multi-task improvements over this baseline of around 0.4% across all metrics, despite being a test-only setting and also with the mismatch between the summarization and entailment domains." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-105", "text": "Figure 3 shows some additional interesting output examples of our multi-task model and how it generates summaries that are better at being logically entailed by the input document, whereas the baseline model contains some crucial contradictory or unrelated information." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-106", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-107", "text": "**ANALYSIS EXAMPLES**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-108", "text": "----------------------------------" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-109", "text": "**CONCLUSION AND NEXT STEPS**" }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-110", "text": "We presented a multi-task learning approach to incorporate entailment generation knowledge into summarization models." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-111", "text": "We demonstrated promising initial improvements based on multiple datasets and metrics, even when the entailment knowledge was extracted from a domain different from the summarization domain." }, { "sent_id": "7a1a1593a9480b6ee246ff4248668e-C001-112", "text": "Our next steps to this workshop paper include: (1) stronger summarization baselines, e.g., using pointer copy mechanism (See et al., 2017; Nallapati et al., 2016) , and also adding this capability to the entailment generation model; (2) results on CNN/Daily Mail corpora (Nallapati et al., 2016) ; (3) incorporating entailment knowledge from other news-style domains such as the new Multi-NLI corpus (Williams et al., 2017) , and (4) demonstrating mutual improvements on the entailment generation task." } ], "y": { "@SIM@": { "gold_contexts": [ [ "7a1a1593a9480b6ee246ff4248668e-C001-39" ], [ "7a1a1593a9480b6ee246ff4248668e-C001-43" ], [ "7a1a1593a9480b6ee246ff4248668e-C001-103" ] ], "cite_sentences": [ "7a1a1593a9480b6ee246ff4248668e-C001-39", "7a1a1593a9480b6ee246ff4248668e-C001-43", "7a1a1593a9480b6ee246ff4248668e-C001-103" ] }, "@USE@": { "gold_contexts": [ [ "7a1a1593a9480b6ee246ff4248668e-C001-47" ], [ "7a1a1593a9480b6ee246ff4248668e-C001-93" ] ], "cite_sentences": [ "7a1a1593a9480b6ee246ff4248668e-C001-47", "7a1a1593a9480b6ee246ff4248668e-C001-93" ] }, "@BACK@": { "gold_contexts": [ [ "7a1a1593a9480b6ee246ff4248668e-C001-53" ] ], "cite_sentences": [ "7a1a1593a9480b6ee246ff4248668e-C001-53" ] }, "@DIF@": { "gold_contexts": [ [ "7a1a1593a9480b6ee246ff4248668e-C001-53", "7a1a1593a9480b6ee246ff4248668e-C001-54" ] ], "cite_sentences": [ "7a1a1593a9480b6ee246ff4248668e-C001-53" ] } } }, "ABC_6c872be6b2fbe83890e28ddc1098a3_34": { "x": [ { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-2", "text": "Interpreting the performance of deep learning models beyond test set accuracy is challenging." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-3", "text": "Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-4", "text": "We examine the impact of a test set question's difficulty to determine if there is a relationship between difficulty and performance." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-5", "text": "We model difficulty using well-studied psychometric methods on human response patterns." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-6", "text": "Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question's difficulty." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-7", "text": "As DNNs are trained with more data, easy examples are learned more quickly than hard examples." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-8", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-10", "text": "One method for interpreting deep neural networks (DNNs) is to examine model predictions for specific input examples, e.g. testing for shape bias as in Ritter et al. (2017) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-11", "text": "In the traditional classification task, the difficulty of the test set examples is not taken into account." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-12", "text": "The number of correctlylabeled examples is tallied up and reported." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-13", "text": "However, we hypothesize that it may be worthwhile to use difficulty when evaluating DNNs." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-14", "text": "For example, what does it mean if a trained model answers the more difficult examples correctly, but cannot correctly classify what are seemingly simple cases?" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-15", "text": "Recent work has shown that for NLP tasks such as Natural Language Inference (NLI), models can achieve strong results by simply using the hypothesis of a premise-hypothesis pair and ignoring the premise entirely (Gururangan et al., 2016; Tsuchiya, 2018; Poliak et al., 2018) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-16", "text": "In this work we consider understanding DNNs by looking at the difficulty of specific test set examples and comparing DNN performance under different training scenarios." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-17", "text": "Do DNN models learn examples of varying difficulty at different rates?" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-18", "text": "If a model does well on hard examples and poor on easy examples, then can we say that it has really learned anything?" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-19", "text": "In contrast, if a model does well on easy items, because a dataset is all easy, have we really \"solved\" anything?" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-20", "text": "To model difficulty we use Item Response Theory (IRT) from psychometrics (Baker and Kim, 2004) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-21", "text": "IRT models characteristics such as difficulty and discrimination ability of specific examples (called \"items\" 1 ) in order to estimate a latent ability trait of test-takers." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-22", "text": "Here we use IRT to model the difficulty of test items to determine how DNNs learn items of varying difficulty." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-23", "text": "IRT provides a well-studied methodology for modeling item difficulty as opposed to more heuristic-based difficulty estimates such as sentence length." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-24", "text": "IRT was previously used to build a new test set for the NLI task (Lalor et al., 2016) and show that model performance is dependent on test set difficulty." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-25", "text": "In this work we use IRT to probe specific items to try to analyze model performance at a more finegrained level, and expand the analysis to include the task of SA." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-26", "text": "We train three DNNs models with varying training set sizes to compare performance on two NLP tasks: NLI and Sentiment Analysis (SA)." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-27", "text": "Our experiments show that a DNN model's likelihood of classifying an item correctly is dependent on the item's difficulty." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-28", "text": "In addition, as the models are trained with more data, the odds of answering easy examples correctly increases at a faster rate than the odds of answering a difficult example correctly." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-29", "text": "That is, performance starts to look more human, in the sense that humans learn easy items faster than they learn hard items." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-30", "text": "That the DNNs are better at easy items than hard items seems intuitive but is a surprising and interesting result since the item difficulties are modeled from human data." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-31", "text": "There is no underlying reason that the DNNs would find items that are easy for humans inherently easy." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-32", "text": "To our knowledge this is the first work to use a grounded measure of difficulty learned from human responses to understand DNN performance." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-33", "text": "Our contributions are as follows: (i) we use a well-studied methodology, IRT, to estimate item difficulty in two NLP tasks and show that this human-estimated difficulty is a useful predictor of DNN model performance, (ii) we show that as training size increases DNN performance trends towards expected human performance." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-34", "text": "2 2 Methods" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-35", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-36", "text": "**ESTIMATING ITEM DIFFICULTY**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-37", "text": "To model item difficulty we use the Three Parameter Logistic (3PL) model from IRT (Baker, 2001; Baker and Kim, 2004; Lalor et al., 2016) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-38", "text": "The 3PL model in IRT models an individual's latent ability (\u03b8) on a task as a function of three item characteristics: discrimination ability (a), difficulty (b), and guessing (c)." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-39", "text": "For a particular item i, the probability that an individual j will answer item i correctly is a function of the individual's ability and the three item characteristics:" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-40", "text": "where a i is the discrimination parameter (the value of the function slope at it's steepest point), b i is the difficulty parameter (the value where p ij (\u03b8 j ) = 0.5), and c i is the guessing parameter (the lower asymptote of the function)." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-41", "text": "For a set of items I and a set of individuals J, the likelihood of each individual in J's responses to the items in I is:" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-42", "text": "where q ij (\u03b8 j ) = 1 \u2212 p ij (\u03b8 j ) and y ij = 1 if individual j answered item i correctly and y ij = 0 otherwise." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-43", "text": "Item parameters and individual ability are jointly estimated from a set of individuals' response patterns using an Expectation-Maximization algorithm (Bock and Aitkin, 1981) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-44", "text": "In this work we focus on the difficulty parameter b i , which represents the latent ability level at which an individual has a 50% chance of answering item 2 Code and data available at http://jplalor.github.io i correctly." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-45", "text": "Low values of b i are associated with easier items (since an individual with low ability has a 50% chance of answering correctly), and higher values of b i represent more difficult items." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-46", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-47", "text": "**DATA**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-48", "text": "To estimate item difficulties for NLI, we used the pre-trained IRT models of Lalor et al. (2016) and extracted the difficulty item parameters." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-49", "text": "The data consists of approximately 1000 human annotator responses from Amazon Mechanical Turk (AMT) for a selection of 180 premise-hypothesis pairs from the SNLI data set (Bowman et al., 2015) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-50", "text": "Each AMT worker (Turker) was shown the premisehypothesis pairs and was asked to indicate whether, if the premise was taken to be true, the hypothesis was (a) definitely true (entailment), (b) maybe true (neutral), or (c) definitely not true (contradiction)." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-51", "text": "For SA, we collected a new data set of labels for 134 examples randomly selected from the Stanford Sentiment Treebank (SSTB) (Socher et al., 2013) , using a similar AMT setup as Lalor et al. (2016) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-52", "text": "For each randomly selected example, we had 1000 Turkers label the sentence as very negative, negative, neutral, positive, or very positive." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-53", "text": "We converted these responses to binary positive/negative labels and fit a new IRT 3PL model ( \u00a72.1) using the mirt R package (Chalmers et al., 2015) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-54", "text": "Very negative and negative labels were binned together, and neutral, positive, and very positive were binned together." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-55", "text": "Tables 1 and 2 show examples of the items in our data sets, and the difficulty values estimated from the IRT models." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-56", "text": "The first example in Table 1 is a clear case of entailment, where if we assume that the premise is true, we can infer that the hypothesis is also true." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-57", "text": "The label of the second example in SNLI is contradiction, but in this case the result is not as clear." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-58", "text": "There are sports stadiums that offer lawn seating, and therefore this could potentially be a case of entailment (or neutral)." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-59", "text": "Either way, one could argue that the second example here is more difficult than the first." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-60", "text": "Similarly, the first two examples of Table 2 are interesting." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-61", "text": "Both of these items are labeled as negative examples in the data set." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-62", "text": "The first example is clear, but the second one is more ambiguous." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-63", "text": "It could be considered a mild complement, since the author still endorses renting the movie." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-64", "text": "Therefore you could argue again that the second example is more difficult than the first." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-65", "text": "The learned difficulty parameters reflect this difference Negative -2.46 Still, it gets the job done -a sleepy afternoon rental." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-66", "text": "Negative 1.78 An endlessly fascinating, landmark movie that is as bold as anything the cinema has seen in years." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-67", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-68", "text": "**POSITIVE -2.27**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-69", "text": "Perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-70", "text": "Positive 2.05 in difficulty in both cases." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-71", "text": "Inter-rater reliability scores for the collected annotations are showin in Table 3 ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-72", "text": "Scores for the NLI annotations were calculated when the original dataset was collected and are reproduced here (Lalor et al., 2016) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-73", "text": "Human annotations for the SA annotations were converted to binary before calculating the agreement." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-74", "text": "We see that the agreement scores are in the range of 0.4 to 0.6 which is considered moderate agreement (Landis and Koch, 1977) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-75", "text": "With the large number of annotators it is to be expected that there is some disagreement in the labels." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-76", "text": "However this disagreement can be interpreted as varying difficulty of the items, which is what we expect when we fit the IRT models." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-77", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-78", "text": "**EXPERIMENTS**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-79", "text": "Our goal in this work is to understand how DNN performance on items of varying difficulty changes under different training scenarios." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-80", "text": "To test this, we trained three DNN models using subsets of the original SNLI and SSTB training data sets: (i) Long Short Term Memory Network (LSTM) (Bow-man et al., 2015) , (ii) Convolutional Neural Network (CNN) (Kim, 2014) , and (iii) Neural Semantic Encoder (NSE), a type of memory-augmented RNN (Munkhdalai and Yu, 2017 )." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-81", "text": "3 For each task (NLI and SA), we randomly sampled subsets of training data, from 100 examples up to and including the full training data sets." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-82", "text": "4 We trained each model on the training data subsets, using the original development sets for early stopping to prevent overfitting." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-83", "text": "The IRT data with difficulty estimates were used as test sets for the trained models." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-84", "text": "Once the models were trained and had classified the IRT data sets, we fit logistic regression models to predict whether a DNN model would label an item correctly, using the training set size and item difficulty as the dependent parameters." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-85", "text": "Figure 1 plots the contour plots of our learned regression models." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-86", "text": "The top row plots results for the NLI task, and the bottom row plots results for the SA task." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-87", "text": "From left to right in both rows, the plots show results for the LSTM, CNN, and NSE models." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-88", "text": "In each plot, the x-axis is the training set size, the y-axis is the item difficulty, and the contour lines represent the log-odds that the DNN model would classify an item correctly." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-89", "text": "As the plots show, item difficulty has a clear effect on classification." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-90", "text": "Easier items have higher odds of being classified correctly across all of the training set sizes." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-91", "text": "In addition, the slopes of the contour lines are steeper at lower levels of difficulty." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-92", "text": "This indicates that, moving left to right along the x-axis, a model's odds of answering an easy item correctly increase more quickly than the odds of answering a harder item correctly." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-93", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-94", "text": "**RESULTS**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-95", "text": "The contour plots for the CNN and NSE models on the SA task (Figure 1 , second row middle and right plots) show that the easier items have higher likelihood of being classified correctly, but the odds for the most difficult items decrease as training size increases." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-96", "text": "This suggests that these models are learning in such a way that improves performance on easy items but has a negative effect on hard items." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-97", "text": "This result is important for interpretability, as it could inform stakeholder decisions if they need to have difficult examples classified." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-98", "text": "The idea that easy items should be easier than hard items is consistent with learning strategies in humans." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-99", "text": "For example, when teaching new concepts to students, easier concepts are presented first so that the students can learn patterns and core information before moving to more difficult concepts (Collins et al., 1988; Arroyo et al., 2010) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-100", "text": "As students do more examples, all questions get easier, but easy questions get easier at a faster rate." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-101", "text": "Our result is also consistent with the key assumptions of curriculum learning (Bengio et al., 2009 )." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-102", "text": "Lalor et al. (2016) introduced the idea of applying IRT evaluation to NLP tasks." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-103", "text": "They built a set of scales using IRT for NLI and evaluated a single LSTM neural network to demonstrate the effectiveness of the evaluation, but did not evaluate other NLP models or tasks." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-104", "text": "Mart\u00ednez-Plumed et al. (2016) consider IRT in the context of evaluating ML models, but they do not use a human population to calibrate the models, and obtain results that are difficult to interpret under IRT assumptions." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-105", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-106", "text": "**RELATED WORK**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-107", "text": "There has been work in the NLP community around modeling latent characteristics of data (Bruce and Wiebe, 1999) and annotators (Hovy et al., 2013) , but none that apply the resulting metrics to interpret DNN models." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-108", "text": "Passonneau and Carpenter (2014) model the probability a label is correct with the probability of an annotator to label an item correctly according to the Dawid and Skene (1979) model, but do not consider difficulty or discriminatory ability of the data points." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-109", "text": "One-shot learning is an attempt to build ML models that can generalize after being trained on one or a few examples of a class as opposed to a large training set (Lake et al., 2013) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-110", "text": "One-shot learning attempts to mimic human learning behaviors (i.e., generalization after being exposed to a small number of training examples) (Lake et al., 2016) ." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-111", "text": "Our work instead looks at comparisons to human performance, where any learning (on the part of models) has been completed beforehand." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-112", "text": "Our goal is to analyze DNN models and training set variations as they affect ability in the context of IRT." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-113", "text": "----------------------------------" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-114", "text": "**DISCUSSION**" }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-115", "text": "In this work we have shown that DNN model performance is affected by item difficulty as well as training set size." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-116", "text": "This is the first work that has used a well-established method for estimating difficulty to analyze DNN model performance as opposed to heuristics." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-117", "text": "DNN models perform better on easy items, and as more data is introduced in training, easy items are learned more quickly than hard items." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-118", "text": "Learning easy examples faster than harder examples is what would be expected when examining human response patterns as they learn more about a subject." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-119", "text": "However this has not previously been shown to be true in DNN models." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-120", "text": "That the results are consistent across NLI and SA shows that the methods can be applied to a number of NLP tasks." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-121", "text": "The SA results do show that the odds of labeling a difficult item correctly decrease with more training data 1." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-122", "text": "It could be the case that these difficult items in the SA task are more subjective than the easier items, for example a review that is fairly neutral and is split between positive and negative annotations." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-123", "text": "These cases would be more difficult for a model to label, and are worth examining in more detail." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-124", "text": "By identifying items such as these as difficult makes it easier to see where the model is going wrong and allows for research on better way to represent these cases." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-125", "text": "This result has implications for how machine learning models are evaluated across tasks." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-126", "text": "The traditional assumption that the test data is drawn from the same distribution as the training data, makes it difficult to understand how a model will perform in settings where that assumption does not hold." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-127", "text": "However, if the difficulty of test set data is known, we can better understand what kind of examples a given model performs well on, and specific instances where a model underperforms (e.g. the most difficult examples)." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-128", "text": "In addition, researhers can build test sets that consist of a specific type of data (very easy, very hard, or a mix) to evaluate a trained model under specific assumptions to test generalization ability in a controlled way." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-129", "text": "This could allow for more confidence in model performance in more varied deployment settings, since there would be a set of tests a model would have to pass before being deployed." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-130", "text": "It is important to note that the difficulty parameters were estimated from a human population, meaning that those items that are difficult for humans are in fact more difficult for the DNN models as well." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-131", "text": "This does not need to be the case given that DNNs learn very different patterns, etc." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-132", "text": "than humans." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-133", "text": "In fact there were exceptions in our results which shows that these models should be carefully examined using techniques like those described here." }, { "sent_id": "6c872be6b2fbe83890e28ddc1098a3-C001-134", "text": "Future work can investigate why this is the case and how we can leverage this information to improve model performance and interpretability." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6c872be6b2fbe83890e28ddc1098a3-C001-24" ] ], "cite_sentences": [ "6c872be6b2fbe83890e28ddc1098a3-C001-24" ] }, "@EXT@": { "gold_contexts": [ [ "6c872be6b2fbe83890e28ddc1098a3-C001-24", "6c872be6b2fbe83890e28ddc1098a3-C001-25" ] ], "cite_sentences": [ "6c872be6b2fbe83890e28ddc1098a3-C001-24" ] }, "@USE@": { "gold_contexts": [ [ "6c872be6b2fbe83890e28ddc1098a3-C001-37" ], [ "6c872be6b2fbe83890e28ddc1098a3-C001-48" ], [ "6c872be6b2fbe83890e28ddc1098a3-C001-72" ] ], "cite_sentences": [ "6c872be6b2fbe83890e28ddc1098a3-C001-37", "6c872be6b2fbe83890e28ddc1098a3-C001-48", "6c872be6b2fbe83890e28ddc1098a3-C001-72" ] }, "@SIM@": { "gold_contexts": [ [ "6c872be6b2fbe83890e28ddc1098a3-C001-51" ] ], "cite_sentences": [ "6c872be6b2fbe83890e28ddc1098a3-C001-51" ] } } }, "ABC_c67297dc1c4376dc715bf5c1c9132f_34": { "x": [ { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-37", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-38", "text": "**WORD SENSE DISAMBIGUATION (WSD)**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-107", "text": "ELMo performances vary more with different output layers, compared to BERT." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-36", "text": "BERT has an architecture of a multi-layer bidirectional transformer encoder (Devlin et al., 2019) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-12", "text": "One could expect considerable performance boosts by simply substituting distributional word embeddings with Flair (Akbik et al., 2018) , ELMo (Peters et al., 2018) , and BERT (Devlin et al., 2019) embeddings." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-13", "text": "The unique thing about contextualized word embeddings is that different representations are generated for the same word type with different topical senses." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-2", "text": "We present a novel online algorithm that learns the essence of each dimension in word embeddings by minimizing the within-group distance of contextualized embedding groups." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-3", "text": "Three state-of-the-art neural-based language models are used, Flair, ELMo, and BERT, to generate contextualized word embeddings such that different embeddings are generated for the same word type, which are grouped by their senses manually annotated in the SemCor dataset." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-4", "text": "We hypothesize that not all dimensions are equally important for downstream tasks so that our algorithm can detect unessential dimensions and discard them without hurting the performance." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-5", "text": "To verify this hypothesis, we first mask dimensions determined unessential by our algorithm, apply the masked word embeddings to a word sense disambiguation task (WSD), and compare its performance against the one achieved by the original embeddings." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-6", "text": "Several KNN approaches are experimented to establish strong baselines for WSD." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-7", "text": "Our results show that the masked word embeddings do not hurt the performance and can improve it by 3%." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-8", "text": "Our work can be used to conduct future research on the interpretability of contextualized embeddings." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-9", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-11", "text": "Contextualized word embeddings have played an essential role in many NLP tasks." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-14", "text": "This work focuses on interpreting embedding representations for word senses." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-15", "text": "We propose an algorithm (Section 3) that learns the dimension importance in representing sense information and then mask unessential dimensions that are deemed less meaningful in word sense representations to 0." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-16", "text": "The effectiveness of our approach is validated by a word sense disambiguation task (WSD) that aims to distinguish the correct senses of words under different contexts, as well as two intrinsic evaluations of embedding groups on the masked embeddings." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-17", "text": "In addition to the final outputs of Flair, ELMo and BERT embeddings, hidden layer outputs from ELMo and BERT are also extracted and compared." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-18", "text": "Our results show that masking unessential dimensions of word embeddings does not impair the performance on WSD; moreover, discarding those dimensions can improve the performance up to 3%, which suggests a new method for embedding distillation for more efficient neural network modeling." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-19", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-20", "text": "**RELATED WORK**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-21", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-22", "text": "**WORD EMBEDDING INTERPRETIBILITY**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-23", "text": "In the earlier work, Murphy et al. (2012) suggest a variant of sparse matrix factorization, which generates highly interpretable word representations." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-24", "text": "Based on that work, Jang and Myaeng (2017) introduce a method analyzing dimensions characterizing categories by linking concepts with types and comparing dimension values within concept groups with the average of dimension values within category groups." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-25", "text": "Works have also investigated ways to enrich embedding interpretability by modifying the training process of word embedding models (Luo et al., 2015; Ko\u00e7 et al., 2018) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-26", "text": "Others make use of pre-trained embeddings and apply post-processing techniques to acquire embeddings with more interpretability." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-27", "text": "Past researches use matrix transformation methods on pre-trained embeddings (Zobnin, 2017; Park et al., 2017; Shin et al., 2018) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-28", "text": "Zobnin (2017) utilizes canonical orthogonal transformations to map current embeddings to a new vector space where the vectors are more interpretable." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-29", "text": "Similarly, Park et al. (2017) proposes an approach that rotates pre-trained embedding by minimizing the complexity function, so that the dimensions after rotation become more interpretable." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-30", "text": "Another type of methods applies sparse encoding techniques on word embeddings and map them to sparse vectors (Subramanian et al., 2018; Arora et al., 2018) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-31", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-32", "text": "**CONTEXTUALIZED WORD EMBEDDING MODELS**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-33", "text": "Three popular word embedding algorithms are used for our experiments with various dimensions: ELMo, Flair, and BERT." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-34", "text": "ELMo is a deep word-level bidirectional LSTM language model with character level convolution networks along with a final linear projection output layer (Peters et al., 2018) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-35", "text": "Flair is a character-level bidirectional LSTM language model on sequences of characters (Akbik et al., 2018) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-39", "text": "This work uses WSD as the evaluation for the proposed algorithm, which is the task of determining which sense a target word belongs to in a sentence." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-40", "text": "This work adopts a supervised approach that makes use of sense-annotated training data." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-41", "text": "The Most Frequent Sense (MFS) heuristic is the most common baseline, which selects the most frequent sense in the training data for the target word (Raganato et al., 2017a) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-42", "text": "Depending on the evaluation dataset, the state-of-art in WSD varies." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-43", "text": "Raganato et al. (2017b) utilize bi-LSTM networks with attention mechanism and a softmax layer." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-44", "text": "Melamud et al. (2016) and Peters et al. (2018) also adopt bi-LSTM networks with KNN classifiers." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-45", "text": "Later work incorporates word features such as gloss and POS information into memory networks (Luo et al., 2018; Papandrea et al., 2017) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-46", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-47", "text": "**SENSE WEIGHT TRAINING (SWT)**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-48", "text": "Given a large embedding dimension size, the hypothesis is that not every embedding dimension plays a role in representing a sense." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-49", "text": "Here we propose a new algorithm to determine the importance of dimensions." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-50", "text": "With word embedding groups classified by their senses annotated in the SemCor dataset (Miller et al., 1994) , the objective function in this algorithm is to maximize the average pairwise cosine similarity in all sense groups." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-51", "text": "A weight matrix with the same size of the word embedding is initialized for each sense." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-52", "text": "Each dimension represents the importance of a specific dimension to that sense." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-53", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-54", "text": "**ALGORITHM 1 ALGORITHM FOR INCREMENTAL SENSE WEIGHT TRAINING**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-55", "text": "for each sense group SG do initialize weights w, learning rate \u03b3 0 , Adagrad" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-56", "text": "generate N numbers based on policy:" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-57", "text": "During training, a mask matrix is generated and applied to the weight matrix." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-58", "text": "The gradient of the algorithm is defined to be the difference between the current similarity score and the previous similarity score multiplied by the masking matrix subtracted by one." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-59", "text": "The weight matrix is updated during training with the gradients and a learning rate." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-60", "text": "The mask matrix is the size of the weight matrix and has N dimensions being zero and the rest being one." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-61", "text": "The generation of the mask matrix involves two phases." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-62", "text": "In the first phase, SWT randomly generates N positions of zeros to ensure enough dimensions have been covered." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-63", "text": "After a certain number of epochs, the training enters the second phase where an exploration-exploitation policy is employed." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-64", "text": "The policy states that there is a chance of \u03b1 to randomly generate N numbers." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-65", "text": "For the remaining 1\u2212\u03b1 possibility, the generation of N numbers depends on the weight matrix: the higher the value of dimension in the weight matrix, the lower probability of the number getting selected." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-66", "text": "Furthermore, l 1 regularization is applied for feature selection purpose, and AdaGrad (Duchi et al., 2011 ) is used to encourage convergence." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-67", "text": "Pseudocode for SWT is in Algorithm 1, where n is the number of epochs for exploration, \u03bb the parameter for l 1 regularization and a small number to prevent zero denominators in AdaGrad." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-68", "text": "After the weights are learned, we set the value of embedding dimensions with low importance to zero and test if the rest dimensions are enough to represent the word sense group." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-69", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-70", "text": "**EXPERIMENTS**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-71", "text": "Firstly, all the experiments using the original word embeddings are run." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-72", "text": "Then, using the trained weight matrix from Section 3, the same tests are run on masked embeddings again for comparison." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-73", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-74", "text": "**DATASETS AND WORD EMBEDDINGS**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-75", "text": "Our proposed baselines and algorithms are trained on SemCor (Miller et al., 1994) and evaluated on SenEval-2 (Edmonds and Cotton, 2001) , SenEval-3 (Snyder and Palmer, 2004) , SemEval'07 (Pradhan et al., 2007) , SemEval'13 (Navigli et al., 2013) and SemEval'15 (Moro and Navigli, 2015) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-76", "text": "Pre-trained contextualized word embeddings are exclusively used and compared." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-77", "text": "Pre-trained ELMo, BERT and Flair models are tested." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-78", "text": "The models include ELMo's three models with dimension sizes of 256, 512 and 1,024 (all with 2-layer bi-LSTM), BERT's 2 models: BERT-base with a dimension size of 768 and 12 output layers; BERT-Large with a dimension size of 1,024 and 24 layers, and Flair's single-layer bi-LSTM models with dimension sizes of 2,048 and 4,096." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-79", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-80", "text": "**KNN METHODS**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-81", "text": "K-Nearest Neighbor (KNN) approach is adopted from both ELMo (Peters et al., 2018) and con-text2vec (Melamud et al., 2016) to establish strong baseline approaches." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-82", "text": "Sense-based KNN Adapted from ELMo (Peters et al., 2018) with k = 1, words that have the same senses are clustered together, and the average of that cluster is used as the sense vector, which is then fitted using a one KNN classifier." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-129", "text": "A table with relevant examples can be found in the Appendix." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-83", "text": "Unseen words from the test corpus fall back using the first sense from WordNet (Fellbaum, 1998) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-84", "text": "Word-based KNN Following context2vec (Akbik et al., 2018), a cluster of each lemma occurrences in the training set is formed." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-85", "text": "Each word has a distinct classifier, which will assign labels based on k, where k = min(# of occurrences, 5)." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-86", "text": "Unseen words from test corpus fall back using the first sense from WordNet." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-87", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-88", "text": "**MASKED EMBEDDINGS**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-89", "text": "Each sense has a trained weight matrix from Section 3." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-90", "text": "We process the weight matrix by experimenting four percentages (5%, 10%, 15%, 20%) to find the best threshold to mask out dimensions: the embedding dimensions with weight value ranked below such percentage are marked 0." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-91", "text": "For evaluations, each target word tries all the masks of its appeared senses and selects the masking that produces the closest distance d, where d is the sum of the distances from the masked word to its k-nearest neighbors." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-92", "text": "Baselines As shown in Table 1 , BERT-Large, and ELMo tend to achieve higher F1 using all four methods, and word-based KNN with fall back works better in general." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-93", "text": "Therefore, KNN-WF are used to conduct all subsequent tasks." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-94", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-95", "text": "**WF**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-96", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-97", "text": "**MASKED RESULTS**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-98", "text": "The embeddings from all output layers of ELMo, BERT and Flair are evaluated." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-99", "text": "Table 2 proves that for ELMo and Flair-2048, masking does not hurt the performance too much and for single layers, it even shows improvements." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-100", "text": "Figure 1 shows the results when 5% of the embeddings are masked." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-101", "text": "Half the embeddings are improved if being masked using the 5% threshold, especially the last 10 layer outputs." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-102", "text": "Surprisingly, the last layer output score is boosted by 3%." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-103", "text": "In Figure 2 , the first 3 layer outputs are improved by the masking with 5% threshold." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-104", "text": "Why the deeper layer outputs are not improved requires further research." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-105", "text": "In Figure 3 , both 5% and 10% masking are reported because for layer 1-1 and 3-1 10% threshold works better than 5%." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-106", "text": "For the last layer of each model, the 5% threshold surpasses the performance of the original ones." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-108", "text": "BERT-Base output layers exhibit more stable performances compared to the BERT-Large model." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-109", "text": "Furthermore, an interesting pattern for ELMo is that masking out 5% dimensions cause a more considerable performance drop for layers with worse original scores." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-110", "text": "One possible explanation is that embeddings from output layers closer to the input layer contain less insignificant dimensions." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-111", "text": "Experiments have also been done for the Flair models, which show similar results that the performances remain stable after 5% dimensions of embeddings masked to zero, as shown in Table 4 .4." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-112", "text": "In summary, masking 5% of the dimensions does not hurt the performance too much, and for half of them, masking helps improve the score by 3 percent at most." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-113", "text": "10% threshold sometimes outperforms the 5% threshold in ELMo hidden layers." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-114", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-115", "text": "**ANALYSIS**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-116", "text": "Further analysis is made to investigate the number of negligible dimensions in word embeddings." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-117", "text": "Figure 4a shows a projected graph of selected sense groups, each with 100 embeddings from one ELMo model." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-118", "text": "Figure 4b demonstrates the same word embeddings with the dimensions masked to 0 if their corresponding weights are smaller than 0.5." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-119", "text": "The masked groups display a smaller with-in group distance and a greater separation of sense groups." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-120", "text": "The Spearman's Rank-Order Correlation Coefficient \u03c1 between the pair-wise cosine similarity of (a) original (b) masked (threshold of 0.5) Figure 4 : Graphs of 20 selected sense groups with 100 embeddings each for ELMo with a dimension size of 512 (third output layer)." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-121", "text": "The projection of dimensions from 512 to 2 is done by Linear Discriminant Analysis." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-122", "text": "sense vectors (average embedding of embedding groups classified by word senses) and the pair-wise path similarity scores between senses provided by WordNet (Landes et al., 1998) is evaluated for the original word embeddings and the masked embeddings whose group sizes are larger than 100." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-123", "text": "Average pair-wise cosine similarity within sense groups is also calculated before and after." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-124", "text": "The table with all the test results is in the Appendix." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-125", "text": "Overall, the average cosine similarities within sense groups all increase after dimensions are masked out for all models, which proves that the dimension weights learn by our objective function." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-126", "text": "The correlation test shows no significant performance decrease (some even increase), which manifests that the masked dimensions do not contribute to the sense group relations." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-127", "text": "(Peters et al., 2018) ." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-128", "text": "Another pattern is that the verb sense groups tend to have less number of dimensions getting masked out because verb sense groups have more possible forms of tokens belonging to the same sense group." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-130", "text": "We also attempted to mask out embedding dimensions with higher weights." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-131", "text": "In other words, we kept only the masked dimensions in the evaluations above, to examine what information is in the discarded dimensions." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-132", "text": "We ranked the cosine similarities between masked embedding pairs and picked out the 100 top most similar ones, which fails to output any patterns and points the future research direction in this domain." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-133", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-134", "text": "**MODEL**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-135", "text": "----------------------------------" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-136", "text": "**CONCLUSION**" }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-137", "text": "This paper demonstrates a novel approach to interpret word embeddings." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-138", "text": "Mainly focusing on contextbased word embeddings ability to distinguish and learn relationships in word senses, we propose an algorithm for learning the importance of dimension weights in sense groups." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-139", "text": "After training the weights for word dimensions, the dimensions with less importance are masked out and tested using a word sense disambiguation task and two other evaluations." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-140", "text": "A conclusion can be drawn from the results that some dimensions do not contribute to the representation of sense groups and our algorithm can distinguish the importance of them." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-141", "text": "Table 5 : Cosine similarity and correlation test results for unmasked and masked word embeddings: the embedding model (Model), dimension size (Dim), output layer (Out where av represents the average embedding of three output layers), the average number of dimensions that are masked to zero in embedding sense groups (Dim masked ), the correlation coefficient of original embedding sense group centers and sense relations (\u03c1 unmasked ), the correlation coefficient of embedding sense group centers with dimensions masked to 0 (\u03c1 masked ), the average within-group cosine similarity for the original embeddings (cos unmasked ) and the average within-group cosine similarity after the dimensions are masked out (cos unmasked )." }, { "sent_id": "c67297dc1c4376dc715bf5c1c9132f-C001-142", "text": "Only sense groups with a group size bigger than 100 are considered in this case." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c67297dc1c4376dc715bf5c1c9132f-C001-12" ], [ "c67297dc1c4376dc715bf5c1c9132f-C001-34" ], [ "c67297dc1c4376dc715bf5c1c9132f-C001-44" ] ], "cite_sentences": [ "c67297dc1c4376dc715bf5c1c9132f-C001-12", "c67297dc1c4376dc715bf5c1c9132f-C001-34", "c67297dc1c4376dc715bf5c1c9132f-C001-44" ] }, "@USE@": { "gold_contexts": [ [ "c67297dc1c4376dc715bf5c1c9132f-C001-81" ], [ "c67297dc1c4376dc715bf5c1c9132f-C001-82" ] ], "cite_sentences": [ "c67297dc1c4376dc715bf5c1c9132f-C001-81", "c67297dc1c4376dc715bf5c1c9132f-C001-82" ] } } }, "ABC_2e636754342e9bb857068922519dbc_34": { "x": [ { "sent_id": "2e636754342e9bb857068922519dbc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-2", "text": "In this paper, we investigate a new approach to Population, Intervention and Outcome (PIO) element detection, a common task in Evidence Based Medicine (EBM)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-3", "text": "The purpose of this study is two-fold: to build a training dataset for PIO element detection with minimum redundancy and ambiguity and to investigate possible options in utilizing state of the art embedding methods for the task of PIO element detection." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-4", "text": "For the former purpose, we build a new and improved dataset by investigating the shortcomings of previously released datasets." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-5", "text": "For the latter purpose, we leverage the state of the art text embedding, Bidirectional Encoder Representations from Transformers (BERT), and build a multi-label classifier." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-6", "text": "We show that choosing a domain specific pre-trained embedding further optimizes the performance of the classifier." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-7", "text": "Furthermore, we show that the model could be enhanced by using ensemble methods and boosting techniques provided that features are adequately chosen." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-8", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-10", "text": "Evidence-based medicine (EBM) is of primary importance in the medical field." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-11", "text": "Its goal is to present statistical analyses of issues of clinical focus based on retrieving and analyzing numerous papers in the medical literature (Haynes et al., 1997) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-12", "text": "The PubMed database is one of the most commonly used databases in EBM (Sackett et al., 1996) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-34", "text": "**DATASETS**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-35", "text": "In this study, we introduce PICONET, a multilabel dataset consisting of sequences with labels Population/Problem (P), Intervention (I), and Outcome (O)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-36", "text": "This dataset was created by collecting structured abstracts from PubMed and carefully choosing abstract headings representative of the desired categories." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-37", "text": "The present approach is an improvement over a similar approach used in (Jin and Szolovits, 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-38", "text": "Our aim was to perform automatic labeling while removing as much ambiguity as possible." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-39", "text": "We performed a search on April 11, 2019 on PubMed for 363,078 structured abstracts with the following filters: Article Types (Clinical Trial), Species (Humans), and Languages (English)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-40", "text": "Structured abstract sections from PubMed have labels such as introduction, goals, study design, findings, or discussion; however, the majority of these labels are not useful for P, I, and O extraction since most are general (e.g. methods) and do not isolate a specific P, I, O sequence." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-41", "text": "Therefore, in order to narrow down abstract sections that correspond to the P label, for example, we needed to find a subset of labels such as, but not limited to population, patients, and subjects." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-42", "text": "We performed a lemmatization of the abstract section labels in order to cluster similar categories such as subject and subjects." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-43", "text": "Using this approach, we carefully chose candidate labels for each P, I, and O, and manually looked at a small number of samples for each label to determine if text was representative." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-44", "text": "Since our goal was to collect sequences that are uniquely representative of a description of Population, Intervention, and Outcome, we avoided a keyword-based approach such as in (Jin and Szolovits, 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-45", "text": "For example, using a keyword- based approach would yield a sequence labeled population and methods with the label P, but such abstract sections were not purely about the population and contained information about the interventions and study design making them poor candidates for a P label." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-46", "text": "Thus, we were able to extract portions of abstracts pertaining to P, I, and O categories while minimizing ambiguity and redundancy." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-47", "text": "Moreover, in the dataset from (Jin and Szolovits, 2018) , a section labeled as P that contained more than one sentence would be split into multiple P sentences to be included in the dataset." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-48", "text": "We avoided this approach and kept the full abstract sections." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-49", "text": "The full abstracts were kept in conjunction with our belief that keeping the full section retains more feature-rich sequences for each sequence, and that individual sentences from long abstract sections can be poor candidates for the corresponding label." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-50", "text": "For sections with labels such as population and intervention, we created a mutli-label." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-51", "text": "We also included negative examples by taking sentences from sections with headings such as aim." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-52", "text": "Furthermore, we cleaned the remaining data with various approaches including, but not limited to, language identification, removal of missing values, cleaning unicode characters, and filtering for sequences between 5 and 200 words, inclusive." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-53", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-54", "text": "**BERT-BASED CLASSIFICATION MODEL**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-55", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-56", "text": "**BACKGROUND**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-57", "text": "BERT (Bidirectional Encoder Representations from Transformers) is a deep bidirectional attention text embedding model." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-58", "text": "The idea behind this model is to pre-train a bidirectional representation by jointly conditioning on both left and right contexts in all layers using a transformer (Vaswani et al., 2017; Devlin et al., 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-59", "text": "Like any other language model, BERT can be pre-trained on different contexts." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-60", "text": "A contextualized representation is generally optimized for downstream NLP tasks." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-61", "text": "Since its release, BERT has been pre-trained on a multitude of corpora." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-62", "text": "In the following, we describe different BERT embedding versions used for our classification problem." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-63", "text": "The first version is based on the original BERT release (Devlin et al., 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-64", "text": "This model is pre-trained on the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-65", "text": "For Wikipedia, text passages were extracted while lists were ignored." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-66", "text": "The second version is BioBERT (Lee et al., 2019) , which was trained on biomedical corpora: PubMed (4.5B words) and PMC (13.5B words)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-67", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-68", "text": "**THE MODEL**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-69", "text": "The classification model is built on top of the BERT representation by adding a dense layer corresponding to the multi-label classifier with three output neurons corresponding to PIO labels." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-70", "text": "In order to insure that independent probabilities are assigned to the labels, as a loss function we have chosen the binary cross entropy with logits (BCEWithLogits) defined by" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-71", "text": "where t and y are the target and output vectors, respectively; n is the number of independent targets (n=3)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-72", "text": "The outputs are computed by applying the logistic function to the weighted sums of the last hidden layer activations, s," }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-73", "text": "For the original BERT model, we have chosen the smallest uncased model, Bert-Base." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-74", "text": "The model has 12 attention layers and all texts are converted to lowercase by the tokenizer (Devlin et al., 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-75", "text": "The architecture of the model is illustrated in Figure 1 ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-76", "text": "Using this framework, we trained the model using the two pretrained embedding models described in the previous section." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-77", "text": "It is worth to mention that the embedding is contextualized during the training phase." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-78", "text": "For both models, the pretrained embedding layer is frozen during the first epoch (the embedding vectors are not updated)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-79", "text": "After the first epoch, the embedding layer is unfrozen and the vectors are fine-tuned for the classification task during training." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-80", "text": "The advantage of this approach is that few parameters need to be learned from scratch (Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-81", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-82", "text": "**RESULTS**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-83", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-84", "text": "**PERFORMANCE COMPARISON**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-85", "text": "In order to quantify the performance of the classification model, we computed the precision and recall scores." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-86", "text": "On average, it was found that the model leads to better results when trained using the BioBERT embedding." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-87", "text": "In addition, the performance of the PIO classifier was measured by averaging the three Area Under Receiver Operating Characteristic Curve (ROC AUC) scores for P, I, and O. The ROC AUC score of 0.9951 was obtained by the model using the general BERT embedding." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-88", "text": "This score was improved to 0.9971 when using the BioBERT model pre-trained on medical context." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-89", "text": "The results are illustrated in Figure 2." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-90", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-91", "text": "**MODEL BOOSTING**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-92", "text": "We further applied ensemble methods to enhance the model." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-93", "text": "This approach consists of combin- ing predictions from base classifiers with features of the input data to increase the accuracy of the model (Merz, 1999) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-94", "text": "We investigate an important family of ensemble methods known as boosting, and more specifically a Light Gradient Boosting Machine (LGBM) algorithm, which consists of an implementation of fast gradient boosting on decision trees." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-95", "text": "In this study, we use a library implemented by Microsoft (Ke et al., 2017) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-96", "text": "In our model, we learn a linear combination of the prediction given by the base classifiers and the input text features to predict the labels." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-97", "text": "As features, we consider the average term frequency-inverse document frequency (TF-IDF) score for each instance and the frequency of occurrence of quantitative information elements (QIEF) (e.g. percentage, population, dose of medicine)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-98", "text": "Finally, the output of the binary cross entropy with logits layer (predicted probabilities for the three classes) and the feature information are fed to the LGBM." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-99", "text": "We train the base classifier using the original training dataset, using 60% of the whole data as training dataset, and use a five-fold crossvalidation framework to train the LGBM on the remaining 40% of the data to avoid any information leakage." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-100", "text": "We train the LGBM on four folds and test on the excluded one and repeat the process for all five folds." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-101", "text": "The results of the LGBM classifier for the different boosting frameworks and the scores from the base classifiers are illustrated in Table 2 ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-102", "text": "The highest average ROC AUC score of 0.9998 is obtained in the case of combining the two base learners along with the TF-IDF and QIEF features." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-103", "text": "----------------------------------" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-104", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-105", "text": "In this paper, we presented an improved methodology to extract PIO elements, with reduced ambiguity, from abstracts of medical papers." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-106", "text": "The proposed technique was used to build a dataset of PIO elements that we call PICONET." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-107", "text": "We further proposed a model of PIO elements classification using state of the art BERT embedding." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-108", "text": "It has been shown that using the contextualized BioBERT embedding improved the accuracy of the classifier." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-109", "text": "This result reinforces the idea of the importance of embedding contextualization in subsequent classification tasks in this specific context." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-110", "text": "In order to enhance the accuracy of the model, we investigated an ensemble method based on the LGBM algorithm." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-111", "text": "We trained the LGBM model, with the above models as base learners, to optimize the classification by learning a linear combination of the predicted probabilities, for the three classes, with the TF-IDF and QIEF scores." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-112", "text": "The results indicate that these text features were adequate for boosting the contextualized classification model." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-113", "text": "We compared the performance of the classifier when using the features with one of the base learners and the case where we combine the base learners along with the features." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-114", "text": "We obtained the best performance in the latter case." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-115", "text": "The present work resulted in the creation of a PIO elements dataset, PICONET, and a classification tool." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-116", "text": "These constitute an important component of our system of automatic mining of medical abstracts." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-117", "text": "We intend to extend the dataset to full medical articles." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-118", "text": "The model will be modified to take into account the higher complexity of full text data and more efficient features for model boosting will be investigated." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-13", "text": "Biomedical papers, describing randomized controlled trials in medical intervention, are published at a high rate every year." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-14", "text": "The volume of these publications makes it very challenging for physicians to find the best medical intervention for a given patient group and condition (Borah et al., 2017) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-15", "text": "Computational methods and natural language processing (NLP) could be adopted in order to expedite the process of biomedical evidence synthesis." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-16", "text": "Specifically, NLP tasks applied to well structured documents and queries can help physicians extract appropriate information to identify the best available evidence in the context of medical treatment." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-17", "text": "Clinical questions are formed using the PIO framework, where clinical issues are broken down into four components: Population/Problem (P), Intervention (I), Comparator (C), and Outcome (O)." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-18", "text": "We will refer to these categories as PIO elements, by using the common practice of merging the C and I categories." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-19", "text": "In (Rathbone et al., 2017) a literature screening performed in 10 systematic reviews was studied." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-20", "text": "It was found that using the PIO framework can significantly improve literature screening efficacy." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-21", "text": "Therefore, efficient extraction of PIO elements is a key feature of many EBM applications and could be thought of as a multilabel sentence classification problem." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-22", "text": "Previous works on PIO element extraction focused on classical NLP methods, such as Naive Bayes (NB), Support Vector Machines (SVM) and Conditional Random Fields (CRF) (Chung, 2009; Boudin et al., 2010) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-23", "text": "These models are shallow and limited in terms of modeling capacity." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-24", "text": "Furthermore, most of these classifiers are trained to extract PIO elements one by one which is suboptimal since this approach does not allow the use of shared structure among the individual classifiers." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-25", "text": "Deep neural network models have increased in popularity in the field of NLP." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-26", "text": "They have pushed the state of the art of text representation and information retrieval." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-27", "text": "More specifically, these techniques enhanced NLP algorithms through the use of contextualized text embeddings at word, sentence, and paragraph levels (Mikolov et al., 2013; Le and Mikolov, 2014; Peters et al., 2017; Devlin et al., 2018; Logeswaran and Lee, 2018; Radford et al., 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-28", "text": "More recently, Jin and Szolovits (2018) components from PubMed abstracts." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-29", "text": "To our knowledge, that study was the first in which a deep learning framework was used to extract PIO elements from PubMed abstracts." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-30", "text": "In the present paper, we build a dataset of PIO elements by improving the methodology found in (Jin and Szolovits, 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-31", "text": "Furthermore, we built a multi-label PIO classifier, along with a boosting framework, based on the state of the art text embedding, BERT." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-32", "text": "This embedding model has been proven to offer a better contextualization compared to a bidirectional LSTM model (Devlin et al., 2018) ." }, { "sent_id": "2e636754342e9bb857068922519dbc-C001-33", "text": "----------------------------------" } ], "y": { "@BACK@": { "gold_contexts": [ [ "2e636754342e9bb857068922519dbc-C001-27" ], [ "2e636754342e9bb857068922519dbc-C001-58" ] ], "cite_sentences": [ "2e636754342e9bb857068922519dbc-C001-27", "2e636754342e9bb857068922519dbc-C001-58" ] }, "@DIF@": { "gold_contexts": [ [ "2e636754342e9bb857068922519dbc-C001-31", "2e636754342e9bb857068922519dbc-C001-32" ] ], "cite_sentences": [ "2e636754342e9bb857068922519dbc-C001-32" ] }, "@MOT@": { "gold_contexts": [ [ "2e636754342e9bb857068922519dbc-C001-58" ] ], "cite_sentences": [ "2e636754342e9bb857068922519dbc-C001-58" ] }, "@USE@": { "gold_contexts": [ [ "2e636754342e9bb857068922519dbc-C001-63" ], [ "2e636754342e9bb857068922519dbc-C001-74" ] ], "cite_sentences": [ "2e636754342e9bb857068922519dbc-C001-63", "2e636754342e9bb857068922519dbc-C001-74" ] }, "@UNSURE@": { "gold_contexts": [ [ "2e636754342e9bb857068922519dbc-C001-80" ] ], "cite_sentences": [ "2e636754342e9bb857068922519dbc-C001-80" ] } } }, "ABC_c6ae69051a6d9111dea1a6e8405ac9_34": { "x": [ { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-2", "text": "We present a novel technique for training translation models for statistical machine translation by aligning source sentences to their oracle-BLEU translations." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-3", "text": "In contrast to previous approaches which are constrained to phrase training, our method also allows the re-estimation of reordering models along with the translation model." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-4", "text": "Experiments show an improvement of up to 0.8 BLEU for our approach over a competitive Arabic-English baseline trained directly on the word-aligned bitext using heuristic extraction." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-5", "text": "As an additional benefit, the phrase table size is reduced dramatically to only 3% of the original size." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-6", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-8", "text": "In phrase-based SMT, the phrase pairs in the translation model are traditionally trained by applying a heuristic extraction method (Och and Ney, 2000) which extracts phrase pairs based on consistency of word alignments from a word-aligned bilingual training data." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-9", "text": "The probabilities of the translation model are then calculated based on the relative frequencies of the extracted phrase pairs." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-10", "text": "A notable shortcoming of this approach is that the translation model probabilities thus calculated from the training bitext can be unintuitive and unreliable (Marcu and Wong, 2002; Foster et al., 2006) as they reflect only the distribution over the phrase pairs observed in the training data." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-11", "text": "However, from an SMT perspective it is important that the models reflect probability distributions which are preferred by the decoding process, i.e., phrase translations which are likely to be used frequently to achieve better translations should get higher scores and phrases which are less likely to be used should get low scores." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-12", "text": "In addition, the heuristic extraction algorithm generates all possible, consistent phrases including overlapping phrases." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-13", "text": "This means that translation probabilities are distributed over a very large number of phrase translation candidates most of which never lead to the best possible translation of a sentence." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-14", "text": "In this paper, we propose a novel solution which is to re-estimate the models from the best BLEU translation of each source sentence in the bitext." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-15", "text": "An important contribution of our approach is that unlike previous approaches such as forced alignment (Wuebker et al., 2010) , reordering and language models can also be re-estimated." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-16", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-17", "text": "**RELATED WORK**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-18", "text": "The forced alignment technique of Wuebker et al. (2010) forms the main motivation for our work." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-19", "text": "In forced alignment, given a sentence pair (F, E), a decoder determines the best phrase segmentation and alignment which will result in a translation of F into E. The best segmentation is defined as the one which maximizes the probability of translating the source sentence into the given target sentence." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-20", "text": "At the end, the phrase table is re-estimated using the phrase pair segmentations obtained from forced decoding." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-21", "text": "Thus forced alignment is a reestimation technique where translation probabilities are calculated based on their frequency in best-scoring hypotheses instead of the frequencies of all possible phrase pairs in the bitext." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-22", "text": "However, one limitation of forced alignment is that only the phrase translation model can be re-estimated since it is restricted to align the source sentence to the given target reference, thus fixing the choice of reordering decisions." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-23", "text": "A similar line of work is proposed by and who use a self-enhancing strategy to utilize additional mono- lingual source language data by aligning it to its target language translation obtained by using an SMT system to rank sentence translation probabilities." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-24", "text": "However, the main focus of their work is translation model adaptation by augmenting the bitext with additional training data and not the reestimation of the translation models trained on the parallel data." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-25", "text": "In this work, we propose that aligning source sentences to their oracle BLEU translations provides a more realistic estimate of the models from the decoding perspective instead of aligning them to high quality human translations as in forced decoding." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-26", "text": "Another relevant line of research relates tuning (weight optimisation), where our work lies between forced decoding (Wuebker et al., 2010) and the bold updating approach of (Liang et al., 2006) ." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-27", "text": "However, our approach specifically proposes a novel method for training models using oracle BLEU translations." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-28", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-29", "text": "**MODEL RE-ESTIMATION**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-30", "text": "The idea of our approach is to re-estimate the models with n-best oracle-BLEU translations and sentence alignments resulting from decoding the source sentence." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-31", "text": "Given a source and its reference translation, the oracle-BLEU translation is defined as the translation output with highest BLEU score." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-32", "text": "Oracle BLEU translations have been previously used for different analytical purposes in SMT (Srivastava et al., 2011; Dreyer et al., 2007; Wisniewski et al., 2010) ." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-33", "text": "Figure 1 shows example of word alignment obtained from EM training, segmentations and alignment obtained from forced decoding and oracle-BLEU re-estimation." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-34", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-35", "text": "**ORACLE BLEU**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-36", "text": "Ideally, one would like to re-estimate translation models directly from the n-best BLEU translations." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-37", "text": "However there are two problems in calculating BLEU for individual sentence: First, as discussed in (Chiang et al., 2008) , BLEU is not designed to be used for sentences in isolation where it can exhibit rather volatile behavior." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-38", "text": "Hence, following their work and (Watanabe et al., 2007) , we calculate BLEU for a sentence in the context of a exponentially-weighted moving average of previous translations." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-39", "text": "We briefly discuss the computation from (Chiang et al., 2008) as follows: Given a source sentence f, and its reference translation r, for an n-best translation e * , let c(e) be defined as the vector of target length |e|, source length |f|, reference length |r|, and the number of n-gram matches between e and r, then two pseudo document parameters O and O f are defined as:" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-40", "text": "O is an exponentially-weighted moving average of the vectors from previous sentences and O f is the correction of source length with respect to the previous sentences." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-41", "text": "Then the BLEU score for a sentence pairs (f,r) and translation e * is defined as:" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-42", "text": "The second problem as discussed in Chiang et al. (2008) is that due to noise in the training data, a high-BLEU translation may contain certain rules which are unlikely to be used by the model." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-43", "text": "Hence following them, we use a weighted combination of BLEU and model score to select the n-best list:" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-44", "text": "where B(e) and h(e) are the BLEU and model scores of the candidate translation and w is the optimised weights for the models, \u00b5 controls the preference between BLEU and model scores to determine oracle translations." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-45", "text": "We set \u00b5=0.5 to balance between BLEU scores almost as high as the max-BLEU translations, while staying close to translations preferred by the model." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-46", "text": "We also conducted a set of experiments with \u00b5=0 (pure or absolute BLEU) in order to verify the necessity for the optimal combination." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-47", "text": "The lower scores for this setting as compared to the baseline verified that using only the best BLEU translation indeed degrades the performance of the re-estimated models." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-48", "text": "This finding for the optimal value of \u00b5 has also been established in (Chiang et al., 2008) through a series of experiments." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-49", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-50", "text": "**TRAINING**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-51", "text": "For obtaining the oracle-BLEU translations, we first train the translation models from the bitext using the standard pipeline of word alignment and heuristic extraction." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-52", "text": "Along with the phrase translation and language models, we also train a bilingual language model (BiLM) (Niehues et al., 2011; Garmash and Monz, 2014) , as well as lexicalized (Tillman, 2004) and hierarchical reordering models (Galley and Manning, 2008) ." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-53", "text": "We use a BiLM specifically as an instance of a reordering model in order to determine the effect of re-estimating re-ordering decisions from oracle-BLEU translations." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-54", "text": "We use the decoder trained on these models to translate the training bitext." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-55", "text": "Along with the 1-best translation (based on model scores), we also store search graphs or lattices generated during the translations process." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-56", "text": "Using the target sentences, we convert the translation lattice to an isomorphic oracle-BLEU lattice which has the same set of nodes but the edges represent BLEU score differences corresponding to each transition." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-57", "text": "Finally, we extract n-best candidate translations from the graphs ranked on BLEU score as defined in Equation (3)." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-58", "text": "Using the word alignments from the initial phrase table, we extract the alignments between each source sentence and each of their n-best oracle-BLEU translations." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-59", "text": "Finally, we re-train the phrase translations, re-ordering and BiLM on these translations and alignments." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-60", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-61", "text": "**AVOIDING OVER-FITTING**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-62", "text": "Re-estimation of the translation models from the n-best translation of the bitext could re-enforce the probabilities of the low frequency phrase pairs in the re-estimated models leading to over-fitting." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-63", "text": "Within forced decoding, Wuebker et al. (2010) address this problem by using a leave-one-out approach where they modify the phrase translation probabilities for each sentence pair by removing the counts of all phrases that were extracted from that particular sentence." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-64", "text": "However, in our approach, we do not impose a constraint to produce the exact translation, instead we use the highest BLEU translations which may be very different from the references." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-65", "text": "Thus it is not strictly necessary to apply leave-one-out in our approach as a solution to over-fitting." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-66", "text": "Instead, we handle the problem by simply removing all the phrase pairs below a threshold count which in our case is 2," }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-67", "text": "therefore removing phrase pairs with high probability but low frequency." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-68", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-69", "text": "**EXPERIMENTAL SET UP**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-70", "text": "Our experiments are carried out for an ArabicEnglish parallel corpus of approximately 1 million sentence pairs." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-71", "text": "We establish a baseline system by training models on this bitext and then compare this to a forced decoding implementation and to oracle-BLEU re-estimation using the same bitext." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-72", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-73", "text": "**BASELINE AND FORCED DECODING**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-74", "text": "The initial training corpus we use is a collection of parallel sentences taken from OpenMT data sources released by the LDC." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-75", "text": "Phrase table, distortion models and the lexical BiLM are trained with initial alignments obtained using GIZA++ (Och and Ney, 2003) ." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-76", "text": "The English 5-gram target language model is trained with Kneser-Ney smoothing on news data of nearly 1.6B tokens." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-77", "text": "We use an in-house phrase-based SMT system similar to Moses." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-78", "text": "For all settings in this paper, weights were optimized on NIST's MT04 data set using pairwise ranked optimization (Hopkins and May, 2011) ." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-79", "text": "For forced alignment we use the existing implementation within the Moses SMT toolkit (Koehn et al., 2007) trained on the baseline phrase translation model." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-80", "text": "In order to increase the chances of producing the exact reference, we follow Foster and Kuhn (2012) and relax the standard decoding parameters as follows: distortion limit=\u221e, stack size=2000, beam width=10e-30, and no threshold pruning of the translation model." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-81", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-82", "text": "**ORACLE BLEU RE-ESTIMATION**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-83", "text": "To obtain oracle-BLEU translations, we first train an initial SMT system and use it to decode the bitext." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-84", "text": "This system is identical to the baseline system except for the removal of low-frequency phrase pairs from the baseline phrase table as described in Section 3.3." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-85", "text": "To obtain the n-best oracle-BLUE translations, we experiment with different values of n, where n \u2208 {1, 10, 100}. From these oracle-BLEU translations and alignments all phrases that were used in the derivation of these nbest sentences are extracted and the models are reestimated by re-calculating the translation probabilities." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-86", "text": "Hierarchical and lexicalized re-ordering models as well as the BiLM are re-trained using the source sentences, oracle-BLEU translations and word alignments." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-87", "text": "For testing the performance of the re-estimated models, we tune different systems while replacing the baseline models with the corresponding re-estimated models." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-88", "text": "We also experiment with the interpolation of re-estimated models with the respective baseline models." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-89", "text": "We evaluate against 4 test sets: MT05, MT06, MT08, and MT09." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-90", "text": "Case-insensitive 4-gram BLEU (Papineni et al., 2002 ) is used as evaluation metric." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-91", "text": "Approximate randomization (Noreen., 1989; Riezler and Maxwell, 2005 ) is used to detect statistically significant differences." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-92", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-93", "text": "**RESULTS**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-94", "text": "We discuss the experimental results of our oracle-BLEU re-estimation approach for different models and settings and provide a comparison with the baseline (heuristic training) and forced alignment." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-95", "text": "Re-estimated models with three different values of n \u2208 {1, 10, 100} were evaluated under three settings: phrase table re-estimation, interpolation, and BiLM re-estimation." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-96", "text": "The best improvements over the baseline are obtained by using only 1-best (n= 1) alignments as shown in Table 1 ." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-97", "text": "Surprisingly, this is in contrast with forced decoding as discussed in Wuebker et al. (2010) , where the best improvements are obtained for n = 100." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-98", "text": "Table 2 provides a comparison between BLEU improvements achieved by forced decoding (n = 100 best) and our oracle-BLEU re-estimation approach (n = 1 best) over the baseline for different models." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-99", "text": "One can see in Table 2 that while phrase table re-estimation drops substantially for forced decoding for all test sets (up to -1.4 for MT09), oracle-BLEU phrase table re-estimation shows either slight improvements or negligible drops compared to the baseline." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-100", "text": "For the linear interpolation of the re-estimated phrase table with the baseline, forced decoding shows only a slight improvement for MT06, MT08 and MT09 and still suffers from a substantial drop for MT05." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-101", "text": "On the other hand, oracle-BLEU re-estimation shows consistent improvements for all test sets with a maximum gain of up to +0.7 for MT06." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-102", "text": "It is important to note here that although linear interpolation extinguishes the advantage of a smaller phrase table size obtained by re-estimation, the improvement achieved by interpolation for oracle-BLEU re-estimation are significantly higher as compared to forced decoding." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-103", "text": "An important novelty of oracle-BLEU reestimation is that it also allows for re-training of other models alongside the phrase table." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-104", "text": "Here we provide the results for the re-estimation of a BiLM." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-105", "text": "For all test sets, BiLM re-estimation provides additional improvements over simple phrase table interpolation, demonstrating that reestimation of re-ordering models can further improve translation performance." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-106", "text": "The last row of Table 2 shows that the re-estimated BiLM on its own adds BLEU improvement of up to +0.5 (for MT09)." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-107", "text": "The highest BLEU improvement of +0.8 is achieved by using a re-estimated BiLM and an interpolated phrase ysis, we experimented with the interpolation of both the re-estimated phrase table (forced decoding and oracle-BLEU) with the baseline." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-108", "text": "However, improvements achieved with this interpolation did not surpass the best result obtained for the oracle-BLEU re-estimation." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-109", "text": "Additionally, we also compare oracle-BLEU re-estimation to forced decoding with leave-oneout (Wuebker et al., 2010) by evaluating both on a concatenation of 5 test sets (MT03, MT05-MT09)." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-110", "text": "As shown in Table 3 , even with leaveone-out, forced decoding performance drops below the baseline by -0.3 BLEU." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-111", "text": "In contrast, phrase tables re-estimated from oracle-BLEU translation achieves the same performance as the baseline." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-112", "text": "When interpolated with the baseline phrase table, both approaches show significant improvements over the baseline." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-113", "text": "This implies that only in combination with the original phrase table does forced-decoding with leave-one-out outperform the baseline." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-114", "text": "On the other hand, oracle-BLEU re-estimation by its own not only performs better than forced decoding, but also gives a performance equal to forced decoding with leave-oneout when interpolated with baseline phrase table." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-115", "text": "In addition to the BLEU improvements, our approach also results in a re-estimated phrase table with a significantly reduced size as compared to the baseline." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-116", "text": "As shown in Table 4 , out of all the settings, the minimum phrase table size after oracle-BLEU re-estimation is only 3.28% of baseline (i.e., a reduction of 96.72%) while it is 7.6% for forced decoding." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-117", "text": "----------------------------------" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-118", "text": "**CONCLUSIONS**" }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-119", "text": "In this paper, we proposed a novel technique for improving the reliability of SMT models by model re-estimation from oracle-BLEU translations of the source sentences in the bitext." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-120", "text": "Our experimental results show BLEU score improvements of up to +0.8 points for oracle-BLEU re-estimation over a strong baseline along with a substantially reduced size of the re-estimated phrase table (3.3% of the baseline)." }, { "sent_id": "c6ae69051a6d9111dea1a6e8405ac9-C001-121", "text": "An important novelty of our approach is that it also allows for the re-estimation of re-ordering models which can yield further improvements in SMT performance as demonstrated by the re-estimation of a BiLM." } ], "y": { "@DIF@": { "gold_contexts": [ [ "c6ae69051a6d9111dea1a6e8405ac9-C001-15" ], [ "c6ae69051a6d9111dea1a6e8405ac9-C001-63", "c6ae69051a6d9111dea1a6e8405ac9-C001-64" ], [ "c6ae69051a6d9111dea1a6e8405ac9-C001-96", "c6ae69051a6d9111dea1a6e8405ac9-C001-97" ] ], "cite_sentences": [ "c6ae69051a6d9111dea1a6e8405ac9-C001-15", "c6ae69051a6d9111dea1a6e8405ac9-C001-63", "c6ae69051a6d9111dea1a6e8405ac9-C001-97" ] }, "@MOT@": { "gold_contexts": [ [ "c6ae69051a6d9111dea1a6e8405ac9-C001-18" ] ], "cite_sentences": [ "c6ae69051a6d9111dea1a6e8405ac9-C001-18" ] }, "@SIM@": { "gold_contexts": [ [ "c6ae69051a6d9111dea1a6e8405ac9-C001-26" ] ], "cite_sentences": [ "c6ae69051a6d9111dea1a6e8405ac9-C001-26" ] }, "@BACK@": { "gold_contexts": [ [ "c6ae69051a6d9111dea1a6e8405ac9-C001-63" ] ], "cite_sentences": [ "c6ae69051a6d9111dea1a6e8405ac9-C001-63" ] }, "@USE@": { "gold_contexts": [ [ "c6ae69051a6d9111dea1a6e8405ac9-C001-109" ] ], "cite_sentences": [ "c6ae69051a6d9111dea1a6e8405ac9-C001-109" ] } } }, "ABC_4edb60770ebeafc56446aeca9a3b2e_34": { "x": [ { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-2", "text": "Relation extraction is one of the core challenges in automated knowledge base construction." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-3", "text": "One line of approach for relation extraction is to perform multi-hop reasoning on the paths connecting an entity pair to infer new relations." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-4", "text": "While these methods have been successfully applied for knowledge base completion, they do not utilize the entity or the entity type information to make predictions." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-5", "text": "In this work, we incorporate selectional preferences, i.e., relations enforce constraints on the allowed entity types for the candidate entities, to multi-hop relation extraction by including entity type information." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-6", "text": "We achieve a 17.67% (relative) improvement in MAP score in a relation extraction task when compared to a method that does not use entity type information." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-7", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-9", "text": "Knowledge Bases (KB's) are structured knowledge sources widely used in applications like question answering (Kwiatkowski et al., 2013; Berant et al., 2013; Bordes et al., 2014) and search engines like Google Search and Microsoft Bing." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-10", "text": "This has led to the creation of large KB's like Freebase (Bollacker et al., 2008) , YAGO (Suchanek et al., 2007) and NELL (Carlson et al., 2010 )." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-11", "text": "KB's contains millions of facts usually in the form of triples (entity1, relation, entity2)." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-12", "text": "However, KB's are woefully incomplete (Min et al., 2013) , missing important facts, and hence limiting their usefulness in downstream tasks." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-13", "text": "Figure 1: The two paths above consist of the same relations (locatedIn \u2192 locatedIn) and, hence, the model of Neelakantan (2015) will assign them the same score for the relation AirportServesPlace without considering the fact that Yankee Stadium is not an airport." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-14", "text": "To overcome this difficulty, Knowledge Base Completion (KBC) methods aim to complete the KB using existing facts." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-15", "text": "For example, we can infer nationality of a person from their place of birth." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-16", "text": "A common approach in many KBC methods for relation extraction is reasoning on individual relations (single-hop reasoning) to predict new relations (Mintz et al., 2009; Bordes et al., 2013; Socher et al., 2013) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-17", "text": "For example, predicting Nationality(X, Y) from BornIn(X, Y)." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-18", "text": "The performance of relation extraction methods have been greatly improved by incorporating selectional preferences, i.e., relations enforce constraints on the allowed entity types for the candidate entities, both in sentence level (Roth and Yih, 2007; Singh et al., 2013) and KB relation extraction (Chang et al., 2014) , and in learning entailment rules (Berant et al., 2011) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-19", "text": "Another line of work in relation extraction performs reasoning on the paths (multi-hop reasoning on paths of length \u2265 1) connecting an entity pair (Lao et al., 2011; Lao et al., 2012; Gardner et al., 2013; Gardner et al., 2014; Neelakantan et al., 2015; Guu et al., 2015) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-20", "text": "For example, these models can infer the relation PlaysInLeague(Tom Brady, NFL) from the facts PlaysForTeam(Tom Brady, New England Patriots) and PartOf(New England Patriots, NFL)." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-21", "text": "All these methods utilize only the relations in the path and do not include any information about the entities." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-22", "text": "In this work, we extend the method of Neelakantan (2015) by incorporating entity type information." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-23", "text": "Their method can generalize to paths unseen in training by composing embeddings of relations in the path non-linearly using a Recurrent Neural Network (RNN) (Werbos, 1990) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-24", "text": "While entity type information has been successfully incorporated into relation extraction methods that perform single hop reasoning, here, we include them for multi-hop relation extraction." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-25", "text": "For example, Figure 1 illustrates an example where reasoning without type information would score both the paths equally although the latter path should receive a lesser score since there is an entity type mismatch for the first entity." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-26", "text": "Our approach constructs vector representation of paths in the KB graph from representations of relations and entity types occurring in the path." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-27", "text": "We achieve a 17.67% improvement in Mean Average Precision (MAP) scores in a relation extraction task when compared to a method that does not use entity type information." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-28", "text": "Lastly, the SHERLOCK system (Schoenmackers et al., 2010 ) also discovers multihop clauses using typed predicates from web text, but, unlike our RNN approach it employs a Inductive Logic Programming method." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-29", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-30", "text": "**MODEL**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-31", "text": "This paper extends the Recurrent Neural Network model of Neelakantan (2015) by jointly reasoning over the relations and entity types occurring in the paths between an entity pair." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-32", "text": "Paths are represented Figure 2: The encoder network for a path between an entity pair." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-33", "text": "The inputs to the network are embeddings of entities, entity types and relations." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-34", "text": "This architecture corresponds to equation 4 below." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-35", "text": "The network for other equations can be obtained by setting the appropriate input embeddings to zeros." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-36", "text": "Also note we have a dummy relation token end relation for the last entity of the path." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-37", "text": "In the network above, at each time step, the entity embedding is concatenated with the sum of its type embeddings, followed by the embeddings of the relation type and are fed as input to the recurrent network as dense vectors formed by composing embeddings of relations and entities occurring at each step." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-38", "text": "Figure 2 illustrates the encoder architecture for a path between an entity pair." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-39", "text": "The [\u00b7] in figure 2 denotes the concatenate operation." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-40", "text": "As will be described later, we also try representing the entity by its observed types." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-41", "text": "The relation types considered in our work are either fixed symbolic types defined in the Freebase schema such as /people/person/nationality or a free text relation from Clueweb (Orr et al., 2013) such as born in." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-42", "text": "In Freebase, an entity is associated with several types." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-43", "text": "For example, the entity Barack Obama has types such as President, Author and Award Winner." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-44", "text": "In our work, we consider the top l types (sorted by corpus frequency) for an entity and we obtain a combined representation by summing the embeddings of types." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-45", "text": "Let v r (\u03b4) \u2208 R d denote the vector representation of relation type \u03b4." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-46", "text": "Let v e (e) \u2208 R m denote the vector representation of an entity e and v et (e) \u2208 R n denote the combined representation of the types of e obtained by taking the sum of the representation of its top l types." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-47", "text": "Let \u03c0 be a path between the entity pair (e1, e2) containing the relation types \u03b4 1 , \u03b4 2 , . . . , \u03b4 N ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-48", "text": "In the following section, we first briefly describe the model proposed by Neelakantan (2015) (RNN model henceforth) followed by our extensions to it." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-49", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-50", "text": "**RNN MODEL**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-51", "text": "The RNN model only considers the representations of relation type present in the path." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-52", "text": "More precisely, the vector representation h t \u2208 R p of path" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-53", "text": "The vector representation of the entire path is h N where N is the length of the path." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-54", "text": "Here W h,h \u2208 R p\u00d7p and W rh \u2208 R p\u00d7d are composition matrices between the previous step in the path and the relation vector at the current step respectively and f is a nonlinear activation function." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-55", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-56", "text": "**EXTENSION WITH ENTITY (AND TYPES)**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-57", "text": "The previous model can be extended to incorporate the embeddings of entities along with relations occurring at each step in the path." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-58", "text": "We consider learning a separate representation for every entity and representing an entity using its entity types." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-59", "text": "\u2022 RNN + Entity: In this model, we add the embedding of the entity." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-60", "text": "\u2022 RNN + Type: In this model, we add the embedding of the entity obtained from its types at each step." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-61", "text": "\u2022 RNN + Entity + Type: In this model, we use both the representations of the entity." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-62", "text": "Here e t denotes the t th entity occurring in the path between an entity pair and W eh \u2208 R p\u00d7m , W th \u2208 R p\u00d7n are new composition matrices due to the entity and its types respectively." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-63", "text": "In all of our experiments f is the sigmoid activation function." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-64", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-65", "text": "**MODEL TRAINING**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-66", "text": "We train a separate RNN model for each target relation 1 ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-67", "text": "The parameters for each model are the embedding of the relations, entities and types, and the various composition matrices (as applicable) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-68", "text": "They are trained to maximize the likelihood of the training data." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-69", "text": "The score of a path \u03c0 w.r.t to the target relation \u03b4 is" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-70", "text": "We then choose the path which has the highest score similar to Neelakantan et al., 2014) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-71", "text": "Selecting just one path (out of typically hundreds to thousands of paths) between entity pairs might lead to our model ignoring informative paths, especially during the initial stages of training." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-72", "text": "To alleviate this issue we also experiment by selecting the top k paths that have the highest score for a given entity pair and relation with the resultant score being the average of the top k scores." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-73", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-74", "text": "**EXPERIMENTS & RESULTS**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-75", "text": "In all of our experiments, we set the dimension of the relations, entity and their type embeddings to be 50." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-76", "text": "For a fair comparison with our model, which has more number of parameters due to the entity and/or type embeddings, we experiment by varying the dimension of the relation embeddings between 50, 100 and 150 for the baseline model." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-77", "text": "We use Adam (Kingma and Ba, 2014) for optimization with the default hyperparameter settings." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-78", "text": "The models are trained for 15 epochs beyond which we observed overfitting on a held-out development set." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-79", "text": "We set l = 7 and k = 5 in our experiments." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-80", "text": "We experiment with 12 target relations." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-81", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-82", "text": "**DATA**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-83", "text": "We run our experiments on the dataset released by Neelakantan el al. (2015) which is a subset of Freebase enriched with information from ClueWeb." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-84", "text": "The dataset comprises of entity pairs with a set of paths connecting them in the knowledge graph." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-85", "text": "The negative examples comprise of entity pairs for which the given query relation does not hold." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-86", "text": "However the paths had the entity information missing We augment the dataset with the entities present in them." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-87", "text": "To gather the entities, we do a depth first traversal starting from the first entity of the entity pair and following the relation types until we reach the last entity of the pair." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-88", "text": "In cases of one-to-many relations we choose the next entity to be traversed at random." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-89", "text": "Due to the combinatorial search space we limit the total number of edges traversed beyond which we ignore the path." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-90", "text": "Therefore the number of paths between an entity pair would be less than in the original dataset." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-91", "text": "However, we are continuously augmenting the dataset and the latest version of the dataset can be downloaded from http://iesl." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-92", "text": "cs.umass.edu/downloads/akbc16/. Table 1 displays some statistics of the dataset gathered till now and also the subset that was used for running the current experiments." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-93", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-94", "text": "**LINK PREDICTION**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-95", "text": "We compare our models with the baseline model on predicting whether an entity pair participates in a target relation." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-96", "text": "We rank the entity pairs in the test set based on their scores and calculate the Mean Average Precision (MAP) score for the ranking following previous work Neelakantan et al., 2015) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-97", "text": "Table 2 lists the MAP scores of both the models averaged over 12 freebase relation types." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-98", "text": "Incorporating selectional preferences by adding entity types gives a significant boost in scores (17.67 % over the baseline model.)." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-99", "text": "However, we see a drop in performance on adding just entities." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-100", "text": "This is primarily because during test time we encounter a lot of previously unseen entities and hence we do Table 2 : Mean Average Precision scores averaged over 12 relations." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-101", "text": "The number in the parentheses denotes the dimension of the embedding of the relations type in the baseline model." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-102", "text": ". not have learned embeddings for them." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-103", "text": "We overcome this problem by representing the entity using its observed types in Freebase." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-104", "text": "In future work, we would consider using pre-trained entity embeddings and also by representing the entity additionally using context words (Yaghoobzadeh and Sch\u00fctze, 2015) ." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-105", "text": "Although considering top-k paths improves the performance of the baseline model, we observe that they provide almost similar scores with entity types." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-106", "text": "We run our experiments with k = 5 and we hope that the results would get better if we tune for k. Table 3 shows maximum scoring paths for four entity pair and freebase relation triples chosen by the baseline and our model." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-107", "text": "We often find that the paths chosen by the baseline model have noisier textual relation, (like 'London' 2 ,'and at the') and have entities belonging to very different types than expected by the query relation." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-108", "text": "For example, in table 3, the path chosen by the baseline model for '/aviation/airport/serves' goes to a music education school, and a water body and for '/education/campus/institution', it goes to a country in which the institution is situated followed by a notable person in the country (unrelated to the query relation)." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-109", "text": "It is quite clear that Table 3 : Predictive paths chosen by the baseline and our model for four entity pair and relation triples." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-110", "text": "The relations are edge labels and the entities occur in between them and at the ends." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-111", "text": "The freebase relations starts with '/', (/location/contains, for e.g.)." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-112", "text": "Inverse relations are denoted by \u22121 i.e. r(x, y) =\u21d2 r \u22121 (y, x), \u2200(x, y) \u2208 r. The scores are given in parentheses (higher is better)." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-113", "text": "Sometimes, both models find the same path (second example in /aviation/airport/serves), but we often find that our model correctly scores it higher." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-114", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-115", "text": "**PREDICTIVE PATHS**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-116", "text": ". adding entity types helps us incorporate selectional preference and hence eliminate lot of noisy paths." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-117", "text": "We also find that sometimes both models finds the same max scoring path but our model assigns more confidence (higher scores) to them leading to better MAP scores." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-118", "text": "3" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-119", "text": "----------------------------------" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-120", "text": "**CONCLUSION**" }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-121", "text": "In this work, we incorporate selectional preferences to a multi-hop relation extraction method." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-122", "text": "We have released the dataset we collected for this project." }, { "sent_id": "4edb60770ebeafc56446aeca9a3b2e-C001-123", "text": "We achieve a 17.67% relative improvement in MAP score in a relation extraction task when compared to a method that does not use entity type information." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4edb60770ebeafc56446aeca9a3b2e-C001-13" ], [ "4edb60770ebeafc56446aeca9a3b2e-C001-19" ], [ "4edb60770ebeafc56446aeca9a3b2e-C001-48" ] ], "cite_sentences": [ "4edb60770ebeafc56446aeca9a3b2e-C001-13", "4edb60770ebeafc56446aeca9a3b2e-C001-19", "4edb60770ebeafc56446aeca9a3b2e-C001-48" ] }, "@EXT@": { "gold_contexts": [ [ "4edb60770ebeafc56446aeca9a3b2e-C001-22" ], [ "4edb60770ebeafc56446aeca9a3b2e-C001-31" ], [ "4edb60770ebeafc56446aeca9a3b2e-C001-48" ] ], "cite_sentences": [ "4edb60770ebeafc56446aeca9a3b2e-C001-22", "4edb60770ebeafc56446aeca9a3b2e-C001-31", "4edb60770ebeafc56446aeca9a3b2e-C001-48" ] }, "@USE@": { "gold_contexts": [ [ "4edb60770ebeafc56446aeca9a3b2e-C001-96" ] ], "cite_sentences": [ "4edb60770ebeafc56446aeca9a3b2e-C001-96" ] } } }, "ABC_19233e4954d7e75ac01112a4c07e64_34": { "x": [ { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-106", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-107", "text": "**ANALYSIS EXAMPLES**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-108", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-2", "text": "Abstractive summarization, the task of rewriting and compressing a document into a short summary, has achieved considerable success with neural sequence-tosequence models." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-3", "text": "However, these models can still benefit from stronger natural language inference skills, since a correct summary is logically entailed by the input document, i.e., it should not contain any contradictory or unrelated information." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-4", "text": "We incorporate such knowledge into an abstractive summarization model via multi-task learning, where we share its decoder parameters with those of an entailment generation model." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-5", "text": "We achieve promising initial improvements based on multiple metrics and datasets (including a test-only setting)." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-6", "text": "The domain mismatch between the entailment (captions) and summarization (news) datasets suggests that the model is learning some domain-agnostic inference skills." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-7", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-9", "text": "Abstractive summarization, the task of rewriting a document into a short summary is a significantly more challenging (and natural) task than extractive summarization, which only involves choosing which sentence from the original document to keep or discard in the output summary." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-10", "text": "Neural sequence-to-sequence models have led to substantial improvements on this task of abstractive summarization, via machine translation inspired encoder-aligner-decoder approaches, further enhanced via convolutional encoders, pointer-copy mechanisms, and hierarchical attention (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-11", "text": "Despite these promising recent improvements, Input Document: may is a pivotal month for moving and storage companies ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-12", "text": "Ground-truth Summary: moving companies hit bumps in economic road Baseline Summary: a month to move storage companies Multi-task Summary: pivotal month for storage firms there is still scope in better teaching summarization models about the general natural language inference skill of logical entailment generation." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-13", "text": "This is because the task of abstractive summarization involves two subtasks: salient (important) event detection as well as logical compression, i.e., the summary should not contain any information that is contradictory or unrelated to the original document." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-14", "text": "Current methods have to learn both these skills from the same dataset and a single model." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-82", "text": "Rouge scores are full length F-1, following previous work." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-15", "text": "Therefore, there is benefit in learning the latter ability of logical compression via external knowledge from a separate entailment generation task, that will specifically teach the model how to rewrite and compress a sentence such that it logically follows from the original input." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-16", "text": "To achieve this, we employ the recent paradigm of sequence-to-sequence multi-task learning (Luong et al., 2016) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-17", "text": "We share the decoder parameters of the summarization model with those of the entailment-generation model, so as to generate summaries that are good at both extracting important facts from as well as being logically entailed by the input document." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-18", "text": "Fig. 1 shows such an (actual) output example from our model, where it successfully learns both salient information extraction as well as entailment, unlike the strong baseline model." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-19", "text": "Empirically, we report promising initial improvements over some solid baselines based on several metrics, and on multiple datasets: Gigaword and also a test-only setting of DUC." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-20", "text": "Impor-tantly, these improvements are achieved despite the fact that the domain of the entailment dataset (image captions) is substantially different from the domain of the summarization datasets (general news), which suggests that the model is learning certain domain-independent inference skills." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-21", "text": "Our next steps to this workshop paper include incorporating stronger pointer-based models and employing the new multi-domain entailment corpus (Williams et al., 2017) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-22", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-24", "text": "Earlier summarization work focused more towards extractive (and compression) based summarization, i.e., selecting which sentences to keep vs discard, and also compressing based on choosing grammatically correct sub-sentences having the most important pieces of information (Jing, 2000; Knight and Marcu, 2002; Clarke and Lapata, 2008; Filippova et al., 2015) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-25", "text": "Bigger datasets and neural models have allowed the addressing of the complex reasoning involved in abstractive summarization, i.e., rewriting and compressing the input document into a new summary." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-26", "text": "Several advances have been made in this direction using machine translation inspired encoder-aligner-decoder models, convolution-based encoders, switching pointer and copy mechanisms, and hierarchical attention models (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-27", "text": "Recognizing textual entailment (RTE) is the classification task of predicting whether the relationship between a premise and hypothesis sentence is that of entailment (i.e., logically follows), contradiction, or independence (Dagan et al., 2006) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-28", "text": "The SNLI corpus Bowman et al. (2015) allows training accurate end-to-end neural networks for this task." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-29", "text": "Some previous work (Mehdad et al., 2013; Gupta et al., 2014) has explored the use of textual entailment recognition for redundancy detection in summarization." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-30", "text": "They label relationships between sentences, so as to select the most informative and non-redundant sentences for summarization, via sentence connectivity and graphbased optimization and fusion." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-31", "text": "Our focus, on the other hand, is entailment generation and not recognition, i.e., to teach summarization models the general natural language inference skill of generating a compressed sentence that logically entails the original longer sentence, so as to produce more effective short summaries." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-32", "text": "We achieve this via multi-task learning with entailment generation." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-33", "text": "Multi-task learning involves sharing parameters between related tasks, whereby each task benefits from extra information in the training signals of the related tasks, and also improves its generalization performance." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-34", "text": "Luong et al. (2016) showed improvements on translation, captioning, and parsing in a shared multi-task setting." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-35", "text": "Recently, Pasunuru and Bansal (2017) extend this idea to video captioning with two related tasks: video completion and entailment generation." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-36", "text": "We demonstrate that abstractive text summarization models can also be improved by sharing parameters with an entailment generation task." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-37", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-38", "text": "**MODELS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-39", "text": "First, we discuss our baseline model which is similar to the machine translation encoder-alignerdecoder model of Luong et al. (2015) , and presented by Chopra et al. (2016) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-40", "text": "Next, we introduce our multi-task learning approach of sharing the parameters between abstractive summarization and entailment generation models." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-41", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-42", "text": "**BASELINE MODEL**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-43", "text": "Our baseline model is a strong, multi-layered encoder-attention-decoder model with bilinear attention, similar to Luong et al. (2015) and following the details in Chopra et al. (2016) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-44", "text": "Here, we encode the source document with a two-layered LSTM-RNN and generate the summary using another two-layered LSTM-RNN decoder." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-45", "text": "The word probability distribution at time step t of the decoder is defined as follows:" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-46", "text": "where g is a non-linear function and c t and s t are the context vector and LSTM-RNN decoder hidden state at time step t, respectively." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-47", "text": "The context vector c t = \u03b1 t,i h i is a weighted combination of encoder hidden states h i , where the attention weights are learned through the bilinear attention mechanism proposed in Luong et al. (2015) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-48", "text": "For the rest of the paper, we use same notations." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-49", "text": "We also use the same model architecture for the entailment generation task, i.e., a sequence-tosequence model encoding the premise and decoding the entailed hypothesis, via bilinear attention between them." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-50", "text": "Figure 2: Multi-task learning of the summarization task (left) with the entailment generation task (right)." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-51", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-52", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-53", "text": "Multi-task learning helps in sharing knowledge between related tasks across domains (Luong et al., 2015) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-54", "text": "In this work, we show improvements on the task of abstractive summarization by sharing its parameters with the task of entailment generation." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-55", "text": "Since a summary is entailed by the input document, sharing parameters with the entailment generation task improves the logically-directed aspect of the summarization model, while maintaining the salient information extraction aspect." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-56", "text": "In our multi-task setup, we share the decoder parameters of both the tasks (along with the word embeddings), as shown in Fig. 2 , and we optimize the two loss functions (one for summarization and another for entailment generation) in alternate mini-batches of training." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-57", "text": "Let \u03b1 s be the number of mini-batches of training for summarization after which it is switched to train \u03b1 e number of minibatches for entailment generation." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-58", "text": "Then, the mixing ratio is defined as" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-59", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-61", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-62", "text": "**DATASETS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-63", "text": "Gigaword Corpus We use the exact annotated Gigaword corpus provided by Rush et al. (2015) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-64", "text": "The dataset has approximately 3.8 million training pairs." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-65", "text": "We use 10, 000 pairs as validation set and the exact test sample provided by Rush et al. (2015) as our test set." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-66", "text": "We use the first sentence of the article as the source with vocabulary size of 119, 505 and article headline as target with vocabulary size of 68, 885." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-67", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-68", "text": "**DUC TEST CORPUS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-69", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-70", "text": "**SNLI CORPUS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-71", "text": "For the task of entailment generation, we use the Standford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) , where we only use the entailment-labeled pairs and regroup the splits to have a zero overlap traintest split and have a multi-reference test set, as suggested by Pasunuru and Bansal (2017) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-72", "text": "Out of 190, 113 entailments pairs, we use 145, 822 unique premise pairs for training, and the rest of them are equally divided into dev and test sets." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-73", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-74", "text": "**EVALUATION**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-75", "text": "Following previous work (Nallapati et al., 2016; Chopra et al., 2016; Rush et al., 2015) , we use the full-length F1 variant of Rouge (Lin, 2004) for the Gigaword results, and the 75-bytes length limited Recall variant of Rouge for DUC." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-76", "text": "Additionally, we also report other standard language generation metrics (as motivated recently by See et al. (2017) ): METEOR (Denkowski and Lavie, 2014) , BLEU-4 (Papineni et al., 2002) , and CIDEr-D , based on the MS-COCO evaluation script (Chen et al., 2015) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-77", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-78", "text": "**TRAINING DETAILS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-79", "text": "We use the following simple settings for all the models, unless otherwise specified." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-80", "text": "We unroll the encoder RNN's to a maximum of 50 time steps and decoder RNN's to a maximum of 30 time steps." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-81", "text": "Models ROUGE-1 ROUGE-2 ROUGE-L METEOR BLEU-4 CIDEr-D PREVIOUS WORK ABS+ (Rush et al., 2015) 29.76 11.88 26.96 ---RAS-Elman (Chopra et al., 2016) 33.78 15.97 31.15 ---words-lvt2k-1sent (Nallapati et al., 2016) 32 Table 1 : Summarization results on Gigaword." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-83", "text": "We use RNN hidden state dimension of 512 and word embedding dimension of 256." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-84", "text": "We do not initialize our word embeddings with any pre-trained models, i.e., we learn them from scratch." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-85", "text": "We use the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.001." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-86", "text": "During training, to handle the large vocabulary, we use the sampled loss trick of Jean et al. (2014) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-87", "text": "We always tune hyperparameters on the validation set of the corresponding dataset, where applicable." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-88", "text": "For multi-task learning, we tried a few mixing ratios and found 1 : 0.05 to work better, i.e., 100 mini-batches of summarization with 5 mini-batches of entailment generation task in alternate training rounds." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-89", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-90", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-91", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-92", "text": "**SUMMARIZATION RESULTS: GIGAWORD**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-93", "text": "Baseline Results and Previous Work Our baseline is a strong encoder-attention-decoder model based on Luong et al. (2015) and presented by Chopra et al. (2016) ." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-94", "text": "As shown in Table 1 , it is reasonably close to some of the state-of-theart (comparable) results in previous work, though making this baseline further strong (e.g., based on pointer-copy mechanism) is our next step." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-95", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-96", "text": "**MULTI-TASK RESULTS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-97", "text": "We show promising initial multi-task improvements on top of our baseline, based on several metrics." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-98", "text": "This suggests that the entailment generation model is teaching the summarization model some skills about how to choose a logical subset of the events in the full input document." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-99", "text": "This is especially promising given that the domain of the entailment dataset (image captions) is very different from the domain of the summarization datasets (news), suggesting that the model might be learning some domain-agnostic inference skills." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-100", "text": "----------------------------------" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-101", "text": "**SUMMARIZATION RESULTS: DUC**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-102", "text": "Here, we directly use the Gigaword-trained model to test on the DUC-2004 dataset (see tuning discussion in Sec. 4.1)." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-103", "text": "In Table 2 , we again see that et al. (2015) 28.18 8.49 23.81 Chopra et al. (2016) 28.97 8.26 24.06 Nallapati et al. (2016) our Luong et al. (2015) baseline model achieves competitive performance with previous work, esp." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-104", "text": "on Rouge-2 and Rouge-L. Next, we show promising multi-task improvements over this baseline of around 0.4% across all metrics, despite being a test-only setting and also with the mismatch between the summarization and entailment domains." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-105", "text": "Figure 3 shows some additional interesting output examples of our multi-task model and how it generates summaries that are better at being logically entailed by the input document, whereas the baseline model contains some crucial contradictory or unrelated information." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-109", "text": "**CONCLUSION AND NEXT STEPS**" }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-110", "text": "We presented a multi-task learning approach to incorporate entailment generation knowledge into summarization models." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-111", "text": "We demonstrated promising initial improvements based on multiple datasets and metrics, even when the entailment knowledge was extracted from a domain different from the summarization domain." }, { "sent_id": "19233e4954d7e75ac01112a4c07e64-C001-112", "text": "Our next steps to this workshop paper include: (1) stronger summarization baselines, e.g., using pointer copy mechanism (See et al., 2017; Nallapati et al., 2016) , and also adding this capability to the entailment generation model; (2) results on CNN/Daily Mail corpora (Nallapati et al., 2016) ; (3) incorporating entailment knowledge from other news-style domains such as the new Multi-NLI corpus (Williams et al., 2017) , and (4) demonstrating mutual improvements on the entailment generation task." } ], "y": { "@USE@": { "gold_contexts": [ [ "19233e4954d7e75ac01112a4c07e64-C001-39" ], [ "19233e4954d7e75ac01112a4c07e64-C001-43" ], [ "19233e4954d7e75ac01112a4c07e64-C001-75" ], [ "19233e4954d7e75ac01112a4c07e64-C001-93" ] ], "cite_sentences": [ "19233e4954d7e75ac01112a4c07e64-C001-39", "19233e4954d7e75ac01112a4c07e64-C001-43", "19233e4954d7e75ac01112a4c07e64-C001-75", "19233e4954d7e75ac01112a4c07e64-C001-93" ] }, "@SIM@": { "gold_contexts": [ [ "19233e4954d7e75ac01112a4c07e64-C001-103" ] ], "cite_sentences": [ "19233e4954d7e75ac01112a4c07e64-C001-103" ] } } }, "ABC_4c615cb5a496c1e225a8457a4f55d4_34": { "x": [ { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-2", "text": "We introduce EASSE, a Python package aiming to facilitate and standardise automatic evaluation and comparison of Sentence Simplification (SS) systems." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-3", "text": "EASSE provides a single access point to a broad range of evaluation resources: standard automatic metrics for assessing SS outputs (e.g. SARI), wordlevel accuracy scores for certain simplification transformations, reference-independent quality estimation features (e.g. compression ratio), and standard test data for SS evaluation (e.g. TurkCorpus)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-4", "text": "Finally, EASSE generates easy-to-visualise reports on the various metrics and features above and on how a particular SS output fares against reference simplifications." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-5", "text": "Through experiments, we show that these functionalities allow for better comparison and understanding of the performance of SS systems." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-6", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-8", "text": "Sentence Simplification (SS) consists of modifying the content and structure of a sentence to improve its readability while retaining its original meaning." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-9", "text": "For automatic evaluation of a simplification output, it is common practice to use machine translation (MT) metrics (e.g. BLEU (Papineni et al., 2002) ), simplicity metrics (e.g. SARI (Xu et al., 2016) ), and readability metrics (e.g. FKGL (Kincaid et al., 1975) )." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-10", "text": "Most of these metrics are available in individual code repositories, with particular software requirements that sometimes differ even in programming language (e.g. corpus-level SARI is implemented in Java, whilst sentence-level SARI is available in both Java and Python)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-11", "text": "Other metrics (e.g. SAMSA (Sulem et al., 2018b) ) suffer from insufficient documentation or require executing multiple scripts with hard-coded paths, which prevents researchers from using them." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-12", "text": "EASSE (Easier Automatic Sentence Simplification Evaluation) is a Python package that provides access to popular automatic metrics in SS evaluation and ready-to-use public datasets through a simple command-line interface." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-13", "text": "With this tool, we make the following contributions: (1) we provide popular automatic metrics in a single software package, (2) we supplement these metrics with word-level transformation analysis and referenceless Quality Estimation (QE) features, (3) we provide straightforward access to commonly used evaluation datasets, and (4) we generate a comprehensive HTML report for quantitative and qualitative evaluation of a SS system." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-14", "text": "We believe this package will facilitate evaluation and improve reproducibility of results in SS." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-15", "text": "EASSE is available in https://github.com/feralvam/ easse." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-16", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-17", "text": "**PACKAGE OVERVIEW**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-18", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-19", "text": "**AUTOMATIC CORPUS-LEVEL METRICS**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-20", "text": "Although human judgements on grammaticality, meaning preservation and simplicity are considered the most reliable method for evaluating a SS system's output (\u0160tajner et al., 2016) , it is common practice to use automatic metrics." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-21", "text": "They are useful for either assessing systems at development stage, to compare different architectures, for model selection or as part of a training policy." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-22", "text": "EASSE implements works as a wrapper for the most common evaluation metrics in SS:" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-23", "text": "BLEU is a precision-oriented metric that relies on the proportion of n-gram matches between a system's output and reference(s)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-24", "text": "Previous work (Xu et al., 2016) has shown that BLEU correlates fairly well with human judgements of grammaticality and meaning preservation." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-25", "text": "EASSE uses SACREBLEU (Post, 2018) 1 to calculate BLEU." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-26", "text": "This package was designed to standardise the process by which BLEU is calculated: it only expects a detokenised system's output and the name of a test set." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-27", "text": "It ensures that the same pre-processing steps are used for the system output and reference sentences." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-28", "text": "SARI measures how the simplicity of a sentence was improved based on the words added, deleted and kept by a system." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-29", "text": "The metric compares the system's output to multiple simplification references and the original sentence." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-30", "text": "SARI has shown positive correlation with human judgements of simplicity gain." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-31", "text": "We re-implement SARI's corpuslevel version in Python (it was originally available in Java)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-32", "text": "In this version, for each operation (ope \u2208 {add, del, keep}) and n-gram order, precision p ope (n), recall r ope (n) and F1 f ope (n) scores are calculated." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-33", "text": "These are then averaged over the n-gram order to get the overall operation F1 score" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-34", "text": "Although Xu et al. (2016) indicate that only precision should be considered for the deletion operation, we follow the Java implementation that uses F1 score for all operations in corpus-level SARI." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-35", "text": "SAMSA measures structural simplicity (i.e. sentence splitting)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-36", "text": "This is in contrast to SARI, which is designed to evaluate simplifications involving paraphrasing." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-37", "text": "EASSE re-uses the original SAMSA implementation 2 with some modifications: (1) an internal call to the TUPA parser (Hershcovich et al., 2017) , which generates the semantic annotations for each original sentence; (2) a modified version of the monolingual word aligner (Sultan et al., 2014) that is compatible with Python 3, and uses Stanford CoreNLP (Manning et al., 2014) 3 through their official Python interface, and (3) a single function call to get a SAMSA score instead of running a series of scripts." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-38", "text": "FKGL Readability metrics, such as FleschKincaid Grade Level (FKGL), are commonly reported as measures of simplicity." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-39", "text": "They however only rely on average sentence lengths and number of syllables per word, so short sentences would get good scores even if they are ungrammatical, or do not preserve meaning (Wubben et al., 2012) ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-40", "text": "Therefore, these scores should be interpreted with caution." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-41", "text": "EASSE re-implements FKGL by porting publicly available scripts 4 to Python 3 and fixing some edge case inconsistencies (e.g. newlines incorrectly counted as words or bugs with memoization)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-42", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-43", "text": "**WORD-LEVEL ANALYSIS AND QE FEATURES**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-44", "text": "Word-level Transformation Analysis EASSE includes algorithms to determine which specific text transformations a SS system performs more effectively." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-45", "text": "This is done based on word-level alignment and analysis." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-46", "text": "Since there is no available simplification dataset with manual annotations of the transformations performed, we re-use the annotation algorithms from MASSAlign ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-47", "text": "Given a pair of sentences (e.g. original and system output), the algorithms use word alignments to identify deletions, movements, replacements and copies (see Fig. 1 )." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-48", "text": "This process is prone to some errors: when compared to manual labels produced by four annotators in 100 original-simplified pairs, the automatic algorithms achieved a micro-averaged F1 score of 0.61 ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-49", "text": "We generate two sets of automatic word-level annotations: (i) between the original sentences and their reference simplifications, and (ii) between the original sentences and their automatic simplifications produced by a SS system." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-50", "text": "Considering (i) as reference labels, we calculate the F1 score of each transformation in (ii) to estimate their correctness." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-51", "text": "When more than one reference simplification exists, we calculate the per-transformation F1 scores of the output against each reference, and then keep the highest one as the sentence-level score." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-52", "text": "The corpus-level scores are the average of sentence-level scores." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-53", "text": "Quality Estimation Features Traditional automatic metrics used for SS rely on the existence and quality of references, and are often not enough to analyse the complex process of simplification." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-54", "text": "QE leverages both the source sentence and the output simplification to provide additional information on specific behaviours of simplification systems which are not reflected in metrics such as SARI." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-55", "text": "EASSE uses QE features from Martin et al. (2018) 's open-source repository 5 ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-56", "text": "The QE scores that we currently use include the compression ratio of the simplification with respect to its source sentence, its Levenshtein similarity, the average number of sentence splits performed by the system, the proportion of exact matches (i.e. original sentences left untouched), average proportion of added words, deleted words and lexical complexity score 6 ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-57", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-58", "text": "**ACCESS TO TEST DATASETS**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-59", "text": "EASSE provides access to three publicly available datasets for automatic SS evaluation (Table 1): PWKP (Zhu et al., 2010) , TurkCorpus (Xu et al., 2016) , and HSplit (Sulem et al., 2018a) ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-60", "text": "All of them consist of the data from the original datasets, which are sentences extracted from English Wikipedia (EW) articles." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-61", "text": "It is important to highlight that EASSE can also evaluate system's outputs in other datasets provided by the user." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-62", "text": "PWKP Zhu et al. (2010) automatically aligned sentences in 65,133 EW articles to their corresponding versions in Simple EW (SEW)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-63", "text": "Since the latter is aimed at English learners, its articles are expected to contain fewer words and simpler grammar structures than those in their EW counterpart." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-64", "text": "The test set split of PWKP contains 100 sentences, with 1-to-1 and 1-to-N alignments (resp." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-65", "text": "93 and 7 instances)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-66", "text": "The latter correspond to instances of sentence splitting." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-67", "text": "Since this dataset has only one reference for each original sentence, 5 https://github.com/facebookresearch/ text-simplification-evaluation 6 The lexical complexity score of a simplified sentence is computed by taking the log-ranks of each word in the frequency it is not ideal for calculating automatic metrics that rely on multiple references, such as SARI." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-68", "text": "TurkCorpus Xu et al. (2016) asked crowdworkers to simplify 2,359 original sentences extracted from PWKP to collect eight simplification references for each one." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-69", "text": "This dataset was then randomly split into tuning (2,000 instances) and test (359 instances) sets." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-70", "text": "The test set only contains 1-to-1 alignments, mostly with instances of paraphrasing and deletion." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-71", "text": "Each original sentence in TurkCorpus has 8 simplified references produced through crowdsourcing." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-72", "text": "As such, it is better suited for computing SARI and multi-reference BLEU scores." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-73", "text": "HSplit Sulem et al. (2018a) recognised that existing EW-based datasets did not contain sufficient instances of sentence splitting." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-74", "text": "As such, they collected four reference simplifications of this transformation for the first 70 original sentences in the TurkCorpus test set." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-75", "text": "Even though SAMSA's computation does not require access to references, this dataset can be used to compute an upperbound on the expected performance of SS systems that model this type of structural simplification." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-76", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-77", "text": "**HTML REPORT GENERATION**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-78", "text": "EASSE wraps all the aforementioned analyses in a simple comprehensive HTML report that can be generated with a single command." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-79", "text": "This report compares the system output with human" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-80", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-81", "text": "**EXPERIMENTS**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-82", "text": "We collected publicly available outputs of several SS systems (Sec. 3.1) to evaluate their performance using the functionalities available in EASSE." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-83", "text": "In particular, we compare them using automatic metrics, and provide some insights on the reasoning behind their results (Sec. 3.2)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-84", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-85", "text": "**SENTENCE SIMPLIFICATION SYSTEMS**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-86", "text": "EASSE provides access to various SS system outputs that follow different approaches for the task." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-87", "text": "For instance, we included those that rely on phrase-based statistical MT, either by itself (e.g. PBSMT-R (Wubben et al., 2012) ), or coupled with semantic analysis, (e.g. Hybrid (Narayan and Gardent, 2014) )." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-88", "text": "We also included SBSMT-SARI (Xu et al., 2016) , which relies on syntaxbased statistical MT; DRESS-LS (Zhang and Lapata, 2017 ), a neural model using the standard encoder-decoder architecture with attention combined with reinforcement learning; and DMASS-DCSS (Zhao et al., 2018) , the current state-of-theart in the TurkCorpus, which is based on the Transformer architecture (Vaswani et al., 2017) ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-89", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-90", "text": "**COMPARISON AND ANALYSIS OF SCORES**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-91", "text": "Automatic Metrics For illustration purposes, we compare systems' outputs using BLEU and SARI in TurkCorpus (with 8 manual simplification references), and SAMSA in HSplit." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-92", "text": "For calculating Reference values in Table 2 , we sample one of the 8 human references for each instance as others have done (Zhang and Lapata, 2017) ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-93", "text": "When reporting SAMSA scores, we only use the 70 sentences of TurkCorpus that also appear in HSplit." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-94", "text": "This allows us to compute Reference scores for instances that contain structural simplifications (i.e. sentence splits)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-95", "text": "We calculate SAMSA scores for each of the four manual simplifications in HSplit, and choose the highest as an upper-bound Reference value." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-96", "text": "The results for all three metrics are shown in DMASS-DCSS is the state-of-the-art in TurkCorpus according to SARI." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-97", "text": "However, it gets the lowest SAMSA score, and the third to last BLEU score." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-98", "text": "PBSMT-R is the best in terms of these two metrics." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-99", "text": "Finally, across all metrics, the Reference stills gets the highest values, with significant differences from the top performing systems." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-100", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-101", "text": "**WORD-LEVEL TRANSFORMATIONS**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-102", "text": "In order to better understand the previous results, we use the wordlevel annotations of text transformations (Table 3) ." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-103", "text": "Since SARI was design to evaluate mainly paraphrasing transformations, the fact that SBSMT-SARI is the best at performing replacements and second place in copying explains its high SARI score." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-104", "text": "DMASS-DCSS is second best in replacements, while PBSMT-R (which achieved the highest BLEU score) is the best at copying." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-105", "text": "Hybrid is the best at performing deletions, but is the worst at replacements, which SARI mainly measures." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-106", "text": "The origin of the TurkCorpus set itself could explain some of these observations." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-107", "text": "According to Xu et al. (2016) , the annotators in TurkCorpus were instructed to mainly produce paraphrases, i.e. mostly replacements with virtually no deletions." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-108", "text": "As such, copying words is also a significant transformation, so systems that are good at performing it better mimic the characteristics of the human simplifications in this dataset." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-109", "text": "Table 3 : Transformation-based performance of the sentence simplification systems in the TurkCorpus test set." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-110", "text": "Table 4 : Quality estimation features, which give additional information on the output of different systems." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-111", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-112", "text": "**QUALITY ESTIMATION FEATURES**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-113", "text": "Report Figure 2 displays the quantitative part of the HTML report generated for the DMASS-DCSS system." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-114", "text": "The report compares the system to a reference human simplification." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-115", "text": "The \"System vs. Reference\" table and the two plots indicate that DMASS-DCSS closely matches different aspects of human simplifications, according to QE features." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-116", "text": "This contributes to explaining the high SARI score of the DMASS-DCSS system in Table 2." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-117", "text": "----------------------------------" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-118", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-119", "text": "EASSE provides easy access to commonly used automatic metrics as well as to more detailed word-level transformation analysis and QE scores which allows us to compare the quality of the generated outputs of different SS systems on public test datsets." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-120", "text": "We reported some experiments on the use of automatic metrics to obtain overall performance scores, followed by measurements of how effective the SS systems are at executing specific simplification transformations using wordlevel analysis and QE features." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-121", "text": "The former analysis provided insights about the simplification capabilities of each system, which help better explain the initial automatic scores." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-122", "text": "In the future, we plan to continue developing the transformation-based analysis algorithms, so that more sophisticated transformations could be identified (e.g. splitting or subject-verb-object reordering)." }, { "sent_id": "4c615cb5a496c1e225a8457a4f55d4-C001-123", "text": "In addition, we expect to integrate more QE features to cover other aspects of the simplification process (e.g. depth of the dependency parse tree to measure syntactic complexity)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4c615cb5a496c1e225a8457a4f55d4-C001-9" ], [ "4c615cb5a496c1e225a8457a4f55d4-C001-24" ], [ "4c615cb5a496c1e225a8457a4f55d4-C001-59" ], [ "4c615cb5a496c1e225a8457a4f55d4-C001-68" ], [ "4c615cb5a496c1e225a8457a4f55d4-C001-107" ] ], "cite_sentences": [ "4c615cb5a496c1e225a8457a4f55d4-C001-9", "4c615cb5a496c1e225a8457a4f55d4-C001-24", "4c615cb5a496c1e225a8457a4f55d4-C001-59", "4c615cb5a496c1e225a8457a4f55d4-C001-68", "4c615cb5a496c1e225a8457a4f55d4-C001-107" ] }, "@DIF@": { "gold_contexts": [ [ "4c615cb5a496c1e225a8457a4f55d4-C001-34" ] ], "cite_sentences": [ "4c615cb5a496c1e225a8457a4f55d4-C001-34" ] }, "@USE@": { "gold_contexts": [ [ "4c615cb5a496c1e225a8457a4f55d4-C001-88" ] ], "cite_sentences": [ "4c615cb5a496c1e225a8457a4f55d4-C001-88" ] } } }, "ABC_033ce75c882764e08fb3871656a8d1_34": { "x": [ { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-2", "text": "Multiword expressions (MWEs) can be extracted automatically from large corpora using association measures, and tools like mwetoolkit allow researchers to generate training data for MWE extraction given a tagged corpus and a lexicon." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-3", "text": "We use mwetoolkit on a sample of the French Europarl corpus together with the French lexicon Dela, and use Weka to train classifiers for MWE extraction on the generated training data." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-4", "text": "A manual evaluation shows that the classifiers achieve 60-75% precision and that about half of the MWEs found are novel and not listed in the lexicon." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-5", "text": "We also investigate the impact of the patterns used to generate the training data and find that this can affect the trade-off between precision and novelty." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-6", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-8", "text": "In alphabetic languages, words are delimited by spaces." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-9", "text": "Some words can combine to create a new unit of meaning that we call a multiword expression (MWE)." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-10", "text": "However, MWEs such as kick the bucket must be distinguished from free combinations of words such as kick the ball." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-11", "text": "A sequence of several words is an MWE if \"at least one of its syntactic, distributional or semantic properties cannot be deduced from the properties of its component\" (Silberztein and L.A.D.L., 1990) ." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-12", "text": "So how can we extract them?" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-36", "text": "Villaviciencio (2005), for example, uses number of hits on Google for validating the likelihood of particle verbs." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-13", "text": "Statistical association measures have long been used for MWE extraction (Pecina, 2010) , and by training supervised classifiers that use association measures as features we can further improve the quality of the extraction process." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-14", "text": "However, supervised machine learning requires annotated data, which creates a bottleneck in the absence of large corpora annotated for MWEs." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-15", "text": "In order to circumvent this bottleneck, mwetoolkit (Ramisch et al., 2010b) generates training instances by first extracting candidates that fit a certain part-of-speech pattern, such as Noun-Noun or Noun-Adjective, and then marking the candidates as positive or negative instances depending on whether they can be found in a given lexicon or not." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-16", "text": "Such a training set will presumably not contain any false positives (that is, candidates marked as positive instances that are not real MWEs), but depending on the coverage of the lexicon there will be a smaller or larger proportion of false negatives." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-17", "text": "The question is what quality can be obtained using such a noisy training set." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-18", "text": "To the best of our knowledge, we cannot find the answer for French in literature." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-19", "text": "Indeed, compares the performance of mwetoolkit with another toolkit on English and French corpora, but they never use the data generated by mwetoolkit to train a model." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-20", "text": "In contrast, Zilio et al. (2011) make a study involving training a model but use it only on English and use extra lexical resources to complement the machine learning method, so their study does not focus just on classifier evaluation." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-21", "text": "This paper presents the first evaluation of mwetoolkit on French together with two resources very commonly used by the French NLP community: the tagger TreeTagger (Schmid, 1994) and the dictionary Dela." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-22", "text": "1 Training and test data are taken from the French Europarl corpus (Koehn, 2005) and classifiers are trained using the Weka machine learning toolkit (Hall et al., 2009) ." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-23", "text": "The primary goal is to evaluate what level of precision can be achieved for nominal MWEs, using a manual evaluation of MWEs extracted, and to what extent the MWEs extracted are novel and can be used to enrich the lexicon." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-24", "text": "In addition, we will investigate what effect the choice of part-of-speech patterns used to generate the training data has on precision and novelty." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-25", "text": "Our results indicate that classifiers achieve precision in the 60-75% range and that about half of the MWEs found are novel ones." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-26", "text": "In addition, it seems that the choice of patterns used to generate the training data can affect the tradeoff between precision and novelty." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-27", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-28", "text": "**RELATED WORK**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-29", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-30", "text": "**EXTRACTION TECHNIQUES**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-31", "text": "There is no unique definition of MWEs (Ramisch, 2012) ." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-32", "text": "In the literature on the subject, we notice that manual MWE extraction often requires several annotators native of the studied language." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-33", "text": "Nevertheless, some techniques exist for selecting automatically candidates that are more likely to be the true ones." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-34", "text": "Candidates can be validated against an external resource, such as a lexicon." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-35", "text": "It is possible also to check the frequency of candidates in another corpus like the web." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-37", "text": "However, as Ramisch (2012) states in his introduction, MWE is an institutionalised phenomenon." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-38", "text": "This means that an MWE is frequently used and is part of the vocabulary of a speaker as well as the simple words." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-39", "text": "It means also that MWEs have specific statistical properties that have been studied." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-40", "text": "The results of those studies are statistical measures such as dice score, maximum likelihood estimate, pointwise mutual information, T-score." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-41", "text": "As Islam et al. (2012) remark in a study of Google Ngram, those measures of association are language independent. And it is demonstrated by Pecina (2008) that combining different collocation measures using standard statistical classification methods improves over using a single collocation measure." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-42", "text": "However, nowadays, using only lexical association measures for extraction and validation of MWE is not considered the most effective method." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-43", "text": "The tendency these last years is to combine association measures with linguistic features (Ramisch et al., 2010a; Pecina, 2008; Tsvetkov and Wintner, 2011) ." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-44", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-45", "text": "**MWETOOLKIT**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-46", "text": "Among the tools developed for extracting MWEs, mwetoolkit is one of the most recent." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-47", "text": "Developed by Ramisch et al. (2010b) it aims not only at extracting candidates for potential MWEs, but also at extracting their association measures." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-48", "text": "Provided that a lexicon of MWEs is available and provided a preprocessed corpus, mwetoolkit makes it possible to train a machine learning system with the association measures as features with a minimum of implementation." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-49", "text": "Ramisch et al. (2010b) provide experiments on Portuguese, English and Greek." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-50", "text": "Zilio et al. (2011) provide experiments with this tool as well." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-51", "text": "In the latter study, after having trained a machine on bigram MWEs, they try to extract full n-gram expressions from the Europarl corpus." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-52", "text": "They then reuse the model obtained on bigrams for extraction of full n-gram MWEs." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-53", "text": "Finally, they apply a second filter for getting back the false negatives by checking every MWE annotated as False by the algorithm against a online dictionary." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-54", "text": "This method gets a very good precision (over 87%) and recall (over 84%)." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-55", "text": "However, we do not really know if this result is mostly due to the coverage of the dictionary online. What is the contribution of machine learning in itself?" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-56", "text": "Another question raised by this study is the ability of a machine trained on one kind of pattern (e.g., Noun-Adjective) to extract correctly another kind of MWE pattern (e.g., Noun-Noun)." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-57", "text": "That is the reason why we will run three experiments close to the one of Zilio et al. (2011) but were the only changing parameter is the pattern that we train our classifiers on." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-58", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-59", "text": "**GENERATING TRAINING DATA**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-60", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-61", "text": "**CHOICE OF PATTERNS**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-62", "text": "In contrast to Zilio et al. (2011) we run our experiment on French." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-63", "text": "The choice of a different language requires an adaptation of the patterns." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-64", "text": "French indeed, as a latin language, does not show the same characteristic patterns as English." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-65", "text": "We know that there is a strong recurrence of the pattern Noun-Adjective in bigram MWEs in our lexicon (Silberztein and L.A.D.L., 1990, p.82) , and the next most frequent pattern is NounNoun." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-66", "text": "Therefore we extract only candidates that correspond to these patterns." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-67", "text": "And, since we have two patterns, we will run two extra experiments where our models will be trained only on one of the patterns." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-68", "text": "In this way, we will discover how sensitive the method is to the choice of pattern." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-69", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-70", "text": "**CORPUS**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-71", "text": "As we work on the French Europarl corpus." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-72", "text": "We took the three first million words of Europarl and divided it into three equal parts (one million words each) for running our experiments." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-73", "text": "The first part will be devoted at 80% to training and 20% to development test set, when training classifiers on Noun-Adjective or NounNoun patterns, or both." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-74", "text": "We use the second million as a secondary development set that is not used in this study." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-75", "text": "The third million is used as a final test set and we will present results on this set." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-76", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-77", "text": "**PREPROCESSING**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-78", "text": "For preprocessing we used the same processes as described in Zilio et al. (2011) ." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-79", "text": "First we ran the sentence splitter and the tokenizer provided with the Europarl corpus." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-80", "text": "Then we ran TreeTagger (Schmid, 1994) to obtain the tags and the lemmas." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-81", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-82", "text": "**EXTRACTING DATA AND FEATURES**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-83", "text": "The mwetoolkit takes as input a preprocessed corpus plus a lexicon and gives two main outputs: an arff file which is a format adapted to the machine learning framework Weka, and an XML file." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-84", "text": "At the end of the process we obtain, for each candidate, a binary classification as an MWE (True) or not (False) depending on whether it is contained in the lexicon." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-85", "text": "For each candidate, we also obtain the following features: maximum likelihood estimate, pointwise mutual information, T-score, dice coefficient, log-likelihood ratio." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-86", "text": "The machine learning task is then to predict the class (True or False) given the features of a candidate." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-87", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-88", "text": "**CHOICE OF A LEXICON IN FRENCH**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-89", "text": "The evaluation part of mwetoolkit is furnished with an internal English lexicon as a gold standard for evaluating bigram MWEs, but for French it is necessary to provide an external resource." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-90", "text": "We used as our gold standard the French dictionary Dela (Silberztein and L.A.D.L., 1990) , the MWE part of which is called Delac." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-91", "text": "It is a general purpose dictionary for NLP and it includes 100,000 MWE expressions, which is a reasonable size for leading an experiment on the Europarl corpus." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-92", "text": "Also the technical documentation of the Delac (Silberztein and L.A.D.L., 1990, p.72) says that this dictionary has been constructed by linguists with reference to several dictionaries." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-93", "text": "So it is a manually built resource that contains MWEs only referenced in official lexicographical books." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-94", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-95", "text": "**PROCESSING**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-96", "text": "Thanks to mwetoolkit we extracted all the bigrams that correspond to the patterns NounAdjective (NA), Noun-Noun (NN) and to both Noun-Adjective and Noun-Noun (NANN) in our three data sets and let mwetoolkit make an automatic annotation by checking the presence of the MWE candidates in the Delac." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-97", "text": "Note that the automatic annotation was used only for training." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-98", "text": "The final evaluation was done manually." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-99", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-100", "text": "**TRAINING CLASSIFIERS**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-101", "text": "For finding the best model we think that we have to favour the recall of the positive candidates." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-102", "text": "Indeed, when an MWE candidate is annotated as True, it means that it is listed in the Dela, which means that it is an officially listed MWE." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-103", "text": "However, if an MWE is not in the Dela, it does not mean that the candidate does not fulfil all the criteria for being an MWE." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-104", "text": "For this reason, obtaining a good recall is much more difficult than getting a good precision, but it is also the most important if we stay on a lexicographical purpose." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-105", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-106", "text": "**TRAINING ON NA**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-107", "text": "We tested several algorithms offered by Weka as well as the training options suggested by Zilio et al. (2011) ." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-108", "text": "We also tried to remove some features and to keep only the most informative ones (MLE, T-score and log-likelihood according to information gain ratio) but we noticed each time a loss in the recall." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-109", "text": "At the end with all the features kept and for the purpose of evaluating NA MWE candidates the best classification algorithm was the Bayesian network." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-110", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-111", "text": "**TRAINING ON NN**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-112", "text": "When training a model on NN MWEs, our aim was to keep as much as possible the same condition for our three experiments." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-113", "text": "However, the NN training set has definitely not the same properties as the NA and NANN ones." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-114", "text": "The NN training set is twenty-four times smaller than NA training set." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-115", "text": "Most of the algorithms offered by Weka therefore ended up with a dummy systematic classification to the majority class False." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-116", "text": "The only exceptions were ibk, ib1, hyperpipes, random trees and random forest." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-117", "text": "We kept random forest because it gave the best recall with a very good precision." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-118", "text": "We tried several options and obtained the optimum results with 8 trees each constructed while considering 3 random features, one seed, and unlimited depth of trees." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-119", "text": "As well as for NA we kept all features." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-120", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-121", "text": "**TRAINING ON NA+NN**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-122", "text": "For the training on NANN candidates we tried the same models as for NN and for NA candidates." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-123", "text": "The best result was obtained with the same algorithm as for NA: Bayesian network." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-124", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-125", "text": "**EVALUATION**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-126", "text": "The data automatically annotated by mwetoolkit could be used for training, but to properly evaluate the precision of MWE extraction on new data and not penalize the system for 'false positives' that are due to lack of coverage of the lexicon, we needed to perform a manual annotation." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-127", "text": "To do so, we randomly picked 100 candidates annotated as True by each model (regardless if they were in the Delac or not)." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-128", "text": "We then annotated all such candidates as True if they were found in Delac (without further inspection) and otherwise classified them manually following the definition of Silberztein and L.A.D.L. (1990) As we see in Table 1 , the experiment reveals a precision ranging from almost 60% up to 74%." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-129", "text": "The results of our comparative manual annotation indicate that the model trained on NN candidates has the capacity to find more MWEs not listed in our lexicon (41 out of 59) even if it is the least precise model." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-130", "text": "On the other hand, we notice that the model based on Noun-Adjective patterns is more precise but at the same time extracts fewer MWEs that are not already in the lexicon (34 out of 74)." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-131", "text": "Our mixed model confirms these two tendencies with a performance in between (38 new MWEs out of 66)." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-132", "text": "Thus, the method appears to be sensitive to the patterns used for training." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-133", "text": "We notice during evaluation different kinds of MWEs that are successfully extracted by models but that are not listed in the Delac." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-134", "text": "Most of them are the MWEs specific to Europarl (e.g., 'dimension communautaire', 'l\u00e9gislation europ\u00e9enne' 2 )." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-135", "text": "Another category are those MWEs that became 2 'community scale', 'European legislation' popular in the French language after the years 2000's and therefore could not be included in the Delac, released in 1997." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-136", "text": "Indeed by reading the first paragraph of the French version of Europarl we notice that the texts have been written after 1999." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-137", "text": "Of course, they are not the majority of the successfully extracted MWEs but we still manage to find up to 3 of them in a sample of 100 that we checked ('d\u00e9veloppement durable', 'radiophonie num\u00e9rique', 'site internet' 3 )." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-138", "text": "Furthermore the corpus in itself is already more than ten years old, so in a text of 2014 we can expect to find even more of them." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-139", "text": "Finally, there are MWEs that are not in French (e.g., 'Partido popular'), these, however, did not appear systematically in our samples." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-140", "text": "It is tricky to learn statistical properties of MWEs when, actually, we do not have all the information necessary for extracting the MWEs in the corpus." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-141", "text": "Indeed, for this purpose the corpus should ideally be read and annotated by humans." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-142", "text": "However, we still managed to train models with decent performance, even if it is likely that a lot of candidates pre-annotated as False in the training data were probably perfect MWEs." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-143", "text": "This means that the Delac has covered enough MWEs for the features to not appear as completely meaningless and arbitrary." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-144", "text": "The final precision would never be as good as it is, if the coverage had been not sufficient enough." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-145", "text": "This shows that the method of automatic annotation offered by mwetoolkit is reliable given a lexicon as large as Delac." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-146", "text": "----------------------------------" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-147", "text": "**CONCLUSION**" }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-148", "text": "We wanted to know if the method of automatic extraction and evaluation offered by mwetoolkit could have a decent precision in French." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-149", "text": "We annotated automatically part of the Europarl corpus given the lexical resource Dela as a gold standard and generated in this way annotated training sets." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-150", "text": "Classifiers trained on this data using Weka achieved a maximum precision of 74%, with about half of the extracted MWEs being novel compared to the lexicon." }, { "sent_id": "033ce75c882764e08fb3871656a8d1-C001-151", "text": "In addition, we found that the final precision and novelty scores were sensitive to the choice of patterns used to generate the training data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "033ce75c882764e08fb3871656a8d1-C001-20" ], [ "033ce75c882764e08fb3871656a8d1-C001-49", "033ce75c882764e08fb3871656a8d1-C001-50" ] ], "cite_sentences": [ "033ce75c882764e08fb3871656a8d1-C001-20", "033ce75c882764e08fb3871656a8d1-C001-50" ] }, "@MOT@": { "gold_contexts": [ [ "033ce75c882764e08fb3871656a8d1-C001-20", "033ce75c882764e08fb3871656a8d1-C001-21" ], [ "033ce75c882764e08fb3871656a8d1-C001-57" ] ], "cite_sentences": [ "033ce75c882764e08fb3871656a8d1-C001-20", "033ce75c882764e08fb3871656a8d1-C001-57" ] }, "@USE@": { "gold_contexts": [ [ "033ce75c882764e08fb3871656a8d1-C001-57" ], [ "033ce75c882764e08fb3871656a8d1-C001-78" ], [ "033ce75c882764e08fb3871656a8d1-C001-107" ] ], "cite_sentences": [ "033ce75c882764e08fb3871656a8d1-C001-57", "033ce75c882764e08fb3871656a8d1-C001-78", "033ce75c882764e08fb3871656a8d1-C001-107" ] }, "@DIF@": { "gold_contexts": [ [ "033ce75c882764e08fb3871656a8d1-C001-62" ] ], "cite_sentences": [ "033ce75c882764e08fb3871656a8d1-C001-62" ] } } }, "ABC_6869f08e826aa434471c51c010ef28_34": { "x": [ { "sent_id": "6869f08e826aa434471c51c010ef28-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-2", "text": "This paper presents a novel approach for enhancing the multiple sets of acoustic patterns automatically discovered from a given corpus." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-3", "text": "In a previous work it was proposed that different HMM configurations (number of states per model, number of distinct models) for the acoustic patterns form a two-dimensional space." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-4", "text": "Multiple sets of acoustic patterns automatically discovered with the HMM configurations properly located on different points over this two-dimensional space were shown to be complementary to one another, jointly capturing the characteristics of the given corpus." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-5", "text": "By representing the given corpus as sequences of acoustic patterns on different HMM sets, the pattern indices in these sequences can be relabeled considering the context consistency across the different sequences." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-6", "text": "Good improvements were observed in preliminary experiments of pattern spoken term detection (STD) performed on both TIMIT and Mandarin Broadcast News with such enhanced patterns." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-7", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-9", "text": "Supervised training of HMMs for large vocabulary continuous speech recognition (LVCSR) relies on not only collecting huge quantities of acoustic data, but also obtaining the corresponding transcriptions." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-10", "text": "Such supervised training methods yield adequate performance in most circumstances but at high cost, and in many situations such annotated data sets are simply not available." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-11", "text": "This is why substantial effort [1] [7] has been made for unsupervised discovery of acoustic patterns from huge quantities of acoustic data without annotation, which may be easily obtained nowadays." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-12", "text": "For some applications such as Spoken Term Detection (STD) [8] [9] [10] [11] [12] in which the goal is simply to match and find some signal segments, the extra effort of building an LVCSR system using corpora with human annotations is very often an unnecessary burden [13] [14] [15] [16] [17] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-13", "text": "M ost effort of unsupervised discovery of acoustic patterns considered only one level of phoneme-like acoustic patterns." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-14", "text": "However, it is well known that speech signals have multilevel structures including at least phonemes and words, and such structures are very helpful in analysing or decoding speech [12] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-15", "text": "In a previous work, we proposed to discover the hierarchical structure of two-level acoustic patterns, including subword-like and word-like patterns." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-16", "text": "A similar two-level framework was also developed recently [18] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-17", "text": "In a more recent attempt [19] , we further proposed a framework of discovering multi-level acoustic patterns with varying model granularity." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-122", "text": "Row (a) in Table 1 was the frame-based dynamic time warping (DTW) on MFCC sequences." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-18", "text": "The different pattern HMM configurations (number of states per model, number of distinct models) form a two-dimensional model granularity space." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-19", "text": "Different sets of acoustic patterns with HMM model configurations represented by different points properly distributed over this two-dimensional space are complementary to one another, thus jointly capture the characteristics of the corpora considered." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-20", "text": "Such a multi-level framework was shown to be very helpful in the task of unsupervised spoken term detection (STD) with spoken queries, because token matching can be performed with pattern indices on different levels of signal characteristics, and the information integration across multiple model granularities offered the improved performance." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-21", "text": "In this work, we further propose an enhanced version of the multi-level acoustic patterns with varying model granularity by considering the context consistency for the decoded pattern sequences within each level and across different levels." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-22", "text": "In other words, the acoustic patterns discovered on different levels are no longer trained completely independently." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-23", "text": "We try to \"relabel\" the pattern sequence for each utterance in the training corpora considering the context consistency within and across levels." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-24", "text": "For a certain level, the context consistency may indicate that the realizations of a certain pattern should be split into two different patterns, while the realizations of another two patterns should be merged." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-25", "text": "In this way the multi-level acoustic patterns can be enhanced." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-26", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-27", "text": "**PROPOSED APPROACH**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-28", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-29", "text": "**PATTERN DISCOVERY FOR A GIVEN MODEL CONFIGURATION**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-30", "text": "Given an unlabeled speech corpus, it is not difficult for unsupervised discovery of the desired acoustic patterns from the corpus for a chosen hyperparameter set \u03c8 that determines the HMM configuration (number of states per model and number of distinct models) [20] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-31", "text": "This can be achieved by first finding an initial label \u03c90 based on a set of assumed patterns for all observations in the corpus \u03c7 as in (1) [6] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-32", "text": "Then in each iteration t the HMM parameter set \u03b8 \u03c8 t can be trained with the label \u03c9t\u22121 obtained in the previous iteration as in (2) , and the new label \u03c9t can be obtained by pattern decoding with the obtained parameter set \u03b8 \u03c8 t as in (3) ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-33", "text": "The training process can be repeated with enough number of iterations until a converged set of pattern HMMs is obtained." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-34", "text": "The above process can be performed with many different HMM configurations, each characterized by two hyperparameters: the number of states m in each acoustic pattern HMM, and the total number of distinct acoustic patterns n during initialization, \u03c8 = (m, n)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-35", "text": "The transcription of a signal decoded with these patterns can be considered as a temporal segmentation of the signal, so the HMM length (or number of states in each HMM) m represents the temporal granularity." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-36", "text": "The set of all distinct acoustic patterns can be considered as a segmentation of the phonetic space, so the total number n of distinct acoustic patterns represents the phonetic granularity." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-37", "text": "This gives a two-dimensional representation of the acoustic pattern configurations in terms of temporal and phonetic granularities as in Fig. 1 ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-38", "text": "Any point in this two-dimensional space in Fig. 1 corresponds to an acoustic pattern configuration." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-39", "text": "Note that in our previous work [19] , the effect of the third dimension, the acoustic granularity which is the number of Gaussians in each state, was shown to be negligible, thus here we simply set the number of Gaussians in each state to be 4 in all cases." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-40", "text": "Although the selection of the hyperparameters can be arbitrary in this two-dimensional space, here we only select M temporal granularities and N phonetic granularities, forming a two-dimensional array of M \u00d7 N hyperparameter sets in the granularity space." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-41", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-42", "text": "**PATTERN RELABELING CONSIDERING CONTEXT CONSISTENCY**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-43", "text": "Context constraints successfully explored in language modeling can be used here for relabeling the acoustic patterns as shown by an example in Fig. 2 ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-44", "text": "We assume the patterns 'b' and 'B' are similar without context as in Fig. 2 As shown in Fig. 3 , assuming an utterance is decoded into four different pattern sequences using four sets of patterns with neighboring temporal granularity m4 > m3 > m2 > m1, i.e., pattern HMMs with different lengths." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-45", "text": "Considering a realization of pattern 'b' of temporal granularity m3, we find its central frame belongs to the realization of pattern 'a' of temporal granularity m4 and the realization of pattern 'c' of temporal granularity m2." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-46", "text": "So patterns 'a' and 'c' are taken as the context of pattern 'b' in neighboring temporal granularities." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-47", "text": "The same could be done for phonetic granularity." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-48", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-49", "text": "**PATTERN RELABELING METHOD**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-50", "text": "Let \u03c9(m k , n k , l) be the index for a decoded acoustic pattern at time l within an utterance in the corpus \u03c7 using the acoustic pattern set with the granularity \u03c8(m k , n k )." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-51", "text": "The relabeled pattern \u03c9(m k , n k , l) is then as in (4a) i.e., the pattern among all patterns in the set of \u03c8(m k , n k ) which maximizes the product of the three probabilities in (4b)(4c)(4d) evaluated with the context respectively in l, n, and m. The first probability P l (w) in (4b) for context in time l is actually the product of forward bigram and backward bigram well known in language modeling." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-52", "text": "The other two probabilities Pn(w), Pm(w) in (4c)(4d) are exactly the same, except n k\u22121 , n k+1 and m k\u22121 , m k+1 are the neighboring values of n and m." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-53", "text": "Finer patterns and coarser patterns are drastically different in terms of perplexity; shorter patterns and longer patterns produce very different pattern sequences in terms of duration." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-54", "text": "They are complementary to each other, but we only consider the context consistency among the neighboring granularity configurations as in (4)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-55", "text": "This relabeling is performed on every decoded sequence of the M \u00d7 N pattern sets considered." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-56", "text": "Katz smoothing [21] was applied to deal with unseen pattern bigrams." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-57", "text": "On the boundary of the granularity configurations or time sequences, the bigram probability is taken as 1." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-58", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-59", "text": "**PATTERN ENHANCEMENT BY RE-ESTIMATION AFTER RELABELING**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-60", "text": "The relabeling in (4a) can be inserted into the recursive process of discovering the patterns in each iteration in (2)(3), as shown in (5)(6)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-61", "text": "= arg max" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-62", "text": "When an iteration is completed as in (2)(3), a new set of patterns is generated as in (2), with which a new set of labels is obtained as in (3) ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-63", "text": "The new labels \u03c9t in (3) is then relabeled with (4a) based on the new labels \u03c9t on all different HMM sets to produce a slightly better label \u03c9t as in (5) ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-64", "text": "This slightly better label \u03c9t is then used in (6) to generate a slightly better model set \u03b8 \u03c8 t+1 ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-65", "text": "Note that (6) is almost the same as (2), except here based on the slightly better label \u03c9t obtained in (5) ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-66", "text": "In this way the relabeling process can be repeatedly applied in every iteration, and the patterns can be enhanced by the relabeling process during the model re-estimation." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-67", "text": "Although it is theoretically possible to consider the optimization process in (3) and (5) jointly in a single step, such as maximizing the product of the two probabilities in the right hand sides of (3) and (5), practically such a joint optimization is computationally unfeasible." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-68", "text": "Therefore this is done in two separate steps here." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-69", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-70", "text": "**SPOKEN TERM DETECTION**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-71", "text": "There can be various applications for the acoustic patterns presented here." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-72", "text": "In this section we summarize the way to perform spoken term detection [19] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-73", "text": "Let {pr, r = 1, 2, 3, .., n} denote the n acoustic patterns in the set of \u03c8=(m, n)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-74", "text": "We first construct a similarity matrix S of size n \u00d7 n off-line for every pattern set \u03c8=(m, n), for which the element S(i, j) is the similarity between any two pattern HMMs pi and pj in the set." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-75", "text": "The KL-divergence KL(i, j) between two pattern HMMs in (7) is defined as the symmetric KL-divergence between the states based on the variational approximation [22] summed over the states." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-76", "text": "To transform the KL divergence into a similarity measure between 0 and 1, a negative exponential was applied [23] with a scaling factor \u03b2." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-77", "text": "When \u03b2 is small, similarity between distinct patterns in (7) approaches zero, so (7) approaches the delta function \u03b4(i, j)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-78", "text": "\u03b2 can be determined with a held out data set, but here we simply set it to 100." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-79", "text": "In the on-line phase, we perform the following for each entered spoken query q and each document (utterance) d in the archive for each pattern set \u03c8=(m, n)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-80", "text": "Assume for a given pattern set a document d is decoded into a sequence of D acoustic patterns with indices (d1, d2, ..., dD) and the query q into a sequence of Q patterns with indices (q1, ..., qQ)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-81", "text": "We thus construct a matching matrix W of size D \u00d7 Q for every document-query pair, in which each entry (i, j) is the similarity between acoustic patterns with indices di and qj as in (8) and shown in Fig. 4(a) for a simple example of Q = 3 and D = 6, where S(i, j) is defined in (7)," }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-82", "text": "It is possible to consider the N-best pattern sequences rather than the one-best sequences here by considering the posteriorgram vectors based on the N-best sequences for d, q and integrate them in the matrix W ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-83", "text": "However, previous experiments showed that the extra improvements brought in this way is almost negligible, probably because the M \u00d7 N different pattern sequences based on the M \u00d7 N different pattern sets can be considered as a huge lattice including many one-best paths which will be jointly considered here [19] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-84", "text": "For matching the sub-sequence of d with q, we sum the elements in the matrix W in (9) along the diagonal direction, generating the accumulated similarities for all sub-sequences starting at all pattern positions in d as shown in Fig. 4(a) ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-85", "text": "The maximum is selected to represent the relevance between document d and query q on the pattern set \u03c8=(m, n) as in (9) ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-86", "text": "It is also possible to consider dynamic time warping (DTW) on the matrix W as shown in Fig. 4(b) ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-87", "text": "However, previous experiments showed that the extra improvements brought in this way is almost negligible, probably because here we have jointly considered the M \u00d7 N different pattern sequences based on the M \u00d7 N different pattern sets (e.g. including longer /shorter patterns), so the different time-warped matching and insertion/deletion between d and q is already automatically included [19] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-88", "text": "The M \u00d7N relevance scores R(d, q) in (9) obtained with M \u00d7N pattern sets \u03c8=(m, n) are then averaged and the average scores are used in ranking all the documents for spoken term detection." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-89", "text": "It is also possible to learn the weights for different pattern sets to produce better results using a development set." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-90", "text": "But here we simply assume the detection is completely unsupervised without any annotation, and all pattern sets are equally weighted [19] ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-91", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-92", "text": "**EXPERIMENTS**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-93", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-94", "text": "**PURITY IN PATTERN SEQUENCES FOR KNOWN WORDS**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-95", "text": "In order to evaluate the quality of the acoustic patterns we discovered with varying temporal and phonetic granularities, we use the Gini impurity for the pattern sequences found for known high frequency words, since this can be evaluated for any given pattern set." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-96", "text": "Assume all the realizations of a high frequency word (e.g. the word \"water\") are decoded into I different pattern sequences, each occupying a percentage fi of the realizations (\u03a3ifi = 1), we can evaluate the Gini impurity [24] for the word using the I percentages f ={fi, i=1,2,...I} as in (10)," }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-97", "text": "Gini impurity falls within the interval [0, 1), reaches zero when all the realizations are decoded into the same pattern sequence, and becomes larger when the distribution is less pure." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-98", "text": "We trained the above different sets of patterns with m=3, 5, 7, 9, 11 and n=50, 100, 200, 300 on the TIMIT training set." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-99", "text": "Fig. 5 shows the average Gini impurity for the top 20 words with the highest occurrence counts in TIMIT training set, based on the original patterns (blue) and those after relabeling (green) for all cases considered." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-100", "text": "We see the impurity was in general high for such automatically discovered patterns because the realizations of the same phoneme produced different speakers were possibly decoded as different patterns, and the insertion/deletion inevitably increased the impurity." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-101", "text": "Although the impurity was high, the relabeling proposed here generated better patterns." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-102", "text": "We see the difference was more significant for larger m. Because the temporal variation is easily captured by models with short patterns (m=3 or 5 with high impurity) which increases the impurity, much lower impurity was achieved with longer patterns (m=9 or 11)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-103", "text": "Another set of results for average Gini impurity for the cluster of words with occurrence counts ranging from 16 to 22 in the TIMIT training set is shown in Fig. 6 for m=3 and 11 states per HMM with varying number of distinct patterns (n)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-104", "text": "It is still quite clear that the relabeling process enhanced the patterns, and it is interesting to note that the trends for m=3 and 11 are quite different ( Fig. 5(a) and (b))." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-105", "text": "As mentioned above, the temporal variation is easily captured by models with short patterns which increases the impurity (e.g. m=3 in Fig. 6(a) ) so increasing the number of patterns (n) helped reduce the impurity." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-106", "text": "However, when the models are long enough (e.g. m=11 in Fig. 6(b) ), larger number of patterns(n) gives more redundant patterns which caused confusion during decoding, so the impurity went up with larger n. These results indicate that the different sets of patterns of different model granularities were complementary to each other." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-107", "text": "Note that only high frequency words with enough realizations can be used or the impurity evaluation here to show the quality of the patterns. But how these patterns can be applied to spoken term detection will be shown below, for which the queries are usually low frequency words, whose impurity is difficult to evaluate." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-108", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-109", "text": "**UNSUPERVISED SPOKEN TERM DETECTION**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-110", "text": "We conducted two separate query by example spoken term detection experiments on two spoken archives." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-111", "text": "In the first experiment, the TIMIT training set was used as the spoken archive and the spoken query set consisted of 16 words randomly selected from the TIMIT testing set." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-112", "text": "In the second experiment, the spoken archive was 4.5 hours of Mandarin Broadcast News segmented into 5034 spoken documents and the spoken query set was 10 words selected from another development set." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-113", "text": "In either case, a spoken instance of a query word was randomly selected from the data set, and used as the spoken query to search for other instances in the spoken archive." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-114", "text": "The conventional 39 dimensional MFCC features were used for the HMMs." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-115", "text": "20 sets of acoustic patterns were generated for TIMIT with m = 3, 5, 7, 9, 11 and n = 50, 100, 200, 300; 9 sets for the Mandarin Broadcast News with m = 3, 7, 13 and n = 50, 100, 300; all with 4 Gaussian mixtures per state." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-116", "text": "We compared \u03c9(m, n) with \u03c9(m, n) for each (m, n) pair." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-117", "text": "We used the mean average precision (MAP) [25] [26] as the performance measure, a higher value implies better performance." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-118", "text": "The MAP performance of each of the 20 pattern sets for TIMIT and 9 sets for Mandarin Broadcast News before and after relabeling is in Fig. 7(a)(b) where the performance was clearly boosted for most of the pattern sets." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-119", "text": "A paired sample t-test was used to check the MAP improvement of relabeled pattern sets, t(28)=3.37, p=0.0011, significant improvement was observed." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-120", "text": "Note that different from TIMIT which had many different speakers, the Mandarin Broadcast News was produced by a limited number of anchors, so MAP for each pattern set ranged between 18% to 22%, much higher than TIMIT." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-121", "text": "Although the MAP for each individual pattern set was relatively low on TIMIT (1% to 5%) in general, much better results in MAP can be obtained when all of them are jointly considered as rows (b)(c) in Table 1 ." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-123", "text": "We see the relabeled patterns achieved an MAP of 28.26% and 24.50% which is significantly better than that using the original patterns (26.32% and 23.38%)." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-124", "text": "Further more, both of them significantly outperformed the baseline (10.16% and 22.19%), which proved the improvement was non-trivial." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-125", "text": "----------------------------------" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-127", "text": "In this work, we propose a method for improving the quality of multilevel acoustic patterns discovered from a target corpus." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-128", "text": "By incorporating context consistency in time and model granularity, a more consistent set of patterns can be obtained." }, { "sent_id": "6869f08e826aa434471c51c010ef28-C001-129", "text": "This is verified with improved performance in spoken term detection on TIMIT and Mandarin Broadcast News." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6869f08e826aa434471c51c010ef28-C001-17" ], [ "6869f08e826aa434471c51c010ef28-C001-39" ], [ "6869f08e826aa434471c51c010ef28-C001-72" ], [ "6869f08e826aa434471c51c010ef28-C001-87" ] ], "cite_sentences": [ "6869f08e826aa434471c51c010ef28-C001-17", "6869f08e826aa434471c51c010ef28-C001-39", "6869f08e826aa434471c51c010ef28-C001-72", "6869f08e826aa434471c51c010ef28-C001-87" ] }, "@UNSURE@": { "gold_contexts": [ [ "6869f08e826aa434471c51c010ef28-C001-83" ], [ "6869f08e826aa434471c51c010ef28-C001-90" ] ], "cite_sentences": [ "6869f08e826aa434471c51c010ef28-C001-83", "6869f08e826aa434471c51c010ef28-C001-90" ] } } }, "ABC_b9e9f358ace19da43bfe9e5bc380c5_34": { "x": [ { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-2", "text": "The task of semantic parsing is highly useful for dialogue and question answering systems." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-3", "text": "Many datasets have been proposed to map natural language text into SQL, among which the recent Spider dataset provides crossdomain samples with multiple tables and complex queries." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-4", "text": "We build a Spider dataset for Chinese, which is currently a low-resource language in this task area." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-5", "text": "Interesting research questions arise from the uniqueness of the language, which requires word segmentation, and also from the fact that SQL keywords and columns of DB tables are typically written in English." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-6", "text": "We compare character-and wordbased encoders for a semantic parser, and different embedding schemes." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-7", "text": "Results show that word-based semantic parser is subject to segmentation errors and cross-lingual word embeddings are useful for text-to-SQL." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-8", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-10", "text": "The task of semantic parsing is highly useful for tasks such as dialogue (Chen et al., 2013; Gupta et al., 2018; Einolghozati et al., 2019) and question answering (Gildea and Jurafsky, 2002; Yih et al., 2015; Reddy et al., 2016) ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-11", "text": "Among a wide range of possible semantic representations, SQL offers a standardized interface to knowledge bases across tasks (Astrova, 2009; Xu et al., 2017; Dong and Lapata, 2018; Lee et al., 2011) ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-12", "text": "Recently, Yu et al. (2018b) released a manually labelled dataset for parsing natural language questions into complex SQL, which facilitates related research." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-13", "text": "Yu et al. (2018b) 's dataset is exclusive for English questions." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-14", "text": "Intuitively, the same semantic parsing task can be applied cross-lingual, since SQL is a universal semantic representation and database interface." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-15", "text": "However, for languages other than English, there can be added difficulties parsing into SQL." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-16", "text": "Take Chinese for example, the additional challenges can be at least two-fold." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-17", "text": "First, structures of relational databases, in particular names and column names of DB tables, are typically represented in English." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-18", "text": "This adds to the challenges to question-to-DB mapping." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-19", "text": "Second, the basic semantic unit for denoting columns or cells can be words, but word segmentation can be erroneous." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-20", "text": "It is also interesting to study the influence of other linguistic characteristics of Chinese, such as zero-pronoun, on its SQL parsing." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-21", "text": "We investigate parsing Chinese questions to SQL by creating a first dataset, and empirically evaluating a strong baseline model on the dataset." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-22", "text": "In particular, we translate the Spider (Yu et al., 2018b) dataset into Chinese." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-23", "text": "Using the model of Yu et al. (2018a) , we compare several key model configurations." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-24", "text": "Results show that our human-translated dataset is significantly more reliable compared to a dataset composed of machine-translated questions." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-25", "text": "In addition, the overall accuracy for Chinese SQL semantic parsing can be comparable to that for English." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-26", "text": "We found that cross-lingual word embeddings are useful for matching Chinese questions with English table columns and keywords and that language characteristics have a significant influence on parsing results." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-27", "text": "We release our dataset named CSpider and code at https://github.com/taolusi/chisp." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-28", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-30", "text": "Existing datasets for semantic parsing can be classified into two major categories." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-31", "text": "The first uses logic for semantic representation, including ATIS (Price, 1990; Dahl et al., 1994) and GeroQuery (Zelle and Mooney, 1996) ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-32", "text": "The second and dominant category of datasets uses SQL, which includes Restaurants (Tang and Mooney, 2001; Popescu et al., 2003) , Academic (Iyer et al., 2017) , Yelp and IMDB (Yaghmazadeh et al., 2017) , Ad-vising (Finegan-Dollak et al., 2018) and the recently proposed WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018b) ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-33", "text": "One salient difference between Spider and prior work is that Spider uses different databases across domains for training and testing, which can verify the generalization power of a semantic parsing model." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-34", "text": "Compared with WikiSQL, Spider further has multiple tables in each database and correspondingly more complex queries." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-35", "text": "We thus consider Spider for sourcing our dataset." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-36", "text": "Existing semantic parsing datasets for Chinese include a small corpus for assigning semantic roles (Sun and Jurafsky, 2004) and SemEval-2016 Task 9 for Chinese semantic dependency parsing (Che et al., 2012) , but these data are not related to SQL." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-37", "text": "To our knowledge, we are the first to release a Chinese SQL semantic parsing dataset." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-38", "text": "There has been a line of work improving the model of Yu et al. (2018a) since the release of the Spider dataset (Guo et al., 2019; Lin et al., 2019) ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-39", "text": "At the time of our investigation, however, the models are not published." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-40", "text": "We thus chose the model of Yu et al. (2018a) as our baseline." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-41", "text": "The choice of more different neural models is orthogonal to our dataset contribution, but can empirically give more insights about the conclusions." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-42", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-43", "text": "**DATASET**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-44", "text": "We translate all English questions in the Spider dataset into Chinese." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-45", "text": "1 The work is undertaken by 2 NLP researchers and 1 computer science student." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-46", "text": "Each question is first translated by one annotator, and then cross-checked and corrected by a second annotator." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-47", "text": "Finally, a third annotator verifies the original and corrected versions." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-48", "text": "Statistics of the dataset are shown in Table 1 ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-49", "text": "There are originally 10181 questions from Spider, but only 9691 for the training and development sets are publicly available." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-50", "text": "We thus translated these sentences only." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-51", "text": "Following the database split setting of Yu et al. (2018b) , we make training, development and test sets split in a way that no database overlaps in them as shown in Table 1 ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-52", "text": "The translation work is performed on a database to database basis." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-53", "text": "For each database, the same translator translates relevant inquiries sentence by # Q # SQL # DB # sentence." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-54", "text": "The translator is asked to read the original question as well as the SQL query before making its Chinese translation." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-55", "text": "If the literal translation is possible, the translator is asked to stick to the original sentence style as much as feasible." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-56", "text": "For complex questions, the translator is allowed to rephrase the English question, so that the most natural Chinese translation is made." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-57", "text": "In addition, we keep the diversity of style in the English dataset by matching different English expressions to different Chinese expressions." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-58", "text": "A sample of our dataset is shown in Table 2 ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-59", "text": "Our dataset is named CSpider." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-60", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-61", "text": "**MODEL**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-62", "text": "We use the neural semantic parsing method of Yu et al. (2018a) as the baseline model, which can be regarded as a sequence-to-tree model." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-63", "text": "In particular, the input question is encoded using an LSTM sequence encoder, and the output is a SQL query in its syntactic tree form." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-64", "text": "The tree is generated incrementally top-down, in a pre-order traversal sequence." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-65", "text": "Tree nodes include keyword nodes (e.g., SELECT, WHERE, EXCEPT) and table column name nodes (e.g., ID, City, Surname, which are defined in specific tables), which are represented in respective embedding spaces." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-66", "text": "Each keyword or column is generated by attention to the embedding space using the question representation as a key." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-67", "text": "A stack is used for incremental decoding, where the whole output history is leveraged as a feature for deciding the next term." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-68", "text": "This method gives the current released state-of-the-art results while submitting this paper." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-69", "text": "We provide a visualization of the model in Figure 1 ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-70", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-71", "text": "**EXPERIMENTS**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-72", "text": "We focus on comparing different word segmentation methods and different embedding representations." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-73", "text": "As discussed above, column names are selected by attention over column embeddings using sentence representation as a key." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-74", "text": "Hence there must be a link between the embeddings of columns and those of the questions." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-75", "text": "Since columns are written in English and questions in Chinese, we consider two embedding methods." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-76", "text": "The first method is to use two separate sets of embeddings for Chinese and English, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-77", "text": "We use Glove (Pennington et al., 2014) 2 for embeddings of English keywords, column names etc., and Tencent embeddings (Song et al., 2018 ) 3 for Chinese." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-78", "text": "The second method is to directly use the cross-lingual word embeddings." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-79", "text": "To this end, the Tencent multilingual embeddings are chosen, which contain both Chinese and English words in a multi-lingual embedding matrix." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-80", "text": "Evaluation Metrics." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-81", "text": "We follow Yu et al. (2018b) , evaluating the results using two major 2 https://nlp.stanford.edu/projects/glove/ 3 https://ai.tencent.com/ailab/nlp/embedding.html types of metrics." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-82", "text": "The first is exact matching accuracy, namely the percentage of questions that have exactly the same SQL output as its reference." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-83", "text": "The second is component matching F1, namely the F1 scores for SELECT, WHERE, GROUP BY, ORDER BY and all keywords, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-84", "text": "Hyperparameters." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-85", "text": "Our hyperparameters are mostly taken from Yu et al. (2018a) , but tuned on the Chinese Spider development set." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-86", "text": "We use character and word embeddings from Tencent embedding; both of them are not fine-tuned during model training." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-87", "text": "Embedding sizes are set to 200 for both characters and words." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-88", "text": "For the different choices of keywords and column names embeddings, sizes are set to 200 and 300, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-89", "text": "Adam (Kingma and Ba, 2014) is used for optimization, with a learning rate of 1e-4." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-90", "text": "Dropout is used for the output of LSTM with a rate of 0.5." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-91", "text": "For word-based models, segmentation is necessary." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-92", "text": "We take two segmentors with different performances, including the Jieba segmentor and the model of Yang et al. (2017) , which we name Jieba and YZ, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-93", "text": "To verify their accuracy, we manually segment the first 100 sentences from the test set." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-94", "text": "Jieba and YZ give F1 scores of 89.8% and 91.7%, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-95", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-96", "text": "**OVERALL RESULTS**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-97", "text": "The overall exact matching results are shown in Table 3 ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-98", "text": "In this table, ENG represents the results of Yu et al. (2018a) 's model on their English dataset but under our split." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-99", "text": "HT and MT denote human translation and machine translation of questions, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-100", "text": "Both HT and MT results are evaluated on human translated questions." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-101", "text": "C-ML and C-S denote the results of our Chinese models based on characters with multi-lingual embeddings and monolingual embeddings, respectively, while WY-ML, WY-S denote the wordbased models applying YZ segmentor with multilingual embeddings and monolingual embeddings, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-102", "text": "Finally, WJ-ML and WJ-S denote the word model with multi-lingual embeddings and monolingual embeddings with the Jieba segmentor, respectively." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-103", "text": "First, compared to the best results of human translation (C-ML and WY-ML), machine translation results show a large disadvantage (e.g. 7.1% vs 12.1% using C-ML)." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-104", "text": "We further did a manual inspection of 100 randomly picked machinetranslated sentences." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-105", "text": "Out of the 100 translated Second, comparisons among C-ML, WY-ML and WJ-ML, and among C-S, WY-S and WJ-S show that multi-lingual embeddings give superior results compared to monolingual embeddings, which is likely because they bring a better connection between natural language questions and database columns." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-106", "text": "Third, comparisons between WY-ML and WJ-ML, and WY-S and WJ-S indicate that better segmentation accuracy has a significant influence on question parsing." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-107", "text": "Word-based methods are subject to segmentation errors." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-108", "text": "Moreover, with the current segmentation accuracy of 92%, a word-based model underperforms a character-based model." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-109", "text": "Intuitively, since words carry more direct semantic information as compared with database columns and keywords, improved segmentation may allow a word-based model to outperform a character-based model." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-110", "text": "Finally, for easy questions, the character-based model shows strong advantages over the wordbased models." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-111", "text": "However, for medium to extremely hard questions, the trend becomes less obvious, which is likely because the intrinsic semantic complexity overwhelms the encoding differences." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-112", "text": "Our best Chinese system gives an overall accuracy of 12.1%, 4 which is less but comparable to the English results." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-113", "text": "This shows that Chinese semantic parsing may not be significantly more challenging compared to English with text to SQL." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-114", "text": "Component matching." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-115", "text": "Figure 2 shows F1 scores of several typical components, including 4 Note that the results are lower than those reported by Yu et al. (2018a) under their split due to different training/test splits." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-116", "text": "Our split has less training data and more test instances in the \"Hard\" category and less in \"Easy\" and \"Medium\"." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-117", "text": "Table 4 ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-118", "text": "Specifically, the char-based methods achieve around 41% on SELN and SEL (SELECT), which are about 5% higher compared to the word-based methods." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-119", "text": "This result may be due to the fact that word-based models are sensitive to the OOV words (Zhang and Yang, 2018; Li et al., 2019) ." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-120", "text": "Unlike other components, SEL and SELN are confronted with more severe OOV challenges caused by recognizing the unseen schema during testing." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-121", "text": "In addition, the models using multi-lingual embedding overperform the models using separate embeddings on both WHEN and OB (OR-DERBY), which further demonstrates that embeddings in the same dimension distribution benefit to strengthen the connection between the question and the schema." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-122", "text": "Contrary to the overall result, the models employing the jieba segmentor perform better than those using the YZ segmentor on OB." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-123", "text": "The reason is that the jieba segmentor has different word segmentation results in terms of the superlative of adjectives." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-124", "text": "For example, the word \"\u6700\u9ad8\" (the highest) is segmented as \"\u6700\"(most) and \"\u9ad8\"(high) by YZ segmentor but \"\u6700\u9ad8\" in jieba segmentor." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-125", "text": "This again demonstrates the influence of word segmentation." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-126", "text": "Finally, for GB (GROUPBY) there is not a regular contrast pattern between different models, which can be likely because of the lack of sufficient training data." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-127", "text": "Figure 3 shows the negative influence of segmentation errors." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-128", "text": "In particular, the incorrect segmentation of the word \"\u5e97\u540d\" (shop name) leads to incorrect SQL for the whole sentence, since the character \"\u5e97\" (shop) can typically be associated with \"\u5e97\u957f\" (shop manager)." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-129", "text": "Figure 4 shows the sensitivity of our model to sentence patterns." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-130", "text": "In particular, the wordbased model gives incorrect predictions for many question sentences frequently." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-131", "text": "As shown in the first row, the word \"where\" confuses the system for making a choice between \"ORDER BY\" and \"GROUP BY\"." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-132", "text": "When we manually change the sentence pattern into \"List the most common hometown of teachers\", the parser gives the correct keyword." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-133", "text": "In contrast, the characterbased model is less sensitive to question sentences, which is likely because characters are less sparse compared with words." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-134", "text": "More training data or contextualized embeddings may alleviate the issue for the word-based method, which we leave for future work." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-135", "text": "Figure 5 shows the sensitivity of the model to Chinese linguistic patterns." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-136", "text": "In particular, the first sentence has a zero pronoun \"\u5404\u515a\u7684\" (in each party), which is omitted later." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-137", "text": "As a result, a semantic parser cannot tell the correct database columns from the sentence." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-138", "text": "We manually add the correct entity for the zero pronoun, resulting in the second sentence." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-139", "text": "The parser can correctly identify both the column name and the table name for this cor- rected sentence." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-140", "text": "Since zero-pronouns are frequent for Chinese (Chen and Ng, 2016) , they give added difficulty for its semantic parsing." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-141", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-142", "text": "**CASE STUDY**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-143", "text": "----------------------------------" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-144", "text": "**CONCLUSION**" }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-145", "text": "We constructed a first resource named CSpider for Chinese sentence to SQL, evaluating the performance of a strong English model on this dataset." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-146", "text": "Results show that the input representation, embedding forms and linguistic factors all have the influence on the Chinese-specific task." }, { "sent_id": "b9e9f358ace19da43bfe9e5bc380c5-C001-147", "text": "Our dataset can serve as a starting point for further research on this task, which can be beneficial to the investigation of Chinese QA and dialogue models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b9e9f358ace19da43bfe9e5bc380c5-C001-12", "b9e9f358ace19da43bfe9e5bc380c5-C001-13" ], [ "b9e9f358ace19da43bfe9e5bc380c5-C001-32" ] ], "cite_sentences": [ "b9e9f358ace19da43bfe9e5bc380c5-C001-12", "b9e9f358ace19da43bfe9e5bc380c5-C001-13", "b9e9f358ace19da43bfe9e5bc380c5-C001-32" ] }, "@MOT@": { "gold_contexts": [ [ "b9e9f358ace19da43bfe9e5bc380c5-C001-13", "b9e9f358ace19da43bfe9e5bc380c5-C001-14", "b9e9f358ace19da43bfe9e5bc380c5-C001-15", "b9e9f358ace19da43bfe9e5bc380c5-C001-16", "b9e9f358ace19da43bfe9e5bc380c5-C001-17", "b9e9f358ace19da43bfe9e5bc380c5-C001-18" ] ], "cite_sentences": [ "b9e9f358ace19da43bfe9e5bc380c5-C001-13" ] }, "@EXT@": { "gold_contexts": [ [ "b9e9f358ace19da43bfe9e5bc380c5-C001-22" ] ], "cite_sentences": [ "b9e9f358ace19da43bfe9e5bc380c5-C001-22" ] }, "@DIF@": { "gold_contexts": [ [ "b9e9f358ace19da43bfe9e5bc380c5-C001-22" ] ], "cite_sentences": [ "b9e9f358ace19da43bfe9e5bc380c5-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "b9e9f358ace19da43bfe9e5bc380c5-C001-51" ], [ "b9e9f358ace19da43bfe9e5bc380c5-C001-81" ] ], "cite_sentences": [ "b9e9f358ace19da43bfe9e5bc380c5-C001-51", "b9e9f358ace19da43bfe9e5bc380c5-C001-81" ] } } }, "ABC_8b5e14bdf3f415725333de672be114_34": { "x": [ { "sent_id": "8b5e14bdf3f415725333de672be114-C001-9", "text": "Knowing the reading level of a children's book is an important task in the educational setting." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-2", "text": "Early primary children's literature poses some interesting challenges for automated readability assessment: for example, teachers often use fine-grained reading leveling systems for determining appropriate books for children to read (many current systems approach readability assessment at a coarser whole grade level)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-3", "text": "In previous work (Ma et al., 2012), we suggested that the fine-grained assessment task can be approached using a ranking methodology, and incorporating features that correspond to the visual layout of the page improves performance." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-4", "text": "However, the previous methodology for using \"found\" text (e.g., scanning in a book from the library) requires human annotation of the text regions and correction of the OCR text." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-5", "text": "In this work, we ask whether the annotation process can be automated, and also experiment with richer syntactic features found in the literature that can be automatically derived from either the humancorrected or raw OCR text." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-6", "text": "We find that automated visual and text feature extraction work reasonably well and can allow for scaling to larger datasets, but that in our particular experiments the use of syntactic features adds little to the performance of the system, contrary to previous findings." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-7", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-10", "text": "Teachers want to have leveling for books in the school library; parents are trying to select appropriate books for their children; writers need guidance while writing for different literacy needs (e.g. text simplification)-reading level assessment is required in a variety of contexts." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-11", "text": "The history of assessing readability using simple arithmetic metrics dates back to the 1920s when Thorndike (1921) has measured difficulty of texts by tabulating words according to the frequency of their use in general literature." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-12", "text": "Most of the traditional readability formulas were also based on countable features of text, such as syllable counts (Flesch, 1948) ." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-13", "text": "More advanced machine learning techniques such as classification and regression have been applied to the task of reading level prediction (Collins- Thompson and Callan, 2004; Schwarm and Ostendorf, 2005; Petersen and Ostendorf, 2009; Feng et al., 2010) ; such works are described in further detail in the next Section 2." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-14", "text": "In recent work (Ma et al., 2012) , we approached the problem of fine-grained leveling of books, demonstrating that a ranking approach to predicting reading level outperforms both classification and regression approaches in that domain." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-15", "text": "A further finding was that visually-oriented features that consider the visual layout of the page (e.g. number of text lines per annotated text region, text region area compared to the whole page area and font size etc.) play an important role in predicting the reading levels of children's books in which pictures and textual layout dominate the book content over text." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-16", "text": "However, the data preparation process in our previous study involves human intervention-we ask human annotators to draw rectangle markups around text region over pages." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-17", "text": "Moreover, we only use a very shallow surface level text-based feature set to compare with the visually-oriented features." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-18", "text": "Hence in this paper, we assess the effect of using completely automated annotation processing within the same framework." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-19", "text": "We are interested in exploring how much performance will change by completely eliminating manual intervention." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-20", "text": "At the same time, we have also extended our previous feature set by introducing a richer set of automatically derived textbased features, proposed by Feng et al. (2010) , which capture deeper syntactic complexities of the text." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-21", "text": "Unlike our previous work, the major goal of this paper is not trying to compare different machine learning techniques used in readability assessment task, but rather to compare the performance differences between with and without human labor involved within our previous proposed system framework." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-22", "text": "We begin the paper with the description of related work in Section 2, followed by detailed explanation regarding data preparation and automatic annotations in Section 3." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-23", "text": "The extended features will be covered in Section 4, followed by experimental analysis in Section 5, in which we will compare the results between human annotations and automatic annotations." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-24", "text": "We will also report the system performance after incorporating the rich text features (structural features)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-25", "text": "Conclusions follow in Section 6." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-26", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-27", "text": "**RELATED WORK**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-28", "text": "Since 1920, approximately 200 readability formulas have been reported in the literature (DuBay, 2004) ; statistical language processing techniques have recently entered into the fray for readability assessment." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-29", "text": "Si and Callan (2001) and Collins-Thompson and Callan (2004) have demonstrated the use of language models is more robust for web documents and passages." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-30", "text": "Heilman et al. (2007) studied the impact of grammar-based features combined with language modeling approach for readability assessment of first and second language texts." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-31", "text": "They argued that grammar-based features are more pertinent for second language learners than for the first language readers." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-32", "text": "Schwarm and Ostendorf (2005) and Petersen and Ostendorf (2009) both used a support vector machine to classify texts based on the reading level." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-33", "text": "They combined traditional methods of readability assessment and the features from language models and parsers." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-34", "text": "Aluisio et al. (2010) have developed a tool for text simplification for the authoring process which addresses lexical and syntactic phenomena to make text readable but their assessment takes place at more coarse levels of literacy instead of finer-grained levels used for children's books." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-35", "text": "A detailed analysis of various features for automatic readability assessment has been done by Feng et al. (2010) ." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-36", "text": "Most of the previous work has used web page documents, short passages or articles from educational newspapers as their datasets; typically the task is to assess reading level at a whole-grade level." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-37", "text": "In contrast, early primary children's literature is typically leveled in a more fine-grained manner, and the research question we pursued in our previous study was to investigate appropriate methods of predicting what we suspected was a non-linear reading level scale." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-38", "text": "Automating the process of readability assessment is crucial for eventual widespread acceptance." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-39", "text": "Previous studies have looked at documents that were already found in electronic form, such as web texts." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-40", "text": "While e-books are certainly on the rise (and would help automated processing) it is unlikely that paper books will be completely eliminated from the primary school classroom soon." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-41", "text": "Our previous study required both manual scanning of the books and manual annotation of the books to extract the location and content of text within the book -the necessity of which we evaluate in this study by examining the effects of errors from the digitization process." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-42", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-43", "text": "**DATA PREPARATION AND BOOK ANNOTATION**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-44", "text": "Our previous study was based on a corpus of 36 scanned children's books; in this study we have expanded the set to 97 books which range from levels A to N in Fountas and Pinnell Benchmark Assessment System 1 (Fountas and Pinnell, 2010 ); the Fountas and Pinnell level serves as our gold standard." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-45", "text": "The distribution of number of books per reading level is shown in Table 1 ." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-46", "text": "Levels A to N, in increasing difficulty, corresponds to the primary grade books from roughly kindergarten through third grade." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-47", "text": "The collection of children's books covers a large diversity of genres, series and publishers." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-48", "text": "Our agreement with the books' publishers only allows access to physical copies of books rather than electronic versions; we scan each book into a PDF version." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-49", "text": "This situation would be similar to that of a contemporary classroom teacher who is selecting books from the classroom or school library for evaluating a child's literacy progress." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-50", "text": "1 We then use Adobe Acrobat to run OCR (Optical Character Recognition) on the PDF books." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-51", "text": "Following our previous work, we first begin our process of annotating each book using Adobe Acrobat before converting them into corresponding XML files." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-52", "text": "Features for each book are extracted from their corresponding XMLs which contain all the text information and book layout contents necessary to calculate the features." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-53", "text": "Each book is manually scanned, and then annotated in two different ways: we use human annotators (Section 3.1) and a completely automated process (Section 3.2)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-54", "text": "The job of human annotators is primarily to eliminate the errors made by OCR software, as well as correctly identifying text regions on each page." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-55", "text": "We encountered three types of typical OCR errors for the children's books in our set: 3. OCR could misread the text." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-56", "text": "These are most common errors." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-57", "text": "Some examples of this type of error are shown in Table 2 ." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-58", "text": "The two different annotation processes are explained in the following Subsections 3.1 and 3.2." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-59", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-60", "text": "**HUMAN ANNOTATION**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-61", "text": "Annotators manually draw a rectangular box over the text region on each page using Adobe Acrobat markup drawing tools." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-62", "text": "The annotators also correct the type 2 and 3 of OCR errors which are mentioned above." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-63", "text": "In human annotation process, the false alarm (type 1) errors are implicitly prevented since the annotators will only annotate the regions where text truly exists on the page (no matter whether the OCR recognized or not)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-64", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-65", "text": "**AUTOMATIC ANNOTATION**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-66", "text": "For automatic annotation, we make use of JavaScript API provided by Adobe Acrobat." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-67", "text": "The automatic annotation tool is implemented as a JavaScript plugin menu item within Adobe Acrobat." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-68", "text": "The JavaScript API can return the position of every single recognized word on the page." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-69", "text": "Based on the position cues of each word, we design a simple algorithm to automatically cluster the words into separate groups according to certain spatial distance thresholds." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-70", "text": "2 In-tuitively, one could imagine the words as small floating soap bubbles on the page-where smaller bubbles (individual words) which are close enough will merge together to form bigger bubbles (text regions) automatically." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-71", "text": "For each detected text region, a bounding rectangle box annotation is drawn on the page automatically." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-72", "text": "Beyond this point, the rest of the data preparation process is identical to human annotation, in which the corresponding XMLs will be generated from the annotated versions of the PDF books." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-73", "text": "However, unlike human annotation, automating the annotation process can introduce noise into the data due to uncorrected OCR errors." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-74", "text": "In correspondence to the three types of OCR errors, automatic annotation could also draw extra bounding rectangle boxes on non-text region (where OCR thinks there is text there but there is not), fails to draw bounding rectangle boxes on text region (where OCR should have recognized text there but it does not) and accepts many mis-recognized nonword symbols as text content (where OCR misreads words)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-75", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-76", "text": "**GENERATING XMLS FROM ANNOTATED PDF BOOKS**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-77", "text": "This process is also implemented as another JavaScript plugin menu item within Adobe Acrobat." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-78", "text": "The plugin is run on the annotated PDFs and is designed to be agnostic to the annotation types-it will work on both human-annotated and auto-annotated versions of PDFs." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-79", "text": "Once the XMLs for each children's book are generated, we could proceed to the feature extraction step." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-80", "text": "The set of features we use in the experiments are described in the following Section 4." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-81", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-82", "text": "**SURFACE-LEVEL FEATURES AND VISUALLY-ORIENTED FEATURES**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-83", "text": "\u2022 Surface-level Features 1. Number of words 2. Number of letters per word 3. Number of sentences 4." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-84", "text": "Average sentence length 5." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-85", "text": "Type-token ratio of the text content." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-86", "text": "\u2022 Visually-oriented Features" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-87", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-88", "text": "**STRUCTURAL FEATURES**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-89", "text": "Since our previous work only uses surface level of text features, we are interested in investigating the contribution of high-level structural features to the current system." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-90", "text": "Feng et al. (2010) found several parsing-based features and part-of-speech based features to be useful." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-91", "text": "We utilize the Stanford Parser (Klein and Manning, 2003) to extract the following features from the XML files based on those used in (Feng et al., 2010) :" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-92", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-93", "text": "**EXPERIMENTS**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-94", "text": "In the experiments, we look at how much the performance dropped by switching to zero human inputs." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-95", "text": "We also investigate the impact of using a richer set of text-based features." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-96", "text": "We apply the ranking-based book leveling algorithm proposed by our previous study (Ma et al., 2012) and use the SVM rank ranker (Joachims, 2006) for our experiments." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-97", "text": "In this system, the ranker learns to sort the training books into leveled order." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-98", "text": "The unknown test book is inserted into the ordering of the training books by the trained ranking model, and the predicted reading level is calculated by averaging over the levels of the known books above and below the test book." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-99", "text": "Following the previous study, each book is uniformly partitioned into 4 parts, treating each sub-book as an individual entity." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-100", "text": "A leave-n-out procedure is utilized for evaluation: during each iteration of the training, the system leaves out all n partitions (sub-books) corresponding to one book." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-101", "text": "In the testing phase, the trained ranking model tests on all partitions corresponding to the held-out book." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-102", "text": "We obtain a single predicted reading level for the held-out book by averaging the results for all its partitions; averaging produces a more robust result." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-103", "text": "Two separate experiments are carried out on human-annotated and autoannotated PDF books respectively." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-104", "text": "We use two metrics to determine quality: first, the accuracy of the system is computed by claiming it is correct if the predicted book level is within \u00b11 of the true reading level." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-105", "text": "4 The second scoring metric is the absolute error of number of levels away from the key reading level, averaged over all of the books." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-106", "text": "We report the experiment results on different combinations of feature sets: surface level features plus visually-oriented features, surface level features only, visually-oriented features only, structural features only and finally combining all the features together." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-107", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-108", "text": "**HUMAN ANNOTATION VS. AUTOMATIC ANNOTATION**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-109", "text": "As we can observe from Table 3 , 5 overall the human annotation gives higher accuracy than automatic annotation across different feature sets." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-110", "text": "The performance difference between human annotation and automatic annotation could be attributed to the OCR errors (described in Section 3.2) which are introduced in the automatic annotation process." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-111", "text": "However, to our surprise, the best performance of human annotation is not significantly better than automatic annotation even at p < 0.1 level (figures in bold)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-112", "text": "6 Only for the experiment using all features does human annotation outperform the automatic annotation at p < 0.1 level (still not significantly better at p < 0.05 level, figures with asterisks)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-113", "text": "Therefore, we believe that the extra labor involved in the annotation step could be replaced by the automatic process without leading to a significant performance drop." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-114", "text": "While the process does still require manual scanning of each book (which can be time consuming depending on the kind of scanner), the automatic processing can reduce the labor per book from approximately twenty minutes per book to just a few seconds." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-115", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-116", "text": "**INCORPORATING STRUCTURAL FEATURES**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-117", "text": "Our previous study demonstrated that combining surface features with visual features produces promising results." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-118", "text": "As mentioned above, the second aim of this study is to see how much benefit we can get from incorporating high-level structural features, such as those used in (Feng et al., 2010) (described in Section 4.2), with the features in our previous study." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-119", "text": "annotation under the \u00b11 accuracy metric, the visual features and the structural features have the same performance, whose accuracy are both slightly lower than that of surface level features." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-120", "text": "By combining the surface level features with the visual features, the system obtains the best performance." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-121", "text": "However, by combining all three feature sets, the system performance does not change for human annotation whereas it hurts the performance for automatic annotation-it is likely that the OCR errors existing in the automatic annotations give rise to erroneous structural features (e.g. the parser would produce less robust parses for sentences which have out of vocabulary words)." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-122", "text": "Overall, we did not observe better performance by incorporating structural features." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-123", "text": "Using structural features on their own also did not produce noteworthy results." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-124", "text": "Although among the three kinds of features (surface, visual and structural), structural features have the highest computational cost, it exhibits no significant improvement to system results." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-125", "text": "In the average leveling error metric, the best performance is again obtained at the combination of surface level features and visual features for human annotation, whereas the performance remains almost the same after incorporating structural features for automatic annotation." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-126", "text": "----------------------------------" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-127", "text": "**CONCLUSION**" }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-128", "text": "In this paper, we explore the possibility of reducing human involvement in the specific task of predicting reading levels of scanned children's books by eliminating the need for human annotation." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-129", "text": "Clearly there is a trade off between the amount of human labor involved and the accuracy of the reading level predicted." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-130", "text": "Based on the experimental results, we did not observe significant performance drop by switching from human annotation to automatic annotation in the task of predicting reading levels for scanned children's books." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-131", "text": "We also study the effect of incorporating structural features into the proposed ranking system." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-132", "text": "The experimental results showed that structural features exhibit no significant effect to the system performance." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-133", "text": "We conclude for the simply structured, short text that appears in most children's books, a deep level analysis of the text properties may be overkill for the task and produced unsatisfactory results at a high computational cost for our task." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-134", "text": "In the future, we are interested in investigating the importance of each individual feature as well as applying various feature selection methods to further improve the overall performance of the system-in the hope that making the ranking system more robust to OCR errors introduced by automatic annotation processing." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-135", "text": "Another interesting open question is that how many scanned book pages are needed to make a good prediction." }, { "sent_id": "8b5e14bdf3f415725333de672be114-C001-136", "text": "7 Such analysis would be very helpful for practical purposes, since a teacher could just scan few sample pages instead of a full book for a reliable prediction." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8b5e14bdf3f415725333de672be114-C001-13" ], [ "8b5e14bdf3f415725333de672be114-C001-35" ] ], "cite_sentences": [ "8b5e14bdf3f415725333de672be114-C001-13", "8b5e14bdf3f415725333de672be114-C001-35" ] }, "@USE@": { "gold_contexts": [ [ "8b5e14bdf3f415725333de672be114-C001-20" ], [ "8b5e14bdf3f415725333de672be114-C001-91" ], [ "8b5e14bdf3f415725333de672be114-C001-118" ] ], "cite_sentences": [ "8b5e14bdf3f415725333de672be114-C001-20", "8b5e14bdf3f415725333de672be114-C001-91", "8b5e14bdf3f415725333de672be114-C001-118" ] } } }, "ABC_d5d81a4c7759f9a4ab81195819c6d9_34": { "x": [ { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-2", "text": "While very deep neural networks have shown effectiveness for computer vision and text classification applications, how to increase the network depth of neural machine translation (NMT) models for better translation quality remains a challenging problem." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-3", "text": "Directly stacking more blocks to the NMT model results in no improvement and even reduces performance." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-4", "text": "In this work, we propose an effective two-stage approach with three specially designed components to construct deeper NMT models, which results in significant improvements over the strong Transformer baselines on WMT14 English\u2192German and English\u2192French translation tasks 1 ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-5", "text": "----------------------------------" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-7", "text": "Neural machine translation (briefly, NMT), which is built upon deep neural networks, has gained rapid progress in recent years (Bahdanau et al., 2015; Sutskever et al., 2014; Sennrich et al., 2016b; He et al., 2016a; Sennrich et al., 2016a; Xia et al., 2017; Wang et al., 2019) and achieved significant improvement in translation quality (Hassan et al., 2018) ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-8", "text": "Variants of network structures have been applied in NMT such as LSTM (Wu et al., 2016) , CNN (Gehring et al., 2017) and Transformer (Vaswani et al., 2017) ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-9", "text": "Training deep networks has always been a challenging problem, mainly due to the difficulties in optimization for deep architecture." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-10", "text": "Breakthroughs have been made in computer vision to enable deeper model construction via advanced initialization schemes (He et al., 2015) , multi-stage training strategy (Simonyan and Zisserman, 2015) , and Figure 1 : Performances of Transformer models with different number of encoder/decoder blocks (recorded on x-axis) on WMT14 En\u2192De translation task." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-11", "text": "\u2020 denotes the result reported in (Vaswani et al., 2017) ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-12", "text": "novel model architectures (Srivastava et al., 2015; He et al., 2016b) ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-13", "text": "While constructing very deep neural networks with tens and even more than a hundred blocks have shown effectiveness in image recognition (He et al., 2016b) , question answering and text classification (Devlin et al., 2018; Radford et al., 2019) , scaling up model capacity with very deep network remains challenging for NMT." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-14", "text": "The NMT models are generally constructed with up to 6 encoder and decoder blocks in both state-of-the-art research work and champion systems of machine translation competition." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-38", "text": "Let x and y denote the embedding of source and target sequence." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-39", "text": "Let l y denote the number of words in y, and y (Vaswani et al., 2017; JunczysDowmunt, 2018; Edunov et al., 2018) ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-16", "text": "Increasing the NMT model depth by directly stacking more blocks results in no improvement or performance drop (Figure 1) , and even leads to optimization failure ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-17", "text": "There have been a few attempts in previous works on constructing deeper NMT models." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-18", "text": "Zhou et al. (2016) and Wang et al. (2017) propose increasing the depth of LSTM-based models by introducing linear units between internal hidden states to eliminate the problem of gradient vanishing." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-19", "text": "However, their methods are specially designed for the recurrent architecture which has been significantly outperformed by the state-ofthe-art transformer model." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-20", "text": "propose an enhancement to the attention mechanism to ease the optimization of models with deeper encoders." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-21", "text": "While gains have been reported over different model architectures including LSTM and Transformer, their improvements are not made over the best performed baseline model configuration." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-22", "text": "How to construct and train deep NMT models to push forward the state-ofthe-art translation performance with larger model capacity remains a challenging and open problem." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-23", "text": "In this work, we explore the potential of leveraging deep neural networks for NMT and propose a new approach to construct and train deeper NMT models." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-24", "text": "As aforementioned, constructing deeper models is not as straightforward as directly stacking more blocks, but requires new mechanisms to boost the training and utilize the larger capacity with minimal increase in complexity." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-25", "text": "Our solution is a new two-stage training strategy, which \"grows\" a well-trained NMT model into a deeper network with three components specially designed to overcome the optimization difficulty and best leverage the capability of both shallow and deep architecture." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-26", "text": "Our approach can effectively construct a deeper model with significantly better performance, and is generally applicable to any model architecture." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-27", "text": "We evaluate our approach on two large-scale benchmark datasets, WMT14 English\u2192German and English\u2192French translations." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-28", "text": "Empirical studies show that our approach can significantly improve in translation quality with an increased model depth." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-29", "text": "Specifically, we achieve 1.0 and 0.6 BLEU score improvement over the strong Transformer baseline in English\u2192German and English\u2192French translations." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-30", "text": "----------------------------------" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-31", "text": "**APPROACH**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-32", "text": "We introduce the details of our proposed approach in this section." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-33", "text": "The overall framework is illustrated in Figure 2 ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-34", "text": "Our model consists of a bottom module with N blocks of encoder and decoder (the grey components in Figure 2 ), and a top module with M blocks (the blue and green components)." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-35", "text": "We denote the encoder and decoder of the bottom module as enc 1 and dec 1 , and the corresponding two parts of the top module as enc 2 and dec 2 ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-36", "text": "An encoder-decoder attention mechanism is used in the decoder blocks of the NMT models, and here we use attn 1 and attn 2 to represent such attention in the bottom and top modules respectively." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-37", "text": "The model is constructed via a two-stage training strategy: in Stage 1, the bottom module (i.e., enc 1 and dec 1 ) is trained and subsequently holds constant; in Stage 2, only the top module (i.e., enc 2 and dec 2 ) is optimized." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-42", "text": "(1), the encoder enc 1 of the bottom module encodes the input x to a hidden representation h 1 , then a cross-module residual connection is introduced to the top module and the representation h 2 is eventually produced." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-43", "text": "The decoders work in a similar way as shown in Eqn. (2) and (3)." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-44", "text": "This enables the top module to have direct access to both the low-level input signals from the word embedding and high-level information generated by the bottom module." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-45", "text": "Similar principles can be found in Wang et al. (2017) ; ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-46", "text": "(2) Hierarchical encoder-decoder attention: We introduce a hierarchical encoder-decoder attention calculated with different contextual representations as shown in Eqn. (2) and (3), where h 1 is used as key and value for attn 1 in the bottom module, and h 2 for attn 2 in the top module." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-47", "text": "Hidden states from the corresponding previous decoder block are used as queries for both attn 1 and attn 2 (omitted for readability)." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-48", "text": "In this way, the strong capability of the well trained bottom module can be best preserved regardless of the influence from top module, while the newly stacked top module can leverage the higher-level contextual representations." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-49", "text": "More details can be found from source code in the supplementary materials." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-50", "text": "\u2022 Ensemble learning: What we propose in this paper is a single deeper model with hierarchical contextual information, although the deep-shallow decoding is similar to the ensemble methods in terms of inference complexity (Zhou, 2012) ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-51", "text": "While training multiple diverse models for good ensemble performance introduces high additional complexity, our approach, as discussed above, \"grows\" a well-trained model into a deeper one with minimal increase in training complexity." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-52", "text": "Detailed empirical analysis is presented in Section 3.3." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-53", "text": "----------------------------------" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-54", "text": "**EXPERIMENTS**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-55", "text": "We evaluate our proposed approach on two largescale benchmark datasets." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-56", "text": "We compare our approach with multiple baseline models, and analyze the effectiveness of our deep training strategy." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-57", "text": "----------------------------------" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-58", "text": "**EXPERIMENT DESIGN**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-59", "text": "Datasets We conduct experiments to evaluate the effectiveness of our proposed method on two widely adopted benchmark datasets: the WMT14 2 English\u2192German translation (En\u2192De) and the WMT14 English\u2192French translation (En\u2192Fr)." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-60", "text": "We use 4.5M parallel sentence pairs for En\u2192De and 36M pairs for En\u2192Fr as our training data 3 ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-61", "text": "We use the concatenation of Newstest2012 and Newstest2013 as the validation set, and Newstest2014 as the test set." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-62", "text": "All words are segmented into sub-word units using byte pair encoding (BPE) 4 (Sennrich et al., 2016b) , forming a vocabulary shared by the source and target languages with 32k and 45k tokens for En\u2192De and En\u2192Fr respectively." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-63", "text": "Architecture The basic encoder-decoder framework we use is the strong Transformer model." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-64", "text": "We adopt the big transformer configuration following Vaswani et al. (2017) , with the dimension of word embeddings, hidden states and non-linear layer set as 1024, 1024 and 4096 respectively." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-65", "text": "The dropout rate is 0.3 for En\u2192De and 0.1 for En\u2192Fr." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-66", "text": "We set the number of encoder/decoder blocks for the bottom module as N = 6 following the common practice, and set the number of additionally stacked blocks of the top module as M = 2." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-67", "text": "Our models are implemented based on the PyTorch implementation of Transformer 5 and the code can be found in the supplementary materials." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-68", "text": "Training We use Adam (Kingma and Ba, 2015) optimizer following the optimization settings and default learning rate schedule in Vaswani et al. (2017) for model training." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-69", "text": "All models are trained on 8 M40 GPUs." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-70", "text": "Evaluation We evaluate the model performances with tokenized case-sensitive BLEU 6 score (Papineni et al., 2002) for the two translation tasks." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-71", "text": "We use beam search with a beam size of 5 and with no length penalty." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-72", "text": "----------------------------------" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-73", "text": "**OVERALL RESULTS**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-74", "text": "We compare our method (Ours) with the Transformer baselines of 6 blocks (6B) and 8 blocks (8B), and a 16-block Transformer with transparent attention (Transparent Attn (16B)) 7 ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-75", "text": "We also reproduce a 6-block Transformer baseline, which has better performance than what is reported in (Vaswani et al., 2017) and we use it to initialize the bottom module in our model." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-76", "text": "From the results in Table 1 , we see that our proposed approach enables effective training for deeper network and achieves significantly better performances compared to baselines." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-77", "text": "With our method, the performance of a well-optimized 6-block model can be further boosted by adding two additional blocks, while simply using Transformer (8B) will lead to a performance drop." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-78", "text": "Specifically, we achieve a 29.92 BLEU score on En\u2192De translation with 1.0 BLEU improvement over the strong baselines, and achieve a 0.6 BLEU improvement for En\u2192Fr." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-79", "text": "The improvements are statistically significant with p < 0.01 in paired bootstrap sampling (Koehn, 2004) ." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-80", "text": "We further make an attempt to train a deeper model with additional M = 4 blocks, which has 10 blocks in total for En\u2192De translation." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-81", "text": "The bot-6 https://github.com/moses-smt/ mosesdecoder/blob/master/scripts/ generic/multi-bleu.perl 7 We directly use the performance figure from , which uses the base Transformer configuration." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-82", "text": "We run the method of our own implementation with the widely adopted and state-of-the-art big setting, but no improvement has been observed." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-83", "text": "tom module is also initialized from our reproduced 6-block transformer baseline." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-84", "text": "This model achieves a 30.07 BLEU score on En\u2192De translation and it surpasses the performance of our 8-block model, which further demonstrates that our approach is effective for training deeper NMT models." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-85", "text": "----------------------------------" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-86", "text": "**ANALYSIS**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-87", "text": "To further study the effectiveness of our proposed framework, we present additional comparisons in En\u2192De translation with two groups of baseline approaches in Figure 3: (1) Direct stacking (DS): we extend the 6-block baseline to 8-block by directly stacking 2 additional blocks." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-88", "text": "We can see that both training from scratch (DS scratch) and \"growing\" from a welltrained 6-block model (DS grow) fails to improve performance in spite of larger model capacity." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-89", "text": "The comparison with this group of models shows that directly stacking more blocks is not a good strategy for increasing network depth, and demonstrates the effectiveness and necessity of our proposed mechanisms for training deep networks." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-90", "text": "(2) Ensemble learning (Ensemble): we present the two-model ensemble results for fair comparison with our approach that involves a two-pass deepshallow decoding." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-91", "text": "Specifically, we present the ensemble performances of two independently trained 6-block models (Ensemble 6B/6B), and ensemble of one 6-block and one 8-block model independently trained from scratch (Ensemble 6B/8B)." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-92", "text": "As expected, the ensemble method improves translation quality over the single model baselines by a large margin (over 0.8 BLEU improvement)." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-93", "text": "Regarding training complexity, it takes 40 GPU days (5 days on 8 GPU) to train a single 6-block model from scratch, 48 GPU days for a 8-block model , and 8 GPU days to \"grow\" a 6-block model into 8-block with our approach." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-94", "text": "Therefore, our model is better than the two-model ensemble in terms of both translation quality (more than 0.3 BLEU improvement over the ensemble baseline) and training complexity." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-95", "text": "----------------------------------" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-96", "text": "**CONCLUSION**" }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-97", "text": "In this paper, we proposed a new training strategy with three specially designed components, including cross-module residual connection, hierarchical encoder-decoder attention and deep-shallow decoding, to construct and train deep NMT models." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-98", "text": "We showed that our approach can effectively construct deeper model with significantly better performance over the state-of-the-art transformer baseline." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-99", "text": "Although only empirical studies on the transformer are presented in this paper, our proposed strategy is a general approach that can be universally applicable to other model architectures, including LSTM and CNN." }, { "sent_id": "d5d81a4c7759f9a4ab81195819c6d9-C001-100", "text": "In future work, we will further explore efficient strategies that can jointly train all modules of the deep model with minimal increase in training complexity." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d5d81a4c7759f9a4ab81195819c6d9-C001-8" ], [ "d5d81a4c7759f9a4ab81195819c6d9-C001-14", "d5d81a4c7759f9a4ab81195819c6d9-C001-15" ] ], "cite_sentences": [ "d5d81a4c7759f9a4ab81195819c6d9-C001-8", "d5d81a4c7759f9a4ab81195819c6d9-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "d5d81a4c7759f9a4ab81195819c6d9-C001-11" ], [ "d5d81a4c7759f9a4ab81195819c6d9-C001-64" ], [ "d5d81a4c7759f9a4ab81195819c6d9-C001-68" ] ], "cite_sentences": [ "d5d81a4c7759f9a4ab81195819c6d9-C001-11", "d5d81a4c7759f9a4ab81195819c6d9-C001-64", "d5d81a4c7759f9a4ab81195819c6d9-C001-68" ] } } }, "ABC_aa87225d7d326adfb4a8b2702b8f25_34": { "x": [ { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-2", "text": "Neural language models learn word representations that capture rich linguistic and conceptual information." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-3", "text": "Here we investigate the embeddings learned by neural machine translation models." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-4", "text": "We show that translation-based embeddings outperform those learned by cutting-edge monolingual models at single-language tasks requiring knowledge of conceptual similarity and/or syntactic role." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-5", "text": "The findings suggest that, while monolingual models learn information about how concepts are related, neural-translation models better capture their true ontological status." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-6", "text": "It is well known that word representations can be learned from the distributional patterns in corpora." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-7", "text": "Originally, such representations were constructed by counting word co-occurrences, so that the features in one word's representation corresponded to other words [11, 17] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-8", "text": "Neural language models, an alternative means to learn word representations, use language data to optimise (latent) features with respect to a language modelling objective." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-9", "text": "The objective can be to predict either the next word given the initial words of a sentence [4, 14, 8] , or simply a nearby word given a single cue word [13, 15] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-10", "text": "The representations learned by neural models (sometimes called embeddings) generally outperform those acquired by co-occurrence counting models when applied to NLP tasks [3] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-11", "text": "Despite these clear results, it is not well understood how the architecture of neural models affects the information encoded in their embeddings." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-12", "text": "Here, we explore this question by considering the embeddings learned by architectures with a very different objective function to monolingual language models: neural machine translation models." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-13", "text": "We show that translation-based embeddings outperform monolingual embeddings on two types of task: those that require knowledge of conceptual similarity (rather than simply association or relatedness), and those that require knowledge of syntactic role." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-14", "text": "We discuss what the findings indicate about the information content of different embeddings, and suggest how this content might emerge as a consequence of the translation objective." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-15", "text": "Both neural language models and translation models learn real-valued embeddings (of specified dimension) for words in some pre-specified vocabulary, V , covering many or all words in their training corpus." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-16", "text": "At each training step, a 'score' for the current training example (or batch) is computed based on the embeddings in their current state." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-17", "text": "This score is compared to the model's objective function, and the error is backpropagated to update both the model weights (affecting how the score is computed from the embeddings) and the embedding features." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-18", "text": "At the end of this process, the embeddings should encode information that enables the model to optimally satisfy its objective." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-19", "text": "----------------------------------" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-20", "text": "**LEARNING EMBEDDINGS FROM LANGUAGE DATA**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-21", "text": "Both neural language models and translation models learn real-valued embeddings (of specified dimension) for words in some pre-specified vocabulary, V , covering many or all words in their training corpus." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-22", "text": "At each training step, a 'score' for the current training example (or batch) is computed based on the embeddings in their current state." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-23", "text": "This score is compared to the model's objective function, and the error is backpropagated to update both the model weights (affecting how the score is computed from the embeddings) and the embedding features." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-24", "text": "At the end of this process, the embeddings should encode information that enables the model to optimally satisfy its objective." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-25", "text": "----------------------------------" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-26", "text": "**MONOLINGUAL MODELS**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-27", "text": "In the original neural language model [4] and subsequent variants [8] , each training example consists of n subsequent words, of which the model is trained to predict the n-th word given the first n \u2212 1 words." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-28", "text": "The model first represents the input as an ordered sequence of embeddings, which it transforms into a single fixed length 'hidden' representation by, e.g., concatenation and non-linear projection." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-29", "text": "Based on this representation, a probability distribution is computed over the vocabulary, from which the model can sample a guess at the next word." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-30", "text": "The model weights and embeddings are updated to maximise the probability of correct guesses for all sentences in the training corpus." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-31", "text": "More recent work has shown that high quality word embeddings can be learned via models with no nonlinear hidden layer [13, 15] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-32", "text": "Given a single word in the corpus, these models simply predict which other words will occur nearby." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-33", "text": "For each word w in V , a list of training cases (w, c) : c \u2208 V is extracted from the training corpus." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-34", "text": "For instance, in the skipgram approach [13] , for each 'cue word' w the 'context words' c are sampled from windows either side of tokens of w in the corpus (with c more likely to be sampled if it occurs closer to w)." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-35", "text": "1 For each w in V , the model initialises both a cue-embedding, representing the w when it occurs as a cue-word, and a context-embedding, used when w occurs as a context-word." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-36", "text": "For a cue word w, the model can use the corresponding cueembedding and all context-embeddings to compute a probability distribution over V that reflects the probability of a word occurring in the context of w. When a training example (w, c) is observed, the model updates both the cue-word embedding of w and the context-word embeddings in order to increase the conditional probability of c." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-37", "text": "----------------------------------" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-38", "text": "**TRANSLATION-BASED EMBEDDINGS**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-39", "text": "Neural translation models generate an appropriate sentence in their target language S t given a sentence S s in their source language [see, e.g., 16, 6] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-40", "text": "In doing so, they learn distinct sets of embeddings for the vocabularies V s and V t in the source and target languages respectively." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-41", "text": "Observing a training case (S s , S t ), such a model represents S s as an ordered sequence of embeddings of words from V s ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-42", "text": "The sequence for S s is then encoded into a single representation R S ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-43", "text": "2 Finally, by referencing the embeddings in V t , R S and a representation of what has been generated thus far, the model decodes a sentence in the target language word by word." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-44", "text": "If at any stage the decoded word does not match the corresponding word in the training target S t , the error is recorded." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-45", "text": "The weights and embeddings in the model, which together parameterise the encoding and decoding process, are updated based on the accumulated error once the sentence decoding is complete." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-46", "text": "Although neural translation models can differ in low-level architecture [7, 2] , the translation objective exerts similar pressure on the embeddings in all cases." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-47", "text": "The source language embeddings must be such that the model can combine them to form single representations for ordered sequences of multiple words (which in turn must enable the decoding process)." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-48", "text": "The target language embeddings must facilitate the process of decoding these representations into correct target-language sentences." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-49", "text": "----------------------------------" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-50", "text": "**COMPARING MONO-LINGUAL AND TRANSLATION-BASED EMBEDDINGS**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-51", "text": "To learn translation-based embeddings, we trained both the RNN encoder-decoder [RNNenc, 7] and the RNN Search architectures [2] on a 300m word corpus of English-French sentence pairs." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-52", "text": "We conducted all experiments with the resulting (English) source embeddings from these models." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-77", "text": "In this task, models must identify the correct answer (girl) when presented with questions such as 'man is to boy as woman is to ...'." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-53", "text": "For comparison, we trained a monolingual skipgram model [13] and its Glove variant [15] for the same number of epochs on the English half of the bilingual corpus." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-54", "text": "We also extracted embeddings from a full-sentence language model [CW, 8] trained for several months on a larger 1bn word corpus." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-55", "text": "As in previous studies [1, 5, 3] , we evaluate embeddings by calculating pairwise (cosine) distances and correlating these distances with (gold-standard) human judgements." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-56", "text": "Table 1 shows the correlations of different model embeddings with three such gold-standard resources, WordSim-353 [1] , MEN [5] and SimLex-999 [10] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-57", "text": "Interestingly, translation embeddings perform best on SimLex-999, while the two sets of monolingual embeddings perform better on modelling the MEN and WordSim-353." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-58", "text": "To interpret these results, it should be noted that SimLex-999 evaluation quantifies conceptual similarity (dog -wolf ), whereas MEN and WordSim-353 (despite its name) quantify more general relatedness (dog -collar) [10] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-59", "text": "The results seem to indicate that translation-based embeddings better capture similarity, while monolingual embeddings better capture relatedness." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-60", "text": "Table 1 : Translation-based embeddings outperform alternatives on similarity-focused evaluations." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-61", "text": "To test this hypothesis further, we ran two more evaluations focused specifically on similarity." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-62", "text": "The TOEFL synonym test contains 80 cue words, each with four possible answers, of which one is a correct synonym [11] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-63", "text": "We computed the proportion of questions answered correctly by each model, where a model's answer was the nearest (cosine) neighbour to the cue word in its vocabulary." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-64", "text": "3 In addition, we tested how well different embeddings enabled a supervised classifier to distinguish between synonyms and antonyms." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-65", "text": "For 500 hand-labelled pairs we presented a Gaussian SVM with the concatenation of the two word embeddings." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-66", "text": "We evaluated accuracy using 8-fold cross-validation." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-67", "text": "As shown in Table 1 , translation-based embeddings outperform all monolingual embeddings on these two additional similarity-focused tasks." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-68", "text": "Qualitative analysis of nearest neighbours (bottom rows) also supports the conclusion that proximity in the translation embedding space corresponds to similarity while proximity in the monolingual embedding space reflects relatedness." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-69", "text": "----------------------------------" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-70", "text": "**QUANTITY OF TRAINING DATA**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-71", "text": "In previous work, monolingual models were trained on corpora many times larger than the English half of our parallel translation corpus." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-72", "text": "To check if these models simply need more training data to capture similarity as effectively as translation models, we trained them on increasingly large subsets of Wikipedia." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-73", "text": "4 The results refute this possibility: the performance of monolingual embeddings on similarity tasks converges well below the level of the translation-based embeddings (Fig. 1) ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-74", "text": "----------------------------------" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-75", "text": "**ANALOGY QUESTIONS**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-76", "text": "Lexical analogy questions are an alternative way of evaluating word representations [13, 15] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-78", "text": "For skipgram-style embeddings, it has been shown that if m, b and w are the embeddings for man, boy and woman respectively, the correct answer is often the nearest neighbour in the vocabulary (by cosine distance) to the vector v = w + b \u2212 m [13] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-79", "text": "Monolingual skipgram/Glove models are better at semantic analogies (father, man; mother, woman)" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-80", "text": "We evaluated the embeddings on this task using the same vector-algebra method as [13] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-81", "text": "As before we excluded questions containing a word outside the intersection of all model vocabularies, and restricted all answer searches to this reduced vocabulary, leaving 11,166 analogies." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-82", "text": "Of these, 7219 are classed as 'syntactic', in that they exemplify mappings between parts-of-speech or syntactic roles (fast, fastest; heavy -heaviest), and 3947 are classed as 'semantic' (Ottawa, Canada; ParisFrance), deriving from wider world knowledge." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-83", "text": "As shown in Fig. 2 , the translation-based embeddings seem to yield poor answers to semantic analogy questions, but are very effective for syntactic analogies, outperforming the monolingual embeddings, even those trained on much more data." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-84", "text": "----------------------------------" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-85", "text": "**CONCLUSIONS**" }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-86", "text": "Neural machine translation models are more effective than monolingual models at learning embeddings that encode information about concept similarity and syntactic role." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-87", "text": "In contrast, monolingual models encode general inter-concept relatedness (as applicable to semantic analogy questions), but struggle to capture similarity, even when training on larger corpora." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-88", "text": "For skipgram-style models, whose objective is to predict linguistically collocated pairs, this limitation is perhaps unsurprising, since co-occurring words are, in general, neither semantically nor syntactically similar." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-89", "text": "However, the fact that it also applies to the full-sentence model CW suggests that inferring similarity is problematic for monolingual models even with knowledge of the precise (ordered) contexts of words." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-90", "text": "This may be because very dissimilar words (such as antonyms) actually often occur in identical linguistic contexts." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-91", "text": "When considering the strengths of translation embeddings -similarity and syntactic role -it is notable that each item in the three similarity-focused evaluations consists of word groups or pairs of identical syntactic role." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-92", "text": "Thus, the strong performance of translation embeddings on similarity tasks cannot be simply a result of their encoding of richer syntactic information." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-93", "text": "To perform well on SimLex-999, embeddings must encode information approximating what concepts are (their function or ontology), even when this contradicts the signal conferred by co-occurrence (as can be the case for related-but-dissimilar concept pairs) [10] ." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-94", "text": "The translation objective seems particularly effective at inducing models to encode such ontological or functional information in word embeddings." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-95", "text": "While much remains unknown about this process, one cause might be the different ways in which words partition the meaning space of a language." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-96", "text": "In cases where a French word has two possible English translations (e.g. gagner \u2192 win / earn), we note that the (source) embeddings of the two English words are very close." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-97", "text": "It appears that, since the translation model, which has limited encoding capacity, is trained to map tokens of win and earn to the same place in the target embedding space, it is efficient to move these concepts closer in the source space." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-98", "text": "While clear-cut differences in how languages partition meaning space, such as (gagner = win, earn), may in fact be detrimental to similarity modelling (win and earn are not synonymous to English speakers), in general, languages partition meaning space in less drastically different ways." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-99", "text": "We hypothesize that these small differences are the key to how neural translation models approximate ontological similarity so effectively." }, { "sent_id": "aa87225d7d326adfb4a8b2702b8f25-C001-100", "text": "At the same time, since two dissimilar or even antonymous words in the source language should never correspond to a single word in the target language, these pairs diverge in the embedding space, rendering two antonymous embeddings easily distinguishable from those of two synonyms." } ], "y": { "@BACK@": { "gold_contexts": [ [ "aa87225d7d326adfb4a8b2702b8f25-C001-9" ], [ "aa87225d7d326adfb4a8b2702b8f25-C001-31" ], [ "aa87225d7d326adfb4a8b2702b8f25-C001-34" ], [ "aa87225d7d326adfb4a8b2702b8f25-C001-76" ], [ "aa87225d7d326adfb4a8b2702b8f25-C001-78" ] ], "cite_sentences": [ "aa87225d7d326adfb4a8b2702b8f25-C001-9", "aa87225d7d326adfb4a8b2702b8f25-C001-31", "aa87225d7d326adfb4a8b2702b8f25-C001-34", "aa87225d7d326adfb4a8b2702b8f25-C001-76", "aa87225d7d326adfb4a8b2702b8f25-C001-78" ] }, "@USE@": { "gold_contexts": [ [ "aa87225d7d326adfb4a8b2702b8f25-C001-53" ], [ "aa87225d7d326adfb4a8b2702b8f25-C001-80" ] ], "cite_sentences": [ "aa87225d7d326adfb4a8b2702b8f25-C001-53", "aa87225d7d326adfb4a8b2702b8f25-C001-80" ] } } }, "ABC_0f0e13e275c4bc4021b1b0d26f3e0c_34": { "x": [ { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-2", "text": "Current techniques for Open Information Extraction (OIE) focus on the extraction of binary facts and suffer significant quality loss for the task of extracting higher order N-ary facts." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-3", "text": "This quality loss may not only affect the correctness, but also the completeness of an extracted fact." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-4", "text": "We present KRAKEN, an OIE system specifically designed to capture N-ary facts, as well as the results of an experimental study on extracting facts from Web text in which we examine the issue of fact completeness." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-5", "text": "Our preliminary experiments indicate that KRAKEN is a high precision OIE approach that captures more facts per sentence at greater completeness than existing OIE approaches, but is vulnerable to noisy and ungrammatical text." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-6", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-8", "text": "For the task of fact extraction from billions of Web pages the method of Open Information Extraction (OIE) (Fader et al., 2011) trains domainindependent extractors." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-9", "text": "This important characteristic enables a potential application of OIE for even very large corpora, such as the Web." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-10", "text": "Existing approaches for OIE, such as REVERB (Fader et al., 2011) , WOE (Wu and Weld, 2010) or WANDER-LUST (Akbik and Bross, 2009 ) focus on the extraction of binary facts, e.g. facts that consist of only two arguments, as well as a fact phrase which denotes the nature of the relationship between the arguments." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-11", "text": "However, a recent analysis of OIE based on Semantic Role Labeling (Christensen et al., 2011) revealed that N-ary facts (facts that connect more than two arguments) were present in 40% of surveyed English sentences." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-12", "text": "Worse, the analyses performed in (Fader et al., 2011) and (Akbik and Bross, 2009) show that incorrect handling of N-ary facts leads to extraction errors, such as incomplete, uninformative or erroneous facts." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-13", "text": "Our first example illustrates the case of a significant information loss: a) In the 2002 film Bubba Ho-tep, Elvis lives in a nursing home." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-14", "text": "REVERB: LivesIn(Elvis, nursing home)" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-15", "text": "In this case, the OIE system ignores the significant contextual information in the argument the 2002 film Bubba Ho-tep, which denotes the domain in which the fact LivesIn(Elvis, nursing home) is true." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-16", "text": "As a result, and by itself, the extracted fact is false." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-17", "text": "The next example shows a binary fact from a sentence that de-facto expresses an N-ary fact." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-18", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-19", "text": "**B) ELVIS MOVED TO MEMPHIS IN 1948.**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-20", "text": "REVERB: MovedTo(Elvis, Memphis) WANDERLUST: MovedIn (Elvis, 1948) Contrary to the previous example, the OIE systems extracted two binary facts that are not false, but incomplete, as the interaction between all three entities in this sentence can only be adequately modeled using an ternary fact." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-21", "text": "The fact MovedIn(Elvis, 1948) for example misses an important aspect, namely the location Elvis moved to in 1948." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-22", "text": "Therefore, each of these two facts is an example of important, but not crucial information loss." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-23", "text": "Unfortunately, current OIE systems are not designed to capture the complete set of arguments for 52 each fact phrase within a sentence and to link arguments into an N-ary fact." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-24", "text": "We view intra-sentence fact completeness as a major measure of data quality." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-25", "text": "Following existing work from (Galhardas et al., 2001 ) complete factual data is a key for advanced data cleansing tasks, such as fact de-duplication, object resolution across N-ary facts, semantic fact interpretation and corpus wide fact aggregation." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-26", "text": "Therefore we argue that complete facts may serve a human reader or an advanced data cleansing approach as additional clue for interpreting and validating the fact." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-27", "text": "In order to investigate the need and feasibility for N-ary OIE we have performed the following, the results of which we present in this paper: 1. We introduce the OIE system KRAKEN, which has been built specifically for capturing complete facts from sentences and is capable of extracing unary, binary and higher order N-ary facts." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-28", "text": "2." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-29", "text": "We examine intra sentence fact correctness (true/false) and fact completeness for KRAKEN and REVERB on the corpus of (Fader et al., 2011) ." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-30", "text": "In the rest of the paper we review earlier work and outline KRAKEN, our method for extracting N-ary facts and contextual information." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-31", "text": "Next, we describe our experiments and end with conclusions." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-32", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-33", "text": "**KRAKEN**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-34", "text": "We introduce KRAKEN, an N-ary OIE fact extraction system for facts of arbitrary arity." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-35", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-36", "text": "**PREVIOUS WORK**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-37", "text": "Binary-OIE: Our previous system WANDERLUST ( Akbik and Bross, 2009 ) operates using a typed dependency-style grammar representation called Link Grammar." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-38", "text": "The system traverses paths of typed dependencies (referred to as linkpaths) to find pairs of arguments connected by a valid grammatical relationship." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-39", "text": "We identified a set of 46 common linkpaths that can be used for fact extraction." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-40", "text": "Later, the authors (Wu and Weld, 2010 ) trained extractors in a system called WOE, one using only shallow syntactic features and one (called WOEPARSE) that also uses typed dependencies as features." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-41", "text": "The latter system learned more than 15.000 patterns over typed dependencies." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-42", "text": "In their evaluation they showed that using deep syntactic parsing improves the precision of their system, however at a high cost in extraction speed." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-43", "text": "The OIE system REVERB (Fader et al., 2011) by contrast uses a fast shallow syntax parser for labeling sentences and applies syntactic and a lexical constraints for identifying binary facts." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-44", "text": "However, the shallow syntactic analysis limits the capability of REVERB of extracting higher order N-ary facts." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-45", "text": "Higher order fact extraction for Wikipedia: In previous work on higher order fact extraction, the focus was placed on specific types of arguments." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-46", "text": "The authors of (Hoffart et al., 2011) for example extract temporal, spatial and category information from Wikipedia info boxes. and (Ling and Weld, 2010) focused on N-ary fact types from English sentences that contain at least one temporal argument." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-47", "text": "In contrast, KRAKEN extracts N-ary facts with arbitrary argument types." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-48", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-49", "text": "**ALGORITHM OUTLINE**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-50", "text": "KRAKEN expects as input a Stanford dependency parsed sentence, in which two words are linked if connected via a typed dependency." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-51", "text": "Each typed dependency has a type denoting the grammatical nature of the link, and is directed, either upward (from child to parent) or downward (from parent to child)." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-52", "text": "Given such a parse, KRAKEN executes the following three steps: Figure 1 : Example of a sentence in Stanford typed dependency formalism." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-53", "text": "One fact phrase is was coined." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-54", "text": "Using the type-path rcmod-\u2191-appos-\u2191, the subject the Doublethink is found, the path is highlighted in dotted lines." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-55", "text": "Using prep-\u2193, pobj-\u2193, two arguments are found: Orwell and the novel 1984." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-56", "text": "One N-ary fact for this sentence is WasCoined(Doublethink, (by) Orwell, (in) the novel 1984)." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-57", "text": "The other is Describes(Doublethink, fictional concept)." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-58", "text": "2. Detection of argument heads: Next, for each word of a fact phrase, KRAKEN attempts to find heads of arguments using type-paths as listed in Table 1." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-59", "text": "Each type-path indicates one or more links, as well as the direction of each link, to follow to find an argument head." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-60", "text": "For example, the type-path subj-\u2193 indicates that if one downward link of type subj exists, then the target of that link is an argument head." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-61", "text": "Figure 1 illustrates an example." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-62", "text": "At the end of this step, KRAKEN returns all found argument heads for the fact phrase." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-63", "text": "3. Detection of full arguments: KRAKEN recursively follows all downward links from the argument head to get the full argument, excluding any links that were part of the type-path to the argument head." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-64", "text": "The combination of the detected fact phrase from step 1 and these full arguments form the fact." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-65", "text": "If a fact phrase has at least one argument, the system extracts it as a fact." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-66", "text": "The ruleset was generated by joining the linkpaths reported in (Akbik and Bross, 2009 ) that contain at least one overlapping entity and one overlapping verb, and exchanging the underlying grammatical formalism with Stanford typed dependencies 1 , resulting in a verb-centric and human-readable ruleset." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-67", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-68", "text": "**PRELIMINARY EXPERIMENTAL STUDY**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-69", "text": "We compare REVERB, the state-of-the-art in binary fact extraction, with KRAKEN, in order to measure the effect of using N-ary fact extraction over purely binary extractors on overall precision and completeness." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-70", "text": "Additionally, we test in how far using an IE approach based on deep syntactic parsing can be used for sentences from the Web, which have a higher chance of being ungrammatical or noisy." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-71", "text": "1 http://nlp.stanford.edu/software/dependencies" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-72", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-73", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-74", "text": "Data set: We use the data set from (Fader et al., 2011) which consists of 500 sentences sampled from the Web using Yahoo's random link service." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-75", "text": "2 The sentences were labeled both with facts found with KRAKEN and the current version of REVERB." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-76", "text": "3 We then paired facts for the same sentence that overlap in at least one of the fact phrase words, in order to present to the judges two different versions of the same fact -often one binary (REVERB) and one Nary (KRAKEN)." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-77", "text": "Measurements/Instructions: Given a sentence and a fact (or fact-pair), we asked two human judges to label each fact as either 1) true and complete, 2) true and incomplete, or 3) false." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-78", "text": "True and incomplete facts either lack contextual information in the form of arguments that were present in the sentence, or contain underspecified arguments, but are nevertheless valid statements in themselves (see our examples in Section 1)." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-79", "text": "In previous evaluations, such 2 facts have been counted as true." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-80", "text": "We distinguish them from true and complete facts that capture all relevant arguments as given by the sentence they were extracted from." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-81", "text": "We measured an inter-annotator agreement of 87%, differently evaluated facts were discussed by the judges and resolved." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-82", "text": "Most disagreement was caused by facts with underspecified arguments, labeled as false by one judge and as true and incomplete by the other." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-83", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-84", "text": "**EVALUATION RESULTS AND DISCUSSION**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-85", "text": "KRAKEN extracts higher order N-ary facts." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-86", "text": "Table 2 show results for KRAKEN and REVERB." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-87", "text": "We measured results for REVERB with different confidence thresholds." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-88", "text": "In all measurements, we observe a significantly higher number of true and complete facts for KRAKEN, as well as both a higher overall precision and number of facts extracted per sentence." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-89", "text": "The completeness, measured as the ratio of complete facts over all true facts, is also significantly higher for KRAKEN." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-90", "text": "Figure 2 breaks down the fact arity." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-91", "text": "KRAKEN performs particularly well for binary, ternary and 4-ary facts, which are also most common." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-92", "text": "We conclude that even though our ruleset was generated on a different domain (Wikipedia text), it generalizes well to the Web domain." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-93", "text": "Dependency parsing of Web text." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-94", "text": "One major drawback of the settings we used is our (possibly too crude) heuristic for detecting erroneous dependency parses: We set KRAKEN to extract facts from all sentences in which the dependency parse does not contain the typed dependency dep, which indicates unclear grammatical relationships." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-95", "text": "A total of 155 sentences -31% of the overall evaluation set -were skipped as a consequence." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-96", "text": "Also, the elapsed time of the fact extraction process was more than one order of magnitude longer than REVERB, possibly limit- ing the ability of the system to scale to very large collections of documents." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-97", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-98", "text": "**MEASUREMENTS OVER DIFFERENT SENTENCE LENGTHS.**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-99", "text": "When limiting the maximum number of words allowed per sentence, we note modest gains in precision and losses in complete positives in both systems, see Figure 3 ." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-100", "text": "KRAKEN performs well even on long sentences, extracting more true and complete positives at a high precision." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-101", "text": "Lessons learned." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-102", "text": "Based on these observations, we reach the conclusion that given the 'right portion' of sentences from a collection such as the Web, our method for N-ary OIE can be very effective, extracting more complete facts with a high precision and fact-per-sentence rate." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-103", "text": "Sentences that are well suited for our algorithm must fulfill the following desiderata: 1) They are noise free and grammatically correct, so there is a high chance for a correct parse." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-104", "text": "2) They are fact-rich, so that processing resources are wisely used." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-105", "text": "----------------------------------" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-106", "text": "**SUMMARY AND FUTURE WORK**" }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-107", "text": "Current OIE systems do not perform well for the task of higher order N-ary fact extraction." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-108", "text": "We presented KRAKEN, an algorithm that finds these facts with high precision, completeness, and factper-sentence rate." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-109", "text": "However, we also note that relying on a dependency parser comes at the cost ofspeed and recall, as many sentences were skipped due to our heuristic of detecting erroneous parses." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-110", "text": "Future work focuses on scaling the system up for use on a large Web corpus and increasing the system's recall." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-111", "text": "To achieve this, we will work on a first step of identifying grammatical and fact-rich sentences before applying dependency parsing in a second step, filtering out all sentences that do not meet the desiderata stated in Section 3." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-112", "text": "We intend to evaluate using very fast dependency parsers, some more than two orders of magnitude faster than the Stanford parser (Cer et al., 2010) , one prominent example of which is the MALTparser (Nivre et al., 2007) ." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-113", "text": "Additionally, we will examine more data-driven approaches for identifying fact phrases and arguments in order to maximize the system's recall." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-114", "text": "We intend to use such an approach to train KRAKEN for use on other languages such as German." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-115", "text": "One interesting aspect of future work is the canonicalization of the fact phrases and arguments given very large collections of extracted facts." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-116", "text": "Unsupervised approaches that make use of redundancy such as (Bollegala et al., 2010) or (Yates and Etzioni, 2007) may help cluster similar fact phrases or arguments." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-117", "text": "A related possibility is the integration of facts into an existing knowledge base, using methods such as distant supervision (Mintz et al., 2009) ." }, { "sent_id": "0f0e13e275c4bc4021b1b0d26f3e0c-C001-118", "text": "We believe that combining OIE with a method for fact phrase canonicalization will allow us to better evaluate the system in terms of precision/recall and usefulness in the future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-8" ], [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-10" ], [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-12" ], [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-43" ] ], "cite_sentences": [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-8", "0f0e13e275c4bc4021b1b0d26f3e0c-C001-10", "0f0e13e275c4bc4021b1b0d26f3e0c-C001-12", "0f0e13e275c4bc4021b1b0d26f3e0c-C001-43" ] }, "@MOT@": { "gold_contexts": [ [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-12", "0f0e13e275c4bc4021b1b0d26f3e0c-C001-13" ] ], "cite_sentences": [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-12" ] }, "@USE@": { "gold_contexts": [ [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-29" ], [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-74" ] ], "cite_sentences": [ "0f0e13e275c4bc4021b1b0d26f3e0c-C001-29", "0f0e13e275c4bc4021b1b0d26f3e0c-C001-74" ] } } }, "ABC_5b9a6590d2e7c49f9a9788abe6dc1b_34": { "x": [ { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-28", "text": "**DECODING IN GENERATIVE NEURAL MODELS**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-2", "text": "Recent work has proposed several generative neural models for constituency parsing that achieve state-of-the-art results." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-3", "text": "Since direct search in these generative models is difficult, they have primarily been used to rescore candidate outputs from base parsers in which decoding is more straightforward." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-4", "text": "We first present an algorithm for direct search in these generative models." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-5", "text": "We then demonstrate that the rescoring results are at least partly due to implicit model combination rather than reranking effects." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-6", "text": "Finally, we show that explicit model combination can improve performance even further, resulting in new state-of-the-art numbers on the PTB of 94.25 F1 when training only on gold data and 94.66 F1 when using external data." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-7", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-9", "text": "Recent work on neural constituency parsing (Dyer et al., 2016; Choe and Charniak, 2016) has found multiple cases where generative scoring models for which inference is complex outperform base models for which inference is simpler." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-48", "text": "To deal with this issue, we force partial parse candidates to compete with each other on a wordby-word level, rather than solely on the level of individual actions." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-10", "text": "Let A be a parser that we want to parse with (here one of the generative models), and let B be a base parser that we use to propose candidate parses which are then scored by the less-tractable parser A. We denote this cross-scoring setup by B \u2192 A. The papers above repeatedly saw that the cross-scoring setup B \u2192 A under which their generative models were applied outperformed the standard singleparser setup B \u2192 B. We term this a cross-scoring gain." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-11", "text": "This paper asks two questions." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-12", "text": "First, why do recent discriminative-to-generative cross-scoring se- * Equal contribution." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-13", "text": "tups B \u2192 A outperform their base parsers B?" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-14", "text": "Perhaps generative models A are simply superior to the base models B and direct generative parsing (A \u2192 A) would be better still if it were feasible." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-15", "text": "If so, we would characterize the cross-scoring gain from B \u2192 B to B \u2192 A as a reranking gain." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-16", "text": "However, it's also possible that the hybrid system B \u2192 A shows gains merely from subtle model combination effects." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-17", "text": "If so, scoring candidates using some combined score A + B would be even better, which we would characterize as a model combination gain." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-18", "text": "It might even be the case that B is a better parser overall (i.e. B \u2192 B outperforms A \u2192 A)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-19", "text": "Of course, many real hybrids will exhibit both reranking and model combination gains." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-20", "text": "In this paper, we present experiments to isolate the degree to which each gain occurs for each of two state-of-the-art generative neural parsing models: the Recurrent Neural Network Grammar generative parser (RG) of Dyer et al. (2016) , and the LSTM language modeling generative parser (LM) of Choe and Charniak (2016) ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-21", "text": "In particular, we present and use a beam-based search procedure with an augmented state space that can search directly in the generative models, allowing us to explore A \u2192 A for these generative parsers A independent of any base parsers." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-22", "text": "Our findings suggest the presence of model combination effects in both generative parsers: when parses found by searching directly in the generative parser are added to a list of candidates from a strong base parser (the RNNG discriminative parser, RD (Dyer et al., 2016) ), performance decreases when compared to using just candidates from the base parser, i.e., B \u222a A \u2192 A has lower evaluation performance than B \u2192 A (Section 3.1)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-23", "text": "This result suggests that both generative models benefit from fortuitous search errors in the rescoring setting -there are trees with higher probability under the generative model than any tree proposed by the base parser, but which would decrease evaluation performance if selected." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-24", "text": "Because of this, we hypothesize that model combination effects between the base and generative models are partially responsible for the high performance of the generative reranking systems, rather than the generative model being generally superior." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-25", "text": "Here we consider our second question: if crossscoring gains are at least partly due to implicit model combination, can we gain even more by combining the models explicitly? We find that this is indeed the case: simply taking a weighted average of the scores of both models when selecting a parse from the base parser's candidate list improves over using only the score of the generative model, in many cases substantially (Section 3.2)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-26", "text": "Using this technique, in combination with ensembling, we obtain new state-of-the-art results on the Penn Treebank: 94.25 F1 when training only on gold parse trees and 94.66 F1 when using external silver data." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-27", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-88", "text": "**SCORE COMBINATION**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-29", "text": "All of the parsers we investigate in this work (the discriminative parser RD, and the two generative parsers RG and LM, see Section 1) produce parse trees in a depth-first, left-to-right traversal, using the same basic actions: NT(X), which opens a new constituent with the non-terminal symbol X; SHIFT / GEN(w), which adds a word; and RE-DUCE, which closes the current constituent." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-30", "text": "We refer to Dyer et al. (2016) for a complete description of these actions, and the constraints on them necessary to ensure valid parse trees." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-31", "text": "1 The primary difference between the actions in the discriminative and generative models is that, whereas the discriminative model uses a SHIFT action which is fixed to produce the next word in the sentence, the generative models use GEN(w) to define a distribution over all possible words w in the lexicon." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-32", "text": "This stems from the generative model's definition of a joint probability p(x, y) over all possible sentences x and parses y. To use a generative model as a parser, we are interested in finding the maximum probability parse for a given sentence." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-33", "text": "This is made more complicated by not having an explicit representation for p(y|x), as we do in the discriminative setting." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-34", "text": "However, we can start by applying similar approximate search procedures as are used for the discriminative parser, constraining the set of actions such that it is only possible to produce the observed sentence: i.e. only allow a GEN(w) action when w is the next terminal in the sentence, and prohibit GEN actions if all terminals have been produced." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-35", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-36", "text": "**ACTION-SYNCHRONOUS BEAM SEARCH**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-37", "text": "Past work on discriminative neural constituency parsers has shown the effectiveness of beam search with a small beam (Vinyals et al., 2015) or even greedy search, as in the case of RD (Dyer et al., 2016) ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-38", "text": "The standard beam search procedure, which we refer to as action-synchronous, maintains a beam of K partially-completed parses that all have the same number of actions taken." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-39", "text": "At each stage, a pool of successors is constructed by extending each candidate in the beam with each of its possible next actions." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-40", "text": "The K highest-probability successors are chosen as the next beam." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-41", "text": "Unfortunately, we find that action-synchronous beam search breaks down for both generative models we explore in this work, failing to find parses that are high scoring under the model." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-42", "text": "This stems from the probabilities of the actions NT(X) for all labels X almost always being greater than the probability of GEN(w) for the particular word w which must be produced next in a given sentence." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-43", "text": "Qualitatively, the search procedure prefers to open constituents repeatedly up until the maximum number allowed by the model." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-44", "text": "While these long chains of non-terminals will usually have lower probability than the correct sequence at the point where they finally generate the next word, they often have higher probability up until the word is generated, and so they tend to push the correct sequence off the beam before this point is reached." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-45", "text": "This search failure produces very low evaluation performance: with a beam of size K = 100, action-synchronous beam search achieves 29.1 F1 for RG and 27.4 F1 for LM on the development set." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-46", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-47", "text": "**WORD-SYNCHRONOUS BEAM SEARCH**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-49", "text": "The word-synchronous beam search we apply is very similar to approximate decoding procedures developed for other generative models (Henderson, 2003; Titov and Henderson, 2010; Buys and Blunsom, 2015) and can be viewed as a simplified version of the procedure used in the generative top-down parsers of Roark (2001) and Charniak (2010) ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-50", "text": "In word-synchronous search, we augment the beam state space, identifying beams by tuples (|W |, |A w |), where |W | is the number of words that have been produced so far in the sentence, and |A w | is the number of structural actions that have been taken since the last word was produced." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-51", "text": "Intuitively, we want candidates with the same |W | = w to compete against each other." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-52", "text": "For a beam of partial parses in the state (|W | = w, |A w | = a), we generate a beam of successors by taking all of the next possible actions for each partial parse in the beam." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-53", "text": "If the action is NT(X) or REDUCE, we place the resulting partial parse in the beam for state (|W | = w, |A w | = a + 1); otherwise, if the action is GEN, we place it in a list for (|W | = w + 1, |A w | = 0)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-54", "text": "After all partial parses in the beam have been processed, we check to see if there are a sufficient number of partial parses that have produced the next word: if the beam (|W | = w + 1, |A w | = 0) contains at least K w partial parses (the word beam size), we prune it to this size and continue search using this beam." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-55", "text": "Otherwise, we continue building candidates for this word by pruning the beam (|W | = w, |A w | = a + 1) to size K a (the action beam size), and continuing search from there." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-56", "text": "In practice, we found it to be most effective to use a value for K w that is a fraction of the value for K a ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-57", "text": "In all the experiments we present here, we fix K a = 10 \u00d7 K w , with K w ranging from 10 to 100." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-58", "text": "Table 1 shows F1 for decoding in both generative models on the development set, using the top-scoring parse found for a sentence when searching with the given beam size." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-59", "text": "RG has comparatively larger gains in performance between the larger beam sizes, while still underperforming LM, suggesting that more search is necessary in this model." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-60", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-61", "text": "**EXPERIMENTS**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-62", "text": "Using the above decoding procedures, we attempt to separate reranking effects from model combination effects through a set of reranking experiments." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-63", "text": "Our base experiments are performed on the Penn Treebank (Marcus et al., 1993) , using sections 2-21 for training, section 22 for development, and section 23 for testing." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-64", "text": "For the LSTM generative model (LM), we use the pre-trained model released by Choe and Charniak (2016) ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-65", "text": "We train RNNG discriminative (RD) and generative (RG) models, following Dyer et al. (2016) by using the same hyperparameter settings, and using pretrained word embeddings from Ling et al. (2015) for the discriminative model." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-66", "text": "The automaticallypredicted part-of-speech tags we use as input for RD are the same as those used by Cross and Huang (2016) ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-67", "text": "In each experiment, we obtain a set of candidate parses for each sentence by performing beam search in one or more parsers." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-68", "text": "We use actionsynchronous beam search (Section 2.1) with beam size K = 100 for RD and word-synchronous beam (Section 2.2) with K w = 100 and K a = 1000 for the generative models RG and LM." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-69", "text": "In the case that we are using only the scores from a single generative model to rescore candidates taken from the discriminative parser, this setup is close to the reranking procedures originally proposed for these generative models." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-70", "text": "For RG, the original work also used RD to produce candidates, but drew samples from it, whereas we use a beam search to approximate its k-best list." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-71", "text": "The LM generative model was originally used to rerank a 50-best list taken from the Charniak parser (Charniak, 2000) ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-72", "text": "In comparison, we found higher performance for the LM model when using a candidate list from the RD parser: 93.66 F1 versus 92.79 F1 on the development data." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-73", "text": "This may be attributable to having a stronger set of candidates: with beam size 100, RD has an oracle F1 of 98.2, compared to 95.9 for the 50-best list from the Charniak parser." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-74", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-75", "text": "**AUGMENTING THE CANDIDATE SET**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-76", "text": "We first experiment with combining the candidate lists from multiple models, which allows us to look for potential model errors and model combination effects." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-77", "text": "Consider the standard reranking setup B \u2192 A, where we search in B to get a set of candidate parses for each sentence, and Table 2 : Development F1 scores on section 22 of the PTB when using various models to produce candidates and to score them." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-78", "text": "\u222a denotes taking the union of candidates from each of two models; + denotes using a weighted average of the models' log-probabilities." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-79", "text": "choose the top scoring candidate from these under A. We extend this by also searching directly in A to find high-scoring candidates for each sentence, and combining them with the candidate list proposed by B by taking the union, A \u222a B. We then choose the highest scoring candidate from this list under A. If A generally prefers parses outside of the candidate list from B, but these decrease evaluation performance (i.e., if B \u222a A \u2192 A is worse than B \u2192 A), this suggests a model combination effect is occurring: A makes errors which are hidden by having a limited candidate list from B. This does seem to be the case for both generative models, as shown in Table 2 , which presents F1 scores on the development set when varying the models used to produce the candidates and to score them." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-80", "text": "Each row is a different candidate set, where the third row in each table presents results for the augmented candidate sets; each column is a different scoring model, where the third column is the score combination setting described below." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-81", "text": "Going from RD \u2192 RG to the augmented candidate setting RD \u222a RG \u2192 RG decreases performance from 93.45 F1 to 92.78 F1 on the development set." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-82", "text": "This difference is statistically significant at the p < 0.05 level under a paired bootstrap test." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-83", "text": "We see a smaller, but still significant, effect in the case of LM: RD \u2192 LM achieves 93.66, compared to 93.47 for RD \u222a LM \u2192 LM." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-84", "text": "We can also consider the performance of RG \u2192 RG and LM \u2192 LM (where we do not use candidates from RD at all, but return the highestscoring parse from searching directly in one of the generative models) as an indicator of reranking effects: absolute performance is higher for LM (92.20 F1) than for RG (89.55)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-85", "text": "Taken together, these results suggest that model combination contributes to the success of both models, but to a larger extent for RG." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-86", "text": "A reranking effect may be a larger contributor to the success of LM, as this model achieves stronger performance on its own for the described search setting." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-87", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-110", "text": "Performance when using only the ensembled RD models (row 10) is lower than rescoring a single RD model with score combinations of single models, either RD + RG (row 3) or RD + LM (row 6)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-111", "text": "In the PTB setting, ensembling with score combination achieves the best overall result of 94.25 (row 13)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-112", "text": "In the silver training data setting, while this does improve on the analogous unensembled result (row 8), it is not better than the combination of single models when candidates from the generative models are also included (row 9)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-113", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-114", "text": "**DISCUSSION**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-115", "text": "Searching directly in the generative models yields results that are partly surprising, as it reveals the presence of parses which the generative models prefer, but which lead to lower performance than the candidates proposed by the base model." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-116", "text": "However, the results are also unsurprising in the sense that explicitly combining scores allows the reranking setup to achieve better performance than implicit combination, which uses only the scores of a single model." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-117", "text": "Additionally, we see support for the hypothesis that the generative models can achieve good results on their own, with the LSTM generative model showing particularly strong and selfcontained performance." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-118", "text": "While this search procedure allows us to explore these generative models, disentangling reranking and model combination effects, the increase in performance from augmenting the candidate lists with the results of the search may not be worth the required computational cost in a practical parser." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-119", "text": "However, we do obtain a gain over state-of-theart results using simple model score combination on only the base candidates, which can be implemented with minimal cost over the basic reranking setup." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-120", "text": "This provides a concrete improvement for these particular generative reranking procedures for parsing." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-121", "text": "More generally, it supports the idea that hybrid systems, which rely on one model to produce a set of candidates and another to determine which candidates are good, should explore combining their scores and candidates when possible." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-89", "text": "If the cross-scoring setup exhibits an implicit model combination effect, where strong performance results from searching in one model and scoring with the other, we might expect substantial further improvements in performance by explicitly combining the scores of both models." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-90", "text": "To do so, we score each parse by taking a weighted sum of the log-probabilities assigned by both models (Hayashi et al., 2013) , using an interpolation parameter which we tune to maximize F1 on the development set." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-91", "text": "These results are given in columns RD + RG and RD + LM in Table 2 ." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-92", "text": "We find that combining the scores of both models improves on using the score of either model alone, regardless of the source of candidates." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-93", "text": "These improvements are statistically significant in all cases." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-94", "text": "Score combination also more than compensates for the decrease in performance we saw previously when adding in candidates from the generative model: RD \u222a RG \u2192 RD + RG improves upon both RD \u2192 RG and RD \u222a RG \u2192 RG, and the same effect holds for LM." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-95", "text": "----------------------------------" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-96", "text": "**STRENGTHENING MODEL COMBINATION**" }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-97", "text": "Given the success of model combination between the base model and a single generative model, we also investigate the hypothesis that the generative models are complementary." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-98", "text": "The Model Combination block of Table 3 shows full results on the test set for these experiments, in the PTB column." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-99", "text": "The same trends we observed on the development data, on which the interpolation parameters were tuned, hold here: score combination improves results for all models (row 3 vs. row 2; row 6 vs. row 5), with candidate augmentation from the generative models giving a further increase (rows 4 and 7)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-100", "text": "2 Combining candidates and scores from all three models (row 9), we obtain 93.94 F1." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-101", "text": "Table 3 : Test F1 scores on section 23 of the PTB, by treebank training data conditions: either using only the training sections of the PTB, or using additional silver data (+S)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-102", "text": "Semi-supervised silver data Choe and Charniak (2016) found a substantial increase in performance by training on external data in addition to trees from the Penn Treebank." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-103", "text": "This silver dataset was obtained by parsing the entire New York Times section of the fifth Gigaword corpus using a product of eight Berkeley parsers (Petrov, 2010) and ZPar (Zhu et al., 2013) , then retaining 24 million sentences on which both parsers agreed." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-104", "text": "For our experiments we train RD and RG using the same silver dataset." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-105", "text": "3 The +S column in Table 3 shows these results, where we observe gains over the PTB models in nearly every case." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-106", "text": "As in the PTB training data setting, using all models for candidates and score combinations is best, achieving 94.66 F1 (row 9)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-107", "text": "Ensembling Finally, we compare to another commonly used model combination method: ensembling multiple instances of the same model type trained from different random initializations." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-108", "text": "We train ensembles of 8 copies each of RD and RG in both the PTB and silver data settings, combining scores from models within an ensemble by averaging the models' distributions for each action (in beam search as well as rescoring)." }, { "sent_id": "5b9a6590d2e7c49f9a9788abe6dc1b-C001-109", "text": "These results are shown in the bottom section, Ensembling, of Table 3 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-9" ], [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-37" ] ], "cite_sentences": [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-9", "5b9a6590d2e7c49f9a9788abe6dc1b-C001-37" ] }, "@USE@": { "gold_contexts": [ [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-20" ], [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-22" ], [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-30" ], [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-65" ] ], "cite_sentences": [ "5b9a6590d2e7c49f9a9788abe6dc1b-C001-20", "5b9a6590d2e7c49f9a9788abe6dc1b-C001-22", "5b9a6590d2e7c49f9a9788abe6dc1b-C001-30", "5b9a6590d2e7c49f9a9788abe6dc1b-C001-65" ] } } }, "ABC_ada92083e8c012c328d5b6172b76ad_34": { "x": [ { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-2", "text": "Opinion role labeling (ORL) is an important task for fine-grained opinion mining, which identifies important opinion arguments such as holder and target for a given opinion trigger." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-3", "text": "The task is highly correlative with semantic role labeling (SRL), which identifies important semantic arguments such as agent and patient for a given predicate." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-4", "text": "As predicate agents and patients usually correspond to opinion holders and targets respectively, SRL could be valuable for ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-5", "text": "In this work, we propose a simple and novel method to enhance ORL by utilizing SRL, presenting semantic-aware word representations which are learned from SRL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-6", "text": "The representations are then fed into a baseline neural ORL model as basic inputs." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-7", "text": "We verify the proposed method on a benchmark MPQA corpus." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-8", "text": "Experimental results show that the proposed method is highly effective." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-9", "text": "In addition, we compare the method with two representative methods of SRL integration as well, finding that our method can outperform the two methods significantly, achieving 1.47% higher F-scores than the better one." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-10", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-12", "text": "Fine-grained opinion mining aims to detect structured user opinions in text, which has drawn much attention in the natural language processing (NLP) community (Kim and Hovy, 2006; Breck et al., 2007; Ruppenhofer et al., 2008; Wilson et al., 2009; Qiu et al., 2011; Cardie, 2013, 2014; Liu et al., 2015; Wiegand et al., 2016) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-13", "text": "A structured opinion includes the key arguments of one opinion, such as expressions, holders and targets (Breck et al., 2007; Cardie, 2012, 2013; Katiyar and Cardie, 2016 )." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-14", "text": "Here we focus on opinion role labeling (ORL) (Marasovi\u0107 and Frank, 2018) , which identifies opinion holders and * Corresponding author." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-15", "text": "We want to resolve all issues peacefully targets assuming that the opinion expressions are given." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-16", "text": "Figure 1 shows an example of the task." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-17", "text": "The focused task behaves very similar with semantic role labeling (SRL) which identifies the core semantic roles for given predicates." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-18", "text": "Earlier work attempts to exploit a well-trained SRL model to recognize possible semantic roles for a given opinion expression, and then map the semantic roles into opinion roles (Kim and Hovy, 2006; Ruppenhofer et al., 2008) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-19", "text": "The heuristic approach is unable to obtain high performance for ORL because there are large mismatches between SRL and ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-20", "text": "For example, opinion expressions are different from verb/noun predicates in SRL, and meanwhile, opinion holders and targets may not always correspond to semantic agents (ARG0) and patients (ARG1), respectively." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-21", "text": "We can exploit machine learning based method to solve the mismatching problem between ORL and SRL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-22", "text": "With a small number of annotated ORL corpus, we can feed the SRL outputs as inputs to build a statistical model for ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-23", "text": "By this way, the model can learn the consistencies and inconsistencies between SRL and ORL, arriving at a full exploration of SRL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-24", "text": "The method is essentially a feature-based method, treating SRL outputs as a source of features for ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-25", "text": "The main drawback of the method is that direct exploration of SRL outputs may lead to the error propagation problem." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-26", "text": "SRL errors can be further propagated into ORL outputs, resulting in degraded ORL performance." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-27", "text": "In this work, we propose a simple and novel method by using implicit semantic-aware word representations from SRL to enhance ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-28", "text": "The method is referred to as SRL-SAWR for brief." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-29", "text": "Thanks to the recent advances of encoder-decoder neural SRL models (Zhou and Xu, 2015; He et al., 2017) , we can extract implicit vectorized features from the intermediate encoder module instead, avoiding the direct exploration of the final onebest SRL outputs." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-30", "text": "The vectorized features from the encoder part are implicit semantic-aware representations for input sentences." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-31", "text": "By taking the semantic-aware representations from SRL as ORL inputs, we are able to make use of SRL information and meanwhile alleviate the error propagation problem." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-32", "text": "Here we exploit a neural conditional random field (CRF) model with deep bi-directional long short-term memory networks (Bi-LSTMs) as a baseline, most of which is borrowed from Katiyar and Cardie (2016) and Marasovi\u0107 and Frank (2018) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-33", "text": "Our preliminary experiments show that the model is able to achieve state-of-the-art performances for both ORL and SRL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-34", "text": "Based on this model, we study the proposed implicit semanticaware word representations for ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-35", "text": "In addition, we compare this method with two other representative methods of SRL integration as well: one uses discrete SRL outputs as features directly for ORL and the other one exploits a multi-tasklearning (MTL) framework to benefit ORL by SRL information." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-36", "text": "Experiments are conducted on the MPQA 2.0 dataset, which is a standard benchmark for opinion mining." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-37", "text": "Results show that SRL is highly effective for ORL, which is consistent with previous findings (Kim and Hovy, 2006; Ruppenhofer et al., 2008; Marasovi\u0107 and Frank, 2018) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-38", "text": "Meanwhile, our implicit SRL-SAWR method can achieve the best ORL performance, 2.23% higher F-scores than the second best method." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-39", "text": "All the codes and datasets are released publicly available for research purpose under Apache Licence 2.0 at https://github.com/zhangmeishan/SRL4ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-40", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-41", "text": "**METHOD**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-42", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-43", "text": "**BASELINE**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-44", "text": "ORL aims to identify important opinion arguments for a given opinion expression." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-45", "text": "The task can be modeled as a sequence labeling problem, similar to SRL (Zhou and Xu, 2015; He et al., 2017) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-46", "text": "We adopt the {BMESO} schema to con- vert opinion arguments into a sequence of boundary tags for each word, where B, M and E denote the beginning, middle and ending words of an argument, S denotes a single-word argument, and O denotes the other words." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-47", "text": "Formally, given a sentence w 1 \u00b7 \u00b7 \u00b7 w n and a span of opinion expression w s \u00b7 \u00b7 \u00b7 w e (1 \u2264 s \u2264 e \u2264 n), we aim to assign each word in the sentence by a tag, outputting t 1 \u00b7 \u00b7 \u00b7 t n ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-48", "text": "Inspired by Katiyar and Cardie (2016) and Marasovi\u0107 and Frank (2018) , we exploit a deep Bi-LSTM CRF model as the baseline." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-49", "text": "Figure 2 shows the overall architecture of the baseline model." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-50", "text": "This model can achieve state-of-the-art performances for both ORL and SRL, which facilitates our study." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-51", "text": "The key components of the baseline model include three parts: word representation, the deep Bi-LSTM encoder and the CRF decoder." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-52", "text": "The word representation takes sequential words and opinion expressions as input, mapping them into dense-valued feature vectors x 1 \u00b7 \u00b7 \u00b7 x n ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-53", "text": "Following we extract high-level neural features based on the vectors by deep Bi-LSTM, arriving at h 1 \u00b7 \u00b7 \u00b7 h n . And finally a CRF decoder is applied to output the ORL results t 1 \u00b7 \u00b7 \u00b7 t n ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-54", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-55", "text": "**SRL INTEGRATION**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-56", "text": "SRL aims to find the core semantic arguments for a given predicate, which is highly correlative with the ORL task." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-57", "text": "The semantic roles agent (ARG0) and patient (ARG1) are often corresponding to the opinion holder and target, respectively." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-58", "text": "Several works even directly transfer semantic roles into opinion roles for ORL (Kim and Hovy, 2006; Ruppenhofer et al., 2008) , treating opinion expressions as the major predicates." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-59", "text": "These systems can achieve good performances, indicating that SRL information can be greatly useful for ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-60", "text": "Here we propose a novel method to encode the SRL information implicitly, enhancing ORL model with semantic-aware word representations from a neural SRL model (SRL-SAWR)." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-61", "text": "Figure 3 shows the overall architectures of our SRL integration method." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-62", "text": "Instead of using the discrete outputs from the SRL model, the SRL-SAWR method exploits the intermediate encoder outputs as inputs for ORL, which can alleviate the problems in the above two methods." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-86", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-87", "text": "**SETTING**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-63", "text": "On the one hand, we do not rely on the discrete outputs of a well-trained SRL, reducing the error prorogation problem. And on the other hand, we handle ORL and SRL separately, avoiding the model structure dependencies between the two tasks." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-64", "text": "We assume that the external SRL system is a neural-based encoder-decoder model." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-65", "text": "For fair comparisons with FS-MTL, here we use the same deep Bi-LSTM CRF model for SRL as well." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-66", "text": "Thus the encoder outputs are the hidden vectors from deep Bi-LSTMs." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-67", "text": "Assuming that the dumped hidden vector sequence from the SRL encoder is h SRL 1 \u00b7 \u00b7 \u00b7 h SRL n , we integrate it into the ORL model by the following equation:" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-68", "text": "where W SRL is a projection matrix which is a model parameter, x i is the baseline word representation of word w i , and x * i is the new word representation, which will be further fed into the deep Bi-LSTM layer of the ORL model." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-69", "text": "Noticeably, the model parameters of the SRL encoder are also fine tuned according to the ORL objective, as the preliminary results indicate that fine-tuning can bring better performance." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-70", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-71", "text": "**EXPERIMENTS**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-72", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-73", "text": "**ORL DATA**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-74", "text": "We exploit the MPQA version 2.0 corpus (Wiebe et al., 2005; Wilson, 2008) to evaluate our models, 1 which has been widely adopted as a benchmark dataset for opinion mining (Yang and Cardie, 2013; Katiyar and Cardie, 2016; Marasovi\u0107 and Frank, 2018) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-75", "text": "There are 482 documents in the dataset." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-76", "text": "Following these work, we set aside 132 documents as the development set, and the remaining 350 documents are used as the test set in our experiments." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-77", "text": "We conduct experiments using fivefold cross-validation (CV) on the test set at the document level." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-78", "text": "Following Marasovi\u0107 and Frank (2018) , we focus on opinion holders and targets only." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-79", "text": "The gold standard opinion expressions, holders and targets correspond to the direct subjective annotations, agent annotations and target annotations, respectively." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-80", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-81", "text": "**EVALUATION METRICS**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-82", "text": "We use recall (R), precision (P) and their F1-measure value to measure our proposed models." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-83", "text": "The average values of the five-fold CV results are reported in this work." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-84", "text": "We exploit exact matching as the major metric." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-85", "text": "Following Marasovi\u0107 and Frank (2018) , two kinds of soft evaluation methods are also adopted for evaluation, namely binary and proportional overlapping, Binary overlap treats an entity as correct if it contains an overlapped region with the gold-standard entity, and the proportional overlap assigns a partial score proportional to the ratio of the overlapped region." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-88", "text": "There are several hyper-parameters to define our neural network structures." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-89", "text": "We simply set their values according to previous work (He et al., 2017; Marasovi\u0107 and Frank, 2018) , without much tuning work." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-90", "text": "Concretely, we set the dimension size of all embeddings to 100, the output hidden size of LSTMs to 200 and the layer number of Bi-LSTM to 3." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-91", "text": "For external word embeddings, we use the pretrained 100-dimensional glove embeddings (Pennington et al., 2014) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-92", "text": "We exploit online training to learn model parameters, and train on the entire training instances for 40 epochs, choosing the best-epoch model according to the performance on the development corpus." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-93", "text": "We use Adam (Kingma and Ba, 2014) with a learning rate 10 \u22123 to update model parameters, and use gradient clipping by a max norm 1.0 and l 2 -regularization by a parameter 10 \u22128 ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-94", "text": "We apply dropout with a ratio of 0.2 over word represen-tations, output layers of Bi-LSTMs to avoid overfitting (Srivastava et al., 2014) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-95", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-96", "text": "**SRL**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-97", "text": "For SRL, we use the large-scale dataset of CoNLL-2012 shared task, which is extracted from OntoNotes v5.0 corpus." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-98", "text": "The description and separation of train, development and test data set can be found in Pradhan et al. (2013) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-99", "text": "The training corpus contains over 250K predicates, which is much larger than the number of opinion expressions in the ORL training corpus (averaged 3.6K)." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-100", "text": "We exploit the same neural network model as the ORL for SRL, in order to make fair comparisons between our proposed model with FS-MTL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-101", "text": "According to the preliminary experiments, the SRL model can reach an F-measure of 81.8%, which is comparable to the reported result (81.7%) in He et al. (2017) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-102", "text": "Table 1 shows the final results on the test dataset." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-103", "text": "We report the overall as well as the fine-grained performance in term of opinion arguments (i.e., holder and target)." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-104", "text": "Compared with the baseline system, our final SRL-SAWR model can bring significantly better results (p < 10 \u22125 under pairwise t-test)." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-105", "text": "For fine-grained evaluations, the final model outperforms the baseline model consistently on opinion holders and targets." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-106", "text": "The tendencies are similar by exploiting the binary and proportional matching methods." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-107", "text": "The results show that SRL information is very helpful for ORL, which is consistent with previous studies (Kim and Hovy, 2006; Ruppenhofer et al., 2008; Marasovi\u0107 and Frank, 2018) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-108", "text": "The implicit SRL-SAWR method is highly effective to integrate SRL information into the ORL model." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-109", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-110", "text": "**RESULTS**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-134", "text": "We can see that the simple mapping method is also one feasible alternative as a whole." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-135", "text": "Further, we compare the SRL utilization capabilities of our proposed method and the other SRLenhanced ORL systems, including the above SRL Mapping method." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-136", "text": "We categorize the opinion arguments by whether they can be directly mapped from the SRL outputs." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-137", "text": "The opinion arguments which can be directly mapped from SRL, referred to as consistent arguments, should be more easily identified by SRL enhanced models than the remaining inconsistent arguments." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-138", "text": "enhanced supervised models can achieve better performances for consistent arguments." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-139", "text": "For the inconsistent arguments, the tendency is similar, except the holder performance of SRL-TE." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-140", "text": "In addition, our method can gain much larger improvements, which indicates that our method can better handle the inconsistencies between SRL and ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-141", "text": "Finally, we show one case study to illustrate the advantage of our SRL-SAWR method." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-142", "text": "Figure 5 shows one example." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-143", "text": "As shown, the SRL argument ARG0, which is more probably mapped onto holder, is annotated by target in the example." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-144", "text": "The SRL argument ARG1 is labeled as opinion holder, which is also one inconsistent case." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-145", "text": "Compared with SRL-TE and FS-MTL, our model can better handle these inconsistent cases." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-146", "text": "The observation further confirms our results in Table 3 ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-147", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-148", "text": "**CONCLUSION**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-149", "text": "We proposed a simple and novel method (SRL-SAWR) to enhance ORL with SRL information by exploiting implicit semantic-aware word representations from SRL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-150", "text": "The main idea is to export intermediate SRL encoder outputs as inputs to better word representations of an ORL model." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-151", "text": "This method does not impose any extra requirement for ORL, and meanwhile avoids the error prorogation problem from discrete SRL outputs." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-152", "text": "We conducted experiments to verify our method on a benchmark MPQA dataset." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-153", "text": "The results showed that our method can exploit SRL information effectively." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-154", "text": "We compared the proposed method with SRL-TE and FS-MTL, which are two representative approaches to enhance ORL by SRL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-155", "text": "The results demonstrated our method can bring the best performance among the three approaches." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-111", "text": "Further, we compare the SRL-SAWR method with two other methods as well, namely SRL-TE and FS-MTL, respectively." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-112", "text": "The SRL-TE approach simply exploits the output SRL tags as inputs for ORL, embedding them as an additional source of word representations." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-113", "text": "The FS-MTL approach is exactly the proposed model by Marasovi\u0107 and Frank (2018) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-114", "text": "As shown in Table 1 , all three methods can bring improved performance by integrating SRL, further demonstrating that SRL is indeed valuable for ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-115", "text": "In addition, the SRL-SAWR method can achieve the best performance among the three methods, obtaining further significant improvements by at least 63.74 \u2212 61.51 = 2.23 points on overall F1-measure with exact matching (p < 10 \u22124 )." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-116", "text": "For fine-grained evaluations, the SRL-SAWR method can also give the best performance." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-117", "text": "The results demonstrate that SRL-SAWR is most effective to integrate the SRL information into a neural ORL model." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-118", "text": "The two methods, SRL-TE and FS-MTL, are comparable by evaluations based on the exact matching." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-119", "text": "----------------------------------" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-120", "text": "**ANALYSIS**" }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-121", "text": "In this section, we conduct several experimental analysis on the test dataset to deeply understand the effectiveness of SRL information." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-122", "text": "First, we examine the relationship between SRL and ORL." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-123", "text": "SRL identifies the semantic arguments for predicates, and ORL recognizes the opinion arguments for opinion expressions." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-124", "text": "Intuitively, in most cases, the opinion holders are corresponding to semantic agents/ARG0 of opinion triggers/expressions, and similarly, the opinion targets are usually corresponding to patients/ARG1." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-125", "text": "Figure 4 shows the percentages of opinion hold- ers/targets being corresponding to semantic roles, which are calculated according to the word-level mapping over the 1-best SRL outputs and the goldstandard ORL tags." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-126", "text": "We list only the five semantic roles with highest mapping percentages." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-127", "text": "As shown, the results are consistent with our intuition." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-128", "text": "Thus SRL and ORL are highly correlative." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-129", "text": "Considering the much larger scale of annotated SRL corpora, SRL can benefit ORL potentially." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-130", "text": "According to the above findings, we design a simple system by mapping SRL outputs into ORL directly (Kim and Hovy, 2006; Ruppenhofer et al., 2008) ." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-131", "text": "We simply convert the semantic role ARG0 into holder, and ARG1 into target." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-132", "text": "Table 2 shows the performance." }, { "sent_id": "ada92083e8c012c328d5b6172b76ad-C001-133", "text": "The results of the baseline system are shown for comparison." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ada92083e8c012c328d5b6172b76ad-C001-12" ], [ "ada92083e8c012c328d5b6172b76ad-C001-18" ], [ "ada92083e8c012c328d5b6172b76ad-C001-58" ] ], "cite_sentences": [ "ada92083e8c012c328d5b6172b76ad-C001-12", "ada92083e8c012c328d5b6172b76ad-C001-18", "ada92083e8c012c328d5b6172b76ad-C001-58" ] }, "@SIM@": { "gold_contexts": [ [ "ada92083e8c012c328d5b6172b76ad-C001-37" ], [ "ada92083e8c012c328d5b6172b76ad-C001-107" ] ], "cite_sentences": [ "ada92083e8c012c328d5b6172b76ad-C001-37", "ada92083e8c012c328d5b6172b76ad-C001-107" ] }, "@MOT@": { "gold_contexts": [ [ "ada92083e8c012c328d5b6172b76ad-C001-58", "ada92083e8c012c328d5b6172b76ad-C001-59", "ada92083e8c012c328d5b6172b76ad-C001-60" ] ], "cite_sentences": [ "ada92083e8c012c328d5b6172b76ad-C001-58" ] }, "@USE@": { "gold_contexts": [ [ "ada92083e8c012c328d5b6172b76ad-C001-130" ] ], "cite_sentences": [ "ada92083e8c012c328d5b6172b76ad-C001-130" ] } } }, "ABC_eeada4aedbb43b575365a15d75f2ac_34": { "x": [ { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-2", "text": "Recent advances in large-scale, broad coverage part-of-speech tagging and syntactic parsing have been achieved in no small part due to the availability of large amounts of online, human-annotated corpora." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-3", "text": "In this paper, I argue that a large, human sensetagged corpus is also critical as well as necessary to achieve broad coverage, high accuracy word sense disambiguation, where the sense distinction is at the level of a good desk-top dictionary such as WORD-NET." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-4", "text": "Using the sense-tagged corpus of 192,800 word occurrences reported in (Ng and Lee, 1996) , I examine the effect of the number of training examples on the accuracy of an exemplar-based classifier versus the base-line, most-frequent-sense classitier." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-5", "text": "I also estimate the amount of human sense-tagged corpus and the manual annotation effort needed to build a largescale, broad coverage word sense disambiguation program which can significantly outperform the most-frequent-sense classifier." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-6", "text": "Finally, I suggest that intelligent example selection techniques may significantly reduce the amount of sense-tagged corpus needed and offer this research problem as a fruitful area for word sense disambiguation research." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-7", "text": "----------------------------------" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-9", "text": "Much recent research in the field of natural language processing (NLP) has focused on an empirical, corpus-based approach (Church and Mercer, 1993) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-10", "text": "The high accuracy achieved by a corpus-based approach to part-of-speech tagging and noun phrase parsing, as demonstrated by (Church, 1988) , has inspired similar approaches to other problems in natural language processing, including syntactic parsing and word sense disambiguation (WSD)." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-11", "text": "The availability of large quantities of part-ofspeech tagged and syntactically parsed sentences like the Penn Treebank corpus (Marcus, Santorini, and Marcinkiewicz, 1993) has contributed greatly to the development of robust, broad coverage partof-speech taggers and syntactic parsers." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-12", "text": "The Penn Treebank corpus contains a sufficient number of partof-speech tagged and syntactically parsed sentences to serve as adequate training material for building broad coverage part-of-speech taggers and parsers." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-13", "text": "Unfortunately, an analogous sense-tagged corpus large enough to achieve broad coverage, high accuracy word sense disambiguation is not available at present." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-14", "text": "In this paper, I argue that, given the current state-of-the-art capability of automated machine learning algorithms, a supervised learning approach using a large sense-tagged corpus is a viable way to build a robust, wide coverage, and high accuracy WSD program." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-15", "text": "In this view, a large sense-tagged corpus is critical as well as necessary to achieve broad coverage, high accuracy WSD." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-16", "text": "The rest of this paper is organized as follows." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-17", "text": "In Section 2, I briefly discuss the utility of WSD in practical NLP tasks like information retrieval and machine translation." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-18", "text": "I also address some objections to WSD research." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-19", "text": "In Section 3, I examine the size of the training corpus on the accuracy of WSD, using a corpus of 192,800 occurrences of 191 words hand tagged with WORDNET senses (Ng and Lee, 1996) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-20", "text": "In Section 4, I estimate the amount of human sense-tagged corpus and the manual annotation effort needed to build a broad coverage, high accuracy WSD program." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-21", "text": "Finally, in Section 5, I suggest that intelligent example selection techniques may significantly reduce the amount of sense-tagged corpus needed and offer this research problem as a fruitful area for WSD research." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-22", "text": "----------------------------------" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-23", "text": "**THE UTILITY OF WORD SENSE DISAMBIGUATION**" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-24", "text": "Although there is agreement in general about the utility of WSD within the NLP community, I will briefly address some objections to WSD in this section." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-25", "text": "To justify the investment of manpower and time to gather a large sense-tagged corpus, it is important to examine the benefits brought about by WSD." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-26", "text": "Information retrieval (IR) is a practical NLP task where WSD has brought about improvement in accuracy." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-27", "text": "When tested on some standard IR test collection, the use of WSD improves precision by about 4.3% (from 29.9% to 34.2%) (Schiitze and Pedersen, 1995) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-28", "text": "The work of (Dagan and Itai, 1994) has also successfully used WSD to improve the accuracy of machine translation." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-29", "text": "These examples clearly demonstrate the utility of WSD in practical NLP applications." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-30", "text": "In this paper, by word sense disambiguation, I mean identifying the correct sense of a word in context such that the sense distinction is at the level of a good desk-top dictionary like WORDNET (Miller, 1990) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-31", "text": "I only focus on content word disambiguation (i.e., words in the part of speech noun t, verb, adjective and adverb)." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-32", "text": "This is also the task addressed by other WSD research such as (Bruce and Wiebe, 1994; Miller et al., 1994) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-33", "text": "When the task is to resolve word senses to the fine-grain distinction of WORD-NET senses, the accuracy figures achieved are generally not very high (Miller et al., 1994; Ng and Lee, 1996) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-34", "text": "This indicates that WSD is a challenging task and much improvement is still needed." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-35", "text": "However, if one were to resolve word sense to the level of homograph, or coarse sense distinction, then quite high accuracy can be achieved (in excess of 90%), as reported in (Wilks and Stevenson, 1996) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-36", "text": "Similarly, if the task is to distinguish between binary, coarse sense distinction, then current WSD techniques can achieve very high accuracy (in excess of 96% when tested on a dozen words in (Yarowsky, 1995) )." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-37", "text": "This is to be expected, since homograph contexts are quite distinct and hence it is a much simpler task to disambiguate among a small number of coarse sense classes." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-38", "text": "This is in contrast to disambiguating word senses to the refined senses of WoRDNET, where for instance, the average number of senses per noun is 7.8 and the average number of senses per verb is 12.0 for the set of 191 most ambiguous words investigated in (Ng and Lee, 1996) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-39", "text": "We can readily collapse the refined senses of WORDNET into a smaller set if only a coarse (hot I will only focus on common noun in this paper and ignore proper noun." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-40", "text": "2 mographic) sense distinction is needed, say for some NLP applications." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-41", "text": "Indeed, the WORDNET software has an option for grouping noun senses into a smaller number of sense classes." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-42", "text": "WSD techniques that work well for refined sense distinction will apply equally to homograph dlsambiguation." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-43", "text": "That is, if we succeed in working on the harder WSD task of resolution into refined senses, the same techniques will also work on the simpler task of homograph disambiguation." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-44", "text": "A related objection to WSD research is that the sense distinction made by a good desk-top dictionary like WOI~DNET is simply too refined, to the point that two humans cannot genuinely agree on the most." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-45", "text": "appropriate sense to assign to some word occurrence (Kilgarriff, 1996) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-46", "text": "This objection has some merits." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-47", "text": "However, the remedy is not to throw out word senses completely, but rather to work on a level of sense distinction that is somewhere in between homograph distinction and the refined WoRVNET sense distinction." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-48", "text": "The existing lumping of noun senses in WORD-NET into coarser sense groups is perhaps a good compromise." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-49", "text": "However, in the absence of well accepted guidelines for making an appropriate level of sense distinction, using the sense classification given in WOI~I)NET, an on-line, publicly available dictionary, seems a natural choice." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-50", "text": "Hence, I believe that using the current WoRDNET sense distinction to build a sense-tagged corpus is a reasonable approach to go forward." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-51", "text": "In any case, if some aggregation of senses into coarser grouping is done in future, this can be readily incorporated into my proposed sense-tagged corpus which uses the refined sense distinction of WOItDNET." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-52", "text": "In the rest of this paper, I will assume that broad coverage, high accuracy WSD is indeed useful in practical NLP tasks, and that resolving senses to the refined level of WORDNET is a worthwhile task to pursue." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-53", "text": "----------------------------------" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-54", "text": "**THE EFFECT OF TRAINING CORPUS SIZE**" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-55", "text": "A number of past research work on WSD, such as (Leacock et al., 1993; Bruce and Wiebe, 1994; Mooney, 1996) , were tested on a small number of words like \"line\" and \"interest\"." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-56", "text": "Similarly, (Yarowsky, 1995) tested his WSD algorithm on a dozen words." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-57", "text": "The sense-tagged corpus SEMCOI~, prepared by (Miller et al., 1994) , contains a substantial subset of the Brown corpus tagged with the refined senses of WORDNET." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-58", "text": "However, as reported in (Miller et al., 1994) , there are not enough training examples per word in SP.MCOR to yield a broad coverage, high accuracy WSD program, due to the fact that sense tagging is done on every word in a running text in SEMCOR." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-59", "text": "To overcome this data sparseness problem of WSD, I initiated a mini-project in sense tagging and collected a corpus in which 192,800 occurrences of 191 words have been manually tagged with senses of WORDNET (Ng and Lee, 1996) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-60", "text": "These 192,800 word occurrences consist of only 121 nouns and 70 verbs which are the most frequently occurring and most ambiguous words of English." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-61", "text": "2" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-62", "text": "To investigate the effect of the number of training examples on WSD accuracy, I ran the exemplarbased WSD algorithm L~.XAS on varying number of training examples to obtain learning curves for the 191 words (details of LEXAS are described in (Ng and Lee, 1996) )." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-63", "text": "For each word, 10 random trials were conducted and the accuracy figures were averaged over the I0 trials." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-64", "text": "In each trial, I00 examples were randomly selected to form the test set, while the remaining examples (randomly shuffled) were used for training." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-65", "text": "LEXAS was given training examples in multiple s of i00, starting with I00,200,300, ... training examples, up to the maximum number of training examples (in a multiple of 100) available in the corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-66", "text": "Note that each word w (of the 191 words) can have a different number of sense-tagged occurrences in our corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-67", "text": "From the combination of Brown corpus (1 million words) and Wall Street Journal corpus (2.5 million words), up to 1,500 sentences each containing an occurrence of the word w are extracted from the combined corpus, with each sentence containing a sense-tagged occurrence of w. When the combined corpus has less than 1,500 occurrences of w, the max= imum number of available occurrences of w is used." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-68", "text": "For instance, while 137 words have at least 600 occurrences in the combined corpus, only a subset of 43 words has at least 1400 occurrences." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-69", "text": "Figure 1 and 2 show the learning curves averaged over these 43 words and 137 words with at least 1300 and 500 training examples, respectively." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-70", "text": "Each figure shows the accuracy of LEXAS versus the base-line, mostfrequent-sense classifier." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-71", "text": "Both figures indicate that WSD accuracy continues to climb as the number of training examples increases." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-72", "text": "They confirm that all the training examples collected in our corpus are effectively utilized by LZXAS to improve its WSD performance." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-73", "text": "In fact, it appears that for this set of most ambiguous words of English, more training data may be beneficial to further improve WSD performance." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-74", "text": "I also report here the evaluation of LP.XAS on two 2This corpus is scheduled for release by the lAnguistic Data Consortium (LDC)." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-75", "text": "Contact the LDC at ldc~unagi.cis.upenn.edu for details." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-76", "text": "The performance figures of LEXAS in Table 1 are higher than those reported in (Ng and Lee, 1996) ." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-77", "text": "The classification accuracy of the nearest neighbor algorithm used by LEXAS (Cost and Salzberg, 1993) is quite sensitive to the number of nearest neighbors used to select the best matching example." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-78", "text": "By using 10-fold cross validation (Kohavi and John, 1995) to automatically pick the best number of nearest neighbors to use, the performance of LSXAS has improved." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-79", "text": "----------------------------------" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-80", "text": "**WORD SENSE DISAMBIGUATION IN THE**" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-81", "text": "----------------------------------" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-82", "text": "**LARGE**" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-83", "text": "In (Gale et al., 1992) , it was argued that any wide coverage WSD program must be able to perform significantly better than the most-frequent-sense classifier to be worthy of serious consideration." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-84", "text": "The performance of LEXAS as indicated in Table 1 is significantly better than the most-frequent-sense classifier for the set of 191 words collected in our corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-85", "text": "Figure 1 and 2 also confirm that all the training examples collected in our corpus are effectively utilized by LEXAS to improve its WSD performance." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-86", "text": "This is encouraging as it demonstrates the feasibility of building a wide coverage WSD program using a supervised learning approach." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-87", "text": "Unfortunately, our corpus only contains tagged senses for 191 words, and this set of words does not constitute a sufficiently large fraction of all occurrences of content words in an arbitrarily chosen unrestricted text." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-88", "text": "As such, our sense-tagged corpus is still not large enough to enable the building of a wide coverage, high accuracy WSD program that can significantly outperform the most-frequent-sense classifier over all content words encountered in an arbitrarily chosen unrestricted text." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-89", "text": "This brings us to the question: how much data do we need to achieve wide coverage, high accuracy WSD? 90% 95% 99% noun 975 1776 2638 4510 verb 242 550 926 1806 adj 374 769 1286 2384 adv 36 76 128 269 sum 1627 3171 4978 8969 To shed light on this question, it is instructive to examine the distribution of words and their occurrence frequency in a large corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-90", "text": "Table 2 lists the number of polysemous words in each part of speech making up the top 80%, ..., top 99% of word occurrences in the Brown corpus, where the polysemous words are ordered in terms of their occurrence frequency from the most frequently occurring word to the least frequently occurring word." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-91", "text": "For example, Table 2 indicates that when the polysemous nouns are ordered from the most frequently occurring noun to the least frequently occurring noun, the top 975 polysemous nouns constitute 80% of all noun occurrences in the Brown corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-92", "text": "This 80% of all noun occurrences include all nouns in the Brown corpus that are monosemous (about 15.4%) and all rare nouns in the Brown corpus that do not appear in WORD-NP.T and hence have no valid sense definition (about 3.3%) (i.e., the remaining 20% noun occurrences are all polysemous)." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-93", "text": "Table 3 lists the analogous statistics for the Wall Street Journal corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-94", "text": "It is also the case that the last 5%-10% of polysemous words in a corpus have only a small number of distinct senses on average." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-95", "text": "Table 4 lists the average number of senses per polysemous word in the Brown corpus for the top 80%, ..., top 99%, and the bottom 20%, ..., bottom 1% of word occurrences, where the words are again ordered from the most frequently occurring word to the least frequently occurring word." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-96", "text": "For example, the average number of senses per polysemous noun is 5.14 for the nouns which account for the top 80% noun occurrences in 5 the Brown corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-97", "text": "Similarly, the average number of senses per polysemous noun is 2.86 for the polysemous nouns which account for the bottom 20% of noun occurrences in the Brown corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-98", "text": "Table 5 lists the analogous statistics for the Wall Street Journal corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-99", "text": "Table 2 and 3 indicate that a sense-tagged corpus collected for 3,200 words will cover at least 90% of all (content) word occurrences in the Brown corpus, and at least 95% of all (content) word occurrences in the Wall Street Journal corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-100", "text": "From Table 4 , the average number of senses per polysemous word in the Brown corpus for the remaining 10% word occurrences is only 3.15 or less." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-101", "text": "Similarly, from Table 5 , the average number of senses per polysemous word in the Wall Street Journal corpus for the remaining 5% word occurrences is only 3.10 or less." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-102", "text": "For these remaining polysemous words which account for the last 5%-10% word occurrences with an average of about 3 senses per word, we can always assign the most frequent sense as a first approximation in building our wide coverage WSD program." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-103", "text": "Based on these figures, I estimate that a sensetagged corpus of 3,200 words is sufficient to build a broad coverage, high accuracy WSD program capable of significantly outperforming the mostfrequent-sense classifier on average over all content words appearing in an arbitrary, unrestricted English text." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-104", "text": "Assuming an average of 1,000 sense-tagged occurrences per word, this will mean a corpus of 3.2 million sense-tagged word occurrences." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-105", "text": "Assuming human sense tagging throughput at 200 words, or 200,000 word occurrences, per man-year (which is the approximate human tagging throughput of my completed sense-tagging mini-project), such a corpus will require about 16 man-years to construct." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-106", "text": "Given the benefits of a wide coverage, high accuracy and domain-independent WSD program, I believe it is justifiable to spend the 16 man-years of human annotation effort needed to construct such a sense-tagged corpus." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-107", "text": "----------------------------------" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-108", "text": "**CAN WE DO BETTER?**" }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-109", "text": "My estimate of the amount of human annotation effort needed can be considered as an upper bound on the manual effort needed to construct the necessary sense-tagged corpus to achieve wide coverage WSD." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-110", "text": "It may turn out that we can achieve our goal with much less annotation effort." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-111", "text": "Recent work on intelligent example selection techniques suggest that the quality of the examples used for supervised learning can have a large impact on the classification accuracy of the induced classitier." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-112", "text": "For example, in (Engelson and Dagan, 1996) committee-based sample selection is applied to partof-speech tagging to select for annotation only those examples that are the most informative, and this avoids redundantly annotating examples." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-113", "text": "Similarly, in (Lewis and Catlett, 1994) , uncertainty sampling of training examples achieved better accuracy than random sampling of training examples for a text categorization application." }, { "sent_id": "eeada4aedbb43b575365a15d75f2ac-C001-114", "text": "Intelligent example selection for supervised learning is an important issue of machine learning in its own right. I believe it is of particular importance to investigate this issue in the context of word sense disambiguation, as the payoff is high, given that a large sense tagged corpus is currently not available and remains one of the most critical bottlenecks in achieving wide coverage, high accuracy WSD." } ], "y": { "@USE@": { "gold_contexts": [ [ "eeada4aedbb43b575365a15d75f2ac-C001-4" ], [ "eeada4aedbb43b575365a15d75f2ac-C001-19" ], [ "eeada4aedbb43b575365a15d75f2ac-C001-59" ], [ "eeada4aedbb43b575365a15d75f2ac-C001-62" ] ], "cite_sentences": [ "eeada4aedbb43b575365a15d75f2ac-C001-4", "eeada4aedbb43b575365a15d75f2ac-C001-19", "eeada4aedbb43b575365a15d75f2ac-C001-59", "eeada4aedbb43b575365a15d75f2ac-C001-62" ] }, "@BACK@": { "gold_contexts": [ [ "eeada4aedbb43b575365a15d75f2ac-C001-33" ], [ "eeada4aedbb43b575365a15d75f2ac-C001-38" ] ], "cite_sentences": [ "eeada4aedbb43b575365a15d75f2ac-C001-33", "eeada4aedbb43b575365a15d75f2ac-C001-38" ] }, "@MOT@": { "gold_contexts": [ [ "eeada4aedbb43b575365a15d75f2ac-C001-33", "eeada4aedbb43b575365a15d75f2ac-C001-34" ] ], "cite_sentences": [ "eeada4aedbb43b575365a15d75f2ac-C001-33" ] }, "@EXT@": { "gold_contexts": [ [ "eeada4aedbb43b575365a15d75f2ac-C001-38", "eeada4aedbb43b575365a15d75f2ac-C001-39" ] ], "cite_sentences": [ "eeada4aedbb43b575365a15d75f2ac-C001-38" ] }, "@DIF@": { "gold_contexts": [ [ "eeada4aedbb43b575365a15d75f2ac-C001-76" ] ], "cite_sentences": [ "eeada4aedbb43b575365a15d75f2ac-C001-76" ] } } }, "ABC_5cf0215cd20c86f329c8debc0daeb8_34": { "x": [ { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-2", "text": "Opinion role labeling (ORL) is an important task for fine-grained opinion mining, which identifies important opinion arguments such as holder and target for a given opinion trigger." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-3", "text": "The task is highly correlative with semantic role labeling (SRL), which identifies important semantic arguments such as agent and patient for a given predicate." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-4", "text": "As predicate agents and patients usually correspond to opinion holders and targets respectively, SRL could be valuable for ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-5", "text": "In this work, we propose a simple and novel method to enhance ORL by utilizing SRL, presenting semantic-aware word representations which are learned from SRL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-6", "text": "The representations are then fed into a baseline neural ORL model as basic inputs." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-7", "text": "We verify the proposed method on a benchmark MPQA corpus." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-8", "text": "Experimental results show that the proposed method is highly effective." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-9", "text": "In addition, we compare the method with two representative methods of SRL integration as well, finding that our method can outperform the two methods significantly, achieving 1.47% higher F-scores than the better one." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-10", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-12", "text": "Fine-grained opinion mining aims to detect structured user opinions in text, which has drawn much attention in the natural language processing (NLP) community (Kim and Hovy, 2006; Breck et al., 2007; Ruppenhofer et al., 2008; Wilson et al., 2009; Qiu et al., 2011; Cardie, 2013, 2014; Liu et al., 2015; Wiegand et al., 2016) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-13", "text": "A structured opinion includes the key arguments of one opinion, such as expressions, holders and targets (Breck et al., 2007; Cardie, 2012, 2013; Katiyar and Cardie, 2016 )." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-14", "text": "Here we focus on opinion role labeling (ORL) (Marasovi\u0107 and Frank, 2018) , which identifies opinion holders and * Corresponding author." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-15", "text": "We want to resolve all issues peacefully targets assuming that the opinion expressions are given." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-16", "text": "Figure 1 shows an example of the task." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-17", "text": "The focused task behaves very similar with semantic role labeling (SRL) which identifies the core semantic roles for given predicates." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-18", "text": "Earlier work attempts to exploit a well-trained SRL model to recognize possible semantic roles for a given opinion expression, and then map the semantic roles into opinion roles (Kim and Hovy, 2006; Ruppenhofer et al., 2008) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-19", "text": "The heuristic approach is unable to obtain high performance for ORL because there are large mismatches between SRL and ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-20", "text": "For example, opinion expressions are different from verb/noun predicates in SRL, and meanwhile, opinion holders and targets may not always correspond to semantic agents (ARG0) and patients (ARG1), respectively." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-21", "text": "We can exploit machine learning based method to solve the mismatching problem between ORL and SRL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-22", "text": "With a small number of annotated ORL corpus, we can feed the SRL outputs as inputs to build a statistical model for ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-23", "text": "By this way, the model can learn the consistencies and inconsistencies between SRL and ORL, arriving at a full exploration of SRL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-24", "text": "The method is essentially a feature-based method, treating SRL outputs as a source of features for ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-25", "text": "The main drawback of the method is that direct exploration of SRL outputs may lead to the error propagation problem." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-26", "text": "SRL errors can be further propagated into ORL outputs, resulting in degraded ORL performance." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-27", "text": "In this work, we propose a simple and novel method by using implicit semantic-aware word representations from SRL to enhance ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-28", "text": "The method is referred to as SRL-SAWR for brief." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-29", "text": "Thanks to the recent advances of encoder-decoder neural SRL models (Zhou and Xu, 2015; He et al., 2017) , we can extract implicit vectorized features from the intermediate encoder module instead, avoiding the direct exploration of the final onebest SRL outputs." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-30", "text": "The vectorized features from the encoder part are implicit semantic-aware representations for input sentences." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-31", "text": "By taking the semantic-aware representations from SRL as ORL inputs, we are able to make use of SRL information and meanwhile alleviate the error propagation problem." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-32", "text": "Here we exploit a neural conditional random field (CRF) model with deep bi-directional long short-term memory networks (Bi-LSTMs) as a baseline, most of which is borrowed from Katiyar and Cardie (2016) and Marasovi\u0107 and Frank (2018) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-33", "text": "Our preliminary experiments show that the model is able to achieve state-of-the-art performances for both ORL and SRL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-34", "text": "Based on this model, we study the proposed implicit semanticaware word representations for ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-35", "text": "In addition, we compare this method with two other representative methods of SRL integration as well: one uses discrete SRL outputs as features directly for ORL and the other one exploits a multi-tasklearning (MTL) framework to benefit ORL by SRL information." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-36", "text": "Experiments are conducted on the MPQA 2.0 dataset, which is a standard benchmark for opinion mining." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-37", "text": "Results show that SRL is highly effective for ORL, which is consistent with previous findings (Kim and Hovy, 2006; Ruppenhofer et al., 2008; Marasovi\u0107 and Frank, 2018) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-38", "text": "Meanwhile, our implicit SRL-SAWR method can achieve the best ORL performance, 2.23% higher F-scores than the second best method." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-39", "text": "All the codes and datasets are released publicly available for research purpose under Apache Licence 2.0 at https://github.com/zhangmeishan/SRL4ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-40", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-41", "text": "**METHOD**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-42", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-43", "text": "**BASELINE**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-44", "text": "ORL aims to identify important opinion arguments for a given opinion expression." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-45", "text": "The task can be modeled as a sequence labeling problem, similar to SRL (Zhou and Xu, 2015; He et al., 2017) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-46", "text": "We adopt the {BMESO} schema to con- vert opinion arguments into a sequence of boundary tags for each word, where B, M and E denote the beginning, middle and ending words of an argument, S denotes a single-word argument, and O denotes the other words." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-47", "text": "Formally, given a sentence w 1 \u00b7 \u00b7 \u00b7 w n and a span of opinion expression w s \u00b7 \u00b7 \u00b7 w e (1 \u2264 s \u2264 e \u2264 n), we aim to assign each word in the sentence by a tag, outputting t 1 \u00b7 \u00b7 \u00b7 t n ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-48", "text": "Inspired by Katiyar and Cardie (2016) and Marasovi\u0107 and Frank (2018) , we exploit a deep Bi-LSTM CRF model as the baseline." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-49", "text": "Figure 2 shows the overall architecture of the baseline model." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-50", "text": "This model can achieve state-of-the-art performances for both ORL and SRL, which facilitates our study." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-51", "text": "The key components of the baseline model include three parts: word representation, the deep Bi-LSTM encoder and the CRF decoder." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-52", "text": "The word representation takes sequential words and opinion expressions as input, mapping them into dense-valued feature vectors x 1 \u00b7 \u00b7 \u00b7 x n ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-53", "text": "Following we extract high-level neural features based on the vectors by deep Bi-LSTM, arriving at h 1 \u00b7 \u00b7 \u00b7 h n . And finally a CRF decoder is applied to output the ORL results t 1 \u00b7 \u00b7 \u00b7 t n ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-54", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-55", "text": "**SRL INTEGRATION**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-56", "text": "SRL aims to find the core semantic arguments for a given predicate, which is highly correlative with the ORL task." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-57", "text": "The semantic roles agent (ARG0) and patient (ARG1) are often corresponding to the opinion holder and target, respectively." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-58", "text": "Several works even directly transfer semantic roles into opinion roles for ORL (Kim and Hovy, 2006; Ruppenhofer et al., 2008) , treating opinion expressions as the major predicates." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-59", "text": "These systems can achieve good performances, indicating that SRL information can be greatly useful for ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-60", "text": "Here we propose a novel method to encode the SRL information implicitly, enhancing ORL model with semantic-aware word representations from a neural SRL model (SRL-SAWR)." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-61", "text": "Figure 3 shows the overall architectures of our SRL integration method." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-62", "text": "Instead of using the discrete outputs from the SRL model, the SRL-SAWR method exploits the intermediate encoder outputs as inputs for ORL, which can alleviate the problems in the above two methods." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-63", "text": "On the one hand, we do not rely on the discrete outputs of a well-trained SRL, reducing the error prorogation problem. And on the other hand, we handle ORL and SRL separately, avoiding the model structure dependencies between the two tasks." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-64", "text": "We assume that the external SRL system is a neural-based encoder-decoder model." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-65", "text": "For fair comparisons with FS-MTL, here we use the same deep Bi-LSTM CRF model for SRL as well." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-66", "text": "Thus the encoder outputs are the hidden vectors from deep Bi-LSTMs." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-113", "text": "The FS-MTL approach is exactly the proposed model by Marasovi\u0107 and Frank (2018) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-67", "text": "Assuming that the dumped hidden vector sequence from the SRL encoder is h SRL 1 \u00b7 \u00b7 \u00b7 h SRL n , we integrate it into the ORL model by the following equation:" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-68", "text": "where W SRL is a projection matrix which is a model parameter, x i is the baseline word representation of word w i , and x * i is the new word representation, which will be further fed into the deep Bi-LSTM layer of the ORL model." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-69", "text": "Noticeably, the model parameters of the SRL encoder are also fine tuned according to the ORL objective, as the preliminary results indicate that fine-tuning can bring better performance." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-70", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-71", "text": "**EXPERIMENTS**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-72", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-73", "text": "**ORL DATA**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-74", "text": "We exploit the MPQA version 2.0 corpus (Wiebe et al., 2005; Wilson, 2008) to evaluate our models, 1 which has been widely adopted as a benchmark dataset for opinion mining (Yang and Cardie, 2013; Katiyar and Cardie, 2016; Marasovi\u0107 and Frank, 2018) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-75", "text": "There are 482 documents in the dataset." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-76", "text": "Following these work, we set aside 132 documents as the development set, and the remaining 350 documents are used as the test set in our experiments." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-77", "text": "We conduct experiments using fivefold cross-validation (CV) on the test set at the document level." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-78", "text": "Following Marasovi\u0107 and Frank (2018) , we focus on opinion holders and targets only." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-79", "text": "The gold standard opinion expressions, holders and targets correspond to the direct subjective annotations, agent annotations and target annotations, respectively." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-80", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-81", "text": "**EVALUATION METRICS**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-82", "text": "We use recall (R), precision (P) and their F1-measure value to measure our proposed models." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-83", "text": "The average values of the five-fold CV results are reported in this work." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-84", "text": "We exploit exact matching as the major metric." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-85", "text": "Following Marasovi\u0107 and Frank (2018) , two kinds of soft evaluation methods are also adopted for evaluation, namely binary and proportional overlapping, Binary overlap treats an entity as correct if it contains an overlapped region with the gold-standard entity, and the proportional overlap assigns a partial score proportional to the ratio of the overlapped region." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-86", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-87", "text": "**SETTING**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-88", "text": "There are several hyper-parameters to define our neural network structures." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-89", "text": "We simply set their values according to previous work (He et al., 2017; Marasovi\u0107 and Frank, 2018) , without much tuning work." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-90", "text": "Concretely, we set the dimension size of all embeddings to 100, the output hidden size of LSTMs to 200 and the layer number of Bi-LSTM to 3." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-91", "text": "For external word embeddings, we use the pretrained 100-dimensional glove embeddings (Pennington et al., 2014) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-92", "text": "We exploit online training to learn model parameters, and train on the entire training instances for 40 epochs, choosing the best-epoch model according to the performance on the development corpus." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-93", "text": "We use Adam (Kingma and Ba, 2014) with a learning rate 10 \u22123 to update model parameters, and use gradient clipping by a max norm 1.0 and l 2 -regularization by a parameter 10 \u22128 ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-94", "text": "We apply dropout with a ratio of 0.2 over word represen-tations, output layers of Bi-LSTMs to avoid overfitting (Srivastava et al., 2014) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-95", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-96", "text": "**SRL**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-97", "text": "For SRL, we use the large-scale dataset of CoNLL-2012 shared task, which is extracted from OntoNotes v5.0 corpus." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-98", "text": "The description and separation of train, development and test data set can be found in Pradhan et al. (2013) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-99", "text": "The training corpus contains over 250K predicates, which is much larger than the number of opinion expressions in the ORL training corpus (averaged 3.6K)." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-100", "text": "We exploit the same neural network model as the ORL for SRL, in order to make fair comparisons between our proposed model with FS-MTL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-101", "text": "According to the preliminary experiments, the SRL model can reach an F-measure of 81.8%, which is comparable to the reported result (81.7%) in He et al. (2017) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-102", "text": "Table 1 shows the final results on the test dataset." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-103", "text": "We report the overall as well as the fine-grained performance in term of opinion arguments (i.e., holder and target)." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-104", "text": "Compared with the baseline system, our final SRL-SAWR model can bring significantly better results (p < 10 \u22125 under pairwise t-test)." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-105", "text": "For fine-grained evaluations, the final model outperforms the baseline model consistently on opinion holders and targets." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-106", "text": "The tendencies are similar by exploiting the binary and proportional matching methods." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-107", "text": "The results show that SRL information is very helpful for ORL, which is consistent with previous studies (Kim and Hovy, 2006; Ruppenhofer et al., 2008; Marasovi\u0107 and Frank, 2018) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-108", "text": "The implicit SRL-SAWR method is highly effective to integrate SRL information into the ORL model." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-109", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-110", "text": "**RESULTS**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-111", "text": "Further, we compare the SRL-SAWR method with two other methods as well, namely SRL-TE and FS-MTL, respectively." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-112", "text": "The SRL-TE approach simply exploits the output SRL tags as inputs for ORL, embedding them as an additional source of word representations." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-114", "text": "As shown in Table 1 , all three methods can bring improved performance by integrating SRL, further demonstrating that SRL is indeed valuable for ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-115", "text": "In addition, the SRL-SAWR method can achieve the best performance among the three methods, obtaining further significant improvements by at least 63.74 \u2212 61.51 = 2.23 points on overall F1-measure with exact matching (p < 10 \u22124 )." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-116", "text": "For fine-grained evaluations, the SRL-SAWR method can also give the best performance." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-117", "text": "The results demonstrate that SRL-SAWR is most effective to integrate the SRL information into a neural ORL model." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-118", "text": "The two methods, SRL-TE and FS-MTL, are comparable by evaluations based on the exact matching." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-119", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-120", "text": "**ANALYSIS**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-121", "text": "In this section, we conduct several experimental analysis on the test dataset to deeply understand the effectiveness of SRL information." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-122", "text": "First, we examine the relationship between SRL and ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-123", "text": "SRL identifies the semantic arguments for predicates, and ORL recognizes the opinion arguments for opinion expressions." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-124", "text": "Intuitively, in most cases, the opinion holders are corresponding to semantic agents/ARG0 of opinion triggers/expressions, and similarly, the opinion targets are usually corresponding to patients/ARG1." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-125", "text": "Figure 4 shows the percentages of opinion hold- ers/targets being corresponding to semantic roles, which are calculated according to the word-level mapping over the 1-best SRL outputs and the goldstandard ORL tags." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-126", "text": "We list only the five semantic roles with highest mapping percentages." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-127", "text": "As shown, the results are consistent with our intuition." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-128", "text": "Thus SRL and ORL are highly correlative." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-129", "text": "Considering the much larger scale of annotated SRL corpora, SRL can benefit ORL potentially." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-130", "text": "According to the above findings, we design a simple system by mapping SRL outputs into ORL directly (Kim and Hovy, 2006; Ruppenhofer et al., 2008) ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-131", "text": "We simply convert the semantic role ARG0 into holder, and ARG1 into target." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-132", "text": "Table 2 shows the performance." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-133", "text": "The results of the baseline system are shown for comparison." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-134", "text": "We can see that the simple mapping method is also one feasible alternative as a whole." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-135", "text": "Further, we compare the SRL utilization capabilities of our proposed method and the other SRLenhanced ORL systems, including the above SRL Mapping method." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-136", "text": "We categorize the opinion arguments by whether they can be directly mapped from the SRL outputs." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-137", "text": "The opinion arguments which can be directly mapped from SRL, referred to as consistent arguments, should be more easily identified by SRL enhanced models than the remaining inconsistent arguments." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-138", "text": "enhanced supervised models can achieve better performances for consistent arguments." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-139", "text": "For the inconsistent arguments, the tendency is similar, except the holder performance of SRL-TE." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-140", "text": "In addition, our method can gain much larger improvements, which indicates that our method can better handle the inconsistencies between SRL and ORL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-141", "text": "Finally, we show one case study to illustrate the advantage of our SRL-SAWR method." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-142", "text": "Figure 5 shows one example." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-143", "text": "As shown, the SRL argument ARG0, which is more probably mapped onto holder, is annotated by target in the example." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-144", "text": "The SRL argument ARG1 is labeled as opinion holder, which is also one inconsistent case." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-145", "text": "Compared with SRL-TE and FS-MTL, our model can better handle these inconsistent cases." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-146", "text": "The observation further confirms our results in Table 3 ." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-147", "text": "----------------------------------" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-148", "text": "**CONCLUSION**" }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-149", "text": "We proposed a simple and novel method (SRL-SAWR) to enhance ORL with SRL information by exploiting implicit semantic-aware word representations from SRL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-150", "text": "The main idea is to export intermediate SRL encoder outputs as inputs to better word representations of an ORL model." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-151", "text": "This method does not impose any extra requirement for ORL, and meanwhile avoids the error prorogation problem from discrete SRL outputs." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-152", "text": "We conducted experiments to verify our method on a benchmark MPQA dataset." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-153", "text": "The results showed that our method can exploit SRL information effectively." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-154", "text": "We compared the proposed method with SRL-TE and FS-MTL, which are two representative approaches to enhance ORL by SRL." }, { "sent_id": "5cf0215cd20c86f329c8debc0daeb8-C001-155", "text": "The results demonstrated our method can bring the best performance among the three approaches." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5cf0215cd20c86f329c8debc0daeb8-C001-12" ], [ "5cf0215cd20c86f329c8debc0daeb8-C001-18" ], [ "5cf0215cd20c86f329c8debc0daeb8-C001-58" ] ], "cite_sentences": [ "5cf0215cd20c86f329c8debc0daeb8-C001-12", "5cf0215cd20c86f329c8debc0daeb8-C001-18", "5cf0215cd20c86f329c8debc0daeb8-C001-58" ] }, "@MOT@": { "gold_contexts": [ [ "5cf0215cd20c86f329c8debc0daeb8-C001-18", "5cf0215cd20c86f329c8debc0daeb8-C001-19", "5cf0215cd20c86f329c8debc0daeb8-C001-21" ] ], "cite_sentences": [ "5cf0215cd20c86f329c8debc0daeb8-C001-18" ] }, "@SIM@": { "gold_contexts": [ [ "5cf0215cd20c86f329c8debc0daeb8-C001-37" ], [ "5cf0215cd20c86f329c8debc0daeb8-C001-107" ] ], "cite_sentences": [ "5cf0215cd20c86f329c8debc0daeb8-C001-37", "5cf0215cd20c86f329c8debc0daeb8-C001-107" ] }, "@USE@": { "gold_contexts": [ [ "5cf0215cd20c86f329c8debc0daeb8-C001-37" ], [ "5cf0215cd20c86f329c8debc0daeb8-C001-130" ] ], "cite_sentences": [ "5cf0215cd20c86f329c8debc0daeb8-C001-37", "5cf0215cd20c86f329c8debc0daeb8-C001-130" ] }, "@DIF@": { "gold_contexts": [ [ "5cf0215cd20c86f329c8debc0daeb8-C001-37", "5cf0215cd20c86f329c8debc0daeb8-C001-38" ] ], "cite_sentences": [ "5cf0215cd20c86f329c8debc0daeb8-C001-37" ] }, "@EXT@": { "gold_contexts": [ [ "5cf0215cd20c86f329c8debc0daeb8-C001-58", "5cf0215cd20c86f329c8debc0daeb8-C001-59", "5cf0215cd20c86f329c8debc0daeb8-C001-60" ] ], "cite_sentences": [ "5cf0215cd20c86f329c8debc0daeb8-C001-58" ] } } }, "ABC_8e59c2c48e27b2abd5f63d6b4ce23d_34": { "x": [ { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-2", "text": "Neural Machine Translation (NMT) models generally perform translation using a fixedsize lexical vocabulary, which is an important bottleneck on their generalization capability and overall translation quality." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-3", "text": "The standard approach to overcome this limitation is to segment words into subword units, typically using some external tools with arbitrary heuristics, resulting in vocabulary units not optimized for the translation task." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-4", "text": "Recent studies have shown that the same approach can be extended to perform NMT directly at the level of characters, which can deliver translation accuracy on-par with subword-based models, on the other hand, this requires relatively deeper networks." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-5", "text": "In this paper, we propose a more computationally-efficient solution for character-level NMT which implements a hierarchical decoding architecture where translations are subsequently generated at the level of words and characters." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-6", "text": "We evaluate different methods for open-vocabulary NMT in the machine translation task from English into five languages with distinct morphological typology, and show that the hierarchical decoding model can reach higher translation accuracy than the subword-level NMT model using significantly fewer parameters, while demonstrating better capacity in learning longer-distance contextual and grammatical dependencies than the standard character-level NMT model." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-7", "text": "----------------------------------" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-9", "text": "Neural Machine Translation (NMT) models are typically trained using a fixed-size lexical vocabulary." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-10", "text": "In addition to controlling the computational load, this limitation also serves to maintain better distributed representations for the most frequent set of words included in the vocabulary." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-11", "text": "On the other hand, rare words in the long tail of the lexical distribution are often discarded during translation since they are not found in the vocabulary." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-12", "text": "The prominent approach to overcome this limitation is to segment words into subword units (Sennrich et al., 2016) and perform translation based on a vocabulary composed of these units." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-13", "text": "However, subword segmentation methods generally rely on statistical heuristics that lack any linguistic notion." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-67", "text": "ppy|xq \" n \u017a j\"1 l \u017a k\"1 ppy j,k |y j,\u0103k , y\u0103j, x\u0103mq" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-14", "text": "Moreover, they are typically deployed as a pre-processing step before training the NMT model, hence, the predicted set of subword units are essentially not optimized for the translation task." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-15", "text": "Recently, (Cherry et al., 2018) extended the approach of NMT based on subword units to implement the translation model directly at the level of characters, which could reach comparable performance to the subword-based model, although this would require much larger networks which may be more difficult to train." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-16", "text": "The major reason to this requirement may lie behind the fact that treating the characters as individual tokens at the same level and processing the input sequences in linear time increases the difficulty of the learning task, where translation would then be modeled as a mapping between the characters in two languages." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-17", "text": "The increased sequence lengths due to processing sentences as sequences of characters also augments the computational cost, and a possible limitation, since sequence models typically have limited capacity in remembering longdistance context." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-18", "text": "In many languages, words are the core atomic units of semantic and syntactic structure, and their explicit modeling should be beneficial in learning distributed representations for translation." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-19", "text": "There have been early studies in NMT which proposed to perform translation at the level of characters while also regarding the word boundaries in the translation model through a hierarchical decoding procedure, although these approaches were generally deployed through hybrid systems, either as a back-off solution to translate unknown words (Luong and Manning, 2016) , or as pre-trained components (Ling et al., 2015) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-20", "text": "In this paper, we explore the benefit of achieving character-level NMT by processing sentences at multi-level dynamic time steps defined by the word boundaries, integrating a notion of explicit hierarchy into the decoder." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-21", "text": "In our model, all word representations are learned compositionally from character embeddings using bi-directional recurrent neural networks (bi-RNNs) (Schuster and Paliwal, 1997) , and decoding is performed by generating each word character by character based on the predicted word representation through a hierarchical beam search algorithm which takes advantage of the hierarchical architecture while generating translations." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-22", "text": "We present the results of an extensive evaluation comparing conventional approaches for open-vocabulary NMT in the machine translation task from English into five morphologically-rich languages, where each language belongs to a different language family and has a distinct morphological typology." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-23", "text": "Our findings show that using the hierarchical decoding approach, the NMT models are able to obtain higher translation accuracy than the subword-based NMT models in many languages while using significantly fewer parameters, where the character-based models implemented with the same computational complexity may still struggle to reach comparable performance." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-24", "text": "Our analysis also shows that explicit modeling of word boundaries in character-level NMT is advantageous for capturing longer-term contextual dependencies and generalizing to morphological variations in the target language." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-25", "text": "----------------------------------" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-26", "text": "**NEURAL MACHINE TRANSLATION**" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-27", "text": "In this paper, we use recurrent NMT architectures based on the model developed by ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-28", "text": "The model essentially estimates the conditional probability of translating a source sequence x \" px1, x2, . . . xmq into a target sequence y \" py1, y2, . . ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-29", "text": "ynq, using the decomposition ppy|xq \"" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-30", "text": "where y\u0103j is the target sentence history defined by the sequence ty1...yj\u00b41u." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-87", "text": "All models are implemented using gated recurrent units (GRU) with the same number of parameters." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-31", "text": "The inputs of the network are one-hot vectors representing the tokens in the source sentence, which are binary vectors with a single bit set to 1 to identify a specific token in the vocabulary." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-32", "text": "Each one-hot vector is then mapped to a dense continuous representation, i.e. an embedding, of the source tokens via a look-up table." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-33", "text": "The representation of the source sequence is computed using a multi-layer bi-RNN, also referred as the encoder, which maps x into m dense vectors corresponding to the hidden states of the last bi-RNN layer updated in response to the input token embeddings." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-34", "text": "The generation of the translation of the source sentence is called decoding, and it is conventionally implemented in an auto-regressive mode, where each token in the target sentence is generated based on an sequential classification procedure defined over the target token vocabulary." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-35", "text": "In this decoding architecture, a unidirectional recurrent neural network (RNN) predicts the most likely output token yi in the target sequence using an approximate search algorithm based on the previous target token yi\u00b41, represented with the embedding of the previous token in the target sequence, the previous decoder hidden state, representing the sequence history, and the current attention context in the source sequence, represented by the context vector ct." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-36", "text": "The latter is a linear combination of the encoder hidden states, whose weights are dynamically computed by a dot product based similarity metric called the attention model (Luong et al., 2015) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-37", "text": "The probability of generating each target word yi is estimated via a softmax function" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-38", "text": "where zj is the j th one-hot vector of the target vocabulary of size K, and oi is the decoder output vector for the i th target word yi." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-39", "text": "The model is trained by maximizing the loglikelihood of a parallel training set via stochastic gradientdescent (Bottou, 2010) , where the gradients are computed with the back propagation through time (Werbos, 1990) algorithm." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-40", "text": "Due to the softmax function in Equation 2, the size of the target vocabulary plays an important role in defining the computational complexity of the model." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-41", "text": "In the standard architecture, the embedding matrices account for the vast majority of the network parameters, thus, the amount of embeddings that could be learned and stored efficiently needs to be limited." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-42", "text": "Moreover, for many words corresponding to the long tail of the lexical distribution, the model fails in learning accurate embeddings, as they are rarely observed in varying context, leading the model vocabulary to typically include the most frequent set of words in the target language." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-43", "text": "This creates an important bottleneck over the vocabulary coverage of the model, which is especially crucial when translating into low-resource and morphologically-rich languages, which often have a high level of sparsity in the lexical distribution." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-44", "text": "The standard approach to overcome this limitation has now become applying a statistical segmentation algorithm on the training corpus which splits words into smaller and more frequent subword units, and building the model vocabulary composed of these units." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-45", "text": "The translation problem is then modeled as a mapping between sequences of subword units in the source and target languages (Sennrich et al., 2016; Wu et al., 2016; Ataman et al., 2017) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-46", "text": "The most popular statistical segmentation method is Byte-Pair Encoding (BPE) (Sennrich et al., 2016) , which finds the optimal description of a corpus vocabulary by iteratively merging the most frequent character sequences." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-47", "text": "One problem related to the subwordbased NMT approach is that segmentation methods are typically implemented as pre-processing steps to NMT, thus, they are not optimized simultaneously with the translation task in an end-to-end fashion." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-48", "text": "This can lead to morphological errors at different levels, and cause loss of semantic or syntactic information (Ataman et al., 2017) , due to the ambiguity in subword embeddings." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-49", "text": "In fact, recent studies have shown that the same approach can be extended to implement the NMT model directly at the level of characters, which could alleviate potential morphological errors due to subword segmentation." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-50", "text": "Although character-level NMT models have shown the potential to obtain comparable performance with subwordbased NMT models, this would require increasing the computational cost of the model, defined by the network parameters (Kreutzer and Sokolov, 2018; Cherry et al., 2018) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-51", "text": "As given in Figure 1a implementing the NMT decoder directly at the level of characters leads to repetitive passes over the attention mechanism and the RNNs modeling the target language for each character in the sentence." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-52", "text": "Since the distributed representations of characters are shared among different word and sentence-level context, the translation task requires a network with high capacity to learn this vastly dynamic context." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-53", "text": "----------------------------------" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-54", "text": "**HIERARCHICAL DECODING**" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-55", "text": "In this paper, we explore the benefit of integrating a notion of hierarchy into the decoding architecture which could increase the computational efficiency in character-level NMT, following the work of (Luong and Manning, 2016) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-56", "text": "In this architecture, the input embedding layer of the decoder is augmented with a character-level bi-RNN, which estimates a composition function over the embeddings of the characters in each word in order to compute the distributed representations of target words." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-57", "text": "Given a bi-RNN with a forward (f ) and backward (b) layer, the word representation w of a token of t characters (a) (b) Figure 1 : (a) Hierarchical NMT decoder: input words are encoded as character sequences and the translation is predicted at the level of words." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-58", "text": "The output words are generated as character sequences." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-59", "text": "(b) Character-level NMT decoder: the next token in the sentence is predicted by computing the attention weights and the target context repetitively for each character in the sentence." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-60", "text": "is computed from the hidden states h t f and h 0 b , i.e. the final outputs of the forward and backward RNNs, as follows:" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-61", "text": "where W f and W b are weight matrices associated to each RNN and b is a bias vector." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-62", "text": "The embeddings of characters and the parameters of the word composition layer are jointly learned while training the NMT model." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-63", "text": "Since all target word representations are computed compositionally, the hierarchical decoding approach eliminates the necessity of storing word embeddings, significantly reducing the number of parameters." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-64", "text": "Each word in the target sentence is predicted by an RNN operating at the level of words, using the compositional target word representations, target sentence history and the context vector computed by the attention mechanism only in the beginning of a new word generation." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-65", "text": "Instead of classifying the predicted target word in the vocabulary, its distributed representation is fed to a character-level RNN to generate the surface form of the word one character at a time by modeling the probability of observing the k th character of the j th word with length l, ppy j,k |y\u0103j, y j,\u0103k q, given the previous words in the sequence and the previous characters in the word." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-66", "text": "The translation probability is then decomposed as:" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-68", "text": "Similar to (Luong and Manning, 2016) , the information necessary to generate the surface form is encoded into the attentional vector\u0125t:" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-69", "text": "where ht is the hidden state of the word-level RNN representing the current target context." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-70", "text": "The attentional vector is used to initialize the character RNN, and after the generation of the first character in the word, character decoding continues in an auto-regressive mode, where the embedding of the each character is fed to the RNN to predict the next character in the word." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-71", "text": "The decoder consecutively iterates over the words and characters in the target sentence, where each RNN is updated at dynamic time steps based on the word boundaries." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-72", "text": "Table 1 : Results of the evaluation of models in translating languages with different morphological typology using the IWSLT data sets." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-73", "text": "The average number of parameters are calculated only for the decoders of the NMT models at a resolution of millions (M)." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-74", "text": "The best scores for each translation direction are in bold font." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-75", "text": "All improvements over the baselines are statistically significant (p-value \u0103 0.01)." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-76", "text": "NMT decoder, we implement a hierarchical beam search algorithm." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-77", "text": "Similar to the standard algorithm, the beam search starts by predicting the B most likely characters and storing them in a character beam along with their probabilities." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-78", "text": "The beams are reset each time the generation of a word is complete and the B most likely words are used to update the hidden states of the word-level RNN, which are fed to the character RNN to continue the beam search." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-79", "text": "When the beam search is complete, the most likely character sequence is generated as the best hypothesis." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-80", "text": "----------------------------------" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-81", "text": "**EXPERIMENTS**" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-82", "text": "We evaluate decoding architectures using different levels of granularity in the vocabulary units and the attention mechanism, including the standard decoding architecture implemented either with subword (Sennrich et al., 2016) or fully character-level (Cherry et al., 2018) units, which constitute the baseline approaches, and the hierarchical decoding architecture, by implementing all in Pytorch (Paszke et al., 2017) within the OpenNMT-py framework (Klein et al., 2017) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-83", "text": "In order to evaluate how each generative method performs in languages with different morphological typology, we model the machine translation task from English into five languages from different language families and exhibiting distinct morphological typology: Arabic (templatic), Czech (mostly fusional, partially agglutinative), German (fusional), Italian (fusional) and Turkish (agglutinative)." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-84", "text": "We use the TED Talks corpora (Cettolo, 2012) for training the NMT models, which range from 110K to 240K sentences, and the official development and test sets from IWSLT 1 (Cettolo et al., 2017) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-85", "text": "The low-resource settings for the training data allows us to examine the quality of the internal representations learned by each decoder under high data sparseness." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-86", "text": "In order to evaluate how the performance of each method scales with increasing data size, we evaluate the models also by training with a multi-domain training data using the public data sets from WMT 2 (Bojar et al., 2016) in the English-to-German direction, followed by an analysis on each model's capability in generalizing to morphological variations in the target language, using the Morpheval (Burlot et al., 2018) evaluation sets." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-88", "text": "The hierarchical decoding model implements a 3-layer GRU architecture, which is compared with a character-level decoder which also uses a 3-layer stacked GRU architecture." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-89", "text": "The subword-level decoder has a 2-layer stacked GRU architecture, to account also for the larger number of embedding parameters." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-90", "text": "The models using the standard architecture have the attention mechanism after the first GRU layer, and have residual connections after the second layer (Barone et al., 2017) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-91", "text": "The hierarchical decoder implements the attention mechanism after the second layer and has a residual connection between the first and second layers." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-92", "text": "The source sides of the data used for training characterlevel NMT models are segmented using BPE with 16,000 merge rules on the IWSLT data, and 32,000 on WMT." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-93", "text": "For subword-based models we learn shared merging rules for BPE for 16,000 (in IWSLT) and 32,000 (in WMT) units." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-94", "text": "The models use an embedding and hidden unit size of 512 under low-resource (IWSLT) and 1024 under high-resource (WMT) settings, and are trained using the Adam (Kinga and Ba, 2015) optimizer with a learning rate of 0.0003 and decay of 0.5, batch size of 100 and a dropout of 0.2." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-95", "text": "Decoding in all models is performed with a beam size of 5." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-96", "text": "The accuracy of each output is measured in terms of the BLEU metric (Papineni et al., 2002) and the significance of the improvements are measured using bootstrap hypothesis testing (Clark et al., 2011) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-97", "text": "----------------------------------" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-98", "text": "**RESULTS**" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-99", "text": "The results of the experiments given in Table 1 show that the hierarchical decoder can reach performance comparable to or better than the NMT model based on subword units in all languages while using almost three times less number of parameters." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-100", "text": "The improvements are especially evident in Arabic and Turkish, languages with the most complex morphology, where the accuracy with the hierarchical decoder is 1.28 and 1.22 BLEU points higher, respectively, and comparable in Czech, Italian and German, which represent the fusional languages." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-101", "text": "In Czech, the hierarchical model outperforms the subword-based model by 0.19 BLEU and in Italian by 0.41 BLEU points." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-102", "text": "The subword-based NMT model achieves the best performance in German, a language that is rich in compounding, where explicit subword segmentation might allow learning better representations for translation units." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-103", "text": "The fully character-level NMT model, on the other hand, obtains higher translation accuracy than the hierarchical model in Turkish, with an improvement of 0.91 BLEU, and in Czech with 0.15 BLEU points." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-104", "text": "As can be seen in the statistical characteristics of the training sets illustrated by plotting the token-to-type ratios in each language (Figure 2) , these two directions constitute the most sparse settings, where Turkish has the highest amount of sparsity in the benchmark, followed by Czech, and the improvements seem to be proportional to the amount of sparsity in the language." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-105", "text": "This suggests that in case of high lexical sparsity, learning to translate based on representations of characters might aid in reducing contextual sparsity, allowing to learn better distributed representations." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-106", "text": "As the training data size increases, one would expect the likelihood of observing rare words to decrease, especially in languages with low morphological complexity, along with the significance of representing rare and unseen words (Cherry et al., 2018) ." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-107", "text": "Our results support this hypothesis, where decreasing lexical sparsity, either in the form of the training data size, or the morphological complexity of the target language, eliminates the advantage of character-level translation." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-108", "text": "In Arabic and Italian, where the training data is almost twice as large as the other languages, using the hierarchical model provides improvements of 2.83 and 2.31 BLEU points over the character-level NMT model." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-109", "text": "In German, the fully characterlevel NMT model still achieves the lowest accuracy, with 2.06 BLEU points below the subword-based model." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-110", "text": "This might be due to the increased level of contextual ambiguity leading to difficulty in learning reliable character embeddings when the model is trained over larger corpora." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-111", "text": "Another factor which might affect the lower performance of character-level models is the average sentence lengths, which are much longer compared to the sentence lengths resulting from with subword segmentation (Figure 2 )." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-112", "text": "In the experiments conducted in the English-to-German translation direction, the results of which are given in Table 3 , accuracy obtained with the hierarchical and subword-based NMT decoders significantly increase with the extension of the training data, where the subword-based model obtains the best accuracy, followed by the hierarchical model, and the character-level NMT model obtains significantly lower accuracy compared to both approaches." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-113", "text": "Studies have shown that character-level NMT models could potentially reach the same performance with the subword-based NMT models (Cherry et al., 2018) , although this might require increasing the capacity of the network." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-114", "text": "On the other hand, the consistency in Since solely relying on BLEU scores may not be sufficient in understanding the generative properties of different NMT models, we perform an additional evaluation in order to assess the capacity of models in learning syntactic or morphological dependencies using the Morpheval test suites, which consist of sentence pairs that differ by one morphological contrast, and each output accuracy is measured in terms of the percentage of translations that could convey the morphological contrast in the target language." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-115", "text": "Table 2 lists the performance of different NMT models implementing decoding at the level of subwords, characters, or hierarchical wordcharacter units in capturing variances in each individual morphological paradigm and preserving the agreement between inflected words and their dependent lexical items." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-116", "text": "The results of our analysis support the benefit of using BPE in German as an open-vocabulary NMT solution, where the subwordbased model obtains the highest accuracy in most of the morphological paradigm generation tasks." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-117", "text": "The character-level model shows to be promising in capturing few morphological features better than the former, such as negation or comparative adjectives, and in capturing agreement features, the hierarchical decoding model generally performs better than the subword-based model." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-118", "text": "The dominance of subword-based models could be due to the high level of compounding in German where segmentation is possibly beneficial in splitting compound words and aiding better syntactic modeling in some cases." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-119", "text": "These results generally suggest the importance of processing the sentence context at the word level in order to induce a better notion of syntax during generation." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-120", "text": "In order to better illustrate the differences in the outputs of each NMT model, we also present some sample translations in Table 4 , obtained by translating English into Turkish using the NMT models trained on the TED Talks corpus." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-121", "text": "The input sentences are selected such that they are sufficiently long so that one can see the ability of each model in capturing long-distance dependencies in context." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-122", "text": "The input sentence is from a typical conversation, which requires remembering a long context with many references." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-123", "text": "We highlight the words in each output that is generated for the first time." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-124", "text": "Most of the models fail to generate a complete translation, starting to forget the sentence history after the generation of a few words, indicated by the start of generation of repetitions of the previously generated words." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-125", "text": "The character-level decoder seems to have the shortest memory span, followed by the subwordbased decoder, which completely omits the second half of the sentence." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-126", "text": "Despite omitting the translations of the last four words in the input and some lexical errors, the hierarchical decoder is the only model which can generate a meaningful and grammatically-correct sentence, suggesting that modeling translation based on a context defined at the lexical level might help to learn better grammatical and contextual dependencies, and remembering longer history." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-127", "text": "Although current methodology in NMT allows more efficient processing by implementing feed-forward architectures (Vaswani et al., 2017) , our approach can conceptually be applied within these frameworks." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-128", "text": "In this paper, we limit the evaluation to recurrent architectures for comparison to previous work, including (Luong and Manning, 2016) , (Sennrich et al., 2016) and (Cherry et al., 2018) , and leave implementation of hierarchical decoding with feed-forward architectures to future work." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-129", "text": "----------------------------------" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-131", "text": "In this paper, we explored the idea of performing the decoding procedure in NMT in a multi-dimensional search space defined by word and character level units via a hierarchical decoding structure and beam search algorithm." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-132", "text": "Our model obtained comparable to better performance than conventional open-vocabulary NMT solutions, including subword and character-level NMT methods, in many languages while using a significantly smaller number of parameters, showing promising application under high-resource settings." }, { "sent_id": "8e59c2c48e27b2abd5f63d6b4ce23d-C001-133", "text": "Our software is available for public usage 3 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-15" ], [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-50" ], [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-106" ], [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-113" ] ], "cite_sentences": [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-15", "8e59c2c48e27b2abd5f63d6b4ce23d-C001-50", "8e59c2c48e27b2abd5f63d6b4ce23d-C001-106", "8e59c2c48e27b2abd5f63d6b4ce23d-C001-113" ] }, "@USE@": { "gold_contexts": [ [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-82" ], [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-128" ] ], "cite_sentences": [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-82", "8e59c2c48e27b2abd5f63d6b4ce23d-C001-128" ] }, "@SIM@": { "gold_contexts": [ [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-106", "8e59c2c48e27b2abd5f63d6b4ce23d-C001-107" ] ], "cite_sentences": [ "8e59c2c48e27b2abd5f63d6b4ce23d-C001-106" ] } } }, "ABC_d1bff202991116a6a957aa61c05770_35": { "x": [ { "sent_id": "d1bff202991116a6a957aa61c05770-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-2", "text": "Natural language inference has been shown to be an effective supervised task for learning generic sentence embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-3", "text": "In order to better understand the components that lead to effective representations, we propose a lightweight version of InferSent (Conneau et al., 2017) , called InferLite, that does not use any recurrent layers and operates on a collection of pre-trained word embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-4", "text": "We show that a simple instance of our model that makes no use of context, word ordering or position can still obtain competitive performance on the majority of downstream prediction tasks, with most performance gaps being filled by adding local contextual information through temporal convolutions." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-5", "text": "Our models can be trained in under 1 hour on a single GPU and allows for fast inference of new representations." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-6", "text": "Finally we describe a semantic hashing layer that allows our model to learn generic binary codes for sentences." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-7", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-9", "text": "Distributed representations of words have become immensely successful as the building blocks for deep neural networks applied to a wide range of natural language processing tasks (Pennington et al., 2014) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-10", "text": "Learning representations of sentences, however, has largely been done in a taskdependent way." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-11", "text": "In recent years, a growing body of research has emerged for learning general purpose sentence embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-12", "text": "These methods aim to learn a universal encoding function that can map arbitrary sentences into vectors which can then be applied to downstream prediction tasks without finetuning." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-13", "text": "Much of the motivation behind this work is to mimic the successful use of feature transfer in computer vision." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-14", "text": "Recently, Conneau et al. (2017) showed that a bidirectional LSTM with max pooling trained to perform Natural Language Inference (NLI), called InferSent, outperforms several other encoding functions on a suite of downstream prediction tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-15", "text": "This method could match or outperform existing models that learns generic embeddings in an unsupervised setting, often requiring several days or weeks to train (Kiros et al., 2015) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-16", "text": "However, a better understanding of what properties induce a useful generic embedding remains illusive." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-17", "text": "In this work we propose a lightweight version of InferSent, called InferLite." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-18", "text": "InferLite deviates from InferSent in that it does not use any recurrent connections and can generalize to multiple pre-trained word embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-19", "text": "Our method uses a controller to dynamically weight embeddings for each word followed by max pooling over components to obtain the final sentence representation." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-20", "text": "Despite its simplicity, our method obtains performances on par with InferSent (Conneau et al., 2017) when using Glove representations (Pennington et al., 2014) as the source of pre-trained word vectors." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-21", "text": "To our surprise, the majority of evaluations can be done competitively without any notion of context, word ordering or position." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-22", "text": "For tasks where this is useful, much of the performance gap can be made up through a stack of convolutional layers to incorporate local context." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-23", "text": "Finally, we describe a semantic hashing layer that allows our model to be extended to learning generic binary vectors." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-24", "text": "The final result is a method that is both fast at training and inference and offers a strong baseline for future research on general purpose embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-25", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-26", "text": "**WHY LEARN LIGHTWEIGHT ENCODERS?**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-27", "text": "Our proposed model naturally raises a question: why consider lightweight sentence encoders? If a generic encoder only needs to be trained once, why would training times be relevant?" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-28", "text": "We argue our direction is important for two reasons." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-29", "text": "One is inference speed." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-30", "text": "With a lightweight encoder, we can encode millions of sentences efficiently without requiring extensive computational resources." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-31", "text": "The appendix includes inference speeds of our models." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-32", "text": "The second, perhaps more importantly, is to gain a better understanding of what prop-erties lead to high quality generic embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-33", "text": "When models take several days or weeks to train, an ablation analysis becomes prohibitively costly." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-34", "text": "Since our models can be trained quickly, it allows for a more extensive analysis of architectural and data necessities." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-35", "text": "Moreover, we include an ablation study in the appendix that shows even innocent or seemingly irrelevant model decisions can have a drastic effect on performance." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-36", "text": "Such observations could not be observed when models take orders of magnitude longer to train." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-37", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-38", "text": "**RELATED WORK**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-39", "text": "A large body of work on distributional semantics have considered encoding phrase and sentence meaning into vectors e.g. (Mitchell and Lapata, 2008; Grefenstette et al., 2013; Paperno et al., 2014) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-40", "text": "The first attempt at using neural networks for learning generic sentence embeddings was Kiros et al. (2015) , who proposed a sequenceto-sequence extension of the skip-gram model but applied at the sentence level." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-41", "text": "This method was taught to encode a sentence and predict its neighbours, harnessing a large collection of books for training (Zhu et al., 2015) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-42", "text": "A similar approach, FastSent, was proposed by (Hill et al., 2016) which replaced the RNN encoder of skip-thoughts with word embedding summation." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-43", "text": "Methods using RNN encoders tend to perform poorly on STS evaluations, as shown by Wieting et al. (2015) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-44", "text": "Arora et al. (2017) showed a simple weighted bag of words with the first principal component subtracted, can be competitive on many sentencing encoding tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-45", "text": "Attempts to learn generic encoders with discriminative objectives were considered by Nie et al. (2017) and Logeswaran and Lee (2018) who replaced the decoder of skip-thoughts with classification tasks based on discourse relations and prediction of target sentences from an encoded candidate." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-46", "text": "All of the above methods relied on a large corpus of unlabelled data." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-47", "text": "Conneau et al. (2017) showed that similar or improved performance can be obtained using NLI datasets as a source of supervisory information." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-48", "text": "The state of the art sentence encoders utilize multi-task learning (Subramanian et al., 2018) by training an encoder to simultaneously do well on a collection of tasks such as NLI, next sentence prediction and translation." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-49", "text": "The use of gating for selecting word representations has been considered in previous work." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-50", "text": "Yang et al. (2017) introduced a method for choosing between word and character embeddings while method for word embedding selection." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-51", "text": "Gating has also been widely applied to multimodal fusion (Arevalo et al., 2017; Wang et al., 2018b; Kiros et al., 2018) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-75", "text": "The controller first computes a shared layer G c 0 along with M heads G k 1 , . . . , G k M for k = 1, . . . , K embedding types." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-52", "text": "Our work is also related to recent methods that induce contextualized word representations (McCann et al., 2017; Peters et al., 2018) as well as pre-training language models for task-dependent fine-tuning (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-53", "text": "We differ from these approaches in that we aim to infer a transferable sentence vector without any additional fine-tuning." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-54", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-55", "text": "**METHOD**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-56", "text": "Our method operates on a collection of pre-trained word representations and is then trained on the concatenation of SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets as in Conneau et al. (2017) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-57", "text": "Table 1 summarizes the properties of the embeddings we consider." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-58", "text": "At a high level, our method takes as input a collection of embeddings for each word and learns a gated controller to decide how to weight each representation." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-59", "text": "After encoding each word in a sentence, the sentence embedding is obtained by max pooling the transformed word representations." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-60", "text": "Unlike Subramanian et al. (2018) , which learn a shared encoder in a multi-task setting, we instead fix the prediction task to NLI but use embeddings obtained from alternative tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-61", "text": "Figure 1 illustrates our model." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-62", "text": "We begin by defining notation." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-63", "text": "Suppose we are given a sentence of words S = w 1 , . . . , w T which we would like to encode into a vector." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-64", "text": "Let K be the number of embedding types (e.g. Glove, News, Query) and let E k denote the word embedding matrix for type k. Define E c = [E 1 ; . . . ; E K ] to be the concatenation of word embedding matrices of all K types." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-65", "text": "We break our model description into four modules: Encoder, Controller, Fusion and Reduction." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-66", "text": "In the appendix we include an ablation study that analyzes the effect of our design choices." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-67", "text": "Encoder." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-68", "text": "The encoder computes M + 1 layers The first layer is computed as:" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-69", "text": "where W k 0,g E k is a time distributed matrix multiply 1 and 0,h is the activation function." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-70", "text": "Each subsequent layer is given by:" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-71", "text": "where \u21e4 denotes the 1-D convolution operator that preserves dimensions." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-72", "text": "Note that if the convolutional filter length is 1, the model reduces to a bagof-words encoder." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-73", "text": "We use ReLU activation functions for i,h where i = 0, . . . , M 1 and a tanh activation for the last layer M,h ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-74", "text": "Controller." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-76", "text": "The first layer is computed as:" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-77", "text": "where W c 0,g E c is a time distributed matrix multiply and 0,g is the activation function." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-78", "text": "Define" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-79", "text": "Each subsequent layer is given by:" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-80", "text": "We use ReLU activation functions for i,g where i = 0, . . . , M 1 and a sigmoid activation for the last layer M,g ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-81", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-82", "text": "**FUSION.**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-83", "text": "The fusion layer combines the encoder and controller layers as follows:" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-84", "text": "where denotes a component-wise product, W f F 0 is a time distributed matrix multiply, f is a ReLU activation function and G c 0 is added as a skip connection." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-85", "text": "In the appendix we demonstrate that the added skip connection is crucial to the success of the model." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-86", "text": "Reduction." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-87", "text": "The final reduction operation simply applies max pooling across tokens:" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-88", "text": "resulting in a sentence vector s. This resulting vector corresponds to the embedding for which we evaluate all downstream tasks with." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-89", "text": "For training on NLI, we follow existing work and compute the concatenation of the embeddings of premise and hypothesis sentences along with their componentwise and absolute difference (Conneau et al., 2017) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-90", "text": "This joint vector is fed into a 2 hidden layer feedforward network with ReLU activations, followed by a softmax layer to predict whether the sentence pairs are neutral, entailed or contradictory." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-91", "text": "After training on NLI, the weights of the model are frozen and used for encoding new sentences." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-92", "text": "(2016) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-93", "text": "There are three main differences: 1) we generalize to multiple embedding types 2) we only apply gating at the end of the last layer as a way of weighting all embedding types (instead of each layer) and 3) we use a skip connection from the controller's transformed input to the fusion layer." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-94", "text": "We note that our encoder module can be reduced to the gated convolutional encoder in van den Oord et al. (2016) if we use one embedding type, remove the time distributed layers and only use a single convolutional layer." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-95", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-96", "text": "**RELATIONSHIP TO OTHER WORK**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-97", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-98", "text": "**SEMANTIC HASHING**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-99", "text": "We can augment a semantic hashing (Salakhutdinov and Hinton, 2009) layer to InferLite as a way of learning binary codes for sentences." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-100", "text": "Binary codes allow for efficient storage and retrieval over massive corpora." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-101", "text": "To do this, we append the following layer:" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-102", "text": "where LN is Layer Normalization (Ba et al., 2016) , is the sigmoid activation and \u2327 is a temperature hyperparameter." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-103", "text": "We initialize \u2327 = 1 at the beginning of training and exponentially decay \u2327 towards 0 over the course of training." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-104", "text": "At inference time, we threshold at 0.5 to obtain codes." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-105", "text": "We found Layer Normalization was important for obtaining good codes as otherwise many dead units would form." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-106", "text": "In the appendix we include downstream performance results for 256, 1024 and 4096-bit codes." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-107", "text": "The combination of fast inference and efficient storage allows InferLite to be an effective generic encoder for large-scale retrieval and similarity search." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-108", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-109", "text": "**EXPERIMENTS**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-110", "text": "We use the SentEval toolkit (Conneau and Kiela, 2018) for evaluating our sentence embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-111", "text": "All of our models are trained to optimize performance on the concatenation of SNLI and MultiNLI, using the concatenated development sets for early stopping." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-112", "text": "We use 4096-dimensional embeddings as in Conneau et al. (2017) ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-113", "text": "We consider encoders that use convolutional filters of length 1 (no context) or length 3 (local context), with a stack of M = 3 convolutional layers." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-114", "text": "All word embeddings are pre-trained, normalized to unit length and held fixed during training." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-115", "text": "Full hyperparameter details are included in the appendix, including an ablation study comparing the effect of the choice of M ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-116", "text": "We first analyze performance of our model on NLI prior to evaluating our models on downstream tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-117", "text": "Figure 2 shows development set accuracy on NLI for models with and without context, using various feature combinations." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-118", "text": "Here we observe that a) using local context improves NLI performance and b) adding additional embedding types leads to improved performance." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-119", "text": "Tables 2 and 3 show results on downstream evaluation tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-120", "text": "Here several observations can be made." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-121", "text": "First note the effectiveness of the basic (glove,1) model, which is essentially a deep bagof-unigram encoder." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-122", "text": "We also observe our models outperform all previous bag of words baselines." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-123", "text": "Next we observe that adding local context helps significantly on MR, CR, SST2 and TREC tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-124", "text": "Furthermore, fusing embeddings from query and news models matches or improves performance over a glove-only model on 12 out of 15 tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-125", "text": "Our (glove+news+query,3) model is best on 5 tasks and is a generally strong performer across all evaluations." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-126", "text": "Finally observe that our models significantly improves over previous work on STS tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-127", "text": "Next we compare training times of our models to previous work." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-128", "text": "All of our models can be trained in one GPU hour or less." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-129", "text": "QuickThoughts and InferSent can be trained on the order of a day while Multitask requires 1 week of training." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-130", "text": "This demonstrates the trade-off of these approaches." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-131", "text": "In the appendix we include results from several other experiments including COCO imagesentence retrieval, downstream performance of InferLite with semantic hashing and results on 10 probing tasks introduced in ." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-132", "text": "We also do an extensive ablation study of model components and illustrate gate activation values qualitatively for sentences from the (glove+news+query,3) model." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-133", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-134", "text": "**LIMITATIONS**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-135", "text": "We also experimented with additional embedding types, including Picturebook (Kiros et al., 2018) , knowledge graph and neural machine translation based embeddings." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-136", "text": "While adding these embeddings improved performance on NLI, they did not lead to any performance gains on downstream tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-137", "text": "This is in contrast to Subramanian et al. (2018) who showed adding additional tasks in a multi-task objective led to better downstream performance." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-138", "text": "This demonstrates the limitations of solely using NLI as an objective, even if we transfer embeddings from additional tasks." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-139", "text": "----------------------------------" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-140", "text": "**FUTURE WORK**" }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-141", "text": "In future work, we would like to explore using contextualized word embeddings, such as CoVe (McCann et al., 2017) and ELMo (Peters et al., 2018) , as input to our models as opposed to noncontextualized representations." }, { "sent_id": "d1bff202991116a6a957aa61c05770-C001-142", "text": "We also intend to evaluate on additional benchmark tasks such as GLUE (Wang et al., 2018a) , explore using the learned word representations as contextualized embeddings and perform downstream fine-tuning." } ], "y": { "@EXT@": { "gold_contexts": [ [ "d1bff202991116a6a957aa61c05770-C001-3" ] ], "cite_sentences": [ "d1bff202991116a6a957aa61c05770-C001-3" ] }, "@BACK@": { "gold_contexts": [ [ "d1bff202991116a6a957aa61c05770-C001-14" ], [ "d1bff202991116a6a957aa61c05770-C001-47" ] ], "cite_sentences": [ "d1bff202991116a6a957aa61c05770-C001-14", "d1bff202991116a6a957aa61c05770-C001-47" ] }, "@SIM@": { "gold_contexts": [ [ "d1bff202991116a6a957aa61c05770-C001-20" ] ], "cite_sentences": [ "d1bff202991116a6a957aa61c05770-C001-20" ] }, "@USE@": { "gold_contexts": [ [ "d1bff202991116a6a957aa61c05770-C001-56" ], [ "d1bff202991116a6a957aa61c05770-C001-89" ], [ "d1bff202991116a6a957aa61c05770-C001-112" ] ], "cite_sentences": [ "d1bff202991116a6a957aa61c05770-C001-56", "d1bff202991116a6a957aa61c05770-C001-89", "d1bff202991116a6a957aa61c05770-C001-112" ] } } }, "ABC_7b69c68a602e90c7ee7b9fa8a8facf_35": { "x": [ { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-2", "text": "We examine the problem of overcoming noisy word-level alignments when learning tree-to-string translation rules." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-3", "text": "Our approach introduces new rules, and reestimates rule probabilities using EM." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-4", "text": "The major obstacles to this approach are the very reasons that word-alignments are used for rule extraction: the huge space of possible rules, as well as controlling overfitting." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-5", "text": "By carefully controlling which portions of the original alignments are reanalyzed, and by using Bayesian inference during re-analysis, we show significant improvement over the baseline rules extracted from word-level alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-6", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-8", "text": "Non-parametric Bayesian methods have been successfully applied to directly learn phrase pairs from a bilingual corpus with little or no dependence on word alignments (Blunsom et al., 2008; DeNero et al., 2008) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-9", "text": "Because such approaches directly learn a generative model over phrase pairs, they are theoretically preferable to the standard heuristics for extracting the phrase pairs from the many-to-one word-level alignments produced by the IBM series models (Brown et al., 1993) or the Hidden Markov Model (HMM) (Vogel et al., 1996) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-10", "text": "We wish to apply this direct, Bayesian approach to learn better translation rules for syntaxbased statistical MT (SSMT), by which we specifically refer to MT systems using Tree-to-String (TTS) translation templates derived from syntax trees (Liu et al., 2006; Huang et al., 2006; Galley et al., 2006; May and Knight, 2007) , as opposed to formally syntactic systems such as Hiero (Chiang, 2007) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-11", "text": "The stumbling block preventing us from taking this approach is the extremely large space of possible TTS templates when no word alignments are given." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-12", "text": "Given a sentence pair and syntax tree over one side, there are an exponential number of potential TTS templates and a polynomial number of phrase pairs." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-13", "text": "In this paper, we explore methods for restricting the space of possible TTS templates under consideration, while still allowing good templates to emerge directly from the data as much as possible." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-14", "text": "We find an improvement in translation accuracy through, first, using constraints to limit the number of new templates, second, using Bayesian methods to limit which of these new templates are favored when re-analyzing the training data with EM, and, third, experimenting with different renormalization techniques for the EM re-analysis." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-15", "text": "We introduce two constraints to limit the number of TTS templates that we extract directly from tree/string pairs without using word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-16", "text": "The first constraint is to limit direct TTS template extraction to the part of the corpus where word alignment tools such as GIZA++ do poorly." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-17", "text": "There is no reason not to re-use the good alignments from GIZA++, which holds a very competitive baseline performance." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-18", "text": "As already mentioned, the noisy alignments from GIZA++ are likely to cross the boundaries of the tree constituents, which leads to comparatively big TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-19", "text": "We use this fact as a heuristic to roughly distinguish noisy from good word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-20", "text": "1 Here we define big templates as those with more than 8 symbols in their right hand sides (RHSs)." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-21", "text": "The word alignments in big templates are considered to be noisy and will be recomposed by extracting smaller TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-22", "text": "Another reason to do extraction on big templates is that the applicability of big templates to new sentences is very limited due to their size, and the portion of the training data from which they are extracted is effectively wasted." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-23", "text": "The second constraint, after choosing the extraction site, is to extract the TTS templates all the way down to the leaves of the hosting templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-24", "text": "This constraint limits the number of possible left hand sides (LHSs) to be equal to the number of tree nodes in the hosting templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-25", "text": "The entire extraction process can be summarized in 3 steps:" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-26", "text": "1. Compute word alignments using GIZA++, and generate the basic TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-27", "text": "2. Select big templates from the basic TTS templates in step 1, and extract smaller TTS templates all the way down to the bottom from big templates, without considering the precomputed word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-28", "text": "3." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-29", "text": "Combine TTS templates from step 1 and step 2 and estimate their probabilities using Variational Bayes with a Dirichlet Process prior." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-30", "text": "In step 2, since there are no constraints from the pre-computed word alignments, we have complete freedom in generating all possible TTS templates to overcome noisy word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-31", "text": "We use variational EM to approximate the inference of our Bayesian model and explore different normalization methods for the TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-32", "text": "A two-stage normalization is proposed by combining LHSbased normalization with normalization based on the root of the LHS, and is shown to be the best model when used with variational EM." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-33", "text": "Galley et al. (2006) recompose the TTS templates by inserting unaligned target words and combining small templates into bigger ones." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-34", "text": "The recomposed templates are then re-estimated using the EM algorithm described in Graehl and Knight (2004) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-35", "text": "This approach also generates TTS templates beyond the precomputed word alignments, but the freedom is only granted over unaligned target words, and most of the pre-computed word alignments remain unchanged." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-36", "text": "Other prior approaches towards improving TTS templates focus on improving the word alignment performance over the classic models such as IBM series models and Hidden Markov Model (HMM), which do not consider the syntactic structure of the aligning languages and produce syntax-violating alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-37", "text": "DeNero and Klein (2007) use a syntaxbased distance in an HMM word alignment model to favor syntax-friendly alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-38", "text": "Fossum et al. (2008) start from the GIZA++ alignment and incrementally delete bad links based on a discrim- inative model with syntactic features." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-39", "text": "This approach can only find a better subset of the GIZA++ alignment and requires a parallel corpus with goldstandard word alignment for training the discriminative model." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-40", "text": "May and Knight (2007) factorize the word alignment into a set of re-orderings represented by the TTS templates and build a hierarchical syntax-based word alignment model." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-41", "text": "The problem is that the TTS templates are generated by the word alignments from GIZA++, which limits the potential of the syntactic re-alignment." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-42", "text": "As shown by these prior approaches, directly improving the word alignment either falls into the framework of many-to-one alignment, or is substantially confined by the word alignment it builds upon." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-43", "text": "The remainder of the paper focuses on the Bayesian approach to learning TTS templates and is organized as follows: Section 2 describes the procedure for generating the candidate TTS templates; Section 3 describes the inference methods used to learn the TTS templates; Section 4 gives the empirical results, Section 5 discusses the characteristics of the learned TTS templates, and Section 6 presents the conclusion." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-44", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-45", "text": "**EXTRACTING PHRASAL TTS TEMPLATES**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-46", "text": "The Tree-to-String (TTS) template, the most important component of a SSMT system, usually contains three parts: a fragment of a syntax tree in its left hand side (LHS), a sequence of words and variables in its right hand side (RHS), and a probability indicating how likely the template is to be used in translation." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-47", "text": "The RHS of a TTS template shows one possible translation and reordering of its LHS." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-48", "text": "The variables in a TTS template are further transformed using other TTS templates, and the recursive process continues until there are no variables left." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-49", "text": "There are two ways that TTS templates are commonly used in machine translation." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-50", "text": "The first is synchronous parsing (Galley et al., 2006; May and Knight, 2007) , where TTS templates are used to construct synchronous parse trees for an input sentence, and the translations will be generated once the synchronous trees are built up." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-51", "text": "The other way is the TTS transducer (Liu et al., 2006; Huang et al., 2006) , where TTS templates are used just as their name indicates: to transform a source parse tree (or forest) into the proper target string." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-52", "text": "Since synchronous parsing considers all possible synchronous parse trees of the source sentence, it is less constrained than TTS transducers and hence requires more computational power." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-53", "text": "In this paper, we use a TTS transducer to test the performance of different TTS templates, but our techniques could also be applied to SSMT systems based on synchronous parsing." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-54", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-55", "text": "**BASELINE APPROACH: TTS TEMPLATES OBEYING WORD ALIGNMENT**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-56", "text": "TTS templates are commonly generated by decomposing a pair of aligned source syntax tree and target string into smaller pairs of tree fragments and target string (i.e., the TTS templates)." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-57", "text": "To keep the number of TTS templates to a manageable scale, only the non-decomposable TTS templates are generated." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-58", "text": "This algorithm is referred to as GHKM (Galley et al., 2004) and is widely used in SSMT systems (Galley et al., 2006; Liu et al., 2006; Huang et al., 2006) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-59", "text": "The word alignment used in GHKM is usually computed independent of the syntactic structure, and as DeNero and Klein (2007) and May and Knight (2007) have noted," }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-60", "text": "Ch-En En-Ch Union Heuristic 28.6% 33.0% 45.9% 20.1% Table 2 : In the selected big templates, the distribution of words in the templates of different sizes, which are measured based on the number of symbols in their RHSs is not the best for SSMT systems." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-61", "text": "In fact, noisy word alignments cause more damage to a SSMT system than to a phrase based SMT system, because the TTS templates can only be derived from tree constituents." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-62", "text": "If some noisy alignments happen to cross over the boundaries of two constituents, as shown in Figure 2 , a much bigger tree fragment will be extracted as a TTS template." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-63", "text": "Even though the big TTS templates still carry the original alignment information, they have much less chance of getting matched beyond the syntax tree where they were extracted, as we show in Section 4." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-64", "text": "In other words, a few cross-boundary noisy alignments could disable a big portion of a training syntax tree, while for a phrase-based SMT system, their effect is limited to the phrases they align." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-65", "text": "As a rough measure of how the training corpus is affected by the big templates, we calculated the distribution of target words in big and non-big TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-66", "text": "The word alignment is computed using GIZA++ 2 for the selected 73,597 sentence pairs in the FBIS corpus in both directions and then combined using union and heuristic diagonal growing (Koehn et al., 2003) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-67", "text": "Table 1 shows that big templates consume 20.1% to 45.9% of the training corpus depending on different types of word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-68", "text": "The statistics indicate that a significant portion of the training corpus is simply wasted, if the TTS templates are extracted based on word alignments from GIZA++." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-69", "text": "On the other hand, it shows the potential for improving an SSMT system if we can efficiently re-use the wasted training corpus." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-70", "text": "By further examining the selected big templates, we find that the most common form of big templates is a big skeleton template starting from the root of the source syntax tree, and having many terminals (words) misaligned in the bottom." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-71", "text": "Table 2 shows, in the selected big templates, the distribution of words in the templates of different sizes (measured based on the number of symbols in their RHS)." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-72", "text": "We can see that based on either type of word alignment, the most common big templates are the TTS templates with more than 20 symbols in their RHSs, which are generally the big skeleton templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-73", "text": "The advantage of such big skeleton templates is that they usually have good marginal accuracy 3 and allow accurate smaller TTS templates to emerge." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-74", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-75", "text": "**LIBERATING PHRASAL TTS TEMPLATES FROM NOISY WORD ALIGNMENTS**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-76", "text": "To generate better TTS templates, we use a more direct way than modifying the underlying word alignment: extract smaller phrasal TTS templates from the big templates without looking at their pre-computed word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-77", "text": "We define phrasal TTS templates as those with more than one symbol (word or non-terminal) in their LHS." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-78", "text": "The reason to consider only phrasal TTS templates is that they are more robust than the wordlevel TTS templates in addressing the complicated word alignments involved in big templates, which are usually not the simple type of one-to-many or many-to-one." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-79", "text": "Abandoning the pre-computed word alignments in big templates, an extracted smaller TTS template can have many possible RHSs, as long as the two sides have the same set of variables." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-80", "text": "Note that the freedom is only given to the alignments of the words; for the variables in the big templates, we respect the pre-computed word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-81", "text": "To keep the extracted smaller TTS templates to a manageable scale, the following two constraints are applied:" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-82", "text": "1. denotes the sub-tree in y which is rooted at x and goes all the way down to y's bottom." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-83", "text": "Figure 2 .1 shows valid and invalid TTS templates which can be extracted from an example hosting TTS template." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-84", "text": "Note that, in order to keep the example simple, the hosting TTS template only has 4 symbols in its RHS, which does not qualify as a big template according to our definition." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-85", "text": "Figure 2 .2 shows the complete set of valid TTS templates which can be extracted from the example TTS template." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-86", "text": "The subscripts of the non-terminals are used to differentiate identical non-terminals in different positions." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-87", "text": "The extraction process blindly releases smaller TTS templates from the big templates, among which only a small fraction are correct TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-88", "text": "Therefore, we need an inference method to raise the weight of the correct templates and decrease the weight of the noisy templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-89", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-90", "text": "**ESTIMATING TTS TEMPLATE PROBABILITY**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-91", "text": "The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) can be used to estimate the TTS templates' probabilities, given a generative model addressing how a pair of source syntax tree and target string is generated." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-92", "text": "There are two commonly used generative models for syntaxbased MT systems, each of which corresponds to a normalization method for the TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-93", "text": "The LHS-based normalization (LHSN) (Liu et al., 2006; Huang et al., 2006) , corresponds to the generative process where the source syntax subtree is first generated, and then the target string is generated given the source syntax subtree." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-94", "text": "The other one is normalization based on the root of the LHS (ROOTN) (Galley et al., 2006) , corresponding to the generative process where, given the root of the syntax subtree, the LHS syntax subtree and the RHS string are generated simultaneously." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-95", "text": "By omitting the decomposition probability in the LHS-based generative model, the two generative models share the same formula for computing the probability of a training instance:" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-96", "text": "where T and S denote the source syntax tree and target string respectively, R denotes the decomposition of (T, S), and t denotes the TTS template." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-97", "text": "The expected counts of the TTS templates can then be efficiently computed using an inside-outsidelike dynamic programming algorithm (May and Knight, 2007) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-98", "text": "LHSN, as shown by Galley et al. (2006) , cannot accurately restore the true conditional probabilities of the target sentences given the source sentences in the training corpus." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-99", "text": "This indicates that LHSN is not good at predicting unseen sentences or at translating new sentences." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-100", "text": "But this deficiency does not affect its ability to estimate the expected counts of the TTS templates, because the posteriors of the TTS templates only depend on the comparative probabilities of the different derivations of a training instance (a pair of tree and string)." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-101", "text": "In fact, as we show in Section 4, LHSN is better than ROOTN in liberating smaller TTS templates out of the big templates, since it is less biased to the big templates in the EM training." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-102", "text": "4 Because the two normalization methods have their 4 Based on LHSN, the difference between the probability of a big Template and the product of the probabilities of E-step: for all pair of syntax tree T and target string S do for all TTS Template t do EC(t)+ = (Rose et al., 1992 ) is is used in our system to speed up the training process, similar to Goldwater et al. (2006) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-103", "text": "We start from a high temperature and gradually decrease the temperature to 1; we find that the initial high temperature can also help small templates to survive the initial iterations." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-104", "text": "The complete EM framework is sketched in Figure 3 , where \u03b2 is the inverse of the specified temperature, and EC denotes the expected count." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-105", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-106", "text": "**BAYESIAN INFERENCE WITH THE DIRICHLET PROCESS PRIOR**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-107", "text": "Bayesian inference plus the Dirichlet Process (DP) have been shown to effectively prevent MT models from overfitting the training data (DeNero et al., 2008; Blunsom et al., 2008) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-108", "text": "A similar approach can be applied here for SSMT by considering each TTS template as a cluster, and using DP to adjust the number of TTS templates according to the training data." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-109", "text": "Note that even though there is a size limitation on the liberated phrasal TTS templates, standard EM will still tend to overfit the training data by pushing up the probabilities of the big templates from the noisy word alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-110", "text": "The complete generative process, integrating the DP prior and the generative models described in its decomposing TTS templates is much less than the one based on ROOTN, thus LHSN gives comparably more expected counts to the smaller TTS templates than ROOTN." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-111", "text": "for all TTS Template t do if it is the last iteration then P r(t) = exp(\u03c8(EC(t)+\u03b1G 0 (t))) exp(\u03c8(( P t :t .root=t.root EC(t ))+\u03b1))" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-112", "text": "; else P r(t) = exp(\u03c8(EC(t)+\u03b1G 0 (t))) exp(\u03c8(( P t :t .lhs=t.lhs EC(t ))+\u03b1))" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-113", "text": "; Figure 6 : M-step of the Variational EM Section 3.1, is given below:" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-114", "text": "where G 0 is a base distribution of the TTS templates, t denotes a TTS template, \u03b8 t.root denotes the multinomial distribution over TTS templates with the same root as t, SG denotes the generative model for a pair of tree and string in Section 3.1, and \u03b1 is a free parameter which adjusts the rate at which new TTS templates are generated." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-115", "text": "It is intractable to do exact inference under the Bayesian framework, even with a conjugate prior such as DP." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-116", "text": "Two methods are commonly used for approximate inference: Markov chain Monte Carlo (MCMC) (DeNero et al., 2008) , and Variational Bayesian (VB) inference (Blunsom et al., 2008) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-117", "text": "In this paper, the latter approach is used because it requires less running time." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-118", "text": "The E-step of VB is exactly the same as standard EM, and in the M-step the digamma function \u03c8 and the base distribution G 0 are used to increase the uncertainty of the model." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-119", "text": "Similar to standard EM, both LHS-and root-based normalizations are used in the M-step, as shown in Figure 3 .1." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-120", "text": "For the TTS templates, which are also pairs of subtrees and strings, a natural choice of G 0 is the generative models described in Section 3.1." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-121", "text": "Because G 0 estimates the probability of the new TTS templates, the root-based generative model is superior to the LHS-based generative model and used in our approach." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-122", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-123", "text": "**INITIALIZATION**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-124", "text": "Since the EM algorithm only converges to a local minimum, proper initializations are needed to achieve good performance for both standard EM and variational EM." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-125", "text": "For the baseline templates derived from word alignments, the initial counts are set to the raw counts in the training corpus." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-126", "text": "For the templates blindly extracted from big templates, the raw count of a LHS tree fragment is distributed among their RHSs based on the likelihood of the template, computed by combining for all big template t do for all template g extracted from t do g.count = g.lhs.count = 0; for all template g extracted from t do g.count += w in(g)\u00d7w out(g, t); g.lhs.count += w in(g)\u00d7w out(g, t); for all template g extracted from t do g.init += g.count g.lhs.count ; Figure 7 : Compute the initial counts of the liberated TTS templates the word-based inside/outside scores." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-127", "text": "The algorithm is sketched in Figure 3 .2, where the inside score w in(g) is the product of the IBM Model 1 scores in both directions, computed based on the words in g's LHS and RHS." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-128", "text": "The outside score w out(g, t) is computed similarly, except that the IBM Model 1 scores are computed based on the words in the hosting template t's LHS/RHS excluding the words in g's LHS/RHS." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-129", "text": "The initial probabilities of the TTS templates are then computed by normalizing their initial counts using LHSN or ROOTN." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-130", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-131", "text": "**EXPERIMENTS**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-132", "text": "We train an English-to-Chinese translation system using the FBIS corpus, where 73,597 sentence pairs are selected as the training data, and 500 sentence pairs with no more than 25 words on the Chinese side are selected for both the development and test data." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-133", "text": "5 Charniak (2000)'s parser, trained on the Penn Treebank, is used to generate the English syntax trees." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-134", "text": "Modified Kneser-Ney trigram models are trained using SRILM (Stolcke, 2002) upon the Chinese portion of the training data." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-135", "text": "The trigram language model, as well as the TTS templates generated based on different methods, are used in the TTS transducer." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-136", "text": "The model weights of the transducer are tuned based on the development set using a grid-based line search, and the translation results are evaluated based on a single Chinese reference 6 using BLEU-4 (Papineni et al., 2002) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-137", "text": "Huang et al. (2006) used character-based BLEU as a way of normalizing inconsistent Chinese word segmentation, but we avoid this problem as the training, development, and test data are from the same source." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-138", "text": "Table 4 : BLEU-4 scores (test set) of the union alignment, using TTS templates up to a certain size, in terms of the number of leaves in their LHSs" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-139", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-140", "text": "**BASELINE SYSTEMS**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-141", "text": "GHKM (Galley et al., 2004 ) is used to generate the baseline TTS templates based on the word alignments computed using GIZA++ and different combination methods, including union and the diagonal growing heuristic (Koehn et al., 2003) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-142", "text": "We also tried combining alignments from GIZA++ based on intersection, but it is worse than both single-direction alignments, due to its low coverage of training corpus and the incomplete translations it generates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-143", "text": "The baseline translation results based on ROOTN are shown in Table 4 .1." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-144", "text": "The first two columns in the table show the results of the two single direction alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-145", "text": "e2c and c2e denote the many English words to one Chinese word alignment and the many Chinese words to one English word alignment, respectively." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-146", "text": "The two rows show the results with and without the big templates, from which we can see that removing the big templates does not affect performance much; this verifies our postulate that the big templates have very little chance of being used in the translation." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-147", "text": "Table 4 .1, using the union alignments as the representative and measuring a template's size by the number of leaves in its LHS, also demonstrates that using big TTS templates brings very limited performance gain." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-148", "text": "The result that the union-based combination outperforms either single direction alignments and even the heuristic-based combination, combined with the statistics of the disabled corpus in Section 2.2, shows that more disabled training corpus actually leads to better performance." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-149", "text": "This can be explained by the fact that the union alignments have the largest number of noisy alignments gathered together in the big templates, and thus have the least amount of noisy alignments which lead to small and low-quality TTS templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-150", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-151", "text": "**LEARNING PHRASAL TTS TEMPLATES**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-152", "text": "To test our learning methods, we start with the TTS templates generated based on e2c, c2e, and union alignments using GHKM." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-153", "text": "This gives us 0.98M baseline templates." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-154", "text": "We use the big templates from the union alignments as the basis and extract 10.92M new phrasal TTS templates, which, for convenience, are denoted by NEW-PHR." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-155", "text": "Because based on Table 1 and Table 2 the union alignment has the greatest number of alignment links and therefore produces the largest rules, this gives us the greatest flexibility in realigning the input sentences." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-156", "text": "The baseline TTS templates as well as NEW-PHR are initialized using the method in Section 3.3 for both annealing EM and annealing VB." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-157", "text": "To simplify the experiments, the same Dirichlet Process prior is used for all multinomial distributions of the TTS templates with different roots." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-158", "text": "G 0 in the Dirichlet prior is computed based on the 1-level TTS templates selected from the baseline TTS templates, so that the big templates are efficiently penalized." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-159", "text": "The training algorithms follow the same annealing schedule, where the temperature parameter \u03b2 is initialized to 0.1, and gradually increased to 1." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-160", "text": "We experiment with the two training algorithms, annealing EM and annealing VB, with different normalization methods." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-161", "text": "The experimental results based on the development data are shown in Figure 4 .2, where the free parameter \u03b1 of annealing VB is set to 1, 100, and 100 respectively for ROOTN, LHSN, and MIXN." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-162", "text": "The results verify that LHSN is worse than ROOTN in predicting the translations, since MIXN outperforms LHSN with both annealing EM and VB." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-163", "text": "ROOTN is on par with MIXN and much better Figure 4 .2 shows the optimized results of the development set based on annealing VB with different \u03b1." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-164", "text": "The best performance is achieved as \u03b1 approaches 1, 100, and 100 for ROOTN, LHSN and MIXN respectively." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-165", "text": "The \u03b1 parameter can be viewed as a weight used to balance the expected counts and the probabilities from G 0 ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-166", "text": "Thus it is reasonable for LHSN and MIXN to have bigger optimal \u03b1 than ROOTN, since ROOTN gives lower expected counts to NEW-PHR than LHSN and MIXN do." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-167", "text": "To see the contribution of the phrasal template extraction in the performance gain, MT experiments are conducted by turning this component on and off." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-168", "text": "Results on the test set, obtained by using parameters optimized on the development set, are shown in Table 4 .2." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-169", "text": "The template counts used in the Max-Likelihood training are the same as the ones used in the initialization of annealing EM and VB." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-170", "text": "Results show that for annealing EM and VB, use of NEW-PHR greatly improves performance, while for the Max-Likelihood training, use of NEW-PHR hurts performance." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-171", "text": "This is not surprising, because Max-Likelihood training cannot efficiently filter out the noisy phrasal templates introduced in the initial NEW-PHR." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-172", "text": "Another observation is that annealing VB does not always outperform annealing EM." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-173", "text": "With NEW-PHR turned on, annealing VB shows consistent superiority over annealing EM; while without NEW-PHR, it only outperforms annealing EM based on LHSN and MIXN, and the improvement is not as big as when NEW-PHR is turned on." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-174", "text": "This indicates that without NEW-PHR, there is less need to use VB to shrink down the size of the template set." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-175", "text": "Table 4 .2 shows the statistics of the initial template set including NEW-PHR and the final TTS template set after annealing VB is conducted, where we can see annealing VB efficiently reduces NEW-PHR to a relatively small size and results in much more compact systems than the system based on the baseline templates from GIZA++ alignments." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-176", "text": "Comparing with the best GIZA++-based system union, our best system, utilizing NEW-PHR and the two-stage template normalization, demonstrates the strength of annealing VB by an absolute improvement of 2.29% in BLEU-4 score, from 14.55 to 16.84." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-177", "text": "This improvement is significant at p < 0.005 based on 2000 iterations of paired bootstrap re-sampling of the test set (Koehn, 2004) ." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-178", "text": "----------------------------------" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-179", "text": "**DISCUSSION**" }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-180", "text": "Our experimental results are obtained based on a relatively small training corpus, the improved performance may be questionable when a larger training corpus is used." }, { "sent_id": "7b69c68a602e90c7ee7b9fa8a8facf-C001-181", "text": "Someone may wonder if the performance gain primarily comes from the" } ], "y": { "@USE@": { "gold_contexts": [ [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-10" ], [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-102", "7b69c68a602e90c7ee7b9fa8a8facf-C001-98" ] ], "cite_sentences": [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-10", "7b69c68a602e90c7ee7b9fa8a8facf-C001-98" ] }, "@BACK@": { "gold_contexts": [ [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-50" ], [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-94" ], [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-98" ] ], "cite_sentences": [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-50", "7b69c68a602e90c7ee7b9fa8a8facf-C001-94", "7b69c68a602e90c7ee7b9fa8a8facf-C001-98" ] }, "@SIM@": { "gold_contexts": [ [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-50", "7b69c68a602e90c7ee7b9fa8a8facf-C001-53" ], [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-58" ] ], "cite_sentences": [ "7b69c68a602e90c7ee7b9fa8a8facf-C001-50", "7b69c68a602e90c7ee7b9fa8a8facf-C001-58" ] } } }, "ABC_cc3d38692097020ee7f4f17cf9247d_35": { "x": [ { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-2", "text": "Extracting semantic relations between entities from natural language text is an important step towards automatic knowledge extraction from large text collections and the Web." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-3", "text": "The state-of-the-art approach to relation extraction employs Support Vector Machines (SVM) and kernel methods for classification." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-4", "text": "Despite the diversity of kernels and the near exhaustive trial-and-error on kernel combination, there lacks a clear understanding of how these kernels relate to each other and why some are superior than others." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-5", "text": "In this paper, we provide an analysis of the relative strength and weakness of several kernels through systematic experimentation." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-6", "text": "We show that relation extraction can benefit from increasing the feature space through convolution kernel and introducing bias towards more syntactically meaningful feature space." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-7", "text": "Based on our analysis, we propose a new convolution dependency path kernel that combines the above two benefits." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-8", "text": "Our experimental results on the standard ACE 2003 datasets demonstrate that our new kernel gives consistent and significantly better performance than baseline methods, obtaining very competitive results to the state-ofthe-art performance." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-9", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-99", "text": "**SHORTEST PATH DEPENDENCY KERNEL**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-11", "text": "There exists a large body of knowledge embedded in unstructured natural language text on the Web." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-12", "text": "The sheer volume and heterogeneity of such knowledge renders traditional rule-based and manually-crafted knowledge extraction systems unsuitable." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-13", "text": "Thus it calls for methods that automatically extract knowledge from natural language text." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-14", "text": "An important step towards automatic knowledge discovery is to extract semantic relations between entities." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-15", "text": "Two types of collections are commonly studied for relation extraction." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-16", "text": "The first type is annotated newswire text made available by programs such as Message Understanding Conferences (MUC) and Automatic Content Extraction (ACE)." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-17", "text": "The types of entities that are of interest to these programs include person, organization, facilities, location and GPE (Geo-political entities) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-18", "text": "Given entities in a document, the relation extraction task is to identify explicit semantic relationship such as Located-In and Citizen-Of between pairs of entities." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-19", "text": "For example, in the sentence \"The funeral was scheduled for Thursday in Paris at the Saint-Germain-des-Pres Church\", the organization Saint-Germain-des-Pres Church is \"Located-In\" GPE Paris." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-20", "text": "The second type of collection that has been widely studied is biomedical literature (Bunescu and Mooney, 2005b; Giuliano et al., 2006; McDonald et al., 2005b) , promoted by evaluation programs such as BioCreAtIvE and JNLPBA 2004 ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-21", "text": "In this particular domain, studies often focus on specific entities such as genes and proteins." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-22", "text": "And the kinds of relations to extract are usually gene-toprotein interactions." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-23", "text": "The predominant approach to relation extraction treats the task as a multi-class classification problem, in which different relation types form different output classes." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-24", "text": "Early work employed a diverse range of features in a linear classifier (commonly referred to as \"feature-based\" approaches), including lexical features, syntactic parse features, dependency features and semantic features (Jiang and Zhai, 2007; Kambhatla, 2004; Zhou et al., 2005) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-25", "text": "These approaches were hindered by drawbacks such as limited feature space and excessive feature engineering." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-26", "text": "Kernel methods (Cortes and Vapnik, 1995; Cristianini and Shawe-Taylor, 2000) on the other hand can explore a much larger feature space very efficiently." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-27", "text": "Recent studies on relation extraction have shown that by combining kernels with Support-vector Machines (SVM), one can obtain results superior to feature-based methods (Bunescu and Mooney, 2005b; Bunescu and Mooney, 2005a; Culotta and Sorensen, 2004; Cumby and Roth, 2003; Zelenko et al., 2003; Zhang et al., 2006a; Zhang et al., 2006b; Zhao and Grishman, 2005) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-28", "text": "Despite the large number of recently proposed kernels and their reported success, there lacks a clear understanding of their relative strength and weakness." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-29", "text": "In this study, we provide a systematic comparison and analysis of three such kernels -subsequence kernel (Bunescu and Mooney, 2005b) , dependency tree kernel (Culotta and Sorensen, 2004) and dependency path kernel (Bunescu and Mooney, 2005a) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-30", "text": "We replicated these kernels and conducted experiments on the standard ACE 2003 newswire text evaluation set." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-31", "text": "We show that whereas some kernels are less effective than others, they exhibit properties that are complementary to each other." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-32", "text": "In particular, We found that relation extraction can benefit from increasing the feature space through convolution kernel and introducing bias towards more syntactically meaningful feature space." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-33", "text": "Drawn from our analysis, we further propose a new convolution dependency path kernel which combines the benefits of the subsequence kernel and shortest path dependency kernel." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-34", "text": "Comparing to the previous kernels, our new kernel gives consistent and significantly better performance than all three previous kernels that we look at." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-35", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-36", "text": "**RELATED WORK**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-37", "text": "Statistical methods for relation extraction can be roughly categorized into two categories: featurebased and kernel-based." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-38", "text": "Feature-based methods (Jiang and Zhai, 2007; Kambhatla, 2004; Zhou et al., 2005) use pre-defined feature sets to extract features to train classification models." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-39", "text": "Zhou et al. (2005) manually crafted a wide range of features drawn from sources such as lexical, syntactic and semantic analyses." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-40", "text": "Combined with SVM, they reported the best results at the time on ACE corpus." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-41", "text": "Kambhatla (2004) took a similar approach but used multivariate logistic regression (Kambhatla, 2004) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-42", "text": "Jiang & Zhai (2007) gave a systematic examination of the efficacy of unigram, bigram and trigram features drawn from different representations -surface text, constituency parse tree and dependency parse tree." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-43", "text": "One drawback of these feature-based methods is that the feature space that can be explored is often limited." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-44", "text": "On the other hand, kernel-based methods offer efficient solutions that allow us to explore a much larger (often exponential, or in some cases, infinite) feature space in polynomial time, without the need to explicitly represent the features." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-45", "text": "Lodhi et al. (2002) described a convolution string kernel, which measures the similarity between two strings by recursively computing matching of all possible subsequences of the two strings." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-46", "text": "Bunescu & Mooney (2005b) generalized the string kernel to work with vectors of objects occurred in relation extraction." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-47", "text": "In a later work also done by Bunescu & Mooney (2005a) , they proposed a kernel that computes similarities between nodes on the shortest dependency paths that connect the entities." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-48", "text": "Their kernel assigns no-match to paths that are of different length." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-49", "text": "And for paths that are of the same length, it simply computes the product of the similarity score of node pairs at each index." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-50", "text": "The dependency tree kernel proposed by Zelenko et al. (2003) was also inspired by the string kernel of Lodhi et al. (2002) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-51", "text": "Their kernel walks down the parse trees from the root and computes a similarity score for children nodes at each depth level using the same subsequence algorithm as the string kernel." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-52", "text": "Culotta & Sorensen (2004) worked on the same idea but applied it to dependency parse trees." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-53", "text": "Prior to these two tree kernels, Collins & Duffy (2001) proposed a convolution tree kernel for natural language tasks." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-54", "text": "Their kernel has since been applied to relation extraction by Zhang et al. (2006a) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-55", "text": "The tree kernel considers matching of all subtrees that share the same production rule at the root of the subtree." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-56", "text": "Zhang et al. (2006a) showed results that are significantly better than the previous two dependency tree kernels." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-57", "text": "They obtained further improvements in their later paper (2006b) by composing the tree kernel with a simple entity kernel and raising the composite kernel to polynomial degree 2." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-58", "text": "Another study on kernel composition is the work by Zhao & Grishman (2005) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-59", "text": "It is worth noting that although there exist standard evaluation datasets such as ACE 2003 and 2004, many of the aforementioned work report results on non-standard datasets or splits, making it difficult to directly compare the performance." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-60", "text": "We feel that there is a sense of increasing confusion down this line of research." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-61", "text": "Although partly due to the lack of compatibility in evaluation results, we believe it is more due to the lack of understanding in the relative strength and weakness of these kernels." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-62", "text": "Therefore we focus on analyzing and understanding the pros and cons of different kernels, through systematic comparison and experimentation." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-63", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-64", "text": "**KERNEL METHODS FOR RELATION EXTRACTION**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-65", "text": "In this Section we first give a very brief introduction to kernel methods." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-66", "text": "We then present the algorithms behind three kernels that we are particularly interested in: subsequence kernel (Bunescu and Mooney, 2005b) , dependency tree kernel (Culotta and Sorensen, 2004) and shortest path dependency kernel (Bunescu and Mooney, 2005a) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-67", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-68", "text": "**SVM AND KERNELS**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-69", "text": "Support-Vector Machines (Cortes and Vapnik, 1995; Cristianini and Shawe-Taylor, 2000) learn to find hyperplanes that separate the positive and negative data points so that the margin between the supportvector points and the hyperplane is maximized." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-70", "text": "The dual formulation of the optimization problem involves only computing the dot product of feature vectors." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-71", "text": "This is equivalent to mapping the data points into a high dimensional space. And the separating plane learnt in the high dimensional space can give non-linear decision boundaries." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-72", "text": "The dot product of data points can be computed using a kernel function K(X, Y ) = \u03c6(X), \u03c6(Y ) for any mapping function." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-73", "text": "A valid kernel function satisfies certain properties: it is symmetric and the Gram matrix G formed by K(X, Y ) is positive semi-definite." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-74", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-75", "text": "**SUBSEQUENCE KERNEL**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-76", "text": "The subsequence kernel introduced in (Bunescu and Mooney, 2005b ) is a generalization of the string kernel first introduced by Lodhi et al. (2002) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-77", "text": "The feature space of the original string kernel \u03a3 string k ernel is defined as \u03a3 string k ernel = \u03a3 char , where \u03a3 char is simply a set of characters." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-78", "text": "Bunescu & Mooney (2005a) re-defined the feature space to be \u03a3 x = \u03a3 1 \u00d7 \u03a3 2 \u00d7 \u00b7 \u00b7 \u00b7 \u00d7 \u03a3 k , where \u03a3 1 , \u03a3 2 , \u00b7 \u00b7 \u00b7 , \u03a3 k can be some arbitray disjoint feature spaces, such as the set of words, part-of-speech (POS) tags, etc." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-79", "text": "We can measure the number of common features shared by two feature vectors x, y \u2208 \u03a3 x using function c(x, y)." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-80", "text": "Let s, t be two sequences over the feature set \u03a3 x , we use |s| to denote the length of s. Thus s can be written out as s 1 \u00b7 \u00b7 \u00b7 s |s| ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-81", "text": "We use s[i : j] to denote a continuous subsequence s i \u00b7 \u00b7 \u00b7 s j of s. Let i = (i 1 , \u00b7 \u00b7 \u00b7 , i |i| ) be a sequence of |i| indices in s, we define the length of the index sequence i to be l(i) = i |i| \u2212 i 1 + 1." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-82", "text": "Similarly we have index sequence j in t of length l(j)." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-83", "text": "Let \u03a3 \u222a = \u03a3 1 \u222a \u03a3 2 \u222a \u00b7 \u00b7 \u00b7 \u222a \u03a3 k be the set of all possible features." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-84", "text": "A sequence u \u2208 \u03a3 * \u222a is a subsequence of feature vector sequence s if there exists a sequence of |u| indices i, such that u k \u2208 s i k , \u2200k \u2208 {1, \u00b7 \u00b7 \u00b7 , |u|}. Follow the notions in (Bunescu and Mooney, 2005b; Cumby and Roth, 2003) , we use u \u227a s[i] as a shorthand for the above componentwise '\u2208' relationship." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-85", "text": "Now we can define the kernel function K n (s, t) to be the total number of weighted common subsequence of length n between the two sequeneces s and t." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-86", "text": "where \u03bb is a decaying factor \u2264 1, penalizing long, sparse subsequence." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-87", "text": "We can re-write this kernel function as (Bunescu and Mooney, 2005b) showed that using the recursive dynamic programming algorithm from (Cumby and Roth, 2003) , the kernel K n (s, t) can be computed in O (kn|s||t|) time." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-88", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-89", "text": "**FROM SUBSEQUENCE TO TREE KERNELS**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-90", "text": "We will use an example to illustrate the relation between the dependency tree kernels proposed by (Culotta and Sorensen, 2004; Zelenko et al., 2003) and the subsequence kernel we introduced above." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-91", "text": "Consider two instances of the \"Located-In\" relations \"his actions in Brcko\" and \"his recent arrival in Beijing\"." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-92", "text": "The dependency parse trees of these two sentences are shown below." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-93", "text": "The entities in these two relations are the pronoun mentions of \"his\", and two locations \"Brcko\" and \"Beijing\", all shown in italic." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-94", "text": "The dependency tree kernel visits nodes in the two trees starting from the root. And at each depth level, it takes nodes that are at that level and form two sequences of nodes." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-95", "text": "For example, in the example instances, nodes at one level below the root forms vectors s= {his, PRP, PERSON},{in, IN} and t= {his,PRP,PERSON},{recent, ADJ},{in, IN} ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-96", "text": "It then makes use of the subsequence kernel in the previous section to compute the total number of weighted subsequences between these two vectors." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-97", "text": "The kernel returns the sum of subsequence matching scores at each depth level as the final score." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-98", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-100", "text": "The shortest path dependency kernel proposed by Bunescu & Mooney (2005a) also works with dependency parse trees." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-101", "text": "Reuse our example in the previous section, the shortest dependency path between entity his and Brcko in the first sentence is s= {his, PRP, PERSON}, {actions, NNS, NOUN}, {in, IN}, {Brcko, NNP, NOUN, LOCATION} ; and the path between his and Beijing in the second sentence is t= {his, PRP, PERSON}, {arrival, NN, NOUN}, {in, IN}, {Beijing, NNP, NOUN, LOCATION} ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-102", "text": "Since most dependency parser output connected trees, finding the shortest path between two nodes is trivial." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-103", "text": "Once the two paths are found, the kernel simply computes the product of the number of common features between a pair of nodes at each index along the path." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-104", "text": "If the two paths have different number of nodes, the kernel assigns 0 (no-match) to the pair." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-105", "text": "Formally, the kernel is defined as:" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-106", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-107", "text": "**EXPERIMENTS AND ANALYSIS**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-108", "text": "We implemented the above three kernels and conducted a set of experiments to compare these kernels." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-109", "text": "By minimizing divergence in our experiment setup and implementation for these kernels, we hope to reveal intrinsic properties of different kernels." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-110", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-111", "text": "**EXPERIMENT SETUP**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-112", "text": "We conducted experiments using the ACE 2003 standard evaluation set." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-113", "text": "Training set of this collection contains 674 doc and 9683 relations." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-114", "text": "The test set contains 97 doc and 1386 relations." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-115", "text": "5 entity types (Person, Organization, Location, Facilities and Geopolitical Entities) and 5 top-level relation types (At, Near, Part-of, Role and Social) are manually annotated in this collection." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-116", "text": "Since no development set is given, we report results in this section only on the training set, using 5-fold cross-validation, and defer the comparison of results on the test set till Section 6." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-117", "text": "Corpus preprocessing is done as the following: sentence segmentation was performed using the tool from CCG group at UIUC 1 ; words are then tokenized and tagged with part-of-speech using MX-POST (Ratnaparkhi, 1996) and dependency parsing is performed using MSTParser (McDonald et al., 2005a) ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-118", "text": "We used the SVM-light (Joachims, 2002) toolkit and augmented it with our custom kernels." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-119", "text": "SVM parameters are chosen using cross-validation (C=2.4), and the decaying factor in all kernels are uniformally set to be 0.75." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-120", "text": "We report precision (P), recall (R) and F-measure (F) on the training (5-fold cross-validation) and test set." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-121", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-122", "text": "**COMPARISON OF KERNELS**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-123", "text": "In table 1 we listed results of the above three kernels on the training set using 5-fold cross-validation." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-124", "text": "A first glimpse of the results tells us that the shortest path kernel performs the best in terms of F-measure, while the dependency tree kernel did the worst." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-125", "text": "The performance of subsequence kernel is not as good as the dependency path kernel, but the difference is small." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-126", "text": "In particular, the subsequence kernel gave the best recall, whereas the dependency path kernel gave the highest precision." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-127", "text": "To understand why shortest path kernel performs better than the subsequence kernel, let us review the definition of these two kernels." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-128", "text": "The subsequence kernel considers all subsequences of feature vector sequences that are formed by all words occurred inbetween two entities in a sentence; while the shortest path kernel only considers feature vector sequences formed by words that are connected through a dependency path." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-129", "text": "In general, the sequences considered in the dependency path kernel are more compact than the sequences used in the subsequence kernel." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-130", "text": "Actually, in most cases the dependency path sequence is indeed one particular subsequence of the entire subsequence used in subsequence kernel." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-131", "text": "Arguably, this particular subsequence is the one that captures the most important syntactic information." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-132", "text": "Although the feature spaces of the dependency path kernels are not subsets of the subsequence kernel, we can clearly see that we get higher precisions by introducing bias towards the syntactically more meaningful feature space." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-133", "text": "However, the dependency path kernel is fairly rigid and imposes many hard constraints such as requiring the two paths to have exactly the same number of nodes." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-134", "text": "This restriction is counter-intuitive." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-135", "text": "To illustrate this, let us reconsider the example given in Section 3." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-136", "text": "In that example, it is obviously the case that the two instances of relations have very similar dependency path connecting the entities." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-137", "text": "However, the second path is one node longer than the first path, and therefore the dependency path kernel will declare no match for them." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-138", "text": "The subsequence kernel, on the other hand, considers subsequence matching and therefore inherently incorporates a notion of fuzzy matching." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-139", "text": "Furthermore, we have observed from the training data that many short word sequences carry strong relational information; hence only part of the entire dependency path is truly meaningful in most cases." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-140", "text": "It also helps to understand why subsequence kernel has better recall than dependency path kernel." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-141", "text": "(Zhang et al., 2006b) 0.773 0.656 0.709 (Zhang et al., 2006b) The disappointing performance of the dependency tree kernel can also be explained by our analysis." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-142", "text": "Although the dependency tree kernel performs subsequence matching for nodes at each depth level, it is unclear what the relative syntactic or semantic relation is among sibling nodes in the dependency tree." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-143", "text": "The sequence formed by sibling nodes is far less intuitive from a linguistic point of view than the sequence formed by nodes on a dependency path." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-144", "text": "To summarize the above results, we found that dependency path kernel benefits from a reduction in feature space by using syntactic dependency information." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-145", "text": "But the subsequence kernel has an edge in recall by allowing fuzzy matching and expanding the feature space into convolution space." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-146", "text": "We will show in the following section that these two benefits are complementary and can be combined to give better performance." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-147", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-148", "text": "**COMBINING THE BENEFITS -A NEW KERNEL**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-149", "text": "It is a natural extension to combine the two benefits that we have identified in the previous section." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-150", "text": "The idea is simple: we want to allow subsequence matching in order to gain more flexibility and therefore higher recall, but constrain the sequence from which to deduce subsequences to be the dependency path sequence." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-151", "text": "We call the combined kernel a \"convolution dependency path kernel\"." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-152", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-153", "text": "**FINAL TEST RESULTS**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-154", "text": "We obtained the final results on the test set of the ACE 2003 collection, using the same experimental setting as above." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-155", "text": "The results are listed in Table 2 ." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-156", "text": "From the table we can see that the performances of the previous three kernels hold up qualitatively on the test set as cross-validation on training set." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-157", "text": "There is one exception that the shortest path kernel's Fmeasure score is no longer better than the subsequence kernel on the test set, but the difference is small. And our new convolution dependency path kernel beats all above three kernels in precision, recall and F-measure, suggesting that our analysis is accurate and the benefits we outlined are truly complementary." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-158", "text": "Comparing to the best reported results on the same test set from (Zhang et al., 2006b) , our scores are not as high, but the results are quite competitive, given our minimum efforts on tuning kernel parameters and trying out kernel combinations." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-159", "text": "----------------------------------" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-160", "text": "**CONCLUSION**" }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-161", "text": "We re-examined three existing kernel methods for relation extraction." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-162", "text": "We conducted experiments on the standard ACE 2003 evaluation set and showed that whereas some kernels are less effective than others, they exhibit properties that are complementary to each other." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-163", "text": "In particular, we found that relation extraction can benefit from increasing the feature space through convolution kernel and introducing bias towards more syntactically meaningful feature space." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-164", "text": "Drawn from our analysis, we proposed a new convolution dependency path kernel which combines the benefits of the subsequence kernel and shortest path dependency kernel." }, { "sent_id": "cc3d38692097020ee7f4f17cf9247d-C001-165", "text": "Comparing with previous kernels, our new kernel consistently and significantly outperforms all three previous kernels, suggesting that our analyses of the previously proposed kernels are correct." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cc3d38692097020ee7f4f17cf9247d-C001-27" ], [ "cc3d38692097020ee7f4f17cf9247d-C001-47" ], [ "cc3d38692097020ee7f4f17cf9247d-C001-100" ] ], "cite_sentences": [ "cc3d38692097020ee7f4f17cf9247d-C001-27", "cc3d38692097020ee7f4f17cf9247d-C001-47", "cc3d38692097020ee7f4f17cf9247d-C001-100" ] }, "@MOT@": { "gold_contexts": [ [ "cc3d38692097020ee7f4f17cf9247d-C001-27", "cc3d38692097020ee7f4f17cf9247d-C001-28" ] ], "cite_sentences": [ "cc3d38692097020ee7f4f17cf9247d-C001-27" ] }, "@USE@": { "gold_contexts": [ [ "cc3d38692097020ee7f4f17cf9247d-C001-29" ], [ "cc3d38692097020ee7f4f17cf9247d-C001-66" ] ], "cite_sentences": [ "cc3d38692097020ee7f4f17cf9247d-C001-29", "cc3d38692097020ee7f4f17cf9247d-C001-66" ] } } }, "ABC_a789aea59eebfefb990dfc6367d323_35": { "x": [ { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-2", "text": "We present a system for extracting the dates of illness events (year and month of the event occurrence) from posting histories in the context of an online medical support community." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-3", "text": "A temporal tagger retrieves and normalizes dates mentioned informally in social media to actual month and year referents." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-4", "text": "Building on this, an event date extraction system learns to integrate the likelihood of candidate dates extracted from time-rich sentences with temporal constraints extracted from eventrelated sentences." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-5", "text": "Our integrated model achieves 89.7% of the maximum performance given the performance of the temporal expression retrieval step." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-6", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-116", "text": "This task is similar to the sentence event-DCT ordering task in TempEval-2 (UzZaman and Allen, 2010)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-8", "text": "In this paper we present a challenging new event date extraction task." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-9", "text": "Our technical contribution is a temporal tagger that outperforms previously published baseline approaches in its ability to identify informal temporal expressions (TE) and that normalizes each of them to an actual month and year (Chang and Manning, 2012; Strotgen and Gertz, 2010) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-10", "text": "This temporal tagger then contributes towards high performance at matching event mentions with the month and year in which they occurred based on the complete posting history of users." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-11", "text": "It does so with high accuracy on informal event mentions in social media by learning to integrate the likelihood of multiple candidate dates extracted from event mentions in timerich sentences with temporal constraints extracted from event-related sentences." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-12", "text": "Despite considerable prior work in temporal information extraction, to date state-of-the-art resources are designed for extracting temporally scoped facts about public figures/organizations from newswire or Wikipedia articles McClosky and Manning, 2012; Garrido et [11/15/2008] I have noticed some pulling recently and I won't start rads until March." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-13", "text": "[11/20/2008] It is sloowwwly healing, so slowly, in fact, that she said she HOPES it will be healed by March, when I am supposed to start rads." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-14", "text": "al., 2012)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-15", "text": "When people are instead communicating informally about their lives, they refer to time more informally and frequently from their personal frame of reference rather than from an impersonal third person frame of reference." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-16", "text": "For example, they may use their own birthday as a time reference." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-17", "text": "The proportion of relative (e.g., \"last week\", \"two days from now\"), or personal time references in our data is more than one and a half times as high as in newswire and Wikipedia." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-18", "text": "Therefore, it is not surprising that there would be difficulty in applying a temporal tagger designed for newswire to social media data (Strotgen and Gertz, 2012; Kolomiyets et al., 2011) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-19", "text": "Recent behavioral studies (Choudhury et al., 2013; Park and Choi, 2012; Wen et al., 2012) demonstrate that user-focused event mentions extracted from social media data can provide a useful timeline-like tool for studying how behavior patterns change over time in response to mentioned events." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-20", "text": "Our research contributes towards automating this work." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-21", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-22", "text": "**TASK**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-23", "text": "Our task is to extract personal illness events mentioned in the posting histories of online community participants." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-24", "text": "The input to our system is a candidate event and a posting history." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-25", "text": "The output is the event date (month and year) for the event if it occurred, or \"unknown\" if it did not occur." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-26", "text": "The process iterates through a list of 10 cancer events (CEs)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-27", "text": "This list includes breast cancer Diagnosis, Metastasis, Recurrence, Mastectomy, Lumpectomy, Reconstruction, Chemotherapy-Start, Chemotherapy-End, Radiation-Start and Radiation-End." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-28", "text": "For each of these target CEs, we manually designed an event keyword set that includes the name of the event, abbreviations, slang, aliases and related words." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-29", "text": "For each of the 10 events, all sentences that mention a related event keyword are extracted from the user's posting history." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-30", "text": "Figure 1 shows sevaral sentences that were extracted for one user for the start date of Radiation." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-31", "text": "The task is to determine that the beginning of this user's Radiation therapy was 2/2009." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-32", "text": "Note that the user began to post about Radiation before she started it." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-33", "text": "She first reported planning to start Radiation in March, but then rescheduled for February." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-34", "text": "Most of the TEs are non-standard and need to be resolved to calendar dates (year and month)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-35", "text": "Once the full set of event mention sentences has been extracted for a user, all the temporal expressions (TEs) that appear in the same sentence with an event mention are resolved to a set of candidate dates." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-36", "text": "Besides a standard event-time classifier for within-sentence event-time anchoring, we leverage a new source of temporal information to train a constraint-based event-time classifier." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-37", "text": "Previous work only retrieves time-rich sentences that include both the query and some TEs McClosky and Manning, 2012; Garrido et al., 2012) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-38", "text": "However, sentences that contain only the event mention but no explicit TE can also be informative." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-39", "text": "For example, the post time (usually referred to as document creation time or DCT) of the sentence \"metastasis was found in my bone\" might be labeled as being after the \"metastasis\" event date." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-40", "text": "These DCTs impose constraints on the possible event dates, which can be integrated with the event-time classifier, as a variant on related work (Chambers, 2012) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-41", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-42", "text": "**RELATED WORK**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-43", "text": "Previous work on TE extraction has focused mainly on newswire text (Strotgen and Gertz, 2010; Chang and Manning, 2012) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-44", "text": "This paper presents a rule-based TE extractor that identifies and resolves a higher percentage of nonstandard TEs than earlier state-of-art temporal taggers." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-45", "text": "Our task is closest to the temporal slot filling track in the TAC-KBP 2011 shared task and timelining task (McClosky and Manning, 2012) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-46", "text": "Their goal was to extract the temporal bounds of event relations." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-47", "text": "Our task has two key differences." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-48", "text": "First, they used newswire, Wikipedia and blogs as data sources from which they extract temporal bounds of facts found in Wikipedia infoboxes." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-163", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-49", "text": "Second, in the KBP task, the set of gold event relations are provided as input, so that the task is only to identify a date for an event that is guaranteed to have been mentioned." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-50", "text": "In our task, we provide a set of potential events." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-51", "text": "However, most of the candidate events won't have ever been reported within a user's posting history." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-52", "text": "Temporal constraints have proven to be useful for producing a globally consistent timeline." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-53", "text": "In most temporal relation bound extraction systems, the constraints are included as input rather than learned by the system (Talukdar et al., 2012; Wang et al., 2011) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-54", "text": "A notable exception is McCloskyet al. (2012) who developed an approach to learning constraints such as that people cannot attend school if they have not been born yet." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-55", "text": "A notable characteristic of our task is that constraints are softer." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-56", "text": "Diseases may occur in very different ways across patients." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-57", "text": "Recurring illnesses falsely appear to have an unpredictable order." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-58", "text": "Thus, there can be no universal logical constraints on the order of cancer events." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-59", "text": "Our approach to using temporal constraints is a variant on previously published approaches." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-60", "text": "Garrido et al. (2012) made use of DCT (document creation time) as well, however, they have assumed the DCT is within the time-range of the event stated in the document, which is often not true in our data." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-61", "text": "Chambers (2012) utilized the withinsentence time-DCT relation to learn constrains for predicting DCT." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-62", "text": "We learn the event-DCT relations to produce constrains for the event date." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-63", "text": "domly picked one post from each of 1,000 randomly selected users." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-64", "text": "We used this sampling technique because each user tends to use a narrow range of date expression forms." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-65", "text": "From these posts, we manually extracted 601 TEs and resolved them to a specific month and year or just year if the month was not mentioned." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-66", "text": "Events not reported to have occurred were annotated as \"unknown\"." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-67", "text": "Our corpus for event date extraction consists of the complete posting history of 300 users that were randomly drawn from our dataset." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-68", "text": "Three annotators were provided with guidelines for how to infer the date of the events (Wen et al., 2013) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-69", "text": "We achieved .94 Kappa on identification of whether an event has a reported event date in a user's history or not." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-70", "text": "In evaluation of agreement on extracted dates, we achieved a .99 Cronbach's alpha." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-71", "text": "From this corpus, 509 events were annotated with occurrence dates (year and month)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-72", "text": "In our evaluation, we use data from 250 users for training, and 50 for testing." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-73", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-74", "text": "**METHOD**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-75", "text": "Now we explain on a more technical level how our system works on our task." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-76", "text": "Given an event and a user's post history, the system searches for all of the sentences that contain an event keyword (keyword sentence) and all the sentences that contain both a keyword and a TE (date sentence)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-77", "text": "The TEs in the date sentences are resolved and then used as candidate dates for the event." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-78", "text": "For selecting among candidate dates, our model integrates two main components." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-79", "text": "First, the Date Classifier is trained from date sentences to predict how likely its candidate TE and the gold event date are to overlap." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-80", "text": "Then, because constraints over event dates can be informed by temporal relations between the event date and the DCT, the Constraint-based Classifier provides an indication of the plausibility of candidate dates." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-81", "text": "The integrated system combines the predictions from both classifiers." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-82", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-83", "text": "**TEMPORAL TAGGER**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-84", "text": "We design a rule-based temporal tagger that is built using regular expression patterns to recognize informal TEs." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-85", "text": "Similar to SUTime (Chang and Manning, 2012), we identify and resolve a wide range of non-standard TE types such as \"Feb '07 (2/2007)\"." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-86", "text": "The additional types of TE we handle include: 1)user-specific TEs: A user's age, cancer anniversary and survivorship can provide temporal information about the user's CEs." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-87", "text": "We obtain the birth date of users from their personal profile to resolve age date expressions such as \"at the age of 57\"." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-88", "text": "2)non-whole numbers such as \"a year and half\" and \"1/2 weeks\"." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-89", "text": "3)abbreviations of time units : e.g. \"wk\" as the abbreviation of \"week\"." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-90", "text": "4)underspecified month mentions, we resolve the year information according to the DCT month, the mentioned month and the verb tense." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-91", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-92", "text": "**DATE CLASSIFIER**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-93", "text": "We train a MaxEnt classifier to predict the temporal relationship between the retrieved TE and the event date as overlap or no-overlap, similar to the within-sentence event-time anchoring task in TempEval-2 (UzZaman and Allen, 2010)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-94", "text": "Features for the classifier include many of those in (McClosky and Manning, 2012; Yoshikawa et al., 2009 ): namely, event keyword and its dominant verb, verb and preposition that dominate TE, dependency path between TE and keyword and its length, unigram and bigram word and POS features." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-95", "text": "New features include the Event-Subject, Negative and Modality features." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-96", "text": "In online support groups, users not only tell stories about themselves, they also share other patients' stories (as shown in Figure 1 )." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-97", "text": "So we add subject features to remove this kind of noise, which includes the governing subject of the event keyword and its POS tag." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-98", "text": "Modality features include the appearance of modals before the event keyword (e.g., may, might)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-99", "text": "Negative features include the presence/absence of negative words (e.g., no, never)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-100", "text": "These two features indicate a hypothetical or counter-factual expression of the event." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-101", "text": "To calculate the likelihood of a candidate date for an event, we need to aggregate the hard decisions from the classifier." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-102", "text": "Let DS u be the set of the user's date sentences, let D u be the set of dates resolved from each TE." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-103", "text": "We represent a MaxEnt classifier by P relation (R|t, ds) for a candidate date t in date sentence ds and possible relation R = {overlap, no-overlap}. We map the distribution over relations to a distribution over dates by defining P DateSentence (t|DS u ):" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-104", "text": "We refer to this model as the Date Classifier." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-105", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-106", "text": "**CONSTRAINT-BASED CLASSIFIER**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-107", "text": "Previous work only retrieves time-rich sentences (i.e., date sentences) (Ling and Weld, 2010; McClosky and Manning, 2012; Garrido et al., 2012) ." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-108", "text": "However, keyword sentences can inform temporal constraints for events and therefore should not be ignored." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-109", "text": "For example, \"Well, I'm officially a Radiation grad!\" indicates the user has done radiation by the time of the post (DCT)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-110", "text": "\"Radiation is not a choice for me.\" indicates the user probably never had radiation." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-111", "text": "The topic of the sentence can also indicate the temporal relation." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-112", "text": "For example, before chemotherapy, the users tend to talk about choices of drug combinations." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-113", "text": "After chemotherapy, they talk about side-effects." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-114", "text": "This section departs from the above Date Classifier and instead predicts whether each keyword sentence is posted before or overlap-or-after the user's event date." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-115", "text": "The goal is to automatically learn time constraints for the event." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-117", "text": "We create training examples by computing the temporal relation between the DCT and the user's gold event date." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-118", "text": "If the user has not reported an event date, the label should be unknown." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-119", "text": "We train a MaxEnt classifier on each event mention paired with its corresponding DCT." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-120", "text": "All the features used in the classifier component that are not related to the TEs are included." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-121", "text": "Let KS u be the set of the user's keyword sentences, let D u be the set of dates resolved from each date sentence." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-122", "text": "We define a MaxEnt classifier by P relation (R|ks) for a keyword sentence ks and possible relation R = {before, overlap-or-after, unknown}. DCT is the post time of the keyword sentence ks." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-123", "text": "The rel(DCT, t) function simply determines if the DCT is before or overlap-or-after the candidate date t. We map this distribution over relations to a distribution over dates by defining P KeywordSentence (t, KS u ):" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-124", "text": "rel(dct, t) = before if dct < t overlap-or-after if dct \u2265 t" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-125", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-126", "text": "**INTEGRATED MODEL**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-127", "text": "Given the Date Classifier of Section 5.2 and the Constraint-based Classifier of Section 5.3, we create a Integrated Model combining the two with the following linear interpolation as follows:" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-128", "text": "where t is a candidate event date." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-129", "text": "The system will output t that maximizes P (t|posts u ) and unknown if DS u is empty." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-130", "text": "\u03bb was set to 0.7 by maximizing accuracy using five-fold cross-validation over the training set." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-131", "text": "6 Evaluation Metric and Results" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-132", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-133", "text": "**TEMPORAL EXPRESSION RETRIEVAL**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-134", "text": "We compare our temporal tagger's performance with SUTime (Chang and Manning, 2012) on the 601 manually extracted TEs." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-135", "text": "We exclude userspecific TEs such as birthday references since SUTime cannot handle those." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-136", "text": "We first evaluate identification of the extent of a TE and then production of the correctly resolved date for each recognized expression." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-137", "text": "Table 1 shows that our tagger has significantly higher precision and recall for both." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-138", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-139", "text": "**.1 EVALUATION METRIC**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-140", "text": "The extracted date is only considered correct if it completely matches the gold date." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-141", "text": "For less than 4% of users, we have multiple dates for the same event (e.g., a user had a mastectomy twice)." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-142", "text": "Similar to the evaluation metric in a previous study , in these cases, we give the system the benefit of the doubt and the extracted date is considered correct if it matches one of the gold dates." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-143", "text": "In previous work (McClosky and Manning, 2012; , the evaluation metric score is defined as 1/((1 + |d|)) where d is the difference between the values in years." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-144", "text": "We choose a much stricter evaluation metric because we need a precise event date to study user behavior changes." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-145", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-146", "text": "**BASELINES AND ORACLE**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-147", "text": "Based on our temporal tagger, we provide two baselines to describe heuristic methods of aggregating the hard decisions from the classifier To set an upper bound on performance given our TE retrieval system, we calculate the oracle score by considering an extraction as correct if the gold date is one of the retrieved candidate dates." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-148", "text": "The oracle score can differ from a perfect score since we can only use candidate temporal expressions if (a)the relation is known and (b)mentions of the event are retrievable, (c)the TE and event keyword appear in the same sentence, and (d)our temporal tagger is able to recognize and resolve it correctly." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-149", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-150", "text": "**RESULTS**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-151", "text": "We present the performance of our models, baselines and the oracle in Table 3 : Performance of systems on the test set." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-152", "text": "Table 3 shows the performance of our systems and baselines on individual event types." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-153", "text": "The Joint Model derives most of its improvement from performance related to the Chemotherapy/Radiationstart date." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-154", "text": "This is mainly because Chemotherapy and Radiation last for a period of time and there are more event-related discussions containing the event keyword." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-155", "text": "None of our systems improves on cancer Metastasis and Recurrence." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-156", "text": "This is likely due to the sparsity of these events." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-157", "text": "----------------------------------" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-158", "text": "**CONCLUSION**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-159", "text": "We presented a novel event date extraction task that requires extraction and resolution of nonstandard TEs, namely personal illness event dates, from the posting histories of online community participants." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-160", "text": "We constructed an evaluation corpus and designed a temporal tagger for non-standard TEs in social media." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-161", "text": "Using a much stricter standard correctness measure than in previous work, our method achieves promising results that are significantly better than two types of baseline." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-162", "text": "By creating an analogous keyword set, our event date extraction method could be easily adapted to other datasets." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-164", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-165", "text": "We want to thank Dong Nguyen and Yi-chia Wang, who helped provide the data for this project." }, { "sent_id": "a789aea59eebfefb990dfc6367d323-C001-166", "text": "The research reported here was supported by National Science Foundation grant IIS-0968485." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a789aea59eebfefb990dfc6367d323-C001-12" ], [ "a789aea59eebfefb990dfc6367d323-C001-37" ], [ "a789aea59eebfefb990dfc6367d323-C001-107" ], [ "a789aea59eebfefb990dfc6367d323-C001-143" ] ], "cite_sentences": [ "a789aea59eebfefb990dfc6367d323-C001-12", "a789aea59eebfefb990dfc6367d323-C001-37", "a789aea59eebfefb990dfc6367d323-C001-107", "a789aea59eebfefb990dfc6367d323-C001-143" ] }, "@MOT@": { "gold_contexts": [ [ "a789aea59eebfefb990dfc6367d323-C001-37" ], [ "a789aea59eebfefb990dfc6367d323-C001-107" ] ], "cite_sentences": [ "a789aea59eebfefb990dfc6367d323-C001-37", "a789aea59eebfefb990dfc6367d323-C001-107" ] }, "@SIM@": { "gold_contexts": [ [ "a789aea59eebfefb990dfc6367d323-C001-45" ], [ "a789aea59eebfefb990dfc6367d323-C001-94" ] ], "cite_sentences": [ "a789aea59eebfefb990dfc6367d323-C001-45", "a789aea59eebfefb990dfc6367d323-C001-94" ] }, "@DIF@": { "gold_contexts": [ [ "a789aea59eebfefb990dfc6367d323-C001-143", "a789aea59eebfefb990dfc6367d323-C001-144" ] ], "cite_sentences": [ "a789aea59eebfefb990dfc6367d323-C001-143" ] } } }, "ABC_289ea9be270f68e23ca1809f997be9_35": { "x": [ { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-2", "text": "Abstract." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-3", "text": "Cyberbullying is a disturbing online misbehaviour with troubling consequences." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-4", "text": "It appears in different forms, and in most of the social networks, it is in textual format." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-5", "text": "Automatic detection of such incidents requires intelligent systems." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-6", "text": "Most of the existing studies have approached this problem with conventional machine learning models and the majority of the developed models in these studies are adaptable to a single social network at a time." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-7", "text": "In recent studies, deep learning based models have found their way in the detection of cyberbullying incidents, claiming that they can overcome the limitations of the conventional models, and improve the detection performance." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-8", "text": "In this paper, we investigate the findings of a recent literature in this regard." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-9", "text": "We successfully reproduced the findings of this literature and validated their findings using the same datasets, namely Wikipedia, Twitter, and Formspring, used by the authors." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-10", "text": "Then we expanded our work by applying the developed methods on a new YouTube dataset (~54k posts by ~4k users) and investigated the performance of the models in new social media platforms." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-11", "text": "We also transferred and evaluated the performance of the models trained on one platform to another platform." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-12", "text": "Our findings show that the deep learning based models outperform the machine learning models previously applied to the same YouTube dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-13", "text": "We believe that the deep learning based models can also benefit from integrating other sources of information and looking into the impact of profile information of the users in social networks." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-14", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-15", "text": "**INTRODUCTION**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-16", "text": "With the emergence of Web 2.0 there has been a substantial impact on social communication, and relationships and friendships have been redefined all over again." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-17", "text": "Adolescents spend a considerable amount of time online and on different social platforms, besides all the benefits that it might bring them, their online presence also make them vulnerable to threats and social misbehaviours such as cyberbullying." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-18", "text": "Studies show that about 18% of the children in Europe have been involved in cyberbullying 1 ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-19", "text": "In the 2014 EU Kids Online Report [1] it is stated that 20% of children from 11 to 16 years old have been exposed to online bullying." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-20", "text": "The quantitative research of [2] shows that cyber-victimization rates among teenagers between 20% and 40%." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-21", "text": "These statistics go on and on 2,3 [3] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-22", "text": "All these demonstrate the importance of finding a robust and comprehensive solution to this widespread problem." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-23", "text": "Cyberbullying needs to be understood and addressed from different perspectives." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-24", "text": "Automatic detection and prevention of these incidents can substantially help to tackle this problem." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-25", "text": "There are already tools developed which can flag a bullying incident [4] and programs which try to provide support to the victims [5] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-26", "text": "Moreover, most of the online platforms which are commonly used by teenagers have safely centres, for example, YouTube Safety Centre 4 and Twitter Safety and Security" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-27", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-28", "text": "**5**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-29", "text": ", which provide support to users and monitor the communications." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-49", "text": "**DATASETS**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-30", "text": "There are also many research conducted on automatic detection and prevention of cyberbullying, which we will address in more details in the next section, but this problem is still far from resolved and there is the need for further improvements towards having a concrete solution." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-31", "text": "Most of the existing studies [6] - [9] have used conventional Machine Learning (ML) models to detect cyberbullying incidents." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-32", "text": "Recently Deep Neural Network Based (DNN) models have also been applied for detection of cyberbullying [10] , [11] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-33", "text": "In [11] , authors have used DNN models for detection of cyberbullying and have expanded their models across multiple social media platforms." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-34", "text": "Based on their reported results, their models outperform traditional ML models, and most importantly authors have stated that they have applied transfer learning which means their developed models for detection of cyberbullying can be adapted and used on other datasets." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-35", "text": "In this contribution, we begin by reproducing and validating the [11] proposed models and their results on the three datasets, Formspring [12] , Twitter [13] and Wikipedia [14] , which have been used by the authors." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-36", "text": "Cyberbullying takes place in almost all of the online social networks; therefore, developing a detection model which is adaptable and transferable to different social networks is of great value." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-37", "text": "We expand our work by re-implementing the models on a new dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-38", "text": "For this purpose, we have used a YouTube dataset which has been extensively used in cyberbullying studies [6] , [15] , [16] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-39", "text": "The ultimate aim was to investigate the interoperability and the performance of the reproduced models on new datasets, to see how adaptable they are to different social media platforms and to what extent models trained on a dataset (i.e., social network) can be transferred to another one." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-40", "text": "This provides a base to compare the outcome of DNN models with the conventional ML models." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-41", "text": "In the remainder of this paper, we reported the reproduced experimental setup, datasets and findings (Section 2), we investigated the adaptability of the methods on the YouTube dataset (Section 3), and in Section 4 we discussed our findings, compared our results with previous attempts on the same dataset, and pointed out potential future works." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-42", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-43", "text": "**REPRODUCED EXPERIMENTAL SETUP**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-44", "text": "In this study, we have first reproduced the experiments conducted in [11] on the datasets used by the authors namely, Formspring [12] , Wikipedia [14] , and Twitter [13] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-45", "text": "We have used the same models and experimental setup for our implementations." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-46", "text": "In this section, we have briefly introduced the datasets and explained the models and other experiment components." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-47", "text": "For further details please see the reference literature." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-48", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-50", "text": "In this study three datasets are used; Formspring (a Q&A forum), Wikipedia talk pages (collaborative knowledge repository) and Twitter (microblogging platform)." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-51", "text": "All these datasets are manually labelled and publicly available." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-52", "text": "One problem that all these datasets share, is that the datasets are skewed, and there is a class imbalance, i.e., the number of posts labelled as bullying is significantly less than the neutral ones." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-53", "text": "This causes the classification results to be biased towards non-bullying posts." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-54", "text": "Therefore as it will be explained in more details in the next section, we oversampled the bullying class in the datasets." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-55", "text": "Furthermore, the size of the posts in terms of number of words differs across datasets." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-56", "text": "This can affect the number of distinct words encountered in each dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-57", "text": "Therefore, long posts (measured based on the number of words) are truncated to the size of post ranked at 95 percentile in that dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-58", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-59", "text": "**WIKIPEDIA.**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-60", "text": "A talk page in Wikipedia is where the discussions among those who have contributed to the editing of a Wikipedia page are maintained." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-61", "text": "This dataset includes more than 10,000 discussion comments from English Wikipedia' talk pages." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-62", "text": "All comments were manually annotated by ten persons." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-63", "text": "In total 13,590 comments were labelled as a personal attack." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-64", "text": "Formspring." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-65", "text": "This is a question-answering platform." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-66", "text": "The dataset includes 12,000 pairs of questions-answers which were manually annotated by three persons." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-67", "text": "In total 825 pairs was annotated as cyberbullying by at least two persons." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-68", "text": "Twitter." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-69", "text": "From this microblogging platform, 16,000 tweets were collected and manually annotated." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-70", "text": "The tweets were collected by search of terms which refer to religious, sexual, gender, and ethnic minorities." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-71", "text": "In total 3117 were labelled as sexist and 1937 were labelled as racist." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-72", "text": "The remaining tweets were labelled as neither." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-73", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-74", "text": "**MODELS AND METHODS**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-75", "text": "In this section the four DDN models that we experimented for cyberbullying detection are described." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-76", "text": "We also have experimented with three methods for initializing word embeddings, which will be briefly explained in follow." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-77", "text": "Deep Neural Network Based Models." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-78", "text": "In this study four different DNN models were used for detection of cyberbullying: Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional LSTM (BLSTM) and BLSTM with attention." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-79", "text": "These models respectively vary in complexity in their neural architecture." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-80", "text": "CNNs are mostly used for image and text classification [17] , [18] as well as sentiment classification [19] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-81", "text": "LSTM networks are used for learning long-term dependencies." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-82", "text": "Their internal memory makes these networks useful for text classification [20] , [21] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-83", "text": "Bidirectional LSTMs, increase the input information to the network by encoding information in both forward and backward direction [22] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-84", "text": "BLSTM with attention, gives more direct dependence between the state of the model at different points in time [23] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-85", "text": "All the models have identical layers except for the neural architecture layer which is unique to each model." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-86", "text": "The embedding layer, which will be explained in more details in following, processes a fixed length sequence of words." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-87", "text": "There are two dropout layers which are used to avoid overfitting, once before (with dropout rates of 0.25) and one after (with dropout rates of 0.5) the neural architecture layer." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-88", "text": "Then there is the fully connected layer which is a dense output layer with the number of neurons equal to the number of classes." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-89", "text": "The last layer is the softmax layer that provides softmax activation." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-90", "text": "All the models are trained using backpropagation with Adam optimizer and categorical cross-entropy loss function." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-91", "text": "All the reference codes can be found in the author's GitHub repository 6 ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-92", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-93", "text": "**INITIAL WORD EMBEDDING.**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-94", "text": "Word embedding is the process of representing each word as real value vector." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-95", "text": "The embedding layer of our models processes a fixed length sequence of words." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-96", "text": "In this study three methods are used for initializing word embeddings: random, GloVe [24] and SSWE [25] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-97", "text": "Using initial words embeddings during the training can improve the model to learn task specific word embeddings." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-98", "text": "Task specific word embeddings can differentiate the style of cyberbullying among different online platform as well as topics." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-99", "text": "The GloVe vectors mostly improve the performance of the models over the random vector initialization." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-100", "text": "However, in GloVe method only the syntactic context of the word in considered and the sentiment conveyed by the text is ignored." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-101", "text": "This problem is overcome in the SSWE method by incorporating the text sentiment as one of the parameters for word embedding generation." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-102", "text": "In this study different dimension size for word embeddings are experimented." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-103", "text": "Since there was no significant difference in the result of dimensions from 30 to 200, the results of the dimension size 50 are reported." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-104", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-105", "text": "**WORKFLOW AND RESULTS**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-106", "text": "We started our experiments by implementing the DDN based models using Keras Python package 7." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-107", "text": "The datasets went through the pre-processing steps such as stopwords and punctuations removal." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-108", "text": "The performance of the models were evaluated using five-fold cross-validation and precision, recall and F1-score evaluation metrics were used to report the performance." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-109", "text": "In order to overcome the problem caused by the imbalance in the datasets, the number of bullying class in the datasets were oversampled and the bullying posts were tripled in the training data." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-110", "text": "To illustrate the effect of oversampling, the BLSTM with attention model was used with three word embedding methods, once on the original datasets and once on the oversampled datasets (see Table 1a )." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-111", "text": "The oversampling has significantly (Mann-Whitney U test, P<0.001) improved the performance of the models in all the datasets, especially those with a smaller number of bullying posts in their training data such as Formspring." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-112", "text": "Initial word embeddings affect the data representation for the models which use them during the training to learn task specific word embeddings." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-113", "text": "Comparing the performance of the reference and the reproduced models shows that the majority of the reproduced results were within the standard deviation of the reference results (Table 1b) ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-114", "text": "The highest inconsistencies were observed in the recall values of original Twitter dataset with GloVe and SSWE word embeddings." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-115", "text": "In Table 2a we have reported the F1-scores of different initial word embeddings on two DDN models; CNN as the simplest model, and BLSTM with attention as the most complex model." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-116", "text": "As results show, the performance of the models was influenced by different initial word embeddings." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-117", "text": "Using Random initial word embeddings outperformed the SSWE and GloVe in original datasets." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-118", "text": "However, in oversampled datasets, the initial word embeddings did not show a significant effect on cyberbullying detection." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-119", "text": "We noticed that the inconsistencies among the reference and reproduced F1-scores mostly occurred in CNN model on Twitter dataset (Table 2b) ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-120", "text": "The performance of all four DNN models while using SSWE as the initial word embeddings is summarized in Table 3a ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-121", "text": "Same as the reference work, we also noticed that the LSTM were outperformed by other models, and the performance gaps in the other three models were quite small." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-122", "text": "Following the above mentioned inconsistencies, we further observed that the main difference between the reproduced and reference results are due to the differences in recall values (Table 3b) ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-123", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-124", "text": "**APPLICATION AND EVALUATION OF THE METHODS ON A NEW DATASET**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-125", "text": "In this section we investigated the adaptability of the reproduced methods on a new dataset and to evaluate the performance of the methods on a new social media platform; YouTube." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-126", "text": "We investigate how the DNN models would perform in this dataset in comparison to the previous ML models used on this dataset for detection of cyberbullying." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-127", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-128", "text": "**YOUTUBE DATASET**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-129", "text": "YouTube is one of the most popular user-generated content video platforms." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-130", "text": "The wide range of audience and the content of the videos make it prone to misbehaviours such as cyberbullying." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-131", "text": "In this study we use a YouTube dataset which has been created by [6] ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-132", "text": "This dataset was developed by searching the YouTube for topics sensitive to cyberbullying, such as, gender and race." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-133", "text": "From the retrieved videos, the comments of the users as well as their publicly available profile information were extracted." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-134", "text": "After some trimming, the final dataset consists of about 54,000 comments from 3,858 distinct users." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-135", "text": "The comments were manually annotated by 2 persons." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-136", "text": "In total about 6,500 of the comments were labelled as bullying." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-137", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-138", "text": "**WORKFLOW AND RESULTS**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-139", "text": "We used the same experimental settings as explained in Section 2 for the new dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-140", "text": "We continued our experiments by implementing the DDN based models." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-141", "text": "The YouTube dataset also suffers from the class imbalance and the number of bullying posts is significantly smaller than the neutral ones." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-142", "text": "Therefore were oversampled the bullying posts of the dataset and their number was tripled." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-143", "text": "Table 4 shows the performance of the BLSTM with attention model using the three initial word embeddings in both the original dataset and the oversampled dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-144", "text": "We also used different dimension sizes for word embeddings, from 25 to 200." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-145", "text": "Here we have reported the average of all the experimented demission sizes for each word embedding." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-146", "text": "Table 4 ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-147", "text": "Performance evaluation of the BLSTM with attention models, with three word embedding methods on the YouTube original dataset and the oversamples dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-148", "text": "As the results show, oversampling of the dataset significantly (Mann-Whitney U test, P<0.001) improved the performance of the models in all three word embeddings." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-149", "text": "Overall, the SSWE has the highest F1-score and precision while Random embeddings resulted in the highest recall." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-150", "text": "The performance measures of all the DNN based models using SSWE initial word embedding is presented in Table 5 ." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-151", "text": "The LSTM model had the lowest performance in comparison to other models, mainly due to near zero recall." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-152", "text": "While BLSTM has the highest F1-score and recall, BLSTM with attention also performed quite similar with slightly higher precision in the cost of a lower recall." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-153", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-154", "text": "**DATASET**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-155", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-156", "text": "**TRANSFER LEARNING**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-157", "text": "Transfer learning is the process of using a model which has been trained on one task for another related task." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-158", "text": "Following [11] we also implemented the transfer learning procedure to evaluate to what extent the DNN models trained on a social network, here Twitter, Formspring, and Wiki, can successfully detect cyberbullying posts in another social network, i.e., YouTube." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-159", "text": "For this purpose we used the BLSTM with attention model and experimented with three different approaches." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-160", "text": "\uf0b7 Complete Transfer Learning." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-161", "text": "In this approach, a model trained on one dataset is directly used in other datasets without any extra training." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-162", "text": "As the results in Table 6 show, the recalls are quite low but varying in all three datasets." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-163", "text": "This can indicate that the nature of cyberbullying is different in these different datasets." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-164", "text": "However, the complete transfer learning approach shows that bullying nature in YouTube is more similar to Formspring (F1-score = 0.30) and then to Wikipedia (F1-score = 0.23) in comparison to Twitter (F1-score = 0.15)." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-165", "text": "This might be due to the similarity of the nature of these social networks." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-166", "text": "YouTube, Formspring and (Table 6) was not significant compared to the feature level learning approach." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-167", "text": "This indicates that the transfer of network weights is not as essential to cyberbullying detection as the learned word embeddings." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-168", "text": "----------------------------------" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-169", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-170", "text": "In this study, we successfully reproduced the reference literature [11] for detection of cyberbullying incidents in social media platforms using DNN based models." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-171", "text": "The source codes and materials were mostly well organized and accessible." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-172", "text": "However, there were some details and settings that were not clearly stated." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-173", "text": "These might have been the reason for some the inconsistencies in our results." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-174", "text": "We further expanded our work by using a new social media dataset, YouTube, to investigate the adaptability and transferability of the models to the new dataset and also to compare the performance of the DNN models against the conventional ML models which were used in previous studies on the YouTube dataset for cyberbullying detection." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-175", "text": "Often, the datasets for cyberbullying detections contains very few posts marked as bullying." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-176", "text": "This imbalance problem can be partly covered by oversampling the bullying posts." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-177", "text": "However, the effects of such prevalence on the performance of models need to be further assessed." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-178", "text": "Our study shows that the DNN models were adaptable and transferable to the new dataset." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-179", "text": "DNN based models coupled with transfer learning outperformed all the previous results for the detection of cyberbullying in this YouTube dataset using ML models." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-180", "text": "In [26] , [27] authors have used context-based features such as the users' profile information and personal demographics to train the ML models which has resulted in F1-score=0.64." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-181", "text": "In [6] the discrimination capacity of the detection methods were improved to 0.76 by incorporating expert knowledge." }, { "sent_id": "289ea9be270f68e23ca1809f997be9-C001-182", "text": "We believe that the DNN models can also benefit from integrating other sources of information and as the future work we recommend to look into the impact of profile information of the social media users and to investigate the improvement of the models by considering the above mentioned sources of information." } ], "y": { "@BACK@": { "gold_contexts": [ [ "289ea9be270f68e23ca1809f997be9-C001-32", "289ea9be270f68e23ca1809f997be9-C001-33" ] ], "cite_sentences": [ "289ea9be270f68e23ca1809f997be9-C001-32", "289ea9be270f68e23ca1809f997be9-C001-33" ] }, "@EXT@": { "gold_contexts": [ [ "289ea9be270f68e23ca1809f997be9-C001-32", "289ea9be270f68e23ca1809f997be9-C001-33", "289ea9be270f68e23ca1809f997be9-C001-37" ] ], "cite_sentences": [ "289ea9be270f68e23ca1809f997be9-C001-32", "289ea9be270f68e23ca1809f997be9-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "289ea9be270f68e23ca1809f997be9-C001-35" ], [ "289ea9be270f68e23ca1809f997be9-C001-44" ], [ "289ea9be270f68e23ca1809f997be9-C001-158" ], [ "289ea9be270f68e23ca1809f997be9-C001-170" ] ], "cite_sentences": [ "289ea9be270f68e23ca1809f997be9-C001-35", "289ea9be270f68e23ca1809f997be9-C001-44", "289ea9be270f68e23ca1809f997be9-C001-158", "289ea9be270f68e23ca1809f997be9-C001-170" ] } } }, "ABC_b6afd492c60af7ab1ba0be3cd654b2_35": { "x": [ { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-2", "text": "We present a natural language generator based on the sequence-to-sequence approach that can be trained to produce natural language strings as well as deep syntax dependency trees from input dialogue acts, and we use it to directly compare two-step generation with separate sentence planning and surface realization stages to a joint, one-step approach." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-3", "text": "We were able to train both setups successfully using very little training data." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-4", "text": "The joint setup offers better performance, surpassing state-of-the-art with regards to ngram-based scores while providing more relevant outputs." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-5", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-7", "text": "In spoken dialogue systems (SDS), the task of natural language generation (NLG) is to convert a meaning representation (MR) produced by the dialogue manager into one or more sentences in a natural language." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-8", "text": "It is traditionally divided into two subtasks: sentence planning, which decides on the overall sentence structure, and surface realization, determining the exact word forms and linearizing the structure into a string (Reiter and Dale, 2000) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-9", "text": "While some generators keep this division and use a two-step pipeline (Walker et al., 2001; Rieser et al., 2010; Dethlefs et al., 2013) , others apply a joint model for both tasks (Wong and Mooney, 2007; Konstas and Lapata, 2013) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-10", "text": "We present a new, conceptually simple NLG system for SDS that is able to operate in both modes: it either produces natural language strings or generates deep syntax dependency trees, which are subsequently processed by an external surface realizer ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-11", "text": "This allows us to show a direct comparison of two-step generation, where sentence planning and surface realization are separated, with a joint, one-step approach." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-12", "text": "Our generator is based on the sequence-tosequence (seq2seq) generation technique (Cho et al., 2014; Sutskever et al., 2014) , combined with beam search and an n-best list reranker to suppress irrelevant information in the outputs." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-13", "text": "Unlike most previous NLG systems for SDS (e.g., (Stent et al., 2004; Raux et al., 2005; ), it is trainable from unaligned pairs of MR and sentences alone." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-14", "text": "We experiment with using much less training data than recent systems based on recurrent neural networks (RNN) (Wen et al., 2015b; Mei et al., 2015) , and we find that our generator learns successfully to produce both strings and deep syntax trees on the BAGEL restaurant information dataset ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-15", "text": "It is able to surpass n-gram-based scores achieved previously by Du\u0161ek and Jur\u010d\u00ed\u010dek (2015) , offering a simpler setup and more relevant outputs." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-16", "text": "We introduce the generation setting in Section 2 and describe our generator architecture in Section 3." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-17", "text": "Section 4 details our experiments, Section 5 analyzes the results." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-18", "text": "We summarize related work in Section 6 and offer conclusions in Section 7." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-19", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-20", "text": "**GENERATOR SETTING**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-21", "text": "The input to our generator are dialogue acts (DA) representing an action, such as inform or request, along with one or more attributes (slots) and their values." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-22", "text": "Our generator operates in two modes, producing either deep syntax trees (Du\u0161ek et al., 2012) or natural language strings (see Fig. 1 )." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-23", "text": "The first mode corresponds to the sentence planning NLG stage as it decides the syntactic shape of the output sentence; the resulting deep syntax tree involves content words (lemmas) and their syntactic form (formemes, purple in Fig. 1 )." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-24", "text": "The trees are linearized to strings using a surface realizer from the TectoMT translation system ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-25", "text": "The second generator mode joins sentence planning and surface realization into one step, producing natural language sentences directly." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-26", "text": "Both modes offer their advantages: The twostep mode simplifies generation by abstracting away from complex surface syntax and morphology, which can be handled by a handcrafted, domain-independent module to ensure grammatical correctness at all times (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015) , and the joint mode does not need to model structure explicitly and avoids accumulating errors along the pipeline (Konstas and Lapata, 2013) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-27", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-28", "text": "**THE SEQ2SEQ GENERATION MODEL**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-29", "text": "Our generator is based on the seq2seq approach (Cho et al., 2014; Sutskever et al., 2014) , a type of an encoder-decoder RNN architecture operating on variable-length sequences of tokens." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-30", "text": "We address the necessary conversion of input DA and output trees/sentences into sequences in Section 3.1 and then describe the main seq2seq component in Section 3.2." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-31", "text": "It is supplemented by a reranker, as explained in Section 3.3." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-32", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-33", "text": "**SEQUENCE REPRESENTATION OF DA, TREES,**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-34", "text": "and Sentences We represent DA, deep syntax trees, and sentences as sequences of tokens to enable their usage in the sequence-based RNN components of our generator (see Sections 3.2 and 3.3)." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-35", "text": "Each token is represented by its embedding -a vector of floatingpoint numbers (Bengio et al., 2003) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-36", "text": "To form a sequence representation of a DA, we create a triple of the structure \"DA type, slot, value\" for each slot in the DA and concatenate the triples (see Fig. 3 )." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-37", "text": "The deep syntax tree output from the seq2seq generator is represented in a bracketed notation similar to the one used by Vinyals et al. (2015, see Fig. 2) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-38", "text": "The inputs to the reranker are always a sequence of tokens; structure is disregarded in trees, resulting in a list of lemma-formeme pairs (see Fig. 2 )." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-39", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-40", "text": "**SEQ2SEQ GENERATOR**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-41", "text": "Our seq2seq generator with attention (Bahdanau et al., 2015, see Fig. 3) 1 starts with the encoder stage, which uses an RNN to encode an input sequence x = {x 1 , . . . , x n } into a sequence of encoder outputs and hidden states h = {h 1 , . . . , h n }, where h t = lstm(x t , h t\u22121 ), a non-linear function represented by the long-short-term memory (LSTM) cell (Graves, 2013) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-42", "text": "The decoder stage then uses the hidden states to generate a sequence y = {y 1 , . . . , y m } with a second LSTM-based RNN." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-43", "text": "The probability of each output token is defined as:" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-44", "text": "Here, s t is the decoder state where s 0 = h n and s t = lstm((y t\u22121 \u2022 c t )W S , s t\u22121 ), i.e., the decoder is initialized by the last hidden state and uses the previous output token at each step." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-45", "text": "W Y and W S are learned linear projection matrices and \"\u2022\" denotes concatenation." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-46", "text": "c t is the context vector -a weighted sum of the encoder hidden states c t = n i=1 \u03b1 ti h i , where \u03b1 ti corresponds to an alignment model, represented by a feed-forward network with a single tanh hidden layer." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-47", "text": "On top of this basic seq2seq model, we implemented a simple beam search for decoding (Sutskever et al., 2014; Bahdanau et al., 2015) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-48", "text": "It proceeds left-to-right and keeps track of log probabilities of top n possible output sequences, expanding them one token at a time." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-49", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-50", "text": "**RERANKER**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-51", "text": "To ensure that the output trees/strings correspond semantically to the input DA, we implemented a classifier to rerank the n-best beam search outputs and penalize those missing required information and/or adding irrelevant one." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-52", "text": "Similarly to Wen et al. (2015a) , the classifier provides a binary decision for an output tree/string on the presence of all dialogue act types and slot-value combinations seen in the training data, producing a 1-hot vector." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-53", "text": "Figure 4 : The reranker The input DA is converted to a similar 1-hot vector and the reranking penalty of the sentence is the Hamming distance between the two vectors (see Fig. 4 )." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-54", "text": "Weighted penalties for all sentences are subtracted from their n-best list log probabilities." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-55", "text": "We employ a similar architecture for the classifier as in our seq2seq generator encoder (see Section 3.2), with an RNN encoder operating on the output trees/strings and a single logistic layer for classification over the last encoder hidden state." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-56", "text": "Given an output sequence representing a string or a tree y = {y 1 , . . . , y n } (cf. Section 3.1), the encoder again produces a sequence of hidden states h = {h 1 , . . . , h n } where h t = lstm(y t , h t\u22121 )." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-57", "text": "The output binary vector o is computed as: o i = sigmoid((h n \u00b7 W R + b) i ) Here, W R is a learned projection matrix and b is a corresponding bias term." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-58", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-59", "text": "**EXPERIMENTS**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-60", "text": "We perform our experiments on the BAGEL data set of , which contains 202 DA from the restaurant information domain with two natural language paraphrases each, describing restaurant locations, price ranges, food types etc." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-61", "text": "Some properties such as restaurant names or phone numbers are delexicalized (replaced with \"X\" symbols) to avoid data sparsity." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-62", "text": "2 Unlike , we do not use manually annotated alignment of slots and values in the input DA to target words and phrases and let the generator learn it from data, which simplifies training data preparation but makes our task harder." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-63", "text": "We lowercase the data and treat plural -s as separate tokens for generating into strings, and we apply automatic analysis from the Treex NLP toolkit (Popel and\u017dabokrtsk\u00fd, 2010) to obtain deep syntax trees for training tree-based generator setups." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-64", "text": "3 Same as , we apply 10-fold cross-validation, with 181 training DA and 21 testing DA." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-65", "text": "In addition, we reserve 10 DA from the training set for validation." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-66", "text": "4 To train our seq2seq generator, we use the Adam optimizer (Kingma and Ba, 2015) to minimize unweighted sequence cross-entropy." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-67", "text": "5 We perform 10 runs with different random initialization of the network and up to 1,000 passes over the training data, 6 validating after each pass and selecting the parameters that yield the highest BLEU score on the validation set." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-68", "text": "Neither beam search nor the reranker are used for validation." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-69", "text": "We use the Adam optimizer minimizing crossentropy to train the reranker as well." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-70", "text": "7 We perform a single run of up to 100 passes over the data, and we also validate after each pass and select the parameters giving minimal Hamming distance on both validation and training set." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-71", "text": "8 3 The input vocabulary size is around 45 (DA types, slots, and values added up) and output vocabulary sizes are around 170 for string generation and 180 for tree generation (45 formemes and 135 lemmas)." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-72", "text": "4 We treat the two paraphrases for the same DA as separate instances in the training set but use them together as two references to measure BLEU and NIST scores (Papineni et al., 2002; Doddington, 2002) on the validation and test sets." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-73", "text": "5 Based on a few preliminary experiments, the learning rate is set to 0.001, embedding size 50, LSTM cell size 128, and batch size 20." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-74", "text": "Reranking penalty for decoding is 100. 6 Training is terminated early if the top 10 so far achieved validation BLEU scores do not change for 100 passes." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-75", "text": "7 We use the same settings as with the seq2seq generator." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-76", "text": "8 The validation set is given 10 times more importance." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-77", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-78", "text": "**RESULTS**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-79", "text": "The results of our experiments and a comparison to previous works on this dataset are shown in Table 1." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-80", "text": "We include BLEU and NIST scores and the number of semantic errors (incorrect, missing, and repeated information), which we assessed manually on a sample of 42 output sentences (outputs of two randomly selected cross-validation runs)." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-81", "text": "The outputs of direct string generation show that the models learn to produce fluent sentences in the domain style; 9 incoherent sentences are rare, but semantic errors are very frequent in the greedy search." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-82", "text": "Most errors involve confusion of semantically close items, e.g., Italian instead of French or riverside area instead of city centre (see Table 2); items occurring more frequently are preferred regardless of their relevance." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-83", "text": "The beam search brings a BLEU improvement but keeps most semantic errors in place." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-84", "text": "The reranker is able to reduce the number of semantic errors while increasing automatic scores considerably." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-85", "text": "Using a larger beam increases the effect of the reranker as expected, resulting in slightly improved outputs." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-86", "text": "Models generating deep syntax trees are also able to learn the domain style, and they have virtually no problems producing valid trees." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-87", "text": "10 The surface realizer works almost flawlessly on this lim-ited domain (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015) , leaving the seq2seq generator as the major error source." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-88", "text": "The syntax-generating models tend to make different kinds of errors than the string-based models: Some outputs are valid trees but not entirely syntactically fluent; missing, incorrect, or repeated information is more frequent than a confusion of semantically similar items (see Table 2 )." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-89", "text": "Semantic error rates of greedy and beam-search decoding are lower than for string-based models, partly because confusion of two similar items counts as two errors." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-90", "text": "The beam search brings an increase in BLEU but also in the number of semantic errors." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-91", "text": "The reranker is able to reduce the number of errors and improve automatic scores slightly." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-92", "text": "A larger beam leads to a small BLEU decrease even though the sentences contain less errors; here, NIST reflects the situation more accurately." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-93", "text": "A comparison of the two approaches goes in favor of the joint setup: Without the reranker, models generating trees produce less semantic errors and gain higher BLEU/NIST scores." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-94", "text": "However, with the reranker, the string-based model is able to reduce the number of semantic errors while producing outputs significantly better in terms of BLEU/NIST." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-95", "text": "11 In addition, the joint setup does not need an external surface realizer." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-96", "text": "The best results of both setups surpass the best results on this dataset using training data without manual alignments (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015) in both automatic metrics 12 and the number of semantic errors." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-97", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-98", "text": "**RELATED WORK**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-99", "text": "While most recent NLG systems attempt to learn generation from data, the choice of a particular approach -pipeline or joint -is often arbitrary and depends on system architecture or particular generation domain." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-100", "text": "Works using the pipeline approach in SDS tend to focus on sentence planning, improving a handcrafted generator (Walker et al., 2001; Stent et al., 2004; Paiva and Evans, 2005) or using perceptron-guided A* search (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-101", "text": "Generators taking the joint approach employ various methods, e.g., factored language models , inverted parsing (Wong and Mooney, 2007; Konstas and Lapata, 2013) , or a pipeline of discriminative classifiers (Angeli et al., 2010) ." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-102", "text": "Unlike most previous Input DA inform(name=X-name, type=placetoeat, eattype=restaurant, area=citycentre, near=X-near, food=\"Chinese takeaway\", food=Japanese) Reference X is a Chinese takeaway and Japanese restaurant in the city centre near X. Greedy with trees X is a restaurant offering chinese takeaway in the centre of town near X." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-103", "text": "[Japanese] + Beam search X is a restaurant and japanese food and chinese takeaway. + Reranker X is a restaurant serving japanese food in the centre of the city that offers chinese takeaway." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-104", "text": "Greedy into strings X is a restaurant offering italian and indian takeaway in the city centre area near X. [ Mei et al. (2015) present the only seq2seq-based NLG system known to us." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-105", "text": "We extend the previous works by generating deep syntax trees as well as strings and directly comparing pipeline and joint generation." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-106", "text": "In addition, we experiment with an order-of-magnitude smaller dataset than other RNN-based systems." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-107", "text": "----------------------------------" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-108", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-109", "text": "We have presented a direct comparison of two-step generation via deep syntax trees with a direct generation into strings, both using the same NLG system based on the seq2seq approach." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-110", "text": "While both approaches offer decent performance, their outputs are quite different." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-111", "text": "The results show the direct approach as more favorable, with significantly higher n-gram based scores and a similar number of semantic errors in the output." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-112", "text": "We also showed that our generator can learn to produce meaningful utterances using a much smaller amount of training data than what is typically used for RNN-based approaches." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-113", "text": "The resulting models had virtually no problems with produc-ing fluent, coherent sentences or with generating valid structure of bracketed deep syntax trees." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-114", "text": "Our generator was able to surpass the best BLEU/NIST scores on the same dataset previously achieved by a perceptron-based generator of Du\u0161ek and Jur\u010d\u00ed\u010dek (2015) while reducing the amount of irrelevant information on the output." }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-115", "text": "Our generator is released on GitHub at the following URL:" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-116", "text": "https://github.com/UFAL-DSG/tgen" }, { "sent_id": "b6afd492c60af7ab1ba0be3cd654b2-C001-117", "text": "We intend to apply it to other datasets for a broader comparison, and we plan further improvements, such as enhancing the reranker or including a bidirectional encoder (Bahdanau et al., 2015; Mei et al., 2015; Jean et al., 2015) and sequence level training (Ranzato et al., 2015) ." } ], "y": { "@DIF@": { "gold_contexts": [ [ "b6afd492c60af7ab1ba0be3cd654b2-C001-15" ], [ "b6afd492c60af7ab1ba0be3cd654b2-C001-96" ], [ "b6afd492c60af7ab1ba0be3cd654b2-C001-114" ] ], "cite_sentences": [ "b6afd492c60af7ab1ba0be3cd654b2-C001-15", "b6afd492c60af7ab1ba0be3cd654b2-C001-96", "b6afd492c60af7ab1ba0be3cd654b2-C001-114" ] }, "@BACK@": { "gold_contexts": [ [ "b6afd492c60af7ab1ba0be3cd654b2-C001-26" ], [ "b6afd492c60af7ab1ba0be3cd654b2-C001-100" ] ], "cite_sentences": [ "b6afd492c60af7ab1ba0be3cd654b2-C001-26", "b6afd492c60af7ab1ba0be3cd654b2-C001-100" ] }, "@USE@": { "gold_contexts": [ [ "b6afd492c60af7ab1ba0be3cd654b2-C001-22", "b6afd492c60af7ab1ba0be3cd654b2-C001-26" ], [ "b6afd492c60af7ab1ba0be3cd654b2-C001-87" ] ], "cite_sentences": [ "b6afd492c60af7ab1ba0be3cd654b2-C001-26", "b6afd492c60af7ab1ba0be3cd654b2-C001-87" ] }, "@EXT@": { "gold_contexts": [ [ "b6afd492c60af7ab1ba0be3cd654b2-C001-100", "b6afd492c60af7ab1ba0be3cd654b2-C001-105" ] ], "cite_sentences": [ "b6afd492c60af7ab1ba0be3cd654b2-C001-100" ] } } }, "ABC_c7e304499654516cce43c550256eae_35": { "x": [ { "sent_id": "c7e304499654516cce43c550256eae-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-2", "text": "Cross-lingual model transfer is a compelling and popular method for predicting annotations in a low-resource language, whereby parallel corpora provide a bridge to a high-resource language and its associated annotated corpora." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-3", "text": "However, parallel data is not readily available for many languages, limiting the applicability of these approaches." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-4", "text": "We address these drawbacks in our framework which takes advantage of cross-lingual word embeddings trained solely on a high coverage bilingual dictionary." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-5", "text": "We propose a novel neural network model for joint training from both sources of data based on cross-lingual word embeddings, and show substantial empirical improvements over baseline techniques." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-6", "text": "We also propose several active learning heuristics, which result in improvements over competitive benchmark methods." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-7", "text": "----------------------------------" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-9", "text": "Part-of-speech (POS) tagging is an important first step in most natural language processing (NLP) applications." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-10", "text": "Typically this is modelled using sequence labelling methods to predict the conditional probability of taggings given word sequences, using linear graphical models (Lafferty et al., 2001) , or neural network models, such as recurrent neural networks (RNN) (Mikolov et al., 2010; Huang et al., 2015) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-11", "text": "These supervised learning algorithms rely on large labelled corpora; this is particularly true for state-of-the-art neural network models." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-12", "text": "Due to the expense of annotating sufficient data, such techniques are not well suited to applications in low-resource languages." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-13", "text": "Prior work on low-resource NLP has primarily focused on exploiting parallel corpora to project information between a high-and low-resource language (Yarowsky and Ngai, 2001; T\u00e4ckstr\u00f6m et al., 2013; Guo et al., 2015; Agi\u0107 et al., 2016; Buys and Botha, 2016) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-14", "text": "For example, POS tags can be projected via word alignments, and the projected POS is then used to train a model in the lowresource language Zhang et al., 2016; Fang and Cohn, 2016) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-15", "text": "These methods overall have limited effectiveness due to errors in the alignment and fundamental differences between the languages." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-16", "text": "They also assume a large parallel corpus, which may not be available for many low-resource languages." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-17", "text": "To address these limitations, we propose a new technique for low resource tagging, with more modest resource requirements: 1) a bilingual dictionary; 2) monolingual corpora in the high and low resource languages; and 3) a small annotated corpus of around 1, 000 tokens in the low-resource language." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-18", "text": "The first two resources are used as a form of distant supervision through learning crosslingual word embeddings over the monolingual corpora and bilingual dictionary (Ammar et al., 2016) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-19", "text": "Additionally, our model jointly incorporates the language-dependent information from the small set of gold annotations." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-20", "text": "Our approach combines these two sources of supervision using multi-task learning, such that the kinds of errors that occur in cross-lingual transfer can be accounted for, and corrected automatically." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-21", "text": "We empirically demonstrate the validity of our observation by using distant supervision to improve POS tagging performance with little supervision." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-22", "text": "Experimental results show the effectiveness of our approach across several low-resource languages, including both simulated and true lowresource settings." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-23", "text": "Furthermore, given the clear superiority of training with manual annotations, we compare several active learning heuristics." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-24", "text": "Active learning using uncertainty sampling with a word- type bias leads to substantial gains over benchmark methods such as token or sentence level uncertainty sampling." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-25", "text": "----------------------------------" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-26", "text": "**RELATED WORK**" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-27", "text": "POS tagging has been studied for many years." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-28", "text": "Traditionally, probabilistic models are a popular choice, such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Lafferty et al., 2001) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-29", "text": "Recently, neural network models have been developed for POS tagging and achieved good performance, such as RNN and bidirectional long short-term memory (BiLSTM) and CRF-BiLSTM models (Mikolov et al., 2010; Huang et al., 2015) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-30", "text": "For example, the CRFBiLSTM POS tagger obtained the state-of-theart performance on Penn Treebank WSJ corpus (Huang et al., 2015) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-31", "text": "However, in low-resource languages, these models are seldom used because of limited labelled data." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-32", "text": "Parallel data therefore appears to be the most realistic additional source of information for developing NLP systems in low-resource languages (Yarowsky and Ngai, 2001; T\u00e4ckstr\u00f6m et al., 2013; Fang and Cohn, 2016; Zhang et al., 2016) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-33", "text": "Yarowsky and Ngai (2001) pioneered the use of parallel data for projecting POS tag information from one language to another language." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-34", "text": "used parallel data and exploited graph-based label propagation to expand the coverage of labelled tokens." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-35", "text": "T\u00e4ckstr\u00f6m et al. (2013) constructed tag dictionaries by projecting tag information from a highresource language to a low-resource language via alignments in the parallel text." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-36", "text": "Fang and Cohn (2016) used parallel data to obtain projected tags as distant labels and proposed a joint BiLSTM model trained on both the distant data and 1, 000 tagged tokens." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-37", "text": "Zhang et al. (2016) used a few word translations pairs to find a linear transformation between two language embeddings." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-38", "text": "Then they used unsupervised learning to refine embedding transformations and model parameters." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-39", "text": "Instead we use minimal supervision to refine 'distant' labels through modelling the tag transformation, based on a small set of annotations." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-40", "text": "----------------------------------" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-41", "text": "**MODEL**" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-42", "text": "We now describe the modelling framework for POS tagging in a low-resource language, based on very limited linguistic resources." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-43", "text": "Our approach extends the work of Fang and Cohn (2016) , who present a model based on distant supervision in the form of cross-lingual projection and use projected tags generated from parallel corpora as distant annotations." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-44", "text": "There are three main differences between their work and ours: 1) We do not use parallel corpora, but instead use a bilingual dictionary for knowledge transfer." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-45", "text": "2) Our model uses a more expressive multi-layer perceptron when generating the gold standard tags." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-46", "text": "The multi-layer perceptron can capture both language-specific infor- mation and consistent tagging errors arising from this method of supervision." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-47", "text": "3) We propose a number of active learning methods to further reduces the annotation requirements." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-48", "text": "Our method is illustrated in Figure 1 , and we now elaborate on the model components." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-49", "text": "Distant cross-lingual supervision In order to transfer tag information between the high-and low-resource languages, we start by learning cross-lingual word embeddings, which operate by learning vector valued embeddings such that words and their translations tend to be close together in the vector space." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-50", "text": "We use the embeddings from Ammar et al. (2016) which trains monolingual word2vec distributional representations, which are then projected into a common space, learned from bilingual dictionaries." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-51", "text": "We then train a POS tagger on the high-resource language, using the cross-lingual word embeddings as the first, fixed, layer of a bidirectional LSTM tagger." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-52", "text": "The tagger is a language-universal model based on cross-lingual word embeddings, for processing an arbitrary language, given a monolingual corpus and a bilingual dictionary, as shown in Figure 2 ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-53", "text": "Next we apply this tagger to unannotated text in the low-resource language; this application is made possible through the use of cross-lingual word embeddings." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-54", "text": "We refer to text tagged this way as distantly supervised data, and emphasize that although much better than chance, the outputs are often incorrect and are of limited utility on their own." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-55", "text": "As illustrated in Figure 1 , the distant components are generated directly as softmax outputs, y t \u223c Categorial(o t ), with parameters o t = Softmax(W h t + b) as a linear classifier over a sentence encoding, h t , which is the output of a bidirectional LSTM encoder over the words." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-56", "text": "Ground truth supervision The second component of the model is manually labelled text in the low-resource language." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-57", "text": "To model this data we employ the same model structure as above but augmented with a second perceptron output layer, as illustrated in Figure 1 (right) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-58", "text": "Formally,\u1ef9 t \u223c Categorial(\u00f5 t ) where\u00f5 t = MLP(o t ) is a single hidden layer perceptron with tanh activation and softmax output transformation." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-59", "text": "This component allows for a more expressive label mapping than Fang and Cohn (2016)'s linear matrix translation." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-60", "text": "Joint multi-task learning To combine the two sources of information, we use a joint objective," }, { "sent_id": "c7e304499654516cce43c550256eae-C001-61", "text": "where N and M index the token positions in the distant and ground truth corpora, respectively, and \u03b3 is a constant balancing the two components which we set for uniform weighting, \u03b3 = |M| |N | ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-62", "text": "Consider the training effect of the true POS tags: when performing error backpropagation, the cross-entropy error signal must pass through the transformation linking\u00f5 with o, which can be seen as a language-specific step, after which the generalised error signal can be further backpropagated to the rest of the model." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-84", "text": "4 The BiLSTM layer uses 128 hidden units, and 32 hidden units for the transformation step." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-63", "text": "Active learning Given the scarcity of ground truth labels and the high cost of annotation, a natural question is whether we can optimise which text to be annotated in order achieve the high accuracy for the lowest cost." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-64", "text": "We now outline a range of active learning approaches based on the following heuristics, which are used to select the instances for annotation from a pool of candidates:" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-65", "text": "TOKEN Select the token x t with the highest uncertainty, H(x, t) = \u2212 y P (y|x, t) log P (y|x, t);" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-66", "text": "SENT Select the sentence x with the highest aggregate uncertainty, H(x) = t H(x, t);" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-67", "text": "FREQTYPE Select the most frequent unannotated word type (Garrette and Baldridge, 2013) , in which case all token instances are annotated with the most frequent label for the type in the training corpus; 1 SUMTYPE Select a word type, z, for annotation with the highest aggregate uncertainty over token occurrences, H(z) = i\u2208D x i,t =z H(x i , t), which effectively combines uncertainty sampling with a bias towards high frequency types; and RANDOM Select word types randomly." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-68", "text": "----------------------------------" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-69", "text": "**EXPERIMENTS**" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-70", "text": "We evaluate the effectiveness of the proposed model for several different languages, including both simulated low-resource and true low-resource settings." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-71", "text": "The first evaluation set uses the CoNLL-X datasets of European languages (Buchholz and Marsi, 2006) , comprising Danish (da), Dutch (nl), German (de), Greek (el), Italian (it), Portuguese (pt), Spanish (es) and Swedish (sv)." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-72", "text": "We use the standard corpus splits." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-73", "text": "The first 20 sentences of training set are used for training as the tiny labelled (gold) data and the last 20 sentences are used for development (early stopping)." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-74", "text": "We report accuracy on the held-out test set." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-75", "text": "The second evaluation set includes two highly challenging languages, Turkish (tk) and Malagasy (mg), both having high morphological complexity and the latter has truly scant resources." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-76", "text": "Turkish data was drawn from CoNLL 2003 2 and Malagasy data was collected from , in both cases using the same training configuration as above." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-77", "text": "In all cases English is used as the source 'high resource' language, on which we train a tagger using the Penn Treebank, and we evaluate on each of the remaining languages as an independent target." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-78", "text": "For cross-lingual word embeddings, we evaluate two techniques from Ammar et al. (2016) : CCA-based word embeddings and clusterbased word embeddings." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-79", "text": "Both types of word embedding techniques are based on bilingual dictionaries." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-80", "text": "The dictionaries were formed by translating the 20k most common words in the En-glish monolingual corpus with Google Translate." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-81", "text": "3 The monolingual corpora were constructed from a combination of text from the Leipzig Corpora Collection and Europarl." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-82", "text": "We trained the languageuniversal POS tagger based on the cross-lingual word embeddings with the universal POS tagset , and then applied to the target language using the embedding lookup table for the corresponding language embeddings." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-83", "text": "We implement our learning procedure with the DyNet toolkit (Neubig et al., 2017) ." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-85", "text": "We used SGD with momentum to train models, with early stopping based on development performance." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-86", "text": "For benchmarks, we compare the proposed model against various state-of-the-art supervised learning methods, namely: a BILSTM tagger, BILSTM-CRF tagger (Huang et al., 2015) , and a state-of-the-art semi-supervised POS tagging algorithm, MINITAGGER (Stratos and Collins, 2015) , which is also focusing on minimising the amount of labelled data." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-87", "text": "Note these methods do not use cross-lingual supervision." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-88", "text": "For a more direct comparison, we include BILSTM-DEBIAS (Fang and Cohn, 2016) , applied using our proposed cross-lingual supervision based on dictionaries, instead of parallel corpora; accordingly the key difference is their linear transformation for the distant data, versus our non-linear transformation to the gold data." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-89", "text": "Results Table 1 reports the tagging accuracy, showing that our models consistently outperform the baseline techniques." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-90", "text": "The poor performance of the supervised methods suggests they are overfitting the small training set, however this is much less of a problem for our approach (labelled Joint)." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-91", "text": "Note that distant supervision alone gives reasonable performance (labelled DISTANT) however the joint modelling of the ground truth and distant data yields significant improvements in almost all cases." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-92", "text": "BILSTM-DEBIAS (Fang and Cohn, 2016) performs worse than our proposed method, indicating that a linear transformation is insufficient for modelling distant supervision." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-93", "text": "The accuracies are higher overall for the European cf." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-94", "text": "Turkic languages, presumably because these languages are Table 1 : POS tagging accuracy on over the ten target languages, showing first approaches using only the gold data; next methods using only distant cross-lingual supervision, and lastly joint multi-task learning." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-95", "text": "English is used as the source language and columns correspond to a specific target language." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-96", "text": "closer to English, have higher quality dictionaries and in most cases are morphologically simpler." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-97", "text": "Finally, note the difference between CCA and Cluster methods for learning word embeddings which arise from the differing quality of distant supervision between the languages." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-98", "text": "Figure 3 compares various active learning heuristics (see \u00a73) based on different taggers, either a supervised BILSTM (labelled Trad) or our multi-task model which also includes crosslingual supervision (JOINT)." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-99", "text": "Traditional uncertainty-based sampling strategies (TOKEN(Trad) and SENT(Trad)) do not work well because models based on limited supervision do not provide accurate uncertainty information, 5 and moreover, annotating at the type rather than token level provides a significantly stronger supervision signal." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-100", "text": "The difference is apparent from the decent performance of Random sampling over word types." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-101", "text": "Overall, SUMTYPE(Joint) outperforms the other heuristics consistently, underlining the importance of cross-lingual distant super-vision, as well as combining the benefits of uncertainty sampling, type selection and a frequency bias." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-102", "text": "Comparing the amount of annotation required between the best traditional active learning method SUMTYPE(Trad) and our best method SUMTYPE(Joint), we achieve the same performance with an order of magnitude less annotated data (100 vs. 1, 000 labelled words)." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-103", "text": "----------------------------------" }, { "sent_id": "c7e304499654516cce43c550256eae-C001-105", "text": "In this paper, we proposed a means of tagging a low-resource language without the need for bilingual parallel corpora." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-106", "text": "We introduced a new cross-lingual distant supervision method based on a bilingual dictionary." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-107", "text": "Furthermore, deep neural network models can be effective with limited supervision by incorporating distant supervision, in the form of model transfer with cross-lingual word embeddings." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-108", "text": "We show that traditional uncertainty sampling strategies do not work well on low-resource settings, and introduce new methods based around labelling word types." }, { "sent_id": "c7e304499654516cce43c550256eae-C001-109", "text": "Overall our approach leads to consistent and substantial improvements over benchmark methods." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c7e304499654516cce43c550256eae-C001-14" ], [ "c7e304499654516cce43c550256eae-C001-32" ] ], "cite_sentences": [ "c7e304499654516cce43c550256eae-C001-14", "c7e304499654516cce43c550256eae-C001-32" ] }, "@DIF@": { "gold_contexts": [ [ "c7e304499654516cce43c550256eae-C001-14", "c7e304499654516cce43c550256eae-C001-15", "c7e304499654516cce43c550256eae-C001-16", "c7e304499654516cce43c550256eae-C001-17" ], [ "c7e304499654516cce43c550256eae-C001-59" ], [ "c7e304499654516cce43c550256eae-C001-88" ], [ "c7e304499654516cce43c550256eae-C001-92" ] ], "cite_sentences": [ "c7e304499654516cce43c550256eae-C001-14", "c7e304499654516cce43c550256eae-C001-59", "c7e304499654516cce43c550256eae-C001-88", "c7e304499654516cce43c550256eae-C001-92" ] }, "@MOT@": { "gold_contexts": [ [ "c7e304499654516cce43c550256eae-C001-14", "c7e304499654516cce43c550256eae-C001-15", "c7e304499654516cce43c550256eae-C001-16", "c7e304499654516cce43c550256eae-C001-17" ] ], "cite_sentences": [ "c7e304499654516cce43c550256eae-C001-14" ] }, "@EXT@": { "gold_contexts": [ [ "c7e304499654516cce43c550256eae-C001-43" ] ], "cite_sentences": [ "c7e304499654516cce43c550256eae-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "c7e304499654516cce43c550256eae-C001-88" ] ], "cite_sentences": [ "c7e304499654516cce43c550256eae-C001-88" ] } } }, "ABC_b624e8d3ad3d351d4cb27ea4b5c616_35": { "x": [ { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-2", "text": "Many of the existing Named Entity Recognition (NER) solutions are built based on news corpus data with proper syntax." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-3", "text": "These solutions might not lead to highly accurate results when being applied to noisy, user generated data, e.g., tweets, which can feature sloppy spelling, concept drift, and limited contextualization of terms and concepts due to length constraints." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-4", "text": "The models described in this paper are based on linear chain conditional random fields (CRFs), use the BIEOU encoding scheme, and leverage random feature dropout for up-sampling the training data." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-5", "text": "The considered features include word clusters and pre-trained distributed word representations, updated gazetteer features, and global context predictions." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-6", "text": "The latter feature allows for ingesting the meaning of new or rare tokens into the system via unsupervised learning and for alleviating the need to learn lexicon based features, which usually tend to be high dimensional." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-7", "text": "In this paper, we report on the solution [ST] we submitted to the WNUT 2016 NER shared task." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-8", "text": "We also present an improvement over our original submission [SI], which we built by using semi-supervised learning on labelled training data and pre-trained resourced constructed from unlabelled tweet data." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-9", "text": "Our ST solution achieved an F1 score of 1.2% higher than the baseline (35.1% F1) for the task of extracting 10 entity types." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-10", "text": "The SI resulted in an increase of 8.2% in F1 score over the baseline (7.08% over ST)." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-11", "text": "Finally, the SI model's evaluation on the test data achieved a F1 score of 47.3% (~1.15% increase over the 2 nd best submitted solution)." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-12", "text": "Our experimental setup and results are available as a standalone twitter NER tool at https://github.com/napsternxg/TwitterNER." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-13", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-14", "text": "**INTRODUCTION**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-15", "text": "A common task in information extraction is the identification of named entities from free text, also referred to as Named Entity Recognition (NER) (Sarawagi, 2008) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-16", "text": "In the machine learning and data mining literature, NER is typically formulated as a sequence prediction problem, where for a given sequence of tokens, an algorithm or model need to predict the correct sequence of labels." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-17", "text": "Additionally, most of the NER systems are designed or trained based on monolingual newswire corpora, which are written with proper linguistic syntax." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-18", "text": "However, noisy and user generated text data, which are common on social media, pose several challenges for generic NER systems, such as shorter and multilingual texts, ever evolving word forms and vocabulary, improper grammar, and shortened or incorrectly spelled words." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-19", "text": "Let us consider a fictional tweet: \"r u guyz goin to c da #coldplay show @madisonsqrgrdn\uf04a?\"." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-20", "text": "This tweet contains two named entities, namely: \"Coldplay\", a music band, and \"Madison Square Garden, NYC, USA\", a geolocation, which references the place at which the band is playing." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-21", "text": "Many of the terms present in the exemplary tweet would be considered as out of vocabulary (OOV) terms by traditional NER systems." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-22", "text": "Furthermore, using a large set of such OOV tokens for training a classifier is likely to result in a sparse and high dimensional feature space, thereby increasing computing time." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-23", "text": "The phenomenon of concept-drift, i.e., the meaning of terms shifting over time, has also been found to affect the accuracy of NER systems over time, resulting in poor performance of a classifier trained on older data (Cherry & Guo, 2015; Derczynski, Maynard, et al., 2015; Fromreide et al., 2014; Hulten et al., 2001; Masud et al., 2010) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-24", "text": "The Workshop on \"Noisy User-generated Text\" (WNUT) continued its 2015 shared task on NER on tweets (Baldwin et al., 2015) in 2016." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-25", "text": "In 2016, the task was divided into two parts: (1) identification of named entities in tweets, and (2) NER on 10 types of entities, namely person, geo-location, other, company, sports-team, facility, product, music-artist, movie, and tv-show." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-26", "text": "In this paper we introduce two solutions to perform NER on tweets." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-27", "text": "The first system, which we will refer to as the submitted solution [ST] , was submitted as an entry to the WNUT 2016 NER shared task." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-28", "text": "It uses random feature [RF] dropout for up-sampling the dataset." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-29", "text": "This system was improved into a semisupervised solution (our 2 nd solution [SI]), which uses additional, unsupervised features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-30", "text": "These features have been found to be useful in prior information extraction and NER tasks." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-31", "text": "The semi-supervised approach circumvents the need to include word n-gram features from any tweets, and builds upon the successful usage of word representations (Collobert et al., 2011) , and word clusters (Lin & Wu, 2009; Miller et al., 2004; Ratinov & Roth, 2009; Turian et al., 2010) for NER by utilizing large amounts of unlabelled data or models pre-trained on a large vocabulary." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-32", "text": "The SI system was designed to mitigate the various issues mentioned above, and utilizes the unlabelled tokens from the all the available datasets (including unlabelled test data) to improve the prediction quality on the evaluation datasets, a form of transductive learning (Joachims, 2003) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-33", "text": "The SI system outperforms ST by ~7% (F1 score) when using the development set for evaluation, and by ~11% when using the test set (1% higher than the 2 nd best team in the task)." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-34", "text": "The SI model does not utilize any word n-gram lexical features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-35", "text": "We believe that the approach taken for SI is useful for situations that require refinement or adaptation of an existing classifier to perform well on a new test set." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-36", "text": "We have released our experimental setup and code at https://github.com/napsternxg/TwitterNER." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-37", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-38", "text": "**DATA**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-39", "text": "The training, development, and test dataset were provided by the task organizers." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-40", "text": "The training set consists of 2,394 tweets with a total of 1,499 named entities." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-41", "text": "The organizers provided two separate development datasets, which we merged to create a dataset of 1,420 tweets with 937 named entities." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-42", "text": "This merged dataset was used as the development dataset for all of our experiments." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-43", "text": "The test dataset comprises 3,856 tweets with 3,473 named entities." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-44", "text": "Most of the tweets in the provided data lack any entities mentions (42% in training, 59% in development, and 47% in test data), resulting in sparse training samples." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-45", "text": "Furthermore, certain types of entities, such as movie and tvshow, have only a few instances." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-46", "text": "The frequency distribution of the different types of named entities in the training, development, and test data are shown in Figure 1 ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-47", "text": "Additionally, we found that the training, development, and test data have an average of 19.4 (\u00b17.6), 16.2 (\u00b16.8), and 16.1 (\u00b16.6) tokens per sequence, respectively, and mostly contain less than 3 entities per tweet." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-48", "text": "This implies that the presence of certain entity types might be reflective of the category of the tweet, e.g. movie entities will be found in tweets about movies, and sports-team entities will be found in tweets about sports." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-49", "text": "Additionally, some types of entities are more likely to cooccur with each other than others." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-50", "text": "Using the provided data, we found that both person and geo-location entities were most likely to co-occur with entities of other 8 types, compared to the co-occurrence of the rest of the entities." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-51", "text": "Although the original dataset was tagged using the Begin-Inside-Outside (BIO) encoding, we converted that into the Begin-Inside-End-Outside-Unigram (BIEOU) encoding, which has been found to be more efficient for sequence classification tasks (Ratinov & Roth, 2009 )." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-172", "text": "**FEATURES LEARNED BY THE MODEL**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-52", "text": "However, the predicted tags were converted back to the BIO encoding to make our submission compatible with the evaluation system." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-53", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-54", "text": "**FEATURE ENGINEERING**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-55", "text": "We trained our system using multiple combinations of features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-56", "text": "Features were chosen with the intent to increase the generalizability and scalability of our classifier." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-57", "text": "Some of the considered features can be updated with the availability of new unlabelled data, while other features capture the general token patterns in tweets." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-58", "text": "All features are described in detail in the following subsections." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-59", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-60", "text": "**REGEX FEATURES [RF]**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-61", "text": "Regular expressions are rules describing regularities in data, and are typically empirically derived." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-62", "text": "For example, in regular English corpora, named entities usually being with capital letters." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-63", "text": "Although regex based approaches can be effective, they are likely to result in retrieving large amounts of false positives." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-64", "text": "Most NER systems use token level regex features (Baldwin et al., 2015; Ratinov & Roth, 2009 )." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-65", "text": "We extended these regex features by including features which detect syntax patterns of tokens commonly present in tweets." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-66", "text": "Our patterns return \"true\" if the regex pattern matches the token." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-67", "text": "A detailed list of our regex features is described in Table 1 ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-68", "text": "These features were extracted per token, and every pair of the neighbouring tokens' regex features were multiplied to create pairwise features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-69", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-70", "text": "**ISHASHTAG -IDENTIFIES IF TOKEN IS A HASHTAG ISMENTION -IDENTIFIES IF TOKEN IS A USER MENTION**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-71", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-72", "text": "**GAZETTEERS [GZ]**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-73", "text": "The task organizers provided a set of gazetteer lists." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-74", "text": "Although being helpful, these lists include some irregularities, such as words composed of or containing non-ascii characters, garbled strings, and missing names of important named entities in many categories." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-75", "text": "Furthermore, the provided gazetteers did not include names of movies or music artists." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-76", "text": "We increased the given set of gazetteers by including an additional 41K person names, 63K music artist names, 8K TV show titles, 2K sports team names, and 110K movie titles from WikiData (https://www.wikidata.org), additional 8.3M locations from GeoNames (http://www.geonames.org/), and 4.5M music artist names and their 1.4M name variants from the Discogs' public data dump (http://data.discogs.com/)." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-77", "text": "Improved gazetteer features were also used as features in last year's shared task (Derczynski, Augenstein, et al., 2015) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-78", "text": "The gazetteer features were implemented on a per token level, where we look up a gazetteer phrase in a range of window sizes W (min=1 and max=6) both left and right of the current token." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-79", "text": "Additionally, we encode the window size and the identified gazetteer name." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-80", "text": "Finally, we include interaction terms computed as the product of all pairs of gazetteer features for each token." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-81", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-82", "text": "**WORD REPRESENTATIONS [WR]**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-83", "text": "Distributed word representations have been shown to improve the accuracy of NER systems (Collobert et al., 2011; Turian et al., 2010) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-84", "text": "We used 200 dimensional GloVe word representations [WRG] (Pennington et al., 2014) , which were pre-trained on 6 billion tweets." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-85", "text": "Furthermore, we built a set of word clusters by performing an agglomerative clustering of word representations [WRFTC] and fine tuning them on the training plus development dataset by running the word2vec model Mikolov, Sutskever, et al., 2013) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-86", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-87", "text": "**WORD CLUSTERS [WC]**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-88", "text": "Word clusters are word groupings that get generated in an unsupervised fashion, and they have been successfully used as features for NER tasks (Lin & Wu, 2009; Miller et al., 2004; Ratinov & Roth, 2009; Turian et al., 2010) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-89", "text": "One algorithm for creating such sets is Brown clustering (Brown et al., 1992) , which produces a hierarchical cluster of words in the corpus while optimizing the likelihood of a language model based on a Hidden Markov Model (HMM)." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-90", "text": "We used pre-trained 1000 brown clusters [WCBPT] that were prepared by using a large corpus of tweets (Gimpel et al., 2011; Owoputi et al., 2013) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-91", "text": "Additionally, we built another set of brown clusters [WCBD] with a cluster size of 100 based on all of the available data by using the code provided by Liang (2005) 1 ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-92", "text": "Furthermore, we also used an implementation 2 of the algorithm proposed by Clark (2003) to create 32 (default option) additional word clusters from our training plus development data based on the regex and sequential features of the words." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-93", "text": "We choose to call these Clark clusters [WCCC] ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-94", "text": "Additionally, for each token, we also included all word cluster features for their immediate neighbours along with interactive terms; with the latter capturing the product of the token cluster with the neighbouring cluster." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-95", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-96", "text": "**ADDITIONAL FEATURES**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-97", "text": "Even though the strength of our system lies in its semi-supervised nature and its non-reliance on data specific features such as lexical tokens [LT], we still included lexical tokens for comparison." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-98", "text": "Additionally, we used certain global features [GF] for helping with the prediction." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-99", "text": "Global features capture the overall composition of the sequence." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-100", "text": "We constructed the GF using the average values of the word representations and the binary presence of cluster and dictionary features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-101", "text": "Additionally, another feature was constructed, which approximates the probability of the sequence being of a certain type." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-102", "text": "This feature adds an additional context to the token level prediction task, e.g. a tweet about sports is more likely to mention a sports team, and similarly, a tweet about a company is more likely to mention a product and vice-versa." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-103", "text": "To use this global feature, we first trained a Logistic regression classifier to predict if a tweet is about any of each of the 10 types of entities." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-104", "text": "The predicted probability per type is used as a feature for each of the tokens in the sequence." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-105", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-106", "text": "**RANDOM UP-SAMPLING WITH FEATURE DROPOUT [RSFD]**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-107", "text": "Since the training dataset is comparatively small and its features are sparse, we create synthetic examples by dropping interaction and lexical features with probability p. These features were chosen for random dropout because our earlier experiments had shown that the classifier identifies large weights for these features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-108", "text": "We further scaled the training data size by a factor of k. This technique is inspired by the success of the dropout technique (Srivastava et al., 2014) , which serves as a regularization function for deep neural networks." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-109", "text": "However, our technique is slightly different in that we use dropout to create a larger number of noisy samples from our data." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-110", "text": "Also, in contrast to the basic dropout technique, we did not re-weight the feature weights using the dropout probability (Srivastava et al., 2014) during evaluation." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-111", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-112", "text": "**NER CLASSIFICATION ALGORITHM**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-113", "text": "We used a linear chain CRF (Lafferty et al., 2001; McCallum & Li, 2003) as implemented in CRFSuite (Okazaki, 2007) package for training all our models." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-114", "text": "The models were trained using stochastic gradient decent (SGD) with an L2 norm (C=1e-3)." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-115", "text": "We also tested some of the recently popular deep learning based approaches, such as word embedding based and character based recurrent neural networks, for our prediction task." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-116", "text": "However, these techniques did not yield competitive results and were too slow to converge on CPU." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-117", "text": "Furthermore, training the CRF model was faster (average training time of the CRF algorithms was ~3 mins on CPU, compared to >15 minutes for the character/word based 3-layer deep recurrent neural network solution), and gave interpretable results while beating the baseline model provided by the task organizers." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-118", "text": "In the following sections, we will first describe the model we used in our submission to the shared task, and then our improvement over the initial model and results." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-119", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-120", "text": "**SHARED TASK SUBMISSION SOLUTION [ST] BASED ON RANDOM FEATURE DROPOUT UP-SAMPLING**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-121", "text": "Our original submission to the shared task [ST] was based on a system that uses the lexical, regex, and dictionary based features with random feature dropout based up-sampling." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-122", "text": "All the interaction terms were randomly dropped out with p=0.5, and the scaling factor k was chosen to be 5." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-123", "text": "The dictionary based features were created using a context window of size 2 to the left and right of the token." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-124", "text": "Additional interaction features were included by calculating the product of the dictionary features of the token and the neighbouring tokens." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-125", "text": "Finally, ST was based on a classifier trained only on the training dataset, and was corpus specific in that it used the vocabulary created from the training data." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-126", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-127", "text": "**SEMI-SUPERVISED WORD CLUSTERS AND REPRESENTATION BASED SOLUTION [SI]**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-128", "text": "The described lexicon based solution [ST] had one major drawback: The most highly weighted features were mainly tokens descriptive of entity types as occurring in the training data." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-129", "text": "For example, the highest weighted feature for the label U-person was word_normed:pope." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-130", "text": "Similarly, for many of the other entity types, the highest weighted features were the names or labels of popular entities." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-131", "text": "Although these features help to achieve a decent evaluation score on the development dataset, they can lead to overfitting of the classifier to the vocabulary of the training corpus." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-132", "text": "In order to circumvent this issue, a semi-supervised (Blum, 1998; Blum & Mitchell, 1998) solution builds on the general recent success of using word representations and word clusters in NER tasks, while disregarding lexical vocabulary based features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-133", "text": "The intuition behind our approach to the 2 nd solution [SI] was to ensure that the classifier learns higher level representations of the observed tokens." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-134", "text": "All the features used for our second solution augment the tokens present in the given tweets." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-135", "text": "This allows us to scale-up the underlying resources, such as gazetteers, and improve word representations and clusters using the new unlabelled test data, while still being able to update the classifier from the initially provided, limited training data." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-136", "text": "We replicate this behaviour in our classifiers by training our clusters on all of the unlabelled data generated by merging tweet texts from the training, development, and test data (only un-labelled) [TDTE] (Blum & Mitchell, 1998) , and comparing the resulting performance to that obtained with unsupervised training that does not consider the test data [TD] ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-137", "text": "Although it might appear that our classifier has access to the unlabelled test data sequences while learning, it rather is the case that we resemble an online setting where we continuously update our unsupervised features using the new batch of unlabelled test data, and then retrain our model on the original training data (Blum, 1998; Blum & Mitchell, 1998; Carlson et al., 2010; Chapelle et al., 2009; Liang, 2005; Turian et al., 2010; Zhu & Goldberg, 2009) ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-138", "text": "In this case, the unlabelled data prevent the classifier from overfitting to the training data by acting as a regularization factor." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-139", "text": "An alternative approach would be to train these clusters on a large number of unlabelled tweets that match the time range and search domain of the test tweets." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-140", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-141", "text": "**RESULTS**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-142", "text": "In the following sections, we describe the evaluation of the accuracy of both the ST and SI system in comparison to BL and against each other." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-143", "text": "All evaluations were done by using the evaluation script provided by the organizers." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-144", "text": "We use the classifier provided by the organizers as the baseline (BL) system." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-145", "text": "The baseline system uses lexical, gazetteer, and regex features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-146", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-147", "text": "**PERFORMANCE IN WNUT NER SHARED TASK**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-148", "text": "Using BL as a point of comparison, ST scored 1.1% (F1 score) higher for the 10-types task (based on the development set), and 1.2% (F1) lower for the no-types task." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-149", "text": "Our ST is based on random feature dropout based sampling." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-150", "text": "Among the 10 participating teams, our solution placed 7 th for the 10-types category with an overall F1 score of 36.95%, and 6 th in the no-type category with an overall F1 score of 51.38%." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-151", "text": "The top team on both tasks (same team in both cases) achieved F1 scores of 52.41% and 65.89%, respectively." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-152", "text": "Overall, we found that ST performed best on the geo-location type (F1 score of 64.72%), and behind the top two teams (score of 72.61% and 68.36%, respectively) for this category." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-153", "text": "We placed 3 rd in terms of F1 (37%) in the facility category shown Table 2 : Results of the WNUT NER 2016 shared task." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-154", "text": "Rank denotes the rank of the winning team, which we use as an ID to identify the evaluation performance of each of the participating teams in the shared task." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-155", "text": "Our solution was ranked 7 th (in bold) and (6 th not shown) in the 10-types and no-types categories, respectively." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-156", "text": "Columns with TD and TDTE show the performance of the improved model on the test data, and their ranks denote the best rank in the competition which they beat." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-157", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-158", "text": "**IMPROVED MODEL PERFORMANCE [SI]**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-159", "text": "In this section we describe the evaluation of our improved system SI, which was developed after the release of the shared task results." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-160", "text": "Since we received the gold standard labels for the test-set late in the process, we evaluated most of the improved models based on the development set." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-161", "text": "We present the additive effect of a series of features to the model in Table 3 ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-162", "text": "Additionally, that table also shows the performance of ST and BL." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-163", "text": "We do not include any lexical features in SI, however, lexical features were part of the ST and BL models." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-164", "text": "We found that the addition of the gazetteer [GZ] features improved the classification accuracy considerably." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-165", "text": "The next two big jumps in accuracy increase in SI came from using brown clusters [WCBTP] and fine-tuned word representations based clusters [WCFTC] ." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-166", "text": "From all of the improved models that we trained, we selected the 10-types category model with the highest overall F1 score, namely RF+GZ+WRG+WCBPT+WCCC+WRFTC model, also referred to as SI herein." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-167", "text": "Only the SI model was also evaluated on the test data with [TDTE] as well as without [TD] , using the test data for enriching the unsupervised features." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-168", "text": "Although the model with the global features [See +GF in Table 3] is not the top one in terms of the F1 score, it achieved considerably high scores for movie and tvshow class, which have very few training instances." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-169", "text": "Similarly, the random dropout up-sampling based solution showed improvements by 15% and 6% F1 score in terms of predicting named entities of the types movie and music-artist, respectively." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-170", "text": "Finally, these models were trained in almost half the time as the ST models." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-171", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-173", "text": "We extracted the learned features from the top performing model on the 10-types category (the RF+GZ+WRG+WCBPT+WCCC+WRFTC model)." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-174", "text": "The features with the highest positive and negative weights for each of the category labels are presented in We also investigated the transition features of the linear chain CRF model." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-175", "text": "The transition matrix (based on transition weights) is presented in Figure 2 , and coloured as red for negative weights and black for positive weights." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-176", "text": "Some trends become obvious from the transition matrix: For most entity types, the model is able to find high transition weights for going from B to I to E, while penalizing transitions between the other states." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-177", "text": "The choice of using BIEOU tagging is supported by the results shown in the transition matrix since for most entity types, there is a high negative weight for going from the B or I tag to the O tag." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-178", "text": "However, a transition from the U tag to O tag is usually supported." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-179", "text": "Our earliest experiments (not reported here) revealed that there was a considerable improvement from using the BIEOU tagging scheme." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-180", "text": "This finding aligns with the existing research which argues for the usage of this tagging scheme for NER tasks." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-181", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-182", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-183", "text": "Prior work has shown that semi-supervised algorithms can perform decently for NER tasks with sparse labelled data (Blum, 1998; Carlson et al., 2010; Chapelle et al., 2009; Liang, 2005; Turian et al., 2010; Zhu & Goldberg, 2009 )." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-184", "text": "We leverage this fact in our SI model via the use of unsupervised word clusters, word representations, and refined gazetteers; all of which contributed to a cumulative increase in accuracy over our initial submission [ST] by ~11% when using the test data for evaluation." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-185", "text": "Furthermore, the transition features learned by our model are reflective of correct learning of NER sequences and demonstrate the strength of using the BIEUO encoding scheme." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-186", "text": "Additionally, the supervised training of our classifier on features extracted from the unlabelled data, as opposed to lexical token features, reduces the dimensionality of the training data for the classifier and results in increased performance in terms of both accuracy and training time." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-187", "text": "Furthermore, our model can be adjusted on the arrival of new unlabelled data by updating the underlying learned word clusters and representations, and retraining the model on the existing labelled data." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-188", "text": "As identified by Turian et al. (2010) , the importance of word representations and word clusters increases as the availability of unlabelled data increases." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-189", "text": "We can add additional entity names to the gazetteers." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-190", "text": "Retraining the model on the same training data would then allow for accommodating to the new feature representations." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-191", "text": "Finally, the random feature dropout based up-sampling can help to increase the amount of training data available, and can also be improved by random swapping of entity types in the training data with their nearest neighbours in the word representations and clusters, or by choosing entities from the most correlated gazetteers." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-192", "text": "We believe that our described models can help in improving NER on noisy-text, and our open source implementation can be further extended." }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-193", "text": "----------------------------------" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-194", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "b624e8d3ad3d351d4cb27ea4b5c616-C001-195", "text": "We would like to acknowledge the three anonymous reviewers for their useful feedback." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-31" ], [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-83" ], [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-88" ], [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-183" ], [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-188" ] ], "cite_sentences": [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-31", "b624e8d3ad3d351d4cb27ea4b5c616-C001-83", "b624e8d3ad3d351d4cb27ea4b5c616-C001-88", "b624e8d3ad3d351d4cb27ea4b5c616-C001-183", "b624e8d3ad3d351d4cb27ea4b5c616-C001-188" ] }, "@SIM@": { "gold_contexts": [ [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-137" ] ], "cite_sentences": [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-137" ] }, "@USE@": { "gold_contexts": [ [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-183", "b624e8d3ad3d351d4cb27ea4b5c616-C001-184" ] ], "cite_sentences": [ "b624e8d3ad3d351d4cb27ea4b5c616-C001-183" ] } } }, "ABC_7f1723c42fc577fdfd7144b7991db1_35": { "x": [ { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-55", "text": "**MASKED ACOUSTIC MODELING**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-2", "text": "We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-3", "text": "Previous speech representation methods learn through conditioning on past frames and predicting information about future frames." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-4", "text": "Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past and future contexts." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-5", "text": "The Mockingjay representation improves performance for a wide range of downstream tasks, including phoneme classification, speaker recognition, and sentiment classification on spoken content, while outperforming other approaches." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-6", "text": "Mockingjay is empirically powerful and can be fine-tuned with downstream models, with only 2 epochs we further improve performance dramatically." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-7", "text": "In a low resource setting with only 0.1% of labeled data, we outperform the result of Mel-features that uses all 100% labeled data." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-8", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-10", "text": "The goal of speech representation learning is to find a transform from speech that makes high-level information more accessible to SLP (Speech and Language Processing) downstream tasks, as speech signal possess a rich set of acoustic and linguistic content, including phonemes, words, semantic meanings, tone, speaker characteristics, and even sentimental information." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-11", "text": "In this paper, we propose Mockingjay to learn speech representations through unsupervised training without the use of any labels." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-12", "text": "We use multi-layer transformer encoders and multi-head self-attention [1] to achieve bidirectional encoding; this framework allows our model to consider past and future contexts at the same time." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-13", "text": "To achieve unsupervised pre-training for speech representations, Mockingjay learns under the proposed Masked Acoustic Model (MAM) task." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-14", "text": "During training, masked frames are given, and the model learns to reconstruct and predict the original speech." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-15", "text": "Hence we gave the name Mockingjay, a bird that mimics sound." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-16", "text": "The proposed framework is illustrated in Figure 1 ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-17", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-19", "text": "Unsupervised speech representation learning [2, 3, 4, 5, 6, 7, 8, 9, 10] is effective in extracting high-level properties from speech." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-20", "text": "SLP downstream tasks can be improved through speech representations because surface features such as log Mel-spectrograms or waveform can poorly reveal the abundant information within speech." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-21", "text": "Contrastive Predictive Coding (CPC) [5] and wav2vec [7] use a multi-layer CNN to encode past context, representations are learned by predicting the future in latent space under a contrastive binary classification task." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-22", "text": "Autoregressive Predictive Coding (APC) [6] uses autoregressive models to encode temporal information of past acoustic sequences; the model predicts future frames like an RNN-based language model [11] , optimized with reconstruction loss." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-23", "text": "Unidirectional models are commonly used in the previous approaches [2, 3, 4, 5, 6, 7] ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-24", "text": "However, this constraint on model architectures limits the potential of speech representation learning." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-25", "text": "The recently proposed vq-wav2vec [8] approach attempts to apply the well-performing Natural Language Processing (NLP) algorithm BERT [12] on continuous speech." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-26", "text": "Input speech is discretized to a K-way quantized embedding space, so continuous speech could act like discrete units similar to word tokens in NLP tasks." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-27", "text": "In vq-wav2vec [8] , an exhaustive two-stage training pipeline with massive computing resources are required to adapt speech to NLP algorithm, as the quantization process is against the continuous nature of speech." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-28", "text": "Unlike [8] that adapts speech to BERT [12] through quantization, the proposed approach can be seen as a modified version of BERT [12] for direct application on continuous speech." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-29", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-30", "text": "**PROPOSED METHOD**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-31", "text": "Unlike previous left-to-right unidirectional approaches that only consider past sequences to predict information about future frames, the proposed method allows us to train a bidirectional speech representation model, alleviating the unidirectionality constraint of previous methods." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-32", "text": "As a result, the Mockingjay model obtains substantial improvements in several SLP tasks." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-33", "text": "Moreover, as previous approaches restrict the power of the pre-trained models to representation extraction only [5, 6, 7, 8] , the proposed method is robust and can be fine-tuned easily on downstream tasks." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-34", "text": "We show that finetuning for 2 epochs easily acquires significant improvement." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-35", "text": "The proposed approach outperforms other representations and features." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-36", "text": "When compared to the commonly used log Mel-features, we outperformed it by 35.2% (absolute improvement) for phoneme classification accuracy, 28.0% (absolute improvement) for speaker recognition accuracy, and 6.4% (absolute improvement) for sentiment discrimination accuracy on a spoken content dataset unseen during pre-train." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-37", "text": "We also experiment in low resource settings to show that Mockingjay is capable of improving supervised training in real-life low-resource scenarios." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-38", "text": "With only 0.36 hours (0.1%) of transcribed speech, the proposed approach outperforms Mel-features with 360 hours (100%) of labels." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-39", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-40", "text": "**MOCKINGJAY**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-41", "text": "In this section, we first introduce model architecture and its designs, secondly we explain the proposed unsupervised context prediction task, and finally we explain how the proposed model is used with downstream task models." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-42", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-43", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-44", "text": "We use a multi-layer Transformer encoder with multi-head self-attention for left-and-right bidirectional encoding, this architecture is illustrated in Figure 2 ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-45", "text": "Each encoder layer has two sub-layers, the first is a multi-head self-attention network, and the second is a feed-forward layer, each sub-layer has a residual connection followed by layer normalization [13] , based on the design described in [1] ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-46", "text": "All encoder layers in the model, as well as the sub-layers, produce outputs of identical dimensions denoted as H dim ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-47", "text": "In Figure 2 , we denote the feed-forward size as F dim , the number of self-attention heads as A num , and the total of Transformer layers as L num ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-48", "text": "The Mockingjay representations can be extracted from the Transformer encoders' hidden state and labeled as Hidden, we explain how they are used as representations in Section 2.3." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-49", "text": "Since Transformer encoders contain no recurrence and convolution, we use positional encoding to make our model aware of the input sequence order [1] ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-50", "text": "As direct addition of acoustic features to positional encoding may lead to potential training failure [14] , the input frames are first projected linearly to the hidden dimension of H dim before adding with positional encoding [15] ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-51", "text": "We use sinusoidal positional encoding instead of learnable positional embeddings [16] because acoustic features can be arbitrarily long with high variance [15] ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-52", "text": "We apply downsampling on input features to adapt our model to long sequences." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-53", "text": "To reduce the length of frames by a factor of R f actor , we use the reshape technique from [14, 15] by stacking R f actor consecutive frames into one step." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-54", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-56", "text": "We propose the Masked Acoustic Modeling task, where we randomly select 15% of the input frames, and the model predicts the selected frames based on its left and right context, as illustrated in Figure 1 ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-57", "text": "During training, we add a prediction head consists of two layers of feed-forward network with layer-normalization, using the last encoder layer as it's input." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-58", "text": "We use L1 Loss to minimize reconstruction error between prediction and ground-truth frames on the selected 15%." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-59", "text": "The prediction head is not required once the model is trained." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-60", "text": "During training, for the selected 15% of frames, 1) we mask it all to zero 80% of the time, 2) replace all with a random frame 10% of the time, and 3) leave the frames untouched 10% of the time." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-61", "text": "1 We introduce this sub-random process instead of always masking the frames to alleviate the mismatch between training and inference, as masked frames do not appear during inference time." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-62", "text": "Note that in contrast to BERT [12] , where the sub-random process is performed token-wise on the i-th chosen token, our sub-random process is performed utterance-wise." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-63", "text": "In other words, our model may receive inputs as ground-truth frames for 3) 10% of the time, rather with some of the inputs always augmented as in [12] ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-64", "text": "To avoid the model exploiting local smoothness of acoustic frames, we propose additional consecutive masking where we mask consecutive frames C num to zero." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-65", "text": "The model is required to infer on global structure rather than local information." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-66", "text": "We also use dynamic masking [18] where masking patterns are sampled from an uniform distribution every time we feed a sequence to the model, unlike the static mask employed in [12] where masking is performed during data preprocessing." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-67", "text": "We only use a single context prediction task to train our representation model, as suggested by [18] ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-68", "text": "Unlike BERT [12] and ALBERT [19] that needs two tasks to train their language model." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-69", "text": "In our preliminary experiments, we found that the sentence prediction task used in [12, 19] is not helpful, as additional tasks can potentially harm training behavior." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-70", "text": "We do not include details due to space limitations." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-71", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-72", "text": "**INCORPORATING WITH DOWNSTREAM TASKS**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-73", "text": "Mockingjay representations are essentially the Transformer encoder hidden states." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-74", "text": "There are many ways to incorporate learned representations to downstream tasks." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-75", "text": "In this work, we mainly extract representations from the last layer." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-76", "text": "However, we also expose the deep internals of Mockingjay to downstream models, where we use a mixture of representations from all layers, similar to the ELMO [20] approach." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-77", "text": "In other words, we use a learnable weighted sum to integrate hidden states from all layers." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-78", "text": "A pre-trained Mockingjay model can be fine-tuned with downstream models to create improving results, we update the pre-trained Mockingjay together with random initialized downstream task models." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-79", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-80", "text": "**IMPLEMENTATION**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-81", "text": "In this work, we use two types of features as our model's output reconstruction target: the Mel-scale spectrogram and the linear-scale spectrogram." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-82", "text": "As Mel-scale spectrogram is a more condensed acoustic representation when compared to linearscale spectrogram, we propose two model settings: BASE and LARGE." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-83", "text": "Both of these models take Mel-features as input, and transform input Mel-features into high-level representations." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-84", "text": "They use the same hidden dimension size of H dim =768, feedforward size of F dim =3072, and attention heads of A num =12, with the exception of layer number L num , downsample factor R f actor , and consecutive masking number C num , the differences in model settings are listed in Table 1 ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-85", "text": "We further analyze their differences in our experiment section." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-86", "text": "The proposed Mockingjay models are pre-trained on the LibriSpeech [21] corpus train-clean-360 subset." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-87", "text": "We use Adam [22] where learning rate is warmed up over the first 7% of 500k total training steps to a peak value of 4e-4 and then linearly decayed." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-88", "text": "A dropout [23] of 0.1 is applied on all layers and attention weights." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-89", "text": "For downstream task fine-tuning, most of the hyperparameters are the same as in pre-training, with the exception of a learning rate of 4e-3, and the number of training epochs is set to 2 (which is approximately 50k steps)." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-90", "text": "We train with a batch size of 6 using a single 1080Ti GPU." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-91", "text": "We provide pre-trained models with our implementation, they are publicly available for reproducibility 2 ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-92", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-93", "text": "**EXPERIMENT**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-94", "text": "Following previous works [2, 3, 4, 5, 6, 7, 8] , we evaluate different features and representations on downstream tasks, including: phoneme classification, speaker recognition, and sentiment classification on spoken content." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-95", "text": "For a fair comparison, each downstream task uses an identical model architecture and hyperparameters despite different input features." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-96", "text": "We report results from 5 of our models: 1) BASE and 2) LARGE where Mockingjay representations are extracted from the last encoder layer, 3) the BASE-FT2 where we finetune BASE with random initialized downstream models for 2 epochs, and 4) the BASE-FT500 where we fine-tune for 500k steps, and finally 5) the LARGE-WS where we incorporate hidden states from all encoder layers of the LARGE model through a learnable weighted sum." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-97", "text": "We did not fine-tune the LARGE model, as it is meant for extracting representations." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-98", "text": "Empirically we found that even with supervised training, a random initialized Mockingjay model followed by any downstream model is hard to be trained from scratch." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-99", "text": "This shows that the proposed pre-training is essentially indispensable." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-100", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-101", "text": "**COMPARING WITH OTHER REPRESENTATIONS**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-102", "text": "The proposed approaches are mainly compared with APC [6] representations, as they also experiment on phone classification and speaker verification." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-103", "text": "As reported in [6] , the APC approach outperformed CPC representations [5, 7, 9] in both two tasks, which makes APC suitable as a strong baseline." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-104", "text": "APC uses an unidirectional autoregressive model." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-105", "text": "We compare the proposed approach with APC to show that our bidirectional approach has advantages in speech representation learning." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-106", "text": "For fair comparison, we pre-train APC using their official implementations with the reported ideal parameters and settings, but expand the model's hidden size to H dim =768 to match ours." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-107", "text": "We also report results on 160-dimensional log Mel-features, which helps evaluate the accessibility of speech information from regular acoustic features." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-108", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-109", "text": "**PHONEME CLASSIFICATION**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-110", "text": "To measure the accessibility of phonetic information, we train linear phone classifiers using Mel-features, APC and Mockingjay representations from the LibriSpeech train-clean-360 subset." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-111", "text": "We obtain force-aligned phoneme sequences with the Montreal Forced Aligner [24] , where there are 72 possible phone classes." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-112", "text": "Testing results on the LibriSpeech test-clean subset are presented in Figure 3 ." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-113", "text": "In the case where all 360 hours of labels are used to train the classifier, BASE and LARGE representations increase 11.8% and 15.2% accuracy from Mel-features." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-114", "text": "The BASE-FT2 model outperforms all representations after 2 epochs of fine-tuning, with 10.2% and 35.2% absolute improvement over APC and Mel-features, respectively." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-115", "text": "We observe that fine-tuning for 2 epochs is enough to reveal our method's potential as there is only a small gap (3.9%) between BASE-FT2 and BASE-FT500." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-116", "text": "Furthermore, LARGE-WS improves over LARGE, just as we expected." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-117", "text": "To demonstrate how pre-training on speech can improve supervised training in resource constrained scenarios where human labels are short, we train the classifier with reduced amount of training data." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-118", "text": "The performance variation of different methods are plotted in Figure 3 , where we measure over various intervals of constrained training data to observe performance drop." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-119", "text": "Both BASE and LARGE increase accuracy over Mel-features across various amount of transcribed data." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-120", "text": "Whereas the APC approach performs well on the full resource but fails to generalize for limited amount of labeled data." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-121", "text": "In the case where there are only 0.36 hours of data available, we improve accuracy by 22.7% (an absolute improvement from Mel-features)." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-122", "text": "Note that with only 0.36 hours (0.1%) of labels available, BASE-FT2 (57.9% acc) even outperformed Mel-features (49.1% acc) that uses all 360 hours (100%) of labeled data." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-123", "text": "We conclude that pre-training Mockingjay on speech substantially improves the performance on supervised task that requires human annotations." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-124", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-125", "text": "**SPEAKER RECOGNITION**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-126", "text": "To demonstrate that the proposed approach performs constantly for all SLP downstream tasks, we report speaker recognition results on the LibriSpeech 100 hour selected subset, where train/test split is performed randomly with a 9:1 ratio, and there are 63 possible speakers." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-127", "text": "We trained a simple one-layer RNN classifier for speaker recognition using different representations, results are listed in Table 2" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-128", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-129", "text": "**SENTIMENT CLASSIFICATION ON SPOKEN CONTENT**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-130", "text": "To demonstrate domain invariant transferability of the proposed representation across different datasets, the Mockingjay model is pre-trained on LibriSpeech and applied on the MOSEI [25] dataset." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-131", "text": "We also use a simple one-layer RNN classifier, where the model is trained to extract linguistic meanings from speech and discriminates between sentiments." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-132", "text": "The results listed in Table 2 lead to an identical conclusion as in the speaker recognition task discussed above." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-133", "text": "Except that in the case of sentiment classification, LARGE-WS achieved the highest score without the need of fine-tuning, demonstrating that a deeper model has great potential for extracting general speech representations." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-134", "text": "To conclude this section, we claim that the proposed representations are general and can be used on datasets with various unseen domains." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-135", "text": "----------------------------------" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-136", "text": "**CONCLUSION**" }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-137", "text": "The proposed representation contains a variety of knowledge, including but not limited to phonetic, speaker, and sentiment information." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-138", "text": "We improve performance for a wide range of downstream tasks, and show promising results in low resource settings, as the learned speech representations are robust and can be transferred to different tasks across different datasets." }, { "sent_id": "7f1723c42fc577fdfd7144b7991db1-C001-139", "text": "In future work, we will investigate and deploy Mockingjay representations on more downstream SLP tasks, including ASR, voice conversion, and speech translation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7f1723c42fc577fdfd7144b7991db1-C001-19" ], [ "7f1723c42fc577fdfd7144b7991db1-C001-21" ], [ "7f1723c42fc577fdfd7144b7991db1-C001-23" ] ], "cite_sentences": [ "7f1723c42fc577fdfd7144b7991db1-C001-19", "7f1723c42fc577fdfd7144b7991db1-C001-21", "7f1723c42fc577fdfd7144b7991db1-C001-23" ] }, "@MOT@": { "gold_contexts": [ [ "7f1723c42fc577fdfd7144b7991db1-C001-23", "7f1723c42fc577fdfd7144b7991db1-C001-24" ] ], "cite_sentences": [ "7f1723c42fc577fdfd7144b7991db1-C001-23" ] }, "@DIF@": { "gold_contexts": [ [ "7f1723c42fc577fdfd7144b7991db1-C001-33" ], [ "7f1723c42fc577fdfd7144b7991db1-C001-103" ] ], "cite_sentences": [ "7f1723c42fc577fdfd7144b7991db1-C001-33", "7f1723c42fc577fdfd7144b7991db1-C001-103" ] }, "@USE@": { "gold_contexts": [ [ "7f1723c42fc577fdfd7144b7991db1-C001-94" ] ], "cite_sentences": [ "7f1723c42fc577fdfd7144b7991db1-C001-94" ] } } }, "ABC_3e344c590b4d5270a29054ac15efa5_35": { "x": [ { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-2", "text": "Characters are the smallest unit of text that can extract stylometric signals to determine the author of a text." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-3", "text": "In this paper, we investigate the effectiveness of character-level signals in Authorship Attribution of Bangla Literature and show that the results are promising but improvable." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-4", "text": "The time and memory efficiency of the proposed model is much higher than the word level counterparts but accuracy is 2-5% less than the best performing word-level models." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-5", "text": "Comparison of various word-based models is performed and shown that the proposed model performs increasingly better with larger datasets." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-6", "text": "We also analyze the effect of pre-training character embedding of diverse Bangla character set in authorship attribution." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-7", "text": "It is seen that the performance is improved by up to 10% on pre-training." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-8", "text": "We used 2 datasets from 6 to 14 authors, balancing them before training and compare the results." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-9", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-10", "text": "**I. INTRODUCTION**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-11", "text": "Authorship attribution is generally concerned with the identification of the original author of a given text from a set of given authors." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-12", "text": "It has a wide range of applications including plagiarism detection, forensic linguistics, etc." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-13", "text": "Each author has a distinctive writing style that is exploited by statistical analysis to detect the author." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-14", "text": "However, in Bangla language, the amount of work done in this area is not very rich despite being one of the most spoken languages." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-15", "text": "In traditional methods, texts are represented using independent features such as lexical n-gram or frequency-based representation." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-16", "text": "In this approach, words of similar context are likely to be represented in different vector space as the features are independent." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-17", "text": "So, the semantic values of the words might be lost, which is problematic." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-18", "text": "Word embedding, also generally known as distributed term representations, offers a solution to this problem by encoding semantic similarity from their co-occurrences." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-19", "text": "Chowdhury [1] experimented with the effectiveness of word embedding in authorship attribution for Bangla language for various architectures." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-20", "text": "Another type of embedding, which we tried to analyze in this paper is character embedding." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-21", "text": "Character CNN was first introduced by Zhang [2] for the text classification task." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-22", "text": "Through the empirical experiment of Sebastian [3] and Jozefowicz [4] , character level NLP has been proven to be very promising in various ways." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-23", "text": "Although it may seem that character on its own does not have any semantic value, Radford [5] illustrates that character-level models can capture the semantic properties of text." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-24", "text": "Character level models are also better at handling out-of-vocabulary words, misspelling, etc and provide an open vocabulary." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-25", "text": "Another major advantage is that it reduces the dimension to as low as 16, unlike word embedding where the dimension can increase up to 300 while the vocabulary is also huge." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-26", "text": "So, character embedding removes the bottleneck in training tasks and gives huge advantages on computational complexity." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-27", "text": "Our approach in this paper was to investigate how character embedding performs in the task of Authorship Attribution in Bangla language." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-28", "text": "Bangla Language has numerous words with joint letters which can be written in a few different forms." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-29", "text": "Moreover, there are some words with the same meaning but slightly different spelling." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-30", "text": "These inconsistencies are not recognized by word-level models but character-level models can capture and relate words of this kind, making such models more appropriate for Bangla language." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-31", "text": "Comparison of character embedding with word embedding is discussed according to the findings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-32", "text": "Experiments with and without pretrained embedding layers have also been done to show the effectiveness of information captured in the embeddings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-33", "text": "No previous work, analysis or investigation has yet published on the effect of character embedding in Authorship Attribution of Bangla Literature as of our knowledge to date." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-34", "text": "This paper follows the structure provided below:" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-35", "text": "\u2022 Related Works -Extensive background study on some works relevant to this paper are provided in this section." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-36", "text": "\u2022 Corpus -The dataset used in our experiment is described in this section." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-37", "text": "\u2022 Methodology -The proposed architecture for our char-978-1-7281-5842-6/19/$31.00 \u00a92019 IEEE acter embedding model along with the strategies used during the training phase of the neural networks are described in depth." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-38", "text": "\u2022 Experiments -Describes the evaluation process and the model setup for comparison." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-39", "text": "\u2022 Results and Discussion -Our findings along with results and possible reasons are presented in this part." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-40", "text": "\u2022 Conclusion -In the last section, some recommendations and scope for future research on this field are mentioned." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-41", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-42", "text": "**II. RELATED WORKS**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-43", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-44", "text": "**A. ON AUTHORSHIP ATTRIBUTION**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-45", "text": "Authorship Attribution has been a topic of important research for a long time." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-46", "text": "With increased anonymity on the internet and easy fraud, authorship attribution of writings has become crucial." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-47", "text": "For authorship attribution, work on varying degrees of feature selection [6] , including advanced features such as local histograms [7] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-48", "text": "Naive similarity-based models [8] , SVMs [9] have been explored." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-49", "text": "Semi-supervised approach to authorship attribution was also taken [10] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-50", "text": "SOTA was achieved by Ruder [3] using character-level and multi-channel CNN." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-51", "text": "Compared to other works, very few works have been done in Bangla language, lacking any sort of high benchmarks until very recently." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-52", "text": "Das and Mitra [11] worked with a really small dataset of 36 documents and 3 authors to perform unigram and bi-gram feature-based classification." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-53", "text": "Chakraborty [12] worked with SVMs on 3 authors to achieve up to 84% accuracy." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-54", "text": "Shanta Phani also attempted to attribute 3 authors with machine learning [13] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-55", "text": "P. Das, R. Tasmim, and S. Ismail used 4 authors of current times and hand-drawn features such as word frequency, type-token ratio, number of various POS, word/sentence lengths etc [14] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-56", "text": "90.67% was achieved by Hossain and Rahman by using multiple features along with cosine similarity [15] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-57", "text": "Pal, Siddika, and Ismail achieved 90.74% accuracy with 6 authors using SVM on one feature [16] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-58", "text": "Multi-layered perceptrons were employed by Phani, Lahiri, and Biswas [17] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-59", "text": "Impressive results were achieved very recently by [1] using various word embeddings on a 6 author dataset." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-60", "text": "They demonstrated the effects of various architectures and word embeddings on authorship attribution and concluded that fastTexts skip-gram used with CNN tends to beat all other models in terms of accuracy." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-61", "text": "No work has been done on the character level classification task as of knowledge in Bangla literature." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-62", "text": "The effects of Bangla alphabet complexity and language formulation on architectural design and character embedding learning remains largely untouched." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-63", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-64", "text": "**B. ON EMBEDDING**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-65", "text": "Embeddings are effectively mappings from various entities (character, word, sentence, etc) to continuous vector spaces in high dimensions." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-66", "text": "The relation among the numerical representations gives a semantic, syntactic and morphological meaning of the entities." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-67", "text": "These meanings are leveraged by machine learning techniques to find patterns in texts and thus perform various tasks such as classification." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-68", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-69", "text": "**1) WORD EMBEDDING:**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-70", "text": "Representing words in continuous vector spaces is considered as one of the breakthroughs of NLP." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-71", "text": "Word embeddings are learned in the form of an embedding layer or separately in an unsupervised manner." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-72", "text": "Among the unsupervised techniques include Continuous Bag-of-Words(CBOW) and Skip-Gram models famously implemented by Word2Vec and fastText." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-73", "text": "Also, there are co-occurrence statistical methods such as Glove." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-74", "text": "Santos [18] used word embeddings with convolutional models showing significant improvements over baseline methods." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-75", "text": "Word embeddings have been used to improve the performance of sentiment analysis [19] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-76", "text": "Often pre-trained embeddings are used or are learned for specific tasks such as tree-structured long short-term memory networks [20] and Multi-perspective sentence similarity modeling [21] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-77", "text": "Although words started to be used as units of text, various works have started to break down words and work at subword and character levels." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-78", "text": "Wieting [22] creates subword embedding from counts of character n-grams." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-79", "text": "2) Character Embedding: Character Level embeddings are used in various ways, either by themselves or to produce embeddings of higher levels e.g for words." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-80", "text": "Character embeddings have been employed in POS tagging [23] , language modelling [23] and dependency parsing [24] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-81", "text": "Character-RNN were used for machine translation, for representing words [25] or to generate character level translations [26] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-82", "text": "Pure Character level classification was first explored using CNN architecture [2] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-83", "text": "Jozefowicz [4] shows that a character-level language model can significantly outperform state of the art models." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-84", "text": "Their best performing model combines an LSTM with CNN input over the characters." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-85", "text": "Besides using either just word or character embeddings, ideas of combining them also have been introduced [27] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-86", "text": "Attempts to learn character embedding and serve as pre-trained have also been explored [28] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-87", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-88", "text": "**III. CORPUS**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-89", "text": "Because of the scarcity for the standard dataset in authorship attribution, we made a custom web crawler to parse the data on our own." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-90", "text": "We collected writings from an online Bangla e-library containing writings(e.g., novels, story, series, etc.) of different authors." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-91", "text": "Table I shows the details of our dataset." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-92", "text": "Our dataset is larger compared to the previously worked on datasets for Bangla as mentioned in section II with 13.4+ million words." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-93", "text": "The dataset was equally partitioned with each document having the same length of 750 words." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-94", "text": "Various subsets of authors were chosen and the dataset was truncated to each author having the same number of samples." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-95", "text": "The dataset from the paper [1] was also used." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-96", "text": "This dataset consists of 6 authors with 350 sample texts per author and total word count of 2.3+ million." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-97", "text": "For pre-training our model, we used another large corpus of Bangla Newspaper articles based on 6 topics." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-98", "text": "The topics were accident, crime, education, entertainment, environment, and sports." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-99", "text": "The dataset consists of 10564543 tokens." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-100", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-101", "text": "**IV. METHODOLOGY A. PROPOSED ARCHITECTURE**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-102", "text": "Character-level CNN can sufficiently replace words for classifications [2] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-103", "text": "This means CNN does not require the syntactic or semantic structure of a language, which makes such approaches effectively independent of language as the number of characters is limited." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-104", "text": "To this end, CNN was used in this paper to perform the task of author attribution." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-105", "text": "An elaborate set of experiments were performed on 3 different datasets to conclude with an architecture that successfully extracts the character level features of any sample text." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-106", "text": "The same architecture was used to prepare the pre-trained character embeddings for classification tasks." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-107", "text": "The model is a deep neural network starting with 4 convolutional layers, each followed with a maxpool layer of kernel size 3." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-108", "text": "As standardized in computer vision, for the convolutional layers, the number of filters increases while decreasing the kernel size at each layer." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-109", "text": "The kernel sizes are respectively 7,3,1 and 1." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-110", "text": "The number of filters is 64,128,256 and 256." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-111", "text": "Beneath all is an embedding layer where each character is represented as a vector of length V , i.e, the alphabet size." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-112", "text": "The convolutional layers are stacked with a fully connected layer of 512 activation nodes, activation function ReLU and dropout." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-113", "text": "Finally, an output layer with softmax is used to provide the classification probabilities." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-114", "text": "For optimization Adam optimizer is used along with categorical cross-entropy as the loss function." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-115", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-116", "text": "**B. CHARACTER EMBEDDING**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-117", "text": "Character embedding aims to turn characters into meaningful numerical representations in the form of vectors." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-118", "text": "These vectors may represent the correlation of different characters, or even correlation of groups of characters together i.e. words, sentences, documents, etc." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-119", "text": "This concept can be leveraged to use character embeddings to fit misspelled words, rare or new words, slangs or emoticons." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-120", "text": "They can also easily represent words with variations such as drive, driving, drives, etc." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-121", "text": "There is no more bottleneck for out of vocabulary words." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-122", "text": "The character set can be used to make any word, even if it is out of vocabulary, in contrast to word embeddings which simply ignored them, or had weak representations for rare words." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-123", "text": "This way character embeddings increase generalization compared to words." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-124", "text": "Another significant improvement is the vocabulary size." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-125", "text": "Instead of having a very large vocabulary of words, character embeddings have a fixed number of characters which is significantly smaller, therefore reduces model complexity and the number of parameters by a significant amount." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-126", "text": "Furthermore, they can be represented with a small vector size (e.g 16) and still be significantly informative as opposed to word embeddings which require at least 100-300 size vectors for a decent model." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-127", "text": "The simplest way to represent a character is to use a one-hot encoding." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-128", "text": "This requires the vector size to be the size of the alphabet." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-129", "text": "We used a one-hot encoding as a baseline for comparison of pre-trained embeddings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-130", "text": "Otherwise, one can randomly initialize the vectors, where the vectors can be of any size as small as 16 to as big as 300." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-131", "text": "This becomes a hyperparameter for tuning." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-132", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-133", "text": "**C. TRAINING THE MODEL**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-134", "text": "The alphabet size, and therefore the embedding vector size is 253." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-135", "text": "Among the 253 different characters are the English letters(capital and small) and digits, Bangla letters and digits, Bangla vowel symbols, and various other punctuation and symbols." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-136", "text": "For comparative training, two sets of embeddings were created for the character set." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-137", "text": "First is one-hot encoding, and the other is pre-trained embeddings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-138", "text": "The training was done in two phases as stated below:" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-139", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-140", "text": "**1) PRE-TRAINING**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-141", "text": "Embedding:" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-142", "text": "To learn character embeddings, the architecture mentioned above was used for classification of the news dataset as mentioned in section III." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-143", "text": "This is in contrast to the usual ways of learning embeddings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-144", "text": "No separate model was used [28] to learn the embeddings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-145", "text": "Instead, already available classification task on a marginally large dataset learns character embeddings for its purposes." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-170", "text": "The model is initialized with pre-trained word embeddings from word2vec and fastText, both CBOW and skip-gram versions." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-146", "text": "These embeddings can be used as initialization for the author attribution task, which has a smaller dataset compared to the former, giving it an initial boost." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-147", "text": "The model was trained with a learning rate of 0.001 and decay of 0.0001." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-148", "text": "The maximum length of each text sample was set as 1000 and batch size as 80." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-149", "text": "A dropout rate of 0.5 was used in the fully connected layer to prevent over-fitting." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-150", "text": "The embeddings then learned to have an understanding of how the Bangla language works and provide a meaningful initialization for any classification tasks." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-151", "text": "They were then extracted and used for the task of authorship attribution." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-152", "text": "2) Performing Classification: To perform the main task of author attribution and comparison, this training phase was performed twice with each type of embeddings mentioned above, i.e one-hot and pre-trained." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-153", "text": "The fully connected layer was given a dropout probability of 0.7 and trained with batch size 128 and the maximum length of each text was set to be 3000 characters." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-154", "text": "Everything else was kept similar." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-155", "text": "The classification was carried out with 2 author attribution datasets, one with 6 authors [1] and our dataset with maximum 14 authors." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-156", "text": "The larger dataset was trained with 6,8,10,12 and 14 authors to analyze the effects of increasing classes on the proposed model." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-157", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-158", "text": "**V. EXPERIMENTS**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-159", "text": "We evaluate the performance of the proposed architecture in terms of accuracy, with and without pre-training character level embedding and comparing them on the held-out dataset." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-160", "text": "We also try to infer how the character-level model compares with the word level models." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-161", "text": "All models are compared for the increasing number of authors(classes) on the corpus mentioned to assess the quality of the models." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-162", "text": "To keep the dataset balanced, the number of samples per class were truncated to the minimum among the classes." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-163", "text": "We propose a model for word-level classification mostly similar to our Char-CNN model." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-164", "text": "The model used for performance analysis is as follows:" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-165", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-166", "text": "**A. WORD EMBEDDING MODEL**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-167", "text": "This model has a close resemblance to the proposed Char-CNN model except for a few differences to tune with the word level version of the classification." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-168", "text": "The model has 2 convolutional layers with the kernel sizes 7,3 and number of filters are 128,256 respectively for each layer." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-169", "text": "Each layer followed by a maxpool layer." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-171", "text": "The convolutional layers are stacked with an LSTM layer of 100 neurons and a fully connected layer of 512 activation nodes both with dropout to prevent overfitting." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-172", "text": "Finally, a softmax layer is used to provide the classification probabilities." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-173", "text": "It is trained for 10 epochs with a learning rate of 0.001 with Adam optimizer, the batch size is 32 and 750 words per sample are used as input to models." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-174", "text": "All the word level models have a vocabulary size of 60000 and word embedding vector of size 300." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-175", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-176", "text": "**VI. RESULTS AND DISCUSSION**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-177", "text": "The accuracies achieved(in percents) on the test set of the datasets, with pre-trained embeddings for both word and character levels are summarized in Table II ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-178", "text": "Because the datasets were balanced, the comparison of accuracies is sufficient." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-179", "text": "Accuracy comparison(in percents) of the proposed model with and without pre-trained character embeddings are summarized in Table III." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-180", "text": "From the accuracy comparisons shown in Table II we see that Skip-gram implemented by fastText performs well in the given datasets." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-181", "text": "So we can infer that subword level classification tends to extract a good amount of meaning information and styles from the text." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-182", "text": "On the other hand, the word2vec models, which use entire words have worse performance." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-183", "text": "Character level model performs reasonably well in competition with subword level as long as the dataset is big enough." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-184", "text": "When the number of authors increased, the number of samples per author decreased making it difficult for the character-level model to collect enough information." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-185", "text": "With larger datasets, this model will be able to perform significantly better [2] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-186", "text": "This can be illustrated from Figure 1 that with a larger number of samples, the Char-CNN model raises steeply and performs competitively with the other models." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-187", "text": "In terms of the number of parameters, character level model is much superior to its word-level counterparts." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-188", "text": "The embedding vectors for the word level models is of size embedding vector * vocabulary size." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-189", "text": "i.e. 300 * 60000." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-190", "text": "On the other hand, the character embedding matrix is of size 253*253 given that we initially used one-hot vectors." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-191", "text": "This size can also be reduced to as low as 253*16 as were done in some research [4] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-192", "text": "Another thing to consider is the time it takes to train the models." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-193", "text": "For the word embedding models, a pure CNN does not work satisfactorily, so an LSTM layer had to be added to add sequential information in the model." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-194", "text": "This improves accuracy with the cost of taking more time to train, around 15-20 minutes." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-195", "text": "On the other hand, the character-level model works significantly well with only using convolutional layers taking less than 2 minutes to train." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-196", "text": "This effect of training time become largely magnified on largescale cases, making the word-level model unfit for light-weight devices." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-197", "text": "As stated in the paper [2] , ConvNets with character embedding can completely replace words and work even without any semantic meanings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-198", "text": "Which means that convolutional layers can extract whatever information necessary for author attribution, given enough data." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-199", "text": "To illustrate the need of pre-trained character embeddings, we see from III that using a pre-trained embedding increases the accuracy across datasets and the different number of authors, regardless of the amount of data for each author." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-200", "text": "Which shows that these naively learned embeddings contain valuable information that can be easily applied to various tasks of the language, including author attribution, and increase the performance a few degrees." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-201", "text": "These numerical representations of character contain information about morphology and the syntax of the language among other things." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-202", "text": "Therefore such embedding can be learned from any task and applied to other tasks as a form of transfer learning, given the alphabet remains the same." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-203", "text": "----------------------------------" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-204", "text": "**VII. CONCLUSION**" }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-205", "text": "So far no work has been done to evaluate the usefulness of character embeddings for classification task in Bangla language." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-206", "text": "We attempt to fill this gap and compare character embeddings with word embeddings showing that character embeddings perform almost as good as the best word embedding model." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-207", "text": "But besides accuracy, character level classification has a greater hand in terms of memory, time and number of parameters." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-208", "text": "Considering the small size of our datasets, we hope to have improved performance with larger datasets, as is the case for character level ConvNets [2] ." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-209", "text": "Besides such network also work better with non-curated texts, which are hard for wordlevel embeddings to capture, thus more applicable to real-life scenarios." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-210", "text": "Furthermore, we analyzed the importance of pretrained character embedding for author attribution and showed that pre-training can result in better performances." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-211", "text": "Since very large corpus is not available in Bangla language yet, we must come up with solutions that tackle attribution tasks sufficiently well even with little data." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-212", "text": "Therefore our future works include the combination of both character and word level embeddings to perform attribution task, in an attempt to combine the power of both types of embeddings." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-213", "text": "More advanced levels of transfer learning can also be performed by using language models in place of embeddings before classification." }, { "sent_id": "3e344c590b4d5270a29054ac15efa5-C001-214", "text": "Language models and embeddings can also be combined to give greater generalization for Bangla language." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3e344c590b4d5270a29054ac15efa5-C001-21" ], [ "3e344c590b4d5270a29054ac15efa5-C001-82" ], [ "3e344c590b4d5270a29054ac15efa5-C001-102" ], [ "3e344c590b4d5270a29054ac15efa5-C001-185" ], [ "3e344c590b4d5270a29054ac15efa5-C001-197" ] ], "cite_sentences": [ "3e344c590b4d5270a29054ac15efa5-C001-21", "3e344c590b4d5270a29054ac15efa5-C001-82", "3e344c590b4d5270a29054ac15efa5-C001-102", "3e344c590b4d5270a29054ac15efa5-C001-185", "3e344c590b4d5270a29054ac15efa5-C001-197" ] }, "@USE@": { "gold_contexts": [ [ "3e344c590b4d5270a29054ac15efa5-C001-102", "3e344c590b4d5270a29054ac15efa5-C001-103", "3e344c590b4d5270a29054ac15efa5-C001-104" ] ], "cite_sentences": [ "3e344c590b4d5270a29054ac15efa5-C001-102" ] }, "@SIM@": { "gold_contexts": [ [ "3e344c590b4d5270a29054ac15efa5-C001-185", "3e344c590b4d5270a29054ac15efa5-C001-186" ], [ "3e344c590b4d5270a29054ac15efa5-C001-208" ] ], "cite_sentences": [ "3e344c590b4d5270a29054ac15efa5-C001-185", "3e344c590b4d5270a29054ac15efa5-C001-208" ] } } }, "ABC_3ebfa05038431571701a7199163832_35": { "x": [ { "sent_id": "3ebfa05038431571701a7199163832-C001-43", "text": "Time-Domain filterbanks are neural network layers that take the raw waveform as input." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-2", "text": "Speech classifiers of paralinguistic traits traditionally learn from diverse hand-crafted low-level features, by selecting the relevant information for the task at hand." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-3", "text": "We explore an alternative to this selection, by learning jointly the classifier, and the feature extraction." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-4", "text": "Recent work on speech recognition has shown improved performance over speech features by learning from the waveform." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-5", "text": "We extend this approach to paralinguistic classification and propose a neural network that can learn a filterbank, a normalization factor and a compression power from the raw speech, jointly with the rest of the architecture." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-6", "text": "We apply this model to dysarthria detection from sentence-level audio recordings." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-7", "text": "Starting from a strong attention-based baseline on which mel-filterbanks outperform standard low-level descriptors, we show that learning the filters or the normalization and compression improves over fixed features by 10% absolute accuracy." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-8", "text": "We also observe a gain over OpenSmile features by learning jointly the feature extraction, the normalization, and the compression factor with the architecture." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-9", "text": "This constitutes a first attempt at learning jointly all these operations from raw audio for a speech classification task." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-10", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-12", "text": "Learning from speech still relies on handcrafted, fixed features on which a classifier can be trained." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-13", "text": "This differs from a field like computer vision which now widely uses end-toend models trained on raw pixels, that are typically processed by learnable convolutional operations [1, 2, 3] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-14", "text": "Speech features typically contain spectral representations, such as melfilterbanks or MFCCs, and/or low-level informations [4] , such as zero-crossing rate or harmonics-to-noise ratio." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-15", "text": "They are chosen to model a broad range of linguistic and paralinguistic information." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-16", "text": "Training a classifier from these fixed coefficients requires performing a feature selection step, which has the limitation that it cannot retrieve useful information that would have been lost in the feature computation." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-17", "text": "Recent research has shown improvement when replacing fixed speech features by Fig. 1 ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-18", "text": "Proposed pipeline that learns jointly the feature extraction, the compression, the normalization and the classifier." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-19", "text": "a learnable frontend, for tasks such as speech recognition [5] , speaker identification [6] or emotion recognition [7] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-20", "text": "In this work, we propose to apply such end-to-end systems to another paralinguistic task: the detection of dysarthria from speech recordings." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-21", "text": "There is a growing interest in automatically extracting information from speech for health care [8, 9, 10] , and unlike a feature-driven approach that would require testing various combinations of fixed features, we implement a system that can directly process raw speech and learn relevant features jointly with the dysarthria classifier, such that they will be optimal for the task." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-22", "text": "The TORGO database [11] is a collection of annotated speech recordings and articulatory measurements from speakers with cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), as well as control patients." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-23", "text": "[12, 13, 14] have used this database to provide speech recognition systems with robustness to dysarthria." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-24", "text": "[15] trains various linear classifiers on TORGO and the NKI CCRT corpus [16] to detect dysarthria." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-25", "text": "More recently, [17] has trained fully connected neural networks to classify the severity of the disease, using TORGO and the UASPEECH [18] database." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-26", "text": "All these models are trained on standard low-level features." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-27", "text": "In this work we show that dysarthria detection benefits significantly from learning directly from the raw waveform." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-28", "text": "Previous work has explored learnable alternatives to speech features that rely on a similar computation to spectral representations [19, 20, 21, 22, 5] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-29", "text": "These approaches learn convolutions that are then passed through a non-linearity, eventually a pooling operator and then a log compression to replicate the dynamic range compression typically performed on spectrograms or mel-filterbanks." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-30", "text": "This compression function remains fixed and is chosen beforehand, which could impact the final performance, as various compression functions including logarithm, cubic root, or 10th root have been previously showed to perform better depending on the task (see Table 2 of [23] )." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-31", "text": "A second fixed component is the meanvariance normalization of speech features." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-32", "text": "[5] integrates this normalization into the neural architecture, but keeps it fixed during training." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-33", "text": "[24] introduces a computational block, the Per Channel Energy Normalization (PCEN) that can learn a compression and a normalization factor per channel, and can be integrated into a neural network on top of speech features." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-34", "text": "It has since then been used in production speech recognition systems [25] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-35", "text": "In this work, we start from an attention-based model on mel-filterbanks, which already outperforms an equivalent model trained on low-level descriptors (LLDs)." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-36", "text": "Our experiments show that by training a PCEN block on top of mel-filterbanks or replacing them by learnable time-domain filterbanks from [22] , we get a gain in accuracy around 10% in absolute when training an identical neural network for dysarthria detection." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-37", "text": "Finally, by combining time-domain filterbanks and PCEN we propose the first audio frontend that can learn features, compression and normalization jointly with a neural network using backpropagation." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-38", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-39", "text": "**MODEL**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-40", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-41", "text": "**TIME-DOMAIN FILTERBANKS**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-42", "text": "As the first step of our computational pipeline, we use TimeDomain filterbanks from [22] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-44", "text": "They can be initialized to replicate mel-filterbanks, and then learnt for the task at hand." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-45", "text": "The standard computation of melfilterbanks relies on passing a spectrogram through a bank of frequency domain filters." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-46", "text": "More formally, the n th melfilterbank of a signal in t is: where" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-47", "text": "is the waveform windowed with an Hanning function \u03c6 centered in t, (\u03c8 n ) n=1...N the N melfilters andf denotes the Fourier transform of f ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-48", "text": "[26] shows that these coefficient can be approximated in the time domain by the following computation, referred as the first order scattering transform:" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-49", "text": "where (\u03d5 n ) n=1...N are Gabor wavelets defined in [22] such that |\u03c6 n | 2 \u2248 |\u03c8 n | 2 . [22] shows that this computation can be implemented as neural network layers, referred as TimeDomain filterbanks (TD-filterbanks)." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-50", "text": "The waveform goes through a complex-valued convolution, a modulus operator and the a convolution with a lowpass-filter (the squared hanning window) that performs the decimation." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-51", "text": "When not combined with PCEN, a log-compression is added on top of TD-filterbanks after adding 1 to their absolute value to avoid numerical issues." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-52", "text": "Table 1 shows the detailed layers." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-53", "text": "Following [22] , the first 1D convolution filters are initialized with Gabor wavelets, to replicate mel-filterbanks, and are then learnt at the same time as the rest of the model." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-54", "text": "The second convolution layer is kept fixed as a squared hanning window to perform lowpass filtering." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-55", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-56", "text": "**PER CHANNEL ENERGY NORMALIZATION**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-57", "text": "Per Channel Energy Normalization (PCEN) is a learnable component introduced in [24] which computes parametrized normalization and compression." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-58", "text": "It replaces the log-compression and the mean-variance normalization." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-59", "text": "With E(t, f ) the value of the feature f at time t, the computation of PCEN is:" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-60", "text": "M (t, f ) is a moving average of the feature f along the time axis, defined as:" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-61", "text": "\u03b1 controls the strength of the normalization, the exponent r (typically in [0, 1]) defines the slope of the compression curve, s sets the spread of the moving average, and is a small scalar used to avoid division by zero." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-62", "text": "By backpropagation, we learn \u03b1, r, and \u03b4 with the rest of the model to obtain a compression and a normalization that fit the task at hand." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-63", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-64", "text": "**LSTM AND ATTENTION MODEL**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-65", "text": "The output of the learnable frontend is fed to an attentionbased model [27] , that contains one LSTM layer of hidden size 60 followed by an attention mechanism, inspired by [28] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-66", "text": "The attention mechanism consists of two fully connected layers, of 50 and 1 unit respectively, and a softmax layer, that are applied to each output of the LSTM." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-67", "text": "The vector obtained is used to weight a linear combination of the LSTM outputs, that goes throught another fully connected layer of size the number of labels considered." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-68", "text": "The detailed architecture is shown in Figure 1 ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-69", "text": "In [28] , this model reaches state-of-the-art performance when trained for emotion recognition on melfilterbanks, which motivated using it for the paralinguistic task of dysarthria detection." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-70", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-71", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-72", "text": "We carry experiments on the TORGO database [29] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-73", "text": "It consists of sound recordings, sampled at 16kHz, from speakers with either cerebral palsy or amyotrophic lateral sclerosis, which are two of the prevalent causes of speech disability or dysarthria." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-74", "text": "Similar data for a control set of subjects is also available." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-75", "text": "Along with sound recordings, TORGO contains 3D articulatory features that we did not use." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-76", "text": "There are five groups of people: the control group not affected by the disease, and 4 other groups of affected people, classified by the severity of the disease." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-77", "text": "Each person recorded has a code name, F is for female, M is for male, while C is for control, followed by an identification number." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-78", "text": "A random split of the database would result in similar speakers in training, validation, and test sets, that could reduce the task to a speaker identification task." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-79", "text": "To avoid this confounding factor, we split the database to have a good repartition of the different severities among the training, validation and test set, while having no common speakers between the different sets (see Table 2 for the detailed split)." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-80", "text": "After studying the database we decided to pad the recordings so they all last 2.5s." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-81", "text": "We extracted some typical low level descriptors (LLDs) from it to have a first baseline." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-82", "text": "We use the OpenSmile toolkit [4] , with the configuration of the Interspeech 2009 Emotion Challenge [30] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-83", "text": "For each 25ms window of the recordings (strided by 10ms), 32 features are extracted (12 MFCCs, root mean square energy, zero-crossing rate, harmonics-to-noise ratio, F 0 and their \u2206)." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-84", "text": "Our second baseline takes as input mel-filterbanks." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-85", "text": "We pre-emphase the sound signals with a factor of 0.97." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-86", "text": "64 mel- Table 3 ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-87", "text": "UAR (%) of the attention-based model trained over different features or learnable frontends." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-88", "text": "The UAR is averaged over 3 runs and standard deviations are reported." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-89", "text": "filterbanks are computed every 25ms with a stride of 10ms and passed through a log-compression." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-90", "text": "To evaluate our learnable frontend in a comparable setting, we design them with the same number of filters, window size and stride (see Table 1 )." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-91", "text": "For the PCEN layer, we take = 10 \u22126 and s = 0.5, both fixed, and we only consider the absolute value of r. We initialize r, \u03b1 and \u03b4 at 0.5, 0.98 and 2.0 respectively." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-92", "text": "All models are trained with a stochastic gradient descent with momentum (0.98) and batch size 1, with a learning rate of 0.001." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-93", "text": "We use the Unweighted Average Recall (UAR) to evaluate our results." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-94", "text": "The UAR of a model is the mean of its accuracy for each label." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-95", "text": "It is a better metric when dealing with unbalanced datasets than the accuracy, since it is reweighting the results depending on the size of each class." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-96", "text": "It has been widely used in unbalanced settings such as the Emotion Recognition challenge [30] ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-97", "text": "We use the validation set for hyperparameter selection and early stopping." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-98", "text": "Table 3 shows the UAR on the validation and test sets." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-99", "text": "All the results are the mean UAR obtained over three runs with different random initialization." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-100", "text": "We do not compare them to previously published results [15, 17] as they use additional data and/or perform a different task." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-101", "text": "The attention based-model trained on LLDs features reaches an accuracy of 66% and is our baseline system." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-102", "text": "Replacing LLDs by mel-filterbanks improves the performance by 6% in absolute." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-103", "text": "Adding a fixed mean-variance normalization step (mvn) brings the models to Table 4 ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-104", "text": "UAR (%) of the attention-based model trained over different fully learnable frontends." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-105", "text": "The UAR is averaged over 3 runs and standard deviations are reported." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-106", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-107", "text": "**RESULTS**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-108", "text": "over-fitting, and thus the UAR decreases of 2%." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-109", "text": "However, we observe that replacing the fixed log-compression and meanvariance normalization step by a learnable PCEN layer improves the UAR of the models of 7% compared to the unnormalized mel-filterbanks." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-110", "text": "Moreover, an even bigger increase is noticed when replacing mel-filterbanks by equivalent TDfilterbanks (10% in absolute)." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-111", "text": "We can emphasize the fact that using the TD-filterbanks also leads to a more stable learning process, as the standard deviation along different runs is considerably lower." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-112", "text": "When studying the new scale learned by the TD-filterbanks (see Figure 2) we notice that the filters tend to focus around 2000Hz and 6500Hz, which suggests that either those frequencies are crucial to identify dysarthria, or the model might exploit a bias in the dataset." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-113", "text": "In Figure 3 we observe that the parameters learned by the PCEN layer reproduce similar schemes from one model to another, and that the learnt compression varies between filters, unlike a log-compression which is applied equivalently to all channels." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-114", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-115", "text": "**FULLY LEARNABLE FRONTEND**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-116", "text": "As we observe independent gains from either learning the features or learning the compression-normalization, we explore in our final experiments learning jointly all these operations." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-117", "text": "We remove the log-compression step of Time-Domain filterbanks and replace it by a PCEN layer." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-118", "text": "We use three settings: one for which r, \u03b1 and \u03b4 are learned, the second one with only r learned, and finally the last one for which only \u03b1 is learned." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-119", "text": "If a parameter is not learned, it is fixed to its initial value (specified in Section 3)." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-120", "text": "Table 4 shows that learning only the normalization exponent gives worse results than the models trained on LLDs." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-121", "text": "However, we notice that the model learning r, \u03b1 and \u03b4, and the one only learning r match the models using mel-filterbanks." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-122", "text": "----------------------------------" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-123", "text": "**CONCLUSION**" }, { "sent_id": "3ebfa05038431571701a7199163832-C001-124", "text": "This paper presents a fully learnable audio frontend, combining Time-Domain filterbanks and Per Channel Energy Normalization." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-125", "text": "It is the first time that a model is developed with the ability to learn the extraction, compression and normalization of the features from the raw waveform, jointly with a classifier." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-126", "text": "We apply it to dysarthria detection, and show Fig. 2 ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-127", "text": "New scales obtained by three independent models using TD-filterbanks, compared to mel scale." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-128", "text": "The center frequency is the frequency for which a filter is maximum." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-129", "text": "Fig. 3 ." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-130", "text": "Approximation of the compression exponent obtained for the PCEN layer learned on mel-filterbanks." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-131", "text": "that replacing fixed features by learnable frontends leads to an increase in performance of the models for this task, consistently with previous results on other linguistic and paralinguistic tasks." }, { "sent_id": "3ebfa05038431571701a7199163832-C001-132", "text": "Learning only the Time-Domain filterbanks or the PCEN parameters gives better results than learning them jointly, but learning both still gives similar to better performance than using fixed features, which constitutes a proof of concept for fully learnable audio frontends." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3ebfa05038431571701a7199163832-C001-28" ], [ "3ebfa05038431571701a7199163832-C001-49" ] ], "cite_sentences": [ "3ebfa05038431571701a7199163832-C001-28", "3ebfa05038431571701a7199163832-C001-49" ] }, "@USE@": { "gold_contexts": [ [ "3ebfa05038431571701a7199163832-C001-36" ], [ "3ebfa05038431571701a7199163832-C001-42" ], [ "3ebfa05038431571701a7199163832-C001-49" ], [ "3ebfa05038431571701a7199163832-C001-53" ] ], "cite_sentences": [ "3ebfa05038431571701a7199163832-C001-36", "3ebfa05038431571701a7199163832-C001-42", "3ebfa05038431571701a7199163832-C001-49", "3ebfa05038431571701a7199163832-C001-53" ] } } }, "ABC_fd9122d20c390ea115c27092170739_35": { "x": [ { "sent_id": "fd9122d20c390ea115c27092170739-C001-45", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-46", "text": "**PROPOSED APPROACH**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-2", "text": "This paper describes a hybrid model that combines a rule-based model with two statistical models for the task of POS guessing of Chinese unknown words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-3", "text": "The rule-based model is sensitive to the type, length, and internal structure of unknown words, and the two statistical models utilize contextual information and the likelihood for a character to appear in a particular position of words of a particular length and POS category." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-4", "text": "By combining models that use different sources of information, the hybrid model achieves a precision of 89%, a significant improvement over the best result reported in previous studies, which was 69%." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-5", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-7", "text": "Unknown words constitute a major source of difficulty for Chinese part-of-speech (POS) tagging, yet relatively little work has been done on POS guessing of Chinese unknown words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-8", "text": "The few existing studies all attempted to develop a unified statistical model to compute the probability of a word having a particular POS category for all Chinese unknown words (Chen et al., 1997; Wu and Jiang, 2000; Goh, 2003) ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-9", "text": "This approach tends to miss one or more pieces of information contributed by the type, length, internal structure, or context of individual unknown words, and fails to combine the strengths of different models." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-10", "text": "The rule-based approach was rejected with the claim that rules are bound to overgenerate (Wu and Jiang, 2000) ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-11", "text": "In this paper, we present a hybrid model that combines the strengths of a rule-based model with those of two statistical models for this task." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-12", "text": "The three models make use of different sources of information." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-13", "text": "The rule-based model is sensitive to the type, length, and internal structure of unknown words, with overgeneration controlled by additional constraints." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-14", "text": "The two statistical models make use of contextual information and the likelihood for a character to appear in a particular position of words of a particular length and POS category respectively." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-15", "text": "The hybrid model achieves a precision of 89%, a significant improvement over the best result reported in previous studies, which was 69%." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-16", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-17", "text": "**CHINESE UNKNOWN WORDS**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-18", "text": "The definition of what constitutes a word is problematic for Chinese, as Chinese does not have word delimiters and the boundary between compounds and phrases or collocations is fuzzy." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-19", "text": "Consequently, different NLP tasks adopt different segmentation schemes (Sproat, 2002) ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-20", "text": "With respect to any Chinese corpus or NLP system, therefore, unknown words can be defined as character strings that are not in the lexicon but should be identified as segmentation units based on the segmentation scheme." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-43", "text": "Second, Wu and Jiang (2000) argued that assigning POS to Chinese unknown words on the basis of the internal structure of those words will \"result in massive overgeneration\" (p. 48)." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-21", "text": "Chen and Bai (1998) categorized Chinese unknown words into the following five types: 1) acronyms, i.e., shortened forms of long names, e.g., b\u011bi-d\u00e0 for b\u011bij\u012bng-d\u00e0xu\u00e9 'Beijing University'; 2) proper names, including person, place, and organization names, e.g., M\u00e1o-Z\u00e9d\u014dng; 3) derived words, which are created through affixation, e.g., xi\u00e0nd\u00e0i-hu\u00e0 'modernize'; 4) compounds, which are created through compounding, e.g., zh\u01d0-l\u01ceoh\u01d4 'paper tiger'; and 5) nu-meric type compounds, including numbers, dates, time, etc., e.g., li\u01ceng-di\u01cen 'two o'clock'." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-22", "text": "Other types of unknown words exist, such as loan words and reduplicated words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-23", "text": "A monosyllabic or disyllabic Chinese word can reduplicate in various patterns, e.g., z\u01d2u-z\u01d2u 'take a walk' and pi\u00e0o-pi\u00e0o-li\u00e0ng-li\u00e0ng 'very pretty' are formed by reduplicating z\u01d2u 'walk' and pi\u00e0o-li\u00e0ng 'pretty' respectively." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-24", "text": "The identification of acronyms, proper names, and numeric type compounds is a separate task that has received substantial attention." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-25", "text": "Once a character string is identified as one of these, its POS category also becomes known." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-26", "text": "We will therefore focus on reduplicated and derived words and compounds only." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-27", "text": "We will consider unknown words of the categories of noun, verb, and adjective, as most unknown words fall under these categories (Chen and Bai, 1998) ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-28", "text": "Finally, monosyllabic words will not be considered as they are well covered by the lexicon." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-29", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-30", "text": "**PREVIOUS APPROACHES**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-31", "text": "Previous studies all attempted to develop a unified statistical model for this task." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-32", "text": "Chen et al. (1997) examined all unknown nouns 1 , verbs, and adjectives and reported a 69.13% precision using Dice metrics to measure the affix-category association strength and an affix-dependent entropy weighting scheme for determining the weightings between prefix-category and suffix-category associations." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-33", "text": "This approach is blind to the type, length, and context of unknown words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-34", "text": "Wu and Jiang (2000) calculated P (Cat,Pos,Len) for each character, where Cat is the POS of a word containing the character, Pos is the position of the character in that word, and Len is the length of that word." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-35", "text": "They then calculated the POS probabilities for each unknown word as the joint probabilities of the P(Cat,Pos,Len) of its component characters." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-36", "text": "This approach was applied to unknown nouns, verbs, and adjectives that are two to four characters long 2 ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-37", "text": "They did not report results on unknown word tagging, but reported that the new word identification and tagging mechanism increased parser coverage." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-38", "text": "We will show that this approach suffers reduced recall for multisyllabic words if the training corpus is small." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-39", "text": "Goh (2003) reported a precision of 59.58% on all unknown words using Support Vector Machines." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-40", "text": "Several reasons were suggested for rejecting the rule-based approach." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-41", "text": "First, Chen et al. (1997) claimed that it does not work because the syntactic and semantic information for each character or morpheme is unavailable." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-42", "text": "This claim does not fully hold, as the POS information about the component words or morphemes of many unknown words is available in the training lexicon." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-44", "text": "We will show that overgeneration can be controlled by additional constraints." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-47", "text": "We propose a hybrid model that combines the strengths of different models to arrive at better results for this task." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-48", "text": "The models we will consider are a rule-based model, the trigram model, and the statistical model developed by Wu and Jiang (2000) ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-49", "text": "Combination of the three models will be based on the evaluation of their individual performances on the training data." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-50", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-51", "text": "**THE RULE-BASED MODEL**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-52", "text": "The motivations for developing a set of rules for this task are twofold." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-53", "text": "First, the rule-based approach was dismissed without testing in previous studies." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-54", "text": "However, hybrid models that combine rule-based and statistical models outperform purely statistical models in many NLP tasks." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-55", "text": "Second, the rule-based model can incorporate information about the length, type, and internal structure of unknown words at the same time." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-56", "text": "Rule development involves knowledge of Chinese morphology and generalizations of the training data." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-57", "text": "Disyllabic words are harder to generalize than longer words, probably because their monosyllabic component morphemes are more fluid than the longer component morphemes of longer words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-58", "text": "It is interesting to see if reduction in the degree of fluidity of its components makes a word more predictable." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-59", "text": "We therefore develop a separate set of rules for words that are two, three, four, and five Chars T1 T2 T3 T4 Total 2 1 2 1 2 6 3 2 6 2 5 15 4 2 2 0 8 12 5+ 0 1 0 1 2 Total 5 11 3 16 35 Table 1 : Rule distribution or more characters long." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-60", "text": "The rules developed fall into the following four types: 1) reduplication rules (T1), which tag reduplicated unknown words based on knowledge about the reduplication process; 2) derivation rules (T2), which tag derived unknown words based on knowledge about the affixation process; 3) compounding rules (T3), which tag unknown compounds based on the POS information of their component words; and 4) rules based on generalizations about the training data (T4)." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-61", "text": "Rules may come with additional constraints to avoid overgeneration." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-62", "text": "The number of rules in each set is listed in Table 1 ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-63", "text": "The complete set of rules are developed over a period of two weeks." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-64", "text": "As will be shown below, the order in which the rules in each set are applied is crucial for dealing with ambiguous cases." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-65", "text": "To illustrate how rules work, we discuss the complete set of rules for disyllabic words here 3 ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-66", "text": "These are given in Figure 1 , where A and B refer to the component morpheme of an unknown AB." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-67", "text": "As rules for disyllabic words tend to overgenerate and as we prefer precision over recall for the rule-based model, most rules in this set are accompanied with additional constraints." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-68", "text": "In the first reduplication rule, the order of the three cases is crucial in that if A can be both a verb and a noun, AA is almost always a verb." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-69", "text": "The second rule tags a disyllabic unknown word formed by attaching the diminutive suffix er to a monosyllabic root as a noun." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-93", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-70", "text": "This may appear a hasty generalization, but examination of the data shows that er rarely attaches to monosyllabic verbs except for the few well-known cases." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-71", "text": "In the third rule, a categorizing suffix is one that attaches to other words to form a noun that refers to a category of people or objects, e.g., ji\u0101 '-ist'." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-72", "text": "The constraint \"A is not a verb morpheme\" excludes cases where B is polysemous and does not function as a categorizing suffix 3 Multisyllabic words can have various internal structures, e.g., a disyllabic noun can have a N-N, Adj-N, or V-N structure." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-73", "text": "Figure 1 : Rules for disyllabic words but a noun morpheme." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-74", "text": "Thus, this rule tags b\u00e8ng-y\u00e8 'water-pump industry' as a noun, but not l\u00ed-y\u00e8 leavejob 'resign'." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-75", "text": "The fourth rule tags words such as sh\u0101-xi\u0101ng 'sand-box' as nouns, but the constraints prevent verbs such as s\u014dng-k\u00f2u 'loosen-button' from being tagged as nouns." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-76", "text": "S\u014dng can be both a noun and a verb, but it is used as a verb in this word." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-77", "text": "The last two rules make use of two lists of characters extracted from the list of disyllabic words in the training data, i.e., those that have only appeared in the verb-initial and noun-final positions respectively." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-78", "text": "This is done because in Chinese, disyllabic compound verbs tend to be head-initial, whereas disyllabic compound nouns tend to be head-final." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-79", "text": "The fifth rule tags words such as d\u012bng-y\u01ceo 'sting-bite' as verbs, and the additional constraints prevent nouns such as f\u00fa-xi\u00e0ng 'lying-elephant' from being tagged as verbs." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-80", "text": "The last rule tags words such as xu\u011b-b\u00e8i 'snow-quilt' as nouns, but not zh\u0101i-sh\u0101o pick-tip 'pick the tips'." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-81", "text": "One derivation rule for trisyllabic words has a special status." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-82", "text": "Following the tagging guidelines of our training corpus, it tags a word ABC as verb/deverbal noun (v/vn) if C is the suffix hu\u00e0 '-ize'." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-83", "text": "Disambiguation is left to the statistical models." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-84", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-85", "text": "**THE TRIGRAM MODEL**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-86", "text": "The trigram model is used because it captures the information about the POS context of unknown words and returns a tag for each unknown word." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-87", "text": "We assume that the unknown POS depends on the previous two POS tags, and calculate the trigram probability P (t 3 |t 1 , t 2 ), where t 3 stands for the unknown POS, and t 1 and t 2 stand for the two previous POS tags." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-88", "text": "The POS tags for known words are taken from the tagged training corpus." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-89", "text": "Following Brants (2000), we first calculate the maximum likelihood probabilitiesP for unigrams, bigrams, and trigrams as in (1-3) ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-90", "text": "To handle the sparse-data problem, we use the smoothing paradigm that Brants reported as delivering the best result for the TnT tagger, i.e., the context-independent variant of linear interpolation of unigrams, bigrams, and trigrams." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-91", "text": "A trigram probability is then calculated as in (4)." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-92", "text": "As in Brants (2000) , \u03bb 1 + \u03bb 2 + \u03bb 3 = 1, and the values of \u03bb 1 , \u03bb 2 , and \u03bb 3 are estimated by deleted interpolation, following Brants' algorithm for calculating the weights for context-independent linear interpolation when the n-gram frequencies are known." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-94", "text": "**WU AND JIANG'S (2000) STATISTICAL MODEL**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-95", "text": "There are several reasons for integrating another statistical model in the model." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-96", "text": "The rule-based model is expected to yield high precision, as over-generation is minimized, but it is bound to suffer low recall for disyllabic words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-97", "text": "The trigram model covers all unknown words, but its precision needs to be boosted." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-98", "text": "Wu and Jiang's (2000) model provides a good complement for the two, because it achieves a higher recall than the rule-based model and a higher precision than the trigram model for disyllabic words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-99", "text": "As our training corpus is relatively small, this model will suffer a low recall for longer words, but those are handled effectively by the rule-based model." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-100", "text": "In principle, other statistical models can also be used, but Wu and Jiang's model appears more appealing because of its relative simplicity and higher or comparable precision." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-101", "text": "It is used to handle disyllabic and trisyllabic unknown words only, as recall drops significantly for longer words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-102", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-103", "text": "**COMBINING MODELS**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-104", "text": "To determine the best way to combine the three models, their individual performances are evaluated" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-105", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-106", "text": "**RESULTS**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-107", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-108", "text": "**EXPERIMENT SETUP**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-109", "text": "The different models are trained and tested on a portion of the Contemporary Chinese Corpus of Peking University (Yu et al., 2002) , which is segmented and POS tagged." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-110", "text": "This corpus uses a tagset consisting of 40 tags." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-111", "text": "We consider unknown words that are 1) two or more characters long, 2) formed through reduplication, derivation, or compounding, and 3) in one of the eight categories listed in Table 2 ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-112", "text": "The corpus consists of all the news articles from People's Daily in January, 1998." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-113", "text": "It has a total of 1,121,016 tokens, including 947,959 word tokens and 173,057 punctuation marks." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-114", "text": "90% of the data are used for training, and the other 10% are reserved for testing." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-115", "text": "We downloaded a reference lexicon 4 containing 119,791 entries." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-116", "text": "A word is considered unknown if it is in the wordlist extracted from the training or test data but is not in the reference lexicon." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-117", "text": "Given this definition, we first train and evaluate the individual models on the training data and then evaluate the final combined model on the test data." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-118", "text": "The distribution of unknown words is summarized in Table 3 ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-119", "text": "Types Tokens Types Tokens 2 2611 4789 387 464 3 3818 7378 520 764 4 490 1229 74 125 5+ 188 698 20 56 Total 7107 14094 1001 1509 Table 3 : Unknown word distribution in the data" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-120", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-121", "text": "**RESULTS FOR THE INDIVIDUAL MODELS**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-122", "text": "The results for the rule-based model are listed in Table 4 ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-123", "text": "Recall (R) is defined as the number of correctly tagged unknown words divided by the total number of unknown words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-124", "text": "Precision (P) is defined as the number of correctly tagged unknown words divided by the number of tagged unknown words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-125", "text": "The small number of words tagged \"v/vn\" are excluded in the count of tagged unknown words for calculating precision, as this tag is not a final guess but is returned to reduce the search space for the statistical models." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-126", "text": "F-measure (F) is computed as 2 * RP/(R + P )." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-127", "text": "The rule-based model achieves very high precision, but recall for disyllabic words is low." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-128", "text": "The results for the trigram model are listed in Table 5." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-129", "text": "Candidates are restricted to the eight POS categories listed in Table 2 for this model." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-130", "text": "Precision for the best guess in both datasets is about 62%." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-131", "text": "The results for Wu and Jiang's model are listed in Table 6 ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-132", "text": "Recall for disyllabic words is much higher than that of the rule-based model." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-133", "text": "Precision for disyllabic words reaches mid 70%, higher than that of the trigram model." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-134", "text": "Precision for trisyllabic words is very high, but recall is low." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-135", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-136", "text": "**RESULTS FOR THE COMBINED MODEL**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-137", "text": "To evaluate the combined model, we first define the upper bound of the precision for the model as the number of unknown words tagged correctly by at least one of the three models divided by the total number of unknown words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-138", "text": "The upper bound is 91.10% for the training data and 91.39% for the test data." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-139", "text": "Table 7 reports the results for the combined model." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-140", "text": "The overall precision of the model reaches 89.32% in the training data and 89.00% in the test data, close to the upper bounds." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-141", "text": "----------------------------------" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-142", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-143", "text": "The results indicate that the three models have different strengths and weaknesses." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-144", "text": "Using rules that do not overgenerate and that are sensitive to the type, length, and internal structure of unknown words, The results challenge the reasons given in previous studies for rejecting the rule-based model." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-145", "text": "Overgeneration is a problem only if one attempts to write rules to cover the complete set of unknown words." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-146", "text": "It can be controlled if one prefers precision over recall." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-147", "text": "To this end, the internal structure of the unknown words provides very useful information." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-148", "text": "Results for the rule-based model also suggest that as unknown words become longer and the fluidity of their component words/morphemes reduces, they become more predictable and generalizable by rules." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-149", "text": "The results achieved in this study prove a significant improvement over those reported in previous studies." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-150", "text": "To our knowledge, the best result on this task was reported by Chen et al. (1997) , which was 69.13%." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-151", "text": "However, they considered fourteen POS categories, whereas we examined only eight." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-152", "text": "This difference is brought about by the different tagsets used in the different corpora and the decision to include or exclude proper names and numeric type compounds." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-153", "text": "To make the results more comparable, we replicated their model, and the results we found were consistent with what they reported, i.e., 69.12% for our training data and 68.79% for our test data, as opposed to our 89.32% and 89% respectively." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-154", "text": "Several avenues can be taken for future research." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-155", "text": "First, it will be useful to identify a statistical model that achieves higher precision for disyllabic words, as this seems to be the bottleneck." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-156", "text": "It will also be relevant to apply advanced statistical models that can incorporate various useful information to this task, e.g., the maximum entropy model (Ratnaparkhi, 1996) ." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-157", "text": "Second, for better evaluation, it would be helpful to use a larger corpus and evaluate the individual models on a held-out dataset, to compare our model with other models on more comparable datasets, and to test the model on other logographic languages." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-158", "text": "Third, some grammatical constraints may be used for the detection and correction of tagging errors in a post-processing step." }, { "sent_id": "fd9122d20c390ea115c27092170739-C001-159", "text": "Finally, as part of a bigger project on Chinese unknown word resolution, we would like to see how well the general methodology used and the specifics acquired in this task can benefit the identification and sense-tagging of unknown words." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fd9122d20c390ea115c27092170739-C001-8" ], [ "fd9122d20c390ea115c27092170739-C001-10" ], [ "fd9122d20c390ea115c27092170739-C001-43" ] ], "cite_sentences": [ "fd9122d20c390ea115c27092170739-C001-8", "fd9122d20c390ea115c27092170739-C001-10", "fd9122d20c390ea115c27092170739-C001-43" ] }, "@MOT@": { "gold_contexts": [ [ "fd9122d20c390ea115c27092170739-C001-8", "fd9122d20c390ea115c27092170739-C001-9" ] ], "cite_sentences": [ "fd9122d20c390ea115c27092170739-C001-8" ] }, "@DIF@": { "gold_contexts": [ [ "fd9122d20c390ea115c27092170739-C001-43", "fd9122d20c390ea115c27092170739-C001-44" ] ], "cite_sentences": [ "fd9122d20c390ea115c27092170739-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "fd9122d20c390ea115c27092170739-C001-48" ] ], "cite_sentences": [ "fd9122d20c390ea115c27092170739-C001-48" ] } } }, "ABC_30718e751f18432c2478442530267e_35": { "x": [ { "sent_id": "30718e751f18432c2478442530267e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "30718e751f18432c2478442530267e-C001-2", "text": "Question answering systems deteriorate dramatically in the presence of adversarial sentences in articles." }, { "sent_id": "30718e751f18432c2478442530267e-C001-3", "text": "According to Jia and Liang (2017) , the single BiDAF system (Seo et al., 2016) only achieves an F1 score of 4.8 on the ADDANY adversarial dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-4", "text": "In this paper, we present a method to tackle this problem via answer sentence selection." }, { "sent_id": "30718e751f18432c2478442530267e-C001-5", "text": "Given a paragraph of an article and a corresponding query, instead of directly feeding the whole paragraph to the single BiDAF system, a sentence that most likely contains the answer to the query is first selected, which is done via a deep neural network based on Tree-LSTM (Tai et al., 2015) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-6", "text": "Experiments on ADDANY adversarial dataset validate the effectiveness of our method." }, { "sent_id": "30718e751f18432c2478442530267e-C001-7", "text": "The F1 score has been improved to 52.3." }, { "sent_id": "30718e751f18432c2478442530267e-C001-8", "text": "----------------------------------" }, { "sent_id": "30718e751f18432c2478442530267e-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "30718e751f18432c2478442530267e-C001-10", "text": "Question answering is an important task in evaluating the ability of language understanding of machines." }, { "sent_id": "30718e751f18432c2478442530267e-C001-11", "text": "Usually, given a paragraph and a corresponding question, a question answering system is supposed to generate the answer of this question from the paragraph." }, { "sent_id": "30718e751f18432c2478442530267e-C001-12", "text": "By comparing the predicted answer with human-approved answers, the performance of the system can be assessed." }, { "sent_id": "30718e751f18432c2478442530267e-C001-13", "text": "Recently, many systems have achieved great results on this task (Shen et al., 2017b; Wang and Jiang, 2016; Hu et al., 2017) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-14", "text": "However, Jia and Liang (2017) show that these systems are very vulnerable to paragraphs with adversarial sentences." }, { "sent_id": "30718e751f18432c2478442530267e-C001-15", "text": "For instance, the single BiDAF system (Seo et al., 2016) , which achieves an F1 of 75.5 on Standford Question Answering Dataset (SQuAD), deteriorates significantly to an F1 of 4.8 on the ADDANY adversarial dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-60", "text": "Finally, two different training sets are generated by pair-level sampling and paragraphlevel sampling." }, { "sent_id": "30718e751f18432c2478442530267e-C001-61", "text": "Each set has 90,000 instances." }, { "sent_id": "30718e751f18432c2478442530267e-C001-16", "text": "Besides the single BiDAF, the single Match LSTM, the ensemble Match LSTM, and the ensemble BiDAF achieve an F1 of 7.6, 11.7, and 2.7 respectively in question answering on ADDANY adversarial dataset (Jia and Liang, 2017) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-17", "text": "Therefore, question answering with adversarial sentences in paragraphs is a prominent issue and is the focus of this study." }, { "sent_id": "30718e751f18432c2478442530267e-C001-18", "text": "In this paper, we propose a method to improve the performance of the single BiDAF system 1 on ADDANY adversarial dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-19", "text": "Given a paragraph and a corresponding question, our method works in two steps to generate an answer." }, { "sent_id": "30718e751f18432c2478442530267e-C001-20", "text": "In the first step, a deep neural network named the QA Likelihood neural network is deployed to predict the likelihood of each sentence in the paragraph to be an answer sentence, i.e., the sentence that contains the answer." }, { "sent_id": "30718e751f18432c2478442530267e-C001-21", "text": "The architecture and the loss of the QA Likelihood neural network follow the neural network for semantic relatedness proposed by Tai et al. (2015) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-22", "text": "Its main ingredient is the Tree-LSTM model." }, { "sent_id": "30718e751f18432c2478442530267e-C001-23", "text": "While the neural network for semantic relatedness is used to predict the similarity between sentence A and B, the QA Likelihood neural network is used to predict if sentence A contains the answer to query B. In the second step, only the sentence with the highest likelihood is paired with the question and passed to the single BiDAF to further output an answer." }, { "sent_id": "30718e751f18432c2478442530267e-C001-24", "text": "In summary, compared to the original BiDAF that is an end-to-end question answering system, our method first selects a sentence that is most likely to be an answer sentence." }, { "sent_id": "30718e751f18432c2478442530267e-C001-25", "text": "Since adversarial sentences are not supposed to contain the answer, they can be screened out." }, { "sent_id": "30718e751f18432c2478442530267e-C001-26", "text": "Therefore, the distractions of adversarial sentences are reduced." }, { "sent_id": "30718e751f18432c2478442530267e-C001-27", "text": "Experiments on ADDANY adversarial dataset demonstrates the effectiveness The contributions of this study are in three folds." }, { "sent_id": "30718e751f18432c2478442530267e-C001-28", "text": "First, to the best of our knowledge, it's the first work that tries to address the problem of Question Answering with Adversarial Examples." }, { "sent_id": "30718e751f18432c2478442530267e-C001-29", "text": "Our results show the effectiveness of answer sentence selection to tackle adverserial sentences in ADDANY dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-30", "text": "Second, the power of sentence representation of Tree-LSTM has been demonstrated in different NLP tasks, such as semantic relatedness computation, machine translation evaluation and natural language inference (Tai et al., 2015; Gupta et al., 2015; Chen et al., 2017) ; meanwhile, multiple methods have been proposed for answer sentence selection (Wang and Nyberg, 2015; Rao et al., 2016; Wang et al., 2017; Shen et al., 2017a; Choi et al., 2017) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-31", "text": "We are the first to design a framework that illustrates the effectiveness of Tree-LSTM in answer sentence selection." }, { "sent_id": "30718e751f18432c2478442530267e-C001-32", "text": "Third, two sampling methods are implemented to build the training set for the QA Likelihood neural network." }, { "sent_id": "30718e751f18432c2478442530267e-C001-33", "text": "We show that different sampling methods do influence the performance of question answering in this scenario." }, { "sent_id": "30718e751f18432c2478442530267e-C001-34", "text": "----------------------------------" }, { "sent_id": "30718e751f18432c2478442530267e-C001-35", "text": "**METHODS**" }, { "sent_id": "30718e751f18432c2478442530267e-C001-36", "text": "Given a paragraph C and a corresponding query Q, the paragraph is split into a bunch of sentences C = {S i |i = 1, 2, . . . , |C|}. By combining each sentence S i with the query Q, a set of sentence pairs P C,Q = {(S i , Q)|i = 1, 2, . . . , |C|} is obtained." }, { "sent_id": "30718e751f18432c2478442530267e-C001-62", "text": "The validation set with 3,000 instances are sampled through these two methods as well." }, { "sent_id": "30718e751f18432c2478442530267e-C001-37", "text": "Then, the dependency parsing ) is used to get the tree representation T S i for S i and T Q for Q. Based on T S i and T Q , two Tree-LSTMs, Tree-LSTM S i and Tree-LSTM Q , are built respectively (Tai et al., 2015) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-38", "text": "The inputs to the leafs of both Tree-LSTMs are GloVe word vectors generated by Pennington et al. (2014) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-39", "text": "The output hidden vectors of the Tree-LSTM for S i and Q are h S i and h Q respectively." }, { "sent_id": "30718e751f18432c2478442530267e-C001-40", "text": "Then, h S i and h Q are concatenated and passed to a feed forward neural network to output the likelihood that S i contains the answer to Q. The architecture and the loss of the feed forward neural network follows the neural network for semantic relatedness (Tai et al., 2015) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-41", "text": "During training, the likelihood is supervised by 1 if S i contains the answer and 0 otherwise." }, { "sent_id": "30718e751f18432c2478442530267e-C001-42", "text": "The procedure above is summarized as the QA Likelihood neural network that is illustrated in Part 1 of Figure 1." }, { "sent_id": "30718e751f18432c2478442530267e-C001-43", "text": "Following that, the sentence that is most likely to be an answer sentence," }, { "sent_id": "30718e751f18432c2478442530267e-C001-44", "text": "is selected, where L stands for the likelihood predicted by the QA Likelihood neural network." }, { "sent_id": "30718e751f18432c2478442530267e-C001-45", "text": "After that, a pair of sentences S * and Q are passed to the pre-trained single BiDAF (Seo et al., 2016) to generate an answer\u00e2 to Q. This process is illustrated in Part 2 of Figure 1 ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-46", "text": "----------------------------------" }, { "sent_id": "30718e751f18432c2478442530267e-C001-47", "text": "**EXPERIMENTS**" }, { "sent_id": "30718e751f18432c2478442530267e-C001-48", "text": "Dataset for Training." }, { "sent_id": "30718e751f18432c2478442530267e-C001-49", "text": "As Figure 1 shows, the input of our system is a pair of sentences." }, { "sent_id": "30718e751f18432c2478442530267e-C001-50", "text": "Thus, the training instances for the QA Likelihood neural network are in the form of sentence pairs." }, { "sent_id": "30718e751f18432c2478442530267e-C001-51", "text": "They are sampled from the training set of SQuAD v1.1 (Rajpurkar et al., 2016) that contains no adversarial sentences." }, { "sent_id": "30718e751f18432c2478442530267e-C001-52", "text": "Specifically, there are 87,599 queries of 18,896 paragraphs in the training set of SQuAD v1.1." }, { "sent_id": "30718e751f18432c2478442530267e-C001-53", "text": "While each query refers to one paragraph, a paragraph may refer to multiple queries." }, { "sent_id": "30718e751f18432c2478442530267e-C001-54", "text": "For the k-th query Q k , by splitting its corresponding paragraph C k into separate sentences and combining them with the query, a set of sentence pairs is obtained," }, { "sent_id": "30718e751f18432c2478442530267e-C001-55", "text": "where D k represents the set of sentence pairs for the k-th query, m k is the number of sentences in the paragraph C k , S k i is the i-th sentence in In order to train our model properly and efficiently, both downsampling of D and undersampling of negative instances must be done." }, { "sent_id": "30718e751f18432c2478442530267e-C001-56", "text": "In this paper, we implement two different sampling methods: pair-level sampling and paragraph-level sampling." }, { "sent_id": "30718e751f18432c2478442530267e-C001-57", "text": "In pair-level sampling, 45,000 positive instances and 45,000 negative instances are randomly selected from D as the training set." }, { "sent_id": "30718e751f18432c2478442530267e-C001-58", "text": "By contrast, in paragraph-level sampling, we first randomly select a query Q k without replacement, then one positive instance and one negative instance are randomly sampled from the set of sentence pairs D k ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-59", "text": "This operation is repeated until we get 45,000 positive instances and 45,000 negative instances." }, { "sent_id": "30718e751f18432c2478442530267e-C001-63", "text": "Dataset for Testing." }, { "sent_id": "30718e751f18432c2478442530267e-C001-64", "text": "Our test set is Jia and Liang (2017)'s ADDANY adversarial dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-65", "text": "It includes 1,000 paragraphs and each paragraph refers to only one query, i.e., 1,000 (C, Q) pairs." }, { "sent_id": "30718e751f18432c2478442530267e-C001-66", "text": "By splitting and combining, 6,154 sentence pairs are obtained." }, { "sent_id": "30718e751f18432c2478442530267e-C001-67", "text": "Experimental Settings." }, { "sent_id": "30718e751f18432c2478442530267e-C001-68", "text": "The dimension of GloVe word vectors (Pennington et al., 2014 ) is set as 300." }, { "sent_id": "30718e751f18432c2478442530267e-C001-69", "text": "The sentence scoring neural network is trained by Adagrad (Duchi et al., 2011 ) with a learning rate of 0.01 and a batch size of 25." }, { "sent_id": "30718e751f18432c2478442530267e-C001-70", "text": "Model parameters are regularized by a 10 \u22124 strength of per-minibatch L 2 regularization." }, { "sent_id": "30718e751f18432c2478442530267e-C001-71", "text": "----------------------------------" }, { "sent_id": "30718e751f18432c2478442530267e-C001-72", "text": "**RESULTS**" }, { "sent_id": "30718e751f18432c2478442530267e-C001-73", "text": "The performance of question answering is evaluated by the Macro-averaged F1 score (Rajpurkar Jia and Liang, 2017) ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-74", "text": "It measures the average overlap between the predicted answer\u00e2 and real answers on token-level." }, { "sent_id": "30718e751f18432c2478442530267e-C001-75", "text": "We also compute the Macro-averaged Precision and Recall following the same procedure." }, { "sent_id": "30718e751f18432c2478442530267e-C001-76", "text": "The results are in Table 1 . As it shows, both the systems based on pair-level sampling and paragraph-level sampling significantly outperform the single BiDAF system 2 ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-77", "text": "The Macro-averaged F1 has been improved from 4.8 to 52.3." }, { "sent_id": "30718e751f18432c2478442530267e-C001-78", "text": "Besides, the paragraphlevel sampling achieves better results than the pairlevel sampling." }, { "sent_id": "30718e751f18432c2478442530267e-C001-79", "text": "In order to analyze the source of performance improvements, we further evaluate the performance of the QA Likelihood neural network and the single BiDAF system on answer sentence selection 3 ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-80", "text": "Here, we consider the problem as a binary classification problem." }, { "sent_id": "30718e751f18432c2478442530267e-C001-81", "text": "In the test set, positive instances are labeled with 1 and negative ones are labeled with 0." }, { "sent_id": "30718e751f18432c2478442530267e-C001-82", "text": "A sentence pair selected by a QA system (QA Likelihood neural network or the single BiDAF) has a predicted label 1, while the others have a predicted label 0." }, { "sent_id": "30718e751f18432c2478442530267e-C001-83", "text": "The results are shown in Table 2 ." }, { "sent_id": "30718e751f18432c2478442530267e-C001-84", "text": "It shows that both of our systems outperform the single BiDAF on all of the four metrics in the table." }, { "sent_id": "30718e751f18432c2478442530267e-C001-85", "text": "We further evaluate the performance of the QA Likelihood neural network and the single BiDAF system on answer sentence selection from another perspective." }, { "sent_id": "30718e751f18432c2478442530267e-C001-86", "text": "Here, we consider three types of sentences: adversarial sentences, answer sentences, and the sentences that include the answers returned by the single BiDAF system." }, { "sent_id": "30718e751f18432c2478442530267e-C001-87", "text": "Given a QA" }, { "sent_id": "30718e751f18432c2478442530267e-C001-88", "text": "----------------------------------" }, { "sent_id": "30718e751f18432c2478442530267e-C001-89", "text": "**RELATED WORKS**" }, { "sent_id": "30718e751f18432c2478442530267e-C001-90", "text": "With the help of deep learning, many techniques have been investigated to achieve exciting results on answer sentence selection and QA." }, { "sent_id": "30718e751f18432c2478442530267e-C001-91", "text": "Wang and Nyberg (2015) measure the relevance between sentences through a stacked bidirectional LSTM network." }, { "sent_id": "30718e751f18432c2478442530267e-C001-92", "text": "They show that these scores are effective in answer sentence selection." }, { "sent_id": "30718e751f18432c2478442530267e-C001-93", "text": "He et al. (2015) embed sentences with CNN at multiple levels of granularity to model the similarity between sentences." }, { "sent_id": "30718e751f18432c2478442530267e-C001-94", "text": "Rao et al. (2016) extend the method of Noise-Contrastive Estimation to questions paired with positive and negative sentences." }, { "sent_id": "30718e751f18432c2478442530267e-C001-95", "text": "Based on that, they present a pairwise ranking approach to select an answer from multiple candidate sentences." }, { "sent_id": "30718e751f18432c2478442530267e-C001-96", "text": "Wang et al. (2017) propose a bilateral multi-perspective matching model which achieves rivaling results in the task of answer sentence selection." }, { "sent_id": "30718e751f18432c2478442530267e-C001-97", "text": "Shen et al. (2017a) measure the similarity between sentences by utilizing the word level 4 The x-axis is truncated to save the space." }, { "sent_id": "30718e751f18432c2478442530267e-C001-98", "text": "similarity matrix." }, { "sent_id": "30718e751f18432c2478442530267e-C001-99", "text": "This approach is validated in answer selection." }, { "sent_id": "30718e751f18432c2478442530267e-C001-100", "text": "To efficiently tackle question answering for long documents, Choi et al. (2017) propose a method based on answer sentence selection to first narrow down a document and then use RNN to generate an answer." }, { "sent_id": "30718e751f18432c2478442530267e-C001-101", "text": "However, following the idea of adversarial examples in image recognition (Goodfellow et al., 2014; Kurakin et al., 2016; Papernot et al., 2016) , Jia and Liang (2017) point out the unreliability of existing question answering models in the presence of adversarial sentences." }, { "sent_id": "30718e751f18432c2478442530267e-C001-102", "text": "In this study, we propose a method to tackle this problem through answer sentence selection." }, { "sent_id": "30718e751f18432c2478442530267e-C001-103", "text": "The main component of our system is Tree-LSTM which is a powerful variant of Tree-RNN." }, { "sent_id": "30718e751f18432c2478442530267e-C001-104", "text": "Therefore, studies about Tree-RNN (Pollack, 1990; Goller and Kchler, 1996; Socher et al., 2011 Socher et al., , 2012 Socher et al., , 2013 are also related." }, { "sent_id": "30718e751f18432c2478442530267e-C001-105", "text": "----------------------------------" }, { "sent_id": "30718e751f18432c2478442530267e-C001-106", "text": "**CONCLUSIONS**" }, { "sent_id": "30718e751f18432c2478442530267e-C001-107", "text": "In this paper, we propose a method to address the problem of question answering with adversarial sentences in paragraphs." }, { "sent_id": "30718e751f18432c2478442530267e-C001-108", "text": "Specifically, our system via the QA Likelihood neural network based on Tree-LSTMs successfully boost the performance of the single BiDAF on ADDANY adversarial dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-109", "text": "Experiments show the F1 score has been largely improved from 4.8 to 52.3." }, { "sent_id": "30718e751f18432c2478442530267e-C001-110", "text": "To the best of our knowledge, we are the first to apply TreeLSTMs in answer sentence selection and the first to tackle question answering with adversarial examples on ADDANY adversarial dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-111", "text": "However, Jia and Liang (2017) also present the deterioration of QA systems on another dataset, ADDSENT adversarial dataset." }, { "sent_id": "30718e751f18432c2478442530267e-C001-112", "text": "Question answer-ing on this dataset remains unsolved." }, { "sent_id": "30718e751f18432c2478442530267e-C001-113", "text": "We leave it as a future work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "30718e751f18432c2478442530267e-C001-3" ], [ "30718e751f18432c2478442530267e-C001-14" ], [ "30718e751f18432c2478442530267e-C001-16" ], [ "30718e751f18432c2478442530267e-C001-101" ] ], "cite_sentences": [ "30718e751f18432c2478442530267e-C001-3", "30718e751f18432c2478442530267e-C001-14", "30718e751f18432c2478442530267e-C001-16", "30718e751f18432c2478442530267e-C001-101" ] }, "@MOT@": { "gold_contexts": [ [ "30718e751f18432c2478442530267e-C001-3", "30718e751f18432c2478442530267e-C001-4" ], [ "30718e751f18432c2478442530267e-C001-16", "30718e751f18432c2478442530267e-C001-17" ], [ "30718e751f18432c2478442530267e-C001-101", "30718e751f18432c2478442530267e-C001-102" ] ], "cite_sentences": [ "30718e751f18432c2478442530267e-C001-3", "30718e751f18432c2478442530267e-C001-16", "30718e751f18432c2478442530267e-C001-101" ] }, "@USE@": { "gold_contexts": [ [ "30718e751f18432c2478442530267e-C001-64" ], [ "30718e751f18432c2478442530267e-C001-73" ] ], "cite_sentences": [ "30718e751f18432c2478442530267e-C001-64", "30718e751f18432c2478442530267e-C001-73" ] }, "@SIM@": { "gold_contexts": [ [ "30718e751f18432c2478442530267e-C001-111" ] ], "cite_sentences": [ "30718e751f18432c2478442530267e-C001-111" ] } } }, "ABC_05eecafea7684dc8de13c29a76b767_35": { "x": [ { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-2", "text": "We propose a simple yet effective textbased user geolocation model based on a neural network with one hidden layer, which achieves state of the art performance over three Twitter benchmark geolocation datasets, in addition to producing word and phrase embeddings in the hidden layer that we show to be useful for detecting dialectal terms." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-3", "text": "As part of our analysis of dialectal terms, we release DAREDS, a dataset for evaluating dialect term detection methods." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-4", "text": "----------------------------------" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-6", "text": "Many services such as web search (Leung et al., 2010) , recommender systems (Ho et al., 2012) , targeted advertising (Lim and Datta, 2013) , and rapid disaster response (Ashktorab et al., 2014) rely on the location of users to personalise information and extract actionable knowledge." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-7", "text": "Explicit user geolocation metadata (e.g. GPS tags, WiFi footprint, IP address) is not usually available to third-party consumers, giving rise to the need for geolocation based on profile data, text content, friendship graphs (Jurgens et al., 2015) or some combination of these (Rahimi et al., 2015b,a) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-8", "text": "The strong geographical bias, most obviously at the language level (e.g. Finland vs. Japan), and more subtly at the dialect level (e.g. in English used in north-west England vs. north-east USA vs. Texas, USA), clearly reflected in language use in social media services such as Twitter, has been used extensively either for geolocation of users (Eisenstein et al., 2010; Roller et al., 2012; Rout et al., 2013; Wing and Baldridge, 2014) or dialectology Eisenstein, 2015) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-9", "text": "In these methods, a user is often represented by the concatenation of their tweets, and the geolocation model is trained on a very small percentage of explicitly geotagged tweets, noting the potential biases implicit in geotagged tweets (Pavalanathan and Eisenstein, 2015) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-10", "text": "Lexical dialectology is (in part) the converse of user geolocation (Eisenstein, 2015) : given text associated with a variety of regions, the task is to identify terms that are distinctive of particular regions." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-11", "text": "The complexity of the task is two-fold: (1) localised named entities (e.g. sporting team names) are not of interest; and (2) without semantic knowledge it is difficult to detect terms that are in general use but have a special meaning in a region." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-12", "text": "In this paper we propose a text-based geolocation method based on neural networks." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-13", "text": "Our contributions are as follows: (1) we achieve state-of-the-art results on benchmark Twitter geolocation datasets; (2) we show that the model is less sensitive to the specific location discretisation method; (3) we release the first broad-coverage dataset for evaluation of lexical dialectology models; (4) we incorporate our text-based model into a network-based model (Rahimi et al., 2015a) and improve the performance utilising both network and text; and (5) we use the model's embeddings for extraction of local terms and show that it outperforms two baselines." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-14", "text": "----------------------------------" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-15", "text": "**RELATED WORK**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-16", "text": "Related work on Twitter user geolocation falls into two categories: text-based and network-based methods." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-17", "text": "Text-based methods make use of the geographical biases of language use, and networkbased methods rely on the geospatial homophily of user-user interactions." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-18", "text": "In both cases, the assumption is that users who live in the same geographic area share similar features (linguistic or interactional)." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-19", "text": "Three main text-based approaches are: (1) the use of gazetteers Quercini et al., 2010) ; (2) unsupervised text clustering based on topic models or similar (Eisenstein et al., 2010; Hong et al., 2012; Ahmed et al., 2013) ; and (3) supervised classification (Ding et al., 2000; Backstrom et al., 2008; Cheng et al., 2010; Hecht et al., 2011; Kinsella et al., 2011; Wing and Baldridge, 2011; Han et al., 2012; Rout et al., 2013) , which unlike gazetteers can be applied to informal text and compared to topic models, scales better." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-20", "text": "The classification models often rely on less than 1% of geotagged tweets for supervision and discretise real-valued coordinates into equalsized grids (Serdyukov et al., 2009 ), administrative regions (Cheng et al., 2010; Hecht et al., 2011; Kinsella et al., 2011; Han et al., 2012 , or flat (Wing and Baldridge, 2011) or hierarchical k-d tree clusters (Wing and Baldridge, 2014) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-21", "text": "Network-based methods also use either real-valued coordinates (Jurgens et al., 2015) or discretised regions (Rahimi et al., 2015a) as labels, and use label propagation over the interaction graph (e.g. @-mentions)." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-22", "text": "More recent methods have focused on representation learning by using sparse coding (Cha et al., 2015) or neural networks (Liu and Inkpen, 2015) , utilising both text and network information (Rahimi et al., 2015a) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-23", "text": "Dialect is a variety of language shared by a group of speakers (Wolfram and Schilling, 2015) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-24", "text": "Our focus here is on geographical dialects which are spoken (and written in social media) by people from particular areas." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-25", "text": "The traditional approach to dialectology is to find the geographical distribution of known lexical alternatives (e.g. you, yall and yinz: (Labov et al., 2005; Nerbonne et al., 2008; Gon\u00e7alves and S\u00e1nchez, 2014; Doyle, 2014; Huang et al., 2015; Nguyen and Eisenstein, 2016) ), the shortcoming of which is that the alternative lexical variables must be known beforehand." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-26", "text": "There have also been attempts to automatically identify such words from geotagged documents (Eisenstein et al., 2010; Ahmed et al., 2013; Eisenstein, 2015) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-27", "text": "The main idea is to find lexical variables that are disproportionately distributed in different locations either via model-based or statistical methods (Monroe et al., 2008) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-28", "text": "There is a research gap in evaluating the geolocation models in terms of their usability in retrieving dialect terms given a geographic region." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-29", "text": "We use a text-based neural approach trained on geotagged Twitter messages that: (a) given a geographical region, identifies the associated lexical terms; and (b) given a text, predicts its location." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-30", "text": "----------------------------------" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-31", "text": "**DATA**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-32", "text": "We use three existing Twitter user geolocation datasets: (1) GEOTEXT (Eisenstein et al., 2010) , (2) TWITTER-US (Roller et al., 2012) , and (3) TWITTER-WORLD (Han et al., 2012) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-33", "text": "These datasets have been used widely for training and evaluation of geolocation models." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-34", "text": "They are all prepartitioned into training, development and test sets." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-35", "text": "Each user is represented by the concatenation of their tweets, and labeled with the latitude/longitude of the first collected geotagged tweet in the case of GEOTEXT and TWITTER-US, and the centre of the closest city in the case of TWITTER-WORLD." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-36", "text": "1 GEOTEXT and TWITTER-US cover the continental US, and TWITTER-WORLD covers the whole world, with 9k, 449k and 1.3m users, respectively as shown in Figure 1 ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-37", "text": "2 DAREDS is a dialect-term dataset novel to this research, created from the Dictionary of American Regional English (DARE) (Cassidy et al., 1985) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-38", "text": "DARE consists of dialect regions, their terms and meaning." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-39", "text": "3 It is based on dialectal surveys from different regions of the U.S., which are then postprocessed to identify dialect regions and terms." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-40", "text": "In order to construct a dataset based on DARE, we downloaded the web version of DARE, cleaned it, and removed multiword expressions and highly-frequent words (any word which occurred in the top 50k most frequent words, based on a word frequency list (Norvig, 2009) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-41", "text": "For dialect regions that don't correspond to a single state or set of cities (e.g. South), we mapped it to the most populous cities within each region." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-42", "text": "For example, within the Pacific Northwest dialect region, we manually extracted the most populous cities (Seattle, Tacoma, Portland, Salem, Eugene) and added those cities to DAREDS as subregions." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-43", "text": "The resulting dataset (DAREDS) consists of around 4.3k dialect terms from 99 U.S. dialect regions." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-44", "text": "DAREDS is the largest standardised dialectology dataset." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-45", "text": "----------------------------------" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-46", "text": "**METHODS**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-47", "text": "We use a multilayer perceptron (MLP) with one hidden layer as our location classifier, where the input is l 2 normalised bag-of-words features for a given user." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-48", "text": "We exclude @-mentions, words with document frequency less than 10, and stop words." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-49", "text": "The output is either a k-d tree leaf node or k-means discretisation of real-valued coordinates of training locations, the output of which is visualised for TWITTER-US in Figure 2 ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-50", "text": "The hidden layer output provides word (and phrase, as bags of words) embeddings for dialectal analysis." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-51", "text": "The number of regions, regularisation strength, hidden layer and mini-batch size are tuned over development data and set to (32, 10 \u22125 , 896, 100), (256, 10 \u22126 , 2048, 10000) and (930, 10 \u22126 , 3720, 10000) for GEOTEXT, TWITTER-US and TWITTER-WORLD, respectively." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-52", "text": "The parameters are optimised using Adamx (Kingma and Ba, 2014) using Lasagne/Theano (Theano Development Team, 2016) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-53", "text": "Following Cheng (2010) and Eisenstein (2010) , we evaluated the geolocation model using mean and median error in km (\"Mean\" and \"Median\" resp.) and accuracy within 161km of the actual location (\"Acc@161\")." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-54", "text": "Note that lower numbers are better for Mean and Median, and higher numbers better for Acc@161." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-55", "text": "4 The results reported in Rahimi et al. (2015b; 2015a) for TWITTER-WORLD were over a superset of the dataset; the results reported here are based on the actual dataset." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-56", "text": "While the focus of this paper is text-based user geolocation, state-of-the-art results for the three datasets have been achieved with hybrid text+network-based models, where the predictions of the text-based model are fed into a mention network as \"dongle\" nodes to each user node, providing a personalised geolocation prior for each user (Rahimi et al., 2015a) ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-57", "text": "Note that it would, of course, be possible to combine text and network information in a joint deep learning model (Yang et al., 2016; Kipf and Welling, 2016) , which we leave to future work (noting that scalability will potentially be a major issue for the larger datasets)." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-58", "text": "To test the applicability of the model's embeddings in dialectology, we created DAREDS." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-59", "text": "The output of the hidden layer of the model is used as embeddings for both location names and dialect terms." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-60", "text": "Given a dialect region name, we retrieve its nearest neighbours in the embedding space, and compare them to dialect terms associated with that location." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-61", "text": "We also compare the quality of the embeddings with pre-trained word2vec embeddings and the embeddings from the output layer of LR (logistic regression) (Rahimi et al., 2015b) as baselines." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-62", "text": "Regions in DAREDS can be very broad (e.g. SouthWest), meaning that words associated with those locations will be used across a large number Table 1 : Geolocation results over the three Twitter datasets, based on the text-based MLP with k-d tree or k-means discretisation and text+network model MADCEL-W-MLP using MLP with k-d tree for text-based predictions." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-63", "text": "We compare with state-of-the-art results for each dataset." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-64", "text": "4 \"-\" signifies that no results were reported for the given metric or dataset." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-65", "text": "of cities contained within that region." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-66", "text": "We generate a region-level embedding by simply taking the city names associated with the region, and feeding them as BoW input for LR and MLP and averaging their embeddings for word2vec." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-67", "text": "We evaluate the retrieved terms by computing recall of DAREDS terms existing in TWITTER-US (1071 terms) at k \u2208 {0.05%, 0.1%, 0.2%, 0.5%, 1%, 2%, 5%} of vocabulary size." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-68", "text": "The code and the DAREDS dataset are available at https://github." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-69", "text": "com/afshinrahimi/acl2017." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-70", "text": "----------------------------------" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-71", "text": "**RESULTS**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-72", "text": "----------------------------------" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-73", "text": "**GEOLOCATION**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-74", "text": "The performance of the text-based MLP model with k-d tree and k-means discretisation over the three datasets is shown in Table 1 ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-75", "text": "The results are also compared with state-of-the-art text-based methods based on a flat (Rahimi et al., 2015b; Cha et al., 2015) or hierarchical (Wing and Baldridge, 2014; Melo and Martins, 2015; Liu and Inkpen, 2015) geospatial representation." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-76", "text": "Our method outperforms both the flat and hierarchical text-based models by a large margin." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-77", "text": "Comparing the two discretisation strategies, k-means outperforms k-d tree by a reasonable margin." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-78", "text": "We also incorporated the MLP predictions into a network-based model based on the method of Rahimi et al. (2015a) , and improved upon their work." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-79", "text": "We analysed the Table 2 : Nearest neighbours of place names." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-80", "text": "in Figure 3 ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-81", "text": "The error is highest in states with lower training coverage (e.g. Maine, Montana, Wisconsin, Iowa and Kansas)." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-82", "text": "We also randomly sampled 50 development samples from the 1000 samples with highest prediction errors to check the biases of the model." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-83", "text": "Most of the errors are the result of geolocating users from Eastern U.S. in Western U.S. particularly in Los Angeles and San Francisco." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-84", "text": "----------------------------------" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-85", "text": "**DIALECTOLOGY**" }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-86", "text": "We quantitatively tested the quality of the geographical embeddings by calculating the micro-average recall of the k-nearest dialect terms (in terms of the proportion of retrieved dialect terms) given a dialect region, as shown in Figure 4 ." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-87", "text": "Recall at 0.5% is about 3.6%, meaning that we were able to retrieve 3.6% of the dialect terms given the dialect region name in the geographical embedding space." }, { "sent_id": "05eecafea7684dc8de13c29a76b767-C001-88", "text": "The embeddings slightly outperform the output layer of logistic regression (LR) (Rahimi et al., 2015b)" } ], "y": { "@BACK@": { "gold_contexts": [ [ "05eecafea7684dc8de13c29a76b767-C001-8" ], [ "05eecafea7684dc8de13c29a76b767-C001-19" ], [ "05eecafea7684dc8de13c29a76b767-C001-26" ] ], "cite_sentences": [ "05eecafea7684dc8de13c29a76b767-C001-8", "05eecafea7684dc8de13c29a76b767-C001-19", "05eecafea7684dc8de13c29a76b767-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "05eecafea7684dc8de13c29a76b767-C001-32" ], [ "05eecafea7684dc8de13c29a76b767-C001-53" ] ], "cite_sentences": [ "05eecafea7684dc8de13c29a76b767-C001-32", "05eecafea7684dc8de13c29a76b767-C001-53" ] } } }, "ABC_6781a5d131e68f7a7ac2fe239cc3d0_35": { "x": [ { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-2", "text": "We present a follow-up of our previous frequency-based greedy attribute selection strategy." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-4", "text": "----------------------------------" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-3", "text": "The current version takes into account also the instructions given to the participants of TUNA trials regarding the use of location information, showing an overall improvement on string-edit distance values driven by the results on the Furniture domain." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-6", "text": "In previous work (Lucena & Paraboni, 2008) we presented a frequency-based greedy attribute selection strategy submitted to the TUNA Challenge 2008." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-7", "text": "Presently we further the issue by taking additional information into accountnamely, the trial condition information available from the TUNA data -and report improved results for string-edit distance as required for the 2009 competition." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-8", "text": "----------------------------------" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-9", "text": "**BACKGROUND**" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-10", "text": "In Lucena & Paraboni (2008) we presented a combined strategy based on attribute frequency and certain aspects of a greedy attribute selection strategy for referring expressions generation." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-11", "text": "A list P of attributes sorted by frequency is the centre piece of the following selection strategy:" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-12", "text": "\u2022 select all attributes whose relative frequency falls above a threshold value t (t was estimated to be 0.8 for both Furniture and People domains.) \u2022 if the resulting description uniquely describes the target object, then finalizes." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-13", "text": "\u2022 if not, starting from the most frequent attribute in P, search exhaustively for an attribute g such that g, if selected, would rule out all remaining distractors in the context." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-14", "text": "The overall effect obtained is twofold: on the one hand, in a complex situation of reference (in which many attributes may rule out many distractors, but more than one will be required to achieve uniqueness) the algorithm simply selects frequent attributes." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-15", "text": "This may be comparable to a human speaker who has to single out the target object but who does not have the means to come up with the 'right' attribute straight away." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-16", "text": "On the other hand, as the number of distractors decreases, a single attribute capable of ruling out all distractors will eventually emerge, forcing the algorithm to switch to a greedy strategy and finalize." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-17", "text": "Once again, this may be comparable to what a human speaker may do when an appropriate attribute becomes sufficiently salient and all distractors in the context can be ruled out at once." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-18", "text": "The above approach performed fairly well (at least considering its simplicity) as reported in Lucena & Paraboni (2008) ." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-19", "text": "However, there is one major source of information available from the TUNA data that was not taken into account in the above strategy: the trial condition represented by the +/-LOC feature." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-20", "text": "Because this feature distinguishes the very kinds of instruction given to each participant to complete the TUNA task, the information provided by -/+ LOC is likely to have a significant impact on the overall results." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-21", "text": "This clear gap in our previous work represents an opportunity for improvement discussed in the next section." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-22", "text": "----------------------------------" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-23", "text": "**ALGORITHM**" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-24", "text": "The present work is a refined version of the original frequency-based greedy attribute selection strategy submitted to the TUNA Challenge 2008 (Lucena & Paraboni, 2008) , now taking also the trial condition (+/-LOC) into account." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-25", "text": "In the TUNA data, +LOC indicates the instances of the experiment in which participants were told that they were allowed to refer to the X,Y coordinates of the screen (i.e., selecting the X-and/or Y-DIMENSION attributes), whereas -LOC indicates the trials in which they were discouraged (but not prevented) to do so." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-26", "text": "In practice, references in +LOC trials are more likely to convey the X-and Y-DIMENSION attributes than those in which the -LOC condition was applied." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-27", "text": "Our modified algorithm simply consists of computing separated frequency lists for +LOC and -LOC trial conditions, and then using the original frequency-based greedy approach with each list accordingly." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-28", "text": "In practice, descriptions are now generated in two different ways, depending on the trial condition, which may promote the Xand Y-DIMENSION attributes to higher positions in the list P when +LOC applies." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-29", "text": "Using the TUNA Challenge 2009 development data set, the attribute selection task was performed as above." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-30", "text": "For the surface realisation task, we have reused the English language surface realisation module provided by Irene Langkilde-Geary for the TUNA Challenge 2008." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-31", "text": "----------------------------------" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-32", "text": "**RESULTS**" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-33", "text": "The following Figure 1 shows mean sting-edit distance and BLEU-3 scores computed using the evaluation tool provided by the TUNA Challenge team." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-34", "text": "For ease of comparison with our previous work, we also present Dice and MASI scores computed as in the previous TUNA Challenge, although these scores were not required for the current competition." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-35", "text": "The most relevant comparison with our previous work is observed in the overall string-edit distance values in Figure 1 : considering that in Lucena & Paraboni (2008) we reported 6.12 editdistance for Furniture and 7.38 for People, the overall improvement (driven by the descriptions in the Furniture domain) may be explained by the fact that the current version makes more accurate decisions as to when to use these attributes according to the instructions given to the participants of the TUNA trials (the trial condition +/-LOC. )" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-36", "text": "On the other hand, the divide between +LOC and -LOC strategies does not have a significant effect on the results based on the semantics of the description (i.e., Dice and MASI scores), which remain the same as those obtained previously." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-37", "text": "This may be explained by the fact that using location information inappropriately counts as one single error in Dice/MASI calculations, but it may have a much greater impact on the wording of the surface string (e.g., one single use of the X-DIMENSION attribute may be realized as \"on the far left\", adding four words to the descriptions.)" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-38", "text": "----------------------------------" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-39", "text": "**CONCLUSION**" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-40", "text": "We have presented a refined version of our previous frequency-based greedy attribute selection strategy." }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-41", "text": "The current version takes into account the instructions given to the participants of TUNA trials regarding the use of location information (the trial condition +/-LOC.)" }, { "sent_id": "6781a5d131e68f7a7ac2fe239cc3d0-C001-42", "text": "Results obtained using the TUNA Challenge 2009 development data set show improvements on string-edit distance, suggesting that the generated descriptions resemble more closely those seen in the TUNA corpus." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-6" ], [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-10" ], [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-18" ] ], "cite_sentences": [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-6", "6781a5d131e68f7a7ac2fe239cc3d0-C001-10", "6781a5d131e68f7a7ac2fe239cc3d0-C001-18" ] }, "@EXT@": { "gold_contexts": [ [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-6", "6781a5d131e68f7a7ac2fe239cc3d0-C001-7" ], [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-24" ] ], "cite_sentences": [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-6", "6781a5d131e68f7a7ac2fe239cc3d0-C001-24" ] }, "@MOT@": { "gold_contexts": [ [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-18", "6781a5d131e68f7a7ac2fe239cc3d0-C001-19", "6781a5d131e68f7a7ac2fe239cc3d0-C001-21" ] ], "cite_sentences": [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-18" ] }, "@DIF@": { "gold_contexts": [ [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-35" ] ], "cite_sentences": [ "6781a5d131e68f7a7ac2fe239cc3d0-C001-35" ] } } }, "ABC_b7e0879c4cac85054870146e61aa6f_35": { "x": [ { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-2", "text": "We explore blindfold (question-only) baselines for Embodied Question Answering." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-3", "text": "The EmbodiedQA task requires an agent to answer a question by intelligently navigating in a simulated environment, gathering necessary visual information only through first-person vision before finally answering." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-4", "text": "Consequently, a blindfold baseline which ignores the environment and visual information is a degenerate solution, yet we show through our experiments on the EQAv1 dataset that a simple question-only baseline achieves state-of-the-art results on the EmbodiedQA task in all cases except when the agent is spawned extremely close to the object." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-5", "text": "----------------------------------" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-7", "text": "Recent breakthroughs in static, unimodal tasks such as image classification [16] and language processing [18] has prompted research towards multimodal tasks [1, 8] and virtual environments [4, 15, 25] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-8", "text": "This is substantiated by embodiment theories in cognitive science that have argued for agent learning to be interactive and multimodal, mimicking key aspects of human learning [9, 17] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-9", "text": "To foster and measure progress in such virtual environments, new tasks have been introduced, one of them being Embodied Question Answering (EmbodiedQA) [5] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-10", "text": "The EmbodiedQA task requires an agent to intelligently navigate in a simulated household environment [25] and answer questions through egocentric vision." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-11", "text": "Concretely, an agent is spawned at a random location in an environment (a house or building) and asked a question (e.g. 'What color is the car?')." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-12", "text": "The agent perceives its environment through first-person egocentric vision and can perform a few atomic actions (move-forward, turn, strafe, etc.)." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-13", "text": "The goal of the agent is to intelligently navigate the environment and gather visual information necessary for answering the question." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-14", "text": "Subsequent to the introduction of the task, several methods have been introduced to solve the EmbodiedQA task [5, 6] , using some combination of reinforcement learning, behavior cloning and hierarchical control." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-15", "text": "Apart from using the question and images from the environment, these methods also rely on varying degrees of expert supervision such as shortest path demonstrations and subgoal policy sketches." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-16", "text": "In this work, we evaluate simple question-only baselines that never see the environment and receive no form of expert supervision." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-17", "text": "We examine whether existing methods outperform baselines designed to solely capture dataset bias, in order to better understand the performance of these existing methods." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-18", "text": "To our surprise, blindfold baselines achieve state-of-the-art performance on the EmbodiedQA task, except in the case when the agent is spawned extremely close to the object." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-19", "text": "Even in the latter case, blindfold baselines perform surprisingly close to existing state-of-the-art methods." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-20", "text": "We note that this finding is reminiscent of several recent works in both Computer Vision and Natural Language Processing, where researchers have found that statistical irregularities in the dataset can enable degenerate methods to perform surprisingly well [11, 12, 14, 21] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-21", "text": "Our findings suggest that current EmbodiedQA models are ineffective at leveraging the context from the environment, in fact this context or embodiment in the environment can negatively hamper them." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-22", "text": "We hope comparison with our baseline results can more effectively demonstrate how well a method is able to leverage embodiment in the environment." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-23", "text": "Upon further error analysis of our models and qualitative inspection of the dataset, we find that there exist biases in the EQAv1 dataset that allow blindfold models to perform so well." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-24", "text": "We acknowledge the active effort of Das et al. [5] in removing some biases via entropy-pruning but note that further efforts might be necessary to fully correct these biases." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-25", "text": "----------------------------------" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-26", "text": "**RELATED WORK**" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-27", "text": "EmbodiedQA Methods: Das et al. [5] introduced the PACMAN-RL+Q model which is bootstrapped with expert shortest-path demonstrations and later fine-tuned with REINFORCE [24] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-28", "text": "This model consists of a hierarchical navigation module: a planner and a controller, and a question answering module that acts when the navigation module has given up control." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-29", "text": "In a later work, Das et al. [6] introduce Neural Modular Control (NMC) which is a hierarchical policy network that operates over expert sub-policy sketches." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-30", "text": "The master and sub-policies are initialized with Behavior Cloning (BC), and later fine-tuned with Asynchronous Advantage Actor-Critic (A3C) [19] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-78", "text": "We reproduce the settings for training the VQA model 2 ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-31", "text": "Dataset Biases and Trivial Baselines: Many recent studies in language and vision show how biases in a dataset allow models to perform well on a task without leveraging the meaning of the text or image in the underlying dataset." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-32", "text": "A simple CNN-BoW model was shown to achieve state-of-the-art results [12] on the Visual7W [26] task while also performing surprisingly well compared to the most complex systems proposed for the VQA dataset [1] and other joint vision and language tasks [2, 10] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-33", "text": "Simple nearest neighbor approaches have been shown to perform well on image captioning datasets [7] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-34", "text": "This phenomenon has also been observed in language processing tasks." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-35", "text": "On the Story-cloze task which was presented to evaluate common-sense reasoning, Schwartz et al. [23] achieved state-ofthe-art performance by ignoring the narrative and training a linear classifier with features related to the writing style of the two potential endings, rather than their content." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-36", "text": "Similar observations were found on the Natural Language Inference (NLI) datasets, where methods ignoring the context and relying only on the hypothesis perform remarkably well [11, 21] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-37", "text": "Most recently, question-only and passage-only baselines on several QA datasets highlighted similar issues [14] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-38", "text": "3 Methods grey (1946) black (1386) bedroom (978) living room (871) bathroom (451) balcony (10) passenger elevator (1) Figure 2 : Frequency of each answer in the entire EQAv1 dataset." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-39", "text": "We observe that answers do not appear equally in the dataset, and are biased toward a select few." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-40", "text": "Average BOW Embedding We use a simple linear classifier as described in [3, 13, 22] , which takes word level embeddings and averages them to construct the question representation." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-41", "text": "We first perform a look-up over an embedding matrix for each word to get individual word representations." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-42", "text": "These word representations are then averaged into a text representation, which is in turn fed to a linear classifier." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-43", "text": "This architecture is similar to the fastText model of [13] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-44", "text": "It is also a common and strong baseline in language and vision and language tasks [13, 22] ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-45", "text": "We use the softmax function f to compute the probability distribution over the predefined classes." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-46", "text": "The training criterion minimizes the negative log-likelihood over the classes." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-47", "text": "Nearest Neighbor Answer Distribution (NN-AnswerDist) This method attempts to answer purely based on the per-question answer distribution of the training set." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-48", "text": "For an input question we find either the identical question in the training set or if one doesn't exist the nearest matching question (based on number of shared words)." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-49", "text": "We then select the most likely answer for the training set." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-50", "text": "Performance on this baseline is directly indicative of the bias in answer distributions in the dataset." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-51", "text": "We note that for EQAv1 almost all questions in the validation and test sets are present in the training set." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-52", "text": "The answers span across 72 different categories of color, location and objects." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-53", "text": "We note that there are only 2 questions in the validation set, and 6 questions in the test set that are not in the training set." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-54", "text": "This limits the ability to test how well an agent generalizes across unseen combinations of rooms/objects/colors." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-55", "text": "To get rid of peaky answers, an entropy pruning method was applied by [5] where questions with normalized entropy below 0.5 were excluded." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-56", "text": "However this still leaves an uneven answer distribution that can be exploited." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-57", "text": "Training Details 1 We evaluate the efficacy of our proposed baselines on the EQAv1 dataset." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-58", "text": "For the BoW model, we initialize the embeddings with Glove vectors [20] of size 100, which are allowed to be fine-tuned during the training procedure." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-59", "text": "We use the Adam optimizer (batch-size of 64) with a learning rate of 5e \u22123 which is annealed via a scheduling mechanism based on plateaus in the validation loss." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-60", "text": "The training procedure is run for 200 epochs and we use the checkpoint with minimum validation loss to compute accuracy on the test set." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-61", "text": "The NN-AnswerDist and the Majority baselines are self-descriptive and there are no specific training details that we apply." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-62", "text": "We also train the [5] text embedding model (an LSTM) with the optimization settings described in [5] for 200 epochs." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-63", "text": "Results Detailed results are reported in Table 4 ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-64", "text": "Following Das et al. [6] , we report the agent's top-1 accuracy on the test set when spawned 10, 20 and 50 steps away from the goal, denoted as T 10 , T 20 and T 50 respectively." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-65", "text": "Since the performance of blindfold baselines are not affected based on where the agent is spawned, their accuracy is same across T 10 , T 20 and T 50 ." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-66", "text": "We observe that the BoW model outperforms all existing methods except NMC(BC+A3C) in the case where agent is spawned very close to the target." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-67", "text": "The Nearest Neighbour method also does pretty well, and only falls behind to PACMAN (BC+REINFORCE) and NMC(BC+A3C) in the T 10 case." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-68", "text": "The difference in performance b/w the Nearest Neighbour method and BoW is primarily due to the fact that the BoW method leverages validation metrics more effectively, uses distributed word representations and differs in optimization." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-69", "text": "We also observe that the majority baseline achieves an accuracy of only 17.15%, suggesting that the other question-only baselines leverage dataset biases separate from class modes." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-70", "text": "For completeness, we also include a question only baseline derived directly from the EmbodiedQA codebase, which uses only the Question LSTM in the PACMAN model, termed as PACMAN Q-only (LSTM)." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-71", "text": "Note that we only compare the top-1 accuracy of different methods here, and not the navigation performance since it's not directly applicable to these blindfold baselines." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-72", "text": "To better understand the exact bias exploited by the text only models we observe that (a) The questions from training set are largely repeated in the validation and test set, with only 2 and 6 questions being unique to them respectively." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-73", "text": "As noted earlier, this means that models don't need to generalize across unseen combinations of rooms/objects/colors to perform well on this task (b) Despite entropy-pruning, there is a noticeable bias in the answer distribution of EQAv1 questions (see [5, Appendix A])." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-74", "text": "Our results on the Nearest Neighbour baseline confirm this source of bias and explain largely the text model performance." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-75", "text": "Viewing these results holistically, we conclude that current methods for the EmbodiedQA task are not effective at using context from the environment, and in fact this negatively hampers them." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-76", "text": "This shows that there is room for building new models that leverage the context and embodiment in the environment." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-77", "text": "Oracles: We now examine whether the EQAv1 dataset and the proposed oracle navigation can improve over pure text baselines, to leverage visual information in the most ideal case." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-79", "text": "Specifically we train the VQA model described in [6] on the last 5 frames of oracle navigation for 50 epochs with ADAM and a learning rate of 3e \u2212 4 using batch size 20." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-80", "text": "We observe the accuracy is improved over text baselines in this unrealistic setting, but the use of this model with navigation in PACMAN reduces performance to below the text baselines." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-81", "text": "For completeness we benchmark an oracle with our BoW embedding model in place of the LSTM with all other settings kept constant." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-82", "text": "As noted in [5] , we re-iterate that these oracles are far from perfect, as they may not contain the best vantage or context to answer the question." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-83", "text": "----------------------------------" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-84", "text": "**T 10**" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-85", "text": "T 20 T 50 T any Navigation + VQA PACMAN (BC) [5] 48 BOW-CNN VQA-Only 56.5 Table 1 : We compare to the published results from [6] for agent spawned at various steps away from the target: 10, 30, 50, and anywhere in the environment." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-86", "text": "Question-only baselines outperform Navigation+VQA methods except when spawned 10 steps from the target object." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-87", "text": "A VQA-only system with oracle navigation can improve on a pure text baseline but isn't effective when combined with navigation." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-88", "text": "(*) indicates our reproduction of the model described in [5] Error Analysis: To better understand the shortcomings and limitations, we perform an error analysis of the one of the runs of the BoW model on different question types: Here, the color category Preposition Location Color 9.09 51.72 53.31 Table 2 : Accuracy of the BoW model on different question types subsumes color and color_room both." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-89", "text": "The particularly low accuracy on preposition questions is due to the fact that there exist very few questions of this type in the training set (2.44%), and the entropy of answer distribution in this class is much higher compared to color and location question types." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-90", "text": "----------------------------------" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-92", "text": "We show that simple question only baselines largely outperform or closely compete with existing methods on the EmbodiedQA task." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-93", "text": "Our results indicate existing models are not able to convincingly use sensory inputs from the environment to perform question answering, although they have been demonstrated some ability navigate toward the object of interest." }, { "sent_id": "b7e0879c4cac85054870146e61aa6f-C001-94", "text": "Besides providing a benchmark score for future researchers working on this task, our results suggest considerations for future dataset and task construction in EQA and related tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b7e0879c4cac85054870146e61aa6f-C001-14" ], [ "b7e0879c4cac85054870146e61aa6f-C001-29" ] ], "cite_sentences": [ "b7e0879c4cac85054870146e61aa6f-C001-14", "b7e0879c4cac85054870146e61aa6f-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "b7e0879c4cac85054870146e61aa6f-C001-64" ], [ "b7e0879c4cac85054870146e61aa6f-C001-79" ], [ "b7e0879c4cac85054870146e61aa6f-C001-85" ] ], "cite_sentences": [ "b7e0879c4cac85054870146e61aa6f-C001-64", "b7e0879c4cac85054870146e61aa6f-C001-79", "b7e0879c4cac85054870146e61aa6f-C001-85" ] } } }, "ABC_0751f2ced4f7ced37cf206fea051fa_35": { "x": [ { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-2", "text": "An important problem when using Stochastic Inversion Transduction Grammars is their computational cost." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-3", "text": "More specifically, when dealing with corpora such as Europarl only one iteration of the estimation algorithm becomes prohibitive." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-4", "text": "In this work, we apply a reduction of the cost by taking profit of the bracketing information in parsed corpora and show machine translation results obtained with a bracketed Europarl corpus, yielding interresting improvements when increasing the number of non-terminal symbols." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-5", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-7", "text": "Statistical Machine Translation (SMT) systems have proved in the last years to be an important alternative to rule-based MT systems, being even able of outperforming commercial machine translation systems in the tasks they have been trained on." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-8", "text": "Phrase-based (PB) models (Tomas and Casacuberta, 2001; Zens et al., 2002) have proved to provide a very efficient framework for SMT." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-9", "text": "An important issue when training PB models is the algorithm by means of which the bilingual phrases are extracted." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-10", "text": "Hence, a wide variety of methods have been proposed for this purpose, spanning through statistically motivated procedures (Tomas and Casacuberta, 2001 ), heuristic algorithms (Zens et al., 2002) , and linguistically motivated methods (S\u00e1nchez and Bened\u00ed, 2006a) ." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-11", "text": "In this work, we will be following this last approach, which relies on Stochastic Inverse Transduction Grammars (SITGs) (Wu, 1997) for phrase extraction." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-12", "text": "SITGs constitute a restricted subset of syntax directed stochastic grammars for translation, and are very related to context-free grammars." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-13", "text": "These can be used to analyse two strings simultaneously, which makes them specially useful for extracting bilingual segments from a parallel corpus in a syntax-oriented manner." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-14", "text": "In (S\u00e1nchez and Bened\u00ed, 2006b ), SITGs were used for obtaining word phrases, reporting preliminary results on the EuroParl corpus." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-15", "text": "In this work, we extend that work by using bracketed corpora for estimating the STIGs." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-16", "text": "In section 2, we will briefly review the phrase-based SMT approach." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-17", "text": "Next, in section 3, we will sum up the grounds of SITGs and the modifications proposed in (S\u00e1nchez and Bened\u00ed, 2006a) ." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-18", "text": "In section 4, we present the translation results on the Europarl corpus, obtained when applying one learning iteration on SITGs with several number of nonterminals." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-19", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-20", "text": "**PHRASE-BASED MODELS**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-21", "text": "The derivation of the PB models stems from the concept of bilingual segmentation, i.e. sequences of source words and sequences of target words." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-22", "text": "It is assumed that only segments of contiguous words are considered, the number of source segments being equal to the number of target segments and each source segment being aligned with only one target segment and vice versa." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-23", "text": "Ultimately, when learning a PB model, the purpose is to compute a phrase translation table, where each input phrase is assigned to one or more output phrases with a given probability." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-24", "text": "In this work, we use SITGs to build this phrase translation table." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-25", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-26", "text": "**STOCHASTIC INVERSION TRANSDUCTION GRAMMARS**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-27", "text": "Being closely related to context-free grammars, Stochastic Inverse Transduction Grammars (Wu, 1997 ) specify a subset of syntax directed stochastic grammars for translation." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-28", "text": "Analysing two strings simultaneously, SITGs may be used to extract bilingual segments from a parallel corpus while taking into account syntax-motivated restrictions." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-29", "text": "The internal nodes of the parse tree define a span over each pair of strings." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-30", "text": "These spans can be considered as paired segments of words." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-31", "text": "In (Wu, 1997) , an algorithm similar to the CYK of context free grammars is proposed in order to parse a sentence pair with a SITG." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-32", "text": "This algorithm has a time complexity of O(|x| 3 |y| 3 |R|), being |x| the length of the source sentence, |y| the length of the target target sentence, and |R| the number of rules in the SITG." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-33", "text": "However, if the corpus has been previously parsed with a syntactical parser and is given in a bracketed form, (S\u00e1nchez and Bened\u00ed, 2006a) suggest the use of a version of the algorithm by (Wu, 1997) which is more efficient while performing the analysis, achieving a time complexity of O(|x||y||R|) when x and y are fully bracketed." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-34", "text": "In this work, we will be taking profit of bracketing information provided by freely available parsing toolkits in order to achieve an important increase of speed within the estimation algorithm." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-35", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-36", "text": "**EXPERIMENTS**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-37", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-38", "text": "**SITGS FOR PHRASE EXTRACTION**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-39", "text": "First, we built an initial SITG by following the method described in (S\u00e1nchez and Bened\u00ed, 2006b )." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-40", "text": "Then, both source and target languages in the training corpus were bracketed by using FreeLing (Asterias et al., 2006) , which is an opensource suite of language analysers." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-41", "text": "This being done, we then used the bracketed corpus to perform one estimation iteration on the initial SITG and obtain improved SITGs." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-42", "text": "Finally, the SITG obtained after the estimation iteration was used to parse the bracketed training corpus and extract segment pairs to setup a phrase-based translation model." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-43", "text": "It is important to stress the importance of the bracketing information, without which it would have been practically impossible to perform any learning iterations at all because of the severe temporal issues." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-44", "text": "Following common knowledge in SMT, we computed both the inverse and direct translation probabilities of each segment pair according to the formulae" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-45", "text": "where C(s, t) is the number of times segments s and t were extracted throughout the whole corpus." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-46", "text": "This phrase-table was fed to Moses (Philipp Koehn, 2007) for producing the final translation." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-47", "text": "Initial SITGs with increasing number of non-terminal symbols were built and then estimated." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-48", "text": "The purpose of building SITGs with several non-terminal symbols was to analyse whether augmenting the number of non-terminals would improve word reorderings between both input and output languages." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-49", "text": "Adding non-terminal symbols may provide more complexity to the grammar built, and hence increases its expressive power." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-50", "text": "(S\u00e1nchez and Bened\u00ed, 2006b ) Translation results of this setup can be seen in Table 1 ." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-51", "text": "Here, all the weights of the log-linear model were adjusted my MERT training, and the language model used was a 5-gram interpolated with Knesser-Ney discount." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-52", "text": "It is interresting to point out that one estimation iteration for any number of non-terminal symbols has deffinitely an improving effect." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-53", "text": "(S\u00e1nchez and Bened\u00ed, 2006b) shows an experiment in which segments were extracted from training corpora without any bracketing information." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-54", "text": "Since this was not computationally feasible with the training algorithm, we decided to use the SITG obtained after one estimation iteration (estimated using the bracketed corpus) to parse a non-bracketed 27.9/61.5 28.9/60.0 version of the corpus for the purpose of obtaining segments." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-55", "text": "Interestingly however, the BLEU score did not vary." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-56", "text": "The same conclusion was achieved when mixing the segments obtained when using the bracketed and the non-bracketed corpus: again, the BLEU score did not differ significantly." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-57", "text": "The fact that the translation quality is not lessened by introducing the bracketing, and hence constraining the SITG, has a lot of importance, since a bracketed corpus can be analysed by the SITG much faster." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-58", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-59", "text": "**ADDING SYNTACTIC TRANSLATION PROBABILITIES**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-60", "text": "In the process of obtaining the best parse tree t x,y for each pair of strings (x, y) (see Section 3), a joint probability p(s, t) (s and t are, respectively, word segments from x and y) for several overlapping spans is obtained." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-61", "text": "It is important to note that a given pair of word segments (s, t) can have different probabilities depending on the tree it comes from." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-62", "text": "We have defined a new translation model that is based in this information as follows." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-63", "text": "Let \u2126 the multiset of spans (word segments) obtained from the training sample, and \u2126 s,t \u2286 \u2126 the multiset of (s, t) spans." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-64", "text": "We define the expected value of p(s, t) according to the empirical distribution as:" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-65", "text": "If we marginalise for the input side of the word segments and for the output side of the segments, then we get:" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-66", "text": "and" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-67", "text": "In this way we obtain these two new syntax-based models:" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-68", "text": "As can be seen in Table 2 , adding these new syntax based models produces a consistent improvement of approximately one point of BLEU." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-69", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-70", "text": "**DISCUSSION**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-71", "text": "Comparatively, the best result that (S\u00e1nchez and Bened\u00ed, 2006b) reported in the Spanish-English task was a BLEU score of 23.0, which they obtained by combining segments extracted from both the bracketed and the non-bracketed corpus." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-72", "text": "We have widely exceeded this baseline." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-73", "text": "On the other hand, the Moses toolkit (Philipp Koehn, 2007) , which is a state of the art statistical machine translation system, obtains in this task a score of 31.0 BLEU." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-74", "text": "However, when constrained to use only the inverse and direct translation models as we did, the score drops to 29.6 BLEU, which is only 1.7 points away from our best score, with only the direct and the inverse translation models, and 0.7 points away from our overall best score." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-75", "text": "It can be argued that we should be comparing this last score with Moses with all four models, adding the lexical alignment models." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-76", "text": "However, these lexical alignment models can also be added to our system, and we actually plan to do so as future work, whereas the syntactic models we introduced cannot be added to the segments obtained in the Moses Toolkit." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-77", "text": "Although Moses obtains a slightly better score, it must be taken into consideration that this toolkit achieves this by using 19M different segment pairs, whereas our translation models only use half that amount." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-78", "text": "This fact has important implications: being our model smaller, less computational resources are used in decoding time, but also the final translation is produced faster." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-79", "text": "Moreover, adding non-terminal symbols seems to have beneficial effects on the final BLEU score." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-80", "text": "Hence, it seems there is still room for improvement, whereas regular phrasebased models (such as Moses) do not have this ability." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-81", "text": "----------------------------------" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-82", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-83", "text": "We have presented an alternative method for phrase extraction, which is competitive in terms of quality and produces smaller phrase-based models when compared to the traditional phrase-based extraction algorithms used." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-84", "text": "Moreover, we have shown that freely available natural language processing toolkits can be successfully used to obtain bracketed corpora and reduce time complexity in SITG estimation, without trading off translation quality." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-85", "text": "In the future, we plan to compute more complex SITGs and introduce further models to improve our translation table, such as the lexical alignment models or other models obtained by combining the various probabilities that SITG estimation entails." }, { "sent_id": "0751f2ced4f7ced37cf206fea051fa-C001-86", "text": "In this line, we also plan to investigate which effect has the combination of our phrase table with the phrase table produced by Moses." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0751f2ced4f7ced37cf206fea051fa-C001-14" ], [ "0751f2ced4f7ced37cf206fea051fa-C001-71" ] ], "cite_sentences": [ "0751f2ced4f7ced37cf206fea051fa-C001-14", "0751f2ced4f7ced37cf206fea051fa-C001-71" ] }, "@EXT@": { "gold_contexts": [ [ "0751f2ced4f7ced37cf206fea051fa-C001-14", "0751f2ced4f7ced37cf206fea051fa-C001-15" ] ], "cite_sentences": [ "0751f2ced4f7ced37cf206fea051fa-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "0751f2ced4f7ced37cf206fea051fa-C001-39" ], [ "0751f2ced4f7ced37cf206fea051fa-C001-50" ] ], "cite_sentences": [ "0751f2ced4f7ced37cf206fea051fa-C001-39", "0751f2ced4f7ced37cf206fea051fa-C001-50" ] }, "@DIF@": { "gold_contexts": [ [ "0751f2ced4f7ced37cf206fea051fa-C001-71", "0751f2ced4f7ced37cf206fea051fa-C001-72" ] ], "cite_sentences": [ "0751f2ced4f7ced37cf206fea051fa-C001-71" ] } } }, "ABC_2b7ba7f7aa2a03ad0de84e007c1f64_35": { "x": [ { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-44", "text": "**APPROACH**" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-2", "text": "The task of building automatic agents that can negotiate with humans in free-form natural language has gained recent interest in the literature." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-3", "text": "Although there have been initial attempts, combining linguistic understanding with strategy effectively still remains a challenge." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-4", "text": "Towards this end, we aim to understand the role of natural language in negotiations from a datadriven perspective by attempting to predict a negotiation's outcome, well before the negotiation is complete." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-5", "text": "Building on the recent advancements in pre-trained language encoders, our model is able to predict correctly within 10% for more than 70% of the cases, by looking at just 60% of the negotiation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-6", "text": "These results suggest that rather than just being a way to realize a negotiation, natural language should be incorporated in the negotiation planning as well." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-7", "text": "Such a framework can be directly used to get feedback for training an automatically negotiating agent." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-8", "text": "----------------------------------" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-10", "text": "Negotiations, either between individuals or entities, are ubiquitous in everyday human interactions ranging from sales to legal proceedings." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-11", "text": "Being a good negotiator is a complex skill, requiring the ability to understand the partner's motives, ability to reason and to communicate effectively, making it a challenging task for an automated system." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-12", "text": "While research in building automatically negotiating agents has primarily focused on agent-agent negotiations (Williams et al., 2012; Lin et al., 2014) , there is a recent interest in agent-human negotiations (Gratch et al., 2015) as well." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-13", "text": "Such agents may act as mediators or can be helpful for pedagogical purposes (Johnson et al., 2019) ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-14", "text": "Efforts in agent-human negotiations involving free-form natural language as a means of communication are rather sparse." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-15", "text": "Researchers (He et al., 2018) recently studied natural language negotiations in buyer-seller bargaining setup, which is comparatively less restricted than previously studied game environments (Asher et al., 2016; Lewis et al., 2017) ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-16", "text": "Lack of a well-defined structure in such negotiations allows humans or agents to express themselves more freely, which better emulates a realistic scenario." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-17", "text": "Interestingly, this also provides an exciting research opportunity: how can an agent leverage the behavioral cues in natural language to direct its negotiation strategies? Understanding the impact of natural language on negotiation outcomes through a data-driven neural framework is the primary objective of this work." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-18", "text": "We focus on buyer-seller negotiations (He et al., 2018) where two individuals negotiate the price of a given product." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-19", "text": "Leveraging the recent advancements (Vaswani et al., 2017; Devlin et al., 2019) in pre-trained language encoders, we attempt to predict negotiation outcomes early on in the conversation, in a completely data-driven manner ( Figure 1 )." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-20", "text": "Early prediction of outcomes is essential for effective planning of an automatically negotiating agent." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-21", "text": "Although there have been attempts to gain insights into negotiations (Adair et al., 2001; Koit, 2018) , to the best of our knowledge, we are the first to study early natural language cues through a datadriven neural system (Section 3)." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-22", "text": "Our evaluations show that natural language allows the models to make better predictions by looking at only a fraction of the negotiation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-23", "text": "Rather than just realizing the strategy in natural language, our empirical results suggest that language can be crucial in the planning as well." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-24", "text": "We provide a sample negotiation from the test set (He et al., 2018 ) along with our model predictions in Table 1 ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-25", "text": "----------------------------------" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-26", "text": "**PROBLEM SETUP**" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-27", "text": "We study human-human negotiations in the buyerseller bargaining scenario, which has been a key research area in the literature (Williams et al., 2012) ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-28", "text": "In this section, we first describe our problem setup and key terminologies by discussing the dataset used." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-29", "text": "Later, we formalize our problem definition." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-30", "text": "Dataset: For our explorations, we use the Craigslist Bargaining dataset (CB) introduced by He et al. (2018) ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-31", "text": "Instead of focusing on the previously studied game environments (Asher et al., 2016; Lewis et al., 2017) , the dataset considers a more realistic setup: negotiating the price of products listed on Craigslist 1 ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-32", "text": "The dataset consists of 6682 dialogues between a buyer and a seller who converse in natural language to negotiate the price of a given product (sample in Table 1 )." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-33", "text": "In total, 1402 product ad postings were scraped from Craigslist, belonging to six categories: phones, bikes, housing, furniture, car and electronics." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-34", "text": "Each ad posting contains details such as Product Title, Category Type and a Listing Price." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-35", "text": "Moreover, a secret target price is also pre-decided for the buyer." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-36", "text": "The final price after the agreement is called the Agreed Price, which we aim to predict." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-37", "text": "Defining the problem: Say we are provided with a product scenario S, a tuple: (Category, Title, Listing Price, Target Price) 2 . Define the interactions between a buyer and seller using a sequence of n events E n :< e 1 , e 2 , ..., e n >, where e i occurs before e j iff i < j. Event e i is also a tuple: (Initiator, Type, Data)." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-38", "text": "Initiator is either the Buyer or Seller, Type can be one of (message, offer, accept, reject or quit) and Data consists of either the corresponding natural language dialogue, offer price or can be empty." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-39", "text": "Nearly 80% of events in CB dataset are of type 'message', each consisting a textual message as Data." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-40", "text": "An offer is usually made and accepted at the end of each negotiation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-41", "text": "Since the offers directly contain the agreed price (which we want to predict), we only consider 'message' events in our models." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-42", "text": "Given the scenario S and first n events E n , our problem is then to learn the function f n : A = f n (S, E n ) where A refers to the final agreed price between the two negotiating parties." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-43", "text": "----------------------------------" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-45", "text": "Pre-trained language models, such as BERT (Vaswani et al., 2017; Devlin et al., 2019 ) have recently gained huge success on a wide range of NLP tasks." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-46", "text": "However, since our framework deals with various auxiliary pieces (category, price, etc.), we cannot directly leverage these language models, which have only been trained on natural language inputs." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-47", "text": "Instead of relying on additional representations along with BERT outputs, we propose a simple, yet effective way to incorporate the auxiliary information into the same embedding space." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-48", "text": "Our model hierarchically builds a representation for the given negotiation to finally predict the agreed price." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-49", "text": "We present our complete architecture in Figure 1 ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-50", "text": "Encoding the input: In order to effectively capture the natural language dialogue and the associated auxiliary information, we make use of predefined sentence templates." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-51", "text": "Table 2 shows how we represent the category, target price and the product title in natural language sentences." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-52", "text": "These sentences are concatenated to form our Scenario S. Moving ahead in a similar manner, we define templates to capture the negotiator identity (buyer/seller) and any message which is conveyed." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-53", "text": "As shown in Figure 1 , the scenario S and the events are separated with the usage of [SEP] tokens." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-54", "text": "Following Liu and Lapata (2019) , who use BERT for extractive text summarization, we add a [CLS] token at the beginning of each segment." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-55", "text": "We also alternate between a sequence of 0s and 1s for segment embeddings to differentiate between the scenario and the events." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-56", "text": "Architecture and Learning: BERT representation for each [CLS] token is a contextualized encoding for the associated word sequence after it." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-57", "text": "In order to further capture the sequential nature of negotiation events, we pass these [CLS] representations through Gated-Recurrent Units (GRU)." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-58", "text": "Recurrent Networks have been shown to be useful along with Transformer architectures ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-59", "text": "Finally, a feed-forward network is applied to predict the agreed price for the negotiation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-60", "text": "The model is end-to-end trained and fine-tuned using the Mean Squared Error (MSE) loss between the predicted price and the ground-truth." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-61", "text": "----------------------------------" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-62", "text": "**EXPERIMENTAL DETAILS**" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-63", "text": "We perform experiments on the CB dataset to primarily answer two questions: 1) Is it feasible to predict negotiation outcomes without observing the complete conversation between the buyer and seller? 2) To what extent does the natural language incorporation help in the prediction?" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-64", "text": "In order to answer these questions, we compare our model empirically with a number of baseline methods." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-65", "text": "This section presents the methods we compare to, the training setup and the evaluation metrics." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-66", "text": "Methods: The first baseline is the Listing Price (LP) where the model ignores the negotiation and returns the listing price of the product." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-67", "text": "Similarly, we use Target Price (TP), where the model just returns the target price for the buyer." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-68", "text": "We also consider the mean of Listing and Target price (TP+LP/2) as another baseline." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-69", "text": "Although trivial, these baselines help in benchmarking our results and also show good performance in some cases." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-70", "text": "Next, we build another baseline which completely ignores the natural language incorporation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-71", "text": "In this case, the model only sees a sequence of prices shared across the messages in the negotiation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-72", "text": "We keep the input format the same as our model and all the parameters are randomly initialized to remove learning from natural language." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-73", "text": "We refer to this model as Prices-only." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-74", "text": "We compare two variants for BERT-based models." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-75", "text": "First, for the BERT method, we keep only the first [CLS] token in the input and then train the model with fine-tuning using a single feedforward network on top of the [CLS] representation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-76", "text": "Secondly, we call our complete approach as BERT+GRU, where we use a recurrent network with BERT fine-tuning, as depicted in Figure 1 ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-77", "text": "Training Details: Given the multiple segments in our model input and small data size, we use BERTbase (Devlin et al., 2019) , having output dimension of 768." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-78", "text": "To tackle the variance in product prices across different categories, all prices in the inputs and outputs were normalized by the listing price." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-79", "text": "The predictions were unnormalized before final evaluations." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-80", "text": "Further, we only considered the negotiations where an agreement was reached." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-81", "text": "These were the instances for which ground truth was available (\u223c 75% of the data)." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-82", "text": "We use a two-layer GRU with a dropout of 0.1 and 50 hidden units." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-83", "text": "The models were trained for a maximum of 5000 iterations, with AdamW optimizer (Loshchilov and Hutter, 2018) , a learning rate of 2x10 \u2212 5 and a batch size of 4." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-84", "text": "We used a linear warmup schedule for the first 0.1 fraction of the steps." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-85", "text": "All the hyper-parameters were optimized on the provided development set." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-86", "text": "Evaluation Metrics: We study the variants of the same model by training with different proportions of the negotiation seen, namely, f \u2208 {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. We compare the models on two evaluation metrics: MAE: Mean Absolute Error between the predicted and ground-truth agreed prices along with Accuracy\u00b1k: the percentage of cases where the predicted price lies Target (TP) Listing (LP) (TP+LP)/2 Prices-o Figure 2 : Performance for various approaches by varying the fraction of events seen by the model." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-87", "text": "We present scores on MAE, Accuracy\u00b15 and Accuracy\u00b110 from left to right for the complete test set." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-88", "text": "Overall, BERT-based approaches incorporating natural language cues almost always beat the baselines not using these cues." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-89", "text": "By looking at 60% of the negotiation, the model can predict correctly within 10% for more than 70% of the cases." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-90", "text": "within k percent of the ground-truth." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-91", "text": "We use k = 5 and k = 10 in our experiments." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-92", "text": "----------------------------------" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-93", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-94", "text": "We present our results in Figure 2 ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-95", "text": "We also show Accuracy\u00b110 for different product categories in the Appendix." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-96", "text": "First, Target Price (TP) and (TP+LP)/2 prove to be strong baselines, with the latter achieving 61.07% Accuracy\u00b110." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-97", "text": "This performance is also attested by relatively strong numbers on the other metrics as well." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-98", "text": "Prices-only, which does not incorporate any knowledge from natural language, fails to beat the average baseline even with 60% of the negotiation history." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-99", "text": "This can be attributed to the observation that in many negotiations, before discussing the price, buyers tend to get more information about the product by exchanging messages: what is the condition of the product, how old it is, is there an urgency for any of the buyer/seller and so on." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-100", "text": "Incorporating natural language in both the scenario and event messages paves the way to leverage such cues and make better predictions early on in the conversation, as depicted in the plots." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-101", "text": "Both BERT and BERT-GRU consistently perform well on the complete test set." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-102", "text": "There is no clear winner, although using a recurrent network proves to be more helpful in the early stages of the negotiation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-103", "text": "Note that BERT method still employs multiple [SEP] tokens along with alternating segment embeddings (Section 3)." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-104", "text": "Without this usage, the fine-tuning pipeline proves to be inadequate." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-105", "text": "Overall, BERT-GRU achieves 67.08% Accuracy\u00b110 with just the product scenario, reaching to 71.16% with 60% of the messages and crosses 90% as more information about the final price is revealed." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-106", "text": "Paired Bootstrap Resampling (Koehn, 2004) with 10, 000 bootstraps shows that for a given f , BERT-GRU is better than its Prices-only counterpart with 95% statistical significance." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-107", "text": "The prices discussed during the negotiation still play a crucial role in making the predictions." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-108", "text": "In fact, in only 65% of the negotiations, the first price is quoted within the first 0.4 fraction of the events." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-109", "text": "This is visible in higher performance as more events are seen after this point." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-110", "text": "This number is lower than average for Housing, Bike and Car, resulting in relative better performance of Priceonly model for these categories over others." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-111", "text": "The models also show evidence of capturing buyer interest." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-112", "text": "By constructing artificial negotiations, we observe that the model predictions at f =0.2 increase when the buyer shows more interest in the product, indicating more willingness to pay." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-113", "text": "With the capability to incorporate cues from natural language, such a framework can be used in the future to get negotiation feedback, in order to guide the planning of a negotiating agent." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-114", "text": "This can be a viable middleground between following the average human behavior through supervised learning or exploring the wild by optimizing on rewards using reinforcement learning (Lewis et al., 2017; He et al., 2018) ." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-115", "text": "----------------------------------" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-117", "text": "We presented a framework to attempt early predictions of the agreed product prices in buyer-seller negotiations." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-118", "text": "We construct sentence templates to encode the product scenario, exchanged messages and associated auxiliary information into the same hidden space." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-119", "text": "By combining a recurrent network and the pre-trained BERT encoder, our model leverages natural language cues in the exchanged messages to predict the negotiation outcomes early on in the conversation." }, { "sent_id": "2b7ba7f7aa2a03ad0de84e007c1f64-C001-120", "text": "With this capability, such a framework can be used in a feedback mechanism to guide the planning of a negotiating agent." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-15" ], [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-114" ] ], "cite_sentences": [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-15", "2b7ba7f7aa2a03ad0de84e007c1f64-C001-114" ] }, "@MOT@": { "gold_contexts": [ [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-15", "2b7ba7f7aa2a03ad0de84e007c1f64-C001-16", "2b7ba7f7aa2a03ad0de84e007c1f64-C001-17" ] ], "cite_sentences": [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-18" ], [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-24" ], [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-30" ] ], "cite_sentences": [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-18", "2b7ba7f7aa2a03ad0de84e007c1f64-C001-24", "2b7ba7f7aa2a03ad0de84e007c1f64-C001-30" ] }, "@FUT@": { "gold_contexts": [ [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-113", "2b7ba7f7aa2a03ad0de84e007c1f64-C001-114" ] ], "cite_sentences": [ "2b7ba7f7aa2a03ad0de84e007c1f64-C001-114" ] } } }, "ABC_5eb321d3c63642a4b148e1276eab20_35": { "x": [ { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-34", "text": "Finally, we obtain predicted types through a classification layer." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-2", "text": "Fine-grained entity typing aims to assign entity mentions in the free text with types arranged in a hierarchical structure." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-3", "text": "Traditional distant supervision based methods employ a structured data source as a weak supervision and do not need hand-labeled data, but they neglect the label noise in the automatically labeled training corpus." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-4", "text": "Although recent studies use many features to prune wrong data ahead of training, they suffer from error propagation and bring much complexity." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-5", "text": "In this paper, we propose an end-to-end typing model, called the path-based attention neural model (PAN), to learn a noiserobust performance by leveraging the hierarchical structure of types." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-6", "text": "Experiments demonstrate its effectiveness." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-7", "text": "----------------------------------" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-9", "text": "Fine-grained entity typing aims to assign types (e.g., \"person\", \"politician\", etc.) to entity mentions in the local context (a single sentence), and the type set constitutes a treestructured hierarchy (i.e., type hierarchy)." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-10", "text": "Recent years witness the boost of neural models in this task, e.g., (Shimaoka et al. 2016) employs an attention based LSTM to attain sentence representations and achieves state-of-the-art performance." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-11", "text": "However, it still suffers from noise in training data, which is a main challenge in this task." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-12", "text": "The training data is generated by distant supervision, which assumes that if an entity has a type in knowledge bases (KBs), then all sentences containing this entity will express this type." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-13", "text": "This method inevitably introduces irrelevant types to the context." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-14", "text": "For example, the entity \"Donald Trump\" has types \"person\", \"businessman\" and \"politician\" in KBs, thus all three types are annotated for its mentions in the training corpora." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-15", "text": "But in sentence \"Donald Trump announced his candidacy for President of US.\", only \"person\" and \"politician\" are correct types, while \"businessman\" can not be deduced from the sentence, thus serves as noise." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-16", "text": "To alleviate this issue, a few systems try to denoise training data by filtering irrelevant types ahead of training." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-17", "text": "For instance, (Ren et al. 2016) proposes PLE to identify correct types by jointly embedding mentions, context and type hierarchy, and then use clean data to train classifiers." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-18", "text": "However, the denoising and training process are not unified, which may cause error propagation and bring much additional complexity." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-19", "text": "Motivated by this, we propose an end-to-end typing model, called the path-based attention neural model (PAN), to select relevant sentences to each type, which can dynamically reduce the weights of wrong labeled sentences for each type during training." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-20", "text": "This idea is inspired by some successful attempts to reduce noise in relation extraction, e.g., (Lin et al. 2016) ." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-21", "text": "However, these methods fail to formulate type hierarchy, which is distinct in fine-grained entity typing." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-22", "text": "Specifically, if a sentence indicates a type, its parent type can be also deduced from the sentence." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-23", "text": "Like the example above, \"politician\" is the subtype of \"person\"." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-24", "text": "Since the sentence indicates that \"Donald Trump\" is \"politician\", \"person\" should also be assigned." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-25", "text": "Thus, we build path-based attention for each type by utilizing its path to its coarsest parent type (e.g., person, businessman) in the type hierarchy." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-26", "text": "Compared to the simple attention in relation extraction, it enables parameter sharing for types in the same path." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-27", "text": "With the support of hierarchical information of types, it can reduce noise effectively and yields a better typing classifier." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-28", "text": "Experiments on two data sets validate effectiveness of PAN." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-29", "text": "----------------------------------" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-30", "text": "**PATH-BASED ATTENTION NEURAL MODEL**" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-31", "text": "The architecture of PAN is illustrated in Figure1." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-32", "text": "Supposing that there are n sentences containing entity e, i.e., S e = {s 1 , s 2 , ..., s n }, and T e is the automatically labeled types based on KBs." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-33", "text": "Firstly PAN employs LSTM to generate representations of sentences s i following (Shimaoka et al. 2016) , where s i \u2208 R d is the semantic representation of s i , i \u2208 {1, 2, ..., n}. Afterwards, we build path-based attention \u03b1 i,t over sentences s i for each type t \u2208 T e , which is expected to focus on relevant sentences to type t. Then, the representation of sentence set S e for type t, denoted by s e,t \u2208 R d , is calculated through weighted sum of vectors of sentences." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-35", "text": "The architecture of PAN for given entity e, type t More precisely, given e, an attention \u03b1 i,t is learned to score how well sentence s i matches type t, i.e.," }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-36", "text": ", where A \u2208 R d\u00d7d is a weighted diagonal matrix." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-37", "text": "p t \u2208 R d is the representation of path p t for type t. Specifically, for each type, we define one path as a sequence of types starting from its coarsest parent type, and ending with it." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-38", "text": "More formally, for type t l , p t l = t 1 \u2192t 2 \u2192...\u2192t l , where t 1 is its coarsest parent type, and t i+1 is the subtype of t i ." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-39", "text": "For example, for type t l = politician, its path is p t l = person\u2192politician." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-40", "text": "We represent the path p t l as a semantic composition of all the types on the path, i.e., p t l = t 1 \u2022 t 2 \u2022 ... \u2022 t l , where t i \u2208 R d is the representation of type t i , which is a parameter to learn." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-41", "text": "\u2022 is a composition operator." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-42", "text": "In this paper, we consider two types of operators: (1) Addition (PAN-A) , where p t l equals the sum of type vectors." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-43", "text": "(2) Multiplication (PAN-M), where p t l equals the cumulative product of type vectors." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-44", "text": "In this way, path-based attention enables the model to share parameters between types in the same path." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-45", "text": "For example, the attention learned for \"person\" could assist the learning of the attention for \"politician\"." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-46", "text": "It makes learning easier especially for infrequent subtypes, which suffer from dearth of training data, since the attentions for these subtypes can get support from the attention for parent type." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-47", "text": "Then, the representation of sentence set S e for type t, i.e., s e,t , is calculated through weighted sum of sentence vectors," }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-48", "text": "Since one mention can have multiple types, we employ a classification layer consisting of N logistic classifiers, where N is the total number of types." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-49", "text": "Each classifier outputs the probability of respective type, i.e., P (t|s e,t ) = exp(w T t s e,t + b t ) 1 + exp(w T t s e,t + b t )" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-50", "text": ", where w t , b t \u2208 R d are the logistic regression parameters." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-51", "text": "To optimize the model, a multi-type loss is defined according to the cross entropy as follows, J = \u2212 e t [I t ln P (t|s e,t ) + (1 \u2212 I t ) ln(1 \u2212 P (t|s e,t ))], where I t is indicator function to indicate whether t is the annotated type of entity e, i.e., t \u2208 T e ." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-52", "text": "----------------------------------" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-53", "text": "**EXPERIMENTS AND CONCLUSION**" }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-54", "text": "Experiments are carried on two widely used datasets OntoNotes and FIGER(GOLD), and the training dataset of OntoNotes is noisy compared to FIGER(GOLD) (Shimaoka et al. 2016) ." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-55", "text": "The statistics of the datasets are listed in Table1." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-56", "text": "We employ Strict Accuracy (Acc), Loose Macro F1 (Ma-F1), and Loose Micro F1 (Mi-F1) as evaluation measures following (Shimaoka et al. 2016 )." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-57", "text": "Specifically, \"Strict\" evaluates on the type set of each entity mention, while \"Loose\" on each type. \"Marco\" is the geometric average over all mentions, while \"Micro\" is the arithmetic average." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-58", "text": "The baselines are chosen from two aspects: (1) Predicting types in a unified process using raw noisy data, i.e., TLSTM (Shimaoka et al. 2016) , and other methods shown in Table2." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-59", "text": "(2) Predicting types using clean data by denoising ahead, i.e., H PLE and F PLE (Ren et al. 2016) ." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-60", "text": "To prove the superiority of path-based attention, we also directly apply the attention neural model in relation extraction (Lin et al. 2016) without using type hierarchy (AN)." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-61", "text": "The results of baselines are the best results reported in their papers." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-62", "text": "We can observed that: (1) When using the same raw noisy data, PAN outperforms all methods on both data sets, which proves the anti-noise ability of PAN." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-63", "text": "(2) PAN performs better than AN, since the attention learned in PAN utilizes the hierarchical structure to enable parameter sharing." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-64", "text": "(3) The improvements on OntoNotes are higher than FIGER(GOLD), because OntoNotes is more noisy, and the hierarchical structure in OntoNotes is more complex with more layers, which further demonstrates that path-based attention does well with type hierarchy, and proves the superiority of PAN in reducing noise." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-65", "text": "(4) PAN-A achieves better performance than PAN-M, which shows that addition operator can better capture type hierarchy." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-66", "text": "As shown in Table3, PAN using raw noisy data outperforms H PLE and F PLE using denoised data on Ma-F1 and Mi-F1." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-67", "text": "It makes sense that F PLE has higher Acc on OntoNotes since the noise is reduced before training, but it needs to learn additional parameters about mentions, context and types, while PAN only needs to learn parameters of attention." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-68", "text": "Thus, PAN is more efficient to reduce noise." }, { "sent_id": "5eb321d3c63642a4b148e1276eab20-C001-69", "text": "In conclusion, PAN can reduce noise effectively through an end-to-end process, and achieves better typing performance on datasets with more noise." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5eb321d3c63642a4b148e1276eab20-C001-10" ] ], "cite_sentences": [ "5eb321d3c63642a4b148e1276eab20-C001-10" ] }, "@MOT@": { "gold_contexts": [ [ "5eb321d3c63642a4b148e1276eab20-C001-10", "5eb321d3c63642a4b148e1276eab20-C001-11" ] ], "cite_sentences": [ "5eb321d3c63642a4b148e1276eab20-C001-10" ] }, "@USE@": { "gold_contexts": [ [ "5eb321d3c63642a4b148e1276eab20-C001-33" ], [ "5eb321d3c63642a4b148e1276eab20-C001-56" ], [ "5eb321d3c63642a4b148e1276eab20-C001-58" ] ], "cite_sentences": [ "5eb321d3c63642a4b148e1276eab20-C001-33", "5eb321d3c63642a4b148e1276eab20-C001-56", "5eb321d3c63642a4b148e1276eab20-C001-58" ] }, "@DIF@": { "gold_contexts": [ [ "5eb321d3c63642a4b148e1276eab20-C001-54" ] ], "cite_sentences": [ "5eb321d3c63642a4b148e1276eab20-C001-54" ] } } }, "ABC_6f3d6ad1f09c55a1a006abbeddb4ab_35": { "x": [ { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-2", "text": "We introduce a set of 1,000 gold standard parse trees for the British National Corpus (BNC) and perform a series of self-training experiments with Charniak and Johnson's reranking parser and BNC sentences." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-3", "text": "We show that retraining this parser with a combination of one million BNC parse trees (produced by the same parser) and the original WSJ training data yields improvements of 0.4% on WSJ Section 23 and 1.7% on the new BNC gold standard set." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-4", "text": "----------------------------------" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-6", "text": "Given the success of statistical parsing models on the Wall Street Journal (WSJ) section of the Penn Treebank (PTB) (Charniak, 2000; Collins, 2003, for example) , there has been a change in focus in recent years towards the problem of replicating this success on genres other than American financial news stories." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-7", "text": "The main challenge in solving the parser adaptation problem are the resources required to construct reliable annotated training examples." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-8", "text": "A breakthrough has come in the form of research by McClosky et al. (2006a; 2006b ) who show that self-training can be used to improve parser performance when combined with a two-stage reranking parser model (Charniak and Johnson, 2005) ." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-9", "text": "Selftraining is the process of training a parser on its own output, and earlier self-training experiments using generative statistical parsers did not yield encouraging results (Steedman et al., 2003) ." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-10", "text": "McClosky et al. (2006a; 2006b ) proceed as follows: sentences * Now affiliated to Lalic, Universit\u00e9 Paris 4 La Sorbonne." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-11", "text": "from the LA Times newspaper are parsed by a firststage generative statistical parser trained on some seed training data (WSJ Sections 2-21) and the nbest parse trees produced by this parser are reranked by a discriminative reranker." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-12", "text": "The highest ranked parse trees are added to the training set of the parser and the parser is retrained." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-13", "text": "This self-training method gives improved performance, not only on Section 23 of the WSJ (an absolute f-score improvement of 0.8%), but also on test sentences from the Brown corpus (Francis and Ku\u010dera, 1979 ) (an absolute fscore improvement of 2.6%)." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-14", "text": "In the experiments of McClosky et al. (2006a; 2006b) , the parse trees used for self-training come from the same domain (American newspaper text) as the parser's original seed training material." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-15", "text": "Bacchiani et al. (2006) find that self-training is effective when the parse trees used for self-training (WSJ parse trees) come from a different domain to the seed training data and from the same domain as the test data (WSJ sentences)." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-16", "text": "They report a performance boost of 4.2% on WSJ Section 23 for a generative statistical parser trained on Brown seed data when it is self-trained using 200,000 WSJ parse trees." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-17", "text": "However, McCloskey et al. (2006b) report a drop in performance for their reranking parser when the experiment is repeated in the opposite direction, i.e. with Brown data for self-training and testing, and WSJ data for seed training." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-18", "text": "In contrast, we report successful in-domain 1 self-training experiments with the BNC data as self-training and test material, and with the WSJ-trained reranking parser used by McCloskey et al. (2006a; 2006b) ." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-19", "text": "We parse the BNC (Burnard, 2000) in its entirety using the reranking parser of Charniak and Johnson (2005) ." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-20", "text": "1,000 BNC sentences are manually annotated for constituent structure, resulting in the first gold standard set for this corpus." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-21", "text": "The gold standard set is split into a development set of 500 parse trees and a test set of 500 parse trees and used in a series of self-training experiments: Charniak and Johnson's parser is retrained on combinations of WSJ treebank data and its own parses of BNC sentences." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-22", "text": "These combinations are tested on the BNC development set and Section 00 of the WSJ." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-23", "text": "An optimal combination is chosen which achieves a Parseval labelled bracketing f-score of 91.7% on Section 23 and 85.6% on the BNC gold standard test set." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-24", "text": "For Section 23 this is an absolute improvement of 0.4% on the baseline results of this parser, and for the BNC data this is a statistically significant improvement of 1.7%." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-25", "text": "----------------------------------" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-26", "text": "**THE BNC DATA**" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-27", "text": "The BNC is a 100-million-word balanced part-ofspeech-tagged corpus of written and transcribed spoken English." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-28", "text": "Written text comprises 90% of the BNC: 75% non-fictional and 25% fictional." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-29", "text": "To facilitate parsing with a WSJ-trained parser, some reversible transformations were applied to the BNC data, e.g. British English spellings were converted to American English and neutral quotes disambiguated." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-30", "text": "The reranking parser of Charniak and Johnson (2005) was used to parse the BNC." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-31", "text": "99.8% of the 6 million BNC sentences obtained a parse, with an average parsing speed of 1.4s per sentence." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-32", "text": "A gold standard set of 1,000 BNC sentences was constructed by one annotator by correcting the output of the first stage of Charniak and Johnson's reranking parser." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-33", "text": "The sentences included in the gold standard were chosen at random from the BNC, subject to the condition that they contain a verb which does not occur in the training sections of the WSJ section of the PTB (Marcus et al., 1993) ." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-34", "text": "A decision was made to select sentences for the gold standard set which differ from the sentences in the WSJ training sections, and one way of finding different sentences is to focus on verbs which are not attested in the WSJ Sections 2-21." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-35", "text": "It is expected that these gold standard parse trees can be used as training data although they are used only as test and development data in this work." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-36", "text": "Because they contain verbs which do not occur in the parser's training set, they are likely to represent a hard test for WSJ-trained parsers." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-37", "text": "The PTB bracketing guidelines (Bies et al., 1995) and the PTB itself were used as references by the BNC annotator." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-38", "text": "Functional tags and traces were not annotated." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-39", "text": "The annotator noticed that the PTB parse trees sometimes violate the PTB bracketing guidelines, and in these cases, the annotator chose the analysis set out in the guidelines." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-40", "text": "It took approximately 60 hours to build the gold standard set." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-41", "text": "----------------------------------" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-42", "text": "**SELF-TRAINING EXPERIMENTS**" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-43", "text": "Charniak and Johnson's reranking parser (June 2006 version) is evaluated against the BNC gold standard development set." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-44", "text": "Labelled precision (LP), recall (LR) and f-score measures 2 for this parser are shown in the first row of Table 1 ." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-45", "text": "The f-score of 83.7% is lower than the f-score of 85.2% reported by McClosky et al. (2006b) for the same parser on Brown corpus data." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-46", "text": "This difference is reasonable since there is greater domain variation between the WSJ and the BNC than between the WSJ and the Brown corpus, and all BNC gold standard sentences contain verbs not attested in WSJ Sections 2-21." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-47", "text": "We retrain the first-stage generative statistical parser of Charniak and Johnson using combinations of BNC trees (parsed using the reranking parser) and WSJ treebank trees." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-48", "text": "We test the combinations on the BNC gold standard development set and on WSJ Section 00." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-49", "text": "Table 1 shows that parser accuracy increases with the size of the in-domain selftraining material." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-50", "text": "3 The figures confirm the claim of McClosky et al. (2006a) that self-training with a reranking parsing model is effective for improving parser accuracy in general, and the claim of Gildea (2001) that training on in-domain data is effective for parser adaption." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-51", "text": "They confirm that self-training on in-domain data is effective for parser adaptation." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-52", "text": "The WSJ Section 00 results suggest that, in order to maintain performance on the seed training domain, it is necessary to combine BNC parse trees Of the self-training combinations with abovebaseline improvements for both development sets, the combination of 1,000K BNC parse trees and Section 2-21 of the WSJ (multiplied by ten) yields the highest improvement for the BNC data, and we present final results with this combination for the BNC gold standard test set and WSJ Section 23." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-53", "text": "There is an absolute improvement on the original reranking parser of 1.7% on the BNC gold standard test set and 0.4% on WSJ Section 23." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-54", "text": "The improvement on BNC data is statistically significant for both precision and recall (p < 0.0002, p < 0.0002)." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-55", "text": "The improvement on WSJ Section 23 is statistically significant for precision only (p < 0.003)." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-56", "text": "----------------------------------" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-57", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-58", "text": "We have introduced a set of 1,000 gold standard parse trees for the BNC." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-59", "text": "We have performed selftraining experiments with Charniak and Johnson's reranking parser and sentences from the BNC." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-60", "text": "We have shown that retraining this parser with a combination of one million BNC parse trees (produced by the same parser) and the original WSJ training data yields improvements of 0.4% on WSJ Section 23 and 1.7% on the BNC gold standard sentences." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-61", "text": "These results indicate that self-training on in-domain data can be used for parser adaptation." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-62", "text": "Our BNC gold standard set consists of sentences containing verbs which are not in the WSJ training sections." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-63", "text": "We suspect that this makes the gold standard set a hard test for WSJ-trained parsers, and our results are likely to represent a lower bound for WSJ-trained parsers on BNC data." }, { "sent_id": "6f3d6ad1f09c55a1a006abbeddb4ab-C001-64", "text": "When used as training data, we predict that the novel verbs in the BNC gold standard set add to the variety of training material, and will further help parser adaptation from the WSJ domain -a matter for further research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-8" ], [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-10", "6f3d6ad1f09c55a1a006abbeddb4ab-C001-11" ], [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-14" ], [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-17" ] ], "cite_sentences": [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-8", "6f3d6ad1f09c55a1a006abbeddb4ab-C001-10", "6f3d6ad1f09c55a1a006abbeddb4ab-C001-14", "6f3d6ad1f09c55a1a006abbeddb4ab-C001-17" ] }, "@USE@": { "gold_contexts": [ [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-18" ] ], "cite_sentences": [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-18" ] }, "@DIF@": { "gold_contexts": [ [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-18" ], [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-45" ] ], "cite_sentences": [ "6f3d6ad1f09c55a1a006abbeddb4ab-C001-18", "6f3d6ad1f09c55a1a006abbeddb4ab-C001-45" ] } } }, "ABC_0c2f7cea9f27b4799736fbcba48192_36": { "x": [ { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-2", "text": "Human language has evolved towards newer forms of communication such as social media, where emojis (i.e., ideograms bearing a visual meaning) play a key role." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-3", "text": "While there is an increasing body of work aimed at the computational modeling of emoji semantics, there is currently little understanding about what makes a computational model represent or predict a given emoji in a certain way." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-4", "text": "In this paper we propose a label-wise attention mechanism with which we attempt to better understand the nuances underlying emoji prediction." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-5", "text": "In addition to advantages in terms of interpretability, we show that our proposed architecture improves over standard baselines in emoji prediction, and does particularly well when predicting infrequent emojis." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-6", "text": "----------------------------------" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-8", "text": "Communication in social media differs from more standard linguistic interactions across a wide range of dimensions." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-9", "text": "Immediacy, short text length, the use of pseudowords like #hashtags or @mentions, and even metadata such as user information or geolocalization are essential components of social media messages." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-10", "text": "In addition, the use of emojis, small ideograms depicting objects, people and scenes (Cappallo et al., 2015) , are becoming increasingly important for fully modeling the underlying semantics of a social media message, be it a product review, a tweet or an Instagram post." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-11", "text": "Emojis are the evolution of characterbased emoticons (Pavalanathan and Eisenstein, 2015) , and are extensively used, not only as sentiment carriers or boosters, but more importantly, to express ideas about a myriad of topics, e.g., mood ( ), food ( ), sports ( ) or scenery ( )." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-12", "text": "Emoji modeling and prediction is, therefore, an important problem towards the end goal of properly capturing the intended meaning of a social media message." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-13", "text": "In fact, emoji prediction, i.e., given a (usually short) message, predict its most likely associated emoji(s), may help to improve different NLP tasks (Novak et al., 2015) , such as information retrieval, generation of emojienriched social media content or suggestion of emojis when writing text messages or sharing pictures online." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-14", "text": "It has furthermore proven to be useful for sentiment analysis, emotion recognition and irony detection (Felbo et al., 2017) ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-15", "text": "The problem of emoji prediction, albeit recent, has already seen important developments." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-16", "text": "For example, Barbieri et al. (2017) describe an LSTM model which outperforms a logistic regression baseline based on word vector averaging, and even human judgement in some scenarios." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-17", "text": "The above contributions, in addition to emoji similarity datasets (Barbieri et al., 2016; Wijeratne et al., 2017) or emoji sentiment lexicons (Novak et al., 2015; Wijeratne et al., 2016; Kimura and Katsurai, 2017; Rodrigues et al., 2018) , have paved the way for better understanding the semantics of emojis." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-18", "text": "However, our understanding of what exactly the neural models for emoji prediction are capturing is currently very limited." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-19", "text": "What is a model prioritizing when associating a message with, for example, positive ( ), negative ( ) or patriotic ( ) intents?" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-20", "text": "A natural way of assessing this would be to implement an attention mechanism over the hidden states of LSTM layers." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-21", "text": "Attentive architectures in NLP, in fact, have recently received substantial interest, mostly for sequenceto-sequence models (which are useful for machine translation, summarization or language modeling), and a myriad of modifications have been proposed, including additive (Bahdanau et al., 2015) , multiplicative (Luong et al., 2015) or self (Lin et al., 2017) Figure 1 : A classic attention network (top), and our attentive label-wise network (bottom), with a specific attention module for each label." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-22", "text": "tant for the overall prediction distribution." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-23", "text": "While emoji prediction has predominantly been treated as a multi-class classification problem in the literature, it would be more informative to analyze which text fragments are considered important for each individual emoji." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-24", "text": "With this motivation in mind, in this paper we put forward a label-wise mechanism that operates over each label during training." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-25", "text": "The resulting architecture intuitively behaves like a batch of binary mini-classifiers, which make decisions over one single emoji at a time, but without the computational burden and risk of overfitting associated with learning separate LSTMbased classifiers for each emoji." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-26", "text": "Our contribution in this paper is twofold." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-27", "text": "First, we use the proposed label-wise mechanism to analyze the behavior of neural emoji classifiers, exploiting the attention weights to uncover and interpret emoji usages." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-28", "text": "Second, we experimentally compare the effect of the label-wise mechanism on the performance of an emoji classifier." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-29", "text": "We observed a performance improvement over competitive baselines such as FastText (FT) (Joulin et al., 2017) and Deepmoji (Felbo et al., 2017) , which is most noticeable in the case of infrequent emojis." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-30", "text": "This suggests that an attentive mechanism can be leveraged to make neural architectures more sensitive to instances of underrepresented classes." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-31", "text": "----------------------------------" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-32", "text": "**METHODOLOGY**" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-33", "text": "Our base architecture is the Deepmoji model (Felbo et al., 2017) , which is based on two stacked word-based bi-directional LSTM recurrent neural networks with skip connections between the first and the second LSTM." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-34", "text": "The model also includes an attention module to increase its sensitivity to individual words during prediction." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-35", "text": "In general, attention mechanisms allow the model to focus on specific words of the input (Yang et al., 2016) , instead of having to memorize all the important features in a fixed-length vector." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-36", "text": "The main architectural difference with respect to the typical attention is illustrated in Figure 1 ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-37", "text": "In Felbo et al. (2017) , attention is computed as follows:" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-38", "text": "Here h i \u2208 R d is the hidden representation of the LSTM corresponding to the i th word, with N the total number of words in the sentence." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-39", "text": "The weight vector w a \u2208 R d and bias term b a \u2208 R map this hidden representation to a value that reflects the importance of this state for the considered classification problem." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-40", "text": "The values z 1 , ..., z n are then normalized using a softmax function, yielding the attention weights \u03b1 i ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-41", "text": "The sentence representation s is defined as a weighted average of the vectors h i ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-42", "text": "The final prediction distribution is then defined as follows:" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-43", "text": "where w f,l \u2208 R d and b f,l define a label-specific linear transformation, with \u03b2 l reflecting our confidence in the l th label and L is the total number of labels." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-44", "text": "The confidence scores \u03b2 l are then normalized to probabilities using another softmax operation." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-45", "text": "However, while the above design has contributed to better emoji prediction, in our case we are interested in understanding the contribution of the words of a sentence for each label (i.e., emoji), and not in the whole distribution of the target labels." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-46", "text": "To this end, we propose a label-wise attention mechanism." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-47", "text": "Specifically, we apply the same type of attention, but repeating it |L| (number of labels) times, where each attention module is reserved for a specific label l:" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-48", "text": "----------------------------------" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-49", "text": "**EVALUATION**" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-50", "text": "This section describes the main experiment w.r.t the performance of our proposed attention mechanism, in comparison with existing emoji prediction systems." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-51", "text": "We use the data made available in the context of the SemEval 2018 Shared Task on Emoji Prediction (Barbieri et al., 2018) ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-52", "text": "Given a tweet, the task consists of predicting an associated emoji from a predefined set of 20 emoji labels." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-53", "text": "We evaluate our model on the English split of the official task dataset." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-54", "text": "We also show results from additional experiments in which the label space ranged from 20 to 200 emojis." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-55", "text": "These extended experiments are performed on a corpus of around 100M tweets geolocalized in the United States and posted between October 2015 and May 2018." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-56", "text": "Models." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-57", "text": "In order to put our proposed labelwise attention mechanism in context, we compare its performance with a set of baselines: (1 (Felbo et al., 2017) ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-58", "text": "Finally, we denote as 2-BiLSTMs l our proposed label-wise attentive Bi-LSTM architecture." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-59", "text": "Results." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-60", "text": "Table 1 shows the results of our model and the baselines in the emoji prediction task for the different evaluation splits." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-61", "text": "The evaluation metrics used are: F1, Accuracy@k (A@k, where k \u2208 {1, 5}), and Coverage Error (CE 1 ) (Tsoumakas et al., 2009) ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-62", "text": "We note that the latter metric is not normally used in emoji prediction settings." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-63", "text": "However, with many emojis being \"near synonyms\" (in the sense of being often used almost interchangeably), it seems natural to evaluate the performance of an emoji prediction system in terms of how far we would need to go through the predicted emojis to recover the true label." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-64", "text": "The results show that our proposed 2-BiLSTMs l method outperforms all baselines for F1 in three out of four settings, and for CE in all of them." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-65", "text": "In the following section we shed light on the reasons behind this performance, and we try to understand how these predictions were made." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-66", "text": "----------------------------------" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-67", "text": "**ANALYSIS**" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-68", "text": "By inspecting the predictions of our model, we found that the label-wise attention mechanism tends to be less heavily biased towards the most frequent emojis." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-69", "text": "This is reflected in the lower coverage error results in all settings, and becomes more noticeable as the number of labels grows." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-70", "text": "We verified this by computing the average difference between ranked predictions of the two attentive models in the 200-label setting (Figure 2) ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-71", "text": "We can observe a sudden switch at more or less the median emoji, after which the label-wise attention model becomes increasingly accurate (relative to the standard attention model)." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-72", "text": "This can be explained by the fact that infrequent emojis tend to be more situational (used in specific contexts and leaving less room for ambiguity or interchangeability), which the label-wise attention mechanism can take advantage of, as it explicitly links emojis with highly informative words." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-73", "text": "Let us illustrate this claim with a case in which the label-wise attention model predicts the correct emoji, unlike its single-attention counterpart: a friendship is built over time , but sisterhood is given automatically." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-74", "text": "Gold:" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-75", "text": "For the above example 2 , the predictions of the single attention model were all linked to the general meaning of the message, that is love and friendship, leading it to predict associated emojis ( , and ), failing to capture the most relevant bit of information." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-76", "text": "On the other hand, our proposed model \"picks on\" the word sisterhood, and with 2 The highlights show the \u03b1 l attention weights of ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-77", "text": "Let us explore what we argue are interesting cases of emoji usage (ranging from highly explicit to figurative or situtational intent)." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-78", "text": "Figure 3 shows how the word (praying) and emojis such as and are strongly correlated." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-79", "text": "In addition, the bond between the word snow and the emoji is also indisputable." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-80", "text": "However, a perhaps more surprising example is displayed in Figure 4 , which is a negative example." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-81", "text": "Here, the emoji was predicted with rank 1, and we see it being strongly associated with the ordinal second, suggesting that the model assumed this was some kind of \"ticked enumeration\" of completed tasks, which is indeed regular practice in Twitter." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-82", "text": "Finally, we found it remarkable that the ambiguous nature of the word boarding is also reflected in two different emojis being predicted with high probability ( and ), each of them showcasing one of the word's senses." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-83", "text": "As an additional exploratory analysis, we computed statistics on those words with the highest average attention weights associated with one single emoji." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-84", "text": "One interesting example is the emoji, which shows two clear usage patterns: one literal (a tree) and one figurative (christmas and holidays)." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-85", "text": "Finally, as a final (and perhaps thoughtprovoking) finding, the highest attention weights associated to the emoji were given to the words game, boys and football, in that order." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-86", "text": "In other words, the model relies more on the word boys than on the actual description of the emoji." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-87", "text": "This is in line with a previous study that showed how the current usage of emojis in Twitter is in some cases associated with gender stereotypes (Barbieri and Camacho-Collados, 2018" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-88", "text": "----------------------------------" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-89", "text": "**CONCLUSION**" }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-90", "text": "In this paper we have presented a neural architecture for emoji prediction based on a label-wise attention mechanism, which, in addition to improving performance, provides a degree of interpretability about how different features are used for predictions, a topic of increasing interest in NLP (Linzen et al., 2016; Palangi et al., 2017) ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-91", "text": "As we experimented with sets of emoji labels of different sizes, our proposed label-wise attention architecture proved especially well-suited for emojis which were infrequent in the training data, making the system less biased towards the most frequent." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-92", "text": "We see this as a first step to improve the robustness of recurrent neural networks in datasets with unbalanced distributions, as they were shown not to perform better than well-tuned SVMs on the emoji predicion task (\u00c7\u00f6ltekin and Rama, 2018) ." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-93", "text": "As for future work, we plan to apply our labelwise attention mechanism to understand other interesting linguistic properties of human-generated text in social media, and other multi-class or multilabel classification problems." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-94", "text": "Finally, code to reproduce our experiments and additional examples of label-wise attention weights from input tweets can be downloaded at https://fvancesco.github." }, { "sent_id": "0c2f7cea9f27b4799736fbcba48192-C001-95", "text": "io/label_wise_attention/." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0c2f7cea9f27b4799736fbcba48192-C001-13", "0c2f7cea9f27b4799736fbcba48192-C001-14" ], [ "0c2f7cea9f27b4799736fbcba48192-C001-37", "0c2f7cea9f27b4799736fbcba48192-C001-38" ] ], "cite_sentences": [ "0c2f7cea9f27b4799736fbcba48192-C001-14", "0c2f7cea9f27b4799736fbcba48192-C001-37" ] }, "@DIF@": { "gold_contexts": [ [ "0c2f7cea9f27b4799736fbcba48192-C001-29" ] ], "cite_sentences": [ "0c2f7cea9f27b4799736fbcba48192-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "0c2f7cea9f27b4799736fbcba48192-C001-33" ], [ "0c2f7cea9f27b4799736fbcba48192-C001-36", "0c2f7cea9f27b4799736fbcba48192-C001-37", "0c2f7cea9f27b4799736fbcba48192-C001-38" ], [ "0c2f7cea9f27b4799736fbcba48192-C001-57" ] ], "cite_sentences": [ "0c2f7cea9f27b4799736fbcba48192-C001-33", "0c2f7cea9f27b4799736fbcba48192-C001-37", "0c2f7cea9f27b4799736fbcba48192-C001-57" ] } } }, "ABC_4625b603beb2308b8b306dd4c01823_36": { "x": [ { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-2", "text": "We describe UBC-NLP contribution to IEST-2018, focused at learning implicit emotion in Twitter data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-3", "text": "Among the 30 participating teams, our system ranked the 4th (with 69.3% F-score)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-4", "text": "Post competition, we were able to score slightly higher than the 3rd ranking system (reaching 70.7%)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-5", "text": "Our system is trained on top of a pre-trained language model (LM), fine-tuned on the data provided by the task organizers." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-6", "text": "Our best results are acquired by an average of an ensemble of language models." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-7", "text": "We also offer an analysis of system performance and the impact of training data size on the task." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-8", "text": "For example, we show that training our best model for only one epoch with < 40% of the data enables better performance than the baseline reported by Klinger et al. (2018) for the task." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-9", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-11", "text": "Emotion is essential in human experience and communication, lending special significance to natural language processing systems aimed at learning it." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-12", "text": "Emotion detection systems can be applied in a host of domains, including health and well-being, user profiling, education, and marketing." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-13", "text": "There is a small, yet growing, body of NLP literature on emotion." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-14", "text": "Early works focused on creating and manually labeling datasets." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-15", "text": "The SemEval 2007 Affective Text task Strapparava and Mihalcea (2007) and Aman and Szpakowicz (2007) are two examples that target the news and blog domains respectively." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-16", "text": "In these works, data were labeled for the 6 basic emotions of Ekman (Ekman, 1972) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-17", "text": "More recent works exploit distant supervision (Mintz et al., 2009 ) to automatically acquire emotion data for training systems." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-18", "text": "More specifically, a number of works use hashtags like #happy and #sad, especially occurring finally in Twitter data, as a proxy of emotion (Wang et al., 2012; Mohammad and Kiritchenko, 2015; Volkova and Bachrach, 2016) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-19", "text": "Abdul-Mageed and Ungar (2017) report state-of-the-art results using a large dataset acquired with hashtags." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-20", "text": "Other works exploit emojis to capture emotion carrying data (Felbo et al., 2017) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-21", "text": "Alhuzali et al. (2018) introduce a third effective approach that leverages firstperson seed phrases like \"I'm happy that\" to collect emotion data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-22", "text": "Klinger et al. (2018) propose yet a fourth method for collecting emotion data that depends on the existence of the expression \"emotion-word + one of the following words (when, that or because)\" in a tweet, regardless of the position of the emotion word." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-23", "text": "In the \"Implicit Emotion\" shared task 1 , participants were provided data representing the 6 emotions in the set (anger, disgust, fear, joy, sad, surprise)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-24", "text": "The trigger word was removed from each tweet." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-25", "text": "To illustrate, the task is to predict the emotion in a tweet like \"Boys who like Starbucks make me [#TRIGGERWORD#] because we can go on cute coffee dates\" (with the triggered word labeled as joy)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-26", "text": "In this paper, we describe our system submitted as part of the competition." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-27", "text": "Overall, our submission ranked the 4th out of the 30 participating teams." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-28", "text": "With further experiments, we were able to acquire better results, which would rank our model at top 3 (70.7% F-score)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-29", "text": "The rest of the paper is organized as follows: Section 2 describes the data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-30", "text": "Section 3 offers a description of the methods employed in our work." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-31", "text": "Section 3 is where we present our results, and we perform an analysis of these results in Section 5." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-32", "text": "We list negative experiments in Section 6 and conclude in Section 7." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-33", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-34", "text": "**DATA**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-35", "text": "We use the Twitter dataset released by the organizers of the \"Implicit Emotion\" task, as described in the previous section." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-36", "text": "The data are partitioned into 153, 383 tweets for training, 9591 tweets for validation, and 28, 757 data points for testing." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-37", "text": "The training and validation sets were provided early for system development, while the test set was released one week before the deadline of system submission." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-38", "text": "The full details of the dataset can be found in Klinger et al. (2018) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-39", "text": "We now describe our methods in the nesxt section." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-40", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-41", "text": "**METHODS**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-42", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-43", "text": "**PRE-PROCESSING**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-44", "text": "We adopt a simple pre-processing scheme, similar to most of the pre-trained models we employ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-45", "text": "This involves lowercasing all text and filtering out urls and user mentions." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-46", "text": "We also split clusters of emojis into individual emojis, following Duppada et al. (2018) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-47", "text": "For our vocabulary V, we retain the top 100k words and then remove all words occurring < 2 times, which leaves |V | = 23, 656." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-48", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-49", "text": "**MODELS**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-50", "text": "We develop a host of models based on deep neural networks, using some of these models as our baseline models." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-51", "text": "As an additional baseline, we compare to Klinger et al. (2018) who propose a model based on Logistic Regression with a bag of word unigrams (BOW)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-52", "text": "All our deep learning models are based on variations of recurrent neural networks (RNNs), which have proved useful for several NLP tasks." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-53", "text": "RNNs are able to capture sequential dependencies especially in time series data, of which language can be seen as an example." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-54", "text": "One weakness of RNNs, however, lies in gradient either vanishing or exploding during training." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-55", "text": "Longshort term memory (LSTM) networks were developed to target this problem, and hence we employ these in our work." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-56", "text": "We also use a bidirectional version (BiLSTM) where the vector of representation is built as a concatenation of two vectors, one that runs from left-to-right and another running from right-to-left." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-57", "text": "Ultimately, we generate a fixed-size representation for a given tweet using the last hidden state for the Fwd and Bwd LSTM." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-58", "text": "Our systems can be categorized as follows: (1) Systems tuning simple pre-trained embeddings, (2) Systems tuning embeddings from language models, and (3) Systems directly tuning language models." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-59", "text": "We treat #1 and #2 as baseline systems, while our best models are based on #3." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-60", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-61", "text": "**SYSTEMS WITH SIMPLE EMBEDDINGS**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-62", "text": "Character and/or Word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Bojanowski et al., 2016) have boosted performance on a host of NLP tasks." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-63", "text": "Most state of the art systems now finetune these embeddings as a simple transfer learning technique targeting the first layer of a network (McCann et al., 2017) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-64", "text": "We make use of one such pre-trained embeddings (fastText) to identify the utility of tuning its learned weights on the task." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-65", "text": "FastText: The first embedding model is fastText 2 (Bojanowski et al., 2016) , which builds representations based on characters, rather than only words, thus alleviating issues of complex morphology characetrestic of many languages like Arabic, Hebrew, and Swedish, but also enhancing representations for languages of simpler morphology like English." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-66", "text": "Additionally, fastText partially solves issues with out-of-vocabulary words since it exploits character sequences." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-67", "text": "FastText is trained on the Common Crawl dataset, consisting of 600B tokens." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-68", "text": "For this and the next set of experiments (i.e., experiments in 3.2.2), we train both an LSTM and BiLSTM." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-69", "text": "Since we treat these as baseline systems, especially with our goal to report our experiments in available space for the competition, we try a small set of hyper-parameters, identifying best settings on our validation set." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-70", "text": "We train each network for 4 epochs each." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-71", "text": "For optimization, we use Adam (Kingma and Ba, 2014) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-72", "text": "The model's weights W are initialized from a normal distribution W \u223c N with a small standard deviation of \u03c3 = 0.05." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-73", "text": "We apply two sources of regularization: dropout: we apply a dropout rate of 0.5 on the input embeddings to prevent co-adaptation of hidden units' activation, and L2 \u2212 norm: we also apply an L2-norm regularization with a small value (0.0001) on the hidden units layer to prevent the network from over-fitting on training set." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-74", "text": "Each of the networks has a single hidden layer." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-75", "text": "Network architectures and hyper-parameters are listed in Table 1 ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-76", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-77", "text": "**EMBEDDING FROM LMS**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-78", "text": "Peters et al. (2018) language models, which they refer to as ELMo." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-79", "text": "ELMo is shown to capture both complex characteristics of words (as syntax and semantics) as well as the usage of these words across various linguistic contexts, thanks to its language modeling component." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-80", "text": "ELMo is trained on a dataset of Wikipedia and is publicly available 3 , which we use as our input layer." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-81", "text": "More specifically, we extract the weighted sum of the 3 layers (word embedding, Bi-lstm-outputs1, and Bi-lstm-outputs2) and follow the same network architectures and hyper-parameters employed with fastText as we explain before." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-82", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-83", "text": "**FINE-TUNING LMS: ULMFIT**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-84", "text": "Another recent improvement in training NLP systems is related to the way these systems are finetuned, especially vis-a-vis how different layers in the network operate during training time." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-85", "text": "Howard and Ruder (2018) present ULMFiT 4 , an example such systems that is pre-trained on a language model exploiting Wikitext-103." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-86", "text": "Ultimately, ULMFiT employs a number of techniques for training." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-87", "text": "These include \"gradual unfreezing\", which aims at fine-tuning each layer of the network independently and then fine-tuning all layers together." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-88", "text": "Gradual unfreezing proves useful for reducing the risk of overfitting as also found in Felbo et al. (2017) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-89", "text": "ULMFiT also uses \"discriminative fine-tuning\", which tunes each layer with different learning rates, the idea being different layers capture different types of information (Howard and Ruder, 2018; Peters et al., 2018) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-90", "text": "Howard and Ruder (2018) also use different learning rates, which they refer to as \"slanted triangular learning rates\", at different times of the training process." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-91", "text": "With ULMFiT, we experiment with different variations of LMs 5 : forward (Fwd), backward (Bwd), and an average of these (BiLM (Fwd+Bwd))." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-92", "text": "We follow Howard and Ruder (2018) in training each of the Fwd and Bwd models independently on the training data provided by the task organizers, and then combining their predictions using an ensemble averaging." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-93", "text": "This is the setting we refer to as BiLM." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-94", "text": "As we show in Section 3, this is a beneficial measure (similar to what Howard and Ruder (2018) also found)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-95", "text": "For our hyper-parameters for this iteration of experiments, we identify them on our validation set." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-96", "text": "We list the network architectures and hyper-parameters for this set of experiments in Table 2 ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-97", "text": "Table 2 : Hyper-parameters for our submitted system exploiting fine-tuned language models from Howard and Ruder (2018) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-98", "text": "Table 3 shows results of all our models in F-score." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-99", "text": "As the Table shows, all our models achieve sizable gains over the logistic regression model introduced by (Klinger et al., 2018) as a baseline for the competition (F-score = 60%)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-100", "text": "Even though our models trained based on fastText and ELMo each has a single hidden layer, which is not that deep, these at least 1.5% higher than the logistic regression model." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-101", "text": "We also observe that ELMo embeddings, which are acquired from language models rather than optimized from sequences of tokens, achieves higher performance than FastText embeddings." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-102", "text": "This is not surprising, and aligns with the results reported by Peters et al. (2018) ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-103", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-104", "text": "**RESULTS**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-105", "text": "For results with ULMFiT, as Table 3 shows, it acquires gains over all the other models." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-106", "text": "As mentioned earlier, we experiment with different variations of LMs (Fwd, Bwd, and BiLM)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-107", "text": "Results in our submitted system are based on the Fwd model, and are at 69.4%." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-108", "text": "After system submission, we also experimented with the Bwd and BiLM models and were able to acquire even higher gains, putting our best performance at 70.7% (which moves us to the top 3 position)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-109", "text": "Using predictions from our best model (as described in Table 2 ), we investigate the extent with which each emotion is mislabeled and the categories with which it is confused." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-110", "text": "Figure 1 shows the confusion matrix of this analysis." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-111", "text": "As the Figure shows, anger is predicted with least F-score (% = 63), followed by sadness (% = 66)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-112", "text": "Figure 1 also shows that anger is most confused for surprise and sadness is most confused for anger." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-113", "text": "Additionally, disgust is the third most confused category (% = 66), and is mislabeled as surprise 9% of the time." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-114", "text": "These results suggest overlap in the ways each of the emotions is expressed in the training data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-115", "text": "To further investigate these observations, we measure the shared vocabulary between the different classes." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-116", "text": "Figure 2 shows percentages of lexical overlap in the data, and does confirm that some categories share unigram tokens to varying degrees." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-117", "text": "Lexical overlap between classes seem to align with the error matrix in Figure 1 ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-118", "text": "For example, anger overlaps most with surprise (% = 9) and sadness overlaps most with anger (% = 10)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-119", "text": "These findings are not surprising, since our models are based on lexical input and do not involve other types of information (e.g., POS tags)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-120", "text": "Table 4 offers examples of overlap in the form of lexical sequences between test data and training data across a number of classes." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-121", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-122", "text": "**SIZE OF TRAINING DATA**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-123", "text": "We also investigate the impact of training data size on model accuracy." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-124", "text": "For this purpose, we train models with different data sizes with the best parameter settings shown in Table 2 6 ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-125", "text": "Figure 3 shows the impact of different percentages of training examples on model performance." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-126", "text": "We test model performance for this analysis on our validation data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-127", "text": "Interestingly, as Figure 3 shows, the model exceeds the baseline model reported by the task organizers (Klinger et al., 2018) when trained on only 10% of the training data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-128", "text": "Additionally, the model outperforms the fastText and ELMo models by only seeing 40% of the training data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-129", "text": "Once the model has access to 80% of the training data, its gains start to increase relatively slowly." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-130", "text": "In addition to the positive, yet unsurprising, impact that training data size has on performance, the results also reflect the utility of employing the pre-trained language model." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-131", "text": "In order to further inspect this observation regarding the impact of language modeling, we use the same architecture reported in Table 2 to train a classifier that does not have access to the pretrained LM." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-132", "text": "We find the classifier to achieve only 63.8 F-score." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-133", "text": "Again, this demonstrates the advantage of using the pra-trained LM." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-134", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-135", "text": "**NEGATIVE EXPERIMENTS**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-136", "text": "We performed a number of negative experiments that we report briefly here." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-137", "text": "Our intuition is that training our models with Twitter-specific data should help classification." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-138", "text": "For this reason, we trained ULMFiT with 4.5 million tweets with the same settings reported in Table 2 ." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-139", "text": "We did not find this set of experiments to yield gains over the results reported in Table 3 , however." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-140", "text": "For example, an Fwd LM trained on Twitter domain data yields 67.9% F-score, which is 1.4% less than the Fscore obtained by the Wikipedia-trained Fwd LM in 3." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-141", "text": "The loss might be due to the smaller size of the Twitter data we train on, as compared to the Wikipedia data the ULMFiT is originally trained on (i.e., > 103 million tokens)." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-142", "text": "----------------------------------" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-143", "text": "**CONCLUSION**" }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-144", "text": "In this paper, we described our system submitted to IEST-2018 task, focused on learning implicit emotion from Twitter data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-145", "text": "We explored the utility of tuning different word-and character-level pre-trained representations and language modeling methods to minimize training loss." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-146", "text": "We found that the method introduced by Howard and Ruder (2018) yields best performance on the task." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-147", "text": "We note that our baselines employing sub-word embeddings (fastText) and embeddings from language models (ELMo) can be improved by using deeper neural architectures with larger model capacity, which we cast for future work." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-148", "text": "We have also shown that the classifier confuses certain emotion classes with one another, possible due to overlap of lexical sequences between training and test data." }, { "sent_id": "4625b603beb2308b8b306dd4c01823-C001-149", "text": "This reflects the difficulty of the task." } ], "y": { "@DIF@": { "gold_contexts": [ [ "4625b603beb2308b8b306dd4c01823-C001-8" ], [ "4625b603beb2308b8b306dd4c01823-C001-99" ], [ "4625b603beb2308b8b306dd4c01823-C001-127" ] ], "cite_sentences": [ "4625b603beb2308b8b306dd4c01823-C001-8", "4625b603beb2308b8b306dd4c01823-C001-99", "4625b603beb2308b8b306dd4c01823-C001-127" ] }, "@BACK@": { "gold_contexts": [ [ "4625b603beb2308b8b306dd4c01823-C001-22" ], [ "4625b603beb2308b8b306dd4c01823-C001-38" ] ], "cite_sentences": [ "4625b603beb2308b8b306dd4c01823-C001-22", "4625b603beb2308b8b306dd4c01823-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "4625b603beb2308b8b306dd4c01823-C001-51" ] ], "cite_sentences": [ "4625b603beb2308b8b306dd4c01823-C001-51" ] } } }, "ABC_b2a6ec11403fe73b9bae7742c1c5a2_36": { "x": [ { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-2", "text": "Abstract." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-3", "text": "We present a method for identifying editor roles from students' revision behaviors during argumentative writing." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-4", "text": "We first develop a method for applying a topic modeling algorithm to identify a set of editor roles from a vocabulary capturing three aspects of student revision behaviors: operation, purpose, and position." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-5", "text": "We validate the identified roles by showing that modeling the editor roles that students take when revising a paper not only accounts for the variance in revision purposes in our data, but also relates to writing improvement." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-6", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-8", "text": "Knowing that experienced and successful writers revise differently than inexperienced writers [4] , various intelligent writing tools have been developed that provide localized feedback on text characteristics [5, 3, 6, 9] ." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-9", "text": "These tools typically suggest edits to guide revision, rather than model the editing process after observing revisions." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-10", "text": "With the long term goal of developing an intelligent revision assistant, this paper presents an approach to modeling student editor roles." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-11", "text": "Prior natural language processing (NLP) approaches to student revision analysis have focused on identifying revisions during argumentative writing and classifying their purposes and other properties [12, 11, 7, 1] ." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-12", "text": "In contrast, editor roles have generally been studied in NLP using online collaborative writing applications such as Wikipedia [10] ." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-13", "text": "Inspired by the use of Wikipedia revision histories [10] , in this paper we similarly use topic modeling applied to revision histories to identify editor roles in the domain of student argumentative writing." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-14", "text": "To model student revision histories, between-draft essay revisions are extracted at a sentence-level and represented in terms of the following three aspects: operation (add, delete, or modify a sentence), purpose (e.g., correct grammar versus improve fluency), and position (revise at the beginning, middle or the end of an essay)." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-15", "text": "To identify editor roles, a Latent Dirichlet Allocation (LDA) [2] graphical model is then applied to these revision histories." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-16", "text": "Finally, we show that the identified roles capture the variability in our data as well as correlate with writing improvement." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-17", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-18", "text": "**ORIGINAL DRAFT**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-19", "text": "Revised Draft Operation Purpose Position Self-driving vehicles pose many advantages and disadvantages." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-20", "text": "While self-driving vehicles pose many advantages and disadvantages, I am not on the bandwagon for them at this time." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-21", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-22", "text": "**MODIFY**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-23", "text": "Claim Beg." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-24", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-25", "text": "**CORPORA**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-46", "text": "The final topics are shown in Table 2 , labeled by us with the best-matching editor role from the anticipated set of potential roles, based on the vocabulary items in each topic." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-26", "text": "Our work takes advantage of several corpora of multiple drafts of argumentative essays written by both high-school and college students [12, 11] , where all data has been annotated for revision using the framework of [12] ." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-27", "text": "We divide our data into a Modeling Corpus (185 paired drafts, 3245 revisions) and an Evaluation Corpus (107 paired draft, 2045 revisions), based on whether expert grades are available before (Score1) and after (Score2) essay revision." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-28", "text": "Although the grading rubrics for the college and high-school essays in the Evaluation Corpus are different, both are based upon common criteria of argumentative writing, e.g., clear thesis, convincing evidence, clear wording without grammatical errors, etc." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-29", "text": "We apply linear scaling 1 to bring the scores within the same range of [0,100]." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-30", "text": "After scaling, the average Score1 and Score2 are 64.41 and 73.59, respectively." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-31", "text": "For all essays and prior to this study, subsequent drafts were manually aligned at the sentence-level based on semantic similarity." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-32", "text": "Nonidentical aligned sentences were extracted as the revisions, resulting in three types of revision operations -Add, Delete, M odif y. Each extracted revision was manually annotated with a purpose following the revision schema shown in Figure 1 (modified compared to [12] by adding the Precision category)." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-33", "text": "For this study, each revision's position was in addition automatically tagged using its paragraph position in the revised essay." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-34", "text": "To maintain consistency across essays, instead of using paragraph number, we identify whether a revision is in the first (beg), last (end), or a middle (mid) paragraph." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-35", "text": "Table 1 shows a modified claim at the beginning of an essay from the Modeling Corpus." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-36", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-37", "text": "**IDENTIFYING EDITOR ROLES**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-38", "text": "To create a vocabulary for topic modeling and to understand the repeating patterns of student editors, we represent each revision utilizing the three aspects" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-39", "text": "-General mid +Claims beg Organization mid +General beg described earlier: operation, purpose, and position." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-40", "text": "This yields a rich and informative vocabulary for modeling our data, consisting of 63 revision \"words\" (54 content, 9 surface)." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-41", "text": "This is in contrast to the 24 word revision vocabulary used in the prior Wikipedia editor role extraction method [10] , formed using a Wiki-specific revision taxonomy of operation and purpose." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-42", "text": "When describing our revision \"words\", add and delete revisions are represented with '+' and '\u2212' sign, and no sign for modification, e.g., Claim beg in Table 1 . Editors are then represented by their history of revisions in terms of this revision vocabulary." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-43", "text": "We trained the LDA model on the Modeling Corpus and experimented with 2 to 10 topics." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-44", "text": "After an extensive evaluation for topic interpretation based on top 10 revisions under each topic, we ended up with 5 topics where the revisions under each topic intuitively correspond to one of a set of potentially relevant editor roles for academic writing." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-45", "text": "We drew upon roles previously identified for writing domains such as newspaper editing (e.g., proofreader, copy editor), Wikipedia (e.g., technical editor, substantive expert), and academic writing 2 (i.e., descriptive, analytical, persuasive, and critical)." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-47", "text": "The defining characteristic of a Proofreader are surfacelevel error corrections." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-48", "text": "Copy editors ensure that the article is clear and concise as they revise for word-usage, clarity, and organization." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-49", "text": "Descriptive editors provide details and enhance clarity, with widespread development of general content." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-50", "text": "Analytical editors revise by adding information and better organizing thoughts, with top revision purposes being word-usage, content, reasoning, and rebuttal." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-51", "text": "Persuasive editors discuss ideas and facts with relevant examples and develop arguments with added information." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-52", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-53", "text": "**VALIDATING EDITOR ROLES**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-54", "text": "Using the trained topic model, we first calculate the probability of an editor belonging to each of the 5 roles, for each editor in the Evaluation Corpus." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-55", "text": "These probabilities represent each role's contribution to the essay revision." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-56", "text": "Motivated Table 4 ." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-57", "text": "Partial correlations between role probabilities and Score2 controlling Score1." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-58", "text": "by Wikipedia role validation [10] , we first validate our editor roles by similarly using editor roles to explain the variance in revision purposes." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-59", "text": "We create 8 linear regression models, one for each revision purpose 3 ." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-60", "text": "The models take as input a five dimensional vector indicating an editor's contribution to each role and the output is the editor's edit frequency for each revision purpose." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-61", "text": "The R-squared values in Table 3 show that our topic model can best explain the variance of Grammar, Word-Usage, General content, Claim, Reasoning, and Evidence edits." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-62", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-63", "text": "**EDITOR ROLES**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-64", "text": "A corpus study in [12] showed that content changes are correlated with argumentative writing improvement, reaffirming the statement of [4] ." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-65", "text": "Using a similar method, we investigate if our editor roles are related to writing improvement." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-66", "text": "We calculate partial Pearson correlations between editor roles and Score2 while controlling for Score1 to regress out the effect of the correlation between Score1 and Score2 (Corr.= 0.692, p < 0.001)." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-67", "text": "Table 4 shows that the roles consisting of only surface edits or a mixture of edits are not correlated to writing improvement." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-68", "text": "However, Persuasive editor, which consists of content revisions, shows a positive significant correlation to writing improvement." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-69", "text": "Our results suggest that the Persuasive editor is the role of an experienced writer." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-70", "text": "----------------------------------" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-71", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-72", "text": "Although editor roles have been studied for online collaborative writing [10, 8] , our research investigates student revisions of argumentative essays." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-73", "text": "While our model follows previous methods [10] , we introduce a unique vocabulary to model each editor's revision history, with evaluation results suggesting that our identified roles capture salient features of writing." }, { "sent_id": "b2a6ec11403fe73b9bae7742c1c5a2-C001-74", "text": "Future plans include using a Markov model to consider revision order, expanding the revision vocabulary, and using the predictions to provide feedback in an intelligent revision assistant." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-11" ], [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-64" ] ], "cite_sentences": [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-11", "b2a6ec11403fe73b9bae7742c1c5a2-C001-64" ] }, "@USE@": { "gold_contexts": [ [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-26" ] ], "cite_sentences": [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-26" ] }, "@EXT@": { "gold_contexts": [ [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-32" ] ], "cite_sentences": [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-32" ] }, "@SIM@": { "gold_contexts": [ [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-64", "b2a6ec11403fe73b9bae7742c1c5a2-C001-65" ] ], "cite_sentences": [ "b2a6ec11403fe73b9bae7742c1c5a2-C001-64" ] } } }, "ABC_fc58a9813b80afc9811b8ee27679b7_36": { "x": [ { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-2", "text": "The efficiency of Information Extraction systems is known to be heavily influenced by domain-specific knowledge but the cost of developing such systems is considerably high." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-3", "text": "In this article, we consider the problem of event extraction and show that learning word representations from unlabeled domain-specific data and using them for representing event roles enable to outperform previous state-of-the-art event extraction models on the MUC-4 data set." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-4", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-6", "text": "In the Information Extraction (IE) field, event extraction constitutes a challenging task." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-7", "text": "An event is described by a set of participants (i.e. attributes or roles) whose values are text excerpts." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-8", "text": "The event extraction task is related to several subtasks: event mention detection, candidate rolefiller extraction, relation extraction and event template filling." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-9", "text": "The problem we address here is the detection of role-filler candidates and their association with specific roles in event templates." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-10", "text": "For this task, IE systems adopt various ways of extracting patterns or generating rules based on the surrounding context, local context and global context (Patwardhan and Riloff, 2009 )." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-29", "text": "Furthermore, we propose two additional contributions to the construction of the word representations." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-11", "text": "Current approaches for learning such patterns include bootstrapping techniques (Huang and Riloff, 2012a; Yangarber et al., 2000) , weakly supervised learning algorithms (Huang and Riloff, 2011; Sudo et al., 2003; Surdeanu et al., 2006) , fully supervised learning approaches (Chieu et al., 2003; Freitag, 1998; Bunescu and Mooney, 2004; Patwardhan and Riloff, 2009 ) and other variations." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-12", "text": "All these methods rely on substantial amounts of manually annotated corpora and use a large body of linguistic knowledge." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-13", "text": "The performance of these approaches is related to the amount of knowledge engineering deployed and a good choice of features and classifiers." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-14", "text": "Furthermore, the efficiency of the system relies on the a priori knowledge of the applicative domain (the nature of the events) and it is generally difficult to apply a system on a different domain with less annotated data without reconsidering the design of the features used." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-15", "text": "An important step forwards is TIER light (Huang and Riloff, 2012a ) that targeted the minimization of human supervision with a bootstrapping technique for event roles detection." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-16", "text": "Also, PIPER (Patwardhan and Riloff, 2007; Patwardhan, 2010) distinguishes between relevant and irrelevant regions and learns domain-relevant extraction patterns using a semantic affinity measure." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-17", "text": "Another possible approach for dealing with this problem is to combine the use a restricted set of manually annotated data with a much larger set of data extracted in an unsupervised way from a corpus." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-18", "text": "This approach was experimented for relations in the context of Open Information Extraction (Soderland et al., 2010) but not for extracting events and their participants to our knowledge." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-19", "text": "In this paper, we propose to approach the task of labeling text spans with event roles by automatically learning relevant features that requires limited prior knowledge, using a neural model to induce semantic word representations (commonly referred as word embeddings) in an unsupervised fashion, as in (Bengio et al., 2006; Collobert and Weston, 2008) ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-20", "text": "We exploit these word embeddings as features for a supervised event role (multiclass) classifier." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-21", "text": "This type of approach has been proved efficient for numerous tasks in natural language processing, including named entity recognition (Turian et al., 2010) , semantic role labeling , machine translation (Schwenk and Koehn, 2008; Lambert et al., 2012) , word sense disambiguation (Bordes et al., 2012) or sentiment analysis (Glorot et al., 2011; Socher et al., 2011) but has never been used, to our knowl-edge, for an event extraction task." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-22", "text": "Our goal is twofold: (1) to prove that using as only features word vector representations makes the approach competitive in the event extraction task; (2) to show that these word representations are scalable and robust when varying the size of the training data." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-23", "text": "Focusing on the data provided in MUC-4 (Lehnert et al., 1992) , we prove the relevance of our approach by outperforming state-of-the-art methods, in the same evaluation environment as in previous works." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-24", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-25", "text": "**APPROACH**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-26", "text": "In this work, we approach the event extraction task by learning word representations from a domainspecific data set and by using these representations to identify the event roles." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-27", "text": "This idea relies on the assumption that the different words used for a given event role in the text share some semantic properties, related to their context of use and that these similarities can be captured by specific representations that can be automatically induced from the text, in an unsupervised way." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-28", "text": "We then propose to rely only on these word representations to detect the event roles whereas, in most works (Riloff, 1996; Patwardhan and Riloff, 2007; Huang and Riloff, 2012a; Huang and Riloff, 2012b) , the role fillers are represented by a set of different features (raw words, their parts-ofspeech, syntactic or semantic roles in the sentence)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-30", "text": "The first one is to exploit limited knowledge about the event types (seed words) to improve the learning procedure by better selecting the dictionary." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-31", "text": "The second one is to use a max operation 1 on the word vector representations in order to build noun phrase representations (since slot fillers are generally noun phrases), which represents a better way of aggregating the semantic information born by the word representations." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-32", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-33", "text": "**INDUCING DOMAIN-RELEVANT WORD REPRESENTATIONS**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-34", "text": "In order to induce the domain-specific word representations, we project the words into a 50-dimensional word space." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-35", "text": "We chose a single layer neural network (NN) architecture that avoids strongly engineered features, assumes little prior knowledge about the task, but is powerful enough to capture relevant domain information." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-36", "text": "Following , we use an NN which learns to predict whether a given text sequence (short word window) exists naturally in the considered domain." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-37", "text": "We represent an input sequence of n words as w i = w i\u2212(n/2) . . . , w i , . . ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-38", "text": "w i+(n/2) ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-39", "text": "The main idea is that each sequence of words in the training set should receive a higher score than a sequence in which one word is replaced with a random one." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-40", "text": "We call the sequence with a random word corrupted (\u00af w i ) and denote as correct ( w i ) all the sequences of words from the data set." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-41", "text": "The goal of the training step is then to minimize the following loss function for a word w i in the dictionary D:" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-42", "text": "is the scoring function given by the neural network." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-43", "text": "Further details and evaluations of these embeddings can be found in (Bengio et al., 2003; Bengio et al., 2006; Collobert and Weston, 2008; Turian et al., 2010) ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-44", "text": "For efficiency, words are fed to our architecture as indices taken from a finite dictionary." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-45", "text": "Obviously, a simple index does not carry much useful information about the word." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-46", "text": "So, the first layer of our network maps each of these word indices into a feature vector, by a lookup table operation." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-47", "text": "Our first contribution intervenes in the process of the choosing the proper dictionary." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-48", "text": "(Bengio, 2009) has shown that the order of the words in the dictionary of the neural network is not indifferent to the quality of the achieved representations: he proposed to order the dictionary by frequency and select the words for the corrupted sequence according to this order." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-49", "text": "In our case, the most frequent words are not always the most relevant for the task of event role detection." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-50", "text": "Since we want to have a training more focused to the domain specific task, we chose to order the dictionary by word relevance to the domain." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-51", "text": "We accomplish this by considering a limited number of seed words for each event type that needs to be discovered in text (e.g. attack, bombing, kidnapping, arson)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-52", "text": "We then rate with higher values the words that are more similar to the event types words, according to a given semantic similarity, and we rank them accordingly." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-53", "text": "We use the \"Leacock Chodorow\" similarity from Wordnet 3.0 (Leacock and Chodorow, 1998) ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-54", "text": "Initial experimental results proved that using this domain-oriented order leads to better performance for the task than the order by frequency." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-55", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-56", "text": "**USING WORD REPRESENTATIONS TO IDENTIFY EVENT ROLES**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-57", "text": "After having generated for each word their vector representation, we use them as features for the annotated data to classify event roles." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-58", "text": "However, event role fillers are not generally single words but noun phrases that can be, in some cases, identified as named entities." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-59", "text": "For identifying the event roles, we therefore apply a two-step strategy." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-60", "text": "First, we extract the noun chunks using SENNA 2 parser Collobert, 2011 ) and we build a representation for these chunks defined as the maximum, per column, of the vector representations of the words it contains." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-61", "text": "Second, we use a statistical classifier to recognize the slot fillers, using this representation as features." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-62", "text": "We chose the extra-trees ensemble classifier (Geurts et al., 2006) , which is a meta estimator that fits a number of randomized decision trees (extra-trees) on various sub-samples of the data set and use averaging to improve the predictive accuracy and control over-fitting." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-63", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-64", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-65", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-66", "text": "**TASK DESCRIPTION**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-67", "text": "We conducted the experiments on the official MUC-4 training corpus that consists of 1,700 documents and instantiated templates for each document." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-68", "text": "The task consists in extracting information about terrorist events in Latin America from news articles." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-69", "text": "We classically considered the following 4 types of events: attack, bombing, kidnapping and arson." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-70", "text": "These are represented by templates containing various slots for each piece of information that should be extracted from the document (perpetrators, human targets, physical targets, etc)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-71", "text": "Following previous works (Huang and Riloff, 2011; Huang and Riloff, 2012a) , we only consider the \"String Slots\" in this work (other slots need different treatments) and we group certain slots to finally consider the five slot types PerpInd (individual perpetrator), PerpOrg (organizational perpetrator), Target (physical target), Victim (human target name or description) and Weapon (instrument id or type)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-72", "text": "We used 1,300 documents (DEV) for training, 200 documents (TST1+TST2) for tuning, and 200 documents (TST3+TST4) as the blind test set." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-73", "text": "To compare with similar works, we do not evaluate the template construction and only focus on the identification of the slot fillers: for each answer key in a reference template, we check if we find it correctly with our extraction method, using head noun matching (e.g., the victim her mother Martha Lopez Orozco de Lopez is considered to match Matha Lopez), and merging duplicate extractions (so that different extracted slot fillers sharing the same head noun are counted only once)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-74", "text": "We also took into account the answer keys with multiple values in the reference, dealing with conjunctions (when several victims are named, we need to find all of them) and disjunctions (when several names for the same organization are possible, we need to find any of them)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-75", "text": "Our results are reported as Precision/Recall/F1-score for each event role separately and averaged on all roles." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-76", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-77", "text": "**EXPERIMENTS**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-78", "text": "In all the experiments involving our model, we established the following stable choices of parameters: 50-dimensional vectors obtained by training on sequences of 5 words, which is consistent with previous studies (Turian et al., 2010; Collobert and Weston, 2008) ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-79", "text": "All the hyper-parameters of our model (e.g. learning rate, size of the hidden layer, size of the word vectors) have been chosen by finetuning our event extraction system on the TST1+TST2 data set." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-80", "text": "For DRVR-50 and W2V-50, the embeddings were built from the whole training corpus (1,300 documents) and the dictionary was made of all the words of this corpus under their inflected form." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-81", "text": "We used the extra-trees ensemble classifier implemented in (Pedregosa et al., 2011) , with hyperparameters optimized on the validation data: forest of 500 trees and the maximum number of features to consider when looking for the best split is \u221a number f eatures." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-82", "text": "We present a 3-fold evaluation: first, we compare our system with state-of-the-art systems on the same task, then we compare our domain-relevant vector representations (DRVR-50) to more generic word embeddings (C&W50, HLBL-50) 3 and finally to another State-of-the-art systems PerpInd PerpOrg Target Victim Weapon Average (Riloff, 1996) 33 word representation construction on the domainspecific data (W2V-50) 4 ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-83", "text": "Figure 1: F1-score results for event role labeling on MUC-4 data, for different size of training data, of \"String Slots\" on the TST3+TST4 with different parameters, compared to the learning curve of TIER (Huang and Riloff, 2012a) ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-84", "text": "The grey points represent the performances of other IE systems." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-85", "text": "Figure 1 presents the average F1-score results, computed over the slots PerpInd, PerpOrg, Target, Victim and Weapon." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-86", "text": "We observe that models relying on word embeddings globally outperform the state-of-the-art results, which demonstrates that the word embeddings capture enough semantic information to perform the task of event newswire corpus 4 W2V-50 are the embeddings induced from the MUC4 data set using the negative sampling training algorithm (Mikolov et al., 2013a; Mikolov et al., 2013b; Mikolov et al., 2013c) , available at https://code.google.com/ p/word2vec/ role labeling on \"String Slots\" without using any additional hand-engineered features." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-87", "text": "Moreover, our representations (DRVR-50) clearly surpass the models based on generic embeddings (C&W-50 and HLBL-50) and obtain better results than W2V-50, based the competitive model of (Mikolov et al., 2013a) , even if the difference is small." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-88", "text": "We can also note that the performance of our model is good even with a small amount of training data, which makes it a good candidate to easily develop an event extraction system on a new domain." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-89", "text": "Table 1 provides a more detailed analysis of the comparative results." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-90", "text": "We can see in this table that our results surpass those of previous systems (0.73 vs. 0.59) with, particularly, a consistently higher precision on all roles, whereas recall is smaller for certain roles (Target and Weapon)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-91", "text": "To further explore the impact of these representations, we compared our word embeddings with other word embeddings (C&W-50, HLBL-50) and report the results in Figure 1 and Table 1 ." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-92", "text": "The results show that our model also outperforms the models using others word embeddings (F1-score of 0.73 against 0.65, 0.66)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-93", "text": "This proves that a model learned on a domain-specific data set does indeed provide better results, even if its size is much smaller (whereas it is usually considered that neural models require often important training data)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-94", "text": "Finally, we also achieve slightly better results than W2V-50 with other word representations built on the same corpus, which shows that the choices made for the word representation construction, such as the use of domain information for word ordering, tend to have a positive impact." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-95", "text": "----------------------------------" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-96", "text": "**CONCLUSIONS AND PERSPECTIVES**" }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-97", "text": "We presented in this paper a new approach for event extraction by reducing the features to only use unsupervised word representations and a small set of seed words." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-98", "text": "The word embeddings induced from a domain-specific corpus bring improvement over state-of-art models on the standard MUC-4 corpus and demonstrate a good scalability on different sizes of training data sets." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-99", "text": "Therefore, our proposal offers a promising path towards easier and faster domain adaptation." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-100", "text": "We also prove that using a domain-specific corpus leads to better word vector representations for this task than using other publicly-available word embeddings (even if they are induced from a larger corpus)." }, { "sent_id": "fc58a9813b80afc9811b8ee27679b7-C001-101", "text": "As future work, we will reconsider the architecture of the neural network and we will refocus on creating a deep learning model while taking advantage of a larger set of types of information such as syntactic information, following (Levy and Goldberg, 2014) , or semantic information, following (Yu and Dredze, 2014) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fc58a9813b80afc9811b8ee27679b7-C001-11" ], [ "fc58a9813b80afc9811b8ee27679b7-C001-15" ] ], "cite_sentences": [ "fc58a9813b80afc9811b8ee27679b7-C001-11", "fc58a9813b80afc9811b8ee27679b7-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "fc58a9813b80afc9811b8ee27679b7-C001-28" ], [ "fc58a9813b80afc9811b8ee27679b7-C001-83" ] ], "cite_sentences": [ "fc58a9813b80afc9811b8ee27679b7-C001-28", "fc58a9813b80afc9811b8ee27679b7-C001-83" ] }, "@USE@": { "gold_contexts": [ [ "fc58a9813b80afc9811b8ee27679b7-C001-71" ] ], "cite_sentences": [ "fc58a9813b80afc9811b8ee27679b7-C001-71" ] }, "@SIM@": { "gold_contexts": [ [ "fc58a9813b80afc9811b8ee27679b7-C001-71" ] ], "cite_sentences": [ "fc58a9813b80afc9811b8ee27679b7-C001-71" ] } } }, "ABC_196e7ca5ccd6754ac986137ec55cd3_36": { "x": [ { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-2", "text": "We present VOILA: an optimised, multimodal dialogue agent for interactive learning of visually grounded word meanings from a human user." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-3", "text": "VOILA is: (1) able to learn new visual categories interactively from users from scratch; (2) trained on real human-human dialogues in the same domain, and so is able to conduct natural spontaneous dialogue; (3) optimised to find the most effective trade-off between the accuracy of the visual categories it learns and the cost it incurs to users." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-4", "text": "VOILA is deployed on Furhat 1 , a humanlike, multi-modal robot head with backprojection of the face, and a graphical virtual character." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-5", "text": "----------------------------------" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-7", "text": "As intelligent systems/robots are brought out of the laboratory and into the physical world, they must become capable of natural everyday conversation with their human users about their physical surroundings." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-8", "text": "Among other competencies, this involves the ability to learn and adapt mappings between words, phrases, and sentences in Natural Language (NL) and perceptual aspects of the external environment -this is widely known as the grounding problem." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-9", "text": "Our work is similar in spirit to e.g. (Roy, 2002; Skocaj et al., 2011) but advances it in several aspects (Yu et al., 2016) ." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-10", "text": "In this demo paper, we present a dialogue agent that learns visually grounded word meanings interactively from a human tutor, which we call: VOILA (Visually Optimised Interactive Learning Agent)." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-11", "text": "Our goal is to enable this agent to learn to identify and describe objects/attributes (colour 1 http://www.furhatrobotics.com/ and shape in this case) in its immediate visual environment through interaction with human users, incrementally, over time." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-12", "text": "Unlike a lot of past work (Silberer and Lapata, 2014; Thomason et al., 2016; Matuszek et al., 2014) , here we assume that the agent is in the position of a child, who does not have any prior knowledge of perceptual categories." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-13", "text": "Hence, the agent must learn from scratch: (1) the perceptual/visual categories themselves; and (2) how NL expressions map to these; and in addition, (3) as a standard conversational agent, the agent much also learn to conduct natural, spontaneous conversations with real humans." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-14", "text": "In this demonstration, VOILA plays the role of an interactive, concept learning agent that takes initiative in the dialogues and actively learns novel visual knowledge from the feedback from the human tutor." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-15", "text": "What sets VOILA apart from other work in this area is:" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-16", "text": "\u2022 VOILA's dialogue strategy is optimised via Reinforcement Learning to achieve an optimal trade-off between the accuracy of the concepts it learns/has learnt from users, and the effort that the dialogues incur on the users: this is a form of active learning where the agent only asks about something if it doesn't already know the answer with some appropriate confidence (see (Yu et al., 2016) for more detail)." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-17", "text": "\u2022 VOILA is trained on a corpus of real HumanHuman conversations (Yu et al., 2017) , and is thus able to process natural human dialogue, which contains phenomena such as self-corrections, repetitions and restarts, pauses, fillers, and continuations VOILA is deployed onto Furhat, a humanlike robot head with a custom back-projected face, built-in stereo microphones, and a Microsoft" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-18", "text": "----------------------------------" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-19", "text": "**INTERACTIVE MULTIMODAL FRAMEWORK**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-20", "text": "We developed a multimodal framework in support of building an interactive learning system, which loosely follows that of Yu et al. (2016) ." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-21", "text": "The framework consists of two core modules:" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-22", "text": "Vision Module The vision module produces visual attribute predictions, using two base feature categories: the HSV colour space for colour attributes, and a 'bag of visual words' (i.e. PHOW descriptors) for the object shapes/class." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-23", "text": "It consists of a set of binary classifiers -Logistic Regression SVM classifiers with Stochastic Gradient Descent (SGD) (Zhang, 2004) -to incrementally learn attribute predictions." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-24", "text": "The visual classifiers ground visual attribute words such as 'red', 'circle' etc." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-25", "text": "that appear as parameters of the Dialogue Acts used in the system." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-26", "text": "Dialogue Module This module relies on a classical architecture for dialogue systems, composed of Dialogue Management (DM) and Natural Language Understanding (NLU), as well as Generation (NLG) components." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-27", "text": "These components interact via Dialogue Act representations (Stolcke et al., 2000) , e.g. inform(color=red), ask(shape)." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-28", "text": "The Natural Language Understanding component processes user utterances by extracting a sequence of key patterns, slots and values, and then transforming them into dialogue-act representations, following a list of hand-crafted rules." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-29", "text": "The NLG component makes use of a template-based approach that chooses a suitable learner utterance for a specific dialogue act, according to the statistical distribution of utterance templates from dialogue examples." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-30", "text": "Finally, the DM component is implemented with an optimised learning policy using Reinforcement Learning (see Section 3)." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-31", "text": "This optimised policy is trained to: (1) conduct interaction with human partners, and (2) achieve an optimum balance between classification performance and the cost of the dialogue to the tutor in the interactive learning process." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-32", "text": "----------------------------------" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-33", "text": "**LEARNING HOW TO LEARN**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-34", "text": "In this section, we briefly describe our method for optimising the dialogue agent with Reinforcement Learning and in interaction with a simulated tutor, itself built from the BURCHAK human-human dialogue corpus 2 (Yu et al., 2017 ) within a simulated learning environment (see Fig. 2 )." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-35", "text": "Given the visual attribute learning task, a smart agent must learn novel visual objects/attributes as accurately as possible through natural interactions with real humans, but meanwhile it should attempt to minimise the human involvement as much as possible in this life-long learning process." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-36", "text": "Here, we divide this interactive learning task into two sub-tasks, modeled as a hierarchical Markov Decision Process, consisting of two interdependent MDPs in charge of decisions about: \"when to learn\" and \"how to learn\"." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-37", "text": "----------------------------------" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-38", "text": "**WHEN TO LEARN: ADAPTIVE CONFIDENCE THRESHOLD**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-39", "text": "The first MDP performs a kind of active learning: the learner/agent only acquires the feedback from humans about a visual attribute if it is not confident enough already about its own predictions." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-40", "text": "Following previous work (Yu et al., 2016) , here we use a positive confidence threshold, which determines when the agent believes its own predictions." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-41", "text": "For instance, the learner can ask either polar or WH-questions about an attribute if its confidence score is higher than a certain threshold; otherwise, there should be no interaction about that attribute. But as Yu et al. (2016) point out the confidence score from a classifier is not reliable enough at the early stages of learning, so in order to find an optimum dialogue policy, a threshold should be able to dynamically adjust according to the previous learning performance of the agent." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-42", "text": "We therefore assign a separate but dependent component MDP for adjusting the threshold dynamically in order to optimise the trade-off between accuracy and cost." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-43", "text": "Note now that the adjusted confidence threshold will affect the agent's dialogue behaviour, modeled in the other MDP presented in the next section (natural interaction with humans)." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-44", "text": "----------------------------------" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-45", "text": "**HOW TO LEARN: NATURAL INTERACTION WITH HUMANS**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-46", "text": "The second MDP, as a purely conversational agent, aims at managing natural, spontaneous conversation with human partners or other agents to achieve the final goal, i.e. gain useful information about visual attributes." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-47", "text": "The initial state in this MDP is determined by a combination of the adjusted threshold from the former MDP and the visual predictions from the color and shape classifiers that ground NL attributes terms ('red', 'square', etc): either the color or shape status can be assigned to: 0, if the learner has a low confidence on its predictions (i.e. the confidence score is lower than 0.5); or, 1 if the confidence score is higher than 0.5, but lower than the positive threshold; or else, 2." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-48", "text": "This together with the previous dialogue act constitutes the state space of this MDP." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-49", "text": "The agent is then trained to choose the correct dialogue action to achieve a state in which both shape and color of the current object are known with certainty (with status = 2), either through feedback from the user, or through the agent's own existing visual knowledge." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-50", "text": "Of course, the agent must also learn to produce coherent dialogues by responding to questions at the right time, giving feedback at the right time, asking for feedback at the right time, etc." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-51", "text": "----------------------------------" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-52", "text": "**DEMONSTRATION**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-53", "text": "As noted, VOILA has been deployed onto Furhat: a human-like Robot Head (Moubayed et al., 2011) , which provides an interaction framework for the management of multi-party, multi-modal interactions, and which employs a Microsoft Kinect for skeletal tracking." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-54", "text": "In the demonstration, the VOILA agent will randomly choose 20 visual objects, and then learn to describe them using their low-level visual attributes (e.g. color and shape) image-by-image through interaction with users." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-55", "text": "As mentioned above, we assume that VOILA is in the position of a child learning from scratch, but instead of complex real objects with noisy backgrounds, we use a set of simple toy objects (see dialogue example in Fig. 1 ), but without annotations or labels." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-56", "text": "It is essential to highlight that the VOILA agent would only start a conversation with a human partner when it isn't confident about its own attribute predictions." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-57", "text": "----------------------------------" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-58", "text": "**CONCLUSION**" }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-59", "text": "We have presented a multi-modal learning agent -VOILA -that can learn grounded visual-concept meanings through interaction with human tutors incrementally, over time." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-60", "text": "The agent is deployed with an adaptive dialogue policy (optimised using Reinforcement Learning), which has learned to (1) process natural, coherent conversations with humans and (2) achieve comparable learning performance to a hand-crafted system, but with less tutoring effort needed from humans." }, { "sent_id": "196e7ca5ccd6754ac986137ec55cd3-C001-61", "text": "Recently, we also extended the VOILA agent to learn real visual object classes instead of toy objects by integrating with a Load-Balancing Self-Organizing Incremental Neural Network (LB-SOINN) (Zhang et al., 2014) for object classification." } ], "y": { "@USE@": { "gold_contexts": [ [ "196e7ca5ccd6754ac986137ec55cd3-C001-9" ], [ "196e7ca5ccd6754ac986137ec55cd3-C001-20" ], [ "196e7ca5ccd6754ac986137ec55cd3-C001-40" ] ], "cite_sentences": [ "196e7ca5ccd6754ac986137ec55cd3-C001-9", "196e7ca5ccd6754ac986137ec55cd3-C001-20", "196e7ca5ccd6754ac986137ec55cd3-C001-40" ] }, "@BACK@": { "gold_contexts": [ [ "196e7ca5ccd6754ac986137ec55cd3-C001-16" ] ], "cite_sentences": [ "196e7ca5ccd6754ac986137ec55cd3-C001-16" ] }, "@DIF@": { "gold_contexts": [ [ "196e7ca5ccd6754ac986137ec55cd3-C001-15", "196e7ca5ccd6754ac986137ec55cd3-C001-16" ], [ "196e7ca5ccd6754ac986137ec55cd3-C001-40", "196e7ca5ccd6754ac986137ec55cd3-C001-41", "196e7ca5ccd6754ac986137ec55cd3-C001-42" ] ], "cite_sentences": [ "196e7ca5ccd6754ac986137ec55cd3-C001-16", "196e7ca5ccd6754ac986137ec55cd3-C001-40", "196e7ca5ccd6754ac986137ec55cd3-C001-41" ] }, "@SIM@": { "gold_contexts": [ [ "196e7ca5ccd6754ac986137ec55cd3-C001-20" ] ], "cite_sentences": [ "196e7ca5ccd6754ac986137ec55cd3-C001-20" ] }, "@MOT@": { "gold_contexts": [ [ "196e7ca5ccd6754ac986137ec55cd3-C001-41" ] ], "cite_sentences": [ "196e7ca5ccd6754ac986137ec55cd3-C001-41" ] }, "@EXT@": { "gold_contexts": [ [ "196e7ca5ccd6754ac986137ec55cd3-C001-40", "196e7ca5ccd6754ac986137ec55cd3-C001-41", "196e7ca5ccd6754ac986137ec55cd3-C001-42" ] ], "cite_sentences": [ "196e7ca5ccd6754ac986137ec55cd3-C001-40", "196e7ca5ccd6754ac986137ec55cd3-C001-41" ] } } }, "ABC_a67f40381e82afaf249f097a208555_36": { "x": [ { "sent_id": "a67f40381e82afaf249f097a208555-C001-7", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-52", "text": "**RELATIONAL SIMILARITY**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-2", "text": "Abstract." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-3", "text": "In this work, we study the problem of relational similarity by combining different word embeddings learned from different types of contexts." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-4", "text": "The word2vec model with linear bag-ofwords contexts can capture more topical and less functional similarity, while the dependency-based word embeddings with syntactic contexts can capture more functional and less topical similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-5", "text": "We explore topical space and functional space simultaneously by considering these two word embeddings and different metrics." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-6", "text": "We evaluate our model on relational similarity framework, and report state-of-the-art performance on standard test collections." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-9", "text": "Measuring relational similarity between two word pairs plays important roles in natural language processing (NLP)." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-10", "text": "The techniques for solving this problem can be applied to a variety of NLP tasks, such as query expansion, word sense disambiguation, machine translation, information extraction and question answering." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-11", "text": "Previous work addressing the problem can be roughly classified into three categories: (1) learning word embeddings from large collections of text using variants of neural networks (Mikolov et al. (2013a) ; Mikolov et al. (2013b) ; Mikolov et al. (2013c) ; Levy and Goldberg (2014) ) or global matrix factorization (Deerwester et al. (1990) ; Turney (2012) ); (2) extracting knowledge from existing semantic networks, such as WordNet (Yang and Powers (2005) ; Alvarez and Lim (2007) ; Hughes and Ramage (2007) ) and ConceptNet (Boteanu and Chernova (2015) ); (3) combining the above two models by various ways (Agirre et al. (2009) ; Zhila et al. (2013) ; Iacobacci et al. (2015) ; Summers-Stay et al. (2016) )." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-12", "text": "The empirical evidence shows that the word representations learned from neural network models do an especially good job in capturing not only attributional similarities between words but also similarities between pairs of words (Mikolov et al. (2013c) )." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-13", "text": "Levy and Goldberg (2014) generalize the skip-gram model with negative sampling to include arbitrary word contexts and present the dependency-based word embeddings, which are learned from syntactic contexts derived from dependency parse-trees." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-14", "text": "Qualitative and quantitative analysis demonstrates that the word2vec model with linear bag-of-words contexts can yield broad topical similarity while the dependency-based word embeddings with syntactic contexts can capture more functional similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-15", "text": "Turney (2012) is the first, to the best of our knowledge, to raise the word vector representations in a dual space and unify semantic relations and compositions by the dualspace model." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-16", "text": "The dual-space model consists of a domain space and a function space, where the domain or topic of a word is characterized by the nouns that occur near it and the function or role of a word is characterized by the syntactic context that relates it to the verbs that occur near it." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-17", "text": "We detail our main contributions as follows." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-18", "text": "(1) In this paper, we use the word2vec model with linear bag-of-words contexts to capture the domain of a word, and the dependency-based word embeddings with syntactic contexts to characterize the function of a word." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-19", "text": "The broad contexts used in our model can provide richer information for measuring domain similarity (i.e., topic, subject, or field similarity) and function similarity (i.e, role, relationship, or usage similarity) than noun or verb-based patterns for contexts in Turney's (2012) model." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-20", "text": "(2) The two existing models for measuring relational similarity are: the directional similarity model (Zhila et al. (2013) ) and the dual-space model consisting of domain and function space (Turney (2012) )." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-21", "text": "Both models suffer some drawbacks." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-22", "text": "The directional similarity model explores the difference of two relationships in multiple topicality dimensions in the vector space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-23", "text": "However, it ignores the spatial distances between word vectors, which can reveal the function similarity of words in function space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-24", "text": "The dual-space model can measure the domain similarity and function similarity between words." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-25", "text": "However, it only computes the domain similarity between two single words and places less emphasis on the domain similarity between two relations." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-26", "text": "In this work, we propose a new dual-space model for measuring relational similarity, which combines the advantages of the two existing models." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-27", "text": "(3) We evaluate our model on relational similarity framework and report state-of-the-art performance on SAT analogy questions." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-28", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-30", "text": "Vector space models have a long, rich history in the field of natural language processing, where each word is represented as a real-valued vector in a continuous vector space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-31", "text": "All these models depend in some way or another on the distributional hypothesis which states that words that occur in similar contexts tend to have similar meanings (Harris, 1954; Firth, 1957) ." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-32", "text": "There are the two main model families for learning word vectors: (1) global matrix factorization methods, such as latent semantic analysis, which generates embeddings from term-document matrices by singular value decomposition (Deerwester et al. (1990) ); (2) neural network models, such as the skip-gram and continuous bag of words models of () Mikolov et al. (2013a) ; Mikolov et al. (2013b) ; Mikolov et al. (2013c) ), referred to as word2vec, which learn embeddings by training a network to predict neighboring words." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-33", "text": "Levy and Goldberg (2014) generalize the skip-gram model with negative sampling to include arbitrary word contexts, learn embeddings from syntactic contexts and demonstrate that the dependency-based embeddings can capture more functional similarity than the original skip-gram embeddings (word2vec)." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-34", "text": "Several algorithms have been proposed for solving SAT-style analogy questions." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-35", "text": "A good algorithm for recognizing analogies can also help to solve the classification problem of semantic relations, which has potential applications in machine translation, information extraction and word sense disambiguation." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-36", "text": "Quesada et al. (2004) propose LSA as a method to represent the relations between words and use a prediction to represent relation comparisons in the LSA semantic space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-37", "text": "Veale (2004) considers the utility of the taxonomic structure of WordNet to the solution of SAT analogies." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-38", "text": "Turney and Littman (2005) show that the cosine metric in the Vector Space Model of information retrieval can be used to solve analogy questions and to classify semantic relations." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-39", "text": "Turney (2012) introduces a dual-space model, which consists of a space for measuring domain similarity and a space for measuring function similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-40", "text": "The dual-space model has been applied to measuring relation similarity and compositional similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-41", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-42", "text": "**WORD EMBEDDINGS WITH DIFFERENT CONTEXT TYPES**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-43", "text": "Word2vec (W2V) is one of the most popular word embedding methods that learn word vectors from raw text." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-44", "text": "It has two models for generating dense embeddings: the skip-gram model and the continuous bagof-words (CBOW) model." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-45", "text": "The training objective of the skip-gram model is to predict each surrounding word in a context window of 2k words from the target word." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-46", "text": "So for k =2, the contexts of the target word w t are w t\u22122 , w t\u22121 , w t+1, w t+2 and we are predicting each of these from the word w t ." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-47", "text": "However, a context window with a smaller size k may miss some important contexts while including some accidental ones." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-48", "text": "Recently, Levy et al. propose the dependency-based word embeddings (DEP), which generalize the skip-gram model with negative sampling, and move from linear bag-of-words contexts to syntactic contexts that are derived from automatically produced dependency parse-trees." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-49", "text": "Embeddings produced from different kinds of contexts can induce different word similarities." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-50", "text": "The original skip-gram embeddings can yield broad topical similarities, while the dependency-based word embeddings can capture more functional similarities (Levy et al., 2014) ." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-51", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-53", "text": "Relational similarity measures the degree of correspondence between two relations (Jurgens et al. (2012) )." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-54", "text": "The task can be modeled as an analogy problem, where given two pairs of words, A:B and C:D, the goal is to determine the degree to which the semantic relation between A and B is similar to that between C and D. We first introduce two types of existing models for relational similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-55", "text": "Zhila et al. (2013) propose a directional similarity model to evaluate the correspondence between relations." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-56", "text": "Given two pairs of words A:B and C:D, suppose (v A, v B ) and (v C, v D ) are the corresponding vectors of these words." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-57", "text": "Relational similarity of these two word pairs is defined as the cosine function between the two directional vectors of (v A \u2212 v B ) and (v C \u2212 v D ):" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-58", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-59", "text": "**DIRECTIONAL SIMILARITY MODEL**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-60", "text": "In this model, the relationship between two words is represented by the difference of corresponding two word vectors, which reveals the change from one word to the other in terms of multiple topicality dimensions in the vector space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-61", "text": "From the geometric point of view, if two relationship vectors are relatively parallel (i.e., share the same direction), then the two word pairs can be considered as they have similar relations." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-62", "text": "They apply the model to the problem of relation classification which determines whether two pairs share the same relation." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-63", "text": "The goal here is to design a model for answering the more difficult SAT analogy questions." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-64", "text": "However, sometimes this directional similarity model can not measure the analogy problem correctly." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-65", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-66", "text": "**A DUAL-SPACE MODEL**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-67", "text": "Why does the directional similarity model fail on some analogy questions?" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-68", "text": "One possible reason is that the model depends on direction alone and ignores spatial distance between word vectors." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-69", "text": "Turney (2012) presents a dual-space model to unify semantic relations and compositions." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-70", "text": "The dual-space model consists of a domain space for measuring domain similarity and a function space for measuring function similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-71", "text": "He gives a good example to explain this model." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-72", "text": "Consider the analogy example, traffic:street :: water:riverbed." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-73", "text": "Traffic and street share the same domain, the domain of transportation." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-74", "text": "Water and riverbed share the same domain, the domain of hydrology." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-75", "text": "On the other hand, traffic and water share the same function, the function of flow, and street and riverbed share the same function, the function of carry." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-76", "text": "The model can recognize that the semantic relation between traffic and street is analogous to the relation between water and riverbed by considering the combination of domain and function similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-77", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-78", "text": "**A NEW DUAL-SPACE MODEL: REFINING THE DIRECTIONAL SIMILARITY MODEL AND THE DUALSPACE MODEL**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-79", "text": "The directional similarity model (Zhila et al. (2013) ) explores the difference of two relationships in multiple topicality dimensions in the vector space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-80", "text": "However, it ignores the spatial distance between word vectors, which can reveal the function similarity of words in function space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-81", "text": "The dual-space model (Turney (2012) ) can measure both domain similarity and function similarity between words." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-82", "text": "However, it only computes the domain similarity between two single word vectors and places less emphasis on the domain similarity between two relations." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-83", "text": "Moreover, Turney's (2012) model uses only noun or verbbased patterns for contexts to model domain or function space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-84", "text": "In this paper, we propose a novel dualspace model, which combines the advantages of the above two existing models." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-85", "text": "We use the word2vec-based directional similarity model to measure the similarity of two relations in domain space, and the dependency-based word embeddings to represent the word vector in function space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-86", "text": "The two embeddings used in our model consider broader contexts than those in the original dual-space model, which can provide richer information for measuring domain similarity and function similarity." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-87", "text": "A mathematical description of our dual-space model is given as follows: Given two pairs of words A:B and C:D, suppose D) are the vectors of these words in domain space and V DEP (A) , V DEP (B) , V DEP (C) , V DEP (D) are the vectors of these words in function space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-88", "text": "Based on the directional similarity model (Zhila et al., 2013) , we define the domain similarity of two pairs of words as follows:" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-89", "text": "The function similarity of A and C and the function similarity of B and D are defined respectively as follows:" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-90", "text": "We design the method Similarity ADD to combine the above similarities for relational similarity as follows, which satisfies that the combined similarity is high when the component similarities are high:" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-91", "text": "We give an SAT question in Table 1 to demonstrate how our compositional similarity works." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-92", "text": "An SAT analogy question consists of a target pair of words and five option pairs of words." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-93", "text": "The task is to select the option pair that \"best expresses a relationship similar to that expressed in the original pair\", as stated in the test's directions." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-94", "text": "We use the 300-dimensional W2V and DEP vectors available for downloading 1 ." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-95", "text": "For the target pair bruise:skin in Table 1 , our method Similarity ADD (= 0.406) can recognize the correct answer stain:fabric although Similarity D (= 0.189), Similarity F 1 (= 0.539) and Similarity F 2 (= 0.491) for the correct answer are all ranked second among five options." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-96", "text": "----------------------------------" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-97", "text": "**RELATIONAL SIMILARITY EXPERIMENT**" }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-98", "text": "In the following experiments, we evaluate our approaches to solving analogies by a set of 374 SAT analogy questions, which is the same set of questions as was used in Turney's Dual-Space mode (Turney (2012) )." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-99", "text": "Precision and Recall are two standard performance measurements used for evaluation." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-100", "text": "The definitions of precision and recall are specified by (Turney and Littman (2005) )." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-101", "text": "In all experiments, we use the 300-dimensional W2V and DEP vectors pretrained on a concatenation of three large, diverse English corpora, and those vectors are available for downloading 1 ." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-102", "text": "Table 2 shows the experimental results of four approaches presented in Section 4.3 on the set of 374 analogy questions." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-103", "text": "Two questions are skipped because the vector for the target pair is not available in the collection." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-104", "text": "Since there are five options for each target pair of an SAT analogy question, random guessing would yield a recall of 20%." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-105", "text": "Domain similarity Similarity D , function similarity Similarity F 1 and Similarity F 2 all perform much better than random guessing." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-106", "text": "Our compositional similarity model Similarity ADD in the dual space clearly outperforms Similarity D , Similarity F 1 and Similarity F 2 in single domain or function space." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-107", "text": "In this work, we explore domain space and function space simultaneously by considering two kinds of word embeddings and different metrics." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-108", "text": "Word embeddings can capture topical and functional information of a word by using different types of contexts, however they are unable to model the words with multiple meanings accurately because a word is represented as just a single vector which carries a weighted average of different meanings." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-109", "text": "Existing lexical and knowledge databases, such as WordNet, ConceptNet and Cyc, can be modeled as graphs in which words are represented as the nodes and the relations between words are signified by the edges." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-110", "text": "These databases have more accurate information although their coverage of words and types of relations is usually limited." }, { "sent_id": "a67f40381e82afaf249f097a208555-C001-111", "text": "We plan to look at ways to combine our dual-space model with existing databases to improve the performance of our current system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a67f40381e82afaf249f097a208555-C001-11" ], [ "a67f40381e82afaf249f097a208555-C001-20" ], [ "a67f40381e82afaf249f097a208555-C001-79" ] ], "cite_sentences": [ "a67f40381e82afaf249f097a208555-C001-11", "a67f40381e82afaf249f097a208555-C001-20", "a67f40381e82afaf249f097a208555-C001-79" ] }, "@MOT@": { "gold_contexts": [ [ "a67f40381e82afaf249f097a208555-C001-20", "a67f40381e82afaf249f097a208555-C001-21" ], [ "a67f40381e82afaf249f097a208555-C001-79", "a67f40381e82afaf249f097a208555-C001-80" ] ], "cite_sentences": [ "a67f40381e82afaf249f097a208555-C001-20", "a67f40381e82afaf249f097a208555-C001-79" ] }, "@USE@": { "gold_contexts": [ [ "a67f40381e82afaf249f097a208555-C001-88" ] ], "cite_sentences": [ "a67f40381e82afaf249f097a208555-C001-88" ] } } }, "ABC_07cee5aa02b518d48e41b1d6010c2f_36": { "x": [ { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-2", "text": "This paper presents a higher-order model for constituent parsing aimed at utilizing more local structural context to decide the score of a grammar rule instance in a parse tree." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-3", "text": "Experiments on English and Chinese treebanks confirm its advantage over its first-order version." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-4", "text": "It achieves its best F1 scores of 91.86% and 85.58% on the two languages, respectively, and further pushes them to 92.80% and 85.60% via combination with other highperformance parsers." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-5", "text": "----------------------------------" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-7", "text": "Factorization is crucial to discriminative parsing." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-8", "text": "Previous discriminative parsing models usually factor a parse tree into a set of parts." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-9", "text": "Each part is scored separately to ensure tractability." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-10", "text": "In dependency parsing (DP), the number of dependencies in a part is called the order of a DP model (Koo and Collins, 2010) ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-11", "text": "Accordingly, existing graph-based DP models can be categorized into tree groups, namely, the first-order (Eisner, 1996; McDonald et al., 2005a; McDonald et al., 2005b) , second-order (McDonald and Pereira, 2006; Carreras, 2007) and third-order (Koo and Collins, 2010 ) models." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-12", "text": "Similarly, we can define the order of constituent parsing in terms of the number of grammar rules in a part." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-13", "text": "Then, the previous discriminative constituent parsing models (Johnson, 2001; Henderson, 2004; Taskar et al., 2004; Petrov and Klein, 2008a ; * The research reported in this paper was partially supported by the Research Grants Council of HKSAR, China, through the GRF Grant 9041597 (CityU 144410)." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-14", "text": "Petrov and Klein, 2008b; Finkel et al., 2008) are the first-order ones, because there is only one grammar rule in a part." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-15", "text": "The discriminative re-scoring models (Collins, 2000; Collins and Duffy, 2002; Charniak and Johnson, 2005; Huang, 2008) can be viewed as previous attempts to higher-order constituent parsing, using some parts containing more than one grammar rule as non-local features." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-16", "text": "In this paper, we present a higher-order constituent parsing model 1 based on these previous works." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-17", "text": "It allows multiple adjacent grammar rules in each part of a parse tree, so as to utilize more local structural context to decide the plausibility of a grammar rule instance." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-18", "text": "Evaluated on the PTB WSJ and Chinese Treebank, it achieves its best F1 scores of 91.86% and 85.58%, respectively." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-19", "text": "Combined with other high-performance parsers under the framework of constituent recombination (Sagae and Lavie, 2006; Fossum and Knight, 2009) , this model further enhances the F1 scores to 92.80% and 85.60%, the highest ones achieved so far on these two data sets." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-20", "text": "----------------------------------" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-21", "text": "**HIGHER-ORDER CONSTITUENT PARSING**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-22", "text": "Discriminative parsing is aimed to learn a function f : S \u2192 T from a set of sentences S to a set of valid parses T according to a given CFG, which maps an input sentence s \u2208 S to a set of candidate parses T (s)." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-23", "text": "The function takes the following discriminative form: where g(t, s) is a scoring function to evaluate the event that t is the parse of s. Following Collins (2002) , this scoring function is formulated in the linear form" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-24", "text": "where \u03a8(t, s) is a vector of features and \u03b8 the vector of their associated weights." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-25", "text": "To ensure tractability, this model is factorized as" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-26", "text": "where g(Q(r), s) scores Q(r), a part centered at grammar rule instance r in t, and \u03a6(Q(r), s) is the vector of features for Q(r)." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-27", "text": "Each Q(r) makes its own contribution to g(t, s)." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-28", "text": "A part in a parse tree is illustrated in Figure 1 ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-29", "text": "It consists of the center grammar rule instance NP \u2192 NP VP and a set of immediate neighbors, i.e., its parent PP \u2192 IN NP, its children NP \u2192 DT QP and VP \u2192 VBN PP, and its sibling IN \u2192 of." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-30", "text": "This set of neighboring rule instances forms a local structural context to provide useful information to determine the plausibility of the center rule instance." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-31", "text": "----------------------------------" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-32", "text": "**FEATURE**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-33", "text": "The feature vector \u03a6(Q(r), s) consists of a series of features {\u03c6 i (Q(r), s))|i \u2265 0}. The first feature \u03c6 0 (Q(r), s) is calculated with a PCFG-based generative parsing model (Petrov and Klein, 2007) , as defined in (4) below, where r is the grammar rule instance A \u2192 B C that covers the span from the b-th" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-34", "text": "to the e-th word, splitting at the m-th word, x, y and z are latent variables in the PCFG-based model, and I(\u00b7) and O(\u00b7) are the inside and outside probabilities, respectively." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-35", "text": "All other features \u03c6 i (Q(r), s) are binary functions that indicate whether a configuration exists in Q(r) and s. These features are by their own nature in two categories, namely, lexical and structural." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-36", "text": "All features extracted from the part in Figure 1 are demonstrated in Table 1 ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-37", "text": "Some back-off structural features are used for smoothing, which cannot be presented due to limited space." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-38", "text": "With only lexical features in a part, this parsing model backs off to a first-order one similar to those in the previous works." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-39", "text": "Adding structural features, each involving a least a neighboring rule instance, makes it a higher-order parsing model." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-40", "text": "----------------------------------" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-41", "text": "**DECODING**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-42", "text": "The factorization of the parsing model allows us to develop an exact decoding algorithm for it." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-43", "text": "Following Huang (2008) , this algorithm traverses a parse forest in a bottom-up manner." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-44", "text": "However, it determines and keeps the best derivation for every grammar rule instance instead of for each node." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-45", "text": "Because all structures above the current rule instance is not determined yet, the computation of its nonlocal structural features, e.g., parent and sibling features, has to be delayed until it joins an upper level structure." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-46", "text": "For example, when computing the score of a derivation under the center rule NP \u2192 NP VP in Figure 1 , the algorithm will extract child features from its children NP \u2192 DT QP and VP \u2192 VBN PP." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-47", "text": "The parent and sibling features of the two child rules can also be extracted from the current derivation and used to calculate the score of this derivation." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-48", "text": "But parent and sibling features for the center rule will not be computed until the decoding process reaches the rule above, i.e., PP \u2192 IN NP." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-49", "text": "This algorithm is more complex than the approximate decoding algorithm of Huang (2008) ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-50", "text": "However, its efficiency heavily depends on the size of the parse forest it has to handle." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-51", "text": "Forest pruning (Char- Child" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-52", "text": "----------------------------------" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-53", "text": "**CONSTITUENT RECOMBINATION**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-54", "text": "Following Fossum and Knight (2009), our constituent weighting scheme for parser combination uses multiple outputs of independent parsers." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-55", "text": "Suppose each parser generates a k-best parse list for an input sentence, the weight of a candidate constituent c is defined as" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-56", "text": "where i is the index of an individual parser, \u03bb i the weight indicating the confidence of a parser, \u03b4(c, t i,k ) a binary function indicating whether c is contained in t i,k , the k-th parse output from the ith parser, and f (t i,k ) the score of the k-th parse assigned by the i-th parser, as defined in Fossum and Knight (2009) ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-57", "text": "The weight of a recombined parse is defined as the sum of weights of all constituents in the parse." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-58", "text": "However, this definition has a systematic bias towards selecting a parse with as many constituents as possible for the highest weight." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-59", "text": "A pruning threshold \u03c1, similar to the one in Sagae and Lavie (2006) , is therefore needed to restrain the number of constituents in a recombined parse." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-60", "text": "The parameters \u03bb i and \u03c1 are tuned by the Powell's method (Powell, 1964 ) on a development set, using the F1 score of PARSEVAL (Black et al., 1991) as objective." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-61", "text": "----------------------------------" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-62", "text": "**EXPERIMENT**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-63", "text": "Our parsing models are evaluated on both English and Chinese treebanks, i.e., the WSJ section of Penn Treebank 3.0 (LDC99T42) and the Chinese Treebank 5.1 (LDC2005T01U01)." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-64", "text": "In order to compare with previous works, we opt for the same split as in Petrov and Klein (2007) , as listed in Table 2 ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-65", "text": "For parser combination, we follow the setting of Fossum and Knight (2009) , using Section 24 instead of Section 22 of WSJ treebank as development set." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-66", "text": "In this work, the lexical model of Chen and Kit (2011) is combined with our syntactic model under the framework of product-of-experts (Hinton, 2002) ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-67", "text": "A factor \u03bb is introduced to balance the two models." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-68", "text": "It is tuned on a development set using the gold sec- (2003) 90.70 Carreras et al. (2008) 91.1 Re-scoring Collins (2000) 89.70 Charniak and Johnson (2005) 91.02 The parser of Charniak and Johnson 91.40 43.54 Huang (2008) 91.69 43.5 Combination Fossum and Knight (2009) 92." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-69", "text": "4 Zhang et al. (2009) 92.3 Petrov (2010) 91.85 41.9 Self-training Zhang et al. (2009) (Kiefer, 1953) ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-70", "text": "The parameters \u03b8 of each parsing model are estimated from a training set using an averaged perceptron algorithm, following Collins (2002) and Huang (2008) ." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-71", "text": "The performance of our first-and higher-order parsing models on all sentences of the two test sets is presented in Table 3 , where \u03bb indicates a tuned balance factor." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-72", "text": "This parser is also combined with the parser of Charniak and Johnson (2005) 2 and the Stanford." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-73", "text": "parser 3 The best combination results in Table 3 are achieved with k=70 for English and k=100 for Chinese for selecting the k-best parses." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-74", "text": "Our results are compared with the best previous ones on the same test sets in Tables 4 and 5" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-75", "text": "----------------------------------" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-76", "text": "**CONCLUSION**" }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-77", "text": "This paper has presented a higher-order model for constituent parsing that factorizes a parse tree into larger parts than before, in hopes of increasing its power of discriminating the true parse from the others without losing tractability." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-78", "text": "A performance gain of 0.3%-0.4% demonstrates its advantage over its first-order version." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-79", "text": "Including a PCFG-based model as its basic feature, this model achieves a better performance than previous single and re-scoring parsers, and its combination with other parsers performs even better (by about 1%)." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-80", "text": "More importantly, it extends the existing works into a more general framework of constituent parsing to utilize more lexical and structural context and incorporate more strength of various parsing techniques." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-81", "text": "However, higher-order constituent parsing inevitably leads to a high computational complexity." }, { "sent_id": "07cee5aa02b518d48e41b1d6010c2f-C001-82", "text": "We intend to deal with the efficiency problem of our model with some advanced parallel computing technologies in our future works." } ], "y": { "@BACK@": { "gold_contexts": [ [ "07cee5aa02b518d48e41b1d6010c2f-C001-15" ], [ "07cee5aa02b518d48e41b1d6010c2f-C001-43" ] ], "cite_sentences": [ "07cee5aa02b518d48e41b1d6010c2f-C001-15", "07cee5aa02b518d48e41b1d6010c2f-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "07cee5aa02b518d48e41b1d6010c2f-C001-43" ], [ "07cee5aa02b518d48e41b1d6010c2f-C001-70" ] ], "cite_sentences": [ "07cee5aa02b518d48e41b1d6010c2f-C001-43", "07cee5aa02b518d48e41b1d6010c2f-C001-70" ] }, "@DIF@": { "gold_contexts": [ [ "07cee5aa02b518d48e41b1d6010c2f-C001-42", "07cee5aa02b518d48e41b1d6010c2f-C001-49" ] ], "cite_sentences": [ "07cee5aa02b518d48e41b1d6010c2f-C001-49" ] } } }, "ABC_b9e1a6bb7b6f2d55c6ea3bb2d3897d_36": { "x": [ { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-2", "text": "We present a model for automatically predicting information status labels for German referring expressions." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-3", "text": "We train a CRF on manually annotated phrases, and predict a fine-grained set of labels." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-4", "text": "We achieve an accuracy score of 69.56% on our most detailed label set, 76.62% when gold standard coreference is available." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-5", "text": "----------------------------------" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-27", "text": "Such items have also been called bridging anaphors (Poesio and Vieira, 1998) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-7", "text": "The automatic identification of information status (Prince, 1981; 1992) , i.e. categorizing discourse entities into different classes on the given-new scale, has recently been identified as an important issue in natural language processing (Nissim, 2006; Rahman and Ng, 2011; ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-8", "text": "It is widely acknowledged that information status and, more generally, information structure, 1 is reflected in word order, in the form of referring expressions as well as in prosody." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-9", "text": "In computational linguistics, the ability to automatically label text with information status, therefore, could be of great benefit to many applications, including surface realization, text-to-speech synthesis, anaphora resolution, summarization, etc." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-10", "text": "The task of automatically labeling text with information status, however, is a difficult one." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-11", "text": "Part of the difficulty arises from the fact that, to a certain degree, such labeling requires world knowledge and semantic comprehension of the text, but another obstacle is simply that theoretical notions of information status are not used consistently in the literature." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-12", "text": "In this paper we outline a system, trained on a small amount of data, that achieves encouraging results on the task of automatically labeling transcribed German radio news data with fine-grained information status labels." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-13", "text": "----------------------------------" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-14", "text": "**LEARNING INFORMATION STATUS**" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-15", "text": "A simpler variant of the task is anaphoricity detection (discourse-new detection) (Bean and Riloff, 1999; Ng and Cardie, 2002; Uryupina, 2003; Denis and Baldridge, 2007; Zhou and Kong, 2011) , which divides discourse entities into anaphoric (given) and new." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-16", "text": "Identifying discourse-new expressions in texts is helpful as a precursor to coreference resolution, since, by definition, there is no need to identify antecedents for new entities." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-17", "text": "In the linguistic literature, referring expressions have been distinguished in much more detail, and there is reason to believe that this could also provide useful information for NLP applications." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-18", "text": "Nissim (2006) and Rahman and Ng (2011) developed methods to automatically identify three different classes: OLD, MEDIATED and NEW expressions." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-19", "text": "This classification, which is described in Nissim et al. (2004) , has been used for annotating the Switchboard dialog corpus (Calhoun et al., 2010) , on which both studies are based." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-20", "text": "Most recently, Rahman and Ng (2012) extend their automatic prediction system to a more fine-grained set of 16 subtypes." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-21", "text": "Old." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-22", "text": "The class of OLD entities in Nissim et al. (2004) is not limited to full-fledged anaphors like in Example (1a) but also includes cases of generic and first/second person pronouns like in (1b), which may or may not possess a previous mention." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-23", "text": "(1) a. Shares in General Electric rose as investors bet that the US company would take more lucrative engine orders for the A380." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-24", "text": "b. I wonder where this comes from." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-25", "text": "Mediated." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-26", "text": "The group of MEDIATED entities mainly has two subtypes: (2a) shows an expression which has not been mentioned before but which is dependent on previous context." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-28", "text": "(2b) contains a phrase which is generally known but does not depend on the discourse context." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-29", "text": "(2) a. Tomorrow, the Shenzhou 8 spacecraft will be in a position to attempt the docking." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-30", "text": "b. They hope that he will be given the right to remain in the Netherlands." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-31", "text": "New." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-32", "text": "The label NEW, following Nissim et al. (2004 Nissim et al. ( : 1024 , applies \"to entities that have not yet been introduced in the dialog and that the hearer cannot infer from previously mentioned entities." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-33", "text": "\" 2 Two kinds of expressions which fall into this category are unfamiliar definites (3a) and (specific) indefinites (3b)." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-34", "text": "(3) a. The man who shot a policeman yesterday has not been caught yet." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-35", "text": "b. Klose scored a penalty in the 80th minute." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-36", "text": "Based on work described in Nissim (2006) , Rahman and Ng (2011) develop a machine learning approach to information-status determination." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-37", "text": "They develop a support vector machine (SVM) model from the annotated Switchboard dialogs in order to predict the three possible classes." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-38", "text": "In an extension of this work, Rahman and Ng (2012) compare a rule-based system to a classifier with features based on the rules to predict 16 subtypes of the three basic types." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-39", "text": "On this extended label set on the dialog data, they achieve accuracy of 86.4% with gold standard coreference and 78.7% with automatically detected coreference." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-40", "text": "----------------------------------" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-41", "text": "**EXTENDING INFORMATION STATUS PREDICTION**" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-42", "text": "The work we present here is most similar to that of Rahman and Ng (2012) , however, our work dif-fers from theirs in a number of important respects." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-43", "text": "We (i) experiment with a different information status classification, derived from , (ii) use (morpho-)syntactic and functional features automatically extracted from a deep linguistic parser in our CRF sequence model, (iii) test our approach on a different language (German), (iv) show that high accuracy can be achieved with a limited number of training examples, and (v) that the approach works on a different genre (transcribed radio news bulletins which contain complex embedded phrases like an offer to the minority Tamil population of Sri Lanka, not typically found in spoken dialog)." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-44", "text": "The annotation scheme by divides referring items differently to Nissim et al. (2004) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-45", "text": "Arguments are provided in the former paper and in Baumann and Riester (to appear)." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-46", "text": "As it stands, the scheme provides too many labels for our purpose." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-47", "text": "As a compromise, we group them in seven classes: GIVEN, SITUATIVE, BRIDGING, UNUSED, NEW, GENERIC and EXPLETIVE." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-48", "text": "Given." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-49", "text": "Givenness is a central notion in information structure theory." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-50", "text": "Schwarzschild (1999) defines givenness of individual-type entities in terms of coreference." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-51", "text": "If desired, GIVEN items can be subclassified, e.g. whether they are pronouns or full noun phrases, and whether the latter are repetitions or short forms of earlier material, or whether they consist of lexically new material (epithets)." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-52", "text": "Situative." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-53", "text": "1 st and 2 nd person pronouns, locative and temporal adverbials, usually count as deictic expressions since they refer to elements in the utterance situation." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-54", "text": "We therefore count them as a separate class." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-55", "text": "SITUATIVE entities may, but need not, corefer." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-56", "text": "Bridging." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-57", "text": "Bridging anaphors, as in (2a) above, have received much attention, see e.g. Asher and Lascarides (1998) or Poesio and Vieira (1998) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-58", "text": "Although they are discourse-new, they share properties with coreference anaphors since they depend on the discourse context." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-59", "text": "They represent a class which can be easily identified by human annotators but are difficult to capture by automatic techniques." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-60", "text": "Unused." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-61", "text": "In manual annotation practice, it is very often impossible to decide whether an entity is hearerknown, since this depends on who we assume the hearer to be; and even if we agree on a recipient, we may still be mistaken about their knowledge." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-62", "text": "For example, Wolfgang Bosbach, deputy chairman of the CDU parliamentary group may be known to parts of a German audience but not to other people." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-63", "text": "We address this by collecting both hearer-known and hearer-unknown definite expressions into one class UNUSED." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-64", "text": "This does not rule out further subclassification (known/unknown) or the possibility of using machine learning techniques to identify this distinction, see Nenkova et al. (2005) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-65", "text": "The fact that Rahman and Ng (2011) report the highest confusion rate between NEW and MEDIATED entities may have its roots in this issue." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-66", "text": "New." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-67", "text": "Only (specific) indefinites are labeled NEW." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-68", "text": "Generic." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-69", "text": "An issue which is not dealt with in Nissim et al. (2004) are GENERIC expressions as in Lions have manes." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-70", "text": "Reiter and Frank (2010) discuss the task of identifying generic items in a manner similar to the learning tasks presented above, using a Bayesian network." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-71", "text": "We believe it makes sense to integrate genericity detection into information-status prediction." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-72", "text": "3" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-73", "text": "----------------------------------" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-74", "text": "**GERMAN DATA**" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-75", "text": "Our work is based on the DIRNDL radio news corpus of Eckart et al. (2012) which has been handannotated with information status labels." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-76", "text": "We choose a selection of 6668 annotated phrases (1420 sentences)." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-77", "text": "This is an order of magnitude smaller than the annotated Switchboard corpus of Calhoun et al. (2010) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-78", "text": "We parse each sentence with the German Lexical Functional Grammar of Rohrer and Forst (2006) using the XLE parser in order to automati- Cahill and Riester (2009) show that there are asymmetries between pairs of information status labels contained in sentences, i.e. certain classes of expressions tend to precede certain other classes." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-79", "text": "We therefore treat the prediction of IS labels as a sequence labeling task." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-80", "text": "4 We train a CRF using wapiti (Lavergne et al., 2010) , with the features outlined in Table 1 ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-81", "text": "We also include a basic \"coreference\" feature, similar to the lexical features of Rahman and Ng (2011) , that fires if there is some lexical overlap of nouns (or compound nouns) in the preceding 10 sentences." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-82", "text": "The original label set described in contains 21 labels." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-83", "text": "Here we work with a subset of maximally 12 labels, but also consider smaller subsets of labels and carry out a mapping to the Nissim (2006) label set (Table 2) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-84", "text": "5 We run a 10-fold cross-validation experiment and report average prediction accuracy." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-85", "text": "The results are given in Table 3a ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-86", "text": "As an informed baseline, we run the same crossvalidation experiment with a subset of features that roughly correspond to the features of Nissim (2006) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-87", "text": "Our models perform statistically significantly better than the baseline (p < 0.001, using the approximate randomization test) for all label sets." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-88", "text": "Table 2 : Varying the granularity of the label sets" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-89", "text": "----------------------------------" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-90", "text": "**PREDICTION MODEL FOR INFORMATION STATUS**" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-91", "text": "As expected, the less fine-grained a label set, the easier it is to predict the labels." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-92", "text": "It remains for future work to show the effect of different label set granularities in practical applications." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-93", "text": "We approximate gold standard coreference information from the manually annotated labels (e.g. all GIVEN label types are by their nature coreferent), and carry out an experiment with gold-standard approximation of coreference marking." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-94", "text": "These results are also reported in Table 3a ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-95", "text": "Here we see a clear performance difference in the effect of gold-standard coreference on the Riester label set (increasing around 6-10%), compared to the Nissim label set (decreasing slightly)." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-96", "text": "This is an artifact of the way the mapping was carried out, deriving the gold standard coreference information from the Riester label set." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-97", "text": "There is not a one-to-one mapping between OLD and GIVEN, and, in the Riester label set, coreferential entities that are labeled as SITUATIVE (deictic terms) are not recognized as such." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-98", "text": "The feature set in Table 1 reflects the morphosyntactic properties of the phrases to be labeled." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-99", "text": "Sometimes world knowledge is required in order to be able to accurately predict a label; for example, to know that the pope can be categorized as UNUSED-KNOWN, because it can occur discourseinitially, whereas the priest must usually be categorized as GIVEN." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-100", "text": "The BRIDGING relationship is also difficult to capture without some world knowledge." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-101", "text": "For example, to infer that the waitress can be categorized as BRIDGING in the context of the restaurant requires information that links the two concepts." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-102", "text": "Rahman and Ng (2012) also note this and include features based on FrameNet, WordNet and the ReVerb corpus for English." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-103", "text": "For German, we address this issue by introducing two further types of features into our model based on the GermaNet resource (Hamp and Feldweg, 1997) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-104", "text": "The first type is based on the GermaNet synset of the head noun in the phrase and its distance from the root node (the assumption is that entities closer to root are more generic than those further away)." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-105", "text": "The second include the sum and maximum of the Lin semantic relatedness measures (Lin, 1998) of how similar the head noun of the phrase is to the other nouns in current and immediately preceding sentence surrounding the phrase (calculated with GermaNet Pathfinder; Finthammer and Cramer, 2008) ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-106", "text": "The results are given in Table 3b ." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-107", "text": "Here we see a consistent increase in performance of around 4% for each label set over the model that does not include the GermaNet features." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-108", "text": "Again, we see the same decrease in performance on the Nissim label set when using gold standard coreference information." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-109", "text": "----------------------------------" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-110", "text": "**CONCLUSION**" }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-111", "text": "In this paper we presented a model for automatically labeling German text with fine-grained information status labels." }, { "sent_id": "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-112", "text": "The results reported here show that we can achieve high accuracy prediction on a complex text type (transcribed radio news), even with a limited amount of data." } ], "y": { "@MOT@": { "gold_contexts": [ [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-7" ], [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-65" ] ], "cite_sentences": [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-7", "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-65" ] }, "@BACK@": { "gold_contexts": [ [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-18" ], [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-36" ], [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-65" ] ], "cite_sentences": [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-18", "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-36", "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-65" ] }, "@SIM@": { "gold_contexts": [ [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-81" ] ], "cite_sentences": [ "b9e1a6bb7b6f2d55c6ea3bb2d3897d-C001-81" ] } } }, "ABC_ff5122ce817d506fbcb269b7ae41fe_36": { "x": [ { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-137", "text": "**CONCLUSION**" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-2", "text": "We present a cross-lingual method for determining NP structures." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-3", "text": "More specifically, we try to determine whether the semantics of tripartite noun compounds in context requires a left or right branching interpretation." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-4", "text": "The system exploits the difference in word position between languages as found in parallel corpora." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-5", "text": "We achieve a bracketing accuracy of 94.6%, significantly outperforming all systems in comparison and comparable to human performance." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-6", "text": "Our system generates large amounts of high-quality bracketed NPs in a multilingual context that can be used to train supervised learners." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-7", "text": "----------------------------------" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-9", "text": "k-partite noun compounds, i.e., compositions of k bare common nouns that function as one unit, (kNCs), such as air traffic control system, usually have an implicit structure that reflects semantics." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-10", "text": "While a LEFTbranching [world banana] market is very unlikely, for luxury cattle truck, both structures make sense and context is necessary for disambiguation: [luxury cattle] truck is a truck for luxury cattle whereas luxury [cattle truck] is a luxury truck for (any) cattle." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-11", "text": "Therefore, a proper structural analysis is a crucial part of noun compound interpretation and of fundamental importance for many tasks in natural language processing such as machine translation." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-12", "text": "The correct French translation of luxury cattle truck depends on the internal structure." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-13", "text": "While [luxury cattle] truck is translated as camion pour b\u00e9tail de luxe, the preferred translation for luxury [cattle truck] is camion de luxe pour b\u00e9tail." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-14", "text": "Previous work on noun compound bracketing has shown that supervised beats unsupervised." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-15", "text": "The latter approaches use N-gram statistics or lexical patterns (Lauer, 1995; Nakov and Hearst, 2005; Barri\u00e8re and M\u00e9nard, 2014) , web counts (Lapata and Keller, 2004) or semantic relations (Kim and Baldwin, 2013) and evaluate on carefully selected evaluation data from encyclopedia (Lauer, 1995; Barri\u00e8re and M\u00e9nard, 2014) or from general newspaper text (Kim and Baldwin, 2013) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-16", "text": "Vadas and Curran (2007a,b) manually annotated the Penn Treebank and showed that they improve over unsupervised results by a large margin." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-17", "text": "Pitler et al. (2010) used the data from Vadas and Curran (2007a) for a parser applicable on base noun phrases (NPs) of any length including coordinations." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-18", "text": "Barker (1998) presents a bracketing method for k-partite NPs that reduces the task to three-word bracketings within a sliding window." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-19", "text": "One advantage of supervised approaches for this task is that kNCs are labeled in context so contextual features can be used in the learning framework." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-20", "text": "These are especially useful when dealing with ambiguous kNCs." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-21", "text": "The need for annotated data is a drawback of supervised approaches." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-22", "text": "Manual annotations are costly and time-consuming." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-23", "text": "To circumvent this need for annotated data, previous work has used cross-lingual supervision based on parallel corpora." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-24", "text": "Bergsma et al. (2011) made use of small amounts of annotated data on the target side and complement this with bilingual features from unlabeled bitext in a co-trained classifier for coordination disambiguation in complex NPs." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-25", "text": "Previous work on using cross-lingual data for the analysis of multi-word expressions (MWEs) of different types include Busa and Johnston (1996) ; Girju (2007) ; Sinha (2009); Tsvetkov and Wintner (2010) ; Ziering et al. (2013) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-26", "text": "Ziering and Van der Plas (2014) propose an approach that refrains from using any human annotation." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-27", "text": "They use the fact, that languages differ in their preference for open or closed compounding (i.e., multiword vs. one-word compounds), for inducing the English bracketing of 3NCs." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-28", "text": "English open 3NCs like human rights abuses can be translated to partially closed phrases as in German Verletzungen der Menschenrechte, (abuses of human rights), from which we can induce the LEFT-branching structure." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-29", "text": "Although this approach achieves a solid accuracy, a crucial limitation is coverage, because restricting to six paraphrasing patterns ignores many other predictive cases." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-30", "text": "Moreover, the system needs part of speech (PoS) tags and splitting information for determining 2NCs and is therefore rather language-dependent." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-31", "text": "In this paper, we present a precise, high-coverage and knowledge-lean method for bracketing kNCs (for k \u2265 3) occurring in parallel data." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-32", "text": "Our method uses the distances of words that are aligned to kNC components in parallel languages." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-33", "text": "For example, the 3NC human rights violations can be bracketed using the positions of aligned words in the Italian fragment . . ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-34", "text": "che le violazioni gravi e sistematiche dei diritti umani . . . ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-35", "text": "The fact, that the alignment of the third noun, violations (violazioni), is separated from the rest, points us in the direction of LEFT-branching." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-36", "text": "Using less restricted forms of cross-lingual supervision, we achieve a much higher coverage than Ziering and Van der Plas (2014) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-37", "text": "Furthermore, our results are more accurate." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-38", "text": "In contrast to previous unsupervised methods, our system is applicable in both token-and type-based modes." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-39", "text": "Token-based bracketing is context-dependent and allows for a better treatment of structural ambiguity (as in luxury cattle truck)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-40", "text": "We generate large amounts of high-quality bracketed kNCs in a multilingual context that can be used to train supervised learners." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-41", "text": "----------------------------------" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-42", "text": "**ALIGNED WORD DISTANCE BRACKETING**" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-43", "text": "The aligned word distance bracketing (AWDB) is inspired by Behaghel's First Law saying that elements which belong close together intellectually will also be placed close together (Behaghel, 1909) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-44", "text": "For each language l, we apply the AWDB algorithm on a kNC as shown in Figure 1 : we start bottomup with one constituent per noun." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-45", "text": "For each constituent c i , we create a set of content words 1 c i aligns to, AW i ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-46", "text": "We merge the two consecutive constituents c m and c m+1 with the smallest aligned word distance (AWD) based on the minimum distance from all words in AW m to all words in AW m+1 :" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-47", "text": "where pos(\u03b1) is the position of a word \u03b1 in a sentence." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-48", "text": "In the case of (c m , c m+1 ) being aligned to a common closed compound, AWD(c m , c m+1 ) is zero." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-49", "text": "If the smallest AWD is not unique but the related constituents do not overlap (e.g., (c 1 ,c 2 ) and (c 3 ,c 4 ) aligning to two different closed compounds) we merge both constituent pairs in one iteration." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-50", "text": "If they overlap (e.g., (c 1 ,c 2 ) and (c 2 ,c 3 ) aligning to a common closed compound), no bracketing structure can be derived from the word positions in l. Similarly, if there is an empty set AW e , i.e., there is no alignment from c e to a content word in l, AWDB cannot bracket the kNC using the translation to l. If no structure can be derived from any aligned language, AWDB refuses to answer." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-51", "text": "For example, the 4NC air transport safety organization is aligned to four words in the French fragment Nous devons mettre en place cette organisation 7 europ\u00e9enne charg\u00e9e de la s\u00e9curit\u00e9 12 du transport 14 a\u00e9rien 15 qui . . . (We need to establish this European organization responsible for the safety of air transport that . . . )." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-52", "text": "The aligned word sets are: AW 1 = {a\u00e9rien}, AW 2 = {transport}, AW 3 = {s\u00e9curit\u00e9} and AW 4 = {organisation}. c 1 and c 2 have the smallest AWD and thus are merged." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-53", "text": "In the next iteration, the smallest AWD is between c [1, 2] and c 3 ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-54", "text": "As last step, we merge c [[1, 2] To determine the final bracketing for a given kNC, we use the majority vote of all structures derived from all aligned languages." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-55", "text": "In the case of a tie, AWDB does not produce a final bracketing." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-56", "text": "Although this decision leads to lower coverage, it enables us to measure the pure impact of the cross-lingual word distance feature." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-57", "text": "For practical purposes, an additional back-off model is put in place." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-58", "text": "In order to mitigate word alignment problems and data sparseness, we additionaly bracket kNCs in a type-based fashion, i.e., we collect all kNC structures of a kNC type from various contexts." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-59", "text": "----------------------------------" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-60", "text": "**EXPERIMENTS**" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-61", "text": "Tools and data." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-62", "text": "While AWDB is designed for bracketing NPs of any length, we first experiment with bracketing 3NCs, the largest class of 3 + NCs (93.8% on the basic dataset of Ziering and Van der Plas (2014)), for which bracketing is a binary classification (i.e., LEFT or RIGHT)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-63", "text": "For bracketing longer NCs we often have to make do with partial information from a language, instead of a full structure." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-64", "text": "In future work, we plan to investigate methods to combine these partial results." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-65", "text": "Moreover, in contrast to previous work (e.g., Vadas and Curran (2007b) ), we take only common nouns as components into account rather than named entities." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-66", "text": "We consider the task of bracketing 3NCs composed of common nouns more ambitious, because named entities often form a single concept that is easy to spot, e.g., Apple II owners." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-67", "text": "Although AWDB can also process compounds including adjectives (e.g., active inclusion policy aligned to the Dutch beleid voor actieve insluiting (policy for active inclusion)), for a direct comparison with the system of Ziering and Van der Plas (2014) , that analyses 3NCs, we restrict ourselves to noun sequences." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-68", "text": "We use the Europarl 2 compound database 3 developed by Ziering and Van der Plas (2014) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-69", "text": "This database has been compiled from the OPUS 4 corpus (Tiedemann, 2012) and comprises ten languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese, Spanish and Swedish." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-70", "text": "We use the initial version (basic dataset), that contains English word sequences that conform PoS chunks and their alignments." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-71", "text": "We select English word sequences whose PoS pattern conforms three nouns." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-72", "text": "Extraction errors are a problem, since many adjectives have been tagged as nouns and some 3NCs occur as incomplete fragments." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-73", "text": "For increasing the effectiveness of human annotation, we developed a high-confidence noun filter P noun (word) = P (noun | word)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-74", "text": "It is trained on the English Wikipedia 5 tagged by TreeTagger (Schmid, 1995) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-75", "text": "We inspect all 3NCs in the context of one token to the left and right, w 0 {N 1 N 2 N 3 }w 4 ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-76", "text": "If P noun (N i ) < \u03b8 or P noun (w j ) \u2265 \u03b8, we remove the 3NC from our dataset." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-77", "text": "We inspected a subset of all 3NCs in the database and estimated the best filter quality to be with \u03b8 = 0.04." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-78", "text": "This threshold discards increasing land abandonment but keeps human rights abuse." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-79", "text": "Our final dataset contains 14,941 tokens and 8824 types." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-80", "text": "Systems in comparison." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-81", "text": "We compare AWDB with the bracketing approach of Ziering and Van der Plas (2014)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-82", "text": "For both systems, we use the majority vote across all nine aligned languages, in a token-and type-based version." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-83", "text": "We implemented an unsupervised method based on statistics on bi-grams extracted from the English part of the Europarl corpus." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-84", "text": "6 As scorer, we use the Chi squared (\u03c7 2 ) measure, which worked best in previous work (Nakov and Hearst, 2005) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-85", "text": "We consider both the adjacency (i.e., (N 1 , N 2 ) vs. (N 2 , N 3 ) , (Marcus, 1980) ) and the dependency (i.e., (N 1 , N 2 ) vs. (N 1 , N 3 ) , (Lauer, 1995) ) model." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-86", "text": "We created a back-off model for the bracketing system of Ziering and Van der Plas (2014) and for AWDB that falls back to using \u03c7 2 if no bracketing structure can be derived (system \u2192 \u03c7 2 )." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-87", "text": "Finally, we compare with a baseline, that always predicts the majority class: LEFT." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-88", "text": "Human annotation." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-89", "text": "We observed that there is only a very small overlap between test sets of previous work on NP bracketing and the Europarl database." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-90", "text": "The test set used by Ziering and Van der Plas (2014) is very small and the labeling is less fine-grained." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-91", "text": "Thus, we decided to create our own test set." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-92", "text": "2 statmt.org/europarl 3 ims.uni-stuttgart.de/data/NCDatabase.html 4 opus.lingfil.uu.se 5 en.wikipedia.org 6 For a fair comparison, we leave systems that have access to external knowledge, such as web search engines, aside." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-93", "text": "A trained independent annotator classified a set of 1100 tokens in context with one of the following labels: LEFT, RIGHT, EXTRACTION (for extraction errors that survived the high-confidence noun filter P noun (word)), UNDECIDED (for 3NCs that cannot be disambiguated within the one-sentence context) and SEMANTIC INDETERMINATE (for 3NCs for which LEFT and RIGHT have the same meaning such as book price fixing (i.e., price fixing for books is equivalent to fixing of the book price))." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-94", "text": "We consider the full dataset to compare the coverage of the systems in comparison." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-95", "text": "For the accuracy figures, in order to keep annotation efforts small, we asked evaluators to annotate just those tokens that our system provides an answer to, because tokens that our system has no answer to will not be evaluated in the comparative evaluation on accuracy anyhow." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-96", "text": "Two additional trained independent annotators each classified one half of the dataset for checking inter-annotator agreement." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-97", "text": "For the classes LEFT/RIGHT (308 tokens), we achieve an agreement rate of 90.3% and \u03ba = 0.717 (Cohen, 1960) , which means good agreement (Landis and Koch, 1977) ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-98", "text": "We use the LEFT/RIGHT consensus of the 3NC tokens as final test set (278 tokens)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-99", "text": "Evaluation Measure." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-100", "text": "We measure accuracy (Acc \u2126 ) for a set of 3NC tokens, \u2126, as correct LEFT/RIGHT labels divided by all LEFT/RIGHT labels." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-101", "text": "Coverage is measured as LEFT/RIGHT labels divided by all 3NC tokens in the full dataset." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-102", "text": "We refer to the harmonic mean of Acc \u2126 and Coverage as harmonic(\u2126)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-103", "text": "As it turned out that the adjacency model outperforms the dependency model, we only report results for the first." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-104", "text": "Table 1 presents the coverage of each system, based on the full dataset." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-105", "text": "Our first result is that type-based cross-lingual bracketing outperforms token-based and achieves up to 91.2% in coverage." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-106", "text": "As expected, the system of Ziering and Van der Plas (2014) does not cover more than 48.1%." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-107", "text": "The \u03c7 2 method and the back-off models cover all 3NCs in our dataset." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-108", "text": "The fact that AWDB type misses 8.8% of the dataset is mainly due to equal distances between aligned words (e.g., crisis resolution mechanism is only aligned to closed compounds, such as the Swedish krisl\u00f6sningsmekanism or to nouns separated by one preposition, such as the Spanish mecanismo de resoluci\u00f3n de crisis)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-109", "text": "In future work, we will add more languages in the hope to find more variation and thus get an even higher coverage." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-110", "text": "----------------------------------" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-111", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-112", "text": "----------------------------------" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-113", "text": "**SYSTEM**" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-114", "text": "Acc Table 2 : Direct comparison on common test sets; \u2020 = significantly better than the systems in comparison Table 2 directly compares the systems on common subsets (com), i.e., on 3NCs for which all systems in the set provide a result." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-115", "text": "The main reason why cross-lingual systems make bracketing errors is the quality of automatic word alignment." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-116", "text": "AWDB outperforms Ziering and Van der Plas (2014) significantly 7 ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-117", "text": "This can be explained with the flexible structure of AWDB, which can exploit more data and is thus more robust to word alignment errors." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-118", "text": "AWDB significantly outperforms \u03c7 2 in accuracy but is inferior in harmonic(com)." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-119", "text": "The last four lines of Table 2 show all systems with full coverage." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-120", "text": "AWDB's back-off model achieves the best harmonic(com) with 96.6% and an accuracy comparable to human performance." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-121", "text": "For AWDB, types and tokens show the same accuracy." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-122", "text": "The harmonic mean numbers for the system of Ziering and Van der Plas (2014) illustrate that coverage gain of types outweighs a higher accuracy of tokens." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-123", "text": "Our intuition that token-based approaches are superior in accuracy is hardly reflected in the present results." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-124", "text": "We believe that this is due to the domain-specificity of Europarl." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-125", "text": "There are only few instances, where the bracketing of a 3NC type differs from token to token." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-126", "text": "We expect to see a larger difference for general domain parallel corpora." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-127", "text": "Table 3 shows the contribution of the Romance (i.e., French, Italian, Portuguese and Spanish) and Germanic (i.e., Danish, Dutch, German and Swedish) language families for AWDB type ." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-128", "text": "Romance languages have a higher coverage than Germanic languages." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-129", "text": "This is because many 3NCs are aligned to a closed Germanic compound, which gives no information on the internal structure." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-130", "text": "Since Romance languages use open compounds, coverage is higher." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-131", "text": "On the other hand, Romance languages are worse in accuracy." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-132", "text": "One reason for this is that they can also produce constructions that violate Behaghel (1909) LEFT labels than the total test set." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-133", "text": "Furthermore, many instances of these cases can be disambiguated using morphosyntactic information such as number, e.g., world fishing quotas aligned to the French quotas de p\u00eache mondiaux (quotas pl of fishing sg world adj,pl )." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-134", "text": "As a result, we have 13,622 3NC tokens in context annotated with bracketing structures that are comparable to human annotation." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-135", "text": "The manual annotation by Vadas and Curran (2007a) resulted in 5582 three-word NPs, that were successfully used to train supervised learners." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-136", "text": "----------------------------------" }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-138", "text": "In this paper, we presented a method for the automatic bracketing of k-partite noun compounds by using the surface structure (i.e., various word positions) in parallel translations as supervision." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-139", "text": "In an experiment, we extracted 3NCs from a noun compound database comprising ten languages derived from a parallel corpus." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-140", "text": "Our bracketing system outperforms all systems in comparison with an accuracy of 94.6% and is comparable with human performance." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-141", "text": "In future work, we will investigate how to combine partial bracketing results from different languages and how to make the approach independent from parallel data." }, { "sent_id": "ff5122ce817d506fbcb269b7ae41fe-C001-142", "text": "Along with this paper, we publish 9 the processed dataset and our test set, which can be used as training and test data for supervised learners." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ff5122ce817d506fbcb269b7ae41fe-C001-26" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-62" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-122" ] ], "cite_sentences": [ "ff5122ce817d506fbcb269b7ae41fe-C001-26", "ff5122ce817d506fbcb269b7ae41fe-C001-62", "ff5122ce817d506fbcb269b7ae41fe-C001-122" ] }, "@MOT@": { "gold_contexts": [ [ "ff5122ce817d506fbcb269b7ae41fe-C001-26", "ff5122ce817d506fbcb269b7ae41fe-C001-27", "ff5122ce817d506fbcb269b7ae41fe-C001-29", "ff5122ce817d506fbcb269b7ae41fe-C001-30" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-90", "ff5122ce817d506fbcb269b7ae41fe-C001-91" ] ], "cite_sentences": [ "ff5122ce817d506fbcb269b7ae41fe-C001-26", "ff5122ce817d506fbcb269b7ae41fe-C001-90" ] }, "@DIF@": { "gold_contexts": [ [ "ff5122ce817d506fbcb269b7ae41fe-C001-36" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-105", "ff5122ce817d506fbcb269b7ae41fe-C001-106" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-116" ] ], "cite_sentences": [ "ff5122ce817d506fbcb269b7ae41fe-C001-36", "ff5122ce817d506fbcb269b7ae41fe-C001-106", "ff5122ce817d506fbcb269b7ae41fe-C001-116" ] }, "@USE@": { "gold_contexts": [ [ "ff5122ce817d506fbcb269b7ae41fe-C001-67" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-68" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-81" ], [ "ff5122ce817d506fbcb269b7ae41fe-C001-86" ] ], "cite_sentences": [ "ff5122ce817d506fbcb269b7ae41fe-C001-67", "ff5122ce817d506fbcb269b7ae41fe-C001-68", "ff5122ce817d506fbcb269b7ae41fe-C001-81", "ff5122ce817d506fbcb269b7ae41fe-C001-86" ] } } }, "ABC_9770f647c0b406462f4b941f136748_36": { "x": [ { "sent_id": "9770f647c0b406462f4b941f136748-C001-30", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-2", "text": "Automatic sentiment analysis in texts has attracted considerable attention in recent years." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-3", "text": "Most of the approaches developed to classify texts or sentences as positive or negative rest on a very specific kind of language resource: emotional lexicons." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-4", "text": "To build these resources, several automatic techniques have been proposed." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-5", "text": "Some of them are based on dictionaries while others use corpora." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-6", "text": "One of the main advantages of the corpora techniques is that they can build lexicons that are tailored for a specific application simply by using a specific corpus." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-7", "text": "Currently, only anecdotal observations and data from other areas of language processing plead in favour of the utility of specific corpora." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-8", "text": "This research aims to test this hypothesis." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-29", "text": "Two techniques that fulfil this condition are described below." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-31", "text": "**SO-LSA**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-9", "text": "An experiment based on 702 sentences evaluated by judges shows that automatic techniques developed for estimating the valence from relatively small corpora are more efficient if the corpora used contain texts similar to the one that must be evaluated." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-10", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-12", "text": "Automatic sentiment analysis in texts, also called opinion mining, has attracted considerable attention in recent years, primarily because of its potential use in marketing study." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-13", "text": "It aims to answer questions such as 'Is the customer who sent a mail to an after-sales service particularly dissatisfied?' 'Are the opinions about a product posted on blogs positive or negative?' 'What is the image of a political party or leader in the press?'." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-14", "text": "All these questions, which relate to the way something is presented and evaluated in a text, are particularly difficult for traditional information extraction techniques (Das & Chen, 2001; Strapparava & Mihalcea, 2007; Wilks, 1997) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-15", "text": "They present, however, many applications like transmitting to the senior members of an after-sales service the mails of which the emotional tone is the most intense." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-16", "text": "Most of the approaches developed to classify texts or sentences as positive or negative rest on unsupervised knowledge-based methods and on a very specific kind of language resource: emotional lexicons (Andreevskaia & Bergler, 2007) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-17", "text": "These lexicons contain words tagged with their affective valence (also called affective polarity or semantic orientation) that indicates whether a word conveys a positive or a negative content." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-18", "text": "To build these resources, several automatic techniques have been proposed." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-19", "text": "Some of them are based on dictionaries and lexical databases (e.g. Esuli & Sebastiani, 2006; Kamps et al., 2004; Kim & Hovy, 2004) , while others use corpora (e.g. Bestgen, 2002; Hatzivassiloglou & McKeown, 1997; Sahlgren et al., 2007; Turney & Littman, 2002 ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-20", "text": "One of the main advantages of the corpora techniques is that they can build lexicons that are tailored to a specific application simply by using a specific corpus." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-21", "text": "Currently, only anecdotal observations and data from other areas of language processing plead in favour of the utility of specific corpora (Bestgen, 2002 (Bestgen, , 2006 ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-22", "text": "This research aims at testing this hypothesis explicitly." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-23", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-24", "text": "**DETERMINING THE AFFECTIVE VALENCE OF WORDS FROM SMALL CORPORA**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-25", "text": "To my knowledge, the first researchers to propose an automatic procedure to determine the valence of words on the basis of corpora are Hatzivassiloglou and McKeown (1997) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-26", "text": "Their algorithm aims to infer the semantic orientation of adjectives on the basis of an analysis of their co-occurrences with conjunctions." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-27", "text": "The main limitation of their algorithm is that it was developed specifically for adjectives and that the question of its application to other grammatical categories has not been solved (Turney & Littman, 2003) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-28", "text": "If several other techniques have been proposed to determine affective valence from corpora, only a few of them have been designed to work with relatively small corpora (ten million words or fewer), a necessary property for building specific affective lexicons." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-32", "text": "The technique proposed by Turney and Littman (2003) tries to infer semantic orientation from semantic association in a corpus." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-33", "text": "It is based on the semantic proximity between a target word and fourteen benchmarks: seven with positive valence and seven with negative valence (see Table 1 )." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-34", "text": "A word is considered as positive if it is closer to the positive benchmarks and further away from the negative benchmarks." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-35", "text": "Turney and Littman proposed two techniques for estimating the strength of the semantic association between words on the basis of corpora." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-36", "text": "The first technique estimates the semantic proximity between a word and a benchmark on the basis of the frequency with which they co-occur." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-37", "text": "Its main limitation is that it requires a very large corpus to be effective." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-38", "text": "Turney and Littman (2003) Turney and Littman (2003) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-39", "text": "For relatively small corpora (i.e. ten million words), they recommend the use of Latent Semantic Analysis (LSA), a mathematical technique for extracting a very large 'semantic space' from large text corpora on the basis of the statistical analysis of the set of co-occurrences in a text corpus (Deerwester et al., 1990; Landauer, Foltz & Laham, 1998) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-40", "text": "The point of departure of the analysis is a lexical table (Lebart & Salem, 1992) containing the frequencies of every word in each of the documents included in the text material, a document being a text, a paragraph or a sentence." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-41", "text": "To derive semantic relations between words from the lexical table the analysis of mere co-occurrences will not do, the major problem being that even in a large corpus most words are relatively rare." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-42", "text": "Consequently, the co-occurrences of words are even rarer." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-43", "text": "This fact makes such co-occurrences very sensitive to arbitrary variations (Burgess, Livesay & Lund, 1998; Kintsch, 2001; Rajman & Besan\u00e7on, 1997) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-44", "text": "LSA resolves this problem by replacing the original frequency table with an approximation producing a kind of smoothing effect on the associations." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-45", "text": "To this end, the frequency table undergoes singular value decomposition and it is then recomposed on the basis of only a fraction of the information it contains." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-46", "text": "Thus, the thousands of words from the documents have been substituted by linear combinations or 'semantic dimensions' with respect to which the original words can be situated again." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-47", "text": "Contrary to a classical factor analysis the extracted dimensions are very numerous and non-interpretable." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-48", "text": "One could, however, compare them to semantic features describing the meaning of words (Landauer et al., 1998) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-49", "text": "In this semantic space, the meaning of a word is represented by a vector." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-50", "text": "To determine the semantic proximity between two words, the cosine between their corresponding vectors is calculated." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-51", "text": "The more two words (or one word and a benchmark) are semantically similar, the more their vectors point in the same direction, and consequently the closer their cosine will be to one (which corresponds to coinciding vectors)." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-96", "text": "The Soir1995 corpus is most similar to the test materials." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-52", "text": "A cosine of zero shows an absence of similarity, since the corresponding vectors point in orthogonal directions." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-53", "text": "The emotional valence of a word corresponds to the sum of the cosine between this word and the positive benchmarks minus the sum of the cosine between it and the negative benchmarks." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-54", "text": "Turney and Littman evaluated the effectiveness of their technique by comparing the predicted orientation of words with that defined in the General Inquirer Lexicon (Stone et al., 1966) , which contains a list of 3596 English words labelled as positive or negative." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-55", "text": "Calculated on the basis of a corpus of ten million words, PR-LSA labels 65% of the words correctly." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-56", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-57", "text": "**DI-LSA**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-58", "text": "The technique proposed independently by Bestgen (2002) , DI-LSA, is very similar to the one proposed by Turney and Littman." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-59", "text": "The main difference is at the level of the benchmarks used to evaluate a word." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-60", "text": "While SO-LSA uses a few benchmarks selected a priori, DI-LSA is based on lexicons that contain several hundred words rated by judges on the pleasant-unpleasant scale This kind of lexicon was initially developed in the field of content analysis." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-61", "text": "As early as 1965, Heise proposed to constitute a valence dictionary by asking judges to rate a sample of the most frequent English words on the pleasant-unpleasant scale." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-62", "text": "Since then, lexicons for various languages have been made up (Hogenraad et al., 1995; Whissell et al., 1986) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-63", "text": "As an example, Table 2 shows the evaluation scores of several words randomly extracted from the dictionary and used in the present study (Hogenraad, Bestgen & Nysten, 1995 Table 2 ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-64", "text": "Emotional valences of several words on a scale from very unpleasant (1.0) to very pleasant (7.0)." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-65", "text": "To determine the emotional valence of a word on the basis of the words with which it co-occurs in a corpus, a specific whole set of benchmarks is selected from the lexicon." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-66", "text": "More precisely, the unknown valence of a word corresponds to the average valence of its thirty closer neighbours, neighbourhood being identified on the basis of the cosine in the semantic space extracted by LSA." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-67", "text": "To evaluate this index, Bestgen (2002) compared the predicted values for words with their actual values according to the dictionary and obtained correlations ranging from 0.56 to 0.70." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-68", "text": "He also showed that taking into account the thirty closer neighbours yields a better estimate than taking into account only five neighbours." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-69", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-70", "text": "**EXPERIMENT**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-71", "text": "This experiment aims to determine the effect on an automatic sentiment analysis task of the similarity between test materials and a corpus from which an affective lexicon is extracted." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-72", "text": "The materials for this sentence-level classification experiment (Riloff, Wiebe, & Phillips, 2005) Participants read, individually and in a different random order, the 702 sentences of the corpus on a computer screen." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-73", "text": "The sentences were successively displayed just above the rating scale." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-74", "text": "Participants gave their ratings by clicking on the button corresponding to the level of pleasantness they felt." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-75", "text": "A 'validate' button enabled them to confirm their choice and to start processing the next sentence." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-76", "text": "Participants could pause at any time." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-77", "text": "The instructions specified that a break of at least one hour was to be taken around the middle of the task." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-78", "text": "On average, the participants took fifteen seconds to rate each sentence." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-79", "text": "Table 3 provides some examples of the sentences and their emotional valence." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-80", "text": "The inter-rater agreement, computed by means of Cronbach's alpha, was very high (0.93)." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-81", "text": "The average correlation between the ratings of one participant and the average ratings of all the other participants was 0.75 (the leave-one-out technique)." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-82", "text": "The average correlation between the ratings provided by two participants was 0.60." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-83", "text": "A detailed presentation of the procedure used to build the materials is given in Bestgen, Fairon and Kevers (2004) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-84", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-85", "text": "**METHOD**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-86", "text": "The two techniques described above were used in this experiment." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-87", "text": "The fourteen SO-LSA benchmarks chosen by Turney and Littman (2003) were translated into 2 Each sentence was automatically modified so as to replace the name and the description of the function of every individual by a generic first name of adequate sex (Mary, John, etc.) in order to prevent the judges being influenced by their prior positive or negative opinion about these people." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-88", "text": "French (bon, gentil, excellent, positif, heureux, correct et sup\u00e9rieur: mauvais, m\u00e9chant, m\u00e9diocre, n\u00e9gatif, malheureux, faux et inf\u00e9rieur) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-89", "text": "For DI-LSA, a French lexicon made up of 3000 words evaluated on the pleasant-unpleasant scale was used (Hogenraad et al., 1995) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-90", "text": "A minimum of thirty judges rated the words on a seven-point scale from 'very unpleasant' (1) Table 3 : Emotional valences of several sentences." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-91", "text": "Three corpora of five million words each, varying in similarity to the test materials, were used to estimate the proximity between the words and the benchmarks: -Soir1995 Corpus." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-92", "text": "This includes newspaper articles published in Le Soir during the early months of 1995: that is, the period from which the target sentences were extracted. -Soir1997 Corpus." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-93", "text": "A comparable corpus was built from the articles published in Le Soir during the early months of 1997." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-94", "text": "-Literary Corpus." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-95", "text": "A literary corpus of texts was built from novels and short stories available on the Web (mainly in the literary Web databases ABU and Frantext)." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-97", "text": "The Soir1997 corpus includes texts from the same source as the test materials, but from a later period." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-98", "text": "The Literary corpus contains texts from a very different genre: it is the least similar to the test materials." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-99", "text": "To be able to compare these three corpora in a fair way, the three semantic spaces were extracted, one from each corpus, according to an identical procedure adapted from Bestgen (2006) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-100", "text": "These corpora were subdivided into segments of 125 words." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-101", "text": "All the words of a segment had to come from the same text." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-102", "text": "All the segments of fewer than 125 words (articles of small size and the last incomplete segment of a text) were removed." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-103", "text": "These rules produced 40635 segments for the Literary corpus and more than 50000 for the other two corpora." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-104", "text": "In order to be able to compare corpora of different types, but of same sizes, only the first 40635 segments of the Soir1995 and Soir1997 corpora were taken into account." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-105", "text": "Singular value decomposition was realised with the program SVDPACKC (Berry, 1992) , and the first 300 singular vectors were retained." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-106", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-107", "text": "**RESULTS**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-108", "text": "The predicted valences, corresponding to the average valence of the words belonging to the sentence were compared with the judges' ratings by means of Pearson's correlation coefficients (see Table 4 )." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-109", "text": "Two levels of reference to measure the effectiveness of the techniques are given by previous analyses of the test materials (Bestgen et al., 2004) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-110", "text": "First, a correlation of 0.39 was obtained between the judges' ratings of the sentences and those based on the original lexicon of 3000 words, a value statistically significant (p<0.0001)." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-111", "text": "In order to determine the effectiveness of a lexicon which takes into account all the words included in the sentences, two judges were asked to decide if each word present in the sentences, but absent from the original lexicon, was positive, neutral or negative." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-112", "text": "The correlation between the sentence ratings and that obtained on the basis of this exhaustive dictionary was 0.56." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-113", "text": "The most important result has a bearing on the large difference in efficiency between the three corpora used to compute the word-benchmark similarities." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-114", "text": "Both techniques are far more efficient when the word's affective valence is estimated from the semantic proximities in corpora that contain texts very similar to the one from which the test materials have been extracted." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-115", "text": "Interestingly, there is little difference between the Soir1995 and the Soir1997 corpora, leading to the conclusion that it is the genre or the source of the texts that matters and not the fact that the test materials and the semantic space were extracted from the very same texts." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-116", "text": "As regards the difference between the techniques, DI-LSA outperforms SO-LSA as well as the original lexicon." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-117", "text": "It even almost reaches the level of efficiency of the manually-expanded lexicon." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-118", "text": "If SO-LSA only weakly outperforms the lexicon approach, this performance is notable because SO-LSA is based on only fourteen benchmarks while the original lexicon includes 3000 words evaluated by numerous judges." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-119", "text": "Table 4 : Correlations between the sentence valence as estimated by the judges' ratings and by the two techniques on the basis of the three corpora." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-120", "text": "----------------------------------" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-121", "text": "**CONCLUSION**" }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-122", "text": "The experiment reported above shows that automatic techniques developed for estimating the valence of words from relatively small corpora (five million words) are more efficient if the corpora used contain texts similar to the one that must be evaluated." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-123", "text": "Obviously, the beneficial effect of using a corpus similar to the test materials would have been more strongly supported if the opposite demonstration could have been carried out: the Literary corpus should outperform the newspaper ones when the test materials are made up of sentences extracted from literary texts." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-124", "text": "More generally, it seems that in the present experiment we are close to the maximum effectiveness of the lexical approach to evaluating sentences, since the automatic technique is nearly as effective as the traditional approach based on an exhaustive dictionary." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-125", "text": "The correlation between the predicted valences and the valences obtained from judges, however, is just higher than 0.50." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-126", "text": "If one wishes to go beyond this level of efficiency, it is probably essential to combine lexical information and more complex linguistic analyses." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-127", "text": "The 'simplistic' character of an approach based solely on the words considered individually has been strongly criticised (Bestgen, 1994; Pang et al., 2002; Polanyi & Zaenen, 2003) ." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-128", "text": "For example, Polanyi and Zaenen (2003) underline the need to take into account negations but also some connectors ('Although Boris is brilliant at maths, he is a horrific teacher') and the modal operators ('If Mary were a terrible person, she would be mean to her dogs')." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-129", "text": "It is, however, noteworthy that these criticisms of the lexical approach do not reject it but underline the need to supplement it." }, { "sent_id": "9770f647c0b406462f4b941f136748-C001-130", "text": "Having the most powerful lexical indices possible is a prerequisite for following this new avenue of research." } ], "y": { "@MOT@": { "gold_contexts": [ [ "9770f647c0b406462f4b941f136748-C001-27" ] ], "cite_sentences": [ "9770f647c0b406462f4b941f136748-C001-27" ] }, "@BACK@": { "gold_contexts": [ [ "9770f647c0b406462f4b941f136748-C001-32" ] ], "cite_sentences": [ "9770f647c0b406462f4b941f136748-C001-32" ] }, "@SIM@": { "gold_contexts": [ [ "9770f647c0b406462f4b941f136748-C001-86", "9770f647c0b406462f4b941f136748-C001-87" ] ], "cite_sentences": [ "9770f647c0b406462f4b941f136748-C001-87" ] } } }, "ABC_3a7625a0f38424fe922ad095e07e68_36": { "x": [ { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-2", "text": "Data augmentation is widely used to train deep neural networks for image classification tasks." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-3", "text": "Simply flipping images can help learning by increasing the number of training images by a factor of two." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-4", "text": "However, data augmentation in natural language processing is much less studied." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-5", "text": "Here, we describe two methods for data augmentation for Visual Question Answering (VQA)." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-6", "text": "The first uses existing semantic annotations to generate new questions." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-7", "text": "The second method is a generative approach using recurrent neural networks." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-8", "text": "Experiments show the proposed schemes improve performance of baseline and state-of-the-art VQA algorithms." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-9", "text": "----------------------------------" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-11", "text": "In recent years, both computer vision and natural language processing (NLP) have made enormous progress on many problems using deep learning." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-12", "text": "Visual question answering (VQA) is a problem that fuses computer vision and NLP to build upon these successes." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-13", "text": "In VQA, an algorithm is given an image and a question about the image, and it predicts the answer to the question (Malinowski and Fritz, 2014; Antol et al., 2015) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-14", "text": "Although progress has been rapid, there is still a significant gap between the performance of the best VQA systems and humans." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-15", "text": "For example, on the open-ended 'The VQA Dataset' that uses real images, the best systems in 2016 are at around 65% accuracy (e.g., Fukui et al. (2016) ) compared to 83% for humans (Antol et al., 2015) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-16", "text": "Analysis of VQA algorithm performance as a function of the amount of training data show that existing algorithms would benefit greatly from more training data (Kafle and * Corresponding author." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-17", "text": "Kanan, 2017)." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-18", "text": "One way to address this would be to annotate additional questions about images, but this is time-consuming and expensive." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-19", "text": "Data augmentation is a much cheaper alternative." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-20", "text": "Data augmentation is generating new training data from existing examples." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-21", "text": "In this paper, we explore two data augmentation methods for generating new question-answer (QA) pairs for images." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-22", "text": "The first method uses existing semantic annotations and templates to generate QA pairs, similar to the method in Kafle and Kanan (2017) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-23", "text": "The second method is a generative approach using a recurrent neural network (RNN)." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-24", "text": "Fig. 1 shows an example image from 'The VQA Dataset' along with the original questions and the questions generated using our methods." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-25", "text": "Our methods improve the variety and the number of questions for the image." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-26", "text": "We evaluate how well each augmentation method performs on two VQA datasets." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-27", "text": "Our results show that augmentation increases performance for both datasets." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-28", "text": "----------------------------------" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-30", "text": "For supervised computer vision problems, e.g., image recognition, labels are scarcer than images." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-31", "text": "This is especially a problem with deep convolutional neural networks (CNNs) that have millions of parameters." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-32", "text": "Although more human labeled data would be ideal, it is easier to exploit the training dataset to generate new examples." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-33", "text": "For image classification, common ways to exploit training images to create more labeled examples include mirror reflection, random crops etc." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-34", "text": "Many of these methods were used in training the seminal AlexNet (Krizhevsky et al., 2012) , which increased the training data by more than ten folds and produced relative improvement of over 4% for image classification." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-35", "text": "Compared to vision, where augmentation is common, little work has been done on augmenting text for classification problems." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-36", "text": "A notable exception is Zhang et al. (2015) , where a thesaurus was used to replace synonymous words to create more training data for text classification." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-37", "text": "However, this augmentation produced little improvement and sometimes even hurt performance." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-38", "text": "The authors' argued that because large quantities of real data are available, models generalize properly without augmentation." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-39", "text": "Although training using augmented text data is rare, generating new questions about images has been studied." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-40", "text": "The COCO-QA dataset (Ren et al., 2015) for VQA was created by parsing COCO captions with a syntactic parser, and then used this to create QA pairs for four kinds of questions using hand-crafted rules." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-67", "text": "The template data augmentation method uses the semantic segmentation annotations in COCO to generate new QA pairs." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-41", "text": "However, due to inability of the algorithm to cope with complex sentence structures, a significant portion of COCO-QA questions have grammatical errors or are oddly phrased." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-42", "text": "Visual question generation was also studied in (Mostafazadeh et al., 2016) , with an emphasis on generating questions about images that are beyond the literal visual content of the image." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-43", "text": "They endeavored to avoid simple questions such as counting and color, which were emphasized in COCO-QA." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-44", "text": "Unlike our work, their objective was not data augmentation and they did not try to answer the generated questions." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-45", "text": "----------------------------------" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-46", "text": "**DATASETS AND ALGORITHMS FOR VQA**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-47", "text": "We conduct experiments on two of the most popular VQA datasets: 'The VQA Dataset' (Antol et al., 2015) and COCO-QA (Ren et al., 2015) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-48", "text": "'The VQA Dataset' is currently the most popular VQA dataset and it contains both synthetic and real-world images." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-49", "text": "The real-world images are from the COCO dataset (Lin et al., 2014) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-50", "text": "All questions were generated by human annotators." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-51", "text": "We refer to this portion as COCO-VQA, and use it for our experiments." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-52", "text": "COCO-QA (Ren et al., 2015) also uses images from COCO, with the questions generated by an NLP algorithm that uses COCO's captions." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-53", "text": "All questions belong to four categories: object, number, color, and location." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-54", "text": "Many algorithms have been proposed for VQA." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-55", "text": "Some notable formulations include attention based methods Xiong et al., 2016; Lu et al., 2016; Fukui et al., 2016) , Bayesian frameworks (Kafle and Kanan, 2016; Malinowski and Fritz, 2014) , and compositional approaches (Andreas et al., 2016a,b) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-56", "text": "Detailed reviews of existing methods can be found in Kafle and Kanan (2017) and Wu et al. (2016) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-57", "text": "However, simpler models such as linear classifiers and multilayer perceptrons (MLPs) perform only slightly worse on many VQA datasets." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-58", "text": "These baseline methods predict the answer using a vector of image features concatenated to a vector of question features (Ren et al., 2015; Zhou et al., 2015; Kafle and Kanan, 2016) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-59", "text": "We use the MLP model to conduct the bulk of the experiments, but we show that the proposed method is also effective on more sophisticated VQA systems like multimodal compact bilinear pooling (MCB) (Fukui et al., 2016) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-60", "text": "----------------------------------" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-61", "text": "**METHODS FOR DATA AUGMENTATION**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-62", "text": "The impact of using data augmentation to improve VQA has not been studied." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-63", "text": "We propose two methods for generating QA pairs about images: 1) a template based generation method that uses image annotations and 2) a long short term memory (LSTM) based language model." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-64", "text": "The number of questions generated using both methods are shown in Table 1 ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-65", "text": "----------------------------------" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-66", "text": "**TEMPLATE AUGMENTATION METHOD**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-68", "text": "COCO contains detailed segmentation annotations with labels for 80 objects typically found in the images." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-69", "text": "We synthesize four kinds of questions from the COCO annotations: yes/no, counting, object recognition, scene, activity and sport recognition." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-70", "text": "Yes/No Questions: First, we make a list of the Counting Questions: To make counting questions, we count the number of separate annotations of all the objects of a particular category that have an area greater than 2000 pixels, and ask 12 variations of a counting question template." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-71", "text": "Object Recognition Questions: Object recognition questions such as 'What is in the picture?' can be ambiguous because multiple objects may be present." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-72", "text": "So, we ask questions about COCO 'super-categories' (e.g., 'food,' 'furniture,' 'vehicle,' etc.) to specify the type of object in the question." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-73", "text": "However, ambiguity may persist if there are multiple objects belonging to same supercategory." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-74", "text": "For example, 'What vehicles are shown in the photo?' becomes ambiguous if both 'cars' and 'trucks' are present." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-75", "text": "So, we ensure only a single object of a supercategory is present before asking a recognition question." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-76", "text": "We use 12 variations of 'What SUPERCATEGORY is in the image?' Scene and Activity Questions: If a object in an image belongs to the COCO supercategory indoor or outdoor, we generate questions such as 'Is this indoor or outdoors?' Similarly, we ask about different rooms in the house by identifying the common objects in the room." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-77", "text": "For example, if there are least two common kitchen appliances in the picture(e.g., toaster, microwave, etc.), then we infer the room is a kitchen and ask 'What room is this?' with 'kitchen' as the answer." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-78", "text": "We employ similar strategies for 'living room' and 'bathroom.' We used six variations for 'indoor/outdoor' questions and four variations for room classification questions." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-79", "text": "For sports, we check if any sports equipment is present in the image and generate a question about the type of sport being depicted in the image." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-80", "text": "We use four variations of questions to ask about each of the six common sports activities." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-81", "text": "----------------------------------" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-82", "text": "**LSTM AUGMENTATION METHOD**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-83", "text": "One major issues with our template-based augmentation method is that the questions are rigid and may not closely resemble the way questions are typically posed in the VQA dataset." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-84", "text": "To address this, we train a stacked LSTM that generates questions about images." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-85", "text": "The network consists of two LSTM layers each with 1000 hidden units followed by two fully connected layers, with 7000 units each, which is the size of our vocabulary constructed by tokening training questions into individual words." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-86", "text": "The first fully connected layer has a ReLU activation function, while the second layer has the 7000-way softmax." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-87", "text": "The output question is produced one word at a time until the \u00a1end-ofquestion\u00bf token." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-88", "text": "The network is trained using the COCO-VQA training data." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-89", "text": "During the generation process, we start by passing the \u00a1start-question\u00bf token concatenated with the image features." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-90", "text": "To predict the next word, we sample from a multinomial distribution characterized by the prediction probabilities." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-91", "text": "Sometimes such sampling generates questions unrelated to image content." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-92", "text": "To compensate for this, we repeat the sampling for every word multiple times and pick the word occurring most frequently." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-93", "text": "We then generate 30 initial questions per image, and only retain the 3 most frequent questions." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-94", "text": "Any generated question that already exists in the original dataset is removed." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-95", "text": "We use the MLP VQA method described in Sec. 3 to create answers for the generated questions, but it is trained without augmented data." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-96", "text": "Used alone, this can produce many incorrect answers." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-97", "text": "To mitigate this problem, we tried to identify the kinds of questions the MLP VQA algo- rithm tends to get correct." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-98", "text": "To do this, we use kmeans to cluster the training question features concatenated to a one-hot vector with the answer for each question type (k = 25)." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-99", "text": "We assign each validation QA pair to one of these clusters and compute each cluster's accuracy." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-100", "text": "QA pairs assigned to clusters that have a validation accuracy of less than 70% are removed from the dataset." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-101", "text": "----------------------------------" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-102", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-103", "text": "First, we use the simple MLP baseline model used in Kafle and Kanan (2016) to assess the two data augmentation methods." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-104", "text": "Kafle and Kanan (2016) showed that MLP worked well across multiple datasets despite its simplicity." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-105", "text": "The MLP model treats VQA as a classification problem with concatenated image and question features given to the model as features and answers as categories." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-106", "text": "CNN features from ResNet-152 and the skip-thought vectors (Kiros et al., 2015) are used as image and question features respectively." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-107", "text": "We evaluate the MLP model on COCO-VQA and COCO-QA datasets." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-108", "text": "For COCO-QA, we excluded all the augmented QA pairs derived from COCO's validation images during training, as the test portion of COCO-QA contains questions for these images." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-109", "text": "Table 2 shows the results for the MLP model when trained with and without augmentation." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-110", "text": "Some examples for the model trained with augmentation are are shown in Fig. 2 ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-111", "text": "Next, to demonstrate that the data augmentation scheme also helps improve more complex models, we train the state-of-the-art MCB model with attention (MCB+Att.+GloVe) (Fukui et al., 2016) with the template augmentation and compare the accuracy when the same model trained only on the COCO-VQA dataset (Table 3) ." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-112", "text": "Referring to Table 2 , we can see that both forms of augmentation improved accuracy on COCO-QA compared to the baseline, and the templatebased approach worked better than LSTM." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-113", "text": "For COCO-VQA, the template-based augmentation helped considerably, producing a relative increase of 1.6% compared to when it was not used." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-114", "text": "We did not observe an improvement from using the LSTM method, perhaps due to label noise." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-115", "text": "While we tried to mitigate label noise by rejecting QA pairs that were likely to be wrong, this was not sufficient." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-116", "text": "We are exploring alternative training methods that are robust to label noise (e.g., Reed et al. (2014) ) to help improve results using LSTM." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-117", "text": "Additionally, we also evaluated which types of questions benefit the most from data augmentation." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-118", "text": "For the MLP model trained on COCO-VQA with the template augmentation, counting category answer is improved the most (1.74%), followed by others (1.01%), and yes/no (0.7%)." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-119", "text": "The results are promising and demonstrate that VQA algorithms can benefit from data augmentation, even for hard question types like counting." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-120", "text": "Furthermore, there is a lot of room for expansion in both the LSTM and the template based methods to produce a larger number and variety of questions." }, { "sent_id": "3a7625a0f38424fe922ad095e07e68-C001-121", "text": "Template augmentation worked best in our experiments, but if we can control for label noise, the LSTM method can be more flexible than the template method, and could be used to generate virtually unlimited amount of training data using images from the Internet." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3a7625a0f38424fe922ad095e07e68-C001-40" ], [ "3a7625a0f38424fe922ad095e07e68-C001-52" ], [ "3a7625a0f38424fe922ad095e07e68-C001-58" ] ], "cite_sentences": [ "3a7625a0f38424fe922ad095e07e68-C001-40", "3a7625a0f38424fe922ad095e07e68-C001-52", "3a7625a0f38424fe922ad095e07e68-C001-58" ] }, "@MOT@": { "gold_contexts": [ [ "3a7625a0f38424fe922ad095e07e68-C001-40", "3a7625a0f38424fe922ad095e07e68-C001-41" ] ], "cite_sentences": [ "3a7625a0f38424fe922ad095e07e68-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "3a7625a0f38424fe922ad095e07e68-C001-47" ] ], "cite_sentences": [ "3a7625a0f38424fe922ad095e07e68-C001-47" ] } } }, "ABC_8ace0627a085efd0cf0ccc211c556f_36": { "x": [ { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-2", "text": "We propose a way to use a transformer-based language model in conversational speech recognition." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-3", "text": "Specifically, we focus on decoding efficiently in a weighted finite-state transducer framework." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-4", "text": "We showcase an approach to lattice re-scoring that allows for longer range history captured by a transfomerbased language model and takes advantage of a transformer's ability to avoid computing sequentially." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-5", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-7", "text": "In conversational speech, individual utterances may reference context from previous utterances." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-8", "text": "If certain topics or words have been mentioned in the past, related words are likely to be used." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-9", "text": "In automatic speech recognition, language models are responsible for capturing the probabilities of words likely to be uttered given past words." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-10", "text": "Traditionally, these probabilities are captured in an n-gram language model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-11", "text": "For example, in a trigram language model, we would store a mapping of all 3-word combinations found in our training corpus, along with the probabilities of the third word following the previous two words." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-12", "text": "The past \"history\" that a trigram language model would capture is limited to two words." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-13", "text": "It is also limited in its ability to capture semantics." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-14", "text": "With the advent of more computational power, neural-based methods for language modeling, like the Recurrent Neural Network (RNN) architecture have become possible." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-15", "text": "In neural architectures, words are typically represented as word embeddings, n-dimensional vectors that attempt to capture the semantics in the latent space." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-16", "text": "Due to its recurrent set-up, the RNN-based language model is able to encapsulate previous word embeddings in its hidden state (see Figure 1 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-17", "text": "The hidden state at each step represents a single value computed from a series of matrix operations on current word embedding and the hidden state from the previous step." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-18", "text": "However, with a long sequence, RNNs on their own suffer from vanishing gradients, Long Short-Term Memory (LSTM) units remedy this by introducing additional operations for computing each hidden state [1] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-19", "text": "Despite their name, they fall short of capturing word dependencies past 200 words Fig. 1 ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-20", "text": "A recurrent neural network predicting the probability of the next word using hidden states and focus on past 50 words more heavily [2] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-21", "text": "The transformer architecture, however, does not suffer from this problem." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-22", "text": "It does away with recurrence in favor of an attention mechanism [3] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-23", "text": "The attention mechanism allows the network to learn a weighted average to determine which words are important at at each position." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-24", "text": "The network can capture much longer dependencies as all previous words are encoded and passed into the multi-head attention layers in parallel (see Figure 2 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-25", "text": "The multi-head attention layer learns to reference words from the beginning of a very long context." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-26", "text": "This is a desirable attribute that we'd like to apply to conversational speech recognition, however, there is one problem: many transformer-based architectures take in fixed size input [4] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-27", "text": "Any change to the input, whether it be adding another word, or changing the last word in the input would require recomputing all values of the network." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-28", "text": "This can be prohibitive in the case of speech recognition as it would require the re-computation of all word inputs from the beginning of our speech context for every new utterance that we attempt to re-score." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-29", "text": "The transformer-XL architecture solves this by introducing a segment-level recurrence mechanism [5] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-30", "text": "In this work we use the Kaldi framework [6] and a transfomer-XL architecture to do efficient lattice re-scoring." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-31", "text": "For each new utterance, we cache the segment-level embeddings for use in future utterance lattice re-scoring." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-32", "text": "arXiv:2001.01140v1 [cs.CL] 4 Jan 2020" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-33", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-34", "text": "**PAST WORK**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-35", "text": "Typically, due to Kaldi\u00e2\u0202\u0179s use of weighted finite state transducers, the fast decoding techniques require that a language model is expanded as a finite state transducer [7] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-36", "text": "This is difficult to do with an RNNLM as it requires approximating the RNN's probability distribution and converting into the a deterministic FST form." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-37", "text": "There have been attempts to sample an RNNLM to produce an equivalent FST, however the accuracy performance was equivalent to that of a bigram language model [8] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-38", "text": "Because of this, first-pass decoding is done using an n-gram language model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-39", "text": "After first-pass decoding with an n-gram language model, a second-pass re-scoring is done using an RNNLM." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-40", "text": "However, this can be slow, so n-gram approximation is used, but this limits the history that an RNN has available to it." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-41", "text": "[9] proposes a pruned approach which allows increasing the n-gram approximation to more n-grams but without degrading performance." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-42", "text": "In other works to handle long range dependencies, [10] use an RNNLM with a \"conversation cache\" and DNN-based adaptation technique." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-43", "text": "The \"conversations cache\" is a count of seen unigrams." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-44", "text": "The cache is used to modify unigram priors before RNNLM re-scoring." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-45", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-46", "text": "**APPROACH**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-47", "text": "We use XLNet [11] to attempt to capture long range language dependencies." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-48", "text": "At the time of this writing, XLNet provides the best accuracy for many downstream tasks that require language modeling pre-training, including question-answering, text classification, and other natural language understanding tasks." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-49", "text": "We also attempt to take advantage of a transformer's parallel properties to make some performance optimizations when re-scoring our lattices." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-50", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-51", "text": "**WHY XLNET OVER BERT?**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-52", "text": "XLNet is a generalized auto-regressive model that can be used for language modeling based on the transformer-XL architecture [11] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-53", "text": "This means that the outputs of XLNet depend strictly on the previous outputs." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-54", "text": "This is different from other state-of-the-art language models like BERT (Bidirectional Encoder Representations from Transformers) which rely on conditioning the probabilities given surrounding words." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-55", "text": "In BERT, the model tries to predict a masked word by looking at all surrounding unmasked words ( figure 3 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-56", "text": "The concept of Fig. 3 ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-57", "text": "BERT attempts to predict the masked word use both left and right contexts." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-58", "text": "During training, a certain percentage of words are masked for use in prediction." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-59", "text": "If both \"San\" and \"Francisco\" were masked, BERT would not be able to use information when decoding one of the words to help in decoding the other." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-60", "text": "masking the input introduces a few disadvantages mentioned in [11] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-61", "text": "Firstly, a masked token is rarely seen for most subsequent language modeling tasks, so there tends to be a discrepancy between the \"pre-taining\" step and the \"fine-tuning\" step." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-62", "text": "Typically, in the \"fine-tuning\" step, BERT is adapted to attempt tasks like question-answering." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-63", "text": "Secondly, BERT does not use information from one decoded masked token to help in decoding another masked token." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-64", "text": "In other words, all masked tokens are assumed to be independent." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-65", "text": "This is necessary in BERT, because there is a strict separation between unmasked and masked tokens, as the masked tokens will be predicted." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-66", "text": "In XLNet, however, the separation is a directional one: anything to the left of the word that is attempting to be predicted is fair game (during training, words are reordered to get the benefit of surrounding context, but conceptually, orders are seen from left to right for the factorization order)." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-67", "text": "This lends itself well to decoding in speech recognition as we typically re-score a lattice from left to right (assuming you are visualizing a lattice in english), while we prune low scoring results." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-112", "text": "We run a transfer learning step using PyTorch on the TED-LIUM dataset." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-68", "text": "This does not mean, however, that the encodings for the words do not capture context from surrounding words." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-69", "text": "With permutation language modeling, during training, the ordering of previous tokens can be modified (figure 4)." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-70", "text": "With permutation language modeling, the network is trained on a regiment of random order word sequences." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-71", "text": "For example, the phrase: \"I am going to San Francisco to watch the Warriors play basketball\" could be used to train XLNet by selecting a random factorization order, which is the order in which we will decode the network \"see\" the tokens." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-72", "text": "Since this is accomplished by passing an attention mask, which controls which positions are allowed to attend to which other positions, we don't give up any parallelism." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-73", "text": "Then we select a pivot index, say 6 corresponding to the word \"to.\" All words preceding the pivot would be used to predict words succeeding the pivot." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-74", "text": "So we may try to predict the p(\"to\"| \"I\", \"am\", \"going\", \"to\", \"San\", \"Francisco\") and p(\"to\" | \"San\", \"am\", \"I\", \"to\", \"going\", \"Francisco\"), among all other permutations of the preceding words (see figure 4 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-75", "text": "Although this random ordering seems jarring, the original word sequence orderings are still preserved through \"relative positional embeddings." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-76", "text": "\" XLNet maintains as an input, the relative word position distance from the word that is being predicted from its inputs." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-77", "text": "This means that a random permutation of previous words can help the network learn an embedding from the surrounding words while it preserves information related to the ordering of those words." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-78", "text": "Another feature of XLNet, is that it allows the exporting of its internal hidden states, which can be passed in on subsequent calls." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-79", "text": "This provides a recurrence mechanism that can help capture long range dependencies (see figure 5) ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-80", "text": "The segment-level memory units (mems) allow us to capture longer range dependencies in language without recomputing all previous hidden states." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-81", "text": "There are large performance improvements for longer attention lengths as well [5] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-82", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-83", "text": "**DECODING AND RE-SCORING IN KALDI**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-84", "text": "Weighted finite state transducers (WFSTs or FSTs) provide a structured way to perform decoding." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-85", "text": "They represent a directed graph where each arc can represent probabilities of various state transitions." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-86", "text": "FSTs can represent hidden Markov models (HMMs), context dependency models, lexicons (pronunciation dictionaries), and grammars (n-gram language models)." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-87", "text": "Kaldi uses FSTs heavily, and combines all of the previously mentioned components into a single composed FST (HCLG.fst) [7] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-88", "text": "During decoding, an acoustic model is used to produce log probabilities of various phonemes, and the HCLG.fst will be used to map the outputs of the acoustic model to HMM state transitions to context-dependent phones (H.fst) to context-independent phones (C.fst) to pronunciations of a single word (L.fst) to subsquent words (G.fst)." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-89", "text": "The resulting output is a lattice which can also be represented as an FST (see figure 6 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-90", "text": "The lattice contains the combined acoustic model and language model probabilities for each arc (see figure 6 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-91", "text": "The language modeling probabilities represent the log probability of the next state given the previous n-grams from the path leading up to that state." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-92", "text": "Typically, for larger language models, a separate re-scoring step occurs where the arcs in a lattice are re-scored by subtracting the ini-tial language modeling score that was given to a particular arc and inserting the new score from the larger language model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-93", "text": "In a pruned approach, only a subset of the arcs are re-scored." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-94", "text": "Using a set of heuristics [9] , arcs that require re-scoring are added to a priority queue and only the most promising paths are explored." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-95", "text": "In deterministic approaches, all possible paths are explored and re-scored." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-96", "text": "The re-scoring lattice operation Fig. 6 ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-97", "text": "This is an example of a lattice." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-98", "text": "\"So does\" or \"sodas\"? Each arc can represent a word and a log probability that combines the acoustic and language modeling scores." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-99", "text": "The best path for the lattice represents the best scoring sentence decoded for this utterance." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-100", "text": "takes the form of an FST composition." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-101", "text": "In FST composition [12] two FSTs are combined to form an FST with the combined weights for the arcs that \"exist\" in both." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-102", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-103", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-104", "text": "All experiments were conducted using the TED-LIUM3 dataset [13] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-105", "text": "TED-LIUM is an English speech recognition training corpus from TED talks." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-106", "text": "This data-set was chosen due to its topical nature, usually in the form of 15 minutes or more worth of speech where the speaker is discussing a particular topic." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-107", "text": "Our TED-LIUM dataset contains a training set of 248 hours of speech with aligned transcription, with approximately 2 hours of development and 3 hours of test." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-108", "text": "An acoustic model and n-gram language model are trained to provide a baseline word-error rate." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-109", "text": "We use a library that contains a pre-trained version of XLNet, an implementation of the transformer-XL architecture [14] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-110", "text": "The model is fairly large with 110M parameters." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-111", "text": "It was previously trained on BooksCorpus [15] and English Wikipedia which have 13GB of plain text combined [11] ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-113", "text": "We implement a gRPC server that can run inference on our model over the local network." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-114", "text": "In Kaldi, we implement a DeterministicOn-DemandFst that calls into our exported model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-115", "text": "Our Deter-ministicOnDemandFst maps kaldi word symbol identifiers to tokens that our model understands and vice-versa." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-116", "text": "When re-scoring a lattice, we remove the first-pass FST values and compose our DeterministicOnDemandFst (see figure 7) ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-117", "text": "The segment embedding for the best path after lattice re-scoring is cached and passed as inputs for re-scoring future lattices from the same speech context." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-118", "text": "We compare this technique to first-pass decoding lattices (no-rescoring) and to re-scoring with an RNNLM trained directly on the TED-LIUM data-set." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-119", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-120", "text": "**XLNET GRPC INFERENCE SERVER**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-121", "text": "For ease of integration, we implement a gRPC server for access over the local network." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-122", "text": "gRPC provides a highperformance, cross-language way to send data across processes and over the network." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-123", "text": "Our implementation of a gRPC server takes in a sequence of words and returns the log probability for the next word." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-124", "text": "1 syntax = \"proto3\"; There are two methods that can be invoked on our gRPC server: GetLogProb and GetLogProbBatch." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-125", "text": "GetLogProb can take a single sequence of words and returns the log probability of the next word." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-126", "text": "It will also return a \"mems_id\" which can be used on subsequent requests to reference the cached hidden memory states on the gRPC server." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-127", "text": "GetLogProbBatch can take a batch of requests and execute them in a single batch on the model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-128", "text": "Words are tokenized to map to the word IDs that the XLNet model understand." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-129", "text": "Also, all sequence lengths in the batch are expected to be the same, so a padding token is added to pad all sequences to the maximum length sequence of the batch." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-130", "text": "A \"common_mems_id\" can be be passed with a batch request to load the cached hidden memory states as part of the batch inference call into the model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-131", "text": "The memory states are repeated to match the dimensions of the batch." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-132", "text": "After each sequential lattice is re-scored, its best path is passed as sequence of words to GetLogProb and predicting the probability of an end of sentence token." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-133", "text": "The returned \"mems_id\" is saved." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-134", "text": "Fig. 7 ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-135", "text": "Our architecture for lattice re-scoring: TransformerL-mOnDemandFst is composed with the lattice to re-score each lattice arc." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-136", "text": "Each inference request is sent to our gRPC server." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-137", "text": "After re-scoring the lattice, the best path is passed to the transformer to get its memory states, those are cached in the Trans-formerLmOnDemandFst for rescoring future sequential utterances." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-138", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-139", "text": "**FINE-TUNING LANGUAGE MODEL**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-140", "text": "We fine-tune the XLNet base-cased model for 20 epochs on the TED-LIUM training set." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-141", "text": "First we concatenate all of the TED-LIUM training transcripts, preserving the order of the utterances." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-142", "text": "This step is important because we want to be able to condition our language models on very long histories of words (more than 100) so we need to be sure that contiguous training text belongs to the same TED talk." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-143", "text": "The training text is then tokenized using XLNet's word dictionary." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-144", "text": "Our training examples consist of blocks of 512 XLNet word IDs." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-145", "text": "We use an Adam optimizer with a learning rate of 5e-6." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-146", "text": "We did not attempt to different factorization orders ( figure 4) , only the natural factorization order was used." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-147", "text": "The cross entropy losses against all of the word positions were back-propagated for experiments where we had no target mapping." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-148", "text": "Other training runs for 10 and 4 word blocks were also ran." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-149", "text": "We changed the random sampling to sequential sampling during training to allow for memory blocks to be cached from previous examples." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-150", "text": "We also experimented with a single permutation mask for the next token in order to mimic inference time where we are at-tempted to predict a last word that all other word positions and hidden states cannot attend to." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-151", "text": "There are a few discrepancies between the pre-trained XLNet and our use-case." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-152", "text": "All of our words are lower cased and our corpus contains no punctuation." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-153", "text": "Since the pre-trained XLNet differentiates between upper and lower cased words and has punctuation tokens, it is likely that it relies on them for deriving semantics." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-154", "text": "XLNet's pre-trained vocabulary is also much larger than TED-LIUM's." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-155", "text": "Fig. 8 ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-156", "text": "An improvement to our lattice rescoring flow involves the NgramCacherOnDemandFst to retrieves all n-grams in the lattice." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-157", "text": "All n-grams are passed to the TransformerLmOnDe-mandFst for a single batch request to the transformer server." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-158", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-159", "text": "**PERFORMANCE OPTIMIZATIONS**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-160", "text": "Typically with an RNNLM approach, we must have cached the hidden states of previous time steps before running inference for the next FST state." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-161", "text": "This is improved upon with our transformer LM approach, as we can compute the log probability of the next unigram with a single inference pass." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-162", "text": "There is no need to compute the hidden state for each additional word in sequence." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-163", "text": "The problem, however, is that the transformer based language model (110M parameters for the one used on our experiments) is much larger than the RNNLM (15M parameters) that is referenced in the TED-LIUM recipe in Kaldi." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-164", "text": "This makes running the model a much more expensive prospect." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-165", "text": "For that reason, we must take a different approach to running our model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-166", "text": "Because we are not constrained by running our models in sequence, we can batch all operations for execution in a single-pass, if the GPU memory permits it (see figure 8 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-167", "text": "See the graph in figure 9 which shows that the incremental cost of adding an extra call to the batch operation is negligible (see figure 9 )." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-168", "text": "This allows us to experiment with different techniques, for one, we can Fig. 9 ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-169", "text": "This is a graph showing the incremental cost of adding a request to the batch." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-170", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-171", "text": "**MODEL**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-172", "text": "Dev Table 1 ." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-173", "text": "The pre-trained and fine-tuned transformer models make improvements on the lattice, but they're still not as good as the pre-trained RNNLM for TED-LIUM." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-174", "text": "This is most likely due to using a very large XLNet, limited GPU resources, and only 25MB worth of TED-LIUM text." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-175", "text": "relax our pruning parameters as there is very little additional cost to run a deterministic approach where we score every arc in the lattice." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-176", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-177", "text": "**RESULTS**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-178", "text": "In order to motivate the problem, we measure the oracle worderror rate which gives us the path with the minimum word error rate found within each lattice." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-179", "text": "The oracle word error rate for the test set was found to be 1.70%." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-180", "text": "If we were to flawlessly re-score a lattice, we could, in theory, achieve this word error rate." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-181", "text": "In the lattice some very good answers exist." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-182", "text": "However, our results in table 1 show how difficult it is to make a dent in the WER with such a large XLNet model." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-183", "text": "The RNNLM still gives a much better score." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-184", "text": "We suspect that this is due to a few things: firstly, the XLNet is 110M parameters and was trained on approximately 13GB of text compared to 25MB worth of text for TED-LIUM." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-185", "text": "Given the size of the model, and the fact that it was pre-trained on 512 TPUs [11] , we expect that training for 20 epochs on TED-LIUM's text is not enough to overcome the differences between written text and conversational speech." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-186", "text": "Without fine-tuning, adding memory seems to have an adverse effect on the test set." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-187", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-188", "text": "**CONCLUSION**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-189", "text": "We proposed a way to use a transformer-based language model in conversational speech recognition." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-190", "text": "We showed that XLNet-based language model with 110M parameters is slower in lattice re-scoring than the RNNLM, but there is promise when it comes to the ability of the transformer architecture to run inference in parallel." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-191", "text": "For a fairer comparison, a much smaller derivative of XLNet should be used." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-192", "text": "The transformer also showed a 4% relative improvement on the TED-LIUM when transfer learning and running a 10-gram approximation on the lattice." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-193", "text": "Given a smaller model and more training time, we expect to see small XLNets to be used for lattice re-scoring soon." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-194", "text": "----------------------------------" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-195", "text": "**FUTURE IMPROVEMENTS**" }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-196", "text": "Future improvements include reducing the batch sizes that are sent to our inference server by caching common n-gram prefixes." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-197", "text": "We also experimented fine-tuning the language models with memory hidden states but did not have time to compile the results." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-198", "text": "A much smaller network based on the XLNet archicture would allow for easier fine-tuning and inference." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-199", "text": "It also may be easier to test using the pre-computed word embedding matrix from TED-LIUM's RNNLM." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-200", "text": "That would help with reducing the size of the model overall and training time since we could avoid using the model's internal embedding lookup matrix." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-201", "text": "We would also like to explore scoring lattices in both directions and re-scoring the n-best outputs directly." }, { "sent_id": "8ace0627a085efd0cf0ccc211c556f-C001-202", "text": "We also wish to experiment with expanding the lattice with unique n-gram paths first before re-scoring to reduce the effect of multiple states affecting the path." } ], "y": { "@USE@": { "gold_contexts": [ [ "8ace0627a085efd0cf0ccc211c556f-C001-47" ], [ "8ace0627a085efd0cf0ccc211c556f-C001-109", "8ace0627a085efd0cf0ccc211c556f-C001-111" ], [ "8ace0627a085efd0cf0ccc211c556f-C001-185" ] ], "cite_sentences": [ "8ace0627a085efd0cf0ccc211c556f-C001-47", "8ace0627a085efd0cf0ccc211c556f-C001-111", "8ace0627a085efd0cf0ccc211c556f-C001-185" ] }, "@BACK@": { "gold_contexts": [ [ "8ace0627a085efd0cf0ccc211c556f-C001-52" ] ], "cite_sentences": [ "8ace0627a085efd0cf0ccc211c556f-C001-52" ] }, "@MOT@": { "gold_contexts": [ [ "8ace0627a085efd0cf0ccc211c556f-C001-58", "8ace0627a085efd0cf0ccc211c556f-C001-60" ] ], "cite_sentences": [ "8ace0627a085efd0cf0ccc211c556f-C001-60" ] } } }, "ABC_0a226accf1fa8b471176916a76f1c6_36": { "x": [ { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-71", "text": "----------------------------------" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-72", "text": "**BASELINE MODEL**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-2", "text": "Syntactic analysis of search queries is important for a variety of information-retrieval tasks; however, the lack of annotated data makes training query analysis models difficult." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-3", "text": "We propose a simple, efficient procedure in which part-of-speech tags are transferred from retrieval-result snippets to queries at training time." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-4", "text": "Unlike previous work, our final model does not require any additional resources at run-time." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-5", "text": "Compared to a state-ofthe-art approach, we achieve more than 20% relative error reduction." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-6", "text": "Additionally, we annotate a corpus of search queries with partof-speech tags, providing a resource for future work on syntactic query analysis." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-7", "text": "----------------------------------" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-9", "text": "Syntactic analysis of search queries is important for a variety of tasks including better query refinement, improved matching and better ad targeting (Barr et al., 2008) ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-10", "text": "However, search queries differ substantially from traditional forms of written language (e.g., no capitalization, few function words, fairly free word order, etc.), and are therefore difficult to process with natural language processing tools trained on standard corpora (Barr et al., 2008) ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-11", "text": "In this paper we focus on part-of-speech (POS) tagging queries entered into commercial search engines and compare different strategies for learning from search logs." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-12", "text": "The search logs consist of user queries and relevant search results retrieved by a search engine." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-13", "text": "We use a supervised POS tagger to label the result snippets and then transfer the tags to the queries, producing a set of noisy labeled queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-14", "text": "These labeled queries are then added to the training data and the tagger is retrained." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-15", "text": "We evaluate different strategies for selecting which annotation to transfer and find that using the result that was clicked by the user gives comparable performance to using just the top result or to aggregating over the top-k results." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-16", "text": "The most closely related previous work is that of Bendersky et al. (2010 Bendersky et al. ( , 2011 ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-17", "text": "In their work, unigram POS tag priors generated from a large corpus are blended with information from the top-50 results from a search engine at prediction time." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-18", "text": "Such an approach has the disadvantage that it necessitates access to a search engine at run-time and is computationally very expensive." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-19", "text": "We re-implement their method and show that our direct transfer approach is more effective, while being simpler to instrument: since we use information from the search engine only during training, we can train a stand-alone POS tagger that can be run without access to additional resources." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-20", "text": "We also perform an error analysis and find that most of the remaining errors are due to errors in POS tagging of the snippets." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-21", "text": "----------------------------------" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-22", "text": "**DIRECT TRANSFER**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-23", "text": "The main intuition behind our work, Bendersky et al. (2010) and R\u00fcd et al. (2011) , is that standard NLP annotation tools work better on snippets returned by a search engine than on user supplied queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-24", "text": "This is because snippets are typically well-formed English sentences, while queries are not." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-25", "text": "Our goal is to leverage this observation and use a supervised POS tagger trained on regular English sentences to generate annotations for a large set of queries that can be used for training a query-specific model." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-26", "text": "Perhaps the simplest approach -but also a surprisingly powerful one -is to POS tag some relevant snippets for a given query, and then to transfer the tags from the snippet tokens to matching query tokens." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-27", "text": "This \"direct\" transfer idea is at the core of all our experiments." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-28", "text": "In this work, we provide a comparison of techniques for selecting snippets associated with the query, as well as an evaluation of methods for aligning the matching words in the query to those in the selected snippets." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-29", "text": "Specifically, for each query 1 with a corresponding set of \"relevant snippets,\" we first apply the baseline tagger to the query and all the snippets." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-30", "text": "We match any query terms in these snippets, and copy over the POS tag to the matching query term." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-31", "text": "Note that this can produce multiple labelings as the relevant snippet set can be very diverse and varies even for the same query." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-32", "text": "We choose the most frequent tagging as the canonical one and add it to our training set." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-33", "text": "We then train a query tagger on all our training data: the original human annotated English sentences and also the automatically generated query training set." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-34", "text": "The simplest way to match query tokens to snippet tokens is to allow a query token to match any snippet token." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-35", "text": "This can be problematic when we have queries that have a token repeated with different parts-of-speech such as in \"tie a tie.\" To make a more precise matching we try a sequence of matching rules: First, exact match of the query n-gram." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-36", "text": "Then matching the terms in order, so the query \"tie a a tie b \" matched to the snippet \"to tie 1 a neck tie 2 \" would match tie a :tie 1 and tie b :tie 2 ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-37", "text": "Finally, we match as many query terms as possible." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-38", "text": "An early observation showed that when a query term occurs in the result URL, e.g., searching for \"irs mileage rate\" results in the page irs.gov, the query term matching the URL domain name is usually a proper noun." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-39", "text": "Consequently we add this rule." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-40", "text": "In the context of search logs, a relevant snippet set can refer to the top k snippets (including the case where k = 1) or the snippet(s) associated with results clicked by users that issued the query." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-41", "text": "In our experiments we found that different strategies for selecting relevant snippets, such as selecting the snippets of the clicked results, using the top-10 results or using only the top result, perform similarly (see Table 1 )." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-42", "text": "By contrast Bendersky et al. (2010) use a linear interpolation between a prior probability and the snippet tagging." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-69", "text": "These are queries that occurred rarely in the search logs, and are typically difficult to tag because they are searching for lessfrequent information." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-43", "text": "They define \u03c0(t|w) as the relative frequency of tag t given by the baseline tagger to word w in some corpus and \u03c8(t|w, s) as the indicator function for word w in the context of snippet s has tag t. They define the tagging of a word as arg max We illustrate the difference between the two approaches in Figure 1 ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-44", "text": "The numbered rows of the table correspond to three snippets (with non-query terms elided)." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-45", "text": "The strategy that uses the clicks to select the tagging would count two examples of \"Budget/NNP Rent/NNP A/NNP Car/NNP\" and one for each of two other taggings." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-46", "text": "Note that snippet 1 and the query get different taggings primarily due to orthographic variations." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-47", "text": "It would then add \"budget/NNP rent/NNP a/NNP car/NNP\" to its training set." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-48", "text": "The interpolation approach of Bendersky et al. (2010) would tag the query as \"budget/NNP rent/VB a/DET car/NN\"." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-49", "text": "To see why this is the case, consider the probability for rent/VB vs rent/NNP." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-50", "text": "For rent/VB we have 0.2 + 0.8 \u00d7" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-51", "text": "----------------------------------" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-52", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-53", "text": "We assume that we have access to labeled English sentences from the PennTreebank (Marcus et al., 1993) and the QuestionBank (Judge et al., 2006) , as well as large amounts of unlabeled search queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-54", "text": "Each query is paired with a set of relevant results represented by snippets (sentence fragments containing the search terms), as well as information about the order in which the results were shown to the user and possibly the result the user clicked on." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-55", "text": "Note that different sets of results are possible for the same query, because of personalization and ranking changes over time." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-56", "text": "----------------------------------" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-57", "text": "**EVALUATION DATA**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-58", "text": "We use two data sets for evaluation." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-59", "text": "The first is the set of 251 queries from Microsoft search logs (MS-251) used in Bendersky et al. (2010 Bendersky et al. ( , 2011 ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-60", "text": "The queries are annotated with three POS tags representing nouns, verbs and \"other\" tags (MS-251 NVX)." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-61", "text": "We additionally refine the annotation to cover 14 POS tags comprising the 12 universal tags of Petrov et al. (2012) , as well as proper nouns and a special tag for search operator symbols such as \"-\" (for excluding the subsequent word)." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-62", "text": "We refer to this evaluation set as MS-251 in our experiments." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-63", "text": "We had two annotators annotate the whole of the MS-251 data set." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-64", "text": "Before arbitration, the inter-annotator agreement was 90.2%." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-65", "text": "As a reference, Barr et al. (2008) report 79.3% when annotating queries with 19 POS tags." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-66", "text": "We then examined all the instances where the annotators disagreed, and corrected the discrepancy." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-67", "text": "Our annotations are available at http://code.google.com/p/query-syntax/." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-68", "text": "The second evaluation set consists of 500 so called \"long-tail\" queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-70", "text": "They do not contain navigational queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-73", "text": "We use a linear chain tagger trained with the averaged perceptron (Collins, 2002) ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-74", "text": "We use the following features for our tagger: current word, suffixes and prefixes of length 1 to 3; additionally we use word cluster features (Uszkoreit and Brants, 2008) for the current word, and transition features of the cluster of the current and previous word." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-75", "text": "When training on Sections 1-18 of the Penn Treebank and testing on sections 22-24, our tagger achieves 97.22% accuracy with the Penn Treebank tag set, which is state-of-the-art for this data set." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-76", "text": "When we evaluate only on the 14 tags used in our experiments, the accuracy increases to 97.88%." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-77", "text": "We experimented with 4 baseline taggers (see Ta- QuestionBank as training data." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-78", "text": "WSJ NOCASE and WSJ+QTB NOCASE use case-insensitive version of the tagger (conceptually lowercasing the text before training and before applying the tagger)." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-79", "text": "As we will see, all our baseline models are better than the baseline reported in Bendersky et al. (2010) ; our lowercased baseline model significantly outperforms even their best model." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-80", "text": "----------------------------------" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-81", "text": "**EXPERIMENTS**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-82", "text": "First, we compared different strategies for selecting relevant snippets from which to transfer the tags." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-83", "text": "These systems are: DIRECT-CLICK, which uses snippets clicked on by users; DIRECT-ALL, which uses all the returned snippets seen by the user; 2 and DIRECT-TOP-1, which uses just the snippet in the top result." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-84", "text": "Table 1 compares these systems on our three evaluation sets." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-85", "text": "While DIRECT-ALL and DIRECT-TOP-1 perform best on the MS-251 data sets, DIRECT-CLICK has an advantage on the long tail queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-86", "text": "However, these differences are small (<0.6%) suggesting that any strategy for selecting relevant snippet sets will return comparable results when aggregated over large amounts of data." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-87", "text": "We then compared our method to the baseline models and a re-implementation of Bendersky et al. (2010) , which we denote BSC." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-88", "text": "We use the same matching scheme for both BSC and our system, including the URL matching described in Section 2." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-89", "text": "The URL matching improves performance by 0.4-3.0% across all models and evaluation settings." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-90", "text": "Table 2 summarizes our final results." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-91", "text": "For comparison, Bendersky et al. (2010) report 91.6% for their final system, which is comparable to our implementation of their system when the baseline tagger is trained on just the WSJ corpus." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-92", "text": "Our best system achieves a 21.2% relative reduction in error on their annotations." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-93", "text": "Some other trends become appar- ent in Table 2 ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-94", "text": "Firstly, a large part of the benefit of transfer has to do with case information that is available in the snippets but is missing in the query." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-95", "text": "The uncased tagger is insensitive to this mismatch and achieves significantly better results than the cased taggers." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-96", "text": "However, transferring information from the snippets provides additional benefits, significantly improving even the uncased baseline taggers." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-97", "text": "This is consistent with the analysis in Barr et al. (2008) ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-98", "text": "Finally, we see that the direct transfer method from Section 2 significantly outperforms the method described in Bendersky et al. (2010) ." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-99", "text": "Table 3 confirms this trend when focusing on proper nouns, which are particularly difficult to identify in queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-100", "text": "We also manually examined a set of 40 queries with their associated snippets, for which our best DIRECT-CLICK system made mistakes." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-101", "text": "In 32 cases, the errors in the query tagging could be traced back to errors in the snippet tagging." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-102", "text": "A better snippet tagger could alleviate that problem." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-103", "text": "In the remaining 8 cases there were problems with the matching -either the mis-tagged word was not found at all, or it was matched incorrectly." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-104", "text": "For example one of the results for the query \"bell helmet\" had a snippet containing \"Bell cycling helmets\" and we failed to match helmet to helmets." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-105", "text": "Table 3 : Precision and recall of the NNP tag on the longtail data for the best baseline method and the three transfer methods using that baseline." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-106", "text": "5 Related Work Barr et al. (2008) manually annotate a corpus of 2722 queries with 19 POS tags and use it to train and evaluate POS taggers, and also describe the linguistic structures they find." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-107", "text": "Unfortunately their data is not available so we cannot use it to compare to their results." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-108", "text": "R\u00fcd et al. (2011) create features based on search engine results, that they use in an NER system applied to queries." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-109", "text": "They report report significant improvements when incorporating features from the snippets." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-110", "text": "In particular, they exploit capitalization and query terms matching URL components; both of which we have used in this work." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-111", "text": "use clicks in a product data base to train a tagger for product queries, but they do not use snippets and do not annotate syntax." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-112", "text": "Li (2010) and Manshadi and Li (2009) also work on adding tags to queries, but do not use snippets or search logs as a source of information." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-113", "text": "----------------------------------" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-114", "text": "**CONCLUSIONS**" }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-115", "text": "We described a simple method for training a searchquery POS tagger from search-logs by transferring context from relevant snippet sets to query terms." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-116", "text": "We compared our approach to previous work, achieving an error reduction of 20%." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-117", "text": "In contrast to the approach proposed by Bendersky et al. (2010) , our approach does not require access to the search engine or index when tagging a new query." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-118", "text": "By explicitly re-training our final model, it has the ability to pool knowledge from several related queries and incorporate the information into the model parameters." }, { "sent_id": "0a226accf1fa8b471176916a76f1c6-C001-119", "text": "An area for future work is to transfer other syntactic information, such as parse structures or supertags using a similar transfer approach." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0a226accf1fa8b471176916a76f1c6-C001-9" ], [ "0a226accf1fa8b471176916a76f1c6-C001-106" ] ], "cite_sentences": [ "0a226accf1fa8b471176916a76f1c6-C001-9", "0a226accf1fa8b471176916a76f1c6-C001-106" ] }, "@MOT@": { "gold_contexts": [ [ "0a226accf1fa8b471176916a76f1c6-C001-10" ] ], "cite_sentences": [ "0a226accf1fa8b471176916a76f1c6-C001-10" ] }, "@USE@": { "gold_contexts": [ [ "0a226accf1fa8b471176916a76f1c6-C001-65" ] ], "cite_sentences": [ "0a226accf1fa8b471176916a76f1c6-C001-65" ] }, "@DIF@": { "gold_contexts": [ [ "0a226accf1fa8b471176916a76f1c6-C001-63", "0a226accf1fa8b471176916a76f1c6-C001-64", "0a226accf1fa8b471176916a76f1c6-C001-65", "0a226accf1fa8b471176916a76f1c6-C001-66" ] ], "cite_sentences": [ "0a226accf1fa8b471176916a76f1c6-C001-65" ] }, "@SIM@": { "gold_contexts": [ [ "0a226accf1fa8b471176916a76f1c6-C001-97" ] ], "cite_sentences": [ "0a226accf1fa8b471176916a76f1c6-C001-97" ] } } }, "ABC_7ce7cfb0f918a46a1000e25b2e9eea_36": { "x": [ { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-2", "text": "Data augmentation is widely used to train deep neural networks for image classification tasks." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-3", "text": "Simply flipping images can help learning by increasing the number of training images by a factor of two." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-4", "text": "However, data augmentation in natural language processing is much less studied." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-5", "text": "Here, we describe two methods for data augmentation for Visual Question Answering (VQA)." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-6", "text": "The first uses existing semantic annotations to generate new questions." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-7", "text": "The second method is a generative approach using recurrent neural networks." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-8", "text": "Experiments show the proposed schemes improve performance of baseline and state-of-the-art VQA algorithms." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-9", "text": "----------------------------------" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-11", "text": "In recent years, both computer vision and natural language processing (NLP) have made enormous progress on many problems using deep learning." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-12", "text": "Visual question answering (VQA) is a problem that fuses computer vision and NLP to build upon these successes." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-13", "text": "In VQA, an algorithm is given an image and a question about the image, and it predicts the answer to the question (Malinowski and Fritz, 2014; Antol et al., 2015) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-14", "text": "Although progress has been rapid, there is still a significant gap between the performance of the best VQA systems and humans." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-15", "text": "For example, on the open-ended 'The VQA Dataset' that uses real images, the best systems in 2016 are at around 65% accuracy (e.g., Fukui et al. (2016) ) compared to 83% for humans (Antol et al., 2015) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-16", "text": "Analysis of VQA algorithm performance as a function of the amount of training data show that existing algorithms would benefit greatly from more training data (Kafle and * Corresponding author." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-17", "text": "Kanan, 2017)." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-18", "text": "One way to address this would be to annotate additional questions about images, but this is time-consuming and expensive." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-19", "text": "Data augmentation is a much cheaper alternative." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-20", "text": "Data augmentation is generating new training data from existing examples." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-21", "text": "In this paper, we explore two data augmentation methods for generating new question-answer (QA) pairs for images." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-22", "text": "The first method uses existing semantic annotations and templates to generate QA pairs, similar to the method in Kafle and Kanan (2017) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-23", "text": "The second method is a generative approach using a recurrent neural network (RNN)." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-24", "text": "Fig. 1 shows an example image from 'The VQA Dataset' along with the original questions and the questions generated using our methods." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-25", "text": "Our methods improve the variety and the number of questions for the image." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-26", "text": "We evaluate how well each augmentation method performs on two VQA datasets." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-27", "text": "Our results show that augmentation increases performance for both datasets." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-28", "text": "----------------------------------" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-30", "text": "For supervised computer vision problems, e.g., image recognition, labels are scarcer than images." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-31", "text": "This is especially a problem with deep convolutional neural networks (CNNs) that have millions of parameters." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-32", "text": "Although more human labeled data would be ideal, it is easier to exploit the training dataset to generate new examples." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-33", "text": "For image classification, common ways to exploit training images to create more labeled examples include mirror reflection, random crops etc." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-34", "text": "Many of these methods were used in training the seminal AlexNet (Krizhevsky et al., 2012) , which increased the training data by more than ten folds and produced relative improvement of over 4% for image classification." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-35", "text": "Compared to vision, where augmentation is common, little work has been done on augmenting text for classification problems." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-36", "text": "A notable exception is Zhang et al. (2015) , where a thesaurus was used to replace synonymous words to create more training data for text classification." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-37", "text": "However, this augmentation produced little improvement and sometimes even hurt performance." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-38", "text": "The authors' argued that because large quantities of real data are available, models generalize properly without augmentation." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-39", "text": "Although training using augmented text data is rare, generating new questions about images has been studied." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-40", "text": "The COCO-QA dataset (Ren et al., 2015) for VQA was created by parsing COCO captions with a syntactic parser, and then used this to create QA pairs for four kinds of questions using hand-crafted rules." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-41", "text": "However, due to inability of the algorithm to cope with complex sentence structures, a significant portion of COCO-QA questions have grammatical errors or are oddly phrased." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-42", "text": "Visual question generation was also studied in (Mostafazadeh et al., 2016) , with an emphasis on generating questions about images that are beyond the literal visual content of the image." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-43", "text": "They endeavored to avoid simple questions such as counting and color, which were emphasized in COCO-QA." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-44", "text": "Unlike our work, their objective was not data augmentation and they did not try to answer the generated questions." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-45", "text": "----------------------------------" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-46", "text": "**DATASETS AND ALGORITHMS FOR VQA**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-72", "text": "So, we ask questions about COCO 'super-categories' (e.g., 'food,' 'furniture,' 'vehicle,' etc.) to specify the type of object in the question." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-47", "text": "We conduct experiments on two of the most popular VQA datasets: 'The VQA Dataset' (Antol et al., 2015) and COCO-QA (Ren et al., 2015) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-48", "text": "'The VQA Dataset' is currently the most popular VQA dataset and it contains both synthetic and real-world images." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-49", "text": "The real-world images are from the COCO dataset (Lin et al., 2014) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-50", "text": "All questions were generated by human annotators." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-51", "text": "We refer to this portion as COCO-VQA, and use it for our experiments." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-52", "text": "COCO-QA (Ren et al., 2015) also uses images from COCO, with the questions generated by an NLP algorithm that uses COCO's captions." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-53", "text": "All questions belong to four categories: object, number, color, and location." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-54", "text": "Many algorithms have been proposed for VQA." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-55", "text": "Some notable formulations include attention based methods Xiong et al., 2016; Lu et al., 2016; Fukui et al., 2016) , Bayesian frameworks (Kafle and Kanan, 2016; Malinowski and Fritz, 2014) , and compositional approaches (Andreas et al., 2016a,b) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-56", "text": "Detailed reviews of existing methods can be found in Kafle and Kanan (2017) and Wu et al. (2016) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-57", "text": "However, simpler models such as linear classifiers and multilayer perceptrons (MLPs) perform only slightly worse on many VQA datasets." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-58", "text": "These baseline methods predict the answer using a vector of image features concatenated to a vector of question features (Ren et al., 2015; Zhou et al., 2015; Kafle and Kanan, 2016) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-59", "text": "We use the MLP model to conduct the bulk of the experiments, but we show that the proposed method is also effective on more sophisticated VQA systems like multimodal compact bilinear pooling (MCB) (Fukui et al., 2016) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-60", "text": "----------------------------------" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-61", "text": "**METHODS FOR DATA AUGMENTATION**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-62", "text": "The impact of using data augmentation to improve VQA has not been studied." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-63", "text": "We propose two methods for generating QA pairs about images: 1) a template based generation method that uses image annotations and 2) a long short term memory (LSTM) based language model." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-64", "text": "The number of questions generated using both methods are shown in Table 1 ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-65", "text": "----------------------------------" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-66", "text": "**TEMPLATE AUGMENTATION METHOD**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-67", "text": "The template data augmentation method uses the semantic segmentation annotations in COCO to generate new QA pairs." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-68", "text": "COCO contains detailed segmentation annotations with labels for 80 objects typically found in the images." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-69", "text": "We synthesize four kinds of questions from the COCO annotations: yes/no, counting, object recognition, scene, activity and sport recognition." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-70", "text": "Yes/No Questions: First, we make a list of the Counting Questions: To make counting questions, we count the number of separate annotations of all the objects of a particular category that have an area greater than 2000 pixels, and ask 12 variations of a counting question template." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-71", "text": "Object Recognition Questions: Object recognition questions such as 'What is in the picture?' can be ambiguous because multiple objects may be present." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-73", "text": "However, ambiguity may persist if there are multiple objects belonging to same supercategory." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-74", "text": "For example, 'What vehicles are shown in the photo?' becomes ambiguous if both 'cars' and 'trucks' are present." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-75", "text": "So, we ensure only a single object of a supercategory is present before asking a recognition question." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-76", "text": "We use 12 variations of 'What SUPERCATEGORY is in the image?' Scene and Activity Questions: If a object in an image belongs to the COCO supercategory indoor or outdoor, we generate questions such as 'Is this indoor or outdoors?' Similarly, we ask about different rooms in the house by identifying the common objects in the room." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-77", "text": "For example, if there are least two common kitchen appliances in the picture(e.g., toaster, microwave, etc.), then we infer the room is a kitchen and ask 'What room is this?' with 'kitchen' as the answer." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-78", "text": "We employ similar strategies for 'living room' and 'bathroom.' We used six variations for 'indoor/outdoor' questions and four variations for room classification questions." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-79", "text": "For sports, we check if any sports equipment is present in the image and generate a question about the type of sport being depicted in the image." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-80", "text": "We use four variations of questions to ask about each of the six common sports activities." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-81", "text": "----------------------------------" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-82", "text": "**LSTM AUGMENTATION METHOD**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-83", "text": "One major issues with our template-based augmentation method is that the questions are rigid and may not closely resemble the way questions are typically posed in the VQA dataset." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-84", "text": "To address this, we train a stacked LSTM that generates questions about images." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-85", "text": "The network consists of two LSTM layers each with 1000 hidden units followed by two fully connected layers, with 7000 units each, which is the size of our vocabulary constructed by tokening training questions into individual words." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-86", "text": "The first fully connected layer has a ReLU activation function, while the second layer has the 7000-way softmax." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-87", "text": "The output question is produced one word at a time until the \u00a1end-ofquestion\u00bf token." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-88", "text": "The network is trained using the COCO-VQA training data." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-89", "text": "During the generation process, we start by passing the \u00a1start-question\u00bf token concatenated with the image features." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-90", "text": "To predict the next word, we sample from a multinomial distribution characterized by the prediction probabilities." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-91", "text": "Sometimes such sampling generates questions unrelated to image content." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-92", "text": "To compensate for this, we repeat the sampling for every word multiple times and pick the word occurring most frequently." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-93", "text": "We then generate 30 initial questions per image, and only retain the 3 most frequent questions." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-94", "text": "Any generated question that already exists in the original dataset is removed." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-95", "text": "We use the MLP VQA method described in Sec. 3 to create answers for the generated questions, but it is trained without augmented data." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-96", "text": "Used alone, this can produce many incorrect answers." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-97", "text": "To mitigate this problem, we tried to identify the kinds of questions the MLP VQA algo- rithm tends to get correct." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-98", "text": "To do this, we use kmeans to cluster the training question features concatenated to a one-hot vector with the answer for each question type (k = 25)." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-99", "text": "We assign each validation QA pair to one of these clusters and compute each cluster's accuracy." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-100", "text": "QA pairs assigned to clusters that have a validation accuracy of less than 70% are removed from the dataset." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-101", "text": "----------------------------------" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-102", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-103", "text": "First, we use the simple MLP baseline model used in Kafle and Kanan (2016) to assess the two data augmentation methods." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-104", "text": "Kafle and Kanan (2016) showed that MLP worked well across multiple datasets despite its simplicity." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-105", "text": "The MLP model treats VQA as a classification problem with concatenated image and question features given to the model as features and answers as categories." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-106", "text": "CNN features from ResNet-152 and the skip-thought vectors (Kiros et al., 2015) are used as image and question features respectively." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-107", "text": "We evaluate the MLP model on COCO-VQA and COCO-QA datasets." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-108", "text": "For COCO-QA, we excluded all the augmented QA pairs derived from COCO's validation images during training, as the test portion of COCO-QA contains questions for these images." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-109", "text": "Table 2 shows the results for the MLP model when trained with and without augmentation." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-110", "text": "Some examples for the model trained with augmentation are are shown in Fig. 2 ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-111", "text": "Next, to demonstrate that the data augmentation scheme also helps improve more complex models, we train the state-of-the-art MCB model with attention (MCB+Att.+GloVe) (Fukui et al., 2016) with the template augmentation and compare the accuracy when the same model trained only on the COCO-VQA dataset (Table 3) ." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-112", "text": "Referring to Table 2 , we can see that both forms of augmentation improved accuracy on COCO-QA compared to the baseline, and the templatebased approach worked better than LSTM." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-113", "text": "For COCO-VQA, the template-based augmentation helped considerably, producing a relative increase of 1.6% compared to when it was not used." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-114", "text": "We did not observe an improvement from using the LSTM method, perhaps due to label noise." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-115", "text": "While we tried to mitigate label noise by rejecting QA pairs that were likely to be wrong, this was not sufficient." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-116", "text": "We are exploring alternative training methods that are robust to label noise (e.g., Reed et al. (2014) ) to help improve results using LSTM." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-117", "text": "Additionally, we also evaluated which types of questions benefit the most from data augmentation." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-118", "text": "For the MLP model trained on COCO-VQA with the template augmentation, counting category answer is improved the most (1.74%), followed by others (1.01%), and yes/no (0.7%)." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-119", "text": "The results are promising and demonstrate that VQA algorithms can benefit from data augmentation, even for hard question types like counting." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-120", "text": "Furthermore, there is a lot of room for expansion in both the LSTM and the template based methods to produce a larger number and variety of questions." }, { "sent_id": "7ce7cfb0f918a46a1000e25b2e9eea-C001-121", "text": "Template augmentation worked best in our experiments, but if we can control for label noise, the LSTM method can be more flexible than the template method, and could be used to generate virtually unlimited amount of training data using images from the Internet." } ], "y": { "@MOT@": { "gold_contexts": [ [ "7ce7cfb0f918a46a1000e25b2e9eea-C001-40", "7ce7cfb0f918a46a1000e25b2e9eea-C001-41" ] ], "cite_sentences": [ "7ce7cfb0f918a46a1000e25b2e9eea-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "7ce7cfb0f918a46a1000e25b2e9eea-C001-47" ] ], "cite_sentences": [ "7ce7cfb0f918a46a1000e25b2e9eea-C001-47" ] }, "@BACK@": { "gold_contexts": [ [ "7ce7cfb0f918a46a1000e25b2e9eea-C001-52" ], [ "7ce7cfb0f918a46a1000e25b2e9eea-C001-58" ] ], "cite_sentences": [ "7ce7cfb0f918a46a1000e25b2e9eea-C001-52", "7ce7cfb0f918a46a1000e25b2e9eea-C001-58" ] } } }, "ABC_8e0dcaec15a3b9c4947946a4e885c8_36": { "x": [ { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-2", "text": "Ensembling word embeddings to improve distributed word representations has shown good success for natural language processing tasks in recent years." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-3", "text": "These approaches either carry out straightforward mathematical operations over a set of vectors or use unsupervised learning to find a lower-dimensional representation." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-4", "text": "This work compares meta-embeddings trained for different losses, namely loss functions that account for angular distance between the reconstructed embedding and the target and those that account normalized distances based on the vector length." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-5", "text": "We argue that meta-embeddings are better to treat the ensemble set equally in unsupervised learning as the respective quality of each embedding is unknown for upstream tasks prior to meta-embedding." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-6", "text": "We show that normalization methods that account for this such as cosine and KL-divergence objectives outperform meta-embedding trained on standard 1 and 2 loss on defacto word similarity and relatedness datasets and find it outperforms existing metalearning strategies." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-7", "text": "----------------------------------" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-9", "text": "Meta-embeddings are a quick and useful prior step for improving word representations in natural language learning tasks." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-10", "text": "This involves combining several learned embeddings in a way that improve the overall input representation." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-11", "text": "This approach is a less computationally expensive compared to if a practitioner were to train a set of word embeddings from scratch, particularly when considering non-sliding window methods." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-12", "text": "The most straightforward approaches to meta-embeddings are: concatenation (CONC) and averaging (AV)." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-13", "text": "The former is limited since the dimensionality grows large for multiple embeddings as more vectors are concatenated and the latter, while fast, does not preserve most of the information encoded in each embedding when taking the arithmetic mean." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-14", "text": "Although, it would seem surprising concatenating vectors from different embedding spaces is valid, it has been shown that (Coates and Bollegala, 2018) AV approximates CONC even though the embedding spaces are different." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-15", "text": "Although, to address the loss of information when using AV, Singular Value Decomposition has been used as a dimensionality reduction technique to factorize the embeddings into a lower-rank approximation of the concatenated meta-embedding set." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-16", "text": "Linear methods include the use of a projection layer for meta-embedding (known as 1TON) Yin and Sch\u00fctze (2015) , which is simply trained using an 2 -based loss." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-17", "text": "Similarly, Bollegala et al. (Bollegala et al., 2017) has focused on finding a linear transformation between count-based and prediction-based embeddings, showing that linearly transformed count-based embeddings can be used for predictions in the localized neighborhoods in the target space." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-18", "text": "Most recent work (Bao and Bollegala, 2018 ) has focused on the use of an autoencoder (AE) to encode a set of N pretrained embeddings using 3 different variants: (1) Decoupled Autoencoded Meta Embeddings (DAEME) that keep activations separated for each respective embedding input during encoding and uses a reconstruction loss for both predicted embeddings while minimizing the loss for each respective decoded output, (2) Coupled Autoencoded Meta Embeddings (CAEME) which instead learn to predict from a shared encoding and (3) Averaged Autoencoded Meta-Embedding (AAME) is simply an averaging of the embedding set as input instead of using a concatenation." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-19", "text": "This is the most relevant work to our paper, hence, we include these 3 autoencoding schemes along with aforementioned methods for experiments, described in Section 3." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-20", "text": "We also include two subtle variations of the aforementioned arXiv:1808.04334v1 [cs.CL] 13 Aug 2018" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-21", "text": "AEs." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-22", "text": "The first predicts a target embedding from an embedding set using the remaining embedding set, post-learning the single hidden layer is used as the word meta-embedding." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-23", "text": "The second method is similar except an AE is used for each input embedding to predict the designated target embedding, followed by an averaging over the resulting hidden layers." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-24", "text": "This alternative is described in more detail in Section 2." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-25", "text": "The aforementioned previously proposed unsupervised learning have a common limitation, that is minimising the Euclidean ( 2 ) distance between source word embeddings and their metaembedding." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-26", "text": "Arora et al. (2016) have shown that the 2 norm of a word embedding vector is proportional to the frequency of the corresponding word in the corpus used to learn the word embeddings." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-27", "text": "Considering that in meta-embedding learning we use source embeddings trained on different resources, we argue that it is more important to preserve the semantic orientation of words, which is captured by the angle between two word embeddings, not their length." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-28", "text": "Indeed, cosine similarity, a popularly used measure for computing the semantic relatedness among words, ignores the length related information." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-29", "text": "Additionally, we note the relationship between KL-divergence and cosine similarity in the sense both methods perform a normalization that is proportional to the semantic information." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-30", "text": "Hence, we compare several popular measures such as MSE and MAE that use 2 and 1 respectively, against KL-divergence and cosine similarity for the purpose of learning meta-embeddings and show that the loss which accounts for this orientation consistently outperforms the former objectives that only consider length." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-31", "text": "We demonstrated this across multiple benchmark datasets." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-32", "text": "----------------------------------" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-33", "text": "**METHODOLOGY**" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-34", "text": "Before describing the loss functions used, we explain the aforementioned variation on the autoencoding method and how it slightly differs from 1TON/1TON + (Yin and Sch\u00fctze, 2015) and standard AEs (Bao and Bollegala, 2018) presented in previous work." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-35", "text": "Target Autoencoders (TAE) are defined as learning an ensemble of nonlinear transformations between sets of bases X s in sets of vector spaces X S = {X 1 , .., X s , .., X N } s.t X s \u2208 R |vs|\u00d7ds to a target space X t \u2208 R |vt|\u00d7dt , where f" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-36", "text": "\u2192 X t \u2200i is the nonlinear transformation function used to make the mapping." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-37", "text": "Once a set of M number of parameteric models f w = {f" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-38", "text": "w } are trained with various objective functions to learn the mappings between the word vectors we obtain a set of lowerdimensional target latent representation that represents different combinations of mappings from one vector space to another." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-39", "text": "After training, all H set of latent variables Z s = {z 1 , .., z H } are concatenated with an autoencoded target vector." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-40", "text": "This means thats all vector spaces have been mapped to a target space and there hidden meta-word representations have been averaged, as illustrated in Figure 1 ." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-41", "text": "Figure 2 shows a comparison of the previous autoencoder approaches (Bao and Bollegala, 2018) (left) and the alternative AE (right), where dashed lines indicate connections during training and bold lines indicate prediction." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-42", "text": "The ConcatAutoEncoder (CAEME) simply concatenates the embedding set into a single vector and trains the autoencoder so to produce a lower-dimensional representation (shown in red), while the decoupled autoencoder (DAEME) keeps the embedding vectors separate in the encoding." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-43", "text": "In contrast the target encoder (TAE) is similar to that of CAEME only the label is a single embedding from the embedding set and the input are remaining embeddings from the set." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-44", "text": "After training, TAE then concatenates the hidden layer encoding with the original target vector." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-45", "text": "The Mean Target AutoEncoder (MTE) instead performs an averaging over separate autoencoded representation." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-46", "text": "Weights are initialized with a normal distribution, mean \u00b5 = 0 and standard deviation \u03c3 = 1." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-47", "text": "Dropout is used with a dropout rate p = 0.2 for all datasets." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-48", "text": "The model takes all unique vocabulary terms pertaining to all tested word association and word similarity datasets (n = 4819) and performs Stochastic Gradient Descent (SGD) with batch sizex = 32 trained between 50 epochs for each dataset \u2200d \u2208 D. This results in a set of vectors X j \u2208 R |v j |\u00d7200 \u2200j that are then used for finding the similarity between word pairs." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-49", "text": "The above parameters were chosen (h d ,x and number of epochs) over a small grid search." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-50", "text": "As stated, we compare against previous methods (Yin and Sch\u00fctze, 2015; Bao and Bollegala, 2018 ) that use 2 distance, as shown in Equation 1)." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-51", "text": "Similarly, the Mean Absolute Error ( 1 norm of difference) loss 1 N N i=1 |y \u2212\u0177| is tested." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-52", "text": "We also compare against a KL divergence objective, as shown in Equation 2,\u0177 is the last activation output from the log-softmax that represents q(x) and the KL-divergence is given as" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-53", "text": "Since tanh functions are used and input vectors are 2 normalized we propose a Squared Cosine" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-54", "text": "----------------------------------" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-55", "text": "**PROXIMITY (SCP) LOSS, SHOWN IN EQUATION 3**" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-56", "text": "where m is the output dimensionality." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-57", "text": "This forces the optimization to tune weights such that the rotational difference between the embedding spaces is minimized, thus preserving semantic information in the reconstruction." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-58", "text": "In the context of its utility for the TAE, we too want to minimize the angular difference between corresponding vectors in different vector spaces." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-59", "text": "It is also a suitable fit since it is a proper distance measure (i.e symmetric), unlike KL-divergence." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-60", "text": "(3) The model is kept relatively simple so that the comparisons against previous methods are directly comparable and that the performance comparison between the proposed SCP loss and KL divergence loss against MSE and MAE is controlled." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-61", "text": "Additionally, all comparison are that of models which are only trained from co-occurrence statistics that are not leveraging any external knowledge (e.g AutoExtend (Rothe and Sch\u00fctze, 2015) )." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-62", "text": "----------------------------------" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-63", "text": "**EXPERIMENTS**" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-64", "text": "The following word association and word similarity datasets are used throughout experimentation: Simlex (Hill et al., 2015) , WordSim-353 (Finkelstein et al., 2001) , RG (Rubenstein and Goodenough, 1965) , MTurk (MechanicalTurk-771) (Halawi et al., 2012) , RareWord (RW) (Luong et al., 2014) and MEN (Bruni et al., 2012) ." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-65", "text": "The word vectors considered in the embeddings set are skipgram and cbow (Mikolov et al., 2013) , FastText (Bojanowski et al., 2016) , LexVec (Salle et al., 2016) , Hellinger PCA (HPCA) (Lebret and Collobert, 2013) and Hierarchical Document Context (HDC) (Sun et al., 2015) ." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-66", "text": "We now report results on the performance of meta-embedding autoencodings with various loss functions, while also presenting target autoencoders for combinations of word embeddings and compare against existing current SoTA meta-embeddings." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-67", "text": "Table 1 shows the scaled Spearman correlation test scores, where (1) shows the original single embeddings, (2) results for standard metaembedding approaches that either apply a single mathematical operation or employ a linear projection as an encoding, (3) presents the results using autoencoder schemes by (Bao and Bollegala, 2018 ) that we have used to test the various losses, (4) introduces TAE without concatenating the target Y embedding post-training with MSE loss and (5) shows the results of concatenating Y with the lower-dimensional (200-dimensions) vector that encodes all embeddings apart from the target vector." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-68", "text": "Therefore reported results from (4) are of a 200d vector, while (5) concatenates the vector leading to a vector between 300-500 dimensions dependent on the target vector." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-69", "text": "All trained encodings from sections 3-5 are 200-dimensional vectors." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-70", "text": "Results in red shading indicate the best performing meta-embedding for all presented approaches, while black shading indicates the best performing meta-embedding for the respective section." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-71", "text": "Best performing word meta-embeddings are held between concatenated autoencoders that use the proposed Cosine-Embedding loss, while a KL-divergence also performs well on Simlex and RareWord." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-72", "text": "Interestingly, both of these dataset are distinct in that Simlex is the only dataset providing scores on true similarity instead of free association, which has shown to be more difficult for word embeddings to account for (Hill et al., 2016) , while Rareword provides morphologically complex words to find similarity between." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-73", "text": "Concretely, it would seem KL-divergence is well suited for encoding when the word relations exhibits of a more complex or rare nature." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-74", "text": "Similarly, we find SCP loss to achieve best results on RG and MEN, both the smallest and largest datasets of the set." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-75", "text": "Furthermore, the TAE variant has lead to competitive performance overall against other metaembedding approaches and even produces best results on WS353." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-76", "text": "Lastly, we find that HPCA embeddings are relatively weak for word similarity." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-77", "text": "----------------------------------" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-78", "text": "**CONCLUSION**" }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-79", "text": "We find the meta-embeddings trained using Autoencoders with a Squared Cosine loss and a KLdivergence loss improves performance in the majority of cases, reinforcing the argument that accounting for angles explicitly through normalization (log-softmax for KL) is an important criterion for encoding." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-80", "text": "It is particularly useful for distributed word representations, since embeddings are learned from large documents of varying length and semantics." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-81", "text": "Lastly, we have shown its use in the context of improving meta-embeddings, Table 1 : Meta-Embedding Results although this suggests cosine loss is also suitable for minimizing angular differences for word embeddings, not only for meta-embeddings." }, { "sent_id": "8e0dcaec15a3b9c4947946a4e885c8-C001-82", "text": "Concretely, this paper has carried out a comprehensive study of methods to embed a lowerdimensional representation from embedding sets, while proposing losses that explicitly keep angular information intact for meta-embeddings." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8e0dcaec15a3b9c4947946a4e885c8-C001-18" ], [ "8e0dcaec15a3b9c4947946a4e885c8-C001-41" ] ], "cite_sentences": [ "8e0dcaec15a3b9c4947946a4e885c8-C001-18", "8e0dcaec15a3b9c4947946a4e885c8-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "8e0dcaec15a3b9c4947946a4e885c8-C001-34" ], [ "8e0dcaec15a3b9c4947946a4e885c8-C001-41" ], [ "8e0dcaec15a3b9c4947946a4e885c8-C001-50" ], [ "8e0dcaec15a3b9c4947946a4e885c8-C001-67" ] ], "cite_sentences": [ "8e0dcaec15a3b9c4947946a4e885c8-C001-34", "8e0dcaec15a3b9c4947946a4e885c8-C001-41", "8e0dcaec15a3b9c4947946a4e885c8-C001-50", "8e0dcaec15a3b9c4947946a4e885c8-C001-67" ] }, "@DIF@": { "gold_contexts": [ [ "8e0dcaec15a3b9c4947946a4e885c8-C001-34" ] ], "cite_sentences": [ "8e0dcaec15a3b9c4947946a4e885c8-C001-34" ] } } }, "ABC_56fa13128027f7c37d504d97cfcc45_36": { "x": [ { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-2", "text": "To alleviate the shortage of labeled data, we propose to use bilingually-constrained synthetic implicit data for implicit discourse relation recognition." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-3", "text": "These data are extracted from a bilingual sentence-aligned corpus according to the implicit/explicit mismatch between different languages." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-4", "text": "Incorporating these data via a multi-task neural network model achieves significant improvements over baselines, on both the English PDTB and Chinese CDTB data sets." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-5", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-7", "text": "Discovering the discourse relation between two sentences is crucial to understanding the meaning of a coherent text, and also beneficial to many downstream NLP applications, such as question answering and machine translation." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-8", "text": "Implicit discourse relation recognition (DRR imp ) remains a challenging task due to the absence of strong surface clues like discourse connectives (e.g. but)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-9", "text": "Most work resorts to large amounts of manually designed features (Soricut and Marcu, 2003; Pitler et al., 2009; Lin et al., 2009; Louis et al., 2010; Rutherford and Xue, 2014) , or distributed features learned via neural network models (Braud and Denis, 2015; ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-10", "text": "The above methods usually suffer from limited labeled data." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-11", "text": "Marcu and Echihabi (2002) attempt to create labeled implicit data automatically by removing connectives from explicit instances, as additional training data." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-12", "text": "These data are usually called as syn- * Corresponding author." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-13", "text": "thetic implicit data (hereafter SynData)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-14", "text": "However, Sporleder and Lascarides (2008) argue that SynData has two drawbacks: 1) meaning shifts in some cases when removing connectives, and 2) a different word distribution with the real implicit data." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-15", "text": "They also show that using SynData directly degrades the performance." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-16", "text": "Recent work seeks to derive valuable information from SynData while filtering noise, via domain adaptation (Braud and Denis, 2014; , classifying connectives (Rutherford and Xue, 2015) or multi-task learning (Lan et al., 2013; Liu et al., 2016) , and shows promising results." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-17", "text": "society reckon existence youth problems, but many young people think themselves no problems." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-18", "text": "Figure 1 : An example illustrating the implicit/explicit mismatch between Chinese (ch) and English (en)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-19", "text": "A Chinese implicit instance is translated into an English explicit one." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-20", "text": "In the PDTB, a discourse instance is defined as a connective (e.g. but)" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-21", "text": "taking two arguments (Arg1 and Arg2)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-22", "text": "Different from previous work, we propose to construct bilingually-constrained synthetic implicit data (called BiSynData) for DRR imp , which can alleviate the drawbacks of SynData." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-23", "text": "Our method is inspired by the findings that a discourse instance expressed implicitly in one language may be expressed explicitly in another." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-24", "text": "For example, Zhou and Xue (2012) show that the connectives in Chinese omit much more frequently than those in English with about 82.0% vs. 54.5%." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-25", "text": "Li et al. (2014a) further argue that there are about 23.3% implicit/explicit mismatchs between Chinese/English instances." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-26", "text": "As illustrated in Figure 1 , a Chinese implicit instance where the connective \u00b4is absent, is translated into an English explicit one with the connective but." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-27", "text": "Intuitively, the Chinese instance is a real implicit one which can be signaled by but." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-28", "text": "Hence, it could potentially serve as additional training data for the Chinese DRR imp , avoiding the different word distribution problem of SynData." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-29", "text": "Meanwhile, for the English explicit instance, it is very likely that removing but would not lose any information since its Chinese counterpart \u00b4can be omitted." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-30", "text": "Therefore it could be used for the English DRR imp , alleviating the meaning shift problem of SynData." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-31", "text": "We extract our BiSynData from a Chinese-English sentence-aligned corpus (Section 2)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-32", "text": "Then we design a multi-task neural network model to incorporate the BiSynData (Section 3)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-33", "text": "Experimental results, on both the English PDTB (Prasad et al., 2008) and Chinese CDTB (Li et al., 2014b) , show that BiSynData is more effective than SynData used in previous work (Section 4)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-34", "text": "Finally, we review the related work (Section 5) and draw conclusions (Section 6)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-35", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-36", "text": "**BISYNDATA**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-37", "text": "Formally, given a Chinese-English sentence pair (S ch , S en ), we try to find an English explicit instance (Arg1 en , Arg2 en , Conn en ) in S en 1 , and a Chinese implicit instance (Arg1 ch , Arg2 ch ) in S ch , where (Arg1 en , Arg2 en , Conn en ) is the translation of (Arg1 ch , Arg2 ch )." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-38", "text": "In most cases, discourse relations should be preserved during translating, so the connective Conn en is potentially a strong indicator of the discourse relation between not only Arg1 en and Arg2 en , but also Arg1 ch and Arg2 ch ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-39", "text": "Therefore, we can construct two synthetic implicit instances labeled by Conn en , denoted as (Arg1 en , Arg2 en ), Conn en and (Arg1 ch , Arg2 ch ), Conn en , respectively." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-40", "text": "We refer to these synthetic instances as BiSynData be-cause they are constructed according to the bilingual implicit/explicit mismatch." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-41", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-42", "text": "**CONN.**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-43", "text": "Freq In our experiments, we extract our BiSynData from a combined corpus (FBIS and HongKong Law), with about 2.38 million Chinese-English sentence pairs." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-44", "text": "We generate 30,032 synthetic English instances and the same number of Chinese instances, with 80 connectives, as our BiSynData." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-45", "text": "Table 1 lists the top 10 most frequent connectives in our BiSynData, which are roughly consistent with the statistics of Chinese/English implicit/explicit mismatches in (Li et al., 2014a) ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-46", "text": "According to connectives and their related relations in the PDTB, in most cases, and and also indicate the Expansion relation, if and because the Contigency relation, bef ore the T emporal relation, and but the Comparison relation." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-47", "text": "Connectives as, when, while and since are ambiguous." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-48", "text": "For example, while can indicate the Comparison or T emporal relation." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-49", "text": "Overall, our constructed BiSynData covers all four main discourse relations defined in the PDTB." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-50", "text": "With our BiSynData, we define two connective classification tasks: 1) given (Arg1 en , Arg2 en ) to predict the connective Conn en , and 2) given (Arg1 ch , Arg2 ch ) to predict Conn en ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-51", "text": "We incorporate the first task to help the English DRR imp , and the second for the Chinese DRR imp ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-52", "text": "It is worthy to note that we use English connectives themselves as classification labels rather than mapping them to relations in both tasks." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-53", "text": "two tasks are essentially the same, just with different output labels." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-54", "text": "Therefore, as illustrated in Figure 2 , M T N shares parameters in all feature layers (L 1 -L 3 ) and uses two separate classifiers in the classifier layer (L 4 )." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-55", "text": "For each task, given an instance (Arg 1 , Arg 2 ), M T N simply averages embeddings of words to represent arguments, as v Arg 1 and v Arg 2 ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-56", "text": "These two vectors are then concatenated and transformed through two non-linear hidden layers." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-57", "text": "Finally, the corresponding sof tmax layer is used to perform classification." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-58", "text": "M T N ignores the word order in arguments and uses two hidden layers to capture the interactions between two arguments." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-59", "text": "The idea behind M T N is borrowed from (Iyyer et al., 2015) , where a deep averaging network achieves close to the state-ofthe-art performance on text classification." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-60", "text": "Though M T N is simple, it is easy to train and efficient on both memory and computational cost." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-61", "text": "In addition, the simplicity of M T N allows us to focus on measuring the quality of BiSynData." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-62", "text": "We use the cross-entropy loss function and minibatch AdaGrad (Duchi et al., 2011) to optimize parameters." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-63", "text": "Pre-trained word embeddings are fixed." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-64", "text": "We find that fine-tuning word embeddings during training leads to severe overfitting in our experiments." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-65", "text": "Following Liu et al. (2016) , we alternately use two tasks to train the model, one task per epoch." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-66", "text": "For tasks on both the PDTB and CDTB, we use the same hyper-parameters." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-67", "text": "The dimension of word embedding is 100." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-68", "text": "We set the size of L 2 to 200, and L 3 to 100." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-69", "text": "ReLU is used as the non-linear function." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-70", "text": "Different learning rates 0.005 and 0.001 are used in the main and auxiliary tasks, respectively." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-71", "text": "To avoid overfitting, we randomly drop out 20% words in each argument following Iyyer et al. (2015) ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-72", "text": "All hyper-parameters are tuned on the development set." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-73", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-74", "text": "**EXPERIMENTS**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-75", "text": "We evaluate our method on both the English PDTB and Chinese CDTB data sets." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-76", "text": "We tokenize English data and segment Chinese data using the Stanford CoreNLP toolkit (Manning et al., 2014) ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-77", "text": "The English/Chinese Gigaword corpus (3rd edition) is used to train the English/Chinese word embeddings via word2vec (Mikolov et al., 2013) , respectively." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-78", "text": "Due to the skewed class distribution of test data (see Section 4.1), we use the macro-averaged F 1 for performance evaluation." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-79", "text": "Table 2 shows the results of M T N combining our BiSynData (denoted as M T N bi ) on the PDTB." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-80", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-81", "text": "**ON THE PDTB**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-82", "text": "ST N means we train M T N with only the main task." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-83", "text": "On the macro F 1 , M T N bi gains an improvement of 4.17% over ST N ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-84", "text": "The improvement is significant under one-tailed t-test (p<0.05)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-85", "text": "A closer look into the results shows that M T N bi performs better across all relations, on the precision, recall and F 1 score, except a little drop on the recall of Cont." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-86", "text": "The reason for the recall drop of Cont is not clear." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-87", "text": "The greatest improvement is observed on Comp, up to 6.36% F 1 score." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-88", "text": "The possible reason is that only while is ambiguous about Comp and T emp, while as, when and since are all ambiguous about T emp and Cont, among top 10 connectives in our BiSynData." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-89", "text": "Meanwhile the amount of labeled data for Comp is relatively small." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-90", "text": "Overall, using BiSynData under our multi-task model achieves significant improvements on the English DRR imp ." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-91", "text": "We believe the reasons for the improvements are twofold: 1) the added synthetic English instances from our BiSynData can alleviate the meaning shift problem, and 2) a multi-task learning method is helpful for addressing the different word distribution problem between implicit and explicit data." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-92", "text": "Considering some of the English connectives (e.g., while) are highly ambiguous, we compare our method with ones that uses only unambiguous connectives." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-93", "text": "Specifically, we first discard as, when, while and since in top 20 connectives, and get 22,999 synthetic instances." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-94", "text": "Then, we leverage these instances in two different ways: 1) using them in our multi-task model as above, and 2) using them as additional training data directly after mapping unambiguous connectives into relations." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-95", "text": "Both methods using only unambiguous connectives do not achieve better performance." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-96", "text": "One possible reason is that these synthetic instances become more unbalanced after discarding ones with ambiguous connectives." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-97", "text": "We also compare M T N bi with recent systems using additional training data." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-98", "text": "Rutherford and Xue (2015) select explicit instances that are similar to the implicit ones via connective classification, to enrich the training data." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-99", "text": "Liu et al. (2016) use a multi-task model with three auxiliary tasks: 1) conn: connective classification on explicit instances, 2) exp: relation classification on the labeled explicit instances in the PDTB, and 3) rst: relation classification on the labeled RST corpus (William and Thompson, 1988), which defines different discourse relations with that in the PDTB." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-100", "text": "The results are shown in Table 3." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-101", "text": "Although Liu et al. (2016) achieve the stateof-the-art performance (Line 5), they use two additional labeled corpora." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-102", "text": "We can find that M T N bi (Line 6) yields better results than those systems incorporating SynData (Line 1, 2 and 3), or even the labeled RST (Line 4)." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-103", "text": "These results confirm that BiSynData can indeed alleviate the disadvantages of SynData effectively." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-104", "text": "Table 5 show that M T N incorporating BiSynData (Line 3) performs better than using SynData (Line 1 and 2), for the task on the CDTB." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-105", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-106", "text": "**ON THE CDTB**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-107", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-108", "text": "**RELATED WORK**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-109", "text": "One line of research related to DRR imp tries to take advantage of explicit discourse data." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-110", "text": "Zhou et al. (2010) predict the absent connectives based on a language model." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-111", "text": "Using these predicted connectives as features is proven to be helpful." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-112", "text": "Biran and McKeown (2013) aggregate word-pair features that are collected around the same connectives, which can effectively alleviate the feature sparsity problem." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-113", "text": "More recently, Braud and Denis (2014) and consider explicit data from a different domain, and use domain adaptation methods to explore the effect of them." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-114", "text": "Rutherford and Xue (2015) propose to gather weakly labeled data from explicit instances via connective classification, which are used as additional training data directly." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-115", "text": "Lan et al. (2013) and Liu et al. (2016) combine explicit and implicit data using multi-task learning models and gain improvements." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-116", "text": "Different from all the above work, we construct additional training data from a bilingual corpus." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-117", "text": "Multi-task neural networks have been successfully used for many NLP tasks." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-118", "text": "For example, Collobert et al. (2011) jointly train models for the Partof-Speech tagging, chunking, named entity recognition and semantic role labeling using convolutional network." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-119", "text": "Liu et al. (2015) successfully combine the tasks of query classification and ranking for web search using a deep multi-task neural network." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-120", "text": "Luong et al. (2016) explore multi-task sequence to sequence learning for constituency parsing, image caption generation and machine translation." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-121", "text": "----------------------------------" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-122", "text": "**CONCLUSION**" }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-123", "text": "In this paper, we introduce bilingually-constrained synthetic implicit data (BiSynData), which are generated based on the bilingual implicit/explicit mismatch, into implicit discourse relation recognition for the first time." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-124", "text": "On both the PDTB and CDTB, using BiSynData as the auxiliary task significantly improves the performance of the main task." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-125", "text": "We also show that BiSynData is more beneficial than the synthetic implicit data typically used in previous work." }, { "sent_id": "56fa13128027f7c37d504d97cfcc45-C001-126", "text": "Since the lack of labeled data is a major challenge for implicit discourse relation classification, our proposed BiSynData can enrich the training data and then benefit future work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "56fa13128027f7c37d504d97cfcc45-C001-16" ], [ "56fa13128027f7c37d504d97cfcc45-C001-99" ], [ "56fa13128027f7c37d504d97cfcc45-C001-101" ], [ "56fa13128027f7c37d504d97cfcc45-C001-115" ] ], "cite_sentences": [ "56fa13128027f7c37d504d97cfcc45-C001-16", "56fa13128027f7c37d504d97cfcc45-C001-99", "56fa13128027f7c37d504d97cfcc45-C001-101", "56fa13128027f7c37d504d97cfcc45-C001-115" ] }, "@USE@": { "gold_contexts": [ [ "56fa13128027f7c37d504d97cfcc45-C001-65" ] ], "cite_sentences": [ "56fa13128027f7c37d504d97cfcc45-C001-65" ] }, "@MOT@": { "gold_contexts": [ [ "56fa13128027f7c37d504d97cfcc45-C001-101" ] ], "cite_sentences": [ "56fa13128027f7c37d504d97cfcc45-C001-101" ] }, "@DIF@": { "gold_contexts": [ [ "56fa13128027f7c37d504d97cfcc45-C001-101", "56fa13128027f7c37d504d97cfcc45-C001-102" ], [ "56fa13128027f7c37d504d97cfcc45-C001-115", "56fa13128027f7c37d504d97cfcc45-C001-116" ] ], "cite_sentences": [ "56fa13128027f7c37d504d97cfcc45-C001-101", "56fa13128027f7c37d504d97cfcc45-C001-115" ] } } }, "ABC_291a6ac3f0c2d27ca69ee8f5f266f5_36": { "x": [ { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-2", "text": "This paper proposes an expansion of set of primitive constraints available within the Primitive Optimality Theory framework (Eisner, 1997a) ." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-3", "text": "This expansion consists of the addition of a new family of constraints--existential implicational constraints, which allow the specification of faithfulness constraints that can be satisfied at a distance--and the definition of two ways to combine simple constraints into com: plex constraints, that is, constraint disjunction (Crowhurst and Hewitt, 1995) and local constraint conjunction (Smolensky, 1995) ." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-4", "text": "----------------------------------" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-6", "text": "Primitive Optimality Theory (OTP) (Eisner, 1997a) , and extensions to it (e.g., Albro (1998) ), can be useful as a formal system in which phonological analyses can be implemented and evaluated." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-7", "text": "However, for certain types of constraints, translation into the primitives of OTP (Eisner (1997b) ) can only be accomplished by adding to the grammar a number of ad hoc phonological tiers." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-8", "text": "Because these tiers serve no phonological purpose other than to allow calculation of the constraints without adding new primitives, and because the addition of phonological tiers to an OTP grammar can have a dramatic negative impact on the efficiency of OTP implementations 1, it is preferable to avoid the addition of ad hoc tiers by adding new primitives to the system." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-9", "text": "This paper looks at three types of constraints employed throughout the Optimality Theoretic literature that cannot be translated in to the 1The computation time for an Optimality Theoretic derivation within the implementation of Albro (1998) increases exponentially with the number of tiers." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-10", "text": "The same is true for the implementation described in Eisner (1997a) , although a proposal is given there for a method that might improve the situation." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-11", "text": "primitives of OTP without reference to ad hoc tiers, and proposes a formalization of these constraints that is compatible with the finite state model described in Eisner (1997a) and Albro (1998) ." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-12", "text": "These are constraints of existential implication (that is, of faithfulness without the requirement of alignment), constraint disjunction, and local constraint conjunction." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-13", "text": "2 Existential Implication 2.1 Motivation OWP as described in Eisner (1997a) provides some support for correspondence constraints (input-output only)." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-14", "text": "These may be defined by means of implication constraints of the form P --4 P or P --+ P, which can be interpreted as requiring, in the first case, that each surface constituent representing property P be aligned with an underlying constituent representing that property, and in the second case that every underlying constituent representing property P be aligned with a surface constituent representing that property." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-15", "text": "Constraints of this type may be employed to require correspondence between the underlying representation and the surface representation where corresponding constituents must be aligned with one another." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-16", "text": "However, natural languages also seem to follow weaker constraints requiring only that for each underlying constituent there be a corresponding surface constituent, regardless of the position of that constituent relative to its position in the underlying representation." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-17", "text": "For example, in Sanskrit roots with at least two voiced stops, where the root ends in a voiced aspirated stop, the underlying aspiration of the root-final stop can be realized upon that stop in the surface representation only when the root is followed by a suffix beginning with a vocoid or a nasal (data from Whitney (1889) In these forms it is clear that aspiration is being preserved, but that it is surfacing in a position that cannot overlap with the underlying form." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-18", "text": "Another example is the Bantu language Chizigula (Kenstowicz and Kisseberth, 1988) , in which roots with underlying high vowels appear on the surface with a single high tone in the penultimate syllable of the word, where this syllable could belong to a suffix." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-19", "text": "Additionally, if a prefix with an underlying high tone is prefixed to a root with no underlying high tone, the high tone of the prefix appears in the penultimate syllable of the resulting word." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-20", "text": "The existence of a high tone in the underlying form implies the existence of a high tone in the surface form, but the position where that high tone occurs in the underlying form has nothing to do with where the tone appears on the surface." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-21", "text": "----------------------------------" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-22", "text": "**FORMALIZATION**" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-23", "text": "Existential implication constraints can be used to drive correspondence effects such as the above." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-24", "text": "These constraints take represented by this notation outputs a violation for each domain 9,, where 9' represents the intersection of the domains 9,k, in which the time slice represented by the oq occurs, but no/3j occurs." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-25", "text": "Using the FST notation of Eisner (1997a) , the implementation for this constraint would be the following FST:" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-26", "text": "[ represents \"((in or begin all 9,k) -(in all 9,k)),\" and ] represents \"((in or end all 9,k) -(in all 9,k)).\" That is, the machine moves from state S to state 1 if the domain 9, is entered." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-27", "text": "It moves from there back to state S if the end of the domain appears before cv does, or if any/3 appears." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-28", "text": "If a appears, the machine moves from state 1 to state 2." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-29", "text": "From state 2, if/3 appears, the machine returns to the start state without outputting a violation, but if the end of the domain appears without any/3 having appeared, the machine outputs a violation." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-30", "text": "----------------------------------" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-31", "text": "**CONSTRAINT DISJUNCTION**" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-32", "text": "Crowhurst and Hewitt (1995) cite a number of instances in which it appears that multiple simple constraints must be combined via disjunction (there called conjunction) into complex constraints." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-33", "text": "Here a simple constraint is a function that takes an input, surface pair as its input and returns true if a particular disallowed phonological structure or lack of correspondence is present in the pair, otherwise false." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-34", "text": "A constraint disjunction would thus be a function that returns the disjunction of the outputs of its component constraints." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-35", "text": "Thus a constraint defined by disjunction of component constraints outputs a violation whenever any one of its components does." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-36", "text": "Formalization of constraint disjunction requires reference only to intersection of weighted finite state machines." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-37", "text": "Specifically, if constraint Cx is defined as a weighted finite state machine T1 = (~E1, ~2,Qi, F1,81, Ex), where E1 is 22 the alphabet of labels, E2 is the alphabet of weights, drawn from the natural numbers, Q1 is the set of states of the machine, F1 C Q1 is the final states, Sl is the start state, and E1 C Q1 \u00d7 Y.q \u00d7 Z,2 \u00d7 Q1 is the set of edges, and constraint C2 is another weighted deterministic finite state machine T2 --(~1, ~2, Q2, F2, s2, E2), then the disjunction of the two constraints may be defined as follows: T = (~1, ~2, Q1 \u00d7 Q2, F1 \u00d7 F2, (81,82), E>, ((q1,1, q2,1>, a, n, (ql,2, q2,2)> 6 E iff (ql,1, al, nl,ql,2> E EiA (q2,1, a2, n2, q2,2) E E2A" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-38", "text": "----------------------------------" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-39", "text": "**A ----A I N A2A N = (NL V N2)**" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-40", "text": "A possible notation for the disjunction of two constraints C1 and C2 is C1 v C2, for example \"(yce --+ vce) V (cont --+ cont)\"." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-41", "text": "A similar concept is that of mutually unranked primitive constraints." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-42", "text": "For any given input, a complex constraint defined as a group of mutually unranked primitive constraints returns the sum of the violations that the primitive constraints returned." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-43", "text": "Although it has been argued that the formal power provided by allowing new constraints to be defined by grouping mutually unranked primitive constraints is too great, constraints so defined are fairly prevalent in the literature." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-44", "text": "For example, Steriade (1996) makes use of a constraint Paradigm Uniformity (PU) Stress which requires that all features within stressed syllables in-one member of a paradigm must be preserved in the corresponding syllable of other members of that paradigm." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-45", "text": "PU Stress is equivalent to a set of mutually unranked paradigm uniformity constraints for all phonological features." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-46", "text": "The empirical prediction of PU Stress is that changes in any one feature are as important as changes in any other." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-47", "text": "If PU Stress were instead to be considered a block of ranked constraints for the individual features, the prediction would be that in the comparison between one candidate in which the top-ranked feature is identical between stressed syllables of the paradigm members, but all other features are different, and another candidate in which only a lower-ranked feature is different, the first candidate would prevail." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-48", "text": "The data seems to bear out the prediction of the definition using mutually unranked constraints." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-49", "text": "Another possible definition of PU Stress would be to make use of constraint disjunction." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-50", "text": "In this definition, all features would be equally important, but the number of nonidentical features would not matter--candidates differing in three features would be equal to candidates differing in one feature." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-51", "text": "Once again, the definition using mutually unranked constraints seems better borne out by the data." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-52", "text": "Leaving aside constraints such as PU Stress, we will see that complex constraints defined as combinations of mutually unranked constraints are useful as inputs to local constraint conjunctions." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-53", "text": "The formal definition of a complex constraint in terms of mutually unranked subconstraints is identical to the definition of a constraint disjunction, except that the weight n of a new edge is defined as the sum of the weights of the input edges nl and n2 rather than the disjunction: T = (El, E:, Q1 \u00d7 Q2, F1 \u00d7 F2, (sl, s2), E), ((q1, 1, q2,1) , a, n, (ql,2, q2.2)) E E iff (ql,1,al, nl, ql,2) E E1A (q2,1, a2, n2, q:,2) E E2A a 1 N a2 ----aA A possible notation for a complex constraint C combining mutually unranked constraints C1 and C2 is C1 + C2, for example \"(vce ~ vce) + (cont ~ cont)\"." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-54", "text": "4 Local Constraint Conjunction Smolensky (1995) and Kirchner (1996) propose a different method for combining constraints: local conjunction." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-55", "text": "A local conjunction of constraints is defined as a constraint that outputs a violation for each domain of a specified type in which all of the component constraints are violated." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-56", "text": "A constraint may be locally conjoined with itself, in which case the resulting conjunction outputs a violation whenever there are two violations of the component constraint within the specified domain." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-57", "text": "The conjunction of a constraint C1 with itself within a domain 7 may be notated \"A(C1)/7.\"" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-58", "text": "The following algorithm computes the local conjunction of constraint C1, where C1 is represented by the weighted finite state machine T1 = (El, 22, Q1, Sl, F1, El) The machine M will be used to limit the evaluation of constraint C1 ~ to the domain '7." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-59", "text": "To accomplish this, we need to define the behavior at the edges of the '7 domain." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-60", "text": "Outside the '7 domain, violations of C1 ~ will have no effect." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-61", "text": "At the left edge of the 3' domain, violations that do not involve the left." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-62", "text": "edge of constituents will have no effect." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-63", "text": "At the right edge of the '7 domain, violations that do not involve the right edge of constituents will have no effect." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-64", "text": "The final weighted finite state machine L representing the local conjunction of C1 with itself is produced by intersecting M with T, with the following modifications made to the intersection algorithm." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-65", "text": "Edges from T that are intersected with the edge G[, or edges from T that are intersected with the edge G[ and contain no reference to a left edge, or edges from T that are intersected with the edge ]G and contain no reference to a right edge, are assigned a weight of 0, and if their destination within T was state s2, their destination in T is treated as having been Sl." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-66", "text": "This has the effect of limiting the constraint violations of C1 ~ to the domain 7." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-67", "text": "Edges from T that are intersected with edge IG keep their original weight, but are treated as though their destination within T was sl." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-68", "text": "This has the effect of resetting C1 ~ to zero violations at the beginning of a '7 domain immediately following another." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-69", "text": "The constraint A(C1)/7 produced by the above algorithm outputs a violation for every violation of C1 after the first within domain '7." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-70", "text": "Thus A(C1)/7 penalizes two or more violations of C1 within '7, but does not penalize single violations of C1." }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-71", "text": "For example, the constraint A_kA is represented as the following weighted finite state machine:" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-72", "text": "]lA/1" }, { "sent_id": "291a6ac3f0c2d27ca69ee8f5f266f5-C001-73", "text": "The result of the above algorithm is the following machine: IWd/0 While this algorithm does not allow definition of local conjunction of different constraints, 24 it can be given nearly equivalent power by applying it to the output of complex constraints formed from mutually unranked subconstraints." } ], "y": { "@USE@": { "gold_contexts": [ [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-2" ], [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-10", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-9" ], [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-25" ] ], "cite_sentences": [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-2", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-10", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-25" ] }, "@EXT@": { "gold_contexts": [ [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-2", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-3" ] ], "cite_sentences": [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-2" ] }, "@BACK@": { "gold_contexts": [ [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-6" ], [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-11" ], [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-13" ] ], "cite_sentences": [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-6", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-11", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-13" ] }, "@MOT@": { "gold_contexts": [ [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-6", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-7" ], [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-10", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-9" ] ], "cite_sentences": [ "291a6ac3f0c2d27ca69ee8f5f266f5-C001-6", "291a6ac3f0c2d27ca69ee8f5f266f5-C001-10" ] } } }, "ABC_0ab60c5c9ace058a5fbe3bc2643cba_36": { "x": [ { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-2", "text": "We extend sequence-to-sequence models with the possibility to control the characteristics or style of the generated output, via attention that is generated a priori (before decoding) from a latent code vector." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-3", "text": "After training an initial attentionbased sequence-to-sequence model, we use a variational auto-encoder conditioned on representations of input sequences and a latent code vector space to generate attention matrices." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-4", "text": "By sampling the code vector from specific regions of this latent space during decoding and imposing prior attention generated from it in the seq2seq model, output can be steered towards having certain attributes." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-5", "text": "This is demonstrated for the task of sentence simplification, where the latent code vector allows control over output length and lexical simplification, and enables fine-tuning to optimize for different evaluation metrics." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-6", "text": "----------------------------------" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-8", "text": "Apart from its application to machine translation, the encoder-decoder or sequence-to-sequence (seq2seq) paradigm has been successfully applied to monolingual text-to-text tasks including simplification (Nisioi et al., 2017) , paraphrasing (Mallinson et al., 2017) , style transfer (Jhamtani et al., 2017) , sarcasm interpretation (Peled and Reichart, 2017) , automated lyric annotation (Sterckx et al., 2017) and dialogue systems (Serban et al., 2016) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-9", "text": "A sequence of input tokens is encoded to a series of hidden states using an encoder network and decoded to a target domain by a decoder network." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-10", "text": "During decoding, an attention mechanism is used to indicate which are the relevant input tokens at each step." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-11", "text": "This attention component is computed as an intermediate part of the model, and is trained jointly with the rest of the model." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-12", "text": "Alongside being crucial for effective translation, attention -while not necessarily correlated with human attention -brings interpretability to seq2seq models by visualizing how individual input elements contribute to the model's decisions." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-13", "text": "Attention values typically match up well with word alignments used in traditional statistical machine translation, obtained with tools such as GIZA++ (Och and Ney, 2000) or fast-align (Dyer et al., 2013) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-14", "text": "Therefore, several works have included prior alignments from dedicated alignment software such as GIZA++ or fast-align (Alkhouli et al., 2016; Mi et al., 2016; Liu et al., 2016) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-15", "text": "In particular, Mi et al. (2016) showed that the distance between the attention-infused alignments and the ones learned by an independent alignment model can be added to the networks' training objective, resulting in improved translation and alignment quality." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-16", "text": "Further, Gulcehre et al. (2017) demonstrated that this alignment between given input sentence and generated output can be planned ahead as part of a seq2seq model: their model makes a plan of future alignments using an alignment-plan matrix and decides when to follow this plan by learning a separate commitment vector." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-17", "text": "In the standard seq2seq model, where attention is calculated at each time step, such overall alignment or focus is only apparent after decoding and is thus not carefully planned nor controlled." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-18", "text": "We hypothesize that many text-to-text operations have varying levels of alignment and focus." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-19", "text": "To enable control over these aspects, we propose to pre-compute alignments and use this prior attention to determine the structure or focus before decoding in order to steer output towards having specific attributes, such as length or level of compression." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-20", "text": "We facilitate this control through an input represented in a latent vector space (rather than, e.g., explicit 'style' attributes)." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-21", "text": "After training of the initial seq2seq model (with standard attention) on a parallel text corpus, a conditional variational autoencoder (Sohn et al., 2015) learns to reconstruct matrices of alignment scores or attention matrices from a latent vector space and the input sentence encoding." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-22", "text": "At translation time, we are able to efficiently generate specific attention by sampling from regions in the latent vector space, resulting Training of a conditional variational autoencoder applied to attention matrices." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-23", "text": "The seq2seq model translates training sentences from the source to a target domain while generating attention matrices." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-24", "text": "These matrices are concatenated with a representation of the source sentence and encoded to a low dimensional latent vector space." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-25", "text": "in output having specific stylistic attributes." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-26", "text": "We apply this method on a sentence simplification corpus, showing that we can control length and compression of output while producing realistic output and allowing fine-tuning for optimal evaluation scores." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-27", "text": "----------------------------------" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-28", "text": "**GENERATION OF PRIOR ATTENTION**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-29", "text": "This section describes our proposed method, sketched in Figure 1 , with emphasis on the generation of prior attention matrices." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-30", "text": "An encoder recurrent neural network computes a sequence of representations over the source sequence, i.e., its hidden states h s i (with i = 1,...,n and n the length of the source sequence)." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-31", "text": "In attention-based models, an alignment vector a j = [\u03b1 j,1 ,...,\u03b1 j,n ] is obtained by comparing the current target hidden state h t j with each source hidden state h s i ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-32", "text": "A global context vector c j is then computed as the weighted average, according to alignment weights of a j , over all the source states h s i at time step j (for j =1,...,m over m decoding steps)." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-33", "text": "After decoding, these alignment vectors form a matrix A of attention vectors, A = [a 1 ;a 2 ;...;a m ], capturing the alignment between source and target sequence." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-34", "text": "Inspired by the field of image generation, we treat alignment matrices as grayscale images and use generative models to create previously unseen attention." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-35", "text": "Generative models have been applied to a variety of problems giving state-of-the-art results in image generation, text-to-speech synthesis, and image captioning." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-36", "text": "One of the most prominent models is the variational autoencoder (VAE) proposed by Kingma and Welling (2013) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-37", "text": "Given an observed variable x, the VAE introduces a continuous latent variable z, and assumes x to be generated from z, i.e., p(x,z) = p(x|z) p(z), with p(z) being a prior over the latent variables." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-38", "text": "p D (x|z) is the conditional distribution that models the generation procedure parameterized by a decoder network D. For a given x, an encoder network E outputs a variational approximation q E (z|x) of the true posterior over the latent values p(z|x)\u221dp D (x|z)p Z (z)." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-39", "text": "The parameters of E, D are learned using stochastic variational inference to maximize a lower bound for the marginal likelihood of each observation in the training data." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-40", "text": "In our setting, x represents the attention matrix." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-41", "text": "Next to control over stylistic features, we want attention matrices to be relevant for a specific source sentence." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-42", "text": "In the Conditional Variational Autoencoder (CVAE) (Yan et al., 2016; Sohn et al., 2015) , the standard VAE is conditioned on additional variables which can be used to generate diverse images conditioned on certain attributes, e.g., generating different human faces given a sentiment." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-43", "text": "We view the source contexts as the added conditional attributes and use the CVAE to generate diverse attention matrices instead of images." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-44", "text": "This context vector is represented by the source sentence encoding h s ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-45", "text": "The CVAE encoder is conditioned on two variables, the attention matrix A and the sentence encoding q E (z|A,h s )." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-46", "text": "Analogously, for the decoder, the likelihood is now conditioned on two variables, a latent code z and again the source sentence encoding, p D (A|z,h s )." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-47", "text": "This training procedure of the CVAE is visualized in Figure 1 ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-48", "text": "At test time, the attention scores from the attention matrix, pre-generated from a latent code sample and the source sentence encoding, are used instead of the standard seq2seq model's attention mechanism." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-49", "text": "----------------------------------" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-50", "text": "**EXPERIMENTS**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-51", "text": "----------------------------------" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-52", "text": "**PRIOR ATTENTION FOR TEXT SIMPLIFICATION**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-53", "text": "While our model is essentially task-agnostic, we demonstrate prior attention for the task of sentence simplification." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-54", "text": "The goal of sentence simplification is to reduce the linguistic complexity of text, while still retaining its original information and meaning." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-55", "text": "It has been suggested that sentence simplification can be defined by three major types of operations: splitting, deletion, and paraphrasing (Shardlow, 2014) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-56", "text": "We hypothesize that these operations occur at varying frequencies in the training data." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-57", "text": "We adopt our model in an attempt to capture these operations into attention matrices and the latent vector space, and thus control the form and degree of simplification through sampling from that space." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-58", "text": "We train on the Wikilarge collection used by Zhu (2010) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-59", "text": "Wikilarge is a collection of 296,402 automatically aligned complex and simple sentences from the ordinary and simple English Wikipedia corpora, used extensively in previous work (Wubben et al., 2012; Woodsend and Lapata, 2011; Zhang and Lapata, 2017; Nisioi et al., 2017) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-60", "text": "The training data includes 2,000 development and 359 test instances created by Xu et al. (2016) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-61", "text": "These are complex sentences paired with simplifications provided by Amazon Mechanical Turk workers and provide a more reliable evaluation of the task." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-62", "text": "----------------------------------" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-63", "text": "**HYPERPARAMETERS AND OPTIMIZATION**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-64", "text": "We extend the OpenNMT (Klein et al., 2017) framework with functions for attention generation and release our code as a submodule." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-65", "text": "We use a similar archi- Table 2 : Quantitative evaluation of existing baselines from previous work and seq2seq with prior attention from the CVAE when choosing an optimal z sample for BLEU scores." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-66", "text": "tecture as Zhu et al. (2010) and Nisioi et al. (2017) : 2 layers of stacked unidirectional LSTMs with bi-linear global attention as proposed by Luong et al. (2015) , with hidden states of 512 dimensions." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-67", "text": "The vocabulary is reduced to the 50,000 most frequent tokens and embedded in a shared 500-dimensional space." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-68", "text": "We train using SGD with batches of 64 samples for 13 epochs after which the autoencoder is trained by translating sequences from training data." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-69", "text": "Both the encoder and decoder of the CVAE comprise 2 fully connected layers of 128 nodes." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-70", "text": "Weights are optimized using ADAM (Kingma and Ba, 2014) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-71", "text": "We visualize and evaluate using a two-dimensional latent vector space." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-72", "text": "Source and target sequences are both padded or reduced to 50 tokens." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-73", "text": "The integration of the CVAE is analogous across the family of attention-based seq2seq models, i.e., our approach can be applied more generally with different models or training data." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-74", "text": "----------------------------------" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-75", "text": "**DISCUSSION**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-76", "text": "To study the influence of sampling from different regions in the latent vector space, we visualize the resulting attention matrices and measure simplification quality using automated metrics." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-77", "text": "Figure 2a shows the two-dimensional latent space for a single source sentence encoding using 64 samples ranging from values \u22122 to 2." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-78", "text": "Next to the target-to-source length ratio, we apply automated measures commonly used to evaluate simplification systems (Woodsend and Lapata, 2011; Zhang and Lapata, 2017) : BLEU, SARI (Xu et al., 2016) , FKGL 1 (Kincaid, 1975) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-79", "text": "Automated evaluation metrics for matrices originating from samples from different regions of latent codes are shown in Figure 2b ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-80", "text": "Inclusion of an attention mechanism was instrumental to match existing baselines." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-81", "text": "Our standard seq2seq model with attention, without prior attention, obtains a score of 89.92 BLEU points, which is close to scores obtained by similar models used in existing 1 Fleish-Kincaid Grade Level index." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-82", "text": "work on neural text simplification (Zhang and Lapata, 2017; Nisioi et al., 2017) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-83", "text": "In Table 2 , we compare our seq2seq model with attention and without prior attention." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-84", "text": "A value for BLEU of 90.14 is found for z =[\u22122,0] which was tuned on a development set." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-85", "text": "For the same z value, a SARI value of 38.30 was reached." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-86", "text": "For comparison, we include the SMT-based model by (Wubben et al., 2012) , the NTS model by (Nisioi et al., 2017) and the EncDecA by (Zhang and Lapata, 2017) ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-87", "text": "For decreasing values of the first hidden dimension z 1 , we observe that attention becomes situated at the diagonal, thus keeping closer to the structure of the source sentence and having one-to-one word alignments." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-88", "text": "For increasing values of z 1 , attention becomes more vertical and focused on single encoder states." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-89", "text": "This type of attention gives more control to the language model, as exemplified by output samples shown in Table 1 ." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-90", "text": "Output from this region is far longer and less related to the source sentence." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-91", "text": "Influence of the second latent variable z 2 is less apparent from the attention matrices." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-92", "text": "However, sampling across this dimension shows large effects on evaluation metrics." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-93", "text": "For decreasing values, output becomes more similar to the source, with higher BLEU as a result." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-94", "text": "Sampling these values along the zero-axis results in the overall highest BLEU and SARI scores, trading similarity for simplification and readability." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-95", "text": "----------------------------------" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-96", "text": "**CONCLUSION**" }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-97", "text": "We introduced a method to control the decoding process in sequence-to-sequence models using attention, in terms of stylistic characteristics of the output." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-98", "text": "This means that the trained model is able to produce output with custom stylistic properties, given a well-chosen style input vector by the user at prediction time." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-99", "text": "Given the input sequence and an additional code vector to influence decoding characteristics, a variational autoencoder generates an attention matrix, which is used by the decoder to generate the output sequence according to the alignment style directed by the code vector." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-100", "text": "We demonstrated the resulting variations in output for the task of text simplification." }, { "sent_id": "0ab60c5c9ace058a5fbe3bc2643cba-C001-101", "text": "Yet, our method can be applied to any form of parallel text: we expect different types of training collections, such as translation or style transfer, to give rise to different characteristics or mappings in the latent space." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-8" ], [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-59" ] ], "cite_sentences": [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-8", "0ab60c5c9ace058a5fbe3bc2643cba-C001-59" ] }, "@USE@": { "gold_contexts": [ [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-65", "0ab60c5c9ace058a5fbe3bc2643cba-C001-66" ], [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-86" ] ], "cite_sentences": [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-66", "0ab60c5c9ace058a5fbe3bc2643cba-C001-86" ] }, "@SIM@": { "gold_contexts": [ [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-81", "0ab60c5c9ace058a5fbe3bc2643cba-C001-82" ] ], "cite_sentences": [ "0ab60c5c9ace058a5fbe3bc2643cba-C001-82" ] } } }, "ABC_2d2ec7230a651d1d6786d0f8a71f7e_36": { "x": [ { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-2", "text": "We assess the reliability and accuracy of (neural) word embeddings for both modern and historical English and German." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-3", "text": "Our research provides deeper insights into the empirically justified choice of optimal training methods and parameters." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-4", "text": "The overall low reliability we observe, nevertheless, casts doubt on the suitability of word neighborhoods in embedding spaces as a basis for qualitative conclusions on synchronic and diachronic lexico-semantic matters, an issue currently high up in the agenda of Digital Humanities." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-5", "text": "----------------------------------" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-7", "text": "Distributional methods applied to large-sized, often temporally stratified corpora have markedly enhanced the methodological repertoire of both synchronic and diachronic computational linguistics and are getting more and more popular in the Digital Humanities (see Section 2.2)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-8", "text": "However, using such quantitative data as a basis for qualitative, empirically-grounded theories requires that measurements should not only be accurate, but also reliable." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-9", "text": "Only under such a guarantee, quantitative data can be assembled from different experiments as a foundation for trustful theories." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-10", "text": "Measuring word similarity by word neighborhoods in embedding space can be used to detect diachronic shifts or domain specific usage, by training word embeddings on suited corpora and comparing these representations." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-50", "text": "----------------------------------" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-11", "text": "Additionally, lexical items near in the embedding space to the lexical item under scrutiny can be considered as approximating its meaning at a given point in time or in a specific domain." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-12", "text": "These two lines of research converge in prior work to show, e.g., the increasing association of the lexical item 'gay' with the meaning dimension of homosexuality (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-13", "text": "Neural word embeddings (Mikolov et al., 2013) are probably the most influential among all embedding types (see Section 2.1)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-14", "text": "Yet, we gathered evidence that the inherent randomness involved in their generation affects the reliability of word neighborhood judgments and demonstrate how this hampers qualitative conclusions based on such models." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-15", "text": "Our investigation was performed on both historical (for the time span of 1900 to 1904) and contemporary texts (for the time span of 2005 to 2009) in two languages, English and German." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-16", "text": "It is thus a continuation of prior work, in which we investigated historical English texts only (Hellrich and Hahn, 2016a) , and also influenced by the design decisions of Kim et al. (2014) and Kulkarni et al. (2015) which were the first to use word embeddings in diachronic studies." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-17", "text": "Our results cast doubt on the reproducibility of such experiments where neighborhoods between words in embedding space are taken as a computationally valid indicator for properly capturing lexical meaning (and, consequently, meaning shifts)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-18", "text": "linguistics." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-19", "text": "The word2vec family of algorithms, developed from heavily trimmed artificial neural networks, is a widely used and robust way to generate such embeddings (Mikolov et al., 2013; Levy et al., 2015) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-20", "text": "Its skip-gram variant predicts plausible contexts for a given word, whereas the alternative continuous bag-of-words variant tries to predict words from contexts; we focus on the former as it is generally reported to be superior (see e.g., Levy et al. (2015) )." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-21", "text": "There are two strategies for managing the huge number of potential contexts a word can appear in." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-22", "text": "Skip-gram hierarchical softmax (SGHS) uses a binary tree to more efficiently represent the vocabulary, whereas skip-gram negative sampling (SGNS) updates only a limited number of word vectors during each training step." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-23", "text": "SGNS is preferred in general, yet SGHS showed slight benefits in some reliability scenarios in our prior investigations (Hellrich and Hahn, 2016a) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-24", "text": "There are two sources of randomness involved in the training of neural word embeddings: First, the random initialization of all word vectors before any examples are processed." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-25", "text": "Second, the order in which these examples are processed." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-26", "text": "Both can be replaced by deterministic alternatives, 1 yet this would simply replace a random distortion with a fixed one, thus providing faux reliability only useful for testing purposes." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-27", "text": "A range of other word embedding algorithms was inspired by word2vec, either trying to avoid the opaqueness stemming from its neural network heritage (GloVe; still using random initialization, see Pennington et al. (2014) ) or adding capabilities, like using syntactic information during training (Levy and Goldberg, 2014) or modeling multiple word senses (Bartunov et al., 2016; Panchenko, 2016) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-28", "text": "Levy et al. (2015) created SVD PPMI , a variant of the classical pointwise mutual information co-occurrence metric (see e.g., Manning and Sch\u00fctze (1999, pp.178-183) ), by transferring pre-processing steps and hyper-parameters uncovered by the development of these algorithms, and reported similar or slightly better performance than SGNS on evaluation tasks." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-29", "text": "It is conceptually not affected by reliability problems, as there is no random initialization or relevant processing order." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-49", "text": "Diachronic research focuses on the English Fiction part, with the exception of some work relating to German data (Hellrich and Hahn, 2016b) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-30", "text": "Word embeddings capture both syntactic and semantic information (and arguably also social biases, see Bolukbasi et al. (2016) ) in vector form and can thus be evaluated by their ability to calculate the similarity of two words and perform analogy-based reasoning; there exist several other evaluation methods and more test sets than discussed here, see e.g., Baroni et al. (2014) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-31", "text": "Mikolov et al. (2013) provide an analogy test set for measuring performance as the percentage of correctly calculated analogies for test cases such as the frequently cited 'king'-'queen' example (see Section 3)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-32", "text": "Word similarity is evaluated by calculating Spearman's rank coefficient between embedding-derived predictions and a gold standard of human word similarity judgments." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-33", "text": "Finkelstein et al. (2002) developed a widely used test set with 353 English word pairs, 2 a similar resource for German with 350 word pairs was provided by Zesch and Gurevych (2006) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-34", "text": "3 Recent work cautions that performance on such tasks is not always predictive for performance in down-stream applications (Batchkarov et al., 2016) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-35", "text": "----------------------------------" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-36", "text": "**DIACHRONIC APPLICATION**" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-37", "text": "Word embeddings can be used rather directly for tracking semantic changes, namely by measuring the similarity of word representations generated for one word at different points in time-words which underwent semantic shifts will be dissimilar with themselves." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-38", "text": "These models must either be trained in a continuous manner where the model for each time span is initialized with its predecessor (Kim et al., 2014; Hellrich and Hahn, 2016b) , or a mapping between models for different points in time must be calculated (Kulkarni et al., 2015; Hamilton et al., 2016) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-39", "text": "The first approach cannot be performed in parallel and is thus rather time-consuming, if texts are not subsampled." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-40", "text": "We nevertheless discourage using samples instead of full corpora, as we observed extremely low reliability values between different samples (Hellrich and Hahn, 2016a) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-41", "text": "Word embeddings can also be used in diachronic studies without any kind of mapping to track clusters of similar words over time and, thus, model the evolution of topics (Kenter et al., 2015) or compare neighborhoods in embedding space for preselected words (Jo, 2016) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-42", "text": "Besides temporal variations, word embeddings can also used to analyze geographic ones, e.g., the distinction between US American and British English variants (Kulkarni et al., 2016) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-43", "text": "Most of these studies were performed with algorithms from the word2vec family, respectively GloVe in Jo (2016), and are thus likely to be affected by the same systematic reliability problems on which we focus here." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-44", "text": "Only Hamilton et al. (2016) used SVD PPMI in some of their very recent experiments and showed it to be adequate for exploring historical semantics." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-45", "text": "The Google Books Ngram corpus (GBN; Michel et al. (2011 ), Lin et al. (2012 ) is used in most of the studies we already mentioned, including our current study and its predecessor (Hellrich and Hahn, 2016a) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-46", "text": "It contains about 6% of all books published between 1500 and 2009 in the form of n-grams (up to pentagrams), together with their frequency for each year." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-47", "text": "This corpus has often been criticized for its opaque sampling strategy, as its constituent books remain unknown and can be shown to form an unbalanced collection (Pechenick et al., 2015) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-48", "text": "GBN is multilingual, with its English part being subdivided into regional segments (British, US) and topic categories (general language and fiction texts)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-51", "text": "**EVALUATION METHODS**" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-52", "text": "Reliability, in this study, is judged by training three identically parametrized models for each experiment and by comparing the n next neighbors (by cosine distance) for each word modeled by the experiments with a variant of the Jaccard coefficient (Manning and Sch\u00fctze, 1999, p.299) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-53", "text": "The 3-dimensional array W i,j,k contains words ordered by closeness (i) for a word in question (j) according to an experiment (k)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-54", "text": "The reliability r for a specific value of n (r@n) is defined as the magnitude of the intersection of similar words produced by all three experiments with a rank of n or lower, averaged over all t words modeled by these experiments and normalized by n, which is the maximally achievable score for this value of n:" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-55", "text": "Accuracy, in this study, is measured considering two different approaches-analogy and similarity." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-56", "text": "The analogy approach uses the English test set developed by Mikolov et al. (2013) by calculating the percentage of correct analogies made by a word2vec model." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-57", "text": "It contains groups of four words connected via the analogy relation '::' and the similarity relation '\u223c', as exemplified by the expression 'king' \u223c 'queen' :: 'man' \u223c 'woman'." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-58", "text": "The similarity approach covers both English and German by calculating Spearman's rank correlation coefficient between the similarity judgments made by a word2vec model for a word pair (e.g., 'bread' and 'butter') and the human judgment thereof (Finkelstein et al., 2002; Zesch and Gurevych, 2006) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-59", "text": "Pairs containing words not modeled for the time span in question, such as the at that time non-existent 'FBI' in the early 20th century, are simply ignored." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-60", "text": "All three test sets are based on contemporary language and current world knowledge and might thus not fully match the requirements for historical texts, yet are also used for these due to the lack of a suitable alternative." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-61", "text": "Accuracy values were calculated independently for each of the three identically parametrized models and subsequently averaged, but resulting deviations were negligible." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-62", "text": "----------------------------------" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-63", "text": "**EXPERIMENTAL SET-UP 4.1 CORPUS**" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-64", "text": "Our experiments 4 were performed on the German part and the English Fiction part of the GBN; the latter is known to be less unbalanced than the general English part (Pechenick et al., 2015) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-65", "text": "Both corpus splits differ in size and contain mainly contemporary texts (from the past fifty years), as is evident from Figure 1 ; note the logarithmic axis and the negative impact of both World Wars on book production." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-66", "text": "Following Kulkarni et al. (2015) , we trained our models on all 5-grams occurring during five consecutive years for the two time spans, 5 1900-1904 and 2005-2009 ; the number of 5-grams 6 for each time span is listed in Table 1 ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-67", "text": "The two languages share a similar number of 5-grams for 1900-1904, yet not for [2005] [2006] [2007] [2008] [2009] ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-68", "text": "5-grams from both corpus parts were lower cased for training." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-69", "text": "The German part was not only taken as is, but also orthographically normalized using the CAB service (Jurish, 2013) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-70", "text": "7 We incorporated this step because major changes in German orthography occurred during the 20th century, an issue that could hamper diachronic comparisons, e.g., archaic 'Gem\u00fcth' (in English: \"mind, emotional disposition\") became modern 'Gem\u00fct'." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-71", "text": "Table 1 shows the resulting reduction in the number of types, bringing the morphologically richer German to levels below English (yet this reduction is in line with the respective corpus sizes)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-72", "text": "----------------------------------" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-73", "text": "**TRAINING**" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-74", "text": "We used the PYTHON-based GENSIM 8 implementation of word2vec to independently train word embeddings for each time span with 200 dimensions, a context window of 4 (limited by the 5-gram size), a minimum frequency of 10, and 10 \u22125 as the threshold for downsampling frequent words." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-75", "text": "We processed the full subcorpora for each time span, due to the extremely low reliability values between samples we observed in previous investigations (Hellrich and Hahn, 2016a) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-76", "text": "We tested both SGNS with 5 noise words and SGHS training strategies and trained for 10 iterations, saving the resulting embeddings after each epoch." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-77", "text": "During each epoch the learning rate was decreased from 0.025 to 0.0001." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-78", "text": "The averaged cosine values between word embeddings before and after an epoch are used as a convergence measure c (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-79", "text": "It is defined for a vocabulary with n words and a matrix W containing word embedding vectors (normalized to length 1) for words i from training epochs e and e-1:" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-80", "text": "We also define \u2206c, the change of c during subsequent epochs e-1, as another convergence criterion:" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-81", "text": "5 Results Table 2 shows the performance of the systems trained according to the settings described in Section 4.2, as measured by similarity accuracy and top-1 reliability (see below for other cut-offs)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-82", "text": "We make the following observations:" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-83", "text": "1. Both accuracy and reliability are higher for SGNS than for SGHS for all tested combinations of languages and time spans, if 10 training epochs are used." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-84", "text": "We also measured analogy accuracy for the English Fiction data sets, and observed no negative effect of multiple training epochs, yet a more pronounced gap between training methods, e.g., 36% of all analogies were correct for SGNS and only 27% for SGHS after one epoch on 1900-1904 data." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-85", "text": "In the following, we further explore system performance as influenced, e.g., by word frequency, word ambiguity and the number of training epochs." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-86", "text": "For German, we focus on the normalized version due to the overall similar performance and suitability for further applications." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-87", "text": "Influence of Neighborhood Size." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-88", "text": "Reliability at different top-n cut-offs is very similar for all languages and time spans under scrutiny, confirming previous observations in Hellrich and Hahn (2016a) and strengthening the suggestion to use only top-1 reliability for evaluation." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-89", "text": "Figure 2 illustrates this phenomenon with an SGNS trained on 1900-1904 English Fiction data." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-90", "text": "We assume this to be connected with the general decrease in word2vec embedding utility for high values of n already observed by Schnabel et al. (2015) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-91", "text": "Influence of Word Frequency." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-92", "text": "Figures 3 and 4 depict the influence of word frequency (as percentile ranks) for English, as well as orthographically normalized German." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-93", "text": "Negative sampling is overall more reliable, especially for words with low or medium frequency." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-94", "text": "Word frequency has a less pronounced effect on reliability for German and negative sampling is again preferable, especially for low or medium frequency words." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-95", "text": "The 21 English words reported to have undergone traceable semantic changes in prior work 9 are all frequent with percentiles between 89 and 99-for such high-frequency words hierarchical softmax performs similarly or even slightly better." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-96", "text": "The relatively low reliability for medium-frequency English words, as compared to German ones, could be caused by a peculiar pattern of word co-occurrences, illustrated in Figures 5 and 6 for 1900-1904 English Fiction, respectively normalized German." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-97", "text": "Medium-frequency English words have fewer co-occurrences with low-frequency words than German ones, which might result in a lack of specific contexts for these words during training and thus hamper embedding quality." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-98", "text": "Influence of Word Ambiguity." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-99", "text": "Entries in lexical databases, such as WORDNET 10 (Fellbaum, 1998) and its German counterpart GERMANET 11 (Lemnitzer and Kunze, 2002) , can be employed to approximate the effect of word ambiguity on reliability." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-100", "text": "The number of synsets a word belongs to (i.e., the number of its senses) seems to be positively correlated with top-1 reliability for English, as shown in Figure 7 , whereas orthographically normalized German is less affected by ambiguity as Figure 8 reveals." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-101", "text": "This counter-intuitive effect for English seems to be caused by the low ambiguity of infrequent words-results become more uniform, if analysis is limited to high frequency words (e.g., 90th frequency percentile or higher)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-102", "text": "Influence of the Number of Training Epochs." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-103", "text": "Table 2 )." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-104", "text": "Figures 11 and 12 show the results for English and orthographically normalized German, respectively." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-105", "text": "Note that accuracy is assessed on a test set for modern-day language, and can thus not be considered a fully valid yardstick." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-106", "text": "Accuracy behaves similar to reliability, as under the negative sampling condition it clearly profits from multiple training epochs." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-107", "text": "This effect is more pronounced for smaller corpora; the biggest corpus (i.e., English Fiction 2005 -2009 ) shows a slight regression in accuracy after more than 5 training epochs." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-108", "text": "Conclusions." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-109", "text": "Both reliability and accuracy point towards negative sampling with 4 to 6 training epochs (6 being better for smaller and 4 being better for larger corpora) as the optimal training regime for all tested combinations of languages and time spans (implicitly, this is also a test on largely varying corpus sizes, see Table 1 )." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-110", "text": "Such a training scheme yields models with high reliability without losses in accuracy (that would indicate overfitting)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-111", "text": "Figure 13 shows \u2206c, i.e., the difference of the convergence measure c (Equations (2) and (3) averaged over all three models) between subsequent epochs, for both German and English data from the intervals 1900-1904 and 2005-2009 ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-112", "text": "Few changes occur after 4-6 epochs, which could be alternatively expressed as a \u2206c of about 0.003." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-113", "text": "The convergence criterion proposed by Kulkarni et al. (2015) , i.e., c = 0.9999, was never reached (this observation might be explained by Kulkarni et al.'s decision not to reset the learning rate for each training epoch, as was done by us and Kim et al. (2014) )." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-114", "text": "SVD PPMI , which are conceptually not bothered by the reliability problems we discussed here, were not a good fit for the hyperparameters we adopted from Kulkarni et al. (2015) ." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-115", "text": "Hamilton et al. (2016) reports similarity accuracy superior to SGNS, whereas for our set-up results in pretests were about 10 percent points worse than skip-gram embeddings, e.g., only 0.35 for 1900-1904 English Fiction." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-116", "text": "Finally, to want to illustrate how this reliability problem affects qualitative conclusions." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-117", "text": "In Table 3 we provide some examples in which three negative sampling models for 1900-1904 English Fiction did not agree on the closest neighbor for words in question (mostly drawn from the list in Footnote 9)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-118", "text": "The most inconsistent word neighborhoods are provided for 'romantic' which is connected to 'lazzaroni', 12 'fanciful' and 'melancholies'." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-119", "text": "This holds despite the high frequency (94th percentile) and moderate ambiguity (5 synsets) of the target item 'romantic'." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-120", "text": "----------------------------------" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-121", "text": "**DISCUSSION**" }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-122", "text": "Our investigation into the accuracy and reliability of skip-gram word embeddings shows even the most reliable systems too often provide inconsistent word neighborhoods." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-123", "text": "This carries unwarranted potential for erroneous conclusions on a word's semantic evolution as was shown, e.g., for the lexical item 'romantic' and English Fiction texts from the 1900-1904 time slice." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-124", "text": "We are thus skeptical about using word neighborhoods in skip-gram embedding space to adequately capture natural languages' lexical semantics (for English and German, at least)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-125", "text": "While we found some mitigation strategies, i.e., training for multiple epochs or using our convergence criterion of \u2206c 0.003, we assume SVD PPMI to be conceptually superior." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-126", "text": "Future work might try to provide general guidelines for proper hyperparameter selection for SVD PPMI , especially regarding complete temporal slices of the GBN (Hamilton et al. (2016) used samples)." }, { "sent_id": "2d2ec7230a651d1d6786d0f8a71f7e-C001-127", "text": "Alternatively, training several identically parametrized SGNS/SGHS models and combining them into an ensemble might constitute an easy way to reduce the reliability problems we described, yet at the price of exorbitant computational costs." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-12" ], [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-38" ] ], "cite_sentences": [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-12", "2d2ec7230a651d1d6786d0f8a71f7e-C001-38" ] }, "@USE@": { "gold_contexts": [ [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-16" ], [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-78" ] ], "cite_sentences": [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-16", "2d2ec7230a651d1d6786d0f8a71f7e-C001-78" ] }, "@MOT@": { "gold_contexts": [ [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-37", "2d2ec7230a651d1d6786d0f8a71f7e-C001-38", "2d2ec7230a651d1d6786d0f8a71f7e-C001-39" ] ], "cite_sentences": [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-113" ] ], "cite_sentences": [ "2d2ec7230a651d1d6786d0f8a71f7e-C001-113" ] } } }, "ABC_59a7c1fffdd45f8e152d060a4b9f50_36": { "x": [ { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-2", "text": "We use referential translation machines (RTMs) for predicting translation performance." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-3", "text": "RTMs pioneer a language independent approach to all similarity tasks and remove the need to access any task or domain specific information or resource." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-4", "text": "We improve our RTM models with the ParFDA instance selection model , with additional features for predicting the translation performance, and with improved learning models." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-5", "text": "We develop RTM models for each WMT15 QET (QET15) subtask and obtain improvements over QET14 results." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-6", "text": "RTMs achieve top performance in QET15 ranking 1st in document-and sentence-level prediction tasks and 2nd in word-level prediction task." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-7", "text": "Referential translation machines are a computational model effectively judging monolingual and bilingual similarity while identifying translation acts between any two data sets with respect to interpretants." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-8", "text": "RTMs achieve top performance in automatic, accurate, and language independent prediction of machine translation performance and reduce our dependence on any task dependent resource." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-9", "text": "Prediction of translation performance can help in estimating the effort required for correcting the translations during post-editing by human translators." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-10", "text": "We improve our RTM models (Bi\u00e7ici and Way, 2014):" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-11", "text": "\u2022 by using improved ParFDA instance selection model allowing better language models (LM) in which similarity judgments are made to be built with improved optimization and selection of the LM data," }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-12", "text": "\u2022 by selecting TreeF features over source and translation data jointly instead of taking their intersection," }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-13", "text": "\u2022 with extended learning models including bayesian ridge regression (Tan et al., 2015) , which did not obtain better performance than support vector regression in training results (Section 2.2)." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-14", "text": "We present top results with Referential Translation Machines (Bi\u00e7ici, 2015; Bi\u00e7ici and Way, 2014) at quality estimation task (QET15) in WMT15 (Bojar et al., 2015) ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-15", "text": "RTMs pioneer a computational model for quality and semantic similarity judgments in monolingual and bilingual settings using retrieval of relevant training data (Bi\u00e7ici and Yuret, 2015) as interpretants for reaching shared semantics." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-16", "text": "RTMs use Machine Translation Performance Prediction (MTPP) System Bi\u00e7ici, 2015) , which is a state-of-the-art performance predictor of translation even without using the translation by using only the source." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-17", "text": "We use ParFDA for selecting the interpretants Bi\u00e7ici and Yuret, 2015) and build an MTPP model." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-18", "text": "MTPP derives indicators of the closeness of test sentences to the available training data, the difficulty of translating the sentence, and the presence of acts of translation for data transformation." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-19", "text": "We view that acts of translation are ubiquitously used during communication:" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-20", "text": "Every act of communication is an act of translation (Bliss, 2012)." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-21", "text": "Figure 1 depicts RTM." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-22", "text": "Our encouraging results in QET provides a greater understanding of the acts of translation we ubiquitously use and how they can be used to predict the performance of translation." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-23", "text": "RTMs are powerful enough to be applicable in different domains and tasks while achieving top performance." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-24", "text": "304" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-25", "text": "----------------------------------" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-26", "text": "**REFERENTIAL TRANSLATION MACHINE (RTM)**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-27", "text": "Referential translation machines are a computational model effectively judging monolingual and bilingual similarity while identifying translation acts between any two data sets with respect to interpretants." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-28", "text": "RTMs achieve top performance in automatic, accurate, and language independent prediction of machine translation performance and reduce our dependence on any task dependent resource." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-29", "text": "Prediction of translation performance can help in estimating the effort required for correcting the translations during post-editing by human translators." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-30", "text": "We improve our RTM models (Bi\u00e7ici and Way, 2014 ):" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-31", "text": "\u2022 by using improved ParFDA instance selection model allowing better language models (LM) in which similarity judgments are made to be built with improved optimization and selection of the LM data," }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-32", "text": "\u2022 by selecting TreeF features over source and translation data jointly instead of taking their intersection," }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-33", "text": "\u2022 with extended learning models including bayesian ridge regression (Tan et al., 2015) , which did not obtain better performance than support vector regression in training results (Section 2.2)." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-34", "text": "We present top results with Referential Translation Machines (Bi\u00e7ici, 2015; Bi\u00e7ici and Way, 2014) at quality estimation task (QET15) in WMT15 (Bojar et al., 2015) ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-35", "text": "RTMs pioneer a computational model for quality and semantic similarity judgments in monolingual and bilingual settings using retrieval of relevant training data (Bi\u00e7ici and Yuret, 2015) as interpretants for reaching shared semantics." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-36", "text": "RTMs use Machine Translation Performance Prediction (MTPP) System Bi\u00e7ici, 2015) , which is a state-of-the-art performance predictor of translation even without using the translation by using only the source." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-37", "text": "We use ParFDA for selecting the interpretants Bi\u00e7ici and Yuret, 2015) and build an MTPP model." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-38", "text": "MTPP derives indicators of the closeness of test sentences to the available training data, the difficulty of translating the sentence, and the presence of acts of translation for data transformation." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-39", "text": "We view that acts of translation are ubiquitously used during communication:" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-40", "text": "Every act of communication is an act of translation (Bliss, 2012) ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-41", "text": "Figure 1 depicts RTM." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-42", "text": "Our encouraging results in QET provides a greater understanding of the acts of translation we ubiquitously use and how they can be used to predict the performance of translation." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-43", "text": "RTMs are powerful enough to be applicable in different domains and tasks while achieving top performance." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-44", "text": "----------------------------------" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-45", "text": "**RTM IN THE QUALITY ESTIMATION TASK**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-46", "text": "We participate in all of the three subtasks of the quality estimation task (QET) (Bojar et al., 2015) , which include English to Spanish (en-es), English to German (en-de), and German to English (deen) translation directions." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-47", "text": "There are three subtasks: sentence-level prediction (Task 1), wordlevel prediction (Task 2), and document-level prediction (Task 3)." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-48", "text": "Task 1 is about predicting HTER (human-targeted translation edit rate) (Snover et al., 2006) scores of sentence translations, Task 2 is about binary classification of word-level quality, and Task 3 is about predicting METEOR (Lavie and Agarwal, 2007) scores of document translations." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-49", "text": "Instance selection for the training set and the language model (LM) corpus is handled by ParFDA , whose parameters are optimized for each translation task." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-50", "text": "LM are trained using SRILM (Stolcke, 2002) ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-51", "text": "We tokenize and truecase all of the corpora using code released with Moses (Koehn et al., 2007) 1 ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-52", "text": "Table 1 lists the number of sentences in the training and test sets for each task." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-53", "text": "1 mosesdecoder/scripts/" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-54", "text": "----------------------------------" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-55", "text": "**RTM PREDICTION MODELS AND OPTIMIZATION**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-56", "text": "We present results using support vector regression (SVR) with RBF (radial basis functions) kernel (Smola and Sch\u00f6lkopf, 2004) for sentence and document translation prediction tasks and Global Linear Models (GLM) (Collins, 2002) with dynamic learning (GLMd) (Bi\u00e7ici, 2013; Bi\u00e7ici and Way, 2014) for word-level translation performance prediction." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-57", "text": "We also use these learning models after a feature subset selection (FS) with recursive feature elimination (RFE) (Guyon et al., 2002) or a dimensionality reduction and mapping step using partial least squares (PLS) (Specia et al., 2009 ), or PLS after FS (FS+PLS)." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-58", "text": "GLM relies on Viterbi decoding, perceptron learning, and flexible feature definitions." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-59", "text": "GLMd extends the GLM framework by parallel perceptron training (McDonald et al., 2010) and dynamic learning with adaptive weight updates in the perceptron learning algorithm:" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-60", "text": "where \u03a6 returns a global representation for instance i and the weights are updated by \u03b1, which dynamically decays the amount of the change during weight updates at later stages and prevents large fluctuations with updates." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-61", "text": "The learning rate updates the weight values with weights in the range [a, b] using the following function taking error rate as the input:" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-62", "text": "Learning rate curve for a = 0.5 and b = 1.0 is provided in Figure 2:" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-63", "text": "----------------------------------" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-64", "text": "**TRAINING RESULTS**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-65", "text": "We use mean absolute error (MAE), relative absolute error (RAE), root mean squared error (RMSE), and correlation (r) as well as relative MAE (MAER) and relative RAE (MRAER) to evaluate (Bi\u00e7ici, 2015; Bi\u00e7ici, 2013) ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-66", "text": "MAER is mean absolute error relative to the magnitude of the target and MRAER is mean absolute error relative to the absolute error of a predictor always predicting the target mean assuming that target mean is known (Bi\u00e7ici, 2015 2012) calculates the average quality difference between the top n\u22121 quartiles and the overall quality for the test set." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-67", "text": "Table 2 presents the training results for Task 1 and Task 3." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-68", "text": "Table 3 presents Task 2 training results." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-69", "text": "We refer to GLMd parallelized over 4 splits as GLMd s4 and GLMd with 5 splits as GLMd s5." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-70", "text": "----------------------------------" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-71", "text": "**TEST RESULTS**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-72", "text": "Task 1: Predicting the HTER for Sentence Translations The results on the test set are given in Table 4 ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-73", "text": "Rank lists the overall ranking in the task out of about 9 submissions." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-74", "text": "We obtain the rankings by sorting according to the predicted scores and randomly assigning ranks in case of ties." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-75", "text": "RTMs with FS followed by PLS and learning with SVR is able to achieve the top rank in this task." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-76", "text": "Task 2: Prediction of Word-level Translation Quality Task 2 is about binary classification of word-level quality." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-77", "text": "We develop individual RTM models for each subtask and use GLMd model (Bi\u00e7ici, 2013; Bi\u00e7ici and Way, 2014) , for predicting the quality at the word-level." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-78", "text": "The results on the test set are in Table 5 where the ranks are out of about 17 submissions." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-79", "text": "RTMs with GLMd becomes the second best system this task." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-80", "text": "Task 3: Predicting METEOR of Document Translations Task 3 is about predicting ME-TEOR (Lavie and Agarwal, 2007) and their ranking." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-81", "text": "The results on the test set are given in Table 4 where the ranks are out of about 6 submissions using wF 1 ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-82", "text": "RTMs achieve top rankings in this task." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-83", "text": "Table 5 : RTM-DCU Task 2 results on the test set." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-84", "text": "wF 1 is the average weighted F 1 score." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-85", "text": "----------------------------------" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-86", "text": "**RTMS ACROSS TASKS AND YEARS**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-87", "text": "We compare the difficulty of tasks according to MRAER levels achieved." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-88", "text": "In Table 6 , we list the RTM test results for tasks and subtasks that predict HTER or METEOR from QET15, QET14 (Bi\u00e7ici and Way, 2014) , and QET13 (Bi\u00e7ici, 2013) ." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-89", "text": "The best results when predicting HTER are obtained this year." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-90", "text": "----------------------------------" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-92", "text": "Referential translation machines achieve top performance in automatic, accurate, and language independent prediction of document-, sentence-, and word-level statistical machine translation (SMT) performance." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-93", "text": "RTMs remove the need to access any SMT system specific information or prior knowledge of the training data or models used when generating the translations." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-94", "text": "RTMs achieve top performance when predicting translation performance." }, { "sent_id": "59a7c1fffdd45f8e152d060a4b9f50-C001-95", "text": "Table 6 : Test performance of the top individual RTM results when predicting HTER or METEOR also including results from QET14 (Bi\u00e7ici and Way, 2014) and QET13 (Bi\u00e7ici, 2013) ." } ], "y": { "@USE@": { "gold_contexts": [ [ "59a7c1fffdd45f8e152d060a4b9f50-C001-56" ], [ "59a7c1fffdd45f8e152d060a4b9f50-C001-65" ], [ "59a7c1fffdd45f8e152d060a4b9f50-C001-77" ], [ "59a7c1fffdd45f8e152d060a4b9f50-C001-88" ], [ "59a7c1fffdd45f8e152d060a4b9f50-C001-95" ] ], "cite_sentences": [ "59a7c1fffdd45f8e152d060a4b9f50-C001-56", "59a7c1fffdd45f8e152d060a4b9f50-C001-65", "59a7c1fffdd45f8e152d060a4b9f50-C001-77", "59a7c1fffdd45f8e152d060a4b9f50-C001-88", "59a7c1fffdd45f8e152d060a4b9f50-C001-95" ] } } }, "ABC_97fd0f1ce3d4f510c1566d642e9d2c_37": { "x": [ { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-2", "text": "This paper describes the Neural Machine Translation system of IIIT-Hyderabad for the Gujarati\u2192English news translation shared task of WMT19." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-3", "text": "Our system is based on encoder-decoder framework with attention mechanism." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-4", "text": "We experimented with Multilingual Neural MT models." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-5", "text": "Our experiments show that Multilingual Neural Machine Translation leveraging parallel data from related language pairs helps in significant BLEU improvements upto 11.5, for low resource language pairs like Gujarati-English." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-6", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-8", "text": "Neural Machine Translation (Luong et al., 2015; Bahdanau et al., 2014; Johnson et al., 2017; Vaswani et al., 2017) has been receiving considerable attention in the recent years, given its superior performance without the demand of heavily hand crafted engineering efforts." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-9", "text": "NMT often outperforms Statistical Machine Translation (SMT) techniques but it still struggles if the parallel data is insufficient like in the case of Indian languages." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-10", "text": "The bulk of research on low resource NMT has focused on exploiting monolingual data or parallel data from other language pairs." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-11", "text": "Some recent methods to improve NMT models that exploit monolingual data ranges from back-translation (Sennrich et al., 2015a) , dual NMT (He et al., 2016) to Unsupervised MT models (Lample et al., 2017; Artetxe et al., 2017; Lample et al., 2018) ." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-12", "text": "Transfer Learning is also a promising approach for low resource NMT which exploits parallel data from other language pairs Nguyen and Chiang, 2017; Kocmi and Bojar, 2018) ." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-13", "text": "Typically it is achieved by training a parent model in a high resource language pair, then using some of the trained weights as the initialization for a child model and further train it on the low-resource language pair." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-14", "text": "Other promising approach for improving translation performance for low resource languages is Multilingual Neural Machine Translation." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-15", "text": "It has been shown that exploiting data from other language pairs & joint training helps in improving the translation performance of NMT models." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-16", "text": "(Ha et al., 2016; Firat et al., 2016; Johnson et al., 2017) ." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-17", "text": "This paper describes the NMT system of IIIT-H for WMT19 evaluation." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-18", "text": "We participated in the Gujarati\u2192English news translation task." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-19", "text": "We used an attention-based encoder-decoder model as our baseline system and used Byte Pair Encoding (BPE) to enable open vocabulary translation." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-20", "text": "We then leverage Hindi-English parallel corpus in a multilingual setting so as to improve our baseline system." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-21", "text": "We basically combined Hindi-English and Gujarati-English parallel corpus and use it as our training corpus." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-22", "text": "Our multilingual system is similiar to Johnson et al. (2017) but we don't use any artificial token at the start of source sentences to indicate the target language." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-23", "text": "The reason is trivial, that is we have only English as our target language." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-24", "text": "We also provide results of our experiments conducted post WMT19 shared task involving Transformer models." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-25", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-26", "text": "**NEURAL MT ARCHITECTURE**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-27", "text": "Our NMT model consists of an encoder and a decoder, each of which is a Recurrent Neural Network (RNN) as described in (Luong et al., 2015) ." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-28", "text": "The model directly estimates the posterior distribution P \u03b8 (y|x) of translating a source sentence x = (x 1 , .., x n ) to a target sentence y = (y 1 , .., y m ) as:" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-29", "text": "Each of the local posterior distribution P (y t |y 1 , 2 , .., y t\u22121 , x) is modeled as a multinomial distribution over the target language vocabulary which is represented as a linear transformation followed by a softmax function on the decoder's output vectorh dec t :" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-30", "text": "where c t is the context vector, h enc and h dec are the hidden vectors generated by the encoder and decoder respectively, AttentionFunction(. , .) is the attention mechanism as shown in (Luong et al., 2015) and [. ; .] is the concatenation of two vectors." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-31", "text": "An RNN encoder first encodes x to a continuous vector, which serves as the initial hidden vector for the decoder and then the decoder performs recursive updates to produce a sequence of hidden vectors by applying the transition function f as:" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-32", "text": "where e(.) is the word embedding operation." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-33", "text": "Popular choices for mapping f are Long-Short-Term Memory (LSTM) units and Gated Recurrent Units (GRU), the former of which we use in our models." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-34", "text": "An NMT model is typically trained under the maximum log-likelihood objective:" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-35", "text": "where D is the training set." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-36", "text": "Our NMT model uses a bi-directional RNN as an encoder and a unidirectional RNN as a decoder with global attention (Luong et al., 2015) ." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-37", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-38", "text": "**MULTILINGUAL NEURAL MACHINE TRANSLATION**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-39", "text": "Most of the practical applications in Machine Translation have focused on individual language pairs because it was simply too difficult to build a single system that translates to and from many language pairs." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-40", "text": "But Neural Machine Translation was shown to be an end-to-end learning approach and was quickly extended to multilingual machine translation in several ways." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-41", "text": "In Dong et al. (2015) , the authors modify the attention-based encoderdecoder approach by introducing separate decoder and attention mechanism for each target language." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-42", "text": "In , multi-source translation was proposed where the model has different encoders and different attention mechanisms for different source languages." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-43", "text": "In Firat et al. (2016) , the authors proposed a multi-way multilingual NMT model using a single shared attention mechanism but with multiple encoders/decoders for each source/target language." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-44", "text": "In this paper, we adopted the approach proposed in Johnson et al. (2017) , where a single NMT model is used for multilingual machine translation." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-45", "text": "We used Hindi-English as our assisting language pair and combined it with Gujarati-English parallel data to form a multi source translation system." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-46", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-47", "text": "**DATA PROCESSING**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-48", "text": "We used Moses (Koehn et al., 2007) toolkit for tokenization and cleaning the English side of the data." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-49", "text": "Gujarati and Hindi sides of the data is first normalized with Indic NLP library 1 followed by tokenization with the same library." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-50", "text": "As our preprocessing step, we removed all the sentences of length greater than 80 from our training corpus." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-51", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-52", "text": "**SUBWORD SEGMENTATION FOR NMT**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-53", "text": "Neural Machine Translation relies on first mapping each word into the vector space, and tradi-tionally we have a word vector corresponding to each word in a fixed vocabulary." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-54", "text": "Addressing the problem of data scarcity and the hardness of the system to learn high quality representations for rare words, (Sennrich et al., 2015b) proposed to learn subword units and perform translation at a subword level." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-55", "text": "With the goal of open vocabulary NMT, we incorporate this approach in our system as a preprocessing step." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-56", "text": "In our early experiments, we note that Byte Pair Encoding (BPE) works better than UNK replacement techniques." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-57", "text": "For our baseline system, we learn separate vocabularies for Hindi and English each with 32k merge operations." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-58", "text": "For our multilingual model, we learn a joint vocabulary for Hindi and Gujarati & a separate vocabulary for English." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-59", "text": "With the help of BPE, the vocabulary size is reduced drastically and we no longer need to prune the vocabularies." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-60", "text": "After the translation, we do an extra post processing step to convert the target language subword units back to normal words." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-61", "text": "We found this approach to be very helpful in handling rare word representations." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-62", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-63", "text": "**SCRIPT CONVERSION**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-64", "text": "India is a linguistically rich country having 22 constitutional languages, written in different scripts." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-65", "text": "Indian languages are highly inflectional with a rich morphology, default sentence structure as subject object verb (SOV) and relatively free word order." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-66", "text": "Many of them are structurally similar, also called as sibling languages." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-67", "text": "Hindi & Gujarati languages are such siblings." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-68", "text": "That is why, we have chosen Hindi as an assisting language for our multilingual model." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-69", "text": "Although, there are many linguistic similarities between Gujarati & Hindi, both of these languages are written in different scripts." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-70", "text": "So, to make a strong multilingual NMT model, we converted the script of the Gujarati side of the parallel corpus to Hindi (Devanagari script)." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-71", "text": "We used Indic NLP Library's transliteration script for this purpose." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-72", "text": "We found this approach to be very helpful in enabling better sharing between languages on the encoder side." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-73", "text": "BPE also enhances the usage of script conversion technique." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-74", "text": "We used script conversion only with our additional Multilingual NMT experiments based on Transformer architecture." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-75", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-76", "text": "**TRAINING DETAILS**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-77", "text": "The structure of our NMT model is same as in Luong et al. (2015) , an RNN based encoder-decoder model with Global Attention mechanism." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-78", "text": "We used an LSTM based Bi-directional encoder and a unidirectional decoder." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-79", "text": "We kept 4 layers in both the encoder & decoder with embedding size set to 512." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-80", "text": "The batch size was set to 64 and a dropout rate of 0.3." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-81", "text": "We used Adam optimizer (Kingma and Ba, 2014) for our experiments." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-82", "text": "Our multilingual model is trained with all the same hyperparameters as our baseline model except that the training data is a combination of Hindi-English & Gujarati-English parallel data." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-83", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-84", "text": "**RESULTS**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-85", "text": "In this section, we report the BLEU (Papineni et al., 2002) scores on the test sets provided in WMT19." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-86", "text": "Our simple NMT model which is an attention-based LSTM encoder-decoder model achieves a BLEU score of 6.2 on the test set." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-87", "text": "Our multilingual model which is trained with the help of Hindi-English parallel corpus attains a BLEU score of 9.8, showing a gain of +3.6 BLEU points on the same test set." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-88", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-89", "text": "**ADDITIONAL TRANSFORMER EXPERIMENTS**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-90", "text": "In this section, we present a set of experiments and results post WMT19 shared task involving the Transformer (Vaswani et al., 2017) architecture." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-91", "text": "We used the Transformer-Base architecture in this set of experiments with the rest of the pipeline being kept same as described before." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-92", "text": "We used 6 layers in both the encoder decoder with embedding size set to 512." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-93", "text": "The batch size was 2048 tokens & a dropout of 0.3." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-94", "text": "We used Adam optimizer for our experiments." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-95", "text": "During inference time, we averaged the checkpoints of the model at different epochs to obtain better results than a single checkpoint." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-96", "text": "In the multilingual Transformer experiments, we employ script conversion technique for its merits described before." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-97", "text": "In table 3, we provide the results of our Transformer experiments and also compare it to other systems submitted to WMT19." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-98", "text": "----------------------------------" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-99", "text": "**CONCLUSION & FUTURE WORK**" }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-100", "text": "We believe that NMT is a promising approach for Machine Translation for low resource languages. But we need various techniques to handle the data scarcity problem." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-101", "text": "Transfer Learning and Multilingual Machine Translation are two important areas of research that tackles this problem." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-102", "text": "In this paper, we showed that how Multilingual MT models are more effective than the individually trained MT models for a low resource language pair." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-103", "text": "We presented our results on the Gujarati\u2192English language pair and achieved significant BLEU improvements." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-104", "text": "The Multilingual NMT model we presented in this paper is a many-to-one model." }, { "sent_id": "97fd0f1ce3d4f510c1566d642e9d2c-C001-105", "text": "In future, we will work on building effective one-tomany Multilingual NMT systems." } ], "y": { "@BACK@": { "gold_contexts": [ [ "97fd0f1ce3d4f510c1566d642e9d2c-C001-8" ] ], "cite_sentences": [ "97fd0f1ce3d4f510c1566d642e9d2c-C001-8" ] }, "@USE@": { "gold_contexts": [ [ "97fd0f1ce3d4f510c1566d642e9d2c-C001-27" ], [ "97fd0f1ce3d4f510c1566d642e9d2c-C001-30" ], [ "97fd0f1ce3d4f510c1566d642e9d2c-C001-36" ], [ "97fd0f1ce3d4f510c1566d642e9d2c-C001-77" ] ], "cite_sentences": [ "97fd0f1ce3d4f510c1566d642e9d2c-C001-27", "97fd0f1ce3d4f510c1566d642e9d2c-C001-30", "97fd0f1ce3d4f510c1566d642e9d2c-C001-36", "97fd0f1ce3d4f510c1566d642e9d2c-C001-77" ] } } }, "ABC_877a0b5b5d25b3849ca44ed42b8d6d_37": { "x": [ { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-2", "text": "Recently, various synchronous grammars are proposed for syntax-based machine translation, e.g. synchronous context-free grammar and synchronous tree (sequence) substitution grammar, either purely formal or linguistically motivated." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-3", "text": "Aiming at combining the strengths of different grammars, we describes a synthetic synchronous grammar (SSG), which tentatively in this paper, integrates a synchronous context-free grammar (SCFG) and a synchronous tree sequence substitution grammar (STSSG) for statistical machine translation." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-4", "text": "The experimental results on NIST MT05 Chinese-to-English test set show that the SSG based translation system achieves significant improvement over three baseline systems." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-5", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-7", "text": "The use of various synchronous grammar based formalisms has been a trend for statistical machine translation (SMT) (Wu, 1997; Eisner, 2003; Galley et al., 2006; Chiang, 2007; Zhang et al., 2008) ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-8", "text": "The grammar formalism determines the intrinsic capacities and computational efficiency of the SMT systems." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-9", "text": "To evaluate the capacity of a grammar formalism, two factors, i.e. generative power and expressive power are usually considered (Su and Chang, 1990) ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-10", "text": "The generative power refers to the ability to generate the strings of the language, and the expressive power to the ability to describe the same language with fewer or no extra ambiguities." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-11", "text": "For the current synchronous grammars based SMT, to some extent, the generalization ability of the grammar rules (the usability of the rules for the new sentences) can be considered as a kind of the generative power of the grammar and the disambiguition ability to the rule candidates can be considered as an embodiment of expressive power." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-12", "text": "However, the generalization ability and the disambiguition ability often contradict each other in practice such that various grammar formalisms in SMT are actually different trade-off between them." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-13", "text": "For instance, in our investigations for SMT (Section 3.1), the Formally SCFG based hierarchical phrase-based model (hereinafter FSCFG) (Chiang, 2007) has a better generalization capability than a Linguistically motivated STSSG based model (hereinafter LSTSSG) (Zhang et al., 2008) , with 5% rules of the former matched by NIST05 test set while only 3.5% rules of the latter matched by the same test set." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-14", "text": "However, from expressiveness point of view, the former usually results in more ambiguities than the latter." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-15", "text": "To combine the strengths of different synchronous grammars, this paper proposes a statistical machine translation model based on a synthetic synchronous grammar (SSG) which syncretizes FSCFG and LSTSSG." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-16", "text": "Moreover, it is noteworthy that, from the combination point of view, our proposed scheme can be considered as a novel system combination method which goes beyond the existing post-decoding style combination of N -best hypotheses from different systems." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-17", "text": "2 The Translation Model Based on the Synthetic Synchronous Grammar" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-18", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-19", "text": "**THE SYNTHETIC SYNCHRONOUS GRAMMAR**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-20", "text": "Formally, the proposed Synthetic Synchronous Grammar (SSG) is a tuple" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-21", "text": "where \u03a3 s (\u03a3 t ) is the alphabet set of source (target) terminals, namely the vocabulary; N s (N t ) is the alphabet set of source (target) non-terminals, such as the POS tags and the syntax labels; X represents the special nonterminal label in FSCFG; and P is the grammar rule set which is the core part of a grammar." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-22", "text": "Every rule r in P is as:" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-23", "text": "where \u03b1 \u2208 [{X}, N s , \u03a3 s ] + is a sequence of one or more source words in \u03a3 s and nonterminals symbols in [{X}, N s ];\u03b3 \u2208 [{X}, N t , \u03a3 t ] + is a sequence of one or more target words in \u03a3 t and nonterminals symbols in [{X}, N t ]; A T is a many-tomany corresponding set which includes the alignments between the terminal leaf nodes from source and target side, and A N T is a one-to-one corresponding set which includes the synchronizing relations between the non-terminal leaf nodes from source and target side;\u03c9 contains feature values associated with each rule." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-24", "text": "Through this formalization, we can see that FSCFG rules and LSTSSG rules are both included." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-25", "text": "However, we should point out that the rules with mixture of X non-terminals and syntactic non-terminals are not included in our current implementation despite that they are legal under the proposed formalism." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-26", "text": "The rule extraction in current implementation can be considered as a combination of the ones in (Chiang, 2007) and (Zhang et al., 2008) ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-27", "text": "Given the sentence pair in Figure 1 , some SSG rules can be extracted as illustrated in Figure 2 ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-28", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-29", "text": "**THE SSG-BASED TRANSLATION MODEL**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-30", "text": "The translation in our SSG-based translation model can be treated as a SSG derivation." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-31", "text": "A derivation consists of a sequence of grammar rule applications." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-32", "text": "To model the derivations as a latent variable, we define the conditional probability distribution over the target translation e and the cor-" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-33", "text": "where H k is a feature function ,\u03bb k is the corresponding feature weight and \u2126 \u039b (f) is a normalization factor for each derivation of f. The main challenge of SSG-based model is how to distinguish and weight the different kinds of derivations ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-34", "text": "For a simple illustration, using the rules listed in Figure 2 , three derivations can be produced for the sentence pair in Figure 1 by the proposed model:" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-35", "text": "All of them are SSG derivations while d 1 is also a FSCFG derivation, d 2 is also a LSTSSG derivation." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-36", "text": "Ideally, the model is supposed to be able to weight them differently and to prefer the better derivation, which deserves intensive study." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-37", "text": "Some sophisticated features can be designed for this issue." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-38", "text": "For example, some features related with structure richness and grammar consistency 1 of a derivation should be designed to distinguish the derivations involved various heterogeneous rule applications." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-39", "text": "For the page limit and the fair comparison, we only adopt the conventional features as in (Zhang et al., 2008) in our current implementation." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-40", "text": "Figure 2 : Some synthetic synchronous grammar rules can be extracted from the sentence pair in Figure 1 ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-41", "text": "R 1 -R 3 are bilingual phrase rules, R 4 -R 5 are FSCFG rules and R 6 -R 8 are LSTSSG rules." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-42", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-43", "text": "**DECODING**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-44", "text": "For efficiency, our model approximately search for the single 'best' derivation using beam search as" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-45", "text": "The major challenge for such a SSG-based decoder is how to apply the heterogeneous rules in a derivation." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-46", "text": "For example, (Chiang, 2007) adopts a CKY style span-based decoding while (Liu et al., 2006 ) applies a linguistically syntax node based bottom-up decoding, which are difficult to integrate." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-47", "text": "Fortunately, our current SSG syncretizes FSCFG and LSTSSG. And the conventional decodings of both FSCFG and LSTSSG are spanbased expansion." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-48", "text": "Thus, it would be a natural way for our SSG-based decoder to conduct a spanbased beam search." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-49", "text": "The search procedure is given by the pseudocode in Figure 3 ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-50", "text": "A hypotheses stack H[i, j] (similar to the \"chart cell\" in CKY parsing) is arranged for each span [i, j] for storing the translation hypotheses." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-51", "text": "The hypotheses stacks are ordered such that every span is translated after its possible antecedents: smaller spans before larger spans." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-52", "text": "For translating each span [i, j] , the decoder traverses each usable rule r = \u03b1, \u03b3, A N T , A T ,\u03c9 ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-53", "text": "If there is no nonterminal leaf node in r, the target side \u03b3 will be added into H[i, j] as the candidate hypothesis." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-54", "text": "Otherwise, the nonterminal leaf nodes in r should be substituted iteratively by the corresponding hypotheses until all nonterminal leaf nodes are processed." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-55", "text": "The key feature of our decoder is that the derivations are based on synthetic grammar, so that one derivation may consist of applications of heterogeneous rules (Please see d 3 in Section 2.2 as a simple demonstration)." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-56", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-57", "text": "**EXPERIMENTS AND DISCUSSIONS**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-58", "text": "Our system, named HITREE, is implemented in standard C++ and STL." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-59", "text": "In this section we report on experiments with Chinese-to-English translation base on it." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-60", "text": "We used FBIS Chinese-to-English parallel corpora (7.2M+9.2M words) as the training data." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-61", "text": "We also used SRI Language Modeling Toolkit to train a 4-gram language model on the Xinhua portion of the English Gigaword corpus(181M words)." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-62", "text": "NIST MT2002 test set is used as the development set." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-63", "text": "The NIST MT2005 test set is used as the test set." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-64", "text": "The evaluation metric is case-sensitive BLEU4." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-65", "text": "For significant test, we used Zhang's implementation (Zhang et al., 2004 )(confidence level of 95%)." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-66", "text": "For comparisons, we used the following three baseline systems: LSTSSG An in-house implementation of linguistically motivated STSSG based model similar to (Zhang et al., 2008) ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-67", "text": "FSCFG An in-house implementation of purely formally SCFG based model similar to (Chiang, 2007) ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-68", "text": "MBR We use an in-house combination system which is an implementation of a classic sentence level combination method based on the Minimum Bayes Risk (MBR) decoding (Kumar and Byrne, 2004) ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-69", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-70", "text": "**STATISTICS OF RULE NUMBERS IN DIFFERENT PHASES**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-71", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-72", "text": "**OVERALL PERFORMANCES**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-73", "text": "The performance comparison results are presented in Table 2 ." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-74", "text": "The experimental results show that the SSG-based model (HITREE) achieves significant improvements over the models based on the two isolated grammars: FSCFG and LSTSSG (both p < 0.001)." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-75", "text": "From combination point of view, the newly proposed model can be considered as a novel method going beyond the conventional post-decoding style combination methods." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-76", "text": "The baseline Minimum Bayes Risk combination of LSTSSG based model and FSCFG based model (M BR(1, 2)) obtains significant improvements over both candidate models (both p < 0.001)." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-77", "text": "Meanwhile, the experimental results show that the proposed model outperforms M BR(1, 2) significantly (p < 0.001)." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-78", "text": "These preliminary results indicate that the proposed SSG-based model is rather promising and it may serve as an alternative, if not superior, to current combination methods." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-79", "text": "----------------------------------" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-80", "text": "**CONCLUSIONS**" }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-81", "text": "To combine the strengths of different grammars, this paper proposes a statistical machine translation model based on a synthetic synchronous grammar (SSG) which syncretizes a purely formal synchronous context-free grammar (FSCFG) and a linguistically motivated synchronous tree sequence substitution grammar (LSTSSG)." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-82", "text": "Experimental results show that SSGbased model achieves significant improvements over the FSCFG-based model and LSTSSG-based model." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-83", "text": "In the future work, we would like to verify the effectiveness of the proposed model on various datasets and to design more sophisticated features." }, { "sent_id": "877a0b5b5d25b3849ca44ed42b8d6d-C001-84", "text": "Furthermore, the integrations of more different kinds of synchronous grammars for statistical machine translation will be investigated." } ], "y": { "@BACK@": { "gold_contexts": [ [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-7" ] ], "cite_sentences": [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-7" ] }, "@DIF@": { "gold_contexts": [ [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-13" ] ], "cite_sentences": [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-26" ], [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-39" ], [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-66" ] ], "cite_sentences": [ "877a0b5b5d25b3849ca44ed42b8d6d-C001-26", "877a0b5b5d25b3849ca44ed42b8d6d-C001-39", "877a0b5b5d25b3849ca44ed42b8d6d-C001-66" ] } } }, "ABC_9340338e7cf8ff8de4db84b462dfe5_37": { "x": [ { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-2", "text": "In this paper we present our automated fact checking system demonstration which we developed in order to participate in the Fast and Furious Fact Check challenge." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-3", "text": "We focused on simple numerical claims such as \"population of Germany in 2015 was 80 million\" which comprised a quarter of the test instances in the challenge, achieving 68% accuracy." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-4", "text": "Our system extends previous work on semantic parsing and claim identification to handle temporal expressions and knowledge bases consisting of multiple tables, while relying solely on automatically generated training data." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-5", "text": "We demonstrate the extensible nature of our system by evaluating it on relations used in previous work." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-6", "text": "We make our system publicly available so that it can be used and extended by the community." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-7", "text": "1" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-8", "text": "----------------------------------" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-10", "text": "Fact checking is the task of assessing the truthfulness in spoken or written language." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-11", "text": "We are motivated by calls to provide tools to support journalists with resources to verify content at source (Cohen et al., 2011) or upon distribution." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-12", "text": "Manual verification can be too slow to verify information given the speed at which claims travel on social networks (Hassan et al., 2015a) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-13", "text": "In the context of natural language processing research, the task of automated fact checking was discussed by Vlachos and Riedel (2014) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-14", "text": "Given a claim, a system for this task must determine what information is needed to support or refute the claim, retrieve the information from a knowledge base (KB) and then compute a deduction to assign 1 https://github.com/sheffieldnlp/ numerical-fact-checking-eacl2017 Figure 1 a system needs to recognize the named entity (Germany), the statistical property (population) and the year, link them to appropriate elements in a KB, and deduce the truthfulness of the claim using the absolute percentage error." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-15", "text": "We contrast this task against rumour detection (Qazvinian et al., 2011) -a similar prediction task based on language subjectivity and growth of readership through a social network." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-16", "text": "While these are important factors to consider, a sentence can be true or false regardless of whether it is a rumour (Lukasik et al., 2016) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-17", "text": "Existing fact checking systems are capable of detecting fact-check-worthy claims in text (Hassan et al., 2015b) , returning semantically similar textual claims (Walenz et al., 2014) ; and scoring the truth of triples on a knowledge graph through semantic distance (Ciampaglia et al., 2015) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-18", "text": "However, neither of these are suitable for fact checking a claim made in natural language against a database." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-19", "text": "Previous works appropriate for this task operate on a limited domain and are not able to incorporate temporal information when checking time-dependent claims (Vlachos and Riedel, 2015) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-20", "text": "In this paper we introduce our fact checking tool, describe its architecture and design decisions, evaluate its accuracy and discuss future work." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-21", "text": "We highlight the ease of incorporating new information sources to fact check, which may be unavailable during training." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-22", "text": "To validate the extensibility of the system, we complete an additional evaluation of the system using claims taken from Vlachos and Riedel (2015) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-23", "text": "We make the source code publicly available to the community." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-24", "text": "----------------------------------" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-25", "text": "**DESIGN CONSIDERATIONS**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-26", "text": "We developed our fact-checking approach in the context of the HeroX challenge 2 -a competition organised by the fact checking organization FullFact 3 ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-27", "text": "The types of claims the system presented can fact check was restricted to those which require looking up a value in a KB, similar to the one in Figure 1 ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-28", "text": "To learn a model to perform the KB look up (essentially a semantic parsing task), we extend the work of Vlachos and Riedel (2015) who used distant supervision (Mintz et al., 2009 ) to generate training data, obviating the need for manual labeling." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-29", "text": "In particular, we extend it to handle simple temporal expressions in order to fact check time-dependent claims appropriately, i. e. population in 2015." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-30", "text": "While the recently proposed semantic parser of Pasupat and Liang (2015) is also able to handle temporal expressions, it makes the assumption that the table against which the claim needs to be interpreted is known, which is unrealistic in the context of fact checking." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-31", "text": "Furthermore, the system we propose can predict relations from the KB on which the semantic parser has not been trained, a paradigm referred to as zero-shot learning (Larochelle et al., 2008) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-32", "text": "We achieve this by learning a binary classifier that assesses how well the claim \"matches\" each relation in the KB." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-33", "text": "Finally, another consideration in our design is algorithmic accountability (Diakopoulos, 2016) so that the predictions and the decision process used by the system are interpretable by a human." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-34", "text": "----------------------------------" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-35", "text": "**SYSTEM OVERVIEW**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-36", "text": "Given an unverified statement, the objective of this system is to identify a KB entry to support or refute the claim." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-37", "text": "Our KB consists of a set of un-normalised tables that have been translated into simple Entity-Predicate-Value triples through a simple set of rules." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-38", "text": "In what follows we first describe the fact checking process used during test-2 http://herox.com/factcheck 3 https://fullfact.org Figure 2 : Relation matching step and filtering ing (Section 3.1) and then how the relation matching module is trained and the features used for this purpose (Section 3.2)." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-39", "text": "----------------------------------" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-40", "text": "**FACT CHECKING**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-41", "text": "In our implementation, fact checking is a three step process, illustrated in Figure 2 ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-42", "text": "Firstly, we link named entities in the claim to entities in our KB and retrieve a set of tuples involving the entities found." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-43", "text": "Secondly, these entries are filtered in a relation matching step." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-44", "text": "Using the text in the claim and the predicate as features, we classify whether this tuple is relevant." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-45", "text": "And finally, the values in the matched triples from the KB are compared to the value in the statement to deduce the verdict as follows: if there is at least one value with absolute percentage error lower than a threshold defined by the user, the claim is labeled true, otherwise false." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-46", "text": "We model the relation matching as a binary classification task using logistic regression implemented in scikit-learn, predicting whether a predicate is a match to the given input claim." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-47", "text": "The aim of this step is to retain tuples for which the predicate in the Entity-Predicate-Value tuple can be described by the surface forms present in the claim (positive-class) and discard the remainder (negative-class)." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-48", "text": "We chose not to model this as a multi-class classification task (one class per predicate) to improve the extensibility of the system." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-49", "text": "A multi-class classifier requires training instances for every class, thus would not be applicable to predicates not seen during training." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-50", "text": "Instead, our aim is to predict the compatibility of a predicate w. r. t. the input claim." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-51", "text": "For each of the candidate tuples r i \u2208 R, a feature vector is generated using lexical and syntactic features present in the claim and relation: \u03c6(r i , c)." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-52", "text": "This feature vector is inputted to a logistic regression binary classifier and we retain all tuples where \u03b8 T \u03c6(r i , c) \u2265 0." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-53", "text": "----------------------------------" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-54", "text": "**TRAINING DATA GENERATION**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-55", "text": "The training data for relation matching is generated using distant supervision and the Bing search engine." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-56", "text": "We first read the table and apply a set of simple rules to extract subject, predicate, object tuples, and for each named entity and numeric value we generate a query containing the entity name and the predicate." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-57", "text": "For example, the entry (Germany,Population:2015,81413145) is converted to the query \"Germany\" Population 2015." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-58", "text": "The queries are then executed on the Bing search engine and the top 50 web-page results are retained." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-59", "text": "We extract the text from the webpages using a script built around the BeautifulSoup package." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-60", "text": "4 This text is parsed and annotated with co-reference chains using the Stanford CoreNLP pipeline (Manning et al., 2014) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-61", "text": "Each sentence containing a mention of an entity and a number is used to generate a training example." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-62", "text": "The examples are labeled as follows." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-63", "text": "If the absolute percentage error between the value in the KB and the number extracted from text is below a threshold (an adjustable hyperparameter), the training instance is marked as a positive instance." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-64", "text": "Sentences which contain a number outside of this threshold are marked as negative instances." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-65", "text": "We make an exception for numbers tagged as dates where an exact match is required." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-66", "text": "For each claim, the feature generation function, \u03c6(r, c), outputs lexical and syntactic features from the claim and a set of custom indicator variables." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-67", "text": "Additionally, we include a bias feature for every unique predicate in the KB." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-68", "text": "For our lexical and syntactic features, we consider the words in the span between the entity and number as well as the dependency path between the entity and the number." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-69", "text": "To generalise to unseen relations, we include the ability to add custom indicator functions; for our demonstration we include a simple boolean feature indicating whether the intersection of the words in the predicate name and the sentence is not empty." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-70", "text": "----------------------------------" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-71", "text": "**EVALUATION**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-72", "text": "The system was field tested in the HeroX fact checking challenge -40 general-domain claims chosen by journalists." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-73", "text": "Given the design of our system, we were restricted to claims that can only be answered by a KB look up (11 claims in total), returning the correct truth assessment for 7.5 of them according to the human judges." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-74", "text": "The half-correct one was due to not providing a fully correct explanation." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-75", "text": "To fact check a statement, the user only needs to enter the claim into a text box." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-76", "text": "The only requirement is to provide the system with an appropriate KB." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-77", "text": "Thus we ensured that the system can readily incorporate new tables taken from encyclopedic sources such as Wikipedia and the World Bank." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-78", "text": "In our system, this step is achieved by simply importing a CSV file and running a script to generate the new instances to train the relation matching classifier." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-79", "text": "Analysis of our entry to this competition showed that two errors were caused by incorrect initial source data and one partial error caused by recalling a correct property but making an incorrect deduction." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-80", "text": "Of numerical claims that we did not attempt, we observed that many required looking up multiple entries and performing a more complex deduction step which was beyond the scope of this project." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-81", "text": "We further validate the system by evaluating the ability of this fact checking system to make veracity assessments on simple numerical claims from the data set collected by (Vlachos and Riedel, 2015) ." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-82", "text": "Of the 4,255 claims about numerical properties about countries and geographical areas in this data set, our KB contained information to fact check 3,418." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-83", "text": "The system presented recalled KB entries for 3,045 claims (89.1%)." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-84", "text": "We observed that the system was consistently unable to fact check two properties (undernourishment and renewable freshwater per capita)." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-85", "text": "Analysis of these failure cases revealed too great a lexical difference between the test claims and the training data our system generated; the claims in the test cases were comparative in nature (e. g. country X has higher rate of undernourishment than country Y) whereas the training data generated using the method described in Section 3.2 are absolute claims." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-86", "text": "A high number of false positive matches were generated, e. g. for a claim about population, other irrelevant properties were also recalled in addition." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-87", "text": "For the 3,045 matched claims, 17,770 properties were matched from the KB that had a score greater than or equal to the logistic score of the correct property." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-88", "text": "This means that for every claim, there were, on average, 5.85 incorrect properties also extracted from the KB." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-89", "text": "In our case, this did not yield false positive assignment of truth labels to the claims." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-90", "text": "This was because the absolute percentage error between the incorrectly retrieved properties and the claimed value was outside of the threshold we defined; thus these incorrectly retrieved properties never resulted in a true verdict, allowing the correct one to determine the verdict." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-91", "text": "----------------------------------" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-92", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-93", "text": "The core capability of the system demonstration we presented is to fact check natural language claims against relations stored in a KB." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-94", "text": "Although the range of claims is limited, the system is a fieldtested prototype and has been evaluated on a published data set (Vlachos and Riedel, 2015) and on real-world claims presented as part of the HeroX fact checking challenge." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-95", "text": "In future work, we will extend the semantic parsing technique used and apply our system to more complex claim types." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-96", "text": "Additionally, further work is required to reduce the number of candidate relations recalled from the KB." }, { "sent_id": "9340338e7cf8ff8de4db84b462dfe5-C001-97", "text": "While this was not an issue in our case, we believe that ameliorating this issue will enhance the ability of the system to assign a correct truth label where there exist properties with similar numerical values." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9340338e7cf8ff8de4db84b462dfe5-C001-19" ] ], "cite_sentences": [ "9340338e7cf8ff8de4db84b462dfe5-C001-19" ] }, "@USE@": { "gold_contexts": [ [ "9340338e7cf8ff8de4db84b462dfe5-C001-22" ], [ "9340338e7cf8ff8de4db84b462dfe5-C001-81" ], [ "9340338e7cf8ff8de4db84b462dfe5-C001-94" ] ], "cite_sentences": [ "9340338e7cf8ff8de4db84b462dfe5-C001-22", "9340338e7cf8ff8de4db84b462dfe5-C001-81", "9340338e7cf8ff8de4db84b462dfe5-C001-94" ] }, "@EXT@": { "gold_contexts": [ [ "9340338e7cf8ff8de4db84b462dfe5-C001-28" ] ], "cite_sentences": [ "9340338e7cf8ff8de4db84b462dfe5-C001-28" ] } } }, "ABC_24b38363d53468175e0274ac0b4fd3_37": { "x": [ { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-2", "text": "We present and evaluate several hybrid systems for sentiment identification for Twitter, both at the phrase and document (tweet) level." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-3", "text": "Our approach has been to use a novel combination of lexica, traditional NLP and deep learning features." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-4", "text": "We also analyse techniques based on syntactic parsing and tokenbased association to handle topic specific sentiment in subtask C. Our strategy has been to identify subphrases relevant to the designated topic/target and assign sentiment according to our subtask A classifier." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-5", "text": "Our submitted subtask A classifier ranked fourth in the SemEval official results while our BASELINE and \u00b5PARSE classifiers for subtask C would have ranked second." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-6", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-8", "text": "Twitter holds great potential for analyses in the social sciences both due to its explosive popularity, increasing accessibility to large amounts of data and its dynamic nature." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-111", "text": "**CLASSIFICATION WITHOUT DEPENDENCY RELATIONS**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-9", "text": "For sentiment analysis on twitter the best performing approaches (Mohammad et al., 2013; Zhu et al., 2014) have used a set of rich lexical features." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-10", "text": "However, the development of lexica can be time consuming and is not always suitable when shifting between domains, which examine new topics and user populations (Thelwall and Buckley, 2013) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-11", "text": "Excitingly, the state of the art has recently shifted toward novel semi-supervised techniques such as the incorporation of word embeddings to represent the context of words and concepts (Tang et al., 2014b) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-12", "text": "Moreover, it is important to be able to identify sentiment in relation to particular entities, topics or events (aspect-based sentiment)." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-13", "text": "We have followed a hybrid approach which incorporates traditional lexica, unigrams and bigrams as well as word embeddings using word2vec (Mikolov et al., 2013) to train classifiers for subtasks A and B. For subtask C, sentiment targeted towards a particular topic, we have developed a set of different strategies which use either syntactic dependencies or token-level associations with the topic word in combination with our A classifier to produce sentiment annotations." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-14", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-15", "text": "**PHRASE-BASED SENTIMENT ANALYSIS (SUBTASK A) AS A MEANS TO AN END (SUBTASK C)**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-16", "text": "Phrase-based sentiment analysis (subtask A) in tweets is a long standing task where the goal is to classify the sentiment of a designated expression within the tweet as either positive, negative or neutral." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-17", "text": "The state of the art for subtask A achieves high performance usually based on methodologies employing features obtained from either manually or automatically generated lexica (Mohammad et al., 2013; Zhu et al., 2014) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-18", "text": "However, lexica by definition lack contextual information and are oftain domain dependent." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-19", "text": "Recent work (Tang et al., 2014a) has successfully used sentiment-specific word embeddings, vector representations of the n-gram context of positive, negative and neutral sentiment in tweets to obtain performance which approaches that of lexicon-based approaches." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-20", "text": "Here we employ a combination of lexical features and word embeddings to maximise our performance in task A. We build phrase-based classifiers both with an emphasis on the distinction between positive and negative sentiment, which conforms to the distribution of training data in task A, as well as phrasebased classifiers trained on a balanced set of positive, negative and neutral tweets." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-21", "text": "We use the latter to identify sentiment in the vicinity of topic words in task C, for targeted sentiment assignment." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-22", "text": "In previous work (Tang et al., 2014a; Tang et al., 2014b) sentiment-specific word embeddings have been used as features for identification of tweet-level sentiment but not phrase-level sentiment." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-23", "text": "Other work which considered word embeddings for phrase level sentiment (dos Santos, 2014) did not focus on producing sentiment-specific representations and the embeddings learnt were a combination of character and word embeddings, where the relative contribution of the word embeddings is not clear." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-24", "text": "In this work we present two different strategies for learning phrase level sentiment specific word embeddings." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-25", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-26", "text": "**FEATURE EXTRACTION FOR TASK A**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-27", "text": "Here we provide a detailed description of data preprocessing and feature extraction for phrase-level sentiment." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-112", "text": "The simplest classification method (BASELINE) identifies the topic and then only considers those tokens around it." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-28", "text": "Working on the training set (7,643 tweets), we replaced URLs with \"URLINK\", converted everything to lower case, removed special characters and tokenised on whitespace, as in (Brody and Diakopoulos, 2011) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-29", "text": "We decided to keep user mentions, as potentially sentiment-revealing features." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-30", "text": "We then extracted features both for the target (the designated highlighted phrase) and its context (the whole tweet):" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-31", "text": "Ngrams: For a target at the position n in a tweet, we created binary unigram and bigram features of the sequence between {n 4, n + 4}, as suggested by Saif et al. (Mohammad et al., 2013) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-32", "text": "Lexicons: We used four different lexica: Bing Liu's lexicon (Hu and Liu, 2004 ) (about 6,800 polarised terms), NRC's Emotion Lexicon (Mohammad and Turney, 2010 ) (about 14,000 words annotated based on 10 emotional dimensions), the Sentiment140 Lexicon (62,468 unigrams, 677,968 bigrams and 480,010 non-contiguous pairs) and NRC's Hashtag Sentiment Lexicon (Mohammad et al., 2013) (54,129 unigrams, 316,531 bigrams and 308,808 non-contiguous pairs)." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-33", "text": "We extracted the number of words in the text that appear in every dimension of the Bing Liu and NRC Emotion Lexica." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-34", "text": "For every lexicon, we extracted features indicating the number of positive unigrams, bigrams and pairs, their maximum sentimental value as indicated by each lexicon, the sum of their sentiment values and the value of the last non-zero (non-neutral) token." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-35", "text": "All features were extracted both from the tweet as well as the target." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-36", "text": "Word Embeddings: We used the tweets collected by (Purver and Battersby, 2012) as training data for sentiment-specific word embeddings." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-37", "text": "These tweets contain emoticons and hashtags for six different emotions, which we group together to compile positive and negative subsets." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-38", "text": "To create phrase-level word embeddings, we applied two strategies: (i) we searched for positive and negative words (as defined in Bing Liu's lexicon) in the corpus; (ii) we performed chi-squared feature selection and extracted the 5,000 most important tokens to be used as our index; for both strategies, we extracted the phrase included in the 2-token-length, two-sided window." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-39", "text": "The embeddings were learnt by using Gensim (\u0158eh\u016f\u0159ek and Sojka, 2010), a Python package that integrates word2vec 1 ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-40", "text": "In both cases, we created representations of length equal to 100 2 ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-41", "text": "For each strategy, class and dimension, we used the functions suggested by (Tang et al., 2014b ) (average, maximum and minimum), resulting in 2,400 features." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-42", "text": "Extra Features: We used several features, potentially indicative of sentiment, a subset of those in (Mohammad et al., 2013) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-43", "text": "These include: the total number of words of the target phrase, its position within the tweet (\"start\", \"end\", or \"other\"), the average word length of the target/context and the presence of elongated words, URLs and user mentions." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-44", "text": "We manually labelled various emoticons as positive (strong/weak), negative (strong/weak) and \"other\" and counted how many times each label appeared in the target and its context." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-45", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-46", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-47", "text": "We experimented with Random Forests and Lib-SVM with a linear kernel on the training set (4,769 positive, 2,493 negative and 381 neutral tweets) using 10-fold cross-validation and selected LibSVM as the algorithm which achieved the best average F1 score on the positive and negative classes." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-48", "text": "We then used the development set (387 positive, 229 negative and 25 neutral tweets) to fine-tune the value of parameter C, achieving an F1 score of 86.40." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-49", "text": "The final model was applied on the two test sets provided to us; the \"Official 2015 Test\" (\"OT\") included 3,092 instances and the \"Progress Test\" (\"PT\"), including 10,681." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-50", "text": "Our results are summarised in 3 Tweet-Level Sentiment Analysis (Subtask B) Using Multiple Word Embeddings" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-51", "text": "Our approach to subtask B follows the same logic as for subtask A, feeding a combination of hybrid features (lexical features, n-grams and word embeddings) to an SVM classifier to determine tweet-level polarity." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-52", "text": "Our approach integrates rich lexicon-based resources and semantic features of tweets, which enables us to achieve an average F1-score of 65.78 on positive tweets and negative tweets in the development dataset." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-53", "text": "In the final evaluation for subtask B, we got a rank of 27 out of 40 teams for the test dataset and a rank of 24 out of 40 teams for the test progress dataset." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-54", "text": "The results are discussed in more detail in subsection 3.2." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-55", "text": "The features we used are presented below: Lexicon-based features: For lexica that only provided the polarities of sentiment bearing words, we used the numbers of matched positive words and negative words in a tweet as features; for lexica that provided sentiment scores for words or ngrams, we included the sum of positive scores of matched words and the sum of negative scores of matched words as two separate features." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-56", "text": "The lexica we utilised fell into two categories: manually generated sentiment lexica like the AFINN (Nielsen, 2011) , MPQA (Wilson et al., 2005) , and Bing Liu's lexica (Liu, 2010) ; and automatically generated sentiment lexica like the Sentiment140 (Mohammad et al., 2013) and NRC Hashtag Sentiment lexica (Mohammad et al., 2013) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-57", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-58", "text": "**WORD EMBEDDINGS REPRESENTATIONS FEATURES:**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-59", "text": "We learned positive and negative word embeddings separately by training on the HAPPY and NON-HAPPY tweets from Purver & Battersby's multi-class Twitter emoticon and hashtag corpus (Purver and Battersby, 2012) , as with subtask A. The difference with subtask A is that here we used the whole tweet as our input (compared to the two-sided window around a polarised word in subtask A) in order to create tweet-level representations." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-60", "text": "We set the word embeddings dimension to 100 in order to gain enough semantic information whilst reducing training time." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-61", "text": "We also employed the word embeddings encoding sentiment information generated through the unified models in (Tang et al., 2014b) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-62", "text": "Similar to Tang, we represent each tweet by the min, average, max and sum on each dimension of the word embeddings of all the words in the tweet." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-63", "text": "In the end, the number of our word embeddings features is 4 \u21e5 100 = 400." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-64", "text": "A tweet's representations of word embeddings generated from the HAPPY and non-HAPPY subset of tweets and the embeddings generated by Tang et al. were incorporated into the feature set." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-65", "text": "Their word embeddings have 50 dimensions, so another 4 \u21e5 50 = 200 features are added to our feature set." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-66", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-67", "text": "**EXPERIMENTS**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-68", "text": "For our SemEval submission we trained an SVM classifier on 9684 tweets (37.59% positive, 47.36% neutral, 15.05% negative) from the training data set, and used the classifier to classify the 2390 tweets (43.43% positive, 41.30% neutral, 15.27% negative) in the test data set." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-69", "text": "After the training process, we tested the performance of classifiers with different feature sets (shown in the first column in Table 2 ) on the development data set (1654 tweets with 34.76% positive, 44.68% neutral, 20.56% negative), and used the average F1 scores of positive and negative tweets as performance measurement." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-70", "text": "The classifier had the best performance on the development data set, achieving a score of 65.78, compared with 57.32 and 65.47 on the test and test progress datasets." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-71", "text": "We hypothesize that these differences are caused by differences in the proportions of positive and negative tweets in these datasets." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-72", "text": "In Table 2 we list the average F1 scores of positive and negative tweets in the test data set when removing certain features." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-73", "text": "The results we submitted were generated by the second classifier." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-74", "text": "Table 2 demonstrates that representing the tweet with positive and negative word embeddings is the most effective feature (performance is affected the most when we remove these) followed by the manually generated lexicon-based features." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-75", "text": "This combined with a 2% reduction in F1 score when the embeddings are removed, indicates that the embeddings improve sentiment analysis performance." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-76", "text": "Contrary to the approach by (Tang et al., 2014b) , we didn't integrate the sentiment information in the word embeddings training process, but rather the sentiment-specific nature of the embeddings was reflected in the choice of different training datasets, yielding different word embedding features for positive and negative tweets." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-77", "text": "To measure the contributions of our word embeddings and Tang's sentiment-specific word embeddings separately in the F1 score, we performed a further test." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-78", "text": "When we only removed Tang's word embeddings features, the F1 score dropped by 0.15%; when we only removed our word embedding features, the F1 score dropped by 1.21%." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-79", "text": "This illustrates that for our approach, our word embedding features contribute more." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-80", "text": "However, it is the combination of the two types of word embeddings that boosts our classifier's performance." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-81", "text": "In subtask C the goal is to identify the sentiment targeted towards a particular topic or entity." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-82", "text": "This is closely linked to aspect-based sentiment (Pontiki et al., 2014) and is very important for understanding the reasons behind the manifestation of different reactions." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-83", "text": "We develop several strategies for selecting a topic-relevant portion of a tweet and use it to produce a sentiment annotation." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-84", "text": "A driving force of our approach has been to use phrase-based sentiment identification from subtask A to annotate the topicrelevant selections." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-85", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-86", "text": "**TOPIC RELEVANCE THROUGH SYNTACTIC RELATIONS**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-87", "text": "A syntactic parser generates possible grammatical relations between words in unstructured text, which are potentially useful for capturing the context around a target topic." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-88", "text": "We experimented with the Stanford parser (Klein and Manning, 2003) and the recently released TweeboParser (Kong et al., 2014) ." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-89", "text": "TweeboParser is explicitly designed to parse tweets -supporting multi-word annotations and multiple roots -but instead of the popular Penn Treebank annotation it uses a simpler annotation scheme and outputs much less dependency type information and was therefore not deemed suitable for our purpose." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-90", "text": "We used the Stanford parser with a caseless parsing model, expected to work better for short documents." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-91", "text": "We define the topic-relevant portion of a tweet as the weakly connected components of the dependency graph containing a given topic word." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-92", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-93", "text": "**GENERATING PER-TOKEN ANNOTATIONS**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-94", "text": "C O N L L -PROPAGATION and \u00b5PARSE all use per-token sentiment annotations generated in advance by the linear SVM-and random forest-based classifiers discussed in subtask A, using balanced and imbalanced versions of subtask A's training data." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-95", "text": "Because the classifier can perform better with additional context, we generated two versions of each annotation set -one token at a time (1-WINDOW), and three at a time (3-WINDOW)." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-96", "text": "3-WINDOW annotations undergo a further majority pre-processing operation to generate a per-token annotation, since adjacent windows overlap." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-97", "text": "We found again that the SVM classifier outperformed the random forest classifer, with SUBMISSION-RETOKENIZED and CONLL-PROPAGATION performing best with the balanced version, and \u00b5PARSE and BASELINE performing best using the imbalanced training data." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-98", "text": "In the following we explain each of the above mentioned Task C strategies." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-99", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-100", "text": "**USING DEPENDENCY RELATIONS**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-101", "text": "CONLL-PROPAGATION builds a dependency graph from a supplied parse, trims some of the relations 3 , attaches a 1-WINDOW sentiment to each node using our subtask A classifier, and then propagates those values along variably weighted edges to the target." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-102", "text": "To help the algorithm propagate successfully, the graph is undirected." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-103", "text": "We opted to train the edge weights using a simple genetic algorithm." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-104", "text": "Whilst its performance is modestly better than our submission, the approach is constrained by its inefficiency." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-105", "text": "SUBMISSION builds a directed co-dependency graph from the supplied parse, and then attempts to match it against parse trees seen previously, to capture syntactic features that may be relevant to the topic's sentiment." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-106", "text": "Because subgraph isomorphism is a computationally difficult problem, we use a diffusion kernel (as in (Fouss et al., 2006) ) to normalise the adjacency matrix for SVM classification." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-107", "text": "We also add unigrams within the same window used for BASELINE as an additional feature." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-108", "text": "SUBMISSION-RETOKENIZED updates the result and replaces whitespace tokenization with that used by (Gimpel et al., 2011) , more aggressively trims the adjacency matrix, and improves the preprocessing pipeline, improving performance a little." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-109", "text": "SUBMISSION-SENTIMENT changes the structure of the dependency graph by connecting tokens to their 1-WINDOW sentiment derived from task A, improving performance further still." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-110", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-113", "text": "Despite being rudimentary, we found BASELINE difficult to beat when teamed with the sentiment analyser developed for part A, producing an F1-score of 46.59 with a window of 8 tokens." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-114", "text": "BASELINE is also useful because it doesn't require the use of the training data for task C, leaving it free for validation." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-115", "text": "\u00b5PARSE is an approach offering a compromise between potentially noisy dependency parsing and the model-free baseline." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-116", "text": "It feeds a buffer of word2vec-derived word representations into min-max-average feature map (similar to (Tang et al., 2014a) ), which is then classified with a linear SVM to decide whether to segment the tweet at the position of an incoming token, or to add the current token to the existing segment." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-117", "text": "The aim is to extract the neighbourhood around the root words that would have been identified by a perfect syntactic parser, making it conceptually similar to chunking." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-118", "text": "\u00b5PARSE then seeks out a cluster containing the target concept and then uses 1-WINDOW or 3-WINDOW to obtain a consensus annotation." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-119", "text": "When trained and evaluated on TweeboParser's dataset, it incorrectly groups root words together at a rate of 16%, but this is sufficient to slightly outperform BASELINE." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-120", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-121", "text": "**DISCUSSION**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-122", "text": "We found it surprising that our task C submission did not need to be very complex, since we could determine phrase-level sentiment accurately." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-123", "text": "Whilst we decided not to submit BASELINE as our official entry -owing to uncertainty about the best subtask A parameters and its lack of technical sophistication -our results in Table 3 clearly demonstrate that we should have done so." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-124", "text": "Syntactic information does seem effective on its own in combination with phrase-level sentiment data, but its real utility might be to guide a more advanced approach that detects syntactically complex structures, and cedes the rest to BASELINE." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-125", "text": "----------------------------------" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-126", "text": "**CONCLUSIONS**" }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-127", "text": "We have presented our system's components for phrase-, tweet-and topic-based sentiment classification." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-128", "text": "While lexica remain a critical aspect of our system, we have found that word embeddings are highly important and have great potential for future research in this domain." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-129", "text": "For both subtasks A and B we generated sentiment-specific word embeddings which yield a performance comparable to that of our lexicon-based approach and further enhance performance." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-130", "text": "Furthermore, we have found that syntactic features can be useful for topic-based sentiment classification, achieving good results when combined with phrase-based sentiment labels." }, { "sent_id": "24b38363d53468175e0274ac0b4fd3-C001-131", "text": "However, our findings also indicate that simpler approaches can perform better (perhaps due to the need for improvements in dependency parsing for Twitter), and further investigation will be required to determine how to exploit the relationship between topicspecific and phrase-level sentiment." } ], "y": { "@BACK@": { "gold_contexts": [ [ "24b38363d53468175e0274ac0b4fd3-C001-11" ], [ "24b38363d53468175e0274ac0b4fd3-C001-22" ] ], "cite_sentences": [ "24b38363d53468175e0274ac0b4fd3-C001-11", "24b38363d53468175e0274ac0b4fd3-C001-22" ] }, "@MOT@": { "gold_contexts": [ [ "24b38363d53468175e0274ac0b4fd3-C001-22", "24b38363d53468175e0274ac0b4fd3-C001-24" ] ], "cite_sentences": [ "24b38363d53468175e0274ac0b4fd3-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "24b38363d53468175e0274ac0b4fd3-C001-41" ], [ "24b38363d53468175e0274ac0b4fd3-C001-61" ] ], "cite_sentences": [ "24b38363d53468175e0274ac0b4fd3-C001-41", "24b38363d53468175e0274ac0b4fd3-C001-61" ] }, "@DIF@": { "gold_contexts": [ [ "24b38363d53468175e0274ac0b4fd3-C001-76" ] ], "cite_sentences": [ "24b38363d53468175e0274ac0b4fd3-C001-76" ] } } }, "ABC_dcc866dcfb5f9233170d633d052e8b_37": { "x": [ { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-42", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-43", "text": "**DECODING**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-2", "text": "Recently, various synchronous grammars are proposed for syntax-based machine translation, e.g. synchronous context-free grammar and synchronous tree (sequence) substitution grammar, either purely formal or linguistically motivated." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-3", "text": "Aiming at combining the strengths of different grammars, we describes a synthetic synchronous grammar (SSG), which tentatively in this paper, integrates a synchronous context-free grammar (SCFG) and a synchronous tree sequence substitution grammar (STSSG) for statistical machine translation." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-4", "text": "The experimental results on NIST MT05 Chinese-to-English test set show that the SSG based translation system achieves significant improvement over three baseline systems." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-5", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-7", "text": "The use of various synchronous grammar based formalisms has been a trend for statistical machine translation (SMT) (Wu, 1997; Eisner, 2003; Galley et al., 2006; Chiang, 2007; Zhang et al., 2008) ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-8", "text": "The grammar formalism determines the intrinsic capacities and computational efficiency of the SMT systems." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-9", "text": "To evaluate the capacity of a grammar formalism, two factors, i.e. generative power and expressive power are usually considered (Su and Chang, 1990) ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-10", "text": "The generative power refers to the ability to generate the strings of the language, and the expressive power to the ability to describe the same language with fewer or no extra ambiguities." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-11", "text": "For the current synchronous grammars based SMT, to some extent, the generalization ability of the grammar rules (the usability of the rules for the new sentences) can be considered as a kind of the generative power of the grammar and the disambiguition ability to the rule candidates can be considered as an embodiment of expressive power." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-12", "text": "However, the generalization ability and the disambiguition ability often contradict each other in practice such that various grammar formalisms in SMT are actually different trade-off between them." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-13", "text": "For instance, in our investigations for SMT (Section 3.1), the Formally SCFG based hierarchical phrase-based model (hereinafter FSCFG) (Chiang, 2007) has a better generalization capability than a Linguistically motivated STSSG based model (hereinafter LSTSSG) (Zhang et al., 2008) , with 5% rules of the former matched by NIST05 test set while only 3.5% rules of the latter matched by the same test set." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-14", "text": "However, from expressiveness point of view, the former usually results in more ambiguities than the latter." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-15", "text": "To combine the strengths of different synchronous grammars, this paper proposes a statistical machine translation model based on a synthetic synchronous grammar (SSG) which syncretizes FSCFG and LSTSSG." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-16", "text": "Moreover, it is noteworthy that, from the combination point of view, our proposed scheme can be considered as a novel system combination method which goes beyond the existing post-decoding style combination of N -best hypotheses from different systems." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-17", "text": "2 The Translation Model Based on the Synthetic Synchronous Grammar" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-18", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-19", "text": "**THE SYNTHETIC SYNCHRONOUS GRAMMAR**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-20", "text": "Formally, the proposed Synthetic Synchronous Grammar (SSG) is a tuple" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-44", "text": "For efficiency, our model approximately search for the single 'best' derivation using beam search as" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-21", "text": "where \u03a3 s (\u03a3 t ) is the alphabet set of source (target) terminals, namely the vocabulary; N s (N t ) is the alphabet set of source (target) non-terminals, such as the POS tags and the syntax labels; X represents the special nonterminal label in FSCFG; and P is the grammar rule set which is the core part of a grammar." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-22", "text": "Every rule r in P is as:" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-23", "text": "where \u03b1 \u2208 [{X}, N s , \u03a3 s ] + is a sequence of one or more source words in \u03a3 s and nonterminals symbols in [{X}, N s ];\u03b3 \u2208 [{X}, N t , \u03a3 t ] + is a sequence of one or more target words in \u03a3 t and nonterminals symbols in [{X}, N t ]; A T is a many-tomany corresponding set which includes the alignments between the terminal leaf nodes from source and target side, and A N T is a one-to-one corresponding set which includes the synchronizing relations between the non-terminal leaf nodes from source and target side;\u03c9 contains feature values associated with each rule." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-24", "text": "Through this formalization, we can see that FSCFG rules and LSTSSG rules are both included." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-25", "text": "However, we should point out that the rules with mixture of X non-terminals and syntactic non-terminals are not included in our current implementation despite that they are legal under the proposed formalism." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-26", "text": "The rule extraction in current implementation can be considered as a combination of the ones in (Chiang, 2007) and (Zhang et al., 2008) ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-27", "text": "Given the sentence pair in Figure 1 , some SSG rules can be extracted as illustrated in Figure 2 ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-28", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-29", "text": "**THE SSG-BASED TRANSLATION MODEL**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-30", "text": "The translation in our SSG-based translation model can be treated as a SSG derivation." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-31", "text": "A derivation consists of a sequence of grammar rule applications." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-32", "text": "To model the derivations as a latent variable, we define the conditional probability distribution over the target translation e and the cor-" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-33", "text": "where H k is a feature function ,\u03bb k is the corresponding feature weight and \u2126 \u039b (f) is a normalization factor for each derivation of f. The main challenge of SSG-based model is how to distinguish and weight the different kinds of derivations ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-34", "text": "For a simple illustration, using the rules listed in Figure 2 , three derivations can be produced for the sentence pair in Figure 1 by the proposed model:" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-35", "text": "All of them are SSG derivations while d 1 is also a FSCFG derivation, d 2 is also a LSTSSG derivation." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-36", "text": "Ideally, the model is supposed to be able to weight them differently and to prefer the better derivation, which deserves intensive study." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-37", "text": "Some sophisticated features can be designed for this issue." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-38", "text": "For example, some features related with structure richness and grammar consistency 1 of a derivation should be designed to distinguish the derivations involved various heterogeneous rule applications." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-39", "text": "For the page limit and the fair comparison, we only adopt the conventional features as in (Zhang et al., 2008) in our current implementation." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-40", "text": "Figure 2 : Some synthetic synchronous grammar rules can be extracted from the sentence pair in Figure 1 ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-41", "text": "R 1 -R 3 are bilingual phrase rules, R 4 -R 5 are FSCFG rules and R 6 -R 8 are LSTSSG rules." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-45", "text": "The major challenge for such a SSG-based decoder is how to apply the heterogeneous rules in a derivation." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-46", "text": "For example, (Chiang, 2007) adopts a CKY style span-based decoding while (Liu et al., 2006 ) applies a linguistically syntax node based bottom-up decoding, which are difficult to integrate." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-47", "text": "Fortunately, our current SSG syncretizes FSCFG and LSTSSG. And the conventional decodings of both FSCFG and LSTSSG are spanbased expansion." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-48", "text": "Thus, it would be a natural way for our SSG-based decoder to conduct a spanbased beam search." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-49", "text": "The search procedure is given by the pseudocode in Figure 3 ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-50", "text": "A hypotheses stack H[i, j] (similar to the \"chart cell\" in CKY parsing) is arranged for each span [i, j] for storing the translation hypotheses." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-51", "text": "The hypotheses stacks are ordered such that every span is translated after its possible antecedents: smaller spans before larger spans." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-52", "text": "For translating each span [i, j] , the decoder traverses each usable rule r = \u03b1, \u03b3, A N T , A T ,\u03c9 ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-53", "text": "If there is no nonterminal leaf node in r, the target side \u03b3 will be added into H[i, j] as the candidate hypothesis." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-54", "text": "Otherwise, the nonterminal leaf nodes in r should be substituted iteratively by the corresponding hypotheses until all nonterminal leaf nodes are processed." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-55", "text": "The key feature of our decoder is that the derivations are based on synthetic grammar, so that one derivation may consist of applications of heterogeneous rules (Please see d 3 in Section 2.2 as a simple demonstration)." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-56", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-57", "text": "**EXPERIMENTS AND DISCUSSIONS**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-58", "text": "Our system, named HITREE, is implemented in standard C++ and STL." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-59", "text": "In this section we report on experiments with Chinese-to-English translation base on it." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-60", "text": "We used FBIS Chinese-to-English parallel corpora (7.2M+9.2M words) as the training data." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-61", "text": "We also used SRI Language Modeling Toolkit to train a 4-gram language model on the Xinhua portion of the English Gigaword corpus(181M words)." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-62", "text": "NIST MT2002 test set is used as the development set." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-63", "text": "The NIST MT2005 test set is used as the test set." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-64", "text": "The evaluation metric is case-sensitive BLEU4." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-65", "text": "For significant test, we used Zhang's implementation (Zhang et al., 2004 )(confidence level of 95%)." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-66", "text": "For comparisons, we used the following three baseline systems: LSTSSG An in-house implementation of linguistically motivated STSSG based model similar to (Zhang et al., 2008) ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-67", "text": "FSCFG An in-house implementation of purely formally SCFG based model similar to (Chiang, 2007) ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-68", "text": "MBR We use an in-house combination system which is an implementation of a classic sentence level combination method based on the Minimum Bayes Risk (MBR) decoding (Kumar and Byrne, 2004) ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-69", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-70", "text": "**STATISTICS OF RULE NUMBERS IN DIFFERENT PHASES**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-71", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-72", "text": "**OVERALL PERFORMANCES**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-73", "text": "The performance comparison results are presented in Table 2 ." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-74", "text": "The experimental results show that the SSG-based model (HITREE) achieves significant improvements over the models based on the two isolated grammars: FSCFG and LSTSSG (both p < 0.001)." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-75", "text": "From combination point of view, the newly proposed model can be considered as a novel method going beyond the conventional post-decoding style combination methods." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-76", "text": "The baseline Minimum Bayes Risk combination of LSTSSG based model and FSCFG based model (M BR(1, 2)) obtains significant improvements over both candidate models (both p < 0.001)." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-77", "text": "Meanwhile, the experimental results show that the proposed model outperforms M BR(1, 2) significantly (p < 0.001)." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-78", "text": "These preliminary results indicate that the proposed SSG-based model is rather promising and it may serve as an alternative, if not superior, to current combination methods." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-79", "text": "----------------------------------" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-80", "text": "**CONCLUSIONS**" }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-81", "text": "To combine the strengths of different grammars, this paper proposes a statistical machine translation model based on a synthetic synchronous grammar (SSG) which syncretizes a purely formal synchronous context-free grammar (FSCFG) and a linguistically motivated synchronous tree sequence substitution grammar (LSTSSG)." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-82", "text": "Experimental results show that SSGbased model achieves significant improvements over the FSCFG-based model and LSTSSG-based model." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-83", "text": "In the future work, we would like to verify the effectiveness of the proposed model on various datasets and to design more sophisticated features." }, { "sent_id": "dcc866dcfb5f9233170d633d052e8b-C001-84", "text": "Furthermore, the integrations of more different kinds of synchronous grammars for statistical machine translation will be investigated." } ], "y": { "@BACK@": { "gold_contexts": [ [ "dcc866dcfb5f9233170d633d052e8b-C001-7" ], [ "dcc866dcfb5f9233170d633d052e8b-C001-13" ], [ "dcc866dcfb5f9233170d633d052e8b-C001-46" ] ], "cite_sentences": [ "dcc866dcfb5f9233170d633d052e8b-C001-7", "dcc866dcfb5f9233170d633d052e8b-C001-13", "dcc866dcfb5f9233170d633d052e8b-C001-46" ] }, "@USE@": { "gold_contexts": [ [ "dcc866dcfb5f9233170d633d052e8b-C001-13" ], [ "dcc866dcfb5f9233170d633d052e8b-C001-26" ] ], "cite_sentences": [ "dcc866dcfb5f9233170d633d052e8b-C001-13", "dcc866dcfb5f9233170d633d052e8b-C001-26" ] }, "@SIM@": { "gold_contexts": [ [ "dcc866dcfb5f9233170d633d052e8b-C001-67" ] ], "cite_sentences": [ "dcc866dcfb5f9233170d633d052e8b-C001-67" ] } } }, "ABC_e29c7551ea78cb425054963489e1b9_37": { "x": [ { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-2", "text": "Spoken language translation applications for speech suffer due to conversational speech phenomena, particularly the presence of disfluencies." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-3", "text": "With the rise of end-to-end speech translation models, processing steps such as disfluency removal that were previously an intermediate step between speech recognition and machine translation need to be incorporated into model architectures." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-4", "text": "We use a sequence-to-sequence model to translate from noisy, disfluent speech to fluent text with disfluencies removed using the recently collected 'copy-edited' references for the Fisher SpanishEnglish dataset." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-5", "text": "We are able to directly generate fluent translations and introduce considerations about how to evaluate success on this task." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-6", "text": "This work provides a baseline for a new task, the translation of conversational speech with joint removal of disfluencies." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-7", "text": "----------------------------------" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-8", "text": "**INTRODUCTION & RELATED WORK**" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-9", "text": "Spoken language translation (SLT) applications suffer due to conversational speech phenomena, particularly the presence of disfluencies." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-10", "text": "In conversational speech, speakers often use disfluencies such as filler words, repetitions, false starts, and corrections which do not naturally occur in text and may not be desired in translation outputs." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-11", "text": "Disfluency recognition and removal has previously been performed as an intermediate step between speech recognition (ASR) and machine translation (MT), to make disfluent ASR output better-matched to typically clean machine translation training data (Cho et al., 2013 (Cho et al., , 2014 Wang et al., 2010; Honal and Schultz, 2005; Zayats et al., 2016) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-12", "text": "With the rise of end-to-end sequence-to-sequence speech translation systems (Weiss et al., 2017; Bansal et al., 2018) , disfluency removal can no longer be handled as an intermediate step between ASR and MT but needs to be incorporated into the model or handled as a post-processing step." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-13", "text": "Generating fluent translations from disfluent speech may be desired for simultaneous SLT applications where removing disfluencies will improve the application's clarity and usability." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-14", "text": "To train endto-end speech translation requires parallel speech and text translations." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-15", "text": "This introduces data considerations not previously relevant with chained ASR+MT models, as different datasets could be used to train ASR and MT components." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-16", "text": "Where aligned speech and translations exist, data is typically clean speech clean text, as in news or TED talks, or disfluent speech disfluent translations, as in Fisher or meeting data, where disfluencies were faithfully included in the references for completeness." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-17", "text": "While some corpora with labeled disfluencies exist (Cho et al., 2014; Burger et al., 2002) , only subsets have been translated and/or released." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-18", "text": "Salesky et al. (2018) introduced a set of fluent references 1 for Fisher Spanish-English, enabling a new task: end-to-end training and evaluation against fluent references." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-19", "text": "Previous work on disfluency removal has treated it as a sequence labeling task using word or spanlevel labels." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-20", "text": "However, in some cases, simply removing disfluencies from an utterance can create ill-formed output." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-21", "text": "Further, corpora can have different translation and annotation schemes: for example for Fisher Spanish-English, translated using Mechanical Turk, Salesky et al. (2018) found 268 unique filler words due to spelling and casing." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-22", "text": "Disfluencies can also be context-specific, such as false starts or corrections where a phrase may be 'disfluent' due to its surroundings." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-23", "text": "To remove disfluencies as a post-processing step would require a separate model trained with appropriate data and disfluency labels, and may lead to ill-formed output." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-24", "text": "By translating directly to fluent target data instead, we aim to handle these concerns implicitly." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-25", "text": "We present the first results translating directly from disfluent source speech to fluent target text." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-26", "text": "For our experiments, we use Fisher Spanish speech (Graff et al.) and with two sets of English translations (Salesky et al., 2018; Post et al., 2013) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-27", "text": "The speech dataset comprises telephone conversations between mostly native Spanish speakers recorded in realistic noise conditions." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-28", "text": "The original English translations were collected through crowdsourcing, as described in Post et al. (2013) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-29", "text": "Four references were collected for each of the development and test sets, and one for training." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-30", "text": "The training data consists of 819 conversations yielding \u223c160 hours of speech and 150k utterances; the development and test sets are \u223c4k utterances each." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-31", "text": "We use only the first of the two development sets (dev, not dev2)." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-32", "text": "This data is conversational and disfluent." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-33", "text": "The original translations faithfully maintain and translate phenomena in the Spanish transcripts such as filler words and hesitations, discourse markers (you know, well, mm), repetitions, corrections and false starts, among others." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-34", "text": "Salesky et al. (2018) introduced a new set of fluent reference translations collected on Mechanical Turk." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-35", "text": "They collected two references for each of the development and test sets, and one for the training set." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-36", "text": "Rather than labeling the disfluencies in the original target data, Turkers were asked to rewrite the utterance in a 'copy-edited' manner without disfluent phenomena." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-37", "text": "In some cases, simply removing disfluencies would created ill-formed structure in the resulting utterance." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-38", "text": "This scheme instead creates a sentencelevel edit allowing for reordering and insertions as necessary to create fluent content, akin instead to monolingual translation or paraphrasing." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-39", "text": "Examples of source transcripts and original translations with the fluent counterparts are shown below in Table 1. SRC eh, eh, eh, um, yo pienso que es as\u00ed ORG uh, uh, uh, um, i think it's like that FLT i think it's like that SRC tambi\u00e9n tengo um eh estoy tomando una clase .. ORG i also have um eh i'm taking a marketing class .. FLT i'm also taking a marketing class SRC porque qu\u00e9 va, mja ya te acuerda que .. ORG because what is, mhm do you recall now that .. FLT do you recall now that .. SRC y entonces am es entonces la universidad donde yo estoy es university of pennsylvania ORG and so am and so the university where i am it's the university of pennsylvania FLT i am at the university of pennsylvania Initial work on the Fisher-Spanish dataset used HMM-GMM ASR models linked with phrasebased MT using lattices (Post et al., 2013; Kumar et al., 2014) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-40", "text": "More recently, it was shown in Weiss et al. (2017) and Bansal et al. (2018) that end-toend SLT models perform competitively on this task." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-41", "text": "As in Bansal et al. (2018) , we use a sequence-tosequence architecture inspired by Weiss et al. but modified to train within available resources; specifically, all models may be trained in less than 5 days on one GPU." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-42", "text": "We build an encoder-decoder model with attention in xnmt with 512 hidden units throughout." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-43", "text": "We use a 3-layer BiLSTM encoder." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-44", "text": "We do not use the additional convolutional layers from Weiss et al. and Bansal et al. to reduce temporal resolution, but rather use network-in-network (NiN) projections from previous work in sequence-to-sequence ASR (Zhang et al., 2017; to get the same total 4\u00d7 downsampling in time." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-45", "text": "This gives the benefit of added depth with fewer parameters." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-46", "text": "We closely follow the LSTM/NiN encoder used in for ASR and use the same training procedure, detailed in Appendix A." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-47", "text": "We extract 40-dimensional mel filterbank features with per-speaker mean and variance normalization with Kaldi (Povey et al., 2011) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-48", "text": "We did not see significant difference between 40, 40+deltas and 80-dimensional features in initial experiments, similar to Bansal et al. (2018) , who chose 80-dim." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-49", "text": "Weiss et al. (2017) used 240-dim features comprising 80-dim filterbanks stacked with deltas and delta-deltas." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-50", "text": "We exclude utterances longer than 1500 frames to manage memory requirements." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-51", "text": "Like Weiss et al. (2017) , we translate to target characters as opposed to words (Bansal et al., 2018) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-52", "text": "We also use an MLP-based attention with 1 hidden layer with 128 units and 64-dimensional target embeddings, though we use only 1 decoder hidden layer as opposed to 3 or 4 in these works." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-53", "text": "We use input feeding (Luong et al., 2015) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-54", "text": "All models use the same preprocessing as previous work on this dataset: lowercasing and removing punctuation aside from apostrophes." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-55", "text": "----------------------------------" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-56", "text": "**EXPERIMENTS**" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-57", "text": "----------------------------------" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-58", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-59", "text": "We focus on the problem of translating directly from noisy speech to clean references without a separate disfluency removal step." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-60", "text": "We first demonstrate the efficacy of our models on the original disfluent Fisher Spanish-English task, comparing to the previously reported numbers on the SLT task (Weiss et al., 2017; Bansal et al., 2018) ." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-61", "text": "We then compare these results with models trained using the collected 'clean' target data with disfluencies removed." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-62", "text": "Finally, we look at the mismatched case where we train on disfluent data and evaluate on a cleaned test set; this is a more realistic scenario, as clean training data is difficult to collect, and we cannot expect to have it for each language and use case we encounter." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-63", "text": "We evaluate using both BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014) to compare different aspects of model behavior on our two tasks." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-64", "text": "2 BLEU assesses how well predicted translations match a set of reference translations using modified n-gram precision, weighted by a brevity penalty in place of recall to penalize short hypothesis translations without full coverage." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-65", "text": "The brevity penalty is computed as e (1\u2212r/c) , where r is the length of the reference and c the candidate translation." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-66", "text": "For our task of implicitly removing disfluencies during translation, our generated translations should contain much of the same content but with certain tokens removed, creating shorter translations." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-67", "text": "When scoring fluent output against the original disfluent references, then, differences in BLEU score will come from two sources: shorter n-gram matches, and the brevity penalty." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-68", "text": "METEOR, on the other hand, can be considered a more 'semantic' evaluation metric." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-69", "text": "It uses a harmonic mean of precision and recall, with greater weight given to recall." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-70", "text": "Further, while BLEU uses exact n-gram matches, METEOR also takes into account stem, synonym, and paraphrase matches." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-71", "text": "For our fluent task, we aim to maintain semantic meaning while removing disfluent tokens." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-72", "text": "Accordingly, when scored against the fluent target references, we hope to see similar METEOR scores between the disfluent models and fluent models." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-73", "text": "Both metrics are used for a holistic view of the problem: METEOR will indicate if meaning is maintained, but not assess disfluency removal, while BLEU changes will indicate whether disfluencies have been removed." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-74", "text": "We provide both multi-reference and singlereference BLEU and METEOR scores: the original Fisher target data has four reference translations for the dev and test sets, which boosts scores considerably as hypothesis n-grams can match in any of the references." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-75", "text": "The fluent target data has two references, so the single reference scores better enable comparison between the two tasks." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-76", "text": "Table 2 : Single task end-to-end speech translation using original disfluent references to train and evaluate." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-77", "text": "Comparing multi-reference scores using all four references (4Ref) vs average single reference score (1Ref)." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-78", "text": "Table 3 compares performance of speech translation models trained with the fluent target translations to models trained with the original disfluent translations, as scored on the fluent references." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-79", "text": "Comparing the disfluent and fluent models, we see that METEOR scores are almost the same while BLEU scores are lower with the disfluent model." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-80", "text": "This is as we would hope: with our fluent model, we want to generate translations that are semantically the same but with disfluencies removed." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-81", "text": "Therefore similar METEOR scores with similar recall (52) on the fluent references are encouraging." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-82", "text": "For BLEU, however, the disfluencies generated by the disfluent model break up n-grams in the fluent references, thereby lowering scores." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-83", "text": "Comparing single-reference scores with Table 2 , we see that they are distinctly lower." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-84", "text": "This is to be expected with the shorter fluent references; a difference of a single token carries greater weight." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-85", "text": "Translating directly to the fluent references is a more challenging task." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-86", "text": "As shown in Table 1 , the original English translations and Spanish speech are very one-to-one while the edited translations introduce deletions and reorderings." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-87", "text": "In learning to generate fluent translations, the model needs to learn to handle these more inconsistent behaviors." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-88", "text": "Figure 1 shows a visual comparison between outputs generated by the two models." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-89", "text": "Using the fluent target data to train constrains the model output vocabulary, so filler words such as 'um', 'ah', 'mhm' are not generated." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-90", "text": "We also see significant reductions in repetitions of both words and phrases from the model trained with fluent reference translations." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-91", "text": "Further, we also see instances where the fluent model generates a shorter paraphrase of a disfluent phrase, as in the 2nd example." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-92", "text": "Disfluency removal for speech translation has traditionally been done as an intermediate step between ASR and MT to better-match additional clean corpora used for MT training; we do not compare to a pipeline approach here." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-93", "text": "However, to contextualize these results, we compare disfluency removal as a post-processing step after end-to-end speech translation with the original disfluent pardev test Model 1Ref 2Ref 1Ref 2Ref Postproc." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-94", "text": "Filter 13.6 16.5 13.5 16.8 Postproc." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-95", "text": "MonoMT 14.4 17.8 14.4 18.0 Table 4 : End-to-end disfluent model with different postprocessing steps." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-96", "text": "Performance evaluated with new fluent references." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-97", "text": "----------------------------------" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-98", "text": "**RESULTS & DISCUSSION**" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-99", "text": "allel data." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-100", "text": "Simply filtering filler words and repetitions from the disfluent model (Filter) outputs as a post-processing step, the dev scores improve slightly, but test stays the same or decreases." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-101", "text": "In some cases, treating disfluency removal as a filtering task can reduce the fluency of an utterance:" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-102", "text": "Disfluent mm well and from and the email is a scandal the spam." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-103", "text": "Fluent the email is a scandal it's spam." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-104", "text": "A filtering or tagging system may not capture all false starts or corrections, leading to lower fluency, and requires labeled spans." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-105", "text": "Treating the post-processing step as a monolingual translation task (MonoMT) rather than a filtering task allows for reordering and insertions, which we saw boost fluency." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-106", "text": "We trained a 4-layer BiLSTM encoderdecoder model to translate between the disfluent and fluent English references and applied this to the output of the end-to-end disfluent model." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-107", "text": "BLEU scores approach the results with the end-to-end fluent target model (Table 3) , but we note, this requires the same resources as the direct task." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-108", "text": "Showing the importance of fluent references for evaluation, Table 5 shows the performance of fluent models as evaluated on the original disfluent references." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-109", "text": "Disfluent target scores are the same as in Table 2 Table 5 : Performance evaluated with original disfluent references." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-110", "text": "Comparing average single reference scores (1Ref) vs multi-reference scores using all references (4Ref)." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-111", "text": "ison. As we would expect, here there is a greater difference in scores." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-112", "text": "The fluent references have fewer long n-gram matches with disfluencies removed, lowering BLEU." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-113", "text": "The fluent model's ME-TEOR scores suffer more than BLEU due to the recall calculation; recall on the disfluent references is lower because the fluent model does not produce many of the disfluencies (indeed filler words are not in the vocabulary when trained with the fluent references)." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-114", "text": "Recall is reduced by \u223c14% with the fluent model, reflecting the approximate distribution of disfluencies in the original data." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-115", "text": "The differences in scores with these two metrics do not show the full picture." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-116", "text": "Outputs generated by the fluent model are on average 13% shorter and contain 1.5 fewer tokens per utterance than the disfluent model, which is significant with average utterance lengths of 10-11 tokens." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-117", "text": "When scoring the fluent output against the original disfluent references, the shorter length significantly contributes to the lower scores, with the BLEU brevity penalty calculated as 0.86 as opposed to 0.96-1.0 for all other conditions." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-118", "text": "Removing the length penalty from the BLEU score calculation, single-reference scores are boosted to 19.3 and 19.8 from 16.6 and 17.0 for dev and test, respectively (Table 5 )." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-119", "text": "This is a somewhat fairer comparison of the disfluent and fluent models, as we do not want the fluent output to match the disfluent sequence length, and the disfluent models are not penalized due to length." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-120", "text": "These BLEU scores are now very similar to those of the disfluent model on the disfluent references, though the outputs are very different (Figure 1 )." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-121", "text": "The changes here and the difference in trends between the two metrics with respect to the two types of references show that evaluating this task cannot be simply accomplished with one existing metric: depending on the combination of metric and references, it's possible to mask the difference between disfluent and fluent systems, unless you have word-level disfluency annotations, which are more difficult to obtain." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-122", "text": "----------------------------------" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-123", "text": "**CONCLUSION**" }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-124", "text": "Machine translation applications for speech can suffer due to conversational speech phenomena, particularly the presence of disfluencies." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-125", "text": "Previous work to remove disfluencies in speech translation did so as a separate step between speech recognition and machine translation, which is not possible using end-to-end models." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-126", "text": "Using clean references for disfluent data collected by Salesky et al. (2018) , we extend their text baseline to speech input and provide first results for direct generation of fluent text from noisy disfluent speech." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-127", "text": "While fluent training data enables research on this task with end-to-end models, it is unlikely to have this resource for every corpus and domain and it is expensive to collect." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-128", "text": "In future work, we hope to reduce the dependence on fluent target data during training through decoder pretraining on external non-conversational corpora or multitask learning." }, { "sent_id": "e29c7551ea78cb425054963489e1b9-C001-129", "text": "Further, standard metrics alone do not tell the full story for this task; additional work on evaluation metrics may better demonstrate the differences between such systems." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e29c7551ea78cb425054963489e1b9-C001-18" ], [ "e29c7551ea78cb425054963489e1b9-C001-21" ], [ "e29c7551ea78cb425054963489e1b9-C001-34" ] ], "cite_sentences": [ "e29c7551ea78cb425054963489e1b9-C001-18", "e29c7551ea78cb425054963489e1b9-C001-21", "e29c7551ea78cb425054963489e1b9-C001-34" ] }, "@USE@": { "gold_contexts": [ [ "e29c7551ea78cb425054963489e1b9-C001-26" ] ], "cite_sentences": [ "e29c7551ea78cb425054963489e1b9-C001-26" ] }, "@EXT@": { "gold_contexts": [ [ "e29c7551ea78cb425054963489e1b9-C001-126" ] ], "cite_sentences": [ "e29c7551ea78cb425054963489e1b9-C001-126" ] } } }, "ABC_32ca78f6e01b7732a4bf573b91fbfe_37": { "x": [ { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-2", "text": "Abstract." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-59", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-60", "text": "**OVERALL SENT2VEC TRAINING**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-3", "text": "Citation sentiment analysis is an important task in scientific paper analysis." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-57", "text": "For example, there are incomplete sentences mistakenly detected (e.g. Publication Year.)." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-58", "text": "To address this issue, I eliminated sentences with less than three words." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-4", "text": "Existing machine learning techniques for citation sentiment analysis are focusing on labor-intensive feature engineering, which requires large annotated corpus." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-5", "text": "As an automatic feature extraction tool, word2vec has been successfully applied to sentiment analysis of short texts." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-6", "text": "In this work, I conducted empirical research with the question: how well does word2vec work on the sentiment analysis of citations?" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-7", "text": "The proposed method constructed sentence vectors (sent2vec) by averaging the word embeddings, which were learned from Anthology Collections (ACL-Embeddings)." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-8", "text": "I also investigated polarity-specific word embeddings (PS-Embeddings) for classifying positive and negative citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-9", "text": "The sentence vectors formed a feature space, to which the examined citation sentence was mapped to." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-10", "text": "Those features were input into classifiers (support vector machines) for supervised classification." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-11", "text": "Using 10-cross-validation scheme, evaluation was conducted on a set of annotated citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-12", "text": "The results showed that word embeddings are effective on classifying positive and negative citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-13", "text": "However, hand-crafted features performed better for the overall classification." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-14", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-15", "text": "**INTRODUCTION**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-16", "text": "The evolution of scientific ideas happens when old ideas are replaced by new ones." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-17", "text": "Researchers usually conduct scientific experiments based on the previous publications." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-18", "text": "They either take use of others work as a solution to solve their specific problem, or they improve the results documented in the previous publications by introducing new solutions." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-19", "text": "I refer to the former as positive citation and the later negative citation." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-20", "text": "Citation sentence examples 1 with different sentiment polarity are shown in Table 1 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-21", "text": "Sentiment analysis of citations plays an important role in plotting scientific idea flow." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-22", "text": "I can see from paper A0 is Hidden Markov Model (HMM) based part-of-speech (POS) tagging, which has been referenced positively in paper A1." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-23", "text": "In paper A2, however, a better approach was brought up making the idea (HMM based POS ) in paper A0 negative." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-24", "text": "This citation sentiment analysis could lead to future-works in such a way that new approaches (mentioned in paper A2) are recommended to other papers which cited A0 positively 2 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-25", "text": "Analyzing citation sentences during literature review is time consuming." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-26", "text": "Recently, researchers developed algorithms to automatically analyze citation sentiment." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-27", "text": "For example, [1] extracted several features for citation purpose and polarity classification, such as reference count, contrary expression and dependency relations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-28", "text": "Jochim et al. tried to improve the result by using unigram and bigram features [2] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-29", "text": "[3] used word level features, contextual polarity features, and sentence structure based features to detect sentiment citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-30", "text": "Although they generated good results using the combination of features, it required a lot of engineering work and big amount of annotated data to obtain the features." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-31", "text": "Further more, capturing accurate features relies on other NLP techniques, such as part-of-speech tagging (POS) and sentence parsing." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-32", "text": "Therefore, it is necessary to explore other techniques that are free from hand-crafted features." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-33", "text": "With the development of neural networks and deep learning, it is possible to learn the representations of concepts from unlabeled text corpus automatically." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-34", "text": "These representations can be treated as concept features for classification." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-35", "text": "An important advance in this area is the development of the word2vec technique [4] , which has proved to be an effective approach in Twitter sentiment classification [5] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-36", "text": "In this work, the word2vec technique on sentiment analysis of citations was explored." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-37", "text": "Word embeddings trained from different corpora were compared." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-38", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-39", "text": "**RELATED WORK**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-40", "text": "Mikolov et al. introduced word2vec technique [4] that can obtain word vectors by training text corpus." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-41", "text": "The idea of word2vec (word embeddings) originated from the concept of distributed representation of words [6] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-42", "text": "The common method to derive the vectors is using neural probabilistic language model [7] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-43", "text": "Word embeddings proved to be effective representations in the tasks of sentiment analysis [5, 8, 9 ] and text classification [10] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-44", "text": "Sadeghian and Sharafat [11] extended word embeddings to sentence embeddings by averaging the word vectors in a sentiment review statement." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-45", "text": "Their results showed that word embeddings outperformed the bagof-words model in sentiment classification." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-46", "text": "In this work, I are aiming at evaluating word embeddings for sentiment analysis of citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-47", "text": "The research questions are:" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-48", "text": "1. How well does word2vec work on classifying positive and negative citations?" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-49", "text": "2. Can sentiment-specific word embeddings improve the classification result? 3. How well does word2vec work on classifying implicit citations? 4." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-50", "text": "In general, how well does word2vec work on classifying positive, negative and objective citations in comparison with hand-crafted features?" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-51", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-52", "text": "**METHODOLOGY**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-53", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-54", "text": "**PRE-PROCESSING**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-55", "text": "The SentenceModel provided by LingPipe was used to segment raw text into its constituent sentences 3 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-56", "text": "The data I used to train the vectors has noise." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-61", "text": "In the work, I constructed sentence embeddings based on word embeddings." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-62", "text": "I simply averaged the vectors of the words in one sentence to obtain sentence embeddings (sent2vec)." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-63", "text": "The main process in this step is to learn the word embedding matrix W w :" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-64", "text": "where W w (w =< w 1 , x 2 , ...w n >) is the word embedding for word x i , which could be learned by the classical word2vec algorithm [4] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-65", "text": "The parameters that I used to train the word embeddings are the same as in the work of Sadeghian and Sharafat" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-66", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-67", "text": "**POLARITY-SPECIFIC WORD REPRESENTATION TRAINING**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-68", "text": "To improve sentiment citation classification results, I trained polarity specific word embeddings (PS-Embeddings), which were inspired by the Sentiment-Specific Word Embedding [5] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-69", "text": "After obtaining the PS-Embeddings, I used the same scheme to average the vectors in one sentence according to the sent2vec model." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-70", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-71", "text": "**EXPERIMENT**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-72", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-73", "text": "**TRAINING DATASET**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-74", "text": "The ACL-Embeddings (300 and 100 dimensions) from ACL collection were trained ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-75", "text": "ACL Anthology Reference Corpus 4 contains the canonical 10,921 computational linguistics papers, from which I have generated 622,144 sentences after filtering out sentences with lower quality." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-76", "text": "For training polarity specific word embeddings (PS-Embeddings, 100 dimensions), I selected 17,538 sentences (8,769 positive and 8,769 negative) from ACL collection, by comparing sentences with the polar phrases 5 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-77", "text": "The pre-trained Brown-Embeddings (100 dimensions) learned from Brown corpus was also used 6 as a comparison." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-78", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-79", "text": "**TEST DATASET**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-80", "text": "To evaluate the sent2vec performance on citation sentiment detection, I conducted experiments on three datasets." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-81", "text": "The first one (dataset-basic) was originally taken from ACL Anthology [12] ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-82", "text": "Athar and Awais [3] manually annotated 8,736 citations from 310 publications in the ACL Anthology." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-83", "text": "I used all of the labeled sentences (830 positive, 280 negative and 7,626 objective) for testing." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-84", "text": "7 The second dataset (dataset-implicit) was used for evaluating implicit citation classification, containing 200,222 excluded (x), 282 positive (p), 419 negative (n) and 2,880 objective (o) annotated sentences." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-85", "text": "Every sentence which does not contain any direct or indirect mention of the citation is labeled as being excluded (x) 8 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-86", "text": "The third dataset (dataset-pn) is a subset of dataset-basic, containing 828 positive and 280 negative citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-87", "text": "Dataset-pn was used for the purposes of (1) evaluating binary classification (positive versus negative) performance using sent2vec; (2) Comparing the sentiment classification ability of PS-Embeddings with other embeddings." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-88", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-89", "text": "**EVALUATION STRATEGY**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-90", "text": "One-Vs-The-Rest strategy was adopted 9 for the task of multi-class classification and I reported F-score, micro-F, macro-F and weighted-F scores 10 using 10-fold cross-validation." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-91", "text": "The F1 score is a weighted average of the precision and recall." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-92", "text": "In the multi-class case, this is the weighted average of the F1 score of each class." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-93", "text": "There are several types of averaging performed on the data: Micro-F calculates metrics globally by counting the total true positives, false negatives and false positives." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-94", "text": "Macro-F calculates metrics for each label, and find their unweighted mean." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-95", "text": "Macro-F does not take label imbalance into account." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-96", "text": "Weighted-F calculates metrics for each label, and find their average, weighted by support (the number of true instances for each label)." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-97", "text": "Weighted-F alters macro-F to account for label imbalance." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-98", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-99", "text": "**RESULTS**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-100", "text": "The performances of citation sentiment classification on dataset-basic and dataset-implicit were shown in Table 2 and Table 3 respectively." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-101", "text": "The result of classifying positive and negative citations was shown in Table 4 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-102", "text": "To compare with the outcomes in the work of [3] 11 , I selected two records from their results: the best one (based on features n-gram + dependencies + negation) and the baseline (based on 1-3 grams)." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-103", "text": "From Table 2 I can see that the features extracted by [3] performed far better than word embeddings, in terms of macro-F (their best macro-F is 0.90, the one in this work is 0.33)." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-104", "text": "However, the higher micro-F score (The highest micro-F in this work is 0.88, theirs is 0.78) and the weighted-F scores indicated that this method may achieve better performances if the evaluations are conducted on a balanced dataset." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-105", "text": "Among the embeddings, ACL-Embeddings performed better than Brown corpus in terms of macro-F and weighted-F measurements 12 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-106", "text": "To compare the dimensionality of word embeddings, ACL300 gave a higher micro-F score than ACL100, but there is no difference between 300 and 100 dimensional ACL-embeddings when look at the macro-F and weighted-F scores." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-107", "text": "Table 3 showed the sent2vec performance on classifying implicit citations with four categories: objective, negative, positive and excluded." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-108", "text": "The method in this experiment had a poor performance on detecting positive citations, but it was comparable with both the baseline and sentence structure method [13] for the category of objective citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-109", "text": "With respect to classifying negative citations, this method was not as good as sentence structure features but it outperformed the baseline." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-110", "text": "The results of classifying category X from the rest showed that the performances of this method and the sentence structure method are fairly equal." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-111", "text": "Table 4 showed the results of classifying positive and negative citations using different word embeddings." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-112", "text": "The macro-F score 0.85 and the weighted-F score 0.86 proved that word2vec is effective on classifying positive and negative citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-113", "text": "However, unlike the outcomes in the paper of [5] , where they concluded that sentiment specific word embeddings performed best, integrating polarity information did not improve the result in this experiment." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-114", "text": "Table 4 ." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-115", "text": "Performance of classifying positive and negative citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-116", "text": "----------------------------------" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-117", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-118", "text": "In this paper, I reported the citation sentiment classification results based on word embeddings." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-119", "text": "The binary classification results in Table 4 showed that word2vec is a promising tool for distinguishing positive and negative citations." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-120", "text": "From Table 4 I can see that there are no big differences among the scores generated by ACL100 and Brown100, despite they have different vocabulary sizes (ACL100 has 14,325 words, Brown100 has 56,057 words)." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-121", "text": "The polarity specific word embeddings did not show its strength in the task of binary classification." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-122", "text": "For the task of classifying implicit citations (Table 3) , in general, sent2vec (macro-F 0.44) was comparable with the baseline (macro-F 0.47) and it was effective for detecting objective sentences (F-score 0.84) as well as separating X sentences from the rest (F-score 0.997), but it did not work well on distinguishing positive citations from the rest." }, { "sent_id": "32ca78f6e01b7732a4bf573b91fbfe-C001-123", "text": "For the overall classification (Table 2) , however," } ], "y": { "@BACK@": { "gold_contexts": [ [ "32ca78f6e01b7732a4bf573b91fbfe-C001-35" ], [ "32ca78f6e01b7732a4bf573b91fbfe-C001-43" ] ], "cite_sentences": [ "32ca78f6e01b7732a4bf573b91fbfe-C001-35", "32ca78f6e01b7732a4bf573b91fbfe-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "32ca78f6e01b7732a4bf573b91fbfe-C001-43", "32ca78f6e01b7732a4bf573b91fbfe-C001-46" ], [ "32ca78f6e01b7732a4bf573b91fbfe-C001-68" ] ], "cite_sentences": [ "32ca78f6e01b7732a4bf573b91fbfe-C001-43", "32ca78f6e01b7732a4bf573b91fbfe-C001-68" ] }, "@DIF@": { "gold_contexts": [ [ "32ca78f6e01b7732a4bf573b91fbfe-C001-113" ] ], "cite_sentences": [ "32ca78f6e01b7732a4bf573b91fbfe-C001-113" ] } } }, "ABC_6bb7d5f16861470214626c1cc497bb_37": { "x": [ { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-2", "text": "When a speaker, Mary, asks Do you know that Florence is packed with visitors?, we take her to believe that Florence is packed with visitors, but not if she asks Do you think that Florence is packed with visitors?" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-3", "text": "Inferring speaker commitment (aka event factuality) is crucial for information extraction and question answering." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-4", "text": "Here we explore the hypothesis that linguistic deficits drive the error patterns of speaker commitment models by analyzing the linguistic correlates of model errors on a challenging naturalistic dataset." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-5", "text": "We evaluate two state-of-the-art speaker commitment models on the CommitmentBank, an English dataset of naturally occurring discourses." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-6", "text": "The CommitmentBank is annotated with speaker commitment towards the content of the complement (Florence is packed with visitors in our example) of clause-embedding verbs (know, think) under four entailment-canceling environments." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-7", "text": "We found that a linguisticallyinformed model outperforms a LSTM-based one, suggesting that linguistic knowledge is needed to capture such challenging naturalistic data." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-8", "text": "A breakdown of items by linguistic features reveals asymmetrical error patterns: while the models achieve good performance on some classes (e.g., negation), they fail to generalize to the diverse linguistic constructions (e.g., conditionals) in natural language, highlighting directions for improvement." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-9", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-11", "text": "Prediction of speaker commitment 1 is the task of determining to what extent the speaker is committed to an event in a sentence as actual, nonactual, or uncertain." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-12", "text": "This matters for downstream NLP applications, such as information extraction or question answering: for instance, we should extract from example (1) in Table 1 that the speaker could wish someone dead, but from (3) that people should not be allowed to carry guns in their vehicles, even though both events are embedded under believe and negation." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-13", "text": "There has been work on factors leading to speaker commitment in theoretical linguistics (i.a., Karttunen (1971) ; Simons et al. (2010) ) and computational linguistics (i.a., Diab et al. (2009) ; Saur\u00ed and Pustejovsky (2012) ; Prabhakaran et al. (2015) ), but mostly on constructed or newswire examples, which may simplify the task by failing to reflect the lexical and syntactic diversity of naturally occurring utterances." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-14", "text": "de Marneffe et al. (2019) introduced the CommitmentBank, a dataset of naturally occurring sentences annotated with speaker commitment towards the content of complements of clause-embedding verbs under canceling-entailment environments (negation, modal, question and conditional), to study the linguistic correlates of speaker commitment." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-15", "text": "In this paper, we use it to evaluate two state-of-the-art (SoA) models of speaker commitment: Stanovsky et al. (2017) and ." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-16", "text": "The CommitmentBank, restricted to specific linguistic constructions, is a good test case." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-17", "text": "It allows us to evaluate whether current speaker commitment models achieve robust language understanding, by analyzing their performance on specific challenging linguistic constructions." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-18", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-19", "text": "**THE COMMITMENTBANK CORPUS**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-20", "text": "The CommitmentBank 2 consists of 1,200 naturally occurring items involving clause-embedding verbs under four entailment-canceling environments (negations, modals, questions, condition- (1) Context The answer is no, no no. Not now, not ever." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-21", "text": "Target I never believed I could wish anyone dead Table 1 ." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-22", "text": "For each item, speaker commitment judgments were gathered on Mechanical Turk from at least eight native English speakers." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-23", "text": "Participants judged whether or not the speaker is certain that the content of the complement in the target sentence is true, using a Likert scale labeled at 3 points (+3/speaker is certain that the complement is true, 0/speaker is not certain whether it is true or false, -3/speaker is certain that it is false)." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-24", "text": "We took the mean annotations of each item as gold score of speaker commitment." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-25", "text": "Figure 1 shows the mean annotations per embedding verb." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-26", "text": "Restricted set We identified a subset of the CommitmentBank that displays high agreement among annotators." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-27", "text": "We divided the range of integer ratings [\u22123, 3] into three sub-ranges: [1, 3] where the speaker is committed to the complement p, 0 where the speaker is uncommitted towards p, [\u22123, \u22121] where the speaker is committed to \u00acp." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-28", "text": "We selected the items for which at least 80% of the annotations fall into the same sub-range." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-29", "text": "This gives 556 items, with 37 clause-embedding verbs." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-30", "text": "Figure 2 shows that the proportion of items with different linguistic features in the restricted set is similar to the proportion in the full set, suggesting that the restricted set is representative of the original data." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-31", "text": "The full CommitmentBank has a Krippendorff's \u03b1 of 0.53, while \u03b1 is 0.74 on the restricted set." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-32", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-33", "text": "**MODELS OF SPEAKER COMMITMENT**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-34", "text": "We evaluate the performance of two speaker commitment models on the CommitmentBank: a rulebased model (Stanovsky et al., 2017 ) and a neuralbased one ." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-35", "text": "Rule-based model Stanovsky et al. (2017) proposed a rule-based model based on a deterministic algorithm based on TruthTeller (Lotan et al., 2013) , which uses a top-down approach on a de- pendency tree and predicts speaker commitment score in [\u22123, 3] according to the implicative signatures (Karttunen, 2012) of the predicates, and whether the predicates are under the scope of negation and uncertainty modifiers." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-36", "text": "For example, refuse p entails \u00acp, so the factuality of its complement p gets flipped if encountered." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-37", "text": "Neural-based model introduced three neural models for speaker commitment: a linear biLSTM, a dependency tree biL-STM, a hybrid model that ensembles the two. also proposed a multitask training scheme in which a model is trained on four factuality datasets: FactBank (Saur\u00ed and Pustejovsky, 2009) , UW (Lee et al., 2015) , MEAN-TIME (Minard et al., 2016) and UDS , all with annotations on a [\u22123, 3] scale." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-38", "text": "Each dataset has shared biLSTM weights but specific regression parameters." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-39", "text": "Reference datasets The FactBank, UW, and MEANTIME datasets all consist of sentences from news articles." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-40", "text": "Each event in FactBank was annotated by 2 annotators, with 0.81 Cohen's \u03ba." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-41", "text": "UW has 5 annotations for each event, and MEANTIME has 6." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-42", "text": "UDS contains sentences from the English Web Treebank (Bies et al., 2012) , which contains weblogs, newsgroups, emails, reviews, and question-answers." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-43", "text": "It has 2 annotations for each predicate, with 0.66 Cohen's \u03ba." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-44", "text": "All four datasets have annotations biased towards +3, because (1) they are newswire-heavy with sentences describing known factual events, and (2) most annotations are for main-clause predicates instead of predicates in an embedded clause." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-45", "text": "Table 2 : The number of annotated predicates in each dataset, and previous state-of-the-art performance." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-46", "text": "The score on UW with MAE was obtained by Stanovsky et al. (2017) , while the other scores were obtained by ." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-47", "text": "and Pearson's r correlation, measuring how well the model captures variability in the data." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-48", "text": "Pearson's r is considered more informative than MAE because the reference sets are biased towards +3." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-49", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-50", "text": "**EVALUATION**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-51", "text": "We evaluated the models of Stanovsky et al. (2018), we trained the linear, tree, and hybrid biLSTM models using the multi-task training scheme on the four factuality datasets they used, which produced four predictions." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-52", "text": "Following White et al. (2018), we used cross-validated ridge regression to predict a final score using the four predictions." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-53", "text": "We include a majority baseline \"All -2.0\" (always predicting -2.0, since -2.0 is the most frequent answer in the full and restricted CommitmentBank)." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-54", "text": "The results are shown in Figure 3 ." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-55", "text": "The rule-based model outperforms the biLSTM models on the full set, but overall both SoA models do not perform very well on the CommitmentBank." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-56", "text": "As shown in Figure 3 , the CommitmentBank is substantially more challenging for these models than the reference datasets, with lower correlation and higher absolute error rates than were obtained for any of these other datasets." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-57", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-58", "text": "**ANALYSIS**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-59", "text": "Focusing on the restricted set, we perform detailed error analysis of the outputs of the rule-based and hybrid biLSTM models, which achieved the best Figure 3: Pearson r correlation and Mean Absolute Error (MAE) on All -2.0 baseline, Rule-based annotator (Stanovsky et al., 2017) , and three biLSTM models in ." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-60", "text": "Pearson r is undefined for All -2.0." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-61", "text": "All correlations are statistically significant (p < 0.05)." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-62", "text": "correlation." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-63", "text": "Table 3 shows performance for the following linguistic features, and Figure 4 shows scatterplots of gold judgments vs. predictions." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-64", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-65", "text": "**EMBEDDING ENVIRONMENT**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-66", "text": "The rule-based model can only capture inferences involving negation (r = 0.45), while the hybrid model performs more consistently across negation, modal, and question (r \u223c 0.25)." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-67", "text": "Both models cannot handle inferences with conditionals." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-68", "text": "The model's performance on the negation items also vary with respect to genre: the rule-based model has significant correlations for fiction (r = 0.45) and dialog (r = 0.32), while the hybrid model has correlations between 0.05 and 0.2 for all three genres, none reaching significance." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-69", "text": "About 40% of the modal and question items involve factive verbs, therefore the performance of these environments also correlate with the models' performance on factive verbs (elaborated on below)." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-70", "text": "Genre Both models achieve the best correlation on dialog (Switchboard), and the worst on newswire (WSJ)." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-71", "text": "The poor performance on WSJ might be due to its scores in CommitmentBank being more widespread (reflected in Figure 4 ) than annotations in the reference datasets (e.g., MEAN-TIME), which tend to be biased towards +3." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-72", "text": "The good performance of the rule-based model on dialog could be due to the fact that 70% of the items in dialog are in a negation environment with a nonfactive verb." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-73", "text": "Factive embedding verb Lexicalist theories (i.a., Karttunen 1973; Heim 1983 ) predict that complements of factive verbs are commitments of the speaker." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-74", "text": "This tendency is reflected in Figures 1 and 4 where most sentences with factives have higher mean commitment scores." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-75", "text": "Both models get better MAE on factives, but better correlation on nonfactives." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-76", "text": "The improved MAE of the rule-based model might be due to its use of factive/implicative signatures." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-77", "text": "However, the poor correlations suggest that neither model can robustly capture the variability in inference which exists in sentences involving factive/nonfactive verbs (see i.a." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-78", "text": "Beaver 2010; de Marneffe et al. 2019 )." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-79", "text": "Neg-raising Within sentences with negation, we examine the models' performance on sentences with \"neg-raising\" reading, where a negation in the matrix clause (not {think/believe} p) is interpreted as negating the complement clause (think/believe not p), as in example (3) in Table 1 where we understand the speaker to be committed to people should not be allowed to carry guns in their vehicles." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-80", "text": "We identify \"neg-raising\" items as items with a negation embedding environment, think or believe verb, and a negative commitment score." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-81", "text": "There is almost no correlation between both models' predictions and gold judgments ( Table 3 ), suggesting that the models are not able to capture neg-raising inferences." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-82", "text": "Figure 4 shows that the hybrid model predictions are mostly positive, whereas the rule-based model predictions are clustered at \u22123 and +3." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-83", "text": "This suggests that the rule-based model cannot capture the gradience present in commitment judgments, while the hybrid model struggles to recognize negative commitments." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-84", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-85", "text": "**MODEL BEHAVIOR**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-86", "text": "To better interpret the models' outputs, we evaluate them in a classification setting." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-87", "text": "We use Gaussian mixture models to obtain three clusters for the mean gold scores and the predictions of both models." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-88", "text": "We assign the cluster with the highest mean to +: speaker is certain that the complement is true, the one with the lowest mean to -: speaker is certain that it is false, and the remaining one to o: speaker is not certain about its truth." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-89", "text": "We report precision, recall and F1 in Table 4 ." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-90", "text": "The rule-based model predicts + by default unless it has clear evidence (e.g., negation) for negative commitment." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-91", "text": "This behavior is reflected in the high precision for -." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-92", "text": "Both models perform well on + and -, but neither is able to identify no commitment (o)." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-93", "text": "----------------------------------" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-94", "text": "**CONCLUSION**" }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-95", "text": "Our evaluation of two SoA models for speaker commitment on the CommitmentBank shows that the models perform better on sentences with nega- tion, and with nonfactive embedding verbs." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-96", "text": "However, they are not able to generalize to other linguistic environments such as conditional, modal, and neg-raising, which display inference patterns that are important for information extraction." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-97", "text": "Both models are able to identify the polarity of commitment, but cannot capture its gradience." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-98", "text": "The rulebased model, outperforming the biLSTM models on the full CommitmentBank, shows that a linguistically-informed model scales more successfully to challenging naturalistic data." }, { "sent_id": "6bb7d5f16861470214626c1cc497bb-C001-99", "text": "In the long run, to perform robust language understanding, models will need to incorporate more linguistic foreknowledge and be able to generalize to a wider range of linguistic constructions." } ], "y": { "@USE@": { "gold_contexts": [ [ "6bb7d5f16861470214626c1cc497bb-C001-15" ], [ "6bb7d5f16861470214626c1cc497bb-C001-34" ], [ "6bb7d5f16861470214626c1cc497bb-C001-46" ], [ "6bb7d5f16861470214626c1cc497bb-C001-59" ] ], "cite_sentences": [ "6bb7d5f16861470214626c1cc497bb-C001-15", "6bb7d5f16861470214626c1cc497bb-C001-34", "6bb7d5f16861470214626c1cc497bb-C001-46", "6bb7d5f16861470214626c1cc497bb-C001-59" ] }, "@BACK@": { "gold_contexts": [ [ "6bb7d5f16861470214626c1cc497bb-C001-35" ] ], "cite_sentences": [ "6bb7d5f16861470214626c1cc497bb-C001-35" ] } } }, "ABC_db1fd6f10a3ee22e22093d50395217_37": { "x": [ { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-2", "text": "In this paper, we propose an automatic sentence classification model that can map sentences of a given biomedical abstract into set of pre-defined categories which are used for Evidence Based Medicine (EBM)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-3", "text": "In our model we explored the use of various lexical, structural and sequential features and worked with Conditional Random Fields (CRF) for classification." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-4", "text": "Results obtained with our proposed method show improvement with respect to current state-ofthe-art systems." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-5", "text": "We have participated in the ALTA shared task 2012 and our best performing model is ranked among top 5 systems." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-6", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-8", "text": "Evidence Based Medicine (EBM) or Evidence based practice is \"systematically locating, appraising, and using contemporaneous research findings as the basis for clinical decisions\" (Rosenberg and Donald, 1995) ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-9", "text": "Considering the huge amounts of literature and millions of clinical articles currently available and continuously being added to databases like PubMed 1 , automating the information access or searching scientific evidence for EBM is a crucial task." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-10", "text": "Currently evidence based practitioners use the PICO criterion which was proposed by Armstrong (1999) to construct queries and search information in EBM tasks." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-11", "text": "The PICO concepts or tag-sets are: Population (P), Intervention (I), Comparison (C) and Outcome (O)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-12", "text": "In this paper, we present a method that classifies sentences in the abstract of a clinical article 1 http://en.wikipedia.org/wiki/PubMed according to PIBOSO criteria which is an extension of PICO." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-13", "text": "PIBOSO has six tags: Population (P), Intervention (I), Background (B), Outcome (O), Study Design (SD) and Other (Oth)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-14", "text": "This information could be used in constructing queries or searching relevant articles in the EBM task." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-15", "text": "A clear description of the PIBOSO tag-set is available in (Kim et al., 2011) , who proposed the tagset." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-16", "text": "Our system is based on the CRF algorithm which was earlier used by Kim et al. (2011) for a similar task and proven to be useful." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-17", "text": "The major contribution of this paper is that we use a simple and large set of features such as lexical, structural and sequential features and show major improvements on the task of sentence classification over earlier attempts." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-18", "text": "Our classification techniques have shown clear improvement over existing state-of-the art systems especially for unstructured abstracts." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-19", "text": "The paper is organised as follows: We present our related work in Section 2, describe the dataset for training and evaluation in Section 3, and our method and experimental setup in Section 4." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-20", "text": "We present the analysis of our results in Section 5 and conclude in Section 6." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-21", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-22", "text": "**RELATED WORK**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-23", "text": "The first attempt to classify abstract sentences based on the PIBOSO schema is made by Kim et al. (2011) ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-24", "text": "They used the Conditional Random Field (CRF) classifier for learning, and their feature set included lexical features (unigram and bigram with part-of-speech), semantic features (using metathesaurus), structural features (sentence positional features) and sequential features (features from previous sentences)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-25", "text": "They found out that the best features are unigrams, sentence po-sitional attributes, and sequential information." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-26", "text": "Using this best configuration of features and the same data set as in our experiment, they did 10 fold cross validation." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-27", "text": "The best microaverage Fscore for each class or label for both Structured (S) and Unstructured (U) data are summarised in Table 3 ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-28", "text": "The other attempt of same 6 way PIBOSO classification on the same dataset is presented by (Verbeke et al., 2012) ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-29", "text": "In this method, the input sentences are pre-processed with a namedentity tagger and dependency parser." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-30", "text": "They used a statistical relational learning approach in which features are constructed declaratively using intentional relation." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-31", "text": "Unlike us and Kim et al. (2011) they have used SVM-HMM 2 for learning." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-32", "text": "Similar to Kim et al. (2011) they did 10 fold cross validation and the best microaverage F-score of their system is also summarised in Table 3 ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-33", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-34", "text": "**DATASET**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-35", "text": "To build the EBM classifier we used the 800 expert annotated training abstracts and 200 test abstracts which were given as part of the shared task." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-36", "text": "Kim et al(2011) annotated this data using abstracts retrieved from MEDLINE." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-37", "text": "Both the training and test abstracts have two types of abstracts, structured (S) and unstructured (S)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-38", "text": "In structured abstracts sentences are organised (and labelled) in an orderly fashion such as Aim, Method, Results, Conclusions and Other whereas these labels are absent in unstructured abstracts." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-39", "text": "Please note that the way we categorised an abstract as structured or unstructured might be a bit different from previous approaches by Kim et al. (2011) and Verbeke et al. 2012 ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-40", "text": "If the first sentence in an abstract is a sentence ordering label then we considered the abstract as structured or else unstructured." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-41", "text": "There are 1000 abstracts containing 11616 sentences in total." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-42", "text": "Statistics of the dataset used are presented in Table 1 and Table 2 All S U Abstracts 1000 37.4% 62.6% Sentences 11616 54.4% 45.6% In this section we present the details of our feature set, the training (classification) algorithm, the tools used and assumptions made in executing the experiments." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-43", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-44", "text": "**FEATURES**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-45", "text": "We have trained our classifier with different set of features which include lexical features, structural features, sequential features and dependency features 3 ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-46", "text": "\u2022 Lexical features include lemmatized bag-ofwords, their part-of-speech, collocational information, the number of content words, verbs and nouns in the sentence (we have used the Med-Post (Smith et al., 2004) part-of-speech tagger)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-47", "text": "\u2022 Structural features include position of the sentence in the abstract, normalised sentence position, reverse sentence position (Kim et al., 2011) ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-48", "text": "\u2022 Sequential features include previous sentence label, similar to Kim et al. (2011)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-49", "text": "Additionally, for structured abstracts, we use the sentence ordering labels as features: Heading, Aim, Method, Results, Conclusions." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-50", "text": "These are provided in the data." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-51", "text": "Since unstructured abstracts do not have these ordering labels, we automatically annotate the training and testing data with ordering labels using simple heuristics." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-52", "text": "In the unstructured training data, sentences are classified into an ordering label based on its PIBOSO label: Background -> Aim, (Population or Intervention or Study Design) -> Method, Outcome -> Results and Other -> Other." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-53", "text": "In the unstructured testing data, we have divided sentences into four equal groups based on their position and mapped them to Aim, Method, Results and Conclusions in this order." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-54", "text": "Using sentence ordering labels for unstructured abstracts is the main difference compared to earlier methods (Kim et al., 2011; Verbeke et al., 2012) ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-55", "text": "We tried 6 combinations of features which will be discussed in Results section." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-56", "text": "Kim" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-57", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-58", "text": "**ALGORITHM**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-59", "text": "Our sentence classifier uses CRF learning algorithm 4 ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-60", "text": "We have also executed few experiments using SVM and observed CRF performed better over this dataset with our choice of features." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-61", "text": "Due to space constraints in this paper we are not comparing CRF versus SVM results." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-62", "text": "For feature selection, we used Fselector 5 package from R-system 6 ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-63", "text": "From the pool of features, we select the \"meaningful\" features based on the selecting criteria." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-64", "text": "We have tested several criteria including (1) information gain (2) oneR (3) chisquare test (4) spearman test." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-65", "text": "Among them, information gain outperformed the others." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-66", "text": "We select the 700 best features from our pool of features based on information gain score." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-67", "text": "Other technique we used for this shared task is \"bootstrapping\"." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-68", "text": "Our system performed very well on training data but did not perform well on test data, perhaps it suffered over-fitting." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-69", "text": "To overcome this, we ran our current best model on test data (without using gold-standard labels) and then merge the result with train data to get the new train." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-70", "text": "In that way, under ROC evaluation, we improved our final scores by 3%." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-71", "text": "In addition, we also pre-process the data." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-72", "text": "Since the heading such as \"AIM,OUTCOME,INTRODUCTION etc.\" are always classified as \"other\" in train data, when we find sentence which has less than 20 characters and all in upper case (our notion of heading), we directly classify it as \"other\" in test data." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-73", "text": "In this section we present the analysis of results on structured and unstructured abstracts separately." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-74", "text": "In all our experiments, we performed 10fold cross validation on the given dataset." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-75", "text": "Shared task organisers have used Receiver operating characteristic (ROC) to evaluate the scores." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-76", "text": "According to ROC our best system scored 93.78% (public board) and 92.16% (private board)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-77", "text": "However, we compare our results with (Kim et al., 2011) and (Verbeke et al., 2012) using the microaveraged F-scores as in Table 3 ." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-78", "text": "Our system outperformed previous works in unstructured abstracts (22% higher than state-of-the-art)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-79", "text": "Our system performed well in classifying Background, Outcome and Other for both structured and un-structured data." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-80", "text": "However, our system performed poor in classifying study design as very few instances of it is available in both test and train." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-81", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-82", "text": "**RESULTS**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-83", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-84", "text": "**FEATURES**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-85", "text": "We present the results of 6 systems learned using different feature sets: Table 4 for structured abstracts and Table 5 for unstructured abstracts." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-86", "text": "We choose bag-of-words (BOW) as the base features, +lexical includes BOW and lexical features, +struct include BOW and structural features, +ordering includes BOW and sentence ordering labels, All includes BOW, lexical, struct and ordering features." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-87", "text": "All+seq includes all these features and sequential features." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-88", "text": "In previous works, F-scores for unstructured data are low (compared to structured data)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-89", "text": "However, adding the automatic sentence ordering label to the unstructured data improved the performance drastically." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-90", "text": "This is the main difference compared to earlier models." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-91", "text": "Overall, our system outperformed existing systems in both structured and unstructured in many labels, which are highlighted in Table 3 under our system section." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-92", "text": "Finally, combining BOW, lexical, structure and sentence ordering features showed the highest performance for both structured and unstructured data." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-93", "text": "It also showed that adding the sequential feature (i.e. the PIBOSO label of the previous sentence) do not help in our system, in fact the result slightly reduced." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-94", "text": "(81.7 -> 81.4 for structured and 89.2 -> 89.0 for unstructured)." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-95", "text": "----------------------------------" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-96", "text": "**CONCLUSIONS**" }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-97", "text": "In this paper, we have presented a brief overview of our method to classify sentences to support EBM." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-98", "text": "We showed that structural and lexical features coupled with a CRF classifier is an effective method for dealing with sentence classification tasks." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-99", "text": "The best features in our setting are found to be words, lexical features such as part-of-speech information, sentence positional features, collocations and sentence ordering labels." }, { "sent_id": "db1fd6f10a3ee22e22093d50395217-C001-100", "text": "Our system outperformed earlier existing state-of-art systems (Kim et al., 2011; Verbeke et al., 2012) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "db1fd6f10a3ee22e22093d50395217-C001-28" ] ], "cite_sentences": [ "db1fd6f10a3ee22e22093d50395217-C001-28" ] }, "@DIF@": { "gold_contexts": [ [ "db1fd6f10a3ee22e22093d50395217-C001-28", "db1fd6f10a3ee22e22093d50395217-C001-31" ], [ "db1fd6f10a3ee22e22093d50395217-C001-39" ], [ "db1fd6f10a3ee22e22093d50395217-C001-54" ], [ "db1fd6f10a3ee22e22093d50395217-C001-77", "db1fd6f10a3ee22e22093d50395217-C001-78" ], [ "db1fd6f10a3ee22e22093d50395217-C001-100" ] ], "cite_sentences": [ "db1fd6f10a3ee22e22093d50395217-C001-28", "db1fd6f10a3ee22e22093d50395217-C001-39", "db1fd6f10a3ee22e22093d50395217-C001-54", "db1fd6f10a3ee22e22093d50395217-C001-77", "db1fd6f10a3ee22e22093d50395217-C001-100" ] }, "@USE@": { "gold_contexts": [ [ "db1fd6f10a3ee22e22093d50395217-C001-77" ] ], "cite_sentences": [ "db1fd6f10a3ee22e22093d50395217-C001-77" ] } } }, "ABC_6fd0c2fbbe0c7fb669f2618f4d01f7_37": { "x": [ { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-96", "text": "**EVALUATION OF HI-EN-WP CORPUS**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-2", "text": "Abstract." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-3", "text": "The text generated on social media platforms is essentially a mixed lingual text." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-4", "text": "The mixing of language in any form produces considerable amount of difficulty in language processing systems." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-5", "text": "Moreover, the advancements in language processing research depends upon the availability of standard corpora." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-6", "text": "The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-7", "text": "Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-8", "text": "The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-9", "text": "The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-10", "text": "The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-11", "text": "Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-12", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-14", "text": "Named Entity Recognition (NER) is a significant task in Information Extraction which identifies and classifies information elements called Named Entities (NEs) in a given sample of text." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-15", "text": "The term Named Entity was coined at the 6 th Message Understanding Conference (MUC) to encourage the importance of semantic identification of persons, locations and organizations in natural language text [1] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-16", "text": "NER is far from being a solved task due to large dissimilarity around its concept which shows impact on its applications and evaluations [2, 7, 9] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-17", "text": "The efforts in the progress of NER systems for Indian languages, primarily Hindi faced adversities due to unavailability web resources, as most of them are in English." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-18", "text": "Moreover, language mixing is common phenomenon on social media which increases considerable amount of difficulty in its corpus creation." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-19", "text": "The data analysis of Indian social networking sites, blogs and chatting applications show great deal of anglicism in its text [3] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-20", "text": "The key challenge in the construction and evaluation of Indian NER systems is the unavailability of standard evaluation corpora for mixed lingual text." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-21", "text": "As an example with binary language set, a mixed bilingual text is one that contains Hindi, English and Romanized Hindi." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-22", "text": "In motivation to boost the advancements in research and studies on Indian mixed lingual information extraction, we attempt to build a cross script Hindi-English NER corpus from Wikipedia." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-97", "text": "Machine Leaning algorithms are quite successful for NER training and prediction." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-23", "text": "This work will enable to develop informatics, probably to build systems for extraction of named entities from Indian social media content." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-24", "text": "It will empower the cyber media regulatory authorities to tackle the severe problem of using social media platform for intimidation." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-25", "text": "Moreover, it may be helpful in detection of fake and offensive messages spread against a person or an organization." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-26", "text": "To the best of our knowledge, our work is the first attempt to use Wikipedia for building such corpora." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-27", "text": "To build the cross script Hindi English NER corpora (Hi-En-WP), we explored Wikipedia and subsequently identified specific category pages having substantial amount of links to the Wikipedia pages pertaining to Hindi belt of India [17] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-28", "text": "We considered the entities targeted in such pages as Hindi-English cross script due to its origin from Hindi linguistic sections, but written in Roman script." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-29", "text": "The automatically extracted collection of multi-token expressions from the links of those Wikipedia content pages are considered as probable NEs." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-30", "text": "Each of these probable NEs is assigned a confidence score based on its attributes." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-31", "text": "The probable NEs with certain level of confidence score are selected for inclusion in the corpora." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-32", "text": "Further, the constructed corpora is evaluated on various machine learning algorithms." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-33", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-34", "text": "**NER AND WIKIPEDIA**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-35", "text": "Wikipedia is an open and collaborative multilingual encyclopedia contributed by several collaborators currently having 5.6 million English articles [21] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-36", "text": "Constantly, the articles are updated and new articles are added by its collaborators." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-37", "text": "About 74% of Wikipedia articles fall under the category of named entity classes [4] , therefore, we consider Wikipedia links among articles to construct the Hi-En-WP NER annotated corpus." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-38", "text": "Wikipedia includes content pages which contain concepts and facts about the article, category pages provides a list of content pages into several classes based on specific criteria and disambiguation pages help to locate different content pages with same title." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-39", "text": "Wikipedia is an abundant resource for generation of different types of multilingual, cross lingual, cross script and language independent corpus, etc., its markup has been used extensively to automatically generate NER annotated corpus for training machine learning models [4] [5] [6] [11] [12] [13] [14] 19] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-40", "text": "The latest involvement using Wikipedia is the portable cross lingual NER for low resource languages using translation of an annotated NER corpus from English [12, 19] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-41", "text": "Another approach to cross lingual and language independent corpora is to learn a model on language independent features of a source language and test the model on other languages using same features [13] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-42", "text": "Nothman, et al. (2008) [5] constructed a massive English NER corpus by the classification of Wikipedia articles to its category types by mapping them to CoNLL-2003 NER tagset." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-43", "text": "A similar approach to massive multilingual NER corpus is found in [14] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-44", "text": "A hybrid approach to generate parallel sentences with NE notations reveal strong results on Wikipedia dataset [19] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-68", "text": "We consider these expressions as probable NEs after the removal of duplicates from the set of tokens obtained." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-45", "text": "Kazama and Torisawa (2007) [15] use the external knowledge along with the analysis of first sentence in Wikipedia articles to infer the entity category." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-46", "text": "Cross script NER corpus from social media in Bengali written in Roman script is also evaluated on various models [10] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-47", "text": "Domain specific gazetteers for Indian languages are developed using transliteration and context pattern extraction." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-48", "text": "Saha et al. (2008) [8] proposed an approach to develop NE corpora for Indian languages from various web resources using transliteration and context pattern extraction." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-49", "text": "Sequence labelling was employed on English-Hindi and English-Tamil to obtain suitable results." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-50", "text": "A corpora for Bengali-English was derived from social media and was evaluated using contextual features on Tourism and Sports domain." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-51", "text": "Our approach to classify the articles is based on computation of the entity confidence score for each multi-token extracted as a probable NE and tag the probable NE having certain high confidence score to CoNLL-2003 [21] NE classes." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-52", "text": "Eventually, there is no standard cross script Hindi-English NER corpus available, we are first to utilize Wikipedia for such corpus preparation." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-53", "text": "Except for Bengali-English Code Mixed Cross Script Named Entity corpora which is extracted from social media content [10] , to the best of our knowledge, no Hindi-English cross script automatically built corpora is available which is extracted from Wikipedia." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-54", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-55", "text": "**HINDI ENGLISH WIKIPEDIA CORPUS GENERATION**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-56", "text": "The automatic generation of Hi-En-WP corpora involves three tasks to be performed: extracting entities from a webpage, generating a confidence score for each entity for selection into named entity set, and finally annotating them with the appropriate tag." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-57", "text": "To achieve this, we parsed 13 Wikipedia category pages and collected 7285 hyperlinks." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-58", "text": "The category pages that we selected to initiate the extraction process were those class of Wikipedia pages that contain surplus information that is closely related to NEs and positively about Hindi speaking regions of India [17] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-59", "text": "We strongly take into account the assumption that the entities present on these pages are prominently Hindi NEs." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-60", "text": "This assumption is based on human assessment that the information on such pages is based on Indian background especially from Hindi linguistic majority sections of India." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-61", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-62", "text": "**CORPUS ACQUISITION**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-63", "text": "Wikipedia being a huge source of information, its articles comprise of: topic and its comprehensive summary in paragraphs and images; reference to reliable resources; and hyperlinks, also called wikilinks to other articles." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-64", "text": "Our method takes the advantage of wikilinks between the articles from which linktext is extracted." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-65", "text": "Since wikilinks are links to articles, it may be considered as a named entity." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-66", "text": "This approach is similar to Nothman et al (2008) [4] to generate the NER data from wikilinks." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-67", "text": "A total number of 7285 tokens and multi-tokens expressions were extracted from the links by parsing the 13 identified Wikipedia webpages." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-69", "text": "In order to distinguish among the tokens an intuitive approach is to calculate the distribution of different type of tokens." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-70", "text": "To make this achievable, we grouped them according to POS tags." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-71", "text": "We consider the supposition that NEs are often NNP, moreover, the common nouns, verbs and rest of the POS tags generated help to filter out the non NEs." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-72", "text": "Wordtypes." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-73", "text": "We adopt a subset of wordtypes proposed by Collins (2002) [16] , which maps uppercase letters to A, lowercase to a, and, digits to 0." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-74", "text": "The brief statics of the generated HI-En-WP corpus is presented in Table 1 ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-75", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-76", "text": "**CONFIDENCE SCORE**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-77", "text": "According to the POS tags and wordtypes assigned to each expression, the confidence score is calculated for the selection of the NEs to be annotated." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-78", "text": "Against each probable NE two scores are generated: (1) POS tag score is assigned a value of 1 if the tag is NNP, NNS or NN, and 0 for rest of the tags." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-79", "text": "(2) wordtypes score is assigned a value of 1 if the type starts with A, or contains all A, and 0 for rest of the cases." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-80", "text": "In this way, a confidence score is constituted by the sum of both the scores for all the probable NEs." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-81", "text": "A positive confidence score achieved by a probable NE is marked selected for the corpus." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-82", "text": "The probable NEs which are allocated a zero confidence score as discarded from selection." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-83", "text": "In this way, we obtained 2916 entities ready for annotation into NER tagset." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-84", "text": "More complex score can be taken into account by the different attributes relevant for named entity classification." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-85", "text": "Also, the usage of different cases of POS tags, wordtypes can be explored." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-86", "text": "The task is easily scalable to generate massive corpus by the consideration of more number of source Wikipedia pages." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-87", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-88", "text": "**NER TAGSETS**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-89", "text": "Two different annotators manually tagged almost 2916 NEs to course grained labels based on CoNLL-2003 tagset, i.e., PER, LOC, ORG and MISC [21] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-90", "text": "Both the annotators were proficient in Hindi as well as in English." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-91", "text": "The accurately labelled corpus achieved the inter annotator agreement over 98%." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-92", "text": "A little disagreement was over MISC tags which were mutually resolved." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-93", "text": "The predominantly occupied class over the whole set is PER which is 65%, LOC and ORG tagged NEs are almost distributed similarly, MISC is only 5 percent, see Table 2 ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-94", "text": "We experimented with this course-grained corpus in various configurations of train and test set over different machine learning approaches" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-95", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-98", "text": "The algorithms make use of patterns, contextual information, orthographic and linguistic features in order to train the models." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-99", "text": "The annotated training corpora common towards NER task is Newswire text, CoNLL-2003 shared task data, the MUC7 dataset, BBN, etc." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-100", "text": "We evaluated the performance of word level Hi-En-WP corpus on general classification algorithms using the collection of all character n-grams for n=1-5 as feature set." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-101", "text": "Amongst Logistic Regression, SVM, Na\u00efve Bayes, Random Forest and Stochastic Gradient Classifier, Logistic Regression seemed most effective, in all the cases, as shown in Table 3 ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-102", "text": "The analysis of class wise precision, recall and F-score on Logistic Regression classifier is shown in Table 4 ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-103", "text": "When trained with MISC as a separate class, high values for PER suggests that they comparatively easy to identify." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-104", "text": "It may be due to sufficiently large amount of training data as compared to rest of the tags." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-105", "text": "Whereas, low precision value for LOC tag suggests that other entities are wrongly classified as location." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-106", "text": "The MISC F-score is expectedly low, in agreement with the results of Nothman et al (2008) [4] ." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-107", "text": "The variation reflected in F-score among all may be the effect of diversity in linguistic attributes." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-108", "text": "An increase in accuracy from 89% to 92% is observed when the model is trained without MISC tag which reflects that the confusion is created in data by the inclusion of training examples that belong to MISC tag." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-109", "text": "The Fig.1 illustrates the effect of varying size of the Hi-En-WP training data." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-110", "text": "The improving trend of accuracy score on increasing training data sufficiently produces scope to scale the size of corpus in future." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-111", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-112", "text": "**5**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-113", "text": "----------------------------------" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-114", "text": "**CONCLUSION AND DISCUSSION**" }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-115", "text": "We prepared a cross script Hindi English NER annotated corpora automatically from Wikipedia for mixed lingual NER tasks." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-116", "text": "We exploit the wikilinks of Indian context Wikipedia pages to extract the entities." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-117", "text": "The manually annotated corpora is examined against POS tags and Wordtypes in order to remove non-named entities." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-118", "text": "We retrieve fair number of named entities which is largely occupied by persons." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-119", "text": "We evaluated the corpus against itself across several machine learning algorithms and obtained promising results." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-120", "text": "There is much scope for improvement of the corpus, firstly, more number of attributes of wikilinks can be explored to formulate a strong confidence score." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-121", "text": "Secondly, the corpus can be distributed according to the domain to which each named entity belongs." }, { "sent_id": "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-122", "text": "Finally, the cross corpus evaluation of Hi-En-WP corpus can be carried out against CoNLL-2003, MUC-7 and other named entity corpus in order to perform a comparative analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-37" ], [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-39" ] ], "cite_sentences": [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-37", "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-39" ] }, "@USE@": { "gold_contexts": [ [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-37" ] ], "cite_sentences": [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-37" ] }, "@SIM@": { "gold_contexts": [ [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-66" ], [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-106" ] ], "cite_sentences": [ "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-66", "6fd0c2fbbe0c7fb669f2618f4d01f7-C001-106" ] } } }, "ABC_782517ae7688cf18b4bca37a8087dd_37": { "x": [ { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-2", "text": "This paper explores the problem of ranking short social media posts with respect to user queries using neural networks." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-3", "text": "Instead of starting with a complex architecture, we proceed from the bottom up and examine the effectiveness of a simple, word-level Siamese architecture augmented with attention-based mechanisms for capturing semantic \"soft\" matches between query and post terms." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-4", "text": "Extensive experiments on datasets from the TREC Microblog Tracks show that our simple models not only demonstrate better effectiveness than existing approaches that are far more complex or exploit a more diverse set of relevance signals, but also achieve 4\u00d7 speedup in model training and inference." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-5", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-7", "text": "Despite a large body of work on neural ranking models for \"traditional\" ad hoc retrieval over web pages and newswire documents (Huang et al., 2013; Shen et al., 2014; Pang et al., 2016; Xiong et al., 2017; Mitra et al., 2017; Pang et al., 2017; Dai et al., 2018; McDonald et al., 2018) , there has been surprisingly little work on applying neural networks to searching short social media posts such as tweets on Twitter." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-8", "text": "Rao et al. (2018) identified short document length, informality of language, and heterogeneous relevance signals as main challenges in relevance modeling, and proposed a model specifically designed to handle these characteristics." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-9", "text": "Evaluation on a number of datasets from the TREC Microblog Tracks demonstrates state-of-the-art effectiveness as well as the necessity of different model components to capture a multitude of relevance signals." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-10", "text": "In this paper, we also examine the problem of modeling relevance for ranking short social media posts, but from a complementary perspective." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-11", "text": "As Weissenborn et al. (2017) argues, most Figure 1: Our model architecture: a general sentence encoder is applied on query and post embeddings to generate g q and g p ; an attention encoder is applied on post embeddings to generate variable-length queryaware features h i ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-12", "text": "These features are further aggregated to yield v, which feeds into the final prediction." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-13", "text": "systems are built in a top-down process: authors proposing a complex architecture and validating design decisions with ablation experiments." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-14", "text": "However, such experiments often lack comparisons to strong baselines, which raises the question as to whether model complexity is empirically justified." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-15", "text": "As an alternative, they advocate a bottom-up approach where architectural complexity is gradually increased." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-16", "text": "We adopt exactly such an approach, focused exclusively on word-level modeling." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-17", "text": "As shown in Figure 1 , we examine variants of a simple, generic architecture that has emerged as \"best practice\" of the NLP community for tackling modeling problems on two input sequences: a Siamese CNN architecture for learning representations over both inputs (a query and a social media post in our case), followed by fully-connected layers that produces a final relevance prediction (Severyn and Moschitti, 2015; He et al., 2015; Rao et al., 2016) , which we refer to as a General Sentence Encoder in Section 2.1." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-18", "text": "Further adopting best practices, we incorporate query-aware convolutions with an aggregation layer in the representation learning process." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-19", "text": "Recently, a number of researchers (Petrochuk and Zettlemoyer, 2018; Conneau et al., 2017) have started to reexamine simple baselines and found them to be highly competitive with the state of the art, especially with proper tuning." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-20", "text": "For example, the InferSent approach (Conneau et al., 2017) uses a simple BiLSTM with max pooling that achieves quite impressive accuracy on several classification benchmarks." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-21", "text": "Our contribution is exactly along this line by exploring simple yet strong baselines for ranking social media posts, to gain more insights into query-post matching using standard neural architectures." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-22", "text": "Experiments with data from the TREC Microblog Tracks show that our simple models not only demonstrate better effectiveness than existing approaches that are far more complex or exploit a more diverse set of relevance signals, but also achieve 4\u00d7 speedup in model training and inference." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-23", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-24", "text": "**MODEL**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-25", "text": "Our model comprises a representation learning layer with convolutional encoders (Section 2.1) and another simple aggregation layer(Section 2.2)." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-26", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-27", "text": "**REPRESENTATION LEARNING LAYER**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-28", "text": "General Sentence Encoder: The general sentence encoder uses a standard convolutional layer with randomly initialized kernels to learn semantic representations for text." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-29", "text": "More formally, given query q and post p as sentence inputs, we first convert them to embedding matrixes Q and P through an embedding lookup layer, where Q \u2208 R n\u00d7d and P \u2208 R m\u00d7d , d is the dimension of embeddings, and n and m are the number of tokens in q and p, respectively." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-30", "text": "Then we apply a standard convolution operation with kernel window size k over the embedding matrix Q and P. The convolution operation is parameterized by a weight term W \u2208 R F \u00d7k\u00d7d and a bias term b w \u2208 R F , where F is the number of convolutional kernels." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-31", "text": "This generates semantic representation O q \u2208 R n\u00d7F and O p \u2208 R m\u00d7F , on which max pooling and an MLP are applied to obtain query representation g q \u2208 R d and post representation g p \u2208 R d ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-32", "text": "The weakness of the kernels in the general sentence encoder is that they do not have a priori knowledge from the query when they capture feature patterns from the post." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-33", "text": "Inspired by the attention mechanism (Bahdanau et al., 2014) , we propose two novel approaches to incorporate query Query-aware Attention Encoder (QAtt): In QAtt, for each query token, we construct a tokenspecific convolutional kernel to inject the query information." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-34", "text": "Unlike methods that apply attention mechanisms after the sentence representations are generated (Bahdanau et al., 2014; Seo et al., 2016) , our approach aims to model the representation learning process jointly with attention mechanism." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-35", "text": "Formally, for each query token t q , the QAtt kernel W tq QAtt is composed as follows:" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-36", "text": "where U \u2208 R F \u00d7k\u00d7d is trainable parameter, Q tq is the embedding of token t q with size R d ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-37", "text": "The element-wise product \u2297 is applied between the token embedding Q tq and the last dimension of kernel weight U. Intuitively, when the QAtt tokenspecific kernel is applied, it moves through the post embeddings P as a sliding window and automatically learns soft-matchings to each query token to generate query-aware post representations." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-38", "text": "On the top of the QAtt kernel, we apply maxpooling and an MLP to produce a set of post representations {h i }, with each h i standing for the representation learned from query token t q i ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-39", "text": "Position-aware Attention Encoder (PAtt): In the QAtt encoder, the token-specific kernel learns soft-matchings to the query." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-40", "text": "However, it still ignores position information when encoding the post semantics, which has been shown to be effective on sequence modeling (Gehring et al., 2017) ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-41", "text": "Therefore, we propose an alternative attention encoder that captures the positional information through interactions between query embeddings and post embeddings." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-42", "text": "Given a query token t q and the j-th position in post p, we compute the interaction scores by taking the cosine similarity between the word embeddings of token t q and post tokens t p j:j+k\u22121 from position j to j + k \u2212 1:" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-43", "text": "where S j \u2208 R k\u00d71 . Then we populate the similarity vector S j to a matrix form as below:" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-44", "text": "where 1 \u2208 R 1\u00d7d with each element set to 1." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-45", "text": "Finally, the PAtt convolution kernel for query token t q at j-th position is constructed as below:" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-46", "text": "where V \u2208 R F \u00d7k\u00d7d is system trainable parameter." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-47", "text": "The element-wise product \u2297 is applied between the attention weights\u015c j and the last two dimensions of kernel weight V (see Figure 2 )." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-48", "text": "This operation can be thought as adding a soft attention term (with value range in [\u22121, 1]) to regularize the learning of F convolutional filters." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-49", "text": "Same as QAtt, the PAtt encoder with max-pooling and an MLP generates a set of post representations {h i }, with each h i standing for the representation learned from query token t q i ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-50", "text": "It's worth noting that both the QAtt and PAtt encoder have no extra parameters over a general sentence encoder." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-51", "text": "However, incorporating the query-aware and position-aware information enables more effective representation learning, as our experiments show later." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-52", "text": "The QAtt and PAtt encoder can also be used as plug-in modules in standard convolutional architectures to enhance the sequence learning ability." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-53", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-54", "text": "**AGGREGATION LAYER**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-55", "text": "After the representation layer, a set of vectors {g q , g p , {h i }} is obtained." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-56", "text": "Because our model yields different number of h i regarding queries of different lengths, a further aggregation step is needed to output a global feature v. For simplicity, we directly average all vectors v = 1 Nq h i as the aggregated feature, where N q is the length of the query." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-57", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-58", "text": "**TRAINING**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-59", "text": "To obtain the final score, the feature vectors g q , g p and v are concatenated and fed into an MLP with ReLU activate function for dimension reduction and obtain o, followed by batch normalization and fully-connected layer and softmax to output the final prediction." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-60", "text": "The model is trained endto-end with Stochastic Gradient Decent optimizer, and negative log-likelihood loss function is used." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-61", "text": "3 Experiment Experimental Setup." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-62", "text": "Our models are evaluated on four tweets test collections from the TREC 2011-2014 Microblog (MB) Tracks (Ounis et al., 2011; Soboroff et al., 2012; Lin and Efron, 2013; Lin et al., 2014) ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-63", "text": "Each dataset contains around 50 queries and the more detailed statistics are shown in Table 1 ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-64", "text": "Following Rao et al. (2018) , we evaluate our models in a reranking task, where the inputs are up to the top 1000 tweets retrieved from the classical query likelihood (QL) language model (Ponte and Croft, 1998) ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-65", "text": "We run four-fold cross-validation test split by year (i.e., train on three year's data, test on one year's data), and we randomly sample 10 queries from each year in the training sets (in total 30 queries) as our validation set." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-66", "text": "The mean average precision (MAP) and precision at top 30 (P30) are adopted as our evaluation metrics." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-67", "text": "We also conducted Fisher's two-sided, paired randomization test (Smucker et al., 2007) to test for statistical significance at p < 0.05." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-68", "text": "We randomly tried ten different seeds with the same hyperparameters and obtain an average score over each query-post pair for final ranking, to eliminate the effects of random parameter initialization (Crane, 2018) ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-69", "text": "Our code is released for further reproduction." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-70", "text": "The best model hyper-parameters are shown in Table 4 in Appendix section." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-71", "text": "Baselines: QL is a competitive language modeling baseline." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-72", "text": "RM3 (Lavrenko and Croft, 2001 ) is an interpolation model combining the QL score with a relevance model using pseudo-relevance feedback." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-73", "text": "MP-HCNN (Rao et al., 2018) is the first neural model that captures the characteristics of social media domain." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-74", "text": "Their method improves current neural IR methods, e.g., K-NRM (Xiong et al., 2017) , DUET (Mitra et al., 2017) , by a signifi- 4 BiCNN+PAtt+QL 0.4728 1-3 5-7 0.4293 1-3 5-7 0.4147 1,2 5,6 0.2621 1-3 5-7 0.5367 1-3 5,6 0.2990 1-3 5 0.6806 1,2 5-8 0.4563 1-3" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-75", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-76", "text": "**5,7**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-77", "text": "Existing Models 5 QL 0.4000 1 0.3576 1 0.3311 1 0.2091 1 0.4450 1 0.2532 1 0.6182 1 0.3924 1 6 RM3 0.4211 1 0.3824 1 0.3452 1 0.2342 1 0.4733 1 0.2766 1 0.6339 1 0.4480 1 7 MP-HCNN(+URL) 0.4306 1 0.3940 1,2 0.3757 1,5 0.2313 1,5 0.5211 1,5 0.2856 1,5 0.6279 1 0.4178 1 8 MP-HCNN(+URL)+QL 0.4435 1-2 5,6 0.4040 1,2 5,6 0.3915 1,5 6 0.2482 1,2 5 0.5250 1,5 6 0.2937 1,2 5 0.6455 1 0.4403 1,5 Table 2 : Results of non-neural and neural models on the TREC Microblog Tracks datasets." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-78", "text": "Results from 5 -8 are adopted from Rao et al. (2018) ." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-79", "text": "Models denoted with (+URL) represents utilizing the URL information." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-80", "text": "Models denoted with +QL are interpolated with QL baseline." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-81", "text": "Bi-CNN denotes general sentence encoder architecture." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-82", "text": "Both superscripts and subscripts indicate the row indexes for which a metric difference is statistically significant at p < 0.05." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-83", "text": "cant margin." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-84", "text": "To the best of our knowledge, Rao et al. (2018) is the best neural model to date, and there are no neural models from TREC evaluations for further comparison." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-85", "text": "We also compared to MP-HCNN+QL, which is a linear interpolation to combine the raw MP-HCNN and QL scores." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-86", "text": "Table 2 shows our experiment results of all settings and the results of existing models." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-87", "text": "Model 1 is the effectiveness of BiCNN model with kernel window size 2." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-88", "text": "1 Comparing Models 1 and 2, we observe that the models with query-aware kernels significantly improve the BiCNN baselines, achieving competitive effectiveness as QL baseline, demonstrating the effectiveness of the queryaware encoder." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-89", "text": "Further capturing position information with the position-aware encoder, as shown with Models 3, we obtain competitive effectiveness against MP-HCNN, even interpolated with QL." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-90", "text": "Noted that the MP-HCNN leverages extra URL information, external term feature such as tf-idf and character-level modeling." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-91", "text": "With simple interpolation (Model 4), we can obtain significant effectiveness gains against all methods." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-92", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-93", "text": "**RESULTS**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-94", "text": "In the view of representation learning, PAtt model produces better hidden states in most of the case compared with QAtt and BiCNN model." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-95", "text": "We project the hidden states o into low-dimension vector with t-SNE (Laurens van der and Geoffrey, 2008) for visualization." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-96", "text": "Figure 3 is an example of query id 158 hush puppies meal." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-97", "text": "We can ob- Another interesting question is that, when does neural model fail, compared against QL baseline." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-98", "text": "Figure 4 shows query semantically emphasizes the snub." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-99", "text": "Furthermore, after observing relevant posts, Oscars information is often expressed implicitly." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-100", "text": "For example argo wins retributions for the snub of ben affleck." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-101", "text": "However, QL baseline can give more weights on snub because of the inverse-document-frequency feature." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-102", "text": "In terms of training and inference speed, under the same setting, 2 we measure the average training time and inference time per query-post pair with GPU setting." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-103", "text": "Our PAtt-ave model only has 2-4 times on training and inference time, which is about 1 4 compared to MP-HCNN." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-104", "text": "----------------------------------" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-105", "text": "**CONCLUSION**" }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-106", "text": "In this work, we propose two novel attentionbased convolutional encoders to incorporate query-aware and position-aware information in learning document representations, without introducing any new parameter over a standard convolutional operation." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-107", "text": "The approaches are kept simple but demonstrate competitive effectiveness and 4\u00d7 speedup in model training and inference against state-of-the-art models." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-108", "text": "Table 4 : Hyper-Parameters for our models." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-109", "text": "GloVe (Pennington et al., 2014) Embedding is used and fine-tuned during training." }, { "sent_id": "782517ae7688cf18b4bca37a8087dd-C001-110", "text": "Unknown word is initialized from uniform distribution [\u2212k, k]." } ], "y": { "@BACK@": { "gold_contexts": [ [ "782517ae7688cf18b4bca37a8087dd-C001-8" ], [ "782517ae7688cf18b4bca37a8087dd-C001-84" ] ], "cite_sentences": [ "782517ae7688cf18b4bca37a8087dd-C001-8", "782517ae7688cf18b4bca37a8087dd-C001-84" ] }, "@SIM@": { "gold_contexts": [ [ "782517ae7688cf18b4bca37a8087dd-C001-10", "782517ae7688cf18b4bca37a8087dd-C001-8" ] ], "cite_sentences": [ "782517ae7688cf18b4bca37a8087dd-C001-8" ] }, "@USE@": { "gold_contexts": [ [ "782517ae7688cf18b4bca37a8087dd-C001-64" ], [ "782517ae7688cf18b4bca37a8087dd-C001-73" ], [ "782517ae7688cf18b4bca37a8087dd-C001-78" ] ], "cite_sentences": [ "782517ae7688cf18b4bca37a8087dd-C001-64", "782517ae7688cf18b4bca37a8087dd-C001-73", "782517ae7688cf18b4bca37a8087dd-C001-78" ] } } }, "ABC_28805bfa8f8b847110664d7e05b1b3_37": { "x": [ { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-2", "text": "We learn graph-based similarity measures for the task of extracting word synonyms from a corpus of parsed text." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-3", "text": "A constrained graph walk variant that has been successfully applied in the past in similar settings is shown to outperform a state-of-the-art syntactic vectorbased approach on this task." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-4", "text": "Further, we show that learning specialized similarity measures for different word types is advantageous." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-5", "text": "----------------------------------" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-7", "text": "Many applications of natural language processing require measures of lexico-semantic similarity." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-8", "text": "Examples include summarization (Barzilay and Elhadad, 1999) , question answering (Lin and Pantel, 2001) , and textual entailment (Mirkin et al., 2006) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-9", "text": "Graph-based methods have been successfully applied to evaluate word similarity using available ontologies, where the underlying graph included word senses and semantic relationships between them (Hughes and Ramage, 2007) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-10", "text": "Another line of research aims at eliciting semantic similarity measures directly from freely available corpora, based on the distributional similarity assumption (Harria, 1968) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-11", "text": "In this domain, vector-space methods give state-ofthe-art performance (Pad\u00f3 and Lapata, 2007) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-12", "text": "Previously, a graph based framework has been proposed that models word semantic similarity from parsed text (Minkov and Cohen, 2008) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-13", "text": "The underlying graph in this case describes a text corpus as connected dependency structures, according to the schema shown in Figure 1 ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-14", "text": "The toy graph shown includes the dependency analysis of two sentences: \"a major environmental disaster is under way\", and \"combat the environmental catastrophe\"." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-15", "text": "In the graph, word mentions (in circles) and word types (in squares) are both represented as nodes." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-16", "text": "Each word mention is linked to its corresponding word type; for example, the nodes \"environmental 3 \" and \"environmental 204 \" represent distinct word mentions and both nodes are linked to the word type \"environmental\"." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-17", "text": "1 For every edge in the graph, there exists an edge in the opposite direction (not shown in the figure)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-18", "text": "In this graph, the terms disaster and catastrophe are related due to the connecting path disaster \u2212\u2192 disaster" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-19", "text": "Given a query, which consists of a word of interest (e.g., 'disaster'), various graph-based similarity metrics can be used to assess inter-node relatedness, so that a list of nodes ranked by their similarity to the query is returned to the user." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-20", "text": "An advantage of graph-based similarity approaches is that they produce similarity scores that reflect structural infor-mation in the graph (Liben-Nowell and Kleinberg, 2003) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-21", "text": "Semantically similar terms are expected to share connectivity patterns with the query term in the graph, and thus appear at the top of the list." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-22", "text": "Notably, different edge types, as well as the paths traversed, may have varying importance for different types of similarity sought." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-23", "text": "For example, in the parsed text domain, noun similarity and verb similarity are associated with different syntactic phenomena (Resnik and Diab, 2000) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-24", "text": "To this end, we consider a path constrained graph walk (PCW) algorithm, which allows one to learn meaningful paths given a small number of labeled examples and incorporates this information in assessing node relatedness in the graph (Minkov and Cohen, 2008) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-25", "text": "PCW have been successfully applied to the extraction of named entity coordinate terms, including city and person names, from graphs representing newswire text (Minkov and Cohen, 2008) , where the specialized measures learned outperformed the state-ofthe-art dependency vectors method (Pad\u00f3 and Lapata, 2007) for small-and medium-sized corpora." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-26", "text": "In this work, we apply the path constrained graph walk method to the task of eliciting general word relatedness from parsed text, conducting a set of experiments on the task of synonym extraction." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-27", "text": "While the tasks of named entity extraction and synonym extraction from text have been treated separately in the literature, this work shows that both tasks can be addressed using the same general framework." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-28", "text": "Our results are encouraging: the PCW model yields superior results to the dependency vectors approach." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-29", "text": "Further, we show that learning specialized similarity measures per word type (nouns, verbs and adjectives) is preferable to applying a uniform model for all word types." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-30", "text": "----------------------------------" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-31", "text": "**PATH CONSTRAINED GRAPH WALKS**" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-32", "text": "PCW is a graph walk variant proposed recently that is intended to bias the random walk process to follow meaningful edge sequences (paths) (Minkov and Cohen, 2008) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-33", "text": "In this approach, rather than assume fixed (possibly, uniform) edge weight parameters \u0398 for the various edge types in the graph, the probability of following an edge of type \u2113 from node x is evaluated dynamically, based on the history of the walk up to x." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-34", "text": "The PCW algorithm includes two components." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-35", "text": "First, it should provide estimates of edge weights conditioned on the history of a walk, based on training examples." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-36", "text": "Second, the random walk algorithm has to be modified to maintain historical information about the walk compactly." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-37", "text": "In learning, a dataset of N labelled example queries is provided." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-38", "text": "The labeling schema is binary, where a set of nodes considered as relevant answers to an example query e i , denoted as R i , is specified, and graph nodes that are not explicitly included in R i are assumed irrelevant to e i ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-39", "text": "As a starting point, an initial graph walk is applied to generate a ranked list of graph nodes l i for every example query e i ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-40", "text": "A path-tree T is then constructed that includes all of the acyclic paths up to length k leading to the top M + correct and M \u2212 incorrect nodes in each of the retrieved lists l i ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-41", "text": "Every path p is associated with a maximum likelihood probability estimate P r(p) of reaching a correct node based on the number of times the path was observed in the set of correct and incorrect target nodes." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-42", "text": "These path probabilities are propagated backwards in the path tree to reflect the probability of reaching a correct node, given an outgoing edge type and partial history of the walk." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-43", "text": "Given a new query, a constrained graph walk variant is applied that adheres both to the topology of the graph G and the path tree T ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-44", "text": "In addition to tracking the graph node that the random walker is at, PCW maintains pointers to the nodes of the path tree that represent the walk histories in reaching that graph node." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-45", "text": "In order to reduce working memory requirements, one may prune paths that are associated with low probability of reaching a correct node." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-46", "text": "This often leads to gains in accuracy." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-47", "text": "----------------------------------" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-48", "text": "**SYNONYM EXTRACTION**" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-49", "text": "We learn general word semantic similarity measures from a graph that represents a corpus of parsed text (Figure 1 )." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-50", "text": "In particular, we will focus on evaluating word synonymy, learning specialized models for different word types." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-51", "text": "In the experiments, we mainly compare PCW against the dependency vectors model (DV), due to Pad\u00f3 and Lapata (2007) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-52", "text": "In the latter approach, a word w i is represented as a vector of weighted scores, which reflect cooccurrence frequency with words w j , as well as properties of the dependency paths that connect the word w i to word w j ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-53", "text": "In particular, higher weight is assigned to connecting paths that include grammatically salient relations, based on the obliqueness weighting hierarchy (Keenan and Comrie, 1977) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-54", "text": "For example, co-occurrence of word w i with word w j over a path that includes the salient subject relation receives higher credit than co-occurrences over a non-salient relation such as preposition." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-55", "text": "In addition, Pad\u00f3 and Lapata suggest to consider only a subset of the paths observed that are linguistically meaningful." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-56", "text": "While the two methods incorporate similar intuitions, PCW learns meaningful paths that connect the query and target terms from examples, whereas DV involves manual choices that are taskindependent." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-57", "text": "----------------------------------" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-58", "text": "**DATASET**" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-59", "text": "To allow effective learning, we constructed a dataset that represents strict word synonymy relations for multiple word types." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-60", "text": "The dataset consists of 68 examples, where each example query consists of a single term of interest, with its synonym defined as a single correct answer." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-61", "text": "The dataset includes noun synonym pairs (22 examples), adjectives (24) and verbs (22)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-62", "text": "Example synonym pairs are shown in Table 1 ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-63", "text": "A corpus of parsed text was constructed using the British National Corpus (Burnard, 1995) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-64", "text": "The full BNC corpus is a 100-million word collection of samples of written and spoken contemporary British English texts." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-65", "text": "We extracted relevant sentences, which contained the synonymous words, from the BNC corpus." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-66", "text": "(The number of extracted sentences was limited to 2,000 per word.) For infrequent words, we extracted additional example sentences from Associated Press (AP) articles included in the AQUAINT corpus (Bilotti et al., 2007) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-67", "text": "(Sentence count was complemented to 300 per word, where applicable.) The constructed corpus, BNC+AP, includes 1.3 million words overall." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-68", "text": "This corpus was parsed using the Stanford dependency parser (de Marneffe et al., 2006" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-69", "text": "----------------------------------" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-70", "text": "**EXPERIMENTS**" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-71", "text": "Given a query like {term=\"movie\"}, we would like to get synonymous words, such as film, to appear at the top of the retrieved list." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-72", "text": "In our experimental setting, we assume that the word type of the query term is known." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-73", "text": "Rather than rank all words (terms) in response to a query, we use available (noisy) part of speech information to narrow down the search to the terms of the same type as the query term, e.g. for the query \"film\" we retrieve nodes of type \u03c4 =noun." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-74", "text": "We applied the PCW method to learn separate models for noun, verb and adjective queries." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-75", "text": "The path trees were constructed using the paths leading to the node known to be a correct answer, as well as to the otherwise irrelevant top-ranked 10 terms." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-76", "text": "We required the paths considered by PCW to include exactly 6 segments (edges)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-77", "text": "Such paths represent distributional similarity phenomena, allowing a direct comparison against the DV method." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-78", "text": "In conducting the constrained walk, we applied a threshold of 0.5 to truncate paths associated with lower probability of reaching a relevant response, following on previous work (Minkov and Cohen, 2008) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-79", "text": "We implemented DV using code made available by its authors, 3 where we converted the syntactic patterns specified to Stanford dependency parser conventions." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-80", "text": "The parameters of the DV method were set to medium context and oblique edge weighting scheme, which were found to perform best (Pad\u00f3 and Lapata, 2007) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-81", "text": "In applying a vector-space based method, a similarity score needs to be computed between every candidate from the corpus and the query term to construct a ranked list." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-82", "text": "In practice, we used the union of the top 300 words retrieved by PCW as candidate terms for DV." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-83", "text": "We evaluate the following variants of DV: hav- Table 2 : 5-fold cross validation results: MAP ing inter-word similarity computed using Lin's measure (Lin, 1998 ) (DV-Lin), or using cosine similarity (DV-Cos)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-84", "text": "In addition, we consider a non-syntactic variant, where a word's vector consists of its cooccurrence counts with other terms (using a window of two words); that is, ignoring the dependency structure (CO-Lin)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-85", "text": "Finally, in addition to the PCW model described above (PCW), we evaluate the PCW approach in settings where random, noisy, edges have been eliminated from the underlying graph." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-86", "text": "Specifically, dependency links in the graph may be associated with pointwise mutual information (PMI) scores of the linked word mention pairs (Manning and Sch\u00fctze, 1999) ; edges with low scores are assumed to represent word co-occurrences of low significance, and so are removed." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-87", "text": "We empirically set the PMI score threshold to 2.0, using cross validation (PCW-P)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-88", "text": "4 In addition to the specialized PCW models, we also learned a uniform model over all word types in these settings; that is, this model is trained using the union of all training examples, being learned and tested using a mixture of queries of all types (PCW-P-U)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-89", "text": "Table 2 gives the results of 5-fold cross-validation experiments in terms of mean average precision (MAP)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-90", "text": "Since there is a single correct answer per query, these results correspond to the mean reciprocal rank (MRR)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-91", "text": "5 As shown, the dependency vectors model applied using Lin similarity (DV-Lin) performs best among the vector-based models." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-92", "text": "The improvement achieved due to edge weighting com-pared with the co-occurrence model (CO-Lin) is large, demonstrating that syntactic structure is very informative for modeling word semantics (Pad\u00f3 and Lapata, 2007) ." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-93", "text": "Interestingly, the impact of applying the Lin similarity measure versus cosine (DV-Cos) is even more profound." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-94", "text": "Unlike the cosine measure, Lin's metric was designed for the task of evaluating word similarity from corpus statistics; it is based on the mutual information measure, and allows one to downweight random word co-occurrences." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-95", "text": "----------------------------------" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-96", "text": "**RESULTS**" }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-97", "text": "Among the PCW variants, the specialized PCW models achieve performance that is comparable to the state-of-the-art DV measure (DV-Lin)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-98", "text": "Further, removing noisy word co-occurrences from the graph (PCW-P) leads to further improvements, yielding the best results over all word types." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-99", "text": "Finally, the graph walk model that was trained uniformly for all word types (PCW-P-U) outperforms DV-Lin, showing the advantage of learning meaningful paths." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-100", "text": "Notably, the uniformly trained model is inferior to PCW trained separately per word type in the same settings (PCW-P)." }, { "sent_id": "28805bfa8f8b847110664d7e05b1b3-C001-101", "text": "This suggests that learning specialized word similarity metrics is beneficial." } ], "y": { "@BACK@": { "gold_contexts": [ [ "28805bfa8f8b847110664d7e05b1b3-C001-12" ], [ "28805bfa8f8b847110664d7e05b1b3-C001-25" ], [ "28805bfa8f8b847110664d7e05b1b3-C001-32" ] ], "cite_sentences": [ "28805bfa8f8b847110664d7e05b1b3-C001-12", "28805bfa8f8b847110664d7e05b1b3-C001-25", "28805bfa8f8b847110664d7e05b1b3-C001-32" ] }, "@USE@": { "gold_contexts": [ [ "28805bfa8f8b847110664d7e05b1b3-C001-24" ], [ "28805bfa8f8b847110664d7e05b1b3-C001-25", "28805bfa8f8b847110664d7e05b1b3-C001-26" ], [ "28805bfa8f8b847110664d7e05b1b3-C001-78" ] ], "cite_sentences": [ "28805bfa8f8b847110664d7e05b1b3-C001-24", "28805bfa8f8b847110664d7e05b1b3-C001-25", "28805bfa8f8b847110664d7e05b1b3-C001-78" ] } } }, "ABC_f326a3e2a5e349ce84b0a759f8e0b2_37": { "x": [ { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-26", "text": "While some systems generate sentence fragments in certain situations (e.g., not this one when the user is moving towards the wrong button), instructions are generally produced as complete sentences and replaced with a new full sentence when the context changes (a strategy which would not work for spoken instructions)." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-27", "text": "Nevertheless, timing issues are a cause for errors that is cited by several teams who developed systems for the GIVE challenge, and generating appropriate feedback has been an important concern for almost all teams (see the system descriptions in (Belz et al., 2011) )." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-28", "text": "Unfortunately, no systematic error analysis has been done for the interactions from the GIVE challenges." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-29", "text": "Anecdotally, however, not reacting to signs of confusion in the user's actions at all or reacting too late seem to be common causes for problems." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-30", "text": "Furthermore, we have found that the strategy of replacing instructions with complete sentences to account for a change in context can lead to confusion because it seems unclear to the user whether this new instruction is a correction or an elaboration." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-31", "text": "In this paper we report on a study of the communicative behavior of human dyads in the GIVE environment where instead of written text instruction givers use unrestricted spoken language to direct instruction followers through the world." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-32", "text": "We find that often multiple installments are used to identify a referent and that the instruction givers are highly responsive to context changes and the instruction followers' actions." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-33", "text": "Our goal is to inform the development of a generation system that generates object descriptions in installments while taking into account the actions of its interaction partner." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-34", "text": "----------------------------------" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-35", "text": "**A CORPUS OF SPOKEN INSTRUCTIONS IN A VIRTUAL ENVIRONMENT**" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-36", "text": "Data collection method The setup of this study was similar to the one used to collect the GIVE-2 corpus of typed instructions (Gargett et al., 2010) ." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-2", "text": "Commonly, the result of referring expression generation algorithms is a single noun phrase." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-3", "text": "In interactive settings with a shared workspace, however, human dialog partners often split referring expressions into installments that adapt to changes in the context and to actions of their partners." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-4", "text": "We present a corpus of human-human interactions in the GIVE-2 setting in which instructions are spoken." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-5", "text": "A first study of object descriptions in this corpus shows that references in installments are quite common in this scenario and suggests that contextual factors partly determine their use." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-6", "text": "We discuss what new challenges this creates for NLG systems." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-7", "text": "----------------------------------" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-9", "text": "Referring expression generation is classically considered to be the problem of producing a single noun phrase that uniquely identifies a referent (Krahmer and van Deemter, 2012) ." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-10", "text": "This approach is well suited for non-interactive, static contexts, but recently, there has been increased interest in generation for situated dialog (Stoia, 2007; ." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-11", "text": "Most human language use takes place in dynamic situations, and psycholinguistic research on humanhuman dialog has proposed that the production of referring expressions should rather be seen as a process that not only depends on the context and the choices of the speaker, but also on the reactions of the addressee." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-12", "text": "Thus the result is often not a single noun phrase but a sequence of installments (Clark and Wilkes-Gibbs, 1986) , consisting of multiple utterances which may be interleaved with feedback from the addressee." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-13", "text": "In a setting where the dialog partners have access to a common workspace, they, furthermore, carefully monitor each other's non-linugistic actions, which often replace verbal feedback (Clark and Krych, 2004; Gergle et al., 2004) ." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-14", "text": "The following example from our data illustrates this." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-15", "text": "A is instructing B to press a particular button." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-16", "text": "While computational models of this behavior are still scarce, some first steps have been taken." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-17", "text": "Stoia (2007) studies instruction giving in a virtual environment and finds that references to target objects are often not made when they first become visible." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-18", "text": "Instead interaction partners are navigated to a spot from where an easier description is possible. develop a planning-based approach of this behavior. But once their system decides to generate a referring expression, it is delivered in one unit." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-19", "text": "Thompson (2009) , on the other hand, proposes a game-theoretic model to predict how noun phrases are split up into installments." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-20", "text": "While Thompson did not specify how the necessary parameters to calculate the utility of an utterance are derived from the context and did not implement the model, it provides a good theoretical basis for an implementation." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-21", "text": "The GIVE Challenge is a recent shared task on situated generation ." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-22", "text": "In the GIVE scenario a human user goes on a treasure hunt in a virtual environment." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-23", "text": "He or she has to press a series of buttons that unlock doors and open a safe." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-24", "text": "The challenge for the NLG systems is to generate instructions in real-time to guide the user to the goal." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-25", "text": "The instructions are presented to the user as written text, which means that there is less opportunity for interleaving language and actions than with spoken instructions." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-37", "text": "Instruction followers (IFs) used the standard GIVE-2 client to interact with the virtual environment." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-38", "text": "Instruction givers (IGs) could observe the followers' position and actions in the world using an interactive map, and they were also provided with the same 3D view into the scene that the IFs saw on their screen." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-39", "text": "Differently from the normal GIVE-2 scenario, the IGs did not type their instructions but gave spoken instructions, which were audio recorded as well as streamed to the IFs over the network." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-40", "text": "A log of the IFs' position, orientation and actions that was updated every 200ms was recorded in a database." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-41", "text": "Participants were recruited in pairs on Bielefeld University's campus and received a compensation of six euros each." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-42", "text": "They were randomly assigned to the roles of IG and IF and were seated and instructed separately." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-43", "text": "To become familiar with the task, they switched roles in a first, shorter training world." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-44", "text": "These interactions were later used to devise and test the annotation schemes." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-45", "text": "They then played two different worlds in their assigned roles." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-46", "text": "After the first round, they received a questionnaire assessing the quality of the interaction; after the second round, they completed the Santa Barbara sense of direction test (Hegarty et al., 2006) and answered some questions about themselves." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-47", "text": "Annotations The recorded instructions of the IGs were transcribed and segmented into utterances (by identifying speech pauses longer than 300ms) using Praat (Boersma and Weenink, 2011) ." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-48", "text": "We then created videos showing the IGs' map view as well as the IFs' scene view and aligned the audio and transcriptions with them." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-49", "text": "The data was further annotated by the first two authors using ELAN (Wittenburg et al., 2006) ." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-50", "text": "Most importantly for this paper, we classified utterances into the following types:" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-51", "text": "(i) move (MV) -instruction to turn or to move (ii) manipulate (MNP) -instruction to manipulate an object (e.g., press a button) (iii) reference (REF) -utterance referring to an object (iv) stop -instruction to stop moving (v) warning -telling the user to not do something (vi) acknowledgment (ACK) -affirmative feedback (vii) communication management (CM) -indicating that the IG is planning (e.g., uhmm, just a moment, sooo etc.) (viii) negative acknowledgment -indicating a mistake on the player's part (ix) other -anything else A few utterances which contained both move and press instructions were further split, but in general we picked the label that fit best (using the above list as a precedence order to make a decision if two labels fit equally well)." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-52", "text": "The inter-annotator agreement for utterance types was \u03ba = 0.89 (Cohen's kappa), which is considered to be very good." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-53", "text": "Since the categories were of quite different sizes (cf." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-54", "text": "We reviewed all cases with differing annotations and reached a consensus, which is the basis for all results presented in this paper." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-55", "text": "Furthermore, we collapsed the labels warning, negative acknowledgment and other which only occurred rarely." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-56", "text": "To support a later more in depth analysis, we also annotated what types of properties are used in object descriptions, the givenness status of information in instructions, and whether an utterance is giving positive or negative feedback on a user action (even if not explicitly labeled as (negative) acknowledgment)." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-57", "text": "Finally, information about the IF's movements and actions in the world as well as the visible context was automatically calculated from the GIVE log files and integrated into the annotation." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-58", "text": "Collected data We collected interactions between eight pairs." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-59", "text": "Due to failures of the network connection and some initial problems with the GIVE software, only four pairs were recorded completely, so that we currently have data from eight interactions with four different IGs." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-60", "text": "We are in the process of collecting additional data in order to achieve a corpus size that will allow for a more detailed statistical analysis." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-61", "text": "Furthermore, we are collecting data in English to be able to make comparisons with the existing corpus of written instructions in the GIVE world and to make the data more easily accessible to a wider audience." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-62", "text": "The corpus will be made freely available at http://purl.org/net/sgive-corpus." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-63", "text": "Participants were between 20 and 30 years old and all of them are native German speakers." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-64", "text": "Two of the IGs are male and two female; three of the IFs are female." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-65", "text": "The mean length of the interactions is 5.24 minutes (SD = 1.86), and the IGs on average use 325 words (SD = 91)." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-66", "text": "Table 1 gives an overview of the kinds of utterances used by the IGs." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-67", "text": "While the general picture is similar for all speakers, there are statistically significant differences between the frequencies with which different IGs use the utterance types (\u03c7 2 = 78.82, p \u2264 0.001)." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-68", "text": "We did not find a significant differences (in terms of the utterance types used) between the two worlds that we used or between the two rounds that each pair played." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-69", "text": "----------------------------------" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-70", "text": "**HOW INSTRUCTION GIVERS DESCRIBE OBJECTS**" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-71", "text": "We now examine how interaction partners establish what the next target button is." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-72", "text": "Overall, there are 76 utterance sequences in the data that identify a target button and lead to the IF pressing that button." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-73", "text": "We discuss a selection of seven representative examples." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-74", "text": "In (2) the IG generates a referring expression identifying the target and integrates it into an object manipulation instruction." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-75", "text": "In our data, 55% of the target buttons (42 out of 76) get identified in this way (which fits into the traditional view of referring expression generation)." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-76", "text": "In all other cases a sequence of at least two, and in 14% of the cases more than two, utterances is used." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-77", "text": "The transitional probabilities between utterance types shown in Table 2 suggest what some common patterns may be." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-78", "text": "For example, even though move instructions are so prevalent in our data, they are uncommon after reference or manipulate utterances." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-79", "text": "Instead, two thirds of the reference utterances are followed by object manipulation instruction, another reference or an acknowledgement." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-80", "text": "In the remaining cases, IFs press a button in response to the reference." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-81", "text": "In (3) and (4) a first reference utterance is followed by a separate object manipulation utterance." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-82", "text": "While in (3) the first reference uniquely identifies the target, in (4) the first utterance simply directs the player's attention to a group of buttons." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-83", "text": "The second utterance then picks out the target." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-84", "text": "Stoia (2007) observed that IGs use move instructions to focus the IF's attention on a particular area." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-85", "text": "This is also common in our data." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-86", "text": "For instance in (5), the IF is asked to turn to directly face the group of buttons containing the target." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-87", "text": "(5) also shows how IGs monitor their partners' actions and respond to them." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-88", "text": "The IF is moving towards the wrong button causing the IG to repeat part of the previous description." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-89", "text": "Similarly, in (6) the IG produces an elaboration when the IF stops moving towards the target, indicating her confusion." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-90", "text": "In (7) the IG inserts affirmative feedback when the IF reacts correctly to a portion of his utterance." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-91", "text": "As can be seen in Table 2 , reference utterances are relatively often followed by affirmative feedback." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-92", "text": "IGs can also take advantage of IF actions that are not in direct response to an utterance." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-93", "text": "This happens in (8)." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-94", "text": "The IF enters a new room and looks around." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-95", "text": "When she looks towards the target, the IG seizes the opportunity and produces affirmative feedback." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-96", "text": "----------------------------------" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-97", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-98", "text": "We have described a corpus of spoken instructions in the GIVE scenario which we are currently building and which we will make available once it is completed." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-99", "text": "This corpus differs from other corpora of taskoriented dialog (specifically, the MapTask corpus (Anderson et al., 1991) , the TRAINS corpus (Heeman and Allen, 1995), the Monroe corpus (Stent, 2000) ) in that the IG could observe the IF's actions in real-time." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-100", "text": "This led to interactions in which instructions are given in installments and linguistic and non-linguistic actions are interleaved." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-101", "text": "This poses interesting new questions for NLG systems, which we have illustrated by discussing the patterns of utterance sequences that IGs and IFs use in our corpus to agree on the objects that need to be manipulated." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-102", "text": "In line with results from psycholinguistics, we found that the information necessary to establish a reference is often expressed in multiple installments and that IGs carefully monitor how their partners react to their instructions and quickly respond by giving feedback, repeating information or elaborating on previous utterance when necessary." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-103", "text": "The NLG system thus needs to be able to decide when a complete identifying description can be given in one utterance and when a description in installments is more effective." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-104", "text": "Stoia (2007) as well as have addressed this question, but their approaches only make a choice between generating an instruction to move or a uniquely identifying referring expression." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-105", "text": "They do not consider cases in which another type of utterance, for instance, one that refers to a group of objects or gives an initial ambiguous description, is used to draw the attention of the IF to a particular area and they do not generate referring expressions in installments." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-106", "text": "The system, furthermore, needs to be able to interpret the IF's actions and decide when to insert an acknowledgment, elaboration or correction." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-107", "text": "It then has to decide how to formulate this feedback." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-108", "text": "The addressee, e.g., needs to be able to distinguish elaborations from corrections." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-109", "text": "If the feedback was inserted in the middle of a sentence, if finally has to decide whether this sentence should be completed and how the remainder may have to be adapted." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-110", "text": "Once we have finished the corpus collection, we plan to use it to study and address the questions discussed above." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-111", "text": "We are planning on building on the work by Stoia (2007) on using machine learning techniques to develop a model that takes into account various contextual factors and on the work by Thompson (2009) on generating references in installments." }, { "sent_id": "f326a3e2a5e349ce84b0a759f8e0b2-C001-112", "text": "The set-up under which the corpus was collected, furthermore, lends itself well to Wizard-of-Oz studies to test the effectiveness of different interactive strategies for describing objects." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-10" ], [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-17" ], [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-84" ] ], "cite_sentences": [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-10", "f326a3e2a5e349ce84b0a759f8e0b2-C001-17", "f326a3e2a5e349ce84b0a759f8e0b2-C001-84" ] }, "@SIM@": { "gold_contexts": [ [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-84", "f326a3e2a5e349ce84b0a759f8e0b2-C001-85" ] ], "cite_sentences": [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-84" ] }, "@DIF@": { "gold_contexts": [ [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-104" ] ], "cite_sentences": [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-104" ] }, "@FUT@": { "gold_contexts": [ [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-111" ] ], "cite_sentences": [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-111" ] }, "@EXT@": { "gold_contexts": [ [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-111" ] ], "cite_sentences": [ "f326a3e2a5e349ce84b0a759f8e0b2-C001-111" ] } } }, "ABC_493f353942929c1a015d5e0acbf564_37": { "x": [ { "sent_id": "493f353942929c1a015d5e0acbf564-C001-60", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-61", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-62", "text": "**DATASETS**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-87", "text": "A similar pattern can be observed for the E2E dataset." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-2", "text": "Variational Autoencoder (VAE) is a powerful method for learning representations of highdimensional data." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-3", "text": "However, VAEs can suffer from an issue known as latent variable collapse (or KL loss vanishing), where the posterior collapses to the prior and the model will ignore the latent codes in generative tasks." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-4", "text": "Such an issue is particularly prevalent when employing VAE-RNN architectures for text modelling (Bowman et al., 2016) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-5", "text": "In this paper, we present a simple architecture called holistic regularisation VAE (HR-VAE), which can effectively avoid latent variable collapse." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-6", "text": "Compared to existing VAE-RNN architectures, we show that our model can achieve much more stable training process and can generate text with significantly better quality." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-7", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-9", "text": "Variational Autoencoder (VAE) (Kingma and Welling, 2013 ) is a powerful method for learning representations of high-dimensional data." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-10", "text": "However, recent attempts of applying VAEs to text modelling are still far less successful compared to its application to image and speech (Bachman, 2016; Fraccaro et al., 2016; Semeniuta et al., 2017) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-11", "text": "When applying VAEs for text modelling, recurrent neural networks (RNNs) 1 are commonly used as the architecture for both encoder and decoder (Bowman et al., 2016; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-12", "text": "While such a VAE-RNN based architecture allows encoding and generating sentences (in the decoding phase) with variablelength effectively, it is also vulnerable to an issue known as latent variable collapse (or KL loss vanishing) , where the posterior collapses to the prior and the model will ignore the latent codes in generative tasks." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-13", "text": "Various efforts have been made to alleviate the latent variable collapse issue." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-58", "text": "The weight between these two terms of our model is simply 1 : 1." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-14", "text": "Bowman et al. (2016) uses KL annealing, where a variable weight is added to the KL term in the cost function at training time." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-15", "text": "Yang et al. (2017) discovered that there is a trade-off between the contextual capacity of the decoder and effective use of encoding information, and developed a dilated CNN as decoder which can vary the amount of conditioning context." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-16", "text": "They also introduced a loss clipping strategy in order to make the model more robust." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-17", "text": "Xu and Durrett (2018) addressed the problem by replacing the standard normal distribution for the prior with the von Mises-Fisher (vMF) distribution." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-18", "text": "With vMF, the KL loss only depends on the concentration parameter which is fixed during training and testing, and hence results in a constant KL loss." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-19", "text": "In a more recent work, Dieng et al. (2019) avoided latent variable collapse by including skip connections in the generative model, where the skip connections enforce strong links between the latent variables and the likelihood function." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-20", "text": "Although the aforementioned works show effectiveness in addressing the latent variable collapse issue to some extent, they either require carefully engineering to balance the weight between the reconstruction loss and KL loss (Bowman et al., 2016; , or resort to designing more sophisticated model structures (Yang et al., 2017; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-21", "text": "In this paper, we present a simple architecture called holistic regularisation VAE (HR-VAE), which can effectively avoid latent variable collapse." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-22", "text": "In contrast to existing VAE-RNN models for text modelling which merely impose a standard normal distribution prior on the last hidden state of the RNN encoder, our HR-VAE model imposes regularisation for all hidden states of the RNN encoder." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-23", "text": "Another advantage of our model is that it is generic and can be applied to any existing VAE-RNN-based architectures." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-24", "text": "We evaluate our model against several strong baselines which apply VAE for text modelling (Bowman et al., 2016; Yang et al., 2017; Xu and Durrett, 2018) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-25", "text": "We conducted experiments based on two public benchmark datasets, namely, the Penn Treebank dataset (Marcus and Marcinkiewicz, 1993) and the end-to-end (E2E) text generation dataset (Novikova et al., 2017) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-26", "text": "Experimental results show that our HR-VAE model not only can effectively mitigate the latent variable collapse issue with a stable training process, but also can give better predictive performance than the baselines, as evidenced by both quantitative (e.g., negative log likelihood and perplexity) and qualitative evaluation." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-27", "text": "The code for our model is available online 2 ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-28", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-29", "text": "**METHODOLOGY**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-30", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-31", "text": "**BACKGROUND OF VAE**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-32", "text": "A variational autoencoder (VAE) is a deep generative model, which combines variational inference with deep learning." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-33", "text": "The VAE modifies the conventional autoencoder architecture by replacing the deterministic latent representation z of an input x with a posterior distribution P (z|x), and imposing a prior distribution on the posterior, such that the model allows sampling from any point of the latent space and yet able to generate novel and plausible output." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-34", "text": "The prior is typically chosen to be standard normal distributions, i.e., P (z) = N (0, 1), such that the KL divergence between posterior and prior can be computed in closed form (Kingma and Welling, 2013) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-59", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-35", "text": "To train a VAE, we need to optimise the marginal likelihood P \u03b8 (x) = P (z)P \u03b8 (x|z)dz, 2 https://github.com/ruizheliUOA/HR-VAE where the log likelihood can take following form:" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-36", "text": "Here Q \u03c6 (z|x) is the variational approximation for the true posterior P \u03b8 (z|x)." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-37", "text": "Specifically, Q \u03c6 (z|x) can be regarded as an encoder (a.k.a." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-38", "text": "the recognition model) and P \u03b8 (x|z) the decoder (a.k.a. the generative model)." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-39", "text": "Both encoder and decoder are implemented via neural networks." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-40", "text": "As proved in (Kingma and Welling, 2013) , optimising the marginal log likelihood is essentially equivalent to maximising L(\u03b8, \u03c6; x), i.e., the evidence lower bound (ELBO), which consists of two terms." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-41", "text": "The first term is the expected reconstruction error indicating how well the model can reconstruct data given a latent variable." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-42", "text": "The the second term is the KL divergence of the approximate posterior from prior, i.e., a regularisation pushing the learned posterior to be as close to the prior as possible." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-43", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-44", "text": "**VARIATIONAL AUTOENDODER WITH HOLISTIC REGULARISATION**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-45", "text": "In this section, we discuss the technical details of the proposed holistic regularisation VAE (HR-VAE) model, a general architecture which can effectively mitigate the KL vanishing phenomenon." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-46", "text": "Our model design is motivated by one noticeable defect shared by the VAE-RNN based models in previous works (Bowman et al., 2016; Yang et al., 2017; Xu and Durrett, 2018; Dieng et al., 2019) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-47", "text": "That is, all these models, as shown in Figure 1a , only impose a standard normal distribution prior on the last hidden state of the RNN encoder, which potentially leads to learning a suboptimal representation of the latent variable and results in model vulnerable to KL loss vanishing." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-48", "text": "Our hypothesis is that to learn a good representation of data and a good generative model, it is crucial to impose the standard normal prior on all the hidden states of the RNN-based encoder (see Figure 1b ), which allows a better regularisation of the model learning process." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-49", "text": "We implement the HR-VAE model using a twolayer LSTM for both the encoder and decoder." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-50", "text": "However, one should note that our architecture can be readily applied to other types of RNN such as GRU." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-51", "text": "For each time stamp t (see Figure 1b) , we concatenate the hidden state h t and the cell state c t of the encoder." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-52", "text": "The concatenation (i.e., [h t ; c t ]) is then fed into two linear transformation layers for estimating \u00b5 t and \u03c3 2 t , which are parameters of a normal distribution corresponding to the concatenation of h t and c t ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-53", "text": "Let Q \u03c6t (z t |x) = N (z t |\u00b5 t , \u03c3 2 t ), we wish Q \u03c6t (z t |x) to be close to a prior P (z t ), which is a standard Gaussian." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-54", "text": "Finally, the KL divergence between these two multivariate Gaussian distributions (i.e., Q \u03c6t and P (z t )) will contribute to the overall KL loss of the ELBO." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-55", "text": "By taking the average of the KL loss at each time stamp t, the resulting ELBO takes the following form" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-56", "text": "KL(Q \u03c6t (z t |x) P (z t ))." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-57", "text": "( 3) As can be seen in Eq. 3, our solution to the KL collapse issue does not require any engineering for balancing the weight between the reconstruction term and KL loss as commonly the case in existing works (Bowman et al., 2016; ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-63", "text": "We evaluate our model on two public datasets, namely, Penn Treebank (PTB) (Marcus and Marcinkiewicz, 1993) and the end-to-end (E2E) text generation corpus (Novikova et al., 2017) , which have been used in a number of previous works for text generation (Bowman et al., 2016; Xu and Durrett, 2018; Wiseman et al., 2018; Su et al., 2018) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-64", "text": "PTB consists of more than 40,000 sentences from Wall Street Journal articles whereas the E2E dataset contains over 50,000 sen-tences of restaurant reviews." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-65", "text": "The statistics of these two datasets are summarised in Table 1 ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-66", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-67", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-68", "text": "For the PTB dataset, we used the train-test split following (Bowman et al., 2016; Xu and Durrett, 2018) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-69", "text": "For the E2E dataset, we used the train-test split from the original dataset (Novikova et al., 2017) and indexed the words with a frequency higher than 3." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-70", "text": "We represent input data with 512-dimensional word2vec embeddings (Mikolov et al., 2013) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-71", "text": "We set the dimension of the hidden layers of both encoder and decoder to 256." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-72", "text": "The Adam optimiser (Kingma and Ba, 2014) was used for training with an initial learning rate of 0.0001." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-73", "text": "Each utterance in a mini-batch was padded to the maximum length for that batch, and the maximum batch-size allowed is 128." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-74", "text": "----------------------------------" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-75", "text": "**BASELINES**" }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-76", "text": "We compare our HR-VAE model with three strong baselines using VAE for text modelling: VAE-LSTM-base 3 : A variational autoencoder model which uses LSTM for both encoder and decoder." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-77", "text": "KL annealing is used to tackled the latent variable collapse issue (Bowman et al., 2016) ; VAE-CNN 4 : A variational autoencoder model with a LSTM encoder and a dilated CNN decoder (Yang et al., 2017) ; vMF-VAE 5 : A variational autoencoder model using LSTM for both encoder and decoder where the prior distribution is the von Mises-Fisher (vMF) distribution rather than a Gaussian distribution (Xu and Durrett, 2018) ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-78", "text": "the decoder needs to predict the entire sequence with only the help of the given latent variable z." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-79", "text": "In this way, a high-quality representation abstracting the information of the input sentence is much needed for the decoder, and hence enforcing z to learn the required information." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-80", "text": "Overall performance." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-81", "text": "Table 2 shows the language modelling results of our approach and the baselines." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-82", "text": "We report negative log likelihood (NLL), KL loss, and perplexity (PPL) on the test set." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-83", "text": "As expected, all the models have a higher KL loss in the inputless setting than the standard setting, as z is required to encode more information about the input data for reconstruction." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-84", "text": "In terms of overall performance, our model outperforms all the baselines in both datasets (i.e., PTB and E2E)." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-85", "text": "For instance, when comparing with the strongest baseline vMF-VAE in the standard setting, our model reduces NLL from 96 to 79 and PPL from 98 to 43 in PTB, respectively." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-86", "text": "In the inputless setting, our performance gain is even higher, i.e., NLL reduced from 117 to 85 and PPL from 262 to 54." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-88", "text": "These observations suggest that our approach can learn a better generative model for data." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-89", "text": "Loss analysis." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-90", "text": "To conduct a more thorough evaluation, we further investigate model behaviours in terms of both reconstruction loss and KL loss, as shown in Figure 2 ." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-91", "text": "These plots were obtained based on the E2E training set using the inputless setting." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-92", "text": "We can see that the KL loss of VAE-LSTMbase, which uses Sigmoid annealing (Bowman et al., 2016) , collapses to zero, leading to a poor generative performance as indicated by the high reconstruction loss." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-93", "text": "The KL loss for both VAE-CNN and vMF-VAE are nonzero, where the former mitigates the KL collapse issue with a KL loss clipping strategy and the latter by replacing the standard normal distribution for the prior with the vMF distribution (i.e., with the vMF distribution, the KL loss only depends on a fixed concentration parameter, and hence results in a constant KL loss)." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-94", "text": "Although both VAE-CNN and vMF-VAE outperform VAE-LSTM-base by a large margin in terms of reconstruction loss as shown in Figure 2 , one should also notice that these two models actually overfit the training data, as their performance on the test set is much worse (cf." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-95", "text": "Table 2 )." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-96", "text": "In contrast to the baselines which mitigate the KL collapse issue by carefully engineering the weight between the reconstruction loss and KL loss or choosing a different choice of prior, we provide a simple and elegant solution through holistic KL regularisation, which can effectively mitigate the KL collapse issue and achieve a better reconstruction error in both training and testing." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-97", "text": "Sentence reconstruction." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-98", "text": "Lastly, we show some sentence examples reconstructed by vMF-VAE (i.e., the best baseline) and our model in the inputless setting using sentences from the E2E test set as input." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-99", "text": "As shown in Table 3 , the sentences generated by vMF-VAE contain repeated words in quite a few cases, such as 'city city area' and 'blue spice spice'." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-100", "text": "In addition, vMF-VAE also tends to generate unnecessary or unrelated words at the end of sentences, making the generated sentences ungrammatical." }, { "sent_id": "493f353942929c1a015d5e0acbf564-C001-101", "text": "The sentences reconstructed by our model, in contrast, are more grammatical and more similar to the corresponding ground truth sentences than vMF-VAE." } ], "y": { "@BACK@": { "gold_contexts": [ [ "493f353942929c1a015d5e0acbf564-C001-20" ], [ "493f353942929c1a015d5e0acbf564-C001-77" ] ], "cite_sentences": [ "493f353942929c1a015d5e0acbf564-C001-20", "493f353942929c1a015d5e0acbf564-C001-77" ] }, "@USE@": { "gold_contexts": [ [ "493f353942929c1a015d5e0acbf564-C001-24" ], [ "493f353942929c1a015d5e0acbf564-C001-76", "493f353942929c1a015d5e0acbf564-C001-77" ] ], "cite_sentences": [ "493f353942929c1a015d5e0acbf564-C001-24", "493f353942929c1a015d5e0acbf564-C001-77" ] }, "@DIF@": { "gold_contexts": [ [ "493f353942929c1a015d5e0acbf564-C001-24", "493f353942929c1a015d5e0acbf564-C001-26" ] ], "cite_sentences": [ "493f353942929c1a015d5e0acbf564-C001-24" ] }, "@MOT@": { "gold_contexts": [ [ "493f353942929c1a015d5e0acbf564-C001-46" ] ], "cite_sentences": [ "493f353942929c1a015d5e0acbf564-C001-46" ] } } }, "ABC_193d388c3f4c346cb62711f3f04c0f_37": { "x": [ { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-2", "text": "We propose a multi-view network for text classification." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-3", "text": "Our method automatically creates various views of its input text, each taking the form of soft attention weights that distribute the classifier's focus among a set of base features." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-4", "text": "For a bag-of-words representation, each view focuses on a different subset of the text's words." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-5", "text": "Aggregating many such views results in a more discriminative and robust representation." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-6", "text": "Through a novel architecture that both stacks and concatenates views, we produce a network that emphasizes both depth and width, allowing training to converge quickly." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-7", "text": "Using our multi-view architecture, we establish new state-of-theart accuracies on two benchmark tasks." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-8", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-10", "text": "State-of-the-art deep neural networks leverage task-specific architectures to develop hierarchical representations of their input, with each layer building a refined abstraction of the layer that came before it (Conneau et al., 2016) ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-11", "text": "For text classification, one can think of this as a single reader building up an increasingly refined understanding of the content." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-12", "text": "In a departure from this philosophy, we propose a divide-and-conquer approach, where a team of readers each focus on different aspects of the text, and then combine their representations to make a joint decision." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-13", "text": "More precisely, the proposed Multi-View Network (MVN) for text classification learns to generate several views of its input text." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-14", "text": "Each view is formed by focusing on different sets of words through a view-specific attention mechanism." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-15", "text": "These views are arranged sequentially, so each subsequent view can build upon or deviate from previous views as appropriate." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-16", "text": "The final representation that concatenates these diverse views should be more robust to noise than any one of its components." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-17", "text": "Furthermore, different sentences may look similar under one view but different under another, allowing the network to devote particular views to distinguishing between subtle differences in sentences, resulting in more discriminative representations." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-18", "text": "Unlike existing multi-view neural network approaches for image processing (Zhu et al., 2014; Su et al., 2015) , where multiple views are provided as part of the input, our MVN learns to automatically create views from its input text by focusing on different sets of words." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-19", "text": "Compared to deep Convolutional Networks (CNN) for text (Zhang et al., 2015; Conneau et al., 2016) , the MVN strategy emphasizes network width over depth." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-20", "text": "Shorter connections between each view and the loss function enable better gradient flow in the networks, which makes the system easier to train." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-21", "text": "Our use of multiple views is similar in spirit to the weak learners used in ensemble methods (Breiman, 1996; Friedman et al.," }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-22", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-23", "text": "**MULTI-VIEW NETWORKS FOR TEXT**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-24", "text": "The MVN architecture is depicted in Figure 1 ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-25", "text": "First, individual selection vectors s + are created, each formed by a distinct softmax weighted sum over the word vectors of the input text." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-26", "text": "Next, these selections are sequentially transformed into views v, with each view influencing the views that come after it." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-27", "text": "Finally, all views are concatenated and fed into a two-layer perceptron for classification." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-28", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-29", "text": "**MULTIPLE ATTENTIONS FOR SELECTION**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-30", "text": "Each selection s + is constructed by focusing on a different subset of words from the original text, as determined by a softmax weighted sum (Bahdanau et al., 2014) ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-31", "text": "Given a piece of text with H words, we represent it as a bag-of-words feature matrix B \u2208 IR H\u00d7d ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-32", "text": "Each row of the matrix corresponds to one word, which is represented by a d-dimensional vector, as provided by a learned word embedding table." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-33", "text": "The selection s + i for the i th view is the softmax weighted sum of features:" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-34", "text": "where the weight d i,h is computed by:" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-35", "text": "here, w s i (a vector) and W s i (a matrix) are learned selection parameters." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-36", "text": "By varying the weights d i,h , the selection for each view can focus on different words from B, as illustrated by different color curves connecting to s + in Figure 1 ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-37", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-38", "text": "**AGGREGATING SELECTIONS INTO VIEWS**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-39", "text": "Having built one s + for each of our V views, the actual views are then created as follows:" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-40", "text": "where W v i are learned parameter matrices, and [. . . ; . . .] represents concatenation." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-41", "text": "The first and last views are formed by solely s + ; however, they play very different roles in our network." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-42", "text": "v V is completely disconnected from the others, an independent attempt at good feature selection, intended to increase view diversity (Muslea et al., 2002; Viktor, 2006, 2008; Wang et al., 2015) ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-43", "text": "Conversely, v 1 forms the base of a structure similar to a multi-layer perceptron with shortcutting, as defined by the recurrence in Equation 5." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-44", "text": "Here, the concatenation of all previous views implements short-cutting, while the recursive definition of each view implements stacking, forming a deep network depicted by horizontal arrows in Figure 1 ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-45", "text": "This structure makes each view aware of the information in those previous to it, allowing them to build upon each other." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-46", "text": "Note that the W v matrices are view-specific and grow with each view, making the overall parameter count quadratic in the number of views." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-47", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-48", "text": "**CLASSIFICATION WITH VIEWS**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-49", "text": "The final step is to transform our views into a classification of the input text." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-50", "text": "The MVN does so by concatenating its view vectors, which are then fed into a fully connected projection followed by a softmax function to produce a distribution over the possible classes." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-51", "text": "Dropout regularization (Hinton et al., 2012) can be applied at this softmax layer, as in (Kim, 2014) ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-52", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-53", "text": "**BEYOND BAGS OF WORDS**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-54", "text": "The MVN's selection layer operates on a matrix of feature vectors B, which has thus far corresponded to a bag of word vectors." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-55", "text": "Each view's selection makes intuitive sense when features correspond to words, as it is easy to imagine different readers of a text focusing on different words, with each reader arriving at a useful interpretation." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-56", "text": "However, there is a wealth of knowledge on how to construct powerful feature representations for text, such as those used by convolutional neural networks (CNNs)." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-57", "text": "To demonstrate the utility of having views that weight arbitrary feature vectors, we augment our bag-of-words representation with vectors built by n-gram filters max-pooled over the entire text (Kim, 2014) , with one feature vector for each n-gram order, n = 2 . . . 5." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-58", "text": "The augmented B matrix has H +4 rows." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-59", "text": "Unlike our word vectors, the 4 CNN vectors each provide representations of the entire text." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-60", "text": "Returning to our reader analogy, one could imagine these to correspond to quick (n = 2) or careful (n = 5) skims of the text." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-61", "text": "Regardless of whether a feature vector is built by embedding table or by max-pooled n-gram filters, we always back-propagate through all feature construction layers, so they become specialized to our end task." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-62", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-63", "text": "**EXPERIMENTS**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-64", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-65", "text": "**STANFORD SENTIMENT TREEBANK**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-66", "text": "The Stanford Sentiment Treebank contains 11,855 sentences from movie reviews." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-67", "text": "We use the same splits for training, dev, and test data as in (Socher et al., 2013) to predict the fine-grained 5-class sentiment categories of the sentences." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-68", "text": "For comparison purposes, following (Kim, 2014; Kalchbrenner et al., 2014; Lei et al., 2015) , we train the models using both phrases and sentences, but only evaluate sentences at test time." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-69", "text": "We initialized all of the word embeddings using the publicly available 300 dimensional pre-trained vectors from GloVe (Pennington et al., 2014) ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-70", "text": "We learned 8 views with 200 dimensions each, which requires us to project the 300 dimensional word vectors, which we implemented using a linear transformation, whose weight matrix and bias term are shared across all words, followed by a tanh activation." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-71", "text": "For optimization, we used Adadelta (Zeiler, 2012), with a starting learning rate of 0.0005 and a mini-batch of size 50." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-72", "text": "Also, we used dropout (with a rate of 0.2) to avoid overfitting." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-73", "text": "All of these MVN hyperparameters were determined through experiments measuring validation-set accuracy." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-74", "text": "The test-set accuracies obtained by different learning methods, including the current state-ofthe-art results, are presented in Table 1 ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-75", "text": "The results indicate that the bag-of-words MVN outperforms most methods, but obtains lower accuracy than the state-of-the-art results achieved by the tree-LSTM (Tai et al., 2015; Zhu et al., 2015) and the high-order CNN (Lei et al., 2015 (Lei et al., 2015) ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-76", "text": "when augmented with 4 convolutional features as described in Section 2.4, the MVN strategy surpasses both of these, establishing a new state-ofthe-art on this benchmark." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-77", "text": "In Figure 2 , we present the test-set accuracies obtained while varying the number of views in our MVN with convolutional features." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-78", "text": "These results indicate that better predictive accuracy can be achieved while increasing the number of views up to eight." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-79", "text": "After eight, the accuracy starts to drop." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-80", "text": "The number of MVN views should be tuned for each new application, but it is good to see that not too many views are required to achieve optimal performance on this task." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-81", "text": "To better understand the benefits of the MVN method, we further analyzed the eight views constructed by our best model." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-82", "text": "After training, we obtained the view representation vectors for both the training and testing data, and then independently trained a very simple, but fast and stable Na\u00efve Bayes classifier (McCallum and Nigam, 1998) for each view." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-83", "text": "We report class-specific F-measures for Figure 3 ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-84", "text": "From this figure, we can observe that different views focus on different target classes." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-85", "text": "For example, the first two views perform poorly on the 0 (very negative) and 1 (negative) classes, but achieve the highest F-measures on the 2 (neutral) class." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-86", "text": "Meanwhile, the non-neutral classes each have a different view that achieves the highest F-measure." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-87", "text": "This suggests that some views have specialized in order to better separate subsets of the training data." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-88", "text": "We provide an ablation study in Table 2 ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-89", "text": "First, we construct a traditional ensemble model." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-90", "text": "We independently train eight MVN models, each with a single view, to serve as weak learners." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-91", "text": "We have them vote with equal weight for the final classification, obtaining a test-set accuracy of 50.2." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-92", "text": "Next, we restrict the views in the MVN to be unaware of each other." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-93", "text": "That is, we replace Equation 5 with v i = s (Conneau et al., 2016)" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-94", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-95", "text": "**AG'S ENGLISH NEWS CATEGORIZATION**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-96", "text": "The AG corpus (Zhang et al., 2015; Conneau et al., 2016) contains categorized news articles from more than 2,000 news outlets on the web." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-97", "text": "The task has four classes, and for each class there are 30,000 training documents and 1,900 test documents." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-98", "text": "A random sample of the training set was used for hyper-parameter tuning." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-99", "text": "The training and testing settings of this task are exactly the same as those presented for the Stanford Sentiment Treebank task in Section 3.1, except that the mini-batch size is reduced to 23, and each view has a dimension of 100." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-100", "text": "The test errors obtained by various methods are presented in Table 3 ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-101", "text": "These results show that the bag-of-words MVN outperforms the state-of-theart accuracy obtained by the non-neural n-gram TFIDF approach (Zhang et al., 2015) , as well as several very deep CNNs (Conneau et al., 2016) ." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-102", "text": "Accuracy was further improved when the MVN was augmented with 4 convolutional features." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-103", "text": "In Figure 4 , we show how accuracy and loss evolve on the validation set during MVN training." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-104", "text": "These curves show that training is quite stable." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-105", "text": "The MVN achieves its best results in just a few thousand iterations." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-106", "text": "----------------------------------" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-107", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-108", "text": "We have presented a novel multi-view neural network for text classification, which creates multiple views of the input text, each represented as a weighted sum of a base set of feature vectors." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-109", "text": "These views work together to produce a discriminative feature representation for text classification." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-110", "text": "Unlike many neural approaches to classification, our architecture emphasizes network width in addition to depth, enhancing gradient flow during training." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-111", "text": "We have used the multi-view network architecture to establish new state-of-the-art results on two benchmark text classification tasks." }, { "sent_id": "193d388c3f4c346cb62711f3f04c0f-C001-112", "text": "In the future, we wish to better understand the benefits of generating multiple views, explore new sources of base features, and apply this technique to other NLP problems such as translation or tagging." } ], "y": { "@BACK@": { "gold_contexts": [ [ "193d388c3f4c346cb62711f3f04c0f-C001-10" ], [ "193d388c3f4c346cb62711f3f04c0f-C001-96" ] ], "cite_sentences": [ "193d388c3f4c346cb62711f3f04c0f-C001-10", "193d388c3f4c346cb62711f3f04c0f-C001-96" ] }, "@DIF@": { "gold_contexts": [ [ "193d388c3f4c346cb62711f3f04c0f-C001-10", "193d388c3f4c346cb62711f3f04c0f-C001-12" ], [ "193d388c3f4c346cb62711f3f04c0f-C001-19" ], [ "193d388c3f4c346cb62711f3f04c0f-C001-101" ] ], "cite_sentences": [ "193d388c3f4c346cb62711f3f04c0f-C001-10", "193d388c3f4c346cb62711f3f04c0f-C001-19", "193d388c3f4c346cb62711f3f04c0f-C001-101" ] }, "@USE@": { "gold_contexts": [ [ "193d388c3f4c346cb62711f3f04c0f-C001-93" ], [ "193d388c3f4c346cb62711f3f04c0f-C001-96", "193d388c3f4c346cb62711f3f04c0f-C001-98" ] ], "cite_sentences": [ "193d388c3f4c346cb62711f3f04c0f-C001-93", "193d388c3f4c346cb62711f3f04c0f-C001-96" ] } } }, "ABC_c86271049ebbcf8eafe781a0af6a98_37": { "x": [ { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-2", "text": "This paper tackles the problem of the semantic gap between a document and a query within an ad-hoc information retrieval task." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-3", "text": "In this context, knowledge bases (KBs) have already been acknowledged as valuable means since they allow the representation of explicit relations between entities." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-4", "text": "However, they do not necessarily represent implicit relations that could be hidden in a corpora." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-5", "text": "This latter issue is tackled by recent works dealing with deep representation learning of texts." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-6", "text": "With this in mind, we argue that embedding KBs within deep neural architectures supporting documentquery matching would give rise to fine-grained latent representations of both words and their semantic relations." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-7", "text": "In this paper, we review the main approaches of neural-based document ranking as well as those approaches for latent representation of entities and relations via KBs." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-115", "text": "**CONCLUSIONS**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-8", "text": "We then propose some avenues to incorporate KBs in deep neural approaches for document ranking." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-9", "text": "More particularly, this paper advocates that KBs can be used either to support enhanced latent representations of queries and documents based on both distributional and relational semantics or to serve as a semantic translator between their latent distributional representations." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-10", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-12", "text": "Knowledge resources such as ontologies and Knowledge bases (KBs) provide data which are critical to associate words with their senses." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-13", "text": "Even if word senses cannot be easily discretized to a finite set of entries, numerous works have shown that such resources can successfully bridge the semantic gap between the document and the query within Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-14", "text": "Copyrights for third-party components of this work must be honored." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-15", "text": "For all other uses, contact the owner/author(s)." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-16", "text": "an information retrieval (IR) task [4] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-17", "text": "More specifically, in this line of work, those resources allow to enrich word-based representations by mapping words to concepts or entities and exploiting symbolic formalized semantic relations (e.g., \"is-a\" or \"part-of\") between words." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-18", "text": "Another way to deal with word senses is to learn from corpora their representations based on the premise of distributional semantics [10, 15] , also called word embeddings." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-19", "text": "Numerous recent works in this other line of works learn deep word representations by exploiting the context window surrounding the word." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-20", "text": "Furthermore, based on the general approach of latent representations of texts, several works attempt to model the relevance scoring of latent representations using deep neural architectures [8, 16] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-21", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-22", "text": "**NEU-IR '16 SIGIR WORKSHOP ON NEURAL INFORMATION RETRIEVAL**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-23", "text": "In this paper, we argue that combining (1) distributional semantics learned through deep architectures from the text corpora, and (2) symbolic semantics held by extracted concepts or entities from texts based on digital knowledge, would enhance the learning algorithm of latent representations of queries and documents with respect to the IR task." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-24", "text": "Thus, we propose two general deep architectures that incorporate a knowledge-based source of evidence in the input layer." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-25", "text": "The aim of the first approach is to combine word-based semantics and relational-based semantics in the query/document representations." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-26", "text": "The learning model attempts to map a term-concept-relation vector of the query/document to an abstracted representation." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-27", "text": "Unlikely, the main objective in the second approach is to jointly learn latent representations of the document and the query as well as a semantic translator between these latent entities." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-28", "text": "Therefore, the objective of the learning algorithm is to map a term-based representation of the query/document and joint concept-relation representation to a low-dimensional semantic representation." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-29", "text": "The rest of this paper is organized as follows." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-30", "text": "Section 2 discusses previous works related to neural approaches of ad-hoc IR and for latent representation of KB entities and relations or using KBs for improving latent text representations." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-31", "text": "Section 3 presents our approaches for using KBs as part of a deep neural architecture for performing ad-hoc IR." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-32", "text": "Section 4 concludes the paper and outlines relevant future work in the line of the proposed approaches." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-33", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-34", "text": "**RELATED WORK**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-35", "text": "Deep learning techniques have shown strong performance in many natural language processing and IR tasks." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-36", "text": "According to the motivation of this paper, we review here how deep neural networks have been leveraged for both documentquery matching tasks as well as the representation of KBs." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-37", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-38", "text": "**ON USING DEEP NEURAL NETWORKS IN IR**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-39", "text": "Recently, many works have shown that deep learning approaches are highly efficient in several IR tasks (e.g., text matching [1, 8] , query reformulation [12] , or questionanswering [2] )." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-40", "text": "More close to our work, we consider in this paper the specific task of text matching and the use of deep neural networks for document ranking." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-41", "text": "Indeed, deep architectures have been highlighted as effective in the discovery of hidden structures underlying plain text modeled through latent semantic features." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-42", "text": "We distinguish between two types of neural IR models according to the learning and leveraging approaches of distributed representations of text." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-43", "text": "The first category of work uses distributed representations to exploit text dependence within a well-known IR model, such as language models [1, 9] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-44", "text": "Also, Mitra [13] has recently proposed a model that leverages the dual word embeddings to better measure the document-query relevance." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-45", "text": "By keeping both input and output projections of word2vec [10] , this Dual Embedding Space Model allows to leverage both the embedding spaces to acquire richer distributional relationships." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-46", "text": "The author has demonstrated that this model is able to better gauge the document aboutness with respect to the query." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-47", "text": "The second category of works, which knows a keen interest in the recent years, consists in end-to-end scoring models that learn the relevance of document-query pairs via latent semantic features [8, 17] by taking into consideration the retrieval task objective." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-48", "text": "These models, also called Deep Semantic Structured Model (DSSM), have been introduced by Huang et al. [8] and are reported to be strong ones in web search task." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-49", "text": "In this approach, the query and the document are first modeled as two high dimensional term vectors (e.g., bag-of-words representation)." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-50", "text": "Through a feed-forward neural network, as shown in Figure 1 , the DSSM learns a representation of these entities (namely document and query) so as to obtain a low-dimensional vector projected within a latent semantic space." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-51", "text": "Then, the document ranking is trained, always within the DSSM architecture, by the maximization of the conditional likelihood of the query given the document." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-52", "text": "More particularly, the authors estimate this conditional likelihood by a softmax function applied on the cosine similarity between the corresponding semantic vector of documents and queries." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-53", "text": "Moreover, to tackle the issue of large vocabularies surrounding long texts and to enable largescale training, the authors have proposed the word hashing method which transforms the high-dimensional term vector of the query/document to a low-dimensional letter-trigram vector." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-54", "text": "This lower dimensional vector is then considered as input of the feed-forward neural network." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-55", "text": "As an extension of the DSSM proposed in [8] , Shen et al. [17] propose to consider word-trigram vectors enhanced by a word hashing layer (instead of word hashing on the basis of bag-of-words) to capture the fine-grained contextual structures in the query/document." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-56", "text": "Accordingly, the end-toend scoring model is impacted, leading to a convolutionalpooling structure, called Convolutional Latent Semantic Model (CLSM)." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-57", "text": "In the same mind, Severyn and Moschitti [16] present another convolutional neural network architecture to learn the optimal representation of short text pairs as well as the similarity function." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-58", "text": "Given a pair of sentences modeled as a matrix of pre-trained word embeddings, this model first learns their intermediate feature representation by applying convolution-pooling layers on each sentence." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-59", "text": "A similarity score of this intermediate representation of the document and the query is computed and enhanced then by additional features (e.g., query-document word/IDF overlap)." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-60", "text": "This richer representation is plugged into a fully connected layer that classifies whether or not the document is similar to the query." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-61", "text": "Another convolutional architecture model for matching two sentences is proposed in [7] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-62", "text": "Instead of relying on theirs semantic vectors, the authors use a deep architecture with multiple convolutional layers to model an interaction between plain texts (i.e. the co-occurence pattern of words across two texts)." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-63", "text": "The proposed model allows to represent the hierarchical structures of sentences and to capture the rich matching patterns at different levels." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-64", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-65", "text": "**LEVERAGING KNOWLEDGE GRAPH FOR DISTRIBUTED REPRESENTATIONS**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-66", "text": "The potential of semantic representations of words learned through a neural approach has been introduced in [10, 15] , opening several perspectives in natural language processing and IR tasks." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-67", "text": "Beyond words, several works focused on the representation of sentences [11] , documents [9] , and also knowledge bases (KBs) [3, 18] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-68", "text": "Within the latter work focusing on KBs, the goal is to exploit concepts and their relationships to obtain a latent representation of the KB." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-69", "text": "While some work focused on the representation of relations on the basis of triplets belonging to the KB [3] , other work proposed to enhance the distributed representation of words for representing their underlying concepts by taking into consideration the structure of the KB graph (e.g., concepts in the same category or their relationships with other concepts) [6, 18, 19] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-70", "text": "A first work [6] proposes a \"retrofitting\" technique consisting in a leveraging of lexicon-derived relational information, namely adjacent words of concepts, to refine their associated word embeddings." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-71", "text": "The underlying intuition is that adjacent concepts in the KB should have similar embeddings while maintaining most of the semantic information in their prelearned distributed word representations." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-72", "text": "For each word, the retrofitting approach learns its new representation by minimizing both (1) its distance with the representation of all connected words in the semantic graph and (2) its distance with the pre-learned word embedding, namely its initial distributed representation." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-73", "text": "In contrast to [6] , other work [18, 19] proposes an endto-end oriented approach that rather adjusts the objective function of the neural language model." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-74", "text": "For instance, Xu et al. [18] propose the RC-NET model that leverages the relational and categorical knowledge to learn a higher quality word embeddings." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-75", "text": "This model extends the objective function of the skip-gram model [10] with two regularization functions based on relational and categorical knowledge from the external resource, respectively." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-76", "text": "While the relationalbased regularization function characterizes the word relationships which are interpreted as translations in latent semantic space of word embeddings, the categorical-based one aims at minimizing the weighted distance between words with same attributes." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-77", "text": "With experiments on text mining and NLP tasks, the authors have reported that combining these two regularization functions allows to significantly improve the quality of word representations." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-78", "text": "In the same mind, Yu et al. [19] propose a relation constrained model (RCM) that extends the CBOW model [10] with a function based on prior relational knowledge issued from an external resource." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-79", "text": "Thus, the final objective of the model is to learn the pure distributed representation in the text corpus and also to capture the semantic relationship between words from external resources." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-80", "text": "In addition to word similarity tasks, the literature review shows that KBs are also exploited in question-answering tasks." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-81", "text": "For instance, Bordes et al. [2] exploit a KB to learn the latent representations of questions and candidate answers." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-82", "text": "The latter is modeled as a subgraph built by a sophisticated inference procedure that captures the relationship of the question object with the candidate answer as well as its related entities." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-83", "text": "We have described in this section two branches of work." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-84", "text": "The first one investigates the use of a deep neural network within a document-ranking matching process, often performed without external features, while the second one exploits KBs to learn a better distributed representation of words or concepts." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-85", "text": "In the next section, we will show how KBs could be leveraged within the deep neural network architecture for the document-ranking task." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-86", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-87", "text": "**TOWARD LEVERAGING KB FOR NEU-RAL AD-HOC IR**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-88", "text": "The reported literature review clearly highlights the potential of neural networks in one hand and the benefit of KBs, in the other hand, for ad-hoc search tasks." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-89", "text": "We believe that the integration of an external resource within a document-query neural matching process would allow benefiting from the symbolic semantics surrounding concepts and their relationships." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-90", "text": "Accordingly, such approach would impact the representation learning that could be performed at different levels." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-91", "text": "As illustrated in Figure 2 , we suggest using a deep neural approach to achieve two levels of representations: 1) an enhanced knowledge-based representation of the document and the query and 2) a distinct representation of the document and the query surrounding by a third KB-based representation aiming at improving the semantic closeness of document and query representations." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-92", "text": "While in the first approach, a KB is used as a mean of document and query representation enhancement, the KB is exploited in the latter approach as a mean for document-query translation." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-93", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-94", "text": "**LEVERAGING ENHANCED REPRESENTATIONS OF TEXT USING KB FOR IR**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-95", "text": "The first approach that we suggest for integrating KB within a deep neural network focuses on an enhanced representation of documents and queries as illustrated in Figure 2a ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-96", "text": "While a naive approach would be to exploit the concept embeddings learned from the KB distributed representation [6, 18] as input of the deep neural network, we believe that a hybrid representation of the distributional semantic (namely, word embeddings) and the symbolic semantics (namely, concept embeddings taking into account the graph structure) would allow enhancing the document-query matching." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-97", "text": "Indeed, simply considering concepts belonging to the KB may lead to a partial mismatch with the text of queries and/or documents [4] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-98", "text": "With this in mind, the document and query representations could be enhanced with a symbolic semantic layer expressing the projection of the plain text on the KB with the consideration of concepts and their relationships within the KB." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-99", "text": "On one hand, the representation of the plain text might be, as used in several previous work, a high-dimensional vector of terms [8, 17] or of their corresponding word embeddings [16] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-100", "text": "On the other hand, the semantic layer could be built by the representation of concepts (and their relationships) extracted from the plain text through a concept embedding [6] or a richer embedding representation of a KB sub-graph, as suggested in [2] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-101", "text": "The latter presents the advantage to model the compositionality of concepts within the document." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-102", "text": "Similarly to previous approaches [8, 16, 17] , the enhanced representations of both document and query would be transformed into low-dimensional semantic feature vectors used within a similarity function." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-103", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-104", "text": "**USING KB TRANSLATION MODEL FOR IR**" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-105", "text": "While the first model exploits knowledge bases to enhance the representation of a document-query pair and their similarity score, an alternative approach consists in a ranking model based on the translation role of the knowledge resource." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-106", "text": "As illustrated in Figure 2b , this second approach aims to take external knowledge resources as a third component of the deep neural network architecture." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-107", "text": "Intuitively, this third branch could be considered as a pivotal component bridging the semantic gap between the document and the query vocabulary." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-108", "text": "Indeed, the knowledge resource is here seen as a mediate component that helps to translate the deep representation of the query towards the deep representation of the document with respect to the ad-hoc IR task." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-109", "text": "More practically, the model would consider three initial entities (namely the document, the query, and the knowledge resource) as inputs." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-110", "text": "Whether modeled as plain text vector or word embedding matrices, the translation input should be an extraction from the KB characterizing the semantic relationship between the document and the query through their symbolic semantics in the KB (e.g., the embedding of concepts extracted in common in both entities)." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-111", "text": "Then, with a deep architecture, the model will learn the raw representation as a latent semantic feature vector for each entity (document, query, and knowledge-based bridge)." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-112", "text": "Note that in (a) Enhanced representation using KB for IR (b) The KB translation model Figure 2 : Overview of approaches aiming at leveraging KB in DSSM architectures this approach, the representations of a document-query pair and the representation of the knowledge-based translation vector are learned in the same continuous embedding space." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-113", "text": "Then, with the intuition that the KB plays the role of a mediation component, the model will learn the similarity of a document-query pair with a scoring function that takes into account the translation role of the knowledge-based bridge (e.g., vector or matrix translation as done in [5] )." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-114", "text": "----------------------------------" }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-116", "text": "In this paper, we addressed the emergence of deep learning in ad-hoc IR tasks as well as the representation learning approach of words surrounded by external KB." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-117", "text": "Following previous work in IR highlighting the benefit of the consideration of the semantic in IR, we have suggested two approaches that leverage external semantic resources to improve a text retrieval task within deep structure neural networks." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-118", "text": "More particularly, we explained how KB could be integrated within the representation learning, either through an enhanced knowledge-based representation of the document and the query or as a translation representation bridging the semantic gap between the document and the query vocabulary." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-119", "text": "We outline that we particularly focused on the DSSM architecture but that our positions could fit with other deep neural network architectures, e.g. recurrent or memory networks [14] ." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-120", "text": "We hope that this proposal would support researchers in their future work related to ad-hoc IR as well as other search tasks such as question-answering or entity retrieval." }, { "sent_id": "c86271049ebbcf8eafe781a0af6a98-C001-121", "text": "All of these tasks would benefit from combining both distributional and knowledge-based latent representations of texts within the relevance scoring process." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c86271049ebbcf8eafe781a0af6a98-C001-69" ], [ "c86271049ebbcf8eafe781a0af6a98-C001-70" ], [ "c86271049ebbcf8eafe781a0af6a98-C001-73" ], [ "c86271049ebbcf8eafe781a0af6a98-C001-100" ] ], "cite_sentences": [ "c86271049ebbcf8eafe781a0af6a98-C001-69", "c86271049ebbcf8eafe781a0af6a98-C001-70", "c86271049ebbcf8eafe781a0af6a98-C001-73", "c86271049ebbcf8eafe781a0af6a98-C001-100" ] }, "@DIF@": { "gold_contexts": [ [ "c86271049ebbcf8eafe781a0af6a98-C001-96" ] ], "cite_sentences": [ "c86271049ebbcf8eafe781a0af6a98-C001-96" ] } } }, "ABC_1a17ae4e5c8ea9e605f129aa96a6ee_37": { "x": [ { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-26", "text": "2. Only one adjunction can occur at a given node." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-75", "text": "**EVALUATION RESULTS**" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-25", "text": "1. The foot node in an auxiliary tree must be the immediate child of the root node." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-76", "text": "We use the standard Penn treebank methodology of training on sections 2-21 and testing on section 23." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-2", "text": "We present a Bayesian nonparametric model for estimating tree insertion grammars (TIG), building upon recent work in Bayesian inference of tree substitution grammars (TSG) via Dirichlet processes." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-3", "text": "Under our general variant of TIG, grammars are estimated via the Metropolis-Hastings algorithm that uses a context free grammar transformation as a proposal, which allows for cubic-time string parsing as well as tree-wide joint sampling of derivations in the spirit of Cohn and Blunsom (2010)." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-4", "text": "We use the Penn treebank for our experiments and find that our proposal Bayesian TIG model not only has competitive parsing performance but also finds compact yet linguistically rich TIG representations of the data." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-5", "text": "----------------------------------" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-7", "text": "There is a deep tension in statistical modeling of grammatical structure between providing good expressivity -to allow accurate modeling of the data with sparse grammars -and low complexitymaking induction of the grammars and parsing of novel sentences computationally practical." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-8", "text": "Recent work that incorporated Dirichlet process (DP) nonparametric models into TSGs has provided an efficient solution to the problem of segmenting training data trees into elementary parse tree fragments to form the grammar (Cohn et al., 2009; Cohn and Blunsom, 2010; Post and Gildea, 2009) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-9", "text": "DP inference tackles this problem by exploring the space of all possible segmentations of the data, in search for fragments that are on the one hand large enough so that they incorporate the useful dependencies, and on the other small enough so that they recur and have a chance to be useful in analyzing unseen data." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-10", "text": "The elementary trees combined in a TSG are, intuitively, primitives of the language, yet certain linguistic phenomena (notably various forms of modification) \"split them up\", preventing their reuse, leading to less sparse grammars than might be ideal." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-11", "text": "For instance, imagine modeling the following set of structures:" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-12", "text": "\u2022 A natural recurring structure here would be the structure \"[ N P the [ N N president]]\", yet it occurs not at all in the data." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-13", "text": "TSGs are a special case of the more flexible grammar formalism of tree adjoining grammar (TAG) (Joshi et al., 1975) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-14", "text": "TAG augments TSG with an adjunction operator and a set of auxiliary trees in addition to the substitution operator and initial trees of TSG, allowing for \"splicing in\" of syntactic fragments within trees." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-15", "text": "In the example, by augmenting a TSG with an operation of adjunction, a grammar that hypothesizes auxiliary trees corresponding to adjoining \"[ N N Unfortunately, TAG's expressivity comes at the cost of greatly increased complexity." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-16", "text": "Parsing complexity for unconstrained TAG scales as O(n 6 ), im- practical as compared to CFG and TSG's O(n 3 )." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-17", "text": "In addition, the model selection problem for TAG is significantly more complicated than for TSG since one must reason about many more combinatorial options with two types of derivation operators." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-18", "text": "1 This has led researchers to resort to heuristic grammar extraction techniques (Chiang, 2000; Carreras et al., 2008) or using a very small number of grammar categories (Hwa, 1998) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-19", "text": "Hwa (1998) first proposed to use tree-insertion grammars (TIG), a kind of expressive compromise between TSG and TAG, as a substrate on which to build grammatical inference." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-20", "text": "TIG constrains the adjunction operation so that spliced-in material falls completely to the left or completely to the right of the splice point." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-21", "text": "By restricting the form of possible auxiliary trees to only left or right auxiliary trees in this way, TIG remains within the realm of contextfree formalisms (with cubic complexity) while still modeling rich linguistic phenomena (Schabes and Waters, 1995) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-22", "text": "Figure 1 depicts some examples of TIG derivations." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-23", "text": "Sharing the same intuitions, Shindo et al. (2011) have provided a previous attempt at combining TIG and Bayesian nonparametric principles, albeit with severe limitations." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-24", "text": "Their TIG variant (which we will refer to as TIG 0 ) is highly constrained in the following ways." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-27", "text": "1 This can be seen by the fact that tree-path languages under TAG are context free, whereas they are regular for TSG." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-28", "text": "(Schabes and Waters, 1995) 3." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-29", "text": "Even modeling multiple adjunction with root adjunction is disallowed." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-30", "text": "There is thus no recursion possibility with adjunction, no stacking of auxiliary trees." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-31", "text": "4. As a consequence of the prior two constraints, no adjunction along the spines of auxiliary trees is allowed." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-32", "text": "5. As a consequence of the first constraint, all nonterminals along the spine of an auxiliary tree are identical." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-33", "text": "In this paper we explore a Bayesian nonparametric model for estimating a far more expressive version of TIG, and compare its performance against TSG and the restricted TIG 0 variant." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-34", "text": "Our more general formulation avoids these limitations by supporting the following features and thus relaxing four of the five restrictions of TIG 0 ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-35", "text": "1. Auxiliary trees may have the foot node at depth greater than one." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-36", "text": "The increased expressivity of our TIG variant is motivated both linguistically and practically." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-37", "text": "From a linguistic point of view: Deeper auxiliary trees can help model large patterns of insertion and potential correlations between lexical items that extend over multiple levels of tree." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-38", "text": "Combining left and right auxiliary trees can help model modifiers of the same node from left and right (combination of adjectives and relative clauses for instance)." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-39", "text": "Simultaneous insertion allows us to deal with multiple independent modifiers for the same constituent (for example, a series of adjectives)." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-40", "text": "From a practical point of view, we show that an induced TIG provides modeling performance superior to TSG and comparable with TIG 0 ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-41", "text": "However we show that the grammars we induce are compact yet rich, in that they succinctly represent complex linguistic structures." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-42", "text": "----------------------------------" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-43", "text": "**PROBABILISTIC MODEL**" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-44", "text": "In the basic nonparametric TSG model, there is an independent DP for every grammar category (such as c = N P ), each of which uses a base distribution P 0 that generates an initial tree by making stepwise decisions." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-45", "text": "The canonical P 0 uses a probabilistic CFGP that is fixed a priori to sample CFG rules top-down and Bernoulli variables for determining where substitutions should occur (Cohn et al., 2009; Cohn and Blunsom, 2010) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-46", "text": "We extend this model by adding specialized DPs for left and right auxiliary trees." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-47", "text": "3" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-48", "text": "Therefore, we have an exchangeable process for generating right auxiliary trees" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-49", "text": "as for initial trees in TSG." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-50", "text": "We must define three distinct base distributions for initial trees, left auxiliary trees, and right auxiliary trees." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-51", "text": "P init 0 generates an initial tree with root label c by sampling CFG rules fromP and making a binary decision at every node generated whether to leave it as a frontier node or further expand (with probability \u03b2 c ) (Cohn et al., 2009) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-52", "text": "Similarly, our P right 0 generates a right auxiliary tree with root label c by first making a binary decision whether to generate an immediate foot or not (with probability \u03b3 right c ), and then sampling an appropriate CFG rule 3 We use right insertions for illustration; the symmetric analog applies to left insertions." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-53", "text": "----------------------------------" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-54", "text": "**(VP (, ,) (VP PP (VP (, ,) VP*))) (VP (SBAR (WHADVP (WRB (WRB WHEN) ) ) S) (VP (, ,) VP*)) (VP (PP (IN FOR) (NP NN )) (VP (, ,) VP*)) (VP (CC BUT) (VP PP (VP (, ,) VP*))) (VP ADVP (VP (, ,) VP*)) (IN (ADVP (RB (RB PARTICULARLY) ) ) IN*) (NP PP (NP (CC AND) (NP PP NP*)))**" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-55", "text": "Figure 3: Example left auxiliary trees that occur in the top derivations for Section 23." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-56", "text": "Simultaneous insertions occur most frequently for the labels VP (85 times), NNS (21 times), NNP (14 times)." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-57", "text": "fromP ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-58", "text": "For the right child, we sample an initial tree from P init 0 ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-59", "text": "For the left child, if decision to generate an immediate foot was made, we generate a foot node, and stop." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-60", "text": "Otherwise we recur into P right 0 which generates a right auxiliary tree that becomes the left child." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-61", "text": "We bring together these three sets of processes via a set of insertion parameters \u00b5 left c , \u00b5 right c ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-62", "text": "In any derivation, for every initial tree node labelled c (except for frontier nodes) we determine whether or not there are insertions at this node by sampling a Bernoulli(\u00b5 left c ) distributed left insertion variable and a Bernoulli(\u00b5 right c ) distributed right insertion variable." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-63", "text": "For left auxiliary trees, we treat the nodes that are not along the spine of the auxiliary tree the same way we treat initial tree nodes, however for nodes that are along the spine (including root nodes, excluding foot nodes) we consider only left insertions by sampling the left insertion variable (symmetrically for right insertions)." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-64", "text": "----------------------------------" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-65", "text": "**INFERENCE**" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-66", "text": "Given this model, our inference task is to explore optimal derivations underlying the data." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-67", "text": "Since TIG derivations are highly structured objects, a basic sampling strategy based on local node-level moves such as Gibbs sampling (Geman and Geman, 1984) would not hold much promise." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-68", "text": "Following previous work, we design a blocked Metropolis-Hastings sampler that samples derivations per entire parse trees all at once in a joint fashion (Cohn and Blunsom, 2010; Shindo et al., 2011) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-69", "text": "This is achieved by proposing derivations from an approximating distribution and stochastically correcting via accept/reject to achieve convergence into the correct posterior (Johnson et al., 2007) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-70", "text": "Since our base distributions factorize over levels of tree, CFG is the most convenient choice for a CFG rule CFG probability proposal distribution." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-71", "text": "Fortunately, Schabes and Waters (1995) provide an (exact) transformation from a fully general TIG into a TSG that generates the same string languages." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-72", "text": "It is then straightforward to represent this TSG as a CFG using the Goodman transform (Goodman, 2002; Cohn and Blunsom, 2010) ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-73", "text": "Figure 4 lists the additional CFG productions we have designed, as well as the rules used that trigger them." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-74", "text": "----------------------------------" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-77", "text": "All our data is head-binarized and words occurring only once are mapped into unknown categories of the Berkeley parser." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-78", "text": "As has become standard, we carried out a small treebank experiment where we train on Section 2, and a large one where we train on the full training set." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-79", "text": "All hyperparameters are resampled under appropriate vague gamma and beta priors." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-80", "text": "All reported numbers are averages over three runs." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-81", "text": "Parsing results are based on the maximum probability parse which was obtained by sampling derivations under the transform CFG." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-82", "text": "We compare our system (referred to as TIG) to our implementation of the TSG system of (Cohn and Blunsom, 2010 ) (referred to as TSG) and the constrained TIG variant of (Shindo et al., 2011 ) (referred to as TIG 0 )." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-83", "text": "The upshot of our experiments is that, while on the large training set all models have similar performance (85.6, 85.3, 85 .4 for TSG, TIG 0 and TIG respectively), on the small dataset insertion helps nonparametric model to find more compact and generalizable representations for the data, which affects parsing performance (Figure 4 )." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-84", "text": "Although TIG 0 has performance close to TIG, note that TIG achieves this performance using a more succinct representation and extracting a rich set of auxiliary trees." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-85", "text": "As a result, TIG finds many chances to apply insertions to test sentences, whereas TIG 0 depends mostly on TSG rules." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-86", "text": "If we look at the most likely derivations for the test data, TIG 0 assigns 663 insertions (351 left insertions) in the parsing of entire Section 23, meanwhile TIG assigns 3924 (2100 left insertions)." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-87", "text": "Some of these linguistically sophisticated auxiliary trees that apply to test data are listed in Figure 3 ." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-88", "text": "----------------------------------" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-89", "text": "**CONCLUSION**" }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-90", "text": "We described a nonparametric Bayesian inference scheme for estimating TIG grammars and showed the power of TIG formalism over TSG for returning rich, generalizable, yet compact representations of data." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-91", "text": "The nonparametric inference scheme presents a principled way of addressing the difficult model selection problem with TIG which has been prohibitive in this area of research." }, { "sent_id": "1a17ae4e5c8ea9e605f129aa96a6ee-C001-92", "text": "TIG still remains within context free and both our sampling and parsing techniques are highly scalable." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-8" ], [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-45" ] ], "cite_sentences": [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-8", "1a17ae4e5c8ea9e605f129aa96a6ee-C001-45" ] }, "@EXT@": { "gold_contexts": [ [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-45", "1a17ae4e5c8ea9e605f129aa96a6ee-C001-46" ] ], "cite_sentences": [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-45" ] }, "@USE@": { "gold_contexts": [ [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-68" ], [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-72" ], [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-82" ] ], "cite_sentences": [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-68", "1a17ae4e5c8ea9e605f129aa96a6ee-C001-72", "1a17ae4e5c8ea9e605f129aa96a6ee-C001-82" ] }, "@SIM@": { "gold_contexts": [ [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-82", "1a17ae4e5c8ea9e605f129aa96a6ee-C001-83" ] ], "cite_sentences": [ "1a17ae4e5c8ea9e605f129aa96a6ee-C001-82" ] } } }, "ABC_5f2f4087b80aa8dc3a5ccdb686983d_37": { "x": [ { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-2", "text": "This paper presents an analysis of how the level of performance achievable by an NLU module can affect the optimal modular design of a dialogue system." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-3", "text": "We present an evaluation that shows how NLU accuracy levels impact the overall performance of a system that includes an NLU module and a rule-based dialogue policy." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-4", "text": "We contrast these performance levels with the performance of a direct classification design that omits a separate NLU module." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-5", "text": "We conclude with a discussion of the potential for a hybrid architecture incorporating the strengths of both approaches." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-6", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-8", "text": "Recently computer-driven conversational characters or virtual humans have started finding real-life applications ranging from education to health services and museums (Traum et al., 2005; Swartout et al., 2006; Kenny et al., 2009; Jan et al., 2009; Swartout et al., 2010) ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-9", "text": "As proliferation of these systems increases, there is a growing demand for the design and construction of virtual humans to be made more efficient and accessible to people without extensive linguistics and computer science backgrounds, such as writers, designers, and educators." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-10", "text": "We are specifically interested in making the language processing and dialogue management components in a virtual human easier for such potential authors to develop." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-11", "text": "Some system building steps that can be challenging for such authors include annotating the meaning of user and system utterances in a semantic formalism, developing a formal representation of information state, and writing detailed rules that govern dialogue management." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-12", "text": "We are generally interested in the extent to which these various authoring steps are necessary in order to achieve specific levels of system performance." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-13", "text": "In this paper, we present a case study analysis of the performance of two alternative architectures for a specific virtual human." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-14", "text": "The two architectures, which have been developed and evaluated in prior work (DeVault et al., 2011b; DeVault et al., 2011a) , differ substantially in their semantic annotation and policy authoring requirements." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-15", "text": "We describe these architectures and our evaluation corpus in Section 2." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-16", "text": "We focus our new analysis specifically on how the overall performance of one of the architectures, which uses a natural language understanding (NLU) module and hand-authored rules for the dialogue policy, depends on the performance of the NLU module." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-17", "text": "In Section 3, we describe our finding that, depending on the attainable level of NLU accuracy, this modular approach may or may not perform better than a simpler direct classification design that omits a separate NLU module and has a lower annotation and rule authoring burden." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-18", "text": "In Section 4, we present an initial exploration of whether a hybrid architecture may be able to combine these approaches' strengths." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-19", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-20", "text": "**SUMMARY OF DATA SET AND PRIOR RESULTS**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-21", "text": "This work is part of an ongoing research effort into techniques for developing high quality dialogue policies using a relatively small number of sample dialogues and low annotation requirements (DeVault et al., 2011b; DeVault et al., 2011a) ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-22", "text": "This section briefly summarizes our prior work and data set." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-23", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-24", "text": "**DATA SET**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-25", "text": "For our experiments we use the dataset described in (DeVault et al., 2011b) ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-26", "text": "It contains 19 Wizard of Oz dialogues with a virtual human called Amani ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-27", "text": "The user plays the role of an Army commander whose unit has been attacked by a sniper." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-28", "text": "The user interviews Amani, who was a witness to the incident and has some information about the sniper." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-29", "text": "Amani is willing to tell the interviewer what she knows, but she will only reveal certain information in exchange for promises of safety, secrecy, and money )." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-30", "text": "Each dialogue turn in the data set includes a single user utterance followed by the response chosen by a human Amani role player." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-31", "text": "There are a total of 296 turns, for an average of 15.6 turns/dialogue." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-32", "text": "User utterances are modeled using 46 distinct speech act (SA) labels." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-33", "text": "The dataset also defines a different set of 96 unique SAs (responses) for Amani." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-34", "text": "Six external referees analyzed each user utterance and selected a single character response out of the 96 SAs." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-35", "text": "Thus the dataset defines a one-to-many mapping between user utterances and alternative system SAs." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-36", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-37", "text": "**EVALUATION METRIC**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-38", "text": "We evaluate the dialogue policies in our experiments through 19-fold cross-validation of our 19 dialogues." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-39", "text": "In each fold, we hold out one dialogue and use the remaining 18 as training data." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-40", "text": "To measure policy performance, we count an automatically produced system SA as correct if that SA was chosen by the original wizard or at least one external referee for that dialogue turn." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-41", "text": "We then count the proportion of the correct SAs among all the SAs produced across all 19 dialogues, and use this measure of weak accuracy to score dialogue policies." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-42", "text": "We can use the weak accuracy of one referee, measured against all the others, to establish a performance ceiling for this metric." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-43", "text": "This score is .79; see DeVault et al. (2011b) ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-44", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-45", "text": "**BASELINE SYSTEMS**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-46", "text": "We consider two existing baseline systems in our experiments here." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-47", "text": "The first system (Rules-NLU-SA) consists of a statistical NLU module that maps a user utterance to a single user SA label, and a rule-based dialogue policy hand-crafted by one of the authors." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-48", "text": "The NLU uses a maximum-entropy model (Berger et al., 1996) to classify utterances as one of the user SAs using shallow text features." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-49", "text": "Training this model requires a corpus of user utterances that have been semantically annotated with the appropriate SA." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-50", "text": "We developed our rule-based policy by manually writing the simple rules needed to implement Amani's dialogue policy." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-51", "text": "Given a user SA label A t for turn t, the rules for determining Amani's response R t take one of three forms:" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-52", "text": "The first rule form specifies that a given user SA should always lead to a given system response." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-53", "text": "The second and third rule forms enable the system's response to depend on the user having previously performed (or not performed) a specific SA." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-54", "text": "One the system developers, who is also a computational linguist, created the current set of 42 rules in about 2 hours." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-55", "text": "There are 30 rules of form (a), 6 rules of form (b), and 6 rules of form (c)." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-56", "text": "The second baseline system (RM-Text) is a statistical classifier that selects system SAs by analyzing shallow features of the user utterances and system responses." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-57", "text": "We use the Relevance Model (RM) approach pioneered by Lavrenko et al. (2002) for cross-lingual information retrieval and adapted to question-answering by Leuski et al. (2006) ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-58", "text": "This method does not require semantic annotation or rule authoring; instead, the necessary training data is defined by linking user utterances directly to the appropriate system responses ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-59", "text": "Table 1 summarizes the performance for the baseline systems (DeVault et al., 2011a) ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-60", "text": "The NLU module accuracy is approximately 53%, and the weak accuracy of .58 for the corresponding system (Rules-NLU-SA) is relatively low when compared to the RM system at .71." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-61", "text": "For comparison we provide a third data point: for Rules-G-SA, we assume that our NLU is 100% accurate and always returns the correct (\"gold\") SA label." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-62", "text": "We then run the rulebased dialogue policy on those labels." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-63", "text": "The third column (Rules-G-SA) shows the resulting weak accuracy value, .79, which is comparable to the weak accuracy score achieved by the human referees (DeVault et al., 2011b" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-64", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-65", "text": "**NLU ACCURACY AND SYSTEM PERFORMANCE**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-66", "text": "We conducted two experiments." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-67", "text": "In the first, we studied the effect of NLU accuracy on the performance of the Rules-NLU-SA system." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-68", "text": "One of our goals was to find how accurate the NLU would have to be for the Rules-NLU-SA system to outperform RM-Text." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-69", "text": "To investigate this, we simulated NLU performance at different accuracy levels by repeatedly sampling to create a mixture of the SAs from the trained NLU classifier and from the correct (gold) set of SAs." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-70", "text": "Specifically, we set a fixed value p ranging from 0 to 1 and then iterate over all dialogue turns in the held out dialogue, selecting the the correct SA label with probability p or the trained NLU module's output with probability 1 \u2212 p. Using the sampled set of SA labels, we compute the resulting simulated NLU accuracy, run the Rules dialogue policy, and record the weak accuracy result." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-71", "text": "We repeat the process 25 times for each value of p. We let p range from 0 to 1 in increments of .05 to explore a range of simulated accuracy levels." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-72", "text": "Figure 1 shows simulated NLU accuracy and the corresponding dialogue policy weak accuracy as a point in two dimensions." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-73", "text": "The points form a cloud with a clear linear trend that starts at approximately 53% NLU accuracy where it intersects with the Rules-NLU-SA system performance and then goes up to the Rules-G performance at 100% NLU accuracy." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-74", "text": "The correlation is strong with R 2 = 0.97." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-75", "text": "1 The existence of a mostly linear relationship comports with the fact that most of the policy rules (30 of 42), as described in Section 2.3, are of form (a)." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-76", "text": "For such rules, each individual correct NLU speech act translates directly into a single correct system response, with no dependence on the system having understood previous user utterances correctly." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-77", "text": "In contrast, selecting system responses that comply with rules in forms (b) and (c) generally requires correct understanding of multiple user utterances." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-78", "text": "Such rules create a nonlinear relationship between policy performance and NLU accuracy, but these rules are relatively few in number for Amani." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-79", "text": "The estimated linear trend line (in purple) crosses the RM-Text system performance at approximately 82% NLU accuracy." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-80", "text": "This result suggests that our NLU component would need to improve from its current accuracy of 53% to approximately 82% accuracy for the Rules-NLU-SA system to outperform the RM-Text classifier." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-81", "text": "This represents a very substantial increase in NLU accuracy that, in practice, could be expected to require a significant effort involving utterance data collection, semantic annotation, and optimization of machine learning for NLU." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-82", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-83", "text": "**HYBRID SYSTEM**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-84", "text": "In our second experiment we investigated the potential to integrate the Rules-NLU-SA and RM-Text systems together for better performance." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-85", "text": "Our approach draws on a confidence score \u03b8 from the NLU maximum-entropy classifier; specifically, \u03b8 is the probability assigned to the most probable user SA." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-86", "text": "Figure 2 shows an analysis of NLU accuracy, Rules-NLU-SA, and RM-Text that is restricted to those subsets of utterances for which NLU confidence \u03b8 is greater than or equal to some threshold \u03c4 ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-87", "text": "Two important aspects of this figure are (1) that raising the minimum confidence threshold also raises the NLU accuracy on the selected subset of utterances; and (2) that there is a threshold NLU confidence level beyond which Rules-NLU-SA seems to outperform RM-Text." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-88", "text": "This confidence level is approximately 0.95, and it identifies a subset of user utterances for which NLU accuracy is 83.3%." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-89", "text": "These results therefore suggest that NLU confidence can be useful in identifying utterances for which NLU speech acts are more likely to be accurate and Rules-NLU-SA is more likely to perform well." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-90", "text": "To explore this further, we implemented a hybrid system that chooses between Rules-NLU-SA or RM-Text as follows." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-91", "text": "If the confidence score is high enough (\u03b8 \u2265 \u03c4 , for some fixed threshold \u03c4 ), the Hybrid system uses the NLU output to run the Rules dialogue policy to select the system SA; otherwise, it discards the NLU SA, and applies the RM classifier to select the system response directly." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-92", "text": "Figure 3 shows the plot of the Hybrid system performance as a function of the threshold value \u03c4 ." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-93", "text": "We see that with sufficiently high threshold value (\u03c4 \u2265 0.95) the Hybrid system outperforms both the Rules-NLU-SA and the RM-Text systems." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-94", "text": "The second line, labeled \"Mix\" and plotted against the secondary (right) axis, shows the proportion of the NLU SAs with the confidence score that exceed the threshold (\u03b8 \u2265 \u03c4 )." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-95", "text": "It indicates how often the Hybrid system prefers the Rules-NLU-SA output over the RM-Text system output." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-96", "text": "We observe that approximately 42 of the NLU outputs over all 296 dialogue turns (15%) have confidence values \u03b8 \u2265 0.95." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-97", "text": "However, for most of these dialogue turns the outputs for the Rules-NLU-SA and RM-Text dialogue policies are the same." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-98", "text": "While we observe a small improvement in the Hybrid system weak accuracy values over the RM-Text system at thresholds of 0.95 and higher, the difference is not statistically significant." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-99", "text": "Despite the lack of statistical significance in the initial Hybrid results in this small data set, we interpret the complementary evidence from both experiments, which support the potential for Rules-NLU-SA to perform well when NLU accuracy is high, and the potential for a hybrid system to identify a subset of utterances that are likely to be understood accurately at run-time, as indicating that a hybrid design is a promising avenue for future work." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-100", "text": "----------------------------------" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-101", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-102", "text": "We presented a case study analysis of how the level of performance that is achievable in an NLU module can provide perspective on the design choices for a modular dialogue system." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-103", "text": "We found that NLU accuracy must be substantially higher than it currently is in order for the Rules-NLU-SA design, which carries a greater annotation and rule authoring burden, to deliver better performance than the simpler RMText design." }, { "sent_id": "5f2f4087b80aa8dc3a5ccdb686983d-C001-104", "text": "We also presented evidence that a hybrid architecture could be a promising direction." } ], "y": { "@USE@": { "gold_contexts": [ [ "5f2f4087b80aa8dc3a5ccdb686983d-C001-25" ], [ "5f2f4087b80aa8dc3a5ccdb686983d-C001-43" ] ], "cite_sentences": [ "5f2f4087b80aa8dc3a5ccdb686983d-C001-25", "5f2f4087b80aa8dc3a5ccdb686983d-C001-43" ] } } }, "ABC_9af4a895dd4b45bb3827c74bdc7f05_37": { "x": [ { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-2", "text": "This study aims at determining whether collocational features automatically extracted from EFL (English as a foreign language) texts are useful for quality scoring, and allow the improvement of a competitive baseline based on, amongst other factors, bigram frequencies." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-3", "text": "The collocational features were gathered by assigning to each bigram in an EFL text eight association scores computed on the basis of a native reference corpus." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-4", "text": "The distribution of the association scores were then summarized by a few global statistical features and by a discretizing procedure." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-5", "text": "An experiment conducted on a publicly available dataset confirmed the effectiveness of these features and the benefit brought by using several discretized association scores." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-6", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-8", "text": "The importance of preformed units in language use is well established (Pawley and Syder, 1983; Schmitt, 2004; Sinclair, 1991) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-9", "text": "If some of these sequences belong to the traditional phraseological approach, signalled by their syntactic fixedness and semantic non-compositionality, the vast majority of them are conventional word combinations that display statistical idiomaticity (Baldwin and Kim, 2010; Smiskova et al., 2012) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-10", "text": "This phraseological dimension of language has important implications for learning a foreign language, as shown by many studies in applied linguistics." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-11", "text": "It not only distinguishes native speakers from nonnative ones, but the number of phraseological units in a learner text is related to the overall level of proficiency in the learned language (e.g., Forsberg, 2010; Levitzky-Aviad and Laufer, 2013; Santos et al., 2012; ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-12", "text": "In these studies, a limited number of expressions were analysed in a small number of texts, giving a very detailed, but also very punctual, view of the phenomenon." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-13", "text": "In addition, the phraseological nature of a lexical sequence was determined manually using dictionaries or by asking native speakers, making the analysis of numerous texts difficult." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-14", "text": "These limitations were overcome by Durrant and Schmitt (2009) , who proposed 1 assigning to the bigrams present in an EFL text two association scores (ASs), computed on the basis of a large native reference corpus: (pointwise) Mutual Information (MI), which favours bigrams made up of low-frequency words, and the t-score, which highlights those composed of high-frequency words." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-15", "text": "They observed that, compared to native speakers, EFL learners tend to underuse collocations with high MI scores while overusing those with high t-scores." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-16", "text": "More recently, Bestgen and Granger (2014, 2015) and showed that these ASs distinguish advanced learners from intermediate learners, and that the average MI score and the proportion of bigrams in the text that are absent from the reference corpus were good predictors of text quality, but that the average t-score was much less successful." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-17", "text": "These studies have a major drawback: the effectiveness of phraseological indices was not compared to that of other features known to be effective predictors." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-18", "text": "It is therefore impossible to determine whether the phraseological indices are really effective and if they can improve the prediction when combined with other indices." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-19", "text": "This limitation is probably partly due to the fact that these analyses were not conducted in the field of automatic scoring, but in applied linguistics." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-20", "text": "In automatic scoring, phraseological expres-sions have long been used almost exclusively for detecting errors, a task for which they have been very useful (e.g., Chodorow and Leacock, 2000; Futagi et al., 2008; Wu et al., 2010) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-21", "text": "It is noteworthy that a feature tracking the correct use of collocations was considered for inclusion in eRater, but its usefulness for predicting text quality seems rather limited (Higgins et al., 2015) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-22", "text": "Very recently, however, Somasundaran and Chodorow (2014) and Somasundaran et al. (2015) Even if these results were extremely promising, they leave a number of questions unanswered." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-23", "text": "First, they were obtained by studying short oral responses." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-24", "text": "Can they be generalized to longer written texts, a situation that allows the learner to spend much more time on its production? Then one can wonder whether the use of MI is sufficient, or if additional benefits can be obtained by taking into account other associational measures for collocations." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-25", "text": "In this context, extracting richer features than the mean scores, as done by Somasundaran and Chodorow (2014) , seems particularly promising, because found that the best learner texts contain more middlelevel t-score bigrams and fewer low and high-level t-score bigrams." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-26", "text": "This observation may be related to the fact that the low t-score bigrams are often erroneous combinations of words, while high scores indicate extremely common bigrams in the language, which are easy to learn." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-27", "text": "It is therefore far from obvious that there is a simple linear or monotonic relationship between the distribution of the association scores (ASs) in a text and its quality." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-28", "text": "Finally, it would be interesting to determine whether using ASs extracted from a corpus of native texts enables a better prediction than that obtained by using the simple frequency of the unigrams and bigrams (Yannakoudakis et al., 2011) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-29", "text": "This study attempts to answer these questions by extracting from the bigrams in EFL texts richer features from several association measures as described in Section 2, and by comparing the effectiveness of these collocational features to that of lexical features (Section 3)." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-30", "text": "The conclusion proposes several paths for further research." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-31", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-32", "text": "**EXTRACTING COLLOCATION FEATURES**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-33", "text": "Somasundaran and Chodorow (2014) used only one AS, while Durrant and Schmitt (2009) used two, but there are many other ASs (Pecina, 2010) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-34", "text": "Evert (2009) recommends a heuristic approach by testing a series of ASs to keep the one that is most appropriate for the task at hand, while Pecina recommends using several ASs simultaneously." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-35", "text": "These recommendations were followed here by comparing the performance of eight ASs and by combining them (i.e., using simultaneously all of them in the feature set)." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-36", "text": "In addition to MI and tscore (Church et al., 1991) , the six following ASs were evaluated:" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-37", "text": "1. MI3 (Daille, 1994), a heuristic modification of MI, proposed to reduce its tendency to assign inflated scores to rare words that occur together, 2." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-38", "text": "z (Berry-Rogghe, 1973), the signed squareroot of the cell contribution to the Pearson Chi-square for a 2x2 contingency table, 3. simple-ll (Evert, 2009), the signed cell contribution to the log-likelihood Chi-square test recommended by Dunning (1993) , 4." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-39", "text": "Fisher's exact test (Pedersen et al., 1996) , which corresponds to the probability of observing, under the null hypothesis of independence, at least as many collocations as the number actually observed, 5." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-40", "text": "Mutual rank ratio (mrr, Dean, 2005), a nonparametric measure that has been successful in detecting collocation errors in EFL texts (Futagi et al., 2008), 6 . logDice (Rychly, 2008), a logarithmic transformation of the Dice coefficient used in the Sketch Engine (Kilgarriff et al., 2014) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-41", "text": "In order to extract more information from the distribution of the ASs in each text than the mean or the median, Durrant and Schmitt (2009) and Somasundaran et al. (2015) used a standard procedure in descriptive statistics and automatic information processing known as discretization, binning or quantization (Garcia et al., 2013) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-42", "text": "It divides a continuous variable into bins and counts the proportion of scores that fall into each bin." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-43", "text": "In their analyses, the boundaries of the bins were manually and arbitrarily defined." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-44", "text": "This approach can be used for any AS, but it makes the comparison of the effectiveness of them difficult because a weaker performance may come from a less effective AS or from poorly chosen bin boundaries." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-45", "text": "To reduce the potential impact of the choice of boundaries, a very simple and completely automatic discretization procedure was used: the Equal Frequency Discretizer, which divides the sorted values into k intervals so that each interval contains approximately the same number of values (Dougherty et al., 1995) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-46", "text": "It is unsupervised and depends on only one parameter (i.e., the number of bins)." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-47", "text": "In the present study, it was applied separately for each AS, to every bigram present in the learners' texts and consists of two steps:" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-48", "text": "1. Partitioning the distribution of scores in bins containing the same number of bigrams, 2. Computing for each text the proportion of bigrams whose AS falls into each bin, using as a denominator the total number of bigrams in the text." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-49", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-50", "text": "**EXPERIMENT**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-51", "text": "To assess the benefits of relying on collocational features to predict an EFL text's quality, an experiment was conducted." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-52", "text": "This section describes the corpus used, as well as the procedures for extracting the collocational and baseline features and for scoring the texts." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-53", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-54", "text": "**EXPERIMENT SETUP**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-55", "text": "Dataset: The analyses were conducted on the First Certificate in English (FCE) ESOL examination scripts described in Yannakoudakis et al. (2011 Yannakoudakis et al. ( , 2012 ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-56", "text": "Extracted from the Cambridge Learner Corpus, this dataset consists of 1238 texts of between 200 and 400 words, to which an overall mark has been assigned." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-57", "text": "As in Yannakoudakis et al. (2011) , the 1141 texts from the year 2000 were used for training, while the 97 texts from the year 2001 were used for testing." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-58", "text": "Collocational Features: The global statistical features in Somasundaran et al. (2015) and were used: the mean, the median, the maximum and the minimum of the ASs, and the proportion of bigrams that are present in the learner text but absent from the reference corpus." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-59", "text": "Because the best number of bins for discretizing the distributions was not known, the following ones were compared: 3, 5, 8, 10, 15, 20, 25, 33, 50, 75 and 100. To get all these features, each learner text was tokenized and POS-tagged by means of CLAWS7 2 and all bigrams were extracted." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-60", "text": "Punctuation marks and any sequence of characters that did not correspond to a word interrupt the bigram extraction." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-61", "text": "Each bigram was then looked up in the 100 million word British National Corpus (BNC 3 ) and, if found, assigned its ASs." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-62", "text": "The collocational features were then computed on the basis of all the different bigrams present in each text (types) to give more weight to their diversity (Durrant and Schmitt, 2009 )." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-63", "text": "Lexical Features: As a benchmark for comparison, the lexical features that were showed to be good predictors of the quality of the texts in this dataset (Yannakoudakis et al., 2011) were chosen." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-64", "text": "They consist of the frequency of the word unigrams and bigrams." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-65", "text": "This baseline is particularly relevant because it includes the lexical bigrams that are the basis of the collocational features." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-66", "text": "These features were extracted as described in Yannakoudakis et al. (2011) ; the only difference is that they used the RASP tagger and not the CLAWS tagger." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-67", "text": "Supervised Learning Approach and Evaluation: As in Yannakoudakis et al. (2011) , the automated scoring task was treated as a rankpreference learning problem by means of the SVM-Rank package (Joachims, 2006) , which is a much faster version of the SVM-Light package used by Yannakoudakis et al. (2011) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-68", "text": "The procedure was identical to that described in their study." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-69", "text": "Since the quality ratings are distributed on a zero to 40 scale, I chose Pearson's correlation coefficient, also used by Yannakoudakis et al. (2011) , as the measure of performance." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-70", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-71", "text": "**RESULTS**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-72", "text": "Initial analyses focused on the interest of discretizing the ASs by assessing the benefits obtained when these features were added to the global statistical features." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-73", "text": "Collocational features were then compared to the lexical features and added to them to determine the maximum level of performance that could be achieved." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-74", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-75", "text": "**COLLOCATIONAL FEATURES**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-76", "text": "When no discretization procedure was used (the 0 row), Fisher was far more effective than the other ASs, followed by MI." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-77", "text": "Adding the discretized features led to far better performances (except for logDice), as shown by the Mean row." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-78", "text": "For a small number of bins, Fisher remained the best, but for an intermediate number, the best were t and simple-ll, and for a large number, z became competitive." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-79", "text": "Still, the differences between the best ASs were quite small." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-80", "text": "From eight bins and beyond, using all the ASs gave the best result, but the gain was relatively small." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-81", "text": "Regarding the number of bins, at least five seems necessary, but using many more did not harm performance." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-82", "text": "It is noteworthy that all the correlations reported in table 1 are much larger that the correlation of a baseline system based purely on length (r = 0.27)." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-83", "text": "To determine if the automatic procedure for discretizing the ASs is at least as effective as the bin boundaries manually set by Somasundaran et al. (2015) , I used them instead of the automatic bins for the model with eight bins based on MI." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-84", "text": "The correlation obtained was 0.60, a value slightly lower than that reported in Table 1 (0.61)." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-85", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-86", "text": "**COLLOCATIONAL AND BASELINE FEATURES**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-87", "text": "The lexical features used alone allowed a 0.68 correlation 4 ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-88", "text": "These features are thus more effective than the best combinations of collocational features reported in Table 1 , but, as shown in Table 2 , adding the collocational features to the lexical ones produces far better performances." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-89", "text": "Steiner's ttest (Howell, 2008, p. 269-271) for comparing two non-independent correlations showed that collocational features significantly improve the prediction when compared to the baseline (all ps <0.005)." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-90", "text": "If MI is always one of the best performing ASs, the differences between the ASs are quite low." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-91", "text": "For all numbers of bins, using all the ASs allows the best performance." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-92", "text": "To get an idea of how well the collocational and lexical features perform, the correlations in Table 2 can be compared to the average correlation between the Examiners' scores reported by Yannakoudakis et al. (2011) , which give an upper bound of 0.80 while the All models with more than three bins obtain a correlation of at least 0.75." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-93", "text": "Adding collocational features to lexical ones thus reduces by 58% the difference between the lexical features alone and the upper bound." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-94", "text": "However, the most difficult part of the work is still to be done." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-95", "text": "----------------------------------" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-96", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-97", "text": "Following on from Durrant and Schmitt (2009), Somasundaran and Chodorow (2014) and , this study confirms the benefits conferred by collocational features for the automated scoring of EFL texts." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-98", "text": "It also shows that these features improve a competitive baseline, based among other factors on the bigram frequenprocedure, the difference probably comes from the SVMRank/SVM-Light parameters." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-99", "text": "The SVM-Rank default settings were used except for the squared slacks for the L-norm (i.e., -p 2) because it provided a high performance without having to optimize other parameters such as C. cies in the texts." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-100", "text": "As proposed by Somasundaran and Chodorow (2014) , binning the AS distributions improves the efficiency and, as proposed by Durrant and Schmitt (2009) , considering several ASs also gives extra efficiency." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-101", "text": "Compared to , the binning allows t to be as effective as the MI." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-102", "text": "This result suggests that it might be interesting to analyse more thoroughly the complex relationship between the AS distributions in a text and its quality." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-103", "text": "It must be kept in mind that these observations result from the analysis of a single dataset and replications are more than desirable." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-104", "text": "It is also necessary to determine whether the collocational features can improve not only the baseline used here, but also a predictive model that includes many other features known for their effectiveness." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-105", "text": "Further developments are worth mentioning." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-106", "text": "Unlike Somasundaran et al. (2015) , I only used bigrams' collocational features." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-107", "text": "Whether adding trigrams would further improve the performance is an open question." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-108", "text": "Trying to answer it requires a thorough study of the association measures for ngrams longer than two words since they have received much less attention (Bestgen, 2014; Gries, 2010) ." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-109", "text": "It might also be interesting to evaluate other techniques to discretize the AS distributions, since this study rests on one of the simplest techniques." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-110", "text": "Further studies are also needed to better understand the impact of the combination of ASs." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-111", "text": "On the one hand, it is likely that some ASs are partially redundant and that keeping only one might be enough." }, { "sent_id": "9af4a895dd4b45bb3827c74bdc7f05-C001-112", "text": "On the other hand, it would be interesting to determine whether, rather than combining the AS bin proportions independently, it would be better to create the bins on the simultaneous basis of two or more ASs, such as one bin for the bigrams with high MI scores and medium t-scores." } ], "y": { "@MOT@": { "gold_contexts": [ [ "9af4a895dd4b45bb3827c74bdc7f05-C001-22" ] ], "cite_sentences": [ "9af4a895dd4b45bb3827c74bdc7f05-C001-22" ] }, "@BACK@": { "gold_contexts": [ [ "9af4a895dd4b45bb3827c74bdc7f05-C001-41" ] ], "cite_sentences": [ "9af4a895dd4b45bb3827c74bdc7f05-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "9af4a895dd4b45bb3827c74bdc7f05-C001-58" ], [ "9af4a895dd4b45bb3827c74bdc7f05-C001-83" ] ], "cite_sentences": [ "9af4a895dd4b45bb3827c74bdc7f05-C001-58", "9af4a895dd4b45bb3827c74bdc7f05-C001-83" ] }, "@DIF@": { "gold_contexts": [ [ "9af4a895dd4b45bb3827c74bdc7f05-C001-106" ] ], "cite_sentences": [ "9af4a895dd4b45bb3827c74bdc7f05-C001-106" ] } } }, "ABC_dcfad33f4322738906e2fdffe2e721_37": { "x": [ { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-2", "text": "Various methods have been proposed for aligning texts in two or more languages such as the Canadian Parliamentary Debates (Hansards)." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-3", "text": "Some of these methods generate a bilingual lexicon as a by-product." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-4", "text": "We present an alternative alignment strategy which we call K-vec, that starts by estimating the lexicon." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-5", "text": "For example, it discovers that the English word fisheries is similar to the French p~ches by noting that the distribution of fisheries in the English text is similar to the distribution of p~ches in the French." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-6", "text": "K-vec does not depend on sentence boundaries." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-7", "text": "There have been quite a number of recent papers on parallel text: Brown et al" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-8", "text": "----------------------------------" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-9", "text": "**MOTIVATION**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-10", "text": "There have been quite a number of recent papers on parallel text: Brown et al (1990 Brown et al ( , 1991 Brown et al ( , 1993 , Chen (1993) , Church (1993) , , Dagan et al (1993) , Church (1991, 1993) , Isabelle (1992) , Kay and Rgsenschein (1993) , Klavans and Tzoukermann (1990) , Kupiec (1993) , Matsumoto (1991) , Ogden and Gonzales (1993) , Shemtov (1993) , Simard et al (1992) , WarwickArmstrong and Russell (1990) , Wu (to appear)." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-11", "text": "Most of this work has been focused on European language pairs, especially English-French." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-12", "text": "It remains an open question how well these methods might generalize to other language pairs, especially pairs such as English-Japanese and EnglishChinese." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-13", "text": "In previous work , we have reported some preliminary success in aligning the English and Japanese versions of the AWK manual (Aho, Kernighan, Weinberger (1980) ), using charalign (Church, 1993) , a method that looks for character sequences that are the same in both the source and target." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-14", "text": "The charalign method was designed for European language pairs, where cognates often share character sequences, e.g., government and gouvernement." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-15", "text": "In general, this approach doesn't work between languages such as English and Japanese which are written in different alphabets." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-16", "text": "The AWK manual happens to contain a large number of examples and technical words that are the same in the English source and target Japanese." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-17", "text": "It remains an open question how we might be able to align a broader class of texts, especially those that are written in different character sets and share relatively few character sequences." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-18", "text": "The K-vec method attempts to address this question." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-19", "text": "----------------------------------" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-20", "text": "**THE K-VEC ALGORITHM**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-21", "text": "K-vec starts by estimating the lexicon." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-22", "text": "Consider the example: fisheries --~ p~ches." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-23", "text": "The K-vec algorithm will discover this fact by noting that the distribution of fisheries in the English text is similar to the distribution of p~ches in the French." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-24", "text": "The concordances for fisheries and p~ches are shown in Tables 1 and 2 (at the end of this paper)." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-25", "text": "1 1." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-26", "text": "These tables were computed from a small fragment of the Canadian Hansards that has been used in a number of other studies: Church (1993) and Simard et al (1992 show where the concordances were found in the texts." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-27", "text": "We want to know whether the distribution of numbers in Table 1 is similar to those in Table 2, and if so, we will suspect that fisheries and p~ches As can be seen in the concordances in Table 3 , for K=10, the vector is <1, 1, 0, 1, 1,0, 1, 0, 0, 0>. By almost any measure of similarity one could imagine, this vector will be found to be quite different from the one for fisheries, and therefore, we will correctly discover that fisheries is not the translation of lections." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-28", "text": "To make this argument a little more precise, it might help to compare the contingency matrices in Tables 5 and 6 ." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-29", "text": "The contingency matrices show: (a) the number of pieces where both the English and French word were found, (b) the number of pieces where just the English word was found, (c) the number of pieces where just the French word was found, and (d) the number of peices where neither word was found." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-30", "text": "In general, if the English and French words are good translations of one another, as in Table 5 , then a should be large, and b and c should be small." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-31", "text": "In contrast, if the two words are not good translations of one another, as in Table 6 , then a should be small, and b and c should be large." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-32", "text": "----------------------------------" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-33", "text": "**MUTUAL INFORMATION**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-34", "text": "Intuitively, these statements seem to be true, but we need to make them more precise." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-35", "text": "One could have chosen quite a number of similarity metrics for this purpose." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-36", "text": "We use mutual information:" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-37", "text": "That is, we want to compare the probability of seeing fisheries and p~ches in the same piece to chance." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-38", "text": "The probability of seeing the two words in the same piece is simply:" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-39", "text": "The marginal probabilities are:" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-40", "text": "Thus, the mutual information is log25 or 2.32 bits, meaning that the joint probability is 5 times more likely than chance." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-41", "text": "In contrast, for fisheries ~ lections, prob ( V f, V p ) = O, prob(Vf) =0.5 and prob(Vp) = 0.4." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-42", "text": "Thus, the mutual information is log 2 0, meaning that the joint is infinitely less likely than chance." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-43", "text": "We conclude that it is quite likely that fisheries and p~ches are translations of one another, much more so than fisheries and lections." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-44", "text": "----------------------------------" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-45", "text": "**SIGNIFICANCE**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-46", "text": "Unfortunately, mutual information is often unreliable when the counts are small." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-47", "text": "For example, there are lots of infrequent words." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-73", "text": "The alignment program tracks this line with as much precision as possible." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-74", "text": "3. The low frequency words (frequency less then 3) would have been rejected anyways as insignificant." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-75", "text": "----------------------------------" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-76", "text": "**CONCLUSIONS**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-77", "text": "The K-vec algorithm generates a quick-and-dirty estimate of a bilingual lexicon." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-78", "text": "This estimate could be used as a starting point for a more detailed alignment algorithm such as word_align ." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-79", "text": "In this way, we might be able to apply word_align to a broader class of language combinations including possibly English-Japanese and English-Chinese." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-80", "text": "Currently, word_align depends on charalign (Church, 1993) to generate a starting point, which limits its applicability to European languages since char_align was designed for language pairs that share a common alphabet." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-81", "text": "----------------------------------" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-82", "text": "**REFERENCES**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-83", "text": "Aho, Kernighan, Weinberger (1980) \"The AWK Programming" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-84", "text": "Language,\" Addison-Wesley, Reading, Massachusetts, USA. private sector is quite weak." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-85", "text": "1,ct us turn now to fisheries, an industry which as most important 1o The fishermen would like to see the l)epartment of Fisheries and Oceans put more effort towards the p s in particular." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-86", "text": "The budget of the Department of Fisheries and Oceans has been reduced to such ate ' habitation ' ' trom which to base his trade in fisheries and filrs." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-87", "text": "He brought wilh him the first ase .just outside of my riding." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-88", "text": "The Department of Fisheries and Oceans provides employmeut for many and all indications are that the riclmess ot' its fisheries resource will enable it to maintain its taxpayer." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-89", "text": "The role of file federal Department of Fisheries and Oceans is central to the concerns of is the new Chainnan of the Standing Committee on Fisheries and Oceans." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-90", "text": "I am sure he will bring a w ortunity to discuss it with me as a member of the Fisheries Committee." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-91", "text": "The Hon. Member asked what he proposal has been submitted to the Minister of Fisheries and Oceans ( Mr. Siddon ) which I hope ch as well as on his selection as Chairman of the Fisheries Committee." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-92", "text": "I have workexl with Mr. Come his intense interest and expertise in the area of fisheries." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-93", "text": "It seems most appropriate, given that r from Eastern Canada and the new Chairman of the Fisheries and Oceans Committee." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-94", "text": "We know that the d Oceans Committee." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-95", "text": "We know that the Minister of Fisheries and Oceans ( Mr. Siddon ), should we s ows the importance of research and development to fisheries and oceans." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-96", "text": "Is he now ready to tell the research and development component in the area of fisheries and oceans at Bedford, in order that th" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-48", "text": "If we pick a pair of these words at random, there is a very large chance that they would receive a large mutual information value by chance." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-49", "text": "For example, let e be an English word that appeared just once and letfbe a French word that appeared just once." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-50", "text": "Then, there a non-trivial chance (-~) that e andf will appear is in the same piece, as shown in Table 7 . If this should happen, the mutual information estimate would be very large, i.e., logK, and probably misleading." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-51", "text": "(Vf,gp) Using the numbers in Table 7 , t=l, which is not significant." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-52", "text": "(A t of 1.65 or more would be significant at the p > 0.95 confidence level.)" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-53", "text": "Similarly, if e and f appeared in just two pieces 1 each, then there is approximately a ~ chance that they would both appear in the same two pieces, and then the mutual information score would be quite log,, ~--, but we probably wouldn't believe it high, Z." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-54", "text": "because the t-score would be only \"~-." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-55", "text": "By this definition of significance, we need to see the two words in at least 3 different pieces before the result would be considered significant." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-56", "text": "This means, unfortunately, that we would reject fisheries --+ p~ches because we found them in only two pieces." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-57", "text": "The problem, of course, is that we don't have enough pieces." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-58", "text": "When K=10, there simply isn't enough resolution to see what's going on." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-59", "text": "At K=100, we obtain the contingency matrix shown in Table 8 , and the t-score is significant (t=2.1)." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-60", "text": "Ideally, we would like to apply the K-vec algorithm to all pairs of English and French words, but unfortunately, there are too many such pairs to consider." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-61", "text": "We therefore limited the search to pairs of words in the frequency range: 3-10." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-62", "text": "This heuristic makes the search practical, and catches many interesting pairs)" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-63", "text": "----------------------------------" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-64", "text": "**RESULTS**" }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-65", "text": "This algorithm was applied to a fragment of the Canadian Hansards that has been used in a number of other studies: Church (1993) and Simard et al (1992) ." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-66", "text": "The 30 significant pairs with the largest mutual information values are shown in Table 9 ." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-67", "text": "As can be seen, the results provide a quick-anddirty estimate of a bilingual lexicon." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-68", "text": "When the pair is not a direct translation, it is often the translation of a collocate, as illustrated by acheteur ~ Limited and Santd -~ Welfare." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-69", "text": "(Note that some words in Table 9 are spelled with same way in English and French; this information is not used by the K-vec algorithm)." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-70", "text": "The equality constraint is relaxed in Figure 2 ." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-71", "text": "A dot is placed in position i,j whenever the input token at position i is highly associated with the input token at position j as determined by the mutual information score of their respective Kvecs." }, { "sent_id": "dcfad33f4322738906e2fdffe2e721-C001-72", "text": "In addition, it shows a detailed, magnified and rotated view of the diagonal line." } ], "y": { "@BACK@": { "gold_contexts": [ [ "dcfad33f4322738906e2fdffe2e721-C001-10" ], [ "dcfad33f4322738906e2fdffe2e721-C001-13" ], [ "dcfad33f4322738906e2fdffe2e721-C001-80" ] ], "cite_sentences": [ "dcfad33f4322738906e2fdffe2e721-C001-10", "dcfad33f4322738906e2fdffe2e721-C001-13", "dcfad33f4322738906e2fdffe2e721-C001-80" ] }, "@MOT@": { "gold_contexts": [ [ "dcfad33f4322738906e2fdffe2e721-C001-10", "dcfad33f4322738906e2fdffe2e721-C001-11", "dcfad33f4322738906e2fdffe2e721-C001-12" ], [ "dcfad33f4322738906e2fdffe2e721-C001-13", "dcfad33f4322738906e2fdffe2e721-C001-15" ] ], "cite_sentences": [ "dcfad33f4322738906e2fdffe2e721-C001-10", "dcfad33f4322738906e2fdffe2e721-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "dcfad33f4322738906e2fdffe2e721-C001-26" ], [ "dcfad33f4322738906e2fdffe2e721-C001-65" ], [ "dcfad33f4322738906e2fdffe2e721-C001-80" ] ], "cite_sentences": [ "dcfad33f4322738906e2fdffe2e721-C001-26", "dcfad33f4322738906e2fdffe2e721-C001-65", "dcfad33f4322738906e2fdffe2e721-C001-80" ] } } }, "ABC_2d7e98487698b0b6ae85f052402f7c_38": { "x": [ { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-2", "text": "Dialogue act recognition is an important part of natural language understanding." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-3", "text": "We investigate the way dialogue act corpora are annotated and the learning approaches used so far." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-4", "text": "We find that the dialogue act is context-sensitive within the conversation for most of the classes." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-5", "text": "Nevertheless, previous models of dialogue act classification work on the utterance-level and only very few consider context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-6", "text": "We propose a novel context-based learning method to classify dialogue acts using a character-level language model utterance representation, and we notice significant improvement." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-7", "text": "We evaluate this method on the Switchboard Dialogue Act corpus, and our results show that the consideration of the preceding utterances as a a context of the current utterance improves dialogue act detection." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-8", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-10", "text": "In natural language processing research, the dialogue act (DA) concept plays an important role." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-11", "text": "Its recognition, in most cases, is considered a lexical-based or syntax-based classification at utterance-level." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-12", "text": "However, the discourse compositionality is context sensitive, meaning that the DA of an utterance can be elicited from the preceding utterances (Grosz, 1982) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-13", "text": "Hence, classifying only utterances is not enough because their DA class arises from their context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-14", "text": "For example, the utterance containing only the lexical entry 'yeah' might appear in several DA classes such as Backchannel, Yes-Answer, etc." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-15", "text": "For certain DA classes, the utterances are short, and most of them share similar lexical and syntactic cues ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-16", "text": "The aim of this article has two subgoals: first, we investigate the annotation process of DA corpora and review the modelling so far used for DA classification, and second, we present a novel model and compare its results with the state of the art." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-17", "text": "We propose to use context-based learning for the identification of the DA classes." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-18", "text": "First, we show the results without context, i.e., classifying only utterances." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-19", "text": "Including context leads to 3% higher accuracy." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-20", "text": "We use a simple recurrent neural network (RNN) for context learning of the discourse compositionality." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-21", "text": "We feed the preceding and current utterances to the RNN model to predict its DA class." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-22", "text": "The main contributions of this work are as follows: -We provide detailed insight on the annotation and modelling of dialogue act corpora." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-23", "text": "We suggest to model discourse within the context of a conversation." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-24", "text": "-We propose a context-based learning approach for DA identification." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-25", "text": "In our approach, we represent utterances by a character-level language model trained on domainindependent data." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-26", "text": "-We evaluate the model on the Switchboard Dialogue Act (SwDA 1 ) corpus and show how using context affects the results." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-27", "text": "For the SwDA corpus, our model achieved an accu-1 Available at https://github.com/cgpotts/swda racy of 77.3% compared to 73.9% as state of the art, where the context-based learning is used for the DA classification (Kalchbrenner and Blunsom, 2013 )." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-28", "text": "-Benefits of using context arise from using only a few preceding utterances making the model suitable for dialogue system in real time, in contrast to feeding the whole conversation, which can achieve high accuracy, but includes future utterances (Liu et al., 2017; Kumar et al., 2017) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-29", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-31", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-32", "text": "**ANNOTATION OF DIALOGUE ACT CORPORA**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-33", "text": "Annotation Process and Standards: Research on dialogue acts became important with the commercial reality of spoken dialogue systems." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-34", "text": "There have been many taxonomies to it: speech act (Austin, 1962) which was later modified into five classes (Assertive, Directive, Commissive, Expressive, Declarative) (Searle, 1979) , and the Dialogue Act Markup in Several Layers (DAMSL) tag set where each DA has a forward-looking function (such as Statement, Info-request, Thanking) and a backwardlooking function (such as Accept, Reject, Answer) (Allen and Core, 1997) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-35", "text": "There are many such standard taxonomies and schemes to annotate conversational data, some of them follow the concept of discourse compositionality." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-36", "text": "These schemes are important for analysing dialogues or building a dialogue system (Skantze, 2007) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-37", "text": "However, there can never be a unique scheme that considers all aspects of dialogue." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-38", "text": "Corpus Insight: We have investigated the annotation method for two corpora: Switchboard (SWBD) (Godfrey et al., 1992; Jurafsky et al., 1997) and ICSI Meeting Recorder Dialogue Act (MRDA) (Shriberg et al., 2004) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-39", "text": "They are annotated with the DAMSL tag set." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-40", "text": "The annotation includes not only the utterance-level but also the segmentedutterance labelling." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-41", "text": "The DAMSL tag set provides very fine-grained and detailed DA classes and follows the discourse compositionality." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-42", "text": "For example, the SWBD-DAMSL is the variant of DAMSL specific to the Switchboard cor-" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-43", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-44", "text": "**NATURE OF DISCOURSE IN CONVERSATION:**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-45", "text": "The dialogue act is a context-based discourse concept that means the DA class of a current utterance can be derived from its preceding utterance." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-46", "text": "We will elaborate this argument with an example given in Table 1 ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-47", "text": "Speaker A utters 'Oh, yeah.' twice in the first portion, and each time it is labelled with two different DA labels." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-48", "text": "This is simply due to the context of the previously conversed utterances." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-49", "text": "If we see the last four utterances of the example, when speaker A utters the 'Yes-No Question' DA, speaker B answers with 'yeah' which is labelled as 'Yes-Answer' DA." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-50", "text": "However, after the 'Statementopinion' from the same speaker, the same utterance 'yeah' is labelled as 'Backchannel' and not 'Yes-Answer'." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-51", "text": "This gives evidence that when we process the text of a conversation, we can see the context of a current utterance in the preceding utterances." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-52", "text": "Prosodic Cues for DA Recognition: It has also been noted that prosodic knowledge plays a major role in DA identification for certain DA types Stolcke et al., 2000) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-53", "text": "The main reason is that the acoustic signal of the same utterance can be very different in a different DA class." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-54", "text": "This indicates that if one wants to classify DA classes only from the text, the context must be an important aspect to consider: simply classifying single utterances might not be enough, but considering the preceding utterances as a context is important." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-55", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-56", "text": "**MODELLING APPROACHES**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-57", "text": "Lexical, Prosodic, and Syntactic Cues: Many studies have been carried out to find out the lexical, prosodic and syntactic cues (Stolcke et al., 2000; Surendran and Levow, 2006; O'Shea et al., 2012; Yang et al., 2014) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-58", "text": "For the SwDA corpus, the state-of-the-art baseline result was 71%" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-59", "text": "for more than a decade using a standard Hidden Markov Model (HMM) with language features such as words and n-grams (Stolcke et al., 2000) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-60", "text": "The inter-annotator agreement accuracy for the same corpus is 84%, and in this particular case, we are still far from achieving human accuracy." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-61", "text": "However, words like 'yeah' appear in many classes such as backchannel, yes-answer, agree/accept etc." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-62", "text": "Here, the prosodic cues play a very important role in identifying the DA classes, as the same utterance can acoustically differ a lot which helps to distinguish the specific DA class (Shriberg et al., 1998) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-63", "text": "There are several approaches like traditional Naive Bayes and HMM models, which use minimal information and certainly ignore the dependency of the context within the communication (Grau et al., 2004; Tavafi et al., 2013) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-64", "text": "They achieved 66% and 74.32% respectively on the SwDA test set." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-65", "text": "Utterance-level Classification: Perhaps most research in modelling dialogue act identification is conducted at utterance-level (Stolcke et al., 2000; Grau et al., 2004; Tavafi et al., 2013; Ji et al., ; Khanpour et al., 2016; Lee and Dernoncourt, 2016) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-66", "text": "The emergence of deep learning also gave a big push to DA classification." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-67", "text": "In a natural language conversation, most utterances are very short; hence it is also referred to as short text classification." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-68", "text": "Lee and Dernoncourt (2016) achieved 73.1% accuracy on the SwDA corpus by using advanced deep learning frameworks such as RNNs and convolutional neural networks (CNN) with word-level feature embeddings." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-69", "text": "A Novel Approach: Context-based Learning: Classifying the DA classes at single utterance-level might fail when it comes to DA classes where the utterances share similar lexical and syntactic cues (words and phrases) like the backchannel, yes-answer and accept/agree classes." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-70", "text": "Some researchers proposed an utterance-dependent learning approach (Kalchbrenner and Blunsom, 2013; Ji et al., ; Kumar et al., 2017; Tran et al., 2017; Liu et al., 2017; Ortega and Vu, 2017; Meng et al., 2017) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-71", "text": "Kalchbrenner and Blunsom (2013) and Ortega and Vu (2017) have proposed contextbased learning, where they represent the utterance as a compressed vector of the word embeddings using CNNs and use these utterance representations to model discourse within a conversation using RNNs." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-72", "text": "In their architecture, I c a n i m a g i n e ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-73", "text": "Figure 1: (a) Multiplicative LSTM (mLSTM) characterlevel language model to produce the sentence representation s t ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-74", "text": "The character-level language model is pre-trained and produces the feature (hidden unit states of mLSTM at the last character) or average (average of all hidden unit states of every character) vector representation of the given utterance." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-75", "text": "(b) Utterance-level classification using a simple MLP layer with a softmax function (our baseline model)." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-76", "text": "they also give importance to turn-taking by providing the speaker identity but do not analyse their model in this regard." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-77", "text": "This approach achieves about 73.9% accuracy on the SwDA corpus." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-78", "text": "In another line of research (Ji et al., ; Kumar et al., 2017) , authors claim that their models take care of the dependency of the utterances within a conversation." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-79", "text": "Ji et al. (2016) use discourse annotation for the word-level language modelling on the SwDA corpus and also highlight a limitation that this approach is not scalable to large data." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-80", "text": "In other approaches a hierarchical convolutional and recurrent neural encoder model are used to learn utterance representation by feeding a whole conversation (Kumar et al., 2017; Liu et al., 2017) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-81", "text": "The utterance representations are further used to classify DA classes using the conditional random field (CRF) as a linear classifier." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-82", "text": "The model can see the past and future utterances at the same time within a conversation, which limits usage in a dialogue system where one can only perceive the preceding utterance as a context but does not know the upcoming utterances." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-83", "text": "Hence, we use a context-based learning approach and regard the 73.9% accuracy (Kalchbrenner and Blunsom, 2013) on the SwDA corpus as a current state of the art for this task." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-84", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-85", "text": "**OUR APPROACH**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-86", "text": "Our approach takes care of discourse compositionality while recognising dialogue acts." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-87", "text": "The DA class of the current utterance is predicted using the context of the preceding utterances." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-88", "text": "We represent each utterance by the hidden state of the multiplicative recurrent neural network trained on domain-independent data using a character-level language model." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-89", "text": "We use RNNs to feed the sequence of the utterances and eventually predict the DA class of the corresponding utterance." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-90", "text": "Figure 2 : The RNN setup for learning the dialogue act recognition with the previous sentences as context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-91", "text": "s t is an utterance representation derived with a character-level language model and has a dialogue act label da t ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-92", "text": "s t\u22121 and s t\u22122 are the preceding utterances of s t ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-93", "text": "The RNN is trained to learn the recurrency through previous utterances s t\u22121 and s t\u22122 derived as h t\u22121 and h t\u22122 as a context to recognize the dialogue act of current utterance s t which is represented by h t used to detect da t ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-94", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-95", "text": "**UTTERANCE REPRESENTATION**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-96", "text": "Character-level encoding allows processing words and whole sentences based on their smallest units and still capturing punctuation and permutation of words." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-97", "text": "We represent a character-level utterance by encoding the whole sentence with a pre-trained character language model 2 ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-98", "text": "This model consists of a single multiplicative long-short-term memory (mLSTM) network (Krause et al., 2016) layer with 4,096 hidden units." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-99", "text": "The mLSTM is composed of an LSTM and a multiplicative RNN and considers each possible input in a recurrent transition function." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-100", "text": "It is trained as a character language model on \u223c80 million Amazon product reviews (Radford et al., 2017) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-101", "text": "We sequentially input the characters of an utterance to the mLSTM and take the hidden state values after the last character as shown in Figure 1 (a) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-102", "text": "The hidden vector s t obtained after the last character is called the last feature vector, as it stores the information related to the character language model and the sentiment of the utterance." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-103", "text": "However, it was shown that the average vector over all characters in the utterance works better for emotion detection (Lakomkin et al., 2017) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-104", "text": "Hence, we extract the last feature vector and also the average feature vector representations for each utterance." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-105", "text": "We classify these representations with a multi-layer perceptron (MLP) as shown in Figure 1 (b) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-106", "text": "The results are shown in Table 2 ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-107", "text": "The standard deviation (SD) is computed over ten runs." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-108", "text": "The average vector seems to carry more information related to the DA; hence we use it for future experiments." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-109", "text": "There is an advantage of using domain-independent data: it is rich regarding features being trained on big data, perhaps surpassing the limitation of scalability as mentioned in Ji et al. (2016) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-110", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-111", "text": "**CONTEXT LEARNING WITH RNNS**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-112", "text": "We apply context learning with the help of RNNs." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-113", "text": "As shown in Figure 2 , the utterances with their character-level language model representation s t are fed to the RNN with the preceding utterances (s t\u22121 , s t\u22122 ) being the context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-114", "text": "We use the RNN, which gets the input s t , and stores the hidden vector h t at time t (Elman, 1990) , which is calculated as:" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-115", "text": "where f () is a sigmoid function, W h and I are recurrent and input weight matrices respectively and b is a bias vector learned during training." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-116", "text": "h t is computed using the previous hidden vector h t\u22121 which is computed in a same way for preceding utterance s t\u22121 ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-117", "text": "The output da t is the dialogue act label of the current utterance s t calculated using h t , as:" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-118", "text": "where W out is the output weight matrix." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-119", "text": "The weight matrices are learned using back-propagation through time." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-120", "text": "The task is to classify several classes; hence we use a sof tmax function g() on the output." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-121", "text": "The input is the sequence of the current and preceding utterances, e.g., s t , s t\u22121 , and s t\u22122 ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-122", "text": "We reset the RNN when it sees the current utterance s t ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-123", "text": "We also give the information related to a speaker to let the network find the change in the speaker's turn." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-124", "text": "The speaker id 'A' is represented by [1, 0] and id 'B' by [0, 1] and it is concatenated with the corresponding utterances s t ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-125", "text": "The Adam optimiser (Kingma and Ba, 2014) was used with a learning rate 1e \u2212 4, which decays to zero during training, and clipping gradients at norm 1." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-126", "text": "Early stopping was used to avoid over-fitting of the network, 20% of training samples were used for validation." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-127", "text": "In all learning cases, we minimise the categorical cross-entropy." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-128", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-129", "text": "**RESULTS**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-130", "text": "We follow the same data split of 1115 training and 19 test conversations as in the baseline approach (Stolcke et al., 2000; Kalchbrenner and Blunsom, 2013) ." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-131", "text": "Table 3 shows the results of the proposed model with several setups, first without the context, then with one, two, and so on preceding utterances in the context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-132", "text": "We examined different values for the number of the hidden units of the RNN, empirically 64 was identified as best and used throughout the experiments." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-133", "text": "We also experimented with the various representations for the speaker id that is concatenated with the respective utterances but could find no differences." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-134", "text": "As a result, our proposed model uses minimal information for the context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-135", "text": "The performance increases from 74% to about 77% with context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-136", "text": "We run each experiment for ten times and take the average." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-137", "text": "The model shows robustness providing minimal variance, and using a minimum number of preceding utterances as a context can produce fair results." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-138", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-139", "text": "**CONCLUSION**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-140", "text": "In this article, we detail the annotation and modelling of dialogue act corpora, and we find that there is a difference in the way DAs are annotated and the way they are modelled." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-141", "text": "We argue to generalise the discourse modelling for conversation within the context of communication." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-142", "text": "Hence, we propose to use the context-based learning approach for the DA identification task." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-143", "text": "We used simple RNN to model the context of preceding utterances." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-144", "text": "We used the domainindependent pre-trained character language model to represent the utterances." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-145", "text": "We evaluated the proposed model on the Switchboard Dialogue Act corpus and show the results with and without context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-146", "text": "For this corpus, our model achieved an accuracy of 77.34% with context compared to 73.96% without context." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-147", "text": "We also compare our model with Kalchbrenner and Blunsom (2013) who used the contextbased learning approach achieving 73.9%." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-148", "text": "Our model uses minimal information, such as the context of a few preceding utterances which can be adapted to an online learning tool such as a spoken dialogue system where one can naturally see the preceding utterances but not the future ones." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-149", "text": "This makes our model suitable for human-robot/computer interaction which can be easily plugged into any spoken dialogue system." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-150", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-151", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-152", "text": "This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement number 642667 (SECURE)." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-153", "text": "----------------------------------" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-154", "text": "**BIBLIOGRAPHICAL REFERENCES APPENDIX: ANALYSIS OF THE STATE OF THE RNN**" }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-155", "text": "We also analyze the internal state h t of the RNNs for a two-utterance setup." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-156", "text": "We plot them on a 2D graph with the t-SNE algorithm for the first 2,000 utterances of the SwDA test set." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-157", "text": "Figure 3 shows the clusters of all the DA classes." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-158", "text": "The classes which do not share any information are grouped without any interference such as Non-verbal, and Abandoned." }, { "sent_id": "2d7e98487698b0b6ae85f052402f7c-C001-159", "text": "Figure 4 shows some particular classes with utterances in their vector spaces, the (1) current utterance and (2) a preceding utterance in the context." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2d7e98487698b0b6ae85f052402f7c-C001-52", "2d7e98487698b0b6ae85f052402f7c-C001-53", "2d7e98487698b0b6ae85f052402f7c-C001-54" ], [ "2d7e98487698b0b6ae85f052402f7c-C001-57" ], [ "2d7e98487698b0b6ae85f052402f7c-C001-58", "2d7e98487698b0b6ae85f052402f7c-C001-59", "2d7e98487698b0b6ae85f052402f7c-C001-60", "2d7e98487698b0b6ae85f052402f7c-C001-61" ] ], "cite_sentences": [ "2d7e98487698b0b6ae85f052402f7c-C001-52", "2d7e98487698b0b6ae85f052402f7c-C001-57", "2d7e98487698b0b6ae85f052402f7c-C001-59" ] }, "@MOT@": { "gold_contexts": [ [ "2d7e98487698b0b6ae85f052402f7c-C001-58", "2d7e98487698b0b6ae85f052402f7c-C001-59", "2d7e98487698b0b6ae85f052402f7c-C001-60", "2d7e98487698b0b6ae85f052402f7c-C001-61" ] ], "cite_sentences": [ "2d7e98487698b0b6ae85f052402f7c-C001-59" ] }, "@USE@": { "gold_contexts": [ [ "2d7e98487698b0b6ae85f052402f7c-C001-130" ] ], "cite_sentences": [ "2d7e98487698b0b6ae85f052402f7c-C001-130" ] } } }, "ABC_772cdd4263cf8979a54cc5196e5853_38": { "x": [ { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-76", "text": "**EXPERIMENTS**" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-121", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-2", "text": "Radically different approaches have been proved to be effective for phrase-structure and dependency parsers in the last decade." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-3", "text": "Here, we aim to exploit the divergence in these approaches and show the utility of features extracted from the automatic dependency parses of sentences for a discriminative phrase-structure parser." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-4", "text": "Our experiments show a significant improvement over the state-of-the-art German discriminative constituent parser." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-5", "text": "----------------------------------" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-7", "text": "Both phrase-structure and dependency parsers have developed a lot in the last decade (Nivre et al., 2004; McDonald et al., 2005; Charniak and Johnson, 2005; Huang, 2008) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-8", "text": "Different approaches have been proved to be effective for these two parsing tasks which has implicated a divergence between techniques used (and a growing gap between researcher communities)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-9", "text": "In this work, we exploit this divergence and show the added value of features extracted from automatic dependency parses of sentences for a discriminative phrase-structure parser." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-10", "text": "We report results on German phrase-structure parsing, however, we note that the reverse direction of our approach -i.e." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-11", "text": "defining features from automatic phrase-structure parses for discriminative dependency parsers -is also manifest which we will address as future work." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-12", "text": "Some generative parsing approaches exploited the difference between phrase-structure and dependency parsers." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-13", "text": "For instance, Klein and Manning (2003) introduced an approach where the objective function is the product of the probabilities of a generative phrase-structure and a dependency parsers." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-14", "text": "Model 1 of Collins (2003) is based on the dependencies between pairs of head words." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-15", "text": "On the other hand, the related work on this topic for discriminative parsing is sparse, we are only aware of the following works." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-16", "text": "Carreras et al. (2008) and Koo et al. (2010) introduced frameworks for joint learning of phrase-structure and dependency parsers and showed improvements on both tasks for English." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-17", "text": "These frameworks require special formulation of -one or both -parsing approaches while our simple approach allows the usage of arbitrary dependency parsers and any feature-based phrase-structure parser." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-18", "text": "Wang and Zong (2010) used automatic dependency parses for pruning the chart of a phrase-structure parser and reported a significant improvement." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-19", "text": "One of our feature templates can be regarded as the generalization of this approach." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-20", "text": "----------------------------------" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-21", "text": "**FEATURE-RICH PARSE RERANKING**" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-22", "text": "The most successful supervised phrase-structure parsers are feature-rich discriminative parsers which heavily depend on an underlying PCFG (Charniak and Johnson, 2005; Huang, 2008) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-23", "text": "These approaches consists of two stages." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-24", "text": "At the first stage they apply a PCFG to extract possible parses." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-25", "text": "The full set of possible parses cannot be iterated through in practice, and is usually pruned as a consequence." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-26", "text": "The n-best list parsers keep just the 50-100 best parses according to the PCFG." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-27", "text": "Other methods remove nodes and hyperedges whose posterior probability is under a predefined threshold from the forest (chart)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-28", "text": "The task of the second stage is to select the best parse from the set of possible parses (i.e. rerank this set)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-29", "text": "These methods employ a large feature set (usually a few millions features) (Collins, 2000; Charniak and Johnson, 2005) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-30", "text": "The n-best list approaches can straightforwardly employ local and non-local features as well because they decide at the sentence-level (Charniak and Johnson, 2005) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-31", "text": "Involving non-local features is more complicated in the forest-based approaches." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-32", "text": "The conditional random field methods usually use only local features (Miyao and Tsujii, 2002; Finkel et al., 2008) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-33", "text": "Huang (2008) introduced a beam-search and average perceptron-based procedure for incorporating them, however his empirical results show only minor improvement from incorporating non-local features." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-34", "text": "In this study, we experiment with n-best list reranking and a packed-forest based model as well along with local features exclusively." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-35", "text": "Our goal is to investigate the extension of the standard feature set of these models by features extracted from the automatic dependency parse of the sentence in question." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-36", "text": "----------------------------------" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-37", "text": "**DEPENDENCY PARSE-BASED FEATURES FOR**" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-38", "text": "Phrase-Structure Parsing" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-39", "text": "Given the automatic (1-best) dependency parse of the sentence in question, we defined three feature templates for representing hyperedges (i.e. a CFG rule applied over a certain span of words)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-40", "text": "We illustrate them on two hyperedges E 1 = (NP die Inseln (PP von Ru\u00dfland)) and E 2 = (VP fordern (NP die Inseln) (PP von Ru\u00dfland))." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-41", "text": "Let's assume that the corresponding dependency subtree consists of the following arcs: ROOT\u2192fordern, Inseln" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-42", "text": "outArc features are counting the dependency arcs which \"go out\" from the constituent in question." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-43", "text": "More precisely we count the words within the span whose parent in the dependency tree lays outside the span of words in question." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-44", "text": "We use the absolute count and the ratio of outArcs among the words of the span." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-45", "text": "The more arcs go out, the further away is the dependency subtree over the words of the constituent from a dominating subtree." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-46", "text": "Hence, these features try to capture the \"phraseness\" of the span of words in question based on the dependency tree." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-47", "text": "For E 1 we have outArc=2 and outArcRatio=2/4 as the parent of Inseln and von lay outside the constituent." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-48", "text": "For E 2 we have outArc=1 and outArcRatio=1/5." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-49", "text": "POSRel features intend to tune daughter attachments to the dependency parse based on the POS tags of the lexical heads." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-50", "text": "For this we gather the daughter constituents whose lexical head is linked in the (undirected) dependency tree to the head of the parent constituent." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-51", "text": "We define features from them using the pair of the two head's POS tag and a triplet using the POS tags and the corresponding dependency label." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-52", "text": "For E 1 we cannot extract features as the lexical head of the parent (Inseln) and the lexical head of the daughter (von) are not linked in the dependency tree." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-53", "text": "For E 2 we have the following binary valued features: VVFIN-NN, VVFIN-NN-OBJA, VVFIN-APPR, VVFIN-APPR-PP as both daughter attachments have the corresponding arcs in the dependency tree." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-54", "text": "ConstRel features are similar to POSRel but use the constituent labels rather than the POS tags of the heads." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-55", "text": "Thus, once again we do not have any positive feature for E 1 , but for E 2 we extract: VP-NP, VP-NP-OBJA, VP-PP, VP-PP-PP." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-56", "text": "We also investigated the role of case and grammatical functions and extended the POSRel and ConstRel feature sets by adding this information to the labels." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-57", "text": "For instance besides VVFIN-NN-OBJA and VP-NP-OBJA from our example E 2 we also used VVFIN-NN-Acc-OBJA and VP-NP-OA-OBJA." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-58", "text": "Note that the value of outArc is 1 iff the word span in question has a dominating dependency subtree in the automatic parse." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-59", "text": "Wang and Zong (2010) prune hyperedges with outArc = 1 thus this feature can be regarded as a generalization of their approach." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-60", "text": "----------------------------------" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-61", "text": "**TWO-STAGE PARSING OF GERMAN**" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-62", "text": "As a first-stage parser, we used BitPar (Schmid, 2004) , a fast unlexicalized PCFG parser based on a first pass where non-probabilistic bottom-up parsing and top-down pruning is efficiently carried out by storing the chart in bit vectors." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-63", "text": "Bitpar constructs the probabilistic forest only after top-down pruning, i.e. after computing the posterior probability of each hyperedge given the input sentence." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-64", "text": "The forest is pruned by deleting hyperedges whose posterior probability is below some threshold." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-65", "text": "We used a treebank grammar enriched with case information, lexicalization of selected prepositions, conjunctions, and punctuation symbols, coarse parent category features for adverbs, adverbial phrases, prepositions, PPs and special markers for non-verbal phrases containing a wh expression, phrases without a head and clauses without a subject." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-66", "text": "We applied a second-order markovization of rules below a frequency threshold 1 , but infrequent second-order Markov symbols are replaced by first-order Markov symbols if the frequency is below threshold 2 ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-67", "text": "We used simple regular expressions for unknown word clustering and estimated POS probabilities for unknown words of each cluster based on the word suffix." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-68", "text": "The relative frequency estimates of the POS probabilities of known words were interpolated with the respective unknown word POS probabilities using Witten-Bell smoothing." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-69", "text": "To the best of our knowledge Bitpar with this grammar is the state-of-theart German generative parser." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-70", "text": "At the second stage, we used n-best list and forest-based rerankers as well." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-71", "text": "The feature values of a full possible parse is the sum of the local feature vectors (for the hyperedges) (Charniak and Johnson, 2005) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-72", "text": "Learning is guided by the so-called oracle parse which is the full parse in the set of possible parses most similar to the gold standard tree." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-73", "text": "Our oracle extraction method is an extension of Huang (2008)'s dynamic programing procedure which takes into consideration POS tag and grammatical function matches as well and selects hyperedges with higher posterior probability for tie-breaking." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-74", "text": "For a detailed description of the training and supporting algorithms please refer to Charniak and Johnson (2005) and Huang (2008) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-75", "text": "----------------------------------" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-77", "text": "We evaluate our approach on the Tiger corpora of the Parsing German Shared Task (PaGe) (K\u00fcbler, 2008) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-78", "text": "Its training, development, and test datasets consist of 20894, 2611 and 2611 sentences respectively." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-79", "text": "We decided to use these corpora to be able to compare our results with other results." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-80", "text": "We used the dependency parser of Bohnet (2010) to generate the parses for the feature extraction." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-81", "text": "We selected the parser since it had top scores for German in the CoNLL Shared Task 2009." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-82", "text": "The parser is a second order dependency parser that models the interaction between siblings as well as grandchildren." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-83", "text": "The parser was after the Shared Task enhanced by a Hash Kernel, which leads to significantly higher accuracy." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-84", "text": "We generated the dependency structures by 10-fold cross-validation training of the training corpus." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-85", "text": "The model for the annotation of the test set and development set was trained on the entire training corpus." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-86", "text": "We evaluated the dependency parses themselves in line with PaGe." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-87", "text": "Table 1 shows the labeled (LAS) and unlabeled attachment scores (UAS) of the dependency parser and compares it with the Malt parser (Nivre et al., 2004; Hall and Nivre, 2008) , which was the only and therefore best dependency parser that participated in the PaGe's dependency parsing track." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-88", "text": "Bohnet's parser reaches higher labeled and unlabeled scores." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-89", "text": "The last row shows the parsing accuracy with predicted Part-ofSpeech." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-90", "text": "We used the parses with predicted pos tags for our reranking experiments." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-91", "text": "Regarding the phrase-structure parser, our grammar extractor used markovization threshold 1 = 20 and threshold 2 = 10 resulting in a grammar with over fifty thousand of rules." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-92", "text": "Our prior experiments found the forest pruning threshold to be optimal at the order of 10 \u22122 which resulted in packed forests with average node number of 108." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-93", "text": "The oracle scores were 87.1 and 91.4 for the 100-best lists and packed forests, respectively." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-94", "text": "At the second stage, we filtered out rare features (which occurred in less than 5 sentences)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-95", "text": "The new dependency parse-based feature set consists of 9240 and 5359 features before and after filtering." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-96", "text": "We employed the ranking MaxEnt implementation of the MALLET package (McCallum, 2002 ) and the average perceptron training of the Joshua package (Li et al., 2009 )." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-97", "text": "The update mechanism of the latter one was extended by using the F-score of the candidate full parse against the oracle parse as a loss function (see MIRA (Crammer and Singer, 2003) for the motivation)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-98", "text": "We used the state-of-the-art feature set of the German phrase-structure parse reranker of Versley and Rehbein (2009) as a baseline feature set." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-99", "text": "This feature set is rich and consists of features constructed from the lexicalized parse tree and its typed dependencies along with features based on external statistical information (like the clustering of unknown words according to their context of occurrences and PP attachment statistics gathered from the automatic POS tagged DE-WaC corpus, a 1.7G words sample of the German-language WWW)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-100", "text": "This feature set consists of 1.7 and 0.2 million of features before and after filtering and enables the direct comparison of our results with state-of-theart discriminative results on German." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-101", "text": "We use the evalb implementation of PARSEVAL as evaluation metric hereafter on basic constituent labels (noGF) and on the conflation of these labels and grammatical functions (GF)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-102", "text": "We have to mention that our F-values are not comparable to the official results of PaGe -which was our original goal -because the evaluation metric there was a special im- plementation for calculating F-value (which differs from evalb for example in handling punctuation marks) and it used gold-standard POS tags in the input (which we thought to be unrealistic)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-103", "text": "On the other hand, our results are comparable with results of Rafferty and Manning (2008) and Versley and Rehbein (2009) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-104", "text": "Table 2 shows the results achieved by the MaxEnt 100-best list reranker using one out of the three feature templates alone and their union (All) on the development set." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-105", "text": "All+Case refers to the enriched feature set incorporating case information for POS tag and grammatical functions for labels." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-106", "text": "Baseline here refers to the top parse of Bitpar (the first stage parser)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-107", "text": "We note that the inside probability estimation of Bitpar for an edge is always in our feature set." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-108", "text": "Each of the three feature templates achieved significant improvements over a strong baselinenote that our first-stage parser is competitive with Versley and Rehbein (2009)'s two-stage parser -." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-109", "text": "On the other hand, as the All results are just slightly better than POSRel (the best individual feature template), the three templates seem to capture similar patterns." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-110", "text": "The introduction of case information also improved the results, thus we incorporate them into our final feature set." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-111", "text": "Table 3 illustrates the added value of the dependency features (Dep=All+Case) over the reranking feature set of Versley and Rehbein (2009) (RR) ." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-112", "text": "We also cite here previously published results on the same dataset by Rafferty and Manning (2008) (a generative parser) and Versley and Rehbein (2009) (a conditional random field-based discriminative parser)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-113", "text": "The rows RR, Dep and RR+Dep show the results achieved by the MaxEnt 100-best list parser while the AvgPer row show the results of the forest-based average perceptron approach using the RR+Dep feature set." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-114", "text": "We report numbers only at this feature configuration due to the lack of space and because the difference between this and n-best list approaches is similarly moderate at other configurations as well." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-115", "text": "The results of Table 3 show that our simple features constructed from the automatic dependency parse of the sentence are as useful as the stateof-the-art rich feature set for German." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-116", "text": "Moreover these two features sets have a certain level of diversity as their union could achieve significantly better results than any of them alone." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-117", "text": "This is probably due to fact that most of the RR features are lexicalized while Dep features are unlexicalized." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-118", "text": "Regarding the two discriminative approaches, our findings are similar to Huang (2008) , i.e. the packed forest-based and n-best list procedures achieved similar results by using only local features." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-119", "text": "We found that the improvements by applying the dependency features are similar at the two evaluation metrics (with and without grammatical functions)." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-120", "text": "----------------------------------" }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-122", "text": "We presented experimental results on exploiting automatic dependency parses in a discriminative phrase-structure parser." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-123", "text": "Our simple feature templates achieved around 1.8 points of improvement in terms of F-score over Bitpar, the state-of-the-art generative parser for German and 0.8 when we extended a rich feature set." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-124", "text": "Although these results are promising, we consider them as the first step on a long road." }, { "sent_id": "772cdd4263cf8979a54cc5196e5853-C001-125", "text": "In the future, we will implement more sophisticated features derived from dependency parses (like dependency paths rather than single edges and non-local ones) and investigate the reverse direction, i.e. whether automatic constituent parses can help dependency parsers." } ], "y": { "@BACK@": { "gold_contexts": [ [ "772cdd4263cf8979a54cc5196e5853-C001-7", "772cdd4263cf8979a54cc5196e5853-C001-8", "772cdd4263cf8979a54cc5196e5853-C001-9" ], [ "772cdd4263cf8979a54cc5196e5853-C001-22", "772cdd4263cf8979a54cc5196e5853-C001-23", "772cdd4263cf8979a54cc5196e5853-C001-24", "772cdd4263cf8979a54cc5196e5853-C001-25", "772cdd4263cf8979a54cc5196e5853-C001-26", "772cdd4263cf8979a54cc5196e5853-C001-27", "772cdd4263cf8979a54cc5196e5853-C001-28" ], [ "772cdd4263cf8979a54cc5196e5853-C001-74" ] ], "cite_sentences": [ "772cdd4263cf8979a54cc5196e5853-C001-7", "772cdd4263cf8979a54cc5196e5853-C001-22", "772cdd4263cf8979a54cc5196e5853-C001-74" ] }, "@MOT@": { "gold_contexts": [ [ "772cdd4263cf8979a54cc5196e5853-C001-7", "772cdd4263cf8979a54cc5196e5853-C001-8", "772cdd4263cf8979a54cc5196e5853-C001-9" ] ], "cite_sentences": [ "772cdd4263cf8979a54cc5196e5853-C001-7" ] }, "@EXT@": { "gold_contexts": [ [ "772cdd4263cf8979a54cc5196e5853-C001-73" ] ], "cite_sentences": [ "772cdd4263cf8979a54cc5196e5853-C001-73" ] }, "@SIM@": { "gold_contexts": [ [ "772cdd4263cf8979a54cc5196e5853-C001-118" ] ], "cite_sentences": [ "772cdd4263cf8979a54cc5196e5853-C001-118" ] } } }, "ABC_73d4518e44f14a725a28e86de96963_38": { "x": [ { "sent_id": "73d4518e44f14a725a28e86de96963-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-2", "text": "In this paper we describe the system used by the ValenTO team in the shared task on Irony Detection in English Tweets at SemEval 2018." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-3", "text": "The system takes as starting point emotIDM, an irony detection model that explores the use of affective features based on a wide range of lexical resources available for English, reflecting different facets of affect." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-4", "text": "We experimented with different settings, by exploiting different classifiers and features, and participated both to the binary irony detection task and to the task devoted to distinguish among different types of irony." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-5", "text": "We report on the results obtained by our system both in a constrained setting and unconstrained setting, where we explored the impact of using additional data in the training phase, such as corpora annotated for the presence of irony or sarcasm from the state of the art." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-6", "text": "Overall, the performance of our system seems to validate the important role that affective information has for identifying ironic content in Twitter." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-7", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-9", "text": "People use social media platforms as a forum to share and express themselves by using the language in creative ways and employing figurative language devices such as irony to achieve different communication purposes." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-10", "text": "Irony is closely associated with the indirect expression of feelings, emotions and evaluations, and detecting the presence of irony in social media texts is considered a challenge for research in computational linguistics, also for the impact on sentiment analysis, where irony detection is important to avoid misinterpreting the polarity of ironic statements." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-11", "text": "Broadly speaking, under the umbrella term of irony two main concepts are covered: verbal irony and situational irony." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-12", "text": "Verbal irony is commonly defined as a figure of speech where the speaker intends to communicate the opposite of what is literally said (Sperber and Wilson, 1986) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-13", "text": "Situational irony, instead refers to a contradictory or unexpected outcome of events (Lucariello, 2014) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-14", "text": "In Twitter we can find many examples both of verbal irony and of posts where users describe aspects of an ironic situation." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-34", "text": "They are organized in three sub-groups representing different facets of affect:" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-35", "text": "Sentiment-Related Features (Sent)." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-58", "text": "By analyzing the training data, an interesting characteristic was found: 857 out of 3,834 tweets contain an URL." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-15", "text": "Most of the proposed approaches to the automatic detection of irony in social media (Riloff et al., 2013; Buschmeier et al., 2014; Pt\u00e1\u010dek et al., 2014 )take advantage of lexical factors such as n-grams, punctuation marks, among others." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-16", "text": "Information related to affect has been also exploited (Reyes et al., 2013; Barbieri et al., 2014; Hern\u00e1ndez Far\u00edas et al., 2015) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-17", "text": "Other scholars proposed methods exploiting the context surrounding an ironic utterance (Wallace et al., 2015; Karoui et al., 2015) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-18", "text": "Recently, also deep learning techniques have been applied (Nozza et al., 2016; Poria et al., 2016) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-19", "text": "This paper describes our participation in the SemEval-2018 Task 3." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-20", "text": "The aim of this task is to identify ironic tweets." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-21", "text": "ValenTO exploited an extended version of emotIDM , an irony detection model based mainly on affective information." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-22", "text": "In particular, we experimented the use of a wide range of affectrelated features for characterizing the presence of ironic content, covering different facets of affect, from sentiment to finer-grained emotions." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-23", "text": "Most theorist (Grice, 1975; Wilson and Sperber, 1992; Alba-Juez and Attardo, 2014) recognized, indeed, the important role of affective information for irony communication-comprehension." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-24", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-25", "text": "**THE EMOTIDM MODEL**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-26", "text": "Irony is a very subjective language device that involves the expression of affective contents such as emotions, attitudes, or evaluations towards a particular target." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-27", "text": "Attempting to take advantage of the emotionally-laden characteristics of ironic expressions, we relied on emotIDM, an irony detection model that, taking advantage of several affective resources available for English (Nissim and Patti, 2016) , exploits various facets of affective information from sentiment to finer-grained emotions for characterizing the presence of irony in Twitter ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-28", "text": "In ) the robustness of emotIDM was assessed over different Twitter state-of-the-art corpora for irony detection (Reyes et al., 2013; Barbieri et al., 2014; Mohammad et al., 2015; Pt\u00e1\u010dek et al., 2014; Riloff et al., 2013) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-29", "text": "The obtained results outperform those in the previous works confirming the significance of affective features for irony detection." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-30", "text": "An additional aspect to be mentioned about emotIDM is that it was designed to identify ironic content in a general sense, i.e. considering irony as a broad term covering different types of irony in tweets." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-31", "text": "emotIDM comprises two main groups of features that are described below: Structural Features (Str)." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-32", "text": "This group includes different markers that could help to identify ironic intention in tweets: punctuation marks (colon, exclamation, question marks), Part-Of-Speech labels (verbs, adverbs, nouns, adjectives), emoticons, uppercase characters, among others." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-33", "text": "Affective Features." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-36", "text": "Hu&Liu (Hu and Liu, 2004), General Inquirer (Stone and Hunt, 1963) , EffectWordNet (Choi and Wiebe, 2014) , Subjectivity lexicon (Wilson et al., 2005) , and EmoLex (Mohammad and Turney, 2013), AFINN, SWN, Semantic Orientation lexicon (Taboada and Grieve, 2004) , and SenticNet (SN) (Cambria et al., 2014) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-37", "text": "Emotional Categories (eCat)." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-38", "text": "EmoLex, EmoSenticNet (Poria et al., 2013) , SentiSense (Carrillo de Albornoz et al., 2012) , and the Linguistic Inquiry and Word Count dictionary (Pennebaker et al., 2001) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-39", "text": "Dimensional Models of Emotions (eDim)." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-40", "text": "ANEW (Bradley and Lang, 1999) , DAL (Whissell, 2009), and SN." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-41", "text": "3 emotIDM at SemEval-2018 Task 3:" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-42", "text": "Irony Detection in English Tweets" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-43", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-44", "text": "**TASK DESCRIPTION AND DATASETS**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-45", "text": "In the framework of SemEval-2018 was organized the Task 3 on Irony detection in English tweets (Van Hee et al., 2018) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-46", "text": "The main objective of this task is to identify the presence of irony in Twitter." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-47", "text": "It was divided in two different subtasks:" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-48", "text": "1. Task A: Ironic vs. non-ironic: to determine whether a tweet is ironic or not." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-49", "text": "2. Task B: Different types of irony: to predict one out of four labels: 0) non-irony (nI), 1) verbal irony by polarity contrast (vI), 2) other verbal irony (oI), 3) situational irony (sI)." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-50", "text": "Organizers provided datasets for training and test labeled according the objectives of each subtask." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-51", "text": "The whole dataset was collected by exploiting a set of hashtags (#irony, #sarcasm and #not)." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-52", "text": "Therefore, a manual annotation process was applied in order to minimize the noise in the data." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-53", "text": "For Task A, 1,911 ironic and 1,923 non-ironic tweets where provided." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-54", "text": "While for Task B, the distribution was: 1923 for nI, 1393 for vI, 213 for oI and 328 for sI. Participants were allowed to submit systems trained under two settings: constrained (C), where only the training data provided for the task should be used; unconstrained (U), where the use of additional data was permitted." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-55", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-56", "text": "**OUR PROPOSAL**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-57", "text": "We decided to participate to the shared task by using emotIDM." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-59", "text": "From these tweets, 265 were belonging to the ironic class, while 592 were labeled as non-ironic." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-60", "text": "Notice that, in (Hern\u00e1ndez-Farias et al., 2014) , the authors found a similar behavior regarding URL information in the dataset provided by the organizers of SentiPOLC-2014 (Basile et al., 2014) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-61", "text": "Furthermore, Barbieri et al. (2014) exploited a feature for alerting the existence of an URL in a tweet; such feature was ranked among the most discriminative ones according to an information gain analysis." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-62", "text": "Since, information regarding to the presence of URL in a tweet has proven to be useful for detecting irony in Twitter, we decided to enrich emotIDM by adding a binary feature for reflecting the presence of URL in a tweet." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-63", "text": "Below, we describe our participation in the task." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-64", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-65", "text": "**TASK A: IRONIC VS. NON-IRONIC**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-66", "text": "We addressed this task as a binary classification by taking advantage of two of the most widely applied classifiers in irony detection: Decision Tree (DT) and Support Vector Machine (SVM) 1 ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-67", "text": "Moreover, we also included Random Forest (RF) as a classifier in our experiments 2 ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-68", "text": "We carried out a set of experiments for assessing the performance of the original version of emotIDM and the one including information concerning URL (emotIDM+URL)." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-69", "text": "Besides, to investigate the contribution of the different sets of features in emotIDM further experiments were performed." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-70", "text": "Several classifiers were used in order to identify the most promising setting." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-71", "text": "As mentioned before, exploiting external data was allowed in the unconstrained setting." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-72", "text": "We took advantage of a set of corpora previously used in the state of the art in irony detection." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-73", "text": "We exploited data from a set of corpora collected exploiting different approaches: self-tagging or manual annotation or crowd-sourcing 3 ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-74", "text": "We exploited the corpora developed by (Reyes et al., 2013) , (Barbieri et al., 2014) , (Mohammad et al., 2015) , (Pt\u00e1\u010dek et al., 2014) , (Riloff et al., 2013) , (Ghosh et al., 2015) , (Karoui et al., 2017) , and (Sulis et al., 2016) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-75", "text": "Besides, we also take advantage of an in-house collection of tweets containing the hashtags #irony and #sarcasm 4 ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-76", "text": "Table 1 shows the obtained results during the developing phase for Task A. We experimented with different sets of features and classifiers considering a five fold-cross validation setting." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-77", "text": "SVM emerges as the classifier with the best performance in both C and U scenarios." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-78", "text": "We noticed that, when using SVM, adding the URL feature to emotIDM helps to improve the overall performance of our system." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-79", "text": "When we experimented by removing a set of features in emotIDM, a drop in the performance (in most of the cases) is observed." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-80", "text": "The results of the experiments with external data are higher than those using only the training data." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-81", "text": "The last row in Table 1 shows the obtained results when only affect-related features were used; even though there is a drop in the performance respect to the experiments using structural features, it seems that affective features on their own provide useful information for irony detection." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-82", "text": "We participated in the subtask A by submitting two runs (constrained and unconstrained) exploiting the experimental setting with the best performance: emotIDM+URL with a SVM as classifier." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-83", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-84", "text": "**TASK B: DIFFERENT TYPES OF IRONY**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-85", "text": "Distinguishing between different kinds of ironic devices is still a controversial issue." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-86", "text": "In computational linguistics, only few research works have attempted to address such a difficult task (Wang, 2013; Barbieri et al., 2014; Sulis et al., 2016; Van Hee et al., 2016) ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-87", "text": "We are interested in assessing the performance of emotIDM when it deals with different types of irony, in order to test if a wide variety of affective features can help in discriminating also in the finer-grained classification task here proposed." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-88", "text": "This could give some insights on the role of affective content among ironic devices having different communication purposes." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-89", "text": "emotIDM+URL was trained with the dataset provided for Task B (constrained setting) to test the effectiveness of affective features in such finergrained task." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-90", "text": "We exploited the same classifiers than in Task A attempting to evaluate their performance when different classes of irony should be classified." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-91", "text": "Overall, the best performance was achieved by SVM (see Table 2 )." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-92", "text": "However, when the performance of each single class was considered, the best results were those obtained with DT." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-93", "text": "For this reason, we decided to combine two classifiers with the following criterion: the sI and oI classes are assigned by the DT; while irony and non-irony are assigned by SVM or RF." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-94", "text": "Table 2 shows the obtained results of the experiments carried out over the dataset for Task B. A five fold cross-validation was applied." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-95", "text": "From the results in Table 2 , it can be noticed that when two classifiers are combined the performance of our model improves." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-96", "text": "The DT + SVM was selected as the system for participating in the Task B." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-97", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-98", "text": "**RESULTS**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-99", "text": "The results of ValenTO participation in the shared task are summarized in Table 3 ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-100", "text": "In Task A, on the official CodaLab ranking, we ranked in the 16 th position with the unconstrained version of our submission." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-101", "text": "When comparing our official result with the one obtained by the best-ranked system (0.7054), it can be noticed that the difference is lower than 0.1 in F-score terms." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-102", "text": "It is an interesting result considering that our system relies mainly on features covering different facets of affect in ironic tweets, and confirms the key role that such kind of affective information plays for detecting irony in Twitter." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-103", "text": "In addition, the organizers also provided separate rankings for constrained and unconstrained submissions." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-104", "text": "Our system ranked in the 17 th position in the constrained setting, while in the unconstrained one we ranked as 4 th ." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-105", "text": "Moreover, the performance of our system seems to be stable in the two C and U settings." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-106", "text": "Concerning Task B, our system performed relatively well, considering that we did not apply further tuning to capture different ironic devices." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-107", "text": "We ranked in the 17 th position of 31 submissions in the Official ranking at CodaLab." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-108", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-109", "text": "**DISCUSSION AND ERROR ANALYSIS**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-110", "text": "Data provided for the task were retrieved by exploiting hashtags #irony, #sarcasm and #not, which according to (Sulis et al., 2016) seems to label different kinds of ironic phenomena." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-111", "text": "We analyzed the gold standard labels provided by the organizers (where the ironic hashtags were also included in the tweets) in order to see the performance of our model for recognizing tweets labeled with distinct hashtags." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-112", "text": "Considering the results in Task A, we noticed that our system was able to identify all the three kinds of tweets without any kind of skew towards a particular hashtag." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-113", "text": "It somehow confirms the robustness of emotIDM for recognizing irony in a broad sense." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-114", "text": "Our system was able to correctly identify instances expressing an apparent positive emotional distress with an ironic intention, such as: Sunday is such a fun day to study #ew #saywhat and Yay I just love this time of the month...!. A special mention is for tweets labeled with #not." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-115", "text": "This hashtag is not always used for highlighting a figurative meaning." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-116", "text": "Our system was able to correctly identify instances containing #not when it was used for figurative meaning such as: Yay for Fire Alarms at 3AM #not, and also when it was used as part of the text in a tweet: #Myanmar #men #plead #not #guilty to #murder of #British #tourists." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-117", "text": "http://t.co/flrKr3H6Kl via @reuters." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-118", "text": "For what concerns the performance of emotIDM in Task B, Table 4 5 shows that our model performed better in identifying tweets where verbal irony was expressed by means of a polarity contrast." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-119", "text": "Moreover, it was recognizing better \"situational irony\" than \"other irony\"." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-120", "text": "Since our model relies mainly on affective information, ironic instances lacking of subjectiverelated content are hard to recognize, as in: Being a hipster now is so mainstream." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-121", "text": "Oh, the irony." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-122", "text": "#hipster #irony." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-123", "text": "Moreover, we found some tweets where context information is crucial for capturing the ironic sense, like in: So there used to be a crossfit place here.... #irony #pizzawins http://t.co/9BDkxT9GFJ; or where the hashtag is the only signal for ironic intention." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-124", "text": "----------------------------------" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-125", "text": "**CONCLUSIONS**" }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-126", "text": "In this paper, we described our participation at SemEval-2018 Task 3." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-127", "text": "We exploited an enhanced version of emotIDM." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-128", "text": "In our experiments, SVM emerges as the classifier with the best performance." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-129", "text": "The obtained results serve to validate the usefulness of affect-related features for distinguishing ironic tweets." }, { "sent_id": "73d4518e44f14a725a28e86de96963-C001-130", "text": "As future work, it could be interesting to enrich emotIDM with features for capturing other kinds of information such as common-knowledge and semantic incongruity." } ], "y": { "@BACK@": { "gold_contexts": [ [ "73d4518e44f14a725a28e86de96963-C001-16" ], [ "73d4518e44f14a725a28e86de96963-C001-28", "73d4518e44f14a725a28e86de96963-C001-29" ], [ "73d4518e44f14a725a28e86de96963-C001-61" ], [ "73d4518e44f14a725a28e86de96963-C001-85", "73d4518e44f14a725a28e86de96963-C001-86" ] ], "cite_sentences": [ "73d4518e44f14a725a28e86de96963-C001-16", "73d4518e44f14a725a28e86de96963-C001-28", "73d4518e44f14a725a28e86de96963-C001-61", "73d4518e44f14a725a28e86de96963-C001-86" ] }, "@USE@": { "gold_contexts": [ [ "73d4518e44f14a725a28e86de96963-C001-74" ] ], "cite_sentences": [ "73d4518e44f14a725a28e86de96963-C001-74" ] }, "@MOT@": { "gold_contexts": [ [ "73d4518e44f14a725a28e86de96963-C001-85", "73d4518e44f14a725a28e86de96963-C001-86" ] ], "cite_sentences": [ "73d4518e44f14a725a28e86de96963-C001-86" ] } } }, "ABC_c5e1debe3fcab509737e092505a29e_38": { "x": [ { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-27", "text": "**SUFFIX SEPARATION (SS)**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-56", "text": "**PRE-PROCESSING**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-28", "text": "In this step, the source language words are preprocessed for suffix separation." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-2", "text": "This paper discusses Centre for Development of Advanced Computing Mumbai's (CDACM) submission to the NLP Tools Contest on Statistical Machine Translation in Indian Languages (ILSMT) 2014 (collocated with ICON 2014)." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-29", "text": "We have considered only suffix from source language which corresponds to post-positions in Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-3", "text": "The objective of the contest was to explore the effectiveness of Statistical Machine Translation (SMT) for Indian language to Indian language and English-Hindi machine translation." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-4", "text": "In this paper, we have proposed that suffix separation and word splitting for SMT from agglutinative languages to Hindi significantly improves over the baseline (BL)." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-5", "text": "We have also shown that the factored model with reordering outperforms the phrase-based SMT for EnglishHindi (en-hi)." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-6", "text": "We report our work on all five pairs of languages, namely Bengali-Hindi (be-hi), Marathi-Hindi (mrhi), Tamil-Hindi (ta-hi), Telugu-Hindi (tehi), and en-hi for Health, Tourism, and General domains." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-7", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-9", "text": "In this paper, we present our experiments on SMT from Bengali, Marathi, Tamil, Telugu and English to Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-10", "text": "From the set of languages involved in the shared task, Bengali, Hindi and Marathi belong to IndoAryan family and Tamil and Telugu are from Dravidian language family." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-11", "text": "All languages except English, have the same flexibility towards word order, canonically following the SOV structure." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-12", "text": "With reference to the morphology, Bengali, Marathi, Tamil, and Telugu are more agglutinative compared to Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-13", "text": "It is known that SMT produces more unknown words resulting in the bad translation quality if the morphological divergence between the source and target language is high." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-14", "text": "Koehn and Knight (2003) , Popovic and Ney (2004) and Popovi\u0107 et al. (2006) have demonstrated ways to handle this issue with morphological segmentation of words before training the SMT system." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-15", "text": "To tackle the morphological divergence of Hindi with these languages we have explored Suffix Separation (SS) and Compound word Splitting (CS) as a pre-processing step." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-16", "text": "For English to Hindi SMT, better alignment is achieved through the use of preordering developed by Patel et al. (2013) and stem as an alignment factor ." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-17", "text": "The rest of the paper is organized as follows." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-18", "text": "In Section 2, we discuss our methodology, followed by data-set and experimental setup in section 3." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-19", "text": "Section 4 discusses experiments and results." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-20", "text": "Submitted systems to the shared task and error analysis are displayed in section 5 and 6 respectively, followed by conclusion and future work in section 7." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-21", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-22", "text": "**METHODOLOGY**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-23", "text": "Our methodology to tackle morphological and structural divergence involves the use of the suffix separation, compound splitting, and reordering for the language pairs under study." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-24", "text": "These methods are briefly described below." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-25", "text": "Pseudocode for the suffix separation and compound word splitting is detailed in Algorithms 1 and 2 respectively." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-26", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-30", "text": "For example, in Marathi, 1 'mahinyaaMnii' is translated as 'mahiine meM' in Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-31", "text": "In this case, we split end for 10:" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-32", "text": "end for 11: end procedure the word into 'mahiny + aaMnii'." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-33", "text": "Here, the suffix 'aaMnii' corresponds to the word 'meM' in Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-34", "text": "For this task, the list of suffixes is manually created with the linguistic expertise." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-35", "text": "When a word is subjected to SS, longest matching suffix from the list is considered for suffix separation." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-36", "text": "Suffix separation takes place only once for a word." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-37", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-38", "text": "**COMPOUND SPLITTING (CS)**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-39", "text": "In this step source language compound words are split into constituents, recursively." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-40", "text": "For example, in Marathi, a compound word 'daMtatajGYaaMkaDuuna' is translated as 'danta visheShaGYa se' in Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-41", "text": "In this case we split the source word into constituents, 'daMta' , 'tajGYaaM' and 'kaDuuna'." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-42", "text": "The list of the constituent suffixes for splitting is empirically prepared from a monolingual data." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-43", "text": "The compound suffix creation algorithm is very basic and simple, the pseudo code is detailed in Algoritm 2." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-44", "text": "A word is considered for compound suffix list, if it appears as a suffix in another word of the monolingual corpus." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-45", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-46", "text": "**REORDERING (RO)**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-47", "text": "It is based on the syntactic transformation of the English sentence parse tree according to the target language (Hindi) structure." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-48", "text": "We have used source side reordering developed by Patel et al. (2013), and Ramanathan et al. (2008 We now discuss training and testing corpus from health, tourism and general domains for be-hi, mrhi, ta-hi, te-hi, and en-hi language pairs, followed by preprocessing, SMT system setup and evaluation metrics for experiments." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-49", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-50", "text": "**CORPUS FOR SMT TRAINING AND TESTING**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-51", "text": "For experiments, we have used corpus shared by ILSMT organizers, detailed in Table 1 ." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-52", "text": "Additional monolingual corpus of approx., 23K sentences (Khapra et al., 2010 ) is used to train the language model." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-53", "text": "The evaluation of the systems were done using Test1 data which was the development set." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-54", "text": "The quality of the submitted systems were estimated by the organizers against Test2 corpus." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-55", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-57", "text": "To tackle the morphological divergence between the source and target languages (Bengali/Marathi/Tamil/Telugu to Hindi), we used suffix separation and compound splitting, as explained in section 2." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-58", "text": "To handle the structural divergence for English-Hindi SMT, we exploited source side preordering (Patel et al., 2013; Ramanathan et al., 2008) ." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-59", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-60", "text": "**SMT SYSTEM SET UP**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-61", "text": "The baseline system was setup using the phrasebased model ( (Brown et al., 1990; Marcu and Wong, 2002; Och and Ney, 2003; and was used for factored model." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-62", "text": "The language model was trained using KenLM (Heafield, 2011) toolkit with modified Kneser-Ney smoothing (Chen and Goodman, 1996) ." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-63", "text": "For factored SMT training, we used source and target side stem as an alignment factor." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-64", "text": "Stemming was done using lightweight stemmer (Ramanathan and Rao, 2003) for Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-65", "text": "For English, we used porter stemmer (Minnen et al., 2001 )." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-66", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-67", "text": "**EVALUATION METRICS**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-68", "text": "We compared different experimental systems using BLEU (Papineni et al., 2002) , NIST (Doddington, 2002) and translation edit rate (TER) (Snover et al., 2006) ." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-69", "text": "For a MT system to be better, higher BLEU and NIST scores with lower TER are desired." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-70", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-71", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-72", "text": "In the following subsections, we present different experiments carried out for the shared task." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-73", "text": "We also study the impact of suffix separation, compound splitting and preordering on the SMT accuracy." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-74", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-75", "text": "**IMPACT OF SUFFIX SEPARATION AND COMPOUND SPLITTING**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-76", "text": "Pre-processing of source words for suffix separation and compound word splitting results in better alignment and hence the better translation." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-77", "text": "The alignment improvements can be seen in the Table 2." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-78", "text": "Improvement in the translation quality can be observed in the various evaluation scores detailed in Table 3 ." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-79", "text": "From the table, we can infer that for be-hi and mr-hi, suffix separation have shown significant improvements over the baseline, whereas, compound word splitting has caused slight improvement." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-80", "text": "However, compound splitting has found to be more effective than suffix separation for ta-hi and te-hi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-81", "text": "We have also tried a combination of compound word splitting and suffix separation, serially." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-82", "text": "We observed improvements over the BL+SS and BL+CS for ta-hi and te-hi, across the evaluation scores." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-83", "text": "BL+SS is better than all other systems for be-hi, across the evaluation metrics, whereas, for mr-hi BL+CS+SS has highest BLEU and BL+SS has the best NIST and TER scores." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-84", "text": "Table 3 : Effect of suffix separation and compound splitting" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-85", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-86", "text": "**IMPACT OF REORDERING**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-87", "text": "Preordering of the source language sentence helps in the better alignment and decoding for English to Indian language (Ramanathan et al., 2008; Patel et al., 2013; Kunchukuttan et al., 2014) SMT." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-88", "text": "Table 4 details the results for the systems under study." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-89", "text": "We can see that BL+RO shows significant improvement over BL." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-90", "text": "Further, the factored SMT system with stem as alignment factor shows slight improvement in BLEU over the BL+RO, but other metrics show BL+RO is better compared to the factored system." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-91", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-92", "text": "**SUBMISSION**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-93", "text": "In this section, we discuss evaluation scores of the submitted systems." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-94", "text": "We submitted BL+SS for behi and BL+CS+SS for all other language pairs except en-hi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-95", "text": "For en-hi BL+RO+FACT is submitted as the final system." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-96", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-97", "text": "**ERROR ANALYSIS**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-98", "text": "A closer look at the performance of these systems to understand the utility of SS and CS has been done." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-99", "text": "We report a few early observations." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-100", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-101", "text": "**SUPERFLUOUS SPLITTING**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-102", "text": "With the suffix separation, the Marathi word like 'dilaavara' is getting split into 'dilaa' + 'vara' which is a wrong split." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-103", "text": "As, 'dilaavara' is a proper noun and hence should not have been split." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-104", "text": "We tried to overcome this error by avoiding suffix separation and compound splitting of NNP POS tagged words, but that was stopping many other valid candidates from pre-processing." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-105", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-106", "text": "**BAD SPLIT**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-107", "text": "The words like 'jarmaniitiila' is getting split into 'jarmaniit + 'iila which actually should have been split into 'jarmanii +'tiila'." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-108", "text": "Similarly, many words on splitting have not given any valid Marathi word which causesed sparsity in training data up to some extent." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-109", "text": "----------------------------------" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-110", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-111", "text": "In this paper, we presented various systems for translation from Bengali, English, Marathi, Tamil and Telugu to Hindi." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-112", "text": "These SMT systems with the use of source side suffix separation, compound splitting and preordering showed significantly higher accuracy over the baseline." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-113", "text": "In future, we could investigate the formulation of more effective solutions for SS, CS and RO." }, { "sent_id": "c5e1debe3fcab509737e092505a29e-C001-114", "text": "Reasons for lower BLEU due to the combination of suffix separation and compound word splitting for Bengali and Marathi is another interesting case to study further." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c5e1debe3fcab509737e092505a29e-C001-13", "c5e1debe3fcab509737e092505a29e-C001-14" ], [ "c5e1debe3fcab509737e092505a29e-C001-87" ] ], "cite_sentences": [ "c5e1debe3fcab509737e092505a29e-C001-14", "c5e1debe3fcab509737e092505a29e-C001-87" ] }, "@USE@": { "gold_contexts": [ [ "c5e1debe3fcab509737e092505a29e-C001-48" ], [ "c5e1debe3fcab509737e092505a29e-C001-58" ], [ "c5e1debe3fcab509737e092505a29e-C001-64" ] ], "cite_sentences": [ "c5e1debe3fcab509737e092505a29e-C001-48", "c5e1debe3fcab509737e092505a29e-C001-58", "c5e1debe3fcab509737e092505a29e-C001-64" ] } } }, "ABC_a62f376adefad10c5fb8b6c08ebb63_38": { "x": [ { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-2", "text": "Empirical lower bounds studies in which the frequency of alignment configurations that cannot be induced by a particular formalism is estimated, have been important for the development of syntax-based machine translation formalisms." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-3", "text": "The formalism that has received most attention has been inversion transduction grammars (ITGs) (Wu, 1997)." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-4", "text": "All previous work on the coverage of ITGs, however, concerns parse failure rates (PFRs) or sentence level coverage, which is not directly related to any of the evaluation measures used in machine translation." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-5", "text": "S\u00f8gaard and Kuhn (2009) induce lower bounds on translation unit error rates (TUERs) for a number of formalisms, incl." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-6", "text": "normal form ITGs, but not for the full class of ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-7", "text": "Many of the alignment configurations that cannot be induced by normal form ITGs can be induced by unrestricted ITGs, however." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-8", "text": "This paper estimates the difference and shows that the average reduction in lower bounds on TUER is 2.48 in absolute difference (16.01 in average parse failure rate)." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-9", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-11", "text": "The first stage in training a machine translation system is typically that of aligning bilingual text." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-30", "text": "The result extends to decoding in conjunction with a bigram language model (Huang et al., 2005) ." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-12", "text": "The quality of alignments is in that case of vital importance to the quality of the induced translation rules used by the system in subsequent stages." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-13", "text": "In string-based statistical machine translation, the alignment space is typically restricted by the n-grams considered in the underlying language model, but in syntax-based machine translation the alignment space is restricted by very different and less transparent structural contraints." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-14", "text": "While it is easy to estimate the consequences of restrictions to n-grams of limited size, it is less trivial to estimate the consequences of the structural constraints imposed by syntax-based machine translation formalisms." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-15", "text": "Consequently, much work has been devoted to this task (Wu, 1997; Zens and Ney, 2003; Wellington et al., 2006; Macken, 2007; S\u00f8gaard and Kuhn, 2009) ." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-16", "text": "The task of estimating the consequences of the structural constraints imposed by a particular syntax-based formalism consists in finding what is often called \"empirical lower bounds\" on the coverage of the formalism (Wellington et al., 2006; S\u00f8gaard and Kuhn, 2009 )." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-17", "text": "Gold standard alignments are constructed and queried in some way as to identify complex alignment configurations, or they are parsed by an all-accepting grammar such that a parse failure indicates that no alignment could be induced by the formalism." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-18", "text": "The assumption in this and related work that enables us to introduce a meaningful notion of alignment capacity is that simultaneously recognized words are aligned (Wu, 1997; Zhang and Gildea, 2004; Wellington et al., 2006; S\u00f8gaard and Kuhn, 2009) ." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-19", "text": "As noted by S\u00f8gaard (2009) , this definition of alignment has the advantageous consequence that candidate alignments can be singled out by mere inspection of the grammar rules." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-20", "text": "It also has the consequence that alignments are transitive (Goutte et al., 2004) , since simultaneity is transitive." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-21", "text": "While previous work (S\u00f8gaard and Kuhn, 2009 ) has estimated empirical lower bounds for normal form ITGs at the level of translation units (TUER), or cepts (Goutte et al., 2004) , defined as maximally connected subgraphs in alignments, nobody has done this for the full class of ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-22", "text": "What is important to understand is that while normal form ITGs can induce the same class of translations as the full class of ITGs, they do not induce the same class of alignments." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-23", "text": "They do not, for ex-ample, induce discontinuous translation units (see Sect." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-24", "text": "3)." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-25", "text": "Sect." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-26", "text": "2 briefly presents some related results in the literature." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-27", "text": "Some knowledge about formalisms used in machine translation is assumed." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-28", "text": "Aho and Ullman (1972) showed that 4-ary synchronous context-free grammars (SCFGs) could not be binarized, and Satta and Peserico (2005) showed that the hiearchy of SCFGs beyond ternary ones does not collapse; they also showed that the complexity of the universal recognition problem for SCFGs is NP-complete." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-29", "text": "ITGs on the other hand has a O(|G|n 6 ) solvable universal recognition problem, which coincides with the unrestricted alignment problem (S\u00f8gaard, 2009 )." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-31", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-32", "text": "**RELATED WORK**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-33", "text": "Wu (1997) introduced ITGs and normal form ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-34", "text": "ITGs are a notational variant of the subclass of SCFGs such that all indexed nonterminals in the source side of the RHS occur in the same order or exactly in the inverse order in the target side of the RHS." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-35", "text": "It turns out that this subclass of SCFGs defines the same set of translations that can be defined by binary SCFGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-36", "text": "The different forms of production rules are listed below with the more restricted normal form production rules in the right column, with \u03c6 \u2208 (N \u222a {e/f | e \u2208 T * , f \u2208 T * }) * (N nonterminals and T terminals, as usual)." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-37", "text": "The RHS operator [ ] preserves source language constituent order in the target language, while reverses it." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-38", "text": "1" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-39", "text": "Several studies have adressed the alignment capacity of ITGs and normal form ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-40", "text": "Zens and Ney (2003) induce lower bounds on PRFs for normal form ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-41", "text": "Wellington et al. (2006) induce lower bounds on PRFs for ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-42", "text": "S\u00f8gaard and Kuhn (2009) induce lower bounds on TUER for normal form ITGs and more expressive formalisms for syntax-based machine translation." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-43", "text": "No one has, however, to the best our knowledge induced lower bounds on TUER for ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-44", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-45", "text": "**EXPERIMENTS**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-46", "text": "As already mentioned empirical lower bounds studies differ in four important respects, namely wrt." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-47", "text": ": (i) whether they use hand-aligned or automatically aligned gold standards, (ii) the level at which they count failures, e.g. sentence, alignment or translation unit level, (iii) whether they interpret translation units disjunctively or conjunctively, and (iv) whether they induce the lower bounds (a) by running an all-accepting grammar on the gold standard data, (b) by logical characterization of the structures that can be induced by a formalism, or (c) by counting the frequency of complex alignment configurations." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-48", "text": "The advantage of (a) and (b) is that they are guaranteed to find the highest possible lower bound on the gold standard data, whereas (c) is more modular (formalismindependent) and actually tells us what configurations cause trouble." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-49", "text": "(i) In this study we use hand-aligned gold standard data." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-50", "text": "It should be obvious why this is preferable to automatically aligned data." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-51", "text": "The only reason that some previous studies used automatically aligned data is that hand-aligned data are hard to come by." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-52", "text": "This study uses the data also used by S\u00f8gaard and Kuhn (2009) , which to the best of our knowledge uses the largest collection of handaligned parallel corpora used in any of these studies." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-53", "text": "(ii) Failures are counted at the level of translation units as argued for in the above, but supplemented by parse failure rates for completeness." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-54", "text": "(iii) Since we count failures at the level of translation units, it is natural to interpret them conjunctively." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-55", "text": "Otherwise we would in reality count failures at the level of alignments." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-56", "text": "(iv) We use (c)." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-57", "text": "The conjunctive interpretation of translation units was also adopted by Fox (2002) and is motivated by the importance of translation units and discontinuous ones in particular to machine translation in general (Simard and colleagues, 2005; Ayan and Dorr, 2006; Macken, 2007; Shieber, 2007) ." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-58", "text": "In brief," }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-59", "text": "where G U are the translation units in the gold standard, and S U the translation units produced by the system." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-60", "text": "This evaluation measure is related to consistent phrase error rate (CPER) introduced in Ayan and Dorr (2006) , except that it does not only consider contiguous phrases." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-61", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-62", "text": "**DATA**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-63", "text": "The characteristics of the hand-aligned gold standard parallel corpora used are presented in Fig" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-64", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-65", "text": "**ALIGNMENT CONFIGURATIONS**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-66", "text": "The full class of ITGs induces many alignment configurations that normal form ITGs do not induce, incl." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-67", "text": "discontinuous translation units (DTUs), i.e. translation units with at least one gap, doublesided DTUs, i.e. DTUs with both a gap in the source side and a gap in the target side, and multigap DTUs with arbitrarily many gaps (as long as the contents in the gap are either respect the linear order of the source side or the inverted order)." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-68", "text": "ITGs do not induce (i) inside-out alignments, (ii) cross-serial DTUs, (iii) what is called the \"bonbon\" configuration below, and (iv) multigap DTUs with mixed order in the target side." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-69", "text": "The reader is referred to Wu (1997) for discussion of inside-out alignments." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-70", "text": "(ii) and (iii) are explained below." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-71", "text": "Multigap DTUs with up to three gaps are frequent (S\u00f8gaard and Kuhn, 2009 ) and have shown to be important for translation quality (Simard and colleagues, 2005) ." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-72", "text": "While normal form ITGs do not induce multigap DTUs, ITGs induce a particular subclass of multigap DTUs, namely those that are constructed by linear or inverse interpolation." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-73", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-74", "text": "**INDUCED CONFIGURATIONS**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-75", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-76", "text": "**NON-INDUCED CONFIGURATIONS**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-77", "text": "Inside-out alignments were first described by Wu (1997) , and their frequency has been a matter of some debate (Lepage and Denoual, 2005; Wellington et al., 2006; S\u00f8gaard and Kuhn, 2009) ." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-78", "text": "Cross-serial DTUs are made of two DTUs noncontiguous to the same side such that both have material in the gap of each other." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-79", "text": "Bonbons are similar, except the DTUs are non-contiguous to different sides, i.e. D has a gap in the source side that contains at least one token in E, and E has a gap in the target side that contains at least one token in D. Here's an example of a bonbon configuration from Simard et al. (2005) Multigap DTUs with mixed transfer are, as already mentioned multigap DTUs with crossing alignments from material in two distinct gaps." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-80", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-81", "text": "**RESULTS**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-82", "text": "The lower bounds on TUER for the full class of ITGs are obtained by summing the ratios of insideout alignments, cross-serial DTUs, bonbons and mixed order multigap DTUs, subtracting any overlap between these classes of configurations." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-83", "text": "The lower bounds on TUER for normal form ITGs sum ratios of inside-out aligments and DTUs subtracting any overlap." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-84", "text": "Figure 1 presents the ratio (\u00d7100), and Figure 2 presents the induced lower bounds on the full class of ITGs and normal form ITGs." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-85", "text": "Any two configurations differ on all translation units in order to count as two distinct configurations in these statistics." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-86", "text": "Otherwise a single translation unit could be removed to simplify two or more configurations." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-87", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-88", "text": "**DISCUSSION**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-89", "text": "The usefulness of alignment error rate (AER) (Och and Ney, 2000) has been questioned lately (Fraser and Marcu, 2007) ; most importantly, AER does not always seem to correlate with translation quality." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-90", "text": "TUER is likely to correlate better with translation quality, since it by definition correlates with CPER (Ayan and Dorr, 2006) ." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-91", "text": "No large-scale experiment has been done yet to estimate the strength of this correlation." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-92", "text": "Our study also relies on the assumption that simulatenously recognized words are aligned in bilingual parsing." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-93", "text": "The relationship between parsing and alignment can of course be complicated in ways that will alter the alignment capacity of ITG and its normal form; on some definitions the two formalisms may even become equally expressive." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-94", "text": "----------------------------------" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-95", "text": "**CONCLUSION**" }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-96", "text": "It was shown that the absolute reduction in average lower bound on TUER is 2.48 for the full class of ITGs over its canonical normal form." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-97", "text": "For PRF, it is 16.01." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-98", "text": "LB-PFR lists the lower bounds on parse failure rates." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-99", "text": "Finally, the third and fourth columns list configuration overlaps at the level of translation units, resp." }, { "sent_id": "a62f376adefad10c5fb8b6c08ebb63-C001-100", "text": "sentences." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a62f376adefad10c5fb8b6c08ebb63-C001-14", "a62f376adefad10c5fb8b6c08ebb63-C001-15" ], [ "a62f376adefad10c5fb8b6c08ebb63-C001-16", "a62f376adefad10c5fb8b6c08ebb63-C001-17" ], [ "a62f376adefad10c5fb8b6c08ebb63-C001-18" ], [ "a62f376adefad10c5fb8b6c08ebb63-C001-77" ] ], "cite_sentences": [ "a62f376adefad10c5fb8b6c08ebb63-C001-15", "a62f376adefad10c5fb8b6c08ebb63-C001-16", "a62f376adefad10c5fb8b6c08ebb63-C001-18", "a62f376adefad10c5fb8b6c08ebb63-C001-77" ] }, "@MOT@": { "gold_contexts": [ [ "a62f376adefad10c5fb8b6c08ebb63-C001-18" ] ], "cite_sentences": [ "a62f376adefad10c5fb8b6c08ebb63-C001-18" ] } } }, "ABC_fd7bae08fd3e69744a3980daa1a649_38": { "x": [ { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-2", "text": "Word2vec (Mikolov et al., 2013b ) has proven to be successful in natural language processing by capturing the semantic relationships between different words." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-3", "text": "Built on top of single-word embeddings, paragraph vectors (Le and Mikolov, 2014) find fixed-length representations for pieces of text with arbitrary lengths, such as documents, paragraphs, and sentences." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-4", "text": "In this work, we propose a novel interpretation for neural-network-based paragraph vectors by developing an unsupervised generative model whose maximum likelihood solution corresponds to traditional paragraph vectors." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-5", "text": "This probabilistic formulation allows us to go beyond point estimates of parameters and to perform Bayesian posterior inference." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-6", "text": "We find that the entropy of paragraph vectors decreases with the length of documents, and that information about posterior uncertainty improves performance in supervised learning tasks such as sentiment analysis and paraphrase detection." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-7", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-9", "text": "Paragraph vectors (Le and Mikolov, 2014 ) are a recent method for embedding pieces of natural language text as fixed-length, real-valued vectors." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-10", "text": "Extending the word2vec framework (Mikolov et al., 2013b) , paragraph vectors are typically presented as neural language models, and compute a single vector representation for each paragraph." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-11", "text": "Unlike word embeddings, paragraph vectors are not shared across the entire corpus, but are instead local to each paragraph." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-12", "text": "When interpreted as a latent variable, we expect them to have higher uncertainty when the paragraphs are short." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-13", "text": "Recently, Barkan (2017) proposed a probabilistic view of word2vec that has motivated research on combining word2vec with other priors (Bamler and Mandt, 2017) ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-14", "text": "Inspired by this progress, we extend paragraph vectors to a probabilistic model." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-15", "text": "Our model may be specified via modern inference tools like Edward (Tran et al., 2016) , which makes it easy to experiment with different inference algorithms." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-16", "text": "The experiments in Sec. 4 confirm the intuition that paragraph vectors have higher posterior uncertainty when paragraphs are short, and we show that explicitly modeling this uncertainty improves performance in supervised prediction tasks." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-17", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-19", "text": "Paragraph embeddings are built on top of word embeddings, a set of dimensionality reduction tools that map words from a large vocabulary to a dense vector representation." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-20", "text": "Most word embedding methods learn a point estimate for each embedding vector (Mikolov et al., 2013a,b; Mnih and Kavukcuoglu, 2013; Goldberg and Levy, 2014; Pennington et al., 2014) ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-21", "text": "Barkan (2017) pointed out that the skip-gram model with negative sampling, also known as word2vec (Mikolov et al., 2013b) , admits a Bayesian interpretation." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-22", "text": "The Bayesian skip-gram model allows uncertainty to be taken into account in a principled way, and lays the basis for our proposed Bayesian paragraph vector model." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-23", "text": "Many tasks in natural language processing require fixed-length features for text passages of variable length, such as sentences, paragraphs, or documents (in this paper, we treat these three terms interchangeably)." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-24", "text": "Generalizing embeddings of single words, several methods have been proposed to find dense vector representations of paragraphs (Le and Mikolov, 2014; Kiros et al., 2015; Wieting et al., 2015; Palangi et al., 2016; Pagliardini et al., 2017) ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-25", "text": "Since paragraph embeddings are local to short pieces of text, we expect them to have high posterior uncertainty if the paragraphs are short." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-26", "text": "In this work, we incorporate the idea of paragraph vectors (Le and Mikolov, 2014) into the Bayesian skip-gram model in order to coherently infer the uncertainty associated with paragraph vectors." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-27", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-28", "text": "**METHOD**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-29", "text": "In Sec. 3.1, we summarize the Bayesian skip-gram model on which our model is based." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-30", "text": "We then present our Bayesian paragraph model in Sec. 3.2, and discuss two inference methods in Sec. 3.3." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-31", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-32", "text": "**BAYESIAN SKIP-GRAM MODEL**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-33", "text": "The Bayesian skip-gram model (Barkan, 2017 ) is a probabilistic interpretation of word2vec (Mikolov et al., 2013b) ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-34", "text": "The left part of Figure 1 shows the generative process." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-35", "text": "For each word i in the vocabulary, the model draws a latent word embedding vector U i \u2208 R E and a latent context embedding vector V i \u2208 R E from a Gaussian prior N (0, \u03bb 2 I)." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-36", "text": "Here, E is the embedding dimension and \u03bb is a hyperparameter." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-37", "text": "The model then constructs N pairs labeled pairs of words following a twostep process." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-38", "text": "First, a proposal pair of words (i, j) is drawn from a uniform distribution over the vocabulary." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-39", "text": "Then, the model assigns to the proposal pair a binary label" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-40", "text": "is the sigmoid function." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-41", "text": "The pairs with label z ij = 1 form the so-called positive examples, and are assumed to correspond to occurrences of the word i in the context of word j somewhere in the corpus." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-42", "text": "The so-called negative examples with label z ij = 0 do not correspond to any observation in the corpus." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-43", "text": "When training the model, we resort to the heuristics proposed in (Mikolov et al., 2013b) to create artificial evidence for the negative examples (see Section 3.2 below)." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-44", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-45", "text": "**BAYESIAN PARAGRAPH VECTORS**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-46", "text": "Bayesian paragraph vectors (BPV) are a direct extension of the Bayesian skip-gram model." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-47", "text": "The right part of Figure 1 shows the generative process." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-48", "text": "In addition to global word and context embeddings U and V , the model draws a paragraph vector d n \u223c N (0, \u03c6 2 I) for each of the N docs documents in the corpus." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-49", "text": "Following Le and Mikolov (2014) , we add d n to the context vector V j when we classify a given pair of words (i, j) as a positive or a negative example." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-50", "text": "Thus, the likelihood of a word pair (i, j) in document n to have label z n,ij \u2208 {0, 1} is" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-51", "text": "We collect evidence for the positive examples X + n in each document n by forming pairs of words (w n,t , w n,t+\u03b4 )." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-52", "text": "Here, w n,t is the word class of the t th token, t runs over all tokens in document n, \u03b4 runs from \u2212c to c where c is a small context window size, and we exclude \u03b4 = 0." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-53", "text": "Negative examples are not observed in the corpus." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-54", "text": "Following Mikolov et al. (2013b) , we construct artificial evidence X \u2212 n for negative pairs by sampling from the noise distribution" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-55", "text": ", where f is the empirical unigram frequency across the training corpus." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-56", "text": "The log-likelihood of the entire data is thus" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-57", "text": "In the limit d n \u2192 0, Eq. (2) reduces to the negative loss function of word2vec." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-58", "text": "BPV can be easily specified in Edward, a Python library for probabilistic modeling and inference (Tran et al., 2016) :" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-59", "text": "from edward.models import Bernoulli, Normal U = Normal(loc=tf.zeros((W, E), dtype=tf.float32), scale=lam) V = Normal(loc=tf.zeros((W, E), dtype=tf.float32), scale=lam) d_n = Normal(loc=tf.zeros(E, dtype=tf.float32), scale=phi) u_n = tf.nn.embedding_lookup(U, indices_n_I) v_n = tf.nn.embedding_lookup(V, indices_n_J) z_n = Bernoulli(logits=tf.reduce_sum(u_n * (v_n + d_n), axis=1))" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-60", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-61", "text": "**MAP AND BLACK BOX VARIATIONAL INFERENCE**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-62", "text": "The BPV model has global and local latent variables." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-63", "text": "We expect the posterior of the global variables to be peaked, and therefore approximate the global word embedding matrices U and V via point estimates." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-64", "text": "We expect a broader posterior distribution for the local paragraph vectors d n ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-65", "text": "Thus we use variational inference (VI) to fit the posterior over d n with a fully factorized Gaussian distribution." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-66", "text": "We split inference into two stages." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-67", "text": "In the first stage, we point estimate all parameters." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-68", "text": "In the second stage, we fix U and V and only perform VI for the paragraph vectors." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-69", "text": "In the first stage, our goal is to train the global variables via stochastic gradient descent, where every minibatch contains a single document n and a fixed set of negative examples X \u2212 n ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-70", "text": "We first maximize the joint probability p(X + n , X \u2212 n , U, V, d n ) w.r.t the paragraph vector d n ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-71", "text": "As this local optimization is noise free, it converges quickly under a constant learning rate." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-72", "text": "Then, we perform a single gradient step for the global variables U and V ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-73", "text": "This gradient is noisy due to the minibatch sampling and the stochastic generation of negative examples." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-74", "text": "For this reason, a decreasing learning rate is used." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-75", "text": "Finally, we reinitialize d n and proceed to the next document." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-76", "text": "Optimizing d n in a nested loop before each update step saves memory since we only need to keep track of the document vectors one at a time." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-77", "text": "In the second stage, we fit a variational distribution for the paragraph vectors while holding U and V fixed." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-78", "text": "We use black box VI (Ranganath et al., 2014) with reparameterization gradients (Kingma and Welling, 2014; Rezende et al., 2014) , which is provided by the Edward library." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-79", "text": "This time, we generate new negative examples in each update step to avoid overfitting." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-80", "text": "The stochastic optimization is again performed with a decreasing learning rate." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-81", "text": "We also perform a separate maximum a posteriori (MAP) estimate of the paragraph vectors to serve as the baseline for downstream classification tasks." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-82", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-83", "text": "**EXPERIMENTS**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-84", "text": "Paragraph vectors are often used as input features for supervised learning in natural language processing (Le and Mikolov, 2014; Kiros et al., 2015; Palangi et al., 2016) ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-85", "text": "In this section, we apply BPV to two binary classification tasks: sentiment analysis and paraphrase detection." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-86", "text": "We find that the posterior uncertainty of BPV decreases as the length of paragraphs grows." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-87", "text": "We also find that by concatenating the variational mean and standard deviation features inferred by VI, we improve classification accuracy compared to MAP point estimates of paragraph embeddings." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-88", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-89", "text": "**SENTIMENT ANALYSIS**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-90", "text": "We use the IMDB dataset (Maas et al., 2011) for sentiment analysis." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-91", "text": "It contains 100k movie reviews, split into 75k training points (25k labeled, 50k unlabeled) and 25k labeled test points." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-92", "text": "Positive and negative labels are balanced in both labeled subsets, and typical reviews consist of several sentences." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-93", "text": "As our algorithm is unsupervised, we run the inference algorithms described in Sec. 3.3 using all the training data, and then train a logistic regression classifier using the paragraph vectors of the labeled training data only." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-94", "text": "We use the most frequent 10k words as the vocabulary, and set the context window size c = 4, the embedding dimension E = 100, the hyperparameters for the prior \u03bb = \u03c6 = 1, and the Table 1 shows the test accuracy of the two inference methods." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-95", "text": "VI outperforms MAP since it takes into account posterior uncertainty in paragraph embeddings." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-96", "text": "Fig. 2 (left) shows the entropy of the paragraph vectors, computed using the posterior variance obtained from VI." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-97", "text": "As the document length grows, the entropy decreases, which makes intuitive sense since longer reviews can be more specific." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-98", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-99", "text": "**PARAPHRASE DETECTION**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-100", "text": "We also test the discriminative power of BPV on the Microsoft Research Paraphrase Corpus (Dolan et al., 2004) ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-101", "text": "Each data point contains of two sentences extracted from news sources on the web, and the goal is to predict whether they are paraphrases of each other." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-102", "text": "The training set contains 4076 sentence pairs in which 2753 are paraphrases, and the test set contains 1725 pairs among which 1147 are paraphrases." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-103", "text": "We use the same hyperparameters as in the sentiment analysis task, except that we take all the words appearing more than once into the vocabulary because this dataset is much smaller." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-104", "text": "After finding the paragraph vectors, we train the classifier by following Kiros et al. (2015) , where features are constructed by concatenating the component-wise product x n \u00b7 x n and the absolute difference |x n \u2212 x n | between each pair of features x n and x n ." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-105", "text": "The classification results in Table 1 show that VI again outperforms MAP." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-106", "text": "The relationship between entropy and document length shown in Fig. 2 (right) is also similar to that of the IMDB dataset." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-107", "text": "----------------------------------" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-108", "text": "**DISCUSSION**" }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-109", "text": "We proposed Bayesian paragraph vectors, a generative model of paragraph embeddings." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-110", "text": "We treated the local latent variables of paragraph vectors in a Bayesian way because we expected high uncertainty, especially for short documents." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-111", "text": "Our experiments confirmed this intuition, and showed that knowledge of the posterior uncertainty improves the performance of downstream supervised tasks." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-112", "text": "In addition to MAP and VI, we experimented with Hamiltonian Monte Carlo (HMC) inference, but our preliminary results showed worse performance; we plan to investigate further." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-113", "text": "A possible reason might be that we had to use a fixed set of negative examples for each document when generating HMC samples, which may result in overfitting to the noise." }, { "sent_id": "fd7bae08fd3e69744a3980daa1a649-C001-114", "text": "Finally, we believe that more sophisticated models of document embeddings would also benefit from a Bayesian treatment of the local variables." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fd7bae08fd3e69744a3980daa1a649-C001-10", "fd7bae08fd3e69744a3980daa1a649-C001-11", "fd7bae08fd3e69744a3980daa1a649-C001-12", "fd7bae08fd3e69744a3980daa1a649-C001-9" ], [ "fd7bae08fd3e69744a3980daa1a649-C001-21", "fd7bae08fd3e69744a3980daa1a649-C001-22" ], [ "fd7bae08fd3e69744a3980daa1a649-C001-33" ], [ "fd7bae08fd3e69744a3980daa1a649-C001-2" ] ], "cite_sentences": [ "fd7bae08fd3e69744a3980daa1a649-C001-10", "fd7bae08fd3e69744a3980daa1a649-C001-21", "fd7bae08fd3e69744a3980daa1a649-C001-33" ] }, "@MOT@": { "gold_contexts": [ [ "fd7bae08fd3e69744a3980daa1a649-C001-10", "fd7bae08fd3e69744a3980daa1a649-C001-11", "fd7bae08fd3e69744a3980daa1a649-C001-12", "fd7bae08fd3e69744a3980daa1a649-C001-9" ] ], "cite_sentences": [ "fd7bae08fd3e69744a3980daa1a649-C001-10" ] }, "@USE@": { "gold_contexts": [ [ "fd7bae08fd3e69744a3980daa1a649-C001-43" ], [ "fd7bae08fd3e69744a3980daa1a649-C001-54", "fd7bae08fd3e69744a3980daa1a649-C001-55" ] ], "cite_sentences": [ "fd7bae08fd3e69744a3980daa1a649-C001-43", "fd7bae08fd3e69744a3980daa1a649-C001-54" ] } } }, "ABC_2db25254f275303c41f1e7ab15a5e0_38": { "x": [ { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-2", "text": "Many discourse relations are explicitly marked with discourse connectives, and these examples could potentially serve as a plentiful source of training data for recognizing implicit discourse relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-3", "text": "However, there are important linguistic differences between explicit and implicit discourse relations, which limit the accuracy of such an approach." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-4", "text": "We account for these differences by applying techniques from domain adaptation, treating implicitly and explicitly-marked discourse relations as separate domains." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-5", "text": "The distribution of surface features varies across these two domains, so we apply a marginalized denoising autoencoder to induce a dense, domain-general representation." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-6", "text": "The label distribution is also domain-specific, so we apply a resampling technique that is similar to instance weighting." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-7", "text": "In combination with a set of automatically-labeled data, these improvements eliminate more than 80% of the transfer loss incurred by training an implicit discourse relation classifier on explicitly-marked discourse relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-8", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-10", "text": "Discourse relations reveal the structural organization of text, potentially supporting applications such as summarization (Louis et al., 2010; Yoshida et al., 2014) , sentiment analysis (Somasundaran et al., 2009) , and coherence evaluation (Lin et al., 2011) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-11", "text": "While some relations are signaled explicitly with connectives such as however (Pitler et al., 2008) , many more are implicit." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-12", "text": "Expert-annotated datasets of implicit discourse relations are expensive to produce, so it would be preferable to use weak supervision, by automatically labeling instances with explicit connectives (Marcu and Echihabi, 2003) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-13", "text": "However, Sporleder and Lascarides (2008) show that models trained on explicitly marked examples generalize poorly to implicit relation identification." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-14", "text": "They argued that explicit and implicit examples may be linguistically dissimilar, as writers tend to avoid discourse connectives if the discourse relation could be inferred from context (Grice, 1975) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-15", "text": "Similar observations are made by Rutherford and Xue (2015) , who attempt to add automatically-labeled instances to improve supervised classification of implicit discourse relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-16", "text": "In this paper, we approach this problem from the perspective of domain adaptation." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-17", "text": "Specifically, we argue that the reason that automatically-labeled examples generalize poorly is due to domain mismatch from the explicit relations (source domain) to the implicit relations (target domain)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-18", "text": "We propose to close the gap by using two simple methods from domain adaptation: (1) feature representation learning: mapping the source domain and target domain to a shared latent feature space; (2) resampling: modifying the relation distribution in the explicit relations to match the distribution over implicit relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-19", "text": "Our results on the Penn Discourse Treebank (Prasad et al., 2008) show that these two methods improve the performance on unsupervised discourse relation identification by more than 8.4% on average F 1 score across all relation types, an 82% reduction on the transfer loss incurred by training on explicitly-marked discourse relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-20", "text": "2 Related Work Marcu and Echihabi (2003) train a classifier for implicit intra-sentence discourse relations from explicitly-marked examples in the rhetorical structure theory (RST) treebank, where the relations are automatically labeled by their discourse connectives: for example, labeling the relation as CONTRAST if the connective is but." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-21", "text": "However, Sporleder and Lascarides (2008) argue that explicitly marked relations are too different from implicit relations to serve as an adequate supervision signal, obtaining negative results in segmented discourse representation theory (SDRT) relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-22", "text": "More recent work has focused on the Penn Discourse Treebank (PDTB), using explicitly-marked relations to supplement, rather than replace, a labeled corpus of implicit relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-23", "text": "For example, Biran and McKeown (2013) collect word pairs from arguments of explicit examples to help the supervised learning on implicit relation identification." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-24", "text": "Lan et al. (2013) present a multi-task learning framework, using explicit relation identification as auxiliary tasks to help main task on implicit relation identification." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-25", "text": "Rutherford and Xue (2015) explore several selection heuristics for adding automatically-labeled examples from Gigaword to their system for implicit relation detection, obtaining a 2% improvement in Macro-F 1 ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-26", "text": "Our work differs from these previous efforts in that we focus exclusively on training from automaticallylabeled explicit instances, rather than supplementing a training set of manually-labeled implicit examples." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-27", "text": "Learning good feature representations (BenDavid et al., 2007) and reducing mismatched label distributions (Joshi et al., 2012) are two main ways to make a domain adaptation task successful." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-28", "text": "Structural correspondence learning is an early example of representation learning for domain adaptation (Blitzer et al., 2006) ; we build on the more computationally tractable approach of marginalized denoising autoencoders (Chen et al., 2012) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-29", "text": "Instance weighting is an approach for correcting label distribution mismatch (Jiang and Zhai, 2007) ; we apply a simpler approach of resampling the source domain according to an estimate of the target domain label distribution." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-30", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-31", "text": "**DOMAIN ADAPTATION FOR IMPLICIT RELATION IDENTIFICATION**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-32", "text": "We employ two domain adaptation techniques: learning feature representations, and resampling to match the target label distribution." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-33", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-34", "text": "**LEARNING FEATURE REPRESENTATION: MARGINALIZED DENOISING AUTOENCODERS**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-35", "text": "The goal of feature representation learning is to obtain dense features that capture feature correlations between the source and target domains." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-36", "text": "Denoising autoencoders (Glorot et al., 2011) do this by first \"corrupting\" the original data, x 1 , . . . , x n intox 1 , . . . ,x n , either by adding Gaussian noise (in the case of real-valued data) or by randomly zeroing out features (in the case of binary data)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-37", "text": "We can then learn a function to reconstruct the original data, thereby capturing feature correlations and improving resilience to domain shift." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-38", "text": "Chen et al. (2012) propose a particularly simple and elegant form of denoising autencoder, by marginalizing over the noising process." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-39", "text": "Their single-layer marginalized denoising autoencoder (mDA) solves the following problem:" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-40", "text": "where the parameter W \u2208 R d\u00d7d is a projection matrix." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-41", "text": "After learning the projection matrix, we use tanh(Wx) as the representation for our relation identification task." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-42", "text": "Usually, x i \u2208 R d is a sparse vector with more than 10 5 dimensions." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-43", "text": "Solving the optimization problem defined in equation 1 will produce a d \u00d7 d dense matrix W, and is prohibitively expensive." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-44", "text": "We employ the trick proposed by Blitzer et al. (2006) to select \u03ba pivot features to be reconstructed." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-45", "text": "We then split all features into nonoverlapping subsets of size \u2264 K. Then, a set of projection matrices are learned, so as to transform each feature subset to the pivot feature set." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-46", "text": "The final projection matrix W is the stack of all projection matrices learned from the feature subsets." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-47", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-48", "text": "**HANDLING MISMATCHED LABEL DISTRIBUTIONS:**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-49", "text": "Resampling with minimal supervision" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-50", "text": "There is a notable mismatch between the relation distributions for implicit and explicitly-marked discourse relations in the Penn Discourse Treebank: as shown in Figure 1 , the EXPANSION and CONTINGENCY relations comprise a greater share of the implicit relations, while the TEMPORAL and COMPARISON relations comprise a greater share of the explicitly-marked discourse relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-51", "text": "Such label distribution mismatches can be a major source of transfer loss across domains, and therefore, reducing this mismatch can be an easy way to obtain performance gains in domain adaptation (Joshi et al., 2012) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-52", "text": "Specifically, our goal is to modify the relation distribution in the source domain (explicitly-marked relations) and make it as similar as possible to the target domain (implicit relations)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-53", "text": "Given the label distribution from the target domain, we resample training examples from the source domain with replacement, in order to match the label distribution in the target domain." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-54", "text": "As this requires the label distribution from the target domain, it is no longer purely unsupervised domain adaptation; instead, we call it resampling with minimal supervision." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-55", "text": "It may also be desirable to ensure that the source and target training instances are similar in terms of their observed features; this is the idea behind the instance weighting approach to domain adaptation (Jiang and Zhai, 2007) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-56", "text": "Motivated by this idea, we require that sampled instances from the source domain have a cosine similarity of at least \u03c4 with at least one target domain instance (Rutherford and Xue, 2015) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-57", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-58", "text": "**EXPERIMENTS**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-59", "text": "Our experiments test the utility of the two domain adaptation methods, using the Penn Discourse Treebank (Prasad et al., 2008) and some extra-training data collected from a external resource." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-60", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-61", "text": "**EXPERIMENTAL SETUP DATASETS**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-62", "text": "The test examples are implicit relation instances from section 21-22 in the PDTB." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-63", "text": "For the domain adaptation setting, the training set consists of the explicitly-marked examples extracted from sections 02-20 and 23-24, and the development set consists of the explicit relations from sections 21-22." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-64", "text": "All relations in the explicit examples are automatically labeled by using the connective-to-relation mapping from Table 2 in (Prasad et al., 2007) , where we only keep the majority relation type for every connective." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-65", "text": "For each identified connective, we use its annotated arguments in the PDTB." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-66", "text": "As an upper bound, we also train a supervised discourse relation classifier, using the implicit examples in sections 02-20 and 23-24 as the training set, and using sections 00-01 as the development set." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-67", "text": "Following prior work (Pitler et al., 2009; Park and Cardie, 2012; Biran and McKeown, 2013) , we consider the firstlevel discourse relations in the PDTB -Temporal (TEMP.), Comparison (COMP.), Expansion (EXP.) and Contingency (CONT.)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-68", "text": "We train binary classifiers and report F 1 score on each binary classification task." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-69", "text": "Extension of this approach to multi-class classification is important, but since this is not the setting considered in most of the prior research, we leave it for future work." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-70", "text": "The true power of learning from automatically labeled examples is that we could leverage much larger datasets than hand-annotated corpora such as the Penn Discourse Treebank." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-71", "text": "To test this idea, we collected 1,000 news articles from CNN.com as extra training data." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-72", "text": "Explicitly-marked discourse relations from this data are automatically extracted by matching the PDTB discourse connectives (Prasad et al., 2007) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-73", "text": "For this data, we also need to extract the arguments of the identified connectives: for every identified connective, the sentence following this connective is labeled as Arg2 and the preceding sentence is labeled as Arg1, as suggested by Biran and McKeown (2013) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-74", "text": "In a pilot study we found that larger amounts of additional training data yielded no further improvements, which is consistent with the recent results of Rutherford and Xue (2015) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-75", "text": "Model selection We use a linear support vector machine (Fan et al., 2008) as the classification model." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-76", "text": "Our model includes five tunable parameters: the number of pivot features \u03ba, the size of the feature subset K, the noise level for the denoising autoencoder q, the cosine similarity threshold for resampling \u03c4 , and the penalty parameter C for the SVM classifier." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-77", "text": "We consider \u03ba \u2208 {1000, 2000, 3000} for pivot features and C \u2208 {0.001, 0.01, 0.1, 1.0, 10.0} for penalty parameters, q \u2208 {0.90, 0.95, 0.99} for noise levels." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-78", "text": "To reduce the free parameters, we set K = 5\u03ba and simply fix the cosine similarity threshold \u03c4 = Features All features are motivated by prior work on implicit discourse relation classification: from each training example with two arguments, we extract (1) Lexical features, including word pairs, the first and last words from both arguments (Pitler et al., 2009 ); (2) Syntactic features, including production rules from each argument, and the shared production rules between two arguments (Lin et al., 2009 ); (3) Other features, including modality, Inquirer tags, Levin verb classes, and argument polarity (Park and Cardie, 2012) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-79", "text": "We re-implement these features as closely as possible to the cited works, using the Stanford CoreNLP Toolkit to obtain syntactic annotations (Manning et al., 2014) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-80", "text": "The FULL feature set for domain adaptation is constructed by collecting all features from the training set, and then removing features that occur fewer than ten times." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-81", "text": "The PIVOT feature set includes \u03ba high-frequency features from the FULL feature set." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-82", "text": "To focus on testing the domain adaptation techniques, we use the same FULL and PIVOT set for all four relations, and leave feature set optimization for each relation as a future work (Park and Cardie, 2012) ." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-83", "text": "To obtain the upper bound, we employ the same feature categories and frequency threshold to extract features from the in-domain data, hand-annotated implicit discourse relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-84", "text": "To include the representations from the marginalized denoising autoencoder for relation identification, we concatenate them with the original surface feature representations of the same examples." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-85", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-86", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-87", "text": "In experiments, we start with surface feature representations as baselines, then incorporate the two domain adaptation techniques incrementally." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-88", "text": "As shown in line 2 of Table 1 , the performance is poor if directly applying a model trained on the explicit examples with the FULL feature set, which is consistent with the observations of Sporleder and Lascarides (2008) : there is a 10.28% absolute reduction on average F 1 score from the upper bound obtained with in-domain supervision (line 1)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-89", "text": "With mDA, the overall performance increases by 0.86% (line 4); resampling gives a further 4.16% improvement mainly because of the performance gain on the EXP." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-90", "text": "relation (line 5)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-91", "text": "The resampling method itself (line 3) also gives a better overall performance then mDA (line 4)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-92", "text": "However, the F 1 scores on the TEMP. and CONT." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-93", "text": "are even worse than the baseline (line 2)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-94", "text": "Surface representations with the FULL feature set were found to cause serious overfitting in the experiments." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-95", "text": "To deal with this problem, we propose to use only \u03ba pivot features, which gives a stronger baseline of the cross-domain relation identification, as shown in line 6." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-96", "text": "Then, by incorporating resampling and feature representation learning individually, the average F 1 increases from 32.74% to 35.44% (line 7) and 36.69% (line 8) respectively." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-97", "text": "The combination of these two domain adaptation techniques boosts the average F 1 further to 38.62% (line 9)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-98", "text": "The additional CNN training data further improves performance to 39.46% (line 10)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-99", "text": "This represents an 8.42% improvement of average F 1 from the original result (line 2), for more than 80% reduction on the transfer loss incurred by training on explicit discourse relations." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-100", "text": "An additional experiment is to use automatic argument extraction in both the PDTB and the CNN data, which would correspond to more truly unsupervised domain adaptation." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-101", "text": "(Recall that in the CNN data, we used adjacent sentences as argument spans, while in the PDTB data, we use expert annotations.) When using adjacent sentences as argument spans in both datasets, the average F 1 is 38.52% for the combination of representation learning and resampling." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-102", "text": "Compared to line 10, this is a 0.94% performance drop, indicating the importance of argument identification in the PDTB data." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-103", "text": "In future work we may consider better heuristics for argument extraction, such as obtaining automatically-labeled examples only from those connectors for whom the arguments usually are the adjacent sentences; for example, the connector nonetheless usually connects adjacent spans (e.g., Bob was hungry." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-104", "text": "Nonetheless he gave Tina the burger.), while the connector even though may connect two spans that follow the connector in the same sentence (e.g., Even though Bob was hungry, he gave Tina the burger.)." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-105", "text": "----------------------------------" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-106", "text": "**CONCLUSION**" }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-107", "text": "We have presented two methods -feature representation learning and resampling -from domain adaptation to close the gap of using explicit examples for unsupervised implicit discourse relation identification." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-108", "text": "Experiments on the PDTB show the combination of these two methods eliminates more than 80% of the transfer loss caused by training on explicit examples, increasing average F 1 from 31% to 39.5%, against a supervised upper bound of 41.3%." }, { "sent_id": "2db25254f275303c41f1e7ab15a5e0-C001-109", "text": "Future work will explore the combination of this approach with more sophisticated techniques for instance selection (Rutherford and Xue, 2015) and feature selection (Park and Cardie, 2012; Biran and McKeown, 2013) , while also tackling the more difficult problems of multi-class relation classification and fine-grained level-2 discourse relations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2db25254f275303c41f1e7ab15a5e0-C001-13", "2db25254f275303c41f1e7ab15a5e0-C001-14", "2db25254f275303c41f1e7ab15a5e0-C001-15", "2db25254f275303c41f1e7ab15a5e0-C001-16" ], [ "2db25254f275303c41f1e7ab15a5e0-C001-25", "2db25254f275303c41f1e7ab15a5e0-C001-26" ], [ "2db25254f275303c41f1e7ab15a5e0-C001-55", "2db25254f275303c41f1e7ab15a5e0-C001-56" ] ], "cite_sentences": [ "2db25254f275303c41f1e7ab15a5e0-C001-15", "2db25254f275303c41f1e7ab15a5e0-C001-25", "2db25254f275303c41f1e7ab15a5e0-C001-56" ] }, "@MOT@": { "gold_contexts": [ [ "2db25254f275303c41f1e7ab15a5e0-C001-13", "2db25254f275303c41f1e7ab15a5e0-C001-14", "2db25254f275303c41f1e7ab15a5e0-C001-15", "2db25254f275303c41f1e7ab15a5e0-C001-16" ] ], "cite_sentences": [ "2db25254f275303c41f1e7ab15a5e0-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "2db25254f275303c41f1e7ab15a5e0-C001-25", "2db25254f275303c41f1e7ab15a5e0-C001-26" ] ], "cite_sentences": [ "2db25254f275303c41f1e7ab15a5e0-C001-25" ] }, "@SIM@": { "gold_contexts": [ [ "2db25254f275303c41f1e7ab15a5e0-C001-55", "2db25254f275303c41f1e7ab15a5e0-C001-56" ], [ "2db25254f275303c41f1e7ab15a5e0-C001-74" ] ], "cite_sentences": [ "2db25254f275303c41f1e7ab15a5e0-C001-56", "2db25254f275303c41f1e7ab15a5e0-C001-74" ] }, "@FUT@": { "gold_contexts": [ [ "2db25254f275303c41f1e7ab15a5e0-C001-107", "2db25254f275303c41f1e7ab15a5e0-C001-109" ] ], "cite_sentences": [ "2db25254f275303c41f1e7ab15a5e0-C001-109" ] } } }, "ABC_ad9b663ac88667c1b88767ca4b2f8f_38": { "x": [ { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-2", "text": "This paper presents a comparative study of target dependency structures yielded by several state-of-the-art linguistic parsers." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-3", "text": "Our approach is to measure the impact of these nonisomorphic dependency structures to be used for string-to-dependency translation." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-4", "text": "Besides using traditional dependency parsers, we also use the dependency structures transformed from PCFG trees and predicate-argument structures (PASs) which are generated by an HPSG parser and a CCG parser." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-5", "text": "The experiments on Chinese-to-English translation show that the HPSG parser's PASs achieved the best dependency and translation accuracies." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-6", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-8", "text": "Target language side dependency structures have been successfully used in statistical machine translation (SMT) by Shen et al. (2008) and achieved state-of-the-art results as reported in the NIST 2008 Open MT Evaluation workshop and the NTCIR-9 Chinese-to-English patent translation task (Goto et al., 2011; Ma and Matsoukas, 2011) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-9", "text": "A primary advantage of dependency representations is that they have a natural mechanism for representing discontinuous constructions, which arise due to longdistance dependencies or in languages where grammatical relations are often signaled by morphology instead of word order (McDonald and Nivre, 2011) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-10", "text": "It is known that dependency-style structures can be transformed from a number of linguistic struc- * Now at Baidu Inc." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-11", "text": "\u2020 Now at Nara Institute of Science & Technology (NAIST) tures." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-12", "text": "For example, using the constituent-todependency conversion approach proposed by Johansson and Nugues (2007) , we can easily yield dependency trees from PCFG style trees." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-13", "text": "A semantic dependency representation of a whole sentence, predicate-argument structures (PASs), are also included in the output trees of (1) a state-of-the-art head-driven phrase structure grammar (HPSG) (Pollard and Sag, 1994; Sag et al., 2003) parser, Enju 1 (Miyao and Tsujii, 2008) and (2) a state-of-the-art CCG parser 2 (Clark and Curran, 2007) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-14", "text": "The motivation of this paper is to investigate the impact of these non-isomorphic dependency structures to be used for SMT." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-15", "text": "That is, we would like to provide a comparative evaluation of these dependencies in a string-to-dependency decoder (Shen et al., 2008) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-16", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-17", "text": "**GAINING DEPENDENCY STRUCTURES**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-18", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-19", "text": "**DEPENDENCY TREE**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-20", "text": "We follow the definition of dependency graph and dependency tree as given in (McDonald and Nivre, 2011) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-21", "text": "A dependency graph G for sentence s is called a dependency tree when it satisfies, (1) the nodes cover all the words in s besides the ROOT; (2) one node can have one and only one head (word) with a determined syntactic role; and (3) the ROOT of the graph is reachable from all other nodes." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-22", "text": "For extracting string-to-dependency transfer rules, we use well-formed dependency structures, either fixed or floating, as defined in (Shen et al., 2008) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-23", "text": "Similarly, we ignore the syntactic roles both during rule extracting and target dependency language model (LM) training." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-24", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-25", "text": "**DEPENDENCY PARSING**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-26", "text": "Graph-based and transition-based are two predominant paradigms for data-driven dependency parsing." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-27", "text": "The MST parser (McDonald et al., 2005) and the Malt parser (Nivre, 2003) stand for two typical parsers, respectively." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-28", "text": "Parsing accuracy comparison and error analysis under the CoNLL-X dependency shared task data (Buchholz and Marsi, 2006) have been performed by McDonald and Nivre (2011) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-29", "text": "Here, we compare them on the SMT tasks through parsing the real-world SMT data." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-30", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-31", "text": "**PCFG PARSING**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-32", "text": "For PCFG parsing, we select the Berkeley parser (Petrov and Klein, 2007) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-33", "text": "In order to generate wordlevel dependency trees from the PCFG tree, we use the LTH constituent-to-dependency conversion tool 3 written by Johansson and Nugues (2007) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-34", "text": "The head finding rules 4 are according to Magerman (1995) and Collins (1997) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-35", "text": "Similar approach has been originally used by Shen et al. (2008) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-36", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-37", "text": "**HPSG PARSING**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-38", "text": "In the Enju English HPSG grammar (Miyao et al., 2003) used in this paper, the semantic content of 1. finding, i.e., find the syntactic/semantic head word of each argument node through a bottomup traversal of the tree;" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-39", "text": "2. mapping, i.e., determine the arc directions (among a predicate word and the syntactic/semantic head words of the argument nodes) for each predicate type according to Table 1 ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-40", "text": "Then, a dependency graph will be generated; 3. checking, i.e., post modifying the dependency graph according to the definition of dependency tree (Section 2.1)." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-41", "text": "Table 1 lists the mapping from HPSG's PAS types to word-level dependency arcs." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-42", "text": "Since a non-terminal node in an HPSG tree has two kinds of heads, syntactic or semantic, we will generate two dependency graphs after mapping." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-43", "text": "We use \"PAS+syn\" to represent the dependency trees generated from the HPSG PASs guided by the syntactic heads." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-44", "text": "For semantic heads, we use \"PAS+sem\"." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-45", "text": "For example, refer to t0 = when in Figure 1 ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-46", "text": "Its arg1 = c16 (with syntactic head t10), arg2 = c3 (with syntactic head t6), and PAS type = conj arg12." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-47", "text": "In Table 1 , this PAS type corresponds to arg2\u2192pred\u2192arg1, then the result word-level dependency is t6(is)\u2192t0(when)\u2192t10(is)." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-48", "text": "We need to post modify the dependency graph after applying the mapping, since it is not guaranteed to be a dependency tree." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-49", "text": "Referring to the definition of dependency tree (Section 2.1), we need the strategy for (1) selecting only one head from multiple" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-50", "text": "arg2/pred \u2192 arg1 det arg1,it arg1,punct arg1 pred \u2192 arg1 dtv arg2 pred \u2192 arg2 lgs arg2" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-51", "text": "arg2 \u2192 pred heads and (2) appending dependency relations for those words/punctuation that do not have any head." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-52", "text": "When one word has multiple heads, we only keep one." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-53", "text": "The selection strategy is that, if this arc was deleted, it will cause the biggest number of words that can not reach to the root word anymore." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-54", "text": "In case of a tie, we greedily pack the arc that connect two words w i and w j where |i \u2212 j| is the biggest." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-55", "text": "For all the words and punctuation that do not have a head, we greedily take the root word of the sentence as their heads." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-56", "text": "In order to fully use the training data, if there are directed cycles in the result dependency graph, we still use the graph in our experiments, where only partial dependency arcs, i.e., those target flat/hierarchical phrases attached with well-formed dependency structures, can be used during translation rule extraction." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-57", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-58", "text": "**CCG PARSING**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-59", "text": "We also use the predicate-argument dependencies generated by the CCG parser developed by Clark and Curran (2007) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-60", "text": "The algorithm for generating word-level dependency tree is easier than processing the PASs included in the HPSG trees, since the word level predicate-argument relations have already been included in the output of CCG parser." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-61", "text": "The mapping from predicate types to the gold-standard grammatical relations can be found in Table 13 in (Clark and Curran, 2007) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-62", "text": "The post-processing is like that described for HPSG parsing, except we greedily use the MST's sentence root when we can not determine it based on the CCG parser's PASs." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-63", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-64", "text": "**EXPERIMENTS**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-65", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-66", "text": "**SETUP**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-67", "text": "We re-implemented the string-to-dependency decoder described in (Shen et al., 2008 (Pauls and Klein, 2011) , was employed to train (1) a five-gram LM on the Xinhua portion of LDC English Gigaword corpus v3 (LDC2007T07) and (2) a tri-gram dependency LM on the English dependency structures of the training data." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-68", "text": "We report the translation quality using the case-insensitive BLEU-4 metric (Papineni et al., 2002) ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-69", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-70", "text": "**STATISTICS OF DEPENDENCIES**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-71", "text": "We compare the similarity of the dependencies with each other, as shown in Table 2 ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-72", "text": "Basically, we investigate (1) if two dependency graphs of one sentence share the same root word and (2) if the head of one word in one sentence are identical in two dependency graphs." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-73", "text": "In terms of root word comparison, we observe that MST and CCG share 87.3% of identical root words, caused by borrowing roots from MST to CCG." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-74", "text": "Then, it is interesting that Berkeley and PAS+syn share 74.8% of identical root words." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-75", "text": "Note that the Berkeley parser is trained on the Penn treebank (Marcus et al., 1994) 2008)." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-76", "text": "In terms of head word comparison, PAS+syn and PAS+sem share 79.1% of identical head words." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-77", "text": "This is basically due to that we used the similar PASs of the HPSG trees." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-78", "text": "Interestingly, there are only 59.3% identical root words shared by PAS+syn and PAS+sem." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-79", "text": "This reflects the significant difference between syntactic and semantic heads." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-80", "text": "We also manually created the golden dependency trees for the first 200 English sentences in the training data." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-81", "text": "The precision/recall (P/R) are shown in Table 3 ." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-82", "text": "We observe that (1) the translation accuracies approximately follow the P/R scores yet are not that sensitive to their large variances, and (2) it is still tough for domain-adapting from the treebanktrained parsers to parse the real-world SMT data." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-83", "text": "PAS+syn performed the best by avoiding the errors of missing of arguments for a predicate, wrongly identified head words for a linguistic phrase, and inconsistency dependencies inside relatively long coordinate structures." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-84", "text": "These errors significantly influence the number of extractable translation rules and the final translation accuracies." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-85", "text": "Note that, these P/R scores on the first 200 sentences (all from less than 20 newswire documents) shall only be taken as an approximation of the total training data and not necessarily exactly follow the tendency of the final BLEU scores." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-86", "text": "For example, CCG is worse than Malt in terms of P/R yet with a higher BLEU score." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-87", "text": "We argue this is mainly due to that the number of illegal dependency trees generated by Malt is the highest." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-88", "text": "Consequently, the number of flat/hierarchical rules generated by using Malt trees is the lowest." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-89", "text": "Also, PAS+sem has a lower P/R than Berkeley, yet their final BLEU scores are not statistically different." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-90", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-91", "text": "**RESULTS**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-92", "text": "Table 3 also shows the BLEU scores, the number of flat phrases and hierarchical rules (both integrated with target dependency structures), and the number of illegal dependency trees generated by each parser." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-93", "text": "From the table, we have the following observations: (1) all the dependency structures (except Malt) achieved a significant better BLEU score than the phrasal Moses; (2) PAS+syn performed the best in the test set (0.3376), and it is significantly better than phrasal/hierarchical Moses (p < 0.01), MST (p < 0.05), Malt (p < 0.01), Berkeley (p < 0.05), and CCG (p < 0.05); and (3) CCG performed as well as MST and Berkeley." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-94", "text": "These results lead us to argue that the robustness of deep syntactic parsers can be advantageous in SMT compared with traditional dependency parsers." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-95", "text": "----------------------------------" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-96", "text": "**CONCLUSION**" }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-97", "text": "We have constructed a string-to-dependency translation platform for comparing non-isomorphic target dependency structures." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-98", "text": "Specially, we proposed an algorithm for generating word-based dependency trees from PASs which are generated by a state-ofthe-art HPSG parser." }, { "sent_id": "ad9b663ac88667c1b88767ca4b2f8f-C001-99", "text": "We found that dependency trees transformed from these HPSG PASs achieved the best dependency/translation accuracies." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ad9b663ac88667c1b88767ca4b2f8f-C001-8" ], [ "ad9b663ac88667c1b88767ca4b2f8f-C001-14", "ad9b663ac88667c1b88767ca4b2f8f-C001-15" ], [ "ad9b663ac88667c1b88767ca4b2f8f-C001-32", "ad9b663ac88667c1b88767ca4b2f8f-C001-33", "ad9b663ac88667c1b88767ca4b2f8f-C001-34", "ad9b663ac88667c1b88767ca4b2f8f-C001-35" ] ], "cite_sentences": [ "ad9b663ac88667c1b88767ca4b2f8f-C001-8", "ad9b663ac88667c1b88767ca4b2f8f-C001-15", "ad9b663ac88667c1b88767ca4b2f8f-C001-35" ] }, "@MOT@": { "gold_contexts": [ [ "ad9b663ac88667c1b88767ca4b2f8f-C001-14", "ad9b663ac88667c1b88767ca4b2f8f-C001-15" ] ], "cite_sentences": [ "ad9b663ac88667c1b88767ca4b2f8f-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "ad9b663ac88667c1b88767ca4b2f8f-C001-22" ], [ "ad9b663ac88667c1b88767ca4b2f8f-C001-67" ] ], "cite_sentences": [ "ad9b663ac88667c1b88767ca4b2f8f-C001-22", "ad9b663ac88667c1b88767ca4b2f8f-C001-67" ] }, "@SIM@": { "gold_contexts": [ [ "ad9b663ac88667c1b88767ca4b2f8f-C001-32", "ad9b663ac88667c1b88767ca4b2f8f-C001-33", "ad9b663ac88667c1b88767ca4b2f8f-C001-34", "ad9b663ac88667c1b88767ca4b2f8f-C001-35" ] ], "cite_sentences": [ "ad9b663ac88667c1b88767ca4b2f8f-C001-35" ] } } }, "ABC_a5d7f5c5fed218149818463427d6a1_38": { "x": [ { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-2", "text": "In this paper, we quantify, analyze and mitigate gender bias exhibited in ELMo's contextualized word vectors." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-3", "text": "First, we conduct several intrinsic analyses and find that (1) training data for ELMo contains significantly more male than female entities, (2) the trained ELMo embeddings systematically encode gender information and (3) ELMo unequally encodes gender information about male and female entities." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-4", "text": "Then, we show that a state-of-the-art coreference system that depends on ELMo inherits its bias and demonstrates significant bias on the WinoBias probing corpus." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-5", "text": "Finally, we explore two methods to mitigate such gender bias and show that the bias demonstrated on WinoBias can be eliminated." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-6", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-8", "text": "Distributed representations of words in the form of word embeddings Pennington et al., 2014) and contextualized word embeddings (Peters et al., 2018; Devlin et al., 2018; Radford et al., 2018; McCann et al., 2017; Radford et al., 2019) have led to huge performance improvement on many NLP tasks." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-30", "text": "To mitigate bias from word embeddings, Bolukbasi et al. (2016) propose a post-processing method to project out the bias subspace from the pre-trained embeddings." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-9", "text": "However, several recent studies show that training word embeddings in large corpora could lead to encoding societal biases present in these human-produced data (Bolukbasi et al., 2016; Caliskan et al., 2017) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-10", "text": "In this work, we extend these analyses to the ELMo contextualized word embeddings." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-11", "text": "Our work provides a new intrinsic analysis of how ELMo represents gender in biased ways." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-12", "text": "First, the corpus used for training ELMo has a significant gender skew: male entities are nearly three times more common than female entities, which leads to gender bias in the downloadable pre-trained contextualized embeddings." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-13", "text": "Then, we apply principal component analysis (PCA) to show that after training on such biased corpora, there exists a lowdimensional subspace that captures much of the gender information in the contextualized embeddings." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-14", "text": "Finally, we evaluate how faithfully ELMo preserves gender information in sentences by measuring how predictable gender is from ELMo representations of occupation words that co-occur with gender revealing pronouns." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-15", "text": "Our results show that ELMo embeddings perform unequally on male and female pronouns: male entities can be predicted from occupation words 14% more accurately than female entities." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-16", "text": "In addition, we examine how gender bias in ELMo propagates to the downstream applications." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-17", "text": "Specifically, we evaluate a state-of-the-art coreference resolution system ) that makes use of ELMo's contextual embeddings on WinoBias (Zhao et al., 2018a) , a coreference diagnostic dataset that evaluates whether systems behave differently on decisions involving male and female entities of stereotyped or anti-stereotyped occupations." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-18", "text": "We find that in the most challenging setting, the ELMo-based system has a disparity in accuracy between pro-and anti-stereotypical predictions, which is nearly 30% higher than a similar system based on GloVe (Lee et al., 2017) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-19", "text": "Finally, we investigate approaches for mitigating the bias which propagates from the contextualized word embeddings to a coreference resolution system." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-20", "text": "We explore two different strategies: (1) a training-time data augmentation technique (Zhao et al., 2018a) , where we augment the corpus for training the coreference system with its genderswapped variant (female entities are swapped to male entities and vice versa) and, afterwards, retrain the coreference system; and (2)" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-21", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-22", "text": "**RELATED WORK**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-23", "text": "Gender bias has been shown to affect several realworld applications relying on automatic language analysis, including online news (Ross and Carter, 2011) , advertisements (Sweeney, 2013) , abusive language detection (Park et al., 2018) , machine translation (Font and Costa-juss\u00e0, 2019; Vanmassenhove et al., 2018) , and web search (Kay et al., 2015) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-24", "text": "In many cases, a model not only replicates bias in the training data but also amplifies it (Zhao et al., 2017) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-25", "text": "For word representations, Bolukbasi et al. (2016) and Caliskan et al. (2017) show that word embeddings encode societal biases about gender roles and occupations, e.g. engineers are stereotypically men, and nurses are stereotypically women." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-26", "text": "As a consequence, downstream applications that use these pretrained word embeddings also reflect this bias." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-27", "text": "For example, Zhao et al. (2018a) and Rudinger et al. (2018) show that coreference resolution systems relying on word embeddings encode such occupational stereotypes." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-28", "text": "In concurrent work, May et al. (2019) measure gender bias in sentence embeddings, but their evaluation is on the aggregation of word representations." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-29", "text": "In contrast, we analyze bias in contextualized word representations and its effect on a downstream task." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-31", "text": "Their method is shown to reduce the gender information from the embeddings of gender-neutral words, and, remarkably, maintains the same level of performance on different downstream NLP tasks." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-32", "text": "Zhao et al. (2018b) further propose a training mechanism to separate gender information from other factors." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-33", "text": "However, Gonen and Goldberg (2019) argue that entirely removing bias is difficult, if not impossible, and the gender bias information can be often recovered." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-34", "text": "This paper investigates a natural follow-up question: What are effective bias mitigation techniques for contextualized embeddings?" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-35", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-36", "text": "**GENDER BIAS IN ELMO**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-37", "text": "In this section we describe three intrinsic analyses highlighting gender bias in trained ELMo contextual word embeddings (Peters et al., 2018) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-38", "text": "We show that (1) training data for ELMo contains sig#occurrence #M-biased occs." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-39", "text": "#F-biased occs." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-40", "text": "M 5,300,000 170,000 81,000 F 1,600,000 33,000 36,000 nificantly more male entities compared to female entities leading to gender bias in the pre-trained contextual word embeddings (2) the geometry of trained ELMo embeddings systematically encodes gender information and (3) ELMo propagates gender information about male and female entities unequally." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-41", "text": "Table 1 lists the data analysis on the One Billion Word Benchmark (Chelba et al., 2013) corpus, the training corpus for ELMo." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-42", "text": "We show counts for the number of occurrences of male pronouns (he, his and him) and female pronouns (she and her) in the corpus as well as the co-occurrence of occupation words with those pronouns." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-43", "text": "We use the set of occupation words defined in the WinoBias corpus and their assignments as prototypically male or female (Zhao et al., 2018a) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-44", "text": "The analysis shows that the Billion Word corpus contains a significant skew with respect to gender: (1) male pronouns occur three times more than female pronouns and (2) male pronouns co-occur more frequently with occupation words, irrespective of whether they are prototypically male or female." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-45", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-46", "text": "**TRAINING DATA BIAS**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-47", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-48", "text": "**GEOMETRY OF GENDER**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-49", "text": "Next, we analyze the gender subspace in ELMo." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-50", "text": "We first sample 400 sentences with at least one gendered word (e.g., he or she from the OntoNotes 5.0 dataset (Weischedel et al., 2012) and generate the corresponding gender-swapped variants (changing he to she and vice-versa)." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-51", "text": "We then calculate the difference of ELMo embeddings between occupation words in corresponding sentences and conduct principal component analysis for all pairs of sentences." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-52", "text": "Figure 1 shows there are two principal components for gender in ELMo, in contrast to GloVe which only has one (Bolukbasi et al., 2016) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-53", "text": "The two principal components in ELMo seem to represent the gender from the contextual information (Contextual Gender) as well as the gender embedded in the word itself (Occupational Gender)." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-54", "text": "To visualize the gender subspace, we pick a few sentence pairs from WinoBias (Zhao et al., 2018a) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-55", "text": "Each sentence in the corpus contains one gendered pronoun and two occupation words, such as \"The developer corrected the secretary because she made a mistake\" and also the same sentence with the opposite pronoun (he)." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-56", "text": "In Figure 1 on the right, we project the ELMo embeddings of occupation words that are co-referent with the pronoun (e.g. secretary in the above example) for when the pronoun is male (blue dots) and female (orange dots) on the two principal components from the PCA analysis." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-57", "text": "Qualitatively, we can see the first component separates male and female contexts while the second component groups male related words such as lawyer and developer and female related words such as cashier and nurse." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-58", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-59", "text": "**UNEQUAL TREATMENT OF GENDER**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-60", "text": "To test how ELMo embeds gender information in contextualized word embeddings, we train a classifier to predict the gender of entities from occupation words in the same sentence." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-61", "text": "We collect sentences containing gendered words (e.g., he-she, father-mother) and occupation words (e.g., doctor) 1 from the OntoNotes 5.0 corpus (Weischedel et al., 2012) , where we treat occupation words as a mention to an entity, and the gender of that entity is taken to the gender of a co-referring gendered word, if one exists." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-62", "text": "For example, in the sentence \"the engineer went back to her home,\" we take engineer to be a female mention." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-63", "text": "Then we split all such instances into training and test, with 539 and 62 instances, respectively and augment these sentences by swapping all the gendered words with words of the opposite gender such that the numbers of male 1 We use the list collected in (Zhao et al., 2018a) and female entities are balanced." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-64", "text": "We first test if ELMo embedding vectors carry gender information." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-65", "text": "We train an SVM classifier with an RBF kernel 2 to predict the gender of a mention (i.e., an occupation word) based on its ELMo embedding." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-66", "text": "On development data, this classifier achieves 95.1% and 80.6% accuracy on sentences where the true gender was male and female respectively." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-67", "text": "For both male and female contexts, the accuracy is much larger than 50%, demonstrating that ELMo does propagate gender information to other words." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-68", "text": "However, male information is more than 14% more accurately represented in ELMo than female information, showing that ELMo propagates the information unequally for male and female entities." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-69", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-70", "text": "**BIAS IN COREFERENCE RESOLUTION**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-71", "text": "In this section, we establish that coreference systems that depend on ELMo embeddings exhibit significant gender bias." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-72", "text": "Then we evaluate two simple methods for removing the bias from the systems and show that the bias can largely be reduced." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-73", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-74", "text": "**SETUP**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-75", "text": "We evaluate bias with respect to the WinoBias dataset (Zhao et al., 2018a) , a benchmark of paired male and female coreference resolution examples following the Winograd format (Hirst, 1981; Rahman and Ng, 2012; Peng et al., 2015) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-76", "text": "It contains two different subsets, pro-stereotype, where pronouns are associated with occupations predominately associated with the gender of the pronoun, or anti-stereotype, when the opposite relation is true." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-77", "text": "Table 2 : F1 on OntoNotes and WinoBias development sets." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-78", "text": "WinoBias dataset is split Semantics Only and w/ Syntactic Cues subsets." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-79", "text": "ELMo improves the performance on the OntoNotes dataset by 5% but shows stronger bias on the WinoBias dataset." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-80", "text": "Avg. stands for averaged F1 score on the pro-and anti-stereotype subsets while \"Diff.\" is the absolute difference between these two subsets." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-81", "text": "* indicates the difference between pro/anti stereotypical conditions is significant (p < .05) under an approximate randomized test (Graham et al., 2014) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-82", "text": "Mitigating bias by data augmentation reduces all the bias from the coreference model to a neglect level." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-83", "text": "However, the neutralizing ELMo approach only mitigates bias when there are other strong learning signals for the task." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-84", "text": "Each subset consists of two types of sentences: one that requires semantic understanding of the sentence to make coreference resolution (Semantics Only) and another that relies on syntactic cues (w/ Syntactic Cues)." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-85", "text": "Gender bias is measured by taking the difference of the performance in pro-and antistereotypical subsets." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-86", "text": "Previous work (Zhao et al., 2018a) evaluated the systems based on GloVe embeddings but here we evaluate a state-of-the-art system that trained on the OntoNotes corpus with ELMo embeddings ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-87", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-88", "text": "**BIAS MITIGATION METHODS**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-89", "text": "Next, we describe two methods for mitigating bias in ELMo for the purpose of coreference resolution:" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-90", "text": "(1) a train-time data augmentation approach and (2) a test-time neutralization approach." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-91", "text": "Zhao et al. (2018a) propose a method to reduce gender bias in coreference resolution by augmenting the training corpus for this task." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-92", "text": "Data augmentation is performed by replacing gender revealing entities in the OntoNotes dataset with words indicating the opposite gender and then training on the union of the original data and this swapped data." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-93", "text": "In addition, they find it useful to also mitigate bias in supporting resources and therefore replace standard GloVe embeddings with bias mitigated word embeddings from Bolukbasi et al. (2016) ." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-94", "text": "We evaluate the performance of both aspects of this approach." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-95", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-96", "text": "**DATA AUGMENTATION**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-97", "text": "Neutralization We also investigate an approach to mitigate bias induced by ELMo embeddings without retraining the coreference model." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-98", "text": "Instead of augmenting training corpus by swapping gender words, we generate a gender-swapped version of the test instances." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-99", "text": "We then apply ELMo to obtain contextualized word representations of the original and the gender-swapped sentences and use their average as the final representations." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-100", "text": "Table 2 summarizes our results on WinoBias." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-101", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-102", "text": "**RESULTS**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-103", "text": "ELMo Bias Transfers to Coreference Row 3 in Table 2 summarizes performance of the ELMo based coreference system on WinoBias." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-104", "text": "While ELMo helps to boost the coreference resolution F1 score (OntoNotes) it also propagates bias to the task." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-105", "text": "It exhibits large differences between pro-and anti-stereotyped sets (|Diff|) on both semantic and syntactic examples in WinoBias." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-106", "text": "Bias Mitigation Rows 4-6 in Table 2 summarize the effectiveness of the two bias mitigation approaches we consider." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-107", "text": "Data augmentation is largely effective at mitigating bias in the coreference resolution system with ELMo (reducing |Diff | to insignificant levels) but requires retraining the system." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-108", "text": "Neutralization is less effective than augmentation and cannot fully remove gender bias on the Semantics Only portion of WinoBias, indicating it is effective only for simpler cases." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-109", "text": "This observation is consistent with Gonen and Goldberg (2019) , where they show that entirely removing bias from an embedding is difficult and depends on the manner, by which one measures the bias." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-110", "text": "----------------------------------" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-111", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-112", "text": "Like word embedding models, contextualized word embeddings inherit implicit gender bias." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-113", "text": "We analyzed gender bias in ELMo, showing that the corpus it is trained on has significant gender skew and that ELMo is sensitive to gender, but unequally so for male and female entities." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-114", "text": "We also showed this bias transfers to downstream tasks, such as coreference resolution, and explored two bias mitigation strategies: 1) data augmentation and 2) neutralizing embeddings, effectively eliminating the bias from ELMo in a state-of-the-art system." }, { "sent_id": "a5d7f5c5fed218149818463427d6a1-C001-115", "text": "With increasing adoption of contextualized embeddings to get better results on core NLP tasks, e.g. BERT (Devlin et al., 2018) , we must be careful how such unsupervised methods perpetuate bias to downstream applications and our work forms the basis of evaluating and mitigating such bias." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a5d7f5c5fed218149818463427d6a1-C001-10", "a5d7f5c5fed218149818463427d6a1-C001-11", "a5d7f5c5fed218149818463427d6a1-C001-8", "a5d7f5c5fed218149818463427d6a1-C001-9" ], [ "a5d7f5c5fed218149818463427d6a1-C001-25", "a5d7f5c5fed218149818463427d6a1-C001-26" ], [ "a5d7f5c5fed218149818463427d6a1-C001-30", "a5d7f5c5fed218149818463427d6a1-C001-31" ] ], "cite_sentences": [ "a5d7f5c5fed218149818463427d6a1-C001-9", "a5d7f5c5fed218149818463427d6a1-C001-25", "a5d7f5c5fed218149818463427d6a1-C001-30" ] }, "@MOT@": { "gold_contexts": [ [ "a5d7f5c5fed218149818463427d6a1-C001-10", "a5d7f5c5fed218149818463427d6a1-C001-11", "a5d7f5c5fed218149818463427d6a1-C001-8", "a5d7f5c5fed218149818463427d6a1-C001-9" ], [ "a5d7f5c5fed218149818463427d6a1-C001-25", "a5d7f5c5fed218149818463427d6a1-C001-26" ] ], "cite_sentences": [ "a5d7f5c5fed218149818463427d6a1-C001-9", "a5d7f5c5fed218149818463427d6a1-C001-25" ] }, "@DIF@": { "gold_contexts": [ [ "a5d7f5c5fed218149818463427d6a1-C001-52", "a5d7f5c5fed218149818463427d6a1-C001-53" ] ], "cite_sentences": [ "a5d7f5c5fed218149818463427d6a1-C001-52" ] }, "@USE@": { "gold_contexts": [ [ "a5d7f5c5fed218149818463427d6a1-C001-92", "a5d7f5c5fed218149818463427d6a1-C001-93" ] ], "cite_sentences": [ "a5d7f5c5fed218149818463427d6a1-C001-93" ] } } }, "ABC_c9d9997b61974a537915a2c90af3cf_38": { "x": [ { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-2", "text": "Current multimodal sentiment analysis frames sentiment score prediction as a general Machine Learning task." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-3", "text": "However, what the sentiment score actually represents has often been overlooked." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-4", "text": "As a measurement of opinions and affective states, a sentiment score generally consists of two aspects: polarity and intensity." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-5", "text": "We decompose sentiment scores into these two aspects and study how they are conveyed through individual modalities and combined multimodal models in a naturalistic monologue setting." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-6", "text": "In particular, we build unimodal and multimodal multitask learning models with sentiment score prediction as the main task and polarity and/or intensity classification as the auxiliary tasks." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-7", "text": "Our experiments show that sentiment analysis benefits from multi-task learning, and individual modalities differ when conveying the polarity and intensity aspects of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-8", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-10", "text": "Computational analysis of human multimodal language is a growing research area in Natural Language Processing (NLP)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-11", "text": "One important type of information communicated through human multimodal language is sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-12", "text": "Current NLP studies often define sentiments using scores on a scale, e.g., a 5-point Likert scale representing sentiments from strongly negative to strongly positive." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-13", "text": "Previous work on multimodal sentiment analysis has focused on identifying effective approaches for sentiment score prediction (e.g., Zadeh et al. (2018b) )." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-14", "text": "However, in these studies sentiment score prediction is typically represented as a regression or classification task, without taking into account what the sentiment score means." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-15", "text": "As a measurement of human opinions and affective states, a sentiment score can often be decomposed into two aspects: the polarity and intensity of the sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-16", "text": "In this work, we study how individual modalities and multimodal information convey these two aspects of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-17", "text": "More specifically, we conduct experiments on the Carnegie Mellon University Multimodal Opinion Sentiment Intensity (CMU-MOSI) database (Zadeh et al., 2016) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-18", "text": "The CMU-MOSI database is a widely used benchmark database for multimodal sentiment analysis." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-19", "text": "It contains naturalistic monologues expressing opinions on various subjects." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-20", "text": "Sentiments are annotated as continuous scores for each opinion segment in the CMU-MOSI database, and data were collected over the vocal, visual, and verbal modalities." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-21", "text": "We build unimodal and multimodal multi-task learning models with sentiment score regression as the main task, and polarity and/or intensity classification as the auxiliary tasks." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-22", "text": "Our main research questions are: 1. Does sentiment score prediction benefit from multi-task learning? 2." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-23", "text": "Do individual modalities convey the polarity and intensity of sentiment differently? 3. Does multi-task learning influence unimodal and multimodal sentiment analysis models in different ways?" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-24", "text": "Our work contributes to our current understanding of the intra-modal and inter-modal dynamics of how sentiments are communicated in human multimodal language." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-25", "text": "Moreover, our study provides detailed analysis on how multi-task learning and modality fusion influences sentiment analysis." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-26", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-27", "text": "**BACKGROUND**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-28", "text": "Sentiment is an important type of information conveyed in human language." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-29", "text": "Previous sentiment analysis studies in the field of NLP have mostly been focused on the verbal modality (i.e., text)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-30", "text": "For example, predicting the sentiment of Twitter texts (Kouloumpis et al., 2011) or news articles (Balahur et al., 2013) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-31", "text": "However, human language is multimodal in, for instance, face-toface communication and online multimedia opinion sharing." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-32", "text": "Understanding natural language used in such scenarios is especially important for NLP applications in Human-Computer/Robot Interaction." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-33", "text": "Thus, in recent years there has been growing interest in multimodal sentiment analysis." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-34", "text": "The three most widely studied modalities in current multimodal sentiment analysis research are: vocal (e.g., speech acoustics), visual (e.g., facial expressions), and verbal (e.g., lexical content)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-35", "text": "These are sometimes referred to as \"the three Vs\" of communication (Mehrabian et al., 1971) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-36", "text": "Multimodal sentiment analysis research focuses on understanding how an individual modality conveys sentiment information (intra-modal dynamics), and how they interact with each other (intermodal dynamics)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-37", "text": "It is a challenging research area and state-of-the-art performance of automatic sentiment prediction has room for improvement compared to human performance (Zadeh et al., 2018a) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-38", "text": "While multimodal approaches to sentiment analysis are relatively new in NLP, multimodal emotion recognition has long been a focus of Affective Computing." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-39", "text": "For example, De Silva and Ng (2000) combined facial expressions and speech acoustics to predict the Big-6 emotion categories (Ekman, 1992) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-40", "text": "Emotions and sentiments are closely related concepts in Psychology and Cognitive Science research, and are often used interchangeably." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-41", "text": "Munezero et al. (2014) identified the main differences between sentiments and emotions to be that sentiments are more stable and dispositional than emotions, and sentiments are formed and directed toward a specific object." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-42", "text": "However, when adopting the cognitive definition of emotions which connects emotions to stimuli in the environment (Ortony et al., 1990) , the boundary between emotions and sentiments blurs." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-43", "text": "In particular, the circumplex model of emotions proposed by Russell (1980) describes emotions with two dimensions: Arousal which represents the level of excitement (active/inactive), and Valence which represents the level of liking (positive/negative)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-44", "text": "In many sentiment analysis studies, sentiments are defined using Likert scales with varying numbers of steps." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-45", "text": "For example, the Stanford Sentiment Treebank (Socher et al., 2013) In order to decompose sentiment scores into polarity and intensity and study how they are conveyed through different modalities, we include polarity and/or intensity classification as auxiliary tasks to sentiment score prediction with multi-task learning." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-46", "text": "One problem with Machine Learning approaches for Affective Computing is model robustness." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-47", "text": "In multi-task learning, the model shares representations between the main task and auxiliary tasks related to the main task, often enabling the model to generalize better on the main task (Ruder, 2017) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-48", "text": "Multiple auxiliary tasks have been used in previous sentiment analysis and emotion recognition studies." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-49", "text": "For example, Xia and Liu (2017) used dimensional emotion regression as an auxiliary task for categorical emotion classification, while used sentence type classification (number of opinion targets expressed in a sentence) as an auxiliary task for verbal sentiment analysis." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-50", "text": "To the best of our knowledge, there has been no previous work applying multi-task learning to the CMU-MOSI database." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-51", "text": "In addition to how individual modalities convey sentiment, another interesting topic in multimodal sentiment analysis is how to combine information from multiple modalities." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-52", "text": "There are three main types of modality fusion strategies in current multimodal Machine Learning research (Baltru\u0161aitis et al., 2018) : early fusion which combines features from different modalities, late fusion which combines outputs of unimodal models, and hybrid fusion which exploits the advantages of both early and late fusion." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-53", "text": "We will study the performance of these different modality fusion strategies for multimodal sentiment analysis." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-54", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-55", "text": "**METHODOLOGY**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-56", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-57", "text": "**THE CMU-MOSI DATABASE**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-58", "text": "The CMU-MOSI database contains 93 YouTube opinion videos from 89 distinct speakers (Zadeh et al., 2016) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-59", "text": "The videos are monologues on various topics recorded with various setups, lasting from 2 to 5 minutes." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-60", "text": "2199 opinion segments were manually identified from the videos with an average length of 4.2 seconds (approximately 154 minutes in total)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-61", "text": "An opinion segment is the expression of opinion on a distinct subject, and can be part of a spoken utterance or consist of several consecutive utterances." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-62", "text": "Zadeh et al. (2016) collected sentiment score annotations of the opinion segments using Amazon Mechanical Turk and each video clip was annotated by five workers." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-63", "text": "For each opinion segment the sentiment scores are annotated on a 7-point Likert scale, i.e., strongly negative (-3), negative (-2), weakly negative (-1), neutral (0), weakly positive (+1), positive (+2), strongly positive (+3)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-64", "text": "The goldstandard sentiment score annotations provided are the average of all five workers." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-65", "text": "Previous work on the CMU-MOSI database explored various approaches to improving performance of sentiment score prediction (e.g., Zadeh et al. (2018b) )." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-66", "text": "The target sentiment annotations can be continuous sentiment scores or discrete sentiment classes (binary, 5-class, or 7-class sentiment classes)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-67", "text": "The Tensor Fusion Network model of achieved the best performance for continuous sentiment score regression on the CMU-MOSI database using features from all three modalities." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-68", "text": "The Pearson's correlation coefficient between the automatic predictions of their model and the gold-standard sentiment score annotations reached 0.70." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-69", "text": "In this work, we follow the parameter settings and features used by when predicting the sentiment scores." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-70", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-71", "text": "**MULTIMODAL SENTIMENT ANALYSIS WITH MULTI-TASK LEARNING**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-72", "text": "In this study, we apply multi-task learning to sentiment analysis using the CMU-MOSI database." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-73", "text": "We consider predicting the gold-standard sentiment scores as the main task." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-74", "text": "Thus, the singletask learning model is a regression model predicting the sentiment score S o of an opinion segment o, which has a value within range [-3,+3] ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-75", "text": "To perform multi-task learning, for each opinion segment, we transform the gold-standard sentiment score S o into binary polarity class P o and intensity class I o :" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-76", "text": "Unlike previous studies performing a 5-class or 7-class classification experiment for sentiment analysis, our definition of intensity classes uses the absolute sentiment scores, thus separating the polarity and intensity information." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-77", "text": "For example, an opinion segment o 1 with S o 1 = +3.0 will have P o 1 = Positive and I o 1 = Strong, while an opinion segment o 2 with S o 2 = -2.75 will have P o 2 = Negative and I o 2 = Strong." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-78", "text": "Note that here we group the sentiment scores into discrete intensity classes." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-79", "text": "In the future we plan to study the gain of preserving the ordinal information between the intensity classes." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-80", "text": "For each modality or fusion strategy we build four models: single-task sentiment regression model, bi-task sentiment regression model with polarity classification as the auxiliary task, bi-task sentiment regression model with intensity classification as the auxiliary task, and tri-task sentiment regression model with both polarity and intensity classification as the auxiliary tasks." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-81", "text": "In the bi-task and tri-task models, the main task loss is assigned a weight of 1.0, while the auxiliary task losses are assigned a weight of 0.5." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-82", "text": "Structures of the single-task and multi-task learning models only differ at the output layer: for sentiment score regression the output is a single node with tanh activation; for polarity classification the output is a single node with sigmoid activation; for intensity classification the output is 4 nodes with softmax activation." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-83", "text": "The main task uses mean absolute error as the loss function, while polarity classification uses binary cross-entropy as the loss function, and intensity classification uses categorical crossentropy as the loss function." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-84", "text": "Following state-ofthe-art on the CMU-MOSI database , during training we used Adam as the optimization function with a learning rate of 0.0005." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-85", "text": "We use the CMU Multimodal Data Software Development Kit (SDK) (Zadeh et al., 2018a) to load and pre-process the CMU-MOSI database, which splits the 2199 opinion segments into training (1283 segments), validation (229 segments), and test (686 segments) sets." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-86", "text": "1 We implement the sentiment analysis models using the Keras deep learning library (Chollet et al., 2015) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-87", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-88", "text": "**MULTIMODAL FEATURES**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-89", "text": "For the vocal modality, we use the COVAREP feature set provided by the SDK." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-90", "text": "These are 74 vocal features including 12 Mel-frequency cepstral coefficients, pitch tracking and voiced/unvoiced segmenting features, glottal source parameters, peak slope parameters, and maxima dispersion quotients." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-91", "text": "The vocal features are extracted from the audio recordings at a sampling rate of 100Hz." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-92", "text": "For the visual modality, we use the FACET feature set provided by the SDK." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-93", "text": "These are 46 visual features including facial indicators of 9 types of emotion (anger, contempt, disgust, fear, joy, sadness, surprise, frustration, and confusion) and movements of 20 facial action units." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-94", "text": "The visual features are extracted from the speaker's facial region in the video recordings at a sampling rate of 30Hz." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-95", "text": "Following , for the vocal and visual unimodal models, we apply a drop-out rate of 0.2 to the features and build a neural network with three hidden layers of 32 ReLU activation units, as shown in Figure 1 ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-96", "text": "For the verbal modality, we use the word em-bedding features provided by the SDK, which are 300-dimensional GloVe word vectors." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-97", "text": "There are 26,295 words in total (3,107 unique words) in the opinion segments of the CMU-MOSI database." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-98", "text": "Following , for the verbal unimodal model we build a neural network with one layer of 128 Long Short-Term Memory (LSTM) units and one layer of 64 ReLU activation units, as shown in Figure 2 ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-99", "text": "Previous work has found that context information is important for multimodal sentiment analysis, and the use of LSTM allows us to include history ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-100", "text": "Note that the visual and vocal features are extracted at the frame level, while the verbal features are extracted at the word level." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-101", "text": "Before conducting all unimodal and multimodal experiments, we aligned all the features to the word level using the SDK." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-102", "text": "This down-samples the visual and vocal features to the word level by computing the averaged feature vectors for all frames within a word." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-103", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-104", "text": "**MODALITY FUSION STRATEGIES**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-105", "text": "We test four fusion strategies here: Early Fusion (EF), Tensor Fusion Network (TFN), Late Fusion (LF), and Hierarchical Fusion (HF)." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-106", "text": "EF and LF are the most widely used fusion strategies in multimodal recognition studies and were shown to be effective for multimodal sentiment analysis (Poria et al., 2015) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-107", "text": "TFN achieved state-of-the-art performance on the CMU-MOSI database ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-108", "text": "HF is a form of hybrid fusion strategy shown to be effective for multimodal emotion recognition (Tian et al., 2016) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-109", "text": "The structure of the EF model is shown in Figure 3 ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-110", "text": "The feature vectors are simply concatenated in the EF model." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-111", "text": "A drop-out rate of 0.2 is applied to the combined feature vector." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-112", "text": "We then stack one layer of 128 LSTM units and three layers of 32 ReLU units with an L2 regularizer weight of 0.01 on top of the multimodal inputs." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-113", "text": "To compare performance of the fusion strategies, this same structure is applied to the multimodal inputs in all multimodal models." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-114", "text": "In the TFN model, we compute the Cartesian products (shown in Figure 4 ) of the unimodal model top layers as the multimodal inputs." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-115", "text": "Unlike , we did not add the extra constant dimension with value 1 when computing the 3-fold Cartesian space in order to reduce the dimensionality of the multimodal input." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-116", "text": "In the LF model, as shown in Figure 5 , we concatenate the unimodal model top layers as the multimodal inputs." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-117", "text": "In the HF model, unimodal information is used in a hierarchy where the top layer of the lower unimodal model is concatenated with the input layer of the higher unimodal model, as shown in Figure 6 ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-118", "text": "We use the vocal modality at the bottom of the hierarchy while using the verbal modality at the top in HF fusion." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-119", "text": "This is because in previous studies (e.g., Zadeh et al. (2018a) ) the verbal modality was shown to be the most effective for unimodal sentiment analysis, while the vocal modality was shown to be the least effective." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-120", "text": "4" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-121", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-122", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-123", "text": "Here we report our sentiment score prediction experiments." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-124", "text": "2 In Tables 2 and 3 , \"S\" is the singletask learning model; \"S+P\" is the bi-task learning model with polarity classification as the auxillary task; \"S+I\" is the bi-task learning model with intensity classification as the auxillary task; \"S+P+I\" is the tri-task learning model." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-125", "text": "To evaluate the performance of sentiment score prediction, following previous work (Zadeh et al., 2018a) , we Tables 2 and 3 , the numbers in bold are the best performance for each modality or fusion strategy." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-126", "text": "To identify the significant differences in results, we perform a two-sample Wilcoxon test on the sentiment score predictions given by each pair of models being compared and consider p < 0.05 as significant." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-127", "text": "We also include random prediction as a baseline and the human performance reported by ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-128", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-129", "text": "**UNIMODAL EXPERIMENTS**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-130", "text": "The results of unimodal sentiment prediction experiments are shown in Table 2 ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-131", "text": "3 The verbal models have the best performance here, which is consistent with previous sentiment analysis studies on multiple databases (e.g., Zadeh et al. (2018a) )." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-132", "text": "This suggests that lexical information remains the most effective for sentiment analysis." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-133", "text": "On each modality, the best performance is achieved by a multi-task learning model." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-134", "text": "This answers our first research question and suggests that sentiment analysis can benefit from multi-task learning." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-135", "text": "In multi-task learning, the main task gains additional information from the auxillary tasks." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-136", "text": "Compared to the S model, the S+P model has increased focus on the polarity of sentiment, while the S+I model has increased focus on the intensity of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-137", "text": "On the verbal modality, the S+P model achieved the best performance, while on the visual modality the S+I model achieved the best performance." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-138", "text": "This suggests that the verbal modality is weaker at communicating the polarity of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-139", "text": "Thus, verbal sentiment analysis benefits more from including additional information on polarity." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-140", "text": "On the contrary, the visual modality is weaker at communicating the intensity of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-141", "text": "Thus, visual sentiment analysis benefits more from including additional information on intensity." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-142", "text": "For the vocal modality, the S+P+I model achieved the best performance, and the S+P model yielded improved performance over that of the S model." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-143", "text": "This suggests that the vocal modality is weaker at communicating the polarity of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-144", "text": "Thus, addressing our second research question, the results suggest that individual modalities differ when conveying each aspect of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-145", "text": "Table 2 : Unimodal sentiment analysis results on the CMU-MOSI test set." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-146", "text": "Numbers in bold are the best results on each modality." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-147", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-148", "text": "**MULTIMODAL EXPERIMENTS**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-149", "text": "The results of the multimodal experiments are shown in Table 3 ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-150", "text": "We find that EF>HF>TFN>LF." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-151", "text": "4 The reason that the EF model yields the best performance may be that it is the least complex." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-152", "text": "This is shown to be beneficial for the small CMU-MOSI database (Poria et al., 2015) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-153", "text": "Unlike , here the EF model outperforms the TFN model." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-154", "text": "However, the TFN model achieved the best performance on the training and validation sets." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-155", "text": "This indicates that performance of the TFN model may be limited by over-fitting." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-156", "text": "Compared to the feature concatenation used in EF, the Cartesian product used in TFN results in higher dimensionality of the multimodal input vector, 5 which in turn increases the complexity of the model." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-157", "text": "Similarly, the HF model has worse performance than the EF model here, unlike in Tian et al. (2016) ." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-158", "text": "This may be due to the HF model having the deepest structure with the most hidden layers, which increases its complexity." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-159", "text": "The performance of unimodal and multimodal models are significantly different." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-160", "text": "In general, the multimodal models have better performance than the unimodal models." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-161", "text": "6 Unlike unimodal models, multimodal models benefit less from multi-task learning." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-162", "text": "In fact, the HF and LF models have better performance using single-task learning." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-163", "text": "For the TFN models, only the S+P model outperforms the S model, although the improvement is not significant." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-164", "text": "7 For the EF models, multi-task learning results in better performance." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-165", "text": "8 The reason that EF benefits from multi-task learning may be that it combines modalities without bias and individual features have more influence on the EF model." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-166", "text": "Thus, the benefit of multi-task learning is preserved in EF." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-167", "text": "However, the other fusion strategies (TFN, LF, HF) attempt to compensate one modality with information from other modalities, i.e., relying more on other modalities when one modality is weaker at predicting an aspect of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-168", "text": "In Section 4.1 we showed that each modality has different weaknesses when conveying the polarity or intensity aspect of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-169", "text": "The multimodal models are able to overcome such weaknesses by modality fusion." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-170", "text": "Thus, multi-task learning does not yield additional improvement in these models." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-171", "text": "Our observations answer our third research question: multi-task learning influences unimodal and multimodal sentiment analysis differently." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-172", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-173", "text": "**DISCUSSION**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-174", "text": "Our unimodal experiments in Section 4.1 show that unimodal sentiment analysis benefits significantly from multi-task learning." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-175", "text": "As suggested by Wilson (2008) , polarity and intensity can be conveyed through different units of language." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-176", "text": "We can use one word such as extremely to express intensity, while the polarity of a word and the polarity of the opinion segment the word is in may be opposite." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-177", "text": "Our work supports a fine-grained sentiment analysis." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-178", "text": "By including polarity and intensity classification as the auxiliary tasks, we illustrate that individual modalities differ when conveying sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-179", "text": "In particular, the visual modality is weaker at conveying the intensity aspect of sentiment, while the vocal and verbal modalities are weaker at conveying the polarity aspect of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-180", "text": "In previous emotion recognition studies under the circumplex model of emotions (Russell, 1980) , it was found that the visual modality is typically weaker at conveying the Arousal dimension of emotion, while the vocal modality is typically weaker at conveying the Valence dimension of emotion (e.g., Nicolaou et al. (2011) )." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-181", "text": "The similarities between the performance of different communication modalities on conveying emotion dimensions and on conveying different aspects of sentiment indicate a connection between emotion dimensions and sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-182", "text": "The different behaviors of unimodal models in conveying the polarity and intensity aspects of sentiment also explain the improved performance achieved by modality fusion in Section 4.2 and in various previous studies." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-183", "text": "By decomposing sentiment scores into polarity and intensity, our work provides detailed understanding on how individual modalities and multimodal information convey these two aspects of sentiment." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-184", "text": "We are aware that performance of our sentiment analysis models leaves room for improvement compared to state-of-the-art on the CMU-MOSI database." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-185", "text": "One reason may be that we did not perform pre-training in this study." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-186", "text": "In the future, we plan to explore more advanced learning techniques and models, such as a Dynamic Fusion Graph (Zadeh et al., 2018b) , to improve performance." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-187", "text": "We also plan to perform case studies to provide detailed analysis on how the unimodal models benefit from multi-task learning, and how individual modalities compensate each other in the multimodal models." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-188", "text": "----------------------------------" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-189", "text": "**CONCLUSIONS**" }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-190", "text": "In this work, we decouple Likert scale sentiment scores into two aspects: polarity and intensity, and study the influence of including polarity and/or intensity classification as auxiliary tasks to sentiment score regression." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-191", "text": "Our experiments showed that all unimodal models and some multimodal models benefit from multi-task learning." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-192", "text": "Our unimodal experiments indicated that each modality conveys different aspects of sentiment differently." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-193", "text": "In addition, we observed similar behaviors between how individual modalities convey the polarity and intensity aspects of sentiments and how they convey the Valence and Arousal emotion dimensions." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-194", "text": "Such connections between sentiments and emotions encourage researchers to obtain an integrated view of sentiment analysis and emotion recognition." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-195", "text": "Our multimodal experiments showed that unlike unimodal models, multimodal models benefit less from multi-task learning." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-196", "text": "This suggests that one reason that modality fusion yields improved performance in sentiment analysis is its ability to combine the different strengths of individual modalities on conveying sentiments." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-197", "text": "Note that we only conducted experiments on the CMU-MOSI database." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-198", "text": "In the future, we plan to expand our study to multiple databases." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-199", "text": "Moreover, we are interested in including databases collected on modalities beyond the three Vs." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-200", "text": "For example, gestures or physiological signals." }, { "sent_id": "c9d9997b61974a537915a2c90af3cf-C001-201", "text": "We also plan to perform sentiment analysis and emotion recognition in a multi-task learning setting to further explore the relationship between sentiments and emotions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c9d9997b61974a537915a2c90af3cf-C001-36", "c9d9997b61974a537915a2c90af3cf-C001-37" ], [ "c9d9997b61974a537915a2c90af3cf-C001-118", "c9d9997b61974a537915a2c90af3cf-C001-119" ] ], "cite_sentences": [ "c9d9997b61974a537915a2c90af3cf-C001-37", "c9d9997b61974a537915a2c90af3cf-C001-119" ] }, "@USE@": { "gold_contexts": [ [ "c9d9997b61974a537915a2c90af3cf-C001-85" ], [ "c9d9997b61974a537915a2c90af3cf-C001-125" ] ], "cite_sentences": [ "c9d9997b61974a537915a2c90af3cf-C001-85", "c9d9997b61974a537915a2c90af3cf-C001-125" ] }, "@SIM@": { "gold_contexts": [ [ "c9d9997b61974a537915a2c90af3cf-C001-118", "c9d9997b61974a537915a2c90af3cf-C001-119" ], [ "c9d9997b61974a537915a2c90af3cf-C001-131", "c9d9997b61974a537915a2c90af3cf-C001-132" ] ], "cite_sentences": [ "c9d9997b61974a537915a2c90af3cf-C001-119", "c9d9997b61974a537915a2c90af3cf-C001-131" ] } } }, "ABC_e608068f472e7045b682f979fd5295_38": { "x": [ { "sent_id": "e608068f472e7045b682f979fd5295-C001-36", "text": "**SEED SET**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-37", "text": "We use a seed set of 60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-2", "text": "This paper introduces a method for creating a subjectivity lexicon for languages with scarce resources." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-3", "text": "The method is able to build a subjectivity lexicon by using a small seed set of subjective words, an online dictionary, and a small raw corpus, coupled with a bootstrapping process that ranks new candidate words based on a similarity measure." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-4", "text": "Experiments performed with a rule-based sentence level subjectivity classifier show an 18% absolute improvement in F-measure as compared to previously proposed semi-supervised methods." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-5", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-7", "text": "There is growing interest in the automatic extraction of opinions, emotions, and sentiments in text (subjectivity), to provide tools and support for various natural language processing applications." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-8", "text": "Most of the research to date has focused on English, which is mainly explained by the availability of resources for subjectivity analysis, such as lexicons and manually labeled corpora." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-9", "text": "In this paper, we describe a bootstrapping method for the automatic generation of a large subjectivity lexicon starting with a few seeds." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-10", "text": "Unlike previously proposed methods for building subjectivity lexicons, which typically rely on advanced language processing tools such as syntactic parsers or information extraction tools, our method specifically targets the construction of lexicons for languages with scarce resources." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-11", "text": "The method requires only a small set of seeds, a basic dictionary, and a small raw corpus, which makes it appealing for the large number of languages that have only limited text processing resources developed to date." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-12", "text": "We focus our experiments on Romanian, but the method is applicable to any other language." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-13", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-14", "text": "**RELATED WORK**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-15", "text": "Many subjectivity and sentiment analysis tools rely on manually or semi-automatically constructed lexicons (Yu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim and Hovy, 2006) ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-16", "text": "The availability of such lexicons enables the construction of efficient rule-based subjectivity and sentiment classifiers that rely on the presence of lexicon entries in the text." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-17", "text": "Most of the work to date on subjectivity lexicon construction has assumed advanced natural language processing tools such as syntactic parsers (Wiebe, 2000) or tools for information extraction (Riloff and Wiebe, 2003) , or the availability of broad-coverage rich lexical resources such as WordNet (Esuli and Sebastiani, 2006a) ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-18", "text": "However, such tools and resources are available only for a handful of languages, which limits the applicability of these approaches." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-19", "text": "Instead, in the method introduced in this paper, we try to minimize the resources required to build a subjectivity lexicon." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-20", "text": "Thus, the method is potentially applicable to a large number of the languages spoken worldwide." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-21", "text": "Our approach relates most closely to the method proposed by (Turney, 2002) for the construction of lexicons annotated for polarity." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-22", "text": "His algorithm starts with a few positive and negative seeds, and then uses data from the Web together with a similarity method (pointwise mutual information) to automatically grow this seed list." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-23", "text": "Our approach differs from (Turney, 2002) in two important ways: first, we do not address the task of polarity lexicon construction, but instead we focus on the acquisition of subjectivity lexicons." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-24", "text": "Second, Turney assumes a very large corpus such as the terabyte corpus of English documents available on the Web, whereas we rely on fewer, smaller-scale resources, namely a basic dictionary and a small raw corpus." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-25", "text": "The problem of distinguishing subjective versus objective instances has often proven to be more difficult than subsequent polarity classification, so improvements in subjectivity classification promise to positively impact sentiment classification." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-26", "text": "This is reported in studies of manual annotation of phrases (Takamura et al., 2006) , recognizing contextual polarity of expressions (Wilson et al., 2005) , and sentiment tagging of words and word senses (Andreevskaia and Bergler, 2006; Esuli and Sebastiani, 2006b )." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-27", "text": "Another closely related work is our own previously proposed method for leveraging on resources available for English to construct resources for a second language (Mihalcea et al., 2007) ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-28", "text": "That method assumed the availability of a bridge between languages, such as a bilingual lexicon or a parallel corpus." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-29", "text": "Instead, in the method proposed here, we rely exclusively on language-specific resources, and do not make use of any such bilingual resources which may not always be available." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-30", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-31", "text": "**BOOTSTRAPPING**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-32", "text": "Our method is able to quickly acquire a large subjectivity lexicon by bootstrapping from a few manually selected seeds." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-33", "text": "At each iteration, the seed set is expanded with related words found in an online dictionary, which are filtered by using a measure of word similarity." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-34", "text": "The bootstrapping process is illustrated in Figure 1 ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-35", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-38", "text": "The seeds were manually selected from two resources: the XI-th grade curriculum for Romanian Language and Literature developed by the Romanian Ministry of Education, which exemplifies students exerting proper use of subjective language, and translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005) ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-39", "text": "Table 1 shows a sample of the entries in the initial seed set." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-40", "text": "A similar seed set can be easily obtained for any other language, either by finding a short listing of subjective words in the language of interest or by manually translating a small set of subjective entries from English." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-41", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-42", "text": "**DICTIONARY**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-43", "text": "Starting with the seed set, new related words are added based on the entries found in a dictionary." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-44", "text": "For each seed word, we collect all the open-class words appearing in its definition, as well as synonyms and antonyms if available." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-45", "text": "Note that word ambiguity is not an issue, as the expansion is done with all the possible meanings for each candidate word, which are subsequently filtered for incorrect meanings by using the measure of word similarity." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-46", "text": "In our experiments, we use an online Romanian dictionary http://www.dexonline.ro." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-47", "text": "Similar dictionaries are available for many languages; when online dictionaries are not available, they can be obtained with relatively low cost through OCR recognition performed on a hardcopy dictionary." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-48", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-49", "text": "**BOOTSTRAPPING ITERATIONS**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-50", "text": "For each seed word, a query is formulated against the online Romanian dictionary." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-51", "text": "From the definitions obtained in this way, a list of related words is extracted, and added to the list of candidates if their own definition is part of the dictionary and if they do not appear in a list of stopwords." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-52", "text": "We then filter the candidate words based on their similarity with the original seed (see the following section), and continue to the next iteration until a maximum number of iterations is reached." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-53", "text": "Note that the part-of-speech information is not maintained throughout the bootstrapping process, as words in the definitions belong to different parts-of-speech." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-54", "text": "Although the initial seed set is balanced with respect to syntactic categories, depending on the usage of words in definitions, this balance may be skewed toward one of the categories by the end of the bootstrapping process." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-55", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-56", "text": "**FILTERING**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-57", "text": "In order to remove noise from the lexicon, we implemented a filtering step which is performed by calculating a measure of similarity between the original seeds and each of the possible candidates." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-58", "text": "We experimented with two corpusbased measures of similarity, namely the Pointwise Mutual Information (Turney, 2001) and Latent Semantic Analysis (LSA) (Dumais et al., 1988) ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-59", "text": "We ultimately decided to use only LSA, as both methods provided similar results, but the LSA method was significantly faster and required less training data." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-60", "text": "After each iteration, only candidates with an LSA score higher than 0.4 (deduced empirically) between the original seed set and the candidates are considered to be expanded in the next iteration." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-61", "text": "Upon bootstrapping termination, the subjectivity lexicons constructed incrementally after each iteration consist of a ranked list of candidates in decreasing order of similarity to the original seed set." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-62", "text": "A variable filtering threshold can be used to enforce the selection of only the most closely related candidates, resulting in more restrictive and pure subjectivity lexicons." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-63", "text": "In our experiments, we used the following thresholds: 0.40 (i.e. the lexicon resulting after the bootstrapping process without additional filtering), 0.50, 0.55, and 0.60." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-64", "text": "The LSA module was trained on a half-million word Romanian corpus, consisting of a manually translated version of the SemCor balanced corpus (Miller et al., 1993) ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-65", "text": "Corpora of similar size can be easily obtained for many lowresource languages by using semi-automatic methods for corpus construction (Ghani et al., 2001 )." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-66", "text": "(Figure 2 )." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-67", "text": "These settings resulted in a lexicon of 3,913 entries, which is used in a rule-based sentence-level subjectivity classifier." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-68", "text": "The classifier labels as subjective a sentence that contains three or more entries that appear in the subjective lexicon, and as objective a sentence that has two or fewer entries, respectively." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-69", "text": "This rule is derived based on the OpinionFinder rules (Wiebe and Riloff, 2005) , which were modified to account for the fact that no strong/weak confidence labels are available." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-70", "text": "We evaluate our results against a gold-standard corpus consisting of 504 Romanian sentences manually annotated for subjectivity." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-71", "text": "Two Romanian native speakers annotated the sentences individually, and the differences were adjudicated through discussions." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-72", "text": "The agreement of the two annotators is 0.83% (\u03ba = 0.67); when the uncertain annotations are removed, the agreement rises to 0.89 (\u03ba = 0.77)." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-73", "text": "The two annotators reached consensus on all sentences for which they disagreed, resulting in a gold standard dataset with 272 (54%) subjective sentences and 232 (46%) objective sentences." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-74", "text": "More details about this data set are available in (Mihalcea et al., 2007) ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-75", "text": "The sentence-level subjectivity classification results are shown in Table 2 ." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-76", "text": "By using the extracted lexicon alone, we were able to obtain a rule-based subjectivity classifier with an overall F-measure of 61.69%." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-77", "text": "To examine the effect of the number of bootstrapping iterations and the value of the LSA similarity threshold over the classifier, Table 2 displays the measures obtained through five bootstrapping iterations at an LSA threshold of 0.50, while Table 3 focuses on the fifth iteration tested over an LSA similarity of 0.40, 0.45, 0.50, 0.55, and 0.60." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-78", "text": "As expected, the overall F-measure is directly proportional to the LSA similarity score until the threshold becomes too restrictive, explicitly limiting the number of entries in the subjectivity lexicon." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-79", "text": "We compare our results with those obtained by a previously proposed method that was based on a similar rule-based classifier." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-80", "text": "In (Mihalcea et al., 2007) , a subjectivity lexicon was automatically obtained through the translation of the English subjectivity lexicon available in OpinionFinder." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-81", "text": "That lexicon consisted of 2,282 entries with a confidence label of strong, neutral or weak as flagged by the OpinionFinder lexicon." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-82", "text": "Table 4 shows the results obtained when using the translated lexicon to classify the subjectivity of Table 3 : Precision (P), Recall (R) and F-measure (F) for the 5th bootstrapping iteration for varying LSA scores sentences in the same data set as used in our experiments." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-83", "text": "By comparing the results in Tables 2 and 4 , we observe an absolute significant improvement of 18.03% in the overall F-measure when using the bootstrapping method introduced in this paper, as compared to the translated lexicon." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-84", "text": "Note that (Mihalcea et al., 2007 ) also proposed a corpusbased method for subjectivity classification; however that method is supervised and thus not directly comparable with the approach introduced in this paper." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-85", "text": "Interestingly, the Fmeasure obtained for the classification of subjective sentences is more than double in the case of the bootstrapping method, reflecting the ability of our approach to identify reliable subjective clues." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-86", "text": "----------------------------------" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-87", "text": "**CONCLUSION**" }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-88", "text": "In this paper, we introduced a bootstrapping method able to quickly generate a large subjectivity lexicon that can be used to build rule-based sentence-level subjectivity classifiers for languages with scarce resources." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-89", "text": "The process starts with a small seed set of hand-picked subjective words, and with the help of an online dictionary, produces a lexicon of potential subjective candidates." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-90", "text": "Table 4 : Precision (P), Recall (R) and F-measure (F) for the automatic translation subjectivity lexicon (2282 entries, cf." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-91", "text": "(Mihalcea et al., 2007)) ranked based on the LSA similarity measure, and the top approximately 4,000 entries are used to build a rule-based subjectivity classifier." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-92", "text": "Testing is performed between a human sentence-level annotated gold-standard and a heuristic providing sentence level automatic annotations." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-93", "text": "Even if unsupervised, our system is able to achieve a subjectivity F-measure of 66.20% and an overall F-measure of 61.69%." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-94", "text": "This system proposes a possible path towards identifying subjectivity in low-resource languages." }, { "sent_id": "e608068f472e7045b682f979fd5295-C001-95", "text": "In the future, we plan to experiment with variations of the bootstrapping mechanism, as well as with other similarity measures." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e608068f472e7045b682f979fd5295-C001-27", "e608068f472e7045b682f979fd5295-C001-28", "e608068f472e7045b682f979fd5295-C001-29" ], [ "e608068f472e7045b682f979fd5295-C001-74" ] ], "cite_sentences": [ "e608068f472e7045b682f979fd5295-C001-27", "e608068f472e7045b682f979fd5295-C001-74" ] }, "@MOT@": { "gold_contexts": [ [ "e608068f472e7045b682f979fd5295-C001-27", "e608068f472e7045b682f979fd5295-C001-28", "e608068f472e7045b682f979fd5295-C001-29" ] ], "cite_sentences": [ "e608068f472e7045b682f979fd5295-C001-27" ] }, "@DIF@": { "gold_contexts": [ [ "e608068f472e7045b682f979fd5295-C001-27", "e608068f472e7045b682f979fd5295-C001-28", "e608068f472e7045b682f979fd5295-C001-29" ], [ "e608068f472e7045b682f979fd5295-C001-84" ] ], "cite_sentences": [ "e608068f472e7045b682f979fd5295-C001-27", "e608068f472e7045b682f979fd5295-C001-84" ] }, "@USE@": { "gold_contexts": [ [ "e608068f472e7045b682f979fd5295-C001-74" ] ], "cite_sentences": [ "e608068f472e7045b682f979fd5295-C001-74" ] }, "@UNSURE@": { "gold_contexts": [ [ "e608068f472e7045b682f979fd5295-C001-79", "e608068f472e7045b682f979fd5295-C001-80", "e608068f472e7045b682f979fd5295-C001-81", "e608068f472e7045b682f979fd5295-C001-82", "e608068f472e7045b682f979fd5295-C001-83" ] ], "cite_sentences": [ "e608068f472e7045b682f979fd5295-C001-80" ] } } }, "ABC_78a7ca27c5ca032116db12205af939_38": { "x": [ { "sent_id": "78a7ca27c5ca032116db12205af939-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-2", "text": "Abstract." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-3", "text": "Medical imaging contains the essential information for rendering diagnostic and treatment decisions." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-4", "text": "Inspecting (visual perception) and interpreting image to generate a report are tedious clinical routines for a radiologist where automation is expected to greatly reduce the workload." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-5", "text": "Despite rapid development of natural image captioning, computeraided medical image visual perception and interpretation remain a challenging task, largely due to the lack of high-quality annotated imagereport pairs and tailor-made generative models for sufficient extraction and exploitation of localized semantic features, particularly those associated with abnormalities." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-6", "text": "To tackle these challenges, we present Vispi, an automatic medical image interpretation system, which first annotates an image via classifying and localizing common thoracic diseases with visual support and then followed by report generation from an attentive LSTM model." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-7", "text": "Analyzing an open IU X-ray dataset, we demonstrate a superior performance of Vispi in disease classification, localization and report generation using automatic performance evaluation metrics ROUGE and CIDEr." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-8", "text": "----------------------------------" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-10", "text": "X-ray is a widely used medical imaging technique in clinics for diagnosis and treatment of thoracic diseases." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-11", "text": "Medical image interpretation, including both disease annotation and report writing, is a laborious routine for radiologists." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-12", "text": "Moreoever, the quality of interpretation is often quite diverse due to the differential levels of experience, expertise and workload of the radiologists." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-13", "text": "To release radiologists from their excessive workload and to better control quality of the written reports, it is desirable to implement a medical image interpretation system that automates the visual perception and cognition process and generates draft reports for radiologists to review, revise and finalize." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-14", "text": "Despite the rapid and significant development, the existing natural image captioning models, e.g. [7, 18] , fail to perform satisfactorily on medical report generation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-15", "text": "The major challenge lies in the limited number of image-report pairs and relative scarcity of abnormal pairs for model training, which are essential for Fig. 1 ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-16", "text": "Illustration of an existing medical report generation system (e.g. [6, 19] ) (a) and the proposed medical image interpretation system (b)." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-17", "text": "The former uses a coarse grid of image regions as visual features to generate report directly whereas the latter first predicts and localizes disease as semantic features then followed by report generation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-18", "text": "quality radiology report generation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-19", "text": "Additional challenge is the lack of appropriate performance evaluation metrics; the n-gram based BLEU scores widely used in natural language processing (NLP) are not suitable for assessing the quality of generated reports." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-20", "text": "Nevertheless several approaches have been developed to generate reports automatically for chest X-rays using the CNN-RNN architecture developed in natural image captioning research [6, 8, 17, 19] (Fig. 1a) ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-21", "text": "Since the medical report typically consists of a sequences of sentences, Jing et al. [6] use a hierarchical LSTM [7] to generate paragraphs and achieve impressive results on Indiana University (IU) X-ray dataset [2] ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-22", "text": "Instead of only using visual features extracted from image, they first predict the Medical Text Indexer (MTI) annotated tags, and then combine semantic features from the tags with visual features from the images for report generation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-23", "text": "Similarly [19] use both visual and semantic features but generate 'impression' and 'findings' of the report separately." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-24", "text": "The former one-sentence summary is generated from a CNN encoder whereas the latter paragraph is generated using visual and semantic features." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-25", "text": "Different from [6] , the semantic feature is extracted by embedding the last generated sentence as opposed to the annotated tags." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-26", "text": "Li et al. [8] use a hierarchical decision-making procedure to determine whether to retrieve a template sentence from an existing template corpus or to invoke the lower-level decision to generate a new sentence from scratch." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-27", "text": "The decision priority is updated via reinforcement learning based on sentence-level and word-level rewards or punishments." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-28", "text": "However, none of these methods demonstrate a satisfactory performance in disease localization and classification, which is a central issue in medical image interpretation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-29", "text": "Wang et al. [17] address both disease classification and medical image report generation problems in the same model." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-30", "text": "They introduce a novel Text-Image Embedding network (TieNet), which integrates self-attention LSTM using tex- tual report data and visual attention CNN using image data." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-31", "text": "TieNet is capable of extracting an informative embedding to represent the paired medical image and report, which significantly improves the disease classification performance compared to [16] ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-32", "text": "However, TieNet's performance on medical report generation improves only marginally over the baseline approach [18] , trading the medical report generation performance for the disease classification performance." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-33", "text": "Moreover, TieNet does not provide a visual support for radiologists to review and revise the automatically generated report." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-34", "text": "We present an automatic medical image interpretation system with in situ visual support striving for a better performance in both image annotation and report generation (Fig. 1b) ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-35", "text": "To our knowledge this is among the first attempts to exploit disease localization for X-ray image report generation with visual supports." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-103", "text": "Evaluation of Disease Classification." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-36", "text": "Our contributions are in four-fold: (1) we describe an integrated image interpretation framework for disease annotation and medical report generation, (2) we transfer knowledge from large image data sets (ChestX-ray8 [16] and ImageNet) to enhance medical image interpretation using a small number of reports for training (IU X-ray [2] ), (3) we evaluate suitability of the NLP evaluation metrics for medical report generation, and (4) we demonstrate the functionality of localizing the key finding in an X-ray with a heatmap." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-37", "text": "----------------------------------" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-38", "text": "**METHOD**" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-39", "text": "Our workflow (Fig. 2 ) first annotates an X-ray image by classifying and localizing thoracic diseases (Fig. 2a) and then generates the corresponding sentences to build up the entire report (Fig. 2b) ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-40", "text": "Fig. 2c displays the structure of attentive LSTM used to generate reports." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-41", "text": "----------------------------------" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-42", "text": "**DISEASE CLASSIFICATION AND LOCALIZATION**" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-43", "text": "Fig. 2a shows our classification module built on a 121-layer Dense Convolutional Network (DenseNet) [5] ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-44", "text": "Similar to [12] , we replace the last fully-connected layer with a new layer of dimension M , where M is the number of diseases." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-45", "text": "This is a multiple binary classification problem that input is a frontal view X-ray image X and output is a binary vector y = [y 1 , . . . , y m , . . . , y M ], i.e., y m \u2208 {0, 1}, indicating absence or presence of a disease m. The binary cross-entropy loss function is defined as:" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-46", "text": ", where g m (X) is the probability for a target disease m. If g m (X) > 0.8, an X-ray is annotated with disease m for the next level modeling." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-47", "text": "Otherwise, it is considered as \"Normal\"." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-48", "text": "It is worth mentioning that a vast majority of X-rays are considered as \"Normal\", therefore, other choices of thresholds also work well with our system." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-49", "text": "We apply Grad-GAMs [14] to localize disease with a heatmap." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-50", "text": "Gard-CAMs uses the gradient information and flows it back to the final convolutional layer to decipher the importance of each neuron in classifying an image to disease m. Formally, let A k be the kth feature maps and weight w mk represents importance of the feature map k for the disease m. We first calculate the gradient of the score for class m, z m (before the sigmoid), with respect to a feature map A k , i.e., \u2202zm \u2202A k ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-51", "text": "Thus w mk are calculated by:" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-52", "text": "represents the coordinates of a pixel, and N is the total number of pixels." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-53", "text": "We then generate a heatmap for disease m by applying weighted average of A k , followed by a ReLU activation: H m = ReLU( k w mk A k )." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-54", "text": "The localized semantic features to predict disease m are identified and visualized with the heatmap H m ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-55", "text": "Similar to [16] , we apply a thresholding based bounding box (B-Box) generation method." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-56", "text": "The B-Box bounds pixels whose heatmap intensity is above 90% of the maximum intensity." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-57", "text": "The resulting region of interest is then cropped for next level modeling." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-58", "text": "Fig. 2b illustrates the process of report generation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-59", "text": "If there is no active thoracic disease found in an X-ray, a report will be directly generated by an attentive LSTM based on the original X-ray as shown in the green dashed box." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-60", "text": "Otherwise (as shown in the red dashed box), the cropped subimage with localized disease from the classification module (Fig. 2a) is used to generate description of abnormalities whereas the original X-ray is used to generate description of normalities in the report." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-61", "text": "----------------------------------" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-62", "text": "**ATTENTION-BASED REPORT GENERATION**" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-63", "text": "As shown in the Fig. 2c , the attentive LSTM is based on an encoder-decoder structure [18] , which takes either the original X-ray image or the cropped subimage corresponding to abnormal region as the input and generates a sequence of sentences for the entire report." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-64", "text": "Our encoder is built on a pre-trained ResNet-101 [4] , which extracts the visual features matrix F \u2208 R 2048\u00d7196 (reshaped from 2048 \u00d7 14 \u00d7 14) from the last convolutional layer followed by an adaptive average pooling layer." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-65", "text": "Each vector F k \u2208 R 2048 of F represents one regional feature vector, where k = {1, ..., 196}." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-66", "text": "The LSTM decoder takes F as input and generates sentences by producing a word w t at each time t. To utilize the spatial visual attention information, we define the weights \u03b1 tk , which can be interpreted as the relative importance of region feature F k at time t. The weights \u03b1 tk is computed by a multilayer perceptron f : e tk = f (F k , h t\u22121 ) and \u03b1 tk = Softmax(e tk ), and hence the attentive visual feature vector V t is computed by V t = 196 k=1 \u03b1 tk F k ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-67", "text": "In addition to the weighted visual feature V t and last hidden layer h t\u22121 , the RNN also accepts the last output word w t at each time step as an input." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-68", "text": "We concatenate the embedding of last output word and visual feature as context vector c t ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-69", "text": "Thus the transition to the current hidden layer h t can be calculated as: h t = LSTM(c t , h t\u22121 )." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-70", "text": "After model training, a report is generated by sampling words w t \u223c p(w t |h t ) and updating the hidden layer until hitting the stop token." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-71", "text": "----------------------------------" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-72", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-73", "text": "Datasets." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-74", "text": "We use the IU Chest X-ray Collection [2] , an open image dataset with 3955 radiology reports paired with chest X-rays for our experimental evaluation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-75", "text": "Each report contains three sections: impression, findings and Medical Subject Headings (MeSH) terms." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-76", "text": "Similar to [6, 19] , we generate sentences in 'impression' and 'findings' together." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-77", "text": "The MeSH terms are used as labels for disease classification [17] as well as the follow-up report generation with abnormality and normality descriptions." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-78", "text": "We convert all the words to lower-case, remove all non-alphanumeric tokens, replace single-occurrence tokens with a special token and use another special token to separate sentences." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-79", "text": "We filter out images and reports that are non-relevant to the eight common thoracic diseases included in both ChestX-ray8 [16] and IU X-ray datasets [2] , resulting in a dataset with 2225 pairs of X-ray image and report." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-80", "text": "Finally, we split all the image-report pairs into training, validation and testing dataset by ratio 7 : 1 : 2." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-81", "text": "Implementation Details." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-82", "text": "We implement our model on a GeForce GTX 1080ti GPU platform using PyTorch." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-83", "text": "The dimension of all hidden layers and word embeddings are set to 512." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-84", "text": "The network is trained with Adam optimizer with a mini-batch size of 16." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-85", "text": "The training stops when the performance on validation dataset does not increase for 20 epochs." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-86", "text": "We do not fine-tune the DenseNet pretrained with ChestX-ray8 [16] and ResNet pretrained with ImageNet due to the small sample size of IU X-ray dataset [2] ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-87", "text": "For each disease class, a specific pair of LSTMs are trained to ensure consistency between the predicted disease annotation(s) and the generated report." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-88", "text": "For the disease classes with less than 50 samples, we train a shared attentive LSTM across the classes to generate normality description of the report." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-89", "text": "Evaluation of Automatic Medical Image Reports." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-90", "text": "We use the metrics for NLP tasks such as BLEU [11] , ROUGE [9] , and CIDEr [1] for automatic performance evaluation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-91", "text": "Our model outperforms all baseline models [3, 10, 13, 15] and demonstrates the best CIDEr and ROUGE scores among all the advanced methods specifically designed for medical report generation [6, 8, 19] , despite the fact that we only use a single frontal view X-ray." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-92", "text": "While BLEU scores measure the percentage of consistency between the automatic report and the manual report in light of the automatic report (precision), it is not illuminative in assessing the amount of information captured in the automatic report in light of the manual Table 1 ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-93", "text": "Automatic evaluations on IU dataset." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-94", "text": "* results from [8] . + results from [19] ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-95", "text": "report (recall)." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-96", "text": "In real-world clinical applications, both recall and precision are critical in evaluating the quality of an automatic report." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-97", "text": "For example, automatic reports often miss description of abnormalities that contained in manual reports written by human radiologists [8, 19] , which may decreases recall but does not affect precision." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-98", "text": "Thus, the automatic report missing the key disease information can still achieve high BLEU scores nevertheless it provides limited insight for medical image interpretation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-99", "text": "Therefore, ROUGE is more suitable than BLEU for evaluating the quality of automatic reports since it measures both precision and recall." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-100", "text": "Further, CIDEr is more suitable for our purpose than ROUGE and BLEU since it captures the notions of grammaticality, saliency, importance and accuracy [1] ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-101", "text": "Additionally, CIDEr uses TF-IDF to filter out unimportant common words and weight more on disease keywords." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-102", "text": "As a result, higher ROUGE and CIDEr scores demonstrate a superior performance of our medical image interpretation system." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-104", "text": "Although ROUGE and CIDEr scores are effective in evaluating the consistency of an automatic report to a manual report, none of them, however, are designed for assessing the correctness of medical report annotation in terms of common thoracic diseases." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-105", "text": "The latter is another key output of a useful image interpretation system." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-106", "text": "For example, the automatically generated sentence: \"no focal airspace consolidation, pleural effusion or pneumothorax\" is considered as similar to the manually written sentence: \"persistent pneumothorax with small amount of pleural effusion\" using both ROUGE and CIDEr scores despite the completely opposite annotations." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-107", "text": "Therefore, we assess the accuracy in medical report annotation by comparing with TieNet [17] in disease classification using Area Under the ROC (AUROC) as the metric." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-108", "text": "Our result outperforms TieNet's classification module in 7 out of 8 diseases (Table 2, Fig. 3 )." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-109", "text": "As we discussed before, the inferior performance of TieNet may due to the fact that it trades the image classification performance for report generation performance." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-110", "text": "On the contrary, our model exploits the former to enhance the latter via a bi-level attention." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-111", "text": "Example System Outputs." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-112", "text": "Fig. 4 shows two example outputs each with a generated report and image annotation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-113", "text": "The first row presents an annotated Table 2 ." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-114", "text": "Comparison using AUROCs." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-115", "text": "* results from [17] . \"Normal\" case whereas the second row presents an annotated \"Cardiomegaly\" case with the disease localized in a red bounding box on the heatmap generated from our classification and localization module." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-116", "text": "The results show that our medical interpretation system is capable of diagnosing thoracic diseases, highlighting the key findings in X-rays with heatmaps and generating well-structured reports." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-117", "text": "----------------------------------" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-118", "text": "**CONCLUSIONS**" }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-119", "text": "In summary, we propose a bi-level attention mechanism for automatic X-ray image interpretation." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-120", "text": "Using only a single frontal view chest X-ray, our system is capable of accurately annotating X-ray images and generating quality reports." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-121", "text": "Our system also provides visual supports to assist radiologists in rendering diagnostic decisions." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-122", "text": "With more quality training data becomes available in the near future, our medical image interpretation system can be improved by: (1) incorporating both frontal and lateral view of X-rays, (2) predicting more disease classes, and (3) using hand labeled bounding boxes as the target of localization." }, { "sent_id": "78a7ca27c5ca032116db12205af939-C001-123", "text": "We will also generalize our system by extracting informative features from Elec-tronic Health Record (EHR) data and repeated longitudinal radiology reports to further enhance the performance of our system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "78a7ca27c5ca032116db12205af939-C001-30", "78a7ca27c5ca032116db12205af939-C001-31" ], [ "78a7ca27c5ca032116db12205af939-C001-36" ] ], "cite_sentences": [ "78a7ca27c5ca032116db12205af939-C001-31", "78a7ca27c5ca032116db12205af939-C001-36" ] }, "@MOT@": { "gold_contexts": [ [ "78a7ca27c5ca032116db12205af939-C001-30", "78a7ca27c5ca032116db12205af939-C001-31" ] ], "cite_sentences": [ "78a7ca27c5ca032116db12205af939-C001-31" ] }, "@USE@": { "gold_contexts": [ [ "78a7ca27c5ca032116db12205af939-C001-36" ], [ "78a7ca27c5ca032116db12205af939-C001-55", "78a7ca27c5ca032116db12205af939-C001-56", "78a7ca27c5ca032116db12205af939-C001-57", "78a7ca27c5ca032116db12205af939-C001-58", "78a7ca27c5ca032116db12205af939-C001-59", "78a7ca27c5ca032116db12205af939-C001-60" ], [ "78a7ca27c5ca032116db12205af939-C001-79" ] ], "cite_sentences": [ "78a7ca27c5ca032116db12205af939-C001-36", "78a7ca27c5ca032116db12205af939-C001-55", "78a7ca27c5ca032116db12205af939-C001-79" ] }, "@UNSURE@": { "gold_contexts": [ [ "78a7ca27c5ca032116db12205af939-C001-86" ] ], "cite_sentences": [ "78a7ca27c5ca032116db12205af939-C001-86" ] } } }, "ABC_484bc7c9c66bf4028eef4103beec7f_38": { "x": [ { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-2", "text": "T.S. Eliot's modernist poem The Waste Land is often interpreted as collection of voices which appear multiple times throughout the text." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-3", "text": "Here, we investigate whether we can automatically cluster existing segmentations of the text into coherent, expert-identified characters." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-4", "text": "We show that clustering The Waste Land is a fairly difficult task, though we can do much better than random baselines, particularly if we begin with a good initial segmentation." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-5", "text": "----------------------------------" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-7", "text": "Although literary texts are typically written by a single author, the style of a work of literature is not necessarily uniform." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-8", "text": "When a certain character speaks, for instance, an author may shift styles to give the character a distinct voice." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-9", "text": "Typically, voice switches in literature are explicitly marked, either by the use of quotation marks with or without a said quotative, or, in cases of narrator switches, by a major textual boundary (e.g. the novel Ulysses by James Joyce)." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-10", "text": "However, implicit marking is the norm in some modernist literature: a well-known example is the poem The Waste Land by T.S. Eliot, which is usually analyzed in terms of voices that each appear multiple times throughout the text." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-11", "text": "Our interest is distinguishing these voices automatically." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-12", "text": "One of the poem's most distinctive voices is that of the woman who speaks at the end of its second section:" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-13", "text": "I can't help it, she said, pulling a long face, It's them pills I took, to bring it off, she said [158] [159] Her chatty tone and colloquial grammar and lexis distinguish her voice from many others in the poem, such as the formal and traditionally poetic voice of a narrator that recurs many times in the poem:" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-14", "text": "Above the antique mantel was displayed As though a window gave upon the sylvan scene The change of Philomel [97] [98] [99] Although the stylistic contrasts between these and other voices are clear to many readers, Eliot does not explicitly mark the transitions, nor is it obvious when a voice has reappeared." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-15", "text": "Our previous work focused on only the segmentation part of the voice identification task (Brooke et al., 2012) ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-16", "text": "Here, we instead assume an initial segmentation and then try to create clusters corresponding to segments of the The Waste Land which are spoken by the same voice." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-17", "text": "Of particular interest is the influence of the initial segmentation on the success of this downstream task." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-18", "text": "----------------------------------" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-19", "text": "**RELATED WORK**" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-20", "text": "There is a small body of work applying quantitative methods to poetry: Simonton (1990) looked at lexical and semantic diversity in Shakespearean sonnets and correlated this with aesthetic success, whereas Dugan (1973) developed statistics of formulaic style and applied them to the Chanson de Roland to determine whether it represents an oral or written style." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-21", "text": "Kao and Jurafsky (2012) quantify various aspects of poety, including style and sentiment, and use these features to distinguish professional and amateur writers of contemporary poetry." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-22", "text": "With respect to novels, the work of McKenna and Antonia (2001) is very relevant; they used principal components analysis of lexical frequency to discriminate different voices and narrative styles in sections of Ulysses by James Joyce." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-23", "text": "Clustering techniques have been applied to literature in general; for instance, Luyckx (2006) clustered novels according to style, and recent work in distinguishing two authors of sections of the Bible (Koppel et al., 2011) relies crucially on an initial clustering which is bootstrapped into a supervised classifier which is applied to segments." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-24", "text": "Beyond literature, the tasks of stylistic inconsistency detection (Graham et al., 2005; Guthrie, 2008) and intrinsic (unsupervised) plagiarism detection (Stein et al., 2011) are very closely related to our interests here, though in such tasks usually only two authors are posited; more general kinds of authorship identification (Stamatatos, 2009 ) may include many more authors, though some form of supervision (i.e. training data) is usually assumed." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-25", "text": "Our work here is built on our earlier work (Brooke et al., 2012) ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-26", "text": "Our segmentation model for The Waste Land was based on a stylistic change curve whose values are the distance between stylistic feature vectors derived from 50 token spans on either side of each point (spaces between tokens) in the text; the local maxima of this curve represent likely voice switches." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-27", "text": "Performance on The Waste Land was far from perfect, but evaluation using standard text segmentation metrics (Pevzner and Hearst, 2002) indicated that it was well above various baselines." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-28", "text": "----------------------------------" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-29", "text": "**METHOD**" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-30", "text": "Our approach to voice identification in The Waste Land consists first of identifying the boundaries of voice spans (Brooke et al., 2012) ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-31", "text": "Given a segmentation of the text, we consider each span as a data point in a clustering problem." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-32", "text": "The elements of the vector correspond to the best feature set from the segmentation task, with the rationale that features which were useful for detecting changes in style should also be useful for identifying stylistic similarities." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-33", "text": "Our features therefore include: a collection of readability metrics (including word length), frequency of punctuation, line breaks, and various parts-ofspeech, lexical density, average frequency in a large external corpus (Brants and Franz, 2006) , lexiconbased sentiment metrics using SentiWordNet (Baccianella et al., 2010) , formality score (Brooke et al., 2010) , and, perhaps most notably, the centroid of 20-dimensional distributional vectors built using latent semantic analysis (Landauer and Dumais, 1997), reflecting the use of words in a large web corpus (Burton et al., 2009) ; in previous work (Brooke et al., 2010) , we established that such vectors contain useful stylistic information about the English lexicon (including rare words that appear only occasionally in such a corpus), and indeed LSA vectors were the single most promising feature type for segmentation." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-34", "text": "For a more detailed discussion of the feature set, see Brooke et al. (2012) ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-35", "text": "All the features are normalized to a mean of zero and a standard deviation of 1." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-36", "text": "For clustering, we use a slightly modified version of the popular k-means algorithm (MacQueen, 1967) ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-37", "text": "Briefly, k-means assigns points to a cluster based on their proximity to the k cluster centroids, which are initialized to randomly chosen points from the data and then iteratively refined until convergence, which in our case was defined as a change of less than 0.0001 in the position of each centroid during one iteration." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-38", "text": "1 Our version of k-means is distinct in two ways: first, it uses a weighted centroid where the influence of each point is based on the token length of the underlying span, i.e. short (unreliable) spans which fall into the range of some centroid will have less effect on the location of the centroid than larger spans." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-39", "text": "Second, we use a city-block (L 1 ) distance function rather than standard Euclidean (L 2 ) distance function; in the segmentation task, Brooke et al. found that city-block (L1) distance was preferred, a result which is in line with other work in stylistic inconsistency detection (Guthrie, 2008) ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-40", "text": "Though it would be interesting to see if a good k could be estimated independently, for our purposes here we set k to be the known number of speakers in our gold standard." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-41", "text": "----------------------------------" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-42", "text": "**EVALUATION**" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-43", "text": "We evaluate our clusters by comparing them to a gold standard annotation." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-44", "text": "There are various metrics for extrinsic cluster evaluation; Amig\u00f3 et al. (2009) (Bagga and Baldwin, 1998) as having all of a set of key desirable properties." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-45", "text": "BCubed precision is a calculation of the fraction of item pairs in the same cluster which are also in the same category, whereas BCubed recall is the fraction of item pairs in the same category which are also in the same cluster." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-46", "text": "The harmonic mean of these two metrics is BCubed F-score." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-47", "text": "Typically, the 'items' are exactly what has been clustered, but this is problematic in our case, because we wish to compare methods which have different segmentations and thus the vectors that are being clustered are not directly comparable." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-48", "text": "Instead, we calculate the BCubed measures at the level of the token; that is, for the purposes of measuring performance we act as if we had clustered each token individually, instead of the spans of tokens actually used." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-49", "text": "Our first evaluation is against a set of 20 artificially-generated 'poems' which are actually randomly generated combinations of parts of 12 poems which were chosen (by an English literature expert, one of the authors) to represent the time period and influences of The Waste Land." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-50", "text": "The longest of these poems is 1291 tokens and the shortest is just 90 tokens (though 10 of the 12 have at least 300 tokens); the average length is 501 tokens." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-51", "text": "Our method for creating these poems is similar to that of Koppel et al. (2011) , though generalized for multiple authors." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-52", "text": "For each of the artificial poems, we randomly selected 6 poems from the 12 source poems, and then we concatenated 100-200 tokens (or all the remaining tokens, if less than the number selected) from each of these 6 poems to the new combined poem until all the poems were exhausted or below our minimum span length (20 tokens)." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-53", "text": "This allows us to evaluate our method in ideal circumstances, i.e. when there are very distinct voices corresponding to different poets, and the voice spans tend to be fairly long." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-54", "text": "Our gold standard annotation of The Waste Land speakers is far more tentative." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-55", "text": "It is based on a number of sources: our own English literature expert, relevant literary analysis (Cooper, 1987) , and also The Waste Land app (Touch Press LLP, 2011), which includes readings of the poem by various experts, including T.S. Eliot himself." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-56", "text": "However, there is inherently a great deal of subjectivity involved in literary annotation and, indeed, one of the potential benefits of our work is to find independent justification for a particular voice annotation." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-57", "text": "Our gold standard thus represents just one potential interpretation of the poem, rather than a true, unique gold standard." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-58", "text": "The average size of the 69 segments in the gold standard is 50 tokens; the range, however, is fairly wide: the longest is 373 tokens, while the shortest consists of a single token." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-59", "text": "Our annotation has 13 voices altogether." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-60", "text": "We consider three segmentations: the segmentation of our gold standard (Gold), the segmentation predicted by our segmentation model (Automatic), and a segmentation which consists of equallength spans (Even), with the same number of spans as in the gold standard." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-61", "text": "The Even segmentation should be viewed as the baseline for segmentation, and the Gold segmentation an \"oracle\" representing an upper bound on segmentation performance." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-62", "text": "For the automatic segmentation model, we use the settings from Brooke et al. (2012) ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-63", "text": "We also compare three possible clusterings for each segmentation: no clustering at all (Initial), that is, we assume that each segment is a new voice; k-means clustering (k-means), as outlined above; and random clustering (Random), in which we randomly assign each voice to a cluster." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-64", "text": "For these latter two methods, which both have a random component, we averaged our metrics over 50 runs." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-65", "text": "Random and Initial are here, of course, to provide baselines for judging the effectiveness of k-means clustering model." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-66", "text": "Finally, when using the gold standard segmentation and k-means clustering, we included another oracle option (Seeded): instead of the standard k-means method of randomly choosing them from the available datapoints, each centroid is initialized to the longest instance of a different voice, essentially seeding each cluster." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-67", "text": "Table 1 contains the results for our first evaluation of voice clustering, the automatically-generated poems." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-68", "text": "In all the conditions, using the gold segmentation far outstrips the other two options." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-69", "text": "The automatic segmentation is consistently better than the evenly-spaced baseline, but the performance is actually worse than expected; the segmentation metrics we used in our earlier work The results for The Waste Land are in Table 2 . Many of the basic patterns are the same, including the consistent ranking of the methods; overall, however, the clustering is far less effective." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-70", "text": "This is particularly true for the gold-standard condition, which only increases modestly between the initial and clustered state; the marked increase in recall is balanced by a major loss of precision." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-71", "text": "In fact, unlike with the artificial text, the most promising aspect of the clustering seems to be the fairly sizable boost to the quality of clusters in automatic segmenting performance." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-72", "text": "The effect of seeding is also very consistent, nearly as effective as in the automatic case." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-73", "text": "----------------------------------" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-74", "text": "**RESULTS**" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-75", "text": "We also looked at the results for individual speakers in The Waste Land; many of the speakers (some of which appear only in a few lines) are very poorly distinguished, even with the gold-standard segmentation and seeding, but there are a few that cluster quite well; the best two are in fact our examples from Section 1, 2 that is, the narrator (F-score 0.869), and the chatty woman (F-score 0.605)." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-76", "text": "The former result is particularly important, from the perspective of literary analysis, since there are several passages which seem to be the main narrator (and our expert annotated them as such) but which are definitely open to interpretation." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-77", "text": "----------------------------------" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-78", "text": "**CONCLUSION**" }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-79", "text": "Literature, by its very nature, involves combining existing means of expression in surprising new ways, resisting supervised analysis methods that depend on assumptions of conformity." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-80", "text": "Our unsupervised approach to distinguishing voices in poetry offers this necessary flexibility, and indeed seems to work reasonably well in cases when the stylistic differences are clear." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-81", "text": "The Waste Land, however, is a very subtle text, and our results suggest that we are a long way from something that would be a considered a possible human interpretation." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-82", "text": "Nevertheless, applying quantitative methods to these kinds of texts can, for literary scholars, bridge the gab between abstract interpretations and the details of form and function (McKenna and Antonia, 2001 )." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-83", "text": "In our own case, this computational work is just one aspect of a larger project in literary analysis where the ultimate goal is not to mimic human behavior per se, but rather to better understand literary phenomena by annotation and modelling of these phenomena (Hammond, 2013; ." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-84", "text": "With respect to future enhancements, improving segmentation is obviously important; the best automated efforts so far provide only a small boost over a baseline approach to segmentation." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-85", "text": "However, independently of this, our experiments with goldstandard seeding suggest that refining our approach to clustering, e.g. a method that identifies good initial points for our centroids, may also pay dividends in the long run." }, { "sent_id": "484bc7c9c66bf4028eef4103beec7f-C001-86", "text": "A more radical idea for future work would be to remove the somewhat artificial delim-itation of the task into segmentation and clustering phases, building a model which works iteratively to produce segments that are sensitive to points of stylistic change but that, at a higher level, also form good clusters (as measured by intrinsic measures of cluster quality)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "484bc7c9c66bf4028eef4103beec7f-C001-15", "484bc7c9c66bf4028eef4103beec7f-C001-16", "484bc7c9c66bf4028eef4103beec7f-C001-17" ], [ "484bc7c9c66bf4028eef4103beec7f-C001-25", "484bc7c9c66bf4028eef4103beec7f-C001-26" ], [ "484bc7c9c66bf4028eef4103beec7f-C001-30", "484bc7c9c66bf4028eef4103beec7f-C001-31", "484bc7c9c66bf4028eef4103beec7f-C001-32", "484bc7c9c66bf4028eef4103beec7f-C001-33" ], [ "484bc7c9c66bf4028eef4103beec7f-C001-34" ] ], "cite_sentences": [ "484bc7c9c66bf4028eef4103beec7f-C001-15", "484bc7c9c66bf4028eef4103beec7f-C001-25", "484bc7c9c66bf4028eef4103beec7f-C001-30", "484bc7c9c66bf4028eef4103beec7f-C001-34" ] }, "@MOT@": { "gold_contexts": [ [ "484bc7c9c66bf4028eef4103beec7f-C001-15", "484bc7c9c66bf4028eef4103beec7f-C001-16", "484bc7c9c66bf4028eef4103beec7f-C001-17" ] ], "cite_sentences": [ "484bc7c9c66bf4028eef4103beec7f-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "484bc7c9c66bf4028eef4103beec7f-C001-62" ] ], "cite_sentences": [ "484bc7c9c66bf4028eef4103beec7f-C001-62" ] } } }, "ABC_f861e6590ff57225395e7d480c66e8_38": { "x": [ { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-92", "text": "We include different evaluation types (S, R and B)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-2", "text": "Adversarial training (AT) is a regularization method that can be used to improve the robustness of neural network methods by adding small perturbations in the training data." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-3", "text": "We show how to use AT for the tasks of entity recognition and relation extraction." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-23", "text": "Gupta et al. (2016) propose the use of various manually extracted features along with RNNs." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-4", "text": "In particular, we demonstrate that applying AT to a general purpose baseline model for jointly extracting entities and relations, allows improving the stateof-the-art effectiveness on several datasets in different contexts (i.e., news, biomedical, and real estate data) and for different languages (English and Dutch)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-5", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-7", "text": "Many neural network methods have recently been exploited in various natural language processing (NLP) tasks, such as parsing , POS tagging (Lample et al., 2016) , relation extraction (dos Santos et al., 2015) , translation (Bahdanau et al., 2015) , and joint tasks (Miwa and Bansal, 2016) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-8", "text": "However, Szegedy et al. (2014) observed that intentional small scale perturbations (i.e., adversarial examples) to the input of such models may lead to incorrect decisions (with high confidence)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-9", "text": "Goodfellow et al. (2015) proposed adversarial training (AT) (for image recognition) as a regularization method which uses a mixture of clean and adversarial examples to enhance the robustness of the model." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-10", "text": "Although AT has recently been applied in NLP tasks (e.g., text classification (Miyato et al., 2017) ), this paper -to the best of our knowledge -is the first attempt investigating regularization effects of AT in a joint setting for two related tasks." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-11", "text": "We start from a baseline joint model that performs the tasks of named entity recognition (NER) and relation extraction at once." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-12", "text": "Previously proposed models (summarized in Section 2) exhibit several issues that the neural network-based baseline approach (detailed in Section 3.1) overcomes: (i) our model uses automatically extracted features without the need of external parsers nor manually extracted features (see Gupta et al. (2016) ; Miwa and Bansal (2016) ; Li et al. (2017) ), (ii) all entities and the corresponding relations within the sentence are extracted at once, instead of examining one pair of entities at a time (see Adel and Sch\u00fctze (2017) ), and (iii) we model relation extraction in a multi-label setting, allowing multiple relations per entity (see Katiyar and Cardie (2017) ; Bekoulis et al. (2018a) )." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-13", "text": "The core contribution of the paper is the use of AT as an extension in the training procedure for the joint extraction task (Section 3.2)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-14", "text": "To evaluate the proposed AT method, we perform a large scale experimental study in this joint task (see Section 4), using datasets from different contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-15", "text": "We use a strong baseline that outperforms all previous models that rely on automatically extracted features, achieving state-of-the-art performance (Section 5)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-16", "text": "Compared to the baseline model, applying AT during training leads to a consistent additional increase in joint extraction effectiveness." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-17", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-19", "text": "Joint entity and relation extraction: Joint models (Li and Ji, 2014; Miwa and Sasaki, 2014) that are based on manually extracted features have been proposed for performing both the named entity recognition (NER) and relation extraction subtasks at once." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-20", "text": "These methods rely on the availability of NLP tools (e.g., POS taggers) or manually designed features leading to additional complexity." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-21", "text": "Neural network methods have been exploited to overcome this feature design issue and usually involve RNNs and CNNs (Miwa and Bansal, 2016; Zheng et al., 2017) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-22", "text": "Specifically, Miwa and Bansal (2016) as well as Li et al. (2017) apply bidirectional tree-structured RNNs for different contexts (i.e., news, biomedical) to capture syntactic information (using external dependency parsers)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-24", "text": "Adel and Sch\u00fctze (2017) solve the simpler problem of entity classification (EC, assuming entity boundaries are given), instead of NER, and they replicate the context around the entities, feeding entity pairs to the relation extraction layer." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-25", "text": "Katiyar and Cardie (2017) investigate RNNs with attention without taking into account that relation labels are not mutually exclusive." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-26", "text": "Finally, Bekoulis et al. (2018a) use LSTMs in a joint model for extracting just one relation at a time, but increase the complexity of the NER part." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-27", "text": "Our baseline model enables simultaneous extraction of multiple relations from the same input." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-28", "text": "Then, we further extend this strong baseline using adversarial training." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-29", "text": "Adversarial training (AT) (Goodfellow et al., 2015) has been proposed to make classifiers more robust to input perturbations in the context of image recognition." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-30", "text": "In the context of NLP, several variants have been proposed for different tasks such as text classification (Miyato et al., 2017) , relation extraction (Wu et al., 2017) and POS tagging (Yasunaga et al., 2018) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-31", "text": "AT is considered as a regularization method." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-32", "text": "Unlike other regularization methods (i.e., dropout (Srivastava et al., 2014) , word dropout (Iyyer et al., 2015) ) that introduce random noise, AT generates perturbations that are variations of examples easily misclassified by the model." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-33", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-34", "text": "**MODEL**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-35", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-36", "text": "**JOINT LEARNING AS HEAD SELECTION**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-37", "text": "The baseline model, described in detail in Bekoulis et al. (2018b) , is illustrated in Fig. 1 ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-38", "text": "It aims to detect (i) the type and the boundaries of the entities and (ii) the relations between them." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-39", "text": "The input is a sequence of tokens (i.e., sentence) w = w 1 , ..., w n ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-40", "text": "We use character level embeddings to implicitly capture morphological features (e.g., prefixes and suffixes), representing each character by a vector (embedding)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-41", "text": "The character embeddings are fed to a bidirectional LSTM (BiLSTM) to obtain the character-based representation of the word." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-42", "text": "We also use pre-trained word embeddings." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-43", "text": "Word and character embeddings are concatenated to form the final token representation, which is then fed to a BiLSTM layer to extract sequential information." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-44", "text": "For the NER task, we adopt the BIO (Beginning, Inside, Outside) encoding scheme." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-45", "text": "In Fig. 1 , the B-PER tag is assigned to the beginning token of a 'person' (PER) entity." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-46", "text": "For the prediction of the entity tags, we use: (i) a softmax approach for the entity classification (EC) task (assuming entity boundaries given) or (ii) a CRF approach where we identify both the type and the boundaries for each entity." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-47", "text": "During decoding, in the softmax setting, we greedily detect the entity types of the tokens (i.e., independent prediction)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-48", "text": "Although independent distribution of types is reasonable for EC tasks, this is not the case when there are strong correlations between neighboring tags." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-49", "text": "For instance, the BIO encoding scheme imposes several constraints in the NER task (e.g., the B-PER and I-LOC tags cannot be sequential)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-50", "text": "Motivated by this intuition, we use a linear-chain CRF for the NER task (Lample et al., 2016) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-51", "text": "For decoding, in the CRF setting, we use the Viterbi algorithm." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-52", "text": "During training, for both EC (softmax) and NER tasks (CRF), we minimize the cross-entropy loss L NER ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-53", "text": "The entity tags are later fed into the relation extraction layer as label embeddings (see Fig. 1 ), assuming that knowledge of the entity types is beneficial in predicting the relations between the involved entities." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-54", "text": "We model the relation extraction task as a multi-label head selection problem (Bekoulis et al., 2018b; )." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-55", "text": "In our model, each word w i can be involved in multiple relations with other words." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-56", "text": "For instance, in the example illustrated in Fig. 1 , \"Smith\" could be involved not only in a Lives in relation with the token \"California\" (head) but also in other relations simultaneously (e.g., Works for, Born In with some corresponding tokens)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-57", "text": "The goal of the task is to predict for each word w i , a vector of heads\u0177 i and the vector of corresponding relationsr i ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-58", "text": "We compute the score s(w j , w i , r k ) of word w j to be the head of w i given a relation label r k using a single layer neural network." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-59", "text": "The corresponding probability is defined as: P(w j , r k | w i ; \u03b8) = \u03c3(s(w j , w i , r k )), where \u03c3(.) is the sigmoid function." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-60", "text": "During training, we minimize the cross-entropy loss L rel as:" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-61", "text": "where m is the number of associated heads (and thus relations) per word w i ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-62", "text": "During decoding, the most probable heads and relations are selected using threshold-based prediction." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-63", "text": "The final objective for the joint task is computed as L JOINT (w; \u03b8) = L NER + L rel where \u03b8 is a set of parameters." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-64", "text": "In the case of multi-token entities, only the last token of the entity can serve as head of another token, to eliminate redundant relations." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-65", "text": "If an entity is not involved in any relation, we predict the auxiliary \"N\" relation label and the token itself as head." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-66", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-67", "text": "**ADVERSARIAL TRAINING (AT)**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-68", "text": "We exploit the idea of AT (Goodfellow et al., 2015) as a regularization method to make our model robust to input perturbations." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-69", "text": "Specifically, we generate examples which are variations of the original ones by adding some noise at the level of the concatenated word representation (Miyato et al., 2017) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-70", "text": "This is similar to the concept introduced by Goodfellow et al. (2015) to improve the robustness of image recognition classifiers." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-71", "text": "We generate an adversarial example by adding the worst-case perturbation \u03b7 adv to the original embedding w that maximizes the loss function:" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-72", "text": "where\u03b8 is a copy of the current model parameters." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-73", "text": "Since Eq. (2) is intractable in neural networks, we use the approximation proposed in Goodfellow et al. (2015) defined as: \u03b7 adv = g/ g , with g = \u2207 w L JOINT (w;\u03b8), where is a small bounded norm treated as a hyperparameter." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-74", "text": "Similar to Yasunaga et al. (2018) , we set to be \u03b1 \u221a D (where D is the dimension of the embeddings)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-75", "text": "We train on the mixture of original and adversarial examples, so the final loss is computed as: L JOINT (w;\u03b8) + L JOINT (w + \u03b7 adv ;\u03b8)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-76", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-77", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-78", "text": "We evaluate our models on four datasets, using the code as available from our github codebase." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-79", "text": "1 Specifically, we follow the 5-fold crossvalidation defined by Miwa and Bansal (2016) for the ACE04 (Doddington et al., 2004) dataset." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-80", "text": "For the CoNLL04 (Roth and Yih, 2004 ) EC task (assuming boundaries are given), we use the same splits as in Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-81", "text": "We also evaluate our models on the NER task similar to Miwa and Sasaki (2014) in the same dataset using 10-fold cross validation." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-82", "text": "For the Dutch Real Estate Classifieds, DREC (Bekoulis et al., 2017) dataset, we use train-test splits as in Bekoulis et al. (2018a) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-83", "text": "For the Adverse Drug Events, ADE (Gurulingappa et al., 2012) , we perform 10-fold cross-validation similar to Li et al. (2017) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-84", "text": "To obtain comparable results that are not affected by the input embeddings, we use the embeddings of the previous works." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-85", "text": "We employ early stopping in all of the experiments." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-86", "text": "We use the Adam optimizer (Kingma and Ba, 2015) and we fix the hyperparameters (i.e., \u03b1, dropout values, best epoch, learning rate) on the validation sets." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-87", "text": "The scaling parameter \u03b1 is selected from {5e\u22122, 1e\u22122, 1e\u22123, 1e\u22124}. Larger values of \u03b1 (i.e., larger perturbations) lead to consistent performance decrease in our early experiments." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-88", "text": "This can be explained from the fact that adding more noise can change the content of the sentence as also reported by Wu et al. (2017) ." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-89", "text": "We use three types of evaluation, namely: (i) S(trict): we score an entity as correct if both the entity boundaries and the entity type are correct (ACE04, ADE, CoNLL04, DREC), (ii) B(oundaries): we score an entity as correct if only the entity boundaries are correct while the entity type is not taken into account (DREC) and (iii) R(elaxed): a multi-token entity is considered Table 1 : Comparison of our method with the stateof-the-art in terms of F 1 score." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-90", "text": "The proposed models are: (i) baseline, (ii) baseline EC (predicts only entity classes) and (iii) baseline (EC) + AT (regularized by AT)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-91", "text": "The and symbols indicate whether the models rely on external NLP tools." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-93", "text": "correct if at least one correct type is assigned to the tokens comprising the entity, assuming that the boundaries are known (CoNLL04), to compare to previous works." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-94", "text": "In all cases, a relation is considered as correct when both the relation type and the argument entities are correct." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-95", "text": "Table 1 shows our experimental results." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-96", "text": "The name of the dataset is presented in the first column while the models are listed in the second column." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-97", "text": "The proposed models are the following: (i) baseline: the baseline model shown in Fig. 1 with the CRF layer and the sigmoid loss, (ii) baseline EC: the proposed model with the softmax layer for EC, (iii) baseline (EC) + AT: the baseline regularized using AT." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-98", "text": "The final three columns present the F 1 results for the two subtasks and their average performance." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-99", "text": "Bold values indicate the best results among models that use only automatically extracted features." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-100", "text": "For ACE04, the baseline outperforms Katiyar and Cardie (2017) by \u223c2% in both tasks." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-101", "text": "This improvement can be explained by the use of: (i) multi-label head selection, (ii) CRF-layer and (iii) character level embeddings." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-102", "text": "Compared to Miwa and Bansal (2016) , who rely on NLP tools, the baseline performs within a reasonable margin (less than 1%) on the joint task." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-103", "text": "On the other hand, Li et al. (2017) use the same model for the ADE biomedical dataset, where we report a 2.5% overall improvement." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-104", "text": "This indicates that NLP tools are not always accurate for various contexts." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-105", "text": "For the CoNLL04 dataset, we use two evaluation settings." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-106", "text": "We use the relaxed evaluation similar to Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) on the EC task." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-107", "text": "The baseline model outperforms the state-of-the-art models that do not rely on manually extracted features (>4% improvement for both tasks), since we directly model the whole sentence, instead of just considering pairs of entities." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-108", "text": "Moreover, compared to the model of Gupta et al. (2016) that relies on complex features, the baseline model performs within a margin of 1% in terms of overall F 1 score." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-109", "text": "We also report NER results on the same dataset and improve overall F 1 score with \u223c1% compared to Miwa and Sasaki (2014) , indicating that our automatically extracted features are more informative than the hand-crafted ones." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-110", "text": "These automatically extracted features exhibit their performance improvement mainly due to the shared LSTM layer that learns to automatically generate feature representations of entities and their corresponding relations within a single model." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-111", "text": "For the DREC dataset, we use two evaluation methods." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-112", "text": "In the boundaries evaluation, the baseline has an improvement of \u223c3% on both tasks compared to Bekoulis et al. (2018a) , whose quadratic scoring layer complicates NER." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-113", "text": "Table 1 and Fig. 2 show the effectiveness of the adversarial training on top of the baseline model." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-114", "text": "In all of the experiments, AT improves the predictive performance of the baseline model in the joint setting." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-115", "text": "Moreover, as seen in Fig. 2 , the performance of the models using AT is closer to maximum even from the early training epochs." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-116", "text": "Specifically, for ACE04, there is an improvement in both tasks as well as in the overall F 1 performance (0.4%)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-117", "text": "For CoNLL04, we note an improvement in the overall F 1 of 0.4% for the EC and 0.8% for the NER tasks, respectively." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-118", "text": "For the DREC dataset, in both settings, there is an overall improvement of \u223c1%." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-119", "text": "Figure 2 shows that from the first epochs, the model obtains its maximum performance on the DREC validation set." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-120", "text": "Finally, for ADE, our AT model beats the baseline F 1 by 0.7%." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-121", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-122", "text": "**RESULTS**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-123", "text": "Our results demonstrate that AT outperforms the neural baseline model consistently, consider- ing our experiments across multiple and more diverse datasets than typical related works." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-124", "text": "The improvement of AT over our baseline (depending on the dataset) ranges from \u223c0.4% to \u223c0.9% in terms of overall F 1 score." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-125", "text": "This seemingly small performance increase is mainly due to the limited performance benefit for the NER component, which is in accordance with the recent advances in NER using neural networks that report similarly small gains (e.g., the performance improvement in Ma and Hovy (2016) and Lample et al. (2016) on the CoNLL-2003 test set is 0.01% and 0.17% F 1 percentage points, while in the work of Yasunaga et al. (2018) , a 0.07% F 1 improvement on CoNLL-2000 using AT for NER is reported)." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-126", "text": "However, the relation extraction performance increases by \u223c1% F 1 scoring points, except for the ACE04 dataset." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-127", "text": "Further, as seen in Fig. 2 , the improvement for CoNLL04 is particularly small on the evaluation set." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-128", "text": "This may indicate a correlation between the dataset size and the benefit of adversarial training in the context of joint models, but this needs further investigation in future work." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-129", "text": "----------------------------------" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-131", "text": "We proposed to use adversarial training (AT) for the joint task of entity recognition and relation extraction." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-132", "text": "The contribution of this study is twofold: (i) investigation of the consistent effectiveness of AT as a regularization method over a multi-context baseline joint model, with (ii) a large scale experimental evaluation." }, { "sent_id": "f861e6590ff57225395e7d480c66e8-C001-133", "text": "Experiments show that AT improves the results for each task separately, as well as the overall performance of the baseline joint model, while reaching high performance already during the first epochs of the training procedure." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f861e6590ff57225395e7d480c66e8-C001-11", "f861e6590ff57225395e7d480c66e8-C001-12" ], [ "f861e6590ff57225395e7d480c66e8-C001-19", "f861e6590ff57225395e7d480c66e8-C001-20", "f861e6590ff57225395e7d480c66e8-C001-23" ] ], "cite_sentences": [ "f861e6590ff57225395e7d480c66e8-C001-12", "f861e6590ff57225395e7d480c66e8-C001-23" ] }, "@MOT@": { "gold_contexts": [ [ "f861e6590ff57225395e7d480c66e8-C001-11", "f861e6590ff57225395e7d480c66e8-C001-12" ], [ "f861e6590ff57225395e7d480c66e8-C001-19", "f861e6590ff57225395e7d480c66e8-C001-20", "f861e6590ff57225395e7d480c66e8-C001-23" ] ], "cite_sentences": [ "f861e6590ff57225395e7d480c66e8-C001-12", "f861e6590ff57225395e7d480c66e8-C001-23" ] }, "@USE@": { "gold_contexts": [ [ "f861e6590ff57225395e7d480c66e8-C001-80" ], [ "f861e6590ff57225395e7d480c66e8-C001-105", "f861e6590ff57225395e7d480c66e8-C001-106", "f861e6590ff57225395e7d480c66e8-C001-107" ] ], "cite_sentences": [ "f861e6590ff57225395e7d480c66e8-C001-80", "f861e6590ff57225395e7d480c66e8-C001-106" ] }, "@DIF@": { "gold_contexts": [ [ "f861e6590ff57225395e7d480c66e8-C001-105", "f861e6590ff57225395e7d480c66e8-C001-106", "f861e6590ff57225395e7d480c66e8-C001-107" ] ], "cite_sentences": [ "f861e6590ff57225395e7d480c66e8-C001-106" ] }, "@UNSURE@": { "gold_contexts": [ [ "f861e6590ff57225395e7d480c66e8-C001-108" ] ], "cite_sentences": [ "f861e6590ff57225395e7d480c66e8-C001-108" ] } } }, "ABC_023a954d97b5d761b01f09bb242d19_38": { "x": [ { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-26", "text": "This score measures the level of structural overlapping between two structures." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-2", "text": "Convolutional neural networks (CNN) have recently achieved remarkable performance in a wide range of applications." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-3", "text": "In this research, we equip convolutional sequence-to-sequence (seq2seq) model with an efficient graph linearization technique for abstract meaning representation parsing." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-4", "text": "Our linearization method is better than the prior method at signaling the turn of graph traveling." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-5", "text": "Additionally, convolutional seq2seq model is more appropriate and considerably faster than the recurrent neural network models in this task." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-6", "text": "Our method outperforms previous methods by a large margin on both the standard dataset LDC2014T12." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-7", "text": "Our result indicates that future works still have a room for improving parsing model using graph linearization approach." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-8", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-10", "text": "Abstract Meaning Representation (AMR) forms a rooted acyclic directed graph that represents the content of a sentence." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-11", "text": "All nodes and edges of the AMR graph are labeled according to the sense of the words in a sentence." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-12", "text": "AMR parsing is the task of converting a given sentence to a corresponding graph." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-13", "text": "AMRs have been applied to several applications such as event extraction [13, 7] , text summarization [6, 11] and text generation [15, 14] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-14", "text": "However, AMR annotation which requires a lot of human effort limits the outcome of data-driven approaches, one of which being neural network based methods [10, 3] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-15", "text": "Therefore, a highly accurate parser is necessary in order to intensify other applications which are based on AMR." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-16", "text": "Three different ways are widely utilized to demonstrate AMR graphs." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-17", "text": "First, conjunction form represents AMR to measure the similarity between two AMR graphs and some logic applications." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-18", "text": "Secondly, the PENMAN notation is used on several occasions that are related to human reading and writing such as annotation and data observation." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-19", "text": "Thirdly, computer programs commonly store AMRs as graph structure in memory." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-20", "text": "Figure 1 illustrates three typical representation approaches." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-21", "text": "In an AMR graph, each node is managed using an unique ID called variables." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-22", "text": "The content of a node is expressed by a semantic concept, which can be an English word (e.g. dog) or a PropBank frameset (e.g. want-01) or a special keyword (e.g. the \"-\" sign)." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-23", "text": "The edge between two vertices is labeled using more than 100 relations including semantic relations (e.g. :location, :name), and frameset argument index (e.g.:ARG0, :ARG1)." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-24", "text": "AMR also provides the inverse form of relations (e.g. :location vs :location-of )." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-25", "text": "To compare two semantic graphs, Cai et al [4] introduced the SMATCH score." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-27", "text": "SMATCH score has been widely applied in measuring the accuracy of AMR parser." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-28", "text": "Transition-based parsers have made notable achievements in graph parsing such as dependency tree [5] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-29", "text": "Currently, AMR parsers are benefiting from the power of this approach." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-30", "text": "Motivated by the analogy between dependency tree and AMR graph, Wang et al. [18] proposed the first transition system for parsing AMR graph." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-31", "text": "Figure 2 illustrates the dependency tree and the AMR graph of the sentence: \"The domicile of a juridical person shall be at the location of its principal office\"." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-32", "text": "These two structures share some nodes (e.g.domicile, person, juridical), and their node interrelations (e.g.person -juridical)." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-33", "text": "Wang et al. defined a two-stage process for their system: (1)parsing a sentence into a dependency tree using existing parsers such as Stanford parser and Charniak parser; (2) converting the obtained tree into AMR graph by an eight-action transition system." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-34", "text": "Their later works have investigated a richer feature set including co-reference, semantic role labeling, word cluster [17] ; rich name entity tag, and ISI verbalization list [16] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-35", "text": "NeuralAMR [10] has succeeded at both AMR parsing and sentence generation as the result of a bootstrapping training strategy on a 20-million-sentence unsupervised dataset." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-36", "text": "An efficient adaptation of machine translation to AMR parsing by Barzdins et al [2] indicates that character-based features are better than [12] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-37", "text": "The work of Ballesteros et al has combined recurrent neural network and transition system into a deep transition model [1] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-38", "text": "Among those methods, the information is encoded in LSTM hidden state using embedding vector and syntactic features instead of gathering a large number of features which are introduced in the conventional transition method." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-39", "text": "Although recent studies have utilized Long Short-Term Memory (LSTM) in AMR parsing [10, 1] , there are several disadvantages of employing LSTM compared to CNN." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-40", "text": "First, LSTM models long dependency, which might be noise to generate a linearized graph, whereas CNN provides a shorter dependency which is advantageous to generate graph traversal." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-41", "text": "Secondly, LSTM requires a chronologically computing process that restrains the ability of parallelization; on the contrary, CNN enables simultaneous parsing." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-42", "text": "In this paper, we present the first success of applying convolutional seq2seq in AMR parsing." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-43", "text": "The main contributions of this research are:" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-44", "text": "\u2022 An outstanding performance with 5 points SMATCH score improvement resulted from the proposed AMR parsing model using depth-first-search graph linearization and convolutional seq2seq network." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-45", "text": "\u2022 A new public AMR test 1 set of legal document." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-46", "text": "\u2022 The first study of AMR parsing in the legal domain." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-47", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-48", "text": "**METHOD**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-49", "text": "In this section, we first present the formalization of the AMR parsing task." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-50", "text": "We then demonstrate in detail two main parts of our model: the graph conversion including linearization and de-linearization presented in section 2.1, and convolutional seq2seq model presented in section 2.2." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-51", "text": "Given the training dataset S, G where S and G stand for the set of sentences and the set of corresponding AMR graphs, our supervised learning model with parameter set \u03b8 maximizes the following problem:" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-52", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-53", "text": "**GRAPH LINEARIZATION AND DE-LINEARIZATION**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-54", "text": "Seq2seq model requires sequential representation of features and labels, therefore, the AMR graph must be presented as a sequence." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-55", "text": "However, the raw AMR text cannot be an appropriate format due to its imbalance of tokens." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-56", "text": "Raw AMR text contains too many round brackets and variables which present less semantic information than other components such as concepts, constants, and English words." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-57", "text": "Unlike the prior work [10] , in our model, the graphs pass through a much simpler pre-processing series which consists of variable removal, graph linearization, and infrequent word replacement." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-58", "text": "For stripping the AMR text, we modified the depth-first-search traversal from the work of Kontas et al [10] in the way of marking the end of a path." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-59", "text": "The left parentheses are ignored and the right parentheses are replaced by doubling the concept of the terminal node." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-60", "text": "The process of recovering the stripped text from the graph is called de-linearization." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-61", "text": "The graph which contains multiple nodes of a single concept might not be perfectly reversed because those nodes have been collapsed into one." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-62", "text": "We show the level of information loss corresponding to each dataset in section 2." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-63", "text": "Table 1 demonstrates the converting process." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-64", "text": "We conducted the measurement of information loss to prove the efficiency of this graph conversion method." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-65", "text": "All graphs in the official AMR corpus G were passed through a full linearization process to get the linearized versions." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-66", "text": "These sequences were then the input of recovering process to obtain the AMR graph set G ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-67", "text": "The information loss L(G) is calculated by equation 1 from the SMATCH score." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-68", "text": "The result of the test is presented in table 2." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-69", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-70", "text": "**CONVOLUTIONAL SEQUENCE TO SEQUENCE MODEL**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-71", "text": "Our proposal is to utilize three different seq2seq models which have showed their strengths in machine translation." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-72", "text": "They are the combination of a convolutional encoder and an LSTM decoder; and the fully convolutional seq2seq model." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-73", "text": "The first model uses a multilayer bi-directional LSTM encoder to produce hidden states from the input." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-74", "text": "The decoder gathers the hidden states and then generates output with attention mechanism." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-75", "text": "We made a further modification by supplementing a dropout layer locating between two consecutive LSTM layers." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-76", "text": "The second model bases on the work of Gehring et al [8] where the bi-directional LSTM is alternated by a convolutional encoder." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-77", "text": "The final model fully applies convolutional neural network with attention mechanism [9] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-78", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-79", "text": "**DATA ANNOTATION**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-80", "text": "The Semeval competitions allowed participants to access multiple AMR corpus annotated manually but no large corpus has been made accessible to the public." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-81", "text": "Especially, there is no open AMR resource for any specific domain such as the juristic document or scientific document." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-82", "text": "Therefore, we manually annotated a corpus for the English version of the Japan Civil Code." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-83", "text": "The code is organized in multiple levels including chapter, part, article, paragraph, and sentence." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-84", "text": "The pre-processing consists of the following steps: gathering articles, removing all article prefixes and article IDs, then splitting the article into sentences." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-85", "text": "We labeled each sentence with an ID containing the article name, the paragraph index, and the sentence index." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-86", "text": "To annotate the sentences, we used the webbased editor 2 provided by ISI group." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-87", "text": "This editor provides a combination of command line and graphical interface." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-88", "text": "The Propbank corpus is integrated into the search engine to minimize the time it takes to choose a proper meaning of the words." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-89", "text": "Two annotators are given a list of article sentences and annotate corpus independently." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-90", "text": "After finishing their own works, the annotators are invited to discuss and aggregate their outcomes into a single result." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-91", "text": "We call this dataset JCivilCode-1.0." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-92", "text": "The statistics of this corpus is presented in table 2." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-93", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-94", "text": "**EXPERIMENT & RESULT**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-95", "text": "We conducted the experiment on two datasets in different domains." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-96", "text": "The first one is the official dataset LDC2014T12, which we designed the first experiment configuration with." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-97", "text": "The second configuration was made on our self-annotated dataset by mixing the training set and the validation set of LDC2014T12 together with JCivilCode-1.0 as the test set." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-98", "text": "We decided to train and test on two different domains because the number of pair of sentences and graphs are not too large." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-99", "text": "The performance of the proposed approaches is assessed using SMATCH score." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-100", "text": "To compare our model with other ones, we collected the performance results of other works on LDC2014T12 from the original paper." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-101", "text": "We also run their best public pre-trained model on JCivilCode-1.0." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-102", "text": "Table 3 shows that our proposed model outperformed both the transition-based methods and the neuralbased methods on LDC2014T12 whereas our methods are much simpler than prior works." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-103", "text": "The NeuralAMR lies on an intensive preprocessing with graph simplification and strong name-entity anonymization." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-104", "text": "The stack-LSTM model gathers many types of syntactic features including name entity, part-of-speech, dependency tree." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-105", "text": "Moreover, CAMR relies on rich features of a single node, node pair, path, distance, action [18] and semantic role labeling [17] ." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-106", "text": "On the other hand, our models employ only word embedding as feature after a three-step preprocessing as described in section 2.1." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-107", "text": "Linearization method might create two foreseeable issues though it significantly increased the accuracy of neural network method." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-108", "text": "First, entity redundancy occurs if the graph contains multiple nodes who share an identical concept." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-109", "text": "The second issue is the syntax error of the output because the neural network does not guarantee that the output follows the PENMAN notation." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-110", "text": "Table 4 shows some sample of JCivilCode-1.0 and output of our model." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-111", "text": "The bold words in the table show the error that our model generated." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-112", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-113", "text": "**CONCLUSION**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-114", "text": "We published the first release of a testing set of Japan Civil Code for AMR parsing." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-115", "text": "We presented the efficiency of the convolutional seq2seq model on Abstract Meaning Representation parsing." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-116", "text": "By using a simple but effective graph linearization methods, our model achieved a competitive accuracy." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-117", "text": "The result indicates a certain possibility of higher performance on many application basing on AMR." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-118", "text": "However, this method revealed two technical issues that we plan to investigate more in future research." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-119", "text": "----------------------------------" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-120", "text": "**ACKNOWLEDGEMENT**" }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-121", "text": "This work was supported by JST CREST Grant Number JPMJCR1513, Japan." }, { "sent_id": "023a954d97b5d761b01f09bb242d19-C001-122", "text": "The authors would like to thank our colleagues and reviewers for their intensive comments and suggestions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "023a954d97b5d761b01f09bb242d19-C001-10", "023a954d97b5d761b01f09bb242d19-C001-11", "023a954d97b5d761b01f09bb242d19-C001-12", "023a954d97b5d761b01f09bb242d19-C001-13", "023a954d97b5d761b01f09bb242d19-C001-14", "023a954d97b5d761b01f09bb242d19-C001-15" ], [ "023a954d97b5d761b01f09bb242d19-C001-35" ], [ "023a954d97b5d761b01f09bb242d19-C001-39", "023a954d97b5d761b01f09bb242d19-C001-40", "023a954d97b5d761b01f09bb242d19-C001-41" ] ], "cite_sentences": [ "023a954d97b5d761b01f09bb242d19-C001-14", "023a954d97b5d761b01f09bb242d19-C001-35", "023a954d97b5d761b01f09bb242d19-C001-39" ] }, "@MOT@": { "gold_contexts": [ [ "023a954d97b5d761b01f09bb242d19-C001-10", "023a954d97b5d761b01f09bb242d19-C001-11", "023a954d97b5d761b01f09bb242d19-C001-12", "023a954d97b5d761b01f09bb242d19-C001-13", "023a954d97b5d761b01f09bb242d19-C001-14", "023a954d97b5d761b01f09bb242d19-C001-15" ], [ "023a954d97b5d761b01f09bb242d19-C001-39", "023a954d97b5d761b01f09bb242d19-C001-40", "023a954d97b5d761b01f09bb242d19-C001-41" ] ], "cite_sentences": [ "023a954d97b5d761b01f09bb242d19-C001-14", "023a954d97b5d761b01f09bb242d19-C001-39" ] }, "@EXT@": { "gold_contexts": [ [ "023a954d97b5d761b01f09bb242d19-C001-57", "023a954d97b5d761b01f09bb242d19-C001-58", "023a954d97b5d761b01f09bb242d19-C001-59", "023a954d97b5d761b01f09bb242d19-C001-60", "023a954d97b5d761b01f09bb242d19-C001-61" ] ], "cite_sentences": [ "023a954d97b5d761b01f09bb242d19-C001-57", "023a954d97b5d761b01f09bb242d19-C001-58" ] } } }, "ABC_b9a748ac201b2d8f5d52abd60aa018_38": { "x": [ { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-2", "text": "In learning Asian languages, learners encounter the problem of character types that are different from those in their first language, for instance, between Chinese characters and the Latin alphabet." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-3", "text": "This problem also affects listening because learners reconstruct letters from speech sounds." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-4", "text": "Hence, special attention should be paid to listening practice for learners of Asian languages." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-5", "text": "However, to our knowledge, few studies have evaluated the ease of listening comprehension (listenability) in Asian languages." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-6", "text": "Therefore, as a pilot study of listenability in Asian languages, we developed a measurement method for learners of English in order to examine the discriminability of linguistic and learner features." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-7", "text": "The results showed that the accuracy of our method outperformed a simple majority vote, which suggests that a combination of linguistic and learner features should be used to measure listenability in Asian languages as well as in English." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-8", "text": "----------------------------------" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-10", "text": "An important task of language teachers is to choose reading/listening materials appropriate for the proficiency of their learners so as to prevent decreases in learning motivation." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-11", "text": "However, this task can be a heavy burden for language teachers when they introduce computer-assisted language learning/teaching (CALL/T) techniques." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-12", "text": "Although CALL/T allows language teachers to use different reading/listening materials for each learner, it also increases the number of materials that they must evaluate for appropriateness." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-13", "text": "To address this issue, alternative methods that automatically measure the ease of reading comprehension (readability) have been developed." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-14", "text": "However, although the majority of previous studies have addressed the measurement of readability: Japanese by Sato et al. (2008) ; Chinese by Sung et al. (2015) , among others, they have not addressed the ease of listening comprehension (henceforth, listenability)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-15", "text": "Several studies have examined listenability for English learners (Kiyokawa 1990; Kotani et al. 2014; Kotani & Yoshimi 2016; Yoon et al. 2016) ; however, to the best of our knowledge, no previous studies on listenability for learners of Asian languages such as Chinese, Korean, and Japanese have been conducted." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-16", "text": "The method of Kiyokawa (1990) measured listenability based on the length of sentences and the difficulty of words." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-17", "text": "It was hypothesized that the listenability of a sentence decreases as it becomes longer and contains more advanced vocabulary words." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-18", "text": "Kotani et al. (2014) suggested the possibility of using different linguistic elements such as phonological features, and addressed this question by measuring listenability based on various linguistic features, including speech rate and the frequency of phonological modification patterns such as linking." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-19", "text": "In addition, their method used listening test scores as a learner feature to measure listenability relative to proficiency." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-20", "text": "This is because sentences with less listenability for learners at the beginner level might be easy for those at the advanced level." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-21", "text": "However, because that study focused on the accuracy of measurement, the question of discriminability of This work is licensed under a Creative Commons Attribution 4.0 International Licence." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-22", "text": "Licence details: http://creativecommons.org/licenses/by/4.0/ linguistic and learner features for the measurement of listenability remained." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-23", "text": "The discriminability of linguistic features was examined by Yoon et al. (2016) , who used multiple regression analysis to measure listenability; however, they did not examine the discriminability of learner features." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-24", "text": "Hence, the discriminability of both linguistic and learner features still has yet to be examined." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-25", "text": "Given this background, the purpose of this study was to attempt to answer the following two research questions by measuring listenability on the basis of linguistic and learner features:" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-26", "text": "(1) How accurately can listenability be measured using linguistic and learner features?" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-27", "text": "(2) Which of linguistic and learner features are discriminative for the measurement of listenability?" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-28", "text": "To answer these questions, in this study, we developed a listenability measurement method using a decision tree classification algorithm (Quinlan 1992 ) that classifies sentences into five levels of listenability in order to determine the accuracy of listenability measurement and the discriminability of linguistic and learner features to this classification." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-29", "text": "Although the linguistic and learner features examined were effective for listenability measurement in English, they were not English-specific, which suggests that they may also be useful for the measurement of listenability in Asian languages." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-30", "text": "----------------------------------" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-31", "text": "**LINGUISTIC AND LEARNER FEATURES**" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-32", "text": "Listenability is measured based on linguistic and learner features." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-33", "text": "Linguistic features explain the difficulty of a sentence, and learner features explain the proficiency of a learner." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-34", "text": "The linguistic (Chall 1948; Fang 1966; Kiyokawa 1990; Messerklinger 2006; Kotani et al. 2014; Kotani & Yoshimi 2016; Yoon et al. 2016) , and learner features (Kotani et al. 2014; Kotani & Yoshimi 2016) used in this study were originally described elsewhere." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-35", "text": "Linguistic features consist of sentence length, mean word length, multiple syllable words, word difficulty, speech rate, and phonological modification patterns." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-36", "text": "Sentence length is calculated based on the number of words in a sentence." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-37", "text": "Mean word length is derived from the mean number of syllables per word." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-38", "text": "Multiple syllable words refer to the number of multiple syllable words in a sentence." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-39", "text": "Word difficulty is derived from the rate of words absent from Kiyokawa's basic vocabulary list for words in a sentence." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-40", "text": "Speech rate is calculated in terms of spoken words per minute." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-41", "text": "Phonological modification patterns are derived from the rate of phonologically modified words in a sentence." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-42", "text": "The types of phonological modification patterns are: elision (elimination of phonemes), in which vowel sounds immediately follow a stressed syllable, such as the second \"o\" sound in \"chocolate\"; reduction (weakening a sound by changing a vowel to a schwa), such as vowel sounds in personal/interrogative pronouns, auxiliaries, modals, prepositions, articles, and conjunctions; contraction (combining word pairs), such as a modal with a subject noun; linkage (connecting final and initial word sounds), such as connected a word ending with an \"n\" or \"r\" sound with a word starting with a vowel sound, for example, \"in an hour\" and \"after all\"; and deduction (elimination of sounds between words), in which words share the same sound, for example, \"good day\"." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-43", "text": "Learner features consist of listening test scores, learning experience, visiting experience, and listening frequency." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-44", "text": "Listening test score refers to scores on the Test of English for International Communication (TOEIC)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-45", "text": "Learning experience refers to the number of months for which learners have been studying English." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-46", "text": "Visiting experience refers to the number of months learners have spent in English-speaking countries." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-47", "text": "Listening frequency refers to scores on a five-point Likert scale for the frequency of English use (1: infrequently, 2: somewhat infrequently, 3: moderate, 4: somewhat frequently, and 5: frequently)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-48", "text": "----------------------------------" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-49", "text": "**TRAINING/TEST DATA**" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-50", "text": "Training/test data for a decision tree classification algorithm were constructed using the learner corpus of Kotani et al. (2014) , which includes learners' judgment of listenability." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-51", "text": "Listenability was judged by learners of English as a foreign language using scores on a five-point Likert scale (1: easy, 2: somewhat easy, 3: average, 4: somewhat difficult, or 5: difficult)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-52", "text": "Scores were judged on a sentenceby-sentence basis where each learner listened to and assigned scores for 80 sentences from four news clips selected from the editorial and special sections for English learners on the Voice of America (VOA) website (http://www.voanews.com)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-53", "text": "News clips in the special section were intended for learners, while news clips in the editorial section were intended for native speakers of English." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-54", "text": "The news clips in the special section consisted of short, simple sentences using the VOA's basic vocabulary of 1,500 words; idiomatic expressions were avoided." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-55", "text": "By contrast, the news clips in the editorial section were made without any restrictions on vocabulary and sentence construction, as long as they were appropriate as news clips for native speakers of English." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-56", "text": "The speech rate of the news clips in the special section were two-thirds slower than those in the editorial section, which were read aloud at a natural speech rate of approximately 250 syllables per minute (Robb & Gillon 2007) ." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-57", "text": "The learners were 90 university students (48 males, 42 females; mean age \u00b1 SD, 21.5 \u00b1 2.6 years) who were compensated for their participation." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-58", "text": "All learners were asked to submit valid scores from TOEIC tests taken in the current or previous year." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-59", "text": "The mean TOEIC listening score was 334.78 \u00b1 98.14." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-60", "text": "The minimum sore was 130 (n = 1), and the maximum score was 495 (n = 8)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-61", "text": "Although the training/test data should have consisted of 7,200 instances (90 learners \u00d7 80 sentences) for valid listenability measurement, only 6,804 instances were actually observed." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-62", "text": "Assuming that the missing 396 instances resulted from listening difficulties, these instances were scored as having the lowest listenability." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-63", "text": "Most instances (25.2%) were scored in the middle range (3) of listenability, and the fewest instances (15.8%) were scored in the high range (2)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-64", "text": "Listenability scores of 1, 4, and 5 were given by 21.7%, 20.8%, and 16.5% of the learners, respectively." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-65", "text": "Table 1 shows the means and SDs of the linguistic and learner features in the training/test data." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-66", "text": "----------------------------------" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-67", "text": "**EXPERIMENT**" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-68", "text": "Listenability was measured on the basis of linguistic and learner features using a decision tree classification algorithm implemented on C4.5 software (Quinlan 1992) ." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-69", "text": "All settings were taken as defaults, and classification was evaluated using five-fold cross validation." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-70", "text": "The results of the five-fold cross validation tests, as well as the confusion matrix for the test data, where the rows indicate the correct classification and the columns indicate the selected classes, are shown in Table 2 ." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-71", "text": "The accuracy of classification rate was 47.0% ((1116+293+740+574+661)/7200) in the test data." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-72", "text": "Although this might be insufficient for validating our listenability measurement method, we believe that the method can still be judged as valid through a comparison with the accuracy attained by a simple majority vote (25.2%) as a baseline." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-73", "text": "We calculated the accuracy for each listenability score from 1 to 5, which is shown as bracketed numbers in Table 2 ." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-74", "text": "The accuracy varied from 25.8% (293/(299+293+348+125+70)) to 71.4% (1116/(1116+190+169+46+42))." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-75", "text": "As this examination was not conclusive, it remains for the future study to examine why the method showed the different accuracies in more detail." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-76", "text": "Using the five-fold cross validation test, five decision trees (I-V) were generated." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-77", "text": "In four of the five decision trees, the same type of linguistic and learner features were allocated at the root nodes, the first-level child nodes (child nodes originating from the root nodes), and the second-level child nodes (child nodes originating from the first-level child nodes)." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-78", "text": "Parts of the decision tree (I-IV) can be seen in Figure 1 ; the different roots (V) are shown in bold." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-79", "text": "Part V of the decision tree is shown in Figure 2 ." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-80", "text": "As the listening test score was allocated at the root node of the five decision trees, this feature was regarded as the most discriminative." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-81", "text": "Visiting experience was allocated at the first-level child node of the decision trees, and was therefore judged as the second most discriminative feature." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-82", "text": "The third most discriminative feature was regarded as the speech rate, because it was allocated at either the first-or second-level child node in each tree." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-83", "text": "----------------------------------" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-84", "text": "**CONCLUSION**" }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-85", "text": "In this study, we examined the measurement of listenability for learners of English as a foreign language." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-86", "text": "We found that learner features were discriminative for the measurement accuracy." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-87", "text": "This finding suggests that learner features should be taken into account when measuring listenability for learners of Asian languages." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-88", "text": "Although the accuracy was not high, our method outperformed a simple majority vote." }, { "sent_id": "b9a748ac201b2d8f5d52abd60aa018-C001-89", "text": "In the future, using this method as a baseline, we plan on developing a listenability measurement method for Asian languages that would outperform that for English." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b9a748ac201b2d8f5d52abd60aa018-C001-15" ], [ "b9a748ac201b2d8f5d52abd60aa018-C001-18", "b9a748ac201b2d8f5d52abd60aa018-C001-19", "b9a748ac201b2d8f5d52abd60aa018-C001-20", "b9a748ac201b2d8f5d52abd60aa018-C001-21", "b9a748ac201b2d8f5d52abd60aa018-C001-22" ] ], "cite_sentences": [ "b9a748ac201b2d8f5d52abd60aa018-C001-15", "b9a748ac201b2d8f5d52abd60aa018-C001-18" ] }, "@MOT@": { "gold_contexts": [ [ "b9a748ac201b2d8f5d52abd60aa018-C001-18", "b9a748ac201b2d8f5d52abd60aa018-C001-19", "b9a748ac201b2d8f5d52abd60aa018-C001-20", "b9a748ac201b2d8f5d52abd60aa018-C001-21", "b9a748ac201b2d8f5d52abd60aa018-C001-22" ] ], "cite_sentences": [ "b9a748ac201b2d8f5d52abd60aa018-C001-18" ] }, "@USE@": { "gold_contexts": [ [ "b9a748ac201b2d8f5d52abd60aa018-C001-34" ], [ "b9a748ac201b2d8f5d52abd60aa018-C001-50", "b9a748ac201b2d8f5d52abd60aa018-C001-51" ] ], "cite_sentences": [ "b9a748ac201b2d8f5d52abd60aa018-C001-34", "b9a748ac201b2d8f5d52abd60aa018-C001-50" ] } } }, "ABC_4d25b647a261a415d769532386265a_38": { "x": [ { "sent_id": "4d25b647a261a415d769532386265a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-2", "text": "The recent success of transformer networks for neural machine translation and other NLP tasks has led to a surge in research work trying to apply it for speech recognition." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-3", "text": "Recent efforts studied key research questions around ways of combining positional embedding with speech features, and stability of optimization for large scale learning of transformer networks." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-4", "text": "In this paper, we propose replacing the sinusoidal positional embedding for transformers with convolutionally learned input representations." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-5", "text": "These contextual representations provide subsequent transformer blocks with relative positional information needed for discovering long-range relationships between local concepts." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-6", "text": "The proposed system has favorable optimization characteristics where our reported results are produced with fixed learning rate of 1.0 and no warmup steps." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-7", "text": "The proposed model reduces the word error rate (WER) by 12% and 16% relative to previously published work on Librispeech \"dev other\" and \"test other\" subsets respectively, when no extra LM text is provided." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-8", "text": "Full code to reproduce our results will be available online at the time of publication." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-9", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-11", "text": "Speech Recognition systems have experienced many advances over the past decade, with neural acoustic and language models leading to impressive new levels of performance across many challenging tasks [1, 2] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-12", "text": "Advances in alignment-free sequence-level loss functions like CTC and ASG [3, 4] enabled easier training with letters as output units [5, 6] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-13", "text": "The success of sequence-to-sequence models in neural machine translation systems [7, 8] offered further simplification to ASR systems by integrating the acoustic and the language models into a single encoder-decoder architecture that is jointly optimized [9, 10] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-14", "text": "The encoder focuses on building acoustic representations that the decoder, through different attention mechanisms, can use to generate the target units." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-15", "text": "Recently, transformer networks have been shown to perform well for neural machine translation [11] and many other NLP tasks [12] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-16", "text": "A Transformer layer distinguishes itself from a regular recurrent network by entirely relying on a key-value \"self\"-attention mechanism for learning relationships between distant concepts, rather than relying on recurrent connections and memory cells to preserve information, as in LSTMs, that can fade over time steps." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-17", "text": "Transformer layers can be seen as bagof-concept layers because they don't preserve location information in the weighted sum self-attention operation." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-18", "text": "To model word order, sinusoidal positional embeddings are used [11] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-19", "text": "There has been recent research interest in using transformer networks for end-to-end ASR both with CTC loss [13] and in an encoder-decoder framework [14, 15] with modest performance compared to baseline systems." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-20", "text": "For a standard hybrid ASR system, [16] introduced a time-constrained key-value self-attention layer to be used in tandem with other TDNN and recurrent layers." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-21", "text": "Using time-restricted self-attention context enabled the authors to model input positions as 1-hot vectors, however, they didn't show a conclusive evidence for the impact of the self-attention context size." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-22", "text": "One interesting research question in all previous work was: how to best introduce positional information for input speech features." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-23", "text": "Answers range from dropping it altogether, adding it to input features/embedding, and concatenating it with input features leaving it to the neural network to decide how to combine them." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-24", "text": "In this paper, we take an alternative approach." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-25", "text": "We propose replacing sinusoidal positional embedding with contextually augmented inputs learned by 2-D convolutional layers over input speech features in the encoder, and by 1-D convolutional layers over previously generated outputs in the decoder." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-26", "text": "Lower layers build atomic concepts, both in encoders and decoders, by learning local relationships between time steps." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-27", "text": "Long-range sequential structure modeling is left to subsequent layers." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-28", "text": "Although the transformer's flexible inductive bias is able to mimic convolution filters in its lower layers, we argue that this comes at the expense of brittle optimization." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-29", "text": "We believe that adding early convolutional layers allows the model to learn implicit relative positional encodings which enable subsequent transformer layers to recover the right order of the output sequence." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-30", "text": "Using convolutional layers as input processors before recurrent layers in acoustic encoders has been previously proposed for computational reasons with minimal impact on performance [17] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-31", "text": "So, we focus our experiments on understanding the impact of the convolutional context size consumed by the decoder 1-D convolutional layers." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-32", "text": "Our best model configuration, with a fixed learning rate of 1.0, no hyperparameter or decoder optimization, achieves 12% and 16% relative reduction in WER compared to previously published results on the acoustically challenging Librispeech [18] \"dev other\" and \"test other\" subsets, when no extra LM text data is used during decoding." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-33", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-34", "text": "**TRANSFORMERS WITH CONVOLUTIONAL CONTEXT**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-35", "text": "We propose dividing the modeling task into two subcomponents: learning local relationships within a small context with convolutional layers, and learning global sequential structure of the input with transformer layers." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-36", "text": "This division simplifies transformer optimization leading to more stable training and better results because we don't need to force lower transformer layers to learn local dependencies." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-37", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-38", "text": "**TRANSFORMER LAYER**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-39", "text": "Transformer layers [11] have the ability to learn long range relationships for many sequential classification tasks [12] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-40", "text": "Multi-" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-41", "text": "The dot product between keys and queries is scaled by the inverse square root of the key dimension." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-42", "text": "This self-attention operation is done h times in parallel, for the case of h attention heads, with different projection matrices from dinput to d k , d k , and dv." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-43", "text": "The final output is a concatenation of h vectors each with dimension dv which is in turn linearly projected to the desired output dimension of the self-attention layer." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-44", "text": "On top of the self-attention component, transformer layers have multiple operations applied on each time step; dropout, residual connection, layer norm, two fully connected layers with a ReLU layer in between, another residual and Layer norm operations." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-45", "text": "Figure(1) -left show the details of one transformer layer as proposed by [11] ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-46", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-47", "text": "**ADDING CONTEXT TO TRANSFORMER**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-48", "text": "Our convolutional layers are added below the Transformer layers, and we do not make any use of positional encodings." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-49", "text": "The model learns an acoustic language model over the bag of discovered acoustic units as it goes deeper in the encoder." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-50", "text": "The experimental results show that using a relatively deep encoder is critical for getting good performance." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-51", "text": "For the encoder, we used 2-D convolutional blocks with layer norms and ReLU after each convolutional layer." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-52", "text": "Each convolutional block contains K convolutional layers followed by a 2-D max pooling layer, as shown in figure(2)-right." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-53", "text": "For the decoder, we follow a similar approach using 1-D convolutions over embeddings of previously predicted words (shown in figure(2)-left with N 1-D convolutional layers in each decoder convolutional block)." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-54", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-55", "text": "**FULL END-TO-END MODEL ARCHITECTURE**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-56", "text": "Figure (1)-right shows the full end-to-end system architecture." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-57", "text": "Each block in the model is repeated multiple times (shown on the top right corner of each block)." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-58", "text": "On the decoder side, we use a separate multi-head attention layer to aggregate encoder context for each decoder transformer block." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-59", "text": "We found that hav- ing more than one attention layer improves the overall system recognition performance." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-60", "text": "The decoder 1-D convolution only looks at historical predictions with its end point at the current time step." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-61", "text": "Similarly, the transformer layers have future target steps masked, so that decoder self-attention is only running over current and previous time steps to respect left-to-right output generation." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-62", "text": "We are not investigating online/streaming decoding conditions in this paper, so the encoder self-attention is allowed to operate over the entire input utterance." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-63", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-64", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-65", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-66", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-67", "text": "We evaluate performance on the Librispeech dataset [18] containing 1000h of training data with development and test sets split into simple (\"clean\") and harder (\"other\") subsets 1 ." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-68", "text": "We use 5k \"unigram\" subword target units learned by the sentence piece package [20] with full coverage of all training text data." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-69", "text": "Input speech is represented as 80-D log mel-filterbank coefficients plus three fundamental frequency features computed every 10ms with a 25ms window." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-70", "text": "All experiments were not tuned to best possible performance using training hyperparameter or decoder optimization." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-71", "text": "We don't use scheduled sampling or label smoothing." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-72", "text": "For regularization, we use a single dropout rate of 0.15 across all blocks as part of our default configurations." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-73", "text": "For model optimization, we use the AdaDelta algorithm [21] with fixed learning rate=1.0 and gradient clipping at 10.0." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-74", "text": "We run all configurations for 80 epochs, we then report results on an average model computed over the last 30 checkpoints." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-75", "text": "Averaging the last few checkpoints brings the model weights closer to the nearest local minimum." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-76", "text": "We could have stopped training models much earlier than 80 epochs, with a different early stopping point for different runs, but decided to stick by a generic training recipe to simplify reproducing our results." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-77", "text": "It is important to mention that we aren't using a learning rate warmup schedule and yet the model converges to the reported WER results in a stable way." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-78", "text": "This general fixed training recipe wasn't optimized on any part of Librispeech." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-79", "text": "The standard convolutional tranformer model used in most experiments has the following configuration: (1) Two 2-D convolutional blocks, each with two conv." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-80", "text": "layers with kernel size=3, max-pooling kernel=2." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-81", "text": "The first block has 64 feature maps while the second has 128, (2) 10 encoder transformer blocks all with transformer dim=1024, 16 heads, intermediate ReLU layer size=2048, (3) decoder input word embedding dim=512, (4) three 1-D conv." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-82", "text": "layers each with kernel size=3, no max pooling is used for the decoder side 1-D convolution, and (5) 10 decoder transformer blocks each with encoder-side multihead attention, otherwise the configuration is identical to the encoder transformer block." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-83", "text": "This canonical model has about 223M parameters, and it takes about 24 hours to perform all 80 epochs on 2 machines each with 8GPUs with 16GB of memory." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-84", "text": "All results are reported without any external language model trained on extra text data." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-85", "text": "Our focus is to study the contextual transformer decoder's ability to model the statistical properties of the spoken training data." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-86", "text": "We use beam size of 5 during inference for all experiments except mentioned otherwise." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-87", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-88", "text": "**MODEL COMPARISIONS**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-89", "text": "We first studied performance of our approach to alternative architectures and positional encoding schemes." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-90", "text": "In table (1) we show the WER of the proposed transformer encoder-decoder model with convolutional context using the canonical configuration in the first row." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-91", "text": "Replacing the 1-D convolutional context in the decoder with sinusoidal positional embedding, as proposed in the baseline machine translation transformers [11] and adopted in [13, 15] , shows inferior WER performance." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-92", "text": "By combining sinusoidal and convolutional position embedding (rows 1+2), we don't observe any gains." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-93", "text": "This supports our intuition that the relative convolutional positional information provides sufficient signal for the transformer layers to recreate more global word order." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-94", "text": "We also found that having multiple encoder-side attention layers is critical for achieving the best WER." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-95", "text": "Increasing the intermediate ReLU layer in each encoder and decoder layer was found to greatly improve the overall WER across different sets, however increasing the number of attention heads, while keeping the attention dimension the same, deteriorates the performance." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-96", "text": "To better understand these results, we also studied the effects of different hyperparameter settings." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-97", "text": "Table( 2) shows the effect of different decoder convolutional context sizes spread over different depths." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-98", "text": "All configurations in table(2) share the same canonical configuration and the number of 1-D conv." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-99", "text": "feature maps were chosen to ensure the total number of parameters are fixed between all configurations." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-100", "text": "The best performance comes from using the same parameter budget over wider context that is built over multiple convolutional layers." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-101", "text": "However, the decoder is able to get reasonable WER even with a context of just 3 words as input to the transformer layers." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-102", "text": "Using a deep transformer encoder capture long range structure of the data as an acoustic LM built on top of learned concepts from the" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-103", "text": "----------------------------------" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-104", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "4d25b647a261a415d769532386265a-C001-105", "text": "We presented a transformer seq2seq ASR system with learned convolutional context, both for the encoder and the decoder." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-106", "text": "Input convolutional layers capture relative positional information which enables subsequent transformer blocks to learn long range relationships between local concepts in the encoder, and recover the target sequence in the decoder." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-107", "text": "Using a deep transformer encoder was important to reach best performance, as we demonstrated empirically." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-108", "text": "Our best configuration achieves 12% and 16% relative reduction in WER compared to previously published systems on Librispeech \"dev other\" and \"test other\" subsets respectively, when no extra LM text is provided." }, { "sent_id": "4d25b647a261a415d769532386265a-C001-109", "text": "Future work includes extending our approach to other NLP tasks, and testing its impact on larger scale benchmarks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4d25b647a261a415d769532386265a-C001-15", "4d25b647a261a415d769532386265a-C001-16", "4d25b647a261a415d769532386265a-C001-17" ], [ "4d25b647a261a415d769532386265a-C001-39", "4d25b647a261a415d769532386265a-C001-40", "4d25b647a261a415d769532386265a-C001-41", "4d25b647a261a415d769532386265a-C001-42", "4d25b647a261a415d769532386265a-C001-43", "4d25b647a261a415d769532386265a-C001-44" ], [ "4d25b647a261a415d769532386265a-C001-45" ], [ "4d25b647a261a415d769532386265a-C001-90", "4d25b647a261a415d769532386265a-C001-91", "4d25b647a261a415d769532386265a-C001-92", "4d25b647a261a415d769532386265a-C001-93" ] ], "cite_sentences": [ "4d25b647a261a415d769532386265a-C001-15", "4d25b647a261a415d769532386265a-C001-39", "4d25b647a261a415d769532386265a-C001-45", "4d25b647a261a415d769532386265a-C001-91" ] }, "@USE@": { "gold_contexts": [ [ "4d25b647a261a415d769532386265a-C001-18" ], [ "4d25b647a261a415d769532386265a-C001-45" ], [ "4d25b647a261a415d769532386265a-C001-90", "4d25b647a261a415d769532386265a-C001-91", "4d25b647a261a415d769532386265a-C001-92", "4d25b647a261a415d769532386265a-C001-93" ] ], "cite_sentences": [ "4d25b647a261a415d769532386265a-C001-18", "4d25b647a261a415d769532386265a-C001-45", "4d25b647a261a415d769532386265a-C001-91" ] } } }, "ABC_93a1f611592ce6aa5cde7538486f97_38": { "x": [ { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-2", "text": "Our research aims at tracking the semantic evolution of the lexicon over time." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-3", "text": "For this purpose, we investigated two wellknown training protocols for neural language models in a synchronic experiment and encountered several problems relating to accuracy and reliability." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-4", "text": "We were able to identify critical parameters for improving the underlying protocols in order to generate more adequate diachronic language models." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-5", "text": "----------------------------------" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-7", "text": "The lexicon can be considered the most dynamic part of all linguistic knowledge sources over time." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-8", "text": "There are two innovative change strategies typical for lexical systems: the creation of entirely new lexical items, commonly reflecting the emergence of novel ideas, technologies or artifacts, on the one hand, and, on the other hand, shifts in the meaning of already existing lexical items, a process which usually takes place over larger periods of time." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-9", "text": "Tracing semantic changes of the latter type is the main focus of our research." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-10", "text": "Meaning shift has recently been investigated with emphasis on neural language models (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-11", "text": "This work is based on the assumption that the measurement of semantic change patterns can be reduced to the measurement of lexical similarity between lexical items." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-12", "text": "Neural language models, originating from the word2vec algorithm (Mikolov et al., 2013a; Mikolov et al., 2013b; Mikolov et al., 2013c) , are currently considered as state-of-the-art solutions for implementing this assumption (Schnabel et al., 2015) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-13", "text": "Within this approach, changes in similarity relations between lexical items at two different points of time are interpreted as a signal for meaning shift." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-14", "text": "Accordingly, lexical items which are very similar to the lexical item under scrutiny can be considered as approximating its meaning at a given point in time." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-15", "text": "Both techniques were already combined in prior work to show, e.g., the increasing association of the lexical item \"gay\" with the meaning dimension of \"homosexuality\" (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-16", "text": "We here investigate the accuracy and reliability of such similarity judgments derived from different training protocols dependent on word frequency, word ambiguity and the number of training epochs (i.e., iterations over all training material)." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-17", "text": "Accuracy renders a judgment of the overall model quality, whereas reliability between repeated experiments ensures that qualitative judgments can indeed be transferred between experiments." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-18", "text": "Based on the identification of critical conditions in the experimental set-up of previously employed protocols, we recommend improved training strategies for more adequate neural language models dealing with diachronic lexical change patterns." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-19", "text": "Our results concerning reliability also cast doubt on the reproducibility of experiments where semantic similarity between lexical items is taken as a computationally valid indicator for properly capturing lexical meaning (and, consequently, meaning shifts) under a diachronic perspective." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-20", "text": "----------------------------------" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-21", "text": "**RELATED WORK**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-22", "text": "Neural language models for tracking semantic changes over time typically distinguish between two different training protocols-continuous training of models (Kim et al., 2014) where the model for each time span is initialized with the embeddings of its predecessor, and, alternatively, independent training with a mapping between models for different points in time (Kulkarni et al., 2015) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-23", "text": "A comparison between these two protocols, such as the one proposed in this paper, has not been carried out before." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-24", "text": "Also, the application of such protocols to non-English corpora is lacking, with the exception of our own work relating to German data (Hellrich and Hahn, 2016b; Hellrich and Hahn, 2016a) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-25", "text": "The word2vec algorithm is a heavily trimmed version of an artificial neural network used to generate low-dimensional vector space representations of a lexicon." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-26", "text": "We focus on its skip-gram variant, trained to predict plausible contexts for a given word that was shown to be superior over other settings for modeling semantic information (Mikolov et al., 2013a) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-27", "text": "There are several parameters to choose for training-learning rate, downsampling factor for frequent words, number of training epochs and choice between two strategies for managing the huge number of potential contexts." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-28", "text": "One strategy, hierarchical softmax, uses a binary tree to efficiently represent the vocabulary, while the other, negative sampling, works by updating only a limited number of word vectors during each training step." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-29", "text": "Furthermore, artificial neural networks, in general, are known for a large number of local optima encountered during optimization." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-30", "text": "While these commonly lead to very similar performance (LeCun et al., 2015) , they cause different representations in the course of repeated experiments." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-31", "text": "Approaches to modelling changes of lexical semantics not using neural language models, e.g., Wijaya and Yeniterzi (2011), Gulordava and Baroni (2011), Mihalcea and Nastase (2012) , Riedl et al. (2014) or Jatowt and Duh (2014) are, intentionally, out of the scope of this paper." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-32", "text": "In the same way, we here refrain from comparison with computational studies dealing with literary discussions related to the Romantic period (e.g., Aggarwal et al. (2014) )." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-33", "text": "----------------------------------" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-34", "text": "**EXPERIMENTAL SET-UP**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-35", "text": "For comparability with earlier studies (Kim et al., 2014; Kulkarni et al., 2015) , we use the fiction part of the GOOGLE BOOKS NGRAM corpus (Michel et al., 2011; Lin et al., 2012) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-36", "text": "This part of the corpus is also less affected by sampling irregularities than other parts (Pechenick et al., 2015) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-37", "text": "Due to the opaque nature of GOOGLE's corpus acquisition strategy, the influence of OCR errors on our results cannot be reasonably estimated, yet we assume that they will affect all experiments in an equal manner." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-38", "text": "The wide range of experimental parameters described in Section 2 makes it virtually impossible to test all their possible combinations, especially as repeated experiments are necessary to probe a method's reliability." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-39", "text": "We thus concentrate on two experimental protocols-the one described by Kim et al. (2014) (referred to as Kim protocol) and the one from Kulkarni et al. (2015) (referred to as Kulkarni protocol), including close variations thereof." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-40", "text": "Kulkarni's protocol operates on all 5-grams occurring during five consecutive years (e.g., [1900] [1901] [1902] [1903] [1904] and trains models independently of each other." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-41", "text": "Kim's protocol operates on uniformly sized samples of 10M 5-grams for each year from 1850 onwards in a continuous fashion (years before 1900 are used for initialization only)." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-42", "text": "Its constant sampling sizes result in both oversampling and undersampling as is evident from Figure 1 ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-43", "text": "We use the PYTHON-based GENSIM 1 implementation of word2vec for our experiments; the relevant code is made available via GITHUB." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-44", "text": "2 Due to the 5-gram nature of the corpus, a context window covering four neighboring words is used for all experiments." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-45", "text": "Only words with at least 10 occurrences in a sample are modeled." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-46", "text": "Training for each sample is repeated until convergence 3 is achieved or 10 epochs have passed." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-47", "text": "Following both protocols, we use word vectors with 200 Table 1 : Accuracy and reliability among top n words for threefold application of different training protocols." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-48", "text": "Reliability is given as fraction of the maximum for n. Standard deviation for accuracy \u00b10, if not noted otherwise; reliability is based on the evaluation of all lexical items, thus no standard deviation." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-49", "text": "dimensions for all experiments, as well as an initial learning rate of 0.01 for experiments based on 10M samples, and one of 0.025 for systems trained on unsampled texts; the threshold for downsampling frequent words was 10 \u22123 for sample-based experiments and 10 \u22125 for unsampled ones." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-50", "text": "We tested both negative sampling and hierarchical softmax training strategies, the latter being canonical for Kulkarni's protocol, whereas Kim's protocol is underspecified in this regard." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-51", "text": "We evaluate accuracy by using the test set developed by Mikolov et al. (2013a) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-52", "text": "This test set is based on present-day English language and world knowledge, yet we assume it to be a viable proxy for overall model quality." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-53", "text": "It contains groups of four words connected via the analogy relation '::' and the similarity relation '\u223c', as exemplified by the expression king \u223c queen :: man \u223c woman." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-54", "text": "We evaluate reliability by training three identically parametrized models for each experiment." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-55", "text": "We then compare the top n similar words (by cosine distance) for each word modeled by the experiments with a variant of the Jaccard coefficient (Manning et al., 2008, p.61) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-56", "text": "We limit our analysis to values of n between 1 and 5, in accordance with data on word2vec accuracy (Schnabel et al., 2015) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-57", "text": "The 3-dimensional array W i,j,k contains words ordered by similarity (i) for a word in question (j) according to an experiment (k)." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-58", "text": "If a word in question is not modeled by an experiment, as can be the case for comparisons over different samples, \u2205 is the corresponding entry." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-59", "text": "The reliability r for a specific value of n (r@n) is defined as the magnitude of the intersection of similar words produced by all three experiments with a rank of n or lower, averaged over all t words modeled by any of these experiments and normalized by n, the maximally achievable score for this value of n:" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-60", "text": "----------------------------------" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-61", "text": "**RESULTS**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-62", "text": "We focus our analysis on the representations generated for the initial period, i.e., 1900 for samplebased experiments and 1900-1904 for unsampled ones." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-63", "text": "This choice was made since researchers can be assumed to be aware of current word meanings, thus making correct judgments on initial word semantics more important." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-64", "text": "As a beneficial side effect, we get a marked reduction of computational demands, saving several CPU years compared to an evaluation based on the most recent period." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-65", "text": "Table 1 depicts the assessments for different training protocols." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-66", "text": "Four results seem relevant for future experiments." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-67", "text": "First, reliability at different top-n cut-offs is rather uniform, so that evaluations could be performed on top-1 reliability only without real losses." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-68", "text": "Second, both accuracy and reliability are often far higher for negative sampling than for hierarchical softmax under direct comparison of the evaluated conditions; under no condition hierarchical softmax outperforms negative sampling." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-69", "text": "Third, continuous training improves reliability, yet not accuracy, for systems trained on samples." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-70", "text": "Fourth, reliability for experiments between samples heavi-ly degrades compared to reliability for repeated experiments on the same sample." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-71", "text": "----------------------------------" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-72", "text": "**TRAINING PROTOCOLS**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-73", "text": "----------------------------------" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-74", "text": "**DETAILED INVESTIGATION**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-75", "text": "As variations of Kulkarni's protocol yield more consistent results, we further explore its performance considering word frequency, word ambiguity and the number of training epochs." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-76", "text": "All experiments described in this section are based on the complete 1900-1904 corpus." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-77", "text": "Figure 2 shows the influence of word frequency, negative sampling being overall more reliable, especially for words with low or medium frequency." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-78", "text": "The 21 words reported to have undergone traceable semantic changes 4 are all frequent with percentiles between 89 and 99." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-79", "text": "For such high-frequency words hierarchical softmax performs similar or slightly better." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-80", "text": "Entries in the lexical database WORDNET (Fellbaum, 1998) can be employed to measure the effect of word ambiguity on reliability." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-81", "text": "5 The number of WORDNET synsets a word belongs to (i.e., the number of its senses) seems to have little effect on top-1 reliability for negative sampling, while hierarchical softmax underperforms for words with a low number of senses, as shown in Figure 3 ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-82", "text": "Model reliability and accuracy depend on the number of training epochs, as shown in Figure 4 ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-83", "text": "There are diminishing returns for hierarchical softmax, reliability staying constant after 5 epochs, while negative sampling increases in reliability with each epoch." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-84", "text": "Yet, both methods achieve maximal accuracy after only 2 epochs; additional epochs lead to a small decrease from 0.4 down to 0.38 for negative sampling." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-85", "text": "This could indicate overfitting, but accuracy is based on a test set for modern-day language, and can thus not be considered a fully valid yardstick." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-86", "text": "----------------------------------" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-87", "text": "**DISCUSSION**" }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-88", "text": "Our investigation in the performance of two common protocols for training neural language models on historical text data led to several hitherto unknown results." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-89", "text": "We could show that negative sampling outperforms hierarchical softmax both in terms of accuracy and reliability, especially 4 Kulkarni et al. (2015) compiled the following list based on prior work (Wijaya and Yeniterzi, 2011; Gulordava and Baroni, 2011; Jatowt and Duh, 2014; Kim et al., 2014): card, sleep, parent, address, gay, mouse, king, checked, check, actually, supposed, guess, cell, headed, ass, mail, toilet, cock, bloody, nice and guy." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-90", "text": "5 We used WORDNET 3.0 and the API provided by the Natural Language Toolkit (NLTK): www.nltk.org Even the most reliable system often identifies widely different words as most similar." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-91", "text": "This carries unwarranted potential for erroneous conclusions on a words' semantic evolution, e.g., \"romantic\" happens to be identified as most similar to \"lazzaroni\" 7 , \"fanciful\" and \"melancholies\" by three systems trained with negative sampling on 1900-1904 texts." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-92", "text": "We are thus skeptical about using such similarity clouds to describe or visualize lexical semantics at a point in time." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-93", "text": "In future work, we will explore the effects of continuous training based on complete corpora." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-94", "text": "The selection of a convergence criterion remains another open issue due to the threefold trade-off between training time, reliability and accuracy." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-95", "text": "It would also be interesting to replicate our experiments for other languages or points in time." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-96", "text": "Yet, the enormous corpus size for more recent years might require a reduced number of maximum epochs for these experiments." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-97", "text": "In order to improve the semantic modeling itself one could lemmatize the training material or utilize the part of speech annotations provided in the latest version of the GOOGLE corpus (Lin et al., 2012) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-98", "text": "Also, recently available neural language models with support for multiple word senses (Bartunov et al., 2016; Panchenko, 2016) could be helpful, since semantic changes can often be described as changes in the usage frequency of different word senses (Rissanen, 2008, pp.58-59) ." }, { "sent_id": "93a1f611592ce6aa5cde7538486f97-C001-99", "text": "Finally, it is clearly important to test the effect of our proposed changes, based on synchronic experiments, on a system for tracking diachronic changes in word semantics." } ], "y": { "@BACK@": { "gold_contexts": [ [ "93a1f611592ce6aa5cde7538486f97-C001-10", "93a1f611592ce6aa5cde7538486f97-C001-7", "93a1f611592ce6aa5cde7538486f97-C001-8", "93a1f611592ce6aa5cde7538486f97-C001-9" ], [ "93a1f611592ce6aa5cde7538486f97-C001-12", "93a1f611592ce6aa5cde7538486f97-C001-13", "93a1f611592ce6aa5cde7538486f97-C001-14", "93a1f611592ce6aa5cde7538486f97-C001-15" ], [ "93a1f611592ce6aa5cde7538486f97-C001-22", "93a1f611592ce6aa5cde7538486f97-C001-23", "93a1f611592ce6aa5cde7538486f97-C001-24" ], [ "93a1f611592ce6aa5cde7538486f97-C001-39", "93a1f611592ce6aa5cde7538486f97-C001-41", "93a1f611592ce6aa5cde7538486f97-C001-42" ] ], "cite_sentences": [ "93a1f611592ce6aa5cde7538486f97-C001-10", "93a1f611592ce6aa5cde7538486f97-C001-15", "93a1f611592ce6aa5cde7538486f97-C001-22", "93a1f611592ce6aa5cde7538486f97-C001-39" ] }, "@MOT@": { "gold_contexts": [ [ "93a1f611592ce6aa5cde7538486f97-C001-12", "93a1f611592ce6aa5cde7538486f97-C001-13", "93a1f611592ce6aa5cde7538486f97-C001-14", "93a1f611592ce6aa5cde7538486f97-C001-15" ], [ "93a1f611592ce6aa5cde7538486f97-C001-22", "93a1f611592ce6aa5cde7538486f97-C001-23", "93a1f611592ce6aa5cde7538486f97-C001-24" ] ], "cite_sentences": [ "93a1f611592ce6aa5cde7538486f97-C001-15", "93a1f611592ce6aa5cde7538486f97-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "93a1f611592ce6aa5cde7538486f97-C001-35" ], [ "93a1f611592ce6aa5cde7538486f97-C001-39", "93a1f611592ce6aa5cde7538486f97-C001-41", "93a1f611592ce6aa5cde7538486f97-C001-42" ] ], "cite_sentences": [ "93a1f611592ce6aa5cde7538486f97-C001-35", "93a1f611592ce6aa5cde7538486f97-C001-39" ] } } }, "ABC_7a5ebe06eebebabf4340f1cf583d86_38": { "x": [ { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-2", "text": "Our research aims at tracking the semantic evolution of the lexicon over time." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-3", "text": "For this purpose, we investigated two wellknown training protocols for neural language models in a synchronic experiment and encountered several problems relating to accuracy and reliability." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-4", "text": "We were able to identify critical parameters for improving the underlying protocols in order to generate more adequate diachronic language models." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-5", "text": "----------------------------------" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-7", "text": "The lexicon can be considered the most dynamic part of all linguistic knowledge sources over time." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-8", "text": "There are two innovative change strategies typical for lexical systems: the creation of entirely new lexical items, commonly reflecting the emergence of novel ideas, technologies or artifacts, on the one hand, and, on the other hand, shifts in the meaning of already existing lexical items, a process which usually takes place over larger periods of time." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-9", "text": "Tracing semantic changes of the latter type is the main focus of our research." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-10", "text": "Meaning shift has recently been investigated with emphasis on neural language models (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-11", "text": "This work is based on the assumption that the measurement of semantic change patterns can be reduced to the measurement of lexical similarity between lexical items." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-12", "text": "Neural language models, originating from the word2vec algorithm (Mikolov et al., 2013a; Mikolov et al., 2013b; Mikolov et al., 2013c) , are currently considered as state-of-the-art solutions for implementing this assumption (Schnabel et al., 2015) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-13", "text": "Within this approach, changes in similarity relations between lexical items at two different points of time are interpreted as a signal for meaning shift." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-14", "text": "Accordingly, lexical items which are very similar to the lexical item under scrutiny can be considered as approximating its meaning at a given point in time." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-15", "text": "Both techniques were already combined in prior work to show, e.g., the increasing association of the lexical item \"gay\" with the meaning dimension of \"homosexuality\" (Kim et al., 2014; Kulkarni et al., 2015) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-16", "text": "We here investigate the accuracy and reliability of such similarity judgments derived from different training protocols dependent on word frequency, word ambiguity and the number of training epochs (i.e., iterations over all training material)." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-17", "text": "Accuracy renders a judgment of the overall model quality, whereas reliability between repeated experiments ensures that qualitative judgments can indeed be transferred between experiments." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-18", "text": "Based on the identification of critical conditions in the experimental set-up of previously employed protocols, we recommend improved training strategies for more adequate neural language models dealing with diachronic lexical change patterns." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-19", "text": "Our results concerning reliability also cast doubt on the reproducibility of experiments where semantic similarity between lexical items is taken as a computationally valid indicator for properly capturing lexical meaning (and, consequently, meaning shifts) under a diachronic perspective." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-20", "text": "----------------------------------" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-21", "text": "**RELATED WORK**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-22", "text": "Neural language models for tracking semantic changes over time typically distinguish between two different training protocols-continuous training of models (Kim et al., 2014) where the model for each time span is initialized with the embeddings of its predecessor, and, alternatively, independent training with a mapping between models for different points in time (Kulkarni et al., 2015) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-23", "text": "A comparison between these two protocols, such as the one proposed in this paper, has not been carried out before." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-24", "text": "Also, the application of such protocols to non-English corpora is lacking, with the exception of our own work relating to German data (Hellrich and Hahn, 2016b; Hellrich and Hahn, 2016a) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-25", "text": "The word2vec algorithm is a heavily trimmed version of an artificial neural network used to generate low-dimensional vector space representations of a lexicon." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-26", "text": "We focus on its skip-gram variant, trained to predict plausible contexts for a given word that was shown to be superior over other settings for modeling semantic information (Mikolov et al., 2013a) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-27", "text": "There are several parameters to choose for training-learning rate, downsampling factor for frequent words, number of training epochs and choice between two strategies for managing the huge number of potential contexts." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-28", "text": "One strategy, hierarchical softmax, uses a binary tree to efficiently represent the vocabulary, while the other, negative sampling, works by updating only a limited number of word vectors during each training step." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-29", "text": "Furthermore, artificial neural networks, in general, are known for a large number of local optima encountered during optimization." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-30", "text": "While these commonly lead to very similar performance (LeCun et al., 2015) , they cause different representations in the course of repeated experiments." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-31", "text": "Approaches to modelling changes of lexical semantics not using neural language models, e.g., Wijaya and Yeniterzi (2011), Gulordava and Baroni (2011), Mihalcea and Nastase (2012) , Riedl et al. (2014) or Jatowt and Duh (2014) are, intentionally, out of the scope of this paper." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-32", "text": "In the same way, we here refrain from comparison with computational studies dealing with literary discussions related to the Romantic period (e.g., Aggarwal et al. (2014) )." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-33", "text": "----------------------------------" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-34", "text": "**EXPERIMENTAL SET-UP**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-35", "text": "For comparability with earlier studies (Kim et al., 2014; Kulkarni et al., 2015) , we use the fiction part of the GOOGLE BOOKS NGRAM corpus (Michel et al., 2011; Lin et al., 2012) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-36", "text": "This part of the corpus is also less affected by sampling irregularities than other parts (Pechenick et al., 2015) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-37", "text": "Due to the opaque nature of GOOGLE's corpus acquisition strategy, the influence of OCR errors on our results cannot be reasonably estimated, yet we assume that they will affect all experiments in an equal manner." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-38", "text": "The wide range of experimental parameters described in Section 2 makes it virtually impossible to test all their possible combinations, especially as repeated experiments are necessary to probe a method's reliability." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-39", "text": "We thus concentrate on two experimental protocols-the one described by Kim et al. (2014) (referred to as Kim protocol) and the one from Kulkarni et al. (2015) (referred to as Kulkarni protocol), including close variations thereof." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-40", "text": "Kulkarni's protocol operates on all 5-grams occurring during five consecutive years (e.g., [1900] [1901] [1902] [1903] [1904] and trains models independently of each other." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-41", "text": "Kim's protocol operates on uniformly sized samples of 10M 5-grams for each year from 1850 onwards in a continuous fashion (years before 1900 are used for initialization only)." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-42", "text": "Its constant sampling sizes result in both oversampling and undersampling as is evident from Figure 1 ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-43", "text": "We use the PYTHON-based GENSIM 1 implementation of word2vec for our experiments; the relevant code is made available via GITHUB." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-44", "text": "2 Due to the 5-gram nature of the corpus, a context window covering four neighboring words is used for all experiments." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-45", "text": "Only words with at least 10 occurrences in a sample are modeled." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-46", "text": "Training for each sample is repeated until convergence 3 is achieved or 10 epochs have passed." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-47", "text": "Following both protocols, we use word vectors with 200 Table 1 : Accuracy and reliability among top n words for threefold application of different training protocols." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-48", "text": "Reliability is given as fraction of the maximum for n. Standard deviation for accuracy \u00b10, if not noted otherwise; reliability is based on the evaluation of all lexical items, thus no standard deviation." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-49", "text": "dimensions for all experiments, as well as an initial learning rate of 0.01 for experiments based on 10M samples, and one of 0.025 for systems trained on unsampled texts; the threshold for downsampling frequent words was 10 \u22123 for sample-based experiments and 10 \u22125 for unsampled ones." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-50", "text": "We tested both negative sampling and hierarchical softmax training strategies, the latter being canonical for Kulkarni's protocol, whereas Kim's protocol is underspecified in this regard." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-51", "text": "We evaluate accuracy by using the test set developed by Mikolov et al. (2013a) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-52", "text": "This test set is based on present-day English language and world knowledge, yet we assume it to be a viable proxy for overall model quality." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-53", "text": "It contains groups of four words connected via the analogy relation '::' and the similarity relation '\u223c', as exemplified by the expression king \u223c queen :: man \u223c woman." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-54", "text": "We evaluate reliability by training three identically parametrized models for each experiment." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-55", "text": "We then compare the top n similar words (by cosine distance) for each word modeled by the experiments with a variant of the Jaccard coefficient (Manning et al., 2008, p.61) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-56", "text": "We limit our analysis to values of n between 1 and 5, in accordance with data on word2vec accuracy (Schnabel et al., 2015) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-57", "text": "The 3-dimensional array W i,j,k contains words ordered by similarity (i) for a word in question (j) according to an experiment (k)." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-58", "text": "If a word in question is not modeled by an experiment, as can be the case for comparisons over different samples, \u2205 is the corresponding entry." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-59", "text": "The reliability r for a specific value of n (r@n) is defined as the magnitude of the intersection of similar words produced by all three experiments with a rank of n or lower, averaged over all t words modeled by any of these experiments and normalized by n, the maximally achievable score for this value of n:" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-60", "text": "----------------------------------" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-61", "text": "**RESULTS**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-62", "text": "We focus our analysis on the representations generated for the initial period, i.e., 1900 for samplebased experiments and 1900-1904 for unsampled ones." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-63", "text": "This choice was made since researchers can be assumed to be aware of current word meanings, thus making correct judgments on initial word semantics more important." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-64", "text": "As a beneficial side effect, we get a marked reduction of computational demands, saving several CPU years compared to an evaluation based on the most recent period." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-65", "text": "Table 1 depicts the assessments for different training protocols." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-66", "text": "Four results seem relevant for future experiments." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-67", "text": "First, reliability at different top-n cut-offs is rather uniform, so that evaluations could be performed on top-1 reliability only without real losses." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-68", "text": "Second, both accuracy and reliability are often far higher for negative sampling than for hierarchical softmax under direct comparison of the evaluated conditions; under no condition hierarchical softmax outperforms negative sampling." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-69", "text": "Third, continuous training improves reliability, yet not accuracy, for systems trained on samples." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-70", "text": "Fourth, reliability for experiments between samples heavi-ly degrades compared to reliability for repeated experiments on the same sample." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-71", "text": "----------------------------------" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-72", "text": "**TRAINING PROTOCOLS**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-73", "text": "----------------------------------" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-74", "text": "**DETAILED INVESTIGATION**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-75", "text": "As variations of Kulkarni's protocol yield more consistent results, we further explore its performance considering word frequency, word ambiguity and the number of training epochs." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-76", "text": "All experiments described in this section are based on the complete 1900-1904 corpus." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-77", "text": "Figure 2 shows the influence of word frequency, negative sampling being overall more reliable, especially for words with low or medium frequency." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-78", "text": "The 21 words reported to have undergone traceable semantic changes 4 are all frequent with percentiles between 89 and 99." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-79", "text": "For such high-frequency words hierarchical softmax performs similar or slightly better." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-80", "text": "Entries in the lexical database WORDNET (Fellbaum, 1998) can be employed to measure the effect of word ambiguity on reliability." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-81", "text": "5 The number of WORDNET synsets a word belongs to (i.e., the number of its senses) seems to have little effect on top-1 reliability for negative sampling, while hierarchical softmax underperforms for words with a low number of senses, as shown in Figure 3 ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-82", "text": "Model reliability and accuracy depend on the number of training epochs, as shown in Figure 4 ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-83", "text": "There are diminishing returns for hierarchical softmax, reliability staying constant after 5 epochs, while negative sampling increases in reliability with each epoch." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-84", "text": "Yet, both methods achieve maximal accuracy after only 2 epochs; additional epochs lead to a small decrease from 0.4 down to 0.38 for negative sampling." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-85", "text": "This could indicate overfitting, but accuracy is based on a test set for modern-day language, and can thus not be considered a fully valid yardstick." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-86", "text": "----------------------------------" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-87", "text": "**DISCUSSION**" }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-88", "text": "Our investigation in the performance of two common protocols for training neural language models on historical text data led to several hitherto unknown results." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-89", "text": "We could show that negative sampling outperforms hierarchical softmax both in terms of accuracy and reliability, especially 4 Kulkarni et al. (2015) compiled the following list based on prior work (Wijaya and Yeniterzi, 2011; Gulordava and Baroni, 2011; Jatowt and Duh, 2014; Kim et al., 2014): card, sleep, parent, address, gay, mouse, king, checked, check, actually, supposed, guess, cell, headed, ass, mail, toilet, cock, bloody, nice and guy." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-90", "text": "5 We used WORDNET 3.0 and the API provided by the Natural Language Toolkit (NLTK): www.nltk.org Even the most reliable system often identifies widely different words as most similar." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-91", "text": "This carries unwarranted potential for erroneous conclusions on a words' semantic evolution, e.g., \"romantic\" happens to be identified as most similar to \"lazzaroni\" 7 , \"fanciful\" and \"melancholies\" by three systems trained with negative sampling on 1900-1904 texts." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-92", "text": "We are thus skeptical about using such similarity clouds to describe or visualize lexical semantics at a point in time." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-93", "text": "In future work, we will explore the effects of continuous training based on complete corpora." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-94", "text": "The selection of a convergence criterion remains another open issue due to the threefold trade-off between training time, reliability and accuracy." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-95", "text": "It would also be interesting to replicate our experiments for other languages or points in time." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-96", "text": "Yet, the enormous corpus size for more recent years might require a reduced number of maximum epochs for these experiments." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-97", "text": "In order to improve the semantic modeling itself one could lemmatize the training material or utilize the part of speech annotations provided in the latest version of the GOOGLE corpus (Lin et al., 2012) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-98", "text": "Also, recently available neural language models with support for multiple word senses (Bartunov et al., 2016; Panchenko, 2016) could be helpful, since semantic changes can often be described as changes in the usage frequency of different word senses (Rissanen, 2008, pp.58-59) ." }, { "sent_id": "7a5ebe06eebebabf4340f1cf583d86-C001-99", "text": "Finally, it is clearly important to test the effect of our proposed changes, based on synchronic experiments, on a system for tracking diachronic changes in word semantics." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7a5ebe06eebebabf4340f1cf583d86-C001-10", "7a5ebe06eebebabf4340f1cf583d86-C001-7", "7a5ebe06eebebabf4340f1cf583d86-C001-8", "7a5ebe06eebebabf4340f1cf583d86-C001-9" ], [ "7a5ebe06eebebabf4340f1cf583d86-C001-12", "7a5ebe06eebebabf4340f1cf583d86-C001-13", "7a5ebe06eebebabf4340f1cf583d86-C001-14", "7a5ebe06eebebabf4340f1cf583d86-C001-15", "7a5ebe06eebebabf4340f1cf583d86-C001-16" ], [ "7a5ebe06eebebabf4340f1cf583d86-C001-22", "7a5ebe06eebebabf4340f1cf583d86-C001-23" ], [ "7a5ebe06eebebabf4340f1cf583d86-C001-39", "7a5ebe06eebebabf4340f1cf583d86-C001-40" ] ], "cite_sentences": [ "7a5ebe06eebebabf4340f1cf583d86-C001-10", "7a5ebe06eebebabf4340f1cf583d86-C001-15", "7a5ebe06eebebabf4340f1cf583d86-C001-22", "7a5ebe06eebebabf4340f1cf583d86-C001-39" ] }, "@MOT@": { "gold_contexts": [ [ "7a5ebe06eebebabf4340f1cf583d86-C001-12", "7a5ebe06eebebabf4340f1cf583d86-C001-13", "7a5ebe06eebebabf4340f1cf583d86-C001-14", "7a5ebe06eebebabf4340f1cf583d86-C001-15", "7a5ebe06eebebabf4340f1cf583d86-C001-16" ], [ "7a5ebe06eebebabf4340f1cf583d86-C001-22", "7a5ebe06eebebabf4340f1cf583d86-C001-23" ] ], "cite_sentences": [ "7a5ebe06eebebabf4340f1cf583d86-C001-15", "7a5ebe06eebebabf4340f1cf583d86-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "7a5ebe06eebebabf4340f1cf583d86-C001-35" ], [ "7a5ebe06eebebabf4340f1cf583d86-C001-39", "7a5ebe06eebebabf4340f1cf583d86-C001-40" ] ], "cite_sentences": [ "7a5ebe06eebebabf4340f1cf583d86-C001-35", "7a5ebe06eebebabf4340f1cf583d86-C001-39" ] }, "@DIF@": { "gold_contexts": [ [ "7a5ebe06eebebabf4340f1cf583d86-C001-88", "7a5ebe06eebebabf4340f1cf583d86-C001-89" ] ], "cite_sentences": [ "7a5ebe06eebebabf4340f1cf583d86-C001-89" ] } } }, "ABC_a01a6ab7cf13c7916b1b3823a4b4de_38": { "x": [ { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-2", "text": "While past research on machine transliteration has focused on a single transliteration task, there exist a variety of supplemental transliterations available in other languages." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-3", "text": "Given an input for English-toHindi transliteration, for example, transliterations from other languages such as Japanese or Hebrew may be helpful in the transliteration process." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-27", "text": "Here, we apply this method verbatim to machine transliteration." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-4", "text": "In this paper, we propose the application of such supplemental transliterations to English-to-Hindi machine transliteration via an SVM re-ranking method with features based on n-gram alignments as well as system and alignment scores." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-5", "text": "This method achieves a relative improvement of over 10% over the base system used on its own." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-6", "text": "We further apply this method to system combination, demonstrating just under 5% relative improvement." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-7", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-9", "text": "The focus of significant previous work in machine transliteration, including that presented at past NEWS Shared Tasks (Li et al., 2009; Kumaran et al., 2010b) , has been on single transliteration tasks in isolation of other other languages." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-10", "text": "This is despite the fact that the various languages provided represent a significant quantity of potentially useful data that is being ignored." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-11", "text": "In this NEWS 2011 Shared Task submission, we present a method which beneficially applies supplemental transliterations from other languages to English-to-Hindi transliteration." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-12", "text": "In practice, this is a realistic situation in which transliterations from other languages can help." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-13", "text": "For example, Wikipedia contains articles on guitarist John Petrucci in English and Japanese, but not in Hindi." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-14", "text": "If we wanted to automatically generate a stub (skeleton) article in Hindi, we would need to transliterate his name into Hindi." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-15", "text": "Since a Japanese version already exists, we could extract from it additional information to help with the transliteration process." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-16", "text": "Importantly, since our article is about an American guitarist, we would explicitly want to start with the English (original) version of the name, and treat other languages as extra data, rather than vice versa." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-17", "text": "In order to effectively incorporate the otherlanguage data, we apply SVM re-ranking in a manner that has previously been shown to provide significant improvement for grapheme-to-phoneme conversion (Bhargava and Kondrak, 2011) ." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-18", "text": "This method is flexible enough to incorporate multiple languages; it employs features based on character alignments between potential outputs and existing transliterations from other languages, as well as scores of these alignments, which serve as a measure of similarity." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-19", "text": "We apply this approach on top of the same DIRECTL+ system as submitted last year (Jiampojamarn et al., 2010b) for English-to-Hindi machine transliteration." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-20", "text": "Compared to the base DI-RECTL+ performance, we are able to achieve significantly better results, with a relative performance increase of over 10%." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-21", "text": "We also achieve improvements without supplemental transliterations by simply apply the same approach with another system's output as extra data." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-22", "text": "We furthermore experiment with romanization for Hindi data as well as different alignment length settings for English-toChinese transliteration." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-23", "text": "This paper presents methods, methodology, and results for the above experiments." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-24", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-25", "text": "**LEVERAGING MULTIPLE TRANSLITERATIONS**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-26", "text": "Bhargava and Kondrak (2011) present a method for applying transliterations to grapheme-to-phoneme conversion." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-28", "text": "The method is based on SVM re-ranking applied over n-best output lists generated by a base system." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-29", "text": "Intuitively, we have an existing base transliteration system that, for a given input, provides a set of n scored outputs, with the correct output not always appearing in the top position." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-30", "text": "In order to help bring the correct output to the top, we turn to existing transliterations of the input from other languages." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-31", "text": "In order to leverage a variety of features and transliterations from all available languages, SVM re-ranking is applied to this task." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-32", "text": "For each output, a feature vector is constructed." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-33", "text": "Given alignments between the input and output, for example, binary indicator features based on grouping input and output n-grams in the style of DIRECTL+ (Jiampojamarn et al., 2010a) are constructed." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-34", "text": "The base system's score for the output would be included as well, along with differences between the given output's score and the scores for the other outputs in the list." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-35", "text": "This feature construction process is then repeated, replacing the input with an available transliteration, for each available transliteration language." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-36", "text": "The score in this latter case is used as a measure of how \"similar\" a candidate output is to a \"reference\" transliteration from another language." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-37", "text": "We refer to these other transliterations as supplemental transliterations." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-38", "text": "While the score features provide a global measure of similarity, the n-gram features allow weights to be learned for character combinations between the candidate output and supplemental transliterations; this provides very fine-grained features that can explicitly use certain characters in supplemental transliterations to help determine the quality of a candidate output." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-39", "text": "There are, however, some practicalities that must be considered." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-40", "text": "Bhargava and Kondrak (2011) note the importance of applying multiple languages; they found it difficult to achieve significant improvements using transliterations from one language only." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-41", "text": "This is due in part to noise in the data (which has been observed in some of the NEWS Shared Task data (Jiampojamarn et al., 2009) ) as well as differing conventions for various transliteration \"schemes\"." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-42", "text": "These issues are handled implicitly in two ways: (1) the granularity of the n-gram features allows certain character combinations in the transliteration to be learned as being positive or negative indicators of a candidate output's quality, or that they should be ignored altogether; and (2) the use of multiple transliterations helps smooth out some of the noise." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-43", "text": "While we do not examine these methods here for brevity's sake, Bhargava and Kondrak (2011) show the effectiveness of the granular n-gram features vs. the score features as well as the importance of applying multiple transliteration languages." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-44", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-45", "text": "**ALIGNMENT OF TRAINING DATA**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-46", "text": "Practically, we must consider how to generate the alignments between the candidate output transliterations and the supplemental transliterations for the n-gram features, as well as how to generate the similarity scores." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-47", "text": "M2M-ALIGNER (Jiampojamarn et al., 2007) addresses both of these." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-48", "text": "M2M-ALIGNER is an unsupervised character alignment system, meaning that it can learn to align data given sufficient training data consisting of unaligned inputoutput pairs." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-49", "text": "Once trained, M2M-ALIGNER will then produce an alignment for a new pair as well as an alignment score." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-50", "text": "Because the algorithm is a many-to-many extension of the unsupervised edit distance algorithm, we can see that the alignment score should represent some notion of scriptagnostic similarity." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-51", "text": "Since we will be applying M2M-ALIGNER between candidate output transliterations and supplemental transliterations for a variety of supplemental languages, we will need to build several alignment models, each being built from separate training data." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-52", "text": "The majority of the task data are Englishsource, so for any entry in one language corpus we can easily find corresponding transliterations in other language corpora." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-53", "text": "In other words, to generate training data for M2M-ALIGNER between the target transliteration language and a supplemental language, we need only intersect the two corpora on the basis of the common English input." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-54", "text": "Table 1 shows the amount of overlap between the test data for the different English-source languages and the combined training and development data for the other English-source languages." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-55", "text": "Note that the Chinese-and Korean-target corpora show very high coverage; however, we focus on Englishto-Hindi transliteration as it enables us to more closely examine the outputs based on our own linguistic familiarities." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-56", "text": "The use of other corpora here requires that these results be submitted as a nonstandard run." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-57", "text": "Note that, because there is not complete coverage for the English-to-Hindi test data, we simply submit the base system's results as-is in cases where there is no transliteration available from other languages." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-58", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-59", "text": "**BASE SYSTEMS**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-60", "text": "Our principal base system that generates the n-best output lists is DIRECTL+, which has produced excellent results in the NEWS 2010 Shared Task on Transliteration (Jiampojamarn et al., 2010b) ." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-61", "text": "For re-ranking, note that training a re-ranker requires training data where the base system scores are representative of unseen data so that the re-ranker does not simply learn to follow the base system; we therefore split the training data into ten folds and perform a sort-of cross validation with DIRECTL+." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-62", "text": "This provides us with usable training data for reranking." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-63", "text": "We tune the SVM's hyperparameter based on performance on the provided development data, and use the best DIRECTL+ settings established in the NEWS 2010 Shared Task (Jiampojamarn et al., 2010b) ." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-64", "text": "Armed with optimal parameter settings, we combine the training and development data into a single set used to train our final DIRECTL+ system." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-65", "text": "We also repeat the cross-validation process for training the re-ranker." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-66", "text": "We also apply the SVM re-ranking approach to system combination." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-67", "text": "In this case, we additionally train another system-here we use SE-QUITUR (Bisani and Ney, 2008 )-for English-toHindi transliteration." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-68", "text": "During test time, we feed the input into both DIRECTL+ and SEQUITUR, and use the top SEQUITUR output as supplemental data." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-69", "text": "We expect that sometimes SEQUITUR will provide a correct answer where DIRECTL+ does not; the hope is that the SVM re-ranking approach will be able to learn when this is the case based on the n-gram and score features." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-70", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-71", "text": "**HINDI ROMANIZATION**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-72", "text": "In addition to the above re-ranking approach, we experimented with a romanization method for the Hindi data." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-73", "text": "Since consonant characters in the Devanagari alphabet have vowels included by default, we romanize the text in order to provide DIRECTL+ with direct individual control over the consonant and vowel components of the Hindi characters." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-74", "text": "The default vowel is changed by means of diacriticlike characters, which in turn deletes the default vowel; this requires a context-sensitive (but still rule-based) romanization method, which we construct manually." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-75", "text": "We then train DIRECTL+ on the romanized data; during testing, we take the romanized output and convert it back into Devanagari Unicode characters, again using a manuallyconstructed context-sensitive rule-based converter." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-76", "text": "Table 2 shows that SVM re-ranking significantly improves the English-to-Hindi transliteration accuracy in comparison with the base system." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-77", "text": "Leveraging all of the English-source transliteration corpora as supplemental data yields an increase of over 10%." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-78", "text": "When applied using SEQUITUR's output as \"supplemental\" data, we see almost a 5% (relative) increase in word accuracy." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-79", "text": "In contrast, our Hindi romanization approach decreases the accuracy." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-80", "text": "This differs from the results of the successful application of romanization to Japanese (Jiampojamarn et al., 2010b) , demonstrating that it is not always possible to transfer an idea from one language to another." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-81", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-82", "text": "**RESULTS**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-83", "text": "The English-to-Chinese results, which use only the base DIRECTL+ system, demonstrate the importance of the alignment length parameter setting." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-84", "text": "DIRECTL+ requires aligned data for input, and the maximum length of the alignments will have an effect on what DIRECTL+ learns to produce." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-85", "text": "We submitted both 3-to-1 and 7-to-1 alignments because they gave similar results during development, and both were better than other tested possibilities." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-86", "text": "In the final results, we see a substantial difference between the two alignment settings." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-87", "text": "We hypothesize that the complexity of English-to-Chinese mappings is better captured by the alignments that map longer sequences of English letters to single Chinese characters." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-88", "text": "making it difficult to generalize to new data." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-89", "text": "Finally, we observe very good overall accuracy in the English-to-Japanese results (which also only use base DIRECTL+), which further confirm the effectiveness of DIRECTL+ when applied to machine transliteration." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-90", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-91", "text": "**PREVIOUS WORK**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-92", "text": "There are three lines of research that are relevant to the work we have presented in this paper: (1) DI-RECTL+ and SEQUITUR for machine transliteration; (2) applying multiple languages; and (3) system combination." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-93", "text": "For the NEWS 2009 and 2010 Shared Tasks, the discriminative DIRECTL+ system that incorporates many-to-many alignments, online maxmargin training and a phrasal decoder was shown to function well as a general string transduction tool; while originally designed for grapheme-tophoneme conversion, it produced excellent results for machine transliteration (Jiampojamarn et al., 2009; Jiampojamarn et al., 2010b) , leading us to re-use it here." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-94", "text": "Finch and Sumita (2010) also submitted a top-performing system that was based in part on SEQUITUR, which is a generative system based on joint n-gram modelling (Bisani and Ney, 2008) ." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-95", "text": "In this paper, we applied multiple transliteration languages to a single transliteration task." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-96", "text": "While our method is based on SVM re-ranking with similar features as to those used in the base system (Bhargava and Kondrak, 2011) , there have been other explorations into incorporating other language data, particularly when data are scarce." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-97", "text": "Zhang et al. (2010) , for example, apply a pivoting approach to machine transliteration, and similarly Khapra et al. (2010) propose to transliterate through \"bridge\" languages." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-98", "text": "Along similar lines, Kumaran et al. (2010a) find increases in accuracy using a linear-combination-of-scores system that combined the outputs of a direct transliteration system with a system that transliterated through a third language." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-99", "text": "For statistical machine translation, Cohn and Lapata (2007) also explore the use of a third language." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-100", "text": "Finally, we also touched briefly on system combination: we applied the SVM re-ranking method to combining the outputs of both DIRECTL+ and SEQUITUR, in particular treating DIRECTL+ as the base system and using SEQUITUR's best outputs to re-rank DIRECTL+'s output lists." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-101", "text": "Finch and Sumita (2010) , in contrast, combine SEQUITUR's output with that of a phrase-based statistical machine translation system, achieving excellent results." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-102", "text": "Where our approach is based on SVM reranking, theirs merged the outputs of the two systems together and then used a linear combination of the system scores to re-rank the combined list." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-103", "text": "----------------------------------" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-104", "text": "**CONCLUSION**" }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-105", "text": "In this paper, we described our submission to the NEWS 2011 Shared Task on machine transliteration." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-106", "text": "Our focus was on incorporating supplemental data, using a method based on SVM re-ranking, with features derived from n-gram alignments and alignment scores." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-107", "text": "We demonstrated improvements of over 10% when applying other transliteration data to English-to-Hindi machine transliteration, and just under 5% when applying another system's outputs in a similar manner." }, { "sent_id": "a01a6ab7cf13c7916b1b3823a4b4de-C001-108", "text": "We also found that the romanization of Hindi characters brings about a decrease in performance, and that the alignment length parameter in the DIRECTL+ system has a critical effects on the results." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-17", "a01a6ab7cf13c7916b1b3823a4b4de-C001-18", "a01a6ab7cf13c7916b1b3823a4b4de-C001-19", "a01a6ab7cf13c7916b1b3823a4b4de-C001-20" ], [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-60" ], [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-93" ] ], "cite_sentences": [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-19", "a01a6ab7cf13c7916b1b3823a4b4de-C001-60", "a01a6ab7cf13c7916b1b3823a4b4de-C001-93" ] }, "@USE@": { "gold_contexts": [ [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-60" ], [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-93" ] ], "cite_sentences": [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-60", "a01a6ab7cf13c7916b1b3823a4b4de-C001-93" ] }, "@EXT@": { "gold_contexts": [ [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-63", "a01a6ab7cf13c7916b1b3823a4b4de-C001-64" ] ], "cite_sentences": [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-63" ] }, "@DIF@": { "gold_contexts": [ [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-78", "a01a6ab7cf13c7916b1b3823a4b4de-C001-79", "a01a6ab7cf13c7916b1b3823a4b4de-C001-80" ] ], "cite_sentences": [ "a01a6ab7cf13c7916b1b3823a4b4de-C001-80" ] } } }, "ABC_604807137ee5d9a6775821496c6af5_39": { "x": [ { "sent_id": "604807137ee5d9a6775821496c6af5-C001-66", "text": "For a new target collection, it identifies relevant adaptable source collections based on their similarity." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-67", "text": "One can select any of the candidate source collections (selected collection highlighted in Figure 2(a) ) and adapt." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-2", "text": "We demonstrate SODA (Service Oriented Domain Adaptation) for efficient and scalable cross-domain microblog categorization which works on the principle of transfer learning." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-3", "text": "It is developed on a novel similarity-based iterative domain adaptation algorithm while extended with features such as active learning and interactive GUI to be used by business professionals." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-4", "text": "SODA demonstrates efficient classification accuracy on new collections while minimizing and sometimes eliminating the need for expensive data labeling efforts." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-5", "text": "SODA also implements an active learning (AL) technique to select informative instances from the new collection to seek annotations, if a small amount of labeled data is required by the adaptation algorithm." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-6", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-8", "text": "Online social media, such as Twitter.com, have become the de facto standard for sharing information, thoughts, ideas, personal feelings, daily happenings etc." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-9", "text": "which essentially led research and development in the field of social media analytics to flourish." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-10", "text": "Social media analytics provide actionable insights to business by analyzing huge amount of user generated content (UGC) (Sriram et al., 2010; Jo and Oh, 2011; He et al., 2012; Si et al., 2013; Nakov et al., 2013) ." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-11", "text": "Sentiment categorization, one of the common social media analytics task, segregates a collection of UGC into different buckets with positive, negative or neutral orientation (Liu and Zhang, 2012; * Work done at Xerox Research Centre India Thelwall et al., 2011; Bollen et al., 2009 )." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-12", "text": "This information is used to aggregate statistics and identify trends which are helpful for many applications viz. Customer Care, Product Marketing, User Studies." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-13", "text": "Supervised machine learning (ML) techniques such as text categorization have played a key enabler role to classify microblogs into sentiment categories (Pang and Lee, 2008; Tan et al., 2009; Go et al., 2009; Fern\u00e1ndez et al., 2014) ." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-14", "text": "These are trained on a fraction of annotated data as per client provided label set e.g. {positive, negative, and neutral} for a product/service/domain 1 ." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-15", "text": "One of the obstacles towards rapid adoption of such systems is requirement of labeled tweets for developing MLbased models as it requires extensive human labeling efforts." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-16", "text": "Additionally, need of manual labeling slows down the process of categorization on high velocity social media which requires fast analytic insights." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-17", "text": "From our conversations with business professionals, we derived the need of a practical solution which would help them scale up across hundreds of collections and domains without the overhead of annotating data and building models from scratch every time for a new collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-18", "text": "In this paper, we demonstrate Service Oriented Domain Adaptation (SODA) which offers social media analytics as-a-service to the users." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-19", "text": "Specifically, it provides sentiment categorization as-aservice that allows users to efficiently analyze comments from any new collection without the over- head of manual annotations or re-training models." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-20", "text": "It thus enables faster wide-scale analysis within and across different domains/industries such as telecom, healthcare, finance etc." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-43", "text": "C S is then used to predict labels for the pool of unlabeled instances from the target collection, referred to as P u , using the shared representations." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-21", "text": "SODA is based on an iterative ensemble based adaptation technique (Bhatt et al., 2015) which gradually transfers knowledge from the source to the new target collection while being cognizant of similarity between the two collections." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-22", "text": "It has been extensively evaluated by business professionals in a user-trial and on a benchmark dataset." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-23", "text": "Figure 1 illustrates the architecture of SODA comprising three primary modules, 1) similarity, 2) domain adaptation, and 3) active learning." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-24", "text": "The first two modules use unlabeled data from the new collection while the optional third module helps in creating labeled data for enhanced classification performance." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-25", "text": "These modules are explained below." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-26", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-27", "text": "**SODA FEATURES**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-28", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-29", "text": "**SIMILARITY**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-30", "text": "In social media analytics, especially for sentiment categorization, there exist numerous collections about different products or services where labeled data is available and thus can be used to adapt to a new unlabeled collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-31", "text": "Given a target collection, the key question is to identify the best possible source collection to adapt from." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-32", "text": "The similarity module in SODA identifies the best adaptable source collection based on the similarity between the source and target collections." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-33", "text": "This is based on the observations from existing literature (Bhatt et al., 2015; Blitzer et al., 2007) which suggest that if the source and target collections are similar, the adaptation performance tends to be better than if the two collections are dissimilar." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-34", "text": "The similarity module in SODA is capable of computing different kinds of lexical, syntactic, and semantic similarities between unlabeled target and labeled source collections." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-35", "text": "For this demonstration on sentiment categorization from social media data, it measures cosine similarity between the comments in each collection and computes sim as the similarity score." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-36", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-37", "text": "**DOMAIN ADAPTATION**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-38", "text": "The heart of SODA is the adaptation module that works on two principles, generalization and adaptation." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-39", "text": "During generalization, it learns shared common representation (Blitzer et al., 2007; Ji et al., 2011; Pan et al., 2010) which minimizes the divergence between two collections." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-40", "text": "We leverage one of the widely used structural correspondence learning (SCL) approach (Blitzer et al., 2007) to compute shared representations." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-41", "text": "The idea adhered here is that a model learned on the shared feature representation using labeled data from the source collection will also generalize well on the target collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-42", "text": "Towards this, we learn a model (C S ) on the shared feature representation from the source collection, referred to as \"source classifier\"." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-44", "text": "All instances in P u which are predicted with a confidence (\u03b1 1 ) higher than a predefined threshold (\u03b8 1 ) are moved to the pool of pseudo-labeled target instances, referred to as P s ." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-45", "text": "We now learn a target domain model C T on P s using the target specific representation, referred to as \"target classifier\"." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-46", "text": "C T captures a separate view of the target instances than the shared representation and hence brings in discriminating target specific information which is useful for categorization in target collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-47", "text": "For further adaptation, the source (C S ) and target (C T ) classifiers are combined in a weighted ensemble (E) with w s and w t as the corresponding weights and iterate over the remaining unlabeled instances in P u ." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-48", "text": "In each iteration, the ensemble processes the remaining instances and iteratively adds confidently predicted instances to P s which are used to re-train/update C T ." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-49", "text": "This iterative process continues till all instances in P u are confidently labeled or a maximum number of iterations is reached." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-50", "text": "Transfer occurs within the ensemble where the source classifier progressively facilitates the learning of tar-78 end iterate." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-51", "text": "Output: Final Ct and updated weights w s and w t get classifier by providing pseudo labeled instances." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-52", "text": "The weights of the individual classifiers are updated, as a function of error (I(\u00b7)) and the similarity (sim) between the collections, which gradually shift the emphasis from source to the target classifier." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-53", "text": "Finally, the ensemble is used to predict labels for future unseen instances in the target collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-54", "text": "Algorithm 1 summarizes our approach (refer (Bhatt et al., 2015) for more details)." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-55", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-56", "text": "**ACTIVE LEARNING**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-57", "text": "SODA also implements an active learning module to allow users to annotate a few selected informative comments from the target collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-58", "text": "These comments are selected using cross entropy difference (CED) (Axelrod et al., 2011) such that the difference with source collection and the similarity with target collection is maximized." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-59", "text": "It selects comment(s) from target collection that have low CED score i.e. comments that have high entropy with respect to source H S (\u00b7) and low entropy with respect to target collection H T (\u00b7) as in Equation (1)." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-60", "text": "Note, this active learning module is optional and should be used when the adaptation performance with unlabeled instances is not satisfactory." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-61", "text": "More and more instances can be annotated in multiple rounds till a satisfactory performance is achieved." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-62", "text": "These annotated instances are used to build a stronger target classifier for the ensemble based adaptation algorithm." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-63", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-64", "text": "**DESIGN AND INTERNALS**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-65", "text": "Figure 2(a) illustrates the interactive user interface (UI) of SODA where one can select a new target collection for the analysis task (i.e. sentiment categorization)." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-68", "text": "Figure 2(b) shows the performance report along with the predicted comments from the target collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-69", "text": "User evaluates the adaptation performance in unlabeled target collection by analyzing the predicted comments and decides whether to annotate additional comments in Figure 3 : The effect of labeled comments on the performance while adapting from Coll-1 \u2192 Coll-6." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-70", "text": "target collection? If yes, Figure 2 (c) lists a few informative comments selected using the active learning module to seek annotations." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-71", "text": "One can mark these comments as positive, negative or neutral and subsequently adapt using these labeled instances from the target collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-72", "text": "Figure 2 (b) also shows the adaptation performance with a few labeled instances in the target collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-73", "text": "One can continue annotating more instances in the target collection until satisfactory performance is achieved." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-74", "text": "For more detailed demonstration, please refer to the video." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-75", "text": "2 The interactive UI of SODA is developed using Ruby on Rails framework." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-76", "text": "All collections are managed in MySQL server." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-77", "text": "All three modules in SODA fetch data from the server and write the output back to the server." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-78", "text": "All modules work in real time enabling the system to be highly responsive to the user." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-79", "text": "The application is hosted on Amazon AWS as RESTful web services using Java Jersey (Tomcat server) that act as a bridge between the UI and back end." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-80", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-81", "text": "**USER TRIAL & EXPERIMENTAL RESULTS**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-82", "text": "To evaluate the overall experience, a user trial was conducted where several business professionals provided feedback on SODA." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-83", "text": "The objective was to evaluate the overall usability, reduction in required efforts, and the performance on the new target collections." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-84", "text": "The overall evaluation rated SODA 5 on usability and 4 for reduction in efforts (1 being worst & 5 being the best)." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-85", "text": "Table 1 reports the classification accuracy of SODA with few labeled comments from the target collection (ranging from 0 to 100)." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-86", "text": "It also reports the performance of the in-domain classifier which is trained and tested on data from the same collection." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-87", "text": "Coll-1 to Coll-8 refer to collections pertaining to marketing & sales, comcast support, 2 https://www.youtube.com/watch?v= zKnP5QEHVAE Figure 3 compares the effect of adding labeled comments in batches of 25 comments at-a-time." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-88", "text": "When there is no labeled data in the target collection, in-domain classifier can not be applied while SODA still yields good classification accuracy." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-89", "text": "Moreover, SODA consistently performs better than the in-domain classifier with same amount of labeled data." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-90", "text": "We also evaluated the performance of domain adaptation (DA) module of SODA on the Amazon review dataset (Blitzer et al., 2007) which is a benchmark dataset for sentiment categorization." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-91", "text": "It has 4 domains, namely, books(B), dvds(D), electronics(E), and kitchen(K) each with 2000 reviews divided equally into positive and negative reviews." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-92", "text": "Table 2 shows that DA module of SODA outperforms 1) a widely used domain adaptation technique , namely, structural correspondence learning (SCL) (Blitzer et al., 2007; Blitzer et al., 2006) , 2) the baseline (BL) where a classifier trained on one domain is applied on another domain, and 3) the in-domain classifier." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-93", "text": "Note that in Table 2 , the performance of DA module of SODA is reported when it does not use any labeled instances from the target domain." }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-94", "text": "----------------------------------" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-95", "text": "**CONCLUSION**" }, { "sent_id": "604807137ee5d9a6775821496c6af5-C001-96", "text": "We demonstrated SODA for efficient microblog categorization on new social media collections with minimum (or sometimes no) need of manual annotations; thus, enabling faster and efficient analytics." } ], "y": { "@USE@": { "gold_contexts": [ [ "604807137ee5d9a6775821496c6af5-C001-33" ], [ "604807137ee5d9a6775821496c6af5-C001-39" ], [ "604807137ee5d9a6775821496c6af5-C001-40" ], [ "604807137ee5d9a6775821496c6af5-C001-90" ] ], "cite_sentences": [ "604807137ee5d9a6775821496c6af5-C001-33", "604807137ee5d9a6775821496c6af5-C001-39", "604807137ee5d9a6775821496c6af5-C001-40", "604807137ee5d9a6775821496c6af5-C001-90" ] }, "@DIF@": { "gold_contexts": [ [ "604807137ee5d9a6775821496c6af5-C001-92" ] ], "cite_sentences": [ "604807137ee5d9a6775821496c6af5-C001-92" ] } } }, "ABC_f6a35ed1ec0c01d3e9faa1ec3d8478_39": { "x": [ { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-2", "text": "Abstract Chatbots are a rapidly expanding application of dialogue systems with companies switching to bot services for customer support, and new applications for users interested in casual conversation." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-3", "text": "One style of casual conversation is argument; many people love nothing more than a good argument." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-4", "text": "Moreover, there are a number of existing corpora of argumentative dialogues, annotated for agreement and disagreement, stance, sarcasm and argument quality." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-5", "text": "This paper introduces Debbie, a novel arguing bot, that selects arguments from conversational corpora, and aims to use them appropriately in context." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-6", "text": "We present an initial working prototype of Debbie, with some preliminary evaluation and describe future work." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-7", "text": "----------------------------------" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-9", "text": "A chatbot or a conversational agent is a computer program that can converse with humans, via speech or text, with the goal of conducting a natural conversation, hopefully indistinguishable from a real human-to-human interaction." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-10", "text": "Chatbots are gaining momentum as more companies are switching to bot services for customer care, but there is also an opportunity for different types of casual conversation." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-11", "text": "Conversational agents can be broadly classified into retrieval-based and generative models." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-12", "text": "Retrieval-based methods have a repository of predefined responses and a mechanism to pick an appropriate response based on user input." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-13", "text": "They, therefore, can't generate a completely new response." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-14", "text": "Such methods are commonly used for \"help system\" chatbots, that target a predefined set of FAQs and responses." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-15", "text": "Another strategy is to use rule-based expression matching, and templated response generation, as in ELIZA or JULIA [19, 7, 10] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-16", "text": "Most existing chatbots are task-oriented and their evaluation is based on the successful accomplishment of the task." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-17", "text": "There are many websites dedicated to debating controversial topics, and data from them have been structured and labeled in previous work." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-18", "text": "For example, we have access to trained models for labeling these argumentative conversations with attributes such as agreement, disagreement, stance, sarcasm, factual vs. feeling arguments, argument quality and argument facets." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-19", "text": "This data provides us with a rich source of conversational data for mining argumentative responses." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-20", "text": "We build on previous work in our lab on disagreement detection, classifying stance, identifying high quality arguments, measuring the properties and the persuasive effects of factual vs. emotional arguments, and clustering arguments into their facets or frames related to a particular topic [9, 1, 13, 18, 16, 12, 15] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-21", "text": "In this work, we present Debbie, a novel arguing bot, that uses retrieval from existing conversations in order to argue with users." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-22", "text": "Debbie's main aim is to keep the conversation going, by successfully producing arguments and counter-arguments that will keep the user talking about the topic." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-23", "text": "Our initial prototype of Debbie works with three topics: death penalty, gun control, gay marriage." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-24", "text": "This paper focuses on our basic investigations on the initial prototype." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-25", "text": "While we are aware of other retrieval based chatbot systems [6, 14, 3, 2] , Debbie is novel in that it is the first to deal with argument retrieval." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-26", "text": "----------------------------------" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-27", "text": "**DATA**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-28", "text": "Social media conversations are a good source of argumentative data but many sentences either do not express an argument or cannot be understood out of context and hence cannot be used to build Debbie's response pool." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-29", "text": "Swanson et al.(2015) created a large corpus consisting of 109,074 posts on the topics gay marriage (GM, 22425 posts), gun control (GC, 38102 posts), death penalty (DP, 5283 posts) by combining the Internet Argument Corpus(IAC) [17] , with dialogues from online debate forums 1 [16] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-30", "text": "It includes topic annotations, response characterizations (4forums), and stance." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-31", "text": "They build an argument quality regressor to rate the argument quality (AQ) using a continuous slider ranging from hard (0.0) to easy to interpret (1.0)." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-32", "text": "The AQ score is intended to reflect how easily the speaker's argument can be understood from the sentence without any context." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-33", "text": "Easily understandable sentences are assumed to be prime candidates for Debbie's response pool." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-34", "text": "note that a threshold of predicted AQ >0.55 maintained both diversity and quality in the arguments [12] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-35", "text": "For example, the sentence The death penalty is also discriminatory in its application what i mean is that through out the world the death penalty is disproportionately used against disadvantaged people was given a score of 0.98." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-36", "text": "We started with the Argument Quality (AQ) regressor from [16] , which predicts a quality score for each sentence." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-37", "text": "The stance for these argument segments is obtained from IAC [1] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-38", "text": "We keep only stance bearing statements from the above dataset." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-39", "text": "had improved upon the AQ predictor from [16] , giving a much larger and diverse corpus [12] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-40", "text": "Since generating a cohesive dialogue is a challenging task, we first evaluated our prototype with hand labeled 2000 argument quality sentence pairs for the topic of death penalty obtained from [12] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-41", "text": "We tested our model for both appropriateness of responses and response times." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-42", "text": "Once we had a working system for death penalty, we added the best quality 250 arguments for gay marriage and gun control, each, from the corpus of [12] (This had 174405 arguments from gay marriage and 258763 for gun control)." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-43", "text": "----------------------------------" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-44", "text": "**METHODOLOGY**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-45", "text": "Debbie has domain knowledge of three hot button topics -death penalty, gay marriage and gun control." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-46", "text": "We created a database of statements for and against each topic, which serves as the source for Debbie's views." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-47", "text": "The user picks a topic from a pool of topics and specifies his/her stance (for or against)." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-48", "text": "As the user provides an argument, Debbie uses a similarity algorithm to retrieve a ranked list of the most appropriate counter-arguments, i.e., arguments opposing the user's stance." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-49", "text": "To speed up this retrieval process, we pre-create clusters of the arguments present in our database (described in section 3)." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-50", "text": "The most appropriate counter-arguments are calculated based on a similarity score, which, in this case, was the UMBC STS score [8] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-51", "text": "The similarity algorithm takes as input two sentences and returns a real-valued score, which we use directly as the similarity between two argument statements." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-52", "text": "It uses a lexical similarity feature that combines LSA (Latent Semantic Similarity) word similarity, and WordNet knowledge, and can be run from a web API provided by the authors [8] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-53", "text": "Debbie's responses are stored for the duration of the chat." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-54", "text": "From the ranked list, the most appropriate response (having the highest similarity score), that has not already been used in the chat, is selected." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-55", "text": "Since we only have high quality arguments in our database, we expect Debbie's responses to be good in terms of grammatical correctness, on topic, etc." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-56", "text": "Another way of looking at appropriateness of a response is basically how adequate a retort it is to the user's view." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-57", "text": "Debbie sustains the debate until the user explicitly ends the chat." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-58", "text": "While there is a limited number of unique counter-arguments which Debbie can utilize, it would take the user substantial time to exhaust all possible responses." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-59", "text": "----------------------------------" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-60", "text": "**GENERATING CLUSTERS**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-61", "text": "To create the clusters, we first generated a distance matrix of similarity scores for each topic and stance." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-62", "text": "A similarity score falls between 0 and 1." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-63", "text": "Using agglomerative clustering from scikit-learn, we created 15 clusters." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-64", "text": "We then identified the head of a cluster; the argument within each cluster, that best represents all of the statements within that cluster." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-65", "text": "We calculated this by finding the average distance for each statement in a cluster to all the statements in the cluster, and chose the one with the minimum average as the head." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-66", "text": "Hierarchical agglomerative clustering has been previously used for argument clustering by [4] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-67", "text": "----------------------------------" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-68", "text": "**USING THE CLUSTERS**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-69", "text": "To speed up the response times by clustering, we compare the user's input to the head of each cluster." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-70", "text": "We find the cluster whose head has the highest similarity score and calculate the similarity score of each response within that cluster in order to return the most similar response." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-71", "text": "----------------------------------" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-72", "text": "**FIG. 1 CHAT WHERE DEBBIE IS AGAINST THE DEATH PENALTY**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-73", "text": "We further optimized our algorithm by implementing a graph-based comparison method to find an acceptable cluster faster." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-74", "text": "We create a graph with the cluster heads as nodes." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-75", "text": "We start at a random head and find how similar it is to the input." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-76", "text": "If the similarity passes a high threshold of 0.9, we use the related cluster automatically." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-77", "text": "Otherwise, if the similarity is very high, say, above a threshold of 0.8, we eliminate connected edges where the similarity is very low, below a threshold of 0.5." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-78", "text": "Conversely, if the similarity is very low, say, below 0.5, we eliminate connected edges where the similarity is very high, above 0.8." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-79", "text": "In the case where we don't find a head which surpasses our high threshold, we continue to explore our graph until all the clusters have either been visited or eliminated from consideration." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-80", "text": "We then select the head with the highest similarity score." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-81", "text": "A sample conversation with Debbie is shown in Fig 1, where the user supports the death penalty and Debbie opposes it." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-82", "text": "For a start, we looked at methods for faster response retrieval and the quality of the responses." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-83", "text": "The most basic (baseline) method just finds the similarity score between the input and each possible response in our database, and returns the response with the highest similarity score." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-84", "text": "The second method is the simple clustering method and the third method is clustering with the graph method described in Section 3." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-85", "text": "Table 1 shows the average response times for each retrieval method." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-86", "text": "We arrived at these by testing for three sentences per stance per topic, each deliberately chosen such that they would access different clusters." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-87", "text": "As expected, both the cluster and the cluster graph methods perform faster than the baseline." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-88", "text": "The cluster graph method is faster than using just clusters in most cases." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-89", "text": "The exception was the \"for\" case of death penalty, which, we believe, can be attributed to the significantly larger size of the cluster that is accessed by the cluster graph, triggering a greater number of computations." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-90", "text": "Given that our database only has high quality arguments, the appropriateness of Debbie's responses are primarily dependent on the performance of the similarity algorithm." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-91", "text": "However, exchange of only high quality arguments between the system and the user, with minimal repetition (none from Debbie) hampers the natural flow of the conversation." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-92", "text": "----------------------------------" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-93", "text": "**FUTURE WORK**" }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-94", "text": "The prototype we are proposing represents our pilot work with Debbie, an argumentative chatbot." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-95", "text": "Our work in this domain is still in progress and we have a lot of future work planned based on our preliminary observations." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-96", "text": "We acknowledge the fact that we must migrate Debbie away from the assumption that the argument as a whole will consist of argumentatively sound statements." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-97", "text": "In our evaluation, we observed a high usage of utterances such as You're just wrong. and I don't think so. from the user." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-98", "text": "These statements have low argumentative quality, or, are completely off-topic." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-99", "text": "Hence, responding to them with a high quality argument tended to sound unnatural and inappropriate." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-100", "text": "Therefore, in order to make the conversation sound less robotic and more natural, we must detect user utterances which are not argumentatively sound and respond accordingly." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-101", "text": "Lukin et al.(2017) talk about the role of personality in persuasion [9] and Bowden et al. (2016) have shown that adding a layer of stylistic variation to a dialogue is sufficient for representing personality between speakers [5] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-102", "text": "We intend to investigate how we can enhance the user's experience by entraining Debbie's personality with respect to the user's personality." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-103", "text": "While our initial results are promising, we need to improve Debbie's performance with regards to retrieval time." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-104", "text": "We can potentially do this by recursively employing the graph method within the clusterssimilar to hierarchical clusters." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-105", "text": "We also intend to explore alternative information retrieval methods such as indexing, to create a more balanced trade-off between appropriateness and response time." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-106", "text": "Debbie currently uses the UMBC STS for calculating similarity scores, which is known to not work very well for argumentative sentences [11] ." }, { "sent_id": "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-107", "text": "We intend to use a better argument similarity algorithm in the future." } ], "y": { "@USE@": { "gold_contexts": [ [ "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-20" ], [ "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-36" ] ], "cite_sentences": [ "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-20", "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-36" ] }, "@BACK@": { "gold_contexts": [ [ "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-29" ] ], "cite_sentences": [ "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-29" ] }, "@EXT@": { "gold_contexts": [ [ "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-39" ] ], "cite_sentences": [ "f6a35ed1ec0c01d3e9faa1ec3d8478-C001-39" ] } } }, "ABC_aebac57baf260be18945feba38d6a1_39": { "x": [ { "sent_id": "aebac57baf260be18945feba38d6a1-C001-75", "text": "----------------------------------" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-76", "text": "**CONCLUSIONS**" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-2", "text": "Splitting compound words has proved to be useful in areas such as Machine Translation, Speech Recognition or Information Retrieval (IR)." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-3", "text": "Furthermore, real-time IR systems (such as search engines) need to cope with noisy data, as user queries are sometimes written quickly and submitted without review." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-4", "text": "In this paper we apply a state-of-the-art procedure for German decompounding to other compounding languages, and we show that it is possible to have a single decompounding model that is applicable across languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-5", "text": "----------------------------------" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-7", "text": "Compounding languages (Krott, 1999) , such as German, Dutch, Danish, Norwegian, Swedish, Greek or Finnish, allow the generation of complex words by merging together simpler ones." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-8", "text": "So, for instance, the flower bouquet can be expressed in German as Blumenstr\u00e4u\u00dfe, made up of Blumen (flower) and str\u00e4u\u00dfe (bouquet), and in Finnish as kukkakimppu, from kukka (flower) and kimppu (bunch, collection)." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-9", "text": "For many language processing tools that rely on lexicons or language models it is very useful to be able to decompose compounds to increase their coverage and reduce out-of-vocabulary terms." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-31", "text": "Each compound was annotated by two human judges who had received the previous instructions on when to consider that a keyword is a compound." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-52", "text": "A combination of all these metrics into a learning model is able to attain a high recall." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-10", "text": "Decompounders have been used successfully in Information Retrieval (Braschler and Ripplinger, 2004) , Machine Translation (Brown, 2002; Koehn and Knight, 2003) and Speech Recognition (Adda-Decker et al., 2000) ." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-11", "text": "The Cross Language Evaluation Forum (CLEF) competitions have shown that very simple approaches can produce big gains in Cross Language Information Retrieval (CLIR) for German and Dutch (Monz and de Rijke, 2001 ) and for Finnish (Adafre et al., 2004) ." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-12", "text": "When working with web data, which has not necessarily been reviewed for correctness, many of the words are more difficult to analyze than when working with standard texts." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-13", "text": "There are more words with spelling mistakes, and many texts mix words from different languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-14", "text": "This problem exists to a larger degree when handling user queries: they are written quickly, not paying attention to mistakes." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-15", "text": "However, being able to identify that achzigerjahre should be decompounded as achzig+jahre (where achzig is a misspelled variation of achtzig) is still useful in obtaining some meaning from the user query and in helping the spelling correction system." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-16", "text": "This paper evaluates a state-of-the-art procedure for German splitting (Alfonseca et al., 2008) , robust enough to handle query data, on different languages, and shows that it is possible to have a single decompounding model that can be applied to all the languages under study." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-17", "text": "----------------------------------" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-18", "text": "**PROBLEM DEFINITION AND EVALUATION SETTINGS**" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-19", "text": "Any set of query keywords contains a large amount of noisy data, such as words in foreign languages or misspelled words." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-20", "text": "In order to be robust enough to handle this kind of corpus, we require the following for a decompounder: first, obviously, compounds should be split, and non-compounds should be left untouched." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-21", "text": "This also applies if they are misspelled." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-22", "text": "Unknown words or words involving a part in a foreign language are split if there is a plausible interpretation of them being a compound word." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-23", "text": "An example is Turingmaschine (Turing machine) in German, where Turing is an English word." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-24", "text": "Finally, words that are not really grammatical compounds, but due to the user forgetting to input the blankspace between the words (like desktopcomputer) are split." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-25", "text": "For the evaluation, we have built and manually annotated gold standard sets for German, Dutch, Danish, Norwegian, Swedish and Finnish from fully anonymized search query logs." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-26", "text": "Because people do not use capitalization consistently when writing queries, all the query logs are lowercased." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-27", "text": "By randomly sampling keywords we would get few compounds (as their frequency is small compared to that of non-compounds), so we have proceeded in the following way to ensure that the gold-standards contain a substantial amount of compounds: we started by building a very naive decompounder that splits a word in several parts using a frequency-based compound splitting method (Koehn and Knight, 2003) ." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-28", "text": "Using this procedure, we obtain two random samples with possibly repeated words: one with words that are considered non-compounds, and the other with words that are considered compounds by this naive approach." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-29", "text": "Next, we removed all the duplicates from the previous list, and we had them annotated manually as compounds or non-compounds, including the correct splittings." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-30", "text": "The sizes of the final training sets vary between 2,000 and 3,600 words depending on the language." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-32", "text": "For all the languages considered, exactly one of the two judges was a native speaker living in a country where it is the official language 1 . Table 1 shows the percentage of agreement in classifying words as compounds or non-compounds (Compound Classification Agreement, CCA) for each language and the Kappa score (Carletta, 1996) obtained from it, and the percentage of words for which also the decomposition provided was identical (Decompounding Agreement, DA)." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-33", "text": "The most common source of disagreement were long words that could be split into two or more parts." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-34", "text": "The evaluation is done using the metrics precision, recall and accuracy, defined in the following way (Koehn and Knight, 2003 ):" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-35", "text": "\u2022 Correct splits: no. of compounds that are split correctly." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-36", "text": "\u2022 Correct non-splits: no. non-compounds that are not split." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-37", "text": "\u2022 Wrong non-splits: no. of compounds and are not split." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-38", "text": "\u2022 Wrong faulty splits: no. of compounds that are incorrectly split." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-39", "text": "\u2022 Wrong splits: no. of non-compounds that are split." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-40", "text": "P recision = correct splits correct splits + wrong faulty splits + wrong splits Recall = correct splits correct splits + wrong faulty splits + wrong non-splits Accuracy = correct splits correct splits + wrong splits" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-41", "text": "----------------------------------" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-42", "text": "**COMBINING CORPUS-BASED FEATURES**" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-43", "text": "Most approaches for decompounding can be considered as having this general structure: given a word w, calculate every possible way of splitting w in one or more parts, and score those parts according to some weighting function." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-44", "text": "If the highest scoring splitting contains just one part, it means that w is not a compound." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-45", "text": "For the first step (calculating every possible splitting), it is common to take into account that modifiers inside compound words sometimes need linking morphemes." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-46", "text": "Table 2 lists the ones used in our system (Langer, 1998; Marek, 2006; Krott, 1999 Concerning the second step, there is some work that uses, for scoring, additional information such as rules for cognate recognition (Brown, 2002) or sentence-aligned parallel corpora and a translation model, as in the full system described by Koehn and Knight (2003) ." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-47", "text": "When those resources are not available, the most common methods used for compound splitting are using features such as the geometric mean of the frequencies of compound parts in a corpus, as in Koehn and Knight (2003) 's back-off method, or learning a language model from a corpus and estimating the probability of each sequence of possible compound parts (Schiller, 2005; Marek, 2006) ." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-48", "text": "While these methods are useful for several applications, such as CLIR and MT, they have known weaknesses, such as preferring a decomposition if a compound part happens to be very frequent by chance, in the case of the frequency-based method, or the preference of decompositions with the least possible number of parts, in the case of the probability-based method." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-49", "text": "Alfonseca et al. (2008) describe an integration of the previous methods, together with the Mutual Information and additional features obtained from web anchor texts to train a supervised German decompounder that outperforms the previous methods used as standalone." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-50", "text": "The geometric mean of the frequencies of compound parts and the probability estimated from the language model usually attain a high recall, given they are based on unigram features which are easy to collect, but they have some weaknesses, as mentioned above." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-51", "text": "On the other hand, while Mutual Information is a much more precise metric, it is less likely to have evidence about every single possible pair of compound parts from a corpus, so it suffers from low recall." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-53", "text": "An ablation study, reported in that paper, indicated that the contribution of the web anchor texts is minimal, so in this study we have just kept the other three metrics." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-54", "text": "Table 3 shows the results reported for Ger- man, training (i.e. counting frequencies and learning the language model) on the query keywords, and running a 10-fold cross validation of a SVM with a polynomial kernel using the German gold-standard." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-55", "text": "The supervised system improves over the single unsupervised metrics, attaining simultaneously good recall and precision metrics." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-56", "text": "----------------------------------" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-57", "text": "**EXPERIMENTS AND EVALUATION**" }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-58", "text": "The first motivation of this work is to test whether the results reported for German are easy to reproduce in other languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-59", "text": "The results, shown in Table 4 , are very similar across languages, having precision and recall values over 80% for most languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-60", "text": "A notable exception is Dutch, for which the inter-judge agreement was the highest, so we expected the set of words to be easier to classify." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-61", "text": "An analysis of the errors reported in the 10-fold crossvalidation indicates that most errors in Dutch were wrong non-splits (in 147 cases) and wrong splits (in 139 cases), with wrong faulty splits happening only in 20 occasions." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-62", "text": "Many of the wrong splits are location names and trademarks, like youtube, piratebay or smallville." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-63", "text": "While the supervised model gives much better results than the unsupervised ones, it still requires the construction of a goldstandard from which to train, which is usually costly." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-64", "text": "Therefore, we ran another experiment to check whether the models trained from some languages are applicable to other languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-65", "text": "Table 5 shows the results obtained in this case, the last column indicating the results when the model is trained from the training instances from all the other languages together." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-66", "text": "For each row, the highest value and those which are inside its 95% confidence interval are highlighted." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-67", "text": "Interestingly, apart from a few exceptions, the results are rather good for all the pairs of training and test language." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-68", "text": "Thus, the use of features like frequencies, probabilities or mutual information of compound parts is truly language-independent and the models learned from one language can safely be applied for decompounding a different language without the need of annotating a gold-standard for it." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-69", "text": "Still, some trends in the results can be observed: training with the Danish corpus produced the best results in terms of recall for all the languages, but recall for Danish still improved when we trained on data from all languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-70", "text": "We believe that this indicates that the Danish dataset contains items with a more varied sets of feature combinations, so that the models trained from it have a good coverage on different kinds of compounds, but models trained in other languages are not able to identify many of the compounds in the Danish dataset." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-71", "text": "Concerning precision, training with either the Norwegian or the Finnish data produced very good results for most languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-72", "text": "This is consistent with the monolingual experiments (see Table 4 ) in which these languages had the best results." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-73", "text": "We believe these trends are probably due to the quality of the training data." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-74", "text": "Interestingly, the size of the training data is not so relevant, as most of the best results are not located at the last column in the table." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-77", "text": "This paper shows that a combination of several corpus-based metrics for decompounding, previously applied to German, with big improvements with respect to other state-of-the-art systems, is also useful for other compounding languages." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-78", "text": "More interestingly, models learned from a goldstandard created for some language can be applied to other languages, sometimes producing better results than when a model is trained and tested in the same language." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-79", "text": "This should alleviate the fact that the proposed system is supervised, as there should just be the need of creating a goldstandard in one language in order to train a generic decompounder, thus facilitating the availability of decompounders for smaller languages like Faroese." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-80", "text": "For future work, we plan to investigate more deeply how the quality of the data affects the results, with a more detailed error analysis." }, { "sent_id": "aebac57baf260be18945feba38d6a1-C001-81", "text": "Other open lines include exploring the addition of new features to the trained models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "aebac57baf260be18945feba38d6a1-C001-10" ] ], "cite_sentences": [ "aebac57baf260be18945feba38d6a1-C001-10" ] }, "@USE@": { "gold_contexts": [ [ "aebac57baf260be18945feba38d6a1-C001-27" ], [ "aebac57baf260be18945feba38d6a1-C001-34" ], [ "aebac57baf260be18945feba38d6a1-C001-46" ], [ "aebac57baf260be18945feba38d6a1-C001-47" ] ], "cite_sentences": [ "aebac57baf260be18945feba38d6a1-C001-27", "aebac57baf260be18945feba38d6a1-C001-34", "aebac57baf260be18945feba38d6a1-C001-46", "aebac57baf260be18945feba38d6a1-C001-47" ] } } }, "ABC_9bab4741e6b9f132c2851bae3a3cf4_39": { "x": [ { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-20", "text": "For example, \"" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-2", "text": "The selection of features is critical in providing discriminative information for classifiers in Word Sense Disambiguation (WSD)." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-3", "text": "Uninformative features will degrade the performance of classifiers." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-4", "text": "Based on the strong evidence that an ambiguous word expresses a unique sense in a given collocation, this paper reports our experiments on automatic WSD using collocation as local features based on the corpus extracted from People's Daily News (PDN) as well as the standard SENSEVAL-3 data set." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-5", "text": "Using the Na\u00efve Bayes classifier as our core algorithm, we have implemented a classifier using a feature set combining both local collocation features and topical features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-6", "text": "The average precision on the PDN corpus has 3.2% improvement compared to 81.5% of the baseline system where collocation features are not considered." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-7", "text": "For the SENSEVAL-3 data, we have reached the precision rate of 37.6% by integrating collocation features into contextual features, to achieve 37% improvement over 26.7% of precision in the baseline system." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-8", "text": "Our experiments have shown that collocation features can be used to reduce the size of human tagged corpus." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-9", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-11", "text": "WSD tries to resolve lexical ambiguity which refers to the fact that a word may have multiple meanings such as the word \"walk\" in \"Walk or Bike to school\" and \"BBC Education Walk Through Time\", or the Chinese word \" \" in \" \"(\"local government\") and \" \"(\"He is also partly right\")." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-12", "text": "WSD tries to automatically assign an appropriate sense to an occurrence of a word in a given context." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-13", "text": "Various approaches have been proposed to deal with the word sense disambiguation problem including rule-based approaches, knowledge or dictionary based approaches, corpus-based approaches, and hybrid approaches." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-14", "text": "Among these approaches, the supervised corpus-based approach had been applied and discussed by many researches ( [2] [3] [4] [5] [6] [7] [8] )." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-15", "text": "According to [1] , the corpusbased supervised machine learning methods are the most successful approaches to WSD where contextual features have been used mainly to distinguish ambiguous words in these methods." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-16", "text": "However, word occurrences in the context are too diverse to capture the right pattern, which means that the dimension of contextual words will be very large when all words in the training samples are used for WSD [14] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-17", "text": "Certain uninformative features will weaken the discriminative power of a classifier resulting in a lower precision rate." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-18", "text": "To narrow down the context, we propose to use collocations as contextual information as defined in Section 3.1.2." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-19", "text": "It is generally understood that the sense of an ambiguous word is unique in a given collocation [19] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-21", "text": "\" means \"burden\" but not \"baggage\" when it appears in the collocation \" \" (\" burden of thought\")." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-22", "text": "In this paper, we apply a classifier to combine the local features of collocations which contain the target word with other contextual features to discriminate the ambiguous words." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-23", "text": "The intuition is that when the target context captures a collocation, the influence of other dimensions of contextual words can be reduced or even ignored." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-24", "text": "For example, in the expression \" \" (\"terrorists burned down the gene laboratory\"), the influence of contextual word \" \" (\"gene\") should be reduced to work on the target word \" \" because \" \" is a collocation whereas \" \" and \" \" are not collocations even though they do co-occur." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-25", "text": "Our intention is not to generally replace contextual information by collocation only." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-26", "text": "Rather, we would like to use collocation as an additional feature in WSD." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-27", "text": "We still make use of other contextual features because of the following reasons." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-28", "text": "Firstly, contextual information is proven to be effective for WSD in the previous research works." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-29", "text": "Secondly, collocations may be independent on the training corpus and a sentence in consideration may not contain any collocation." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-30", "text": "Thirdly, to fix the tie case such as ( terrorists' gene checking ), means human when presented in the collocation , but particle in the collocation \"." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-31", "text": "The primary purpose of using collocation in WSD is to improve precision rate without any sacrifices in recall rate." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-32", "text": "We also want to investigate whether the use of collocation as an additional feature can reduce the size of hand tagged sense corpus." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-33", "text": "The rest of this paper is organized as follows." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-34", "text": "Section 2 summarizes the existing Word Sense Disambiguation techniques based on annotated corpora." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-35", "text": "Section 3 describes the classifier and the features in our proposed WSD approach." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-36", "text": "Section 4 describes the experiments and the analysis of our results." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-37", "text": "Section 5 is the conclusion." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-38", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-39", "text": "**RELATED WORK**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-40", "text": "Automating word sense disambiguation tasks based on annotated corpora have been proposed." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-41", "text": "Examples of supervised learning methods for WSD appear in [2] [3] [4] , [7] [8] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-42", "text": "The learning algorithms applied including: decision tree, decisionlist [15] , neural networks [7] , na\u00efve Bayesian learning ( [5] , [11] ) and maximum entropy [10] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-43", "text": "Among these leaning methods, the most important issue is what features will be used to construct the classifier." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-44", "text": "It is common in WSD to use contextual information that can be found in the neighborhood of the ambiguous word in training data ( [6] , [16] [17] [18] )." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-45", "text": "It is generally true that when words are used in the same sense, they have similar context and co-occurrence information [13] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-46", "text": "It is also generally true that the nearby context words of an ambiguous word give more effective patterns and features values than those far from it [12] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-47", "text": "The existing methods consider features selection for context representation including both local and topic features where local features refer to the information pertained only to the given context and topical features are statistically obtained from a training corpus." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-48", "text": "Most of the recent works for English corpus including [7] and [8] , which combine both local and topical information in order to improve their performance." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-49", "text": "An interesting study on feature selection for Chinese [10] has considered topical features as well as local collocational, syntactic, and semantic features using the maximum entropy model." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-50", "text": "In Dang's [10] work, collocational features refer to the local PoS information and bi-gram co-occurrences of words within 2 positions of the ambiguous word." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-51", "text": "A useful result from this work based on (about one million words) the tagged People's Daily News shows that adding more features from richer levels of linguistic information such as PoS tagging yielded no significant improvement (less than 1%) over using only the bi-gram co-occurrences information." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-52", "text": "Another similar study for Chinese [11] is based on the Naive Bayes classifier model which has taken into consideration PoS with position information and bi-gram templates in the local context." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-53", "text": "The system has a reported 60.40% in both precision and recall based on the SENSEVAL-3 Chinese training data." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-54", "text": "Even though in both approaches, statistically significant bi-gram co-occurrence information is used, they are not necessarily true collocations." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-55", "text": "For example, in the express \" \", the bi-grams in their system are ( , , ," }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-56", "text": "Some bi-grams such as may have higher frequency but may introduce noise when considering it as features in disambiguating the sense \"human| \" and \"symbol| \" like in the example case of \" \"." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-57", "text": "In our system, we do not rely on co-occurrence information." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-58", "text": "Instead, we utilize true collocation information ( , ) which fall in the window size of (-5, +5) as fea-tures and the sense of \"human| \" can be decided clearly using this features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-59", "text": "The collocation information is a pre-prepared collocation list obtained from a collocation extraction system and verified with syntactic and semantic methods ( [21] , [24] )." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-60", "text": "Yarowsky [9] used the one sense per collocation property as an essential ingredient for an unsupervised Word-Sense Disambiguation algorithm to perform bootstrapping algorithm on a more general high-recall disambiguation." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-61", "text": "A few recent research works have begun to pay attention to collocation features on WSD." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-62", "text": "Domminic [19] used three different methods called bilingual method, collocation method and UMLS (Unified Medical Language System) relation based method to disambiguate unsupervised English and German medical documents." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-63", "text": "As expected, the collocation method achieved a good precision around 79% in English and 82% in German but a very low recall which is 3% in English and 1% in German." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-64", "text": "The low recall is due to the nature of UMLS where many collocations would almost never occur in natural text." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-65", "text": "To avoid this problem, we combine the contextual features in the target context with the pre-prepared collocations list to build our classifier." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-66", "text": "As stated early, an important issue is what features will be used to construct the classifier in WSD." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-67", "text": "Early researches have proven that using lexical statistical information, such as bi-gram co-occurrences was sufficient to produce close to the best results [10] for Chinese WSD." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-68", "text": "Instead of including bi-gram features as part of discrimination features, in our system, we consider both topical contextual features as well as local collocation features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-69", "text": "These features are extracted form the 60MB human sense-tagged People's Daily News with segmentation information." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-70", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-71", "text": "**TOPICAL CONTEXTUAL FEATURES**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-72", "text": "Niu [11] proved in his experiments that Na\u00efve Bayes classifier achieved best disambiguation accuracy with small topical context window size (< 10 words)." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-73", "text": "We follow their method and set the contextual window size as 10 in our system." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-74", "text": "Each of the Chinese words except the stop words inside the window range will be considered as one topical feature." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-75", "text": "Their frequencies are calculated over the entire corpus with respect to each sense of an ambiguous word w. The sense definitions are obtained from HowNet." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-76", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-77", "text": "**LOCAL COLLOCATION FEATURES**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-78", "text": "We chose collocations as the local features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-79", "text": "A collocation is a recurrent and conventional fixed expression of words which holds syntactic and semantic relations [21] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-80", "text": "Collocations can be classified as fully fixed collocations, fixed collocations, strong collocations and loose collocations." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-81", "text": "Fixed collocations means the appearance of one word implies the co-occurrence of another one such as \" \" (\"burden of history\"), while strong collocations allows very limited substitution of the components, for example, \"" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-82", "text": "\" (\"local college\"), or \" \" (\"local university\")." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-83", "text": "The sense of ambiguous words can be uniquely determined in these two types of collocations, therefore are the collocations applied in our system." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-84", "text": "The sources of the collocations will be explained in Section 4.1." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-85", "text": "In both Niu [11] and Dang's [10] work, topical features as well as the so called collocational features were used." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-86", "text": "However, as discussed in Section 2, they both used bi-gram cooccurrences as the additional local features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-87", "text": "However, bi-gram co-occurrences only indicate statistical significance which may not actually satisfy the conceptual definition of collocations." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-88", "text": "Thus instead of using co-occurrences of bigrams, we take the true bi-gram collocations extracted from our system and use this data to compare with bi-gram co-occurrences to test the usefulness of collocation for WSD." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-89", "text": "The local features in our system make use of the collocations using the template (w i , w) within a window size of ten (where i = \u00b1 5)." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-90", "text": "For example, \" \" (\"Government departments and local government commanded that\") fits the bi-gram collocation template (w, w 1 ) with the value of ( )." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-91", "text": "During the training and the testing processes, the counting of frequency value of the collocation feature will be increased by 1 if a collocation containing the ambiguous word occurs in a sentence." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-92", "text": "To have a good analysis on collocation features, we have also developed an algorithm using lonely adjacent bi-gram as locals features(named Sys-adjacent bi-gram as locals features(named System A) and another using collocation as local features(named System B)." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-93", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-94", "text": "**THE COLLOCATION CLASSIFIER**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-95", "text": "We consider all the features in the features set F = F t \u222aF l = {f 1 , f 2 , \u2026 , f m } as independent, where F t stands for the topical contextual features set, and F l stands for the local collocation features set." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-96", "text": "For an ambiguous word w with n senses, let S w = {w s1 , w s2 , \u2026 , w sn } be the sense set." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-97", "text": "For the contextual features, we directly apply the Na\u00efve Bayes algorithm using Add-Lambda Smoothing to handle unknown words:" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-98", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-99", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-100", "text": "We have designed a set of experiments to compare the classifier with and without the collocation features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-101", "text": "In system A, the classifier is built with local bi-gram features and topical contextual features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-102", "text": "The classifier in system B is constructed from combining the local collocation features with topical features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-103", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-104", "text": "**PREPARATION THE DATA SET**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-105", "text": "We have selected 20 ambiguous words from nouns and verbs with the sense number as 4 in average." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-106", "text": "The sense definition is taken from HowNet [22] ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-107", "text": "To show the effect of the algorithm, we try to choose words with high degree of ambiguity, high frequency of use [23] , and high frequency of constructing collocations." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-108", "text": "The selection of these 20 words is not completely random although within each criterion class we do try to pick word randomly." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-109", "text": "Based on the 20 words, we extracted 28,000 sentences from the 60 MB People's Daily News with segmentation information as our training/test set which is then manually sense-tagged." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-110", "text": "The collocation list is constructed from a combination of a digital collocation dictionary, a return result from a collocation automatic extraction system [21] , and a hand collection from the People's Daily News." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-111", "text": "As we stated early, the sense of ambiguous words in the fixed collocations and strong collocations can be decided uniquely although they are not unique in loose collocations." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-112", "text": "For example, the ambiguous word \" \" in the collocation \" \" may have both the sense of \"appearance| \" or \"reputation| \"." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-113", "text": "Therefore, when labeling the sense of collocations, we filter out the ones which cannot uniquely determine the sense of ambiguous words inside." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-114", "text": "However, this does not mean that loose collocations have no contribution in WSD classification." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-115", "text": "We simply reduce its weight when combining it with the contextual features compared with the fixed and strong collocations." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-116", "text": "The sense and collocation distribution over the 20 words on the training examples can be found in Table 1 ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-117", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-118", "text": "**THE EFFECT OF COLLOCATION FEATURES**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-119", "text": "We recorded 6 trials with average precision over six-fold validation for each word." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-120", "text": "Their average precision for the six trials in the system A, and B can be found in Table 2 and Table 3 ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-121", "text": "From Table 3, regarding to precision, there are 16 words have improved and 4 words remained the same in the system B. The results from the both system confirmed that collocation features do improve the precision." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-122", "text": "Note that 4 words have the same precision in the two systems, which fall into two cases." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-123", "text": "In the first case, it can be seen that these words already have very high precision in the system A (over 93%) which means that one sense dominates all other senses." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-124", "text": "In this case, the additional collation information is not necessary." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-125", "text": "In fact, when we checked the intermediate outputs, the score of the candidate senses of the ambiguous words contained in the collocations get improved." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-126", "text": "Even though, it would not change the result." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-127", "text": "Secondly, no collocation appeared in the sentences which are tagged incorrectly in the system A. This is confirmed when we check the error files." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-128", "text": "For example, the word \" \" with the sense as \" \" (\"closeness\") appeared in 4492 examples over the total 4885 examples (91.9%)." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-129", "text": "In the mean time, 99% of collocation in its collocation list has the same sense of \" \" (\"closeness\")." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-130", "text": "Only one collocation \" \" has the sense of \" \" (\"power\")." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-131", "text": "Therefore, the collocation features improved the score of sense \" \" which is already the highest one based on the contextual features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-132", "text": "As can be seen from Table 3 , the collocation features work well for the sparse data." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-133", "text": "For example, the word \" \" in the training corpus has only one example with the sense \" \" (\"human\"), the other 30 examples all have the sense \" \" (\"management\")." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-134", "text": "Under this situation, the topical contextual features failed to identify the right sense for the only appearance of the sense \" \" (\"human\") in the training instance \" \u2026\"." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-135", "text": "However, it can be correctly identified in the system B because the appearance of the collocation \" \"." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-136", "text": "To well show the effect of collocations on the accuracy of classifier for the task of WSD, we also tested both systems on SENSEVAL-3 data set, and the result is recorded in the Table 4 ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-137", "text": "From the difference in the relative improvement of both data sets, we can see that collocation features work well when the statistical model is not sufficiently built up such as from a small corpus like SENSEVAL-3." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-138", "text": "Actually, in this case, the training examples appear in the corpus only once or twice so that the parameters for such sparse training examples may not be accurate to forecast the test examples, which convinces us that collocation features are effective on handling sparse training data even for unknown words." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-139", "text": "Fig. 1 shows the precision comparison in the system A, and B on SENVESAL-3." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-140", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-141", "text": "**THE EFFECT OF COLLOCATIONS ON THE SIZE OF TRAINING CORPUS NEEDED**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-142", "text": "Hwee [21] stated that a large-scale, human sense-tagged corpus is critical for a supervised learning approach to achieve broad coverage and high accuracy WSD." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-143", "text": "He conducted a thorough study on the effect of training examples on the accuracy of supervised corpus based WSD." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-144", "text": "As the result showed, WSD accuracy continues to climb as the number of training examples increases." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-145", "text": "Similarly, we have tested the system A, and B with the different size of training corpus based on the PDN corpus we prepared." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-146", "text": "Our experiment results shown in Fig 2 follow the same fact." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-147", "text": "The purpose we did the testing is that we hope to disclose the effect of collocations on the size of training corpus needed." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-148", "text": "From Fig 2, we can see by using the collocation features, the precision of the system B has increased slower along with the growth of training examples than the precision of the system A. The result is reasonable because with collocation feature, the statistical contextual information over the entire corpus becomes side effect." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-149", "text": "Actually, as can be seen from Fig. 2 , after using collocation features in the system B, even we use 1/6 corpus as training, the precision is still higher than we use 5/6 train corpus in the system A." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-150", "text": "----------------------------------" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-151", "text": "**INVESTIGATION OF SENSE DISTRIBUTION ON THE EFFECT OF COLLOCATION FEATURES**" }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-152", "text": "To investigate the sense distribution on the effect of collocation features, we selected the ambiguous words with the number of sense varied from 2 to 6." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-153", "text": "In each level of the sense number, the words are selected randomly." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-154", "text": "Table 5 shows the effect of sense distribution on the effect of collocation features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-155", "text": "From the table, we can see that the collocation features work well when the sense distribution is even for a particular ambiguous word under which case the classifier may get confused." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-156", "text": "We have conducted a set of experiments based on both the PDN corpus and SENSEVLA-3 data to set the best value of \u03b1 for the formula (4) described in Section 3.2." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-157", "text": "The best start value of \u03b1 is tested based on the precision rate which is shown in Fig. 3 ." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-158", "text": "It is shown from the experiment that \u03b1 takes the start value of 0.5 for both corpuses." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-159", "text": "This paper reports a corpus-based Word Sense Disambiguation approach for Chinese word using local collocation features and topical contextual features." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-160", "text": "Compared with the base-line systems in which a Na\u00efve Bayes classifier is constructed by combining the contextual features with the bi-gram features, the new system achieves 3% precision improvement in average in Peoples' Daily News corpus and 10% improvement in SENSEVAL-3 data set." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-161", "text": "Actually, it works very well when disambiguating the sense with sparse distribution over the entire corpus under which the statistic calculation prone to identify it incorrectly." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-162", "text": "In the same time, because disambiguating using collocation fea-tures does not need statistical calculation, it makes contribution to reduce the size of human tagged corpus needed which is critical and time consuming in corpus based approach." }, { "sent_id": "9bab4741e6b9f132c2851bae3a3cf4-C001-163", "text": "Because different types of collocations may play different roles in classifying the sense of an ambiguous word, we hope to extend this work by integrating collocations with different weight based on their types in the future, which may need a pre-processing job to categorize the collocations automatically." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9bab4741e6b9f132c2851bae3a3cf4-C001-42" ], [ "9bab4741e6b9f132c2851bae3a3cf4-C001-49" ], [ "9bab4741e6b9f132c2851bae3a3cf4-C001-50" ], [ "9bab4741e6b9f132c2851bae3a3cf4-C001-67" ], [ "9bab4741e6b9f132c2851bae3a3cf4-C001-85" ] ], "cite_sentences": [ "9bab4741e6b9f132c2851bae3a3cf4-C001-42", "9bab4741e6b9f132c2851bae3a3cf4-C001-49", "9bab4741e6b9f132c2851bae3a3cf4-C001-50", "9bab4741e6b9f132c2851bae3a3cf4-C001-67", "9bab4741e6b9f132c2851bae3a3cf4-C001-85" ] }, "@DIF@": { "gold_contexts": [ [ "9bab4741e6b9f132c2851bae3a3cf4-C001-67", "9bab4741e6b9f132c2851bae3a3cf4-C001-68" ], [ "9bab4741e6b9f132c2851bae3a3cf4-C001-85", "9bab4741e6b9f132c2851bae3a3cf4-C001-86", "9bab4741e6b9f132c2851bae3a3cf4-C001-88" ] ], "cite_sentences": [ "9bab4741e6b9f132c2851bae3a3cf4-C001-67", "9bab4741e6b9f132c2851bae3a3cf4-C001-85" ] } } }, "ABC_4a093e0ca9a499c19e4721366d88f6_39": { "x": [ { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-33", "text": "**PRIOR WORK**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-2", "text": "Word embedding models have gained a lot of traction in the Natural Language Processing community, however, they suffer from unintended demographic biases." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-3", "text": "Most approaches to evaluate these biases rely on vector space based metrics like the Word Embedding Association Test (WEAT)." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-4", "text": "While these approaches offer great geometric insights into unintended biases in the embedding vector space, they fail to offer an interpretable meaning for how the embeddings could cause discrimination in downstream NLP applications." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-5", "text": "In this work, we present a transparent framework and metric for evaluating discrimination across protected groups with respect to their word embedding bias." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-6", "text": "Our metric (Relative Negative Sentiment Bias, RNSB) measures fairness in word embeddings via the relative negative sentiment associated with demographic identity terms from various protected groups." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-7", "text": "We show that our framework and metric enable useful analysis into the bias in word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-8", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-10", "text": "Word embeddings have established themselves as an integral part of Natural Language Processing (NLP) applications." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-11", "text": "Unfortunately word embeddings have also introduced unintended biases that could cause downstream NLP systems to be unfair." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-12", "text": "Recent studies have shown that word embeddings exhibit unintended gender and stereotype biases inherent in the training corpus." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-13", "text": "Bias can be defined as an unfair expression of prejudice for or against a person, a group, or an idea." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-14", "text": "Bias is a broad term, which covers a range of problems particularly relevant in natural language systems such as, discriminatory gender bias (Bolukbasi et al., 2016a; Zhao et al., 2017) , bias against regionally accented speech (Najafian et al., 2016 (Najafian et al., , 2017 , personal or political view bias (Iyyer et al., 2014; Recasens et al., 2013) , and many other examples." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-15", "text": "In Figure 1 : 2-D PCA embeddings for positive/negative sentiment words and a set of national origin identity terms." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-16", "text": "Geometrically, it is difficult to parse how these embeddings can lead to discrimination." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-17", "text": "our work, we restrict our definition of bias to unequal distributions of negative sentiment among demographic identity terms in word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-18", "text": "One could also look at unequal distributions of positive sentiment, but for this work we restrict ourselves to the negative case." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-19", "text": "Sentiment analysis makes up a large portion of current NLP systems." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-20", "text": "Therefore, preventing negative sentiment from mixing with sensitive attributes (i.e. race, gender, religion) in word embeddings is needed to prevent discrimination in ML models using the embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-21", "text": "As studied in (Packer et al., 2018) , unintentionally biased word embeddings can have adverse consequences when deployed in applications, such as movie sentiment analyzers or messaging apps." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-22", "text": "Negative sentiment can be unfairly entangled in the word embeddings, and detecting this unintended bias is a difficult problem." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-23", "text": "We need clear signals to evaluate which groups are discriminated against due to the bias in an embedding model." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-24", "text": "That way we can pinpoint where to mitigate those biases." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-25", "text": "To demonstrate this need for clear signals of bias in word embeddings, we look at Figure 1 ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-26", "text": "Figure 1 shows a 2D word embedding projection of positive sentiment (green) and negative sentiment (red) words." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-27", "text": "It would be unfair for any given demographic identity word vector (blue) to be more semantically related to negative terms than the other identities." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-28", "text": "However, many identity terms exist closer to negative words than other identity terms in the vector space." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-29", "text": "This bias may affect a downstream ML model, but the vector space has no absolute interpretable meaning, especially when it comes to whether this embedding model will lead to a unfairly discriminative algorithm." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-30", "text": "Our framework enables transparent insights into word embedding bias by instead viewing the output of a simple logistic regression algorithm trained on an unbiased positive/negative word sentiment dataset initialized with biased word vectors." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-31", "text": "We use this framework to create a clear metric for unintended demographic bias in word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-32", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-34", "text": "Researchers have found a variety of ways in which dangerous unintended bias can show up in NLP applications (Blodgett and O'Connor, 2017; Hovy and Spruit, 2016; Tatman, 2017) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-35", "text": "Mitigating such biases is a difficult problem, and researchers have created many ways to make fairer NLP applications." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-36", "text": "Much of the focus for mitigating unintended bias in NLP is either targeted at reducing gender stereotypes in text (Bolukbasi et al., 2016b,a; Zhao et al., 2017; Zhang et al., 2018) , or inequality of sentiment or toxicity for various protected groups (Caliskan-Islam et al., 2016; Bakarov, 2018; Dixon et al.; Garg et al., 2018; Kiritchenko and Mohammad, 2018) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-37", "text": "More specifically, word embeddings has been an area of focus for evaluating unintended bias." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-38", "text": "(Bolukbasi et al., 2016b ) defines a useful metric for identifying gender bias and (Caliskan-Islam et al., 2016) defines a metric called the WEAT score for evaluating unfair correlations with sentiment for various demographics in text." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-39", "text": "Unfortunately metrics like these leverage vector space arguments between only two identities at a time like man vs woman (Bolukbasi et al., 2016a) , or European American names vs. African American names (Caliskan-Islam et al., 2016) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-40", "text": "Though geometrically intuitive, these tests do not have a direct relation to discrimination in general." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-41", "text": "Our framework and RNSB metric enable a clear evaluation of discrimination with respect to word embedding bias for a whole class of demographics." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-42", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-43", "text": "**METHODS**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-44", "text": "We present our framework for understanding and evaluating unintentional demographic bias in word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-45", "text": "We first describe the flow of our framework." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-46", "text": "Then, we address which datasets/models were chosen for our approach." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-47", "text": "Finally, we show how our framework can enable analysis and new metrics like RNSB." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-48", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-49", "text": "**FRAMEWORK**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-50", "text": "Figure 2: We isolate unintended bias to the word embeddings by training a logistic regression classifier on a unbiased positive/negative word sentiment dataset (initialized with the biased word embeddings)." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-51", "text": "We measure word embedding bias by analyzing the predicted probability of negative sentiment for identity terms." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-52", "text": "Our framework enables the evaluation of unintended bias in word embeddings through the results of negative sentiment predictions." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-53", "text": "Our framework has a simple layout." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-54", "text": "Figure 2 shows the flow of our system." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-55", "text": "We first use the embedding model we are trying to evaluate to initialize vectors for an unbiased positive/negative word sentiment dataset." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-56", "text": "Using this dataset, we train a logistic classification algorithm to predict the probability of any word being a negative sentiment word." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-57", "text": "After training, we take a set of neutral identity terms from a protected group (i.e. national origin) and predict the probability of negative sentiment for each word in the set." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-58", "text": "Neutral identity terms that are unfairly entangled with negative sentiment in the word embeddings will be classified like their neighboring sentiment words from the sentiment dataset." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-59", "text": "We leverage this set of negative sentiment probabilities to summarize unintended demographic bias using RNSB." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-60", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-61", "text": "**MODELS AND DATA**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-62", "text": "We evaluate three pretrained embedding models: GloVe (Pennington et al., 2014) , Word2vec (Mikolov et al., 2013 ) (trained on the large Google News corpus), and ConceptNet ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-63", "text": "GloVe and Word2vec embeddings have been shown to contain unintended bias in (Bolukbasi et al., 2016a; Caliskan-Islam et al., 2016) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-64", "text": "ConceptNet has been shown to be less biased than these models (Speer, 2017) due to the mixture of curated corpora used for training." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-65", "text": "As part of our pipeline, we also use a labeled positive/negative sentiment training set (Hu and Liu, 2004) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-66", "text": "This dataset has been shown to be a trustworthy lexicon for negative and positive sentiment words (Pang et al., 2008; Liu, 2012; Wilson et al., 2005) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-67", "text": "We trust these labels to be unbiased so that we may isolate the unintended biases entering our system to the word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-68", "text": "Finally, we use a simple logistic regression algorithm to predict negative sentiment." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-69", "text": "Although the choice of ML model can have an impact on fairness for sentiment applications as shown in (Kiritchenko and Mohammad, 2018) , we choose a simple ML model to limit the possible unintended biases introduced downstream from our word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-70", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-71", "text": "**BIAS ANALYSIS: RNSB**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-72", "text": "We now present our metric for unintended demographic bias, RNSB." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-73", "text": "For gold standard labeled positive/negative sentiment words, (x i , y i ), in training set, S, where x i is a word vector from a possibly biased word embedding model, we find the minimizer, f * (x i ) = \u03c3(w T x i ), for the logistic loss, l, and learned weights, w." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-74", "text": "Then for a set, K = {k 1 , ..., k t }, of t demographic identity word vectors from a particular protected group (i.e. national origin, religion, etc.), we define a set, P , containing the predicted negative sentiment probability via minimizer, f * , normalized to be one probability mass." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-75", "text": "Thus, our metric, RN SB(P ), is defined as the KL divergence of P from U , where U is the uniform distribution for t elements." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-76", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-77", "text": "**RN SB(P ) = D KL (P U )**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-78", "text": "We choose our set of neutral identity terms based on the most populous demographics for each protected group." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-79", "text": "However, due to the simplicity of this method, one can easily adapt it to include identity terms that suit the application in need of analysis." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-80", "text": "Since neutral identity terms are inherently not associated with sentiment, it is unfair to have identity term with differing levels of negative sentiment." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-81", "text": "This type of discrimination can show up in many downstream sentiment analysis applications." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-82", "text": "Thus, we want no differences between negative sentiment predictions of various identity terms." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-83", "text": "Mathematically, this can be represented as a uniform distribution of negative sentiment probability for identity terms from a protected group." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-84", "text": "Our RNSB metric captures the distance, via KL divergence, between the current distribution of negative sentiment and the fair uniform distribution." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-85", "text": "So the more fair a word embedding model with respect to sentiment bias, the lower the RNSB metric." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-86", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-87", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-88", "text": "We evaluate our framework and metric on two cases studies: National Origin Discrimination and Religious Discrimination." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-89", "text": "For each case study, we create a set of the most frequent identity terms from the protected groups in the Wikipedia word corpus and analyze bias with respect to these terms via our framework." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-90", "text": "First, we compare the RNSB metric for 3 pretrained word embeddings, showing that our metric is consistent with other word embedding analysis like WEAT (Caliskan-Islam et al., 2016) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-91", "text": "We then show that our framework enables an insightful view into word embedding bias." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-92", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-93", "text": "**RNSB METRIC ON WORD EMBEDDINGS**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-94", "text": "We vary the word embeddings used in our framework and calculate the RNSB metric for each embedding." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-95", "text": "The results are displayed in Table 1 ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-96", "text": "For both case studies, the bias is largest in GloVe, as shown by the largest RNSB metric." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-97", "text": "As mentioned earlier, ConceptNet is a state of the art model that mixes models like GloVe and Word2vec, creating fairer word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-98", "text": "Through the RNSB metric, one can see that the unintended demographic bias of these word embeddings is an order of magnitude lower than GloVe or Word2vec." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-99", "text": "Although the RNSB metric is not directly comparable to WEAT scores, these results are still consistent with some of the bias predicted by (Caliskan-Islam et al., 2016) ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-100", "text": "The WEAT score shows that word embeddings like Word2vec and GloVe are biased with respect to national origin because European-American names are more correlated with positive sentiment than AfricanAmerican names." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-101", "text": "RNSB captures the same types of biases, but has a clear and larger scope, measuring discrimination with respect to more than two demographics within a protected group." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-102", "text": "Table 1: Table showing our RNSB metric for various word embeddings on two case studies." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-103", "text": "Our metric effectively predicts the unintended demographic bias in the presented word embeddings with respect to negative sentiment." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-104", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-105", "text": "**ANALYZING UNINTENDED DEMOGRAPHIC BIAS IN WORD EMBEDDINGS**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-106", "text": "Using the probability distribution of negative sentiment for the identity terms in a protected group, we can gain insights into the relative risks for discrimination between various demographics." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-107", "text": "Figure 3 shows three histograms." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-108", "text": "The bottom histogram is the uniform distribution." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-109", "text": "As described earlier, zero unintended demographic bias with respect to our definition is achieved when all the identity terms within a protected group have equal negative sentiment." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-110", "text": "The top two histograms show the negative sentiment probability for each identity normalized across all terms to be a probability distribution." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-111", "text": "The left histogram is computed using the GloVe word embeddings, and the right histogram is computed using the fairer ConceptNet embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-112", "text": "One can see that certain demographics have very high negative sentiment predictions, while others have very low predictions." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-113", "text": "The ConceptNet distribution seems to equalize much of this disparity." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-114", "text": "This type of analysis is very insightful as it enables one to see which identities are more at risk for discrimination." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-115", "text": "A more direct way to measure how certain groups receive similar unfair treatment is to compute a correlation matrix between the vectors containing negative sentiment predictions for each identity term." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-116", "text": "We compute this matrix for the same two cases: GloVe word embeddings (top) and ConceptNet word embeddings (bottom) shown in Figure 4 ." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-117", "text": "The GloVe word embedding correlation matrix contains a lot of dark low correlations between identities, as a lot of identities contain small amounts of negative sentiment. But this visual brings out that certain groups like Indian, Mexican, and Russian have a high correlation, indicating that they could be treated similarly unfairly in a downstream ML algorithm." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-118", "text": "This is a useful insight that could allow a practitioner to change to embedding training corpora to create fairer models." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-119", "text": "For the ConceptNet word embeddings, we see a much more colorful heat map, indicating there are higher correlations between more identity terms." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-120", "text": "This hints that ConceptNet contains less targeted discrimination via negative sentiment." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-121", "text": "This visual also brings out slight differences in negative sentiment prediction." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-122", "text": "Identity terms like Scottish have lower correlations across the board, manifesting that this identity has slightly less negative sentiment than the rest of the identities." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-123", "text": "This is important to analyze to get a broader context for how various identities could receive different amounts of discrimination stemming from the word embedding bias." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-124", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-125", "text": "**DISCUSSION**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-126", "text": "We showed how our framework can be used in the religious and national origin case studies." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-127", "text": "In practice, our framework should be used to measure bias among demographics of interest for the NLP application in question." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-128", "text": "Our RNSB metric is a useful signal a practitioner can use to choose the embedding model with the least amount of risk for discrimination in their application, or even to evaluate what types of unintended biases exists in their training corpora." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-129", "text": "We used our framework to evaluate unintended bias with respect to sentiment, but there exists many other types of unintended demographic bias to create clear signals for in word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-130", "text": "----------------------------------" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-131", "text": "**CONCLUSION**" }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-132", "text": "We presented a transparent framework for evaluating unintended demographic bias in word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-133", "text": "For this work our scope was limited to unfair biases with respect to negative sentiment." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-134", "text": "In our framework, we train a classifier on an unbiased positive/negative word sentiment dataset initialized with biased word embeddings." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-135", "text": "This way, we can observe the unfairness in the word embeddings at the ML prediction level." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-136", "text": "This allows us to observe clearer signals of bias in our metric, Relative Negative Sentiment Bias (RNSB)." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-137", "text": "Previous metrics and analysis into unintended bias in word embeddings rely on vector space arguments for only two demographics at a time, which does not lend itself well to evaluating real world discrimination." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-138", "text": "Our metric has a direct connection to discrimination and can evaluate any number of demographics in a protected group." }, { "sent_id": "4a093e0ca9a499c19e4721366d88f6-C001-139", "text": "Finally, our framework and metric reveal transparent analysis of the unintended bias hidden in word embeddings." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4a093e0ca9a499c19e4721366d88f6-C001-36" ], [ "4a093e0ca9a499c19e4721366d88f6-C001-38" ], [ "4a093e0ca9a499c19e4721366d88f6-C001-39" ], [ "4a093e0ca9a499c19e4721366d88f6-C001-63" ] ], "cite_sentences": [ "4a093e0ca9a499c19e4721366d88f6-C001-36", "4a093e0ca9a499c19e4721366d88f6-C001-38", "4a093e0ca9a499c19e4721366d88f6-C001-39", "4a093e0ca9a499c19e4721366d88f6-C001-63" ] }, "@DIF@": { "gold_contexts": [ [ "4a093e0ca9a499c19e4721366d88f6-C001-39", "4a093e0ca9a499c19e4721366d88f6-C001-41" ] ], "cite_sentences": [ "4a093e0ca9a499c19e4721366d88f6-C001-39" ] }, "@SIM@": { "gold_contexts": [ [ "4a093e0ca9a499c19e4721366d88f6-C001-90" ], [ "4a093e0ca9a499c19e4721366d88f6-C001-99" ] ], "cite_sentences": [ "4a093e0ca9a499c19e4721366d88f6-C001-90", "4a093e0ca9a499c19e4721366d88f6-C001-99" ] } } }, "ABC_4498072885df2a126e2db553cf3aca_39": { "x": [ { "sent_id": "4498072885df2a126e2db553cf3aca-C001-71", "text": "**TASKS**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-2", "text": "Methods for unsupervised hypernym detection may broadly be categorized according to two paradigms: pattern-based and distributional methods." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-3", "text": "In this paper, we study the performance of both approaches on several hypernymy tasks and find that simple pattern-based methods consistently outperform distributional methods on common benchmark datasets." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-4", "text": "Our results show that pattern-based models provide important contextual constraints which are not yet captured in distributional methods." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-5", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-7", "text": "Hierarchical relationships play a central role in knowledge representation and reasoning." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-8", "text": "Hypernym detection, i.e., the modeling of word-level hierarchies, has long been an important task in natural language processing." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-9", "text": "Starting with Hearst (1992) , pattern-based methods have been one of the most influential approaches to this problem." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-10", "text": "Their key idea is to exploit certain lexico-syntactic patterns to detect is-a relations in text." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-11", "text": "For instance, patterns like \"NP y such as NP x \", or \"NP x and other NP y \" often indicate hypernymy relations of the form x is-a y. Such patterns may be predefined, or they may be learned automatically (Snow et al., 2004; Shwartz et al., 2016) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-12", "text": "However, a well-known problem of Hearst-like patterns is their extreme sparsity: words must co-occur in exactly the right configuration, or else no relation can be detected." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-13", "text": "To alleviate the sparsity issue, the focus in hypernymy detection has recently shifted to distributional representations, wherein words are represented as vectors based on their distribution across large corpora." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-14", "text": "Such methods offer rich representations of lexical meaning, alleviating the sparsity problem, but require specialized similarity measures to distinguish different lexical relationships." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-15", "text": "The most successful measures to date are generally inspired by the Distributional Inclusion Hypothesis (DIH) (Zhitomirsky-Geffet and Dagan, 2005) , which states roughly that contexts in which a narrow term x may appear (\"cat\") should be a subset of the contexts in which a broader term y (\"animal\") may appear." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-16", "text": "Intuitively, the DIH states that we should be able to replace any occurrence of \"cat\" with \"animal\" and still have a valid utterance." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-17", "text": "An important insight from work on distributional methods is that the definition of context is often critical to the success of a system (Shwartz et al., 2017) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-18", "text": "Some distributional representations, like positional or dependency-based contexts, may even capture crude Hearst pattern-like features (Levy et al., 2015; Roller and Erk, 2016) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-19", "text": "While both approaches for hypernym detection rely on co-occurrences within certain contexts, they differ in their context selection strategy: pattern-based methods use predefined manuallycurated patterns to generate high-precision extractions while DIH methods rely on unconstrained word co-occurrences in large corpora." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-20", "text": "Here, we revisit the idea of using pattern-based methods for hypernym detection." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-21", "text": "We evaluate several pattern-based models on modern, large corpora and compare them to methods based on the DIH." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-22", "text": "We find that simple pattern-based methods consistently outperform specialized DIH methods on several difficult hypernymy tasks, including detection, direction prediction, and graded entailment ranking." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-23", "text": "Moreover, we find that taking low-rank embeddings of pattern-based models substantially improves performance by remedying the sparsity issue." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-24", "text": "Overall, our results show that Hearst patterns provide high-quality and robust predictions on large corpora by capturing important contextual constraints, which are not yet modeled in distributional methods." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-25", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-26", "text": "**MODELS**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-27", "text": "In the following, we discuss pattern-based and distributional methods to detect hypernymy relations." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-28", "text": "We explicitly consider only relatively simple pattern-based approaches that allow us to directly compare their performance to DIH-based methods." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-29", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-30", "text": "**PATTERN-BASED HYPERNYM DETECTION**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-31", "text": "First, let P = {(x, y)} n i=1 denote the set of hypernymy relations that have been extracted via Hearst patterns from a text corpus T ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-32", "text": "Furthermore let w(x, y) denote the count of how often (x, y) has been extracted and let W = (x,y)\u2208P w(x, y) denote the total number extractions." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-33", "text": "In the first, most direct application of Hearst patterns, we then simply use the counts w(x, y) or, equivalently, the extraction probability" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-34", "text": "to predict hypernymy relations from T ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-35", "text": "However, simple extraction probabilities as in Equation (1) are skewed by the occurrence probabilities of their constituent words." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-36", "text": "For instance, it is more likely that we extract (France, country) over (France, republic), just because the word country is more likely to occur than republic." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-37", "text": "This skew in word distributions is well-known for natural language and also translates to Hearst patterns (see also Figure 1 )." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-38", "text": "For this reason, we also consider predicting hypernymy relations based on the Pointwise Mutual Information of Hearst patterns: First, let p \u2212 (x) = (x,y)\u2208P w(x, y)/W and p + (x) = (y,x)\u2208P w(y, x)/W denote the probability that x occurs as a hyponym and hypernym, respectively." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-39", "text": "We then define the Positive Pointwise Mutual Information for (x, y) as" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-40", "text": "While Equation (2) can correct for different word occurrence probabilities, it cannot handle missing data." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-41", "text": "However, sparsity is one of the main issues when using Hearst patterns, as a necessarily incomplete set of extraction rules will lead inevitably to missing extractions." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-42", "text": "For this purpose, we also study low-rank embeddings of the PPMI matrix, which allow us to make predictions for unseen pairs." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-43", "text": "In particular, let m = |{x : (x, y) \u2208 P \u2228 (y, x) \u2208 P}| denote the number of unique terms in P. Furthermore, let X \u2208 R m\u00d7m be the PPMI matrix with q q q q q qqqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q entries M xy = ppmi(x, y) and let M = U \u03a3V be its Singular Value Decomposition (SVD)." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-44", "text": "We can then predict hypernymy relations based on the truncated SVD of M via" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-45", "text": "where u x , v y denote the x-th and y-th row of U and V , respectively, and where \u03a3 r is the diagonal matrix of truncated singular values (in which all but the r largest singular values are set to zero)." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-46", "text": "Equation (3) can be interpreted as a smoothed version of the observed PPMI matrix." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-47", "text": "Due to the truncation of singular values, Equation (3) computes a low-rank embedding of M where similar words (in terms of their Hearst patterns) have similar representations." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-94", "text": "Second, the relative comparison of scores determines which direction is predicted." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-48", "text": "Since Equation (3) is defined for all pairs (x, y), it allows us to make hypernymy predictions based on the similarity of words." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-49", "text": "We also consider factorizing a matrix that is constructed from occurrence probabilities as in Equation (1), denoted by sp(x, y)." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-50", "text": "This approach is then closely related to the method of Cederberg and Widdows (2003) , which has been proposed to improve precision and recall for hypernymy detection from Hearst patterns." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-51", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-52", "text": "**DISTRIBUTIONAL HYPERNYM DETECTION**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-53", "text": "Most unsupervised distributional approaches for hypernymy detection are based on variants of the Distributional Inclusion Hypothesis (Weeds et al., 2004; Kotlerman et al., 2010; Santus et al., 2014; Lenci and Benotto, 2012; Shwartz et al., 2017) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-54", "text": "Here, we compare to two methods with strong empirical results." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-55", "text": "As with most DIH measures, they are only defined for large, sparse, positively-valued distributional spaces." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-56", "text": "First, we consider WeedsPrec (Weeds et al., 2004) which captures the features of x which are included in the set of a broader term's features, y:" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-57", "text": "Second, we consider invCL (Lenci and Benotto, 2012) which introduces a notion of distributional exclusion by also measuring the degree to which the broader term contains contexts not used by the narrower term." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-58", "text": "In particular, let" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-59", "text": "denote the degree of inclusion of x in y as proposed by Clarke (2009) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-60", "text": "To measure both the inclusion of x in y and the non-inclusion of y in x, invCL is then defined as" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-61", "text": "Although most unsupervised distributional approaches are based on the DIH, we also consider the distributional SLQS model based on on an alternative informativeness hypothesis (Santus et al., 2014; Shwartz et al., 2017) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-62", "text": "Intuitively, the SLQS model presupposes that general words appear mostly in uninformative contexts, as measured by entropy." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-63", "text": "Specifically, SLQS depends on the median entropy of a term's top N contexts, defined as" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-64", "text": ", where H(c i ) is the Shannon entropy of context c i across all terms, and N is chosen in hyperparameter selection." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-65", "text": "Finally, SLQS is defined using the ratio between the two terms:" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-66", "text": "Since the SLQS model only compares the relative generality of two terms, but does not make judgment about the terms' relatedness, we report SLQS-cos, which multiplies the SLQS measure by cosine similarity of x and y (Santus et al., 2014) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-67", "text": "For completeness, we also include cosine similarity as a baseline in our evaluation." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-68", "text": "Table 1 : Hearst patterns used in this study." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-69", "text": "Patterns are lemmatized, but listed as inflected for clarity." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-70", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-72", "text": "Detection: In hypernymy detection, the task is to classify whether pairs of words are in a hypernymy relation." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-73", "text": "For this task, we evaluate all models on five benchmark datasets: First, we employ the noun-noun subset of BLESS, which contains hypernymy annotations for 200 concrete, mostly unambiguous nouns." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-74", "text": "Negative pairs contain a mixture of co-hyponymy, meronymy, and random pairs." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-75", "text": "This version contains 14,542 total pairs with 1,337 positive examples." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-76", "text": "Second, we evaluate on LEDS (Baroni et al., 2012) , which consists of 2,770 noun pairs balanced between positive hypernymy examples, and randomly shuffled negative pairs." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-77", "text": "We also consider EVAL (Santus et al., 2015) , containing 7,378 pairs in a mixture of hypernymy, synonymy, antonymy, meronymy, and adjectival relations." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-78", "text": "EVAL is notable for its absence of random pairs." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-79", "text": "The largest dataset is SHWARTZ (Shwartz et al., 2016) , which was collected from a mixture of WordNet, DBPedia, and other resources." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-80", "text": "We limit ourselves to a 52,578 pair subset excluding multiword expressions." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-81", "text": "Finally, we evaluate on WBLESS (Weeds et al., 2014) , a 1,668 pair subset of BLESS, with negative pairs being selected from co-hyponymy, random, and hyponymy relations." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-82", "text": "Previous work has used different metrics for evaluating on BLESS (Lenci and Benotto, 2012; Levy et al., 2015; Roller and Erk, 2016) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-83", "text": "We chose to evaluate the global ranking using Average Precision." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-84", "text": "This allowed us to use the same metric on all detection benchmarks, and is consistent with evaluations in Shwartz et al. (2017) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-85", "text": "Direction: In direction prediction, the task is to identify which term is broader in a given pair of words." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-86", "text": "For this task, we evaluate all models on three datasets described by Kiela et al. (2015) : On BLESS, the task is to predict the direction for all 1337 positive pairs in the dataset." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-87", "text": "Pairs are only counted correct if the hypernymy direction scores higher than the reverse direction, i.e. score(x, y) > score(y, x)." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-88", "text": "We reserve 10% of the data for validation, and test on the remaining 90%." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-89", "text": "On WBLESS, we follow prior work (Nguyen et al., 2017; Vuli\u0107 and Mrk\u0161i\u0107, 2017) and perform 1000 random iterations in which 2% of the data is used as a validation set to learn a classification threshold, and test on the remainder of the data." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-90", "text": "We report average accuracy across all iterations." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-91", "text": "Finally, we evaluate on BIBLESS (Kiela et al., 2015) , a variant of WBLESS with hypernymy and hyponymy pairs explicitly annotated for their direction." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-92", "text": "Since this task requires three-way classification (hypernymy, hyponymy, and other), we perform two-stage classification." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-93", "text": "First, a threshold is tuned using 2% of the data, identifying whether a pair exhibits hypernymy in either direction." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-95", "text": "As with WBLESS, we report the average accuracy over 1000 iterations." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-96", "text": "Graded Entailment: In graded entailment, the task is to quantify the degree to which a hypernymy relation holds." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-97", "text": "For this task, we follow prior work (Nickel and Kiela, 2017; Vuli\u0107 and Mrk\u0161i\u0107, 2017) and use the noun part of HYPER-LEX , consisting of 2,163 noun pairs which are annotated to what degree x is-a y holds on a scale of [0, 6] ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-98", "text": "For all models, we report Spearman's rank correlation \u03c1." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-99", "text": "We handle out-ofvocabulary (OOV) words by assigning the median of the scores (computed across the training set) to pairs with OOV words." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-100", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-101", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-102", "text": "Pattern-based models: We extract Hearst patterns from the concatenation of Gigaword and Wikipedia, and prepare our corpus by tokenizing, lemmatizing, and POS tagging using CoreNLP 3.8.0." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-103", "text": "The full set of Hearst patterns is provided in Table 1 ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-104", "text": "Our selected patterns match prototypical Hearst patterns, like \"animals such as cats,\" but also include broader patterns like \"New Year is the most important holiday.\" Leading and following noun phrases are allowed to match limited modifiers (compound nouns, adjectives, etc.) , in which case we also generate a hit for the head of the noun phrase." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-105", "text": "During postprocessing, we remove pairs which were not extracted by at least two distinct patterns." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-106", "text": "We also remove any pair (y, x) if p(y, x) < p(x, y)." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-107", "text": "The final corpus contains roughly 4.5M matched pairs, 431K unique pairs, and 243K unique terms." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-108", "text": "For SVD-based models, we select the rank from r \u2208 {5, 10, 15, 20, 25, 50, 100, 150, 200, 250 , 300, 500, 1000} on the validation set." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-109", "text": "The other pattern-based models do not have any hyperparameters." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-110", "text": "Distributional models: For the distributional baselines, we employ the large, sparse distributional space of Shwartz et al. (2017) , which is computed from UkWaC and Wikipedia, and is known to have strong performance on several of the detection tasks." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-111", "text": "The corpus was POS tagged and dependency parsed." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-112", "text": "Distributional contexts were constructed from adjacent words in dependency parses (Pad\u00f3 and Lapata, 2007; Levy and Goldberg, 2014) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-113", "text": "Targets and contexts which appeared fewer than 100 times in the corpus were filtered, and the resulting co-occurrence matrix was PPMI transformed." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-114", "text": "1 The resulting space contains representations for 218K words over 732K context dimensions." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-115", "text": "For the SLQS model, we selected the number of contexts N from the same set of options as the SVD rank in pattern-based models." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-116", "text": "Table 2 shows the results from all three experimental settings." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-117", "text": "In nearly all cases, we find that patternbased approaches substantially outperform all three distributional models." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-118", "text": "Particularly strong improvements can be observed on BLESS (0.76 average precision vs 0.19) and WBLESS (0.96 vs. 0.69) for the detection tasks and on all directionality tasks." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-119", "text": "For directionality prediction on BLESS, the SVD models surpass even the state-of-the-art supervised model of Vuli\u0107 and Mrk\u0161i\u0107 (2017) ." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-120", "text": "Moreover, both SVD models perform generally better than their sparse counterparts on all tasks and datasets except on HYPERLEX." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-121", "text": "We performed a posthoc analysis of the validation sets comparing the ppmi and spmi models, and found that the truncated SVD improved recall via its matrix completion properties." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-122", "text": "We also found that the spmi model downweighted many high-scoring outlier pairs composed of rare terms." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-123", "text": "When comparing the p(x, y) and ppmi models to distributional models, we observe mixed results." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-124", "text": "The SHWARTZ dataset is difficult for sparse models due to its very long tail of low frequency words that are hard to cover using Hearst patterns." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-125", "text": "On EVAL, Hearst-pattern based methods get penalized by OOV words, due to the large number of verbs and adjectives in the dataset, which are not captured by our patterns." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-126", "text": "However, in 7 of the 9 datasets, at least one of the sparse models outperforms all distributional measures, showing that Hearst patterns can provide strong performance on large corpora." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-127", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-128", "text": "**RESULTS**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-129", "text": "----------------------------------" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-131", "text": "We studied the relative performance of Hearst pattern-based methods and DIH-based methods for hypernym detection." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-132", "text": "Our results show that the pattern-based methods substantially outperform DIH-based methods on several challenging benchmarks." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-133", "text": "We find that embedding methods alleviate sparsity concerns of pattern-based approaches and substantially improve coverage." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-134", "text": "We conclude that Hearst patterns provide important contexts for the detection of hypernymy relations that are not yet captured in DIH models." }, { "sent_id": "4498072885df2a126e2db553cf3aca-C001-135", "text": "Our code is available at https://github.com/ facebookresearch/hypernymysuite." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4498072885df2a126e2db553cf3aca-C001-17" ], [ "4498072885df2a126e2db553cf3aca-C001-53" ] ], "cite_sentences": [ "4498072885df2a126e2db553cf3aca-C001-17", "4498072885df2a126e2db553cf3aca-C001-53" ] }, "@USE@": { "gold_contexts": [ [ "4498072885df2a126e2db553cf3aca-C001-61" ], [ "4498072885df2a126e2db553cf3aca-C001-84" ], [ "4498072885df2a126e2db553cf3aca-C001-110" ] ], "cite_sentences": [ "4498072885df2a126e2db553cf3aca-C001-61", "4498072885df2a126e2db553cf3aca-C001-84", "4498072885df2a126e2db553cf3aca-C001-110" ] } } }, "ABC_e59bd02bb560d80ce08dfcd6b35317_39": { "x": [ { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-114", "text": "All rare terms may be removed by deletion, but they have a low chance of appearing in the original topic model anyway." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-115", "text": "Therefore LDA model has high stability over various levels of deletion errors." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-2", "text": "Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-3", "text": "These corpora may contain noisy or erroneous texts which may undermine topic stability." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-4", "text": "Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-5", "text": "In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-6", "text": "From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-116", "text": "Insertion and Metaphone replacement introduces systematic noise, which changes the distribution of original texts with respect to frequency, thus having more impact on the LDA model." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-117", "text": "A high portion of general frequent terms may dilute the frequency of characteristic terms and add noisy terms to a topic model." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-7", "text": "We also suggest topic model selection methods for noisy corpora." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-8", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-10", "text": "Topic modelling techniques are widely applied in text retrieval tasks." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-11", "text": "Such techniques have been previously applied to news sources (Newman et al., 2006) , OCR (Tamura et al., 2013) , blogs (Yokomoto et al., 2012) etc." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-12", "text": "in which the quality of the source text is high with low error rates (missing, misspelled, or incorrect terms or phrases)." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-13", "text": "However with the improvements in terms of accuracy and the reduction in the cost of automatic speech transcription and optical character recognition (OCR) technologies, the range of sources that topic modelling can now be applied to is growing." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-14", "text": "One artefact of such new text sources is their inherent noise." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-15", "text": "In speech to text transcriptions, humans in general manage a WER of 2% to 4% (Fiscus et al., 2007) ." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-16", "text": "When transcribing with a vocabulary size of 200, 5000 and 100000, the word error rates are 3%, 7% and 45% respectively." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-17", "text": "The best accuracy for broadcast news transcription 13% (Pallet, 2003) , but this drops below 25.7% in conference transcription and gets worse in casual conversation (Fiscus et al., 2007) ." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-18", "text": "These records show that the difficulty of automatic speech recognition rises with vocabulary size, speaker dependency and level of crosstalk." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-19", "text": "Noise aside, many of these newly available sources contain rich and valuable information that can be analysed through topic modelling." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-20", "text": "For example, automatic speech transcription applied to call centre audio recordings is able to capture a level of detail that is otherwise unavailable unless the call audio is manually reviewed which is infeasible for large call volumes." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-21", "text": "In this case topic modelling can be applied to transcribed text to extract the key issues and emerging topics of discussion." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-22", "text": "In this study we propose a method for simulating various types of transcription errors." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-23", "text": "We then test the robustness of a popular topic modelling algorithm, Latent Dirichlet Allocation (LDA) using a topic stability measure introduced by Greene et al. (2014) over a variety of corpora." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-24", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-25", "text": "**TOPIC MODELLING AND METRICS**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-26", "text": "Blei et al. (Blei et al., 2003) introduced Latent Dirichlet Allocation (LDA) as a generative probabilistic model for text corpora." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-27", "text": "LDA regulates the probabilistic distributions between document, topic and word and it is an unsupervised learning model." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-28", "text": "For the evaluation of topic models, we follow the approach by Greene et al. (2014) for measuring topic model agreement ." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-29", "text": "We can denote a topic list as S = {R 1 , ..., R k }, where R i is a topic with rank i. An individual topic can be described as R = {T 1 , ..., T m }, where T l is a term with rank l belong to the topic." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-30", "text": "Jaccard index (Jaccard, 1912) compares the number of identical items in two sets, but it neglects ranking order." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-31", "text": "Average Jaccard (AJ) similarity is a top-weighted version of the Jaccard index used to accommodate ranking information." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-32", "text": "AJ calculates the average of the Jaccard scores between every pair of subsets in two lists." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-33", "text": "Based on AJ, we can evaluate the agreement of two sets of ranked lists (topic models)." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-34", "text": "The topic model agreement score between S 1 and S 2 is a mean value of the top similarity scores between each cross pair of R. The agreement score is solved using the Hungarian method (Kuhn, 1955) and is constrained in the range [0, 1] , where a perfect match between two identical k-way ranked sets results in 1 and a score 0 for nonoverlapping sets." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-35", "text": "(Greene et al., 2014)" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-36", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-37", "text": "**DATASETS**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-38", "text": "In this paper, we explore two datasets bbc and wikilow (Greene et al., 2014) with different document size and corpus size." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-39", "text": "The bbc corpus includes general BBC news articles." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-40", "text": "This corpus contains 2225 documents in 5 topics and it uses 3121 terms." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-41", "text": "The wikilow corpus is a subset of Wikipedia and articles are labeled with fine-grained WikiProject sub-groups." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-42", "text": "There are totally 4986 documents in 10 topics and it uses 15411 terms." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-43", "text": "In both datasets the topics consist of distinct vocabularies which we expect LDA to detect." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-44", "text": "For example, the topics in bbc datasets are business, entertainment, politics, sport and technology." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-45", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-46", "text": "**TEXTUAL NOISE**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-47", "text": "We artificially introduce noise into text to investigate the performance of topc modelling over naturally noisy sources." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-48", "text": "We measure noise using word error rate (WER), a common metric for measuring speech recognition accuracy." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-49", "text": "Moreover, WER has been used as a salient metric in speech quality analytics (Saon et al., 2006) and spoken dialogue system (Cavazza, 2001) ." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-50", "text": "In Equation 1 WER is defined as the fraction between the sum of the number of substitutions S, the number of deletions D, the number of insertions and the number of terms in reference N. The experiments investigate the robustness of topic models against each type of noise, and at which noise levels the output of a topic model is consistent with that of the original corpus." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-51", "text": "Deletion noise is introduced by randomly removing a portion of text in the corpus." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-52", "text": "The proportion of deletion ranges from 0% to 50% and the term selection is based on uniform distribution." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-53", "text": "Insertion is introduced by adding a portion (0% to 50%) of frequent terms from a list of frequent English words with 7726 entries 1 ." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-54", "text": "The probability of sampling of a certain term from the list is based on the term frequency." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-55", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-56", "text": "**METAPHONE REPLACEMENT**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-57", "text": "We simulate speech recognition errors using Metaphone, a phonetic algorithm for indexing English words by their pronunciation (Philips, 1990 )." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-58", "text": "Here we use the Double Metaphone (Black, 2014) algorithm in replacement and the replacement is on a one-to-one basis." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-59", "text": "This may not simulate the full range of errors produced by ASR systems, in which the substitution may be a one-to-many or many-to-one 2 mapping, but it was deemed sufficient for the current experiments." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-60", "text": "In this study we map Metaphone codes to frequent English words (examples in Table 2 )." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-61", "text": "Then in a given text document, we randomly select X percent terms and replace each by a term in the Metaphone map." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-62", "text": "The candidate terms sharing the same metaphone symbol are selected based on term frequencies." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-63", "text": "A frequent term has higher probability to be selected over a rare term (see Table 1 )." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-64", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-65", "text": "**EXPERIMENTS**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-66", "text": "In our experiments with LDA, we aim to test the topic stability over different levels of noise and different numbers of topics." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-67", "text": "In order to produce consistent and repeatable results where each noise generation method relies on a degree of randomness with word deletion, insertion or substitution we generate multiple copies of each modified corpus using 5 random seeds." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-68", "text": "Similarly we perform 5 runs of each Mallet LDA (McCallum, 2002) topic model as the algorithm initial state is determined by a random seed." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-69", "text": "LDA hyperparameters are defined with default values, and each topic is represented by the top 25 terms." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-70", "text": "The final stability score on each level is a mean value of a number of runs with fixed seeds." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-71", "text": "Figure 1 and Figure 2 show the topic stability of the bbc and wikilow corpora with reference numbers of topics 5 and 10 each." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-72", "text": "For each level of topic model complexity, a downward slope indicates decreasing stability of topic models against increasing noise." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-73", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-74", "text": "**LDA OUTPUT**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-75", "text": "In bbc corpus, topic model stability shows clear difference with different noise types." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-76", "text": "The model is especially robust against Deletion errors." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-77", "text": "When noise increases from 5% to 50%, the Hungarian agreement score of output topics only drops about 1% (for the fitted model with K = 5 in Figure 1(a) )." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-78", "text": "Checking each model in Figure 1(a) , We can say that in bbc corpus the topic models are robust against random Deletion noise." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-79", "text": "In Figure 1 (b), the model with 5 topics achieves the highest Hungarian agreement score at noise level 5% and 10%, but it drops significantly afterwards." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-80", "text": "The best and most stable topic model with noise higher than 15% of Insertion errors is the model with 15 topics, which is three times of the reference." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-81", "text": "Similar trend is observed with Metaphone replacement errors in Figure 1(c) ." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-82", "text": "The topic model with reference number of topics achieves the highest stability when noise level is low." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-83", "text": "However, there are differences between Insertion and Metaphone errors in bbc corpus tests." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-84", "text": "With 50% of Insertion errors, the model with 15 topics achieves 56.4% agreement with original model, but the agreement is only 32.4% with Metaphone errors." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-85", "text": "In bbc corpus Metaphone errors are the most challenging case." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-86", "text": "In wikilow corpus we observe similar trends in Figure 1 and Figure 2 on specific types of noise." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-118", "text": "However, a topic model with many more topics than the reference can deal with the effect of systematic errors." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-119", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-120", "text": "**CONCLUSIONS**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-121", "text": "We investigated how transcription errors affect the quality and robustness of topic models produced over a range of corpora, using a topic stability measure introduced a priori." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-122", "text": "We simulate transcription errors from the perspective of word error rate and generated noisy corpora with deletion, insertion and Metaphone replacement." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-123", "text": "Topic models produced by LDA show high tolerance to deletion noise up to 50% but low tolerance to insertion and metaphone replacement errors." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-124", "text": "We find the robustness of topic models is mainly determined by the extent to which the distribution of original texts is modified." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-125", "text": "Deletion noise is introduced randomly and its effect on topic models is minor." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-126", "text": "Insertion and metaphone replacement noise is systematic and undermines topic model stability to a large extent." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-127", "text": "Moreover, the number of topics selected also affects topic agreement." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-128", "text": "With random noise or low-level systematic errors (below 20%), a correct or approximately correct number of topics brings the highest topic agreement scores." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-129", "text": "With high level systematic errors, topic models with 3 times the correct number of topics are most robust." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-130", "text": "In some corpora, redundant number of topics helps the LDA model through severe systematic errors (Figure 1(b) )." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-131", "text": "This complements previous work by Greene et al. (2014) who investigated how topic stability is influenced by number of topics over noisefree corpora." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-132", "text": "This suggests that transcribers should perhaps consider omitting words when the uncertainty is high." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-133", "text": "The topic model is less influenced with a random missing term than an erroneous replacement." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-134", "text": "For human consumption this may not be optimal, but in the case of output specifically intended for topic extraction this approach makes sense." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-87", "text": "With Deletion errors, the topic model with reference number of topics is most stable across noise levels." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-88", "text": "The difference of topic agreement scores is below 2% across noise levels." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-89", "text": "With Insertion and Metaphone errors, the topic model with reference number of topics is almost the best when noise is low but it drops below others when noise is higher than 15%." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-90", "text": "Although there are many similarities between Figure 1 and 2, we like to mention two major differences across corpora." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-91", "text": "In Figure 2 (b) and 2(c), Hungarian scores of different topic models (number of topics) share similar gradient of descending slope." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-92", "text": "However a few models from bbc corpus (K as 15, 20, 30) keep roughly stable Hungarian scores in Figure 1(b) ." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-93", "text": "Another difference is that the most stable topic models against noise levels higher than 20% in Figure 1 (b) and 1(c) both have 15 topics, whereas the most stable models in wikilow have 30 topics in the same settings." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-94", "text": "However, if we compare them with corresponding reference topic numbers K, the most stable topic models with high systematic errors all have K * 3 topics." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-95", "text": "Models with topic number higher than K * 3 are not optimal in Figure 1 (b) and 1(c)." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-96", "text": "----------------------------------" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-97", "text": "**DISCUSSION**" }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-98", "text": "In Section 4.1 we observe topic model stability in two corpora and three types of noise." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-99", "text": "Here we can define a single measurement of topic stability across different settings." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-100", "text": "If a level of agreement is set as 70%, LDA is robust against Deletion noises up to 50% in both bbc and wikilow corpora." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-101", "text": "However, LDA model reaches this agreement level only on 10% Insertion noises and on 5% Metaphone replacement noise." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-102", "text": "We see that Metaphone replacement and insertion are more severe challenges to topic models vs. deletion." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-103", "text": "Regarding deletion errors, we observe that the robustness of a topic model is mostly determined by the number of topics." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-104", "text": "When this matches the reference, the topic model is the most stable." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-105", "text": "However, this does not emerge with insertion and metaphone errors." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-106", "text": "Reference topic models with 5 (bbc) and 10 (wik- ilow) topics achieve high stability only when noise is \u2264 10%." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-107", "text": "With higher levels of noise, a more complex topic model exhibits higher stability." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-108", "text": "15 (bbc) and 30 (wikilow) topics are the most robust at noise level 50%." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-109", "text": "A tentative explanation of the high stability of topic models against Deletion error concerns the LDA model." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-110", "text": "LDA takes term frequency into account." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-111", "text": "The probability of a word belonging to a topic is high if it appears frequently in one topic and seldom in other topics." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-112", "text": "Such a word is very likely to be an entry in a topic model." }, { "sent_id": "e59bd02bb560d80ce08dfcd6b35317-C001-113", "text": "If we randomly delete corpus terms, the scale of frequent terms is influenced trivially and these frequent terms still have a high probability of selection." } ], "y": { "@USE@": { "gold_contexts": [ [ "e59bd02bb560d80ce08dfcd6b35317-C001-23" ], [ "e59bd02bb560d80ce08dfcd6b35317-C001-28" ], [ "e59bd02bb560d80ce08dfcd6b35317-C001-38" ] ], "cite_sentences": [ "e59bd02bb560d80ce08dfcd6b35317-C001-23", "e59bd02bb560d80ce08dfcd6b35317-C001-28", "e59bd02bb560d80ce08dfcd6b35317-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "e59bd02bb560d80ce08dfcd6b35317-C001-131" ] ], "cite_sentences": [ "e59bd02bb560d80ce08dfcd6b35317-C001-131" ] } } }, "ABC_0d06c8509ebbdc61985bebcdb26e6c_39": { "x": [ { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-60", "text": "Let" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-2", "text": "Training large vocabulary Neural Network Language Models (NNLMs) is a difficult task due to the explicit requirement of the output layer normalization, which typically involves the evaluation of the full softmax function over the complete vocabulary." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-3", "text": "This paper proposes a Batch Noise Contrastive Estimation (B-NCE) approach to alleviate this problem." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-4", "text": "This is achieved by reducing the vocabulary, at each time step, to the target words in the batch and then replacing the softmax by the noise contrastive estimation approach, where these words play the role of targets and noise samples at the same time." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-5", "text": "In doing so, the proposed approach can be fully formulated and implemented using optimal dense matrix operations." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-6", "text": "Applying B-NCE to train different NNLMs on the Large Text Compression Benchmark (LTCB) and the One Billion Word Benchmark (OBWB) shows a significant reduction of the training time with no noticeable degradation of the models performance." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-7", "text": "This paper also presents a new baseline comparative study of different standard NNLMs on the large OBWB on a single Titan-X GPU." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-8", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-10", "text": "Neural Network Language Models (NNLM) [1, 2] have been shown to significantly outperform standard N-gram LMs [3, 4] on many speech and language technology applications, such as machine translation [5] and speech recognition [6] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-11", "text": "The training and evaluation of these models, however, becomes significantly slow and challenging when considering large vocabulary language models [7] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-12", "text": "This is mainly due to the explicit normalization of the output layer, which typically requires the evaluation of the full softmax function over the complete vocabulary." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-13", "text": "In order to overcome this problem, Schwenk et al. [8] proposed to use a short list of frequent words in combination with N-gram models." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-14", "text": "The performance of this approach, however, significantly depends on the short list size." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-15", "text": "In a different attempt, Morin et al. [9] proposed to factorize the output probabilities using a binary tree, which results in an exponential speed-up of the training and evaluation, whereas Mikolov et al. [10] proposed to use an additional class layer as an alternative to the factorization." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-16", "text": "The performance of these two approaches significantly depends on the design of the binary tree and the class layer size, respectively." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-17", "text": "As an alternative to modifying the architecture design, the authors of [11] used importance sampling to approximate the output gradient." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-18", "text": "Unfortunately, this approach requires a control of the samples variance, which can lead otherwise to unstable learning [12] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-19", "text": "In a similar work, Mnih et al. [13] proposed to use Noise Contrastive Estimation (NCE) [14] to speed-up the training." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-20", "text": "NCE treats the learning as a binary classification problem between a target word and noise samples, which are drawn from a noise distribution." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-21", "text": "Moreover, NCE considers the normalization term as an additional parameter that can be learned during training or fixed beforehand." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-22", "text": "In this case, the network learns to self-normalize." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-23", "text": "This property makes NCE more attractive compared to other sampling methods, such as importance sampling, which would still require the use of the softmax function during evaluation." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-24", "text": "In batch mode training, however, the implementation of NCE cannot be directly formulated using dense matrix operations, which compromises their speed-up gains." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-25", "text": "This paper proposes a new solution to train large vocabulary LMs using NCE in batch mode." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-26", "text": "The main idea here is to restrict the vocabulary, at each iteration, to the words in the batch and replace the standard softmax function by NCE." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-27", "text": "In particular, the target words (to predict) in the batch play the role of targets and noise samples at the same time." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-28", "text": "In doing so, the proposed approach does not require any sampling and can be fully formulated using dense matrix operations, leading to significant speed-up improvements with no noticeable degradation of the performance." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-29", "text": "Moreover, we can show that this approach optimally approximates the unigram noise distribution, which is widely used in NCE-based LMs (Section 3)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-30", "text": "While applying the proposed batch NCE approach, this paper also presents a new baseline comparative study of different NNLMs on the Large Text Compression Benchmark (LTCB) [15] and the One Billion Word Benchmark (OBWB) [7] on a single Titan-X GPU (Section 4)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-31", "text": "NCE training of a neural network follows the standard backpropagation algorithm applied to the objective function (5) and its gradient (6) ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-32", "text": "More details about NCE and its gradient derivation can be found in [14] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-33", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-34", "text": "**NCE VS IMPORTANCE SAMPLING**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-35", "text": "The authors of [18] have shown that NCE and Importance Sampling (IS) are closely related, with the main difference is that NCE is defined as a binary classifier between samples drawn from data or noise distributions with a logistic loss, whereas IS is a multi-class classifier, which uses softmax and a crossentropy loss." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-36", "text": "Hence, the authors concluded that IS is theoretically a better choice than NCE." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-37", "text": "The results reported, however, showed a minor difference in performance (2.4 points in perplexity)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-38", "text": "Moreover, training using IS can be very difficult and requires a careful control of the samples variance, which can lead otherwise to unstable learning as was reported in [12] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-39", "text": "Hence, an adaptive IS may use a large number of samples to solve this problem whereas NCE is more stable and requires a fixed small number of noise samples (e.g., 100) to achieve a good performance [13, 16] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-40", "text": "Furthermore, the network learns to self-normalize during training using NCE." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-41", "text": "As a results, and on the contrary to IS, the softmax is no longer required during evaluation, which makes NCE an attractive choice to train large vocabulary NNLM." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-42", "text": "The next section will show how NCE can be efficiently implemented in batch mode training." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-43", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-44", "text": "**BATCH NOISE CONTRASTIVE ESTIMATION**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-45", "text": "Although NCE is a good alternative to train large vocabulary LMs, it is not well-suited for batch mode training on GPUs." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-46", "text": "More particularly, each target word in the batch uses a different set of noise samples, which makes it difficult to formulate the learning using dense matrix operations." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-47", "text": "As a result, the training time significantly increases." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-48", "text": "To alleviate this problem, noise samples can be shared across the batch [16] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-49", "text": "This paper proposes an extension of NCE to batch mode (B-NCE) training." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-50", "text": "This approach does not require any sampling and can be formulated using dense matrix operations." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-51", "text": "Furthermore, we can show that this solution optimally approximates the sampling from a unigram distribution, which has been shown to be a good noise distribution choice [13, 16] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-52", "text": "The main idea here is to restrict the vocabulary, at each forward-backward pass, to the target words in the batch (words to predict) and then replace the softmax function by NCE." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-53", "text": "In particular, these words play alternatively the role of targets and noise samples." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-54", "text": "That is, for a target word wi, at batch index i, the rest of the target batch (the remaining target words at the other batch indices j, j = i) are considered to be the noise samples." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-55", "text": "The rest of this section introduces the mathematical formulation of B-NCE to efficiently calculate the error with respect to the output layer weights and biases, as well as the error at the previous layer in batch training, using the objective function (5) and its gradient (6)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-56", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-57", "text": "**LM TRAINING USING B-NCE**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-58", "text": "Let B, H and V be the sizes of the batch, the last hidden layer and the vocabulary, respectively." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-59", "text": "The matrix L t (size B \u00d7 H) will denote the evaluation of the last hidden layer at time t on the current batch." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-61", "text": "be the target words in the batch at time t and let W (size H \u00d7 V ) and C (1 \u00d7 V ) denote the hidden-to-output weight matrix and bias vector, respectively." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-62", "text": "Our goal here is to calculate the error (delta) of the output weights W and biases C, as well as the error at the previous layer L t ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-63", "text": "The output layer evaluation in a feed-forward pass of B-NCE, at time t, is calculated by restricting the output layer to V" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-64", "text": "Once the error E(L t ) is propagated to the last hidden layer L t using (11), the learning of the rest of the network follows the standard back-propagation algorithm." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-65", "text": "After processing the complete training data, each word w in the vocabulary will be used exactly (B \u2212 1) \u00d7 count(w) as a noise sample." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-66", "text": "This is strictly equivalent to sampling from a unigram noise distribution, which shows that B-NCE is an optimal implementation of NCE using unigram as noise distribution." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-67", "text": "We should also mention that some words may occur more than once in the batch." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-68", "text": "This observation should be taken into consideration before updating the weights and biases." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-69", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-70", "text": "**ADAPTIVE B-NCE**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-71", "text": "The proposed B-NCE approach as defined above uses a fixed number of noise samples (B \u2212 1), which is dependent on the batch size." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-72", "text": "In cases where the latter is small (e.g., B \u2264 100), B-NCE can be extended to use an additional K noise samples." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-73", "text": "This can be done by simply drawing an additional K samples form the noise distribution pn, and share them across the batch as it was done in [16] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-74", "text": "The adaptive B-NCE follows the exact same steps described above using the extended output weight sub-matrix W t B+K (H \u00d7(B +K)), and the extended sub-vector bias C t B+K (1\u00d7(B +K)) to evaluate (7), whereas (8) becomes" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-75", "text": "are the probabilities of the additional K noise samples using the noise distribution pn." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-76", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-77", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-78", "text": "To evaluate the proposed B-NCE approach, we conducted a set of LM experiments on two different corpora." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-79", "text": "Namely, the Large Text Compression Benchmark (LTCB) [15] , which is an extract of the enwik9 dataset, and the very large One Billion Word Benchmark (OBWB) [7] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-80", "text": "The LTCB data split and processing is the same as the one used in [19, 20] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-81", "text": "In particular, the LTCB vocabulary is limited to the 80K most frequent words with all remaining words replaced by . Similarly to RNNLM toolkit [10] , we add a single tag at the end of each sentence whereas the begin tag is not used." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-82", "text": "The resulting corpus size is 133M with an rate of 1.43% for the training set and 2.30% for the test set." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-83", "text": "The second corpus is the OBWB, which contains \u2248 0.8B tokens with a vocabulary size of \u2248 0.8M words." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-84", "text": "The data processing follows the description provided in [7] leading to an rate of \u2248 0.3%." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-85", "text": "Similarly to LTCB, a single tag is added at the end of each sentence." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-86", "text": "In the experiments described below, the first 5 held-out sets are used for validation whereas the remaining 45 sets are used for testing." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-87", "text": "The obtained results, however, showed that the models perplexity on these two sets is comparable, with an average difference of less than 0.5 points in perplexity." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-88", "text": "The primary motive of using LTCB, with its medium vocabulary size (80K), is to be able to compare the performance of LMs trained using NCE to their counterparts that are trained using the full softmax." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-89", "text": "When using NCE to train the models, the evaluation is either performed using the NCE constant Z for normalization (PPL n ), in this case the target word probabilities are given by (2), or using the softmax function (PPL f ), which calculates these probabilities using (1) ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-90", "text": "The difference in performance between these metrics will evaluate the ability of the models to learn to self-normalize after training." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-91", "text": "For a comprehensive comparison of the different models, we also report the Number of Parameters (NoP) required by each model as well as its Training Speed (TS), which is calculated as the number of words processed per second (w/s) during training." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-92", "text": "All experiments were conducted on a single Titan-X GPU." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-93", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-94", "text": "**BASELINE MODELS**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-95", "text": "In order to assess the gap among established NNLMs, this paper also presents a comparative study of different standard architectures with comparable NoPs." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-96", "text": "That is, we report results for the standard Feed-forward network (FFNN) [1] , the Recurrent Neural Network (RNN) [10] as well as the Long-Short Term Memory network (LSTM) [21] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-97", "text": "Our RNN implementation uses a projection weight matrix to decouple the word embedding and the hidden layer sizes." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-98", "text": "We also report results after adding a bottleneck fully-connected ReLu layer right before the output layer in the recurrent models." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-99", "text": "These models are marked with the prefix ReLu in the tables below." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-100", "text": "Each of the models is trained using the proposed B-NCE approach and the shared noise NCE (S-NCE) [16] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-101", "text": "For the LTCB corpus, we also report results of the models trained with the full softmax function." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-102", "text": "This is the primary motive for using this corpus." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-103", "text": "We would like also to highlight that the goal of this paper is not about improving LMs performance but rather showing how a significant training speed-up can be achieved without compromising the models performance for large vocabulary LMs." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-104", "text": "Hence, we solely focus our experiments on NCE as a major approach to achieve this goal [17, 13, 16] in comparison to the reference full softmax function." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-105", "text": "Comparison to other training approaches such as importance sampling will be conducted in future work." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-106", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-107", "text": "**LTCB EXPERIMENTS**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-108", "text": "For the LTCB experiments, the embedding size is fixed at 200, the 5-gram FFNN has two hidden layers, whereas RNN and LSTM use a single recurrent layer." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-109", "text": "All non-recurrent layers use ReLu as activation function." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-110", "text": "More details about the models architectures are shown in Table 1 , where \"(R)\" stands for recurrent and \"(B)\" for bottleneck." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-111", "text": "The batch size is fixed at 400 and the initial learning rate is set to 0.4." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-112", "text": "The latter is halved when no improvement on the validation data is observed for an additional 7 epochs." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-113", "text": "We also use a norm-based gradient clipping with a threshold of 5 but we do not use dropout." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-114", "text": "Moreover, B-NCE and S-NCE use the unigram as noise distribution pn." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-115", "text": "Following the setup proposed in [13, 16] , S-NCE uses K = 100 noise samples, whereas B-NCE uses only the target words in the batch (K=0)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-116", "text": "Note that S-NCE will process and update B + K words at its output layer during each forward-backward pass, whereas B-NCE updates only B words." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-117", "text": "Similarly to [17] , the NCE normalization constant is set to Z = exp(9), which approximates the mean of the normalization term using softmax." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-118", "text": "Table 2 clearly show that B-NCE reduces the training time by a factor of 4 to 8 with a slight degradation in the models performance compared to softmax." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-119", "text": "Moreover, we can also see that B-NCE slightly outperforms S-NCE while being faster and simpler to implement." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-120", "text": "In particular, B-NCE does not require the sampling step since it uses the rest of the output words in the batch itself as noise samples to train the model on each target word." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-121", "text": "This can be efficiently implemented using dense matrix operations (see Sections 3)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-122", "text": "Table 2 also shows that PPL n is close from PPL f , which typically reflects that the models trained using NCE are able to selfnormalize, where the normalization term using softmax is, in average, very close from the NCE constant Z. We have also observed in our experiments that the models degradation and the gap between PPL f and PPL n strongly depend on the amount of training data, the vocabulary size as well as the size of the last hidden layer." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-123", "text": "More particularly, increasing the training data leads to a more stable learning and therefore to a smaller gap between these two metrics and a much lower degradation of the models performance (see OBWB experiments below)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-124", "text": "We can also conclude from Table 2 that the additional ReLu layer improves the performance while significantly decreasing the number of parameters (NoP)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-125", "text": "This conclusion is valid for both, RNN and LSTM." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-126", "text": "These results confirm that adding a fully-connected bottleneck layer can significantly boost the performance of recurrent models." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-127", "text": "This idea has been previously used in computer vision tasks in combination with Convolutional Neural Networks (CNN) [22] , as well as in speech recognition [23] , where the fully-connected layer is used as pat of the LSTM recurrent module." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-128", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-129", "text": "**ONE BILLION WORD BENCHMARK EXPERIMENTS**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-130", "text": "The OBWB experiments are similar to LTCB with minor differences." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-131", "text": "Namely, the embedding size is set to 500 for all models, the batch size is fixed at 500, S-NCE uses K = 200 noise samples and the initial learning rate is set to 1.0." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-132", "text": "Given that the vocabulary size is \u2248 0.8M , it was not possible to train the language models using the full softmax function." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-133", "text": "Therefore, we only report results for B-NCE and S-NCE." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-134", "text": "More details about the models configuration are shown in Table 3 ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-135", "text": "Table 4 generally confirm the LTCB conclusions." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-136", "text": "That is, B-NCE slightly outperforms S-NCE while being faster and simpler to train (training speed in 3 rd column)." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-137", "text": "Moreover, these results also show a much smaller difference between PPL f and PPL n compared to LTCB, which suggests that the models learned to better self-normalize due to the larger amount of training data." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-138", "text": "Similarly to LTCB, we can see that the additional ReLu helps reducing the NoPs while improving or maintaining the models performance for RNN and LSTM." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-139", "text": "In comparison to other results on the OBWB." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-140", "text": "We can see that the small ReLu-RNN achieves a close performance from the very large RNN model (PPL = 51.3 and NoP = 20B) proposed in [7] ." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-141", "text": "Moreover, the performance of the small ReLu-LSTM is comparable to the LSTM models proposed in [16] and [18] which use large hidden layers." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-142", "text": "In particular, the first paper trains a large 4-layers LSTM model using S-NCE on 4 GPUs (PPL = 43.2 and NoP = 3.4B), whereas the second uses a recurrent bottleneck layer [23] and a total of K = 8192 noise samples with importance sampling on 32 Tesla K40 GPUs." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-143", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-144", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-145", "text": "We have presented a batch-NCE approach which allows a fast and simple training of large vocabulary LMs." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-146", "text": "This approach eliminates the sampling step required in standard NCE and can be fully formulated using dense matrix operations." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-147", "text": "Experiments on LTCB and OBWB have shown that this approach achieves a comparable performance to a softmax function while significantly speeding-up the training." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-148", "text": "While the evaluation focused on NCE performance, future experiments will be conducted to evaluate B-NCE in comparison to other alternatives of softmax." }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-149", "text": "----------------------------------" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-150", "text": "**ACKNOWLEDGMENT**" }, { "sent_id": "0d06c8509ebbdc61985bebcdb26e6c-C001-151", "text": "This research was funded by the German Research Foundation (DFG) as part of SFB 1102." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0d06c8509ebbdc61985bebcdb26e6c-C001-19" ], [ "0d06c8509ebbdc61985bebcdb26e6c-C001-39" ], [ "0d06c8509ebbdc61985bebcdb26e6c-C001-51" ] ], "cite_sentences": [ "0d06c8509ebbdc61985bebcdb26e6c-C001-19", "0d06c8509ebbdc61985bebcdb26e6c-C001-39", "0d06c8509ebbdc61985bebcdb26e6c-C001-51" ] }, "@USE@": { "gold_contexts": [ [ "0d06c8509ebbdc61985bebcdb26e6c-C001-104" ], [ "0d06c8509ebbdc61985bebcdb26e6c-C001-115" ] ], "cite_sentences": [ "0d06c8509ebbdc61985bebcdb26e6c-C001-104", "0d06c8509ebbdc61985bebcdb26e6c-C001-115" ] } } }, "ABC_653327ecbc925624d509c679fbe0ba_39": { "x": [ { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-49", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-50", "text": "**KNOWLEDGE-ENRICHED MRC MODEL 3.1 TASK DEFINITION**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-2", "text": "This paper focuses on how to take advantage of external relational knowledge to improve machine reading comprehension (MRC) with multi-task learning." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-3", "text": "Most of the traditional methods in MRC assume that the knowledge used to get the correct answer generally exists in the given documents." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-4", "text": "However, in real-world task, part of knowledge may not be mentioned and machines should be equipped with the ability to leverage external knowledge." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-5", "text": "In this paper, we integrate relational knowledge into MRC model for commonsense reasoning." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-6", "text": "Specifically, based on a pre-trained language model (LM), we design two auxiliary relation-aware tasks to predict if there exists any commonsense relation and what is the relation type between two words, in order to better model the interactions between document and candidate answer option." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-7", "text": "We conduct experiments on two multi-choice benchmark datasets: the SemEval-2018 Task 11 and the Cloze Story Test." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-8", "text": "The experimental results demonstrate the effectiveness of the proposed method, which achieves superior performance compared with the comparable baselines on both datasets." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-9", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-11", "text": "Machine reading comprehension (MRC) enables machines with the ability to answer questions with given corresponding documents." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-12", "text": "Recent years have witnessed the bloom of various well-designed Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-13", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-14", "text": "Abstracting with credit is permitted." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-15", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-16", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-17", "text": "CIKM '19, November 3-7, 2019, Beijing, China \u00a9 2019 Association for Computing Machinery." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-18", "text": "ACM ISBN 978-1-4503-6976-3/19/11. . . $15.00 https://doi.org /10.1145/3357384.3358165 MRC models [13, 14, 16] , which achieve promising performance when provided with adequate manually labeled instances [9] ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-19", "text": "However, these models generally assume that the knowledge required to answer the questions has already existed in the documents, which does not hold at some time." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-20", "text": "How to leverage the commonsense knowledge for better reading comprehension remains largely unexplored." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-21", "text": "Recently, some preliminary studies have begun to incorporate certain side information (e.g., triplets from external knowledge base) into the model design of various NLP tasks, such as question answering [1] and conversation generation [18] ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-22", "text": "Generally, there are two lines of this work." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-23", "text": "The first line focuses on designing task-specific model structures [1, 15] , which exploit the retrieved concepts from external knowledge base for enhancing the representation." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-24", "text": "Recently, the other line has studied to pre-train a language model over large corpus to learn the inherent word-level knowledge in an unsupervised way [4, 8] , which achieves very promising performance." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-25", "text": "The first line of work is usually carefully designed for the target task, which is not widely applicable." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-26", "text": "The second line can only learn the co-occurrence of words or entities in the context, while it may not be that robust for some complex scenarios such as reasoning task." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-27", "text": "For example, to answer the question \"Was the light bulb still hot?\" when the document is given as \"I went into my bedroom and flipped the light switch." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-48", "text": "The model can learn to fill in the knowledge gap without changing the original model structure." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-28", "text": "Oh, I see that the ceiling lamp is not turning on... \", machines should have the commonsense knowledge that \"the bulb is not hot when turning off\" to correctly answer the question." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-29", "text": "The explicit relation information can act as a bridge to connect the scattered context, which may not be easily captured." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-30", "text": "Therefore, the aim of this paper is to take the advantage of both the pre-trained language model and the explicit relation knowledge from the external knowledge base for commonsense reading comprehension." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-31", "text": "Specifically, we first extract the triplets from the popular Concept-Net knowledge base [11] and design two auxiliary relation-aware tasks to predict if there exists any relation and what is the relation type between two concepts." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-32", "text": "To make the model be aware of the commonsense relations between concepts, we propose a multi-task learning framework to jointly learn the prediction of the target MRC task and the two relation-aware tasks in a unified model." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-33", "text": "We conduct experiments on two multi-choice commonsense reading comprehension datasets: Story Cloze Test [6] and SemEval-2018 Task 11 [7] ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-34", "text": "Experimental results demonstrate the effectiveness of our method, which achieves superior performance compared with the comparable baselines on both datasets." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-35", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-36", "text": "**RELATED WORK**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-37", "text": "Previous studies mainly focused on developing effective model structures to improve the reading ability of the systems [14, 16] , which have achieved promising performance." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-38", "text": "However, the success on these tasks is not adequate considering the model's ability of commonsense reasoning." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-39", "text": "Recently, a number of efforts have been invested in developing datasets for commonsense reading comprehension such as Story Cloze Test and SemEval-2018 Task 11 [6, 7] ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-40", "text": "In these datasets, part of required knowledge may not be mentioned in the document and machines should be equipped with commonsense knowledge to make correct prediction." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-41", "text": "There exists increasing interest in incorporating commonsense knowledge into commonsense reading comprehension tasks." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-42", "text": "Most of previous studies focused on developing special model structures to introduce external knowledge into neural network models [1, 15] , which have achieved promising results." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-43", "text": "For example, Yang and Mitchell [15] use concepts from WordNet and weighted average vectors of the retrieved concepts to calculate a new LSTM state." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-44", "text": "These methods relied on task-specific model structures which are difficult to adapt to other tasks." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-45", "text": "Pre-trained language model such as BERT and GPT [4, 8] is also used as a kind of commonsense knowledge source." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-46", "text": "However, the LM method mainly captures the co-occurrence of words and phrases and cannot address some more complex problems which may require the reasoning ability." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-47", "text": "Unlike previous work, we incorporate external knowledge by jointly training MRC model with two auxiliary tasks which are relevant to commonsense knowledge." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-51", "text": "Here we formally define the task of multi-choice commonsense reading comprehension." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-52", "text": "Given a reference document D (a question q if possible), a set of N answer options {O 1 , O 2 , ..., O N } and an external knowledge base F = { f 1 , \u00b7 \u00b7 \u00b7 , f M }, the goal is to choose the correct answer option according to their probabilities {p 1 , p 2 , ..., p N } given by MRC model, where N is the total number of options." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-53", "text": "In this paper, we use ConceptNet knowledge base [11] , which is a large semantic network of commonsense knowledge with a total of about 630k facts." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-54", "text": "Each fact f i is represented as a triplet f i = (subject, relation, object), where subject and object can be a word or phrase and relation is a relation type." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-55", "text": "An example is:" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-56", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-57", "text": "**OVERALL FRAMEWORK**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-58", "text": "The proposed method can be roughly divided into three parts: a pre-trained LM encoder, a task-specific prediction layer for multichoice MRC and two relation-aware auxiliary tasks." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-59", "text": "The overall framework is shown in Figure 1 ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-60", "text": "The pre-trained LM encoder acts as the foundation of the model, which is used to capture the relationship between question, document and answer options." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-61", "text": "Here we utilize BERT [4] as the pretrained encoder for its superior performance in a range of natural language understanding tasks." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-62", "text": "Specially, we concatenate the given document, question (as sentence A) and each option (as sentence B) Next, on top of BERT encoder, we add a task-specific output layer and view the multi-choice MRC as a multi-class classification task." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-63", "text": "Specifically, we apply a linear head layer plus a softmax layer on the final contextualized word representation of [CLS] token h L 0 ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-64", "text": "We minimize the Negative Log Likelihood (NLL) loss with respect to the correct class label, as:" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-65", "text": "where\u0125 0 is the final hidden state of the correct option\u00d4, N is the number of options and v T \u2208 R H is a learnable vector." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-66", "text": "Finally, to make the model be aware of certain implicit commonsense relations between concepts, we further introduce two auxiliary relation-aware prediction tasks for joint learning." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-67", "text": "Since it may be difficult to directly predict the actual relation type between two concepts without adequate training data, we split the relation prediction problem into two related tasks: i.e., relation-existence task and relation-type task." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-68", "text": "In relation-existence, we basically add an auxiliary task to predict if there exists any relation between two concepts, which is a relatively easy task." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-69", "text": "Then, we take one step further to decide what is the right type of the relation in relationexistence." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-70", "text": "The basic premise is that by guiding the MRC model training with extra relation information, the proposed model can be equipped with the ability to capture some underlying commonsense relationships." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-71", "text": "The two auxiliary tasks are jointly trained with the multi-choice answer prediction task." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-72", "text": "In the following, we will describe the two auxiliary tasks in detail." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-73", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-74", "text": "**INCORPORATING RELATION KNOWLEDGE**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-75", "text": "Task 1 is the relation-existence task." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-76", "text": "Following [4] , we first convert the concept to a set of BPE tokens tokens A and tokens B, with beginning index i and j in the input sequence respectively." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-77", "text": "The probability of whether there is a relation in each pair (tokens A," }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-78", "text": "where W 1 \u2208 R H \u00d7H is a trainable matrix." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-79", "text": "We define the pair (tokens A, tokens B) that has relation in ConceptNet as a positive example and others as negative examples." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-80", "text": "We down-sample the negative examples and keep ratio of positive vs negative is 1 : \u03b3 ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-81", "text": "We define the relation-existence loss as the average binary cross-entropy (BCE) loss:" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-82", "text": "where |A|, |B| are the number of sampled concepts in sentence A and sentence B respectively, y RE is the label of whether there is a relation between concepts." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-83", "text": "Task 2 is the relation-type task." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-84", "text": "We predict the relation type between tokens A and tokens B. The relation-type probabilities are computed as:" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-85", "text": "where W 2 \u2208 R H \u00d72H and W 3 \u2208 R R\u00d7H are new trainable matrices, R is the number of selected relation types 2 ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-86", "text": "The relation-type loss is computed as:" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-87", "text": "We define s i j as the label whether there is a relation from sentence A to sentence B. k is the index of ground-truth relation in ConceptNet, |S | is the number of relations among the tokens in two sentences." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-88", "text": "As the three tasks share the same BERT architecture with only different linear head layers, we propose to train them together." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-89", "text": "The joint objective function is formulated as follows:" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-90", "text": "where \u03bb 1 and \u03bb 2 are two hyper-parameters that control the weight of the tasks, N is number of options." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-91", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-92", "text": "**EXPERIMENTS 4.1 DATASET**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-93", "text": "We conduct experiments on two commonsense reading comprehension tasks: SemEval-2018 shared task11 [7] and Story Cloze Test [6] ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-94", "text": "The statistics of the datasets are shown in Table 1 3 ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-95", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-96", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-97", "text": "We use the uncased BERT(base) [4] as pre-trained language model." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-98", "text": "We set the batch size to 24, learning rate to 2e-5." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-99", "text": "The maximum sequence length is 384 for SemEval-2018 Task 11 and 512 for Story Cloze Test." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-100", "text": "We fine-tune for 3 epochs on each dataset." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-101", "text": "The taskspecific hyper-parameters \u03bb 1 and \u03bb 2 are set to 0.5 and the ratio \u03b3 is set to 4.0." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-102", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-103", "text": "**EXPERIMENTAL RESULTS AND ANALYSIS**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-104", "text": "Model Comparison The performances of our model on two datasets are shown in Table 2 and Table 3 ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-105", "text": "We compare our single model with other existing systems (single model)." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-106", "text": "We also adopt relation-aware tasks on the co-attention layer of the TriAN model [12] ." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-107", "text": "From the result, we can observe that: (1) Our method achieves better performance on both datasets compared with previous methods and Bert(base) model." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-108", "text": "(2) By adopt relation-aware tasks on the attention layer of the TriAN model [12] on SemEval, the model performance can also be improved." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-109", "text": "The results show that the relation-aware tasks can help to better align sentences due to knowledge gap." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-110", "text": "Effectiveness of Relation-aware Tasks To get better insight into our model, we analyze the benefit brought by using relation-aware tasks on the development set of SemEval-2018 Task 11." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-111", "text": "The performance of jointly training the basic answer prediction model with different tasks is shown in table 4." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-112", "text": "From the result we can see that by incorporating the auxiliary relation-existence task (L RE ) or relation-type task (L RT ) in the joint learning framework, the performance can always improved." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-113", "text": "The result shows the advantage of incorporating auxiliary tasks." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-114", "text": "Besides, the performance gain by adding relation-existence task is larger, which shows relationexistence task can incorporate more knowledge into the model." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-115", "text": "We also attempt to merge two relation-aware tasks into one task by simply taking \"No-Relation\" as a special type of relation." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-116", "text": "The model performance is just slightly higher than using relation-type task and lower than using two tasks separately." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-117", "text": "The result is due to the number of \"No-Relation\" labels is much more than other relation types, which makes the task hard to train." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-118", "text": "Analysis Table 5 shows the examples that are incorrectly predicted by BERT(base), while correctly solved by incorporating relationaware tasks." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-119", "text": "The first two examples can benefit from relationexistence knowledge." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-120", "text": "From the first example we can see that the retrieved relation between concepts from ConceptNet provide useful evidence to connect the question to the correct option (A)." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-121", "text": "The second example is from Cloze Story Test dataset, we can see that the retrieved relation is also helpful in making correct prediction." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-122", "text": "The third example from SemEval requires relation-type knowledge." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-123", "text": "but the relation type (kettle, U sedFor , boilwater ) in option (A), is more relevant to the question, which shows that relation type knowledge can be used as side information to do the prediction." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-124", "text": "----------------------------------" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-125", "text": "**CONCLUSION**" }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-126", "text": "In this paper, we aim to enrich the neural model with external knowledge to improve commonsense reading comprehension." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-127", "text": "We use two auxiliary relation-aware tasks to incorporate ConceptNet knowledge into the MRC model." }, { "sent_id": "653327ecbc925624d509c679fbe0ba-C001-128", "text": "Experimental results demonstrate the effectiveness of our method which achieves improvements compared with the pre-trained language model baselines on both datasets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "653327ecbc925624d509c679fbe0ba-C001-24" ], [ "653327ecbc925624d509c679fbe0ba-C001-45" ] ], "cite_sentences": [ "653327ecbc925624d509c679fbe0ba-C001-24", "653327ecbc925624d509c679fbe0ba-C001-45" ] }, "@USE@": { "gold_contexts": [ [ "653327ecbc925624d509c679fbe0ba-C001-61" ], [ "653327ecbc925624d509c679fbe0ba-C001-76" ], [ "653327ecbc925624d509c679fbe0ba-C001-97" ] ], "cite_sentences": [ "653327ecbc925624d509c679fbe0ba-C001-61", "653327ecbc925624d509c679fbe0ba-C001-76", "653327ecbc925624d509c679fbe0ba-C001-97" ] } } }, "ABC_7202fd7fe7e776362b126f7cbf0bf3_39": { "x": [ { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-144", "text": "Word error rate comparison with other end-to-end methods on LibriSpeech dataset." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-145", "text": "with the data released with the corpus." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-2", "text": "Regularization is important for end-to-end speech models, since the models are highly flexible and easy to overfit." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-3", "text": "Data augmentation and dropout has been important for improving end-to-end models in other domains." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-4", "text": "However, they are relatively under explored for end-to-end speech models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-5", "text": "Therefore, we investigate the effectiveness of both methods for end-to-end trainable, deep speech recognition models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-6", "text": "We augment audio data through random perturbations of tempo, pitch, volume, temporal alignment, and adding random noise." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-7", "text": "We further investigate the effect of dropout when applied to the inputs of all layers of the network." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-8", "text": "We show that the combination of data augmentation and dropout give a relative performance improvement on both Wall Street Journal (WSJ) and LibriSpeech dataset of over 20%." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-9", "text": "Our model performance is also competitive with other end-to-end speech models on both datasets." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-10", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-12", "text": "Regularization has proven crucial to improving the generalization performance of many machine learning models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-13", "text": "In particular, regularization is crucial when the model is highly flexible (e.g. deep neural networks) and likely to overfit on the training data." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-14", "text": "Data augmentation is an efficient and effective way of doing regularization that introduces very small (or no) overhead during training." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-15", "text": "It has shown to consistently improve performance in various pattern recognition tasks [1, 2, 3, 4, 5] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-16", "text": "Dropout [6] is another powerful way of doing regularization for training deep neural networks, it intends to reduce the co-adaptation amongst hidden units by randomly zero-ing out inputs to the hidden layer during training." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-17", "text": "End-to-end speech models often have millions of parameters [7, 8, 9, 10, 11] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-18", "text": "However, data augmentation and dropout have not been extensively studied or applied to them." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-146", "text": "Our results on both WSJ and LibriSpeech are competitive to existing methods." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-147", "text": "We would like to note that our model achieved comparable results with Amodei et al. [9] on LibriSpeech dataset, although our model is only trained only on the provided training set." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-148", "text": "This demonstrates the effectiveness of the proposed regularization methods for training end-to-end speech models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-149", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-150", "text": "**CONCLUSION**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-151", "text": "In this paper, we investigate the effectiveness of data augmentation and dropout for deep neural network based, end-to-end speech recognition models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-152", "text": "For data augmentation, we independently vary the tempo and pitch of the audio so that it is able to generate a large variety of additional data." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-153", "text": "In addition, we also add noisy versions of the data by changing the gain, shifting the audio, and add random white noise." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-154", "text": "We show that, with tempo and noise based augmentation, we are able to achieve 15-20% relative performance improvement on WSJ and LibriSpeech dataset." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-155", "text": "We further investigate the regularization of dropout by applying it to inputs of all layers of the network." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-156", "text": "Similar to data augmentation, we obtained significant performance improvements." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-157", "text": "When both regularization techniques are combined, we achieved new state-of-the-art results on both dataset, with 6.26% on WSJ, and 5.67% and 15.18% on test-clean and test-other set from LibriSpeech." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-158", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-159", "text": "**REFERENCES**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-160", "text": "[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \"Imagenet classification with deep convolutional neural networks,\" in Advances in Neural Information Processing Systems, 2012, pp." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-161", "text": "1106-1114." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-19", "text": "We investigate the effectiveness of data augmentation and dropout for regularizing end-to-end speech models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-20", "text": "In particular, we augment the raw audio data by changing the tempo and pitch independently." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-21", "text": "The volume and temporal alignment of the audio signals are also randomly perturbed, with additional random noises added." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-22", "text": "To further regularize the model, we employ dropout to each input layer of the network." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-23", "text": "With these regularization techniques, we obtained over 20% relative performance on the Wall Street Journal (WSJ) dataset and LibriSpeech dataset." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-24", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-25", "text": "**RELATED WORK**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-26", "text": "Data augmentation in speech recognition has been applied before." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-27", "text": "Gales et al. [12] use hidden Markov models to generate synthetic data to enable 1-vs-1 training of SVMs." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-28", "text": "Feature level augmentation has also demonstrated effectiveness [5, 3, 4] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-29", "text": "Ko et al. [2] performed audio level speed perturbation that also lead to performance improvements." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-30", "text": "Background noise is used for augmentation in [8, 9] to improve performance on noisy speech." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-31", "text": "Apart from adding noise, our data augmentation also modifies the tempo, pitch, volume and temporal alignment of the audio." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-32", "text": "Dropout has been applied to speech recognition before." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-33", "text": "It has been applied to acoustic models in [13, 14, 15] and demonstrated promising performance." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-34", "text": "For end-to-end speech models, Hannun et al. [8] applied dropout to the output layer of the network." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-35", "text": "However, to the best of our knowledge, our work is the first to apply it to other layers in end-to-end speech models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-36", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-37", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-38", "text": "The end-to-end model structure used in this work is very similar to the model architecture of Deep Speech 2 (DS2) [9] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-39", "text": "While we have made some modifications at the front-end time-frequency convolution (i.e. 2-D convolution) layers, the core structure of recurrent layers is the same." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-40", "text": "The full end-to-end model structure is illustrated in Fig. 1 ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-41", "text": "First, we use depth-wise separable convolution [16, 17] for all the convolution layers." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-42", "text": "The performance advantage of depth-wise separable convolution has been demonstrated in computer vision tasks [17] and is also more computationally efficient." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-43", "text": "The depth-wise separable convolution is implemented by first convolving over the input channel-wise." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-44", "text": "It then convolves with 1 \u00d7 1 filters with the desired number of channels." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-45", "text": "Stride size only influence the channel-wise convolution; the following 1 \u00d7 1 convolutions always have stride one." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-46", "text": "Second, we substitute normal convolution layers with ResNet [18] blocks." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-47", "text": "The residual connections help the gradient flow during training." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-48", "text": "They have been employed in speech recognition [11] and achieved promising results." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-49", "text": "For example, a w \u00d7 h depth-wise separable convolution with n input and m output channels is implemented by first convolving the input channel-wise with its corresponding w \u00d7 h filters, followed by standard 1 \u00d7 1 convolution with m filters." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-50", "text": "Our model is composed of one standard convolution layer that has larger filter size, followed by five residual convolution blocks." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-51", "text": "Convolutional features are then given as input to a 4-layer bidirectional recurrent neural network with gated recurrent units (GRU) [19] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-52", "text": "Finally, two fully connected layers take the last hidden RNN layer as input and output the final per-character prediction." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-53", "text": "Batch normalization [20] is applied to all layers to facilitate training." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-54", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-55", "text": "**PREVENTING OVERFITTING**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-56", "text": "Our model has over five million parameters (see sec. 5)." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-57", "text": "This makes regularization important for it to generalize well." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-58", "text": "In this section we describe our primary methods for preventing overfitting: data augmentation and dropout." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-59", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-60", "text": "**DATA AUGMENTATION**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-61", "text": "Vocal tract length perturbation (VTLP [5] ) is a popular method for doing feature level data augmentation in speech." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-62", "text": "We choose to do data level augmentation (i.e. augment raw audio) instead of feature level augmentation, because the absence of feature-level dependencies makes it more flexible." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-63", "text": "Ko et al. [2] used data level augmentation and showed that modifying the speed of raw audio approximates the effect of VTLP and works better than VTLP." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-64", "text": "However, in speed perturbation since the pitch is positively correlated with the speed, it is not possible to generate audio with higher pitch but slower speed and vice versa." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-65", "text": "This may not be ideal, since it reduces the variation in the augmented data, which in turn may hurt performance." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-66", "text": "Therefore, To get increased variation in our data, we separate the speed perturbation into two independent components -tempo and pitch." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-67", "text": "By keeping the pitch and tempo separate, we can cover a wider range of variations." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-68", "text": "We use the tempo and pitch functions from the SoX audio manipulation tool [21] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-69", "text": "Generating noisy versions of the data is also a common way to do data augmentation." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-70", "text": "To generate such data, we add random white noise to the audio signal." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-71", "text": "Volume of the audio is also randomly modified to simulate the effect of different recording volumes." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-72", "text": "To further distort the audio, it is also randomly shifted by a small amount (i.e. less than 10ms)." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-73", "text": "With a combination of the above approaches, we can synthetically generate a large amount of data that captures different variations." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-74", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-75", "text": "**DROPOUT**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-76", "text": "Dropout [6] is a powerful regularizer." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-77", "text": "It prevents the coadaptation of hidden units by randomly zero-ing out a subset of inputs for that layer during training." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-78", "text": "In more detail, let x lows:" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-79", "text": "where z i = [z i1 , z i2 , . . . , z id ] is the dropout mask." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-80", "text": "Since we are not interested in the Bayesian view of the model, we choose the same rescaling approximation as standard dropout (i.e. rescale input by 1 \u2212 p) instead of doing Monte Carlo approximation at test time." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-81", "text": "We apply the dropout variant described in eq. 3 and 4 to inputs of all convolutional and recurrent layers." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-82", "text": "Standard dropout is applied on the fully connected layers." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-83", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-84", "text": "**EXPERIMENTS**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-85", "text": "To investigate the effectiveness of the proposed techniques, we perform experiments on the Wall Street Journal (WSJ) and LibriSpeech [23] datasets." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-86", "text": "The input to the model is a spectrogram computed with a 20ms window and 10ms step size." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-87", "text": "We normalize each spectrogram to have zero mean and unit variance." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-88", "text": "In addition, we also normalize each feature to have zero mean and unit variance based on the training set statistics." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-89", "text": "No further preprocessing is done after these two steps of normalization." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-90", "text": "We denote the size of the convolution layer by tuple (C, F, T, SF, ST), where C, F, T, SF, and ST denotes number of channels, filter size in frequency dimension, filter size in time dimension, stride in frequency dimension and stride in time dimension respectively." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-91", "text": "We have one convolutional layer with size (32,41,11,2,2), and five residual convolution blocks of size (32,7,3,1,1), (32,5,3,1,1), (32,3,3,1,1), (64,3,3,2,1), (64,3,3,1,1) respectively." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-92", "text": "Following the convolutional layers we have 4-layers of bidirectional GRU RNNs with 1024 hidden units per direction per layer." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-93", "text": "Finally we have one fully connected hidden layer of size 1024 followed by the output layer." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-94", "text": "The convolutional and fully connected layers are initialized uniformly [24] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-95", "text": "The recurrent layer weights are initialized with a uniform distribution U(\u22121/32, 1/32)." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-96", "text": "The model is trained in an end-to-end fashion to maximize the log-likelihood using connectionist temporal classification [25] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-97", "text": "We use mini-batch stochastic gradient descent with batch size 64, learning rate 0.1, and with Nesterov momentum 0.95." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-98", "text": "The learning rate is reduced by half whenever the validation loss has plateaued, and the model is trained until the validation loss stops improving." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-99", "text": "The norm of the gradient is clipped [26] to have a maximum value of 1." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-100", "text": "In addition, for all experiments we use l-2 weight decay of 1e \u22125 for all parameters." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-101", "text": "1 In [22] , the authors also have dropout applied on recurrent connections, we did not employ the recurrent dropout because dropout on the input is an easy drop-in substitution for cuDNN RNN implementation, whereas the recurrent one is not." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-102", "text": "Switch from cuDNN based RNN implementation will increase the computation time of RNNs by a significant amount, and thus we choose to avoid it." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-103", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-104", "text": "**EFFECT OF INDIVIDUAL REGULARIZER**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-105", "text": "To study the effectiveness of data augmentation and dropout, we perform experiments on both datasets with various settings." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-106", "text": "The first set of experiments were carried out on the WSJ corpus." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-107", "text": "We use the standard si284 set for training, dev93 for validation and eval92 for test evaluation." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-108", "text": "We use the provided language model and report the result in the 20K closed vocabulary setting with beam search." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-109", "text": "The beam width is set to 100." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-110", "text": "Since the training set is relatively small (\u223c 80 hours), we performed a more detailed ablation study on this dataset by separating the tempo based augmentation from the one that generates noisy versions of the data." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-111", "text": "For tempo based data augmentation, the tempo parameter is selected following a uniform distribution U(0.7, 1.3), and U(\u2212500, 500) for pitch." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-112", "text": "Since WSJ has relatively clean recordings, we keep the signal to noise ratio between 10 and 15db when adding white noise." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-113", "text": "The gain is selected from U(\u221220, 10) and the audio is shifted randomly by 0 to 10ms." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-114", "text": "Results are shown in Table 1 ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-115", "text": "Both approaches improve the performance over the baseline, where none of the additional regularization is applied." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-116", "text": "Noise augmentation has demonstrated its effectiveness for making the model more robust against noisy inputs." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-117", "text": "We show here that adding a small amount of noise also benefits the model on relatively clean speech samples." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-118", "text": "To compare with existing augmentation methods,we trained a model by using speed perturbation." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-119", "text": "We use 0.9, 1.0, and 1.1 as the perturb coefficient for speed as suggested in [2] ." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-120", "text": "This results in a WER of 7.21%, which brings 13.96% relative performance improvement." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-121", "text": "Our tempo based augmentation is slightly better than the speed augmentation, which may attribute to more variations in the augmented data." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-122", "text": "When the techniques for data augmentation are combined, we have a significant relative improvement of 20% over the baseline (see table 1 )." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-123", "text": "Dropout also significantly improved the performance (22% relative improvement, see table 1)." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-124", "text": "The dropout probabilities are set as follows, 0.1 for data, 0.2 for all convolution layers, 0.3 for all recurrent and fully connected layers." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-125", "text": "By combining all regularization, we achieve a final WER of 6.42%." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-126", "text": "Fig. 2 shows the training curve of baseline and regularized models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-127", "text": "It is clear that with regularization, the gap between the validation and training loss is narrowed." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-128", "text": "In addition, the regularized training also results in a lower validation loss." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-129", "text": "We also performed experiments on the LibriSpeech dataset." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-130", "text": "The model is trained using all 960 hours of training data." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-131", "text": "We use both dev-clean and dev-other for validation and report results on test-clean and test-other." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-132", "text": "The provided 4-gram language model is used for final beam search decoding." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-133", "text": "The beam width used in this experiment is also set to 100." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-134", "text": "The results follow a similar trend as the previous experiments." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-135", "text": "We achieved relative performance improvement of over 22% on test-clean and over 32% on test-other set (see table 2)." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-136", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-137", "text": "**COMPARISON TO OTHER METHODS**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-138", "text": "In this section, we compare our end-to-end model with other end-to-end speech models." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-139", "text": "The results from WSJ and LibriSpeech (see table 3 and 4) are obtained through beam search decoding with the language model provided with the dataset with beam size 100." }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-140", "text": "To make a fair comparison on the WSJ corpus, we additionally trained an extended trigram model" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-141", "text": "----------------------------------" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-142", "text": "**METHOD**" }, { "sent_id": "7202fd7fe7e776362b126f7cbf0bf3-C001-143", "text": "Eval 92 Bahdanau et al. [10] 9.30% Graves and Jaitly [27] 8.20% Miao et al. [7] 7.34% Ours 6.42% Ours (extended 3-gram) 6.26% Amodei et al. [9] 3.60% 5.33% 13.25% Table 4 ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7202fd7fe7e776362b126f7cbf0bf3-C001-17" ], [ "7202fd7fe7e776362b126f7cbf0bf3-C001-30" ] ], "cite_sentences": [ "7202fd7fe7e776362b126f7cbf0bf3-C001-17", "7202fd7fe7e776362b126f7cbf0bf3-C001-30" ] }, "@MOT@": { "gold_contexts": [ [ "7202fd7fe7e776362b126f7cbf0bf3-C001-17", "7202fd7fe7e776362b126f7cbf0bf3-C001-18", "7202fd7fe7e776362b126f7cbf0bf3-C001-19" ] ], "cite_sentences": [ "7202fd7fe7e776362b126f7cbf0bf3-C001-17" ] }, "@SIM@": { "gold_contexts": [ [ "7202fd7fe7e776362b126f7cbf0bf3-C001-30", "7202fd7fe7e776362b126f7cbf0bf3-C001-31" ], [ "7202fd7fe7e776362b126f7cbf0bf3-C001-38" ], [ "7202fd7fe7e776362b126f7cbf0bf3-C001-147" ] ], "cite_sentences": [ "7202fd7fe7e776362b126f7cbf0bf3-C001-30", "7202fd7fe7e776362b126f7cbf0bf3-C001-38", "7202fd7fe7e776362b126f7cbf0bf3-C001-147" ] } } }, "ABC_a6b450d1113e0e6d3d2813c09d12a8_39": { "x": [ { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-2", "text": "We evaluate semantic relatedness measures on different German datasets showing that their performance depends on: (i) the definition of relatedness that was underlying the construction of the evaluation dataset, and (ii) the knowledge source used for computing semantic relatedness." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-3", "text": "We analyze how the underlying knowledge source influences the performance of a measure." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-4", "text": "Finally, we investigate the combination of wordnets and Wikipedia to improve the performance of semantic relatedness measures." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-5", "text": "----------------------------------" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-7", "text": "Semantic similarity (SS) is typically defined via the lexical relations of synonymy (automobile -car) and hypernymy (vehicle -car), while semantic relatedness (SR) is defined to cover any kind of lexical or functional association that may exist between two words." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-8", "text": "Many NLP applications, like sense tagging or spelling correction, require knowledge about semantic relatedness rather than just similarity (Budanitsky and Hirst, 2006) ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-9", "text": "For these tasks, it is not necessary to know the exact type of semantic relation between two words, but rather if they are closely semantically related or not." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-10", "text": "This is also true for the work presented herein, which is part of a project on electronic career guidance." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-11", "text": "In this domain, it is important to conclude that the words \"baker\" and \"bagel\" are closely related, while the exact type of a semantic relation does not need to be determined." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-12", "text": "As we work on German documents, we evaluate a number of SR measures on different German datasets." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-13", "text": "We show that the performance of measures strongly depends on the underlying knowledge source." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-14", "text": "While WordNet (Fellbaum, 1998) models SR, wordnets for other languages, such as the German wordnet GermaNet (Kunze, 2004) , contain only few links expressing SR." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-15", "text": "Thus, they are not well suited for estimating SR." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-16", "text": "Therefore, we apply the Wikipedia category graph as a knowledge source for SR measures." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-17", "text": "We show that Wikipedia based SR measures yield better correlation with human judgments on SR datasets than GermaNet measures." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-18", "text": "However, using Wikipedia also leads to a performance drop on SS datasets, as knowledge about classical taxonomic relations is not explicitly modeled." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-19", "text": "Therefore, we combine GermaNet with Wikipedia, and yield substantial improvements over measures operating on a single knowledge source." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-20", "text": "----------------------------------" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-21", "text": "**DATASETS**" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-22", "text": "Several German datasets for evaluation of SS or SR have been created so far (see Table 1 )." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-23", "text": "Gurevych (2005) conducted experiments with a German translation of an English dataset (Rubenstein and Goodenough, 1965) , but argued that the dataset (Gur65) is too small (it contains only 65 noun pairs), and does not model SR." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-24", "text": "Thus, she created a German dataset containing 350 word pairs (Gur350) containing nouns, verbs and adjectives that are connected by classical and non-classical relations (Morris and Hirst, 2004) ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-25", "text": "However, the dataset is biased towards strong classical relations, as word pairs were manually selected." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-26", "text": "Thus, Zesch and Gurevych (2006) semi-automatically created word pairs from domain-specific corpora." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-27", "text": "The resulting ZG222 dataset contains 222 word pairs that are connected by all kinds of lexical semantic relations." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-28", "text": "Hence, it is particularly suited for analyzing the capability of a measure to estimate SR." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-29", "text": "----------------------------------" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-30", "text": "**SEMANTIC RELATEDNESS MEASURES**" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-31", "text": "Semantic wordnet based measures Lesk (1986) introduced a measure (Les) based on the number of word overlaps in the textual definitions (or glosses) of two terms, where higher overlap means higher similarity." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-32", "text": "As GermaNet does not contain glosses, this measure cannot be employed." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-33", "text": "Gurevych (2005) proposed an alternative algorithm (PG) generating surrogate glosses by using a concept's relations within the hierarchy." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-34", "text": "Following the description in Budanitsky and Hirst (2006) , we further define several measures using the taxonomy structure." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-35", "text": "PL is the taxonomic path length l(c 1 , c 2 ) between two concepts c1 and c2." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-36", "text": "LC normalizes the path length with the depth of the taxonomy." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-37", "text": "Res computes SS as the information content (IC) of the lowest common subsumer (lcs) of two concepts, while JC combines path based and IC features." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-38", "text": "1 Lin is derived from information theory." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-39", "text": "Wikipedia based measures For computing the SR of two words w 1 and w 2 using Wikipedia, we first retrieve the articles or disambiguation pages with titles that equal w 1 and w 2 (see Figure 1) ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-40", "text": "If we hit a redirect page, we retrieve the corresponding article or disambiguation page instead." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-41", "text": "In case of an article, we insert it into the candidate article set (A 1 for w1, A 2 for w 2 )." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-42", "text": "In case of a disambiguation page, the page contains links to all encoded word senses, but it may also contain other 1 Note that JC returns a distance value instead of a similarity value resulting in negative correlation with human judgments." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-43", "text": "links." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-44", "text": "Therefore, we only consider links conforming to the pattern Title_(DisambiguationText) 2 (e.g. \"Train_(roller coaster)\")." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-45", "text": "Following all such links gives the candidate article set." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-46", "text": "If no disambiguation links are found, we take the first link on the page, as most important links tend to come first." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-47", "text": "We add the corresponding articles to the candidate set." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-48", "text": "We form pairs from each candidate article a i \u2208 A 1 and each article a j \u2208 A 2 ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-49", "text": "We then compute SR(a i , a j ) for each pair." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-50", "text": "The output of the algorithm is the maximum SR value max a i \u2208A 1 ,a j \u2208A 2 (SR(a i , a j ))." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-51", "text": "3 As most SR measures have been developed for taxonomic wordnets, porting them to Wikipedia requires some modifications (see Figure 2) ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-52", "text": "Text overlap measures can be computed based on the article text, while path based measures operate on the category graph." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-53", "text": "We compute the overlap between article texts based on (i) the first paragraph, as it usually contains a short gloss, and (ii) the full article text." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-54", "text": "As Wikipedia articles do not form a taxonomy, path based measures have to be adapted to the Wikipedia category graph (see the right part of Figure 2 )." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-55", "text": "We define C 1 and C 2 as the set of categories assigned to article a i and a j , respectively." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-56", "text": "We compute the SR value for each category pair (c k , c l ) with c k \u2208 C 1 and c l \u2208 C 2 ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-57", "text": "We use two different strategies to combine the resulting SR values: First, we choose the best value among all pairs (c k , c l ), i.e., the minimum for path based, and the maximum for information content based measures." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-58", "text": "As a second strategy, we average over all category pairs." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-59", "text": "Table 2 gives an overview of our experimental results on three German datasets." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-60", "text": "Best values for each dataset and knowledge source are in bold." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-61", "text": "We use the P G measure in optimal configuration as reported by Gurevych (2005) ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-62", "text": "For the Les measure, we give the results for considering: (i) only the first paragraph (+First) and (ii) the full text (+Full)." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-63", "text": "For the path length based measures, we give the values for averaging over all category pairs (+Avg), or taking the best SR value computed among the pairs (+Best)." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-64", "text": "For each dataset, we report Pearson's correlation r with human judgments on pairs that are found in both resources (BOTH)." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-65", "text": "Otherwise, the results would not be comparable." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-66", "text": "We additionally use a subset containing only noun-noun pairs (BOTH NN)." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-67", "text": "This comparison is fairer, because article titles in Wikipedia are usually nouns." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-68", "text": "Table 2 also gives the inter annotator agreement for each subset." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-69", "text": "It constitutes an upper bound of a measure's performance." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-70", "text": "----------------------------------" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-71", "text": "**EXPERIMENTS & RESULTS**" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-72", "text": "Our results on Gur65 using GermaNet are very close to those published by Gurevych (2005) , ranging from 0.69-0.75." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-73", "text": "For Gur350, the performance drops to 0.38-0.50, due to the lower upper bound, and because GermaNet does not model SR well." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-74", "text": "These findings are endorsed by an even more significant performance drop on ZG222." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-75", "text": "The measures based on Wikipedia behave less uniformly." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-76", "text": "Les yields acceptable results on Gur350, but is generally not among the best performing measures." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-77", "text": "LC +Avg yields the best performance on Gur65, but is outperformed on the other datasets by P L +Best, which performs equally good for all datasets." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-78", "text": "If we compare GermaNet based and Wikipedia based measures, we find that the knowledge source has a major influence on performance." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-79", "text": "When evaluated on Gur65, that contains pairs connected by SS, GermaNet based measures perform near the upper bound and outperform Wikipedia based measures by a wide margin." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-80", "text": "On Gur350 containing a mix of SS and SR pairs, most measures perform comparably." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-81", "text": "Finally, on ZG222, that contains pairs connected by SR, the best Wikipedia based measure outperforms all GermaNet based measures." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-82", "text": "The impressive performance of P L on the SR datasets cannot be explained with the structural properties of the category graph (Zesch and Gurevych, 2007) ." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-83", "text": "Semantically related terms, that would not be closely related in a taxonomic wordnet structure, are very likely to be categorized under the same Wikipedia category, resulting in short path lengths leading to high SR." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-84", "text": "These findings are contrary to that of (Strube and Ponzetto, 2006) , where LC outperformed path length." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-85", "text": "They limited the search depth using a manually defined threshold, and did not compute SR between all candidate article pairs." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-86", "text": "Our results show that judgments on the performance of a measure must always be made with respect to the task at hand: computing SS or SR." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-87", "text": "Depending on this decision, we can choose the best underlying knowledge source." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-88", "text": "GermaNet is better for GermaNet is used for all other part-of-speech (POS) combinations." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-89", "text": "For most datasets, we find a combination strategy that outperforms all single measures." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-90", "text": "----------------------------------" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-91", "text": "**CONCLUSION**" }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-92", "text": "We have shown that in deciding for a specific measure and knowledge source it is important to consider (i) whether the task at hand requires SS or SR, and (ii) which POS are involved." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-93", "text": "We pointed out that the underlying knowledge source has a major influence on these points." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-94", "text": "GermaNet is better used for SS, and contains nouns, verbs, and adjectives, while Wikipedia is better used for SR between nouns." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-95", "text": "Thus, GermaNet and Wikipedia can be regarded as complementary." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-96", "text": "We have shown that combining them significantly improves the performance of SR measures up to the level of human performance." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-97", "text": "Future research should focus on improving the strategies for combining complementary knowledge sources." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-98", "text": "We also need to evaluate a wider range of measures to validate our findings." }, { "sent_id": "a6b450d1113e0e6d3d2813c09d12a8-C001-99", "text": "As the simple P L measure performs remarkably well, we should also consider computing SR based on the Wikipedia article graph instead of the category graph." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a6b450d1113e0e6d3d2813c09d12a8-C001-23" ], [ "a6b450d1113e0e6d3d2813c09d12a8-C001-33" ] ], "cite_sentences": [ "a6b450d1113e0e6d3d2813c09d12a8-C001-23", "a6b450d1113e0e6d3d2813c09d12a8-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "a6b450d1113e0e6d3d2813c09d12a8-C001-61" ] ], "cite_sentences": [ "a6b450d1113e0e6d3d2813c09d12a8-C001-61" ] }, "@SIM@": { "gold_contexts": [ [ "a6b450d1113e0e6d3d2813c09d12a8-C001-72" ] ], "cite_sentences": [ "a6b450d1113e0e6d3d2813c09d12a8-C001-72" ] } } }, "ABC_65fbdd0397473763bca35376d581be_39": { "x": [ { "sent_id": "65fbdd0397473763bca35376d581be-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-2", "text": "We present a browser-based editor for simplifying English text." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-3", "text": "Given an input sentence, the editor performs both syntactic and lexical simplification." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-4", "text": "It splits a complex sentence into shorter ones, and suggests word substitutions in drop-down lists." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-5", "text": "The user can choose the best substitution from the list, undo any inappropriate splitting, and further edit the sentence as necessary." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-6", "text": "A significant novelty is that the system accepts a customized vocabulary list for a target reader population." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-7", "text": "It identifies all words in the text that do not belong to the list, and attempts to substitute them with words from the list, thus producing a text tailored for the targeted readers." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-8", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-10", "text": "The task of text simplification aims to rewrite a sentence so as to reduce its lexical and syntactic complexity, while preserving its meaning and grammaticality." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-11", "text": "Consider the complex sentence \"The professor, carrying numerous books, entered the room.\" It can be rewritten into two simple sentences, \"The teacher entered the room.\" and \"He was carrying many books.\" The rewriting process involves both syntactic and lexical simplification." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-12", "text": "The former decomposes the complex sentence, extracting the participial phrase \"carrying numerous books\" and turning it into a separate sentence." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-13", "text": "The latter replaces the word \"professor\" with the simpler word \"teacher\", and \"numerous\" with \"many\"." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-14", "text": "It is well known that sentences with difficult vocabulary, passive voice or complex structures, such as relative and subordinated clauses, can be challenging to understand." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-15", "text": "Text simplification has been found to be beneficial for language learners (Shirzadi, 2014) , children (Kajiwara et al., 2013) , and adults with low literacy skills (Arnaldo Candido Jr. and Erick Maziero and Caroline Gasperin and Thiago A. S. Pardo and Lucia Specia and Sandra M. Aluisio, 2009) or language disabilities (John Carroll and Guido Minnen and Darren Pearce and Yvonne Canning and Siobhan Devlin and John Tait, 1999; Luz Rello and Ricardo Baeza-Yates, 2014) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-16", "text": "To cater to these target reader populations, language teachers, linguists and other editors are often called upon to manually adapt a text." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-17", "text": "To automate this time-consuming task, there has been much effort in developing systems for lexical simplification (Zhu et al., 2010; Biran et al., 2011) and syntactic simplification (Siddharthan, 2002; Siddharthan and Angrosh, 2014) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-18", "text": "The performance of the state-of-the-art systems has improved significantly (Horn et al., 2014; Siddharthan and Angrosh, 2014) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-19", "text": "Nonetheless, one cannot expect any single system, trained on a particular dataset, to simplify arbitrary texts in a way that would suit all readers -for example, the kinds of English words and structures suitable for a native speaker in Grade 6 are unlikely to be suitable for a non-native speaker in Grade 4." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-20", "text": "Hence, human effort is generally needed for modifying the system output." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-21", "text": "To support human post-editing, a number of researchers have developed specialized editors for text simplification." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-22", "text": "While the editor described in Max (2006) shares similar goals as ours, it requires human intervention in much of the simplification process." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-23", "text": "The Automatic Text Adaptation tool suggests synonyms (Burstein et al., 2007) , but does not perform syntactic simplification." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-24", "text": "Conversely, the Simpli- fica tool, developed for Brazilian Portuguese, does not perform lexical simplification." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-25", "text": "Other packages for lexical simplification, such as LEXenstein (Paetzold and Specia, 2015) , are not designed for post-editing." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-26", "text": "To fill this gap, we developed a customizable, browser-based editor for simplifying English text." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-27", "text": "Besides performing automatic lexical and syntactic simplification, it facilitates user post-editing, for example in choosing candidate substitutions or undoing sentence splits." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-28", "text": "Importantly, the user can supply a vocabulary list tailored for a target reader population." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-29", "text": "This list serves to specify which words are considered \"simple,\" thus guiding the system in tailoring lexical substitution for the target readers." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-30", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-31", "text": "**LEXICAL SIMPLIFICATION**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-32", "text": "The lexical simplification task generally consists of three steps (Paetzold and Specia, 2015) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-33", "text": "The first step, substitution generation, produces a list of candidate words to substitute for the target word w. Typically, the context of w in the input sentence is not considered in this step." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-34", "text": "In the second step, substitution selection, the system selects the best candidates to replace w in the input sentence." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-35", "text": "Finally, the substitution ranking step re-ranks the candidates in terms of their simplicity." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-36", "text": "Often, the expected vocabulary level of a target reader population is explicitly prescribed." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-37", "text": "For example, many governments have drawn up graded vocabulary lists to guide students of English as a foreign language; likewise, developers of machine translation systems have specified controlled languages with restricted vocabulary." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-38", "text": "In this context, lexical simplification can be defined as follows: to rewrite a sentence by replacing all words that are not in the given vocabulary list (and hence presumed to be difficult for the reader) with those from the list (and hence presumed to be simple)." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-39", "text": "For example, Kajiwara et al. (2013) performed lexical simplification based on 5,404 words that elementary school children are expected to know." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-40", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-41", "text": "**ALGORITHM**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-42", "text": "By default, the editor uses a list of approximately 4,000 words that all students in Hong Kong are expected to know upon graduation from primary school (EDB, 2012)." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-43", "text": "However, the user can also upload his or her own vocabulary list." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-44", "text": "Given an input sentence, we first identify the target words, namely those words that do not appear in the vocabulary list." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-45", "text": "Following Horn et al. (2014) , our system simplifies neither proper nouns, as identified by the Natural Language Toolkit (Bird et al., 2009) , nor words in our stoplist, which are already simple." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-46", "text": "In terms of the three-step framework described above, we use the word2vec model 1 to retrieve candidates for substitution in the first step." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-47", "text": "We trained the model with all sentences from Wikipedia." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-48", "text": "For each target word, the model returns a list of the most similar words; we extract the top 20 in this list that are included in the user-supplied vocabulary list." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-49", "text": "In the next step, substitution selection, we re-rank these 20 words with a language model." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-50", "text": "We trained a trigram model with the kenlm (Heafield, 2011) , again using all sentences from Wikipedia." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-51", "text": "We then place the 10 words with the highest probabilities in a drop-down list in our editor 2 ; for example, Figure 1 shows the ten candidates offered for the word \"municipal\"." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-52", "text": "If none of the candidates are appropriate, the user can easily revert to the original word, which is also included in the drop-down list; alternatively, the user can click on the text to directly edit it." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-53", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-54", "text": "**EVALUATION**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-55", "text": "We evaluated the performance of our algorithm on the Mechanical Turk Lexical Simplification Data Set (Horn et al., 2014) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-56", "text": "This dataset contains 500 manually annotated sentences; the target word in each sentence was annotated by 50 independent annotators." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-57", "text": "To simulate a teacher adapting an English text for Hong Kong pupils, we used the vocabulary list from the Hong Kong Education Bureau (EDB, 2012) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-58", "text": "To enable automatic evaluation, we considered only the 249 sentences in the dataset whose target word is not in our vocabulary list, but whose human annotations contain at least one word in the list." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-59", "text": "Precision is at 31% for the top candidate; it is at 57% for the top ten candidates." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-60", "text": "In other words, for 57% of the target words, a valid substitution can be found in the drop-down list in the editor." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-61", "text": "The input sentence is \"City of Faizabad, the headquarters of Faizabad District, is a municipal board in the state of Uttar Pradesh, India, and situated on the banks of river Ghaghra.\" For syntactic simplification (Section 3), the system first splits its coordinated clauses into two sentences, S 1 =\"City of Faizabad ... state of Uttar Pradesh, India.\"; and S 2 =\"City of Faizabad is situated on the banks of river Ghaghra\"." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-62", "text": "It then further extracts the appositive phrase \"the headquarters of Faizabad District\" from S 1 , and turns into a separate sentence." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-63", "text": "For lexical simplification (Section 2), the system offers eight substitution candidates for the word \"municipal\" in a drop-down list." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-64", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-65", "text": "**SYNTACTIC SIMPLIFICATION**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-66", "text": "The editor performs automatic syntactic simplification for seven grammatical constructs." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-67", "text": "In a complex sentence, it identifies relative clauses, adverbial clauses, coordinated clauses, subordinated clauses, participial phrases and appositive phrases; it then splits the sentence into two simpler ones." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-68", "text": "Further, it transforms passive voice into active voice when the agent is explicitly mentioned." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-69", "text": "Examples of these constructs and their simplifications are listed in Table 1 ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-70", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-71", "text": "**ALGORITHM**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-72", "text": "The system follows the three-step framework of analysis, transformation and regeneration, as laid out in Siddharthan (2002) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-73", "text": "In the analysis step, it parses the input sentence with the Stanford dependency parser (Manning et al., 2014) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-74", "text": "In the transformation step, it scans the parse tree of the input sentence to match subtree patterns that have been manually crafted for each of the seven constructs in Table 1 ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-75", "text": "In Figure 1 , the input sentence matches the subtree pattern for coordination; it is therefore split into two shorter sentences, S 1 =\"City of Faizabad ... India.\" and S 2 =\"and situated ... river Ghaghra\"." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-76", "text": "Since S 1 then matches the pattern for appositive phrase, the phrase \"the headquarters of Faizabad District\" is taken out to form its own sentence." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-77", "text": "If the user finds a sentence split to be inappropriate, he or she can click on the \"Merge\" button to undo the split." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-78", "text": "Finally, in the regeneration step, the editor restores the subject (e.g., \"City of Faizabad\") to newly formed sentences." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-79", "text": "Often, this step also requires generation of referring expressions, determiners, conjunctions and sentence re-ordering." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-80", "text": "Since most of these tasks require real-world knowledge, the editor currently leaves it to the user for post-editing." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-81", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-82", "text": "**EVALUATION**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-83", "text": "We evaluated the quality of syntactic simplification on the first 300 sentences in the Mechanical Turk Lexical Simplification Data Set (Horn et al., 2014) ." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-84", "text": "For each sentence, we asked a professor of linguistics to mark the types of syntactic simplification (" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-85", "text": "----------------------------------" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-86", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-87", "text": "We have presented a browser-based editor that performs lexical and syntactic simplification and supports human post-editing." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-88", "text": "The editor takes a customized vocabulary list as input, such that its lexical substitutions are tailored to the needs of the target reader population." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-89", "text": "Evaluation shows that, for a majority of sentences in a test set, the editor is able to propose appropriate word substitutions and to split up complex syntactic structures." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-90", "text": "In future work, we aim to further improve the quality of simplification, and to offer annotations for difficult words that cannot be simplified." }, { "sent_id": "65fbdd0397473763bca35376d581be-C001-91", "text": "3 We also intend to perform empirical studies, to measure the editor's effectiveness in assisting teachers in language lesson planning." } ], "y": { "@BACK@": { "gold_contexts": [ [ "65fbdd0397473763bca35376d581be-C001-18" ] ], "cite_sentences": [ "65fbdd0397473763bca35376d581be-C001-18" ] }, "@USE@": { "gold_contexts": [ [ "65fbdd0397473763bca35376d581be-C001-45" ], [ "65fbdd0397473763bca35376d581be-C001-55" ], [ "65fbdd0397473763bca35376d581be-C001-83" ] ], "cite_sentences": [ "65fbdd0397473763bca35376d581be-C001-45", "65fbdd0397473763bca35376d581be-C001-55", "65fbdd0397473763bca35376d581be-C001-83" ] } } }, "ABC_9ea99bf57e9113b2f03f2285741397_39": { "x": [ { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-117", "text": "Does the pointer-generator learn how to switch?" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-118", "text": "We found that our pointer-generator model generates sentences that have not been seen before." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-2", "text": "Building large-scale datasets for training code-switching language models is challenging and very expensive." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-3", "text": "To alleviate this problem using parallel corpus has been a major workaround." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-4", "text": "However, existing solutions use linguistic constraints which may not capture the real data distribution." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-5", "text": "In this work, we propose a novel method for learning how to generate code-switching sentences from parallel corpora." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-6", "text": "Our model uses a Seq2Seq model in combination with pointer networks to align and choose words from the monolingual sentences and form a grammatical code-switching sentence." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-7", "text": "In our experiment, we show that by training a language model using the augmented sentences we improve the perplexity score by 10% compared to the LSTM baseline." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-8", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-10", "text": "Language mixing has been a common phenomenon in multilingual communities." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-11", "text": "It is motivated in response to social factors as a way of communication in a multicultural society." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-12", "text": "From a sociolinguistic perspective, individuals do code-switching in order to construct an optimal interaction by accomplishing the conceptual, relational-interpersonal, and discourse-presentational meaning of conversation [1] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-13", "text": "In its practice, the variation of code-switching will vary due to the traditions, beliefs, and normative values in the respective communities." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-14", "text": "A number of studies [2, 3, 4, 5] found that code-switching is not produced indiscriminately, but follows syntactic constraints." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-15", "text": "Many linguists formulated various constraints to define a general rule for code-switching [2, 4, 5] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-16", "text": "However, the constraints are not enough to make a good generalization of real code-switching constraints, and they have not been tested in large-scale corpora for many language pairs." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-17", "text": "One of the biggest problem in code-switching is collecting large scale corpora." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-18", "text": "Speech data have to be collected from a spontaneous speech by bilingual speakers and the codeswitching has to be triggered naturally during the conversation." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-19", "text": "In order to solve the data scarcity issue, code-switching data generation is useful to increase the volume and variance." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-20", "text": "A linguistics constraint-driven generation approach such as equivalent constraint [6, 7] is not restrictive to languages with distinctive grammar structure." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-21", "text": "In this paper, we propose a novel language-agnostic method to learn how to generate code-switching sentences by using a pointer-generator network [8] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-22", "text": "The model is trained from concatenated sequences of parallel sentences to generate code-switching sentences, constrained by codeswitching texts." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-23", "text": "The pointer network copies words from both languages and pastes them into the output, generating code switching sentences in matrix language to embedded language and vice versa." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-24", "text": "The attention mechanism helps the decoder to generate meaningful and grammatical sentences without needing any sequence alignment." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-25", "text": "This idea is also in line with code-mixing by borrowing words from the embedded language [9] and intuitively, the copying mechanism can be seen as an end-to-end approach to translate, align, and reorder the given words into a grammatical code-switching sentence." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-26", "text": "This approach is the unification of all components in the work of [6] into a single computational model." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-27", "text": "A code-switching language model learned in this way is able to capture the patterns and constraints of the switches and mitigate the out-of-vocabulary (OOV) issue during sequence generation." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-28", "text": "By adding the generated sentences and incorporating syntactic information to the training data, we achieve better performance by 10% compared to an LSTM baseline [10] and 5% to the equivalent constraint." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-29", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-31", "text": "The synthetic code-switching generation approach was introduced by adapting equivalence constraint on monolingual sentence pairs during the decoding step on an automatic speech recognition (ASR) model [6] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-32", "text": "[11] explored Functional Head Constraint, which was found to be more restrictive than the Equivalence Constraint, but complex to be implemented, by using a lattice parser with a weighted finitestate transducer." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-33", "text": "[12] extended the RNN by adding POS information to the input layer and factorized output layer with a language identifier." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-34", "text": "Then, Factorized RNN networks were combined with an n-gram backoff model using linear interpolation [13] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-35", "text": "[14] added syntactic and semantic features to the Factorized RNN networks." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-36", "text": "[15] adapted an effective curriculum learning by training a network with monolingual corpora of both languages, and subsequently train on codeswitched data." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-37", "text": "A further investigation of Equivalence Constraint and Curriculum Learning showed an improvement in language modeling [7] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-38", "text": "A multi-task learning approach was introduced to train the syntax representation of languages by constraining the language generator [10] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-39", "text": "A copy mechanism was proposed to copy words directly from the input to the output using an attention mechanism [16] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-40", "text": "This mechanism has proven to be effective in several NLP tasks including text summarization [8] , and dialog systems [17] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-41", "text": "The common characteristic of these tasks is parts of the output are exactly the same as the input source." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-42", "text": "For example, in dialog systems the responses most of the time have appeared in the previous dialog steps." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-43", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-44", "text": "**METHODOLOGY**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-45", "text": "We use a sequence to sequence (Seq2Seq) model in combination with pointer and copy networks [8] to align and choose words from the monolingual sentences and generate a code-switching sentence." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-46", "text": "The models' input is the concatenation of the two monolingual sentences, denoted as [w sumption is that almost all, the token present in the codeswitching sentence are also present in the source monolingual sentences." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-47", "text": "Our model leverages this property by copying input tokens, instead of generating vocabulary words." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-48", "text": "This approach has two major advantages: (1) the learning complexity decreases since it relies on copying instead of generating; (2) improvement in generalization, the copy mechanism could produce words from the input that are not present in the vocabulary." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-49", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-50", "text": "**POINTER-GENERATOR NETWORK**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-51", "text": "Instead of generating words from a large vocabulary space using a Seq2Seq model with attention [18] , pointer-generator network [8] is proposed to copy words from the input to the output using an attention mechanism and generate the output sequence using decoders." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-52", "text": "The network is depicted in Figure 1 ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-53", "text": "For each decoder step, a generation probability p gen \u2208 [0,1] is calculated, which weights the probability of generating words from the vocabulary, and copying words from the source text." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-54", "text": "p gen is a soft gating probability to decide whether generating the next token from the decoder or copying the word from the input instead." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-55", "text": "The attention distribution a t is a standard attention with general scoring [18] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-56", "text": "It considers all encoder hidden states to derive the context vector." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-57", "text": "The vocabulary distribution P vocab (w) is calculated by concatenating the decoder state s t and the context vector h * t ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-58", "text": "where w h * , w s , w x are trainable parameters and b ptr is the scalar bias." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-59", "text": "The vocabulary distribution P vocab (w) and the attention distribution a t are weighted and summed to obtain the final distribution P (w)." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-60", "text": "The final distribution is calculated as follows: We use a beam search to select N -best code-switching sentences and concatenate the generated sentence with the training set to form a larger dataset." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-61", "text": "The result of the generated code-switching sentences is showed in Table 1 ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-62", "text": "As our baseline, we compare our proposed method with three other models: (1) We use Seq2Seq with attention; (2) We generate sequences that satisfy Equivalence Constraint [6] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-63", "text": "The constraint doesn't allow any switch within a crossing of two word alignments." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-64", "text": "We use FastAlign [19] as the word aligner 1 ; (3) We also form sentences using the alignments without any constraint." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-65", "text": "The number of the generated sentences are equivalent to 3-best data from the pointer-generator model." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-66", "text": "To increase the generation variance, we randomly permute each alignment to form a new sequence." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-67", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-68", "text": "**LANGUAGE MODELING**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-69", "text": "The quality of the generated code-switching sentences is evaluated using a language modeling task." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-70", "text": "Indeed, if the perplexity in this task drops consistently we can assume that the generated sentences are well-formed." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-71", "text": "Hence, we use an LSTM language model with weight tying [20] that can capture an unbounded number of context words to approximate the probability of the next word." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-72", "text": "Syntactic information such as Partof-speech (POS) [p 1 , ..., p T ] is added to further improve the performance." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-73", "text": "The POS tags are generated phrase-wise using pretrained English and Chinese Stanford POS Tagger [21] by adding a word at a time in a unidirectional way to avoid any intervention from future information." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-74", "text": "The word and syntax unit are represented as a vector x w and x p respectively." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-75", "text": "Next, we concatenate both vectors and use it as an input [x w |x p ] to an LSTM layer similar to [10] . 4." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-76", "text": "EXPERIMENT" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-77", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-78", "text": "**CORPUS**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-79", "text": "In our experiment, we use a conversational Mandarin-English code-switching speech corpus called SEAME Phase II (South East Asia Mandarin-English)." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-80", "text": "The data are collected from spontaneously spoken interviews and conversations in Singapore and Malaysia by bilinguals [22] ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-81", "text": "As the data preprocessing, words are tokenized using Stanford NLP toolkit [23] and all hesitations and punctuations were removed except apostrophe." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-82", "text": "The split of the dataset is identical to [10] and it is showed in Table 1 ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-83", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-84", "text": "**TRAINING SETUP**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-85", "text": "In this section, we present the experimental settings for pointer-generator network and language model." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-86", "text": "Our experiment, our pointer-generator model has 500-dimensional hidden states and word embeddings." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-87", "text": "We use 50k words as our vocabulary for source and target." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-88", "text": "We evaluate our pointer-generator performance using BLEU 2 score." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-89", "text": "We take the best model as our generator and during the decoding stage, we generate 1-best and 3-best using beam search with a beam size of 5." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-90", "text": "For the input, we build a parallel monolingual corpus by translating the mixed language sequence using Google NMT 3 to English (w 1 ) and Mandarin (w 2 ) sequences." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-91", "text": "Then, we concatenate the translated English and Mandarin sequences and assign code-switching sequences as the labels (y cs )." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-92", "text": "The baseline language model is trained using RNNLM [24] 4 ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-93", "text": "Then, we train our 2-layer LSTM models with a hidden size of 500 and unrolled for 35 steps." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-94", "text": "The embedding size is equal to the LSTM hidden size for weight tying." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-95", "text": "We optimize our model using SGD with initial learning rates of {10, 20}. If there is no improvement during the evaluation, we reduce the learning rate by a factor of 0.75." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-96", "text": "In each time step, we apply dropout to both embedding layer and recurrent network." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-97", "text": "The gradient is clipped to a maximum of 0.25." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-98", "text": "Perplexity measure is used in the evaluation." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-99", "text": "2 BLEU is computed using multi bleu.perl from MOSES package 3 Google NMT Translate API 4 RNNLM toolkit from http://www.fit.vutbr.cz/\u223cimikolov/rnnlm/" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-100", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-101", "text": "**RESULTS**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-102", "text": "The pointer-generator significantly outperforms the Seq2Seq with attention model by 3.58 BLEU points on the test set as shown in Table 2 ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-103", "text": "Our language modeling result is given in Table 3 ." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-104", "text": "Based on the empirical result, adding generated samples consistently improve the performance of all models with a moderate margin around 10% in perplexity." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-105", "text": "After all, our proposed method still slightly outperforms the heuristic from linguistic constraint." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-106", "text": "In addition, we get a crucial gain on performance by adding syntax representation of the sequences." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-107", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-108", "text": "**CHANGE IN DATA DISTRIBUTION:**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-109", "text": "To further analyze the generated result, we observed the distribution of real codeswitching data and the generated code-switching data." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-110", "text": "From Figure 2 , we can see that 1-best and real code-switching data have almost identical distributions." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-111", "text": "The distributions are left-skewed where the overall mean is less than the median." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-112", "text": "Interestingly, the distribution of the 3-best data is less skewed and generates a new set of n-grams such as \"\u90a3 \u4e2a (that) proposal\" which was learned from other code-switching sequences." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-113", "text": "As a result, generating more samples effects the performance positively." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-114", "text": "Importance of Linguistic Constraint: The result in Table 3 emphasizes that linguistic constraints have some significance in replicating the real code-switching patterns, specifically the equivalence constraint." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-115", "text": "There is a slight reduction in perplexity around 6 points on the test set." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-116", "text": "In addition, when we ignore the constraint, we lose performance because it still allows switches in the inversion grammar cases." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-119", "text": "The example in Figure 1 shows that our model is able to construct a new well-formed sentence such as \"\u6211 \u4eec \u8981 \u53bb (We want to) check\"." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-120", "text": "It is also shown that the pointer-generator model has the capability to learn the characteristics of the linguistic constraints from data without any word alignment between the matrix and embedded languages." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-121", "text": "On the other hand, training using 3-best data obtains better performance compared to 1-best data." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-122", "text": "We found a positive correlation from Table 1 , where 3-best data is more similar to the test set in terms of segment length and number of switches compared to 1-best data." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-123", "text": "Adding more samples N may improve the performance, but it will be saturated at a certain point." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-124", "text": "One way to solve this is by using more parallel samples." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-125", "text": "----------------------------------" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-127", "text": "We introduce a new learning method for code-switching sentence generation using a parallel monolingual corpus that is applicable to any language pair." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-128", "text": "Our experimental result shows that adding generated sentences to the training data, effectively improves our model performance." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-129", "text": "Combining the generated samples with code-switching dataset reduces perplexity." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-130", "text": "We get further performance gain after using syntactic information of the input." }, { "sent_id": "9ea99bf57e9113b2f03f2285741397-C001-131", "text": "In future work, we plan to explore reinforcement learning for sequence generation and employ more parallel corpora." } ], "y": { "@DIF@": { "gold_contexts": [ [ "9ea99bf57e9113b2f03f2285741397-C001-28" ] ], "cite_sentences": [ "9ea99bf57e9113b2f03f2285741397-C001-28" ] }, "@BACK@": { "gold_contexts": [ [ "9ea99bf57e9113b2f03f2285741397-C001-38" ] ], "cite_sentences": [ "9ea99bf57e9113b2f03f2285741397-C001-38" ] }, "@SIM@": { "gold_contexts": [ [ "9ea99bf57e9113b2f03f2285741397-C001-75" ], [ "9ea99bf57e9113b2f03f2285741397-C001-82" ] ], "cite_sentences": [ "9ea99bf57e9113b2f03f2285741397-C001-75", "9ea99bf57e9113b2f03f2285741397-C001-82" ] } } }, "ABC_6cc36fef99fb1f25370175452f30b0_39": { "x": [ { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-2", "text": "We compare the CCG parser of Clark and Curran (2007) with a state-of-the-art Penn Treebank (PTB) parser." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-3", "text": "An accuracy comparison is performed by converting the CCG derivations into PTB trees." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-4", "text": "We show that the conversion is extremely difficult to perform, but are able to fairly compare the parsers on a representative subset of the PTB test section, obtaining results for the CCG parser that are statistically no different to those for the Berkeley parser." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-5", "text": "----------------------------------" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-7", "text": "There are a number of approaches emerging in statistical parsing." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-8", "text": "The first approach, which began in the mid-90s and now has an extensive literature, is based on the Penn Treebank (PTB) parsing task: inferring skeletal phrase-structure trees for unseen sentences of the WSJ, and evaluating accuracy according to the Parseval metrics." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-9", "text": "Collins (1999) is a seminal example." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-10", "text": "The second approach is to apply statistical methods to parsers based on linguistic formalisms, such as HPSG, LFG, TAG, and CCG, with the grammar being defined manually or extracted from a formalism-specific treebank." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-11", "text": "Evaluation is typically performed by comparing against predicate-argument structures extracted from the treebank, or against a test set of manually annotated grammatical relations (GRs)." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-12", "text": "Examples of this approach include Riezler et al. (2002) , Miyao and Tsujii (2005) , Briscoe and Carroll (2006) , and Clark and Curran (2007) ." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-13", "text": "1 Despite the many examples from both approaches, there has been little comparison across the two groups, which we refer to as PTB parsing and formalism-based parsing, respectively." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-14", "text": "The PTB parser we use for comparison is the publicly available Berkeley parser (Petrov and Klein, 2007) ." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-15", "text": "The formalism-based parser we use is the CCG parser of Clark and Curran (2007) , which is based on CCGbank (Hockenmaier and Steedman, 2007) , a CCG version of the Penn Treebank." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-16", "text": "We compare this parser with a PTB parser because both are derived from the same original source, and both produce phrase-structure in some form or another; the interesting question is whether anything is gained by converting the PTB into CCG." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-17", "text": "2 The comparison focuses on accuracy and is performed by converting CCG derivations into PTB phrase-structure trees." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-18", "text": "A contribution of this paper is to demonstrate the difficulty of mapping from a grammatical resource based on the PTB back to the PTB, and we also comment on the (non-)suitability of the PTB as a general formalism-independent evaluation resource." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-19", "text": "A second contribution is to provide the first accuracy comparison of the CCG parser with a PTB parser, obtaining competitive scores for the CCG parser on a representative subset of the PTB test sections." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-20", "text": "It is important to note that the purpose of this evaluation is comparison with a PTB parser, rather than evaluation of the CCG parser per se." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-21", "text": "The CCG parser has been extensively evaluated elsewhere (Clark and Curran, 2007) , and arguably GRs or predicate-argument structures provide a more suitable test set for the CCG parser than PTB phrase-structure trees." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-22", "text": "----------------------------------" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-23", "text": "**THE CCG TO PTB CONVERSION**" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-24", "text": "There has been much recent work in attempting to convert native parser output into alternative representations for evaluation purposes, e.g. (Clark and Curran, 2007; Matsuzaki and Tsujii, 2008 )." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-25", "text": "The conclusion is that such conversions are surprisingly difficult." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-26", "text": "Clark and Curran (2007) shows that converting gold-standard CCG derivations into the GRs in DepBank resulted in an Fscore of only 85%; hence the upper bound on the performance of the CCG parser, using this evaluation scheme, was only 85%." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-27", "text": "Given that the current best scores for the PTB parsing task are over 90%, any loss from the conversion process needs to be considered carefully if a fair comparison with PTB parsers is to be achieved." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-28", "text": "CCGbank was derived from the PTB, and so it might be considered that converting back to the PTB would be a relatively easy task, by essentially reversing the mapping Hockenmaier and Steedman (2007) used to create CCGbank." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-29", "text": "However, there are a number of differences between the two treebanks which make the conversion back far from trivial." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-30", "text": "First, the corresponding derivations in the treebanks are not isomorphic: a CCG derivation is not simply a relabelling of the nodes in the PTB tree; there are many constructions, such as coordination and control structures, where the trees are a different shape, as well as having different labels." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-31", "text": "It is important to realise that Hockenmaier and Steedman (2007) invested a significant amount of time and effort in creating the mapping." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-32", "text": "Second, some of the labels in the PTB do not appear in CCGbank, for example the QP label, and these must be added back in; however, developing rules to insert these labels in the right places is a far from trivial task." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-33", "text": "There were two approaches we considered for the conversion." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-34", "text": "One possibility is to associate PTB tree structures with CCG lexical categories, and combine the trees together in step with the category combinations in a CCG derivation -in much the same way that an LTAG has elementary trees in the lexicon which are combined using the substitution and adjunction rules of TAG." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-35", "text": "The second approach is to associate conversion rules with each local tree -i.e." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-36", "text": "a parent and one or two child nodes -which appears in the CCGbank data." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-37", "text": "3 In this paper we took the second approach." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-38", "text": "----------------------------------" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-39", "text": "**CONVERSION SCHEMAS**" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-40", "text": "There are three types of conversion schema: schemas which introduce nodes for lexical items; schemas which insert or elide PTB nodes for unary 3 Another possible approach has been taken by Matsuzaki and Tsujii (2008) , who convert HPSG analyses from a grammar automatically extracted from the PTB back into the PTB." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-41", "text": "They treat the problem as one of translation, learning a synchronous grammar to perform the mapping." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-42", "text": "A PTB tree is built from a CCG derivation by running over the derivation in a bottom-up fashion and applying these schemas to the local trees in the derivation." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-43", "text": "----------------------------------" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-44", "text": "**SCHEMA DEVELOPMENT**" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-45", "text": "The schemas were developed by manual inspection using section 00 of CCGbank and the PTB as a development set, following the oracle methodology of Clark and Curran (2007) , in which goldstandard derivations from CCGbank are converted to the new representation and compared with the gold standard for that representation." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-46", "text": "As well as giving an idea of the difficulty, and success, of the conversion, the resulting numbers provide an up- In total, we annotated 32 unary and 776 binary rule instances (of the possible 2853 instances) with conversion schemas, and 162 of the 425 lexical categories." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-47", "text": "We also implemented a small number of default catch-all cases for the general CCG combinatory rules and for the rules dealing with punctuation, which allowed most of the 2853 rule instances to be covered." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-48", "text": "Considerable time and effort was invested in the creation of these schemas." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-49", "text": "The oracle conversion results from the gold standard CCGbank to the PTB for section 00 and 23 are shown in Table 2 ." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-50", "text": "The numbers are bracketing precision, recall, F-score and complete sentence matches, using the EVALB evaluation script." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-51", "text": "Note that these figures provide an upper bound on the performance of the CCG parser using EVALB, given the current conversion process." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-52", "text": "The importance of this upper bound should not be underestimated, when the evaluation framework is such that incremental improvements of a few tenths of a percent are routinely presented as improving the state-of-the-art, as is the case with the Parseval metrics." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-53", "text": "The fact that the upper bound here is less than 95% shows that it is not possible to fairly evaluate the CCG parser on the complete test set." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-54", "text": "Even an upper bound of around 98%, which is achieved by Matsuzaki and Tsujii (2008) , is not sufficient, since this guarantees a loss of at least 2%." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-55", "text": "4" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-56", "text": "----------------------------------" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-57", "text": "**EVALUATION**" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-58", "text": "The Berkeley parser (Petrov and Klein, 2007) provides performance close to the state-of-the-art for the PTB parsing task, with reported F-scores of around 90%." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-59", "text": "Since the oracle score for CCGbank is less than 95%, it would not be a fair comparison to use the complete test set." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-60", "text": "However, there are a number of sentences which are correct, or almost correct, according to EVALB after the conversion, and we are able to use those for a fair comparison." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-61", "text": "Table 3 gives the EVALB results for the CCG parser on various subsets of section 00 of the PTB." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-62", "text": "The first row shows the results on only those sentences which the conversion process can convert sucessfully (as measured by converting gold-standard CCGbank derivations and comparing with PTB trees; although, to be clear, the scores are for the CCG parser on those sentences)." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-63", "text": "As can be seen from the scores, these sentences form a slightly easier subset than the full section 00, but this is a subset which can be used for a fair comparison against the Berkeley parser, since the conversion process is not lossy for this subset." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-64", "text": "The second row shows the scores on those sentences for which the conversion process was somewhat lossy, but when the gold-standard CCGbank derivations are converted, the oracle F-measure is greater than 95%." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-65", "text": "The third row is similar, but for sentences for which the oracle F-score is geater than 92%." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-66", "text": "The final row is for the whole of section 00." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-67", "text": "The UB column gives the upper bound on the accuracy of the CCG parser." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-68", "text": "Results are calculated using both gold standard and automatically assigned POS tags; # is the number of sentences in the sample, and the % column gives the sample size as a percentage of the whole section." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-69", "text": "We compare the CCG parser to the Berkeley parser using the accurate mode of the Berkeley parser, together with the model supplied with the publicly available version." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-70", "text": "Table 3 gives the results for Section 23, comparing the CCG and Berkeley parsers." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-71", "text": "The projected columns give the projected scores for the CCG parser, if it performed at the same accuracy level for those sentences which could not be converted successfully." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-72", "text": "The purpose of this column is to obtain an approximation of the CCG parser score for a perfect conversion process." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-73", "text": "5 The results in bold are those which we consider to be a fair comparison against the Berkeley parser." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-74", "text": "The difference in scores is not statistically significant at p=0.05 (using Dan Bikel's stratified shuffling test)." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-75", "text": "One possible objection to this comparison is that the subset for which we have a fair compar- ison is likely to be an easy subset consisting of shorter sentences, and so the most that can be said is that the CCG parser performs as well as the Berkeley parser on short sentences." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-76", "text": "In fact, the subset for which we perform a perfect conversion contains sentences with an average length of 18.1 words, compared to 21.4 for sentences with 40 words or less (a standard test set for reporting Parseval figures)." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-77", "text": "Hence we do consider the comparison to be highly informative." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-78", "text": "----------------------------------" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-79", "text": "**CONCLUSION**" }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-80", "text": "One question that is often asked of the CCG parsing work is \"Why not convert back into the PTB representation and perform a Parseval evaluation?\" By showing how difficult the conversion is, we believe that we have finally answered this question, as well as demonstrating comparable performance with the Berkeley parser." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-81", "text": "In addition, we have thrown further doubt on the possible use of the PTB for cross-framework parser evaluation, as recently suggested by Matsuzaki and Tsujii (2008) ." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-82", "text": "Even the smallest loss due to mapping across representations is significant when a few tenths of a percentage point matter." }, { "sent_id": "6cc36fef99fb1f25370175452f30b0-C001-83", "text": "Whether PTB parsers could be competitive on alternative parser evaluations, such as those using GR schemes, for which the CCG parser performs very well, is an open question." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6cc36fef99fb1f25370175452f30b0-C001-24" ], [ "6cc36fef99fb1f25370175452f30b0-C001-40" ], [ "6cc36fef99fb1f25370175452f30b0-C001-54" ] ], "cite_sentences": [ "6cc36fef99fb1f25370175452f30b0-C001-24", "6cc36fef99fb1f25370175452f30b0-C001-40", "6cc36fef99fb1f25370175452f30b0-C001-54" ] }, "@MOT@": { "gold_contexts": [ [ "6cc36fef99fb1f25370175452f30b0-C001-24", "6cc36fef99fb1f25370175452f30b0-C001-25" ] ], "cite_sentences": [ "6cc36fef99fb1f25370175452f30b0-C001-24" ] }, "@DIF@": { "gold_contexts": [ [ "6cc36fef99fb1f25370175452f30b0-C001-53", "6cc36fef99fb1f25370175452f30b0-C001-54" ] ], "cite_sentences": [ "6cc36fef99fb1f25370175452f30b0-C001-54" ] }, "@SIM@": { "gold_contexts": [ [ "6cc36fef99fb1f25370175452f30b0-C001-81" ] ], "cite_sentences": [ "6cc36fef99fb1f25370175452f30b0-C001-81" ] } } }, "ABC_65f7546e2abfd74c0daa43c25ca63f_39": { "x": [ { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-20", "text": "A linguistics constraint-driven generation approach such as equivalent constraint [6, 7] is not restrictive to languages with distinctive grammar structure." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-21", "text": "In this paper, we propose a novel language-agnostic method to learn how to generate code-switching sentences by using a pointer-generator network [8] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-22", "text": "The model is trained from concatenated sequences of parallel sentences to generate code-switching sentences, constrained by codeswitching texts." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-23", "text": "The pointer network copies words from both languages and pastes them into the output, generating code switching sentences in matrix language to embedded language and vice versa." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-24", "text": "The attention mechanism helps the decoder to generate meaningful and grammatical sentences without needing any sequence alignment." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-25", "text": "This idea is also in line with code-mixing by borrowing words from the embedded language [9] and intuitively, the copying mechanism can be seen as an end-to-end approach to translate, align, and reorder the given words into a grammatical code-switching sentence." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-26", "text": "This approach is the unification of all components in the work of [6] into a single computational model." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-27", "text": "A code-switching language model learned in this way is able to capture the patterns and constraints of the switches and mitigate the out-of-vocabulary (OOV) issue during sequence generation." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-28", "text": "By adding the generated sentences and incorporating syntactic information to the training data, we achieve better performance by 10% compared to an LSTM baseline [10] and 5% to the equivalent constraint." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-29", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-30", "text": "**RELATED WORK**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-31", "text": "The synthetic code-switching generation approach was introduced by adapting equivalence constraint on monolingual sentence pairs during the decoding step on an automatic speech recognition (ASR) model [6] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-32", "text": "[11] explored Functional Head Constraint, which was found to be more restrictive than the Equivalence Constraint, but complex to be implemented, by using a lattice parser with a weighted finitestate transducer." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-33", "text": "[12] extended the RNN by adding POS information to the input layer and factorized output layer with a language identifier." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-34", "text": "Then, Factorized RNN networks were combined with an n-gram backoff model using linear interpolation [13] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-35", "text": "[14] added syntactic and semantic features to the Factorized RNN networks." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-36", "text": "[15] adapted an effective curriculum learning by training a network with monolingual corpora of both languages, and subsequently train on codeswitched data." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-37", "text": "A further investigation of Equivalence Constraint and Curriculum Learning showed an improvement in language modeling [7] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-38", "text": "A multi-task learning approach was introduced to train the syntax representation of languages by constraining the language generator [10] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-39", "text": "A copy mechanism was proposed to copy words directly from the input to the output using an attention mechanism [16] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-40", "text": "This mechanism has proven to be effective in several NLP tasks including text summarization [8] , and dialog systems [17] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-41", "text": "The common characteristic of these tasks is parts of the output are exactly the same as the input source." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-42", "text": "For example, in dialog systems the responses most of the time have appeared in the previous dialog steps." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-43", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-44", "text": "**METHODOLOGY**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-45", "text": "We use a sequence to sequence (Seq2Seq) model in combination with pointer and copy networks [8] to align and choose words from the monolingual sentences and generate a code-switching sentence." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-46", "text": "The models' input is the concatenation of the two monolingual sentences, denoted as [w sumption is that almost all, the token present in the codeswitching sentence are also present in the source monolingual sentences." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-47", "text": "Our model leverages this property by copying input tokens, instead of generating vocabulary words." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-48", "text": "This approach has two major advantages: (1) the learning complexity decreases since it relies on copying instead of generating; (2) improvement in generalization, the copy mechanism could produce words from the input that are not present in the vocabulary." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-49", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-50", "text": "**POINTER-GENERATOR NETWORK**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-51", "text": "Instead of generating words from a large vocabulary space using a Seq2Seq model with attention [18] , pointer-generator network [8] is proposed to copy words from the input to the output using an attention mechanism and generate the output sequence using decoders." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-52", "text": "The network is depicted in Figure 1 ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-53", "text": "For each decoder step, a generation probability p gen \u2208 [0,1] is calculated, which weights the probability of generating words from the vocabulary, and copying words from the source text." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-54", "text": "p gen is a soft gating probability to decide whether generating the next token from the decoder or copying the word from the input instead." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-55", "text": "The attention distribution a t is a standard attention with general scoring [18] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-56", "text": "It considers all encoder hidden states to derive the context vector." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-57", "text": "The vocabulary distribution P vocab (w) is calculated by concatenating the decoder state s t and the context vector h * t ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-58", "text": "where w h * , w s , w x are trainable parameters and b ptr is the scalar bias." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-59", "text": "The vocabulary distribution P vocab (w) and the attention distribution a t are weighted and summed to obtain the final distribution P (w)." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-60", "text": "The final distribution is calculated as follows: We use a beam search to select N -best code-switching sentences and concatenate the generated sentence with the training set to form a larger dataset." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-61", "text": "The result of the generated code-switching sentences is showed in Table 1 ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-62", "text": "As our baseline, we compare our proposed method with three other models: (1) We use Seq2Seq with attention; (2) We generate sequences that satisfy Equivalence Constraint [6] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-63", "text": "The constraint doesn't allow any switch within a crossing of two word alignments." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-64", "text": "We use FastAlign [19] as the word aligner 1 ; (3) We also form sentences using the alignments without any constraint." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-65", "text": "The number of the generated sentences are equivalent to 3-best data from the pointer-generator model." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-66", "text": "To increase the generation variance, we randomly permute each alignment to form a new sequence." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-67", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-68", "text": "**LANGUAGE MODELING**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-69", "text": "The quality of the generated code-switching sentences is evaluated using a language modeling task." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-70", "text": "Indeed, if the perplexity in this task drops consistently we can assume that the generated sentences are well-formed." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-71", "text": "Hence, we use an LSTM language model with weight tying [20] that can capture an unbounded number of context words to approximate the probability of the next word." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-72", "text": "Syntactic information such as Partof-speech (POS) [p 1 , ..., p T ] is added to further improve the performance." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-73", "text": "The POS tags are generated phrase-wise using pretrained English and Chinese Stanford POS Tagger [21] by adding a word at a time in a unidirectional way to avoid any intervention from future information." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-74", "text": "The word and syntax unit are represented as a vector x w and x p respectively." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-75", "text": "Next, we concatenate both vectors and use it as an input [x w |x p ] to an LSTM layer similar to [10] . 4." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-76", "text": "EXPERIMENT" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-77", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-78", "text": "**CORPUS**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-79", "text": "In our experiment, we use a conversational Mandarin-English code-switching speech corpus called SEAME Phase II (South East Asia Mandarin-English)." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-80", "text": "The data are collected from spontaneously spoken interviews and conversations in Singapore and Malaysia by bilinguals [22] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-81", "text": "As the data preprocessing, words are tokenized using Stanford NLP toolkit [23] and all hesitations and punctuations were removed except apostrophe." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-82", "text": "The split of the dataset is identical to [10] and it is showed in Table 1 ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-83", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-84", "text": "**TRAINING SETUP**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-85", "text": "In this section, we present the experimental settings for pointer-generator network and language model." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-86", "text": "Our experiment, our pointer-generator model has 500-dimensional hidden states and word embeddings." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-87", "text": "We use 50k words as our vocabulary for source and target." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-88", "text": "We evaluate our pointer-generator performance using BLEU 2 score." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-89", "text": "We take the best model as our generator and during the decoding stage, we generate 1-best and 3-best using beam search with a beam size of 5." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-90", "text": "For the input, we build a parallel monolingual corpus by translating the mixed language sequence using Google NMT 3 to English (w 1 ) and Mandarin (w 2 ) sequences." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-91", "text": "Then, we concatenate the translated English and Mandarin sequences and assign code-switching sequences as the labels (y cs )." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-92", "text": "The baseline language model is trained using RNNLM [24] 4 ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-93", "text": "Then, we train our 2-layer LSTM models with a hidden size of 500 and unrolled for 35 steps." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-94", "text": "The embedding size is equal to the LSTM hidden size for weight tying." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-95", "text": "We optimize our model using SGD with initial learning rates of {10, 20}. If there is no improvement during the evaluation, we reduce the learning rate by a factor of 0.75." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-96", "text": "In each time step, we apply dropout to both embedding layer and recurrent network." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-97", "text": "The gradient is clipped to a maximum of 0.25." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-98", "text": "Perplexity measure is used in the evaluation." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-99", "text": "2 BLEU is computed using multi bleu.perl from MOSES package 3 Google NMT Translate API 4 RNNLM toolkit from http://www.fit.vutbr.cz/\u223cimikolov/rnnlm/" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-100", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-101", "text": "**RESULTS**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-102", "text": "The pointer-generator significantly outperforms the Seq2Seq with attention model by 3.58 BLEU points on the test set as shown in Table 2 ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-103", "text": "Our language modeling result is given in Table 3 ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-104", "text": "Based on the empirical result, adding generated samples consistently improve the performance of all models with a moderate margin around 10% in perplexity." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-105", "text": "After all, our proposed method still slightly outperforms the heuristic from linguistic constraint." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-106", "text": "In addition, we get a crucial gain on performance by adding syntax representation of the sequences." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-107", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-108", "text": "**CHANGE IN DATA DISTRIBUTION:**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-109", "text": "To further analyze the generated result, we observed the distribution of real codeswitching data and the generated code-switching data." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-110", "text": "From Figure 2 , we can see that 1-best and real code-switching data have almost identical distributions." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-111", "text": "The distributions are left-skewed where the overall mean is less than the median." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-112", "text": "Interestingly, the distribution of the 3-best data is less skewed and generates a new set of n-grams such as \"\u90a3 \u4e2a (that) proposal\" which was learned from other code-switching sequences." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-113", "text": "As a result, generating more samples effects the performance positively." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-114", "text": "Importance of Linguistic Constraint: The result in Table 3 emphasizes that linguistic constraints have some significance in replicating the real code-switching patterns, specifically the equivalence constraint." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-115", "text": "There is a slight reduction in perplexity around 6 points on the test set." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-116", "text": "In addition, when we ignore the constraint, we lose performance because it still allows switches in the inversion grammar cases." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-117", "text": "Does the pointer-generator learn how to switch?" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-118", "text": "We found that our pointer-generator model generates sentences that have not been seen before." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-119", "text": "The example in Figure 1 shows that our model is able to construct a new well-formed sentence such as \"\u6211 \u4eec \u8981 \u53bb (We want to) check\"." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-120", "text": "It is also shown that the pointer-generator model has the capability to learn the characteristics of the linguistic constraints from data without any word alignment between the matrix and embedded languages." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-121", "text": "On the other hand, training using 3-best data obtains better performance compared to 1-best data." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-122", "text": "We found a positive correlation from Table 1 , where 3-best data is more similar to the test set in terms of segment length and number of switches compared to 1-best data." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-123", "text": "Adding more samples N may improve the performance, but it will be saturated at a certain point." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-124", "text": "One way to solve this is by using more parallel samples." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-125", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-126", "text": "**CONCLUSION**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-127", "text": "We introduce a new learning method for code-switching sentence generation using a parallel monolingual corpus that is applicable to any language pair." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-128", "text": "Our experimental result shows that adding generated sentences to the training data, effectively improves our model performance." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-129", "text": "Combining the generated samples with code-switching dataset reduces perplexity." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-130", "text": "We get further performance gain after using syntactic information of the input." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-131", "text": "In future work, we plan to explore reinforcement learning for sequence generation and employ more parallel corpora." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-2", "text": "Building large-scale datasets for training code-switching language models is challenging and very expensive." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-3", "text": "To alleviate this problem using parallel corpus has been a major workaround." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-4", "text": "However, existing solutions use linguistic constraints which may not capture the real data distribution." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-5", "text": "In this work, we propose a novel method for learning how to generate code-switching sentences from parallel corpora." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-6", "text": "Our model uses a Seq2Seq model in combination with pointer networks to align and choose words from the monolingual sentences and form a grammatical code-switching sentence." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-7", "text": "In our experiment, we show that by training a language model using the augmented sentences we improve the perplexity score by 10% compared to the LSTM baseline." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-8", "text": "----------------------------------" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-10", "text": "Language mixing has been a common phenomenon in multilingual communities." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-11", "text": "It is motivated in response to social factors as a way of communication in a multicultural society." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-12", "text": "From a sociolinguistic perspective, individuals do code-switching in order to construct an optimal interaction by accomplishing the conceptual, relational-interpersonal, and discourse-presentational meaning of conversation [1] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-13", "text": "In its practice, the variation of code-switching will vary due to the traditions, beliefs, and normative values in the respective communities." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-14", "text": "A number of studies [2, 3, 4, 5] found that code-switching is not produced indiscriminately, but follows syntactic constraints." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-15", "text": "Many linguists formulated various constraints to define a general rule for code-switching [2, 4, 5] ." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-16", "text": "However, the constraints are not enough to make a good generalization of real code-switching constraints, and they have not been tested in large-scale corpora for many language pairs." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-17", "text": "One of the biggest problem in code-switching is collecting large scale corpora." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-18", "text": "Speech data have to be collected from a spontaneous speech by bilingual speakers and the codeswitching has to be triggered naturally during the conversation." }, { "sent_id": "65f7546e2abfd74c0daa43c25ca63f-C001-19", "text": "In order to solve the data scarcity issue, code-switching data generation is useful to increase the volume and variance." } ], "y": { "@BACK@": { "gold_contexts": [ [ "65f7546e2abfd74c0daa43c25ca63f-C001-20" ], [ "65f7546e2abfd74c0daa43c25ca63f-C001-31" ] ], "cite_sentences": [ "65f7546e2abfd74c0daa43c25ca63f-C001-20", "65f7546e2abfd74c0daa43c25ca63f-C001-31" ] }, "@USE@": { "gold_contexts": [ [ "65f7546e2abfd74c0daa43c25ca63f-C001-26" ], [ "65f7546e2abfd74c0daa43c25ca63f-C001-62" ] ], "cite_sentences": [ "65f7546e2abfd74c0daa43c25ca63f-C001-26", "65f7546e2abfd74c0daa43c25ca63f-C001-62" ] } } }, "ABC_4235dbd05a848d934f17f35894c051_39": { "x": [ { "sent_id": "4235dbd05a848d934f17f35894c051-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-2", "text": "PredPatt is a pattern-based framework for predicate-argument extraction." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-3", "text": "While it works across languages and provides a well-formed syntax-semantics interface for NLP tasks, a large-scale and reproducible evaluation has been lacking, which prevents comparisons between PredPatt and other related systems, and inhibits the updates of the patterns in PredPatt." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-4", "text": "In this work, we improve and evaluate PredPatt by introducing a large set of high-quality annotations converted from PropBank, which can also be used as a benchmark for other predicate-argument extraction systems." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-5", "text": "We compare PredPatt with other prominent systems and shows that PredPatt achieves the best precision and recall." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-6", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-8", "text": "PredPatt 1 (White et al., 2016 ) is a pattern-based framework for predicate-argument extraction." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-9", "text": "It defines a set of interpretable, extensible and non-lexicalized patterns based on Universal Dependencies (UD) (de Marneffe et al., 2014) , and extracts predicates and arguments through these manual patterns." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-10", "text": "Figure 1 shows the predicates and arguments extracted by PredPatt from the sentence: \"Chris, the designer, wants to launch a new brand.\" The underlying predicate-argument structure constructed by PredPatt is a directed graph, where a special dependency ARG is built between a predicate head token and its arguments' head tokens, and the original UD relations are retained within predicate phrases and argument phrases." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-11", "text": "For example, Figure 2 shows the directed graph for the predicate-argument extraction (1) and (2) in Figure 1 ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-12", "text": "Compared to other existing systems for predicate-argument extraction (Banko et al., 2007; Fader et al., 2011; Angeli et al., 2015) , the use of manual language-agnostic patterns on UD makes PredPatt a well-founded component across languages." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-13", "text": "Additionally, the underlying structure constructed by PredPatt has been shown to be a well-formed syntax-semantics interface for NLP tasks: Zhang et al. (2016) utilizes PredPatt to extract possibilistic propositions in automatic common-sense inference generation." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-14", "text": "White et al. (2016) uses PredPatt to help augmenting data with Universal Decompositional Semantics." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-15", "text": "Zhang et al. (2017) adapts PredPatt to data generation for cross-lingual open information extraction." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-16", "text": "However, the evaluation of PredPatt has been restricted to manually-checked extractions over a small set of sentences (White et al., 2016) , which lacks gold annotations to conduct an objective and reproducible evaluation, and inhibits the updates of patterns in PredPatt." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-17", "text": "Chris , the designer , wants to launch a new brand ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-18", "text": "In this work, we aim to conduct a large-scale and reproducible evaluation of PredPatt by introducing a large set of gold annotations gathered from PropBank (Palmer et al., 2005) ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-19", "text": "We leverage these gold annotations to improve PredPatt and compare it with other prominent systems." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-20", "text": "The evaluation results demonstrate that we make a promising improvement on PredPatt, and it significantly outperforms other comparing systems." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-21", "text": "The scripts for creating gold annotations and evaluation are available at: https: //github.com/hltcoe/PredPatt/tree/master/eval" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-22", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-23", "text": "**CREATING GOLD ANNOTATIONS**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-24", "text": "Open Information Extraction (Open IE) and Semantic Role Labeling (SRL) (Carreras and M\u00e0rquez, 2005) are quite related: semantically labeled arguments correspond to the arguments in Open IE extractions, and verbs often match up with Open IE relations (Christensen et al., 2011) ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-25", "text": "Lang and Lapata (2010) has acknowledged that the SRL task can be viewed as a two stage process of (1) recognizing predicates and arguments then (2) assigning semantics." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-26", "text": "Therefore, predicate-argument extraction (i.e., Open IE) should primarily be considered the same as the first of two stages of SRL, and expert annotated SRL data would be an ideal resource for evaluating Open IE systems." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-27", "text": "This makes PropBank (Palmer et al., 2005) a natural choice from which we can create gold annotations for Open IE, Here, we choose to use expert annotations from PropBank, as compared to the recent suggestion to employ non-expert annotations as a means of benchmarking systems Stanovsky and Dagan (2016) ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-28", "text": "Another advantage of choosing PropBank is that PropBank has gold annotations for UD which lays the important groundwork for evaluating UD-based patterns in PredPatt." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-29", "text": "In this work, we create gold annotations for predicate-argument extraction by converting PropBank annotations on English Web Treebank (EWT) (LDC2012T13) and the Penn Treebank II Wall Street Journal Corpus (WSJ) (Marcus et al., 1994) ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-30", "text": "3 These two corpora have all verbal predicates annotated, and are used to evaluate PredPatt in different perspectives: EWT is the corpus where the gold standard English UD Treebank is built over, which enables an evaluation and analysis of PredPatt patterns; WSJ is used to evaluate PredPatt in a real-world scenario where we run SyntaxNet Parser 4 (Andor et al., 2016) on the corpus to generate automated UD parses as input of PredPatt." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-31", "text": "Table 1 shows the statistics of the auto-converted gold annotations for predicate-argument extraction on EWT and WSJ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-53", "text": "Prep-separation: By default, PredPatt considers prepositions to belong to the predicate, while PropBank places preopositions within the span of their corresponding argument." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-32", "text": "We convert the PropBank annotations for all verbal predicates in these two corpora, and ignore roles of directional (DIR), manner (MNR), modals (MOD), negation (NEG) and adverbials (ADV), as they aren't extracted as distinct argument but instead are folded into the complex predicate by PredPatt and other systems for predicate-argument extraction (Banko et al., 2007; Fader et al., 2011; Angeli et al., 2015) ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-33", "text": "For EWT, we select 13,583 sentences that have the version 2.0 of the gold UD annotations." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-34", "text": "5 The resulting annotations on these two corpora contain over 94K extractions." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-35", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-36", "text": "**IMPROVING PREDPATT**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-37", "text": "PredPatt is a pattern-based system, comprising an extensible set of clean, interpretable linguistic patterns over UD parses." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-38", "text": "By analyzing PredPatt extractions in comparison with gold annotations (Sec. 2), we are able to refine and improve PredPatt's pattern set." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-39", "text": "From the auto-converted gold annotations, we create a held-out set by randomly sampling 10% sentences from EWT." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-40", "text": "We then update the existing PredPatt patterns and introduce new patterns by analyzing PredPatt annotations on the held-out set." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-41", "text": "PredPatt extracts predicates and arguments in four stages (White et al., 2016) : (1) predicate and argument root identification, (2) argument resolution, (3) predicate and argument phrase extraction, and (4) optional post-processing." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-42", "text": "We analyze PredPatt extraction in each of these stages on the held-out set, and make 19 improvements to PredPatt patterns." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-43", "text": "Due to lack of space, we only highlight one improvement for each stage below." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-44", "text": "Fixed-MWE-pred: The UD version 2.0 introduces a new dependency relation fixed for identifying fixed function-word \"multiword expressions\" (MWEs)." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-45", "text": "To accommodate this new feature, we add patterns to identify the MWE predicate and its argument." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-46", "text": "As shown in Figure 3 , the predicate root in this case is the dependent of fixed that is tagged as a verb (i.e., \"opposed\"); the root of its argument is the token which indirectly governs the predicate root via the case and fixed relation (i.e., \"one\")." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-47", "text": "Please use this new file as opposed to the one I sent earlier ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-48", "text": "Cut-complex-pred: The existing patterns take clausal complements (ccomp and xcomp) as predicatives of complex predicates in the argument resolution stage, where the arguments of the clausal complement will be merged into the argument set of their head predicate." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-49", "text": "For example, in the sentence \"Chris, the designer, wants to launch a new brand\", PredPatt merges the argument \"a new brand\" of the predicate \"to launch\" into the argument set of the complex predicate \"wants to launch\"." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-50", "text": "As a result, only the complex predicate, \"[Chris, the designer] wants to launch [a new brand]\", will be extracted." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-51", "text": "It ignores the possibility of the clausal complement itself being a predicate." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-52", "text": "Here, we add a cutting option; when turned on, it will cut the complex predicate into simple predicates as shown in Figure 1 ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-76", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-54", "text": "Either behavior may be preferable under different circumstances, so we make preposition placement a new configurable option of PredPatt." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-55", "text": "Borrow-subj-for-conj-of-xcomp: PredPatt contains a post-processing option for distributing a single nsubj argument over multiple predicates joined by a conj relation." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-56", "text": "PredPatt also contains a pattern assigning subject arguments to predicates introduced by open clausal complement (xcomp) relations, according to the theory of obligatory control (Farkas, 1988) ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-57", "text": "We introduce a new post-processing option that combines these two patterns, allowing an argument in subject position to be distributed over multiple xcomp predicates that are joined by a conj relation, as illustrated in Figure 4 ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-58", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-59", "text": "**EVALUATION**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-60", "text": "In this section, we evaluate the original PredPatt (PredPatt v1) and the improved PredPatt (PredPatt v2) on the English Web Treebank (EWT) and the Wall Street Journal corpus (WSJ), and compare their performance with four prominent Open IE systems: OpenIE 4, 6 OLLIE (Mausam et al., 2012) , ClausIE (Del Corro and Gemulla, 2013) , and Stanford Open IE (Angeli et al., 2015) ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-61", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-62", "text": "**PRECISION-RECALL CURVE**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-63", "text": "We compare PredPatt with four prominent Open IE systems which are also built for predicate-argument extraction." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-64", "text": "To allow some flexibility, we compute the precision and recall of different systems by running the scripts used in Stanovsky and Dagan (2016), 7 where an automated extraction is matched with a gold extraction based on their token-level overlap." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-65", "text": "Figure 5 and Figure 6 show the Precision-Recall Curves for different systems on EWT and WSJ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-66", "text": "8 When tested on EWT which has gold UD parses ( Figure 5 ), PredPatt v1 and v2 outperforms the other systems by a significant margin in both precision and recall." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-67", "text": "When tested on WSJ where only automated UD parses are available (Figure 6 ), ClausIE achieves a recall that is slightly better than PredPatt v1, but PredPatt v2 still shows the best performance across all systems." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-68", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-69", "text": "**EXTRACTION HEAD AGREEMENT**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-70", "text": "The rich underlying structure in PredPatt (see Figure 2 ) contains head information for predicates and arguments, which enables a precision-recall metric based on the agreement of head information." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-71", "text": "Similar to He et al. (2015) , we first match an automated predicate with a gold predicate if they both agree on their head." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-72", "text": "9 With two matched predicates, we then match an automated argument with a gold argument if the automated argument head is within the gold argument span." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-73", "text": "We evaluate the precision and recall by a loose macro measure: For the i-th extractions that have two matched predicates, let the argument set of the gold predicate be A i , and the argument set of the automated predicate be\u00c2 i ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-74", "text": "The number of matched arguments is represented by |A i \u2229\u00c2 i |." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-75", "text": "Then the precision is computed by Precision = 1 N N i=1 |A i \u2229\u00c2 i |/|\u00c2 i | , and the recall is computed by Recall" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-77", "text": "**STATISTICS OF ARGUMENT SPAN RELATIONS**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-78", "text": "Besides the precion-recall oriented metrics, we impose another metric to further measure the argument span relations." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-79", "text": "Following in same notations in \u00a7 4.2, for the i-th extractions that have an automated predicate and a gold predicate matched with each other, let an argument in the gold argument set be \u03b1 \u2208 A i , and an argument in the automated argument set \u03b2 \u2208\u00c2 i ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-80", "text": "We categorize the automated extractions into four sets according to their arguments relation to the gold arguments." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-81", "text": "Table 3 shows the proportion of PredPatt extractions in different sets." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-82", "text": "As we expected, compared to WSJ, more extractions on EWT fall into S same , which shows that PredPatt works better on gold UD parses." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-83", "text": "In contrast to PredPatt v1, PredPatt v2 on EWT increases extractions in S same by 12.97%, which contributes to the most increase of S subset ; on WSJ, PredPatt v2 decreases extractions in S subset by 13.89%, which leads the major increases of S same and S superset ." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-84", "text": "There are still over 10% extractions not belonging to any of these four sets." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-85", "text": "Case analysis shows that the inconsistent extractions are mainly caused by incorrect borrowing of arguments for compound predicates or predicates under obligatory control, missing arguments for passive/active verbs that act as adjectival modifiers, etc." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-86", "text": "These cases are not easily reachable via UD analysis, but leave room for further improvement on PredPatt." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-87", "text": "In the current settings, the head of a gold predicate is the verb token in the predicate." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-88", "text": "----------------------------------" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-89", "text": "**CONCLUSIONS**" }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-90", "text": "We introduce a large-scale benchmark for predicate-argument extraction by converting manual annotations from PropBank." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-91", "text": "Based on the benchmark, we improve PredPatt patterns, and compare PredPatt with four prominent Open IE systems." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-92", "text": "The comparison shows that PredPatt significantly outperforms the other systems." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-93", "text": "The evaluation results demonstrate that we improve the performance of PredPatt in both precion-recall and the argument span relation with the gold annotations." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-94", "text": "As for further work, we see the confidence score estimater for PredPatt extractions as a desirable target, so that the quality of extractions can be controlled." }, { "sent_id": "4235dbd05a848d934f17f35894c051-C001-95", "text": "Additionally, we would like to further improve the PredPatt patterns by analyzing more PredPatt extractions in comparison with gold annotations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4235dbd05a848d934f17f35894c051-C001-8" ], [ "4235dbd05a848d934f17f35894c051-C001-16" ], [ "4235dbd05a848d934f17f35894c051-C001-41" ] ], "cite_sentences": [ "4235dbd05a848d934f17f35894c051-C001-8", "4235dbd05a848d934f17f35894c051-C001-16", "4235dbd05a848d934f17f35894c051-C001-41" ] }, "@MOT@": { "gold_contexts": [ [ "4235dbd05a848d934f17f35894c051-C001-16" ] ], "cite_sentences": [ "4235dbd05a848d934f17f35894c051-C001-16" ] }, "@EXT@": { "gold_contexts": [ [ "4235dbd05a848d934f17f35894c051-C001-16", "4235dbd05a848d934f17f35894c051-C001-18" ], [ "4235dbd05a848d934f17f35894c051-C001-41", "4235dbd05a848d934f17f35894c051-C001-42" ] ], "cite_sentences": [ "4235dbd05a848d934f17f35894c051-C001-16", "4235dbd05a848d934f17f35894c051-C001-41" ] } } }, "ABC_fa641aca676761c79c0469c195f336_39": { "x": [ { "sent_id": "fa641aca676761c79c0469c195f336-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-2", "text": "ABSTRACT" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-3", "text": "----------------------------------" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-4", "text": "**INTRODUCTION**" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-5", "text": "Readability refers to the ease with which a given piece of natural language text can be read and understood." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-6", "text": "Intuitively, readability emerges from an interaction between the reader and the text, and depends on the prior knowledge of the reader, his/her reading skills, interest, and motivation [23] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-7", "text": "Although it may seem that automatic assessment of readability would be a very complicated process, as it turns out, fairly effective readability scoring can be achieved by means of several lowlevel features." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-8", "text": "Readability has many important applications, such as assessing the quality of student essays (one of the original applications of readability scoring), designing educational materials for schoolchildren and second-language learners, moderating newspaper content to convey information more clearly and effectively, and standardizing the language-learning experience of different age groups." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-9", "text": "Readability (\"reading ease\") and its converse -reading difficulty -are associated with different grade levels in school." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-10", "text": "It is generally observed that students from higher grade levels can write and comprehend texts with greater reading difficulty than students from lower grade levels." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-11", "text": "A lot of studies in readability therefore focused on correlating readability scores with grade levels, or even predicting grade levels from readability-oriented features." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-12", "text": "Existing methods of readability assessment look into a handful of low-level signals such as average sentence length (ASL), average word length in syllables (AWL), percentage of difficult words, and number of polysyllabic words." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-13", "text": "Early studies used word frequency lists to identify difficult words." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-14", "text": "Recently, readability evaluation has been tackled as a supervised machine learning problem [12] , [17] , [22] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-15", "text": "There have been many different studies on readability assessment in English (cf. Section 2)." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-16", "text": "Bengali has received much less attention owing to inadequate resources and a lack of robust natural language processing tools." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-17", "text": "It is only very recently that some groups of researchers looked into readability assessment in Bengali." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-18", "text": "They observed that English readability formulas did not work well on Bengali texts [11] , [21] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-19", "text": "This observation is not surprising, because Bengali is very different than English." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-20", "text": "Bengali is a highly inflected language, follows subject-object-verb ordering in sentences, and has a rich morphology." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-21", "text": "Further, Bengali shows word compounding and diglossia, i.e. formal and informal language variants (sadhu bhasha and cholit bhasha)." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-22", "text": "All these factors complicate readability scoring in Bengali." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-23", "text": "Since the concept of readability is highly subjective and reader-dependent, it is necessary to find out how much two native Bengali speakers agree on the readability level of a piece of text." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-24", "text": "Generalizing from there, we performed an inter-rater agreement study on readability assessment in Bengali." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-25", "text": "This study not only enables us to see how much human annotators agree on readability assessment, but also shows how difficult it is for humans to assign consistent readability scores." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-26", "text": "Since Bengali is very different than English, we want to see if (and how) readability is affected by the peculiarities of the language." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-27", "text": "As a by-product of this study, we obtained a human-annotated gold standard dataset for readability evaluation in Bengali." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-28", "text": "The rest of this paper is organized as follows." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-29", "text": "We briefly discuss related studies in Section 2, followed by a discussion of our dataset and annotation scheme in Section 3." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-30", "text": "Experimental results are described in Section 4, along with their explanation and observations." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-31", "text": "Section 5 concludes the paper with contributions, limitations, and further research directions." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-32", "text": "----------------------------------" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-33", "text": "**RELATED WORK**" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-34", "text": "Readability scoring in English has a long and rich history, starting with the work of L. A. Sherman in the late nineteenth century [20] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-35", "text": "Among the early readability formulas were Flesch Reading Ease [7] , Dale-Chall Formula [5] , Automated Readability Index [19] , Gunning Fog Index [9] , SMOG score [16] , and Coleman-Liau Index [2] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-36", "text": "These early indices were based on simple features like average number of characters, words, syllables and sentences, number of difficult and polysyllabic words, etc." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-37", "text": "Albeit simple, these readability indices were surprisingly good predictors of a reader's grade level." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-38", "text": "Two different lines of work focused on children and adult readability formulas." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-39", "text": "Recently Lahiri et al. showed moderate correlation between readability indices and formality score ( [10] ) in four different domains [14] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-40", "text": "Sinha et al. classified English readability formulas into three broad categories -traditional methods, cognitively motivated methods, and machine learning methods [21] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-41", "text": "Traditional methods assess readability using surface features and shallow linguistic features such as the ones mentioned in the preceding paragraph." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-42", "text": "Cognitively motivated methods take into account the cohesion and coherence of text, its latent topic structure, Kintsch's propositions, etc [1] , [8] , [13] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-43", "text": "Finally, machine learning methods utilize sophisticated structures such as language models [3] , [4] , [18] , query logs [15] , and several other features to predict the readability of open-domain text data." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-44", "text": "There are very few studies on readability assessment in Bengali texts." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-45", "text": "We found only three lines of work that specifically looked into Bengali readability [6] , [11] , [21] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-46", "text": "Das and Roychoudhury worked with a miniature model of two parameters in their pioneering study [6] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-47", "text": "They found that the two-parameter model was a better predictor of readability than the one-parameter model." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-48", "text": "Note, however, that Das and Roychoudhury's corpus was small (only seven documents), thereby calling into question the validity of their results." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-49", "text": "Sinha et al. alleviated these problems by considering six parameters instead of just two [21] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-50", "text": "They further showed that English readability indices were inadequate for Bengali, and built their own readability model on 16 texts." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-51", "text": "Around the same time, Islam et al. independently reached the same conclusion [11] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-52", "text": "They designed a Bengali readability classifier on lexical and information-theoretic features, resulting in an F-score 50% higher than that from traditional scoring approaches." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-53", "text": "While all the above studies are very important and insightful, none of them explicitly performed an inter-rater agreement study." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-54", "text": "For reasons mentioned in Section 1, an inter-rater agreement study is very important when we talk about readability assessment." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-55", "text": "Further, none of these studies made available their readability-annotated gold standard datasets, thereby stymieing further research." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-56", "text": "We attempt to bridge these gaps in our work." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-57", "text": "----------------------------------" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-58", "text": "**METHODOLOGY**" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-59", "text": "We collected a corpus of 30 Bengali text passages." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-60", "text": "The passages were randomly selected from the writings of four eminent Bengali authors -Rabindranath Tagore (1861-1941), Sarat Chandra Chattopadhyay (1876-1938), Bankim Chandra Chattopadhyay (1838-1894), and Bibhutibhushan Bandyopadhyay (1894-1950)." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-61", "text": "We ensured that samples from both sadhu bhasha as well as cholit bhasha were incorporated in our corpus." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-62", "text": "We also ensured that we had both adult text as well as children's text in the mix." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-63", "text": "The number of passages from different authors is shown in Table 1 ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-64", "text": "We assigned the 30 text passages to seven independent annotators." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-65", "text": "The annotators were 30 to 35 years of age; they were from a similar educational background and socio-economic milieu; there were four female and three male annotators; and they all were native speakers of Bengali." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-66", "text": "Annotators were asked to assign a readability rating to each of the 30 passages." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-67", "text": "The rating scale was as follows: 1) Very easy to read 2) Easy to read 3) Somewhat easy to read 4) In-between 5) Somewhat difficult to read 6) Difficult to read 7) Very difficult to read This rating scale reflects the fact that readability is not a binary/ternary variable; it is an ordinal variable." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-68", "text": "We further collected the data on whether the annotators were avid readers of Bengali or not." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-69", "text": "Each annotator rated every passage." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-70", "text": "Note that readability annotation in Bengali is challenging because passages written in sadhu bhasha tend to be harder to read than those written in cholit bhasha." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-71", "text": "Since our dataset contains both sadhu bhasha and cholit bhasha, maintaining consistency in readability rating becomes a big issue." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-72", "text": "Table 2 gives the mean readability rating of the 30 text passages, along with their standard deviations." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-73", "text": "These ratings are averages over seven independent annotations." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-74", "text": "Note from Table 2 that none of the mean ratings is 1 or 7." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-75", "text": "In other words, mean ratings never reach the extreme readability values." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-76", "text": "This phenomenon is known as the central tendency bias." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-77", "text": "Note also that the standard deviations are not very high, which should be intuitive because the rating scale varies between 1 and 7.Agreement among the annotators was measured by Cohen's kappa (\u03ba) and Spearman's rank correlation coefficient (\u03c1)." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-78", "text": "Table 3 shows the pairwise \u03ba values among different annotators, and Table 4 gives the pairwise \u03c1 values." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-79", "text": "Both tables are symmetric around the main diagonal." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-80", "text": "Note from" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-81", "text": "----------------------------------" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-82", "text": "**RESULTS**" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-83", "text": "----------------------------------" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-84", "text": "**CONCLUSION**" }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-85", "text": "We performed an inter-rater agreement study for readability assessment in Bengali." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-86", "text": "This is the first time such an agreement study has been performed." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-87", "text": "We obtained moderate to fair agreement among seven independent annotators on 30 text passages written by four eminent Bengali authors." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-88", "text": "As a byproduct of this study, we obtained a gold standard human annotated readability dataset for Bengali." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-89", "text": "We plan to release this dataset for future research." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-90", "text": "We are working on readability modeling in Bengali, and this dataset will be very helpful." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-91", "text": "An important limitation of our study is the small corpus size." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-92", "text": "We only have 30 annotated passages at our disposal, whereas Islam et al. [11] had around 300. But Islam et al.'s dataset is not annotated in as fine-grained a fashion as ours." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-93", "text": "Note also that our dataset is larger than both Sinha et al.'s 16document dataset [21] , and Das and Roychoudhury's seven document dataset [6] ." }, { "sent_id": "fa641aca676761c79c0469c195f336-C001-94", "text": "We plan to increase the size of our dataset in future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fa641aca676761c79c0469c195f336-C001-18" ], [ "fa641aca676761c79c0469c195f336-C001-45", "fa641aca676761c79c0469c195f336-C001-51" ] ], "cite_sentences": [ "fa641aca676761c79c0469c195f336-C001-18", "fa641aca676761c79c0469c195f336-C001-45", "fa641aca676761c79c0469c195f336-C001-51" ] }, "@MOT@": { "gold_contexts": [ [ "fa641aca676761c79c0469c195f336-C001-45", "fa641aca676761c79c0469c195f336-C001-51", "fa641aca676761c79c0469c195f336-C001-53", "fa641aca676761c79c0469c195f336-C001-55", "fa641aca676761c79c0469c195f336-C001-56" ] ], "cite_sentences": [ "fa641aca676761c79c0469c195f336-C001-45", "fa641aca676761c79c0469c195f336-C001-51" ] }, "@DIF@": { "gold_contexts": [ [ "fa641aca676761c79c0469c195f336-C001-92" ] ], "cite_sentences": [ "fa641aca676761c79c0469c195f336-C001-92" ] } } }, "ABC_f7255360eacc4e2a4e8bea2f6ab1b0_39": { "x": [ { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-24", "text": "First, a computation of the cohesion of the different parts of a text is done by using a collocation network." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-25", "text": "Second, we locate the major breaks in this cohesion to detect the thematic shifts and build segments." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-26", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-27", "text": "**THE COLLOCATION NETWORK**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-2", "text": "This article outlines a quantitative method for segmenting texts into thematically coherent units." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-3", "text": "This method relies on a network of lexical collocations to compute the thematic coherence of the different parts of a text from the lexical cohesiveness of their words." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-4", "text": "We also present the results of an experiment about locating boundaries between a series of concatened texts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-5", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-6", "text": "****" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-7", "text": "1 Introduction Several quantitative methods exist for thematically segmenting texts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-8", "text": "Most of them are based on the following assumption: the thematic coherence of a text segment finds expression at the lexical level." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-9", "text": "Hearst (1997) and Nomoto and Nitta (1994) detect this coherence through patterns of lexical cooccurrence." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-10", "text": "Morris and Hirst (1991) and Kozima (1993) find topic boundaries in the texts by using lexical cohesion." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-11", "text": "The first methods are applied to texts, such as expository texts, whose vocabulary is often very specific." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-12", "text": "As a concept is always expressed by the same word, word repetitions are thematically significant in these texts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-13", "text": "The use of lexical cohesion allows to bypass the problem set by texts, such as narratives, in which a concept is often expressed by different means." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-14", "text": "However, this second approach requires knowledge about the cohesion between words." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-15", "text": "Morris and Hirst (1991) extract this knowledge from a thesaurus." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-16", "text": "Kozima (1993) exploits a lexical network built from a machine readable dictionary (MRD)." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-17", "text": "This article presents a method for thematically segmenting texts by using knowledge about lexical cohesion that has been automatically built." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-18", "text": "This knowledge takes the form of a network of lexical collocations." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-19", "text": "We claim that this network is as suitable as a thesaurus or a MRD for segmenting texts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-20", "text": "Moreover, building it for a specific domain or for another language is quick." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-21", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-22", "text": "**METHOD**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-23", "text": "The segmentation algorithm we propose includes two steps." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-28", "text": "Our collocation network has been built from 24 months of the French Le Monde newspaper." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-29", "text": "The size of this corpus is around 39 million words." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-30", "text": "The cohesion between words has been evaluated with the mutual information measure, as in (Church and Hanks, 1990) ." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-31", "text": "A large window, 20 words wide, was used to take into account the thematic links." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-32", "text": "The texts were pre-processed with the probabilistic POS tagger TreeTagger (Schmid, 1994) in order to keep only the lemmatized form of their content words, i.e. nouns, adjectives and verbs." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-33", "text": "The resulting network is composed of approximatively 31 thousand words and 14 million relations." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-34", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-35", "text": "**COMPUTATION OF TEXT COHESION**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-36", "text": "As in Kozima's work, a cohesion value is computed at each position of a window in a text (after pre-processing) from the words in this window." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-37", "text": "The collocation network is used for determining how close together these words are." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-38", "text": "We suppose that if the words of the window are strongly connected in the network, they belong to the same domain and so, the cohesion in this part of text is high." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-39", "text": "On the contrary, if they are not very much linked together, we assume that the words of the window belong to two different domains." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-40", "text": "It means that the window is located across the transition from one topic to another." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-41", "text": "In practice, the cohesion inside the window is evaluated by the sum of the weights of the words in this window and the words selected from the collocation network common to at least two words of the window." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-42", "text": "Selecting words from the network linked to those of the texts makes explicit words related to the same topic as the topic referred by the words in the window and produces a more stable description of this topic when the window moves." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-43", "text": "As shown in Figure 1 , each word w (from the window or from the network) is weighted by the sum of the contributions of all the words of the window it is linked to." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-44", "text": "The contribution of such a word is equal to its number of occurrences in the window modulated by the cohesion measure associated to its link with w. Thus, the more the words belong to a same topic, the more they are linked together and the higher their weights are." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-45", "text": "Finally, the value of the cohesion for one position of the window is the result of the following weighted sum:" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-46", "text": "coh(p) = Y~i sign(wi) ." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-47", "text": "wght(wi), with wght(wi), the resulting weight of the word wi, sign(wi), the significance of wi, i.e. the normalized information of wi in the Le Monde corpus." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-48", "text": "Figure 2 shows the smoothed cohesion graph for ten texts of the experiment." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-49", "text": "Dotted lines are text boundaries (see 3.1)." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-50", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-51", "text": "**SEGMENTING THE COHESION GRAPH**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-52", "text": "First, the graph is smoothed to more easily detect the main minima and maxima." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-53", "text": "This operation is done again by moving a window on the text." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-54", "text": "At each position, the cohesion associ- After this smoothing, the derivative of the graph is calculated to locate the maxima and the minima." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-55", "text": "We consider that a minimum marks a thematic shift." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-56", "text": "So, a segment is characterized by the following sequence: minimum -maximum -minimum." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-57", "text": "For making the delimitation of the segments more precise, they are stopped before the next (or the previous) minimum if there is a brutal break of the graph and after this, a very slow descent." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-58", "text": "This is done by detecting that the cohesion values fall under a given percentage of the maximum value." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-59", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-60", "text": "**RESULTS**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-61", "text": "A first qualitative evaluation of the method has been done with about 20 texts but without a formal protocol as in (Hearst, 1997) ." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-62", "text": "The results of these tests are rather stable when parameters such as the size of the cohesion computing window or the size of the smoothing window are changed (from 9 to 21 words)." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-63", "text": "Generally, the best results are obtained with a size of 19 words for the first window and 11 for the second one." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-64", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-65", "text": "**DISCOVERING DOCUMENT BREAKS**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-66", "text": "In order to have a more objective evaluation, the method has been applied to the \"classical\" task of discovering boundaries between concatened texts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-67", "text": "Results are shown in Table 1 ." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-68", "text": "As in (Hearst, 1997) , boundaries found by the method are weighted and sorted in decreasing order." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-69", "text": "Document breaks are supposed to be the boundaries that have the highest weights." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-70", "text": "For the first from the corpus used for building the collocation network." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-71", "text": "Each text was 80 words long on average." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-72", "text": "Each boundary, which is a minimum of the cohesion graph, was weighted by the sum of the differences between its value and the values of the two maxima around it, as in (Hearst, 1997) ." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-73", "text": "The match between a boundary and a document break was accepted if the boundary was no further than 9 words (after pre-processing)." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-74", "text": "Globally, our results are not as good as Hearst's (with 44 texts; Nb: 10, P: 0.8, R: 0.19; Nb: 70, P: 0.59, R: 0.95)." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-75", "text": "The first explanation for such a difference is the fact that the two methods do not apply to the same kind of texts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-76", "text": "Hearst does not consider texts smaller than 10 sentences long." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-77", "text": "All the texts of this evaluation are under this limit." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-78", "text": "In fact, our method, as Kozima's, is more convenient for closely tracking thematic evolutions than for detecting the major thematic shifts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-79", "text": "The second explanation for this difference is related to the way the document breaks are found, as shown by the precision values." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-80", "text": "When Nb increases, precision decreases as it generally does, but very slowly." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-81", "text": "The decrease actually becomes significant only when Nb becomes larger than N. It means that the weights associated to the boundaries are not very significant." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-82", "text": "We have validated this hypothesis by changing the weighting policy of the boundaries without having significant changes in the results." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-83", "text": "One way for increasing the performance would be to take as text boundary not the position of a minimum in the cohesion graph but the nearest sentence boundary from this position." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-84", "text": "----------------------------------" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-85", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-86", "text": "We have presented a method for segmenting texts into thematically coherent units that relies on a collocation network." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-87", "text": "This collocation network is used to compute a cohesion value for the different parts of a text." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-88", "text": "Segmentation is then done by analyzing the resulting cohesion graph." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-89", "text": "But such a numerical value is a rough characterization of the current topic." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-90", "text": "For future work we will build a more precise representation of the current topic based on the words selected from the network." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-91", "text": "By computing a similarity measure between the representation of the current topic at one position of the window and this representation at a further one, it will be possible to determine how thematically far two parts of a text are." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-92", "text": "The minima of the measure will be used to detect the thematic shifts." }, { "sent_id": "f7255360eacc4e2a4e8bea2f6ab1b0-C001-93", "text": "This new method is closer to Hearst's than the one presented above but it relies on a collocation network for finding relations between two parts of a text instead of using the word recurrence." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f7255360eacc4e2a4e8bea2f6ab1b0-C001-9" ] ], "cite_sentences": [ "f7255360eacc4e2a4e8bea2f6ab1b0-C001-9" ] }, "@DIF@": { "gold_contexts": [ [ "f7255360eacc4e2a4e8bea2f6ab1b0-C001-61" ] ], "cite_sentences": [ "f7255360eacc4e2a4e8bea2f6ab1b0-C001-61" ] }, "@USE@": { "gold_contexts": [ [ "f7255360eacc4e2a4e8bea2f6ab1b0-C001-68" ], [ "f7255360eacc4e2a4e8bea2f6ab1b0-C001-72" ] ], "cite_sentences": [ "f7255360eacc4e2a4e8bea2f6ab1b0-C001-68", "f7255360eacc4e2a4e8bea2f6ab1b0-C001-72" ] } } }, "ABC_932a13e179da50c9189bd0c612cb9c_39": { "x": [ { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-2", "text": "The paper presents a novel sentence pair extraction algorithm for comparable data, where a large set of candidate sentence pairs is scored directly at the sentence-level." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-3", "text": "The sentencelevel extraction relies on a very efficient implementation of a simple symmetric scoring function: a computation speed-up by a factor of 30 is reported." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-4", "text": "On Spanish-English data, the extraction algorithm finds the highest scoring sentence pairs from close to 1 trillion candidate pairs without search errors." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-5", "text": "Significant improvements in BLEU are reported by including the extracted sentence pairs into the training of a phrase-based SMT (Statistical Machine Translation) system." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-6", "text": "----------------------------------" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-8", "text": "The paper presents a simple sentence-level translation pair extraction algorithm from comparable monolingual news data." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-9", "text": "It differs from similar algorithms that select translation correspondences explicitly at the document level (Fung and Cheung, 2004; Resnik and Smith, 2003; Snover et al., 2008; Munteanu and Marcu, 2005; Quirk et al., 2007; Utiyama and Isahara, 2003) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-10", "text": "In these publications, the authors use Information-Retrieval (IR) techniques to match document pairs that are likely translations of each other." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-11", "text": "More complex sentence-level models are then used to extract parallel sentence pairs (or fragments)." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-12", "text": "From a computational perspective, the document-level filtering steps are needed to reduce the number of candidate sentence pairs." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-13", "text": "While IR techniques might be useful to improve the selection accuracy, the current paper demonstrates that they are not necessary to obtain parallel sentence pairs." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-14", "text": "For some data, e.g. the Portuguese-English Reuters data used in the experiments in Section 3, document-level information may not even be available." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-15", "text": "In this paper, sentence pairs are extracted by a simple model that is based on the so-called IBM Model-1 (Brown et al., 1993) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-16", "text": "The Model-1 is trained on some parallel data available for a language pair, i.e. the data used to train the baseline systems in Section 3." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-17", "text": "The scoring function used in this paper is inspired by phrase-based SMT." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-18", "text": "Typically, a phrase-based SMT system includes a feature that scores phrase pairs using lexical weights (Koehn et al., 2003) which are computed for two directions: source to target and target to source." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-19", "text": "Here, a sentence pair is scored as a phrase pair that covers all the source and target words." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-20", "text": "The scoring function \u033a(S, T ) is defined as follows:" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-21", "text": "Here, S = s J 1 is the source sentence of length J and T = t I 1 is the target sentence of length I. p(s|T ) is the Model-1 probability assigned to the source word s given the target sentence T , p(t|S) is defined accordingly." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-22", "text": "p(s|t) and p(t|s) are word translation probabilities obtained by two parallel Model-1 training steps on the same data, but swapping the role of source and target language." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-23", "text": "They are smoothed to avoid 0.0 entries; there is no special NULL-word model and stop words are kept." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-24", "text": "The log(\u00b7) is applied to turn the sentence-level probabilities into scores." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-25", "text": "These log-probabilities are normalized with respect to the source and target sentence length: this way the score \u033a(S, T ) can be used across all sentence pairs considered, and a single manually set threshold \u03b8 is used to select all those sentence pairs whose score is above it." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-26", "text": "For computational reasons, the sum \u033a(S, T ) is computed over the following terms: \u03c4 (t i , S) where 1 \u2264 i \u2264 I and \u03c3(s j , T ), where 1\u2264 j \u2264 J. The \u03c4 's and \u03c3's represent partial score contributions for a given source or target position." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-27", "text": "Note that \u033a(S, T ) \u2264 0 since the terms \u03c4 (\u00b7, S) \u2264 0 and \u03c3(\u00b7, T ) \u2264 0." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-28", "text": "Section 2 presents an efficient implementation of the scoring function in Eq. 1." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-29", "text": "Its effectiveness is demonstrated in Section 3." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-30", "text": "Finally, Section 4 discusses future work and extensions of the current algorithm." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-31", "text": "----------------------------------" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-32", "text": "**SENTENCE-LEVEL PROCESSING**" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-33", "text": "We process the comparable data at the sentencelevel: for each language and all the documents in the comparable data, we distribute sentences over a list of files : one file for each news feed f (for the Spanish Gigaword data, there are 3 news feeds) and publication date d ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-34", "text": "The Gigaword data comes annotated with sentence-level boundaries, and all document boundaries are discarded." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-35", "text": "This way, the Spanish data consists of about 24 thousand files and the English data consists of about 53 thousand files (for details, see Table 2 )." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-36", "text": "For a given source sentence S, the search algorithm computes the highest scoring sentence pair \u033a(S, T ) over a set of candidate translations T \u2208 \u0398, where |\u0398| can be in the hundreds of thousands of sentences ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-37", "text": "\u0398 consists of all target sentences that have been published from the same news feed f within a 7 day window from the publication date of the current source sentence S. The extraction algorithm is guaranteed to find the highest scoring sentence pairs (S, T ) among all T \u2208 \u0398. In order to make this processing pipeline feasible, the scoring function in Eq. 1 needs to be computed very efficiently." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-38", "text": "That efficiency is based on the decomposition of the scoring functions into I + J terms ( \u03c4 's and \u03c3's) where source and target terms are treated differently." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-39", "text": "While the scoring function computation is symmetric, the processing is organized according the source language files: all the source sentences are processed one-by-one with respect to their individual candidate sets \u0398:" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-40", "text": "\u2022 Caching for target term \u03c4 (t, S): For each target word t that occurs in a candidate translation T , the Model-1 based probability p(t|S) can be cached: its value is independent of the other words in T ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-41", "text": "The same word t in different target sentences is processed with respect to the same source sentence S and p(t|S) has to be computed only once." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-42", "text": "\u2022 Array access for source terms \u03c3(s, T ): For a given source sentence S, we compute the scoring function \u033a(S, T ) over a set of target sentences T \u2208 \u0398. The computation of the source term \u03c3(s, T ) is based on translation probabilities p(s|t) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-43", "text": "For each source word s, we can retrieve all target words t for which p(s|t) > 0 just once." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-44", "text": "We store those words t along with their probabilities in an array the size of the target vocabulary." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-45", "text": "Words t that do not have an entry in the lexicon have a 0 entry in that array." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-46", "text": "We keep a separate array for each source position." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-47", "text": "This way, we reduce the probability access to a simple array look-up." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-48", "text": "Generating the full array presentation requires less than 50 milliseconds per source sentence on average." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-49", "text": "\u2022 Early-Stopping: Two loops compute the scoring function \u033a(S, T ) exhaustively for each sentence pair (S, T ): 1) a loop over all the target position terms \u03c4 (t i , S), and 2) a loop over all source position terms \u03c3(s j , T ) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-50", "text": "Once the current partial sum is lower than the best score \u033a(S, T best ) computed so far, the computation can be safely discarded as \u03c4 (t i , S), \u03c3(s j , T ) \u2264 \u2022 Frequency-Sorting: Here, we aim at making the early pruning step more efficient." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-51", "text": "Source and target words are sorted according to the source and target vocabulary frequency: less frequent words occur at the beginning of a sentence." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-52", "text": "These words are likely to contribute terms with high partial scores." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-53", "text": "As a result, the early-stopping step fires earlier and becomes more effective." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-54", "text": "\u2022 Sentence-level filter: The word-overlap filter in (Munteanu and Marcu, 2005) has been implemented: for a sentence pair (S, T ) to be considered parallel the ratio of the lengths of the two sentences has to be smaller than two." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-55", "text": "Additionally, at least half of the words in each sentence have to have a translation in the other sentence based on the word-based lexicon." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-56", "text": "Here, the implementation of the coverage restriction is tightly integrated into the above implementation: the decision whether a target word is covered can be cached." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-57", "text": "Likewise, source word coverage can be decided by a simple array look-up." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-58", "text": "----------------------------------" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-59", "text": "**EXPERIMENTS**" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-60", "text": "The parallel sentence extraction algorithm presented in this paper is tested in detail on the largescale Spanish-English Gigaword data (Graff, 2006; Graff, 2007) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-61", "text": "The Spanish data comes from 3 news feeds: Agence France-Presse (AFP), Associated Press Worldstream (APW), and Xinhua News sentences." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-62", "text": "Here, the size of the target candidate set \u0398 is 61 736 sentences." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-63", "text": "All the techniques presented result in some improvement." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-64", "text": "The baseline uses only the length-based filtering and the coverage filtering without caching the coverage decisions (Munteanu and Marcu, 2005) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-65", "text": "Caching the target word probabilities results in the biggest reduction." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-66", "text": "The results are representative: finding the highest scoring target sentence T for a given source sentence S takes about 1 second on average." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-67", "text": "Since 20 million source sentences are processed, and the workload is distributed over roughly 120 processors, overall processing time sums to less than 3 days." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-68", "text": "Here, the total number of translation pairs considered is close to 1 trillion." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-69", "text": "The effect of including additional sentence pairs along with selection statistics is presented in Table 3." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-70", "text": "Translation results are presented for a standard phrase-based SMT system." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-71", "text": "Here, both languages use a test set with a single reference." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-72", "text": "Including about 1.4 million sentence pairs extracted from the Gigaword data, we obtain a statistically significant improvement from 42.3 to 45.6 in BLEU (Papineni et al., 2002) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-73", "text": "The baseline system has been trained on about 1.8 million sentence pairs from Europarl and FBIS parallel data." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-74", "text": "We also present results for a Portuguese-English system: the baseline has been trained on Europarl and JRC data." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-75", "text": "Parallel sentence pairs are extracted from comparable Reuters news data published in 2006." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-76", "text": "The corpus statistics for the Portuguese-English data are given in Table 2 ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-77", "text": "The selection threshold \u03b8 is determined with the help of bilingual annotators (it typically takes a few hours)." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-78", "text": "Sentence pairs are selected with a conservative threshold \u03b8 \u2032 first." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-79", "text": "Then, all the sentence pairs are sorted by descending score." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-80", "text": "The annotator descends this list to determine a score threshold cut-off." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-81", "text": "Here, translation pairs are considered to be parallel if 75 % of source and target words have a corresponding translation in the other sentence." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-82", "text": "Using a threshold \u03b8 = \u22124.1 for the Spanish-English data, results in a selection precision of around 80 % (most of the misqualified pairs are partial translations with less than 75 % coverage or short sequences of high frequency words)." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-83", "text": "This simple selection criterion proved sufficient to obtain the results presented in this paper." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-84", "text": "As can be seen from Table 3 , the optimal threshold is language specific." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-85", "text": "----------------------------------" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-86", "text": "**FUTURE WORK AND DISCUSSION**" }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-87", "text": "In this paper, we have presented a novel sentencelevel pair extraction algorithm for comparable data." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-88", "text": "We use a simple symmetrized scoring function based on the Model-1 translation probability." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-89", "text": "With the help of an efficient implementation, it avoids any translation candidate selection at the document level (Resnik and Smith, 2003; Smith, 2002; Snover et al., 2008; Utiyama and Isahara, 2003; Munteanu and Marcu, 2005; Fung and Cheung, 2004) ." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-90", "text": "In particular, the extraction algorithm works when no document-level information is available." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-91", "text": "Its usefulness for extracting parallel sentences is demonstrated on news data for two language pairs." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-92", "text": "Currently, we are working on a feature-rich approach (Munteanu and Marcu, 2005) to improve the sentence-pair selection accuracy." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-93", "text": "Feature functions will be 'light-weight' such that they can be computed efficiently in an incremental way at the sentence-level." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-94", "text": "This way, we will be able to maintain our search-driven extraction approach." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-95", "text": "We are also re-implementing IR-based techniques to preselect translation pairs at the document-level, to gauge the effect of this additional filtering step." }, { "sent_id": "932a13e179da50c9189bd0c612cb9c-C001-96", "text": "We hope that a purely sentence-level processing might result in a more productive pair extraction in future." } ], "y": { "@DIF@": { "gold_contexts": [ [ "932a13e179da50c9189bd0c612cb9c-C001-9" ] ], "cite_sentences": [ "932a13e179da50c9189bd0c612cb9c-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "932a13e179da50c9189bd0c612cb9c-C001-54" ], [ "932a13e179da50c9189bd0c612cb9c-C001-64" ], [ "932a13e179da50c9189bd0c612cb9c-C001-89" ], [ "932a13e179da50c9189bd0c612cb9c-C001-92" ] ], "cite_sentences": [ "932a13e179da50c9189bd0c612cb9c-C001-54", "932a13e179da50c9189bd0c612cb9c-C001-64", "932a13e179da50c9189bd0c612cb9c-C001-89", "932a13e179da50c9189bd0c612cb9c-C001-92" ] } } }, "ABC_7c2586a172dc6f061817c3a8b3ebf0_39": { "x": [ { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-2", "text": "West African Pidgin English is a language that is significantly spoken in West Africa, consisting of at least 75 million speakers." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-3", "text": "Nevertheless, proper machine translation systems and relevant NLP datasets for pidgin English are virtually absent." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-4", "text": "In this work, we develop techniques targeted at bridging the gap between Pidgin English and English in the context of natural language generation." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-5", "text": "By building upon the previously released monolingual Pidgin English text and parallel English data-to-text corpus, we hope to build a system that can automatically generate Pidgin English descriptions from structured data." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-6", "text": "We first train a data-to-English text generation system, before employing techniques in unsupervised neural machine translation and self-training to establish the Pidgin-to-English cross-lingual alignment." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-7", "text": "The human evaluation performed on the generated Pidgin texts shows that, though still far from being practically usable, the pivoting + self-training technique improves both Pidgin text fluency and relevance." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-8", "text": "----------------------------------" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-10", "text": "Pidgin English is one of the the most widely spoken languages in West Africa with roughly 75 million speakers estimated in Nigeria; and over 5 million speakers estimated in Ghana (Ogueji & Ahia, 2019) ." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-11", "text": "1 While there have been recent efforts in popularizing the monolingual Pidgin English as seen in the BBC Pidgin 2 , it remains under-resourced in terms of the available parallel corpus for machine translation." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-12", "text": "Similarly, this low-resource scenario extends to other domains in natural language generation (NLG) such as summarization, data-to-text and so on (Lebret et al., 2016; Su et al., 2018; Shen et al., 2019a; b; Zhao et al., 2019; Hong et al., 2019; de Souza et al., 2018) \u2212 where Pidgin English generation is largely under-explored." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-13", "text": "The scarcity is further aggravated when the pipeline language generation system includes other sub-modules that computes semantic textual similarity (Shen et al., 2017; Zhuang & Chang, 2017) , which exists solely in English." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-14", "text": "Previous works on unsupervised neural machine translation for Pidgin English constructed a monolingual corpus (Ogueji & Ahia, 2019) , and achieved a BLEU score of 5.18 from English to Pidgin." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-15", "text": "However, there is an issue of domain mismatch between down-stream NLG tasks and the trained machine translation system." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-16", "text": "This creates a caveat where the resulting English-to-Pidgin MT systems (trained on the domain of news and the Bible) cannot be directly used to translate out-domain English texts to Pidgin." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-17", "text": "An example of the English/pidgin text in the restaurant domain (Novikova et al., 2017) is displayed in Table 1 ." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-18", "text": "Nevertheless, we argue that this domain-mismatch problem can be alleviated by using English text in the target-domain as a pivot language (Guo et al., 2019) ." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-19", "text": "To this end, we explore this idea on the task of neural data-to-text generation which has been the subject of much recent research." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-20", "text": "Neural data-to-Pidgin generation is essential in the African continent especially given the fact that many English There is a pub Blue Spice located in the centre of the city that provides Chinese food." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-21", "text": "Pidgin" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-22", "text": "Wan pub blue spice dey for centre of city wey dey give Chinese food." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-23", "text": "existing data-to-text systems are English-based e.g. Weather reporting systems (Sripada et al., 2002; Belz, 2008) ." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-24", "text": "This work aims at bridging the gap between many of these English-based systems and Pidgin by training an in-domain English-to-pidgin MT system in an unsupervised way." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-25", "text": "By this means, English-based NLG systems can be locally adapted by translating the output English text into pidgin English." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-26", "text": "We employ the publicly available parallel data-to-text corpus E2E (Novikova et al., 2017) consisting of tabulated data and English descriptions in the restaurant domain." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-27", "text": "The training of the in-domain MT system is done with a two-step process: (1) We use the target-side English texts as the pivot, and train an unsupervised NMT (model unsup ) directly between in-domain English text and the available monolingual Pidgin corpus." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-28", "text": "(2) Next, we employ self-training (He et al., 2019) to create augmented parallel pairs to continue updating the system (model self )." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-29", "text": "----------------------------------" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-30", "text": "**APPROACH**" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-31", "text": "First phase of the approach requires training of an unsupervised NMT system similar to Ogueji & Ahia (2019) (PidginUNMT)." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-32", "text": "Similar to Ogueji & Ahia (2019) , we train the cross-lingual model using FastText Bojanowski et al. (2017) on the combined Pidgin-English corpus." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-33", "text": "Next, we train an unsupervised NMT similar to Lample et al. (2017) ; Artetxe et al. (2017) ; Ogueji & Ahia (2019) between them to obtain model unsup ." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-34", "text": "Then we further utilize model unsup to construct pseudo parallel corpus by predicting target Pidgin text given the English input." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-35", "text": "We augment this dataset to the existing monolingual corpus." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-36", "text": "The self-training step involves further updating model unsup on the pseudo parallel corpus and non-parallel monolingual corpus to yield model self ." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-37", "text": "----------------------------------" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-38", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-39", "text": "We conduct experiments on the E2E corpus (Novikova et al., 2017) which amounts to roughly 42k samples in the training set." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-40", "text": "The monolingual Pidgin corpus contains 56,695 sentences and 32,925 unique words." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-41", "text": "The human evaluation was performed on the test set (630 data instances for E2E) by averaging over scores by 2 native Pidgin speakers on both Relevance (0 or 1 to indicate relevant or not) and Fluency (0, 1, or 2 to indicate readability)." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-42", "text": "Table 2 shows that model self outperforms direct translation (PidginUNMT) and unsupervisedly-trained model model unsup on relevance and performing on par with PidginUNMT on fluency." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-43", "text": "We also display relevant sample outputs in Table 3 at all levels of fluency." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-44", "text": "Pidgin text Fluency Every money of money on food and at least 1 of 1 points." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-45", "text": "0 and na one na di best food for di world." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-46", "text": "1 People dey feel the good food but all of us no dey available." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-47", "text": "2" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-48", "text": "----------------------------------" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-49", "text": "**CONCLUSION**" }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-50", "text": "In this paper, we have shown that it is possible to improve upon low-resource Pidgin text generation in a demonstrated low-resource scenario." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-51", "text": "By using non-parallel in-domain English and out-domain Pidgin text along with self-training algorithm, we show that both fluency and relevance can be further improved." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-52", "text": "This work serves as the starting point for future works on Pidgin NLG in the absence of annotated data." }, { "sent_id": "7c2586a172dc6f061817c3a8b3ebf0-C001-53", "text": "For future works, we will also further utilize phrase-based statistical machine translation to further improve upon current work." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7c2586a172dc6f061817c3a8b3ebf0-C001-10" ], [ "7c2586a172dc6f061817c3a8b3ebf0-C001-14" ] ], "cite_sentences": [ "7c2586a172dc6f061817c3a8b3ebf0-C001-10", "7c2586a172dc6f061817c3a8b3ebf0-C001-14" ] }, "@MOT@": { "gold_contexts": [ [ "7c2586a172dc6f061817c3a8b3ebf0-C001-14", "7c2586a172dc6f061817c3a8b3ebf0-C001-15" ] ], "cite_sentences": [ "7c2586a172dc6f061817c3a8b3ebf0-C001-14" ] }, "@SIM@": { "gold_contexts": [ [ "7c2586a172dc6f061817c3a8b3ebf0-C001-31" ], [ "7c2586a172dc6f061817c3a8b3ebf0-C001-32" ], [ "7c2586a172dc6f061817c3a8b3ebf0-C001-33" ] ], "cite_sentences": [ "7c2586a172dc6f061817c3a8b3ebf0-C001-31", "7c2586a172dc6f061817c3a8b3ebf0-C001-32", "7c2586a172dc6f061817c3a8b3ebf0-C001-33" ] } } }, "ABC_a9897f66e05a0354c36daba0db9afe_40": { "x": [ { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-2", "text": "We introduce a set of 1,000 gold standard parse trees for the British National Corpus (BNC) and perform a series of self-training experiments with Charniak and Johnson's reranking parser and BNC sentences." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-3", "text": "We show that retraining this parser with a combination of one million BNC parse trees (produced by the same parser) and the original WSJ training data yields improvements of 0.4% on WSJ Section 23 and 1.7% on the new BNC gold standard set." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-4", "text": "----------------------------------" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-6", "text": "Given the success of statistical parsing models on the Wall Street Journal (WSJ) section of the Penn Treebank (PTB) (Charniak, 2000; Collins, 2003, for example) , there has been a change in focus in recent years towards the problem of replicating this success on genres other than American financial news stories." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-7", "text": "The main challenge in solving the parser adaptation problem are the resources required to construct reliable annotated training examples." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-8", "text": "A breakthrough has come in the form of research by McClosky et al. (2006a; 2006b ) who show that self-training can be used to improve parser performance when combined with a two-stage reranking parser model (Charniak and Johnson, 2005) ." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-9", "text": "Selftraining is the process of training a parser on its own output, and earlier self-training experiments using generative statistical parsers did not yield encouraging results (Steedman et al., 2003) ." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-10", "text": "McClosky et al. (2006a; 2006b ) proceed as follows: sentences * Now affiliated to Lalic, Universit\u00e9 Paris 4 La Sorbonne." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-11", "text": "from the LA Times newspaper are parsed by a firststage generative statistical parser trained on some seed training data (WSJ Sections 2-21) and the nbest parse trees produced by this parser are reranked by a discriminative reranker." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-12", "text": "The highest ranked parse trees are added to the training set of the parser and the parser is retrained." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-13", "text": "This self-training method gives improved performance, not only on Section 23 of the WSJ (an absolute f-score improvement of 0.8%), but also on test sentences from the Brown corpus (Francis and Ku\u010dera, 1979 ) (an absolute fscore improvement of 2.6%)." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-14", "text": "In the experiments of McClosky et al. (2006a; 2006b) , the parse trees used for self-training come from the same domain (American newspaper text) as the parser's original seed training material." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-15", "text": "Bacchiani et al. (2006) find that self-training is effective when the parse trees used for self-training (WSJ parse trees) come from a different domain to the seed training data and from the same domain as the test data (WSJ sentences)." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-16", "text": "They report a performance boost of 4.2% on WSJ Section 23 for a generative statistical parser trained on Brown seed data when it is self-trained using 200,000 WSJ parse trees." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-17", "text": "However, McCloskey et al. (2006b) report a drop in performance for their reranking parser when the experiment is repeated in the opposite direction, i.e. with Brown data for self-training and testing, and WSJ data for seed training." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-18", "text": "In contrast, we report successful in-domain 1 self-training experiments with the BNC data as self-training and test material, and with the WSJ-trained reranking parser used by McCloskey et al. (2006a; 2006b) ." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-19", "text": "We parse the BNC (Burnard, 2000) in its entirety using the reranking parser of Charniak and Johnson (2005) ." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-20", "text": "1,000 BNC sentences are manually annotated for constituent structure, resulting in the first gold standard set for this corpus." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-21", "text": "The gold standard set is split into a development set of 500 parse trees and a test set of 500 parse trees and used in a series of self-training experiments: Charniak and Johnson's parser is retrained on combinations of WSJ treebank data and its own parses of BNC sentences." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-22", "text": "These combinations are tested on the BNC development set and Section 00 of the WSJ." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-23", "text": "An optimal combination is chosen which achieves a Parseval labelled bracketing f-score of 91.7% on Section 23 and 85.6% on the BNC gold standard test set." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-24", "text": "For Section 23 this is an absolute improvement of 0.4% on the baseline results of this parser, and for the BNC data this is a statistically significant improvement of 1.7%." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-25", "text": "----------------------------------" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-26", "text": "**THE BNC DATA**" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-27", "text": "The BNC is a 100-million-word balanced part-ofspeech-tagged corpus of written and transcribed spoken English." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-28", "text": "Written text comprises 90% of the BNC: 75% non-fictional and 25% fictional." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-29", "text": "To facilitate parsing with a WSJ-trained parser, some reversible transformations were applied to the BNC data, e.g. British English spellings were converted to American English and neutral quotes disambiguated." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-30", "text": "The reranking parser of Charniak and Johnson (2005) was used to parse the BNC." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-31", "text": "99.8% of the 6 million BNC sentences obtained a parse, with an average parsing speed of 1.4s per sentence." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-32", "text": "A gold standard set of 1,000 BNC sentences was constructed by one annotator by correcting the output of the first stage of Charniak and Johnson's reranking parser." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-33", "text": "The sentences included in the gold standard were chosen at random from the BNC, subject to the condition that they contain a verb which does not occur in the training sections of the WSJ section of the PTB (Marcus et al., 1993) ." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-34", "text": "A decision was made to select sentences for the gold standard set which differ from the sentences in the WSJ training sections, and one way of finding different sentences is to focus on verbs which are not attested in the WSJ Sections 2-21." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-35", "text": "It is expected that these gold standard parse trees can be used as training data although they are used only as test and development data in this work." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-36", "text": "Because they contain verbs which do not occur in the parser's training set, they are likely to represent a hard test for WSJ-trained parsers." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-37", "text": "The PTB bracketing guidelines (Bies et al., 1995) and the PTB itself were used as references by the BNC annotator." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-38", "text": "Functional tags and traces were not annotated." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-39", "text": "The annotator noticed that the PTB parse trees sometimes violate the PTB bracketing guidelines, and in these cases, the annotator chose the analysis set out in the guidelines." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-40", "text": "It took approximately 60 hours to build the gold standard set." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-41", "text": "----------------------------------" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-42", "text": "**SELF-TRAINING EXPERIMENTS**" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-43", "text": "Charniak and Johnson's reranking parser (June 2006 version) is evaluated against the BNC gold standard development set." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-44", "text": "Labelled precision (LP), recall (LR) and f-score measures 2 for this parser are shown in the first row of Table 1 ." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-45", "text": "The f-score of 83.7% is lower than the f-score of 85.2% reported by McClosky et al. (2006b) for the same parser on Brown corpus data." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-46", "text": "This difference is reasonable since there is greater domain variation between the WSJ and the BNC than between the WSJ and the Brown corpus, and all BNC gold standard sentences contain verbs not attested in WSJ Sections 2-21." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-47", "text": "We retrain the first-stage generative statistical parser of Charniak and Johnson using combinations of BNC trees (parsed using the reranking parser) and WSJ treebank trees." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-48", "text": "We test the combinations on the BNC gold standard development set and on WSJ Section 00." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-49", "text": "Table 1 shows that parser accuracy increases with the size of the in-domain selftraining material." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-50", "text": "3 The figures confirm the claim of McClosky et al. (2006a) that self-training with a reranking parsing model is effective for improving parser accuracy in general, and the claim of Gildea (2001) that training on in-domain data is effective for parser adaption." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-51", "text": "They confirm that self-training on in-domain data is effective for parser adaptation." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-52", "text": "The WSJ Section 00 results suggest that, in order to maintain performance on the seed training domain, it is necessary to combine BNC parse trees Of the self-training combinations with abovebaseline improvements for both development sets, the combination of 1,000K BNC parse trees and Section 2-21 of the WSJ (multiplied by ten) yields the highest improvement for the BNC data, and we present final results with this combination for the BNC gold standard test set and WSJ Section 23." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-53", "text": "There is an absolute improvement on the original reranking parser of 1.7% on the BNC gold standard test set and 0.4% on WSJ Section 23." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-54", "text": "The improvement on BNC data is statistically significant for both precision and recall (p < 0.0002, p < 0.0002)." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-55", "text": "The improvement on WSJ Section 23 is statistically significant for precision only (p < 0.003)." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-56", "text": "----------------------------------" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-57", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-58", "text": "We have introduced a set of 1,000 gold standard parse trees for the BNC." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-59", "text": "We have performed selftraining experiments with Charniak and Johnson's reranking parser and sentences from the BNC." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-60", "text": "We have shown that retraining this parser with a combination of one million BNC parse trees (produced by the same parser) and the original WSJ training data yields improvements of 0.4% on WSJ Section 23 and 1.7% on the BNC gold standard sentences." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-61", "text": "These results indicate that self-training on in-domain data can be used for parser adaptation." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-62", "text": "Our BNC gold standard set consists of sentences containing verbs which are not in the WSJ training sections." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-63", "text": "We suspect that this makes the gold standard set a hard test for WSJ-trained parsers, and our results are likely to represent a lower bound for WSJ-trained parsers on BNC data." }, { "sent_id": "a9897f66e05a0354c36daba0db9afe-C001-64", "text": "When used as training data, we predict that the novel verbs in the BNC gold standard set add to the variety of training material, and will further help parser adaptation from the WSJ domain -a matter for further research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a9897f66e05a0354c36daba0db9afe-C001-8" ], [ "a9897f66e05a0354c36daba0db9afe-C001-14" ] ], "cite_sentences": [ "a9897f66e05a0354c36daba0db9afe-C001-8", "a9897f66e05a0354c36daba0db9afe-C001-14" ] }, "@UNSURE@": { "gold_contexts": [ [ "a9897f66e05a0354c36daba0db9afe-C001-50" ] ], "cite_sentences": [ "a9897f66e05a0354c36daba0db9afe-C001-50" ] } } }, "ABC_5f25b6a3bcaca2e4beb59ce0f3eb5f_40": { "x": [ { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-2", "text": "SUDOKU's submissions to SemEval Task 13 treats Word Sense Disambiguation and Entity Linking as a deterministic problem that exploits two key attributes of open-class words as constraints -their degree of polysemy and their part of speech." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-3", "text": "This is an extension and further validation of the results achieved by Manion and Sainudiin (2014)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-4", "text": "SUDOKU's three submissions are incremental in the use of the two aforementioned constraints." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-5", "text": "Run1 has no constraints and disambiguates all lemmas in one pass." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-6", "text": "Run2 disambiguates lemmas at increasing degrees of polysemy, leaving the most polysemous until last." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-61", "text": "This underscores the importance of named entities being included in disambiguation tasks." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-7", "text": "Run3 is identical to Run2, with the additional constraint of disambiguating all named entities and nouns first before other types of open-class words (verbs, adjectives, and adverbs)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-8", "text": "Over all-domains, for English Run2 and Run3 were placed second and third." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-9", "text": "For Spanish Run2, Run3, and Run1 were placed first, second, and third respectively." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-10", "text": "For Italian Run1 was placed first with Run2 and Run3 placed second equal." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-11", "text": "----------------------------------" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-12", "text": "**INTRODUCTION & RELATED WORK**" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-13", "text": "Almost a decade ago, Agirre and Edmonds (2007) suggested the promising potential for WSD that could exploit the interdependencies between senses in an interactive manner." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-14", "text": "In other words, this would be a WSD system which allows the disambiguation of word a to directly influence the consecutive disambiguation of word b. This is analogous to treating WSD as a deterministic problem, much like the Sudoku puzzle in which the final solution is reached by adhering to a set of pre-determined constraints." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-15", "text": "Conventional approaches to WSD often overlook the potential to exploit sense interdependencies, and simply disambiguate all senses in one pass based on a context window (e.g. a sentence or document)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-16", "text": "For this task the author proposes an iterative approach which makes several passes based on a set of constraints." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-17", "text": "For a more formal distinction between the conventional and iterative approach to WSD, please refer to this paper (Manion and Sainudiin, 2014 Table 1 : Parts of Speech disambiguated (as percentages) for each SemEval Task (denoted by its year)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-18", "text": "In-Degree Centrality as implemented in (Manion and Sainudiin, 2014) observes F-Score improvement (F + \u2206F) by applying the iterative approach." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-19", "text": "The author found in the investigations of his thesis (Manion, 2014) that the iterative approach performed best on the SemEval 2013 Multilingual WSD Task (Navigli et al., 2013) , as opposed to earlier tasks such as SensEval 2004 English All Words WSD Task (Snyder and Palmer, 2004) and the SemEval 2010 All Words WSD task on a Specific Domain (Agirre et al., 2010) ." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-20", "text": "While these earlier tasks also experienced improvement, F-Scores remained lower overall." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-21", "text": "Table 1 Depicted above are distributions for each domain and language, detailing the probability (y-axis) of specific parts of speech at increasing degrees of polysemy (x-axis)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-22", "text": "These distributions were produced from the gold keys (or synsets) of the test documents by querying BabelNet for the polysemy of each word." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-23", "text": "Each distribution was normalised with one sense per discourse assumed, therefore duplicate synsets were ignored." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-24", "text": "Lastly the difference in F-Score between the conventional Run1 and the iterative Run2 and Run3 is listed beside each distribution." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-25", "text": "Firstly WSD tasks before 2013 generally relied on only a lexicon, such as WordNet (Fellbaum, 1998) or an alternative equivalent, whereas SemEval 2013 Task 12 WSD and this task (Moro and Navigli, 2015) included Entity Linking (EL) using the encyclopaedia Wikipedia via BabelNet (Navigli and Ponzetto, 2012) ." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-26", "text": "Secondly, as shown by Manion and Sainudiin (2014) with a simple linear regression, the iterative approach increases WSD performance for documents that have a higher degree of document monosemy -the percentage of unique monosemous lemmas in a document." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-27", "text": "As seen in Figures 1(a) to (i) on the previous page, named entities (or unique rather than common nouns) are more monosemous compared to other parts of speech, especially for more technical domains." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-28", "text": "Lastly, the SemEval 2013 WSD task differs in that only nouns and named entities required disambiguation." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-29", "text": "This simplifies the WSD task, as shown in the experiments on local context by Yarowsky (1993) , nouns are best disambiguated by directly adjacent nouns (or modifying adjectives)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-30", "text": "Based on these observations, the author hypothesized the following implementations of the iterative approach should perform well." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-31", "text": "----------------------------------" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-32", "text": "**SYSTEM DESCRIPTION & IMPLEMENTATION**" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-33", "text": "Run1 (SUDOKU-1) is the conventional approachno constraints are applied." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-34", "text": "Formalised in (Manion and Sainudiin, 2014) , this run can act as a baseline to gauge any improvement for Run2 and Run3 that apply the iterative approach." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-35", "text": "Run2 (SUDOKU-2) has the constraint of words being disambiguated in order of increasing polysemy, leaving the most polysemous to last." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-36", "text": "Run3 (SUDOKU-3) is an untested and unpublished version of the iterative approach." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-37", "text": "It includes Run2's constraint plus a second constraint -that all nouns and named entities must be disambiguated before other parts of speech." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-38", "text": "For each run, a semantic subgraph is constructed from BabelNet (version 2.5.1)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-39", "text": "Then for disambiguation the graph centrality measure PageRank (Brin and Page, 1998 ) is used in conjunction with a surfing vector that biases probability mass to certain sense nodes in the semantic subgraph." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-40", "text": "This idea is taken from Personalised PageRank (PPR) (Agirre and Soroa, 2009) , which applies the method put forward by Haveliwala (2003) to the field of WSD." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-41", "text": "In the previous SemEval WSD task (Navigli et al., 2013) team UMCC DLSI (Gutierrez et al., 2013) implemented this method and achieved the best performance by biasing probability mass based on SemCor (Miller et al., 1993 ) sense frequencies." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-42", "text": "As the winning method for this task, PPR was selected to test the iterative approach on." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-43", "text": "For SUDOKU's implementation to be unsupervised, all runs biased probability mass towards senses from monosemous lemmas." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-44", "text": "Additionally for Run2 and Run3, once a lemma is disambiguated it is considered to be monosemous." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-45", "text": "Therefore with each iteration of Run2 and Run3, probability mass is redistributed across the surfing vector to acknowledge these newly appointed monosemous lemmas." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-46", "text": "All system runs are applied at the document level, across all languages and domains, for all named entities, nouns, verbs, adverbs, and adjectives." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-47", "text": "Semantic subgraphs are constructed from BabelNet via a Depth First Search (DFS) up to 2 hops in path length." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-48", "text": "PageRank's damping factor is set to 0.85, with a maximum of 30 iterations 1 ." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-49", "text": "In order to avoid masking the effect of using the iterative approach, a back-off strategy (see (McCarthy et al., 2004) ) was not used." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-50", "text": "Multiword units were found by finding lemma sequences that contained at least one noun and at the same time could return a result from BabelNet." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-51", "text": "Lemma sequences beginning with definite/indefinite articles (e.g. the, a, il, la, and el) were removed as they induced too much noise, given they almost always returned a result from BabelNet (such as a book or movie title)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-52", "text": "----------------------------------" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-53", "text": "**RESULTS, DISCUSSIONS, & CONCLUSIONS**" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-54", "text": "As seen in Figures 1(a) to (i) on the previous page, the Biomedical and Math & Computers domains include a substantial degree of monosemy, no doubt increased by the monosemous technical terms and named entities present." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-55", "text": "Given the importance of document monosemy for the iterative approach, it is of no surprise that Run2 and Run3 in most cases performed much better than Run1 for these technical domains." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-56", "text": "Equally so, Run2 and Run3 were outperformed by Run1 for the less technical Social Issues All Domains domain in which many of the named entities are polysemous rather than monosemous." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-57", "text": "While the iterative approach achieved reasonably competitive results in English, this success did not translate as well to Spanish and Italian." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-58", "text": "The Italian Biomedical domain had the highest document monosemy, observable in Figure 1 (g ), yet this did not help the iterative Run2 and Run3." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-59", "text": "Yet it is worth noting the results of the task paper (Moro and Navigli, 2015) report that SUDOKU Run2 and Run3 achieved very low F-Scores for named entity disambiguation (<28.6) in Spanish and Italian." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-60", "text": "Given that more than half of the named entities were monosemous in Figure 1 (d) and (g), the WSD system either did not capture them in text or filtered them out during subgraph construction (see BabelNet API)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-62", "text": "To further support this evidence, while the iterative approach is suited to domain based WSD, recall that the 2010 domain based WSD task in Table 1 also had no tagged named entities (and thus scores were lower than for successive named entity inclusive WSD tasks)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-63", "text": "As seen in Table 2 , the iterative approach has a varied effect on different parts of speech." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-64", "text": "Always improved is the disambiguation of named entities and adverbs." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-65", "text": "This is also the case for nouns in technical domains (e.g. Biomedical as opposed to Social Issues)." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-66", "text": "On the other hand the disambiguation of verbs and adjectives suffers under the iterative approach." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-67", "text": "In hindsight, the iterative approach could be restricted to the parts of speech it is known to improve, while remaining with the conventional approach on others." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-68", "text": "To the right in Table 3 the author's SUDOKU runs are compared against the team with the most competitive results -LIMSI." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-69", "text": "The author could not improve on their superior results achieved in English, however for Spanish and Italian the BabelNet First Sense (BFS) baseline was much lower since it often resorted to lexicographic sorting in the absence of WordNet synsets -see (Navigli et al., 2013) ." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-70", "text": "The author's baseline-independent submissions were unaffected by this, which on reviewing results in (Moro and Navigli, 2015) appears to have helped SUDOKU do best for these languages." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-71", "text": "Table 3 : F1 scores for each domain/language for SUDOKU and LIMSI." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-72", "text": "In summary, the inclusion of named entities in disambiguation tasks certainly improves results, as well as the effectiveness of the iterative approach." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-73", "text": "Furthermore in Table 3 above, the iterative Run3 for the English Biomedical domain is 0.1 short of achieving the best result of 71.3." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-74", "text": "Investigating exactly which factors contributed to the success of this unsupervised result is a top priority for future work." }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-75", "text": "----------------------------------" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-76", "text": "**RESOURCES**" }, { "sent_id": "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-77", "text": "Codebase and resources are at the author's homepage: http://www.stevemanion.com." } ], "y": { "@EXT@": { "gold_contexts": [ [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-3" ] ], "cite_sentences": [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-3" ] }, "@UNSURE@": { "gold_contexts": [ [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-17" ] ], "cite_sentences": [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-17" ] }, "@BACK@": { "gold_contexts": [ [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-18" ], [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-26" ], [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-34" ] ], "cite_sentences": [ "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-18", "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-26", "5f25b6a3bcaca2e4beb59ce0f3eb5f-C001-34" ] } } }, "ABC_0e4ca87c0e2b899bfd1f36dc5974b9_40": { "x": [ { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-2", "text": "We present PhoBERT with two versions of \"base\" and \"large\"-the first public large-scale monolingual language models pre-trained for Vietnamese." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-3", "text": "We show that PhoBERT improves the state-ofthe-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Named-entity recognition and Natural language inference." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-4", "text": "We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-5", "text": "Our PhoBERT is released at: https://github. com/VinAIResearch/PhoBERT." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-6", "text": "----------------------------------" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-8", "text": "Pre-trained language models, especially BERT-the Bidirectional Encoder Representations from Transformers [Devlin et al., 2019] , have recently become extremely popular and helped to produce significant improvement gains for various NLP tasks." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-9", "text": "The success of pre-trained BERT and its variants has largely been limited to the English language." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-10", "text": "For other languages, one could retrain a language-specific model using the BERT architecture Martin et al., 2019; de Vries et al., 2019] or employ existing pre-trained multilingual BERT-based models [Devlin et al., 2019; Conneau et al., 2019; Conneau and Lample, 2019] ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-11", "text": "In terms of Vietnamese language modeling, to the best of our knowledge, there are two main concerns: (i) The Vietnamese Wikipedia corpus is the only data used to train all monolingual language models , and it also is the only Vietnamese dataset included in the pre-training data used by all multilingual language models except XLM-R [Conneau et al., 2019] ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-12", "text": "It is worth noting that Wikipedia data is not representative of a general language use, and the Vietnamese Wikipedia data is relatively small (1GB in size uncompressed), while pre-trained language models can be significantly improved by using more data [Liu et al., 2019] ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-13", "text": "(ii) All monolingual and multilingual models, except ETNLP , are not aware of the difference between Vietnamese syllables and word tokens (this ambiguity comes from the fact that the white space is also used to separate syllables that constitute words when written in Vietnamese)." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-14", "text": "Without doing a pre-process step of Vietnamese word segmentation, those models directly apply Bype-Pair encoding (BPE) methods [Sennrich et al., 2016] to the syllable-level pre-training Vietnamese data." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-15", "text": "Also, although performing word segmentation before applying BPE on the Vietnamese Wikipedia corpus, ETNLP in fact does not publicly release any pre-trained BERT-based model." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-16", "text": "1 As a result, we find difficulties in applying existing pre-trained language models for word-level Vietnamese NLP tasks." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-17", "text": "To handle the two concerns above, we train the first largescale monolingual BERT-based \"base\" and \"large\" models using a 20GB word-level Vietnamese corpus." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-18", "text": "We evaluate our models on three downstream Vietnamese NLP tasks: the two most common ones of Part-of-speech (POS) tagging and Named-entity recognition (NER), and a language understanding task of Natural language inference (NLI)." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-19", "text": "Experimental results show that our models obtain state-of-the-art (SOTA) performances for all three tasks." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-20", "text": "We release our models under the name PhoBERT in popular open-source libraries, hoping that PhoBERT can serve as a strong baseline for future Vietnamese NLP research and applications." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-21", "text": "----------------------------------" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-22", "text": "**PHOBERT**" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-23", "text": "This section outlines the architecture and describes the pretraining data and optimization setup we use for PhoBERT." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-24", "text": "Architecture: PhoBERT has two versions PhoBERT base and PhoBERT large , using the same configuration as BERT base and BERT large , respectively." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-25", "text": "PhoBERT pre-training approach is based on RoBERTa [Liu et al., 2019] which optimizes the BERT pre-training method for more robust performance." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-26", "text": "Data: We use a pre-training dataset of 20GB of uncompressed texts after cleaning." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-27", "text": "This dataset is a combination of two corpora: (i) the first one is the Vietnamese Wikipedia corpus (\u223c1GB), and (ii) the second corpus (\u223c19GB) is a subset of a 40GB Vietnamese news corpus after filtering out similar news and duplications." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-28", "text": "2 We employ RDRSegmenter from VnCoreNLP to perform word and sentence segmentation on the pre-training dataset, resulting in \u223c145M word-segmented sentences (\u223c3B word tokens)." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-29", "text": "Different from RoBERTa, we then apply fastBPE [Sennrich et al., 2016] to segment these sentences with subword units, using a vocabulary size of 64K subword types." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-30", "text": "Optimization: We employ the RoBERTa implementation in fairseq ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-31", "text": "Each sentence contains at most 256 subword tokens (here, 5K/145M sentences with more , 2014] ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-32", "text": "We use a batch size of 1024 and a peak learning rate of 0.0004 for PhoBERT base , and a batch size of 512 and a peak learning rate of 0.0002 for PhoBERT large ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-33", "text": "We run for 40 epochs (here, the learning rate is warmed up for 2 epochs)." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-34", "text": "We use 4 Nvidia V100 GPUs (16GB each), resulting in about 540K training steps for PhoBERT base and 1.08M steps for PhoBERT large ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-35", "text": "We pretrain PhoBERT base during 3 weeks, and then PhoBERT large during 5 weeks." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-36", "text": "----------------------------------" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-37", "text": "**EXPERIMENTS**" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-38", "text": "We evaluate the performance of PhoBERT on three downstream Vietnamese NLP tasks: POS tagging, NER and NLI." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-39", "text": "Experimental setup: For the two most common Vietnamese POS tagging and NER tasks, we follow the VnCoreNLP setup , using standard benchmarks of the VLSP 2013 POS tagging dataset and the VLSP 2016 NER dataset [Nguyen et al., 2019a] ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-40", "text": "For NLI, we use the Vietnamese validation and test sets from the XNLI corpus v1.0 [Conneau et al., 2018] where the Vietnamese training data is machinetranslated from English." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-41", "text": "Unlike the 2013 POS tagging and 2016 NER datasets which provide the gold word segmentation, for NLI, we use RDRSegmenter to segment the text into words before applying fastBPE to produce subwords from word tokens." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-42", "text": "Following Devlin et al. [2019] , for POS tagging and NER, we append a linear prediction layer on top of the PhoBERT architecture w.r.t." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-43", "text": "the first subword token of each word." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-44", "text": "We fine-tune PhoBERT for each task and each dataset independently, employing the Hugging Face transformers for POS tagging and NER and the RoBERTa implementation in fairseq for NLI." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-45", "text": "We use AdamW [Loshchilov and Hutter, 2019] with a fixed learning rate of 1.e-5 and a batch size of 32." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-46", "text": "We fine-tune in 30 training epochs, evaluate the task performance after each epoch on the validation set (here, early stopping is applied when there is no improvement after 5 continuous epochs), and then select the best model to report the final result on the test set." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-47", "text": "Main results: Table 1 compares our PhoBERT scores with the previous highest reported results, using the same experimental setup." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-48", "text": "PhoBERT helps produce new SOTA results for all the three tasks, where unsurprisingly PhoBERT large obtains higher performances than PhoBERT base ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-49", "text": "For POS tagging, PhoBERT obtains about 0.8% absolute higher accuracy than the feature-and neural network-based models VnCoreNLP-POS (i.e. VnMarMoT) and join-tWPD." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-50", "text": "For NER, PhoBERT large is 1.1 points higher F 1 than PhoBERT base which is 2+ points higher than the featureand neural network-based models VnCoreNLP-NER and BiLSTM-CNN-CRF trained with the BERT-based ETNLP word embeddings ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-51", "text": "For NLI, PhoBERT outperforms the multilingual BERT and the BERT-based crosslingual model with a new translation language modeling objective XLM MLM+TLM by large margins." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-52", "text": "PhoBERT also performs slightly better than the cross-lingual model XLM-R, but using far fewer parameters than XLM-R (base: 135M vs. 250M; large: 370M vs. 560M)." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-53", "text": "Discussion: Using more pre-training data can help significantly improve the quality of the pre-trained language models [Liu et al., 2019] ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-54", "text": "Thus it is not surprising that PhoBERT helps produce better performance than ETNLP on NER, and the multilingual BERT and XLM MLM+TLM on NLI (here, PhoBERT employs 20GB of Vietnamese texts while those models employ the 1GB Vietnamese Wikipedia data)." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-55", "text": "Our PhoBERT also does better than XLM-R which uses a 2.5TB pre-training corpus containing 137GB of Vietnamese texts (i.e. about 137/20 \u2248 7 times bigger than our pretraining corpus)." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-56", "text": "Recall that PhoBERT performs segmentation into subword units after performing a Vietnamese word segmentation, while XLM-R directly applies a BPE method to the syllable-level pre-training Vietnamese data." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-57", "text": "Clearly, word-level information plays a crucial role for the Vietnamese language understanding task of NLI, i.e. word segmentation is necessary to improve the NLI performance." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-58", "text": "This reconfirms that dedicated language-specific models still outperform multilingual ones [Martin et al., 2019] ." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-59", "text": "Experiments also show that using a straightforward finetuning manner as we do can lead to SOTA results." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-60", "text": "Note that we might boost our downstream task performances even further by doing a more careful hyper-parameter fine-tuning." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-61", "text": "----------------------------------" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-62", "text": "**CONCLUSION**" }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-63", "text": "In this paper, we have presented the first public large-scale PhoBERT language models for Vietnamese." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-64", "text": "We demonstrate the usefulness of PhoBERT by producing new state-of-theart performances for three Vietnamese NLP tasks of POS tagging, NER and NLI." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-65", "text": "By publicly releasing PhoBERT, we hope that it can foster future research and applications in Vietnamse NLP." }, { "sent_id": "0e4ca87c0e2b899bfd1f36dc5974b9-C001-66", "text": "Our PhoBERT and its usage are available at: https://github.com/VinAIResearch/PhoBERT." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0e4ca87c0e2b899bfd1f36dc5974b9-C001-10" ], [ "0e4ca87c0e2b899bfd1f36dc5974b9-C001-11" ] ], "cite_sentences": [ "0e4ca87c0e2b899bfd1f36dc5974b9-C001-10", "0e4ca87c0e2b899bfd1f36dc5974b9-C001-11" ] }, "@MOT@": { "gold_contexts": [ [ "0e4ca87c0e2b899bfd1f36dc5974b9-C001-11" ] ], "cite_sentences": [ "0e4ca87c0e2b899bfd1f36dc5974b9-C001-11" ] }, "@USE@": { "gold_contexts": [ [ "0e4ca87c0e2b899bfd1f36dc5974b9-C001-40" ] ], "cite_sentences": [ "0e4ca87c0e2b899bfd1f36dc5974b9-C001-40" ] } } }, "ABC_f60796ff05156e81c4b183cdcb05ae_40": { "x": [ { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-71", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-72", "text": "**MULTI-TASK MODEL**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-44", "text": "State-of-the-art slot filling methods usually rely on bidirectional LSTM models [6, 5, 7, 17, 18 , among others]." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-46", "text": "For this study we crowd sourced natural language text data for 10 domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-2", "text": "State-of-the-art slot filling models for goal-oriented human/machine conversational language understanding systems rely on deep learning methods." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-3", "text": "While multi-task training of such models alleviates the need for large in-domain annotated datasets, bootstrapping a semantic parsing model for a new domain using only the semantic frame, such as the back-end API or knowledge graph schema, is still one of the holy grail tasks of language understanding for dialogue systems." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-4", "text": "This paper proposes a deep learning based approach that can utilize only the slot description in context without the need for any labeled or unlabeled in-domain examples, to quickly bootstrap a new domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-5", "text": "The main idea of this paper is to leverage the encoding of the slot names and descriptions within a multi-task deep learned slot filling model, to implicitly align slots across domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-6", "text": "The proposed approach is promising for solving the domain scaling problem and eliminating the need for any manually annotated data or explicit schema alignment." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-7", "text": "Furthermore, our experiments on multiple domains show that this approach results in significantly better slot-filling performance when compared to using only in-domain data, especially in the low data regime." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-8", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-10", "text": "In traditional goal-oriented dialogue systems, user utterances are typically understood in terms of hand-designed semantic frames comprised of domains, intents and slots [1] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-11", "text": "Understanding the user utterance involves (i) detecting the domain of the utterance, (ii) classifying the intent of the utterance based on the semantic frame corresponding to the detected domain and (iii) identifying the values or sequence of tokens corresponding to each slot in the semantic frame." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-12", "text": "An example semantic frame is shown in Figure 1 for a flight related query: find flights to new york tomorrow." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-13", "text": "Most modern approaches for conversational language understanding involve training machine learning models on annotated training data [2, 3, 4, among others] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-14", "text": "Deep learning models typically outperform most other approaches in the domain of large scale supervised learning and this has been shown to be the case for spoken language understanding [5, 6, 7, 8, among others] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-15", "text": "However, despite recent advancements and tremendous research activity in semi-supervised and unsupervised learning, these models still require massive amounts of labeled data to train." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-16", "text": "In recent years, motivated by commercial applications like Apple Siri, Microsoft Cortana, Amazon Alexa, or Google Assistant, there is significant interest in enabling users to add more functionality and power to their respective assistants." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-17", "text": "However, although these assistants can handle queries corresponding to certain narrow domains with high reliability, the ability to understand user queries across a wide range of domains, in a ro- bust manner, is still missing." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-18", "text": "This is a significant bottleneck that restricts the ability to crowd-source the addition of new actions or skills to these assistants." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-19", "text": "With recent advances in deep learning, there is renewed excitement around latent semantic representations which can be trained for a multitude of covered domains via transfer learning." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-20", "text": "An earlier work proposed training a single multi-task deep learning model covering all domains, providing implicit shared feature learning across domains [6] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-21", "text": "The approach showed significantly better overall semantic template level performance." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-45", "text": "Extensions include encoder-decoder models [19, 20, among others] or memory networks [21] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-22", "text": "Similar experiments on using shared feature extraction layers for slot-filling across several domains have demonstrated significant performance improvements relative to single-domain baselines, especially in low data regimes [9] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-23", "text": "In this study we explore semi-supervised slot-filling based on deep learning based approaches that can utilize natural language slot label descriptions." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-24", "text": "This alleviates the need for large amounts of labeled or unlabeled in-domain examples or explicit schema alignment, enabling developers to quickly bootstrap new domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-25", "text": "Similar ideas have previously been shown to work for domain classification, where domain names were leveraged to generate representations in a shared space with query representations [10] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-26", "text": "Applying this idea to a full semantic frame is much more complex and requires building a general slot or concept tagger over a large set of domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-27", "text": "The architecture for the general concept tagger is similar to those proposed in recent Question Answering (QA) and Machine Reading literature." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-28", "text": "We build upon the idea of using a learned question encoding and using it for extractive question answering from input passages [11, 12, 13] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-29", "text": "However, our proposed use case involves using slot encodings learned from their natural language descriptions and training the model on a multitude of small in-domain datasets to generalize to new domains, instead of training and evaluating on larger open-domain QA datasets." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-30", "text": "Through our experiments we demonstrate that our model learns to identify slots across domains, from small amounts of training data, without the need for any explicit schema alignments." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-31", "text": "Such an approach can significantly alleviate the domain scaling problem and reduce the need for additional manually annotated data when bringing up a new domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-32", "text": "In Section 2 we describe the task and the dataset, followed by descriptions of the baseline model, the multi-task model and the concept tagger in Section 3." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-33", "text": "This is followed by experimental results in Section 4 and discussion in Section 5." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-34", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-35", "text": "**SLOT FILLING**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-36", "text": "In most spoken dialogue systems, the semantic structure of an application domain is defined in terms of semantic frames." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-37", "text": "Each semantic frame contains several typed components called \"slots.\" For the example in Figure 1 , the domain Flights may contain slots like Departure City, Arrival City, Departure Date, Airline Name, etc." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-38", "text": "The task of slot filling is then to instantiate slots in semantic frames from a given user query or utterance." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-39", "text": "More formally, the task is to estimate the sequence of tags Y = y1, ..., yn in the form of IOB labels as in [14] (with 3 outputs corresponding to 'B', 'I' and 'O'), and as shown in Figure 1 corresponding to an input sequence of tokens X = x1, ..., xn." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-40", "text": "In literature, researchers usually employ known sequence labeling methods for filling frame slots of an application domain using a labeled training data set." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-41", "text": "With advances in deep learning, the best performing academic slot filling systems rely on recurrent neural network (RNN) or Long Short Term Memory (LSTM) based models." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-42", "text": "RNNs were first used for slot filling by Yao et al. [15] and Mesnil et al. [16] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-43", "text": "A comprehensive compilation of RNN based slot filling approaches was described by Mesnil et al. [5] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-47", "text": "The schema corresponding to each of these domains is described in Table 2 ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-48", "text": "One noticeable feature of our datasets is a lack of fine-grained slot types as compared to the popular ATIS dataset [22] which contains over a hundred distinct slots." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-49", "text": "For the collection, a list of possible slot-value combinations was generated from the Google knowledge graph manually, and used to prompt crowd-workers with slot-value pairs." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-50", "text": "The crowd-workers were instructed to ask their digital assistant to complete certain tasks with given slot-value based arguments." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-51", "text": "The collected utterances were then either automatically labeled if a verbatim match was found for the slot-values, or sent out to a second set of raters for labeling." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-52", "text": "For the labeling job the crowd workers were instructed to label spans corresponding to slot values in the instantiated samples." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-53", "text": "Table 1 shows the list of these domains with representative example queries and the total number of training samples available." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-54", "text": "Test sets were constructed using the same framework." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-55", "text": "All the collected data was tokenized using a standard tokenizer and lower-cased before use since capitalization was seen to be indicative of slot values." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-56", "text": "All digits were replaced with special \"#\" tokens following [9] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-57", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-58", "text": "**MODELS**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-59", "text": "In this study we explore the idea of zero-shot slot-filling, by implicitly linking slot representations across domains by using the label descriptions of the slots." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-60", "text": "We compare the performance of three model architectures on varying amounts of training data:" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-61", "text": "\u2022 Single task bi-directional LSTM \u2022 Multi-task bi-directional stacked LSTM model [6, 9] \u2022 Concept tagging model using slot label descriptions For all our experiments we use 200 dimensional word2vec embeddings trained on the GNews corpus [23] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-62", "text": "Tokens not present in the pre-trained embeddings were replaced by a OOV token." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-63", "text": "Each model was trained for 50000 steps using the RMSProp optimizer and tuned on the dev set performance before evaluation on the test set." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-64", "text": "For evaluation, we compute the token F1 for each slot independently and report the weighted average over all slots for the target domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-65", "text": "We use token F1 instead of the traditional slot F1 since token level evaluation results in softer penalization for mistakes, to mitigate span inconsistencies in the crowd sourced labels." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-66", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-67", "text": "**BASELINE SINGLE TASK MODEL**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-68", "text": "We use a single domain bidirectional LSTM as our baseline model." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-69", "text": "The model consists of the embedding layer followed by a 128 dimensional (64 dimensions in each direction) bidirectional LSTM." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-70", "text": "This is followed by a softmax layer that acts on the LSTM state for every token to predict the IOB label corresponding to the token." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-73", "text": "The multi-task model consists of 2 stacked bidirectional LSTM layers with 256 dimensions each (128 dimensions in each direction)." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-74", "text": "Both LSTM layers are shared across all domains, fol- lowed by domain specific softmax layers, following [9] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-75", "text": "The model was trained using a batch size of 100 with alternating batches from different domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-76", "text": "To avoid over-training the model on the larger domains, the number of batches chosen from each domain was proportional to the logarithm of the number of training samples from the domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-77", "text": "The conceptual model architecture is depicted in Figure 2 ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-78", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-79", "text": "**ZERO-SHOT CONCEPT TAGGING MODEL**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-80", "text": "The main idea behind the zero-shot concept tagger is to leverage the slot names or descriptions in a domain-agnostic slot tagging model." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-81", "text": "Assuming that the slot description is semantically accurate, if one of the already covered domains contains a similar slot, a continuous representation of the slot obtained from shared pre-trained embeddings can be leveraged in a domainagnostic model." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-82", "text": "An obvious example would be adding United Airlines when the multi-task model can already parse queries for American Airlines and Turkish Airlines." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-83", "text": "While the slot names may be different, the concept of departure city or arrival city should persist and can be transferred to the new task of United Airlines using their natural language descriptions." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-84", "text": "The very same idea can hold when the new domain is related but not flights, but, say, ground transportation." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-85", "text": "Similarly, domain independent slots such as location or date/time expressions can be implicitly shared for two domains like hotel reservation or restaurant reservation." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-86", "text": "In order to incorporate this slot description knowledge into model training, we first generate an encoding of the slot by combining the token embeddings for the slot description." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-87", "text": "In principle this encoding can be obtained by passing description token embeddings through a RNN, but for the current experiments we just use their average." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-88", "text": "These slot representations are then combined within the multi-task architecture to obtain a domainagnostic slot tagging model." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-89", "text": "To elaborate, the zero-shot concept tagger consists of a single 256 dimensional bidirectional LSTM layer that acts on a sequence of tokens to produce contextual representations for each token in the utterance." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-90", "text": "This is followed by a feed forward layer where the contextual token representations are combined with the slot encoding to produce vectors of 128 dimensions." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-91", "text": "This feeds into another 128 dimensional bi-directional LSTM layer followed by a softmax layer that outputs the prediction for that slot." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-92", "text": "In our experiments these predictions are made independently for each slot by feeding a single slot description, but it is possible to slightly alter the architecture to make predictions for all slots." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-93", "text": "The input samples during training and evaluation for each slot included both positive (where the slot was present) and negative samples (where it was absent)." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-94", "text": "The ratio of training samples from a particular domain in a batch was proportional to the logarithm of the number of training samples." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-95", "text": "The conceptual model architecture is depicted in Figure 3 ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-96", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-97", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-98", "text": "We compare the performances of the models on varying amounts of training data from each domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-99", "text": "This involves using all available out of domain data and varying the amount of training data for the target domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-100", "text": "To avoid performance variations due to the small sample sizes, the performance was averaged over 10 runs with training samples drawn from different parts of the domain dataset." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-101", "text": "For every domain, 20% of the training examples were set aside for the dev set." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-102", "text": "Since we were evaluating with varying amounts of training data for each domain, the dev set was different for each data-point, corresponding to the bottom 20% of the training samples." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-103", "text": "For example, if the training set consisted of 100 samples from Domain A and 20 samples from Domain B, dev set A would consist of 20 samples from Domain A and dev set B would be comprised of 4 samples from Domain B. The performance on these dev sets was evaluated separately and averaged weighted by the log of the number of training samples." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-104", "text": "This weighted average was used to tune the model hyper-parameters." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-105", "text": "We used a logarithmic combination since it struck a good balance between noisy evaluations on domains with small dev sets Table 3 : Weighted token F1 scores at various points on the learning curve for the compared models." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-106", "text": "ST corresponds to the single task baseline, MT corresponds to the multi task baseline and CT corresponds to the general concept tagging model." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-107", "text": "0 5 20 100 1000 Domain CT ST MT CT ST MT CT ST MT CT ST MT CT book and over-tuning to the domains with larger dev sets." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-108", "text": "The performances along the learning curve for all the models on the 10 domains are described in Table 3 ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-109", "text": "When no indomain data is available, the concept tagging model is able to achieve reasonable bootstrap performance for most domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-110", "text": "Even when more data becomes available the model beats the single task model by significant margins and performs better than or on par with the multi-task baseline for most points on the learning curves." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-111", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-112", "text": "**# TARGET TRAIN SAMPLES**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-113", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-114", "text": "**DISCUSSION**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-115", "text": "To better understand the strengths and weaknesses of the concept tagger we analyze its performance on individual slots." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-116", "text": "The model performs better than or on par with the multi-task model for most slots, with significant performance gains on slots that have shared semantics with slots in other domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-117", "text": "For slots that are specific to particular domains, like discount type from bus tickets, the concept tagger usually needs a larger number of training samples to reach the same level of performance." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-118", "text": "This can be explained by a lack of slot-specific parameters within the model." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-119", "text": "Table 4 compares the performance of the concept tagger and the multi-task model on three slots across different domains that illustrate the strengths and weaknesses of our approach." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-120", "text": "By leveraging shared features and semantics with departure time, reservation time and time related slots from other domains the concept tagger is able to reach within 10% of the peak performance on appointment time without the need for any in-domain training data." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-121", "text": "Similarly, the model is able to ramp up performance on # seniors with a small amount of in-domain data, despite the presence of a competing slot with similar semantics (# adults) within the same domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-122", "text": "This highlights the model's ability to generalize from slots with similar descriptions and semantics across domains." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-123", "text": "On the other hand, the concept tagger's performance is worse than our multi-task baseline on pickup location." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-124", "text": "A lack of a good contextual representations for the description pickup location and the presence of a competing slot, dropoff location, might be responsible for the performance degradation observed for this slot." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-125", "text": "This highlights the concept tagger's susceptibility to descriptions that fail to produce a compatible slot representation, either due to an incomplete or misleading description of the slot semantics or a lack of good embedding representations for these descriptions." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-126", "text": "It might be possible to alleviate poor slot representations by fine tuning the slot representations on small amounts of in-domain training data after starting with representations derived from pre-trained word embeddings or using contextual word embeddings [24] ." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-127", "text": "Enhancing utterance token representations with an entity linker or a knowledge base are possible extensions of this work that might enable better generalization to new entities." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-128", "text": "Exploring use of unlabeled training data from target domains with a domain adversarial loss [25] might be another interesting avenue for exploration." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-129", "text": "The appeal for our approach derives from its simplicity and the minimal amount of supervision required to bring up slot-filling on a new domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-130", "text": "The proposed solution makes it possible to design a plug and play system with a reduced need for expensive labeled data for every additional domain." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-131", "text": "----------------------------------" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-132", "text": "**CONCLUSIONS**" }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-133", "text": "In this paper we propose a novel approach to slot filling that leverages shared feature extraction and slot representations across domains by using the natural language descriptions of slots." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-134", "text": "We crowd-sourced slot filling datasets for ten domains to explore approaches that can easily scale across domains and demonstrate that our proposed concept tagging model performs significantly better than a strong multi-task baseline, especially in the low data regime." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-135", "text": "To further evaluate the strengths and weaknesses of the proposed approach, we analyze its performance on individual slots and demonstrate that our model is able to leverage shared features and semantic descriptions of slots defined in other domains, and shows potential for reasonable performance on slots in a new domain without the need for any in-domain training data or explicit schema alignment." }, { "sent_id": "f60796ff05156e81c4b183cdcb05ae-C001-136", "text": "We hope that our proposed solution can provide a baseline approach for future research into scalable frame semantic parsing systems." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f60796ff05156e81c4b183cdcb05ae-C001-22" ] ], "cite_sentences": [ "f60796ff05156e81c4b183cdcb05ae-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "f60796ff05156e81c4b183cdcb05ae-C001-56" ], [ "f60796ff05156e81c4b183cdcb05ae-C001-74" ] ], "cite_sentences": [ "f60796ff05156e81c4b183cdcb05ae-C001-56", "f60796ff05156e81c4b183cdcb05ae-C001-74" ] }, "@UNSURE@": { "gold_contexts": [ [ "f60796ff05156e81c4b183cdcb05ae-C001-60", "f60796ff05156e81c4b183cdcb05ae-C001-61" ] ], "cite_sentences": [ "f60796ff05156e81c4b183cdcb05ae-C001-61" ] } } }, "ABC_5f86a4791bee14e0b1053e9b9a6fff_40": { "x": [ { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-2", "text": "A model of co-occurrence in bitext is a boolean predicate that indicates whether a given pair of word tokens co-occur in corresponding regions of the bitext space." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-3", "text": "Co-occurrence is a precondition for the possibility that two tokens might be mutual translations." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-4", "text": "Models of cooccurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models into an integrated system for exploiting parallel texts." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-5", "text": "Different models of co-occurrence are possible, depending on the kind of bitext map that is available, the language-specific information that is available, and the assumptions made about the nature of translational equivalence." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-6", "text": "Although most statistical translation models are based on models of co-occurrence, modeling co-occurrence correctly is more difficult than it may at first appear." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-7", "text": "----------------------------------" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-9", "text": "Most methods for estimating translation models from parallel texts (bitexts) start with the following intuition: Words that are translations of each other are more likely to appear in corresponding bitext regions than other pairs of words." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-10", "text": "The intuition is simple, but its correct exploitation turns out to be rather subtle." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-11", "text": "Most of the literature on translation model estimation presumes that corresponding regions of the input bitexts are represented by neatly aligned segments." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-12", "text": "As discovered by Church (1993) , most of the bitexts available today are not easy to align." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-13", "text": "Moreover, imposing an alignment relation on such bitexts is inefficient, because alignments cannot capture crossing correspondences among text segments." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-14", "text": "Melamed (1996) proposed methods for producing general bitext maps for arbitrary bitexts." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-15", "text": "The present report shows how to use bitext maps and other information to construct a model of co-occurrence." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-16", "text": "A model of co-occurrence is a boolean predicate, which indicates whether a given pair of word tokens co-occur in corresponding regions of the bitext space." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-17", "text": "Co-occurrence is a precondition for the possibility that two tokens might be mutual translations." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-18", "text": "Models of cooccurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models into an integrated system for exploiting parallel texts." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-19", "text": "When the model of co-occurrence is modularized away from the translation model, it also becomes easier to study translation model estimation methods per se." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-20", "text": "Different models of co-occurrence are possible, depending on the kind of bitext map that is available, the language-specific information that is available, and the assumptions made about the nature of translational equivalence." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-21", "text": "The following three sections explore these three variables." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-22", "text": "----------------------------------" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-23", "text": "**RELEVANT REGIONS OF THE BITEXT SPACE**" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-24", "text": "By definition of \"mutual translations,\" corresponding regions of a text and its translation will contain word token pairs that are mutual translations." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-25", "text": "Therefore, a general representation of bitext correspondence is the natural concept on which to build a model of where mutual translations cooccur." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-26", "text": "The most general representation of bitext correspondence is a bitext map (Melamed, 1996) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-27", "text": "Token pairs whose co-ordinates are part of the true bitext map (TBM) are mutual translations, by definition of the TBM." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-28", "text": "The likelihood that two tokens are mutual translations is inversely correlated with the distance between the tokens' co-ordinate in the bitext space and the interpolated TBM." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-29", "text": "It may be possible to develop translation model estimation methods that take into account a probabilistic model of co-occurrence." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-30", "text": "However, all the models in the literature are based on a boolean co-occurrence model -they want to know either that two tokens co-occur or that they do not." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-31", "text": "A boolean co-occurrence predicate can be defined by setting a threshold \u03b4 on the distance from the interpolated bitext map." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-32", "text": "Any token pair whose co-ordinate is closer than \u03b4 to the bitext map would be considered to co-occur by this predicate." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-33", "text": "The optimal value of \u03b4 varies with the language pair, the bitext genre and the application." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-34", "text": "Figure 1 illustrates what I will call the distance-based model of co-occurrence." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-35", "text": "Dagan et al. (1993) were the first to use a distance-based model of co-occurrence, although they measured the distance in words rather than in characters." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-36", "text": "General bitext mapping algorithms are a recent invention." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-37", "text": "So far, most researchers interested in co-occurrence of mutual translations have relied on bitexts where sentence boundaries (or other text unit boundaries) were easy to find (e.g. Gale & Church, 1991; Kumano & Hirakawa, 1994; Fung, 1995; Melamed, 1995) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-38", "text": "Aligned text segments suggest a boundary-based model of cooccurrence, illustrated in Figure 2 ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-39", "text": "For bitexts involving languages with similar word order, a more accurate combined model of co-occurrence can be built using both segment boundary information and the map-distance threshold." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-40", "text": "As shown in Figure 3 , each of these constraints eliminates the noise from a characteristic region of the bitext space." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-41", "text": "00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000 00000000000000000000000000000000" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-42", "text": "----------------------------------" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-43", "text": "**CO-OCCURRENCE COUNTING METHODS**" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-44", "text": "Both the boundary-based and distance-based constraints restrict the region of the bitext space where tokens may be considered to co-occur." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-45", "text": "Yet, these constraints do not answer the question of how to count co-occurrences within the restricted regions." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-46", "text": "It is somewhat surprising that this is a question at all, and most authors ignore it." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-47", "text": "However, when authors specify their algorithms in sufficient detail to answer this question, the most common answer (given, e.g., by Brown et al., 1993; Dagan et al., 1993; Kupiec, 1993; Melamed, 1995) turns out to be unsound." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-48", "text": "The problem is easiest to illustrate under the boundary-based model of co-occurrence." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-49", "text": "Given two aligned text segments, the naive way to count co-occurrences is" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-50", "text": "( 1) where e(u) and f (v) are the frequencies of occurrence of u and v in their respective segments." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-51", "text": "For many u and v, e(u) and f (v) are either 0 or 1, and Equation 1 returns 1 just in case both words occur." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-52", "text": "The problem arises when e(u) > 1 and f (v) > 1." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-53", "text": "For example, if e(u) = f (v) = 3, then according to Equation 1, cooc(u, v) = 9! If the two aligned segments are really translations of each other, then it is most likely that each of the occurrences of u is a translation of just one of the occurrences of v. Although it may not be known which of the 3 v's each u corresponds to, the number of times that u and v co-occur as possible translations of each other in that segment pair must be 3." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-54", "text": "There are various ways to arrive at cooc(u, v) = 3." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-55", "text": "Two of the simplest ways are" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-56", "text": "and" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-57", "text": "Equation 2 is based on the simplifying assumption that each word is translated to at most one other word." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-58", "text": "Equation 3 is based on the simplifying assumption that each word is translated to at least one other word." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-59", "text": "Either simplifying assumption results in more plausible co-occurrence counts than the naive method in Equation 1." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-60", "text": "Counting co-occurrences is more difficult under a distance-based co-occurrence model, because there are no aligned segments and consequently no useful definition for e() and f ()." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-61", "text": "Furthermore, under a distance-based co-occurrence model, the co-occurrence relation is not transitive." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-62", "text": "E.g., it is possible that s 1 co-occurs with t 1 , t 1 co-occurs with s 2 , s 2 co-occurs with t 2 , but s 1 does not co-occur with t 2 ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-63", "text": "The correct counting method becomes clearer if the problem is recast in graphtheoretic terms." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-64", "text": "Let the words in each half of the bitext represent the vertices on one side of a bipartite graph." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-65", "text": "Let there be edges between each pair of words whose co-ordinates are closer than \u03b4 to the bitext map." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-66", "text": "Now, under the \"at most one\" assumption of Equation 2, each co-occurrence is represented by an edge in the graph's maximum matching 1 ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-67", "text": "Under the \"at least one\" assumption of Equation 3, each co-occurrence is represented by an edge in the graph's smallest vertex cover." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-68", "text": "Maximum matching can be computed in polynomial time for any graph (Ahuja et al., 1993) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-69", "text": "Vertex cover can be solved in polynomial time for bipartite graphs 2 ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-70", "text": "It is of no importance that maximum matchings and minimum vertex covers may be non-unique -by definition, all solutions have the same number of edges, and this number is the correct co-occurrence count." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-71", "text": "----------------------------------" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-72", "text": "**LANGUAGE-SPECIFIC FILTERS**" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-73", "text": "Co-occurrence is a universal precondition for translational equivalence among word tokens in bitexts." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-74", "text": "Other preconditions may be imposed if certain language-specific resources are available (Melamed, 1995) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-75", "text": "For example, parts of speech tend to be preserved in translation (Papageorgiou et al., 1994) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-76", "text": "If part-of-speech taggers are available for both languages in a bitext, and if cases where one part of speech is translated to another are not important for the intended application, then we can rule out the possibility of translational equivalence for all token pairs involving different parts of speech." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-77", "text": "A more obvious source of language-specific information is a machine-readable bilingual dictionary (MRBD)." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-78", "text": "If token a in one half of the bitext is found to co-occur with token b in the other half, and (a, b) is an entry in the MRBD, then it is highly likely that the tokens a and b are indeed mutual translations." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-79", "text": "In this case, there is no point considering the co-occurrence of a or b with any other token." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-80", "text": "Similarly exclusive candidacy can be granted to cognate token pairs (Simard et al., 1992) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-81", "text": "Most published translation models treat co-occurrence counts as counts of potential link tokens (Melamed, 1998) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-82", "text": "More accurate models may result if the co-occurrence counts are biased with language-specific knowledge." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-83", "text": "Without loss of generality, whenever translation models refer to cooccurrence counts, they can refer to co-occurrence counts that have been filtered using whatever language-specific resources happen to be available." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-84", "text": "It does not matter if there are dependencies among the different knowledge sources, as long as each is used as a simple filter on the co-occurrence relation (Melamed, 1995) ." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-85", "text": "----------------------------------" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-86", "text": "**CONCLUSION**" }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-87", "text": "In this short report, I have investigated methods for modeling word token co-occurrence in parallel texts (bitexts)." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-88", "text": "Models of co-occurrence are a precursor to all the most accurate translation models in the literature." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-89", "text": "So far, most researchers have relied on only a restricted form of co-occurrence, based on a restricted kind of bitext map, applicable to only a limited class of bitexts." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-90", "text": "A more general co-occurrence model can be based on any bitext map, and thus on any bitext." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-91", "text": "The correct method for counting the number of times that two words co-occur turns out to be rather subtle, especially for more general co-occurrence models." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-92", "text": "As noted in Section 3, many published translation models have been based on flawed models of co-occurrence." }, { "sent_id": "5f86a4791bee14e0b1053e9b9a6fff-C001-93", "text": "This report has exposed the flaw and has shown how to fix it." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5f86a4791bee14e0b1053e9b9a6fff-C001-37" ], [ "5f86a4791bee14e0b1053e9b9a6fff-C001-47" ], [ "5f86a4791bee14e0b1053e9b9a6fff-C001-74" ], [ "5f86a4791bee14e0b1053e9b9a6fff-C001-84" ] ], "cite_sentences": [ "5f86a4791bee14e0b1053e9b9a6fff-C001-37", "5f86a4791bee14e0b1053e9b9a6fff-C001-47", "5f86a4791bee14e0b1053e9b9a6fff-C001-74", "5f86a4791bee14e0b1053e9b9a6fff-C001-84" ] } } }, "ABC_9fdeb20207af1e8ee0c6e5374e3731_40": { "x": [ { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-2", "text": "In this paper we present a demo of our system: Social Interaction Network Extractor from Text (SINNET)." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-3", "text": "SINNET is able to extract a social network from unstructured text." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-4", "text": "Nodes in the network are people and links are social events." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-5", "text": "----------------------------------" }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-7", "text": "Language is the primary tool that people use for establishing, maintaining and expressing social relations." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-8", "text": "This makes language the real carrier of social networks." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-9", "text": "In this paper, we present a demo of our system that automatically extracts a social network from raw texts such as literary texts, emails, blog comments and news articles." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-10", "text": "1 We take a \"social network\" to be a network consisting of individual human beings and groups of human beings who are connected to each other through various relationships by the virtue of participating in social events." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-11", "text": "We define social events to be events that occur between people where at least one person is aware of the other and of the event taking place." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-12", "text": "For example, in the sentence John talks to Mary, entities John and Mary are aware of each other and of the talking event." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-13", "text": "In the sentence John thinks Mary is great, only John is aware of Mary and the event is the thinking event." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-14", "text": "There has been recent work on extracting social networks from literary text (Elson et al., 2010; He et al., 2013) ." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-15", "text": "However, both these works focus on extracting only conversational links between people, signaled in text by quotation marks." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-16", "text": "They do not extract social event links from other parts 1 A web demo is available at http://nlp.ldeo.columbia.edu/sinnet/ of text such as reported speech and other nondialogue text." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-17", "text": "Our system overcomes this limitation." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-18", "text": "The rest of the paper is structured as follows: In section 2, we briefly describe the research that has gone into building the system." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-19", "text": "In section ??, we present the technical details of SINNET and describe our web demo." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-20", "text": "----------------------------------" }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-21", "text": "**RESEARCH**" }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-22", "text": "The SINNET system is the result of several years of research Agarwal et al., 2012; Agarwal et al., 2013) ." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-23", "text": "In , we introduced the notion of social events." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-24", "text": "A social event is a happening between two people, at least one of whom is cognizant of the other and of the event taking place." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-25", "text": "At a broad level, there are two types of social events: interaction (INR) and observation (OBS)." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-26", "text": "INR is a bi-directional event in which both parties are mutually aware of each other." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-27", "text": "Examples of INR are a meeting or a dinner." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-28", "text": "OBS is a one-directional event in which only one party is aware of the other." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-29", "text": "Examples of OBS are thinking about someone, or missing someone." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-30", "text": "In , we presented a preliminary system that uses tree kernels and Support Vector Machines (SVMs) to extract social events from news articles." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-31", "text": "In Agarwal et al. (2012) , we presented a case study on a manually extracted network from Alice in Wonderland, showing that analyzing networks based on these social events gives us insight into the roles of characters in the story." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-32", "text": "Also, static network analysis has limitations which become apparent from our analysis." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-33", "text": "We propose the use of dynamic network analysis to overcome these limitations." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-34", "text": "In Agarwal et al. the social event extraction task and show that our system trained on a news corpus using tree kernels and support vector machines beats the baseline systems by a statistically significant margin." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-35", "text": "We also show that while the performance of our system on detecting social events in Alice in Wonderland achieves an F-measure of 61%, the unweighted network built using these detected social events is not statistically distinguishable from the un-weighted gold network according to popularly used network measures." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-36", "text": "Figure 1 shows two figures exemplifying the meaning of social events and social networks." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-37", "text": "In the first figure, there are three entity mentions: Alice, Rabbit and her (co-referential with Alice)." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-38", "text": "There is an OBS event between Alice and Rabbit triggered by the word in bold -saw." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-39", "text": "The direction of the event is from the observer to the one being observed." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-40", "text": "In the second figure there are two entity mentions: Alice and Mouse." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-41", "text": "There is a bidirectional interaction link between the Alice and Mouse triggered by the word asked." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-42", "text": "Figure 2 shows the network extracted from an abridged version of Alice in Wonderland (Agarwal et al., 2012) ." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-43", "text": "Figure 3 shows the output of running the Hubs and Authority algorithm (Kleinberg, 1998 ) on the network." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-44", "text": "In information retrieval, an authority is a webpage that many hubs point to and a hub is a webpage that points to many authorities." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-45", "text": "In our network, webpages are synonymous to characters." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-46", "text": "Figure 3a shows the hubs in decreasing order of hub weights." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-47", "text": "Figure 3b shows the authorities in decreasing order of authority weights." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-48", "text": "We see that the main character of the story, Alice, is the main authority but not the main hub." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-49", "text": "This network may be used for other In Agarwal et al. (2012) , we argued that a static network does not bring out the true nature of a network." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-50", "text": "For example, even though the centrality of the Mouse in a static network is high, a dynamic network analysis shows that the mouse is central only in one chapter of the novel (Chapter 3 -The drying ceremony)." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-51", "text": "Figure 4 shows the the network at the end of chapter 1 and chapter 3." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-52", "text": "----------------------------------" }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-53", "text": "**SYSTEM DETAILS AND WEB DEMO**" }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-54", "text": "SINNET is fully implemented in Java." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-55", "text": "Following is a list of external off-the-shelf tools used by our current pipeline: Jet sentence splitter, Jet NER (Grishman et al., 2005) , Stanford parser (Klein and Manning, 2003) , SVM-Light-TK (Moschitti, 2006) , Input to SINNET may be provided in two formats: as raw text or text with entity annotations." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-56", "text": "If the text is input as raw text without any entity annotations, SINNET first runs an off-the-shelf named entity recognizer and co-reference resolution (NER) tool." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-57", "text": "Currently, we run Jet (Grish-man et al., 2005) , but an interface makes it easy to plug-in any other NER tool." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-58", "text": "Once the text is annotated with entity mentions, for each sentence, for each entity mention pair per sentence, we create test examples in the format that our models accept." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-59", "text": "We use tree kernels with Support Vector Machines (SVM) for our models." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-60", "text": "Details of our system may be found in ." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-61", "text": "Any sentence splitter may be plugged in." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-62", "text": "Currently, we are using Jet's sentence splitter." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-63", "text": "Finally, the examples are fed to the models for prediction." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-64", "text": "The output is stored as a list of entities and their relations in a standard graph format." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-65", "text": "Currently, the output formats include graph modeling language (gml) and Pajek's .net format (Batagelj and Mrvar, 1998) ." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-66", "text": "In many situations, the input text may already have entity mentions annotated and co-referenced." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-67", "text": "In these situations, SINNET will accept these gold entity mention annotations instead of running the NER tool." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-68", "text": "The rest of the processing remains the same as above." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-69", "text": "Figure 5 shows an image of our web demo." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-70", "text": "2 The demo has a text box for entering text." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-71", "text": "We have various models that use features from three levels of natural language abstractions: lexical, syntactic and semantic." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-72", "text": "Users of the web demo are given the option of selecting the type of model used for making predictions." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-73", "text": "We have seven models in place: lexical, syntactic, semantic and all combinations of these three types." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-74", "text": "Once the user inputs a text and selects the type of model, we display the extracted network and make the file with the extracted network (which is in a standard graph format such as .gml/.net) available for download." }, { "sent_id": "9fdeb20207af1e8ee0c6e5374e3731-C001-75", "text": "Our web demo has two other tabs: one listing the publications relevant to SINNET and the other mentioning technical details and capabilities of our web demo." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9fdeb20207af1e8ee0c6e5374e3731-C001-22" ], [ "9fdeb20207af1e8ee0c6e5374e3731-C001-31" ], [ "9fdeb20207af1e8ee0c6e5374e3731-C001-42" ], [ "9fdeb20207af1e8ee0c6e5374e3731-C001-49" ] ], "cite_sentences": [ "9fdeb20207af1e8ee0c6e5374e3731-C001-22", "9fdeb20207af1e8ee0c6e5374e3731-C001-31", "9fdeb20207af1e8ee0c6e5374e3731-C001-42", "9fdeb20207af1e8ee0c6e5374e3731-C001-49" ] } } }, "ABC_27dbdd4827554df0f53013966242dc_40": { "x": [ { "sent_id": "27dbdd4827554df0f53013966242dc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-2", "text": "This article briefly explains our submitted approach to the DocEng'19 competition on extractive summarization [3] ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-3", "text": "We implemented a recurrent neural network based model that learns to classify whether an article's sentence belongs to the corresponding extractive summary or not." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-4", "text": "We bypass the lack of large annotated news corpora for extractive summarization by generating extractive summaries from abstractive ones, which are available from the CNN corpus." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-5", "text": "----------------------------------" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-7", "text": "The DocEng '19 competition focused on automatic extractive text summarization." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-8", "text": "Participants were provided with a corpus of 50 news articles from the CNN-corpus [4] ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-9", "text": "These articles contained corresponding extractive and abstractive summaries aimed to train and test a system to perform the summarization task." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-10", "text": "The gold standard summaries contained around 10% of the original text, with a minimum of 3 sentences." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-11", "text": "After submission, the methods were tested on a larger test set consisting of 1000 articles randomly chosen from the CNN-corpus." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-12", "text": "The limited available training data was one of the major challenges of this competition, which prevented any deep learning approach from being successful if no external corpus was incorporated to the training set." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-13", "text": "----------------------------------" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-14", "text": "**APPROACH**" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-15", "text": "Our work is based on the SummaRuNNer model [5] ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-16", "text": "It consists of a two-layer bi-directional Gated Recurrent Unit (GRU) Recurrent Neural Network (RNN) which treats the summarization problem as a binary sequence classification problem, where each sentence is classified sequentially as sentence to be included or not in the summary." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-17", "text": "However, we introduced two modifications to the original SummaRuNNer architecture, leading to better results while reducing complexity: arXiv:1911.06121v1 [cs.CL] 13 Nov 2019 Fig. 1 ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-18", "text": "Our RNN-based sequence classifier (based on [5] )." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-19", "text": "All word embeddings from each sentence are averaged to generate a sentence embedding." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-20", "text": "Sentence embeddings are then used for the bidirectional RNN at sentence level." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-21", "text": "At the top, the sigmoid activation based classification layer decides whether a sentence is included in the summary based on the content richness of the sentence, its salience with respect to the document and its novelty respect to the accumulated summary representation." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-22", "text": "1." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-23", "text": "Our model operates directly on a sentence level (instead of at word level within each sentence)." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-24", "text": "We compute sentence vector representations by means of the the Flair library." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-25", "text": "[1] 1 ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-26", "text": "These sentence embeddings substitute the bottom layer of the SummaRuNNer architecture." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-27", "text": "2. We do not consider the position of each sentence (absolute or relative) for the logistic layer." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-28", "text": "The resulting architecture is displayed on Figure 1 ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-29", "text": "Our code to generate extractive summaries according to the instructions established for the competition is publicly available 2 ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-30", "text": "----------------------------------" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-31", "text": "**DATA**" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-32", "text": "In contrast to [5] , we trained our model only on CNN articles from the CNN/Daily Mail corpus [2] ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-33", "text": "Due to the limited number of provided news articles, we automatically annotated a large corpus of CNN articles from which an abstractive summary was available." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-34", "text": "In a similar approach to [5] , we calculated the ROUGE-1 F1 score between each sentence and its article's abstractive summary." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-35", "text": "Finally for each article, we sorted the sentences having the highest ROUGE-1 F1 score and picked the top N = max(0.1 * ||sentences||, 3) sentences." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-36", "text": "----------------------------------" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-37", "text": "**EVALUATION**" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-38", "text": "We evaluated our model on the provided labeled CNN news articles with three different metrics: sentences from the generated summary matching the gold standard summary, ROUGE-1 and ROUGE-2." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-39", "text": "The achieved scores with our trained model after 20 epochs are displayed on Table 1 ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-40", "text": "----------------------------------" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-41", "text": "**CONCLUSION**" }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-42", "text": "Our approach achieved the second best performance among the compared methods in the competition, although the F1-score difference between both approaches is not statistically significant [3] ." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-43", "text": "Additionally, the performance of these approaches is hardly better than some of the \"traditional algorithms\" that were presented as baselines, which are much simpler than ours." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-44", "text": "Moreover, the real value of the different approaches on the various use cases of automatic text summarization cannot be covered with the current evaluation since the valuable properties of the summaries vary depending on the use case." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-45", "text": "For instance, coherence is important if the summary will be read by a final user while it is not if the summary is \"just\" a preprocessing step within an indexing pipeline." }, { "sent_id": "27dbdd4827554df0f53013966242dc-C001-46", "text": "Therefore, it would be interesting to assess the different techniques on several downstream tasks to obtain a better overview about which algorithms are most suitable." } ], "y": { "@EXT@": { "gold_contexts": [ [ "27dbdd4827554df0f53013966242dc-C001-15", "27dbdd4827554df0f53013966242dc-C001-16", "27dbdd4827554df0f53013966242dc-C001-17" ] ], "cite_sentences": [ "27dbdd4827554df0f53013966242dc-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "27dbdd4827554df0f53013966242dc-C001-32" ] ], "cite_sentences": [ "27dbdd4827554df0f53013966242dc-C001-32" ] }, "@SIM@": { "gold_contexts": [ [ "27dbdd4827554df0f53013966242dc-C001-34" ] ], "cite_sentences": [ "27dbdd4827554df0f53013966242dc-C001-34" ] } } }, "ABC_b4093db328fd6839777a6d34507b34_40": { "x": [ { "sent_id": "b4093db328fd6839777a6d34507b34-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-35", "text": "**LEARNING EXPERIMENTS**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-83", "text": "**RELATED WORK**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-32", "text": "The features are inspired by Saloj\u00e4rvi et al. (2003) who used a similar exploratory approach." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-33", "text": "For a full list of features, see Appendix." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-34", "text": "----------------------------------" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-30", "text": "Features It is not yet established which eye movement reading features are fit for the task of distinguishing grammatical functions of the words." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-2", "text": "This paper investigates to what extent grammatical functions of a word can be predicted from gaze features obtained using eye-tracking." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-3", "text": "A recent study showed that reading behavior can be used to predict coarse-grained part of speech, but we go beyond this, and show that gaze features can also be used to make more finegrained distinctions between grammatical functions, e.g., subjects and objects." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-4", "text": "In addition, we show that gaze features can be used to improve a discriminative transition-based dependency parser." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-5", "text": "----------------------------------" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-7", "text": "Readers fixate more and longer on open syntactic categories (verbs, nouns, adjectives) than on closed class items like prepositions and conjunctions (Rayner and Duffy, 1988; Nilsson and Nivre, 2009) ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-8", "text": "Recently, Barrett and S\u00f8gaard (2015) presented evidence that gaze features can be used to discriminate between most pairs of parts of speech (POS) ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-9", "text": "Their study uses all the coarse-grained POS labels proposed by Petrov et al. (2011) ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-10", "text": "This paper investigates to what extent gaze data can also be used to predict grammatical functions such as subjects and objects." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-11", "text": "We first show that a simple logistic regression classifier trained on a very small seed of data using gaze features discriminates between some pairs of grammatical functions." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-12", "text": "We show that the same kind of classifier distinguishes well between the four main grammatical functions of nouns, POBJ, DOBJ, NN and NSUBJ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-13", "text": "In \u00a73, we also show how gaze features can be used to improve dependency parsing." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-14", "text": "Many gaze features correlate with word length and word Figure 1: A dependency structure with average fixation duration per word frequency (Rayner, 1998 ) and these could be as good as gaze features, while being easier to obtain." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-15", "text": "We use frequencies from the unlabelled portions of the English Web Treebank and word length as baseline in all types of experiments and find that gaze features to be better predictors for the noun experiment as well as for improving parsers." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-16", "text": "This work is of psycholinguistic interest, but we show that gaze features may have practical relevance, by demonstrating that they can be used to improve a dependency parser." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-17", "text": "Eye-tracking data becomes more readily available with the emergence of eye trackers in mainstream consumer products (San Agustin et al., 2010) ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-18", "text": "With the development of robust eye-tracking in laptops, it is easy to imagine digital text providers storing gaze data, which could then be used as partial annotation of their publications." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-19", "text": "Contributions We demonstrate that we can discriminate between some grammatical functions using gaze features and which features are fit for the task." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-20", "text": "We show a practical use for data reflecting human cognitive processing." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-21", "text": "Finally, we use gaze features to improve a transition-based dependency parser, comparing also to dependency parsers augmented with word embeddings." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-22", "text": "----------------------------------" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-23", "text": "**EYE TRACKING DATA**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-24", "text": "The data comes from (Barrett and S\u00f8gaard, 2015) and is publicly available 1 ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-25", "text": "In this experiment 10 native English speakers read 250 syntactically annotated sentences in English (min." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-26", "text": "3 tokens, max." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-27", "text": "120 characters)." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-28", "text": "The sentences were randomly sampled from one of five different, manually annotated corpora from different domains: Wall Street Journal articles (WSJ), Wall Street Journal headlines (HDL), emails (MAI), weblogs (WBL), and Twitter (TWI) 2 ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-29", "text": "See Figure 1 for an example." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-31", "text": "To explore this, we extracted a broad selection of word-and sentence-based features." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-36", "text": "In our binary experiments, we use L2-regularized logistic regression classifiers with the default parameter setting in SciKit Learn 3 and a publicly available transition-based dependency parser 4 trained using structured perceptron (Collins, 2002; Zhang and Nivre, 2011) ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-37", "text": "Binary classification We trained logistic regression models to discriminate between pairs of the 11 most frequent dependency relations where the sample size is above 100: (AMOD, NN, AUX, PREP, NSUBJ, ADVMOD, DEP, DET, DOBJ, POBJ, ROOT) only using gaze features." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-38", "text": "E.g., we selected all words annotated as PREP or NSUBJ and trained a logistic regression model to discriminate between the two in a five-fold cross validation setup." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-39", "text": "Our baseline uses the following features: word length, position in sentence and word frequency." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-40", "text": "Some dependency relations are almost uniquely associated with one POS, e.g. determiners where 1 https://bitbucket.org/lowlands/ release/src 2 Wall Street Journal sentences are from OntoNotes 4.0 release of the English Penn Treebank." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-41", "text": "catalog.ldc." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-42", "text": "upenn.edu/LDC2011T03." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-43", "text": "Mail and weblog sentences come from the English Web Treebank." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-44", "text": "catalog.ldc. upenn.edu/LDC2012T13." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-45", "text": "Twitter sentences are from the work of (Foster et al., 2011) 3 http://scikit-learn.org/stable/ modules/generated/sklearn.linear_model." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-46", "text": "LogisticRegression.html 4 https://github.com/andersjo/hanstholm Parsing In all experiments we trained our parsing models on four domains and evaluated on the fifth to avoid over-fitting to the characteristics of a specific domain." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-47", "text": "All parameters were tuned on the WSJ dataset." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-48", "text": "We did 30 passes over the data and used the feature model in Zhang and Nivre (2011) -concatenated with gaze vectors for the first token on the buffer, the first token in the stack, and the left sibling of the first token in the stack." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-49", "text": "We extend the feature representation of each parser configuration by 3 \u00d7 26 features." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-50", "text": "Our gaze vectors were normalized using the technique in Turian et al. (2010) (\u03c3 \u00b7 E/SD(E)) using a scaling factor of \u03c3 = 0.001." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-51", "text": "Gaze features such as fixation duration are known to correlate with word frequency and word length." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-52", "text": "To investigate whether word length and frequency are stronger features than gaze, we perform an experiment, +FREQ+LEN, where our baseline and system also use frequencies and word length as features." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-53", "text": "----------------------------------" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-54", "text": "**RESULTS**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-55", "text": "Predictive features To investigate which gaze features were more predictive of grammatical function, we used stability selection (Meinshausen and B\u00fchlmann, 2010) with logistic regression classification on binary dependency relation classifications on the most frequent dependency relations." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-56", "text": "For each pair of dependencies, we perform a five-fold cross validation and record the informative features from each run." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-57", "text": "Table 1 shows the 15 most used features in ranked order with their proportion of all votes." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-58", "text": "The features predictive of grammatical functions are similar to the features that were found to be predictive of POS (Barrett and S\u00f8gaard, 2015) , however, the probability that a word gets first and second fixation were not important features for POS classification, whereas they are contributing to dependency classification." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-59", "text": "This could suggest that words with certain grammatical functions are consistently more likely or less likely to get first and second fixation, but could also be due to a frequent syntactic order in the sample." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-60", "text": "Binary discrimination Error reduction over the baseline can be seen in Figure 2 ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-61", "text": "The mean accuracy using logistic regression on all binary classification problems between grammatical functions is 0.722." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-62", "text": "The frequency-position-word length baseline is 0.706." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-63", "text": "In other words, using gaze features leads to a 5.6% error reduction over the baseline." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-64", "text": "The worst performance (where our baseline outperforms using gaze features) is seen where one relation is associated with closed class words Table 2 : Most predictive features for the binary classification of four most frequent dependency relations for nouns using five-fold cross validation." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-65", "text": "(DET, PREP, AUX), and where discrimination is easier." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-66", "text": "Noun experiment Error reductions for pairwise classification of nouns are between -4% and 41%." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-67", "text": "See Figure 2 ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-68", "text": "The average accuracy for binary noun experiments is 0.721." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-69", "text": "Baseline accuracy is 0.647." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-70", "text": "For POBJ and DOBJ the baseline was better than using gaze, but for the other pairs, gaze was better." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-71", "text": "When doing stability selection for nouns with only the four most frequent grammatical functions, the most important features can be seen from Figure 2 ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-72", "text": "The most informative feature is the fixation probability of the next word." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-73", "text": "Kernel density of this feature can be seen in Figure 3a , and it shows two types of behavior: POBJ and DOBJ, where the next word is less frequently fixated, and NN and NSUBJ, where the next word is more frequently fixated." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-74", "text": "Whether the next word is fixated or not, can be influenced by the word length, as well as the fixation probability of the current word: If the word is very short, the next word can be processed from a fixation of the current word, and if the current word is not fixated, the eyes need to land somewhere in order for the visual span to cover a satisfactory part of the text." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-75", "text": "Word length and fixation probabilities for the nouns are reported in Figure 3c and Figure 3b to show that the dependency labels have similar densities." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-76", "text": "Dependency parsing We also evaluate our gaze features directly in a supervised dependency parser." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-77", "text": "Our baseline performance is relatively low because of the small training set, but comparable to performance often seen with low-resource languages." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-78", "text": "Evaluation metrics are labeled attachment scores (LAS) and unlabeled attachment scores (UAS), i.e. the number of words that get assigned the correct syntactic head w/o the correct dependency label." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-79", "text": "Gaze features lead to consistent improvements across all five domains." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-80", "text": "The average error reduction in LAS is 5.0%, while the average error reduc- For comparison we also ran our parser with SENNA embeddings 5 and EIGENWORDS embeddings." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-81", "text": "6 The gaze vectors proved overall more informative." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-82", "text": "----------------------------------" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-84", "text": "In addition to Barrett and S\u00f8gaard (2015) , our work relates to Matthies and S\u00f8gaard (2013) , who study the robustness of a fixation prediction model across readers, not domains, but our work also relates in spirit to research on using weak supervision in NLP, e.g., work on using HTML markup to improve dependency parsers (Spitkovsky, 2013) or using click-through data to improve POS taggers (Ganchev et al., 2012) ." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-85", "text": "There have been few studies correlating reading behavior and general dependency syntax in the literature." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-86", "text": "Demberg and Keller (2008) , having parsed the Dundee corpus using MINIPAR, show that dependency integration cost, roughly the distance between a word and its head, is pre-dictive of reading times for nouns." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-87", "text": "Our finding could be a side-effect of this, since NSUBJ, NN and DOBJ/POBJ typically have very different dependency integration costs, while DOBJ and POBJ have about the same." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-88", "text": "Their study thus seems to support our finding that gaze features can be used to discriminate between the grammatical functions of nouns." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-89", "text": "Most other work of this kind focus on specific phenomena, e.g., Traxler et al. (2002) , who show that subjects find it harder to process object relative clauses than subject relative clauses." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-90", "text": "This paper is related to such work, but our interest is a broader model of syntactic influences on reading patterns." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-91", "text": "----------------------------------" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-92", "text": "**CONCLUSIONS**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-93", "text": "We have shown that gaze features can be used to discriminate between a subset of grammatical functions, even across domains, using only a small dataset and explored which features are more useful." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-94", "text": "Furthermore, we have shown that gaze features can be used to improve a state-of-the-art dependency parsing model, even when trained on small seeds of data, which suggests that parsers can benefit from data from human processing." }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-95", "text": "----------------------------------" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-96", "text": "**APPENDIX: GAZE FEATURES**" }, { "sent_id": "b4093db328fd6839777a6d34507b34-C001-97", "text": "First fixation duration on every word, fixation probability, mean fixation duration per sentence, mean fixation duration per word, next fixation duration, next word fixation probability, probability to get 1 st fixation, probability to get 2 nd fixation, previous fixation duration, previous word fixation probability, re-read probability, reading time per sentence normalized by word count, share of fixated words per sentence, time percentage spent on this word out of total sentence reading time, total fixation duration per word, total regression from word duration, total duration of regressions to word, n fixations on word, n fixations per sent normalized by token count, n long regressions from word, n long regressions per sentence normalized by token count, n long regressions to word, n refixations on word, n re-fixations per sentence normalized by token count, n regressions from word, n regressions per sentence normalized by token count, n regressions to word." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b4093db328fd6839777a6d34507b34-C001-8" ] ], "cite_sentences": [ "b4093db328fd6839777a6d34507b34-C001-8" ] }, "@USE@": { "gold_contexts": [ [ "b4093db328fd6839777a6d34507b34-C001-24" ] ], "cite_sentences": [ "b4093db328fd6839777a6d34507b34-C001-24" ] }, "@SIM@": { "gold_contexts": [ [ "b4093db328fd6839777a6d34507b34-C001-58" ] ], "cite_sentences": [ "b4093db328fd6839777a6d34507b34-C001-58" ] }, "@UNSURE@": { "gold_contexts": [ [ "b4093db328fd6839777a6d34507b34-C001-84" ] ], "cite_sentences": [ "b4093db328fd6839777a6d34507b34-C001-84" ] } } }, "ABC_60f4e3a8c8cae1b1ba2620b11ae6b0_40": { "x": [ { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-2", "text": "We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-3", "text": "We employ two methods-taking the maximum attention weight and computing the maximum spanning tree-to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-4", "text": "We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-5", "text": "We also analyze BERT fine-tuned on two datasets-the syntaxoriented CoLA and the semantics-oriented MNLI-to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-6", "text": "Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-7", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-9", "text": "Pretrained Transformer models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have shown stellar performance on language understanding tasks, significantly improve the state-of-the-art on many tasks such as dependency parsing (Zhou et al., 2019) , question answering (Rajpurkar et al., 2016) , and have at- * Equal contribution \u2020 Currently working at Verisk Analytics." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-10", "text": "This work was completed when the author was at New York University." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-11", "text": "tained top positions on transfer learning benchmarks such as GLUE and Su-perGLUE ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-12", "text": "As these models become a staple component of many NLP tasks, it is crucial to understand what kind of knowledge they learn, and why and when they perform well." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-13", "text": "To that end, researchers have investigated the linguistic knowledge that these models learn by analyzing BERT (Goldberg, 2018; Lin et al., 2019) directly or training probing classifiers on the contextualized embeddings or attention heads of BERT (Tenney et al., 2019b,a; Hewitt and Manning, 2019) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-14", "text": "BERT and RoBERTa, as Transformer models (Vaswani et al., 2017) , compute the hidden representation of all the attention heads at each layer for each token by attending to all the token representations in the preceding layer." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-15", "text": "In this work, we investigate the hypothesis that BERTstyle models use at least some of their attention heads to track syntactic dependency relationships between words." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-16", "text": "We use two dependency relation extraction methods to extract dependency relations from each self-attention heads of BERT and RoBERTa." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-17", "text": "The first method-maximum attention weight (MAX)-designates the word with the highest incoming attention weight as the parent, and is meant to identify specialist heads that track specific dependencies like obj (in the style of Clark et al., 2019) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-18", "text": "The second-maximum spanning tree (MST)-computes a maximum spanning tree over the attention matrix, and is meant to identify generalist heads that can form complete, syntactically informative dependency trees." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-19", "text": "We analyze the extracted dependency relations and trees to investigate whether the attention heads of these models track syntactic dependencies significantly better than chance or baselines, and what type of dependency relations they learn best." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-20", "text": "In contrast to probing models (Adi et al., 2017; Conneau et al., 2018) , our methods require no further training." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-21", "text": "In prior work, Clark et al. (2019) find that some heads of BERT exhibit the behavior of some dependency relation types, though they do not perform well at all types of relations in general." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-22", "text": "We are able to replicate their results on BERT using our MAX method." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-23", "text": "In addition, we also perform a similar analysis on BERT models fine-tuned on natural language understanding tasks as well as RoBERTa." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-24", "text": "Our experiments suggest that there are particular attention heads of BERT and RoBERTa that encode certain dependency relation types such as nsubj, obj with substantially higher accuracy than our baselines-a randomly initialized Transformer and relative positional baselines." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-25", "text": "We find that fine-tuning BERT on the syntax-oriented CoLA does not significantly impact the accuracy of extracted dependency relations." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-26", "text": "However, when fine-tuned on the semantics-oriented MNLI dataset, we see improvements in accuracy for longer-distance clausal relations and a slight loss in accuracy for shorter-distance relations." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-27", "text": "Overall, while BERT and RoBERTa models obtain nontrivial accuracy for some dependency types such as nsubj, obj and conj when we analyze individual heads, their performance still leaves much to be desired." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-28", "text": "On the other hand, when we use the MST method to extract full trees from specific dependency heads, BERT and RoBERTa fail to meaningfully outperform our baselines." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-29", "text": "Although the attention heads of BERT and RoBERTa capture several specific dependency relation types somewhat well, they do not reflect the full extent of the significant amount of syntactic knowledge that these models are known to learn." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-30", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-31", "text": "**RELATED WORK**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-32", "text": "Previous works have proposed methods for extracting dependency relations and trees from the attention heads of the transformer-based neural machine translation (NMT) models." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-33", "text": "In their preliminary work, Mare\u010dek and Rosa (2018) aggregate the attention weights across the self-attention layers and heads to form a single attention weight matrix." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-34", "text": "Using this matrix, they propose a method to extract constituency and (undirected) dependency trees by recursively splitting and constructing the maximum spanning trees respectively." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-35", "text": "In contrast, Raganato and Tiedemann (2018) train a Transformer-based machine translation model on different language pairs and extract the maximum spanning tree algorithm from the attention weights of the encoder for each layer and head individually." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-36", "text": "They find that the best dependency score is not significantly higher than a right-branching tree baseline." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-37", "text": "Voita et al. (2019) find the most confident attention heads of the Transformer NMT encoder based on a heuristic of the concentration of attention weights on a single token, and find that these heads mostly attend to relative positions, syntactic relations, and rare words." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-38", "text": "Additionally, researchers have investigated the syntactic knowledge that BERT learns by analyzing the contextualized embeddings (Warstadt et al., 2019a) and attention heads of BERT (Clark et al., 2019) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-39", "text": "Goldberg (2018) analyzes the contextualized embeddings of BERT by computing language model surprisal on subject-verb agreement and shows that BERT learns significant knowledge of syntax." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-40", "text": "Tenney et al. (2019b) introduce a probing classifier for evaluating syntactic knowledge in BERT and show that BERT encodes syntax more than semantics." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-41", "text": "Hewitt and Manning (2019) train a structural probing model that maps the hidden representations of each token to an inner-product space that corresponds to syntax tree distance." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-42", "text": "They show that the learned spaces of strong models such as BERT and ELMo (Peters et al., 2018) are better for reconstructing dependency trees compared to baselines." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-43", "text": "Clark et al. (2019) train a probing classifier on the attentionheads of BERT and show that BERT's attention heads capture substantial syntactic information." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-44", "text": "While there has been prior work on analysis of the attention heads of BERT, we believe we are the first to analyze the dependency relations learned by the attention heads of fine-tuned BERT models and RoBERTa." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-45", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-46", "text": "**METHODS**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-47", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-48", "text": "**MODELS**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-49", "text": "BERT (Devlin et al., 2019) is a Transformer-based masked language model, pretrained on BooksCorpus (Zhu et al., 2015) and English Wikipedia, that has attained stellar performance on a variety of downstream NLP tasks." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-50", "text": "RoBERTa (Liu et al., 2019) adds several refinements to BERT while using the same model architecture and capacity, including a longer training schedule over more data, and shows significant improvements over BERT on a wide range of NLP tasks." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-51", "text": "We run our ex-periments on the pretrained large versions of both BERT (cased and uncased) and RoBERTa models, which consist of 24 self-attention layers with 16 heads each layer." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-52", "text": "For a given dataset, we feed each input sentence through the respective model and capture the attention weights for each individual head and layer." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-53", "text": "Phang et al. (2018) report the performance gains on the GLUE benchmark by supplementing pretrained BERT with data-rich supervised tasks such as the Multi-Genre Natural Language Inference dataset (MNLI; Williams et al., 2018) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-54", "text": "Although these fine-tuned BERT models may learn different aspects of language and show different performance from BERT on GLUE benchmark, comparatively little previous work has investigated the syntax learned by these fine-tuned models (Warstadt et al., 2019a) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-55", "text": "We run experiments on the uncased BERT-large model finetuned on the Corpus of Linguistic Acceptability (CoLA; Warstadt et al., 2019b) and MNLI to investigate the impact of fine-tuning on a syntaxrelated task (CoLA) and a semantic-related task (MNLI) on the structure of attention weights and resultant extracted dependency relations." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-56", "text": "We refer to these fine-tuned models as CoLA-BERT and MNLI-BERT respectively." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-57", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-58", "text": "**ANALYSIS METHODS**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-59", "text": "We aim to test the hypothesis that the attention heads of BERT learn syntactic relations implicitly, and that the self-attention between two words corresponds to their dependency relation." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-60", "text": "We use two methods for extracting dependency relations from the attention heads of Transformer-based models." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-61", "text": "Both methods operate on the attention weight matrix W \u2208 (0, 1) T \u00d7T for a given head at a given layer, where T is the number of tokens in the sequence, and the rows and columns correspond to the attending and attended tokens respectively (such that each row sums to 1)." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-62", "text": "Method 1: Maximum Attention Weights (MAX) Given a token A in a sentence, a selfattention mechanism is designed to assign high attention weights on tokens that have some kind of relationship with token A (Vaswani et al., 2017) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-63", "text": "Therefore, for a given token A, a token B that has the highest attention weight with respect to the token A should be related to token A. Our aim is to investigate whether this relation maps to a universal dependency relation." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-64", "text": "We assign a relation (w i , w j ) between word w i and w j if j = argmax W [i] for each row (that corresponds to a word in attention matrix) i in attention matrix W ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-65", "text": "Based on this simple strategy, we extract relations for all sentences in our evaluation datasets." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-66", "text": "This method is similar to Clark et al. (2019) , and attempts to recover individual arcs between words; the relations extracted using this method need not form a valid tree, or even be fully connected, and the resulting edge directions may or may not match the canonical directions." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-67", "text": "Hence, we evaluate the resulting arcs individually and ignore their direction." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-68", "text": "After extracting dependency relations from all heads at all layers, we take the maximum UUAS over all relations types." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-69", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-70", "text": "**METHOD 2: MAXIMUM SPANNING TREE (MST)**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-71", "text": "We also want to investigate if there are attention heads of BERT that can form complete, syntactically informative parse trees." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-72", "text": "To extract complete valid dependency trees from the attention weights for a given layer and head, we follow the approach of Raganato and Tiedemann (2018) and treat the matrix of attention weight tokens as a complete weighted directed graph, with the edges pointing from the output token to each attended token." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-73", "text": "As in Raganato and Tiedemann, we take the root of the gold dependency tree as the starting node and apply the Chu-Liu-Edmonds algorithm (Chu, 1965; Edmonds, 1967) to compute the maximum spanning tree." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-74", "text": "(Using the gold root as the starting point in MST may artificially improve our results slightly, but this bias is applied evenly across all the models we compare.) The resulting tree is a valid directed dependency tree, though we follow Hewitt and Manning (2019) in evaluating it as undirected, for easier comparison with our MAX method." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-75", "text": "Following Voita et al. (2019) , we exclude the sentence demarcation tokens ([CLS], [SEP], , ) from the attention matrices." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-76", "text": "This allows us to focus on inter-word attention." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-77", "text": "Where the tokenization of our parsed corpus does not match the model's tokenization, we merge the non-matching tokens until the tokenizations are mutually compatible, and sum the attention weights for the corresponding columns and rows." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-78", "text": "We then apply either of the two extraction methods to the attention matrix." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-79", "text": "During evaluation when we compare the gold dependencies, to handle the subtokens within the merged tokens, we set all subtokens except for the first to depend on the first subtoken." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-80", "text": "This approach is largely similar to that in Hewitt and Manning (2019) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-81", "text": "We use the English Parallel Universal Dependencies (PUD) treebank from the CoNLL 2017 shared task (Zeman et al., 2018) as the gold standard for our evaluation." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-82", "text": "Baselines Many dependency relations tend to occur in specific positions relative to the parent word." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-83", "text": "For example, amod (adjectival modifier) mostly occurs before a noun." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-84", "text": "As an example, Figure 1 shows the distribution of relative positions for four major UD relations in our data." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-85", "text": "Following Voita et al. (2019) , we compute the most common positional offset between a parent and child word for a given dependency relation, and formulate a baseline based on that most common relative positional offset to evaluate our methods." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-86", "text": "For MST, as we also want to evaluate the quality of the entire tree, we use a right-branching dependency tree as baseline." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-87", "text": "Additionally, we use a BERT-large model with randomly initialized weights (which we refer to as random BERT), as previous work has shown that randomly initialized sentence encoders perform surprisingly well on a suite of NLP tasks (Zhang and Bowman, 2018; Wieting and Kiela, 2019) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-88", "text": "Figure 2 and Table 1 describe the accuracy for the most frequent relation types in the dataset using relations extracted based on the MAX method." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-89", "text": "We also include results for the rarer long-distance advcl and csubj dependency types, as they show that MNLI-BERT has a tendency to track clausal dependencies better than BERT, CoLA-BERT, and RoBERTa." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-90", "text": "The non-random models outperform random BERT substantially for all dependency types." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-91", "text": "They also outperform the rel- ative position baselines for some relation types." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-92", "text": "They outperform the baselines by a large margin for nsubj and obj, but only slightly better for advmod and amod." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-93", "text": "These results are consistent with the findings of Clark et al. (2019) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-94", "text": "Moreover, we do not observe substantial differences in accuracy by fine-tuning on CoLA." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-95", "text": "Both BERT and CoLA-BERT have similar or slightly better performance than MNLI-BERT, except for clausal dependencies such as advcl (adverbial clause modifier) and csubj (clausal subject) where MNLI-BERT outperforms BERT and CoLA-BERT by more than 5 absolute points in accuracy." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-96", "text": "This suggests that fine-tuning on a semantics-oriented task encourages effective long-distance dependencies, although it slightly degrades the performance in other shorter-distance dependency types." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-97", "text": "Figure 3 shows the accuracy for nsubj, obj, advmod, and amod relations extracted based on the MST method." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-98", "text": "Similar to the MAX method, we choose the best accuracy for each relation type." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-99", "text": "We observe that the models outperform the baselines by a large margin for nsubj and obj." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-100", "text": "However, the models do not outperform the positional baseline for advmod and amod." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-101", "text": "Surprisingly, RoBERTa performs worse than other BERT models in all categories when the MAX method is used to extract the trees, but it outperforms all other models when the MST method is used." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-102", "text": "Figure 4 describes the maximum undirected unlabeled attachment scores (UUAS) across each layer." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-103", "text": "The trained models achieve significantly higher UUAS than random BERT." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-104", "text": "Although the trained models perform better than the rightbranching baseline in most cases, the performance gap is not substantial." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-105", "text": "Given that the MST method uses the root of the gold trees, whereas the rightbranching baseline does not, this implies that the attention weights in the different layers/heads of the BERT models do not appear to correspond to complete, syntactically informative parse trees." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-106", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-107", "text": "**RESULTS**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-108", "text": "Overall, the results of both analysis methods suggest that, although some attention heads of BERT capture specific dependency relation types, they do not reflect the full extent of the significant amount of syntactic knowledge BERT and RoBERTa are known to learn as shown in previous syntactic probing work (Tenney et al., 2019b; Hewitt and Manning, 2019) ." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-109", "text": "Additionally, we find that fine-tuning on the semantics-oriented MNLI dataset improves long term dependencies while slightly degrading the performance for other dependency types." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-110", "text": "The overall performance of BERT and the fine-tuned BERTs over the nonrandom baselines are not substantial, and finetuning on CoLA and MNLI also does not have a large impact on UUAS." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-111", "text": "----------------------------------" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-112", "text": "**CONCLUSION**" }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-113", "text": "In this work, we investigate whether the attention heads of BERT and RoBERTa exhibit the implicit syntax dependency by extracting and analyzing the dependency relations from the attention heads at all layers." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-114", "text": "We use two simple dependency relation extraction methods that require no additional training, and observe that there are certain specialist attention heads of the models that track specific dependency types, but neither of our analysis methods support the existence of generalist heads that can perform holistic parsing." }, { "sent_id": "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-115", "text": "Furthermore, we observe that fine-tuning on CoLA and MNLI does not significantly change the overall pattern of selfattention within the frame of our analysis, despite their being tuned for starkly different downstream tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-13" ], [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-41" ] ], "cite_sentences": [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-13", "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-74" ] ], "cite_sentences": [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-74" ] }, "@SIM@": { "gold_contexts": [ [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-79", "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-80" ] ], "cite_sentences": [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-80" ] }, "@DIF@": { "gold_contexts": [ [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-108" ] ], "cite_sentences": [ "60f4e3a8c8cae1b1ba2620b11ae6b0-C001-108" ] } } }, "ABC_7516b533aafa8b41e7e554fa54e39c_40": { "x": [ { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-2", "text": "Multiple-choice Machine Reading Comprehension (MRC) is an important and challenging Natural Language Understanding (NLU) task, in which a machine must choose the answer to a question from a set of choices, with the question placed in context of text passages or dialog." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-3", "text": "In the last a couple of years the NLU field has been revolutionized with the advent of models based on the Transformer architecture, which are pretrained on massive amounts of unsupervised data and then fine-tuned for various supervised learning NLU tasks." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-4", "text": "Transformer models have come to dominate a wide variety of leader-boards in the NLU field; in the area of MRC, the current state-of-the-art model on the DREAM dataset (see [Sun et al., 2019] ) fine tunes Albert, a large pretrained Transformer-based model, and additionally combines it with an extra layer of multi-head attention between context and question-answer [Zhu et al., 2020] ." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-5", "text": "The purpose of this note is to document a new state-of-the-art result in the DREAM task, which is accomplished by, additionally, performing multi-task learning on two MRC multi-choice reading comprehension tasks (RACE and DREAM)." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-6", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-8", "text": "Making a computer system understand text within context and answer questions is challenging but has attracted a lot of interest of the Artificial Intelligence community and general audience for a long time." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-9", "text": "In the recent years, many Machine Reading Comprehension (MRC) datasets have been published, with different genres and formats of the context and questions." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-10", "text": "The context could be in the form of text passages, or in the form of dialogues." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-11", "text": "The questions could be open-formed (e.g. HotPotQA [Yang et al., 2018] ), asking the system to either extract the answers as spans from the context or external knowledge, or abstract and summarize the answers; the questions could also be in the form of asking the system to choose the best answer from multiple choices." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-12", "text": "In this note we will focus on the multi-choice MRC tasks, more specifically, the DREAM task [Sun et al., 2019 ]." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-13", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-14", "text": "**DATASETS**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-15", "text": "RACE [Lai et al., 2017] is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-16", "text": "The average passage length is 322 words." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-17", "text": "Each question provides 4 answer options to choose from." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-18", "text": "The human ceiling performance is 94.5." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-19", "text": "DREAM [Sun et al., 2019] is a much smaller reading comprehension dataset with more than 6,000 dialogues and over 10,000 questions." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-20", "text": "The average dialogue length is 86 words." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-21", "text": "Each question provides 3 answer options to choose from." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-22", "text": "The human ceiling performance is 98.6." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-23", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-25", "text": "Early works on the DREAM task include feature-based GBDT [Sun et al., 2019] , and FTLM [Radford et al., 2018] which is based on the Transformer [Vaswani et al., 2017] architecture." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-26", "text": "The top system accuracy on the DREAM leaderboard has been advanced gradually to above 90 percent, since the break-through of the text encoder in the form of large pretrained Transformerbased models (BERT [Devlin et al., 2019] , XLNet [Yang et al., 2019a] , RoBERTa [Liu et al., 2019] , Albert [Lan et al., 2020] )." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-27", "text": "Transfer learning is a widely used practice in machine learning (ML) that focuses on utilizing knowledge gained while solving one problem and applying it to a different but related problem." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-28", "text": "Using pretrained language models (LMs) such as ELMO [Peters et al., 2018] and BERT [Devlin et al., 2019] in down-stream tasks is an example of sequential transfer learning." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-29", "text": "On the other hand, multi-task learning, which involves learning several similar tasks simultaneously, is able to share the knowledge learned among the tasks." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-30", "text": "On the DREAM leaderboard 1 , the recent top systems include RoBERTA large +MMM in [Jin et al., 2020] and Albert xxlarge +DUMA in [Zhu et al., 2020] ." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-31", "text": "Both systems employ model architectures composed of a Transformer-based encoder and some matching/attention mechanism between the context and the question-answer pair." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-32", "text": "RoBERTA large +MMM in [Jin et al., 2020] additionally employed two stages of transfer learning: coarse-tuning with natural language inference (NLI) tasks and multi-task learning with multi-choice reading comprehension tasks." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-33", "text": "We used Albert xxlarge +DUMA as our model architecture and did multi-task learning on top of that, as through experiments we felt it efficiently boosts the performance on top of the powerful Albert xxlarge model." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-34", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-35", "text": "**METHODS**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-36", "text": "The model architecture is composed of a Transformer-based encoder, a linear layer classifier, and an extra attention layer(s) in between to model reasoning between the context and the question-answer, as in [Zhu et al., 2020] ." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-37", "text": "We use the pretrained Albert xxlarge as the encoder, and fine tune it during the training process." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-38", "text": "Since the DREAM dataset is small, joint training on both the DREAM task and the RACE task helps get a good boost on the DREAM task." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-39", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-40", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-41", "text": "Encoder When encoding an answer option for a question, we concatenate it with not only the question but also the context (passage for the RACE task, dialogue for the DREAM task) to form one single sequence, the parts separated by the token." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-42", "text": "The sequence is fed through a Transformer-based encoder (Albert in our case)." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-43", "text": "The sequence output of the encoder is then sent to the next part of the model." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-44", "text": "Extra Attention layer In the next part of the model, we use Dual Multi-head Co-Attention (DUMA) module as described in Section 4.2 of [Zhu et al., 2020] ." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-45", "text": "Basically, it involves 1) splitting the output sequence from the encoder into two parts, one for the context and one for the question-answer; 2) from the two sequences, computing two attention representations, one from the context attending the question-answer, the other vice versa; 3) the two attention representations are individually mean-pooled and then concatenated together and sent to the next part of the model: the classifier." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-46", "text": "Classifier For each question, over all the answer options, the classifier part takes the outputs from the Extra Attention layer, and feeds them through a linear layer." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-47", "text": "The answer option with the highest probability is picked as the predicted answer." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-48", "text": "The Cross Entropy function between the ground truth and the predicted probabilities is applied to compute the loss." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-49", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-50", "text": "**MULTI-TASK LEARNING**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-51", "text": "The question/answer pairs in the DREAM task have syntactic and semantic characteristics that are generally different from the text sequences that are used to pre-train Albert." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-52", "text": "Because the DREAM dataset is relatively small, it is reasonable to hypothesize that adding a larger amount of similar multi-choice MRC data in the training will be beneficial for the DREAM task." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-53", "text": "Inspired by [Jin et al., 2020] , we did multi-task learning on the DREAM task and the RACE task." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-54", "text": "Although the number of choices are different in both tasks, we are still able to share all the parts of the model between them." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-55", "text": "We sampled mini-batches from the DREAM and RACE datasets in proportion of the relative sizes of the two datasets." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-56", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-57", "text": "**EXPERIMENT SETTINGS**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-58", "text": "We used the pretrained Albert-xxlarge-v2 model as our encoder, and one layer of DUMA as in [Zhu et al., 2020] ." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-59", "text": "Since we do not have the implementation details on the number of DUMA attention heads and the head size in [Zhu et al., 2020] , for reimplementation we used the setting as in Alberta-xxlarge self attention layers: 64 attention heads and each head has a dimension of 64." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-60", "text": "We used the same setting in our multi-task learning." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-61", "text": "Our codes are written based on Transformers 2 ." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-62", "text": "The maximum sequence length was set to 512." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-63", "text": "We used a mini-batch size of 24, and a learning rate of 1e-05." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-64", "text": "The gradient norm was clipped to 1." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-65", "text": "We adopted a linear scheduler with a weight decay of 0.01, trained the model for 5 epochs." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-66", "text": "For the multi-task learning, we used 10% of the total steps as warming up, evaluated on the dev set at every 1000 steps and saved the best model on the dev set." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-67", "text": "For the single-task training on the DREAM dataset (the second last line in Table 1 ), we evaluated on the dev set at every 100 steps and saved the best model on the dev set." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-68", "text": "We did not do hyper-parameter searching and only had one run of multi-task learning at the moment." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-69", "text": "For the single task training we had three runs and picked the model with the best accuracy on the dev set." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-70", "text": "All the experiments was run on four v100 GPUs in a single machine." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-71", "text": "Table 1 summarizes the experiment results." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-72", "text": "The numbers marked with \u22c6 are from [Jin et al., 2020] , the numbers marked with \u2020 are from [Zhu et al., 2020] ." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-73", "text": "Note that the second last line is our implementation of Albert xxlarge +DUMA, with 64 DUMA heads of dimension 64." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-74", "text": "The multi-task learning in the last line was run with similar settings and parameters." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-75", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-76", "text": "**RESULTS**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-77", "text": "Compared to [Zhu et al., 2020] , our implementation of Albert xxlarge +DUMA obtained a higher accuracy on the dev set but a lower accuracy on the test set." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-78", "text": "The possible reasons could be that we did not have the exact setting such as attention head model dev test FTLM++ [Sun et al., 2019] 58.1 * 58.2 * BERT large [Devlin et al., 2019] 66.0 * 66.8 * XLNet [Yang et al., 2019b] * 72.0 * RoBERTa large [Liu et al., 2019] 85.4 * 85.0 * RoBERTa large +MMM [Jin et al., 2020] 88.0 * 88.9 * Albert xxlarge [Lan et al., 2020] 89.2 \u2020 88.5 \u2020 Albert xxlarge +DUMA [Zhu et al., 2020] 89.3 \u2020 90.4 \u2020 Albert xxlarge +DUMA(our implementation) 90.7 88.6 Our model (above model with multi-task learning) 91.9 91.8 Table 1 : Results on DREAM dataset, the numbers marked with \u22c6 are from [Jin et al., 2020] , the numbers marked with \u2020 are from [Zhu et al., 2020] number and size, and/or randomness." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-79", "text": "Nonetheless, the model from multi-task learning had a good boost over both the numbers from [Zhu et al., 2020] and the numbers from our re-implementation of the single-task learning." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-80", "text": "This shows that although the context part in the DREAM task is in dialogue style instead of passage style in the RACE task, the DREAM task could still benefit a lot from learning together with the RACE task, because of similar domain and similar question-answer style." }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-81", "text": "----------------------------------" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-82", "text": "**ACKNOWLEDGEMENT**" }, { "sent_id": "7516b533aafa8b41e7e554fa54e39c-C001-83", "text": "The author would like to thank Luis Lastras, Sachindra Joshi, and Chulaka Gunasekara for helpful discussions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7516b533aafa8b41e7e554fa54e39c-C001-4" ], [ "7516b533aafa8b41e7e554fa54e39c-C001-19" ], [ "7516b533aafa8b41e7e554fa54e39c-C001-25" ] ], "cite_sentences": [ "7516b533aafa8b41e7e554fa54e39c-C001-4", "7516b533aafa8b41e7e554fa54e39c-C001-19", "7516b533aafa8b41e7e554fa54e39c-C001-25" ] }, "@USE@": { "gold_contexts": [ [ "7516b533aafa8b41e7e554fa54e39c-C001-12" ] ], "cite_sentences": [ "7516b533aafa8b41e7e554fa54e39c-C001-12" ] } } }, "ABC_373795850c8f182051214a8ee09461_40": { "x": [ { "sent_id": "373795850c8f182051214a8ee09461-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-2", "text": "This paper compares a neural network DSM relying on textual co-occurrences with a multi-modal model integrating visual information." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-3", "text": "We focus on nominal vs. verbal compounds, and zoom into lexical, empirical and perceptual target properties to explore the contribution of the visual modality." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-4", "text": "Our experiments show that (i) visual features contribute differently for verbs than for nouns, and (ii) images complement textual information, if (a) the textual modality by itself is poor and appropriate image subsets are used, or (b) the textual modality by itself is rich and large (potentially noisy) images are added." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-5", "text": "----------------------------------" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-7", "text": "Distributional semantic models (DSMs) rely on the distributional hypothesis (Harris, 1954) , that words with similar distributions have related meanings." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-8", "text": "They represent a well-established tool for modelling semantic relatedness between words and phrases (Bullinaria and Levy, 2007; Turney and Pantel, 2010) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-9", "text": "In the last decade, standard DSMs using bag-of-words or syntactic cooccurrence counts have been enhanced by integration into neural networks Levy et al., 2015; Nguyen et al., 2016) , or by integrating perceptual information (Silberer and Lapata, 2014; Bruni et al., 2014; Kiela et al., 2014; Lazaridou et al., 2015) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-10", "text": "While standard DSMs have been applied to a variety of semantic relatedness tasks such as word sense discrimination, selectional preferences, relation distinction (among others), multi-modal models have predominantly been evaluated on their general ability to model semantic similarity as captured by SimLex (Hill et al., 2015) , WordSim (Finkelstein et al., 2002) , etc." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-11", "text": "In this paper, we compare a neural network DSM relying on textual co-occurrences with a multi-modal model extension integrating visual information." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-12", "text": "We focus on the prediction of compositionality for two types of German multi-word expressions: noun-noun compounds and particle verbs." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-13", "text": "Differently to most previous multimodal approaches, we thus address a semantically specific task that was traditionally addressed by standard DSMs, mainly for English and German (Baldwin, 2005; Bannard, 2005; Reddy et al., 2011; Salehi and Cook, 2013; Schulte im Walde et al., 2013; Salehi et al., 2014; Bott and Schulte im Walde, 2014; Bott and Schulte im Walde, 2015; Schulte im Walde et al., 2016a) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-14", "text": "Furthermore, we zoom into factors that might influence the quality of predictions, such as lexical and empirical target properties (e.g., ambiguity, frequency, compositionality); and filters to optimise the visual space, such as dispersion and imageability filters (Kiela et al., 2014) , and a novel clustering filter." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-15", "text": "Our experiments demonstrate that the contributions of the textual and the visual models differ for predictions across the nominal vs. verbal compositions." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-16", "text": "The visual modality adds complementary features in cases where (a) the textual modality performs poorly, and images of the most imaginable targets are added, or (b) the textual modality performs well, and all available -potentially noisy-images are added." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-17", "text": "In addition, we demonstrate that perceptual features of verbs, such as abstractness and imageability, have a different influence on multi-modality than for nouns, presumably because they are more difficult to grasp." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-18", "text": "----------------------------------" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-19", "text": "**DATA**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-20", "text": "Target Multi-Word Expressions (MWEs) German noun-noun compounds represent two-part multi-word expressions where both con- stituents are nouns, e.g., Feuerwerk 'fire works' is composed of the nominal constituents Feuer 'fire' and Werk 'opus'." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-21", "text": "German particle verbs are complex verbs such as anstrahlen 'beam/smile at' which are composed of a separable prefix particle (such as an) and a base verb (such as strahlen 'beam/smile')." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-22", "text": "Both types of German MWEs are highly frequent and highly productive in the lexicon." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-23", "text": "In addition, the particles are notoriously ambiguous, e.g., an has a partitive meaning in anbei\u00dfen 'take a bite', a cumulative meaning in anh\u00e4ufen 'pile up', and a topological meaning in anbinden 'tie to' (Springorum, 2011) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-24", "text": "We rely on two existing gold standards annotated with compositionality ratings: GS-NN, a set of 868 German noun-noun compounds (Schulte im Walde et al., 2016b) , and GS-PV, a set of 400 particle verbs across 11 particle types (Bott et al., 2016) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-25", "text": "----------------------------------" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-26", "text": "**MULTI-MODAL VECTOR SPACE MODELS**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-27", "text": "For the textual representation we used two sets of embeddings." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-28", "text": "Based on word2vec (Mikolov et al., 2013) , we obtained both representations using the skipgram architecture with negative sampling." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-29", "text": "The sets differ with respect to window size (5 vs. 10) and dimensionality (400 vs. 500)." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-30", "text": "As corpus resource we relied on the lemmatized version of the DECOW14AX, a German web corpus containing 12 billion tokens (Sch\u00e4fer and Bildhauer, 2012) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-31", "text": "The visual features rely on images downloaded from the bing search engine, following Kiela et al. (2016) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-32", "text": "We queried 25 images per word, and converted all images into high-dimensional numerical representations by using the caffe toolkit (Jia et al., 2014) and pre-trained models." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-33", "text": "In the default setting, a word is represented in the visual space by the mean vector of its 25 image representations." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-34", "text": "As image-recognition neural network models, we used: (i) GoogLeNet (Szegedy et al., 2015) , a 22-layer deep network; we obtained vectors by using the value of the last layer before the final softmax, containing 1024 elements (= dimensionality)." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-35", "text": "(ii) AlexNet (Krizhevsky et al., 2012) , a neural network with five convolutional layers (4,096-dim)." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-36", "text": "The multi-modal representations were combined by applying mid-fusion between textual and visual representation, i.e., concatenation of the L2-normalized representations (Bruni et al., 2014) 1" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-37", "text": "----------------------------------" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-38", "text": "**EXPERIMENTS**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-39", "text": "Predicting Compositionality For the prediction of compositionality, we represented the meanings of the multi-word expressions and their constituent words by textual, visual and textual+visual (i.e., multi-modal) vectors." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-40", "text": "The similarity of a compound-constituent vector pair as measured by the cosine was taken as the predicted degree of compound-constituent compositionality, and the overall ranking of pair similarities was compared to the gold standard compositionality ratings using Spearman's Rank-Order Correlation Coefficient \u03c1 (Siegel and Castellan, 1988) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-41", "text": "----------------------------------" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-42", "text": "**LEXICAL, EMPIRICAL AND VISUAL FILTERS**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-43", "text": "The experiments compare the predictions of compositionality across all targets in the gold standards." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-44", "text": "2 Furthermore, we zoom into factors that might influence the quality of predictions: (A) the impact of lexical and empirical target properties, i.e., ambiguity (relying on the DUDEN dictionary 3 , frequency (as provided by the gold standards), abstractness and imageability (as taken from K\u00f6per and Schulte im Walde (2016)); (B) optimisation of the visual space: (i) In accordance with human concept processing (Paivio, 1990) , including image representations should be more useful for words which are visual." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-45", "text": "We therefore apply the dispersion-based filter suggested by Kiela et al. (2014) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-46", "text": "The filter decides whether to include perceptual information for a specific word or not, relying on a pairwise similarity between all images of a concept." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-47", "text": "The underlying idea is that highly visual concepts are visualised by similar pictures and thus trigger a high average similarity between the word's images." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-48", "text": "Abstract concepts, on the other hand, are expected to provide a lower dispersion." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-49", "text": "For a given word, the filter decides about using only the textual representation, or both the textual and visual representations, depending on the dispersion value and a predefined threshold (set to the median of all the dispersion values)." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-50", "text": "(ii) We apply an imageability filter based on external imageability norms (K\u00f6per and Schulte im Walde, 2016) , to successively include only images for the most imaginable target words." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-51", "text": "This filter is applied in the same way as dispersion." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-52", "text": "(iii) We suggest a novel clustering filter, that performs a clustering of the 25 images for a given concept, using the algo- Figure 2 present the prediction results for the two gold standards, GS-NN and GS-PV." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-53", "text": "For GS-NN, we focus on predicting the compositionality for compound-head pairs (ignoring compound-modifier pairs), in order to have a more parallel setup to GS-PV, where the particle verb compositionality focuses on the contribution of the base verb." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-54", "text": "The figures show the results across all targets." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-55", "text": "Note that the vertical axis, showing the range of Spearman's \u03c1 are different for both results." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-56", "text": "----------------------------------" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-57", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-58", "text": "Figures 3 and 4 zoom into target subsets regarding target ambiguity (one sense vs. multiple senses), frequency, abstractness vs. concreteness, imageability, and compositionality." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-59", "text": "The bars refer to the textual model, the multi-modal model (including all images for all targets), and the best results obtained when using the dispersion/imageability/clustering 4 filters." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-60", "text": "The plots demonstrate that overall the multimodal model provides only a tiny gain for GS-NN in comparison to the text-only model, which is however significant using Steiger's test (p < 0.001) (Steiger, 1980) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-61", "text": "All filters worsen the results." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-62", "text": "For GS-PV, we also obtain a significant improvement by the multi-modal model, but only when applying the imageability or the clustering filter to the visual information." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-63", "text": "The main differences in the overall noun and verb results are emphasised in Figure 5 , comparing the successive increase of images to the multi-modal model in comparison to the textual model, based on the dispersion and imageability filters." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-64", "text": "Note that the textual model baselines are very different for the two gold standards, \u03c1 = .65 for GS-NN and \u03c1 = .22 for GS-PV." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-65", "text": "Regarding the nouns, the multi-modality improves the textual modality when adding the images for the \u224835% most imaginable words, and when adding all images." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-66", "text": "Regarding the verbs, the multi-modality improves the textual modality in most proportions, reaching its maximum when adding images for \u224880% of the most imaginable verbs; when adding the \u224810% of the least imaginable verbs, the model strongly drops in its performance." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-67", "text": "For the dispersion filter, the tendencies are less clear." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-68", "text": "We conclude that the visual information adds to the textual information either by adding all (potentially noisy) images because the textual information is rich by itself; or by adding a selection of images (unless they are overly dissimilar to each other, or for non-imaginable targets), because the textual information by itself is poor." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-69", "text": "Zooming into target subsets, the predictions for monosemous targets are better than those for ambiguous targets (significant for GS-NN), see Figure 3 ; ditto for low-frequency vs. high-frequency targets." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-70", "text": "Taking frequency as an indicator of ambiguity, these differences are presumably due to the difficulty of distinguishing between multiple senses in vector spaces that subsume the features of all word senses within one vector, which applies to our textual and multi-modal models." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-71", "text": "The gold standard predictions strongly differ regarding the influence of target abstractness, imageability and compositionality." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-72", "text": "For GS-NN, the compositionality of concrete and imaginable targets is predicted better than for abstract and less imaginable targets, as one would expect and has been shown by Kiela et al. (2014) ; for GS-PV, the opposite is the case." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-73", "text": "Similarly, while for GS-NN highly compositional targets are predicted worse than low-and mid-compositional targets, for GS-PV mid-compositional targets are predicted much worse than low-and high-compositional targets." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-74", "text": "These differences in results point to questions that have still been unsolved across research fields: while humans can easily grasp intuitions about the abstractness, imageability and compositionality of nouns, the categorisations are difficult to define for verbs (Glenberg and Kaschak, 2002; Brysbaert et al., 2014) ." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-75", "text": "Particle verbs add to this complexity, especially since compositionality (rating) is typically reduced to the semantic relatedness between the complex verb and the base verb, ignoring the particle that however contributes a considerable portion of meaning to the complex verb." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-76", "text": "----------------------------------" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-77", "text": "**CONCLUSION**" }, { "sent_id": "373795850c8f182051214a8ee09461-C001-78", "text": "The paper demonstrated strong differences in the effect of adding visual information to a textual neural network model, when predicting the compositionality for nominal vs. verbal MWE targets." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-79", "text": "The visual modality adds complementary features in cases where (a) the textual modality performs poorly, and images of the most imaginable targets are added, or (b) the textual modality performs well, and all available -potentially noisyimages are added." }, { "sent_id": "373795850c8f182051214a8ee09461-C001-80", "text": "Image filters relying on imageability and a novel clustering filter positively affect the verbal but not the nominal perceptual feature spaces." } ], "y": { "@BACK@": { "gold_contexts": [ [ "373795850c8f182051214a8ee09461-C001-9" ], [ "373795850c8f182051214a8ee09461-C001-14" ] ], "cite_sentences": [ "373795850c8f182051214a8ee09461-C001-9", "373795850c8f182051214a8ee09461-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "373795850c8f182051214a8ee09461-C001-45" ] ], "cite_sentences": [ "373795850c8f182051214a8ee09461-C001-45" ] }, "@SIM@": { "gold_contexts": [ [ "373795850c8f182051214a8ee09461-C001-72" ] ], "cite_sentences": [ "373795850c8f182051214a8ee09461-C001-72" ] } } }, "ABC_0888b30ae5dcc880761a92ffbdcd1b_40": { "x": [ { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-2", "text": "Here we sketch a new derivation of Zipf's law for word frequencies based on optimal coding." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-3", "text": "The structure of the derivation is reminiscent of Mandelbrot's random typing model but it has multiple advantages over random typing: (1) it departs from realistic cognitive pressures (2) it does not require fine tuning of parameters and (3) it sheds light on the origins of other statistical laws of language and thus can lead to a compact theory of linguistic laws." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-4", "text": "Our findings suggest that the recurrence of Zipf's law in human languages could originate from pressure for easy and fast communication." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-5", "text": "----------------------------------" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-6", "text": "****" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-7", "text": "Zipf's law for word frequencies states that the probability of the i-th most frequent word obeys" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-8", "text": "with \u03b1 \u2248 1 [1, 2] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-9", "text": "Here we explore the possibility that Zipf's law is a consequence of compression, the minimization of the mean length of the words of a vocabulary [3] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-10", "text": "This principle has already been used to explain the origins of other linguistic laws: Zipf's law of abbreviation, namely, the frequency of more frequent words to be shorter [3, 4] , and Menzerath's law, the tendency of a larger linguistic construct to be made of smaller components [5] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-11", "text": "Our argument combines two constraints for compression: (1) non-singular coding, i.e. any two different words should not be represented by the same string of letters or phonemes, and (2) unique decipherability, i.e. given a continuous sequence of letters or phonemes, there should be only one way of segmenting it into words [6] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-12", "text": "The former is needed to reduce the cost of retrieving the original meaning." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-13", "text": "The latter is required to reduce the cost of determining word boundaries." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-14", "text": "Thus both constraints on compression and compression itself, are realistic cognitive pressures that are vital to fight against the now-or-never bottleneck of linguistic processing [7] . Suppose that words are coded using an alphabet of N letters (or phonemes) and that p i and l i are the probability and the length of the i-th most probable word." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-15", "text": "On the one hand, optimal uniquely decipherable coding gives [6] l" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-16", "text": "where \u2308..\u2309 is the ceiling function." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-17", "text": "Thus" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-18", "text": "On the other hand, optimal non-singular coding gives [4]" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-19", "text": "for N > 1. When N and i are sufficiently large, we have" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-20", "text": "Combining Eqs. 3 and 5, one obtains" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-21", "text": "and finally Zipf's law (Eq. 1) with \u03b1 = 1." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-22", "text": "By presenting this derivation we are not taking for granted that real language use is fully optimal with regard to any of the coding schemes mentioned above." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-23", "text": "Instead, our point is that it is not surprising that languages tend toward Zipf's law given the convenience of both kinds of compression for easy and fast communication [7] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-24", "text": "Our derivation of Zipf's law is reminiscent of Mandelbrot's derivation based on random typing [8] , that is defined by three parameters, N (the alphabet size), p s (the probability of hitting the space) and l 0 (the minimum word length)." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-25", "text": "The last parameter has been introduced to accommodate other variants of Mandelbrot's model [9] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-26", "text": "Mandelbrot's derivation assumes that typing at random determines the probability of a word, which has two key implications." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-27", "text": "First, a relationship between the length of a word and its probability [4] l = a log p + b," }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-28", "text": "where a and b are constants (a < 0) defined on the parameters of the model as" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-29", "text": "and" }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-30", "text": "Eq. 7 can be interpreted, approximately, as a linear generalization of the relationship between l and p of optimal uniquely decipherable codes in Eq." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-31", "text": "3)." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-32", "text": "Second, a relationship between the length of a word and its rank that matches exactly that of optimal non-singular codes in Eq. 5 (while Eq. 5 is exact, Mandelbrot derived an approximate equation)." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-33", "text": "Combining these two implications of random typing, Mandelbrot derived Zipf's law for word frequency with \u03b1 > 1." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-34", "text": "The exact (Eq. 5) and approximate (Eq. 7) connections between random typing and optimal coding challenge the view of random typing as totally detached from cost-cutting considerations [10, 11] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-35", "text": "Our derivation of Zipf's law presents various advantages over random typing." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-36", "text": "First, it departs from realistic cognitive pressures [7] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-37", "text": "Second, random typing is based exclusively on random choices but its parameters cannot be set at random: indeed, a precise tuning of the parameters is needed to mimic Zipf's law with \u03b1 = 1 [8] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-38", "text": "In contrast, our argument only requires N to be large enough." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-39", "text": "Third, its assumptions are far reaching: compression allows one to shed light on the origins of three linguistic laws at the same time: Zipf's law for word frequencies, Zipf's law of abbreviation and Menzerath's law with the unifying principle of compression [3, 4, 5] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-40", "text": "There are many ways of explaining the origins of power-law-like distributions such as Zipf's law for word frequencies [12] but compression appears to be as the only one that can lead to a compact theory of statistical laws of language." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-41", "text": "Although uniquely decipherable codes are a subset of non-singular codes, it is tempting to think that both optimal non-singular coding and optimal uniquely decipherable coding cannot be satisfied to a large extend simultaneously." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-42", "text": "However, random typing is an example of how both constraints can be met approximately." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-43", "text": "We suggest that human languages are additional examples of a different nature." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-44", "text": "The two forms of optimality can coexist to some degree because the need for unique decipherability is alleviated by statistical cues that humans use to segment the linguistic input [13] ." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-45", "text": "Here we have only sketched a new path to derive power-law-like distributions of ranks from efficiency considerations." }, { "sent_id": "0888b30ae5dcc880761a92ffbdcd1b-C001-46", "text": "We hope that our simple derivation stimulates further research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0888b30ae5dcc880761a92ffbdcd1b-C001-10" ], [ "0888b30ae5dcc880761a92ffbdcd1b-C001-27" ], [ "0888b30ae5dcc880761a92ffbdcd1b-C001-39" ] ], "cite_sentences": [ "0888b30ae5dcc880761a92ffbdcd1b-C001-10", "0888b30ae5dcc880761a92ffbdcd1b-C001-27", "0888b30ae5dcc880761a92ffbdcd1b-C001-39" ] } } }, "ABC_119d473a0a5a4c42de193e51564f1f_40": { "x": [ { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-2", "text": "The automatic identification of discourse relations is still a challenging task in natural language processing." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-3", "text": "Discourse connectives, such as since or but, are the most informative cues to identify explicit relations; however discourse parsers typically use a closed inventory of such connectives." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-4", "text": "As a result, discourse relations signaled by markers outside these inventories (i.e. AltLexes) are not detected as effectively." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-5", "text": "In this paper, we propose a novel method to leverage parallel corpora in text simplification and lexical resources to automatically identify alternative lexicalizations that signal discourse relation." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-6", "text": "When applied to the Simple Wikipedia and Newsela corpora along with WordNet and the PPDB, the method allowed the automatic discovery of 91 AltLexes." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-7", "text": "----------------------------------" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-9", "text": "Understanding a text goes beyond understanding its textual units in isolation; the relation between these units must also be understood." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-10", "text": "Discourse connectives such as since, but, etc. are often used to explicitly connect textual units and signal the presence of specific explicit discourse relations such as CONTRAST, CAUSE, etc." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-11", "text": "The Penn Discourse Tree Bank (PDTB) framework (Prasad et al., 2008 ) defined a closed list of discourse connectives." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-12", "text": "Implicit, AltLex and EntRel are other realizations of discourse relations in the PDTB." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-13", "text": "AltLex (or alternative lexicalization relations), which are understudied in computational discourse processing, are signalled using an open list of lexical markers that are not part of the PDTB inventory of discourse connectives." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-14", "text": "Figure 1 shows a pair of sentences that convey the same information; however only one sentence contains a discourse connective from the PDTB inventory." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-15", "text": "Hence, a discourse parser using the PDTB inventory of connectives would easily identify the explicit CONTRAST relation in the first sentence but will likely not tag the second sentence because whilst is not part of the PDTB inventory of discourse connectives." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-16", "text": "However, the writer's intention can be understood using other means such as the use of an alternative lexical marker (i.e. AltLex), a change of tense, a structural signal, etc." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-17", "text": "Thus, discourse parsers can benefit from the automatic identification of AltLexes that can signal discourse relations." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-18", "text": "According to Pitler et al. (2009) , discourse connectives constitute strong clues to detect explicit relations, hence discourse parsers have relied on them as valuable features in order to identify explicit discourse relations automatically (Lin et al., 2014) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-19", "text": "Similarly, the presence of alternative lexical markers is a strong indicator of an AltLex relation; however since the list of such markers is open, identifying them is a challenge." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-20", "text": "----------------------------------" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-21", "text": "**BACKGROUND**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-22", "text": "Discourse connectives (Blakemore, 1987; Knott and Dale, 1994; Schourup, 1985) are the most informative signals of explicit discourse relations (Pitler et al., 2009 )." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-23", "text": "However, they are not well-defined in linguistics." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-24", "text": "Levinson (1983) defined discourse connectives as words and phrases such as after all, actually, etc." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-25", "text": "that connect an utterance to the prior discourse." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-26", "text": "Zwicky (1985) considered discourse connectives as a class of particles, but did not specify what particles are considered as discourse connectives." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-27", "text": "Schiffrin (1988) also defined discourse connectives as words that connect dependent textual units in a discourse." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-28", "text": "According to Schiffrin, discourse Complex Simple These works he produced and published himself, whilst his much larger woodcuts were mostly commissioned work." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-29", "text": "[Non-Explicit] He created and published his works himself, but his larger works were mostly commissioned work to be sold." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-30", "text": "[Explicit CONTRAST] Figure." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-31", "text": "1: No explicit relation is detected in the complex sentence (left), but an explicit CONTRAST relation is identified in the simple sentence (right)." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-32", "text": "The example is taken from the Simple English Wikipedia corpus (Coster and Kauchak, 2011) connectives do not belong to any linguistic class and except for a few discourse connectives such as oh and well, most carry meaning." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-33", "text": "Redeker (1991) revised this definition; even though she agreed that discourse connectives have meaning by themselves, she argued that they should contribute to the semantic interpretations of the discourse." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-34", "text": "Apart from research efforts aiming at defining discourse connectives, another line of research has focused on providing a list of discourse connectives in English (Prasad et al., 2008; Andersen, 2001; Blakemore, 2002; Fischer, 2000; Sanders et al., 1992; Knott, 1996) and other languages (Pasch, 2003; Travis, 2005) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-35", "text": "While most of these inventories have been built by hand, some work has attempted to identify them automatically." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-36", "text": "Laali and Kosseim (2014) used the Europal parallel corpus and collocation techniques to induce French connectives from their English counterparts." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-37", "text": "Following this work, Hidey and McKeown (2016) built a parallel corpus of causal and non-causal AltLexes using word alignment with PDTB discourse connectives as initial seeds." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-38", "text": "Our work is different from these as we use already existing parallel corpora in text simplification and extract discourse information automatically using the state of the art discourse parser." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-39", "text": "In addition, instead of focusing on only one relation, we generalize the problem to all PDTB relations." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-40", "text": "We also use external resources which are shown to have advantages over word alignment (Versley, 2010) in similar tasks." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-41", "text": "Lastly, the PDTB AltLexes only captures inter-sentence relations." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-42", "text": "Our contribution overcomes this limitation by identifying intra-sentence relations." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-43", "text": "----------------------------------" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-44", "text": "**IDENTIFICATION OF DISCOURSE CONNECTIVES**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-45", "text": "In order to automatically identify AltLexes, we used the notion of text simplification (Siddharthan, 2014; Kauchak, 2013) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-46", "text": "While two texts may convey the same meaning, they may exhibit different complexity levels (Pitler and Nenkova, 2008; Davoodi and Kosseim, 2016 guistic choices; at the lexical level (e.g. using frequent vs. abandoned words), the syntactic level (e.g. using active vs. passive voice) or even the discourse level (e.g. using an implicit vs. an explicit discourse relation)." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-47", "text": "The main assumption in text simplification is that it is possible to reduce a text's complexity while preserving its meaning as much as possible." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-48", "text": "Because discourse relations are semantic in nature, we can therefore assume that they are also preserved as much as possible during text simplification." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-49", "text": "However, the lexical realization of discourse relations (i.e. explicit versus nonexplicit) or the choice of a discourse connective (e.g. but vs. however) may change." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-50", "text": "The removal of a discourse relation may happen if the discourse argument is considered non-essential." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-51", "text": "For example, Figure 2 shows a pair of aligned sentences where the complex version contains an explicit SYNCHRONY relation signalled by when; while the discourse argument and consequently the explicit discourse connective has been removed in the simple version." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-52", "text": "Hence, given a sentence and its simplified version, three phenomena can occur:" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-53", "text": "1.a discourse connective is replaced by another (e.g. although \u21d2 though), 2.a discourse connective is replaced by another lexical choice (i.e. word or phrase) which is not considered as a discourse connective in the inventory used (e.g. although \u21d2 despite), or 3.a discourse connective is removed completely." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-54", "text": "In cases (1) and (2) above, the discourse relation is preserved, while in case (3) the discourse relation is either removed or changed to an implicit relation." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-55", "text": "The focus of this paper is to study case (2) and use such a phenomenon to automatically identify AltLexes." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-56", "text": "----------------------------------" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-57", "text": "**DATA SETS**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-58", "text": "To discover AltLexes automatically, we created two sentence-aligned data sets using standard corpora in text simplification." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-59", "text": "The first data set was created from the Simple English Wikipedia corpus (Coster and Kauchak, 2011) ; the other was created from the Newsela corpus (Xu et al., 2015) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-60", "text": "The Simple English Wikipedia (SEW) corpus (Coster and Kauchak, 2011) contains two sections: 1) article-aligned and 2) sentence-aligned." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-61", "text": "Here, we used the sentence-aligned section, which contains 167,686 pairs of aligned sentences." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-62", "text": "In order not to overfit to a specific corpus, we also used the Newsela (News) corpus (Xu et al., 2015) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-63", "text": "This corpus contains 1,911 English news articles which have been manually re-written at most 5 times, each time with decreasing complexity level." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-64", "text": "We used this article-aligned corpus to align it at the sentence-level using an approach similar to (Coster and Kauchak, 2011) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-65", "text": "Then, two native English speakers evaluated the alignments." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-66", "text": "The Kappa inter-annotation agreement was 0.898 computed on 100 randomly chosen alignments." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-67", "text": "----------------------------------" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-68", "text": "**METHODOLOGY**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-69", "text": "According to the PDTB framework, each Altlex can be substituted with at least one discourse connective (Prasad et al., 2008) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-70", "text": "Based on this, to discover AltLexes automatically, we first parsed both sides of the aligned sentences of both data sets to extract discourse information." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-71", "text": "This was done using the PDTB-style End-to-End parser (Lin et al., 2014) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-72", "text": "This parser was selected as it is currently the best performing parser to identify explicit relations, with an F-measure between 80.61% and 86.77% depending on the evaluation criteria." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-73", "text": "Because it uses the PDTB framework, the parser uses the inventory of 100 discourse connectives from the PDTB." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-74", "text": "The result of this tagging was categorized into one of the following cases:" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-75", "text": "1. NonExp-NonExp: a non-explicit 1 discourse relation occurs in both sentences." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-76", "text": "2. Exp-Exp: the same discourse relation and discourse connective occur in both sentences." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-77", "text": "3. NonExp-Exp: a non-explicit relation occurs in the complex sentence, but an explicit one is used in the simple sentence." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-78", "text": "4. Exp-NonExp: an explicit relation occurs in the complex sentence, but no relation is used in the simple sentence." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-79", "text": "Table 1 shows the frequency of these transformations in the two data sets." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-80", "text": "To discover AltLexes, we only considered cases (3) and (4), where only one side of the alignment includes a PDTB discourse connective." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-81", "text": "This gave rise to a total of 20,220 aligned sentences." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-82", "text": "We then used two external resources: 1) the paraphrase database (PPDB) (Ganitkevitch et al., 2013) and 2) WordNet (Miller, 1995) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-83", "text": "The PPDB comes in six sizes from S to XXXL." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-84", "text": "The smaller versions of the PPDB contain more precise paraphrases with higher confidence scores; while the larger versions have more coverage." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-85", "text": "We choose the PPDB version L to have a good compromise between the precision and coverage of paraphrases." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-86", "text": "We took the discourse connective from the explicit side and looked for an alternative lexicalization (a synonym or paraphrase) in the external resources." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-87", "text": "If any of its alternative lexicalization appeared in the nonexplicit side, we considered it as an AltLex to signal the relation." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-88", "text": "We then replaced the AltLex with the discourse connective from the explicit side and parsed the new sentence with the PDTB-style Endto-End parser again." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-89", "text": "This process is shown in Figure 5 ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-90", "text": "On average, each discourse connective was replaced by 23.2 alternative lexicalizations taken from the PPDB and 12.3 from WordNet." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-91", "text": "If the parser detected the same relation (see Figure 5) , then the potential marker was considered as an AltLex." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-92", "text": "On the other hand, because the End-toEnd discourse parser uses both the discourse connective and syntactic features, if it was not capable of detecting the relation in the replaced sentences, we concluded that either (1) the relation existed, but the parser could not detect it, (2) the AltLex does not signal the discourse relation or (3) the relation does not exist (see Figure 4) ." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-93", "text": "Because we did not use any syntactic filter, the replacement of the discourse connective may alter the syntax of the sentence such that the parser is unable to detect the relation." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-94", "text": "This is why, regardless of the reason, if the parser was not able detect the relation, we discarded the AltLex." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-95", "text": "Table 2 shows the number of sentence alignments mined and the number of potential AltLexes (i.e. type count) identified in each data set for each level 2 PDTB relation." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-96", "text": "Overall, by mining 17,358 NonExp-Exp and Exp-NonExp alignments, the SEW-based data set allowed the discovery of 79 AltLexes from the PPDB and 8 from WordNet; whereas, the News-based data set, providing only 2,862 alignments, allowed the discovery of 28 AltLexes from PPDB and 11 from WordNet." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-97", "text": "Using both corpora and both lexical resources, the method found 91 AltLexe tokens, which account for 533 AltLexes." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-98", "text": "It is interesting to note that, overall, the approach did not find any alternate lexicalizations for some relations such as LIST or EXCEPTION and only one for CONDITION." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-99", "text": "It is not clear if this is because these relations are typically signalled using a rather fixed inventory of discourse markers or because of the low number of such alignments." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-100", "text": "Indeed, in the PDTB, out of On the other hand, relations such as CONJUNC-TION, ASYNCHRONOUS and CAUSE provided a large number of alignments from which we identified a variety of AltLexes." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-101", "text": "For example, the PPDB identified \"caused by\", \"resulting\", \"causing\", \"this being so\", etc." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-102", "text": "as AltLexes to signal a CAUSE relation." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-103", "text": "The full inventory of the automatically identified AltLexes can be found at: http://Anynomous." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-104", "text": "----------------------------------" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-105", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-106", "text": "In addition, as can be seen in Table 2 , the number of potential AltLexes coming from the PPDB is higher than the number of AltLexes coming from WordNet." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-107", "text": "This may be due to the difference of the coverage of these two resources as WordNet is smaller than the PPDB." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-108", "text": "Another possible reason is that each word in PPDB has a list of paraphrases with various syntactical classes." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-109", "text": "Thus, if the syntactic class of a discourse marker is changed in the simplification process, it is more probable that the PPDB covers more syntactical variations of the discourse marker compared to WordNet." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-110", "text": "Figure 3 shows an example taken from the Newsela corpus." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-111", "text": "In this example, the discourse marker before in the complex version signals ASYNCHRONOUS relation, but is tagged as subordinating conjunction." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-112", "text": "In the paraphrase database, used to is one of the paraphrases of discourse marker before." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-113", "text": "In the simple version of this example, the verb used to is signalling the same relation (i.e. ASYNCHRONOUS relation) which is captured as an AltLex." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-114", "text": "----------------------------------" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-115", "text": "**FUTURE WORK**" }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-116", "text": "As future research, we plan to assign a confidence level to the automatically identified AltLexes by using a syntactic filter to replace potential markers only if they lead to syntactically correct sentences and by using their frequency of appearance in the parallel corpora." }, { "sent_id": "119d473a0a5a4c42de193e51564f1f-C001-117", "text": "Another interesting line of research would be to evaluate if discourse parsers can increase their performance using such new list." } ], "y": { "@BACK@": { "gold_contexts": [ [ "119d473a0a5a4c42de193e51564f1f-C001-32" ] ], "cite_sentences": [ "119d473a0a5a4c42de193e51564f1f-C001-32" ] }, "@EXT@": { "gold_contexts": [ [ "119d473a0a5a4c42de193e51564f1f-C001-59" ], [ "119d473a0a5a4c42de193e51564f1f-C001-60", "119d473a0a5a4c42de193e51564f1f-C001-61" ] ], "cite_sentences": [ "119d473a0a5a4c42de193e51564f1f-C001-59", "119d473a0a5a4c42de193e51564f1f-C001-60" ] }, "@USE@": { "gold_contexts": [ [ "119d473a0a5a4c42de193e51564f1f-C001-60", "119d473a0a5a4c42de193e51564f1f-C001-61" ] ], "cite_sentences": [ "119d473a0a5a4c42de193e51564f1f-C001-60" ] }, "@SIM@": { "gold_contexts": [ [ "119d473a0a5a4c42de193e51564f1f-C001-64" ] ], "cite_sentences": [ "119d473a0a5a4c42de193e51564f1f-C001-64" ] } } }, "ABC_76fd2709a325366be6154d2a84b29b_40": { "x": [ { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-72", "text": "Consider the analogy example, traffic:street :: water:riverbed." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-96", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-97", "text": "**RELATIONAL SIMILARITY EXPERIMENT**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-2", "text": "Abstract." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-3", "text": "In this work, we study the problem of relational similarity by combining different word embeddings learned from different types of contexts." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-4", "text": "The word2vec model with linear bag-ofwords contexts can capture more topical and less functional similarity, while the dependency-based word embeddings with syntactic contexts can capture more functional and less topical similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-5", "text": "We explore topical space and functional space simultaneously by considering these two word embeddings and different metrics." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-6", "text": "We evaluate our model on relational similarity framework, and report state-of-the-art performance on standard test collections." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-7", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-9", "text": "Measuring relational similarity between two word pairs plays important roles in natural language processing (NLP)." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-10", "text": "The techniques for solving this problem can be applied to a variety of NLP tasks, such as query expansion, word sense disambiguation, machine translation, information extraction and question answering." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-11", "text": "Previous work addressing the problem can be roughly classified into three categories: (1) learning word embeddings from large collections of text using variants of neural networks (Mikolov et al. (2013a) ; Mikolov et al. (2013b) ; Mikolov et al. (2013c) ; Levy and Goldberg (2014) ) or global matrix factorization (Deerwester et al. (1990) ; Turney (2012) ); (2) extracting knowledge from existing semantic networks, such as WordNet (Yang and Powers (2005) ; Alvarez and Lim (2007) ; Hughes and Ramage (2007) ) and ConceptNet (Boteanu and Chernova (2015) ); (3) combining the above two models by various ways (Agirre et al. (2009) ; Zhila et al. (2013) ; Iacobacci et al. (2015) ; Summers-Stay et al. (2016) )." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-12", "text": "The empirical evidence shows that the word representations learned from neural network models do an especially good job in capturing not only attributional similarities between words but also similarities between pairs of words (Mikolov et al. (2013c) )." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-13", "text": "Levy and Goldberg (2014) generalize the skip-gram model with negative sampling to include arbitrary word contexts and present the dependency-based word embeddings, which are learned from syntactic contexts derived from dependency parse-trees." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-14", "text": "Qualitative and quantitative analysis demonstrates that the word2vec model with linear bag-of-words contexts can yield broad topical similarity while the dependency-based word embeddings with syntactic contexts can capture more functional similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-15", "text": "Turney (2012) is the first, to the best of our knowledge, to raise the word vector representations in a dual space and unify semantic relations and compositions by the dualspace model." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-16", "text": "The dual-space model consists of a domain space and a function space, where the domain or topic of a word is characterized by the nouns that occur near it and the function or role of a word is characterized by the syntactic context that relates it to the verbs that occur near it." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-17", "text": "We detail our main contributions as follows." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-18", "text": "(1) In this paper, we use the word2vec model with linear bag-of-words contexts to capture the domain of a word, and the dependency-based word embeddings with syntactic contexts to characterize the function of a word." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-19", "text": "The broad contexts used in our model can provide richer information for measuring domain similarity (i.e., topic, subject, or field similarity) and function similarity (i.e, role, relationship, or usage similarity) than noun or verb-based patterns for contexts in Turney's (2012) model." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-20", "text": "(2) The two existing models for measuring relational similarity are: the directional similarity model (Zhila et al. (2013) ) and the dual-space model consisting of domain and function space (Turney (2012) )." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-21", "text": "Both models suffer some drawbacks." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-22", "text": "The directional similarity model explores the difference of two relationships in multiple topicality dimensions in the vector space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-23", "text": "However, it ignores the spatial distances between word vectors, which can reveal the function similarity of words in function space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-24", "text": "The dual-space model can measure the domain similarity and function similarity between words." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-25", "text": "However, it only computes the domain similarity between two single words and places less emphasis on the domain similarity between two relations." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-26", "text": "In this work, we propose a new dual-space model for measuring relational similarity, which combines the advantages of the two existing models." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-27", "text": "(3) We evaluate our model on relational similarity framework and report state-of-the-art performance on SAT analogy questions." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-28", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-29", "text": "**RELATED WORK**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-30", "text": "Vector space models have a long, rich history in the field of natural language processing, where each word is represented as a real-valued vector in a continuous vector space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-31", "text": "All these models depend in some way or another on the distributional hypothesis which states that words that occur in similar contexts tend to have similar meanings (Harris, 1954; Firth, 1957) ." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-32", "text": "There are the two main model families for learning word vectors: (1) global matrix factorization methods, such as latent semantic analysis, which generates embeddings from term-document matrices by singular value decomposition (Deerwester et al. (1990) ); (2) neural network models, such as the skip-gram and continuous bag of words models of () Mikolov et al. (2013a) ; Mikolov et al. (2013b) ; Mikolov et al. (2013c) ), referred to as word2vec, which learn embeddings by training a network to predict neighboring words." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-33", "text": "Levy and Goldberg (2014) generalize the skip-gram model with negative sampling to include arbitrary word contexts, learn embeddings from syntactic contexts and demonstrate that the dependency-based embeddings can capture more functional similarity than the original skip-gram embeddings (word2vec)." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-34", "text": "Several algorithms have been proposed for solving SAT-style analogy questions." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-35", "text": "A good algorithm for recognizing analogies can also help to solve the classification problem of semantic relations, which has potential applications in machine translation, information extraction and word sense disambiguation." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-36", "text": "Quesada et al. (2004) propose LSA as a method to represent the relations between words and use a prediction to represent relation comparisons in the LSA semantic space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-37", "text": "Veale (2004) considers the utility of the taxonomic structure of WordNet to the solution of SAT analogies." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-38", "text": "Turney and Littman (2005) show that the cosine metric in the Vector Space Model of information retrieval can be used to solve analogy questions and to classify semantic relations." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-39", "text": "Turney (2012) introduces a dual-space model, which consists of a space for measuring domain similarity and a space for measuring function similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-40", "text": "The dual-space model has been applied to measuring relation similarity and compositional similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-41", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-42", "text": "**WORD EMBEDDINGS WITH DIFFERENT CONTEXT TYPES**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-43", "text": "Word2vec (W2V) is one of the most popular word embedding methods that learn word vectors from raw text." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-44", "text": "It has two models for generating dense embeddings: the skip-gram model and the continuous bagof-words (CBOW) model." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-45", "text": "The training objective of the skip-gram model is to predict each surrounding word in a context window of 2k words from the target word." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-46", "text": "So for k =2, the contexts of the target word w t are w t\u22122 , w t\u22121 , w t+1, w t+2 and we are predicting each of these from the word w t ." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-47", "text": "However, a context window with a smaller size k may miss some important contexts while including some accidental ones." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-48", "text": "Recently, Levy et al. propose the dependency-based word embeddings (DEP), which generalize the skip-gram model with negative sampling, and move from linear bag-of-words contexts to syntactic contexts that are derived from automatically produced dependency parse-trees." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-49", "text": "Embeddings produced from different kinds of contexts can induce different word similarities." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-50", "text": "The original skip-gram embeddings can yield broad topical similarities, while the dependency-based word embeddings can capture more functional similarities (Levy et al., 2014) ." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-51", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-52", "text": "**RELATIONAL SIMILARITY**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-53", "text": "Relational similarity measures the degree of correspondence between two relations (Jurgens et al. (2012) )." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-54", "text": "The task can be modeled as an analogy problem, where given two pairs of words, A:B and C:D, the goal is to determine the degree to which the semantic relation between A and B is similar to that between C and D. We first introduce two types of existing models for relational similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-55", "text": "Zhila et al. (2013) propose a directional similarity model to evaluate the correspondence between relations." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-56", "text": "Given two pairs of words A:B and C:D, suppose (v A, v B ) and (v C, v D ) are the corresponding vectors of these words." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-57", "text": "Relational similarity of these two word pairs is defined as the cosine function between the two directional vectors of (v A \u2212 v B ) and (v C \u2212 v D ):" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-58", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-59", "text": "**DIRECTIONAL SIMILARITY MODEL**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-60", "text": "In this model, the relationship between two words is represented by the difference of corresponding two word vectors, which reveals the change from one word to the other in terms of multiple topicality dimensions in the vector space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-61", "text": "From the geometric point of view, if two relationship vectors are relatively parallel (i.e., share the same direction), then the two word pairs can be considered as they have similar relations." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-62", "text": "They apply the model to the problem of relation classification which determines whether two pairs share the same relation." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-63", "text": "The goal here is to design a model for answering the more difficult SAT analogy questions." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-64", "text": "However, sometimes this directional similarity model can not measure the analogy problem correctly." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-65", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-66", "text": "**A DUAL-SPACE MODEL**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-67", "text": "Why does the directional similarity model fail on some analogy questions?" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-68", "text": "One possible reason is that the model depends on direction alone and ignores spatial distance between word vectors." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-69", "text": "Turney (2012) presents a dual-space model to unify semantic relations and compositions." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-70", "text": "The dual-space model consists of a domain space for measuring domain similarity and a function space for measuring function similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-71", "text": "He gives a good example to explain this model." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-73", "text": "Traffic and street share the same domain, the domain of transportation." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-74", "text": "Water and riverbed share the same domain, the domain of hydrology." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-75", "text": "On the other hand, traffic and water share the same function, the function of flow, and street and riverbed share the same function, the function of carry." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-76", "text": "The model can recognize that the semantic relation between traffic and street is analogous to the relation between water and riverbed by considering the combination of domain and function similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-77", "text": "----------------------------------" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-78", "text": "**A NEW DUAL-SPACE MODEL: REFINING THE DIRECTIONAL SIMILARITY MODEL AND THE DUALSPACE MODEL**" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-79", "text": "The directional similarity model (Zhila et al. (2013) ) explores the difference of two relationships in multiple topicality dimensions in the vector space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-80", "text": "However, it ignores the spatial distance between word vectors, which can reveal the function similarity of words in function space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-81", "text": "The dual-space model (Turney (2012) ) can measure both domain similarity and function similarity between words." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-82", "text": "However, it only computes the domain similarity between two single word vectors and places less emphasis on the domain similarity between two relations." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-83", "text": "Moreover, Turney's (2012) model uses only noun or verbbased patterns for contexts to model domain or function space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-84", "text": "In this paper, we propose a novel dualspace model, which combines the advantages of the above two existing models." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-85", "text": "We use the word2vec-based directional similarity model to measure the similarity of two relations in domain space, and the dependency-based word embeddings to represent the word vector in function space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-86", "text": "The two embeddings used in our model consider broader contexts than those in the original dual-space model, which can provide richer information for measuring domain similarity and function similarity." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-87", "text": "A mathematical description of our dual-space model is given as follows: Given two pairs of words A:B and C:D, suppose D) are the vectors of these words in domain space and V DEP (A) , V DEP (B) , V DEP (C) , V DEP (D) are the vectors of these words in function space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-88", "text": "Based on the directional similarity model (Zhila et al., 2013) , we define the domain similarity of two pairs of words as follows:" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-89", "text": "The function similarity of A and C and the function similarity of B and D are defined respectively as follows:" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-90", "text": "We design the method Similarity ADD to combine the above similarities for relational similarity as follows, which satisfies that the combined similarity is high when the component similarities are high:" }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-91", "text": "We give an SAT question in Table 1 to demonstrate how our compositional similarity works." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-92", "text": "An SAT analogy question consists of a target pair of words and five option pairs of words." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-93", "text": "The task is to select the option pair that \"best expresses a relationship similar to that expressed in the original pair\", as stated in the test's directions." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-94", "text": "We use the 300-dimensional W2V and DEP vectors available for downloading 1 ." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-95", "text": "For the target pair bruise:skin in Table 1 , our method Similarity ADD (= 0.406) can recognize the correct answer stain:fabric although Similarity D (= 0.189), Similarity F 1 (= 0.539) and Similarity F 2 (= 0.491) for the correct answer are all ranked second among five options." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-98", "text": "In the following experiments, we evaluate our approaches to solving analogies by a set of 374 SAT analogy questions, which is the same set of questions as was used in Turney's Dual-Space mode (Turney (2012) )." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-99", "text": "Precision and Recall are two standard performance measurements used for evaluation." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-100", "text": "The definitions of precision and recall are specified by (Turney and Littman (2005) )." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-101", "text": "In all experiments, we use the 300-dimensional W2V and DEP vectors pretrained on a concatenation of three large, diverse English corpora, and those vectors are available for downloading 1 ." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-102", "text": "Table 2 shows the experimental results of four approaches presented in Section 4.3 on the set of 374 analogy questions." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-103", "text": "Two questions are skipped because the vector for the target pair is not available in the collection." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-104", "text": "Since there are five options for each target pair of an SAT analogy question, random guessing would yield a recall of 20%." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-105", "text": "Domain similarity Similarity D , function similarity Similarity F 1 and Similarity F 2 all perform much better than random guessing." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-106", "text": "Our compositional similarity model Similarity ADD in the dual space clearly outperforms Similarity D , Similarity F 1 and Similarity F 2 in single domain or function space." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-107", "text": "In this work, we explore domain space and function space simultaneously by considering two kinds of word embeddings and different metrics." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-108", "text": "Word embeddings can capture topical and functional information of a word by using different types of contexts, however they are unable to model the words with multiple meanings accurately because a word is represented as just a single vector which carries a weighted average of different meanings." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-109", "text": "Existing lexical and knowledge databases, such as WordNet, ConceptNet and Cyc, can be modeled as graphs in which words are represented as the nodes and the relations between words are signified by the edges." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-110", "text": "These databases have more accurate information although their coverage of words and types of relations is usually limited." }, { "sent_id": "76fd2709a325366be6154d2a84b29b-C001-111", "text": "We plan to look at ways to combine our dual-space model with existing databases to improve the performance of our current system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "76fd2709a325366be6154d2a84b29b-C001-11" ], [ "76fd2709a325366be6154d2a84b29b-C001-50" ] ], "cite_sentences": [ "76fd2709a325366be6154d2a84b29b-C001-11", "76fd2709a325366be6154d2a84b29b-C001-50" ] } } }, "ABC_5503b8571748ea900340aead22743b_40": { "x": [ { "sent_id": "5503b8571748ea900340aead22743b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-2", "text": "Previous work on dependency parsing used various kinds of combination models but a systematic analysis and comparison of these approaches is lacking." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-3", "text": "In this paper we implemented such a study for English dependency parsing and find several non-obvious facts: (a) the diversity of base parsers is more important than complex models for learning (e.g., stacking, supervised meta-classification), (b) approximate, linear-time re-parsing algorithms guarantee well-formed dependency trees without significant performance loss, and (c) the simplest scoring model for re-parsing (unweighted voting) performs essentially as well as other more complex models." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-4", "text": "This study proves that fast and accurate ensemble parsers can be built with minimal effort." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-5", "text": "----------------------------------" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-7", "text": "Several ensemble models have been proposed for the parsing of syntactic dependencies." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-8", "text": "These approaches can generally be classified in two categories: models that integrate base parsers at learning time, e.g., using stacking (Nivre and McDonald, 2008; Attardi and Dell'Orletta, 2009) , and approaches that combine independently-trained models only at parsing time (Sagae and Lavie, 2006; Hall et al., 2007; Attardi and Dell'Orletta, 2009 )." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-9", "text": "In the latter case, the correctness of the final dependency tree is ensured by: (a) selecting entire trees proposed by the base parsers (Henderson and Brill, 1999) ; or (b) re-parsing the pool of dependencies proposed by the base models (Sagae and Lavie, 2006) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-10", "text": "The latter approach was shown to perform better for constituent parsing (Henderson and Brill, 1999) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-11", "text": "While all these models achieved good performance, the previous work has left several questions Table 2 : Scores of unsupervised combination models using different voting strategies." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-12", "text": "The combined trees are assembled using a word-by-word voting scheme." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-13", "text": "parser variants are built by varying the parsing algorithm (we used three parsing models: Nivre's arceager (AE), Nivre's arc-standard (AS), and Covington's non-projective model (CN)), and the parsing direction (left to right (\u2192) or right to left (\u2190)), similar to (Hall et al., 2007) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-14", "text": "The parameters of the Malt models were set to the values reported in (Hall et al., 2007) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-15", "text": "The MST parser was used with the default configuration." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-16", "text": "Table 1 shows the performance of these models in the development and test partitions." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-17", "text": "----------------------------------" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-18", "text": "**EXPERIMENTS**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-19", "text": "----------------------------------" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-20", "text": "**ON SCORING MODELS FOR PARSER COMBINATION**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-21", "text": "The most common approach for combining independently-trained models at parsing time is to assign each candidate dependency a score based on the number of votes it received from the base parsers." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-22", "text": "Considering that parsers specialize in different phenomena, these votes can be weighted by different criteria." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-23", "text": "To understand the importance of such weighting strategies we compare several voting approaches in Table 2 : in the \"unweighted\" strategy all votes have the same weight; in all other strategies each vote is assigned a value equal to the accuracy of the given parser in the particular instance of the context considered, e.g., in the \"weighted by POS of modifier\" model we use the accuracies of the base models for each possible part-of-speech (POS) tag of a modifier token." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-24", "text": "In the table we show results as more base parsers are added to the ensemble (we add parsers in the order given by Table 1 )." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-25", "text": "The results in Table 2 hand, the number of base parsers in the ensemble pool is crucial: performance generally continues to improve as more base parsers are considered." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-26", "text": "The best ensemble uses 6 out of the 7 base parsers." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-27", "text": "3 It is often argued that the best way to re-score candidate dependencies is not through voting but rather through a meta-classifier that selects candidate dependencies based on their likelihood of belonging to the correct tree." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-28", "text": "Unlike voting, a metaclassifier can combine evidence from multiple contexts (such as the ones listed in Table 2 )." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-29", "text": "However, in our experiments such a meta-classifier 4 did not offer any gains over the much simpler unweighted voting strategy." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-30", "text": "We explain these results as follows: the meta-classifier can potentially help only when it proposes dependencies that disagree with the majority vote." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-31", "text": "We call such dependencies minority dependencies." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-32", "text": "5 For a given parser and context instance (e.g., a modifier POS), we define precision of minority dependencies as the ratio of minority dependencies in this group that are correct." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-33", "text": "Obviously, a group of minority dependencies provides beneficial signal only if its precision is larger than 50%." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-34", "text": "Table 3 lists the total number of minority dependencies in groups with precision larger than 50% for all our base parsers and the most representative features." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-35", "text": "The table shows that the number of minority dependencies with useful signal is extremely low." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-36", "text": "All in all, it accounts for less than 0.7% of all dependencies in the development corpus." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-37", "text": "----------------------------------" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-38", "text": "**ON RE-PARSING ALGORITHMS**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-39", "text": "To guarantee that the resulting dependency tree is well-formed, most previous work used the dynamic programming algorithm of Eisner (1996) for reparsing (Sagae and Lavie, 2006; Hall et al., 2007) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-40", "text": "6 However, it is not clear that this step is necessary." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-41", "text": "In other words, how many sentences are not wellformed if one uses a simple word-by-word voting scheme?" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-42", "text": "To answer this, we analyzed the output of our best word-by-word voting scheme (six base parsers weighted by the POS of the modifier)." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-43", "text": "The results for both in-domain and out-of-domain testing corpora are listed in Table 4 ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-44", "text": "The table shows that the percentage of badly-formed trees is relatively large: almost 10% out of domain." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-45", "text": "This indicates that the focus on algorithms that guarantee well-formed trees is justified." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-46", "text": "However, it is not clear how the Eisner algorithm, which has runtime complexity of O(n 3 ) (n -number of tokens per sentence), compares against approximate re-parsing algorithms that have lower runtime complexity." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-47", "text": "One such algorithm was proposed by Attardi and Dell'Orletta (2009) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-48", "text": "The algorithm, which has a runtime complexity of O(n), builds dependency trees using a greedy top-down strategy, i.e., it starts by selecting the highest-scoring root node, then the highest-scoring children, etc." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-49", "text": "We compare these algorithms against the word-by-word voting scheme in Table 5 ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-50", "text": "7 The results show that both algorithms pay a small penalty for guaranteeing well-formed trees." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-51", "text": "This performance drop is statistically significant out of domain." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-52", "text": "On the other hand, the difference between the Eisner and Attardi algorithms is not statistically significant out of domain." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-53", "text": "6 We focus on projective parsing algorithms because 99.6% of dependencies in our data are projective (Surdeanu et al., 2008) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-54", "text": "7 Statistical significance was performed using Dan Bikel randomized parsing evaluation comparator at 95% confidence." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-55", "text": "This experiment proves that approximate re-parsing algorithms are a better choice for practical purposes, i.e., ensemble parsing in domains different from the training material of the base models." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-56", "text": "----------------------------------" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-57", "text": "**ON PARSER INTEGRATION AT LEARNING TIME**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-58", "text": "Recent work has shown that the combination of base parsers at learning time, e.g., through stacking, yields considerable benefits (Nivre and McDonald, 2008; Attardi and Dell'Orletta, 2009 )." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-59", "text": "However, it is unclear how these approaches compare against the simpler ensemble models, which combine parsers only at runtime." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-60", "text": "To enable such a comparison, we reimplemented the best stacking model from (Nivre and McDonald, 2008 ) -MST M alt -which trains a variant of the MSTParser that uses additional features extracted from the output of a Malt parser." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-61", "text": "In Table 6 , we compare this stacking approach against four variants of our ensemble models." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-62", "text": "The superscript in the ensemble name indicates the runtime complexity of the model (O(n 3 ) or O(n))." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-63", "text": "The cubic-time models use all base parsers from Table 1 and the Eisner algorithm for re-parsing." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-64", "text": "The lineartime models use only Malt-based parsers and the Attardi algorithm for re-parsing." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-65", "text": "The subscript in the model names indicates the percentage of available base parsers used, e.g., ensemble 3 50% uses only the first three parsers from Table 1 ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-66", "text": "These results show that MST M alt is statistically equivalent to an ensemble that uses MST and two Malt variants, and both our ensemble 100% models are significantly better than MST M alt ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-67", "text": "While this comparison is somewhat unfair (MST M alt uses two base models, whereas our ensemble models use at least three) it does illustrate that the advantages gained from combining parsers at learning time can be easily surpassed by runtime combination models that have access to more base parsers." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-68", "text": "Considering that variants of shift-reduce parsers can be generated with minimal effort (e.g., by varying the parsing direction, learning algorithms, etc.) and combining models at runtime is simpler than combining them at learning time, we argue that runtime parser combination is a more attractive approach." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-69", "text": "----------------------------------" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-70", "text": "**COMPARISON WITH THE STATE OF THE ART**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-71", "text": "In Table 7 we compare our best ensemble models against the top two systems of the CoNLL-2008 shared task evaluation." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-72", "text": "The table indicates that our best ensemble model ranks second, outperforming significantly 19 other systems." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-73", "text": "The only model performing better than our ensemble is a parser that uses higher-order features and has a higher runtime complexity (O(n 4 )) (Johansson and Nugues, 2008) ." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-74", "text": "While this is certainly proof of the importance of higher-order features, it also highlights a pragmatic conclusion: in out-of-domain corpora, an ensemble of models that use only first-order features achieves performance that is within 1% LAS of much more complex models." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-75", "text": "----------------------------------" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-76", "text": "**CONCLUSIONS**" }, { "sent_id": "5503b8571748ea900340aead22743b-C001-77", "text": "This study unearthed several non-intuitive yet important observations about ensemble models for dependency parsing." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-78", "text": "First, we showed that the diversity of base parsers is more important than complex learning models for parser combination, i.e., (a) ensemble models that combine several base parsers at runtime performs significantly better than a state-ofthe-art model that combines two parsers at learning time, and (b) meta-classification does not outperform unsupervised voting schemes for the re-parsing of candidate dependencies when six base models are available." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-79", "text": "Second, we showed that well-formed dependency trees can be guaranteed without significant performance loss by linear-time approximate re-parsing algorithms. And lastly, our analysis indicates that unweighted voting performs as well as weighted voting for the re-parsing of candidate dependencies." }, { "sent_id": "5503b8571748ea900340aead22743b-C001-80", "text": "Considering that different base models are easy to generate, this work proves that ensemble parsers that are both accurate and fast can be rapidly developed with minimal effort." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5503b8571748ea900340aead22743b-C001-8" ], [ "5503b8571748ea900340aead22743b-C001-47" ], [ "5503b8571748ea900340aead22743b-C001-58" ] ], "cite_sentences": [ "5503b8571748ea900340aead22743b-C001-8", "5503b8571748ea900340aead22743b-C001-47", "5503b8571748ea900340aead22743b-C001-58" ] } } }, "ABC_5f5a59f8fbf999b9eecfe7c1897b2c_40": { "x": [ { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-2", "text": "Previous work on dependency parsing used various kinds of combination models but a systematic analysis and comparison of these approaches is lacking." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-3", "text": "In this paper we implemented such a study for English dependency parsing and find several non-obvious facts: (a) the diversity of base parsers is more important than complex models for learning (e.g., stacking, supervised meta-classification), (b) approximate, linear-time re-parsing algorithms guarantee well-formed dependency trees without significant performance loss, and (c) the simplest scoring model for re-parsing (unweighted voting) performs essentially as well as other more complex models." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-4", "text": "This study proves that fast and accurate ensemble parsers can be built with minimal effort." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-5", "text": "----------------------------------" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-7", "text": "Several ensemble models have been proposed for the parsing of syntactic dependencies." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-8", "text": "These approaches can generally be classified in two categories: models that integrate base parsers at learning time, e.g., using stacking (Nivre and McDonald, 2008; Attardi and Dell'Orletta, 2009) , and approaches that combine independently-trained models only at parsing time (Sagae and Lavie, 2006; Hall et al., 2007; Attardi and Dell'Orletta, 2009 )." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-9", "text": "In the latter case, the correctness of the final dependency tree is ensured by: (a) selecting entire trees proposed by the base parsers (Henderson and Brill, 1999) ; or (b) re-parsing the pool of dependencies proposed by the base models (Sagae and Lavie, 2006) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-10", "text": "The latter approach was shown to perform better for constituent parsing (Henderson and Brill, 1999) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-11", "text": "While all these models achieved good performance, the previous work has left several questions Table 2 : Scores of unsupervised combination models using different voting strategies." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-12", "text": "The combined trees are assembled using a word-by-word voting scheme." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-13", "text": "parser variants are built by varying the parsing algorithm (we used three parsing models: Nivre's arceager (AE), Nivre's arc-standard (AS), and Covington's non-projective model (CN)), and the parsing direction (left to right (\u2192) or right to left (\u2190)), similar to (Hall et al., 2007) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-14", "text": "The parameters of the Malt models were set to the values reported in (Hall et al., 2007) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-15", "text": "The MST parser was used with the default configuration." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-16", "text": "Table 1 shows the performance of these models in the development and test partitions." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-17", "text": "----------------------------------" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-18", "text": "**EXPERIMENTS**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-19", "text": "----------------------------------" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-20", "text": "**ON SCORING MODELS FOR PARSER COMBINATION**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-21", "text": "The most common approach for combining independently-trained models at parsing time is to assign each candidate dependency a score based on the number of votes it received from the base parsers." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-22", "text": "Considering that parsers specialize in different phenomena, these votes can be weighted by different criteria." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-23", "text": "To understand the importance of such weighting strategies we compare several voting approaches in Table 2 : in the \"unweighted\" strategy all votes have the same weight; in all other strategies each vote is assigned a value equal to the accuracy of the given parser in the particular instance of the context considered, e.g., in the \"weighted by POS of modifier\" model we use the accuracies of the base models for each possible part-of-speech (POS) tag of a modifier token." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-24", "text": "In the table we show results as more base parsers are added to the ensemble (we add parsers in the order given by Table 1 )." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-25", "text": "The results in Table 2 hand, the number of base parsers in the ensemble pool is crucial: performance generally continues to improve as more base parsers are considered." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-26", "text": "The best ensemble uses 6 out of the 7 base parsers." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-27", "text": "3 It is often argued that the best way to re-score candidate dependencies is not through voting but rather through a meta-classifier that selects candidate dependencies based on their likelihood of belonging to the correct tree." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-28", "text": "Unlike voting, a metaclassifier can combine evidence from multiple contexts (such as the ones listed in Table 2 )." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-29", "text": "However, in our experiments such a meta-classifier 4 did not offer any gains over the much simpler unweighted voting strategy." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-30", "text": "We explain these results as follows: the meta-classifier can potentially help only when it proposes dependencies that disagree with the majority vote." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-31", "text": "We call such dependencies minority dependencies." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-32", "text": "5 For a given parser and context instance (e.g., a modifier POS), we define precision of minority dependencies as the ratio of minority dependencies in this group that are correct." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-33", "text": "Obviously, a group of minority dependencies provides beneficial signal only if its precision is larger than 50%." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-34", "text": "Table 3 lists the total number of minority dependencies in groups with precision larger than 50% for all our base parsers and the most representative features." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-35", "text": "The table shows that the number of minority dependencies with useful signal is extremely low." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-36", "text": "All in all, it accounts for less than 0.7% of all dependencies in the development corpus." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-37", "text": "----------------------------------" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-38", "text": "**ON RE-PARSING ALGORITHMS**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-39", "text": "To guarantee that the resulting dependency tree is well-formed, most previous work used the dynamic programming algorithm of Eisner (1996) for reparsing (Sagae and Lavie, 2006; Hall et al., 2007) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-40", "text": "6 However, it is not clear that this step is necessary." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-41", "text": "In other words, how many sentences are not wellformed if one uses a simple word-by-word voting scheme?" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-42", "text": "To answer this, we analyzed the output of our best word-by-word voting scheme (six base parsers weighted by the POS of the modifier)." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-43", "text": "The results for both in-domain and out-of-domain testing corpora are listed in Table 4 ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-44", "text": "The table shows that the percentage of badly-formed trees is relatively large: almost 10% out of domain." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-45", "text": "This indicates that the focus on algorithms that guarantee well-formed trees is justified." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-46", "text": "However, it is not clear how the Eisner algorithm, which has runtime complexity of O(n 3 ) (n -number of tokens per sentence), compares against approximate re-parsing algorithms that have lower runtime complexity." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-47", "text": "One such algorithm was proposed by Attardi and Dell'Orletta (2009) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-48", "text": "The algorithm, which has a runtime complexity of O(n), builds dependency trees using a greedy top-down strategy, i.e., it starts by selecting the highest-scoring root node, then the highest-scoring children, etc." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-49", "text": "We compare these algorithms against the word-by-word voting scheme in Table 5 ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-50", "text": "7 The results show that both algorithms pay a small penalty for guaranteeing well-formed trees." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-51", "text": "This performance drop is statistically significant out of domain." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-52", "text": "On the other hand, the difference between the Eisner and Attardi algorithms is not statistically significant out of domain." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-53", "text": "6 We focus on projective parsing algorithms because 99.6% of dependencies in our data are projective (Surdeanu et al., 2008) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-54", "text": "7 Statistical significance was performed using Dan Bikel randomized parsing evaluation comparator at 95% confidence." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-55", "text": "This experiment proves that approximate re-parsing algorithms are a better choice for practical purposes, i.e., ensemble parsing in domains different from the training material of the base models." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-56", "text": "----------------------------------" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-57", "text": "**ON PARSER INTEGRATION AT LEARNING TIME**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-58", "text": "Recent work has shown that the combination of base parsers at learning time, e.g., through stacking, yields considerable benefits (Nivre and McDonald, 2008; Attardi and Dell'Orletta, 2009 )." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-59", "text": "However, it is unclear how these approaches compare against the simpler ensemble models, which combine parsers only at runtime." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-60", "text": "To enable such a comparison, we reimplemented the best stacking model from (Nivre and McDonald, 2008 ) -MST M alt -which trains a variant of the MSTParser that uses additional features extracted from the output of a Malt parser." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-61", "text": "In Table 6 , we compare this stacking approach against four variants of our ensemble models." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-62", "text": "The superscript in the ensemble name indicates the runtime complexity of the model (O(n 3 ) or O(n))." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-63", "text": "The cubic-time models use all base parsers from Table 1 and the Eisner algorithm for re-parsing." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-64", "text": "The lineartime models use only Malt-based parsers and the Attardi algorithm for re-parsing." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-65", "text": "The subscript in the model names indicates the percentage of available base parsers used, e.g., ensemble 3 50% uses only the first three parsers from Table 1 ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-66", "text": "These results show that MST M alt is statistically equivalent to an ensemble that uses MST and two Malt variants, and both our ensemble 100% models are significantly better than MST M alt ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-67", "text": "While this comparison is somewhat unfair (MST M alt uses two base models, whereas our ensemble models use at least three) it does illustrate that the advantages gained from combining parsers at learning time can be easily surpassed by runtime combination models that have access to more base parsers." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-68", "text": "Considering that variants of shift-reduce parsers can be generated with minimal effort (e.g., by varying the parsing direction, learning algorithms, etc.) and combining models at runtime is simpler than combining them at learning time, we argue that runtime parser combination is a more attractive approach." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-69", "text": "----------------------------------" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-70", "text": "**COMPARISON WITH THE STATE OF THE ART**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-71", "text": "In Table 7 we compare our best ensemble models against the top two systems of the CoNLL-2008 shared task evaluation." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-72", "text": "The table indicates that our best ensemble model ranks second, outperforming significantly 19 other systems." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-73", "text": "The only model performing better than our ensemble is a parser that uses higher-order features and has a higher runtime complexity (O(n 4 )) (Johansson and Nugues, 2008) ." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-74", "text": "While this is certainly proof of the importance of higher-order features, it also highlights a pragmatic conclusion: in out-of-domain corpora, an ensemble of models that use only first-order features achieves performance that is within 1% LAS of much more complex models." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-75", "text": "----------------------------------" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-76", "text": "**CONCLUSIONS**" }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-77", "text": "This study unearthed several non-intuitive yet important observations about ensemble models for dependency parsing." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-78", "text": "First, we showed that the diversity of base parsers is more important than complex learning models for parser combination, i.e., (a) ensemble models that combine several base parsers at runtime performs significantly better than a state-ofthe-art model that combines two parsers at learning time, and (b) meta-classification does not outperform unsupervised voting schemes for the re-parsing of candidate dependencies when six base models are available." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-79", "text": "Second, we showed that well-formed dependency trees can be guaranteed without significant performance loss by linear-time approximate re-parsing algorithms. And lastly, our analysis indicates that unweighted voting performs as well as weighted voting for the re-parsing of candidate dependencies." }, { "sent_id": "5f5a59f8fbf999b9eecfe7c1897b2c-C001-80", "text": "Considering that different base models are easy to generate, this work proves that ensemble parsers that are both accurate and fast can be rapidly developed with minimal effort." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5f5a59f8fbf999b9eecfe7c1897b2c-C001-8" ], [ "5f5a59f8fbf999b9eecfe7c1897b2c-C001-39" ] ], "cite_sentences": [ "5f5a59f8fbf999b9eecfe7c1897b2c-C001-8", "5f5a59f8fbf999b9eecfe7c1897b2c-C001-39" ] }, "@SIM@": { "gold_contexts": [ [ "5f5a59f8fbf999b9eecfe7c1897b2c-C001-13" ] ], "cite_sentences": [ "5f5a59f8fbf999b9eecfe7c1897b2c-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "5f5a59f8fbf999b9eecfe7c1897b2c-C001-14" ] ], "cite_sentences": [ "5f5a59f8fbf999b9eecfe7c1897b2c-C001-14" ] } } }, "ABC_5aeb64701a6b7d9878ea5e14a87b4e_40": { "x": [ { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-2", "text": "Lexical substitution is an annotation task in which annotators provide one-word paraphrases (lexical substitutes) for individual target words in a sentence context." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-3", "text": "Lexical substitution yields a fine-grained characterization of word meaning that can be done by non-expert annotators." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-4", "text": "We discuss results of a recent lexical substitution annotation effort, where we found strong contextual modulation effects: Many substitutes were not synonyms, hyponyms or hypernyms of the targets, but were highly specific to the situation at hand." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-5", "text": "This data provides some food for thought for framesemantic analysis." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-6", "text": "----------------------------------" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-7", "text": "****" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-8", "text": "1 Introduction Fillmore (1985) introduces the term \"semantics of understanding\", or U-semantics." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-9", "text": "In contrast to the semantics of truth (T-semantics), the goal of Usemantics is to \"uncover the nature of the relationship between linguistic texts and the interpreter's full understanding of the texts in their contexts\"." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-10", "text": "A central concept of the semantics of understanding is that of the interpretive frames that are necessary for understanding a sentence." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-11", "text": "Frames are the \"coherent schematizations of experience\" underlying the words in a given sentence." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-12", "text": "This idea of a semantics of understanding, or a frame semantics, has been made concrete in FrameNet (Fillmore et al., 2003) , a large lexical database that lists frames for English words and constructions." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-13", "text": "At this point, it comprises more than 1,100 frames covering more than 12,000 lexical units (LUs), which are pairs of a term and its frame." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-14", "text": "Researchers working on other languages have adopted the FrameNet idea." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-15", "text": "Among others, there are now FrameNet resources for Spanish (Subirats and Petruck, 2003) , Japanese (Ohara et al., 2004) , Italian (Tonelli and Pianta, 2008; Lenci et al., 2010) , as well as frame-semantic annotation for German (Erk et al., 2003) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-16", "text": "The definition of frames proceeds in a corpusbased fashion, driven by the data (Ellsworth et al., 2004) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-17", "text": "We stand in this tradition by reporting on a recent annotation effort (Kremer et al., 2014 ) that collected lexical substitutes for content words in part of the MASC corpus (Ide et al., 2008) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-40", "text": "The results are shown in Table 2 , for substitutes that are synonyms (syn), hypernyms (direct-hyper, trans-hyper) and hyponyms (direct-hypo, trans-hypo) of the target." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-41", "text": "The \"wn-other\" line shows the percentage for substitutes that are in WordNet but not a synonym, hypo-or hypernym of the target, and \"not-in-wn\" are substitutes not covered by WordNet." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-42", "text": "For substitutes that are synonyms, hypernyms, and hyponyms, we see percentages between 8% and 15% for both verbs and nouns." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-43", "text": "We also see that there are few substitutes that are not in WordNet, only 1-2%." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-44", "text": "Strikingly, 60-66% of all substitutes are in WordNet, but are \"wn-other\": neither synonyms nor (transitive) hyponyms or hypernyms of the target." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-45", "text": "Some of these items can be viewed as missing links in the taxonomy." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-46", "text": "For example, in the second sentence of Table 2 , two of the \"wn-other\" substitutes of keep are own and possess. But while own and possess are not linked to keep in WordNet, the FrameNet frame RETAINING, which has keep as a lexical unit, inherits from POSSESSION, which has both own and possess as lexical units. But this does not apply to all the \"wn-other\" substitutes." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-47", "text": "Some are better explained as effects of contextual modulation, fine-grained meaning distinctions that the sentence context brings about." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-48", "text": "In the first example in Table 1 , there is the possibility that the speaker could be laughing at the other person, and the shoulder-clapping clarifies that this possibility does not correspond to the facts." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-49", "text": "In the second example in the table, the words possess, enshrine and stage are more specific than the substitutes that are in WordNet, and maybe more appropriate too." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-50", "text": "In the third example, the word charge has the meaning of dependent, but the situation that the sentence describes suggests that the dependents in questions may be something like underlings or prisoners." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-51", "text": "When we look at this data from a framesemantic analysis point of view, the first question that arises is: How specific should the frames be that are listed in FrameNet?" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-52", "text": "For the second example, would we want a very specific \"person as precious jewel\" frame to be associated with the lexical unit \"keep\"? From a U-semantics point of view, one could argue that we would in fact want to have this frame, after all: It describes a recognizable abstract situation that is important for the understanding of this sentence. But it does not seem that all \"wn-other\" cases need to correspond to particular frames of the target word." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-53", "text": "For example, in the first sentence on Table 1 , it does not seem that clarify should be an actual frame involving the word show." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-54", "text": "From a computational linguistics point of view, a fine-grained analysis would be necessary in order to correctly predict lexical substitutes like sentence substitutes I clapped her shoulder to show I was not laughing at her." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-55", "text": "demonstrate, express, establish, indicate, prove, convey, imply, display, disclose, clarify My fear is that she would live, and I would learn that I had lost her long before Emil Malaquez translated her into a thing that can be kept, admired, and loved. preserve, retain, hold, fix, store, own, possess, enshrine, stage" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-56", "text": "The distinctive whuffle of pleasure rippled through the betas on the bridge, and Rakal let loose a small growl, as if to caution his charges against false hope. dependent, command, accusation, private, companion, follower, subordinate, prisoner, teammate, ward, junior, underling, enemy, group, crew, squad, troop, team, kid Table 1 : Example from lexical substitution data: Target words underlined, and WordNet-unrelated substitutes shown in italics." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-57", "text": "this -but on the other hand, experience with word sense disambiguation has shown that finegrained senses are hard to assign with good accuracy (Palmer et al., 2007) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-58", "text": "Another question that this data poses is: What are the items that evoke a frame? That is, what words or phrases in a sentence are responsible that a particular frame becomes important for understanding the sentence?" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-59", "text": "In FrameNet it is a single lemma, multi-word expression or construction that evokes a frame. But one way of looking at the contextual modulation effects in the lexical substitution data is to say that multiple terms or constructions in the context \"conspire\" to make a frame relevant." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-60", "text": "In the second sentence of Table 1, we can point to multiple factors that lead to substitutes like possess and enshrine." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-61", "text": "There is fact that the THEME of keep is thing, along with the fact that the same thing is being admired and loved, and maybe also the fact that some woman had been translated to said thing." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-62", "text": "This thought is reminiscent of McRae and colleagues, who study general event knowledge and argue that it is not just verbs that introduce the events, but also arguments (McRae et al., 2005) and combinations of verbs and their arguments (Bicknell et al., 2010) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-18", "text": "If we view substitute sets as indications of the relevant frame, then this data can give us interesting indicators on perceived frames in a naturally occurring text." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-19", "text": "----------------------------------" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-20", "text": "**LEXICAL SUBSTITUTION**" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-21", "text": "The Lexical Substitution task was first introduced in the context of SemEval 2007 (McCarthy and Navigli, 2009) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-22", "text": "For this dataset, annotators are asked to provide substitutes for a selected word (the target word) in its sentence context -at least one substitute, but possible more, and ideally a single word, though all the datasets contain some multi-word substitutes." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-23", "text": "Multiple annotators provide substitutes for each target word occurrence." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-24", "text": "Table 1 shows some examples." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-25", "text": "By now, several lexical substitution datasets exist." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-26", "text": "Some are \"lexical sample\" datasets, that is, only occurrences of some selected lemmas are annotated (McCarthy and Navigli, 2009; Biemann, 2013) , and some are \"all-words\", providing substitutes for all content words in the given sentences (Sinha and Mihalcea, 2014; Kremer et al., 2014) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-27", "text": "In addition, there is a cross-lingual lexical substitution dataset (McCarthy et al., 2013) , where annotators provided Spanish substitutes for English target words in English sentence context." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-28", "text": "Lexical substitution is a method for characterizing word meaning in context that has several attractive properties." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-29", "text": "Lexical substitution makes it possible to describe word meaning without having to rely on any particular dictionary." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-30", "text": "In addi- Table 2 : Analysis of lexical substitution data: Relation of the substitute to the target, in percentages by part of speech (from Kremer et al. (2014)) tion, providing substitutes is a task that seems to be well doable by untrained annotators: Both Biemann (2013) and our recent annotation (Kremer et al., 2014) used crowdsourcing to collect the substitutes." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-31", "text": "1" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-32", "text": "----------------------------------" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-33", "text": "**ANALYZING LEXICAL SUBSTITUTES**" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-34", "text": "In a recent lexical substitution annotation effort (Kremer et al., 2014) , we collected lexical substitution annotation for all nouns, verbs, and adjectives in a mixed news and fiction corpus, using untrained annotators via crowdsourcing." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-35", "text": "The data came from MASC, a freely available part of the American National Corpus that has already been annotated for a number of linguistic phenomena (Ide et al., 2008) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-36", "text": "All in all, more than 15,000 target tokens were annotated." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-37", "text": "After the annotation, we performed a number of analyses in order to better understand the nature of lexical substitutes, by linking substitutes to information on WordNet (Fellbaum, 1998) ." }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-38", "text": "Among other things, we analyzed the relation between targets and substitutes: Did substitutes tend to be synonyms, hypernyms, or hyponyms or the targets?" }, { "sent_id": "5aeb64701a6b7d9878ea5e14a87b4e-C001-39", "text": "To classify substitutes, the shortest route from any synset of the target to any synset of the substitute was used." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "5aeb64701a6b7d9878ea5e14a87b4e-C001-17" ] ], "cite_sentences": [ "5aeb64701a6b7d9878ea5e14a87b4e-C001-17" ] }, "@BACK@": { "gold_contexts": [ [ "5aeb64701a6b7d9878ea5e14a87b4e-C001-26" ], [ "5aeb64701a6b7d9878ea5e14a87b4e-C001-30" ], [ "5aeb64701a6b7d9878ea5e14a87b4e-C001-34" ] ], "cite_sentences": [ "5aeb64701a6b7d9878ea5e14a87b4e-C001-26", "5aeb64701a6b7d9878ea5e14a87b4e-C001-30", "5aeb64701a6b7d9878ea5e14a87b4e-C001-34" ] } } }, "ABC_3c74f66c209335ea33dda38c203199_40": { "x": [ { "sent_id": "3c74f66c209335ea33dda38c203199-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-2", "text": "Recent work has shown that neural models can be successfully trained on multiple languages simultaneously." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-3", "text": "We investigate whether such models learn to share and exploit common syntactic knowledge among the languages on which they are trained." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-4", "text": "This extended abstract presents our preliminary results." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-5", "text": "----------------------------------" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-7", "text": "Recent work has shown that state-of-the-art neural models of language and translation can be successfully trained on multiple languages simultaneously without changing the model architecture (\u00d6stling and Tiedemann, 2017; Johnson et al., 2017) ." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-8", "text": "In some cases this leads to improved performance compared to models only trained on a specific language, suggesting that multilingual models learn to share useful knowledge crosslingually through their learned representations." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-9", "text": "While a large body of research exists on the multilingual mind, the mechanisms explaining knowledge sharing in computational multilingual models remain largely unknown: What kind of knowledge is shared among languages?" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-10", "text": "Do multilingual models mostly benefit from a better modeling of lexical entries or do they also learn to share more abstract linguistic categories?" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-11", "text": "We focus on the case of language models (LM) trained on two languages, one of which (L1) is over-resourced with respect to the other (L2), and investigate whether the syntactic knowledge learned for L1 is transferred to L2." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-12", "text": "To this end we use the long-distance agreement benchmark recently introduced by Gulordava et al. (2018) ." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-13", "text": "----------------------------------" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-14", "text": "**BACKGROUND**" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-15", "text": "The recent advances in neural networks have opened the way to the design of architecturally simple multilingual models for various NLP tasks, such as language modeling or next word prediction (Tsvetkov et al., 2016; \u00d6stling and Tiedemann, 2017; Malaviya et al., 2017; Tiedemann, 2018) , translation (Dong et al., 2015; Zoph et al., 2016; Firat et al., 2016; Johnson et al., 2017) , morphological reinflection (Kann et al., 2017) and more (Bjerva, 2017) ." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-16", "text": "A practical benefit of training models multilingually is to transfer knowledge from high-resource languages to lowresource ones and improve task performance in the latter." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-17", "text": "Here we aim at understanding how linguistic knowledge is transferred among languages, specifically at the syntactic level, which to our knowledge has not been studied so far." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-18", "text": "Assessing the syntactic abilities of monolingual neural LMs trained without explicit supervision has been the focus of several recent studies: Linzen et al. (2016) analyzed the performance of LSTM LMs at an English subject-verb agreement task, while Gulordava et al. (2018) extended the analysis to various long-range agreement patterns in different languages." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-19", "text": "The latter study found that state-of-the-art LMs trained on a standard loglikelihood objective capture non-trivial patterns of syntactic agreement and can approach the performance levels of humans, even when tested on syntactically well-formed but meaningless (nonce) sentences." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-20", "text": "Cross-language interaction during language production and comprehension by human subjects has been widely studied in the fields of bilingualism and second language acquisition (Kellerman and Sharwood Smith; Odlin, 1989; Jarvis and Pavlenko, 2008) under the terms of language transfer or cross-linguistic influence." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-21", "text": "Numerous studies have shown that both the lexicons and the grammars of different languages are not stored independently but together in the mind of bilinguals and second-language learners, leading to observ-able lexical and syntactic transfer effects (Kootstra et al., 2012) ." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-22", "text": "For instance, through a crosslingual syntactic priming experiment, Hartsuiker et al. (2004) showed that bilinguals recently exposed to a given syntactic construction (passive voice) in their L1 tend to reuse the same construction in their L2." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-23", "text": "While the neural networks in this study are not designed to be plausible models of the human mind learning and processing multiple languages, we believe there is interesting potential at the intersection of these research fields." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-24", "text": "----------------------------------" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-25", "text": "**EXPERIMENT**" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-26", "text": "We consider the scenario where L1 is overresourced compared to L2 and train our bilingual models by joint training on a mixed L1/L2 corpus so that supervision is provided simultaneously in the two languages (\u00d6stling and Tiedemann, 2017; Johnson et al., 2017) ." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-27", "text": "We leave the evaluation of pre-training (or transfer learning) methods (Zoph et al., 2016; Nguyen and Chiang, 2017) to future work." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-28", "text": "The monolingual LM is trained on a small L2 corpus (LM L2 )." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-29", "text": "The bilingual LM is trained on a shuffled mix of the same small L2 corpus and a large L1 corpus, where L2 is oversampled to approximately match the amount of L1 sentences (LM L1+L2 )." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-30", "text": "See Table 1 for the actual training sizes." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-31", "text": "For our preliminary experiments we have chosen French as the helper language (L1) and Italian as the target language (L2)." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-32", "text": "Since French and Italian share many morphosyntactic patterns, accuracy on the Italian agreement tasks is expected to benefit from adding French sentences to the training data if syntactic transfer occurs." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-33", "text": "Data and training details: We train our LMs on French and Italian Wikipedia articles extracted using the WikiExtractor tool." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-34", "text": "1 For each language, we maintain a vocabulary of the 50k most frequent tokens, and replace the remaining tokens by . For the bilingual LM, all words are prepended with a language tag so that vocabularies are completely disjoint." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-35", "text": "Their union (100K types) is used to train the model." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-36", "text": "This is the least optimistic scenario for linguistic transfer but also the most controlled one." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-37", "text": "In future experiments we plan to study how transfer is affected by varying degrees of vocabulary overlap." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-38", "text": "Following the setup of Gulordava et al. (2018) , we train 2-layer LSTM models with embedding and hidden layers of 650 dimensions for 40 epochs." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-39", "text": "The trained models are evaluated on the Italian section of the syntactic benchmark provided by Gulordava et al. (2018) , which includes various non-trivial number agreement constructions." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-40", "text": "2 Note that all models are trained on a regular corpus likelihood objective and do not receive any specific supervision for the syntactic tasks." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-41", "text": "Table 1 shows the results of our preliminary experiments." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-42", "text": "The unigram baseline simply picks, for each sentence, the most frequent word form between singular or plural." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-43", "text": "As an upper-bound we report the agreement accuracy obtained by a monolingual model trained on a large L2 corpus." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-44", "text": "The effect of mixing the small Italian corpus with the large French one does not appear to be major." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-45", "text": "Agreement accuracy increases slightly in the original sentences, where the model is free to rely on collocational cues, but decreases slightly in the nonce sentences, where the model must rely on pure grammatical knowledge." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-46", "text": "Thus there is currently no evidence that syntactic transfer occurs in our setup." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-47", "text": "A possible explanation is that the bilingual model has to fit the knowledge from two language systems into the same number of hidden layer parameters and this may cancel out the benefits of being exposed to a more diverse set of sentences." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-48", "text": "In fact, the bilingual model achieves a considerably worse perplexity than the monolingual one (69.9 vs 55.62) on an Italian-only held-out set." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-49", "text": "For comparison,\u00d6stling and Tiedemann (2017) observed slightly better perplexities when mixing a small number of related languages, however their setup was considerably different (characterlevel LSTM with highly overlapping vocabulary)." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-50", "text": "This is work in progress." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-51", "text": "We are currently looking for a bilingual LM configuration that will result in better target language perplexity and, possibly, better agreement accuracy." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-52", "text": "We also plan to extend the evaluation to other, less related, language pairs and different multilingual training techniques." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-53", "text": "Finally, we plan to examine whether lexical syntactic categories (POS) are represented in a shared space among the two languages." }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-54", "text": "----------------------------------" }, { "sent_id": "3c74f66c209335ea33dda38c203199-C001-55", "text": "**RESULTS AND CONCLUSIONS**" } ], "y": { "@USE@": { "gold_contexts": [ [ "3c74f66c209335ea33dda38c203199-C001-12" ], [ "3c74f66c209335ea33dda38c203199-C001-38" ], [ "3c74f66c209335ea33dda38c203199-C001-39" ] ], "cite_sentences": [ "3c74f66c209335ea33dda38c203199-C001-12", "3c74f66c209335ea33dda38c203199-C001-38", "3c74f66c209335ea33dda38c203199-C001-39" ] }, "@BACK@": { "gold_contexts": [ [ "3c74f66c209335ea33dda38c203199-C001-18" ] ], "cite_sentences": [ "3c74f66c209335ea33dda38c203199-C001-18" ] } } }, "ABC_5f97682d8a1b78f08fd3623ee81703_40": { "x": [ { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-2", "text": "Extracting open relational tuples that are mediated by nouns (instead of verbs) is important since titles and entity attributes are often expressed nominally." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-3", "text": "While appositives and possessives are easy to handle, a difficult and important class of nominal extractions requires interpreting compound noun phrases (e.g., \"Google CEO Larry Page\")." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-4", "text": "We substantially improve the quality of Open IE from compound noun phrases by focusing on phenomena like demonyms and compound relational nouns." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-5", "text": "We release RELNOUN 2.2, which obtains 3.5 times yield with over 15 point improvement in precision compared to RELNOUN 1.1, a publicly available nominal Open IE system." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-6", "text": "----------------------------------" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-8", "text": "Open Information Extraction (Etzioni et al., 2008) systems output relational tuples from text without a pre-specified relational vocabulary by identifying relation phrases present in text." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-9", "text": "Early work on Open IE focused on verb-mediated relations that could be expressed using a handful of patterns and still covered substantial information in text." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-10", "text": "Subsequent research has focused on increasing recall -a noteworthy approach (OLLIE) uses bootstrapping for learning general language patterns (Mausam et al., 2012) ." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-11", "text": "Various extensions improve on the amount of linguistic knowledge in the systems -EXEMPLAR (de S\u00e1 Mesquita et al., 2013) improves the set of rules on top of dependency parses; Open IE 4.0 1 uses carefully designed rules over 1 https://github.com/knowitall/openie semantic role labeling systems ; several works attempt clause identification or sentence restructuring, thus identifying sentence components and applying extraction rules on top of these components (Schmidek and Barbosa, 2014; Corro and Gemulla, 2013; Bast and Haussmann, 2013) ." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-57", "text": "Demonym replacements are also needed for compound NPs without capitalization, for example, \"German chancellor Angela Merkel\"." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-12", "text": "Other approaches include use of lexicosyntactic qualia-based patterns (Xavier et al., 2015) , simple sentence-specific inference (Bast and Haussmann, 2014) , and a supervised approach using tree kernels (Xu et al., 2013) ." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-13", "text": "While the focus on verbs continues to be common in these Open IE systems, some works have directed attention on noun-mediated relations such as OL-LIE (Mausam et al., 2012) , RENOUN (Yahya et al., 2014) , and RELNOUN." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-14", "text": "2 A common observation is that many relations (e.g, capital of, economist at) are more frequently expressed using nouns, instead of verbs." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-15", "text": "Common noun-mediated patterns include appositive constructions, possessive constructions, and compound noun phrases (see Table 1 for examples)." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-16", "text": "While most patterns give some syntactic cues for the existence of a relation (such as a comma or a possessive 's), interpreting and extracting tuples from compound NPs is specifically challenging, since they are just a continuous sequence of nouns and adjectives (e.g., \"Google CEO Larry Page\")." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-17", "text": "This paper substantially improves the quality of extraction from compound noun phrases." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-18", "text": "Our work builds on the publicly available RELNOUN system (ver 1.1) and extends it to RELNOUN 2.2, which incorporates three additional sources of recall from compound noun phrases: (1) capitalized relational nouns, (2) demonyms, the adjectives used to identify residents of a location (e.g., 'Japanese' for 'Japan'), and (3) compound relational nouns (see Table 2 for examples)." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-19", "text": "Compared to its predecessor, RELNOUN 2.2 triples the yield with over 15 point improvement in precision." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-20", "text": "Our code is freely downloadable." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-21", "text": "2" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-22", "text": "----------------------------------" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-23", "text": "**BACKGROUND ON NOMINAL OPEN IE**" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-24", "text": "Probably the earliest work on Nominal Open IE is OLLIE, which is a pattern learning approach based on a bootstrapped training data using high precision verb-based extractions (Mausam et al., 2012) ." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-25", "text": "It identified that nominal IE can't be completely syntactic, and, at the least, a list of relational nouns (e.g, mother, director, CEO, capital) is needed for high precision extraction." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-26", "text": "OLLIE is superseded by REL-NOUN, which is a rule-based extractor incorporating most of the high precision learnings of OLLIE." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-27", "text": "A third work on nominal Open IE is RENOUN (Yahya et al., 2014) ." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-28", "text": "RENOUN builds a comprehensive list of relational nouns using bootstrapping over query logs and text." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-29", "text": "It then uses seed patterns to extract data and then uses these as source of distant supervision for additional pattern learning." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-30", "text": "Unfortunately, neither their list of relational nouns, nor their final extractor are available." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-31", "text": "Moreover, it is hard to reproduce their list of nouns since most researchers don't have access to query logs." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-32", "text": "Hence, we build upon the publicly available RELNOUN system." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-33", "text": "RELNOUN 1.x series is a set of POS and NPChunk patterns defined to extract a high precision subset of noun-mediated extractions (see Table 1 )." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-58", "text": "This requires another small extension to RELNOUN -allowing arg2 to be a JJ when it is in the demonym list (typically arg2s can only be an NNP)." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-34", "text": "The input to RELNOUN is a set of relational nouns, which are extracted using bootstrapping -these include words which are common headnouns for X in \"is a X of\" patterns, as well as words which are within the 'Person' subclass in WordNet hierarchy (Miller, 1995) ." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-35", "text": "3 We added a couple of missing patterns and made small modifications to the previous RELNOUN release (version 1.0.9) to increase its coverage and precision." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-36", "text": "The resulting system REL-NOUN 1.1, acts as the baseline for our work." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-37", "text": "Our analysis of RELNOUN 1.1 revealed significant missed recall when extracting from compound noun phrases such as \"Mechelen Mayor Bart Somers\", \"Chinese president Hu Jintao\", or \"United States health minister Levitt\"." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-38", "text": "We attribute this to three important missing phenomena in RELNOUN 1.1 when extracting from compound noun phrases -capitalized relational nouns ('Mayor'), demonyms ('Chinese'), and compound relational nouns ('health minister')." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-39", "text": "Note that here a compound relational noun occurs within a larger compound noun phrase." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-40", "text": "RELNOUN 2.2 improves the analysis for all these three categories." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-41", "text": "3 RELNOUN 2.2 RELNOUN 1.1 does not extract from phrases containing capitalized (NNP) relational nouns (e.g., \"Mechelen Mayor Bart Somers\") even though, at times, that is grammatically correct whereas uncapitalized nouns are not." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-42", "text": "4 The main reason for this The first set of errors is easy to fix." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-43", "text": "We create a list of 160 organization words and filter out any extractions where arg1 has an organization word." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-44", "text": "The list is created by extracting the most frequent last words from the list of organizations on Wikipedia." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-45", "text": "They include words like 'Committee', 'Limited', 'Group', and 'Association'." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-46", "text": "This ORG filtering improves the precision slightly, with almost no impact to recall." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-47", "text": "The next two subsections detail our approaches for incorporating knowledge about demonyms and compound relational nouns in RELNOUN." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-48", "text": "----------------------------------" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-49", "text": "**DEMONYMS**" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-50", "text": "Demonyms are words derived from the name of a location and are used to identify residents or natives of that location." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-51", "text": "Typical examples include 'Israeli', 'Japanese', and 'South African' for residents from Israel, Japan, and South Africa, respectively." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-52", "text": "We first parse a list of demonyms from Wikipedia to populate a table of (Location, Demonym) pairs." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-53", "text": "5 We expand this table with information from additional capitalize-president/ says \"Use a capital when the title directly precedes the name\" 5 https://en.wikipedia.org/wiki/Demonym geographical websites 6 leading to a total of 2,143 base entries." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-54", "text": "The demonyms (and locations) can frequently take region-related pre-modifiers." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-55", "text": "When checking for demonyms, we allow them to be preceded by 'North', 'South', 'East', 'West', 'Northern', 'Southern', 'Eastern', 'Western', and 'Central'." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-56", "text": "To extract relationships expressed via demonyms appropriately (e.g. \"German Chancellor Angela Merkel\"), we simply check whether arg2's headnoun is in our demonym table, and if it is then we replace it with its corresponding location from the table." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-59", "text": "In addition to compound NPs, the demonym replacement can be useful for other patterns from Table 1 also." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-60", "text": "For example, AppositivePossessive and PossessiveAppositive both benefit from this (for example, \"Angela Merkel, German Chancellor\" and \"German Chancellor, Angela Merkel)." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-61", "text": "Domicile vs. Title Classification: Demonyms are used to denote two common relationships." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-62", "text": "First, arg1 may be directly related to the location or the government of the location through the relational noun." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-63", "text": "For example, \"United States President Obama\" -Obama is the President of the country of (or govt." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-64", "text": "of) United States." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-65", "text": "A second usage simply suggests that the location is arg1's domicile, i.e., arg1 is a native of, lives in, or has another substantial connection with the location." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-66", "text": "For example, \"Canadian pitcher Andrew Albers\" only denotes a domicile -Albers is not a player of Canada! Ideally, we would like to extract (Andrew Albers, [is] player [from], Canada), instead of [of] ." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-67", "text": "We manually create a small list of ten relational nouns, which represent heads and other high-posts" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-68", "text": "----------------------------------" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-69", "text": "**COMPOUND RELATIONAL NOUNS**" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-70", "text": "We now extend RELNOUN to handle compound relational nouns such as 'health minister', 'foreign secretary', and 'vice president'." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-71", "text": "We first observe that for all extractors in Table 1 , except CompoundNoun there are lexical indicators ('of', ',', or possessive marker) to segment a relational noun." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-72", "text": "However, because CompoundNoun pattern is simply a sequence of nouns, segmenting relational nouns is harder and it is a common source of errors." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-73", "text": "Segmenting compound relational nouns is relatively easy if it is followed by a demonym, since the demonym can help segment the left boundary (e.g., \"Indian Prime Minister Modi\")." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-74", "text": "However, the real challenge is in non-demonym cases -disambiguating whether \"GM Vice Chairman Bob Lutz\" should result in (Bob Lutz, [is]" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-75", "text": "We bootstrap a list of common relational noun prefixes over a large text corpus." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-76", "text": "We collect all words (from 1.1 million sentences of ClueWeb12 corpus) that precede relational nouns and are in the same NP chunk, are of appropriate POS (JJ, NN, NNP, etc), are not a known demonym and don't end in a possessive." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-77", "text": "We keep all such prefixes of frequency greater than 20, resulting in a list of 5,606 prefixes." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-78", "text": "We use these prefixes to segment the final relational nouns in compound NPs." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-79", "text": "Even though noisy, this list serves our purpose since common prefixes are already present, and it is used only for nondemonym CompoundNoun extractor." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-80", "text": "----------------------------------" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-81", "text": "**EXPERIMENTS**" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-82", "text": "Our goal is to compare RELNOUN 1.1 with REL-NOUN 2.2." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-83", "text": "We randomly sample 2,000 sentences from Newswire and run both (and other intermediate systems) on them." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-84", "text": "We ask two annotators (students) to tag if the sentence asserted or implied an extraction." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-85", "text": "Our inter-annotator agreement is 0.97 and we retain the subset of extractions on which the annotators agree for further analysis." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-86", "text": "Note that while precision and yield (number of correct extractions) can be naturally computed by tagging extractions, estimating recall is challenging, as it requires annotators to tag all possible extractions from these sentences." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-87", "text": "Following previous work (Mausam et al., 2012) , we report yield, since recall is proportional to yield and suffices for system comparisons." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-88", "text": "Table 3 reports the results." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-89", "text": "We find that OLLIE's noun patterns have a good yield but poor precision, whereas RELNOUN 1.1 has a decent precision of 0.53, but the yield is much lower." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-90", "text": "Allowing NNP relational nouns in RELNOUN 1.1 has a 66% increase in yield, but precision takes a severe hit." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-91", "text": "ORG filtering only helps a little bit in improving precision." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-92", "text": "Handling of demonyms has a huge impact since not only does yield increase by another 58%, precision goes up 13 points too." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-93", "text": "Finally, incorporating compound relational nouns adds another 51 extractions with a significant (17 point) improvement in precision." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-94", "text": "Overall, RELNOUN 2.2 has a 3.5x increase in yield with a 16 point increase in precision, making it a substantial improvement on the existing REL-NOUN 1.1 system." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-95", "text": "----------------------------------" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-96", "text": "**CONCLUSIONS**" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-97", "text": "An important subtask of Open IE is Nominal Open IE, within which dealing with compound NPs is particularly challenging owing to complications arising due to organization names, demonyms and compound relational nouns." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-98", "text": "We release RELNOUN 2.2, a significant improvement over the publicly available RELNOUN 1.1 system, which deals with each of these challenges." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-99", "text": "The main approach for improvement uses a combination of rule-based systems and semantic lists, bootstrapped automatically from text corpus and also compiled manually over the Web." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-100", "text": "RELNOUN 2.2 has 3.5x more yield with a substantial increase of 16 precision points." }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-101", "text": "----------------------------------" }, { "sent_id": "5f97682d8a1b78f08fd3623ee81703-C001-102", "text": "**38**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "5f97682d8a1b78f08fd3623ee81703-C001-10" ], [ "5f97682d8a1b78f08fd3623ee81703-C001-13" ], [ "5f97682d8a1b78f08fd3623ee81703-C001-24" ] ], "cite_sentences": [ "5f97682d8a1b78f08fd3623ee81703-C001-10", "5f97682d8a1b78f08fd3623ee81703-C001-13", "5f97682d8a1b78f08fd3623ee81703-C001-24" ] }, "@USE@": { "gold_contexts": [ [ "5f97682d8a1b78f08fd3623ee81703-C001-87" ] ], "cite_sentences": [ "5f97682d8a1b78f08fd3623ee81703-C001-87" ] } } }, "ABC_d6c8b712c8fe3dd87d23886d575098_40": { "x": [ { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-2", "text": "This paper describes the DFKI-NMT submission to the WMT19 News translation task." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-3", "text": "We participated in both English-to-German and German-to-English directions." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-4", "text": "We trained standard Transformer models and adopted various techniques for effectively training our models, including data selection, backtranslation, in-domain fine-tuning and model ensemble." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-5", "text": "We show that these training techniques improved the performance of our Transformer models up to 5 BLEU points." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-6", "text": "We give a detailed analysis of the performance of our system." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-7", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-9", "text": "This paper describes the DFKI-NMT submission to the WMT19 News translation task." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-10", "text": "We participated in both English-to-German and German-toEnglish directions." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-11", "text": "We trained Transformer models (Vaswani et al., 2017) using Sockeye 1 (Hieber et al., 2017) ." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-12", "text": "Compared to RNN-based translation models (Bahdanau et al., 2014) , Transformer models can be trained very fast due to parallelizable self-attention networks." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-13", "text": "We applied several very useful techniques for effectively training our models." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-14", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-15", "text": "**DATA SELECTION**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-16", "text": "The parallel training data provided for German-English is quite large (38M sentence pairs)." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-17", "text": "Most of the parallel data is crawled from the Internet and is not in News domain." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-18", "text": "Out-ofdomain training data can hurt the translation performance on News test sets (Wang et al., 2017) and also significantly increase training time." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-19", "text": "Therefore, we trained neural language models on a large monolingual News corpus to perform data selection (Schamper et al., 2018) ." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-20", "text": "Back-translation Large monolingual data in the News domain is provided for both German and 1 https://github.com/awslabs/sockeye English, which can be back-translated as additional parallel training data for our system (Sennrich et al., 2016a; Fadaee and Monz, 2018) ." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-21", "text": "The back-translated parallel data is in the News domain, which is a big advantage compared to outof-domain parallel training data provided for the News task." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-22", "text": "In-domain Fine-tuning The Transformer models were finally fine-tuned using the small in-domain parallel data provided for the News task (Luong and Manning, 2015; Schamper et al., 2018) ." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-23", "text": "Note that the large back-translated parallel data is also in-domain, but it has relatively low quality due to translation errors." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-24", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-25", "text": "**MODEL ENSEMBLE**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-26", "text": "We trained two Transformer models with different sizes, Transformer-base and Transformer-big." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-27", "text": "Our final submission is an ensemble of both models (Schamper et al., 2018) ." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-28", "text": "The ensemble of both models outperformed a single base or big model most likely because the two models can capture somewhat different features for the translation task." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-29", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-30", "text": "**SYSTEM DETAILS**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-31", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-32", "text": "**DATA SELECTION**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-33", "text": "The parallel data provided for the German-toEnglish and English-to-German tasks includes Europarl v9, ParaCrawl v3, Common Crawl corpus, News Commentary v14, Wiki Titles v1 and Document-split Rapid corpus." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-34", "text": "We also used old test sets (newstest2008 to newstest2017) for training our systems." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-35", "text": "We consider News Commentary v14 and old test sets as in-domain data and the rest as out-of-domain data." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-36", "text": "Compared to the in-domain data (356k sentence pairs), the size of the out-of-domain data (38M sentence pairs) is quite large, which makes the training process relatively slow and may also hurt the translation per- formance due to domain dismatch." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-37", "text": "Therefore, we performed data selection on out-of-domain data." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-38", "text": "Inspired by Schamper et al. (2018) 's work which used KenLM (Heafield, 2011) for data selection, we trained two neural language models based on self-attention networks using the 2018 part of the large monolingual News crawl corpus for English and German, respectively." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-39", "text": "Because these neural language models are trained on the News domain, we can use them to score out-ofdomain data." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-40", "text": "Sentences with higher probabilities are more likely to be in News domain." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-41", "text": "Equation 1 is used to score each sentence pair in the out-ofdomain corpus." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-42", "text": "In Equation 1, P s is the language model probability of the source sentence; N s is the length of the source sentence; P t is the language model probability of the target sentence; N t is the length of the target sentence." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-43", "text": "We selected the top 15M scored sentence pairs from out-ofdomain data for training our systems." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-44", "text": "The neural language models trained for data selection in our experiments are based on selfattention networks which can be trained very fast." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-45", "text": "Figure 1 (a) shows the structure of the standard Transformer translation model (Vaswani et al., 2017) and we removed the encoder and the attention layer in the decoder from the Transformer translation model to create our Transformer language model as shown in Figure 1 (b) ." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-46", "text": "For training efficiency, we used byte pair encoding (Sennrich et al., 2016b) to learn a vocabulary of 50k for English and German respectively." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-47", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-48", "text": "**BACK-TRANSLATION**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-49", "text": "We back-translated the 2018 part of the large monolingual in-domain News crawl data as additional training data for our translation systems." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-50", "text": "Fadaee and Monz (2018) showed that it is more beneficial to back-translate sentences that contain difficult words." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-51", "text": "In our experiments, we consider words which occur less than 1000 times in the bilingual training data as difficult words." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-52", "text": "Then we randomly selected 10M sentences which contain difficult words for back-translation." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-53", "text": "The mod- els used for back-translating monolingual data are baseline Transformers (Vaswani et al., 2017) trained on the bilingual data after data selection as described before." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-54", "text": "During back-translation, we used greedy search instead of beam search for efficiency." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-55", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-56", "text": "**MODEL AND TRAINING**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-57", "text": "We trained two Transformer models for each translation task as Transformer-base and Transformer-big." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-58", "text": "The settings of Transformerbase is the same as the baseline Transformer in Vaswani et al. (2017) 's work." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-59", "text": "For Transformerbig, we changed word embedding size into 1024 and kept other parameters unchanged." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-60", "text": "A joint vocabulary of 50k for German and English is learned by byte pair encoding (BPE) (Sennrich et al., 2016b) ." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-61", "text": "2 We set dropout to 0.1 for both Transformer-base and Transformer-big." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-62", "text": "We used adam (Kingma and Ba, 2014) for optimization." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-63", "text": "We used newstest2018 as the validation set for model training." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-64", "text": "The training processes for both Transformer-base and Transformer-big consist of three stages." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-65", "text": "Stage 1 We first trained the Transformers using bilingual training data, including all in-domain data and selected out-of-domain data as described in section 2.1." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-66", "text": "Note that the back-translated data was not used in this stage." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-67", "text": "Each training batch contains 8192 words and the validation frequency is 2000 batches." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-68", "text": "We set the initial learning rate to be 2.00e-04." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-69", "text": "We reduced the learning rate by a factor of 0.70 whenever the validation score does not improve 8 times." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-70", "text": "We stopped the training process after 5 times of learning rate reduction." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-71", "text": "Stage 2 We used all bilingual training data used in the first training stage together with the backtranslated monolingual data to continue training the models which had converged in the first training stage." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-72", "text": "We kept the batch size to be 8192 words and changed the validation frequency to 1000 batches." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-73", "text": "We set the initial learning rate to be 1.00e-05 and stopped the training process when the validation score does not improve 8 times." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-74", "text": "Stage 3 For fine-tuning, we used the small parallel in-domain data as described in section 2.1 to continue training the models which had converged in the second training stage." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-75", "text": "We changed batch size to be 1024 words and validation frequency to be 100 batches." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-76", "text": "We set the initial learning rate to be 1.00e-06 and stopped the training process when the validation score does not improve 8 times." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-77", "text": "Table 1 shows training data used in different training stages." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-78", "text": "The models trained in the first training stage were used to back-translate monolingual data as described in section 2.2." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-79", "text": "In Stage 2, we continued training the models which had converged in Stage 1 instead of training models with random initialization in order to reduce the training time of Stage 2." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-80", "text": "Table 2 shows the numbers of training epochs for different training stages and Table 3 shows the performance of our systems after different training stages." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-81", "text": "As we can see, back-translation (Stage 2) and in-domain fine-tuning (Stage 3) both improved the translation quality on a significant level." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-82", "text": "An ensemble of Stage 3 Transformer-base and Transformer-big achieved further improvements." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-83", "text": "We also tried to ensemble different checkpoints of Transformer-big, but achieved little improvement, likely because different checkpoints of Table 4 to analyze when and why our translation system makes mistakes." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-84", "text": "The translations in Table 4 are produced by our best system, i.e., ensemble of Transformer-base and Transformer-big after training stage 3." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-85", "text": "In Example 1, \"wei@@ dez@@ aun@@ projekt\" (pasture fence project) is wrongly translated into \"electric sound project\", likely because \"weidezaunprojekt\" is a unknown word and does not occur in the training data." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-86", "text": "Although BPE can help to relieve data sparsity by using smaller and more frequent sub-word units, the automatic BPE segmentation \"wei@@ dez@@ aun@@ projekt\" is a bad segmentation with linguistically meaningless sub-word pieces." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-87", "text": "A better segmentation \"weide@@(pasture) zaun@@(fence) projekt\" may help to reduce data sparsity and get better translation." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-88", "text": "Example 2 does not contain rare words, but \"nimmt vor\" is still wrongly translated into \"takes\"." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-89", "text": "This is likely because \"nimmt vor\" has different translations in the training data and the correct translation here \"targeting\" is relatively uncommon." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-90", "text": "We find many translation mistakes of our system are caused by rare words or uncommon usages of words as shown in Table 4 , which we will work on in the future." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-91", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-92", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-93", "text": "----------------------------------" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-94", "text": "**CONCLUSION**" }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-95", "text": "This paper describes the DFKI-NMT submission to the WMT19 English-to-German and Germanto-English News translation tasks." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-96", "text": "We trained standard Transformer models and adopted various techniques for effectively training our models, including data selection, back-translation, indomain fine-tuning and model ensemble." }, { "sent_id": "d6c8b712c8fe3dd87d23886d575098-C001-97", "text": "We show that effective training techniques can improve the performance of standard Transformer models up to 5 BLEU points." } ], "y": { "@MOT@": { "gold_contexts": [ [ "d6c8b712c8fe3dd87d23886d575098-C001-18", "d6c8b712c8fe3dd87d23886d575098-C001-19" ] ], "cite_sentences": [ "d6c8b712c8fe3dd87d23886d575098-C001-19" ] }, "@USE@": { "gold_contexts": [ [ "d6c8b712c8fe3dd87d23886d575098-C001-22" ] ], "cite_sentences": [ "d6c8b712c8fe3dd87d23886d575098-C001-22" ] }, "@EXT@": { "gold_contexts": [ [ "d6c8b712c8fe3dd87d23886d575098-C001-27" ] ], "cite_sentences": [ "d6c8b712c8fe3dd87d23886d575098-C001-27" ] }, "@UNSURE@": { "gold_contexts": [ [ "d6c8b712c8fe3dd87d23886d575098-C001-38" ] ], "cite_sentences": [ "d6c8b712c8fe3dd87d23886d575098-C001-38" ] } } }, "ABC_9e8c386b7ea3b5e4e2843fb8382fd8_41": { "x": [ { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-2", "text": "This study demonstrates a weakness in how n-gram and PCFG surprisal are used to predict reading times in eye-tracking data." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-3", "text": "In particular, the information conveyed by words skipped during saccades is not usually included in the surprisal measures." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-4", "text": "This study shows that correcting the surprisal calculation improves n-gram surprisal and that upcoming n-grams affect reading times, replicating previous findings of how lexical frequencies affect reading times." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-5", "text": "In contrast, the predictivity of PCFG surprisal does not benefit from the surprisal correction despite the fact that lexical sequences skipped by saccades are processed by readers, as demonstrated by the corrected n-gram measure." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-6", "text": "These results raise questions about the formulation of information-theoretic measures of syntactic processing such as PCFG surprisal and entropy reduction when applied to reading times." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-7", "text": "----------------------------------" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-9", "text": "Rare words and constructions produce longer reading times than their more frequent counterparts." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-10", "text": "Such effects can be captured by n-grams and by probabilistic context-free grammar (PCFG) surprisal." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-11", "text": "Surprisal theory predicts reading times will be directly proportional to the amount of information which must be processed, as calculated by a generative model, but the surprisal measures commonly used in eye-tracking studies omit probability estimates for words skipped in saccades." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-12", "text": "Therefore, the generative model assumed by those studies does not account for the information contributed by the skipped words even though those words must be processed by readers." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-13", "text": "1 This deficiency can be addressed by summing surprisal measures over the saccade region (see Figure 1) , and the resulting cumulative n-grams have been shown to be more predictive of reading times than the usual non-cumulative n-grams (van Schijndel and Schuler, 2015) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-14", "text": "However PCFG surprisal, which has a similar deficiency when non-cumulatively modeling reading times, has not previously been found to be predictive when accumulated over saccade regions." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-15", "text": "This paper uses a reading time corpus to investigate two accumulation techniques (pre-and post-saccade) and finds that both forms of accumulation improve the fit of n-gram surprisal to reading times." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-16", "text": "However, even though accumulated n-grams demonstrate that the lexical sequence of the saccade region is processed, PCFG surprisal does not seem to be improved by either accumulation technique." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-17", "text": "The results of this work call into question the usual formulation of PCFG surprisal as a reading time predictor and suggest future directions for investigation of the influence of upcoming material on reading times." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-18", "text": "Figure 1 : Eye movements jump between non-adjacent fixation regions (1, 2), while traditional n-gram measures are conditioned on the preceding adjacent context, which is never generated by the typical surprisal models used in eye-tracking studies." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-19", "text": "Cumulative n-grams sum the n-gram measures over the entire skipped region in order to better capture the information that readers need to process." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-20", "text": "----------------------------------" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-21", "text": "**DATA**" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-22", "text": "This work makes use of the University College London (UCL) eye-tracking corpus (Frank et al., 2013) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-23", "text": "Previous reading time studies have often used the Dundee corpus (Kennedy et al., 2003) , which only has data from 10 subjects." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-24", "text": "In contrast, the UCL corpus has reading time data from 43 subjects who read sentences drawn from a series of self-published online novels." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-25", "text": "The sentences in the corpus were presented as isolated sentences and in a random order." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-26", "text": "The present work uses half of the corpus (every other sentence) for exploratory analyses, while the rest of the corpus is set aside for significance testing." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-27", "text": "The corpus was parsed using the van Schijndel et al. (2013) left-corner parser, which outputs a wide variety of incremental complexity metrics computed during parsing (such as PCFG surprisal)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-28", "text": "5-gram back-off ngram probabilities were computed for each word using the KenLM toolkit (Heafield et al., 2013) trained on Gigaword 4.0 (Graff and Cieri, 2003) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-29", "text": "Models were fit to Box-Cox transformed firstpass reading times for all experiments in this paper (\u03bb \u2248 0.02; Box and Cox, 1964) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-30", "text": "2 Fixation data was excluded from analysis if the fixation occurred on the first or last word of a sentence or line or if it followed an unusually long saccade, defined here and in previous work (Demberg and Keller, 2008) as a saccade over more than 4 words (2.5% of the UCL corpus)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-31", "text": "----------------------------------" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-32", "text": "**EXPERIMENTS**" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-33", "text": "3.1 Cumulative n-gram surprisal N -gram surprisal is conditioned on the preceding context (see Equation 1)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-34", "text": "As stated in the introduction, however, direct use of this factor in a reading time model ignores the fact that some or all of the preceding context may not be generated if the associated lexical targets were not previously fixated by readers (see Figure 1) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-35", "text": "The lack of a generated condition results in a probability model that does not reflect the influence of words skipped during saccades." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-36", "text": "This deficiency can be corrected by accumulating n-gram surprisal over the entire saccade region (see" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-37", "text": "where w is a vector of input tokens, f t\u22121 is the index of the previous fixation, f t is the index of the current fixation." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-38", "text": "The linear mixed model 3 that was used in this experiment included item, subject, and sentence ID-crossed-with-subject random intercepts 4 as well as by-subject random slopes and fixed effects for the following predictors: sentence position (sentpos), word length (wlen), region length (rlen), 5 whether the previous word was fixated (prevfix), 5-grams and cumulative 5-grams." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-39", "text": "Likelihood ratio tests were used to compare the mixed model with and without fixed effects for the 5-gram measures (see Table 1 )." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-40", "text": "In line with previous findings on the Dundee corpus (van Schijndel and Schuler, 2015) , cumulative 5-grams provide a significant improvement over basic n-grams (p < 0.001), but unlike previous work, basic n-grams do not improve over cumulative n-grams on this corpus (p > 0.05)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-41", "text": "The benefit of cumulative n-grams suggests that the lexical processing of words skipped during a saccade has a time cost similar to directly fixated words." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-42", "text": "----------------------------------" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-43", "text": "**CUMULATIVE PCFG SURPRISAL**" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-44", "text": "Probabilistic context-free grammar (PCFG) surprisal is similar to n-gram surprisal in that it is also conditioned on preceding context, but PCFG surprisal is conditioned on hierarchic structure rather than on linear lexical sequences (see Equation 3 )." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-45", "text": "PCFG surprisal, therefore, suffers from the same deficiency as non-cumulative n-gram surprisal when modeling reading times: the condition context is never generated by the model." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-46", "text": "where w is a vector of input tokens, f t\u22121 is the index of the previous fixation, f t is the index of the current fixation, T is a random variable over syntactic trees and T i is a terminal symbol in a tree." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-47", "text": "This experiment tested both PCFG surprisal predictors as fixed effects over the baseline from the previous section (now including cumulative n-gram surprisal as a fixed and by-subject random effect)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-48", "text": "Accumulated PCFG surprisal (see Equation 4) did not improve reading time fit (p > 0.05), unlike n-gram surprisal, which replicates a previous result using the Dundee corpus (van Schijndel and Schuler, 2015) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-49", "text": "In fact, not even basic PCFG surprisal was predictive (p > 0.05) over this baseline model in the UCL corpus, whereas it was predictive over this baseline in the Dundee corpus." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-50", "text": "Posthoc testing on the exploratory data partition revealed that PCFG surprisal becomes predictive on the UCL corpus when the n-gram predictors are removed from the baseline (p < 0.001), which could indicate that PCFG surprisal may simply help predict reading times when the n-gram model is too weak." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-51", "text": "Alternatively, since UCL sentences were chosen for their brevity during corpus construction, there just may not be enough syntactic complexity in the corpus to provide an advantage to PCFG surprisal over the n-gram measures, which would explain why PCFG surprisal is still predictive for Dundee reading times where there is greater syntactic complexity." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-52", "text": "However, since cumulative n-gram surprisal is a better predictor of reading times than basic n-gram surprisal, it is conceivable that some other cumulative PCFG surprisal feature could still show up as predictive of UCL reading times even when basic PCFG surprisal fails to be predictive on this corpus." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-53", "text": "The next experiment formulates a new calculation of cumulative surprisal to explore this possibility." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-54", "text": "----------------------------------" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-55", "text": "**CUMULATIVE SUCCESSOR SURPRISAL**" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-56", "text": "In addition to past context, reading times can be influenced by the upcoming words that follow a fixation." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-57", "text": "Such effects have been observed for orthographic and lexical influences and are called successor effects (Kliegl et al., 2006 Table 2 : Goodness of fit of future n-grams and future surprisal to reading times." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-58", "text": "Significance testing was performed between each model and the models in the section above it." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-59", "text": "Significance for the Base+Both model applies to improvement over the Base+Future-PCFG model." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-60", "text": "* p < 0.001 will generalize to something as latent as the syntactic structure underlying upcoming lexical material." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-61", "text": "That is, instead of accumulating the surprisal condition over the region prior to and including each fixated target, this section attempts to accumulate upcoming syntactic structure over the region following each fixated target." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-62", "text": "Using the example in Figure 1 , part of the time spent at fixation 1 might be caused by the complexity of the upcoming material: 'apple', 'that', etc." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-63", "text": "Therefore, this work compares the predictivity of future cumulative n-gram surprisal (see Equation 5 ) and future cumulative PCFG surprisal (see Equation 6) over the n-gram baseline from Section 3.2 on the UCL corpus (see Table 2 )." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-64", "text": "future-PCFG(w," }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-65", "text": "where again w is a vector of input tokens, f t is the index of the current fixation, f t+1 is the index of the next fixation, T is a random variable over syntactic trees and T i is a terminal symbol in a tree." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-66", "text": "Future cumulative PCFG surprisal ceases to be predictive when future-n-grams are in the model, though future-n-grams are predictive over future PCFG surprisal (p < 0.001)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-67", "text": "Therefore, while future PCFG surprisal appears to be a significant predictor of reading times on its own (p < 0.001), it seems largely eclipsed by the upcoming lexical information." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-68", "text": "Further, the present study replicated this result on the Dundee corpus (Kennedy et al., 2003) where, although noncumulative PCFG surprisal is predictive over the n-gram baseline on that corpus, future-PCFG surprisal is still not predictive (p > 0.05)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-69", "text": "Together, these findings suggest that PCFG surprisal does not accumulate, despite evidence that skipped lexical items are processed with some time cost." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-70", "text": "3.4 Limitations of successor n-grams Angele et al. (2015) demonstrated that the predictivity of successor effects cannot be exclusively driven by parafoveal preview; instead, the influence of successor effects may arise from sequence prediction, which could happen, for example, if the parser operates over super-lexical chunks (Hale, 2014) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-71", "text": "This section investigates the extent of n-gram successor predictivity on the UCL corpus." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-72", "text": "On the exploration partition, four cumulative 5-gram successor predictors are tested which utilize look-ahead for 1-word, 2-words, 3-words, or 4-words." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-73", "text": "6 Each future n-gram variant is evaluated based on how it improves over the baseline in Section 3.2." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-74", "text": "Although there are 3-and 4-word saccades in the data, 2-word future n-grams provide the best fit to the data even on the held-out data partition (p < 0.001)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-75", "text": "In contrast, Angele et al. (2015) previously found that successor effects were mainly driven by the word following the target fixation, which suggests that the successor effect observed by Angele et al. may only account for a subset of the successor influences on reading times." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-76", "text": "It's possible that parafoveal preview, which was not possible in the masked condition of the Angele et al. (2015) study, accounts for the additional look-ahead observed in this work (e.g., parafoveal look-ahead could help with the word following the target, and the predictive effect observed by Angele et al. could help with the next word), but additional investigation of this hypothesis is left for future work." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-77", "text": "----------------------------------" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-78", "text": "**DISCUSSION**" }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-79", "text": "This work has confirmed previous findings that cumulative n-grams provide a better model of reading times than the typical non-cumulative reading times (van Schijndel and Schuler, 2015) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-80", "text": "In addition, this work has confirmed previous findings that upcoming lexical items can affect reading times in an n-gram successor effect (Kliegl et al., 2007; Angele et al., 2015) , presumably ruling out incompatible expectations before directly fixating on that material or so that such material can be skipped via saccade." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-81", "text": "The fact that cumulative n-gram models strongly predict reading times suggests PCFG surprisal should be similarly affected, but this work has failed to find such an effect either before or at each given target word." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-82", "text": "The improved reading time fit for accumulated n-gram surprisal suggests that the material skipped during a saccade is processed with a reading time cost." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-83", "text": "Therefore, although PCFG surprisal has previously been found to predict reading times over an n-gram baseline (Boston et al., 2008; Demberg and Keller, 2008) , the lack of accumulation raises questions about PCFG surprisal as a predictor of the reading time influence of syntactic processing." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-84", "text": "Finally, the existence of n-gram successor effects raises questions about other informationtheoretic measures such as entropy reduction (Hale, 2006) ." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-85", "text": "Entropy reduction measures the change in uncertainty at each new word." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-86", "text": "In practice, the entropy of an observation is often approximated by estimating uncertainty about the next word in a sequence given the preceding observations, but this measurement does not make much sense if the following two words are already being integrated along with the target observation (i.e. there is very little to no uncertainty about the next word in the sequence)." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-87", "text": "Thus, the frontier of processing must be determined for a well-motivated measure of entropy reduction." }, { "sent_id": "9e8c386b7ea3b5e4e2843fb8382fd8-C001-88", "text": "In conclusion, the results of this study provide greater insight into how lexical sequence information is processed during reading, providing stronger baseline measures against which to test higher level theories of sentence processing in the future." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-13" ] ], "cite_sentences": [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-13" ] }, "@SIM@": { "gold_contexts": [ [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-40" ], [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-48" ], [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-79" ] ], "cite_sentences": [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-40", "9e8c386b7ea3b5e4e2843fb8382fd8-C001-48", "9e8c386b7ea3b5e4e2843fb8382fd8-C001-79" ] }, "@DIF@": { "gold_contexts": [ [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-40" ] ], "cite_sentences": [ "9e8c386b7ea3b5e4e2843fb8382fd8-C001-40" ] } } }, "ABC_f59a8c650583343fe372db42fc109a_41": { "x": [ { "sent_id": "f59a8c650583343fe372db42fc109a-C001-48", "text": "**BACK-TRANSLATION**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-2", "text": "This paper describes the DFKI-NMT submission to the WMT19 News translation task." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-3", "text": "We participated in both English-to-German and German-to-English directions." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-4", "text": "We trained standard Transformer models and adopted various techniques for effectively training our models, including data selection, backtranslation, in-domain fine-tuning and model ensemble." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-5", "text": "We show that these training techniques improved the performance of our Transformer models up to 5 BLEU points." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-6", "text": "We give a detailed analysis of the performance of our system." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-7", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-9", "text": "This paper describes the DFKI-NMT submission to the WMT19 News translation task." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-10", "text": "We participated in both English-to-German and German-toEnglish directions." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-11", "text": "We trained Transformer models (Vaswani et al., 2017) using Sockeye 1 (Hieber et al., 2017) ." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-12", "text": "Compared to RNN-based translation models (Bahdanau et al., 2014) , Transformer models can be trained very fast due to parallelizable self-attention networks." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-13", "text": "We applied several very useful techniques for effectively training our models." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-14", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-15", "text": "**DATA SELECTION**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-16", "text": "The parallel training data provided for German-English is quite large (38M sentence pairs)." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-17", "text": "Most of the parallel data is crawled from the Internet and is not in News domain." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-18", "text": "Out-ofdomain training data can hurt the translation performance on News test sets (Wang et al., 2017) and also significantly increase training time." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-19", "text": "Therefore, we trained neural language models on a large monolingual News corpus to perform data selection (Schamper et al., 2018) ." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-20", "text": "Back-translation Large monolingual data in the News domain is provided for both German and 1 https://github.com/awslabs/sockeye English, which can be back-translated as additional parallel training data for our system (Sennrich et al., 2016a; Fadaee and Monz, 2018) ." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-21", "text": "The back-translated parallel data is in the News domain, which is a big advantage compared to outof-domain parallel training data provided for the News task." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-22", "text": "In-domain Fine-tuning The Transformer models were finally fine-tuned using the small in-domain parallel data provided for the News task (Luong and Manning, 2015; Schamper et al., 2018) ." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-23", "text": "Note that the large back-translated parallel data is also in-domain, but it has relatively low quality due to translation errors." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-24", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-25", "text": "**MODEL ENSEMBLE**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-26", "text": "We trained two Transformer models with different sizes, Transformer-base and Transformer-big." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-27", "text": "Our final submission is an ensemble of both models (Schamper et al., 2018) ." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-28", "text": "The ensemble of both models outperformed a single base or big model most likely because the two models can capture somewhat different features for the translation task." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-29", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-30", "text": "**SYSTEM DETAILS**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-31", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-32", "text": "**DATA SELECTION**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-33", "text": "The parallel data provided for the German-toEnglish and English-to-German tasks includes Europarl v9, ParaCrawl v3, Common Crawl corpus, News Commentary v14, Wiki Titles v1 and Document-split Rapid corpus." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-34", "text": "We also used old test sets (newstest2008 to newstest2017) for training our systems." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-35", "text": "We consider News Commentary v14 and old test sets as in-domain data and the rest as out-of-domain data." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-36", "text": "Compared to the in-domain data (356k sentence pairs), the size of the out-of-domain data (38M sentence pairs) is quite large, which makes the training process relatively slow and may also hurt the translation per- formance due to domain dismatch." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-37", "text": "Therefore, we performed data selection on out-of-domain data." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-38", "text": "Inspired by Schamper et al. (2018) 's work which used KenLM (Heafield, 2011) for data selection, we trained two neural language models based on self-attention networks using the 2018 part of the large monolingual News crawl corpus for English and German, respectively." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-39", "text": "Because these neural language models are trained on the News domain, we can use them to score out-ofdomain data." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-40", "text": "Sentences with higher probabilities are more likely to be in News domain." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-41", "text": "Equation 1 is used to score each sentence pair in the out-ofdomain corpus." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-42", "text": "In Equation 1, P s is the language model probability of the source sentence; N s is the length of the source sentence; P t is the language model probability of the target sentence; N t is the length of the target sentence." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-43", "text": "We selected the top 15M scored sentence pairs from out-ofdomain data for training our systems." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-44", "text": "The neural language models trained for data selection in our experiments are based on selfattention networks which can be trained very fast." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-45", "text": "Figure 1 (a) shows the structure of the standard Transformer translation model (Vaswani et al., 2017) and we removed the encoder and the attention layer in the decoder from the Transformer translation model to create our Transformer language model as shown in Figure 1 (b) ." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-46", "text": "For training efficiency, we used byte pair encoding (Sennrich et al., 2016b) to learn a vocabulary of 50k for English and German respectively." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-47", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-49", "text": "We back-translated the 2018 part of the large monolingual in-domain News crawl data as additional training data for our translation systems." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-50", "text": "Fadaee and Monz (2018) showed that it is more beneficial to back-translate sentences that contain difficult words." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-51", "text": "In our experiments, we consider words which occur less than 1000 times in the bilingual training data as difficult words." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-52", "text": "Then we randomly selected 10M sentences which contain difficult words for back-translation." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-53", "text": "The mod- els used for back-translating monolingual data are baseline Transformers (Vaswani et al., 2017) trained on the bilingual data after data selection as described before." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-54", "text": "During back-translation, we used greedy search instead of beam search for efficiency." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-55", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-56", "text": "**MODEL AND TRAINING**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-57", "text": "We trained two Transformer models for each translation task as Transformer-base and Transformer-big." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-58", "text": "The settings of Transformerbase is the same as the baseline Transformer in Vaswani et al. (2017) 's work." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-59", "text": "For Transformerbig, we changed word embedding size into 1024 and kept other parameters unchanged." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-60", "text": "A joint vocabulary of 50k for German and English is learned by byte pair encoding (BPE) (Sennrich et al., 2016b) ." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-61", "text": "2 We set dropout to 0.1 for both Transformer-base and Transformer-big." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-62", "text": "We used adam (Kingma and Ba, 2014) for optimization." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-63", "text": "We used newstest2018 as the validation set for model training." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-64", "text": "The training processes for both Transformer-base and Transformer-big consist of three stages." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-65", "text": "Stage 1 We first trained the Transformers using bilingual training data, including all in-domain data and selected out-of-domain data as described in section 2.1." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-66", "text": "Note that the back-translated data was not used in this stage." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-67", "text": "Each training batch contains 8192 words and the validation frequency is 2000 batches." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-68", "text": "We set the initial learning rate to be 2.00e-04." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-69", "text": "We reduced the learning rate by a factor of 0.70 whenever the validation score does not improve 8 times." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-70", "text": "We stopped the training process after 5 times of learning rate reduction." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-71", "text": "Stage 2 We used all bilingual training data used in the first training stage together with the backtranslated monolingual data to continue training the models which had converged in the first training stage." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-72", "text": "We kept the batch size to be 8192 words and changed the validation frequency to 1000 batches." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-73", "text": "We set the initial learning rate to be 1.00e-05 and stopped the training process when the validation score does not improve 8 times." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-74", "text": "Stage 3 For fine-tuning, we used the small parallel in-domain data as described in section 2.1 to continue training the models which had converged in the second training stage." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-75", "text": "We changed batch size to be 1024 words and validation frequency to be 100 batches." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-76", "text": "We set the initial learning rate to be 1.00e-06 and stopped the training process when the validation score does not improve 8 times." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-77", "text": "Table 1 shows training data used in different training stages." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-78", "text": "The models trained in the first training stage were used to back-translate monolingual data as described in section 2.2." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-79", "text": "In Stage 2, we continued training the models which had converged in Stage 1 instead of training models with random initialization in order to reduce the training time of Stage 2." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-80", "text": "Table 2 shows the numbers of training epochs for different training stages and Table 3 shows the performance of our systems after different training stages." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-81", "text": "As we can see, back-translation (Stage 2) and in-domain fine-tuning (Stage 3) both improved the translation quality on a significant level." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-82", "text": "An ensemble of Stage 3 Transformer-base and Transformer-big achieved further improvements." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-83", "text": "We also tried to ensemble different checkpoints of Transformer-big, but achieved little improvement, likely because different checkpoints of Table 4 to analyze when and why our translation system makes mistakes." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-84", "text": "The translations in Table 4 are produced by our best system, i.e., ensemble of Transformer-base and Transformer-big after training stage 3." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-85", "text": "In Example 1, \"wei@@ dez@@ aun@@ projekt\" (pasture fence project) is wrongly translated into \"electric sound project\", likely because \"weidezaunprojekt\" is a unknown word and does not occur in the training data." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-86", "text": "Although BPE can help to relieve data sparsity by using smaller and more frequent sub-word units, the automatic BPE segmentation \"wei@@ dez@@ aun@@ projekt\" is a bad segmentation with linguistically meaningless sub-word pieces." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-87", "text": "A better segmentation \"weide@@(pasture) zaun@@(fence) projekt\" may help to reduce data sparsity and get better translation." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-88", "text": "Example 2 does not contain rare words, but \"nimmt vor\" is still wrongly translated into \"takes\"." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-89", "text": "This is likely because \"nimmt vor\" has different translations in the training data and the correct translation here \"targeting\" is relatively uncommon." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-90", "text": "We find many translation mistakes of our system are caused by rare words or uncommon usages of words as shown in Table 4 , which we will work on in the future." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-91", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-92", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-93", "text": "----------------------------------" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-94", "text": "**CONCLUSION**" }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-95", "text": "This paper describes the DFKI-NMT submission to the WMT19 English-to-German and Germanto-English News translation tasks." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-96", "text": "We trained standard Transformer models and adopted various techniques for effectively training our models, including data selection, back-translation, indomain fine-tuning and model ensemble." }, { "sent_id": "f59a8c650583343fe372db42fc109a-C001-97", "text": "We show that effective training techniques can improve the performance of standard Transformer models up to 5 BLEU points." } ], "y": { "@USE@": { "gold_contexts": [ [ "f59a8c650583343fe372db42fc109a-C001-11" ], [ "f59a8c650583343fe372db42fc109a-C001-52", "f59a8c650583343fe372db42fc109a-C001-53" ], [ "f59a8c650583343fe372db42fc109a-C001-58" ] ], "cite_sentences": [ "f59a8c650583343fe372db42fc109a-C001-11", "f59a8c650583343fe372db42fc109a-C001-53", "f59a8c650583343fe372db42fc109a-C001-58" ] }, "@EXT@": { "gold_contexts": [ [ "f59a8c650583343fe372db42fc109a-C001-45" ] ], "cite_sentences": [ "f59a8c650583343fe372db42fc109a-C001-45" ] }, "@SIM@": { "gold_contexts": [ [ "f59a8c650583343fe372db42fc109a-C001-58" ] ], "cite_sentences": [ "f59a8c650583343fe372db42fc109a-C001-58" ] } } }, "ABC_9e491cad55265802275e9aaea9faae_41": { "x": [ { "sent_id": "9e491cad55265802275e9aaea9faae-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-2", "text": "We show that the task of question answering (QA) can significantly benefit from the transfer learning of models trained on a different large, fine-grained QA dataset." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-3", "text": "We achieve the state of the art in two well-studied QA datasets, WikiQA and SemEval-2016 (Task 3A), through a basic transfer learning technique from SQuAD." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-4", "text": "For WikiQA, our model outperforms the previous best model by more than 8%." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-5", "text": "We demonstrate that finer supervision provides better guidance for learning lexical and syntactic information than coarser supervision, through quantitative results and visual analysis." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-6", "text": "We also show that a similar transfer learning procedure achieves the state of the art on an entailment task." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-7", "text": "----------------------------------" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-9", "text": "Question answering (QA) is a long-standing challenge in NLP, and the community has introduced several paradigms and datasets for the task over the past few years." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-10", "text": "These paradigms differ from each other in the type of questions and answers and the size of the training data, from a few hundreds to millions of examples." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-11", "text": "We are particularly interested in the contextaware QA paradigm, where the answer to each question can be obtained by referring to its accompanying context (paragraph or a list of sentences)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-12", "text": "Under this setting, the two most notable types of supervisions are coarse sentence-level and finegrained span-level." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-13", "text": "In sentence-level QA, the task is to pick sentences that are most relevant to the question among a list of candidates (Yang et al., 2015) ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-14", "text": "In span-level QA, the task is to locate the smallest span in the given paragraph that answers the question (Rajpurkar et al., 2016) ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-15", "text": "In this paper, we address coarser, sentencelevel QA through a standard transfer learning 1 technique of a model trained on a large, spansupervised QA dataset." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-16", "text": "We demonstrate that the target task not only benefits from the scale of the source dataset but also the capability of the finegrained span supervision to better learn syntactic and lexical information." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-17", "text": "For the source dataset, we pretrain on SQuAD (Rajpurkar et al., 2016) , a recentlyreleased, span-supervised QA dataset." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-18", "text": "For the source and target models, we adopt BiDAF (Seo et al., 2016) , one of the top-performing models in the dataset's leaderboard." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-19", "text": "For the target datasets, we evaluate on two recent QA datasets, WikiQA (Yang et al., 2015) and Se-mEval 2016 (Task 3A) , which possess sufficiently different characteristics from that of SQuAD." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-20", "text": "Our results show 8% improvement in WikiQA and 1% improevement in Se-mEval." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-21", "text": "In addition, we report state-of-the-art results on recognizing textual entailment (RTE) in SICK (Marelli et al., 2014) with a similar transfer learning procedure." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-22", "text": "----------------------------------" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-23", "text": "**BACKGROUND AND DATA**" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-24", "text": "Modern machine learning models, especially deep neural networks, often significantly benefit from transfer learning." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-25", "text": "In computer vision, deep convolutional neural networks trained on a large image classification dataset such as ImageNet (Deng et al., 2009 ) have proved to be useful for initializing models on other vision tasks, such as object detection (Zeiler and Fergus, 2014) ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-26", "text": "In natural language processing, domain adaptation has traditionally been an important topic for syntactic parsing (McClosky et al., 2010) and named entity recognition (Chiticariu et al., 2010) , among others." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-27", "text": "With the popularity of distributed representation, pre-trained word embedding models such as word2vec (Mikolov et al., 2013) are also widely used for natural language tasks (Karpathy and Fei-Fei, 2015; Kumar et al., 2016) ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-28", "text": "Instead of these, we initialize our models from a QA dataset and show how standard transfer learning can achieve state-of-the-art in target QA datasets." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-29", "text": "There have been several QA paradigms in NLP, which can be categorized by the context and supervision used to answer questions." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-30", "text": "This context can range from structured and confined knowledge bases (Berant et al., 2013) to unstructured and unbounded natural language form (e.g., documents on the web (Voorhees and Tice, 2000)) and unstructured, but restricted in size (e.g., a paragraph or multiple sentences (Hermann et al., 2015) )." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-31", "text": "The recent advances in neural question answering lead to numerous datasets and successful models in these paradigms (Rajpurkar et al., 2016; Yang et al., 2015; Nguyen et al., 2016; Trischler et al., 2016 )." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-32", "text": "The answer types in these datasets are largely divided into three categories: sentencelevel, in-context span, and generation." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-33", "text": "In this paper, we specifically focus on the former two and show that span-supervised models can better learn syntactic and lexical features." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-34", "text": "Among these datasets, we briefly describe three QA datasets to be used for the experiments in this paper." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-35", "text": "We also give the description of an RTE dataset for an example of a non-QA task." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-36", "text": "Refer to Table 1 to see the examples of the datasets." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-37", "text": "SQuAD (Rajpurkar et al., 2016 ) is a recent spanbased QA dataset, containing 100k/10k train/dev examples." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-38", "text": "Each example is a pair of context paragraph from Wikipedia and a question created by a human, and the answer is a span in the context." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-39", "text": "SQUAD-T is our modification of SQuAD dataset to allow for sentence selection QA." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-40", "text": "('T' for senTence)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-41", "text": "We split the context paragraph into sentences and formulate the task as classifying whether each sentence contains the answer." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-42", "text": "This enables us to make a fair comparison between pretraining with span-supervised and sentencesupervised QA datasets." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-43", "text": "WikiQA (Yang et al., 2015) is a sentence-level QA dataset, containing 1.9k/0.3k train/dev answerable examples." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-44", "text": "Each example consists of a real user's Bing query and a snippet of a Wikipedia article retrieved by Bing, containing 18.6 sentences on average." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-45", "text": "The task is to classify whether each sentence provides the answer to the query." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-46", "text": "SemEval 2016 (Task 3A) is a sentence-level QA dataset, containing 1.8k/0.2k/0.3k train/dev/test examples." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-47", "text": "Each example consists of a community question by a user and 10 comments." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-48", "text": "The task is to classify whether each comment is relevant to the question." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-49", "text": "SICK (Marelli et al., 2014 ) is a dataset for recognizing textual entailment (RTE), containing 4.5K/0.5K/5.0K train/dev/test examples." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-50", "text": "Each example consists of a hypothesis and a premise, and the goal is to determine if the premise is entailed by, contradicts, or is neutral to the hypothesis (hence classification problem)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-51", "text": "We also report results on SICK to show that span-supervised QA dataset can be also useful for non-QA datasets." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-52", "text": "----------------------------------" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-53", "text": "**MODEL**" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-54", "text": "Among numerous models proposed for spanlevel QA tasks (Xiong et al., 2016; Wang and Jiang, 2016b) , we adopt an open-sourced model, BiDAF 2 (Seo et al., 2016) ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-55", "text": "BiDAF." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-56", "text": "The inputs to the model are a question q, and a context paragraph x. BiDAF uses recurrent neural networks to model sequential dependencies within each question and context paragraph and use attention mechanism to model the interaction between them." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-57", "text": "The last layer of BiDAF is the answer module, which produces the psuedo-probability distributions of the start and the end positions of the answer span, y start , y end \u2208 [0, 1] N , where N is the length of the context words." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-58", "text": "Then the best answer span is arg max (i,j) y start i y end j , where i <= j. Here, we briefly describe the answer module which is important for transfer learning to sentence-level QA." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-59", "text": "The input to the answer module is a sequence of vectors {h i } each of which encodes enough information about the i-th context word and its relationship with its surrounding words and the question words." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-60", "text": "Then the role of the answer module is to map each vector h i to its start and end position probabilities, y start i and y end i ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-61", "text": "BiDAF-T refers to the modified version of BiDAF to make it compatible with sentence-level QA." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-62", "text": "In this task, the inputs are a question q and a list of sentences, x 1 , . . . , x T , where T is the number of the sentences." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-63", "text": "Note that, unlike BiDAF, which outputs single answer per example, Here we need to output a C-way classification for each k-th sentence." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-64", "text": "Since BiDAF is a span-selection model, it cannot be directly used for sentence-level classification." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-65", "text": "Hence we replace the original answer module of BiDAF with a different answer module, and keep the other modules identical to those of BiDAF." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-66", "text": "Given the input to the new answer module, {h k 1 , . . . , h k N }, where the superscript is the sentence index (1 \u2264 k \u2264 T ), we obtain the C-way classification scores for the k-th sentence, y k \u2208 [0, 1] C via max-pooling method:" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-67", "text": "where W \u2208 R C\u00d7d , b \u2208 R C are trainable weight matrix and bias, respectively, and max() function is applied elementwise." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-68", "text": "For WikiQA and SemEval 2016, the number of classes (C) is 2, i.e. each sentence (or comment) is either relevant or not relevant." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-69", "text": "Since some of the metrics used for these datasets require full ranking, we use the predicted probability for \"relevant\" label to rank the sentences." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-70", "text": "Note that BiDAF-T can be also used for the RTE dataset, where we can consider the hypothesis as a question and the premise as a context sentence (T = 1), and classify each example into 'entailment', 'neutral', or 'contradiction' (C = 3)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-71", "text": "Transfer Learning." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-72", "text": "Transfer learning between the same model architectures 3 is straightforward; we first initialize the weights of the target model with the weights of the source model pretrained on the source dataset, and then we further train (finetune) on the target model with the target dataset." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-73", "text": "To transfer from BiDAF (on SQuAD) to BiDAF-T, we transfer all the weights of the identical modules, and initialize the new answer module in BiDAF-T with random values." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-74", "text": "For more training details, refer to Appendix ??." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-75", "text": "----------------------------------" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-76", "text": "**EXPERIMENTS**" }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-77", "text": "Pretrained Question Answering Results." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-78", "text": "Table 2 reports the state-of-the-art results of our transfer learning on WikiQA and SemEval-2016 and the performance of previous models as well as several ablations that use no pretraining or no finetuning." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-79", "text": "There are multiple interesting observations from Table 2 ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-80", "text": "First, if we only train the BiDAF-T model on the target datasets (first row of Table 2), the results are poor." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-81", "text": "This shows the effect of both pretraining and finetuning." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-82", "text": "Second, pretraining on SQuAD and SQuAD-T with no finetuning (second and third row) achieves results close to the state-of-the-art in the WikiQA dataset, but not in SemEval-2016." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-83", "text": "Interestingly, our result on SemEval-2016 is not better than only training without transfer learning." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-84", "text": "We conjecture that this is due to the significant difference between the domain of SemEval-2016 and that of SQuAD." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-85", "text": "Third, pretraining on SQuAD and SQuAD-T with finetuning (fourth and fifth row) significantly outper- forms (by more than 5%) the highest-rank systems on WikiQA." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-86", "text": "It also outperforms the second ranking system in SemEval-2016 and is only 1% behind the first ranking system." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-87", "text": "Fourth, transfer learning models achieve better results with pretraining on span-supervision (SQuAD) than coarser sentence supervision (SQuAD-T)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-88", "text": "Finally, we also use the ensemble of 12 different training runs on the same BiDAF architecture, which obtains the state of the art in both datasets." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-89", "text": "This system outperforms the highest-ranking system in WikiQA by more than 8% and the best system in SemEval-2016 by 1% in every metric." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-90", "text": "It is important to note that, while we definitely benefit from the scale of SQuAD for transfer learning to smaller WikiQA, given the gap between SQuAD-T and SQuAD (> 3%), we see a clear sign that span-supervision plays a significant role well." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-91", "text": "Analysis." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-92", "text": "Figure 1 shows the latently-learned attention maps between the question and one of the context sentences from a WikiQA example in Table 1 ." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-93", "text": "The top map is pretrained on SQuAD-T (corresponding to SQuAD-T&Y in Table 2 ) and the bottom map is pretrained on SQuAD (SQuAD&Y)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-94", "text": "The more red the color, the higher the relevance between the words." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-95", "text": "There are two interesting observations here." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-96", "text": "First, in SQuAD-pretrained model (bottom), we see a high correspondence between question's airbus and context's aircraft and aerospace, but the SQuAD-T-pretrained model fails to learn such correspondence." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-97", "text": "Second, we see that the attention map of the SQuAD-pretrained model is more sparse, indicating that it is able to more precisely localize correspondence between question and context words." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-98", "text": "In fact, the average sparsity of all WikiQA test examples in SQuAD&Y is 0.84 while that in SQuAD-T&Y is 0.56." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-99", "text": "For more analyses and details, refer to Appendix ??." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-100", "text": "Entailment Result." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-101", "text": "In addition to QA experiments, we also show that the models trained on span-supervised QA can be useful for textual entailment task (RTE)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-102", "text": "Table 3 shows the transfer learning results of BiDAF-T on SICK dataset (Marelli et al., 2014) , with various pretraining routines." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-103", "text": "Note that SNLI (Bowman et al., 2015) is a similar task to SICK and is significantly larger (150K/10K/10K train/dev/test examples)." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-104", "text": "Here we highlight three observations." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-105", "text": "First, BiDAF-T pretrained on SQuAD outperforms that without any pretraining by 6% and that pretrained on SQuAD-T by 2%, which demonstrates that the transfer learning from large span-based QA gives a clear improvement." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-106", "text": "Second, pretraining on SQuAD+SNLI outperforms pretraining on SNLI only." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-107", "text": "Given that SNLI is larger than SQuAD, the difference in their performance is a strong indicator that we are benefiting from not only the scale of SQuAD, but also the fine-grained supervision that it provides." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-108", "text": "Lastly, we outperform the previous state of the art by 2% with the ensemble of SQuAD+SNLI pretraining routine." }, { "sent_id": "9e491cad55265802275e9aaea9faae-C001-109", "text": "It is worth noting that Mou et al. (2016) also shows improvement on SICK by pretraining on SNLI." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9e491cad55265802275e9aaea9faae-C001-13" ], [ "9e491cad55265802275e9aaea9faae-C001-31" ], [ "9e491cad55265802275e9aaea9faae-C001-43" ] ], "cite_sentences": [ "9e491cad55265802275e9aaea9faae-C001-13", "9e491cad55265802275e9aaea9faae-C001-31", "9e491cad55265802275e9aaea9faae-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "9e491cad55265802275e9aaea9faae-C001-19" ] ], "cite_sentences": [ "9e491cad55265802275e9aaea9faae-C001-19" ] } } }, "ABC_086619bef9b4e7851bf42bf36eb14a_41": { "x": [ { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-35", "text": "----------------------------------" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-36", "text": "**DATABASE**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-2", "text": "Hidden structural patterns in written texts have been subject of considerable research in the last decades." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-3", "text": "In particular, mapping a text into a time series of sentence lengths is a natural way to investigate text structure." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-4", "text": "Typically, sentence length have been quantified by using measures based on the number of words and the number of characters, but other variations are possible." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-5", "text": "To quantify the robustness of different sentence length measures, we analyzed a database containing about five hundred books in English." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-6", "text": "For each book, we extracted six distinct measures of sentence length, including number of words and number of characters (taking into account lemmatization and stop words removal)." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-7", "text": "We compared these six measures for each book by using i ) Pearson's coefficient to investigate linear correlations; ii ) Kolmogorov-Smirnov test to compare distributions; and iii ) detrended fluctuation analysis (DFA) to quantify auto-correlations." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-8", "text": "We have found that all six measures exhibit very similar behavior, suggesting that sentence length is a robust measure related to text structure." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-9", "text": "----------------------------------" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-11", "text": "Several quantitative patterns have been identified in the structure of written texts." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-12", "text": "For example, the Zipf's law states that the frequency of a word in a text is inversely proportional to its rank [1, 2] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-13", "text": "Moreover, it was also observed that punctuation marks also obey this law [3] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-14", "text": "Heaps' or Herdan's law states that the vocabulary increases as a power law of a finite sample size of the text [4, 5] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-15", "text": "These laws have been generalized to account for more general patterns [6, 7, 8, 9, 10, 11, 12, 13] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-16", "text": "Mapping a text into a time series can also reveal hidden structural patterns." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-17", "text": "Methods applied to investigate text structure at word level include investigations on probability distributions [14] , correlations [15, 16, 17, 18, 19] , and networks properties [20, 21, 22, 23] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-18", "text": "Another way to investigate text structure is by the analysis of sentence lengths." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-19", "text": "Typically, each sentence carries a full message and transmits an idea in contrast with an isolated word." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-20", "text": "Therefore, mapping a text into a time series of sentence lengths is a natural way to investigate text structures." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-21", "text": "Recent methods applied to study texts at sentence level include probability distributions [24, 25, 26, 27] and correlations [26, 27, 17, 28] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-22", "text": "In recent studies, sentence length analysis have been related to style and authorship [29, 26, 27] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-23", "text": "In general, sentence lengths have been quantified by the number of words [24, 29, 25, 28] or characters [30, 31, 26, 27] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-24", "text": "Also, note that the recurrence time of full stops also quantifies the sentence length [3] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-25", "text": "However, there are other possible variations, such as word length or word frequency mappings, where all words are brought to lower case and punctuation marks are removed." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-26", "text": "Other possible variations are the removal of stop words and the lemmatization [21] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-27", "text": "In a similar way, the number of characters and variations related to lemmatization and stop words removal could also be considered at sentence level." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-28", "text": "However, to choose between words or characters to measure a sentence may lead to doubts." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-29", "text": "For instance, the Japanese language has three alphabets and the number of characters may differ for the same word [25] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-30", "text": "In the present work, we investigated the robustness of distinct measures of sentence lengths." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-31", "text": "By using a database containing about five hundred books in English, we extracted six distinct measures of sentence length." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-32", "text": "We employed three widely known tests to compare these six measures for each book: i ) Pearson's coefficient to investigate linear correlations; ii ) Kolmogorov-Smirnov test to compare distributions; and iii ) detrended fluctuation analysis (DFA) to quantify auto-correlations." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-33", "text": "We have found that all measures considered are very similar, indicating that sentence length is a robust measure related to text structure." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-34", "text": "This article is structured as follows: in section 2 we describe the data set in which the analysis were executed; in section 3 we performed our analyses comparing the six types of sentence length time series delineated above; and we summarized and discussed our results in section 4." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-37", "text": "Our database consists of all 501 fiction books written in English, from The Literature Network [32] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-38", "text": "This database includes famous titles such as Alice's Adventures in Wonderland For each book, we extracted six time series of sentence length measuring the: i ) number of words, N w ; ii ) number of characters, N c ; iii ) number of characters in lemmatized words, N l ; iv ) number of non-stop words, N Sw ; v ) number of characters in non-stop words, N Sc ; and vi ) number of characters in non-stop words lemmatized, N Sl ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-39", "text": "Differences between the original text and their variants are exemplified in Table 1 ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-40", "text": "Figures 1b and 1c illustrate two distinct time series of sentence length for a given book, counting words and characters, respectively." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-41", "text": "----------------------------------" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-42", "text": "**RESULTS**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-43", "text": "----------------------------------" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-44", "text": "**CORRELATIONS BETWEEN MAPPED TIME SERIES**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-45", "text": "To investigate similarities between the six mapped time series, we first analyzed linear correlations via the Pearson's coefficient [33] Figure 2a illustrates the correlation between two mapped series, Y = N w and X = N c , of a book where a monotonic behavior and a strong linear correlation is observed." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-46", "text": "The Pearson's coefficients for all comparisons for this book is shown in Fig. 2b ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-47", "text": "According to our analysis, r > 0.91." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-48", "text": "Since we have six time series, there are 6 * (6 \u2212 1)/2 = 15 comparisons for each book, resulting in 7,515 correlations for all 501 books." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-49", "text": "The results above are quite similar to those found for all books for each linear correlation test, as can be seen in the cumulated distributions in Fig. 2c ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-50", "text": "Among all these coefficients, the weakest and the strongest linear correlations were, respectively r Nw\u00d7N Sc = 0.91 (The Flood, Emile Zola, 1880) and r N Sc \u00d7N Sl = 0.99 (A Daughter of Eve, Honor\u00e9 de Balzac, 1838)." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-51", "text": "We also notice that the values of r are greater for comparisons among series that count characters." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-52", "text": "The mean value of all r's was 0.98." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-53", "text": "Since r is very close to one, the fluctuation of information due to the transformations has little implications." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-54", "text": "Due to these results, we estimate that, with good approximation for each book, the terms of one series can be written in terms of the other by a transformation such as" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-55", "text": "where n c (n w ) is a term of the time series N c (N w ) and \u03b1 c and \u03b2 c are constants." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-56", "text": "Other tests such as the Goodman and Kruskal's \u03b3 test [34] , the Kendall's \u03c4 test [35] , and the Spearman's test [36] were also performed to study the correlations." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-57", "text": "These tests are based on the null hypothesis that the two distributions are independent." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-58", "text": "All comparisons for all books rejected the null hypothesis for these tests." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-59", "text": "Before concluding this subsection, we discuss Fig. 2a in connection with the MenzerathAltmann law." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-60", "text": "Basically, this law states that the bigger the whole, the smaller its parts and vice-versa." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-61", "text": "However, an issue about this relationship was addressed in [37] : \"Somewhat more problematic is the relation of sentence length to the word length.\" This comment is consistent with the one in [28] asserting that the Menzerath-Altmann law does not hold if the sentence length is measured in terms of characters instead of the number of words." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-62", "text": "In a similar way, the data of Fig. 2a indicates, in the context of this law, that the relationship between the number of words and characters is also problematic." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-63", "text": "----------------------------------" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-64", "text": "**SIMILARITIES BETWEEN DISTRIBUTIONS**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-65", "text": "Next, we investigated the distribution of sentence lengths within each book, for each measure of sentence length considered here." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-66", "text": "This test exploits the distance \u03ba between two empirical cumulated distributions C 1 and C 2 :" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-67", "text": "where sup is the supremum function; the smaller the \u03ba the greater the similarity between the distributions." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-68", "text": "Typically, the number of characters is greater than the number of words (Figs. 1b and 1c) ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-69", "text": "Thus, to make a fair comparison, the all time series were normalized by their mean values prior to the KS test." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-70", "text": "Figure 3a and 3b illustrate some results for the book The Brothers Karamazov." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-71", "text": "The former shows the comparison between the cumulated distributions for N w and N c after normalization and the latter the Kolmogorov-Smirnov distance (\u03ba) for all comparisons." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-72", "text": "Note that the distances found are small (\u03ba \u2264 0.12), implying in the fact that these distributions can be drawn from the same distribution." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-73", "text": "We verified that 41.77% of these comparisons accepted the null hypothesis that the distributions are drawn from the same distribution, with a p-value \u2265 0.01." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-74", "text": "When considering the transformation 2, the results for this test (Fig. 3c) were of 89.74% success rate, which implies that the dependence proposed between the series are meaningful." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-75", "text": "This type of percentage for each type of comparison is exhibited in Table 2 ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-76", "text": "As an extra remark, we notice that the complementary cumulative distribution (CCDF) as a function of the number of words, N w , of the data shown in Fig. 3a is consistent with unity, decreasing as the sentence length increases, and \u00b5 is a positive parameter." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-77", "text": "Because the maximum number of words per sentence of the The Brothers Karamazov (the biggest book in our database) goes roughly until N w \u223c 150, an additional comparison cannot be performed." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-78", "text": "----------------------------------" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-79", "text": "**AUTO-CORRELATIONS**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-80", "text": "Detrended fluctuation analysis (DFA) have been largely used to study long-range correlations in time series [38, 39] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-81", "text": "Texts, for instance, were mapped using the sentence length measured as the number of characters and investigated via DFA [40, 26, 27] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-82", "text": "To exemplify the method, consider a generic sentence length series W = {w 1 , w 2 , \u00b7 \u00b7 \u00b7 , w N }." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-83", "text": "The first step is to calculate the integrated series Z = {z 1 , z 2 , \u00b7 \u00b7 \u00b7 , z N }, where z j = j i=1 w i , and subdivide it into s = N/m non overlapping partitions with m terms in each window." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-84", "text": "Each partition is fitted by a polynomial function of degree l and has its fluctuation computed." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-85", "text": "Averaging the fluctuations over the s segments, we obtain the fluctuation function, F (m), which is usually compared with a power law:" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-86", "text": "where h is the scale exponent." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-87", "text": "If the second moment of Z converges, h is the Hurst exponent." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-88", "text": "For h > 0.5 (h < 0.5) the time series has persistent (anti-persistent) long-range correlation and for h = 0.5 there is no correlation or it is of short range." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-89", "text": "It is common to calculate the Hurst h * for the shuffled version of the time series." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-90", "text": "It is expected that h * = 0.5." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-91", "text": "Using the The Brothers Karamazov as example again, Fig. 4a and 4b illustrate, respectively, F (m) and the differences between the Hurst exponents for all six series." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-92", "text": "We can infer that h differs very little from one series to another and that their values are close to 0.8, with h * \u223c 0.5, implying in long-range correlations." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-93", "text": "All the series from the other books reflects this behavior (h \u223c 0.75)." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-94", "text": "This result is consistent with the multifractal analysis performed in [28] ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-95", "text": "In addition, the values for the difference between Hurst exponents (\u2206h) of a given book are shown in Fig. 4c , where we can observe that the variation of the Hurst exponents is small from one series to another." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-96", "text": "Also, the standard deviation for h within each book was close to 0.001." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-97", "text": "Lastly, no correlations were found between the number of sentences in a text and the Hurst exponent." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-98", "text": "----------------------------------" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-99", "text": "**CONCLUSIONS**" }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-100", "text": "By considering six ways to map a text into a time series, we have performed tests to gauge the robustness of sentence length measures in written texts." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-101", "text": "The length units used were characters and words for regular, lemmatized, without stop-words sentences and a combination of the latter two." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-102", "text": "For each one of the 501 fiction books in English from our database, strong linear correlations were found when the Pearson's coefficients were calculated obtaining r \u223c 0.98 (Fig. 2c) ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-103", "text": "By using this feature, we employed Eq. 2 to map the series into one another in order to compare their distributions using the Kolmogorov-Smirnov test." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-104", "text": "Among the 7, 515 possible comparisons, 89.74% accepted the null hypothesis the they are drawn from the same distribution (Fig. 3c) ." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-105", "text": "Due to the interest into long-range correlations in time series mapped from texts, we performed detrended fluctuation analysis to quantify auto-correlations and to study their variations among the constructed series." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-106", "text": "We have found that the Hurst exponents were very close to each other, |\u2206h| \u223c 0.001 (Fig. 4c) , presenting positive long-range correlations (h \u223c 0.75 with h * \u223c 0.5)." }, { "sent_id": "086619bef9b4e7851bf42bf36eb14a-C001-107", "text": "Our results have not pointed into the direction of significant differences among the series considered, indicating a robustness of sentence length measures in written texts in English." } ], "y": { "@BACK@": { "gold_contexts": [ [ "086619bef9b4e7851bf42bf36eb14a-C001-21" ], [ "086619bef9b4e7851bf42bf36eb14a-C001-23" ], [ "086619bef9b4e7851bf42bf36eb14a-C001-61" ] ], "cite_sentences": [ "086619bef9b4e7851bf42bf36eb14a-C001-21", "086619bef9b4e7851bf42bf36eb14a-C001-23", "086619bef9b4e7851bf42bf36eb14a-C001-61" ] }, "@SIM@": { "gold_contexts": [ [ "086619bef9b4e7851bf42bf36eb14a-C001-94" ] ], "cite_sentences": [ "086619bef9b4e7851bf42bf36eb14a-C001-94" ] } } }, "ABC_bd64b244756259f18f7c3cb60989a2_41": { "x": [ { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-33", "text": "**DATA**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-34", "text": "In this section, we give an overview of the data used for training our NMT systems." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-2", "text": "In this paper we develop a neural machine translation (NMT) system for translating from English into Irish and vice versa." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-3", "text": "We evaluate the performance of NMT on the resource-poor English-Irish (EN-GA) language pair, show that we can achieve good translation quality in comparison to previously reported systems, and outperform Google Translate TM with several BLEU points on a domain-specific test set related to the legal domain." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-4", "text": "We show that back-translation of monolingual data closely related to the domain of the test set can further increase the model's performance." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-5", "text": "Finally, we present a lightweight method for filtering synthetic sentence pairs obtained via back-translation using a tool for misalignment detection." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-6", "text": "We show that our approach results in a slightly higher BLEU score while requiring less training data." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-7", "text": "1 Reported BLEU score of 40.1, compared to a BLEU score of 46.4 for the SMT system." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-8", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-10", "text": "In recent years the performance of machine translation systems has been improving significantly thanks to the shift from statistical machine translation (SMT) to NMT." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-35", "text": "Both bilingual and monolingual data are used." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-11", "text": "Replacing the recurrent neural network (RNN) architecture with a Transformer architecture that relies entirely on self-attention to compute representations of its input and output has set a new state of the art in the field of machine translation (Vaswani et al. 2017 )." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-12", "text": "\u00a9 2019 The authors." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-13", "text": "This article is licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CCBY-ND." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-14", "text": "However, for low-resource languages, the performance of (neural) machine translation systems can still be disappointing, as pointed out for instance by Koehn et al. (2017) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-15", "text": "Many approaches have been suggested to improve the quality of NMT in such a low-resource setting, among which multilingual models (Johnson et al. 2016; Thanh-Le et al. 2016) , unsupervised approaches (Lample et al. 2019 ) and systems relying on back-translation (Sennrich et al. 2016) have been the most successful." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-16", "text": "In this paper we focus on the translation of the English-Irish language pair using NMT." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-17", "text": "The Irish language has been categorized as a 'weak or not supported language' by the META-NET report (Judge et al. 2012) due to the lack of good translation resources." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-18", "text": "Despite this relatively low availability of resources, both in terms of monolingual and bilingual content, it has been shown that an SMT system can achieve promising translation quality both in a domain-specific setting (Dowling et al. 2015) and in a more broaddomain context (Arcan et al. 2016) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-19", "text": "First steps were taken by Dowling et al. (2018) to apply NMT methods to EN-GA MT, although the resulting NMT system performed significantly worse than SMT, scoring more than 6 BLEU lower on an in-domain test set." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-20", "text": "1 In this work we will further explore the potential of NMT for the EN-GA language pair." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-21", "text": "We add web-crawled parallel data to the publicly available resources used in previous studies and show relatively good translation quality both on a domain-specific test set and on a more generic test set." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-22", "text": "Next, our experiments confirm that NMT translation quality for GA\u2192EN can be significantly improved using back-translation." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-23", "text": "Due to a lack of Irish monolingual data, backtranslation was less useful for EN\u2192GA NMT." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-24", "text": "Finally, a set of experiments was performed in which the synthetic parallel corpus, obtained via back-translation, was filtered with Bicleaner 2 (S\u00e1nchez-Cartagena et al. 2018 ), a tool for misalignment detection." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-25", "text": "We show that applying misalignment detection on a synthetic corpus before adding it to the parallel training data results in small increases in BLEU score and could be a useful strategy in terms of data selection." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-26", "text": "Filtering of parallel data has been the subject of various studies (Axelrod et al. 2011; van der Wees et al. 2017 ), but such data selection methods have only been scarcely investigated in the context of back-translation." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-27", "text": "Fadaee et al. (2018) suggest several sampling strategies for synthetic data obtained via back-translation, targeting difficult to predict words." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-28", "text": "More closely related to our filtering technique is the method proposed by Imankulova et al. (2017) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-29", "text": "They present a method in which a synthetic corpus was filtered by calculating the BLEU score between the target monolingual sentence and the translation of the synthetic source sentence in the target language and report small increases in translation quality in a lowresource setting." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-30", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-31", "text": "**MATERIALS AND METHODS**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-32", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-36", "text": "In Table 1 an overview of the parallel data is shown." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-37", "text": "Three types of data were collected: 1) Baseline data, i.e. a collection of publicly available resources; 2) Web-crawled data, i.e. data scraped from two bilingual websites, and 3) ParaCrawl data." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-38", "text": "The baseline data has been described in detail in previous publications (Dowling et al. 2015; Arcan et al. 2016) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-39", "text": "We note that there are some other parallel corpora available for the EN-GA language pair, the largest of which are the KDE 3 and GNOME 4 corpora." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-40", "text": "However, due to the very specific nature of these corpora, they were not included in the training data." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-41", "text": "The web-crawled dataset consists of sentence pairs we scraped and aligned ourselves from two bilingual websites." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-42", "text": "This data was scraped using Scrapy 5 and then document-aligned using Malign, 6 a tool for document alignment that makes use of MT." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-43", "text": "Sentence alignment of these document pairs was subsequently performed using Hunalign 7 (Varga et al. 2005) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-44", "text": "Finally, the misalignment detection tool Bicleaner (S\u00e1nchez-Cartagena et al. 2018 ) was applied to these aligned sentences (the Bicleaner threshold was set to 0.5 8 )." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-45", "text": "We also used the ParaCrawl corpus as a bilingual resource." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-46", "text": "We used the Raw EN-GA ParaCrawl corpus v4.0 15 consisting of 156M sentence pairs." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-47", "text": "ParaCrawl is known to contain a diversity of noise such as misalignments, untranslated sentences, non-linguistic characters, wrong encoding, language errors, short segments etc." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-48", "text": "that may harm NMT performance (Khayrallah et al. 2018" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-49", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-50", "text": "**MACHINE TRANSLATION**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-51", "text": "Neural MT engines were trained with OpenNMTtensorflow 19 using the Transformer architecture during 20 epochs and default training settings." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-52", "text": "20 Preprocessing was done with aggressive tokenization, and joint subword (BPE) and vocabulary sizes set to 32k." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-53", "text": "NMT systems were trained in both translation directions, EN\u2192GA and GA\u2192EN." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-54", "text": "The translation quality of the MT models is measured by calculating BLEU scores on two held out test sets (see Section 2.1)." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-55", "text": "The reported BLEU score is the maximal BLEU reached in the last 10 epochs of training." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-56", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-57", "text": "**BICLEANER**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-58", "text": "Bicleaner detects noisy sentence pairs in a parallel corpus by estimating the likelihood of a pair of sentences being mutual translations (value near 1) or not (value near 0)." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-59", "text": "Very noisy sentences are given the score 0 and detected by means of hand-crafted hard rules." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-60", "text": "This set of hand-crafted rules tries to detect evident flaws such as language errors, encoding errors, short segments and very different lengths in pairs of parallel sentences." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-61", "text": "In a second step, misalignments are detected by means of an automatic classifier." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-62", "text": "Finally, sentences are scored based on fluency and diversity." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-63", "text": "More details are provided by S\u00e1nchez-Cartagena et al. (2018) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-64", "text": "In order to clean the EN-GA web-crawled corpus and the synthetic data obtained via backtranslation we used a pre-trained classifier provided by the authors." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-65", "text": "21" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-66", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-67", "text": "**BACK-TRANSLATION**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-68", "text": "Following the methodology described by Sennrich et al. (2016) , we paired EN|GA monolingual data (see Table 2 ) with EN\u2192GA and GA\u2192EN back-translated data, respectively, and used it as additional synthetic parallel training data." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-69", "text": "Our MT engines for back-translation were trained using the RNN (Recurrent Neural Network) architecture in OpenNMT (Klein et al. 2017 ) on the parallel data described in Table 1 ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-70", "text": "The RNN architecture was chosen because of higher inference speed compared to the Transformer architecture (Vaswani et al. 2017; Zhang et al. 2018) , which speeds up the process of backtranslation." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-71", "text": "We applied Bicleaner to the resulting synthetic parallel corpus in an effort to filter out data that may harm the performance of an NMT engine." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-72", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-73", "text": "**RESULTS**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-74", "text": "Neural MT engines (see Section 2.2) were trained on the different types of training data described in Section 2.1." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-75", "text": "We evaluated our MT systems on two test sets: a generic test set and a domain-specific test set related to the legal domain." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-76", "text": "In Table 3 we give an overview of the results." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-77", "text": "It shows the results of the generic test sets in the third and fifth column and the results for the domain-specific test sets in the fourth and sixth column." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-78", "text": "In the left column, the types of data are indicated, such as synthetic data obtained after back-translation." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-79", "text": "Our NMT engines already perform reasonably well in both language directions using the baseline data only." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-80", "text": "An increase in BLEU score is observed when adding the web-crawled data and the Table 3" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-81", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-82", "text": "**BLEU SCORES OF OUR NMT SYSTEMS FOR DIFFERENT TEST SETS AND TYPES OF TRAINING DATA**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-83", "text": "As mentioned in Section 2.4, we applied Bicleaner to the resulting synthetic parallel corpus obtained after back-translation." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-84", "text": "In order to investigate the effect of this filtering strategy, another set of experiments was performed, for two translation directions." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-85", "text": "Table 4 shows the results of two experiments for EN\u2192GA." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-109", "text": "Such data is, to the best of our knowledge, not publicly available." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-86", "text": "In the first experiment (second row), an engine was trained on the concatenation of the baseline data, the web-crawled data, the ParaCrawl data and the synthetic parallel corpus; no Bicleaner filtering was applied." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-87", "text": "In the second experiment, the Bicleaner threshold was set at 0.7." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-88", "text": "We observe that adding filtered synthetic data results in slightly higher BLEU scores on the domain-specific test set, compared to the scenario in which no filtering was applied." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-89", "text": "On the generic test set, filtering of the synthetic data did not impact the translation quality in terms of BLEU score." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-90", "text": "DGT, EAC, ECDC) for back-translation." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-91", "text": "In all our experiments, we observed an increase in BLEU score for the generic test set when adding synthetic data to the parallel training corpus." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-92", "text": "The performance on the domain-specific test set only slightly increases, but only when domain-specific data is used for back-translation, in all other cases a slight decrease is observed." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-93", "text": "On the generic test set, we found that adding a larger amount of synthetic data results in better performance." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-94", "text": "However, doubling the amount of training data through back-translation seems sufficient: we only notice small improvements in terms of BLEU score when synthetic data is added beyond the 1:1 ratio between synthetic and real data." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-95", "text": "This is in line with and Fadaee et al. (2018) , who show that a ratio around 1:1 between synthetic and real data is optimal." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-96", "text": "Filtering the synthetic data using a misalignment detection tool seems to be a useful strategy in terms of data selection, as slightly higher BLEU scores could be obtained with less data." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-97", "text": "We refer to the last three rows of Table 5 : when using approximately 500k less synthetic sentence pairs, we observe an increase in BLEU of 0.4 on the generic test set (59.0 vs. 58.6)." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-98", "text": "However, we note that one must be careful when setting the Bicleaner threshold: we observe a decrease in BLEU score when increasing the threshold to 0.8." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-99", "text": "----------------------------------" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-100", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-101", "text": "In this paper we present a well-performing NMT system for the EN-GA language pair." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-102", "text": "While EN\u2192GA NMT systems presented in previous work (Dowling et al. 2018) were still performing sub-par, our NMT system outperforms Google Translate TM by several BLEU points on a domainspecific test set in both translation directions." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-103", "text": "By carefully adding web-crawled data, we were able to increase the training corpus from 316k sentence pairs to more than 1M parallel sentences, leading to better translation performance in terms of BLEU score." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-104", "text": "In previous studies (Dowling et al. 2015; Arcan et al. 2016) , EN\u2194GA MT systems were trained on significantly smaller corpora." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-105", "text": "Next, we showed that back-translation can increase the performance of EN\u2194GA NMT systems." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-106", "text": "For the GA\u2192EN translation direction, back-translation proved very useful, especially when EN monolingual data closely related to the domain of the test set was used for back-translation, in line with Sennrich et al. (2016) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-107", "text": "For the EN\u2192GA translation direction, back-translation proved less effective." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-108", "text": "We think this might be solved by using Irish monolingual data that is more closely related to the domain of interest." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-110", "text": "The Corpus of Contemporary Irish, a monolingual collection of Irish-language texts in digital format, 22 containing around 24.7M words, may be a possible candidate." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-111", "text": "However, this corpus is only searchable and we could therefore not use it in the present study." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-112", "text": "Finally, we presented a lightweight method for filtering synthetic sentence pairs obtained via back-translation, using a tool for misalignment detection, Bicleaner (S\u00e1nchez-Cartagena et al. 2018) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-113", "text": "We show that our approach results in small increases in BLEU score, while requiring less training data." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-114", "text": "In future work we will investigate to what extent our proposed methodology can be applied to other languages with a similar amount of data available." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-115", "text": "Another interesting research direction would be the development of a multilingual MT system which includes not only Irish but also other Gaelic languages, and which is based on methods such as the one described by Johnson et al. (2016) ." }, { "sent_id": "bd64b244756259f18f7c3cb60989a2-C001-116", "text": "It should also be investigated whether unsupervised MT approaches like the one of Lample et al. 22 https://www.gaois.ie/g3m/en (2019) can be used to increase the translation quality of EN\u2194GA MT systems." } ], "y": { "@USE@": { "gold_contexts": [ [ "bd64b244756259f18f7c3cb60989a2-C001-24" ], [ "bd64b244756259f18f7c3cb60989a2-C001-44" ], [ "bd64b244756259f18f7c3cb60989a2-C001-112" ] ], "cite_sentences": [ "bd64b244756259f18f7c3cb60989a2-C001-24", "bd64b244756259f18f7c3cb60989a2-C001-44", "bd64b244756259f18f7c3cb60989a2-C001-112" ] }, "@UNSURE@": { "gold_contexts": [ [ "bd64b244756259f18f7c3cb60989a2-C001-63" ] ], "cite_sentences": [ "bd64b244756259f18f7c3cb60989a2-C001-63" ] } } }, "ABC_e177758a227506bbf9de48f8f35715_41": { "x": [ { "sent_id": "e177758a227506bbf9de48f8f35715-C001-17", "text": "Notice that these text corpora need not be aligned." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-70", "text": "C bantu = C bantu1 \u222a\u0108 bantu2 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-68", "text": "Concatenating the Bantu corpora gives: 6 In the experiments we use \u03a6 = 1." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-69", "text": "7 In the experiments we use \u03a0 = 0.5." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-2", "text": "We present a method for learning bilingual translation dictionaries between English and Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-3", "text": "We show that exploiting the grammatical structure common to Bantu languages enables bilingual dictionary induction for languages where training data is unavailable." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-4", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-6", "text": "Bilingual dictionaries mostly exist for resourcerich language pairs, for example, English-German and English-Chinese (Koehn and Knight, 2002; Haghighi et al., 2008; Ammar et al., 2016; Faruqui and Dyer, 2014) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-7", "text": "Such dictionaries are useful for many natural language processing (NLP) tasks including statistical machine translation, cross-lingual information retrieval, and cross-lingual transfer of NLP models such as those for part-of-speech tagging and dependency parsing (T\u00e4ckstr\u00f6m et al., 2012; Guo et al., 2015; Gouws and S\u00f8gaard, 2015) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-8", "text": "In this paper, we consider the task of bilingual dictionary induction for English and Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-9", "text": "Bantu languages are a family of over 300 1 mutually intelligible languages spoken over much of central and southern Africa, see the map 2 in Figure 1 (Guthrie, 1948; Nurse and Philippson, 2003) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-10", "text": "About a third of Africans speak a Bantu language as their native language (Nurse and Tucker, 2001 )." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-11", "text": "The most widely spoken, with over 10 million speakers 3 each, include Swahili (Kenya, Uganda), Shona (Zimbabwe), Zulu (South Africa), and Kinyarwanda (Rwanda, Uganda)." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-12", "text": "As with other low resource languages, labeled data for Bantu languages is scare." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-13", "text": "We seek to exploit the Bantu grammar structure to mitigate lack of labeled data." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-14", "text": "More specifically, we ask the following question: given a small bilingual dictionary between English and one Bantu language, L bantu1 , can we 1) infer missing entries in the English \u2212 L bantu1 dictionary 2) generate a new bilingual dictionary English \u2212 L bantu2 for another Bantu language for which labeled data is unavailable." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-15", "text": "To answer this question we propose an approach based on distributed representations of words (Turney and Pantel, 2010; Mikolov et al., 2013a) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-16", "text": "The first step is to create a vector space for each language, derived from a text corpus for the language." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-18", "text": "The second step is to perform dictionary induction by learning a linear projection, in the form of a matrix, between language vector spaces (Mikolov et al., 2013b; Lazaridou et al., 2015) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-19", "text": "Our key insight for Bantu languages is that one can create a single vector space for them, obviating the need for learning a projection matrix for each Bantu language." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-20", "text": "This means we only need to learn a single projection matrix, for inducing multiple English to Bantu bilingual dictionaries, using the small bilingual dictionary English\u2212L bantu1 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-21", "text": "Additionally, we modify the corpus corresponding to L bantu2 to have a greater vocabulary intersection with L bantu1 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-22", "text": "This step is inspired by the extensive use of bases and affixes, common to Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-23", "text": "Words with the same meaning often differ only in the affixes with the base being similar or the same." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-24", "text": "We therefore use edit distance to replace some fraction of the words of L bantu2 with similar words in L bantu1 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-25", "text": "Contribution." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-26", "text": "1)." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-27", "text": "Unsupervised Bantu language dictionary induction: To the best of our knowledge, our work is the first effort to create bilingual dictionaries for Bantu languages using unsupervised machine learning methods." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-28", "text": "2) Data: We collect corpora for seven Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-29", "text": "Having had access to a first language speaker of two Bantu languages, we obtained labeled which we make available along with the corpora, for further research into NLP for Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-30", "text": "3) Dictionary induction almost from scratch: We propose a method for dictionary induction that only requires training data in one of the Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-31", "text": "Our experiments show the potential of our approach." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-32", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-33", "text": "**APPROACH**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-34", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-35", "text": "**DISTRIBUTED REPRESENTATION**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-36", "text": "Distributed representations of words, in the form of real-valued vectors, encode word semantics based on collocation of words in text (Turney and Pantel, 2010; Mikolov et al., 2013a; Ammar et al., 2016) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-37", "text": "Such vector representations have been shown to improve performance of various NLP tasks including semantic role labeling, partof-speech tagging, and named entity recognition (Collobert et al., 2011) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-38", "text": "In this work we use the skip-gram model with negative sampling to generate word vectors (Mikolov et al., 2013a) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-39", "text": "It is one of the most competitive methods for generating word vector representations, as demonstrated by results on a various semantic tasks (Baroni et al., 2014; Mikolov et al., 2013b) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-40", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-41", "text": "**BILINGUAL DICTIONARY INDUCTION**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-42", "text": "To induce a bilingual dictionary for a pair of languages, we use the projection matrix approach (Mikolov et al., 2013b; Lazaridou et al., 2015) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-43", "text": "It takes as input a small bilingual dictionary containing pairs of translations from the source language to the target language." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-44", "text": "Training data is comprised of vector representations of word pairs" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-45", "text": ", where x i \u2208 R s is the vector for word i in the source language, and y i \u2208 R t is the vector for its translation in the target language." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-46", "text": "At test time, we predict the target word translations for new source language words, D te = {x j } n i=1 where x j \u2208 R s ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-47", "text": "In our case, the source language is a Bantu language and the target language is English." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-48", "text": "This approach assumes that there is linear relationship between the two vector spaces." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-49", "text": "Thus, the learning problem is to find a matrix W that maps a source language word vector x i to the vector of its translation y i in the target language." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-50", "text": "As in , we use an l2-regularized least squares error to learn the projection matrix W." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-51", "text": "where X and Y are matrices representing the source and target vectors in the training data, respectively." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-52", "text": "For a new Bantu word whose vector representation is x j \u2208 R s , we map it to English by computing\u0177 j =\u0174x j where\u0177 j \u2208 R t , and then finding the English word whose vector representation is closest to\u0177 j , as measured by the cosine similarity distance metric." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-53", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-54", "text": "**BANTU LANGUAGE STRUCTURE**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-55", "text": "The word \"Bantu\" is derived from the word for \"people\", which has striking similarities in many Bantu languages (Guthrie, 1948; Nurse and Philippson, 2003) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-56", "text": "In Zulu (South Africa) \"abantu\" 4 means people; in Swahili (Kenya, Uganda) \"watu\" ; in Ndonga (Namibia) \"aantu\"; in Sesotho (Lesotho) \"batho\"; in Herero (Namibia) \"ovandu\"; and in Kwanyama (Namibia, Angola) \"ovanhu\". it is often used in the philosophical sense \" 5 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-57", "text": "While Bantu languages may differ in vocabulary, in some cases quite substantially, they share the same" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-58", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-59", "text": "**SINGLE PROJECTION MATRIX**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-60", "text": "We hypothesize that we only need to learn one projection matrix, W in Equation 1." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-61", "text": "Our labeled data is a small bilingual dictionary English \u2212 L bantu1 , between English and a Bantu language L bantu1 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-62", "text": "We would like to be able to infer missing entries in the English \u2212 L bantu1 dictionary, and to generate a new dictionary, English \u2212 L bantu2 , a language pair for which labeled data is unavailable." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-63", "text": "The core idea is to create only one vector space for the two Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-64", "text": "First we generate a lexicon, lex b2 , containing words that appear in the corpus of L bantu2 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-65", "text": "Next, for each w \u2208 lex b2 we find all words in lex b1 , the lexicon of language L bantu1 , whose edit distance to w is a small value \u03a6 6 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-66", "text": "Thus, each word w \u2208 lex b2 has a list of words from lex b1 , S w = {w1 \u2208 Lex b1 : editditance(w, w1) <= \u03a6}. We then go through corpus C bantu2 and with probability \u03a0 7 , we replace word w \u2208 lex b2 with one of its cross-lingually similar words in S w , random selection is used to pick the replacement word from S w ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-67", "text": "This process creates an updated corpu\u015d C bantu2 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-71", "text": "Applying the skipgram model to C bantu , generates word vectors, every w i \u2208 C bantu has a vector x i \u2208 R s ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-72", "text": "For English, we use the 300-dimensional pre-trained vectors trained the Google News dataset 8 , so that every English word has a vector y i \u2208 R t ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-73", "text": "Finally, we W using the training data," }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-74", "text": ". At test time, we predict the target word translations for unseen Bantu words, D te = {x j } n i=1 , which can either be in language L bantu1 or L bantu2 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-75", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-76", "text": "**EXPERIMENTAL EVALUATION**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-77", "text": "Data." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-78", "text": "We crawled Bible.com for bibles of 9 languages, 7 Bantu and 2 Indo-European, Italian and English." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-79", "text": "The latter were used for comparison." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-80", "text": "Corpora statistics are shown in Table 1 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-81", "text": "In our experiments, we focused on Kwanyama, spoken Namibia and Angola, as we had access to a first language speaker who could annotate data." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-82", "text": "The last column of Table 1 shows the vocabulary intersection between Kwanyama and other languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-83", "text": "The language with the most words in common with Kwanyama is Ndonga, spoken in Namibia, with an 11% vocabulary overlap." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-84", "text": "As expected, the Indo-European languages overlap the least with Kwanyama and the few words they have Table 4 : Precision at Top-K for various language pairs." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-85", "text": "in common have different meanings." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-86", "text": "For example: The word \"male\" in English refers to gender, in Kwanyama \"male\" means \"tall\" or \"deep\"." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-87", "text": "Our Kwayama first language speaker also has five years of formal training in Ndonga, which is a dialect of the same language as Kwanyama." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-88", "text": "We therefore focus on these two Bantu languages in our experiments." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-89", "text": "Table 2 shows some examples of English, Kwanyama and Ndonga translations." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-90", "text": "Details of the training and test data are shown in Table 3 ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-91", "text": "For all languages, we used 300-dimensional word vectors." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-92", "text": "Results." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-93", "text": "Table 4 shows the main results in terms of precision at top-k, the last column, RD = 0.10 shows precision at the value of k which yields random chance of 10% precision." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-94", "text": "The top two rows show the results of bilingual dictionary induction between English and Kwanyama." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-95", "text": "We compare the projection matrix approach, EN-KW, to random chance, EN-KW (RD)." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-96", "text": "We can see that EN-KW far outperforms chance." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-97", "text": "This result is promising given that our annotator only generated about 2, 142 labeled examples." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-98", "text": "In particular, EnglishItalian (EN-IT) with a larger dictionary of 5, 000 word pairs, produced by , achieves similar numbers, however it is worth noting that the EN-IT test data set is also much larger." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-99", "text": "For the English-Ndonga, EN-ND, language pair, we have no labeled data." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-100", "text": "We consider three cases: 1) EN-ND (J-KW), for this case, we concatenate the Kwanyama and Ndonga corpora and use the EN-KW training data to induce the EN-ND dictionary." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-101", "text": "2) EN-ND (J-IT), we concatenate the Italian and Ndonga corpora and use the EN-IT training data to induce EN-ND dictionary." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-102", "text": "3) EN-ND (J-KW-R), this is our approach where we first modify the Ndonga corpus to look more like Kwanyama , which is to be expected because ND, Ndonga, a Bantu language is much more similar to KW, Kwanyama than to the Indo-European language, IT, Italian." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-103", "text": "Figure 2 shows the top-k precision trends for various values of k. For the EN-KW pair, left of Figure 2 , there is a bigger gap between EN-KW and random chance EN-KW (RD)." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-104", "text": "On the other hand, for the EN-ND pair, the right of Figure 2 , the gap between our approach EN-ND (J-KW-R) and random choice, EN-ND (RD) is smaller." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-105", "text": "However, it is also clear that the precision at top-k trend is much better when we make use of training data from Kwanyama EN-ND (J-KW-R), instead of training data from Italian EN-ND (J-IT)." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-106", "text": "This result is encouraging for future work towards inducing accurate bilingual dictionaries for Bantu languages without labeled data." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-107", "text": "Future directions include collecting more training data from popular Bantu languages such as Swahili and Zulu; proposing alternative methods to dictionary induction; and inducing dictionaries for more Bantu languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-108", "text": "----------------------------------" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-109", "text": "**CONCLUSION**" }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-110", "text": "In prior work, bilingual dictionary induction has been studied mostly for resource rich languages." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-111", "text": "(Lazaridou et al., 2015; Upadhyay et al., 2016; Faruqui and Dyer, 2014; Mikolov et al., 2013b; Ammar et al., 2016; Haghighi et al., 2008) ." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-112", "text": "We have introduced an approach where we create one vector space for Bantu languages in order to exploit labeled data available for one language but not for another." }, { "sent_id": "e177758a227506bbf9de48f8f35715-C001-113", "text": "Given that there are over 300 Bantu languages, and not all of them have training data, we believe approaches that rely on their shared grammar will be important for bringing NLP methods to this family of languages." } ], "y": { "@BACK@": { "gold_contexts": [ [ "e177758a227506bbf9de48f8f35715-C001-18" ], [ "e177758a227506bbf9de48f8f35715-C001-39" ] ], "cite_sentences": [ "e177758a227506bbf9de48f8f35715-C001-18", "e177758a227506bbf9de48f8f35715-C001-39" ] }, "@USE@": { "gold_contexts": [ [ "e177758a227506bbf9de48f8f35715-C001-42" ] ], "cite_sentences": [ "e177758a227506bbf9de48f8f35715-C001-42" ] } } }, "ABC_aeb6a815732b36d7602a9c43c47cfa_41": { "x": [ { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-73", "text": "**GEOLOCATION**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-70", "text": "----------------------------------" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-71", "text": "**RESULTS**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-2", "text": "We propose a simple yet effective textbased user geolocation model based on a neural network with one hidden layer, which achieves state of the art performance over three Twitter benchmark geolocation datasets, in addition to producing word and phrase embeddings in the hidden layer that we show to be useful for detecting dialectal terms." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-3", "text": "As part of our analysis of dialectal terms, we release DAREDS, a dataset for evaluating dialect term detection methods." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-4", "text": "----------------------------------" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-6", "text": "Many services such as web search (Leung et al., 2010) , recommender systems (Ho et al., 2012) , targeted advertising (Lim and Datta, 2013) , and rapid disaster response (Ashktorab et al., 2014) rely on the location of users to personalise information and extract actionable knowledge." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-7", "text": "Explicit user geolocation metadata (e.g. GPS tags, WiFi footprint, IP address) is not usually available to third-party consumers, giving rise to the need for geolocation based on profile data, text content, friendship graphs (Jurgens et al., 2015) or some combination of these (Rahimi et al., 2015b,a) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-8", "text": "The strong geographical bias, most obviously at the language level (e.g. Finland vs. Japan), and more subtly at the dialect level (e.g. in English used in north-west England vs. north-east USA vs. Texas, USA), clearly reflected in language use in social media services such as Twitter, has been used extensively either for geolocation of users (Eisenstein et al., 2010; Roller et al., 2012; Rout et al., 2013; Wing and Baldridge, 2014) or dialectology Eisenstein, 2015) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-9", "text": "In these methods, a user is often represented by the concatenation of their tweets, and the geolocation model is trained on a very small percentage of explicitly geotagged tweets, noting the potential biases implicit in geotagged tweets (Pavalanathan and Eisenstein, 2015) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-10", "text": "Lexical dialectology is (in part) the converse of user geolocation (Eisenstein, 2015) : given text associated with a variety of regions, the task is to identify terms that are distinctive of particular regions." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-11", "text": "The complexity of the task is two-fold: (1) localised named entities (e.g. sporting team names) are not of interest; and (2) without semantic knowledge it is difficult to detect terms that are in general use but have a special meaning in a region." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-12", "text": "In this paper we propose a text-based geolocation method based on neural networks." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-13", "text": "Our contributions are as follows: (1) we achieve state-of-the-art results on benchmark Twitter geolocation datasets; (2) we show that the model is less sensitive to the specific location discretisation method; (3) we release the first broad-coverage dataset for evaluation of lexical dialectology models; (4) we incorporate our text-based model into a network-based model (Rahimi et al., 2015a) and improve the performance utilising both network and text; and (5) we use the model's embeddings for extraction of local terms and show that it outperforms two baselines." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-14", "text": "----------------------------------" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-15", "text": "**RELATED WORK**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-16", "text": "Related work on Twitter user geolocation falls into two categories: text-based and network-based methods." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-17", "text": "Text-based methods make use of the geographical biases of language use, and networkbased methods rely on the geospatial homophily of user-user interactions." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-18", "text": "In both cases, the assumption is that users who live in the same geographic area share similar features (linguistic or interactional)." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-19", "text": "Three main text-based approaches are: (1) the use of gazetteers Quercini et al., 2010) ; (2) unsupervised text clustering based on topic models or similar (Eisenstein et al., 2010; Hong et al., 2012; Ahmed et al., 2013) ; and (3) supervised classification (Ding et al., 2000; Backstrom et al., 2008; Cheng et al., 2010; Hecht et al., 2011; Kinsella et al., 2011; Wing and Baldridge, 2011; Han et al., 2012; Rout et al., 2013) , which unlike gazetteers can be applied to informal text and compared to topic models, scales better." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-20", "text": "The classification models often rely on less than 1% of geotagged tweets for supervision and discretise real-valued coordinates into equalsized grids (Serdyukov et al., 2009 ), administrative regions (Cheng et al., 2010; Hecht et al., 2011; Kinsella et al., 2011; Han et al., 2012 , or flat (Wing and Baldridge, 2011) or hierarchical k-d tree clusters (Wing and Baldridge, 2014) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-21", "text": "Network-based methods also use either real-valued coordinates (Jurgens et al., 2015) or discretised regions (Rahimi et al., 2015a) as labels, and use label propagation over the interaction graph (e.g. @-mentions)." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-22", "text": "More recent methods have focused on representation learning by using sparse coding (Cha et al., 2015) or neural networks (Liu and Inkpen, 2015) , utilising both text and network information (Rahimi et al., 2015a) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-23", "text": "Dialect is a variety of language shared by a group of speakers (Wolfram and Schilling, 2015) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-24", "text": "Our focus here is on geographical dialects which are spoken (and written in social media) by people from particular areas." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-25", "text": "The traditional approach to dialectology is to find the geographical distribution of known lexical alternatives (e.g. you, yall and yinz: (Labov et al., 2005; Nerbonne et al., 2008; Gon\u00e7alves and S\u00e1nchez, 2014; Doyle, 2014; Huang et al., 2015; Nguyen and Eisenstein, 2016) ), the shortcoming of which is that the alternative lexical variables must be known beforehand." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-26", "text": "There have also been attempts to automatically identify such words from geotagged documents (Eisenstein et al., 2010; Ahmed et al., 2013; Eisenstein, 2015) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-27", "text": "The main idea is to find lexical variables that are disproportionately distributed in different locations either via model-based or statistical methods (Monroe et al., 2008) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-28", "text": "There is a research gap in evaluating the geolocation models in terms of their usability in retrieving dialect terms given a geographic region." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-29", "text": "We use a text-based neural approach trained on geotagged Twitter messages that: (a) given a geographical region, identifies the associated lexical terms; and (b) given a text, predicts its location." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-30", "text": "----------------------------------" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-31", "text": "**DATA**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-32", "text": "We use three existing Twitter user geolocation datasets: (1) GEOTEXT (Eisenstein et al., 2010) , (2) TWITTER-US (Roller et al., 2012) , and (3) TWITTER-WORLD (Han et al., 2012) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-33", "text": "These datasets have been used widely for training and evaluation of geolocation models." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-34", "text": "They are all prepartitioned into training, development and test sets." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-35", "text": "Each user is represented by the concatenation of their tweets, and labeled with the latitude/longitude of the first collected geotagged tweet in the case of GEOTEXT and TWITTER-US, and the centre of the closest city in the case of TWITTER-WORLD." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-36", "text": "1 GEOTEXT and TWITTER-US cover the continental US, and TWITTER-WORLD covers the whole world, with 9k, 449k and 1.3m users, respectively as shown in Figure 1 ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-37", "text": "2 DAREDS is a dialect-term dataset novel to this research, created from the Dictionary of American Regional English (DARE) (Cassidy et al., 1985) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-38", "text": "DARE consists of dialect regions, their terms and meaning." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-39", "text": "3 It is based on dialectal surveys from different regions of the U.S., which are then postprocessed to identify dialect regions and terms." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-40", "text": "In order to construct a dataset based on DARE, we downloaded the web version of DARE, cleaned it, and removed multiword expressions and highly-frequent words (any word which occurred in the top 50k most frequent words, based on a word frequency list (Norvig, 2009) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-41", "text": "For dialect regions that don't correspond to a single state or set of cities (e.g. South), we mapped it to the most populous cities within each region." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-42", "text": "For example, within the Pacific Northwest dialect region, we manually extracted the most populous cities (Seattle, Tacoma, Portland, Salem, Eugene) and added those cities to DAREDS as subregions." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-43", "text": "The resulting dataset (DAREDS) consists of around 4.3k dialect terms from 99 U.S. dialect regions." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-44", "text": "DAREDS is the largest standardised dialectology dataset." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-45", "text": "----------------------------------" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-46", "text": "**METHODS**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-47", "text": "We use a multilayer perceptron (MLP) with one hidden layer as our location classifier, where the input is l 2 normalised bag-of-words features for a given user." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-48", "text": "We exclude @-mentions, words with document frequency less than 10, and stop words." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-72", "text": "----------------------------------" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-49", "text": "The output is either a k-d tree leaf node or k-means discretisation of real-valued coordinates of training locations, the output of which is visualised for TWITTER-US in Figure 2 ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-50", "text": "The hidden layer output provides word (and phrase, as bags of words) embeddings for dialectal analysis." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-51", "text": "The number of regions, regularisation strength, hidden layer and mini-batch size are tuned over development data and set to (32, 10 \u22125 , 896, 100), (256, 10 \u22126 , 2048, 10000) and (930, 10 \u22126 , 3720, 10000) for GEOTEXT, TWITTER-US and TWITTER-WORLD, respectively." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-52", "text": "The parameters are optimised using Adamx (Kingma and Ba, 2014) using Lasagne/Theano (Theano Development Team, 2016) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-53", "text": "Following Cheng (2010) and Eisenstein (2010) , we evaluated the geolocation model using mean and median error in km (\"Mean\" and \"Median\" resp.) and accuracy within 161km of the actual location (\"Acc@161\")." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-54", "text": "Note that lower numbers are better for Mean and Median, and higher numbers better for Acc@161." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-55", "text": "4 The results reported in Rahimi et al. (2015b; 2015a) for TWITTER-WORLD were over a superset of the dataset; the results reported here are based on the actual dataset." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-56", "text": "While the focus of this paper is text-based user geolocation, state-of-the-art results for the three datasets have been achieved with hybrid text+network-based models, where the predictions of the text-based model are fed into a mention network as \"dongle\" nodes to each user node, providing a personalised geolocation prior for each user (Rahimi et al., 2015a) ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-57", "text": "Note that it would, of course, be possible to combine text and network information in a joint deep learning model (Yang et al., 2016; Kipf and Welling, 2016) , which we leave to future work (noting that scalability will potentially be a major issue for the larger datasets)." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-58", "text": "To test the applicability of the model's embeddings in dialectology, we created DAREDS." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-59", "text": "The output of the hidden layer of the model is used as embeddings for both location names and dialect terms." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-60", "text": "Given a dialect region name, we retrieve its nearest neighbours in the embedding space, and compare them to dialect terms associated with that location." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-61", "text": "We also compare the quality of the embeddings with pre-trained word2vec embeddings and the embeddings from the output layer of LR (logistic regression) (Rahimi et al., 2015b) as baselines." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-62", "text": "Regions in DAREDS can be very broad (e.g. SouthWest), meaning that words associated with those locations will be used across a large number Table 1 : Geolocation results over the three Twitter datasets, based on the text-based MLP with k-d tree or k-means discretisation and text+network model MADCEL-W-MLP using MLP with k-d tree for text-based predictions." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-63", "text": "We compare with state-of-the-art results for each dataset." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-64", "text": "4 \"-\" signifies that no results were reported for the given metric or dataset." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-65", "text": "of cities contained within that region." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-66", "text": "We generate a region-level embedding by simply taking the city names associated with the region, and feeding them as BoW input for LR and MLP and averaging their embeddings for word2vec." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-67", "text": "We evaluate the retrieved terms by computing recall of DAREDS terms existing in TWITTER-US (1071 terms) at k \u2208 {0.05%, 0.1%, 0.2%, 0.5%, 1%, 2%, 5%} of vocabulary size." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-68", "text": "The code and the DAREDS dataset are available at https://github." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-69", "text": "com/afshinrahimi/acl2017." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-74", "text": "The performance of the text-based MLP model with k-d tree and k-means discretisation over the three datasets is shown in Table 1 ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-75", "text": "The results are also compared with state-of-the-art text-based methods based on a flat (Rahimi et al., 2015b; Cha et al., 2015) or hierarchical (Wing and Baldridge, 2014; Melo and Martins, 2015; Liu and Inkpen, 2015) geospatial representation." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-76", "text": "Our method outperforms both the flat and hierarchical text-based models by a large margin." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-77", "text": "Comparing the two discretisation strategies, k-means outperforms k-d tree by a reasonable margin." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-78", "text": "We also incorporated the MLP predictions into a network-based model based on the method of Rahimi et al. (2015a) , and improved upon their work." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-79", "text": "We analysed the Table 2 : Nearest neighbours of place names." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-80", "text": "in Figure 3 ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-81", "text": "The error is highest in states with lower training coverage (e.g. Maine, Montana, Wisconsin, Iowa and Kansas)." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-82", "text": "We also randomly sampled 50 development samples from the 1000 samples with highest prediction errors to check the biases of the model." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-83", "text": "Most of the errors are the result of geolocating users from Eastern U.S. in Western U.S. particularly in Los Angeles and San Francisco." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-84", "text": "----------------------------------" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-85", "text": "**DIALECTOLOGY**" }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-86", "text": "We quantitatively tested the quality of the geographical embeddings by calculating the micro-average recall of the k-nearest dialect terms (in terms of the proportion of retrieved dialect terms) given a dialect region, as shown in Figure 4 ." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-87", "text": "Recall at 0.5% is about 3.6%, meaning that we were able to retrieve 3.6% of the dialect terms given the dialect region name in the geographical embedding space." }, { "sent_id": "aeb6a815732b36d7602a9c43c47cfa-C001-88", "text": "The embeddings slightly outperform the output layer of logistic regression (LR) (Rahimi et al., 2015b)" } ], "y": { "@BACK@": { "gold_contexts": [ [ "aeb6a815732b36d7602a9c43c47cfa-C001-7" ] ], "cite_sentences": [ "aeb6a815732b36d7602a9c43c47cfa-C001-7" ] }, "@DIF@": { "gold_contexts": [ [ "aeb6a815732b36d7602a9c43c47cfa-C001-55" ], [ "aeb6a815732b36d7602a9c43c47cfa-C001-75", "aeb6a815732b36d7602a9c43c47cfa-C001-76" ], [ "aeb6a815732b36d7602a9c43c47cfa-C001-88" ] ], "cite_sentences": [ "aeb6a815732b36d7602a9c43c47cfa-C001-55", "aeb6a815732b36d7602a9c43c47cfa-C001-75", "aeb6a815732b36d7602a9c43c47cfa-C001-88" ] }, "@UNSURE@": { "gold_contexts": [ [ "aeb6a815732b36d7602a9c43c47cfa-C001-61" ] ], "cite_sentences": [ "aeb6a815732b36d7602a9c43c47cfa-C001-61" ] } } }, "ABC_71bcd87091c4f5e2b7bc564b7bd9dd_41": { "x": [ { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-2", "text": "Machine transliteration is the process of automatically transforming the script of a word from a source language to a target language, while preserving pronunciation." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-3", "text": "Sequence to sequence learning has recently emerged as a new paradigm in supervised learning." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-4", "text": "In this paper a character-based encoder-decoder model has been proposed that consists of two Recurrent Neural Networks." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-5", "text": "The encoder is a Bidirectional recurrent neural network that encodes a sequence of symbols into a fixed-length vector representation, and the decoder generates the target sequence using an attention-based recurrent neural network." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-6", "text": "The encoder, the decoder and the attention mechanism are jointly trained to maximize the conditional probability of a target sequence given a source sequence." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-7", "text": "Our experiments on different datasets show that the proposed encoderdecoder model is able to achieve significantly higher transliteration quality over traditional statistical models." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-8", "text": "----------------------------------" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-10", "text": "Machine Transliteration is defined as phonetic transformation of names across languages Karimi et al., 2011) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-11", "text": "Transliteration of named entities is the essential part of many multilingual applications, such as machine translation (Koehn, 2010) and cross-language information retrieval (Jadidinejad and Mahmoudi, 2010) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-12", "text": "Recent studies pay a great attention to the task of Neural Machine Translation (Cho et al., 2014a; Sutskever et al., 2014) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-13", "text": "In neural machine translation, a single neural network is responsible for reading a source sentence and generates its translation." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-14", "text": "From a probabilistic perspective, translation is equivalent to finding a target sentence y that maximizes the conditional probability of y given a source sentence x, i.e., arg max y p(y | x)." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-15", "text": "The whole neural network is jointly trained to maximize the conditional probability of a correct translation given a source sentence, using the bilingual corpus." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-16", "text": "Transforming a name from spelling to phonetic and then use the constructed phonetic to generate the spelling on the target language is a very complex task (Oh et al., 2006; Finch et al., 2015) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-17", "text": "Based on successful studies on Neural Machine Translation (Cho et al., 2014a; Sutskever et al., 2014; Hirschberg and Manning, 2015) , in this paper, we proposed a character-based encoderdecoder model which learn to transliterate endto-end." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-18", "text": "In the opposite side of classical models which contains different components, the proposed model is trained end-to-end, so it able to apply to any language pairs without tuning for a spacific one." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-19", "text": "----------------------------------" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-20", "text": "**PROPOSED MODEL**" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-21", "text": "Here, we describe briefly the underlying framework, called RNN Encoder-Decoder, proposed by (Cho et al., 2014b) and (Sutskever et al., 2014) upon which we build a machine transliteration model that learns to transliterate end-to-end." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-22", "text": "The enoder is a character-based recurrent neural network that learns a highly nonlinear mapping from a spelling to the phonetic of the input sequence." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-23", "text": "This network reads the source name x = (x 1 , . . . , x T ) and encodes it into a sequence of hidden states h = (h 1 , \u00b7 \u00b7 \u00b7 , h T ):" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-24", "text": "Each hidden state h i is a bidirectional recurrent representation with forward and backward sequence information around the ith character." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-25", "text": "The representation of a forward sequence and a backward sequence of the input character sequence is estimated and concatenated to form a context set C = {h 1 , h 2 , ..., h T } (Dong et al., 2015; Chung et al., 2016) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-26", "text": "Then, the decoder, another recurrent neural network, computes the conditional distribution over all possible transliteration based on this context set and generates the corresponding transliteration y = (y 1 , \u00b7 \u00b7 \u00b7 , y T ) based on the encoded sequence of hidden states h. The whole model is jointly trained to maximize the conditional log-probability of the correct transliteration given a source sequence with respect to the parameters \u03b8 of the model:" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-27", "text": "where (x n , y n ) is the n-th training pair of character sequences, and T n is the length of the n-th target sequence (y n )." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-28", "text": "For each conditional term in Equation 2, the decoder updates its hidden state by:" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-29", "text": "where c t is a context vector computed by a soft attention mechanism:" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-30", "text": "The soft attention mechanism f a weights each vector in the context set C according to its relevance given what has been transliterated." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-31", "text": "Finally, the hidden state h t , together with the previous target symbol y t \u22121 and the context vector c t , is fed into a feedforward neural network to result in the conditional distribution described in Equation 2." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-32", "text": "The whole model, consisting of the encoder, decoder and soft attention mechanism, is trained end-to-end to minimize the negative loglikelihood using stochastic gradient descent." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-33", "text": "----------------------------------" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-34", "text": "**EXPERIMENTS**" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-35", "text": "We conducted a set of experiments to show the effectiveness of RNN Encoder-Decoder model (Cho et al., 2014b; Sutskever et al., 2014) in the task of machine transliteration using standard benchmark datasets provided by NEWS 2015-16 shared task ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-36", "text": "Table 1 shows different datasets in our experiments." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-37", "text": "Each dataset covers different levels of difficulty and training set size." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-38", "text": "The proposed model has been applied on ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-39", "text": "each dataset without tuning the algorithm for each specific language pairs." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-40", "text": "Also, we don't apply any preprocessing on the source or target language in order to evaluate the effectiveness of the proposed model in a fair situation." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-41", "text": "'TaskID' is a unique identifier in the following experiments." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-42", "text": "We leveraged a character-based encoderdecoder model (Bojanowski et al., 2015; Chung et al., 2016) with soft attention mechanism (Cho et al., 2014b) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-43", "text": "In this model, input sequences in both source and target languages have been represented as characters." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-44", "text": "Using characters instead of words leads to longer sequences, so Gated Recurrent Units (Cho et al., 2014a) have been used for the encoder network to model long term dependencies." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-45", "text": "The encoder has 128 hidden units for each direction (forward and backward), and the decoder has 128 hidden units with soft attention mechanism (Cho et al., 2014b) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-46", "text": "We train the model using stochastic gradient descent with Adam (Kingma and Ba, 2014)." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-47", "text": "Each update is computed using a minibatch of 128 sequence pairs." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-48", "text": "The norm of the gradient is clipped with a threshold 1 (Pascanu et al., 2013) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-49", "text": "Also, beamsearch has been used to approximately find the most likely transliteration given a source sequence (Koehn, 2010) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-50", "text": "Table 2 shows the effectiveness of the proposed model on different datasets using standard measures ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-51", "text": "The proposed neural machine transliteration model has been compared to the baseline method provided by NEWS 2016 organizers ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-52", "text": "Baseline results are based on a machine translation implementation at the character level using MOSES (Koehn et al., 2007) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-53", "text": "Experimental results shows that the proposed model is significantly better than the robust baseline using different metrics." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-54", "text": "Figure 1 shows the learning curve of the pro- Table 2 : The effectiveness of neural machine transliteration is compared with the robust baseline (Koehn et al., 2007) provided by NEWS 2016 shared task on transliteration of named entities." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-55", "text": "posed model on different datasets." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-56", "text": "It is clear that in most datasets, the trained model is capable of robust transliteration after a few number of iterations." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-57", "text": "As shown in Table 1 , each dataset has different number of training set and also different number of characters in the source and target language." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-58", "text": "For example, when transliterating from English to Chinese (TaskID='En-Ch') and English to Hebrew, the target names contains 548 and 37 different tokens respectively." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-59", "text": "Since we leverage a same model for different datasets without tuning the model for each dataset, differences in the learning curves are expectable." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-60", "text": "For some datasets (such as 'En-Ch'), it takes more time to fit the model to the training data while for some others (such as 'En-He'), the model fit to the training data after a few iterations." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-61", "text": "----------------------------------" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-62", "text": "**CONCLUSION**" }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-63", "text": "In this paper we proposed Neural Machine Transliteration based on successful studies in sequence to sequence learning (Sutskever et al., 2014) and Neural Machine Translation (Ling et al., 2015; Costa-Juss\u00e0 and Fonollosa, 2016; Bahdanau et al., 2015; Cho et al., 2014a) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-64", "text": "Neural Machine Transliteration typically consists of two components, the first of which encodes a source name sequence x and the second decodes to a target name sequence y. Different parts of the proposed model jointly trained using stochastic gradient descent to minimize the log-likelihood." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-65", "text": "Experiments on different datasets using benchmark measures revealed that the proposed model is able to achieve significantly higher transliteration quality over traditional statistical models (Koehn, 2010) ." }, { "sent_id": "71bcd87091c4f5e2b7bc564b7bd9dd-C001-66", "text": "In this paper we did not concentrate on improving the model for achieving state-of-the-art results, so applying hyperparameter optimization (Bergstra and Bengio, 2012) , multi-task sequence to sequence learning (Luong et al., 2015) and multiway transliteration (Firat et al., 2016; Dong et al., 2015) are quite promising for future works." } ], "y": { "@EXT@": { "gold_contexts": [ [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-21" ] ], "cite_sentences": [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-21" ] }, "@BACK@": { "gold_contexts": [ [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-21" ], [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-45" ] ], "cite_sentences": [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-21", "71bcd87091c4f5e2b7bc564b7bd9dd-C001-45" ] }, "@USE@": { "gold_contexts": [ [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-35" ], [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-42" ] ], "cite_sentences": [ "71bcd87091c4f5e2b7bc564b7bd9dd-C001-35", "71bcd87091c4f5e2b7bc564b7bd9dd-C001-42" ] } } }, "ABC_a6296d02f21ca5887c7686a2cbe56c_41": { "x": [ { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-2", "text": "Confusion network decoding has been the most successful approach in combining outputs from multiple machine translation (MT) systems in the recent DARPA GALE and NIST Open MT evaluations." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-3", "text": "Due to the varying word order between outputs from different MT systems, the hypothesis alignment presents the biggest challenge in confusion network decoding." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-4", "text": "This paper describes an incremental alignment method to build confusion networks based on the translation edit rate (TER) algorithm." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-5", "text": "This new algorithm yields significant BLEU score improvements over other recent alignment methods on the GALE test sets and was used in BBN's submission to the WMT08 shared translation task." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-6", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-8", "text": "Confusion network decoding has been applied in combining outputs from multiple machine translation systems." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-9", "text": "The earliest approach in (Bangalore et al., 2001 ) used edit distance based multiple string alignment (MSA) (Durbin et al., 1988) to build the confusion networks." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-10", "text": "The recent approaches used pair-wise alignment algorithms based on symmetric alignments from a HMM alignment model (Matusov et al., 2006) or edit distance alignments allowing shifts (Rosti et al., 2007) ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-11", "text": "The alignment method described in this paper extends the latter by incrementally aligning the hypotheses as in MSA but also allowing shifts as in the TER alignment." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-12", "text": "The confusion networks are built around a \"skeleton\" hypothesis." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-13", "text": "The skeleton hypothesis defines the word order of the decoding output." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-14", "text": "Usually, the 1-best hypotheses from each system are considered as possible skeletons." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-15", "text": "Using the pair-wise hypothesis alignment, the confusion networks are built in two steps." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-16", "text": "First, all hypotheses are aligned against the skeleton independently." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-17", "text": "Second, the confusion networks are created from the union of these alignments." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-18", "text": "The incremental hypothesis alignment algorithm combines these two steps." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-19", "text": "All words from the previously aligned hypotheses are available, even if not present in the skeleton hypothesis, when aligning the following hypotheses." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-20", "text": "As in (Rosti et al., 2007) , confusion networks built around all skeletons are joined into a lattice which is expanded and rescored with language models." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-21", "text": "System weights and language model weights are tuned to optimize the quality of the decoding output on a development set." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-22", "text": "This paper is organized as follows." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-23", "text": "The incremental TER alignment algorithm is described in Section 2." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-24", "text": "Experimental evaluation comparing the incremental and pair-wise alignment methods are presented in Section 3 along with results on the WMT08 Europarl test sets." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-25", "text": "Conclusions and future work are presented in Section 4." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-26", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-27", "text": "**INCREMENTAL TER ALIGNMENT**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-28", "text": "The incremental hypothesis alignment is based on an extension of the TER algorithm (Snover et al., 2006) ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-29", "text": "The extension allows using a confusion network as the reference." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-30", "text": "First, the algorithm finds the minimum edit distance between the hypothesis and the reference network by considering all word arcs between two consecutive nodes in the reference network as possible matches for a hypothesis word at NULL (2) like (3) NULL (2) big blue (1) balloons (2) blue (1) kites (1) Figure 1: Network after pair-wise TER alignment." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-31", "text": "that position." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-32", "text": "Second, shifts of blocks of words that have an exact match somewhere else in the network are tried in order to find a new hypothesis word order with a lower TER." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-33", "text": "Each shifted block is considered a single edit." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-34", "text": "These two steps are executed iteratively as a greedy search." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-35", "text": "The final alignment between the re-ordered hypothesis and the reference network may include matches, substitutions, deletions, and insertions." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-36", "text": "The confusion networks are built by creating a simple confusion network from the skeleton hypothesis." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-37", "text": "If the skeleton hypothesis has words, the initial network has arcs and \u00a2 \u00a1 \u00a4 \u00a3 nodes." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-38", "text": "Each arc has a set of system specific confidence scores." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-39", "text": "The score for the skeleton system is set to \u00a3 \u00a6 \u00a5 \u00a7 and the confidences for other systems are set to zeros." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-40", "text": "For each non-skeleton hypothesis, a TER alignment against the current network is executed as described above." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-41", "text": "Each match found will increase the system specific word arc confidence by" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-42", "text": "is the rank of the hypothesis in that system's -best list." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-43", "text": "Each substitution will generate a new word arc at the corresponding position in the network." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-44", "text": "The word arc confidence for the system is set to" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-45", "text": "and the confidences for other systems are set to zeros." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-46", "text": "Each deletion will generate a new NULL word arc unless one exists at the corresponding position in the network." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-47", "text": "The NULL word arc confidences are adjusted as in the case of a match or a substitution depending on whether the NULL word arc exists or not." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-48", "text": "Finally, each insertion will generate a new node and two word arcs at the corresponding position in the network." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-49", "text": "The first word arc will have the inserted word with the confidence set as in the case of a substitution and the second word arc will have a NULL word with confidences set by assuming all previously aligned hypotheses and the skeleton generated the NULL word arc." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-50", "text": "After all hypotheses have been added into the confusion network, the system specific word arc confidences are scaled to sum to one over all arcs between 1 2 3 4 5 6 I (3) like (3) kites (1) NULL (2) NULL (1) big (1) blue (2) balloons (2) Figure 2: Network after incremental TER alignment." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-51", "text": "each set of two consecutive nodes." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-52", "text": "Other scores for the word arc are set as in (Rosti et al., 2007) ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-53", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-54", "text": "**BENEFITS OVER PAIR-WISE TER ALIGNMENT**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-55", "text": "The incremental hypothesis alignment guarantees that insertions between a hypothesis and the current confusion network are always considered when aligning the following hypotheses." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-56", "text": "This is not the case in any pair-wise hypothesis alignment algorithm." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-57", "text": "During the pair-wise hypothesis alignment, an identical word in two hypotheses may be aligned as an insertion or a substitution in a different position with respect to the skeleton." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-58", "text": "This will result in undesirable repetition and lower confidence for that word in the final confusion network." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-59", "text": "Also, multiple insertions are not handled implicitly." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-60", "text": "For example, three hypotheses \"I like balloons\", \"I like big blue balloons\", and \"I like blue kites\" might be aligned by the pair-wise alignment, assuming the first as the skeleton, as follows:" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-61", "text": "I like NULL balloons NULL I like big blue balloons NULL I like NULL balloons NULL I like NULL blue kites which results in the confusion network shown in Figure 1 ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-62", "text": "The number of hypotheses proposing each word is shown in parentheses." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-63", "text": "The alignment between the skeleton and the second hypothesis has two consecutive insertions \"big blue\" which are not available for matching when the third hypothesis is aligned against the skeleton." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-64", "text": "Therefore, the word \"blue\" appears twice in the confusion network." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-65", "text": "If many hypotheses have multiple insertions at the same location with respect to the skeleton, they have to be treated as phrases or a secondary alignment process has to be applied." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-66", "text": "Assuming the same hypotheses as above, the incremental hypothesis alignment may yield the following alignment: Figure 2 ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-67", "text": "In this case the word \"blue\" is available for matching when the third hypothesis is aligned." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-68", "text": "It should be noted that the final confusion network depends on the order in which the hypotheses are added." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-69", "text": "The experiments so far have indicated that different alignment order does not have a significant influence on the final combination results as measured by the automatic evaluation metrics." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-70", "text": "Usually, aligning the system outputs in the decreasing order of their TER scores on the development set yields the best scores." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-71", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-72", "text": "**CONFUSION NETWORK ORACLE**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-73", "text": "The extended TER algorithm can also be used to estimate an oracle TER in a confusion network by aligning the reference translations against the confusion network." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-74", "text": "The oracle hypotheses can be extracted by finding a path with the maximum number of matches." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-75", "text": "These hypotheses give a lower bound on the TER score for the hypotheses which can be generated from the confusion networks." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-76", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-77", "text": "**EXPERIMENTAL EVALUATION**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-78", "text": "The quality of the final combination output depends on many factors." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-79", "text": "Combining very similar outputs does not yield as good gains as combining outputs from diverse systems." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-80", "text": "It is also important that the development set used to tune the combination weights is as similar to the evaluation set as possible." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-81", "text": "This development set should be different from the one used to tune the individual systems to avoid bias toward any system that may be over-tuned." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-82", "text": "Due to the tight schedule for the WMT08, there was no time to experiment with many configurations." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-83", "text": "As more extensive experiments have been conducted in the context of the DARPA GALE program, results on the Arabic GALE Phase 2 evaluation setup are first presented." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-84", "text": "The translation quality is measured by three MT evaluation metrics: TER (Snover et al., 2006) , BLEU (Papineni et al., 2002) , and METEOR (Lavie and Agarwal, 2007) ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-85", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-86", "text": "**RESULTS ON ARABIC GALE OUTPUTS**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-87", "text": "For the Arabic GALE Phase 2 evaluation, nine systems were combined." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-88", "text": "Five systems were phrasebased, two hierarchical, one syntax-based, and one rule-based." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-89", "text": "All statistical systems were trained on common parallel data, tuned on a common genre specific development set, and a common English tokenization was used." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-90", "text": "The English bi-gram and 5-gram language models used in the system combination were trained on about 7 billion words of English text." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-91", "text": "Three iterations of bi-gram decoding weight tuning were performed followed by one iteration of 5-gram re-scoring weight tuning." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-92", "text": "All weights were tuned to minimize the sum of TER and 1-BLEU." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-93", "text": "The final 1-best outputs were true-cased and detokenized before scoring." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-94", "text": "The results on the newswire system combination development set and the GALE Phase 2 evaluation set are shown in Tables 1 and 2 ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-95", "text": "The first two rows show the worst and best scores from the individual systems." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-96", "text": "The scores may be from different systems as the best performing system in terms of TER was not necessarily the best performing system in terms of the other metrics." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-97", "text": "The following three rows show the scores of three combination outputs where the only difference was the hypothesis alignment method." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-98", "text": "The first, syscomb pw, corresponds Table 3 : NIST BLEU scores on the German-English (deen) and French-English (fr-en) Europarl test2008 set." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-99", "text": "to the pair-wise TER alignment described in (Rosti et al., 2007) ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-100", "text": "The second, syscomb giza, corresponds to the pair-wise symmetric HMM alignments from GIZA++ described in (Matusov et al., 2006) ." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-101", "text": "The third, syscomb inc, corresponds to the incremental TER alignment presented in this paper." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-102", "text": "Finally, oracle corresponds to an estimate of the lower bound on the translation quality obtained by extracting the TER oracle output from the confusion networks generated by the incremental TER alignment." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-103", "text": "It is unlikely that there exists a set of weights that would yield the oracle output after decoding, though." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-104", "text": "The incremental TER alignment yields significant improvements over all individual systems and the combination outputs using the pairwise alignment methods." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-105", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-106", "text": "**RESULTS ON WMT08 EUROPARL OUTPUTS**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-107", "text": "On the WMT08 shared translation task, translations for two language pairs and two tasks were provided for the system combination experiments." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-108", "text": "Twelve systems participated in the German-English and fourteen in the French-English translation tasks." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-109", "text": "The translations of the Europarl test (test2008) were provided as the development set outputs and the translations of the News test (newstest2008) were provided as the evaluation set outputs." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-110", "text": "An English bi-gram, 4-gram, and true-caser language models were trained by using all English text available for the WMT08 shared task, including Europarl monolingual and news commentary parallel training sets." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-111", "text": "The outputs were tokenized and lower-cased before combination, and the final combination output was true-cased and detokenized." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-112", "text": "The results on the Europarl test set for both language pairs are shown in table 3." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-113", "text": "The first two rows have the NIST BLEU scores of the worst and the best individual systems." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-114", "text": "The last row, syscomb, corresponds to the system combination using the incremental TER alignment." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-115", "text": "The improvements in the NIST BLEU scores are fairly modest which is probably due to low diversity of the system outputs." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-116", "text": "It is also unlikely that these weights are optimal for the out-of-domain News test set outputs." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-117", "text": "----------------------------------" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-118", "text": "**CONCLUSIONS**" }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-119", "text": "This paper describes a novel hypothesis alignment algorithm for building confusion networks from multiple machine translation system outputs." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-120", "text": "The algorithm yields significant improvements on the Arabic GALE evaluation set outputs and was used in BBN's submission to the WMT08 shared translation task." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-121", "text": "The hypothesis alignment may benefit from using stemming and synonymy in matching words." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-122", "text": "Also, special handling of punctuation may improve the alignment further." }, { "sent_id": "a6296d02f21ca5887c7686a2cbe56c-C001-123", "text": "The future work will investigate the influence of better alignment to the final combination outputs." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a6296d02f21ca5887c7686a2cbe56c-C001-10" ] ], "cite_sentences": [ "a6296d02f21ca5887c7686a2cbe56c-C001-10" ] }, "@EXT@": { "gold_contexts": [ [ "a6296d02f21ca5887c7686a2cbe56c-C001-10", "a6296d02f21ca5887c7686a2cbe56c-C001-11" ] ], "cite_sentences": [ "a6296d02f21ca5887c7686a2cbe56c-C001-10" ] }, "@SIM@": { "gold_contexts": [ [ "a6296d02f21ca5887c7686a2cbe56c-C001-20" ], [ "a6296d02f21ca5887c7686a2cbe56c-C001-52" ] ], "cite_sentences": [ "a6296d02f21ca5887c7686a2cbe56c-C001-20", "a6296d02f21ca5887c7686a2cbe56c-C001-52" ] }, "@USE@": { "gold_contexts": [ [ "a6296d02f21ca5887c7686a2cbe56c-C001-20" ], [ "a6296d02f21ca5887c7686a2cbe56c-C001-52" ] ], "cite_sentences": [ "a6296d02f21ca5887c7686a2cbe56c-C001-20", "a6296d02f21ca5887c7686a2cbe56c-C001-52" ] } } }, "ABC_47e109fd12ddbeebba894cead282d2_41": { "x": [ { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-2", "text": "Public debate functions as a forum for both expressing and forming opinions, an important aspect of public life." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-3", "text": "We present results for automatically classifying posts in online debate as to the position, or STANCE that the speaker takes on an issue, such as Pro or Con." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-4", "text": "We show that representing the dialogic structure of the debates in terms of agreement relations between speakers, greatly improves performance for stance classification, over models that operate on post content and parentpost context alone." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-5", "text": "----------------------------------" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-7", "text": "Public debate functions as a forum for both expressing and forming opinions." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-8", "text": "Three factors affect opinion formation, e.g. the perlocutionary uptake of debate arguments (Cialdini, 2000; Petty and Cacioppo, 1988; Petty et al., 1981) ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-9", "text": "First, there is the ARGU-MENT itself, i.e. the propositions discussed along with the logical relations between them." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-10", "text": "Second is the SOURCE of the argument (Chaiken, 1980) , e.g. the speaker's expertise, or agreement relations between speakers." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-11", "text": "The third factor consists of properties of the AUDIENCE such as prior beliefs, social identity, personality, and cognitive style (Davies, 1998) ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-12", "text": "Perlocutionary uptake in debates primarily occurs in the audience, who may be undecided, while debaters typically express a particular position or STANCE on an issue, e.g. Pro or Con, as in the online debate dialogues in Figs. 1, 2, and 3." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-13", "text": "Previous computational work on debate covers three different debate settings: (1) congressional de-" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-14", "text": "----------------------------------" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-15", "text": "**POST**" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-16", "text": "Stance Utterance P1 PRO I feel badly for your ignorance because although there maybe a sliver of doubt that mankind may have evolved from previous animals, there is no doubt that the Earth and the cosmos have gone through evolution and are continuing to do so P2 CON As long as there are people who doubt evolution, both lay and acedamia, then evolution is in doubt. And please don't feel bad for me. I am perfectly secure in my \"ignorance\"." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-17", "text": "P3" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-18", "text": "PRO By that measure, as long as organic chemistry, physics and gravity are in doubt by both lay and acedamia, then organic chemistry, physics and gravity are in doubt." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-19", "text": "Gravity is a theory." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-20", "text": "Why aren't you giving it the same treatment you do to evolution? Or is it because you are ignorant?" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-21", "text": "Angelic Falling anyone?" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-22", "text": "P4 CON I'm obviously ignorant. Look how many times i've been given the title. \"Gravity is a theory." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-23", "text": "Why aren't you giving it the same treatment you do to evolution?\" Because it doesn't carry the same weight. ;P bates (Thomas et al., 2006; Bansal et al., 2008; Yessenalina et al., 2010; Balahur et al., 2009; Burfoot et al., 2011) ; (2) company-internal discussion sites (Murakami and Raymond, 2010; Agrawal et al., 2003) ; and (3) online social and political public forums (Somasundaran and Wiebe, 2009; Somasundaran and Wiebe, 2010; Wang and Ros\u00e9, 2010; Biran and Rambow, 2011) ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-24", "text": "Debates in online public forums (e.g. Fig. 1 ) differ from debates in congress and on company discussion sites in two ways." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-25", "text": "First, the language is different." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-26", "text": "Online debaters are highly involved, often using emotional and colorful language to make their points." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-27", "text": "These debates are also personal, giving a strong sense of the indi-vidual making the argument, and whether s/he favors emotive or factual modes of expression, e.g. Let me answer.... NO! (P2 in Fig. 3 )." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-28", "text": "Other common features are sarcasm, e.g. I'm obviously ignorant." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-29", "text": "Look how many times i've been given the title (P4 in Fig. 1 ), questioning another's evidence or assumptions: Yes there is always room for human error, but is one accident that hasn't happened yet enough cause to get rid of a capital punishment? (P2 in Fig. 3 ), and insults: Or is it because you are ignorant? (P3 in Fig. 1 )." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-30", "text": "These properties may function to engage the audience and persuade them to form a particular opinion, but they make computational analysis of such debates challenging, with the best performance to date averaging 64% over several topics (Somasundaran and Wiebe, 2010 Second, the affordances of different online debate sites provide differential support for dialogic relations between forum participants." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-31", "text": "For example, the research of Somasundaran and Wiebe (2010) , does not explicitly model dialogue or author relations." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-32", "text": "However debates in our corpus vary greatly by topic on two dialogic factors: (1) the percent of posts that are rebuttals to prior posts, and (2) the number of posts per author." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-33", "text": "The first 5 columns of Table 2 shows the variation in these dimensions by topic." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-34", "text": "In this paper we show that information about dialogic relations between authors (SOURCE factors) improves performance for STANCE classification, when compared to models that only have access to properties of the ARGUMENT." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-35", "text": "We model SOURCE relations with a graph, and add this information to classifiers operating on the text of a post." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-36", "text": "Sec. 2 describes the corpus and our approach." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-37", "text": "Our corpus is publicly available, see (Walker et al., 2012) ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-38", "text": "We show in Sec. 3 that modeling source properties improves performance when the debates are highly dialogic." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-39", "text": "We leave a more detailed comparison to previous work to Sec. 3 so that we can contrast previous work with our approach." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-40", "text": "----------------------------------" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-41", "text": "**EXPERIMENTAL METHOD AND APPROACH**" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-42", "text": "Our corpus consists of two-sided debates from Convinceme.net for 14 topics that range from playful debates such as Superman vs. Batman (Fig. 2 to more heated political topics such as the Death Penalty (Fig. 3 ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-43", "text": "In total the corpus consists of 2902 two-sided debates (36,307 posts), totaling 3,080,874 words; the topic labelled debates which we use in our experiments contain 575,818 words." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-44", "text": "On Convinceme, a person starts a debate by posting a topic or a question and providing sides such as for vs. against." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-45", "text": "Debate participants can then post arguments for one side or the other, essentially self- We construct features from the posts, along with a representation of the parent post as context, and use those features in several base classifiers." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-46", "text": "As shown in Table 1 , we distinguish between basic features, such as length of the post and the words and bigrams in the post, and features capturing sentiment and subjectivity, including using the LIWC tool for emotion labelling (Pennebaker et al., 2001 ) and deriving generalized dependency features using LIWC categories, as well as some limited aspects of the argument structure, such as cue words signalling rhetorical relations between posts, POS generalized dependencies, and a representation of the parent post (context)." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-47", "text": "Only rebuttal posts have a parent post, and thus values for the context features." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-48", "text": "----------------------------------" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-49", "text": "**-10**" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-50", "text": "Figure 4: Sample maxcut to ConvinceMe siding." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-51", "text": "Symbols (circle, cross, square, triangles) indicate authors and fill colors (white,black) indicate true side." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-52", "text": "Rebuttal links are marked by black edges, same-author links by red; weights are 50 and -10, respectively." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-53", "text": "Edges in the maxcut are highlighted in yellow, and the nodes in each cut set are bounded by the green dotted line." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-54", "text": "We then construct a graph (V,E) representing the dialogue structure, using the rebuttal links and author identifiers from the forums site." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-55", "text": "Each node V of the graph is a post, and edges E indicate dialogic relations of agreement and disagreement between posts." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-56", "text": "We assume only that authors always agree with themselves, and that rebuttal links indicate disagreement." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-57", "text": "Agreement links based on the inference that if A, B disagree with C they agree with each other were not added to the graph." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-58", "text": "Maxcut attempts to partition a graph into two sides." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-59", "text": "Fig. 4 illustrates a sample result of applying MaxCut." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-60", "text": "Edges connecting the partitions are said to be cut, while those within partitions are not." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-61", "text": "The goal is to maximize the sum of cut edge weights." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-62", "text": "By making edge weights high we reward the algorithm for cutting the edge, by making edge weights negative we penalize the algorithm for cutting the edge." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-63", "text": "Rebuttal links were assigned a weight +100/(number of rebuttals)." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-64", "text": "Same author links were assigned a weight -60/(number of posts by author)." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-65", "text": "If author A rebutted author B at some point, then a weight of 50 was assigned to all edges connecting posts by author A and posts by author B. If author B rebutted author A as well, that 50 was increased to 100." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-66", "text": "We applied the MaxCut partitioning algorithm to this graph, and then we orient each of the components automatically using a traditional supervised classifier." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-67", "text": "We consider each component separately where components are defined using the original (pre-MaxCut) graph." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-68", "text": "For each pair of partition side p \u2208 {P 0 , P 1 } and classifier label l \u2208 {L 0 , L 1 }, we compute a score S p,l by summing the margins of all nodes assigned to that partition and label." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-69", "text": "We then compute and compare the score differences for each partition." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-70", "text": ", then nodes in partition P 0 should be assigned label L 0 and nodes in P 1 should be assigned label L 1 ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-71", "text": "Likewise, if D P 0 > D P 1 , then nodes in partition P 0 should be assigned label L 1 and nodes in P 1 should be assigned label L 0 ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-72", "text": "If D P 0 = D P 1 , then we orient the component with a coin flip." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-73", "text": "Table 2 summarizes our results for the base classifier (JRIP) compared to using MaxCut over the social network defined by author and rebuttal links." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-74", "text": "We report results for experiments using all the fea- (Somasundaran and Wiebe, 2010) present an unsupervised approach using ICA to stance classification, showing that identifying argumentation structure improves performance, with a best performance averaging 64% accuracy over all topics, but as high as 70% for some topics." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-75", "text": "Other research classifies the speaker's side in a corpus of congressional floor debates (Thomas et al., 2006; Bansal et al., 2008; Balahur et al., 2009; Burfoot et al., 2011) ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-76", "text": "Thomas et al (2006) achieved accuracies of 71.3% by using speaker agreement information in the graph-based MinCut/Maxflow algorithm, as compared to accuracies around 70% via an an SVM classifier operating on content alone." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-77", "text": "The best performance to date on this corpus achieves accuracies around 82% for different graph-based approaches as compared to 76% accuracy for content only classification (Burfoot et al., 2011) ." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-78", "text": "Other work applies MaxCut to the reply structure of company discussion forums, showing that rules for identifying agreement (Murakami and Raymond, 2010) , defined on the textual content of the post yield performance improvements over using reply structures alone (Malouf and Mullen, 2008; Agrawal et al., 2003) Our results are not strictly comparable since we use a different corpus with different properties, but to our knowledge this is the first application of MaxCut to stance classification that shows large performance improvements from modeling dialogic relations." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-79", "text": "In future work, we plan to explore whether deeper linguistic features can yield large improvements in both the base classifier and in MaxCut results, and to explore better ways of automatically orienting the MaxCut graph to stance side." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-80", "text": "We also hope to develop much better context features and to make even more use of dialogue structure." }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-81", "text": "----------------------------------" }, { "sent_id": "47e109fd12ddbeebba894cead282d2-C001-82", "text": "**RESULTS AND DISCUSSION**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "47e109fd12ddbeebba894cead282d2-C001-23" ], [ "47e109fd12ddbeebba894cead282d2-C001-30" ], [ "47e109fd12ddbeebba894cead282d2-C001-31" ] ], "cite_sentences": [ "47e109fd12ddbeebba894cead282d2-C001-23", "47e109fd12ddbeebba894cead282d2-C001-30", "47e109fd12ddbeebba894cead282d2-C001-31" ] } } }, "ABC_12ab280d48ef6bfae0ff27a400e2ab_41": { "x": [ { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-52", "text": "This evaluation, in turn, has enabled interesting automated teaming approaches to parsing." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-76", "text": "The Treebank annotation paper [3] discusses the new predicate-argument annotation work under Treebank." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-2", "text": "This session focused on experimental or planned approaches to human language technology evaluation and included an overview and five papers: two papers on experimental evaluation approaches [l, 2], and three about the ongoing work in new annotation and evaluation approaches for human language technology [3, 4, 5] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-3", "text": "This was followed by fifteen minutes of general discussion." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-4", "text": "When considering evaluation, it is important to consider the basic issues involved in evaluation:" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-5", "text": "\u2022 . , Within-system progress: This is perhaps the most important role because it supports incremental system development, debugging and even hill climbing and automated learning approaches, if fast evaluation methods are available." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-6", "text": "\u2022 Understanding design trade-offs: It is well-known that there are trade-offs in system design, e.g., between speed and error rate for speech recognition systems; similarly, there may be trade-offs in error recovery and types of feedback in dialogue-based systems." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-7", "text": "Appropriate evaluation methods make it possible to design controlled experiments to investigate these trade-offs." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-8", "text": "Directing research focus: Evaluation (especially when associated with research funding) brings increased attention to the technology being evaluated." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-9", "text": "It also fosters increased infrastructure to support evaluation, and in turn, infrastructure supports evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-10", "text": "1 The success of the ARPA human language technology program can be attributed in part to the judicious use of common evaluation to focus attention on particular research issues, resulting in rapid improvement in the technology, increased sharing of technical information, and broader participation in the research activities." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-11", "text": "Once we decide to evaluate, the first question is what to evaluate? Where do we put probes to inspect the input and output, in order to perform an evaluation?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-12", "text": "This issue is discussed in the Sparck Jones paper [ 1 ] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-13", "text": "In some cases, we can evaluate the language technology in isolation from any front-end or back-end application, as shown in Figure 1 , where probes are inserted on either side of the language interface itself." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-14", "text": "This gives us the kind of evaluation used for word error rate in speech (speech in, transcription out) or for machine translation, as proposed in the Brew/Thompson paper (source text in, target text out) [2] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-15", "text": "This kind of evaluation computes output as a simple function of input to the language system." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-16", "text": "Unfortunately, it is not always possible to measure a meaningful output-for example, researchers have struggled long and hard with measurements for understanding -how can a system demonstrate that it has understood? If we had a general semantic representation, then we could insert a probe on the output side of the semantic component, independent of any specific application." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-17", "text": "The last three papers ([3, 4, 5]) take various approaches to the issue of predicate-argument 1The Penn Treebank parse annotations provide an interesting case where annotation supported evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-18", "text": "By creating a theory-neutral description of a correct parse, the Treebank annotation enabled researchers to take the next step in agreeing to use the parse annotations (bracketings) as a \"gold standard\" against which to compare system-derived bracketings [9] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-19", "text": "This evaluation, in turn, has enabled interesting automated teaming approaches to parsing." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-20", "text": "99" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-21", "text": "----------------------------------" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-22", "text": "****" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-23", "text": "This session focused on experimental or planned approaches to human language technology evaluation and included an overview and five papers: two papers on experimental evaluation approaches [l, 2] , and three about the ongoing work in new annotation and evaluation approaches for human language technology [3, 4, 5] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-24", "text": "This was followed by fifteen minutes of general discussion." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-25", "text": "When considering evaluation, it is important to consider the basic issues involved in evaluation:" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-26", "text": "\u2022 .Why evaluate: what are the goals of evaluation?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-27", "text": "\u2022 What to evaluate: what function(s) of the system should be evaluated, e.g., what input/outputpairs are compared?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-28", "text": "\u2022 How to evaluate: what procedures can be used to evalute specific system functions (or to grade goodness of input/output pairs)?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-29", "text": "\u2022 Where to go from here: what additional evaluations are needed and what can be developed to support future research?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-30", "text": "----------------------------------" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-31", "text": "**WHY EVALUATE?**" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-32", "text": "Evaluation serves a number of purposes:" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-33", "text": "* Cross-system evaluation: This is a mainstay of the periodic ARPA evaluations on competing systems." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-34", "text": "Multiple sites agree to run their respective systems on a single application, so that results across systems are comparable." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-35", "text": "This includes evaluations such as message understanding (MUC) [6] , information retrieval (TREC) [7] , spoken language systems (ATIS) [8] , and automated speech recognition (CSR) [8] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-36", "text": ", Within-system progress: This is perhaps the most important role because it supports incremental system development, debugging and even hill climbing and automated learning approaches, if fast evaluation methods are available." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-37", "text": "\u2022 Understanding design trade-offs: It is well-known that there are trade-offs in system design, e.g., between speed and error rate for speech recognition systems; similarly, there may be trade-offs in error recovery and types of feedback in dialogue-based systems." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-38", "text": "Appropriate evaluation methods make it possible to design controlled experiments to investigate these trade-offs." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-39", "text": "Directing research focus: Evaluation (especially when associated with research funding) brings increased attention to the technology being evaluated." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-40", "text": "It also fosters increased infrastructure to support evaluation, and in turn, infrastructure supports evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-41", "text": "1 The success of the ARPA human language technology program can be attributed in part to the judicious use of common evaluation to focus attention on particular research issues, resulting in rapid improvement in the technology, increased sharing of technical information, and broader participation in the research activities." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-42", "text": "----------------------------------" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-43", "text": "**WHAT TO EVALUATE?**" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-44", "text": "Once we decide to evaluate, the first question is what to evaluate? Where do we put probes to inspect the input and output, in order to perform an evaluation?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-45", "text": "This issue is discussed in the Sparck Jones paper [ 1 ] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-46", "text": "In some cases, we can evaluate the language technology in isolation from any front-end or back-end application, as shown in Figure 1 , where probes are inserted on either side of the language interface itself." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-47", "text": "This gives us the kind of evaluation used for word error rate in speech (speech in, transcription out) or for machine translation, as proposed in the Brew/Thompson paper (source text in, target text out) [2] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-48", "text": "This kind of evaluation computes output as a simple function of input to the language system." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-49", "text": "Unfortunately, it is not always possible to measure a meaningful output-for example, researchers have struggled long and hard with measurements for understanding -how can a system demonstrate that it has understood? If we had a general semantic representation, then we could insert a probe on the output side of the semantic component, independent of any specific application." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-50", "text": "The last three papers ( [3, 4, 5] ) take various approaches to the issue of predicate-argument 1The Penn Treebank parse annotations provide an interesting case where annotation supported evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-51", "text": "By creating a theory-neutral description of a correct parse, the Treebank annotation enabled researchers to take the next step in agreeing to use the parse annotations (bracketings) as a \"gold standard\" against which to compare system-derived bracketings [9] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-53", "text": "Right now, we can only measure understanding by evaluating an interface coupled to an application - Figure 2 shows the application back-end included inside the evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-54", "text": "This allows us to evaluate understanding in terms of getting the right answer for a specific task, as is done in the Air Travel In formation (ATIS) system, which evaluates language input/database answer output pairs." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-55", "text": "However, this means that to evaluate spoken language understanding, it is necessary to build an entire air travel information system." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-56", "text": "Finally, for certain kinds of applications, particularly interactive applications, it is appropriate to enlarge the scope of evaluation still further to include the users." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-57", "text": "For interactive systems, this is particularly important because the user response determines what the system does next, so that it is not possible to use pre-recorded data." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-58", "text": "2 Increasingly complex human-computer interfaces, as well as complex collaborative tools, demand that a system be evaluated in its overall context of use (see Figure 3 )." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-59", "text": "2pre-recorded data allows the same dam to be used by all participating sites, effectively removing human variability as a factor in the evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-60", "text": "pairs as well." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-61", "text": "Evaluation seems relatively easy when there is an intuitive pairing between input and output, for example, between speech signal and transcription at the word or sentence level." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-62", "text": "The task is much more complex when there is either no representation for the output (how to represent understanding?) or in situations where the result is not unique: what is the correct translation of a particular text? What is the best response to a particular query?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-63", "text": "For such cases, it is often expedient to rely on human judgements, provided that these judgements (or relative judgements) are reproducible, given a sufficient number of judges." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-64", "text": "Evaluation of machine translation systems [lO] has used human judges to evaluate systems with differing degrees of interactivity and across different language pairs." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-65", "text": "The Brew and Thompson paper [2] also describes reliability of human judges in evaluating machine translation systems." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-66", "text": "Human judges have also been used in end-to-end evaluation of spoken language interfaces [11] ." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-67", "text": "----------------------------------" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-68", "text": "**WHERE TO GO FROM HERE?**" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-69", "text": "Because evaluation plays such an important role in driving research, we must weigh carefully what and how we evaluate." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-70", "text": "Evaluation should be theory neutral, to avoid bias against novel approaches; it should also push the frontiers of what we know how to do; and finally, it should support a broad range of research interests because evaluation is expensive." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-71", "text": "It requires significant community investment in infrastructure, not to mention time devoted to running evaluations and participating in them." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-72", "text": "For example, we estimate that the ATIS evaluation required several person-years to prepare annotated data, a staffof two to three people at NIST over several months to run the evaluation, time spent agreeing on standards, and months of staff effort at participating sites." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-73", "text": "Altogether, the annual cost of an evaluation certainly exceeds five person-years, or conservatively at least $500,000 per evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-74", "text": "Given this level of investment, it is critical to co-ordinate effort and obtain maximum leverage." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-75", "text": "The last three papers [3, 4, 5] all reflect a concern to develop better evaluation methods for semantics, with a shared focus on predicate-argument evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-77", "text": "The paper by Grishman discusses a range of new evaluation efforts for MUC, which are aimed at providing finer grained component evaluations." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-78", "text": "The last paper, by Moore, describes a similar, but distinct, effort towards developing more semantic evaluation methods for the spoken language community." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-79", "text": "----------------------------------" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-80", "text": "**DISCUSSION**" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-81", "text": "The discussion began with the question: can we afford three somewhat similar but distinct predicate-argument evaluations?" }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-82", "text": "The resulting interchange helped to clarify the relationship between these three proposals." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-83", "text": "Both Marcus and Grishman argued that the Treebank annotation should directly support the MUC-style predicate-argument evaluation outlined in [4] , although the Treebank annotations may be a sub-set of what is used for MUC predicate-argument evaluation." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-84", "text": "The relation of the spoken language \"predicate-argument\" evaluation to the other two was less clear." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-85", "text": "Moore explicitly stated during the discussion (and Marcus agreed) that the Treebank annotation is quite different (more syntactic and more \"surface\") than the predicate-argument notation planned for spoken language." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-86", "text": "Moore believed that a deeper level (less syntactic and more semantic) was needed to meet the needs of (some parts of) the spoken language community." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-87", "text": "Thus, although the spoken and written language communities have an opportunity to converge on some common annotation and evaluation metrics, this may well not happen." }, { "sent_id": "12ab280d48ef6bfae0ff27a400e2ab-C001-88", "text": "These annotation and evaluation approaches are, however, \"work-inprogress\" and economic and time considerations may cause some convergence, even while theories and research agendas remain distinct." } ], "y": { "@BACK@": { "gold_contexts": [ [ "12ab280d48ef6bfae0ff27a400e2ab-C001-2" ], [ "12ab280d48ef6bfae0ff27a400e2ab-C001-17" ], [ "12ab280d48ef6bfae0ff27a400e2ab-C001-23" ], [ "12ab280d48ef6bfae0ff27a400e2ab-C001-50" ], [ "12ab280d48ef6bfae0ff27a400e2ab-C001-75" ], [ "12ab280d48ef6bfae0ff27a400e2ab-C001-83" ] ], "cite_sentences": [ "12ab280d48ef6bfae0ff27a400e2ab-C001-2", "12ab280d48ef6bfae0ff27a400e2ab-C001-17", "12ab280d48ef6bfae0ff27a400e2ab-C001-23", "12ab280d48ef6bfae0ff27a400e2ab-C001-50", "12ab280d48ef6bfae0ff27a400e2ab-C001-75", "12ab280d48ef6bfae0ff27a400e2ab-C001-83" ] } } }, "ABC_2915e49791d14f5b802225d10f33fb_41": { "x": [ { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-2", "text": "Sentiment analysis means to extract opinion of users from review documents." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-3", "text": "Sentiment classification using Machine Learning (ML) methods faces the problem of high dimensionality of feature vector." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-4", "text": "Therefore, a feature selection method is required to eliminate the irrelevant and noisy features from the feature vector for efficient working of ML algorithms." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-5", "text": "Rough Set Theory based feature selection method finds the optimal feature subset by eliminating the redundant features." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-6", "text": "In this paper, Rough Set Theory (RST) based feature selection method is applied for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-7", "text": "A Hybrid feature selection method based on RST and Information Gain (IG) is proposed for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-8", "text": "Proposed methods are evaluated on four standard datasets viz." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-9", "text": "Movie review, product (book, DVD and electronics) review dataset." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-35", "text": "Dataset, Experimental setup and results are discussed in Section 4." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-36", "text": "Finally, Section 5 describes conclusions." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-37", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-38", "text": "**RELATED WORK**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-39", "text": "Machine Learning methods have been widely applied for sentiment analysis (Pang et al. 2008; Pang et al. 2002; Tan et al. 2008 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-40", "text": "Pang and Lee (2004) experimented with various features like unigrams, bi-grams and adjectives for sentiment classification of movie reviews using different machine learning algorithms namely Na\u00efve Bayes (NB), Support Vector Machines (SVM), and Maximum-Entropy (ME)." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-41", "text": "Feature selection methods improve the performance of sentiment classification by eliminating the noisy and irrelevant features from feature vector." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-42", "text": "Tan et al. (2008) investigated with various feature selection methods with different machine learning algorithm for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-43", "text": "Their experimental results show that IG performs better as compared to other feature selection methods and SVM is best machine learning algorithms." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-44", "text": "Categorical Probability Proportion Difference (CPPD) feature selection method is proposed which computes the importance of a feature based on its class discriminating ability for sentiment classification (Agarwal et al. 2012) ." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-45", "text": "Various features are extracted from the text for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-46", "text": "Further, Minimum Redundancy Maximum Relevancy (mRMR) and IG feature selection methods are used to select prominent features for better sentiment classification by machine learning algorithms (Agarwal et al. 2013) ." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-47", "text": "Rough set based dimensionality reduction method is applied for data reduction to characterize bookmarks and it is compared with conventional entropy based reduction method (Jensen et al. 2009 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-48", "text": "Dimension reduction method based on fuzzy-rough sets and Ant Colony Optimization (ACO) method is proposed (Jensen et al. 2006) , which is applied to the web categorisation problem." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-49", "text": "Experimental result show significant reduction in the data redundancy." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-50", "text": "Rough set theory is applied to select relevant features for web-page classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-51", "text": "Their experimental results show that the rough set based feature selection method with SVM gives better accuracy (Wakaki et al. 2004 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-52", "text": "Applicability of RS theory for various existing text classification techniques are discussed in detail with e-mail categorization as an example application (Chouchoulas et al. 2001) ." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-53", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-54", "text": "**METHODOLOGY USED**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-55", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-56", "text": "**ROUGH SET ATTRIBUTE REDUCTION (RSAR)**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-57", "text": "Rough Sets Theory (RST) (Jensen et al. 2007 ) is a mathematical tool to make attribute reduction by eliminating redundant condition attributes (features)." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-58", "text": "The rough set is the approximation of a vague concept (set) by a pair of precise concepts, called lower and upper approximations." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-59", "text": "Rough Set Attribute Reduction (RSAR) (Jensen et al. 2007 ) is a filter based method by which redundant features are eliminated by keeping the amount of knowledge intact in the System." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-60", "text": "Basic intuition behind RSAR is that objects belonging to the same category (same attributes) are not distinguishable (Jensen et al. 2009 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-61", "text": "RSAR algorithm finds the vague attributes which do not have important role in the classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-62", "text": "Therefore, it is needed to remove redundant features without changing the knowledge embedded in the information system." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-63", "text": "An important issue in data analysis is to discover dependencies between the attributes." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-64", "text": "QUICKREDUCT method (Jensen et al. 2007; Jensen et al. 2009 ) calculate a minimal reduct without exhaustively generating all possible subsets, it is used in our experiments for obtaining optimal feature subset." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-65", "text": "Main advantage of RSAR is that it does not require any additional parameter to operate like threshold is required in case of IG." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-66", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-67", "text": "**INFORMATION GAIN (IG)**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-68", "text": "Information gain (IG) is one of the important feature selection techniques for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-69", "text": "IG is used to select important features with respect to class attribute." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-70", "text": "It is measured by the reduction in the uncertainty in identifying the class attribute when the value of the feature is known." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-71", "text": "The top ranked (important) features are selected for reducing the feature vector size in turn better classification results." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-72", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-73", "text": "**PROPOSED HYBRID APPROACH TO FEATURE SELECTION**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-74", "text": "The usefulness of an attribute is determined by both its relevancy and redundancy." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-75", "text": "An attribute is relevant if it is predictive to the class attribute, otherwise it is irrelevant." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-76", "text": "An attribute is consid-ered to be redundant if it is correlated with other attributes." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-77", "text": "Hence, The Aim is to find the attributes that are highly correlated with the class attribute, but not with other attributes for a good attribute subset (Jensen et al. 2007 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-78", "text": "Information Gain based feature selection methods determine the importance of a feature in the documents." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-79", "text": "But, it has disadvantage that threshold value is required initially which is not known generally." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-80", "text": "This method does not consider the redundancy among the attributes." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-81", "text": "In addition, it will return large number of features when massive amount of documents are to be considered." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-82", "text": "RSAR can reduce most of the irrelevant and noisy features." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-83", "text": "It reduces the redundancy among the features." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-84", "text": "It has advantage that it considers the dependency of combination of features on decision attribute in contrast to other conventional feature selection methods (Jensen et al. 2007 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-85", "text": "However, it has some disadvantages." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-86", "text": "Firstly, to get an optimal reduct is a NP-hard problem, some heuristic algorithms are used to get approximate reduction (Jensen et al. 2004; Jensen et al. 2009 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-87", "text": "Secondly, it is very time consuming." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-88", "text": "Therefore, an integrated method is developed which can reduce most of the redundant features and get the minimal feature set with reduced time complexity for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-89", "text": "Proposed Algorithm works in two steps." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-90", "text": "Firstly, Information Gain (IG) of each feature is computed and all the features are taken which has information gain value to be greater than 0." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-91", "text": "So that initially irrelevant and noisy features are removed from the feature vector, by this a lot computational efforts are reduced." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-92", "text": "Main assumption and motivation behind this step is that IG would eliminate the features which are likely to be noisy and irrelevant features." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-93", "text": "Further, Reduced feature set is sent to the RSAR feature selection method to get optimal feature subset." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-94", "text": "So, by combining both the methods a feature selection is proposed which is more efficient in terms of computational and time complexity." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-95", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-96", "text": "**DATASET USED AND EXPERIMENTAL SETUP**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-97", "text": "For the evaluation of the proposed method, one of the most popular publically available movie review dataset (Pang et al. 2004 ) is used." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-98", "text": "This standard dataset contains 2000 reviews comprising 1000 positive and 1000 negative reviews." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-99", "text": "Product review dataset consisting amazon products reviews is also used provided by Blitzer et al. (2007) ." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-100", "text": "We used product reviews of books, DVD and electronics for experiments." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-101", "text": "Each domain has 1000 positive and 1000 negative labelled reviews." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-102", "text": "Documents are initially pre-processed as follows: (i) Negation handling is performed as Pang et al. (2002) , \"NOT_\" is added to every words occurring after the negation word (no, not, isn't, can't, never, couldn't, didn't, wouldn't, don't) and first punctuation mark in the sentence." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-103", "text": "(ii) Words occurring in less than 3 documents are removed from the feature set." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-104", "text": "Binary weighting scheme has been identified as a better weighting scheme as compared to frequency based schemes for sentiment classification (Pang et al. 2002) ; therefore we also used binary weighting method for representing text." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-105", "text": "In addition, there is no need of using separate discretisation method in case of binary weighting scheme as required by RSAR feature selection algorithm." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-106", "text": "Noisy and irrelevant features are eliminated from the feature vector generated after pre-processing using various feature selection methods discussed before." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-107", "text": "Further, prominent feature vector is used by machine learning algorithms." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-108", "text": "Support Vector Machine (SVM) and Na\u00efve Bayes (NB) classifiers are the mostly used for sentiment classification (Pang et al. 2002; Tan et al. 2008) ." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-109", "text": "Therefore, we report the classification results of SVM and NB classifier for classifying review documents into positive or negative sentiment polarity." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-110", "text": "For the evaluation of proposed methods 10 fold cross validation method is used." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-111", "text": "Fmeasure value is reported as a performance measure of various classifiers (Agarwal et al. 2013)" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-112", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-113", "text": "**EXPERIMENTAL RESULTS AND DISCUSSIONS**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-114", "text": "Initially, unigram features are extracted from the review documents." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-115", "text": "Feature set without using any feature selection method is taken as a baseline." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-116", "text": "Further, various feature selection algorithms are used for selecting optimal feature subset." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-117", "text": "IG is used for comparison with the proposed feature selection method as it has been considered as one of the best feature selection method for sentiment classification (Pang et al. 2008; Tan et al. 2008 Experimental results show that both feature selection methods (RSAR and IG) are able to improve the performance from baseline (as shown in Table 2 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-118", "text": "For example from" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-119", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-120", "text": "**CONCLUSION**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-121", "text": "Rough set based dimension reduction method is applied for sentiment analysis." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-122", "text": "It is capable of reducing the redundancy among the attributes." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-123", "text": "Rough set based methods computes the best feature subset based on minimized redundancy in contrast to information gain which computes the importance of the attribute based on the entropy." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-124", "text": "Hybrid feature selection method is proposed which is based on RSAR and IG." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-125", "text": "Experimental results show that Hybrid feature selection method with very less number of features produces better results as compared to other feature selection methods." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-126", "text": "All the methods are experimented using four standard datasets." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-127", "text": "In future, more methods can be explored for making rough set based feature selection method computationally more efficient by incorporating evolutionary approaches in selecting feature subsets." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-10", "text": "Experimental results show that Hybrid feature selection method outperforms than other feature selection methods for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-11", "text": "----------------------------------" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-13", "text": "Sentiment analysis is to extract the users' opinion by analysing the text documents (Pang et al. 2008) ." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-14", "text": "Nowadays people are using web for writing their opinion on blogs, social networking websites, discussion forums etc." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-15", "text": "Hence, it is very much needed to analyse these web contents." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-16", "text": "Thus, it increases the demand of sentiment analysis research." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-17", "text": "Sentiment analysis has been very important for the users as well as for business with the drastic increase of online content." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-18", "text": "For users, it is important to know past experiences about some product or services for taking decision in purchasing products." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-19", "text": "Companies can use sentiment analysis in improving their products based on the users' feedback written about their products on blogs." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-20", "text": "E-commerce based companies know the online trends about the products." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-21", "text": "Example of sentiment analysis is -knowing which model of a camera is liked by most of the users." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-22", "text": "Sentiment classification can be considered as a text classification problem." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-23", "text": "Bag-of-Words (BOW) representation is commonly used for sentiment classification using machine learning approaches." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-24", "text": "The words present in all the documents create the feature vector." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-25", "text": "Generally, this feature vector is huge in dimension that is used by machine learning methods for classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-26", "text": "This high dimensional feature vector deteriorates the performance of machine learning algorithm." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-27", "text": "Rough set theory has been used for reducing the feature vector size for text classification (Jensen et al. 2001; Jensen et al. 2009; Wakaki et al. 2004 )." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-28", "text": "However, it has not been investigated for sentiment analysis yet." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-29", "text": "Contribution of this paper:-1." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-30", "text": "Rough Set theory based feature selection method is applied for sentiment classification." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-31", "text": "2. Hybrid Feature selection method is proposed based on Rough Set and Information Gain which performs better than other feature selection methods." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-32", "text": "3. Proposed methods are experimented with four different standard datasets." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-33", "text": "The paper is organized as follows: A brief discussion of the earlier research work is given in Section 2." }, { "sent_id": "2915e49791d14f5b802225d10f33fb-C001-34", "text": "Section 3 describes the feature selections method used for sentiment classification." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2915e49791d14f5b802225d10f33fb-C001-39" ], [ "2915e49791d14f5b802225d10f33fb-C001-104" ], [ "2915e49791d14f5b802225d10f33fb-C001-108" ] ], "cite_sentences": [ "2915e49791d14f5b802225d10f33fb-C001-39", "2915e49791d14f5b802225d10f33fb-C001-104", "2915e49791d14f5b802225d10f33fb-C001-108" ] }, "@USE@": { "gold_contexts": [ [ "2915e49791d14f5b802225d10f33fb-C001-102" ] ], "cite_sentences": [ "2915e49791d14f5b802225d10f33fb-C001-102" ] }, "@SIM@": { "gold_contexts": [ [ "2915e49791d14f5b802225d10f33fb-C001-102" ] ], "cite_sentences": [ "2915e49791d14f5b802225d10f33fb-C001-102" ] }, "@MOT@": { "gold_contexts": [ [ "2915e49791d14f5b802225d10f33fb-C001-104" ] ], "cite_sentences": [ "2915e49791d14f5b802225d10f33fb-C001-104" ] } } }, "ABC_4bb290aba7ee7843280a8b0e88e5a0_41": { "x": [ { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-6", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-2", "text": "We contrast two seemingly distinct approaches to the task of question answering (QA) using Freebase: one based on information extraction techniques, the other on semantic parsing." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-3", "text": "Results over the same test-set were collected from two state-ofthe-art, open-source systems, then analyzed in consultation with those systems' creators." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-4", "text": "We conclude that the differences between these technologies, both in task performance, and in how they get there, is not significant." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-5", "text": "This suggests that the semantic parsing community should target answering more compositional open-domain questions that are beyond the reach of more direct information extraction methods." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-8", "text": "Question Answering (QA) from structured data, such as DBPedia (Auer et al., 2007) , Freebase (Bollacker et al., 2008) and Yago2 (Hoffart et al., 2011) , has drawn significant interest from both knowledge base (KB) and semantic parsing (SP) researchers." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-9", "text": "The majority of such work treats the KB as a database, to which standard database queries (SPARQL, MySQL, etc.) are issued to retrieve answers." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-10", "text": "Language understanding is modeled as the task of converting natural language questions into queries through intermediate logical forms, with the popular two approaches including: CCG parsing (Zettlemoyer and Collins, 2005; Zettlemoyer and Collins, 2007; Zettlemoyer and Collins, 2009; Kwiatkowski et al., 2010; Kwiatkowski et al., 2011; Krishnamurthy and Mitchell, 2012; Kwiatkowski et al., 2013; Cai and Yates, 2013a) , and dependencybased compositional semantics (Liang et al., 2011; Berant et al., 2013; Berant and Liang, 2014) ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-11", "text": "We characterize semantic parsing as the task of deriving a representation of meaning from language, sufficient for a given task." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-12", "text": "Traditional information extraction (IE) from text may be coarsely characterized as representing a certain level of semantic parsing, where the goal is to derive enough meaning in order to populate a database with factoids of a form matching a given schema." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-13", "text": "1 Given the ease with which reasonably accurate, deep syntactic structure can be automatically derived over (English) text, it is not surprising that IE researchers would start including such \"features\" in their models." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-14", "text": "Our question is then: what is the difference between an IE system with access to syntax, as compared to a semantic parser, when both are targeting a factoid-extraction style task?" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-15", "text": "While our conclusions should hold generally for similar KBs, we will focus on Freebase, such as explored by Krishnamurthy and Mitchell (2012) , and then others such as Cai and Yates (2013a) and Berant et al. (2013) ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-16", "text": "We compare two open-source, state-ofthe-art systems on the task of Freebase QA: the semantic parsing system SEMPRE (Berant et al., 2013) , and the IE system jacana-freebase (Yao and Van Durme, 2014) ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-17", "text": "We find that these two systems are on par with each other, with no significant differences in terms of accuracy between them." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-18", "text": "A major distinction between the work of Berant et al. (2013) and Yao and Van Durme (2014) is the ability of the former to represent, and compose, aggregation operators (such as argmax, or count), as well as integrate disparate pieces of information." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-19", "text": "This representational capability was important in previous, closed-domain tasks such as GeoQuery." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-20", "text": "The move to Freebase by the SP community was meant to" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-21", "text": "provide richer, open-domain challenges." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-22", "text": "While the vocabulary increased, our analysis suggests that compositionality and complexity decreased." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-23", "text": "We therefore conclude that the semantic parsing community should target more challenging opendomain datasets, ones that \"standard IE\" methods are less capable of attacking." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-24", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-25", "text": "**IE AND SP SYSTEMS**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-26", "text": "jacana-freebase 2 (Yao and Van Durme, 2014) treats QA from a KB as a binary classification problem." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-27", "text": "Freebase is a gigantic graph with millions of nodes (topics) and billions of edges (relations)." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-28", "text": "For each question, jacana-freebase first selects a \"view\" of Freebase concerning only involved topics and their close neighbors (this \"view\" is called a topic graph)." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-29", "text": "For instance, for the question \"who is the brother of justin bieber?\", the topic graph of Justin Bieber, containing all related nodes to the topic (think of the \"Justin Bieber\" page displayed by the browser), is selected and retrieved by the Freebase Topic API." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-30", "text": "Usually such a topic graph contains hundreds to thousands of nodes in close relation to the central topic." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-31", "text": "Then each of the node is judged as answer or not by a logistic regression learner." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-32", "text": "Features for the logistic regression learner are first extracted from both the question and the topic graph." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-33", "text": "An analysis of the dependency parse of the question characterizes the question word, topic, verb, and named entities of the main subject as the question features, such as qword=who." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-34", "text": "Features on each node include the types of relations and properties the node possesses, such as type=person." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-35", "text": "Finally features from both the question and each node are combined as the final features used by the learner, such as qword=who|type=person." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-36", "text": "In this way the association between the question and answer type is enforced." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-37", "text": "Thus during decoding, for instance, if there is a who question, the nodes with a person property would be ranked higher as the answer candidate." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-38", "text": "SEMPRE 3 is an open-source system for training semantic parsers, that has been utilized to train a semantic parser against Freebase by Berant et al. (2013) ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-39", "text": "SEMPRE maps NL utterances to logical forms by performing bottom-up parsing." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-40", "text": "First, a lexicon is used to map NL phrases to KB predicates, and then predicates are combined to form a full logical form by a context-free grammar." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-41", "text": "Since logical forms can be derived in multiple ways from the grammar, a log-linear model is used to rank possible derivations." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-42", "text": "The parameters of the model are trained from question-answer pairs." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-43", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-44", "text": "**ANALYSIS**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-45", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-46", "text": "**EVALUATION METRICS**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-47", "text": "Both Berant et al. (2013) and Yao and Van Durme (2014) tested their systems on the WEBQUESTIONS dataset, which contains 3778 training questions and 2032 test questions collected from the Google Suggest API." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-48", "text": "Each question came with a standard answer from Freebase annotated by Amazon Mechanical Turk." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-49", "text": "Berant et al. (2013) reported a score of 31.4% in terms of accuracy (with partial credit if inexact match) on the test set and later in Berant and Liang (2014) revised it to 35.7%." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-50", "text": "Berant et al. focused on accuracy -how many questions were correctly answered by the system." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-51", "text": "Since their system answered almost all questions, accuracy is roughly identical to F 1 ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-52", "text": "Yao and Van Durme (2014)'s system on the other hand only answered 80% of all test questions." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-53", "text": "Thus they report a score of 42% in terms of F 1 on this dataset." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-54", "text": "For the purpose of comparing among all test questions, we lowered the logistic regression prediction threshold (usually 0.5) on jacana-freebase for the other 20% of questions where jacana-freebase had not proposed an answer to, and selected the best-possible prediction with the highest prediction score as the answer." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-55", "text": "In this way jacana-freebase was able to answer all questions with a lower accuracy of 35.4%." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-56", "text": "In the following we present analysis results based on the test questions where the two systems had very similar performance (35.7% vs. 35.4%)." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-57", "text": "4 The difference is not significant according to the paired permutation test (Smucker et al., 2007) ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-58", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-59", "text": "**ACCURACY VS. COVERAGE**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-60", "text": "First, we were interested to see the proportions of questions SEMPRE and jacana-freebase jointly and separately answered correctly." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-61", "text": "The answer to Since turkers did not exhaustively pick out all possible answers, evaluation is performed by computing the F 1 between the set of answers given by the system and the answers provided by turkers." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-62", "text": "With a strict threshold of F 1 = 1 and a permissive threshold of F 1 \u2265 0.5 to judge the correctness, we list the pair-wise correctness matrix in Table 1 . Not surprisingly, both systems had most questions wrong given that the averaged F 1 's were only around 35%." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-63", "text": "With the threshold F 1 = 1, SEMPRE answered more questions exactly correctly compared to jacana-freebase, while when F 1 \u2265 0.5, it was the other way around." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-64", "text": "This shows that SEMPRE is more accurate in certain questions." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-65", "text": "The reason behind this is that SEMPRE always fires queries that return exactly one set of answers from Freebase, while jacana-freebase could potentially tag multiple nodes as the answer, which may lower the accuracy." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-66", "text": "We have shown that both systems can be more accurate in certain questions, but when? Is there a correlation between the system confidence and accuracy?" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-67", "text": "Thus we took the logistic decoding score (between 0 and 1) from jacana-freebase and the probability from the log-linear model used by SEMPRE as confidence, and plotted an \"accuracy vs. coverage\" curve, which shows the accuracy of a QA engine with respect to its coverage of all questions." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-68", "text": "The curve basically answers one question: at a fixed accuracy, what is the proportion of questions that can be answered?" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-69", "text": "A better system should be able to answer more questions correctly with the same accuracy." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-70", "text": "The curve was drawn in the following way." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-71", "text": "For each question, we select the best answer candidate with the highest confidence score." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-72", "text": "Then for the whole test set, we have a list of (question, highest ranked answer, confidence score) tuples." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-73", "text": "Running The two curves generally follow a similar trend, but while jacana-freebase has higher accuracy when coverage is low, SEMPRE obtains slightly better accuracy when more questions are answered." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-74", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-75", "text": "**ACCURACY BY QUESTION LENGTH AND TYPE**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-76", "text": "Do accuracies of the two systems differ with respect to the complexity of questions?" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-77", "text": "Since there is no clear way to measure question complexity, we use question length as a surrogate and report accuracies by question length in Figure 2 ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-78", "text": "Most of the questions were 5 to 8 words long and there was no substantial difference in terms of accuracies." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-79", "text": "The major difference lies in questions of length 3, 12 and 13." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-80", "text": "However, the number of such questions was not high enough to show any statistical significance." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-81", "text": "Figure 3 further shows the accuracies with respect to the question types (as reflected by the WH-word)." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-82", "text": "Again, there is no significant difference between the two systems." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-83", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-84", "text": "**LEARNED FEATURES**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-85", "text": "What did the systems learn during training?" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-86", "text": "We compare them by presenting the top features by weight, as listed in Table 2 ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-87", "text": "Clearly, the type of knowledge learned by the systems in these features is similar: both systems learn to associate certain phrases with predicates from the KB." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-88", "text": "(929) where (357) who (261) which (35) when (100) how ( We note, however, that SEMPRE also obtains information from the fully constructed logical form." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-89", "text": "For instance, SEMPRE learns that logical forms that return an empty set when executed against the KB are usually incorrect (the weight for this feature is -8.88)." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-90", "text": "In this respect the SP approach \"understands\" more than the IE approach." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-91", "text": "We did not further compare on other datasets such as GeoQuery (Tang and Mooney, 2001 ) and FREE917 (Cai and Yates, 2013b) ." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-92", "text": "The first one involves geographic inference and multiple contraints in queries, directly fitting the compositional nature of semantic parsing." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-93", "text": "The second one was manually generated by looking at Freebase topics." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-94", "text": "Both datasets were less realistic than the WEBQUESTIONS dataset." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-95", "text": "Both datasets were also less challenging (accuracy/F 1 were between 80% and 90%) compared to WEBQUESTIONS (around 40%)." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-96", "text": "similarity between features used in both systems shown in Table 2 : the systems learned the same \"knowledge\" from data, with the distinction that the IE approach acquired this through a direct association between dependency parses and answer properties, while the SP approach acquired this through optimizing on intermediate logic forms." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-97", "text": "----------------------------------" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-98", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-99", "text": "With a direct information extraction technology easily getting on par with the more sophisticated semantic parsing method, it suggests that SP-based approaches for QA with Freebase has not yet shown its power from a \"deeper\" understanding of the questions, among questions of various lengths." }, { "sent_id": "4bb290aba7ee7843280a8b0e88e5a0-C001-100", "text": "We suggest that more compositional open-domain datasets should be created, and that SP researchers should focus on utterances in existing datasets that are beyond the reach of direct IE methods." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "4bb290aba7ee7843280a8b0e88e5a0-C001-16" ] ], "cite_sentences": [ "4bb290aba7ee7843280a8b0e88e5a0-C001-16" ] }, "@BACK@": { "gold_contexts": [ [ "4bb290aba7ee7843280a8b0e88e5a0-C001-18" ], [ "4bb290aba7ee7843280a8b0e88e5a0-C001-26" ], [ "4bb290aba7ee7843280a8b0e88e5a0-C001-47" ] ], "cite_sentences": [ "4bb290aba7ee7843280a8b0e88e5a0-C001-18", "4bb290aba7ee7843280a8b0e88e5a0-C001-26", "4bb290aba7ee7843280a8b0e88e5a0-C001-47" ] } } }, "ABC_a229630a81020951ec0be27f54885a_41": { "x": [ { "sent_id": "a229630a81020951ec0be27f54885a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-2", "text": "Spam filtering is a text categorization task that shows especial features that make it interesting and difficult." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-3", "text": "First, the task has been performed traditionally using heuristics from the domain." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-4", "text": "Second, a cost model is required to avoid misclassification of legitimate messages." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-5", "text": "We present a comparative evaluation of several machine learning algorithms applied to spam filtering, considering the text of the messages and a set of heuristics for the task." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-6", "text": "Cost-oriented biasing and evaluation is performed." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-7", "text": "----------------------------------" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-9", "text": "Spam, or more properly Unsolicited Commercial E-mail (UCE), is an increasing threat to the viability of Internet E-mail and a danger to Internet commerce." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-10", "text": "UCE senders take away resources from users and service suppliers without compensation and without authorization." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-11", "text": "A variety of counter-measures to UCE have been proposed, from technical to regulatory (Cranor and LaMacchia, 1998) ." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-12", "text": "Among the technical ones, the use of filtering methods is popular and effective." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-13", "text": "UCE filtering is a text categorization task." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-14", "text": "Text categorization (TC) is the classification of documents with respect to a set of one or more pre-existing categories." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-15", "text": "In the case of UCE, the task is to classify e-mail messages or newsgroups articles as UCE or not (that is, legitimate)." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-16", "text": "The general model of TC makes use of a set of preclassified documents to classify new ones, according to the text content (i.e. words) of the documents (Sebastiani, 1999) ." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-17", "text": "Although UCE filtering seems to be a simple instance of the more general TC task, it shows * Partially supported by the CICYT, project no. TEL99-0335-C04-03 two special characteristics: * First, UCE filtering has been developed using very simple heuristics for many years." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-18", "text": "For example, one individual could manually build a filter that classifies as \"spam\" messages containing the phrase \"win big money\", or with an unusual (big) number of capital letters or non-alphanumeric characters." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-19", "text": "These rules are examples of simple but powerful heuristics that could be used to complement a word-based automatic TC system for UCE filtering." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-20", "text": "\u2022 Second, all UCE filtering errors are not of equal importance." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-21", "text": "Individuals usually prefer conservative filters that tend to classify UCE as legitimate, because missing a legitimate message is more harmful than the opposite." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-22", "text": "A cost model is imperative to avoid the risk of missing legitimate e-mail." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-23", "text": "Many learning algorithms have been applied to the problem of TC (Yang, 1999) , but much less with the problem of UCE filtering in mind." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-24", "text": "Sahami and others (1998) propose the utilization of a Naive Bayes classifier based on the words and a set of manually derived heuristics for UCE filtering, showing that the heuristics improve the effectiveness of the classifier." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-25", "text": "Druker and others (1999) compare boosting, Support Vector Machines, Ripper and Rocchio classifiers for UCE filtering." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-26", "text": "Andruotsopoulos and others (2000) present a cost-oriented evaluation of the Naive Bayes and k-nearest neighbor (kNN) algorithms for UCE filtering." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-27", "text": "Finally, Provost (1999) compares Naive Bayes and RIPPER for the task." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-28", "text": "These three last works do not consider any set of heuristics for UCE filtering." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-29", "text": "So, an extensive evaluation of learning algorithms combining words and heuristics remains to be done." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-30", "text": "Also, although the evaluations performed in these works have taken into account the importance of misclassifying legitimate e-mail, they have not considered that many learning algorithms (specially those that are error-driven) can be biased to prefer some kind of errors to others." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-31", "text": "In this paper, we present a comparative evaluation of a representative selection of Machine Learning algorithms for UCE filtering." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-32", "text": "The algorithms take advantage of two kinds of information: the words in the messages and a set of heuristics." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-33", "text": "Also, the algorithms are biased by a cost weighting schema to avoid, if possible, misclassifying legitimate e-mail." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-34", "text": "Finally, algorithms are evaluated according to cost-sensitive measures." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-35", "text": "----------------------------------" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-36", "text": "**HEURISTICS FOR UCE CLASSIFICATION**" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-37", "text": "Sahami and others (Sahami et al., 1998) have proposed a set of heuristic features to complement the word Bayesian model in their work, including: a set of around 35 hand-crafted key phrases (like \"free money\"); some non text features (like the domain of the sender, or whether the message comes from a distribution list or not); and features concerning the nonalphanumeric characters in the messages." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-38", "text": "For this work, we have focused in this last set of features." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-39", "text": "The test collection used in our experiments, Spambase, already contained a set of nine heuristic features." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-40", "text": "Spambase 1 is an e-mail messages collection containing 4601 messages, being 1813 (39%) marked as UCE." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-41", "text": "The collection comes in preprocessed (not raw) form, and its instances have been represented as 58-dimensional vectors." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-42", "text": "The first 48 features are words extracted from the original messages, without stop list nor stemming, and selected as the most unbalanced words for the UCE class." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-43", "text": "The next 6 features are the percentage of occurrences of the special characters \";', \"(\", \"[\", \"!\", \"$\" and \"#\"." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-44", "text": "The following 3 features represent different measures of occurrences of capital letters in the text of the messages." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-45", "text": "Finally, the last feature is the class label." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-46", "text": "So, features 49 to 57 represent heuristic attributes of the messages." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-47", "text": "In our experiments, we have tested several learning algorithms on three feature sets: only 1 This collection can be obtained from http://www.ics.uci.edu/mlea~n/MLRepository.html." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-48", "text": "words, only heuristic attributes, and both." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-49", "text": "We have divided the Spambase collection in two parts: a 90% of the instances axe used for training, and a 10% of the messages are retained for testing." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-50", "text": "This split has been performed preserving the percentages of legitimate and UCE messages in the whole collection." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-51", "text": "3 Cost-sensitive UCE classification According to the problem of UCE filtering, a cost-sensitive classification is required." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-52", "text": "Each learning algorithm can be biased to prefer some kind of missclassification errors to others." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-53", "text": "A popular technique for doing this is resampling the training collection by multiplying the number of instances of the preferred class by the cost ratio." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-54", "text": "Also, the unpreferred class can be downsampled by eliminating some instances." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-55", "text": "The software package we use for our experiments applies both methods depending on the algorithm tested." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-56", "text": "We have tested four learning algorithms: Naive Bayes (NB), C4.5, PART and k-nearest neighbor (kNN), all implemented in the Weka package (Witten and Frank, 1999) ." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-57", "text": "The version of Weka used in this work is Weka 3.0.1." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-58", "text": "The algorithms used can be biased to prefer the mistake of classify a UCE message as not UCE to the opposite, assigning a penalty to the second kind of errors." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-59", "text": "Following (Androutsopoulos et al., 2000) , we have assigned 9 and 999 (9 and 999 times more important) penalties to the missclassification of legitimate messages as UCE." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-60", "text": "This means that every instance of a legitimate message has been replaced by 9 and 999 instances of the same message respectively for NB, C4.5 and PART." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-61", "text": "However, for kNN the data have been downsampled." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-62", "text": "----------------------------------" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-63", "text": "**EVALUATION AND RESULTS**" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-64", "text": "The experiments results are summarized in the Table 1 , 2 and 3." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-65", "text": "The learning algorithms Naive Bayes (NB), 5-Nearest Neighbor (5NN), C4.5 and PART were tested on words (-W), heuristic features (-H), and both (-WH)." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-66", "text": "The kNN algorithm was tested with values of k equal to 1, 2, 5 and 8, being 5 the optimal number of neighbors." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-67", "text": "We present the weighted accuracy (wacc), and also the recall (rec) and precision (pre) for the class UCE." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-68", "text": "Weighted accuracy is a measure that weights higher the hits and misses for the preferred class." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-69", "text": "Recall and precision for the UCE class show how effective the filter is blocking UCE, and what is its effectiveness letting legitimate messages pass the filter, respectively (Androutsopoulos et al., 2000) ." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-70", "text": "In Table 1 , no costs were used." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-71", "text": "Tables 2 and 3 show the results of our experiments for cost ratios of 9 and 999." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-72", "text": "For these last cases, there were not enough training instances for the kNN algorithm to perform classification, due to the downsampling method applied by Weka." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-73", "text": "----------------------------------" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-74", "text": "**DISCUSSION AND CONCLUSIONS**" }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-75", "text": "The results of our experiments show that the best performing algorithms are C4.5 and PART." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-76", "text": "However, for the cost value of 999, both algorithms degrade to the trivial rejector: they prefer to classify every message as legitimate in order to avoid highly penalized errors." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-77", "text": "With these results, neither of these algorithms seems useful for autonomous classification of UCE as stated by Androutsopoulos, since this cost ratio represents a scenario in which UCE messages are deleted without notifying the user of the UCE filter." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-78", "text": "Nevertheless, PART-WH shows competitive performance for a cost ratio of 9." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-79", "text": "Its numbers are comparable to those shown in a commercial study by the top performing Brightmail filtering system (Mariano, 2000) , which reaches a UCE recall of 0.73, and a precision close to 1.0, and it is manually updated." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-80", "text": "Naive Bayes has not shown high variability with respect to costs." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-81", "text": "This is probably due to the sampling method, which only slightly affects to the estimation of probabilities (done by approximation to a normal distribution)." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-82", "text": "In (Sahami et al., 1998; Androutsopoulos et al., 2000) , the method followed is the variation of the probability threshold, which leads to a high variation of results." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-83", "text": "In future experiments, we plan to apply the uniform method MetaCost (Domingos, 1999) to the algorithms tested in this work, for getting more comparable results." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-84", "text": "With respect to the use of heuristics, we can see that this information alone is not competitive, but it can improve classification based on words." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-85", "text": "The improvement shown in our experiments is modest, due to the heuristics used." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-86", "text": "We are not able to add other heuristics in this case because the Spambase collection comes in a preprocessed fashion." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-87", "text": "For future experiments, we will use the collection from (Androutsopoulos et al., 2000) , which is in raw form." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-88", "text": "This fact will enable us to search for more powerful heuristics." }, { "sent_id": "a229630a81020951ec0be27f54885a-C001-89", "text": "Table 2 : UCE recall, UCE precision and weighted accuracy for costs = 9." } ], "y": { "@USE@": { "gold_contexts": [ [ "a229630a81020951ec0be27f54885a-C001-59" ] ], "cite_sentences": [ "a229630a81020951ec0be27f54885a-C001-59" ] }, "@BACK@": { "gold_contexts": [ [ "a229630a81020951ec0be27f54885a-C001-69" ], [ "a229630a81020951ec0be27f54885a-C001-82" ] ], "cite_sentences": [ "a229630a81020951ec0be27f54885a-C001-69", "a229630a81020951ec0be27f54885a-C001-82" ] }, "@FUT@": { "gold_contexts": [ [ "a229630a81020951ec0be27f54885a-C001-87" ] ], "cite_sentences": [ "a229630a81020951ec0be27f54885a-C001-87" ] } } }, "ABC_d63acda66b0c17c5c6725c0e20b2d9_41": { "x": [ { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-2", "text": "We obtain new results using referential translation machines with increased number of learning models in the set of results that are stacked to obtain a better mixture of experts prediction." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-3", "text": "We combine features extracted from the word-level predictions with the sentence-or document-level features, which significantly improve the results on the training sets but decrease the test set results." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-4", "text": "Quality estimation task in WMT19 (Specia et al., 2019) (QET19) address machine translation performance prediction (MTPP), where translation quality is predicted without using reference translations, at the sentence-and word-(Task 1), and document-levels (Task 2)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-5", "text": "The tasks contain subtasks involving English-German, EnglishRussian, and English-French machine translation (MT)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-6", "text": "The target to predict in Task 1 is HTER (human-targeted translation edit rate) scores (Snover et al., 2006) and binary classification of word-level translation errors and the target in Task 2 is multi-dimensional quality metrics (MQM) (Lommel, 2015) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-7", "text": "Table 1 lists the number of sentences in the training and test sets for each task and the number of instances used as interpretants in the RTM models (M for million)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-8", "text": "We use referential translation machine (RTM) (Bi\u00e7ici, 2018; Bi\u00e7ici and Way, 2015) models for building our prediction models." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-9", "text": "RTMs predict data translation between the instances in the training set and the test set using interpretants, data close to the task instances." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-10", "text": "Interpretants provide context for the prediction task and are used during the derivation of the features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and to identify translation acts between any two data sets for building prediction models." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-11", "text": "With the enlarging parallel and monolingual corpora made available by WMT, the capability of the interpretant datasets selected by RTM models to provide context for the training and test sets improve as can be seen in the data statistics of parfda instance selection (Bi\u00e7ici, 2019)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-12", "text": "Figure 1 depicts RTMs and explains the model building process." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-13", "text": "RTMs use parfda for instance selection and machine translation performance prediction system (MTPPS) for obtaining the features, which includes additional features from word alignment and also from GLMd for word-level prediction." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-14", "text": "We use ridge regression, kernel ridge regression, k-nearest neighors, support vector regression, AdaBoost (Freund and Schapire, 1997), gradient tree boosting, gaussian process regressor, extremely randomized trees (Geurts et al., 2006), and multi-layer perceptron (Bishop, 2006) as learning models in combination with feature selection (FS) (Guyon et al., 2002) and partial least squares (PLS) (Wold et al., 1984) where most of these models can be found in scikit-learn." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-15", "text": "1 We experiment with:" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-16", "text": "\u2022 including the statistics of the binary tags obtained as features extracted from word-level tag predictions for sentence-level prediction," }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-17", "text": "\u2022 using KNN to estimate the noise level for 1 http://scikit-learn.org/" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-18", "text": "Figure 1: RTM depiction: parfda selects interpretants close to the training and test data using parallel corpus in bilingual settings and monolingual corpus in the target language or just the monolingual target corpus in monolingual settings; an MTPPS use interpretants and training data to generate training features and another use interpretants and test data to generate test features in the same feature space; learning and prediction takes place taking these features as input." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-19", "text": "SVR, which obtains accuracy with 5% error compared with estimates obtained with known noise level (Cherkassky and Ma, 2004) and set = \u03c3/2." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-41", "text": "We convert MQM annotation to word-level tags to train GLMd models and obtain word-level predictions." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-20", "text": "Martins et al. (2017) used a hybrid stacking model to combine the word-level predictions from 15 predictors using neural networks with different initializations together with the previous features from a linear model." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-21", "text": "The neural network architecture they used is also hybrid with different types of layers: input word embedding use 64 dimensional vectors, the next three layers are two feedforward layers with 400 nodes and a bidirectional gated recurrent units layer with 200 units, followed by similar three layers with half nodes, followed by a feedforward layer with 50 nodes and a softmax layer." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-22", "text": "We use Global Linear Models (GLM) (Collins, 2002) with dynamic learning (GLMd) (Bi\u00e7ici, 2018) for word-and phrase-level translation performance prediction." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-23", "text": "GLMd uses weights in a range [a, b] to update the learning rate dynamically according to the error rate." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-24", "text": "Evaluation metrics listed are Pearson's correlation (r), mean absolute error (MAE), and root mean squared error (RMSE)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-25", "text": "We use prediction averaging (Bi\u00e7ici, 2018) to obtain a combined prediction from various prediction outputs better than the components, where the performance on the training set is used to obtain weighted average of the top k predictions,\u0177 with evaluation metrics indexed by j \u2208 J and weights with w:" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-26", "text": "We assume independent predictions and use p i /(1 \u2212 p i ) for weights where p i represents the accuracy of the independent classifier i in a weighted majority ensemble (Kuncheva and Rodr\u00edguez, 2014) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-27", "text": "We only use the MIX prediction if we obtain better results on the training set." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-28", "text": "We select the best model using r and mix the results using r, RAE, MRAER, and MAER." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-29", "text": "We filter out those results with higher than 1 relative evaluation metric scores." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-30", "text": "We also use stacking to build higher level models using predictions from base prediction models where they can also use the probability associated with the predictions (Ting and Witten, 1999) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-31", "text": "The stacking models use the predictions from predictors as features and build second level predictors." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-32", "text": "For the document-level RTM model, instead of running separate MTPPS instances for each training or test document to obtain specific features for each document, we concatenate the sentences from each document to obtain a single sentence representing each and then run an RTM model." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-33", "text": "This conversion decreases the number of features and obtains close results (Bi\u00e7ici, 2018) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-34", "text": "Before model combination, we further filter prediction results from different machine learn- ing models based on the results on the training set to decrease the number of models combined and improve the results." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-35", "text": "A criteria that we use is to include results that are better than the best RR model's results." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-36", "text": "In general, the combined model is better than the best model in the set and stacking achieves better results than MIX." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-37", "text": "We tokenize and truecase all of the corpora using Moses ' (Koehn et al., 2007) processing tools." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-38", "text": "2 LMs are built using kenlm (Heafield et al., 2013) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-39", "text": "The comparison of results on the training set are in Table 2 and the results on the test set we obtained after the competition are in Tables 3 and 5 ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-40", "text": "Official competition results of RTMs are similar." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-42", "text": "Addition of the tagging features from the word-level prediction improves the training results significantly but does not improve the test results at the same rate, which indicates overfitting." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-43", "text": "The reason for the overfitting with the word-level features is due to their high correlation with the target." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-44", "text": "Table 4 lists some of the top individual feature 2 https://github.com/moses-smt/ mosesdecoder/tree/master/scripts correlations for en-ru in Task1." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-45", "text": "Top 26 highly correlated features belong to word-level features." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-46", "text": "We also obtained new results on QET18 datasets and experimented adding features from word-level predictions on the QET18 sentencelevel results." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-47", "text": "QET18 results in Table 3 are improved overall." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-48", "text": "Referential translation machines pioneer a language independent approach and remove the need to access any task or domain specific information or resource and can achieve top performance in automatic, accurate, and language independent prediction of translation scores." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-49", "text": "We present RTM results with stacking." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-50", "text": "----------------------------------" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-51", "text": "**REFERENTIAL TRANSLATION MACHINES FOR MACHINE TRANSLATION PERFORMANCE PREDICION**" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-52", "text": "Quality estimation task in WMT19 (Specia et al., 2019 ) (QET19) address machine translation performance prediction (MTPP) , where translation quality is predicted without using reference translations, at the sentence-and word-(Task 1), and document-levels (Task 2)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-53", "text": "The tasks contain subtasks involving English-German, EnglishRussian, and English-French machine translation (MT)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-54", "text": "The target to predict in Task 1 is HTER (human-targeted translation edit rate) scores (Snover et al., 2006) and binary classification of word-level translation errors and the target in Task 2 is multi-dimensional quality metrics (MQM) (Lommel, 2015) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-55", "text": "Table 1 lists the number of sentences in the training and test sets for each task and the number of instances used as interpretants in the RTM models (M for million)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-56", "text": "We use referential translation machine (RTM) (Bi\u00e7ici, 2018; Bi\u00e7ici and Way, 2015) models for building our prediction models." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-57", "text": "RTMs predict data translation between the instances in the training set and the test set using interpretants, data close to the task instances." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-58", "text": "Interpretants provide context for the prediction task and are used during the derivation of the features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and to identify translation acts between any two data sets for building prediction models." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-59", "text": "With the enlarging parallel and monolingual corpora made available by WMT, the capability of the interpretant datasets selected by RTM models to provide context for the training and test sets improve as can be seen in the data statistics of parfda instance selection (Bi\u00e7ici, 2019) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-60", "text": "Figure 1 depicts RTMs and explains the model building process." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-61", "text": "RTMs use parfda for instance selection and machine translation performance prediction system (MTPPS) for obtaining the features, which includes additional features from word alignment and also from GLMd for word-level prediction." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-81", "text": "For the document-level RTM model, instead of running separate MTPPS instances for each training or test document to obtain specific features for each document, we concatenate the sentences from each document to obtain a single sentence representing each and then run an RTM model." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-62", "text": "We use ridge regression, kernel ridge regression, k-nearest neighors, support vector regression, AdaBoost (Freund and Schapire, 1997), gradient tree boosting, gaussian process regressor, extremely randomized trees (Geurts et al., 2006) , and multi-layer perceptron (Bishop, 2006) as learning models in combination with feature selection (FS) (Guyon et al., 2002) and partial least squares (PLS) (Wold et al., 1984) where most of these models can be found in scikit-learn." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-63", "text": "1 We experiment with:" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-64", "text": "\u2022 including the statistics of the binary tags obtained as features extracted from word-level tag predictions for sentence-level prediction," }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-65", "text": "\u2022 using KNN to estimate the noise level for Figure 1 : RTM depiction: parfda selects interpretants close to the training and test data using parallel corpus in bilingual settings and monolingual corpus in the target language or just the monolingual target corpus in monolingual settings; an MTPPS use interpretants and training data to generate training features and another use interpretants and test data to generate test features in the same feature space; learning and prediction takes place taking these features as input." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-66", "text": "SVR, which obtains accuracy with 5% error compared with estimates obtained with known noise level (Cherkassky and Ma, 2004) and set = \u03c3/2." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-67", "text": "Martins et al. (2017) used a hybrid stacking model to combine the word-level predictions from 15 predictors using neural networks with different initializations together with the previous features from a linear model." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-68", "text": "The neural network architecture they used is also hybrid with different types of layers: input word embedding use 64 dimensional vectors, the next three layers are two feedforward layers with 400 nodes and a bidirectional gated recurrent units layer with 200 units, followed by similar three layers with half nodes, followed by a feedforward layer with 50 nodes and a softmax layer." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-69", "text": "We use Global Linear Models (GLM) (Collins, 2002) with dynamic learning (GLMd) (Bi\u00e7ici, 2018) for word-and phrase-level translation performance prediction." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-70", "text": "GLMd uses weights in a range [a, b] to update the learning rate dynamically according to the error rate." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-71", "text": "Evaluation metrics listed are Pearson's correlation (r), mean absolute error (MAE), and root mean squared error (RMSE)." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-72", "text": "----------------------------------" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-73", "text": "**MIXTURE OF EXPERTS MODELS**" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-74", "text": "We use prediction averaging (Bi\u00e7ici, 2018) to obtain a combined prediction from various prediction outputs better than the components, where the performance on the training set is used to obtain weighted average of the top k predictions,\u0177 with evaluation metrics indexed by j \u2208 J and weights with w:" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-75", "text": "We assume independent predictions and use p i /(1 \u2212 p i ) for weights where p i represents the accuracy of the independent classifier i in a weighted majority ensemble (Kuncheva and Rodr\u00edguez, 2014) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-76", "text": "We only use the MIX prediction if we obtain better results on the training set." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-77", "text": "We select the best model using r and mix the results using r, RAE, MRAER, and MAER." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-78", "text": "We filter out those results with higher than 1 relative evaluation metric scores." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-79", "text": "We also use stacking to build higher level models using predictions from base prediction models where they can also use the probability associated with the predictions (Ting and Witten, 1999) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-80", "text": "The stacking models use the predictions from predictors as features and build second level predictors." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-82", "text": "This conversion decreases the number of features and obtains close results (Bi\u00e7ici, 2018) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-83", "text": "Before model combination, we further filter prediction results from different machine learn- ing models based on the results on the training set to decrease the number of models combined and improve the results." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-84", "text": "A criteria that we use is to include results that are better than the best RR model's results." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-85", "text": "In general, the combined model is better than the best model in the set and stacking achieves better results than MIX." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-86", "text": "----------------------------------" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-87", "text": "**RESULTS**" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-88", "text": "We tokenize and truecase all of the corpora using Moses' (Koehn et al., 2007) processing tools." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-89", "text": "2 LMs are built using kenlm (Heafield et al., 2013) ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-90", "text": "The comparison of results on the training set are in Table 2 and the results on the test set we obtained after the competition are in Tables 3 and 5 ." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-91", "text": "Official competition results of RTMs are similar." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-92", "text": "We convert MQM annotation to word-level tags to train GLMd models and obtain word-level predictions." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-93", "text": "Addition of the tagging features from the word-level prediction improves the training results significantly but does not improve the test results at the same rate, which indicates overfitting." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-94", "text": "The reason for the overfitting with the word-level features is due to their high correlation with the target." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-95", "text": "Table 4 lists some of the top individual feature 2 https://github.com/moses-smt/ mosesdecoder/tree/master/scripts correlations for en-ru in Task1." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-96", "text": "Top 26 highly correlated features belong to word-level features." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-97", "text": "We also obtained new results on QET18 datasets and experimented adding features from word-level predictions on the QET18 sentencelevel results." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-98", "text": "QET18 results in Table 3 are improved overall." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-99", "text": "----------------------------------" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-100", "text": "**CONCLUSION**" }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-101", "text": "Referential translation machines pioneer a language independent approach and remove the need to access any task or domain specific information or resource and can achieve top performance in automatic, accurate, and language independent prediction of translation scores." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-102", "text": "We present RTM results with stacking." }, { "sent_id": "d63acda66b0c17c5c6725c0e20b2d9-C001-103", "text": "(11) (7)" } ], "y": { "@USE@": { "gold_contexts": [ [ "d63acda66b0c17c5c6725c0e20b2d9-C001-8" ], [ "d63acda66b0c17c5c6725c0e20b2d9-C001-22" ], [ "d63acda66b0c17c5c6725c0e20b2d9-C001-25" ], [ "d63acda66b0c17c5c6725c0e20b2d9-C001-56" ], [ "d63acda66b0c17c5c6725c0e20b2d9-C001-69" ], [ "d63acda66b0c17c5c6725c0e20b2d9-C001-74" ] ], "cite_sentences": [ "d63acda66b0c17c5c6725c0e20b2d9-C001-8", "d63acda66b0c17c5c6725c0e20b2d9-C001-22", "d63acda66b0c17c5c6725c0e20b2d9-C001-25", "d63acda66b0c17c5c6725c0e20b2d9-C001-56", "d63acda66b0c17c5c6725c0e20b2d9-C001-69", "d63acda66b0c17c5c6725c0e20b2d9-C001-74" ] }, "@SIM@": { "gold_contexts": [ [ "d63acda66b0c17c5c6725c0e20b2d9-C001-33" ], [ "d63acda66b0c17c5c6725c0e20b2d9-C001-82" ] ], "cite_sentences": [ "d63acda66b0c17c5c6725c0e20b2d9-C001-33", "d63acda66b0c17c5c6725c0e20b2d9-C001-82" ] } } }, "ABC_1f72d18331beaef7adf4a78d1619c6_41": { "x": [ { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-78", "text": "**20NG METAPHOR**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-2", "text": "We introduce QVEC-CCA-an intrinsic evaluation metric for word vector representations based on correlations of learned vectors with features extracted from linguistic resources." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-3", "text": "We show that QVEC-CCA scores are an effective proxy for a range of extrinsic semantic and syntactic tasks." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-4", "text": "We also show that the proposed evaluation obtains higher and more consistent correlations with downstream tasks, compared to existing approaches to intrinsic evaluation of word vectors that are based on word similarity." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-5", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-7", "text": "Being linguistically opaque, vector-space representations of words-word embeddings-have limited practical value as standalone items." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-76", "text": "Conversely, QVEC and QVEC-CCA consistently obtain moderate-to-high positive correlations with the downstream tasks." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-8", "text": "They are effective, however, in representing meaning-through individual dimensions and combinations of thereof-when used as features in downstream applications (Turian et al., 2010; Lazaridou et al., 2013; Socher et al., 2013; Bansal et al., 2014; Guo et al., 2014, inter alia) ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-9", "text": "Thus, unless it is coupled with an extrinsic task, intrinsic evaluation of word vectors has little value in itself." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-10", "text": "The main purpose of an intrinsic evaluation is to serve as a proxy for the downstream task the embeddings are tailored for." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-11", "text": "This paper advocates a novel approach to constructing such a proxy." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-12", "text": "What are the desired properties of an intrinsic evaluation measure of word embeddings?" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-13", "text": "First, retraining models that use word embeddings as features is often expensive." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-14", "text": "A computationally efficient intrinsic evaluation that correlates with extrinsic scores is useful for faster prototyping." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-15", "text": "Second, an intrinsic evaluation that enables interpretation and analysis of properties encoded by vector dimensions is an auxiliary mechanism for analyzing how these properties affect the target downstream task." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-16", "text": "It thus facilitates refinement of word vector models and, consequently, improvement of the target task." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-17", "text": "Finally, an intrinsic evaluation that approximates a range of related downstream tasks (e.g., semantic text-classification tasks) allows to assess generality (or specificity) of a word vector model, without actually implementing all the tasks." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-18", "text": "Tsvetkov et al. (2015) proposed an evaluation measure-QVEC-that was shown to correlate well with downstream semantic tasks." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-19", "text": "Additionally, it helps shed new light on how vector spaces encode meaning thus facilitating the interpretation of word vectors." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-20", "text": "The crux of the method is to correlate distributional word vectors with linguistic word vectors constructed from rich linguistic resources, annotated by domain experts." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-21", "text": "QVEC can easily be adjusted to specific downstream tasks (e.g., part-of-speech tagging) by selecting taskspecific linguistic resources (e.g., part-of-speech annotations)." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-22", "text": "However, QVEC suffers from two weaknesses." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-23", "text": "First, it is not invariant to linear transformations of the embeddings' basis, whereas the bases in word embeddings are generally arbitrary (Szegedy et al., 2014) ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-24", "text": "Second, it produces an unnormalized score: the more dimensions in the embedding matrix the higher the score." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-25", "text": "This precludes comparison of models of different dimensionality." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-26", "text": "In this paper, we introduce QVEC-CCA, which simultaneously addresses both problems, while preserving major strengths of QVEC." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-27", "text": "1" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-28", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-29", "text": "**QVEC AND QVEC-CCA**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-30", "text": "We introduce QVEC-CCA-an intrinsic evaluation measure of the quality of word embeddings." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-77", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-31", "text": "Our method is a modification of QVEC-an evaluation based on alignment of embeddings to a matrix of features extracted from a linguistic resource (Tsvetkov et al., 2015) ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-32", "text": "We review QVEC, and then describe QVEC-CCA." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-33", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-34", "text": "**QVEC.**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-35", "text": "The main idea behind QVEC is to quantify the linguistic content of word embeddings by maximizing the correlation with a manuallyannotated linguistic resource." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-36", "text": "Let the number of common words in the vocabulary of the word embeddings and the linguistic resource be N ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-37", "text": "To quantify the semantic content of embeddings, a semantic/syntactic linguistic matrix S \u2208 R P \u00d7N is constructed from a semantic/syntactic database, with a column vector for each word." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-38", "text": "Each word vector is a distribution of the word over P linguistic properties, based on annotations of the word in the database." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-39", "text": "Let X \u2208 R D\u00d7N be embedding matrix with every row as a dimension vector x \u2208 R 1\u00d7N ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-40", "text": "D denotes the dimensionality of word embeddings." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-41", "text": "Then, S and X are aligned to maximize the cumulative correlation between the aligned dimensions of the two matrices." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-42", "text": "Specifically, let A \u2208 {0, 1} D\u00d7P be a matrix of alignments such that a ij = 1 iff x i is aligned to s j , otherwise a ij = 0." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-43", "text": "If r(x i , s j ) is the Pearson's correlation between vectors x i and s j , then QVEC is defined as:" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-44", "text": "The constraint j a ij \u2264 1, warrants that one distributional dimension is aligned to at most one linguistic dimension." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-45", "text": "QVEC-CCA." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-46", "text": "To measure correlation between the embedding matrix X and the linguistic matrix S, instead of cumulative dimension-wise correlation we employ canonical correlation analysis (Hardoon et al., 2004, CCA) ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-47", "text": "CCA finds two sets of basis vectors, one for X \u22a4 and the other for S \u22a4 , such that the correlations between the projections of the matrices onto these basis vectors are maximized." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-48", "text": "Formally, CCA finds a pair of basis vectors v and w such that" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-49", "text": "Thus, QVEC-CCA ensures invariance to the matrices' bases' rotation, and since it is a single correlation, it produces a score in [\u22121, 1]." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-50", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-51", "text": "**LINGUISTIC DIMENSION WORD VECTORS**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-52", "text": "Both QVEC and QVEC-CCA rely on a matrix of linguistic properties constructed from a manually crafted linguistic resource." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-53", "text": "Linguistic resources are invaluable as they capture generalizations made by domain experts." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-54", "text": "However, resource construction is expensive, therefore it is not always possible to find an existing resource that captures exactly the set of optimal lexical properties for a downstream task." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-55", "text": "Resources that capture more coarse-grained, general properties can be used instead, for example, WordNet for semantic evaluation (Fellbaum, 1998) , or Penn Treebank (Marcus et al., 1993, PTB) for syntactic evaluation." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-56", "text": "Since these properties are not an exact match to the task, the intrinsic evaluation tests for a necessary (but possibly not sufficient) set of generalizations." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-57", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-58", "text": "**SEMANTIC VECTORS.**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-59", "text": "To evaluate the semantic content of word vectors, Tsvetkov et al. (2015) exploit supersense annotations in a WordNetannotated corpus-SemCor (Miller et al., 1993 Table 2 : Linguistic dimension word vector matrix with syntactic vectors, constructed using PTB." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-60", "text": "\u2022 We first train 21 word vector models: variants of CBOW and Skip-Gram models (Mikolov et al., 2013) ; their modifications CWindow, Structured Skip-Gram, and CBOW with Attention (Ling et al., 2015b; Ling et al., 2015a) ; GloVe vectors (Pennington et al., 2014) ; Latent Semantic Analysis (LSA) based vectors (Church and Hanks, 1990) ; and retrofitted GloVe and LSA vectors ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-61", "text": "\u2022 We then evaluate these word vector models using existing intrinsic evaluation methods: QVEC and the proposed QVEC-CCA, and also word similarity tasks using the WordSim353 dataset (Finkelstein et al., 2001, WS-353) , MEN dataset (Bruni et al., 2012) , and SimLex-999 dataset (Hill et al., 2014, SimLex) ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-62", "text": "3 \u2022 In addition, the same vectors are evaluated using extrinsic text classification tasks." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-63", "text": "Our semantic benchmarks are four binary categorization tasks from the 20 Newsgroups (20NG); sentiment analysis task (Socher et al., 2013, Senti) ; and the metaphor detection (Tsvetkov et al., 2014, Metaphor) ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-64", "text": "\u2022 Finally, we compute the Pearson's correlation coefficient r to quantify the linear relationship between the intrinsic and extrinsic scorings." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-65", "text": "The higher the correlation, the better suited the intrinsic evaluation to be used as a proxy to the extrinsic task." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-66", "text": "We extend the setup of Tsvetkov et al. (2015) with two syntactic benchmarks, and evaluate QVEC-CCA with the syntactic matrix." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-67", "text": "The first task is POS tagging; we use the LSTM-CRF model (Lample et al., 2016) , and the second is dependency parsing (Parse), using the stack-LSTM model of ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-68", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-69", "text": "**RESULTS.**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-70", "text": "To test the efficiency of QVEC-CCA in capturing the semantic content of word vectors, we evaluate how well the scores correspond to the 3 We employ an implementation of a suite of word similarity tasks at wordvectors.org (Faruqui and Dyer, 2014) ." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-71", "text": "scores of word vector models on semantic benchmarks." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-72", "text": "QVEC and QVEC-CCA employ the semantic supersense-dimension vectors described in \u00a73." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-73", "text": "In table 3, we show correlations between intrinsic scores (word similarity/QVEC/QVEC-CCA) and extrinsic scores across semantic benchmarks for 300-dimensional vectors." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-74", "text": "QVEC-CCA obtains high positive correlation with all the semantic tasks, and outperforms QVEC on two tasks." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-75", "text": "Although some word similarity tasks obtain high correlations with syntactic applications, these results are inconsistent, and vary from a high negative to a high positive correlation." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-79", "text": "Comparing performance of QVEC-CCA in PTB and PTB+SST setups sheds light on the importance of linguistic signals captured by the linguistic matrices." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-80", "text": "Appending supersense-annotated columns to the linguistic matrix which already contains POS-annotated columns does not affect correlations of QVEC-CCA with the POS tagging task, since the additional linguistic information is not relevant for approximating how well dimensions of word embeddings encode POS-related properties." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-81", "text": "In the case of dependency parsing-the task which encodes not only syntactic, but also semantic information (e.g., captured by subjectverb-object relations)-supersenses introduce relevant linguistic signals that are not present in POSannotated columns." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-82", "text": "Thus, appending supersenseannotated columns to the linguistic matrix improves correlation of QVEC-CCA with the dependency parsing task." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-83", "text": "----------------------------------" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-84", "text": "**CONCLUSION**" }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-85", "text": "We introduced QVEC-CCA-an approach to intrinsic evaluation of word embeddings." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-86", "text": "We also showed that both QVEC and QVEC-CCA are not limited to semantic evaluation, but are general approaches, that can evaluate word vector content with respect to desired linguistic properties." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-87", "text": "Semantic and syntactic linguistic features that we use to construct linguistic dimension matrices are rather coarse, thus the proposed evaluation can approximate a range of downstream tasks, but may not be sufficient to evaluate finergrained features." }, { "sent_id": "1f72d18331beaef7adf4a78d1619c6-C001-88", "text": "In the future work we propose to exploit existing semantic, syntactic, morphological, and typological resources (e.g., universal dependencies treebank (Agi\u0107 et al., 2015) and WALS (Dryer and Haspelmath, 2013) ), and also multilingual resources (e.g., Danish supersenses (Mart\u00ednez Alonso et al., 2015) ) to construct better linguistic matrices, suited for evaluating vectors used in additional NLP tasks." } ], "y": { "@EXT@": { "gold_contexts": [ [ "1f72d18331beaef7adf4a78d1619c6-C001-31" ], [ "1f72d18331beaef7adf4a78d1619c6-C001-66" ] ], "cite_sentences": [ "1f72d18331beaef7adf4a78d1619c6-C001-31", "1f72d18331beaef7adf4a78d1619c6-C001-66" ] }, "@BACK@": { "gold_contexts": [ [ "1f72d18331beaef7adf4a78d1619c6-C001-59" ] ], "cite_sentences": [ "1f72d18331beaef7adf4a78d1619c6-C001-59" ] } } }, "ABC_0b2e3651610aba4bd7150eee50797f_41": { "x": [ { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-2", "text": "The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-3", "text": "To tackle this, we propose a framework that uses two different segmentation specifications for alignment and translation respectively: we use Chinese character as the basic unit for alignment, and then convert this alignment to conventional word alignment for translation rule induction." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-4", "text": "Experimentally, our approach outperformed two baselines: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-5", "text": "----------------------------------" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-7", "text": "Chinese Word segmentation is a necessary step in Chinese-English statistical machine translation (SMT) because Chinese sentences do not delimit words by spaces." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-8", "text": "The key characteristic of a Chinese word segmenter is the segmentation specification 1 . As depicted in Figure 1 (a), the dominant practice of SMT uses the same word segmentation for both word alignment and translation rule induction." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-9", "text": "For brevity, we will refer to the word segmentation of the bilingual corpus as word segmentation for alignment (WSA for short), because it determines the basic tokens for alignment; and refer to the word segmentation of the aligned corpus as word segmentation for rules (WSR for short), because it determines the basic tokens of translation 1 We hereafter use \"word segmentation\" for short." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-10", "text": "rules 2 , which also determines how the translation rules would be matched by the source sentences." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-11", "text": "It is widely accepted that word segmentation with a higher F-score will not necessarily yield better translation performance (Chang et al., 2008; Zhang et al., 2008; Xiao et al., 2010) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-12", "text": "Therefore, many approaches have been proposed to learn word segmentation suitable for SMT." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-13", "text": "These approaches were either complicated (Ma et al., 2007; Chang et al., 2008; Ma and Way, 2009; Paul et al., 2010) , or of high computational complexity (Chung and Gildea 2009; Duan et al., 2010) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-14", "text": "Moreover, they implicitly assumed that WSA and WSR should be equal." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-15", "text": "This requirement may lead to a suboptimal problem that word segmentation better for alignment is not necessarily better for translation." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-16", "text": "To tackle this, we propose a framework that uses different word segmentation specifications as WSA and WSR respectively, as shown Figure 1(b) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-17", "text": "We investigate a solution in this framework: first, we use Chinese character as the basic unit for alignment, viz." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-18", "text": "character alignment; second, we use a simple method (Elming and Habash, 2007) to convert the character alignment to conventional word alignment for translation rule induction." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-19", "text": "In the 2 Interestingly, word is also a basic token in syntax-based rules." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-20", "text": "experiment, our approach consistently outperformed two baselines with three different word segmenters: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-21", "text": "The remainder of this paper is structured as follows: Section 2 analyzes the influences of WSA and WSR on SMT respectively; Section 3 discusses how to convert character alignment to word alignment; Section 4 presents experimental results, followed by conclusions and future work in section 5." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-22", "text": "----------------------------------" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-23", "text": "**UNDERSTANDING WSA AND WSR**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-24", "text": "We propose a solution to tackle the suboptimal problem: using Chinese character for alignment while using Chinese word for translation." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-25", "text": "Character alignment differs from conventional word alignment in the basic tokens of the Chinese side of the training corpus 3 ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-26", "text": "Table 1 compares the token distributions of character-based corpus (CCorpus) and word-based corpus (WCorpus)." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-27", "text": "We see that the WCorpus has a longer-tailed distribution than the CCorpus." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-28", "text": "More than 70% of the unique tokens appear less than 5 times in WCorpus." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-29", "text": "However, over half of the tokens appear more than or equal to 5 times in the CCorpus." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-30", "text": "This indicates that modeling word alignment could suffer more from data sparsity than modeling character alignment." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-31", "text": "Table 2 shows the numbers of the unique tokens (#UT) and unique bilingual token pairs (#UTP) of the two corpora." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-32", "text": "Consider two extensively features, fertility and translation features, which are extensively used by many state-of-the-art word aligners." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-33", "text": "The number of parameters w.r.t." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-34", "text": "fertility features grows linearly with #UT while the number of parameters w.r.t." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-35", "text": "translation features grows linearly with #UTP." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-36", "text": "We compare #UT and #UTP of both corpora in Table 2 ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-37", "text": "As can be seen, CCorpus has less UT and UTP than WCorpus, i.e. character alignment model has a compact parameterization than word alignment model, where the compactness of parameterization is shown very important in statistical modeling (Collins, 1999) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-38", "text": "Another advantage of character alignment is the reduction in alignment errors caused by word seg- Table 2 #UT and #UTP in CCorpus and WCorpus mentation errors." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-39", "text": "For example, \"\u5207\u5c3c (Cheney)\" and \"\u613f (will)\" are wrongly merged into one word \u5207 \u5c3c \u613f by the word segmenter, and \u5207 \u5c3c \u613f wrongly aligns to a comma in English sentence in the word alignment; However, both \u5207 and \u5c3c align to \"Cheney\" correctly in the character alignment." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-40", "text": "However, this kind of errors cannot be fixed by methods which learn new words by packing already segmented words, such as word packing (Ma et al., 2007) and Pseudo-word (Duan et al., 2010) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-41", "text": "As character could preserve more meanings than word in Chinese, it seems that a character can be wrongly aligned to many English words by the aligner." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-42", "text": "However, we found this can be avoided to a great extent by the basic features (co-occurrence and distortion) used by many alignment models." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-43", "text": "For example, we observed that the four characters of the non-compositional word \"\u963f\u62c9\u6cd5\u7279 (Arafat)\" align to Arafat correctly, although these characters preserve different meanings from that of Arafat." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-44", "text": "This can be attributed to the frequent co-occurrence (192 times) of these characters and Arafat in CCorpus." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-45", "text": "Moreover, \u6cd5 usually means France in Chinese, thus it may co-occur very often with France in CCorpus." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-46", "text": "If both France and Arafat appear in the English sentence, \u6cd5 may wrongly align to France." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-47", "text": "However, if \u963f aligns to Arafat, \u6cd5 will probably align to Arafat, because aligning \u6cd5 to Arafat could result in a lower distortion cost than aligning it to France." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-48", "text": "Different from alignment, translation is a pattern matching procedure (Lopez, 2008) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-49", "text": "WSR determines how the translation rules would be matched by the source sentences." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-50", "text": "For example, if we use translation rules with character as WSR to translate name entities such as the non-compositional word \u963f\u62c9\u6cd5\u7279, i.e. translating literally, we may get a wrong translation." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-51", "text": "That's because the linguistic knowledge that the four characters convey a specific meaning different from the characters has been lost, which cannot always be totally recovered even by using phrase in phrase-based SMT systems (see Chang et al. (2008) for detail)." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-52", "text": "Duan et al. (2010) and Paul et al., (2010) further pointed out that coarser-grained segmentation of the source sentence do help capture more contexts in translation." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-53", "text": "Therefore, rather than using character, using coarser-grained, at least as coarser as the conventional word, as WSR is quite necessary." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-54", "text": "----------------------------------" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-55", "text": "**CONVERTING CHARACTER ALIGNMENT TO WORD ALIGNMENT**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-56", "text": "In order to use word as WSR, we employ the same method as Elming and Habash (2007) 4 to convert the character alignment (CA) to its word-based version (CA') for translation rule induction." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-57", "text": "The conversion is very intuitive: for every English-Chinese word pair , in the sentence pair, we align to as a link in CA', if and only if there is at least one Chinese character of aligns to in CA." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-58", "text": "Given two different segmentations A and B of the same sentence, it is easy to prove that if every word in A is finer-grained than the word of B at the corresponding position, the conversion is unambiguity (we omit the proof due to space limitation)." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-59", "text": "As character is a finer-grained than its original word, character alignment can always be converted to alignment based on any word segmentation." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-60", "text": "Therefore, our approach can be naturally scaled to syntax-based system by converting character alignment to word alignment where the word segmentation is consistent with the parsers." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-61", "text": "We compare CA with the conventional word alignment (WA) as follows: We hand-align some sentence pairs as the evaluation set based on characters (ESChar), and converted it to the evaluation set based on word (ESWord) using the above conversion method." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-62", "text": "It is worth noting that comparing CA and WA by evaluating CA on ESChar and evaluating WA on ESWord is meaningless, because the basic tokens in CA and WA are different." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-63", "text": "However, based on the conversion method, comparing CA with WA can be accomplished by evaluating both CA' and WA on ESWord." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-64", "text": "----------------------------------" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-65", "text": "**EXPERIMENTS**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-66", "text": "----------------------------------" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-67", "text": "**SETUP**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-68", "text": "FBIS corpus (LDC2003E14) (210K sentence pairs) was used for small-scale task." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-69", "text": "A large bilingual corpus of our lab (1.9M sentence pairs) was used for large-scale task." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-70", "text": "The NIST'06 and NIST'08 test sets were used as the development set and test set respectively." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-71", "text": "The Chinese portions of all these data were preprocessed by character segmenter (CHAR), ICTCLAS word segmenter 5 (ICT) and Stanford word segmenters with CTB and PKU specifications 6 respectively." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-72", "text": "The first 100 sentence pairs of the hand-aligned set in Haghighi et al. (2009) were hand-aligned as ESChar, which is converted to three ESWords based on three segmentations respectively." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-73", "text": "These ESWords were appended to training corpus with the corresponding word segmentation for evaluation purpose." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-74", "text": "Both character and word alignment were performed by GIZA++ (Och and Ney, 2003) enhanced with gdf heuristics to combine bidirectional alignments (Koehn et al., 2003) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-75", "text": "A 5-gram language model was trained from the Xinhua portion of Gigaword corpus." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-76", "text": "A phrase-based MT decoder similar to (Koehn et al., 2007) was used with the decoding weights optimized by MERT (Och, 2003) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-77", "text": "----------------------------------" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-78", "text": "**EVALUATION**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-79", "text": "We first evaluate the alignment quality." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-80", "text": "The method discussed in section 3 was used to compare character and word alignment." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-81", "text": "As can be seen from Table 3 , the systems using character as WSA outperformed the ones using word as WSA in both small-scale (row 3-5) and large-scale task (row 6-8) with all segmentations." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-82", "text": "This gain can be attributed to the small vocabulary size (sparsity) for character alignment." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-83", "text": "The observation is consistent with Koehn (2005) which claimed that there is a negative correlation between the vocabulary size and translation performance without explicitly distinguishing WSA and WSR." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-84", "text": "We then evaluated the translation performance." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-85", "text": "The baselines are fully word-based MT systems (WordSys), i.e. using word as both WSA and WSR, and fully character-based systems (CharSys)." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-86", "text": "Table Word Table 4 Translation evaluation of WordSys and proposed system using BLEU-SBP (Chiang et al., 2008) 4 compares WordSys to our proposed system." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-87", "text": "Significant testing was carried out using bootstrap re-sampling method proposed by Koehn (2004) with a 95% confidence level." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-88", "text": "We see that our proposed systems outperformed WordSys in all segmentation specifications settings." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-89", "text": "Table 5 lists the results of CharSys in small-scale task." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-90", "text": "In this setting, we gradually set the phrase length and the distortion limits of the phrase-based decoder (context size) to 7, 9, 11 and 13, in order to remove the disadvantage of shorter context size of using character as WSR for fair comparison with WordSys as suggested by Duan et al. (2010) ." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-91", "text": "Comparing Table 4 and 5, we see that all CharSys underperformed WordSys." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-92", "text": "This observation is consistent with Chang et al. (2008) which claimed that using characters, even with large phrase length (up to 13 in our experiment) cannot always capture everything a Chinese word segmenter can do, and using word for translation is quite necessary." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-93", "text": "We also see that CharSys underperformed our proposed systems, that's because the harm of using character as WSR outweighed the benefit of using character as WSA, which indicated that word segmentation better for alignment is not necessarily better for translation, and vice versa." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-94", "text": "We finally compared our approaches to Ma et al. (2007) and Ma and Way (2009) , which proposed \"packed word (PW)\" and \"bilingual motivated word (BS)\" respectively." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-95", "text": "Both methods iteratively learn word segmentation and alignment alternatively, with the former starting from word-based corpus and the latter starting from characters-based corpus." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-96", "text": "Therefore, PW can be experimented on all segmentations." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-97", "text": "Table 6 Comparison with other works scale task, we see that both PW and BS underperformed our approach." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-98", "text": "This may be attributed to the low recall of the learned BS or PW in their approaches." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-99", "text": "BS underperformed both two baselines, one reason is that Ma and Way (2009) also employed word lattice decoding techniques (Dyer et al., 2008) to tackle the low recall of BS, which was removed from our experiments for fair comparison." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-100", "text": "Interestingly, we found that using character as WSA and BS as WSR (Char+BS), a moderate gain (+0.43 point) was achieved compared with fully BS-based system; and using character as WSA and PW as WSR (Char+PW), significant gains were achieved compared with fully PW-based system, the result of CTB segmentation in this setting even outperformed our proposed approach (+0.42 point)." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-101", "text": "This observation indicated that in our framework, better combinations of WSA and WSR can be found to achieve better translation performance." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-102", "text": "----------------------------------" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-103", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-104", "text": "We proposed a SMT framework that uses character for alignment and word for translation, which improved both alignment quality and translation performance." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-105", "text": "We believe that in this framework, using other finer-grained segmentation, with fewer ambiguities than character, would better parameterize the alignment models, while using other coarser-grained segmentation as WSR can help capture more linguistic knowledge than word to get better translation." }, { "sent_id": "0b2e3651610aba4bd7150eee50797f-C001-106", "text": "We also believe that our approach, if integrated with combination techniques (Dyer et al., 2008; Xi et al., 2011) , can yield better results." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0b2e3651610aba4bd7150eee50797f-C001-13" ], [ "0b2e3651610aba4bd7150eee50797f-C001-40" ] ], "cite_sentences": [ "0b2e3651610aba4bd7150eee50797f-C001-13", "0b2e3651610aba4bd7150eee50797f-C001-40" ] }, "@USE@": { "gold_contexts": [ [ "0b2e3651610aba4bd7150eee50797f-C001-90" ] ], "cite_sentences": [ "0b2e3651610aba4bd7150eee50797f-C001-90" ] } } }, "ABC_021c423c731ecbe3e26b3ce234b390_41": { "x": [ { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-2", "text": "Misinformation detection at the level of full news articles is a text classification problem." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-3", "text": "Reliably labeled data in this domain is rare." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-4", "text": "Previous work relied on news articles collected from so-called \"reputable\" and \"suspicious\" websites and labeled accordingly." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-5", "text": "We leverage fact-checking websites to collect individuallylabeled news articles with regard to the veracity of their content and use this data to test the cross-domain generalization of a classifier trained on bigger text collections but labeled according to source reputation." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-6", "text": "Our results suggest that reputation-based classification is not sufficient for predicting the veracity level of the majority of news articles, and that the system performance on different test datasets depends on topic distribution." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-7", "text": "Therefore collecting well-balanced and carefully-assessed training data is a priority for developing robust misinformation detection systems." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-8", "text": "----------------------------------" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-10", "text": "Automatic detection of fake from legitimate news in different formats such as headlines, tweets and full news articles has been approached in recent Natural Language Processing literature (Vlachos and Riedel, 2014; Vosoughi, 2015; Jin et al., 2016; Rashkin et al., 2017; Wang, 2017; Pomerleau and Rao, 2017; Thorne et al., 2018) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-11", "text": "The most important challenge in automatic misinformation detection using modern NLP techniques, especially at the level of full news articles, is data." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-12", "text": "Most previous systems built to identify fake news articles rely on training data labeled with respect to the general reputation of the sources, i.e., domains/user accounts (Fogg et al., 2001; Lazer et al., 2017; Rashkin et al., 2017) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-13", "text": "Even though some of these studies try to identify fake news based on linguistic cues, the question is whether they learn publishers' general writing style (e.g., common writing features of a few clickbaity websites) or deceptive style (similarities among news articles that contain misinformation)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-14", "text": "In this study, we collect two new datasets that include the full text of news articles and individually assigned veracity labels." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-15", "text": "We then address the above question, by conducting a set of crossdomain experiments: training a text classification system on data collected in a batch manner from suspicious and reputable websites and then testing the system on news articles that have been assessed in a one-by-one fashion." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-16", "text": "Our experiments reveal that the generalization power of a model trained on reputation-based labeled data is not impressive on individually assessed articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-17", "text": "Therefore, we propose to collect and verify larger collections of news articles with reliably assigned labels that would be useful for building more robust fake news detection systems." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-18", "text": "----------------------------------" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-19", "text": "**DATA COLLECTION**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-20", "text": "Most studies on fake news detection have examined microblogs, headlines and claims in the form of short statements." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-21", "text": "A few recent studies have examined full articles (i.e., actual 'fake news') to extract discriminative linguistic features of misinformation Rashkin et al., 2017; Horne and Adali, 2017) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-22", "text": "The issue with these studies is the data collection methodology." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-23", "text": "Texts are harvested from websites that are assumed to be fake news publishers (according to a list of suspicious websites), with no individual labeling of data." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-24", "text": "The so-called suspicious sources, however, sometimes do publish facts and valid information, and reputable websites sometimes publish inaccurate information (Mantzarlis, 2017) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-25", "text": "The key to collect more reliable data, then, is to not rely on the source but on the text of the article itself, and only after the text has been assessed by human annotators and determined to contain false information." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-26", "text": "Currently, there exists only small collections of reliably-labeled news articles (Rubin et al., 2016; Allcott and Gentzkow, 2017; Zhang et al., 2018; because this type of annotation is laborious." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-27", "text": "The Liar dataset (Wang, 2017) is the first large dataset collected through reliable annotation, but it contains only short statements." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-28", "text": "Another recently published large dataset is FEVER (Thorne et al., 2018) , which contains both claims and texts from Wikipedia pages that support or refute those claims." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-29", "text": "This dataset, however, has been built to serve the slightly different purpose of stance detection (Pomerleau and Rao, 2017; , the claims have been artificially generated, and texts are not news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-30", "text": "Our objective is to elaborate on the distinction between classifying reputation-based labeled news articles and individually-assessed news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-31", "text": "We do so by collecting and using datasets of the second type in evaluation of a text classifier trained on the first type of data." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-32", "text": "In this section, we first introduce one large collection of news text from previous studies that has been labeled according to the list of suspicious websites, and one small collection that was labeled manually for each and every news article, but only contains satirical and legitimate instances." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-33", "text": "We then introduce two datasets that we have scraped from the web by leveraging links to news articles mentioned by fact-checking websites (Buzzfeed and Snopes)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-34", "text": "The distinguishing feature of these new collections is that they contain not only the full text of real news articles found online, but also individually assigned veracity labels indicative of their misinformative content." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-35", "text": "Rashkin et al. dataset: Rashkin et al. (2017) published a collection of roughly 20k news articles from eight sources categorized into four classes: propaganda (The Natural News and Activist Report), satire (The Onion, The Borowitz Report, and Clickhole), hoax (American News and DC Gazette) and trusted (Gigaword News)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-36", "text": "This dataset is balanced across classes, and since the articles in their training and test splits come from different websites, the accuracy of the trained model on test data should be demonstrative of its understanding of the general writing style of each target class rather than author-specific cues." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-37", "text": "However, we suspect that the noisy strategy to label all articles of a publisher based on its reputation highly biases the classifier decisions and limits its power to distinguish individual misinformative from truthful news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-38", "text": "----------------------------------" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-39", "text": "**RUBIN ET AL. DATASET:**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-40", "text": "As part of a study on satirical cues, Rubin et al. (2016) published a dataset of 360 news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-41", "text": "This dataset contains balanced numbers of individually evaluated satirical and legitimate texts." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-42", "text": "Even though small, it is a clean data to test the generalization power of a system trained on noisy data such as the above explained dataset." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-43", "text": "We use this data to make our point about the need for careful annotation of news articles on a one-by-one fashion, rather than harvesting from websites generally knows as hoax, propaganda or satire publishers." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-44", "text": "----------------------------------" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-45", "text": "**BUZZFEEDUSE DATASET:**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-46", "text": "The first source of information that we used to harvest full news articles with veracity labels is from the Buzzfeed factchecking company." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-47", "text": "Buzzfeed has published a collection of links to Facebook posts, originally compiled for a study around the 2016 US election (Silverman et al., 2016) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-48", "text": "Each URL in this dataset was given to human experts so they can rate the amount of false information contained in the linked article." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-49", "text": "The links were collected from nine Facebook pages (three right-wing, three left-wing and three mainstream publishers)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-50", "text": "1 We had to follow the facebook URLs and then the link to the original news articles to obtain the news texts." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-51", "text": "We scraped the full text of each news article from its original source." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-52", "text": "The resulting dataset includes a total of 1,380 news articles on a focused topic (US election and candidates)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-53", "text": "Veracity labels come in a 4-way classification scheme including 1,090 mostly true, 170 mixture of true and false, 64 mostly false and 56 articles containing no factual content." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-54", "text": "----------------------------------" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-55", "text": "**SNOPES312 DATASET:**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-56", "text": "The second source of information that we used to harvest full news articles with veracity labels is Snopes, a well-known rumor debunking website run by a team of expert editors." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-57", "text": "We scraped the entire archive of factchecking pages." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-58", "text": "On each page they talk about a claim, cite the sources (news articles, forums or social networks where the claim was distributed) and provide a veracity label for the claim." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-59", "text": "We automatically extracted all links mentioned on a Snopes page, followed the link to each original news article, and extracted the text." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-60", "text": "The resulting datafile includes roughly 4,000 rows, each containing a claim discussed by Snopes annotators, the veracity label assigned to it, and the text of a news article related to the claim." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-61", "text": "The main challenge in using this data for training/testing a fake news detector is that some of the links on a Snopes page that we collect automatically do not actually point to the discussed news article, i.e., the source of the claim." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-62", "text": "Many links are to pages that provide contextual information for the fact-checking of the claim." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-63", "text": "Therefore, not all the texts in our automatically extracted dataset are reliable or simply the \"supporting\" source of the claim." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-64", "text": "To come up with a reliable set of veracity-labeled news articles, we randomly selected 312 items and assessed them manually." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-65", "text": "Two annotators performed independent assessments on the 312 items." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-66", "text": "A third annotator went through the entire list of items for a final check and resolving disagreements." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-67", "text": "Snopes has a fine-grained veracity labeling system." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-68", "text": "We selected [fully] true, mostly true, mixture of true and false, mostly false, and [fully] false stories." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-69", "text": "Table 1 shows the distribution of these labels in the manually assessed 312 items, and how many from each category of news articles were verified to be the \"supporting\" source (distributing the discussed claim), \"context\" (providing background or related information about the topic of the claim), \"debunking\" (against the claim), \"irrelevant\" (completely unrelated to the claim or distorted text) and ambiguous (not sure how it related to the claim)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-70", "text": "Table 2 provides information on the confusing choices: About 50% of the items received different category labels from the two first annotators." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-71", "text": "The first annotator had a more conservative bias, trying to avoid mistakes in the \"supporting\" category, whereas the second annotator often assigned either \"supporting\" or \"context\", and rarely \"irrelevant\"." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-72", "text": "For the disagreed items, the third annotator (who had access to all outputs) chose the final category." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-73", "text": "Results in Table 1 are based on this final assessment." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-74", "text": "We use the \"supporting\" portion of the data (145 items) in the following experiments." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-75", "text": "----------------------------------" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-76", "text": "**EXPERIMENTS**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-77", "text": "In text classification, Convolutional Neural Networks (CNNs) have been competing with the TF-IDF model, a simple but strong baseline using scored n-grams (Le and Mikolov, 2014; Zhang et al., 2015; Conneau et al., 2017; Medvedeva et al., 2017) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-78", "text": "These methods have been used for fake news detection in previous work (Rashkin et al., 2017; Wang, 2017 fore, we use this model to demonstrate how a classifier trained on data labeled according to publisher's reputation would identify misinformative news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-79", "text": "It is evident in the first section of Figure 1 , that the model performs well on similarly collected test items, i.e., Hoax, Satire, Propaganda and Trusted news articles within Rashkin et al.'s test dataset." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-80", "text": "However, when the model is applied to Rubin et al.'s data, which was carefully assessed for satirical cues in each and every article, the performance drops considerably (See the second section of the figure) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-81", "text": "Although the classifier detects more of the satirical texts in Rubin et al.'s data, the distribution of the given labels is not very different to that of legitimate texts." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-82", "text": "One important feature of Rubin et al.'s data is that topics of the legitimate instances were matched and balanced with topics of the satirical instances." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-83", "text": "The results here suggest that similarities captured by the classifier can be very dependent on the topics of the news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-84", "text": "Next we examine the same model on our collected datasets, BuzzfeedUSE and Snopes312, as test material." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-85", "text": "The BuzzfeedUSE data comes with 4 categories (Figure 1) ." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-86", "text": "The classifier does seem to have some sensitivity to true vs. false information in this dataset, as more of the mostly true articles were labeled as Trusted." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-87", "text": "The difference with mostly false articles, however, is negligible." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-88", "text": "The most frequent label assigned by the classifier was Hoax in all four categories, which suggests that most BuzzfeedUSE articles looked like Hoax in Rashkin's data." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-89", "text": "Finally, the last section of 1 shows the results on the Snopes312 plotted fier was significantly better on both dev and test sets: 0.96 and 0.75 F1-score, respectively, compared to 0.91 and 0.65 reported in their paper." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-90", "text": "Source code will be made available at https://github.com/sfudiscourse-lab/Misinformation_detection along the 6-category distinction." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-91", "text": "A stronger correlation can be observed between the classifier decisions and the veracity labels in this data compared to BuzzfeedUSE." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-92", "text": "This suggests that distinguishing between news articles with true and false information is a more difficult task when topics are the same (BuzzfeedUSE data is all related to the US election)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-93", "text": "In Snopes312, news articles come from a variety of topics." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-94", "text": "The strong alignment between the classifier's Propaganda and Hoax labels with the mostly false and [fully] false categories in this dataset reveals that most misinformative news articles indeed discuss the topics or use the language of generally suspicious publishers." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-95", "text": "This is an encouraging result in the sense that, with surface features such as n-grams and approximate reputationbased training data, we already can detect some of the misinformative news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-96", "text": "Observing classification errors across these experiments, however, indicates that the model performance varies a lot with the type of test material: In a focused topic situation, it fails to distinguish between categories (false vs. true, or satirical vs. legitimate articles)." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-97", "text": "While a correlation is consistently observed between labels assigned by the classifier and the actual labels of target news articles, 3 reputationbased classification does not seem to be sufficient for predicting the veracity level of the majority of news articles." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-98", "text": "----------------------------------" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-99", "text": "**CONCLUSION**" }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-100", "text": "We found that collecting reliable data for automatic misinformation detection at the level of full news articles is a challenging but necessary task for building robust models." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-101", "text": "If we want to benefit from state-of-the-art text classification techniques, such as CNNs, we require larger datasets than what is currently available." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-102", "text": "We took the first steps, by scraping claims and veracity labels from factchecking websites, extracting and cleaning of the original news articles' texts (resulting in roughly 4,000 items), and finally manual assessment of a subset of the data to provide reliable test material for misinformation detection." }, { "sent_id": "021c423c731ecbe3e26b3ce234b390-C001-103", "text": "Our future plan is to crowd-source annotators for the remaining scraped texts and publish a large set of labeled news articles for training purposes." } ], "y": { "@BACK@": { "gold_contexts": [ [ "021c423c731ecbe3e26b3ce234b390-C001-10" ], [ "021c423c731ecbe3e26b3ce234b390-C001-21" ], [ "021c423c731ecbe3e26b3ce234b390-C001-78" ] ], "cite_sentences": [ "021c423c731ecbe3e26b3ce234b390-C001-10", "021c423c731ecbe3e26b3ce234b390-C001-21", "021c423c731ecbe3e26b3ce234b390-C001-78" ] }, "@MOT@": { "gold_contexts": [ [ "021c423c731ecbe3e26b3ce234b390-C001-12", "021c423c731ecbe3e26b3ce234b390-C001-13" ] ], "cite_sentences": [ "021c423c731ecbe3e26b3ce234b390-C001-12" ] }, "@USE@": { "gold_contexts": [ [ "021c423c731ecbe3e26b3ce234b390-C001-78" ] ], "cite_sentences": [ "021c423c731ecbe3e26b3ce234b390-C001-78" ] } } }, "ABC_f3e9e5d7fb4001e3d29a171b5eb4a4_41": { "x": [ { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-2", "text": "Research on question answering with knowledge base has recently seen an increasing use of deep architectures." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-3", "text": "In this extended abstract, we study the application of the neural machine translation paradigm for question parsing." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-4", "text": "We employ a sequence-to-sequence model to learn graph patterns in the SPARQL graph query language and their compositions." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-5", "text": "Instead of inducing the programs through question-answer pairs, we expect a semi-supervised approach, where alignments between questions and queries are built through templates." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-6", "text": "We argue that the coverage of language utterances can be expanded using late notable works in natural language generation." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-7", "text": "----------------------------------" }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-9", "text": "Question Answering with Knowledge Base (KBQA) parses a natural-language question and returns an appropriate answer that can be found in a knowledge base." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-10", "text": "Today, one of the most exciting scenarios for question answering is the Web of Data, a fast-growing distributed graph of interlinked knowledge bases which comprises more than 100 billions of edges (McCrae et al., 2018) ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-11", "text": "Question Answering over Linked Data (QALD) is a subfield of KBQA aimed at transforming utterances into SPARQL queries (Lopez et al., 2013) ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-12", "text": "Being a W3C standard, SPARQL features a high expressivity (Prud'hommeaux et al., 2006) and is by far the most used query language for Linked Data." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-13", "text": "Among traditional approaches to KBQA, Bao et al. (2014) proposed question decomposition and Statistical Machine Translation to translate sub-questions into triple patterns." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-14", "text": "The method however relies on entity detection and struggles in recognizing predicates by their contexts (e.g., play in a film or a football team)." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-15", "text": "In the last years, several methods based on neural networks have been devised to Figure 1 ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-16", "text": "Utterances are translated into SPARQL queries encoded as sequences of tokens." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-17", "text": "Using complex surface forms leads to more graph patterns." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-18", "text": "We aim at learning these compositions." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-19", "text": "tackle the KBQA problem (Liang et al., 2016; Hao et al., 2017; Lukovnikov et al., 2017; Sorokin & Gurevych, 2017) ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-20", "text": "We study the application of the Neural Machine Translation paradigm for question parsing using a sequence-to-sequence model within an architecture dubbed Neural SPARQL Machine, previously introduced in Soru et al. (2017) ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-21", "text": "Similarly to Liang et al. (2016) , we employ a sequence-to-sequence model to learn query expressions and their compositions." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-22", "text": "Instead of inducing the programs through question-answer pairs, we expect a semi-supervised approach, where alignments between questions and queries are built through templates." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-23", "text": "Although query induction can save a considerable amount of supervision effort (Liang et al., 2016; Zhong et al., 2017) , a pseudo-gold program is not guaranteed to be correct when the same answer can be found with more than one query (e.g., as the capital is often the largest city of a country, predicates might be confused)." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-24", "text": "On the contrary, our proposed solution relies on manual annotation and a weakly-supervised expansion of question-query templates." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-25", "text": "----------------------------------" }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-26", "text": "**NEURAL SPARQL MACHINES**" }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-27", "text": "Inspired by the Neural Programmer-Interpreter pattern by (Reed & De Freitas, 2015) , a Neural SPARQL Machine is composed by three modules: a generator, a learner, and an interpreter (Soru et al., 2017) ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-28", "text": "We define a query template as an alignment between a natural language question and its respective SPARQL query, with entities replaced by placeholders (e.g., \"where is located in?\")." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-29", "text": "The gen- erator takes query templates as input and creates the training dataset, which is forwarded to the learner." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-30", "text": "The learner takes natural language as input and generates a sequence which encodes a SPARQL query." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-31", "text": "Here, a recurrent neural network based on (Bidirectional) Long Short-Term Memories (Hochreiter & Schmidhuber, 1997 ) is employed as a sequence-to-sequence translator (see example in Figure 1 )." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-32", "text": "The final structure is then reconstructed by the interpreter through rule-based heuristics." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-33", "text": "Note that a sequence can be represented by any LISP S-expression; therefore, alternatively, sentence dependency trees can be used to encode questions and ARQ algebra (Seaborne, 2010) can be used to encode SPARQL queries." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-34", "text": "Neural SPARQL Machines do not rely on entity linking methods, since entities and relations are detected within the query construction phase." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-35", "text": "External pre-trained word embeddings help deal with vocabulary mismatch." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-36", "text": "Knowledge graph jointly embedded with SPARQL operators (Wang et al., 2014) can be utilized in the target space." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-37", "text": "A curriculum learning (Bengio et al., 2009 ) paradigm can learn graph pattern and SPARQL operator composition, in a similar fashion of Liang et al. (2016) ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-38", "text": "We argue that the coverage of language utterances can be expanded using techniques such as Question (Abujabal et al., 2017; Elsahar et al., 2018; Abujabal et al., 2018) and Query Generation (Zafar et al., 2018) as well as Universal Sentence Encoders (Cer et al., 2018 )." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-39", "text": "Another problem is the disambiguation between entities having the same surface forms." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-40", "text": "Building on top of the DBtrends approach (Marx et al., 2016) , we force the number of occurrences of a given entity in the training set to be inversely proportional to the entity ranking." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-41", "text": "Following this strategy, we expect the RNN to associate the word Berlin with the German capital and not with Berlin, New Hampshire." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-42", "text": "----------------------------------" }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-43", "text": "**EXPERIMENTS AND CURRENT PROGRESS**" }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-44", "text": "We selected the DBpedia Knowledge Base (Lehmann et al., 2015) as the dataset for our experiments, due to its central importance for the Web of Data." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-45", "text": "We built a dataset of 3,108 entities about movies from DBpedia and annotated 20 and 4 question-query templates with one and two placeholders, resp." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-46", "text": "Our preliminary results are given in Table 1 ." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-47", "text": "We experimented with 6 different SPARQL encodings, i.e. ways to encode a SPARQL query into a sequence of tokens." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-48", "text": "At each row of the table, we provide the description of the corresponding changes, each of which persists in the next encodings." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-49", "text": "The experiments were carried out on a 64-CPU Ubuntu machine with 512 GB RAM." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-50", "text": "1 We adopted the implementation of seq2seq in TensorFlow with internal embeddings of 128 dimensions, 2 hidden layers, and a dropout value of 0.2." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-51", "text": "All settings were tested on the same set of unseen questions after applying an 80-10-10% split." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-52", "text": "The results confirmed that the SPARQL encoding highly influences the learning." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-53", "text": "Adding more complex templates (i.e., with more than one placeholder) to the generator input yielded a richer training set and more questions were parsed correctly." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-54", "text": "Merging tokens (see queries and their respective sequences in Figure 1) helped the machine translation, as the SPARQL sequences became shorter." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-55", "text": "Adding alignments of entities and their labels to the training set turned out to be beneficial for a faster convergence, as Figure 2 shows." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-56", "text": "The most frequent errors were due to entity name collisions and out-of-vocabulary words; both issues can be tackled with the strategies introduced in this work." }, { "sent_id": "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-57", "text": "We plan to perform an evaluation on the WEBQUESTION-SSP (Yih et al., 2016) and QALD (Unger et al., 2014) benchmarks to compare with the state-of-the-art approaches for KBQA and QALD, respectively." } ], "y": { "@SIM@": { "gold_contexts": [ [ "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-21" ] ], "cite_sentences": [ "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-21" ] }, "@USE@": { "gold_contexts": [ [ "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-21" ] ], "cite_sentences": [ "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-21" ] }, "@BACK@": { "gold_contexts": [ [ "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-23" ], [ "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-37" ] ], "cite_sentences": [ "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-23", "f3e9e5d7fb4001e3d29a171b5eb4a4-C001-37" ] } } }, "ABC_887864e173d7f7c164fe9f9d940727_41": { "x": [ { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-91", "text": "TransNet [3] adds a regularizer on the penultimate layer that forces the network to predict review embedding." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-92", "text": "TransRev [5] is based on the same idea of restoring the review embedding from user and item embeddings." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-2", "text": "Abstract." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-3", "text": "We propose a novel end-to-end Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items and at the same time discovers coherent aspects of reviews that can be used to explain predictions or profile users." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-4", "text": "The AspeRa model uses max-margin losses for joint item and user embedding learning and a dual-headed architecture; it significantly outperforms recently proposed state-of-the-art models such as DeepCoNN, HFT, NARRE, and TransRev on two real world data sets of user reviews." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-5", "text": "With qualitative examination of the aspects and quantitative evaluation of rating prediction models based on these aspects, we show how aspect embeddings can be used in a recommender system." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-6", "text": "----------------------------------" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-8", "text": "As the scale of online services and the Web itself grows, recommender systems increasingly attempt to utilize texts available online, either as items for recommendation or as their descriptions [1, 24, 27, 43] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-9", "text": "One key complication is that a single text can touch upon many different features of the item; e.g., the same brief review of a laptop can assess its weight, performance, keyboard, and so on, with different results." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-10", "text": "Hence, real-world applications need to separate different aspects of reviews." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-11", "text": "This idea also has a long history [16, 28] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-12", "text": "Many recent works in recommender systems have applied deep learning methods [11, 33, 35, 43] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-13", "text": "In this work, we introduce novel deep learning methods for making recommendations with full-text items, aiming to learn interpretable user representations that reflect user preferences and at the same time help predict ratings." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-14", "text": "We propose a novel Aspect-based Rating Prediction Model (AspeRa) for aspect-based representation learning for items by encoding word-occurrence statistics into word embeddings and applying dimensionality reduction to extract the most important aspects that are used for the user-item rating estimation." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-15", "text": "We investigate how and in what settings such neural autoencoders can be applied to contentbased recommendations for text items." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-16", "text": "----------------------------------" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-17", "text": "**ASPERA MODEL**" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-18", "text": "The AspeRa model combines the advantages of deep learning (end-to-end learning, spatial text representation) and topic modeling (interpretable topics) for text-based recommendation systems." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-19", "text": "Fig. 1 shows the overall architecture of AspeRa." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-20", "text": "The model receives as input two reviews at once, treating both identically." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-21", "text": "Each review is embedded with self-attention to produce two vectors, one for author (user) features and the other for item features." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-22", "text": "These two vectors are used to predict a rating corresponding to the review." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-23", "text": "All vectors are forced to belong to the same feature space." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-24", "text": "The embedding is produced by the Neural Attention-Based Aspect Extraction Model (ABAE) [7] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-25", "text": "As in topic modeling or clustering, with ABAE the designer can determine a finite number of topics/clusters/aspects, and the goal is to find out for every document to which extent it satisfies each topics/aspects." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-26", "text": "From a bird's eye view, ABAE is an autoencoder." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-27", "text": "The main feature of ABAE is the reconstruction loss between bag-ofwords embeddings used as the sentence representation and a linear combination of aspect embeddings." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-28", "text": "A sentence embedding is additionally weighted by selfattention, an attention mechanism where the values are word embeddings and the key is the mean embedding of words in a sentence." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-29", "text": "The first step in ABAE is to compute the embedding z s \u2208 R d for a sentence s; below we call it a text embedding: z s = n i=1 a i e wi , where e wi is a word embedding for a word w i , e \u2208 R d ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-30", "text": "As word vectors the authors use word2vec embeddings trained with the skip-gram model [22] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-31", "text": "Attention weights a i are computed as a multiplicative self-attention model: a i = softmax(e wi Ay s ), where y s is the average of word embeddings in a sentence, y s = n i=1 e wi , and A \u2208 R d\u00d7d is the learned attention model." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-32", "text": "The second step is to compute the aspectbased sentence representation r s \u2208 R d from an aspect embeddings matrix T \u2208 R k\u00d7d , where k is the number of aspects:" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-33", "text": "is the vector of probability weights over k aspect embeddings, r s = T p s , and W \u2208 R k\u00d7d , b \u2208 R k are the parameters of a multi-class logistic regression model." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-34", "text": "Below we call r s the reconstructed embedding." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-35", "text": "To train the model, ABAE uses the cosine distance between r s and z s with a contrastive max-margin objective function [41] as the reconstruction error, also adding an orthogonality penalty term that tries to make the aspect embedding matrix T to produce aspect embeddings as diverse as possible." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-36", "text": "The proposed model's architecture includes an embedder, which provides text and reconstruction embeddings for an object similar to ABAE (\"user embedding\" and \"item embedding\" on Fig. 1 )." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-37", "text": "The intuition behind this separation of user and item embedding is as follows: there are some features (aspects) important in an item for a user, but the item also has other features." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-38", "text": "Hence, we want to extract user aspects from a user's reviews as well as item aspects from an item's reviews." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-39", "text": "The resulting embedding is conditioned on aspect representation of the reviews; we will see below that this model can discover interpretable topics." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-40", "text": "The model contains four embedders in total, one pair of user and item embedders for two reviews being considered at once, as shown on Fig. 1 ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-41", "text": "First each review is paired with another review of the same user, grouping by users and shuffling the reviews inside a group; then with another review of the same item." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-42", "text": "Thus, the training set gives rise to only twice as many pairs as reviews available for training." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-43", "text": "The rating score for the first review in a pair is used to train the rating predictor (MSE ); at prediction stage, only one \"tower\" is used." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-44", "text": "There are two losses in AspeRa: MSE for rating prediction ( Fig. 1 ) and MaxMargin loss to put user and item embeddings in the same space (Fig. 1) ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-45", "text": "The MSE loss assumes that rating is predicted as the dot product of user and item embeddings for a review:" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-46", "text": "2 , where z u j is a text embedding for the author of review j, z i j is a text embedding for the item j is about, and r j is the true rating associated with j. Max-margin loss aims to project all user and item embeddings into the same feature (aspect) space; see Fig. 1 ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-47", "text": "We use it in two ways." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-48", "text": "First, we push reconstructed and text embeddings to be closer for each user i, and pushes text embeddings for both considered items apart: MaxMargin(i, j) =" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-49", "text": "Second, we keep user embeddings from two reviews of the same author close:" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-50", "text": "." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-51", "text": "This second form is symmetrically applied to item and user embeddings for two reviews pf the same item from different authors." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-52", "text": "----------------------------------" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-53", "text": "**EXPERIMENTAL EVALUATION**" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-54", "text": "Datasets and experimental setup." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-55", "text": "We evaluated the proposed model on Amazon Instant Videos 5-core reviews and Amazon Toys and Games 5-core reviews 5 [9, 20] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-56", "text": "The first dataset consists of reviews written by users with at least five reviews on Amazon and/or for items with at least five reviews; it contains 37,126 reviews, 5,130 users, 1,685 items, and a total of 3,454,453 non-unique tokens." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-57", "text": "The second dataset follows 5 minimum reviews rule; it contains 167,597 reviews, 19, 412 users, 11, 924 items, and a total of 17, 082, 324 non-unique tokens." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-58", "text": "We randomly split each dataset into 10% test set and 90% training set, with 10% of the training set used as a validation set for tuning hyperparameters." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-59", "text": "Following ABAE [7] , we set the aspects matrix ortho-regularization coefficient equal to 0.1." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-60", "text": "Since this model utilizes an aspect embedding matrix to approximate aspect words in the vocabulary, initialization of aspect embeddings is crucial." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-61", "text": "The work [8] used k-means clustering-based initialization [17, 18, 36] , where the aspect embedding matrix is initialized with centroids of the resulting clusters of word embeddings." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-62", "text": "We compare two word embeddings for AspeRa: GloVe [29] and word2vec [21, 23] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-63", "text": "We adopted a GloVe model trained on the Wikipedia 2014 + Gigaword 5 dataset (6B tokens, 400K words vocabulary, uncased tokens) with dimension 50." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-64", "text": "For word2vec, we used the training set of reviews to train a skip-gram model (SGNS) with the gensim library [31] with dimension 200, window size 10, and 5 negative samples; see Table 1 for details." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-65", "text": "Rating Prediction." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-66", "text": "We evaluate the performance of AspeRa in comparison to state-of-the-art models: NMF [42] , DeepCoNN [43] , Attn+CNN [33] , SVD [14] , HFT [19] , NARRE [4] , and TransRev [5] ; we introduce these models in Section 4." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-67", "text": "Table 2 compares the best Mean Square Error (MSE) of AspeRa and other models for rating prediction." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-68", "text": "Results of existing models were adopted from [5] for Table 3 : Sample aspects from Instant Videos discovered by AspeRa (SGNS)." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-69", "text": "# Aspect words 1 communities governments incidents poverty unity hardships slaves citizens fought 2 coppola guillermo bram kurosawa toro ridley del prolific ti festivals 3 brisk dialouge manipulation snappy plotlines dialogues taunt camerawork muddled 4 sock vegans peanut stifling bats buh ammonium trollstench vegetables pepsi 5 the a and to is of joe's enters that fatal Table 4 : Sample aspects from Instant Videos discovered by AspeRa (GloVe)." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-70", "text": "# Aspect words 1 protein diagnose cell genes brain membrane interacts interact oxygen spinal 2 boost monetary raise introduce measures credit expects increase push demand 3 towel soaked greasy towels cloth dripping tucked crisp coat buckets 4 offbeat comic parody spoof comedic quirky cinematic campy parodies animated 5 sheesh wham whew hurrah oops yikes c'mon shhh oooh och Amazon Instant Videos 5-core reviews with the ratio 80:10:10." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-71", "text": "We also used the results of NARRE model [4] , obtained in the same setup as [5] but with a different random seed." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-72", "text": "Note that while AspeRa with generic GloVe word embeddings still works better than any other model, adding custom word embeddings trained on the same type of texts improves the results greatly." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-73", "text": "Topic Quality We compared the performance of AspeRa with OnlineLDA [10] trained with the gensim library [31] , with the same vocabulary and number of topics, and ABAE with 10 aspects and 18 epochs, initialized with the same word2vec vectors (SGNS) as AspeRa and having the same ortho-regularization coefficient as the best AspeRa model, evaluating the results in terms of topic coherence metrics, NPMI [2] and PMI [25, 26] computed with companion software for [15] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-74", "text": "Figure 2 shows that the quality is generally lower for larger number of representative words per aspect (horizontal axis), and that AspeRa achieves scores comparable to LDA and ABAE, although ABAE remains ahead." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-75", "text": "Tables 3 and 4 present several sample aspects discovered by AspeRA." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-76", "text": "Qualitative analysis shows that some aspects describe what could be called a topic (a set of words diverse by part of speech and function describing a certain domain), some encode sentiment (top words are adjectives showing attitude to certain objects discussed in the text), and some encode names (actors, directors, etc.)." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-77", "text": "We also found similar patterns in the output of the basic ABAE model [7] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-78", "text": "Thus, most aspects are clearly coherent, but there is room for improvement." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-79", "text": "----------------------------------" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-80", "text": "**RELATED WORK**" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-81", "text": "Classical collaborative filtering based on matrix factorization (MF) [14, 42] has been extended with textual information, often in the form of topics/aspects; as-pect extraction uses topic modelling [37, 38, 44] and phrase-based extraction [34] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-82", "text": "Collaborative topic regression (CTR) [39] was one of the first models to combine collaborative-based and topic-based approaches to recommendation; to recommend research articles; it uses an LDA topic vector as a prior of item embeddings for MF." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-83", "text": "Hidden Factors and Hidden Topics (HTF) [19] also combines MF and LDA but with user reviews used as contextual information." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-84", "text": "A few subsequent works use MF along with deep learning approaches; e.g., Collaborative Deep Learning (CDL) [40] improves upon CTR by replacing LDA with a stacked denoising autoencoder." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-85", "text": "Unlike our approach, all these models learn in alternating rather than end-to-end manner." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-86", "text": "Recent advances in distributed word representations have made it a cornerstone of modern natural language processing [6] , with neural networks recently used to learn text representations." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-87", "text": "He et al. [7] proposed an unsupervised neural attention-based aspect extraction (ABAE) approach that encodes word-occurrence statistics into word embeddings and applies an attention mechanism to remove irrelevant words, learning a set of aspect embeddings." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-88", "text": "Several recent works, including DeepCoNN [43] , propose a completely different approach." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-89", "text": "DeepCoNN is an end-to-end model, both user and item embedding vectors in this model are trainable functions (convolutional neural networks) of reviews associated with a user or item respectively." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-90", "text": "Experiments on Yelp and Amazon datasets showed significant improvements over HFT." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-93", "text": "Attn+CNN and D-Attn [32, 33] extend DeepCoNN with an attention mechanism on top of text reviews; it both improves performance and allows to explain predictions by highlighting significant words." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-94", "text": "However, user and item embeddings of these models are learned in a fully supervised way, unlike the proposed model." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-95", "text": "Our model combines semi-supervised embedding learning, which makes predictions interpretable similar to HTF, with a deep architecture and end-to-end training." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-96", "text": "----------------------------------" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-97", "text": "**CONCLUSION**" }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-98", "text": "We have introduced a novel approach to learning rating-and text-aware recommender systems based on ABAE, metric learning, and autoencoder-enriched learning." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-99", "text": "Our approach jointly learns interpretable user and item representations." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-100", "text": "It is expectedly harder to tune to achieve better quality, but the final model performs better at rating prediction and almost on par at aspects coherence with other state-of-the-art approaches." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-101", "text": "Our results can also be viewed as part of the research effort to analyze and interpret deep neural networks, a very important recent trend [12, 30] ." }, { "sent_id": "887864e173d7f7c164fe9f9d940727-C001-102", "text": "We foresee the following directions for future work: (i) further improving prediction quality (especially for models that learn interpretable user representations), (ii) integrating methods that can remove \"purely sentimental\" aspects into interpretable models for recommendations that we have discussed above, (iii) developing visualization techniques for user profiles." } ], "y": { "@USE@": { "gold_contexts": [ [ "887864e173d7f7c164fe9f9d940727-C001-14", "887864e173d7f7c164fe9f9d940727-C001-15", "887864e173d7f7c164fe9f9d940727-C001-16", "887864e173d7f7c164fe9f9d940727-C001-17", "887864e173d7f7c164fe9f9d940727-C001-18", "887864e173d7f7c164fe9f9d940727-C001-19", "887864e173d7f7c164fe9f9d940727-C001-20", "887864e173d7f7c164fe9f9d940727-C001-21", "887864e173d7f7c164fe9f9d940727-C001-22", "887864e173d7f7c164fe9f9d940727-C001-23", "887864e173d7f7c164fe9f9d940727-C001-24" ], [ "887864e173d7f7c164fe9f9d940727-C001-59" ] ], "cite_sentences": [ "887864e173d7f7c164fe9f9d940727-C001-24", "887864e173d7f7c164fe9f9d940727-C001-59" ] }, "@SIM@": { "gold_contexts": [ [ "887864e173d7f7c164fe9f9d940727-C001-59" ], [ "887864e173d7f7c164fe9f9d940727-C001-76", "887864e173d7f7c164fe9f9d940727-C001-77" ] ], "cite_sentences": [ "887864e173d7f7c164fe9f9d940727-C001-59", "887864e173d7f7c164fe9f9d940727-C001-77" ] }, "@BACK@": { "gold_contexts": [ [ "887864e173d7f7c164fe9f9d940727-C001-87" ] ], "cite_sentences": [ "887864e173d7f7c164fe9f9d940727-C001-87" ] } } }, "ABC_ef6bd5e57196c013d7d0436e5b0ca5_42": { "x": [ { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-40", "text": "**SENTENCE RETRIEVAL**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-2", "text": "We describe here our system and results on the FEVER shared task." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-3", "text": "We prepared a pipeline system which composes of a document selection, a sentence retrieval, and a recognizing textual entailment (RTE) components." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-4", "text": "A simple entity linking approach with text match is used as the document selection component, this component identifies relevant documents for a given claim by using mentioned entities as clues." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-5", "text": "The sentence retrieval component selects relevant sentences as candidate evidence from the documents based on TF-IDF." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-6", "text": "Finally, the RTE component selects evidence sentences by ranking the sentences and classifies the claim simultaneously." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-7", "text": "The experimental results show that our system achieved the FEVER score of 0.4016 and outperformed the official baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-8", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-10", "text": "The increasing amounts of textual information on the Web have brought demands to develop techniques to extract and verify a fact." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-11", "text": "The Fact Extraction and VERification (FEVER) task (Thorne et al., 2018) focuses on verification of textual claims against evidence." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-12", "text": "In the FEVER shared task, a given claim is classified as SUPPORTED, REFUTED, or NOTENOUGHINFO (NEI)." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-13", "text": "Evidence to justify a given claim is required for SUPPORTED or REFUTED claims." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-14", "text": "The evidence is not given and must be retrieved from Wikipedia." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-15", "text": "This paper describes our participating system in the FEVER shared task." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-39", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-16", "text": "The architecture of our system is designed by following the official baseline system (Thorne et al., 2018) ." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-17", "text": "There are two * Authors contributed equally main differences between our system and the baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-18", "text": "The first one is identifying documents that contain evidence by using text match between mentioned entities in a given claim and Wikipedia page title." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-19", "text": "The details are described in Section 2.1." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-20", "text": "The next one is a neural network based model, details of which are described in Section 2.3, for selecting evidence sentences as ranking task and classifying a claim simultaneously." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-21", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-22", "text": "**SYSTEM**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-23", "text": "We propose a pipeline system which composes of a document selection, a sentence retrieval, and a recognizing textual entailment (RTE) components." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-24", "text": "A simple entity linking approach with text match is used as the document selection component." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-25", "text": "This component identifies relevant documents for a given claim by using mentioned entities as clues." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-26", "text": "The sentence retrieval component selects relevant sentences as candidate evidence from the documents based on Term Frequency-Inverse Document Frequency (TF-IDF)." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-27", "text": "Finally, the RTE component selects evidence sentences by ranking the candidate sentences and classifies the claim as SUPPORTED, REFUTED, or NOTENOUGHINFO simultaneously." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-28", "text": "Details of the components are described in the following Section." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-29", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-30", "text": "**DOCUMENT SELECTION**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-31", "text": "Wikipedia pages of entities mentioned in a claim can be good candidate documents containing the SUPPORTED/REFUTED evidence." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-32", "text": "Therefore, we use a simple but efficient entity linking approach as a document selection component." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-33", "text": "In our entity linking approach, relevant documents are retrieved by using exact match between page titles of Wikipedia and words in a claim." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-34", "text": "We expect this component to select only surely correct documents." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-35", "text": "In other words, we decided to prefer precision of evidence rather than recall." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-36", "text": "In fact, our preliminary experiment indicates that 68% of claims excluding NEI in a development set can be fully supported or refuted by the retrieved documents with our approach." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-37", "text": "This corresponds roughly to the accuracy of 10 nearest documents retrieved by the DrQA (Chen et al., 2017) based retrieval approach used in the baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-38", "text": "The average number of selected documents in our approach is 3.7, and thus our approach is more efficient than the baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-41", "text": "Following the baseline system, we use a sentence retrieval component which returns K nearest sentences for a claim using cosine similarity between unigram and bigram TF-IDF vectors." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-42", "text": "The K nearest sentences are retrieved from the documents selected by the document selection component." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-43", "text": "We selected optimal K using grid search over {5, 10, 15, 20, 50} in terms of the performance of the full pipeline system on a development set." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-44", "text": "The optimal values was K = 15." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-45", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-46", "text": "**RECOGNIZING TEXTUAL ENTAILMENT**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-47", "text": "As RTE component, we adopt DEISTE (Deep Explorations of Inter-Sentence interactions for Textual Entailment) model that is the state-of-the-art in RTE tasks (Yin et al., 2018) ." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-48", "text": "RTE component is trained on labeled claims paired with sentencelevel evidence." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-49", "text": "To build the model, we utilize the NEARESTP dataset described in Thorne et al. (2018) ." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-50", "text": "In a case where multiple sentences are required as evidence, the texts of the sentences are concatenated." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-51", "text": "We use Adam (Kingma and Ba, 2014) as an optimizer and utilize 300 dimensional GloVe vector which is adapted by the baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-52", "text": "The other model parameters are the same as the parameters described in Yin et al. (2018) ." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-53", "text": "Claims labelled as NEI are easier to predict correctly than SUPPORTED and REFUTED because unlike SUPPORTED and REFUTED, NEI dose not need evidence." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-54", "text": "Therefore, our RTE component are designed to predict the claims as NEI if the model can not predict claims as SUPPORTED or REFUTED with high confidence." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-55", "text": "RTE prediction process is composed of three steps." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-56", "text": "Firstly, we calculate the probability score of each label for pairs of a claim and candidate sentence using DEISTE model." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-57", "text": "Secondly, we decide a prediction label using the following equations." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-58", "text": "where S is a set of pairs of a claim and candidate sentence; A = {SUPPORTED, REFUTED}; P s,a is a probability score of a pair for label a; P t is a threshold value; Label pred is prediction label for a claim." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-59", "text": "Finally, we sort candidate sentences in descending order of scores and select at most 5 evidence sentences with the same label as predicted label." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-60", "text": "We also apply grid search to find the best threshold P t and set it to 0.93." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-61", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-62", "text": "**EVALUATION**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-63", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-64", "text": "**DATASET**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-65", "text": "We used official training dataset for training RTE component." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-66", "text": "For parameter tuning and performance evaluation, we used a development and test datasets used in (Thorne et al., 2018) ." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-67", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-68", "text": "**IN-HOUSE EXPERIMENT**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-69", "text": "We evaluated our system and baseline system on the test dataset with FEVER score, label accuracy, evidence precision, evidence recall and evidence F1." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-70", "text": "FEVER score is classification accuracy of claims if the correct evidence is selected." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-71", "text": "Label accuracy is classification accuracy of claims if the requirement for correct evidence is ignored." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-72", "text": "Table 2 shows the evaluation results on the test dataset." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-73", "text": "Our system achieved FEVER score of 0.4016 and outperformed the baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-74", "text": "As expected, our system produced a significant improvement of 59 points in evidence precision against the baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-75", "text": "Though evidence recall decreased, evidence F1 increased by 17 points compared to the baseline system. tends to predict claims as NEI, the precisions of SUPPORTED (929/1396 = 0.67) and REFUTED (1331/1812 = 0.73) are higher than the precision of NEI (2670/6791 = 0.39)." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-76", "text": "Table 4 presents the evaluation results of our submissions." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-77", "text": "The models showed similar behavior as in the in-house experiment excepting evidence F1." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-78", "text": "Our submission were ranked in 9th place." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-79", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-80", "text": "**SUBMISSION RUN**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-81", "text": "----------------------------------" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-82", "text": "**CONCLUSION**" }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-83", "text": "We developed a pipeline system which composes of a document selection, a sentence retrieval, and an RTE components for the FEVER shared task." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-84", "text": "Evaluation results of in-house experiment show that our system achieved improvement of 12% in FEVER score against the baseline system." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-85", "text": "Even though document selection component of our system has contributed to find more correct evidence document, the component was too strict, and thus degraded evidence recall." }, { "sent_id": "ef6bd5e57196c013d7d0436e5b0ca5-C001-86", "text": "Therefore, as a future work, we plan to explore more sophisticated entity linking approach." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ef6bd5e57196c013d7d0436e5b0ca5-C001-11" ] ], "cite_sentences": [ "ef6bd5e57196c013d7d0436e5b0ca5-C001-11" ] }, "@USE@": { "gold_contexts": [ [ "ef6bd5e57196c013d7d0436e5b0ca5-C001-11", "ef6bd5e57196c013d7d0436e5b0ca5-C001-15" ], [ "ef6bd5e57196c013d7d0436e5b0ca5-C001-16" ], [ "ef6bd5e57196c013d7d0436e5b0ca5-C001-49" ], [ "ef6bd5e57196c013d7d0436e5b0ca5-C001-66" ] ], "cite_sentences": [ "ef6bd5e57196c013d7d0436e5b0ca5-C001-11", "ef6bd5e57196c013d7d0436e5b0ca5-C001-16", "ef6bd5e57196c013d7d0436e5b0ca5-C001-49", "ef6bd5e57196c013d7d0436e5b0ca5-C001-66" ] } } }, "ABC_6cb86d91918743b0e4ff27e9d2351b_42": { "x": [ { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-2", "text": "We present an error mining tool that is designed to help human annotators to find errors and inconsistencies in their annotation." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-3", "text": "The output of the underlying algorithm is accessible via a graphical user interface, which provides two aggregate views: a list of potential errors in context and a distribution over labels." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-4", "text": "The user can always directly access the actual sentence containing the potential error, thus enabling annotators to quickly judge whether the found candidate is indeed incorrectly labeled." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-5", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-7", "text": "Manually annotated corpora and treebanks are the primary tools that we have for developing and evaluating models and theories for natural language processing." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-8", "text": "Given their importance for testing our hypotheses, it is imperative that they are of the best quality possible." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-9", "text": "However, manual annotation is tedious and error-prone, especially if many annotators are involved." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-10", "text": "It is therefore desirable to have automatic means for detecting errors and inconsistencies in the annotation." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-11", "text": "Automatic methods for error detection in treebanks have been developed in the DECCA project 1 for several different annotation types, for example part-of-speech (Dickinson and Meurers, 2003a) , constituency syntax (Dickinson and Meurers, 2003b) , and dependency syntax (Boyd et al., 2008) ." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-12", "text": "These algorithms work on the assumption that two data points that appear in identical contexts should be labeled in the same way." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-13", "text": "While the data points in question, or nuclei, can be single tokens, spans of tokens, or edges between two tokens, context is usually modeled as n-grams over the surrounding tokens." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-14", "text": "A nucleus that occurs 1 http://www.decca.osu.edu multiple times in identical contexts but is labeled differently shows variation and is considered a potential error." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-15", "text": "Natural language is ambiguous and variation found by an algorithm may be a genuine ambiguity rather than an annotation error." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-16", "text": "Although we can support an annotator in finding inconsistencies in a treebank, these inconsistencies still need to be judged by humans." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-17", "text": "In this paper, we present a tool that allows a user to run automatic error detection on a corpus annotated with part-of-speech or dependency syntax." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-18", "text": "2 The tool provides the user with a graphical interface to browse the variation nuclei found by the algorithm and inspect their label distribution." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-19", "text": "The user can always switch between high-level aggregate views and the actual sentences containing the potential error in order to decide if that particular annotation is incorrect or not." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-20", "text": "The interface thus brings together the output of the error detection algorithm with a direct access to the corpus data." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-21", "text": "This speeds up the process of tracking down inconsistencies and errors in the annotation considerably compared to working with the raw output of the original DECCA tools." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-22", "text": "Several options allow the user to fine-tune the behavior of the algorithm." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-23", "text": "The tool is part of ICARUS (G\u00e4rtner et al., 2013) , a general search and exploration tool." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-24", "text": "3" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-25", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-26", "text": "**THE ERROR DETECTION ALGORITHM**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-27", "text": "The algorithm, described in Dickinson and Meurers (2003a) for POS tags, works by starting from individual tokens (the nuclei) by recording their assigned part-of-speech over an entire treebank." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-28", "text": "From there, it iteratively increases the context for each instance by extending the string to both sides to include adjacent tokens." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-29", "text": "It thus successively builds larger n-grams by adding tokens to the left or to the right. Instances are grouped together if their context is identical, i. e. if their token ngrams match." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-30", "text": "Groups where all instances have the same label do not show variation and are discarded." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-31", "text": "The algorithm stops when either no variation nuclei are left or when none of them can be further extended." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-32", "text": "All remaining groups that show variation are considered potential errors." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-33", "text": "Erroneous annotations that do not show variation in the data cannot be found by the algorithm." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-34", "text": "This limits the usefulness of the method for very small data sets." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-35", "text": "Also, given the inherent ambiguity of natural language, the algorithm is not guaranteed to exclusively output errors, but it achieves very high precision in experiments on several languages." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-36", "text": "The algorithm has been extended to find errors in constituency and dependency structures (Dickinson and Meurers, 2003b; Boyd et al., 2008) , where the definition of a nucleus is changed to capture phrases and dependency edges." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-37", "text": "Context is always modeled using n-grams over surrounding tokens, but see, e. g., Boyd et al. (2007) for extensions." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-38", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-39", "text": "**GRAPHICAL ERROR MINING**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-40", "text": "To start the error mining, a treebank and an error mining algorithm (part-of-speech or dependency) must be selected." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-41", "text": "The algorithm is then executed on the data to create the variation n-grams." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-67", "text": "To the right, the frequencies of the different labels are shown in a bar chart." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-42", "text": "The user can choose between two views for browsing the potential errors in the treebank: (1) a view showing the list of variation n-grams found by the error detection algorithm and (2) a view showing label distributions over word forms." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-43", "text": "Figure 1 shows a screenshot of the view where the user is presented with the list of variation n-grams output by the error detection algorithm." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-44", "text": "The main window shows the list of n-grams." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-45", "text": "When the user selects one of the n-grams, information about the nucleus is displayed below the main window." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-46", "text": "The user can inspect the distribution over labels (here part-of-speech tags) with their absolute frequencies." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-47", "text": "Above the main window, the user can adjust the length of the presented n-grams, sort them, or search for specific strings." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-48", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-49", "text": "**THE VARIATION N-GRAM VIEW**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-50", "text": "For example, Figure 1 shows a part of the variation n-grams found in the German TiGer corpus (Brants et al., 2002) ." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-51", "text": "The minimum and maximum length was restricted to four, thus the list contains only 4-grams." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-52", "text": "The 4-gram so hoch wie in was selected, which contains wie as its nucleus." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-53", "text": "In the lower part, the user can see that wie occurs with four different part-of-speech tags in the treebank, namely KOKOM, PWAV, KON, and KOUS." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-54", "text": "Note that the combination with KOUS occurs only once in the entire treebank." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-55", "text": "Double clicking on the selected 4-gram in the list will open up a new tab that displays all sentences that contain this n-gram, with the nucleus being highlighted." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-56", "text": "The user can then go through each of the sentences and decide whether the annotated part-of-speech tag is correct." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-57", "text": "Each time the user clicks on an n-gram, a new tab will be created, so that the user can jump back to previous results without having to recreate them." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-58", "text": "A double click on one of the lines in the lower part of the window will bring up all sentences that contain that particular combination of word form and part-of-speech tag." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-59", "text": "The fourth line will, for example, show the one sentence where wie has been tagged as KOUS, making it easy to quickly judge whether the tag is correct." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-60", "text": "In this case, the annotation is incorrect (it should have been PWAV) and should thus be marked for correction." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-61", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-62", "text": "**THE LABEL DISTRIBUTION VIEW**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-63", "text": "In addition to the output of the algorithm by Dickinson and Meurers (2003a), the tool also provides a second view, which displays tag distributions of word forms to the user (see Figure 2) ." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-64", "text": "To the left, a list of unique label combinations is shown." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-65", "text": "Selecting one of them displays a list of word forms that occur with exactly these tags in the corpus." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-66", "text": "This list is shown below the list of label combinations." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-68", "text": "The leftmost bar for each label always shows the total frequency summed over all word forms in the set." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-69", "text": "Selecting one or more in the list of word forms adds additional bars to the chart that show the frequencies for each selected word form." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-70", "text": "As an example, Figure 2 shows the tag combination [VVINF] [VVIZU], which are used to tag infinitives with and without incorporated zu in German." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-71", "text": "There are three word forms in the corpus that occur with these two part-of-speech tags: hinzukommen, aufzul\u00f6sen, and anzun\u00e4hern." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-72", "text": "The chart on the right shows the frequencies for each word form and part-of-speech tag, revealing that hinzukommen is mostly tagged as VVINF but once as VVIZU, whereas for the other two word forms it is the other way around." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-73", "text": "This example is interesting if one is looking for annotation errors in the TiGer treebank, because the two part-of-speech tags should have a complementary distribution (a German verb either incorporates zu or it does not)." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-74", "text": "Double clicking on the word forms in the list in the lower left corner will again open up a tab that shows all sentences containing this word form, regardless of their part-of-speech tag." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-75", "text": "The user may then inspect the sentences and decide whether the annotations are erroneous or not." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-76", "text": "If the user wants to see a specific combination, which is more useful if the total number of sentences is large, she can also click on one of the bars in the chart to get all sentences matching that combination." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-77", "text": "In the example, the one instance of hinzukommen being tagged as VVIZU is incorrect, 4 and the instances of the two other verbs tagged as VVINF are as well." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-78", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-79", "text": "**DEPENDENCY ANNOTATION ERRORS**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-80", "text": "As mentioned before, the tool also allows the user to search for errors in dependency structures." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-81", "text": "The error mining algorithm for dependency structures (Boyd et al., 2008) is very similar to the one for part-of-speech tags, and so is the interface to the n-gram list or the distribution view." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-82", "text": "Dependency edges are therein displayed as triples: the head, the dependent, and the edge label with the edge's direction." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-83", "text": "As with the part-of-speech tags, the user can always jump directly to the sentences that contain a particular n-gram or dependency relation." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-84", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-85", "text": "**ERROR DETECTION ON TIGER**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-86", "text": "We ran the error mining algorithm for part-ofspeech on the German TiGer Treebank (the dependency version by Seeker and Kuhn (2012) ) and manually evaluated a small sample of n-grams in order to get an idea of how useful the output is." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-87", "text": "We manually checked 115 out of the 207 variation 6-grams found by the tool, which amounts to 119 different nuclei." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-88", "text": "For 99.16% of these nuclei, we found erroneous annotations in the associated sentences." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-89", "text": "95.6% of these are errors where we are able to decide what the right tag should be, the remaining ones are more difficult to disambiguate because the annotation guidelines do not cover them." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-90", "text": "These results are in line with findings by Dickinson and Meurers (2003a) for the Penn Treebank." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-91", "text": "They show that even manually annotated corpora contain errors and an automatic error mining tool can be a big help in finding them." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-92", "text": "Furthermore, it can help annotators to improve their annotation guidelines by pointing out phenomena that are not covered by the guidelines, because these phenomena will be more likely to show variation." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-93", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-94", "text": "**RELATED WORK**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-95", "text": "We are aware of only one other graphical tool that was developed to help with error detection in treebanks: Ambati et al. (2010) and Agarwal et al. (2012) describe a graphical tool that was used in the annotation of the Hindi Dependency Treebank." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-96", "text": "To find errors, it uses a statistical and a rule-based component." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-97", "text": "The statistical component is recalloriented and learns a MaxEnt model, which is used to flag dependency edges as errors if their probability falls below a predefined threshold." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-98", "text": "In order to increase the precision, the output is postprocessed by the rule-based component, which is tailored to the treebank's annotation guidelines." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-99", "text": "Errors are presented to the annotators in tables, also with the option to go to the sentences directly from there." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-100", "text": "Unlike the algorithm we implemented, this approach needs annotated training data for training the classifier and tuning the respective thresholds." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-101", "text": "----------------------------------" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-102", "text": "**CONCLUSION**" }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-103", "text": "High-quality annotations for linguistic corpora are important for testing hypotheses in NLP and linguistic research." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-104", "text": "Automatically marking potential annotation errors and inconsistencies are one way of supporting annotators in their work." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-105", "text": "We presented a tool that provides a graphical interface for annotators to find and evaluate annotation errors in treebanks." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-106", "text": "It implements the error detection algorithms by Dickinson and Meurers (2003a) and Boyd et al. (2008) ." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-107", "text": "The user can view errors from two perspectives that aggregate error information found by the algorithm, and it is always easy to go directly to the actual sentences for manual inspection." }, { "sent_id": "6cb86d91918743b0e4ff27e9d2351b-C001-108", "text": "The tool is currently extended such that annotators can make changes to the data directly in the interface when they find an error." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6cb86d91918743b0e4ff27e9d2351b-C001-11" ], [ "6cb86d91918743b0e4ff27e9d2351b-C001-27" ] ], "cite_sentences": [ "6cb86d91918743b0e4ff27e9d2351b-C001-11", "6cb86d91918743b0e4ff27e9d2351b-C001-27" ] }, "@EXT@": { "gold_contexts": [ [ "6cb86d91918743b0e4ff27e9d2351b-C001-63" ] ], "cite_sentences": [ "6cb86d91918743b0e4ff27e9d2351b-C001-63" ] }, "@SIM@": { "gold_contexts": [ [ "6cb86d91918743b0e4ff27e9d2351b-C001-90" ] ], "cite_sentences": [ "6cb86d91918743b0e4ff27e9d2351b-C001-90" ] }, "@USE@": { "gold_contexts": [ [ "6cb86d91918743b0e4ff27e9d2351b-C001-106" ] ], "cite_sentences": [ "6cb86d91918743b0e4ff27e9d2351b-C001-106" ] } } }, "ABC_cd7bb4543828f915bc930841bb8d7c_42": { "x": [ { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-2", "text": "We train a char2char model on the E2E NLG Challenge data, by exploiting \"out-of-the-box\" the recently released tfseq2seq framework, using some of the standard options of this tool." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-3", "text": "With minimal effort, and in particular without delexicalization, tokenization or lowercasing, the obtained raw predictions, according to a small scale human evaluation, are excellent on the linguistic side and quite reasonable on the adequacy side, the primary downside being the possible omissions of semantic material." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-4", "text": "However, in a significant number of cases (more than 70%), a perfect solution can be found in the top-20 predictions, indicating promising directions for solving the remaining issues." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-5", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-7", "text": "Very recently, researchers (Novikova et al., 2017) at Heriot-Watt University proposed the E2E NLG Challenge 1 and released a dataset consisting of 50K (MR, RF) pairs, MR being a slot-value Meaning Representation of a restaurant, RF (human ReFerence) being a natural language utterance rendering of that representation." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-8", "text": "The utterances were crowd-sourced based on pictorial representations of the MRs, with the intention of producing more natural and diverse utterances compared to the ones directly based on the original MRs (Novikova et al., 2016) ." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-9", "text": "Most of the RNN-based approaches to Natural Language Generation (NLG) that we are aware of, starting with (Wen et al., 2015) , generate the output word-by-word, and resort to special delexicalization or copy mechanisms (Gu et al., 2016) to handle rare or unknown words, for instance restaurant names or telephone numbers." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-10", "text": "One exception is (Goyal et al., 2016) , who employed a char-based seq2seq model where the input MR is simply represented as a character sequence, and the output is also generated char-by-char; this approach avoids the rare word problem, as the character vocabulary is very small." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-11", "text": "While (Goyal et al., 2016) used an additional finite-state mechanism to guide the production of well-formed (and input-motivated) character sequences, the performance of their basic char2char model was already quite good." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-12", "text": "We further explore how a recent out-of-the box seq2seq model would perform on E2E NLG Challenge, when used in a char-based mode." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-13", "text": "We choose attention-based tfseq2seq framework provided by authors of (Britz et al., 2017 ) (which we detail in next section)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-14", "text": "Using some standard options provided by this framework, and without any pre-or postprocessing (not even tokenization or lowercasing), we obtained results on which we conducted a small-scale human evaluation on one hundred MRs, involving two evaluators." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-15", "text": "This evaluation, on the one hand, concentrated on the linguistic quality, and on the other hand, on the semantic adequacy of the produced utterances." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-16", "text": "On the linguistic side, vast majority of the predictions were surprisingly grammatically perfect, while still being rather diverse and natural." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-17", "text": "In particular, and contrary to the findings of (Goyal et al., 2016 ) (on a different dataset), our char-based model never produced non-words." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-18", "text": "On the adequacy side, we found that the only serious problem was the tendency (in about half of the evaluated cases) of the model to omit to render one (rarely two) slot(s); on the other end, it never hallucinated, and very rarely duplicated, material." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-19", "text": "To try and assess the potential value of a simple re-ranking technique (which we did not implement at this stage, but the approach of (Wen et al., 2015) and more recently the \"inverted generation\" technique of (Chisholm et al., 2017) could be used), we generated (using the beam-search option of the framework) 20-best utterances for each MR, which the evaluators scanned towards finding an \"oracle\", i.e. a generated utterance considered as perfect not only from the grammatical but also from the adequacy viewpoint." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-20", "text": "An oracle was found in the first position in around 50% of the case, otherwise among the 20 positions in around 20% of the cases, and not at all inside this list in the remaining 30% cases." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-21", "text": "On the basis of these experiments and evaluations we believe that there remains only a modest gap towards a very reasonable NLG seq2seq model for the E2E NLG dataset." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-22", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-23", "text": "**MODEL**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-24", "text": "Our model is a direct use of the seq2seq opensource software framework 2 , built over TensorFlow (Abadi et al., 2016) , and provided along with (Britz et al., 2017) , with some standard configuration options that will be detailed in section 3." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-25", "text": "While in their large-scale NMT experiments (Britz et al., 2017) use word-based sequences, in our case we use character-based ones." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-26", "text": "This simply involves changing \"delimiter\" option in configuration files." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-27", "text": "(Britz et al., 2017) (drawing borrowed from that paper)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-28", "text": "Contrary to word-based sequences, we use character-based sequences for generating grammatically correct and natural utterances." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-29", "text": "(Britz et al., 2017) , provides an overview of the framework." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-30", "text": "While many options are configurable (number of layers, unidirectional vs bidirectional encoder, additive vs multiplicative attention mechanism, GRU (Cho et al., 2014) vs LSTM cells (Hochreiter and Schmidhuber, 1997) , etc.), the core architecture is common to all models." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-31", "text": "This is by now a pretty standard attention-based encoder-decoder archi-tecture based on (Bahdanau et al., 2015; Luong et al., 2015) ." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-32", "text": "The encoder RNN embeds each of the source words (in our case, characters) into vectors exploiting the hidden states computed by the RNN." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-33", "text": "The decoder RNN predicts the next word (resp. character) based on its current hidden state, previous character, and also based on the \"context\" vector c i , which is an attention-based weighted average of the embeddings of the source words (resp. characters)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-34", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-35", "text": "**EXPERIMENTS**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-36", "text": "3.1 Dataset (Novikova et al., 2016) explain the protocol followed for crowdsourcing the E2E NLG Challenge dataset." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-37", "text": "Slightly different from the description in the article, there are two additional slots in the dataset: 'kidsFriendly' and 'children-friendly' which seem to be alternates for 'familyFriendly'." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-38", "text": "Thus, there are in total 10 slots (in decreasing order of frequency of being mentioned in the dataset MRs): name (100%), food (83%), customer rating (68%), priceRange (68%), area (60%), eatType (51%), near (50%), familyFriendly (25%), kidsFriendly (19%), children-friendly (19%)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-39", "text": "Also, the number of active slots in the MRs varies as: 3 (5%), 4 (17%), 5 (19%), 6 (19%), 7 (16%), 8 (4%)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-40", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-41", "text": "**IMPLEMENTATION**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-42", "text": "The tf-seq2seq toolkit (Britz et al., 2017) trains on pairs of sequences presented in parallel text format (separate source and target sequence files)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-43", "text": "3 4 Taking cue from recommended configurations in Table 7 of (Britz et al., 2017) and the provided example configs in tf-seq2seq, we experimented with different numbers of layers in the encoder and decoder as well as different beam widths, while using the bi-directional encoder along with \"additive\" attention mechanism." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-44", "text": "As also observed by Britz et al. (2017) , using a non-null \"lengthpenalty\" (alias length normalization (Wu et al., 2016) ), significantly improved decoding results." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-45", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-46", "text": "**RESULTS**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-47", "text": "We report the BLEU scores 5 for different configurations of the seq2seq model in Table 1 ." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-48", "text": "In our initial experiments, using a beam-width 5 (with no length penalty), with 4 layers in both the encoder and decoder and GRU cells, showed the best results in terms of BLEU (score of 24.94)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-49", "text": "We observed significant improvements using length penalty 1, and decided to use this architecture as a basis for human evaluations, with a beam-width 20 to facilitate the observation of oracles." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-50", "text": "These evaluations were thus conducted on model [encoder 4 layers, decoder 4 layers, GRU cell, beam-width 20, length penalty 1] (starred in Table 1 ), though we found slightly better performing models in terms of BLEU at a later stage." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-51", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-52", "text": "**EVALUATION**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-53", "text": "The human evaluations were performed by two annotators on the top 20 predictions of the previously discussed model, for the first 100 MRs of the devset, using the following metrics:" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-54", "text": "1. Semantic Adequacy a) Omission [1/0]: information present in the MR that is omitted in the predicted utterance (1=No omission, 0=Omission)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-55", "text": "b) Addition [1/0]: information in the predicted utterance that is absent in the MR (1=No addition, 0=Addition)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-56", "text": "c) Repetition [1/0]: repeated information in the predicted utterance 5 Calculated using multi-bleu perl script bundled with tfseq2seq." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-57", "text": "Note that these results were computed on the original version of Challenge devset (updated recently) which did not group the references associated with the same MR, possibly resulting in lower scores than when exploiting multi-refs." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-58", "text": "('vsRef' in the Table 2 , 1=Prediction better than RF, 0=Prediction at par with RF, -1=RF better than prediction)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-59", "text": "3. Oracle [1/0/-1]: 1 if the first prediction is an \"oracle\" (i.e. considered as perfect, see section 1), 0 when the oracle is found in the top 20, and -1 when no oracle is found there." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-60", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-61", "text": "**ANALYSIS**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-62", "text": "We show a few examples of utterances (predictions in first position, i.e. most probable) produced by our model, for discussion." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-63", "text": "6" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-64", "text": "1. Among the utterances produced by the model in first position (Pred), the most prominent issue was that of omissions (underlined in example 2)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-65", "text": "There were no additions or non-words (which was one of the primary concerns for (Goyal et al., 2016) )." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-66", "text": "We observed only a couple of repetitions which were actually accompanied by omission of some slot(s) in the same utterance (repetition highlighted in bold in example 3)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-67", "text": "Surprisingly enough, we observed a similar issue of omissions in human references (target for our model)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-68", "text": "We then decided to perform comparisons against the human reference ('vsRef' in Table 2 )." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-69", "text": "Often, the predictions were found to be semantically or grammatically better than the human reference; for example observe the underlined portion of the reference in the first example." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-70", "text": "The two annotators independently found the predictions to be mostly grammatically correct as well as natural (to a slighty lesser extent)." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-71", "text": "7 A general feeling of the annotators was that the predictions, while showing a significant amount of linguistic diversity and naturalness, had a tendency to respect grammatical constraints better than the references; the crowdsourcers tended to strive for creativity, sometimes not supported by evidence in the MR, and often with little concern for linguistic quality; it may be conjectured that the seq2seq model, by \"averaging\" over many linguistically diverse and sometimes incorrect training examples, was still able to learn what amounts to a reasonable linguistic model for its predictions." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-72", "text": "We also investigate whether we could find an 'oracle' (perfect solution as defined in section 1) in the top-20 predictions and observed that in around 70% of our examples the oracle could be found in the top results (see Table 3 ), very often (51%) at the first position." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-73", "text": "In the rest 30% of the cases, even the top-20 predictions did not contain an oracle." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-74", "text": "We found that the presence of an oracle was dependent on the number of slots in the MR." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-75", "text": "When the number of slots was 7 or 8, the presence of an oracle in the top predictions decreased significantly to approximately 40%." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-76", "text": "In contrast, with 4 slots, our model predicted an oracle right at the first place for 83% of the cases." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-77", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-78", "text": "**CONCLUSION**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-79", "text": "We employed the open source tf-seq2seq framework for training a char2char model on the E2E NLG Challenge data." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-80", "text": "This could be done with minimal effort, without requiring delexicalization, lowercasing or even tokenization, by exploiting standard options provided with the framework." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-81", "text": "Human annotators found the predictions to have great linguistic quality, somewhat to our surprise, but also confirming the observations in (Karpathy, 2015) ." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-82", "text": "On the adequacy side, omissions were the major drawback; no hallucinations were observed and only very few instances of repetition." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-83", "text": "We hope our results and annotations can help understand the dataset and issues better, while also being useful for researchers working on the challenge." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-84", "text": "----------------------------------" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-85", "text": "**PRED**" }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-86", "text": "The Eagle is a kid friendly Japanese coffee shop in the riverside area near Burger King." }, { "sent_id": "cd7bb4543828f915bc930841bb8d7c-C001-87", "text": "It has a moderate price range and a customer rating of 1 out of 5." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cd7bb4543828f915bc930841bb8d7c-C001-10", "cd7bb4543828f915bc930841bb8d7c-C001-11" ], [ "cd7bb4543828f915bc930841bb8d7c-C001-65" ] ], "cite_sentences": [ "cd7bb4543828f915bc930841bb8d7c-C001-10", "cd7bb4543828f915bc930841bb8d7c-C001-11", "cd7bb4543828f915bc930841bb8d7c-C001-65" ] }, "@EXT@": { "gold_contexts": [ [ "cd7bb4543828f915bc930841bb8d7c-C001-10", "cd7bb4543828f915bc930841bb8d7c-C001-11", "cd7bb4543828f915bc930841bb8d7c-C001-12" ] ], "cite_sentences": [ "cd7bb4543828f915bc930841bb8d7c-C001-10", "cd7bb4543828f915bc930841bb8d7c-C001-11" ] }, "@DIF@": { "gold_contexts": [ [ "cd7bb4543828f915bc930841bb8d7c-C001-17" ] ], "cite_sentences": [ "cd7bb4543828f915bc930841bb8d7c-C001-17" ] } } }, "ABC_6d8612cfb4bf05322fed1c02f4885a_42": { "x": [ { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-76", "text": "These display an increase in readability (Figure 1a ) starting in the early hours of the morning, peaking at 10AM and then decreasing constantly throughout the day, which is in accordance with the mood swings reported by Golder and Macy (2011)." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-77", "text": "The proportion of pronouns (Figure 1b) and interjections (Figure 1c ) follows the exact opposite pattern, with a peak in frequency during nights." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-78", "text": "This suggests that the language gets more contextual (Heylighen and Dewaele, 2002) towards the end of the day." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-2", "text": "Writing style allows NLP tools to adjust to the traits of an author." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-3", "text": "In this paper, we explore the relation between stylistic and syntactic features and authors' age and income." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-4", "text": "We confirm our hypothesis that for numerous feature types writing style is predictive of income even beyond age." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-5", "text": "We analyze the predictive power of writing style features in a regression task on two data sets of around 5,000 Twitter users each." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-6", "text": "Additionally, we use our validated features to study daily variations in writing style of users from distinct income groups." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-7", "text": "Temporal stylistic patterns not only provide novel psychological insight into user behavior, but are useful for future research and applications in social media." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-8", "text": "----------------------------------" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-10", "text": "The widespread use of social media enables researchers to examine human behavior at a scale hardly imaginable before." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-11", "text": "Research in text profiling has recently shown that a diverse set of user traits is predictable from language use." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-12", "text": "Examples range from demographics such as age (Rao et al., 2010) , gender (Burger et al., 2011; Bamman et al., 2014) , popularity (Lampos et al., 2014) , occupation (Preo\u0163iuc-Pietro et al., 2015a) and location (Eisenstein et al., 2010) to psychological traits such as personality (Schwartz et al., 2013) or mental illness (De Choudhury et al., 2013) and their interplay (Preotiuc-Pietro et al., 2015) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-13", "text": "To a large extent, the prominent differences captured by text are topical: adolescents post more about school, females about relationships (Sap et al., 2014) and sport fans about their local team (Cheng et al., * Project carried out during a research stay at the University of Pennsylvania 2010)." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-14", "text": "Writing style and readability offer a different insight into who the authors are." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-15", "text": "This can help applications such as cross-lingual adaptations without direct translation, for text simplification closely matching the reader's age, level of education and income or tailored to the specific moment the document is presented." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-16", "text": "Recently, Hovy and S\u00f8gaard (2015) have shown that the age of the authors should be taken into account when building and using part-of-speech taggers." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-17", "text": "Likewise, socioeconomic factors have been found to influence language use (Labov, 2006) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-18", "text": "Understanding these biases and their underlying factors in detail is important to develop NLP tools without sociodemographic bias." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-19", "text": "Writing style measures have initially been created to be applied at the document level, where they are often used to assess the quality of a document (Louis and Nenkova, 2013) or a summarization (Louis and Nenkova, 2014) , or even to predict the success of a novel (Ashok et al., 2013 )." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-20", "text": "In contrast to these document-level studies, we adopt a user-centric approach to measuring stylistic differences." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-21", "text": "We examine writing style of users on Twitter in relation to their age and income." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-22", "text": "Both attributes should be closely related to writing style: users of older age write on average more standard-conform (up to a certain point), and higher income is an indicator of education and conscientiousness (Judge et al., 1999) , which determines writing style." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-23", "text": "Indeed, many features that aim to measure the complexity of the language use have been developed in order to study human cognitive abilities, e.g., cognitive decline (Boy\u00e9 et al., 2014; Le et al., 2011) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-24", "text": "The relationship between age and language has been extensively studied by psychologists, and more recently by computational linguists in various corpora, including social media." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-25", "text": "Pennebaker et al. (2003) connect language use with style and personality, while Schler et al. (2006) automatically classified blogs text into three classes based on self-reported age using part-of-speech features." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-26", "text": "Johannsen et al. (2015) uncover some consistent age patterns in part-of-speech usage across languages, while Rosenthal and McKeown (2011) studies the use of Internet specific phenomena such as slang, acronyms and capitalisation patterns." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-27", "text": "Preo\u0163iuc-Pietro et al. (2016) study differences in paraphrase choice between older and younger Twitter users as a measure of style." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-28", "text": "Nguyen et al. (2013) analyzed the relationship between language use and age, modelled as a continuous variable." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-29", "text": "They found similar language usage trends for both genders, with increasing word and tweet length with age, and an increasing tendency to write more grammatically correct, standardized text." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-30", "text": "Such findings encourage further research in the area of measuring readability, which not only facilitates adjusting the text to the reader (Danescu-Niculescu-Mizil et al., 2011) , but can also play an important role in identifying authorial style (Pitler and Nenkova, 2008) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-31", "text": "Davenport and DeLine (2014) report negative correlation between tweet readability (i.e., simplicity) and the percentage of people with college degree in the area." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-32", "text": "Eisenstein et al. (2011) employ language use as a socio-demographic predictor." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-33", "text": "In this paper we analyze two data sets of millions of tweets produced by thousands of users annotated with their age and income." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-34", "text": "We define a set of features ranging from readability and style to syntactic features." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-35", "text": "We use both linear and non-linear machine learning regression methods to predict and analyze user income and age." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-36", "text": "We show that writing style measures give large correlations with both age and income, and that writing style is predictive of income even beyond age." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-37", "text": "Finally, Twitter data allows the unique possibility to study the variation in writing with time." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-38", "text": "We explore the effects of time of day in user behavior dependent in part on the socio-demographic group." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-39", "text": "----------------------------------" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-40", "text": "**DATA**" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-41", "text": "We study two large data sets of tweets." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-42", "text": "Each data set consists of users and their historical record of tweet content, profile information and trait level features extracted with high precision from their profile information." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-43", "text": "All data was tokenized using the Trendminer pipeline (Preo\u0163iuc-Pietro et al., 2012) , @-mentions and URL's collapsed, automatically filtered for English using the langid.py tool (Lui and Baldwin, 2012) and part-of-speech tagged using the ArkTweet POS tagger (Gimpel et al., 2011) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-44", "text": "Income (D 1 ) First, we use a large data set consisting of 5,191 Twitter users mapped to their income through their occupational class." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-45", "text": "This data set, introduced in (Preo\u0163iuc-Pietro et al., 2015a; Preo\u0163iuc-Pietro et al., 2015b) , relies on a standardised job classification taxonomy (the UK Standard Occupational Classification) to extract job-related keywords, search user profile fields for users having those jobs and map them to their mean UK income, independently of user location." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-46", "text": "The final data set consists of 10,796,836 tweets." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-47", "text": "Age (D 2 ) The age data set consists of 4,279 users mapped to their age from (Volkova and Bachrach, 2015) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-48", "text": "The final data set consists of 574,095 tweets." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-49", "text": "----------------------------------" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-50", "text": "**FEATURES**" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-51", "text": "We use a variety of features to capture the language behavior of a user." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-52", "text": "We group these features into:" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-53", "text": "Surface We measure the length of tweets in words and characters, and the length of words." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-54", "text": "As shorter words are considered more readable (Gunning, 1969; Pitler and Nenkova, 2008) , we also measure the ratio of words longer than five letters." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-79", "text": "Finally, named entities ( Figure 1d ) display a very distinctive pattern, with a constant increase starting mornings, which increases throughout the day." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-80", "text": "While the first three patterns mirror the active parts of the day, coinciding with regular working hours, the latter pattern is possibly associated with mentions of venues or news." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-81", "text": "An increase in usage of named entities in the evening is steeper for low-income users -we hypothesize that this phenomenon could be reasoned by a stronger association of named entities with leisure in this user group." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-82", "text": "Overall, we notice a similarity between income groups, which, despite strongly separated, follow similar -perhaps universal -patterns." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-83", "text": "----------------------------------" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-84", "text": "**ANALYSIS**" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-85", "text": "We view age and income as continuous variables and model them in a regression setup." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-86", "text": "This is in contrast to most previous studies on age as a categorical variable (Rangel et al., 2014) to allow for finer grained predictions useful for downstream applications which use exact values of user traits, as opposed to being limited to broad classes such as young vs. old." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-87", "text": "We apply linear regression with Elastic Net regularization (Zou and Hastie, 2005) and support vector regression with an RBF kernel (as a non-linear counterpart) for comparison (Vapnik, 1998) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-88", "text": "We report Pearson correlation results on 10-fold cross-validation." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-89", "text": "We also study if our features are predictive of income above age, by controlling for age assigned by a state-of-the-art model trained on social media data (Sap et al., 2014) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-90", "text": "Similar results have been obtained with log-scaling the income variable." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-91", "text": "Table 1 presents our prediction results." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-55", "text": "We further calculate the type-token ratio per user, which indicates the lexical density of text and is considered to be a readability predictor (Oakland and Lane, 2004 )." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-56", "text": "Additionally we capture the number of positive and negative smileys in the tweet and the number of URLs." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-57", "text": "Readability After filtering tweets to contain only words, we use the most prominent readability measures per user: the Automatic Readability Index (Senter and Smith, 1967) , the FleschKincaid Grade Level (Kincaid et al., 1975) , the Coleman-Liau Index (Coleman and Liau, 1975) , the Flesch Reading Ease (Flesch, 1948) , the LIX Index (Anderson, 1983) , the SMOG grade (McLaughlin, 1969 ) and the Gunning-Fog Index (Gunning, 1969) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-58", "text": "The majority of those are computed using the average word and sentence lengths and number of syllables per sentence, combined with weights." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-59", "text": "Syntax Researchers argue about longer sentences not necessarily being more complex in terms of syntax (Feng et al., 2009; Pitler and Nenkova, 2008) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-60", "text": "However, advanced sentence parsing on Twitter remains a challenging task." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-61", "text": "We thus limit ourselves in this study to the part-of-speech (POS) information." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-62", "text": "In previous work on writing style (Pennebaker et al., 2003; Argamon et al., 2009; Rangel et al., 2014) , a text with more nouns and articles as opposed to pronouns and adverbs is considered more formal." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-63", "text": "We thus measure the ratio of each POS using the universal tagset (Petrov et al., 2012) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-64", "text": "Style We implemented a contextuality measure, based on the work of Heylighen and Dewaele (2002) , which assesses explicitness of the text based on the POS used and serves as a proxy for formality." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-65", "text": "Using Stanford Named Entity Recognizer (Finkel et al., 2005) , we measure the proportion of named entities (3-classed) to words, as their presence potentially decreases readability (Beinborn et al., 2012) , and netspeak aspects such as the proportion of elongations (wooow) and words with numbers (good n8)." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-66", "text": "We quantify the number of hedges (Hyland, 2005) and abstract words 1 used, and the ratio of standalone numbers stated per user as these are indicators of specificity (Pennebaker et al., 2003; Pitler and Nenkova, 2008) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-67", "text": "We also capture the ratio of hapax legomena, and of superlatives and plurals using Stanford POS Tagger 1 www.englishbanana.com (Toutanova et al., 2003 ) using the Twitter model." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-68", "text": "----------------------------------" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-69", "text": "**TEMPORAL PATTERNS IN STYLE**" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-70", "text": "Social media data offers the opportunity to interpret the features in a richer context, including time or space." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-71", "text": "In our income data set, a timestamp is available for each message." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-72", "text": "Golder and Macy (2011) showed user-level diurnal and seasonal patterns of mood across the world using Twitter data, suggesting that individuals awaken in a good mood that deteriorates as the day progresses." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-73", "text": "In this work we explore user-level daily temporal trends in style for the 1500 highest-and 1500 lowest-income users (mean income \u2265 \u00a335,000 vs mean income \u2264 \u00a325,000)." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-74", "text": "In Figure 1 we present normalized temporal patterns for a selected set of features." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-75", "text": "While the difference between groups is most striking, we also observe some consistent daily patterns." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-92", "text": "The strength of the correlation to the income and age, together with the sign of the correlation coefficient, are visually displayed in Figure 2 ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-93", "text": "As expected, all features correlate with age and income in the same direction." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-94", "text": "However, some features and groups are more predictive of one or the other (depicted above or below the principal diagonal in Figure 2 )." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-95", "text": "Most individual surface features correlate with age stronger than with income, with the exception of punctuation and, especially, words longer than 5 characters." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-96", "text": "The correlation of each readability measure is remarkably stronger with high income than with age, despite the fact Numbers in bold represent the highest correlations from the specific block of features and data set." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-97", "text": "All correlations are significant on p < 0.001 level except for those in brackets." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-98", "text": "these are to a large extent based on the surface features." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-99", "text": "Notably, Flesch Reading Ease -previously reported to correlate with education levels at a community level (Davenport and DeLine, 2014) and with the usage of pronouns (\u0160tajner et al., 2012) is highly indicative for income." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-100", "text": "On the syntactic level we observe that increased use of nouns, determiners and adjectives is correlated higher with age as opposed to income, while a high ratio of pronouns and interjections is a good predictor of lower income but, only to a lesser extent, younger age, with which it is traditionally associated (Schler et al., 2006) ." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-101", "text": "From the stylistic features, the contextuality measure stands out as being correlated with increase in age, in line with Heylighen and De- waele (2002), but is almost orthogonal to income." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-102", "text": "Similarly, the frequency of named entities is correlated with higher income, while elongations have stronger association with younger age." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-103", "text": "Our results show, that based on the desired application, one can exploit these differences to tailor the style of a document without altering the topic to suit either age or income individually." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-104", "text": "----------------------------------" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-105", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-106", "text": "Using two large data sets from thousands of users, annotated with their age and income, we presented the first study which analyzes these variables jointly, in relation to writing style." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-107", "text": "We have shown that the stylistic measures not only obtain significant correlations with both age and income, but are predictive of income beyond age." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-108", "text": "Moreover, we explored temporal patterns in user behavior on Twitter, discovering intriguing trends in writing style." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-109", "text": "While the discovery of these patterns provides useful psychosocial insight, it additionally hints to future research and applications that piggyback on author profiling in social media e.g., taking the message timestamp into account for stylistic features may yield improved results in user sociodemographic predictions." }, { "sent_id": "6d8612cfb4bf05322fed1c02f4885a-C001-110", "text": "Likewise, utilizing additional proxies to control for income and education may lead to improvements in user age prediction." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6d8612cfb4bf05322fed1c02f4885a-C001-30" ], [ "6d8612cfb4bf05322fed1c02f4885a-C001-54" ], [ "6d8612cfb4bf05322fed1c02f4885a-C001-59" ], [ "6d8612cfb4bf05322fed1c02f4885a-C001-66" ] ], "cite_sentences": [ "6d8612cfb4bf05322fed1c02f4885a-C001-30", "6d8612cfb4bf05322fed1c02f4885a-C001-54", "6d8612cfb4bf05322fed1c02f4885a-C001-59", "6d8612cfb4bf05322fed1c02f4885a-C001-66" ] }, "@USE@": { "gold_contexts": [ [ "6d8612cfb4bf05322fed1c02f4885a-C001-54" ], [ "6d8612cfb4bf05322fed1c02f4885a-C001-66" ] ], "cite_sentences": [ "6d8612cfb4bf05322fed1c02f4885a-C001-54", "6d8612cfb4bf05322fed1c02f4885a-C001-66" ] } } }, "ABC_53cd860c539e20874ada4343ab788f_42": { "x": [ { "sent_id": "53cd860c539e20874ada4343ab788f-C001-54", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-55", "text": "**EXPERIMENTS**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-56", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-57", "text": "**MULTILINGUAL DATA AND TRAINING REGIMEN**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-2", "text": "One central mystery of neural NLP is what neural models \"know\" about their subject matter." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-3", "text": "When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages?" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-4", "text": "Can this knowledge be extracted from the system to fill holes in human scientific knowledge?" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-5", "text": "Existing typological databases contain relatively full feature specifications for only a few hundred languages." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-6", "text": "Exploiting the existence of parallel texts in more than a thousand languages, we build a massive many-to-one neural machine translation (NMT) system from 1017 languages into English, and use this to predict information missing from typological databases." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-7", "text": "Experiments show that the proposed method is able to infer not only syntactic, but also phonological and phonetic inventory features, and improves over a baseline that has access to information about the languages' geographic and phylogenetic neighbors." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-8", "text": "1" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-9", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-11", "text": "Linguistic typology is the classification of human languages according to syntactic, phonological, and other classes of features, and the investigation of the relationships and correlations between these classes/features." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-12", "text": "This study has been a scientific pursuit in its own right since the 19th century (Greenberg, 1963; Comrie, 1989; Nichols, 1992) , but recently typology has borne practical fruit within various subfields of NLP, particularly on problems involving lower-resource languages." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-13", "text": "(Dryer and Haspelmath, 2013) , has proven useful in many NLP tasks (O'Horan et al., 2016) , such as multilingual dependency parsing (Ammar et al., 2016) , generative parsing in low-resource settings (Naseem et al., 2012; T\u00e4ckstr\u00f6m et al., 2013) , phonological language modeling and loanword prediction (Tsvetkov et al., 2016) , POStagging (Zhang et al., 2012) , and machine translation (Daiber et al., 2016) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-14", "text": "However, the needs of NLP tasks differ in many ways from the needs of scientific typology, and typological databases are often only sparsely populated, by necessity or by design." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-53", "text": "4" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-15", "text": "2 In NLP, on the other hand, what is important is having a relatively full set of features for the particular group of languages you are working on." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-16", "text": "This mismatch of needs has motivated various proposals to reconstruct missing entries, in WALS and other databases, from known entries (Daum\u00e9 III and Campbell, 2007; Daum\u00e9 III, 2009; Coke et al., 2016; Littell et al., 2017) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-17", "text": "In this study, we examine whether we can tackle the problem of inferring linguistic typology from parallel corpora, specifically by training a massively multi-lingual neural machine translation (NMT) system and using the learned representations to infer typological features for each language." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-18", "text": "This is motivated both by prior work in linguistics (Bugarski, 1991; Garc\u00eda, 2002) demonstrating strong links between translation studies and tools for contrastive linguistic analysis, work in inferring typology from bilingual data (\u00d6stling, 2015) and English as Second Language texts (Berzak et al., 2014) , as well as work in NLP (Shi et al., 2016; Kuncoro et al., 2017; Belinkov et al., 2017) showing that syntactic knowledge can be extracted from neural nets on the word-by-word or sentence-by-sentence level." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-19", "text": "This work presents a more holistic analysis of whether we can discover what neural networks learn about the linguistic concepts of an entire language by aggregating their representations over a large number of the sentences in the language." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-20", "text": "We examine several methods for discovering feature vectors for typology prediction, including those learning a language vector specifying the language while training multilingual neural language models (\u00d6stling and Tiedemann, 2017) or neural machine translation (Johnson et al., 2016) systems." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-21", "text": "We further propose a novel method for aggregating the values of the latent state of the encoder neural network to a single vector representing the entire language." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-22", "text": "We calculate these feature vectors using an NMT model trained on 1017 languages, and use them for typlogy prediction both on their own and in composite with feature vectors from previous work based on the genetic and geographic distance between languages (Littell et al., 2017) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-23", "text": "Results show that the extracted representations do in fact allow us to learn about the typology of languages, with particular gains for syntactic features like word order and the presence of case markers." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-24", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-25", "text": "**DATASET AND EXPERIMENTAL SETUP**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-26", "text": "Typology Database: To perform our analysis, we use the URIEL language typology database (Littell et al., 2017) , which is a collection of binary features extracted from multiple typological, phylogenetic, and geographical databases such as WALS (World Atlas of Language Structures) (Collins and Kayne, 2011) , PHOIBLE (Moran et al., 2014) , Ethnologue (Lewis et al., 2015) , and Glottolog (Hammarstr\u00f6m et al., 2015) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-27", "text": "These features are divided into separate classes regarding syntax (e.g. whether a language has prepositions or postpositions), phonology (e.g. whether a language has complex syllabic onset clusters), and phonetic inventory (e.g. whether a language has interdental fricatives)." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-28", "text": "There are 103 syntactical features, 28 phonology features and 158 phonetic inventory features in the database." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-29", "text": "Baseline Feature Vectors: Several previous methods take advantage of typological implicature, the fact that some typological traits correlate strongly with others, to use known features of a language to help infer other unknown features of the language (Daum\u00e9 III and Campbell, 2007; Takamura et al., 2016; Coke et al., 2016) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-30", "text": "As an alternative that does not necessarily require pre-existing knowledge of the typological features in the language at hand, Littell et al. (2017) have proposed a method for inferring typological features directly from the language's k nearest neighbors (k-NN) according to geodesic distance (distance on the Earth's surface) and genetic distance (distance according to a phylogenetic family tree)." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-31", "text": "In our experiments, our baseline uses this method by taking the 3-NN for each language according to normalized geodesic+genetic distance, and calculating an average feature vector of these three neighbors." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-32", "text": "Typology Prediction: To perform prediction, we trained a logistic regression classifier 3 with the baseline k-NN feature vectors described above and the proposed NMT feature vectors described in the next section." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-33", "text": "We train individual classifiers for predicting each typological feature in a class (syntax etc)." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-34", "text": "We performed 10-fold crossvalidation over the URIEL database, where we train on 9/10 of the languages to predict 1/10 of the languages for 10 folds over the data." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-35", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-36", "text": "**LEARNING REPRESENTATIONS FOR TYPOLOGY PREDICTION**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-37", "text": "In this section we describe three methods for learning representations for typology prediction with multilingual neural models." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-38", "text": "LM Language Vector Several methods have been proposed to learn multilingual language models (LMs) that utilize vector representations of languages (Tsvetkov et al., 2016; \u00d6stling and Tiedemann, 2017) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-39", "text": "Specifically, these models train a recurrent neural network LM (RNNLM; Mikolov et al. (2010) ) using long short-term memory (LSTM; Hochreiter and Schmidhuber (1997) ) with an additional vector representing the current language as an input." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-40", "text": "The expectation is that this vector will be able to capture the features of the language and improve LM accuracy." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-41", "text": "\u00d6stling and Tiedemann (2017) noted that, intriguingly, agglomerative clustering of these language vectors results in something that looks roughly like a phylogenetic tree, but stopped short of performing typological inference." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-42", "text": "We train this vector by appending a special token representing the source language (e.g. \" fra \" for French) to the beginning of the source sentence, as shown in Fig. 1 , then using the word representation learned for this token as a representation of the language." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-43", "text": "We will call this first set of feature vectors LMVEC, and examine their utility for typology prediction." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-44", "text": "NMT Language Vector In our second set of feature vectors, MTVEC, we similarly use a language embedding vector, but instead learn a multilingual neural MT model trained to translate from many languages to English, in a similar fashion to Johnson et al. (2016) ; Ha et al. (2016) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-45", "text": "In contrast to LMVEC, we hypothesize that the alignments to an identical sentence in English, the model will have a stronger signal allowing it to more accurately learn vectors that reflect the syntactic, phonetic, or semantic consistencies of various languages." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-46", "text": "This has been demonstrated to some extent in previous work that has used specifically engineered alignment-based models (Lewis and Xia, 2008; \u00d6stling, 2015; Coke et al., 2016) , and we examine whether these results apply to neural network feature extractors and expand beyond word order and syntax to other types of typology as well." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-47", "text": "Table 1 : Accuracy of syntactic, phonological, and inventory features using LM language vectors (LMVEC), MT language vectors (MTVEC), MT encoder cell averages (MTCELL) or both MT feature vectors (MTBOTH)." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-48", "text": "Aux indicates auxiliary information of geodesic/genetic nearest neighbors; \"NONE -Aux\" is the majority class chance rate, while \"NONE +Aux\" is a 3-NN classification." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-49", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-50", "text": "**NMT ENCODER MEAN CELL STATES**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-51", "text": "sentiment (Radford et al., 2017) , we expect that the cell states will represent features that may be linked to the typology of the language." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-52", "text": "To create vectors for each language using LSTM hidden states, we obtain the mean of cell states (c in the standard LSTM equations) for all time steps of all sentences in each language." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-58", "text": "To train a multilingual neural machine translation system, we used a corpus of Bible translations that was obtained by scraping a massive online Bible database at bible.com." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-59", "text": "5 This corpus contains data for 1017 languages." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-60", "text": "After preprocessing the corpus, we obtained a training set of 20.6 million sentences over all languages." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-61", "text": "The implementation of both the LM and NMT models described in \u00a73 was done in the DyNet toolkit ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-62", "text": "In order to obtain a manageable shared vocabulary for all languages, we divided the data into subwords using joint byte-pair encoding of all languages (Sennrich et al., 2016) with 32K merge operations." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-63", "text": "We used LSTM cells in a single recurrent layer with 512-dimensional hidden state and input embedding size." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-64", "text": "The Adam optimizer was used with a learning rate of 0.001 and a dropout of 0.5 was enforced during training." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-65", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-66", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-67", "text": "The results of the experiments can be found in Tab." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-68", "text": "1. First, focusing on the \"-Aux\" results, we can see that all feature vectors obtained by the neural models improve over the chance rate, demonstrating that indeed it is possible to extract information about linguistic typology from unsupervised neural models." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-69", "text": "Comparing LMVEC to MTVEC, we can see a convincing improvement of 2-3% across the board, indicating that the use of bilingual information does indeed provide a stronger signal, allowing the network to extract more salient features." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-70", "text": "Next, we can see that MTCELL further outperforms MTVEC, indicating that the proposed method of investigating the hidden cell dynamics is more effective than using a statically learned language vector." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-71", "text": "Finally, combining both feature vectors as MTBOTH leads to further improvements." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-72", "text": "To measure statistical significance of the results, we performed a paired bootstrap test to measure the gain between NONE+AUX and MTBOTH+AUX and found that the gains for syntax and inventory were significant (p=0.05), but phonology was not, perhaps because the number of phonological features was fewer than the other classes (only 28)." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-73", "text": "When further using the geodesic/genetic distance neighbor feature vectors, we can see that these trends largely hold although gains are much smaller, indicating that the proposed method is still useful in the case where we have a-priori knowledge about the environment in which the language exists." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-74", "text": "It should be noted, however, that the gains of LMVEC evaporate, indicating that access to aligned data may be essential when inferring the typology of a new language." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-75", "text": "We also noted that the accuracies of certain features decreased from NONE-AUX to MTBOTH-AUX, particularly gender markers, case suffix and negative affix, but these decreases were to a lesser extent in magnitude than the improvements." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-76", "text": "Interestingly, and in contrast to previous methods for inferring typology from raw text, which have been specifically designed for inducing word order or other syntactic features (Lewis and Xia, Table 2 : Top 5 improvements from \"NONE -Aux\" to \"MTBOTH -Aux\" in the syntax (\"S \"), phonology (\"P \"), and inventory (\"I \") classes." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-77", "text": "2008; \u00d6stling, 2015; Coke et al., 2016) , our proposed method is also able to infer information about phonological or phonetic inventory features." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-78", "text": "This may seem surprising or even counterintuitive, but a look at the most-improved phonology/inventory features (Tab. 2) shows a number of features in which languages with the \"nondefault\" option (e.g. having uvular consonants or initial velar nasals, not having lateral consonants, etc.) are concentrated in particular geographical regions." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-79", "text": "For example, uvular consonants are not common world-wide, but are common in particular geographic regions like the North American Pacific Northwest and the Caucasus (Maddieson, 2013b) , while initial velar nasals are common in Southeast Asia (Anderson, 2013) , and lateral consonants are uncommon in the Amazon Basin (Maddieson, 2013a) ." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-80", "text": "Since these are also regions with a particular and sometimes distinct syntactic character, we think the model may be finding regional clusters through syntax, and seeing an improvement in regionally-distinctive phonology/inventory features as a side effect." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-81", "text": "Finally, given that MTCELL uses the feature vectors of the latent cell state to predict typology, it is of interest to observe how these latent cells behave for typologically different languages." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-82", "text": "In Fig. 2 we examine the node that contributed most to the prediction of \"S OBJ BEFORE VERB\" (the node with maximum weight in the classifier) for German and Korean, where the feature is active, and Portuguese and Catalan, where the feature is inactive." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-83", "text": "We can see that the node trajectories closely track each other (particularly at the beginning of the sentence) for Portuguese and Catalan, and in general the languages where objects precede verbs have higher average values, which would be expressed by our mean cell state features." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-84", "text": "The similar trends for languages that share the value for a typological feature (S OBJ BEFORE VERB) indicate that information stored in the selected hidden node is consistent across languages with similar structures." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-85", "text": "----------------------------------" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-86", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-87", "text": "Through this study, we have shown that neural models can learn a range of linguistic concepts, and may be used to impute missing features in typological databases." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-88", "text": "In particular, we have demonstrated the utility of learning representations with parallel text, and results hinted at the importance of modeling the dynamics of the representation as models process sentences." }, { "sent_id": "53cd860c539e20874ada4343ab788f-C001-89", "text": "We hope that this study will encourage additional use of typological features in downstream NLP tasks, and inspire further techniques for missing knowledge prediction in under-documented languages." } ], "y": { "@BACK@": { "gold_contexts": [ [ "53cd860c539e20874ada4343ab788f-C001-16" ], [ "53cd860c539e20874ada4343ab788f-C001-29" ] ], "cite_sentences": [ "53cd860c539e20874ada4343ab788f-C001-16", "53cd860c539e20874ada4343ab788f-C001-29" ] }, "@EXT@": { "gold_contexts": [ [ "53cd860c539e20874ada4343ab788f-C001-46" ] ], "cite_sentences": [ "53cd860c539e20874ada4343ab788f-C001-46" ] }, "@DIF@": { "gold_contexts": [ [ "53cd860c539e20874ada4343ab788f-C001-76", "53cd860c539e20874ada4343ab788f-C001-77" ] ], "cite_sentences": [ "53cd860c539e20874ada4343ab788f-C001-77" ] } } }, "ABC_1786b6c1c6532d5baa092cca40e389_42": { "x": [ { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-2", "text": "This paper presents a novel approach to semantic role annotation implementing an entailmentbased view of the concept of semantic role." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-3", "text": "I propose to represent arguments of predicates with grammatically relevant primitive properties entailed by the semantics of predicates." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-4", "text": "Such meaning components generalise over a range of semantic relations which humans tend to express systematically through language." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-5", "text": "In a preliminary study, I show that we can model linguistic knowledge at a general, principled syntax-semantics interface by incorporating a layer of skeletal, entailment-based representation of word meaning in large-scale corpus annotation." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-6", "text": "----------------------------------" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-8", "text": "Large-scale lexical semantic resources that provide relational information about words have recently received much focus in the field of Natural Language Processing (NLP)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-9", "text": "In particular, data-driven models for lexical semantics require the creation of broad-coverage, hand-annotated corpora with predicateargument information, i.e. rich information about words expressing a semantic relation having argument slots filled by the interpretations of their grammatical complements." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-10", "text": "Corpora combining semantic and syntactic annotations constitute the backbone for the development of probabilistic models that automatically identify the semantic relationships, or semantic roles, conveyed by sentential constituents (Gildea and Jurafsky, 2002) ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-11", "text": "That is, given an input sentence and a target predicator the system labels constituents with general roles like Agent, Patient, Theme, etc., or more specific roles, as in (1) ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-12", "text": "(1) 1 The task of automatic semantic role labelling (or shallow semantic parsing) is a first step towards text understanding and has found use in a variety of NLP applications including information extraction (Surdeanu et al., 2003), machine translation (Boas, 2002) , question answering (Narayanan and Harabagiu, 2004) , summarisation (Melli et al., 2005) , recognition of textual entailment relations (Burchardt and Frank, 2006) , etc." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-13", "text": "Corpora with semantic role labels additionally lend themselves to extraction of linguistic knowledge at the syntax-semantics interface." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-14", "text": "The range of semantic and syntactic combinatorial properties (valences) of each word in each of its senses is documented in terms of annotated corpus attestations." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-15", "text": "For instance, the valence pattern for the use of admire in (1) is shown in (2)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-16", "text": "This data enables the quantitative study of various linguistic phenomena and the investigation of the relationship between the distinct linguistic layers comprised by predicate-argument analysis." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-17", "text": "Furthermore, the formulation of generalisations over predicate-specific annotations can capture how predicates relate in terms of both semantic and syntactic features." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-18", "text": "Such syntax-semantics mappings (so-called linking generalisations) encode regularities concerning the associations of semantic roles with grammatical functions and are essential for a linguistic knowledge base for NLP applications." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-19", "text": "This paper addresses the problem of generalising over the valences of individual predicators and proposes an abstract semantic basis for the representation of participant roles." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-20", "text": "The definition of semantic notions at an appropriate level of abstraction is the prerequisite for the formulation of a general, principled syntax-semantics interface." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-21", "text": "This is in accordance with a somewhat intuitive conception of semantic roles as classificatory notions encoding semantic similarities across different types of events or situations in the world." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-22", "text": "In effect, all conceptions of semantic roles as opposed to predicate-specific roles, such as admirer-admired, posit some sort of semantic classification of arguments across predicators while indicating an acknowledgment that the syntax-semantics interface (referred to with the term linking) is not completely arbitrary." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-23", "text": "Put differently, semantic roles constitute a level of representation suitable for capturing semantic generalisations which humans tend to express systematically through language." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-24", "text": "The structure of the paper is organised as follows." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-25", "text": "Section 2 looks at conceptions of semantic roles in state-of-the-art approaches to semantic annotation indicating problems or complications related to the question of whether or how these roles can support generalisations across predicates." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-26", "text": "Section 3 calls attention to the theoretical underpinnings of the notion of semantic role and introduces an annotation schema which departs from the traditional view of semantic roles as atomic, undecomposable categories." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-27", "text": "Following the insight of Dowty's (1991) theory of Proto-Roles, I will propose analytical representations of verbal arguments based on semantically well-founded, grammatically relevant meaning components entailed by the semantics of predicates (Proto-Role entailments)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-28", "text": "Finally, section 4 presents a study in which lexical entailments are marked in a corpus in accordance with the proposed schema." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-29", "text": "General syntax-semantics mappings are extracted from the annotated data and are formalised in abstract classes which readily encode generalisations concerning linking to syntactic form." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-30", "text": "----------------------------------" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-31", "text": "**CORPORA WITH SEMANTIC ROLES AND RELATED WORK**" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-32", "text": "Semantically annotated corpora currently available for English implement two distinct approaches to the prickly notion of semantic role." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-33", "text": "The Proposition Bank (PropBank) (Kingsbury et al., 2002 ) is a one million word corpus in which predicate-argument relations are hand-annotated for every occurrence of every verb in the Wall Street Journal part of the Penn Treebank (Marcus et al., 1994) ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-34", "text": "Verb senses are distinguished informally on the basis of semantic as well as syntactic criteria." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-35", "text": "The semantic arguments of a verb are numbered sequentially." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-36", "text": "PropBank uses a common set of role labels (Arg0 up to Arg5) for all predicators, but these labels are defined on a per-verb basis, i.e. they have verb-specific meanings." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-37", "text": "Example PropBank annotations: As illustrated in (3), argument labels are consistent across alternate syntactic patterns of a given predicator in a given sense." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-38", "text": "However, PropBank refrains from formalising the semantics of the role labels and does not ensure their coherence across verbs." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-39", "text": "This is particularly clear with higher numbered labels, which correspond to distinct types of participants: Arg2 marks an Instrument for break (3), a Benefactive for provide (4), and a Source for buy (5)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-40", "text": "Lower-numbered labels denote various roles as well, but they are less arbitrary across verbs: Arg0 corresponds to traditional Agents, Experiencers, certain types of Theme, etc." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-41", "text": "which surface as subjects of transitive verbs and a class of intransitives called unergatives; Arg1, on the other hand, is assigned to objects of transitive verbs and subjects of unaccusatives and is the equivalent of traditional Patients, Themes, etc." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-42", "text": "While the PropBank corpus enables empirical insight into a variety of linguistic phenomena (e.g. variations in the grammatical expression of arguments) providing useful frequency information for the uses of predicators, it does not lend itself to extraction of a principled linguistic knowledge base with semantic generalisations across predicates." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-43", "text": "Inasmuch as no consistent mapping is ensured between a label and a semantic role, the argument labels result seriously overloaded across verbs." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-44", "text": "This explains why role recognition models have particularly poor performance in assigning the labels Arg2-Arg5." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-45", "text": "In fact, an attempt is currently made to map PropBank argument labels to semantically coherent roles specified by VerbNet (Kipper et al., 2000) (i.e. a broad-coverage verb lexicon based on Levin's (1993) classification of English verbs according to shared meaning and behaviour)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-46", "text": "Even though VerbNet specifies a small list of abstract roles (23 in total) which are intended to support generalisations, these roles are not defined as global primitives, but are meaningful only within verb classes." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-47", "text": "Because mappings of labels to semantic roles with class-specific interpretations would lead to very sparse data, argument labels are subdivided into groupings of VerbNet roles." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-48", "text": "The latter are created manually on the basis of analysis of argument use." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-49", "text": "3 The subdivided (more coherent) PropBank labels perform better for semantic role labelling (Loper et al., 2007) ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-50", "text": "A different paradigm for semantic role annotation is put forth by FrameNet." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-51", "text": "The Berkeley FrameNet project (Baker et al., 1998 ) is creating an online lexical database containing semantic descriptions of words based on Fillmore's (1985) theory of frame semantics." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-52", "text": "The basic unit of analysis is the semantic frame, i.e. a schematic representation of a stereotypical scene or situation." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-53", "text": "Each frame is associated with a set of predicates (including verbs, nouns, and adjectives) and a set of semantic roles (called Frame Elements, FEs) encoding the participants and props in the designated scene." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-54", "text": "FrameNet includes manually annotated example sentences from the British National Corpus incorporating additional layers of phrase structure and grammatical function annotation." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-55", "text": "It also includes two small corpora of full-text annotation intended to facilitate statistical analysis of frame-semantic structures." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-56", "text": "Currently it contains more than 960 frames covering more than 11,600 lexical items exemplified in more than 150,000 annotated sentences." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-57", "text": "The Judgment frame evoked by admire in (1) FrameNet avoids the difficulties of attempting to pin down a small set of general roles." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-58", "text": "Instead Frame Elements are defined locally, i.e. in terms of frames." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-59", "text": "Frames are situated in semantic space by means of directed (asymmetric) relations." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-60", "text": "Each frame-to-frame relation associates a less dependent or more general frame (Super frame) with a more dependent or less general one (Sub frame)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-61", "text": "The hierarchical organisation of frames along with FE identities or analogs across frames are intended to enable the formulation of generalisations concerning the combinatorial properties (valences) of predicates." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-62", "text": "In practice, however, the frame hierarchy turns out to be somewhat complicated." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-63", "text": "Inheritance (i.e. the strongest semantic relation and the most plausible to propagate valence information across frames) is conditioned on complex sets of semantic components underlying frame definitions, ranging from FE membership and relations to other frames to relationships among FEs and Semantic Types on frames and FEs." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-64", "text": "4 This kind of frame dependence based on fine-grained semantic or ontological distinctions is doomed to miss argument structure commonalities in predicates evoking frames that are related at a more abstract, essentially structural semantic level." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-65", "text": "Section 4 includes a concrete example of the complications in generalising valence information across FrameNet frames." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-66", "text": "Researchers working in the FrameNet paradigm have proposed different approaches for abstracting over the properties of individual predicators and increasing the size of training data for semantic role labelling systems." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-67", "text": "Gildea and Jurafsky (2002) attempt to generalise the behaviour of semantically related predicates experimenting with a small set of abstract semantic roles mapped to FrameNet roles." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-68", "text": "The experimental result of the role classification using these generalisation features shows significant improvements in the system." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-69", "text": "This is due to the fact that role generalisations can form a remedy for the severe problem of sparse data which is inherent in lexical semantic corpus annotation." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-70", "text": "Data sparseness, i.e. the insufficient coverage of the range of predicate senses and constructions within sensible sizes of manually annotated data, is a bottleneck both for acquisition of linguistic knowledge for the semantic lexicon and for automated techniques for semantic role assignment." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-71", "text": "----------------------------------" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-72", "text": "**AN ABSTRACT SEMANTIC BASIS FOR THE REPRESENTATION OF PARTICIPANT ROLES**" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-73", "text": "From the presentation of different annotation projects it becomes evident that semantic role annotation is a complicated task whose product is deeply influenced by its initial design philosophy and underlying criteria." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-74", "text": "5 Among these criteria the notion of semantic role itself is central." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-75", "text": "PropBank uses general role labels that lack semantic coherence." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-76", "text": "VerbNet and FrameNet, on the other hand, specify coherent roles at a more fine-grained level (i.e. roles with class-specific or frame-specific interpretations)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-77", "text": "In this section, I consider the linguistic contours of the concept of semantic role proposing an annotation schema based upon theoretically well-founded role concepts which meet the requirements of both generality and coherence." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-78", "text": "This schema is intended at enabling the formulation of a general syntax-semantics interface suitable for modelling the relations of predicates in terms of combinatorial features." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-79", "text": "Espousing and extending Dowty's (1991) Proto-Role hypothesis, I propose to associate arguments of predicates with properties entailed by the semantics of predicates." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-80", "text": "6 Mappings of entailments to syntactic constituents can be many-to-one." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-81", "text": "That is, an argument can be marked with one or more properties necessarily entailed by the meaning of the predicator." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-82", "text": "7 Prepositional complements are also marked with verbal entailments to which prepositions may contribute more specific content." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-83", "text": "In this paper, I will make no attempt to formalise the content added by prepositions; prepositional semantics is represented solely in terms of the common entailment basis it shares with verbal meaning." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-84", "text": "Each Proto-Role entailment indicates a grammatically pervasive concept, i.e. a property having direct effect on the grammatical behaviour of predicates." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-85", "text": "It is defined in terms of an abstract semantic relation underlying the lexical meaning of the predicate." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-86", "text": "Five such relations are identified in terms of which entailment-based representations are specified: Notion, Causation, Motion, Possession, Conditioning." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-87", "text": "Note that contrary to mere ontological labels, entailment-based representations encode structural characterisations of the semantics of arguments." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-88", "text": "Consider, for instance, the sentence in (1), repeated here as (6): (6)" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-89", "text": "A structural representation of the meaning of this construction will explicitly encode the relationships between each of the arguments of admire, i.e. between the NP I and the NP him, between the NP him and the PP for his bravery and his cheerfulness, and between the NP I and the PP for his bravery and his cheerfulness." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-90", "text": "By contrast, the FrameNet roles shown above do not model the fact that the semantic content of an Evaluee requires a Cognizer, or that a Reason requires both a Cognizer and an Evaluee." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-91", "text": "The view that the semantic properties underlying lexical meaning are relational in nature (i.e. they are not to be conceived entirely independently of one another) has been advocated by several researchers, among others Wechsler (1995), Pinker (1989) , Jackendoff (1990) , and Davis (2001), on whose work I build." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-92", "text": "In the rest of this section, I define a set of recurring entailments which underlie the semantics of a range of verbs displaying various syntactic patterns." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-93", "text": "Note that this set can be extended on the basis of additional primitive meaning components of the sort described above, covering the semantics of broad verb classes." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-94", "text": "situation (in every model) in which the first is true, the second is also true." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-95", "text": "For linguistic predicates, in particular, an entailment (or lexical entailment) is an analytic implication following from the meaning of the predicate in question." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-96", "text": "7 The presence of 'necessarily' in this sentence is somewhat redundant, in that its meaning is incorporated by the notion of entailment." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-97", "text": "I insist, however, on emphasising it to indicate that semantic properties that are accidentally associated with the meaning of a particular use of a verb will not be annotated." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-98", "text": "Dowty points out that entailments of the predicate must be distinguished from what follows from any one sentence as a whole (e.g. entailments that may arise from NP meanings) (Dowty, 1991 :572, footnote 16)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-99", "text": "For example, in the sentence Mary slapped John, assuming that John is a human entity, it follows from the meaning of the sentence that John will perceive something as a result of the action of slapping. But this 'entailment' is not intrinsically tied to the meaning of slap, because the sentences Mary slapped the table or Mary slapped the corpse are also felicitous." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-100", "text": "That is, sentience of the direct object is not an essential component of the semantics of slap, in the way it is for a verb like awaken." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-101", "text": "The sentences Mary awakened the table and Mary awakened the corpse are clearly anomalous." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-102", "text": "True entailments of predicators (which are the ones that will be annotated) must be detectable in every possible environment in which the predicator is used." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-103", "text": "8 The examples used to illustrate the proposed schema are from the British National Corpus." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-104", "text": "Some of them are slightly modified for reasons of conciseness." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-105", "text": "The predicates in (7)- (13) are represented in terms of a Notion relation." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-106", "text": "That is, they involve a Conceiver who is entailed to have a notion or perception of a Conceived participant (while the reverse entailment does not necessarily go through)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-107", "text": "9 In situation types in which a Conceiver is entailed to have a notion of more than one participant, Conceived arguments are distinguished on the basis of their salience in the overall semantics of the predicate." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-108", "text": "For instance, test (8) intuitively lexicalises a dyadic relation between a Conceiver (tester) and a Conceived (tested) entity." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-109", "text": "A sought entity denoted by a for-PP is represented as part of a secondary Notion relation situated at the background of the primary (testing) relation." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-110", "text": "Conceived entities that are peripheral to the essential relation lexicalised by the predicate are associated with a more specific property termed Conceived background state of affairs (Conceived bsoa)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-111", "text": "These arguments receive less focus in the meaning of the predicate, in a sense that they are not absolutely necessary to understand the predicate's meaning." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-112", "text": "The representation of test (8), stereotype (10), and find out (11) in terms of two Notion relations, one of which is treated as more salient, reifies the concept of relative significance of Proto-Role properties in the verbal semantics." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-113", "text": "This concept is related to the weighting of entailments in the overall semantics of a verb, which plays a critical role in determining the syntactic patterns in which the verb appears (i.e. the grammatical realisations of its arguments)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-114", "text": "10 The verbs in (8) and (9) involve an additional entailment of Intentionality." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-115", "text": "This is used to mark entities characterised by conscious choice, decision, or control over the course of inherently intentional actions." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-116", "text": "Intentional participants necessarily have a notion/perception of some event participant(s)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-117", "text": "The annotations in (12) and (13) include the Entity and Property tags which are intended to distinguish Conceived arguments in terms of a predicative relation assigned in the Conceiver's mental model." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-118", "text": "The Property label corresponds to a representation of the form P(x) denoting a property P which is predicated of some object x." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-119", "text": "The entailments of Notion are not applicable in the semantics of the predicates in (14)- (15) below." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-120", "text": "These verbs refer to situations with affected participants and are described in terms of an abstract relation of Causation." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-121", "text": "In the denoted events, a Causer is entailed to affect some entity (the Causee) either physically or mentally." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-122", "text": "Causally affected participants sometimes undergo radical changes in their (physical or mental) state, which are identified in terms of a readily observable transition from a source to a final (result) state, as shown in (15) ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-123", "text": "(14) Verbs as in (16)- (17) Finally, verbs such as own, possess, acquire, lack, etc. are treated in terms of a Possession relation involving a Possessor and an entity entailed to be Possessed (18) ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-124", "text": "9 The Notion relation, as defined by Wechsler (1995) , essentially reconstructs the entailment of sentience, which was proposed by Dowty (1991) ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-125", "text": "10 Arguments identified as conceived bsoas have many of the syntactic properties of so-called semantic adjuncts." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-126", "text": "However, I refrain from invoking an argument versus adjunct division, in that it is known to involve serious theoretical pitfalls." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-127", "text": "Instead I classify conceived participants on the basis of the concept of importance of entailments, which lies exactly at the syntaxsemantics interface." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-128", "text": "This concept is defined in terms of the lexicalised event rather than the real-world event that traditional analyses of adjuncthood appeal to. (18) Proto-Role entailments are defined in terms of inherently asymmetric semantic relations involving fixed role positions." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-129", "text": "Each of these relations (with the exception of Motion) can be thought of as instance of a more general relation entailing that properties of an entity \u03b2 are dependent on an entity \u03b1." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-130", "text": "For example, a conceived entity in a Notion relation depends on the existence of a conceiver (it is taken to be within the scope of the conceiver's beliefs)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-131", "text": "An affected or possessed object in a causation or possession relation depends on the existence of some causer or possessor, respectively." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-132", "text": "I refer to this relation as Conditioning relation and associate it with appropriate Proto-Role properties capturing the semantics of a broad range of verbs for which none of the entailments specified so far seems to hold." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-133", "text": "These verbs conform to the basic transitivity pattern that motivated Dowty's Proto-Role hypothesis." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-134", "text": "Below are some characteristic examples: A Conditioning relation encodes the asymmetries in such predicators in terms of the underlying entailment that the properties of a participant \u03b1 impose a condition on properties of a participant \u03b2." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-135", "text": "In each of the sentences above we can conclude something about the object participant (e.g. that it is necessary, illegal, or linguistically expressed) on the basis of the subject referent (i.e. the characteristics of the game, the regulations specified by the code, the usage of the adjective 'beautiful')." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-136", "text": "By contrast, no property of the subject referent is necessarily conditioned on the object: the semantics of ban, for example, does not allow us to characterise code 1425 as fair/unfair, severe/lax, complete/incomplete, new/old, etc." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-137", "text": "on the basis of the object NP 'large trucks in tunnels'; similarly, we cannot infer the precise meaning of the word 'beautiful' or whether it is a verb or a noun or an adjective on the basis of the content of the NP 'a quality which can be found in many different objects'." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-138", "text": "A more precise definition of the Conditioning relation could state that the intrinsic (i.e. invariable) properties of a participant \u03b1 determine or condition some non-intrinsic (i.e. variable or event-dependent) property of a participant \u03b2 while the converse entailment does not go through." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-139", "text": "In (24), for example, the sociolinguistics domain is associated with a property of being diverse whereas the intrinsic properties of the domain have no significance for the definition of 'diversity' or what this notion may characterise." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-140", "text": "----------------------------------" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-141", "text": "**FORMULATION OF A GENERAL SYNTAX-SEMANTICS INTERFACE**" }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-142", "text": "A preliminary study has been carried out mapping state-of-the-art semantic role annotations to lexical entailment representations." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-143", "text": "In particular, a portion of the FrameNet corpora has been annotated with Proto-Role properties by a single annotator." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-144", "text": "The study focuses on a set of English verbs selected from 250 random FrameNet frames." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-145", "text": "For each verb in these frames, collections of example annotated sentences as well as sentences from the FrameNet full-text annotation corpora (where available) were extracted." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-146", "text": "More than 900 lexical units were considered in \u223c20K sentences." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-147", "text": "Proto-Role entailments were annotated on top of FrameNet's syntactic annotations in accordance with the schema sketched out above." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-148", "text": "The annotations were produced semi-automatically following a three-stage procedure: (i) mapping Frame Elements (FEs) to entailments at a frame level (ii) automatically adding this information to the data in a new annotation layer, (iii) manually correcting the novel annotations by examining the argument structures of individual predicators for finer semantic distinctions." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-149", "text": "From the newly annotated data mappings of entailments to grammatical categories were acquired." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-150", "text": "The syntactic realisations of Proto-Role properties were found to readily generalise over combinatorial features of verbs pertaining to various FrameNet frames." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-151", "text": "Valence information can be formally rendered in entailment-based classes called Lexicalisation Types (L-Types) abstracting away from the semantics of predicators." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-152", "text": "L-Types are defined on the basis of grammatically relevant meaning components and encode linking generalisations cutting across FrameNet frames." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-153", "text": "For instance, predicates such as believe and desire (evoking the frames Religious Belief and Desiring, respectively) involve arguments that are equivalent in terms of entailments, as illustrated in (25)- (26) below." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-154", "text": "Hence they are categorised in the Notion L-Type shown in Table 2 ." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-155", "text": "Table 2 Predicates grouped together in L-Types have some but not necessarily all their grammatical properties in common." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-156", "text": "This is in accordance with the fact that L-Types are essentially semantically-driven modelling recurring, abstract features in the semantics of predicators while disregarding ephemeral properties as well as lexical idiosyncrasies." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-157", "text": "11 In addition to the set of entailments discussed in the previous section, L-Types may also incorporate more fine-grained properties that are clearly relevant to linking." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-158", "text": "For instance, verbs lexicalising a Desiring situation were found with prepositional complements introduced by for, after, to, towards, of, or over (e.g. long for, hanker after, aspire to, pine over, etc.), but not on, upon, at, or about (like other Notion verbs, such as ponder, muse, think, etc.)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-159", "text": "Inasmuch as a Desiring relation is identified as a recurring concept systematically associated with a particular grammatical relation (e.g. a for-PP), it can be represented in a separate L-Type inheriting from the Notion L-Type presented previously." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-160", "text": "12 An initial classification like the one exemplified above captures general conditions which determine possible associations between the semantics of predicators and grammatical relations realising their arguments (e.g. the fact that a conceived entity can only surface in subject position in a passive sentence)." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-161", "text": "It can be extended and refined on the basis of more specific semantic relations." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-162", "text": "Moreover, L-Types can be organised in hierarchical structures." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-163", "text": "They can form the upper portion of a principled hierarchy of classes encoding successively broader levels of generalisations concerning argument linking." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-164", "text": "This study indicated that a small number of Lexicalisation Types abstracts over a wide range of FrameNet frames." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-165", "text": "13 More precisely, in the annotated dataset 48 L-Types were identified based on various combinations of entailments: 9 Notion Types, 7 Intentionality Types, 10 Causation Types, 7 Communication (Caused Notion) Types, 7 Motion (including Caused Motion) Types, 7 Possession (including Caused Possession) Types, and 1 Conditioning Type." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-166", "text": "These Types readily abstract over associations of semantic properties and grammatical functions attested in over 200 FrameNet frames." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-167", "text": "14 In the FrameNet paradigm, L-Types can be modelled as non-lexicalised frames specifying syntactic mapping constraints." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-168", "text": "Mappings between FrameNet frames and L-Types can be stated by means of a separate relation in addition to the frame relations currently specified by FrameNet." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-169", "text": "A relation generalising the combinatorial properties of lexical items across frames would simplify the picture of the frame hierarchy, in that it would essentially decouple purely lexical semantic information (encoded by existing frame-to-frame relations) from information pertaining exactly to the interface of syntax and semantics." }, { "sent_id": "1786b6c1c6532d5baa092cca40e389-C001-170", "text": "In future work, our intention is to test whether the proposed semantic role schema and the attested L-Types can be useful for dealing with the sparse data problem and increasing the performance of semantic role labelling systems." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1786b6c1c6532d5baa092cca40e389-C001-10" ] ], "cite_sentences": [ "1786b6c1c6532d5baa092cca40e389-C001-10" ] }, "@USE@": { "gold_contexts": [ [ "1786b6c1c6532d5baa092cca40e389-C001-112" ] ], "cite_sentences": [ "1786b6c1c6532d5baa092cca40e389-C001-112" ] } } }, "ABC_770368eff3410f3c2ab18b14b42243_42": { "x": [ { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-73", "text": "both activities and entities." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-2", "text": "Researchers have recently started investigating deep neural networks for dialogue applications." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-47", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-48", "text": "**EXPERIMENTS**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-3", "text": "In particular, generative sequence-to-sequence (Seq2Seq) models have shown promising results for unstructured tasks, such as word-level dialogue response generation." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-4", "text": "The hope is that such models will be able to leverage massive amounts of data to learn meaningful natural language representations and response generation strategies, while requiring a minimum amount of domain knowledge and hand-crafting." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-5", "text": "An important challenge is to develop models that can effectively incorporate dialogue context and generate meaningful and diverse responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-6", "text": "In support of this goal, we review recently proposed models based on generative encoder-decoder neural network architectures, and show that these models have better ability to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-7", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-9", "text": "Researchers have recently started investigating sequence-to-sequence (Seq2Seq) models for dialogue applications." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-10", "text": "These models typically use neural networks to both represent dialogue histories and to generate or select appropriate responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-11", "text": "Such models are able to leverage large amounts of data in order to learn meaningful natural language representations and generation strategies, while requiring a minimum amount of domain knowledge and hand-crafting." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-12", "text": "Although the Seq2Seq framework is different from the well-established goal-oriented setting [Gorin et al., 1997 , Young, 2000 , Singh et al., 2002 , these models have already been applied to several real-world applications, with Microsoft's system Xiaoice [Markoff and Mozur, 2015] and Google's Smart Reply system [Kannan et al., 2016] as two prominent examples." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-13", "text": "Researchers have mainly explored two types of Seq2Seq models." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-14", "text": "The first are generative models, which are usually trained with cross-entropy to generate responses word-by-word conditioned on a dialogue context [Ritter et al., 2011 , Vinyals and Le, 2015 , Sordoni et al., 2015 , Shang et al., 2015 , Li et al., 2016a , Serban et al., 2016b ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-15", "text": "The second are discriminative models, which are trained to select an appropriate response from a set of candidate responses [Lowe et al., 2015 , Bordes and Weston, 2016 , Inaba and Takahashi, 2016 , Yu et al., 2016 ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-16", "text": "In a related strand of work, researchers have also investigated applying neural networks to the different components of a standard dialogue system, including natural language understanding, natural language generation, dialogue state tracking and evaluation [Wen et al., 2016 , Henderson et al., 2013 , Mrk\u0161i\u0107 et al., 2015 ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-17", "text": "In this paper, we focus on generative models trained with cross-entropy." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-18", "text": "One weakness of current generative models is their limited ability to incorporate rich dialogue context and to generate meaningful and diverse responses [Serban et al., 2016b , Li et al., 2016a ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-19", "text": "To overcome this challenge, we propose new generative models that are better able to incorporate long-term dialogue history, to model uncertainty and ambiguity in dialogue, and to generate responses with high-level compositional structure." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-20", "text": "Our experiments demonstrate the importance of the model architecture and the related inductive biases in achieving this improved performance." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-21", "text": "Classic LSTM model, which uses a shallow generation process." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-22", "text": "This is problematic because it has no mechanism for incorporating uncertainty and ambiguity and because it forces the model to generate compositional and long-term structure incrementally on a word-by-word basis." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-23", "text": "(B): VHRED expands the generation process by adding one latent variable for each utterance, which helps incorporate uncertainty and ambiguity in the representations and generate meaningful, diverse responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-24", "text": "(C): MrRNN expands the generation process by adding a sequence of discrete stochastic variables for each utterance, which helps generate responses with high-level compositional structure." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-25", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-26", "text": "**MODELS**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-27", "text": "HRED: The Hierarchical Recurrent Encoder-Decoder model (HRED) [Serban et al., 2016b ] is a type of Seq2Seq model that decomposes a dialogue into a two-level hierarchy: a sequence of utterances, each of which is a sequence of words." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-28", "text": "HRED consists of three recurrent neural networks (RNNs): an encoder RNN, a context RNN and a decoder RNN." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-29", "text": "Each utterance is encoded into a real-valued vector representation by the encoder RNN." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-30", "text": "These utterance representations are given as input to the context RNN, which computes a real-valued vector representation summarizing the dialogue at every turn." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-31", "text": "This summary is given as input to the decoder RNN, which generates a response word-by-word." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-32", "text": "Unlike the RNN encoders in previous Seq2Seq models, the context RNN is only updated once every dialogue turn and uses the same parameters for each update." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-33", "text": "This gives HRED an inductive bias that helps incorporate long-term context and learn invariant representations." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-34", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-35", "text": "**VHRED:**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-36", "text": "The Latent Variable Hierarchical Recurrent Encoder-Decoder model (VHRED) [Serban et al., 2016c ] is an HRED model with an additional component: a high-dimensional stochastic latent variable at every dialogue turn." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-37", "text": "As in HRED, the dialogue context is encoded into a vector representation using encoder and context RNNs." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-38", "text": "Conditioned on the summary vector at each dialogue turn, VHRED samples a multivariate Gaussian variable, which is given along with the summary vector as input to the decoder RNN." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-39", "text": "The multivariate Gaussian latent variable allows modelling ambiguity and uncertainty in the dialogue through the latent variable distribution parameters (mean and variance parameters)." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-40", "text": "This provides a useful inductive bias, which helps VHRED encode the dialogue context into a real-valued embedding space even when the dialogue context is ambiguous or uncertain, and it helps VHRED generate more diverse responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-41", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-42", "text": "**MRRNN:**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-43", "text": "The Multiresolution RNN (MrRNN) [Serban et al., 2016a ] models dialogue as two parallel stochastic sequences: a sequence of high-level coarse tokens (coarse sequences), and a sequence of low-level natural language words (utterances)." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-44", "text": "The coarse sequences follow a latent stochastic process-analogous to hidden Markov models-which conditions the utterances through a hierarchical generation process." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-45", "text": "The hierarchical generation process first generates the coarse sequence, and conditioned on this generates the natural language utterance." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-46", "text": "In our experiments, the coarse" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-49", "text": "We apply our generative models to dialogue response generation on the Ubuntu Dialogue Corpus [Lowe et al., 2015] ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-50", "text": "For each example, given a dialogue context, the model must generate an appropriate response." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-51", "text": "We also present results on Twitter in the Appendix." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-52", "text": "This task has been studied extensively in the recent literature [Ritter et al., 2011 , Sordoni et al., 2015 , Li et al., 2016a ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-53", "text": "Corpus: The Ubuntu Dialogue Corpus consists of about half a million dialogues extracted from the #Ubuntu Internet Relayed Chat (IRC) channel." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-54", "text": "Users entering this chat channel usually have a specific technical problem." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-55", "text": "Typically, users first describe their problem, and other users try to help them resolve it." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-56", "text": "The technical problems range from software-related and hardware-related issues (e.g. installing packages, fixing broken drivers) to informational needs (e.g. finding software)." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-57", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-58", "text": "**EVALUATION:**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-59", "text": "We carry out an in-lab human study to evaluate the model responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-60", "text": "We recruit 5 human evaluators." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-61", "text": "We show each evaluator between 30 and 40 dialogue contexts with the ground truth response, and 4 candidate model responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-62", "text": "For each example, we ask the evaluators to compare the candidate responses to the ground truth response and dialogue context, and rate them for fluency and relevancy on a scale 0-4, where 0 means incomprehensible or no relevancy and 4 means flawless English or all relevant." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-63", "text": "In addition to the human evaluation, we also evaluate dialogue responses w.r.t." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-64", "text": "the activity-entity metrics proposed by Serban et al. [2016a] ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-65", "text": "These metrics measure whether the model response contains the same activities (e.g. download, install) and entities (e.g. ubuntu, firefox) as the ground truth responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-66", "text": "Models that generate responses with the same activities and entities as the ground truth responses-including expert responses, which often lead to solving the user's problem-are given higher scores." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-67", "text": "Sample responses from each model are shown in Table 1 . entities." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-68", "text": "MrRNN with noun representations obtains an F1 entity score at 6.31, while all other models obtain less than half F1 scores between 0.87 \u2212 2.53, and human evaluators consistently rate its fluency and relevancy significantly higher than all the baseline models." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-69", "text": "MrRNN with activity representations obtains an F1 activity score at 11.43, while all other models obtain less than half F1 activity scores between 1.18 \u2212 4.63, and performs substantially better than the baseline models w.r.t." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-70", "text": "the F1 entity score." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-71", "text": "This indicates that the MrRNNs have learned to model high-level, goal-oriented sequential structure in the Ubuntu domain." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-72", "text": "Followed by these, VHRED performs better than the HRED and LSTM models w.r.t." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-74", "text": "This shows that VHRED generates more appropriate responses, which suggests that the latent variables are useful for modeling uncertainty and ambiguity." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-75", "text": "Finally, HRED performs better than the LSTM baseline w.r.t." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-76", "text": "both activities and entities, which underlines the importance of representing longer-term context." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-77", "text": "These conclusions are confirmed by additional experiments on response generation for the Twitter domain (see Appendix)." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-78", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-79", "text": "**DISCUSSION**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-80", "text": "We have presented generative models for dialogue response generation." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-81", "text": "We have proposed architectural modifications with inductive biases towards 1) incorporating longer-term context, 2) handling uncertainty and ambiguity, and 3) generating diverse and on-topic responses with high-level compositional structure." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-82", "text": "Our experiments show the advantage of the architectural modifications quantitatively through human experiments and qualitatively through manual inspections." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-83", "text": "These experiments demonstrate the need for further research into generative model architectures." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-84", "text": "Although we have focused on three generative models, other model architectures such as memory-based models Weston, 2016, Weston et al., 2015] and attention-based models [Shang et al., 2015] have also demonstrated promising results and therefore deserve the attention of future research." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-85", "text": "In another line of work, researchers have started proposing alternative training and response selection criteria [Weston, 2016] ." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-86", "text": "Li et al. [2016a] propose ranking candidate responses according to a mutual information criterion, in order to incorporate dialogue context efficiently and retrieve on-topic responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-87", "text": "Li et al. [2016b] further propose a model trained using reinforcement learning to optimize a hand-crafted reward function." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-88", "text": "Both these models are motivated by the lack of diversity observed in the generative model responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-89", "text": "Similarly, Yu et al. [2016] propose a hybrid model-combining retrieval models, neural networks and hand-crafted rules-trained using reinforcement learning to optimize a hand-crafted reward function." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-90", "text": "In contrast to these approaches, without combining several models or having to modify the training or response selection criterion, VHRED generates more diverse responses than previous models." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-91", "text": "Similarly, by optimizing the joint log-likelihood over sequences, MrRNNs generate more appropriate and on-topic responses with compositional structure." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-92", "text": "Thus, improving generative model architectures has the potential to compensate -or even remove the need -for hand-crafted reward functions." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-93", "text": "At the same time, the models we propose are not necessarily better language models, which are more efficient at compressing dialogue data as measured by word perplexity." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-94", "text": "Although these models produce responses that are preferred by humans, they often result in higher test set perplexity than traditional LSTM language models." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-95", "text": "This suggests maximizing log-likelihood (i.e. minimizing perplexity) is not a sufficient training objective for these models." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-96", "text": "An important line of future work therefore lies in improving the objective functions for training and response selection, as well as learning directly from interactions with real users." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-97", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-98", "text": "**TWITTER RESULTS**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-99", "text": "Corpus: We experiment on a Twitter Dialogue Corpus [Ritter et al., 2011] containing about one million dialogues." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-100", "text": "The task is to generate utterances to append to existing Twitter conversations." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-101", "text": "This task is typically categorized as a non-goal-driven task, because any fluent and on-topic response may be adequate." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-102", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-103", "text": "**EVALUATION:**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-104", "text": "We carry out a human study on Amazon Mechanical Turk (AMT)." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-105", "text": "We show human evaluators a dialogue context along with two potential responses: one response generated from each model conditioned on the dialogue context." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-106", "text": "We ask evaluators to choose the response most appropriate to the dialogue context." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-107", "text": "If the evaluators are indifferent, they can choose neither response." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-108", "text": "For each pair of models we conduct two experiments: one where the example contexts contain at least 80 unique tokens (long context), and one where they contain at least 20 (not necessarily unique) tokens (short context)." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-109", "text": "We experiment with the LSTM, HRED and VHRED models, as well as a TF-IDF retrieval-based baseline model." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-110", "text": "We do not experiment with the MrRNN models, because we do not have appropriate coarse representations for this domain." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-111", "text": "----------------------------------" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-112", "text": "**RESULTS:**" }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-113", "text": "The results given in Table 3 show that VHRED is strongly preferred in the majority of the experiments." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-114", "text": "In particular, VHRED is strongly preferred over the HRED and TF-IDF baseline models for both short and long context settings." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-115", "text": "VHRED is also strongly preferred over the LSTM baseline model for long contexts, although the LSTM model is preferred over VHRED for short contexts." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-116", "text": "For short contexts, the LSTM model is often preferred over VHRED because the LSTM model tends to generate very generic responses." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-117", "text": "Such generic or safe responses are reasonable for a wide range of contexts, but are not useful when applied through-out a dialogue, because the user would loose interest in the conversation." }, { "sent_id": "770368eff3410f3c2ab18b14b42243-C001-118", "text": "In conclusion, VHRED performs substantially better overall than competing models, which suggests that the high-dimensional latent variables help model uncertainty and ambiguity in the dialogue context and help generate meaningful responses." } ], "y": { "@BACK@": { "gold_contexts": [ [ "770368eff3410f3c2ab18b14b42243-C001-14" ], [ "770368eff3410f3c2ab18b14b42243-C001-52" ], [ "770368eff3410f3c2ab18b14b42243-C001-86" ] ], "cite_sentences": [ "770368eff3410f3c2ab18b14b42243-C001-14", "770368eff3410f3c2ab18b14b42243-C001-52", "770368eff3410f3c2ab18b14b42243-C001-86" ] }, "@MOT@": { "gold_contexts": [ [ "770368eff3410f3c2ab18b14b42243-C001-18" ] ], "cite_sentences": [ "770368eff3410f3c2ab18b14b42243-C001-18" ] } } }, "ABC_b1df73c14f53fea607c4c8b71740fe_42": { "x": [ { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-2", "text": "Recent work applied Dirichlet Process Mixture Models to the task of verb clustering, incorporating supervision in the form of must-links and cannot-links constraints between instances." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-3", "text": "In this work, we introduce an active learning approach for constraint selection employing uncertaintybased sampling." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-4", "text": "We achieve substantial improvements over random selection on two datasets." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-5", "text": "----------------------------------" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-7", "text": "Bayesian non-parametric mixture models have the attractive property that the number of components used to model the data is not fixed in advance but is determined by the model and the data." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-8", "text": "This property is particularly interesting for NLP where many tasks are aimed at discovering novel information." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-9", "text": "Recent work has applied such models to various tasks with promising results, e.g. Teh (2006) and Cohn et al. (2009) ." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-10", "text": "Vlachos et al. (2009) applied the basic model of this class, the Dirichlet Process Mixture Model (DPMM), to lexical-semantic verb clustering with encouraging results." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-11", "text": "The task involves discovering classes of verbs similar in terms of their syntactic-semantic properties (e.g. MOTION class for travel, walk, run, etc.)." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-12", "text": "Such classes can provide important support for other tasks, such as word sense disambiguation, parsing and semantic role labeling." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-13", "text": "(Dang, 2004; Swier and Stevenson, 2004 ) Although some fixed classifications are available these are not comprehensive and are inadequate for specific domains." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-14", "text": "Furthermore, Vlachos et al. (2009) used a constrained version of the DPMM in order to guide clustering towards some prior intuition or considerations relevant to the specific task at hand." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-15", "text": "This supervision was modelled as pairwise constraints between instances and it informs the model of relations between them that cannot be recovered by the model on the basis of the feature representation used." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-16", "text": "Like other forms of supervision, these constraints require manual annotation and it is important to maximize the benefits obtained from it." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-17", "text": "Therefore it is natural to consider active learning (Settles, 2009 ) in order to focus the supervision on clusterings on which the model is uncertain." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-18", "text": "In this work, we propose a simple yet effective active learning method employing uncertainty based sampling." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-19", "text": "The effectiveness of the AL method is demonstrated on two datasets, one of which has multiple gold standards." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-20", "text": "----------------------------------" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-21", "text": "**CONSTRAINED DPMMS FOR CLUSTERING**" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-22", "text": "In DPMMs, the parameters of each component are generated by a Dirichlet Process (DP) which can be seen as a distribution over distributions." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-23", "text": "Each instance, represented by its features, is generated by the component it is assigned to." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-24", "text": "The components discovered correspond to the clusters." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-25", "text": "The prior probability of assigning an instance to a particular component is proportionate to the number of instances already assigned to it, in other words, the DPMM exhibits the \"rich get richer\" property." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-26", "text": "A popular metaphor to describe the DPMM which exhibits an equivalent clustering property is the Chinese Restaurant Process (CRP)." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-27", "text": "Customers (instances) arrive at a Chinese restaurant which has an infinite number of tables (components)." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-28", "text": "Each customer sits at one of the tables that is either occupied or vacant with popular tables attracting more customers." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-96", "text": "While reducing the batch size leads to better learning rates, it requires estimating the model more often." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-29", "text": "Following Navarro et al. (2006) , parameter estimation is performed using Gibbs sampling by sampling the assignment z i of each instance x i given all the others z \u2212i and the data X:" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-30", "text": "In Eq. 1 p(z i = z|z \u2212i ) is the CRP prior and P (x i |z i = z, X \u2212i ) is the distribution that generates instance x i given it has been assigned to component z. This sampling scheme is possible because the assignments in the model are exchangeable, i.e. their order is not relevant." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-31", "text": "The constrained version of the DPMM uses pairwise constraints over instances in order to adapt the clustering discovered." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-32", "text": "Following Wagstaff & Cardie (2000) , a pair of instances is either linked together (must-link) or not (cannotlink)." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-33", "text": "For example, charge and run should form a must-link if the aim is to cluster MOTION verbs together, but they should form a cannot-link if we are interested in BILL verbs." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-34", "text": "All links are assumed to be consistent with each other." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-35", "text": "In order to incorporate the constraints in the DPMM, the Gibbs sampling scheme is modified so that mustlinked instances are generated by the same component and cannot-linked instances always by different ones." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-36", "text": "Following Vlachos et al. (2009) , for each instance that does not belong to a linked-group, the sampler is restricted to choose components that do not contain instances cannot-linked with it." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-37", "text": "For instances in a linked-group, their assignment is sampled jointly, again taking into account their cannot-links." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-38", "text": "This is performed by adding each instance of the linked-group successively to the same component." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-39", "text": "In terms of the CRP metaphor, customers connected with must-links arrive at the restaurant and choose a table jointly, respecting their cannot-links with other customers." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-40", "text": "----------------------------------" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-41", "text": "**ACTIVE CONSTRAINT SELECTION**" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-42", "text": "In active learning, the model selects the supervision to be provided by a human expert." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-43", "text": "In the context of the DPMMs, the model chooses a pair of instances for which a must-link or a cannot-link must be provided." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-44", "text": "To select the pair, we employ the simple but effective idea of uncertainty based sampling." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-45", "text": "We consider the most informative link as that on which the model is most uncertain, more formally the link between instances l * ij that maximizes the following entropy:" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-46", "text": "If we consider clustering as binary classification of links into must-links and cannot-links, it is equivalent to selecting the pair with the highest label entropy." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-47", "text": "During the sampling process used for parameter inference, component assignments vary between samples and the components themselves are not identifiable, i.e. one cannot match the components of one sample with those of another." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-48", "text": "Furthermore, the conditional assignments estimated during Gibbs sampling (Eq. 1) they do not capture the uncertainty of the assignments z \u2212i on which they condition." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-49", "text": "Therefore, we resort to generating a set of samples from the (possibly constrained) DPMM and pick the link on which these samples maximally disagree, i.e. we approximate the distribution in Eq. 2 with the probability that instances i, j are in the same cluster or not." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-50", "text": "Thus, in a given set of samples the most uncertain link would be the one between two instances which are in the same cluster in exactly half of these samples." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-51", "text": "Using multiple samples allows us to take into account the uncertainty in the assignments of the other instances, as well as the varying number of components." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-52", "text": "Compared to standard pool-based AL, when clustering with constraints the possible links between two instances (ignoring transitivity) are C(N, 2) = N (N \u2212 1)/2 (N is the size of the dataset) and there is an equal number of candidate queries to be considered, as opposed to N queries in a supervised classification task." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-53", "text": "Another interesting difference is that the the AL process can be initiated without any supervision, since the DPMM is unsupervised." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-54", "text": "On the other hand, in the standard AL scenario a (usually small) labelled seed set is used." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-55", "text": "Therefore, we rely exclusively on the model and the features to guide the constraint selection process." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-56", "text": "If the model combined with the features is not appropriate for the task then the constraints chosen are unlikely to be useful." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-57", "text": "----------------------------------" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-58", "text": "**DATASETS AND EVALUATION**" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-59", "text": "In our experiments we used two verb clustering datasets, one from general English (Sun et al., 2008) and one from the biomedical domain (Korhonen et al., 2006) ." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-60", "text": "In both datasets the features for each verb are its subcategorization frames (SCFs) which capture the syntactic context in which it occurs." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-61", "text": "They were acquired automatically using a domain-independent statistical parsing toolkit, RASP (Briscoe and Carroll, 2002) , and a classifier which identifies verbal SCFs." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-62", "text": "As a consequence, they include some noise due to standard text processing and parsing errors and due to the subtlety of the argument-adjunct distinction." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-63", "text": "The general English dataset contains 204 verbs belonging to 17 fine-grained classes in Levin's (Levin, 1993) taxonomy so that each class contains 12 verbs." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-64", "text": "The biomedical dataset consists of 193 medium to high frequency verbs from a corpus of 2230 full-text articles from 3 biomedical journals." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-65", "text": "A team of linguists and biologists created a three-level gold standard with 16, 34 and 50 classes." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-66", "text": "Both datasets were pre-processed using non-negative matrix factorization (Lin, 2007) which decomposes a large sparse matrix into two dense matrices (of lower dimensionality) with non-negative values." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-67", "text": "In all experiments 35 dimensions were kept." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-68", "text": "Preliminary experiments with different number of dimensions kept did not affect the performance substantially." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-69", "text": "We evaluate our results using three information theoretic measures: Variation of Information (Meil\u0203, 2007) , V-measure (Rosenberg and Hirschberg, 2007) and V-beta (Vlachos et al., 2009 )." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-70", "text": "All three assess the two desirable properties that a clustering should have with respect to a gold standard, homogeneity and completeness." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-71", "text": "Homogeneity reflects the degree to which each cluster contains instances from a single class and is defined as the conditional entropy of the class distribution of the gold standard given the clustering." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-72", "text": "Completeness reflects the degree to which each class is contained in a single cluster and is defined as the conditional entropy of clustering given the class distribution in the gold standard." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-73", "text": "V-beta balances these properties explicitly by taking into account the ratio of the number of cluster discovered over the number of classes in the gold standard." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-74", "text": "While an ideal clustering should have both properties, naively improving one of them can be harmful for the other." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-75", "text": "Compared to the more commonly used F-measure (Fung et al., 2003) , these measures have the advantage that they do not assume a mapping between clusters and classes." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-76", "text": "----------------------------------" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-77", "text": "**EXPERIMENTS**" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-78", "text": "We performed experiments in order to assess the effectiveness of the AL algorithm for the constrained DPMM comparing it to random selection." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-79", "text": "In each AL round, we run the Gibbs sampler for the (constrained) DPMM five times, using 100 iterations for burn-in, draw 20 samples from each run with 5 iterations lag between samples and select the most uncertain link to be labeled." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-80", "text": "Following Navarro et al. (2006) , the concentration parameter is inferred from the data using Gibbs sampling." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-81", "text": "The performances were averaged across the collected samples." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-82", "text": "Random selection was repeated three times." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-83", "text": "The three levels of the biomedical gold standard were used independently and together with the general English dataset result in four experimental setups." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-84", "text": "The comparison between AL and random selection for each dataset is shown in graphs 1(a)-1(d) using V-beta, noting that the observations made hold with all evaluation metrics used." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-85", "text": "Constraints selected via AL improve the performance rapidly." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-86", "text": "Indicatively, the performance reached using 1000 randomly chosen constraints is obtained using only 110 actively selected ones in the bio-50 dataset." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-87", "text": "AL performance levels out in later stages with performance superior to the one achieved using random selection with the same number of constraints." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-88", "text": "The poor performance of random selection is expected, since the unsupervised DPMM predicts more than 90% of the binary links correctly." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-89", "text": "Another interesting observation is that, during AL, homogeneity increased faster than completeness (graphs 1(g) and 1(h))." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-90", "text": "This suggests that the features used lead the model towards finergrained clusters, which is further confirmed by the fact that the highest scores on the biomedical dataset are achieved when comparing against the finest-grained version of the gold standard." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-91", "text": "While it is possible to choose constraints to the model that would increase completeness with respect to the gold standard, we argue that this would not allow us to obtain obtain insights on the model and the features used." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-92", "text": "We also noticed that the choice of batch size has a significant effect on the learning rate of the model." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-93", "text": "This phenomenon occurs in varying degrees in many applications of AL." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-94", "text": "Manual inspection of the links chosen at each round revealed that batches often contained links involving the same instances." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-95", "text": "This is expected due to transitivity: if the link between instances A and B is uncertain but the link between instances B and C is certain, then the link between A and C will be uncertain too." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-97", "text": "In order to ameliorate this issue, after obtaining the label of the most uncertain link, we remove the samples that disagreed with it and re-calculate the uncertainty of the remaining links given the remaining samples." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-98", "text": "This is repeated until the intended batch size is reached." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-99", "text": "Thus, we avoid selecting links involving the same instance, unless their uncertainty was not reduced by the constraints added." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-100", "text": "A consideration that arises is that by reducing the number of samples used for uncertainty estimation, progressively we are left with fewer samples to rank the remaining links." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-101", "text": "Each labeled link reduces the number of samples approximately by half since the most uncertain link is likely to be a must-link in half the samples and a cannot-lnk in the remaining half." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-102", "text": "As a result, for a batch with size |B| the uncertainty of the last link will be estimated using |S|/2 |B|\u22121 samples." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-103", "text": "A crude solution would be to generate enough samples for the desired batch size." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-104", "text": "However, obtaining a very large number of samples can be computationally expensive." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-105", "text": "Therefore, we set a threshold for the minimum number of samples to be used to estimate the link uncertainty and when it is reached, more samples are generated using the constraints selected." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-106", "text": "In graphs 1(e) and 1(f) we demonstrate the effectiveness of the batch selection method proposed (labeled \"batch\") compared to naive batch selection (labeled \"active10\")." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-107", "text": "----------------------------------" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-108", "text": "**DISCUSSION AND FUTURE WORK**" }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-109", "text": "We presented an AL method for constrained DPMMs employing uncertainty based sampling." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-110", "text": "We applied it to two different verb clustering datasets with 4 gold standards in total and obtained very good results compared to random selection." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-111", "text": "The idea, while explored in the context of verb clustering with the constrained DPMM, is likely to be applicable to other models that can incorporate mustlinks and cannot-links in MCMC sampling." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-112", "text": "Most literature on AL for NLP considers supervised methods for classification or sequential tagging." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-113", "text": "However, AL for clustering is a relatively under-explored area." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-114", "text": "Klein et al. (2002) incorporated actively selected constraints in hierarchical agglomerative clustering." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-115", "text": "Basu et al. (2006) have applied AL to obtain must-links and cannot-links however, the clustering framework used requires the number of clusters to be known in advance which restricts counter-intuitively the clustering solutions that are discovered." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-116", "text": "Moreover, semisupervised clustering is a form of semi-supervised learning and in this light, our approach is related to the work of Zhu et al. (2003) ." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-117", "text": "With respect to the practical application of the AL method suggested, it is worth noting that in all our experiments the constraints were obtained for the respective gold standard of the dataset at question and consequently they are all consistent with each other." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-118", "text": "However, this assumption might not hold in case human experts are employed for the same purpose." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-119", "text": "In order to use such feedback in the framework suggested, it is necessary to filter the constraints provided in order to obtain a consistent subset." }, { "sent_id": "b1df73c14f53fea607c4c8b71740fe-C001-120", "text": "To this end, it would be interesting to investigate the potential of using \"soft\" constraints, i.e. constraints that are provided with relative confidence." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b1df73c14f53fea607c4c8b71740fe-C001-10" ], [ "b1df73c14f53fea607c4c8b71740fe-C001-14" ] ], "cite_sentences": [ "b1df73c14f53fea607c4c8b71740fe-C001-10", "b1df73c14f53fea607c4c8b71740fe-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "b1df73c14f53fea607c4c8b71740fe-C001-36" ], [ "b1df73c14f53fea607c4c8b71740fe-C001-69" ] ], "cite_sentences": [ "b1df73c14f53fea607c4c8b71740fe-C001-36", "b1df73c14f53fea607c4c8b71740fe-C001-69" ] } } }, "ABC_2a01f96893f9c0630a01ecce320184_42": { "x": [ { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-2", "text": "The goal of fine-grained propaganda detection is to determine whether a given sentence uses propaganda techniques (sentence-level) or to recognize which techniques are used (fragment-level)." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-3", "text": "This paper presents the system of our participation in the sentence-level subtask of the propaganda detection shared task." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-4", "text": "In order to better utilize the document information, we construct context-dependent input pairs (sentence-title pair and sentencecontext pair) to fine-tune the pretrained BERT, and we also use the undersampling method to tackle the problem of imbalanced data 1 ." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-5", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-7", "text": "Propaganda detection is a process of determining whether a news article or a sentence is misleading." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-8", "text": "Several research works have been proposed to detect propaganda on document-level (Rashkin et al., 2017; Barr\u00f3n-Cede\u00f1o et al., 2019b) , sentencelevel and fragment-level (Da San Martino et al., 2019) ." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-9", "text": "Sentence-level detection or classification (SLC) is to determine whether a given sentence is propagandistic and it is a special binary classification problem, while the goal of fragment-level classification (FLC) is to extract fragments and assign with given labels such as loaded language, flag-waving and causal oversimplification, and it could be treated as a sequence labeling problem." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-10", "text": "Compared with document-level, sentence-level and fragment-level detection are much more helpful, since detection on sentences and fragments are more practical for real-life applications." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-11", "text": "However, these fine-grained tasks are more challenging." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-12", "text": "Although Da San Martino et al. (2019) indicates that multi-task learning of both the SLC and the FLC could be beneficial for the SLC, in this paper, we only focus on the SLC task so as to better investigate whether context information could improve the performance of our system." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-13", "text": "Since several pretrained language models (Devlin et al., 2019; Liu et al., 2019) have been proved to be effective for text classification and other natural language understanding tasks, we use the pretrained BERT (Devlin et al., 2019) for the SLC task." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-14", "text": "This paper elaborates our BERT-based system for which we construct sentence-title pairs and sentence-context pairs as input." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-15", "text": "In addition, in order to tackle the problem of imbalanced data, we apply the undersampling method (Zhou and Liu, 2006) to the training data, and we find that this method greatly boosts the performance of our system." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-16", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-17", "text": "**RELATED WORK**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-18", "text": "Various methods have been proposed for propaganda detection." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-19", "text": "Rashkin et al. (2017) proposed to use LSTM and other machine learning methods for deception detection in different types of news, including trusted, satire, hoax and propaganda." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-20", "text": "Barr\u00f3n-Cede\u00f1o et al. (2019b) proposed to use Maximum Entropy classifier (Berger et al., 1996) with different features replicating the same experimental setup of Rashkin et al. (2017) for twoway and four-way classifications." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-21", "text": "A fine-grained propaganda corpus was proposed in Da San Martino et al. (2019) which includes both sentencelevel and fragment-level information." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-22", "text": "Based on this corpus and the pretrained BERT which is one of the most powerful pretrained language model, a multi-granularity BERT was proposed and it outperformed several strong BERT-based baselines." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-23", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-24", "text": "**METHODOLOGY**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-25", "text": "In our system, we utilize BERT as our base model and construct different kinds of input pairs to finetune it." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-26", "text": "When constructing the input representa- tion, a special token [CLS] is padded in front of every sentence and another token [SEP] is added at the end of it." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-27", "text": "In addition, for each input pair, a [SEP] is added between a sentence and its context or title." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-28", "text": "Finally, a linear layer and a sigmoid function are applied to the final representation of [CLS] to obtain the probability for classification." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-29", "text": "For comparison, we also use the official method (Random) as baseline which randomly labels sentences." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-30", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-31", "text": "**DATA**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-32", "text": "The dataset is provided by NLP4IF 2019 Shared Task (Barr\u00f3n-Cede\u00f1o et al., 2019a) , and the training set, the development set, and the test set contain approximately 16,000, 2,000 and 3,400 sentences respectively." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-33", "text": "According to the statistics, only 29% of the training sentences are labeled as propaganda, and thus in this paper, we treat propaganda sentences as positive samples and non-propaganda sentences as negative samples." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-34", "text": "More details of the dataset could be found in Da San Martino et al. (2019) ." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-35", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-36", "text": "**INPUT PAIRS**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-37", "text": "Sentence Only: We only use the current sentence to fine-tune the model and models trained with this kind of input are used as baselines for those models trained with the following two kinds of input pairs." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-38", "text": "Sentence-Title Pair:" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-39", "text": "As described in Da San Martino et al. (2019) , the source of the dataset that we use is news articles, and since the title is usually the summarization of a news article, we use the title as supplementary information." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-40", "text": "Sentence-Context Pair: In addition to setting the title as the supplementary information, we construct the sentence-context pair which also includes preceding sentences as additional context, since preceding sentences usually convey the same or related events and this historical content is closely related to the current sentence." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-41", "text": "Figure 1 . shows the details of this kind of input pair in which the preceding sentence and the title are directly concatenated." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-42", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-43", "text": "**UNDERSAMPLING**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-44", "text": "As mentioned above, there are only 29% of training sentences labeled as propaganda (positive)." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-45", "text": "In order to tackle the problem of imbalanced data, we first collect positive samples which size is S pos and negative samples, then we resample S neg (X percent of S pos ) from negative samples at the beginning of each training epoch." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-46", "text": "Finally, we combine and shuffle both positive samples and sampled negative samples as a new training set S sampled ." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-47", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-48", "text": "**EXPERIMENT DETAILS**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-49", "text": "In this paper, we use the pretrained uncased version of BERT BASE and BERT LARGE 2 for the SLC, and more details of these two models could Table 2 : Experiment results of the chosen model and the random baseline for the SLC task." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-50", "text": "be found in Devlin et al. (2019) ." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-51", "text": "Before finetuning, sentences are first converted to lower case and their maximum sequence length is set to 128." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-52", "text": "For a sentence-context pair, the maximum length of context is set to 100." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-53", "text": "If the sequence length of an input pair exceeds 128, then the context or title is truncated to meet the length." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-54", "text": "When fine-tuning, we use the Adam (Kingma and Ba, 2014) with learning rate 2e-5 for 2 epochs, the batch size is 32 and the dropout probability is kept at 0.1." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-55", "text": "Since the title or context information could help improve the performance, we only apply the undersampling method to input pairs (sentence-title and sentence-context)." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-56", "text": "For those models involved with undersampling, the sample rate X is set to 0.8, 0.9 or 1.0 empirically." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-57", "text": "During the training stage, all training samples are used." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-58", "text": "We directly evaluate all the models on the development set, and the best model is chosen to generate predictions of the test data." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-59", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-60", "text": "**RESULT**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-61", "text": "Our approach is evaluated on Propaganda Detection@NLP4IF SLC dataset." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-62", "text": "In the development stage, we use three kinds of input and three different sample rates for BERT." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-63", "text": "Table 1 . shows the results of the development set." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-64", "text": "From Table 1 ., without considering undersampling, we can see that using the sentence-title pair could boost the performance of BERT BASE , compared with the model using only the current sentence and the random baseline." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-65", "text": "While using the sentence-context pair could improve the F 1 score of BERT BASE by 0.8% with precision rising to 71.10 and recall decreasing to 54.94, the performance of BERT BASE drops by around 1% with recall dropping significantly to 49.12." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-66", "text": "We also observe that both performances of BERT BASE and BERT LARGE trained with original training sentences are competitive compared with the random baseline." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-67", "text": "However, the precision of BERT BASE at 70.54 and the one of BERT LARGE at 71.23 are significantly higher than the recall of both models, at 56.70 and at 54.26 respectively, and this may result from the problem of imbalanced instances." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-68", "text": "Thus, we introduce the undersampling technique using 0.8, 0.9 or 1.0 sample rate to tackle this issue." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-69", "text": "We observe from Table 1 ." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-70", "text": "that the F 1 score of BERT BASE with the sentence-title pair and 0.8 sample rate rises around by 5% and the same model using the sentence-context pair and 0.9 sample rate performs similarly." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-71", "text": "As for BERT LARGE , while using the sentence-title pair has the similar performance as it is employed in the base version model, using the sentence-context pair strongly boosts the F 1 score, at 67.94 with 0.8 sample rate and at 67.25 with 1.0 sample rate." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-72", "text": "In addition, it is worth noting that there is a better trade-off between precision and recall with 1.0 sample rate than the one with 0.8." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-73", "text": "In the test stage, since we are only allowed to submit a single run on the test set, we choose the model with the highest F 1 score (67.94) to generate predictions and the evaluated results are listed in Table 2." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-74", "text": "From Table 2 ., we can see that the recall raises by nearly 5% and the precision of it drops significantly, by around 7%, compared with the results on the development set, while the recall of Random Baseline also drops by approximately 5.5% and the precision of it remains nearly the same." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-75", "text": "----------------------------------" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-76", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-77", "text": "In this paper, we examine capability of the context-dependent BERT model." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-78", "text": "In the sentencelevel propaganda detection task, we construct sentence-title pairs and sentence-context pairs in order to better utilize context information to improve the performance of our system." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-79", "text": "Furthermore, the undersampling method is utilized to tackle the data imbalanced problem." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-80", "text": "Experiments show that both sentence-title/context pairs and the undersampling method could boost the performance of BERT on the SLC task." }, { "sent_id": "2a01f96893f9c0630a01ecce320184-C001-81", "text": "In the future, we plan to apply multi-task learning to this context-dependent BERT, similar to the method mentioned in Da San Martino et al. (2019) or introducing other kinds of tasks, such as sentiment analysis or domain classification." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2a01f96893f9c0630a01ecce320184-C001-8" ], [ "2a01f96893f9c0630a01ecce320184-C001-21" ], [ "2a01f96893f9c0630a01ecce320184-C001-34" ], [ "2a01f96893f9c0630a01ecce320184-C001-39" ] ], "cite_sentences": [ "2a01f96893f9c0630a01ecce320184-C001-8", "2a01f96893f9c0630a01ecce320184-C001-21", "2a01f96893f9c0630a01ecce320184-C001-34", "2a01f96893f9c0630a01ecce320184-C001-39" ] }, "@DIF@": { "gold_contexts": [ [ "2a01f96893f9c0630a01ecce320184-C001-12" ] ], "cite_sentences": [ "2a01f96893f9c0630a01ecce320184-C001-12" ] }, "@USE@": { "gold_contexts": [ [ "2a01f96893f9c0630a01ecce320184-C001-39" ] ], "cite_sentences": [ "2a01f96893f9c0630a01ecce320184-C001-39" ] }, "@SIM@": { "gold_contexts": [ [ "2a01f96893f9c0630a01ecce320184-C001-81" ] ], "cite_sentences": [ "2a01f96893f9c0630a01ecce320184-C001-81" ] }, "@FUT@": { "gold_contexts": [ [ "2a01f96893f9c0630a01ecce320184-C001-81" ] ], "cite_sentences": [ "2a01f96893f9c0630a01ecce320184-C001-81" ] } } }, "ABC_7cbe7dc02bbb53fe06d9215ed88fe0_42": { "x": [ { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-2", "text": "This paper reports our submissions to the Cross Level Semantic Similarity (CLSS) task in SemEval 2014." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-3", "text": "We submitted one Random Forest regression system on each cross level text pair, i.e., Paragraph to Sentence (P-S), Sentence to Phrase (SPh), Phrase to Word (Ph-W) and Word to Sense (W-Se)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-4", "text": "For text pairs on P-S level and S-Ph level, we consider them as sentences and extract heterogeneous types of similarity features, i.e., string features, knowledge based features, corpus based features, syntactic features, machine translation based features, multi-level text features, etc." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-5", "text": "For text pairs on Ph-W level and W-Se level, due to lack of information, most of these features are not applicable or available." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-6", "text": "To overcome this problem, we propose several information enrichment methods using WordNet synonym and definition." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-7", "text": "Our systems rank the 2nd out of 18 teams both on Pearson correlation (official rank) and Spearman rank correlation." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-8", "text": "Specifically, our systems take the second place on P-S level, S-Ph level and Ph-W level and the 4th place on W-Se level in terms of Pearson correlation." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-9", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-11", "text": "Semantic similarity is an essential component of many applications in Natural Language Processing (NLP)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-12", "text": "Previous works often focus on text semantic similarity on the same level, i.e., paragraph to paragraph or sentence to sentence, and many effective text semantic measurements have been proposed (Islam and Inkpen, 2008) , (B\u00e4r et al., 2012), This work is licensed under a Creative Commons Attribution 4.0 International Licence." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-13", "text": "Page numbers and proceedings footer are added by the organisers." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-14", "text": "Licence details: http://creativecommons.org/licenses/by/4.0/ (Heilman and Madnani, 2012) ." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-15", "text": "However, in many real world cases, the two texts may not always be on the same level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-16", "text": "The Cross Level Semantic Similarity (CLSS) task in SemEval 2014 provides a universal platform to measure the degree of semantic equivalence between two texts across different levels." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-17", "text": "For each text pair on four cross levels, i.e., Paragraph to Sentence (P-S), Sentence to Phrase (S-Ph), Phrase to Word (Ph-W) and Word to Sense (W-Se), participants are required to return a similarity score which ranges from 0 (no relation) to 4 (semantic equivalence)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-18", "text": "We participate in all the four cross levels and take the second place out of all 18 teams both on Pearson correlation (official) and Spearman correlation ranks." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-19", "text": "In this work, we present a supervised regression system for each cross level separately." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-20", "text": "For P-S level and S-Ph level, we regard the paragraph of P-S as a long sentence, and the phrase of SPh as a short sentence." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-21", "text": "Then we use various types of text similarity features including string features, knowledge based features, corpus based features, syntactic features, machine translation based features, multi-level text features and so on, to capture the semantic similarity between two texts." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-22", "text": "Some of these features are borrowed from our previous system in the Semantic Textual Similarity (STS) task in * SEM Shared Task 2013 (Zhu and Lan, 2013) ." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-23", "text": "Others followed the previous work in (\u0160aric et al., 2012) and (Pilehvar et al., 2013 )." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-24", "text": "For Ph-W level and W-Se level, since the text pairs lack contextual information, for example, word or sense alone no longer shares the property of sentence, most features used in P-S level and S-Ph level are not applicable or available." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-25", "text": "To overcome the problem of insufficient information in word and sense level, we propose several information enrichment methods to extend information with the aid of WordNet (Miller, 1995) , which significantly improved the system performance." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-26", "text": "The rest of this paper is organized as follows." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-27", "text": "Section 2 describes the similarity features used on four cross levels in detail." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-28", "text": "Section 3 presents experiments and the results of four cross levels on training data and test data." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-29", "text": "Conclusions and future work are given in Section 4." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-30", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-31", "text": "**TEXT SIMILARITY MEASUREMENTS**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-32", "text": "To estimate the semantic similarity on P-S level and S-Ph level, we treat the text pairs on both levels as traditional semantic similarity computation on sentence level and adopt 7 types of features, i.e., string features, knowledge based features, corpus based features, syntactic features, machine translation based features, multi-level text features and other features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-33", "text": "All of them are borrowed from previous work due to their superior performance reported." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-34", "text": "For Ph-W level and W-Se level, since word and sense alone cannot be treated as sentence, we propose an information enrichment method to extend original text with the help of WordNet." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-35", "text": "Once the word or sense is enriched with its synonym and its definition description, we can thus adopt the previous features as well." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-36", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-37", "text": "**PREPROCESSING**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-38", "text": "For P-S level and S-Ph level, we perform text preprocessing before we extract semantic similarity features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-39", "text": "Firstly, the Stanford parser 1 is used for sentence tokenization and parsing." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-40", "text": "Specifically, the tokens n't and 'm are replaced with not and am." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-41", "text": "Secondly, the Stanford POS Tagger 2 is used for POS tagging." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-42", "text": "Thirdly, we use Natural Language Toolkit 3 for WordNet based Lemmatization, which lemmatizes the word to its nearest base form that appears in WordNet, for example, was is lemmatized as is rather than be." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-43", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-44", "text": "**FEATURES ON P-S LEVEL AND S-PH LEVEL**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-45", "text": "We treat all text pairs of P-S level and S-Ph level as sentences and then extract 7 types of similarity features as below." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-46", "text": "Totally we get 52 similarity features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-47", "text": "Generally, these similarity features are represented as numerical values." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-48", "text": "String features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-49", "text": "Intuitively, if two texts share more strings, they are considered to be more semantic similar." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-50", "text": "We extract 13 string based features in consideration of the common sequence shared by two texts." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-51", "text": "We chose the Longest Common Sequence (LCS) feature (Zhu and Lan, 2013) , the Ngram Overlap feature (n=1,2,3) and the Weighted Word Overlap feature (\u0160aric et al., 2012) ." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-52", "text": "All these features are computed from original text and from the processed text after lemmatization as well." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-53", "text": "Besides, we also computed the N-gram Overlap on character level, named Character Ngram (n=2,3,4)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-54", "text": "Knowledge based features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-55", "text": "Knowledge based similarity estimation relies on the semantic network of words." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-56", "text": "In this work we used the knowledge based features in our previous work (Zhu and Lan, 2013) , which include four word similarity metrics based on WordNet: Path similarity (Banea et al., 2012) , WUP similarity (Wu and Palmer, 1994) , LCH similarity (Leacock and Chodorow, 1998) and Lin similarity (Lin, 1998) ." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-57", "text": "Then two strategies, i.e., the best alignment strategy and the aggregation strategy, are employed to propagate the word similarity to the text similarity." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-58", "text": "Totally we get 8 knowledge based features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-59", "text": "Corpus based features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-60", "text": "Latent Semantic Analysis (LSA) (Landauer et al., 1997) is a widely used corpus based measure when evaluating text similarity." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-61", "text": "In this work we use the Vector Space Sentence Similarity proposed by (\u0160aric et al., 2012) , which represents each sentence as a single distributional vector by summing up the LSA vector of each word in the sentence." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-62", "text": "Two corpora are used to compute the LSA vector of words: New York Times Annotated Corpus (NYT) and Wikipedia." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-63", "text": "Besides, in consideration of different weights for different words, they also calculated the weighted LSA vector for each word." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-64", "text": "In addition, we use the Co-occurrence Retrieval Model (CRM) feature from our previous work (Zhu and Lan, 2013) as another corpus-based feature." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-65", "text": "The CRM is calculated based on a notion of substitutability, that is, the more appropriate it is to substitute word w 1 in place of word w 2 in a suitable natural language task, the more semantically similar they are." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-66", "text": "At last, 6 corpus based features are extracted." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-67", "text": "Syntactic features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-68", "text": "Dependency relations of sentences often contain semantic information." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-69", "text": "In this work we follow two syntactic dependency similarity features presented in our previous work (Zhu and Lan, 2013), i.e., Simple Dependency Overlap and Special Dependency Overlap." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-70", "text": "The Simple Dependency Overlap measures all dependency relations while the Special Dependency Overlap fea-ture only focuses on the primary roles extracted from several special dependency relations, i.e., subject, object and predict." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-71", "text": "Machine Translation based features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-72", "text": "Machine translation (MT) evaluation metrics are designed to assess whether the output of a MT system is semantically equivalent to a set of reference translations." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-73", "text": "This type of feature has been proved to be effective in our previous work (Zhu and Lan, 2013) ." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-74", "text": "As a result, we extend the original 6 lexical level MT metrics to 10 metrics, i.e., WER, TER, PER, BLEU, NIST, ROUGE-L, GTM-1,GTM-2, GTM-3 and METEOR-ex." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-75", "text": "All these metrics are calculated using the Asiya Open Toolkit for Automatic Machine Translation (Meta-) Evaluation 4 ." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-76", "text": "Multi-level text Features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-77", "text": "(Pilehvar et al., 2013) presented a unified approach to semantic similarity at multiple levels from word senses to text documents through the semantic signature representation of texts (e.g., sense, word or sentence)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-78", "text": "Given initial nodes (senses), they performed random walks on semantic network like WordNet, then the resulting frequency distribution over all nodes in WordNet served as semantic signature of the text." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-79", "text": "By doing so the similarity of two texts can be computed as the similarity of two semantic signatures." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-80", "text": "In this work, we borrowed their semantic signature method and adopted 3 similarity measures to estimate two semantic signatures, i.e., Cosine similarity, Weighted Overlap and Topk Jaccard (k=250, 500)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-81", "text": "Other Features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-82", "text": "Besides, other simple surface features from texts, such as numbers, symbols and length of texts, are extracted." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-83", "text": "Following (\u0160aric et al., 2012) we adopt relative length difference, relative information content difference, numbers overlap, case match and stocks match." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-84", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-85", "text": "**FEATURES ON PH-W LEVEL**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-86", "text": "For Ph-W level, since word and phrase no longer share the property of sentence, most features used for sentence similarity estimation are not applicable for this level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-87", "text": "Therefore, we adopt the following features as the basic feature set for Ph-W level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-88", "text": "String features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-89", "text": "This type contains two features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-90", "text": "The first is a boolean feature which records whether the word appears in the phrase." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-91", "text": "The second is the Weighted Word Overlap feature mentioned in Section 2.2." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-92", "text": "Knowledge based features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-93", "text": "As described in Sec-tion 2.2, we compute the averaged score and the maximal score between word and phrase using the four word similarity measures based on WordNet, i.e., Path, WUP, LCH and Lin." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-94", "text": "Corpus based features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-95", "text": "We adopt the Vector Space Similarity described in Section 2.2." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-96", "text": "Specifically, for word the single distributional vector is the LSA vector of itself." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-97", "text": "Multi-level text Features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-98", "text": "As described in Section 2.2, since the semantic signatures are proposed for various kinds of texts (e.g., sense, word or sentence), they serve as one basic feature." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-99", "text": "Obviously, the above features extracted from the phrase-word pair is significantly less than the features used in P-S level and S-Ph level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-100", "text": "This is because the information contained in phrase-word pair is much less than that in sentences and paragraphs." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-101", "text": "To overcome this information insufficient problem, we propose an information enrichment method based on WordNet to extend the initial word in Ph-W level as below." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-102", "text": "Word Expansion with Definition." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-103", "text": "For the word part in Ph-W level, we extract its definition in terms of its most common concept in WordNet and then replace the initial word with this definition." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-104", "text": "This gives a much richer set of initial single word." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-105", "text": "Since a word may have many senses, not all of this word definition expansion are correct. But we show below empirically that using this expanded set improves performance." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-106", "text": "By doing so we treat the phrase and the definition of the original word as two sentences, and thus, all features described in Section 2.2 are calculated." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-107", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-108", "text": "**FEATURES ON W-SE LEVEL**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-109", "text": "For W-Se level, the information that a word and a sense carry is less than other levels." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-110", "text": "Hence, the basic features that can be extracted from the original word-sense pair are even less than Ph-W level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-111", "text": "Therefore the basic features we use for W-Se level are as follows." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-112", "text": "String features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-113", "text": "Two boolean string features are used." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-114", "text": "One records whether the word-sense pair shares the same POS tag and another records whether the word-sense pair share the same word." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-115", "text": "Knowledge based features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-116", "text": "As described in Section 2.2, four knowledge-based word similarity measures based on WordNet are calculated." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-117", "text": "Multi-level text Features." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-118", "text": "The multi-level text features are the same as Ph-W level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-119", "text": "In consideration of the lack of contextual infor-mation between word-sense pair, we also propose three information enrichment methods in order to generate more effective information for word and sense with the aid of WordNet." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-120", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-121", "text": "**EXPERIMENT AND RESULTS**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-122", "text": "We adopt supervised regression model for each cross level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-123", "text": "In order to compare the performance of different regression algorithms, we perform 5-fold cross validation on training data for each cross level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-124", "text": "We used several regression algorithms including Support Vector Regression (SVR) with 3 different kernels (i.e., linear, polynomial and rbf), Random Forest, Stochastic Gradient Descent (SGD) and Decision Tree implemented in the scikit-learn toolkit (Pedregosa et al., 2011) ." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-125", "text": "The system performance is evaluated in Pearson correlation (r) (official measure) and Spearman's rank correlation (\u03c1)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-126", "text": "Table 1 and Table 2 show the averaged performance of different regression algorithms in terms of Pearson correlation (r) and Spearman's rank correlation (\u03c1) on the training data of P-S level and S-Ph level using 5-fold cross validation, where the standard deviation is given in brackets." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-127", "text": "The results show that Random Forest performs the best both on P-S level and S-Ph level whether in (r) or (\u03c1)." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-128", "text": "We also find that the results of P-S level are better than that of S-Ph level, and the reason may be that paragraph and sentence pair contain more information than the sentence and phrase pair." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-129", "text": "Table 2 : Results of different algorithms using 5-fold cross validation on training data of S-Ph level Table 3 shows the results of different regression algorithms and different feature sets in terms of r and \u03c1 on the training data of Ph-W level using 5-fold cross validation, where the basic features are denoted as Feature Set A and their combination with word definition expansion features are denoted as Feature Set B. The results show that almost all algorithms performance have been improved by using word definition expansion feature except Decision Tree." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-130", "text": "This proves the effectiveness of the information enrichment method we proposed in this level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-131", "text": "Besides, Random Forest achieves the best performance again with r=44% and \u03c1=41%." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-132", "text": "However, in comparison with P-S level and S-Ph level, all scores in Table 3 drop a lot even with information enrichment method." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-133", "text": "The possible reason may be two: the reduction of information on Ph-W level and our information enrichment method brings in a certain noise as well." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-134", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-135", "text": "**RESULTS ON TRAINING DATA**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-136", "text": "For W-Se level, in order to examine the performance of different information enrichment methods, we perform experiments on 4 different feature sets from A to D, where feature set A contains the basic features, feature set B, C and D add one information enrichment method based on former feature set." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-137", "text": "Table 4 and 5 present the r and \u03c1 results of 4 feature sets using different regression algorithms." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-138", "text": "From Table 4 Table 3 : Results of different algorithms using 5-fold cross validation on training data of Ph-W level the performance of W-Se level is the worst among all these four levels." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-139", "text": "This illustrates that the less information the texts contain, the worse performance the model achieves." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-140", "text": "Again the Random Forest algorithm performs the best among all algorithms." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-141", "text": "Again almost all information enrichment features perform better than Feature set A. This illustrates that these information enrichment methods do help to improve performance." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-142", "text": "When we observe the three information enrichment methods, we find that feature set C performs the best." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-143", "text": "In comparison with feature set C, feature set B only used word synonyms to expand information and this expansion is quite limited." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-144", "text": "Feature set D performs better than B but still worse than C. The reason may be that when we extend sense with its definition, the definition is accurate and exactly represents the meaning of sense." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-145", "text": "However since a word often contains more than one concepts, and when we use the definition of the most common concept to extend word, such extension may not be correct and the generated information may contain more noise and/or change the original meaning of word." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-146", "text": "----------------------------------" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-147", "text": "**RESULTS ON TEST DATA**" }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-148", "text": "According to the experiments on training data, we select Random Forest as the final regression algorithm." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-149", "text": "The number of trees in Random Forest n is optimized to 50 and the rest parameters are set to be default." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-150", "text": "All features in Section 2.2 are used on P-S level, S-Ph level and Ph-W level." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-151", "text": "For W-Se level, we take all features except word-sense definition expansion feature which has been shown to impair the system performance." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-152", "text": "For each level, all training examples are used to learn the corresponding regression model." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-153", "text": "According to the official results released by organizers, Table 6 and Table 7 list the top 3 systems in terms of r (official) and \u03c1." }, { "sent_id": "7cbe7dc02bbb53fe06d9215ed88fe0-C001-154", "text": "Our final systems rank the second both in terms of r and \u03c1 and also achieve the second place on P-S level, S-Ph level and Ph-W level, as well as the 4th place on W-Se level in terms of official Pearson correlation." } ], "y": { "@USE@": { "gold_contexts": [ [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-22" ], [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-51" ], [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-56" ], [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-64" ], [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-69" ] ], "cite_sentences": [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-22", "7cbe7dc02bbb53fe06d9215ed88fe0-C001-51", "7cbe7dc02bbb53fe06d9215ed88fe0-C001-56", "7cbe7dc02bbb53fe06d9215ed88fe0-C001-64", "7cbe7dc02bbb53fe06d9215ed88fe0-C001-69" ] }, "@BACK@": { "gold_contexts": [ [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-73" ] ], "cite_sentences": [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-73" ] }, "@EXT@": { "gold_contexts": [ [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-73", "7cbe7dc02bbb53fe06d9215ed88fe0-C001-74" ] ], "cite_sentences": [ "7cbe7dc02bbb53fe06d9215ed88fe0-C001-73" ] } } }, "ABC_60d39eec9573e42d7fb306b0f696c7_42": { "x": [ { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-2", "text": "State-of-the-art approaches of NER have used sequence-labeling BiLSTM as a core module." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-3", "text": "This paper formally shows the limitation of BiLSTM in modeling cross-context patterns." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-4", "text": "Two types of simple cross-structures -self-attention and Cross-BiLSTM -are shown to effectively remedy the problem." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-5", "text": "On both OntoNotes 5.0 and WNUT 2017, clear and consistent improvements are achieved over barebone models, up to 8.7% on some of the multi-token mentions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-6", "text": "In-depth analyses across several aspects of the improvements, especially the identification of multitoken mentions, are further given." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-7", "text": "----------------------------------" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-9", "text": "With state-of-the-art empirical results, most regard BiLSTM-CNN as a robust core module for sequence-labeling NER [1, 2, 3, 4, 5] ." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-10", "text": "However, each direction of BiLSTM only sees and encodes half of a sequence at each time step." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-11", "text": "For each token, the forward LSTM only encodes past context; the backward LSTM only encodes future context." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-12", "text": "Both do not model the patterns that happen to cross past and future at this specific time step." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-13", "text": "This paper explores two types of cross-structures to help cope with the problem: Cross-BiLSTM-CNN and Att-BiLSTM-CNN." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-14", "text": "Section 2 formulates the three models, with Section 2.2 gives a concrete proof that patterns forming an XOR cannot be modeled by (Baseline-)BiLSTM-CNN used in all previous work." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-15", "text": "Section 3 evaluates practical effectiveness of the approaches on two challenging NER datasets." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-16", "text": "The cross-structures bring consistent improvements over Baseline-BiLSTM-CNN without additional gazetteers, language-modeling, or multi-task supervision." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-17", "text": "The improved core module surpasses comparable bare-bone models on OntoNotes and WNUT by 1.4% and 4.6% respectively." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-18", "text": "Ablation experiments reveal that emerging, complex, confusing, and multi-token entity mentions benefitted much from the cross-structures, up to 8.7% on some of the multi-token mentions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-19", "text": "----------------------------------" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-20", "text": "**MODEL**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-21", "text": "----------------------------------" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-22", "text": "**(BASELINE-)BILSTM-CNN**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-23", "text": "For Baseline [2] , a CNN is used to compute character-level word features alongside word embedding and multi-layer BiLSTM is used to capture the future and the past for each time step:" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-24", "text": "The probability of each token class is given by affine-softmax." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-25", "text": "Using OSBIE sequential labels [2] , when there are P entity types, the number of token classes d p = P \u00d7 4 + 1." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-26", "text": "First, note that the score vector at each time step is the sum of contributions of two directions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-27", "text": "Suppose the index of work-of-art:I and O are i, j respectively." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-28", "text": "Then, to predict each \"and\" correctly, it must hold that (superscripts denote the phrase number)" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-29", "text": ", phrase 1 and phrase 3 have the same past context for \"and\", and hence the same" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-30", "text": "Finally, summing the first two inequalities and the last two inequalities gives two contradicting constraints that cannot be satisfied." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-31", "text": "In other words, even if an oracle is given to training the model, Baseline-BiLSTM-CNN can only tag at most 3 out of 4 \"and\" correctly." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-32", "text": "No matter how many LSTM cells are stacked for each direction, the formulation in previous studies simply does not have enough modeling capacity to capture cross-context patterns for sequence labeling NER." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-33", "text": "----------------------------------" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-34", "text": "**CROSS-BILSTM-CNN**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-35", "text": "To resolve the problem, we proposes to use Cross-BiLSTM-CNN:" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-36", "text": "As the forward and backward hidden states are interleaved between stacked LSTM layers, Cross-BiLSTM-CNN models cross-context patterns by computing representations of the whole sequence in a feed-forward, additive manner." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-37", "text": "----------------------------------" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-38", "text": "**ATT-BILSTM-CNN**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-39", "text": "Another way to resolve the problem is to add a self-attentive mechanism [6] on baseline BiLSTM:" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-40", "text": "Att-BiLSTM-CNN correlates past and future context for each token in a dot-product, multiplicative manner." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-41", "text": "To see that, the computation of attention scores can be rewritten as follows." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-42", "text": "----------------------------------" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-43", "text": "**EXPERIMENTS**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-44", "text": "We evaluated on tow datasets: OntoNotes 5.0 Fine-Grained NER -a million-token corpus with diverse sources and 18 fine-grained entity types, including hard ones such as law, event, work-of-art [7, 8] ; WNUT 2017 Emerging NER -a corpus consists of noisy social media text, with text in the testing set containing surface forms seen in the training set filtered out [9, 10] ." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-45", "text": "Overall Results." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-46", "text": "Table 1 shows overall results." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-47", "text": "Besides Baseline-, Cross-, and Att-BiLSTM-CNN, results of bare-bone BiLSTM-CNN [2] , CRF-BiLSTM(-BiLSTM) [11, 12] , and CRF-IDCNN [11] from the literature are also listed." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-48", "text": "The models proposed in this paper surpassed previous reported bare-bone models by 1.4% on OntoNotes and 4.6% on WNUT." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-49", "text": "More substantial improvements were achieved for WNUT 2017 emerging NER, suggesting that cross-context patterns were even more crucial for emerging contexts and entities, which cannot be memorized by their surface forms." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-50", "text": "Complex and Confusing Entity Mentions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-51", "text": "Table 2 shows significant results per entity type." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-52", "text": "It could be seen that harder entity types generally benefitted more from the cross-structures." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-53", "text": "For example, work-of-art/creative-work entities could in principle take any surface forms and written with unreliable capitalizations on social media, requiring models to learn better understanding of their context." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-54", "text": "Both cross-structures were more capable in dealing with such hard entities (2.1%/5.6%/3.2%/2.0%) than Baseline." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-55", "text": "Moreover, disambiguating fine-grained entity types is also a challenging task." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-56", "text": "For example, entities of language and NORP often take the same surface forms." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-57", "text": "Figure 1a shows a confusing example containing \"Dutch\" and \"English\", with the attention heat map (Figure 2a ) telling the story that Att has relied on its attention head to make context-aware decisions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-58", "text": "Both cross-structures were much better at disambiguating these fine-grained types (4.1%/0.8%/3.3%/3.4%)." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-59", "text": "Multi-Token Entity Mentions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-60", "text": "Table 3 shows results among different entity lengths." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-61", "text": "It could be seen that cross-structures were much better at dealing with multi-token mentions (1.8%/2.3%/8.7%/2.6%)." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-62", "text": "In fact, identifying correct mention boundaries for multi-token mentions requires modeling crosscontext." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-63", "text": "For example, a token should be tagged as Inside if and only if it immediately follows a Begin or an I and is immediately followed by an I or an End." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-64", "text": "Figure 1b shows a sentence with \"the White house\", a triple-token facility mention with unreliable capitalization, resulting in an emerging surface form." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-65", "text": "While Att correctly tagged the three tokens, Baseline predicted a false single-token mention \"White\" without hints of a seen surface form." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-66", "text": "Entity-Chunking." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-67", "text": "We perform an ablation study focused on chunking tags to understand why cross-structures are better at locating multi-token mentions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-68", "text": "In 19) ." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-69", "text": "Then, the quantitative results and the qualitative visualizations explains each other: C 2 and especially C 3 tended to focus on looking for preceding mention token (the diagonal shifted left in Figure 2b, 2c) , enabling them to signal for End and Inside; C 4 tended to focus on looking for succeeding mention token (the diagonal shifted right in Figure 2d ), enabling it to signal for Begin and Inside." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-70", "text": "In contrast, unable to model cross-context patterns, Baseline inadvertently retract to predict single-token entities (0.13 vs. -0.63, -0.41, -0.38) when hints from familiar surface forms are unavailable." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-71", "text": "----------------------------------" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-72", "text": "**CONCLUSION**" }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-73", "text": "We have formally analyzed the deficiency of the prevalently used BiLSTM-CNN in modeling crosscontext for NER." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-74", "text": "A concrete proof of its inability to capture XOR patterns has been given." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-75", "text": "Additive and multiplicative cross-structures have shown to be crucial in modeling cross-context, significantly enhancing recognition of emerging, complex, confusing, and multi-token entity mentions." }, { "sent_id": "60d39eec9573e42d7fb306b0f696c7-C001-76", "text": "Against comparable bare-bone models, 1.4% and 4.6% overall improvements on OntoNotes 5.0 and WNUT 2017 have been achieved, showing the importance of remedying the core module of NER." } ], "y": { "@BACK@": { "gold_contexts": [ [ "60d39eec9573e42d7fb306b0f696c7-C001-9" ] ], "cite_sentences": [ "60d39eec9573e42d7fb306b0f696c7-C001-9" ] }, "@MOT@": { "gold_contexts": [ [ "60d39eec9573e42d7fb306b0f696c7-C001-10", "60d39eec9573e42d7fb306b0f696c7-C001-13", "60d39eec9573e42d7fb306b0f696c7-C001-9" ] ], "cite_sentences": [ "60d39eec9573e42d7fb306b0f696c7-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "60d39eec9573e42d7fb306b0f696c7-C001-23" ], [ "60d39eec9573e42d7fb306b0f696c7-C001-25" ], [ "60d39eec9573e42d7fb306b0f696c7-C001-47" ] ], "cite_sentences": [ "60d39eec9573e42d7fb306b0f696c7-C001-23", "60d39eec9573e42d7fb306b0f696c7-C001-25", "60d39eec9573e42d7fb306b0f696c7-C001-47" ] } } }, "ABC_52af1f5378194ccdb7c8f755a6ae34_42": { "x": [ { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-2", "text": "Adversarial training (AT) is a regularization method that can be used to improve the robustness of neural network methods by adding small perturbations in the training data." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-3", "text": "We show how to use AT for the tasks of entity recognition and relation extraction." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-4", "text": "In particular, we demonstrate that applying AT to a general purpose baseline model for jointly extracting entities and relations, allows improving the stateof-the-art effectiveness on several datasets in different contexts (i.e., news, biomedical, and real estate data) and for different languages (English and Dutch)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-5", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-7", "text": "Many neural network methods have recently been exploited in various natural language processing (NLP) tasks, such as parsing , POS tagging (Lample et al., 2016) , relation extraction (dos Santos et al., 2015) , translation (Bahdanau et al., 2015) , and joint tasks (Miwa and Bansal, 2016) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-8", "text": "However, Szegedy et al. (2014) observed that intentional small scale perturbations (i.e., adversarial examples) to the input of such models may lead to incorrect decisions (with high confidence)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-9", "text": "Goodfellow et al. (2015) proposed adversarial training (AT) (for image recognition) as a regularization method which uses a mixture of clean and adversarial examples to enhance the robustness of the model." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-10", "text": "Although AT has recently been applied in NLP tasks (e.g., text classification (Miyato et al., 2017) ), this paper -to the best of our knowledge -is the first attempt investigating regularization effects of AT in a joint setting for two related tasks." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-11", "text": "We start from a baseline joint model that performs the tasks of named entity recognition (NER) and relation extraction at once." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-12", "text": "Previously proposed models (summarized in Section 2) exhibit several issues that the neural network-based baseline approach (detailed in Section 3.1) overcomes: (i) our model uses automatically extracted features without the need of external parsers nor manually extracted features (see Gupta et al. (2016) ; Miwa and Bansal (2016) ; Li et al. (2017) ), (ii) all entities and the corresponding relations within the sentence are extracted at once, instead of examining one pair of entities at a time (see Adel and Sch\u00fctze (2017) ), and (iii) we model relation extraction in a multi-label setting, allowing multiple relations per entity (see Katiyar and Cardie (2017) ; Bekoulis et al. (2018a) )." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-13", "text": "The core contribution of the paper is the use of AT as an extension in the training procedure for the joint extraction task (Section 3.2)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-14", "text": "To evaluate the proposed AT method, we perform a large scale experimental study in this joint task (see Section 4), using datasets from different contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-15", "text": "We use a strong baseline that outperforms all previous models that rely on automatically extracted features, achieving state-of-the-art performance (Section 5)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-16", "text": "Compared to the baseline model, applying AT during training leads to a consistent additional increase in joint extraction effectiveness." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-17", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-19", "text": "Joint entity and relation extraction: Joint models (Li and Ji, 2014; Miwa and Sasaki, 2014) that are based on manually extracted features have been proposed for performing both the named entity recognition (NER) and relation extraction subtasks at once." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-20", "text": "These methods rely on the availability of NLP tools (e.g., POS taggers) or manually designed features leading to additional complexity." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-21", "text": "Neural network methods have been exploited to overcome this feature design issue and usually involve RNNs and CNNs (Miwa and Bansal, 2016; Zheng et al., 2017) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-22", "text": "Specifically, Miwa and Bansal (2016) as well as Li et al. (2017) apply bidirectional tree-structured RNNs for different contexts (i.e., news, biomedical) to capture syntactic information (using external dependency parsers)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-23", "text": "Gupta et al. (2016) propose the use of various manually extracted features along with RNNs." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-24", "text": "Adel and Sch\u00fctze (2017) solve the simpler problem of entity classification (EC, assuming entity boundaries are given), instead of NER, and they replicate the context around the entities, feeding entity pairs to the relation extraction layer." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-25", "text": "Katiyar and Cardie (2017) investigate RNNs with attention without taking into account that relation labels are not mutually exclusive." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-26", "text": "Finally, Bekoulis et al. (2018a) use LSTMs in a joint model for extracting just one relation at a time, but increase the complexity of the NER part." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-27", "text": "Our baseline model enables simultaneous extraction of multiple relations from the same input." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-28", "text": "Then, we further extend this strong baseline using adversarial training." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-29", "text": "Adversarial training (AT) (Goodfellow et al., 2015) has been proposed to make classifiers more robust to input perturbations in the context of image recognition." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-30", "text": "In the context of NLP, several variants have been proposed for different tasks such as text classification (Miyato et al., 2017) , relation extraction (Wu et al., 2017) and POS tagging (Yasunaga et al., 2018) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-31", "text": "AT is considered as a regularization method." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-32", "text": "Unlike other regularization methods (i.e., dropout (Srivastava et al., 2014) , word dropout (Iyyer et al., 2015) ) that introduce random noise, AT generates perturbations that are variations of examples easily misclassified by the model." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-33", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-34", "text": "**MODEL**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-35", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-36", "text": "**JOINT LEARNING AS HEAD SELECTION**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-37", "text": "The baseline model, described in detail in Bekoulis et al. (2018b) , is illustrated in Fig. 1 ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-38", "text": "It aims to detect (i) the type and the boundaries of the entities and (ii) the relations between them." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-39", "text": "The input is a sequence of tokens (i.e., sentence) w = w 1 , ..., w n ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-40", "text": "We use character level embeddings to implicitly capture morphological features (e.g., prefixes and suffixes), representing each character by a vector (embedding)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-41", "text": "The character embeddings are fed to a bidirectional LSTM (BiLSTM) to obtain the character-based representation of the word." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-42", "text": "We also use pre-trained word embeddings." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-43", "text": "Word and character embeddings are concatenated to form the final token representation, which is then fed to a BiLSTM layer to extract sequential information." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-44", "text": "For the NER task, we adopt the BIO (Beginning, Inside, Outside) encoding scheme." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-45", "text": "In Fig. 1 , the B-PER tag is assigned to the beginning token of a 'person' (PER) entity." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-46", "text": "For the prediction of the entity tags, we use: (i) a softmax approach for the entity classification (EC) task (assuming entity boundaries given) or (ii) a CRF approach where we identify both the type and the boundaries for each entity." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-47", "text": "During decoding, in the softmax setting, we greedily detect the entity types of the tokens (i.e., independent prediction)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-48", "text": "Although independent distribution of types is reasonable for EC tasks, this is not the case when there are strong correlations between neighboring tags." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-49", "text": "For instance, the BIO encoding scheme imposes several constraints in the NER task (e.g., the B-PER and I-LOC tags cannot be sequential)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-50", "text": "Motivated by this intuition, we use a linear-chain CRF for the NER task (Lample et al., 2016) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-51", "text": "For decoding, in the CRF setting, we use the Viterbi algorithm." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-52", "text": "During training, for both EC (softmax) and NER tasks (CRF), we minimize the cross-entropy loss L NER ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-53", "text": "The entity tags are later fed into the relation extraction layer as label embeddings (see Fig. 1 ), assuming that knowledge of the entity types is beneficial in predicting the relations between the involved entities." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-54", "text": "We model the relation extraction task as a multi-label head selection problem (Bekoulis et al., 2018b; )." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-55", "text": "In our model, each word w i can be involved in multiple relations with other words." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-56", "text": "For instance, in the example illustrated in Fig. 1 , \"Smith\" could be involved not only in a Lives in relation with the token \"California\" (head) but also in other relations simultaneously (e.g., Works for, Born In with some corresponding tokens)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-57", "text": "The goal of the task is to predict for each word w i , a vector of heads\u0177 i and the vector of corresponding relationsr i ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-58", "text": "We compute the score s(w j , w i , r k ) of word w j to be the head of w i given a relation label r k using a single layer neural network." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-59", "text": "The corresponding probability is defined as: P(w j , r k | w i ; \u03b8) = \u03c3(s(w j , w i , r k )), where \u03c3(.) is the sigmoid function." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-60", "text": "During training, we minimize the cross-entropy loss L rel as:" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-61", "text": "where m is the number of associated heads (and thus relations) per word w i ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-62", "text": "During decoding, the most probable heads and relations are selected using threshold-based prediction." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-63", "text": "The final objective for the joint task is computed as L JOINT (w; \u03b8) = L NER + L rel where \u03b8 is a set of parameters." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-64", "text": "In the case of multi-token entities, only the last token of the entity can serve as head of another token, to eliminate redundant relations." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-65", "text": "If an entity is not involved in any relation, we predict the auxiliary \"N\" relation label and the token itself as head." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-66", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-67", "text": "**ADVERSARIAL TRAINING (AT)**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-68", "text": "We exploit the idea of AT (Goodfellow et al., 2015) as a regularization method to make our model robust to input perturbations." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-69", "text": "Specifically, we generate examples which are variations of the original ones by adding some noise at the level of the concatenated word representation (Miyato et al., 2017) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-70", "text": "This is similar to the concept introduced by Goodfellow et al. (2015) to improve the robustness of image recognition classifiers." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-71", "text": "We generate an adversarial example by adding the worst-case perturbation \u03b7 adv to the original embedding w that maximizes the loss function:" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-72", "text": "where\u03b8 is a copy of the current model parameters." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-73", "text": "Since Eq. (2) is intractable in neural networks, we use the approximation proposed in Goodfellow et al. (2015) defined as: \u03b7 adv = g/ g , with g = \u2207 w L JOINT (w;\u03b8), where is a small bounded norm treated as a hyperparameter." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-74", "text": "Similar to Yasunaga et al. (2018) , we set to be \u03b1 \u221a D (where D is the dimension of the embeddings)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-75", "text": "We train on the mixture of original and adversarial examples, so the final loss is computed as: L JOINT (w;\u03b8) + L JOINT (w + \u03b7 adv ;\u03b8)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-76", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-77", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-78", "text": "We evaluate our models on four datasets, using the code as available from our github codebase." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-79", "text": "1 Specifically, we follow the 5-fold crossvalidation defined by Miwa and Bansal (2016) for the ACE04 (Doddington et al., 2004) dataset." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-80", "text": "For the CoNLL04 (Roth and Yih, 2004 ) EC task (assuming boundaries are given), we use the same splits as in Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-81", "text": "We also evaluate our models on the NER task similar to Miwa and Sasaki (2014) in the same dataset using 10-fold cross validation." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-82", "text": "For the Dutch Real Estate Classifieds, DREC (Bekoulis et al., 2017) dataset, we use train-test splits as in Bekoulis et al. (2018a) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-83", "text": "For the Adverse Drug Events, ADE (Gurulingappa et al., 2012) , we perform 10-fold cross-validation similar to Li et al. (2017) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-84", "text": "To obtain comparable results that are not affected by the input embeddings, we use the embeddings of the previous works." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-85", "text": "We employ early stopping in all of the experiments." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-86", "text": "We use the Adam optimizer (Kingma and Ba, 2015) and we fix the hyperparameters (i.e., \u03b1, dropout values, best epoch, learning rate) on the validation sets." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-87", "text": "The scaling parameter \u03b1 is selected from {5e\u22122, 1e\u22122, 1e\u22123, 1e\u22124}. Larger values of \u03b1 (i.e., larger perturbations) lead to consistent performance decrease in our early experiments." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-88", "text": "This can be explained from the fact that adding more noise can change the content of the sentence as also reported by Wu et al. (2017) ." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-89", "text": "We use three types of evaluation, namely: (i) S(trict): we score an entity as correct if both the entity boundaries and the entity type are correct (ACE04, ADE, CoNLL04, DREC), (ii) B(oundaries): we score an entity as correct if only the entity boundaries are correct while the entity type is not taken into account (DREC) and (iii) R(elaxed): a multi-token entity is considered Table 1 : Comparison of our method with the stateof-the-art in terms of F 1 score." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-90", "text": "The proposed models are: (i) baseline, (ii) baseline EC (predicts only entity classes) and (iii) baseline (EC) + AT (regularized by AT)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-91", "text": "The and symbols indicate whether the models rely on external NLP tools." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-92", "text": "We include different evaluation types (S, R and B)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-93", "text": "correct if at least one correct type is assigned to the tokens comprising the entity, assuming that the boundaries are known (CoNLL04), to compare to previous works." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-94", "text": "In all cases, a relation is considered as correct when both the relation type and the argument entities are correct." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-95", "text": "Table 1 shows our experimental results." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-96", "text": "The name of the dataset is presented in the first column while the models are listed in the second column." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-97", "text": "The proposed models are the following: (i) baseline: the baseline model shown in Fig. 1 with the CRF layer and the sigmoid loss, (ii) baseline EC: the proposed model with the softmax layer for EC, (iii) baseline (EC) + AT: the baseline regularized using AT." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-98", "text": "The final three columns present the F 1 results for the two subtasks and their average performance." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-99", "text": "Bold values indicate the best results among models that use only automatically extracted features." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-100", "text": "For ACE04, the baseline outperforms Katiyar and Cardie (2017) by \u223c2% in both tasks." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-101", "text": "This improvement can be explained by the use of: (i) multi-label head selection, (ii) CRF-layer and (iii) character level embeddings." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-102", "text": "Compared to Miwa and Bansal (2016) , who rely on NLP tools, the baseline performs within a reasonable margin (less than 1%) on the joint task." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-103", "text": "On the other hand, Li et al. (2017) use the same model for the ADE biomedical dataset, where we report a 2.5% overall improvement." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-104", "text": "This indicates that NLP tools are not always accurate for various contexts." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-105", "text": "For the CoNLL04 dataset, we use two evaluation settings." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-106", "text": "We use the relaxed evaluation similar to Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) on the EC task." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-107", "text": "The baseline model outperforms the state-of-the-art models that do not rely on manually extracted features (>4% improvement for both tasks), since we directly model the whole sentence, instead of just considering pairs of entities." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-108", "text": "Moreover, compared to the model of Gupta et al. (2016) that relies on complex features, the baseline model performs within a margin of 1% in terms of overall F 1 score." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-109", "text": "We also report NER results on the same dataset and improve overall F 1 score with \u223c1% compared to Miwa and Sasaki (2014) , indicating that our automatically extracted features are more informative than the hand-crafted ones." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-110", "text": "These automatically extracted features exhibit their performance improvement mainly due to the shared LSTM layer that learns to automatically generate feature representations of entities and their corresponding relations within a single model." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-111", "text": "For the DREC dataset, we use two evaluation methods." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-112", "text": "In the boundaries evaluation, the baseline has an improvement of \u223c3% on both tasks compared to Bekoulis et al. (2018a) , whose quadratic scoring layer complicates NER." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-113", "text": "Table 1 and Fig. 2 show the effectiveness of the adversarial training on top of the baseline model." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-114", "text": "In all of the experiments, AT improves the predictive performance of the baseline model in the joint setting." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-115", "text": "Moreover, as seen in Fig. 2 , the performance of the models using AT is closer to maximum even from the early training epochs." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-116", "text": "Specifically, for ACE04, there is an improvement in both tasks as well as in the overall F 1 performance (0.4%)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-117", "text": "For CoNLL04, we note an improvement in the overall F 1 of 0.4% for the EC and 0.8% for the NER tasks, respectively." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-118", "text": "For the DREC dataset, in both settings, there is an overall improvement of \u223c1%." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-119", "text": "Figure 2 shows that from the first epochs, the model obtains its maximum performance on the DREC validation set." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-120", "text": "Finally, for ADE, our AT model beats the baseline F 1 by 0.7%." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-121", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-122", "text": "**RESULTS**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-123", "text": "Our results demonstrate that AT outperforms the neural baseline model consistently, consider- ing our experiments across multiple and more diverse datasets than typical related works." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-124", "text": "The improvement of AT over our baseline (depending on the dataset) ranges from \u223c0.4% to \u223c0.9% in terms of overall F 1 score." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-125", "text": "This seemingly small performance increase is mainly due to the limited performance benefit for the NER component, which is in accordance with the recent advances in NER using neural networks that report similarly small gains (e.g., the performance improvement in Ma and Hovy (2016) and Lample et al. (2016) on the CoNLL-2003 test set is 0.01% and 0.17% F 1 percentage points, while in the work of Yasunaga et al. (2018) , a 0.07% F 1 improvement on CoNLL-2000 using AT for NER is reported)." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-126", "text": "However, the relation extraction performance increases by \u223c1% F 1 scoring points, except for the ACE04 dataset." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-127", "text": "Further, as seen in Fig. 2 , the improvement for CoNLL04 is particularly small on the evaluation set." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-128", "text": "This may indicate a correlation between the dataset size and the benefit of adversarial training in the context of joint models, but this needs further investigation in future work." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-129", "text": "----------------------------------" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-131", "text": "We proposed to use adversarial training (AT) for the joint task of entity recognition and relation extraction." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-132", "text": "The contribution of this study is twofold: (i) investigation of the consistent effectiveness of AT as a regularization method over a multi-context baseline joint model, with (ii) a large scale experimental evaluation." }, { "sent_id": "52af1f5378194ccdb7c8f755a6ae34-C001-133", "text": "Experiments show that AT improves the results for each task separately, as well as the overall performance of the baseline joint model, while reaching high performance already during the first epochs of the training procedure." } ], "y": { "@DIF@": { "gold_contexts": [ [ "52af1f5378194ccdb7c8f755a6ae34-C001-12" ] ], "cite_sentences": [ "52af1f5378194ccdb7c8f755a6ae34-C001-12" ] }, "@BACK@": { "gold_contexts": [ [ "52af1f5378194ccdb7c8f755a6ae34-C001-24" ] ], "cite_sentences": [ "52af1f5378194ccdb7c8f755a6ae34-C001-24" ] }, "@USE@": { "gold_contexts": [ [ "52af1f5378194ccdb7c8f755a6ae34-C001-80" ] ], "cite_sentences": [ "52af1f5378194ccdb7c8f755a6ae34-C001-80" ] }, "@SIM@": { "gold_contexts": [ [ "52af1f5378194ccdb7c8f755a6ae34-C001-106" ] ], "cite_sentences": [ "52af1f5378194ccdb7c8f755a6ae34-C001-106" ] } } }, "ABC_4c71ac789203d50e3e428458c2f88c_42": { "x": [ { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-2", "text": "Structured data summarization involves generation of natural language summaries from structured input data." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-3", "text": "In this work, we consider summarizing structured data occurring in the form of tables as they are prevalent across a wide variety of domains." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-112", "text": "Evaluation metrics: To evaluate our models we employed BLEU and Rouge-L scores." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-4", "text": "We formulate the standard table summarization problem, which deals with tables conforming to a single predefined schema." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-5", "text": "To this end, we propose a mixed hierarchical attention based encoderdecoder model which is able to leverage the structure in addition to the content of the tables." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-6", "text": "Our experiments on the publicly available WEATHERGOV dataset show around 18 BLEU (\u223c 30%) improvement over the current state-of-the-art." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-7", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-9", "text": "Abstractive summarization techniques from structured data seek to exploit both structure and content of the input data." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-10", "text": "The type of structure on the input side can be highly varied ranging from key-value pairs (e.g. WIKIBIO (Lebret et al., 2016) ), source code (Iyer et al., 2016) , ontologies (Androutsopoulos et al., 2014; Colin et al., 2016) , or tables (Wiseman et al., 2017) , each of which require significantly varying approaches." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-11", "text": "In this paper, we focus on generating summaries from tabular data." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-12", "text": "Now, in most practical applications such as finance, healthcare or weather, data in a table are arranged in rows and columns where the schema is known beforehand." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-13", "text": "However, change in the actual data values can necessitate drastically different output summaries." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-14", "text": "Examples shown in the figure 1 have a predefined schema obtained from the WEATHERGOV dataset (Liang et al., 2009 ) and its corresponding weather report summary." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-15", "text": "Therefore, the problem that we seek to address in this paper is to generate abstractive summaries of tables conforming to a predefined fixed schema (as opposed to cases where the schema is unknown)." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-16", "text": "We refer to this setting as standard table summarization problem." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-17", "text": "Another problem that could be formulated is one in which the output summary is generated from multiple tables as proposed in a recent challenge (Wiseman et al., 2017 ) (this setting is out of the scope of this paper)." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-18", "text": "Now, as the schema is fixed, simple rule based techniques (Konstas and Lapata, 2013) or template based solutions could be employed." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-19", "text": "However, due to the vast space of selection (which attributes to use in the summary based on the current value it takes) and generation (how to express these selected attributes in natural language) choices possible, such approaches are not scalable in terms of the number of templates as they demand hand-crafted rules for both selection and generation." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-20", "text": "We attempt to solve the problem of standard table summarization by leveraging the hierarchical nature of fixed-schema tables." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-21", "text": "In other words, rows consist of a fixed set of attributes and a table is defined by a set of rows." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-22", "text": "We cast this problem into a mixed hierarchical attention model following the encode-attend-decode (Cho et al., 2015) paradigm." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-23", "text": "In this approach, there is static attention on the attributes to compute the row representation followed by dynamic attention on the rows, which is subsequently fed to the decoder." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-24", "text": "This formulation is theoretically more efficient than the fully dynamic hierarchical attention framework followed by Nallapati et al. (2016) ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-25", "text": "Also, our model does not need sophisticated sampling or sparsifying techniques like Deng et al., 2017) , thus, retaining differentiability." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-26", "text": "To demonstrate the efficacy of our approach, we transform the publicly available WEATHERGOV dataset (Liang et al., 2009) into fixed-schema tables, which is then used for our experiments." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-27", "text": "Our proposed mixed hierarchi- Table Summarization with fixed schema tables as input cal attention model provides an improvement of around 18 BLEU (around 30%) over the current state-of-the-art result by Mei et al. (2016) ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-28", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-29", "text": "**TABULAR DATA SUMMARIZATION**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-30", "text": "A standard table consist of set of records (or rows) R = (r 1 , r 2 , ...r T ) and each record r has a fixed set of attributes (or columns) A r = (a r1 , a r2 , ...a rM )." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-31", "text": "Tables in figure 1 have 7 columns (apart from 'TYPE') which correspond to different attributes." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-32", "text": "Also U = (u 1 , u 2 , ...u T ) represents the type of each record where u k is onehot encoding for the record type for record r k ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-33", "text": "Training data consists of instance pairs (X i , Y i ) for i = 1, 2, ..n, where X i = (R i , U i ) represents the input table and Y i = (y 1 , ..., y T ) represents the corresponding natural language summary." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-34", "text": "In this paper, we propose an end-to-end model which takes in a table instance X to produce the output summary Y ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-35", "text": "This can be derived by solving in Y the following conditional probability objective:" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-36", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-37", "text": "**MIXED HIERARCHICAL ATTENTION MODEL (MHAM)**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-38", "text": "Our model is based on the encode-attend-decode paradigm as defined by Cho et al. (2015) ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-39", "text": "It consists of an encoder RNN which encodes a variable length input sequence x = (x 1 , ..., x T ) into a representation sequence c = (c 1 , ..., c T )." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-40", "text": "Another decoder RNN generates sequence of output symbols As illustrated in figure 2, our encoder is not a single RNN." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-41", "text": "The encoder has a hierarchical structure to leverage the structural aspect of a standard table: a table consists of a set of records (or rows) and each record consists of values corresponding to a fixed set of attributes." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-42", "text": "We call it a mixed hierarchical attention based encoder, as it incorporates static attention and dynamic attention at two different levels of the encoder." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-43", "text": "At the record level, the attention over record representations is dynamic as it changes with each decoder time step." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-44", "text": "Whereas at the attribute level, since the schema is fixed, a record representation can be computed without the need of varying attention over attributes -thus static attention is used." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-45", "text": "For example, with respect to WEATHERGOV dataset, a temperature record will always be defined by the attributes like min, max and mean irrespective of the decoder time step." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-46", "text": "So, attention over attributes can be static." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-47", "text": "On the other hand, while generating a word by the decoder, there can be a preference of focusing on a record type say, temperature, over some other type say, windSpeed." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-48", "text": "Thus, dynamic attention is used across records." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-49", "text": "Capturing attribute semantics: We learn record type embeddings and use them to calculate attentions over attributes." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-50", "text": "For the trivial case of all records being same type, it boils down to having a single record type embedding." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-51", "text": "Given attributes A r for a record r, where each attribute a r i is encoded into a vector A r i based on the attribute type (discussed further in section 3), using equation 2 we embed each attribute where W j is the embedding matrix for j th attribute." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-52", "text": "We embed record type one-hot vector u r through W 0 , which is used to compute the importance score I r j for attribute j in record r according to equation 3." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-53", "text": "Static Attribute attention: Not all attribute values contribute equally to the record." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-54", "text": "Hence, we introduce attention weights for attributes of each record." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-55", "text": "These attention weights are static and does not change with decoder time step." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-56", "text": "We calculate the attention probability vector \u03b1 r over attributes using the attribute importance vector I r ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-57", "text": "The attention weights can then be used to calculate the record representation B r for record r by using equations 4 and 5." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-58", "text": "Record Encoder: A GRU based RNN encoder takes as input a sequence of attribute attended records B 1:N and returns a sequence of hidden states h 1:N , where h r is the encoded vector for record B r ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-59", "text": "We obtain the final record encoding c r (equation 6) by concatenating the GRU hidden states with the embedded record encodings B r ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-60", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-61", "text": "**STATIC RECORD ATTENTION:**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-62", "text": "In a table, a subset of record types can always be more salient compared to other record types." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-63", "text": "This is captured by learning a static set of weights over all the records." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-64", "text": "These weights regulate the dynamic attention weights computed during decoding at each time step." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-65", "text": "Equation 7 performs this step where g r is the static record attention weight for r th record and q and P are weights to be learnt." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-66", "text": "We do not have any constraints on static attention vector." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-67", "text": "Dynamic Record attention for Decoder: Our decoder is a GRU based decoder with dynamic attention mechanism similar to (Mei et al., 2016) with modifications to modulate attention weights at each time step using static record attentions." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-68", "text": "At each time step t attention weights are calculated using 8, 9, 10, where \u03b3 r t is the aggregated attention weight of record r at time step t. We use the soft attention over input encoder sequences c r to calculate the weighted average, which is passed to the GRU." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-69", "text": "GRU hidden state s t is used to calculate output probabilities p t by using a softmax as described by equation 11, 12, 13, which is then used to get output word y t ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-70", "text": "Due to the static attention at attribute level, the time complexity of a single pass is O(T M + T T ), where T is the number of records, M is the number of attributes and T is the number of decoder steps." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-71", "text": "In case of dynamic attention at both levels (as in Nallapati et al. (2016) ), the time complexity is much higher O(T M T )." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-72", "text": "Thus, mixed hierarchical attention model is faster than fully dynamic hierarchical attention." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-73", "text": "For better understanding of the contribution of hierarchical attention(MHAM), we propose a simpler non-hierarchical (NHM) architecture with attention only at record level." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-74", "text": "In NHM, B r is calculated by concatenating all the record attributes along with corresponding record type." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-75", "text": "Input Table Generated Output TYPE TIME MIN MAX MEAN MODE MB100-4 MB20-2 temperature 17-30 27 36 29 ---windChill 17-30 14 26 18 ---windSpeed 17-30 14 20 16 --10-20 windDir 17-30 ---SSW --gust 17-30 0 0 0 ---skyCover 17-30 ----75-100 -skyCover 17-21 ----75-100 -skyCover 17-26 ----75-100 -skyCover 21-30 ----75-100 -skyCover 26-30 ----75-100 -precipPotential 17-30 26 58 43 ---rainChance Reference: Rain and snow likely , becoming all snow after 8pm ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-76", "text": "Cloudy , with a low around 22 ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-77", "text": "South southwest wind around 15 mph ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-78", "text": "Chance of precipitation is 60% ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-79", "text": "New snow accumulation of less than one inch possible ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-80", "text": "NHM : Rain or freezing rain likely before 8pm , then snow after 11pm , snow showers and sleet likely before 8pm , then a chance of rain or freezing rain after 3am . Mostly cloudy , with a low around 27 ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-81", "text": "South southeast wind between 15 and 17 mph ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-82", "text": "Chance of precipitation is 80% ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-83", "text": "Little or no ice accumulation expected ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-84", "text": "Little or no snow accumulation expected ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-85", "text": "MHAM: Snow , and freezing rain , snow after 9pm ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-86", "text": "Cloudy , with a steady temperature around 23 ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-87", "text": "Breezy , with a south wind between 15 and 20 mph ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-88", "text": "Chance of precipitation is 60% ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-89", "text": "New snow accumulation of around an inch possible ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-90", "text": "Table 1 : Anecdotal example." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-91", "text": "Records which contain all null attributes are not shown in the example." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-92", "text": "MB100-4 and MB20-2 correspond to mode-bucket-0-100-4 & mode-bucket-0-20-2 resp." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-93", "text": "in the dataset." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-94", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-95", "text": "**EXPERIMENTS**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-96", "text": "Dataset and methodology: To evaluate our model we have used WEATHERGOV dataset (Liang et al., 2009) which is the standard benchmark dataset to evaluate tabular data summarization techniques." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-97", "text": "We compared the performance of our model against the state-of-the-art work of MBW (Mei et al., 2016) , as well as two other baseline models KL (Konstas and Lapata, 2013) and ALK (Angeli et al., 2010 '00010000000000' and '00001000000000' respectively." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-98", "text": "Similar works for directions, for example 'NW', 'NNE' and 'NE' are encoded as '00000100000000', '00000011000000' and '00000001000000' resp." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-99", "text": "Time interval were also encoded as ordinal encodings, for example '6-21' is encoded as '111100' and '6-13' is encoded as '110000', the six bits corresponding to six atomic time intervals available in the dataset." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-100", "text": "Other attributes and words were encoded as one-hot vectors." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-101", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-102", "text": "**TRAINING AND HYPERPARAMETER TUNING**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-103", "text": "We used TensorFlow (Abadi et al., 2015) for our experiments." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-104", "text": "Encoder embeddings were initialized by generating the values from a uniform distribution in the range [-1, 1) ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-105", "text": "Other variables were initialized using Glorot uniform initialization (Glorot and Bengio, 2010) ." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-106", "text": "We tune each hyperparameter by choosing parameter from a ranges of values, and selected the model with best sBLEU score in validation set over 500 epochs." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-107", "text": "We did not use any regularization while training the model." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-108", "text": "For both the models, the hyperparameter tuning was separately performed to give both models a fair chance of performance." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-109", "text": "For both the models, Adam optimizer (Kingma and Ba, 2014) was used with learning rate set to 0.0001." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-110", "text": "We found embedding size of 100, GRU size of 400, static record attention sizeP of 150 to work best for MHAM model." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-111", "text": "We also experimented using bi-directional GRU in the encoder but there was no significant boost observed in the BLEU scores." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-113", "text": "In addition to the standard BLEU (sBleu) (Papineni et al., 2002) , a customized BLEU (cBleu) (Mei et al., 2016) has also been reported." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-114", "text": "cBleu does not penalize numbers which differ by at most five; hence 20 and 18 will be considered same." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-115", "text": "Table 2 describes the results of our proposed models (MHAM and NHM) along with the aforementioned baseline models." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-116", "text": "We observe a significant performance improvement of 16.6 cBleu score (24%) and 18.3 sBleu score (30%) compared to the current state-of-the-art model of MBW." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-117", "text": "MHAM also shows an improvement over NHM in all metrics demonstrating the importance of hierarchical attention." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-118", "text": "Attention analysis: Analysis of figure 3 reveals that the learnt attention weights are reasonable." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-119", "text": "For example, as shown in figure 3(a) , for the phrase 'with a high near 52', the model had a high attention on temperature before and while generating the number '52'." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-120", "text": "Similarly while generating 'mostly cloudy', the model had a high attention on precipitation potential." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-121", "text": "Attribute attentions are also learned as expected (in figure 3(b) )." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-122", "text": "The temperature, wind speed and gust records have high weights on min/max/mean values which describe these records." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-123", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-124", "text": "**RESULTS AND ANALYSES**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-125", "text": "Qualitative analysis: Table 1 contains example table-summary pairs, with summary generated by the proposed hierarchical and non-hierarchical versions." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-126", "text": "We observe that our model is able to generate numbers more accurately by enabling hierar- chical attention." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-127", "text": "Our model is also able to capture weak signals like snow accumulation." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-128", "text": "Further, our proposed model MHAM is able to avoid repetition as compared to NHM." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-129", "text": "----------------------------------" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-130", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-131", "text": "In this work, we have formulated the problem of standard table summarization where all the tables come from a predefined schema." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-132", "text": "Towards this, we proposed a novel mixed hierarchical attention based encoder-decoder approach." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-133", "text": "Our experiments on the publicly available WEATHERGOV benchmark dataset have shown significant improvements over the current state-of-the-art work." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-134", "text": "Moreover, this proposed method is theoretically more efficient compared to the current fully dynamic hierarchical attention model." }, { "sent_id": "4c71ac789203d50e3e428458c2f88c-C001-135", "text": "As future work, we propose to tackle general tabular summarization where the schema can vary across tables in the whole dataset." } ], "y": { "@DIF@": { "gold_contexts": [ [ "4c71ac789203d50e3e428458c2f88c-C001-27" ] ], "cite_sentences": [ "4c71ac789203d50e3e428458c2f88c-C001-27" ] }, "@EXT@": { "gold_contexts": [ [ "4c71ac789203d50e3e428458c2f88c-C001-67" ] ], "cite_sentences": [ "4c71ac789203d50e3e428458c2f88c-C001-67" ] }, "@USE@": { "gold_contexts": [ [ "4c71ac789203d50e3e428458c2f88c-C001-97" ], [ "4c71ac789203d50e3e428458c2f88c-C001-113" ] ], "cite_sentences": [ "4c71ac789203d50e3e428458c2f88c-C001-97", "4c71ac789203d50e3e428458c2f88c-C001-113" ] } } }, "ABC_2292b2c0366ef12a5dd25e544f6b2d_42": { "x": [ { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-2", "text": "Response-based learning allows to adapt a statistical machine translation (SMT) system to an extrinsic task by extracting supervision signals from task-specific feedback." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-3", "text": "In this paper, we elicit response signals for SMT adaptation by executing semantic parses of translated queries against the Freebase database." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-4", "text": "The challenge of our work lies in scaling semantic parsers to the lexical diversity of opendomain databases." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-5", "text": "We find that parser performance on incorrect English sentences, which is standardly ignored in parser evaluation, is key in model selection." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-6", "text": "In our experiments, the biggest improvements in F1-score for returning the correct answer from a semantic parse for a translated query are achieved by selecting a parser that is carefully enhanced by paraphrases and synonyms." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-7", "text": "----------------------------------" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-9", "text": "In response-based learning for SMT, supervision signals are extracted from an extrinsic response to a machine translation, in contrast to using humangenerated reference translations for supervision." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-10", "text": "We apply this framework to a scenario in which a semantic parse of a translated database query is executed against the Freebase database." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-11", "text": "We view learning from such task-specific feedback as adaptation of SMT parameters to the task of translating opendomain database queries, thereby grounding SMT in the task of multilingual database access." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-12", "text": "The success criterion for this task is F1-score in returning the correct answer from a semantic parse of the translated query, rather than BLEU." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-13", "text": "Since the semantic parser provides feedback to the response-based learner and defines the final evaluation criterion, the challenge of the presented work lies in scaling the semantic parser to the lexical diversity of open-domain databases such as Freebase." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-14", "text": "Riezler et al. (2014) showed how to use response-based learning to adapt an SMT system to a semantic parser for the Geoquery domain." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-15", "text": "The state-of-the-art in semantic parsing on Geoquery achieves a parsing accuracy of over 82% (see Andreas et al. (2013) for an overview), while the state-of-the-art in semantic parsing on the Free917 data (Cai and Yates, 2013) achieves 68.5% accuracy (Berant and Liang, 2014) ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-16", "text": "This is due to the lexical variability of Free917 (2,036 word types) compared to Geoquery (279 word types)." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-17", "text": "In this paper, we compare different ways of scaling up state-of-the-art semantic parsers for Freebase by adding synonyms and paraphrases." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-18", "text": "First, we consider Berant and Liang (2014) 's own extension of the semantic parser of Berant et al. (2013) by using paraphrases." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-19", "text": "Second, we apply WordNet synonyms (Miller, 1995) for selected parts of speech to the queries in the Free917 dataset." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-20", "text": "The new pairs of queries and logical forms are added to the dataset on which the semantic parsers are retrained." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-21", "text": "We find that both techniques of enhancing the lexical coverage of the semantic parsers result in improved parsing performance, and that the improvements add up nicely." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-22", "text": "However, improved parsing performance does not correspond to improved F1-score in answer retrieval when using the respective parser in a response-based learning framework." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-23", "text": "We show that in order to produce helpful feedback for responsebased learning, parser performance on incorrect En-glish queries needs to be taken into account, which is standardly ignored in parser evaluation." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-24", "text": "That is, for the purpose of parsing translated queries, a parser should retrieve correct answers for correct English queries (true positives), and must not retrieve correct answers for incorrect translations (false positives)." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-25", "text": "In order to measure false discovery rate, we prepare a test set of manually verified incorrect English in addition to a standard test set of original English queries." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-26", "text": "We show that if false discovery rate on incorrect English queries is taken into account in model selection, the semantic parser that yields best results for response-based learning in SMT can be found reliably." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-27", "text": "----------------------------------" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-28", "text": "**RELATED WORK**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-29", "text": "Our work is most closely related to Riezler et al. (2014) ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-30", "text": "We extend their application of responsebased learning for SMT to a larger and lexically more diverse dataset and show how to perform model selection in the environment from which response signals are obtained." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-31", "text": "In contrast to their work where a monolingual SMT-based approach (Andreas et al., 2013 ) is used as semantic parser, our work builds on existing parsers for Freebase, with a focus on exploiting paraphrasing and synonym extension for scaling semantic parsers to open-domain database queries." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-32", "text": "Response-based learning has been applied in previous work to semantic parsing itself (Kwiatowski et al. (2013) , Berant et al. (2013) , Goldwasser and Roth (2013) , inter alia)." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-33", "text": "In these works, extrinsic responses in form of correct answers from a database are used to alleviate the problem of manual data annotation in semantic parsing." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-34", "text": "Saluja et al. (2012) integrate human binary feedback on the quality of an SMT system output into a discriminative learner." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-35", "text": "Further work on learning from weak supervision signals has been presented in the machine learning community, e.g., in form of coactive learning (Shivaswamy and Joachims, 2012) , reinforcement learning (Sutton and Barto, 1998) , or online learning with limited feedback (Cesa-Bianchi and Lugosi, 2006) ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-36", "text": "----------------------------------" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-37", "text": "**RESPONSE-BASED ONLINE SMT LEARNING**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-38", "text": "We denote by \u03c6(x, y) a joint feature representation of input sentences x and output translations Algorithm 1 Response-based Online Learning repeat for i = 1, . . . , n do Receive input string x (i) Predict translation\u0177 Receive task feedback e(\u0177) \u2208 {1, 0} if e(\u0177) = 1 then y + \u2190\u0177 Store\u0177 as reference" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-39", "text": ", and by s(x, y; w) = w, \u03c6(x, y) a linear scoring function for predicting a translation\u0177." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-40", "text": "A response signal is denoted by a binary function e(y) \u2208 {1, 0} that executes a semantic parse against the database and checks whether it receives the same answer as the gold standard parse." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-41", "text": "Furthermore, a cost function c(y (i) , y) = (1\u2212BLEU(y (i) , y)) based on sentence-wise BLEU (Nakov et al., 2012 ) is used." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-42", "text": "Algorithm 1, called \"Response-based Online Learning\" in Riezler et al. (2014) , is based on contrasting a \"positive\" translation y + that receives positive feedback, has a high model score, and a low cost of predicting y instead of y (i) , with a \"negative\" translation y \u2212 that leads to negative feedback, has a high model score, and a high cost:" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-43", "text": "The central algorithm operates as follows: The SMT system predicts translation\u0177, and in case of positive task feedback, the prediction is accepted and stored as positive example by setting y + \u2190\u0177." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-44", "text": "In that case, y \u2212 needs to be computed in order to perform the stochastic gradient descent update of the weight vector." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-45", "text": "If the feedback is negative, the prediction is treated as y \u2212 and y + needs to be computed for the update." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-46", "text": "If either y + or y \u2212 cannot be computed, the example is skipped." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-47", "text": "----------------------------------" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-48", "text": "**SCALING SEMANTIC PARSING TO OPEN-DOMAIN DATABASE QUERIES**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-49", "text": "The main challenge of grounding SMT in semantic parsing for Freebase lies in scaling the semantic parser to the lexical diversity of the open-domain database." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-50", "text": "Our baseline system is the parser of Berant et al. (2013) , called SEMPRE." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-51", "text": "We first consider the approach presented by Berant and Liang (2014) to scale the baseline to open-domain database queries:" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-52", "text": "In their system, called PARASEMPRE, pairs of logical forms and utterances are generated from a given query and the database, and the pair whose utterance best paraphrases the input query is selected." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-53", "text": "These new pairs of queries and logical forms are added as ambiguous labels in training a model from queryanswer pairs." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-54", "text": "Following a similar idea of extending parser coverage by paraphrases, we extend the training set with synonyms from WordNet." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-55", "text": "This is done by iterating over the queries in the FREE917 dataset." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-56", "text": "To ensure that the replacement is sensible, each sentence is first POS tagged (Toutanova et al., 2003) and WordNet lookups are restricted to matching POS between synonym and query words, for nouns, verbs, adjectives and adverbs." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-57", "text": "Lastly, in order to limit the number of retrieved words, a WordNet lookup is performed by carefully choosing from the first three synsets which are ordered from most common to least frequently used sense." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-58", "text": "Within a synset all words are taken." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-59", "text": "The new training queries are appended to the training portion of FREE917." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-60", "text": "----------------------------------" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-61", "text": "**MODEL SELECTION**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-62", "text": "The most straightforward strategy to perform model selection for the task of response-based learning for SMT is to rely on parsing evaluation scores that are standardly reported in the literature." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-63", "text": "However, as we will show experimentally, if precision is taken as the percentage of correct answers out of instances for which a parse could be produced, recall as the percentage of total examples for which a correct answer could be found, and F1 score as their harmonic mean, the metrics are not appropriate for model selection in our case." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-64", "text": "This is because for our goal of learning the language of correct English database queries from positive and negative parsing feedback, the semantic parser needs to be able to parse and retrieve correct answers for correct database queries, but it must not do so for incorrect queries." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-65", "text": "However, information about incorrect queries is ignored in the definition of the metrics given above." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-66", "text": "In fact, retrieving correct answers for incorrect database queries hurts response-based learning for SMT." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-67", "text": "The problem lies in the incomplete nature of semantic parsing databases, where terms that are not parsed into logical forms in one context make a crucial difference in another context." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-68", "text": "For example in Geoquery, the gold standard queries \"People in Boulder?\" and \"Number of people in Boulder?\" parse into the same logical form, however, the queries \"Give me the cities in Virginia\" and \"Give me the number of cities in Virginia\" have different parses and different answers." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-69", "text": "While in the first case, for example in German-to-English translation of database queries, the German \"Anzahl\" may be translated incorrectly without consequences, it is crucial to translate the term into \"number\" in the second case." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-70", "text": "On an example from Free917, the SMT system translates the German \"Steinformationen\" into \"kind of stone\", which is incorrect in the geological context, where it should be \"rock formations\"." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-71", "text": "If during response-based learning, the error slips through because of an incomplete parse leading to the correct answer, it might hurt on the test data." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-72", "text": "Negative parser feedback for incorrect translations is thus crucial for learning how to avoid these cases in response-based SMT." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-73", "text": "In order to evaluate parsing performance on incorrect translations, we need to extend standard evaluation data of correct English database queries with evaluation data of incorrect English database queries." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-74", "text": "For this purpose, we took translations of an out-of-domain SMT system that were judged either grammatically or semantically incorrect by the authors to create a dataset of negative examples." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-75", "text": "On this dataset, we can define true positives (TP) as correct English queries that were given a correct answer by the semantic parser, and false positives (FP) as wrong English queries that obtained the correct answer." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-76", "text": "The crucial evaluation metric is the false discovery rate (FDR) (Murphy, 2012) , defined as FP/FP+TP, i.e., as the ratio of false positives out of all positive answer retrieval events." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-77", "text": "----------------------------------" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-78", "text": "**EXPERIMENTS**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-79", "text": "We use a data dump of Freebase 1 which was has been indexed by the Virtuoso SPARQL engine 2 as our knowledge base." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-80", "text": "The corpus used in the experiments is the FREE917 corpus as assembled by Cai and Yates (2013) and consists of 614 training and 276 test queries in English and corresponding logical forms." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-81", "text": "3 The dataset of negative examples, i.e., incorrect English database queries that should receive incorrect answers, consists of 166 examples that were judged either grammatically or semantically incorrect by the authors." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-82", "text": "The translation of the English queries in FREE917 into German, in order to provide a set of source sentences for SMT, was done by the authors." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-83", "text": "The SMT framework used is CDEC (Dyer et al., 2010) with standard dense features and additional sparse features as described in Simianer et al. (2012) 4 ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-84", "text": "Training of the baseline SMT system was performed on the COMMON CRAWL 5 (Smith et al., 2013 ) dataset consisting of 7.5M parallel English-German segments extracted from the web." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-85", "text": "Response-based learning for SMT uses the code described in Riezler et al. (2014) 6 ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-86", "text": "For semantic parsing we use the SEMPRE and PARASEMPRE tools of Berant et al. (2013) and Berant and Liang (2014) which were trained on the training portion of the FREE917 corpus 7 ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-87", "text": "Further models use the training data enhanced with synonyms from WordNet as described in Section 4." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-88", "text": "Following Jones et al. (2012), we evaluate semantic parsers according to precision, defined as the percentage of correctly answered examples out of those for which a parse could be produced, recall, defined as the percentage of total examples answered correctly, and F1-score, defined as harmonic mean of precision and recall." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-89", "text": "Furthermore, we report false discovery rate (FDR) on the combined set of 276 correct and 166 incorrect database queries." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-90", "text": "Table 1 reports standard parsing evaluation metrics for the different parsers SEMPRE (S), PARASEMPRE (P), and extensions of the latter with synonyms from the first one (P1), first two (P2) and first three (P3) synsets which are ordered according to frequency of use of the sense." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-91", "text": "As shown in the second column, the size of the training data is increased up to 10 times by using various synonym extensions." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-92", "text": "As shown in the third column, PARASEM-PRE improves F1 by nearly 10 points over SEMPRE." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-93", "text": "Another 0.5 points are added by extending the training data using two synsets." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-94", "text": "The third column shows that the system P1 that scored second-worst in terms of F1 score, scores best under the FDR metric 8 ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-95", "text": "Table 2 shows an evaluation of the use of different parsing models to retrieve correct answers from the FREE917 test set of correct database queries." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-96", "text": "The systems are applied to translated queries, but evaluated in terms of standard parsing metrics." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-97", "text": "Statistical significance is measured using an Approximate Randomization test (Noreen, 1989; Riezler and Maxwell, 2005) ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-98", "text": "The baseline system is CDEC as described above." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-99", "text": "It never sees the FREE917 data during training." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-100", "text": "As a second baseline method we use a stochastic (sub)gradient descent variant of RAM-PION (Gimpel and Smith, 2012 trained by using the correct English queries in the FREE917 training data as references." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-101", "text": "Neither CDEC nor RAMPION use parser feedback in training." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-102", "text": "RE-BOL (Response-based Online Learning) is an implementation of Algorithm 1 described in Section 3." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-103", "text": "This algorithm makes use of positive parser feedback to convert predicted translation into references, in addition to using the original English queries as references." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-104", "text": "Training for both RAMPION and REBOL is performed for 10 epochs over the FREE917 training set, using a constant learning rate \u03b7 that was chosen via cross-validation." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-105", "text": "All methods then proceed to translate the FREE917 test set." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-106", "text": "Best results in Table 2 are obtained by using an extension of PARASEMPRE with one synset as parser in responsebased learning with REBOL." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-107", "text": "This parsing system scored best under the FDR metric in Table 1 ." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-108", "text": "Table 3 shows the Spearman rank correlation (Siegel and Castellan, 1988) between the F1 / FDR ranking of semantic parsers from Table 1 and their contribution to F1 scores in Table 2 for parsing query translations of CDEC, RAMPION or REBOL." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-109", "text": "The system CDEC cannot learn from parser performance based on query translations, thus best results on translated queries correlate positively with good parsing F1 score per se." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-110", "text": "RAMPION can implicitly take advantage of parsers with good FDR score since learning to move away from translations dissimilar to the reference is helpful if they do not lead to correct answers." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-111", "text": "REBOL can make the best use of parsers with low FDR score since it can learn to prevent incorrect translations from hurting parsing performance at test time." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-112", "text": "----------------------------------" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-113", "text": "**CONCLUSION**" }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-114", "text": "We presented an adaptation of SMT to translating open-domain database queries by using feedback of a semantic parser to guide learning." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-115", "text": "Our work highlights an important aspect that is often overlooked in parser evaluation, namely that parser model selection in real-world applications needs to take the possibility of parsing incorrect language into account." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-116", "text": "We found that for our application of response-based learning for SMT, the key is to learn to prevent cases where the correct answer is retrieved despite the translation being incorrect." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-117", "text": "This can be avoided by performing model selection on semantic parsers that parse and retrieve correct answers for correct database queries, but do not do retrieve correct answers for incorrect queries." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-118", "text": "In our experiments, we found that the parser that contributes most to response-based learning in SMT is one that is carefully extended by paraphrases and synonyms." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-119", "text": "In future work, we would like to investigate additional techniques for paraphrasing and synonym extension." }, { "sent_id": "2292b2c0366ef12a5dd25e544f6b2d-C001-120", "text": "For example, a good fit for our task of response-based learning for SMT might be Bannard and Callison-Burch (2005) 's approach to paraphrasing via pivoting on SMT phrase tables." } ], "y": { "@USE@": { "gold_contexts": [ [ "2292b2c0366ef12a5dd25e544f6b2d-C001-18" ], [ "2292b2c0366ef12a5dd25e544f6b2d-C001-50" ], [ "2292b2c0366ef12a5dd25e544f6b2d-C001-86" ] ], "cite_sentences": [ "2292b2c0366ef12a5dd25e544f6b2d-C001-18", "2292b2c0366ef12a5dd25e544f6b2d-C001-50", "2292b2c0366ef12a5dd25e544f6b2d-C001-86" ] }, "@BACK@": { "gold_contexts": [ [ "2292b2c0366ef12a5dd25e544f6b2d-C001-32" ] ], "cite_sentences": [ "2292b2c0366ef12a5dd25e544f6b2d-C001-32" ] } } }, "ABC_0984f12a6fea858c7f18263cc2fb01_42": { "x": [ { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-20", "text": "Image-sentence ranking using monolingual multimodal datasets (Hodosh et al., 2013, inter-alia) is also a natural task for multilingual modelling." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-69", "text": "The translations thus have a wider vocabulary, despite being generated by a smaller number of authors." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-2", "text": "We introduce the Multi30K dataset to stimulate multilingual multimodal research." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-3", "text": "Recent advances in image description have been demonstrated on Englishlanguage datasets almost exclusively, but image description should not be limited to English." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-4", "text": "This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) descriptions crowdsourced independently of the original English descriptions." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-5", "text": "We outline how the data can be used for multilingual image description and multimodal machine translation, but we anticipate the data will be useful for a broader range of tasks." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-6", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-8", "text": "Image description is one of the core challenges at the intersection of Natural Language Processing (NLP) and Computer Vision (CV) (Bernardi et al., 2016) ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-9", "text": "This task has only received attention in a monolingual English setting, helped by the availability of English datasets, e.g. Flickr8K (Hodosh et al., 2013) , Flickr30K (Young et al., 2014) , and MS COCO (Chen et al., 2015) ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-10", "text": "However, the possible applications of image description are useful for all languages, such as searching for images using natural language, or providing alternative-description text for visually impaired Web users." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-11", "text": "We introduce a large-scale dataset of images paired with sentences in English and German as an initial step towards studying the value and the characteristics of multilingual-multimodal data." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-12", "text": "Multi30K is an extension of the Flickr30K dataset (Young et al., 2014) with 31,014 German translations of English descriptions and 155,070 independently collected German descriptions." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-13", "text": "The translations were collected from professionally contracted translators, whereas the descriptions were collected from untrained crowdworkers." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-14", "text": "The key difference between these corpora is the relationship between the sentences in different languages." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-15", "text": "In the translated corpus, we know there is a strong correspondence between the sentences in both languages." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-16", "text": "In the descriptions corpus, we only know that the sentences, regardless of the language, are supposed to describe the same image." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-17", "text": "A dataset of images paired with sentences in multiple languages broadens the scope of multimodal NLP research." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-18", "text": "Image description with multilingual data can also be seen as machine translation in a multimodal context." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-19", "text": "This opens up new avenues for researchers in machine translation (Koehn et al., 2003; Chiang, 2005; Sutskever et al., 2014; Bahdanau et al., 2015, inter-alia) to work with multilingual multimodal data." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-21", "text": "The only existing datasets of images paired with multilingual sentences were created by professionally translating English into the target language: IAPR-TC12 with 20,000 English-German described images (Grubinger et al., 2006) , and the Pascal Sentences Dataset of 1,000 Japanese-English described images (Funaki and Nakayama, 2015) ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-22", "text": "Multi30K dataset is larger than both of these and contains both independent and translated sentences." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-23", "text": "We hope this dataset will be of broad interest across NLP and CV research and anticipate that these communities will put the data to use in a broader range of tasks than we can foresee." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-24", "text": "The independent sentences are all accurate descriptions of the image but do not contain the same details in both languages, such as shirt colour or the scaffolding." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-25", "text": "In the second translation pair (bottom left) the translator has translated \"glide\" as \"schweben\" (\"to float\") probably due to not seeing the image context (see Section 2.1 for more details)." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-26", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-27", "text": "**THE MULTI30K DATASET**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-28", "text": "The Flickr30K Dataset contains 31,014 images sourced from online photo-sharing websites (Young et al., 2014) ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-29", "text": "Each image is paired with five English descriptions, which were collected from Amazon Mechanical Turk 1 ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-30", "text": "The dataset contains 145,000 training, 5,070 development, and 5,000 test descriptions." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-31", "text": "The Multi30K dataset extends the Flickr30K dataset with translated and independent German sentences." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-32", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-33", "text": "**TRANSLATIONS**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-34", "text": "The translations were collected from professional English-German translators contracted via an established Language Service in Germany." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-35", "text": "Figure 1 presents an example of the differences between the types of data." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-36", "text": "We collected one translated description per image, resulting in a total of 31,014 translations." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-37", "text": "To ensure an even distribution over description length, the English descriptions were chosen based on their relative length, with an equal number of longest, shortest, and median length source descriptions." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-38", "text": "We paid a total of e23,000 to collect the data (e0.06 per word)." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-39", "text": "Translators were shown an English language sentences and asked to produce a correct and fluent translation for it in German, without seeing the image." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-40", "text": "We decided against showing the images to translators to make this as close as possible to a standard translation task, also making the data col-lected here distinct from the independent descriptions collected as described in Section 2.2." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-41", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-42", "text": "**INDEPENDENT DESCRIPTIONS**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-43", "text": "The descriptions were collected from crowdworkers via the Crowdflower platform 2 ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-44", "text": "We collected five descriptions per image in the Flickr30K dataset, resulting in a total of 155,070 sentences." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-45", "text": "Workers were presented with a translated version of the data collection interface used by (Hodosh et al., 2013) , as shown in Figure 2 ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-46", "text": "We translated the interface to make the task as similar as possible to the crowdsourcing of the English sentences." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-47", "text": "The instructions were translated by one of the authors and checked by a native German Ph." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-48", "text": "D student." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-49", "text": "185 crowdworkers took part in the task over a period of 31 days." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-50", "text": "We split the task into 1,000 randomly selected images per day to control the quality of the data and to prevent worker fatigue." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-51", "text": "Workers were required to have a German-language skill certification and be at least a Crowdflower Level 2 Worker: they have participated in at least 10 different Crowdflower jobs, has passed at least 100 quality-control questions, and has an job acceptance rate of at least 85%." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-52", "text": "The descriptions were collected in batches of five images per job." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-53", "text": "Each image was randomly selected from the complete set of 1,000 images for that day, and workers were limited to writing at most 250 descriptions per day." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-54", "text": "We paid workers $0.05 per description 3 and limited them submitting faster than 90s per job to discourage poor/lowquality work." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-55", "text": "(This works out at a rate of 40 jobs per hour, i.e. 200 descriptions / hour.) We configured Crowdflower to automatically ban users who worked faster than this rate." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-56", "text": "Thus the theoretical maximum wage per hour was $10/hour." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-57", "text": "We paid a total of $9,591.24 towards collecting the data and paying the Crowdflower platform fees." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-58", "text": "During the collection of the data, we assessed the quality both by manually checking a subset of the descriptions and also with automated checks." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-59", "text": "We inspected the submissions of users who wrote sentences with less than five words, and users with high type to token ratios (to detect repetition)." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-60", "text": "We also used a character-level 6-gram LM to flag descriptions with high perplexity, which was very effective at catching nonsense sentences." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-61", "text": "In general we did not have to ban or reject many users and overall description quality was high." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-62", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-63", "text": "**TRANSLATED VS. INDEPENDENT DESCRIPTIONS**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-64", "text": "We now analyse the differences between the translated and the description corpora." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-65", "text": "For this analysis, all sentences were stripped of punctuation and truecased using the Moses truecaser.pl script trained over Europarl v7 and News Commentary v11 English-German parallel corpora." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-66", "text": "Table 1 shows the differences between the corpora." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-67", "text": "The German translations are longer than the independent descriptions (11.1 vs. 9.6 words), while the English descriptions selected for trans-lation are slightly shorter, on average, than the Flickr30k average (11.9 vs. 12.3)." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-68", "text": "When we compare the German translation dataset against an equal number of sentences from the German descriptions dataset, we find that the translations also have more word types (19.3K vs. 17.6K), and more singleton types occurring only once (11.3K vs. 10.2K; in both datasets singletons comprise 58% of the vocabulary)." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-70", "text": "The English datasets (all descriptions vs. those selected for translation) show a similar trend, indicating that these differences may be a result of the decision to select equal numbers of short, medium, and long English sentences for translation." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-71", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-72", "text": "**ENGLISH VS. GERMAN**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-73", "text": "The English image descriptions are generally longer than the German descriptions, both in terms of number of words and characters." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-74", "text": "Note that the difference is much less smaller when measuring characters: German uses 22% fewer words but only 2.5% fewer characters." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-75", "text": "However, we observe a different pattern in the translation corpora: German uses 6.6% fewer words than English but 17.1% more characters." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-76", "text": "The vocabulary of the German description and translation corpora are more than twice as large as the English corpora." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-77", "text": "Additionally, the German corpora have twoto-three times as many singletons." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-78", "text": "This is likely due to richer morphological variation in German, as well as word compounding." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-79", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-80", "text": "**DISCUSSION**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-81", "text": "The Multi30K dataset is immediately suitable for research on a wide range of tasks, including but not limited to automatic image description, image-sentence ranking, multimodal and multilingual semantics, and machine translation." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-82", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-83", "text": "**MULTI30K FOR IMAGE DESCRIPTION**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-84", "text": "Deep neural networks for image description typically integrate visual features into a recurrent neural network language model (Vinyals et al., 2015; Xu et al., 2015, inter-alia) ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-85", "text": "Elliott et al. (2015) demonstrated how to build multilingual image description models that learn and transfer features between monolingual image description models." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-86", "text": "They performed a series of experiments on the IAPR-TC12 dataset (Grubinger et al., 2006) of images aligned with German translations, showing that both English and German image description could be improved by transferring features from a multimodal neural language model trained to generate descriptions in the other language." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-87", "text": "The Multi30K dataset will enable further research in this direction, allowing researchers to work with larger datasets with multiple references per image." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-88", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-89", "text": "**MULTI30K FOR MACHINE TRANSLATION**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-90", "text": "Machine translation is typically performed using only textual data, for example news data, the Europarl corpora, or corpora harvested from the Web (CommonCrawl, Wikipedia, etc.) ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-91", "text": "The Multi30K dataset makes it possible to further develop machine translation in a setting where multimodal data, such as images or video, are observed alongside text." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-92", "text": "The potential advantages of using multimodal information for machine translation include the ability to better deal with ambiguous source text and to avoid (untranslated) out-of-vocabulary words in the target language (Calixto et al., 2012) ." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-93", "text": "Hitschler and Riezler (2016) have demonstrated the potential of multimodal features in a targetside translation reranking model." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-94", "text": "Their approach is initially trained over large text-only translation copora and then fine-tuned with a small amount of in-domain data, such as our dataset." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-95", "text": "We expect a variety of translation models can be adapted to take advantage of multimodal data as features in a log-linear model or as feature vectors in neural machine translation models." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-96", "text": "----------------------------------" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-97", "text": "**CONCLUSIONS**" }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-98", "text": "We introduced Multi30K: a large-scale multilingual multimodal dataset for interdisciplinary machine learning research." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-99", "text": "Our dataset is an extension of the popular Flickr30K dataset with descriptions and professional translations in German." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-100", "text": "The descriptions were collected from a crowdsourcing platform, while the translations were collected from professionally contracted translators." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-101", "text": "These differences are deliberate and part of the larger scope of studying multilingual multimodal data in different contexts." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-102", "text": "The descriptions were collected as similarly as possible to the original Flickr30K dataset by translating the instructions used by Young et al. (2014) into German." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-103", "text": "The translations were collected without showing the images to the translators to keep it as close to a standard translation task as possible." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-104", "text": "There are substantial differences between the translated and the description datasets." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-105", "text": "The translations contain approximately the same number of tokens and have sentences of approximately the same length in both languages." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-106", "text": "These properties make them suited to machine translations models." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-107", "text": "The description datasets are very different in terms of average sentence lengths and the number of word types per language." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-108", "text": "This is likely to cause different engineering and scientific challenges because the descriptions are independently collected corpora instead of a sentence-level aligned corpus." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-109", "text": "In the future, we want to study multilingual multimodality over a wider range of languages, for example beyond Indo-European families." }, { "sent_id": "0984f12a6fea858c7f18263cc2fb01-C001-110", "text": "We call on the community to engage with us on creating massively multilingual multimodal datasets." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0984f12a6fea858c7f18263cc2fb01-C001-9" ], [ "0984f12a6fea858c7f18263cc2fb01-C001-28" ] ], "cite_sentences": [ "0984f12a6fea858c7f18263cc2fb01-C001-9", "0984f12a6fea858c7f18263cc2fb01-C001-28" ] }, "@EXT@": { "gold_contexts": [ [ "0984f12a6fea858c7f18263cc2fb01-C001-12" ], [ "0984f12a6fea858c7f18263cc2fb01-C001-28", "0984f12a6fea858c7f18263cc2fb01-C001-31" ] ], "cite_sentences": [ "0984f12a6fea858c7f18263cc2fb01-C001-12", "0984f12a6fea858c7f18263cc2fb01-C001-28" ] }, "@USE@": { "gold_contexts": [ [ "0984f12a6fea858c7f18263cc2fb01-C001-102" ] ], "cite_sentences": [ "0984f12a6fea858c7f18263cc2fb01-C001-102" ] } } }, "ABC_ecdd75533aff56771f0320694efc9a_42": { "x": [ { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-2", "text": "This paper describes our submission to Shared Task on Similar Language Translation in Fourth Conference on Machine Translation (WMT 2019)." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-3", "text": "We submitted three systems for Hindi \u2192 Nepali direction in which we have examined the performance of a Recursive Neural Network (RNN) based Neural Machine Translation (NMT) system, a semi-supervised NMT system where monolingual data of both languages is utilized using the architecture by (Artetxe et al., 2017) and a system trained with extra synthetic sentences generated using copy of source and target sentences without using any additional monolingual data." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-4", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-6", "text": "In this paper, we present the submission for Similar Language Translation Task in WMT 2019." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-7", "text": "The task focuses on improving machine translation results for three language pairs Czech-Polish (Slavic languages), Hindi-Nepali (Indo-Aryan languages) and Spanish-Portuguese (Romance languages)." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-8", "text": "The main focus of the task is to utilize monolingual data in addition to parallel data because the provided parallel data is very small in amount." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-9", "text": "The detail of task is provided in (Barrault et al., 2019) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-10", "text": "We participated for Hindi-Nepali language pair and submitted three systems based on NMT for Hindi \u2192 Nepali direction." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-11", "text": "We have utilized monolingual data of both languages and also trained an NMT system with copy data from both sides with no additional monolingual data." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-12", "text": "The rest of the paper is organized as follows: We start with introduction to NMT, followed by a list of some of the existing methods for how to utilize monolingual data in NMT." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-13", "text": "A brief introduction to unsupervised and semi-supervised NMT is provided." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-14", "text": "We also describe in brief about two existing popular methods of training cross-lingual word embeddings in an unsupervised way." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-15", "text": "In Section 4.3 we describe our three submitted systems for the task." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-16", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-17", "text": "**NEURAL MACHINE TRANSLATION**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-18", "text": "Many architectures have been proposed for neural machine translation." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-19", "text": "Most famous one is RNN based encoder-decoder proposed in (Cho et al.) , where encoder and decoder are both recursive neural networks, encoder can be bi-directional." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-20", "text": "After this attention based sequence to sequence models where attention is utilized to improve performance in NMT are proposed in (Bahdanau et al., 2014) , (Luong et al., 2015) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-21", "text": "Attention basically instructs the system about which words to focus more, while generating a particular target word." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-22", "text": "Transformer based encoder-decoder architecture for NMT is proposed in (Vaswani et al., 2017) , which is completely based on self-attention and positional encoding." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-23", "text": "This does not follow recurrent architecture." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-24", "text": "Positional encoding provides the system with information of order of words." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-25", "text": "NMT needs lots of parallel data to train a system." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-26", "text": "This task basically focuses on how to improve performance for languages which are similar but resource scarce." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-27", "text": "There are many language pairs for which parallel data does not exist or exist in a very small amount." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-28", "text": "In past, to improve the performance of NMT systems various techniques like Back-Translation (Sennrich et al., 2016a) , utilizing other similar language pairs through pivoting (Cheng et al., 2017) or transfer learning (Zoph et al., 2016) , complete unsupervised architectures (Artetxe et al., 2017) (Lample et al., 2018 ) and many others have been proposed." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-29", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-30", "text": "**UTILIZING MONOLINGUAL DATA IN NMT**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-31", "text": "There has been good amount of work done on how we can utilize monolingual data to improve performance of an NMT system." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-32", "text": "Back-Translation was introduced by (Sennrich et al., 2016b) , to utilize monolingual data of target language." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-33", "text": "This requires a translation system in opposite direction." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-34", "text": "In (Sennrich et al., 2016b) , a method where empty sentences are provided in the input for target side monolingual data is also evaluated, backtranslation performs better than this." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-35", "text": "In iterative Back-Translation, systems in both directions improve each other (Hoang et al., 2018) , it is done in an incremental fashion." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-36", "text": "To generate backtranslated data, current system in opposite direction is utilized." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-37", "text": "In (Currey et al., 2017) , target side monolingual data is copied to generate source synthetic translations and the system is trained by combining this synthetic data with parallel data." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-38", "text": "In (Zhang and Zong, 2016) , source side monolingual data is utilized to iteratively generate synthetic sentences from the same model." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-39", "text": "In (Domhan and Hieber, 2017) , there is a separate layer for target side language model in training, decoder utilize both source dependent and source independent representations to generate a particular target word." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-40", "text": "In (Burlot and Yvon, 2018) , it is claimed that quality of back-translated sentences is important." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-41", "text": "Recently many systems have been proposed for Unsupervised NMT, where only monolingual data is utilized." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-42", "text": "The Unsupervised NMT approach proposed in (Artetxe et al., 2017) follows an architecture where encoder is shared and decoder is separate for each language." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-43", "text": "Encoder tries to map sentences from both languages in the same space, which is supported by cross-lingual word embeddings." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-44", "text": "They fix cross-lingual word embeddings in the encoder while training, which helps in generating cross-lingual sentence representations in the same space." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-45", "text": "The system with one shared encoder and two separate decoders with no parallel data is trained by iterating between Denoising and Back-Translation." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-46", "text": "Denoising tries to generate the correct sentence from noisy sentences, in that way the decoder is learning how to generate sentences in that particular language." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-47", "text": "These noisy sentences are created by shuffling words within a window." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-48", "text": "If the system is only trained with denoising then it may turn out to be a denoising auto-encoder." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-49", "text": "So they have also introduced back-translation in the training process to introduce translation task." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-50", "text": "Training is done by alternating between denoising and back-translation for mini-batches if parallel data is not available." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-51", "text": "In a semi-supervised setting if some amount of parallel data is available, training alternates between denoising, back-translation and parallel sentences." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-52", "text": "In (Lample et al., 2018) , encoder and decoder both are shared between the languages." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-53", "text": "Training is done by alternating between denoising and back-translation." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-54", "text": "Initialization is performed using a system trained on wordword translated sentences which is performed using cross-lingual word embeddings trained using MUSE (Conneau et al., 2017) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-55", "text": "They also utilize a discriminator which tries to identify the language from the encoder representations, this leads to adversarial training." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-56", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-57", "text": "**CROSS-LINGUAL WORD EMBEDDINGS**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-58", "text": "Cross-lingual word embeddings tries to map two word embedding spaces of different languages in the same space." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-59", "text": "The basic assumption for generating the cross-lingual word embeddings in most papers is that both the embedding spaces must be isometric." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-60", "text": "Cross-lingual word embeddings is generated by learning a linear transformation which minimizes the distances between words given in a dictionary." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-61", "text": "There are many methods proposed for training cross-lingual word embeddings in an unsupervised way." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-62", "text": "While training cross-lingual word embeddings in an unsupervised manner there is no dictionary available, only the monolingual embeddings are available." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-63", "text": "In (Artetxe et al., 2018) , cross lingual word embeddings are generated following a series of steps which involves: normalization of the embeddings so they can be used together to utilize for a distance metric, unsupervised initialization using normalized embeddings, self-learning framework using adversarial training where it iterates between creating the dictionary and finding the optimal mapping, and some weighting refinement over this." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-64", "text": "Through these steps a transformation of these spaces to a common space is learnt." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-65", "text": "In (Conneau et al., 2017) an adversarial training process is followed where discriminator tries to correctly identify the language using its representation and the mapping matrix W tries to confuse the discriminator." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-66", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-67", "text": "**SYSTEM OVERVIEW**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-68", "text": "This section describes the specification of the systems submitted in detail." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-69", "text": "We have submitted sys-tems for Hindi-Nepali language pair in Hindi \u2192 Nepali direction." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-70", "text": "Hindi and Nepali both are IndoAryan languages and are very similar to each other." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-71", "text": "They share a significant portion of the vocabulary and similar word orders." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-72", "text": "The three submitted systems are:" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-73", "text": "\u2022 A pure RNN based NMT system" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-74", "text": "\u2022 Semi-supervised RNN based NMT system" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-75", "text": "\u2022 Utilization of copied data in RNN based NMT First system is pure RNN based NMT system." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-76", "text": "To train this we have utilized only parallel corpora." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-77", "text": "Second system is trained using a semi-supervised NMT system where monolingual data from both languages is utilized." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-78", "text": "We have utilized architecture proposed in (Artetxe et al., 2017) where encoder is shared and decoders are separate for each language and model is trained by alternating between denoising and back-translation." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-79", "text": "This architecture can also be utilized for completely unsupervised setting." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-80", "text": "Third system is also a pure RNN based NMT system where additional parallel data (synthetic data) is created by copying source side sentences to target side and target side sentences to source side, but we do this only for the available parallel sentences, no additional monolingual data is utilized." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-81", "text": "In this way the amount of available data becomes three times of the original data." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-82", "text": "All the data is combined together, shuffled and then provided to the NMT system, there is no identification provided to distinguish between parallel data and copy data." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-83", "text": "To train all three systems we have utilized the implementation of (Artetxe et al., 2017) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-84", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-85", "text": "**EXPERIMENTAL DETAILS**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-86", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-87", "text": "**DATASET**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-88", "text": "We have utilized monolingual corpora of both languages in our primary system." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-89", "text": "The dataset details are given in Table 1 ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-90", "text": "Hindi-Nepali parallel data is provided in the task, which contains 65505 sentences." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-91", "text": "Hindi monolingual corpora is IITB Hindi monolingual corpora (Kunchukuttan et al., 2018) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-92", "text": "Nepali monolingual sentences are created using the monolingual data of Wikipedia and CommonCrawl provided for Parallel corpus filtering task 1 1 http://www.statmt.org/wmt19/ parallel-corpus-filtering.html by separating each sentence using | and keeping sentences of length 500 and less." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-93", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-94", "text": "**PREPROCESSING**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-95", "text": "Sentences are preprocessed using tokenization and Byte Pair Encoding (BPE)." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-96", "text": "Sentences are tokenized for both hindi and nepali using IndicNLP 2 library." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-97", "text": "This tokenized data is preprocessed using BPE." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-98", "text": "Number of merge operations for BPE is set to 20000 for both languages and learnt separately for each language." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-99", "text": "The results may improve if we learn BPE jointly because both languages are similar." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-100", "text": "Byte pair Encoding is learnt using the implementation by (Sennrich et al., 2016b) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-101", "text": "Monolingual embeddings are trained using FastText 3 (Bojanowski et al., 2017 ) using bpe applied monolingual data for both languages." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-102", "text": "The dimension of embeddings is set to 100." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-103", "text": "Cross-lingual embeddings are created using VecMap (Artetxe et al., 2018) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-104", "text": "Table 2 reports BLEU score for the test and dev data for all three systems." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-105", "text": "We have not utilized dev data while training." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-106", "text": "We have used encoder and decoder with 2 layers, 600 hidden units each, GRU cells, batch size of 50 and maximum sentence length of 50." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-107", "text": "Adam optimizer is used with learning rate 0.0002." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-108", "text": "We have trained all three systems with fixed 300000 iterations." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-109", "text": "The number of sentences in test and dev data is 1567 and 3000 respectively." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-110", "text": "The BLEU score for test data is provided by task organizers and for dev data BLEU score is calculated using multi-bleu.pl from moses toolkit (Koehn et al., 2007) ." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-111", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-112", "text": "**SYSTEM DETAIL**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-113", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-114", "text": "**SYSTEM**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-115", "text": "Test Dev Basic 3.5 4.6 With Monolingual Data 2.8 3.27" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-116", "text": "With copy data 2.7 4.38 Table 2 : Experimental results (BLEU scores)" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-117", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-118", "text": "**RESULTS**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-119", "text": "As it is clear from the results in Table 2 that the system with only parallel data is performing better than when we are utilizing monolingual data." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-120", "text": "To answer why this is happening, a study of size and quality of monolingual data, the study of ratio of monolingual and parallel data provided to the system is required." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-121", "text": "The intuition behind using copied data with parallel data is, that both the languages are similar and this may provide more data to the system." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-122", "text": "But the results show the system is getting confused as we are providing all the data together without any distinguishing mark between parallel and copied sentences." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-123", "text": "For the same sentence both original translation and its copy is given in the output which may be causing confusion." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-124", "text": "----------------------------------" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-125", "text": "**SUMMARY**" }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-126", "text": "In this paper we have explained about systems submitted for Similar Language Translation task in WMT 2019." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-127", "text": "We have reported results for a semi-supervised technique which utilizes denoising and back-translation." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-128", "text": "We have utilized lots of monolingual data together with available parallel data for training a neural machine translation system which share encoder and have separate decoders for each language, in a semi-supervised setting." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-129", "text": "A study of size and quality of monolingual data is required to analyze the performance which is left as future work." }, { "sent_id": "ecdd75533aff56771f0320694efc9a-C001-130", "text": "We have also explained results for utilizing copied data with parallel data and compared both the above mentioned techniques with a pure RNN based NMT system." } ], "y": { "@USE@": { "gold_contexts": [ [ "ecdd75533aff56771f0320694efc9a-C001-3" ], [ "ecdd75533aff56771f0320694efc9a-C001-78" ], [ "ecdd75533aff56771f0320694efc9a-C001-83" ] ], "cite_sentences": [ "ecdd75533aff56771f0320694efc9a-C001-3", "ecdd75533aff56771f0320694efc9a-C001-78", "ecdd75533aff56771f0320694efc9a-C001-83" ] }, "@BACK@": { "gold_contexts": [ [ "ecdd75533aff56771f0320694efc9a-C001-28" ], [ "ecdd75533aff56771f0320694efc9a-C001-42" ] ], "cite_sentences": [ "ecdd75533aff56771f0320694efc9a-C001-28", "ecdd75533aff56771f0320694efc9a-C001-42" ] } } }, "ABC_6edf517d79f7fd2a0653a3d5fb543d_42": { "x": [ { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-2", "text": "A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting crosslingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-3", "text": "In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised machine translation." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-4", "text": "This way, instead of directly inducing a bilingual lexicon from cross-lingual embeddings, we use them to build a phrasetable, combine it with a language model, and use the resulting machine translation system to generate a synthetic parallel corpus, from which we extract the bilingual lexicon using statistical word alignment techniques." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-5", "text": "As such, our method can work with any word embedding and cross-lingual mapping technique, and it does not require any additional resource besides the monolingual corpus used to train the embeddings." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-6", "text": "When evaluated on the exact same cross-lingual embeddings, our proposed method obtains an average improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS retrieval, establishing a new state-of-the-art in the standard MUSE dataset." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-7", "text": "----------------------------------" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-9", "text": "Cross-lingual word embedding mappings have attracted a lot of attention in recent times." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-10", "text": "These methods work by independently training word embeddings in different languages, and mapping them to a shared space through linear transformations." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-11", "text": "While early methods required a training dictionary to find the initial alignment (Mikolov et al., 2013) , fully unsupervised methods have managed to obtain comparable results based on either adversarial training or selflearning (Artetxe et al., 2018b) ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-12", "text": "A prominent application of these methods is Bilingual Lexicon Induction (BLI), that is, using the resulting cross-lingual embeddings to build a bilingual dictionary." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-13", "text": "For that purpose, one would typically induce the translation of each source word by taking its corresponding nearest neighbor in the target language." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-14", "text": "However, it has been argued that this basic approach suffers from the hubness problem 1 , which has motivated alternative retrieval methods like inverted nearest neighbor 2 , inverted softmax (Smith et al., 2017) , and Cross-domain Similarity Local Scaling (CSLS) ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-15", "text": "In this paper, we go one step further and, rather than directly inducing the bilingual dictionary from the cross-lingual word embeddings, we use them to build an unsupervised machine translation system, and extract a bilingual dictionary from a synthetic parallel corpus generated with it." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-16", "text": "This allows us to take advantage of a strong language model and naturally extract translation equivalences through statistical word alignment." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-17", "text": "At the same time, our method can be used as a drop-in replacement of traditional retrieval techniques, as it can work with any cross-lingual word embeddings and it does not require any additional resource besides the monolingual corpus used to train them." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-18", "text": "Our experiments show the effectiveness of this alternative approach, which outperforms the previous best retrieval method by 4 accuracy points on average, establishing a new stateof-the-art in the standard MUSE dataset." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-19", "text": "As such, we conclude that, contrary to recent trend, future research in BLI should not focus exclusively on direct retrieval methods." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-20", "text": "----------------------------------" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-21", "text": "**PROPOSED METHOD**" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-22", "text": "The input of our method is a set of cross-lingual word embeddings and the monolingual corpora used to train them." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-23", "text": "In our experiments, we use fastText embeddings (Bojanowski et al., 2017) mapped through VecMap (Artetxe et al., 2018b ), but the algorithm described next can also work with any other word embedding and cross-lingual mapping method." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-24", "text": "The general idea of our method is to to build an unsupervised phrase-based statistical machine translation system Artetxe et al., 2018c Artetxe et al., , 2019 , and use it to generate a synthetic parallel corpus from which to extract a bilingual dictionary." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-25", "text": "For that purpose, we first derive phrase embeddings from the input word embeddings by taking the 400,000 most frequent bigrams and and the 400,000 most frequent trigrams in each language, and assigning them the centroid of the words they contain." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-26", "text": "Having done that, we use the resulting cross-lingual phrase embeddings to build a phrase-table as described in Artetxe et al. (2018c) ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-27", "text": "More concretely, we extract translation candidates by taking the 100 nearest-neighbors of each source phrase, and score them with the softmax function over their cosine similarities:" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-28", "text": "where the temperature \u03c4 is estimated using maximum likelihood estimation over a dictionary induced in the reverse direction." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-29", "text": "In addition to the phrase translation probabilities in both directions, we also estimate the forward and reverse lexical weightings by aligning each word in the target phrase with the one in the source phrase most likely generating it, and taking the product of their respective translation probabilities." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-30", "text": "We then combine this phrase-table with a distortion model and a 5-gram language model estimated in the target language corpus, which results in a phrase-based machine translation system." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-31", "text": "So as to optimize the weights of the resulting model, we use the unsupervised tuning procedure proposed by Artetxe et al. (2019) , which combines a cyclic consistency loss and a language modeling loss over a subset of 2,000 sentences from each monolingual corpora." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-32", "text": "Having done that, we generate a synthetic parallel corpus by translating the source language monolingual corpus with the resulting machine translation system." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-33", "text": "3 We then word align this corpus using FastAlign (Dyer et al., 2013) with default hyperparameters and the grow-diag-finaland symmetrization heuristic." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-34", "text": "Finally, we build a phrase-table from the word aligned corpus, and extract a bilingual dictionary from it by discarding all non-unigram entries." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-35", "text": "For words with more than one entry, we rank translation candidates according to their direct translation probability." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-36", "text": "----------------------------------" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-37", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-38", "text": "In order to compare our proposed method headto-head with other BLI methods, the experimental setting needs to fix the monolingual embedding training method, as well as the cross-lingual mapping algorithm and the evaluation dictionaries." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-39", "text": "In addition, in order to avoid any advantage, our method should not see any further monolingual corpora than those used to train the monolingual embeddings." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-40", "text": "Unfortunately, existing BLI datasets distribute pre-trained word embeddings alone, but not the monolingual corpora used to train them." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-41", "text": "For that reason, we decide to use the evaluation dictionaries from the standard MUSE dataset but, instead of using the pre-trained Wikipedia embeddings distributed with it, we extract monolingual corpora from Wikipedia ourselves and train our own embeddings trying to be as faithful as possible to the original settings." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-42", "text": "This allows us to compare our proposed method to previous retrieval techniques in the exact same conditions, while keeping our results as comparable as possible to previous work reporting results for the MUSE dataset." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-43", "text": "More concretely, we use WikiExtractor 4 to extract plain text from Wikipedia dumps, and preprocess the resulting corpus using standard Moses tools (Koehn et al., 2007) by applying sentence splitting, punctuation normalization, tokenization with aggressive hyphen splitting, and lowercasing." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-44", "text": "We then train word embeddings for each language using the skip-gram implementation of fastText (Bojanowski et al., 2017) Table 1: P@1 of proposed system and previous retrieval methods, using the same cross-lingual embeddings." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-45", "text": "the MUSE dataset were trained using these exact same settings, so our embeddings only differ in the Wikipedia dump used to extract the training corpus and the pre-processing applied to it, which is not documented in the original dataset." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-46", "text": "Having done that, we map these word embeddings to a cross-lingual space using the unsupervised mode in VecMap (Artetxe et al., 2018b) , which builds an initial solution based on the intralingual similarity distribution of the embeddings and iteratively improves it through self-learning." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-47", "text": "Finally, we induce a bilingual dictionary using our proposed method and evaluate it in comparison to previous retrieval methods (standard nearest neighbor, inverted nearest neighbor, inverted softmax 5 and CSLS)." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-48", "text": "Following common practice, we use precision at 1 as our evaluation measure." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-49", "text": "6 4 Results and discussion Table 1 reports the results of our proposed system in comparison to previous retrieval methods." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-50", "text": "As it can be seen, our method obtains the best results in all language pairs and directions, with an average improvement of 6 points over nearest neighbor and 4 points over CSLS, which is the best performing previous method." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-51", "text": "These results are very consistent across all translation directions, with an absolute improvement between 2.7 and 6.3 points over CSLS." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-52", "text": "Interestingly, neither inverted nearest neighbor nor inverted soft-max are able to outperform standard nearest neighbor, presumably because our cross-lingual embeddings are less sensitive to hubness thanks to the symmetric re-weighting in VecMap (Artetxe et al., 2018a) ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-53", "text": "At the same time, CSLS obtains an absolute improvement of 2 points over nearest neighbor, only a third of what our method achieves." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-54", "text": "This suggests that, while previous retrieval methods have almost exclusively focused on addressing the hubness problem, there is a substantial margin of improvement beyond this phenomenon." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-55", "text": "So as to put these numbers into perspective, Table 2 compares our method to previous results reported in the literature." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-56", "text": "7 As it can be seen, our proposed method obtains the best published results in all language pairs and directions, outperforming the previous state-of-the-art by a substantial margin." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-57", "text": "Note, moreover, that these previous systems mostly differ in their cross-lingual mapping algorithm and not the retrieval method, so our improvements are orthogonal." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-58", "text": "We believe that, beyond the substantial gains in this particular task, our work has important implications for future research in cross-lingual word embedding mappings." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-59", "text": "While most work in this topic uses BLI as the only evaluation task, Glavas et al. (2019) recently showed that BLI results do not always correlate well with downstream performance." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-60", "text": "In particular, they observe that some mapping methods that are specifically designed for BLI perform poorly in other tasks." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-61", "text": "Our work shows that, besides their poor performance in those tasks, these BLI-centric mapping methods might not even be the optimal approach to BLI, as our alternative method, which relies on unsupervised machine translation instead of direct retrieval over mapped embeddings, obtains substantially better results without requiring any additional resource." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-62", "text": "As such, we argue that 1) future work in cross-lingual word embeddings should consider other evaluation tasks in addition to BLI, and 2) future work in BLI should consider other alternatives in addition to direct retrieval over crosslingual embedding mappings." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-63", "text": "----------------------------------" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-64", "text": "**RELATED WORK**" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-65", "text": "While BLI has been previously tackled using count-based vector space models (Vuli\u0107 and Moens, 2013) and statistical decipherment (Ravi and Knight, 2011; Dou and Knight, 2012) , these methods have recently been superseded by crosslingual embedding mappings, which work by aligning independently trained word embeddings in different languages." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-66", "text": "For that purpose, early methods required a training dictionary, which was used to learn a linear transformation that mapped these embeddings into a shared crosslingual space (Mikolov et al., 2013; Artetxe et al., 2018a) ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-67", "text": "The resulting cross-lingual embeddings are then used to induce the translations of words that were missing in the training dictionary by taking their nearest neighbor in the target language." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-68", "text": "The amount of required supervision was later reduced through self-learning methods (Artetxe et al., 2017) , and then completely eliminated through adversarial training (Zhang et al., 2017a; or more robust iterative approaches combined with initialization heuristics (Artetxe et al., 2018b; Hoshen and Wolf, 2018) ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-69", "text": "At the same time, several recent methods have formulated embedding mappings as an optimal transport problem (Zhang et al., 2017b; ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-70", "text": "In addition to that, a large body of work has focused on addressing the hubness problem that arises when directly inducing bilingual dictionaries from cross-lingual embeddings, either through the retrieval method Smith et al., 2017; or the mapping itself Shigeto et al., 2015; ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-71", "text": "While all these previous methods directly induce bilingual dictionaries from cross-lingually mapped embeddings, our proposed method combines them with unsupervised machine translation techniques, outperforming them all by a substantial margin." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-72", "text": "----------------------------------" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-73", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-74", "text": "We propose a new approach to BLI which, instead of directly inducing bilingual dictionaries from cross-lingual embedding mappings, uses them to build an unsupervised machine translation system, which is then used to generate a synthetic parallel corpus from which to extract bilingual lexica." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-75", "text": "Our approach does not require any additional resource besides the monolingual corpora used to train the embeddings, and outperforms traditional retrieval techniques by a substantial margin." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-76", "text": "We thus conclude that, contrary to recent trend, future work in BLI should not focus exclusively in direct retrieval approaches, nor should BLI be the only evaluation task for cross-lingual embeddings." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-77", "text": "Our code is available at https://github.com/ artetxem/monoses." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-78", "text": "In the future, we would like to further improve our method by incorporating additional ideas from unsupervised machine translation such as joint refinement and neural hybridization (Artetxe et al., 2019) ." }, { "sent_id": "6edf517d79f7fd2a0653a3d5fb543d-C001-79", "text": "In addition to that, we would like to integrate our induced dictionaries in other downstream tasks like unsupervised cross-lingual information retrieval (Litschko et al., 2018) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6edf517d79f7fd2a0653a3d5fb543d-C001-11" ] ], "cite_sentences": [ "6edf517d79f7fd2a0653a3d5fb543d-C001-11" ] }, "@EXT@": { "gold_contexts": [ [ "6edf517d79f7fd2a0653a3d5fb543d-C001-11", "6edf517d79f7fd2a0653a3d5fb543d-C001-12", "6edf517d79f7fd2a0653a3d5fb543d-C001-15" ] ], "cite_sentences": [ "6edf517d79f7fd2a0653a3d5fb543d-C001-11" ] }, "@USE@": { "gold_contexts": [ [ "6edf517d79f7fd2a0653a3d5fb543d-C001-23" ], [ "6edf517d79f7fd2a0653a3d5fb543d-C001-46" ], [ "6edf517d79f7fd2a0653a3d5fb543d-C001-68" ] ], "cite_sentences": [ "6edf517d79f7fd2a0653a3d5fb543d-C001-23", "6edf517d79f7fd2a0653a3d5fb543d-C001-46", "6edf517d79f7fd2a0653a3d5fb543d-C001-68" ] } } }, "ABC_ee219d599e0e0c2bbebc1849863005_42": { "x": [ { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-24", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-2", "text": "One central mystery of neural NLP is what neural models \"know\" about their subject matter." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-3", "text": "When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages?" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-4", "text": "Can this knowledge be extracted from the system to fill holes in human scientific knowledge?" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-5", "text": "Existing typological databases contain relatively full feature specifications for only a few hundred languages." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-25", "text": "**DATASET AND EXPERIMENTAL SETUP**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-6", "text": "Exploiting the existence of parallel texts in more than a thousand languages, we build a massive many-to-one neural machine translation (NMT) system from 1017 languages into English, and use this to predict information missing from typological databases." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-7", "text": "Experiments show that the proposed method is able to infer not only syntactic, but also phonological and phonetic inventory features, and improves over a baseline that has access to information about the languages' geographic and phylogenetic neighbors." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-8", "text": "1" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-9", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-11", "text": "Linguistic typology is the classification of human languages according to syntactic, phonological, and other classes of features, and the investigation of the relationships and correlations between these classes/features." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-12", "text": "This study has been a scientific pursuit in its own right since the 19th century (Greenberg, 1963; Comrie, 1989; Nichols, 1992) , but recently typology has borne practical fruit within various subfields of NLP, particularly on problems involving lower-resource languages." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-13", "text": "(Dryer and Haspelmath, 2013) , has proven useful in many NLP tasks (O'Horan et al., 2016) , such as multilingual dependency parsing (Ammar et al., 2016) , generative parsing in low-resource settings (Naseem et al., 2012; T\u00e4ckstr\u00f6m et al., 2013) , phonological language modeling and loanword prediction (Tsvetkov et al., 2016) , POStagging (Zhang et al., 2012) , and machine translation (Daiber et al., 2016) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-14", "text": "However, the needs of NLP tasks differ in many ways from the needs of scientific typology, and typological databases are often only sparsely populated, by necessity or by design." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-15", "text": "2 In NLP, on the other hand, what is important is having a relatively full set of features for the particular group of languages you are working on." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-16", "text": "This mismatch of needs has motivated various proposals to reconstruct missing entries, in WALS and other databases, from known entries (Daum\u00e9 III and Campbell, 2007; Daum\u00e9 III, 2009; Coke et al., 2016; Littell et al., 2017) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-17", "text": "In this study, we examine whether we can tackle the problem of inferring linguistic typology from parallel corpora, specifically by training a massively multi-lingual neural machine translation (NMT) system and using the learned representations to infer typological features for each language." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-18", "text": "This is motivated both by prior work in linguistics (Bugarski, 1991; Garc\u00eda, 2002) demonstrating strong links between translation studies and tools for contrastive linguistic analysis, work in inferring typology from bilingual data (\u00d6stling, 2015) and English as Second Language texts (Berzak et al., 2014) , as well as work in NLP (Shi et al., 2016; Kuncoro et al., 2017; Belinkov et al., 2017) showing that syntactic knowledge can be extracted from neural nets on the word-by-word or sentence-by-sentence level." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-19", "text": "This work presents a more holistic analysis of whether we can discover what neural networks learn about the linguistic concepts of an entire language by aggregating their representations over a large number of the sentences in the language." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-20", "text": "We examine several methods for discovering feature vectors for typology prediction, including those learning a language vector specifying the language while training multilingual neural language models (\u00d6stling and Tiedemann, 2017) or neural machine translation (Johnson et al., 2016) systems." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-21", "text": "We further propose a novel method for aggregating the values of the latent state of the encoder neural network to a single vector representing the entire language." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-22", "text": "We calculate these feature vectors using an NMT model trained on 1017 languages, and use them for typlogy prediction both on their own and in composite with feature vectors from previous work based on the genetic and geographic distance between languages (Littell et al., 2017) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-23", "text": "Results show that the extracted representations do in fact allow us to learn about the typology of languages, with particular gains for syntactic features like word order and the presence of case markers." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-26", "text": "Typology Database: To perform our analysis, we use the URIEL language typology database (Littell et al., 2017) , which is a collection of binary features extracted from multiple typological, phylogenetic, and geographical databases such as WALS (World Atlas of Language Structures) (Collins and Kayne, 2011) , PHOIBLE (Moran et al., 2014) , Ethnologue (Lewis et al., 2015) , and Glottolog (Hammarstr\u00f6m et al., 2015) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-27", "text": "These features are divided into separate classes regarding syntax (e.g. whether a language has prepositions or postpositions), phonology (e.g. whether a language has complex syllabic onset clusters), and phonetic inventory (e.g. whether a language has interdental fricatives)." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-28", "text": "There are 103 syntactical features, 28 phonology features and 158 phonetic inventory features in the database." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-29", "text": "Baseline Feature Vectors: Several previous methods take advantage of typological implicature, the fact that some typological traits correlate strongly with others, to use known features of a language to help infer other unknown features of the language (Daum\u00e9 III and Campbell, 2007; Takamura et al., 2016; Coke et al., 2016) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-30", "text": "As an alternative that does not necessarily require pre-existing knowledge of the typological features in the language at hand, Littell et al. (2017) have proposed a method for inferring typological features directly from the language's k nearest neighbors (k-NN) according to geodesic distance (distance on the Earth's surface) and genetic distance (distance according to a phylogenetic family tree)." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-31", "text": "In our experiments, our baseline uses this method by taking the 3-NN for each language according to normalized geodesic+genetic distance, and calculating an average feature vector of these three neighbors." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-32", "text": "Typology Prediction: To perform prediction, we trained a logistic regression classifier 3 with the baseline k-NN feature vectors described above and the proposed NMT feature vectors described in the next section." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-33", "text": "We train individual classifiers for predicting each typological feature in a class (syntax etc)." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-34", "text": "We performed 10-fold crossvalidation over the URIEL database, where we train on 9/10 of the languages to predict 1/10 of the languages for 10 folds over the data." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-35", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-36", "text": "**LEARNING REPRESENTATIONS FOR TYPOLOGY PREDICTION**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-37", "text": "In this section we describe three methods for learning representations for typology prediction with multilingual neural models." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-38", "text": "LM Language Vector Several methods have been proposed to learn multilingual language models (LMs) that utilize vector representations of languages (Tsvetkov et al., 2016; \u00d6stling and Tiedemann, 2017) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-39", "text": "Specifically, these models train a recurrent neural network LM (RNNLM; Mikolov et al. (2010) ) using long short-term memory (LSTM; Hochreiter and Schmidhuber (1997) ) with an additional vector representing the current language as an input." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-40", "text": "The expectation is that this vector will be able to capture the features of the language and improve LM accuracy." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-41", "text": "\u00d6stling and Tiedemann (2017) noted that, intriguingly, agglomerative clustering of these language vectors results in something that looks roughly like a phylogenetic tree, but stopped short of performing typological inference." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-42", "text": "We train this vector by appending a special token representing the source language (e.g. \" fra \" for French) to the beginning of the source sentence, as shown in Fig. 1 , then using the word representation learned for this token as a representation of the language." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-43", "text": "We will call this first set of feature vectors LMVEC, and examine their utility for typology prediction." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-44", "text": "NMT Language Vector In our second set of feature vectors, MTVEC, we similarly use a language embedding vector, but instead learn a multilingual neural MT model trained to translate from many languages to English, in a similar fashion to Johnson et al. (2016) ; Ha et al. (2016) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-45", "text": "In contrast to LMVEC, we hypothesize that the alignments to an identical sentence in English, the model will have a stronger signal allowing it to more accurately learn vectors that reflect the syntactic, phonetic, or semantic consistencies of various languages." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-46", "text": "This has been demonstrated to some extent in previous work that has used specifically engineered alignment-based models (Lewis and Xia, 2008; \u00d6stling, 2015; Coke et al., 2016) , and we examine whether these results apply to neural network feature extractors and expand beyond word order and syntax to other types of typology as well." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-47", "text": "Table 1 : Accuracy of syntactic, phonological, and inventory features using LM language vectors (LMVEC), MT language vectors (MTVEC), MT encoder cell averages (MTCELL) or both MT feature vectors (MTBOTH)." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-48", "text": "Aux indicates auxiliary information of geodesic/genetic nearest neighbors; \"NONE -Aux\" is the majority class chance rate, while \"NONE +Aux\" is a 3-NN classification." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-49", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-50", "text": "**NMT ENCODER MEAN CELL STATES**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-51", "text": "sentiment (Radford et al., 2017) , we expect that the cell states will represent features that may be linked to the typology of the language." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-52", "text": "To create vectors for each language using LSTM hidden states, we obtain the mean of cell states (c in the standard LSTM equations) for all time steps of all sentences in each language." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-53", "text": "4" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-54", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-55", "text": "**EXPERIMENTS**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-56", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-57", "text": "**MULTILINGUAL DATA AND TRAINING REGIMEN**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-58", "text": "To train a multilingual neural machine translation system, we used a corpus of Bible translations that was obtained by scraping a massive online Bible database at bible.com." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-59", "text": "5 This corpus contains data for 1017 languages." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-60", "text": "After preprocessing the corpus, we obtained a training set of 20.6 million sentences over all languages." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-61", "text": "The implementation of both the LM and NMT models described in \u00a73 was done in the DyNet toolkit ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-62", "text": "In order to obtain a manageable shared vocabulary for all languages, we divided the data into subwords using joint byte-pair encoding of all languages (Sennrich et al., 2016) with 32K merge operations." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-63", "text": "We used LSTM cells in a single recurrent layer with 512-dimensional hidden state and input embedding size." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-64", "text": "The Adam optimizer was used with a learning rate of 0.001 and a dropout of 0.5 was enforced during training." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-65", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-66", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-67", "text": "The results of the experiments can be found in Tab." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-68", "text": "1. First, focusing on the \"-Aux\" results, we can see that all feature vectors obtained by the neural models improve over the chance rate, demonstrating that indeed it is possible to extract information about linguistic typology from unsupervised neural models." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-69", "text": "Comparing LMVEC to MTVEC, we can see a convincing improvement of 2-3% across the board, indicating that the use of bilingual information does indeed provide a stronger signal, allowing the network to extract more salient features." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-70", "text": "Next, we can see that MTCELL further outperforms MTVEC, indicating that the proposed method of investigating the hidden cell dynamics is more effective than using a statically learned language vector." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-71", "text": "Finally, combining both feature vectors as MTBOTH leads to further improvements." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-72", "text": "To measure statistical significance of the results, we performed a paired bootstrap test to measure the gain between NONE+AUX and MTBOTH+AUX and found that the gains for syntax and inventory were significant (p=0.05), but phonology was not, perhaps because the number of phonological features was fewer than the other classes (only 28)." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-73", "text": "When further using the geodesic/genetic distance neighbor feature vectors, we can see that these trends largely hold although gains are much smaller, indicating that the proposed method is still useful in the case where we have a-priori knowledge about the environment in which the language exists." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-74", "text": "It should be noted, however, that the gains of LMVEC evaporate, indicating that access to aligned data may be essential when inferring the typology of a new language." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-75", "text": "We also noted that the accuracies of certain features decreased from NONE-AUX to MTBOTH-AUX, particularly gender markers, case suffix and negative affix, but these decreases were to a lesser extent in magnitude than the improvements." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-76", "text": "Interestingly, and in contrast to previous methods for inferring typology from raw text, which have been specifically designed for inducing word order or other syntactic features (Lewis and Xia, Table 2 : Top 5 improvements from \"NONE -Aux\" to \"MTBOTH -Aux\" in the syntax (\"S \"), phonology (\"P \"), and inventory (\"I \") classes." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-77", "text": "2008; \u00d6stling, 2015; Coke et al., 2016) , our proposed method is also able to infer information about phonological or phonetic inventory features." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-78", "text": "This may seem surprising or even counterintuitive, but a look at the most-improved phonology/inventory features (Tab. 2) shows a number of features in which languages with the \"nondefault\" option (e.g. having uvular consonants or initial velar nasals, not having lateral consonants, etc.) are concentrated in particular geographical regions." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-79", "text": "For example, uvular consonants are not common world-wide, but are common in particular geographic regions like the North American Pacific Northwest and the Caucasus (Maddieson, 2013b) , while initial velar nasals are common in Southeast Asia (Anderson, 2013) , and lateral consonants are uncommon in the Amazon Basin (Maddieson, 2013a) ." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-80", "text": "Since these are also regions with a particular and sometimes distinct syntactic character, we think the model may be finding regional clusters through syntax, and seeing an improvement in regionally-distinctive phonology/inventory features as a side effect." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-81", "text": "Finally, given that MTCELL uses the feature vectors of the latent cell state to predict typology, it is of interest to observe how these latent cells behave for typologically different languages." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-82", "text": "In Fig. 2 we examine the node that contributed most to the prediction of \"S OBJ BEFORE VERB\" (the node with maximum weight in the classifier) for German and Korean, where the feature is active, and Portuguese and Catalan, where the feature is inactive." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-83", "text": "We can see that the node trajectories closely track each other (particularly at the beginning of the sentence) for Portuguese and Catalan, and in general the languages where objects precede verbs have higher average values, which would be expressed by our mean cell state features." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-84", "text": "The similar trends for languages that share the value for a typological feature (S OBJ BEFORE VERB) indicate that information stored in the selected hidden node is consistent across languages with similar structures." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-85", "text": "----------------------------------" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-86", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-87", "text": "Through this study, we have shown that neural models can learn a range of linguistic concepts, and may be used to impute missing features in typological databases." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-88", "text": "In particular, we have demonstrated the utility of learning representations with parallel text, and results hinted at the importance of modeling the dynamics of the representation as models process sentences." }, { "sent_id": "ee219d599e0e0c2bbebc1849863005-C001-89", "text": "We hope that this study will encourage additional use of typological features in downstream NLP tasks, and inspire further techniques for missing knowledge prediction in under-documented languages." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ee219d599e0e0c2bbebc1849863005-C001-16" ], [ "ee219d599e0e0c2bbebc1849863005-C001-30" ] ], "cite_sentences": [ "ee219d599e0e0c2bbebc1849863005-C001-16", "ee219d599e0e0c2bbebc1849863005-C001-30" ] }, "@USE@": { "gold_contexts": [ [ "ee219d599e0e0c2bbebc1849863005-C001-22" ], [ "ee219d599e0e0c2bbebc1849863005-C001-26" ], [ "ee219d599e0e0c2bbebc1849863005-C001-30", "ee219d599e0e0c2bbebc1849863005-C001-31" ] ], "cite_sentences": [ "ee219d599e0e0c2bbebc1849863005-C001-22", "ee219d599e0e0c2bbebc1849863005-C001-26", "ee219d599e0e0c2bbebc1849863005-C001-30" ] } } }, "ABC_db6c35071fe4e93c11acca4056e9ac_42": { "x": [ { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-2", "text": "Adversarial training (AT) is a regularization method that can be used to improve the robustness of neural network methods by adding small perturbations in the training data." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-3", "text": "We show how to use AT for the tasks of entity recognition and relation extraction." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-4", "text": "In particular, we demonstrate that applying AT to a general purpose baseline model for jointly extracting entities and relations, allows improving the stateof-the-art effectiveness on several datasets in different contexts (i.e., news, biomedical, and real estate data) and for different languages (English and Dutch)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-5", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-7", "text": "Many neural network methods have recently been exploited in various natural language processing (NLP) tasks, such as parsing , POS tagging (Lample et al., 2016) , relation extraction (dos Santos et al., 2015) , translation (Bahdanau et al., 2015) , and joint tasks (Miwa and Bansal, 2016) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-8", "text": "However, Szegedy et al. (2014) observed that intentional small scale perturbations (i.e., adversarial examples) to the input of such models may lead to incorrect decisions (with high confidence)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-9", "text": "Goodfellow et al. (2015) proposed adversarial training (AT) (for image recognition) as a regularization method which uses a mixture of clean and adversarial examples to enhance the robustness of the model." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-10", "text": "Although AT has recently been applied in NLP tasks (e.g., text classification (Miyato et al., 2017) ), this paper -to the best of our knowledge -is the first attempt investigating regularization effects of AT in a joint setting for two related tasks." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-11", "text": "We start from a baseline joint model that performs the tasks of named entity recognition (NER) and relation extraction at once." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-12", "text": "Previously proposed models (summarized in Section 2) exhibit several issues that the neural network-based baseline approach (detailed in Section 3.1) overcomes: (i) our model uses automatically extracted features without the need of external parsers nor manually extracted features (see Gupta et al. (2016) ; Miwa and Bansal (2016) ; Li et al. (2017) ), (ii) all entities and the corresponding relations within the sentence are extracted at once, instead of examining one pair of entities at a time (see Adel and Sch\u00fctze (2017) ), and (iii) we model relation extraction in a multi-label setting, allowing multiple relations per entity (see Katiyar and Cardie (2017) ; Bekoulis et al. (2018a) )." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-13", "text": "The core contribution of the paper is the use of AT as an extension in the training procedure for the joint extraction task (Section 3.2)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-14", "text": "To evaluate the proposed AT method, we perform a large scale experimental study in this joint task (see Section 4), using datasets from different contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-15", "text": "We use a strong baseline that outperforms all previous models that rely on automatically extracted features, achieving state-of-the-art performance (Section 5)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-16", "text": "Compared to the baseline model, applying AT during training leads to a consistent additional increase in joint extraction effectiveness." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-17", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-19", "text": "Joint entity and relation extraction: Joint models (Li and Ji, 2014; Miwa and Sasaki, 2014) that are based on manually extracted features have been proposed for performing both the named entity recognition (NER) and relation extraction subtasks at once." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-20", "text": "These methods rely on the availability of NLP tools (e.g., POS taggers) or manually designed features leading to additional complexity." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-21", "text": "Neural network methods have been exploited to overcome this feature design issue and usually involve RNNs and CNNs (Miwa and Bansal, 2016; Zheng et al., 2017) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-22", "text": "Specifically, Miwa and Bansal (2016) as well as Li et al. (2017) apply bidirectional tree-structured RNNs for different contexts (i.e., news, biomedical) to capture syntactic information (using external dependency parsers)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-23", "text": "Gupta et al. (2016) propose the use of various manually extracted features along with RNNs." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-24", "text": "Adel and Sch\u00fctze (2017) solve the simpler problem of entity classification (EC, assuming entity boundaries are given), instead of NER, and they replicate the context around the entities, feeding entity pairs to the relation extraction layer." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-25", "text": "Katiyar and Cardie (2017) investigate RNNs with attention without taking into account that relation labels are not mutually exclusive." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-26", "text": "Finally, Bekoulis et al. (2018a) use LSTMs in a joint model for extracting just one relation at a time, but increase the complexity of the NER part." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-27", "text": "Our baseline model enables simultaneous extraction of multiple relations from the same input." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-28", "text": "Then, we further extend this strong baseline using adversarial training." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-29", "text": "Adversarial training (AT) (Goodfellow et al., 2015) has been proposed to make classifiers more robust to input perturbations in the context of image recognition." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-30", "text": "In the context of NLP, several variants have been proposed for different tasks such as text classification (Miyato et al., 2017) , relation extraction (Wu et al., 2017) and POS tagging (Yasunaga et al., 2018) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-31", "text": "AT is considered as a regularization method." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-32", "text": "Unlike other regularization methods (i.e., dropout (Srivastava et al., 2014) , word dropout (Iyyer et al., 2015) ) that introduce random noise, AT generates perturbations that are variations of examples easily misclassified by the model." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-33", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-34", "text": "**MODEL**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-35", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-36", "text": "**JOINT LEARNING AS HEAD SELECTION**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-37", "text": "The baseline model, described in detail in Bekoulis et al. (2018b) , is illustrated in Fig. 1 ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-38", "text": "It aims to detect (i) the type and the boundaries of the entities and (ii) the relations between them." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-39", "text": "The input is a sequence of tokens (i.e., sentence) w = w 1 , ..., w n ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-40", "text": "We use character level embeddings to implicitly capture morphological features (e.g., prefixes and suffixes), representing each character by a vector (embedding)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-41", "text": "The character embeddings are fed to a bidirectional LSTM (BiLSTM) to obtain the character-based representation of the word." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-42", "text": "We also use pre-trained word embeddings." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-43", "text": "Word and character embeddings are concatenated to form the final token representation, which is then fed to a BiLSTM layer to extract sequential information." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-44", "text": "For the NER task, we adopt the BIO (Beginning, Inside, Outside) encoding scheme." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-45", "text": "In Fig. 1 , the B-PER tag is assigned to the beginning token of a 'person' (PER) entity." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-46", "text": "For the prediction of the entity tags, we use: (i) a softmax approach for the entity classification (EC) task (assuming entity boundaries given) or (ii) a CRF approach where we identify both the type and the boundaries for each entity." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-47", "text": "During decoding, in the softmax setting, we greedily detect the entity types of the tokens (i.e., independent prediction)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-48", "text": "Although independent distribution of types is reasonable for EC tasks, this is not the case when there are strong correlations between neighboring tags." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-49", "text": "For instance, the BIO encoding scheme imposes several constraints in the NER task (e.g., the B-PER and I-LOC tags cannot be sequential)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-50", "text": "Motivated by this intuition, we use a linear-chain CRF for the NER task (Lample et al., 2016) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-51", "text": "For decoding, in the CRF setting, we use the Viterbi algorithm." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-52", "text": "During training, for both EC (softmax) and NER tasks (CRF), we minimize the cross-entropy loss L NER ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-53", "text": "The entity tags are later fed into the relation extraction layer as label embeddings (see Fig. 1 ), assuming that knowledge of the entity types is beneficial in predicting the relations between the involved entities." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-54", "text": "We model the relation extraction task as a multi-label head selection problem (Bekoulis et al., 2018b; )." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-55", "text": "In our model, each word w i can be involved in multiple relations with other words." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-56", "text": "For instance, in the example illustrated in Fig. 1 , \"Smith\" could be involved not only in a Lives in relation with the token \"California\" (head) but also in other relations simultaneously (e.g., Works for, Born In with some corresponding tokens)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-57", "text": "The goal of the task is to predict for each word w i , a vector of heads\u0177 i and the vector of corresponding relationsr i ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-58", "text": "We compute the score s(w j , w i , r k ) of word w j to be the head of w i given a relation label r k using a single layer neural network." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-59", "text": "The corresponding probability is defined as: P(w j , r k | w i ; \u03b8) = \u03c3(s(w j , w i , r k )), where \u03c3(.) is the sigmoid function." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-60", "text": "During training, we minimize the cross-entropy loss L rel as:" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-61", "text": "where m is the number of associated heads (and thus relations) per word w i ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-62", "text": "During decoding, the most probable heads and relations are selected using threshold-based prediction." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-63", "text": "The final objective for the joint task is computed as L JOINT (w; \u03b8) = L NER + L rel where \u03b8 is a set of parameters." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-64", "text": "In the case of multi-token entities, only the last token of the entity can serve as head of another token, to eliminate redundant relations." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-65", "text": "If an entity is not involved in any relation, we predict the auxiliary \"N\" relation label and the token itself as head." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-66", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-67", "text": "**ADVERSARIAL TRAINING (AT)**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-68", "text": "We exploit the idea of AT (Goodfellow et al., 2015) as a regularization method to make our model robust to input perturbations." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-69", "text": "Specifically, we generate examples which are variations of the original ones by adding some noise at the level of the concatenated word representation (Miyato et al., 2017) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-70", "text": "This is similar to the concept introduced by Goodfellow et al. (2015) to improve the robustness of image recognition classifiers." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-71", "text": "We generate an adversarial example by adding the worst-case perturbation \u03b7 adv to the original embedding w that maximizes the loss function:" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-72", "text": "where\u03b8 is a copy of the current model parameters." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-73", "text": "Since Eq. (2) is intractable in neural networks, we use the approximation proposed in Goodfellow et al. (2015) defined as: \u03b7 adv = g/ g , with g = \u2207 w L JOINT (w;\u03b8), where is a small bounded norm treated as a hyperparameter." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-74", "text": "Similar to Yasunaga et al. (2018) , we set to be \u03b1 \u221a D (where D is the dimension of the embeddings)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-75", "text": "We train on the mixture of original and adversarial examples, so the final loss is computed as: L JOINT (w;\u03b8) + L JOINT (w + \u03b7 adv ;\u03b8)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-76", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-77", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-78", "text": "We evaluate our models on four datasets, using the code as available from our github codebase." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-79", "text": "1 Specifically, we follow the 5-fold crossvalidation defined by Miwa and Bansal (2016) for the ACE04 (Doddington et al., 2004) dataset." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-80", "text": "For the CoNLL04 (Roth and Yih, 2004 ) EC task (assuming boundaries are given), we use the same splits as in Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-81", "text": "We also evaluate our models on the NER task similar to Miwa and Sasaki (2014) in the same dataset using 10-fold cross validation." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-82", "text": "For the Dutch Real Estate Classifieds, DREC (Bekoulis et al., 2017) dataset, we use train-test splits as in Bekoulis et al. (2018a) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-83", "text": "For the Adverse Drug Events, ADE (Gurulingappa et al., 2012) , we perform 10-fold cross-validation similar to Li et al. (2017) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-84", "text": "To obtain comparable results that are not affected by the input embeddings, we use the embeddings of the previous works." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-85", "text": "We employ early stopping in all of the experiments." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-86", "text": "We use the Adam optimizer (Kingma and Ba, 2015) and we fix the hyperparameters (i.e., \u03b1, dropout values, best epoch, learning rate) on the validation sets." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-87", "text": "The scaling parameter \u03b1 is selected from {5e\u22122, 1e\u22122, 1e\u22123, 1e\u22124}. Larger values of \u03b1 (i.e., larger perturbations) lead to consistent performance decrease in our early experiments." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-88", "text": "This can be explained from the fact that adding more noise can change the content of the sentence as also reported by Wu et al. (2017) ." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-89", "text": "We use three types of evaluation, namely: (i) S(trict): we score an entity as correct if both the entity boundaries and the entity type are correct (ACE04, ADE, CoNLL04, DREC), (ii) B(oundaries): we score an entity as correct if only the entity boundaries are correct while the entity type is not taken into account (DREC) and (iii) R(elaxed): a multi-token entity is considered Table 1 : Comparison of our method with the stateof-the-art in terms of F 1 score." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-90", "text": "The proposed models are: (i) baseline, (ii) baseline EC (predicts only entity classes) and (iii) baseline (EC) + AT (regularized by AT)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-91", "text": "The and symbols indicate whether the models rely on external NLP tools." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-92", "text": "We include different evaluation types (S, R and B)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-93", "text": "correct if at least one correct type is assigned to the tokens comprising the entity, assuming that the boundaries are known (CoNLL04), to compare to previous works." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-94", "text": "In all cases, a relation is considered as correct when both the relation type and the argument entities are correct." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-95", "text": "Table 1 shows our experimental results." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-96", "text": "The name of the dataset is presented in the first column while the models are listed in the second column." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-97", "text": "The proposed models are the following: (i) baseline: the baseline model shown in Fig. 1 with the CRF layer and the sigmoid loss, (ii) baseline EC: the proposed model with the softmax layer for EC, (iii) baseline (EC) + AT: the baseline regularized using AT." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-98", "text": "The final three columns present the F 1 results for the two subtasks and their average performance." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-99", "text": "Bold values indicate the best results among models that use only automatically extracted features." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-100", "text": "For ACE04, the baseline outperforms Katiyar and Cardie (2017) by \u223c2% in both tasks." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-101", "text": "This improvement can be explained by the use of: (i) multi-label head selection, (ii) CRF-layer and (iii) character level embeddings." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-102", "text": "Compared to Miwa and Bansal (2016) , who rely on NLP tools, the baseline performs within a reasonable margin (less than 1%) on the joint task." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-103", "text": "On the other hand, Li et al. (2017) use the same model for the ADE biomedical dataset, where we report a 2.5% overall improvement." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-104", "text": "This indicates that NLP tools are not always accurate for various contexts." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-105", "text": "For the CoNLL04 dataset, we use two evaluation settings." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-106", "text": "We use the relaxed evaluation similar to Gupta et al. (2016) ; Adel and Sch\u00fctze (2017) on the EC task." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-107", "text": "The baseline model outperforms the state-of-the-art models that do not rely on manually extracted features (>4% improvement for both tasks), since we directly model the whole sentence, instead of just considering pairs of entities." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-108", "text": "Moreover, compared to the model of Gupta et al. (2016) that relies on complex features, the baseline model performs within a margin of 1% in terms of overall F 1 score." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-109", "text": "We also report NER results on the same dataset and improve overall F 1 score with \u223c1% compared to Miwa and Sasaki (2014) , indicating that our automatically extracted features are more informative than the hand-crafted ones." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-110", "text": "These automatically extracted features exhibit their performance improvement mainly due to the shared LSTM layer that learns to automatically generate feature representations of entities and their corresponding relations within a single model." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-111", "text": "For the DREC dataset, we use two evaluation methods." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-112", "text": "In the boundaries evaluation, the baseline has an improvement of \u223c3% on both tasks compared to Bekoulis et al. (2018a) , whose quadratic scoring layer complicates NER." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-113", "text": "Table 1 and Fig. 2 show the effectiveness of the adversarial training on top of the baseline model." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-114", "text": "In all of the experiments, AT improves the predictive performance of the baseline model in the joint setting." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-115", "text": "Moreover, as seen in Fig. 2 , the performance of the models using AT is closer to maximum even from the early training epochs." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-116", "text": "Specifically, for ACE04, there is an improvement in both tasks as well as in the overall F 1 performance (0.4%)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-117", "text": "For CoNLL04, we note an improvement in the overall F 1 of 0.4% for the EC and 0.8% for the NER tasks, respectively." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-118", "text": "For the DREC dataset, in both settings, there is an overall improvement of \u223c1%." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-119", "text": "Figure 2 shows that from the first epochs, the model obtains its maximum performance on the DREC validation set." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-120", "text": "Finally, for ADE, our AT model beats the baseline F 1 by 0.7%." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-121", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-122", "text": "**RESULTS**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-123", "text": "Our results demonstrate that AT outperforms the neural baseline model consistently, consider- ing our experiments across multiple and more diverse datasets than typical related works." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-124", "text": "The improvement of AT over our baseline (depending on the dataset) ranges from \u223c0.4% to \u223c0.9% in terms of overall F 1 score." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-125", "text": "This seemingly small performance increase is mainly due to the limited performance benefit for the NER component, which is in accordance with the recent advances in NER using neural networks that report similarly small gains (e.g., the performance improvement in Ma and Hovy (2016) and Lample et al. (2016) on the CoNLL-2003 test set is 0.01% and 0.17% F 1 percentage points, while in the work of Yasunaga et al. (2018) , a 0.07% F 1 improvement on CoNLL-2000 using AT for NER is reported)." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-126", "text": "However, the relation extraction performance increases by \u223c1% F 1 scoring points, except for the ACE04 dataset." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-127", "text": "Further, as seen in Fig. 2 , the improvement for CoNLL04 is particularly small on the evaluation set." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-128", "text": "This may indicate a correlation between the dataset size and the benefit of adversarial training in the context of joint models, but this needs further investigation in future work." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-129", "text": "----------------------------------" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-130", "text": "**CONCLUSION**" }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-131", "text": "We proposed to use adversarial training (AT) for the joint task of entity recognition and relation extraction." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-132", "text": "The contribution of this study is twofold: (i) investigation of the consistent effectiveness of AT as a regularization method over a multi-context baseline joint model, with (ii) a large scale experimental evaluation." }, { "sent_id": "db6c35071fe4e93c11acca4056e9ac-C001-133", "text": "Experiments show that AT improves the results for each task separately, as well as the overall performance of the baseline joint model, while reaching high performance already during the first epochs of the training procedure." } ], "y": { "@DIF@": { "gold_contexts": [ [ "db6c35071fe4e93c11acca4056e9ac-C001-12" ], [ "db6c35071fe4e93c11acca4056e9ac-C001-26", "db6c35071fe4e93c11acca4056e9ac-C001-27" ], [ "db6c35071fe4e93c11acca4056e9ac-C001-112" ] ], "cite_sentences": [ "db6c35071fe4e93c11acca4056e9ac-C001-12", "db6c35071fe4e93c11acca4056e9ac-C001-26", "db6c35071fe4e93c11acca4056e9ac-C001-112" ] }, "@BACK@": { "gold_contexts": [ [ "db6c35071fe4e93c11acca4056e9ac-C001-26" ] ], "cite_sentences": [ "db6c35071fe4e93c11acca4056e9ac-C001-26" ] }, "@EXT@": { "gold_contexts": [ [ "db6c35071fe4e93c11acca4056e9ac-C001-26", "db6c35071fe4e93c11acca4056e9ac-C001-27", "db6c35071fe4e93c11acca4056e9ac-C001-28" ] ], "cite_sentences": [ "db6c35071fe4e93c11acca4056e9ac-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "db6c35071fe4e93c11acca4056e9ac-C001-82" ] ], "cite_sentences": [ "db6c35071fe4e93c11acca4056e9ac-C001-82" ] } } }, "ABC_7d7895690c84fb1af46c30f858470e_42": { "x": [ { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-98", "text": "**EFFECT OF SLICE SIZE**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-2", "text": "Any system which performs goal-directed continual learning must not only learn incrementally but process and absorb information incrementally." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-3", "text": "Such a system also has to understand when its goals have been achieved." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-4", "text": "In this paper, we consider these issues in the context of question answering." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-5", "text": "Current state-of-the-art question answering models reason over an entire passage, not incrementally. As we will show, naive approaches to incremental reading, such as restriction to unidirectional language models in the model, perform poorly." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-6", "text": "We present extensions to the DocQA [2] model to allow incremental reading without loss of accuracy." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-7", "text": "The model also jointly learns to provide the best answer given the text that is seen so far and predict whether this best-so-far answer is sufficient." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-8", "text": "----------------------------------" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-10", "text": "Humans can read and comprehend text incrementally." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-11", "text": "For instance, given a piece of text, our mental state gets updated as we read [10] ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-12", "text": "We do not necessarily wait until the end of a long document to understand its first sentence." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-13", "text": "This incremental reading mechanism allows us to avoid consuming more input if we have already reached our goal, and is analogous to the problem faced by a goal-directed continuous learning system, which must also incrementally absorb new information, determine how to use it, and determine if its goals have been achieved." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-14", "text": "Inspired by how humans read and learn from text incrementally, we introduce incremental models for text comprehension." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-15", "text": "Our primary goal is to address the problem of incremental reading in the context of language comprehension and Question Answering (QA)." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-16", "text": "We formulate the problem as designing a model for question-answering that consumes text incrementally." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-17", "text": "By design and definition, Recurrent Neural Networks (RNNs), e.g. Long Short-Term Memory networks (LSTMs) [5] , process data sequentially, and update their internal states as they read the new tokens." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-18", "text": "However, on tasks like Question Answering, in all the existing well-performing models, RNNs are employed in a bidirectional way, or a self-attention mechanism is employed [11, 2, 4, 8] ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-19", "text": "This means these models need to processes the whole input sequence to compute the final answer." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-20", "text": "This is a reasonable approach if the input sequence is as short as a sentence, but it becomes less effective and efficient as the length of the input sequence increases." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-21", "text": "We introduce a new incremental model based on DocQA [2] , which is an RNN based model proposed for QA." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-48", "text": "**SLICED PREDICTION LAYER**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-22", "text": "The incremental DocQA performs similarly to the original system but can process the input text incrementally." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-23", "text": "We propose the use of slicing to build incremental models." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-24", "text": "Slicing RNNs were introduced in [13] with the motivation of enabling parallelization and speeding up sequence processing." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-25", "text": "Here, we explore using slicing to facilitate incremental processing of the input sequence." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-26", "text": "Besides the fact that incremental reading is more plausible from the cognitive perspective, it can also provide an inductive bias for the models, which will make it easier for them to find a more generalizable solution [1] ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-27", "text": "Moreover, incrementality allows the model to be applied to tasks where we do not have the whole input in the beginning, and we need to compute some intermediate outputs." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-28", "text": "E.g. when the input is a stream, or we are in the context of a dialogue." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-29", "text": "We observed that, even if the whole input text is available, it is not always necessary to read the whole text to answer a given question." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-30", "text": "In fact, our model achieves its highest performance when it is limited to read only a few tokens after the answer, rather than when it is allowed access to the entire context." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-31", "text": "Considering this, we also augment the incremental DocQA with an \"early stopping\" strategy, where it stops consuming the rest of the input text as soon as it is confident that it has found the answer." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-32", "text": "Learning to stop reading or reasoning when the goal is reached is addressed for tasks like text classification [3] and question answering [9, 12, 6] , but the main challenge is that implementing early stopping strategies is only possible when we have an incremental model." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-33", "text": "----------------------------------" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-34", "text": "**ARCHITECTURE OF THE MODELS**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-35", "text": "We use DocQA as the baseline model [2] ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-36", "text": "The architecture of DocQA is illustrated in Figure 4a ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-37", "text": "The output of this model is two vectors with the length of the given context." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-38", "text": "One of the vectors indicates the probability of each token in the context to be the start token of the answer span." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-39", "text": "The other vector indicates the probability of each token in the context to be the end of the answer span." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-40", "text": "The gold labels indicate the ground truth begin and end of the answer spans." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-41", "text": "DocQA is inherently bidirectional thus making it hard to process the input incrementally." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-42", "text": "It is possible to remove the bidirectionality by replacing the bidirectional layers with single-directional layers and replacing the global attention with an attention layer that only attends to the past, but this change significantly reduces performance." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-43", "text": "Sliced DocQA In order to enable the DocQA model to process the context sequence incrementally despite having bidirectional RNN and attention layers, we use the concept of slicing [13] ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-44", "text": "We divide the context sequence into slices and apply the model to each slice." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-45", "text": "We call this \"Sliced DocQA\"." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-46", "text": "We explore different ways of using Sliced DocQA for incremental question answering." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-47", "text": "----------------------------------" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-49", "text": "In the simple sliced model, we predict the output for each slice independently." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-50", "text": "Thus, as the model processes each slice, it predicts whether the answer span exists in that slice or not." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-51", "text": "We call this sliced prediction (see Figure 1a )." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-52", "text": "To aggregate, we use a softmax layer on the concatenation of the outputs of all slices." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-53", "text": "This architecture allows us to processes the input incrementally, but each slice is now processed independently: i.e., when reading the second slice, the model can make no use of what was read in the first slice." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-54", "text": "This architecture hence ignores the order of the slices." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-55", "text": "We propose two solutions to solve this issue, discussed in more detail below." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-56", "text": "The first solution is to let the model access all representations from all slices in the prediction layer: we call this strategy global answer prediction." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-57", "text": "The second solution is a novel mechanism to transfer knowledge between slices called step transfer." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-58", "text": "Global Answer Prediction While the sliced model processes the context sequence in slices, in the last layer, we use the encoded information from all the slices of the context to predict the answer." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-59", "text": "This means the prediction layer is not sliced." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-60", "text": "This model is illustrated in Figure 1b ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-61", "text": "Global answer prediction can be made incremental by applying the prediction once for each slice, with information from the future slices masked out." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-62", "text": "Step Transfer To connect slices incrementally, we can transfer the knowledge between slices by using the knowledge learned until the current slice to initialize the next slice, as illustrated in Figure 1c ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-63", "text": "Thus, at the global level we have a uni-directional RNN, but, locally, we can have bidirectional or self-attention layers." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-64", "text": "To do this, we use encoded information from the current slice as an input to a fully connected network to predict the initial state for the next slice." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-65", "text": "Early Stopping These incremental models can be used to support early stopping [3, 12] ." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-66", "text": "We use a supervised approach to predict when the model should stop, with a classifier that is simply trained on detecting whether an answer is contained in a given context or not." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-67", "text": "We train this classifier in a multi-task framework by adding an extra objective function to the QA system." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-68", "text": "Hence, at each training step, the model not only tries to predict an answer but also predicts whether the true answer is within the currently processed input or not." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-69", "text": "The early stopping classifier is a two-layer fully connected neural network with RELU activations and a sigmoid output layer." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-70", "text": "The input to this network is the average of the representations in the last layer of the current slice, and the output is a scalar indicating the probability of stopping." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-71", "text": "At test time the overall decision as to whether to stop early is based on thresholding the cumulative sum of these predictions, i.e. the probability of stopping at slice i is the sum of the predicted probabilities of stopping at slices 1 to i. Equation 1 shows how we compute the loss for the early stopping model." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-72", "text": "Here extra_length i is a factor for scaling the early stopping loss, based on the distance from the gold stop point, in particular: extra_length = log(max(dist_threshold,|length_read\u2212answer_end|))." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-73", "text": "In this equation, dist_threshold is a minimum number of tokens after which we start scaling the loss." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-74", "text": "Thus, in case the model has not yet reached the answer and decides to stop it will be punished more if it is too early in contrast to when it will reach the answer by reading a few extra tokens." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-75", "text": "Similarly, if the model has already passed the answer span, the more it reads, the bigger the loss will be. If we are in a distance less than the dist_threshold from the gold stop point, the scaling factor is log(dist_threshold)." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-76", "text": "In the end, this loss is added to the answer prediction loss to form the total loss." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-77", "text": "If the true answer span is not before the chosen stopping point, the answer prediction loss is set to zero for all possible answers, and the model is only trained to choose a better stopping point." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-78", "text": "We Evaluate our model on SQuAD v1 [7] which is a reading comprehension dataset." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-79", "text": "It contains question and context pairs, where the answer to the question is a span in the context." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-80", "text": "We will open source the code for reproducing the experiments." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-81", "text": "----------------------------------" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-82", "text": "**EXPERIMENTS AND DISCUSSION**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-83", "text": "----------------------------------" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-84", "text": "**EXPERIMENTS**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-85", "text": "We experiment with different slice sizes for Sliced DocQA in three different modes: with a sliced prediction layer, with a sliced prediction layer and step transfer, without step transfer but with a global prediction layer." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-86", "text": "In the sliced models, the size of each slice can play a crucial role." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-87", "text": "The extreme cases are when the length of the slices are 1, which means we have no bidirectional layer, and when the slices are as long as the whole sequence, which means we have a fully bidirectional model." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-88", "text": "Notice that when slice size is 1, the sliced model with step transfer is not equivalent to the uni-directional DocQA, since the attention layer is implemented differently." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-89", "text": "Figure 2 shows the performance of the Sliced DocQA models with respect to different slice sizes." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-90", "text": "With a sliced prediction layer, as expected, increasing the slice size leads to an increase in the performance." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-91", "text": "In this case, having a slice size of 1 means to predict the answer based on single word representations and still, we can get an accuracy of 31%." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-92", "text": "We can explain this by the fact that 30% of answers in the SQuAD are single words, e.g. when the questions are of type when, where or who." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-93", "text": "We expect to find a slice size for which the performance of the model reaches the performance of the complete, not sliced, model." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-94", "text": "This is because, only local attention might be enough to correctly understand the meaning of the input sequence at each step, and also to predict whether each token is part of the answer or not." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-95", "text": "We observe that by slice size = 32, the model already reaches the performance of the uni-directional DocQA, and by slice size = 128 it almost reaches the performance of the normal DocQA." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-96", "text": "The interesting finding here is that there is a limit to which increasing the slice size leads to an increase in the performance, e.g. for the SQuAD, as soon as the slice size reaches 128 tokens, the increase in the performance is almost not noticeable." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-97", "text": "----------------------------------" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-99", "text": "Step transfer is useful When we have a step transfer mechanism, we observe that for each slice size, the performance of the model is higher." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-100", "text": "This means transferring knowledge between slices is useful." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-101", "text": "Among these models, the Sliced DocQA with step transfer models incremental comprehension and we see that for a slice size of 64 its performance is comparable with normal DocQA." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-102", "text": "Global answer prediction is effective Surprisingly, with the global prediction layer, the effect of the slice size is much less." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-103", "text": "In general, when we have a global prediction layer, the performance of the model with all slice sizes is higher than when we have a sliced prediction layer." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-104", "text": "It is interesting to see that in this model, even with slice size of 1, which means using word embeddings to predict the answer span, we still get a performance close to when we have larger slice sizes." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-105", "text": "Effect of early stopping In our early stopping experiments, we employed Sliced DocQA with Step Transfer as a truly incremental model." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-106", "text": "In Table 1 , we compare the performance of this model with different slice sizes." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-107", "text": "As it can be seen, at slice size of 128 it almost achieves the performance of the model without early stopping (%99)." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-108", "text": "Next, we investigate whether the early stopping model is more efficient regarding the number of read tokens." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-109", "text": "We looked into the performance of the model without early stopping at different context length to find the earliest point at which the model achieves its highest performance." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-110", "text": "This can be assumed as an oracle for early stopping model (i.e. best possible stopping point)." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-111", "text": "In Figure 3a we compare the distribution of best stopping points with the stopping points of our early stopping model." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-112", "text": "We observe that (1)For a large number of examples, we have to read the full context to achieve the best performance." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-113", "text": "(2)For some examples, the early stopping model is reading more than it should, and for some, it stops earlier than it should. (3)In general, while with the best stopping offsets we can read about %15 less text, our model reads about %8 less." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-114", "text": "We also studied if there is a correlation between the context length and the ratio of read context length." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-115", "text": "In Figure 3b , we see for context length up to 128, we read full context, which is because our slice size is 128." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-116", "text": "After that point, we see the trend of reading smaller ratio of longer contexts." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-117", "text": "----------------------------------" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-118", "text": "**CONCLUSION**" }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-119", "text": "In this paper, we propose a model that reads and comprehends text incrementally." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-120", "text": "As a testbed for our approach, we have chosen the question answering task." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-121", "text": "We aim to build a model that can learn incrementally from text, where the learning goal is to answer a given question." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-122", "text": "In standard question answering, we do not care how the context is presented to the model, and for the models that achieve state of the art results, e.g. [11, 2] , they process the full context before making any decisions." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-123", "text": "We show that it is possible to modify these models to be incremental while achieving similar performance." }, { "sent_id": "7d7895690c84fb1af46c30f858470e-C001-124", "text": "Having an incremental model, allows us to employ an early stopping strategy where the model avoids reading the rest of the text as soon as it reaches a state where it thinks it has the answer." } ], "y": { "@EXT@": { "gold_contexts": [ [ "7d7895690c84fb1af46c30f858470e-C001-6" ], [ "7d7895690c84fb1af46c30f858470e-C001-21" ], [ "7d7895690c84fb1af46c30f858470e-C001-122", "7d7895690c84fb1af46c30f858470e-C001-123" ] ], "cite_sentences": [ "7d7895690c84fb1af46c30f858470e-C001-6", "7d7895690c84fb1af46c30f858470e-C001-21", "7d7895690c84fb1af46c30f858470e-C001-122" ] }, "@BACK@": { "gold_contexts": [ [ "7d7895690c84fb1af46c30f858470e-C001-18" ] ], "cite_sentences": [ "7d7895690c84fb1af46c30f858470e-C001-18" ] }, "@MOT@": { "gold_contexts": [ [ "7d7895690c84fb1af46c30f858470e-C001-18", "7d7895690c84fb1af46c30f858470e-C001-19", "7d7895690c84fb1af46c30f858470e-C001-20" ] ], "cite_sentences": [ "7d7895690c84fb1af46c30f858470e-C001-18" ] }, "@USE@": { "gold_contexts": [ [ "7d7895690c84fb1af46c30f858470e-C001-35" ] ], "cite_sentences": [ "7d7895690c84fb1af46c30f858470e-C001-35" ] }, "@DIF@": { "gold_contexts": [ [ "7d7895690c84fb1af46c30f858470e-C001-122" ] ], "cite_sentences": [ "7d7895690c84fb1af46c30f858470e-C001-122" ] } } }, "ABC_3816a122d7f0847c01415fadef2d3d_43": { "x": [ { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-2", "text": "An empirical comparison of CFG filtering techniques for LTAG and HPSG is presented." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-3", "text": "We demonstrate that an approximation of HPSG produces a more effective CFG filter than that of LTAG." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-4", "text": "We also investigate the reason for that difference." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-5", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-7", "text": "Various parsing techniques have been developed for lexicalized grammars such as Lexicalized Tree Adjoining Grammar (LTAG) (Schabes et al., 1988) , and Head-Driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994) ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-8", "text": "Along with the independent development of parsing techniques for individual grammar formalisms, some of them have been adapted to other formalisms (Schabes et al., 1988; van Noord, 1994; Yoshida et al., 1999; Torisawa et al., 2000) ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-9", "text": "However, these realizations sometimes exhibit quite different performance in each grammar formalism (Yoshida et al., 1999; )." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-10", "text": "If we could identify an algorithmic difference that causes performance difference, it would reveal advantages and disadvantages of the different realizations." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-11", "text": "This should also allow us to integrate the advantages of the realizations into one generic parsing technique, which yields the further advancement of the whole parsing community." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-12", "text": "In this paper, we compare CFG filtering techniques for LTAG (Harbusch, 1990; Poller and Becker, 1998) and HPSG (Torisawa et al., 2000; Kiefer and Krieger, 2000) , following an approach to parsing comparison among different grammar formalisms )." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-13", "text": "The key idea of the approach is to use strongly equivalent grammars, which generate equivalent parse results for the same input, obtained by a grammar conversion as demonstrated by ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-14", "text": "The parsers with CFG filtering predict possible parse trees by a CFG approximated from a given grammar." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-15", "text": "Comparison of those parsers are interesting because effective CFG filters allow us to bring the empirical time complexity of the parsers close to that of CFG parsing." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-16", "text": "Investigating the difference between the ways of context-free (CF) approximation of LTAG and HPSG will thereby enlighten a way of further optimization for both techniques." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-17", "text": "We performed a comparison between the existing CFG filtering techniques for LTAG (Poller and Becker, 1998) and HPSG (Torisawa et al., 2000) , using strongly equivalent grammars obtained by converting LTAGs extracted from the Penn Treebank (Marcus et al., 1993) into HPSG-style." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-18", "text": "We compared the parsers with respect to the size of the approximated CFG and its effectiveness as a filter." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-19", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-20", "text": "**BACKGROUND**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-21", "text": "In this section, we introduce a grammar conversion ) and CFG filtering (Harbusch, 1990; Poller and Becker, 1998; Torisawa et al., 2000; Kiefer and Krieger, 2000) ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-22", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-23", "text": "**GRAMMAR CONVERSION**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-24", "text": "The grammar conversion consists of a conversion of LTAG elementary trees to HPSG lexical entries and an emulation of substitution and adjunction by We can perform a comparison between LTAG and HPSG parsers using strongly equivalent grammars obtained by the above conversion." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-25", "text": "This is because strongly equivalent grammars can be a substitute for the same grammar in different grammar formalisms." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-26", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-27", "text": "**CFG FILTERING TECHNIQUES**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-28", "text": "An initial offline step of CFG filtering is performed to approximate a given grammar with a CFG." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-29", "text": "The obtained CFG is used as an efficient device to compute the necessary conditions for parse trees." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-30", "text": "The CFG filtering generally consists of two steps." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-31", "text": "In phase 1, the parser first predicts possible parse trees using the approximated CFG, and then filters out irrelevant edges by a top-down traversal starting from roots of successful context-free derivations." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-32", "text": "In phase 2, it then eliminates invalid parse trees by using constraints in the given grammar." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-33", "text": "We call the remaining edges that are used for the phase 2 parsing essential edges." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-34", "text": "The parsers with CFG filtering used in our experiments follow the above parsing strategy, but are different in the way the CF approximation and the elimination of impossible parse trees in phase 2 are performed." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-35", "text": "In the following sections, we briefly describe the CF approximation and the elimination of impossible parse trees in each realization." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-36", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-37", "text": "**CF APPROXIMATION OF LTAG**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-38", "text": "In CFG filtering techniques for LTAG (Harbusch, 1990; Poller and Becker, 1998) , every branching of elementary trees in a given grammar is extracted as a CFG rule as shown in Figure 1 ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-39", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-40", "text": "**CFG RULES**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-41", "text": "Figure 2: Extraction of CFG from HPSG Because the obtained CFG can reflect only local constraints given in each local structure of the elementary trees, it generates invalid parse trees that connect local trees in different elementary trees." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-42", "text": "In order to eliminate such parse trees, a link between branchings is preserved as a node number which records a unique node address (a subscript attached to each node in Figure 1 )." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-43", "text": "We can eliminate these parse trees by traversing essential edges in a bottomup manner and recursively propagating ok-flag from a node number x to a node number y when a connection between x and y is allowed in the LTAG grammar." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-44", "text": "We call this propagation ok-prop." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-45", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-46", "text": "**CF APPROXIMATION OF HPSG**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-47", "text": "In CFG filtering techniques for HPSG (Torisawa et al., 2000; Kiefer and Krieger, 2000) , the extraction process of a CFG from a given HPSG grammar starts by recursively instantiating daughters of a grammar rule with lexical entries and generated feature structures until new feature structures are not generated as shown in Figure 2 ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-48", "text": "We must impose restrictions on values of some features (i.e., ignoring them) and/or the number of rule applications in order to guarantee the termination of the rule application." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-49", "text": "A CFG is obtained by regarding each initial and generated feature structures as nonterminals and transition relations between them as CFG rules." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-50", "text": "Although the obtained CFG can reflect local and global constraints given in the whole structure of lexical entries, it generates invalid parse trees because they do not reflect upon constraints given by the values of features that are ignored in phase 1." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-51", "text": "These parse trees are eliminated in phase 2 by applying a grammar rule that corresponds to the applied CFG rule." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-52", "text": "We call this rule application rule-app." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-53", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-54", "text": "**COMPARISON WITH CFG FILTERING**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-55", "text": "In this section, we compare a pair of CFG filtering techniques for LTAG (Poller and Becker, 1998) and HPSG (Torisawa et al., 2000) described in Section 2.2.1 and 2.2.2." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-56", "text": "We hereafter refer to PB and TNT for the C++ implementations of the former and a valiant 1 of the latter, respectively." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-57", "text": "2 We first acquired LTAGs by a method proposed in Miyao et al. (2003) from Sections 2-21 of the Wall Street Journal (WSJ) in the Penn Treebank (Marcus et al., 1993) and its subsets." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-58", "text": "3 We then converted them into strongly equivalent HPSG-style grammars using the grammar conversion described in Section 2.1." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-59", "text": "Table 1 shows the size of CFG approximated from the strongly equivalent grammars." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-60", "text": "G x , CFG PB , and CFG TNT henceforth refer to the LTAG extracted from Section x of WSJ and CFGs approximated from G x by PB and TNT, respectively." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-61", "text": "The size of CFG TNT is much larger than that of CFG PB ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-62", "text": "By investigating parsing performance using these CFGs, we show that the larger size of CFG TNT resulted in better parsing performance." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-63", "text": "Table 2 shows the parse time with 254 sentences of length n (\u226410) from Section 2 of WSJ (the average length is 6.72 words)." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-64", "text": "4 This result shows not only that TNT achieved a drastic speed-up against PB, but also that performance difference between them increases with the larger size of the grammars." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-65", "text": "In order to estimate the degree of CF approximation, we measured the number of essential (inactive) edges of phase 1." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-66", "text": "Table 3 shows the number of the essential edges." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-67", "text": "The number of essential edges produced by PB is much larger than that produced by TNT." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-68", "text": "We then investigated the effect on phase 2 as caused by the different number of the essential edges." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-69", "text": "Table 4 shows the success rate of ok-prop and rule-app." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-70", "text": "The success rate of rule-app is 100%, 5 whereas that of ok-prop is quite low." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-71", "text": "6 These results indicate that CFG TNT is superior to CFG PB with respect to the degree of the CF approximation." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-72", "text": "We can explain the reason for this difference by investigating how TNT approximates HPSG-style grammars converted from LTAGs." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-73", "text": "As described in Section 2.1, the grammar conversion preserves the whole structure of each elementary tree (precisely, a canonical elementary tree) in a stack, and grammar rules manipulate a head element of the stack." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-74", "text": "A generated feature structure in the approximation process thus corresponds to the whole unprocessed portion of a canonical elementary tree." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-75", "text": "This implies that successful context-free derivations obtained by CFG TNT basically involve elementary trees in which all substitution and adjunction have succeeded." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-76", "text": "However, CFG PB (also a CFG produced by the other work (Harbusch, 1990) ) cannot avoid generating invalid parse trees that connect two lo-cal structures where adjunction takes place between them." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-77", "text": "We measured with G 2-21 the proportion of the number of ok-prop between two node numbers of nodes that take adjunction and its success rate." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-78", "text": "It occupied 87% of the total number of ok-prop and its success rate was only 22%." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-79", "text": "These results suggest that the global contexts in a given grammar is essential to obtain an effective CFG filter." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-80", "text": "It should be noted that the above investigation also tells us another way of CF approximation of LTAG." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-81", "text": "We first define a unique way of tree traversal such as head-corner traversal (van Noord, 1994) on which we can perform a sequential application of substitution and adjunction." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-82", "text": "We then recursively apply substitution and adjunction on that traversal to an elementary tree and a generated tree structure." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-83", "text": "Because the processed portions of generated tree structures are no longer used later, we regard the unprocessed portions of the tree structures as nonterminals of CFG." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-84", "text": "We can thereby construct another CFG filtering for LTAG by combining this CFG filter with an existing LTAG parsing algorithm (van Noord, 1994) ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-85", "text": "----------------------------------" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-86", "text": "**CONCLUSION AND FUTURE DIRECTION**" }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-87", "text": "We presented an empirical comparison of LTAG and HPSG parsers with CFG filtering." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-88", "text": "We compared the parsers with strongly equivalent grammars obtained by converting LTAGs extracted from the Penn Treebank into HPSG-style." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-89", "text": "Experimental results showed that the existing CF approximation of HPSG (Torisawa et al., 2000) produced a more effective filter than that of LTAG (Poller and Becker, 1998) ." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-90", "text": "By investigating the different ways of CF approximation, we concluded that the global constraints in a given grammar is essential to obtain an effective filter." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-91", "text": "We are going to integrate the advantage of the CF approximation of HPSG into that of LTAG in order to establish another CFG filtering for LTAG." }, { "sent_id": "3816a122d7f0847c01415fadef2d3d-C001-92", "text": "We will also conduct experiments on trade-offs between the degree of CF approximation and the size of approximated CFGs as in Maxwell III and Kaplan (1993) ." } ], "y": { "@USE@": { "gold_contexts": [ [ "3816a122d7f0847c01415fadef2d3d-C001-12" ], [ "3816a122d7f0847c01415fadef2d3d-C001-21" ], [ "3816a122d7f0847c01415fadef2d3d-C001-38" ], [ "3816a122d7f0847c01415fadef2d3d-C001-76" ] ], "cite_sentences": [ "3816a122d7f0847c01415fadef2d3d-C001-12", "3816a122d7f0847c01415fadef2d3d-C001-21", "3816a122d7f0847c01415fadef2d3d-C001-38", "3816a122d7f0847c01415fadef2d3d-C001-76" ] }, "@MOT@": { "gold_contexts": [ [ "3816a122d7f0847c01415fadef2d3d-C001-76" ] ], "cite_sentences": [ "3816a122d7f0847c01415fadef2d3d-C001-76" ] } } }, "ABC_6a693f9cbc6dbb3676d765eee97db7_43": { "x": [ { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-2", "text": "My research is focused on developing machine learning algorithms for inferring dependency parsers from language data." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-3", "text": "By investigating several approaches I have developed a unifying perspective that allows me to share advances between both probabilistic and non-probabilistic methods." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-4", "text": "First, I describe a generative technique that uses a strictly lexicalised parsing model, where all the parameters are based on words and do not use any partof-speech (POS) tags nor grammatical categories." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-5", "text": "Then, I incorporate two ideas from probabilistic parsing-word similarity smoothing and local estimation-to improve the large margin approach." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-6", "text": "Finally, I present a simpler and more efficient approach to training dependency parsers by applying a boosting-like procedure to standard training methods." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-7", "text": "----------------------------------" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-9", "text": "Over the past decade, there has been tremendous progress on learning parsing models from treebank data (Magerman, 1995; Collins, 1999; Charniak, 1997; Ratnaparkhi, 1999; Charniak, 2000; Wang et al., 2005; McDonald et al., 2005) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-10", "text": "Most of the early work in this area was based on postulating generative probability models of language that included parse structures (Magerman, 1995; Collins, 1997; Charniak, 1997) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-11", "text": "Learning in this context consisted of estimating the parameters of the model with simple likelihood based techniques, but incorporating various smoothing and back-off estimation tricks to cope with the sparse data problems (Collins, 1997; Bikel, 2004) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-12", "text": "Subsequent research began to focus more on conditional models of parse structure given the input sentence, which allowed discriminative training techniques such as maximum conditional likelihood (i.e. \"maximum entropy\") to be applied (Ratnaparkhi, 1999; Charniak, 2000) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-13", "text": "Currently, the work on conditional parsing models appears to have culminated in large margin training approaches (Taskar et al., 2004; McDonald et al., 2005) , which demonstrates the state of the art performance in English dependency parsing." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-14", "text": "Despite the realization that maximum margin training is closely related to maximum conditional likelihood for conditional models (McDonald et al., 2005) , a sufficiently unified view has not yet been achieved that permits the easy exchange of improvements between the probabilistic and nonprobabilistic approaches." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-15", "text": "For example, smoothing methods have played a central role in probabilistic approaches (Collins, 1997; Wang et al., 2005) , and yet they are not being used in current large margin training algorithms." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-16", "text": "Another unexploited connection is that probabilistic approaches pay closer attention to the individual errors made by each component of a parse, whereas the training error minimized in the large margin approach-the \"structured margin loss\" (McDonald et al., 2005) -is a coarse measure that only assesses the total error of an entire parse rather than focusing on the error of any particular component." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-17", "text": "I have addressed both of these issues, as well as others in my work." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-18", "text": "----------------------------------" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-19", "text": "**DEPENDENCY PARSING MODEL**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-20", "text": "Given a sentence denote the set of all the directed, projective trees that span ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-21", "text": "From an input sentence , one would like to be able to compute the best parse; that is, a projective tree, 5 4 1 \u00a7 \u00a2 2 3" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-22", "text": ", that obtains the highest \"score\"." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-66", "text": "----------------------------------" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-23", "text": "In particular, I follow Eisner (1996) and McDonald et al. (2005) and assume that the score of a complete spanning tree for a given sentence, whether probabilistically motivated or not, can be decomposed as a sum of local scores for each link (a word pair)." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-24", "text": "In which case, the parsing problem reduces to 7 6 \u00a1 9 8" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-25", "text": "where the score s" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-26", "text": "can depend on any measurable property of \u00a4 and \u00a4 % $ within the tree ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-27", "text": "This formulation is sufficiently general to capture most dependency parsing models, including probabilistic dependency models (Wang et al., 2005; Eisner, 1996) as well as non-probabilistic models (McDonald et al., 2005; Wang et al., 2006) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-28", "text": "For the purpose of learning, the score of each link can be expressed as a weighted linear combination of features" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-29", "text": "where i are the weight parameters to be estimated during training." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-30", "text": "----------------------------------" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-31", "text": "**LEXICALISED DEPENDENCY PARSING**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-32", "text": "To learn an accurate dependency parser from data, the first approach I investigated is based on a strictly lexical parsing model where all the parameters are based on words (Wang et al., 2005) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-33", "text": "The advantage of this approach is that it does not rely on part-ofspeech tags nor grammatical categories." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-34", "text": "Furthermore, I based training on maximizing the conditional probability of a parse tree given a sentence, unlike most previous generative models (Magerman, 1995; Collins, 1997; Charniak, 1997) , which focus on maximizing the joint probability of the parse tree and the sentence." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-35", "text": "An efficient training algorithm can be achieved by maximizing the conditional probability of each parsing decision, hence minimizing a loss based on each local link decision independently." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-36", "text": "Importantly, inter-dependence between links can still be accommodated by exploiting dynamic features in training-features that take into account the labels of (some) of the surrounding components when predicting the label of a target component." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-37", "text": "To cope with the sparse data problem, I use distributional word similarity (Pereira et al., 1993; Grefenstette, 1994; Lin, 1998) to generalize the observed frequency counts in the training corpus." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-38", "text": "The experimental results on the Chinese Treebank 4.0 show that the accuracy of the conditional model is 13.6% higher than corresponding joint models, while similarity smoothing also allows the strictly lexicalised approach to outperform corresponding models based on part-of-speech tags." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-39", "text": "----------------------------------" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-40", "text": "**EXTENSIONS TO LARGE MARGIN PARSING**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-41", "text": "The approach presented above has a limitation: it uses a local scoring function instead of a global scoring function to compute the score for a candidate tree." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-42", "text": "The structured large margin approach, on the other hand, uses a global scoring function by minimizing a training loss-the \"structured margin loss\" (McDonald et al., 2005) -which is directly coordinated with the global tree." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-43", "text": "However, the training error minimized in the large margin approach is a coarse measure that only assesses the total error of an entire parse rather than focusing on the error of any particular component." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-67", "text": "**FUTURE WORK**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-44", "text": "Also, smoothing methods, which have been widely used in probabilistic approaches, are not currently being used in large margin training algorithms." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-45", "text": "In the second approach, I improve structured large margin training for parsing in two ways (Wang et al., 2006) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-46", "text": "First, I incorporate local constraints that enforce the correctness of each individual link, rather than just scoring the global parse tree." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-47", "text": "Second, to cope with sparse data and generalize to unseen words, I smooth the lexical parameters according to their underlying word similarities." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-48", "text": "To smooth parameters in the large margin framework, I introduce the technique of Laplacian regularization in large margin parsing." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-49", "text": "Finally, to demonstrate the benefits of my approach, I reconsider the problem of parsing Chinese treebank data using only lexical features, as in Section 3." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-50", "text": "My results improve current large margin approaches and show that similarity smoothing combined with local constraint enforcement leads to state of the art performance, while only requiring word-based features that do not rely on part-of-speech tags nor grammatical categories in any way." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-51", "text": "----------------------------------" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-52", "text": "**TRAINING VIA STRUCTURED BOOSTING**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-53", "text": "Finally, I have recently demonstrated the somewhat surprising result that state of the art dependency parsing performance can be achieved through the use of conventional, local classification methods." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-54", "text": "In particular, I show how a simple form of structured boosting can be used to improve the training of standard local classification methods, in the context of structured predictions, without modifying the underlying training method (Wang et al., 2007) ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-55", "text": "The advantage of this approach is that one can use off-theshelf classification techniques, such as support vector machines or logistic regression, to achieve competitive parsing results with little additional effort." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-56", "text": "The idea behind structured boosting is very simple." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-57", "text": "To produce an accurate parsing model, one combines the local predictions of multiple weak predictors to obtain a score for each link, which a parser can then use to compute the maximum score tree for a given sentence." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-58", "text": "Structured boosting proceeds in rounds." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-59", "text": "On each round a local \"link predictor\" is trained merely to predict the existence and orientation of a link between two words given input features encoding context-without worrying about coordinating the predictions in a coherent global parse." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-60", "text": "Once a weak predictor is learned, it is added to the ensemble of weak hypotheses, the training corpus is re-parsed using the new predictor, and the local training contexts are re-weighted based on errors made by the parser's output." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-61", "text": "Thus, a wrapper approach is used to successively modify the training data so that the training algorithm is encouraged to facilitate improved global parsing accuracy." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-62", "text": "Table 1 shows that the results I am able to achieve on English are competitive with the state of the art, but are still behind the best results of ." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-63", "text": "However, perhaps surprisingly, Table 1 also shows that the structured boosting approach actually surpasses state of the art accuracy on Chinese parsing for both treebank collections." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-64", "text": "----------------------------------" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-65", "text": "**CURRENT RESULTS**" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-68", "text": "Although the three pieces of my work above look very different superficially, they are actually closely related by the \"scoring\" formulation and, more specifically, by the equations introduced in Section 2." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-69", "text": "In other words, they all compute a linear classifier." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-70", "text": "4 The only differences among them are: (1) What features are used? (2) How are the parameters i estimated?" }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-71", "text": "A general perspective I bring to my investigation is the desire to delineate the effects of domain engineering (choosing good features for representing and learning parsing models) from the general machine learning principles (training criteria, regularization and smoothing techniques) that permit good results." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-72", "text": "In fact, combined features have been proved to be useful in dependency parsing with support vector machines (Yamada and Matsumoto, 2003) , and I have already obtained some preliminary results on generating useful feature combinations via boosting." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-73", "text": "Therefore, I will consider combining all the projects I presented above." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-74", "text": "That is, I plan to incorporate all the useful features, the morphological features and the combined features as discussed above, into the training algorithms presented in Section 4 or Section 5, to train a dependency parser globally." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-75", "text": "Then I am going to augment the training with the existing smoothing and regularization techniques (as described in Section 4), or new developed ones." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-76", "text": "I expect the resulting parser to have better performance than those I have presented above." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-77", "text": "There are a lot of other ideas which can be explored in my future work." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-78", "text": "First and most important, I plan to investigate new advanced machine learning methods (e.g., structured boosting or unsupervised / semi-supervised algorithms (Xu et al., 2006) ) and apply them to the dependency parsing problem generally, since the goal of my research is to learn natural language parsers in an elegant and principled manner." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-79", "text": "Next, I am going to apply my approaches to parse other languages, such as Czech, German, Spanish and French, and analyze the performance of my parsers on these different languages." }, { "sent_id": "6a693f9cbc6dbb3676d765eee97db7-C001-80", "text": "Furthermore, I plan to apply my parsers in other domains (e.g., biomedical data) (Blitzer et al., 2006) besides treebank data, to investigate the effectiveness and generality of my approaches." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6a693f9cbc6dbb3676d765eee97db7-C001-9" ], [ "6a693f9cbc6dbb3676d765eee97db7-C001-15" ] ], "cite_sentences": [ "6a693f9cbc6dbb3676d765eee97db7-C001-9", "6a693f9cbc6dbb3676d765eee97db7-C001-15" ] }, "@MOT@": { "gold_contexts": [ [ "6a693f9cbc6dbb3676d765eee97db7-C001-15", "6a693f9cbc6dbb3676d765eee97db7-C001-17" ] ], "cite_sentences": [ "6a693f9cbc6dbb3676d765eee97db7-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "6a693f9cbc6dbb3676d765eee97db7-C001-27" ], [ "6a693f9cbc6dbb3676d765eee97db7-C001-32" ] ], "cite_sentences": [ "6a693f9cbc6dbb3676d765eee97db7-C001-27", "6a693f9cbc6dbb3676d765eee97db7-C001-32" ] } } }, "ABC_cc5927700475b7abc0482a28ab209a_43": { "x": [ { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-2", "text": "Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-3", "text": "In this work, we analyze word embeddings in terms of their principal components and arrive at a number of novel conclusions." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-4", "text": "In particular, we characterize the utility of variance explained by the principal components (widely used as a fundamental tool to assess the quality of the resulting representations) as a proxy for downstream performance." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-5", "text": "Further, through dimensional linguistic probing of the embedding space, we show that the syntactic information captured by a principal component does not depend on the amount of variance it explains." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-6", "text": "Consequently, we investigate the limitations of variance based embedding post-processing techniques and demonstrate that such post-processing is counterproductive in a number of scenarios such as sentence classification and machine translation tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-7", "text": "Finally, we offer a few guidelines on variance based embedding post-processing." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-8", "text": "We have released the source code along with the paper 1 . * equal contribution 1 Code Link: https://github.com/aclsrw/anonymized_code" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-9", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-11", "text": "Word embeddings have revolutionized natural language processing by representing words as dense real-valued vectors in a low dimensional space." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-12", "text": "Pre-trained word embeddings such as Glove (Pennington et al., 2014) , word2vec (Mikolov et al., 2013) and fasttext (Bojanowski et al., 2017) , trained on large corpora are readily available for use in a variety of tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-13", "text": "Subsequently, there has been emphasis on post-processing the embeddings to improve their performance on downstream tasks (Mu and Viswanath, 2018) or to induce linguistic properties (Mrk\u0161ic et al.; Faruqui et al., 2015) ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-14", "text": "In particular, the Principal Component Analysis (PCA) based post-processing algorithm proposed by (Mu and Viswanath, 2018) has led to significant gains in word and sentence similarity tasks, and has also proved useful in dimensionality reduction (Raunak, 2017) ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-15", "text": "Similarly, understanding the geometry of word embeddings is another area of active research (Mimno and Thompson, 2017) ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-16", "text": "Researchers have tried to ascertain the importance of dimensionality for word embeddings, with results from (Yin and Shen, 2018) answering the question of optimal dimensionality selection." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-17", "text": "In contrast to previous work, we explore the dimensional properties of existing pre-trained word embeddings through their principal components." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-18", "text": "Specifically, our contributions are as follows:" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-19", "text": "1. We analyze the word embeddings in terms of their principal components and demonstrate that their performance on both word similarity and sentence classification tasks saturates well before the full dimensionality." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-20", "text": "2. We demonstrate that the amount of variance captured by the principal components is a poor representative for the downstream performance of the embeddings constructed using the very same principal components." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-21", "text": "3. We investigate the reasons behind the aforementioned result through syntactic information based dimensional linguistic probing tasks and demonstrate that the syntactic information captured by a principal component is independent of the amount of variance it explains." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-22", "text": "4. We point out the limitations of applying variance based post-processing (Mu and Viswanath, 2018) and demonstrate that it leads to a decrease in performance in sentence classification and machine translation arXiv:1910.02211v1 [cs.CL] 5 Oct 2019 In Section 1, we provide an introduction to the problem statement." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-23", "text": "In Section 2 we discuss dimensional properties of word embeddings." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-24", "text": "In Section we 3 conduct a variance based analysis by measuring performance of word embeddings on downstream tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-25", "text": "In Section 4 we move on to dimensional linguistic probing tasks followed by Section 5 where we discuss the post-processing algorithm, and finally conclude in Section 6." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-26", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-27", "text": "**DIMENSIONAL PROPERTIES OF THE WORD EMBEDDING SPACE**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-28", "text": "Principal components provide a natural basis for studying the properties of an embedding space." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-29", "text": "In this work, we refer to the properties pertaining to the principal components of the embedding space as dimensional properties." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-30", "text": "We study such dimensional properties in a number of different contexts such as word similarity, sentence classification and machine translation tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-31", "text": "For experiments in this section, we use Glove embeddings (trained on Wikipedia 2014 + Gi-gaword 5 2 )." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-32", "text": "Subsequently, we also use fasttext (trained on Wikipedia, UBMC webbase corpus and statmt.org news dataset 3 ) and Word2vec (trained on the GoogleNews dataset 4 ) embeddings." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-33", "text": "Details of the corresponding datasets and evaluation tasks have been omitted due to space limit. Please refer to Conneau and Kiela (2018) for the details on sentence classification tasks and the classification algorithm and Faruqui and Dyer (2014) for word similarity benchmarks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-34", "text": "Further, each of our experiments (except Machine Translation in Section 5.2) are deterministic in nature." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-35", "text": "Figure 1 shows the performance of word embeddings on 13 word similarity benchmarks (Faruqui and Dyer, 2014) ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-36", "text": "The dimensions vary along the X-axis and each new evaluation cumulatively adds 10 more principal components to the embeddings (thus, there are 30 measurements for each dataset, ranging from word embeddings constructed using the first 10 principal components to full 300 principal components)." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-37", "text": "The performance is measured using Spearman's rank correlation coefficient (Rho) between the human assigned and cosine similarity based rankings of the word vectors." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-38", "text": "Figure 2 shows the performance (Test accuracy) on 9 standard downstream sentence classification tasks (Conneau and Kiela, 2018) using the same procedure for constructing word embeddings as done in 2.1." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-39", "text": "Further, sentence vectors were constructed using an average of the contained word embeddings, which has been demonstrated to be a very strong baseline for downstream tasks (Arora et al., 2017) ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-40", "text": "From Figures 1, 2 it is evident that the performance saturates consistently at around 200 dimensions, after which adding new principal components does not lead to much gain in performance." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-41", "text": "This implies redundancy (not noise) among the dimensions." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-42", "text": "Further, this also suggests a simple strategy to reduce embedding size wherein one third of the components could be reliably removed without affecting the performance on word similarity or sentence classification tasks, leading to approximately 33% memory reduction." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-43", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-44", "text": "**WORD SIMILARITY TASKS**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-45", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-46", "text": "**SENTENCE CLASSIFICATION TASKS**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-47", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-48", "text": "**VARIANCE BASED ANALYSIS**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-49", "text": "In this section, we characterize the redundancy observed in Section 2, in terms of variance of the principal components." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-50", "text": "Specifically, we measure downstream performance (on the sentence classification tasks of Section 2.2) of word embeddings against the amount of variance captured by the principal components." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-51", "text": "For each of the embedding types, we first construct word embeddings using only top 100 principal components (T), the middle 100 principal components (M) and the bottom 100 principal components (B)." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-52", "text": "The three sets of embeddings differ significantly, across the embedding types, in terms of the variance explained by the principal components used in their construction." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-53", "text": "The results are presented in Table 1 ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-54", "text": "The highlighted cells in Table 1 show that, in a number of cases, embeddings built using the middle (M) and the bottom 100 principal components (B) outperform the embeddings constructed using the top 100 principal components (T)." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-55", "text": "Although, variance explained by the principal components is widely used as a fundamental tool to assess the quality of the corresponding representations (Jolliffe and Cadima, 2016), these results demonstrate that for word embeddings, the variance explained by the principal components is a poor representative of downstream performance." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-56", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-57", "text": "**DIMENSIONAL LINGUISTIC PROBING TASKS**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-58", "text": "A hypothesis to explain the better performance of M and B embeddings (Table 1) in the earlier section is that the syntactic information required for downstream sentence classification tasks is distributed independently with respect to the principal components." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-59", "text": "To explore this hypothesis, we propose to leverage two classification based linguistic probing tasks, namely TreeDepth and TopConst , which are designed to test whether sentence embeddings are sensitive to syntactic properties of the encoded sentences." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-60", "text": "The TreeDepth task tests whether the model can predict the depth of the hierarchical syntactic structure of the sentence, whereas in TopConst, a sentence must be classified in terms of the sequence of its constituents occurring immediately below the sentence node of its hierarchical structure." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-61", "text": "To evaluate the syntactic information contained in each of the principal components, we first construct one-dimensional word embeddings by projecting word vectors onto a single principal component and then use sentence vectors constructed by using these embeddings for solving the TreeDepth and TopConst tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-62", "text": "Figure 3 depicts the scores (Test accuracy) on TopConst and TreeDepth tasks respectively." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-63", "text": "Evidently, no single principal component (dimension) achieves a significantly higher score in any of the two tasks and the performance across the dimensions does not have any particular trend (increasing or decreasing)." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-64", "text": "This validates the hypothesis that the principal components do not vary disproportionately in terms of the syntactic information contained." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-65", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-66", "text": "**THE POST PROCESSING ALGORITHM (PPA)**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-67", "text": "In this section, we first describe and then evaluate the post-processing algorithm (PPA) proposed in (Mu and Viswanath, 2018) , which achieves high scores on Word and Semantic textual similarity tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-68", "text": "The algorithm removes the projections of top principal components from each of the word vectors, making the individual word vectors more discriminative (Refer to Algorithm 1 for details)." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-69", "text": "Algorithm 1: Post Processing Algorithm PPA(X, D) Data: Embedding Matrix X, Threshold Parameter D Result: Post-Processed Word Embedding Matrix X 1 X = X -X ;" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-70", "text": "// Subtract Mean Embedding / * Compute PCA Components * / 2 ui = PCA(X), where i = 1, 2 . ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-71", "text": ". d;" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-72", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-73", "text": "**SENTENCE CLASSIFICATION TASKS**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-74", "text": "We compare the performance of PPA (with a constant D=5 across all the embeddings) on 9 downstream sentence classification tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-75", "text": "Table 2 shows that such post-processing doesn't always lead to gains in accuracy and can be counterproductive in a number of tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-76", "text": "This suggests that within the context of downstream sentence classification tasks, projecting word vectors away from the top components leads to a loss of 'useful' information." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-77", "text": "This could again be explained using the analysis from Figure 3 , wherein it is evident that the top dimensions also contain syntactic information, the loss of which adversely impacts downstream classification tasks, which by construction, benefit from both semantic and syntactic information." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-78", "text": "On the same tasks, we also observe a drop in sentence classification accuracy (2.37/1.99/3.94 average drop on word2vec/Glove/fasttext) using 150 dimensional embeddings obtained from PPA based dimensionality reduction (Raunak, 2017)." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-79", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-80", "text": "**MACHINE TRANSLATION**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-81", "text": "Recently, (Qi et al., 2018) have shown that pretrained embeddings lead to significant gains in performance for the translation of three low resource languages namely, Azerbaijani (AZ), Belarusian (BE) and Galician (GL) into English (EN)." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-82", "text": "Here, we demonstrate the impact of the post processing algorithm on machine translation (MT) tasks." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-83", "text": "We replicate the experimental settings of (Qi et al., 2018) and use a standard 1 layer encoder-decoder model with attention (Bahdanau et al., 2014) and a beam size of 5." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-84", "text": "Prior to training, we initialize the encoder with fasttext word embeddings (no other embeddings are publically available for these languages) trained on Wikipedia 5 ." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-85", "text": "We then use PPA on the pretrained embeddings and train again." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-86", "text": "Results in Table 3 show that removing the top principal component(s) leads to a consistent drop in BLEU scores across the three language pairs." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-87", "text": "This can be explained using the analysis from earlier section i.e. instead of strengthening the embeddings, removing the top components leads to a loss of 'useful' information for the Machine translation task." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-88", "text": "----------------------------------" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-89", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-90", "text": "To conclude, besides elucidating redundancy in the word embedding space, we demonstrate that the variance explained by the word embeddings' principal components is not a reliable proxy for the downstream utility of the corresponding representations and that the syntactic information captured by a principal component does not depend on the amount of variance it explains." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-91", "text": "Further, we show that variance based post-processing is not suitable for tasks which rely more on syntax, such as sentence classification and machine translation." }, { "sent_id": "cc5927700475b7abc0482a28ab209a-C001-92", "text": "Further, we wish to explore whether the geometric intuitions developed for word embeddings could be leveraged for contextualized embeddings such as ElMo (Peters et al., 2018) and BERT (Devlin et al., 2018) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cc5927700475b7abc0482a28ab209a-C001-13" ], [ "cc5927700475b7abc0482a28ab209a-C001-14" ] ], "cite_sentences": [ "cc5927700475b7abc0482a28ab209a-C001-13", "cc5927700475b7abc0482a28ab209a-C001-14" ] }, "@MOT@": { "gold_contexts": [ [ "cc5927700475b7abc0482a28ab209a-C001-22" ] ], "cite_sentences": [ "cc5927700475b7abc0482a28ab209a-C001-22" ] }, "@EXT@": { "gold_contexts": [ [ "cc5927700475b7abc0482a28ab209a-C001-22" ] ], "cite_sentences": [ "cc5927700475b7abc0482a28ab209a-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "cc5927700475b7abc0482a28ab209a-C001-67" ] ], "cite_sentences": [ "cc5927700475b7abc0482a28ab209a-C001-67" ] } } }, "ABC_fd50c8cf386e3ce8c8dd8dc46c467f_43": { "x": [ { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-78", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-79", "text": "**CNN**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-2", "text": "Humor is an essential but most fascinating element in personal communication." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-3", "text": "How to build computational models to discover the structures of humor, recognize humor and even generate humor remains a challenge and there have been yet few attempts on it." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-4", "text": "In this paper, we construct and collect four datasets with distinct joke types in both English and Chinese and conduct learning experiments on humor recognition." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-5", "text": "We implement a Convolutional Neural Network (CNN) with extensive filter size, number and Highway Networks to increase the depth of networks." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-6", "text": "Results show that our model outperforms in recognition of different types of humor with benchmarks collected in both English and Chinese languages on accuracy, precision, and recall in comparison to previous works." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-7", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-9", "text": "Humor, a highly intelligent communicative activity, provokes laughter or provides amusement." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-10", "text": "The role that humor plays in life can be viewed as a sociological phenomenon and function." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-11", "text": "Proper use of it can help eliminate embarrassment, establish social relationships, create positive affection in human social interactions." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-12", "text": "If computers can understand humor to some extent, it would facilitate predicting human's intention in human conversation, and thereby enhance the proficiency of many machine-human interaction systems." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-13", "text": "However, to automate the humor recognition is also a very challenging research topic in natural language understanding." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-14", "text": "The extent to which a person may sense humor depends on his/her personal background." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-15", "text": "For example, young children may favor cartoons while the grownups may feel the humor in cartoons boring." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-16", "text": "Also, many types of humor require substantial such external knowledge as irony, wordplay, metaphor and sarcasm." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-17", "text": "These factors make the task of automated humor recognition difficult." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-18", "text": "Recently, with the advance of deep learning that allows end-to-end training with big data without human intervention of feature selection, humor recognition becomes promising." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-19", "text": "In this work, we propose a convolutional neural network (CNN) with augmentation of both the filter sizes and filter numbers." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-20", "text": "We use the architecture called highway network to implement a much more proficient model for humor recognition." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-21", "text": "The performance on many benchmarks shows a significant improvement in detecting different humor context genre." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-22", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-24", "text": "The task of automatic humor recognition refers to deciding whether a given sentence expresses a certain degree of humor." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-25", "text": "In early studies, most of them are formulated as a binary classification, based on selection on linguistic features." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-26", "text": "Purandare and Litman analyzed humorous spoken conversations from a classic comedy television show." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-27", "text": "They used standard supervised learning classifiers to identify humorous speech (Purandare and Litman, 2006 )." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-28", "text": "Taylor and Marlack focused on a specific type of humor, wordplays." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-29", "text": "Their algorithm of the study was based on the extraction of structural patterns and peculiar structure of jokes (Taylor and Mazlack, 2004) ." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-30", "text": "Later, Yang et al. (2015) formulated a classifier to distinguish between humorous and non-humorous instances, and also created computational models to discover the latent semantic structure behind humor from four perspectives: incongruity, ambiguity, interpersonal effect and phonetic style." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-31", "text": "Recently, with the rise of artificial neural networks, many studies utilize the methods for humor recognition." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-32", "text": "Luke and Alfredo applied recurrent neural network (RNN) to humor detec-tion from reviews in Yelp dataset." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-33", "text": "In addition, they also applied convolutional neural networks (CNNs) to train a model and the work shows that the model trained with CNNs has more accurate humor recognition (de Oliveira and Rodrigo, 2015) ." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-34", "text": "In other research (Bertero and Fung, 2016) , CNNs were found to be a better sentence encoder for humor recognition as well." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-35", "text": "In a recent work, Chen and Lee predicted audience's laughter also using convolutional neural network." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-36", "text": "Their work gets higher detection accuracy and is able to learn essential feature automatically (Chen and Lee, 2017) ." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-37", "text": "However, there are still some limitations: (a) they focused on only a specific humor type in TED data, that is puns." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-38", "text": "(b) the datasets in most studies are English corpus." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-39", "text": "(c) the evaluations are isolated from other research." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-40", "text": "In our work, we build the humor recognizer by using CNNs with extensive filter size and number, and the result shows higher accuracy from previous CNNs models." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-41", "text": "We conducted experiments on two different dataset, which were used in the previous studies." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-42", "text": "One is Pun of the Day (Yang et al., 2015) , and the other is 16000 One-Liners (Mihalcea and Strapparava, 2005) ." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-43", "text": "In addition, we constructed a Chinese dataset to evaluate the generality of the method performance on humor recognition against different languages." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-44", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-45", "text": "**DATA**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-46", "text": "To fairly evaluate the performance on humor recognition, we need the dataset to consist of both humorous (positive) and non-humorous (negative) samples." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-47", "text": "The datasets we use to construct humor recognition experiments includes four parts: Pun of the Day (Yang et al., 2015) , 16000 OneLiners (Mihalcea and Strapparava, 2005) , Short Jokes dataset and PTT jokes." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-48", "text": "The four datasets have different joke types, sentence lengths, data sizes and languages that allow us to conduct more comprehensive and comparative experiments." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-49", "text": "We would like to thank Yang and Mihalcea for their kindly provision of two former datasets. And we depict how we collect the latter two datasets in the following subsections." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-50", "text": "Table 1 shows the statistics of four datasets." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-51", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-52", "text": "**16000 ONE-LINERS**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-53", "text": "16000 One-Liners dataset collected humorous samples from daily joke websites while using formal writing resources (e.g., news titles) to obtain" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-54", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-55", "text": "**PUN OF THE DAY**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-56", "text": "Pun of the Day dataset was constructed from the Pun of the Day website." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-57", "text": "The pun, also called paronomasia, is a form of wordplay that exploits multiple meanings of a term, or of similarsounding words, for an intended humorous or rhetorical effect." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-58", "text": "The negative samples of this dataset are sampled from news website." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-59", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-60", "text": "**SHORT JOKES DATASET**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-61", "text": "Short Jokes dataset, which collected the most amount of jokes among four datasets, are from an open database on a Kaggle project 1 ." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-62", "text": "It contains 231,657 short jokes with no restriction on joke types scraped from various joke websites and length ranging from 10 to 200 characters." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-63", "text": "We use it as our positive samples." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-64", "text": "For the negative samples, we choose WMT16 2 English news crawl as our non-humorous data resource." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-65", "text": "However, simply treating sentences from the resource as negative samples could result in deceptively high performance of classification due to the domain differences between positive and negative data." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-66", "text": "So we try to minimize such domain differences by selecting negative samples whose words all appear in the positive samples and whose average text length being close to the humorous ones." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-67", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-68", "text": "**PTT JOKES**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-69", "text": "PTT Bulletin Board System (PTT, Chinese: \u6279 \u8e22\u8e22, telnet://ptt.cc) is the largest terminal-based bulletin board system (BBS) in Taiwan." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-70", "text": "It has more than 1.5 million registered users and over 20,000 boards covering a multitude of topics." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-71", "text": "Every day more than 20,000 articles and 500,000 comments are posted." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-72", "text": "Additionally, there is a board called joke that we could acquire large amount of Chinese humor samples." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-73", "text": "Thus, we use some political-related words to extract political jokes from PTT and treat them as the positive samples." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-74", "text": "For the negative samples, we use Yahoo News in politics and select the samples by the same method we use in Short Jokes dataset to prevent from the problem of domain difference." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-75", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-76", "text": "**METHOD**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-77", "text": "In this section, we describe how we design our model for humor recognition." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-80", "text": "Convolutional neural network (CNN) is a neural network architecture designed to extract local features in high dimensional data such as image or speech signal." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-81", "text": "When it comes to natural language processing (NLP), CNN also shows successes in several text categorization tasks (Johnson and Zhang, 2015) ." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-82", "text": "The input of most NLP tasks, such as a sentence or a document could be represented as a 2D structure with word embedding (Mikolov et al., 2013) ." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-83", "text": "In the input 2D matrix, each row is a vector of a word, a word segment or even a character that depends on the embedding methods." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-84", "text": "And typically we make the window width of the filters the same as the embedding dimension." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-85", "text": "Thus, the filter size varies according to a sliding window size we decide." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-86", "text": "to 20." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-87", "text": "For each filter size, 100-200 filters are applied to the model." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-88", "text": "After convolutional layer, we exploit max pooling and then flatten the output." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-89", "text": "Assume we totally have n filters, eventually it will lead to a flatten 1D vector with dimension n at the prediction output." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-90", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-91", "text": "**HIGHWAY LAYER**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-92", "text": "To improve the performance we usually can connect the flattened output with a fully connected layer and predict labels." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-93", "text": "In this paper, we would like to evaluate the performance improvement as we increase the network depth." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-94", "text": "However, the training of deeper networks becomes more difficult with increasing depth." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-95", "text": "So we use the concept of highway network (Srivastava et al., 2015) to help improve our model." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-96", "text": "The highway network allows shortcut connections with gate functions." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-97", "text": "These gates are data-dependent with parameters." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-98", "text": "It allows information unimpeded to flow through several layers in information highways." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-99", "text": "The architecture is characterized by the gate units that learn to regulate the flow of information through a network." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-100", "text": "With this architecture, we could train much deeper nets." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-101", "text": "In the end, we also use dropout and connect the results to the output layer." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-102", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-103", "text": "**EXPERIMENT**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-104", "text": "In this section, we describe how we formulate humor recognition as a text classification problem and conduct experiments on four datasets which we mentioned in Section 3." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-105", "text": "We validate the performance of different network structure with 10 fold cross validation and compare with the performance of previous work." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-106", "text": "Table 2 shows the experiments on both 16000 One-Liners and Pun of the Day." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-107", "text": "We set the baseline on the previous works of Yang et al. (2015) by Random Forest with Word2Vec + Human Centric Feature (Word2Vec + HCF) and Chen and Lee (2017) by Convolutional Neural Networks." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-108", "text": "We choose a dropout rate at 0.5 and test our model's performance with two factors F and HN." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-109", "text": "F means the increase of filter size and number as we mentioned in section 4." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-110", "text": "Otherwise, the window sizes would be (5, 6, 7) and filter number is 100 that is the same with Chen and Lee (2017)'s. HN indicates that we use the highway layers to train deep networks and we set the HN layers = 3 because it has better stability and accuracy in training step." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-111", "text": "We could observe that when we use both F and 115 Table 3 presents the result of Short Jokes and PTT Jokes datasets." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-112", "text": "As we can see, for the datasets was construed, it achieve 0.924 on Short Jokes and 0.943 on PTT Jokes in terms of F1 score respectively." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-113", "text": "It shows that the deep learning model can, to some extent learn the humorous meaning and structure embedded in the text automatically without human selection of features." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-114", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-115", "text": "**DISCUSSION**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-116", "text": "In this section, we show a sample in each category (true positive, false positive, true negative and false negative) to get a sense of what kinds of sentences are predicted correctly and incorrectly." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-117", "text": "The sentences are shown in the table 4." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-118", "text": "Sentence TP when he gave his wife a necklace he got a chain reaction TN the barking of a dog does not disturb the man on a camel FP rats know the way of rats FN it's a fact taller people sleep longer in bed The TP sentence \"when he gave his wife a necklace he got a chain reaction\" shows that our model seems to be able to catch not only the literal meaning between the \"necklace\" and \"got a chain reaction\"." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-119", "text": "Besides, the TN sentence \"the barking of a dog does not disturb the man on a camel\" means that if you're lucky enough to own your own camel, a little thing like a barking dog won't bother you." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-120", "text": "The example is a proverb but not a joke and our model correctly recognizes it as a non-humor one." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-121", "text": "Model misclassifies certain instances such as the FP sentence \"rats know the way of rats\" is actually derived from a Chinese proverb and the model predict it as humor." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-122", "text": "In addition, the FN sentence \"it's a fact taller people sleep longer in bed\" is obviously a joke but it is not considered as a humor by the model." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-123", "text": "To deal with more subtle humor/non-humor, the model has room to be improved." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-124", "text": "----------------------------------" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-125", "text": "**CONCLUSION**" }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-126", "text": "In this study, we have extended the techniques of automatic humor recognition to different types of humor as well as different languages in both English and Chinese." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-127", "text": "We proposed a deep learning CNN architecture with high way networks that can learn to distinguish between humorous and nonhumorous texts based on a large scale of balanced positive and negative dataset." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-128", "text": "The performance of the CNN model outperforms the previous work." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-129", "text": "It's worth mentioning that the recognition accuracy on PTT, political jokes in Chinese, and the short jokes dataset with various types of jokes in English are both as high as above 90%." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-130", "text": "The novel deep learning model relieves the required human intervention of selection linguistic features for humor recognition task." }, { "sent_id": "fd50c8cf386e3ce8c8dd8dc46c467f-C001-131", "text": "In future work, we would conduct more rigorous comparative evaluation with human humor recognition and look into how the humorous texts can be generated using deep learning models as well." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-30" ] ], "cite_sentences": [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-30" ] }, "@UNSURE@": { "gold_contexts": [ [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-42" ] ], "cite_sentences": [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-41", "fd50c8cf386e3ce8c8dd8dc46c467f-C001-42" ], [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-47" ], [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-107" ] ], "cite_sentences": [ "fd50c8cf386e3ce8c8dd8dc46c467f-C001-42", "fd50c8cf386e3ce8c8dd8dc46c467f-C001-47", "fd50c8cf386e3ce8c8dd8dc46c467f-C001-107" ] } } }, "ABC_0593fb7ee345cf632e6a61f1f21e6c_43": { "x": [ { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-2", "text": "Abstract-The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic 'common sense' questions about given images." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-3", "text": "Given an image and a question in natural language, the VQA system tries to find the correct answer to it using visual elements of the image and inference gathered from textual questions." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-4", "text": "In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question-formats and enabling robustness of the machine-learning models." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-5", "text": "Next, we discuss about new deep learning models that have shown promising results over the VQA datasets." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-6", "text": "At the end, we present and discuss some of the results computed by us over the vanilla VQA models, Stacked Attention Network and the VQA Challenge 2017 winner model." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-7", "text": "We also provide the detailed analysis along with the challenges and future research directions." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-8", "text": "----------------------------------" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-9", "text": "**I. INTRODUCTION**" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-10", "text": "Visual Question Answering (VQA) refers to a challenging task which lies at the intersection of image understanding and language processing." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-11", "text": "The VQA task has witnessed a significant progress in the recent years by the machine intelligence community." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-12", "text": "The aim of VQA is to develop a system to answer specific questions about an input image." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-13", "text": "The answer could be in any of the following forms: a word, a phrase, binary answer, multiple choice answer, or a fill in the blank answer." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-14", "text": "Agarwal et al. [1] presented a novel way of combining computer vision and natural language processing concepts of to achieve Visual Grounded Dialogue, a system mimicking the human understanding of the environment with the use of visual observation and language understanding." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-15", "text": "The advancements in the field of deep learning have certainly helped to develop systems for the task of Image Question Answering." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-16", "text": "Krizhevsky et al [2] proposed the AlexNet model, which created a revolution in the computer vision domain." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-17", "text": "The paper introduced the concept of Convolution Neural Networks (CNN) to the mainstream computer vision application." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-18", "text": "Later many authors have worked on CNN, which has resulted in robust, deep learning models like VGGNet [3] , Inception [4] , ResNet [5] , and etc." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-19", "text": "Similarly, the recent advancements in natural language processing area based on deep learning have improved the text understanding prforance as well." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-20", "text": "The first major algorithm in the context of text processing is considered to be the Recurrent Neural Networks (RNN) [6] which introduced the concept of prior context for time series based data." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-21", "text": "This architecture helped the growth of machine text understanding which gave new boundaries to machine translation, text classification and contextual understanding." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-22", "text": "Another major breakthrough in the domain was the introduction of Long-Short Term Memory (LSTM) architecture [7] which improvised over the RNN by introducing a context cell which stores the prior relevant information." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-23", "text": "The vanilla VQA model [1] used a combination of VGGNet [3] and LSTM [7] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-24", "text": "This model has been revised over the years, employing newer architectures and mathematical formulations." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-25", "text": "Along with this, many authors have worked on producing datasets for eliminating bias, strengthening the performance of the model by robust question-answer pairs which try to cover the various types of questions, testing the visual and language understanding of the system." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-26", "text": "In this survey, first we cover major datasets published for validating the Visual Question Answering task, such as VQA dataset [1] , DAQUAR [8] , Visual7W [9] and most recent datasets up to 2019 include Tally-QA [10] and KVQA [11] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-27", "text": "Next, we discuss the stateof-the-art architectures designed for the task of Visual Question Answering such as Vanilla VQA [1] , Stacked Attention Networks [12] and Pythia v1.0 [13] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-28", "text": "Next we present some of our computed results over the three architectures: vanilla VQA model [1] , Stacked Attention Network (SAN) [12] and Teney et al. model [14] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-29", "text": "Finally, we discuss the observations and future directions." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-30", "text": "----------------------------------" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-31", "text": "**II. DATASETS**" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-32", "text": "The major VQA datasets are summarized in Table I ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-33", "text": "We present the datasets below." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-34", "text": "DAQUAR: DAQUAR stands for Dataset for Question Answering on Real World Images, released by Malinowski et al. [8] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-35", "text": "It is the first dataset released for the IQA task." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-36", "text": "The images are taken from NYU-Depth V2 dataset [18] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-37", "text": "The dataset is small with a total of 1449 images." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-38", "text": "The question bank includes 12468 question-answer pairs with 2483 unique questions." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-39", "text": "The questions have been generated by human annotations and confined within 9 question templates using annotations of the NYU-Depth dataset." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-40", "text": "VQA Dataset: The Visual Question Answering (VQA) dataset [1] is one of the largest datasets collected from the MS-COCO [19] [9] is also based on the MS-COCO dataset." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-41", "text": "It contains 47,300 COCO images with 327,939 question-answer pairs." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-42", "text": "The dataset also consists of 1,311,756 multiple choice questions and answers with 561,459 groundings." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-43", "text": "The dataset mainly deals with seven forms of questions (from where it derives its name): What, Where, When, Who, Why, How, and Which." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-44", "text": "It is majorly formed by two types of questions." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-45", "text": "The telling questions are the ones which are text-based, giving a sort of description." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-46", "text": "The pointing questions are the ones that begin with Which, and have to be correctly identified by the bounding boxes among the group of plausible answers." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-47", "text": "CLEVR: CLEVR [17] is a synthetic dataset to test the visual understanding of the VQA systems." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-48", "text": "The dataset is generated using three objects in each image, namely cylinder, sphere and cube." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-49", "text": "These objects are in two different sizes, two different materials and placed in eight different colors." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-50", "text": "The questions are also synthetically generated based on the objects placed in the image." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-51", "text": "The dataset also accompanies the groundtruth bounding boxes for each object in the image." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-52", "text": "Tally-QA: Very recently, in 2019, the Tally-QA [10] dataset is proposed which is the largest dataset of object counting in the open-ended task." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-53", "text": "The dataset includes both simple and complex question types which can be seen in Fig. II ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-54", "text": "The dataset is quite large in numbers as well as it is 2.5 times the VQA dataset." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-55", "text": "The dataset contains 287,907 questions, 165,000 images and 19,000 complex questions." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-56", "text": "The Tally-QA samples are shown in Fig. II in 2 nd row and 1 st column." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-57", "text": "KVQA: The recent interest in common-sense questions has led to the development of world Knowledge based VQA dataset [11] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-58", "text": "The dataset contains questions targeting various categories of nouns and also require world knowledge to arrive at a solution." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-59", "text": "Questions in this dataset require multi-entity, multi-relation, and multi-hop reasoning over large Knowledge Graphs (KG) to arrive at an answer." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-60", "text": "The dataset contains 24,000 images with 183,100 question-answer pairs employing around 18K proper nouns." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-61", "text": "The KVQA samples are shown in Fig. II in 2 nd row and 2 nd column." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-62", "text": "----------------------------------" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-63", "text": "**III. DEEP LEARNING BASED VQA METHODS**" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-64", "text": "The emergence of deep-learning architectures have led to the development of the VQA systems." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-65", "text": "We discuss the stateof-the-art methods with an overview in Table II ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-66", "text": "Vanilla VQA [1] : Considered as a benchmark for deep learning methods, the vanilla VQA model uses CNN for feature extraction and LSTM or Recurrent networks for language processing." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-67", "text": "These features are combined using element-wise operations to a common feature, which is used to classify to one of the answers as shown in Fig. 3 ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-68", "text": "Neural-Symbolic VQA [24] : Specifically made for CLEVR dataset, this model leverages the question formation and image generation strategy of CLEVR." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-69", "text": "The images are converted to structured features and the question features are converted to their original root question strategy." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-70", "text": "This feature is used to filter out the required answer." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-71", "text": "Focal Visual Text Attention (FVTA) [25] : This model combines the sequence of image features generated by the network, text features of the image (or probable answers) and the question." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-72", "text": "It applies the attention based on the both text components, and finally classifies the features to answer the question." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-73", "text": "This model is better suited for the VQA in videos which has more use cases than images." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-74", "text": "[20] VQA [1] , TDIUC [29] , COCO-QA [21] Faster-RCNN [22] , Differential Modules [30] , GRU [31] 68.59 (VQA-v2), 86.73 (TDIUC), 69.36 (COCO-QA) AAAI 2019 Pythia v1.0 [28] : Pythia v1.0 is the award winning architecture for VQA Challenge 2018" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-75", "text": "1 . The architecture is similar to Teney et al. [14] with reduced computations with elementwise multiplication, use of GloVe vectors [23] , and ensemble of 30 models." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-76", "text": "Differential Networks [20] : This model uses the differences between forward propagation steps to reduce the noise and to learn the interdependency between features." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-77", "text": "Image features are extracted using Faster-RCNN [22] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-78", "text": "The differential modules [30] are used to refine the features in both text and images." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-79", "text": "GRU [31] is used for question feature extraction." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-80", "text": "Finally, it is combined with an attention module to classify the answers." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-81", "text": "The Differential Networks architecture is illustrated in Fig. 4." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-82", "text": "1 https://github.com/facebookresearch/pythia" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-83", "text": "----------------------------------" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-84", "text": "**IV. EXPERIMENTAL RESULTS AND ANALYSIS**" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-85", "text": "The reported results for different methods over different datasets are summarized in Table I and Table II ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-86", "text": "It can be observed that VQA dataset is very commonly used by different methods to test the performance." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-87", "text": "Other datasets like Visual7W, Tally-QA and KVQA are also very challenging and recent datasets." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-88", "text": "It can be also seen that the Pythia v1.0 is one of the recent methods performing very well over VQA dataset." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-89", "text": "The Differentail Network is the very recent method proposed for VQA task and shows very promising performance over different datasets." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-90", "text": "As part of this survey, we also implemented different methods over different datasets and performed the experiments." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-91", "text": "We considered the following three models for our experiments, 1) the baseline Vanilla VQA model [1] which uses the VGG16 CNN architecture [3] and LSTMs [7] , 2) the Stacked Attention Networks [12] architecture, and 3) the 2017 VQA challenge winner Teney et al. model [14] ." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-92", "text": "We considered the widely adapted datasets such as standard VQA dataset [1] and Visual7W dataset [9] for the experiments." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-93", "text": "We used the Adam Optimizer for all models with Cross-Entropy loss function." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-94", "text": "Each model is trained for 100 epochs for each dataset." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-95", "text": "The experimental results are presented in Table III in terms of the accuracy for three models over two datasets." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-96", "text": "In the experiments, we found that the Teney et al. [14] is the best performing model on both VQA and Visual7W Dataset." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-97", "text": "The accuracies obtained over the Teney et al. model are 67.23% and 65.82% over VQA and Visual7W datasets for the openended question-answering task, respectively." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-98", "text": "The above results re-affirmed that the Teney et al. model is the best performing model till 2018 which has been pushed by Pythia v1.0 [13] , recently, where they have utilized the same model with more layers to boost the performance." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-99", "text": "----------------------------------" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-100", "text": "**V. CONCLUSION**" }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-101", "text": "The Visual Question Answering has recently witnessed a great interest and development by the group of researchers and scientists from all around the world." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-102", "text": "The recent trends are observed in the area of developing more and more real life looking datasets by incorporating the real world type questions and answers." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-103", "text": "The recent trends are also seen in the area of development of sophisticated deep learning models by better utilizing the visual cues as well as textual cues by different means." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-104", "text": "The performance of even best model is still lagging and around 60-70% only." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-105", "text": "Thus, it is still an open problem to develop better deep learning models as well as more challenging datasets for VQA." }, { "sent_id": "0593fb7ee345cf632e6a61f1f21e6c-C001-106", "text": "Different strategies like object level details, segmentation masks, deeper models, sentiment of the question, etc. can be considered to develop the next generation VQA models." } ], "y": { "@USE@": { "gold_contexts": [ [ "0593fb7ee345cf632e6a61f1f21e6c-C001-28" ], [ "0593fb7ee345cf632e6a61f1f21e6c-C001-91" ], [ "0593fb7ee345cf632e6a61f1f21e6c-C001-96" ] ], "cite_sentences": [ "0593fb7ee345cf632e6a61f1f21e6c-C001-28", "0593fb7ee345cf632e6a61f1f21e6c-C001-91", "0593fb7ee345cf632e6a61f1f21e6c-C001-96" ] }, "@SIM@": { "gold_contexts": [ [ "0593fb7ee345cf632e6a61f1f21e6c-C001-75" ] ], "cite_sentences": [ "0593fb7ee345cf632e6a61f1f21e6c-C001-75" ] } } }, "ABC_88900d3533701056f6a26bf7c68670_43": { "x": [ { "sent_id": "88900d3533701056f6a26bf7c68670-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-2", "text": "We present a method for utilizing unannotated sentences to improve a semantic parser which maps natural language (NL) sentences into their formal meaning representations (MRs)." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-3", "text": "Given NL sentences annotated with their MRs, the initial supervised semantic parser learns the mapping by training Support Vector Machine (SVM) classifiers for every production in the MR grammar." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-4", "text": "Our new method applies the learned semantic parser to the unannotated sentences and collects unlabeled examples which are then used to retrain the classifiers using a variant of transductive SVMs." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-5", "text": "Experimental results show the improvements obtained over the purely supervised parser, particularly when the annotated training set is small." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-6", "text": "----------------------------------" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-8", "text": "Semantic parsing is the task of mapping a natural language (NL) sentence into a complete, formal meaning representation (MR) which a computer program can execute to perform some task, like answering database queries or controlling a robot." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-9", "text": "These MRs are expressed in domain-specific unambiguous formal meaning representation languages (MRLs)." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-10", "text": "Given a training corpus of NL sentences annotated with their correct MRs, the goal of a learning system for semantic parsing is to induce an efficient and accurate semantic parser that can map novel sentences into their correct MRs." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-11", "text": "Several learning systems have been developed for semantic parsing, many of them recently (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Ge and Mooney, 2005; Kate and Mooney, 2006) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-12", "text": "These systems use supervised learning methods which only utilize annotated NL sentences." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-13", "text": "However, it requires considerable human effort to annotate sentences." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-14", "text": "In contrast, unannotated NL sentences are usually easily available." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-15", "text": "Semi-supervised learning methods utilize cheaply available unannotated data during training along with annotated data and often perform better than purely supervised learning methods trained on the same amount of annotated data (Chapelle et al., 2006) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-16", "text": "In this paper we present, to our knowledge, the first semi-supervised learning system for semantic parsing." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-17", "text": "We modify KRISP, a supervised learning system for semantic parsing presented in (Kate and Mooney, 2006) , to make a semi-supervised system we call SEMISUP-KRISP." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-18", "text": "Experiments on a realworld dataset show the improvements SEMISUP-KRISP obtains over KRISP by utilizing unannotated sentences." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-19", "text": "----------------------------------" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-20", "text": "**BACKGROUND**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-21", "text": "This section briefly provides background needed for describing our approach to semi-supervised semantic parsing." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-22", "text": "----------------------------------" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-23", "text": "**KRISP: THE SUPERVISED SEMANTIC PARSING**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-24", "text": "Learning System KRISP (Kernel-based Robust Interpretation for Semantic Parsing) (Kate and Mooney, 2006 ) is a supervised learning system for semantic parsing which takes NL sentences paired with their MRs as training data." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-25", "text": "The productions of the formal MRL grammar are treated like semantic concepts." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-26", "text": "For each of these productions, a Support-Vector Machine (SVM) (Cristianini and Shawe-Taylor, 2000) classifier is trained using string similarity as the kernel (Lodhi et al., 2002) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-27", "text": "Each classifier can then estimate the probability of any NL substring representing the semantic concept for its production." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-28", "text": "During semantic parsing, the classifiers are called to estimate probabilities on different substrings of the sentence to compositionally build the most probable meaning representation (MR) of the sentence." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-29", "text": "KRISP trains the classifiers used in semantic parsing iteratively." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-30", "text": "In each iteration, for every production in the MRL grammar, KRISP collects positive and negative examples." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-31", "text": "In the first iteration, the set of positive examples for production contains all sentences whose corresponding MRs use the production in their parse trees." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-32", "text": "The set of negative examples includes all of the other training sentences." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-33", "text": "Using these positive and negative examples, an SVM classifier is trained for each production using a string kernel." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-34", "text": "In subsequent iterations, the parser learned from the previous iteration is applied to the training examples and more refined positive and negative examples, which are more specific substrings within the sentences, are collected for training." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-35", "text": "Iterations are continued until the classifiers converge, analogous to iterations in EM (Dempster et al., 1977) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-36", "text": "Experimentally, KRISP compares favorably to other existing semantic parsing systems and is particularly robust to noisy training data (Kate and Mooney, 2006) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-37", "text": "----------------------------------" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-38", "text": "**TRANSDUCTIVE SVMS**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-39", "text": "SVMs (Cristianini and Shawe-Taylor, 2000) are state-of-the-art machine learning methods for classification." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-40", "text": "Given positive and negative training examples in some vector space, an SVM finds the maximum-margin hyperplane which separates them." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-41", "text": "Maximizing the margin prevents over-fitting in very high-dimensional data which is typical in natural language processing and thus leads to better generalization performance on test examples." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-42", "text": "When the unlabeled test examples are also available during training, a transductive framework for learning (Vapnik, 1998) can further improve the performance on the test examples." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-43", "text": "Transductive SVMs were introduced in (Joachims, 1999) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-44", "text": "The key idea is to find the labeling of the test examples that results in the maximum-margin hyperplane that separates the positive and negative examples of both the training and the test data." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-45", "text": "This is achieved by including variables in the SVM's objective function representing labels of the test examples." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-46", "text": "Finding the exact solution to the resulting optimization problem is intractable, however Joachims (1999) gives an approximation algorithm for it." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-47", "text": "One drawback of his algorithm is that it requires the proportion of positive and negative examples in the test data be close to the proportion in the training data, which may not always hold, particularly when the training data is small." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-48", "text": "Chen et al. (2003) present another approximation algorithm which we use in our system because it does not require this assumption." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-49", "text": "More recently, new optimization methods have been used to scale-up transductive SVMs to large data sets (Collobert et al., 2006) , however we did not face scaling problems in our current experiments." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-50", "text": "Although transductive SVMs were originally designed to improve performance on the test data by utilizing its availability during training, they can also be directly used in a semi-supervised setting (Bennett and Demiriz, 1999) where unlabeled data is available during training that comes from the same distribution as the test data but is not the actual data on which the classifier is eventually to be tested." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-51", "text": "This framework is more realistic in the context of semantic parsing where sentences must be processed in real-time and it is not practical to re-train the parser transductively for every new test sentence." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-52", "text": "Instead of using an alternative semi-supervised SVM algorithm, we preferred to use a transductive SVM algorithm (Chen et al., 2003) in a semi-supervised manner, since it is easily implemented on top of an existing SVM system." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-53", "text": "----------------------------------" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-54", "text": "**SEMI-SUPERVISED SEMANTIC PARSING**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-55", "text": "We modified the existing supervised system KRISP, described in section 2.1, to incorporate semisupervised learning." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-56", "text": "Supervised learning in KRISP involves training SVM classifiers on positive and negative examples that are substrings of the anno- The training algorithm first runs KRISP's existing training algorithm and obtains SVM classifiers for every production in the MRL grammar." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-57", "text": "Sets of positive and negative examples that were used for training the classifiers in the last iteration are collected for each production." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-58", "text": "Next, the learned parser is applied to the unannotated sentences." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-59", "text": "During the parsing of each sentence, whenever a classifier is called to estimate the probability of a substring representing the semantic concept for its production, that substring is saved as an unlabeled example for that classifier." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-60", "text": "These substrings are representative of the examples that the classifier will actually need to handle during testing." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-61", "text": "Note that the MRs obtained from parsing the unannotated sentences do not play a role during training since it is unknown whether or not they are correct." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-62", "text": "These sets of unlabeled examples for each production, along with the sets of positive and negative examples collected earlier, are then used to retrain the classifiers using transductive SVMs." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-63", "text": "The retrained classifiers are finally returned and used in the final semantic parser." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-64", "text": "----------------------------------" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-65", "text": "**EXPERIMENTS**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-66", "text": "We compared the performance of SEMISUP-KRISP and KRISP in the GEOQUERY domain for semantic parsing in which the MRL is a functional language used to query a U.S. geography database (Kate et al., 2005) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-67", "text": "This domain has been used in most of the previous work." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-68", "text": "The original corpus contains \u00be \u00bc NL queries collected from undergraduate students and annotated with their correct MRs (Zelle and Mooney, 1996) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-69", "text": "Later, \u00bf\u00bc additional NL queries were collected from real users of a web-based interface and annotated (Tang and Mooney, 2001) ." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-70", "text": "We used this data as unannotated sentences in our current experiments." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-71", "text": "We also collected an additional \u00bc queries from the same interface, making a total of \u00bd \u00bc\u00bf unannotated sentences." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-72", "text": "The systems were evaluated using standard 10-fold cross validation." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-73", "text": "All the unannotated sentences were used for training in each fold." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-74", "text": "Performance was measured in terms of precision (the percentage of generated MRs that were correct) and recall (the percentage of all sentences for which correct MRs were obtained)." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-75", "text": "An output MR is considered correct if and only if the resulting query retrieves the same answer as the correct MR when submitted to the database." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-76", "text": "Since the systems assign confidences to the MRs they generate, the entire range of the precision-recall trade-off can be obtained for a system by measuring precision and recall at various confidence levels." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-77", "text": "We present learning curves for the best F-measure (harmonic mean of precision and re- call) obtained across the precision-recall trade-off as the amount of annotated training data is increased." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-78", "text": "Figure 2 shows the results for both systems." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-79", "text": "The results clearly show the improvement SEMISUP-KRISP obtains over KRISP by utilizing unannotated sentences, particularly when the number of annotated sentences is small." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-80", "text": "We also show the performance of a hand-built semantic parser GEOBASE (Borland International, 1988 ) for comparison." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-81", "text": "From the figure, it can be seen that, on average, KRISP achieves the same performance as GEOBASE when it is given \u00bd\u00be annotated examples, while SEMISUP-KRISP reaches this level given only annotated examples, a \u00be \u00b1 savings in humanannotation effort." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-82", "text": "----------------------------------" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-83", "text": "**CONCLUSIONS**" }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-84", "text": "This paper has presented a semi-supervised approach to semantic parsing." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-85", "text": "Our method utilizes unannotated sentences during training by extracting unlabeled examples for the SVM classifiers it uses to perform semantic parsing." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-86", "text": "These classifiers are then retrained using transductive SVMs." }, { "sent_id": "88900d3533701056f6a26bf7c68670-C001-87", "text": "Experimental results demonstrated that this exploitation of unlabeled data significantly improved the accuracy of the resulting parsers when only limited supervised data was provided." } ], "y": { "@BACK@": { "gold_contexts": [ [ "88900d3533701056f6a26bf7c68670-C001-11" ], [ "88900d3533701056f6a26bf7c68670-C001-24" ], [ "88900d3533701056f6a26bf7c68670-C001-36" ] ], "cite_sentences": [ "88900d3533701056f6a26bf7c68670-C001-11", "88900d3533701056f6a26bf7c68670-C001-24", "88900d3533701056f6a26bf7c68670-C001-36" ] }, "@DIF@": { "gold_contexts": [ [ "88900d3533701056f6a26bf7c68670-C001-11", "88900d3533701056f6a26bf7c68670-C001-12", "88900d3533701056f6a26bf7c68670-C001-16" ] ], "cite_sentences": [ "88900d3533701056f6a26bf7c68670-C001-11" ] }, "@EXT@": { "gold_contexts": [ [ "88900d3533701056f6a26bf7c68670-C001-17" ] ], "cite_sentences": [ "88900d3533701056f6a26bf7c68670-C001-17" ] } } }, "ABC_4705e8c0bac7bf29d8ef3193cf729b_43": { "x": [ { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-39", "text": "**RELATED WORK**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-2", "text": "The selection of features is critical in providing discriminative information for classifiers in Word Sense Disambiguation (WSD)." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-3", "text": "Uninformative features will degrade the performance of classifiers." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-4", "text": "Based on the strong evidence that an ambiguous word expresses a unique sense in a given collocation, this paper reports our experiments on automatic WSD using collocation as local features based on the corpus extracted from People's Daily News (PDN) as well as the standard SENSEVAL-3 data set." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-5", "text": "Using the Na\u00efve Bayes classifier as our core algorithm, we have implemented a classifier using a feature set combining both local collocation features and topical features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-6", "text": "The average precision on the PDN corpus has 3.2% improvement compared to 81.5% of the baseline system where collocation features are not considered." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-7", "text": "For the SENSEVAL-3 data, we have reached the precision rate of 37.6% by integrating collocation features into contextual features, to achieve 37% improvement over 26.7% of precision in the baseline system." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-8", "text": "Our experiments have shown that collocation features can be used to reduce the size of human tagged corpus." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-9", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-11", "text": "WSD tries to resolve lexical ambiguity which refers to the fact that a word may have multiple meanings such as the word \"walk\" in \"Walk or Bike to school\" and \"BBC Education Walk Through Time\", or the Chinese word \" \" in \" \"(\"local government\") and \" \"(\"He is also partly right\")." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-12", "text": "WSD tries to automatically assign an appropriate sense to an occurrence of a word in a given context." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-13", "text": "Various approaches have been proposed to deal with the word sense disambiguation problem including rule-based approaches, knowledge or dictionary based approaches, corpus-based approaches, and hybrid approaches." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-14", "text": "Among these approaches, the supervised corpus-based approach had been applied and discussed by many researches ( [2] [3] [4] [5] [6] [7] [8] )." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-40", "text": "Automating word sense disambiguation tasks based on annotated corpora have been proposed." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-15", "text": "According to [1] , the corpusbased supervised machine learning methods are the most successful approaches to WSD where contextual features have been used mainly to distinguish ambiguous words in these methods." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-16", "text": "However, word occurrences in the context are too diverse to capture the right pattern, which means that the dimension of contextual words will be very large when all words in the training samples are used for WSD [14] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-17", "text": "Certain uninformative features will weaken the discriminative power of a classifier resulting in a lower precision rate." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-18", "text": "To narrow down the context, we propose to use collocations as contextual information as defined in Section 3.1.2." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-19", "text": "It is generally understood that the sense of an ambiguous word is unique in a given collocation [19] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-20", "text": "For example, \"" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-21", "text": "\" means \"burden\" but not \"baggage\" when it appears in the collocation \" \" (\" burden of thought\")." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-22", "text": "In this paper, we apply a classifier to combine the local features of collocations which contain the target word with other contextual features to discriminate the ambiguous words." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-23", "text": "The intuition is that when the target context captures a collocation, the influence of other dimensions of contextual words can be reduced or even ignored." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-24", "text": "For example, in the expression \" \" (\"terrorists burned down the gene laboratory\"), the influence of contextual word \" \" (\"gene\") should be reduced to work on the target word \" \" because \" \" is a collocation whereas \" \" and \" \" are not collocations even though they do co-occur." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-25", "text": "Our intention is not to generally replace contextual information by collocation only." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-26", "text": "Rather, we would like to use collocation as an additional feature in WSD." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-27", "text": "We still make use of other contextual features because of the following reasons." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-28", "text": "Firstly, contextual information is proven to be effective for WSD in the previous research works." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-29", "text": "Secondly, collocations may be independent on the training corpus and a sentence in consideration may not contain any collocation." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-30", "text": "Thirdly, to fix the tie case such as ( terrorists' gene checking ), means human when presented in the collocation , but particle in the collocation \"." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-31", "text": "The primary purpose of using collocation in WSD is to improve precision rate without any sacrifices in recall rate." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-32", "text": "We also want to investigate whether the use of collocation as an additional feature can reduce the size of hand tagged sense corpus." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-33", "text": "The rest of this paper is organized as follows." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-34", "text": "Section 2 summarizes the existing Word Sense Disambiguation techniques based on annotated corpora." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-35", "text": "Section 3 describes the classifier and the features in our proposed WSD approach." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-36", "text": "Section 4 describes the experiments and the analysis of our results." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-37", "text": "Section 5 is the conclusion." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-38", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-41", "text": "Examples of supervised learning methods for WSD appear in [2] [3] [4] , [7] [8] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-42", "text": "The learning algorithms applied including: decision tree, decisionlist [15] , neural networks [7] , na\u00efve Bayesian learning ( [5] , [11] ) and maximum entropy [10] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-43", "text": "Among these leaning methods, the most important issue is what features will be used to construct the classifier." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-44", "text": "It is common in WSD to use contextual information that can be found in the neighborhood of the ambiguous word in training data ( [6] , [16] [17] [18] )." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-45", "text": "It is generally true that when words are used in the same sense, they have similar context and co-occurrence information [13] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-46", "text": "It is also generally true that the nearby context words of an ambiguous word give more effective patterns and features values than those far from it [12] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-47", "text": "The existing methods consider features selection for context representation including both local and topic features where local features refer to the information pertained only to the given context and topical features are statistically obtained from a training corpus." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-48", "text": "Most of the recent works for English corpus including [7] and [8] , which combine both local and topical information in order to improve their performance." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-49", "text": "An interesting study on feature selection for Chinese [10] has considered topical features as well as local collocational, syntactic, and semantic features using the maximum entropy model." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-50", "text": "In Dang's [10] work, collocational features refer to the local PoS information and bi-gram co-occurrences of words within 2 positions of the ambiguous word." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-51", "text": "A useful result from this work based on (about one million words) the tagged People's Daily News shows that adding more features from richer levels of linguistic information such as PoS tagging yielded no significant improvement (less than 1%) over using only the bi-gram co-occurrences information." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-52", "text": "Another similar study for Chinese [11] is based on the Naive Bayes classifier model which has taken into consideration PoS with position information and bi-gram templates in the local context." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-53", "text": "The system has a reported 60.40% in both precision and recall based on the SENSEVAL-3 Chinese training data." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-54", "text": "Even though in both approaches, statistically significant bi-gram co-occurrence information is used, they are not necessarily true collocations." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-55", "text": "For example, in the express \" \", the bi-grams in their system are ( , , ," }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-56", "text": "Some bi-grams such as may have higher frequency but may introduce noise when considering it as features in disambiguating the sense \"human| \" and \"symbol| \" like in the example case of \" \"." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-57", "text": "In our system, we do not rely on co-occurrence information." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-58", "text": "Instead, we utilize true collocation information ( , ) which fall in the window size of (-5, +5) as fea-tures and the sense of \"human| \" can be decided clearly using this features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-59", "text": "The collocation information is a pre-prepared collocation list obtained from a collocation extraction system and verified with syntactic and semantic methods ( [21] , [24] )." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-60", "text": "Yarowsky [9] used the one sense per collocation property as an essential ingredient for an unsupervised Word-Sense Disambiguation algorithm to perform bootstrapping algorithm on a more general high-recall disambiguation." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-153", "text": "In each level of the sense number, the words are selected randomly." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-61", "text": "A few recent research works have begun to pay attention to collocation features on WSD." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-62", "text": "Domminic [19] used three different methods called bilingual method, collocation method and UMLS (Unified Medical Language System) relation based method to disambiguate unsupervised English and German medical documents." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-63", "text": "As expected, the collocation method achieved a good precision around 79% in English and 82% in German but a very low recall which is 3% in English and 1% in German." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-64", "text": "The low recall is due to the nature of UMLS where many collocations would almost never occur in natural text." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-65", "text": "To avoid this problem, we combine the contextual features in the target context with the pre-prepared collocations list to build our classifier." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-66", "text": "As stated early, an important issue is what features will be used to construct the classifier in WSD." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-67", "text": "Early researches have proven that using lexical statistical information, such as bi-gram co-occurrences was sufficient to produce close to the best results [10] for Chinese WSD." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-68", "text": "Instead of including bi-gram features as part of discrimination features, in our system, we consider both topical contextual features as well as local collocation features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-69", "text": "These features are extracted form the 60MB human sense-tagged People's Daily News with segmentation information." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-70", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-71", "text": "**TOPICAL CONTEXTUAL FEATURES**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-72", "text": "Niu [11] proved in his experiments that Na\u00efve Bayes classifier achieved best disambiguation accuracy with small topical context window size (< 10 words)." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-73", "text": "We follow their method and set the contextual window size as 10 in our system." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-74", "text": "Each of the Chinese words except the stop words inside the window range will be considered as one topical feature." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-75", "text": "Their frequencies are calculated over the entire corpus with respect to each sense of an ambiguous word w. The sense definitions are obtained from HowNet." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-76", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-77", "text": "**LOCAL COLLOCATION FEATURES**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-78", "text": "We chose collocations as the local features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-79", "text": "A collocation is a recurrent and conventional fixed expression of words which holds syntactic and semantic relations [21] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-80", "text": "Collocations can be classified as fully fixed collocations, fixed collocations, strong collocations and loose collocations." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-81", "text": "Fixed collocations means the appearance of one word implies the co-occurrence of another one such as \" \" (\"burden of history\"), while strong collocations allows very limited substitution of the components, for example, \"" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-82", "text": "\" (\"local college\"), or \" \" (\"local university\")." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-83", "text": "The sense of ambiguous words can be uniquely determined in these two types of collocations, therefore are the collocations applied in our system." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-84", "text": "The sources of the collocations will be explained in Section 4.1." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-85", "text": "In both Niu [11] and Dang's [10] work, topical features as well as the so called collocational features were used." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-86", "text": "However, as discussed in Section 2, they both used bi-gram cooccurrences as the additional local features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-87", "text": "However, bi-gram co-occurrences only indicate statistical significance which may not actually satisfy the conceptual definition of collocations." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-88", "text": "Thus instead of using co-occurrences of bigrams, we take the true bi-gram collocations extracted from our system and use this data to compare with bi-gram co-occurrences to test the usefulness of collocation for WSD." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-89", "text": "The local features in our system make use of the collocations using the template (w i , w) within a window size of ten (where i = \u00b1 5)." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-90", "text": "For example, \" \" (\"Government departments and local government commanded that\") fits the bi-gram collocation template (w, w 1 ) with the value of ( )." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-91", "text": "During the training and the testing processes, the counting of frequency value of the collocation feature will be increased by 1 if a collocation containing the ambiguous word occurs in a sentence." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-92", "text": "To have a good analysis on collocation features, we have also developed an algorithm using lonely adjacent bi-gram as locals features(named Sys-adjacent bi-gram as locals features(named System A) and another using collocation as local features(named System B)." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-93", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-94", "text": "**THE COLLOCATION CLASSIFIER**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-95", "text": "We consider all the features in the features set F = F t \u222aF l = {f 1 , f 2 , \u2026 , f m } as independent, where F t stands for the topical contextual features set, and F l stands for the local collocation features set." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-96", "text": "For an ambiguous word w with n senses, let S w = {w s1 , w s2 , \u2026 , w sn } be the sense set." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-97", "text": "For the contextual features, we directly apply the Na\u00efve Bayes algorithm using Add-Lambda Smoothing to handle unknown words:" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-98", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-99", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-100", "text": "We have designed a set of experiments to compare the classifier with and without the collocation features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-101", "text": "In system A, the classifier is built with local bi-gram features and topical contextual features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-102", "text": "The classifier in system B is constructed from combining the local collocation features with topical features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-103", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-104", "text": "**PREPARATION THE DATA SET**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-105", "text": "We have selected 20 ambiguous words from nouns and verbs with the sense number as 4 in average." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-106", "text": "The sense definition is taken from HowNet [22] ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-107", "text": "To show the effect of the algorithm, we try to choose words with high degree of ambiguity, high frequency of use [23] , and high frequency of constructing collocations." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-154", "text": "Table 5 shows the effect of sense distribution on the effect of collocation features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-108", "text": "The selection of these 20 words is not completely random although within each criterion class we do try to pick word randomly." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-109", "text": "Based on the 20 words, we extracted 28,000 sentences from the 60 MB People's Daily News with segmentation information as our training/test set which is then manually sense-tagged." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-110", "text": "The collocation list is constructed from a combination of a digital collocation dictionary, a return result from a collocation automatic extraction system [21] , and a hand collection from the People's Daily News." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-111", "text": "As we stated early, the sense of ambiguous words in the fixed collocations and strong collocations can be decided uniquely although they are not unique in loose collocations." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-112", "text": "For example, the ambiguous word \" \" in the collocation \" \" may have both the sense of \"appearance| \" or \"reputation| \"." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-113", "text": "Therefore, when labeling the sense of collocations, we filter out the ones which cannot uniquely determine the sense of ambiguous words inside." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-114", "text": "However, this does not mean that loose collocations have no contribution in WSD classification." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-115", "text": "We simply reduce its weight when combining it with the contextual features compared with the fixed and strong collocations." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-116", "text": "The sense and collocation distribution over the 20 words on the training examples can be found in Table 1 ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-117", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-118", "text": "**THE EFFECT OF COLLOCATION FEATURES**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-119", "text": "We recorded 6 trials with average precision over six-fold validation for each word." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-120", "text": "Their average precision for the six trials in the system A, and B can be found in Table 2 and Table 3 ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-121", "text": "From Table 3, regarding to precision, there are 16 words have improved and 4 words remained the same in the system B. The results from the both system confirmed that collocation features do improve the precision." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-122", "text": "Note that 4 words have the same precision in the two systems, which fall into two cases." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-123", "text": "In the first case, it can be seen that these words already have very high precision in the system A (over 93%) which means that one sense dominates all other senses." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-124", "text": "In this case, the additional collation information is not necessary." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-125", "text": "In fact, when we checked the intermediate outputs, the score of the candidate senses of the ambiguous words contained in the collocations get improved." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-126", "text": "Even though, it would not change the result." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-127", "text": "Secondly, no collocation appeared in the sentences which are tagged incorrectly in the system A. This is confirmed when we check the error files." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-128", "text": "For example, the word \" \" with the sense as \" \" (\"closeness\") appeared in 4492 examples over the total 4885 examples (91.9%)." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-129", "text": "In the mean time, 99% of collocation in its collocation list has the same sense of \" \" (\"closeness\")." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-130", "text": "Only one collocation \" \" has the sense of \" \" (\"power\")." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-131", "text": "Therefore, the collocation features improved the score of sense \" \" which is already the highest one based on the contextual features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-132", "text": "As can be seen from Table 3 , the collocation features work well for the sparse data." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-133", "text": "For example, the word \" \" in the training corpus has only one example with the sense \" \" (\"human\"), the other 30 examples all have the sense \" \" (\"management\")." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-134", "text": "Under this situation, the topical contextual features failed to identify the right sense for the only appearance of the sense \" \" (\"human\") in the training instance \" \u2026\"." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-135", "text": "However, it can be correctly identified in the system B because the appearance of the collocation \" \"." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-136", "text": "To well show the effect of collocations on the accuracy of classifier for the task of WSD, we also tested both systems on SENSEVAL-3 data set, and the result is recorded in the Table 4 ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-137", "text": "From the difference in the relative improvement of both data sets, we can see that collocation features work well when the statistical model is not sufficiently built up such as from a small corpus like SENSEVAL-3." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-138", "text": "Actually, in this case, the training examples appear in the corpus only once or twice so that the parameters for such sparse training examples may not be accurate to forecast the test examples, which convinces us that collocation features are effective on handling sparse training data even for unknown words." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-139", "text": "Fig. 1 shows the precision comparison in the system A, and B on SENVESAL-3." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-140", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-141", "text": "**THE EFFECT OF COLLOCATIONS ON THE SIZE OF TRAINING CORPUS NEEDED**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-142", "text": "Hwee [21] stated that a large-scale, human sense-tagged corpus is critical for a supervised learning approach to achieve broad coverage and high accuracy WSD." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-143", "text": "He conducted a thorough study on the effect of training examples on the accuracy of supervised corpus based WSD." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-144", "text": "As the result showed, WSD accuracy continues to climb as the number of training examples increases." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-145", "text": "Similarly, we have tested the system A, and B with the different size of training corpus based on the PDN corpus we prepared." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-146", "text": "Our experiment results shown in Fig 2 follow the same fact." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-147", "text": "The purpose we did the testing is that we hope to disclose the effect of collocations on the size of training corpus needed." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-148", "text": "From Fig 2, we can see by using the collocation features, the precision of the system B has increased slower along with the growth of training examples than the precision of the system A. The result is reasonable because with collocation feature, the statistical contextual information over the entire corpus becomes side effect." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-149", "text": "Actually, as can be seen from Fig. 2 , after using collocation features in the system B, even we use 1/6 corpus as training, the precision is still higher than we use 5/6 train corpus in the system A." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-150", "text": "----------------------------------" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-151", "text": "**INVESTIGATION OF SENSE DISTRIBUTION ON THE EFFECT OF COLLOCATION FEATURES**" }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-152", "text": "To investigate the sense distribution on the effect of collocation features, we selected the ambiguous words with the number of sense varied from 2 to 6." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-155", "text": "From the table, we can see that the collocation features work well when the sense distribution is even for a particular ambiguous word under which case the classifier may get confused." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-156", "text": "We have conducted a set of experiments based on both the PDN corpus and SENSEVLA-3 data to set the best value of \u03b1 for the formula (4) described in Section 3.2." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-157", "text": "The best start value of \u03b1 is tested based on the precision rate which is shown in Fig. 3 ." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-158", "text": "It is shown from the experiment that \u03b1 takes the start value of 0.5 for both corpuses." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-159", "text": "This paper reports a corpus-based Word Sense Disambiguation approach for Chinese word using local collocation features and topical contextual features." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-160", "text": "Compared with the base-line systems in which a Na\u00efve Bayes classifier is constructed by combining the contextual features with the bi-gram features, the new system achieves 3% precision improvement in average in Peoples' Daily News corpus and 10% improvement in SENSEVAL-3 data set." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-161", "text": "Actually, it works very well when disambiguating the sense with sparse distribution over the entire corpus under which the statistic calculation prone to identify it incorrectly." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-162", "text": "In the same time, because disambiguating using collocation fea-tures does not need statistical calculation, it makes contribution to reduce the size of human tagged corpus needed which is critical and time consuming in corpus based approach." }, { "sent_id": "4705e8c0bac7bf29d8ef3193cf729b-C001-163", "text": "Because different types of collocations may play different roles in classifying the sense of an ambiguous word, we hope to extend this work by integrating collocations with different weight based on their types in the future, which may need a pre-processing job to categorize the collocations automatically." } ], "y": { "@USE@": { "gold_contexts": [ [ "4705e8c0bac7bf29d8ef3193cf729b-C001-42" ], [ "4705e8c0bac7bf29d8ef3193cf729b-C001-72", "4705e8c0bac7bf29d8ef3193cf729b-C001-73" ] ], "cite_sentences": [ "4705e8c0bac7bf29d8ef3193cf729b-C001-42", "4705e8c0bac7bf29d8ef3193cf729b-C001-72" ] }, "@BACK@": { "gold_contexts": [ [ "4705e8c0bac7bf29d8ef3193cf729b-C001-52" ], [ "4705e8c0bac7bf29d8ef3193cf729b-C001-72" ], [ "4705e8c0bac7bf29d8ef3193cf729b-C001-85" ] ], "cite_sentences": [ "4705e8c0bac7bf29d8ef3193cf729b-C001-52", "4705e8c0bac7bf29d8ef3193cf729b-C001-72", "4705e8c0bac7bf29d8ef3193cf729b-C001-85" ] }, "@DIF@": { "gold_contexts": [ [ "4705e8c0bac7bf29d8ef3193cf729b-C001-52", "4705e8c0bac7bf29d8ef3193cf729b-C001-54", "4705e8c0bac7bf29d8ef3193cf729b-C001-57" ], [ "4705e8c0bac7bf29d8ef3193cf729b-C001-85", "4705e8c0bac7bf29d8ef3193cf729b-C001-88" ] ], "cite_sentences": [ "4705e8c0bac7bf29d8ef3193cf729b-C001-52", "4705e8c0bac7bf29d8ef3193cf729b-C001-85" ] } } }, "ABC_966106c9e00f0333bda45d977a9f35_43": { "x": [ { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-2", "text": "In this paper, we empirically compare the two encoder-decoder neural machine translation architectures: convolutional sequence to sequence model (ConvS2S) and recurrent sequence to sequence model (RNNS2S) for English-Hindi language pair as part of IIT Bombay's submission to WAT2017 shared task." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-3", "text": "We report the results for both English-Hindi and HindiEnglish direction of language pair." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-4", "text": "----------------------------------" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-6", "text": "Neural Machine Translation (NMT) systems are currently being widely investigated in the research community due to the benefits of distributed representation and continuous space modeling in generating more fluent outputs." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-7", "text": "In this paper, we report the results of our experiments with NMT for English-Hindi language pair for the shared task in the 4th Workshop on Asian Translation (Nakazawa et al., 2017) ." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-8", "text": "Hindi is the most widely spoken language in the Indian subcontinent, while English is a major link language in India as well across the world." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-9", "text": "Hence, English-Hindi is an important language pair for machine translation." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-10", "text": "In this work, we focus on comparing two variants of the encoder decoder architectures." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-11", "text": "Section 2 describes our systems." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-12", "text": "Section 3 describes the experimental setup." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-13", "text": "Section 4 describes the results and observations of our experiments." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-14", "text": "Section 5 concludes the report." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-15", "text": "----------------------------------" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-16", "text": "**SYSTEM DESCRIPTION**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-17", "text": "We trained Neural Machine Transaltion systems using the encoder-decoder architecture with attention (Bahdanau et al., 2014) for English-Hindi as well Hindi-English translation." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-18", "text": "We compared convolutional neural network (ConvS2S) (Gehring et al., 2017) and recurrent neural network (RNNS2S) (Bahdanau et al., 2014) based sequence to sequence learning architectures." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-19", "text": "While RNN based architectures have proved to be successful and produce state-of-the-art results for machine translation, they take a long time to train." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-20", "text": "The temporal dependencies between the elements in the sequence due to the RNN state vector requires sequential processing." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-21", "text": "On the other hand, different parts of the sequence can be processed in parallel using a ConvS2S." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-22", "text": "Hence, it is appealing to explore ConvS2S as the basis of an architecture to speed up training and decoding." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-23", "text": "Recent work (Gehring et al., 2017) has shown that a purely CNN based encoder-decoder network is competitive with a RNN based network." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-24", "text": "----------------------------------" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-25", "text": "**RECURRENT SEQUENCE TO SEQUENCE MODEL (RNNS2S)**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-26", "text": "Recurrent sequence to sequence model (Bahdanau et al., 2014) is currently the most popular method for neural machine translation." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-27", "text": "It is been shown to be useful for other sequence to sequence tasks like image captioning (Vinyals et al., 2015) , language modeling, question answering (Wang and Nyberg, 2015) etc." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-28", "text": "The typical architecture encodes the sequence of source word embeddings to generate annotations for the source words." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-29", "text": "The encoder is typically a bi-directional RNN layer of LSTM or GRU units." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-30", "text": "The final state of the encoder is used to initialize the decoder." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-31", "text": "The decoder is also an RNN which generates one output token at a time." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-32", "text": "Each output token is predicted based on the decoder state, previous output word and the context vector." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-33", "text": "The context vector encodes source information required for predicting the words, and is generated using an attention mechanism on the source word annotations. Please refer to Bahdanau et al. (2014) for an detailed description of the method." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-34", "text": "----------------------------------" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-35", "text": "**CONVOLUTIONAL SEQUENCE TO SEQUENCE MODEL (CONVS2S)**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-36", "text": "In convolutional sequence to sequence model (Gehring et al., 2017) , the input sequence is encoded into distributional vector space using a CNN and decoded back to output sequence again using CNN instead of RNN (Sutskever et al., 2014) ." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-37", "text": "Each input element embedding is combined with its positional embedding (signifies the position of the input element)." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-38", "text": "Positional embeddings help the network to realize what part of input it is dealing with, currently." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-39", "text": "Encoder-Decoder." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-40", "text": "Both the encoder and decoder are CNN blocks along with a multistep attention mechanism with multiple 'hops' (Sukhbaatar et al., 2015) ." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-41", "text": "Each block consists of one dimensional convolutions followed by a Gated Linear Unit (GLU) non-linearity (Dauphin et al., 2016) ." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-42", "text": "GLU is a gating function over the outputs of the convolutions." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-43", "text": "The multi-step attention mechanism suggests that the attention mechanism is applied to every layer in the decoder." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-44", "text": "The attention of the first layer gives contextual information which is then given as an input to the next layer that considers this information while calculating the attention weights of the current layer." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-45", "text": "from different sources at CFILT 1 lab." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-46", "text": "The data provided was in tokenized format using moses tokenizer for English side and Indic NLP library 2 for Hindi side of the parallel data." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-47", "text": "The training data was further cleaned for a sentence length of 100 words." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-48", "text": "Table-1 shows data statistics used for the experiments." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-49", "text": "----------------------------------" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-50", "text": "**TRAINING**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-51", "text": "The RNNS2S model was trained using Nematus 3 framework." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-52", "text": "To handle rare words, subword 4 technique was used through byte pair encoding(BPE) Shibata et al. (1999) with 16000 BPE operations." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-53", "text": "Since there is no similarity between English and Hindi language vocabulary, both the languages were trained separately for BPE." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-54", "text": "The encoder and decoder hidden layer size was kept at 512 and word embedding size as 256." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-55", "text": "The model was trained with a batch size of 40 sentences and maximum sentence length of 100 using AdaDelta (Zeiler, 2012) optimizer with a learning rate of 0.0001 and no dropout setting." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-56", "text": "The output parameters were saved after every 10000 iterations." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-57", "text": "The decoding was done using a beam size of 12 and ensemble of last 3 models and the best model taken together." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-58", "text": "The ConvS2S model was trained using Fairseq 5 , an open source library developed by Facebook for neural machine translation using CNN or RNN networks." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-59", "text": "For handling the rare words, the source side and target side corpora were segmented using byte pair encoding (BPE) (Shibata et al., 1999) ." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-81", "text": "----------------------------------" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-60", "text": "The baseline model with 4 encoder layers and 3 decoder layers was trained using nag optimizer (Gehring et al., 2017 ) with a learning rate of 0.25 with 0.2 as its dropout value and gradient clipping was also applied." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-61", "text": "Table 6 : Hindi to English Translation Systems at WAT2017" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-62", "text": "The inferencing was done using beam search with a beam size of 10 for both Hindi-English and English-Hindi translation task." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-63", "text": "The model was also trained with more number of layers in the encoder and the decoder." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-64", "text": "The resulting BLEU scores for different number of encoder and decoder layers are shown in Table 4 ." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-65", "text": "The best results were obtained when the number of encoder layers were set to 13 and decoder layers to 7, with learning rate of 0.1 and no dropout regularization." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-66", "text": "The resulting BLEU scores with this setting for Hindi-English and English-Hindi are shown in Table 2 and Table 3 respectively." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-67", "text": "----------------------------------" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-68", "text": "**RESULTS AND OBSERVATION**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-69", "text": "The Table 2 and the Table 3 shows the different evaluation metrics such as Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002) , Rank-based Intuitive Bilingual Evaluation Score (RIBES) (Group et al., 2013) , Adequacy-Fluency Metrics (AMFM) (Banchs et al., 2015) (N/A for ConvS2S model) and human evaluation score (HUMAN) (N/A for ConvS2S model) for HindiEnglish and English-Hindi translation pairs." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-70", "text": "In Hindi to English translation, the ConvS2S model outperforms the RNNS2S model in terms of BLEU score and the RIBES score." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-71", "text": "On the other hand, in English to Hindi translation, the RNNS2S model performs better than the ConvS2S model in terms of BLEU score and the RIBES score is at par with the ConvS2S model." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-72", "text": "The JPO Adequacy and pairwise evaluation of our RNNS2S output was compared against WAT2016 best system." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-73", "text": "Table 5 and table 6 show the evaluation results of all other systems in comparison to our submission." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-74", "text": "The results clearly indicate the scope of fine tuning our system parameters." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-75", "text": "Due to time constraint, the ConvS2S output could not be submitted for manual evaluation." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-76", "text": "But the increasing trend of BLEU Scores have motivated us to continue our experimentation for a deeper analysis." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-77", "text": "Further experimentation is required to see if the ConvS2S can perform better on English-Hindi as well." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-78", "text": "One way to test this is by increasing the number of encoder and/or decoder layers even further." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-79", "text": "This is because, in the Table 4 we can clearly observe that the BLEU scores increases when number of encoder and decoder layers are increased." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-80", "text": "More experiments are required with RNNS2S architecture as well." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-82", "text": "**CONCLUSION**" }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-83", "text": "In our system submission, we compared two sequence to sequence architectures: RNN based and CNN based for the English-Hindi language pairs." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-84", "text": "The BLEU scores of CNN architecture improves by further tunning the parameters." }, { "sent_id": "966106c9e00f0333bda45d977a9f35-C001-85", "text": "In future, we would like to investigate the threshold of hyperparameters for RNNS2S and ConvS2S architectures for this language pair keeping processing time in consideration." } ], "y": { "@USE@": { "gold_contexts": [ [ "966106c9e00f0333bda45d977a9f35-C001-18" ], [ "966106c9e00f0333bda45d977a9f35-C001-60" ] ], "cite_sentences": [ "966106c9e00f0333bda45d977a9f35-C001-18", "966106c9e00f0333bda45d977a9f35-C001-60" ] }, "@BACK@": { "gold_contexts": [ [ "966106c9e00f0333bda45d977a9f35-C001-23" ], [ "966106c9e00f0333bda45d977a9f35-C001-36" ] ], "cite_sentences": [ "966106c9e00f0333bda45d977a9f35-C001-23", "966106c9e00f0333bda45d977a9f35-C001-36" ] } } }, "ABC_563476cdb64cdf7d47bc5e8e1c32c3_43": { "x": [ { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-11", "text": "When attempting to automate machine transliteration, modeling the channel that transforms source language characters into transliterated target language characters is a key component to good performance." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-12", "text": "Since the primary signal followed by human transliterators is phonetic correspondence, it makes sense that a letter-to-phoneme (L2P) transcription engine would perform well at this task." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-13", "text": "Of course, transliteration is often framed within the larger problems of translation and bilingual named entity co-reference, making available a number of other interesting features, such as target lexicons (Knight and Graehl, 1998) , distributional similarity (Bilac and Tanaka, 2005) , or the dates of an entity's mentions in the news (Klementiev and Roth, 2006 )." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-14", "text": "However, this task's focus on generation has isolated the character-level component, which makes L2P technology a nearideal match." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-15", "text": "For our submission, we re-implement the L2P approach described by Jiampojamarn et al. (2008) as faithfully as possible, and apply it unmodified to the transliteration shared task for the English-to-Hindi (Kumaran and Kellner, 2007) and English-to-Japanese Katakana 1 tests." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-16", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-17", "text": "**APPROACH**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-18", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-19", "text": "**SUMMARY OF L2P APPROACH**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-20", "text": "The core of the L2P transduction engine is the dynamic programming algorithm for monotone phrasal decoding (Zens and Ney, 2004) ." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-21", "text": "The main feature of this algorithm is its capability to transduce many consecutive characters with a single operation." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-22", "text": "This algorithm is used to conduct a search for a max-weight derivation according to a linear model with indicator features." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-23", "text": "A sample derivation is shown in Figure 1 ." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-24", "text": "There are two main categories of features: context and transition features, which follow the first two feature templates described by Jiampojamarn et al. (2008) ." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-25", "text": "Context features are centered around a transduction operation." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-26", "text": "These features include an indicator for the operation itself, which is then conjoined with indicators for all n-grams of source context within a fixed window of the operation." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-27", "text": "Transition features are Markov or n-gram features." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-28", "text": "They ensure that the produced target string makes sense as a character sequence, and are represented as indicators on the presence of target n-grams." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-29", "text": "The feature templates have two main parameters, the size S of the character window from which source context features are drawn, and the maximum length T of target n-gram indicators." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-30", "text": "We fit these parameters using grid search over 1-best The engine's features are trained using the structured perceptron (Collins, 2002) ." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-31", "text": "Jiampojamarn et al. (2008) show strong improvements in the L2P domain using MIRA in place of the perceptron update; unfortunately, we did not implement a k-best MIRA update due to time constraints." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-32", "text": "In our implementation, no special consideration was given to the availability of multiple correct answers in the training data; we always pick the first reference transliteration and treat it as the only correct answer." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-33", "text": "Investigating the use of all correct answers would be an obvious next step to improve the system." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-34", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-35", "text": "**MAJOR DIFFERENCES IN IMPLEMENTATION**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-36", "text": "Our system made two alternate design decisions (we do not claim improvements) over those made by (Jiampojamarn et al., 2008) , mostly based on the availability of software." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-37", "text": "First, we employed a beam of 40 candidates in our decoder, to enable efficient use of large language model contexts." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-38", "text": "This is put to good use in the Hindi task, where we found n-gram indicators of length up to n = 6 provided optimal development performance." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-39", "text": "Second, we employed an alternate character aligner to create our training derivations." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-40", "text": "This aligner is similar to recent non-compositional phrasal word-alignment models (Zhang et al., 2008) , limited so it can only produce monotone character alignments." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-41", "text": "The aligner creates substring alignments, without insertion or deletion operators." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-42", "text": "As such, an aligned transliteration pair also serves as a transliteration derivation." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-43", "text": "We employed a maximum substring length of 3." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-44", "text": "The training data was heuristically cleaned after alignment." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-45", "text": "Any derivation found by the aligner that uses an operation occurring fewer than 3 times throughout the entire training set was eliminated." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-46", "text": "This reduced training set sizes to 8,511 pairs for English-Hindi and 20,306 pairs for EnglishKatakana." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-47", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-48", "text": "**THE BUG**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-49", "text": "The submitted version of our system had a bug in its transition features: instead of generating an indicator for every possible n-gram in the generated target sequence, it generated n-grams over target substrings, defined by the operations used during transduction." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-50", "text": "Consider, for example, the derivation shown in Figure This leads to problems with data sparsity, which we had not noticed on unrelated experiments with larger training data." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-51", "text": "We report results both with the bug and with fixed transition features." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-52", "text": "We do so to emphasize the importance of a fine-grained language discriminative language model, as opposed to one which operates on a substring level." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-53", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-54", "text": "**DEVELOPMENT**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-55", "text": "Development consisted of performing a parameter grid search over S and T for each language pair's development set." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-56", "text": "All combinations of S = 0 . . ." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-57", "text": "4 and T = 0 . . ." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-58", "text": "7 were tested for each language pair." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-59", "text": "Based on these experiments, we selected (for the fixed version), values of S = 2, T = 6 for English-Hindi, and S = 4, T = 3 for EnglishKatakana." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-60", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-61", "text": "**RESULTS**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-62", "text": "The results of our internal experiments with the official evaluation tool are shown in Table 1 ." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-63", "text": "We report 1-best accuracy on both development and test sets, with both the buggy and fixed versions of our system." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-64", "text": "As one can see, the bug makes less of an impact in the English-Katakana setting, where more training data is available." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-65", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-66", "text": "**CONCLUSION**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-67", "text": "We have demonstrated that an automatic letterto-phoneme transducer performs fairly well on this transliteration shared task, with no languagespecific or transliteration-specific modifications." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-68", "text": "Instead, we simply considered Hindi or Katakana to be an alternate encoding for English phonemes." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-69", "text": "In the future, we would like to investigate proper use of multiple reference answers during perceptron training." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-2", "text": "We interpret the problem of transliterating English named entities into Hindi or Japanese Katakana as a variant of the letter-to-phoneme (L2P) subtask of textto-speech processing." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-3", "text": "Therefore, we apply a re-implementation of a state-of-the-art, discriminative L2P system (Jiampojamarn et al., 2008) to the problem, without further modification." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-4", "text": "In doing so, we hope to provide a baseline for the NEWS 2009 Machine Transliteration Shared Task (Li et al., 2009) , indicating how much can be achieved without transliteration-specific technology." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-5", "text": "This paper briefly summarizes the original work and our reimplementation." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-6", "text": "We also describe a bug in our submitted implementation, and provide updated results on the development and test sets." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-7", "text": "----------------------------------" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-9", "text": "Transliteration occurs when a word is borrowed into a language with a different character set from its language of origin." }, { "sent_id": "563476cdb64cdf7d47bc5e8e1c32c3-C001-10", "text": "The word is transcribed into the new character set in a manner that maintains phonetic correspondence." } ], "y": { "@USE@": { "gold_contexts": [ [ "563476cdb64cdf7d47bc5e8e1c32c3-C001-3" ], [ "563476cdb64cdf7d47bc5e8e1c32c3-C001-15" ], [ "563476cdb64cdf7d47bc5e8e1c32c3-C001-24" ] ], "cite_sentences": [ "563476cdb64cdf7d47bc5e8e1c32c3-C001-3", "563476cdb64cdf7d47bc5e8e1c32c3-C001-15", "563476cdb64cdf7d47bc5e8e1c32c3-C001-24" ] }, "@DIF@": { "gold_contexts": [ [ "563476cdb64cdf7d47bc5e8e1c32c3-C001-36" ] ], "cite_sentences": [ "563476cdb64cdf7d47bc5e8e1c32c3-C001-36" ] } } }, "ABC_60f54cf8f510affe214f63f8e23e19_43": { "x": [ { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-27", "text": "The first neural identification system was MU-MULS, submitted to the PARSEME shared task 1.0 (Klyueva et al., 2017) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-28", "text": "Although it did not obtain the best results, MUMULS influenced the development of more advanced models (Gharbieh et al., 2017) which ultimately led to the popularisation of the approach." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-2", "text": "Recent initiatives such as the PARSEME shared task have allowed the rapid development of MWE identification systems." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-3", "text": "Many of those are based on recent NLP advances, using neural sequence models that take continuous word representations as input." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-4", "text": "We study two related questions in neural verbal MWE identification: (a) the use of lemmas and/or surface forms as input features, and (b) the use of word-based or character-based embeddings to represent them." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-5", "text": "Our experiments on Basque, French, and Polish show that character-based representations yield systematically better results than word-based ones." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-6", "text": "In some cases, character-based representations of surface forms can be used as a proxy for lemmas, depending on the morphological complexity of the language." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-29", "text": "As a consequence, and inspired by the success of neural models in NLP, nine out of the 17 systems submitted to the PARSEME shared task 1.1 used neural networks ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-30", "text": "Recently, improvements have been proposed, e.g. to deal with discontinuous MWEs (Rohanian et al., 2019) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-31", "text": "Previous work studied the impact of external lexicons (Riedl and Biemann, 2016) and of several feature sets (Maldonado et al., 2017) on CRFs for MWE identification." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-32", "text": "Character-based embeddings have been shown useful to predict MWE compositionality out of context (Hakimi Parizi and Cook, 2018) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-33", "text": "In other tasks such as named entity recognition, character convolution layers have been successfully applied (Ma and Hovy, 2016) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-34", "text": "The use of pre-trained vs. randomly initialised embeddings has been analysed in some PARSEME shared task papers (Ehren et al., 2018; Zampieri et al., 2018) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-35", "text": "The closest works to ours are the Veyn (Zampieri et al., 2018) and SHOMA (Taslimipoor and Rohanian, 2018) systems, submitted to the PARSEME shared task 1.1." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-36", "text": "Veyn is used as our off-the-shelf base system, so most of its architecture is identical to ours." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-37", "text": "Similarly to us, SHOMA employs FastText embeddings, a recurrent layer and a CRF output layer." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-38", "text": "To our knowledge, however, this is the first study to compare input representations for neural MWE identification." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-39", "text": "3 Experimental Setup Corpora The PARSEME shared task 1.1 released freely available VMWE-annotated corpora in 20 languages." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-40", "text": "1 Each language's corpus is split into training, development and test parts." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-41", "text": "To choose our target languages, we analysed the PARSEME corpora, choosing 3 languages with varying morphological richness: Basque (EU), French (FR) and Polish (PL), shown in Table 1 ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-42", "text": "2 The FR training corpus has more than 420K tokens, whereas the PL and EU training corpora have around 220K and 117K tokens." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-43", "text": "EU contains less annotated VMWE occurrences than both FR and PL." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-44", "text": "The average length of annotated VMWE occurrences is similar in the three languages (2.02/2.29/2.13 in EU/FR/PL)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-45", "text": "The proportion of discontinuous VMWEs in highest in FR (42.12%), whereas in Polish (29.76%) and in Basque (19.28%) they are less frequent." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-7", "text": "----------------------------------" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-9", "text": "MWE identification consists in finding multiword expressions (MWEs) in running text (Constant et al., 2017) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-10", "text": "For many years, MWE identification was considered unrealistic, with most MWE research focusing on out-of-context MWE discovery (Ramisch et al., 2013) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-11", "text": "Indeed, the availability of MWE-annotated corpora was limited to some treebanks with partial annotations, often a by-product of syntax trees Constant et al., 2013) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-12", "text": "This prevented the widespread development and evaluation of MWE identification systems, as compared to other tasks such as POS tagging and named entity recognition." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-13", "text": "This landscape has drastically changed in the last few years, thanks to shared tasks such as DiMSUM (Schneider et al., 2016) and PARSEME 1.0 and 1.1 (Savary et al., 2017; and to the release of open corpora annotated for MWEs in \u223c20 languages." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-14", "text": "These initiatives provide a unified framework for MWE identification, including training/test corpus splits, evaluation metrics, benchmark results, and analysis tools." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-15", "text": "As a consequence, it is now possible to study some classical text processing problems and their impact on MWE identification systems." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-16", "text": "One of these problems is the relation between a language's morphology, lemmatisation, input feature representations, out-of-vocabulary (OOV) words, and the performance of the system." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-17", "text": "For instance, an MWE identification system based on (inflected) surface forms will likely encounter more OOV words than a system based on lemmas, especially for morphologically-rich languages in which a single lemma may correspond to dozens of surface forms (Seddah et al., 2013) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-18", "text": "This problem is particularly relevant for verbal MWEs, which present high morphological and syntactic variability ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-19", "text": "Our goal is to study the impact of word representations on verbal MWE (VMWE) identification, comparing lemmas, surface forms, traditional word embeddings and subword representations." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-20", "text": "We compare the performance of an off-the-shelf MWE identification system based on neural sequence tagging (Zampieri et al., 2018) using lemmas and surface forms as input features, encoded in the form of classical pre-initialised word2vec embeddings (Mikolov et al., 2013) or, alternatively, using new-generation FastText embeddings built from character n-grams (Bojanowski et al., 2017) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-21", "text": "Our main hypothesis is that the latter can model morphological variability, representing an alternative for lemmatisation." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-22", "text": "We carry out experiments in 3 languages with varying morphological complexity: French, Polish and Basque." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-23", "text": "popular models for MWE identification (Constant et al., 2017) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-24", "text": "Parsing-based methods take the (recursive) structure of language into account, trying to identify MWEs as a by-product of parsing Constant et al., 2013) , or jointly (Constant and Nivre, 2016) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-25", "text": "Sequence tagging models, on the other hand, consider only linear context, using models such as CRFs (Vincze et al., 2011; Shigeto et al., 2013; Riedl and Biemann, 2016) and averaged perceptron (Schneider et al., 2014) combined with some variant of begin-inside-outside (BIO) encoding (Ramshaw and Marcus, 1995) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-26", "text": "Recurrent neural networks can be used for sequence tagging, being able to handle continuous word representations and unlimited context." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-46", "text": "These languages do have not the same morphological richness, as measured by the average number of surface forms per lemma in the vocabulary ('Morph' column)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-47", "text": "For instance, the EU training corpus (2.32) has a higher morphological richness than PL (2.21) and FR (1.33)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-48", "text": "The rate of OOVs, that is, of words that appear in the dev or test corpus vocabularies, but not in the training corpus, is higher for surface forms than for lemmas, with a potential negative impact on VMWE identification systems based on surface forms only." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-49", "text": "As expected, the OOV rate for surface forms is lowest in FR (20-26%), which also has the lowest morphological richness, and highest for EU (43%)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-50", "text": "These differences are less visible for lemmas, which abstract away from morphology." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-51", "text": "3 An interesting figure is the OOV rate focusing on verbs only." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-52", "text": "4 Here, PL presents more OOV verb forms (42-44%) than EU (32%), but again this difference disappears for lemmas." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-53", "text": "This is relevant because our experimen-1 http://hdl.handle.net/11372/LRT-2842 2 Other languages have similar characteristics but were not selected due to the size of the corpora or to incomplete information (e.g. Turkish has missing surface forms for some verbs, preventing us from training a system based on surface forms only)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-54", "text": "3 The official PARSEME French test corpus presents 11,632 missing lemmas." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-55", "text": "We have lemmatised it using UDPipe (http://ufal.mff.cuni.cz/udpipe) with default parameters, trained on the PARSEME shared task training corpus, to remain in the \"closed track\" conditions." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-56", "text": "4 For EU, we consider the POS tags VERB, ADI and ADT according to the conversion tal setup implies that it is difficult for a system to predict a VMWE without a reliable representation for a verb, learned from the training data." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-57", "text": "----------------------------------" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-58", "text": "**MWE IDENTIFICATION SYSTEM**" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-59", "text": "We use our inhouse MWE identification system Veyn (Zampieri et al., 2018) , based on sequence tagging using recurrent neural networks." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-60", "text": "5 The system takes as input the concatenation of the embeddings of the words' features (e.g. lemmas and POS)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-61", "text": "It uses a CRF output layer (conditional random fields) to predict valid label sequences, with VMWEs encoded using the 'BIOG+cat' format." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-62", "text": "Each token is tagged 'B' if it is at the beginning of a VMWE, 'I' if it is inside a VMWE, 'O' if it does not belong to a VMWE, and 'G', if it does not belong to a VMWE but it is in the gap between two words that are part of a VMWE." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-63", "text": "The tags 'B' and 'I' are concatenated with the VMWE categories (VID, LVC.full, etc.) present in the corpus." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-64", "text": "The system is trained on the shared task training corpora, so that the results are comparable with the systems submitted to the closed track." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-65", "text": "6 We use the dev corpus as validation data, training for 25 epochs which 3 epochs of patience for early stopping." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-66", "text": "We configure it to use two layers of bidirectional gated recurrent units (GRU) of dimension 128, with all other parameters taking the default values suggested in the Veyn documentation." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-67", "text": "Word Representations We use two types of word embeddings to represent input surface forms 5 https://github.com/zamp13/Veyn 6 http://multiword.sourceforge.net/ sharedtaskresults2018 and lemmas: word2vec and FastText." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-68", "text": "Word2vec is a prediction-based distributional model in which a word representation is obtained from a neural network trying to predict a word from its context or vice-versa (Mikolov et al., 2013) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-69", "text": "FastText is an adaptation which also takes into account character n-grams, being able to build vectors for OOVs from its character n-grams (Bojanowski et al., 2017) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-70", "text": "For each representation, we used the gensim library 7 to train 256-dimensional vectors for both forms and lemmas on the training corpus of the shared task for 10 epochs." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-71", "text": "Furthermore, all embeddings use the CBOW algorithm with the same hyper-parameter values of 5 for the window size (left/right context of words) and 1 for min-count (minimum number of occurrences of words)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-72", "text": "For FastText, we set the size of character n-grams to 1 to combine the whole word's embedding with the embeddings of its characters." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-73", "text": "We did not use contextual representations, like BERT, Elmo or Flair (Devlin et al., 2018; Peters et al., 2018; Akbik et al., 2018) , because they have to be pre-trained on large corpora and we wanted to have an experimental setup compatible with the closed track of the PARSEME shared task." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-74", "text": "----------------------------------" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-75", "text": "**EVALUATION MEASURES**" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-76", "text": "We adopt the metrics proposed in the PARSEME shared tasks (Savary et al., 2017) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-77", "text": "The MWE-based measure (F-MWE) is the F1 score for fully predicted VMWEs, whereas the token-based measure (F-TOK) is the F1 score for tokens belonging to a VMWE." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-78", "text": "Table 2 : MWE-based F-measure (F-MWE) and token-based F-measures (F-TOK) of the models on the test corpus, using word2vec and FastText word representations for different feature sets: lemmas, surface forms, and both." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-79", "text": "----------------------------------" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-80", "text": "**RESULTS**" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-81", "text": "We train Veyn using UPOS tags as input features, combined with word2vec and FastText embeddings for lemmas, surface forms, or both." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-82", "text": "8 Performances are given on the PARSEME test corpus for Basque (EU), French (FR) and Polish (PL)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-83", "text": "On one hand, we compare performances with FastText and word2vec representations, and on the other hand, we compare performances with various input feature sets ( Table 2) ." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-84", "text": "si\u0119 w stanie, the system with FastText makes no prediction whereas with word2vec the prediction is b\u0119dzie si\u0119 w stanie where the reflexive clitic si\u0119 is tagged as being a single-token inherently reflexive verb and w stanie is predicted as a verbal idiom." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-85", "text": "More single token predictions increase the recall of F-TOK, but decrease the precision of F-MWE." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-86", "text": "Further investigation will be made to understand this phenomenon, which could be compensated by simple post-processing, e.g. grouping single-token predictions with adjacent ones." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-87", "text": "We hypothesise that the system with subword representation is able to take the morphological inflection into account." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-88", "text": "For example, the French expression faire r\u00e9f\u00e9rence 'to make reference' is seen in this form in the training corpus, but the test corpus contains a different inflection of the verb fait r\u00e9f\u00e9rence 'makes reference'." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-89", "text": "For this example, with FastText representation the system is able to find the expression, but with word2vec representation the system can not find it if we use surface form and lemma at input." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-90", "text": "----------------------------------" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-91", "text": "**IMPACT OF WORD VECTOR REPRESENTATION**" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-92", "text": "Impact of Input Pre-processing For Basque, which has a high morphological richness, the model with the richest information provides the best results." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-93", "text": "Performances are maximised with the form-lemma model, providing an F-MWE score of 69.24, while the form model yields a 66.52 score and the lemma model gives 62.86, suggesting that relevant information for VMWE identification is lost in lemmatisation." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-94", "text": "For Polish, similar results are obtained in terms of F-Tok while F-MWE is maximised for the lemma configuration with FastText." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-95", "text": "This is also a consequence of the phenomenon described in the previous subsection where single-token expressions are predicted for Polish." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-96", "text": "The lemma configuration is less affected by this phenomenon (F-TOK is lower) and thus full-expression identification is more effec-tive (higher F-MWE of 61.49)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-97", "text": "Results on French corroborate this trend: although French has simpler morphology, lemmas are still important to obtain best results." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-98", "text": "As opposed to highly morphological languages like Basque, the combination of lemmas and forms for French does not yield as much improvement." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-99", "text": "Performances in terms of F-TOK are equivalent for lemma and form-lemma and are slightly better in terms of F-MWE." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-100", "text": "For the three languages under consideration, our best models would have ranked in the top-3 in the closed track of the official shared task results." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-101", "text": "----------------------------------" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-102", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-103", "text": "We have studied the impact of word representations on VMWE identification for Basque, French and Polish, comparing lemmas and surface forms as input features and comparing traditional word embeddings (word2vec) and subword representations (FastText)." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-104", "text": "Regarding the latter, subword representations proved to be efficient for our task." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-105", "text": "For the former, we have highlighted that the use of lemmas always have a positive impact." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-106", "text": "For languages with high morphological richness, the combination of lemmas and forms has an even higher impact, especially for Basque." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-107", "text": "Considering the high Out-of-Vocabulary rate, including for verbs, we intend to improve OOV handling in the future." }, { "sent_id": "60f54cf8f510affe214f63f8e23e19-C001-108", "text": "The use of recent embeddings such as BERT, Elmo and Flair, trained on large external corpora, could help with OOVs." } ], "y": { "@USE@": { "gold_contexts": [ [ "60f54cf8f510affe214f63f8e23e19-C001-20" ], [ "60f54cf8f510affe214f63f8e23e19-C001-34" ], [ "60f54cf8f510affe214f63f8e23e19-C001-59" ] ], "cite_sentences": [ "60f54cf8f510affe214f63f8e23e19-C001-20", "60f54cf8f510affe214f63f8e23e19-C001-34", "60f54cf8f510affe214f63f8e23e19-C001-59" ] }, "@SIM@": { "gold_contexts": [ [ "60f54cf8f510affe214f63f8e23e19-C001-35" ] ], "cite_sentences": [ "60f54cf8f510affe214f63f8e23e19-C001-35" ] } } }, "ABC_c663b64c73f2e583ea631c054824d8_43": { "x": [ { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-33", "text": "The method used for training it is a modified version of Skip-gram word2vec described in [7] ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-85", "text": "Accuracy scores for movie review polarity prediction are presented in figures 3 and 4." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-86", "text": "Again we see that crawl 840 performs very well." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-32", "text": "Wikipedia Dependency corpus is a collection of 1 billion tokens from Wikipedia." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-2", "text": "Word embeddings or distributed representations of words are being used in various applications like machine translation, sentiment analysis, topic identification etc." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-3", "text": "Quality of word embeddings and performance of their applications depends on several factors like training method, corpus size and relevance etc." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-4", "text": "In this study we compare performance of a dozen of pretrained word embedding models on lyrics sentiment analysis and movie review polarity tasks." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-5", "text": "According to our results, Twitter Tweets is the best on lyrics sentiment analysis, whereas Google News and Common Crawl are the top performers on movie polarity analysis." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-6", "text": "Glove trained models slightly outrun those trained with Skipgram." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-7", "text": "Also, factors like topic relevance and size of corpus significantly impact the quality of the models." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-8", "text": "When medium or large-sized text sets are available, obtaining word embeddings from same training dataset is usually the best choice." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-9", "text": "----------------------------------" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-11", "text": "Semantic vector space models of language were developed in the 90s to predict joint probabilities of words that appear together in a sequence." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-12", "text": "A particular upturn was proposed by Bengio et al. in [1] , replacing sparse n-gram models with word embeddings which are more compact representations obtained using feed-forward or more advanced neural networks." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-13", "text": "Recently, high quality and easy to train Skip-gram shallow architectures were presented in [10] and considerably improved in [11] with the introduction of negative sampling and subsampling of frequent words." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-14", "text": "The \"magical\" ability of word embeddings to capture syntactic and semantic regularities on text words is applicable in various applications like machine translations, error correcting systems, sentiment analyzers etc." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-15", "text": "This ability has been tested in [12] and other studies with analogy question tests of the form \"A is to B as C is to \" or male/female relations." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-16", "text": "A recent improved method for generating word embeddings is Glove [15] which makes efficient use of global statistics of text words and preserves the linear substructure of Skip-gram word2vec, the other popular method." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-17", "text": "Authors report that Glove outperforms other methods such as Skip-gram in several tasks like word similarity, word analogy etc." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-18", "text": "In this paper we examine the quality of word embeddings on 2 sentiment analysis tasks: Lyrics mood recognition and movie review polarity analysis." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-19", "text": "We compare various models pretrained with Glove and Skip-gram, together with corpora we train ourself." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-20", "text": "Our goal is to report the best performing models as well as to observe the impact that certain factors like training method, corpus size and thematic relevance of texts might have on model quality." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-21", "text": "According to the results, Common Crawl, Twitter Tweets and Google News are the best performing models." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-22", "text": "Corpus size and thematic relevance have a significant role on the performance of the generated word vectors." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-23", "text": "We noticed that models trained with Glove slightly outperform those trained with Skip-gram in most of experiments." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-24", "text": "----------------------------------" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-25", "text": "**WORD EMBEDDING CORPORA AND MODELS**" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-26", "text": "In this section we present the different word embedding models that we compare." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-27", "text": "Most of them are pretrained and publicly available." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-28", "text": "Two of them (Text8Corpus and Moody-Corpus) were trained by us." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-29", "text": "The full list with some basic characteristics is presented in Table 1 ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-30", "text": "Wikipedia Gigaword is a combination of Wikipedia 2014 dump and Gigaword 5 with about 6 billion tokens in total." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-31", "text": "It was created by authors of [15] to evaluate Glove performance." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-34", "text": "Google News is one of the biggest and richest text sets with 100 billion tokens and a vocabulary of 3 million words and phrases [10] ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-35", "text": "It was trained using Skipgram word2vec with negative sampling, windows size 5 and 300 dimensions." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-36", "text": "Even bigger is Common Crawl 840, a huge corpus of 840 billion tokens and 2.2 million word vectors also used at [15] ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-37", "text": "It contains data of Common Crawl (http://commoncrawl.org), a nonprofit organization that creates and maintains public datasets by crawling the web." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-38", "text": "Common Crawl 42 is a reduced version made up of 42 billion tokens and a vocabulary of 1.9 million words." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-39", "text": "Common Crawl 840 and Common Crawl 42 were trained with Glove method producing vectors of 300 dimensions for each word." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-40", "text": "The last Glove corpus is the collection of Twitter Tweets." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-41", "text": "It consists of 2 billion tweets, 27 billion tokens and 1.2 million words." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-42", "text": "To observe the role of corpus size in quality of generated embeddings, we train and use Text8Corpus, a smaller corpus consisting of 17 million tokens and 25,000 words." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-43", "text": "The last model we use is MoodyCorpus, a collection of lyrics that followed our work in [3] where we build and evaluate MoodyLyrics, a sentiment annotated dataset of songs." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-44", "text": "The biggest part of MoodyCorpus was built using lyrics of Million Song Dataset (MSD) songs (https://labrosa.ee.columbia.edu/millionsong/)." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-45", "text": "As music tastes and characteristics change over time (http://kaylinwalker.com/50-years-ofpop-music), it is better to have diversified sources of songs in terms of epoch, genre etc." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-46", "text": "Thereby we added songs of different genres and epochs that we found in two subsets of MSD, Cal500 and TheBeatles." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-47", "text": "The resulting corpus of 90 million tokens and 43,000 words can be downloaded from http://softeng.polito.it/erion." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-48", "text": "Further information about public music datasets can be found at [2] ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-49", "text": "----------------------------------" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-50", "text": "**SENTIMENT ANALYSIS TASKS**" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-51", "text": "The problem of music mood recognition is about utilizing machine learning, data mining and other techniques to automatically classify songs in 2 or more emotion categories with highest possible accuracy." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-52", "text": "Different combinations of features such as audio or lyrics are involved in the process." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-53", "text": "In this study we make use of song lyrics exploiting the dataset described in [9] (here AM628)." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-54", "text": "The original dataset contains 771 song texts collected from AllMusic portal." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-55", "text": "AllMusic tags and 3 human experts were used for the annotation of songs." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-56", "text": "We balanced the dataset obtaining 314 positive and 314 negative lyrics." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-57", "text": "We also utilize MoodyLyrics (here ML3K), a dataset of 3,000 mood labeled songs from different genres and epochs described in [3] ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-58", "text": "Pioneering work in movie review polarity analysis has been conducted by Pang and Lee in [14] and [13] ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-59", "text": "The authors released sentiment polarity dataset, a collection of 2,000 movie reviews categorized as positive or negative." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-60", "text": "Deep learning techniques and distributed word representations appeared on recent studies like [17] where the role of RNNs (Recurrent Neural Networks), and CNNs (Convolutional Neural Networks) is explored." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-61", "text": "The author reports that CNNs perform best." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-62", "text": "An important work that has relevance here is [8] where authors present an even larger movie review dataset of 50,000 movie reviews from IMBD." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-63", "text": "This dataset has been used in various works such as [5] , [16] etc." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-64", "text": "For our experiments we used a chunk of 10K (MR10K) as well as the full set (MR50K)." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-65", "text": "We first cleaned and tokenized texts of the datasets." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-66", "text": "The dataset of the current run is loaded and a set of unique text words is created." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-67", "text": "All 14 models are also loaded in the script." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-68", "text": "We train a 15th (self w2v) model using the corpus of the current run and Skip-gram method." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-69", "text": "The script iterates in every line of the pretrained models splitting apart the words and the float vectors and building {word: vec} dictionaries later used as classification feature sets." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-70", "text": "Next we prepare the classification models using tf-idf vectorizer which has been successfully applied in similar studies like [4] ." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-71", "text": "Instead of applying tf-idf in words only as in other text classifiers, we vectorize both word (for semantic relevance) and corresponding vector (for syntactic and contextual relevance)." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-72", "text": "Random forest was used as classifier and 5-fold cross-validation accuracy is computed for each of the models." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-73", "text": "----------------------------------" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-74", "text": "**RESULTS**" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-75", "text": "In figures 1 and 2 we see results of 5-fold cross-validation on the 2 lyrics datasets." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-76", "text": "Top three models are crawl 840, twitter 50 and self w2v." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-77", "text": "On AM628 (very smallest dataset), it is crawl 840 (the biggest model) that leads, followed by twitter 50." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-78", "text": "Self w2v is severely penalized by its size and thus is at the bottom." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-79", "text": "On ML3K (large dataset) self w2v reaches the top of the list, leaving behind twitter 50." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-80", "text": "Wikigiga, google news and dep based are positioned in the middle whereas MoodyCorpus and Text8Corpus end the list." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-81", "text": "Their accuracy scores drift from 0.62 to 0.75." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-82", "text": "It is interesting to see how self w2v goes up from the last to the top, with scores edging between 0.61 and 0.83." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-83", "text": "This model is trained with the data of each experiment and depends on the size of that dataset which grows significantly (see Table 2 )." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-84", "text": "We see that accuracy values we got here are in line with reports from other similar works such as [6] where they use a dataset of 1032 lyrics from AllMusic to perform content analysis with text features." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-87", "text": "Google news is also among the top whereas Twitter models are positioned in the middle of the list." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-88", "text": "Once again self w2v grows considerably, this time from the 3rd place to the top." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-89", "text": "On MR50K it has a discrete margin of more than 0.03 from the 2nd position." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-90", "text": "Again wikigiga models are positioned in the middle of the list and the worst performing models are MoodyCorpus and Text8Corpus." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-91", "text": "Our scores on this task are somehow lower than those reported from various studies that explore advanced deep learning constructs on same dataset." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-92", "text": "In [8] for example, authors who created movie review dataset try on it their probabilistic model that is able to capture semantic similarities between words." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-93", "text": "They report a maximal accuracy of 0.88." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-94", "text": "A study that uses a very similar method is [16] where authors combine random forest with word vector average values." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-95", "text": "On movie review dataset they achieve an accuracy of 0.84 which is about what we got here." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-96", "text": "----------------------------------" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-97", "text": "**DISCUSSION**" }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-98", "text": "In this paper we examined the quality of different word embedding models on two sentiment analysis tasks: Lyrics mood recognition and movie review polarity." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-99", "text": "We observed the role of factors like training method, vocabulary and corpus size and thematic relevance of texts." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-100", "text": "According to our results, the best performing models are Common Crawl, Twitter Tweets and Google News." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-101", "text": "In general, models trained with Glove slightly outperform those trained using Skip-gram, especially on lyrics sentiment analysis (Twitter and Crawl)." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-102", "text": "We also notice that vocabulary richness and corpus size have a significant influence on model quality." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-103", "text": "The biggest models like crawl 840 are always among the best." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-104", "text": "Likewise self w2v performs very well on both tasks when trained with medium or large data sizes (see Table 2 )." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-105", "text": "Being the smallest in sizes, MoodyCorpus and Text8Corpus are always the worst." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-106", "text": "Regarding thematic relevance, Twitter corpora perform better on lyrics sentiment analysis." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-107", "text": "They are large and rich in vocabulary with texts of an informal and sentimental language." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-108", "text": "This language is very similar to the one of song lyrics, with love being the predominant word (see word cloud in [3] )." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-109", "text": "Movie review results on the other hand, are headed by Common Crawl and Google News which are the largest, both in size and vocabulary." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-110", "text": "These models are trained with diverse and informative texts that cover every possible subject or topic." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-111", "text": "Having a look on some movie reviews we also see a similar language with comments about the movies of different categories." }, { "sent_id": "c663b64c73f2e583ea631c054824d8-C001-112", "text": "Furthermore, we saw that when training set is big enough, obtaining word embeddings from it (self w2v) is the best option." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c663b64c73f2e583ea631c054824d8-C001-16" ], [ "c663b64c73f2e583ea631c054824d8-C001-31" ] ], "cite_sentences": [ "c663b64c73f2e583ea631c054824d8-C001-16", "c663b64c73f2e583ea631c054824d8-C001-31" ] }, "@USE@": { "gold_contexts": [ [ "c663b64c73f2e583ea631c054824d8-C001-26", "c663b64c73f2e583ea631c054824d8-C001-31", "c663b64c73f2e583ea631c054824d8-C001-36" ] ], "cite_sentences": [ "c663b64c73f2e583ea631c054824d8-C001-31", "c663b64c73f2e583ea631c054824d8-C001-36" ] } } }, "ABC_b7158e8478b14cd8337f2aa8ee6193_43": { "x": [ { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-2", "text": "West African Pidgin English is a language that is significantly spoken in West Africa, consisting of at least 75 million speakers." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-3", "text": "Nevertheless, proper machine translation systems and relevant NLP datasets for pidgin English are virtually absent." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-4", "text": "In this work, we develop techniques targeted at bridging the gap between Pidgin English and English in the context of natural language generation." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-5", "text": "By building upon the previously released monolingual Pidgin English text and parallel English data-to-text corpus, we hope to build a system that can automatically generate Pidgin English descriptions from structured data." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-6", "text": "We first train a data-to-English text generation system, before employing techniques in unsupervised neural machine translation and self-training to establish the Pidgin-to-English cross-lingual alignment." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-7", "text": "The human evaluation performed on the generated Pidgin texts shows that, though still far from being practically usable, the pivoting + self-training technique improves both Pidgin text fluency and relevance." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-8", "text": "----------------------------------" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-10", "text": "Pidgin English is one of the the most widely spoken languages in West Africa with roughly 75 million speakers estimated in Nigeria; and over 5 million speakers estimated in Ghana (Ogueji & Ahia, 2019) ." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-11", "text": "1 While there have been recent efforts in popularizing the monolingual Pidgin English as seen in the BBC Pidgin 2 , it remains under-resourced in terms of the available parallel corpus for machine translation." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-12", "text": "Similarly, this low-resource scenario extends to other domains in natural language generation (NLG) such as summarization, data-to-text and so on (Lebret et al., 2016; Su et al., 2018; Shen et al., 2019a; b; Zhao et al., 2019; Hong et al., 2019; de Souza et al., 2018) \u2212 where Pidgin English generation is largely under-explored." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-13", "text": "The scarcity is further aggravated when the pipeline language generation system includes other sub-modules that computes semantic textual similarity (Shen et al., 2017; Zhuang & Chang, 2017) , which exists solely in English." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-14", "text": "Previous works on unsupervised neural machine translation for Pidgin English constructed a monolingual corpus (Ogueji & Ahia, 2019) , and achieved a BLEU score of 5.18 from English to Pidgin." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-15", "text": "However, there is an issue of domain mismatch between down-stream NLG tasks and the trained machine translation system." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-16", "text": "This creates a caveat where the resulting English-to-Pidgin MT systems (trained on the domain of news and the Bible) cannot be directly used to translate out-domain English texts to Pidgin." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-17", "text": "An example of the English/pidgin text in the restaurant domain (Novikova et al., 2017) is displayed in Table 1 ." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-18", "text": "Nevertheless, we argue that this domain-mismatch problem can be alleviated by using English text in the target-domain as a pivot language (Guo et al., 2019) ." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-19", "text": "To this end, we explore this idea on the task of neural data-to-text generation which has been the subject of much recent research." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-20", "text": "Neural data-to-Pidgin generation is essential in the African continent especially given the fact that many English There is a pub Blue Spice located in the centre of the city that provides Chinese food." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-21", "text": "Pidgin" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-22", "text": "Wan pub blue spice dey for centre of city wey dey give Chinese food." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-23", "text": "existing data-to-text systems are English-based e.g. Weather reporting systems (Sripada et al., 2002; Belz, 2008) ." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-24", "text": "This work aims at bridging the gap between many of these English-based systems and Pidgin by training an in-domain English-to-pidgin MT system in an unsupervised way." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-25", "text": "By this means, English-based NLG systems can be locally adapted by translating the output English text into pidgin English." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-26", "text": "We employ the publicly available parallel data-to-text corpus E2E (Novikova et al., 2017) consisting of tabulated data and English descriptions in the restaurant domain." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-27", "text": "The training of the in-domain MT system is done with a two-step process: (1) We use the target-side English texts as the pivot, and train an unsupervised NMT (model unsup ) directly between in-domain English text and the available monolingual Pidgin corpus." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-28", "text": "(2) Next, we employ self-training (He et al., 2019) to create augmented parallel pairs to continue updating the system (model self )." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-29", "text": "----------------------------------" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-30", "text": "**APPROACH**" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-31", "text": "First phase of the approach requires training of an unsupervised NMT system similar to Ogueji & Ahia (2019) (PidginUNMT)." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-32", "text": "Similar to Ogueji & Ahia (2019) , we train the cross-lingual model using FastText Bojanowski et al. (2017) on the combined Pidgin-English corpus." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-33", "text": "Next, we train an unsupervised NMT similar to Lample et al. (2017) ; Artetxe et al. (2017) ; Ogueji & Ahia (2019) between them to obtain model unsup ." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-34", "text": "Then we further utilize model unsup to construct pseudo parallel corpus by predicting target Pidgin text given the English input." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-35", "text": "We augment this dataset to the existing monolingual corpus." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-36", "text": "The self-training step involves further updating model unsup on the pseudo parallel corpus and non-parallel monolingual corpus to yield model self ." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-37", "text": "----------------------------------" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-38", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-39", "text": "We conduct experiments on the E2E corpus (Novikova et al., 2017) which amounts to roughly 42k samples in the training set." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-40", "text": "The monolingual Pidgin corpus contains 56,695 sentences and 32,925 unique words." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-41", "text": "The human evaluation was performed on the test set (630 data instances for E2E) by averaging over scores by 2 native Pidgin speakers on both Relevance (0 or 1 to indicate relevant or not) and Fluency (0, 1, or 2 to indicate readability)." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-42", "text": "Table 2 shows that model self outperforms direct translation (PidginUNMT) and unsupervisedly-trained model model unsup on relevance and performing on par with PidginUNMT on fluency." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-43", "text": "We also display relevant sample outputs in Table 3 at all levels of fluency." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-44", "text": "Pidgin text Fluency Every money of money on food and at least 1 of 1 points." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-45", "text": "0 and na one na di best food for di world." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-46", "text": "1 People dey feel the good food but all of us no dey available." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-47", "text": "2" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-48", "text": "----------------------------------" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-49", "text": "**CONCLUSION**" }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-50", "text": "In this paper, we have shown that it is possible to improve upon low-resource Pidgin text generation in a demonstrated low-resource scenario." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-51", "text": "By using non-parallel in-domain English and out-domain Pidgin text along with self-training algorithm, we show that both fluency and relevance can be further improved." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-52", "text": "This work serves as the starting point for future works on Pidgin NLG in the absence of annotated data." }, { "sent_id": "b7158e8478b14cd8337f2aa8ee6193-C001-53", "text": "For future works, we will also further utilize phrase-based statistical machine translation to further improve upon current work." } ], "y": { "@USE@": { "gold_contexts": [ [ "b7158e8478b14cd8337f2aa8ee6193-C001-17" ], [ "b7158e8478b14cd8337f2aa8ee6193-C001-26" ], [ "b7158e8478b14cd8337f2aa8ee6193-C001-39" ] ], "cite_sentences": [ "b7158e8478b14cd8337f2aa8ee6193-C001-17", "b7158e8478b14cd8337f2aa8ee6193-C001-26", "b7158e8478b14cd8337f2aa8ee6193-C001-39" ] } } }, "ABC_90962438b8efa0f7a14fd050365310_43": { "x": [ { "sent_id": "90962438b8efa0f7a14fd050365310-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-2", "text": "We introduce a set of 1,000 gold standard parse trees for the British National Corpus (BNC) and perform a series of self-training experiments with Charniak and Johnson's reranking parser and BNC sentences." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-3", "text": "We show that retraining this parser with a combination of one million BNC parse trees (produced by the same parser) and the original WSJ training data yields improvements of 0.4% on WSJ Section 23 and 1.7% on the new BNC gold standard set." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-4", "text": "----------------------------------" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-6", "text": "Given the success of statistical parsing models on the Wall Street Journal (WSJ) section of the Penn Treebank (PTB) (Charniak, 2000; Collins, 2003, for example) , there has been a change in focus in recent years towards the problem of replicating this success on genres other than American financial news stories." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-7", "text": "The main challenge in solving the parser adaptation problem are the resources required to construct reliable annotated training examples." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-8", "text": "A breakthrough has come in the form of research by McClosky et al. (2006a; 2006b ) who show that self-training can be used to improve parser performance when combined with a two-stage reranking parser model (Charniak and Johnson, 2005) ." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-9", "text": "Selftraining is the process of training a parser on its own output, and earlier self-training experiments using generative statistical parsers did not yield encouraging results (Steedman et al., 2003) ." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-10", "text": "McClosky et al. (2006a; 2006b ) proceed as follows: sentences * Now affiliated to Lalic, Universit\u00e9 Paris 4 La Sorbonne." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-11", "text": "from the LA Times newspaper are parsed by a firststage generative statistical parser trained on some seed training data (WSJ Sections 2-21) and the nbest parse trees produced by this parser are reranked by a discriminative reranker." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-12", "text": "The highest ranked parse trees are added to the training set of the parser and the parser is retrained." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-13", "text": "This self-training method gives improved performance, not only on Section 23 of the WSJ (an absolute f-score improvement of 0.8%), but also on test sentences from the Brown corpus (Francis and Ku\u010dera, 1979 ) (an absolute fscore improvement of 2.6%)." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-14", "text": "In the experiments of McClosky et al. (2006a; 2006b) , the parse trees used for self-training come from the same domain (American newspaper text) as the parser's original seed training material." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-15", "text": "Bacchiani et al. (2006) find that self-training is effective when the parse trees used for self-training (WSJ parse trees) come from a different domain to the seed training data and from the same domain as the test data (WSJ sentences)." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-16", "text": "They report a performance boost of 4.2% on WSJ Section 23 for a generative statistical parser trained on Brown seed data when it is self-trained using 200,000 WSJ parse trees." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-17", "text": "However, McCloskey et al. (2006b) report a drop in performance for their reranking parser when the experiment is repeated in the opposite direction, i.e. with Brown data for self-training and testing, and WSJ data for seed training." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-18", "text": "In contrast, we report successful in-domain 1 self-training experiments with the BNC data as self-training and test material, and with the WSJ-trained reranking parser used by McCloskey et al. (2006a; 2006b) ." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-19", "text": "We parse the BNC (Burnard, 2000) in its entirety using the reranking parser of Charniak and Johnson (2005) ." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-20", "text": "1,000 BNC sentences are manually annotated for constituent structure, resulting in the first gold standard set for this corpus." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-21", "text": "The gold standard set is split into a development set of 500 parse trees and a test set of 500 parse trees and used in a series of self-training experiments: Charniak and Johnson's parser is retrained on combinations of WSJ treebank data and its own parses of BNC sentences." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-22", "text": "These combinations are tested on the BNC development set and Section 00 of the WSJ." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-23", "text": "An optimal combination is chosen which achieves a Parseval labelled bracketing f-score of 91.7% on Section 23 and 85.6% on the BNC gold standard test set." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-24", "text": "For Section 23 this is an absolute improvement of 0.4% on the baseline results of this parser, and for the BNC data this is a statistically significant improvement of 1.7%." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-25", "text": "----------------------------------" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-26", "text": "**THE BNC DATA**" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-27", "text": "The BNC is a 100-million-word balanced part-ofspeech-tagged corpus of written and transcribed spoken English." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-28", "text": "Written text comprises 90% of the BNC: 75% non-fictional and 25% fictional." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-29", "text": "To facilitate parsing with a WSJ-trained parser, some reversible transformations were applied to the BNC data, e.g. British English spellings were converted to American English and neutral quotes disambiguated." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-30", "text": "The reranking parser of Charniak and Johnson (2005) was used to parse the BNC." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-31", "text": "99.8% of the 6 million BNC sentences obtained a parse, with an average parsing speed of 1.4s per sentence." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-32", "text": "A gold standard set of 1,000 BNC sentences was constructed by one annotator by correcting the output of the first stage of Charniak and Johnson's reranking parser." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-33", "text": "The sentences included in the gold standard were chosen at random from the BNC, subject to the condition that they contain a verb which does not occur in the training sections of the WSJ section of the PTB (Marcus et al., 1993) ." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-34", "text": "A decision was made to select sentences for the gold standard set which differ from the sentences in the WSJ training sections, and one way of finding different sentences is to focus on verbs which are not attested in the WSJ Sections 2-21." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-35", "text": "It is expected that these gold standard parse trees can be used as training data although they are used only as test and development data in this work." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-36", "text": "Because they contain verbs which do not occur in the parser's training set, they are likely to represent a hard test for WSJ-trained parsers." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-37", "text": "The PTB bracketing guidelines (Bies et al., 1995) and the PTB itself were used as references by the BNC annotator." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-38", "text": "Functional tags and traces were not annotated." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-39", "text": "The annotator noticed that the PTB parse trees sometimes violate the PTB bracketing guidelines, and in these cases, the annotator chose the analysis set out in the guidelines." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-40", "text": "It took approximately 60 hours to build the gold standard set." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-41", "text": "----------------------------------" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-42", "text": "**SELF-TRAINING EXPERIMENTS**" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-43", "text": "Charniak and Johnson's reranking parser (June 2006 version) is evaluated against the BNC gold standard development set." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-44", "text": "Labelled precision (LP), recall (LR) and f-score measures 2 for this parser are shown in the first row of Table 1 ." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-45", "text": "The f-score of 83.7% is lower than the f-score of 85.2% reported by McClosky et al. (2006b) for the same parser on Brown corpus data." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-46", "text": "This difference is reasonable since there is greater domain variation between the WSJ and the BNC than between the WSJ and the Brown corpus, and all BNC gold standard sentences contain verbs not attested in WSJ Sections 2-21." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-47", "text": "We retrain the first-stage generative statistical parser of Charniak and Johnson using combinations of BNC trees (parsed using the reranking parser) and WSJ treebank trees." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-48", "text": "We test the combinations on the BNC gold standard development set and on WSJ Section 00." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-49", "text": "Table 1 shows that parser accuracy increases with the size of the in-domain selftraining material." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-50", "text": "3 The figures confirm the claim of McClosky et al. (2006a) that self-training with a reranking parsing model is effective for improving parser accuracy in general, and the claim of Gildea (2001) that training on in-domain data is effective for parser adaption." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-51", "text": "They confirm that self-training on in-domain data is effective for parser adaptation." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-52", "text": "The WSJ Section 00 results suggest that, in order to maintain performance on the seed training domain, it is necessary to combine BNC parse trees Of the self-training combinations with abovebaseline improvements for both development sets, the combination of 1,000K BNC parse trees and Section 2-21 of the WSJ (multiplied by ten) yields the highest improvement for the BNC data, and we present final results with this combination for the BNC gold standard test set and WSJ Section 23." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-53", "text": "There is an absolute improvement on the original reranking parser of 1.7% on the BNC gold standard test set and 0.4% on WSJ Section 23." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-54", "text": "The improvement on BNC data is statistically significant for both precision and recall (p < 0.0002, p < 0.0002)." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-55", "text": "The improvement on WSJ Section 23 is statistically significant for precision only (p < 0.003)." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-56", "text": "----------------------------------" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-57", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-58", "text": "We have introduced a set of 1,000 gold standard parse trees for the BNC." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-59", "text": "We have performed selftraining experiments with Charniak and Johnson's reranking parser and sentences from the BNC." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-60", "text": "We have shown that retraining this parser with a combination of one million BNC parse trees (produced by the same parser) and the original WSJ training data yields improvements of 0.4% on WSJ Section 23 and 1.7% on the BNC gold standard sentences." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-61", "text": "These results indicate that self-training on in-domain data can be used for parser adaptation." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-62", "text": "Our BNC gold standard set consists of sentences containing verbs which are not in the WSJ training sections." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-63", "text": "We suspect that this makes the gold standard set a hard test for WSJ-trained parsers, and our results are likely to represent a lower bound for WSJ-trained parsers on BNC data." }, { "sent_id": "90962438b8efa0f7a14fd050365310-C001-64", "text": "When used as training data, we predict that the novel verbs in the BNC gold standard set add to the variety of training material, and will further help parser adaptation from the WSJ domain -a matter for further research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "90962438b8efa0f7a14fd050365310-C001-8" ] ], "cite_sentences": [ "90962438b8efa0f7a14fd050365310-C001-8" ] }, "@USE@": { "gold_contexts": [ [ "90962438b8efa0f7a14fd050365310-C001-19" ], [ "90962438b8efa0f7a14fd050365310-C001-30" ] ], "cite_sentences": [ "90962438b8efa0f7a14fd050365310-C001-19", "90962438b8efa0f7a14fd050365310-C001-30" ] } } }, "ABC_460b820360d51d99b4a7ae3f0755bc_43": { "x": [ { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-2", "text": "SUDOKU's submissions to SemEval Task 13 treats Word Sense Disambiguation and Entity Linking as a deterministic problem that exploits two key attributes of open-class words as constraints -their degree of polysemy and their part of speech." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-3", "text": "This is an extension and further validation of the results achieved by Manion and Sainudiin (2014)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-4", "text": "SUDOKU's three submissions are incremental in the use of the two aforementioned constraints." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-5", "text": "Run1 has no constraints and disambiguates all lemmas in one pass." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-6", "text": "Run2 disambiguates lemmas at increasing degrees of polysemy, leaving the most polysemous until last." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-7", "text": "Run3 is identical to Run2, with the additional constraint of disambiguating all named entities and nouns first before other types of open-class words (verbs, adjectives, and adverbs)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-8", "text": "Over all-domains, for English Run2 and Run3 were placed second and third." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-9", "text": "For Spanish Run2, Run3, and Run1 were placed first, second, and third respectively." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-10", "text": "For Italian Run1 was placed first with Run2 and Run3 placed second equal." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-11", "text": "----------------------------------" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-12", "text": "**INTRODUCTION & RELATED WORK**" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-13", "text": "Almost a decade ago, Agirre and Edmonds (2007) suggested the promising potential for WSD that could exploit the interdependencies between senses in an interactive manner." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-14", "text": "In other words, this would be a WSD system which allows the disambiguation of word a to directly influence the consecutive disambiguation of word b. This is analogous to treating WSD as a deterministic problem, much like the Sudoku puzzle in which the final solution is reached by adhering to a set of pre-determined constraints." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-15", "text": "Conventional approaches to WSD often overlook the potential to exploit sense interdependencies, and simply disambiguate all senses in one pass based on a context window (e.g. a sentence or document)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-16", "text": "For this task the author proposes an iterative approach which makes several passes based on a set of constraints." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-17", "text": "For a more formal distinction between the conventional and iterative approach to WSD, please refer to this paper (Manion and Sainudiin, 2014 Table 1 : Parts of Speech disambiguated (as percentages) for each SemEval Task (denoted by its year)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-18", "text": "In-Degree Centrality as implemented in (Manion and Sainudiin, 2014) observes F-Score improvement (F + \u2206F) by applying the iterative approach." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-19", "text": "The author found in the investigations of his thesis (Manion, 2014) that the iterative approach performed best on the SemEval 2013 Multilingual WSD Task (Navigli et al., 2013) , as opposed to earlier tasks such as SensEval 2004 English All Words WSD Task (Snyder and Palmer, 2004) and the SemEval 2010 All Words WSD task on a Specific Domain (Agirre et al., 2010) ." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-20", "text": "While these earlier tasks also experienced improvement, F-Scores remained lower overall." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-21", "text": "Table 1 Depicted above are distributions for each domain and language, detailing the probability (y-axis) of specific parts of speech at increasing degrees of polysemy (x-axis)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-22", "text": "These distributions were produced from the gold keys (or synsets) of the test documents by querying BabelNet for the polysemy of each word." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-23", "text": "Each distribution was normalised with one sense per discourse assumed, therefore duplicate synsets were ignored." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-24", "text": "Lastly the difference in F-Score between the conventional Run1 and the iterative Run2 and Run3 is listed beside each distribution." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-25", "text": "Firstly WSD tasks before 2013 generally relied on only a lexicon, such as WordNet (Fellbaum, 1998) or an alternative equivalent, whereas SemEval 2013 Task 12 WSD and this task (Moro and Navigli, 2015) included Entity Linking (EL) using the encyclopaedia Wikipedia via BabelNet (Navigli and Ponzetto, 2012) ." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-26", "text": "Secondly, as shown by Manion and Sainudiin (2014) with a simple linear regression, the iterative approach increases WSD performance for documents that have a higher degree of document monosemy -the percentage of unique monosemous lemmas in a document." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-27", "text": "As seen in Figures 1(a) to (i) on the previous page, named entities (or unique rather than common nouns) are more monosemous compared to other parts of speech, especially for more technical domains." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-28", "text": "Lastly, the SemEval 2013 WSD task differs in that only nouns and named entities required disambiguation." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-29", "text": "This simplifies the WSD task, as shown in the experiments on local context by Yarowsky (1993) , nouns are best disambiguated by directly adjacent nouns (or modifying adjectives)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-30", "text": "Based on these observations, the author hypothesized the following implementations of the iterative approach should perform well." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-31", "text": "----------------------------------" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-32", "text": "**SYSTEM DESCRIPTION & IMPLEMENTATION**" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-33", "text": "Run1 (SUDOKU-1) is the conventional approachno constraints are applied." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-34", "text": "Formalised in (Manion and Sainudiin, 2014) , this run can act as a baseline to gauge any improvement for Run2 and Run3 that apply the iterative approach." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-35", "text": "Run2 (SUDOKU-2) has the constraint of words being disambiguated in order of increasing polysemy, leaving the most polysemous to last." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-36", "text": "Run3 (SUDOKU-3) is an untested and unpublished version of the iterative approach." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-37", "text": "It includes Run2's constraint plus a second constraint -that all nouns and named entities must be disambiguated before other parts of speech." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-38", "text": "For each run, a semantic subgraph is constructed from BabelNet (version 2.5.1)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-39", "text": "Then for disambiguation the graph centrality measure PageRank (Brin and Page, 1998 ) is used in conjunction with a surfing vector that biases probability mass to certain sense nodes in the semantic subgraph." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-40", "text": "This idea is taken from Personalised PageRank (PPR) (Agirre and Soroa, 2009) , which applies the method put forward by Haveliwala (2003) to the field of WSD." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-41", "text": "In the previous SemEval WSD task (Navigli et al., 2013) team UMCC DLSI (Gutierrez et al., 2013) implemented this method and achieved the best performance by biasing probability mass based on SemCor (Miller et al., 1993 ) sense frequencies." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-42", "text": "As the winning method for this task, PPR was selected to test the iterative approach on." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-43", "text": "For SUDOKU's implementation to be unsupervised, all runs biased probability mass towards senses from monosemous lemmas." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-44", "text": "Additionally for Run2 and Run3, once a lemma is disambiguated it is considered to be monosemous." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-45", "text": "Therefore with each iteration of Run2 and Run3, probability mass is redistributed across the surfing vector to acknowledge these newly appointed monosemous lemmas." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-46", "text": "All system runs are applied at the document level, across all languages and domains, for all named entities, nouns, verbs, adverbs, and adjectives." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-47", "text": "Semantic subgraphs are constructed from BabelNet via a Depth First Search (DFS) up to 2 hops in path length." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-48", "text": "PageRank's damping factor is set to 0.85, with a maximum of 30 iterations 1 ." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-49", "text": "In order to avoid masking the effect of using the iterative approach, a back-off strategy (see (McCarthy et al., 2004) ) was not used." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-50", "text": "Multiword units were found by finding lemma sequences that contained at least one noun and at the same time could return a result from BabelNet." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-51", "text": "Lemma sequences beginning with definite/indefinite articles (e.g. the, a, il, la, and el) were removed as they induced too much noise, given they almost always returned a result from BabelNet (such as a book or movie title)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-52", "text": "----------------------------------" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-53", "text": "**RESULTS, DISCUSSIONS, & CONCLUSIONS**" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-54", "text": "As seen in Figures 1(a) to (i) on the previous page, the Biomedical and Math & Computers domains include a substantial degree of monosemy, no doubt increased by the monosemous technical terms and named entities present." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-55", "text": "Given the importance of document monosemy for the iterative approach, it is of no surprise that Run2 and Run3 in most cases performed much better than Run1 for these technical domains." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-56", "text": "Equally so, Run2 and Run3 were outperformed by Run1 for the less technical Social Issues All Domains domain in which many of the named entities are polysemous rather than monosemous." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-57", "text": "While the iterative approach achieved reasonably competitive results in English, this success did not translate as well to Spanish and Italian." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-58", "text": "The Italian Biomedical domain had the highest document monosemy, observable in Figure 1 (g ), yet this did not help the iterative Run2 and Run3." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-59", "text": "Yet it is worth noting the results of the task paper (Moro and Navigli, 2015) report that SUDOKU Run2 and Run3 achieved very low F-Scores for named entity disambiguation (<28.6) in Spanish and Italian." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-60", "text": "Given that more than half of the named entities were monosemous in Figure 1 (d) and (g), the WSD system either did not capture them in text or filtered them out during subgraph construction (see BabelNet API)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-61", "text": "This underscores the importance of named entities being included in disambiguation tasks." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-62", "text": "To further support this evidence, while the iterative approach is suited to domain based WSD, recall that the 2010 domain based WSD task in Table 1 also had no tagged named entities (and thus scores were lower than for successive named entity inclusive WSD tasks)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-63", "text": "As seen in Table 2 , the iterative approach has a varied effect on different parts of speech." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-64", "text": "Always improved is the disambiguation of named entities and adverbs." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-65", "text": "This is also the case for nouns in technical domains (e.g. Biomedical as opposed to Social Issues)." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-66", "text": "On the other hand the disambiguation of verbs and adjectives suffers under the iterative approach." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-67", "text": "In hindsight, the iterative approach could be restricted to the parts of speech it is known to improve, while remaining with the conventional approach on others." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-68", "text": "To the right in Table 3 the author's SUDOKU runs are compared against the team with the most competitive results -LIMSI." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-69", "text": "The author could not improve on their superior results achieved in English, however for Spanish and Italian the BabelNet First Sense (BFS) baseline was much lower since it often resorted to lexicographic sorting in the absence of WordNet synsets -see (Navigli et al., 2013) ." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-70", "text": "The author's baseline-independent submissions were unaffected by this, which on reviewing results in (Moro and Navigli, 2015) appears to have helped SUDOKU do best for these languages." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-71", "text": "Table 3 : F1 scores for each domain/language for SUDOKU and LIMSI." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-72", "text": "In summary, the inclusion of named entities in disambiguation tasks certainly improves results, as well as the effectiveness of the iterative approach." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-73", "text": "Furthermore in Table 3 above, the iterative Run3 for the English Biomedical domain is 0.1 short of achieving the best result of 71.3." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-74", "text": "Investigating exactly which factors contributed to the success of this unsupervised result is a top priority for future work." }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-75", "text": "----------------------------------" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-76", "text": "**RESOURCES**" }, { "sent_id": "460b820360d51d99b4a7ae3f0755bc-C001-77", "text": "Codebase and resources are at the author's homepage: http://www.stevemanion.com." } ], "y": { "@BACK@": { "gold_contexts": [ [ "460b820360d51d99b4a7ae3f0755bc-C001-25" ], [ "460b820360d51d99b4a7ae3f0755bc-C001-59" ], [ "460b820360d51d99b4a7ae3f0755bc-C001-70" ] ], "cite_sentences": [ "460b820360d51d99b4a7ae3f0755bc-C001-25", "460b820360d51d99b4a7ae3f0755bc-C001-59", "460b820360d51d99b4a7ae3f0755bc-C001-70" ] } } }, "ABC_abe561b75389e026a9f140280f211c_43": { "x": [ { "sent_id": "abe561b75389e026a9f140280f211c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-2", "text": "Sentiment lexicon is very helpful in dimensional sentiment applications." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-3", "text": "Because of countless Chinese words, developing a method to predict unseen Chinese words is required." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-4", "text": "The proposed method can handle both words and phrases by using an ADVWeight List for word prediction, which in turn improves our performance at phrase level." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-5", "text": "The evaluation results demonstrate that our system is effective in dimensional sentiment analysis for Chinese phrases." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-6", "text": "The Mean Absolute Error (MAE) and Pearson's Correlation Coefficient (PCC) for Valence are 0.723 and 0.835, respectively, and those for Arousal are 0.914 and 0.756, respectively." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-7", "text": "----------------------------------" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-9", "text": "Due to the vigorous development of social media in recent years, more and more user-generated sentiment data have been shared on the Web." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-10", "text": "It is a useful means to understand the opinion of the masses, which is a major issue for businesses." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-11", "text": "However, they exist in the forms of comments in a live webcast, opinion sites, or social media, and often contain considerable amount of noise." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-12", "text": "Such characteristics pose obstacles to those who intend to collect this type of information efficiently." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-13", "text": "It is the reason why opinion mining has recently become a topic of interest in both academia and business institutions." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-14", "text": "Sentiment analysis is a type of opinion mining where affective states are represented categorically or by multi-dimensional continuous values (Yu et al., 2015) ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-15", "text": "The categorical approach aims at classifying the sentiment into polarity classes (such as positive, neutral, and negative,) or Ekman's six basic emotions, i.e., anger, happiness, fear, sadness, disgust, and surprise (Ekman, 1992 )." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-16", "text": "This approach is extensively studied because it can provide a desirable outcome, which is an overall evaluation of the sentiment in the material that is being analyzed." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-17", "text": "For instance, a popular form of media in recent years is live webcasting." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-18", "text": "This kind of applications usually provide viewers with the ability to comment immediately while the stream is live." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-19", "text": "Categorical sentiment analysis can immediately classify each response as either positive or negative, thus helping the host to quickly summarize every period of their broadcast." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-20", "text": "On the other hand, the dimensional approach represents affective states as continuous numerical values in multiple dimensions, such as valencearousal space (Markus and Kitayama, 1991) , as shown in Fig. 1 gree of pleasant and unpleasant (i.e., positive and negative) feelings, while the arousal represents the degree of excitement." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-21", "text": "According to the twodimensional representation, any affective state can be represented as a point in the valence-arousal space by determining the degrees of valence and arousal of given words (Wei et al., 2011; Yu et al., 2015) or texts (Kim et al., 2010) ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-22", "text": "Dimen-sional sentiment analysis is an increasingly active research field with potential applications including antisocial behavior detection (Munezero et al., 2011) and mood analysis (De Choudhury et al., 2012) ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-23", "text": "In light of this, the objective of the Dimensional Sentiment Analysis for Chinese Words (DSAW) shared task at the 21th International Conference on Asian Language Processing is to automatically acquire the valence-arousal ratings of Chinese affective words and phrases for compiling Chinese valence-arousal lexicons." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-24", "text": "The expected output of this task is to predict a real-valued score from 1 to 9 for both valence and arousal dimensions of the given 750 test words and phrases." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-25", "text": "The score indicates the degree from most negative to most positive for valence, and from most calm to most excited for arousal." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-26", "text": "The performance is evaluated by calculating mean absolute error and Pearson correlation coefficient between predicted and human-annotated reference scores for two dimensions separately." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-27", "text": "Participants are required to predict a valence-arousal score for each word, and each phrase." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-28", "text": "In order to tackle this problem at the word level, we propose a hybrid approach that integrates valence extension and word embedding-based model with cos similarity to predict valence dimensions." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-29", "text": "Word embedding-based model with SVM and regression to predict arousal dimensions." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-30", "text": "At phrase level, we use our ADVWeight List extracted from training sets and our word level method to predict both valence and arousal." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-31", "text": "The remainder of this paper is organized as follows." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-32", "text": "The proposed method is in Section 2." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-33", "text": "In Section 3, we evaluate performance and compare it with other methods." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-34", "text": "Finally, some conclusions are listed in Section 4." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-35", "text": "----------------------------------" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-36", "text": "**METHOD**" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-37", "text": "This study takes 2,802 single words in CVAW 2.0 (Yu et al., 2016) and 2,250 multi-word phrases, both annotated with valence-arousal ratings, as training material." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-38", "text": "At word level, we use EHowNet (Chen et al., 2005) , a system that is designed for the purpose of automatic semantic composition and decomposition, to extract synonyms of the words from CVAW 2.0, and expand it to 19,611 words with valence-arousal ratings, called WVA." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-39", "text": "Fig. 2 illustrates the proposed framework." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-40", "text": "In order to cope with the problem of unknown words, we separate words in WVA into 4,184 characters with valence-arousal ratings, called CVA." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-41", "text": "The valence-arousal score of the unknown word can be obtained by averaging the matched CVA." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-42", "text": "Moreover, previous research suggested that it is possible to improve the performance by aggregating the results of a number of valence-arousal methods (Yu et al., 2015) ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-43", "text": "Thus, we use two sets of methods for the prediction of valence: (1) prediction based on WVA and CVA, and (2) a kNN valence prediction method." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-44", "text": "The results of these two methods are averaged as the final valence score." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-45", "text": "First, we describe the prediction of valence values." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-46", "text": "As shown in Fig. 3 , the \"\u5b8c\u6210\" of the test data exists in the WVA, so we can directly obtain its valence value of 7.0." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-47", "text": "However, another word \"\u901a\u77e5\" does not exist in the WVA, so we search in CVA and calculate a valence value of 5.6." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-48", "text": "Additionally, we propose another prediction method of the valence value, as shown in Fig. 4 , based on kNN." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-49", "text": "We begin by computing the similarity between words using word embeddings (Mikolov et al., 2013) ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-50", "text": "Then, 10 most similar words are selected and their scores calculated by Eq. 1." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-51", "text": "As for the arousal prediction, we propose two methods: (1) linear regression, and (2) support vector regression (SVR) which averages linear regression and SVM predictions as the final arousal score." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-52", "text": "As shown in Fig. 6 , this study considers the linear regression equation in each range according to the valence-arousal value of words in WVA." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-53", "text": "According to our observation of the data, valence values are generally distributed in the value of 3-7." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-54", "text": "In order to boost learning of different ranges of data, we distribute them in to two categories." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-55", "text": "For example, the work \"\u6bba\u5bb3\" has a valence value of 1.6." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-56", "text": "By our design, it will be distributed to categories with valance value of 1 and 2." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-57", "text": "When the linear regression training is finished, we can predict the corresponding arousal score according to the valence value of the word." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-58", "text": "As for the SVR-based approach, we first train 300-dimensional word embeddings for all words in WVA using online Chinese news corpus 1 ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-59", "text": "As shown in Fig. 6 , L is the label of the sample, and Dim represents the dimension of the features." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-60", "text": "We then predict the value of arousal through SVR." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-61", "text": "Finally, we aggregate the arousal scores predicted by these two methods by taking an average." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-62", "text": "We observe that the values obtained by linear regression are convergent, while the SVR values are more divergent." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-63", "text": "So, averaging of the two values can overcome the shortcomings of these methods." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-64", "text": "At phrase level, we first experiment with using the proposed word-level model to predict the va- lence and arousal values." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-65", "text": "Unfortunately, the results are not satisfactory." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-66", "text": "We then explore the possibility to incorporate linguistic knowledge into the model." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-67", "text": "Structurally, phrases can be split into the adverb (ADV) and the adjective (ADJ)." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-68", "text": "An adverb is a word that modifies an adjective in a phrase." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-69", "text": "For instance, \"\u958b \u5fc3\" (happy) with a preceding \"\u975e\u5e38 (very)\" becomes \"\u975e\u5e38\u958b\u5fc3 (very happy),\" which we consider has an increased degree of happiness." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-70", "text": "Following this line of thought, we explore ADVs as weighting factors for ADJs." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-71", "text": "The ADVList and ADVWeight List are extracted from 2,250 multi-word phrases." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-72", "text": "We employ them to split phrases into ADV and ADJ parts." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-73", "text": "Subsequently, the valence and arousal values of an ADJ is determined by the word-level prediction model, while those of the ADV is used as an offset." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-74", "text": "An illustration of our phrase-level prediction process is in Fig. 7 ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-75", "text": "As shown in Fig. 7 , in order to obtain the weight of the ADV word \"\u6700,\" we need to use ADVList to split phrases that contain \"\u6700\" into the format of \"[ADV] [ADJ] .\" Then, our word prediction" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-76", "text": "[ model is used to obtain valence (VA) value of the ADJ part." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-77", "text": "It will be deducted from the VA of the corresponding phrases, and then the remainders are averaged to become the final ADV weight of the word \"\u6700\"." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-78", "text": "That is, ADVWeight(\u6700) = mean(VA Phrase \u2212 VA ADJ )." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-79", "text": "Most importantly, we hypothesize that ADVs have different effects on phrases with different ADJs, namely, those with valence values \u2265 5.0 and < 5.0." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-80", "text": "Thus, we have to consider them separately." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-81", "text": "In the end, there will be four weights for the ADV \"\u6700\": Positive valence offset, Positive arousal offset, Negative valence offset, and Negative arousal offset." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-82", "text": "----------------------------------" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-83", "text": "**EXPERIMENTS**" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-84", "text": "We utilize the test data in DSAP TestW Input, which contains 750 samples, for performance evaluation." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-85", "text": "The metrics used are mean absolute error (Mean Absolute Error) and Pearson Correlation Coefficient (P < 0.01)." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-86", "text": "In this shared task, valence and arousal values are evaluated separately." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-87", "text": "Table 1 shows the results of valence's performance evaluation, V WVA is the result of WVA alone, and V CVA is the result of CVA." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-88", "text": "V WCVA is the combination of WVA and CVA of the forecast results." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-89", "text": "V kNN is a valence prediction method based on kNN." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-90", "text": "V WVAE is in WVA through word embeddings to find 10 neighbors, and take the average." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-91", "text": "Through the comparison of performance, we found that V WCVA and V WVAE obtained good results with MAEs being 0.527 and 0.508, respectively, and the PCCs are 0.816 and 0.824." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-92", "text": "These results suggest that they are highly relevant, so we try to combine the two methods (namely, V mixed .) The final MAE and PCC were 0.496 and 0.845, which is the best-performing method." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-93", "text": "Table 2 shows the performance of arousal for different regression methods." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-94", "text": "R Polyfit and R Linear use Polyfit Regression and Linear Regression to predict arousal, while R WVA is based on linear regression." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-95", "text": "In addition, S CVAW and S WVA use non-corpusated SVR models." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-96", "text": "R WVA achieved an outstanding performance of an MAE of 0.939, but was the worst performer in PCC; S WVA was slightly inferior to R WVA in MAE, but was superior in PCC with a value of 0.427." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-97", "text": "Notably, the values predicted by S WVA are evenly distributed and are more similar to the actual answers, so we try to combine the two methods (RS) to achieve a performance of 0.858 and 0.474 on MAE and PCC, achieving the most outstanding results." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-98", "text": "Table 3 lists the averaged word-level score and rank of runs 1 and 2 from the participating teams." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-99", "text": "The Rank column in Table 3 represents the averaged rank of each team." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-100", "text": "The best-performing team, AL I NLP, obtained 0.546 in V MAE , 0.8915 in V PCC , 0.855 in A MAE , and 0.6725 in A PCC ." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-101", "text": "Our method (CIAL) only rank in the middle." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-102", "text": "Table 4 lists the averaged phrase-level score and rank of runs 1 and 2 from the participating teams." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-103", "text": "The Rank column in Table 4 represents the averaged rank of each team." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-104", "text": "The bestperforming team, THU NGN, obtained 0.347 in" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-105", "text": "----------------------------------" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-106", "text": "**CONCLUSION**" }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-107", "text": "The system we developed for DSAW integrates E-HowNet and word embeddings with K-Nearest Neighbors in valence dimension." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-108", "text": "Support vector regression and linear regression in arousal dimensions." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-109", "text": "The evaluation results show that the system performance outperforms previous work, but only achieves mediocre performance in this competition." }, { "sent_id": "abe561b75389e026a9f140280f211c-C001-110", "text": "Since the method we used for arousal prediction is still very straightforward, addressing the improvement of its performance should be our target for future research of dimensional sentiment analysis." } ], "y": { "@BACK@": { "gold_contexts": [ [ "abe561b75389e026a9f140280f211c-C001-14" ], [ "abe561b75389e026a9f140280f211c-C001-21" ], [ "abe561b75389e026a9f140280f211c-C001-42" ] ], "cite_sentences": [ "abe561b75389e026a9f140280f211c-C001-14", "abe561b75389e026a9f140280f211c-C001-21", "abe561b75389e026a9f140280f211c-C001-42" ] }, "@MOT@": { "gold_contexts": [ [ "abe561b75389e026a9f140280f211c-C001-42" ] ], "cite_sentences": [ "abe561b75389e026a9f140280f211c-C001-42" ] } } }, "ABC_7936967a70c44890f3a61f6625c59d_43": { "x": [ { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-93", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-2", "text": "The task of building automatic agents that can negotiate with humans in free-form natural language has gained recent interest in the literature." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-3", "text": "Although there have been initial attempts, combining linguistic understanding with strategy effectively still remains a challenge." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-4", "text": "Towards this end, we aim to understand the role of natural language in negotiations from a datadriven perspective by attempting to predict a negotiation's outcome, well before the negotiation is complete." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-5", "text": "Building on the recent advancements in pre-trained language encoders, our model is able to predict correctly within 10% for more than 70% of the cases, by looking at just 60% of the negotiation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-6", "text": "These results suggest that rather than just being a way to realize a negotiation, natural language should be incorporated in the negotiation planning as well." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-7", "text": "Such a framework can be directly used to get feedback for training an automatically negotiating agent." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-8", "text": "----------------------------------" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-10", "text": "Negotiations, either between individuals or entities, are ubiquitous in everyday human interactions ranging from sales to legal proceedings." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-11", "text": "Being a good negotiator is a complex skill, requiring the ability to understand the partner's motives, ability to reason and to communicate effectively, making it a challenging task for an automated system." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-12", "text": "While research in building automatically negotiating agents has primarily focused on agent-agent negotiations (Williams et al., 2012; Lin et al., 2014) , there is a recent interest in agent-human negotiations (Gratch et al., 2015) as well." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-13", "text": "Such agents may act as mediators or can be helpful for pedagogical purposes (Johnson et al., 2019) ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-14", "text": "Efforts in agent-human negotiations involving free-form natural language as a means of communication are rather sparse." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-15", "text": "Researchers (He et al., 2018) recently studied natural language negotiations in buyer-seller bargaining setup, which is comparatively less restricted than previously studied game environments (Asher et al., 2016; Lewis et al., 2017) ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-16", "text": "Lack of a well-defined structure in such negotiations allows humans or agents to express themselves more freely, which better emulates a realistic scenario." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-17", "text": "Interestingly, this also provides an exciting research opportunity: how can an agent leverage the behavioral cues in natural language to direct its negotiation strategies? Understanding the impact of natural language on negotiation outcomes through a data-driven neural framework is the primary objective of this work." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-18", "text": "We focus on buyer-seller negotiations (He et al., 2018) where two individuals negotiate the price of a given product." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-19", "text": "Leveraging the recent advancements (Vaswani et al., 2017; Devlin et al., 2019) in pre-trained language encoders, we attempt to predict negotiation outcomes early on in the conversation, in a completely data-driven manner ( Figure 1 )." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-20", "text": "Early prediction of outcomes is essential for effective planning of an automatically negotiating agent." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-21", "text": "Although there have been attempts to gain insights into negotiations (Adair et al., 2001; Koit, 2018) , to the best of our knowledge, we are the first to study early natural language cues through a datadriven neural system (Section 3)." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-22", "text": "Our evaluations show that natural language allows the models to make better predictions by looking at only a fraction of the negotiation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-23", "text": "Rather than just realizing the strategy in natural language, our empirical results suggest that language can be crucial in the planning as well." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-24", "text": "We provide a sample negotiation from the test set (He et al., 2018 ) along with our model predictions in Table 1 ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-25", "text": "----------------------------------" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-26", "text": "**PROBLEM SETUP**" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-27", "text": "We study human-human negotiations in the buyerseller bargaining scenario, which has been a key research area in the literature (Williams et al., 2012) ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-28", "text": "In this section, we first describe our problem setup and key terminologies by discussing the dataset used." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-29", "text": "Later, we formalize our problem definition." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-30", "text": "Dataset: For our explorations, we use the Craigslist Bargaining dataset (CB) introduced by He et al. (2018) ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-31", "text": "Instead of focusing on the previously studied game environments (Asher et al., 2016; Lewis et al., 2017) , the dataset considers a more realistic setup: negotiating the price of products listed on Craigslist 1 ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-32", "text": "The dataset consists of 6682 dialogues between a buyer and a seller who converse in natural language to negotiate the price of a given product (sample in Table 1 )." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-33", "text": "In total, 1402 product ad postings were scraped from Craigslist, belonging to six categories: phones, bikes, housing, furniture, car and electronics." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-34", "text": "Each ad posting contains details such as Product Title, Category Type and a Listing Price." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-35", "text": "Moreover, a secret target price is also pre-decided for the buyer." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-36", "text": "The final price after the agreement is called the Agreed Price, which we aim to predict." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-37", "text": "Defining the problem: Say we are provided with a product scenario S, a tuple: (Category, Title, Listing Price, Target Price) 2 . Define the interactions between a buyer and seller using a sequence of n events E n :< e 1 , e 2 , ..., e n >, where e i occurs before e j iff i < j. Event e i is also a tuple: (Initiator, Type, Data)." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-38", "text": "Initiator is either the Buyer or Seller, Type can be one of (message, offer, accept, reject or quit) and Data consists of either the corresponding natural language dialogue, offer price or can be empty." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-39", "text": "Nearly 80% of events in CB dataset are of type 'message', each consisting a textual message as Data." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-40", "text": "An offer is usually made and accepted at the end of each negotiation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-41", "text": "Since the offers directly contain the agreed price (which we want to predict), we only consider 'message' events in our models." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-42", "text": "Given the scenario S and first n events E n , our problem is then to learn the function f n : A = f n (S, E n ) where A refers to the final agreed price between the two negotiating parties." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-43", "text": "----------------------------------" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-44", "text": "**APPROACH**" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-45", "text": "Pre-trained language models, such as BERT (Vaswani et al., 2017; Devlin et al., 2019 ) have recently gained huge success on a wide range of NLP tasks." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-68", "text": "We also consider the mean of Listing and Target price (TP+LP/2) as another baseline." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-46", "text": "However, since our framework deals with various auxiliary pieces (category, price, etc.), we cannot directly leverage these language models, which have only been trained on natural language inputs." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-47", "text": "Instead of relying on additional representations along with BERT outputs, we propose a simple, yet effective way to incorporate the auxiliary information into the same embedding space." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-48", "text": "Our model hierarchically builds a representation for the given negotiation to finally predict the agreed price." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-49", "text": "We present our complete architecture in Figure 1 ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-50", "text": "Encoding the input: In order to effectively capture the natural language dialogue and the associated auxiliary information, we make use of predefined sentence templates." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-51", "text": "Table 2 shows how we represent the category, target price and the product title in natural language sentences." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-52", "text": "These sentences are concatenated to form our Scenario S. Moving ahead in a similar manner, we define templates to capture the negotiator identity (buyer/seller) and any message which is conveyed." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-53", "text": "As shown in Figure 1 , the scenario S and the events are separated with the usage of [SEP] tokens." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-54", "text": "Following Liu and Lapata (2019) , who use BERT for extractive text summarization, we add a [CLS] token at the beginning of each segment." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-55", "text": "We also alternate between a sequence of 0s and 1s for segment embeddings to differentiate between the scenario and the events." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-56", "text": "Architecture and Learning: BERT representation for each [CLS] token is a contextualized encoding for the associated word sequence after it." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-57", "text": "In order to further capture the sequential nature of negotiation events, we pass these [CLS] representations through Gated-Recurrent Units (GRU)." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-58", "text": "Recurrent Networks have been shown to be useful along with Transformer architectures ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-59", "text": "Finally, a feed-forward network is applied to predict the agreed price for the negotiation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-60", "text": "The model is end-to-end trained and fine-tuned using the Mean Squared Error (MSE) loss between the predicted price and the ground-truth." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-61", "text": "----------------------------------" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-62", "text": "**EXPERIMENTAL DETAILS**" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-63", "text": "We perform experiments on the CB dataset to primarily answer two questions: 1) Is it feasible to predict negotiation outcomes without observing the complete conversation between the buyer and seller? 2) To what extent does the natural language incorporation help in the prediction?" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-64", "text": "In order to answer these questions, we compare our model empirically with a number of baseline methods." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-65", "text": "This section presents the methods we compare to, the training setup and the evaluation metrics." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-66", "text": "Methods: The first baseline is the Listing Price (LP) where the model ignores the negotiation and returns the listing price of the product." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-67", "text": "Similarly, we use Target Price (TP), where the model just returns the target price for the buyer." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-92", "text": "----------------------------------" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-69", "text": "Although trivial, these baselines help in benchmarking our results and also show good performance in some cases." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-70", "text": "Next, we build another baseline which completely ignores the natural language incorporation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-71", "text": "In this case, the model only sees a sequence of prices shared across the messages in the negotiation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-72", "text": "We keep the input format the same as our model and all the parameters are randomly initialized to remove learning from natural language." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-73", "text": "We refer to this model as Prices-only." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-74", "text": "We compare two variants for BERT-based models." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-75", "text": "First, for the BERT method, we keep only the first [CLS] token in the input and then train the model with fine-tuning using a single feedforward network on top of the [CLS] representation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-76", "text": "Secondly, we call our complete approach as BERT+GRU, where we use a recurrent network with BERT fine-tuning, as depicted in Figure 1 ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-77", "text": "Training Details: Given the multiple segments in our model input and small data size, we use BERTbase (Devlin et al., 2019) , having output dimension of 768." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-78", "text": "To tackle the variance in product prices across different categories, all prices in the inputs and outputs were normalized by the listing price." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-79", "text": "The predictions were unnormalized before final evaluations." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-80", "text": "Further, we only considered the negotiations where an agreement was reached." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-81", "text": "These were the instances for which ground truth was available (\u223c 75% of the data)." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-82", "text": "We use a two-layer GRU with a dropout of 0.1 and 50 hidden units." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-83", "text": "The models were trained for a maximum of 5000 iterations, with AdamW optimizer (Loshchilov and Hutter, 2018) , a learning rate of 2x10 \u2212 5 and a batch size of 4." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-84", "text": "We used a linear warmup schedule for the first 0.1 fraction of the steps." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-85", "text": "All the hyper-parameters were optimized on the provided development set." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-86", "text": "Evaluation Metrics: We study the variants of the same model by training with different proportions of the negotiation seen, namely, f \u2208 {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. We compare the models on two evaluation metrics: MAE: Mean Absolute Error between the predicted and ground-truth agreed prices along with Accuracy\u00b1k: the percentage of cases where the predicted price lies Target (TP) Listing (LP) (TP+LP)/2 Prices-o Figure 2 : Performance for various approaches by varying the fraction of events seen by the model." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-87", "text": "We present scores on MAE, Accuracy\u00b15 and Accuracy\u00b110 from left to right for the complete test set." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-88", "text": "Overall, BERT-based approaches incorporating natural language cues almost always beat the baselines not using these cues." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-89", "text": "By looking at 60% of the negotiation, the model can predict correctly within 10% for more than 70% of the cases." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-90", "text": "within k percent of the ground-truth." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-91", "text": "We use k = 5 and k = 10 in our experiments." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-94", "text": "We present our results in Figure 2 ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-95", "text": "We also show Accuracy\u00b110 for different product categories in the Appendix." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-96", "text": "First, Target Price (TP) and (TP+LP)/2 prove to be strong baselines, with the latter achieving 61.07% Accuracy\u00b110." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-97", "text": "This performance is also attested by relatively strong numbers on the other metrics as well." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-98", "text": "Prices-only, which does not incorporate any knowledge from natural language, fails to beat the average baseline even with 60% of the negotiation history." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-99", "text": "This can be attributed to the observation that in many negotiations, before discussing the price, buyers tend to get more information about the product by exchanging messages: what is the condition of the product, how old it is, is there an urgency for any of the buyer/seller and so on." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-100", "text": "Incorporating natural language in both the scenario and event messages paves the way to leverage such cues and make better predictions early on in the conversation, as depicted in the plots." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-101", "text": "Both BERT and BERT-GRU consistently perform well on the complete test set." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-102", "text": "There is no clear winner, although using a recurrent network proves to be more helpful in the early stages of the negotiation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-103", "text": "Note that BERT method still employs multiple [SEP] tokens along with alternating segment embeddings (Section 3)." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-104", "text": "Without this usage, the fine-tuning pipeline proves to be inadequate." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-105", "text": "Overall, BERT-GRU achieves 67.08% Accuracy\u00b110 with just the product scenario, reaching to 71.16% with 60% of the messages and crosses 90% as more information about the final price is revealed." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-106", "text": "Paired Bootstrap Resampling (Koehn, 2004) with 10, 000 bootstraps shows that for a given f , BERT-GRU is better than its Prices-only counterpart with 95% statistical significance." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-107", "text": "The prices discussed during the negotiation still play a crucial role in making the predictions." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-108", "text": "In fact, in only 65% of the negotiations, the first price is quoted within the first 0.4 fraction of the events." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-109", "text": "This is visible in higher performance as more events are seen after this point." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-110", "text": "This number is lower than average for Housing, Bike and Car, resulting in relative better performance of Priceonly model for these categories over others." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-111", "text": "The models also show evidence of capturing buyer interest." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-112", "text": "By constructing artificial negotiations, we observe that the model predictions at f =0.2 increase when the buyer shows more interest in the product, indicating more willingness to pay." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-113", "text": "With the capability to incorporate cues from natural language, such a framework can be used in the future to get negotiation feedback, in order to guide the planning of a negotiating agent." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-114", "text": "This can be a viable middleground between following the average human behavior through supervised learning or exploring the wild by optimizing on rewards using reinforcement learning (Lewis et al., 2017; He et al., 2018) ." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-115", "text": "----------------------------------" }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-117", "text": "We presented a framework to attempt early predictions of the agreed product prices in buyer-seller negotiations." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-118", "text": "We construct sentence templates to encode the product scenario, exchanged messages and associated auxiliary information into the same hidden space." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-119", "text": "By combining a recurrent network and the pre-trained BERT encoder, our model leverages natural language cues in the exchanged messages to predict the negotiation outcomes early on in the conversation." }, { "sent_id": "7936967a70c44890f3a61f6625c59d-C001-120", "text": "With this capability, such a framework can be used in a feedback mechanism to guide the planning of a negotiating agent." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7936967a70c44890f3a61f6625c59d-C001-15" ] ], "cite_sentences": [ "7936967a70c44890f3a61f6625c59d-C001-15" ] }, "@DIF@": { "gold_contexts": [ [ "7936967a70c44890f3a61f6625c59d-C001-31" ] ], "cite_sentences": [ "7936967a70c44890f3a61f6625c59d-C001-31" ] }, "@USE@": { "gold_contexts": [ [ "7936967a70c44890f3a61f6625c59d-C001-114" ] ], "cite_sentences": [ "7936967a70c44890f3a61f6625c59d-C001-114" ] }, "@FUT@": { "gold_contexts": [ [ "7936967a70c44890f3a61f6625c59d-C001-113", "7936967a70c44890f3a61f6625c59d-C001-114" ] ], "cite_sentences": [ "7936967a70c44890f3a61f6625c59d-C001-114" ] } } }, "ABC_9ac581130218d6c68ca785e3d5ba99_43": { "x": [ { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-2", "text": "SUDOKU's submissions to SemEval Task 13 treats Word Sense Disambiguation and Entity Linking as a deterministic problem that exploits two key attributes of open-class words as constraints -their degree of polysemy and their part of speech." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-3", "text": "This is an extension and further validation of the results achieved by Manion and Sainudiin (2014)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-4", "text": "SUDOKU's three submissions are incremental in the use of the two aforementioned constraints." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-5", "text": "Run1 has no constraints and disambiguates all lemmas in one pass." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-6", "text": "Run2 disambiguates lemmas at increasing degrees of polysemy, leaving the most polysemous until last." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-7", "text": "Run3 is identical to Run2, with the additional constraint of disambiguating all named entities and nouns first before other types of open-class words (verbs, adjectives, and adverbs)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-8", "text": "Over all-domains, for English Run2 and Run3 were placed second and third." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-9", "text": "For Spanish Run2, Run3, and Run1 were placed first, second, and third respectively." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-10", "text": "For Italian Run1 was placed first with Run2 and Run3 placed second equal." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-11", "text": "----------------------------------" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-12", "text": "**INTRODUCTION & RELATED WORK**" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-13", "text": "Almost a decade ago, Agirre and Edmonds (2007) suggested the promising potential for WSD that could exploit the interdependencies between senses in an interactive manner." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-49", "text": "In order to avoid masking the effect of using the iterative approach, a back-off strategy (see (McCarthy et al., 2004) ) was not used." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-50", "text": "Multiword units were found by finding lemma sequences that contained at least one noun and at the same time could return a result from BabelNet." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-51", "text": "Lemma sequences beginning with definite/indefinite articles (e.g. the, a, il, la, and el) were removed as they induced too much noise, given they almost always returned a result from BabelNet (such as a book or movie title)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-52", "text": "----------------------------------" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-53", "text": "**RESULTS, DISCUSSIONS, & CONCLUSIONS**" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-54", "text": "As seen in Figures 1(a) to (i) on the previous page, the Biomedical and Math & Computers domains include a substantial degree of monosemy, no doubt increased by the monosemous technical terms and named entities present." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-55", "text": "Given the importance of document monosemy for the iterative approach, it is of no surprise that Run2 and Run3 in most cases performed much better than Run1 for these technical domains." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-56", "text": "Equally so, Run2 and Run3 were outperformed by Run1 for the less technical Social Issues All Domains domain in which many of the named entities are polysemous rather than monosemous." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-57", "text": "While the iterative approach achieved reasonably competitive results in English, this success did not translate as well to Spanish and Italian." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-58", "text": "The Italian Biomedical domain had the highest document monosemy, observable in Figure 1 (g ), yet this did not help the iterative Run2 and Run3." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-14", "text": "In other words, this would be a WSD system which allows the disambiguation of word a to directly influence the consecutive disambiguation of word b. This is analogous to treating WSD as a deterministic problem, much like the Sudoku puzzle in which the final solution is reached by adhering to a set of pre-determined constraints." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-15", "text": "Conventional approaches to WSD often overlook the potential to exploit sense interdependencies, and simply disambiguate all senses in one pass based on a context window (e.g. a sentence or document)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-16", "text": "For this task the author proposes an iterative approach which makes several passes based on a set of constraints." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-17", "text": "For a more formal distinction between the conventional and iterative approach to WSD, please refer to this paper (Manion and Sainudiin, 2014 Table 1 : Parts of Speech disambiguated (as percentages) for each SemEval Task (denoted by its year)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-18", "text": "In-Degree Centrality as implemented in (Manion and Sainudiin, 2014) observes F-Score improvement (F + \u2206F) by applying the iterative approach." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-19", "text": "The author found in the investigations of his thesis (Manion, 2014) that the iterative approach performed best on the SemEval 2013 Multilingual WSD Task (Navigli et al., 2013) , as opposed to earlier tasks such as SensEval 2004 English All Words WSD Task (Snyder and Palmer, 2004) and the SemEval 2010 All Words WSD task on a Specific Domain (Agirre et al., 2010) ." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-20", "text": "While these earlier tasks also experienced improvement, F-Scores remained lower overall." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-21", "text": "Table 1 Depicted above are distributions for each domain and language, detailing the probability (y-axis) of specific parts of speech at increasing degrees of polysemy (x-axis)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-22", "text": "These distributions were produced from the gold keys (or synsets) of the test documents by querying BabelNet for the polysemy of each word." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-23", "text": "Each distribution was normalised with one sense per discourse assumed, therefore duplicate synsets were ignored." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-24", "text": "Lastly the difference in F-Score between the conventional Run1 and the iterative Run2 and Run3 is listed beside each distribution." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-25", "text": "Firstly WSD tasks before 2013 generally relied on only a lexicon, such as WordNet (Fellbaum, 1998) or an alternative equivalent, whereas SemEval 2013 Task 12 WSD and this task (Moro and Navigli, 2015) included Entity Linking (EL) using the encyclopaedia Wikipedia via BabelNet (Navigli and Ponzetto, 2012) ." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-26", "text": "Secondly, as shown by Manion and Sainudiin (2014) with a simple linear regression, the iterative approach increases WSD performance for documents that have a higher degree of document monosemy -the percentage of unique monosemous lemmas in a document." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-27", "text": "As seen in Figures 1(a) to (i) on the previous page, named entities (or unique rather than common nouns) are more monosemous compared to other parts of speech, especially for more technical domains." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-28", "text": "Lastly, the SemEval 2013 WSD task differs in that only nouns and named entities required disambiguation." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-29", "text": "This simplifies the WSD task, as shown in the experiments on local context by Yarowsky (1993) , nouns are best disambiguated by directly adjacent nouns (or modifying adjectives)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-30", "text": "Based on these observations, the author hypothesized the following implementations of the iterative approach should perform well." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-31", "text": "----------------------------------" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-32", "text": "**SYSTEM DESCRIPTION & IMPLEMENTATION**" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-33", "text": "Run1 (SUDOKU-1) is the conventional approachno constraints are applied." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-34", "text": "Formalised in (Manion and Sainudiin, 2014) , this run can act as a baseline to gauge any improvement for Run2 and Run3 that apply the iterative approach." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-35", "text": "Run2 (SUDOKU-2) has the constraint of words being disambiguated in order of increasing polysemy, leaving the most polysemous to last." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-36", "text": "Run3 (SUDOKU-3) is an untested and unpublished version of the iterative approach." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-37", "text": "It includes Run2's constraint plus a second constraint -that all nouns and named entities must be disambiguated before other parts of speech." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-38", "text": "For each run, a semantic subgraph is constructed from BabelNet (version 2.5.1)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-39", "text": "Then for disambiguation the graph centrality measure PageRank (Brin and Page, 1998 ) is used in conjunction with a surfing vector that biases probability mass to certain sense nodes in the semantic subgraph." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-40", "text": "This idea is taken from Personalised PageRank (PPR) (Agirre and Soroa, 2009) , which applies the method put forward by Haveliwala (2003) to the field of WSD." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-41", "text": "In the previous SemEval WSD task (Navigli et al., 2013) team UMCC DLSI (Gutierrez et al., 2013) implemented this method and achieved the best performance by biasing probability mass based on SemCor (Miller et al., 1993 ) sense frequencies." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-42", "text": "As the winning method for this task, PPR was selected to test the iterative approach on." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-43", "text": "For SUDOKU's implementation to be unsupervised, all runs biased probability mass towards senses from monosemous lemmas." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-44", "text": "Additionally for Run2 and Run3, once a lemma is disambiguated it is considered to be monosemous." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-45", "text": "Therefore with each iteration of Run2 and Run3, probability mass is redistributed across the surfing vector to acknowledge these newly appointed monosemous lemmas." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-46", "text": "All system runs are applied at the document level, across all languages and domains, for all named entities, nouns, verbs, adverbs, and adjectives." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-47", "text": "Semantic subgraphs are constructed from BabelNet via a Depth First Search (DFS) up to 2 hops in path length." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-48", "text": "PageRank's damping factor is set to 0.85, with a maximum of 30 iterations 1 ." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-59", "text": "Yet it is worth noting the results of the task paper (Moro and Navigli, 2015) report that SUDOKU Run2 and Run3 achieved very low F-Scores for named entity disambiguation (<28.6) in Spanish and Italian." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-60", "text": "Given that more than half of the named entities were monosemous in Figure 1 (d) and (g), the WSD system either did not capture them in text or filtered them out during subgraph construction (see BabelNet API)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-61", "text": "This underscores the importance of named entities being included in disambiguation tasks." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-62", "text": "To further support this evidence, while the iterative approach is suited to domain based WSD, recall that the 2010 domain based WSD task in Table 1 also had no tagged named entities (and thus scores were lower than for successive named entity inclusive WSD tasks)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-63", "text": "As seen in Table 2 , the iterative approach has a varied effect on different parts of speech." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-64", "text": "Always improved is the disambiguation of named entities and adverbs." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-65", "text": "This is also the case for nouns in technical domains (e.g. Biomedical as opposed to Social Issues)." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-66", "text": "On the other hand the disambiguation of verbs and adjectives suffers under the iterative approach." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-67", "text": "In hindsight, the iterative approach could be restricted to the parts of speech it is known to improve, while remaining with the conventional approach on others." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-68", "text": "To the right in Table 3 the author's SUDOKU runs are compared against the team with the most competitive results -LIMSI." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-69", "text": "The author could not improve on their superior results achieved in English, however for Spanish and Italian the BabelNet First Sense (BFS) baseline was much lower since it often resorted to lexicographic sorting in the absence of WordNet synsets -see (Navigli et al., 2013) ." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-70", "text": "The author's baseline-independent submissions were unaffected by this, which on reviewing results in (Moro and Navigli, 2015) appears to have helped SUDOKU do best for these languages." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-71", "text": "Table 3 : F1 scores for each domain/language for SUDOKU and LIMSI." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-72", "text": "In summary, the inclusion of named entities in disambiguation tasks certainly improves results, as well as the effectiveness of the iterative approach." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-73", "text": "Furthermore in Table 3 above, the iterative Run3 for the English Biomedical domain is 0.1 short of achieving the best result of 71.3." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-74", "text": "Investigating exactly which factors contributed to the success of this unsupervised result is a top priority for future work." }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-75", "text": "----------------------------------" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-76", "text": "**RESOURCES**" }, { "sent_id": "9ac581130218d6c68ca785e3d5ba99-C001-77", "text": "Codebase and resources are at the author's homepage: http://www.stevemanion.com." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9ac581130218d6c68ca785e3d5ba99-C001-19" ], [ "9ac581130218d6c68ca785e3d5ba99-C001-41" ], [ "9ac581130218d6c68ca785e3d5ba99-C001-69" ] ], "cite_sentences": [ "9ac581130218d6c68ca785e3d5ba99-C001-19", "9ac581130218d6c68ca785e3d5ba99-C001-41", "9ac581130218d6c68ca785e3d5ba99-C001-69" ] } } }, "ABC_b3ef4c176720bdc89d2f73a2560673_43": { "x": [ { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-2", "text": "The task of building automatic agents that can negotiate with humans in free-form natural language has gained recent interest in the literature." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-3", "text": "Although there have been initial attempts, combining linguistic understanding with strategy effectively still remains a challenge." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-4", "text": "Towards this end, we aim to understand the role of natural language in negotiations from a datadriven perspective by attempting to predict a negotiation's outcome, well before the negotiation is complete." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-5", "text": "Building on the recent advancements in pre-trained language encoders, our model is able to predict correctly within 10% for more than 70% of the cases, by looking at just 60% of the negotiation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-6", "text": "These results suggest that rather than just being a way to realize a negotiation, natural language should be incorporated in the negotiation planning as well." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-7", "text": "Such a framework can be directly used to get feedback for training an automatically negotiating agent." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-8", "text": "----------------------------------" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-10", "text": "Negotiations, either between individuals or entities, are ubiquitous in everyday human interactions ranging from sales to legal proceedings." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-11", "text": "Being a good negotiator is a complex skill, requiring the ability to understand the partner's motives, ability to reason and to communicate effectively, making it a challenging task for an automated system." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-79", "text": "The predictions were unnormalized before final evaluations." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-55", "text": "We also alternate between a sequence of 0s and 1s for segment embeddings to differentiate between the scenario and the events." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-12", "text": "While research in building automatically negotiating agents has primarily focused on agent-agent negotiations (Williams et al., 2012; Lin et al., 2014) , there is a recent interest in agent-human negotiations (Gratch et al., 2015) as well." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-13", "text": "Such agents may act as mediators or can be helpful for pedagogical purposes (Johnson et al., 2019) ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-14", "text": "Efforts in agent-human negotiations involving free-form natural language as a means of communication are rather sparse." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-15", "text": "Researchers (He et al., 2018) recently studied natural language negotiations in buyer-seller bargaining setup, which is comparatively less restricted than previously studied game environments (Asher et al., 2016; Lewis et al., 2017) ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-16", "text": "Lack of a well-defined structure in such negotiations allows humans or agents to express themselves more freely, which better emulates a realistic scenario." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-17", "text": "Interestingly, this also provides an exciting research opportunity: how can an agent leverage the behavioral cues in natural language to direct its negotiation strategies? Understanding the impact of natural language on negotiation outcomes through a data-driven neural framework is the primary objective of this work." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-18", "text": "We focus on buyer-seller negotiations (He et al., 2018) where two individuals negotiate the price of a given product." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-19", "text": "Leveraging the recent advancements (Vaswani et al., 2017; Devlin et al., 2019) in pre-trained language encoders, we attempt to predict negotiation outcomes early on in the conversation, in a completely data-driven manner ( Figure 1 )." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-20", "text": "Early prediction of outcomes is essential for effective planning of an automatically negotiating agent." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-21", "text": "Although there have been attempts to gain insights into negotiations (Adair et al., 2001; Koit, 2018) , to the best of our knowledge, we are the first to study early natural language cues through a datadriven neural system (Section 3)." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-22", "text": "Our evaluations show that natural language allows the models to make better predictions by looking at only a fraction of the negotiation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-23", "text": "Rather than just realizing the strategy in natural language, our empirical results suggest that language can be crucial in the planning as well." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-24", "text": "We provide a sample negotiation from the test set (He et al., 2018 ) along with our model predictions in Table 1 ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-25", "text": "----------------------------------" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-26", "text": "**PROBLEM SETUP**" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-27", "text": "We study human-human negotiations in the buyerseller bargaining scenario, which has been a key research area in the literature (Williams et al., 2012) ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-28", "text": "In this section, we first describe our problem setup and key terminologies by discussing the dataset used." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-29", "text": "Later, we formalize our problem definition." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-30", "text": "Dataset: For our explorations, we use the Craigslist Bargaining dataset (CB) introduced by He et al. (2018) ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-31", "text": "Instead of focusing on the previously studied game environments (Asher et al., 2016; Lewis et al., 2017) , the dataset considers a more realistic setup: negotiating the price of products listed on Craigslist 1 ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-32", "text": "The dataset consists of 6682 dialogues between a buyer and a seller who converse in natural language to negotiate the price of a given product (sample in Table 1 )." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-33", "text": "In total, 1402 product ad postings were scraped from Craigslist, belonging to six categories: phones, bikes, housing, furniture, car and electronics." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-34", "text": "Each ad posting contains details such as Product Title, Category Type and a Listing Price." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-35", "text": "Moreover, a secret target price is also pre-decided for the buyer." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-36", "text": "The final price after the agreement is called the Agreed Price, which we aim to predict." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-37", "text": "Defining the problem: Say we are provided with a product scenario S, a tuple: (Category, Title, Listing Price, Target Price) 2 . Define the interactions between a buyer and seller using a sequence of n events E n :< e 1 , e 2 , ..., e n >, where e i occurs before e j iff i < j. Event e i is also a tuple: (Initiator, Type, Data)." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-38", "text": "Initiator is either the Buyer or Seller, Type can be one of (message, offer, accept, reject or quit) and Data consists of either the corresponding natural language dialogue, offer price or can be empty." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-39", "text": "Nearly 80% of events in CB dataset are of type 'message', each consisting a textual message as Data." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-40", "text": "An offer is usually made and accepted at the end of each negotiation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-41", "text": "Since the offers directly contain the agreed price (which we want to predict), we only consider 'message' events in our models." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-42", "text": "Given the scenario S and first n events E n , our problem is then to learn the function f n : A = f n (S, E n ) where A refers to the final agreed price between the two negotiating parties." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-43", "text": "----------------------------------" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-44", "text": "**APPROACH**" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-45", "text": "Pre-trained language models, such as BERT (Vaswani et al., 2017; Devlin et al., 2019 ) have recently gained huge success on a wide range of NLP tasks." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-46", "text": "However, since our framework deals with various auxiliary pieces (category, price, etc.), we cannot directly leverage these language models, which have only been trained on natural language inputs." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-47", "text": "Instead of relying on additional representations along with BERT outputs, we propose a simple, yet effective way to incorporate the auxiliary information into the same embedding space." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-48", "text": "Our model hierarchically builds a representation for the given negotiation to finally predict the agreed price." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-49", "text": "We present our complete architecture in Figure 1 ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-50", "text": "Encoding the input: In order to effectively capture the natural language dialogue and the associated auxiliary information, we make use of predefined sentence templates." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-51", "text": "Table 2 shows how we represent the category, target price and the product title in natural language sentences." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-52", "text": "These sentences are concatenated to form our Scenario S. Moving ahead in a similar manner, we define templates to capture the negotiator identity (buyer/seller) and any message which is conveyed." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-53", "text": "As shown in Figure 1 , the scenario S and the events are separated with the usage of [SEP] tokens." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-54", "text": "Following Liu and Lapata (2019) , who use BERT for extractive text summarization, we add a [CLS] token at the beginning of each segment." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-56", "text": "Architecture and Learning: BERT representation for each [CLS] token is a contextualized encoding for the associated word sequence after it." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-57", "text": "In order to further capture the sequential nature of negotiation events, we pass these [CLS] representations through Gated-Recurrent Units (GRU)." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-58", "text": "Recurrent Networks have been shown to be useful along with Transformer architectures ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-59", "text": "Finally, a feed-forward network is applied to predict the agreed price for the negotiation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-60", "text": "The model is end-to-end trained and fine-tuned using the Mean Squared Error (MSE) loss between the predicted price and the ground-truth." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-61", "text": "----------------------------------" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-62", "text": "**EXPERIMENTAL DETAILS**" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-63", "text": "We perform experiments on the CB dataset to primarily answer two questions: 1) Is it feasible to predict negotiation outcomes without observing the complete conversation between the buyer and seller? 2) To what extent does the natural language incorporation help in the prediction?" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-64", "text": "In order to answer these questions, we compare our model empirically with a number of baseline methods." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-65", "text": "This section presents the methods we compare to, the training setup and the evaluation metrics." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-66", "text": "Methods: The first baseline is the Listing Price (LP) where the model ignores the negotiation and returns the listing price of the product." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-67", "text": "Similarly, we use Target Price (TP), where the model just returns the target price for the buyer." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-68", "text": "We also consider the mean of Listing and Target price (TP+LP/2) as another baseline." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-69", "text": "Although trivial, these baselines help in benchmarking our results and also show good performance in some cases." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-70", "text": "Next, we build another baseline which completely ignores the natural language incorporation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-71", "text": "In this case, the model only sees a sequence of prices shared across the messages in the negotiation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-72", "text": "We keep the input format the same as our model and all the parameters are randomly initialized to remove learning from natural language." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-73", "text": "We refer to this model as Prices-only." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-74", "text": "We compare two variants for BERT-based models." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-75", "text": "First, for the BERT method, we keep only the first [CLS] token in the input and then train the model with fine-tuning using a single feedforward network on top of the [CLS] representation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-76", "text": "Secondly, we call our complete approach as BERT+GRU, where we use a recurrent network with BERT fine-tuning, as depicted in Figure 1 ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-77", "text": "Training Details: Given the multiple segments in our model input and small data size, we use BERTbase (Devlin et al., 2019) , having output dimension of 768." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-78", "text": "To tackle the variance in product prices across different categories, all prices in the inputs and outputs were normalized by the listing price." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-80", "text": "Further, we only considered the negotiations where an agreement was reached." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-81", "text": "These were the instances for which ground truth was available (\u223c 75% of the data)." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-82", "text": "We use a two-layer GRU with a dropout of 0.1 and 50 hidden units." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-83", "text": "The models were trained for a maximum of 5000 iterations, with AdamW optimizer (Loshchilov and Hutter, 2018) , a learning rate of 2x10 \u2212 5 and a batch size of 4." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-84", "text": "We used a linear warmup schedule for the first 0.1 fraction of the steps." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-85", "text": "All the hyper-parameters were optimized on the provided development set." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-86", "text": "Evaluation Metrics: We study the variants of the same model by training with different proportions of the negotiation seen, namely, f \u2208 {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. We compare the models on two evaluation metrics: MAE: Mean Absolute Error between the predicted and ground-truth agreed prices along with Accuracy\u00b1k: the percentage of cases where the predicted price lies Target (TP) Listing (LP) (TP+LP)/2 Prices-o Figure 2 : Performance for various approaches by varying the fraction of events seen by the model." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-87", "text": "We present scores on MAE, Accuracy\u00b15 and Accuracy\u00b110 from left to right for the complete test set." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-88", "text": "Overall, BERT-based approaches incorporating natural language cues almost always beat the baselines not using these cues." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-89", "text": "By looking at 60% of the negotiation, the model can predict correctly within 10% for more than 70% of the cases." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-90", "text": "within k percent of the ground-truth." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-91", "text": "We use k = 5 and k = 10 in our experiments." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-92", "text": "----------------------------------" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-93", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-94", "text": "We present our results in Figure 2 ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-95", "text": "We also show Accuracy\u00b110 for different product categories in the Appendix." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-96", "text": "First, Target Price (TP) and (TP+LP)/2 prove to be strong baselines, with the latter achieving 61.07% Accuracy\u00b110." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-97", "text": "This performance is also attested by relatively strong numbers on the other metrics as well." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-98", "text": "Prices-only, which does not incorporate any knowledge from natural language, fails to beat the average baseline even with 60% of the negotiation history." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-99", "text": "This can be attributed to the observation that in many negotiations, before discussing the price, buyers tend to get more information about the product by exchanging messages: what is the condition of the product, how old it is, is there an urgency for any of the buyer/seller and so on." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-100", "text": "Incorporating natural language in both the scenario and event messages paves the way to leverage such cues and make better predictions early on in the conversation, as depicted in the plots." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-101", "text": "Both BERT and BERT-GRU consistently perform well on the complete test set." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-102", "text": "There is no clear winner, although using a recurrent network proves to be more helpful in the early stages of the negotiation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-103", "text": "Note that BERT method still employs multiple [SEP] tokens along with alternating segment embeddings (Section 3)." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-104", "text": "Without this usage, the fine-tuning pipeline proves to be inadequate." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-105", "text": "Overall, BERT-GRU achieves 67.08% Accuracy\u00b110 with just the product scenario, reaching to 71.16% with 60% of the messages and crosses 90% as more information about the final price is revealed." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-106", "text": "Paired Bootstrap Resampling (Koehn, 2004) with 10, 000 bootstraps shows that for a given f , BERT-GRU is better than its Prices-only counterpart with 95% statistical significance." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-107", "text": "The prices discussed during the negotiation still play a crucial role in making the predictions." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-108", "text": "In fact, in only 65% of the negotiations, the first price is quoted within the first 0.4 fraction of the events." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-109", "text": "This is visible in higher performance as more events are seen after this point." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-110", "text": "This number is lower than average for Housing, Bike and Car, resulting in relative better performance of Priceonly model for these categories over others." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-111", "text": "The models also show evidence of capturing buyer interest." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-112", "text": "By constructing artificial negotiations, we observe that the model predictions at f =0.2 increase when the buyer shows more interest in the product, indicating more willingness to pay." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-113", "text": "With the capability to incorporate cues from natural language, such a framework can be used in the future to get negotiation feedback, in order to guide the planning of a negotiating agent." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-114", "text": "This can be a viable middleground between following the average human behavior through supervised learning or exploring the wild by optimizing on rewards using reinforcement learning (Lewis et al., 2017; He et al., 2018) ." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-115", "text": "----------------------------------" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-117", "text": "We presented a framework to attempt early predictions of the agreed product prices in buyer-seller negotiations." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-118", "text": "We construct sentence templates to encode the product scenario, exchanged messages and associated auxiliary information into the same hidden space." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-119", "text": "By combining a recurrent network and the pre-trained BERT encoder, our model leverages natural language cues in the exchanged messages to predict the negotiation outcomes early on in the conversation." }, { "sent_id": "b3ef4c176720bdc89d2f73a2560673-C001-120", "text": "With this capability, such a framework can be used in a feedback mechanism to guide the planning of a negotiating agent." } ], "y": { "@USE@": { "gold_contexts": [ [ "b3ef4c176720bdc89d2f73a2560673-C001-19" ], [ "b3ef4c176720bdc89d2f73a2560673-C001-77" ] ], "cite_sentences": [ "b3ef4c176720bdc89d2f73a2560673-C001-19", "b3ef4c176720bdc89d2f73a2560673-C001-77" ] }, "@BACK@": { "gold_contexts": [ [ "b3ef4c176720bdc89d2f73a2560673-C001-45" ] ], "cite_sentences": [ "b3ef4c176720bdc89d2f73a2560673-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "b3ef4c176720bdc89d2f73a2560673-C001-45", "b3ef4c176720bdc89d2f73a2560673-C001-46" ] ], "cite_sentences": [ "b3ef4c176720bdc89d2f73a2560673-C001-45" ] } } }, "ABC_3fe979e570992b79c8656ab6cb34fb_43": { "x": [ { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-50", "text": "Verb classes allow us to capture generalizations about verb behavior." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-2", "text": "We present a class-based approach to building a verb lexicon that makes explicit the close relation between syntax and semantics for Levin classes." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-3", "text": "We have used a Lexicalized Tree Adjoining Grammar to capture the syntax associated with each verb class and have added semantic predicates to each tree, which allow for a compositional interpretation." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-4", "text": "----------------------------------" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-6", "text": "We describe a computational verb lexicon called VerbNet which utilizes Levin verb classes (Levin, 1993) to systematically construct lexical entries." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-7", "text": "We have used Lexicalized Tree Adjoining Grammar (LTAG) (Joshi, 1985; Schabes, 1990) to capture the syntax associated with each verb class, and have added semantic predicates." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-8", "text": "We also show how regular extensions of verb meaning can be achieved through the adjunction of particular syntactic phrases." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-9", "text": "We base these regular extensions on intersective Levin classes, a fine-grained variation on Levin classes, as a source of semantic components associated with specific adjuncts (Dang et al., 1998) ." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-10", "text": "Whereas previous research on tying semantics to Levin classes (Dorr, 1997) has not explicitly implemented the close relation between syntax and semantics hypothesized by Levin, our lexical resource combines traditional lexical semantic information, such as thematic roles and semantic predicates, with syntactic frames and selectional restrictions." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-11", "text": "In order to increase the utility of VerbNet, we also include links to entries in WordNet, which is one of the most widely used online lexical databases in Natural Language Processing applications." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-12", "text": "----------------------------------" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-13", "text": "**LEVIN CLASSES AND WORDNET**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-14", "text": "Two current approaches to English verb classifications are WordNet and Levin classes." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-15", "text": "WordNet is an on-line lexical database of English that currently contains approximately 120,000 sets of noun, verb, adjective, and adverb synonyms, each representing a lexicalized concept." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-16", "text": "A synset (synonym set) contains, besides all the word forms that can refer to a given concept, a definitional gloss and -in most cases -an example sentence." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-17", "text": "Words and synsets are interrelated by means of lexical and semantic-conceptual links, respectively." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-18", "text": "Antonymy or semantic opposition links individual words, while the super-/subordinate relation links entire synsets." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-19", "text": "WordNet was designed principally as a semantic network, and contains little syntactic information." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-20", "text": "Even as a semantic resource, however, it is missing some of the information that has traditionally been required by NLP applications, including explicit predicate-argument structures." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-21", "text": "WordNet senses are often too fine-grained as well, lacking an underlying notion of semantic components and a systematic extension of basic senses to produce these fine-grained senses." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-22", "text": "The Levin verb classification, on the other hand, does explicitly state the syntax for each class, but still falls short of assigning semantic components to each class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-23", "text": "The classes are based on the ability or inability of a verb to occur in pairs of syntactic frames that are in some sense meaning preserving (diathesis alternations) (Levin, 1993) ." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-24", "text": "The sets of syntactic frames associated with a particular Levin class are supposed to reflect underlying semantic components that constrain allowable arguments and adjuncts." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-25", "text": "For example, break verbs and cut verbs are similar in that they can all participate in the transitive and middle constructions." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-26", "text": "However, only break verbs can also occur in the simple intransitive, and cut verbs can occur in the conative, where break verbs cannot." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-27", "text": "The explanation given is that cut describes a series of actions directed at achieving the goal of separating some object into pieces." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-28", "text": "It is possible for these actions to be performed without the end result being achieved, but where the cutting manner can still be recognized (i.e., \"John cut at the loaf\")." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-29", "text": "For break, the only thing specified is the resulting change of state where the object becomes separated into pieces." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-30", "text": "If the result is not achieved, no attempted breaking action can be recognized." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-31", "text": "(b) John valiantly cut/hacked at the frozen loaf, but his knife was too dull to make a dent in it." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-32", "text": "The fundamental assumption is that the syntactic frames are a direct reflection of the underlying semantics." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-33", "text": "However, Levin classes exhibit inconsistencies that have hampered researchers' ability to reference them directly in applications." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-34", "text": "Many verbs are listed in multiple classes, some of which have conflicting sets of syntactic frames." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-35", "text": "For instance, carry verbs are described as not taking the conative (*\"The mother carried at the baby\"), and yet many of the verbs in the carry class (push, pull, tug, shove, kick) are also listed in the push/pull class, which does take the conative." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-36", "text": "Dang et al. (1998) showed that multiple listings could in some cases be interpreted as regular sense extensions, and defined intersective Levin classes, which are a more syntactically and semantically coherent refinement of basic Levin classes." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-37", "text": "We implement these verb classes and their regular sense extensions in the Lexicalized Tree Adjoining Grammar formalism." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-38", "text": "----------------------------------" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-39", "text": "**VERB LEXICON**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-40", "text": "VerbNet can be viewed in both a static and a dynamic way." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-41", "text": "The static aspect refers to the verb entries and how they are organized, providing the characteristic descriptions of a verb sense or a verb class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-42", "text": "The dynamic aspect of the lexicon constrains the entries to allow a compositional interpretation in LTAG derivation trees, capturing extended verb meanings by incorporating adjuncts." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-43", "text": "----------------------------------" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-44", "text": "**STATIC DESCRIPTION**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-45", "text": "Each verb entry refers to a set of classes, corresponding to the different senses of the verb." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-46", "text": "For example, the manner of motion sense of \"run\" is a member of the Manner of Motion class, whereas \"run\" as in \"the street runs through the district\" is a member of the Meander class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-47", "text": "For each verb sense there is a verb class as well as specific selectional restrictions (e.g., an instrument of \"kick\" must be of type foot) and semantic characteristics (e.g., a particular manner of directed motion) that may not be captured by the class membership." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-48", "text": "In order to provide a mapping to other dictionaries, we also include links to WordNet synsets." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-49", "text": "Because WordNet has more fine-grained sense distinctions than Levin, each verb sense in VerbNet references the set of WordNet synsets (if any) that captures the meaning appropriate to the class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-51", "text": "This reduces not only the effort needed to construct the lexicon, but also the likelihood that errors are introduced when adding a new verb entry." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-52", "text": "Each verb class lists the thematic roles that the predicate-argument structure of its members allows, and provides descriptions of the syntactic frames corresponding to licensed constructions, with selectional restrictions defined for each argument in each frame." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-53", "text": "Each frame also includes semantic predicates describing the participants at various stages of the event described by the frame." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-54", "text": "We decompose each event E into a tripartite structure in a manner similar to Moens and Steedman (1988) , introducing a time function for each predicate to specify whether the predicate is true in the preparatory (during(E)), culmination (end(E)), or consequent (result(E)) stage of an event." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-55", "text": "The tripartite event structure (Figure 1 ) allows us to express the semantics of classes of verbs like change of state verbs whose adequate description requires reference to a complex event structure." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-56", "text": "In the case of a verb such as \"break\", it is important to make a distinction between the state of the object before the end of the action (during(E)), and the new state that results afterwards (result(E))." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-57", "text": "Verb classes are hierarchically organized, ensuring that each class is coherent -that is, all its members have common semantic elements and share a common set of thematic roles and basic syntactic frames." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-58", "text": "This requires some manual restructuring of the original Levin classes, which is facilitated by using intersective Levin classes." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-59", "text": "In addition, a particular verb may add more semantic information to the basic semantics of its class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-60", "text": "Figure 2 shows a partial entry for the Hit class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-61", "text": "This class allows for three thematic roles: Basic Transitive A V P manner(during(E),directedmotion,A)m anner(end(E),forceful,A)\u0109 ontact(end(E),A,P) Transitive with A V P with I manner(during(E),directedmotion,I)\u00ce nstrument manner(end(E),forceful,I)\u0109 ontact(end(E),I,P) Conative A V at P manner(during(E),directedmotion,A) With/against A V I against/on P manner(during(E),directedmotion,I)\u00e2 lternation manner(end(E),forceful,I)\u0109 ontact(end(E),I,P) Transitive I V P manner(during(E),directedmotion,I)m anner(end(E),forceful,I)\u0109 ontact(end(E),I,P)" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-62", "text": "Figure 2: Partial entry for the Hit class a feature hierarchy where animate subsumes animal and human, concrete subsumes both animate and inanimate, and so forth." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-63", "text": "This representation does not suffer from some drawbacks of theta role analysis because our roles are not global primitives, but are only used to describe relationships within a class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-64", "text": "The strength of our representation comes from the explicit relationship between syntax and semantics captured in each entry." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-65", "text": "Figure 2 shows some of the syntactic frames allowed for the Hit class and the semantic predicates for each frame." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-66", "text": "Thematic roles are used as descriptors which are mapped into arguments of semantic predicates as well as the argument positions in a TAG elementary tree." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-67", "text": "The tripartite event structure also handles the conative construction, in which there is an intention of a goal during the event which is not achieved at the end of the event." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-68", "text": "The example shown in Figure 2 for the conative construction has the predicate manner(during(E),directedmotion,A) but because the intended contact by sudden impact is not satisfied, the semantics does not include the predicates" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-69", "text": "----------------------------------" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-70", "text": "**MANNER(END(E),FORCEFUL,A)^CONTACT(END(E),A,P).**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-71", "text": "----------------------------------" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-72", "text": "**COMPOSITIONAL SEMANTICS**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-73", "text": "We use TAG elementary trees to describe syntactic frames and associate semantic predicates and selectional restrictions with each tree." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-74", "text": "Elementary trees capture the basic semantics of the verbs in each class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-75", "text": "Each frame in the static aspect of the lexicon maps onto a TAG elementary tree, in which the thematic roles correspond to substitution sites." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-76", "text": "Some auxiliary trees are class-based because they interact with the verbs in the class in peculiar ways and add semanrelaxed depending on the domain of a particular application." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-77", "text": "meets(E arg0 ; E)m otion(during(E); X arg0:arg1 )v ia(during(E); X arg0:arg1 ; X arg1 ) Figure 3 : Initial transitive tree for \"hit\" and auxiliary tree for \"across\" tic content specific to the class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-78", "text": "Others, such as temporal adjuncts, bring the same semantic predicate independent of the verb." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-79", "text": "We use a flat semantic representation like that of Joshi and Vijay-Shanker (1999) in which the semantics of a sentence is the conjunction of the semantic predicates of the trees used to derive the sentence." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-80", "text": "We ensure that all the semantic arguments of basic predicates are local to the syntactic initial tree." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-81", "text": "For example, the basic transitive frame in Figure 2 shows that the Agent is in direct motion and contacts the Patient in a forceful manner." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-82", "text": "If an instrument is specified, it replaces the Agent in these predicates." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-83", "text": "Since the instrument can be an argument in the basic predicates of the Hit class, it must appear in the elementary trees whenever it is specified, even if it is in a prepositional phrase." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-84", "text": "The ability of certain verbs to take on extended senses based on their adjuncts is captured in a natural way by the TAG operation of adjunction and our conjunction of semantic predicates." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-85", "text": "Figure 3 shows an initial transitive tree anchored by \"hit\" and the semantic predicates associated with this syntactic frame." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-86", "text": "The original Hit verb class does not include movement of the direct object as part of the meaning of \"hit\" -only one event of contact by sudden impact is described." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-87", "text": "This event is subdivided into three predicates: the first, manner(during(E),directedmotion,X arg0 ) specifies that during the event E, X arg0 is in directed motion; the second, manner(end(E),forceful,X arg0 ) refers to the forceful contact of X arg0 at the end of E; and the third, contact(end(E),X arg0 ,X arg1 ) establishes that at the end of event E, contact between X arg0 and X arg1 has been achieved." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-88", "text": "By adjoining a path PP such as \"across NP\", we get an extended meaning, and a change in Levin class membership to the Throw class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-89", "text": "Figure 3 shows the auxiliary tree anchored by the preposition \"across\" together with its semantic predicates." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-90", "text": "The class-specific path PP adds the predicates meets(E arg0 ,E)^motion(during(E),X arg0:arg1 )^via(during(E),X arg0:arg1 ,X arg1 ), introducing a motion event that immediately follows (meets) the contact event, which is the basic sense of the Hit class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-91", "text": "In Figure 4 , we show the derived tree for the sentence \"John hit the apple across the room\" with all the predicates instantiated." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-92", "text": "The arguments are recovered from the derivation tree, following Candito and Kahane (1998)." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-93", "text": "When an initial tree, such as J ohn , is substituted into another tree hit , the dependency mirrors the derivation structure, so the variables associated with the substituting tree can be referenced as arguments in the host tree's predicates (see Figure 5 )." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-94", "text": "When an auxiliary tree across is adjoined, the dependency for the adjunction is reversed, so that variables associated with the host tree can be referenced as arguments in the adjoining tree's predicates." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-95", "text": "With this dependency from across to hit (labeled arg0), it is now possible for the semantic predicates associated with across to predicate over variables in the dependent tree hit , including the variable X arg0:arg1 instantiated as apple, resulting in the predicates motion(during(e2),apple)^via(during(e2),apple,room)." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-96", "text": "Verbs in the intersective class formed by the Push/Pull verbs and the Carry verbs behave in a similar manner." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-97", "text": "The core meaning of this verb class is exertion of force." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-98", "text": "Adjunction of a path PP implying motion modifies membership of these verbs to the Carry class." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-99", "text": "Push/Pull verbs can appear in the conative construction, which emphasizes their forceful semantic component and ability to express an attempted action where any result that might be associated with the verb is not necessarily achieved; Carry verbs (used with a goal or directional phrase) cannot take the conative alternation because this would conflict with the causation of motion which is the intrinsic meaning of the class (Dang et al., 1998) ." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-100", "text": "Palmer et al. (1999) and Bleam et al. (1998) also defined compositional semantics for classes of verbs implemented in FB-LTAG, but they represented general semantic components (e.g., motion, manner) as features on the nodes of the trees." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-101", "text": "Our use of separate logical forms gives a more detailed semantics for the sentence, so that for an event involving motion, it is possible to know not only that the event has a motion semantic component, but also which entity is actually in motion." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-102", "text": "----------------------------------" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-103", "text": "**CONCLUSION**" }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-104", "text": "We have presented a class-based approach to building a verb lexicon that makes explicit the close association between syntax and semantics, as postulated by Levin." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-105", "text": "By using verb classes we capture generalizations about verb behavior and reduce not only the effort needed to construct the lexicon, but also the likelihood that errors are introduced when adding new verbs." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-106", "text": "Another important contribution of this work is that by dividing each event into a tripartite structure, we permit a more precise definition of the associated semantics, which is necessary for applications such as animation of natural language instructions (Bindiganavale et al., 2000) ." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-107", "text": "The power of the lexicon comes from its dynamic aspect which is based on the LTAG formalism." }, { "sent_id": "3fe979e570992b79c8656ab6cb34fb-C001-108", "text": "The operation of adjunction in TAGs provides a principled approach to representing the type of regular polysemy that has been a major obstacle in building verb lexicons." } ], "y": { "@USE@": { "gold_contexts": [ [ "3fe979e570992b79c8656ab6cb34fb-C001-9" ] ], "cite_sentences": [ "3fe979e570992b79c8656ab6cb34fb-C001-9" ] }, "@BACK@": { "gold_contexts": [ [ "3fe979e570992b79c8656ab6cb34fb-C001-9" ], [ "3fe979e570992b79c8656ab6cb34fb-C001-99" ] ], "cite_sentences": [ "3fe979e570992b79c8656ab6cb34fb-C001-9", "3fe979e570992b79c8656ab6cb34fb-C001-99" ] } } }, "ABC_8dbc779d455ad72def6654564f9e13_43": { "x": [ { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-26", "text": "----------------------------------" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-2", "text": "Pour la documentation des langues, la transcription est un processus tr\u00e8s co\u00fbteux : une minute d'enregistrement n\u00e9cessiterait environ une heure et demie de travail pour un linguiste (Austin and Sallabank, 2013) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-3", "text": "R\u00e9cemment, la collecte de traductions (dans des langues bien document\u00e9es) align\u00e9es aux enregistrements est devenue une solution populaire pour garantir l'interpr\u00e9tabilit\u00e9 des enregistrements (Adda et al., 2016) et aider \u00e0 leur traitement automatique." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-4", "text": "Dans cet article, nous \u00e9tudions l'impact de la langue de traduction sur les approches automatiques en documentation des langues." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-5", "text": "Nous traduisons un corpus parall\u00e8le bilingue Mboshi-Fran\u00e7ais (Godard et al., 2017) dans quatre autres langues, et \u00e9valuons l'impact de la langue de traduction sur une t\u00e2che de segmentation en mots non supervis\u00e9e." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-6", "text": "Nos r\u00e9sultats sugg\u00e8rent que la langue de traduction peut influencer l\u00e9g\u00e8rement la qualit\u00e9 de segmentation." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-7", "text": "Cependant, combiner l'information apprise par diff\u00e9rents mod\u00e8les bilingues nous permet d'am\u00e9liorer ces r\u00e9sultats de mani\u00e8re marginale." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-27", "text": "**METHODOLOGY**" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-46", "text": "----------------------------------" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-8", "text": "For language documentation initiatives, transcription is an expensive resource: one minute of audio is estimated to take one hour and a half on average of a linguist's work (Austin and Sallabank, 2013) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-9", "text": "Recently, collecting aligned translations in well-resourced languages became a popular solution for ensuring posterior interpretability of the recordings (Adda et al., 2016) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-10", "text": "In this paper we investigate language-related impact in automatic approaches for computational language documentation." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-11", "text": "We translate the bilingual Mboshi-French parallel corpus (Godard et al., 2017) into four other languages, and we perform bilingual-rooted unsupervised word discovery." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-12", "text": "Our results hint towards an impact of the well-resourced language in the quality of the output." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-13", "text": "However, by combining the information learned by different bilingual models, we are only able to marginally increase the quality of the segmentation." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-14", "text": "MOTS-CL\u00c9S : d\u00e9couverte non supervis\u00e9e du lexique, documentation des langues, approches multilingues." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-15", "text": "----------------------------------" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-16", "text": "**INTRODUCTION**" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-17", "text": "The Cambridge Handbook of Endangered Languages (Austin and Sallabank, 2011) estimates that at least half of the 7,000 languages currently spoken worldwide will no longer exist by the end of this century." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-18", "text": "For these endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-19", "text": "This transcription bottleneck problem can be handled by translating into a widely spoken language to ensure subsequent interpretability of the collected recordings, and such parallel corpora have been recently created by aligning the collected audio with translations in a well-resourced language (Adda et al., 2016; Godard et al., 2017; Boito et al., 2018) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-20", "text": "Moreover, some linguists suggested that more than one translation should be collected to capture deeper layers of meaning (Evans and Sasse, 2004) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-21", "text": "This work is a contribution to the Computational Language Documentation (CLD) research field, that aims to replace part of the manual steps performed by linguists during language documentation initiatives by automatic approaches." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-22", "text": "Here we investigate the unsupervised word discovery and segmentation task, using the bilingual-rooted approach from Godard et al. (2018) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-23", "text": "There, words in the well-resourced language are aligned to unsegmented phonemes in the endangered language in order to identify group of phonemes, and to cluster them into word-like units." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-24", "text": "We experiment with the Mboshi-French parallel corpus, translating the French text into four other well-resourced languages in order to investigate language impact in this CLD approach." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-25", "text": "Our results hint that this language impact exists, and that models based on different languages will output different word-like units." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-28", "text": "The Multilingual Mboshi Parallel Corpus: In this work we extend the bilingual Mboshi-French parallel corpus (Godard et al., 2017) , fruit of the documentation process of Mboshi (Bantu C25), an endangered language spoken in Congo-Brazzaville." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-29", "text": "The corpus contains 5,130 utterances, for which it provides audio, transcriptions and translations in French." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-30", "text": "We translate the French into four other well-resourced languages through the use of the DeepL translator." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-31", "text": "1 The languages added to the dataset are: English, German, Portuguese and Spanish." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-32", "text": "Table 1 shows some statistics for the produced Multilingual Mboshi parallel corpus." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-33", "text": "2 Bilingual Unsupervised Word Segmentation/Discovery Approach: We use the bilingual neuralbased Unsupervised Word Segmentation (UWS) approach from Godard et al. (2018) to discover words in Mboshi." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-34", "text": "In this approach, Neural Machine Translation (NMT) models are trained between language pairs, using as source language the translation (word-level) and as target, the language to document (unsegmented phonemic sequence)." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-35", "text": "Due to the attention mechanism present in these networks (Bahdanau et al., 2014) , posterior to training, it is possible to retrieve soft-alignment probability matrices between source and target sequences." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-36", "text": "These matrices give us sentence-level source-to-target alignment information, and by using it for clustering neighbor phonemes aligned to the same translation word, we are able to create segmentation in the target side." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-37", "text": "The product of this approach is a set of (discovered-units, translation words) pairs." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-38", "text": "Multilingual Leveraging: In this work we apply two simple methods for including multilingual information into the bilingual models from Godard et al. (2018) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-39", "text": "The first one, Multilingual Voting, consists of merging the information learned by models trained with different language pairs by performing a voting over the final discovered boundaries." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-40", "text": "The voting is performed by applying an agreement threshold T over the output boundaries." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-41", "text": "This threshold balances between accepting all boundaries from all the bilingual models (zero agreement) and accepting only input boundaries discovered by all these models (total agreement)." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-42", "text": "The second method is ANE Selection." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-43", "text": "For every language pair and aligned sentence in the dataset, a soft-alignment probability matrix is generated." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-44", "text": "We use Average Normalized Entropy (ANE) (Boito et al., 2019a) computed over these matrices for selecting the most confident one for segmenting each phoneme sequence." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-45", "text": "This exploits the idea that models trained on different language pairs will have language-related behavior, thus differing on the resulting alignment and segmentation over the same phoneme sequence." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-47", "text": "**EXPERIMENTS**" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-48", "text": "The experiment settings from this paper and evaluation protocol for the Mboshi corpus (Boundary F-scores using the ZRC speech reference) are the same from Boito et al. (2019a) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-49", "text": "Table 2 presents the results for bilingual UWS and multilingual leveraging." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-50", "text": "For the former, we reach our best result by using as aligned information the French, the original aligned language for this dataset." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-51", "text": "Languages closely related to French (Spanish and Portuguese) ranked better, while our worst result used German." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-52", "text": "English also performs notably well in our experiments." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-53", "text": "We believe this is due to the statistics features of the resulting text." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-54", "text": "We observe in Table 1 that the English portion of the dataset contains the smallest vocabulary among all languages." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-55", "text": "Since we train our systems in very low-resource settings, vocabularyrelated features can impact greatly the system's capacity to language-model, and consequently the final quality of the produced alignments." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-56", "text": "Even in high-resource settings, it was already attested that some languages are more difficult to model than others (Cotterell et al., 2018) ." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-57", "text": "For the multilingual selection experiments, we experimented combining the languages from top to bottom as they appear Table 2 (ranked by performance; e.g. 1-3 means the combination of FR(1), Table 3 : Top 10 confident (discovered type, translation) pairs for the five bilingual models." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-58", "text": "The \"+\" mark means the discovered type is a concatenation of two existing true types." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-59", "text": "EN(2) and PT (3))." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-60", "text": "We observe that the performance improvement is smaller than the one observed in previous work (Boito et al., 2019b) , which we attribute to the fact that our dataset was artificially augmented." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-61", "text": "This could result in the available multilingual form of supervision not being as rich as in a manually generated dataset." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-62", "text": "Finally, the best boundary segmentation result is obtained by performing multilingual voting with all the languages and an agreement of 50%, which indicates that the information learned by different languages will provide additional complementary evidence." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-63", "text": "Lastly, following the methodology from Boito et al. (2019a) , we extract the most confident alignments (in terms of ANE) discovered by the bilingual models." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-64", "text": "Table 3 presents the top 10 most confident (discovered type, translation) pairs." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-65", "text": "3 Looking at the pairs the bilingual models are most confident about, we observe there are some types discovered by all the bilingual models (e.g. Mboshi word itua, and the concatenation obo\u00e1+ng\u00e1)." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-66", "text": "However, the models still differ for most of their alignments in the table." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-67", "text": "This hints that while a portion of the lexicon might be captured independently of the language used, other structures might be more dependent of the chosen language." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-68", "text": "On this note, Haspelmath (2011) suggests the notion of word cannot always be meaningfully defined cross-linguistically." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-69", "text": "----------------------------------" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-70", "text": "**CONCLUSION**" }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-71", "text": "In this work we train bilingual UWS models using the endangered language Mboshi as target and different well-resourced languages as aligned information." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-72", "text": "Results show that similar languages rank better in terms of segmentation performance, and that by combining the information learned by different models, segmentation is further improved." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-73", "text": "This might be due to the different languagedependent structures that are captured by using more than one language." }, { "sent_id": "8dbc779d455ad72def6654564f9e13-C001-74", "text": "Lastly, we extend the bilingual Mboshi-French parallel corpus, creating a multilingual corpus for the endangered language Mboshi that we make available to the community." } ], "y": { "@USE@": { "gold_contexts": [ [ "8dbc779d455ad72def6654564f9e13-C001-22" ], [ "8dbc779d455ad72def6654564f9e13-C001-33" ], [ "8dbc779d455ad72def6654564f9e13-C001-38" ] ], "cite_sentences": [ "8dbc779d455ad72def6654564f9e13-C001-22", "8dbc779d455ad72def6654564f9e13-C001-33", "8dbc779d455ad72def6654564f9e13-C001-38" ] } } }, "ABC_f3b1a39203ebf0725d8dd2b8f8c7a9_44": { "x": [ { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-54", "text": "Evaluating creative generation tasks is both critical and complex [27] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-2", "text": "Generative models for text have substantially contributed to tasks like machine translation and language modeling, using maximum likelihood optimization (MLE)." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-3", "text": "However, for creative text generation, where multiple outputs are possible and originality and uniqueness are encouraged, MLE falls short." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-4", "text": "Methods optimized for MLE lead to outputs that can be generic, repetitive and incoherent." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-5", "text": "In this work, we use a Generative Adversarial Network framework to alleviate this problem." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-6", "text": "We evaluate our framework on poetry, lyrics and metaphor datasets, each with widely different characteristics, and report better performance of our objective function over other generative models." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-7", "text": "----------------------------------" }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-8", "text": "**INTRODUCTION AND RELATED WORK**" }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-9", "text": "Language models can be optimized to recognize syntax and semantics with great accuracy [1] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-10", "text": "However, the output generated can be repetitive and generic leading to monotonous or uninteresting responses (e.g \"I don't know\") regardless of the input [2] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-11", "text": "While application of attention [3, 4] and advanced decoding mechanisms like beam search and variation sampling [5] have shown improvements, it does not solve the underlying problem." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-12", "text": "In creative text generation, the objective is not strongly bound to the ground truth-instead the objective is to generate diverse, unique or original samples." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-13", "text": "We attempt to do this through a discriminator which can give feedback to the generative model through a cost function that encourages sampling of creative tokens." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-14", "text": "The contributions of this paper are in the usage of a GAN framework to generate creative pieces of writing." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-15", "text": "Our experiments suggest that generative text models, while very good at encapsulating semantic, syntactic and domain information, perform better with external feedback from a discriminator for fine-tuning objectiveless decoding tasks like that of creative text." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-16", "text": "We show this by evaluating our model on three very different creative datasets containing poetry, metaphors and lyrics." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-17", "text": "Previous work on handling the shortcomings of MLE include length-normalizing sentence probability [6] , future cost estimation [7] , diversity-boosting objective function [8, 2] or penalizing repeating tokens [9] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-18", "text": "When it comes to poetry generation using generative text models, Zhang and Lapata [10] , Yi et al. [11] and Wang et al. [12] use language modeling to generate Chinese poems." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-19", "text": "However, none of these methods provide feedback on the quality of the generated sample and hence, do not address the qualitative objective required for creative decoding." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-20", "text": "For the task of text generation, MaskGAN [13] uses a Reinforcement Learning signal from the discriminator, FMD-GAN [14] uses an optimal transport mechanism as an objective function." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-21", "text": "GumbelGAN [15] uses Gumbel-Softmax distribution that replaces the non-differentiable sample from a categorical distribution with a differentiable sample to propagate stronger gradients." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-22", "text": "Li et al. [2] use a discriminator for a diversity promoting objective." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-23", "text": "Yu et al. [16] use SeqGAN to generate poetry and comment on the performance of SeqGAN over MLE in human evaluations, encouraging our study of GANs for creative text generation." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-24", "text": "However, these studies do not focus solely on creative text." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-25", "text": "Using GANs, we can train generative models in a two-player game setting between a discriminator and a generator, where the discriminator (a binary classifier) learns to distinguish between real and fake data samples and the generator tries to fool the discriminator by generating authentic and high quality output [17] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-26", "text": "GANs have shown to be successful in image generation tasks [18] and recently, some progress has been observed in text generation [14, 13, 16] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-27", "text": "Our generator is a language model trained using backpropagation through time [19] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-28", "text": "During the pre-training phase we optimize for MLE and during the GAN training phase, we optimize on the creativity reward from the discriminator." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-29", "text": "The discriminator's encoder has the same architecture as the generator encoder module with the addition of a pooled decoder layer." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-30", "text": "The decoder contains 3 [DenseBatchN ormalization, ReLU ] blocks and an addtional Sigmoid layer." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-31", "text": "The discriminator decoder takes the hidden state at the last time step of a sequence concatenated with both the max-pooled and mean-pooled representation of the hidden states [20] and outputs a number in the range [0, 1]." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-32", "text": "The difficulty of using GANs in text generation comes from the discrete nature of text, making the model non-differentiable hence, we update parameters for the generator model with policy gradients as described in Yu [16] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-33", "text": "We utilize AWD-LSTM [21] and TransformerXL [22] based language models." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-34", "text": "For model hyperparameters please to refer to Supplementary Section Table 2 ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-35", "text": "We use Adam optimizer [23] with \u03b21 = 0.7 and \u03b22 = 0.8 similar to [20] and use a batch size of 50." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-36", "text": "Other practices for LM training were the same as [22] and [21] for Transformer-XL and AWD-LSTM respectively." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-37", "text": "We refer to our proposed GAN as Creative-GAN and compare it to a baseline (a language model equivalent to our pre-trained generator) and a GumbelGAN model [15] across all proposed datasets." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-38", "text": "We use three creative English datasets with distinct linguistic characteristics: (1) A corpus of 740 classical and contemporary English poems, (2) a corpus of 14950 metaphor sentences retrieved from a metaphor database website 1 and (3) a corpus of 1500 song lyrics ranging across genres." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-39", "text": "The mix of linguistic styles within this corpus offers the potential for interesting variation during the generation phase." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-40", "text": "We use the same pre-processing as in earlier work [20, 24] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-41", "text": "We reserve 10% of our data for test set and another 10% for our validation set." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-42", "text": "We first pre-train our generator on the Gutenberg dataset [25] for 20 epochs and then fine-tune [20] them to our target datasets with a language modeling objective." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-43", "text": "The discriminator's encoder is initialized to the same weights as our fine-tuned language model." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-44", "text": "Once we have our fine-tuned encoders for each target dataset, we train in an adversarial manner." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-45", "text": "The discriminator objective here is to score the quality of the creative text." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-46", "text": "The discriminator is trained for 3 iterations for every iteration of the generator, a practice seen in previous work [26] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-47", "text": "Creative-GAN relies on using the reward from the discriminator [13, 16] for backpropagation." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-48", "text": "We follow a similar training procedure for GumbelGAN." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-49", "text": "Outputs are generated through sampling over a multinomial distribution for all methods, instead of argmax on the log-likelihood probabilities, as sampling has shown to produce better output quality [5] . Please refer to Supplementary Section Table 3 for training parameters of each dataset and Table 2 for hyperparameters of each encoder." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-50", "text": "We pick these values after experimentation with our validation set." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-51", "text": "Training and output generation code can be found online 2 ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-52", "text": "----------------------------------" }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-53", "text": "**EVALUATION AND CONCLUSION**" }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-55", "text": "Along the lines of previous research on evaluating text generation tasks [27] , we report the perplexity scores of our test set on the evaluated models in the Supplementary Section, Table 1 Our model shows improvements over baseline and GumbelGAN." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-56", "text": "Common computational methods like BLEU [28] and perplexity are at best a heuristic and not strong indicators of good performance in text generation models [29] ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-57", "text": "Particularly, since these scores use target sequences as a reference, it has the same pitfalls as relying on MLE." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-58", "text": "The advantages in this approach lie in the discriminator's ability to influence the generator to explore other possibilities." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-59", "text": "Sample outputs for our model can be found on our website 3 ." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-60", "text": "----------------------------------" }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-61", "text": "**SUPPLEMENTARY MATERIAL**" }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-62", "text": "In this section, we report our results on computational metrics, hyperparameters and training configurations for our models." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-63", "text": "Table 1 shows the results of the perplexity score evaluation of the evaluated models, Table 2 shows hyperparameters for each encoding method and Table 3 shows our training parameters." }, { "sent_id": "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-64", "text": "In Table 3 Table 3 : Training Parameters" } ], "y": { "@BACK@": { "gold_contexts": [ [ "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-10" ], [ "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-17" ], [ "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-22" ] ], "cite_sentences": [ "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-10", "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-17", "f3b1a39203ebf0725d8dd2b8f8c7a9-C001-22" ] } } }, "ABC_19a62878f72c84d1c5c83a9a8cdeff_44": { "x": [ { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-2", "text": "Liu et al's reflections on the term dependency length minimization [1] may look anecdotal but they are not." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-3", "text": "By the turn of the 20th century, we put forward a \"Euclidean distance minimization\" hypothesis for the distance between syntactically linked words and various word order phenomena [2, 3] 1 ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-4", "text": "Later on, pressure from language researchers forced us to replace it with terms such as \"online memory minimization\" [5] because our initial formulation was obscure to them." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-5", "text": "Recently, researchers from all over the world have been granted to use the term \"dependency length minimization\" by the popes thanks to whom [6] came into light." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-6", "text": "Although \"length\" is a particular case of distance in this context and thus downsizes our original formulation, it is still abstract enough to allow for progress in theoretical research [7] and frees us from the heavy burden of contingency, i.e. the real implementation of the principle (at present believed to result from decay and interference as reviewed by Liu et al) or the current view of the architecture of memory [8, 9] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-7", "text": "Our position is grounded on the high predictive power of that principle per se [5] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-8", "text": "However, the lower generality of the term \"dependency length\" can be an" }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-9", "text": "Email address: rferrericancho@cs.upc.edu (R. Ferrer-i-Cancho) 1 These were pieces of our PhD thesis [4] that were submitted for publication before its defense." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-10", "text": "June 16, 2017 obstacle to the construction of a fully-fledged scientific field [10] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-11", "text": "First, distance minimization allows one to unify pressure to reduce dependency lengths (still distances) with constraints on word order variation and change arising from a principle of swap distance minimization [11] . \"Distance minimization\" has therefore a higher predictive power and greater utility in a general theory of communication." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-12", "text": "Second, distance provides a \"formal background\" or a \"specific background\" (following Bunge's terminology [10]) from physics or mathematics such as the theory of geographical or spatial networks (where the syntactic dependency structures of sentences are particular cases in one dimension) [12, 13] or the theory for the distance between like elements in sequences (where the couple of words involved in a syntactic dependency are particular cases of like elements) [14] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-13", "text": "Therefore we agree with [1] on the convenience of the term distance." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-14", "text": "A less flashy contribution of [6] has been promoting the need of controlling for sentence length (as a predictor of dependency length in their mixed-effects regression model) in research on dependency length minimization, an important methodological issue [15] that was addressed early [2] but neglected in subsequent research (e.g., [16, 17, 18] )." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-15", "text": "Liu et al focus their review on the fundamental principle of dependency length minimization but understanding how it interacts with other principles is vital." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-16", "text": "In 2009, we put forward another fundamental word order principle, i.e. predictability maximization, and presented a theoretical framework culminating in a conflict between dependency length minimization and predictability maximization [19] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-17", "text": "For sociological reasons, these arguments started appearing in print many years later [20, 5, 21] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-18", "text": "For the case of a single head and its dependents, the minimization of dependency lengths yields that the head should be placed at the center of the sequence whereas the principle of predictability maximization (or uncertainty minimization) yields that the head should be placed at one of the ends of the sequence (last if the head is the target of the prediction; first otherwise) [21, 20] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-19", "text": "Liu et al review two major sources of evidence of dependency length minimization: the analysis of dependency treebanks and psychological experiments." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-20", "text": "A critical difference between them is that the former is based on the calculation of the total cost of the sentence (as a sum or mean of all the dependency lengths of the sentence) while the latter is based on a partial calculation and thus it can be misleading." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-21", "text": "Suppose that one wishes to compare the cost of two orderings of the same sentence." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-22", "text": "The observation that the processing cost of a sentence decreases when the length of a dependency 2 increases, does not allow one to conclude that dependency length minimization cannot explain the results because shortening an edge implies moving at least one of the words defining it, and every move could imply the reduction of other edges eventually reducing the total sum of dependency lengths or altering the so-called complexity profile (e.g., [22] ), rendering fair comparison impractical." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-23", "text": "The problem of partial calculation of length costs has already been discussed in the context of research on the cost of crossing dependencies [23] and worsens when the sentences being compared differ not only in order but also in content." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-24", "text": "Another challenge is the precision of dependency length that is typically measured in words." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-25", "text": "Lengths in phonemes or syllables shed light on why SVO languages show SOV order when the object is a short word such as a clitic [24] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-26", "text": "Without addressing these issues, the anti-locality effects or long-distance dependencies reviewed by Liu et al can neither be attributed to predictability maximization nor be interpreted as a violation of dependency length minimization safely; an effective evaluation of the theoretical framework above can be impossible (as that framework makes theoretical predictions based on the calculation of full length costs)." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-27", "text": "The real challenge for psycholinguistic research is not the extent to which the theoretical framework above is supported by current results in the lab but rather to increase the precision of dependency length measurements and investigate the experimental conditions in which the following theoretical predictions are observed [20, 21] : one principle beating the other, coexistence, collaboration between principles or the very same trade-off causing the delusion that word order constraints have relaxed dramatically or even disappeared." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-28", "text": "This is the way of physics." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-29", "text": "Our concern for units of measurement is not a simple matter of precision but one of great theoretical importance: if the length of a dependency is measured in units of word length (e.g., syllables or phonemes) then it follows that the length of a dependency will be strongly determined by the length of the words defining the dependency and that of the intermediate words." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-30", "text": "Therefore, pressure to reduce dependency lengths implies pressure for compression [25, 26] , linking a principle of word order with a principle that operates (nonexclusively) on individual words." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-31", "text": "An understanding of how the principle of dependency length minimization interacts with other highly predictive principles beyond word order is a fundamental component of a general theory of animal behavior that has human language as a particular case." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-32", "text": "3 Acknowledgments" }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-33", "text": "----------------------------------" }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-34", "text": "****" }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-35", "text": "Liu et al's reflections on the term dependency length minimization [1] may look anecdotal but they are not." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-36", "text": "By the turn of the 20th century, we put forward a \"Euclidean distance minimization\" hypothesis for the distance between syntactically linked words and various word order phenomena [2, 3] 1 ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-37", "text": "Later on, pressure from language researchers forced us to replace it with terms such as \"online memory minimization\" [5] because our initial formulation was obscure to them." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-38", "text": "Recently, researchers from all over the world have been granted to use the term \"dependency length minimization\" by the popes thanks to whom [6] came into light." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-39", "text": "Although \"length\" is a particular case of distance in this context and thus downsizes our original formulation, it is still abstract enough to allow for progress in theoretical research [7] and frees us from the heavy burden of contingency, i.e. the real implementation of the principle (at present believed to result from decay and interference as reviewed by Liu et al) or the current view of the architecture of memory [8, 9] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-40", "text": "Our position is grounded on the high predictive power of that principle per se [5] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-41", "text": "However, the lower generality of the term \"dependency length\" can be an obstacle to the construction of a fully-fledged scientific field [10] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-42", "text": "First, distance minimization allows one to unify pressure to reduce dependency lengths (still distances) with constraints on word order variation and change arising from a principle of swap distance minimization [11] . \"Distance minimization\" has therefore a higher predictive power and greater utility in a general theory of communication." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-43", "text": "Second, distance provides a \"formal background\" or a \"specific background\" (following Bunge's terminology [10] ) from physics or mathematics such as the theory of geographical or spatial networks (where the syntactic dependency structures of sentences are particular cases in one dimension) [12, 13] or the theory for the distance between like elements in sequences (where the couple of words involved in a syntactic dependency are particular cases of like elements) [14] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-44", "text": "Therefore we agree with [1] on the convenience of the term distance." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-45", "text": "A less flashy contribution of [6] has been promoting the need of controlling for sentence length (as a predictor of dependency length in their mixed-effects regression model) in research on dependency length minimization, an important methodological issue [15] that was addressed early [2] but neglected in subsequent research (e.g., [16, 17, 18] )." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-46", "text": "Liu et al focus their review on the fundamental principle of dependency length minimization but understanding how it interacts with other principles is vital." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-47", "text": "In 2009, we put forward another fundamental word order principle, i.e. predictability maximization, and presented a theoretical framework culminating in a conflict between dependency length minimization and predictability maximization [19] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-48", "text": "For sociological reasons, these arguments started appearing in print many years later [20, 5, 21] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-49", "text": "For the case of a single head and its dependents, the minimization of dependency lengths yields that the head should be placed at the center of the sequence whereas the principle of predictability maximization (or uncertainty minimization) yields that the head should be placed at one of the ends of the sequence (last if the head is the target of the prediction; first otherwise) [21, 20] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-50", "text": "Liu et al review two major sources of evidence of dependency length minimization: the analysis of dependency treebanks and psychological experiments." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-51", "text": "A critical difference between them is that the former is based on the calculation of the total cost of the sentence (as a sum or mean of all the dependency lengths of the sentence) while the latter is based on a partial calculation and thus it can be misleading." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-52", "text": "Suppose that one wishes to compare the cost of two orderings of the same sentence." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-53", "text": "The observation that the processing cost of a sentence decreases when the length of a dependency increases, does not allow one to conclude that dependency length minimization cannot explain the results because shortening an edge implies moving at least one of the words defining it, and every move could imply the reduction of other edges eventually reducing the total sum of dependency lengths or altering the so-called complexity profile (e.g., [22] ), rendering fair comparison impractical." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-54", "text": "The problem of partial calculation of length costs has already been discussed in the context of research on the cost of crossing dependencies [23] and worsens when the sentences being compared differ not only in order but also in content." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-55", "text": "Another challenge is the precision of dependency length that is typically measured in words." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-56", "text": "Lengths in phonemes or syllables shed light on why SVO languages show SOV order when the object is a short word such as a clitic [24] ." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-57", "text": "Without addressing these issues, the anti-locality effects or long-distance dependencies reviewed by Liu et al can neither be attributed to predictability maximization nor be interpreted as a violation of dependency length minimization safely; an effective evaluation of the theoretical framework above can be impossible (as that framework makes theoretical predictions based on the calculation of full length costs)." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-58", "text": "The real challenge for psycholinguistic research is not the extent to which the theoretical framework above is supported by current results in the lab but rather to increase the precision of dependency length measurements and investigate the experimental conditions in which the following theoretical predictions are observed [20, 21] : one principle beating the other, coexistence, collaboration between principles or the very same trade-off causing the delusion that word order constraints have relaxed dramatically or even disappeared." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-59", "text": "This is the way of physics." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-60", "text": "Our concern for units of measurement is not a simple matter of precision but one of great theoretical importance: if the length of a dependency is measured in units of word length (e.g., syllables or phonemes) then it follows that the length of a dependency will be strongly determined by the length of the words defining the dependency and that of the intermediate words." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-61", "text": "Therefore, pressure to reduce dependency lengths implies pressure for compression [25, 26] , linking a principle of word order with a principle that operates (nonexclusively) on individual words." }, { "sent_id": "19a62878f72c84d1c5c83a9a8cdeff-C001-62", "text": "An understanding of how the principle of dependency length minimization interacts with other highly predictive principles beyond word order is a fundamental component of a general theory of animal behavior that has human language as a particular case." } ], "y": { "@BACK@": { "gold_contexts": [ [ "19a62878f72c84d1c5c83a9a8cdeff-C001-17" ], [ "19a62878f72c84d1c5c83a9a8cdeff-C001-18" ], [ "19a62878f72c84d1c5c83a9a8cdeff-C001-27" ], [ "19a62878f72c84d1c5c83a9a8cdeff-C001-48" ], [ "19a62878f72c84d1c5c83a9a8cdeff-C001-49" ], [ "19a62878f72c84d1c5c83a9a8cdeff-C001-58" ] ], "cite_sentences": [ "19a62878f72c84d1c5c83a9a8cdeff-C001-17", "19a62878f72c84d1c5c83a9a8cdeff-C001-18", "19a62878f72c84d1c5c83a9a8cdeff-C001-27", "19a62878f72c84d1c5c83a9a8cdeff-C001-48", "19a62878f72c84d1c5c83a9a8cdeff-C001-49", "19a62878f72c84d1c5c83a9a8cdeff-C001-58" ] }, "@USE@": { "gold_contexts": [ [ "19a62878f72c84d1c5c83a9a8cdeff-C001-27" ], [ "19a62878f72c84d1c5c83a9a8cdeff-C001-58" ] ], "cite_sentences": [ "19a62878f72c84d1c5c83a9a8cdeff-C001-27", "19a62878f72c84d1c5c83a9a8cdeff-C001-58" ] } } }, "ABC_8d3a20b4e50f81c94e884a0b978575_44": { "x": [ { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-35", "text": "A richness of the vocabulary and the density of the captions." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-2", "text": "Annotated datasets are commonly used in the training and evaluation of tasks involving natural language and vision (image description generation, action recognition and visual question answering)." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-3", "text": "However, many of the existing datasets reflect problems that emerge in the process of data selection and annotation." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-4", "text": "Here we point out some of the difficulties and problems one confronts when creating and validating annotated vision and language datasets." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-5", "text": "----------------------------------" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-7", "text": "Recently, the use of Natural Language Processing (NLP) resources has become increasingly popular among the Computer Vision (CV) community, mostly thanks to the large-scale, easily accessible data from the web, and the growing popularity of online crowdsourcing platforms, such as Amazon Mechanical Turk (AMT)." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-8", "text": "Typical vision and language tasks that utilize such resources are image description generation, action and affordance recognition, and visual question answering (VQA)." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-10", "text": "In addition, some works aim to utilize both language and vision to create richer multi-modal semantic spaces and vectors [8] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-9", "text": "This expanding intersection has also been leading to the definition of new vision and language tasks, some of them, in their languageonly form, have been well studied in the NLP community, such as VQA." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-11", "text": "Most previous work analyzing vision and language datasets has dealt with the technical aspects of the collected datasets, rather than the data-gathering and annotation techniques used." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-12", "text": "1 Work about the annotation process of images mainly focused on the speed, efficiency and cost aspects of the process [9] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-13", "text": "As far as we know, we are the first to discuss the issue of the quality of annotation content within vision and language datasets." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-14", "text": "We present some of the major difficulties involving building and validating annotated vision and language resources, discuss their potential effects on results, and comment on combining NLP resources." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-15", "text": "1 The survey in [4] presented a large listing of datasets, analyzed by number of images and structure and more advanced criterions like syntactic" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-16", "text": "----------------------------------" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-17", "text": "**VISION AND LANGUAGE ANNOTATION TASKS**" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-18", "text": "Annotation tasks can have the form of multiple choice questions (closed) or open response ones, or a combination of the two." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-19", "text": "The first case includes choosing \"yes\"/\"no\" for a given option (e.g. a pair of an action and an object), choosing all true options from a given list (e.g. given an object, pick its attributes or actions) and more." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-20", "text": "The Second case includes supplying a full free-form sentence describing an image, picking free words to describe attributes or actions, or fill-in-the-blank a specific attribute or event [15] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-21", "text": "A combined case can allowed the user to add their own option to the list if needed [13] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-22", "text": "Tasks can ask the user to refer to the whole image or to only certain parts or objects, and sometimes require them to annotate the regions of some relevant objects." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-23", "text": "The images can be natural scenes taken by real people, or artificial scenes created with clip arts like in [1] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-24", "text": "----------------------------------" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-25", "text": "**DIMENSIONS FOR COMPARISON**" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-26", "text": "From the resources we surveyed, several dimensions emerged that can introduce potential weaknesses and inconsistencies during the creation of annotated datasets." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-27", "text": "1. Manual processing." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-28", "text": "This concerns both creating and validating the gathered data." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-29", "text": "Manual gold sets are widely used in the verification process." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-30", "text": "In [2] for example, the set of verbs for each object is filtered manually, and verbs with similar meaning are grouped together manually." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-31", "text": "In [6] , verbs for annotation are chosen and filtered manually, and annotations are examined manually too in order to penalize (what the authors see as) common mistakes." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-32", "text": "Apart from being inefficient and unscalable, this also creates author bias." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-33", "text": "2. Author bias." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-34", "text": "This can occur when the gold set to evaluate annotation is created solely by the paper's authors." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-36", "text": "way to minimize the bias is to start with a small set and bootstrap it according to an annotation guideline, as in [6] , or to use as a gold set the majority of a subset of the annotations, as in [5] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-37", "text": "3. Limited or sparse vocabulary." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-38", "text": "In [2] and [6] annotators can respond only to options created by a predefined set of verbs/actions, and have no way of introducing new terms into the dataset, imposing a heavier burden on initial annotation schema design." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-39", "text": "A possible solution can be to let the users add their own action terms, if necessary, as in [13] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-40", "text": "Not limiting the users at all, on the other hand, can potentially create a dataset with very low and insignificant counts for each term." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-41", "text": "This can become a problem when the dataset is small, like in [11] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-42", "text": "Grouping together similar terms is one way to solve it." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-43", "text": "However, such unification can potentially lead to a lose in the fine differences between textually described actions." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-44", "text": "An immediate preferable solution would be to largely extend the size of the dataset." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-45", "text": "4. Action/visual sense is not well defined." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-46", "text": "This becomes a problem when dealing with action or affordance recognition." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-47", "text": "A well-defined annotation is necessary for both grounding the action to match external evaluations and resources, and for creating consistency among annotators of the same dataset." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-48", "text": "This is especially crucial when dealing with data sets that were annotated with binary choices, such as [2, 6] (see Figure 1) ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-49", "text": "5. Annotator's Attention." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-50", "text": "When asked to describe an image, people tend to pick easy-to-describe relation (like \"man wearing a t-shirt\") and start with the most salient parts of the image [10] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-51", "text": "In [11] and [13] workers were asked to annotate action for specific already detected objects, forcing them to focus on objects otherwise might have been forgotten." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-52", "text": "6." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-53", "text": "Validation and Averaging." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-54", "text": "These post-processing steps are especially important when the number of annotators per image is small." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-55", "text": "Most of the work surveyed validated the annotations to avoid data that was corrupted for various reasons." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-56", "text": "In [7] , a short quiz verified the English level, and in [6] two sets of test annotation questions were used to verify accuracy in submission time as well as to filter out malicious turkers." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-57", "text": "Sometimes an averaging step is performed to simplify data from closed annotation tasks." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-58", "text": "However, when the averaging results in picking the majority of the annotators only (like in [6] ), or in a general non-informative \"ambiguous\" tag (as in [2] ), or when non-agreement is completely ignored as in [12] , this can lead to a possible information loss -just in the more interesting cases." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-59", "text": "A better solution would be to weigh according to several confidence levels chosen by the user for each annotation task (vs. just Y/N) (as in [3] ), or according to the frequency (agreement) among annotators (as in [13] )." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-60", "text": "----------------------------------" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-61", "text": "**USING NLP RESOURCES**" }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-62", "text": "Many of the more recent sources we reviewed used NLP tools and resources in some capacity." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-63", "text": "They are usually deployed as a \"single application\" solution, in order to expand the vocabulary or enrich the vision-similarity equations with semantic data, without further use." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-64", "text": "However, it is important to understand the expected behavior of such tools." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-65", "text": "For example, semantic vector space models (like Word2Vec) cannot distinguish between synonyms and antonyms, since they tend to appear in the same context [14] ." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-66", "text": "This is particularly important when dealing with binary attributes and yes/no questions." }, { "sent_id": "8d3a20b4e50f81c94e884a0b978575-C001-67", "text": "2" } ], "y": { "@SIM@": { "gold_contexts": [ [ "8d3a20b4e50f81c94e884a0b978575-C001-23" ] ], "cite_sentences": [ "8d3a20b4e50f81c94e884a0b978575-C001-23" ] } } }, "ABC_06de9a8e72b832beea9c2f17e0862a_44": { "x": [ { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-2", "text": "Liu et al's reflections on the term dependency length minimization [1] may look anecdotal but they are not." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-3", "text": "By the turn of the 20th century, we put forward a \"Euclidean distance minimization\" hypothesis for the distance between syntactically linked words and various word order phenomena [2, 3] 1 ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-4", "text": "Later on, pressure from language researchers forced us to replace it with terms such as \"online memory minimization\" [5] because our initial formulation was obscure to them." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-5", "text": "Recently, researchers from all over the world have been granted to use the term \"dependency length minimization\" by the popes thanks to whom [6] came into light." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-6", "text": "Although \"length\" is a particular case of distance in this context and thus downsizes our original formulation, it is still abstract enough to allow for progress in theoretical research [7] and frees us from the heavy burden of contingency, i.e. the real implementation of the principle (at present believed to result from decay and interference as reviewed by Liu et al) or the current view of the architecture of memory [8, 9] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-7", "text": "Our position is grounded on the high predictive power of that principle per se [5] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-8", "text": "However, the lower generality of the term \"dependency length\" can be an" }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-9", "text": "Email address: rferrericancho@cs.upc.edu (R. Ferrer-i-Cancho) 1 These were pieces of our PhD thesis [4] that were submitted for publication before its defense." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-10", "text": "June 16, 2017 obstacle to the construction of a fully-fledged scientific field [10] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-11", "text": "First, distance minimization allows one to unify pressure to reduce dependency lengths (still distances) with constraints on word order variation and change arising from a principle of swap distance minimization [11] . \"Distance minimization\" has therefore a higher predictive power and greater utility in a general theory of communication." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-12", "text": "Second, distance provides a \"formal background\" or a \"specific background\" (following Bunge's terminology [10]) from physics or mathematics such as the theory of geographical or spatial networks (where the syntactic dependency structures of sentences are particular cases in one dimension) [12, 13] or the theory for the distance between like elements in sequences (where the couple of words involved in a syntactic dependency are particular cases of like elements) [14] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-13", "text": "Therefore we agree with [1] on the convenience of the term distance." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-14", "text": "A less flashy contribution of [6] has been promoting the need of controlling for sentence length (as a predictor of dependency length in their mixed-effects regression model) in research on dependency length minimization, an important methodological issue [15] that was addressed early [2] but neglected in subsequent research (e.g., [16, 17, 18] )." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-15", "text": "Liu et al focus their review on the fundamental principle of dependency length minimization but understanding how it interacts with other principles is vital." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-16", "text": "In 2009, we put forward another fundamental word order principle, i.e. predictability maximization, and presented a theoretical framework culminating in a conflict between dependency length minimization and predictability maximization [19] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-17", "text": "For sociological reasons, these arguments started appearing in print many years later [20, 5, 21] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-18", "text": "For the case of a single head and its dependents, the minimization of dependency lengths yields that the head should be placed at the center of the sequence whereas the principle of predictability maximization (or uncertainty minimization) yields that the head should be placed at one of the ends of the sequence (last if the head is the target of the prediction; first otherwise) [21, 20] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-19", "text": "Liu et al review two major sources of evidence of dependency length minimization: the analysis of dependency treebanks and psychological experiments." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-20", "text": "A critical difference between them is that the former is based on the calculation of the total cost of the sentence (as a sum or mean of all the dependency lengths of the sentence) while the latter is based on a partial calculation and thus it can be misleading." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-21", "text": "Suppose that one wishes to compare the cost of two orderings of the same sentence." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-22", "text": "The observation that the processing cost of a sentence decreases when the length of a dependency 2 increases, does not allow one to conclude that dependency length minimization cannot explain the results because shortening an edge implies moving at least one of the words defining it, and every move could imply the reduction of other edges eventually reducing the total sum of dependency lengths or altering the so-called complexity profile (e.g., [22] ), rendering fair comparison impractical." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-23", "text": "The problem of partial calculation of length costs has already been discussed in the context of research on the cost of crossing dependencies [23] and worsens when the sentences being compared differ not only in order but also in content." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-24", "text": "Another challenge is the precision of dependency length that is typically measured in words." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-25", "text": "Lengths in phonemes or syllables shed light on why SVO languages show SOV order when the object is a short word such as a clitic [24] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-26", "text": "Without addressing these issues, the anti-locality effects or long-distance dependencies reviewed by Liu et al can neither be attributed to predictability maximization nor be interpreted as a violation of dependency length minimization safely; an effective evaluation of the theoretical framework above can be impossible (as that framework makes theoretical predictions based on the calculation of full length costs)." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-27", "text": "The real challenge for psycholinguistic research is not the extent to which the theoretical framework above is supported by current results in the lab but rather to increase the precision of dependency length measurements and investigate the experimental conditions in which the following theoretical predictions are observed [20, 21] : one principle beating the other, coexistence, collaboration between principles or the very same trade-off causing the delusion that word order constraints have relaxed dramatically or even disappeared." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-28", "text": "This is the way of physics." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-29", "text": "Our concern for units of measurement is not a simple matter of precision but one of great theoretical importance: if the length of a dependency is measured in units of word length (e.g., syllables or phonemes) then it follows that the length of a dependency will be strongly determined by the length of the words defining the dependency and that of the intermediate words." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-30", "text": "Therefore, pressure to reduce dependency lengths implies pressure for compression [25, 26] , linking a principle of word order with a principle that operates (nonexclusively) on individual words." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-31", "text": "An understanding of how the principle of dependency length minimization interacts with other highly predictive principles beyond word order is a fundamental component of a general theory of animal behavior that has human language as a particular case." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-32", "text": "3 Acknowledgments" }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-33", "text": "----------------------------------" }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-34", "text": "****" }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-35", "text": "Liu et al's reflections on the term dependency length minimization [1] may look anecdotal but they are not." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-36", "text": "By the turn of the 20th century, we put forward a \"Euclidean distance minimization\" hypothesis for the distance between syntactically linked words and various word order phenomena [2, 3] 1 ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-37", "text": "Later on, pressure from language researchers forced us to replace it with terms such as \"online memory minimization\" [5] because our initial formulation was obscure to them." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-38", "text": "Recently, researchers from all over the world have been granted to use the term \"dependency length minimization\" by the popes thanks to whom [6] came into light." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-39", "text": "Although \"length\" is a particular case of distance in this context and thus downsizes our original formulation, it is still abstract enough to allow for progress in theoretical research [7] and frees us from the heavy burden of contingency, i.e. the real implementation of the principle (at present believed to result from decay and interference as reviewed by Liu et al) or the current view of the architecture of memory [8, 9] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-40", "text": "Our position is grounded on the high predictive power of that principle per se [5] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-41", "text": "However, the lower generality of the term \"dependency length\" can be an obstacle to the construction of a fully-fledged scientific field [10] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-42", "text": "First, distance minimization allows one to unify pressure to reduce dependency lengths (still distances) with constraints on word order variation and change arising from a principle of swap distance minimization [11] . \"Distance minimization\" has therefore a higher predictive power and greater utility in a general theory of communication." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-43", "text": "Second, distance provides a \"formal background\" or a \"specific background\" (following Bunge's terminology [10] ) from physics or mathematics such as the theory of geographical or spatial networks (where the syntactic dependency structures of sentences are particular cases in one dimension) [12, 13] or the theory for the distance between like elements in sequences (where the couple of words involved in a syntactic dependency are particular cases of like elements) [14] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-44", "text": "Therefore we agree with [1] on the convenience of the term distance." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-45", "text": "A less flashy contribution of [6] has been promoting the need of controlling for sentence length (as a predictor of dependency length in their mixed-effects regression model) in research on dependency length minimization, an important methodological issue [15] that was addressed early [2] but neglected in subsequent research (e.g., [16, 17, 18] )." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-46", "text": "Liu et al focus their review on the fundamental principle of dependency length minimization but understanding how it interacts with other principles is vital." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-47", "text": "In 2009, we put forward another fundamental word order principle, i.e. predictability maximization, and presented a theoretical framework culminating in a conflict between dependency length minimization and predictability maximization [19] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-48", "text": "For sociological reasons, these arguments started appearing in print many years later [20, 5, 21] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-49", "text": "For the case of a single head and its dependents, the minimization of dependency lengths yields that the head should be placed at the center of the sequence whereas the principle of predictability maximization (or uncertainty minimization) yields that the head should be placed at one of the ends of the sequence (last if the head is the target of the prediction; first otherwise) [21, 20] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-50", "text": "Liu et al review two major sources of evidence of dependency length minimization: the analysis of dependency treebanks and psychological experiments." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-51", "text": "A critical difference between them is that the former is based on the calculation of the total cost of the sentence (as a sum or mean of all the dependency lengths of the sentence) while the latter is based on a partial calculation and thus it can be misleading." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-52", "text": "Suppose that one wishes to compare the cost of two orderings of the same sentence." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-53", "text": "The observation that the processing cost of a sentence decreases when the length of a dependency increases, does not allow one to conclude that dependency length minimization cannot explain the results because shortening an edge implies moving at least one of the words defining it, and every move could imply the reduction of other edges eventually reducing the total sum of dependency lengths or altering the so-called complexity profile (e.g., [22] ), rendering fair comparison impractical." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-54", "text": "The problem of partial calculation of length costs has already been discussed in the context of research on the cost of crossing dependencies [23] and worsens when the sentences being compared differ not only in order but also in content." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-55", "text": "Another challenge is the precision of dependency length that is typically measured in words." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-56", "text": "Lengths in phonemes or syllables shed light on why SVO languages show SOV order when the object is a short word such as a clitic [24] ." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-57", "text": "Without addressing these issues, the anti-locality effects or long-distance dependencies reviewed by Liu et al can neither be attributed to predictability maximization nor be interpreted as a violation of dependency length minimization safely; an effective evaluation of the theoretical framework above can be impossible (as that framework makes theoretical predictions based on the calculation of full length costs)." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-58", "text": "The real challenge for psycholinguistic research is not the extent to which the theoretical framework above is supported by current results in the lab but rather to increase the precision of dependency length measurements and investigate the experimental conditions in which the following theoretical predictions are observed [20, 21] : one principle beating the other, coexistence, collaboration between principles or the very same trade-off causing the delusion that word order constraints have relaxed dramatically or even disappeared." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-59", "text": "This is the way of physics." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-60", "text": "Our concern for units of measurement is not a simple matter of precision but one of great theoretical importance: if the length of a dependency is measured in units of word length (e.g., syllables or phonemes) then it follows that the length of a dependency will be strongly determined by the length of the words defining the dependency and that of the intermediate words." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-61", "text": "Therefore, pressure to reduce dependency lengths implies pressure for compression [25, 26] , linking a principle of word order with a principle that operates (nonexclusively) on individual words." }, { "sent_id": "06de9a8e72b832beea9c2f17e0862a-C001-62", "text": "An understanding of how the principle of dependency length minimization interacts with other highly predictive principles beyond word order is a fundamental component of a general theory of animal behavior that has human language as a particular case." } ], "y": { "@BACK@": { "gold_contexts": [ [ "06de9a8e72b832beea9c2f17e0862a-C001-4" ], [ "06de9a8e72b832beea9c2f17e0862a-C001-17" ], [ "06de9a8e72b832beea9c2f17e0862a-C001-37" ], [ "06de9a8e72b832beea9c2f17e0862a-C001-48" ] ], "cite_sentences": [ "06de9a8e72b832beea9c2f17e0862a-C001-4", "06de9a8e72b832beea9c2f17e0862a-C001-17", "06de9a8e72b832beea9c2f17e0862a-C001-37", "06de9a8e72b832beea9c2f17e0862a-C001-48" ] }, "@USE@": { "gold_contexts": [ [ "06de9a8e72b832beea9c2f17e0862a-C001-4" ], [ "06de9a8e72b832beea9c2f17e0862a-C001-7" ], [ "06de9a8e72b832beea9c2f17e0862a-C001-37" ], [ "06de9a8e72b832beea9c2f17e0862a-C001-40" ] ], "cite_sentences": [ "06de9a8e72b832beea9c2f17e0862a-C001-4", "06de9a8e72b832beea9c2f17e0862a-C001-7", "06de9a8e72b832beea9c2f17e0862a-C001-37", "06de9a8e72b832beea9c2f17e0862a-C001-40" ] } } }, "ABC_1b424cab4d7008997a31be8c2e5198_44": { "x": [ { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-50", "text": "Gradients are clipped to a maximum norm of 5.0." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-2", "text": "In many neural models, new features as polynomial functions of existing ones are used to augment representations." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-3", "text": "Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-4", "text": "We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-5", "text": "----------------------------------" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-7", "text": "In many tasks in natural language processing, it is necessary to match or compare two distributed representations." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-8", "text": "These representations may refer to whole sentences, word contexts or any other construct." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-9", "text": "For concreteness, let u and v be two representations we want to match." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-10", "text": "In order to facilitate the matching, it is often beneficial to explicitly create new features like element-wise absolute difference (|u\u2212v|) and element-wise product (u \u00b7 v) that augment u and v. The combined feature vector is then processed by further layers in the task specific neural network." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-11", "text": "For example, Tai et al. (2015) use these heuristics to improve semantic representations." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-12", "text": "Most notably, for the natural language inference task, augmenting the hypothesis (u) and premise (v) representations with |u\u2212v| and u \u00b7 v considerably improves performance in a siamese architecture Mou et al. (2016) ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-13", "text": "This is also used in the more sophisticated models of Chen et al. (2017) , where u and v represent word contexts." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-14", "text": "Several of these approaches are explored in the Compare-and-Aggregate framework by Wang & Jiang (2017) ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-15", "text": "In this paper we focus on polynomial features like u \u00b7 v for the natural language inference task, where it is trying to capture similarity between u and v. It is also a monomial of degree 2." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-16", "text": "We investigate two aspects of such terms -the use of scaling and the use of higher degree polynomials." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-17", "text": "The motivation for the former is the following." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-18", "text": "The values taken by individual elements of u and u \u00b7 v will in general have slightly different statistical distributions." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-19", "text": "For example, if elements of both u and v are approximately zero mean Gaussians with variance \u03c3 2 , the variance of the elements of u \u00b7 v will be approximately \u03c3 4 ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-20", "text": "As such, subsequent layers in the neural network use weights that are initialized assuming identically distributed inputs Glorot & Bengio (2010) , which is clearly not the case when \u03c3 = 1." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-21", "text": "An appropriate scaling coefficient attached to u \u00b7 v that can bring its variance close to that of u is one possible way of addressing this anomaly." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-22", "text": "The motivation for the latter is to incorporate more complex multiplicative interaction between u and v through degree 3 and 4 polynomials." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-23", "text": "Our findings are two-fold." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-24", "text": "Through numerical experiments using the Stanford Natural Language Inference (SNLI) Bowman et al. (2015) dataset, we show that in the absence of scaling, using higher degree polynomial features instead of u \u00b7 v improves the performance of baseline models." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-25", "text": "In the presence of scaling, this difference all but vanishes and in fact the scaled u \u00b7 v achieves the best performance." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-26", "text": "The use of scaling significantly reduces classification error, by up to 5.0% in the best performing models." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-27", "text": "This is observed both for models that only use encodings of whole sentences and more complex ones." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-28", "text": "----------------------------------" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-29", "text": "**SCALED POLYNOMIAL FEATURES**" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-30", "text": "We present our work in the context of two baseline models for the natural language inference task." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-31", "text": "In this task, given a pair of sentences (premise and hypothesis), it needs to be classified into one of three categories -entailment, contradiction and neutral." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-32", "text": "In the first model, both the premise and the hypothesis sentence are encoded using a bidirectional LSTM Hochreiter & Schmidhuber (1997) and the intermediate states are max-pooled to get the respective representations u and v. We refer to this as the InferSent model Conneau et al. (2017) ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-33", "text": "The standard matching feature of Mou et al. (2016) uses a concatenation of u, v, |u\u2212v| and u \u00b7 v. We define the following new matching feature vector that scales the multiplicative term by a constant factor \u03b7 > 0." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-34", "text": "(1)" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-35", "text": "To incorporate polynomial multiplicative features between u and v of degree 3 and 4, we define the following define the following matching feature vectors." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-36", "text": "and" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-37", "text": "In w poly3 , the additional term is the sum of the two possible monomials of degree 3 involving both u and v. In w poly4 , the fourth degree term is the sum of the 3 possible monomials of degree 4 involving both u and v. Note that we scale the 3rd degree terms by \u03b7 2 and the 4th degree terms by \u03b7 3 ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-38", "text": "If the dimension of u and v is d, the dimensions of w poly2 , w poly3 and w poly4 are all 4d." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-39", "text": "In each case, the feature vector is fed into a fully connected layer(s), before computing the 3-way softmax in the classification layer." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-40", "text": "It is possible to use each of the degree 3 and 4 terms separately as a feature, but this did not make our models substantially more accurate." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-41", "text": "Choosing \u03b7 = 1 in w poly2 reduces it to the matching feature vector proposed by Mou et al. (2016) ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-42", "text": "The same procedure is repeated for the other baseline model, namely ESIM Chen et al. (2017) ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-43", "text": "In this case, u represents one of the intermediate states of a bidirectional LSTM encoding of the premise (hypothesis) and v represents the hypothesis (premise) states weighted by relevance to the premise (hypothesis) state." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-44", "text": "We replace the matching feature vector used in ESIM by the ones defined above." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-45", "text": "The rest of the model is the same which includes another bidirectional LSTM layer followed by pooling, a fully connected layer and a classification layer." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-46", "text": "----------------------------------" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-47", "text": "**TRAINING AND RESULTS**" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-48", "text": "For the InferSent model, we train on the SNLI dataset for the sentence encoding dimensions d \u2208 {512, 1024, 2048, 4096}. The fully connected layer after w poly2 , w poly3 , w poly4 has two layers of 512 dimensions each." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-49", "text": "For optimization, we use SGD with an initial learning rate of 0.1 which is decayed by 0.99 after every epoch or by 0.2 if there is a drop in the validation accuracy." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-51", "text": "Each experiment is repeated 5 times with random weight initializations and the average classification accuracies on the test set are reported." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-52", "text": "For training the ESIM model, we follow the procedure outlined in Chen et al. (2017) with the bidirectional LSTM dimension being d = 600." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-53", "text": "Fig. 1(a) shows the dependence of the test accuracies with changing \u03b7 for each value of d and matching feature vector w poly2 for InferSent." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-54", "text": "The variation of accuracy on \u03b7 is clearly visible and the best accuracy is obtained for \u03b7 = 32, 32, 16, 32 for d = 4096, 2048, 1024, 512, respectively." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-55", "text": "This validates our intuition that the second degree feature (u \u00b7 v) should be scaled differently than the remaining first degree features in w poly2 ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-56", "text": "For d = 4096, the average accuracy for \u03b7 = 32 is 84.82%, which is 0.44% higher than that of \u03b7 = 1." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-57", "text": "Comparing the best performing models for the different weight initializations, the one for \u03b7 = 32 has almost 5% less error than \u03b7 = 1." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-58", "text": "Fig. 1(b) shows the same phenomenon for w poly3 ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-59", "text": "The highest accuracies are obtained for \u03b7 = 4 for d = 512, 1024 and for \u03b7 = 16 for d = 2048, 4096." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-60", "text": "For d = 4096, the average accuracy for \u03b7 = 16 is 84.73%, which is 0.15% higher than that of \u03b7 = 1. Note that the performance tends to drop significantly for \u03b7 = 32, 64, which points to possibly unstable training because of the large values of the coefficients \u03b7 2 ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-61", "text": "We observe a similar trend for w poly4 ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-62", "text": "Fig. 2(a) compares the test accuracies of w poly2 , w poly3 and w poly4 for d = 4096." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-63", "text": "Interestingly, the use of higher degree terms helps the classifier for \u03b7 = 1, with w poly4 achieving 0.25% better accuracy than w poly2 ." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-64", "text": "However, with scaling w poly2 does progressively better while the other two models eventually suffer from unstable training." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-65", "text": "The overall highest average accuracy of 84.82% is Finally, we report the test accuracy of w poly2 on the ESIM model for natural language inference." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-66", "text": "Here again, the gains for using scaled features is quite prominent, with a relative reduction in error of 5% for \u03b7 = 8 as compared to \u03b7 = 1." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-67", "text": "----------------------------------" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-68", "text": "**CONCLUSION AND FUTURE DIRECTIONS**" }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-69", "text": "In this paper, we explore the use of higher degree polynomial features and scaling to derive new features when two distributed representations have to be matched or compared." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-70", "text": "Using the natural language inference task as an example, we show that scaling the higher degree terms helps reduce classification error, in some cases by almost 5%." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-71", "text": "In our preliminary experiments, we use constant scaling factors, but they can be learnt as a parameter of the model." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-72", "text": "We scale the exponent of \u03b7 as the degree of the monomials, which itself may be optimized to stabilize training for higher degree terms." }, { "sent_id": "1b424cab4d7008997a31be8c2e5198-C001-73", "text": "Finally, it will be interesting to use the same scaling mechanism for tasks other than natural language inference." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1b424cab4d7008997a31be8c2e5198-C001-12" ], [ "1b424cab4d7008997a31be8c2e5198-C001-33" ] ], "cite_sentences": [ "1b424cab4d7008997a31be8c2e5198-C001-12", "1b424cab4d7008997a31be8c2e5198-C001-33" ] }, "@USE@": { "gold_contexts": [ [ "1b424cab4d7008997a31be8c2e5198-C001-12", "1b424cab4d7008997a31be8c2e5198-C001-15" ], [ "1b424cab4d7008997a31be8c2e5198-C001-33" ], [ "1b424cab4d7008997a31be8c2e5198-C001-41" ] ], "cite_sentences": [ "1b424cab4d7008997a31be8c2e5198-C001-12", "1b424cab4d7008997a31be8c2e5198-C001-33", "1b424cab4d7008997a31be8c2e5198-C001-41" ] } } }, "ABC_692f7edc151a9a833c7dd7943bb608_44": { "x": [ { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-2", "text": "Offensive language identification (OLI) in user generated text is automatic detection of any profanity, insult, obscenity, racism or vulgarity that degrades an individual or a group." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-3", "text": "It is helpful for hate speech detection, flame detection and cyber bullying." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-4", "text": "Due to immense growth of accessibility to social media, OLI helps to avoid abuse and hurts." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-5", "text": "In this paper, we present deep and traditional machine learning approaches for OLI." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-6", "text": "In deep learning approach, we have used bi-directional LSTM with different attention mechanisms to build the models and in traditional machine learning, TF-IDF weighting schemes with classifiers namely Multinomial Naive Bayes and Support Vector Machines with Stochastic Gradient Descent optimizer are used for model building." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-7", "text": "The approaches are evaluated on the OffensEval@SemEval2019 dataset and our team SSN NLP submitted runs for three tasks of OffensEval shared task." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-8", "text": "The best runs of SSN NLP obtained the F1 scores as 0.53, 0.48, 0.3 and the accuracies as 0.63, 0.84 and 0.42 for the tasks A, B and C respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-9", "text": "Our approaches improved the base line F1 scores by 12%, 26% and 14% for Task A, B and C respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-10", "text": "----------------------------------" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-11", "text": "**INTRODUCTION**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-12", "text": "Offensive language identification (OLI) is a process of detecting offensive language classes (Razavi et al., 2010) such as slurs, homophobia, profanity, extremism, insult, disguise, obscenity, racism or vulgarity that hurts or degrades an individual or a group from user-generated text like social media postings." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-13", "text": "OLI is useful for several applications such as hate speech detection, flame detection, aggression detection and cyber bullying." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-14", "text": "Recently, several research work have been reported to identify the offensive languages using social media content." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-15", "text": "Several workshops such as TA-COS 1 , TRAC 2 (Kumar et al., 2018a) , Abusive Language Online 3 and GermEval (Wiegand et al., 2018) have been organized recently in this research area." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-16", "text": "In this line, OffensEval@SemEval2019 (Zampieri et al., 2019b) shared task focuses on identification and categorization of offensive language in social media." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-17", "text": "It focuses on three subtasks namely offensive language detection, categorization of offensive language and offensive language target identification." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-18", "text": "Sub Task A aims to detect text as offensive (OFF) or not offensive (NOT)." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-19", "text": "Sub Task B aims to categorize the offensive type as targeted text (TIN) or untargeted text (UNT)." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-20", "text": "Sub Task C focuses on identification of target as individual (IND), group (GRP) or others (OTH)." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-21", "text": "Our team SSN NLP participated in all the three subtasks." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-22", "text": "----------------------------------" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-24", "text": "Several research work have been reported since 2010 in this research field of hate speech detection (Kwok and Wang, 2013; Burnap and Williams, 2015; Djuric et al., 2015; Davidson et al., 2017; Malmasi and Zampieri, 2018; Schmidt and Wiegand, 2017; Fortuna and Nunes, 2018; ElSherief et al., 2018; Gamb\u00e4ck and Sikdar, 2017; Zhang et al., 2018; Mathur et al., 2018) ." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-25", "text": "Schmidt and Wiegand (2017) & Fortuna and Nunes (2018) reviewed the approaches used for hate speech detection." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-26", "text": "Kwok and Wang (2013) used bag of words and bi-gram features with machine learning approach to classify the tweets as \"racist\" or \"nonracist\"." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-27", "text": "Burnap and Williams (2015) developed a supervised algorithm for hateful and antagonistic content in Twitter using voted ensemble meta-classifier." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-28", "text": "Djuric et al. (2015) learnt distributed low-dimensional representations of social media comments using neural language models for hate speech detection." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-29", "text": "Davidson et al. (2017) used n-gram (bigram, unigram, and trigram) features with TF-IDF score along with crowd-sourced hate speech lexicon and employed several classifiers including logistic regression with L1 regularization to separate hate speech from other offensive languages." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-30", "text": "Malmasi and Zampieri (2018) used n-grams, skip-grams and clustering-based word representations as features with ensemble classifier for hate speech detection." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-31", "text": "ElSherief et al. (2018) performed linguistic and psycholinguistic analysis to detect the hate speech is either \"directed\" towards a target, or \"generalized\" towards a group." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-32", "text": "Gamb\u00e4ck and Sikdar (2017) used deep learning using CNN models to detect the hate speech as \"racism\", \"sexism\", \"both\" and \"nonhate-speech\"." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-33", "text": "They used character 4-grams, word vectors based on word2vec, randomly generated word vectors, and word vectors combined with character n-grams as features in their approach." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-34", "text": "Zhang et al. (2018) used convolution-GRU based deep neural network for detecting hate speech." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-35", "text": "Many research work have been carried out in aggression detection (Aroyehun and Gelbukh, 2018; Madisetty and Desarkar, 2018; Raiyani et al., 2018; Kumar et al., 2018b) ." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-36", "text": "Aroyehun and Gelbukh (2018) & Raiyani et al. (2018) used LSTM and CNN respectively to detect aggression in text." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-37", "text": "Kumar et al. (2018b) presented the findings of the shared task on aggression identification which aims to detect different scales of aggression namely \"Overtly Aggressive\", \"Covertly Aggressive\", and \"Non-aggressive\"." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-38", "text": "Madisetty and Desarkar (2018) used CNN, LSTM and Bi-LSTM to detect the above scales of aggression." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-39", "text": "Waseem et al. (2017) & Park and Fung (2017) presented the methodologies on abusive language identification using deep neural networks." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-40", "text": "Research on identifying offensive languages has been focused on non-English languages like German (Wiegand et al., 2018) , Hindi (Kumar et al., 2018b) , Hinglish: Hindi-English (Mathur et al., 2018) , Slovene (Fi\u0161er et al., 2017) and Chinese (Su et al., 2017) ." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-41", "text": "Wiegand et al. (2018) presented an overview of GermEval shared task on the identification of offensive language that focused on classification of German tweets from Twitter." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-42", "text": "Kumar et al. (2018b) focused on the shared task to identify aggression on Hindi text." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-43", "text": "Mathur et al. (2018) applied transfer learning to detect three classes namely \"nonoffensive\", \"abusive\" and \"hate-speech\" from Hindi-English code switched language." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-44", "text": "Fi\u0161er et al. (2017) presented a framework to annotate offensive labels in Slovene." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-45", "text": "Su et al. (2017) rephrased profanity in Chinese text after detecting them from social media text." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-46", "text": "----------------------------------" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-47", "text": "**DATA AND METHODOLOGY**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-48", "text": "In our approach, we have used OLID dataset (Zampieri et al., 2019a) given by OffensEval@SemEval2019 shared task." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-49", "text": "The dataset is given in .tsv file format with columns namely, ID, INSTANCE, SUBA, SUBB, SUBC where ID represents the identification number for the tweet, INSTANCE represents the tweets, SUBA consists of the labels namely Offensive (OFF) and Not Offensive (NOT), SUBB consists of the labels namely Targeted Insult and Threats (TIN) and Untargeted (UNT) and SUBC consists of the labels namely Individual (IND), Group (GRP) and Other (OTH)." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-50", "text": "The dataset has 13240 tweets." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-51", "text": "All the instances are considered for Sub Task A. However, we have filtered and considered the data that are labelled with \"TIN/UNT\" and \"IND/GRP/OTH\" for Sub Task B and Sub Task C respectively by ignoring the instances labelled with \"NULL\"." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-52", "text": "Thus, we have obtained 4400 and 3876 instances for Sub Task B and Sub Task C respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-53", "text": "We have preprocessed the data by removing the URLs and the text \"@USER\" from the tweets." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-54", "text": "Tweet tokenizer 4 is used to obtain the vocabulary and features for the training data." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-55", "text": "We have employed both traditional machine learning and deep learning approaches to identify the offensive language in social media." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-56", "text": "The models that are implemented for the three sub-tasks are given in Table 1 ." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-57", "text": "In deep learning (DL) approach, the tweets are vectorized using word embeddings and are fed into encoding and decoding processes." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-58", "text": "Bidirectional LSTMs are used for encoding and decoding processes." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-59", "text": "We have used 2 layers of LSTM for this." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-60", "text": "The output is given to softmax layer by incorporating attention wrapper to obtain the OffensEval class labels." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-61", "text": "We have trained the deep learning models with a batch size 128 and dropout 0.2 for 300 epochs to build the model." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-62", "text": "We have em- (Sutskever et al., 2014; Bahdanau et al., 2014) and Scaled Luong (SL) (Luong et al., 2015 (Luong et al., , 2017 in this approach." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-63", "text": "These two variations are implemented to predict the class labels for all the three sub tasks." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-64", "text": "These attention mechanisms help the model to capture the group of input words relevant to the target output label." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-65", "text": "For example, consider the instance in Task C: \"we do not watch any nfl games this guy can shove it in his pie hole\"." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-66", "text": "This instance clearly contains the offensive slang \"pie hole\" and about watching the \"nfl games\"." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-67", "text": "The attention mechanism captures these named entities or group of words and correctly map to the label \"GRP\"." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-68", "text": "Also, it is evident from the earlier experiments (Sutskever et al., 2014; Thenmozhi et al., 2018 ) that bi-directional LSTM with attention mechanism performs better for mapping input sequences to the output sequences." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-69", "text": "In traditional learning (TL) approach, the features are extracted from the tokens with minimum count of two." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-70", "text": "The feature vectors are constructed using TF-IDF scores for the training instances." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-71", "text": "We have chosen the classifiers namely Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) with Stochastic Gradient Descent optimizer to build the models for Task B and Task C respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-72", "text": "These classifiers have been chosen based on the cross validation accuracies." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-73", "text": "The class labels namely \"TIN/UNT\" and \"IND/GRP/OTH\" are predicted for Task B and Task C using the respective models." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-74", "text": "----------------------------------" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-75", "text": "**RESULTS**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-76", "text": "We have evaluated our models using the test data of OffensEval@SemEval2019 shared task for the three sub tasks." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-77", "text": "The performance was analyzed using the metrics namely precision, re- call, macro-averaged F1 and accuracy." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-78", "text": "The results of our approaches are presented in Tables 2, 3 and 4 for Task A, Task B and Task C respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-79", "text": "We have obtained the best results for Task A DL SL, Task B DL NB, Task C TL SVM models for Task A, Task B and Task C respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-80", "text": "The attention mechanism Scaled Luong performs better when more data is available for training." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-81", "text": "Normed Bahdanau attention mechanism performs better even for a small dataset." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-82", "text": "However, deep learning gives poor results than traditional learning approach for Task C, because only 3876 instances were considered for model building." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-83", "text": "The deep learning model could not learn the features appropiately due to less domain knowledge imparted by the smaller data set." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-84", "text": "Thus, traditional learning performs better with the given data size when compared to deep learning for Task C. The confusion matrix for our best run in the three sub tasks are depicted in Tables 5, 6 and 7." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-85", "text": "These tables show that the true positive rate of \"NOT\", \"TIN\" and \"IND\" classes are good as the number of samples for those classes are more in training set." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-86", "text": "Our approaches show improvement over the base line systems for all the three tasks." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-87", "text": "We have obtained 12% and 14% improvement on F1 and accuracy respectively for Task A when compared with the base line." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-88", "text": "For Task B, we have obtained 26% and 34% improvement on F1 and accuracy respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-89", "text": "Also, Task C results have been improved by 14% and 7% for F1 and accuracy when compared to base line results." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-90", "text": "GRP 16 26 7 IND 62 71 27 OTH 0 3 2 Table 7 : Confusion Matrix for Task C TL SVM." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-91", "text": "----------------------------------" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-92", "text": "**OFF NOT**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-93", "text": "----------------------------------" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-94", "text": "**GRP IND OTH**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-95", "text": "----------------------------------" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-96", "text": "**CONCLUSION**" }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-97", "text": "We have implemented both traditional machine learning and deep learning approaches for identifying offensive languages from social media." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-98", "text": "The approaches are evaluated on OffensEval@SemEval2019 dataset." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-99", "text": "The given instances are preprocessed and vectorized using word embeddings in deep learning models." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-100", "text": "We have employed 2 layered bi-directional LSTM with Scaled Luong and Normed Bahdanau attention mechanisms to build the model for all the three sub tasks." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-101", "text": "The instances are vectorized using TF-IDF score for traditional machine learning models with minimum count two." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-102", "text": "The classifiers namely Multinomial Naive Bayes and Support Vector Machine with Stochastic Gradient Descent optimizer were employed to build the models for sub tasks B and C. Deep learning with Scaled Luong attention, deep learning with Normed Bahdanau attention, traditional machine learning with SVM give better results for Task A, Task B and Task C respectively." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-103", "text": "Our models outperform the base line for all the three tasks." }, { "sent_id": "692f7edc151a9a833c7dd7943bb608-C001-104", "text": "The performance may be improved further by incorporating external datasets (Kumar et al., 2018a; Davidson et al., 2017)" } ], "y": { "@BACK@": { "gold_contexts": [ [ "692f7edc151a9a833c7dd7943bb608-C001-24" ] ], "cite_sentences": [ "692f7edc151a9a833c7dd7943bb608-C001-24" ] }, "@FUT@": { "gold_contexts": [ [ "692f7edc151a9a833c7dd7943bb608-C001-104" ] ], "cite_sentences": [ "692f7edc151a9a833c7dd7943bb608-C001-104" ] } } }, "ABC_f5ad574acf9ea27c0be3129238fd92_44": { "x": [ { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-2", "text": "Recent work has shown that neural models can be successfully trained on multiple languages simultaneously." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-3", "text": "We investigate whether such models learn to share and exploit common syntactic knowledge among the languages on which they are trained." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-4", "text": "This extended abstract presents our preliminary results." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-5", "text": "----------------------------------" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-7", "text": "Recent work has shown that state-of-the-art neural models of language and translation can be successfully trained on multiple languages simultaneously without changing the model architecture (\u00d6stling and Tiedemann, 2017; Johnson et al., 2017) ." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-8", "text": "In some cases this leads to improved performance compared to models only trained on a specific language, suggesting that multilingual models learn to share useful knowledge crosslingually through their learned representations." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-9", "text": "While a large body of research exists on the multilingual mind, the mechanisms explaining knowledge sharing in computational multilingual models remain largely unknown: What kind of knowledge is shared among languages?" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-10", "text": "Do multilingual models mostly benefit from a better modeling of lexical entries or do they also learn to share more abstract linguistic categories?" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-11", "text": "We focus on the case of language models (LM) trained on two languages, one of which (L1) is over-resourced with respect to the other (L2), and investigate whether the syntactic knowledge learned for L1 is transferred to L2." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-12", "text": "To this end we use the long-distance agreement benchmark recently introduced by Gulordava et al. (2018) ." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-13", "text": "----------------------------------" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-14", "text": "**BACKGROUND**" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-15", "text": "The recent advances in neural networks have opened the way to the design of architecturally simple multilingual models for various NLP tasks, such as language modeling or next word prediction (Tsvetkov et al., 2016; \u00d6stling and Tiedemann, 2017; Malaviya et al., 2017; Tiedemann, 2018) , translation (Dong et al., 2015; Zoph et al., 2016; Firat et al., 2016; Johnson et al., 2017) , morphological reinflection (Kann et al., 2017) and more (Bjerva, 2017) ." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-16", "text": "A practical benefit of training models multilingually is to transfer knowledge from high-resource languages to lowresource ones and improve task performance in the latter." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-17", "text": "Here we aim at understanding how linguistic knowledge is transferred among languages, specifically at the syntactic level, which to our knowledge has not been studied so far." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-18", "text": "Assessing the syntactic abilities of monolingual neural LMs trained without explicit supervision has been the focus of several recent studies: Linzen et al. (2016) analyzed the performance of LSTM LMs at an English subject-verb agreement task, while Gulordava et al. (2018) extended the analysis to various long-range agreement patterns in different languages." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-19", "text": "The latter study found that state-of-the-art LMs trained on a standard loglikelihood objective capture non-trivial patterns of syntactic agreement and can approach the performance levels of humans, even when tested on syntactically well-formed but meaningless (nonce) sentences." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-20", "text": "Cross-language interaction during language production and comprehension by human subjects has been widely studied in the fields of bilingualism and second language acquisition (Kellerman and Sharwood Smith; Odlin, 1989; Jarvis and Pavlenko, 2008) under the terms of language transfer or cross-linguistic influence." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-21", "text": "Numerous studies have shown that both the lexicons and the grammars of different languages are not stored independently but together in the mind of bilinguals and second-language learners, leading to observ-able lexical and syntactic transfer effects (Kootstra et al., 2012) ." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-22", "text": "For instance, through a crosslingual syntactic priming experiment, Hartsuiker et al. (2004) showed that bilinguals recently exposed to a given syntactic construction (passive voice) in their L1 tend to reuse the same construction in their L2." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-23", "text": "While the neural networks in this study are not designed to be plausible models of the human mind learning and processing multiple languages, we believe there is interesting potential at the intersection of these research fields." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-24", "text": "----------------------------------" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-25", "text": "**EXPERIMENT**" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-26", "text": "We consider the scenario where L1 is overresourced compared to L2 and train our bilingual models by joint training on a mixed L1/L2 corpus so that supervision is provided simultaneously in the two languages (\u00d6stling and Tiedemann, 2017; Johnson et al., 2017) ." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-27", "text": "We leave the evaluation of pre-training (or transfer learning) methods (Zoph et al., 2016; Nguyen and Chiang, 2017) to future work." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-28", "text": "The monolingual LM is trained on a small L2 corpus (LM L2 )." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-29", "text": "The bilingual LM is trained on a shuffled mix of the same small L2 corpus and a large L1 corpus, where L2 is oversampled to approximately match the amount of L1 sentences (LM L1+L2 )." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-30", "text": "See Table 1 for the actual training sizes." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-31", "text": "For our preliminary experiments we have chosen French as the helper language (L1) and Italian as the target language (L2)." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-32", "text": "Since French and Italian share many morphosyntactic patterns, accuracy on the Italian agreement tasks is expected to benefit from adding French sentences to the training data if syntactic transfer occurs." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-33", "text": "Data and training details: We train our LMs on French and Italian Wikipedia articles extracted using the WikiExtractor tool." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-34", "text": "1 For each language, we maintain a vocabulary of the 50k most frequent tokens, and replace the remaining tokens by . For the bilingual LM, all words are prepended with a language tag so that vocabularies are completely disjoint." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-35", "text": "Their union (100K types) is used to train the model." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-36", "text": "This is the least optimistic scenario for linguistic transfer but also the most controlled one." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-37", "text": "In future experiments we plan to study how transfer is affected by varying degrees of vocabulary overlap." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-38", "text": "Following the setup of Gulordava et al. (2018) , we train 2-layer LSTM models with embedding and hidden layers of 650 dimensions for 40 epochs." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-39", "text": "The trained models are evaluated on the Italian section of the syntactic benchmark provided by Gulordava et al. (2018) , which includes various non-trivial number agreement constructions." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-40", "text": "2 Note that all models are trained on a regular corpus likelihood objective and do not receive any specific supervision for the syntactic tasks." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-41", "text": "Table 1 shows the results of our preliminary experiments." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-42", "text": "The unigram baseline simply picks, for each sentence, the most frequent word form between singular or plural." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-43", "text": "As an upper-bound we report the agreement accuracy obtained by a monolingual model trained on a large L2 corpus." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-44", "text": "The effect of mixing the small Italian corpus with the large French one does not appear to be major." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-45", "text": "Agreement accuracy increases slightly in the original sentences, where the model is free to rely on collocational cues, but decreases slightly in the nonce sentences, where the model must rely on pure grammatical knowledge." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-46", "text": "Thus there is currently no evidence that syntactic transfer occurs in our setup." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-47", "text": "A possible explanation is that the bilingual model has to fit the knowledge from two language systems into the same number of hidden layer parameters and this may cancel out the benefits of being exposed to a more diverse set of sentences." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-48", "text": "In fact, the bilingual model achieves a considerably worse perplexity than the monolingual one (69.9 vs 55.62) on an Italian-only held-out set." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-49", "text": "For comparison,\u00d6stling and Tiedemann (2017) observed slightly better perplexities when mixing a small number of related languages, however their setup was considerably different (characterlevel LSTM with highly overlapping vocabulary)." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-50", "text": "This is work in progress." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-51", "text": "We are currently looking for a bilingual LM configuration that will result in better target language perplexity and, possibly, better agreement accuracy." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-52", "text": "We also plan to extend the evaluation to other, less related, language pairs and different multilingual training techniques." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-53", "text": "Finally, we plan to examine whether lexical syntactic categories (POS) are represented in a shared space among the two languages." }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-54", "text": "----------------------------------" }, { "sent_id": "f5ad574acf9ea27c0be3129238fd92-C001-55", "text": "**RESULTS AND CONCLUSIONS**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "f5ad574acf9ea27c0be3129238fd92-C001-7" ], [ "f5ad574acf9ea27c0be3129238fd92-C001-15" ] ], "cite_sentences": [ "f5ad574acf9ea27c0be3129238fd92-C001-7", "f5ad574acf9ea27c0be3129238fd92-C001-15" ] }, "@USE@": { "gold_contexts": [ [ "f5ad574acf9ea27c0be3129238fd92-C001-26" ] ], "cite_sentences": [ "f5ad574acf9ea27c0be3129238fd92-C001-26" ] } } }, "ABC_3d1f3980190048625ec93517ebffdc_44": { "x": [ { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-23", "text": "We use GDELT 3 to obtain links to news articles and the Newspaper3k Python library 4 to extract their content." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-2", "text": "We present proppy, the first publicly available real-world, real-time propaganda detection system for online news, which aims at raising awareness, thus potentially limiting the impact of propaganda and helping fight disinformation." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-3", "text": "The system constantly monitors a number of news sources, deduplicates and clusters the news into events, and organizes the articles about an event on the basis of the likelihood that they contain propagandistic content." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-4", "text": "The system is trained on known propaganda sources using a variety of stylistic features." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-5", "text": "The evaluation results on a standard dataset show stateof-the-art results for propaganda detection." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-6", "text": "----------------------------------" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-8", "text": "Propaganda is the expression of an opinion or an action by individuals or groups deliberately designed to influence the opinions or the actions of other individuals or groups with reference to predetermined ends (Institute for Propaganda Analysis 1938) ." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-9", "text": "We are interested in propaganda from a journalistic point of view: how news management lacking neutrality shapes information by emphasizing positive or negative aspects purposefully (Jowett and O'Donnell 2012, p. 1)." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-10", "text": "Propaganda uses psychological and rhetorical techniques that are intended to go unnoticed to achieve maximum effect." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-11", "text": "As a result, malicious propaganda news outlets have proven to be able to achieve large-scale impact." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-12", "text": "In particular, the power of disinformation and propaganda was arguably demonstrated during recent events, such as Brexit and the 2016 U.S. Presidential campaign." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-13", "text": "1 With the rise of the Web, a combination of freedom of expression and ease of publishing contents online has nurtured a number of news outlets that produce or distribute propagandistic content." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-14", "text": "Social media further amplified the problem by making it possible to reach millions of users almost instantaneously." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-15", "text": "Thus, with the aim of helping fight the rise of propaganda, here we introduce proppy, a system to unmask articles with propagandistic content, which can (i) help investigative journalists to study propaganda online and (ii) raise awareness that a news article, or a news outlet in general, might be trying to influence people's mindset." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-16", "text": "To the best of our knowledge, proppy 2 is the first publicly available real-world, real-time monitoring and propaganda detection system for online news, which aims at raising awareness about propaganda." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-17", "text": "Figure 1 shows the architecture of proppy." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-18", "text": "We describe its four modules next." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-19", "text": "----------------------------------" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-20", "text": "**ARCHITECTURE OF THE SYSTEM**" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-21", "text": "1. Article retrieval." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-22", "text": "Proppy regularly monitors a variety of news outlets and extracts the content of the latest news articles from their websites." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-24", "text": "Proppy then analyzes the articles in batches every 24 hours, and performs the remaining three steps." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-25", "text": "2." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-26", "text": "Event identification." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-27", "text": "We use the DBSCAN clustering algorithm (Ester et al. 1996) for event identification, as it does not require information related to the expected number of events." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-28", "text": "We use doc2vec embeddings (Le and Mikolov 2014) for article representation, pre-trained on articles from Associated Press." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-29", "text": "We compute the pairwise distances for DBSCAN as 1 minus the cosine similarity." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-30", "text": "DBSCAN has two hyper-parameters: the minimum number of members in a cluster and the maximum distance between two members of the same cluster, ." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-31", "text": "We set the former parameter to 2, thus discarding singletons." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-32", "text": "We estimate the parameter = 0.55 on the METER corpus (Clough, Gaizauskas, and Piao 2002)." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-33", "text": "3. Deduplication." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-34", "text": "Next, we discard near-duplicates using a standard text re-use technique: comparison of word n-grams (Lyon, Barret, and Malcolm 2004) after standard pre-processing (case-folding, tokenization, and stopword removal)." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-35", "text": "We compute the similarity between all pairs of documents in a cluster using the Jaccard coefficient." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-36", "text": "Once again, we use the METER corpus to optimize the value of n and the threshold to consider two documents as near-duplicates." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-37", "text": "At run time, we discard all near-duplicates but one." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-38", "text": "----------------------------------" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-39", "text": "**PROPAGANDA INDEX COMPUTATION.**" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-40", "text": "We train a maximum entropy classifier with L2 regularization to discriminate propagandistic vs non-propagandistic news articles." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-41", "text": "We use the confidence of the classifier, a value in the range [0, 1], to group articles into bins." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-42", "text": "We call this value the propaganda index, since it reflects the probability for an article to have a propagandistic intent." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-43", "text": "We use four families of features: Word n-gram features We use tf.idf -weighted word [1, 3]-grams (Rashkin et al. 2017) ." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-44", "text": "Lexicon features." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-45", "text": "We try to capture the typical vocabulary of propaganda by considering representations reflecting the frequency of specific words from a number of lexicons coming from the Wiktionary, Linguistic Inquiry and Word Count (LIWC), Wilson's subjectives, Hyland hedges, and Hooper's assertives." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-46", "text": "Rashkin et al. (2017) showed that the words in some of these lexicons appear more frequently in propagandistic than in trustworthy articles." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-47", "text": "Style, vocabulary richness, and readability." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-48", "text": "Our writing style representation consists of tf.idf -weighted character 3-grams." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-49", "text": "This representation captures different style markers, such as prefixes, suffixes, and punctuation marks." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-50", "text": "We further consider the type-token ratio (TTR) as well as the number of tokens appearing exactly once or twice in the document: hapax legomena and dislegomena." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-51", "text": "Moreover, we combine types, tokens, and hapax legomenae to compute Honore's R and Yule's characteristic K. We also use three readability features originally designed to estimate the level of complexity of a text: Flesch-Kincaid grade level, Flesch reading ease and the Gunning fog index." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-52", "text": "NELA." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-53", "text": "We also integrate the NEws LAndscape (NELA) features (Horne, Khedr, and Adal 2018) : 130 content-based features collected from the existing literature that measure different aspects of a news article (e.g., sentiment, bias, morality, complexity)." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-54", "text": "We evaluated proppy on data from Rashkin et al. (2017) in a binary setup of distinguishing propaganda vs nonpropaganda." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-55", "text": "Their n-grams system yielded an F 1 of 88.21, whereas proppy achieved 96.72 (+8.51), a statistically significant improvement (measured with the McNemar test)." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-56", "text": "Figure 2 shows a screenshot of proppy." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-57", "text": "The architecture follows a push publishing model: it updates automatically the material that it presents to the user without her taking any action but exploring the available events." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-58", "text": "The left panel shows events from the last 24 hours." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-59", "text": "When the user clicks on an event, its articles are shown in the right panel, organized into five bins according to their propaganda index." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-60", "text": "The articles in bins 1 and 2 are considered nearly nonpropagandistic, whereas those in the two right bins are propagandistic." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-61", "text": "In this way, the user can easily observe how different media cover related events on the propaganda dimension and may guide her further exploration and judgment." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-62", "text": "----------------------------------" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-63", "text": "**ONLINE INTERFACE**" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-64", "text": "----------------------------------" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-65", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-66", "text": "We have presented proppy, a publicly available real-world, real-time propaganda detection system for online news, which aims at raising awareness with the objective of limiting the impact of propaganda and helping fight disinformation." }, { "sent_id": "3d1f3980190048625ec93517ebffdc-C001-67", "text": "In future work, we plan to add support for multiple languages, and a pull mode where users will be able to submit any article and get its propaganda index." } ], "y": { "@USE@": { "gold_contexts": [ [ "3d1f3980190048625ec93517ebffdc-C001-43" ], [ "3d1f3980190048625ec93517ebffdc-C001-54" ] ], "cite_sentences": [ "3d1f3980190048625ec93517ebffdc-C001-43", "3d1f3980190048625ec93517ebffdc-C001-54" ] } } }, "ABC_418e03aa7ba304c4774111e9f300ad_44": { "x": [ { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-2", "text": "In this paper, we describe our three submissions to the inflection track of SIGMORPHON shared task." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-3", "text": "We experimented with three models: namely, sequence to sequence model (popularly known as seq2seq), seq2seq model with data augmentation, and a multilingual multi-tasking seq2seq model that is multilingual in nature." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-4", "text": "Our results with the multilingual model are below the baseline in the case of both high and medium datasets." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-5", "text": "----------------------------------" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-7", "text": "Morphological inflection is the task of predicting the target inflected form from a lemma and a bundle of inflectional features." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-8", "text": "For instance, given the Norwegian lemma hus \"house\" and the morphological features N, DEF, PL the task is to predict husene \"houses\"." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-9", "text": "The SIGMORPHON shared task for 2018 (Cotterell et al., 2018) provided three data scenarios consisting of high (10000), medium (1000), and low (100) examples." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-10", "text": "This paper described the three systems that we submitted to the inflection track in the SIGMORPHON shared task." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-11", "text": "All our models are based on encoder-decoder model introduced by Faruqui et al. (2016) for the morphological inflection task." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-12", "text": "We trained our models on all the data sizes and tested on the test datasets provided by the organizers." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-13", "text": "----------------------------------" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-14", "text": "**BACKGROUND**" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-15", "text": "The morphological (re)inflection task has been studied mainly in last two SIGMORPHON shared tasks (Cotterell et al., 2016 (Cotterell et al., , 2017 ." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-16", "text": "Most of the morphological inflection models are variants of sequence to sequence models applied by Faruqui et al. (2016) to morphological reinflection." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-17", "text": "The input to the model is the source word prepended with relevant morphological tags, the output of the model is the target word for the inflection task." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-18", "text": "For re-inflection task, the input includes the target tags as well." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-19", "text": "The success of the system seems to depend highly on 'training data enhancement'." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-20", "text": "For different tracks (with different restrictions on data used) of the 2016 shared task, Kann and Sch\u00fctze (2016) developed new techniques to increase the number of training instances." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-21", "text": "The methods used mostly work well for re-inflection task, since the re-inflection task is symmetric, and one can invert the source and target forms." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-22", "text": "In the subsequent year's shared task for 2017 (Cotterell et al., 2017) , multiple authors explored new data enhancement techniques Bergmanis et al., 2017; Silfverberg et al., 2017) to improve the performance of the seq2seq models in medium and low resource scenarios." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-23", "text": "The work presented in this paper is based on the work of the simple encoder-decoder system of Faruqui et al. (2016) ." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-24", "text": "----------------------------------" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-25", "text": "**MODELS**" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-26", "text": "In this section, we describe the three different models and the feature representations used in our experiments." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-27", "text": "Morphological features In this paper, we enumerated all the possible features in Unimorph (Kirov et al., 2018) and encoded the feature bundle as multi-hot feature vector." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-28", "text": "We experimented with both one-hot feature vectors and multi-hot feature vectors." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-29", "text": "In our development experiments, we found that multi-hot feature vectors have lower dimension than one-hot feature vectors and yielded similar results." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-30", "text": "Seq2seq-baseline This model consists of two parts: bidirectional encoder and decoder." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-31", "text": "In this model, each character is represented as a one-hot vector whereas the morphological features are rep-resented as multi-hot feature bundle." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-32", "text": "The encoder consists of LSTM cells that transform a sequence into a continuous vector." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-33", "text": "The final time step's hidden state and the cell state are used to initialize the decoder LSTM network." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-34", "text": "The decoder LSTM network predicts a character at each time step by passing the output of the decoder LSTM through a softmax layer." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-35", "text": "The output of the softmax layer is a predicted character that is input along with the multi-hot morphological feature vector to the next timestep." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-36", "text": "We intended this model to be the baseline model in our experiments." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-37", "text": "Augment-Seq2seq This model is a variation of the baseline encoder-decoder model where the training data is augmented with random strings generated with weights proportional to the character probabilities." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-38", "text": "This model is similar to the data augmentation model of Silfverberg et al. (2017) who generate new training instances by randomly sampling characters from unigram distributions." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-39", "text": "In our model, we generate a training instance of the same length as the original training instance." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-40", "text": "We also experimented with Seq2seq-MTL-global In this model, we train a single encoder-decoder model which is trained to perform both language identification and language modeling as auxiliary tasks apart from generating the target inflection." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-41", "text": "The encoder LSTM is trained to predict the next character in the source word at each time step." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-42", "text": "The final hidden state of the encoder is trained to predict the language of the example." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-43", "text": "This model differs from the other seq2seq models in that the model is multilingual (or global) and attempts to predict target inflections for all the languages in the test dataset." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-44", "text": "The seq2seq-mtl-global model is similar to the model of Kann et al. (2018) and Bergmanis et al. (2017) who train their attention enhanced encoder-decoder model using an auxiliary autoencoder objective." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-45", "text": "In contrast, our model uses both prediction of subsequent character and language prediction as auxiliary tasks." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-46", "text": "----------------------------------" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-47", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-48", "text": "We trained our models at all the three resource settings: high, medium, and low." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-49", "text": "In all our experiments, the maximum length of both source and target strings are fixed to 30 and padded with zeroes at the end." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-50", "text": "Both the encoder and decoder LSTM units consisted of 256 hidden units." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-51", "text": "All the models were trained with Adam (Kingma and Ba, 2014) with minibatches of size 32 or 128 depending on the size of the data; and, used a early-stop with a patience of 5 to prevent overfitting." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-52", "text": "----------------------------------" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-53", "text": "**RESULTS**" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-54", "text": "Participating in the competition with less than three weeks at hand, we did not have much time to explore the hyperparameter settings required to tune our models." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-55", "text": "In our development experiments, we found that the baseline seq2seq model performed the best among the tested models." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-56", "text": "We observed similar results with the test dataset also." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-57", "text": "We present the average accuracies of all the models at high and medium datasets in table 1." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-58", "text": "Our results are lower than the baseline system." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-59", "text": "We also present the top-5 and the bottom-5 languages' accuracies of the three models on high and medium data sizes in table 2." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-60", "text": "We did not present the results for low sized datasets since all the models had accuracies lower than 5%." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-61", "text": "Both the seq2seq and augmented-seq2seq systems performed the worst on languages such as Zulu, Swahili, and Basque." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-62", "text": "On the other hand, the MTL system seemed to perform worse on the languages that have close orthography and substantial amount of borrowing such as Hindi, Urdu, and Persian." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-63", "text": "Table 1 : Average accuracies for high and medium datasets with three different models." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-64", "text": "----------------------------------" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-65", "text": "**CONCLUSION**" }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-66", "text": "In conclusion, our global multi-tasking model requires more effort to improve the results for languages with low accuracies." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-67", "text": "As part of future work, we plan to work on incorporating embeddings and attention which are part of the winning systems from the shared tasks of 2016 and 2017." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-68", "text": "We observed that the multi-tasking model's auxiliary objective was easier to achieve than the main objective." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-69", "text": "Therefore, we need to explore ways to regularize the network, for instance, by weighing the individual loss components." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-70", "text": "Finally, the output softmax layer of the decoder has to be made sensitive to the language of the example in the training data to prevent softmax from yielding low values due to the high dimension of the target of the softmax." }, { "sent_id": "418e03aa7ba304c4774111e9f300ad-C001-71", "text": "Table 2 : Top-5 and bottom-5 languages at which the three models perform the best and worse for high and medium datasets." } ], "y": { "@USE@": { "gold_contexts": [ [ "418e03aa7ba304c4774111e9f300ad-C001-11" ], [ "418e03aa7ba304c4774111e9f300ad-C001-23" ] ], "cite_sentences": [ "418e03aa7ba304c4774111e9f300ad-C001-11", "418e03aa7ba304c4774111e9f300ad-C001-23" ] }, "@BACK@": { "gold_contexts": [ [ "418e03aa7ba304c4774111e9f300ad-C001-16" ] ], "cite_sentences": [ "418e03aa7ba304c4774111e9f300ad-C001-16" ] } } }, "ABC_3c4c0875593ed0f196f5295fbaeb37_44": { "x": [ { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-2", "text": "We argue that there are currently two major bottlenecks to the commercial use of statistical machine learning approaches for natural language generation (NLG): (a) The lack of reliable automatic evaluation metrics for NLG, and (b) The scarcity of high quality in-domain corpora." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-3", "text": "We address the first problem by thoroughly analysing current evaluation metrics and motivating the need for a new, more reliable metric." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-4", "text": "The second problem is addressed by presenting a novel framework for developing and evaluating a high quality corpus for NLG training." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-5", "text": "Up to 60% of NLG research published between 2012-2015 relies on automatic evaluation measures, such as BLEU (Gkatzia and Mahamood, 2015) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-6", "text": "The use of such metrics is, however, only sensible if they are known to be sufficiently correlated with human preferences, which is not the case, as we show in the most complete study to date, across metrics, systems, datasets and domains." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-7", "text": "We evaluate three end-to-end NLG systems:" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-8", "text": "RNNLG (Wen et al., 2015) , TGen (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015) and LOLS (Lampouras and Vlachos, 2016) , using a large number of 21 automated metrics." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-28", "text": "At the same time, the proportion of syntactically complex sentences (levels 6 and 7) is one of the highest." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-29", "text": "Furthermore, our dataset requires content selection in 40% of the cases." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-51", "text": "We argue that these assumptions are very often invalid for corpus-based NLG, especially when using crowdsourced datasets." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-9", "text": "The metrics are divided into groups of word-based metrics (WBMs, such as TER (Snover et al., 2006) , BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), semantic similarity (Han et al., 2013) reliability, we calculate the Spearman correlation between the metrics and human ratings for the same natural language (NL) utterances, the accuracy of relative rankings and conduct a detailed error analysis." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-10", "text": "The results reveal that no metric produces an even moderate correlation with human ratings (max." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-11", "text": "\u03c1 = 0.33), independently of dataset, system, or aspect of human rating." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-12", "text": "WBMs make two strong assumptions: They treat human-generated NL references as a gold standard which is correct and complete." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-13", "text": "We argue that these assumptions are very often invalid for corpus-based NLG, especially when using crowdsourced datasets." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-14", "text": "Grammar-based metrics, on the other hand, do not rely on human-generated references and are not influenced by their quality." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-15", "text": "However, these metrics can be easily manipulated with grammatically correct and easily readable output that is unrelated to the input." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-16", "text": "Our study clearly demonstrates the need for more advanced metrics, as used in related fields, e.g. MT (Specia et al., 2010)." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-17", "text": "Recent advances in corpus-based NLG (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015; Wen et al., 2015; Mei et al., 2016; Wen et al., 2016; Du\u0161ek and Jur\u010d\u00ed\u010dek, 2016; Lampouras and Vlachos, 2016) require costly training data, consisting of meaning representations (MRs) paired with corresponding NL texts." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-18", "text": "In our work, we propose a novel framework for crowdsourcing high quality NLG training data, using automatic quality control measures and evaluating two types of MRs, pictorial and textual, used to elicit data (Novikova and Rieser, 2016) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-19", "text": "When collecting corpora for training NLG systems, especially when using crowd workers, the following challenges arise:" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-20", "text": "(1) How to ensure the required high quality of the collected data? (2) What types of meaning representations can elicit spontaneous, natural and varied data from crowd workers?" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-21", "text": "To address (1), we filter the crowdsourced data using a combination of automatic and semimanual validation procedures, as described in (Novikova and Rieser, 2016) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-22", "text": "We validate the data by selecting native English participants, allowing only well formed English sentences to be submitted, and measuring the semantic similarity of a collected NL utterance and an associated MR." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-23", "text": "Using this framework, we collected a dataset of 50k instances in the restaurant domain, which is 10 times bigger than datasets currently used for NLG training, e.g. SFRest and SFHot (Wen et al., 2015) or Bagel (Mairesse et al., 2010) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-24", "text": "To evaluate the quality of the collected corpus, we analyse the data with regards to lexical richness and syntactic variation and compare our results to other popular datasets in similar domains, i.e. SFRest, SFHot and BAGEL.We use the Lexical Complexity Analyser (Lu, 2012) to measure various dimensions of lexical richness and variation, such as mean segmental type-token ratio (MSTTR) and lexical sophistication (LS)." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-25", "text": "The results of lexical analysis (see Table 1 ) show that our corpus is lexically more diverse and as such, considerably more complex." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-26", "text": "In order to evaluate syntactic variation and complexity of NL references in our corpus, we use the D-Level Analyser (Lu, 2009) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-27", "text": "Table 1 shows that our collected corpus has the highest proportion of complex sentences (54% Logic-based MR Pictorial MR informativeness 4.28** 4.51** naturalness 4.09** 4.43** phrasing 4.01** 4.40** Table 2 : Human evaluation of the data collected with logic-based and pictorial MRs (** denotes p <0.01) of sentences scored above level 1)." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-30", "text": "In contrast to the other datasets, crowd workers were asked to verbalise all the useful information from the MR and were allowed to skip an attribute value considered unimportant." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-31", "text": "As such, learning from this dataset promises more natural, varied and less templatelike system utterances." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-32", "text": "To address challenge (2) in corpus development, we conduct a principled study regarding the trade-off between semantic expressiveness of the MR and the quality of crowdsourced utterances elicited." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-33", "text": "In particular, we investigate translating textual MRs (presented in the form of logicbased dialogue acts, such as \"inform(name = the Wrestlers, price range = cheap, customer rating = low)\") into pictorial representations as used in, e.g. (Williams and Young, 2007; Black et al., 2011) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-34", "text": "We show that pictorial MRs result in better quality NLG data than logic-based textual MRs: utterances elicited by pictorial MRs are judged as significantly more natural, more informative, and better phrased, with a significant increase in average quality ratings (around 0.5 points on a 6-point scale), compared to the logical MRs (see Table 2 )." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-35", "text": "Pictorial MRs also result in more spontaneous, natural and varied utterances." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-36", "text": "This is probably due to crowd workers not being primed by lexical tokens." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-37", "text": "Moreover, as the MR becomes more complex, the benefits of pictorial stimuli increase." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-38", "text": "Our work addresses two major bottlenecks of current data-driven NLG: Reliable automatic evaluation and efficient high-quality data collection." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-39", "text": "While our work shows that we can effectively crowdsource data of sufficient quality to train NLG algorithms -in particular, using pictorial representations that reduce bias and elicit more syntactically varied and lexically rich data -our work also clearly demonstrates the need for more advanced evaluation metrics." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-40", "text": "We see our work as a first step towards reference-less evaluation for NLG by introducing grammar-based metrics." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-41", "text": "----------------------------------" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-42", "text": "**EVALUATION METRICS FOR NLG**" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-43", "text": "Up to 60% of NLG research published between 2012-2015 relies on automatic evaluation measures, such as BLEU (Gkatzia and Mahamood, 2015) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-44", "text": "The use of such metrics is, however, only sensible if they are known to be sufficiently correlated with human preferences, which is not the case, as we show in the most complete study to date, across metrics, systems, datasets and domains." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-45", "text": "We evaluate three end-to-end NLG systems:" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-46", "text": "RNNLG (Wen et al., 2015) , TGen (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015) and LOLS (Lampouras and Vlachos, 2016) , using a large number of 21 automated metrics." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-47", "text": "The metrics are divided into groups of word-based metrics (WBMs, such as TER (Snover et al., 2006) , BLEU (Papineni et al., 2002) , ROUGE (Lin, 2004) , semantic similarity (Han et al., 2013) reliability, we calculate the Spearman correlation between the metrics and human ratings for the same natural language (NL) utterances, the accuracy of relative rankings and conduct a detailed error analysis." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-48", "text": "The results reveal that no metric produces an even moderate correlation with human ratings (max." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-49", "text": "\u03c1 = 0.33), independently of dataset, system, or aspect of human rating." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-50", "text": "WBMs make two strong assumptions: They treat human-generated NL references as a gold standard which is correct and complete." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-52", "text": "Grammar-based metrics, on the other hand, do not rely on human-generated references and are not influenced by their quality." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-53", "text": "However, these metrics can be easily manipulated with grammatically correct and easily readable output that is unrelated to the input." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-54", "text": "Our study clearly demonstrates the need for more advanced metrics, as used in related fields, e.g. MT (Specia et al., 2010) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-55", "text": "Recent advances in corpus-based NLG (Du\u0161ek and Jur\u010d\u00ed\u010dek, 2015; Wen et al., 2015; Mei et al., 2016; Wen et al., 2016; Du\u0161ek and Jur\u010d\u00ed\u010dek, 2016; Lampouras and Vlachos, 2016) require costly training data, consisting of meaning representations (MRs) paired with corresponding NL texts." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-56", "text": "In our work, we propose a novel framework for crowdsourcing high quality NLG training data, using automatic quality control measures and evaluating two types of MRs, pictorial and textual, used to elicit data (Novikova and Rieser, 2016) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-57", "text": "When collecting corpora for training NLG systems, especially when using crowd workers, the following challenges arise:" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-58", "text": "(1) How to ensure the required high quality of the collected data? (2) What types of meaning representations can elicit spontaneous, natural and varied data from crowd workers?" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-59", "text": "To address (1), we filter the crowdsourced data using a combination of automatic and semimanual validation procedures, as described in (Novikova and Rieser, 2016) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-60", "text": "We validate the data by selecting native English participants, allowing only well formed English sentences to be submitted, and measuring the semantic similarity of a collected NL utterance and an associated MR." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-61", "text": "Using this framework, we collected a dataset of 50k instances in the restaurant domain, which is 10 times bigger than datasets currently used for NLG training, e.g. SFRest and SFHot (Wen et al., 2015) or Bagel (Mairesse et al., 2010) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-62", "text": "To evaluate the quality of the collected corpus, we analyse the data with regards to lexical richness and syntactic variation and compare our results to other popular datasets in similar domains, i.e. SFRest, SFHot and BAGEL.We use the Lexical Complexity Analyser (Lu, 2012) to measure various dimensions of lexical richness and variation, such as mean segmental type-token ratio (MSTTR) and lexical sophistication (LS)." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-63", "text": "The results of lexical analysis (see Table 1 ) show that our corpus is lexically more diverse and as such, considerably more complex." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-64", "text": "In order to evaluate syntactic variation and complexity of NL references in our corpus, we use the D-Level Analyser (Lu, 2009 of sentences scored above level 1)." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-65", "text": "At the same time, the proportion of syntactically complex sentences (levels 6 and 7) is one of the highest." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-66", "text": "Furthermore, our dataset requires content selection in 40% of the cases." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-67", "text": "In contrast to the other datasets, crowd workers were asked to verbalise all the useful information from the MR and were allowed to skip an attribute value considered unimportant." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-68", "text": "As such, learning from this dataset promises more natural, varied and less templatelike system utterances." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-69", "text": "To address challenge (2) in corpus development, we conduct a principled study regarding the trade-off between semantic expressiveness of the MR and the quality of crowdsourced utterances elicited." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-70", "text": "In particular, we investigate translating textual MRs (presented in the form of logicbased dialogue acts, such as \"inform(name = the Wrestlers, price range = cheap, customer rating = low)\") into pictorial representations as used in, e.g. (Williams and Young, 2007; Black et al., 2011) ." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-71", "text": "We show that pictorial MRs result in better quality NLG data than logic-based textual MRs: utterances elicited by pictorial MRs are judged as significantly more natural, more informative, and better phrased, with a significant increase in average quality ratings (around 0.5 points on a 6-point scale), compared to the logical MRs (see Table 2 )." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-72", "text": "Pictorial MRs also result in more spontaneous, natural and varied utterances." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-73", "text": "This is probably due to crowd workers not being primed by lexical tokens." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-74", "text": "Moreover, as the MR becomes more complex, the benefits of pictorial stimuli increase." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-75", "text": "----------------------------------" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-76", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-77", "text": "Our work addresses two major bottlenecks of current data-driven NLG: Reliable automatic evaluation and efficient high-quality data collection." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-78", "text": "While our work shows that we can effectively crowdsource data of sufficient quality to train NLG algorithms -in particular, using pictorial representations that reduce bias and elicit more syntactically varied and lexically rich data -our work also clearly demonstrates the need for more advanced evaluation metrics." }, { "sent_id": "3c4c0875593ed0f196f5295fbaeb37-C001-79", "text": "We see our work as a first step towards reference-less evaluation for NLG by introducing grammar-based metrics." } ], "y": { "@USE@": { "gold_contexts": [ [ "3c4c0875593ed0f196f5295fbaeb37-C001-7", "3c4c0875593ed0f196f5295fbaeb37-C001-8" ], [ "3c4c0875593ed0f196f5295fbaeb37-C001-45", "3c4c0875593ed0f196f5295fbaeb37-C001-46" ] ], "cite_sentences": [ "3c4c0875593ed0f196f5295fbaeb37-C001-8", "3c4c0875593ed0f196f5295fbaeb37-C001-46" ] }, "@MOT@": { "gold_contexts": [ [ "3c4c0875593ed0f196f5295fbaeb37-C001-17" ], [ "3c4c0875593ed0f196f5295fbaeb37-C001-55" ] ], "cite_sentences": [ "3c4c0875593ed0f196f5295fbaeb37-C001-17", "3c4c0875593ed0f196f5295fbaeb37-C001-55" ] }, "@DIF@": { "gold_contexts": [ [ "3c4c0875593ed0f196f5295fbaeb37-C001-17", "3c4c0875593ed0f196f5295fbaeb37-C001-18" ], [ "3c4c0875593ed0f196f5295fbaeb37-C001-23" ], [ "3c4c0875593ed0f196f5295fbaeb37-C001-55", "3c4c0875593ed0f196f5295fbaeb37-C001-56" ], [ "3c4c0875593ed0f196f5295fbaeb37-C001-61" ] ], "cite_sentences": [ "3c4c0875593ed0f196f5295fbaeb37-C001-17", "3c4c0875593ed0f196f5295fbaeb37-C001-23", "3c4c0875593ed0f196f5295fbaeb37-C001-55", "3c4c0875593ed0f196f5295fbaeb37-C001-61" ] } } }, "ABC_457f9916ed4d7eafacea57e208c760_44": { "x": [ { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-2", "text": "Appointment scheduling is a problem faced daily by many individuals and organizations." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-3", "text": "Cooperating agent systems have been developed to partially automate this task." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-4", "text": "In order to extend the circle of participants as far as possible we advocate the use of natural language transmitted by email." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-5", "text": "We demonstrate COSMA, a fully implemented German language server for existing appointment scheduling agent systems." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-6", "text": "COSMA can cope with multiple dialogues in parallel, and accounts for differences in dialogue behaviour between human and machine agents." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-7", "text": "----------------------------------" }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-8", "text": "**MOTIVATION**" }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-9", "text": "Appointment scheduling is a problem faced daily by many individuals and organizations, and typically solved using communication in natural language (NL) by phone, fax or by mail." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-10", "text": "In general, cooperative interaction between several participants is required." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-11", "text": "Systems available on the market allow for calendar and contact management." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-12", "text": "However, as (Busemann and Merget, 1995) point out in a market survey, all planning and scheduling activity remains with the user." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-13", "text": "Cooperative agent systems developed in the field of Distributed AI are designed to account for the scheduling tasks." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-14", "text": "Using distributed rather than centralized calendar systems, they not only guarantee a maximum privacy of calendar information but also offer their services to members or employees in external organizations." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-15", "text": "Although agent systems allow users to automate their scheduling tasks to a conThis work has been supported by a grant from the German Federal Ministry of Education, Science, Research and Technology (FKZ ITW-9402)." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-16", "text": "siderable degree, the circle of participants remains restricted to users with compatible systems." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-17", "text": "To overcome this drawback we have designed and implemented COSMA, a novel kind of NL dialogue system that serves as a German language front-end to scheduling agents." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-18", "text": "Human language makes agent services available to a much broader public." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-19", "text": "COSMA allows human and machine agents to participate in appointment scheduling dialogues via e-mail." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-20", "text": "We are concerned with meetings all participants should attend and the date of which is negotiable." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-21", "text": "----------------------------------" }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-22", "text": "**2**" }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-23", "text": "The Systems COSMA is organized as a client/server architecture." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-24", "text": "The server offers NL dialogue service to multiple client agent systems." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-25", "text": "The scheduling agent systems act for their respective users." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-26", "text": "The agents systems use a calendar management system for displaying to their owners the results of the appointment negotiations." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-27", "text": "The users can enter their appointment constraints via a graphical user interface and receive the results either by e-mail or via their electronic calendar." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-28", "text": "Agent systems are thus hooked up to e-mail, to a calendar manager and to the dialogue server." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-29", "text": "The server interface is command-driven." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-30", "text": "A client may connect to the server and open up a dialogue (see Figure 1 in (Busemann et al., 1997) )." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-31", "text": "During the dialogue, the client may request texts to be analyzed or semantic descriptions to be verbalized." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-32", "text": "When given a text, the server returns the semantic representation, and vice versa." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-33", "text": "The client ensures that the server has available to it linguistically relevant information about the interlocutors, such as names, sexes etc." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-34", "text": "The user agents may access the dialogue server via Internet." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-35", "text": "They use the server as their NL front end to human participants." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-36", "text": "Machine agents interact with each other in their own formal language." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-37", "text": "This interaction remains unnoticed by the dialogue server." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-38", "text": "As a consequence, the dialogues modeled within the server represent only part of the complete multi-participant negotiation." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-39", "text": "More precisely, only utterances between a human and a machine agent are modeled." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-40", "text": "The agent system used is a further development of the PASHA system (Schmeier and Schupeta, 1996) ." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-41", "text": "NL analysis in the server is based on a shallow parsing strategy implemented in the SMES system (Neumann et al., 1997) ." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-42", "text": "The use of SMES in COSMA, semantic analysis and inference, the dialogue model mapping between human and machine dialogue structures, utterance generation, the architectural framework of the server, and the PASHA agent system are described in (Busemann et al., 1997) ." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-43", "text": "Both papers can be found in the ANLP '97 conference proceedings." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-44", "text": "We demonstrate extended versions of the systems described in (Busemann et al., 1997) ." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-45", "text": "In particular, the systems to be demonstrated can process counterproposals, which form an important part of efficient and cooperative scheduling dialogues." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-46", "text": "----------------------------------" }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-47", "text": "**THE DEMONSTRATION SCENARIO**" }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-48", "text": "The demonstration scenario includes three participants." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-49", "text": "Two are using autonomous agent systems that partially automate the negotiation of appointment scheduling and manage their users' private electronic calendars." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-50", "text": "The third person plans his appointments himself and interacts with other participants through NL e-mail messages." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-51", "text": "His calendar is managed outside the scope of the systems." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-52", "text": "Dialogues can be initiated by the human participant or by one of the agent systems." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-53", "text": "In the former case, the users of the agent systems usually are not involved in the negotiation." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-54", "text": "They see the result :~hen it is entered into their electronic calendars." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-55", "text": "In the latter case, the user starts his agent by entering via a graphical interface the appointment constraints to be used in the negotiation." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-56", "text": "The basic constraints include the time interval within which the appointment must be fixed, the duration of the meeting, and the participants." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-57", "text": "For demonstration purposes, e-mail is exchanged between different accounts on a local host, which the server is running on as well." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-58", "text": "In principle, each participant and the server could reside on a different site in the Internet." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-59", "text": "The NL server is implemented in Common Lisp and C with a graphical surface written in Tcl/Tk." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-60", "text": "The PASHA agent system is implemented in DFKIOz (Smolka, 1995) ." }, { "sent_id": "457f9916ed4d7eafacea57e208c760-C001-61", "text": "The systems are demonstrated on a Sun workstation under Unix." } ], "y": { "@USE@": { "gold_contexts": [ [ "457f9916ed4d7eafacea57e208c760-C001-30" ] ], "cite_sentences": [ "457f9916ed4d7eafacea57e208c760-C001-30" ] }, "@BACK@": { "gold_contexts": [ [ "457f9916ed4d7eafacea57e208c760-C001-42" ] ], "cite_sentences": [ "457f9916ed4d7eafacea57e208c760-C001-42" ] }, "@EXT@": { "gold_contexts": [ [ "457f9916ed4d7eafacea57e208c760-C001-44" ] ], "cite_sentences": [ "457f9916ed4d7eafacea57e208c760-C001-44" ] } } }, "ABC_8bdfc9e82e474413f29ee92f81467e_44": { "x": [ { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-2", "text": "In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-3", "text": "Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-4", "text": "We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-5", "text": "We also showcase the shortcomings of current vision and language models by performing an error analysis on our system's output." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-6", "text": "----------------------------------" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-8", "text": "This work aims to learn strategies for textual response generation in a multimodal conversation directly from data." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-9", "text": "Conversational AI has great potential for online retail: It greatly enhances user experience and in turn directly affects user retention (Chai et al., 2000) , especially if the interaction is multi-modal in nature." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-10", "text": "So far, most conversational agents are uni-modal -ranging from opendomain conversation (Ram et al., 2018; Papaioannou et al., 2017; Fang et al., 2017) to task oriented dialogue systems Lemon, 2010, 2011; Young et al., 2013; Singh et al., 2000; Wen et al., 2016) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-11", "text": "While recent progress in deep learning has unified research at the intersection of vision and language, the availability of open-source multimodal dialogue datasets still remains a bottleneck." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-12", "text": "This research makes use of a recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) , which contains multiple dialogue sessions in the fashion domain." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-13", "text": "The MMD dataset provides an interesting new challenge, combining recent efforts on task-oriented dialogue systems, as well as visually grounded dialogue." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-14", "text": "In contrast to simple QA tasks in visually grounded dialogue, e.g. (Antol et al., 2015) , it contains conversations with a clear end-goal." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-15", "text": "However, in contrast to previous slot-filling dialogue systems, e.g. (Rieser and Lemon, 2011; Young et al., 2013) , it heavily relies on the extra visual modality to drive the conversation forward (see Figure 1) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-16", "text": "In the following, we propose a fully data-driven response generation model for this task." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-17", "text": "Our work is able to ground the system's textual response with language and images by learning the semantic correspondence between them while modelling long-term dialogue context." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-18", "text": "Lu et al., 2016) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-19", "text": "In contrast to standard sequenceto-sequence models (Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015) , HREDs model the dialogue context by introducing a context Recurrent Neural Network (RNN) over the encoder RNN, thus forming a hierarchical encoder." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-20", "text": "We build on top of the HRED architecture to include multimodality over multiple images." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-21", "text": "A simple HRED consists of three RNN modules: encoder, context and decoder." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-22", "text": "In multimodal HRED, we combine the output representations from the utterance encoder with concatenated multiple image representations and pass them as input to the context encoder (see Figure 2) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-23", "text": "A dialogue is modelled as a sequence of utterances (turns), which in turn are modelled as sequences of words and images." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-24", "text": "Formally, a dialogue is generated according to the following:" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-25", "text": "where t n is the n-th utterance in a dialogue." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-26", "text": "For each m = 1, . . . , M n , we have hidden states of each module defined as:" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-27", "text": "where f text \u03b8 ,f cxt \u03b8 and f dec \u03b8 are GRU cells (Cho et al., 2014) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-28", "text": "\u03b8 represent model parameters, w n,m is the m-th word in the n-th utterance and g enc \u03b8 is a Convolutional Neural Network (CNN); here we use VGGnet (Simonyan and Zisserman, 2014) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-29", "text": "We pass multiple images in a context through the CNN in order to get encoded image representations g enc \u03b8 (img k )." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-30", "text": "Then these are combined together and passed through a linear layer l img to get the aggregated image representation for one turn of context, denoted by h are subsequently concatenated and passed as input to the context RNN." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-31", "text": "h cxt N , the final hidden state of the context RNN, acts as the initial hidden state of the decoder RNN." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-32", "text": "Finally, output is generated by passing h dec n,m through an affine transformation followed by a softmax activation." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-33", "text": "The model is trained using cross entropy on next-word prediction." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-34", "text": "During generation, the decoder conditions on the previous output token." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-35", "text": "Saha et al. (2017) propose a similar baseline model for the MMD dataset, extending HREDs to include the visual modality." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-36", "text": "However, for simplicity's sake, they 'unroll' multiple images in a single utterance to include only one image per utterance." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-37", "text": "While computationally leaner, this approach ultimately loses the objective of capturing multimodality over the context of multiple images and text." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-38", "text": "In contrast, we combine all the image representations in the utterance using a linear layer." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-39", "text": "We argue that modelling all images is necessary to answer questions that address previous agent responses." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-40", "text": "For example in Figure 3 , when the user asks \"what about the 4th image?\", it is impossible to give a correct response without reasoning over all images in the previous response." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-41", "text": "In the following, we empirically show that our extension leads to better results in terms of text-based similarity measures, as well as quality of generated dialogues." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-42", "text": "Example contexts for a given system utterance; note the difference in our approach from Saha et al. (2017) when extracting the training data from the original chat logs." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-43", "text": "For simplicity, in this illustration we consider a context size of 2 previous utterances." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-44", "text": "'|' differentiates turns for a given context." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-45", "text": "We concatenate the representation vector of all images in one turn of a dialogue to form the image context." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-46", "text": "If there is no image in the utterance, we consider a 0 4096 vector to form the image context." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-47", "text": "In this work, we focus only on the textual response of the agent." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-48", "text": "----------------------------------" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-49", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-50", "text": "----------------------------------" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-51", "text": "**DATASET**" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-52", "text": "The MMD dataset (Saha et al., 2017) consists of 100/11/11k train/validation/test chat sessions comprising 3.5M context-response pairs for the model." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-53", "text": "Each session contains an average of 40 dialogue turns (average of 8 words per textual response, 4 images per image response)." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-54", "text": "The data contains complex user queries, which pose new challenges for multimodal, task-based dialogue, such as quantitative inference (sorting, counting and filtering): \"Show me more images of the 3rd product in some different directions\", inference using domain knowledge and long term context: \"Will the 5th result go well with a large sized messenger bag?\", inference over aggregate of images: \"List more in the upper material of the 5th image and style as the 3rd and the 5th\", co-reference resolution." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-55", "text": "Note that we started with the raw transcripts of dialogue sessions to create our own version of the dataset for the model." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-56", "text": "This is done since the authors originally consider each image as a different context, while we consider all the images in a single turn as one concatenated context (cf. Figure 3 )." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-57", "text": "----------------------------------" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-58", "text": "**IMPLEMENTATION**" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-59", "text": "We use the PyTorch 1 framework (Paszke et al., 2017) for our implementation." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-60", "text": "2 We used 512 as the word embedding size as well as hidden dimension for all the RNNs using GRUs (Cho et al., 2014) with tied embeddings for the (bidirectional) encoder and decoder." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-61", "text": "The decoder uses Luong-style attention mechanism (Luong et al., 2015) with input feeding." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-62", "text": "We trained our model with the Adam optimizer (Kingma and Ba, 2015) , with a learning rate of 0.0004 and clipping gradient norm over 5." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-63", "text": "We perform early stopping by monitoring validation loss." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-64", "text": "For image representations, we use the FC6 layer representations of the VGG-19 (Simonyan and Zisserman, 2014) , pre-trained on ImageNet." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-65", "text": "3" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-66", "text": "----------------------------------" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-67", "text": "**ANALYSIS AND RESULTS**" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-68", "text": "We report sentence-level BLEU-4 (Papineni et al., 2002) , METEOR (Lavie and Agarwal, 2007) and ROUGE-L (Lin and Och, 2004) using the evaluation scripts provided by (Sharma et al., 2017) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-69", "text": "We compare our results against Saha et al. (2017) by using their code and data-generation scripts." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-70", "text": "4 Note that the results reported in their paper are on a different version of the corpus, hence not directly comparable." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-71", "text": "Table 1 provides results for different configurations of our model (\"T\" stands for text-only in the encoder, \"M\" for multimodal, and \"attn\" for using attention in the decoder)." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-72", "text": "We experimented with different context sizes and found that output quality improved with increased context size (models with 5-turn context perform better than those with a 2-turn context), confirming the observation by Serban et al. (2016 Serban et al. ( , 2017 ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-73", "text": "5 Using attention clearly helps: even T-HRED-attn outperforms M-HRED (without attention) for the same context size." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-74", "text": "We also tested whether multimodal input has an impact on the generated outputs." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-75", "text": "However, there was only a slight increase in BLEU score (M-HRED-attn vs T-HRED-attn)." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-76", "text": "To summarize, our best performing model (M-HRED-attn) outperforms the model of Saha et al. by 7 BLEU points." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-77", "text": "6 This can be primarily attributed to the way we created the input for our model from raw chat logs, as well as incorporating more information during decoding via attention." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-78", "text": "Figure 4 provides example output utterances using M-HRED-attn with a context size of 5." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-79", "text": "Our model is able to accurately map the response to previous textual context turns as shown in (a) and (c)." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-80", "text": "In (c), it is able to capture that the user is asking about the style in the 1st and 2nd image." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-81", "text": "(d) shows an example where our model is able to relate that the corresponding product is 'jeans' from visual features, while it is not able to model finegrained details like in (b) that the style is 'casual fit' but resorts to 'woven'." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-82", "text": "----------------------------------" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-83", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-84", "text": "In this research, we address the novel task of response generation in search-based multimodal dialogue by learning from the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-85", "text": "We introduce a novel extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model (Serban et al., 2016) and show that our implementation significantly outperforms the model of Saha et al. (2017) by modelling the full multimodal context." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-86", "text": "Contrary to their results, our generation outputs improved by adding attention and increasing context size." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-87", "text": "However, we also show that multimodal HRED does not improve significantly over text-only HRED, similar to observations by Agrawal et al. (2016) and Qian et al. (2018) ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-88", "text": "Our model learns to handle textual correspondence between the questions and answers, while mostly ignoring the visual context." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-89", "text": "This indicates that we need better visual models to en-code the image representations when he have multiple similar-looking images, e.g., black hats in Figure 3 ." }, { "sent_id": "8bdfc9e82e474413f29ee92f81467e-C001-90", "text": "We believe that the results should improve with a jointly trained or fine-tuned CNN for generating the image representations, which we plan to implement in future work." } ], "y": { "@DIF@": { "gold_contexts": [ [ "8bdfc9e82e474413f29ee92f81467e-C001-19" ] ], "cite_sentences": [ "8bdfc9e82e474413f29ee92f81467e-C001-19" ] }, "@USE@": { "gold_contexts": [ [ "8bdfc9e82e474413f29ee92f81467e-C001-26", "8bdfc9e82e474413f29ee92f81467e-C001-27" ], [ "8bdfc9e82e474413f29ee92f81467e-C001-60" ] ], "cite_sentences": [ "8bdfc9e82e474413f29ee92f81467e-C001-27", "8bdfc9e82e474413f29ee92f81467e-C001-60" ] } } }, "ABC_db6794da83b12336ab946e5777346d_44": { "x": [ { "sent_id": "db6794da83b12336ab946e5777346d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-48", "text": "What is truly necessary and sufficient for a task to be termed lexicography is that:" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-2", "text": "The title of our talk-an implicit reference to the English clich\u00e9 like a spider weaving her webintends to attract one's attention to the metaphor that can be drawn between the dance of a spider weaving her web and a new lexicographic gesture that is gradually emerging from the work on Net-like lexical resources (Fellbaum, 1998; Baker et al., 2003; Gader et al., 2012) ." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-3", "text": "Our claim is that the inherent graph structure of natural language lexicons not only determine vocabulary acquisition and use (Wolter, 2006) , but also lexicographic activity." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-4", "text": "In that respect, reflecting on new ways to implement the task of building lexical resources is essential for lexicographers themselves, but also for anyone interested is lexicons as mental structures." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-5", "text": "After all, lexicographers and language learners are those who have the most direct contact with lexical structures, through closely related activities: describing a natural phenomenon is a form of learning through explicit conceptualization." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-6", "text": "Lexicographers often experience the fact that by completing the description of a word they achieve a form of understanding and mastering of this word." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-7", "text": "They do not merely transcribe word knowledge and observations made on word behavior in speech and texts: they \"acquire\" the word." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-8", "text": "This makes them feel good and this explains why lexicography is indeed extremely addictive." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-9", "text": "Our talk title is also an implicit reference to the English collocation web of words, that is so often used to refer to natural language lexicons as messy and too big to be embraced entities -cf." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-10", "text": "(Murray, 1977) , entitled Caught in the web of words: James A. H. Murray and the Oxford English dictionary." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-11", "text": "Of course, webs can be seen as being essentially traps that one gets caught in." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-12", "text": "This is so to speak the fly or innocent bug perspective." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-13", "text": "However, lexicographers ought not be caught in the web: they can behave as spiders weaving the web." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-14", "text": "This is possible if the model they are constructing is indeed a diagrammatic representation-in a semiotic sense (Farias and Queiroz, 2006)-of the natural language lexicon that is being scrutinized." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-15", "text": "It is when lexicographers run on pages, writing dictionary articles, like flies walking on a glass window, that they have the most chance to get caught in the web of words." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-16", "text": "This is why lexicographers have long ago introduced systems of cards and records to help them compile data on lexical units." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-17", "text": "Lexicographic records helped lexicographers free themselves from the two-dimensional prison of the dictionary." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-18", "text": "Their knowledge about words occupied a \"volume,\" that of filing cabinets, which is more in line with the three-dimensional nature of the lexicons they had to describe." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-19", "text": "Later, with the advent of computational lexicography, relational databases replaced filing cabinets as convenient tools. . . and metaphors." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-20", "text": "* Extended abstract for CogALex III invited lecture." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-21", "text": "----------------------------------" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-22", "text": "**THE SPIDER ATTITUDE**" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-23", "text": "The title of our talk-an implicit reference to the English clich\u00e9 like a spider weaving her webintends to attract one's attention to the metaphor that can be drawn between the dance of a spider weaving her web and a new lexicographic gesture that is gradually emerging from the work on Net-like lexical resources (Fellbaum, 1998; Baker et al., 2003; Gader et al., 2012) ." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-24", "text": "Our claim is that the inherent graph structure of natural language lexicons not only determine vocabulary acquisition and use (Wolter, 2006) , but also lexicographic activity." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-47", "text": "However, the dictionary-whether in paper or electronic format-is just one among many possible incarnations of lexical models." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-25", "text": "In that respect, reflecting on new ways to implement the task of building lexical resources is essential for lexicographers themselves, but also for anyone interested is lexicons as mental structures." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-26", "text": "After all, lexicographers and language learners are those who have the most direct contact with lexical structures, through closely related activities: describing a natural phenomenon is a form of learning through explicit conceptualization." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-27", "text": "Lexicographers often experience the fact that by completing the description of a word they achieve a form of understanding and mastering of this word." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-28", "text": "They do not merely transcribe word knowledge and observations made on word behavior in speech and texts: they \"acquire\" the word." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-29", "text": "This makes them feel good and this explains why lexicography is indeed extremely addictive." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-30", "text": "Our talk title is also an implicit reference to the English collocation web of words, that is so often used to refer to natural language lexicons as messy and too big to be embraced entities -cf." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-31", "text": "(Murray, 1977) , entitled Caught in the web of words: James A. H. Murray and the Oxford English dictionary." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-32", "text": "Of course, webs can be seen as being essentially traps that one gets caught in." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-33", "text": "This is so to speak the fly or innocent bug perspective." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-34", "text": "However, lexicographers ought not be caught in the web: they can behave as spiders weaving the web." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-35", "text": "This is possible if the model they are constructing is indeed a diagrammatic representation-in a semiotic sense (Farias and Queiroz, 2006 )-of the natural language lexicon that is being scrutinized." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-36", "text": "It is when lexicographers run on pages, writing dictionary articles, like flies walking on a glass window, that they have the most chance to get caught in the web of words." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-37", "text": "This is why lexicographers have long ago introduced systems of cards and records to help them compile data on lexical units." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-38", "text": "Lexicographic records helped lexicographers free themselves from the two-dimensional prison of the dictionary." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-39", "text": "Their knowledge about words occupied a \"volume,\" that of filing cabinets, which is more in line with the three-dimensional nature of the lexicons they had to describe." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-40", "text": "Later, with the advent of computational lexicography, relational databases replaced filing cabinets as convenient tools. . . and metaphors." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-41", "text": "----------------------------------" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-42", "text": "**TOWARDS A LEXICOGRAPHY OF VIRTUAL DICTIONARIES**" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-43", "text": "New data structures for lexical resources should come together with new ways of building lexical models, and this is the main topic we are dealing with here." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-44", "text": "In order to propose an alternate perspective on lexicography, one that in our opinion is more cognition-compatible in nature, it is necessary to first eradicate a rather widespread misconception related to the construction of lexical models." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-45", "text": "According to common perception, lexicography is all about writing dictionaries and, therefore, any activity that targets the construction of other types of lexical models, freed from the two-dimensional (textual) dictionary, is not \"true\" lexicography." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-46", "text": "This misconception, very common among laypersons and endorsed by many natural language researchers, originates mainly for the sheer fact that, for centuries, lexicographers had no better medium of encoding than the text and no better physical support for their description than sheets of paper bound together to make dictionaries." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-49", "text": "\u2022 it targets the description of lexical units of one or more natural languages in terms of sense, forms and all other relevant linguistic properties;" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-50", "text": "\u2022 it uses a well-defined frame of analysis that allows for a coherent and uniform description of all lexical units;" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-51", "text": "\u2022 it is essentially a hand-made task, but with no limitation to the amount and diversity of tools and external data that can be used to perform this task;" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-52", "text": "\u2022 it \"sees big:\" the greater the coverage and depth of description for each lexical unit, the more lexicographic the task will be." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-53", "text": "This last point is more important than it may appear: when it comes to the lexicon-its description, as well as learning, mastering, etc.-size does matter." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-54", "text": "To take an extreme case, a person whose only experience in the field is the description of just one or a couple of lexical units can simply not be considered a lexicographer and the task accomplished is all but an exercise in lexicography." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-55", "text": "By contrast, someone who has achieved the description of tens of thousands of lexical units is no doubt an experienced lexicographer." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-56", "text": "Somewhere in between, there is the transition from being an apprentice to being an actual lexicographer." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-57", "text": "Notice that no mention of the formal nature of lexical models is made in the above characterization of lexicography." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-58", "text": "In fact, when the construction of a totally new, graph-based model of lexical knowledge was proposed by WordNet initiators (Miller et al., 1990) , no claim was made on the advent of a new discipline." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-59", "text": "On the contrary, lexicography remained the reference, with work performed by individuals called lexicographers, who were constructing datasets called lexicographer files. And this is entirely justified as, precisely, lexicography is not about writing dictionaries per se." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-60", "text": "This fact has already been pointed at by some dictionary-makers; (Atkins, 1996) , for instance, adopts a rather visionary perspective and goes as far as to consider that bilingual lexicography should be aiming at virtual dictionaries-cf." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-61", "text": "S. Atkins' proposal for \"real databases, real links and virtual dictionaries\" (section 2.2.1 of her paper)." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-62", "text": "----------------------------------" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-63", "text": "**FROM WRITING DICTIONARIES TO WEAVING LEXICAL NETWORKS**" }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-64", "text": "In our talk, we take the above observations as given, including the fact that lexicography should indeed be targeting virtual dictionaries, generated from non-textual lexical models ." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-65", "text": "We illustrate how the lexicographic process of building graph-based lexical models can benefit from tools that allow lexicographers to wade through the lexical web, following paradigmatic and syntagmatic paths, while simultaneously weaving new links and incrementing the lexical description." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-66", "text": "Work performed on the French Lexical Network (Gader et al., 2012 ) will serve to demonstrate how the lexicographic process can be made closer to actual navigation through lexical knowledge by the speaker." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-67", "text": "The main theoretical and descriptive tool that makes such navigation possible is the system of lexical functions proposed by the Meaning-Text linguistic approach (Mel'\u010duk, 1996) ." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-68", "text": "It induces the multidimensional and non-hierarchical graph structure of the FLN that, we believe, is far better suited for designing lexical resources than hyperonymy-based structures." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-69", "text": "Computational aspects of the work on the French Lexical Network are dealt with in (Gader et al., 2012) ." }, { "sent_id": "db6794da83b12336ab946e5777346d-C001-70", "text": "In our presentation, we focus on the actual process of weaving lexical relations." } ], "y": { "@USE@": { "gold_contexts": [ [ "db6794da83b12336ab946e5777346d-C001-2" ], [ "db6794da83b12336ab946e5777346d-C001-23" ] ], "cite_sentences": [ "db6794da83b12336ab946e5777346d-C001-2", "db6794da83b12336ab946e5777346d-C001-23" ] }, "@FUT@": { "gold_contexts": [ [ "db6794da83b12336ab946e5777346d-C001-66" ] ], "cite_sentences": [ "db6794da83b12336ab946e5777346d-C001-66" ] }, "@BACK@": { "gold_contexts": [ [ "db6794da83b12336ab946e5777346d-C001-69" ] ], "cite_sentences": [ "db6794da83b12336ab946e5777346d-C001-69" ] } } }, "ABC_5ad1e8b75cc6f5b627f770cced8e0f_44": { "x": [ { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-97", "text": "Table 4 : Accuracy in solving our three tasks." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-2", "text": "The detection of reused text is important in a wide range of disciplines." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-3", "text": "However, even as research in the field of plagiarism detection is constantly improving, heavily modified or paraphrased text is still challenging for current methodologies." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-4", "text": "For historical texts, these problems are even more severe, since text sources were often subject to stronger and more frequent modifications." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-5", "text": "Despite the need for tools to automate text criticism, e.g., tracing modifications in historical text, algorithmic support is still limited." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-6", "text": "While current techniques can tell if and how frequently a text has been modified, very little work has been done on determining the degree and kind of paraphrastic modification-despite such information being of substantial interest to scholars." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-7", "text": "We present a human-interpretable, feature-based method to measure paraphrastic modification." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-8", "text": "Evaluating our technique on three data sets, we find that our approach performs competitive to text similarity scores borrowed from machine translation evaluation, being much harder to interpret." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-9", "text": "----------------------------------" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-10", "text": "**INTRODUCTION WHY IS TEXT REUSE IMPORTANT?**" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-11", "text": "The term text reuse refers to the repetition of a text within a new context." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-12", "text": "Examples are citations, paraphrases of a text, allusions, or even cases of cross-linguistic reuse in the form of translations." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-13", "text": "In the humanities context, the detection of text reuse helps tracing down lines of transmission, which is essential to the field of textual criticism (B\u00fcchler et al., 2012) ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-14", "text": "Text reuse detection can also help consolidating today's digital libraries by assuring the consistency of content by inter-linking related documents (Schilling, 2012) ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-15", "text": "Background: To this date, a lot of effort has been put into the investigation of detecting plagiarism, a special kind of text reuse." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-16", "text": "However, while constantly improving (see Ferrero et al. (2017) ), contemporary detection techniques are still quite unreliable when text is heavily modified." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-17", "text": "Historical text is even more challenging through incompleteness, copying errors, and evolution of language." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-18", "text": "Thus, only limited algorithmic support exists for the identification and analysis of (especially paraphrastic) repetition in such documents." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-19", "text": "While existing reuse detection techniques are able to tell if and how frequently a text has been modified, it is important to also determine the degree and characteristics of paraphrastic modification, i.e., the \"features\" that constitute a given modification." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-20", "text": "As such, understanding type and degree of reuse is an important prerequisite for enhancing reuse detection techniques for historical texts as well as giving scholars hints for deeper investigation." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-21", "text": "In this work, we present a technique to measure paraphrastic modification which is both human-interpretable and semantically informed." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-22", "text": "This interpretability sets our method apart from recent approaches based on distributional semantics which do not allow for easy manual inspection of individual model decisions (Wieting et al., 2015) ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-23", "text": "We already investigated descriptive characteristics of paraphrasing in a specific humanities use case (Moritz et al., 2018) ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-24", "text": "We found changes in inflection, synonym replacement and co-hyponym replacement to be the most frequent paraphrastic modifications, thus supporting the feasibility of feature-based approaches to this problem." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-25", "text": "Method and Questions: We measure the degree of modification based on a list of modification operations that we count in a prioritized order based on relations between aligned, parallel sentences." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-26", "text": "These relationships between two words can range from exact copy (no operation necessary) to co-hyponymy, see Table 1 ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-27", "text": "Compared to scores such as Meteor that make use of synonymy, but do not model other relationships, our score also includes information on hypernymy, hyponymy, and co-hyponymy." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-28", "text": "This is especially useful in historical text, since meaning and, therefore, relationships change over time." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-29", "text": "The order in which these operations are counted is intuitive and follows the usual prepossessing steps that one would perform to reduce variance in a text corpus." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-30", "text": "Table 2 shows an example of the alignment output, thus illustrating our method." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-31", "text": "The relative frequencies of the operations then serve as input features for a binary classifier." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-32", "text": "In this contribution we investigate, how our human-interpretable method compares against text similarity metrics borrowed from machine translation evaluation (also serving as input for a classifier)." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-33", "text": "In particular, we examine the performance of those approaches for semantic equivalency in: (RQ1) a modern English paraphrase corpus; (RQ2) a parallel Bible corpus; and (RQ3) a medieval Latin text reuse dataset." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-34", "text": "----------------------------------" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-35", "text": "**RELATED WORK**" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-36", "text": "Surface Feature Approaches: Levenshtein's (1966) edit distance, which is based on characterlevel removal, insertion, and replacement operations, can be considered as one of the earliest works to measure text similarity." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-37", "text": "B\u00fcchler et al. (2012) use overlapping bi-grams to maximize recall in a reuse detection task of Homeric quotations, showing a good precision of more than 70% at the same time." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-38", "text": "Those techniques rely on surface features (token and character-level) only." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-39", "text": "Thus, our proposed method differs by also incorporating semantic information (lexico-semantic relationships between aligned word pairs)." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-40", "text": "Semantic Approaches: Computing the semantic similarity between two sentences is a popular task in NLP (Xu et al., 2015) ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-41", "text": "Osman et al. (2012) present a plagiarism detection technique based on semantic role labeling." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-42", "text": "They analyze text by identifying the semantic space of each term in a sentence and find semantic arguments for each sentence." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-43", "text": "They also assign weights to the arguments and find that not all of them affect plagiarism detection." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-44", "text": "Techniques from the field of paraphrase detection can be used for e.g., sentence similarity, entailment, and sentiment classification." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-45", "text": "Wieting et al. (2015) use embedding models to identify paraphrastic sentences in such a mixed NLP task employing a large corpus of short phrases associated with paraphrastic relatives." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-46", "text": "Their simplest model represents a sentence embedding by the averaged vectors of its tokens, the most complex model is a long short-term memory (LSTM) recurrent neural network." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-47", "text": "They find that the word averaging model performs best on sentence similarity and entailment, and the LSTM performs best on sentiment classification." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-48", "text": "Although these methods generally show good results, they typically allow no manual inspection of why a specific judgment is made and are thus ill-suited for applications in the humanities." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-72", "text": "Our method will be denoted \"multi f\" (multiple features) for the remainder of this paper." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-49", "text": "Approaches Based on Machine Translation (MT) Evaluation Metrics: Madnani et al. (2012) conduct a study on the usefulness of automated MT evaluation metrics (e.g., BLEU, NIST and Meteor) for the task of paraphrase identification." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-50", "text": "They train an ensemble of different classifiers using scores of MT metrics as features." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-51", "text": "They evaluate their model on two corpora for paraphrase and plagiarism detection, respectively, finding that it performs very satisfyingly." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-52", "text": "This approach to paraphrase and plagiarism detection based on MT metrics combines surface and semantic features since Meteor incorporates synonymy information (see below)." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-53", "text": "Yet, the number of semantic features used is limited and so is also the interpretability of this approach." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-54", "text": "----------------------------------" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-55", "text": "**BIBLE TRANSLATION CLASS PREDICTION:**" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-56", "text": "We use a parallel corpus of eight English Bible translations that we gathered from three sources." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-57", "text": "5 We split them in two classes: literal translationsthose being directly translated from the primary languages Hebrew and Ancient Greek coming with rich linguistic diversity-and translations that mainly follow the translation tradition of the Anglican Church (standard)." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-58", "text": "Table 3 lists the detailed edition names accompanied by its publishing date and its class." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-59", "text": "For the experiments we extract parallel verses from two different editions and try to predict if they come from the same or different translation classes (literal vs. standard)." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-60", "text": "Latin Reuse Detection: Excerpts from a total of twelve works and two work collections from the 12th century Latin writer Bernard of Clairvaux constitute our third dataset." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-61", "text": "The team behind the Biblindex project (Mellerin, 2014) 6 manually identified 1,100 instances of text reuse in these writings and bundled them into a corpus." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-62", "text": "Every instance of reuse relates to a Bible verse from the Biblia Sacra Juxta Vulgatam Versionem and is typically half as long as the original verse." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-63", "text": "Negative training data of equal size were obtained by randomly shuffling the initial dataset." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-64", "text": "----------------------------------" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-65", "text": "**METHODS**" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-66", "text": "Our method relies on the relative frequencies of modification operations (see Table 1 ) in an aligned sentence pair which later serve as features for a classifier:" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-67", "text": "where x i is the relative frequency of a modification operation i in an aligned sentence pair, m is the number of features, and o i is the absolute frequency of operation i. 7 Our method, hence, can be understood as a collection of features that are represented as relative frequencies of edits obtained from empirical values." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-68", "text": "These features are used as input to a maximum entropy classifier to predict if two sentences are paraphrases of each other." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-69", "text": "MaxEnt was chosen due to its simplicity, relying on a linear combination of features." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-70", "text": "Thus feature weights can be roughly interpreted as importance of the respective modification operation after fitting the model." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-71", "text": "Recall the example alignment presented in Table 2 illustrating the high interpretability of our approach." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-73", "text": "We evaluate our method by comparing it to several reference methods based on machine-translation evaluation metrics." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-74", "text": "8 To adapt these to our different paraphrase detection tasks, the source Bible provides the reference sentence (ref ) and the target Bible (and Bernard's reuse respectively) provides the system output (sys)." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-75", "text": "From the Gold corpus, also the source text (numbered in the repository with 1, see Madnani et al. (2012) ) serves as reference, and the paraphrastic reuse of it (numbered with 2), provides the system output." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-76", "text": "Reference Methods: Often, machine translation metrics are based on simple edit distance measures." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-77", "text": "Unlike simple word error rate (WER; Su et al. (1992) ), which depends on a strict word order, the positionindependent error rate (PER; Tillmann et al. (1997) ) uses a bag-of-words approach." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-78", "text": "Popovi\u0107 and Ney (2007) define PER based on counts of independent words that system output and reference sentence have in common." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-79", "text": "We adapt their document-wide score to the sentence level:" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-80", "text": "where N sys is the length (in words) of the target reuse text-in MT a.k.a." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-81", "text": "the system output version of a text-and N ref is the length of the source text-in MT a.k.a." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-82", "text": "the reference sentence for a system output-, and n(e, ref ) is the frequency of a given word e in the reference sentence." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-83", "text": "The translation edit rate (TER; Snover et al. (2006) ) is the number of edits that a system output should undergo so that it matches a reference sentence." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-84", "text": "TER 9 is normalized by the length of the reference input." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-85", "text": "Following Papineni et al. (2002) , we define a sentence-based BLEU score:" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-86", "text": "where N is the maximum n-gram size, which we set to 2." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-87", "text": "p n is a precision score that is calculated based on n-grams in both, source and target texts (see Papineni et al. (2002) )." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-88", "text": "We omit BLEU's brevity penalty which would otherwise dominate our sentence level analysis." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-89", "text": "The last measure we consider is Meteor 1.5 (Denkowski and Lavie, 2014) ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-90", "text": "Meteor especially differs from other scores by considering not only precision, but also recall." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-91", "text": "It further takes synonymy and paraphrases into account." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-92", "text": "Meteor introduces so called matchers that are represented by exact match, stem match, synonym match or paraphrase match." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-93", "text": "The hypothesis (system) and reference texts h and r are split into content words h c and r c , and function words h f and r f ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-94", "text": "Precision and recall measures are then used to determine the harmonic mean F mean ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-95", "text": "Together with a fragmentation penalty that measures the degree of chunks, the Meteor score is calculated by M eteor = (1 \u2212 penalty) \u00d7 F mean ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-96", "text": "Similar to Madnani et al. (2012) we use these MT scores separately in a classification task to predict paraphrasticality where the respective MT score is fed into a MaxEnt classifier as only feature." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-98", "text": "Detecting Paraphrases (RQ1): Using the relative operation count from the alignment as features in a classification task, we determine the classification accuracy of our approach on the gold corpus." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-99", "text": "We run a maximum entropy classifier on our operation features." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-100", "text": "The results in Table 4 show that Meteor performs best on that task, followed by our approach." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-101", "text": "Predicting Translation Classes (RQ2): Here we want to determine if two aligned Bible editions are of the same translation class (labeled with 0), or of different classes (labeled with 1); we distinguish between standard vs. literal translations." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-102", "text": "We use the operation counts based on two aligned verses as features in this binary classification task." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-103", "text": "Our operations equip us with a fine-grained description of the degree of modification of two text excerpts." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-104", "text": "The Bible corpus is a suitable source for measuring the degree of modification, since it holds a broad variety of paraphrastic reuses." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-105", "text": "To estimate a human judgment of deviation, we assume that standard translations are more homogeneous to each other (based on their evolution history) than literal translations that demand for more creative language use (Moritz et al., 2018) ." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-106", "text": "We use 10-fold-cross validation on the shuffled dataset." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-107", "text": "The results in Table 4 show that all methods under consideration perform comparably well." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-108", "text": "We also find that our proposed method suffers from a drop of accuracy when semantic features are ablated." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-109", "text": "When only WordNet, not BabelNet, is used for identifying lexico-semantic relations, performance increases slightly, which we attribute to noise that comes with using BabelNet." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-110", "text": "Detecting Latin Reuse (RQ3): Finally, we predict reuse in the medieval Latin dataset." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-111", "text": "Moritz et al. (2016) found out that co-hyponymy (besides synonymy) can be a common means of substitution in reuse, especially in medieval texts." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-112", "text": "Consequently, our method is well suited for this task, because it considers semantic relations beyond synonymy." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-113", "text": "10 Again, we use 10-fold cross-validation on the shuffled dataset." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-114", "text": "Table 4 shows that dropping features such as co-hyponyms indeed worsens the accuracy." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-115", "text": "The low score of TER may be explained by the fact that this metric's normalization term is based on the length of the reference version of a sentence." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-116", "text": "In our setup the Bible verse is the reference and the system output is the reuse." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-117", "text": "The reuse however, is often shorter than the Bible verse (see above)." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-118", "text": "----------------------------------" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-119", "text": "**DISCUSSION AND CONCLUSION**" }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-120", "text": "We presented a method for paraphrase detection that describes reuse based on the frequency of specific modification operations and is thus easily interpretable for humans." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-121", "text": "We showed that modeling reuse in historical text using semantic relations beyond synonyms achieves results comparable to using features derived from machine translation metrics." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-122", "text": "Moreover, our method is especially useful for applications in the humanities as operation frequencies, their respective feature weights, and, by extensions, individual model decisions are open to manual inspection." }, { "sent_id": "5ad1e8b75cc6f5b627f770cced8e0f-C001-123", "text": "In future work, we plan to tune parameters and to qualitatively analyze weaknesses of our method (e.g., due to the tools used for pre-processing and alignment)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5ad1e8b75cc6f5b627f770cced8e0f-C001-49" ], [ "5ad1e8b75cc6f5b627f770cced8e0f-C001-75" ] ], "cite_sentences": [ "5ad1e8b75cc6f5b627f770cced8e0f-C001-49", "5ad1e8b75cc6f5b627f770cced8e0f-C001-75" ] }, "@USE@": { "gold_contexts": [ [ "5ad1e8b75cc6f5b627f770cced8e0f-C001-75" ] ], "cite_sentences": [ "5ad1e8b75cc6f5b627f770cced8e0f-C001-75" ] }, "@SIM@": { "gold_contexts": [ [ "5ad1e8b75cc6f5b627f770cced8e0f-C001-96" ] ], "cite_sentences": [ "5ad1e8b75cc6f5b627f770cced8e0f-C001-96" ] } } }, "ABC_a26260547114750ba2aa49e5f96136_44": { "x": [ { "sent_id": "a26260547114750ba2aa49e5f96136-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-2", "text": "Multiple-choice machine reading comprehension is difficult task as its required machines to select the correct option from a set of candidate or possible options using the given passage and question." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-3", "text": "Reading Comprehension with Multiple Choice Questions task, required a human (or machine) to read a given passage, question pair and select the best one option from n given options." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-4", "text": "There are two different ways to select the correct answer from the given passage." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-5", "text": "Either by selecting the best match answer to by eliminating the worst match answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-6", "text": "Here we proposed GenNet model, a neural network-based model." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-7", "text": "In this model first we will generate the answer of the question from the passage and then will matched the generated answer with given answer, the best matched option will be our answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-8", "text": "For answer generation we used S-net (Tan et al., 2017) model trained on SQuAD and To evaluate our model we used Large-scale RACE (ReAding Comprehension Dataset From Examinations) (Lai et al., 2017)." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-9", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-11", "text": "Reading comprehension is one of the fundamental skills for human, which one learn systematically since the elementary school." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-12", "text": "Reading comprehension give human the ability of reading texts, understanding their meanings,and with the help of given context answering questions." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-13", "text": "When machines are required to comprehend texts, they first need to understand the unstructured text and do reasoning based on given text (Chen et al., 2016) (Wang et al., 2018b) .Answering questions based a passage requires an individual unique skill set." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-14", "text": "It requires ability to perform basic mathematical operations and logical ability (e.g. to answer questions like how many times Amit visited sweet shop?), look-up ability, ability to deduce, ability to gather information contained in multiple sentences and passages." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-15", "text": "This diverse and unique skill set makes question answering a challenging task." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-16", "text": "There are several variants of this task, For example, if we have a given passage and a question, the answer could either (i) be generated from the passage (ii) match some span in the passage (iii) or could be one of the n number of given candidate answers." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-17", "text": "The last variant is mostly used in various high school, quiz , middle school, and different competitive examinations." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-18", "text": "This variant of Reading Comprehension generally referred as Reading Comprehension with Multiple Choice Questions (RC-MCQ).In the given figure 1 We have a passage and a question and 4 candidate answers." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-19", "text": "Task here defined is to find the most suitable answer from the passage for given question." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-20", "text": "While answering such Multiple Choice Questions (MCQs) figure 1, humans typically use a combination of option elimination and option selection or sometimes they find answer from the passage i.e they generate the answer of the question from passage and match the generated answer with given options and they choose more close candidate as correct answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-21", "text": "Here we proposed model which mimic the answer generation and then matching human process." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-22", "text": "First the span where possible answer in the passage is computed." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-23", "text": "we first compute a question-aware representation of the passage (which essentially tries to retain portions of the passage which are only relevant to the question)." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-24", "text": "Then we use answer generation using state-of-art S-Net model (Tan et al., 2017) which extract and generate answer figure 2." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-25", "text": "After we have answer generated from the passage now we weight every given candidate option and select the best matched option." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-26", "text": "That best matched option was our answer figure 3." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-27", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-28", "text": "**RELATED WORK**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-29", "text": "Datasets played an important role in machine reading comprehension, there were different type of datasets designed to solve different variant of machine reading comprehension." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-30", "text": "SQuAD dataset (Rajpurkar et al., 2016) was designed to answer simple question answer reading comprehension that aims to answer a question with exact text spans in a passage." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-31", "text": "Later MS-MACRO dataset (Nguyen et al., 2016) was designed for multi-passage reading comprehension." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-32", "text": "CNN/ Dailymail (Chen et al., 2016) and Who did what dataset (Onishi et al., 2016) designed for cloze variant reading comprehension." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-33", "text": "MCtest (Richardson et al., 2013) and RACE dataset (Lai et al., 2017) are released for Multiple choice question variant reading comprehension." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-34", "text": "(Zhu et al., 2018) , in this model the candidate options leverage to model the interaction between question options and passage." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-35", "text": "This was a option selection model which select the correct option from the given candidate options." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-36", "text": "Other work relatable to this paper was eliminating options model (Parikh et al., 2019) which eliminate the wrong answer from the candidate answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-37", "text": "Multi matching network (Tang et al., 2019) models interaction relationship between passage, questions and candidate answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-38", "text": "It take different paradigm of matching into account." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-39", "text": "Option comparison Network (Ran et al., 2019) compares between options at word level and identify correlation to help buildup logic and reasoning." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-40", "text": "Co-matching model (Wang et al., 2018a) is used to match between answer and question and passage pair." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-41", "text": "It match for the relationship between question and answer with the passage." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-42", "text": "Bidirectional co-matching based model (Zhang et al., 2019) matched passage and question, answer bidirectionally." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-43", "text": "The Convolutional Spatial Attention (CSA) model (Chen et al., 2019) form the enriched representaion by fully extract the mutual information among the passage, question, and the candidates." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-44", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-45", "text": "**SIMILAR WORK IN READING COMPREHENSION WHERE MULTIPLE CHOICE VARIANT OF COMPREHENSION CONSIDERED INCLUDES HIERARCHICAL ATTENTION FLOW MODEL**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-46", "text": "To generate answer several models are there like QANet (Yu et al., 2018) which combined local Convolution with Global Self-Attention and its encoder consist exclusively of convolution and self-attention." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-47", "text": "Bidirectional Attention Flow model (Seo et al., 2016) use to focus on large span of passage." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-48", "text": "BIDAF network is a multi stage hierarchical process and use bidirection attention flow to obtain a query-aware context representation." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-49", "text": "But the reason to use S-Net model as answer generation model because S-Net not only find the answer from the passage but it can also synthesise passage" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-50", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-51", "text": "**PROPOSED MODEL**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-52", "text": "There are two tasks needs to be performed in this model." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-53", "text": "First is Answer extraction and Answer Synthesis/Generation and then option selection." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-54", "text": "Answer extraction and Generation will be done using state-of-art S-NET model (Tan et al., 2017) ." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-55", "text": "S-Net first pull out evidence snippets by matching the question and passage respectively, and then generates the answer by filtering the question, passage, and evidence snippets. consider a passage P = [p 1 , p 2 , p 3 , ...p p ] of word length P, Question Q = [Q 1 , Q 2 , Q 3 , ...Q q ] of word length Q, and n options Z n = [z 1 , z 2 , z 3 , ...z k ] where n > 1 and word length k. We first convert the words to their word-level embedding and character-level embedding using GLOVE (Pennington et al., 2014) .The encoding and embedding layers take in a series of tokens and represent it as a series of vectors." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-56", "text": "The character-level embeddings are cause by taking the final hidden states of a bi-directional GRU applied to embedding of characters in the token." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-57", "text": "They then use a bi-directional Gated Recurrent Unit to give rise to new depiction u p 1 , u p 2 , u p 3 , ...u p p for questions as well as u q 1 , u q 2 , u q 3 , ...u q q for passages too and u z 1 , u z 2 , u z 3 , ...u z z for options as well." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-58", "text": "The embedding matrix is boot only once and not trained in the entire learning process." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-59", "text": "As shown in Figure 4 S-NET uses the series-to-series model to incorporate the answer with the extracted evidences as features." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-60", "text": "They first produce the depiction It first produce the depiction h t p and h t q of all words in the question and passage respectively." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-61", "text": "When giving out the answer depiction, it merge the basic word embedding e t p with some added features f t s and f t e to indicate the end and start place of the evidence snippet given out by evidence extraction model." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-62", "text": "f t s =1 and f t e =1 mean the position t is the start and end of the evidence span, respectively." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-63", "text": "On top of the encoder, S-Net uses GRU with attention as the decoder to produce the answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-64", "text": "At each decoding time step t , the GRU reads the previous word embedding w t\u22121 and previous context vector c t\u22121 and finally produced answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-65", "text": "Figure 4 : Answer Synthesis/Generation Model (Tan et al., 2017) The produced answer will be stored in Answer vector." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-66", "text": "A n = [a 1 , a 2 , a 3 , ...a a ] where a is length of the answer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-67", "text": "Figure 3 shows the overview of selection module." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-68", "text": "The selection module will take the refined answer representation a t and computes its bi-linear similarity with each option representation." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-69", "text": "score(i) = a t W att z ti (3) where i is the number of option, a t is generated answer vector, z ti is option vector and W att is a matrix which needs to be learned." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-70", "text": "We select the option which gives the highest score as computed above." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-71", "text": "We train the model using the cross entropy loss by normalizing the above scores (using softmax) first to obtain a probability distribution." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-72", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-73", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-74", "text": "Here we discussed about the dataset used to evaluate our model, Training procedure, result comparison and future work." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-75", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-76", "text": "**DATASET**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-77", "text": "We evaluate our model on RACE dataset (Lai et al., 2017) Race is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-78", "text": "The dataset is collected from English examinations in China, which are designed for middle school and high school students." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-79", "text": "Each passage is a JSON file." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-80", "text": "The JSON file contains fields (i) article: A string, which is the passage (ii) questions: A string list." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-81", "text": "Each string is a query." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-82", "text": "There are two types of questions." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-83", "text": "First one is an interrogative sentence." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-84", "text": "Another one has a placeholder, which is represented by _." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-85", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-86", "text": "**TRAINING PROCEDURES AND HYPER-PARAMETER**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-87", "text": "We integrate two different model into once." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-88", "text": "First we train our model on S-Net." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-89", "text": "To train model on S-Net we process dataset differently." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-90", "text": "We only consider passage and question and correct option to train model on S-Net." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-91", "text": "Later we pass the result on to next stage on our model where we train model using generated answer and all candidate options." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-92", "text": "To train the model, we used stochastic gradient descent with ADAM optimizer." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-93", "text": "(Kingma and Ba, 2014) We initialize learning rate with 0.005." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-94", "text": "Gradients are clipped in L2-norm to no larger than 10." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-95", "text": "To update model parameter per step,we used A mini-batch of 32 samples." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-96", "text": "We have created a vocabulary using top 65k words from passage and questions and if a new out-of-vocabulary(OOV) word encountered we add a special token UNK." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-97", "text": "We use the same vocabulary for the passage, question, and options vector embedding." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-98", "text": "We tune all our models based on the accuracy achieved on the validation set." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-99", "text": "We use 300 dimensional Glove embedding (Pennington et al., 2014) for word embedding and word and character encoding." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-100", "text": "We experiment with both fine-tuning and not fine-tuning these word embedding." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-101", "text": "We train all our models for upto 80 epochs as we do not see any benefit of training beyond 80 epochs as result were starting recurrence." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-102", "text": "The hidden state size of all GRU network is 128." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-103", "text": "We apply dropout (Srivastava et al., 2014) to word embeddings and BiGRU's outputs with a drop rate of 0.45." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-104", "text": "first answer are generated and then option is selected such model can be used to solve such multiple choice question whose answer option is not present or MCQ with \"none of the above\" or \"No answer\" type multiple choice questions." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-105", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-106", "text": "**RESULTS AND FUTURE WORK**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-107", "text": "----------------------------------" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-108", "text": "**CONCLUSION**" }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-109", "text": "In this paper, we present the GenNet model for multiple-choice reading comprehension." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-110", "text": "Specifically, the model uses a combination of Generation and selection to arrive at the correct option." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-111", "text": "This is achieved by first generating the answer for the questions from the passage and then matching generated answer with the options." }, { "sent_id": "a26260547114750ba2aa49e5f96136-C001-112", "text": "At last, the proposed model achieves overall sate-of-the-art accuracy on RACE and significantly outperforms neural network baselines on RACE-M, RACE-H and RACE FULL.As future work, we would like to work towards unanswerable questions or questions where no option matched." } ], "y": { "@USE@": { "gold_contexts": [ [ "a26260547114750ba2aa49e5f96136-C001-8" ], [ "a26260547114750ba2aa49e5f96136-C001-24" ], [ "a26260547114750ba2aa49e5f96136-C001-54" ], [ "a26260547114750ba2aa49e5f96136-C001-65" ] ], "cite_sentences": [ "a26260547114750ba2aa49e5f96136-C001-8", "a26260547114750ba2aa49e5f96136-C001-24", "a26260547114750ba2aa49e5f96136-C001-54", "a26260547114750ba2aa49e5f96136-C001-65" ] } } }, "ABC_fe1d6ca4a88c03cfb2ae94ef45030d_44": { "x": [ { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-22", "text": "All rights reserved." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-2", "text": "In this paper, we present a keyphrase generation approach using conditional Generative Adversarial Networks (GAN)." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-3", "text": "In our GAN model, the generator outputs a sequence of keyphrases based on the title and abstract of a scientific article." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-4", "text": "The discriminator learns to distinguish between machine-generated and human-curated keyphrases." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-5", "text": "We evaluate this approach on standard benchmark datasets." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-6", "text": "Our model achieves state-of-theart performance in generation of abstractive keyphrases and is also comparable to the best performing extractive techniques." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-7", "text": "We also demonstrate that our method generates more diverse keyphrases and make our implementation publicly available 1 ." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-8", "text": "----------------------------------" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-10", "text": "Keyphrases are employed to capture the most salient topics of a long document and are indexed in databases for convenient retrieval." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-11", "text": "Researchers annotate their scientific publications with high quality keyphrases to ensure discoverability in large scientific repositories." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-12", "text": "Keyphrases could either be extractive (part of the document) or abstractive." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-13", "text": "Keyphrase generation is the process of predicting both extractive and abstractive keyphrases from a given document." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-14", "text": "This process is similar to abstractive summarization but instead of a summary the models generate keyphrases." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-15", "text": "Researchers have achieved considerable success in the field of abstractive summarization using conditional-GANs (Wang and Lee 2018)." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-16", "text": "There has also been growing interest in deep learning models for keyphrase generation (Meng et al. 2017; Chan et al. 2019) ." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-17", "text": "Inspired by these advances, we propose a new GAN architecture for keyphrase generation where the generator produces a sequence of keyphrases from a given document and the discriminator distinguishes between human-curated and machine-generated keyphrases." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-18", "text": "----------------------------------" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-19", "text": "**PROPOSED ADVERSARIAL MODEL**" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-20", "text": "As with most GAN architectures, our model also consists of a generator (G) and discriminator (D), which are trained in an alternating fashion (Goodfellow et al. 2014) ." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-21", "text": "Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org)." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-23", "text": "1 Code is available at https://github.com/avinsit123/keyphrasegan Generator -Given a document d = {x 1 , x 2 , ..., x n }, where x i is the i th token, the generator produces a sequence of keyphrases: y = {y 1 , y 2 , ..., y m }, where each keyphrase y i is composed of tokens y 1 i , y 2 i , ..., y li i ." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-24", "text": "We employ catSeq model (Yuan et al. 2018) for the generation process, which uses an encoder-decoder framework: the encoder being a bidirectional Gated Recurrent Unit (bi-GRU) and the decoder a forward GRU." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-25", "text": "To incorporate the out-of-vocabulary words, we use a copying mechanism (Gu et al. 2016)." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-26", "text": "We also make use of attention mechanism to help the generator identify the relevant components of the source text." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-27", "text": "Discriminator -We propose a new hierarchical-attention model as the discriminator, which is trained to distinguish between human-curated and machine-generated keyphrases." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-28", "text": "The first layer of this model consists of m + 1 bi-GRUs." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-29", "text": "The first bi-GRU encodes the input document d as a sequence of vectors: h = {h 1 , h 2 , ..., h n }." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-30", "text": "The other m bi-GRUs, which have the same weight parameters, encode each keyphrase as a vector: {k 1 , k 2 , ..., k m }." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-31", "text": "We then use an attention-based approach (Luong, Pham, and Manning 2015) to build context vectors c j for each keyphrase, where c j is a weighted average over h. By concatenating c j and k j , we get a contextualized representation e j = [c j ; k j ] of keyphrase y j ." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-32", "text": "The second layer of the discriminator is another bi-GRU which consumes the document representation h and the keyphrase representations e. The final state of this layer is passed through one fully connected layer (W f ) and sigmoid transformation to get the probability that a given keyphrase sequence is human-curated." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-33", "text": "GAN training -For a given dataset (S), which contain the documents and corresponding keyphrases, we first pretrain the generator (G) using Maximum Likelihood Estimation." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-34", "text": "We then use this generator to produce machinegenerated keyphrases for all documents in S. These generated keyphrases along with the curated keyphrases are used to train the first version of the discriminator (D)." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-35", "text": "We then employ policy gradient reinforcement learning to train the subsequent versions of G. We freeze the weight parameters of D and use it for reward calculation to train a new version of G. The reward for each keyphrase is obtained from the last m states of the second bi-GRU layer in D (see Figure 1 )." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-36", "text": "The gradient update is given as:" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-37", "text": "where B is a baseline obtained by greedy decoding of keyphrase sequence." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-38", "text": "The resulting generator is then used to create new training samples for D. This process is continued till G converges." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-39", "text": "----------------------------------" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-40", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-41", "text": "We trained the proposed GAN model on KP20k Table 1 , are broken down in terms of performance on extractive and abstractive keyphrases." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-42", "text": "For extractive keyphrases, our proposed model performs better than the pre-trained catSeq model on all datasets but is slightly worse than catSeq-RL except for on Krapivin where it obtains the best F1@M of 0.37." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-43", "text": "On the other hand, for abstractive keyphrases, our model performs better than the other two baselines on three of four datasets suggesting that GAN models are more effective in generation of keyphrases." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-44", "text": "We also evaluated the models in terms of \u03b1-nDCG@5 (Clarke et al. 2008) ." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-45", "text": "The results are summarized in Table 2 ." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-46", "text": "Our model obtains the best performance on three out of the four datasets." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-47", "text": "The difference is most prevalent in KP20k, the largest of the four datasets, where our GAN model (at 0.85) is nearly 5% better than both the other baseline models." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-48", "text": "----------------------------------" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-49", "text": "**CONCLUSION**" }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-50", "text": "In this paper, we propose new GAN architecture for keyphrase generation." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-51", "text": "The proposed model obtains state-ofthe-art performance in generating abstractive keyphrases." }, { "sent_id": "fe1d6ca4a88c03cfb2ae94ef45030d-C001-52", "text": "To our knowledge, this is the first work that applies GANs to keyphrase generation problem." } ], "y": { "@MOT@": { "gold_contexts": [ [ "fe1d6ca4a88c03cfb2ae94ef45030d-C001-16" ] ], "cite_sentences": [ "fe1d6ca4a88c03cfb2ae94ef45030d-C001-16" ] }, "@USE@": { "gold_contexts": [ [ "fe1d6ca4a88c03cfb2ae94ef45030d-C001-20" ], [ "fe1d6ca4a88c03cfb2ae94ef45030d-C001-24" ] ], "cite_sentences": [ "fe1d6ca4a88c03cfb2ae94ef45030d-C001-20", "fe1d6ca4a88c03cfb2ae94ef45030d-C001-24" ] } } }, "ABC_fbd028e073459b1b4c2d8d99173e15_44": { "x": [ { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-2", "text": "We present a brief history and overview of statistical methods in frame-semantic parsing -the automatic analysis of text using the theory of frame semantics." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-3", "text": "We discuss how the FrameNet lexicon and frameannotated datasets have been used by statistical NLP researchers to build usable, state-of-the-art systems." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-4", "text": "We also focus on future directions in frame-semantic parsing research, and discuss NLP applications that could benefit from this line of work." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-27", "text": "Johansson and Nugues (2007) Table 1 : We show the current state of the art on the frame-semantic parsing task." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-5", "text": "Frame-semantic parsing has been considered as the task of automatically finding semantically salient targets in text, disambiguating their semantic frame representing an event and scenario in discourse, and annotating arguments consisting of words or phrases in text with various frame elements (or roles)." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-6", "text": "The FrameNet lexicon (Baker et al., 1998) , an ontology inspired by the theory of frame semantics (Fillmore, 1982) , serves as a repository of semantic frames and their roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-7", "text": "Figure 1 depicts a sentence with three evoked frames for the targets \"million\", \"created\" and \"pushed\" with FrameNet frames and roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-8", "text": "Automatic analysis of text using framesemantic structures can be traced back to the pioneering work of Gildea and Jurafsky (2002) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-9", "text": "Although their experimental setup relied on a primitive version of FrameNet and only made use of \"exemplars\" or example usages of semantic frames (containing one target per sentence) as opposed to a \"corpus\" of sentences, it resulted in a flurry of work in the area of automatic semantic role labeling ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-10", "text": "However, the focus of semantic role labeling (SRL) research has mostly been on PropBank (Palmer et al., 2005) conventions, where verbal targets could evoke a \"sense\" frame, which is not shared across targets, making the frame disambiguation setup different from the representation in FrameNet." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-11", "text": "Furthermore, it is fair to say that early research on PropBank focused primarily on argument structure prediction, and the interaction between frame and argument structure analysis has mostly been unaddressed ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-12", "text": "There are exceptions, where the verb frame has been taken into account during SRL (Meza-Ruiz and Riedel, 2009; Watanabe et al., 2010) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-13", "text": "Moreoever, the CoNLL 2008 and 2009 shared tasks also include the verb and noun frame identification task in their evaluations, although the overall goal was to predict semantic dependencies based on PropBank, and not full argument spans (Surdeanu et al., 2008; Haji\u010d et al., 2009) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-14", "text": "The SemEval 2007 shared task (Baker et al., 2007) attempted to revisit the frame-semantic analysis task based on FrameNet." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-15", "text": "It introduced a larger FrameNet lexicon (version 1.3), and also a larger corpus with full-text annotations compared to prior work, with multiple targets annotated per sentence." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-16", "text": "The corpus allowed words and phrases with noun, verb, adjective, adverb, number, determiner, conjunction and preposition syntactic categories to serve as targets and evoke frames, unlike any other single dataset; it also allowed targets from different syntactic categories share frames, and therefore roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-17", "text": "The repository of semantic role types was also much richer than PropBankstyle lexicons, numbering in several hundreds." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-18", "text": "Most systems participating in the task resorted to a cascade of classifiers and rule-based modules: identifying targets (a non-trivial subtask), disambiguating frames, identifying potential arguments, and then labeling them with roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-19", "text": "The system described by Johansson and Nugues (2007) performed the best in this shared task." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-20", "text": "Next, we focus on its performance, and subsequent improvements made by the research community on this task." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-21", "text": "Figure 1: A partial depiction of frame-semantic structures taken from ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-22", "text": "The words in bold correspond to targets, which evoke semantic frames that are denoted in capital letters." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-23", "text": "Above each target is shown the corresponding lexical unit, which is a lemma appended by a coarse part-of-speech tag." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-24", "text": "Every frame is shown in a distinct color; each frame's arguments are annotated with the same color, and are marked below the sentence, at different levels." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-25", "text": "For the CARDINAL NUMBERS frame, \"M\" denotes the role Multiplier and \"E\" denotes the role Entity." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-26", "text": "SemEval'07 Data (automatic targets)" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-28", "text": "The first section shows results on the SemEval 2007 shared task." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-29", "text": "The best system in the task, presented by Johansson and Nugues (2007) was later outperformed by SEMAFOR, a system described by Das et al. (2010) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-30", "text": "Both systems use a rule-based module to identify targets." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-31", "text": "On the FrameNet 1.5 data, presented additional semi-supervised experiments using gold targets, which was recently outperformed by an approach presented by Hermann et al. (2014) that made use of distributed word representations." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-32", "text": "Johansson and Nugues (2007) presented the system that resulted in the best F 1 score on the SemEval 2007 task of collectively identifying frameevoking targets, a disambiguated frame for each target, and the set of role-labeled arguments for each frame." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-33", "text": "The system contained a set of rulebased heuristics to identify targets followed by a cascade of three learned models as mentioned in \u00a71." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-34", "text": "Das et al. (2010) presented a tool called SE-MAFOR, 1 which improved upon this system with a similar framework for target identification, but only used two probabilistic models, one for frame identification, and one for predicting the arguments." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-35", "text": "The frame identification subpart involved a latent-variable log-linear model, which intended to capture frames for unseen targets, many of which appeared in the test data." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-36", "text": "Moreover, the feature sets in both the models were sufficiently different from prior work, resulting in improvements." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-37", "text": "Table 1 shows results on the SemEval 2007 data for these two systems." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-38", "text": "The FrameNet project released more annotations and a larger frame lexicon in 2010; used this dataset, and presented a variety of experiments improving upon their prior work, setting the new state of the art." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-39", "text": "A few salient aspects 1 See" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-40", "text": "----------------------------------" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-41", "text": "**FRAME-SEMANTIC PARSING**" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-42", "text": "Frame-semantic parsing has been considered as the task of automatically finding semantically salient targets in text, disambiguating their semantic frame representing an event and scenario in discourse, and annotating arguments consisting of words or phrases in text with various frame elements (or roles)." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-43", "text": "The FrameNet lexicon (Baker et al., 1998) , an ontology inspired by the theory of frame semantics (Fillmore, 1982) , serves as a repository of semantic frames and their roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-44", "text": "Figure 1 depicts a sentence with three evoked frames for the targets \"million\", \"created\" and \"pushed\" with FrameNet frames and roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-45", "text": "Automatic analysis of text using framesemantic structures can be traced back to the pioneering work of Gildea and Jurafsky (2002) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-46", "text": "Although their experimental setup relied on a primitive version of FrameNet and only made use of \"exemplars\" or example usages of semantic frames (containing one target per sentence) as opposed to a \"corpus\" of sentences, it resulted in a flurry of work in the area of automatic semantic role labeling ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-47", "text": "However, the focus of semantic role labeling (SRL) research has mostly been on PropBank (Palmer et al., 2005) conventions, where verbal targets could evoke a \"sense\" frame, which is not shared across targets, making the frame disambiguation setup different from the representation in FrameNet." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-48", "text": "Furthermore, it is fair to say that early research on PropBank focused primarily on argument structure prediction, and the interaction between frame and argument structure analysis has mostly been unaddressed ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-49", "text": "There are exceptions, where the verb frame has been taken into account during SRL (Meza-Ruiz and Riedel, 2009; Watanabe et al., 2010) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-50", "text": "Moreoever, the CoNLL 2008 and 2009 shared tasks also include the verb and noun frame identification task in their evaluations, although the overall goal was to predict semantic dependencies based on PropBank, and not full argument spans (Surdeanu et al., 2008; Haji\u010d et al., 2009) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-51", "text": "The SemEval 2007 shared task (Baker et al., 2007) attempted to revisit the frame-semantic analysis task based on FrameNet." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-52", "text": "It introduced a larger FrameNet lexicon (version 1.3), and also a larger corpus with full-text annotations compared to prior work, with multiple targets annotated per sentence." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-53", "text": "The corpus allowed words and phrases with noun, verb, adjective, adverb, number, determiner, conjunction and preposition syntactic categories to serve as targets and evoke frames, unlike any other single dataset; it also allowed targets from different syntactic categories share frames, and therefore roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-54", "text": "The repository of semantic role types was also much richer than PropBankstyle lexicons, numbering in several hundreds." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-55", "text": "Most systems participating in the task resorted to a cascade of classifiers and rule-based modules: identifying targets (a non-trivial subtask), disambiguating frames, identifying potential arguments, and then labeling them with roles." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-56", "text": "The system described by Johansson and Nugues (2007) performed the best in this shared task." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-57", "text": "Next, we focus on its performance, and subsequent improvements made by the research community on this task. ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-58", "text": "The words in bold correspond to targets, which evoke semantic frames that are denoted in capital letters." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-59", "text": "Above each target is shown the corresponding lexical unit, which is a lemma appended by a coarse part-of-speech tag." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-60", "text": "Every frame is shown in a distinct color; each frame's arguments are annotated with the same color, and are marked below the sentence, at different levels." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-61", "text": "For the CARDINAL NUMBERS frame, \"M\" denotes the role Multiplier and \"E\" denotes the role Entity." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-62", "text": "Table 1 : We show the current state of the art on the frame-semantic parsing task." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-63", "text": "The first section shows results on the SemEval 2007 shared task." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-64", "text": "The best system in the task, presented by Johansson and Nugues (2007) was later outperformed by SEMAFOR, a system described by Das et al. (2010) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-65", "text": "Both systems use a rule-based module to identify targets." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-66", "text": "On the FrameNet 1.5 data, presented additional semi-supervised experiments using gold targets, which was recently outperformed by an approach presented by Hermann et al. (2014) that made use of distributed word representations." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-67", "text": "----------------------------------" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-68", "text": "**CURRENT STATE OF THE ART**" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-69", "text": "Johansson and Nugues (2007) presented the system that resulted in the best F 1 score on the SemEval 2007 task of collectively identifying frameevoking targets, a disambiguated frame for each target, and the set of role-labeled arguments for each frame." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-70", "text": "The system contained a set of rulebased heuristics to identify targets followed by a cascade of three learned models as mentioned in \u00a71." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-71", "text": "Das et al. (2010) presented a tool called SE-MAFOR, 1 which improved upon this system with a similar framework for target identification, but only used two probabilistic models, one for frame identification, and one for predicting the arguments." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-72", "text": "The frame identification subpart involved a latent-variable log-linear model, which intended to capture frames for unseen targets, many of which appeared in the test data." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-73", "text": "Moreover, the feature sets in both the models were sufficiently different from prior work, resulting in improvements." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-74", "text": "Table 1 shows results on the SemEval 2007 data for these two systems." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-75", "text": "The FrameNet project released more annotations and a larger frame lexicon in 2010; of this updated version of SEMAFOR involved handling unseen targets using a graph-based semisupervised learning approach and improved inference using a dual decomposition algorithm." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-76", "text": "Subsequently, Hermann et al. (2014) used a very similar framework but presented a novel method using distributed word representations for better frame identification, outperforming the aforementioned update to SEMAFOR." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-77", "text": "Table 1 shows the performance in terms of F 1 score for frames and arguments given gold targets." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-78", "text": "Recent work on the FrameNet corpora, including the aforementioned two papers have used gold targets to measure the performance of statistical methods because the distribution of annotated targets in the data varied significantly across documents and domains, making it difficult to build a learnable system for target identification." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-79", "text": "The aforementioned papers focused on the task of sentence-internal frame-semantic analysis." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-80", "text": "There have been some investigation of finding implicit arguments of frames that may be present in other parts of a document, outside the sentential context." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-81", "text": "Although there has not been extensive research on this topic, a shared task at SemEval 2010 focused on this problem (Ruppenhofer et al., 2010) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-82", "text": "2 Moreover, there has been significant effort in developing unsupervised techniques for inducing frame-semantic structures (Modi et al., 2012) , to induce FrameNet-like lexicons from weak supervision, such as syntactic parses." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-83", "text": "----------------------------------" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-84", "text": "**APPLICATIONS**" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-85", "text": "Shallow semantic analysis based on FrameNet data has been recently utilized across various natural language processing applications with success." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-86", "text": "These include the generation of meeting summaries (Kleinbauer, 2012) , the prediction of stock price movement using (Xie et al., 2013) , inducing slots for domain-specific dialog systems (Chen et al., 2013) , stance classification in debates (Hasan and Ng, 2013) , modeling the clarity of student essays (Persing and Ng, 2013) to name a few." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-87", "text": "There is strong potential in using framesemantic structures in other applications such as question answering and machine translation, as demonstrated by prior work using PropBank-style SRL annotations (Shen and Lapata, 2007; Liu and Gildea, 2010) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-88", "text": "----------------------------------" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-89", "text": "**FUTURE DIRECTIONS**" }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-90", "text": "Given the wide body of work in frame-semantic analysis of text, and recent interest in using framesemantic parsers in NLP applications, the future directions of research look exciting." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-91", "text": "First and foremost, to improve the quality of automatic frame-semantic parsers, the coverage of the FrameNet lexicon on free English text, and the number of annotated targets needs to increase." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-92", "text": "For example, the training dataset used for the state-ofthe-art system of Hermann et al. (2014) contains only 4,458 labeled targets, which is approximately 40 times less than the number of annotated targets in Ontonotes 4.0 (Hovy et al., 2006) , a standard NLP dataset, containing PropBank-style verb annotations." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-93", "text": "This comparison is important because FrameNet covers many more syntactic categories than the PropBank-style annotations, and features more than 1,000 semantic role labels compared to 51 in Ontonotes, but severely lacks annotations." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-94", "text": "A machine learned system would find it very hard to generalize to new data given such data sparsity." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-95", "text": "Increasing the quantity of such annotations requires exhaustive inter-annotator agreement studies (which has been rare in FrameNet corpora generation) and the development of annotation guidenominal targets in NomBank (Meyers et al., 2004) has been investigated recently (Gerber and Chai, 2012)." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-96", "text": "lines, such that these annotations can be produced outside the FrameNet project." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-97", "text": "Other than increasing the amount of labeled data, there is a necessity of automatically aligning predicate-level semantic knowledge present in resources like FrameNet, PropBank, NomBank and VerbNet (Schuler, 2005) ." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-98", "text": "These lexicons share a lot of knowledge about predicates and current resources like Ontonotes do align some of the information, but a lot remains missing." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-99", "text": "For example, alignment between these lexicons could be done within a statistical model for frame-semantic parsing, such that correlations between the coarse semantic role labels in PropBank or NomBank and the finer labels in FrameNet could be discovered automatically." }, { "sent_id": "fbd028e073459b1b4c2d8d99173e15-C001-100", "text": "Finally, the FrameNet data is an attractive test bed for semi-supervised learning techniques because of data sparsity; distributed word representations, which often capture more semantic information than surface-form features could be exploited in various aspects of the frame-semantic parsing task." } ], "y": { "@BACK@": { "gold_contexts": [ [ "fbd028e073459b1b4c2d8d99173e15-C001-31" ], [ "fbd028e073459b1b4c2d8d99173e15-C001-66" ], [ "fbd028e073459b1b4c2d8d99173e15-C001-92" ] ], "cite_sentences": [ "fbd028e073459b1b4c2d8d99173e15-C001-31", "fbd028e073459b1b4c2d8d99173e15-C001-66", "fbd028e073459b1b4c2d8d99173e15-C001-92" ] }, "@DIF@": { "gold_contexts": [ [ "fbd028e073459b1b4c2d8d99173e15-C001-31" ], [ "fbd028e073459b1b4c2d8d99173e15-C001-66" ], [ "fbd028e073459b1b4c2d8d99173e15-C001-76" ] ], "cite_sentences": [ "fbd028e073459b1b4c2d8d99173e15-C001-31", "fbd028e073459b1b4c2d8d99173e15-C001-66", "fbd028e073459b1b4c2d8d99173e15-C001-76" ] }, "@FUT@": { "gold_contexts": [ [ "fbd028e073459b1b4c2d8d99173e15-C001-90", "fbd028e073459b1b4c2d8d99173e15-C001-91", "fbd028e073459b1b4c2d8d99173e15-C001-92" ] ], "cite_sentences": [ "fbd028e073459b1b4c2d8d99173e15-C001-92" ] } } }, "ABC_cffa735deb802118640005a1d527ee_44": { "x": [ { "sent_id": "cffa735deb802118640005a1d527ee-C001-98", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-2", "text": "Active learning (AL) is getting more and more popular as a methodology to considerably reduce the annotation effort when building training material for statistical learning methods for various NLP tasks." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-3", "text": "A crucial issue rarely addressed, however, is when to actually stop the annotation process to profit from the savings in efforts." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-4", "text": "This question is tightly related to estimating the classifier performance after a certain amount of data has already been annotated." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-5", "text": "While learning curves are the default means to monitor the progress of the annotation process in terms of classifier performance, this requires a labeled gold standard which -in realistic annotation settings, at least -is often unavailable." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-6", "text": "We here propose a method for committee-based AL to approximate the progression of the learning curve based on the disagreement among the committee members." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-7", "text": "This method relies on a separate, unlabeled corpus and is thus well suited for situations where a labeled gold standard is not available or would be too expensive to obtain." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-8", "text": "Considering named entity recognition as a test case we provide empirical evidence that this approach works well under simulation as well as under real-world annotation conditions." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-9", "text": "----------------------------------" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-11", "text": "State-of-the-art NLP components are increasingly based on supervised machine learning methods." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-12", "text": "This raises the need for large amounts of training data." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-13", "text": "While for the general language English newspaper domain syntactic (Marcus et al., 1993) , semantic (Palmer et al., 2005; Pustejovsky et al., 2003) , and even discourse (Carlson et al., 2003; Miltsakaki et al., 2008) annotations are increasingly made available, any language, domain, or genre shift pushes the severe burden on developers of NLP systems to supply comparably sized high-quality annotations." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-14", "text": "Even inner-domain shifts, such as, e.g., moving from hematology (Ohta et al., 2002) to the genetics of cancer (Kulick et al., 2004) within the field of molecular biology may have drastic consequences in the sense that entirely new meta data sets have to produced by annotation teams." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-15", "text": "Thus, reducing the human efforts for the creation of adequate training material is a major challenge." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-16", "text": "Active learning (AL) copes with this problem as it intelligently selects the data to be labeled." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-17", "text": "It is a sampling strategy where the learner has control over the training material to be manually annotated by selecting those examples which are of high utility for the learning process." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-18", "text": "AL has been successfully applied to speed up the annotation process for many NLP tasks without sacrificing annotation quality (Engelson and Dagan, 1996; Ngai and Yarowsky, 2000; Hwa, 2001; Tomanek et al., 2007a) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-19", "text": "Once we decide to use AL for meta-data annotation and a reasonable, stable level of annotation quality is reachedafter having run through only a fraction of the documents compared with the traditional annotation approach where a randomly and independently selected amount of documents is sequentially annotated -an obvious question turns up: When do we stop the annotation process to cash in the time savings?" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-20", "text": "Stopping after a certain amount of time has elapsed or a certain amount of data has been annotated is clearly not the best choice since such criteria, easily applicable though, do not take into account how well a classifier trained on the annotated data really performs." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-21", "text": "An optimal stopping condition for any annotation would be to locate that point in time when no further improvement in terms of classifier performance can be achieved by additional annotations." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-22", "text": "Since learning curves show the classifier performance at different time steps, i.e., for different amounts of annotated training examples, we can observe that progression." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-23", "text": "Given this observation data we may stop the annotation process when the learning curve completely converges and is not ascending any more." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-24", "text": "In most real-world annotation scenarios, however, such a well-defined stopping point based on the convergence of classifier performance does not exist." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-25", "text": "Instead, additional annotations often result in slight improvements of the classifier's performance." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-26", "text": "Accordingly, one should rather consider the trade-off between further annotation efforts and gains in classifier performance to decide whether additional annotations are worth the effort for targeted application." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-27", "text": "This trade-off can be read from the learning curve which, unfortunately, will not always be available." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-28", "text": "Re-sampling strategies, e.g., cross-validation or bootstrapping, usually applied to estimate classifier performance, assume independently and identically distributed (i.i.d.) examples to sample from. But examples selected by means of AL do not meet this requirement." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-29", "text": "So, to estimate classifier performance a separately annotated gold standard with i.i.d." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-30", "text": "examples is often used to obtain a learning curve for AL." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-31", "text": "Yet, this solution comes with expensive extra annotation work." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-32", "text": "We present an approach to approximate the progression of the learning curve without the need for a labeled gold standard." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-33", "text": "We situate our discussion in the context of a simulation and a real-world annotation scenario and will find out that the second scenario imposes some restrictions on the configuration of the approach." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-34", "text": "The paper is structured as follows: In Section 2., we describe our approach in detail." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-35", "text": "Other work on stopping conditions for AL-based annotation is discussed in Section 3." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-36", "text": "Experimental results for the task of named entity recognition are presented in Section 4." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-37", "text": "----------------------------------" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-38", "text": "**APPROXIMATING THE LEARNING CURVE**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-39", "text": "Given the idea that from the learning curve one can read the trade-off between annotation effort and classifier performance gain, we here propose an approach to approximate the progression of the learning curve which comes at no extra annotation costs." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-40", "text": "This approach is designed for use in committee-based AL (Seung et al., 1992) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-41", "text": "A committee consists of k classifiers of the same type trained on different subsets of the already labeled (training) data." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-42", "text": "Each committee member then makes its predictions on the pool of unlabeled examples, and those examples on which the committee members express the highest disagreement are considered most informative for learning and are thus selected for manual annotation." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-43", "text": "To calculate the disagreement among the committee members several metrics have been proposed including the vote entropy (Engelson and Dagan, 1996) as possibly the most well-known one." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-44", "text": "Our approach to approximating the learning curve is based on the disagreement within a committee." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-45", "text": "However, it is independent of the actual metric used to calculate the disagreement." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-46", "text": "Although in our experiments we considered the NLP task of named entity recognition (NER) only, our approach is not limited to this scenario and can be expected to be applicable to other tasks as well." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-47", "text": "In Tomanek et al. (2007a) we introduced the selection agreement (SA) curve -the average agreement amongst the selected examples plotted over time." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-48", "text": "When the SA values are close to '1', the committee members almost perfectly agree." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-49", "text": "So, any further AL iteration would resemble a random selection." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-50", "text": "Experiments have shown that at the point where the SA curve converges on values close to '1' the respective learning curve converges on its maximum value as well so that further annotations would have (almost) no impact on the classifier performance." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-51", "text": "As a result, we concluded that we can derive, from the SA curve, the point where the classifier performance is not increased any more by further annotation efforts." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-52", "text": "Hence, when this curve approaches values of '1' it can be interpreted as a stopping signal for annotation." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-53", "text": "However, this positive finding is due to an inherent feature of AL simulations." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-54", "text": "In typical simulation settings, the pool of annotation items is of a very limited size -normally only a few thousand examples." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-55", "text": "This is so because for simulations, a pre-annotated corpus is used and the manual annotation is simulated by just moving selected examples from the pool to the training set unveiling the labels." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-56", "text": "As a consequence, the total number of positive and hard examples, which are preferentially selected by AL, is rather limited." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-57", "text": "In the NER scenario, examples to be selected are complete sentences." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-58", "text": "Sentences containing (many) entity mentions can be considered as \"positive\" ones." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-59", "text": "Especially when very infrequent entity classes are to be annotated, a corpus will consist of a large proportion of \"negative\" examples which contain no entity mentions at all." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-60", "text": "In our experiments, we observed that sentences which contained many and complex entity mentions were already selected in early AL iterations." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-61", "text": "Thus, the more AL iterations are run, the less hard and positive examples are left in the pool." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-62", "text": "As a consequence, only in early iterations, AL really has choices to select useful examples." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-63", "text": "The SA curve is directly affected by this simulation effect and thus cannot be used as a reliable approximation of the learning curve in a real-world annotation scenario where the pool will be much larger and much more diverse." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-64", "text": "In such a setting there will always be useful (and, by this, hard) examples which AL may find, thus keeping the selection agreement constantly high." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-65", "text": "The solution we propose is to calculate the average agreement for each AL iteration on a separate validation set which should reflect the real data distribution and must not be used in the annotation process itself." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-66", "text": "As for most NLP tasks there is no limit to unlabeled data and no annotations are required, the validation set comes at no extra costs." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-67", "text": "Plotted over time we get the validation set agreement (VSA) curve." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-68", "text": "This curve is based on the same data in each AL iteration making the agreement values comparable between different AL iterations." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-69", "text": "Since the examples of the validation set are not used in the annotation process we can further guarantee that this curve is only affected by the benefit the selected and labeled examples have on training a classifier." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-70", "text": "Now, from a VSA curve which is only slightly ascending between selected measurement points we can infer that the respective learning curve has only a low slope at these positions, too." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-71", "text": "Although interpreting the actual agreement values of the VSA curve is still problematic, its progression behavior can be used to estimate whether further annotation is worth the human labeling effort." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-72", "text": "In Section 4., we will provide empirical evidence that the VSA curve is indeed an adequate approximation of the progression of the learning curve and that the SA curve fails in the real-world annotation scenario where examples are selected from a much larger pool." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-73", "text": "----------------------------------" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-74", "text": "**RELATED WORK**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-75", "text": "While there is a large body of work on AL proper, there are only few papers reporting on stopping criteria or methods to monitor the progress of AL-driven annotations." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-76", "text": "Schohn and Cohn (2000) consider an AL approach for Support Vector Machines (SVM) where examples are selected according to their proximity to the hyperplane." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-77", "text": "They propose to stop the annotation process when, in the current AL iteration, none of the unlabeled examples are closer to the hyperplane than the support vectors." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-78", "text": "While this approach is restricted to AL for SVMs, Vlachos (2008) presents a stopping criterion for uncertainty-based AL (Cohn et al., 1996) in general." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-79", "text": "The confidence of the classifier at the current AL iteration is estimated on a large, separate validation set." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-80", "text": "The author reports that such a confidence curve follows a rise-peak-drop pattern: It rises in the beginning, then reaches its maximum values, and after that it constantly drops." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-81", "text": "The stopping condition is then defined as the point when the confidence curve starts dropping, i.e., the point when the learning curve has converged." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-82", "text": "This approach is similar to ours in that it employs the usefulness measure of the AL selection and in that it applies a separate validation set to calculate the confidence curve on." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-83", "text": "However, while it provides an exact stopping condition, it cannot provide a means to estimate the progression of the learning curve." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-84", "text": "This is equally important since, in practice, one might want to stop the annotation before such a final stopping condition is met, e.g., when the trade-off between additional annotation costs and gain in classifier performance is falling below some threshold." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-85", "text": "For uncertainty-based AL, further stopping criteria employing a confidence estimate of the current classifier were proposed by Zhu et al. (2008) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-86", "text": "The first one is based on an uncertainty measurement on all unlabeled examples of a pool, the second one uses the prediction accuracy on the selected examples, and the final one builds on the classifier's expected error on all unlabeled examples." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-87", "text": "Since these approaches are not based on a separate validation set we assume that their reported success rates are largely due to the simulation effect, i.e., the limited number of 'hard' examples in a simulation data set." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-88", "text": "Whereas the first and the third criterion could also be applied in a separate, unlabeled validation set to avoid this shortcoming, the second one would require an annotated validation set -not really an advantage over plotting a learning curve." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-89", "text": "Further on, Zhu et al. use their approaches as stopping condition by comparing the respective values against a fixed threshold." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-90", "text": "We find this problematic because a priori chosen or heuristically determined values are highly task-and data-dependent." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-91", "text": "In a real-world annotation scenario it is almost impossible to adequately define such values in advance." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-92", "text": "While all the above-mentioned approaches focus on singleclassifier AL strategies, ours is tailored to committee-based AL." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-93", "text": "----------------------------------" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-94", "text": "**EXPERIMENTS**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-95", "text": "To empirically test whether our proposed approach works well as an approximation of the learning curves we ran several experiments both in a pure simulation mode, where the manual annotation was simulated by unveiling the labels already assigned in the simulation corpus, and in a realworld scenario where human annotators were asked to annotate the sentences selected by AL." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-96", "text": "For both scenarios the selection agreement (SA) and the validation set agreement (VSA) was calculated for each AL iteration." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-97", "text": "----------------------------------" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-99", "text": "For our experiments on approximating the learning curves for AL-based selection, we chose named entity recognition (NER) as the annotation task in focus." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-100", "text": "We employed the committee-based AL approach described in Tomanek et al. (2007a) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-101", "text": "The committee consists of k = 3 Maximum Entropy (ME) classifiers (Berger et al., 1996) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-102", "text": "In each AL iteration, each classifier is trained on a randomly" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-103", "text": ", L being the set of all examples seen so far." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-104", "text": "Disagreement is measured by vote entropy (Engelson and Dagan, 1996) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-105", "text": "In our NER scenario, complete sentences are selected by AL." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-106", "text": "While we made use of ME classifiers during the selection, we employed a NE tagger based on Conditional Random Fields (CRFs) (Lafferty et al., 2001 ) during evaluation time to determine the learning curves." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-107", "text": "We have already shown that in this scenario, ME classifiers perform equally well for AL-driven selection as CRFs when using the same features." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-108", "text": "This effect is truly beneficial, especially for real-world annotation projects, due to much lower training times and, by this, shorter annotator idle times (Tomanek et al., 2007a) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-109", "text": "For the AL simulation, we employed two simulation corpora: The CONLL corpus, based on the English data set of the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003) , which consists of newspaper articles annotated with respect to person, location, and organisation entities." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-110", "text": "This pool consists of about 14,000 sentences." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-111", "text": "As validation set and as gold standard for plotting the learning curve we used CoNLL's evaluation corpus which sums up to 3,453 sentences." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-112", "text": "The PBVAR corpus consists of biomedical abstracts and was derived from the PENNBIOIE corpus (Kulick et al., 2004) by keeping only those annotations related to variation event mentions." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-113", "text": "We have randomly split this corpus into a pool set and a validation/gold set." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-114", "text": "In our simulations, 20 sentences were selected in each AL iteration and the simulations were started with a random seed set of 20 sentences." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-115", "text": "Our results are averaged over three independent runs." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-116", "text": "For the real-world annotation scenario, we considered two sub-corpora from the entity annotations described in (Hahn et al., 2008) : The cytokine and growth factor receptors corpus (CYTOREC) is annotated with various entity subclasses of special receptor entities, while the antigens corpus (CDANTIGEN) contains annotations of various immunologically relevant antigen entities." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-117", "text": "For both annotation projects, the pool from which AL selected the examples to be labeled consisted of approximately 2 million sentences taken from PUBMED 2 abstracts, the validation set and gold standard was composed of 2,165 sentences." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-118", "text": "In each AL iteration, 30 sentences were selected for manual annotation." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-119", "text": "The corresponding seed sets were considerably larger than in our simulations and were assembled by the heuristic described by Tomanek et al. (2007b) ." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-120", "text": "Table 1 summarizes the corpora used for our experiments." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-121", "text": "Figures 1 and 2 display the learning and agreement curves for the CONLL and the PBVAR corpus, respectively." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-122", "text": "The learning curves are depicted for both AL (solid line) and random selection (dashed line) revealing the increase in annotation efficiency when AL is used to select the examples to be annotated." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-123", "text": "As for the agreement curves, we plot both the exact agreement values (dots) and a curve obtained by local polynomial regression fitting (solid line)." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-124", "text": "On the CONLL corpus, the learning curve converges on its maximum f-score (\u2248 84%) after about 125,000 tokens." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-125", "text": "This is reflected by the SA curve which is not ascending any more at about the same number of tokens." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-126", "text": "A similar pattern is depicted in the VSA curve though it provides an even clearer picture of the progression of the learning curve." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-127", "text": "It is only slightly ascending after about 50,000 tokens, i.e., at a time when the slope of the learning curve already becomes very low." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-128", "text": "From both the learning and the VSA curve we can read that after 50,000 tokens any additional annotation is very costly compared to its benefits in terms of increased classifier performance." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-129", "text": "----------------------------------" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-130", "text": "**RESULTS**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-131", "text": "On the PBVAR corpus, the maximal f-score (\u2248 80%) is reached after approximately 50,000 tokens, then there is a small decline which after about 100,000 tokens stabilizes at the maximum value." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-132", "text": "The SA curve reached values around '1' after about 100, 000 tokens, but is misleading here since it does not reflect that the learning curve had already reached a maximum before." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-133", "text": "The VSA curve, however, more comprehensively approximates the behavior of the learning curve." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-134", "text": "It has a clear bend after some 50,000 tokens and converges after approximately 100,000 tokens." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-135", "text": "Figures 3 and 4 display the learning and agreement curves for our experiments in the real-world annotation scenario." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-136", "text": "No learning curve for random selection is shown since only AL selection was performed to avoid unnecessary human efforts." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-137", "text": "Further, in this scenario, the agreement was not calculated during the selection to keep selection time as short as possible but was calculated afterwards for this experiment." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-138", "text": "3 On both corpora, agreement as well as learning curves start with the complete seed set (256 sentences, with about 10,00 tokens for CYTOREC and 853 sentences, with some 35,000 tokens for CDANTIGEN)." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-139", "text": "On the CDANTIGEN corpus, after 80, 000 tokens being annotated the learning curve has not completely converged but additional annotations do not pay off either." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-140", "text": "The VSA curve mirrors this behavior since it keeps on ascending with a low slope, though the SA curve remains quite obscure, here." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-141", "text": "A similar behavior can be observed for the CYTOREC corpus." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-142", "text": "The learning curve is only slightly ascending after about 65,000 tokens have been annotated." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-143", "text": "This is nicely mirrored by the VSA curve." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-144", "text": "Again, the SA curve is almost impossible to interpret: Though its slope decreases a bit after roughly 40,000 tokens, it keeps ascending thereafter." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-145", "text": "Both SA curves exhibit an oscillating behavior that does not contain any clue to guide stopping decisions." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-146", "text": "We have seen that in the simulation scenario the two agreement curves (SA and VSA) share a similar curve progression due to the simulation effect (cf. Figure 1 . But even in the simulation scenario the SA curve might be problematic and hence misleading as can be concluded from our experiments on the PBVAR corpus." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-147", "text": "In the real-world annotation scenario these SA curves are clueless to approximate the progression of the learning curve." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-148", "text": "However, our experiments suggest (see Figures 3 and 4 ) that the VSA curve is a good estimator for the progression of the learning curve and also works in practice, while the SA curve fails as a reliable predictor in our realworld annotation scenario." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-149", "text": "Still, even the more predictive VSA curves merely guide but do not finalize stopping decisions." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-150", "text": "So it is left to the annotation manager's over-all assessment to balance the trade-off between annotation costs and expectable quality gains for the learner." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-151", "text": "----------------------------------" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-152", "text": "**CONCLUSIONS**" }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-153", "text": "In this paper, we discussed an approach to approximate the progression of the learning curve for AL-driven annotation." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-154", "text": "Such an approximation can be used to estimate the relative quality gains of further annotation efforts." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-155", "text": "This might render valuable decision support for the question when to actually stop an annotation process, in practice, and is especially helpful when a learning curve is not available due to the absence of a labeled gold standard." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-156", "text": "We have deliberately refrained from defining a fixed stopping condition for AL-driven annotations." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-157", "text": "In practice, further annotation efforts will mostly result in some, although mild, classifier improvement." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-158", "text": "Whether the respective gain justifies the efforts (and costs) depends on the task at hand." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-159", "text": "As far as the learning curve and its approximation is concerned, the relative gains can be estimated." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-160", "text": "Such an approach might be more adequate for practical use cases rather than a single-point stopping condition which does not incorporate trade-off considerations of any sort." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-161", "text": "Further, we have discussed that AL simulations are subject to the simulation effect." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-162", "text": "From our experiments we conclude that approaches to monitor the progress (in whatever manner) of AL-driven annotation should always be based on a separate validation set instead of the material directly involved in the AL training process." }, { "sent_id": "cffa735deb802118640005a1d527ee-C001-163", "text": "As the validation set does not need to be labeled and for almost all NLP applications unlabeled material is available in virtually unlimited volumes this approach comes at not extra costs." } ], "y": { "@BACK@": { "gold_contexts": [ [ "cffa735deb802118640005a1d527ee-C001-18" ], [ "cffa735deb802118640005a1d527ee-C001-43" ] ], "cite_sentences": [ "cffa735deb802118640005a1d527ee-C001-18", "cffa735deb802118640005a1d527ee-C001-43" ] }, "@SIM@": { "gold_contexts": [ [ "cffa735deb802118640005a1d527ee-C001-43", "cffa735deb802118640005a1d527ee-C001-44" ] ], "cite_sentences": [ "cffa735deb802118640005a1d527ee-C001-43" ] }, "@USE@": { "gold_contexts": [ [ "cffa735deb802118640005a1d527ee-C001-104" ] ], "cite_sentences": [ "cffa735deb802118640005a1d527ee-C001-104" ] } } }, "ABC_2dc830dd598102ee82f1b982b88be9_44": { "x": [ { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-23", "text": "Given an audio signal I (e.g. log mel-filter bank energies (LFBEs)), the task is to train a model f to predict a multi-hot vector y \u2208 {0, 1} C , with C being the size of event set E and y c being a binary indicator whether event c is present in I. We denote D L = {(I, y)} as the labeled training dataset." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-24", "text": "Model f is trained using cross-entropy loss: L = \u2212 (I,y)\u2208DL C c=1 {w c y c log f c (I) + (1 \u2212 y c ) log(1 \u2212 f c (I))}, where w c is the penalty of positive mis-classification of class c. Specifically we focus on the RNN-based model in this paper." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-25", "text": "Compared to CNNs, it is more memory efficient and easier to deploy on devices with constrained resources." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-26", "text": "Low-rank matrix factorization The factorization of weight matrices is based on the SVD compression of LSTM [11] ." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-27", "text": "Let W" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-28", "text": "Quantization training Quantization refers to representing floating-point values with n-bit integers (n < 32)." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-29", "text": "as formulated in 2." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-30", "text": "Note the scaling factor \u03b1 and minimum value \u03b2 in equation 2 are not quantized." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-31", "text": "V" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-32", "text": "As the quantization function (Q n ) is discrete, its gradient is almost zero everywhere." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-33", "text": "To solve this problem, we apply straight-through estimator [18] to approximate the gradient of full-precision parameter ( \u2202l \u2202V ) with gradient of quantized value ( \u2202l \u2202V ) in the fine-tuning." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-34", "text": "To combine low-rank matrix factorization and quantization training, the quantization operatorQ n is applied to Z l h , Z l x and\u1e7c lr h ." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-35", "text": "We quantize both model parameters and inputs." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-36", "text": "The original RNN is first trained in full-precision until convergence." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-37", "text": "After the low-rank matrix factorization is applied (equation 1), the model is quantized and fine-tuned with quantization training." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-38", "text": "We find the finetuning step is important to maintain performance with quantization." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-39", "text": "----------------------------------" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-40", "text": "**EXPERIMENTS**" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-41", "text": "The dataset used in our experiments is a subset of Audioset [17] , which includes a large collection of 10-second audio segments for 632 categories of events." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-42", "text": "In particular, we select sounds of dog, baby crying and gunshots as our target events." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-43", "text": "They include both human and non-human vocals, as well as different sound event durations." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-44", "text": "The three events included in Audioset amount to 13,460, 2,313 and 4,083 audio segments, respectively." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-45", "text": "We also randomly select 36,036 examples from all other audio clips in Audioset as negative samples." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-46", "text": "The whole selected dataset is randomly split for training (70%), validation (10%) and test (20%) for each target events." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-47", "text": "64 dimensional LFBE features are extracted for each audio clip, with window size of 25 ms and hop size of 10 ms, which are further normalized by global cepstral mean and variance normalization (CMVN)." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-48", "text": "Our baseline model is a three-layer LSTM with 256 hidden units in each layer." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-49", "text": "Dropout is added between layers at rate of 0.2." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-50", "text": "Adam optimizer is used with learning rate fixed at 0.001." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-51", "text": "The evaluation is based on detection error tradeoff (DET) curve (false negative rate vs. false positive rate)." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-52", "text": "We compute area under curve (AUC) and equal error rate (EER) as the two quantitative measures." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-53", "text": "As the distribution of weight matrices' eigenvalues can be different across different LSTM layers, we follow the practice of [11] to set the same threshold \u03c4 across layers as the fraction of retained singular values, defined as \u03c4 = Table 1 summarizes the results of low-rank matrix factorization compared to our baseline 3-layer LSTM." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-54", "text": "There is no accuracy degradation when \u03c4 is reduced to 0.6, which we hypothesize to be related to the regularizing effects." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-55", "text": "Table 2 summarizes the results with quantization compared to our baseline." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-56", "text": "Post-mortem (PM) refers to the case that quantization is only applied during inference, while quantization training (QT) refers to the case model fine-tuning is further performed on quantized weights." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-57", "text": "Our quantization training approach outperforms PM significantly for the 4-bit quantization case." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-58", "text": "We also note the simple PM quantization preserves the accuracy well (3.0% drop in AUC and 2.7% drop in EER) with 8-bit quantization." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-59", "text": "Finally, we combine both low-rank matrix factorization and quantization training." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-60", "text": "Its results are summarized in table 3." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-61", "text": "The attained singular value ratio is fixed to be 0.6, as parameter size and accuracy is well balanced at this point according to table 1." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-62", "text": "When the model is quantized to 8-bit QT (\u2248 1.5% of original size), AUC is only increased by 0.2% and EER is even improved by a small margin (1.9%)." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-63", "text": "Performance is significantly degraded for 8-bit PM, which is related to the relatively highly compactness and unbounded intermediate outputs in the low-rank setting." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-64", "text": "Fine-tuning the quantized model is an essential step to reduce this effect." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-65", "text": "Though the performance is decreased for the 4-bit QT (less than 1% of original model size), its relative degradation is much smaller compared to the 4-bit PM." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-66", "text": "----------------------------------" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-67", "text": "**CONCLUSION**" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-68", "text": "In this paper we present a simple yet effective compression technique combining low-rank matrix factorization and quantization training." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-69", "text": "The proposed technique is applied to a multi-layer LSTM for AED." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-70", "text": "It compresses the AED model size to less than 1% of the original, with AUC and EER performance well maintained." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-71", "text": "Model fine-tuning with quantization (QT) outperforms a naive quantization scheme (PM), which shows its consistent advantage with different number of quantization bits." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-2", "text": "In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-3", "text": "Our experimental results show this combined compression approach is very effective." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-4", "text": "For a threelayer long short-term memory (LSTM) based AED model, the original model size can be reduced to 1% with negligible loss of accuracy." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-5", "text": "Our approach enables the feasibility of deploying AED for resource-constraint applications." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-6", "text": "----------------------------------" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-8", "text": "Acoustic event detection (AED), the task of detecting whether certain events occur in an audio clip, can be applied in many industry applications [1, 2, 3, 23] ." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-9", "text": "The accuracy of AED models have been increased in a large scale in recent years based on deep learning approaches." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-10", "text": "However, to ensure high performance, those models are of large scale computation and memory intense, which makes it a challenge to deploy for real industrial applications with limited computation resources and memory." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-11", "text": "Our paper is focused on increasing the computation efficiency for AED models while maintaining their accuracies, so that AED deployment for resource-constraint industrial applications is feasible." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-12", "text": "Compression of neural networks has been explored in broad context." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-13", "text": "We focus on two widely used and effective methods for deep models: low-rank matrix factorization and and quantization." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-14", "text": "Singular value decomposition (SVD) is a common factorization technique and has been explored in feedforward networks [9, 10, 21, 22] and recurrent neural networks (RNN) [11] ." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-15", "text": "Neural network quantization refers to compressing the original network by reducing number of bits required to represent weight matrices, and it has been studied for different model architectures [12, 13, 14, 15, 16, 19, 20] ." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-16", "text": "By reducing the bit-width of weights, model size is reduced, and it also brings considerable acceleration via efficient low bit-width arithmetic operations supports available on hardware." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-17", "text": "For the quantization approach, It is important to fine-tune models with quantized weights to reduce the performance loss with quantized networks." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-18", "text": "Here we refer quantization with fine-turning as quantization training." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-19", "text": "32nd Conference on Neural Information Processing Systems (NIPS 2018), Montr\u00e9al, Canada." }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-20", "text": "----------------------------------" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-21", "text": "**METHODS**" }, { "sent_id": "2dc830dd598102ee82f1b982b88be9-C001-22", "text": "We start by formulating the multi-class audio event detection problem." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2dc830dd598102ee82f1b982b88be9-C001-14" ], [ "2dc830dd598102ee82f1b982b88be9-C001-26" ] ], "cite_sentences": [ "2dc830dd598102ee82f1b982b88be9-C001-14", "2dc830dd598102ee82f1b982b88be9-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "2dc830dd598102ee82f1b982b88be9-C001-13", "2dc830dd598102ee82f1b982b88be9-C001-14" ], [ "2dc830dd598102ee82f1b982b88be9-C001-22", "2dc830dd598102ee82f1b982b88be9-C001-26" ], [ "2dc830dd598102ee82f1b982b88be9-C001-53" ] ], "cite_sentences": [ "2dc830dd598102ee82f1b982b88be9-C001-14", "2dc830dd598102ee82f1b982b88be9-C001-26", "2dc830dd598102ee82f1b982b88be9-C001-53" ] } } }, "ABC_2e95d98d5f9d4d6fc90e3d8453f945_45": { "x": [ { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-2", "text": "In this paper we present results from two pilot studies which show that using the Amazon Mechanical Turk for preposition error annotation is as effective as using trained raters, but at a fraction of the time and cost." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-3", "text": "Based on these results, we propose a new evaluation method which makes it feasible to compare two error detection systems tested on different learner data sets." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-4", "text": "----------------------------------" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-6", "text": "The last few years have seen an explosion in the development of NLP tools to detect and correct errors made by learners of English as a Second Language (ESL)." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-7", "text": "While there has been considerable emphasis placed on the system development aspect of the field, with researchers tackling some of the toughest ESL errors such as those involving articles (Han et al., 2006) and prepositions (Gamon et al., 2008) , (Felice and Pullman, 2009) , there has been a woeful lack of attention paid to developing best practices for annotation and evaluation." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-8", "text": "Annotation in the field of ESL error detection has typically relied on just one trained rater, and that rater's judgments then become the gold standard for evaluating a system." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-9", "text": "So it is very rare that inter-rater reliability is reported, although, in other NLP subfields, reporting reliability is the norm." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-10", "text": "Time and cost are probably the two most important reasons why past work has relied on only one rater because using multiple annotators on the same ESL texts would obviously increase both considerably." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-11", "text": "This is especially problematic for this field of research since some ESL errors, such as preposition usage, occur at error rates as low as 10%." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-12", "text": "This means that to collect a corpus of 1,000 preposition errors, an annotator would have to check over 10,000 prepositions." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-13", "text": "1 (Tetreault and Chodorow, 2008b) challenged the view that using one rater is adequate by showing that preposition usage errors actually do not have high inter-annotator reliability." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-14", "text": "For example, trained raters typically annotate preposition errors with a kappa around 0.60." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-15", "text": "This low rater reliability has repercussions for system evaluation: Their experiments showed that system precision could vary as much as 10% depending on which rater's judgments they used as the gold standard." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-16", "text": "For some grammatical errors such as subject-verb agreement, where rules are clearly defined, it may be acceptable to use just one rater." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-17", "text": "But for usage errors, the rules are less clearly defined and two native speakers can have very different judgments of what is acceptable." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-18", "text": "One way to address this is by aggregating a multitude of judgments for each preposition and treating this as the gold standard, however such a tactic has been impractical due to time and cost limitations." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-19", "text": "While annotation is a problem in this field, comparing one system to another has also been a major issue." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-20", "text": "To date, none of the preposition and article error detection systems in the literature have been evaluated on the same corpus." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-21", "text": "This is mostly due to the fact that learner corpora are difficult to acquire (and then annotate), but also to the fact that they are usually proprietary and cannot be shared." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-22", "text": "Examples include the Cambridge Learners Corpus 2 used in (Felice and Pullman, 2009) , and TOEFL data, used in (Tetreault and Chodorow, 2008a) ." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-23", "text": "This makes it difficult to compare systems since learner corpora can be quite different." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-24", "text": "For example, the \"difficulty\" of a corpus can be affected by the L1 of the writers, the number of years they have been learning English, their age, and also where they learn English (in a native-speaking country or a non-native speaking country)." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-25", "text": "In essence, learner corpora are not equal, so a system that performs at 50% precision in one corpus may actually perform at 80% precision on a different one." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-26", "text": "Such an inability to compare systems makes it difficult for this NLP research area to progress as quickly as it otherwise might." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-27", "text": "In this paper we show that the Amazon Mechanical Turk (AMT), a fast and cheap source of untrained raters, can be used to alleviate several of the evaluation and annotation issues described above." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-28", "text": "Specifically we show:" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-29", "text": "\u2022 In terms of cost and time, AMT is an effective alternative to trained raters on the tasks of preposition selection in well-formed text and preposition error annotation in ESL text." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-30", "text": "\u2022 With AMT, it is possible to efficiently collect multiple judgments for a target construction." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-31", "text": "Given this, we propose a new method for evaluation that finally allows two systems to be compared to one another even if they are tested on different corpora." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-32", "text": "----------------------------------" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-33", "text": "**AMAZON MECHNICAL TURK**" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-34", "text": "Amazon provides a service called the Mechanical Turk which allows requesters (companies, researchers, etc.) to post simple tasks (known as Human Intelligence Tasks, or HITs) to the AMT website for untrained raters to perform for payments as low as $0.01 in many cases (Sheng et al., 2008) ." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-35", "text": "Recently, AMT has been shown to be an effective tool for annotation and evalatuation in NLP tasks ranging from word similarity detection and emotion detection (Snow et al., 2008) to Machine Translation quality evaluation (Callison-Burch, 2009 )." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-36", "text": "In these cases, a handful of untrained AMT workers (or Turkers) were found to be as effective as trained raters, but with the advantage of being considerably faster and less expensive." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-37", "text": "Given the success of using AMT in other areas of NLP, we test whether we can leverage it for our work in grammatical error detection, which is the focus of the pilot studies in the next two sections." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-38", "text": "The presence of a gold standard in the above papers is crucial." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-39", "text": "In fact, the usability of AMT for text annotation has been demostrated in those studies by showing that non-experts' annotation converges to the gold standard developed by expert annotators." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-40", "text": "However, in our work we concentrate on tasks where there is no single gold standard, either because there are multiple prepositions that are acceptable in a given context or because the conventions of preposition usage simply do not conform to strict rules." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-41", "text": "Typically, an early step in developing a preposition or article error detection system is to test the system on well-formed text written by native speakers to see how well the system can predict, or select, the writer's preposition given the context around the preposition." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-42", "text": "(Tetreault and Chodorow, 2008b) showed that trained human raters can achieve very high agreement (78%) on this task." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-43", "text": "In their work, a rater was shown a sentence with a target preposition replaced with a blank, and the rater was asked to select the preposition that the writer may have used." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-44", "text": "We replicate this experiment not with trained raters but with the AMT to answer two research questions: 1. Can untrained raters be as effective as trained 46 raters? 2. If so, how many raters does it take to match trained raters?" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-45", "text": "In the experiment, a Turker was presented with a sentence from Microsoft's Encarta encyclopedia, with one preposition in that sentence replaced with a blank." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-46", "text": "There were 194 HITs (sentences) in all, and we requested 10 Turker judgments per HIT." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-47", "text": "Some Turkers did only one HIT, while others completed more than 100, though none did all 194." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-48", "text": "The Turkers' performance was analyzed by comparing their responses to those of two trained annotators and to the Encarta writer's preposition, which was considered the gold standard in this task." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-49", "text": "Comparing each trained annotator to the writer yielded a kappa of 0.822 and 0.778, and the two raters had a kappa of 0.742." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-50", "text": "To determine how many Turker responses would be required to match or exceed these levels of reliability, we randomly selected samples of various sizes from the sets of Turker responses for each sentence." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-51", "text": "For example, when samples were of size N = 4, four responses were randomly drawn from the set of ten responses that had been collected." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-52", "text": "The preposition that occurred most frequently in the sample was used as the Turker response for that sentence." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-53", "text": "In the case of a tie, a preposition was randomly drawn from those tied for most frequent." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-54", "text": "For each sample size, 100 samples were drawn and the mean values of agreement and kappa were calculated." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-55", "text": "The reliability results presented in Table 1 show that, with just three Turker responses, kappa with the writer (top line) is comparable to the values obtained from the trained annotators (around 0.8)." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-56", "text": "Most notable is that with ten judgments, the reliability measures are much higher than those of the trained annotators." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-57", "text": "3" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-58", "text": "----------------------------------" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-59", "text": "**ERROR DETECTION TASK**" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-60", "text": "While the previous results look quite encouraging, the task they are based on, preposition selection in well-formed text, is quite different from, and less challenging than, the task that a system must perform in detecting errors in learner writing." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-61", "text": "To examine the reliability of Turker preposition error judgments, we ran another experiment in which Turkers were presented with a preposition highlighted in a sentence taken from an ESL corpus, and were in-structed to judge its usage as either correct, incorrect, or the context is too ungrammatical to make a judgment." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-62", "text": "The set consisted of 152 prepositions in total, and we requested 20 judgments per preposition." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-63", "text": "Previous work has shown this task to be a difficult one for trainer raters to attain high reliability." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-64", "text": "For example, (Tetreault and Chodorow, 2008b) found kappa between two raters averaged 0.630." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-65", "text": "Because there is no gold standard for the error detection task, kappa was used to compare Turker responses to those of three trained annotators." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-66", "text": "Among the trained annotators, inter-kappa agreement ranged from 0.574 to 0.650, for a mean kappa of 0.606." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-67", "text": "In Figure 2 , kappa is shown for the comparisons of Turker responses to each annotator for samples of various sizes ranging from N = 1 to N = 18." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-68", "text": "At sample size N = 13, the average kappa is 0.608, virtually identical to the mean found among the trained annotators." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-69", "text": "----------------------------------" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-70", "text": "**RETHINKING EVALUATION**" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-71", "text": "We contend that the Amazon Mechanical Turk can not only be used as an effective alternative annotation source, but can also be used to revamp evaluation since multiple judgments are now easily acquired." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-72", "text": "Instead of treating the task of error detection as a \"black or white\" distinction, where a preposition is either correct or incorrect, cases of preposition use can now be grouped into bins based on the level of agreement of the Turkers." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-73", "text": "For example, if 90% or more judge a preposition to be an error, 47 the high agreement is strong evidence that this is a clear case of an error." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-74", "text": "Conversely, agreement levels around 50% would indicate that the use of a particular preposition is highly contentious, and, most likely, it should not be flagged by an automated error detection system." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-75", "text": "The current standard method treats all cases of preposition usage equally, however, some are clearly harder to annotate than others." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-76", "text": "By breaking an evaluation set into agreement bins, it should be possible to separate the \"easy\" cases from the \"hard\" cases and report precision and recall results for the different levels of human agreement represented by different bins." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-77", "text": "This method not only gives a clearer picture of how a system is faring, but it also ameliorates the problem of cross-system evaluation when two systems are evaluated on different corpora." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-78", "text": "If each evaluation corpus is annotated by the same number of Turkers and with the same annotation scheme, it will now be possible to compare systems by simply comparing their performance on each respective bin." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-79", "text": "The assumption here is that prepositions which show X% agreement in corpus A are of equivalent difficulty to those that show X% agreement in corpus B." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-80", "text": "----------------------------------" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-81", "text": "**DISCUSSION**" }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-82", "text": "In this paper, we showed that the AMT is an effective tool for annotating grammatical errors." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-83", "text": "At a fraction of the time and cost, it is possible to acquire high quality judgments from multiple untrained raters without sacrificing reliability." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-84", "text": "A summary of the cost and time of the two experiments described here can be seen in Table 1 ." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-85", "text": "In the task of preposition selection, only three Turkers are needed to match the reliability of two trained raters; in the more complicated task of error detection, up to 13 Turkers are needed." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-86", "text": "However, it should be noted that these numbers can be viewed as upper bounds." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-87", "text": "The error annotation scheme that was used is a very simple one." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-88", "text": "We intend to experiment with different guidelines and instructions, and to screen (CallisonBurch, 2009 ) and weight Turkers' responses (Snow et al., 2008) , in order to lower the number of Turkers required for this task." }, { "sent_id": "2e95d98d5f9d4d6fc90e3d8453f945-C001-89", "text": "Finally, we will look at other errors, such as articles, to determine how many Turkers are necessary for optimal annotation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2e95d98d5f9d4d6fc90e3d8453f945-C001-13" ], [ "2e95d98d5f9d4d6fc90e3d8453f945-C001-42" ], [ "2e95d98d5f9d4d6fc90e3d8453f945-C001-64" ] ], "cite_sentences": [ "2e95d98d5f9d4d6fc90e3d8453f945-C001-13", "2e95d98d5f9d4d6fc90e3d8453f945-C001-42", "2e95d98d5f9d4d6fc90e3d8453f945-C001-64" ] }, "@EXT@": { "gold_contexts": [ [ "2e95d98d5f9d4d6fc90e3d8453f945-C001-42", "2e95d98d5f9d4d6fc90e3d8453f945-C001-44" ] ], "cite_sentences": [ "2e95d98d5f9d4d6fc90e3d8453f945-C001-42" ] } } }, "ABC_f5bf9a833c3d46b00d70498e4f1c1b_45": { "x": [ { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-2", "text": "Identifying the language of social media messages is an important first step in linguistic processing." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-3", "text": "Existing models for Twitter focus on content analysis, which is successful for dissimilar language pairs." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-4", "text": "We propose a label propagation approach that takes the social graph of tweet authors into account as well as content to better tease apart similar languages." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-5", "text": "This results in state-of-the-art shared task performance of 76.63%, 1.4% higher than the top system." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-6", "text": "----------------------------------" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-8", "text": "Language identification is a crucial first step in textual data processing and is considered feasible over formal texts [4] ." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-9", "text": "The task is harder for social media (e.g. Twitter) where text is less formal, noisier and can be written in wide range of languages." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-10", "text": "We focus on identifying similar languages, where surface-level content alone may not be sufficient." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-11", "text": "Our approach combines a content model with evidence propagated over the social network of the authors." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-12", "text": "For example, a user well-connected to users posting in a language is more likely to post in that language." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-13", "text": "Our system scores 76.63%, 1.4% higher than the top submission to the tweetLID workshop." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-14", "text": "----------------------------------" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-15", "text": "**BACKGROUND**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-16", "text": "Traditional language identification compares a document with a language fingerprint built from n-gram bag-of-words 1 http://komunitatea.elhuyar.org/tweetlid Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-17", "text": "To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-18", "text": "Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-19", "text": "(character or word level)." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-20", "text": "Tweets carry additional metadata useful for identifying language, such as geolocation [3] , username [2, 3] and urls mentioned in the tweet [2] ." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-21", "text": "Other methods expand beyond the tweet itself to use a histogram of previously predicted languages, those of users @-mentioned and lexical content of other tweets in a discussion [3] ." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-22", "text": "Discriminating between similar languages was the focus of the VarDial workshop [7] , and most submissions used content analysis." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-23", "text": "These methods make limited use of the social context in which the authors are tweeting -our research question is \"Can we identify the language of a tweet using the social graph of the tweeter?\"." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-24", "text": "Label propagation approaches [8] are powerful techniques for semi-supervised learning where the domain can naturally be described using an undirected graph." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-25", "text": "Each node contains a probability distribution over labels, which may be empty for unlabelled nodes, and these labels are propagated over the graph in an iterative fashion." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-26", "text": "Modified Adsorption (mad) [6] , is an extension that allows more control of the random walk through the graph." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-27", "text": "Applications of lp and mad are varied, including video recommendation [1] and sentiment analysis over Twitter [5] ." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-28", "text": "----------------------------------" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-29", "text": "**METHOD**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-30", "text": "Our method predicts the language for a tweet t by combining scores from a content model and a graph model that takes social context into account, as per Equation 1:" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-31", "text": "Where \u03b8content are the content model parameters, \u03b8 social the social model parameters." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-32", "text": "----------------------------------" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-33", "text": "**CONTENT MODEL**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-34", "text": "Our content model is a 1 vs. all 2 regularised logistic regression model 3 with character 2-to 5-grams features, not spanning over word boundaries." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-35", "text": "The scores for a tweet are normalised to obtain a probability distribution." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-36", "text": "----------------------------------" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-37", "text": "**SOCIAL MODEL**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-38", "text": "We use a graph to model the social media context, relating tweets to one another, authors to tweets and other authors." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-39", "text": "Figure 1 shows the graph, composed of three types of nodes: tweets (T), users (U) and the \"world\" (W)." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-40", "text": "Edges are created between nodes and weighted as follows: T-T the unigram cosine similarity between tweets, T-U weighted 100 between a tweet and its author, U-U weighted 1 between two users in a \"follows\" relationship and U-W weighted 0.001 to ensure a connected graph for the mad algorithm." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-41", "text": "We create the graph using all data, and training set tweets have an initial language label distribution." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-42", "text": "4 A na\u00efve approach to building the tweet-tweet subgraph requires O(n 2 ) comparisons, measuring the similarity of each tweet with all others." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-43", "text": "Instead, we performed k-nearest-neighbour classification on all tweets, represented as a bag of unigrams, and compared each tweet and the top-k neighbours." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-44", "text": "5 We use Junto (mad) [6] to propagate labels from labelled to unlabelled nodes." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-45", "text": "Upon convergence, we renormalise label scores for initially unlabelled nodes to find the value of \u03b8 graph ." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-46", "text": "----------------------------------" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-47", "text": "**EVALUATION**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-48", "text": "The tweetLID workshop shared task requires systems to identify the language of tweets written in Spanish (es), Portuguese (pt), Catalan (ca), English (en), Galician (gl) and Basque (eu)." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-49", "text": "Some language pairs are similar (es and ca; pt and gl) and this poses a challenge to systems that rely on content features alone." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-50", "text": "We use the supplied evaluation corpus, which has been manually labelled with six languages and evenly split into training and test collections." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-51", "text": "We use the official evaluation script and report precision, recall and F-score, macro-averaged across languages." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-52", "text": "This handles ambiguous tweets by permitting systems to return any of the annotated languages." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-53", "text": "Table 1 shows that using the content model alone is more effective for languages that are distinct in our set of languages (i.e. English and Basque)." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-54", "text": "For similar languages, adding the social model helps discriminate them (i.e. Spanish, Portuguese, Catalan and Galician), particularly those where a less-resourced language is similar to a more popular one." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-55", "text": "Using the social graph almost doubles the F-score for undecided (und) languages, either not in the set above or hard-to-identify, from 18.85% to 34.95%." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-56", "text": "Macro-averaged, our system scores 76.63%, higher than the best score in the competition: 75.2%." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-57", "text": "4 We assume a uniform distribution for amb tweets." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-58", "text": "5 We used scikit-learn with k = 0.25 * ntweets." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-59", "text": "----------------------------------" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-60", "text": "**CONCLUSION**" }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-61", "text": "Our approach uses social information to help identify the language of tweets." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-62", "text": "This shows state-of-the-art performance, especially when discriminating between similar languages." }, { "sent_id": "f5bf9a833c3d46b00d70498e4f1c1b-C001-63", "text": "A by-product of our approach is that users are assigned a language distribution, which may be useful for other tasks." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f5bf9a833c3d46b00d70498e4f1c1b-C001-27" ] ], "cite_sentences": [ "f5bf9a833c3d46b00d70498e4f1c1b-C001-27" ] } } }, "ABC_3ff58556ab973a9dde640a2b74c37b_45": { "x": [ { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-2", "text": "We integrate PropBank semantic role labels to an existing statistical parsing model producing richer output." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-3", "text": "We show conclusive results on joint learning and inference of syntactic and semantic representations." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-4", "text": "----------------------------------" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-6", "text": "Recent successes in statistical syntactic parsing based on supervised techniques trained on a large corpus of syntactic trees (Collins, 1999; Charniak, 2000; Henderson, 2003) have brought the hope that the same approach could be applied to the more ambitious goal of recovering the propositional content and the frame semantics of a sentence." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-7", "text": "Moving towards a shallow semantic level of representation has immediate applications in question-answering and information extraction." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-8", "text": "For example, an automatic flight reservation system processing the sentence I want to book a flight from Geneva to New York will need to know that from Geneva indicates the origin of the flight and to New York the destination." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-9", "text": "(Gildea and Jurafsky, 2002 ) define this shallow semantic task as a classification problem where the semantic role to be assigned to each constituent is inferred on the basis of probability distributions of syntactic features extracted from parse trees." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-10", "text": "They use learning features such as phrase type, position, voice, and parse tree path." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-11", "text": "Consider, for example, a sentence such as The authority dropped at midnight Tuesday to $ 2.80 trillion (taken from section 00 of PropBank (Palmer et al., 2005) )." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-12", "text": "The fact that to $ 2.80 trillion receives a direction semantic label is highly correlated to the fact that it is a Prepositional Phrase (PP), that it follows the verb dropped, a verb of change of state requiring an end point, that the verb is in the active voice, and that the PP is in a certain tree configuration with the governing verb." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-13", "text": "All the recent systems proposed for semantic role labelling (SRL) follow this same assumption (CoNLL, 2005) ." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-14", "text": "The assumption that syntactic distributions will be predictive of semantic role assignments is based on linking theory." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-15", "text": "Linking theory assumes the existence of a hierarchy of semantic roles which are mapped by default on a hierarchy of syntactic positions." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-16", "text": "It also shows that regular mappings from the semantic to the syntactic level can be posited even for those verbs whose arguments can take several syntactic positions, such as psychological verbs, locatives, or datives, requiring a more complex theory." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-17", "text": "(See (Hale and Keyser, 1993; Levin and Rappaport Hovav, 1995) among many others.) If the internal semantics of a predicate determines the syntactic expressions of constituents bearing a semantic role, it is then reasonable to expect that knowledge about semantic roles in a sentence will be informative of its syntactic structure, and that learning semantic role labels at the same time as parsing will be beneficial to parsing accuracy." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-18", "text": "We present work to test the hypothesis that a current statistical parser (Henderson, 2003) can output rich information comprising both a parse tree and semantic role labels robustly, that is without any significant degradation of the parser's accuracy on the original parsing task." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-19", "text": "We achieve promising results both on the simple parsing task, where the accuracy of the parser is measured on the standard Parseval measures, and also on the parsing task where more complex labels comprising both syntactic labels and semantic roles are taken into account." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-20", "text": "These results have several consequences." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-21", "text": "First, we show that it is possible to build a single integrated system successfully." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-22", "text": "This is a meaningful achievement, as a task combining semantic role labelling and parsing is more complex than simple syntactic parsing." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-23", "text": "While the shallow semantics of a constituent and its structural position are often correlated, they sometimes diverge." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-24", "text": "For example, some nominal temporal modifiers occupy an object position without being objects, like Tuesday in the Penn Treebank representation of the sentence above." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-25", "text": "The indirectness of the relation is also confirmed by the difficulty in exploiting semantic information for parsing." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-26", "text": "Previous attempts have not been successful." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-27", "text": "(Klein and Manning, 2003) report a reduction in parsing accuracy of an unlexicalised PCFG from 77.8% to 72.9% in using Penn Treebank function labels in training." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-28", "text": "The two existing systems that use function labels sucessfully, either inherit Collins' modelling of the notion of complement (Gabbard, Kulick and Marcus, 2006) or model function labels directly (Musillo and Merlo, 2005) ." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-29", "text": "Furthermore, our results indicate that the proposed models are robust." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-30", "text": "To model our task accurately, additional parameters must be estimated." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-31", "text": "However, given the current limited availability of annotated treebanks, this more complex task will have to be solved with the same overall amount of data, aggravating the difficulty of estimating the model's parameters due to sparse data." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-32", "text": "----------------------------------" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-33", "text": "**THE DATA AND THE EXTENDED PARSER**" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-34", "text": "In this section we describe the augmentations to our base parsing models necessary to tackle the joint learning of parse tree and semantic role labels." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-35", "text": "PropBank encodes propositional information by adding a layer of argument structure annotation to the syntactic structures of the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-36", "text": "Verbal predicates in the Penn Treebank (PTB) receive a label REL and their arguments are annotated with abstract semantic role labels A0-A5 or AA for those complements of the predicative verb that are considered arguments while those complements of the verb labelled with a semantic functional label in the original PTB receive the composite semantic role label AM-X, where X stands for labels such as LOC, TMP or ADV, for locative, temporal and adverbial modifiers respectively." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-37", "text": "PropBank uses two levels of granularity in its annotation, at least conceptually." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-38", "text": "Arguments receiving labels A0-A5 or AA do not express consistent semantic roles and are specific to a verb, while arguments receiving an AM-X label are supposed to be adjuncts, and the roles they express are consistent across all verbs." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-39", "text": "To achieve the complex task of assigning semantic role labels while parsing, we use a family of state-of-the-art history-based statistical parsers, the Simple Synchrony Network (SSN) parsers (Henderson, 2003) , which use a form of left-corner parse strategy to map parse trees to sequences of derivation steps." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-40", "text": "These parsers do not impose any a priori independence assumptions, but instead smooth their parameters by means of the novel SSN neural network architecture." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-41", "text": "This architecture is capable of inducing a finite history representation of an unbounded sequence of derivation steps, which we denote" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-42", "text": "is computed from a set f of handcrafted features of the derivation move d i\u22121 , and from a finite set D of recent history representations h(d 1 , . . . , d j ), where j < i \u2212 1." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-43", "text": "Because the history representation computed for the move i \u2212 1 is included in the inputs to the computation of the representation for the next move i, virtually any information about the derivation history could flow from history representation to history representation and be used to estimate the probability of a derivation move." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-44", "text": "In our experiments, the set D of earlier history representations is modified to yield a model that is sensitive to regularities in structurally defined sequences of nodes bearing semantic role labels, within and across constituents." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-45", "text": "For more information on this technique to capture structural domains, see (Musillo and Merlo, 2005) where the technique was applied to function parsing." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-46", "text": "Given the hidden history representation h(d 1 , \u00b7 \u00b7 \u00b7 , d i\u22121 ) of a derivation, a normalized exponential output function is computed by the SSNs to estimate a probability distribution over the possible next derivation moves d i ." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-47", "text": "To exploit the intuition that semantic role labels are predictive of syntactic structure, we must pro-vide semantic role information as early as possible to the parser." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-48", "text": "Extending a technique presented in (Klein and Manning, 2003 ) and adopted in (Merlo and Musillo, 2005) for function labels with stateof-the-art results, we split some part-of-speech tags into tags marked with AM-X semantic role labels." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-49", "text": "As a result, 240 new POS tags were introduced to partition the original tag set which consisted of 45 tags." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-50", "text": "Our augmented model has a total of 613 nonterminals to represent both the PTB and PropBank labels, instead of the 33 of the original SSN parser." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-51", "text": "The 580 newly introduced labels consist of a standard PTB label followed by one or more PropBank semantic roles, such as PP-AM-TMP or NP-A0-A1." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-52", "text": "These augmented tags and the new non-terminals are included in the set f , and will influence bottomup projection of structure directly." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-53", "text": "These newly introduced fine-grained labels fragment our PropBank data." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-54", "text": "To alleviate this problem, we enlarge the set f with two additional binary features." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-55", "text": "One feature decides whether a given preterminal or nonterminal label is a semantic role label belonging to the set comprising the labels A0-A5 and AA." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-56", "text": "The other feature indicates if a given label is a semantic role label of type AM-X, or otherwise." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-57", "text": "These features allow the SSN to generalise in several ways." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-58", "text": "All the constituents bearing an A0-A5 and AA labels will have a common feature." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-59", "text": "The same will be true for all nodes bearing an AM-X label." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-60", "text": "Thus, the SSN can generalise across these two types of labels." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-61", "text": "Finally, all constituents that do not bear any label will now constitute a class, the class of the nodes for which these two features are false." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-62", "text": "----------------------------------" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-63", "text": "**EXPERIMENTS AND DISCUSSION**" }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-64", "text": "Our extended semantic role SSN parser was trained on sections 2-21 and validated on section 24 from the PropBank." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-65", "text": "Testing data are section 23 from the CoNLL-2005 shared task (Carreras and Marquez, 2005) ." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-66", "text": "We perform two different evaluations on our model trained on PropBank data." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-67", "text": "We distinguish between two parsing tasks: the PropBank parsing task and the PTB parsing task." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-68", "text": "To evaluate the former parsing task, we compute the standard Parseval measures of labelled recall and precision of constituents, taking into account not only the 33 original labels, but also the newly introduced PropBank labels." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-69", "text": "This evaluation gives us an indication of how accurately and exhaustively we can recover this richer set of non-terminal labels." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-70", "text": "The results, computed on the testing data set from the PropBank, are shown in the PropBank column of Table 1 , first line." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-71", "text": "To evaluate the PTB task, we ignore the set of PropBank semantic role labels that our model assigns to constituents (PTB column of Table 1 , first line to be compared to the third line of the same column)." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-72", "text": "To our knowledge, no results have yet been published on parsing the PropBank." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-73", "text": "1 Accordingly, it is not possible to draw a straightforward quantitative comparison between our PropBank SSN parser and other PropBank parsers." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-74", "text": "However, state-of-theart semantic role labelling systems (CoNLL, 2005) use parse trees output by state-of-the-art parsers (Collins, 1999; Charniak, 2000) , both for training and testing, and return partial trees annotated with semantic role labels." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-75", "text": "An indirect way of comparing our parser with semantic role labellers suggests itself." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-76", "text": "2 We merge the partial trees output by a semantic role labeller with the output of the parser on which it was trained, and compute PropBank parsing performance measures on the resulting parse trees." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-77", "text": "The third line, PropBank column of Table 1 reports such measures summarised for the five best semantic role labelling systems (Punyakanok et al., 2005b; Haghighi et al., 2005; Pradhan et al., 2005; Surdeanu and Turmo, 2005) in the CoNLL 2005 shared task." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-78", "text": "These systems all use (Charniak, 2000) 's parse trees both for training and testing, as well as various other information sources including sets of n-best parse trees, chunks, or named entities." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-79", "text": "Thus, the partial trees output by these systems were merged with the parse trees returned by Charniak's parser (second line, PropBank column)." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-80", "text": "3 These results jointly confirm our initial hypothe- sis." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-81", "text": "The performance on the parsing task (PTB column) does not appreciably deteriorate compared to a current state-of-the-art parser, even if our learner can output a much richer set of labels, and therefore solves a considerably more complex problem, suggesting that the relationship between syntactic PTB parsing and semantic PropBank parsing is strict enough that an integrated approach to the problem of semantic role labelling is beneficial." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-82", "text": "Moreover, the results indicate that we can perform the more complex PropBank parsing task at levels of accuracy comparable to those achieved by the best semantic role labellers (PropBank column)." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-83", "text": "This indicates that the model is robust, as it has been extended to a richer set of labels successfully, without increase in training data." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-84", "text": "In fact, the limited availability of data is increased further by the high variability of the argumental labels A0-A5 whose semantics is specific to a given verb or a given verb sense." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-85", "text": "Methodologically, these initial results on a joint solution to parsing and semantic role labelling provide the first direct test of whether parsing is necessary for semantic role labelling (Gildea and Palmer, 2002; Punyakanok et al., 2005a) ." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-86", "text": "Comparing semantic role labelling based on chunked input to the better semantic role labels retrieved based on parsed trees, (Gildea and Palmer, 2002) conclude that parsing is necessary." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-87", "text": "In an extensive experimental investigation of the different learning stages usually involved in semantic role labelling, (Punyakanok et al., 2005a) find instead that sophisticated chunking can achieve state-of-the-art results." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-88", "text": "Neither of these pieces of work actually used a parser to do SRL." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-89", "text": "Their investigation was therefore limited to establishing the usefulness of syntactic features for the SRL task." }, { "sent_id": "3ff58556ab973a9dde640a2b74c37b-C001-90", "text": "Our results do not yet indicate that parsing is beneficial to SRL, but they show that the joint task can be performed successfully." } ], "y": { "@USE@": { "gold_contexts": [ [ "3ff58556ab973a9dde640a2b74c37b-C001-28" ], [ "3ff58556ab973a9dde640a2b74c37b-C001-45" ] ], "cite_sentences": [ "3ff58556ab973a9dde640a2b74c37b-C001-28", "3ff58556ab973a9dde640a2b74c37b-C001-45" ] }, "@SIM@": { "gold_contexts": [ [ "3ff58556ab973a9dde640a2b74c37b-C001-28", "3ff58556ab973a9dde640a2b74c37b-C001-29" ] ], "cite_sentences": [ "3ff58556ab973a9dde640a2b74c37b-C001-28" ] }, "@BACK@": { "gold_contexts": [ [ "3ff58556ab973a9dde640a2b74c37b-C001-45" ] ], "cite_sentences": [ "3ff58556ab973a9dde640a2b74c37b-C001-45" ] }, "@EXT@": { "gold_contexts": [ [ "3ff58556ab973a9dde640a2b74c37b-C001-48" ] ], "cite_sentences": [ "3ff58556ab973a9dde640a2b74c37b-C001-48" ] } } }, "ABC_9c3c35343aeaae0520d92f64e118a2_45": { "x": [ { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-2", "text": "Due to the increased availability of online reviews, sentiment analysis had been witnessed a booming interest from the researchers." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-3", "text": "Sentiment analysis is a computational treatment of sentiment used to extract and understand the opinions of authors." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-4", "text": "While many systems were built to predict the sentiment of a document or a sentence, many others provide the necessary detail on various aspects of the entity (i.e. aspect-based sentiment analysis)." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-5", "text": "Most of the available data resources were tailored to English and the other popular European languages." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-6", "text": "Although Persian is a language with more than 110 million speakers, to the best of our knowledge, there is not any public dataset on aspect-based sentiment analysis in Persian." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-7", "text": "This paper provides a manually annotated Persian dataset, Pars-ABSA, which is verified by 3 native Persian speakers." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-8", "text": "The dataset consists of 5114 positive, 3061 negative and 1827 neutral data samples from 5602 unique reviews." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-9", "text": "Moreover, as a baseline, this paper reports the performance of some state-of-the-art aspect-based sentiment analysis methods with a focus on deep learning, on Pars-ABSA." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-10", "text": "The obtained results are impressive compared to similar English state-of-theart." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-11", "text": "----------------------------------" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-13", "text": "Nowadays, with the rapid growth of the Internet, a huge volume of data is generated daily. And it is said that the world's most valuable resource is no longer oil but data." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-14", "text": "On the other hand, humans are always curious about how others think and want to use others' recommendation." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-15", "text": "User opinion plays a vital role in the modern world, for instance, Manufacturing companies use online reviews to get a sense of general sentiment for their products." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-16", "text": "Also, they study user reactions and reply to them on microblogs and use this media as an opportunity to market their products." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-17", "text": "As well as, taking advice from others' opinions is an important part of the decision making procedure." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-18", "text": "According to two surveys of more than 2000 American adults, 81% of them have done online research on a product at least once; and between 78% and 87% of readers of online reviews of various services (like restaurants, hotels and etc.) reported that reviews had a great impact on their purchase (Pang et." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-19", "text": "al., 2008) ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-20", "text": "Among services available over the internet which mentions: personal blogs, Social Media, Recommendation Services and also E-Commerce websites as the main sources of reviews and user opinions which allow people to share and express their views about topics, discuss about current issues and express positive or negative sentiment toward products they use in daily life with others around the world." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-21", "text": "Personal blog: people share their routines activities through their blogs." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-22", "text": "An influencer can easily persuade people to buy a product by telling others about his experience of using it." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-23", "text": "Social Media: social media is another media for sharing opinions." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-24", "text": "People use services such as Twitter, Instagram and Facebook, which are a place to post real-time short messages." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-25", "text": "E-Commerce: websites like Amazon and E-bay are another source of user opinion." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-26", "text": "People who buy products share their experience of using products over these websites with others." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-27", "text": "By being at the era of data explosion, for example around 500 million tweets are sent daily, one challenge is to build a system to detect and summarize an overall sentiment of these data." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-28", "text": "Sentiment analysis is the computational study of detecting and extracting subjective information and attitudes about entities." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-29", "text": "The entity can represent individuals, events, products or topics." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-30", "text": "The output of it is opinion polarity." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-31", "text": "Polarity is generally expressed in different forms from two classes of positive and negative or three classes of positive, neutral and negative while at some researches it is represented as a real number between 1-5 stars or 0-10 grade." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-32", "text": "Sentiment Analysis was acknowledged at the early 2000s with (Turney, 2002) and (Pang et al., 2002) , both of them doing binary classification on reviews with two classes of Positive and Negative." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-33", "text": "At (Turney, 2002) , an Unsupervised learning algorithm is used for classifying the reviews and the dataset used is from four different domains which consist of automobiles, banks, movies, and travel destinations." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-34", "text": "At (Pang et al., 2002) , they used Supervised Machine Learning methods and the dataset is collected from IMDB movie reviews." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-35", "text": "At (Pang and Lee, 2005) , they changed polarity from binary (Positive and Negative) to a multi-point scale (one to five stars)." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-36", "text": "Data sources used for models are generally collected from tweets, blogs and, reviews about movies and products like (Pang and Lee, 2005) and (Branavan et al., 2009) ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-37", "text": "Sentiment Analysis can also be useful for Politics." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-38", "text": "For instance at presidential elections, candidates can understand public opinions and establish their policies by the use of sentiment analysis approaches." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-39", "text": "At (Ramteke et al., 2016) , a machine learning model is provided to predict the results of the 2016 United States presidential election based on collected tweets." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-40", "text": "Sentiment analysis is generally performed at three different levels: document-based, sentence-based, and aspect-based." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-41", "text": "At the document-based, it is considered that the whole document, for instance, a movie review is an entity and the whole document expresses a positive, negative or neutral polarity about a movie." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-42", "text": "All of the preceding reviewed articles were at document-based." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-43", "text": "There are many datasets available for this task from product review to tweets and hotel reviews, but two of the wellstudied datasets are discussed in (Maas et al., 2011) which consists of 50,000 reviews from IMDB website and Yelp reviews (Zhang et al., 2015) with more than 500,000 data samples in both binary and five-class versions." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-44", "text": "At sentence-based, the goal is to predict and determine the polarity of each sentence, in other words, each sentence at a document is separated from others and may have a different polarity." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-45", "text": "In Stanford Sentiment Tree-bank (Socher et al., 2013) , the task of fine-grained classification is to determine one of five labels (very negative, negative, neutral, positive, very positive) to a sentence from a document." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-46", "text": "Most of the available sentiment analysis tools are concerned with detecting the polarity of a sentence or a document. But in recent years, a new task in the field of sentiment analysis is introduced which is focused on identifying the polarity of the targets expressed in a sentence." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-47", "text": "A target is an object, and its components, attributes, and features." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-48", "text": "An object can be a product, service, etc." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-49", "text": "For instance, in a product review sentence, the model identifies the polarity of product features that have been commented on by the reviewer (Liu, 2010) ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-50", "text": "As an example, in \"Food was great but the service was miserable.\" There are two opinion targets, \"food\" and \"service\"." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-51", "text": "The reviewer has a positive sentiment polarity on \"food\" and a negative sentiment polarity on \"service\"." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-52", "text": "This example shows why predicting tools sentiment polarity of the sentence or the whole document are not sufficient for this task." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-53", "text": "The superiority of aspect level models to sentence level and document level models becomes vivid when manufacturers or service providers want to know which component or feature of their product is not well enough and needs to improve based on the negative reviews they get from customers." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-54", "text": "The rest of the paper is organized as follows." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-55", "text": "In section 2, sentiment analysis datasets available in Persian and English are discussed." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-56", "text": "In section 3, details about the data collection and annotation are given." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-57", "text": "In section 4 we test Pars-ABSA dataset with recent systems available for English and discuss the results." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-58", "text": "In section 5 we conclude and give future directions of research." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-59", "text": "----------------------------------" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-60", "text": "**RELATED WORKS**" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-61", "text": "Persian is the official language of Iran, Afghanistan, and Tajikistan, and also is spoken in Uzbekistan and some other regions which were a part of the Greater Iran with more than 110 million speakers." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-62", "text": "Persian is generally classified as western Iranian languages and is from the Indo-European family." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-63", "text": "In document-based sentiment analysis in Persian, and provided a dataset of manually annotated reviews of cellphones from 829 online reviews; and, at (Hajmohammadi and Ibrahim, 2013) , a dataset of 400 manually annotated from Persian movie reviews was proposed." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-64", "text": "For" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-65", "text": "----------------------------------" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-66", "text": "**DATASET**" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-67", "text": "Positive Neutral Negative Laptop Train 980 454 858 Test 340 171 128 Restaurant Train 2159 632 800 Test 730 196 195 Twitter Train 1567 3127 1563 Test 174 346 174 Table 1 -Number of data samples for each sentiment polarity of 3 English datasets sentence level, at (Basiri and Kabiri, 2017) proposed SPerSent, a Persian dataset consisting of 150,000 sentences from product reviews of Digikala website." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-68", "text": "Each sentence is associated with two types of labels, binary (Positive and Negative) and five-star rating and is labeled automatically based on the majority voting of three different lexicons." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-69", "text": "In last decade, in aspect-based sentiment analysis, most of the data resources and systems built so far are tailored to English and other languages like Chinese and Arabic." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-70", "text": "There are three datasets for English which are mainly used by researchers to compare the performance of their systems." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-71", "text": "These three datasets are Restaurants and Laptops (Pontiki et al., 2014) and Twitter (Dong et al., 2014) ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-72", "text": "First and second datasets are annotated data samples from comments and reviews about laptops and restaurants from Semeval-2014 task 4: Aspectbased sentiment analysis and the last one is collected tweets from Twitter." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-73", "text": "The number of data samples in each dataset is given in Table 1 ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-74", "text": "At (Saeidi et al., 2016) , Sentihood is presented, which consists of annotated data from a QA platform in the domain of neighborhoods of a city." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-75", "text": "Along with this dataset, the task of targeted aspectbased sentiment analysis is introduced which is different from general aspect-based sentiment analysis, in extracting fine-grained information with respect to targets specified in reviews." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-76", "text": "At (Chen et al., 2017) , a Chinese dataset for aspectbased sentiment analysis from comments about the news with 6365 positive, 9457 neutral and 6839 negative annotated data samples was proposed." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-77", "text": "----------------------------------" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-78", "text": "**DATA COLLECTION AND ANNOTATION**" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-79", "text": "To the best of our knowledge, there is no available research on aspect-based sentiment analysis of Persian which is due to the lack of 1 http://www.digikala.com, Based on the terms of Digikala, the information of their website is allowed to be used for non-commercial activities with referring to them." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-80", "text": "publically available datasets for this language." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-81", "text": "It is noteworthy to mention that there is a Persian corpus named, SentiPers which is at three levels (document-level, sentence-level, and aspect-level) but their article has not been published yet." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-82", "text": "Pars-ABSA was created from collected user reviews from Digikala 1 website." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-83", "text": "Digikala is the biggest ecommerce startup in Iran and thousands of people buy goods from its website daily, and some of them post comments about the products they bought and share their experiences with others." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-84", "text": "More than 600,000 comments were scraped from Digikala website till January 2019." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-85", "text": "For manually annotating data samples a framework with Python language and Jupyter notebook was made." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-86", "text": "After the completion of annotation process, the dataset was confirmed and verified by 3 participants who were native Persian speakers." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-87", "text": "Statistical information about the proposed dataset is given in Table 2 ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-88", "text": "Pars-ABSA dataset is stored in two formats: 1-XML 2-Text." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-89", "text": "In XML format, there is a main tag named \"sentences\" that contains all of the data samples." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-90", "text": "For each sentence in the dataset, there is a \"sentence\" tag available inside the main tag. \"sentence\" tag contains two types of tags, first is \"text\" tag that contains the sentence and the second is \"aspectTerms\" that contains one or more \"aspectTerm\" tags." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-91", "text": "As long as it is possible to have more than one aspect in each sentence, all of the available aspects in each sentence are going to be inside \"aspectTerms\" and for each aspect, there would be an \"aspectTerm\" tag with four attributes." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-92", "text": "These attributes are:" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-93", "text": "\uf0b7 from: the position of the start of aspect in the sentence \uf0b7 to: the position of the end of aspect in the sentence \uf0b7 term: aspect term in the sentence \uf0b7 polarity: sentiment polarity of aspect term A data sample stored in XML format is given at Fig. 1 In the second format, for each aspect term, there are three lines inside the file, the sentence is at first line but the aspect term is replaced with \"$T$\", aspect term is written in second line and in third line, there is a number available for sentiment polarity of the aspect term (1 for positive, 0 for neutral and -1 for negative)." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-94", "text": "An example of a data sample in the text format is given at Fig. 2 Model Table 3 ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-95", "text": "Performance of models on English datasets." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-96", "text": "----------------------------------" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-97", "text": "**EXPERIMENTS**" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-98", "text": "To test Pars-ABSA dataset, 6 recent systems available for aspect-based sentiment analysis with a focus on deep learning methods and a typical long short-term memory network model were used." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-99", "text": "Table 3 compares the performance of these models on English datasets based on f1 score macro and accuracy metrics." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-100", "text": "These methods are: \uf0b7 AOA : An attention over attention neural network which captures the interaction between aspects and context sentences." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-101", "text": "\uf0b7 Cabasc (Liu et al., 2018) : This method utilizes two attention mechanisms: sentence-level content attention mechanism which captures the important information about given aspects from a global perspective, while the context attention mechanism is responsible for simultaneously taking the order of the words and their correlations into account, by embedding them into a series of customized memories." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-102", "text": "\uf0b7 RAM (Chen et al., 2017) : This method makes a memory from the input and with using multiple attention mechanism, it extracts important information from memory and for prediction it uses a combination of the extracted features of different attentions non-linearly." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-103", "text": "\uf0b7 IAN (Dehong et al., 2017) : It learns the attentions inside the document and targets interactively, and originates the representations for targets and the document separately." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-104", "text": "\uf0b7 ATAE-LSTM (Wang et al., 2016) : It uses attention mechanism along with Long Short-Term Memory Network." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-105", "text": "\uf0b7 TD-LSTM (Tang et al., 2016) Table 4 ." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-106", "text": "Performance of models on Pars-ABSA." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-107", "text": "But among the models, the result of TD-LSTM (Tang et al., 2016) was quite surprising." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-108", "text": "Because the other models were proposed after TD-LSTM (Tang et al., 2016) and their performances on English datasets were better than TD-LSTM (Tang et al., 2016) , so it was expected from them to perform better on Pars-ABSA dataset too." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-109", "text": "The authors of RAM (Chen et al., 2017) claimed that their model is language insensitive, which mean it can perform on all languages and, compared to TD-LSTM (Tang et al., 2016) which might lose feature if the opinion word is far from the target, they employed the recurrent attention to solving this problem." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-110", "text": "But by comparing results, it's obvious that TD-LSTM (Tang et al., 2016) outperforms their method in Persian." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-111", "text": "Pars-ABSA dataset is available through: https://github.com/Titowak/Pars-ABSA" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-112", "text": "----------------------------------" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-113", "text": "**CONCLUSION AND FUTURE WORKS**" }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-114", "text": "In this paper, Pars-ABSA, a Persian aspect-based sentiment analysis dataset gathered from Digikala website was presented also, the method of collecting and annotating the dataset and properties and statistics of the dataset was provided." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-115", "text": "Some of the available models in aspect-based sentiment analysis were used as a baseline and, their performances were compared on the dataset." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-116", "text": "As future plans, we aim to extend Pars-ABSA with more reviews to include different domains such as restaurants and hotels." }, { "sent_id": "9c3c35343aeaae0520d92f64e118a2-C001-117", "text": "Due to the reviews and the documents posted on social medias are mostly in informal writing and the structure of sentences and even forms of the words are different from the formal writing, we are working on more advanced approaches to provide adequate tools for processing the data and a model that works properly for Persian language and the dataset." } ], "y": { "@BACK@": { "gold_contexts": [ [ "9c3c35343aeaae0520d92f64e118a2-C001-76" ], [ "9c3c35343aeaae0520d92f64e118a2-C001-102" ], [ "9c3c35343aeaae0520d92f64e118a2-C001-109" ] ], "cite_sentences": [ "9c3c35343aeaae0520d92f64e118a2-C001-76", "9c3c35343aeaae0520d92f64e118a2-C001-102", "9c3c35343aeaae0520d92f64e118a2-C001-109" ] }, "@USE@": { "gold_contexts": [ [ "9c3c35343aeaae0520d92f64e118a2-C001-102", "9c3c35343aeaae0520d92f64e118a2-C001-99" ] ], "cite_sentences": [ "9c3c35343aeaae0520d92f64e118a2-C001-102" ] }, "@DIF@": { "gold_contexts": [ [ "9c3c35343aeaae0520d92f64e118a2-C001-109", "9c3c35343aeaae0520d92f64e118a2-C001-110" ] ], "cite_sentences": [ "9c3c35343aeaae0520d92f64e118a2-C001-109" ] } } }, "ABC_929020618e8e1daa6a769f552a4655_45": { "x": [ { "sent_id": "929020618e8e1daa6a769f552a4655-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-2", "text": "Evaluating the readability of a text can significantly facilitate the precise expression of information in a written form." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-3", "text": "The formulation of text readability assessment demands the identification of meaningful properties of the text and correct conversion of features to the right readability level." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-4", "text": "Sophisticated features and models are being used to evaluate the comprehensibility of texts accurately." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-5", "text": "Still, these models are challenging to implement, heavily languagedependent, and do not perform well on short texts." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-6", "text": "Deep reinforcement learning models are demonstrated to be helpful in further improvement of state-of-the-art text readability assessment models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-7", "text": "The main contributions of the proposed approach are the automation of feature extraction, loosening the tight language dependency of text readability assessment task, and efficient use of text by finding the minimum portion of a text required to assess its readability." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-8", "text": "The experiments on Weebit, Cambridge Exams, and Persian readability datasets display the model's state-of-theart precision, efficiency, and the capability to be applied to other languages." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-9", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-11", "text": "Text as a prevalent form of communication has a fundamental role in conducting knowledge and information between humans." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-12", "text": "Nevertheless, not all texts are equally intelligible and understandable for all people." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-13", "text": "Therefore, to ensure the clarity and understandability of the written information, it is crucial to measure its readability." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-14", "text": "The significance of this measurement is apparent from its applications in different fields such as education [1, 2] , medical instructions [3] [4] [5] [6] [7] [8] , social media communications [9] , marketing and advertising [10] [11] [12] , and in some related fields of research like text simplification [13] [14] [15] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-15", "text": "However, readability assessment entails some challenges." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-16", "text": "The first attempt to quantify the readability of the text was the manual intuition-based evaluation, which was done by human readability experts." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-17", "text": "Such evaluation is not standardized or globally correct; hence, researchers such as Flesch [16] has developed readability measurement formulas." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-18", "text": "These formulas use simple and manually calculable properties of the text, such as the number of syllables, words, or sentences in the text to assess its readability." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-19", "text": "These formulas become so popular that they are even widely used nowadays." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-20", "text": "Nonetheless, the low accuracy of such formulas and their language dependency made way for more advanced and accurate readability assessment methods, which involve machine learning techniques." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-21", "text": "These models are highly accurate for their use of sophisticated NLP features and machine intelligence to associate the extracted features to a proper readability level." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-22", "text": "Models proposed by Vajjala and Meurers [17] , Xia et al. [18] , and Mohammadi and Khasteh [19] are examples of state-of-the-art models for their target languages and target audience." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-23", "text": "These models are using Support Vector Machines trained on complex and proper feature sets extracted from related datasets." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-24", "text": "Still, their use of complicated and language-specific NLP features makes these models challenging to implement and heavily language-dependent." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-25", "text": "Furthermore, they do not offer any solution to the problem of finding the minimum portion of the text required to accurately assess the readability of a long text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-26", "text": "The feature extraction task from a long text is computationally heavy, and minimizing the required length of the text to assess its readability is vital in large collections of documents." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-27", "text": "Utilizing the recent advances in deep learning and deep reinforcement learning, a new approach to text readability assessment is introduced in this study." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-28", "text": "Word-to-vec and frequency language model are used to represent the input text in order to eliminate the need for sophisticated NLP features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-29", "text": "In addition to that, such text representation enables the applicability of the model on different languages using the word-to-vec and frequency language model of the target language." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-30", "text": "The new model is a deep convolutional recurrent double dueling Q network." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-31", "text": "The model's perception of the input text is limited to a window of five adjacent words." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-32", "text": "The position of the window could be changed by the model's actions to perceive other parts of the text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-33", "text": "The ability of the model to intelligently pick the portion of the text to be perceived makes it possible to find the minimal portion of the fed text to assess its readability." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-34", "text": "The structure of the paper is as follows: Firstly, section 2 discusses the previous attempts to automate the readability assessment task in detail." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-35", "text": "Next, section 3 presents the proposed DRL model and describes its architecture." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-36", "text": "Later, section 4 reviews the results of the experiments and explains the advantages and disadvantages of the proposed model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-37", "text": "Finally, section 5 states the main contributions of this study and the potential future works." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-38", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-39", "text": "**RELATED WORKS**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-40", "text": "The literature around automated text readability assessment can be classified into two main categories: (i) traditional formulas and (ii) machine learning models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-41", "text": "In summary, traditional formulas are a naive mixture of shallow and easily calculable features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-42", "text": "These formulas are hand calculated and tuned to capture the readability of the text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-43", "text": "On the other hand, machine learning models, which came as an effort to compensate for the low accuracy of traditional models, use a large number of simple to complex machine extracted features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-44", "text": "The relation between the value of these features and the readability of the corresponding text is learned by machine learning techniques utilizing a large dataset of pre-labeled texts." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-45", "text": "The created model can be used to assess the readability of newly seen texts accurately." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-46", "text": "Flesch-Kincaid grade level [16] can be named as one of the earliest and most utilized readability formulas for the English language." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-47", "text": "Flesch-Kincaid readability formula uses only the average number of words per sentence and the average number of syllables per word to evaluate the text readability." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-48", "text": "Other similar formulas are Gunning-Fog [20] and Chall-Dale Chall and Dale [21] , which also use simple features to estimate the readability of English texts." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-49", "text": "The number of \u00e2\u0102\u0132complex words\u00e2\u0102\u0130 and \u00e2\u0102\u0132difficult words\u00e2\u0102\u0130 are calculated using a predefined list of such words." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-50", "text": "These formulas can be seen in Eq. 1, 2, and 3, respectively." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-51", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-52", "text": "**FLESCH-KINCAID GRADELEVEL**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-53", "text": "The features used in the above formulas can be calculated by hand." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-54", "text": "With the advancements in automated computations, computer calculated and extracted features are used in text readability assessment applications." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-55", "text": "Lexile [22] and the work of Collins-Thompson and Callan [23] used word-frequency and language models, respectively." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-56", "text": "The use of statistical models in text readability assessment produced an enhancement in the accuracy of such models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-57", "text": "The only available text readability assessment formula for the Persian language is an adaptation of the Flesch-Kincaid formula, called the Flesch-Dayani formula [24] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-58", "text": "The constants in this formula are specially tuned to suit the text readability assessment of the Persian language." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-59", "text": "This formula is shown in Eq. 4." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-60", "text": "The traditional formulas are straightforward to implement, need limited computational resources, and the results are clear to interpret." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-61", "text": "Despite these benefits, the most critical shortcoming of these methods is their low accuracy and the significant difference between their results and human judgments [25] [26] [27] [28] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-62", "text": "Moreover, these formulas cannot readily be used to assess the readability of texts in other languages as they are specially designed for a particular language." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-63", "text": "These formulas are also not suitable for short text applications, which are more widespread in web and social media nowadays [29] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-64", "text": "With the advent of machine learning methods and in order to overcome the deficiencies of traditional text readability assessment formulas, researchers have employed machine learning models to create a more accurate and comprehensive text readability assessment system." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-65", "text": "Text readability assessment can be viewed both as a regression or a classification problem." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-66", "text": "Although, studies have shown greater accuracy and applicability of text readability assessment as a classification task [30] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-67", "text": "The use of a more significant number of features (naive or sophisticated) and the automated learning of the relation between the features and the readability level is the foremost advantage of the machine learning models, which make them preferred to the traditional formulas [31] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-68", "text": "The choice of features is a vital step in assembling a machine learning model since a model is as good as its features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-69", "text": "Consequently, the notable difference between proposed machine learning approaches to automated text readability assessment is their set of features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-70", "text": "Simple features, such as the average number of characters or syllables in words, the average number of words in sentences, the number of sentences in a text, and simple statistical language models were features used in early machine learning models for text readability assessment like works presented in [27, 32] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-71", "text": "The use of syntactical features [33] and cohesive features [34] also supported the realization of models with higher accuracy." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-72", "text": "The state-of-the-art model for automated English text readability assessment for native readers is proposed by Vajjala and Meurers [17] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-73", "text": "The introduced model is a Support Vector Machine (SVM) classifier, using a comprehensive set of features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-74", "text": "These features consist of different traditional, lexical, and syntactical features totaling a set of 46 distinct features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-75", "text": "The second language text readability assessment can be considered as a subfield of text readability assessment." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-76", "text": "Despite the widespread use of English as a second language, this field has seen a few thorough studies." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-77", "text": "Different models are required to assess the readability of English texts for the second language readers as a different set of characteristics of text is influential on its readability level for second language readers [35] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-78", "text": "Xia et al. [18] has published a thorough study on second language text readability assessment." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-79", "text": "Similar to the study done by Vajjala and Meurers [17] , Xia has used an SVM classifier, and a set of NLP features consists of traditional, lexico-semantic, parse tree syntactic, language modeling, and discourse-based features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-80", "text": "Many comparable studies have been carried out to create automated text readability assessment models for languages such as French [36] Russian [37] Germen [38] Chinese [39] Arabic [40] Portuguese [41] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-81", "text": "This study is concentrated on the English and Persian languages as our test case for multilingual text readability assessment." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-82", "text": "The only known machine learning based text readability assessment model for the Persian language is the model proposed by Mohammadi and Khasteh [19] , which also uses an SVM model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-83", "text": "In conclusion, machine learning models can obtain higher accuracies in contrast to the traditional formulas while being more straightforward to construct, assuming the existence of a useful dataset." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-84", "text": "In contrast, their use of a large number of sophisticated features makes them time-consuming and costly to implement, language-dependent, and less interpretable." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-85", "text": "The focus of this study is to reduce the need for intricate feature engineering and language dependency of text readability assessment models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-86", "text": "In order to overcome these problems, the current advances in deep learning, reinforcement learning, and their mixture, deep reinforcement learning became advantageous." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-87", "text": "In recent years, the massive expansion of accessible data and the progress in hardware resources, especially Graphical Processing Units (GPUs), have aided researchers to build bigger, deeper, and more sophisticated artificial neural networks." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-88", "text": "Besides, the increased popularity of convolutional [42, 43] , and recurrent [44, 45] architectures in NLP applications have provoked an unprecedented growth in the accuracy of computational linguistic models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-89", "text": "The introduction of vector representation of text (i.e., word-to-vec [46] ), which can reduce the need for sophisticated feature engineering, has promoted the creation of a new set of deep NLP models that can deliver state-of-the-art accuracies on several standard NLP tasks without old-fashioned feature engineering and extraction [42, 47, 48] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-90", "text": "The ability to achieve high accuracy using simple features is the main advantage of deep NLP models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-91", "text": "Nevertheless, these neural networks use more processing power and require specialized hardware to be trained and used and need massive datasets to achieve superior accuracies." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-92", "text": "Reinforcement learning can be considered as a type of semi-supervised machine learning technique." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-93", "text": "Being able to learn from partially labeled data makes reinforcement learning especially beneficial in NLP tasks." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-94", "text": "Hence, there is a trend in using reinforcement learning models in NLP tasks such as machine translation [49] [50] [51] [52] , sentence simplification [53] , text summarization [54, 55] , dialogue generation [56] , question answering [57] , question generation [58] , and text generation [59] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-95", "text": "Furthermore, the combination of reinforcement learning and deep learning as deep reinforcement learning models help to fuse the advantages of both fields to produce more accurate and efficient models for NLP tasks." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-96", "text": "These models can actively manipulate their input and intelligently focus on the part of a text that carries more valuable information in their task." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-97", "text": "Despite some drawbacks, such as training instability, these models can achieve higher accuracies and performance in NLP tasks in comparison to earlier models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-98", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-99", "text": "**PROPOSED APPROACH 3.1 MODEL-TEXT INTERACTION**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-100", "text": "The prevalent representation of texts in automated text readability assessment studies is the vector space modeling." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-101", "text": "The predefined features extracted from each text can form a point in an N-dimensional space, which can be a fair representation of that text attributes." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-102", "text": "The formed vector representation can be used in order to assign a readability level to the text using a classification model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-103", "text": "As features can miss some vital information in the text, the classifier model is limited to the insufficiencies of the extracted features." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-104", "text": "Also, in most cases, the readability of the text can be assessed only using a small portion of a homogenous text (a text in which the readability of the text is consistent throughout the text)." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-105", "text": "Hence the extraction of features from the whole text is a waste of processing power and time." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-106", "text": "Current advances in deep NLP reduces the necessity of old-fashioned feature extraction to a great extent." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-107", "text": "With the use of word-to-vec representation, it is possible to feed the raw text into a convolutional neural network and extract beneficial and application-specific features from a text [42] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-108", "text": "In this study, a specific form of word-to-vec representation named GloVe [60] is used, which resulted in the best performance of the new model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-109", "text": "The representation of a text using this method can be described as a sequence of equal length vectors." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-110", "text": "The problem of finding the minimal length of text (text length is defined here by the number of words) to be fed to the model to achieve an optimal trade-off between the accuracy of evaluation and the fed length of the text is particularly challenging." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-111", "text": "Firstly, the required feed length is different for each text (as human evaluators might read different portions of different texts to determine their readability level)." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-112", "text": "Some text's readability level can be readily determined by a small portion of it, while some more ambiguous texts (i.e., texts with characteristics close to two adjacent readability levels) demand a more lengthy portion of text to be accurately assessed." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-113", "text": "Secondly, creating datasets including optimal length information is not possible, as the optimal length is different for each human evaluator and, therefore, for each model; hence, the optimal length should be determined exclusively for a certain model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-114", "text": "To learn the optimal length for each text from data with no such labels, it is possible to switch from supervised learning to semi-supervised learning (as the dataset is labeled regarding its readability level but not the optimal feed length)." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-115", "text": "A text represented as a sequence of vectors (word-to-vecs) can be considered a partially observable environment for a reinforcement learning model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-116", "text": "The observable information for the model at each step is the corresponding vectors of 5 words in a focus window." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-117", "text": "Five is chosen as window length as it is the minimal length of the window, which resulted in desired accuracy and performance in this study." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-118", "text": "The outputs (or actions) of the model are the readability classes and actions to move through the text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-119", "text": "In the learning process, applying a suitable reward shaping, the RL model can learn the distinction between these two groups of actions and learn when to read further through the text and when to decide the readability of the text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-120", "text": "The GloVe representation of a text does not capture the statistical properties of its words, such as their usage frequency." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-121", "text": "In order to compensate for this loss, a scaled language model feature (N-gram frequency, N from 1 to 5) is appended to each word's GloVe, which contains the language model information of the words in the window 1,2 ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-122", "text": "An analysis of the effects of adding a frequency language model to the word representation shows an approximate 10 percent gain in the accuracy of the proposed model in the presence of this feature." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-123", "text": "Consequently, the model can observe a window of 5 word's feature vectors, containing the GloVe, and the language model values, from the input text at each step." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-124", "text": "In this study, a GloVe representation with a length of 100 is chosen." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-125", "text": "Smaller GloVe representations can be processed using a DRL model with a fewer number of parameters, while longer GloVe representations require a larger DRL model to be utilized effectively." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-126", "text": "The choice of 100 between other available GloVe representation lengths (i.e., 50, 200, or 300) is for its optimal balance between the accuracy and required processing power." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-127", "text": "Two sets of rewards are given to the model for its interactions with the partially observable textual environment." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-128", "text": "The first set is negative rewards given for changes in the window position to encourage the model to take the smallest number of steps." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-129", "text": "Further, a positive or negative reward can be observed by the model for choosing a readability level for the intended text by picking one of the readability classes." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-130", "text": "The positiveness or negativeness of the rewards depends on the correctness (positive reward) or incorrectness (negative reward) of the decided readability class." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-131", "text": "A visual depiction of model interaction with the textual environment is shown in Figure 1 ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-132", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-133", "text": "**MODEL ARCHITECTURE**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-134", "text": "The model introduced in this study is a deep convolutional recurrent double dueling Q network with experience replay, as presented in Figure 2 ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-135", "text": "As stated before, the input of the model is a list of five vectors carrying the GloVe and language model values of the corresponding words in the focus window." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-136", "text": "Thus, a 2D vector is fed to the model per each window, which is a raw representation of perceived words." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-137", "text": "To elicit useful features, a set of convolutional layers map the 2D input vector into a 1D feature vector, which can be used in the following layers of the DRL model to learn the Q values for the actions in the textual environment." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-138", "text": "The implemented convolutional encoder consists of four 2D convolutional layers." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-139", "text": "The sequential nature of the input text vector representation creates a need for memorizing the previously observed words and, as intended, determining the Q values of each action by integrating the information perceived throughout the revealed text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-140", "text": "Consequently, a recurrent (LSTM) layer is utilized to capture and learn the temporal information in the extracted feature vectors." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-141", "text": "The stated DRL model is a dueling Q network [61] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-142", "text": "Dueling Q networks calculate the Q values of each action by independently calculating the state value and action advantage values." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-143", "text": "The Q value can be calculated by adding up the state value and advantage value." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-144", "text": "The benefit of dueling models is the detachment of prediction of the current state value from the value of taking a specific action, which is much useful in the current study as the careful judgment between the selection of a readability level in the current state or moving further through text is crucial." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-145", "text": "Each value (state value or advantage value) is calculated using separate dense layers, and the model takes a two-stream architecture after the LSTM layer." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-146", "text": "The products of these two streams are added up to form the Q values, which indicates the value of each readability level or move action." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-147", "text": "A more detailed report on the model architecture is presented in Table 1 ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-148", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-149", "text": "**MODEL TRAINING**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-150", "text": "The Q learning equation is used to calculate the loss of the model at each training step (Eq. 5 and 6)." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-151", "text": "To stabilize the process of learning as deep reinforcement learning models are prone to divergence, a method called double Q network learning [62] is used." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-152", "text": "In this method, an identical separate instance of the main model is used to compute the target Q value of each state-action pair to avoid the oscillation of Q values." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-153", "text": "The copy model, which is called the target network, is frozen (is not trained) for some predefined training steps and is replaced by a new copy of the main Q network after the predefined training steps." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-154", "text": "Deep reinforcement learning models demand numerous interactions with their environment to be sufficiently trained." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-155", "text": "To overcome this problem and also further stabilize the process of learning, a technique called experience replay [63] is used in the current study." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-156", "text": "The previous interactions of the DRL model are stored in [current-state, action, reward, next-state] tuples, which are fed multiple times to the DRL model during its training process." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-157", "text": "The reintroduction of previous experiences to the model increases the efficiency of data usage and prevents the model from forgetting older experiences." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-158", "text": "Other hyper-parameters are presented in Table 2 ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-159", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-160", "text": "**EXPERIMENTS 4.1 DATASETS**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-161", "text": "Three datasets are used to evaluate three different aspects of the proposed model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-162", "text": "Firstly, the Weebit dataset [17] is used to assess the accuracy of the model on deciding the readability of English texts for native readers." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-163", "text": "Weebit dataset is gathered from articles in the Weekly Reader magazine and the BBC-Bitesize website, which is targeted at readers of various ages." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-164", "text": "Weebit dataset comprises texts from five different readability levels arranged by age (8-9, 9-10, 10-12, 11-14, [14] [15] [16] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-165", "text": "The total number of texts in the dataset is over ten thousand texts." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-166", "text": "Nonetheless, due to the significant imbalance in the dataset, some texts are randomly removed from some classes." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-167", "text": "A sum of 3145 texts is used for evaluation purposes, which is similar to the original Weebit paper [17] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-168", "text": "Prior to this study, distinct models have been used to assess the readability of English texts for second language readers." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-169", "text": "Since the DRL model eliminates the need for specific feature engineering for different types of text readability assessment, the proposed model can be applied to second language datasets without any modifications." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-170", "text": "To examine the proposed DRL model regarding this ability, it is applied to the Cambridge Exams dataset [18] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-171", "text": "This dataset contains texts from the reading section of Cambridge English Exams, which is targeted for students at five readability levels (A2 to C2) of the Common European Framework of Reference (CEFR)." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-172", "text": "This dataset contains 331 texts, which makes it a small dataset in comparison to the Weebit dataset." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-173", "text": "The automated feature extraction ability of the proposed model has also given the model the ability to be readily applied to other languages." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-174", "text": "As GloVe and language models are language-specific, the only necessary change is the use of the GloVe and frequency language model of the target language." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-175", "text": "These features are readily and freely available on the internet for many languages." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-176", "text": "The proposed model is evaluated on the Persian text readability dataset [19] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-177", "text": "This dataset includes near four thousand texts, which are split into three balanced classes." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-178", "text": "Additional details concerning the datasets are shown on Table 3 ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-179", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-180", "text": "**RESULTS**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-181", "text": "The proposed model is evaluated on the mentioned datasets." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-182", "text": "Each dataset is split into two parts, containing 80 percent (for the train) and 20 percent (for the test) of the texts in the dataset to ensure the validity of the experiment results." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-183", "text": "Two sets of evaluation metrics are used to report the evaluation results to make the results comparable to the previous state-of-the-art models." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-184", "text": "Firstly, considering the text readability assessment as a classification problem, metrics such as accuracy, precision, recall, and F1-score is reported." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-185", "text": "Secondly, as the distance between readability classes is meaningful (classes are labeled from 1 to 5 for English datasets, and 1 to 3 for the Persian datasets, larger label values yields more difficult texts), smaller distance between the predicted label and the actual label is preferred." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-186", "text": "To quantify this property, the RMSE metric is used, which is also used in some previous studies [17] ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-187", "text": "The comparison between the proposed model results and the other state-of-the-art models is shown in Table 4 ." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-188", "text": "To describe the models performance on each dataset in detail, the class-level classification experiment results are presented in Tables 5, 6 , and 7." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-189", "text": "Table 4 , the proposed model can achieve such performance only by effectively using a small portion of the input text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-190", "text": "The average number of window moves to make an accurate text readability classification on the three datasets is 1.39 moves." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-191", "text": "On average, 38 percent of texts are classified without any movement of the window, and 3 percent of texts are classified after more than two moves." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-192", "text": "Considering the window length of five, the DRL model can assess a text's readability level only by looking at its first 6.95 words on average." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-193", "text": "Considering the problem of \"how much of a text should be given to an automated text readability assessment model, to assess its readability accurately?\", this ability can be advantageous." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-194", "text": "The small portion needed for a readability level classification also implies that the proposed model can adequately be used on short texts." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-195", "text": "Finally, the DRL model is capable of assessing the readability of different languages using the same architecture and by only being trained on a target language GloVe representation and N-gram frequency language model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-196", "text": "On the other hand, the proposed model has some drawbacks regarding its architecture and input interaction." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-197", "text": "First of all, the use of deep architecture in the proposed model has increased the required memory and computational resources." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-198", "text": "To be able to train the DRL model, resources such as Graphical Processing Units (GPU) are required, while the previous models are both light-weighted and swiftly trainable on average CPUs." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-199", "text": "However, the prior models demanded a timely feature extraction phase, which is hugely removed in the new model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-200", "text": "Additionally, the ability of the model to freely move through the text causes a particular problem." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-201", "text": "In these cases, the DRL model would not choose a readability class for the given text in a predefined maximum number of moves through the text due to the lack of confidence in choosing a readability level (moves at the end of a text will not change the position of the window)." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-202", "text": "A solution to this problem is to increase the negative reward of moving through the text, which encourages the model to judge the readability level quicker." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-203", "text": "This solution reduced this problem to around 1 percent of the texts." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-204", "text": "These cases are included in the reported test results as a false-negative prediction with maximum error in RMSE calculations." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-205", "text": "By decreasing the negative reward of moving, the number of wrong classifications decreases due to the higher number of moves through the text." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-206", "text": "Yet the number of undecided texts increases, and the overall accuracy of the model drops." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-207", "text": "----------------------------------" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-208", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-209", "text": "The use of a deep reinforcement learning model as a classifier introduced a new class of text readability assessment model, which achieves multi-linguality and efficient use of text, in addition to high classification accuracy." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-210", "text": "The proposed model also simplifies the process of feature extraction by representing text using the GloVe and frequency language model." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-211", "text": "The interaction between the deep reinforcement learning model and the textual environment makes it possible for the model to assess the readability of a text using only a minimal portion of it." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-212", "text": "The newly proposed approach to text readability assessment imposes some problems, such as its requirement of powerful computational resources in training phase, which is mostly due to its use of convolutional feature extraction layers." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-213", "text": "The use of more efficient convolutional architectures, such as grouped convolution [64] , deep-wise convolution [65] , or channel shuffle [66] , might decrease the complexity of the model and therefore reduce its computational weight." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-214", "text": "Better reward shaping or text representations can help to eliminate the undecided texts problem, which consequently improves the model accuracy." }, { "sent_id": "929020618e8e1daa6a769f552a4655-C001-215", "text": "Experimenting the proposed model on different languages and even other classification tasks are also exciting next steps of this research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "929020618e8e1daa6a769f552a4655-C001-22" ], [ "929020618e8e1daa6a769f552a4655-C001-78" ] ], "cite_sentences": [ "929020618e8e1daa6a769f552a4655-C001-22", "929020618e8e1daa6a769f552a4655-C001-78" ] }, "@USE@": { "gold_contexts": [ [ "929020618e8e1daa6a769f552a4655-C001-170" ] ], "cite_sentences": [ "929020618e8e1daa6a769f552a4655-C001-170" ] } } }, "ABC_8b223f35a4685d6627d29c907e4742_45": { "x": [ { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-2", "text": "The central idea of this paper is to gain a deeper understanding of song lyrics computationally." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-3", "text": "We focus on two aspects: style and biases of song lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-4", "text": "All prior works to understand these two aspects are limited to manual analysis of a small corpus of song lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-5", "text": "In contrast, we analyzed more than half a million songs spread over five decades." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-6", "text": "We characterize the lyrics style in terms of vocabulary, length, repetitiveness, speed, and readability." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-7", "text": "We have observed that the style of popular songs significantly differs from other songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-8", "text": "We have used distributed representation methods and WEAT test to measure various gender and racial biases in the song lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-9", "text": "We have observed that biases in song lyrics correlate with prior results on human subjects." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-10", "text": "This correlation indicates that song lyrics reflect the biases that exist in society." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-11", "text": "Increasing consumption of music and the effect of lyrics on human emotions makes this analysis important." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-12", "text": "----------------------------------" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-14", "text": "Music is an integral part of our culture." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-15", "text": "Smartphones and near ubiquitous availability of the internet have resulted in dramatic growth of online music consumption [17] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-16", "text": "More than 85% of online music subscribers search for song lyrics, indicating a keen interest of people in song lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-17", "text": "Song lyrics have a significant impact on human emotions and behavior." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-18", "text": "While listening to songs with violent lyrics can increase aggressive thoughts and hostile feelings [1] , listening to songs with pro-social lyrics increased the accessibility of pro-social thoughts, led to more interpersonal empathy, and fostered helping behavior [10] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-19", "text": "This paper is motivated by the observation that song lyrics have not received enough attention from the research community to understand them computationally." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-20", "text": "Several works focus on related Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-21", "text": "Copyrights for components of this work owned by others than ACM must be honored." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-22", "text": "Abstracting with credit is permitted." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-23", "text": "To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-24", "text": "Request permissions from permissions@acm.org." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-25", "text": "SIGIR '19, July 21-25, 2019, Paris, France \u00a9 2019 Association for Computing Machinery." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-26", "text": "ACM ISBN 978-1-4503-6172-9/19/07. . . $15.00 https://doi.org/10.1145/3331184.3331363 problems such as personalized music recommendation [13] , song popularity prediction, and genre identification [8] .However, prior works on style analysis and bias measurement are limited only to the manual analysis of a few hundred songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-27", "text": "Such an approach cannot scale to the analysis of millions of songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-28", "text": "These works also fail to capitalize on recent advances in computational methods that generate a semantic representation for natural language text." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-29", "text": "Our work aims to fill this gap by applying these methods on a large corpus of song lyrics scraped from online user-generated content." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-30", "text": "There are two main takeaway results from our work." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-31", "text": "First, popular songs significantly differ from other songs when it comes to the style of lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-32", "text": "This difference indicates that lyrics play a major role in deciding the popularity of a song." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-33", "text": "Second, biases in song lyrics correlate with biases measured in humans." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-34", "text": "This correlation indicates that song lyrics reflect the existing biases in society." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-35", "text": "To the best of our knowledge, this is the first work that analyzes song lyrics at the scale of half million lyrics to understand style and bias." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-36", "text": "Our results are reproducible as all our code and datasets are available publicly on the Web 1 ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-37", "text": "We briefly review the related work in Section 2." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-38", "text": "Our results are presented in Sections 3 and 4." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-39", "text": "Our conclusion and future work are highlighted in Section 5." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-40", "text": "----------------------------------" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-41", "text": "**RELATED WORK**" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-42", "text": "Song lyrics have been used for many tasks related to music mining such as genre identification and popularity prediction." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-43", "text": "Earlier works considered lyrics as a weak source of song characteristics as compared to auditory or social features." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-44", "text": "However, recent works have shown the strength of lyrics for music mining." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-45", "text": "Barman et al. have shown that knowledge encoded in lyrics can be utilized to improve the distributed representation of words [2] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-46", "text": "Mayer et al. have introduced various features for lyrics processing [15] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-47", "text": "Fell and Sporleder presented a lyrics-based analysis of songs based on vocabulary and song structure [8] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-48", "text": "Our work complements these works by characterizing lyrics style using multiple attributes extracted from lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-49", "text": "Many studies have analyzed gender and racial biases in song lyrics [14, 18] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-50", "text": "However, such an approach of manual analysis cannot scale to millions of songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-51", "text": "Caliskan et al. proposed Word Embedding Association Test (WEAT) to computationally measure biases in any text repository [5] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-52", "text": "Their test quantifies biases by computing similarity scores between various sets of words." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-53", "text": "To compute similarity, the WEAT test represents words using a distributed word representation method such as fastText or word2vec [4, 16] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-54", "text": "We apply the WEAT test on song lyrics and discuss its implications." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-55", "text": "----------------------------------" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-56", "text": "**STYLE ANALYSIS**" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-57", "text": "For style analysis, we created two datasets: Billboard (BB) and Million Song Dataset (MSD)." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-58", "text": "For both the datasets, we obtained the song lyrics by scraping through the user-generated content on multiple websites such as MetroLyrics (www.metrolyrics.com)and LyricsMode (www.lyricsmode.com)." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-59", "text": "The BB dataset contains top 100 songs for each year (1965 to 2015) from the Billboard Hot 100 list (www.billboard.com/charts/hot-100)." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-60", "text": "We consider these 5100 songs as popular songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-61", "text": "The original million song dataset only provides audio features and song metadata [3] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-62", "text": "It does not provide song lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-63", "text": "Out of all songs listed in the original million song dataset, we could obtain lyrics only for 451,045 songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-64", "text": "We ensured that our BB and MSD datasets had no songs in common." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-65", "text": "Thus total songs in our analysis are around half a million." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-66", "text": "We had to do extensive data cleaning and preprocessing to use the scraped lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-67", "text": "We started with an analysis of the vocabulary of song lyrics. Please refer to Figure 1 ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-68", "text": "This figure shows word clouds of the top 100 words used in popular song lyrics for years 1965 and 2015." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-69", "text": "We can observe that there is a major shift in vocabulary over time." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-70", "text": "For example, in 1965 the most popular word was \"love\"." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-71", "text": "However, it was no more the case in 2015." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-72", "text": "To better visualize such trends, we have developed a word rank comparison tool 2 ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-73", "text": "Given any set of words, this tool plots the relative popularity of those words in popular song lyrics through various years. Please refer to Figure 2 ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-74", "text": "This figure compares popularity of words \"rock\" and \"blues\" over the period from 1965 to 2015." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-75", "text": "For a given year Y , lower value of rank for a word W indicates more frequent use of that word W in popular song lyrics of year Y ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-76", "text": "We can observe that word \"rock\" has maintained its popularity as compared to word \"blues\"." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-77", "text": "We also looked at the usage of swear words." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-78", "text": "To compile the list of swear words, we used various online resources." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-79", "text": "This list is available along with our code. Please refer to Figure 3 ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-80", "text": "This graph shows a comparison between popular (BB) and other (MSD) song lyrics based on swear word usage over the period from 1965 to 2015." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-81", "text": "We can observe that other songs have steadily gained the swear word usage over this period." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-82", "text": "From 1965 to 1995, popular songs used fewer swear words as compared to the other songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-83", "text": "However, from the 1990s there is a persistent trend of increasing swear word usage in popular songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-84", "text": "As compared to 1980s, swear words are used almost ten times more frequently now in popular song lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-85", "text": "Multiple 2 https://tiny.cc/songlyrics studies have reported adverse effects of inappropriate content in music on the listeners [11] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-86", "text": "We also measured the length of song lyrics as the number of words in the song. Please refer to Figure 4a ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-87", "text": "Other songs have shown a steady increase in length from 1965 to 2015." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-88", "text": "Popular songs also showed similar trend till 1980." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-89", "text": "However, since then popular songs are significantly more lengthy than other songs. Please refer to Figure 4b ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-90", "text": "This figure depicts the average duration of songs per year measured in seconds." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-91", "text": "Both poplar and other songs rose in duration till 1980." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-92", "text": "Since then, other songs have maintained the average duration of about 240 seconds." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-93", "text": "In contrast, popular songs were of longer duration from 1980 to 2010." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-94", "text": "However, the current trend shows far reduced duration for popular songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-95", "text": "Please refer to Figure 4c ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-96", "text": "This graph compares popular and other songs based on speed." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-97", "text": "We measure the speed of a song as (length in words/duration in seconds)." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-98", "text": "We can observe that other songs have maintained an average speed of around 0.6 words per second." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-99", "text": "However, popular songs were comparatively slower till 1990 and since have become significantly faster than other songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-100", "text": "Some studies have reported that repetitive songs are lyrically processed more fluently and listeners prefer such songs [6] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-101", "text": "We computed repetitiveness of a song lyric as ((1-(number of unique lines/total number of lines))*100). Please refer to Figure 5 ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-102", "text": "Except for the period from 1990 to 2000, popular songs are more repetitive than other songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-103", "text": "Readability tests are standardized tests designed to provide a measure indicating how difficult it is to read or understand a given English text." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-104", "text": "We applied the well known Flesch Kincaid Readability Test (FK) on the song lyrics [12] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-105", "text": "The FK test returns a number that directly relates to the US Education grade system while also indicating the number of years of education required to understand the given English text." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-106", "text": "For example, an FK score of 5 indicates that anybody educated up to 5th grade can read and understand the given English text. Please refer to Figure 6 ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-107", "text": "It can be seen that the FK scores of popular songs have always been less than 2." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-108", "text": "Also, the FK scores of other songs have always been quite higher than the popular songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-109", "text": "This difference indicates that popular songs have always been easier to understand as compared to other songs." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-110", "text": "----------------------------------" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-111", "text": "**BIAS MEASUREMENT**" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-112", "text": "We humans have certain biases in our thinking." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-113", "text": "For example, some people can find flower names more pleasant and insect names more unpleasant." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-114", "text": "These biases reflect in our various activities such as politics, movies, and song lyrics as well." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-115", "text": "Implicit Association Test (IAT) is a well-known test designed to measure such biases in human beings [9] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-116", "text": "This test involves two sets of attribute words and two sets of target words." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-117", "text": "For example, consider two attribute word sets as pleasant words (nice, beautiful) and unpleasant words (dirty, awful)." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-118", "text": "Also, consider two target word sets as flower names (rose, daffodil) and insect names (gnat, cockroach)." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-119", "text": "The null hypothesis is that there should be little to no difference between the two sets of target words when we measure their similarity with the attribute word sets." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-120", "text": "The IAT test measures the unlikelihood of the null hypothesis by evaluating the effect size." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-121", "text": "A positive value of the IAT test indicates that people are biased to associate first attribute word set to the first target word set (Bias: flowers are pleasant) and second attribute word set with second target word set (Bias: insects are unpleasant)." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-122", "text": "A negative value of the effect size indicates the bias in the other direction, that is flowers are unpleasant, and insects are pleasant." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-123", "text": "Larger magnitude of effect size indicates a stronger bias." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-124", "text": "If the value of effect size is closer to zero, then it indicates slight or no bias." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-125", "text": "Caliskan et al. designed the Word Embedding Association Test (WEAT) by tweaking the IAT test [5] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-126", "text": "Similar to the IAT test, this test can measure bias given the sets of attribute and target words." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-127", "text": "However, the IAT test requires human subjects to compute the bias value." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-128", "text": "On the other hand, the WEAT test can compute the bias value using a large text repository, and it does not require human subjects." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-129", "text": "The WEAT test represents attribute and target words as vectors using distributed representation methods such as word2vec and fastText [4, 16] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-130", "text": "The WEAT test computes the similarity between words using the cosine similarity." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-131", "text": "Caliskan et al. have performed the bias measurement on a large internet crawl text corpus using the WEAT test." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-132", "text": "They have shown that their results correlate with the IAT tests conducted with human subjects." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-133", "text": "We applied the WEAT test on our song lyrics dataset." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-134", "text": "Due to the small size of popular songs dataset, we cannot apply the WEAT test separately on popular songs lyrics. Please refer to Table 1 . Corresponding to eight rows of the table, we have measured eight biases." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-135", "text": "We borrowed these attribute and target word sets from Caliskan et al. [5] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-136", "text": "First two columns (w2v and FT) correspond to measurements on the song lyrics dataset with word2vec and Out of all tests, we can see that the effect size of both FT and CA column is highest for test 4." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-137", "text": "This bias is about gender stereotypes and career paths." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-138", "text": "This bias can be expressed as males are more associated with career-oriented roles and females are more associated with family-oriented roles." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-139", "text": "This high effect value means that there is greater bias in these target groups." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-140", "text": "Doering and Th\u00e9baud have shown that gender bias does not only disadvantage women, it could also disadvantage men [7] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-141", "text": "Their findings showed that when gender stereotypes get attached to a role, it biases the authority that people attribute to the person who happens to work in that position." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-142", "text": "In a way, these stereotypes harm us all." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-143", "text": "With the dramatic growth in consumption of music [17] , such biases can reinforce the psychological status of these target groups [19] ." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-144", "text": "Hence, it is crucial to address these prevalent biases in songs lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-145", "text": "----------------------------------" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-146", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-147", "text": "We have analyzed over half a million lyrics to understand the style and prevalent biases." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-148", "text": "As compared to other songs, we have observed that popular songs have several distinguishing characteristics that can be expressed in terms of the style of lyrics." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-149", "text": "Lyrics can capture human biases quite accurately." }, { "sent_id": "8b223f35a4685d6627d29c907e4742-C001-150", "text": "This work can be extended further by investigating music genre-specific style and biases." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8b223f35a4685d6627d29c907e4742-C001-51" ], [ "8b223f35a4685d6627d29c907e4742-C001-125" ] ], "cite_sentences": [ "8b223f35a4685d6627d29c907e4742-C001-51", "8b223f35a4685d6627d29c907e4742-C001-125" ] }, "@USE@": { "gold_contexts": [ [ "8b223f35a4685d6627d29c907e4742-C001-51", "8b223f35a4685d6627d29c907e4742-C001-54" ], [ "8b223f35a4685d6627d29c907e4742-C001-135" ] ], "cite_sentences": [ "8b223f35a4685d6627d29c907e4742-C001-51", "8b223f35a4685d6627d29c907e4742-C001-135" ] } } }, "ABC_f36b605a9088532e5f430c86ffb363_45": { "x": [ { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-2", "text": "Both vector space models and graph random walk models can be used to determine similarity between concepts." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-3", "text": "Noting that vectors can be regarded as local views of a graph, we directly compare vector space models and graph random walk models on standard tasks of predicting human similarity ratings, concept categorization, and semantic priming, varying the size of the dataset from which vector space and graph are extracted." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-4", "text": "----------------------------------" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-6", "text": "Vector space models, representing word meanings as points in high-dimensional space, have been used in a variety of semantic relatedness tasks (Sahlgren, 2006; Pad\u00f3 and Lapata, 2007) ." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-7", "text": "Graphs are another way of representing relations between linguistic entities, and they have been used to capture semantic relatedness by using both corpus-based evidence and the graph structure of WordNet and Wikipedia (Pedersen et al., 2004; Widdows and Dorow, 2002; Minkov and Cohen, 2008) ." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-8", "text": "We study the relationship between vector space models and graph random walk models by embedding vector space models in graphs." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-9", "text": "The flexibility offered by graph random walk models allows us to compare the vector space-based similarity measures to extended notions of relatedness and similarity." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-10", "text": "In particular, a random walk model can be viewed as smoothing direct similarity between two vectors using second-order and even higher-order vectors." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-11", "text": "This view leads to the second focal point of this paper: We investigate whether random walk models can simulate the smoothing effects obtained by methods like Singular Value Decomposition (SVD)." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-12", "text": "To answer this question, we compute models on reduced (downsampled) versions of our dataset and evaluate the robustness of random walk models, a classic vector-based model, and SVD-based models against data sparseness." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-13", "text": "----------------------------------" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-14", "text": "**MODEL DEFINITION AND IMPLEMENTATION**" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-15", "text": "We use directed graphs with weighted edges, G = (V, E, w) where V is a set of nodes, E = V \u00d7 V is a set of edges and w : E \u2192 R is the weighting function on edges." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-16", "text": "For simplicity, we assume that G is fully connected, edges with zero weights can be considered as non-existing in the graph." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-17", "text": "On these graphs, we perform random walks with an initial probability distribution q over the nodes (a 1 \u00d7 |V | vector)." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-18", "text": "We then follow edges with probability proportional to their weights, so that the probability of walking from node v 1 to node v 2 is w(v 1 , v 2 )/ v w(v 1 , v)." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-19", "text": "A fixed length random walk ends after a predetermined number of steps." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-20", "text": "In flexible walks, there is a constant probability \u03b3 of stopping at each step." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-21", "text": "Thus, walk length follows a geometric distribution with parameter \u03b3, the probability of a walk of length k is \u03b3(1\u2212\u03b3) k\u22121 and the expected walk length is 1/\u03b3." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-22", "text": "For example, a flexible walk with \u03b3 = 1/2 will produce 1-step, 2-step, and higher-step walks while the expected average length is 2." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-23", "text": "Relating vectors and graphs." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-24", "text": "Corpus cooccurrence (e 1 , e 2 , a 12 ) of two entities e 1 and e 2 that co-occur with (potentially transformed) count a 12 can be represented in either a vector or a graph." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-25", "text": "In a vector, it corresponds to a dimension value of a 12 for the dimension e 2 of entity e 1 ." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-26", "text": "In a graph, it corresponds to two nodes labeled e 1 and e 2 connected by an edge with weight a 12 ." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-27", "text": "Similarity measures." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-28", "text": "Let R(q) = p denote a specific random walk process which transforms an initial probability distribution q to a final probability distribution p over the nodes." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-29", "text": "We write q(m) for the probability assigned to the node m under q. If the initial distribution q concentrates all probability on a single node n, i.e., q(n) = 1 and q(x) = 0 for all nodes x = n, we write pr(n \u2192 m) for the probability p(m) of ending up at node m." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-30", "text": "The simplest way of measuring relatedness through random walks is to consider the probability p(m) of a single node m as an endpoint for a walk starting with start probability distribution q, that is, p = R(q)." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-31", "text": "We call this a direct, onedirection measure of relatedness between q and m. Direct, one-direction measures are typically asymmetric." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-32", "text": "In case all start probability is concentrated on a single node n, we can also consider direct, two-direction measures, which will be a combination of pr(m \u2192 n) and pr(n \u2192 m)." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-33", "text": "The point of using two-direction measures is that these can be made symmetric, which is an advantage when we are modeling undirected semantic similarity." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-34", "text": "In the experiments below we focus on the average of the two probabilities." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-35", "text": "In addition to direct measures, we will use indirect measures, in which we compute the relatedness of endpoint probability distributions p 1 = R(q 1 ) and p 2 = R(q 2 )." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-36", "text": "As endpoint distributions can be viewed both as probability distributions and as vectors, we used three indirect measures: 1) Jensen/Shannon divergence, a symmetric variant of the Kullback/Leibler divergence between probability distributions." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-37", "text": "2) cosine similarity, and 3) dot product." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-38", "text": "Dot product is a natural choice in a graph setting because we can view it as the probability of a pair of walks, one starting at a node determined by q 1 and the other starting at a node governed by q 2 , ending at the same node." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-39", "text": "Discussion. Direct and indirect relatedness measures together with variation in walk length give us a simple, powerful and flexible way to capture different kinds of similarity (with traditional vectorbased approach as a special case)." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-40", "text": "Longer walks or flexible walks will capture higher order effects that may help coping with data sparseness, similar to the use of second-order vectors." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-41", "text": "Dimensionality reduction techniques like Singular Value Decomposition (SVD) also capture these higher-order effects, and it has been argued that that makes them more resistant against sparseness (Sch\u00fctze, 1997) ." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-42", "text": "To our knowledge, no systematic comparison of SVD and classical vector-based methods has been done on different corpus sizes." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-43", "text": "In our experiments, we will compare the performance of SVD and flexible-walk smoothing at different corpus sizes and for a variety of tasks." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-44", "text": "Implementation: We extract tuples from the 2-billion word ukWaC corpus, 1 dependency-parsed with MINIPAR." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-66", "text": "Both models' performances are comparable with the previously reported studies, and above that of random walks." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-45", "text": "2 Following Pad\u00f3 and Lapata (2007) , we only consider co-occurrences where two target words are connected by certain dependency paths, namely: the top 30 most frequent preposition-mediated noun-to-noun paths (soldier+with+gun), the top 50 transitive-verbmediated noun-to-noun paths (soldier+use+gun), the top 30 direct or preposition-mediated verbnoun paths (kill+obj+victim, kill+in+school), and the modifying and predicative adjective-to-noun paths." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-46", "text": "Pairs (w 1 , w 2 ) that account for 0.01% or less of the marginal frequency of w 1 were trimmed." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-47", "text": "The resulting tuple list, with raw counts converted to mutual information scores, contains about 25 million tuples." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-48", "text": "To test how well graph-based and alternative methods \"scale down\" to smaller corpora, we sampled random subsets of tuples corresponding to 0.1%, 1%, 10%, and 100% of the full list." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-49", "text": "To put things into perspective, the full list was extracted from a corpus of about 2 billion words; so, the 10% list is on the order of magnitude of the BNC, and the 0.1% list is on the order of magnitude of the Brown corpus." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-50", "text": "From each of the 4 resulting datasets, we built one graph and two vector space models: one space with full dimensionality, and one space reduced to 300 dimensions using singular value decomposition." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-51", "text": "----------------------------------" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-52", "text": "**EXPERIMENTS**" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-53", "text": "First, we report the results for all tasks obtained on the full data-set and then proceed with the comparison of different models on differing graph sizes to see the robustness of the models against data sparseness." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-54", "text": "Human similarity ratings: We use the dataset of Rubenstein and Goodenough (1965) , consisting of averages of subject similarity ratings for 65 noun pairs." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-55", "text": "We use the Pearson's coefficient between estimates and human judgments as our performance measure." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-56", "text": "The results obtained for Table 2 : Each cell contains the ratio of the performance of the corresponding model for the corresponding downsampling ratio to the performance of the same model on the full graph." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-57", "text": "The higher ratio means the less deterioration due to data sparseness." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-58", "text": "the full graph are in Table 1 , line 1." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-59", "text": "The SVD model clearly outperforms the pure-vector based approach and the graph-based approaches." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-60", "text": "Its performance is above that of previous models trained on the same corpus (Baroni and Lenci, 2009 )." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-61", "text": "The best model that we report is based on web search engine results (Chen et al., 2006) ." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-62", "text": "Among the graph-based random walk models, flexible walk with parameter 0.5 and fixed 1-step walk with indirect relatedness measures using dot product similarity achieve the highest performance." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-63", "text": "Concept categorization: Almuhareb (2006) proposed a set of 402 nouns to be categorized into 21 classes of both concrete (animals, fruit. . . ) and abstract (feelings, times. . . ) concepts." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-64", "text": "Our results on this clustering task are given in Table 1 (line 2) ." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-65", "text": "The difference between SVD and pure-vector models is negligible and they both obtain the best performance in terms of both cluster entropy (not shown in the table) and purity." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-67", "text": "Semantic priming: The next dataset comes from Hodgson (1991) and it is of interest since it requires capturing different forms of semantic relatedness between prime-target pairs: synonyms (synonym), coordinates (coord), antonyms (antonym), free association pairs (conass), superand subordinate pairs (supersub) and phrasal associates (phrasacc)." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-68", "text": "Following previous simulations of this data-set (Pad\u00f3 and Lapata, 2007) , we measure the similarity of each related target-prime pair, and we compare it to the average similarity of the target to all the other primes instantiating the same relation, treating the latter quantity as our surrogate of an unrelated target-prime pair." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-69", "text": "We report results in terms of differences between unrelated and related pairs, normalized to t-scores, marking significance according to twotailed paired t-tests for the relevant degrees of freedom." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-70", "text": "Even though the SVD-based and pure-vector models are among the top achievers in general, we see that in different tasks different random walk models achieve comparable or even better performances." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-71", "text": "In particular, for phrasal associates and conceptual associates, the best results are obtained by random walks based on direct measures." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-72", "text": "----------------------------------" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-73", "text": "**ROBUSTNESS AGAINST DATA SPARSENESS**" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-74", "text": "So far, we reported only the results obtained on the full graph." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-75", "text": "However, in order to see the response of the models to using smaller corpora we ran another set of experiments on artificially down-sampled graphs as explained above." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-76", "text": "In this case, we are not interested in the absolute performance of the models per se but the relative performance." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-77", "text": "Thus, for ease of comparison we fixed each model's performance on the full graph to 1 for each task and linearly scaled its performance on smaller graphs." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-78", "text": "For example saying that the SVD-based model achieves a score of 0.911 on 10% graph for the Rubenstein and Goodenough dataset means that the ratio of the performance of SVD-based model on 10% graph to the performance of the same model on the full graph is 0.911." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-79", "text": "The results are given in Table 2 , where the only random walk model we report is dot 2, i.e., a 2-step random walk coupled with the dot productbased indirect measure." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-80", "text": "This is by far the random walk model most robust to downsampling." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-81", "text": "In the 10% graph, we see that on all tasks but one, dot 2 is the model least affected by the data reduction." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-82", "text": "On the contrary, down-sampling has a positive effect on this model because on 6 tasks, it actually performs better than it does on the full graph!" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-83", "text": "The same behavior is also observed on the 1% graph -as an example, for phrasal associates relations, dot 2 performance increases by a factor of around 1.2 when we use one hundredth of the graph instead of the full one." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-84", "text": "For the smallest graph we used, 0.1%, still dot 2 provides the highest relative performance in 5 out of the 8 tasks." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-85", "text": "----------------------------------" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-86", "text": "**CONCLUSION**" }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-87", "text": "We compared graph-based random walk models and vector models." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-88", "text": "For this purpose, we showed how corpus co-occurrences could be represented both as a graph and a vector, and we identified two different ways to calculate relatedness based on the outcomes of random walks, by direct and indirect measures." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-89", "text": "The experiments carried out on 8 different tasks by using the full graph revealed that SVD-based model performs very well across all types of semantic relatedness." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-90", "text": "However, there is also evidence that -depending on the particular relation-some random walk models can achieve results as good as or even better than those of SVD-based models." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-91", "text": "Our second question was whether the random walk models would be able to simulate the smoothing effects obtained by SVD." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-92", "text": "While answering this question, we also carried out a systematic comparison of plain and SVD-based models on different tasks with different sizes of data." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-93", "text": "One interesting result is that an SVD-based model is not necessarily more robust to data sparseness than the plain vector model." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-94", "text": "The more interesting result is that a 2-step random walk model, based on indirect measures with dot product, consistently outperforms both SVDbased and plain vector models in terms of relative performance, thus it is able to achieve comparable results on very small datasets." }, { "sent_id": "f36b605a9088532e5f430c86ffb363-C001-95", "text": "Actually, the improvement on absolute performance measures of this random walk by making the dataset even smaller calls for future research." } ], "y": { "@BACK@": { "gold_contexts": [ [ "f36b605a9088532e5f430c86ffb363-C001-6" ] ], "cite_sentences": [ "f36b605a9088532e5f430c86ffb363-C001-6" ] }, "@USE@": { "gold_contexts": [ [ "f36b605a9088532e5f430c86ffb363-C001-6", "f36b605a9088532e5f430c86ffb363-C001-8" ], [ "f36b605a9088532e5f430c86ffb363-C001-45" ], [ "f36b605a9088532e5f430c86ffb363-C001-68" ] ], "cite_sentences": [ "f36b605a9088532e5f430c86ffb363-C001-6", "f36b605a9088532e5f430c86ffb363-C001-45", "f36b605a9088532e5f430c86ffb363-C001-68" ] } } }, "ABC_cbfa4d71f40d8008ebd90026dc1bcd_45": { "x": [ { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-2", "text": "The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-3", "text": "Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-4", "text": "We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-5", "text": "To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-6", "text": "The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-7", "text": "The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-8", "text": "----------------------------------" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-9", "text": "**THE SUMMA PROJECT OVERVIEW**" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-10", "text": "Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of storylines (Risen et al., 2013) ." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-11", "text": "The massive growth in the number of broadcast and Internet media channels requires innovative ways to cope with this increasing amount of data." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-12", "text": "It is the aim of SUMMA 1 project to significantly improve media monitoring by creating a platform to automate the analysis of media streams across many languages." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-13", "text": "Within SUMMA project three European news broadcasters BBC, Deutche Welle, and Latvian news agency LETA are joining the forces with the University of Edinburgh, University College London, Swiss IDIAP Research Institute, Qatar Computing Research Institute, and Priberam Labs from Portugal to adapt the emerging big data neural deep learning NLP techniques to the needs of the international news monitoring industry." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-14", "text": "BBC Monitoring undertakes one of the most advanced, comprehensive, and large scale media monitoring operations world-wide, providing news and information from media sources around the world." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-15", "text": "BBC monitoring journalists and analysts translate from over 30 languages into English, and follow approximately 13,500 sources, of which 1,500 are television broadcasters, 1,300 are radio, 3,700 are key news portals world-wide, 20 are commercial news feeds, and the rest are RSS feeds and selected Social Media sources." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-16", "text": "Monitoring journalists follow important stories and flag breaking news events as part of the routine monitoring." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-17", "text": "1 SUMMA (Scalable Understanding of Multilingual MediA) is project 688139 funded by the European Union H2020- The central idea behind SUMMA is to develop a scalable multilingual media monitoring platform ( Fig.1 ) that combines the real-time media stream processing (speech recognition, machine translation, story clustering) with indepth batch-oriented construction of a rich knowledge base of reported events and entities mentioned, enabling extractive summarization of the storylines in the news." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-18", "text": "In this paper we focus only on the streaming shallow processing part of the SUMMA project (the dark block in Fig.1 ), where the recently developed neural machine translation techniques (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014) enable radically new end-to-end approach to machine translation and clustering of the incoming news stories." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-19", "text": "The approach is informed by our previous work on machine learning (Barzdins, Paikens, Gosko, 2013) , media monitoring (Barzdins et al.,2014) , and character-level neural translation (Barzdins & Gosko, 2016) ." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-20", "text": "The key difference of the SUMMA project is that it has been incepted after the recent paradigm-shift (Manning, 2015) in the NLP community towards neural network inspired deep learning techniques such as end-to-end automatic speech recognition (Graves & Jaitly, 2014; Hannun et al., 2014; Amodei, 2015) , end-to-end machinetranslation (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014; Luong et al., 2015) , efficient distributed vectorspace word embeddings (Mikolov et al., 2013) , image and video captioning Venugopalan et al., 2015) , unsupervised learning of document representations by autoencoders (Li, Luong & Jurafsky, 2015) ." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-21", "text": "These recent deep learning breakthroughs along with massively parallel GPU computing allow addressing the media monitoring tasks in the completely new end-toend manner rather than relying on the legacy NLP pipelines." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-22", "text": "The novelty of the SUMMA project approach is that all languages covered by the project (Table 1) can be embedded in the same vectorspace by means of joint multitask learning (Collobert et al., 2011; Dong et al., 2015; Pham, Luong & Manning, 2015) of eight LSTM-RNN translational autoencoders with hidden layer parameters shared as illustrated in Fig.2 ." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-23", "text": "Sharing the same vectorspace for sentences in all project languages enables accurate multilingual news story clustering without resorting to the clustering of the less accurate target (English) language machine translations." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-24", "text": "This shared vectorspace approach extends also to the unsupervised multi-task learning of language models from the large monolingual corpora (Fig. 3) , which is crucial for low-resourced languages: having a generic language model learned in parallel from the monolingual corpora reduces (Dai & Le, 2015) the need for large supervised parallel corpora to achieve the same translational accuracy for the Fig. 2 setup." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-25", "text": "The joint training of seventeen translational and samelanguage autoencoders with shared parameters ( Fig. 2 and Fig. 3 together) to our knowledge has not been attempted so far." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-26", "text": "Even training of a single state-of-the-art sentencelevel translational autoencoder requires days of GPU computing (Barzdins & Gosko, 2016) ) in TensorFlow (Abadi et al., 2015) seq2seq model (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014) ." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-27", "text": "To avoid complexities of asynchronous parallel training with shared parameter server (Dean et al., 2012) , the architecture in Fig.2 and Fig. 3 instead can be trained using the alternating training approach proposed in (Luong et al., 2016) , where each task is optimized for a fixed number of parameter updates (or mini-batches) before switching to the next task (which is a different language pair)." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-28", "text": "Although such alternating approach prolongs the training process, it is preferred for simplicity and robustness reasons." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-29", "text": "Once produced within SUMMA project, these translational autoencoders with shared vectorspace will be a unique language resource of likely interest also to the wider NLP community for multilingual applications outside the media monitoring domain." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-30", "text": "----------------------------------" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-31", "text": "**MULTILINGUAL NEURAL TRANSLATION**" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-32", "text": "----------------------------------" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-33", "text": "**CHARACTER-LEVEL NEURAL TRANSLATION FOR STREAMS**" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-34", "text": "Neural translation attention mechanism (Bahdanau, Cho & Bengio, 2014) has been shown to be highly beneficial for bi-lingual neural translation of long sentences, but it is not compatible with the multi-task multilingual translation models (Dong et al., 2015; Luong et al, 2016) described in the previous Section and character-level translation models (Barzdins & Gosko, 2016) described in this Section." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-35", "text": "For these reasons we replace the neural translation attention mechanism with much simpler sliding-window translation (Barzdins & Gosko, 2016; Karpathy, 2015; Jozefowicz, 2016 )." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-36", "text": "Moving from wordlevel to character-level neural translation makes it even harder to cope with long sentences presenting additional reason to employ the sliding-window translation approach." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-37", "text": "Table 2 illustrates the character-level neural translation from English to Latvian using modified 2 TensorFlow (Abadi et al., 2015) seq2seq (Sutskerev, Vinyals & Le, 2014) neural translation model." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-38", "text": "The character-level neural translation is enabled by forcing tokenizer to treat each input symbol as a separate \"word\" leading to the small and fixed \"vocabulary\" containing only 90 most frequently encountered characters." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-39", "text": "Another necessary change to the TensorFlow default seq2seq settings is disabling the attention (Bahdanau, Cho & Bengio, 2014) mechanism which is known to interfere with character-level translation (Barzdins & Gosko, 2016) because there are no mappings between the characters of the translated words." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-40", "text": "The small vocabulary of 90 words automatically disables also the sampled softmax functionality of seq2seq improving the overall performance." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-41", "text": "Finally, we configure single bucket of size 100 characters, which will be the max translation window size." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-42", "text": "Other hyperparameters used are: 1 LSTM 2 https://github.com/didzis/tensorflowAMR layer of size 400, batch size 16." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-43", "text": "Training is performed on Europarl v7 EN-LV corpus 3 for 24h on TitanX GPU." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-44", "text": "The sliding-window mechanism is used only during decoding (translation), mapping a fragment of 6 English words into 5 Latvian words (Latvian translations typically contain less words than English source -rich morphology substitutes for most prepositions and articles)." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-45", "text": "The multiple sliding-window translations produced are later merged into the final translation consisting only of words appearing at least twice in the neighboring sliding window columns (word suffixes are ignored if the initial 6 characters of the words match -this reduces word drop due to inflection errors)." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-46", "text": "The final translation in Table 2 (bottom row) is close to the manual verbatim translation (top row) and conveys the topic of the original sentence." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-47", "text": "Moreover, the slidingwindow translations are surprisingly fluent Latvian phrases with correct word forms and mostly correct coordination." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-48", "text": "The only non-Latvian \"words\" fabricated by the characterlevel translation in the Table 2 are \"transpastiv\u0113\u0161ana\", \"transpitts\", \"transpirma\" and apparently are triggered by the English verb \"transits\", because in Latvian \"tranz\u012bts\" is used only as a noun without a close substitute verb." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-49", "text": "Sliding-window translation method, obviously, cannot handle long-range dependencies well and occasionally drops or inserts words in the translation -therefore SUMMA provides also state-of-the-art translation service in parallel to the one described here." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-50", "text": "Meanwhile the sliding-window character-based translation method has unique advantages relevant to the scope of SUMMA project, discussed in the next Section." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-51", "text": "----------------------------------" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-52", "text": "**POTENTIAL APPLICATIONS OF THE MULTILINGUAL CHARACTER-LEVEL STREAM TRANSLATION**" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-53", "text": "Having a shared vectorspace multilingual translation system ( Fig.2 and Fig.3 ) able to operate on unsegmented streams of text have a number of novel applications." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-54", "text": "The most straightforward novel application is the possibility to embed the documents of all project languages into the same shared semantics vectorspace and compute document semantic similarity (Hill, Cho & Korhonen, 2016) irrespective of the document language." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-55", "text": "The sliding-window translation approach allows to view the document as a sequence (trace) of vectors corresponding to every slidingwindow step while translating the document." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-56", "text": "These vectors are similar to word embedding vectors, but are likely to be semantically richer, as they would mostly distinguish word-senses in the context of the window." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-57", "text": "Such vectortraces corresponding to the documents can be compared in the bag-of-words fashion by measuring cosine-distance between the sums of document trace-vectors (as part of kmeans clustering or nearest-neighbor search)." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-58", "text": "This can be used as a building block for multilingual semantic clustering of stories into storylines, or for the semantic search of the documents in any language which are similar to the given document." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-59", "text": "Another novel application of the character-level neural translation is stream segmentation into the individual stories -a difficult task for news ingested from audio or video sources and transcribed with ASR and thus lacking any explicit sentence or story segmentation information." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-60", "text": "For stream segmentation into the stories it is possible to utilize the exceptional generalization and memorization capacity of the neural networks, which is already applied in the neural dialogue systems such as Gmail Smart Replies (Corrado, 2015; Vinyals&Le, 2015) ." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-61", "text": "Table 3 illustrates how mere 400 LSTM cells of our single-layer 90-character neural translator have been able to generalize and memorize rather correct translations for the first 100 characters of the entire Europarl v7 EN-LV training corpus containing 600,000 sentence pairs." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-62", "text": "For story segmentation a sliding-window neural translation system can be incrementally trained to monolingually \"translate\" the current 5 words of the stream into the next 5 words of the stream (predicting next 5 words from previous 5 words), based on the actual news streams encountered." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-63", "text": "Such system should be able to predict reasonably well the next 5 words within the news story, but will fail to do so when there is a transition from one story to the next." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-64", "text": "Along with additional auxiliary information such as time-code (when exactly the phrase was spoken and pauses in the speech) and speaker identification for each phrase this should provide a rather reliable segmentation signal." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-65", "text": "Table 3 : Europarl v7 training corpus fragment (only first 100 characters of each sentence were used for training) and the character-level neural translation output illustrating the memorization of the training corpus." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-66", "text": "----------------------------------" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-67", "text": "**CONCLUSIONS**" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-68", "text": "It is still an open issue which vectorspace projections yield the semantically best clusters (Hill, Cho & Korhonen, 2016) and further experiments are needed." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-69", "text": "Particularly for storyline (Rissen et al., 2013) clustering the signals for the stories belonging to the same storyline might be not so much the semantic similarity of the articles (they might report various developments of the storyline from differing viewpoints), but rather the matching time and location as well as same organizations and people being involved -the information typically supplied by Named Entity Linking (NEL) tools." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-70", "text": "The tradeoffs between semantic clustering quality and computational complexity are likely to be crucial." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-71", "text": "Once trained, the run-time use of the multilingual translation modules for translation and news story clustering is around 1 sec on TitanX GPU per average news story." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-72", "text": "This is an order of magnitude slower than regular NEL or IF IDF bagof-words based clustering methods." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-73", "text": "Establishing reliable storyline clustering benchmarking data sets and metrics is one of the goals of the SUMMA project, as good storyline clusters are the prerequisite for downstream storyline summarization, visualization, and predictive anticipation of upcoming developments." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-74", "text": "----------------------------------" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-75", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-76", "text": "This work was supported in part by H2020 SUMMA project under grant agreement 688139/H2020-ICT-2015 and in part by the Latvian National research program SOPHIS under grant agreement Nr.10-4/VPP-4/11." }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-77", "text": "----------------------------------" }, { "sent_id": "cbfa4d71f40d8008ebd90026dc1bcd-C001-78", "text": "**ENGLISH(ORIGINAL LATVIAN(MANUAL(TRANSLATION LATVIAN(CHARACTER3LEVEL(NEURAL(TRANSLATION**" } ], "y": { "@USE@": { "gold_contexts": [ [ "cbfa4d71f40d8008ebd90026dc1bcd-C001-18" ], [ "cbfa4d71f40d8008ebd90026dc1bcd-C001-37" ] ], "cite_sentences": [ "cbfa4d71f40d8008ebd90026dc1bcd-C001-18", "cbfa4d71f40d8008ebd90026dc1bcd-C001-37" ] }, "@BACK@": { "gold_contexts": [ [ "cbfa4d71f40d8008ebd90026dc1bcd-C001-18" ], [ "cbfa4d71f40d8008ebd90026dc1bcd-C001-20" ], [ "cbfa4d71f40d8008ebd90026dc1bcd-C001-26" ] ], "cite_sentences": [ "cbfa4d71f40d8008ebd90026dc1bcd-C001-18", "cbfa4d71f40d8008ebd90026dc1bcd-C001-20", "cbfa4d71f40d8008ebd90026dc1bcd-C001-26" ] } } }, "ABC_c42e9d10ca8876af80eee021c969d7_45": { "x": [ { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-2", "text": "Automatically detecting dialogue structure within corpora of human-human dialogue is the subject of increasing attention." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-3", "text": "In the domain of tutorial dialogue, automatic discovery of dialogue structure is of particular interest because these structures inherently represent tutorial strategies or modes, the study of which is key to the design of intelligent tutoring systems that communicate with learners through natural language." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-4", "text": "We propose a methodology in which a corpus of humanhuman tutorial dialogue is first manually annotated with dialogue acts." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-5", "text": "Dependent adjacency pairs of these acts are then identified through \u03c7 2 analysis, and hidden Markov modeling is applied to the observed sequences to induce a descriptive model of the dialogue structure." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-6", "text": "----------------------------------" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-8", "text": "Automatically learning dialogue structure from corpora is an active area of research driven by a recognition of the value offered by data-driven approaches (e.g., Bangalore et al., 2006) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-9", "text": "Dialogue structure information is of particular importance when the interaction is centered around a learning task, such as in natural language tutoring, because techniques that support empirical identification of dialogue strategies can inform not only the design of intelligent tutoring systems (Forbes-Riley et al., 2007) , but also contribute to our understanding of the cognitive and affective processes involved in learning through tutoring (VanLehn et al., 2007) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-10", "text": "Although traditional top-down approaches (e.g., Cade et al., 2008) and some empirical work on analyzing the structure of tutorial dialogue (Forbes-Riley et al., 2007) have yielded significant results, the field is limited by the lack of an automatic, data-driven approach to identifying dialogue structure." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-11", "text": "An empirical approach to identifying tutorial dialogue strategies, or modes, could address this limitation by providing a mechanism for describing in succinct probabilistic terms the tutorial strategies that actually occur in a corpus." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-12", "text": "Just as early work on dialogue act interpretation utilized hidden Markov models (HMMs) to capture linguistic structure (Stolcke et al., 2000) , we propose a system that uses HMMs to capture the structure of tutorial dialogue implicit within sequences of already-tagged dialogue acts." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-13", "text": "This approach operates on the premise that at any given point in the tutorial dialogue, the collaborative interaction is in a dialogue mode that characterizes the nature of the exchanges between tutor and student." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-14", "text": "In our model, a dialogue mode is defined by a probability distribution over the observed symbols (e.g., dialogue acts and adjacency pairs)." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-15", "text": "Our previous work has noted some limitations of first-order HMMs as applied to sequences of individual dialogue acts (Boyer et al., in press) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-16", "text": "Chief among these is that HMMs allow arbitrarily frequent transitions between hidden states, which does not conform well to human intuition about how tutoring strategies are applied." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-17", "text": "Training an HMM on a sequence of adjacency pairs rather than individual dialogue acts is one way to generate a more descriptive model without increasing model complexity more than is required to accommodate the expanded set of observation symbols." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-18", "text": "To this end, we apply the approach of Midgley et al. (2006) for empirically identifying significant adjacency pairs within dialogue, and proceed by treating adjacency pairs as atomic units for the purposes of training the HMM." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-19", "text": "----------------------------------" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-20", "text": "**CORPUS ANALYSIS**" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-21", "text": "This analysis uses a corpus of human-human tutorial dialogue collected in the domain of introductory computer science." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-22", "text": "Forty-three learners interacted remotely with a tutor through a keyboard-to-keyboard remote learning environment yielding 4,864 dialogue moves." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-23", "text": "The tutoring corpus was manually tagged with dialogue acts designed to capture the salient characteristics of the tutoring process (Table 1) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-24", "text": "The correspondence between utterances and dialogue act tags is one-to-one." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-25", "text": "Compound utterances (i.e., a single utterance comprising more than one dialogue act) were split by the primary annotator prior to the inter-rater reliability study." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-26", "text": "----------------------------------" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-27", "text": "**1**" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-28", "text": "The importance of adjacency pairs is wellestablished in natural language dialogue (e.g., Schlegoff & Sacks, 1973) , and adjacency pair analysis has illuminated important phenomena in tutoring as well (Forbes-Riley et al., 2007) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-29", "text": "For the current corpus, bigram analysis of dialogue acts yielded a set of commonly-occurring pairs." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-30", "text": "However, as noted in (Midgley et al., 2006) , in order to establish that two dialogue acts are truly related as an adjacency pair, it is important to determine whether the presence of the first member of the pair is associated with a significantly higher probability of the second member occurring." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-31", "text": "For this analysis we utilize a \u03c7 2 test for independence of the categorical variables act i and act i+1 for all two-way combinations of dialogue act tags." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-32", "text": "Only pairs in which speaker(act i )\u2260speaker(act i+1 ) were considered." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-33", "text": "Other dialogue acts were treated as atomic elements in subsequent analysis, as discussed in Section 3." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-34", "text": "Table 2 displays a list of the dependent pairs sorted by descending (unadjusted) statistical significance; the subscript indicates tutor (t) or student (s)." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-35", "text": "acti acti+1" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-36", "text": "----------------------------------" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-37", "text": "**HMM ON ADJACENCY PAIR SEQUENCES**" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-38", "text": "The keyboard-to-keyboard tutorial interaction resulted in a sequence of utterances that were annotated with dialogue acts." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-39", "text": "We have hypothesized that a higher-level dialogue structure, namely the tutorial dialogue mode, overlays the observed dialogue acts." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-40", "text": "To build an HMM model of this struc-ture we treat dialogue mode as a hidden variable and train a hidden Markov model to induce the dialogue modes and their associated dialogue act emission probability distributions." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-41", "text": "An adjacency pair joining algorithm (Figure 1 ) was applied to each sequence of dialogue acts." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-42", "text": "This algorithm joins pairs of dialogue acts into atomic units according to a priority determined by the strength of the adjacency pair dependency." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-43", "text": "The final set of observed symbols consists of 39 tags: 23 adjacency pairs (Table 2 ) plus all individual dialogue acts augmented with a tag for the speaker (Table 1) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-44", "text": "It was desirable to learn n, the best number of hidden states, during modeling rather than specifying this value a priori." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-45", "text": "To this end, we trained and ten-fold cross-validated seven models (each featuring randomly-initialized parameters) for each number of hidden states n from 2 to 15, inclusive." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-46", "text": "----------------------------------" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-47", "text": "**2**" }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-48", "text": "The average log-likelihood was computed across all seven models for each n, and this average log-2 n=15 was chosen as an initial maximum number of states because it comfortably exceeded our hypothesized range of 3 to 7 (informed by the tutoring literature)." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-49", "text": "The Akaike Information Criterion measure steadily worsened above n = 5, confirming no need to train models with n > 15." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-50", "text": "likelihood l n was used to compute the Akaike Information Criterion, a maximum-penalized likelihood estimator that penalizes more complex models (Scott, 2002) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-51", "text": "The best fit was obtained with n=4 ( Figure 3) ." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-52", "text": "The transition probability distribution among hidden states is depicted in Figure 4 , with the size of the nodes indicating relative frequency of each hidden state; specifically, State 0 accounts for 63% of the corpus, States 1 and 3 account for approximately 15% each, and State 2 accounts for 7%." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-53", "text": "This exploratory application of hidden Markov models involves training an HMM on a mixed input sequence consisting of both individual dialogue acts and adjacency pairs." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-54", "text": "The best-fit HMM consists of four hidden states whose emission symbol probability distributions lend themselves to interpretation as tutorial dialogue modes." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-55", "text": "For example, State 0 consists primarily of tutor statements and positive feedback, two of the most common dialogue acts in our corpus." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-56", "text": "The transition probabili- ties also reveal that State 0 is highly stable; a selftransition is most likely with probability 0.835." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-57", "text": "State 3 is an interactive state featuring student reflection in the form of questions, statements, and requests for feedback." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-58", "text": "The transition probabilities show that nearly 60% of the time the dialogue transitions from State 3 to State 0; this may indicate that after establishing what the student does or does not know in State 3, the tutoring switches to a less collaborative \"teaching\" mode represented by State 0." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-59", "text": "Future evaluation of the HMM presented here will include comparison with other types of graphical models." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-60", "text": "Another important step is to correlate the dialogue profile of each tutoring session, as revealed by the HMM, to learning and affective outcomes of the tutoring session." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-61", "text": "This type of inquiry can lead directly to design recommendations for tutorial dialogue systems that aim to maximize particular learner outcomes." }, { "sent_id": "c42e9d10ca8876af80eee021c969d7-C001-62", "text": "In addition, leveraging knowledge of the task state as well as surface-level utterance content below the dialogue act level are promising directions for refining the descriptive and predictive power of these models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c42e9d10ca8876af80eee021c969d7-C001-9" ], [ "c42e9d10ca8876af80eee021c969d7-C001-10" ], [ "c42e9d10ca8876af80eee021c969d7-C001-28" ] ], "cite_sentences": [ "c42e9d10ca8876af80eee021c969d7-C001-9", "c42e9d10ca8876af80eee021c969d7-C001-10", "c42e9d10ca8876af80eee021c969d7-C001-28" ] }, "@MOT@": { "gold_contexts": [ [ "c42e9d10ca8876af80eee021c969d7-C001-10" ] ], "cite_sentences": [ "c42e9d10ca8876af80eee021c969d7-C001-10" ] } } }, "ABC_253d635829c733309bb49fc1fcc1cd_45": { "x": [ { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-2", "text": "Misinformation detection at the level of full news articles is a text classification problem." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-3", "text": "Reliably labeled data in this domain is rare." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-4", "text": "Previous work relied on news articles collected from so-called \"reputable\" and \"suspicious\" websites and labeled accordingly." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-5", "text": "We leverage fact-checking websites to collect individuallylabeled news articles with regard to the veracity of their content and use this data to test the cross-domain generalization of a classifier trained on bigger text collections but labeled according to source reputation." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-6", "text": "Our results suggest that reputation-based classification is not sufficient for predicting the veracity level of the majority of news articles, and that the system performance on different test datasets depends on topic distribution." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-7", "text": "Therefore collecting well-balanced and carefully-assessed training data is a priority for developing robust misinformation detection systems." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-8", "text": "----------------------------------" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-10", "text": "Automatic detection of fake from legitimate news in different formats such as headlines, tweets and full news articles has been approached in recent Natural Language Processing literature (Vlachos and Riedel, 2014; Vosoughi, 2015; Jin et al., 2016; Rashkin et al., 2017; Wang, 2017; Pomerleau and Rao, 2017; Thorne et al., 2018) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-11", "text": "The most important challenge in automatic misinformation detection using modern NLP techniques, especially at the level of full news articles, is data." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-12", "text": "Most previous systems built to identify fake news articles rely on training data labeled with respect to the general reputation of the sources, i.e., domains/user accounts (Fogg et al., 2001; Lazer et al., 2017; Rashkin et al., 2017) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-13", "text": "Even though some of these studies try to identify fake news based on linguistic cues, the question is whether they learn publishers' general writing style (e.g., common writing features of a few clickbaity websites) or deceptive style (similarities among news articles that contain misinformation)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-14", "text": "In this study, we collect two new datasets that include the full text of news articles and individually assigned veracity labels." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-15", "text": "We then address the above question, by conducting a set of crossdomain experiments: training a text classification system on data collected in a batch manner from suspicious and reputable websites and then testing the system on news articles that have been assessed in a one-by-one fashion." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-16", "text": "Our experiments reveal that the generalization power of a model trained on reputation-based labeled data is not impressive on individually assessed articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-17", "text": "Therefore, we propose to collect and verify larger collections of news articles with reliably assigned labels that would be useful for building more robust fake news detection systems." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-18", "text": "----------------------------------" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-19", "text": "**DATA COLLECTION**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-20", "text": "Most studies on fake news detection have examined microblogs, headlines and claims in the form of short statements." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-21", "text": "A few recent studies have examined full articles (i.e., actual 'fake news') to extract discriminative linguistic features of misinformation Rashkin et al., 2017; Horne and Adali, 2017) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-22", "text": "The issue with these studies is the data collection methodology." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-23", "text": "Texts are harvested from websites that are assumed to be fake news publishers (according to a list of suspicious websites), with no individual labeling of data." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-24", "text": "The so-called suspicious sources, however, sometimes do publish facts and valid information, and reputable websites sometimes publish inaccurate information (Mantzarlis, 2017) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-25", "text": "The key to collect more reliable data, then, is to not rely on the source but on the text of the article itself, and only after the text has been assessed by human annotators and determined to contain false information." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-26", "text": "Currently, there exists only small collections of reliably-labeled news articles (Rubin et al., 2016; Allcott and Gentzkow, 2017; Zhang et al., 2018; because this type of annotation is laborious." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-27", "text": "The Liar dataset (Wang, 2017) is the first large dataset collected through reliable annotation, but it contains only short statements." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-28", "text": "Another recently published large dataset is FEVER (Thorne et al., 2018) , which contains both claims and texts from Wikipedia pages that support or refute those claims." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-29", "text": "This dataset, however, has been built to serve the slightly different purpose of stance detection (Pomerleau and Rao, 2017; , the claims have been artificially generated, and texts are not news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-30", "text": "Our objective is to elaborate on the distinction between classifying reputation-based labeled news articles and individually-assessed news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-31", "text": "We do so by collecting and using datasets of the second type in evaluation of a text classifier trained on the first type of data." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-32", "text": "In this section, we first introduce one large collection of news text from previous studies that has been labeled according to the list of suspicious websites, and one small collection that was labeled manually for each and every news article, but only contains satirical and legitimate instances." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-33", "text": "We then introduce two datasets that we have scraped from the web by leveraging links to news articles mentioned by fact-checking websites (Buzzfeed and Snopes)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-34", "text": "The distinguishing feature of these new collections is that they contain not only the full text of real news articles found online, but also individually assigned veracity labels indicative of their misinformative content." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-35", "text": "Rashkin et al. dataset: Rashkin et al. (2017) published a collection of roughly 20k news articles from eight sources categorized into four classes: propaganda (The Natural News and Activist Report), satire (The Onion, The Borowitz Report, and Clickhole), hoax (American News and DC Gazette) and trusted (Gigaword News)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-36", "text": "This dataset is balanced across classes, and since the articles in their training and test splits come from different websites, the accuracy of the trained model on test data should be demonstrative of its understanding of the general writing style of each target class rather than author-specific cues." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-37", "text": "However, we suspect that the noisy strategy to label all articles of a publisher based on its reputation highly biases the classifier decisions and limits its power to distinguish individual misinformative from truthful news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-38", "text": "----------------------------------" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-39", "text": "**RUBIN ET AL. DATASET:**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-40", "text": "As part of a study on satirical cues, Rubin et al. (2016) published a dataset of 360 news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-41", "text": "This dataset contains balanced numbers of individually evaluated satirical and legitimate texts." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-42", "text": "Even though small, it is a clean data to test the generalization power of a system trained on noisy data such as the above explained dataset." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-43", "text": "We use this data to make our point about the need for careful annotation of news articles on a one-by-one fashion, rather than harvesting from websites generally knows as hoax, propaganda or satire publishers." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-44", "text": "----------------------------------" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-45", "text": "**BUZZFEEDUSE DATASET:**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-46", "text": "The first source of information that we used to harvest full news articles with veracity labels is from the Buzzfeed factchecking company." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-47", "text": "Buzzfeed has published a collection of links to Facebook posts, originally compiled for a study around the 2016 US election (Silverman et al., 2016) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-48", "text": "Each URL in this dataset was given to human experts so they can rate the amount of false information contained in the linked article." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-49", "text": "The links were collected from nine Facebook pages (three right-wing, three left-wing and three mainstream publishers)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-50", "text": "1 We had to follow the facebook URLs and then the link to the original news articles to obtain the news texts." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-51", "text": "We scraped the full text of each news article from its original source." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-52", "text": "The resulting dataset includes a total of 1,380 news articles on a focused topic (US election and candidates)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-53", "text": "Veracity labels come in a 4-way classification scheme including 1,090 mostly true, 170 mixture of true and false, 64 mostly false and 56 articles containing no factual content." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-54", "text": "----------------------------------" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-55", "text": "**SNOPES312 DATASET:**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-56", "text": "The second source of information that we used to harvest full news articles with veracity labels is Snopes, a well-known rumor debunking website run by a team of expert editors." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-57", "text": "We scraped the entire archive of factchecking pages." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-58", "text": "On each page they talk about a claim, cite the sources (news articles, forums or social networks where the claim was distributed) and provide a veracity label for the claim." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-59", "text": "We automatically extracted all links mentioned on a Snopes page, followed the link to each original news article, and extracted the text." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-60", "text": "The resulting datafile includes roughly 4,000 rows, each containing a claim discussed by Snopes annotators, the veracity label assigned to it, and the text of a news article related to the claim." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-61", "text": "The main challenge in using this data for training/testing a fake news detector is that some of the links on a Snopes page that we collect automatically do not actually point to the discussed news article, i.e., the source of the claim." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-62", "text": "Many links are to pages that provide contextual information for the fact-checking of the claim." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-63", "text": "Therefore, not all the texts in our automatically extracted dataset are reliable or simply the \"supporting\" source of the claim." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-64", "text": "To come up with a reliable set of veracity-labeled news articles, we randomly selected 312 items and assessed them manually." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-65", "text": "Two annotators performed independent assessments on the 312 items." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-66", "text": "A third annotator went through the entire list of items for a final check and resolving disagreements." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-67", "text": "Snopes has a fine-grained veracity labeling system." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-68", "text": "We selected [fully] true, mostly true, mixture of true and false, mostly false, and [fully] false stories." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-69", "text": "Table 1 shows the distribution of these labels in the manually assessed 312 items, and how many from each category of news articles were verified to be the \"supporting\" source (distributing the discussed claim), \"context\" (providing background or related information about the topic of the claim), \"debunking\" (against the claim), \"irrelevant\" (completely unrelated to the claim or distorted text) and ambiguous (not sure how it related to the claim)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-70", "text": "Table 2 provides information on the confusing choices: About 50% of the items received different category labels from the two first annotators." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-71", "text": "The first annotator had a more conservative bias, trying to avoid mistakes in the \"supporting\" category, whereas the second annotator often assigned either \"supporting\" or \"context\", and rarely \"irrelevant\"." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-72", "text": "For the disagreed items, the third annotator (who had access to all outputs) chose the final category." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-73", "text": "Results in Table 1 are based on this final assessment." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-74", "text": "We use the \"supporting\" portion of the data (145 items) in the following experiments." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-75", "text": "----------------------------------" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-76", "text": "**EXPERIMENTS**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-77", "text": "In text classification, Convolutional Neural Networks (CNNs) have been competing with the TF-IDF model, a simple but strong baseline using scored n-grams (Le and Mikolov, 2014; Zhang et al., 2015; Conneau et al., 2017; Medvedeva et al., 2017) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-78", "text": "These methods have been used for fake news detection in previous work (Rashkin et al., 2017; Wang, 2017 fore, we use this model to demonstrate how a classifier trained on data labeled according to publisher's reputation would identify misinformative news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-79", "text": "It is evident in the first section of Figure 1 , that the model performs well on similarly collected test items, i.e., Hoax, Satire, Propaganda and Trusted news articles within Rashkin et al.'s test dataset." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-80", "text": "However, when the model is applied to Rubin et al.'s data, which was carefully assessed for satirical cues in each and every article, the performance drops considerably (See the second section of the figure) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-81", "text": "Although the classifier detects more of the satirical texts in Rubin et al.'s data, the distribution of the given labels is not very different to that of legitimate texts." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-82", "text": "One important feature of Rubin et al.'s data is that topics of the legitimate instances were matched and balanced with topics of the satirical instances." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-83", "text": "The results here suggest that similarities captured by the classifier can be very dependent on the topics of the news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-84", "text": "Next we examine the same model on our collected datasets, BuzzfeedUSE and Snopes312, as test material." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-85", "text": "The BuzzfeedUSE data comes with 4 categories (Figure 1) ." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-86", "text": "The classifier does seem to have some sensitivity to true vs. false information in this dataset, as more of the mostly true articles were labeled as Trusted." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-87", "text": "The difference with mostly false articles, however, is negligible." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-88", "text": "The most frequent label assigned by the classifier was Hoax in all four categories, which suggests that most BuzzfeedUSE articles looked like Hoax in Rashkin's data." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-89", "text": "Finally, the last section of 1 shows the results on the Snopes312 plotted fier was significantly better on both dev and test sets: 0.96 and 0.75 F1-score, respectively, compared to 0.91 and 0.65 reported in their paper." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-90", "text": "Source code will be made available at https://github.com/sfudiscourse-lab/Misinformation_detection along the 6-category distinction." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-91", "text": "A stronger correlation can be observed between the classifier decisions and the veracity labels in this data compared to BuzzfeedUSE." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-92", "text": "This suggests that distinguishing between news articles with true and false information is a more difficult task when topics are the same (BuzzfeedUSE data is all related to the US election)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-93", "text": "In Snopes312, news articles come from a variety of topics." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-94", "text": "The strong alignment between the classifier's Propaganda and Hoax labels with the mostly false and [fully] false categories in this dataset reveals that most misinformative news articles indeed discuss the topics or use the language of generally suspicious publishers." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-95", "text": "This is an encouraging result in the sense that, with surface features such as n-grams and approximate reputationbased training data, we already can detect some of the misinformative news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-96", "text": "Observing classification errors across these experiments, however, indicates that the model performance varies a lot with the type of test material: In a focused topic situation, it fails to distinguish between categories (false vs. true, or satirical vs. legitimate articles)." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-97", "text": "While a correlation is consistently observed between labels assigned by the classifier and the actual labels of target news articles, 3 reputationbased classification does not seem to be sufficient for predicting the veracity level of the majority of news articles." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-98", "text": "----------------------------------" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-99", "text": "**CONCLUSION**" }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-100", "text": "We found that collecting reliable data for automatic misinformation detection at the level of full news articles is a challenging but necessary task for building robust models." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-101", "text": "If we want to benefit from state-of-the-art text classification techniques, such as CNNs, we require larger datasets than what is currently available." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-102", "text": "We took the first steps, by scraping claims and veracity labels from factchecking websites, extracting and cleaning of the original news articles' texts (resulting in roughly 4,000 items), and finally manual assessment of a subset of the data to provide reliable test material for misinformation detection." }, { "sent_id": "253d635829c733309bb49fc1fcc1cd-C001-103", "text": "Our future plan is to crowd-source annotators for the remaining scraped texts and publish a large set of labeled news articles for training purposes." } ], "y": { "@BACK@": { "gold_contexts": [ [ "253d635829c733309bb49fc1fcc1cd-C001-10" ], [ "253d635829c733309bb49fc1fcc1cd-C001-27" ], [ "253d635829c733309bb49fc1fcc1cd-C001-78" ] ], "cite_sentences": [ "253d635829c733309bb49fc1fcc1cd-C001-10", "253d635829c733309bb49fc1fcc1cd-C001-27", "253d635829c733309bb49fc1fcc1cd-C001-78" ] }, "@USE@": { "gold_contexts": [ [ "253d635829c733309bb49fc1fcc1cd-C001-78" ] ], "cite_sentences": [ "253d635829c733309bb49fc1fcc1cd-C001-78" ] } } }, "ABC_b21dfcb9854b0b48af47f4f13899b0_45": { "x": [ { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-2", "text": "While intelligent writing assistants have become more common, they typically have little support for revision behavior." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-3", "text": "We present ArgRewrite, a novel web-based revision assistant that focus on rewriting analysis." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-4", "text": "The system supports two major functionalities: 1) to assist students as they revise, the system automatically extracts and analyzes revisions; 2) to assist teachers, the system provides an overview of students' revisions and allows teachers to correct the automatically analyzed results, ensuring that students get the correct feedback." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-5", "text": "----------------------------------" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-7", "text": "Making revisions is central to improving a student's writings, especially when there is a helpful instructor to offer detailed feedback between drafts." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-8", "text": "However, it is not practical for instructors to provide feedback on every change every time." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-9", "text": "While multiple intelligent writing assistants have been developed (Writelab, 2015; Draft, 2015; Turnitin, 2016) , they typically focus on the quality of the current essay instead of the revisions that have been made." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-10", "text": "For example, Turnitin identifies weak points of the essay and gives suggestions on how to improve them; it also assigns an overall score to the essay so students can get a coarse-grained feedback on whether they are making progress in their revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-11", "text": "However, without explicit feedback on each change, students may inefficiently search for a way to optimize the automatic score rather than actively making the existing revisions \"better\"." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-12", "text": "Moreover, because students are the target users of these systems, instructors typically can neither correct the errors made by the automatic analysis nor observe/assess the students' revision efforts." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-13", "text": "We argue that an intelligent writing assistant ought to be aware of the revision process; it should: 1) identify all significant changes made by a writer between the essay drafts, 2) automatically determine the purposes of these changes, 3) provide the writer the means to compare between drafts in an easy to understand visualization, and 4) support instructor monitoring and corrections in the revision process as well." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-14", "text": "In our previous work (Zhang and Litman, 2014; Zhang and Litman, 2015) , we focused on 1) and 2), the automatic extraction and classification of revisions for argumentative writings." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-15", "text": "In this work, we extend our framework to integrate the automatic analyzer with a web-based interface to support student argumentative writings." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-16", "text": "The purpose of each change between revisions is demonstrated to the writer as a kind of feedback." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-17", "text": "If the author's revision purpose is not correctly recognized, it indicates that the effect of the writer's change might have not met the writer's expectation, which suggests that the writer should revise their revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-18", "text": "The framework also connects the automatic analyzer with an interface for the instructor to manually correct the analysis results." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-19", "text": "As a side benefit, it also sets up an annotation pipeline to collect further data to improve the underlying automatic analyzer." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-20", "text": "----------------------------------" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-21", "text": "**SYSTEM OVERVIEW**" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-22", "text": "The design of ArgRewrite aims to encourage students to concentrate on revision improvement: to iteratively refine the essay based on the feedback of the automatic system or the writing instructor." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-23", "text": "Our framework consists of three components, arranged in a server client model." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-24", "text": "On the server side, the automatic analysis component extracts revision changes by aligning sentences across drafts and infers the purposes of the extracted revisions; this may reduce the writing instructor's workload." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-25", "text": "On the client side, a web-based rewriting assistant interface 1 allows the student to retrieve the feedback to their revisions from the server, make changes to the essay and submit the modified essay to the server for another round of analysis." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-26", "text": "The interface is also accessible to the writing instructor and allows the instructor to have a quick overview of the students' revision efforts." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-27", "text": "Another client side interface is a Java-based revision correction component 2 , which allows the writing instructors to override the results of the automatic analysis and upload the corrected feedback to the server." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-28", "text": "As demonstrated in Figure 1 , the complete process of the student's writing using our system starts with the student's rewriting and submission of the essay." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-29", "text": "The student writes the first draft of the essay before using our system and then modifies the original draft in our rewriting assistant interface." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-30", "text": "The submitted writings are automatically analyzed immediately after the receipt of the student's submission." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-31", "text": "Afterwards the instructor can manually correct the analysis results if necessary." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-32", "text": "The student can choose to view the analysis results immediately after the completion of automatic revision analysis or wait until the analysis results were corrected by the instructor." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-33", "text": "After receiving the analysis feedback, the student can choose to continue with the cycle of essay revising until the revisions are satisfactory." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-34", "text": "aligned pairs where the sentences in the pair are not identical are extracted as revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-35", "text": "We first use the Stanford Parser (Klein and Manning, 2003) to break the original text into sentences and then align the sentences using the algorithm in our prior work (Zhang and Litman, 2014) which considers both sentence similarity (calculated using TF*IDF score) and the global context of sentences." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-36", "text": "Revision classification." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-37", "text": "Following the argumentative revision definition in our prior work (Zhang and Litman, 2015) , revisions are first categorized to Content (Text-based) and Surface 3 according to whether the revision changed the meaning of the essay or not." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-38", "text": "The Text-based revisions include Thesis/Ideas (Claim), Rebuttal, Reasoning (Warrant), Evidence, and Other content changes (General Content)." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-61", "text": "The added/deleted sentences would be aligned to blank in the map." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-39", "text": "The Surface revisions include Fluency (Wordusage/Clarity), Reordering (Organization) and Errors (Conventions/Grammar/Spelling)." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-40", "text": "On the basis of later work, the system includes the two new categories Precision 4 and Unknown 5 ." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-41", "text": "Using the corpora and features defined in our prior work, a multiclass Random Forest classifier was trained to automatically predict the revision purpose type for each extracted revision." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-42", "text": "----------------------------------" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-43", "text": "**REWRITING ASSISTANT INTERFACE**" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-44", "text": "Our rewriting assistant interface is designed with several principles in mind." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-45", "text": "1) Because the revision classification taxonomy goes beyond the binary textual versus surface distinction, we want to make sure that users don't get lost distinguishing different categories; 2) We want to encourage users to think about their revisions holistically, not always just focusing on low-level details; 3) We want to encourage users to continuously re-evaluate whether they succeeded in making changes between drafts (rather than focusing on generating new contents)." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-46", "text": "Thus, we have designed an interface that offers multiple views of the revision changes." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-47", "text": "As demonstrated in Figure 2 , the interface includes a revision overview interface for the overview of the authors' revisions and a revision detail interface that allows the author to access the details of their essays and revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-48", "text": "Inspired by works on learning analytics (Liu et al., 2013; Verbert et al., 2013) , we design the revision overview interface which displays the statistics of the revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-49", "text": "Following design principle #1, the revision purposes are color coded and each purpose corresponds to a specific color." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-50", "text": "Our prior work (Zhang and Litman, 2015) demonstrates that only Text-based revisions are significantly correlated with the writing improvement." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-51", "text": "To inspire the writers to focus more on the important Text-based revisions, cold colors are chosen for the Surface revisions and warm colors are chosen for the Text-based revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-52", "text": "The statistics and the pie chart provide a quantitative summary of the writer's revision efforts." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-53", "text": "For example, in Figure 2 , the writer makes many changes on the Fluency (15) of sentences but makes no change on the Thesis/Ideas (0)." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-54", "text": "To allow the users to concentrate on improving one revision type at a time, the interface allows the user to click on a single revision purpose type and view only the specified revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-55", "text": "Following our design principle #2, the revision map in both interfaces presents an at-a-glance visual representation of the revision." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-56", "text": "This design is inspired by (Southavilay et al., 2013) ." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-57", "text": "Each sentence is represented as a square in the map." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-58", "text": "The left column of the map represents the sentences in the first draft and the right column represents the sentences in the second draft." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-59", "text": "The paragraphs within one draft are segmented by blanks in the map." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-60", "text": "The aligned sentences appear in the same row." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-62", "text": "The revision map allows a user (either an instructor or a student) to view the structure of the essay and identify the locations of all the changes at once." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-63", "text": "For example, in Figure 2 , the user can quickly identify that the writer aims at improving the clarity and soundness of the third paragraph by making a Rebuttal modification on the second sentence and Fluency modifications on all other sentences." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-64", "text": "The 39 user can also click on the square to view the details of the revision in the revision text area region of the revision detail interface." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-65", "text": "To encourage students to make revisions (design principle #3), in the revision detail interface the revision text area region highlights the revisions (colorcoded by the revision categories) in the essay and allows the writer to modify it directly." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-66", "text": "The writer clicks on the text to read the revision and examine whether the revision purpose is recognized by the instructor/system." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-67", "text": "A character-level diff 6 is done on the aligned sentences to help the writer identify the differences between two drafts." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-68", "text": "In the example the writer can see that their \"Evidence\" change is recognized, indicating that the revision effort is clear and effective." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-69", "text": "If the writer finds out that their real revision purpose is not recognized, they can modify the essay in the textbox directly and submit the essay to the server when all the edits are done." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-70", "text": "----------------------------------" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-71", "text": "**REVISION CORRECTION**" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-72", "text": "The revision correction tool is developed for instructors only." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-73", "text": "The instructor loads the revision annotation files from the server, corrects the analysis results and uploads the corrections to the server." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-74", "text": "As demonstrated in Figure 3 , the tool includes a sentence alignment correction interface and a revision purpose correction interface." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-75", "text": "The instructor first corrects the sentence alignment errors and then se-6 google diff match: https://code.google.com/ archive/p/google-diff-match-patch/ lects the revision purposes for the re-aligned or mislabeled sentence pairs." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-76", "text": "The correction actions of the instructors will be recorded and used to improve the analysis accuracy of the automatic analysis module." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-77", "text": "----------------------------------" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-78", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-79", "text": "In this work we demonstrate a novel revision assistant for argumentative writings." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-80", "text": "Comparing to other assistants, the system focuses on inspiring writers to improve existing revisions instead of making new revisions." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-81", "text": "The system takes the writer's drafts as the input and presents the revision purposes (analyzed manually or automatically) as the feedback." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-82", "text": "The writer revises iteratively until the purposes of the revisions are clear enough to be recognized." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-83", "text": "In the future we plan to develop and incorporate the function of revision quality analysis, which not only recognizes the revision purpose, but also evaluates the quality of the revision (whether the revision weakly/strongly improves the essay)." }, { "sent_id": "b21dfcb9854b0b48af47f4f13899b0-C001-84", "text": "We are also about to begin a user study to evaluate the system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b21dfcb9854b0b48af47f4f13899b0-C001-14" ], [ "b21dfcb9854b0b48af47f4f13899b0-C001-50" ] ], "cite_sentences": [ "b21dfcb9854b0b48af47f4f13899b0-C001-14", "b21dfcb9854b0b48af47f4f13899b0-C001-50" ] }, "@EXT@": { "gold_contexts": [ [ "b21dfcb9854b0b48af47f4f13899b0-C001-14", "b21dfcb9854b0b48af47f4f13899b0-C001-15" ] ], "cite_sentences": [ "b21dfcb9854b0b48af47f4f13899b0-C001-14" ] }, "@USE@": { "gold_contexts": [ [ "b21dfcb9854b0b48af47f4f13899b0-C001-37" ] ], "cite_sentences": [ "b21dfcb9854b0b48af47f4f13899b0-C001-37" ] } } }, "ABC_8fc0d25eb177ea876c2b69096f0145_45": { "x": [ { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-31", "text": "Our three embeddings are available online." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-32", "text": "5 Evaluation on KaWAT was done using gensim with its KeyedVectors.most similar method." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-2", "text": "We introduced KaWAT (Kata Word Analogy Task), a new word analogy task dataset for Indonesian." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-3", "text": "We evaluated on it several existing pretrained Indonesian word embeddings and embeddings trained on Indonesian online news corpus." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-4", "text": "We also tested them on two downstream tasks and found that pretrained word embeddings helped either by reducing the training epochs or yielding significant performance gains." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-5", "text": "----------------------------------" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-7", "text": "Despite the existence of various Indonesian pretrained word embeddings, there are no publicly available Indonesian analogy task datasets on which to evaluate these embeddings." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-8", "text": "Consequently, it is unknown if Indonesian word embeddings introduced in, e.g., (Al-Rfou et al., 2013) and (Grave et al., 2018) , capture syntactic or semantic information as measured by analogy tasks." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-9", "text": "Also, such embeddings are usually trained on Indonesian Wikipedia (Al-Rfou et al., 2013; Bojanowski et al., 2017) whose size is relatively small, approximately 60M tokens." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-10", "text": "Therefore, in this work, we introduce KaWAT (Kata Word Analogy Task), an Indonesian word analogy task dataset, and new Indonesian word embeddings pretrained on 160M tokens of online news corpus." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-11", "text": "We evaluated these embeddings on KaWAT, and also tested them on POS tagging and text summarization as representatives of syntactic and semantic downstream task respectively." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-12", "text": "----------------------------------" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-13", "text": "**METHODOLOGY**" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-14", "text": "We asked an Indonesian linguist to help build KaWAT based on English analogy task datasets such as Google Word Analogy (Mikolov et al., 2013a) and BATS (Gladkova et al., 2016) ." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-15", "text": "Following those works, we split the analogy tasks into two categories, syntax and semantic." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-16", "text": "We included mostly morphological analogies in the syntax category, leveraging the richness of Indonesian inflectional morphology." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-17", "text": "For semantic, we included analogies such as antonyms, country capitals and currencies, gender-specific words, measure words, and Indonesian province capitals." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-18", "text": "In total, we have 15K syntactic and 19K semantic analogy queries." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-19", "text": "KaWAT is available online." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-20", "text": "1 One of the goals of this work is to evaluate and compare existing Indonesian pretrained word embeddings." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-21", "text": "We used fastText pretrained embeddings introduced in (Bojanowski et al., 2017 ) and (Grave et al., 2018) , which have been trained on Indonesian Wikipedia and Indonesian Wikipedia plus Common Crawl data respectively." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-22", "text": "We refer to them as Wiki/fastText and CC/fastText hereinafter." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-23", "text": "We also used another two pretrained embeddings: polyglot embedding trained on Indonesian Wikipedia (Al-Rfou et al., 2013) and NLPL embedding trained on the Indonesian portion of CoNLL 2017 corpus (Fares et al., 2017) ." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-24", "text": "For training our word embeddings, we used online news corpus obtained from Tempo." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-25", "text": "2 We used Tempo newspaper and magazine articles up to year 2014." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-26", "text": "This corpus contains roughly 400K articles, 160M word tokens, and 600K word types." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-27", "text": "To train the word embeddings, we experimented with three algorithms: word2vec (Mikolov et al., 2013b) , fastText (Bojanowski et al., 2017) , and GloVe (Pennington et al., 2014) ." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-28", "text": "We refer to them henceforth as Tempo/word2vec, Tempo/fastText, and Tempo/GloVe respectively." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-29", "text": "We used gensim 3 to run word2vec and fastText and the original C implementation for GloVe." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-30", "text": "4 For all three, we used their default hyperparameters, i.e. no tuning was performed." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-33", "text": "Since the vocabularies of the word embeddings are different, for a fair comparison, we first removed analogy queries containing words that do not exist in any vocabulary." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-34", "text": "In other words, we only kept queries whose words all exist in all vocabularies." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-35", "text": "After this process, there were roughly 6K syntactic and 1.5K semantic queries." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-36", "text": "We performed evaluation by computing 95% confidence interval of the accuracy at rank 1 by bootstrapping." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-37", "text": "Our implementation code is available online." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-38", "text": "6 3 Results" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-39", "text": "----------------------------------" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-40", "text": "**WORD ANALOGY RESULTS**" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-41", "text": "We found that on syntactic analogies, Wiki/fastText achieved 2.7% accuracy, which significantly outperformed the others, even CC/fastText which has been trained on a much larger corpus." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-42", "text": "Other embeddings performed poorly, mostly less than 1% of accuracy." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-43", "text": "The overall trend of low accuracy scores attests to the difficulty of syntactic KaWAT analogies, making it suitable as benchmark for future research." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-44", "text": "On semantic analogies, Tempo/GloVe clearly outperformed the others with 20.42% accuracy, except Tempo/word2vec." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-45", "text": "Surprisingly, we found that Tempo/fastText performed very poorly with less than 1% accuracy, even worse than Wiki/fastText which has been trained on a much smaller data." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-46", "text": "Overall, the accuracies on semantic are also low, less than 25%, which again attests to the suitability of KaWAT as benchmark for future work." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-47", "text": "----------------------------------" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-48", "text": "**DOWNSTREAM TASK RESULTS**" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-49", "text": "To check how useful these embeddings are for downstream tasks, we evaluated them on POS tagging and text summarization task." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-50", "text": "For each task, we compared two embeddings, which are the best off-the-shelf pretrained embedding and our proposed embedding on the syntactic (for POS) and semantic (for summarization) analogy task respectively." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-51", "text": "7 We used the same model and setting as (Kurniawan and Aji, 2018) for POS tagging and (Kurniawan and Louvan, 2018) for summarization." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-52", "text": "However, for computational reasons, we tuned only the learning rate using grid search, and only used the first fold of the summarization dataset." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-53", "text": "Our key finding from the POS tagging experiment is that using the two embeddings did not yield significant gain on test F 1 score compared with not using any pretrained embedding (around 97.23)." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-54", "text": "However, on average, using Wiki/fastText resulted in 20% fewer training epochs, compared with only 4% when using Tempo/GloVe." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-55", "text": "For the summarization experiment, Tempo/GloVe was significantly better 8 than CC/fastText in ROUGE-1 and ROUGE-L scores (66.63 and 65.93 respectively)." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-56", "text": "The scores of using CC/fastText was on par to those of not using any pretrained word embedding, and we did not observe fewer training epochs when using pretrained word embedding." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-57", "text": "----------------------------------" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-58", "text": "**CONCLUSION**" }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-59", "text": "We introduced KaWAT, a new dataset for Indonesian word analogy task, and evaluated several Indonesian pretrained word embeddings on it." }, { "sent_id": "8fc0d25eb177ea876c2b69096f0145-C001-60", "text": "We found that (1) in general, accuracies on the analogy tasks were low, suggesting that improvements for Indonesian word embeddings are still possible and KaWAT is hard enough to be the benchmark dataset for that purpose, (2) on syntactic analogies, embedding by (Bojanowski et al., 2017) performed best and yielded 20% fewer training epochs when employed for POS tagging, and (3) on semantic analogies, GloVe embedding trained on Tempo corpus performed best and produced significant gains on ROUGE-1 and ROUGE-L scores when used for text summarization." } ], "y": { "@MOT@": { "gold_contexts": [ [ "8fc0d25eb177ea876c2b69096f0145-C001-8" ], [ "8fc0d25eb177ea876c2b69096f0145-C001-9" ] ], "cite_sentences": [ "8fc0d25eb177ea876c2b69096f0145-C001-8", "8fc0d25eb177ea876c2b69096f0145-C001-9" ] }, "@BACK@": { "gold_contexts": [ [ "8fc0d25eb177ea876c2b69096f0145-C001-9" ] ], "cite_sentences": [ "8fc0d25eb177ea876c2b69096f0145-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "8fc0d25eb177ea876c2b69096f0145-C001-23" ] ], "cite_sentences": [ "8fc0d25eb177ea876c2b69096f0145-C001-23" ] } } }, "ABC_fddb1d19895976661babdc17d232ee_45": { "x": [ { "sent_id": "fddb1d19895976661babdc17d232ee-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-2", "text": "Predicting popularity of social media videos before they are published is a challenging task, mainly due to the complexity of content distribution network as well as the number of factors that play part in this process." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-3", "text": "As solving this task provides tremendous help for media content creators, many successful methods were proposed to solve this problem with machine learning." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-4", "text": "In this work, we change the viewpoint and postulate that it is not only the predicted popularity that matters, but also, maybe even more importantly, understanding of how individual parts influence the final popularity score." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-5", "text": "To that end, we propose to combine the Grad-CAM visualization method with a soft attention mechanism." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-6", "text": "Our preliminary results show that this approach allows for more intuitive interpretation of the content impact on video popularity, while achieving competitive results in terms of prediction accuracy." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-7", "text": "----------------------------------" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-9", "text": "Multiple factors make popularity prediction of social media content a challenging task, including propagation patterns, social graph of users and interestingness of content." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-10", "text": "Current methods for online content popularity analysis focus mostly on determining its future popularity [6, 2, 8, 12] ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-11", "text": "For instance, [6] and [4] use visual cues to predict the popularity of online images." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-12", "text": "[13, 14, 11] use machine learning methods, such as Support Vector Regression and recurrent neural networks, applied to visual and textual cues to predict the popularity of social media videos." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-13", "text": "[2] combines cues from different modalities for the same purpose." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-14", "text": "Although popularity prediction is an important problem, we believe that it does not address all the challenges faced by online video creators." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-15", "text": "More precisely, it does not allow to answer the question of many creators: how a given frame or title word contributes to the popularity of the video?" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-16", "text": "Even though correlation does not mean causality, this kind of analysis allows to understand the importance of a given piece of content and prioritize the creation efforts accordingly." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-17", "text": "Figure 1 ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-18", "text": "Sample social media video frames and its headline with visualized importance of their parts on predicted video popularity." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-19", "text": "The top row shows 5 consecutive frames of a social media video with their attention weights above." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-20", "text": "We use those weights to scale the magnitudes of Grad-CAM [9] heatmaps when visualizing the importance of video elements on the popularity score." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-21", "text": "The bottom row shows the frame with the highest attention weight (left) with its popularity importance visualization (right)." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-22", "text": "For the headline text darker color corresponds to higher importance." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-23", "text": "In this paper, we outline a fundamentally different approach to online video popularity analysis that allows social media creators both to predict video popularity as well as to understand the impact of its headline or video frames on the future popularity." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-24", "text": "To that end, we propose to use an attention-based model and gradient-weighted class activation maps [9] , inspired by the recent successes of the attention mechanism in other domains [15, 16] ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-25", "text": "Although some works focused on understanding the influence of image parts on its popularity [6, 1] , our method addresses videos, not images, and exploits the temporal characteristics of video clips through the attention mechanism." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-26", "text": "By extending the baseline popularity prediction method with the attention mechanism, we enable more intuitive visualization of the impact visual and textual features of the video have on its final popularity, while achieving state-of-the-art results on the popularity prediction task." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-27", "text": "----------------------------------" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-28", "text": "**ARXIV:1804.09949V1 [CS.CV] 26 APR 2018**" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-29", "text": "----------------------------------" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-30", "text": "**POPULARITY PREDICTION WITH ATTENTION**" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-31", "text": "We cast the problem of social media video popularity prediction as a binary classification task, as in [13, 11] ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-32", "text": "We focus on videos from Facebook and normalize their viewcount by the number of page followers." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-33", "text": "We assign a popular/unpopular label by splitting our dataset at the median normalized viewcount, following the approach of [10] ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-34", "text": "We use a cross-entropy loss function to classify a video as popular/unpopular and take a set of video frames and/or headline features as an input." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-35", "text": "Video frames." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-36", "text": "We extract N = 18 evenly distributed frames from the first 6 seconds of a video 1 ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-37", "text": "We use 2048-dimensional output of the penultimate layer of ResNet50 [5] pre-trained on ImageNet [3] to get a high-level frame representation as in [13] ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-38", "text": "For each frame feature vector we apply a learnable linear transformation followed by ReLU, obtaining a sequence of frame embeddings (q j ) N j=1 ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-39", "text": "The final video embedding is a weighted average of these embed-" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-40", "text": "Weights \u03b1 i are computed with attention mechanism implemented as a two-layer neural network [16] : the first layer produces a hidden representation u i = tanh(W u q i + b u ) and the second layer outputs unnormalized importance a i = W a u i + b a ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-41", "text": "W a can be interpreted as a trainable high level representation of the most informative vector in u i space." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-42", "text": "Final weights are normalized with softmax:" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-43", "text": "Headline." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-44", "text": "We represent a headline as a sequence of pre-trained GloVe [7] word vectors (w t ) N t=1 ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-45", "text": "We handle sequences of variable length using a bidirectional LSTM." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-46", "text": "Similarly to video frames, we use a two-layer attention mechanism on hidden state vectors h t to let the network learn the importance coefficients \u03b2 t for each word." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-47", "text": "The final text representation d is a weighted average of hidden state vectors d = N t=1 \u03b2 t h t ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-48", "text": "Multimodal prediction." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-49", "text": "We concatenate previously trained video and text embeddings and train a two-layer neural network for popularity prediction." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-50", "text": "The intermediate layer output serves as multimodal embedding that captures information from both image and text modality that contribute to image popularity." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-51", "text": "Visualizations." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-52", "text": "We visualize the importance of visual features using Grad-CAM [9] ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-53", "text": "More precisely, we generate heatmaps pointing to regions contributing to popularity in each frame." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-54", "text": "To this end, we compute gradients of the popular class score\u015d with respect to the output of the last convolutional layer of ResNet50 A \u2208 R K\u00d7K\u00d7F ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-55", "text": "Gradients are then used to compute weights" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-56", "text": "\u2202\u015d \u2202A f i,j that applied to the convolutional output create class activation map H = max(0," }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-57", "text": "We then normalize the heatmap values to [0, 1] and use attention weights to 1 We use the first seconds of a video as this is how Facebook counts views, but we can extend our method to longer videos through sampling." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-58", "text": "scale the heatmap by \u03b1 i / max(\u03b1)." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-59", "text": "This way we obtain a sequence-wide normalized heatmap of frame regions influencing the final popularity score." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-60", "text": "For visualizations in the text domain, we use attention weights \u03b2 t used to compute text representation d. These weights capture relative importance of words in their context to headline popularity, as shown in [16] in the context of sentiment analysis." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-61", "text": "----------------------------------" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-62", "text": "**EXPERIMENTS**" }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-63", "text": "We use a dataset of 37k Facebook videos with 80/10/10 train/validation/test splits." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-64", "text": "We use validation set to perform randomized serach of hyperparameters such as embedding dimensionalities, dropout rates and batch normalization use." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-65", "text": "For training headline embeddings we use frozen pretrained GloVe word vectors trained on Wikipedia and Gigaword [7] ." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-66", "text": "As baselines, we use a simple mean of ResNet50 feature vectors as input to two layer neural network (video frames) and concatenation of last states of LSTM (headlines)." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-67", "text": "We use Keras for implementation." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-68", "text": "Results." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-69", "text": "For all methods, we follow the evaluation protocol of [13, 11] and compute the classification accuracy and Spearman correlation between the predicted probability of popular label and normalized view count of a video." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-70", "text": "Tab." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-71", "text": "1 shows the results." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-72", "text": "Interestingly, almost equal popularity prediction results can be obtained using either video frames or headline features." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-73", "text": "Combining both modalities leads to noticeable improvement, while adding attention mechanism improves the performance in the multimodal and visual case." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-74", "text": "For headlines, the performance with attention deteriorates slightly." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-75", "text": "We speculate that the bi-directional LSTM already learns internal dependencies between hidden states and adding attention cannot help further, while for video frames it enables the network to exploit the temporal dependencies between the frames." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-76", "text": "Overall, we see that prediction methods with attention achieve competitive results, while thanks to the attention mechanism we can increase the interpretability of our model." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-77", "text": "As Fig. 1 shows, extending standard Grad-CAM visualization with the attention mechanism allows us to interpret the influence of each video frame." }, { "sent_id": "fddb1d19895976661babdc17d232ee-C001-78", "text": "Although the preliminary results presented in this paper leave place for improvement, they indicate the potential of using attention mechanism to increase the interpretability of popularity prediction methods for social media videos." } ], "y": { "@USE@": { "gold_contexts": [ [ "fddb1d19895976661babdc17d232ee-C001-24" ], [ "fddb1d19895976661babdc17d232ee-C001-40" ], [ "fddb1d19895976661babdc17d232ee-C001-60" ] ], "cite_sentences": [ "fddb1d19895976661babdc17d232ee-C001-24", "fddb1d19895976661babdc17d232ee-C001-40", "fddb1d19895976661babdc17d232ee-C001-60" ] }, "@BACK@": { "gold_contexts": [ [ "fddb1d19895976661babdc17d232ee-C001-60" ] ], "cite_sentences": [ "fddb1d19895976661babdc17d232ee-C001-60" ] } } }, "ABC_1e232f9dfa7d499d1ba39fcebf3d1a_45": { "x": [ { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-96", "text": "**CONCLUDING REMARKS**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-2", "text": "GLARF relations are generated from treebank and parses for English, Chinese and Japanese." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-3", "text": "Our evaluation of system output for these input types requires consideration of multiple correct answers." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-4", "text": "1" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-5", "text": "----------------------------------" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-7", "text": "Systems, such as treebank-based parsers (Charniak, 2001; Collins, 1999) and semantic role labelers (Gildea and Jurafsky, 2002; Xue, 2008) , are trained and tested on hand-annotated data." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-8", "text": "Evaluation is based on differences between system output and test data." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-9", "text": "Other systems use these programs to perform tasks unrelated to the original annotation." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-10", "text": "For example, participating systems in CONLL (Surdeanu et al., 2008; Haji\u010d et al., 2009 ), ACE and GALE tasks merged the results of several processors (parsers, named entity recognizers, etc.) not initially designed for the task at hand." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-11", "text": "This paper discusses differences between handannotated data and automatically generated data with respect to our GLARFers, systems for generating Grammatical and Logical Representation Framework (GLARF) for English, Chinese and Japanese sentences." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-12", "text": "The paper describes GLARF (Meyers et al., 2001; ) and GLARFers and compares GLARF produced from treebank and parses." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-13", "text": "Figure 1 includes simplified GLARF analyses for English, Chinese and Japanese sentences." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-14", "text": "For each sentence, a GLARFer constructs both a Feature Structure (FS) representing a constituency analysis and a set of 31-tuples, each representing up to three dependency relations between pairs of words." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-15", "text": "Due to space limitations, we will focus on the 6 fields of the 31-tuple represented in Figure 1 ." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-16", "text": "These include: (1) a functor (func); (2) the depending argument (Arg); (3) a surface (Surf) label based on the position in the parse tree with no regularizations; (4) a logic1 label (L 1) for a relation that reflects grammar-based regularizations of the surface level." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-17", "text": "This marks relations for filling gaps in relative clauses or missing infinitival subjects, represents passives as paraphrases as actives, etc." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-18", "text": "While the general framework supports many regularizations, the relations actually represented depends on the implemented grammar, e.g., our current grammar of English regularizes across passives and relative clauses, but our grammars of Japanese and Chinese do not currently.; (5) a logic2 label (L2) for Chinese and English, which represents PropBank, NomBank and Penn Discourse Treebank relations; and (6) Asterisks (*) indicate transparent relations, relations where the functor inherits semantic properties of certain special arguments (*CONJ, *OBJ, *PRD, *COMP)." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-19", "text": "phrases." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-20", "text": "In this same example, the adverb not can be attached to either the copula are or the predicative adjective, with no discernible difference in meaning-this factor is indicated by the transparent designation of the relations where the copula is a functor." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-21", "text": "Transparent features also provide us with a simple way of handling certain function words, such as the Chinese word De which inherits the function of its underlying head, connecting a variety of such modifiers to head nouns (an adjective in the Chinese example.)." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-22", "text": "For conjunction cases, the number of underlying relations would multiply, e.g., Mary and John bought and sold stock would (underlyingly) have four subject relations derived by pairing each of the underlying subject nouns Mary and John with each of the underlying main predicate verbs bought and sold." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-23", "text": "----------------------------------" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-24", "text": "**GLARF**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-25", "text": "----------------------------------" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-26", "text": "**AUTOMATIC VS. MANUAL ANNOTATION**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-27", "text": "Apart from accuracy, there are several other ways that automatic and manual annotation differs." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-28", "text": "For Penn-treebank (PTB) parsing, for example, most parsers (not all) leave out function tags and empty categories." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-29", "text": "Consistency is an important goal for manual annotation for many reasons including: (1) in the absence of a clear correct answer, consistency helps clarify measures of annotation quality (inter-annotator agreement scores); and (2) consistent annotation is better training data for machine learning." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-30", "text": "Thus, annotation specifications use defaults to ensure the consistent handling of spurious ambiguity." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-31", "text": "For example, given a sentence like I bought three acres of land in California, the PP in California can be attached to either acres or land with no difference in meaning." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-32", "text": "While annotation guidelines may direct a human annotator to prefer, for example, high attachment, systems output may have other preferences, e.g., the probability that land is modified by a PP (headed by in) versus the probability that acres can be so modified." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-33", "text": "Even if the manual annotation for a particular corpus is consistent when it comes to other factors such as tokenization or part of speech, developers of parsers sometimes change these guidelines to suit their needs." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-34", "text": "For example, users of the Charniak parser (Charniak, 2001) should add the AUX category to the PTB parts of speech and adjust their systems to account for the conversion of the word ain't into the tokens IS and n't." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-35", "text": "Similarly, tokenization decisions with respect to hyphens vary among different versions of the Penn Treebank, as well as different parsers based on these treebanks." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-36", "text": "Thus if a system uses multiple parsers, such differences must be accounted for." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-37", "text": "Differences that are not important for a particular application should be ignored (e.g., by merging alternative analyses)." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-38", "text": "For example, in the case of spurious attachment ambiguity, a system may need to either accept both as right answers or derive a common representation for both." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-39", "text": "Of course, many of the particular problems that result from spurious ambiguity can be accounted for in hind sight." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-40", "text": "Nevertheless, it is precisely this lack of a controlled environment which adds elements of spurious ambiguity." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-41", "text": "Using new processors or training on new treebanks can bring new instances of spurious ambiguity." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-42", "text": "----------------------------------" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-43", "text": "**EXPERIMENTS AND EVALUATION**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-44", "text": "We ran GLARFers on both manually created treebanks and automatically produced parses for English, Chinese and Japanese." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-45", "text": "For each corpus, we created one or more answer keys by correcting system output." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-46", "text": "For this paper, we evaluate solely on the logic1 relations (the second column in figure 1.) Figure 2 lists our results for all three languages, based on treebank and parser input." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-47", "text": "As in ), we generated 4-tuples consisting of the following for each dependency: (A) the logic1 label (SBJ, OBJ, etc.), (B) its transparency (True or False), (C) The functor (a single word or a named entity); and (D) the argument (a single word or a named entity)." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-48", "text": "In the case of conjunction where there was no lexical conjunction word, we used either punctuation (commas or semi-colons) or the placeholder *NULL*. We then corrected these results by hand to produce the answer key-an answer was correct if all four members of the tuple were correct and incorrect otherwise." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-49", "text": "Table 2 provides the Precision, Recall and F-scores for our output." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-50", "text": "The F-T columns indicates a modified F-score derived by ignoring the +/-Transparent distinction (resulting changes in precision, recall and F-score are the same)." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-51", "text": "For English and Japanese, an expert native speaking linguist corrected the output." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-52", "text": "For Chinese, several native speaking computational linguists shared the task." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-53", "text": "By checking compatibility of the answer keys with outputs derived from different sources (parser, treebank), we could detect errors and inconsistencies." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-54", "text": "We processed the following corpora." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-55", "text": "English: 86 sentence article (wsj 2300) from the Wall Street Journal PTB test corpus (WSJ); 46 sentence letter from Good Will (LET), the first 100 sentences of a switchboard telephone transcript (TEL) and the first 100 sentences of a narrative from the Charlotte Narrative and Conversation (NAR)." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-56", "text": "These samples are taken from the PTB WSJ Corpus and the SIGANN shared subcorpus of the OANC." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-57", "text": "The filenames are: 110CYL067, NapierDianne and sw2014." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-58", "text": "Chinese: a 20 sentence sample of text from the Penn Chinese Treebank (CTB) (Xue et al., 2005) ." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-59", "text": "Japanese: 20 sentences from the Kyoto Corpus (KYO) (Kurohashi and Nagao, 1998)" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-60", "text": "----------------------------------" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-61", "text": "**RUNNING THE GLARFER PROGRAMS**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-62", "text": "We use Charniak, UMD and KNP parsers (Charniak, 2001; Huang and Harper, 2009; Kurohashi and Nagao, 1998) , JET Named Entity tagger (Grishman et al., 2005; Ji and Grishman, 2006) and other resources in conjunction with languagespecific GLARFers that incorporate hand-written rules to convert output of these processors into a final representation, including logic1 structure, the focus of this paper." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-63", "text": "English GLARFer rules use Comlex (Macleod et al., 1998a) and the various NomBank lexicons (http:// nlp.cs.nyu.edu/meyers/nombank/) for lexical lookup." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-64", "text": "The GLARF rules implemented vary by language as follows." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-65", "text": "English: correcting/standardizing phrase boundaries and part of speech (POS); recognizing multiword expressions; marking subconstituents; labeling relations; incorporating NEs; regularizing infinitival, passives, relatives, VP deletion, predicative and numerous other constructions." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-66", "text": "Chinese: correcting/standardizing phrase boundaries and POS, marking subconstituents, labeling relations; regularizing copula constructions; incorporating NEs; recognizing dates and number expressions." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-67", "text": "Japanese: converting to PTB format; correcting/standardizing phrase boundaries and POS; labeling relations; processing NEs, double quote constructions, number phrases, common idioms, light verbs and copula constructions." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-68", "text": "----------------------------------" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-69", "text": "**DISCUSSION**" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-70", "text": "Naturally, the treebank-based system outperformed parse-based system." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-71", "text": "The Charniak parser for English was trained on the Wall Street Journal corpus and can achieve about 90% accuracy on similar corpora, but lower accuracy on other genres." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-72", "text": "Differences between treebank and parser results for English were higher for LET and NAR genres than for the TEL because the system is not currently designed to handle TEL-specific features like disfluencies." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-73", "text": "All processors were trained on or initially designed for news corpora." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-74", "text": "Thus corpora out of this domain usually produce lower results." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-75", "text": "LET was easier as it consisted mainly of short simple sentences." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-76", "text": "In , we evaluated our results on 40 Japanese sentences from the JENAAD corpus (Utiyama and Isahara, 2003) and achieved a higher F-score (90.6%) relative to the Kyoto corpus, as JENAAD tends to have fewer long complex sentences." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-77", "text": "By using our answer key for multiple inputs, we discovered errors and consequently improved the quality of the answer keys." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-78", "text": "However, at times we were also compelled to fork the answer keys-given multiple correct answers, we needed to allow different answer keys corresponding to different inputs." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-79", "text": "For English, these items represent approximately 2% of the answer keys (there were a total Figure 3 lists examples of answer key divergences that we have found: (1) alternative tokenizations; (2) spurious differences in attachment and conjunction scope; and (3) ambiguities specific to our framework." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-80", "text": "Examples 1 and 2 reflect different treatments of hyphenation and contractions in treebank specifications over time." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-81", "text": "Parsers trained on different treebanks will either keep hyphenated words together or separate more words at hyphens." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-82", "text": "The Treebank treatment of can't regularizes so that (can need not be differentiated from ca), whereas the parser treatment makes maintaining character offsets easier." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-83", "text": "In example 3, the Japanese parser recognizes a single word whereas the treebank divides it into a prefix plus stem." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-84", "text": "Example 4 is a case of differences in character encoding (zero)." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-85", "text": "Example 5 is a common case of spurious attachment ambiguity for English, where a transparent noun takes an of PP complement-nouns such as form, variety and thousands bear the feature transparent in the NOMLEX-PLUS dictionary (a NomBank dictionary based on NOMLEX (Macleod et al., 1998b) )." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-86", "text": "The relative clause attaches either to the noun thousands or people and, therefore, the subject gap of the relative is filled by either thousands or people." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-87", "text": "This ambiguity is spurious since there is no meaningful distinction between these two attachments." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-88", "text": "Example 6 is a case of attachment ambiguity due to a support construction (Meyers et al., 2004) ." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-89", "text": "The recipient of the gift will be Goodwill regardless of whether the PP is attached to give or gift." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-90", "text": "Thus there is not much sense in marking one attachment more correct than the other." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-91", "text": "Example 7 is a case of conjunction ambiguity-the context does not make it clear whether or not the pearls are part of a necklace or just the beads are." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-92", "text": "The distinction is of little consequence to the understanding of the narrative." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-93", "text": "Example 8 is a case in which our grammar handles a case ambiguously: the prenominal adjective can be analyzed either as a simple noun plus adjective phrase meaning various businesses or as a noun plus relative clause meaning businesses that are varied." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-94", "text": "Example 9 is a common case in Chinese where the verb/noun distinction, while unclear, is not crucial to the meaning of the phraseunder either interpretation, 5 billion was exported." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-95", "text": "----------------------------------" }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-97", "text": "We have discussed challenges of automatic annotation when transducers of other annotation schemata are used as input." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-98", "text": "Models underlying different transducers approximate the original annotation in different ways, as do transducers trained on different corpora." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-99", "text": "We have found it necessary to allow for multiple correct answers, due to such differences, as well as, genuine and spurious ambiguities." }, { "sent_id": "1e232f9dfa7d499d1ba39fcebf3d1a-C001-100", "text": "In the future, we intend to investigate automatic ways of identifying and handling spurious ambiguities which are predictable, including examples like 5,6 and 7 in figure 3 involving transparent functors." } ], "y": { "@BACK@": { "gold_contexts": [ [ "1e232f9dfa7d499d1ba39fcebf3d1a-C001-7" ], [ "1e232f9dfa7d499d1ba39fcebf3d1a-C001-34" ] ], "cite_sentences": [ "1e232f9dfa7d499d1ba39fcebf3d1a-C001-7", "1e232f9dfa7d499d1ba39fcebf3d1a-C001-34" ] }, "@USE@": { "gold_contexts": [ [ "1e232f9dfa7d499d1ba39fcebf3d1a-C001-62" ] ], "cite_sentences": [ "1e232f9dfa7d499d1ba39fcebf3d1a-C001-62" ] } } }, "ABC_ab919bfae9ddf780cadcd491fe0a9b_45": { "x": [ { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-27", "text": "**BACKGROUND**" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-2", "text": "Grammar-based surface realizers require inputs compatible with their reversible, constraint-based grammars, including a proper representation of unbounded dependencies and coordination." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-3", "text": "In this paper, we report on progress towards creating realizer inputs along the lines of those used in the first surface realization shared task that satisfy this requirement." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-4", "text": "To do so, we augment the Universal Dependencies that result from running the Stanford Dependency Converter on the Penn Treebank with the unbounded and coordination dependencies in the CCGbank, since only the latter takes the Penn Treebank's trace information into account." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-5", "text": "An evaluation against gold standard dependencies shows that the enhanced dependencies have greatly enhanced recall with moderate precision." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-6", "text": "We conclude with a discussion of the implications of the work for a second realization shared task." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-7", "text": "----------------------------------" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-9", "text": "Surface realization systems employing reversible, broad coverage constraint-based grammars together with statistical ranking models have achieved impressive results in multiple languages, using a variety of formalisms (HPSG, TAG, LFG, CCG) ." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-10", "text": "However, these systems all require somewhat different inputs, making comparative evaluation difficult." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-11", "text": "In the first surface realization shared task (Belz et al., 2011, henceforth SR-11) , which aimed to ameliorate these difficulties, attempts to use grammar-based realizers were unsuccessful, as converting shared task inputs to systemnative inputs turned out to be more difficult than anticipated." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-12", "text": "Subsequently, Narayan & Gardent (2012) demonstrated that grammarbased systems can be substantially improved with error mining techniques, and Gardent and Narayan (2013) showed that augmenting the (shallow) SR-11 representation of coordination to include shared dependencies can benefit grammar-based realizers." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-13", "text": "White (2014) then showed that even better results can be achieved by inducing a grammar (Kwiatkowski et al., 2011; Artzi and Zettlemoyer, 2013) that is directly compatible with (an enhanced version of) the SR-11 inputs." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-14", "text": "However, as explained below, subsequent analysis revealed substantial remaining issues with the data, which this paper takes a step towards addressing." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-15", "text": "A common thread in work on reversible, constraint-based grammars is emphasis on properly representing unbounded dependencies and coordination." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-16", "text": "For parsing, this emphasis has been shown to pay off in improved recall of unbounded dependencies (Rimell et al., 2009; Nguyen et al., 2012; Oepen et al., 2014) ." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-17", "text": "For realization, however, it remains an open question as to whether approaches based on constraintbased grammars can likewise yield an empirical payoff, given the continuing lack of a common input representation that adequately treats unbounded dependencies and coordination, as these grammars require." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-18", "text": "With this issue in mind, White (2014) experimented with a version of the shallow SR-11 inputs (created by Richard Johansson) which included extra dependencies for unbounded dependencies and coordination, yielding dependency graphs extending core dependency trees." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-19", "text": "Unlike the rewrite rules employed by Gardent and Narayan (2013) , the extra dependencies were derived from the gold traces in the Penn Treebank (Marcus et al., 1993, PTB) , which is necessary to adequately handle right node raising and relativization." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-20", "text": "However, this version was still found to be incomplete, in particular because it was missing cases where the extra dependencies are encoded structurally in the PTB." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-21", "text": "Since then, Universal Dependencies (Nivre et al., 2016, UDs) , which aim to represent syntactic dependencies similarly across languages, have become increasingly prominent." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-22", "text": "Building on the enhanced Stanford dependencies for English (de Marneffe et al., 2013)-which were designed to properly represent unbounded dependencies in dependency graphsenhanced UDs for English have been partially implemented in the Stanford Dependency Converter (Schuster and Manning, 2016, SDC) ." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-23", "text": "The SDC transforms automatic or gold PTBstyle trees into UDs; unfortunately, however, it was not designed to take traces into account, and thus the treatment of unbounded dependencies and coordination is only heuristic." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-24", "text": "To address this impasse, in this paper we report on progress towards creating SR-11-style realizer inputs that are both based on enhanced UDs and which accurately represent unbounded dependencies and coordination." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-25", "text": "To do so, we augment the UDs that result from running the SDC on the PTB with the dependencies in the CCGbank (Hockenmaier and Steedman, 2007) , since the latter includes lexicalized dependencies derived from gold PTB traces." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-26", "text": "----------------------------------" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-28", "text": "Figures 1-2 show an example where the CCGbank preserves the information provided by the trace in a free relative clause along with a crucial structurally encoded dependency." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-29", "text": "In Figure 1 (left) , the unbounded dependency between what and achieve is annotated via a trace in the PTB." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-30", "text": "Figure 1 (right) shows the SDC output for the sentence." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-31", "text": "While the SDC manages to capture the unbounded dependency in this case, what is not recognized as the head of the free relative clause and there is no direct dependency from the copula to what, contrary to de Marneffe et al.'s (2013) specifications." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-32", "text": "The inadequacy of the representation here-which is essentially the same as the SR-11 representation for the sentence-has serious implications for realization, as it will be difficult for any realizer to determine that what should appear at the start of the free relative clause rather than following achieve, where direct objects would normally appear (or perhaps sentence initially)." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-33", "text": "By contrast, Figure 2 shows how the Combinatory Categorial Grammar (Steedman, 2000; Steedman and Baldridge, 2011, CCG)" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-34", "text": "It is what federal support should try hardest to achieve ." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-35", "text": "It is what federal support should try hardest to achieve 3 Using the CCGbank to Augment PTB Universal Dependencies" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-36", "text": "Unlike UDs, CCGbank dependencies are numeric and depend on the lexical category of the functor (e.g. what fills the second argument of the category for achieve in Figure 2 )." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-37", "text": "To determine UD labels, we employ a maxent classifier taking information from CCGbank as input." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-38", "text": "Comparing the CCGbank and SDC output, the classifier is trained where their dependencies overlap and predicts both a label and headdependent direction." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-39", "text": "Features used are functor and argument categories; functor and argument tokens; functor and argument POS tags; and functor and argument relative directionality." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-40", "text": "Our system for augmenting the SDC's PTB output begins by combining the SDC basic and enhanced output, since the basic representation does not skip words while the enhanced representation already includes many correct extra dependencies." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-41", "text": "The system then scans the SDC output and CCGbank for 3 triggers: (i) shared arguments in coordination (e.g. shared objects in right node raising), (ii) CCGbank unbounded dependency annotations, and (iii) underspecified SDC dep relations (i.e. instances where the SDC cannot determine the appropriate dependency relation)." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-42", "text": "In each case, the maxent classifier is used to predict UD labels for the CCGbank dependencies in question." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-43", "text": "Predictions are only added to the corpus if there is no (nondep) SDC dependency already present." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-44", "text": "In addition, ccomp and csubj relations that co-occur with free relatives are remapped to make the relative the head of the clause." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-45", "text": "Finally, structural changes for coordination and compounding along SR-11 lines are carried out." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-46", "text": "----------------------------------" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-47", "text": "**EVALUATION**" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-48", "text": "The system's recall was evaluated on Rimell et al.'s (2009) unbounded dependency corpus, a hand-curated corpus with gold annotations for constructions including object free relatives, right node raising, subject extraction, and object extraction." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-49", "text": "During the creation of CCGbank, some problematic sentences involving gapping were left out of the CCGbank." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-50", "text": "As a result, we evaluate the system using four different criteria: with and without the skipped CCG sentences, and with both exact and unlabeled matches." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-51", "text": "Tables 1 and 2 show significant improvements across the board over the SDC." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-52", "text": "Precision was evaluated by manually examining 401 predictions from the system's output to see whether the proposed edits adhered to UD specifications." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-53", "text": "Precision from the converter is 70% for exact label matches and 91% for unlabeled matches." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-54", "text": "----------------------------------" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-55", "text": "**DISCUSSION AND FUTURE WORK**" }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-56", "text": "We have adapted and extended White's (2014) CCG induction algorithm to work with the augmented UDs that our system produces." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-57", "text": "White's algorithm assumed CCG phrases are only rarely projected from a dependent rather than a heade.g., where an NP is projected from a determiner, which is a dependent of the head nounand thus could be easily handled by handcrafted lexical entries." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-58", "text": "Since such cases are very common in UDs, the algorithm needed to be extended to induce such categories automatically." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-59", "text": "Once this was done, the algorithm yielded complete derivations in most cases (approx. 94%)." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-60", "text": "In particular, derivations were induced that captured all but one of the extra dependencies in Table 1 that appear in the CCGbank dev section, and realization experiments with the UD-based representations are underway." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-61", "text": "With the augmented UD reported in this paper, we expect the resulting dependency graphs to serve as a promising basis for a second surface realization challenge (with using just the basic dependency trees as an option)." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-62", "text": "A remaining obstacle, however, are the dependent cluster and gapping cases in the PTB, for which the SDC produces rather degenerate output." }, { "sent_id": "ab919bfae9ddf780cadcd491fe0a9b-C001-63", "text": "A promising avenue here would be to adapt Gardent and Narayan's (2013) method of enhancing the SR-11 representations for these cases." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ab919bfae9ddf780cadcd491fe0a9b-C001-13" ], [ "ab919bfae9ddf780cadcd491fe0a9b-C001-18" ] ], "cite_sentences": [ "ab919bfae9ddf780cadcd491fe0a9b-C001-13", "ab919bfae9ddf780cadcd491fe0a9b-C001-18" ] }, "@MOT@": { "gold_contexts": [ [ "ab919bfae9ddf780cadcd491fe0a9b-C001-13", "ab919bfae9ddf780cadcd491fe0a9b-C001-14" ] ], "cite_sentences": [ "ab919bfae9ddf780cadcd491fe0a9b-C001-13" ] }, "@EXT@": { "gold_contexts": [ [ "ab919bfae9ddf780cadcd491fe0a9b-C001-56" ] ], "cite_sentences": [ "ab919bfae9ddf780cadcd491fe0a9b-C001-56" ] } } }, "ABC_8c530e0c9f7256ac44b1a2adfaf6a9_45": { "x": [ { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-23", "text": "----------------------------------" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-2", "text": "Emotion identification is a process of identifying the emotions automatically from text, speech or images." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-3", "text": "Emotion identification from textual conversations is a challenging problem due to absence of gestures, vocal intonation and facial expressions." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-4", "text": "It enables conversational agents, chat bots and messengers to detect and report the emotions to the user instantly for a healthy conversation by avoiding emotional cues and miscommunications." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-24", "text": "**RELATED WORK**" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-5", "text": "We have adopted a Seq2Seq deep neural network to identify the emotions present in the text sequences." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-6", "text": "Several layers namely embedding layer, encoding-decoding layer, softmax layer and a loss layer are used to map the sequences from textual conversations to the emotions namely Angry, Happy, Sad and Others." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-7", "text": "We have evaluated our approach on the EmoContext@SemEval2019 dataset and we have obtained the micro-averaged F1 scores as 0.595 and 0.6568 for the pre-evaluation dataset and final evaluation test set respectively." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-8", "text": "Our approach improved the base line score by 7% for final evaluation test set." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-9", "text": "----------------------------------" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-11", "text": "Emotion identification is a process of identifying the emotions automatically from different modalities." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-12", "text": "Several research work have been presented on detecting emotions from text (Rao, 2016; AbdulMageed and Ungar, 2017; Samy et al., 2018; AlBalooshi et al., 2018; Gaind et al., 2019) , speech (Arias et al., 2014; Amer et al., 2014; Lim et al., 2016) , images (Shan et al., 2009; Ko, 2018; Ayvaz et al., 2017; Faria et al., 2017; Mohammadpour et al., 2017) and video (Matsuda et al., 2018; Hossain and Muhammad, 2019; Kahou et al., 2016) ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-13", "text": "Emotion understanding from video may be easier by analyzing the body language, speech variations and facial expressions." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-14", "text": "However, identification of emotions from textual conversations is a challenging problem due to absence of above factors." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-15", "text": "Emotions in text are not only identified by its cue words such as happy, good, bore, hurt, hate and fun, but also the presence of interjections (e.g. \"whoops\"), emoticons (e.g. \":)\"), idiomatic expressions (e.g. \"am in cloud nine\"), metaphors (e.g. \"sending clouds\") and other descriptors mark the existence of emotions in the conversational text." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-16", "text": "Recently, the growth of text messaging applications for communications require emotion detection from conversation transcripts." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-17", "text": "This helps conversational agents, chat bots and messengers to avoid emotional cues and miscommunications by detecting the emotions during conversation." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-18", "text": "EmoContext@SemEval2019 shared task (Chatterjee et al., 2019) goal is to encourage more research in the field of contextual emotion detection in textual conversations." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-19", "text": "The shared task focuses on identifying emotions namely Angry, Happy, Sad and Others from conversation with three turns." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-20", "text": "Since, emotion detection is a classification problem, research works have been carried out by using machine learning with lexical features (Sharma et al., 2017) and deep learning with deep neural network (Phan et al., 2016) and convolutional neural network (Zahiri and Choi, 2018) to detect the emotions from text." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-21", "text": "However, we have adopted Seq2Seq deep neural network for detecting the emotions from textual conversations which include sequence of phrases." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-22", "text": "This paper elaborates our Seq2Seq approach for identifying emotions from text sequences." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-25", "text": "This section reviews the research work reported for emotion detection from text / tweets (Perikos and Hatzilygeroudis, 2013; Rao, 2016; AbdulMageed and Ungar, 2017; Samy et al., 2018; AlBalooshi et al., 2018; Gaind et al., 2019 ) and text conversations (Phan et al., 2016; Sharma et al., 2017; Zahiri and Choi, 2018) ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-26", "text": "Sharma et al. (2017) proposed a methodology to create a lexicon -a vocabulary consisting of positive and negative expressions." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-27", "text": "This lexicon is used to assign an emotional value which is derived from a fuzzy set function." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-28", "text": "Gaind et al. (2019) classified twitter text into emotion by using textual and syntactic features with SMO and decision tree classifiers." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-29", "text": "The tweets are annotated manually by Liew and Turtle (2016) with 28 fine-grained emotion categories and experimented with different machine learning algorithms." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-30", "text": "Results show that SVM and BayesNet classifiers produce consistently good performance for fine-grained emotion classification." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-31", "text": "Phan et al. (2016) developed an emotion lexicon from WordNet." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-32", "text": "The conversation utterances are mapped to the lexicons and 22 features are extracted using rule-based algorithm." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-33", "text": "They used fully connected deep neural network to train and classify the emotions." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-34", "text": "TF-IDF with handcrafted NLP features were used by AlBalooshi et al. (2018) in logistic regression, XGBClassifier and CNN+LSTM for emotion classification." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-35", "text": "The authors found that the logistic regression performed better than the deep neural network model." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-36", "text": "All the models discussed above considered the fine-grained emotion categories and used the twitter data to create a manually annotated corpus." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-37", "text": "These models used the rule-based or machine learning based algorithms to classify the emotion category." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-38", "text": "A new C-GRU (Context-aware Gated Recurrent Units) a variant of LSTM was proposed by Samy et al. (2018) which extracts the contextual information (topics) from tweets and uses them as an extra layer to determine sentiments conveyed by the tweet." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-39", "text": "The topic vectors resembling an image are fed to CNN to learn the contextual information." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-40", "text": "Abdul-Mageed and Ungar (2017) built a very large dataset with 24 fine-grained types of emotions and classified the emotions using gated RNN." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-41", "text": "Instead of using basic CNN, a new recurrent sequential CNN is used by Zahiri and Choi (2018) ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-42", "text": "They proposed several sequence-based convolution neural network (SCNN) models with attention to facilitate sequential dependencies among utterances." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-64", "text": "The input sequences and the target label are converted into its corresponding word embeddings by the embedding layer." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-43", "text": "All the models discussed above show that the emotion prediction can be handled using variants of deep neural network such as C-GRU, G-RNN and Sequential-CNN." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-44", "text": "The commonality between the above models are the variations of RNN or LSTM." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-45", "text": "This motivated us to use the Sequenceto-Sequence (Seq2Seq) model which consists of stacked LSTMs to predic the emotion labels conditioned on the given utterance sequences." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-46", "text": "----------------------------------" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-47", "text": "**DATA AND PREPROCESSING**" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-48", "text": "We have used the dataset provided by EmoContext@SemEval2019 shared task in our approach." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-49", "text": "The dataset consists of training set, development set and test set with 30160, 2755 and 5509 instances respectively." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-50", "text": "The dataset contains sequence id, text sequences with three turns which include user utterance along with the context, followed by emotion class label." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-51", "text": "The task is to label the user utterance as one of emotion class: happy, sad, angry or others." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-52", "text": "The textual sequences contain many short words." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-53", "text": "In preprocessing, these words are replaced with original or full word." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-54", "text": "We resort to build a look-up table which replace ''m', with 'am', ''re' with 'are', ''ere' with 'were', 'n't' with 'not', ''ll' with 'will', ''d' with 'would', 'what's' with 'what is' and 'it's' with 'it is'." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-55", "text": "The sequences are converted to lower case." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-56", "text": "Also, the three turns/sentences are delimited with \"eos\" in the input sequences." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-57", "text": "----------------------------------" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-58", "text": "**METHODOLOGY**" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-59", "text": "Seq2Seq model is the most popular model in learning the target sequence conditioned on the source sequence." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-60", "text": "The Seq2Seq model is adopted to map the sequences of n words with a target label (n:1 mapping)." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-61", "text": "This model has an embedding layer, an encoder, a decoder and a projection layer as shown in Figure 1 ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-62", "text": "Once the dialogue sentences are preprocessed, the first three turns of each instance are considered as the input sequences w 1 , w 2 ,..,w n , and the corresponding label e is considered as the target sequence." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-63", "text": "For example, the given instance \"13 Bad Bad bad! That's the bad kind of bad. I have no gf sad\" is converted into input sequence \"bad eos bad bad that is the bad kind of bad eos i have no gf\" and target label \"sad\"." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-65", "text": "The vector representation for each word is derived at embedding layer by choosing a fixed vocabulary of size V for input sequences and target labels." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-66", "text": "Now, the encoder which uses Bi-LSTM, encode these embeddings into a fixed vector representa- Figure 1 : System Architecture tion s which also represents the summary of input sequences." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-67", "text": "Once the source sequences are encoded, the last hidden state of the encoder is used to initialize the decoder." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-68", "text": "The projection layer is fed with the tensors of the target output label." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-69", "text": "Given the hidden state h t , the decoder predicts the label e t ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-70", "text": "However, h t and e t are conditioned on the previous output e t\u22121 and on the summary s of the input sequence." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-71", "text": "The projection layer is a dense matrix to turn the top hidden states of decoder to logit vectors of dimension V ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-72", "text": "Given the logit values, the training loss is easily minimized by using standard SGD optimizer with a learning rate." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-73", "text": "The model is also trained with the attention mechanism, which computes the attention weight by comparing the current decoder hidden state with all encoder states." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-74", "text": "The detailed description of working principle about Seq2Seq model is described in (Sutskever et al., 2014) ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-75", "text": "We have adopted Neural Machine Translation 1 code to implement our Seq2Seq deep neural network." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-76", "text": "Several variations have been implemented by varying the number of layers, units and attention mechanisms." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-77", "text": "It is evident from the earlier experiments (Sutskever et al., 2014; Thenmozhi et al., 2018 ) that bi-directional LSTM performs better for short text sequences." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-78", "text": "Hence, we have used it for encoding and decoding processes." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-79", "text": "The models were trained for 30000 steps with drop out 1 https://github.com/tensorflow/nmt (Sutskever et al., 2014; Bahdanau et al., 2014) and Scaled Luong (SL) (Luong et al., 2015 (Luong et al., , 2017 ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-80", "text": "Since, the model was developed using deep learning technique, it does not require much of linguistic features such as stemming, case normalization and PoS in identifying the emotion cue words." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-81", "text": "These linguistic phenomena could be captured by the encoder RNNs in sequence-tosequence (Seq2Seq) model." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-82", "text": "The other statistical features such as the word frequency are also not considered as input to the model, because the presence of particular cue alone does not guarantee to detect emotions in the text." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-83", "text": "----------------------------------" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-84", "text": "**RESULTS**" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-85", "text": "Our approach is evaluated on EmoContext@SemEval2019 data set." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-86", "text": "During development, we have implemented our variations with and without end of sentence (EOS) delimiter." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-87", "text": "We have built the models using entire training set (No split) and train-validation splits (TV split)." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-88", "text": "27160 and 3000 instances from training data were considered as training and validation set in TV split." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-89", "text": "The performance was measured in terms of micro-averaged F1 score (F1\u00b5) for the three emotion classes namely Angry, Happy and Sad." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-90", "text": "We have submitted eleven runs for EmoContext@SemEval2019 shared task on pre-evaluation dataset." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-91", "text": "The results obtained for pre-evaluation dataset are given in Table 1 ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-92", "text": "We observe from with TV split performs better than the model without split." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-93", "text": "The incorporation of delimiter text EOS also improved the performance of our approach." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-94", "text": "Further, the performance degrades with the increase in number of layers." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-95", "text": "Thus, 2 layered LSTM with TVsplit, EOS delimiter and Normed Bahdanau attention mechanism perform better on the pre-evaluation dataset of EmoContext@SemEval2019 and this architecture is considered for evaluating the final-evaluation test set." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-96", "text": "The final evaluation submissions are based upon the variations in TV split ratio and the number of units as 16, 32, 64, 128 and 256 ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-97", "text": "For TV split 1, the development set (2755 instances) given by EmoContext@SemEval2019 was considered as a validation set." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-98", "text": "The other two TV splits are by keeping the validation set as 1/5 (TV split 2) and 1/3 (TV split 3) of training set." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-99", "text": "The results of our submissions on final evaluation test data are given in Table 2 ." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-100", "text": "It is observed from Table 2 that 64U TV split 1 model outperforms all the other models with 0.656752 F1\u00b5 score." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-101", "text": "This score is higher than the base line score with 7% improvement." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-102", "text": "Table 3 shows the class-wise performance of our models on final evaluation set." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-103", "text": "Our models perform better for Angry class than the other two classes namely Happy and Sad." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-104", "text": "----------------------------------" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-105", "text": "**CONCLUSION**" }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-106", "text": "We have adopted a Seq2Seq deep neural network to identify the emotions present in the text se- quences." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-107", "text": "Our approach is evaluated on the EmoContext@SemEval2019 dataset." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-108", "text": "The input sequences are pre-processed by replacing the short hand notations and by introducing a delimiter string." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-109", "text": "The sequence is vectorized using word embeddings and given to bi-directional LSTM for encoding and decoding." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-110", "text": "We have implemented several variations by changing the parameters namely, number of layers, units, attention wrappers, with and without delimiter string and train-validation split." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-111", "text": "The performance is measured using microaveraged F1 score on three emotion class labels namely Angry, Happy and Sad." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-112", "text": "Our experiments on development set show that 2 layered LSTM with Normed Bahdanau attention mechanism with delimiter string and train-validation split performs better than all the other variations." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-113", "text": "Three variations of train-validation split ratio were experimented on final evaluation test data by varying the number of units with the best parameter values that are learnt during the development phase." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-114", "text": "64U TV split 1 model performs better than all the other runs we have submitted to the task." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-115", "text": "This model shows 7% improvement than the base line on final evaluation test set." }, { "sent_id": "8c530e0c9f7256ac44b1a2adfaf6a9-C001-116", "text": "Our Seq2Seq model can be improved further by incorporating the soft attention mechanism which uses joint distribution between attention and output layer (Shankar et al., 2018) ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "8c530e0c9f7256ac44b1a2adfaf6a9-C001-20" ], [ "8c530e0c9f7256ac44b1a2adfaf6a9-C001-41" ] ], "cite_sentences": [ "8c530e0c9f7256ac44b1a2adfaf6a9-C001-20", "8c530e0c9f7256ac44b1a2adfaf6a9-C001-41" ] }, "@USE@": { "gold_contexts": [ [ "8c530e0c9f7256ac44b1a2adfaf6a9-C001-25" ] ], "cite_sentences": [ "8c530e0c9f7256ac44b1a2adfaf6a9-C001-25" ] }, "@MOT@": { "gold_contexts": [ [ "8c530e0c9f7256ac44b1a2adfaf6a9-C001-41", "8c530e0c9f7256ac44b1a2adfaf6a9-C001-43", "8c530e0c9f7256ac44b1a2adfaf6a9-C001-44", "8c530e0c9f7256ac44b1a2adfaf6a9-C001-45" ] ], "cite_sentences": [ "8c530e0c9f7256ac44b1a2adfaf6a9-C001-41" ] } } }, "ABC_2abfa447cea31af26d06d4325c94ac_45": { "x": [ { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-2", "text": "We integrate PropBank semantic role labels to an existing statistical parsing model producing richer output." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-3", "text": "We show conclusive results on joint learning and inference of syntactic and semantic representations." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-4", "text": "----------------------------------" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-6", "text": "Recent successes in statistical syntactic parsing based on supervised techniques trained on a large corpus of syntactic trees (Collins, 1999; Charniak, 2000; Henderson, 2003) have brought the hope that the same approach could be applied to the more ambitious goal of recovering the propositional content and the frame semantics of a sentence." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-75", "text": "An indirect way of comparing our parser with semantic role labellers suggests itself." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-7", "text": "Moving towards a shallow semantic level of representation has immediate applications in question-answering and information extraction." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-8", "text": "For example, an automatic flight reservation system processing the sentence I want to book a flight from Geneva to New York will need to know that from Geneva indicates the origin of the flight and to New York the destination." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-9", "text": "(Gildea and Jurafsky, 2002 ) define this shallow semantic task as a classification problem where the semantic role to be assigned to each constituent is inferred on the basis of probability distributions of syntactic features extracted from parse trees." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-10", "text": "They use learning features such as phrase type, position, voice, and parse tree path." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-11", "text": "Consider, for example, a sentence such as The authority dropped at midnight Tuesday to $ 2.80 trillion (taken from section 00 of PropBank (Palmer et al., 2005) )." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-12", "text": "The fact that to $ 2.80 trillion receives a direction semantic label is highly correlated to the fact that it is a Prepositional Phrase (PP), that it follows the verb dropped, a verb of change of state requiring an end point, that the verb is in the active voice, and that the PP is in a certain tree configuration with the governing verb." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-13", "text": "All the recent systems proposed for semantic role labelling (SRL) follow this same assumption (CoNLL, 2005) ." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-14", "text": "The assumption that syntactic distributions will be predictive of semantic role assignments is based on linking theory." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-15", "text": "Linking theory assumes the existence of a hierarchy of semantic roles which are mapped by default on a hierarchy of syntactic positions." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-16", "text": "It also shows that regular mappings from the semantic to the syntactic level can be posited even for those verbs whose arguments can take several syntactic positions, such as psychological verbs, locatives, or datives, requiring a more complex theory." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-17", "text": "(See (Hale and Keyser, 1993; Levin and Rappaport Hovav, 1995) among many others.) If the internal semantics of a predicate determines the syntactic expressions of constituents bearing a semantic role, it is then reasonable to expect that knowledge about semantic roles in a sentence will be informative of its syntactic structure, and that learning semantic role labels at the same time as parsing will be beneficial to parsing accuracy." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-18", "text": "We present work to test the hypothesis that a current statistical parser (Henderson, 2003) can output rich information comprising both a parse tree and semantic role labels robustly, that is without any significant degradation of the parser's accuracy on the original parsing task." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-19", "text": "We achieve promising results both on the simple parsing task, where the accuracy of the parser is measured on the standard Parseval measures, and also on the parsing task where more complex labels comprising both syntactic labels and semantic roles are taken into account." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-20", "text": "These results have several consequences." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-21", "text": "First, we show that it is possible to build a single integrated system successfully." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-22", "text": "This is a meaningful achievement, as a task combining semantic role labelling and parsing is more complex than simple syntactic parsing." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-23", "text": "While the shallow semantics of a constituent and its structural position are often correlated, they sometimes diverge." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-24", "text": "For example, some nominal temporal modifiers occupy an object position without being objects, like Tuesday in the Penn Treebank representation of the sentence above." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-25", "text": "The indirectness of the relation is also confirmed by the difficulty in exploiting semantic information for parsing." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-26", "text": "Previous attempts have not been successful." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-27", "text": "(Klein and Manning, 2003) report a reduction in parsing accuracy of an unlexicalised PCFG from 77.8% to 72.9% in using Penn Treebank function labels in training." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-28", "text": "The two existing systems that use function labels sucessfully, either inherit Collins' modelling of the notion of complement (Gabbard, Kulick and Marcus, 2006) or model function labels directly (Musillo and Merlo, 2005) ." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-29", "text": "Furthermore, our results indicate that the proposed models are robust." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-30", "text": "To model our task accurately, additional parameters must be estimated." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-31", "text": "However, given the current limited availability of annotated treebanks, this more complex task will have to be solved with the same overall amount of data, aggravating the difficulty of estimating the model's parameters due to sparse data." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-32", "text": "----------------------------------" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-33", "text": "**THE DATA AND THE EXTENDED PARSER**" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-34", "text": "In this section we describe the augmentations to our base parsing models necessary to tackle the joint learning of parse tree and semantic role labels." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-35", "text": "PropBank encodes propositional information by adding a layer of argument structure annotation to the syntactic structures of the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-36", "text": "Verbal predicates in the Penn Treebank (PTB) receive a label REL and their arguments are annotated with abstract semantic role labels A0-A5 or AA for those complements of the predicative verb that are considered arguments while those complements of the verb labelled with a semantic functional label in the original PTB receive the composite semantic role label AM-X, where X stands for labels such as LOC, TMP or ADV, for locative, temporal and adverbial modifiers respectively." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-37", "text": "PropBank uses two levels of granularity in its annotation, at least conceptually." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-38", "text": "Arguments receiving labels A0-A5 or AA do not express consistent semantic roles and are specific to a verb, while arguments receiving an AM-X label are supposed to be adjuncts, and the roles they express are consistent across all verbs." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-39", "text": "To achieve the complex task of assigning semantic role labels while parsing, we use a family of state-of-the-art history-based statistical parsers, the Simple Synchrony Network (SSN) parsers (Henderson, 2003) , which use a form of left-corner parse strategy to map parse trees to sequences of derivation steps." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-40", "text": "These parsers do not impose any a priori independence assumptions, but instead smooth their parameters by means of the novel SSN neural network architecture." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-41", "text": "This architecture is capable of inducing a finite history representation of an unbounded sequence of derivation steps, which we denote" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-42", "text": "is computed from a set f of handcrafted features of the derivation move d i\u22121 , and from a finite set D of recent history representations h(d 1 , . . . , d j ), where j < i \u2212 1." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-43", "text": "Because the history representation computed for the move i \u2212 1 is included in the inputs to the computation of the representation for the next move i, virtually any information about the derivation history could flow from history representation to history representation and be used to estimate the probability of a derivation move." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-44", "text": "In our experiments, the set D of earlier history representations is modified to yield a model that is sensitive to regularities in structurally defined sequences of nodes bearing semantic role labels, within and across constituents." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-45", "text": "For more information on this technique to capture structural domains, see (Musillo and Merlo, 2005) where the technique was applied to function parsing." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-46", "text": "Given the hidden history representation h(d 1 , \u00b7 \u00b7 \u00b7 , d i\u22121 ) of a derivation, a normalized exponential output function is computed by the SSNs to estimate a probability distribution over the possible next derivation moves d i ." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-47", "text": "To exploit the intuition that semantic role labels are predictive of syntactic structure, we must pro-vide semantic role information as early as possible to the parser." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-48", "text": "Extending a technique presented in (Klein and Manning, 2003 ) and adopted in (Merlo and Musillo, 2005) for function labels with stateof-the-art results, we split some part-of-speech tags into tags marked with AM-X semantic role labels." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-49", "text": "As a result, 240 new POS tags were introduced to partition the original tag set which consisted of 45 tags." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-50", "text": "Our augmented model has a total of 613 nonterminals to represent both the PTB and PropBank labels, instead of the 33 of the original SSN parser." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-51", "text": "The 580 newly introduced labels consist of a standard PTB label followed by one or more PropBank semantic roles, such as PP-AM-TMP or NP-A0-A1." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-52", "text": "These augmented tags and the new non-terminals are included in the set f , and will influence bottomup projection of structure directly." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-53", "text": "These newly introduced fine-grained labels fragment our PropBank data." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-54", "text": "To alleviate this problem, we enlarge the set f with two additional binary features." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-55", "text": "One feature decides whether a given preterminal or nonterminal label is a semantic role label belonging to the set comprising the labels A0-A5 and AA." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-56", "text": "The other feature indicates if a given label is a semantic role label of type AM-X, or otherwise." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-57", "text": "These features allow the SSN to generalise in several ways." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-58", "text": "All the constituents bearing an A0-A5 and AA labels will have a common feature." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-59", "text": "The same will be true for all nodes bearing an AM-X label." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-60", "text": "Thus, the SSN can generalise across these two types of labels." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-61", "text": "Finally, all constituents that do not bear any label will now constitute a class, the class of the nodes for which these two features are false." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-62", "text": "----------------------------------" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-63", "text": "**EXPERIMENTS AND DISCUSSION**" }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-64", "text": "Our extended semantic role SSN parser was trained on sections 2-21 and validated on section 24 from the PropBank." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-65", "text": "Testing data are section 23 from the CoNLL-2005 shared task (Carreras and Marquez, 2005) ." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-66", "text": "We perform two different evaluations on our model trained on PropBank data." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-67", "text": "We distinguish between two parsing tasks: the PropBank parsing task and the PTB parsing task." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-68", "text": "To evaluate the former parsing task, we compute the standard Parseval measures of labelled recall and precision of constituents, taking into account not only the 33 original labels, but also the newly introduced PropBank labels." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-69", "text": "This evaluation gives us an indication of how accurately and exhaustively we can recover this richer set of non-terminal labels." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-70", "text": "The results, computed on the testing data set from the PropBank, are shown in the PropBank column of Table 1 , first line." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-71", "text": "To evaluate the PTB task, we ignore the set of PropBank semantic role labels that our model assigns to constituents (PTB column of Table 1 , first line to be compared to the third line of the same column)." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-72", "text": "To our knowledge, no results have yet been published on parsing the PropBank." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-73", "text": "1 Accordingly, it is not possible to draw a straightforward quantitative comparison between our PropBank SSN parser and other PropBank parsers." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-74", "text": "However, state-of-theart semantic role labelling systems (CoNLL, 2005) use parse trees output by state-of-the-art parsers (Collins, 1999; Charniak, 2000) , both for training and testing, and return partial trees annotated with semantic role labels." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-76", "text": "2 We merge the partial trees output by a semantic role labeller with the output of the parser on which it was trained, and compute PropBank parsing performance measures on the resulting parse trees." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-77", "text": "The third line, PropBank column of Table 1 reports such measures summarised for the five best semantic role labelling systems (Punyakanok et al., 2005b; Haghighi et al., 2005; Pradhan et al., 2005; Surdeanu and Turmo, 2005) in the CoNLL 2005 shared task." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-78", "text": "These systems all use (Charniak, 2000) 's parse trees both for training and testing, as well as various other information sources including sets of n-best parse trees, chunks, or named entities." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-79", "text": "Thus, the partial trees output by these systems were merged with the parse trees returned by Charniak's parser (second line, PropBank column)." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-80", "text": "3 These results jointly confirm our initial hypothe- sis." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-81", "text": "The performance on the parsing task (PTB column) does not appreciably deteriorate compared to a current state-of-the-art parser, even if our learner can output a much richer set of labels, and therefore solves a considerably more complex problem, suggesting that the relationship between syntactic PTB parsing and semantic PropBank parsing is strict enough that an integrated approach to the problem of semantic role labelling is beneficial." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-82", "text": "Moreover, the results indicate that we can perform the more complex PropBank parsing task at levels of accuracy comparable to those achieved by the best semantic role labellers (PropBank column)." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-83", "text": "This indicates that the model is robust, as it has been extended to a richer set of labels successfully, without increase in training data." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-84", "text": "In fact, the limited availability of data is increased further by the high variability of the argumental labels A0-A5 whose semantics is specific to a given verb or a given verb sense." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-85", "text": "Methodologically, these initial results on a joint solution to parsing and semantic role labelling provide the first direct test of whether parsing is necessary for semantic role labelling (Gildea and Palmer, 2002; Punyakanok et al., 2005a) ." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-86", "text": "Comparing semantic role labelling based on chunked input to the better semantic role labels retrieved based on parsed trees, (Gildea and Palmer, 2002) conclude that parsing is necessary." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-87", "text": "In an extensive experimental investigation of the different learning stages usually involved in semantic role labelling, (Punyakanok et al., 2005a) find instead that sophisticated chunking can achieve state-of-the-art results." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-88", "text": "Neither of these pieces of work actually used a parser to do SRL." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-89", "text": "Their investigation was therefore limited to establishing the usefulness of syntactic features for the SRL task." }, { "sent_id": "2abfa447cea31af26d06d4325c94ac-C001-90", "text": "Our results do not yet indicate that parsing is beneficial to SRL, but they show that the joint task can be performed successfully." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2abfa447cea31af26d06d4325c94ac-C001-6" ] ], "cite_sentences": [ "2abfa447cea31af26d06d4325c94ac-C001-6" ] }, "@USE@": { "gold_contexts": [ [ "2abfa447cea31af26d06d4325c94ac-C001-18" ], [ "2abfa447cea31af26d06d4325c94ac-C001-39" ] ], "cite_sentences": [ "2abfa447cea31af26d06d4325c94ac-C001-18", "2abfa447cea31af26d06d4325c94ac-C001-39" ] } } }, "ABC_ab08b0f3c8691852b3aab5e3575ebd_45": { "x": [ { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-2", "text": "We demonstrate SODA (Service Oriented Domain Adaptation) for efficient and scalable cross-domain microblog categorization which works on the principle of transfer learning." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-3", "text": "It is developed on a novel similarity-based iterative domain adaptation algorithm while extended with features such as active learning and interactive GUI to be used by business professionals." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-4", "text": "SODA demonstrates efficient classification accuracy on new collections while minimizing and sometimes eliminating the need for expensive data labeling efforts." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-5", "text": "SODA also implements an active learning (AL) technique to select informative instances from the new collection to seek annotations, if a small amount of labeled data is required by the adaptation algorithm." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-6", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-8", "text": "Online social media, such as Twitter.com, have become the de facto standard for sharing information, thoughts, ideas, personal feelings, daily happenings etc." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-9", "text": "which essentially led research and development in the field of social media analytics to flourish." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-10", "text": "Social media analytics provide actionable insights to business by analyzing huge amount of user generated content (UGC) (Sriram et al., 2010; Jo and Oh, 2011; He et al., 2012; Si et al., 2013; Nakov et al., 2013) ." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-11", "text": "Sentiment categorization, one of the common social media analytics task, segregates a collection of UGC into different buckets with positive, negative or neutral orientation (Liu and Zhang, 2012; * Work done at Xerox Research Centre India Thelwall et al., 2011; Bollen et al., 2009 )." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-75", "text": "2 The interactive UI of SODA is developed using Ruby on Rails framework." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-12", "text": "This information is used to aggregate statistics and identify trends which are helpful for many applications viz. Customer Care, Product Marketing, User Studies." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-13", "text": "Supervised machine learning (ML) techniques such as text categorization have played a key enabler role to classify microblogs into sentiment categories (Pang and Lee, 2008; Tan et al., 2009; Go et al., 2009; Fern\u00e1ndez et al., 2014) ." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-14", "text": "These are trained on a fraction of annotated data as per client provided label set e.g. {positive, negative, and neutral} for a product/service/domain 1 ." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-15", "text": "One of the obstacles towards rapid adoption of such systems is requirement of labeled tweets for developing MLbased models as it requires extensive human labeling efforts." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-16", "text": "Additionally, need of manual labeling slows down the process of categorization on high velocity social media which requires fast analytic insights." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-17", "text": "From our conversations with business professionals, we derived the need of a practical solution which would help them scale up across hundreds of collections and domains without the overhead of annotating data and building models from scratch every time for a new collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-18", "text": "In this paper, we demonstrate Service Oriented Domain Adaptation (SODA) which offers social media analytics as-a-service to the users." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-19", "text": "Specifically, it provides sentiment categorization as-aservice that allows users to efficiently analyze comments from any new collection without the over- head of manual annotations or re-training models." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-20", "text": "It thus enables faster wide-scale analysis within and across different domains/industries such as telecom, healthcare, finance etc." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-21", "text": "SODA is based on an iterative ensemble based adaptation technique (Bhatt et al., 2015) which gradually transfers knowledge from the source to the new target collection while being cognizant of similarity between the two collections." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-22", "text": "It has been extensively evaluated by business professionals in a user-trial and on a benchmark dataset." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-23", "text": "Figure 1 illustrates the architecture of SODA comprising three primary modules, 1) similarity, 2) domain adaptation, and 3) active learning." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-24", "text": "The first two modules use unlabeled data from the new collection while the optional third module helps in creating labeled data for enhanced classification performance." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-25", "text": "These modules are explained below." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-26", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-27", "text": "**SODA FEATURES**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-28", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-29", "text": "**SIMILARITY**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-30", "text": "In social media analytics, especially for sentiment categorization, there exist numerous collections about different products or services where labeled data is available and thus can be used to adapt to a new unlabeled collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-31", "text": "Given a target collection, the key question is to identify the best possible source collection to adapt from." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-32", "text": "The similarity module in SODA identifies the best adaptable source collection based on the similarity between the source and target collections." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-33", "text": "This is based on the observations from existing literature (Bhatt et al., 2015; Blitzer et al., 2007) which suggest that if the source and target collections are similar, the adaptation performance tends to be better than if the two collections are dissimilar." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-34", "text": "The similarity module in SODA is capable of computing different kinds of lexical, syntactic, and semantic similarities between unlabeled target and labeled source collections." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-35", "text": "For this demonstration on sentiment categorization from social media data, it measures cosine similarity between the comments in each collection and computes sim as the similarity score." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-36", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-37", "text": "**DOMAIN ADAPTATION**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-38", "text": "The heart of SODA is the adaptation module that works on two principles, generalization and adaptation." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-39", "text": "During generalization, it learns shared common representation (Blitzer et al., 2007; Ji et al., 2011; Pan et al., 2010) which minimizes the divergence between two collections." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-40", "text": "We leverage one of the widely used structural correspondence learning (SCL) approach (Blitzer et al., 2007) to compute shared representations." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-41", "text": "The idea adhered here is that a model learned on the shared feature representation using labeled data from the source collection will also generalize well on the target collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-42", "text": "Towards this, we learn a model (C S ) on the shared feature representation from the source collection, referred to as \"source classifier\"." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-43", "text": "C S is then used to predict labels for the pool of unlabeled instances from the target collection, referred to as P u , using the shared representations." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-44", "text": "All instances in P u which are predicted with a confidence (\u03b1 1 ) higher than a predefined threshold (\u03b8 1 ) are moved to the pool of pseudo-labeled target instances, referred to as P s ." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-45", "text": "We now learn a target domain model C T on P s using the target specific representation, referred to as \"target classifier\"." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-46", "text": "C T captures a separate view of the target instances than the shared representation and hence brings in discriminating target specific information which is useful for categorization in target collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-47", "text": "For further adaptation, the source (C S ) and target (C T ) classifiers are combined in a weighted ensemble (E) with w s and w t as the corresponding weights and iterate over the remaining unlabeled instances in P u ." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-48", "text": "In each iteration, the ensemble processes the remaining instances and iteratively adds confidently predicted instances to P s which are used to re-train/update C T ." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-49", "text": "This iterative process continues till all instances in P u are confidently labeled or a maximum number of iterations is reached." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-50", "text": "Transfer occurs within the ensemble where the source classifier progressively facilitates the learning of tar-78 end iterate." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-51", "text": "Output: Final Ct and updated weights w s and w t get classifier by providing pseudo labeled instances." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-52", "text": "The weights of the individual classifiers are updated, as a function of error (I(\u00b7)) and the similarity (sim) between the collections, which gradually shift the emphasis from source to the target classifier." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-53", "text": "Finally, the ensemble is used to predict labels for future unseen instances in the target collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-54", "text": "Algorithm 1 summarizes our approach (refer (Bhatt et al., 2015) for more details)." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-55", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-56", "text": "**ACTIVE LEARNING**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-57", "text": "SODA also implements an active learning module to allow users to annotate a few selected informative comments from the target collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-58", "text": "These comments are selected using cross entropy difference (CED) (Axelrod et al., 2011) such that the difference with source collection and the similarity with target collection is maximized." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-59", "text": "It selects comment(s) from target collection that have low CED score i.e. comments that have high entropy with respect to source H S (\u00b7) and low entropy with respect to target collection H T (\u00b7) as in Equation (1)." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-60", "text": "Note, this active learning module is optional and should be used when the adaptation performance with unlabeled instances is not satisfactory." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-61", "text": "More and more instances can be annotated in multiple rounds till a satisfactory performance is achieved." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-62", "text": "These annotated instances are used to build a stronger target classifier for the ensemble based adaptation algorithm." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-63", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-64", "text": "**DESIGN AND INTERNALS**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-65", "text": "Figure 2(a) illustrates the interactive user interface (UI) of SODA where one can select a new target collection for the analysis task (i.e. sentiment categorization)." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-66", "text": "For a new target collection, it identifies relevant adaptable source collections based on their similarity." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-67", "text": "One can select any of the candidate source collections (selected collection highlighted in Figure 2(a) ) and adapt." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-68", "text": "Figure 2(b) shows the performance report along with the predicted comments from the target collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-69", "text": "User evaluates the adaptation performance in unlabeled target collection by analyzing the predicted comments and decides whether to annotate additional comments in Figure 3 : The effect of labeled comments on the performance while adapting from Coll-1 \u2192 Coll-6." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-70", "text": "target collection? If yes, Figure 2 (c) lists a few informative comments selected using the active learning module to seek annotations." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-71", "text": "One can mark these comments as positive, negative or neutral and subsequently adapt using these labeled instances from the target collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-72", "text": "Figure 2 (b) also shows the adaptation performance with a few labeled instances in the target collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-73", "text": "One can continue annotating more instances in the target collection until satisfactory performance is achieved." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-74", "text": "For more detailed demonstration, please refer to the video." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-76", "text": "All collections are managed in MySQL server." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-77", "text": "All three modules in SODA fetch data from the server and write the output back to the server." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-78", "text": "All modules work in real time enabling the system to be highly responsive to the user." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-79", "text": "The application is hosted on Amazon AWS as RESTful web services using Java Jersey (Tomcat server) that act as a bridge between the UI and back end." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-80", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-81", "text": "**USER TRIAL & EXPERIMENTAL RESULTS**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-82", "text": "To evaluate the overall experience, a user trial was conducted where several business professionals provided feedback on SODA." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-83", "text": "The objective was to evaluate the overall usability, reduction in required efforts, and the performance on the new target collections." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-84", "text": "The overall evaluation rated SODA 5 on usability and 4 for reduction in efforts (1 being worst & 5 being the best)." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-85", "text": "Table 1 reports the classification accuracy of SODA with few labeled comments from the target collection (ranging from 0 to 100)." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-86", "text": "It also reports the performance of the in-domain classifier which is trained and tested on data from the same collection." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-87", "text": "Coll-1 to Coll-8 refer to collections pertaining to marketing & sales, comcast support, 2 https://www.youtube.com/watch?v= zKnP5QEHVAE Figure 3 compares the effect of adding labeled comments in batches of 25 comments at-a-time." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-88", "text": "When there is no labeled data in the target collection, in-domain classifier can not be applied while SODA still yields good classification accuracy." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-89", "text": "Moreover, SODA consistently performs better than the in-domain classifier with same amount of labeled data." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-90", "text": "We also evaluated the performance of domain adaptation (DA) module of SODA on the Amazon review dataset (Blitzer et al., 2007) which is a benchmark dataset for sentiment categorization." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-91", "text": "It has 4 domains, namely, books(B), dvds(D), electronics(E), and kitchen(K) each with 2000 reviews divided equally into positive and negative reviews." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-92", "text": "Table 2 shows that DA module of SODA outperforms 1) a widely used domain adaptation technique , namely, structural correspondence learning (SCL) (Blitzer et al., 2007; Blitzer et al., 2006) , 2) the baseline (BL) where a classifier trained on one domain is applied on another domain, and 3) the in-domain classifier." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-93", "text": "Note that in Table 2 , the performance of DA module of SODA is reported when it does not use any labeled instances from the target domain." }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-94", "text": "----------------------------------" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-95", "text": "**CONCLUSION**" }, { "sent_id": "ab08b0f3c8691852b3aab5e3575ebd-C001-96", "text": "We demonstrated SODA for efficient microblog categorization on new social media collections with minimum (or sometimes no) need of manual annotations; thus, enabling faster and efficient analytics." } ], "y": { "@USE@": { "gold_contexts": [ [ "ab08b0f3c8691852b3aab5e3575ebd-C001-21" ], [ "ab08b0f3c8691852b3aab5e3575ebd-C001-33" ], [ "ab08b0f3c8691852b3aab5e3575ebd-C001-54" ] ], "cite_sentences": [ "ab08b0f3c8691852b3aab5e3575ebd-C001-21", "ab08b0f3c8691852b3aab5e3575ebd-C001-33", "ab08b0f3c8691852b3aab5e3575ebd-C001-54" ] }, "@BACK@": { "gold_contexts": [ [ "ab08b0f3c8691852b3aab5e3575ebd-C001-21" ], [ "ab08b0f3c8691852b3aab5e3575ebd-C001-33" ], [ "ab08b0f3c8691852b3aab5e3575ebd-C001-54" ] ], "cite_sentences": [ "ab08b0f3c8691852b3aab5e3575ebd-C001-21", "ab08b0f3c8691852b3aab5e3575ebd-C001-33", "ab08b0f3c8691852b3aab5e3575ebd-C001-54" ] } } }, "ABC_7404a4e4e23eea9663b580f9959689_45": { "x": [ { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-58", "text": "**DATA SETUP**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-82", "text": "As part of future work, we expect a technique that automatically identifies the mistransliterations would lead to improved transliteration quality." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-2", "text": "This paper presents a transliteration system based on pair Hidden Markov Model (pair HMM) training and Weighted Finite State Transducer (WFST) techniques." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-3", "text": "Parameters used by WFSTs for transliteration generation are learned from a pair HMM." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-4", "text": "Parameters from pair-HMM training on English-Russian data sets are found to give better transliteration quality than parameters trained for WFSTs for corresponding structures." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-5", "text": "Training a pair HMM on English vowel bigrams and standard bigrams for Cyrillic Romanization, and using a few transformation rules on generated Russian transliterations to test for context improves the system's transliteration quality." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-6", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-8", "text": "Machine transliteration is the automatic transformation of a word in a source language to a phonetically equivalent word in a target language that uses a different writing system." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-9", "text": "Transliteration is important for various Natural Language Processing (NLP) applications including: Cross Lingual Information Retrieval (CLIR), and Machine Translation (MT)." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-10", "text": "This paper introduces a system that utilizes parameters learned for a pair Hidden Markov Model (pair HMM) in a shared transliteration generation task 1 ." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-11", "text": "The pair HMM has been used before (Mackay and Kondrak, 2005; Wieling et al., 2007) for string similarity estimation, and is based on the notion of string Edit Distance (ED)." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-12", "text": "String ED is defined here as the total edit cost incurred in transforming a source language string (S) to a target language string (T) through a sequence of edit operations." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-34", "text": "A Weighted Finite State Transducer is a finite automaton whose state transitions are labeled with input and output elements and weights that express the level of relationship between the input and output elements." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-13", "text": "The edit operations include: (M)atching an element in S with an element in T; (I)nserting an element into T, and (D)eleting an element in S. 1 The generation task is part of the NEWS 2009 machine transliteration shared task (Li et al., 2009) Based on all representative symbols used for each of the two languages, emission costs for each of the edit operations and transition parameters can be estimated and used in measuring the similarity between two strings." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-14", "text": "To generate transliterations using pair HMM parameters, WFST (Graehl, 1997) techniques are adopted." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-15", "text": "Transliteration training is based mainly on the initial orthographic representation and no explicit phonetic scheme is used." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-16", "text": "Instead, transliteration quality is tested for different bigram combinations including all English vowel bigram combinations and n-gram combinations specified for Cyrillic Romanization by the US Board on Geographic Names and British Permanent Committee on Geographic Names (BGN/PCGN)." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-17", "text": "However, transliteration parameters can still be estimated for a pair HMM when a particular phonetic representation scheme is used." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-18", "text": "The quality of transliterations generated using pair HMM parameters is evaluated against transliterations generated from training WFSTs and transliterations generated using a Phrase-based Statistical Machine Translation (PBSMT) system." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-19", "text": "Section 2 describes the components of the transliteration system that uses pair HMM parameters; section 3 gives the experimental set up and results associated with the transliterations generated; and section 4 concludes the paper." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-20", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-21", "text": "**MACHINE TRANSLITERATION SYSTEM**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-22", "text": "The transliteration system comprises of a training and generation components (Figure 1 )." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-23", "text": "In the training component, the Baum-Welch Expectation Maximization (EM) algorithm (Baum et al., 1970) is used to learn the parameters of a pair HMM." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-24", "text": "In the generation component, WFST techniques (Graehl, 1997) model the learned pair HMM parameters for generating transliterations." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-25", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-26", "text": "**PARAMETER ESTIMATION FOR A PAIR-HMM**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-27", "text": "A pair HMM has two output observations (Figure 2) that are aligned through the hidden states, Figure 3 is the i th symbol in the source language string S while t j is the j th symbol in T. Mackay and Kondrak, 2005] Pair HMM Emission parameters are stored in matrix form in three tables associated with the edit operations; transition parameters are also stored in matrix form in a table." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-28", "text": "The emission parameters are ( ) n m n m \u00d7 + + in total; n and m are the numbers of symbols in the pair HMM source language alphabet (V S ) and target language alphabet (V T ) respectively." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-29", "text": "The parameters of starting in a given edit operation state are derived from the parameters of transiting from the match state (M) to either D or I or back to M." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-30", "text": "Although pair HMM training is evaluated against WFST training, there is no major difference in the training approach used in both cases; a forward-backward EM algorithm is used in each case." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-31", "text": "The main difference is in the structure; for the pair-HMM, the state transition parameter is also incorporated into the weight that measures the level of relationship between the input and output symbol when transformed to a WFST arc." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-32", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-33", "text": "**GENERATING TRANSLITERATIONS IN WFSTS**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-57", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-35", "text": "Although the framework of WFSTs has mostly been applied in representing various models for speech recognition (Mohri et al., 2008) including HMMs, WFSTs have as well been used for transliteration (Knight and Graehl, 1998) , and are the most suitable for modeling pair HMM constraints for generating transliterations." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-36", "text": "In the WFST framework, it is possible to specify various configurations associated with constraints inherent in a particular model." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-37", "text": "Figure 4 shows a WFST that precisely corresponds to the structure of the pair HMM considering the constraints specified for the pair HMM." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-38", "text": "In Figure 4 , e is an empty symbol while s i and s j are as defined for the pair HMM in Figure 3 . Note that, in Figure 4 , a start state is needed to model pair HMM parameter constraints for starting in any of the three edit states." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-39", "text": "However, it is possible to specify a WFST corresponding to the pair HMM with no start state." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-40", "text": "Various WFST configurations that do not conform to the bias corresponding to the pair HMM constraints had low transliteration quality and for space limitations, are not reported in this paper." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-41", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-42", "text": "**TRANSFORMATION RULES**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-43", "text": "A look into the transliterations generated using pair HMM parameters on English-Russian development data showed consistent mistransliterations mainly due to lack of contextual modeling in the generated transliterations." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-44", "text": "For example in all cases where the Russian character \u043b 'l' precedes the Russian soft sign \u044c ' ' ', the Russian soft sign was missing, resulting into a loss of transliteration accuracy." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-45", "text": "Two examples of mistransliterations that do not include the Russian soft sign \u044c are: \u043a\u0440\u0435\u0444\u0435\u043b\u0434 instead of \u043a\u0440\u0435\u0444\u0435\u043b\u044c\u0434 'krefeld', and \u0431\u0438\u043b\u0431\u0430\u043e instead of \u0431\u0438\u043b\u044c\u0431\u0430\u043e 'bilbao'." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-46", "text": "For such cases, simple transformation rules, such as \"\u043b\u2192\u043b\u044c\" were defined on the output transliterations in a post processing step." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-47", "text": "25 transformation rules were specified for some of the mistransliterations to test the effect of modeling context." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-48", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-49", "text": "**TRANSLITERATION USING PSMT SYSTEM**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-50", "text": "Transliterations generated using pair HMM parameters and WFSTs are evaluated against those generated from a state of the art Phrase-based Statistical Machine Translation system called Moses." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-51", "text": "Moses has been used before for machine transliteration (Matthews, 2007) and performed way better than a baseline system that was associated with finding the most frequent mappings between source and target transliteration units in the training data." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-52", "text": "In the PBSMT system, bilingual phrase-tables are used and several components are combined in a log-linear model (translation models, reverse translation model, word and phrase penalties, language models, distortion parameters, etc.) with weights optimized using minimum error rate training." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-53", "text": "For machine transliteration: characters are aligned instead of words, phrases refer to character n-grams instead of word n-grams, and language models are defined over character sequences instead of word sequences." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-54", "text": "A major advantage of the PBSMT system over the pair HMM and a WFST models is that the phrase tables (character n-grams) cover a lot of contextual dependencies found in the data." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-55", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-56", "text": "**EXPERIMENTS**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-59", "text": "The data used is divided according to the experimental runs that were specified for the NEWS 2009 shared transliteration task (Li et al., 2009 ): a standard run and non-standard runs." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-60", "text": "The standard run involved using the transliteration system described above that uses pair HMM parameters combined with transformation rules." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-61", "text": "The English-Russian datasets used here were provided for the NEWS 2009 shared transliteration task (Kumaran and Kellner, 2009): 5977 pairs of names for training, 943 pairs for development, and 1000 for testing." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-62", "text": "For the non-standard runs, an additional English-Russian dataset extracted from the Geonames data dump was merged with the shared transliteration task data above to form 10481 pairs for training and development." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-63", "text": "For a second set of experiments (Table 2) , a different set of test data (1000 pairs) extracted from the Geonames data dump was used." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-64", "text": "For the system used in the standard run, the training data was preprocessed to include representation of bigrams associated with Cyrillic Romanization and all English vowel bigram combinations." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-65", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-66", "text": "**RESULTS**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-67", "text": "Six measures were used for evaluating system transliteration quality." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-68", "text": "These include (Li et al., 2009) : Accuracy (ACC), Fuzziness in Top-1 (Mean F Score), Mean Reciprocal Rank (MRR), Mean Average Precision for reference transliterations (MAP_R), Mean Average Precision in 10 best candidate transliterations (MAP_10), Mean Average Precision for the system (MAP_sys)." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-69", "text": "Table 1 shows the results obtained using only the data sets provided for the shared transliteration task." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-70", "text": "The system used for the standard run is \"phmm_rules\" described in section 2 to sub section 2.3. \"phmm_basic\" is the system in which pair HMM parameters are used for transliteration generation but there is no representation for bigrams as described for the system used in the standard run." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-71", "text": "Table 2 shows the results obtained when additional data from Geonames data dump was used for training and development." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-72", "text": "In Table 2 , \"WFST_basic\" and \"WFST_rules\" are systems associated with training WFSTs for the \"phmm_basic\" and \"phmm_rules\" systems Table 2 Results from additional Geonames data sets." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-73", "text": "respectively." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-74", "text": "Moses_PSMT is the phrase-based statistical machine translation system." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-75", "text": "The results in both tables show that the systems using pair HMM parameters perform relatively better than the systems trained on WFSTs but not better than Moses." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-76", "text": "The low transliteration quality in the pair HMM and WFST systems as compared to Moses can be attributed to lack of modeling contextual dependencies unlike the case in PBSMT." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-77", "text": "----------------------------------" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-78", "text": "**CONCLUSION**" }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-79", "text": "A Transliteration system using pair HMM parameters has been presented." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-80", "text": "Although its performance is better than that of systems based on only WFSTs, its transliteration quality is lower than the PBSMT system." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-81", "text": "On seeing that the pair HMM generated consistent mistransliterations, manual specification of a few contextual rules resulted in improved performance." }, { "sent_id": "7404a4e4e23eea9663b580f9959689-C001-83", "text": "A more general framework, in which we intend to investigate contextual issues in addition to other factors such as position in source and target strings and edit operation memory in transliteration, is that of Dynamic Bayesian Networks (DBNs)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7404a4e4e23eea9663b580f9959689-C001-13" ] ], "cite_sentences": [ "7404a4e4e23eea9663b580f9959689-C001-13" ] }, "@USE@": { "gold_contexts": [ [ "7404a4e4e23eea9663b580f9959689-C001-13" ], [ "7404a4e4e23eea9663b580f9959689-C001-59" ], [ "7404a4e4e23eea9663b580f9959689-C001-67", "7404a4e4e23eea9663b580f9959689-C001-68" ] ], "cite_sentences": [ "7404a4e4e23eea9663b580f9959689-C001-13", "7404a4e4e23eea9663b580f9959689-C001-59", "7404a4e4e23eea9663b580f9959689-C001-68" ] }, "@UNSURE@": { "gold_contexts": [ [ "7404a4e4e23eea9663b580f9959689-C001-68" ] ], "cite_sentences": [ "7404a4e4e23eea9663b580f9959689-C001-68" ] } } }, "ABC_16780bd3c2b350f6d61f2f55f9f88c_45": { "x": [ { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-104", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-2", "text": "We analyze overt displays of power (ODPs) in written dialogs." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-3", "text": "We present an email corpus with utterances annotated for ODP and present a supervised learning system to predict it." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-4", "text": "We obtain a best cross validation F-measure of 65.8 using gold dialog act features and 55.6 without using them." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-5", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-105", "text": "**NOT USING GOLD DIALOG ACTS**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-7", "text": "Analyzing written dialogs (such as email exchanges) to extract social power relations has generated great interest recently." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-8", "text": "This paper introduces a new task within the general field of finding power relations in written dialogs." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-9", "text": "In written dialog, an utterance can represent an overt display of power (ODP) on the part of the utterer if it constrains the addressee's actions beyond the constraints that the underlying dialog act on its own imposes." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-10", "text": "For example, a request for action is the first part of an adjacency pair and thus requires a response from the addressee, but declining the request is a valid response." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-11", "text": "However, the utterer may formulate her request for action in a way that attempts to remove the option of declining it (\"Come to my office now!\")." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-12", "text": "In so doing, she restricts her addressee's options for responding more severely than a simple request for action would." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-13", "text": "Our new task is to classify utterances in written dialog as to whether they are ODPs or not." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-14", "text": "Such a classification can be interesting in and of itself, and it can also be used to study social relations among dialog participants." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-15", "text": "After reviewing related work (Section 2), we define \"overt display of power\" (Section 3) and then present manual annotations for ODP in a small subset of Enron email corpus." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-16", "text": "In Section 5, we present a supervised learning system using word and part-ofspeech features along with features indicating dialog acts." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-17", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-18", "text": "**RELATED WORK**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-19", "text": "Many studies in sociolinguistics have shown that power relations are manifested in language use (e.g., (O'Barr, 1982) )." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-20", "text": "Locher (2004) recognizes \"restriction of an interactant's action-environment\" (Wartenberg, 1990 ) as a key element by which exercise of power in interactions can be identified." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-21", "text": "Through ODP we capture this action-restriction at an utterance level." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-22", "text": "In the computational field, several studies have used Social Network Analysis (e.g., (Diesner and Carley, 2005) ) for extracting social relations from online communication." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-23", "text": "Only recently have researchers started using NLP to analyze the content of messages to deduce social relations (e.g., (Diehl et al., 2007) )." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-24", "text": "Bramsen et al. (2011) use knowledge of the actual organizational structure to create two sets of messages: messages sent from a superior to a subordinate, and vice versa." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-25", "text": "Their task is to determine the direction of power (since all their data, by construction of the corpus, has a power relationship)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-26", "text": "Their reported results cannot be directly compared with ours since their results are on classifying aggregations of messages as being to a superior or to a subordinate, whereas our results are on predicting whether a single utterance has an ODP or not." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-27", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-28", "text": "**OVERT DISPLAY OF POWER (ODP)**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-29", "text": "Dialog is successful when all discourse participants show cooperative dialog behavior." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-30", "text": "Certain types of dialog acts, notably requests for actions and requests for information (questions), \"set constraints on what should be done in a next turn\" (Sacks et al., 1974) ." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-31", "text": "Suppose a boss sends an email to her subordinate: \"It would be great if you could come to my office right now\". He responds by politely declining (\"Would love to, but unfortunately I need to pick up my kids\")." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-32", "text": "He has met the expectation to respond in one of the constrained ways that the request for action allows (other acceptable responses include a commitment to performing the action, or actually performing the action, while unacceptable responses include silence, or changing the topic)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-33", "text": "However, dialog acts only provide an initial description of these constraints." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-34", "text": "Other sources of constraints include the social relations between the utterer and the addressee, and the linguistic form of the utterance." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-35", "text": "Assume our email example had come, say, from the CEO of the company." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-36", "text": "In this case, the addressee's response would not meet the constraints set by the utterance, even though it is still analyzed as the same dialog act (a request for action)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-37", "text": "Detecting such power relations and determining their effect on dialog is a hard problem, and it is the ultimate goal of our research." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-38", "text": "Therefore, we do not use knowledge of power relations as features in performing a finergrained analysis of dialog acts." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-39", "text": "Instead, we turn to the linguistic form of an utterance." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-40", "text": "Specifically, the utterer can choose linguistic forms in her utterance to signal that she is imposing further constraints on the addressee's choice of how to respond, constraints which go beyond those defined by the standard set of dialog acts." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-41", "text": "For example, if the boss's email is \"Please come to my office right now\", and the addressee declines, he is clearly not adhering to the constraints the boss has signaled, though he is adhering to the general constraints of cooperative dialog by responding to the request for action." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-42", "text": "We are interested in these additional constraints imposed on utterances through choices in linguistic form." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-43", "text": "We define an utterance to have Overt Display of Power (ODP) if it is interpreted as creating additional constraints on the response beyond those imposed by the general dialog act." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-44", "text": "Note that use of polite lan- guage does not, on its own, determine the presence or absence of an ODP." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-45", "text": "Furthermore, the presence of an ODP does not presuppose that the utterer actually possess social power: the utterer could be attempting to gain power." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-46", "text": "Table 1 presents some sample utterances chosen from our corpus (the * indicates those without ODP)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-47", "text": "An utterance with ODP can be an explicit order or command (s3, s8) or an implicit one (s2, s5)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-48", "text": "It can be a simple sentence (s3) or a complex one (s1)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-49", "text": "It can be an imperative (s3), an interrogative (s5) or even a declarative (s2) sentence. But not all imperatives (s4) or interrogatives (s6, s7) are ODPs." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-50", "text": "s5, s6 and s7 are all syntactically questions." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-51", "text": "However, s5's discourse function within an email is to request/order to work on \"that\" which makes it an instance of ODP, while s6 is merely an inquiry and s7 is a rhetorical question." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-52", "text": "This makes the problem of finding ODP in utterances a non-trivial one." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-53", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-54", "text": "**DATA AND ANNOTATIONS**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-55", "text": "For our study, we use a small corpus of Enron email threads which has been previously annotated with dialog acts (Hu et al., 2009 )." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-56", "text": "The corpus contains 122 email threads with 360 messages, 1734 utterances and 20,740 word tokens." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-57", "text": "We trained an annotator using the definition for ODP given in Section 3. She was given full email threads whose messages were already segmented into utterances." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-58", "text": "She identified 86 utterances (about 5%) to have an ODP." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-59", "text": "1 In order to validate the annotations, we trained another annotator using the same definitions and examples and had him annotate 46 randomly selected threads from the corpus, which contained a total of 595 utterances (34.3% of whole corpus)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-60", "text": "We obtained a reasonable inter annotator agreement, \u03ba value, of 0.669, which validates the annotations while confirming that the task is not a trivial one." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-61", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-62", "text": "**AUTOMATIC ODP TAGGING**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-63", "text": "In this section, we present a supervised learning method to tag unseen utterances that contain an ODP using a binary SVM classifier." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-64", "text": "We use the tokenizer, POS tagger, lemmatizer and SVMLight (Joachims, 1999) wrapper that come with ClearTK (Ogren et al., 2008) ." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-65", "text": "We use a linear kernel with C = 1 for all experiments and present (P)recision, (R)ecall and (F)-measure obtained on 5-fold cross validation on the data." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-66", "text": "Our folds do not cross thread boundaries." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-67", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-68", "text": "**HANDLING CLASS IMBALANCE**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-69", "text": "In its basic formulation, SVMs learn a decision function f from a set of positive and negative training instances such that an unlabeled instance x is labeled as positive if f (x) > 0." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-70", "text": "Since SVMs optimize on training set accuracy to learn f , it performs better on balanced training sets." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-71", "text": "However, our dataset is highly imbalanced (\u223c 5% positive instances)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-72", "text": "We explore two ways of handling this class imbalance problem: an instance weighting method, InstWeight, where training errors on negative instances are outweighed by errors on positive instances, and SigThresh, a threshold adjusting method to find a better threshold for f (x)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-73", "text": "For InstWeight, we used the j option in SVMlight to set the outweighing factor to be the ratio of negative to positive instances in the training set for each cross validation fold." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-74", "text": "InstWeight is roughly equivalent to oversampling by repeating positive instances." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-75", "text": "For SigThresh, we used a threshold based on a posterior probabilistic score, p = P r(y = 1|x), calculated using the ClearTK implementation of Lin et al. (2007)'s algorithm." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-76", "text": "It uses Platt (1999)'s approximation of p to a sigmoid function P A,B (f ) = (1 + exp(Af + B)) \u22121 , where A and B are estimated from the training set." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-77", "text": "Then, we predict x as positive if p > 0.5 which in effect shifts the threshold for f (x) to a value based on its distri- bution on positive and negative training instances." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-78", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-79", "text": "**FEATURES**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-80", "text": "We present experiments using counts of three types of ngrams: lemma ngrams (LN), POS ngrams (PN) and mixed ngrams (MN We also used a feature (FV) to denote the first verb lemma in the utterance." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-81", "text": "Since ODPs, like dialog acts, constrain how the addressee should react, we also include Dialog Acts as features (DA)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-82", "text": "We use the manual gold dialog act annotations present in our corpus, which use a very small dialog act tag set." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-83", "text": "An utterance has one of 5 dialog acts: RequestAction, RequestInformation, Inform, Commit and Conventional (see (Hu et al., 2009 ) for details)." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-84", "text": "For example, for utterance s2, FV would be 'need' and DA would be 'Inform'. 3" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-85", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-86", "text": "**RESULTS AND ANALYSIS**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-87", "text": "We present two simple baselines -ALL-TRUE, where an utterance is always predicted to have an ODP, and RANDOM, where an utterance is predicted at random, with 50% chance to have an ODP." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-88", "text": "We also present a strong baseline WORD-UNG, which is trained using surface-form word unigrams as features." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-89", "text": "ALL-TRUE and RANDOM obtained F scores of 9.5 and 10.4 respectively, while WORD-UNG obtained an F score of 34.7 under InstWeight, and improved it to 48.6 under SigThresh." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-90", "text": "For LN, PN and MN, we first found the best value for n to be 1, 2 and 4, respectively." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-91", "text": "We then did an exhaustive search in all combinations of LN, PN, MN, FV and DA under both InstWeight and SigThresh." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-92", "text": "Results obtained for best feature subset under both configurations are presented in Table 2 in rows 3 and 4." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-93", "text": "SigThresh outweighed InstWeight in all our experiments." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-94", "text": "(Combining these two techniques for dealing with class imbalance performed worse than using either one.) In both settings, we surpassed the WORD-UNG baseline by a high margin." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-95", "text": "We found MN and DA to be most useful: removing either from the feature set dropped the F significantly in both settings." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-96", "text": "We obtained a best F score of 65.8 using PN, MN and DA under the SigThresh." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-97", "text": "Following (Guyon et al., 2002) , we inspected feature weights of the model created for the last fold of our best performing feature configuration as a posthoc analysis." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-98", "text": "The binary feature DA:RequestAction got the highest positive weight of 2.5." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-99", "text": "The top ten positive weighted features included patterns like you VB, * VB, MD PRP, VB VB and * MD, where * denotes the utterance boundary." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-100", "text": "DA:Inform got the most negative weight of -1.4, followed by DA:Conventional with -1.0." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-101", "text": "The top ten negative weighted features included patterns like MD VB, VB you, what, VB VB me VB and WP." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-102", "text": "In both cases, DA features got almost 2.5 times higher weight than the highest weighted ngram pattern, which reaffirms their importance in this task." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-103", "text": "Also, mixed ngrams helped to capture long patterns like \"please let me know\" by VB VB me VB without increasing dimensionality as much as word ngrams; they also distinguish VB you with a negative weight of -0.51 from VB me with a positive weight of 0.32, which pure POS ngrams couldn't have captured." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-106", "text": "We also evaluate the performance of our ODP tagger without using gold DA tags." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-107", "text": "We instead use the DA tagger of Hu et al. (2009) , which we re-trained using the training sets for each of our cross validation folds, applying it to the test set of that fold." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-108", "text": "We then did cross validation for the ODP tagger using gold dialog acts for training and automatically tagged dialog acts for testing." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-109", "text": "However, for our best performing feature set so far, this reduced the F score from 65.8 to 52.7." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-110", "text": "Our best result for ODP tagging without using gold DAs is shown in row 5 in Table 2 , 56.9 F score under SigThresh." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-111", "text": "The features used are all of our features other than the DA tags." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-112", "text": "On further analysis, we find that even though the dialog act tagger has a high accuracy (85.8% in our cross validation), it obtained a very low recall of 28.6% and precision of 47.6% for the RequestAction dialog act." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-113", "text": "Since RequestAction is the most important feature (weighted 1.7 times more than the next feature), the DA-tagger's poor performance on RequestAction hurt ODP tagging badly." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-114", "text": "The performance reduction in this setting is probably partly due to using gold DAs in training and automatically tagged DAs in testing; however, we feel that improving the detection of minority classes in dialog act tagging (RequestAction constitutes only 2.5% in the corpus) is a necessary first step towards successfully using automatically tagged DAs in ODP tagging." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-115", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-116", "text": "**CONCLUSION**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-117", "text": "We have introduced a new binary classification task on utterances in dialogs, namely predicting Overt Display of Power." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-118", "text": "An ODP adds constraints on the possible responses by the addressee." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-119", "text": "We have introduced a corpus annotated for ODP and we have shown that using supervised machine learning with gold dialog acts we can achieve an F-measure of 66% despite the fact that ODPs are very rare in the corpus." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-120", "text": "We intend to develop a better dialog act tagger which we can use to automatically obtain dialog act labels for ODP classification." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-121", "text": "----------------------------------" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-122", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-123", "text": "This work is supported, in part, by the Johns Hopkins Human Language Technology Center of Excellence." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-124", "text": "Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsor." }, { "sent_id": "16780bd3c2b350f6d61f2f55f9f88c-C001-125", "text": "We thank several anonymous reviewers for their constructive feedback." } ], "y": { "@USE@": { "gold_contexts": [ [ "16780bd3c2b350f6d61f2f55f9f88c-C001-55" ], [ "16780bd3c2b350f6d61f2f55f9f88c-C001-82", "16780bd3c2b350f6d61f2f55f9f88c-C001-83" ], [ "16780bd3c2b350f6d61f2f55f9f88c-C001-107" ] ], "cite_sentences": [ "16780bd3c2b350f6d61f2f55f9f88c-C001-55", "16780bd3c2b350f6d61f2f55f9f88c-C001-83", "16780bd3c2b350f6d61f2f55f9f88c-C001-107" ] }, "@BACK@": { "gold_contexts": [ [ "16780bd3c2b350f6d61f2f55f9f88c-C001-83" ] ], "cite_sentences": [ "16780bd3c2b350f6d61f2f55f9f88c-C001-83" ] } } }, "ABC_ec5897c392b05cb8712feadfc6d2bf_46": { "x": [ { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-12", "text": "One such feature is the knowledge of the semantic clusters in a domain." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-2", "text": "Semantic clusters of a domain form an important feature that can be useful for performing syntactic and semantic disambiguation." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-3", "text": "Several attempts have been made to extract the semantic clusters of a domain by probabilistic or taxonomic techniques." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-4", "text": "However, not much progress has been made in evaluating the obtained semantic clusters." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-5", "text": "This paper focuses on an evaluation mechanism that can be used to evaluate semantic clusters produced by a system against those provided by human experts." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-6", "text": "----------------------------------" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-7", "text": "**INTRODUCTION 1**" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-8", "text": "Most natural language processing (NLP) systems are designed to work on certain specific domains and porting them to other domains is often a very timeconsuming and human-intenslve process." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-9", "text": "As the need for applying NLP systems to more and varied domains grows, it becomes increasingly important that some techniques be used to make these systems more portable." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-10", "text": "Several researchers (Lang and Hirschman, 1988; Rau et al., 1989; Pustejovsky, 1992; Grishman and Sterling, 1993; Basili et al., 1994) , either directly or indirectly, have addressed issues that assist in making it easier to move an NLP system from one domain to another." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-11", "text": "One of the reasons for the lack of portability is the need for domain-specific semantic features that such systems often use for lexical, syntactic, and semantic disambiguation." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-13", "text": "Since semantic classes are often domain-specific, their automatic acquisition is not trivial." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-14", "text": "Such classes can be derived either by distributional means or from existing taxonomies, knowledge bases, dictionaries, thesauruses, and so on." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-15", "text": "A prime example of the latter is WordNet which has been used to 1The author is currently at Texas Instruments and all inquiries should be addressed to rajeev@csc.ti.com." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-16", "text": "provide such semantic classes (Resnik, 1993; Basili et al., 1994) to assist in text understanding." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-17", "text": "Our efforts to obtain such semantic clusters with limited human intervention have been described elsewhere (Agarwal, 1995) ." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-18", "text": "This paper concentrates on the aspect of evahiating the obtained clusters against classes provided by human experts." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-19", "text": "----------------------------------" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-20", "text": "**2**" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-21", "text": "The Need" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-22", "text": "Although there has been a lot of work done in extracting semantic classes of a given domain, relatively little attention has been paid to the task of evaluating the generated classes." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-23", "text": "In the absence of an evaluation scheme, the only way to decide if the semantic classes produced by a system are \"reasonable\" or not is by having an expert analyze them by inspection." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-24", "text": "Such informal evaluations make it very difficult to compare one set of classes against another and are also not very reliable estimates of the quality of a set of classes." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-25", "text": "It is clear that a formal evaluation scheme would be of great help." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-26", "text": "Hatzivassiloglou and McKeown (1993) duster adjectives into partitions and present an interesting evaluation to compare the generated adjective classes against those provided by an expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-27", "text": "Their evaluation scheme bases the comparison between two classes on the presence or absence of pairs of words in them." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-28", "text": "Their approach involves filling in a YES-NO contingency table based on whether a pair of words (adjectives, in their case) is classified in the same class by the human expert and by the system." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-29", "text": "This method works very well for partitions." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-30", "text": "However, if it is used to evaluate sets of classes where the classes may be potentiaily overlapping, their technique yields a weaker measure since the same word pair could possibly be present in more than one class." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-31", "text": "An ideal scheme used to evaluate semantic classes should be able to handle overlapping classes (as o1>. posed to partitions) as well as hierarchies." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-32", "text": "The technique proposed by Hatzivassiloglou and McKeown does not do a good job of evaluating either of these." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-33", "text": "In this paper, we present an evaluation methodology which makes it possible to properly evaluate over- In the discussion that follows, the word \"clustering\" is used to refer to the set of classes that may be either provided by an expert or generated by the system, and the word \"class\" is used to refer to a single class in the clustering." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-34", "text": "----------------------------------" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-35", "text": "**3**" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-36", "text": "Evaluation Approach" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-37", "text": "As mentioned above, we intend to be able to compare a clustering generated by a system against one provided by an expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-38", "text": "Since a word can occur in more than one class, it is important to find some kind of mapping between the classes generated by the system and the classes given by the expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-39", "text": "Such a mapping tells us which class in the system's clustering maps to which one in the expert's clustering, and an overall comparison of the clusterings is based on the comparison of the mutually mapping classes." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-40", "text": "Before we delve deeper into the evaluation process, we must decide on some measure of \"closeness\" between a pair of classes." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-41", "text": "We have adopted the F-measure (Hatzivassiloglou and McKeown, 1993; Chincor, 1992) ." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-42", "text": "In our computation of the Fmeasure, we construct a contingency table based on the presence or absence of individual elements in the two classes being compared, as opposed to basing it on pairs of words." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-43", "text": "For example, suppose that Class A is generated by the system and Class B is provided by an expert (as shown in Table 1 )." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-44", "text": "The contingency table obtained for this pair of classes is shown in Table 2 ." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-45", "text": "The three main steps in the evaluation process are the acquisition of \"correct\" classes from domain experts, mapping the experts' clustering to that generated by the system, and generating an overall measure that represents the system's performance when compared against the expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-46", "text": "----------------------------------" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-47", "text": "**KNOWLEDGE ACQUISITION FROM EXPERTS**" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-48", "text": "The objective of this step is to get human experts to undertake the same task that the system performs, i.e., classifying a set of words into several potentially overlapping classes." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-49", "text": "The classes produced by a system are later compared to these \"correct\" classifications provided by the expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-50", "text": "----------------------------------" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-51", "text": "**MAPPING ALGORITHM**" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-52", "text": "In order to determine pairwise mappings between the clustering generated by the system and one provided by an expert, a table of F-measures is constructed, with a row for each class generated by the system, and a column for every class provided by the expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-53", "text": "Note that since the expert actually provides a hierarchy, there is one column corresponding to every individual class and subclass provided by the expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-54", "text": "This allows the system's classes to map to a class at any level in the expert's hierarchy." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-55", "text": "This table gives an estimate of how well each class generated by the system maps to the ones provided by the expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-56", "text": "The algorithm used to compute the actual mappings from the F-measure table is briefly described here." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-57", "text": "In each row of the table, mark the cell with the highest F-measure as a potential mapping." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-58", "text": "In general, conflicts arise when more than one class generated by the system maps to a given class provided by the expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-59", "text": "In other words, whenever a column in the table has more than one cell marked as a potential mapping, a conflict is said to exist." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-60", "text": "To resolve a conflict, one of the system classes must be re-mapped." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-61", "text": "The heuristic used here is that the class for which such a re-mapping results in minimal loss of F-measure is the one that must be re-mapped." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-62", "text": "Several such conflicts may exist, and re-mapping may lead to further conflicts." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-63", "text": "The mapping algorithm iteratively searches for conflicts and resolves them till no more conflicts exist." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-64", "text": "Note also that a system class may map to an expert class only if the F-measure between them exceeds a certain threshold value." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-65", "text": "This ensures that a certain degree of similarity must exist between two classes for them to map to each other." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-66", "text": "We have used a threshold value of 0.20." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-67", "text": "This value is obtained purely by observations made on the F-measures between different pairs of classes with varying degrees of similarity." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-68", "text": "Table 2 )." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-69", "text": "Once all the mapped classes have been incorporated into this contingency table, add every element of all unmapped classes generated by the system to the YES-NO cell and every element of all unmapped classes provided by the expert to the NO-YES cell of this table." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-70", "text": "Once all classes in the two clusterings have been accounted for, calculate the precision, recall, and F-measure as explained in (Hatzivassiloglou and McKeown, 1993) ." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-71", "text": "----------------------------------" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-72", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-73", "text": "In one of our experiments, the 400 most frequent nouns in the Merck Veterinary Manual were clustered." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-74", "text": "Three experts were used to evaluate the generated noun clusters." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-75", "text": "Some examples of the classes that were generated by the system for the veterinary medicine domain are PROBLEM, TREAT-MENT, ORGAN, DIET, ANIMAL, MEASURE-MENT, PROCESS, and so on." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-76", "text": "The results obtained by comparing these noun classes to the clusterings provided by three different experts are shown in Table 3 ." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-77", "text": "We have also experimented with the use of WordNet to improve the classes obtained by a distributional technique." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-78", "text": "Some initial experiments have shown that WordNet consistently improves the Fmeasures for these noun classes by about 0.05 on an average." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-79", "text": "Details of these experiments can be found in (Agarwal, 1995) ." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-80", "text": "It is our belief that the evaluation scheme presented in this paper is useful for comparing different clusterings produced by the same system or those produced by different systems against one provided by an expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-81", "text": "The resulting precision, recall, and F-measure should not be treated as a kind of \"gold standard\" to represent the quality of these classes in some absolute sense." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-82", "text": "It has been our experience that, as semantic clustering is a highly subjective task, evaluating a given clustering against different experts may yield numbers that vary considerably." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-83", "text": "However, when different clusterings generated by a system are compared against the same expert (or the same set of experts), such relative comparisons are useful." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-84", "text": "The evaluation scheme presented here still suffers from one major limitation --it is not capable of evaluating a hierarchy generated by a system against one provided by an expert." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-85", "text": "Such evaluations get complicated because of the restriction of one-to-one mapping." }, { "sent_id": "ec5897c392b05cb8712feadfc6d2bf-C001-86", "text": "More work definitely needs to be done in this area." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ec5897c392b05cb8712feadfc6d2bf-C001-26" ] ], "cite_sentences": [ "ec5897c392b05cb8712feadfc6d2bf-C001-26" ] }, "@MOT@": { "gold_contexts": [ [ "ec5897c392b05cb8712feadfc6d2bf-C001-26", "ec5897c392b05cb8712feadfc6d2bf-C001-27", "ec5897c392b05cb8712feadfc6d2bf-C001-28", "ec5897c392b05cb8712feadfc6d2bf-C001-29", "ec5897c392b05cb8712feadfc6d2bf-C001-30", "ec5897c392b05cb8712feadfc6d2bf-C001-31", "ec5897c392b05cb8712feadfc6d2bf-C001-32", "ec5897c392b05cb8712feadfc6d2bf-C001-33" ] ], "cite_sentences": [ "ec5897c392b05cb8712feadfc6d2bf-C001-26" ] }, "@USE@": { "gold_contexts": [ [ "ec5897c392b05cb8712feadfc6d2bf-C001-41" ], [ "ec5897c392b05cb8712feadfc6d2bf-C001-70" ] ], "cite_sentences": [ "ec5897c392b05cb8712feadfc6d2bf-C001-41", "ec5897c392b05cb8712feadfc6d2bf-C001-70" ] } } }, "ABC_2f7b64db6939786a5026fc033c85bd_46": { "x": [ { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-2", "text": "Existing algorithms for the Generation of Referring Expressions (GRE) aim at generating descriptions that allow a hearer to identify its intended referent uniquely; the length of the expression is also considered, usually as a secondary issue." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-3", "text": "We explore the possibility of making the trade-off between these two factors more explicit, via a general cost function which scores these two aspects separately." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-4", "text": "We sketch some more complex phenomena which might be amenable to this treatment." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-5", "text": "----------------------------------" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-7", "text": "Until recently, GRE algorithms have focussed on the generation of distinguishing descriptions that are either as short as possible (e.g. (Dale, 1992; Gardent, 2002) ) or almost as short as possible (e.g. (Dale and Reiter, 1995) )." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-8", "text": "Since reductions in ambiguity are achieved by increases in length, there is a tension between these factors, and algorithms usually resolve this in some fixed way." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-9", "text": "However, the need for a distinguishing description is usually assumed, and typically built in to GRE algorithms." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-10", "text": "We will suggest a way to make explicit this balance between clarity (i.e. lack of ambiguity) and brevity, and we indicate some phenomena which we believe may be illuminated by this approach." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-11", "text": "The ideas in this paper can be seen as a loosening of some of the many simplifying assumptions often made in GRE work." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-12", "text": "* This work is supported by a University of Aberdeen Sixth Century Studentship, and the TUNA project (EPSRC, UK) under grant number GR/S13330/01." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-13", "text": "We thank Ielka van der Sluis and Albert Gatt for valuable comments." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-14", "text": "----------------------------------" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-15", "text": "**CLARITY, BREVITY AND COST**" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-16", "text": "We consider only simple GRE, where the aim is to construct a conjunction of unary properties which distinguish a single target object from a set of potential distractors." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-17", "text": "Our notation is as follows." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-18", "text": "A domain consists of a set D of objects, and a set P of properties applicable to objects in D. A description is a subset of P. The denotation of S, written" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-19", "text": "( Krahmer et al., 2003) describe an approach to GRE in which a cost function guides search for a suitable description, and show that some existing GRE algorithms fit into this framework." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-20", "text": "However, they follow the practice of concentrating solely on distinguishing descriptions, treating cost as a matter of brevity." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-21", "text": "We suggest that decomposing cost into two components, for the clarity and brevity of descriptions, permits the examination of tradeoffs." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-22", "text": "For now, we will take the cost of a description S to be the sum of two terms:" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-23", "text": "where f C counts ambiguity (lack of clarity) and f B counts size (lack of brevity)." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-24", "text": "Even with this decomposition of cost, some existing algorithms can still be seen as cost-minimisation." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-25", "text": "For example, the cost functions:" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-26", "text": "allow the Full Brevity algorithm (Dale, 1992) to be viewed as minimising cost(S), and the incremental algorithm (Dale and Reiter, 1995) as hill-climbing (strictly, hill-descending), guided by the property-ordering which that algorithm requires." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-27", "text": "Whereas Krahmer et al.'s cost functions are (brevity-based) heuristic guidance functions, our alternative here is a global quantity for optimisation." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-28", "text": "Hence their simulation of Full Brevity relies on the details of their algorithm (rather than cost) to ensure clarity, while our own cost function ensures both brevity and clarity. , there is a curve where the cost drops more steeply as the more undesirable distractors are excluded." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-29", "text": "For example, each object could be assigned a numerical rating of how undesirable it is, with the target having a score of zero, and the f C value for a set A could be the maximum rating of any element of A. (This would, of course, require a suitably rich domain model.)" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-30", "text": "The brevity cost function f B could still be a relatively simple linear function, providing f B values do not mask the effect of the shape of the f C curve." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-31", "text": "----------------------------------" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-32", "text": "**FUZZINESS OF TARGET**" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-33", "text": "Suppose Mrs X has dropped a piece of raw chicken meat on the kitchen table, and immediately removed the meat." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-34", "text": "She would now like Mr X to wipe the area clean." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-35", "text": "The meat leaves no visible stain, so she has to explain where it was." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-36", "text": "In this case, it appears that there is no such thing as a distinguishing description (i.e. a description that pins down the area precisely), although Mrs X can arbitrarily increase precision, by adding properties:" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-37", "text": "-the edge of the table, -the edge of the table, on the left (etc.)" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-38", "text": "The ideal description would describe the dirty area and nothing more, but a larger area will also do, if not too large." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-39", "text": "Here, the domain D is implicitly defined as all conceivable subareas of the table, the target is again one element of D, but -unlike the traditional set-up with discrete elementsa description (fuzzily) defines one such area, not a disjoint collection of individual items." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-40", "text": "Our f C operates on the description S, not just on the number of distractors, so it can assess the aptness of the denotation of any potential S. However, it has to ensure that this denotation (subarea of the surface) contains the target (contaminated area), and does not contain too much beyond that." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-41", "text": "Hence, we may need to augment our clarity cost function with another argument: the target itself." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-42", "text": "In general, more complex domains may need more complicated functions." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-43", "text": "----------------------------------" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-44", "text": "**UNDERSPECIFICATION IN DIALOGUE**" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-45", "text": "Standard GRE algorithms assume that the speaker knows what the hearer knows (Dale and Reiter, 1995) ." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-46", "text": "In practice, speakers can often only guess." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-47", "text": "It has been observed that speakers sometimes produce referring expressions that are only disambiguated through negotiation with the hearer, as exemplified in the following excerpt (quoted in (Hirst, 2002) )." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-48", "text": "A and B are in the same room, in an informal setting, so A can be relatively interactive in conveying information." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-49", "text": "Also, the situation does not appear to be highly critical, in comparison to a military officer directing gunfire, or a surgeon guiding an incision." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-50", "text": "Initially, A produces an expression which is not very detailed." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-51", "text": "It may be that he thinks this is adequate (the object is sufficiently salient that B will uniquely determine the referent), or he doesn't really know, but is willing to make an opening bid in a negotiation to reach the goal of reference." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-52", "text": "In the former case, a GRE algorithm which took account of salience (e.g. (Krahmer and Theune, 1999)), operating with A's model of B's knowledge, should produce this sort of effect." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-53", "text": "(A dialogue model might also be needed.) In the latter case, we need an algorithm which can relax the need for complete clarity." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-54", "text": "This could be arranged by having f C give similar scores to denotations where there are no distractors and to denotations where there are just a few distractors, with f B making a large contribution to the cost." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-55", "text": "----------------------------------" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-56", "text": "**OVER-SPECIFICATION**" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-57", "text": "Recently, interest has been growing in 'overspecified' referring expressions, which contain more information than is required to identify their intended referent." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-58", "text": "Some of this work is mainly or exclusively experimental (Jordan and Walker, 2000; Arts, 2004) , but algorithmic consequences are also being explored (Horacek, 2005; Paraboni and van Deemter, 2002; van der Sluis and Krahmer, 2005) ." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-59", "text": "Over-specification could also arise in a dialogue situation (comparable to that in Section 3.3) if a speaker is unclear about the hearer's knowledge, and so over-specifies (relative to his own knowledge) to increase the chances of success." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-60", "text": "This goes beyond the classical algorithms, where the main goal is total clarity, with no reason for the algorithm to add further properties to an already unambiguous expression." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-61", "text": "That is, such algorithms assume that every description S for which | [[ S ]] |= 1 has the same level of clarity (f C value)." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-62", "text": "This assumption could be relaxed." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-63", "text": "For example, the approach of (Horacek, 2005) to GRE allows degrees of uncertainty about the effectiveness of properties to affect their selection." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-64", "text": "Within such a framework, one could separately compute costs for clarity (e.g. likelihood of being understood) and brevity (which might include the complexity of expressing the properties)." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-65", "text": "----------------------------------" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-66", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-67", "text": "We have argued that the GRE task becomes very different when some commonly-made assumptions are abandoned: some distractors might be worse than others (section 3.1); the target may be impossible to distinguish precisely (section 3.2); the speaker may be unsure what the hearer knows (section 3.3); or there may be a need for overspecification (section 3.4))." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-68", "text": "As a result, it may be necessary to consider other aspects of the descriptions and their denotations, not simply counting distractors or numbers of properties." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-69", "text": "Some effects could perhaps be modelled using costs which are not simple linear functions, but which give varying importance to particular aspects of the denotation of a description, or of its content." }, { "sent_id": "2f7b64db6939786a5026fc033c85bd-C001-70", "text": "We hope that this approach will ultimately shed light not only on the effect of the discourse situation, but also some aspects of generating indefinite descriptions." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2f7b64db6939786a5026fc033c85bd-C001-7" ], [ "2f7b64db6939786a5026fc033c85bd-C001-26" ], [ "2f7b64db6939786a5026fc033c85bd-C001-45" ] ], "cite_sentences": [ "2f7b64db6939786a5026fc033c85bd-C001-7", "2f7b64db6939786a5026fc033c85bd-C001-26", "2f7b64db6939786a5026fc033c85bd-C001-45" ] } } }, "ABC_854679aec9cf4f53e97936f865f49c_46": { "x": [ { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-2", "text": "While the effect of domain variation on Penntreebank-trained probabilistic parsers has been investigated in previous work, we study its effect on a Penn-Treebank-trained probabilistic generator." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-3", "text": "We show that applying the generator to data from the British National Corpus results in a performance drop (from a BLEU score of 0.66 on the standard WSJ test set to a BLEU score of 0.54 on our BNC test set)." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-4", "text": "We develop a generator retraining method where the domain-specific training data is automatically produced using state-of-the-art parser output." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-5", "text": "The retraining method recovers a substantial portion of the performance drop, resulting in a generator which achieves a BLEU score of 0.61 on our BNC test data." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-6", "text": "----------------------------------" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-8", "text": "Grammars extracted from the Wall Street Journal (WSJ) section of the Penn Treebank have been successfully applied to natural language parsing, and more recently, to natural language generation." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-9", "text": "It is clear that high-quality grammars can be extracted for the WSJ domain but it is not so clear how these grammars scale to other text genres." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-10", "text": "Gildea (2001) , for example, has shown that WSJ-trained parsers suffer a drop in performance when applied to the more varied sentences of the Brown Corpus." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-11", "text": "We investigate the effect of domain variation in treebank-grammar-based generation by applying a WSJ-trained generator to sentences from the British National Corpus (BNC)." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-12", "text": "As with probabilistic parsing, probabilistic generation aims to produce the most likely output(s) given the input." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-13", "text": "We can distinguish three types of probabilistic generators, based on the type of probability model used to select the most likely sentence." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-14", "text": "The first type uses an n-gram language model, e.g. (Langkilde, 2000) , the second type uses a probability model defined over trees or feature-structureannotated trees, e.g. (Cahill and van Genabith, 2006) , and the third type is a mixture of the first and second type, employing n-gram and grammarbased features, e.g. (Velldal and Oepen, 2005) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-15", "text": "The generator used in our experiments is an instance of the second type, using a probability model defined over Lexical Functional Grammar c-structure and f-structure annotations (Cahill and van Genabith, 2006; Hogan et al., 2007) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-16", "text": "In an initial evaluation, we apply our probabilistic WSJ-trained generator to BNC material, and show that the generator suffers a substantial performance degradation, with a drop in BLEU score from 0.66 to 0.54." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-17", "text": "We then turn our attention to the problem of adapting the generator so that it can more accurately generate the 1,000 sentences in our BNC test set." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-18", "text": "The problem of adapting any NLP system to a domain different from the domain upon which it has been trained and for which no gold standard training material is available is a very real one, and one which has been the focus of much recent research in parsing." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-19", "text": "Some success has been achieved by training a parser, not on gold standard hand-corrected trees, but on parser output trees." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-20", "text": "These parser output trees can by produced by a second parser in a co-training scenario (Steedman et al., 2003) , or by the same parser with a reranking component in a type of selftraining scenario (McClosky et al., 2006) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-21", "text": "We tackle the problem of domain adaptation in generation in a similar way, by training the generator on domain specific parser output trees instead of manually corrected gold standard trees." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-22", "text": "This experiment achieves promising results, with an increase in BLEU score from 0.54 to 0.61." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-23", "text": "The method is generic and can be applied to other probabilistic generators (for which suitable training material can be automatically produced)." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-24", "text": "----------------------------------" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-25", "text": "**BACKGROUND**" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-26", "text": "The natural language generator used in our experiments is the WSJ-trained system described in Cahill and van Genabith (2006) and Hogan et al. (2007) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-27", "text": "Sentences are generated from Lexical Functional Grammar (LFG) f-structures (Kaplan and Bresnan, 1982) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-28", "text": "The f-structures are created automatically by annotating nodes in the gold standard WSJ trees with LFG functional equations and then passing these equations through a constraint solver (Cahill et al., 2004) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-29", "text": "The generation algorithm is a chartbased one which works by finding the most probable tree associated with the input f-structure." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-30", "text": "The yield of the most probable tree is the output sentence." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-31", "text": "An annotated PCFG, in which the nonterminal symbols are decorated with functional information, is used to generate the most probable tree from an f-structure." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-32", "text": "Cahill and van Genabith (2006) attain 98.2% coverage and a BLEU score of 0.6652 on the standard WSJ test set (Section 23)." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-33", "text": "Hogan et al. (2007) describe an extension to the system which replaces the annotated PCFG selection model with a more sophisticated history-based probabilistic model." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-34", "text": "Instead of conditioning the righthand side of a rule on the lefthand non-terminal and its associated functional information alone, the new model includes non-local conditioning information in the form of functional information associated with ancestor nodes of the lefthand side category." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-35", "text": "This system achieves a BLEU score of 0.6724 and 99.9% coverage." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-36", "text": "Other WSJ-trained generation systems include Nakanishi et al. (2005) and White et al. (2007) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-37", "text": "Nakanishi et al. (2005) describe a generator trained on a HPSG grammar derived from the WSJ Section of the Penn Treebank." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-38", "text": "On sentences of \u2264 20 words in length, their system attains coverage of 90.75% and a BLEU score of 0.7733." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-39", "text": "White et al. (2007) describe a CCG-based realisation system which has been trained on logical forms derived from CCGBank (Hockenmaier and Steedman, 2005) , achieving 94.3% coverage and a BLEU score of 0.5768 on WSJ23 for all sentence lengths." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-40", "text": "The input structures upon which these systems are trained vary in form and specificity, but what the systems have in common is that their various input structures are derived from Penn Treebank trees." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-41", "text": "----------------------------------" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-42", "text": "**THE BNC TEST DATA**" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-43", "text": "The new English test set consists of 1,000 sentences taken from the British National Corpus (Burnard, 2000) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-44", "text": "The BNC is a one hundred million word balanced corpus of British English from the late twentieth century." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-45", "text": "Ninety per cent of it is written text, and the remaining 10% consists of transcribed spontaneous and scripted spoken language." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-46", "text": "The BNC sentences in the test set are not chosen completely at random." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-47", "text": "Each sentence in the test set has the property of containing a word which appears as a verb in the BNC but not in the usual training sections of the Wall Street Journal section of the Penn Treebank (WSJ02-21)." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-48", "text": "Sentences were chosen in this way so that the resulting test set would be a difficult one for WSJ-trained systems." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-49", "text": "In order to produce input f-structures for the generator, the test sentences were manually parsed by one annotator, using as references the Penn Treebank trees themselves and the Penn Treebank bracketing guidelines (Bies et al., 1995) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-50", "text": "When the two references did not agree, the guidelines took precedence over the Penn Treebank trees." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-51", "text": "Difficult parsing decisions were documented." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-52", "text": "Due to time constraints, the annotator did not mark functional tags or traces." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-53", "text": "The context-free gold standard parse trees were transformed into fstructures using the automatic procedure of Cahill et al. (2004) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-54", "text": "----------------------------------" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-55", "text": "**EXPERIMENTS**" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-56", "text": "Experimental Setup In our first experiment, we apply the original WSJ-trained generator to our BNC test set." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-57", "text": "The gold standard trees for our BNC test set differ from the gold standard Wall Street Journal trees, in that they do not contain Penn-II traces or functional tags." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-58", "text": "The process which pro-duces f-structures from trees makes use of trace and functional tag information, if available." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-59", "text": "Thus, to ensure that the training and test input f-structures are created in the same way, we use a version of the generator which is trained using gold standard WSJ trees without functional tag or trace information." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-60", "text": "When we test this system on the WSJ23 f-structures (produced in the same way as the WSJ training material), the BLEU score decreases slightly from 0.67 to 0.66." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-61", "text": "This is our baseline system." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-62", "text": "In a further experiment, we attempt to adapt the generator to BNC data by using BNC trees as training material." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-63", "text": "Because we lack gold standard BNC trees (apart from those in our test set), we try instead to use parse trees produced by an accurate parser." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-64", "text": "We choose the Charniak and Johnson reranking parser because it is freely available and achieves state-of-the-art accuracy (a Parseval f-score of 91.3%) on the WSJ domain (Charniak and Johnson, 2005) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-65", "text": "It is, however, affected by domain variation - Foster et al. (2007) report that its f-score drops by approximately 8 percentage points when applied to the BNC domain." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-66", "text": "Our training size is 500,000 sentences." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-67", "text": "We conduct two experiments: the first, in which 500,000 sentences are extracted randomly from the BNC (minus the test set sentences), and the second in which only shorter sentences, of length \u2264 20 words, are chosen as training material." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-68", "text": "The rationale behind the second experiment is that shorter sentences are less likely to contain parser errors." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-69", "text": "We use the BLEU evaluation metric for our experiments." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-70", "text": "We measure both coverage and full coverage." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-71", "text": "Coverage measures the number of cases for which the generator produced some kind of output." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-72", "text": "Full coverage measures the number of cases for which the generator produced a tree spanning all of the words in the input." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-73", "text": "----------------------------------" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-74", "text": "**RESULTS**" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-75", "text": "The results of our experiments are shown in Fig. 1 ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-76", "text": "The first row shows the results we obtain when the baseline system is applied to the fstructures derived from the 1,000 BNC gold standard parse trees." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-77", "text": "The second row shows the results on the same test set for a system trained on Charniak and Johnson parser output trees for 500,000 BNC sentences." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-78", "text": "The results in the final row are obtained by training the generator on Charniak and Johnson parser output trees for 500,000 BNC sentences of length \u2264 20 words in length." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-79", "text": "Discussion As expected, the performance of the baseline system degrades when faced with out-ofdomain test data." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-80", "text": "The BLEU score drops from a 0.66 score for WSJ test data to a 0.54 score for the BNC test data, and full coverage drops from 85.97% to 68.77%." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-81", "text": "There is a substantial improvement, however, when the generator is trained on BNC data." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-82", "text": "The BLEU score jumps from 0.5358 to 0.6135." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-83", "text": "There are at least two possible reasons why a BLEU score of 0.66 is not obtained: The first is that the quality of the f-structure-annotated trees upon which the generator has been trained has degraded." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-84", "text": "For the baseline system, the generator is trained on f-structure-annotated trees derived from gold trees." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-85", "text": "The new system is trained on f-structureannotated parser output trees, and the performance of Charniak and Johnson's parser degrades when applied to BNC data (Foster et al., 2007) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-86", "text": "The second reason has been suggested by Gildea (2001) : WSJ data is easier to learn than the more varied data in the Brown Corpus or BNC." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-87", "text": "Perhaps even if gold standard BNC parse trees were available for training, the system would not behave as well as it does for WSJ material." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-88", "text": "It is interesting to note that training on 500,000 shorter sentences does not appear to help." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-89", "text": "We hypothesized that it would improve results because shorter sentences are less likely to contain parser errors." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-90", "text": "The drop in full coverage from 86.69% to 79.58% suggests that the number of short sentences needs to be increased so that the size of the training material stays constant." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-91", "text": "----------------------------------" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-92", "text": "**CONCLUSION**" }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-93", "text": "We have investigated the effect of domain variation on a LFG-based WSJ-trained generation system by testing the system's performance on 1,000 sentences from the British National Corpus." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-94", "text": "Performance drops from a BLEU score of 0.66 on WSJ test data to 0.54 on the BNC test set." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-95", "text": "Encouragingly, we have also shown that domain-specific training material produced by a parser can be used to claw back a significant portion of this performance degradation." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-96", "text": "Our method is general and could be applied to other WSJ-trained generators (e.g. (Nakanishi et , 2007) )." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-97", "text": "We intend to continue this research by training our generator on parse trees produced by a BNC-self-trained version of the Charniak and Johnson reranking parser (Foster et al., 2007) ." }, { "sent_id": "854679aec9cf4f53e97936f865f49c-C001-98", "text": "We also hope to extend the evaluation beyond the BLEU metric by carrying out a human judgement evaluation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "854679aec9cf4f53e97936f865f49c-C001-65" ], [ "854679aec9cf4f53e97936f865f49c-C001-85" ] ], "cite_sentences": [ "854679aec9cf4f53e97936f865f49c-C001-65", "854679aec9cf4f53e97936f865f49c-C001-85" ] }, "@EXT@": { "gold_contexts": [ [ "854679aec9cf4f53e97936f865f49c-C001-97" ] ], "cite_sentences": [ "854679aec9cf4f53e97936f865f49c-C001-97" ] } } }, "ABC_31b06dfc081149e1e436f0bb5e0904_46": { "x": [ { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-2", "text": "As part of the SemEval-2015 shared task on Broad-Coverage Semantic Dependency Parsing, we evaluate the performace of our last year's system (TurboSemanticParser) on multiple languages and out-of-domain data." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-3", "text": "Our system is characterized by a feature-rich linear model, that includes scores for first and second-order dependencies (arcs, siblings, grandparents and co-parents)." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-4", "text": "For decoding this second-order model, we solve a linear relaxation of that problem using alternating directions dual decomposition (AD 3 )." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-5", "text": "The experiments have shown that, even though the parser's performance in Chinese and Czech attains around 80% (not too far from English performance), domain shift is a serious issue, suggesting domain adaptation as an interesting avenue for future research." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-6", "text": "----------------------------------" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-8", "text": "The last years have witnessed a continuous progress in statistical multilingual models for syntax, thanks to shared tasks such as CoNLL 2006-7 (Buchholz and Marsi, 2006; Nivre et al., 2007) and, more recently, SPMRL 2013-14 (Seddah et al., 2013; Seddah et al., 2014) ." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-9", "text": "As a global trend, we observe that models that incorporate rich global features are typically more accurate, even if pruning is necessary or decoding needs to be approximate Koo and Collins, 2010; Bohnet and Nivre, 2012; Martins et al., 2009 Martins et al., , 2013 ." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-10", "text": "The same rationale applies to semantic dependency parsing, also a structured prediction problem, but where the output variable is a semantic graph, rather than a syntactic tree." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-11", "text": "Indeed, the best performing systems in last year shared task on broad-coverage semantic dependency parsing follow this principle (Oepen et al., 2014) ." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-12", "text": "This year, a new challenge was put forth: how to handle multiple languages and out-ofdomain data?" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-13", "text": "Our proposed parser ( \u00a72) is essentially the same that we submitted in the previous year to the same SemEval task (Martins and Almeida, 2014) , where we scored top in the open challenge and second in the closed track." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-14", "text": "This year, we report results using new out-of-domain and multilingual data (namely, Czech and Chinese, in addition to English)." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-15", "text": "For the English language, we participated in the closed and open tracks, using as additional resources the syntactic dependency annotations provided by the organizers." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-16", "text": "For Czech and Chinese, we only addressed the closed track, since no companion data were provided for these languages." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-17", "text": "We did not participate in the gold track that uses gold-standard syntactic annotations; and we did not address the prediction of predicate senses." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-18", "text": "----------------------------------" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-19", "text": "**SEMANTIC PARSER**" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-20", "text": "For this year's shared task, we re-run the semantic parser that we developed last year, which is fully desc1ribed in Martins and Almeida (2014), on the new datasets." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-21", "text": "Since this parser was designed to be multi-lingual, it was straightforward to apply it to the languages introduced this year (Chinese and Czech), as well as on the out-of-domain data." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-22", "text": "We briefly describe our semantic parser (which we dub TurboSemanticParser and release as opensource software 1 ), and refer the interested reader to Figure 1 : Parts considered by our semantic parser." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-23", "text": "The top row illustrate the basic parts, representing the event that a word is a predicate, or the existence of an arc between a predicate and an argument, eventually labeled with a semantic role." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-24", "text": "Our second-order model looks at some pairs of arcs: arcs bearing a grandparent relationship, arguments of the same predicate, predicates sharing the same argument, and consecutive versions of these two." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-25", "text": "Martins and Almeida (2014) for further details." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-26", "text": "The parser was built as an extension of a recent dependency parser, TurboParser (Martins et al., 2010 (Martins et al., , 2013 , with the goal of performing semantic parsing using any of the three formalisms considered in the shared task (DM, PAS, and PSD)." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-27", "text": "We have followed prior work in semantic role labeling (Toutanova et al., 2005; Johansson and Nugues, 2008; Das et al., 2012; Flanigan et al., 2014) , by adding constraints and modeling interactions among arguments within the same frame; however, we went beyond such sibling interactions to consider more complex grandparent and co-parent structures, effectively correlating different predicates." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-28", "text": "The overall set of parts used by our parser is illustrated in Figure 1 ; note that by using only a subset of the parts (predicate, arc, labeled arc, and sibling parts), the semantic parser decodes each predicate frame independently from other predicates; it is the co-parent and grandparent parts that have the effect of creating inter-dependence among predicates; we will analyze the effect of these dependencies in the experimental section ( \u00a73)." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-29", "text": "For each part in our model (shown in Figure 1 ), we computed binary features based on various combination of lexical forms, lemmas, POS tags and syntactic dependency relations of words related to the corresponding predicates and arguments." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-30", "text": "Most of these features were taken from TurboParser (Martins et al., 2013) , and others were inspired by the semantic parser of Johansson and Nugues (2008) ." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-31", "text": "To tackle all the parts, we formulate parsing as a global optimization problem and solve a relaxation through AD 3 (Martins et al., 2011), a fast dual decomposition algorithm in which several simple local subproblems are solved iteratively." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-32", "text": "Through a rich set of features, we arrive at top accuracies at parsing speeds around 1,000 tokens per second." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-33", "text": "See Martins and Almeida (2014) for details on the model, features and decoding process that were used." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-34", "text": "----------------------------------" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-35", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-36", "text": "All models were trained by running 10 epochs of max-loss MIRA with C = 0.01 (Crammer et al., 2006) ." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-37", "text": "The cost function takes into account mismatches between predicted and gold dependencies, with a cost c P on labeled arcs incorrectly predicted (false positives) and a cost c R on gold labeled arcs that were missed (false negatives)." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-38", "text": "These values were set through cross-validation in the dev set, yielding c P = 0.4 and c R = 0.6 in all runs, except for the English PSD dataset in the closed track, for which c P = 0.3 and c R = 0.7." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-39", "text": "As in the previous work, we speed up decoding by training a probabilistic unlabeled first-order pruner and discarding the arcs whose posterior probability is below 10 \u22124 ." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-40", "text": "This allows a significant reduction of the search space with a very small drop in recall." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-41", "text": "Table 1 shows our final results in the test set, for a model trained in the train and development partitions." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-42", "text": "Note that we do not report scores for complete predications, since we did not predict predicate sense." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-43", "text": "Our system achieved the best final score in 3 out of the 4 tracks for the English language, and for the in-domain closed track in the Czech language." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-44", "text": "For the remaining 3 tracks we scored relatively close to the best system (Peking), which consists of an ensemble of various methods." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-45", "text": "For all languages, the runtimes are in par with last year's submission (around 1,000 tokens per second)." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-46", "text": "As expected, the scores obtained for out-ofdomain data are significantly below those obtained with in-domain data." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-47", "text": "This degradation becomes particularly striking for Czech, with F 1 -scores dropping more than 15%." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-48", "text": "This suggests that domain adaptation (Blitzer et al., 2006; Daum\u00e9 III, 2007 ) is an interesting research avenue for future work." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-49", "text": "In ad- dition, as found last year for English, the gap between labeled and unlabeled scores is much higher in the PSD formalism (for English and Czech) then it is for the DM and PAS formalism (for English and Chinese)." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-50", "text": "Finally, to assess the importance of the second order features, Table 2 reports experiments in the devset that progressively add several groups of features." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-51", "text": "We can see that second order features provide valuable information that improves the final scores." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-52", "text": "In particular, the higher-order features are extremely useful for Chinese and Czech, where we can observe gains of 1.5-2.0% over a sibling model that factors over predicates." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-53", "text": "----------------------------------" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-54", "text": "**CONCLUSIONS**" }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-55", "text": "Our system, which is inspired by prior work in syntactic parsing, implements a linear model with second-order features, being able to model interactions between siblings, grandparents and co-parents." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-56", "text": "We have shown empirically that, for all the three languages, second-order features that correlate multiple predicates have a strong impact in the final scores." }, { "sent_id": "31b06dfc081149e1e436f0bb5e0904-C001-57", "text": "However, there is a large drop in accuracy when moving to out-of-domain data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "31b06dfc081149e1e436f0bb5e0904-C001-9" ] ], "cite_sentences": [ "31b06dfc081149e1e436f0bb5e0904-C001-9" ] }, "@MOT@": { "gold_contexts": [ [ "31b06dfc081149e1e436f0bb5e0904-C001-9" ] ], "cite_sentences": [ "31b06dfc081149e1e436f0bb5e0904-C001-9" ] }, "@USE@": { "gold_contexts": [ [ "31b06dfc081149e1e436f0bb5e0904-C001-26" ], [ "31b06dfc081149e1e436f0bb5e0904-C001-30" ] ], "cite_sentences": [ "31b06dfc081149e1e436f0bb5e0904-C001-26", "31b06dfc081149e1e436f0bb5e0904-C001-30" ] } } }, "ABC_fc4b56c865c8a9d0f6a7f5ae37ba96_46": { "x": [ { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-2", "text": "We build parallel FDA5 (ParFDA) Moses statistical machine translation (SMT) systems for all language pairs in the workshop on statistical machine translation (Bojar et al., 2015 ) (WMT15) translation task and obtain results close to the top with an average of 3.176 BLEU points difference using significantly less resources for building SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-3", "text": "ParFDA is a parallel implementation of feature decay algorithms (FDA) developed for fast deployment of accurate SMT systems (Bi\u00e7ici, 2013; Bi\u00e7ici et al., 2014; Bi\u00e7ici and Yuret, 2015) ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-4", "text": "ParFDA Moses SMT system we built is able to obtain the top TER performance in French to English translation." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-5", "text": "We make the data for building ParFDA Moses SMT systems for WMT15 available: https://github. com/bicici/ParFDAWMT15." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-6", "text": "Statistical machine translation performance is influenced by the data: if you already have the translations for the source being translated in your training set or even portions of it, then the translation task becomes easier." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-7", "text": "If some token does not appear in your language model (LM), then it becomes harder for the SMT engine to find its correct position in the translation." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-8", "text": "The importance of ParFDA increases with the proliferation of training material available for building SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-9", "text": "Table 1 presents the statistics of the available training and LM corpora for the constrained (C) systems in WMT15 (Bojar et al., 2015) as well as the statistics of the ParFDA selected training and LM data." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-10", "text": "ParFDA (Bi\u00e7ici, 2013; Bi\u00e7ici et al., 2014) runs separate FDA5 (Bi\u00e7ici and Yuret, 2015) models on randomized subsets of the training data and combines the selections afterwards." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-11", "text": "FDA5 is available at http://github.com/bicici/FDA." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-12", "text": "We run ParFDA SMT experiments using Moses (Koehn et al., 2007) in all language pairs in WMT15 (Bojar et al., 2015) and obtain SMT performance close to the top constrained Moses systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-13", "text": "ParFDA allows rapid prototyping of SMT systems for a given target domain or task." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-14", "text": "We use ParFDA for selecting parallel training data and LM data for building SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-15", "text": "We select the LM training data with ParFDA based on the following observation (Bi\u00e7ici, 2013):" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-16", "text": "No word not appearing in the training set can appear in the translation." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-17", "text": "Thus we are only interested in correctly ordering the words appearing in the training corpus and collecting the sentences that contain them for building the LM." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-18", "text": "At the same time, a compact and more relevant LM corpus is also useful for modeling longer range dependencies with higher order ngram models." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-19", "text": "We use 3-grams for selecting training data and 2-grams for LM corpus selection." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-20", "text": "We run ParFDA SMT experiments for all language pairs in both directions in the WMT15 translation task (Bojar et al., 2015) , which include English-Czech (en-cs), English-German (en-de), English-Finnish (en-fi), English-French (en-fr), and English-Russian (en-ru)." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-21", "text": "We truecase all of the corpora, set the maximum sentence length to 126, use 150-best lists during tuning, set the LM order to a value in [7, 10] for all language pairs, and train the LM using SRILM (Stolcke, 2002) with -unk option." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-22", "text": "For GIZA++ (Och and Ney, 2003) , max-fertility is set to 10, with the number of iterations set to 7,3,5,5,7 for IBM models 1,2,3,4, and the HMM model, and 70 word 74 classes are learned over 3 iterations with the mkcls tool during training." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-23", "text": "The development set contains up to 5000 sentences randomly sampled from previous years' development sets (2010-2014) and remaining come from the development set for WMT15." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-24", "text": "The statistics for the ParFDA selected training data and the available training data for the constrained translation task are given in Table 1 ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-25", "text": "For en and fr, we have access to the LDC Gigaword corpora (Parker et al., 2011; , from which we extract only the story type news." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-26", "text": "The size of the LM corpora includes both the LDC and the monolingual LM corpora provided by WMT15." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-27", "text": "Table 1 shows the significant size differences between the constrained dataset (C) and the ParFDA selected data and also present the source and target coverage (SCOV and TCOV) in terms of the 2-grams of the test set." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-28", "text": "The quality of the training corpus can be measured by TCOV, which is found to correlate well with the BLEU performance achievable (Bi\u00e7ici, 2011)." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-29", "text": "The space and time required for building the ParFDA Moses SMT systems are quantified in Table 2 where size is in MB and time in minutes." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-30", "text": "PT stands for the phrase table." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-31", "text": "We used Moses version 3.0, from www.statmt.org/moses." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-32", "text": "Building a ParFDA Moses SMT system can take about half a day." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-33", "text": "ParFDA Moses SMT results for each translation direction together with the LM order used and the top constrained submissions to WMT15 are given in Table 3 1 , where BLEUc is cased BLEU." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-34", "text": "ParFDA significantly reduces the time required for training, development, and deployment of an SMT system for a given translation task." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-35", "text": "The average difference to the top constrained submission in WMT15 is 3.176 BLEU points whereas the difference was 3.49 BLEU points in WMT14 (Bi\u00e7ici et al., 2014) ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-36", "text": "Performance improvement over last year's results is likely due to using higher order n-grams for data selection." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-37", "text": "ParFDA Moses SMT system is able to obtain the top TER performance in fr-en." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-38", "text": "1 We use the results from matrix.statmt.org." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-39", "text": "S \u2192 T" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-40", "text": "A LM selected for a given translation task allows us to train higher order language models, model longer range dependencies better, and achieve lower perplexity as shown in Table 4 ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-41", "text": "We compare the perplexity of the ParFDA selected LM with a LM trained on the ParFDA selected training data and a LM trained using all of the available training corpora." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-42", "text": "We build LM using SRILM with interpolated Kneser-Ney discounting (-kndiscount -interpolate)." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-43", "text": "We also use -unk option to build open-vocabulary LM." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-44", "text": "We are able to achieve significant reductions in the number of OOV tokens and the perplexity, reaching up to 78% reduction in the number of OOV tokens and up to 63% reduction in the perplexity." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-45", "text": "ParFDA can achieve larger reductions in perplexity than the 27% that can be achieved using a morphological analyzer and disambiguator for Turkish (Yuret and Bi\u00e7ici, 2009 ) and can decrease the OOV rate at a similar rate." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-46", "text": "Table 4 also presents the average log probability of tokens and the log probability of token . The increase in the ratio between them in the last column shows that OOV in ParFDA LM are not just less but also less likely at the same time." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-47", "text": "We use ParFDA for solving computational scalability problems caused by the abundance of training data for SMT models and LMs and still achieve SMT performance that is on par with the top performing SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-48", "text": "ParFDA raises the bar of expectations from SMT with highly accurate translations and lower the bar to entry for SMT into new domains and tasks by allowing fast deployment of SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-49", "text": "ParFDA enables a shift from general purpose SMT systems towards task adaptive SMT solutions." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-50", "text": "We make the data for building ParFDA Moses SMT systems for WMT15 available: https://github.com/ bicici/ParFDAWMT15." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-51", "text": "----------------------------------" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-52", "text": "**PARALLEL FDA5 (PARFDA)**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-53", "text": "Statistical machine translation performance is influenced by the data: if you already have the translations for the source being translated in your training set or even portions of it, then the translation task becomes easier." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-54", "text": "If some token does not appear in your language model (LM), then it becomes harder for the SMT engine to find its correct position in the translation." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-55", "text": "The importance of ParFDA increases with the proliferation of training material available for building SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-56", "text": "Table 1 presents the statistics of the available training and LM corpora for the constrained (C) systems in WMT15 (Bojar et al., 2015) as well as the statistics of the ParFDA selected training and LM data." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-57", "text": "ParFDA (Bi\u00e7ici, 2013; Bi\u00e7ici et al., 2014) runs separate FDA5 (Bi\u00e7ici and Yuret, 2015 ) models on randomized subsets of the training data and combines the selections afterwards." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-58", "text": "FDA5 is available at http://github.com/bicici/FDA." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-59", "text": "We run ParFDA SMT experiments using Moses (Koehn et al., 2007) in all language pairs in WMT15 (Bojar et al., 2015) and obtain SMT performance close to the top constrained Moses systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-60", "text": "ParFDA allows rapid prototyping of SMT systems for a given target domain or task." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-61", "text": "We use ParFDA for selecting parallel training data and LM data for building SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-62", "text": "We select the LM training data with ParFDA based on the following observation (Bi\u00e7ici, 2013) :" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-63", "text": "No word not appearing in the training set can appear in the translation." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-64", "text": "Thus we are only interested in correctly ordering the words appearing in the training corpus and collecting the sentences that contain them for building the LM." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-65", "text": "At the same time, a compact and more relevant LM corpus is also useful for modeling longer range dependencies with higher order ngram models." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-66", "text": "We use 3-grams for selecting training data and 2-grams for LM corpus selection." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-67", "text": "----------------------------------" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-68", "text": "**RESULTS**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-69", "text": "We run ParFDA SMT experiments for all language pairs in both directions in the WMT15 translation task (Bojar et al., 2015) , which include English-Czech (en-cs), English-German (en-de), English-Finnish (en-fi), English-French (en-fr), and English-Russian (en-ru)." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-70", "text": "We truecase all of the corpora, set the maximum sentence length to 126, use 150-best lists during tuning, set the LM order to a value in [7, 10] for all language pairs, and train the LM using SRILM (Stolcke, 2002) with -unk option." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-71", "text": "For GIZA++ (Och and Ney, 2003) , max-fertility is set to 10, with the number of iterations set to 7,3,5,5,7 for IBM models 1,2,3,4, and the HMM model, and 70 word Table 1 : Data statistics for the available training and LM corpora in the constrained (C) setting compared with the ParFDA selected training and LM data." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-72", "text": "#words is in millions (M) and #sents in thousands (K)." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-73", "text": "classes are learned over 3 iterations with the mkcls tool during training." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-74", "text": "The development set contains up to 5000 sentences randomly sampled from previous years' development sets (2010-2014) and remaining come from the development set for WMT15." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-75", "text": "----------------------------------" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-76", "text": "**STATISTICS**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-77", "text": "The statistics for the ParFDA selected training data and the available training data for the constrained translation task are given in Table 1 ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-78", "text": "For en and fr, we have access to the LDC Gigaword corpora (Parker et al., 2011; , from which we extract only the story type news." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-79", "text": "The size of the LM corpora includes both the LDC and the monolingual LM corpora provided by WMT15." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-80", "text": "Table 1 shows the significant size differences between the constrained dataset (C) and the ParFDA selected data and also present the source and target coverage (SCOV and TCOV) in terms of the 2-grams of the test set." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-81", "text": "The quality of the training corpus can be measured by TCOV, which is found to correlate well with the BLEU performance achievable (Bi\u00e7ici, 2011) ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-82", "text": "The space and time required for building the ParFDA Moses SMT systems are quantified in Table 2 where size is in MB and time in minutes." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-83", "text": "PT stands for the phrase table." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-84", "text": "We used Moses version 3.0, from www.statmt.org/moses." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-85", "text": "Building a ParFDA Moses SMT system can take about half a day." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-86", "text": "----------------------------------" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-87", "text": "**TRANSLATION RESULTS**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-88", "text": "ParFDA Moses SMT results for each translation direction together with the LM order used and the top constrained submissions to WMT15 are given in Table 3 1 , where BLEUc is cased BLEU." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-89", "text": "ParFDA significantly reduces the time required for training, development, and deployment of an SMT system for a given translation task." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-90", "text": "The average difference to the top constrained submission in WMT15 is 3.176 BLEU points whereas the difference was 3.49 BLEU points in WMT14 (Bi\u00e7ici et al., 2014) ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-91", "text": "Performance improvement over last year's results is likely due to using higher order n-grams for data selection." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-92", "text": "ParFDA Moses SMT system is able to obtain the top TER performance in fr-en." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-93", "text": "----------------------------------" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-94", "text": "**S \u2192 T**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-95", "text": "----------------------------------" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-96", "text": "**LM DATA QUALITY**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-97", "text": "A LM selected for a given translation task allows us to train higher order language models, model longer range dependencies better, and achieve lower perplexity as shown in Table 4 ." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-98", "text": "We compare the perplexity of the ParFDA selected LM with a LM trained on the ParFDA selected training data and a LM trained using all of the available training corpora." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-99", "text": "We build LM using SRILM with interpolated Kneser-Ney discounting (-kndiscount -interpolate)." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-100", "text": "We also use -unk option to build open-vocabulary LM." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-101", "text": "We are able to achieve significant reductions in the number of OOV tokens and the perplexity, reaching up to 78% reduction in the number of OOV tokens and up to 63% reduction in the perplexity." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-102", "text": "ParFDA can achieve larger reductions in perplexity than the 27% that can be achieved using a morphological analyzer and disambiguator for Turkish (Yuret and Bi\u00e7ici, 2009 ) and can decrease the OOV rate at a similar rate." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-103", "text": "Table 4 also presents the average log probability of tokens and the log probability of token . The increase in the ratio between them in the last column shows that OOV in ParFDA LM are not just less but also less likely at the same time." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-104", "text": "----------------------------------" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-105", "text": "**CONCLUSION**" }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-106", "text": "We use ParFDA for solving computational scalability problems caused by the abundance of training data for SMT models and LMs and still achieve SMT performance that is on par with the top performing SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-107", "text": "ParFDA raises the bar of expectations from SMT with highly accurate translations and lower the bar to entry for SMT into new domains and tasks by allowing fast deployment of SMT systems." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-108", "text": "ParFDA enables a shift from general purpose SMT systems towards task adaptive SMT solutions." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-109", "text": "We make the data for building ParFDA Moses SMT systems for WMT15 available: https://github.com/ bicici/ParFDAWMT15." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-110", "text": "OOV Rate perplexity avg log probability log probability avg Table 4 : Perplexity comparison of the LM built from the training corpus (train), ParFDA selected training data (FDA5 train), and the ParFDA selected LM data (FDA5 LM)." }, { "sent_id": "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-111", "text": "%red is proportion of reduction." } ], "y": { "@USE@": { "gold_contexts": [ [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-9" ], [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-12" ], [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-20" ], [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-56" ], [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-59" ], [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-69" ] ], "cite_sentences": [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-9", "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-12", "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-20", "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-56", "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-59", "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-69" ] }, "@SIM@": { "gold_contexts": [ [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-12" ] ], "cite_sentences": [ "fc4b56c865c8a9d0f6a7f5ae37ba96-C001-12" ] } } }, "ABC_ece5f95d3c616ceeb0b3061e606b41_46": { "x": [ { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-2", "text": "Distributed vector representations of words are useful in various NLP tasks." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-3", "text": "We briefly review the CBOW approach and propose a bilingual application of this architecture with the aim to improve consistency and coherence of Machine Translation." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-4", "text": "The primary goal of the bilingual extension is to handle ambiguous words for which the different senses are conflated in the monolingual setup." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-5", "text": "----------------------------------" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-7", "text": "Machine Translation (MT) systems are nowadays achieving a high-quality performance." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-8", "text": "However, they are typically developed at sentence level using only local information and ignoring the document-level one." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-9", "text": "Recent work claims that discourse-wide context can help to translate individual words in a way that leads to more coherent translations (Hardmeier et al., 2013; Hardmeier et al., 2012; Gong et al., 2011; Xiao et al., 2011) ." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-33", "text": "The semantic models are built using a combination of freely available corpora for English and Spanish (EuropalV7, United Nations and Multilingual United Nations, and Subtitles2012)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-34", "text": "They can be found in the Opus site (Tiedemann, 2012) .We trained vectors to represent word pairs forms using this corpora with the word2vec CBOW implementation." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-35", "text": "We built a training set of almost 600 million words and used 600-dimension vectors in the training." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-36", "text": "Regarding to the alignments, we only used word-to-word ones to avoid noise." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-37", "text": "----------------------------------" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-38", "text": "**ACCURACY OF THE SEMANTIC MODEL**" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-39", "text": "We first evaluate the quality of the models based on the task of predicting semantically related words." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-40", "text": "A Spanish native speaker built the bilingual test set similarly to the process done to the training data from a list of 19, 544 questions introduced by (Mikolov et al., 2013c) ." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-41", "text": "In our bilingual scenario, the task is to predict a pair of words given two pairs of related words." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-42", "text": "For instance, given the pair Athens|Atenas Greece|Grecia and the question London|Londres, the task is to predict England|Inglaterra." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-43", "text": "Table 1 shows the results, both overall accuracy and accuracy over the known words for the models." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-44", "text": "Using the first 30, 000 entries of the model (the most frequent ones), we obtain 32% of accuracy for English (mono en) and 10% for Spanish (mono es)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-45", "text": "We chose these parameters for our system to obtain comparable results to the ones in (Mikolov et al., 2013a ) for a CBOW architecture but trained with 783 million words (50.4%)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-46", "text": "Decay for the model in Spanish can be due to the fact that it was built from automatic translations." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-47", "text": "In the bilingual case (bi en-es), the accuracy is lower than for English probably due to the noise in translations and word alignment." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-48", "text": "----------------------------------" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-49", "text": "**CROSS-LINGUAL LEXICAL SUBSTITUTION**" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-50", "text": "Another way to evaluate the semantic models is through the effect they have in translation." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-51", "text": "We implemented the Cross-Lingual Lexical Substitution task carried out in SemEval-2010 (Task2, 2010 and applied it to a test set of news data from the News Commentary corpus of 2011." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-52", "text": "We identify those content words which are translated in more than one way by a baseline translation system (Moses trained with Europarl v7)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-53", "text": "Given one of these content words, we take the two previous and two following words and look for their vector representations using our bilingual models." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-54", "text": "We compute a linear combination of these vectors to obtain a context vector." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-55", "text": "Then, to chose the best translation option, we calculate a score based on the similarity among the vector of every possible translation option seen in the document and the context vector." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-56", "text": "In average there are 615 words per document within the test set and 7% are translated in more than one way by the baseline system." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-57", "text": "Our bilingual models know in average 87.5% of the words and 83.9% of the ambiguous ones, so although there is a good coverage for this test set, still, some of the candidates cannot be retranslated or some of the options cannot be used because they are missing in the models." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-58", "text": "The accuracy obtained after retranslation of the known ambiguous words is 62.4% and this score is slightly better than the result obtained by using the most frequent translation for ambiguous words (59.8%)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-59", "text": "Even though this improvement is rather modest, it shows potential benefits of our model in MT." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-60", "text": "----------------------------------" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-61", "text": "**CONCLUSIONS**" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-62", "text": "We implemented a new application of word vector representations for MT." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-63", "text": "The system uses word alignments to build bilingual models with the final aim to improve the lexical selection for words that can be translated in more than one sense." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-64", "text": "The models have been evaluated regarding their accuracy when trying to predict related words (Section 3.1) and also regarding its possible effect within a translation system (Section 3.2)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-65", "text": "In both cases one observes that the quality of the translation and alignments previous to building the semantic models are bottlenecks for the final performance: part of the vocabulary, and therefore translation pairs, are lost in the training process." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-66", "text": "Future work includes studying different kinds of alignment heuristics." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-67", "text": "We plan to develop new features based on the semantic models to use them inside state-of-the-art SMT systems like Moses (Koehn et al., 2007) or discourse-oriented decoders like Docent (Hardmeier et al., 2013) ." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-10", "text": "Standard SMT systems use n-gram models to represent words in the target language." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-11", "text": "However, there are other word representation techniques that use vectors of contextual information." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-12", "text": "Recently, several distributed word representation models have been introduced that have interesting properties regarding to the semantic information that they capture." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-13", "text": "In particular, we are interested in the word2vec package available in (Mikolov et al., 2013a) ." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-14", "text": "These models proved to be robust and powerful for predicting semantic relations between words and even across languages." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-15", "text": "However, they are not able to handle lexical ambiguity as they conflate word senses of polysemous words into one common representation." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-16", "text": "This limitation is already discussed in (Mikolov et al., 2013b) and in (Wolf et al., 2014) , in which bilingual extensions of the word2vec architecture are proposed." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-17", "text": "In contrast to their approach, we are not interested in monolingual applications but instead like to concentrate directly on the bilingual case in connection with MT." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-18", "text": "We built bilingual word representation models based on word-aligned parallel corpora by an application of the Continuous Bag-of-Words (CBOW) algorithm to the bilingual case (Section 2)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-19", "text": "We made a twofold preliminary evaluation of the acquired word-pair representations on two different tasks (Section 3): predicting semantically related words (3.1) and cross-lingual lexical substitution (3.2)." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-20", "text": "Section 4 draws the conclusions and sets the future work in a direct application of these models to MT." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-21", "text": "----------------------------------" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-22", "text": "**SEMANTIC MODELS USING CBOW**" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-23", "text": "The basic architecture that we use to build our models is CBOW (Mikolov et al., 2013a) ." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-24", "text": "The algorithm uses a neural network (NN) to predict a word taking into account its context, but without considering word order." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-25", "text": "Despite its drawbacks, we chose to use it since we presume that the translation task applies the same strategy as the CBOW architecture, i.e., from a set of context words try to predict a translation of a specific given word." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-26", "text": "In the monolingual case, the NN is trained using a monolingual corpus to obtain the corresponding projection matrix that encloses the vector representations of the words." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-27", "text": "In order to introduce the semantic information in a bilingual scenario, we use a parallel corpus and automatic word alignment to extract a training corpus of word pairs: (w i,S |w i,T )." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-28", "text": "This approach is different from (Wolf et al., 2014) who build an independent model for each language." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-29", "text": "With our method, we try to capture simultaneously the semantic information associated to the source word and the information in the target side of the translation." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-30", "text": "In this way, we hope to better capture the semantic information that is implicitly given by translating a text." }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-31", "text": "----------------------------------" }, { "sent_id": "ece5f95d3c616ceeb0b3061e606b41-C001-32", "text": "**EXPERIMENTS**" } ], "y": { "@MOT@": { "gold_contexts": [ [ "ece5f95d3c616ceeb0b3061e606b41-C001-13" ], [ "ece5f95d3c616ceeb0b3061e606b41-C001-45" ] ], "cite_sentences": [ "ece5f95d3c616ceeb0b3061e606b41-C001-13", "ece5f95d3c616ceeb0b3061e606b41-C001-45" ] }, "@USE@": { "gold_contexts": [ [ "ece5f95d3c616ceeb0b3061e606b41-C001-23" ] ], "cite_sentences": [ "ece5f95d3c616ceeb0b3061e606b41-C001-23" ] }, "@SIM@": { "gold_contexts": [ [ "ece5f95d3c616ceeb0b3061e606b41-C001-45" ] ], "cite_sentences": [ "ece5f95d3c616ceeb0b3061e606b41-C001-45" ] } } }, "ABC_511c17a6cb6bd74e0216c3d50eb9c0_46": { "x": [ { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-2", "text": "We present gdbank, a small handbuilt corpus of 32 sentences with dependency structures and categorial grammar type assignments." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-3", "text": "The sentences have been chosen to illustrate as broad a range of the unusual features of Scottish Gaelic as possible, particularly nouns being used to represent psychological states where more thoroughly-studied languages such as English and French would prefer a verb, and prepositions marking aspect, as is also seen in Welsh and, for example, Irish Gaelic." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-4", "text": "We provide hand-built dependency trees, building on previous work on Irish Gaelic and using the Universal Dependency Scheme." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-5", "text": "We also provide a tentative categorial grammar account of the words in the sentences, based largely on previous work on English." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-6", "text": "----------------------------------" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-8", "text": "Scottish Gaelic (usually hereafter Gaelic) is a Celtic language, rather closely related to Irish, with around 59,000 speakers as of the last UK census in 2011." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-9", "text": "As opposed to the situation for Irish Gaelic (Lynn et al., 2012a; Lynn et al., 2012b; Lynn et al., 2013; Lynn et al., 2014) there are no treebanks or tagging schemes for Scottish Gaelic, although there are machine-readable dictionaries and databases available from Sabhal M\u00f2r Ostaig." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-10", "text": "A single paper in the ACL Anthology (Kessler, 1995) mentions Scottish Gaelic in the context of computational dialectology of Irish." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-11", "text": "There is also an LREC workshop paper (Scannell, 2006 ) on machine translation between Irish and Scottish Gaelic." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-12", "text": "Elsewhere in the Celtic languages, Welsh has an LFG grammar (Mittendorf and Sadler, 2005) but no treebanks." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-13", "text": "For Breton there is a small amount of work on morphological analysis and Constraint-Grammar-based machine translation (Tyers, 2010) ." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-14", "text": "Recent work on the grammar of Scottish Gaelic (for example (Adger and Ramchand, 2003; Adger and Ramchand, 2005) , but there are many more examples) has largely focussed on theoretical syntactic issues somewhat distant from the more surfacy approaches popular in the field of natural language processing." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-15", "text": "This paper explores grammatical issues in Scottish Gaelic by means of dependency tagging and combinatory categorial grammar (CCG), which we see as complementary approaches." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-16", "text": "As such it is explicitly inspired by CCGbank (Hockenmaier and Steedman, 2007) , which consists of dependency structures and CCG derivations for over 99% of the Penn Treebank." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-17", "text": "It is hoped that this corpus will be a useful adjunct to currently on-going work in developing a part-of-speech tagset and tagger for Scottish Gaelic." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-18", "text": "Section 2 describes how the corpus was prepared, sections 3 and 4 give some context for the dependency scheme and categorial grammar annotations respectively, and the main part of the paper is section 5, which deals with language-specific features of the corpus." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-19", "text": "----------------------------------" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-20", "text": "**PREPARING THE CORPUS**" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-21", "text": "The corpus consists of a small handbuilt selection of sentences from the transcripts of An Litir Bheag, which is a weekly podcast from the BBC written by a native speaker and aimed at Gaelic learners, example sentences from (Lamb, 2003) , the BBC's online news in Gaelic and the Gaelic column in the Scotsman newspaper." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-22", "text": "In order to illustrate as much of the interesting points of Scottish Gaelic as possible, we looked in particular for sentences describing psychological states and made sure that a reasonable number of the sentences used each verb for \"to be\", which we will illustrate in section 5." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-23", "text": "The sentences are tokenized by hand using the following rules: (1) Punctuation which never forms part of a lexical item such as the comma, the full stop, the colon and the semicolon is always separated out from the previous word." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-24", "text": "(2) Strings connected by a hyphen, for example h-Alba in Banca na h-Alba (Bank of Scotland) or t-\u00d2ban as in an t-\u00d2ban (the town of Oban) are always kept together." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-25", "text": "(3) The apostrophe is kept together with the copula where it proceeds it, for example in 'S fhearr leam (I like)." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-26", "text": "(4) Because the past tense particle do is reduced to dh' before a vowel and before f, and this is always typographically closed up, we separate out past-tense dh' as its own token." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-27", "text": "These rules work for the small dataset described here but would clearly need to be expanded for work in the wild." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-28", "text": "In this preliminary work the dependencies and types have been determined by a single, non-native speaker, annotator, according to a set of guidelines which were built up during the annotation process." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-29", "text": "This is clearly less than ideal, however, the guidelines are available along with the corpus and we hope to be able to get the input of a native speaker, not least for interannotator studies." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-30", "text": "We use the CoNLL-X format (Buchholz and Marsi, 2006) , leaving the POS and projective dependency fields empty and store the categorial grammar type under field 6, FEATS." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-31", "text": "----------------------------------" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-32", "text": "**DEPENDENCY SCHEME**" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-33", "text": "There are four dependency schemes that we consulted while preparing the corpus." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-34", "text": "The initial inspiration was provided by the C&C parser (Curran et al., 2007) , which in addition to providing categorial grammar derivations for sentences provides a dependency structure in the GR (Grammatical Representation) scheme due to (Briscoe and Carroll, 2000; Briscoe and Carroll, 2002) ." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-35", "text": "This contains 23 types and was developed originally for parser evaluation." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-36", "text": "Another popular scheme is the Stanford Dependency scheme (de Marneffe and Manning, 2008; de Marneffe and Manning, 2013) , which is more finely-grained with over twice the number of dependency types to deal specifically with noisy data and to make it more accessible to non-linguists building information extraction applications." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-37", "text": "A very important scheme is the Dublin scheme for Irish (Lynn et al., 2012a; Lynn et al., 2012b; Lynn et al., 2013) , which is of a similar size to the Stanford scheme, but the reason for its size relative to GR is that it includes a large number of dependencies intended to handle grammatical features found in Irish but not in English." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-38", "text": "Lastly we mention the Universal Dependency Scheme developed in (McDonald et al., 2013) , which we have adopted, despite its being coarser-grained than the Dublin scheme, on account of its simplicity and utility for cross-lingual comparisons and cross-training (Lynn et al., 2014) ." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-39", "text": "Table 1 gives examples of the dependency relations used along with their mapping to the GR scheme." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-40", "text": "----------------------------------" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-41", "text": "**CATEGORIAL GRAMMAR**" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-42", "text": "Combinatory categorial grammar (CCG) is a type-logical system which was developed to represent natural languages such as English but has subsequently been extended to other systems such as chord se-quences in jazz (Granroth-Wilding and Steedman, 2012) ." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-43", "text": "For a full description the reader is referred to (Steedman and Baldridge, 2003) , but in order to follow the rest of this paper you merely need to know that the type N/N is a function which takes an argument of N to its right, returning N, and that the type N\\N is a function expecting an argument of N to its left and that these are combined by application, composition, where A/B combines with B/C to yield A/C, and type-raising where N is converted to T/(N\\T)." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-44", "text": "Attractive features of CCG for modelling a less-well-studied language include that it is a lexical theory in which it is the lexicon contains the rules for how words are combined to make sense rather than an external grammar, that it allows all manner of unconventional constituents, which is particularly powerful for parsing coordinated structures in English, that it is equivalent to a weakly context-sensitive grammar and hence has the power of a real natural language." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-45", "text": "In Steedman and Baldridge (2003) there are examples of the application of multimodal CCG to Irish Gaelic." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-46", "text": "However, to the best of our knowledge this paper is the first application of CCG to Scottish Gaelic." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-47", "text": "In gdbank, there is a single hand-built CCG derivation for every sentence." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-48", "text": "The notation is based on that in CCGbank with a small number of adaptations for Gaelic (see next section)." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-49", "text": "The basic units that can be assembled into types are S (clauses), N (nouns), conj (conjugations), and PP (prepositional phrases)." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-50", "text": "For subcategorization purposes and to help keep things clear for the annotator and the reader we mark prepositional phrases with the dictionary form of the preposition." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-51", "text": "We have not yet investigated overgeneration and ungrammatical sentences, hence there is only one kind of modality in gdbank; however restricting the way words can combine to the way in which they actually do combine in Gaelic is an obvious and essential next step." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-52", "text": "----------------------------------" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-53", "text": "**LANGUAGE-SPECIFIC FEATURES**" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-54", "text": "Prepositional phrases in Gaelic are often single-word, fused preposition-pronouns, a part-of-speech found across the Celtic languages." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-55", "text": "An ambiguous case of this is the token ris, which can be either ri with the pronoun e, hence taking the CCG type PP [ri] , or the pre-determiner form of ri," }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-56", "text": "The other class of fused preposition-pronoun we need to consider is that in sentences like Tha mi gad chluinntinn, \"I can hear you\", where gad is ag fused with do \"your\"." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-57", "text": "In this case it has type PP[ag]/S[n]." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-58", "text": "Adjectives as in CCGbank are treated as clauses, S [adj] ." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-59", "text": "The verbal noun is labelled S[n] by analogy with Hockenmaier and Steedman (2007) ." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-60", "text": "In addition to declarative and interrogative clauses, S[dcl] and S[q], we take our lead from the fourfold division of preverbal particles and add negative clauses S[neg], usually introduced by cha or chan, and negative interrogative clauses, S[negq], introduced by nach." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-61", "text": "There are two verbs for \"to be\" in Scottish Gaelic, bi and is." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-62", "text": "Bi is used for predicate statements about nouns, to forming the present tense and to describe some psychological states." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-63", "text": "It does not usually equate two NPs, with an exception we will come to." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-64", "text": "In the Dublin scheme the prepositional phrase headed by ag in T\u00e1 s\u00e9 ag iascaireacht (\"He is fishing.\") is treated as being an externally-controlled complement of T\u00e1 (Gaelic tha) and we carry this analysis over into Scottish Gaelic where this is the most common way of expressing the present tense." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-65", "text": "Figure 1 demonstrates this, where dhachaigh is a non-clausal modifier of dol, the verbal noun for \"to go\"." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-66", "text": "Is can be used as the copula between two NPs, and to express psychological states such as liking and preference." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-67", "text": "To say \"I am a teacher\", the Gaelic is 'S e tidsear a th' annam." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-68", "text": "This, at least on the surface, equates pronoun e, with a noun described by a relative clause including the verb bi." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-69", "text": "Fig. 1 shows our dependency tree for this." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-70", "text": "Note that this is different from the scheme in Lynn et al. (2012b) because of a difference between the two languages." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-71", "text": "They treat the analogous sentence Is tusa an m\u00fainteoir \"You are the teacher\" as having a subject, \"the teacher\", and a clausal predicate, tusa, \"you indeed\"." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-72", "text": "The most straightforward way of expressing a preference is the assertive is followed by an adjective or noun, a PP marking the preferrer, and then the object." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-73", "text": "If you dislike music, you might say Is beag orm ce\u00f2l." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-74", "text": "There are exactly analogous constructions in Irish with is + adjective + PP[le] + object, for example Is maith liom... \"I like...\", which in (U\u00ed Dhonnchadha, 2009 ) is treated as having the prepositional phrase as the subject and the adjective as predicate." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-75", "text": "We modify this to use adpmod as in the Universal Dependency Scheme as shown in Fig. 1 ." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-76", "text": "----------------------------------" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-77", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-78", "text": "In this paper we have presented a small handbuilt corpus of Scottish Gaelic sentences, their dependency structures and their CCG derivations." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-79", "text": "To the best of our knowledge this represents the first attempt to handle a range of real-life Scottish Gaelic sentences in such a way." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-80", "text": "gdbank itself and the guidelines used to build it are available from https://code.google.com/p/gdbank/ and we welcome feedback." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-81", "text": "We have of course only been able to illustrate a small number of constructions." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-82", "text": "Tables 2 and 3 list counts for the categorial types and dependency relations used." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-83", "text": "In 32 sentences there are a total of 406 tokens." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-84", "text": "We have not yet on the other hand attempted to deal with the morphology of Scottish Gaelic, for example lenition and slenderization, beyond drawing the attention of the human annotator to these phenomena when they may affect the correct parsing of a sentence." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-85", "text": "Clearly for automated natural-language processing of Gaelic these will need to be treated programmatically." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-86", "text": "We also disregard case and gender, although we expect that these will be dealt with as part of a rather more ambitious project, that of the Lamb group at the University of Edinburgh to build a part-of-speech tagset and tagged corpus which we look forward to seeing." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-87", "text": "mark 23 amod 11 nsubj 47 nmod 18 advmod 9 adpobj 38 ccomp 17 acomp 7 det 34 prt 14 cc 6 p 33 dobj 13 rcmod 4 ROOT 32 xcomp 13 appos 2 Table 3 : Counts for dependency relations in gdbank." }, { "sent_id": "511c17a6cb6bd74e0216c3d50eb9c0-C001-88", "text": "Note the high number of adpmod relations which is significantly larger than adpobj because of fused preposition-pronouns in Gaelic." } ], "y": { "@BACK@": { "gold_contexts": [ [ "511c17a6cb6bd74e0216c3d50eb9c0-C001-9" ], [ "511c17a6cb6bd74e0216c3d50eb9c0-C001-37" ] ], "cite_sentences": [ "511c17a6cb6bd74e0216c3d50eb9c0-C001-9", "511c17a6cb6bd74e0216c3d50eb9c0-C001-37" ] }, "@DIF@": { "gold_contexts": [ [ "511c17a6cb6bd74e0216c3d50eb9c0-C001-69", "511c17a6cb6bd74e0216c3d50eb9c0-C001-70" ] ], "cite_sentences": [ "511c17a6cb6bd74e0216c3d50eb9c0-C001-70" ] } } }, "ABC_ec3702a6b30057fcae65ca297656d2_46": { "x": [ { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-2", "text": "Most work on extracting parallel text from comparable corpora depends on linguistic resources such as seed parallel documents or translation dictionaries." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-3", "text": "This paper presents a simple baseline approach for bootstrapping a parallel collection." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-4", "text": "It starts by observing documents published on similar dates and the cooccurrence of a small number of identical tokens across languages." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-5", "text": "It then uses fast, online inference for a latent variable model to represent multilingual documents in a shared topic space where it can do efficient nearestneighbor search." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-6", "text": "Starting from the Gigaword collections in English and Spanish, we train a translation system that outperforms one trained on the WMT'11 parallel training set." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-7", "text": "----------------------------------" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-110", "text": "----------------------------------" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-111", "text": "**CONCLUSION**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-9", "text": "In statistical machine translation (SMT), the quality of the translation model is highly dependent on the amount of parallel data used to build it." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-10", "text": "Parallel data has usually been generated through the process of human translation, which imposes significant costs when building systems for new languages and domains." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-11", "text": "To alleviate this problem, researchers have considered comparable corpora-a collection of multilingual documents that are only topically aligned but not necessary translations of each other (Fung and Cheung, 2004) ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-12", "text": "While most previous approaches for mining comparable corpora heavily depend on initializing the learning process with some translation dictionaries or parallel text, we use multilingual topic models to detect document translation pairs and extract parallel sentences with only minimum cross-language prior knowledge: the publication dates of articles and the tendency of some vocabulary to overlap across languages." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-13", "text": "Processing only four years of Gigaword news stories in English and Spanish, we are able to outperform the WMT'11 baseline system trained on parallel News Commentary corpus (Table 1) ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-14", "text": "----------------------------------" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-15", "text": "**PRIOR WORK ON COMPARABLE CORPORA**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-16", "text": "Most previous, if not all, approaches for mining comparable corpora heavily depend on bilingual resources, such as translation lexica, bitext, and/or a pretrained baseline MT system." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-17", "text": "This paper, in contrast, investigates building MT systems from comparable corpora without such resources." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-18", "text": "In a widely cited early paper, Munteanu and Marcu (2005) use a bilingual dictionary and a collection of parallel sentences to train IBM Model 1 and a maximum entropy classifier to determine whether two sentences are translations of each other." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-19", "text": "Tillmann and Xu (2009) and Smith et al. (2010) detect parallel sentences by training IBM Model 1 and maximum entropy classifiers, respectively." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-20", "text": "In later work on detecting sentence and phrase translation pairs, Cettolo et al. (2010) and Hoang et al. (2014) use SMT systems to translate candidate documents; Quirk et al. (2007) use parallel data to train a translation equivalence model; and Ture and Lin (2012) use a translation lexicon to build a scoring function for parallel documents." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-21", "text": "More recently, Ling et al. (2013) trained IBM Model 1 on bitext to detect translationally equivalent phrase pairs within single microblog posts." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-22", "text": "Abdul-Rauf and Schwenk (2009), Uszkoreit et al. (2010) , and Gahbiche-Braham et al. (2011) , rather than trying to detect translated sentence pairs directly, translate the entire source language side of a comparable corpus into the target language with a baseline SMT system and then search for corresponding documents." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-23", "text": "On the other hand, there exist approaches that mine comparable corpora without any prior translation information or parallel data." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-24", "text": "Examples of this approach are rarer, and we briefly mention two: Enright and Kondrak (2007) use singleton words (hapax legomena) to represent documents in a bilingual collection for the task of detecting document translation pairs, and Krstovski and Smith (2011) construct a vocabulary of overlapping words to represent documents in multilingual collections." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-25", "text": "The latter approach demonstrates high precision vs. recall values on various language pairs from different languages and writing systems when detecting translation pairs on a document level such as Europarl sessions." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-26", "text": "Recently proposed approaches, such as (Klementiev et al., 2012) use monolingual corpora to estimate phrase-based SMT parameters." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-27", "text": "Unlike our paper, however, they do not demonstrate an end-toend SMT system trained without any parallel data." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-28", "text": "Our approach differs from these and other previous approaches by not relying on any initial translation dictionary or any bitext to train a seed SMT system." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-29", "text": "Therefore, the primary experimental comparison that we perform is between no bitext at all and a system trained with some bitext." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-30", "text": "----------------------------------" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-31", "text": "**BOOTSTRAPPING APPROACH**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-32", "text": "Our bootstrapping approach (Figure 1 ) is a twostage system that used the Overlapping Cosine Distance (OCD) approach of Krstovski and Smith (2011) as its first step." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-33", "text": "OCD outputs a ranked list of candidate document pairs, which are then fed through a sentence-alignment system (Moore, 2002) ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-34", "text": "A polylingual topic model (PLTM) (Mimno et al., 2009 ) is then trained on the aligned portions of these documents." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-35", "text": "Using the trained model, we infer topics on the whole comparable training set." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-36", "text": "Once represented as points in the topic space, documents are then compared for similarity using divergence based metrics such as Hellinger (He) distance." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-37", "text": "Results from these comparisons create a single ranked list of text translation pairs, which are on a sub docu- ment length level." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-38", "text": "From this single ranked list, using thresholding, we again extract the top n candidate translation pairs that are then fed to an aligner for further refinement." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-39", "text": "----------------------------------" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-40", "text": "**DISCOVERING DOCUMENT TRANSLATION PAIRS**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-41", "text": "For a given comparable corpus, OCD assumes that there is a set of words that exist in both languages that could be used as features in order to discriminate between documents that are translations of each other, documents that carry similar content, and documents that are not related." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-42", "text": "Firstly, for each language in the collection a vocabulary is created which consists of all word types seen in the corpora of that language." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-43", "text": "Words found in both source (s) and target (t) languages are extracted and the overlapping list of words are then used as dimensions for constructing a feature vector template." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-44", "text": "Documents in both languages are then represented using the template vector whose dimensions are the tf\u00b7idf values computed on the overlapping words which we now consider as features." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-45", "text": "While the number of overlapping words is dependent on the families of the source and target languages and their orthography, Krstovski and Smith (2011) showed that this approach yields good results across language pairs from different families and writing systems such as English-Greek, English-Bulgarian and EnglishArabic where, as one would expect, most shared words are numbers and named entities." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-46", "text": "We compare these vector representations efficiently using Cosine (Cos) distance and locality sensitive hashing (Charikar, 2002) ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-47", "text": "This results in a single ranked list of all document pairs." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-48", "text": "Compared to the traditional cross-language information retrieval (CLIR) task where a set of document queries is known in advance, there is no prior information on the documents in the source language that may or may not have translation documents in the target language of the collection." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-49", "text": "Due to the length in-variance of Cos distance, the single ranked list may contain document pairs with high similarity value across all documents in the target language." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-50", "text": "This issue in OCD is resolved by applying length and diversity filtering." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-51", "text": "Length filtering removes translation pairs where the length of the target document t is not within \u00b120% of the source document s length, lf : 0.8 \u2264 |s| / |t| \u2264 1.2 ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-52", "text": "For a given source document, diversity filtering is done by allowing only the top five ranked target document pairs to be considered in the single ranked list." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-53", "text": "Limiting the number of target documents for a given source document may discard actual document translation pairs such as in a comparable corpus of news stories where documents in the target language originate from large number of news source." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-54", "text": "While it may restrict more document translation pairs to be discovered, the diversity filtering, on the other hand prevents from limiting the number of discovered similar and translation documents to be from the same topic and domain and thus introduces diversity on another, domain or topic based, level." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-55", "text": "----------------------------------" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-56", "text": "**REPRESENTING MULTILINGUAL COLLECTIONS WITH TOPICS**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-57", "text": "Latent topic models are statistical models of text that discover underlying hidden topics in a text collection." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-58", "text": "We use PLTM (Mimno et al., 2009 ), a multilingual variant of LDA, which assumes that document tuples in multilingual parallel and comparable corpora are drawn from the same tuple-specific multinomial distribution over topics \u03b8." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-59", "text": "For each document in the tuple, PLTM assumes that words are generated from a language L specific topic distribution over words \u03b2 L ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-60", "text": "Using this generative model we represent documents in multiple languages in a common topic space which allows us to perform similarity comparisons across documents in different languages." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-61", "text": "The original PLTM posterior inference is approximated using collapsed Gibbs sampling (Mimno et al., 2009) ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-62", "text": "While more straightforward to implement, this inference approach requires iterating over the multilingual collection multiple times to achieve convergence." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-63", "text": "This incurs a computational cost that could be significant for large collections such as Gigaword." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-64", "text": "Moreover, detecting and retrieving document translation pairs requires all-pairs comparison across documents in both languages with a worst case time complexity of O(N 2 ) which is impractical for large comparable corpora." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-65", "text": "One solution to this problem is to parallelize the brute-force approach through the MapReduce framework (Ture et al., 2011; Ture and Lin, 2012 ) but this approach requires special programming methods." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-66", "text": "In order to use the PLTM on large collections and avoid the bottleneck introduced by Gibbs sampling, we use the online variational Bayes (VB) approach originally developed by (Hoffman et al., 2010) for LDA model to develop a fast, online PLTM model." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-67", "text": "As in the regular VB approach, online VB approximates the hidden parameters \u03b8, z and \u03b2 using the free variational parameters: \u03b3, \u03c6 and \u03bb." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-68", "text": "Rather than going over the whole collection of documents to bring the variational parameters to a convergence point, Krstovski and Smith (2013) perform updates of the variational parameters \u03b3 and \u03c6 L on document batches and update the \u03bb L variational parameter as a weighted average of its stochastic gradient based approximation and its value on the previous batch." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-69", "text": "The approximation is done through Expectation-Maximization (EM)." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-70", "text": "Unlike the usual metric spaces where two vectors are compared using distance metrics such as Euclidean (Eu) or Cos distance, in the probability simplex similarity is computed using informationtheoretic measurements such as Kullback-Leibler, Jensen-Shannon divergence and He distance." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-71", "text": "We alleviate the O(N 2 ) worst case time-complexity in the probability simplex by utilizing approximate nearest-neighbor (NN) search techniques proven in the metric space." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-72", "text": "More specifically, we use the formulaic similarity between He and Eu: He(p, q) \u2261 Eu(x, y), when \u2200i : i = 1, n of x i and y i , x i = \u221a p i and y i = \u221a q i , and compute He distance using Eu based, approximate NN computation approaches such as k-d trees 1 (Bentley, 1975) ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-73", "text": "----------------------------------" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-74", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-75", "text": "We demonstrate the performance of the bootstrapping approach on the task of extracting parallel sentences to train a translation system." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-76", "text": "We evaluate MT systems trained on extracted parallel sentences and compare their performance against MT systems created using clean parallel collections." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-77", "text": "MT systems were evaluated with the standard BLEU metric (Papineni et al., 2002) on two official WMT test sets that cover different domains: News (WMT'11) and Europarl (WMT'08)." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-78", "text": "We trained the Moses SMT system (Koehn et al., 2007) We perform the following processing in each step of the pipeline." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-79", "text": "We run OCD on days of news originating from multiple news agencies or more specifically on news stories originating from the same day which we consider as the \"minimal supervision\" in initiating the bootstrapping process." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-80", "text": "Since the OCD approach generates a single list of ranked document translation pairs, for the second stage of our pipeline we consider the top n document translation pairs." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-81", "text": "We define n to be all document translation pairs whose Cos similarity is between the range of the max (i.e. the top 1 scored document translation pair in the single ranked list) and max 2 ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-82", "text": "Unlike previous thresholding based on absolute values (Ture et al., 2011) , this approach allows us to utilize threshold values that are automatically adjusted to the dynamic range of the Cos distance of a particular corpus." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-83", "text": "Sentences from the top n news stories are extracted and are further aligned." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-84", "text": "The output of the aligner is then used as a training set for the PLTM model." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-85", "text": "We represent each of the news stories using the per story aligned sentences." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-86", "text": "Once trained, we use the PLTM model to infer topics back on to the news stories." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-87", "text": "We then again create a single ranked list of translation news story pairs by computing divergence based similarity using He distance ( \u00a73.2)." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-88", "text": "Keeping the top n ranked news story pairs, we obtain a list of what we believe are parallel documents which we then use to extract sentence pairs." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-89", "text": "Sentences are finally processed through an aligner and then used as the training corpus to our MT system." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-90", "text": "The Gigaword collection contains news stories generated from various agencies in different languages." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-91", "text": "On any given day, a news story in English may or may not cover the same topic as one in a different language." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-92", "text": "To perform a fair evaluation with the WMT'11 News test, we considered stories published in non-overlapping years 2 : 2010, 2009, 2005 and 2004 ." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-93", "text": "Table 1 shows the performance comparison, on the News test set (WMT'11), of the MT system trained on extracted parallel sentences from four years of Gigaword data (GW) with a MT system trained on two WMT'11 baseline parallel collections: Europarl (EP) and News Commentary (NC)." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-94", "text": "While over 10 times bigger than NC, EP is out of domain and thus performs only slightly better." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-95", "text": "On the News test set, parallel sentences automatically extracted from only four years of Gigaword data outperform systems trained on clean NC or EP bitext." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-96", "text": "In order to determine statistically significant differences between the results of different MT systems we ran the randomization test (Smucker et al., 2007) on the News test set with 10k iterations." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-97", "text": "In each iteration we performed permutations across the translation sentences obtained from the two MT systems whose statistical difference in performance we evaluate." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-98", "text": "Table 2 shows the performance comparison on the Europarl test set (WMT'08) between the MT system trained on the extracted parallel sentences and the two MT baseline systems." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-99", "text": "On this test set, unsurprisingly, EP training performed very well." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-100", "text": "Table 3 gives a summary of ablation experiments that we performed across the two stages of our bootstrapping approach." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-101", "text": "More specifically, we ex- plored using bitext extracted by OCD alone, without PLTM reestimation, to train a MT system." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-102", "text": "Both extracted bitext sets also contained many duplicate sentence pairs." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-103", "text": "In this set of experiments we also explored the effect of deduplicating them, i.e. going over the extracted set of English-Spanish sentence pairs and removing the duplicate ones." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-104", "text": "Bitext extracted by OCD alone without PLTM reestimation performed only slightly worse on WMT'11." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-105", "text": "The OCD-only data, however, only showed 70% overlap with OCD+PLTM (GW)." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-106", "text": "Deduplicating the two bitexts (dedup.) hurts OCD somewhat more than OCD+PLTM." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-107", "text": "On the Europarl test set, however, deduplicating OCD+PLTM bitext caused a significant boost from 23.88 to 24.67, while causing slight performance drop for OCD (cf." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-108", "text": "NC-trained 25.43)." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-109", "text": "These interactions of test domain, redundancy, and model settings leave room for further studies of the performance of our bootstrapping approach." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-112", "text": "We introduced a bootstrapping approach for detecting document translations and extracting parallel sentences through latent topic models that are trained with minimal prior knowledge and no lexical resources." }, { "sent_id": "ec3702a6b30057fcae65ca297656d2-C001-113", "text": "The proposed approach is able to extract parallel sentences from comparable corpora to train MT models that outperform a baseline model trained on a parallel collection." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ec3702a6b30057fcae65ca297656d2-C001-24" ], [ "ec3702a6b30057fcae65ca297656d2-C001-45" ] ], "cite_sentences": [ "ec3702a6b30057fcae65ca297656d2-C001-24", "ec3702a6b30057fcae65ca297656d2-C001-45" ] }, "@DIF@": { "gold_contexts": [ [ "ec3702a6b30057fcae65ca297656d2-C001-24", "ec3702a6b30057fcae65ca297656d2-C001-25", "ec3702a6b30057fcae65ca297656d2-C001-26", "ec3702a6b30057fcae65ca297656d2-C001-27" ] ], "cite_sentences": [ "ec3702a6b30057fcae65ca297656d2-C001-24" ] }, "@USE@": { "gold_contexts": [ [ "ec3702a6b30057fcae65ca297656d2-C001-32" ] ], "cite_sentences": [ "ec3702a6b30057fcae65ca297656d2-C001-32" ] } } }, "ABC_d9d5fce2b33c15bf073a5840930be1_46": { "x": [ { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-2", "text": "Abstract" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-3", "text": "BioSimplify is an open source tool written in" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-4", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-6", "text": "Explaining the limitation of the bag of words model, linguist Zellig Harris pointed out 1 : \"language is not merely a bag of words but a tool with particular properties \u2026 The linguist's work is precisely to discover these properties\"." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-7", "text": "When discourse analysts try to discover these properties, they usually break the sentence into simpler clauses." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-8", "text": "It should be noted however that a single independent clause or simple sentence as defined by Quirk 2 , may still be complex." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-9", "text": "Consider for example the following sentence, which even if one were to break it into simpler clauses, it would still be a complex :" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-10", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-11", "text": "**WE HAVE IDENTIFIED A NEW TNF-RELATED LIGAND, DESIGNATED HUMAN GITR LIGAND (HGITRL), AND ITS HUMAN RECEPTOR (HGITR), AN ORTHOLOG OF THE RECENTLY DISCOVERED MURINE GLUCOCORTICOID-INDUCED**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-12", "text": "TNFR-related (mGITR) protein." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-13", "text": "Gee 3 advises that critically analyzing the discourse involves separating and unpacking clauses from sentences and phrases to understand all the perspectives." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-14", "text": "Automatically creating this set of simplified sentences for the purpose of information extraction on biomedical text is the subject of this paper." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-15", "text": "While existing NLP methods for information extraction already use grammatical information of text in the form of features like POS tags, parse trees and dependencies -informally known as \"bag of NLP\", the usual focus is to choose one optimal such parsing for further processing." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-16", "text": "Our approach is rather a \"shotgun\" approach: use grammatical information in elemental chunks that can then be combined and recombined to generate many sentences from one (different perspectives) in order to maximize the likelihood that an automatic extraction engine can find in one (or several) of them the information contained in the original sentence." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-17", "text": "Thus, BioSimplify outputs the set of all sentences it can generate from the original sentence such that they are: 1) implied by the original sentence, 2) grammatically correct, and 3) shorter than the original sentence." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-18", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-19", "text": "**BACKGROUND**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-20", "text": "Sentence Simplification." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-21", "text": "To understand the value of this approach, we need to recognize the different motivations for sentence simplification." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-22", "text": "One of the applications of sentence simplification is to create sentences that are more readable to humans." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-23", "text": "We see examples of this in the works of Siddharthan 4 and Carroll 5 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-24", "text": "These projects aim to create sentences that are shorter, grammatically correct, informationpreserving and cohesive (the property where the context of a discourse element can be inferred from its precedents)." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-25", "text": "Sentence simplification is also applied in text summarization systems like SumBasic 6 where the focus is to preserve only the important content." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-26", "text": "Such an approach called sentence shortening or sentence compression doesn't necessarily preserve semantic content." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-27", "text": "Our goal is rather to use sentence simplification for improving the automatic extraction process." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-28", "text": "We have previously shown that simplification improved the performance of parsers 7 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-29", "text": "There the goal was to preserve both semantic content and grammatical correctness, but not necessarily cohesiveness." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-30", "text": "Our present work shares the same properties, but is based on a different approach that lends itself better for open-source publication of the engine, scalability, and generalization to information extraction tasks (particularly named entity recognition and association extraction) and other NLP applications, like semantic role labeling." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-31", "text": "Protein-Protein Extraction." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-32", "text": "The study of proteinprotein interactions and other molecular events is a central tenet of modern translational and genomic research." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-33", "text": "Publications centering on reports of such atomic events abound, and their manual extraction from the literature currently occupies many trained curators that deposit them in databases such as DIP, MINT, or IntAct." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-34", "text": "Manual curation, however, despite years of effort, has only made a small dent (calculated at around 7%) into the volume of publications believed to report protein-protein interactions." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-35", "text": "Automatic extraction of such facts is thus a priority for biomedical text mining researchers, although performance is still poor." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-36", "text": "We attempt to increase the performance of extraction systems by reducing the complexity of sentences that could be hiding PPIs." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-37", "text": "In the recent BioCreative II.5 international competition for extracting protein-protein interactions, our team 9 introduced the preprocessing step of replacing noun phrases in the sentences with the head noun, a technique that helped us achieve the best F-score for the task." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-38", "text": "This paper presents a tool complete in itself for sentence simplification." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-39", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-40", "text": "**METHODS**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-41", "text": "While trying to understand how to break a sentence into simpler parts, we focused on how sentences grow, and devised methods to undo the expansion." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-42", "text": "Halliday 8 states that there are three ways for such expansion: 1) elaborating an existing basic structure, 2) extending it by addition or replacement, and 3) enhancing its environment." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-43", "text": "We used these basic guidelines to design rules (listed in Table 2 ) for creating simpler sentences out of a complex one." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-44", "text": "The rest of this section presents the Noun Phrase Replacement module and the Syntactic Simplification algorithm along with the rules (Tables 1 and 2 )." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-45", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-46", "text": "**NOUN PHRASE REPLACEMENT**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-47", "text": "A noun phrase in English consists of an optional determinative, an optional premodifier, a mandatory head noun, and an optional postmodifier 2 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-48", "text": "Noun Phrase chunkers usually return the noun phrases of the smallest length, excluding postmodifiers." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-49", "text": "Hence, the last word of an identified noun phrase is the head noun." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-50", "text": "Previously 9 , we introduced the preprocessing step of replacing noun phrases in the sentences with their head noun." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-51", "text": "However, removal of the optional determinative makes the sentence grammatically incorrect, while removal of the premodifier still gives a grammatically correct sentence." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-52", "text": "So, in this work, we only remove the premodifiers." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-53", "text": "For example, the noun phrase \"the recently discovered murine glucocorticoid\" is replaced with \"the glucocorticoid\"." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-54", "text": "The part-of-speech (POS) tags in the sentence are identified using Lingpipe 1 , one of the most widely used POS taggers trained on GENIA biomedical corpus 2 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-55", "text": "We then use the OpenNLP maximum entropy method 3 to identify the noun phrases in the sentence." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-56", "text": "The other chunkers that could be considered are GATE chunker, GENIA Tagger, Lingpipe and Yamacha." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-57", "text": "For noun phrase chunking, GENIA Tagger and OpenNLP perform the best (fscore of 90% on GENIA Corpus), but OpenNLP is more usable in our system workflow as it is written in Java, while the GENIA Tagger is written in C++." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-58", "text": "To remove the premodifiers, all the tokens other than the head noun and the starting determinative or numeral (if they exist) are removed from the noun phrases." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-59", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-60", "text": "**SYNTACTIC SIMPLIFICATION**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-61", "text": "While POS taggers only indicate the grammatical role of a particular word, a parse tree represents the syntactic structure of the whole sentence, giving complete details on how the words in it are related to each other." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-62", "text": "Sentence simplification systems 6, 10, 11 usually have parsers as integral part of their algorithm, while there are few systems 4 that use only POS." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-63", "text": "The latter aim for fast simplification at the point of application, while the former give a higher importance to accuracy of the output." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-64", "text": "BioSimplify is optimized for accurate and flexible biomedical information extraction, and thus uses the output from parsers." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-65", "text": "This is a reasonable choice considering that there has been significant increase in the accuracy of parsing biomedical text from 80% in 2005 12 , to 84% in 2008 13 , and to 88% in 2009 4 measured according to f-score and it facilitates inclusion of the simplification algorithm in different NLP pipelines." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-66", "text": "This choice also decouples the simplification process from the parsing step, allowing it to be done separately or to take advantage of the availability of collections of pre-processed sentences like the NLP" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-67", "text": "The cloning of members of these gene families and the identification of the protein-interaction motifs found within their gene products has initiated the molecular identity of factors (TRADD, FADD/MORT, RIP, FLICE/MACH, and TRAFs) associated with both of the p60 and p80 forms of the TNF receptor and with other members of the TNF receptor superfamily." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-68", "text": "The cloning of members of these gene families and the identification of the protein-interaction motifs has initiated the molecular identity of factors\u2026 The protein-interaction motifs can be found within their gene products." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-69", "text": "These mechanisms must be understood in order to prevent, or combat, the emergence of a virulent, multidrug-resistant form of the bacillus that would be uncontrollable by means of today's treatment strategies." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-70", "text": "These mechanisms must be understood in order to prevent, the emergence of a virulent, multidrug \u2026 These mechanisms must be understood in order to combat , the \u2026 web service provided by NCIBI 14 which has an f-score of 88% for parsing biomedical text." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-71", "text": "BioSimplify can also be used with Penn trees produced from other parsers like Stanford parser and Link Grammar that can produce PTB-style 16 output and also from Penn trees created apriori." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-72", "text": "Our goal is to produce all possible grammatically correct simplified sentences, assuming the available Penn tree is completely accurate." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-73", "text": "Table 1 describes the algorithm for syntactic simplification to produce grammatically correct sentences." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-74", "text": "The algorithm has a time complexity of O(n 2 *R), where n is the number of tokens in the sentence and R is the number of simplifications rules." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-75", "text": "The average time complexity is, however, O(nlog(n)*R)." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-76", "text": "One of the features of BioSimplify is avoidance of domainspecific rules." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-77", "text": "For example, we don't replace entity names (like genes) with shorter alternatives as is done with the noun phrases in the present version and with the gene names in our earlier version 11 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-78", "text": "We also avoided hard-coding the words in the rules created to split sentences with relative clauses." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-79", "text": "These measures enhance the domain adaptability of the system." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-80", "text": "Table 2 are preserved from the original tree, {C1\u2026Cq} are removed from the original tree and added as a separate tree, and {D1\u2026Dr} are removed from the original tree." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-81", "text": "To make sure that the rules are optimized for biomedical sentences, we manually examined each sentence in GENIA biomedical corpus and designed rules that would create the largest possible \"bag of simplified sentences\" based on Halliday's formalism for sentence simplification." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-82", "text": "Some rules are classified as necessary (Table 2) , and are executed only once since that transformations based on those rules do not decrease the semantic content of the original sentence." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-83", "text": "However, some transformations, like the postmodification of noun phrases by prepositional phrases, often destroy the semantic content." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-84", "text": "For such transformations, the algorithm ensures that both the simplified sentences and the sentence from which they are derived are preserved." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-85", "text": "Our rule-set consists of all the rules present in the most comprehensive rule-based system currently known 4 -albeit using a different notation based on phrase structures -and includes many additional rules." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-86", "text": "Siddharthan's system 4 handles coordination only at the clause level, while we handle it also at the phrase level." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-87", "text": "We also remove section indicators, content clauses, and postmodifiers of phrases not handled before." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-88", "text": "Premodification of noun phrases is handled at NP replacement stage." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-89", "text": "We currently don't handle pronoun resolution." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-90", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-91", "text": "**PPI EXTRACTION EVALUATION**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-92", "text": "For the purpose of evaluating the impact of sentence simplification, we use AIMed 17 corpus (which is extensively used in comparing PPI extraction methods) and PIE 18 (a machine-learning based approach available as a web service that uses the parse tree information from the Collins statistical parser as its key component)." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-93", "text": "PIE returns two kinds of results -one with a high precision, which we call tight PIE; and the other with low precision, which we call light PIE." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-94", "text": "We also compare the present version of BioSimplify with the older version 11 which is limited in its functionality because it only implements the rules described by Siddharthan 4 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-95", "text": "The present version which has an average time complexity of O(nlog(n)*R) is faster than the older version which has a time complexity of O(n 3 *R), where n is the number of tokens in the sentence and R is the number of rules." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-96", "text": "The older version has domain specific optimizations (like replacing the gene names with single-word identifiers), which were not used in the newer version for portability." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-97", "text": "----------------------------------" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-98", "text": "**RESULTS AND CONCLUSION**" }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-99", "text": "We used PIE to test for the presence of PPIs before and after simplification in both the versions." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-100", "text": "The test set is from 18 PubMed abstracts in AIMed with ids between 9121766 and 9427624." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-101", "text": "Overall, out of the 189 sentences in these abstracts, 63 contain PPI(s)." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-102", "text": "The aggregate results of PIE are presented in Tables 3 and 4 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-103", "text": "Precision, Recall, and F-score assume conventional meaning." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-104", "text": "Using BioSimplify improved the performance of light PIE by 9% in f-score and in recall by 24%." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-105", "text": "It also enabled an improvement in fscore by 7% and in recall by 20% on tight PIE." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-106", "text": "These improvements are statistically significant based on the two-tailed paired t-test on the outputs (PPI present/absent) of PIE before and after simplification (p=1.3 X 10 -5 for light PIE and p=7.2 X 10 -8 )." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-107", "text": "Overall, the present version of BioSimplify performs much better than the older version." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-108", "text": "This is because using the \"shotgun\" model for simplification (many simpler sentences) instead of the original sentences improves the chances that the PPI engine will detect a relationship hidden in complex syntax." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-109", "text": "The precision of the present system is slightly lower than that of the older system because of the exhaustive simplification." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-110", "text": "This was not the case with the old system where the rules were based on Siddharthan's work and didn't stress as much exhaustiveness of simplification." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-111", "text": "In situations where precision is much more important Table 4 : Results for tight PIE than recall, one could use only a subset of the rules that are empirically found to be more precise." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-112", "text": "The slight loss in precision could also be attributed to the removal of domain-specific features like replacing all gene names with unique identifiers." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-113", "text": "Three judges evaluated the precision and recall of the BioSimplify system itself by reading each simplified sentence produced." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-114", "text": "The evaluation criteria (grammatical correctness) and this post-hoc evaluation using judges follow the same model and rationale described by Siddharthan 4 ." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-115", "text": "We used 404 sentences from AIMed for this evaluation, for an estimated precision of 90%." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-116", "text": "Since the evaluation for grammatical correctness is done post-hoc by humans, the recall could be overestimated, as it is cognitively difficult to think of all possible grammatically correct sentences." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-117", "text": "Our judges found less than 1% new simpler sentences that were not produced by the system." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-118", "text": "The corpus of 4511 biomedical sentences (out of which 2017 contain PPIs) produced from the original 404 is available at https://biosimplify.sourceforge.net." }, { "sent_id": "d9d5fce2b33c15bf073a5840930be1-C001-119", "text": "These sentences were artificially created by BioSimplify from AIMed corpus and are manually annotated to indicate the presence or absence of PPI(s)." } ], "y": { "@DIF@": { "gold_contexts": [ [ "d9d5fce2b33c15bf073a5840930be1-C001-77" ], [ "d9d5fce2b33c15bf073a5840930be1-C001-94", "d9d5fce2b33c15bf073a5840930be1-C001-95", "d9d5fce2b33c15bf073a5840930be1-C001-96" ] ], "cite_sentences": [ "d9d5fce2b33c15bf073a5840930be1-C001-77", "d9d5fce2b33c15bf073a5840930be1-C001-94" ] }, "@USE@": { "gold_contexts": [ [ "d9d5fce2b33c15bf073a5840930be1-C001-94" ] ], "cite_sentences": [ "d9d5fce2b33c15bf073a5840930be1-C001-94" ] } } }, "ABC_c932ba05eb5cb30094dd98739daa95_46": { "x": [ { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-59", "text": "**EVALUATION**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-2", "text": "PredPatt is a pattern-based framework for predicate-argument extraction." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-3", "text": "While it works across languages and provides a well-formed syntax-semantics interface for NLP tasks, a large-scale and reproducible evaluation has been lacking, which prevents comparisons between PredPatt and other related systems, and inhibits the updates of the patterns in PredPatt." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-4", "text": "In this work, we improve and evaluate PredPatt by introducing a large set of high-quality annotations converted from PropBank, which can also be used as a benchmark for other predicate-argument extraction systems." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-5", "text": "We compare PredPatt with other prominent systems and shows that PredPatt achieves the best precision and recall." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-6", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-8", "text": "PredPatt 1 (White et al., 2016 ) is a pattern-based framework for predicate-argument extraction." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-9", "text": "It defines a set of interpretable, extensible and non-lexicalized patterns based on Universal Dependencies (UD) (de Marneffe et al., 2014) , and extracts predicates and arguments through these manual patterns." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-10", "text": "Figure 1 shows the predicates and arguments extracted by PredPatt from the sentence: \"Chris, the designer, wants to launch a new brand.\" The underlying predicate-argument structure constructed by PredPatt is a directed graph, where a special dependency ARG is built between a predicate head token and its arguments' head tokens, and the original UD relations are retained within predicate phrases and argument phrases." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-11", "text": "For example, Figure 2 shows the directed graph for the predicate-argument extraction (1) and (2) in Figure 1 ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-12", "text": "Compared to other existing systems for predicate-argument extraction (Banko et al., 2007; Fader et al., 2011; Angeli et al., 2015) , the use of manual language-agnostic patterns on UD makes PredPatt a well-founded component across languages." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-13", "text": "Additionally, the underlying structure constructed by PredPatt has been shown to be a well-formed syntax-semantics interface for NLP tasks: Zhang et al. (2016) utilizes PredPatt to extract possibilistic propositions in automatic common-sense inference generation." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-14", "text": "White et al. (2016) uses PredPatt to help augmenting data with Universal Decompositional Semantics." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-15", "text": "Zhang et al. (2017) adapts PredPatt to data generation for cross-lingual open information extraction." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-16", "text": "However, the evaluation of PredPatt has been restricted to manually-checked extractions over a small set of sentences (White et al., 2016) , which lacks gold annotations to conduct an objective and reproducible evaluation, and inhibits the updates of patterns in PredPatt." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-17", "text": "Chris , the designer , wants to launch a new brand ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-18", "text": "In this work, we aim to conduct a large-scale and reproducible evaluation of PredPatt by introducing a large set of gold annotations gathered from PropBank (Palmer et al., 2005) ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-19", "text": "We leverage these gold annotations to improve PredPatt and compare it with other prominent systems." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-20", "text": "The evaluation results demonstrate that we make a promising improvement on PredPatt, and it significantly outperforms other comparing systems." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-21", "text": "The scripts for creating gold annotations and evaluation are available at: https: //github.com/hltcoe/PredPatt/tree/master/eval" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-22", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-23", "text": "**CREATING GOLD ANNOTATIONS**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-24", "text": "Open Information Extraction (Open IE) and Semantic Role Labeling (SRL) (Carreras and M\u00e0rquez, 2005) are quite related: semantically labeled arguments correspond to the arguments in Open IE extractions, and verbs often match up with Open IE relations (Christensen et al., 2011) ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-25", "text": "Lang and Lapata (2010) has acknowledged that the SRL task can be viewed as a two stage process of (1) recognizing predicates and arguments then (2) assigning semantics." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-26", "text": "Therefore, predicate-argument extraction (i.e., Open IE) should primarily be considered the same as the first of two stages of SRL, and expert annotated SRL data would be an ideal resource for evaluating Open IE systems." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-27", "text": "This makes PropBank (Palmer et al., 2005) a natural choice from which we can create gold annotations for Open IE, Here, we choose to use expert annotations from PropBank, as compared to the recent suggestion to employ non-expert annotations as a means of benchmarking systems Stanovsky and Dagan (2016) ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-28", "text": "Another advantage of choosing PropBank is that PropBank has gold annotations for UD which lays the important groundwork for evaluating UD-based patterns in PredPatt." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-29", "text": "In this work, we create gold annotations for predicate-argument extraction by converting PropBank annotations on English Web Treebank (EWT) (LDC2012T13) and the Penn Treebank II Wall Street Journal Corpus (WSJ) (Marcus et al., 1994) ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-30", "text": "3 These two corpora have all verbal predicates annotated, and are used to evaluate PredPatt in different perspectives: EWT is the corpus where the gold standard English UD Treebank is built over, which enables an evaluation and analysis of PredPatt patterns; WSJ is used to evaluate PredPatt in a real-world scenario where we run SyntaxNet Parser 4 (Andor et al., 2016) on the corpus to generate automated UD parses as input of PredPatt." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-31", "text": "Table 1 shows the statistics of the auto-converted gold annotations for predicate-argument extraction on EWT and WSJ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-32", "text": "We convert the PropBank annotations for all verbal predicates in these two corpora, and ignore roles of directional (DIR), manner (MNR), modals (MOD), negation (NEG) and adverbials (ADV), as they aren't extracted as distinct argument but instead are folded into the complex predicate by PredPatt and other systems for predicate-argument extraction (Banko et al., 2007; Fader et al., 2011; Angeli et al., 2015) ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-33", "text": "For EWT, we select 13,583 sentences that have the version 2.0 of the gold UD annotations." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-34", "text": "5 The resulting annotations on these two corpora contain over 94K extractions." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-35", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-36", "text": "**IMPROVING PREDPATT**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-37", "text": "PredPatt is a pattern-based system, comprising an extensible set of clean, interpretable linguistic patterns over UD parses." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-38", "text": "By analyzing PredPatt extractions in comparison with gold annotations (Sec. 2), we are able to refine and improve PredPatt's pattern set." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-39", "text": "From the auto-converted gold annotations, we create a held-out set by randomly sampling 10% sentences from EWT." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-40", "text": "We then update the existing PredPatt patterns and introduce new patterns by analyzing PredPatt annotations on the held-out set." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-41", "text": "PredPatt extracts predicates and arguments in four stages (White et al., 2016) : (1) predicate and argument root identification, (2) argument resolution, (3) predicate and argument phrase extraction, and (4) optional post-processing." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-42", "text": "We analyze PredPatt extraction in each of these stages on the held-out set, and make 19 improvements to PredPatt patterns." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-43", "text": "Due to lack of space, we only highlight one improvement for each stage below." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-44", "text": "Fixed-MWE-pred: The UD version 2.0 introduces a new dependency relation fixed for identifying fixed function-word \"multiword expressions\" (MWEs)." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-45", "text": "To accommodate this new feature, we add patterns to identify the MWE predicate and its argument." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-46", "text": "As shown in Figure 3 , the predicate root in this case is the dependent of fixed that is tagged as a verb (i.e., \"opposed\"); the root of its argument is the token which indirectly governs the predicate root via the case and fixed relation (i.e., \"one\")." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-47", "text": "Please use this new file as opposed to the one I sent earlier ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-48", "text": "Cut-complex-pred: The existing patterns take clausal complements (ccomp and xcomp) as predicatives of complex predicates in the argument resolution stage, where the arguments of the clausal complement will be merged into the argument set of their head predicate." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-49", "text": "For example, in the sentence \"Chris, the designer, wants to launch a new brand\", PredPatt merges the argument \"a new brand\" of the predicate \"to launch\" into the argument set of the complex predicate \"wants to launch\"." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-50", "text": "As a result, only the complex predicate, \"[Chris, the designer] wants to launch [a new brand]\", will be extracted." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-51", "text": "It ignores the possibility of the clausal complement itself being a predicate." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-52", "text": "Here, we add a cutting option; when turned on, it will cut the complex predicate into simple predicates as shown in Figure 1 ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-53", "text": "Prep-separation: By default, PredPatt considers prepositions to belong to the predicate, while PropBank places preopositions within the span of their corresponding argument." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-54", "text": "Either behavior may be preferable under different circumstances, so we make preposition placement a new configurable option of PredPatt." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-55", "text": "Borrow-subj-for-conj-of-xcomp: PredPatt contains a post-processing option for distributing a single nsubj argument over multiple predicates joined by a conj relation." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-56", "text": "PredPatt also contains a pattern assigning subject arguments to predicates introduced by open clausal complement (xcomp) relations, according to the theory of obligatory control (Farkas, 1988) ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-57", "text": "We introduce a new post-processing option that combines these two patterns, allowing an argument in subject position to be distributed over multiple xcomp predicates that are joined by a conj relation, as illustrated in Figure 4 ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-58", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-60", "text": "In this section, we evaluate the original PredPatt (PredPatt v1) and the improved PredPatt (PredPatt v2) on the English Web Treebank (EWT) and the Wall Street Journal corpus (WSJ), and compare their performance with four prominent Open IE systems: OpenIE 4, 6 OLLIE (Mausam et al., 2012) , ClausIE (Del Corro and Gemulla, 2013) , and Stanford Open IE (Angeli et al., 2015) ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-61", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-62", "text": "**PRECISION-RECALL CURVE**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-63", "text": "We compare PredPatt with four prominent Open IE systems which are also built for predicate-argument extraction." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-64", "text": "To allow some flexibility, we compute the precision and recall of different systems by running the scripts used in Stanovsky and Dagan (2016), 7 where an automated extraction is matched with a gold extraction based on their token-level overlap." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-65", "text": "Figure 5 and Figure 6 show the Precision-Recall Curves for different systems on EWT and WSJ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-66", "text": "8 When tested on EWT which has gold UD parses ( Figure 5 ), PredPatt v1 and v2 outperforms the other systems by a significant margin in both precision and recall." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-67", "text": "When tested on WSJ where only automated UD parses are available (Figure 6 ), ClausIE achieves a recall that is slightly better than PredPatt v1, but PredPatt v2 still shows the best performance across all systems." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-68", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-69", "text": "**EXTRACTION HEAD AGREEMENT**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-70", "text": "The rich underlying structure in PredPatt (see Figure 2 ) contains head information for predicates and arguments, which enables a precision-recall metric based on the agreement of head information." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-71", "text": "Similar to He et al. (2015) , we first match an automated predicate with a gold predicate if they both agree on their head." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-72", "text": "9 With two matched predicates, we then match an automated argument with a gold argument if the automated argument head is within the gold argument span." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-73", "text": "We evaluate the precision and recall by a loose macro measure: For the i-th extractions that have two matched predicates, let the argument set of the gold predicate be A i , and the argument set of the automated predicate be\u00c2 i ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-74", "text": "The number of matched arguments is represented by |A i \u2229\u00c2 i |." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-75", "text": "Then the precision is computed by Precision = 1 N N i=1 |A i \u2229\u00c2 i |/|\u00c2 i | , and the recall is computed by Recall" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-76", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-77", "text": "**STATISTICS OF ARGUMENT SPAN RELATIONS**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-78", "text": "Besides the precion-recall oriented metrics, we impose another metric to further measure the argument span relations." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-79", "text": "Following in same notations in \u00a7 4.2, for the i-th extractions that have an automated predicate and a gold predicate matched with each other, let an argument in the gold argument set be \u03b1 \u2208 A i , and an argument in the automated argument set \u03b2 \u2208\u00c2 i ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-80", "text": "We categorize the automated extractions into four sets according to their arguments relation to the gold arguments." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-81", "text": "Table 3 shows the proportion of PredPatt extractions in different sets." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-82", "text": "As we expected, compared to WSJ, more extractions on EWT fall into S same , which shows that PredPatt works better on gold UD parses." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-83", "text": "In contrast to PredPatt v1, PredPatt v2 on EWT increases extractions in S same by 12.97%, which contributes to the most increase of S subset ; on WSJ, PredPatt v2 decreases extractions in S subset by 13.89%, which leads the major increases of S same and S superset ." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-84", "text": "There are still over 10% extractions not belonging to any of these four sets." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-85", "text": "Case analysis shows that the inconsistent extractions are mainly caused by incorrect borrowing of arguments for compound predicates or predicates under obligatory control, missing arguments for passive/active verbs that act as adjectival modifiers, etc." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-86", "text": "These cases are not easily reachable via UD analysis, but leave room for further improvement on PredPatt." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-87", "text": "In the current settings, the head of a gold predicate is the verb token in the predicate." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-88", "text": "----------------------------------" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-89", "text": "**CONCLUSIONS**" }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-90", "text": "We introduce a large-scale benchmark for predicate-argument extraction by converting manual annotations from PropBank." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-91", "text": "Based on the benchmark, we improve PredPatt patterns, and compare PredPatt with four prominent Open IE systems." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-92", "text": "The comparison shows that PredPatt significantly outperforms the other systems." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-93", "text": "The evaluation results demonstrate that we improve the performance of PredPatt in both precion-recall and the argument span relation with the gold annotations." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-94", "text": "As for further work, we see the confidence score estimater for PredPatt extractions as a desirable target, so that the quality of extractions can be controlled." }, { "sent_id": "c932ba05eb5cb30094dd98739daa95-C001-95", "text": "Additionally, we would like to further improve the PredPatt patterns by analyzing more PredPatt extractions in comparison with gold annotations." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c932ba05eb5cb30094dd98739daa95-C001-12" ], [ "c932ba05eb5cb30094dd98739daa95-C001-32" ] ], "cite_sentences": [ "c932ba05eb5cb30094dd98739daa95-C001-12", "c932ba05eb5cb30094dd98739daa95-C001-32" ] }, "@USE@": { "gold_contexts": [ [ "c932ba05eb5cb30094dd98739daa95-C001-60" ] ], "cite_sentences": [ "c932ba05eb5cb30094dd98739daa95-C001-60" ] } } }, "ABC_5cb56f6bf8123a9949a7f7c4ebc85c_46": { "x": [ { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-2", "text": "One of the tasks in aspect-based sentiment analysis is to extract aspect and opinion terms from review text." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-3", "text": "Our study focuses on evaluating transfer learning using BERT (Devlin et al., 2019) to classify tokens from hotel reviews in bahasa Indonesia." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-4", "text": "We show that the default BERT model failed to outperform a simple argmax method." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-5", "text": "However, changing the default BERT tokenizer to our custom one can improve the F 1 scores on our labels of interest by at least 5%." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-6", "text": "For I-ASPECT and B-SENTIMENT, it can even increased the F 1 scores by 11%." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-7", "text": "On entity-level evaluation, our tweak on the tokenizer can achieve F 1 scores of 87% and 89% for ASPECT and SENTI-MENT labels respectively." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-8", "text": "These scores are only 2% away from the best model by Fernando et al. (2019) , but with much less training effort (8 vs 200 epochs)." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-9", "text": "----------------------------------" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-11", "text": "Sentiment analysis (Pang et al., 2008) in review text usually consists of multiple aspects." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-12", "text": "For instance, the following review talks about the location, room, and staff aspects of a hotel, \"Excellent location to the Tower of London." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-13", "text": "We also walked to several other areas of interest; albeit a bit of a trek if you don't mind walking." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-14", "text": "The room was a typical hotel room in need of a refresh, however clean." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-15", "text": "The staff couldn't have been more professional, they really helped get us a taxi when our pre arranged pickup ran late.\" In this review, some of the sentiment terms are \"excellent\", \"typical\", \"clean\", and \"professional\"." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-16", "text": "In this study, we are focusing on the aspect and opinion term extraction from the reviews to do aspect-based sentiment analysis (Liu and Zhang, 2012) ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-17", "text": "While some work has been done in this task Fernando et al., 2019; Xue and Li, 2018) , we have not seen a transfer learning approach (Ruder, 2019) employed, which should need much less training effort." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-18", "text": "Using transfer learning is especially helpful for low-resource languages (Kocmi and Bojar, 2018) , such as bahasa Indonesia." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-19", "text": "Our main contribution in this study is evaluating BERT (Devlin et al., 2019) as a pretrained transformer model on this token classification task on hotel reviews in bahasa Indonesia." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-20", "text": "We also found that the current pretrained BERT tokenizer has a poor encoder for bahasa Indonesia, thus we proposed our own custom tokenizer." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-21", "text": "We also provided simpler baselines, namely argmax and logistic regression on word embeddings as comparisons." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-22", "text": "----------------------------------" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-23", "text": "**METHODOLOGY**" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-24", "text": "For this aspect and opinion term extraction task, we use tokenized and annotated hotel reviews on Airy Rooms 1 provided by Fernando et al. (2019)" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-25", "text": "The dataset consists of 5000 reviews in bahasa Indonesia." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-26", "text": "The dataset is divided into training and test sets of 4000 and 1000 reviews respectively." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-27", "text": "The label distribution of the tokens in BIO scheme can be seen in Table 1 ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-28", "text": "In addition, we also see this case as on entity level, i.e. ASPECT, SENTI-MENT, and OTHER labels." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-29", "text": "We found that there are 1643 and 809 unique tokens in the training and test sets respectively." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-30", "text": "Moreover, 75.4% of the unique tokens in the test set can be found in the training set." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-31", "text": "For baseline model, we employed two methods: a simple argmax method and logistic regression on word embeddings from fastText implementation (Bojanowski et al., 2017) ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-32", "text": "In the argmax method, we classify a token as the most probable Label Train Test B-ASPECT 7005 1758 I-ASPECT 2292 584 B-SENTIMENT 9646 2384 I-SENTIMENT 4265 1067 OTHER 39897 9706 Total 63105 15499 Table 1 : Label distribution label in the training set." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-33", "text": "For fastText implementation, we use the skip-gram model and produce 100-dimensional vectors." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-34", "text": "We proposed to use transfer learning from pretrained BERT-Base, Multilingual Cased (Devlin et al., 2019) for this token classification problem." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-35", "text": "We used the implementation in PyTorch by Hugging Face (2019) 3 ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-36", "text": "We found out that the multilingual cased tokenizer of BERT does not recognize some common terms in our dataset, such as \"kamar\" (room), \"kendala\" (issue), \"wifi\", \"koneksi\" (connection), \"bagus\" (good), \"bersih\" (clean)." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-37", "text": "In the training and validation sets, we found 24,370 unknown tokens." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-38", "text": "Thus, we encode the token ourselves to have no unknown tokens." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-39", "text": "For the rest of this paper, we will call this model BERT-custom." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-40", "text": "Since the labels are imbalanced, we are using F 1score as the evaluation metric, which is defined as:" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-41", "text": "Our experiment setup for BERT and BERTcustom is to use Adam (Kingma and Ba, 2015) optimizer with 10 \u22124 as the learning rate and 5 epochs." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-42", "text": "The batch size is 32 and we are optimizing the cross entropy loss function." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-43", "text": "We split the training set into 70:30 for training and validation sets to tune the hyperparameters and then train with the whole training set before applying the model onto the test set." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-44", "text": "----------------------------------" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-45", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-46", "text": "The results from our experiments with BIO scheme labels are summarized in Table 2 ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-47", "text": "We can see in the table that using the default tokenizer cannot beat the baseline F 1 scores for B-ASPECT and B-SENTIMENT labels." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-48", "text": "However, changing the tokenizer can improve the F 1 scores by at least 3 https://github.com/huggingface/ pytorch-transformers 5%." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-49", "text": "For I-ASPECT and B-SENTIMENT, it can increase the F 1 scores by 11%." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-50", "text": "On the other hand, Fernando et al. (2019) trained their model using 200 epochs, while we only use 5 epochs." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-51", "text": "We also found that simply using word embedding (fast-Text) is not suitable for this task since it failed to achieve higher F 1 scores compared to a simple argmax method." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-52", "text": "Furthermore, we can see in Figure 1 that the model overfits after about 12 iterations (mini-batches)." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-53", "text": "Table 3 shows the performance on entity level." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-54", "text": "We are only interested in evaluating the ASPECT and SENTIMENT labels while actually trained the models with 3 labels." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-55", "text": "In this case, we increased the number of epochs to 8 since it can yield higher F 1 scores." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-56", "text": "It is interesting to see that BERT is not even better than argmax in this simplified setting." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-57", "text": "Nevertheless, changing the default BERT tokenizer is beneficial as well." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-58", "text": "BERT-custom model outperforms argmax by more than 5% on our labels of interest and only 2% away from beating the results by Fernando et al. (2019) ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-59", "text": "summarized several studies on aspect and opinion terms extraction." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-60", "text": "Some of the methods used are association rule mining (Hu and Liu, 2004) , dependency rule parsers (Qiu et al., 2011) , conditional random fields (CRF) and hidden Markov model (HMM) Jin et al., 2009) , topic modelling (Chen et al., 2014; Zhao et al., 2010) , and deep learning (Fernando et al., 2019; Xue et al., 2017; Xue and Li, 2018) ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-61", "text": "Fernando et al. (2019) combines the idea of coupled multilayer attentions (CMLA) by and double embeddings by Xue and Li (2018) on aspect and opinion term extraction on SemEval." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-62", "text": "The work by Xue and Li (2018) itself is an improvement from what their prior work on the same task (Xue et al., 2017) ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-63", "text": "Thus, we only included the work by Fernando et al. (2019) because they show that we can get the best result by combining the latest work by and Xue and Li (2018) ." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-64", "text": "In their paper, Devlin et al. (2019) show that they can achieve state-of-the-art performance not only on sentence-level, but also on token-level tasks, such as for named entity recognition (NER)." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-65", "text": "This motivates us to explore BERT in our study." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-66", "text": "This way, we do not need to use dependency parsers or any feature engineering." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-67", "text": "----------------------------------" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-68", "text": "**CONCLUSIONS AND FUTURE WORK**" }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-69", "text": "Our work shows that BERT can achieve F 1 scores of more than 80% in aspect and opinion term extraction task with BIO scheme in noisy bahasa Indonesia text by changing the default tokenizer to have fewer unknown tokens." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-70", "text": "For both BIO scheme and aspect/sentiment/other labels, this simple tweak results in more than 5% absolute increase in F 1 scores on our labels of interest." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-71", "text": "On entity-level evaluation, changing the default tokenizer yields around 8% absolute increase in F 1 scores." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-72", "text": "In the future, we are aiming to compare several transformer-based models, such as XLNet (Yang et al., 2019) , XLM (Lample and Conneau, 2019) , and RoBERTa (Liu et al., 2019) when they are trained using multilingual datasets that include text in bahasa Indonesia as well." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-73", "text": "We also plan to fine-tune those models with richer text in bahasa Indonesia to reduce the number of unknown tokens." }, { "sent_id": "5cb56f6bf8123a9949a7f7c4ebc85c-C001-74", "text": "Furthermore, it is also necessary to evaluate the same task on different datasets." } ], "y": { "@USE@": { "gold_contexts": [ [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-3" ], [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-19" ], [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-34" ] ], "cite_sentences": [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-3", "5cb56f6bf8123a9949a7f7c4ebc85c-C001-19", "5cb56f6bf8123a9949a7f7c4ebc85c-C001-34" ] }, "@BACK@": { "gold_contexts": [ [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-64" ] ], "cite_sentences": [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-64" ] }, "@MOT@": { "gold_contexts": [ [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-64", "5cb56f6bf8123a9949a7f7c4ebc85c-C001-65" ] ], "cite_sentences": [ "5cb56f6bf8123a9949a7f7c4ebc85c-C001-64" ] } } }, "ABC_4b17b24ec0203263e581cbeeaa9fc7_46": { "x": [ { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-20", "text": "**APPROACHES TO SEMANTIC CLUSTERING**" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-2", "text": "This paper describes the use of statistical analyses of untagged corpora to detect similarities and differences in the meaning of words in text." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-3", "text": "This work is motivated by psychological as well as by computational issues." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-4", "text": "The limitations of the method of cluster analysis in assessing the success of such analyses are discussed, and ongoing research using an alternative unsupervised neural network approach is described." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-5", "text": "----------------------------------" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-7", "text": "There has been considerable recent interest in the use of statistical methods for grouping words in large on-line corpora into categories which capture some of our intuitions about the reference of the words we use and the relationships between them (e.g. Brown et al., 1992; Schiitze, 1993) ." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-8", "text": "Although they have received most attention from within computational linguistics, such approaches are also of interest from the point of view of psychology." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-9", "text": "The huge task of developing concepts of word meanings is one that human beings readily achieve; we are all generally aware of the similarities and differences between the meanings of words, despite the fact that in many cases these meanings are not amenable to rigourous definition." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-10", "text": "Whilst supervision may enable children to learn the meanings of a limited number of common words, it seems extremely unlikely that the greater part of our understanding of word meanings is achieved in this way." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-11", "text": "Experimental evidence shows (Harris, 1992) that the occurrence of words in young children's language is strongly influenced by the appearance of those words in the speech they hear around them, and it may be that this process continues indefinitely." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-12", "text": "Such a process would seem to be particularly important when accounting for our understanding of abstract words, such as 'similar' and 'justice', which lack concrete *The author is supported by the Carnegie Trust for the Universities of Scotland referents." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-13", "text": "Despite our difficulty in being able to provide clear definitions for such words, we have strong intuitions about their usage and can readily categorize them on the basis of similarity in meaning." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-14", "text": "This process of developing concepts for abstract words is one which psychological research has tended to ignore." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-15", "text": "This situation suggests that the learning of the meanings of many words, and their relation to the meanings of other words, may be achieved in an unsupervised fashion, and that our ability to develop a categorization for words may be driven, at least in part, by structure latent in the language being learned." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-16", "text": "Recent work in computational linguistics which makes use of statisticM methods to cluster words into groups which reflect their meaning is attractive in this context as it potentially provides a means for developing conceptual structure without supervision, without giving any prior information about the language to the system, and without making a priori distinctions between concrete and abstract words." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-17", "text": "Supervision and knowledge of syntax (much useful information about which, as Finch and Chater (1992) have argued, is also contained in simple distributional statistics) arc two additional factors which are likely to assist in the process of developing concepts of word meanings." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-18", "text": "However, by focusing on the single; intralinguistic, source of information provided by the language data alone, we may be able to obtain useful insights regarding its influence on our conceptual structure." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-19", "text": "----------------------------------" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-21", "text": "A number of analyses were carried out on text corpora to examine the sorts of semantic groupings that can be achieved using simple statistical methods." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-22", "text": "Using an approach similar to that of Brown et al. (1992) , each 'target word '1 wi in the corpus was represented as a vector in which each component j is the probability that any one word position in a 'context window' will be occupied by a 'context word' wj, given that the window is centred on word wi." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-23", "text": "The length of the window used can be varied." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-24", "text": "The basic outline of the moving window used is shown in figure 1." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-25", "text": "As figure 1 indicates, the portion of the moving window in which the context words are contained may exclude a small number of word positions immediately adjacent to the target word." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-26", "text": "This is to weaken the effects of syntax, although the analyses described here do not make use of this facility." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-27", "text": "Following the creation of these vectors, heirarchi- Figure 1 : Design of the Moving Window word or immediately following the target word." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-28", "text": "Whilst it seems reasonable to suppose that children acquiring word meanings would be able to make use of more than this limited amount of context information, the analyses were carried out to investigate performance of the system under such crude conditions." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-29", "text": "It was found on examination of the dendrograms resulting from the cluster analyses that even using this extremely impoverished source of information about the target words did permit a limited number of semantically coherent groupings of words to be created." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-30", "text": "The members of some of these groups were selected following inspection of the relevant dendrograms and are listed in (1992), the distance metric used was the Spearman Rank Correlation coefficient." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-31", "text": "The approach described here differs from that of Brown et al. (1992) in that context words both preceding and following the target word are considered (although information about the ordering of the context was not used), and in that Euclidean distance, rather than average mutual information, is used for clustering." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-32", "text": "Each of the methods described here represents each target word in the same manner, regardless of the syntactic or semantic designation which might conventionally be assigned to it." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-33", "text": "Thus any differences or similarities between words must be detected purely from the statistics of the usage of the words, which are in turn determined by the characteristics of the contexts in which they occur." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-34", "text": "----------------------------------" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-35", "text": "**RESULTS**" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-36", "text": "The methods outlined above were used to cluster words appearing in the Lund corpus (470,000 words), a corpus created from issues of the Wall Street Journal (1.1 million words), and a corpus created from the works of Anthony Trollope (1.7 million words)." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-37", "text": "Initial analyses were carried out on the Lund and Trollope corpora using a short window length of only one word position either side of the target word." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-38", "text": "That is, target words were represented by vectors whose components reflected the (bigram) statistics of occurrence of context words at the word position immediately preceding the target in ~able 1 and a small number of others like them, they represent only a small proportion of the 1000 target words subjected to the analysis." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-39", "text": "Besides those shown above, a number of other types of groupings were evident which appeared to reflect syntactic rather than more specific semantic characteristics." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-40", "text": "This is perhaps not surprising if one regards the problem of grouping words on the basis of similarity as one of prediction; given statistical information only about those words immediately adjacent to a particular target word, it may be possible to say with reasonable confidence that the target word is a noun, a verb, or an adjective, but information about wider context is likely to be needed in order to provide more specific predictions about the particular noun, verb, or adjective in question." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-41", "text": "Since this information is not present, the dendrograms resulting from the analysis show groupings of prepositions, adjectives, verbs, and so on." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-42", "text": "Also present are groups of words whose members all commonly precede or follow a particular particle." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-43", "text": "Further analyses were carried out in which the length of the context window was extended to 5 words either side of the target word." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-44", "text": "The dendrograms resulting from these analyses did not show any marked improvement over those obtained from the earlier analyses, and even when the window length was increased to 25 words each side of the target word, clear differences were not easy to detect from the dendrograms, although the sorts of groupings noted earlier were still identifiable." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-45", "text": "----------------------------------" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-46", "text": "**FUTURE DIRECTIONS**" }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-47", "text": "The use of cluster analysis and related techniques has been popular for presenting the results of recent statistical language work within computational linguistics." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-48", "text": "However, such methods clearly have a number of limitations." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-49", "text": "Firstly, it is difficult to compare dendrograms rigourously, which means that it can be difficult to determine which of a number of alternative approaches or sets of parameters is turning out to be the most successful." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-50", "text": "Secondly, the lack of an objective measure of the clusters obtained means that assessments of the success of a particular technique for categorizing language may well be unreliable; it is quite possible to focus on the .attractive looking groupings revealed in a dendrogram whilst ignoring what may be a very large number of less attractive ones." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-51", "text": "These criticisms arise largely because cluster analysis is a purely descriptive statistical method, and strongly suggest that alternative methods must be found which can provide a more objective measure of the success of the technique being used." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-52", "text": "Of these, word sense disambiguation is attractive." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-53", "text": "Since we can obtain from native speakers an assessment of the correct senses of target words in different contexts, we do have a means for determining how often a particular technique is able to give the correct sense for a particular target word." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-54", "text": "In other words, the evaluation of a native speaker can potentially be used to assess performance each time the system encounters a target word in context and assigns that word to a particular sense class." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-55", "text": "Whilst such assessments might also be applicable to the analysis of dendrograms, word sense disambiguation is of interest since it constitutes the task that continually meets human language users when reading text or listening to speech." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-56", "text": "For these reasons, current work is focusing on the problem of disambiguating words given statistical context." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-57", "text": "To achieve this, an unsupervised competitive neural network is being used." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-58", "text": "This has several features which appear to be desirable." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-59", "text": "Firstly, as in the human case, learning proceeds on-line, without any need for a separate stage of statistical analysis." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-60", "text": "Such a system has the potential to begin developing clusters from the very first exposure to the linguistic input, and the clusters into which the input words are placed evolve con-tinuously during the learning process." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-61", "text": "Thus one can usefully examine the state of the clusters at any point during learning." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-62", "text": "Secondly, it is straightforward to allow any given word to be clustered into as many separate clusters as the system dictates (subject to the maximum number of output units available)." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-63", "text": "Thus, the neural network approach, unlike that described above, has the potential to allow separate senses of a word to be distinguished on the basis of their context." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-64", "text": "This is not to say that non-neural network approaches could not permit a word to belong to more than one cluster (e.g. Pereira et al., 1993) , but rather that this is a very natural and attractive consequence of trsing the unsupervised neural network approach." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-65", "text": "At present, work is being undertaken to examine how well a simple competitive neural network can perform on such a task." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-66", "text": "Preliminary work has been undertaken using a simple competitive neural network similar to that described by Finch and Chater (1992) ." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-67", "text": "Unlike them, though, provision was made for presenting words along with context during the test phase as well as the training phase." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-68", "text": "This potentially allows disambiguation performance to be examined at any time." }, { "sent_id": "4b17b24ec0203263e581cbeeaa9fc7-C001-69", "text": "Initial work using the very simple artificial corpus devised by Elman (1988) has been encouraging, with the network demonstrating near-perfect performance in distinguishing between nouns and verbs in the corpus." } ], "y": { "@BACK@": { "gold_contexts": [ [ "4b17b24ec0203263e581cbeeaa9fc7-C001-7" ] ], "cite_sentences": [ "4b17b24ec0203263e581cbeeaa9fc7-C001-7" ] }, "@SIM@": { "gold_contexts": [ [ "4b17b24ec0203263e581cbeeaa9fc7-C001-22" ] ], "cite_sentences": [ "4b17b24ec0203263e581cbeeaa9fc7-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "4b17b24ec0203263e581cbeeaa9fc7-C001-31" ] ], "cite_sentences": [ "4b17b24ec0203263e581cbeeaa9fc7-C001-31" ] } } }, "ABC_13249ad2fd022b9b4f1d22d2ca77cd_46": { "x": [ { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-2", "text": "This paper presents novel Bayesian optimisation algorithms for minimum error rate training of statistical machine translation systems." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-3", "text": "We explore two classes of algorithms for efficiently exploring the translation space, with the first based on N-best lists and the second based on a hypergraph representation that compactly represents an exponential number of translation options." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-4", "text": "Our algorithms exhibit faster convergence and are capable of obtaining lower error rates than the existing translation model specific approaches, all within a generic Bayesian optimisation framework." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-5", "text": "Further more, we also introduce a random embedding algorithm to scale our approach to sparse high dimensional feature sets." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-6", "text": "----------------------------------" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-8", "text": "State of the art statistical machine translation (SMT) models traditionally consist of a small number (<20) of sub-models whose scores are linearly combined to choose the best translation candidate." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-9", "text": "The weights of this linear combination are usually trained to maximise some automatic translation metric (e.g. BLEU) [1] using Minimum Error Rate Training (MERT) [2, 3] or a variant of the Margin Infused Relaxed Algorithm (MIRA) [4, 5] ." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-10", "text": "These algorithms are heavily adapted to exploit the properties of the translation search space." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-11", "text": "In this paper we introduce generic, effective, and efficient Bayesian optimisation (BO) algorithms [6, 7] for training the weights of SMT systems for arbitrary metrics that outperform both MERT and MIRA." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-12", "text": "To our knowledge this is the first application of BO in natural language processing (NLP) and our results show that their may be significant scope for using BO to tune hyperparameter in a range of NLP models." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-13", "text": "The linear model popular for SMT systems [2] is parametrised in terms of a source sentence f , target translation e, feature weights w k and corresponding feature functions H k (e, f ) (including a language model, conditional translation probabilities, etc.)." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-14", "text": "The best translation is selected by," }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-15", "text": "(" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-16", "text": "Since the translation metrics (e.g. BLEU score) can only be evaluated between the selected translations and reference translations (i.e. the standard manual translations from the parallel training data), meanwhile decoding new translations following Equation 1 is very time consuming, we cannot tune the linear weights directly as in ordinary classification tasks." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-17", "text": "The most common approach is an iterative algorithm MERT [3] which employs N-best lists (the best N translations decoded with a weight set from a previous iteration) as candidate translations C. In this way, the loss function is constructed as E(\u0112,\u00ca) = S s=1 E(\u0113 s ,\u00ea s ), where\u0113 is the reference sentence,\u00ea is selected from N-best lists by\u00ea s = arg max e\u2208C K k=1 w k H k (e, f s ) and S represents the volume of sentences." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-18", "text": "By exploiting the fact that the error surface is piece-wise linear, MERT iteratively applies line search to find the optimal parameters along the randomly chosen directions via" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-19", "text": "Hypergraph, or lattice, MERT [8, 9] aims to tackle problems caused by the lack of diversity in N-best lists." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-20", "text": "A hypergraph [10] efficiently encodes the exponential translation space explored by the beam-search translation decoder." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-44", "text": "To further expand the translation space searched at each iteration, we present a variant cumulative hypergraph BO algorithm (CHG-BO) which combines hypergraphs from one previous and current iterations in order to trade stability and speed of convergence with memory usage." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-45", "text": "Similar to MERT, our BO algorithms become less reliable when the number of features in the linear model exceeds 30." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-46", "text": "Hence, we introduce a variant of random embedding Bayesian optimisation (REMBO) [11] into our hypergraph algorithm (HG-REMBO) to tackle the large scale training problem." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-47", "text": "The original REMBO generates a random matrix A \u2208 R h\u00d7l to map the sample x \u2208 R h from high dimensional space to a point z \u2208 R l in low dimensional space." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-48", "text": "The objective function to be optimised then becomes g(z) = f (Az)." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-49", "text": "Instead of A, we used a regularised random matrix\u0100 where\u0100 mn = Amn Am 1 and transform the objective function to g(z) = f (\u0100z + w), where w are the weights producing the hypergraphs." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-50", "text": "w would remain constant during Bayesian optimisation." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-51", "text": "In this way, BO can be carried out in the low dimensional space and the regularisation of A ensures that each update of the weights remains in a bounded domain." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-52", "text": "----------------------------------" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-53", "text": "**EXPERIMENTS**" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-54", "text": "We implemented our models using spearmint [7] 1 and the cdec SMT decoder [12] 2 ." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-55", "text": "The datasets are from WMT14 shared task, 3 all tokenized and lowercased." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-56", "text": "We employ ARD Matern 5/2 kernel and EI acquisition function." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-57", "text": "The cdec implementations of hypergraph MERT [9] and MIRA [13] are used as benchmarks." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-58", "text": "The experiment 4 results in Table 1 , averaged over 3 runs, show that our BO algorithms always achieve a higher training objective score than MERT and MIRA, and in most cases a higher test BLEU score." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-59", "text": "Fig.2 illustrates the convergence w.r.t." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-60", "text": "the development BLEU score and Table 2 illustrates the efficiency of the BO algorithms." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-61", "text": "They consistently obtain a good weight set within 5 iterations, but the best one is always achieved after 7 iterations." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-62", "text": "This suggests setting the maximum number of iterations to 10 in order to ensure a good result." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-63", "text": "Our BO tuning algorithms only take advantage of multiple processors for decoding, thus there still exists some space to further improve their efficiency." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-64", "text": "Fig. 3a and 3b indicate the comparison of development score and BO score 5 at each iteration in fr-en and cs-en, which again demonstrates the advantage of CHG-BO on stability over NBL-BO and HG-BO." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-65", "text": "Fig. 3c and 3d compare the models with different bound size: b = 0.01 is able to achieve a development and test BLEU score as good as b = 0.1 with more iterations, but b = 0.5 performs worse on the test dataset." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-66", "text": "Thus too large search bound may introduce too much noise which in turn affects the translation performance." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-67", "text": "Table 3 shows the experiments on a large number of sparse features." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-68", "text": "We modify HG-REMBO into a two step coordinate ascent processes in order to stabilise the updates of the core default feature weights." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-69", "text": "First, we optimise the default 18 features, then we fix them and generate a regularised random matrix to update the large scale sparse features in the low dimensional space." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-70", "text": "Table 3 demonstrates that HG-REMBO is able to carry out large scale discriminative training, performing almost on par with MIRA." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-71", "text": "Although HG-REMBO loses its advantage on speed of convergence as it requires multiple runs to generate a good transformation matrix, these results illustrate the potential of applying REMBO on statistical machine translation systems." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-72", "text": "----------------------------------" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-73", "text": "**CONCLUSION**" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-74", "text": "We introduce novel Bayesian optimisation (BO) algorithms for machine translation." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-75", "text": "Our algorithms exhibit faster convergence and achieve higher training objectives and better translation quality than existing translation model specific approaches." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-76", "text": "We further demonstrate that by incorporating the method of random embeddings it is viable to employ Bayesian optimisation to carry out large sale training with a high number of sparse features." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-77", "text": "This initial investigation also suggests that BO has great potential for general natural language processing tasks." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-78", "text": "----------------------------------" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-79", "text": "**ACKNOWLEDGEMENTS**" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-80", "text": "This work was supported by a Xerox Foundation Award and EPSRC grant number EP/K036580/1." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-81", "text": "5 BO score is the best BLEU score achieved by Gaussian processes on fixed N-best lists or hypergraphs." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-21", "text": "The line search can then be carried out on the edges of the hypergraph, instead of the translations in the N-best lists." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-22", "text": "And dynamic programming is used to find the upper envelope of the hypergraph corresponding to the maximum scoring translation." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-23", "text": "Prior work [8, 9] showed that hypergraph MERT is superior to the original N-best algorithm both in speed of convergence and stability." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-24", "text": "MIRA is an online large-margin learning algorithm that applies a different strategy to MERT." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-25", "text": "It enforces a margin between high and low loss translations and enables stochastic gradient descent to be used to update parameters." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-26", "text": "A disadvantage of this approach is that it requires the global BLEU score, which is a non-linear function of local translation candidate statistics, to be approximated by a linear combination of sentence level BLEU scores." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-27", "text": "In this paper, however, our BO algorithms treat the loss function as a black-box function so that we could directly query the function value without the cumbersome and inefficient work of constructing an error surface for random directions." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-28", "text": "Instead of applying BO to the whole SMT pipeline, which would require expensive decoding of new translations with every parameter set sampled, our BO algorithms only decode new translations after obtaining the best parameters on fixed N-best lists or hypergraphs." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-29", "text": "Hence our algorithms iteratively run Gaussian processes on the sub-models and only a few decoding iterations are required to reach convergence." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-30", "text": "The experiments in Section 3 illustrate the superiority of our algorithms both in translation quality and speed of convergence." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-31", "text": "----------------------------------" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-32", "text": "**BAYESIAN OPTIMISATION TUNING ALGORITHMS**" }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-33", "text": "Algorithm 1 describes our hypergraph algorithm (HG-BO)." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-34", "text": "The N-best algorithm (NBL-BO) is similar to HG-BO and can be derived from Algorithm 1 by replacing the hypergraphs with N-best lists." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-35", "text": "In HG-BO, both w i and x j represent the weights of the linear model." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-36", "text": "The weights w i are used to produce the hypergraphs H i , while x j are the weights sampled from the GP to compute the BLEU score (i.e. objective function value) for a fixed set H i ." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-37", "text": "Since H i remains unchanged during an iteration of Bayesian optimisation, the BLEU score calculated for the fixed hypergraphs approximates the true BLEU score that would be achieved if the translation system were run with x j ." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-38", "text": "This introduces some noise owing to the variance between w i and x j ." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-39", "text": "As depicted in Fig. 1 , a key aspect of Algorithm 1 is that we place a bound (blue area) around w i and only consider samples inside this region." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-40", "text": "The sample with the highest BLEU score will then be Table 2 : Time consumption used to decode new hypergraphs for the next iteration of BO." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-41", "text": "Intuitively, to speed up convergence, we would like the search space of BO to be as large as possible." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-42", "text": "When the search space is too large, however, a sampled x j could be so far from w i that the generated translations would become unreliable thus leading to noisy BLEU measurements." }, { "sent_id": "13249ad2fd022b9b4f1d22d2ca77cd-C001-43", "text": "HG-BO is preferable to NBL-BO as it weighs the translations directly in the hypergraphs, which encode an exponentially larger space of translations than the N-best lists, and thus noise is diminished." } ], "y": { "@BACK@": { "gold_contexts": [ [ "13249ad2fd022b9b4f1d22d2ca77cd-C001-9" ], [ "13249ad2fd022b9b4f1d22d2ca77cd-C001-17" ] ], "cite_sentences": [ "13249ad2fd022b9b4f1d22d2ca77cd-C001-9", "13249ad2fd022b9b4f1d22d2ca77cd-C001-17" ] } } }, "ABC_15bacab4a8c520cfcdd7e7bd1e9ec5_46": { "x": [ { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-55", "text": "**DURATION**" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-56", "text": "The intended duration of this tutorial is 3.5 hours plus a half an hour break." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-2", "text": "Adversarial learning is a game-theoretic learning paradigm, which has achieved huge successes in the field of Computer Vision recently." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-3", "text": "It is a general framework that enables a variety of learning models, including the popular Generative Adversarial Networks (GANs)." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-4", "text": "Due to the discrete nature of language, designing adversarial learning models is still challenging for NLP problems." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-5", "text": "In this tutorial, we provide a gentle introduction to the foundation of deep adversarial learning, as well as some practical problem formulations and solutions in NLP." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-6", "text": "We describe recent advances in deep adversarial learning for NLP, with a special focus on generation, adversarial examples & rules, and dialogue." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-7", "text": "We provide an overview of the research area, categorize different types of adversarial learning models, and discuss pros and cons, aiming to provide some practical perspectives on the future of adversarial learning for solving real-world NLP problems." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-8", "text": "----------------------------------" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-9", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-10", "text": "Adversarial learning (AdvL) is an emerging research area that involves a game-theoretical formulation of the learning problem." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-11", "text": "Recently, with the introduction of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , we have observed some stunning results in the area of image synthesis in Computer Vision (Brock et al., 2018) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-12", "text": "Comparing to images, even language is discrete, the general family of adversarial learning methods still have gained significantly more attentions in NLP in recent years 1 ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-13", "text": "In contrast to the focus of GANs in Computer Vision, Natural Language Processing researchers have taken a broader approach to adversarial learning." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-14", "text": "For example, three core technical subareas for adversarial learning include:" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-15", "text": "\u2022 Adversarial Examples, where researchers focus on learning or creating adversarial examples or rules to improve the robustness of NLP systems." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-16", "text": "(Jia and Liang, 2017; Alzantot et al., 2018; Iyyer et al., 2018; Ebrahimi et al., 2018a,b; Shi et al., 2018b; Chen et al., 2018; Farag et al., 2018; Ribeiro et al., 2018; Zhao et al., 2018) \u2022 Adversarial Training, which focuses on adding noise, randomness, or adversarial loss during optimization." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-17", "text": "(Wu et al., 2017; Wang and Bansal, 2018; Li et al., 2018a; Yasunaga et al., 2018; Ponti et al., 2018; Kurita et al., 2018; Kang et al., 2018; Li et al., 2018c; Masumura et al., 2018) \u2022 Adversarial Generation, which primarily includes practical solutions of GANs for processing and generation natural language." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-18", "text": "(Yu et al., 2017; Li et al., 2017; Wang and Lee, 2018; Additionally, we will also introduce other technical focuses such as negative sampling and contrastive estimation (Cai and Wang, 2018; Bose et al., 2018 ), adversarial evaluation (Elliott, 2018 , and reward learning (Wang et al., 2018c) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-19", "text": "In particular, we will also provide a gentle introduction to the applications of adversarial learning in different NLP problems, including social media (Wang et al., 2018a; Carton et al., 2018) , domain adaptation (Kim et al., 2017; Alam et al., 2018; Zou et al., 2018; Chen and Cardie, 2018; Tran and Nguyen, 2018; Cao et al., 2018; Li et al., 2018b) , data cleaning (Elazar and Goldberg, 2018; Shah et al., 2018; Ryu et al., 2018; Zellers et al., 2018) , information extraction (Qin et al., 2018; Hong et al., 2018; Wang et al., 2018b; Bekoulis et al., 2018) , and information retrieval (Li and Cheng, 2018) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-20", "text": "Adversarial learning methods could easily combine any representation learning based neural networks, and optimize for complex problems in NLP." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-21", "text": "However, a key challenge for applying deep adversarial learning techniques to real-world sized NLP problems is the model design issue." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-22", "text": "This tutorial draws connections from theories of deep adversarial learning to practical applications in NLP." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-23", "text": "In particular, we start with the gentle introduction to the fundamentals of adversarial learning." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-24", "text": "We further discuss their modern deep learning extensions such as Generative Adversarial Networks (Goodfellow et al., 2014) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-25", "text": "In the first part of the tutorial, we also outline various applications of deep adversarial learning in NLP listed above." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-26", "text": "In the second part of the tutorial, we will focus on generation of adversarial examples and their uses in NLP tasks, including (1) The inclusion and creation of adversarial examples for robust NLP; (2) The usage of adversarial rules for interpretable and explainable models; and (3) The relationship between adversarial training and adversarial examples." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-27", "text": "In the third part of the tutorial, we focus on GANs." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-28", "text": "We start with the general background introduction of generative adversarial learning." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-29", "text": "We will introduce an in-depth case study of Generative Adversarial Networks for NLP, with a focus on dialogue generation (Li et al., 2017) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-30", "text": "This tutorial aims at introducing deep adversarial learning methods to researchers in the NLP community." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-31", "text": "We do not assume any particular prior knowledge in adversarial learning." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-32", "text": "The intended length of the tutorial is 3.5 hours, including a coffee break." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-33", "text": "----------------------------------" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-34", "text": "**OUTLINE**" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-35", "text": "Noise-Robust Representation Learning, Adversarial Learning, and Generation are three closely related research subjects in Natural Language Processing." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-36", "text": "In this tutorial, we touch the intersection of all the three research subjects, covering various aspects of the theories of modern deep adversarial learning methods, and show their successful applications in NLP." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-37", "text": "This tutorial is organized in three parts:" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-38", "text": "\u2022 Foundations of Deep Adversarial Learning." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-39", "text": "First, we will provide a brief overview of adversarial learning (RL), and discuss the cutting-edge settings in NLP." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-40", "text": "We describe methods such as Adversarial Training (Wu et al., 2017) , Negative Sampling, and Noise Contrastive Estimation (Cai and Wang, 2018; Bose et al., 2018) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-41", "text": "We introduce domain-adaptation learning approaches, and the widely used data cleaning and information extraction methods (Elazar and Goldberg, 2018; Shah et al., 2018; Ryu et al., 2018; Zellers et al., 2018; Qin et al., 2018; Hong et al., 2018; Wang et al., 2018b; Bekoulis et al., 2018) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-42", "text": "In this part, we also introduce the modern renovation of deep generative adversarial learning (Goodfellow et al., 2014) , with a focus on NLP (Yu et al., 2017; Wang and Lee, 2018; ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-43", "text": "\u2022 Adversarial Examples for NLP Second, we will focus on the designing practical adversarial examples for NLP tasks." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-44", "text": "In particular, we will provide an overview of recent methods, including their categorization by whether they are white (e.g. Ebrahimi et al., 2018a) (Pezeshkpour et al., 2019) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-45", "text": "\u2022 An In-depth Case Study of GANs in NLP." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-46", "text": "Third, we switch from the focuses of adversarial training and adversarial examples to generative adversarial networks (Goodfellow et al., 2014) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-47", "text": "We will discuss why it is challenging to deploy GANs for NLP problems, comparing to vision problems." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-48", "text": "We then focus on introducing Seq-GAN (Yu et al., 2017) , an early solution of textual models of GAN, with a focus on policy gradient and Monte Carlo Tree Search." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-49", "text": "Finally, we provide an in-depth case study of deploying two-agent GAN models for conversational AI (Li et al., 2017) ." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-50", "text": "We will summarize the lessons learned, and how we can move forward to investigate game-theoretical approaches in advancing NLP problems." }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-51", "text": "----------------------------------" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-52", "text": "**HISTORY**" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-53", "text": "The full content of this tutorial has not yet been presented elsewhere, but some parts of this tutorial has also been presented at the following locations in recent years:" }, { "sent_id": "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-54", "text": "----------------------------------" } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-18" ] ], "cite_sentences": [ "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-18" ] }, "@BACK@": { "gold_contexts": [ [ "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-29" ] ], "cite_sentences": [ "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-29" ] }, "@USE@": { "gold_contexts": [ [ "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-49" ] ], "cite_sentences": [ "15bacab4a8c520cfcdd7e7bd1e9ec5-C001-49" ] } } }, "ABC_7e52a90a9a0a703250d5c3c1890058_46": { "x": [ { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-2", "text": "There are many domain-specific and language-specific NLG systems, which are possibly adaptable across related domains and languages." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-3", "text": "The languages in the Bantu language family have their own set of features distinct from other major groups, which therefore severely limits the options to bootstrap an NLG system from existing ones." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-4", "text": "We present here our first proof-of-concept application for knowledge-to-text NLG as a plugin to the Prot\u00e9g\u00e9 5.x ontology development system, tailored to Runyankore, a Bantu language indigenous to Uganda." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-5", "text": "It comprises a basic annotation model for linguistic information such as noun class, an implementation of existing verbalisation rules and a CFG for verbs, and a basic interface for data entry." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-6", "text": "----------------------------------" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-8", "text": "Natural Language Generation systems require content planning and format for the selected subject domain as input and specifics about the natural language in order to generate text (Staykova, 2014) , of which the latter tend to be bootstrappable for related languages (de Oliveira and Sripada, 2014) ." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-9", "text": "Our NLG system uses ontologies to represent domain knowledge." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-10", "text": "As for language, we are interested in Runyankore, a Bantu language indigenous to south western Uganda." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-11", "text": "The highly agglutinative structure and complex verbal morphology of Runyankore make existing NLG systems based on templates inapplicable (Keet and Khumalo, 2017) ." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-12", "text": "There have been efforts undertaken to apply the grammar engine technique instead (Byamugisha et al., 2016a; Byamugisha et al., 2016b; Byamugisha et al., 2016c) , which resulted in theoretical advances in verbalization rules for ontologies, pluralization of nouns, and verb conjugation that address the text generation needs for Runyankore." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-13", "text": "We present our implementation of these algorithms and required linguistic annotations as a Prot\u00e9g\u00e9 5.x plugin." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-14", "text": "also ensures no typographical errors are made in the XML file." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-15", "text": "These annotation fields are mandatory, and we allowed for the use of 0 as the NC for the POS which is not a noun." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-16", "text": "These restrictions to input were achieved using document filters." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-17", "text": "The XML file is queried during the verbalization process so as to obtain the required annotations that are needed for the algorithms." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-18", "text": "----------------------------------" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-19", "text": "**IMPLEMENTATION OF THE GRAMMAR ENGINE**" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-20", "text": "We implemented the algorithms for verbalization and pluralization presented in (Byamugisha et al., 2016a; Byamugisha et al., 2016c ) as a Java application." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-21", "text": "The CFG specified in (Byamugisha et al., 2016b) was implemented using the CFG Java tool (Xu et al., 2011) ." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-22", "text": "We used this tool for three main reasons: our grammar engine implementation was done in Java, so we wanted a Java tool as well; we wanted a small CFG implementation for reasonable performance; and their tool extended Purdom's algorithm to fulfill Context-Dependent Rule Coverage (CDRC), which generates more and simpler sentences." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-23", "text": "A sample of the generated text is presented below:" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-24", "text": "\u2022 Buri rupapura rwamakuru n'ekihandiiko ekishohoziibwe, (generated from: Newspaper Publication)" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-25", "text": "\u2022 Buri ntaama nerya ebinyaansi byoona, (gener-" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-26", "text": "The generated text is saved in a text file, which ensures that the text can be linked to other application scenarios." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-27", "text": "We are working on a better design to present the sentences within the tool, for interaction during multi-modal ontology development." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-28", "text": "The grammar engine can be launched through the 'Runyankore>Verbalize' submenu under the 'Tools' menu in Prot\u00e9g\u00e9 5.x." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-29", "text": "The jar file is available from https://github.com/ runyankorenlg/RunyankoreNLGSystem." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-30", "text": "----------------------------------" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-31", "text": "**CONCLUSION**" }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-32", "text": "We briefly presented the core components of the Runyankore grammar engine Prot\u00e9g\u00e9 5.x plugin." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-33", "text": "It implements algorithms for verbalization patterns, noun pluralization, and verb conjugation." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-34", "text": "To make this work, the grammar engine requires linguistic information about each noun and verb (OWL class and object property) in the ontology in order to generate text." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-35", "text": "This linguistic information is stored in as separate XML file." }, { "sent_id": "7e52a90a9a0a703250d5c3c1890058-C001-36", "text": "The demo will show the working system and further details of the architecture." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7e52a90a9a0a703250d5c3c1890058-C001-12" ], [ "7e52a90a9a0a703250d5c3c1890058-C001-21" ] ], "cite_sentences": [ "7e52a90a9a0a703250d5c3c1890058-C001-12", "7e52a90a9a0a703250d5c3c1890058-C001-21" ] }, "@USE@": { "gold_contexts": [ [ "7e52a90a9a0a703250d5c3c1890058-C001-12", "7e52a90a9a0a703250d5c3c1890058-C001-13" ], [ "7e52a90a9a0a703250d5c3c1890058-C001-20", "7e52a90a9a0a703250d5c3c1890058-C001-21" ] ], "cite_sentences": [ "7e52a90a9a0a703250d5c3c1890058-C001-12", "7e52a90a9a0a703250d5c3c1890058-C001-21" ] } } }, "ABC_4f0dec0ce2d7639c250be00d5efee4_46": { "x": [ { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-2", "text": "Given a noisy text page, a word recognizer can generate a set of candidates for each word image." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-3", "text": "A relaxation algorithm was proposed previously by the authors that uses word collocation statistics to select the candidate for each word that has the highest probability of being the correct decision." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-4", "text": "Because word collocation is a local constraint and collocation data trained from corpora are usually incomplete, the algorithm cannot select the correct candidates for some images." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-5", "text": "To overcome this limitation, contextual information at the image level is now exploited inside the relaxation algorithm." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-6", "text": "If two word images can match with each other, they should have same symbolic identity." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-7", "text": "Visual inter-word relations provide a way to link word images in the text and to interpret them systematically." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-8", "text": "By integrating visual inter-word constraints with word collocation data, the performance of the relaxation algorithm is improved." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-9", "text": "----------------------------------" }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-11", "text": "Word collocation is one source of information that has been proposed as a useful tool to post-process word recognition results( [1, 4] )." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-12", "text": "It can be considered as a constraint on candidate selection so that the word candidate selection problem can be formalized as an instance of constraint satisfaction." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-13", "text": "Relaxation is a typical method for constraint satisfaction problems." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-14", "text": "One of the advantages of relaxation is that it can achieve a global effect by using local constraints." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-15", "text": "Previously, a probabilistic relaxation algorithm was proposed for word candidate re-evaluation and selection( [2] )." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-16", "text": "The basic idea of the algorithm is to use word collocation constraints to select the word candidates that have a high probability of occurring simultaneously with word candidates at other nearby locations." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-17", "text": "The algorithm runs iteratively." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-18", "text": "In each iteration, the probability of each word candidate is upgraded based on its previous probability, the probabilities of its neighbors and word collocation data." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-19", "text": "The initial probability of each word candidate is provided by a word recognizer." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-20", "text": "The relaxation process terminates when the probabil-ity of each word candidate becomes stable." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-21", "text": "After relaxation finishes, for each word image, the word candidate with highest probabilistic score will be selected as the decision word." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-22", "text": "Because the window size of word collocation is usually small, word collocation is a local constraint. Because word collocation data are derived from text corpora, it usually is incomplete and unbalanced." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-23", "text": "Those properties limit the usefulness of word collocation for candidate selection." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-24", "text": "By analyzing the performance of the algorithm, three sources of errors were identified: (1) ." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-25", "text": "the local context cannot provide enough information to distinguish the competitive candidates; (2) ." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-26", "text": "word collocation data trained from corpora are not complete so that it does not include the statistical data needed to select the correct candidate; and (3)." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-27", "text": "word collocation data trained from unbalanced corpora are biased so that the wrong candidate is selected." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-28", "text": "In a normal English text, there are many occurrences of the same words." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-29", "text": "Because the main body of a text is usually prepared in the same font type, different occurrences of the same word are visually similar even if the text image is highly degraded." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-30", "text": "Visual similarity between word images can place useful constraints on the process of candidate selection( [3] )." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-31", "text": "If two word images can match with each other, their identities should be the same." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-32", "text": "For example, if there are two sentences, \"Please fill in the application X \" and \"This Y is almost the same as that one\", where X and Y are visually similar, and both of them have the candidate set { farm, form } ." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-33", "text": "The candidate \"form\" can be easily selected as the decision for X and Y if we consider both word collocation and visual inter-word constraints, although it is difficult to select a candidate for Y by only using word collocation." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-34", "text": "Figure 1 is the description of the new relaxation algorithm that integrates word collocation and visuM interword constraints for candidate selection." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-35", "text": "Given a sequence of word images from a text page, the first step of the algorithm is word image clustering." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-36", "text": "Then, a word recognizer is applied to the prototype for each image cluster to generate a set of word candidates." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-37", "text": "Each word inside a cluster inherits the candidate set for the cluster." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-38", "text": "In an iteration of relaxation, the probabilistic scores of the candidates for a word image are upgraded based on word collocation data." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-39", "text": "The probabilistic scores of the candidates for a cluster are upgraded by summing up the probabilistic scores of the word images inside the cluster." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-40", "text": "Each word image then inherits the candidate set from the cluster it belongs to." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-41", "text": "When there is no further significant change in the confidence scores, the relaxation stops." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-42", "text": "The top candidate for each word image is selected as the decision." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-43", "text": "Table 1 : Relaxation Results Conclusions A word-collocation-based relaxation algorithm was proposed for candidate selection in degraded text recognition." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-44", "text": "Word collocation is a local statistical constraint, which sometimes is not sufficient to distinguish among the candidates." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-45", "text": "To make candidate selection more accurate, visual inter-word constraints are investigated." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-46", "text": "A new relaxation algorithm augmented with visual interword constraints was designed." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-47", "text": "Experimental results showed that the modified algorithm has better performance." }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-48", "text": "----------------------------------" }, { "sent_id": "4f0dec0ce2d7639c250be00d5efee4-C001-49", "text": "**MODIFIED RELAXATION ALGORITHM**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "4f0dec0ce2d7639c250be00d5efee4-C001-11" ] ], "cite_sentences": [ "4f0dec0ce2d7639c250be00d5efee4-C001-11" ] } } }, "ABC_45723171ec398550e687c57d42e7cc_46": { "x": [ { "sent_id": "45723171ec398550e687c57d42e7cc-C001-13", "text": "Task 3: Automatic classification of adverse drug reaction mentioning posts." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-14", "text": "Task 4: Automatic detection of posts mentioning vaccination behavior." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-15", "text": "All tasks, except Task 2 are binary classification tasks." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-2", "text": "This paper describes our systems in social media mining for health applications (SMM4H) shared task." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-3", "text": "We participated in all four tracks of the shared task using linear models with a combination of character and word n-gram features." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-4", "text": "We did not use any external data or domain specific information." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-5", "text": "The resulting systems achieved above-average scores among other participating systems, with F 1 -scores of 91." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-6", "text": "22, 46.8, 42.4, and 85.53 on tasks 1, 2, 3, and 4 respectively." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-7", "text": "----------------------------------" }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-9", "text": "The increasing use of social media platforms world wide offers an interesting application of natural language processing tools for monitoring public health and health-related events on the social media." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-10", "text": "The social media mining for health applications (SMM4H) shared task (Weissenbacher et al., 2018) hosts four tasks aiming to identify mentions of different aspects medication use on Twitter." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-11", "text": "Briefly, the tasks and their descriptions are: Task 1: Automatic detection of posts mentioning drug names." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-12", "text": "Task 2: Automatic classification of posts describing medication intake." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-16", "text": "Task 2 requires three-way classification, including an uncertain class indicating posts mentioning possible medication intake." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-17", "text": "For all tasks, we used linear SVM classifiers with character and word bag-of-n-gram features." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-18", "text": "We also experimented with other methods, including deep learning methods with gated RNNs for building document representations." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-19", "text": "However, SVM models achieved best results on the development data." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-20", "text": "As a result, we only submitted results using linear SVMs, and we will only describe and discuss results of these model in this paper." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-21", "text": "----------------------------------" }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-22", "text": "**METHODS AND EXPERIMENTAL PROCEDURE**" }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-23", "text": "We use the same general model for all tasks: linear SVM classifiers with character and word bagof-n-gram features." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-24", "text": "Tokenization was done using a simple regular expression tokenizer that splits the text into consecutive alphanumeric and nonspace, non-alphanumeric tokens." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-25", "text": "For each text to be classified, we extracted both character and word n-grams of order one up to a certain upper limit (specified below)." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-26", "text": "All features are combined in a flat manner as a single text-feature matrix." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-27", "text": "We experimented with two feature weighting methods: tf-idf (Jurafsky and Martin, 2009, p.805) and BM25 (Robertson et al., 2009 )." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-28", "text": "The weighted features are then used for training an SVM classifier." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-29", "text": "We used one-vs-rest multi-class strategy when training the SVM classifier for task 2." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-30", "text": "All models were implemented in Python, using scikit-learn machine learning library (Pedregosa et al., 2011) ." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-31", "text": "The models are similar to the models we used in a few other text classification tasks (\u00c7\u00f6ltekin and Rama, 2018; , where the models are explained in detail." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-32", "text": "We tuned the models for each task separately, changing the maximum order of character and word n-gram features, case normalization, and SVM margin parameter 'C'." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-33", "text": "The parameter ranges explored during tuning was 0-12 for maximum character n-gram order, 0-7 for maximum word n-gram order, and 0.1-2.0 with steps of 0.1 for 'C'." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-34", "text": "We used 5-fold cross validation during tuning, using random search through the space of hy- Table 1 : F1-scores of tf-idf and BM25 weighted models on the development set and the official test set." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-35", "text": "The F1-scores for task 2 are micro-averaged." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-36", "text": "The two set of scores for Task 4 reflect the difference between the full labeled-data set (including additional 1211 training instances) in comparison to the original training set." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-37", "text": "perparameters described above." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-38", "text": "Approximately 1000 random hyperparameter settings were tried for each model." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-39", "text": "The models with the best parameter settings were retrained using the complete training data for producing the final predictions." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-40", "text": "The source of the texts for all tasks is Twitter." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-41", "text": "At the time we downloaded them, some tweets were not available, resulting in training set sizes of 9182, 15 723, 16 888, and 5759 for tasks 1, 2, 3 and 4 respectively." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-42", "text": "Some of these numbers are substantially lower than that of intended number of training samples of 10 000, 17 000, 25 000, and 8180 respectively." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-43", "text": "For task 4, we also used an additional 1211 tweets, initially planned as the test set for this task." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-44", "text": "The test sets contained (approximately) 5000 tweets for tasks 1, 2 and 3, and a considerably smaller number (161) for task 4." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-45", "text": "All training sets showed some degree of class imbalance." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-46", "text": "The imbalance was particularly strong for tasks 3 and 4, where over 70 % and 90 % of the instances belonged to the negative class, respectively." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-47", "text": "Further information on the data sets can be found in Weissenbacher et al. (2018) ." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-48", "text": "Table 1 presents F 1 -scores of the models on each task." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-49", "text": "In general, we do not observe substantial differences between the term weighting schemes, but for some tasks the gap between training and development set scores is rather large." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-50", "text": "We do not know the system rankings at the time of writing, but only know that the results above are above the mean of the best-scores from all participating teams." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-51", "text": "----------------------------------" }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-52", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-53", "text": "The systems we used for the shared task are simple, yet, effective classifiers with character and word n-gram features." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-54", "text": "The big discrepancies between the development and test set scores in task 2 and task 3 points either some differences between the distributions of the training and test sets, or it may also be due to large amount of missing tweets in our training set, indicating more data is likely to be particularly useful in these tasks." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-55", "text": "We also compared the effectiveness of two feature weighting systems, tf-idf and BM25, which did not show any substantial differences." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-56", "text": "Since our models were originally intended as baseline models, the scores presented in Table 1 were obtained without the use of any external data or source of information." }, { "sent_id": "45723171ec398550e687c57d42e7cc-C001-57", "text": "Better results are likely by use of external information, such as appropriate dictionaries, term lists, or embeddings trained on large amounts of unlabeled data." } ], "y": { "@BACK@": { "gold_contexts": [ [ "45723171ec398550e687c57d42e7cc-C001-10" ] ], "cite_sentences": [ "45723171ec398550e687c57d42e7cc-C001-10" ] }, "@UNSURE@": { "gold_contexts": [ [ "45723171ec398550e687c57d42e7cc-C001-47" ] ], "cite_sentences": [ "45723171ec398550e687c57d42e7cc-C001-47" ] } } }, "ABC_5ed24e18f892d7092c183acab4b175_46": { "x": [ { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-2", "text": "Word embeddings or distributed representations of words are being used in various applications like machine translation, sentiment analysis, topic identification etc." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-3", "text": "Quality of word embeddings and performance of their applications depends on several factors like training method, corpus size and relevance etc." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-4", "text": "In this study we compare performance of a dozen of pretrained word embedding models on lyrics sentiment analysis and movie review polarity tasks." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-5", "text": "According to our results, Twitter Tweets is the best on lyrics sentiment analysis, whereas Google News and Common Crawl are the top performers on movie polarity analysis." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-6", "text": "Glove trained models slightly outrun those trained with Skipgram." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-7", "text": "Also, factors like topic relevance and size of corpus significantly impact the quality of the models." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-8", "text": "When medium or large-sized text sets are available, obtaining word embeddings from same training dataset is usually the best choice." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-9", "text": "----------------------------------" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-35", "text": "It was trained using Skipgram word2vec with negative sampling, windows size 5 and 300 dimensions." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-61", "text": "The author reports that CNNs perform best." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-11", "text": "Semantic vector space models of language were developed in the 90s to predict joint probabilities of words that appear together in a sequence." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-12", "text": "A particular upturn was proposed by Bengio et al. in [1] , replacing sparse n-gram models with word embeddings which are more compact representations obtained using feed-forward or more advanced neural networks." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-13", "text": "Recently, high quality and easy to train Skip-gram shallow architectures were presented in [10] and considerably improved in [11] with the introduction of negative sampling and subsampling of frequent words." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-14", "text": "The \"magical\" ability of word embeddings to capture syntactic and semantic regularities on text words is applicable in various applications like machine translations, error correcting systems, sentiment analyzers etc." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-15", "text": "This ability has been tested in [12] and other studies with analogy question tests of the form \"A is to B as C is to \" or male/female relations." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-16", "text": "A recent improved method for generating word embeddings is Glove [15] which makes efficient use of global statistics of text words and preserves the linear substructure of Skip-gram word2vec, the other popular method." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-17", "text": "Authors report that Glove outperforms other methods such as Skip-gram in several tasks like word similarity, word analogy etc." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-18", "text": "In this paper we examine the quality of word embeddings on 2 sentiment analysis tasks: Lyrics mood recognition and movie review polarity analysis." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-19", "text": "We compare various models pretrained with Glove and Skip-gram, together with corpora we train ourself." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-20", "text": "Our goal is to report the best performing models as well as to observe the impact that certain factors like training method, corpus size and thematic relevance of texts might have on model quality." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-21", "text": "According to the results, Common Crawl, Twitter Tweets and Google News are the best performing models." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-22", "text": "Corpus size and thematic relevance have a significant role on the performance of the generated word vectors." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-23", "text": "We noticed that models trained with Glove slightly outperform those trained with Skip-gram in most of experiments." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-24", "text": "----------------------------------" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-25", "text": "**WORD EMBEDDING CORPORA AND MODELS**" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-26", "text": "In this section we present the different word embedding models that we compare." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-27", "text": "Most of them are pretrained and publicly available." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-28", "text": "Two of them (Text8Corpus and Moody-Corpus) were trained by us." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-29", "text": "The full list with some basic characteristics is presented in Table 1 ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-30", "text": "Wikipedia Gigaword is a combination of Wikipedia 2014 dump and Gigaword 5 with about 6 billion tokens in total." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-31", "text": "It was created by authors of [15] to evaluate Glove performance." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-32", "text": "Wikipedia Dependency corpus is a collection of 1 billion tokens from Wikipedia." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-33", "text": "The method used for training it is a modified version of Skip-gram word2vec described in [7] ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-34", "text": "Google News is one of the biggest and richest text sets with 100 billion tokens and a vocabulary of 3 million words and phrases [10] ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-36", "text": "Even bigger is Common Crawl 840, a huge corpus of 840 billion tokens and 2.2 million word vectors also used at [15] ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-37", "text": "It contains data of Common Crawl (http://commoncrawl.org), a nonprofit organization that creates and maintains public datasets by crawling the web." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-38", "text": "Common Crawl 42 is a reduced version made up of 42 billion tokens and a vocabulary of 1.9 million words." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-39", "text": "Common Crawl 840 and Common Crawl 42 were trained with Glove method producing vectors of 300 dimensions for each word." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-40", "text": "The last Glove corpus is the collection of Twitter Tweets." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-41", "text": "It consists of 2 billion tweets, 27 billion tokens and 1.2 million words." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-42", "text": "To observe the role of corpus size in quality of generated embeddings, we train and use Text8Corpus, a smaller corpus consisting of 17 million tokens and 25,000 words." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-43", "text": "The last model we use is MoodyCorpus, a collection of lyrics that followed our work in [3] where we build and evaluate MoodyLyrics, a sentiment annotated dataset of songs." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-44", "text": "The biggest part of MoodyCorpus was built using lyrics of Million Song Dataset (MSD) songs (https://labrosa.ee.columbia.edu/millionsong/)." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-45", "text": "As music tastes and characteristics change over time (http://kaylinwalker.com/50-years-ofpop-music), it is better to have diversified sources of songs in terms of epoch, genre etc." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-46", "text": "Thereby we added songs of different genres and epochs that we found in two subsets of MSD, Cal500 and TheBeatles." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-47", "text": "The resulting corpus of 90 million tokens and 43,000 words can be downloaded from http://softeng.polito.it/erion." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-48", "text": "Further information about public music datasets can be found at [2] ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-49", "text": "----------------------------------" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-50", "text": "**SENTIMENT ANALYSIS TASKS**" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-51", "text": "The problem of music mood recognition is about utilizing machine learning, data mining and other techniques to automatically classify songs in 2 or more emotion categories with highest possible accuracy." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-52", "text": "Different combinations of features such as audio or lyrics are involved in the process." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-53", "text": "In this study we make use of song lyrics exploiting the dataset described in [9] (here AM628)." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-54", "text": "The original dataset contains 771 song texts collected from AllMusic portal." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-55", "text": "AllMusic tags and 3 human experts were used for the annotation of songs." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-56", "text": "We balanced the dataset obtaining 314 positive and 314 negative lyrics." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-57", "text": "We also utilize MoodyLyrics (here ML3K), a dataset of 3,000 mood labeled songs from different genres and epochs described in [3] ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-58", "text": "Pioneering work in movie review polarity analysis has been conducted by Pang and Lee in [14] and [13] ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-59", "text": "The authors released sentiment polarity dataset, a collection of 2,000 movie reviews categorized as positive or negative." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-60", "text": "Deep learning techniques and distributed word representations appeared on recent studies like [17] where the role of RNNs (Recurrent Neural Networks), and CNNs (Convolutional Neural Networks) is explored." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-62", "text": "An important work that has relevance here is [8] where authors present an even larger movie review dataset of 50,000 movie reviews from IMBD." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-63", "text": "This dataset has been used in various works such as [5] , [16] etc." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-64", "text": "For our experiments we used a chunk of 10K (MR10K) as well as the full set (MR50K)." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-65", "text": "We first cleaned and tokenized texts of the datasets." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-66", "text": "The dataset of the current run is loaded and a set of unique text words is created." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-67", "text": "All 14 models are also loaded in the script." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-68", "text": "We train a 15th (self w2v) model using the corpus of the current run and Skip-gram method." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-69", "text": "The script iterates in every line of the pretrained models splitting apart the words and the float vectors and building {word: vec} dictionaries later used as classification feature sets." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-70", "text": "Next we prepare the classification models using tf-idf vectorizer which has been successfully applied in similar studies like [4] ." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-71", "text": "Instead of applying tf-idf in words only as in other text classifiers, we vectorize both word (for semantic relevance) and corresponding vector (for syntactic and contextual relevance)." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-72", "text": "Random forest was used as classifier and 5-fold cross-validation accuracy is computed for each of the models." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-73", "text": "----------------------------------" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-74", "text": "**RESULTS**" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-75", "text": "In figures 1 and 2 we see results of 5-fold cross-validation on the 2 lyrics datasets." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-76", "text": "Top three models are crawl 840, twitter 50 and self w2v." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-77", "text": "On AM628 (very smallest dataset), it is crawl 840 (the biggest model) that leads, followed by twitter 50." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-78", "text": "Self w2v is severely penalized by its size and thus is at the bottom." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-79", "text": "On ML3K (large dataset) self w2v reaches the top of the list, leaving behind twitter 50." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-80", "text": "Wikigiga, google news and dep based are positioned in the middle whereas MoodyCorpus and Text8Corpus end the list." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-81", "text": "Their accuracy scores drift from 0.62 to 0.75." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-82", "text": "It is interesting to see how self w2v goes up from the last to the top, with scores edging between 0.61 and 0.83." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-83", "text": "This model is trained with the data of each experiment and depends on the size of that dataset which grows significantly (see Table 2 )." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-84", "text": "We see that accuracy values we got here are in line with reports from other similar works such as [6] where they use a dataset of 1032 lyrics from AllMusic to perform content analysis with text features." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-85", "text": "Accuracy scores for movie review polarity prediction are presented in figures 3 and 4." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-86", "text": "Again we see that crawl 840 performs very well." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-87", "text": "Google news is also among the top whereas Twitter models are positioned in the middle of the list." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-88", "text": "Once again self w2v grows considerably, this time from the 3rd place to the top." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-89", "text": "On MR50K it has a discrete margin of more than 0.03 from the 2nd position." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-90", "text": "Again wikigiga models are positioned in the middle of the list and the worst performing models are MoodyCorpus and Text8Corpus." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-91", "text": "Our scores on this task are somehow lower than those reported from various studies that explore advanced deep learning constructs on same dataset." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-92", "text": "In [8] for example, authors who created movie review dataset try on it their probabilistic model that is able to capture semantic similarities between words." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-93", "text": "They report a maximal accuracy of 0.88." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-94", "text": "A study that uses a very similar method is [16] where authors combine random forest with word vector average values." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-95", "text": "On movie review dataset they achieve an accuracy of 0.84 which is about what we got here." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-96", "text": "----------------------------------" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-97", "text": "**DISCUSSION**" }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-98", "text": "In this paper we examined the quality of different word embedding models on two sentiment analysis tasks: Lyrics mood recognition and movie review polarity." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-99", "text": "We observed the role of factors like training method, vocabulary and corpus size and thematic relevance of texts." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-100", "text": "According to our results, the best performing models are Common Crawl, Twitter Tweets and Google News." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-101", "text": "In general, models trained with Glove slightly outperform those trained using Skip-gram, especially on lyrics sentiment analysis (Twitter and Crawl)." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-102", "text": "We also notice that vocabulary richness and corpus size have a significant influence on model quality." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-103", "text": "The biggest models like crawl 840 are always among the best." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-104", "text": "Likewise self w2v performs very well on both tasks when trained with medium or large data sizes (see Table 2 )." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-105", "text": "Being the smallest in sizes, MoodyCorpus and Text8Corpus are always the worst." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-106", "text": "Regarding thematic relevance, Twitter corpora perform better on lyrics sentiment analysis." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-107", "text": "They are large and rich in vocabulary with texts of an informal and sentimental language." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-108", "text": "This language is very similar to the one of song lyrics, with love being the predominant word (see word cloud in [3] )." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-109", "text": "Movie review results on the other hand, are headed by Common Crawl and Google News which are the largest, both in size and vocabulary." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-110", "text": "These models are trained with diverse and informative texts that cover every possible subject or topic." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-111", "text": "Having a look on some movie reviews we also see a similar language with comments about the movies of different categories." }, { "sent_id": "5ed24e18f892d7092c183acab4b175-C001-112", "text": "Furthermore, we saw that when training set is big enough, obtaining word embeddings from it (self w2v) is the best option." } ], "y": { "@BACK@": { "gold_contexts": [ [ "5ed24e18f892d7092c183acab4b175-C001-13" ], [ "5ed24e18f892d7092c183acab4b175-C001-34" ] ], "cite_sentences": [ "5ed24e18f892d7092c183acab4b175-C001-13", "5ed24e18f892d7092c183acab4b175-C001-34" ] } } }, "ABC_2b7267b7b192aeca15c0d10a5f0a4b_46": { "x": [ { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-50", "text": "**SENTIMENT ANALYSIS TASKS**" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-2", "text": "Word embeddings or distributed representations of words are being used in various applications like machine translation, sentiment analysis, topic identification etc." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-3", "text": "Quality of word embeddings and performance of their applications depends on several factors like training method, corpus size and relevance etc." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-4", "text": "In this study we compare performance of a dozen of pretrained word embedding models on lyrics sentiment analysis and movie review polarity tasks." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-5", "text": "According to our results, Twitter Tweets is the best on lyrics sentiment analysis, whereas Google News and Common Crawl are the top performers on movie polarity analysis." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-6", "text": "Glove trained models slightly outrun those trained with Skipgram." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-7", "text": "Also, factors like topic relevance and size of corpus significantly impact the quality of the models." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-8", "text": "When medium or large-sized text sets are available, obtaining word embeddings from same training dataset is usually the best choice." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-9", "text": "----------------------------------" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-11", "text": "Semantic vector space models of language were developed in the 90s to predict joint probabilities of words that appear together in a sequence." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-12", "text": "A particular upturn was proposed by Bengio et al. in [1] , replacing sparse n-gram models with word embeddings which are more compact representations obtained using feed-forward or more advanced neural networks." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-13", "text": "Recently, high quality and easy to train Skip-gram shallow architectures were presented in [10] and considerably improved in [11] with the introduction of negative sampling and subsampling of frequent words." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-14", "text": "The \"magical\" ability of word embeddings to capture syntactic and semantic regularities on text words is applicable in various applications like machine translations, error correcting systems, sentiment analyzers etc." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-15", "text": "This ability has been tested in [12] and other studies with analogy question tests of the form \"A is to B as C is to \" or male/female relations." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-16", "text": "A recent improved method for generating word embeddings is Glove [15] which makes efficient use of global statistics of text words and preserves the linear substructure of Skip-gram word2vec, the other popular method." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-17", "text": "Authors report that Glove outperforms other methods such as Skip-gram in several tasks like word similarity, word analogy etc." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-18", "text": "In this paper we examine the quality of word embeddings on 2 sentiment analysis tasks: Lyrics mood recognition and movie review polarity analysis." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-19", "text": "We compare various models pretrained with Glove and Skip-gram, together with corpora we train ourself." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-20", "text": "Our goal is to report the best performing models as well as to observe the impact that certain factors like training method, corpus size and thematic relevance of texts might have on model quality." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-21", "text": "According to the results, Common Crawl, Twitter Tweets and Google News are the best performing models." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-22", "text": "Corpus size and thematic relevance have a significant role on the performance of the generated word vectors." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-49", "text": "----------------------------------" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-23", "text": "We noticed that models trained with Glove slightly outperform those trained with Skip-gram in most of experiments." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-24", "text": "----------------------------------" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-25", "text": "**WORD EMBEDDING CORPORA AND MODELS**" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-26", "text": "In this section we present the different word embedding models that we compare." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-27", "text": "Most of them are pretrained and publicly available." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-28", "text": "Two of them (Text8Corpus and Moody-Corpus) were trained by us." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-29", "text": "The full list with some basic characteristics is presented in Table 1 ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-30", "text": "Wikipedia Gigaword is a combination of Wikipedia 2014 dump and Gigaword 5 with about 6 billion tokens in total." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-31", "text": "It was created by authors of [15] to evaluate Glove performance." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-32", "text": "Wikipedia Dependency corpus is a collection of 1 billion tokens from Wikipedia." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-33", "text": "The method used for training it is a modified version of Skip-gram word2vec described in [7] ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-34", "text": "Google News is one of the biggest and richest text sets with 100 billion tokens and a vocabulary of 3 million words and phrases [10] ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-35", "text": "It was trained using Skipgram word2vec with negative sampling, windows size 5 and 300 dimensions." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-36", "text": "Even bigger is Common Crawl 840, a huge corpus of 840 billion tokens and 2.2 million word vectors also used at [15] ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-37", "text": "It contains data of Common Crawl (http://commoncrawl.org), a nonprofit organization that creates and maintains public datasets by crawling the web." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-38", "text": "Common Crawl 42 is a reduced version made up of 42 billion tokens and a vocabulary of 1.9 million words." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-39", "text": "Common Crawl 840 and Common Crawl 42 were trained with Glove method producing vectors of 300 dimensions for each word." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-40", "text": "The last Glove corpus is the collection of Twitter Tweets." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-41", "text": "It consists of 2 billion tweets, 27 billion tokens and 1.2 million words." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-42", "text": "To observe the role of corpus size in quality of generated embeddings, we train and use Text8Corpus, a smaller corpus consisting of 17 million tokens and 25,000 words." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-43", "text": "The last model we use is MoodyCorpus, a collection of lyrics that followed our work in [3] where we build and evaluate MoodyLyrics, a sentiment annotated dataset of songs." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-44", "text": "The biggest part of MoodyCorpus was built using lyrics of Million Song Dataset (MSD) songs (https://labrosa.ee.columbia.edu/millionsong/)." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-45", "text": "As music tastes and characteristics change over time (http://kaylinwalker.com/50-years-ofpop-music), it is better to have diversified sources of songs in terms of epoch, genre etc." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-46", "text": "Thereby we added songs of different genres and epochs that we found in two subsets of MSD, Cal500 and TheBeatles." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-47", "text": "The resulting corpus of 90 million tokens and 43,000 words can be downloaded from http://softeng.polito.it/erion." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-48", "text": "Further information about public music datasets can be found at [2] ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-51", "text": "The problem of music mood recognition is about utilizing machine learning, data mining and other techniques to automatically classify songs in 2 or more emotion categories with highest possible accuracy." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-52", "text": "Different combinations of features such as audio or lyrics are involved in the process." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-53", "text": "In this study we make use of song lyrics exploiting the dataset described in [9] (here AM628)." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-54", "text": "The original dataset contains 771 song texts collected from AllMusic portal." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-55", "text": "AllMusic tags and 3 human experts were used for the annotation of songs." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-56", "text": "We balanced the dataset obtaining 314 positive and 314 negative lyrics." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-57", "text": "We also utilize MoodyLyrics (here ML3K), a dataset of 3,000 mood labeled songs from different genres and epochs described in [3] ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-58", "text": "Pioneering work in movie review polarity analysis has been conducted by Pang and Lee in [14] and [13] ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-59", "text": "The authors released sentiment polarity dataset, a collection of 2,000 movie reviews categorized as positive or negative." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-60", "text": "Deep learning techniques and distributed word representations appeared on recent studies like [17] where the role of RNNs (Recurrent Neural Networks), and CNNs (Convolutional Neural Networks) is explored." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-61", "text": "The author reports that CNNs perform best." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-62", "text": "An important work that has relevance here is [8] where authors present an even larger movie review dataset of 50,000 movie reviews from IMBD." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-63", "text": "This dataset has been used in various works such as [5] , [16] etc." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-64", "text": "For our experiments we used a chunk of 10K (MR10K) as well as the full set (MR50K)." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-65", "text": "We first cleaned and tokenized texts of the datasets." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-66", "text": "The dataset of the current run is loaded and a set of unique text words is created." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-67", "text": "All 14 models are also loaded in the script." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-68", "text": "We train a 15th (self w2v) model using the corpus of the current run and Skip-gram method." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-69", "text": "The script iterates in every line of the pretrained models splitting apart the words and the float vectors and building {word: vec} dictionaries later used as classification feature sets." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-70", "text": "Next we prepare the classification models using tf-idf vectorizer which has been successfully applied in similar studies like [4] ." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-71", "text": "Instead of applying tf-idf in words only as in other text classifiers, we vectorize both word (for semantic relevance) and corresponding vector (for syntactic and contextual relevance)." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-72", "text": "Random forest was used as classifier and 5-fold cross-validation accuracy is computed for each of the models." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-73", "text": "----------------------------------" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-74", "text": "**RESULTS**" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-75", "text": "In figures 1 and 2 we see results of 5-fold cross-validation on the 2 lyrics datasets." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-76", "text": "Top three models are crawl 840, twitter 50 and self w2v." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-77", "text": "On AM628 (very smallest dataset), it is crawl 840 (the biggest model) that leads, followed by twitter 50." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-78", "text": "Self w2v is severely penalized by its size and thus is at the bottom." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-79", "text": "On ML3K (large dataset) self w2v reaches the top of the list, leaving behind twitter 50." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-80", "text": "Wikigiga, google news and dep based are positioned in the middle whereas MoodyCorpus and Text8Corpus end the list." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-81", "text": "Their accuracy scores drift from 0.62 to 0.75." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-82", "text": "It is interesting to see how self w2v goes up from the last to the top, with scores edging between 0.61 and 0.83." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-83", "text": "This model is trained with the data of each experiment and depends on the size of that dataset which grows significantly (see Table 2 )." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-84", "text": "We see that accuracy values we got here are in line with reports from other similar works such as [6] where they use a dataset of 1032 lyrics from AllMusic to perform content analysis with text features." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-85", "text": "Accuracy scores for movie review polarity prediction are presented in figures 3 and 4." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-86", "text": "Again we see that crawl 840 performs very well." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-87", "text": "Google news is also among the top whereas Twitter models are positioned in the middle of the list." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-88", "text": "Once again self w2v grows considerably, this time from the 3rd place to the top." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-89", "text": "On MR50K it has a discrete margin of more than 0.03 from the 2nd position." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-90", "text": "Again wikigiga models are positioned in the middle of the list and the worst performing models are MoodyCorpus and Text8Corpus." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-91", "text": "Our scores on this task are somehow lower than those reported from various studies that explore advanced deep learning constructs on same dataset." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-92", "text": "In [8] for example, authors who created movie review dataset try on it their probabilistic model that is able to capture semantic similarities between words." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-93", "text": "They report a maximal accuracy of 0.88." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-94", "text": "A study that uses a very similar method is [16] where authors combine random forest with word vector average values." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-95", "text": "On movie review dataset they achieve an accuracy of 0.84 which is about what we got here." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-96", "text": "----------------------------------" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-97", "text": "**DISCUSSION**" }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-98", "text": "In this paper we examined the quality of different word embedding models on two sentiment analysis tasks: Lyrics mood recognition and movie review polarity." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-99", "text": "We observed the role of factors like training method, vocabulary and corpus size and thematic relevance of texts." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-100", "text": "According to our results, the best performing models are Common Crawl, Twitter Tweets and Google News." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-101", "text": "In general, models trained with Glove slightly outperform those trained using Skip-gram, especially on lyrics sentiment analysis (Twitter and Crawl)." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-102", "text": "We also notice that vocabulary richness and corpus size have a significant influence on model quality." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-103", "text": "The biggest models like crawl 840 are always among the best." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-104", "text": "Likewise self w2v performs very well on both tasks when trained with medium or large data sizes (see Table 2 )." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-105", "text": "Being the smallest in sizes, MoodyCorpus and Text8Corpus are always the worst." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-106", "text": "Regarding thematic relevance, Twitter corpora perform better on lyrics sentiment analysis." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-107", "text": "They are large and rich in vocabulary with texts of an informal and sentimental language." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-108", "text": "This language is very similar to the one of song lyrics, with love being the predominant word (see word cloud in [3] )." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-109", "text": "Movie review results on the other hand, are headed by Common Crawl and Google News which are the largest, both in size and vocabulary." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-110", "text": "These models are trained with diverse and informative texts that cover every possible subject or topic." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-111", "text": "Having a look on some movie reviews we also see a similar language with comments about the movies of different categories." }, { "sent_id": "2b7267b7b192aeca15c0d10a5f0a4b-C001-112", "text": "Furthermore, we saw that when training set is big enough, obtaining word embeddings from it (self w2v) is the best option." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2b7267b7b192aeca15c0d10a5f0a4b-C001-62" ], [ "2b7267b7b192aeca15c0d10a5f0a4b-C001-92" ] ], "cite_sentences": [ "2b7267b7b192aeca15c0d10a5f0a4b-C001-62", "2b7267b7b192aeca15c0d10a5f0a4b-C001-92" ] }, "@DIF@": { "gold_contexts": [ [ "2b7267b7b192aeca15c0d10a5f0a4b-C001-91", "2b7267b7b192aeca15c0d10a5f0a4b-C001-92" ] ], "cite_sentences": [ "2b7267b7b192aeca15c0d10a5f0a4b-C001-92" ] } } }, "ABC_debdaa202ebd856991e09e5e00a12b_46": { "x": [ { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-19", "text": "**SUBTASK 1**" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-20", "text": "For Subtask 1, a lexicon-based approach was followed." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-2", "text": "Analyzing social media posts can offer insights into a wide range of topics that are commonly discussed online, providing valuable information for studying various healthrelated phenomena reported online." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-3", "text": "The outcome of this work can offer insights into pharmacovigilance research to monitor the adverse effects of medications." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-4", "text": "This research specifically looks into mentions of adverse drug reactions (ADRs) in Twitter data through the Social Media Mining for Health Applications (SMM4H) Shared Task 2019." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-5", "text": "Adverse drug reactions are undesired harmful effects which can arise from medication or other methods of treatment." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-6", "text": "The goal of this research is to build accurate models using natural language processing techniques to detect reports of adverse drug reactions in Twitter data and extract these words or phrases." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-7", "text": "----------------------------------" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-9", "text": "On average, one in a thousand messages from public Twitter data is health-related (Sadilek et al., 2012) ." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-10", "text": "These health-related Twitter posts can be used to monitor and analyze various health-related phenomena such as drug use and side effects resulting from medication." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-11", "text": "The purpose of this work was to develop a model to accurately analyze mentions of adverse drug reactions (ADRs) in Twitter posts." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-12", "text": "To achieve this task, natural language processing techniques were used to predict whether each Tweet from a given set of Tweets contains a mention of an ADR and extract any mentions of ADRs." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-13", "text": "The results of this project can be useful for research done in the field of pharmacovigilance, which is the monitoring of drug effects with the intention of finding and preventing adverse effects." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-14", "text": "This work was conducted as part of the Social Media Mining for Health (SMM4H) challenge hosted by the Health Language Processing (HLP) Lab at the University of Pennsylvania." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-15", "text": "The predictions of the models developed for this project were evaluated against test data and given F-scores as well as scores of accuracy, precision, and recall based on the degree to which they were able to accomplish the goals of each task." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-16", "text": "----------------------------------" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-17", "text": "**METHODS**" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-18", "text": "----------------------------------" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-42", "text": "In the Named Entity Recognition task, we utilized a deep learning approach, given the demonstrated effectiveness of such an architecture in this domain (Lample et al., 2016) ." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-43", "text": "We expect to improve the performance of our systems through further refinement of our feature engineering and tuning of our model parameters." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-21", "text": "To identify important keywords -keywords whose presence or absence in a Tweet can serve as valuable, reliable indicators of whether the Tweet contains a reference to an Adverse Drug Reaction or not -a methodology adapted from the Internal + External Lexicon Selection technique (Rawal et al., 2019) , a technique that has yielded successful results in previous similar classification tasks, was used." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-22", "text": "First, uni-and bi-grams were extracted from the training dataset." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-23", "text": "The presence or absence of each of these n-grams were then used as binary features in a logistic regression model." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-24", "text": "To estimate the performance of the model using metrics that were to be used for evaluation, such as precision, recall, and F1-score, the model was trained via 10-fold cross-validation of the training set." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-25", "text": "Finally, the coefficients associated with each keyword were examined." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-26", "text": "There were 166,466 total features obtained through the aforementioned technique." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-27", "text": "Through this process, the top 700 absolute-valued coefficients were hypothesized to be the most significant keywords and stored." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-28", "text": "This number of top keywords to keep was a hyperparameter that was experimentally determined through model performance over 10-fold cross-validation of the training set." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-29", "text": "This list of significant keywords was then manually pared down to exclude any intuitively irrelevant terms (such as stop words); the presence or absence of these remaining keywords were used as binary features for our final logistic regression model." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-30", "text": "Other mod-els were also tested during training, such as a BioBERT (Lee et al., 2019) model that was finetuned using the provided training data." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-31", "text": "Although the BioBERT model showed promising results, it was not implemented into the final submission due to time constraints." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-32", "text": "----------------------------------" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-33", "text": "**SUBTASK 2**" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-34", "text": "For Subtask 2, a deep learning approach was taken." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-35", "text": "Specifically, a Bidirectional Long Short-Term Memory (BiLSTM) coupled with a Conditional Random Field (CRF) layer neural network architecture was used to perform Named Entity Recognition to identify the Adverse Drug Reaction mentions." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-36", "text": "This architecture has been empirically shown to perform well at Named Entity Recognition (NER) tasks (Lample et al., 2016) ." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-37", "text": "To represent input words, the Embedding layer weights of the model was pre-initialized with values obtained from a word2vec model that was trained on the MIMIC-III dataset (Johnson et al., 2016) ." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-38", "text": "----------------------------------" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-39", "text": "**CONCLUSION**" }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-40", "text": "Overall, our systems for Tasks 1 and 2 consisted of a combination of (1) lexicon selection and domain-specific feature engineering; (2) classical machine learning techniques such as logistic regression; and (3) neural architectures, including BioBERT and BiLSTM-CRF models." }, { "sent_id": "debdaa202ebd856991e09e5e00a12b-C001-41", "text": "We found simpler models consisting of lexicon selection and classical machine learning models (such as the logistic regression model discussed previously) performed better with limited datasets and offered explainability into feature importance." } ], "y": { "@BACK@": { "gold_contexts": [ [ "debdaa202ebd856991e09e5e00a12b-C001-36" ] ], "cite_sentences": [ "debdaa202ebd856991e09e5e00a12b-C001-36" ] }, "@MOT@": { "gold_contexts": [ [ "debdaa202ebd856991e09e5e00a12b-C001-42" ] ], "cite_sentences": [ "debdaa202ebd856991e09e5e00a12b-C001-42" ] }, "@USE@": { "gold_contexts": [ [ "debdaa202ebd856991e09e5e00a12b-C001-42" ] ], "cite_sentences": [ "debdaa202ebd856991e09e5e00a12b-C001-42" ] } } }, "ABC_b6c33fbb73cbf0af580e8dd14dc59a_47": { "x": [ { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-2", "text": "Generative models for text have substantially contributed to tasks like machine translation and language modeling, using maximum likelihood optimization (MLE)." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-3", "text": "However, for creative text generation, where multiple outputs are possible and originality and uniqueness are encouraged, MLE falls short." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-4", "text": "Methods optimized for MLE lead to outputs that can be generic, repetitive and incoherent." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-5", "text": "In this work, we use a Generative Adversarial Network framework to alleviate this problem." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-6", "text": "We evaluate our framework on poetry, lyrics and metaphor datasets, each with widely different characteristics, and report better performance of our objective function over other generative models." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-7", "text": "----------------------------------" }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-8", "text": "**INTRODUCTION AND RELATED WORK**" }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-9", "text": "Language models can be optimized to recognize syntax and semantics with great accuracy [1] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-10", "text": "However, the output generated can be repetitive and generic leading to monotonous or uninteresting responses (e.g \"I don't know\") regardless of the input [2] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-11", "text": "While application of attention [3, 4] and advanced decoding mechanisms like beam search and variation sampling [5] have shown improvements, it does not solve the underlying problem." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-12", "text": "In creative text generation, the objective is not strongly bound to the ground truth-instead the objective is to generate diverse, unique or original samples." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-13", "text": "We attempt to do this through a discriminator which can give feedback to the generative model through a cost function that encourages sampling of creative tokens." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-14", "text": "The contributions of this paper are in the usage of a GAN framework to generate creative pieces of writing." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-15", "text": "Our experiments suggest that generative text models, while very good at encapsulating semantic, syntactic and domain information, perform better with external feedback from a discriminator for fine-tuning objectiveless decoding tasks like that of creative text." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-16", "text": "We show this by evaluating our model on three very different creative datasets containing poetry, metaphors and lyrics." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-17", "text": "Previous work on handling the shortcomings of MLE include length-normalizing sentence probability [6] , future cost estimation [7] , diversity-boosting objective function [8, 2] or penalizing repeating tokens [9] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-18", "text": "When it comes to poetry generation using generative text models, Zhang and Lapata [10] , Yi et al. [11] and Wang et al. [12] use language modeling to generate Chinese poems." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-19", "text": "However, none of these methods provide feedback on the quality of the generated sample and hence, do not address the qualitative objective required for creative decoding." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-20", "text": "For the task of text generation, MaskGAN [13] uses a Reinforcement Learning signal from the discriminator, FMD-GAN [14] uses an optimal transport mechanism as an objective function." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-21", "text": "GumbelGAN [15] uses Gumbel-Softmax distribution that replaces the non-differentiable sample from a categorical distribution with a differentiable sample to propagate stronger gradients." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-22", "text": "Li et al. [2] use a discriminator for a diversity promoting objective." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-23", "text": "Yu et al. [16] use SeqGAN to generate poetry and comment on the performance of SeqGAN over MLE in human evaluations, encouraging our study of GANs for creative text generation." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-24", "text": "However, these studies do not focus solely on creative text." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-25", "text": "Using GANs, we can train generative models in a two-player game setting between a discriminator and a generator, where the discriminator (a binary classifier) learns to distinguish between real and fake data samples and the generator tries to fool the discriminator by generating authentic and high quality output [17] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-26", "text": "GANs have shown to be successful in image generation tasks [18] and recently, some progress has been observed in text generation [14, 13, 16] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-27", "text": "Our generator is a language model trained using backpropagation through time [19] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-28", "text": "During the pre-training phase we optimize for MLE and during the GAN training phase, we optimize on the creativity reward from the discriminator." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-29", "text": "The discriminator's encoder has the same architecture as the generator encoder module with the addition of a pooled decoder layer." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-30", "text": "The decoder contains 3 [DenseBatchN ormalization, ReLU ] blocks and an addtional Sigmoid layer." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-31", "text": "The discriminator decoder takes the hidden state at the last time step of a sequence concatenated with both the max-pooled and mean-pooled representation of the hidden states [20] and outputs a number in the range [0, 1]." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-32", "text": "The difficulty of using GANs in text generation comes from the discrete nature of text, making the model non-differentiable hence, we update parameters for the generator model with policy gradients as described in Yu [16] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-33", "text": "We utilize AWD-LSTM [21] and TransformerXL [22] based language models." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-34", "text": "For model hyperparameters please to refer to Supplementary Section Table 2 ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-35", "text": "We use Adam optimizer [23] with \u03b21 = 0.7 and \u03b22 = 0.8 similar to [20] and use a batch size of 50." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-36", "text": "Other practices for LM training were the same as [22] and [21] for Transformer-XL and AWD-LSTM respectively." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-37", "text": "We refer to our proposed GAN as Creative-GAN and compare it to a baseline (a language model equivalent to our pre-trained generator) and a GumbelGAN model [15] across all proposed datasets." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-38", "text": "We use three creative English datasets with distinct linguistic characteristics: (1) A corpus of 740 classical and contemporary English poems, (2) a corpus of 14950 metaphor sentences retrieved from a metaphor database website 1 and (3) a corpus of 1500 song lyrics ranging across genres." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-39", "text": "The mix of linguistic styles within this corpus offers the potential for interesting variation during the generation phase." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-40", "text": "We use the same pre-processing as in earlier work [20, 24] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-41", "text": "We reserve 10% of our data for test set and another 10% for our validation set." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-42", "text": "We first pre-train our generator on the Gutenberg dataset [25] for 20 epochs and then fine-tune [20] them to our target datasets with a language modeling objective." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-43", "text": "The discriminator's encoder is initialized to the same weights as our fine-tuned language model." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-44", "text": "Once we have our fine-tuned encoders for each target dataset, we train in an adversarial manner." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-45", "text": "The discriminator objective here is to score the quality of the creative text." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-46", "text": "The discriminator is trained for 3 iterations for every iteration of the generator, a practice seen in previous work [26] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-47", "text": "Creative-GAN relies on using the reward from the discriminator [13, 16] for backpropagation." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-48", "text": "We follow a similar training procedure for GumbelGAN." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-49", "text": "Outputs are generated through sampling over a multinomial distribution for all methods, instead of argmax on the log-likelihood probabilities, as sampling has shown to produce better output quality [5] . Please refer to Supplementary Section Table 3 for training parameters of each dataset and Table 2 for hyperparameters of each encoder." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-50", "text": "We pick these values after experimentation with our validation set." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-51", "text": "Training and output generation code can be found online 2 ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-52", "text": "----------------------------------" }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-53", "text": "**EVALUATION AND CONCLUSION**" }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-54", "text": "Evaluating creative generation tasks is both critical and complex [27] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-55", "text": "Along the lines of previous research on evaluating text generation tasks [27] , we report the perplexity scores of our test set on the evaluated models in the Supplementary Section, Table 1 Our model shows improvements over baseline and GumbelGAN." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-56", "text": "Common computational methods like BLEU [28] and perplexity are at best a heuristic and not strong indicators of good performance in text generation models [29] ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-57", "text": "Particularly, since these scores use target sequences as a reference, it has the same pitfalls as relying on MLE." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-58", "text": "The advantages in this approach lie in the discriminator's ability to influence the generator to explore other possibilities." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-59", "text": "Sample outputs for our model can be found on our website 3 ." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-60", "text": "----------------------------------" }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-61", "text": "**SUPPLEMENTARY MATERIAL**" }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-62", "text": "In this section, we report our results on computational metrics, hyperparameters and training configurations for our models." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-63", "text": "Table 1 shows the results of the perplexity score evaluation of the evaluated models, Table 2 shows hyperparameters for each encoding method and Table 3 shows our training parameters." }, { "sent_id": "b6c33fbb73cbf0af580e8dd14dc59a-C001-64", "text": "In Table 3 Table 3 : Training Parameters" } ], "y": { "@BACK@": { "gold_contexts": [ [ "b6c33fbb73cbf0af580e8dd14dc59a-C001-20" ], [ "b6c33fbb73cbf0af580e8dd14dc59a-C001-26" ] ], "cite_sentences": [ "b6c33fbb73cbf0af580e8dd14dc59a-C001-20", "b6c33fbb73cbf0af580e8dd14dc59a-C001-26" ] } } }, "ABC_19b647ab74d28b59b7df2be729b2d7_47": { "x": [ { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-2", "text": "Abstract In order to build dialogue systems to tackle the ambitious task of holding social conversations, we argue that we need a data-driven approach that includes insight into human conversational \"chit-chat\", and which incorporates different natural language processing modules." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-3", "text": "Our strategy is to analyze and index large corpora of social media data, including Twitter conversations, online debates, dialogues between friends, and blog posts, and then to couple this data retrieval with modules that perform tasks such as sentiment and style analysis, topic modeling, and summarization." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-4", "text": "We aim for personal assistants that can learn more nuanced human language, and to grow from task-oriented agents to more personable social bots." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-5", "text": "----------------------------------" }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-6", "text": "**FROM TASK-ORIENTED AGENTS TO SOCIAL BOTS**" }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-7", "text": "Devices like the Amazon Echo and Google Home have entered our homes to perform task-oriented functions, such as looking up today's headlines and setting reminders [1, 2] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-8", "text": "As these devices evolve, we have begun to expect social conversation, where the device must learn to personalize and produce natural language style." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-9", "text": "Social conversation is not explicitly goal-driven in the same way as task-oriented dialogue." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-10", "text": "Many dialogue systems in both the written and spoken medium have been developed for task-oriented agents with an explicit goal of restaurant information retrieval, booking a flight, diagnosing an IT issue, or providing automotive customer support [6, 16, 23, 24, 4, 21] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-11", "text": "These tasks often revolve around question answering, with little \"chit-chat\"." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-12", "text": "Templates are often used for generation and state tracking, but since they are optimized for the task at hand, the conversation can either become stale, or maintaining a conversation requires the intractable task of manually authoring many different social interactions that can be used in a particular context." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-13", "text": "We argue that a social agent should be spontaneous, and allow for human-friendly conversations that do not follow a perfectly-defined trajectory." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-14", "text": "In order to build such a conversational dialogue system, we exploit the abundance of human-human social media conversations, and develop methods informed by natural language processing modules that model, analyze, and generate utterances that better suit the context." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-15", "text": "----------------------------------" }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-16", "text": "**DATA-DRIVEN MODELS OF HUMAN LANGUAGE**" }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-17", "text": "A myriad of social media data has led to the development of new techniques for language understanding from open domain conversations, and many corpora are available for building data-driven dialogue systems [17, 18] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-18", "text": "While there are differences between how people speak in person and in an online text-based environment, the social agents we build should not be limited in their language; they should be exposed to many different styles and vocabularies." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-19", "text": "Online conversations can be repurposed in new dialogues, but only if they can be properly indexed or adapted to the context." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-20", "text": "Data retrieval algorithms have been successfully employed to co-construct an unfolding narrative between the user and computer [20] , and re-use existing conversations [5] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-21", "text": "Other approaches train on such conversations to analyze sequence and word patterns, but lack detailed annotations and analysis, such as emotion and humor [19, 22, 8] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-22", "text": "The large Ubuntu Dialogue Corpus [9] with over 7 million utterances is large enough to train neural network models [7, 10] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-23", "text": "We argue that combining data-driven retrieval with modules for sentiment analysis and style, topic analysis, summarization, paraphrasing, and rephrasing will allow for more human-like social conversation." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-24", "text": "This requires that data be indexed based on domain and requirement, and then retrieve candidate utterances based on dialogue state and context." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-25", "text": "Likewise, in order to avoid stale and repetitive utterances, we can alter and repurpose the candidate utterances; for example, we can use paraphrase or summarization to create new ways of saying the same thing, or to select utterance candidates according to the desired sentiment [12, 13] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-26", "text": "The style of an utterance can be altered based on requirements; introducing elements of sarcasm, or aspects of factual and emotional argumentation styles [15, 14] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-27", "text": "Changes in the perceived speaker personality can also make more personable conversations [11] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-28", "text": "Even utterances from monologic texts can be leveraged by converting the content to dialogic flow, and performing stylistic transformations [3] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-29", "text": "Of course, while many data sources may be of interest for indexing knowledge for a dialogue system, annotations are not always available or easy to obtain." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-30", "text": "By using machine learning models designed to classify different classes of interest, such as sentiment, sarcasm, and topic, data can be bootstrapped to greatly increase the amount of data available for indexing and utterance selection [15] ." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-31", "text": "There is no shortage of human generated dialogues, but the challenge is to analyze and harness them appropriately for social-dialogue generation." }, { "sent_id": "19b647ab74d28b59b7df2be729b2d7-C001-32", "text": "We aim to combine data-driven methods to repurpose existing social media dialogues, in addition to a suite of tools for sentiment analysis, topic identification, summarization, paraphrase, and rephrasing, to develop a socially-adept agent that can carry on a natural conversation." } ], "y": { "@BACK@": { "gold_contexts": [ [ "19b647ab74d28b59b7df2be729b2d7-C001-26" ], [ "19b647ab74d28b59b7df2be729b2d7-C001-30" ] ], "cite_sentences": [ "19b647ab74d28b59b7df2be729b2d7-C001-26", "19b647ab74d28b59b7df2be729b2d7-C001-30" ] } } }, "ABC_38ad38f25a2823c64cd16bc9f2af93_47": { "x": [ { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-2", "text": "Pour la documentation des langues, la transcription est un processus tr\u00e8s co\u00fbteux : une minute d'enregistrement n\u00e9cessiterait environ une heure et demie de travail pour un linguiste (Austin and Sallabank, 2013) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-3", "text": "R\u00e9cemment, la collecte de traductions (dans des langues bien document\u00e9es) align\u00e9es aux enregistrements est devenue une solution populaire pour garantir l'interpr\u00e9tabilit\u00e9 des enregistrements (Adda et al., 2016) et aider \u00e0 leur traitement automatique." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-4", "text": "Dans cet article, nous \u00e9tudions l'impact de la langue de traduction sur les approches automatiques en documentation des langues." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-5", "text": "Nous traduisons un corpus parall\u00e8le bilingue Mboshi-Fran\u00e7ais (Godard et al., 2017) dans quatre autres langues, et \u00e9valuons l'impact de la langue de traduction sur une t\u00e2che de segmentation en mots non supervis\u00e9e." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-6", "text": "Nos r\u00e9sultats sugg\u00e8rent que la langue de traduction peut influencer l\u00e9g\u00e8rement la qualit\u00e9 de segmentation." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-7", "text": "Cependant, combiner l'information apprise par diff\u00e9rents mod\u00e8les bilingues nous permet d'am\u00e9liorer ces r\u00e9sultats de mani\u00e8re marginale." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-8", "text": "For language documentation initiatives, transcription is an expensive resource: one minute of audio is estimated to take one hour and a half on average of a linguist's work (Austin and Sallabank, 2013) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-9", "text": "Recently, collecting aligned translations in well-resourced languages became a popular solution for ensuring posterior interpretability of the recordings (Adda et al., 2016) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-10", "text": "In this paper we investigate language-related impact in automatic approaches for computational language documentation." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-11", "text": "We translate the bilingual Mboshi-French parallel corpus (Godard et al., 2017) into four other languages, and we perform bilingual-rooted unsupervised word discovery." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-12", "text": "Our results hint towards an impact of the well-resourced language in the quality of the output." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-13", "text": "However, by combining the information learned by different bilingual models, we are only able to marginally increase the quality of the segmentation." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-14", "text": "MOTS-CL\u00c9S : d\u00e9couverte non supervis\u00e9e du lexique, documentation des langues, approches multilingues." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-15", "text": "----------------------------------" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-16", "text": "**INTRODUCTION**" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-17", "text": "The Cambridge Handbook of Endangered Languages (Austin and Sallabank, 2011) estimates that at least half of the 7,000 languages currently spoken worldwide will no longer exist by the end of this century." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-18", "text": "For these endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-19", "text": "This transcription bottleneck problem can be handled by translating into a widely spoken language to ensure subsequent interpretability of the collected recordings, and such parallel corpora have been recently created by aligning the collected audio with translations in a well-resourced language (Adda et al., 2016; Godard et al., 2017; Boito et al., 2018) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-20", "text": "Moreover, some linguists suggested that more than one translation should be collected to capture deeper layers of meaning (Evans and Sasse, 2004) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-21", "text": "This work is a contribution to the Computational Language Documentation (CLD) research field, that aims to replace part of the manual steps performed by linguists during language documentation initiatives by automatic approaches." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-22", "text": "Here we investigate the unsupervised word discovery and segmentation task, using the bilingual-rooted approach from Godard et al. (2018) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-23", "text": "There, words in the well-resourced language are aligned to unsegmented phonemes in the endangered language in order to identify group of phonemes, and to cluster them into word-like units." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-24", "text": "We experiment with the Mboshi-French parallel corpus, translating the French text into four other well-resourced languages in order to investigate language impact in this CLD approach." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-25", "text": "Our results hint that this language impact exists, and that models based on different languages will output different word-like units." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-26", "text": "----------------------------------" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-27", "text": "**METHODOLOGY**" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-28", "text": "The Multilingual Mboshi Parallel Corpus: In this work we extend the bilingual Mboshi-French parallel corpus (Godard et al., 2017) , fruit of the documentation process of Mboshi (Bantu C25), an endangered language spoken in Congo-Brazzaville." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-29", "text": "The corpus contains 5,130 utterances, for which it provides audio, transcriptions and translations in French." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-30", "text": "We translate the French into four other well-resourced languages through the use of the DeepL translator." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-31", "text": "1 The languages added to the dataset are: English, German, Portuguese and Spanish." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-32", "text": "Table 1 shows some statistics for the produced Multilingual Mboshi parallel corpus." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-33", "text": "2 Bilingual Unsupervised Word Segmentation/Discovery Approach: We use the bilingual neuralbased Unsupervised Word Segmentation (UWS) approach from Godard et al. (2018) to discover words in Mboshi." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-34", "text": "In this approach, Neural Machine Translation (NMT) models are trained between language pairs, using as source language the translation (word-level) and as target, the language to document (unsegmented phonemic sequence)." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-35", "text": "Due to the attention mechanism present in these networks (Bahdanau et al., 2014) , posterior to training, it is possible to retrieve soft-alignment probability matrices between source and target sequences." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-36", "text": "These matrices give us sentence-level source-to-target alignment information, and by using it for clustering neighbor phonemes aligned to the same translation word, we are able to create segmentation in the target side." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-37", "text": "The product of this approach is a set of (discovered-units, translation words) pairs." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-38", "text": "Multilingual Leveraging: In this work we apply two simple methods for including multilingual information into the bilingual models from Godard et al. (2018) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-39", "text": "The first one, Multilingual Voting, consists of merging the information learned by models trained with different language pairs by performing a voting over the final discovered boundaries." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-40", "text": "The voting is performed by applying an agreement threshold T over the output boundaries." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-41", "text": "This threshold balances between accepting all boundaries from all the bilingual models (zero agreement) and accepting only input boundaries discovered by all these models (total agreement)." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-42", "text": "The second method is ANE Selection." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-43", "text": "For every language pair and aligned sentence in the dataset, a soft-alignment probability matrix is generated." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-44", "text": "We use Average Normalized Entropy (ANE) (Boito et al., 2019a) computed over these matrices for selecting the most confident one for segmenting each phoneme sequence." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-45", "text": "This exploits the idea that models trained on different language pairs will have language-related behavior, thus differing on the resulting alignment and segmentation over the same phoneme sequence." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-46", "text": "----------------------------------" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-47", "text": "**EXPERIMENTS**" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-48", "text": "The experiment settings from this paper and evaluation protocol for the Mboshi corpus (Boundary F-scores using the ZRC speech reference) are the same from Boito et al. (2019a) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-49", "text": "Table 2 presents the results for bilingual UWS and multilingual leveraging." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-50", "text": "For the former, we reach our best result by using as aligned information the French, the original aligned language for this dataset." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-51", "text": "Languages closely related to French (Spanish and Portuguese) ranked better, while our worst result used German." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-52", "text": "English also performs notably well in our experiments." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-53", "text": "We believe this is due to the statistics features of the resulting text." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-54", "text": "We observe in Table 1 that the English portion of the dataset contains the smallest vocabulary among all languages." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-55", "text": "Since we train our systems in very low-resource settings, vocabularyrelated features can impact greatly the system's capacity to language-model, and consequently the final quality of the produced alignments." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-56", "text": "Even in high-resource settings, it was already attested that some languages are more difficult to model than others (Cotterell et al., 2018) ." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-57", "text": "For the multilingual selection experiments, we experimented combining the languages from top to bottom as they appear Table 2 (ranked by performance; e.g. 1-3 means the combination of FR(1), Table 3 : Top 10 confident (discovered type, translation) pairs for the five bilingual models." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-58", "text": "The \"+\" mark means the discovered type is a concatenation of two existing true types." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-59", "text": "EN(2) and PT (3))." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-60", "text": "We observe that the performance improvement is smaller than the one observed in previous work (Boito et al., 2019b) , which we attribute to the fact that our dataset was artificially augmented." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-61", "text": "This could result in the available multilingual form of supervision not being as rich as in a manually generated dataset." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-62", "text": "Finally, the best boundary segmentation result is obtained by performing multilingual voting with all the languages and an agreement of 50%, which indicates that the information learned by different languages will provide additional complementary evidence." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-63", "text": "Lastly, following the methodology from Boito et al. (2019a) , we extract the most confident alignments (in terms of ANE) discovered by the bilingual models." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-64", "text": "Table 3 presents the top 10 most confident (discovered type, translation) pairs." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-65", "text": "3 Looking at the pairs the bilingual models are most confident about, we observe there are some types discovered by all the bilingual models (e.g. Mboshi word itua, and the concatenation obo\u00e1+ng\u00e1)." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-66", "text": "However, the models still differ for most of their alignments in the table." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-67", "text": "This hints that while a portion of the lexicon might be captured independently of the language used, other structures might be more dependent of the chosen language." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-68", "text": "On this note, Haspelmath (2011) suggests the notion of word cannot always be meaningfully defined cross-linguistically." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-69", "text": "----------------------------------" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-70", "text": "**CONCLUSION**" }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-71", "text": "In this work we train bilingual UWS models using the endangered language Mboshi as target and different well-resourced languages as aligned information." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-72", "text": "Results show that similar languages rank better in terms of segmentation performance, and that by combining the information learned by different models, segmentation is further improved." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-73", "text": "This might be due to the different languagedependent structures that are captured by using more than one language." }, { "sent_id": "38ad38f25a2823c64cd16bc9f2af93-C001-74", "text": "Lastly, we extend the bilingual Mboshi-French parallel corpus, creating a multilingual corpus for the endangered language Mboshi that we make available to the community." } ], "y": { "@EXT@": { "gold_contexts": [ [ "38ad38f25a2823c64cd16bc9f2af93-C001-5" ], [ "38ad38f25a2823c64cd16bc9f2af93-C001-11" ], [ "38ad38f25a2823c64cd16bc9f2af93-C001-28" ] ], "cite_sentences": [ "38ad38f25a2823c64cd16bc9f2af93-C001-5", "38ad38f25a2823c64cd16bc9f2af93-C001-11", "38ad38f25a2823c64cd16bc9f2af93-C001-28" ] }, "@BACK@": { "gold_contexts": [ [ "38ad38f25a2823c64cd16bc9f2af93-C001-19" ] ], "cite_sentences": [ "38ad38f25a2823c64cd16bc9f2af93-C001-19" ] } } }, "ABC_d178b55f8d5928867b481ba89e165c_47": { "x": [ { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-2", "text": "Many services that perform information retrieval for Points of Interest (POI) utilize a Lucene-based setup with spatial filtering." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-3", "text": "While this type of system is easy to implement it does not make use of semantics but relies on direct word matches between a query and reviews leading to a loss in both precision and recall." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-4", "text": "To study the challenging task of semantically enriching POIs from unstructured data in order to support open-domain search and question answering (QA), we introduce a new dataset POIReviewQA 1 ." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-5", "text": "It consists of 20k questions (e.g. \"is this restaurant dog friendly?\") for 1022 Yelp business types." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-6", "text": "For each question we sampled 10 reviews, and annotated each sentence in the reviews whether it answers the question and what the corresponding answer is." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-7", "text": "To test a system's ability to understand the text we adopt an information retrieval evaluation by ranking all the review sentences for a question based on the likelihood that they answer this question." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-8", "text": "We build a Lucenebased baseline model, which achieves 77.0% AUC and 48.8% MAP." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-9", "text": "A sentence embedding-based model achieves 79.2% AUC and 41.8% MAP, indicating that the dataset presents a challenging problem for future research by the GIR community." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-10", "text": "The result technology can help exploit the thematic content of web documents and social media for characterisation of locations." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-11", "text": "----------------------------------" }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-13", "text": "Location-based services (LBS) and the underlying Point of Interest (POI) datasets play a increasingly important role in our daily interaction with mobile devices." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-14", "text": "Platforms such as Yelp, Foursquare, Google Map allow users to search nearby POIs based on their names, place types, or tags, which requires manual data annotation." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-15", "text": "In fact, besides these structured data, POIs are typically associated with abundant unstructured data such as descriptions and users' reviews which contain useful information for search and question answering purpose." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-16", "text": "For example questions like \"Is this restaurant dog friendly?\" or \"Is this night club 18+?\" can be answered by relevant text in reviews such as \"Great dog friendly restaurant\" or \"18+ night club\"." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-17", "text": "This information can also help accomplishing search needs such as \"find night clubs near me which are 18+\"." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-18", "text": "There are only a few existing GIR benchmark datasets (e.g., Geo-CLEF [4] ) and they often lack in rich annotations as would be required for the examples above." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-19", "text": "Recently many datasets have been produced for reading comprehension such as SQuAD [5] ." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-20", "text": "However, they do not have a spatial/platial component." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-21", "text": "Here we present a POI search and question answering dataset called POIReviewQA with detail annotations of context and answers." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-22", "text": "Baseline models are implemented to demonstrate the difficulty of this task." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-23", "text": "Our work provides an evaluation benchmark for geographic information retrieval and question answering systems." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-24", "text": "It follows the idea of semantic signatures for social sensing [2] by which we can study POI types using patterns extracted from human behavior, e.g., what people write about places of a particular type." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-25", "text": "Intuitively, questions about age limits only arise in the narrow context of a few such types, e.g., nightclubs, movie theaters, and so on." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-26", "text": "Furthermore, unstructured data such as reviews are often geo-indicative without the need for explicit geographic coordinates." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-27", "text": "For instance, people may be searching for a central but quiet hotel [3] ." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-28", "text": "It is those questions that we will address in the following." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-29", "text": "----------------------------------" }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-30", "text": "**THE POIREVIEWQA TASK**" }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-31", "text": "We created POIReviewQA based on the Yelp Challenge 11 (YC11) dataset 2 and the QA section of POI pages." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-32", "text": "Query Set Generation." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-33", "text": "We create the question answer dataset from the \"Ask the Community\" section 3 of POI pages." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-34", "text": "The Yelp platform is dominated by popular business types such as restaurants." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-35", "text": "In order to produce a balanced query set for all business types we performed stratified sampling: 1) count the frequencies of POI name suffixes (single words) in YC11; 2) for every suffix Relevance and Answer Annotation." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-36", "text": "For each question, 10 review candidates are selected by stratified sampling from the search result of a lucene-based setup, i.e., applying Elastic Search to POI reviews based on the question with constraint to the associated POI types." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-37", "text": "We developed a crowd-facing Web server and deployed it on Amazon Mechanical Turk to let raters annotate each sentence of these 10 reviews with respect to whether it answer the current question and what the corresponding answer is." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-38", "text": "The annotation results are collected for each question." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-39", "text": "To date, we have collected about 4100 questions." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-40", "text": "Basic statistic for these are shown in Tab." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-41", "text": "1. In order to study the relationship between raters (given 3 raters per review sentence) and the accuracy of the raters, we divide the sentences into 4 sets based on the number of raters that agreed on each sentence, denoted as R 0 , R 1 , R 2 , R 3 ." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-42", "text": "Then we randomly sample 20 sentences from each of the last three sets (R 1 , R 2 , R 3 )." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-43", "text": "By manually inspecting the relevance of these sentences to the corresponding questions." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-44", "text": "The resulting accuracy of each sample set is 45% for R 1 , i.e., 9/20 sentences, 90% for R 2 , 100% for R 3 ." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-45", "text": "We treat the sentences in R 2 , R 3 as relevant, and the rest are labeled as irrelevant sentences." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-46", "text": "These labels are used to evaluate different models." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-47", "text": "Evaluation Metrics." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-48", "text": "Area under curve (AUC) and mean average percision (MAP) are used as evaluation metrics." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-49", "text": "----------------------------------" }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-50", "text": "**EXPERIMENT WITH BASELINE MODELS**" }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-51", "text": "In order to provide a similar search functionality to Yelp's new review-based POI search 5 , we developed a TF-IDF based model to search through all sentences from 10 reviews based on a question." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-52", "text": "An evaluation using the POIReviewQA dataset gives 77% AUC and 48.8% MAP." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-53", "text": "We also applied the sentence embedding model proposed by Sanjeev Arora et al. [1] ." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-54", "text": "It improves the average word embeddings using SVD and gives what the authors call \"toughto-beat\" results for a text similarity tasks." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-55", "text": "We use the pretrained Google News 300 dimension Word2Vec embeddings to generate Comparing to the TF-IDF model, the sentence embedding-based model gives a higher AUC (which is sensitive to overall rankings) but lower MAP (which is sensitive to top rankings)." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-56", "text": "The results from both baseline models indicate that the POIReviewQA dataset presents a challenging task." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-57", "text": "Table 2 shows examples for which the baseline model fails." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-58", "text": "Correctly predicting relevant sentence requires an understanding of language and common sense." }, { "sent_id": "d178b55f8d5928867b481ba89e165c-C001-59", "text": "We hope that the dataset will enable further GIR research about question answering as it relates to place types." } ], "y": { "@BACK@": { "gold_contexts": [ [ "d178b55f8d5928867b481ba89e165c-C001-19" ] ], "cite_sentences": [ "d178b55f8d5928867b481ba89e165c-C001-19" ] } } }, "ABC_0b334057bc358f5537497ed15344c1_47": { "x": [ { "sent_id": "0b334057bc358f5537497ed15344c1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-2", "text": "Research in computational linguistics in the biomedical domain traditionally focuses on two major areas: fundamental advances in language processing; and application of language processing methods to bridge the gap between basic biomedical research, clinical research, and translation of both types of research into practice." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-3", "text": "Several conferences provide opportunities for discussion of these two types of research in specific sub-domains of Biomedical Natural Language Processing." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-4", "text": "For example, Intelligent Systems for Molecular Biology (ISMB) and its associated special interest group and Pacific Symposium on Biocomputing (PSB) focus on NLP research applied to issues of interest to biologists, whereas American Medical Informatics Association (AMIA) is concerned with medical informatics issues." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-5", "text": "Rather than focusing on a specific area of interest, ACL BioNLP workshop strives to provide a forum for any important, new, and exciting research in the field of Biomedical Natural Language Processing." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-6", "text": "Rather than focusing on a specific theme as we have in previous years, the goal of the workshop this year was to solicit work of interest to NLP researchers on any topic in the biomedical domain." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-7", "text": "Asking researchers to share their interests was rewarded by 34 submissions (5 posters and 19 full papers)." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-8", "text": "Of those, 10 were accepted as full papers and 18 as poster presentations." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-9", "text": "The combined expertise of the program committee allowed for providing three thorough reviews for each paper." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-10", "text": "The exceptionally high quality manuscripts accepted for presentation cover a wide area of subjects in clinical and biological areas, as well as methodological issues applicable to both sublanguages." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-11", "text": "Increasingly sophisticated relation extraction methods [6, 8]" }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-12", "text": "----------------------------------" }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-13", "text": "**BACKGROUND**" }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-14", "text": "Research in computational linguistics in the biomedical domain traditionally focuses on two major areas: fundamental advances in language processing; and application of language processing methods to bridge the gap between basic biomedical research, clinical research, and translation of both types of research into practice." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-15", "text": "Several conferences provide opportunities for discussion of these two types of research in specific sub-domains of Biomedical Natural Language Processing." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-16", "text": "For example, Intelligent Systems for Molecular Biology (ISMB) and its associated special interest group and Pacific Symposium on Biocomputing (PSB) focus on NLP research applied to issues of interest to biologists, whereas American Medical Informatics Association (AMIA) is concerned with medical informatics issues." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-17", "text": "Rather than focusing on a specific area of interest, ACL BioNLP workshop strives to provide a forum for any important, new, and exciting research in the field of Biomedical Natural Language Processing." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-18", "text": "Rather than focusing on a specific theme as we have in previous years, the goal of the workshop this year was to solicit work of interest to NLP researchers on any topic in the biomedical domain." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-19", "text": "----------------------------------" }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-20", "text": "**SUBMISSIONS, ACCEPTANCE, AND THEMES**" }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-21", "text": "Asking researchers to share their interests was rewarded by 34 submissions (5 posters and 19 full papers)." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-22", "text": "Of those, 10 were accepted as full papers and 18 as poster presentations." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-23", "text": "The combined expertise of the program committee allowed for providing three thorough reviews for each paper." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-24", "text": "The exceptionally high quality manuscripts accepted for presentation cover a wide area of subjects in clinical and biological areas, as well as methodological issues applicable to both sublanguages." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-25", "text": "Named entity recognition (NER) continues to be an active area of research." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-26", "text": "NER research presented here involves development of new statistical and hybrid approaches to identification and disambiguation of gene [1] , protein [2] , chemical names [3] , and clinical entities." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-27", "text": "Overwhelmingly, researchers chose statistical or hybrid approaches to the tasks at hand." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-28", "text": "This is probably the reason for growing interest in creation of annotated corpora [4] , development of methods for augmenting the existing annotation [5] , speeding up the annotation process [5] , and reducing its cost; evaluating the comparability of results obtained applying the same methods to different collections [6] , And increasing compatibility of different annotations [7] ." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-29", "text": "Increasingly sophisticated relation extraction methods [6, 8] are being applied to a broader set of iii relations [9] ." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-30", "text": "Other steps towards deeper understanding of the text include methods for creation of gene profiles [10] , identification of uncertainty [11] , discourse connectivity [12] , and temporal features of clinical conditions [13] ." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-31", "text": "The applicability of NLP methods to clinical tasks is explored in the work on identification of language impairments [14] and seriousness of suicidal attempts [15] ." }, { "sent_id": "0b334057bc358f5537497ed15344c1-C001-32", "text": "Finally, application of NLP methods to classic information retrieval problems such as automatic indexing of biomedical literature [16] and the newer information retrieval problem of image retrieval [17]" } ], "y": { "@BACK@": { "gold_contexts": [ [ "0b334057bc358f5537497ed15344c1-C001-28" ], [ "0b334057bc358f5537497ed15344c1-C001-29" ] ], "cite_sentences": [ "0b334057bc358f5537497ed15344c1-C001-28", "0b334057bc358f5537497ed15344c1-C001-29" ] } } }, "ABC_891d0e17bf2fb79a378c2d77dda768_47": { "x": [ { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-24", "text": "However, these studies do not focus solely on creative text." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-48", "text": "We follow a similar training procedure for GumbelGAN." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-2", "text": "Generative models for text have substantially contributed to tasks like machine translation and language modeling, using maximum likelihood optimization (MLE)." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-3", "text": "However, for creative text generation, where multiple outputs are possible and originality and uniqueness are encouraged, MLE falls short." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-4", "text": "Methods optimized for MLE lead to outputs that can be generic, repetitive and incoherent." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-5", "text": "In this work, we use a Generative Adversarial Network framework to alleviate this problem." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-6", "text": "We evaluate our framework on poetry, lyrics and metaphor datasets, each with widely different characteristics, and report better performance of our objective function over other generative models." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-7", "text": "----------------------------------" }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-8", "text": "**INTRODUCTION AND RELATED WORK**" }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-9", "text": "Language models can be optimized to recognize syntax and semantics with great accuracy [1] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-10", "text": "However, the output generated can be repetitive and generic leading to monotonous or uninteresting responses (e.g \"I don't know\") regardless of the input [2] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-11", "text": "While application of attention [3, 4] and advanced decoding mechanisms like beam search and variation sampling [5] have shown improvements, it does not solve the underlying problem." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-12", "text": "In creative text generation, the objective is not strongly bound to the ground truth-instead the objective is to generate diverse, unique or original samples." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-13", "text": "We attempt to do this through a discriminator which can give feedback to the generative model through a cost function that encourages sampling of creative tokens." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-14", "text": "The contributions of this paper are in the usage of a GAN framework to generate creative pieces of writing." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-15", "text": "Our experiments suggest that generative text models, while very good at encapsulating semantic, syntactic and domain information, perform better with external feedback from a discriminator for fine-tuning objectiveless decoding tasks like that of creative text." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-16", "text": "We show this by evaluating our model on three very different creative datasets containing poetry, metaphors and lyrics." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-17", "text": "Previous work on handling the shortcomings of MLE include length-normalizing sentence probability [6] , future cost estimation [7] , diversity-boosting objective function [8, 2] or penalizing repeating tokens [9] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-18", "text": "When it comes to poetry generation using generative text models, Zhang and Lapata [10] , Yi et al. [11] and Wang et al. [12] use language modeling to generate Chinese poems." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-19", "text": "However, none of these methods provide feedback on the quality of the generated sample and hence, do not address the qualitative objective required for creative decoding." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-20", "text": "For the task of text generation, MaskGAN [13] uses a Reinforcement Learning signal from the discriminator, FMD-GAN [14] uses an optimal transport mechanism as an objective function." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-21", "text": "GumbelGAN [15] uses Gumbel-Softmax distribution that replaces the non-differentiable sample from a categorical distribution with a differentiable sample to propagate stronger gradients." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-22", "text": "Li et al. [2] use a discriminator for a diversity promoting objective." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-23", "text": "Yu et al. [16] use SeqGAN to generate poetry and comment on the performance of SeqGAN over MLE in human evaluations, encouraging our study of GANs for creative text generation." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-25", "text": "Using GANs, we can train generative models in a two-player game setting between a discriminator and a generator, where the discriminator (a binary classifier) learns to distinguish between real and fake data samples and the generator tries to fool the discriminator by generating authentic and high quality output [17] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-26", "text": "GANs have shown to be successful in image generation tasks [18] and recently, some progress has been observed in text generation [14, 13, 16] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-27", "text": "Our generator is a language model trained using backpropagation through time [19] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-28", "text": "During the pre-training phase we optimize for MLE and during the GAN training phase, we optimize on the creativity reward from the discriminator." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-29", "text": "The discriminator's encoder has the same architecture as the generator encoder module with the addition of a pooled decoder layer." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-30", "text": "The decoder contains 3 [DenseBatchN ormalization, ReLU ] blocks and an addtional Sigmoid layer." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-31", "text": "The discriminator decoder takes the hidden state at the last time step of a sequence concatenated with both the max-pooled and mean-pooled representation of the hidden states [20] and outputs a number in the range [0, 1]." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-32", "text": "The difficulty of using GANs in text generation comes from the discrete nature of text, making the model non-differentiable hence, we update parameters for the generator model with policy gradients as described in Yu [16] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-33", "text": "We utilize AWD-LSTM [21] and TransformerXL [22] based language models." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-34", "text": "For model hyperparameters please to refer to Supplementary Section Table 2 ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-35", "text": "We use Adam optimizer [23] with \u03b21 = 0.7 and \u03b22 = 0.8 similar to [20] and use a batch size of 50." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-36", "text": "Other practices for LM training were the same as [22] and [21] for Transformer-XL and AWD-LSTM respectively." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-37", "text": "We refer to our proposed GAN as Creative-GAN and compare it to a baseline (a language model equivalent to our pre-trained generator) and a GumbelGAN model [15] across all proposed datasets." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-38", "text": "We use three creative English datasets with distinct linguistic characteristics: (1) A corpus of 740 classical and contemporary English poems, (2) a corpus of 14950 metaphor sentences retrieved from a metaphor database website 1 and (3) a corpus of 1500 song lyrics ranging across genres." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-39", "text": "The mix of linguistic styles within this corpus offers the potential for interesting variation during the generation phase." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-40", "text": "We use the same pre-processing as in earlier work [20, 24] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-41", "text": "We reserve 10% of our data for test set and another 10% for our validation set." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-42", "text": "We first pre-train our generator on the Gutenberg dataset [25] for 20 epochs and then fine-tune [20] them to our target datasets with a language modeling objective." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-43", "text": "The discriminator's encoder is initialized to the same weights as our fine-tuned language model." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-44", "text": "Once we have our fine-tuned encoders for each target dataset, we train in an adversarial manner." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-45", "text": "The discriminator objective here is to score the quality of the creative text." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-46", "text": "The discriminator is trained for 3 iterations for every iteration of the generator, a practice seen in previous work [26] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-47", "text": "Creative-GAN relies on using the reward from the discriminator [13, 16] for backpropagation." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-49", "text": "Outputs are generated through sampling over a multinomial distribution for all methods, instead of argmax on the log-likelihood probabilities, as sampling has shown to produce better output quality [5] . Please refer to Supplementary Section Table 3 for training parameters of each dataset and Table 2 for hyperparameters of each encoder." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-50", "text": "We pick these values after experimentation with our validation set." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-51", "text": "Training and output generation code can be found online 2 ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-52", "text": "----------------------------------" }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-53", "text": "**EVALUATION AND CONCLUSION**" }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-54", "text": "Evaluating creative generation tasks is both critical and complex [27] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-55", "text": "Along the lines of previous research on evaluating text generation tasks [27] , we report the perplexity scores of our test set on the evaluated models in the Supplementary Section, Table 1 Our model shows improvements over baseline and GumbelGAN." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-56", "text": "Common computational methods like BLEU [28] and perplexity are at best a heuristic and not strong indicators of good performance in text generation models [29] ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-57", "text": "Particularly, since these scores use target sequences as a reference, it has the same pitfalls as relying on MLE." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-58", "text": "The advantages in this approach lie in the discriminator's ability to influence the generator to explore other possibilities." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-59", "text": "Sample outputs for our model can be found on our website 3 ." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-60", "text": "----------------------------------" }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-61", "text": "**SUPPLEMENTARY MATERIAL**" }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-62", "text": "In this section, we report our results on computational metrics, hyperparameters and training configurations for our models." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-63", "text": "Table 1 shows the results of the perplexity score evaluation of the evaluated models, Table 2 shows hyperparameters for each encoding method and Table 3 shows our training parameters." }, { "sent_id": "891d0e17bf2fb79a378c2d77dda768-C001-64", "text": "In Table 3 Table 3 : Training Parameters" } ], "y": { "@MOT@": { "gold_contexts": [ [ "891d0e17bf2fb79a378c2d77dda768-C001-11" ], [ "891d0e17bf2fb79a378c2d77dda768-C001-49" ] ], "cite_sentences": [ "891d0e17bf2fb79a378c2d77dda768-C001-11", "891d0e17bf2fb79a378c2d77dda768-C001-49" ] } } }, "ABC_2d2da2e9215691bffad74bfb97dbf3_47": { "x": [ { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-2", "text": "This document describes the senti.ue system and how it was used for participation in SemEval-2014 Task 9 challenge." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-3", "text": "Our system is an evolution of our prior work, also used in last year's edition of Sentiment Analysis in Twitter." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-4", "text": "This system maintains a supervised machine learning approach to classify the tweet overall sentiment, but with a change in the used features and the algorithm." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-5", "text": "We use a restricted set of 47 features in subtask B and 31 features in subtask A. In the constrained mode, and for the five data sources, senti.ue achieved a score between 78,72 and 84,05 in subtask A, and a score between 55,31 and 71,39 in subtask B. For the unconstrained mode, our score was slightly below, except for one case in subtask A." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-6", "text": "----------------------------------" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-8", "text": "This paper describes the approach taken by a team of Universidade de\u00c9vora's Computer Science Department in SemEval-2014 Task 9: Sentiment Analysis in Twitter (Rosenthal et al., 2014) ." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-9", "text": "SemEval-2014 Task 9 has an expression-level (subtask A) and a message-level (subtask B) polarity classification challenges." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-10", "text": "The first subtask aims to determine whether a word (or phrase) is positive, negative or neutral, within the textual context in which it appears." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-11", "text": "The second subtask concerns the classification of the overall text polarity, which corresponds to automatically detecting the sentiment expressed in a Twitter message." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-12", "text": "In both subtasks, systems can operate in constrained or unconstrained mode." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-13", "text": "Constrained means that learn- ing is based only on provided training texts, with the possible aid of static resources such as lexicons." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-14", "text": "Extra tweets or additional annotated documents for training are permitted only in unconstrained mode." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-15", "text": "The system we used to respond to this challenge is called senti.ue, and follows on from our previous work on Natural Language Processing (NLP) and Sentiment Analysis (SA)." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-16", "text": "We developed work in automatic reputation assessment, using a Machine Learning (ML) based classifier for comments with impact on a particular target entity (Saias, 2013) ." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-17", "text": "We also participated in the previous edition of SemEval SA task, where we have implemented the basis for the current system." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-18", "text": "In last year's solution (Saias and Fernandes, 2013) , we treated both subtasks using the same method (except the training set)." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-19", "text": "We have updated the method for subtask A, now considering also the text around the area to classify, by dedicating new features to those preceding and following tweet parts." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-20", "text": "Text overall sentiment classification is the core objective of our system, and is performed, as before, with a supervised machine learning technique." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-21", "text": "For subtask B, we fixed some implementation issues in the previous version, and we went from 22 to 53 features, explained in Section 3." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-22", "text": "----------------------------------" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-23", "text": "**RELATED WORK**" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-24", "text": "The popularity of social networks and microblogging facilitated the sharing of opinions." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-25", "text": "To know whether people are satisfied or not with a particular brand or product is of great interest to marketing companies." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-26", "text": "Much work has appeared in SA, trying to capture valuable information in expressions of contentment or discontentment." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-27", "text": "Important international scientific events, NLP related, include SA challenges and workshops." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-28", "text": "This was the case in SemEval-2013, whose task 2 (Wilson et al., 2013) required sentiment analysis of Twitter and SMS text messages." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-29", "text": "Being the pre-decessor task of the challenge for which this work was developed, it is similar to this year's Task 9." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-30", "text": "The participating systems achieved better results in contextual polarity subtask (A) than those obtained for the overall message polarity subtask (B)." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-31", "text": "In that edition, the best results were obtained by systems in constrained mode." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-32", "text": "The most common method was supervised ML with features that can be related to text words, syntactic function, discourse elements relation, internet slang and symbols, or clues from sentiment lexicons." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-33", "text": "In that task, the NRC-Canada system (Mohammad et al., 2013) obtained the best performance, achieving an F1 of 88.9% in subtask A and 69% in subtask B. That system used one SVM classifier for each subtask, together with text surface based features, features associated with manually created and automatically generated sentiment lexicons, and n-gram features." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-34", "text": "Other systems with good results in that task were GU-MLT-LT (G\u00fcnther and Furrer, 2013) and AVAYA (Becker et al., 2013) ." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-35", "text": "The first was implemented in the Python language." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-36", "text": "It includes features for: text tokens after normalization, stems, word clusters, and two values for the accumulated positive and accumulated negative SentiWordNet (Baccianella et al., 2010) scores, considering negation." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-37", "text": "Its machine learning classifier is based on linear models with stochastic gradient descent." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-38", "text": "The approach taken in the AVAYA system centers on training highdimensional, linear classifiers with a combination of lexical and syntactic features." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-39", "text": "This system uses Bag-of-Words features, with negation represented in word suffix, and including not only the raw word forms but also combinations with lemmas and PoS tags." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-40", "text": "Then, word polarity features are added, using the MPQA lexicon (Wiebe et al., 2005) , as well as syntactic dependency and PoS tag features." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-41", "text": "Other features consider emoticons, capitalization, character repetition, and emphasis characters, such as asterisks and dashes." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-42", "text": "The resulting model was trained with the LIBLINEAR (Fan et al., 2008) classification library." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-43", "text": "Another NLP task very close to SA is polarity classification on the reputation of an entity." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-44", "text": "Here, instead the sentiment in the perspective of the opinion holder, the goal is to detect the impact of this particular opinion on some entity's reputation." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-45", "text": "The diue system (Saias, 2013 ) uses a supervised ML approach for reputation polarity classification, including Bag-of-Words and a limited set of features based on sentiment lexicons and superficial text analysis." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-46", "text": "----------------------------------" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-47", "text": "**METHOD**" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-48", "text": "This work follows on from our previous participation in SemEval-2013 SA task, where we have devoted greater effort to subtask B. We start by explaining our current approach for this subtask, and then we describe how such classifier is also used in subtask A." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-49", "text": "----------------------------------" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-50", "text": "**MESSAGE POLARITY CLASSIFICATION**" }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-51", "text": "The senti.ue system maintains a supervised machine learning approach to perform the overall sentiment classification." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-52", "text": "As before, Python and the Natural Language Toolkit (NLTK 1 ) are used for text processing and ML feature extraction." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-53", "text": "The first step was to obtain the tweet content and forming the instances of the training set." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-54", "text": "During the download phase, several tweets were not found." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-55", "text": "In constrained mode, we got only 7352 instances available for training." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-56", "text": "Tweet preprocessing includes tokenization, which is punctuation and white space based, negation detection, and lemmatization, through NLTK class WordNetLemmatizer." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-57", "text": "After that, the system runs the ML component." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-58", "text": "Instead of the solution we used in 2013, with two differently configured classifiers in a pipeline, we chose to use a single classifier, which this year is based on SciKit-Learn 2 , and to increase the number of features that are extracted to represent each instance." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-59", "text": "The classification algorithm was Support Vector Machines (SVM), using SVC 3 class, with a linear kernel and 10 \u22125 tolerance for stopping criterion." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-60", "text": "SVC class implementation is based on libsvm (Chang and Lin, 2011) , and uses one-against-one approach for multi-class classification." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-61", "text": "From each instance, the system extracts the 47 features in Figure 1 ." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-62", "text": "The first two features represent the index of the first polarized token." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-63", "text": "The following represent the repeated occurrence of a question mark, and the existence of a token with negation (not, never)." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-64", "text": "Then there are two features that indicate whether there is negation before positive or negative words." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-65", "text": "The following 8 fea-tures indicate whether there are positive or negative terms, just after, or near, a question mark or an exclamation mark." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-66", "text": "We build a table with words or phrases marked as positive or negative in subtask A data." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-67", "text": "Using this resource, 4 features test the presence and the count of word n-grams marked as positive or negative." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-68", "text": "Then the TA.alike features represent the same, but after lemmatization and synonym verification." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-69", "text": "To find the synonyms of a term, we used the WordNet (Princeton University, 2010) resource." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-70", "text": "The probability of each word belonging to a class was calculated." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-71", "text": "There are 3 features avgProbWordOn, one per class, that represent the average of this probability for each instance words." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-72", "text": "Next 3 features represent the same, but focusing only on the last 5 words of each text." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-73", "text": "Then we have 6 ProbLog2Prob features, representing the average of P \u00d7 log 2 (P ), for all words, or only the latest 5 words, for all classes." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-74", "text": "P is the probability of the word belonging to one class." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-75", "text": "One feature cumulates the token polarity values, according to SentiWordNet." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-76", "text": "The final 12 features are based on sentiment lexicons: AFINN (Nielsen, 2011) , Bing Liu (Liu et al., 2005) , MPQA, and a custom polarity table with some manually entered entries." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-77", "text": "For each resource, we count the instance tokens with negative and positive polarity, and create a feature direction, having the value 1 if countTokens.pos>countTokens.neg, -1 if countTokens.pos(Wilson et al., 2013) ." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-113", "text": "This time, we implemented the contextual polarity solution based on the subtask B classifier." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-114", "text": "Given the results, we intend to do, in the near future, a new iteration of our system where the overall classifier will depend on (or receive features from) the current subtask A classifier." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-115", "text": "It seems to us that senti.ue feature engineering can be improved, maintaining this line of development." }, { "sent_id": "2d2da2e9215691bffad74bfb97dbf3-C001-116", "text": "Once stabilized, the introduction of named entity recognition and a richer linguistic analysis will help to identify the sentiment target entities, as the ultimate goal for this system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "2d2da2e9215691bffad74bfb97dbf3-C001-28" ] ], "cite_sentences": [ "2d2da2e9215691bffad74bfb97dbf3-C001-28" ] }, "@SIM@": { "gold_contexts": [ [ "2d2da2e9215691bffad74bfb97dbf3-C001-112" ] ], "cite_sentences": [ "2d2da2e9215691bffad74bfb97dbf3-C001-112" ] } } }, "ABC_fc3775c0d23292160f5c5eb86861be_47": { "x": [ { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-2", "text": "Romanian has been traditionally seen as bearing three lexical genders: masculine, feminine, and neuter, although it has always been known to have only two agreement patterns (for masculine and feminine)." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-3", "text": "Previous machine learning classifiers which have attempted to discriminate Romanian nouns according to gender have taken as input only the singular form, either presupposing the traditional tripartite analysis, or using additional information from case inflected forms." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-4", "text": "We present here a tool based on two parallel support vector machines using n-gram features from the singular and from the plural, which distinguish the neuter." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-5", "text": "----------------------------------" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-6", "text": "**THE ROMANIAN GENDER SYSTEM**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-7", "text": "Recently, a big grammatical mistake made by a Romanian politician brought in attention the plural form of Romanian nouns." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-8", "text": "The traditional analysis (Graur et al., 1966; Rosetti, 1965 Rosetti, , 1973 Corbett, 1991) identifies Romanian as the only Romance language bearing three lexical genders (masculine, feminine and neuter), whether the neuter was inherited from Latin (Constantinescu-Dobridor, 2001, p. 44) , or redevelopped under the influence of Slavic languages (Rosetti, 1965; Petrucci, 1993) ." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-9", "text": "The first two genders generally have no problem regarding their plurals (follow a pattern more or less), the neuter gender being the one which poses some dificulties." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-10", "text": "These dificulties are not encompased only by politicians, but also for second language acquisition and; not to mention, in some cases, the long debates between linguists themselves." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-11", "text": "The problem occurs since the neuter gender has a masculine form for singular and a feminine form for plural (see Table 1 for examples)." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-12", "text": "Since the language bears only two agreement markers (masculine and feminine), the three genders then need to be mapped onto the dual agreement, the way in which this mapping is done and on what basis also having been debated." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-13", "text": "However, under the premise that gender is expressed through agreement, the fact that Romanian neuter nouns lack their own marking and their own agreement pattern (they systematically and without exception follow the masculine agreement in the singular and the feminine in the plural as seen in Table 1 ) have lead Bateman and Polinsky (2010) and others to ask the question of whether Romanian has three genders, or two." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-14", "text": "Gender assignment thus becomes a burden not only for linguists to describe, but also for second language learners of Romanian to acquire." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-15", "text": "Nastase and Popescu (2009) and (Cucerzan and Yarowsky, 2003) ." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-16", "text": "Our goal was, thus, to better -in comparison to Nastase and Popescu (2009)'s results-or successfully -in comparison to Cucerzan and Yarowsky (2003) 's experiment-distinguish these \"neuter\" nouns from feminines and masculines, by employing the minimum amount of information." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-17", "text": "We employed phonological information (coming from singular and plural noninflected nominative forms) as well as information coming from the feminine and masculine gender labels." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-18", "text": "In what follows we will present our tool for Romanian neuter nouns, which outperforms all previous attempts." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-19", "text": "----------------------------------" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-20", "text": "**OUR APPROACH**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-21", "text": "We will look at singular and plural nominative indefinite forms (as specified by Bateman and Polinsky and used by Nastasescu and Popescu) and see if phonological features (endings) and information from masculine and feminine labels are sufficient to correctly classify Romanian neuter nouns as such." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-22", "text": "Another thing to take into consideration when looking at our classifier is the fact that, while Bateman and Polinsky (2010, p. 53-54) use both semantic and phonological features to assign gender, with the semantic features overriding the formal, we were unable use any semantic features, and used their phonological form as training examples." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-23", "text": "----------------------------------" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-24", "text": "**DATASET**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-25", "text": "The dataset we used is a Romanian language resource containing a total of 480,722 inflected forms of Romanian nouns and adjectives." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-26", "text": "It was extracted from the text form of the morphological dictionary RoMorphoDict (Barbu, 2008) , which was also used by Nastase and Popescu (2009) for their Romanian classifier, where every entry has the following structure:" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-27", "text": "Here, 'form' denotes the inflected form and 'description', the morphosyntactic description, encoding part of speech, gender, number, and case." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-28", "text": "For the morphosyntactic description, the initial dataset uses the slash ('/') as a disjunct operator meaning that 'm/n' stands for 'masculine or neuter', while the dash ('-') is used for the conjunct operator, with 'm-n' meaning 'masculine and neuter'." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-29", "text": "In the following, we will see that some of the disjunct gender labels can cause some problems in the extraction of the appropriate gender and subsequently in the classifier." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-30", "text": "Since our interest was in gender, we discarded all the adjectives listed and we isolated the nominative/accusative indefinite (without the enclitic article) form." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-31", "text": "We then split them into singulars and plurals; the defective nouns were excluded." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-32", "text": "The entries which were labeled as masculine or feminine were used as training and validation data for our experiment, while the neuters were left as the unlabeled test set." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-33", "text": "The training and validation set contained 30,308 nouns, and the neuter test set 9,822 nouns (each with singular and plural form)." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-34", "text": "----------------------------------" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-35", "text": "**CLASSIFIER AND FEATURES**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-36", "text": "Our model consists of two binary linear support vector classifiers (Dinu et al., 2012) , one for the singular forms and another one for the plural forms." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-37", "text": "Each of these has a free parameter C that needs to be optimized to ensure good performance." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-38", "text": "We extracted character n-gram features vectors from the masculine and feminine nouns, separately." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-39", "text": "These vectors can represent counts of binary occurences of n-grams." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-40", "text": "We also considered that the suffix might carry more importance so we added the '$' character at the end of each inflected form." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-41", "text": "This allows the downstream classifier to assign a different weight to the (n \u2212 1)-grams that overlap with the suffix." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-42", "text": "Each possible combinations of parameters: n-gram length, use of binarization, addition of suffix, and the C regularization parameter was evaluated using 10-fold cross-validation, for both singular and plurals." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-43", "text": "After the model has been selected and trained in this manner, the neuter nouns are plugged in and their singular forms are classified according to the singular classifier, while their plural forms are classified by the plural model." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-44", "text": "The experiment was set up and run using the scikit-learn machine learning library for Python (Pedregosa et al., 2011) ." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-45", "text": "The implementation of linear support vector machines used is liblinear." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-46", "text": "----------------------------------" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-47", "text": "**OUR RESULTS**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-48", "text": "The best parameters chosen by cross-validation are 5-gram features, append the suffix character, but don't binarize the feature vectors." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-49", "text": "On masculine-feminine singulars, this obtained an accuracy of 99.59%, with a precision of 99.63%, a recall of 99.80% and an F 1 score of 99.71%." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-50", "text": "The plural model scored an accuracy of 95.98%, with a precision of 97.32%, a recall of 97.05% and an F 1 score of 97.18%." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-51", "text": "We then moved on to check the classification results of the neuter forms, and performed error analysis on the results." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-52", "text": "Table 2a shows the distribution of neuter noun tuples (singular, plural) according to how our models classify their forms." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-53", "text": "Our hypothesis states that all of the mass should gather in the top-left corner, i.e. neuters should classify as masculine in the singular and feminine in the plural." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-54", "text": "There are more misclassifications in the plural form of neuter nouns than in their singular form." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-55", "text": "In what follows, we will briefly analyze the misclassifications and see if there is any room for improvement or any blatant mistakes that can be rectified." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-56", "text": "Table 2 : Distribution of neuters as classified by the system." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-57", "text": "In each table, the upper left corner shows nouns classified as expected (masculine in the singular, feminine in the plural), while the lower right corner shows completely misclassified nouns (nouns that seem to be feminine in the singular and masculine in the plural)." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-58", "text": "The other two fields appropriately show nouns misclassified in only one of the forms." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-59", "text": "----------------------------------" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-60", "text": "**ANALYZING MISCLASSIFICATIONS**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-61", "text": "We first notice that 10 out of the 15 nouns that were completely misclassified are French borrowings which, although feminine in French, designate inanimate things." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-62", "text": "According to (Butiurca, 2005, p. 209) , all feminine French nouns become neuter once they are borrowed into Romanian." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-63", "text": "The ones discussed here have the singular ending in '\u00e9', written in Romanian without the accent, but retaining main stress as in French." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-64", "text": "Another of the 15, which also ends in an 'e' carrying main stress but not of French origin, is a noun formed from an acronym: pefele from PFL." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-65", "text": "There is also a noun (coclaur\u0203-coclauri) probably from the preLatin substratum, which is listed in Romanian dictionaries either as a pluralia tantum or as it is listed in the dataset." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-66", "text": "The others are feminine singular forms wrongly labeled in the original corpus as being neuter or neuter/feminine." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-67", "text": "Looking at the entries in the original dataset for two of the last five nouns completely misclassified (levantin/levantin\u0203-levantinuri/levantine and bageac/bageac\u0203-bageacuri/bageci), we notice that the latter receives an 'n' tag for the singular form bageac\u0203, which in (Collective, 2002 ) is listed as a feminine, and the former receives the 'n/f' tag, meaning either a neuter, or a feminine (Barbu, 2008 (Barbu, , p. 1939 , for both the neuter levantin and the feminine levantin\u0203 singular form." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-68", "text": "We further notice that, when the gender tag 'n/f' accompanies a singular form, from the perspective of our system, a contradiction is stated." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-69", "text": "Seeing as Romanian has only two agreement patterns and that neuters agree like masculines in the singular and feminines in the plural, the feminine form levantin\u0203 cannot be either neuter, and receive the masculine numeral un in the singular, or feminine, and receive the feminine numeral o. It can only be feminine." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-70", "text": "Through analoguous resoning, the tag 'n/m' accompanying a plural form is also \"absurd\"." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-71", "text": "By eliminating the second gender from the two disjunct labels of the original dataset when extracting the nouns for our classifier, we correctly tagged the neuter variants with 'n', but also wrongly tagged 5 feminine singular forms with 'n' and 7 masculine plural forms with 'n'." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-72", "text": "There are other misclassified nouns, from the other two groups, whose misclasification is due to an error in their initial gender label, for instance algoritm-algoritmi is shown to be a masculine in (Collective, 2002) , however in the corpus it is tagged as neuter (together with the neuter variant algoritm-algoritme) and it subsequently appears to be misclassified in the plural as a masculine, which in fact it is." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-73", "text": "Another problem causing the misclassification is represented by the hyphenated compound nouns, which are headed by the leftmost noun that also receives the number/gender inflection." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-74", "text": "Seeing as our classification system weighed more on the suffix, it was prone to fail in correctly clasifying them." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-75", "text": "----------------------------------" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-76", "text": "**CONCLUSION AND PERSPECTIVES**" }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-77", "text": "The results of our classifier make a strong case, in particular, for Bateman and Polinsky's analysis according to which class membership of nouns in Romanian is assigned based on form (nominative noninflected singular endings and plural markers), when semantic cues relating to natural gender (masculine and feminine) are absent, and, in general, for their two separate (for the singular and plural) dual-class division of the Romanian nominal domain." }, { "sent_id": "fc3775c0d23292160f5c5eb86861be-C001-78", "text": "Furthermore, our classification model outperforms the two classifiers of Romanian nouns according to gender previously constructed in terms of correctly distinguishing the neuter." } ], "y": { "@USE@": { "gold_contexts": [ [ "fc3775c0d23292160f5c5eb86861be-C001-25", "fc3775c0d23292160f5c5eb86861be-C001-26" ] ], "cite_sentences": [ "fc3775c0d23292160f5c5eb86861be-C001-26" ] }, "@UNSURE@": { "gold_contexts": [ [ "fc3775c0d23292160f5c5eb86861be-C001-67" ] ], "cite_sentences": [ "fc3775c0d23292160f5c5eb86861be-C001-67" ] } } }, "ABC_b0701d41baf3b355d864f46821f34a_47": { "x": [ { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-2", "text": "Text representation is a fundamental concern in Natural Language Processing, especially in text classification." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-3", "text": "Recently, many neural network approaches with delicate representation model (e.g. FAST-TEXT, CNN, RNN and many hybrid models with attention mechanisms) claimed that they achieved state-of-art in specific text classification datasets." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-4", "text": "However, it lacks an unified benchmark to compare these models and reveals the advantage of each sub-components for various settings." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-5", "text": "We re-implement more than 20 popular text representation models for classification in more than 10 datasets." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-6", "text": "In this paper, we reconsider the text classification task in the perspective of neural network and get serval effects with analysis of the above results." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-7", "text": "----------------------------------" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-9", "text": "In Natural Language Processing or text related community, effective representation of textual sequences is the fundamental topic for the up-stream tasks." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-10", "text": "Traditionally, bag-of-word models (TFIDF or language model) with vocabulary-aware vector space tends to be the main-stream approach, especially in the task with long text (e.g. ad hoc retrieval with long document, text classification for long sentence)." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-11", "text": "However, it tends to get pool performance in the tasks with short-text sentence (text classification for relatively short sentence, Question answering, machine comprehension and dialogue system), which there are little word-level overlaps in bag-of-word vector space." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-12", "text": "Distributed representation (Le and Mikolov, 2014 ) in a fixed low-dimensional space trained from large-scale * \u2020 means equal contribution." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-13", "text": "Need to be peer-reviewed corpus have been proposed to enhance the features of text, then break through the performance bottleneck of bag-of-words models in short-text tasks." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-14", "text": "With combination of Conventional Neural Network (CNN) (Kalchbrenner et al., 2014) , Recurrent Neural Network (RNN), Recursive Neural Network (Socher et al., 2013) and Attention, hundreds of models had been proposed to model text for further classification, matching (Fan et al., 2017) or other tasks." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-15", "text": "However, these models are tested in different settings with various datasets, preprocessing and even evaluation." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-16", "text": "Since subtle differences may lead to large divergence in final performance." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-17", "text": "It is essential to get a robust comparison and tested in rigid significance test." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-18", "text": "Moreover, models with both effective and efficient performance is impossible due to the No-Free-Lunch principle." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-19", "text": "Thus each model should be considered in a trade off between its effectiveness and efficiency." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-20", "text": "Out contribution is 1." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-21", "text": "A new open-source benchmark of text classification 1 with more than 20 models and 10 datasets." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-22", "text": "2." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-23", "text": "Systemic reconsideration of text classification in a trade off." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-24", "text": "----------------------------------" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-25", "text": "**MODELS**" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-26", "text": "Models are shown as follow: Fastext (Joulin et al., 2016) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-27", "text": "Sum with all the input embedding." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-28", "text": "LSTM." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-29", "text": "Basic LSTM (Hochreiter and Schmidhuber, 1997) Basic CNN." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-30", "text": "Convolution over the input embedding (Kalchbrenner et al., 2014) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-31", "text": "Multi-window CNN (Severyn and Moschitti, 2015) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-32", "text": "Padding input embedding for a fixed size and concat the feature maps after convolution." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-33", "text": "Multi-layer CNN." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-34", "text": "CNN with multi layers for high-level modelling." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-35", "text": "CNN with Inception." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-36", "text": "CNN with Inception mechanism (Szegedy et al., 2015) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-37", "text": "Capsules." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-38", "text": "CNN with Capsules Networks (Sabour et al., 2017) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-39", "text": "CNN inspired by Quantum." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-40", "text": "Neural representation inspired by Quantum Theory (Zhang et al., 2018; Niu et al., 2017) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-41", "text": "RCNN (Lai et al., 2015) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-42", "text": "LSTM with pooling mechanism." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-43", "text": "CRNN (Zhou et al., 2015) ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-44", "text": "CNN After LSTM ." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-45", "text": "----------------------------------" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-46", "text": "**DATASET**" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-47", "text": "There are many datasets as showed in Tab." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-48", "text": "1" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-49", "text": "----------------------------------" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-50", "text": "**EVALUTION**" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-51", "text": "We adopt the Precision as the final evaluation metrics, which is widely used in the classification task." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-52", "text": "----------------------------------" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-53", "text": "**CONCLUSION**" }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-54", "text": "As claimed in the introduction, A benchmark for text classification have been proposed to systemically compare these state-of-art models." }, { "sent_id": "b0701d41baf3b355d864f46821f34a-C001-55", "text": "Performance, Significance test, Effectiveness-efficiency Discussion, Case study, comparison between RNN and CNN, Embedding sensitive needs to be done." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b0701d41baf3b355d864f46821f34a-C001-14", "b0701d41baf3b355d864f46821f34a-C001-15" ] ], "cite_sentences": [ "b0701d41baf3b355d864f46821f34a-C001-14" ] }, "@MOT@": { "gold_contexts": [ [ "b0701d41baf3b355d864f46821f34a-C001-14", "b0701d41baf3b355d864f46821f34a-C001-15" ] ], "cite_sentences": [ "b0701d41baf3b355d864f46821f34a-C001-14" ] } } }, "ABC_bbacc6539a7346e4f30a1ae42a636e_47": { "x": [ { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-2", "text": "In its early development, machine translation adopted rule-based approaches, which can include the use of language syntax." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-3", "text": "The late 1980s and early 1990s saw the inception of the statistical machine translation (SMT) approach, where translation models can be learned automatically from a parallel corpus rather than created manually by humans." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-4", "text": "Initial SMT models were word-based and phrase-based, without the use of syntactic knowledge." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-5", "text": "In phrase-based SMT, a source sentence is first segmented into phrases and then translated phrase-by-phrase with some reordering of the translated phrases in the target sentence." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-6", "text": "This has posed challenges when translating between two syntactically different languages." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-7", "text": "Syntax-based SMT approaches take advantage of syntactic knowledge within the framework of SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-8", "text": "This book provides an introduction to syntax-based SMT approaches." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-9", "text": "It is a valuable resource for those who are interested in syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-10", "text": "The book consists of seven chapters." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-11", "text": "There is not an introduction chapter in this book, aside from the preface, which can be considered as a brief introduction." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-12", "text": "Readers are referred to Koehn (2010) for background knowledge." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-13", "text": "I think an introduction chapter categorized into sections would have been useful, before proceeding to describe the various models." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-14", "text": "The first two chapters provide principles applicable across various syntaxbased SMT approaches." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-15", "text": "The next three chapters describe syntax-based SMT decoding in detail; this constitutes half of the book." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-16", "text": "Selected extended topics are provided in the next chapter, which is followed by a concluding chapter." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-17", "text": "Chapter 1 describes the models and formalisms applicable to syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-18", "text": "The first section describes the phrasal translation units in phrase-based SMT, its limitations, and how tree structures address the limitations of the phrase-based approach." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-19", "text": "This explanation is useful as translation units are the key difference between the phrasebased and syntax-based SMT approaches." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-20", "text": "The next two sections describe the grammar formalisms and the statistical models that define syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-21", "text": "The section that covers the grammar formalisms (i.e., synchronous context-free grammar [SCFG] and synchronous tree-substitution grammar [STSG]), would have been clearer if their differences were presented in a side-by-side illustrating example." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-22", "text": "The remainder of the chapter discusses different categories of syntax-based SMT approaches and the history of these approaches, which include string-to-string, string-to-tree, tree-to-string, and" }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-23", "text": "----------------------------------" }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-24", "text": "****" }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-25", "text": "tree-to-tree SMT approaches." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-26", "text": "Although the syntax-based translation model in Galley et al. (2006) falls under the string-to-tree category, I wonder why hierarchical phrasebased SMT, or Hiero (Chiang 2007) , is not explicitly put under the string-to-string category, since Hiero also uses \"unlabeled hierarchical phrases where there is no representation of linguistic categories.\"" }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-27", "text": "Chapter 2 focuses on how the statistical framework of a syntax-based SMT approach learns its model from a word-aligned and parsed parallel text." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-28", "text": "The first section explains how phrase pairs are extracted as translation rules from a word-aligned sentence pair in phrase-based SMT (Koehn, Och, and Marcu 2003) , highlighting the definition of a phrase as a sequence of words and the alignment-consistency property of a phrase pair as defined in Och and Ney (2004) ." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-29", "text": "The remainder of the chapter introduces three predominant instantiations of syntax-based models: hierarchical phrase-based SMT (Hiero) (Chiang 2007) , which is a non-labeled syntax-based SMT approach arising from the phrase-based approach; syntax-augmented machine translation (SAMT), which introduces the notion of soft labels while keeping the nonlinguistic phrase notion; and GHKM (Galley et al. 2004 ), which only extracts translation rules consistent with constituency parse subtrees." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-30", "text": "This chapter is nicely organized and it is easy to follow the gradual evolution from phrase-based SMT to GHKM." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-31", "text": "Chapter 3 introduces the decoding formalism in the form of a directed hypergraph, defined as a set of vertices and a set of directed hyperedges." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-32", "text": "The first section introduces the notion of a weighted parse forest represented in a weighted hypergraph, representing alternative parse trees of a sentence." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-33", "text": "I found it important to pay careful attention to this section, in order to understand the next section and the following chapters." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-34", "text": "The next section presents various algorithms on a hypergraph to translate a sentence in a hypergraph representation of possible tree derivations." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-35", "text": "Overall, I found this chapter to contain many technical details." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-36", "text": "The last section of this chapter provides historical notes on the sources of these concepts." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-37", "text": "This chapter needs to be read before the next chapter, which assumes understanding of the concepts introduced in Chapter 3." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-38", "text": "Chapter 4 describes tree decoding-that is, decoding with the constituency parse tree of a source sentence as its input, focusing on the tree-to-string approach." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-39", "text": "The first two sections highlight decoding with local and non-local features, where non-local features accommodate n-gram language models and are more complex than local features." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-40", "text": "The next section is devoted to an in-depth description of a beam search algorithm on the parse tree of a source sentence." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-41", "text": "The description could have been improved if the running example showed the decoding steps." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-42", "text": "The next two sections present extensions to the concepts introduced in the earlier part of this chapter, by providing references to more efficient hypergraph operations." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-43", "text": "The content of this section requires readers who are interested in implementing an efficient tree-based algorithm to go through the cited references." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-44", "text": "Brief historical notes conclude this chapter nicely, by pointing to relevant materials for further reading." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-45", "text": "Chapter 5 describes string decoding with a source sentence string as its input." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-46", "text": "The first two sections describe beam search decoding algorithms in a binary SCFG, namely, a maximum of two non-terminal symbols on the right-hand side of each rule, adopted in Hiero and SAMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-47", "text": "The algorithms covered are a basic algorithm and an optimized algorithm." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-48", "text": "The complexity comparison between the two is nicely presented here, emphasizing the complexity reduction achieved by algorithm optimization." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-49", "text": "The handling of non-binary rules is described in the following section, illustrated by GHKM rule extraction." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-50", "text": "A mid-chapter summary section divides this chapter into two parts: beam search decoding and parsing." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-51", "text": "The second part describes parsing algorithms in the context of shared-category SCFG, assuming the same set of non-terminal symbols for the left-hand and right-hand sides of a rule, followed by a section extending the algorithm to STSG and distinct-category SCFG." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-52", "text": "The organization of this chapter is excellent." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-53", "text": "However, I feel that the inclusion of distinct-category SCFG decoding does not fit well into this chapter, as string decoding in string-to-tree SMT requires no knowledge of the source syntax." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-54", "text": "The historical notes also do not provide any references of prior work on string decoding using distinct-category SCFG." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-55", "text": "Chapter 6 contains various selected topics on syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-56", "text": "The first section discusses tree transformations, which make translation rule learning more effective." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-57", "text": "The description of non-context-free models serves as a prelude to the next section on dependency-based SMT, which covers dependency treelet (equivalent to the tree-tostring approach) and string-to-dependency (equivalent to the string-to-tree approach)." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-58", "text": "The next section focuses on the ability of syntax-based SMT to have a more grammatical output compared with phrase-based SMT, although there is still room for improvement, including the use of unification grammars and semantic properties." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-59", "text": "Finally, the last section of this chapter explains how MT evaluation benefits from syntax-based SMT principles." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-60", "text": "Overall, this chapter enriches readers' knowledge beyond basic syntax-based SMT in the earlier chapters." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-61", "text": "I would also suggest the inclusion of phrase-based decoding approaches that use syntax-based features (Cherry 2008; Chang et al. 2009 )." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-62", "text": "Chapter 7 nicely concludes this book by discussing the comparison between phrasebased and syntax-based SMT approaches and proposing possible future developments of syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-63", "text": "The chapter also highlights that syntax-driven MT predates statistical MT, as I mentioned at the beginning of this review." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-64", "text": "Overall, I found this book to be a useful reference book for those interested in syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-65", "text": "The book is well organized, which makes it easy for readers to refer to specific aspects of syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-66", "text": "An improvement can be made to the presentation of ideas in this book." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-67", "text": "Throughout the book, there are many technical keywords, resulting from the complexity of syntax-based SMT." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-68", "text": "It would be useful to highlight these keywords in a side bar to remind readers that they are important keywords." }, { "sent_id": "bbacc6539a7346e4f30a1ae42a636e-C001-69", "text": "In addition, although examples are given throughout the book, it would be even more useful to use these examples to illustrate how the algorithms work, so that readers can gain a better understanding of the algorithms." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "bbacc6539a7346e4f30a1ae42a636e-C001-26" ] ], "cite_sentences": [ "bbacc6539a7346e4f30a1ae42a636e-C001-26" ] }, "@BACK@": { "gold_contexts": [ [ "bbacc6539a7346e4f30a1ae42a636e-C001-29" ] ], "cite_sentences": [ "bbacc6539a7346e4f30a1ae42a636e-C001-29" ] } } }, "ABC_4b8de05982074c30f4d03af60827cb_47": { "x": [ { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-2", "text": "A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-3", "text": "To this end, recent advances at the intersection of language and vision have made incredible progress -from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding freeform conversations about visual content!" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-4", "text": "However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?)." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-5", "text": "Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-6", "text": "----------------------------------" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-7", "text": "**TUTORIAL OVERVIEW**" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-8", "text": "This tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-9", "text": "We will comprehensively review existing stateof-the-art approaches to selected tasks such as image captioning (Chen et al., 2015) , visual question answering (VQA) (Antol et al., 2015; Goyal et al., 2017) and visual dialog (Das et al., 2017a,b) , presenting the key architectural building blocks (such as co-attention (Lu et al., 2016) ) and novel algorithms (such as cooperative/adversarial games (Das et al., 2017b) ) used to train models for these tasks." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-10", "text": "We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose (Anderson et al., 2018b; Das et al., 2018) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-11", "text": "The goal of this tutorial is to provide a comprehensive yet accessible overview of existing work and to reduce the entry barrier for new researchers." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-12", "text": "In detail, we will first review the building blocks of the neural network architectures used for these tasks, starting from variants of recurrent sequenceto-sequence language models (Ilya Sutskever, 2014), applied to image captioning (Vinyals et al., 2015) , optionally with visual attentional mechanisms (Bahdanau et al., 2015; Xu et al., 2015; You et al., 2016; Anderson et al., 2018a )." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-13", "text": "We will then look at evaluation metrics for image captioning Anderson et al., 2016) , before reviewing how these metrics can be optimized directly using reinforcement learning (RL) (Williams, 1992; Rennie et al., 2017) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-14", "text": "Next, on the topic of visual question answering, we will look at more sophisticated multimodal attention mechanisms, wherein the network simultaneously attends to visual and textual features (Fukui et al., 2016; Lu et al., 2016) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-15", "text": "We will see how to combine factual and commonsense reasoning from learnt memory representations (Sukhbaatar et al., 2015) and external knowledge bases , and approaches that use the question to dynamically compose the answering neural network from specialized modules (Andreas et al., 2016a,b; Johnson et al., 2017a,b; Hu et al., 2017) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-16", "text": "Following the success of adversarial learning in visual recognition (Goodfellow et al., 2014) , it has recently been gaining momentum in language modeling (Yu et al., 2016) and in multimodal tasks such as captioning (Dai et al., 2017) and dialog (Wu et al., 2018a) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-17", "text": "Within visual dia-log, we will look at recent work that uses cooperative multi-agent tasks as a proxy for training effective visual conversational models via RL (Kottur et al., 2017; Das et al., 2017b) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-18", "text": "Finally, as a move away from static datasets, we will cover recent work on building active RL environments for language-vision tasks." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-19", "text": "Although models that link vision, language and actions have a rich history (Tellex et al., 2011; Paul et al., 2016; Misra et al., 2017) , we will focus primarily on embodied 3D environments (Anderson et al., 2018b; , considering tasks such as visual navigation from natural language instructions (Anderson et al., 2018b) , and question answering (Das et al., 2018; Gordon et al., 2018) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-20", "text": "We will position this work in context of related simulators that also offer significant potential for grounded language learning (Beattie et al., 2016; Zhu et al., 2017) ." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-21", "text": "To finish, we will discuss some of the challenges in developing agents for these tasks, as they need to be able to combine active perception, language grounding, commonsense reasoning and appropriate long-term credit assignment to succeed." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-22", "text": "----------------------------------" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-23", "text": "**STRUCTURE**" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-24", "text": "The following structure is based on an approximately 3 hour timeframe with a break." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-25", "text": "Peter Anderson is a final year PhD candidate in Computer Science at the Australian National University, supervised by Dr Stephen Gould, and a researcher within the Australian Centre for Robotic Vision (ACRV)." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-26", "text": "His PhD focuses on deep learning for visual understanding in natural language." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-27", "text": "He was an integral member of the team that won first place in the 2017 Visual Question Answering (VQA) challenge at CVPR, and he has made several contributions in image captioning, including achieving first place on the COCO leaderboard in July 2017." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-28", "text": "He has published at CVPR, ECCV, EMNLP and ICRA, and spent time at numerous universities and research labs including Adelaide University, Macquarie University, and Microsoft Research." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-29", "text": "His research is currently focused on vision-and-language understanding in complex 3D environments." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-30", "text": "----------------------------------" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-31", "text": "**ABHISHEK DAS**" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-32", "text": "Abhishek Das is a Computer Science PhD student at Georgia Institute of Technology, advised by Dhruv Batra, and working closely with Devi Parikh. He is interested in deep learning and its applications in building agents that can see (computer vision), think (reasoning and interpretability), talk (language modeling) and act (reinforcement learning)." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-33", "text": "He is a recipient of an Adobe Research Fellowship and a Snap Research Fellowship." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-34", "text": "He has published at CVPR, ICCV, EMNLP, HCOMP and CVIU, co-organized the NIPS 2017 workshop on Visually-Grounded Interaction and Language, and has held visiting positions at Virginia Tech, Queensland Brain Institute and Facebook AI Research." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-35", "text": "He graduated from Indian Institute of Technology Roorkee in 2015 with a Bachelor's in Electrical Engineering." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-36", "text": "----------------------------------" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-37", "text": "**QI WU**" }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-38", "text": "Dr. Qi Wu, is a research fellow in the Australia Centre for Robotic Vision (ACRV) in the University of Adelaide." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-39", "text": "Before that, he was a postdoc researcher in the Australia Centre for Visual Technologies (ACVT) in the University of Adelaide." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-40", "text": "He obtained his PhD degree in 2015 and MSc degree in 2011, in Computer Science from University of Bath, United Kingdom." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-41", "text": "His research interests are mainly in Computer Vision and Machine Learning." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-42", "text": "Currently, he is working on the visionto-language problem and he is especially an expert in the area of Image Captioning and Visual Question Answering (VQA)." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-43", "text": "His attributes-based image captioning model got first place on the COCO Image Captioning Challenge Leader Board in the October of 2015." }, { "sent_id": "4b8de05982074c30f4d03af60827cb-C001-44", "text": "He has published several papers in prestigious conferences and journals, such as TPAMI, CVPR, ICCV, ECCV, IJCAI and AAAI." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "4b8de05982074c30f4d03af60827cb-C001-9" ], [ "4b8de05982074c30f4d03af60827cb-C001-17" ] ], "cite_sentences": [ "4b8de05982074c30f4d03af60827cb-C001-9", "4b8de05982074c30f4d03af60827cb-C001-17" ] } } }, "ABC_976e5cf02dc5430c0b1393d3cb38e5_47": { "x": [ { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-2", "text": "In its early development, machine translation adopted rule-based approaches, which can include the use of language syntax." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-3", "text": "The late 1980s and early 1990s saw the inception of the statistical machine translation (SMT) approach, where translation models can be learned automatically from a parallel corpus rather than created manually by humans." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-4", "text": "Initial SMT models were word-based and phrase-based, without the use of syntactic knowledge." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-5", "text": "In phrase-based SMT, a source sentence is first segmented into phrases and then translated phrase-by-phrase with some reordering of the translated phrases in the target sentence." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-6", "text": "This has posed challenges when translating between two syntactically different languages." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-7", "text": "Syntax-based SMT approaches take advantage of syntactic knowledge within the framework of SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-8", "text": "This book provides an introduction to syntax-based SMT approaches." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-9", "text": "It is a valuable resource for those who are interested in syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-10", "text": "The book consists of seven chapters." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-11", "text": "There is not an introduction chapter in this book, aside from the preface, which can be considered as a brief introduction." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-12", "text": "Readers are referred to Koehn (2010) for background knowledge." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-13", "text": "I think an introduction chapter categorized into sections would have been useful, before proceeding to describe the various models." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-14", "text": "The first two chapters provide principles applicable across various syntaxbased SMT approaches." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-15", "text": "The next three chapters describe syntax-based SMT decoding in detail; this constitutes half of the book." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-16", "text": "Selected extended topics are provided in the next chapter, which is followed by a concluding chapter." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-17", "text": "Chapter 1 describes the models and formalisms applicable to syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-18", "text": "The first section describes the phrasal translation units in phrase-based SMT, its limitations, and how tree structures address the limitations of the phrase-based approach." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-19", "text": "This explanation is useful as translation units are the key difference between the phrasebased and syntax-based SMT approaches." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-20", "text": "The next two sections describe the grammar formalisms and the statistical models that define syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-21", "text": "The section that covers the grammar formalisms (i.e., synchronous context-free grammar [SCFG] and synchronous tree-substitution grammar [STSG]), would have been clearer if their differences were presented in a side-by-side illustrating example." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-22", "text": "The remainder of the chapter discusses different categories of syntax-based SMT approaches and the history of these approaches, which include string-to-string, string-to-tree, tree-to-string, and" }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-23", "text": "----------------------------------" }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-24", "text": "****" }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-25", "text": "tree-to-tree SMT approaches." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-26", "text": "Although the syntax-based translation model in Galley et al. (2006) falls under the string-to-tree category, I wonder why hierarchical phrasebased SMT, or Hiero (Chiang 2007) , is not explicitly put under the string-to-string category, since Hiero also uses \"unlabeled hierarchical phrases where there is no representation of linguistic categories.\"" }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-27", "text": "Chapter 2 focuses on how the statistical framework of a syntax-based SMT approach learns its model from a word-aligned and parsed parallel text." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-28", "text": "The first section explains how phrase pairs are extracted as translation rules from a word-aligned sentence pair in phrase-based SMT (Koehn, Och, and Marcu 2003) , highlighting the definition of a phrase as a sequence of words and the alignment-consistency property of a phrase pair as defined in Och and Ney (2004) ." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-29", "text": "The remainder of the chapter introduces three predominant instantiations of syntax-based models: hierarchical phrase-based SMT (Hiero) (Chiang 2007) , which is a non-labeled syntax-based SMT approach arising from the phrase-based approach; syntax-augmented machine translation (SAMT), which introduces the notion of soft labels while keeping the nonlinguistic phrase notion; and GHKM (Galley et al. 2004 ), which only extracts translation rules consistent with constituency parse subtrees." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-30", "text": "This chapter is nicely organized and it is easy to follow the gradual evolution from phrase-based SMT to GHKM." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-31", "text": "Chapter 3 introduces the decoding formalism in the form of a directed hypergraph, defined as a set of vertices and a set of directed hyperedges." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-32", "text": "The first section introduces the notion of a weighted parse forest represented in a weighted hypergraph, representing alternative parse trees of a sentence." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-33", "text": "I found it important to pay careful attention to this section, in order to understand the next section and the following chapters." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-34", "text": "The next section presents various algorithms on a hypergraph to translate a sentence in a hypergraph representation of possible tree derivations." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-35", "text": "Overall, I found this chapter to contain many technical details." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-36", "text": "The last section of this chapter provides historical notes on the sources of these concepts." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-37", "text": "This chapter needs to be read before the next chapter, which assumes understanding of the concepts introduced in Chapter 3." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-38", "text": "Chapter 4 describes tree decoding-that is, decoding with the constituency parse tree of a source sentence as its input, focusing on the tree-to-string approach." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-39", "text": "The first two sections highlight decoding with local and non-local features, where non-local features accommodate n-gram language models and are more complex than local features." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-40", "text": "The next section is devoted to an in-depth description of a beam search algorithm on the parse tree of a source sentence." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-41", "text": "The description could have been improved if the running example showed the decoding steps." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-42", "text": "The next two sections present extensions to the concepts introduced in the earlier part of this chapter, by providing references to more efficient hypergraph operations." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-43", "text": "The content of this section requires readers who are interested in implementing an efficient tree-based algorithm to go through the cited references." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-44", "text": "Brief historical notes conclude this chapter nicely, by pointing to relevant materials for further reading." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-45", "text": "Chapter 5 describes string decoding with a source sentence string as its input." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-46", "text": "The first two sections describe beam search decoding algorithms in a binary SCFG, namely, a maximum of two non-terminal symbols on the right-hand side of each rule, adopted in Hiero and SAMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-47", "text": "The algorithms covered are a basic algorithm and an optimized algorithm." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-48", "text": "The complexity comparison between the two is nicely presented here, emphasizing the complexity reduction achieved by algorithm optimization." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-49", "text": "The handling of non-binary rules is described in the following section, illustrated by GHKM rule extraction." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-50", "text": "A mid-chapter summary section divides this chapter into two parts: beam search decoding and parsing." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-51", "text": "The second part describes parsing algorithms in the context of shared-category SCFG, assuming the same set of non-terminal symbols for the left-hand and right-hand sides of a rule, followed by a section extending the algorithm to STSG and distinct-category SCFG." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-52", "text": "The organization of this chapter is excellent." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-53", "text": "However, I feel that the inclusion of distinct-category SCFG decoding does not fit well into this chapter, as string decoding in string-to-tree SMT requires no knowledge of the source syntax." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-54", "text": "The historical notes also do not provide any references of prior work on string decoding using distinct-category SCFG." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-55", "text": "Chapter 6 contains various selected topics on syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-56", "text": "The first section discusses tree transformations, which make translation rule learning more effective." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-57", "text": "The description of non-context-free models serves as a prelude to the next section on dependency-based SMT, which covers dependency treelet (equivalent to the tree-tostring approach) and string-to-dependency (equivalent to the string-to-tree approach)." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-58", "text": "The next section focuses on the ability of syntax-based SMT to have a more grammatical output compared with phrase-based SMT, although there is still room for improvement, including the use of unification grammars and semantic properties." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-59", "text": "Finally, the last section of this chapter explains how MT evaluation benefits from syntax-based SMT principles." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-60", "text": "Overall, this chapter enriches readers' knowledge beyond basic syntax-based SMT in the earlier chapters." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-61", "text": "I would also suggest the inclusion of phrase-based decoding approaches that use syntax-based features (Cherry 2008; Chang et al. 2009 )." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-62", "text": "Chapter 7 nicely concludes this book by discussing the comparison between phrasebased and syntax-based SMT approaches and proposing possible future developments of syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-63", "text": "The chapter also highlights that syntax-driven MT predates statistical MT, as I mentioned at the beginning of this review." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-64", "text": "Overall, I found this book to be a useful reference book for those interested in syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-65", "text": "The book is well organized, which makes it easy for readers to refer to specific aspects of syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-66", "text": "An improvement can be made to the presentation of ideas in this book." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-67", "text": "Throughout the book, there are many technical keywords, resulting from the complexity of syntax-based SMT." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-68", "text": "It would be useful to highlight these keywords in a side bar to remind readers that they are important keywords." }, { "sent_id": "976e5cf02dc5430c0b1393d3cb38e5-C001-69", "text": "In addition, although examples are given throughout the book, it would be even more useful to use these examples to illustrate how the algorithms work, so that readers can gain a better understanding of the algorithms." } ], "y": { "@BACK@": { "gold_contexts": [ [ "976e5cf02dc5430c0b1393d3cb38e5-C001-26" ], [ "976e5cf02dc5430c0b1393d3cb38e5-C001-29" ] ], "cite_sentences": [ "976e5cf02dc5430c0b1393d3cb38e5-C001-26", "976e5cf02dc5430c0b1393d3cb38e5-C001-29" ] } } }, "ABC_053ce92029e643bfade157b3172c05_47": { "x": [ { "sent_id": "053ce92029e643bfade157b3172c05-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-2", "text": "We discuss a number of practical issues that have arisen in the development of a wide-coverage lexicalized grammar for English." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-3", "text": "In particular, we consider the way in which the design of the grammar and of its encoding was influenced by issues relating to the size of the grammar." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-4", "text": "----------------------------------" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-6", "text": "Hand-crafting a wide-coverage grammar is a difficult task, requiring consideration of a seemingly endless number of constructions in an attempt to produce a treatment that is as uniform and comprehensive as possible." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-7", "text": "In this paper we discuss a number of practical issues that have arisen in the development of a wide-coverage lexicalized grammar for English: the LEXSYS grammar." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-8", "text": "In particular, we consider the way in which the design of the grammar and of its encoding-from the viewpoint both of the grammar writer and of the parsing mechanism-was influenced by issues relating to the size of the grammar." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-9", "text": "One criterion that is often used as a judge of grammar quality is the extent to which 'linguistic generalizations' have been captured." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-10", "text": "Generally speaking, concern over this issue leads to a preference for smaller rather than larger grammars." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-11", "text": "A second reason for pr eferring smaller grammar sizes is on the basis of parsing efficiency, since the running time of parsing algorithms generally depends on the size of the grammar." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-12", "text": "However, a rather different criterion determining grammar quality has to do with the analyses that the grammar assigns to sentences: in particular, the extent to which they provide a good basis for further, perhaps deeper processing." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-13", "text": "It is not necessarily the case that this criterion is compatible with the desire to minimize grammar size." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-14", "text": "In developing the LEXSYS grammar we have explored the consequences of giving the grammar writer the freedom to write a grammar that maximizes analysis quality without any regard for grammar size." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-15", "text": "In the next three sections we present detailed statistics for the current LEXSYS grammar that give an indication of what the grammar contains, its current size, and why it has grown to this size." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-16", "text": "In order to ease the process of engineering such a large grammar, we have made use of the lexical knowledge representation language DATR (Evans & Gazdar, 1996) to compactly encode the elementary trees (Evans et al., 1995; Smets & Evans, 1998) ." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-17", "text": "In Section 5 we present some figures that show how the size of the encoding of the grammar has increased during the grammar development process as the number and complexity of elementary trees has grown." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-18", "text": "We have addressed problems that result from trying to parse with such a large grammar by using a technique proposed by (Evans & Weir, 1997) and (Evans & Weir, 1998) in which all the trees that each word can anchor are compactly represented using a collection of finite state automata." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-19", "text": "In Section 6 we give some data that shows the extent to which this technique is successful in compacting the grammar." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-20", "text": "----------------------------------" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-21", "text": "**COVERAGE OF THE LEXSYS GRAMMAR**" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-22", "text": "The LEXSYS grammar has roughly the same coverage as the Alvey NL Tools grammar (Grover et al., 1993) , and adopts the same set of subcategorization frames as in the Alvey lexicon." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-23", "text": "There are at present 143 families in the grammar." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-24", "text": "Each family contains the base tree of the family, and definitions of lexical rules which derive trees from the base tree." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-25", "text": "There are currently 88 lexical rules." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-26", "text": "Possible rule combinations are determined automatically (see (Smets & Evans, 1998) )." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-27", "text": "There are 7 noun and pronoun families." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-28", "text": "The noun families include trees for bare nouns, for small clauses headed by a noun, for noun-noun modifiers and for coordination." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-29", "text": "Coordination can be at the N,N or NP levels." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-30", "text": "There are 19 adjective families, distinguished according to the position of the adjective and its subcategorization frames." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-31", "text": "Trees derived by lexical rules include small clauses headed by an adjective, comparative constructions, trees with unbounded dependencies for adjectives which subcategorize for a complement (wh-questions, relative clauses, topicalization), a tree for tough-movement, and trees for coordination." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-32", "text": "Numerals also anchor adjective trees." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-33", "text": "Rules derive from the base tree uses of numerals as pronouns and nouns, and coordination of cardinal numbers (for example, hundred and ten)." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-34", "text": "However, the grammar does not as yet have a complete account of complex numerals." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-35", "text": "For ordinals, there are rules to derive fractions with complement, fractions without complement, and the use of ordinals as degree specifiers." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-36", "text": "Adverbs are distinguished according to whether they are complements or modifiers." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-37", "text": "Modifier trees differ according to the modified category and the relative position of the adverb and its argument." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-38", "text": "Rules derive coordinated structures headed by adverbs, and also adverb distribution." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-39", "text": "Long distance dependencies possibly involving adverbs (for example, How did he behave) are handled in the PP modifier family." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-40", "text": "1 The grammar contains an account of constituent and sentential negation (but in the latter disregarding scope issues arising when an adverb comes in between the auxiliary and the negation)." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-41", "text": "Specifier families include families for determiners, quantifying pre-determiners and genitive determiners." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-42", "text": "There is also a family for adjective and adverb specifiers." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-43", "text": "Prepositions followed by an NP are divided into two families: a family for case-marking prepositions and a family for predicative prepositions." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-44", "text": "These two types of prepositions differ in their semantic content, and syntactically also: case-marking prepositions do not head PP-modifiers." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-45", "text": "The case-marking preposition family includes trees for long-distance dependencies with preposition stranding (wh-questions, relative clause, tough-constructions) and trees for coordination." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-46", "text": "The family of predicative prepositions inherits these trees, and also contains trees for adjunct preposition phrases and long-distance dependencies involving adjunct PPs." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-47", "text": "There are also families for prepositions introducing Ss, VPs, PPs and AP." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-48", "text": "There are two families for complementizers (introducing an S or a VP)." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-49", "text": "The 94 verb families constitute the bulk of the grammar." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-50", "text": "Verb families include trees 2 for gerunds (nominal and verbal), long-distance dependencies (topicalization, relative clause and wh-questions), VP complements, VP complements for tough-constructions, small clauses (headed by a present participle or a passive verb), for-to clauses, extraposition, imperative, passive with or without by, inversion (for auxiliaries and modals), VP-ellipsis (after auxiliaries and modals), dative alternation, movement of particles, and coordination (at V, VP and S)." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-51", "text": "Finally, we have recently extended the grammar to include semantic features capturing predicate argument structures." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-52", "text": "We have not implemented quantification yet." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-53", "text": "The grammar adopts a semantic representation inspired by the Minimal Recursion Semantics (MRS) framework (Copestake et al., unpublished) ." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-54", "text": "MRS representations are flat lists of elementary predications, with relations between predications being expressed through coindexation." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-55", "text": "----------------------------------" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-56", "text": "**LOCALIZATION OF SYNTACTIC DEPENDENCIES**" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-57", "text": "The LEXSYS grammar has been designed to localize syntactic dependencies, not only unbounded dependencies between filler and gap, but agreement relations, case and mode of the clause, etc." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-58", "text": "(Carroll et al., 1999) ." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-59", "text": "One immediate advantage is that there is no need for feature percolation during parsing: all syntactic features are grounded during anchoring." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-60", "text": "There are, however, a few cases where all syntactic features cannot be localized in the same tree." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-61", "text": "This happens when the values of syntactic features are determined by more than one constituent." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-62", "text": "This is the case, for example, in raising constructions: the subject raising verb agrees with its syntactic subject but the complement of the raising verb (adjective or verb) determines the category of the subject." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-63", "text": "In such cases, feature percolation is needed, unless one define trees for all the possible feature combinations." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-64", "text": "This is what we have done in the grammar, and 9 more trees are needed to that effect." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-65", "text": "In there-constructions, the NP following the verb (be) determines the agreement of the verb." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-66", "text": "This does not represent a problem if the dependency is local." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-67", "text": "However, if a subject raising verb comes in between there and the rest of the sentence, agreement cannot be determined locally anymore." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-68", "text": "We need one more tree to cover both possible instantiations of agreement features." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-69", "text": "Finally, PP phrases can involve a wh-NP or a rel-NP, thus must be specified as such." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-70", "text": "Because the head of PPs does not set that feature, feature percolation would be needed between the NP and the root of the PP." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-71", "text": "In the grammar, we define three PP trees, one for each possible instantiation of that feature." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-72", "text": "Thus, two more trees are needed than if we had feature percolation." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-73", "text": "In all the above cases, the specification of all possible feature combinations does not involve the creation of many more trees." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-74", "text": "However, from a linguistic point of view, we do miss generalizations." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-75", "text": "With coordination, however, the problem is not the loss of linguistic generalizations, but the substantial increase in the number of trees." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-76", "text": "Indeed, coordination 3 trees are anchored by the head of one of the coordinated constituents." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-77", "text": "The advantage of this is that constraints on the coordination phrase are defined at anchoring." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-78", "text": "But the disadvantage is that this doubles the number of trees in the grammar: every structure can occur in coordination." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-79", "text": "----------------------------------" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-80", "text": "**ANCHORED TREES**" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-81", "text": "The previous two sections discussed the coverage of the grammar, and how some decisions have increased the number of unanchored trees." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-82", "text": "Another important property of the grammar is the number of trees that result from anchoring with lexical items." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-83", "text": "We find that some verbs induce a very large number of anchored trees: for example, get results in 2847 trees, put 3465, come 2656, and turn 1425." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-84", "text": "To illustrate why, consider get." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-85", "text": "First, get has 17 different subcategorization frames (it can be transitive, ditransitive, it can have a prepositional complement, be followed by one or more particles, etc.)." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-86", "text": "It therefore belongs to 17 different families, and each family contains a number of trees (for example, the V PP family, selected by get, has 33 trees, and the V NP PPto family contains 146 trees)." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-87", "text": "Moreover, when a lexical item anchors a tree, features get grounded, and different feature instantiations characterize different trees." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-88", "text": "For example, get can be followed by one of 12 different prepositions which means that there are at least 12 33 trees for the single subcategorization # trees # sets # merged # minimized ratio merged / in set states (mean) states (mean) minimized 1-10 112 17.9 6.9 2.6 11-20 83 53.9 13." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-89", "text": "----------------------------------" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-90", "text": "**ENCODING FOR GRAMMAR DEVELOPMENT**" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-91", "text": "Following (Evans et al., 1995) and (Smets & Evans, 1998 ) the LEXSYS grammar is encoded using DATR, a non-monotonic knowledge representation language." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-92", "text": "In 1998, the grammar contained 620 trees organized into 44 tree families and produced using 35 rules." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-93", "text": "This grammar was encoded in 2200 DATR statements, giving an average of 3:55 DATR statements per tree." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-94", "text": "The grammar currently contains around 4000 trees in 143 families produced with 88 rules." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-95", "text": "This grammar is encoded with around 5300 4 DATR statements, giving an average of 1:325 statements per tree." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-96", "text": "Thus, as the grammar has grown the number of DATR statements needed to encode it has grown, but not as rapidly." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-97", "text": "----------------------------------" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-98", "text": "**ENCODING FOR PARSING**" }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-99", "text": "Following (Evans & Weir, 1997) and (Evans & Weir, 1998) each elementary tree is encoded as a finite state automaton that specifies an accepting traversal of the tree from anchor to root." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-100", "text": "For each input word, the set of all the alternative trees that can anchor an input word can be captured in just one such automaton, which can be minimized in the standard way, and then used for parsing." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-101", "text": "In order to assess the extent to which this technique alleviates the problem of grammar size, we produced automata for the words appearing in the 1426 sentences (mean length 5:70 words) forming the Alvey NL Tools grammar development test suite." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-102", "text": "Each sentence was processed by a morphological analyser, and the result was then used in conjunction with the lexicon to determine for each word in the sentence the complete set of anchored trees, feature values being determined by the morphological analyser or lexicon as appropriate." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-103", "text": "474 distinct sets of anchored trees ('tree sets') were produced in this way, ranging in size from 1 to 3465 trees." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-104", "text": "The total number of anchored trees was 24198, with a mean of 175:5 trees in each tree set." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-105", "text": "Before parsing, the trees in each tree set are stripped of their anchor, merged into a single automaton and minimized; at parse time the relevant automaton is retrieved and the appropriate anchoring lexical item inserted." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-106", "text": "Table 1 shows what happens when the tree sets are converted into automata and minimized, giving figures for the distribution of tree sets, mean numbers of merged and minimized states in each tree set, and ratios of numbers of merged and minimized states." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-107", "text": "What is not clear from Table 1 is how often each of the 474 distinct tree sets occurred in the test sentences." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-108", "text": "This is shown in Table 2 which gives the numbers and mean sizes of tree sets (number of trees and minimized states) relative to the number of times they occurred in the test suite sentences." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-109", "text": "This shows that the larger tree sets tend to occur less often than small ones, and that very few of those tree sets containing more than 100 trees anchored more than 10 of the more than 8100 word tokens in the test sentences." }, { "sent_id": "053ce92029e643bfade157b3172c05-C001-110", "text": "The results we have presented in this section appear to show that by encoding the anchoring possibilities for words with minimized automata we are able to alleviate the grammar size problem to a considerable extent." } ], "y": { "@USE@": { "gold_contexts": [ [ "053ce92029e643bfade157b3172c05-C001-16" ], [ "053ce92029e643bfade157b3172c05-C001-91" ] ], "cite_sentences": [ "053ce92029e643bfade157b3172c05-C001-16", "053ce92029e643bfade157b3172c05-C001-91" ] }, "@SIM@": { "gold_contexts": [ [ "053ce92029e643bfade157b3172c05-C001-91" ] ], "cite_sentences": [ "053ce92029e643bfade157b3172c05-C001-91" ] } } }, "ABC_7f2383426952ddde6fd527cf0d63bd_47": { "x": [ { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-2", "text": "The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-3", "text": "Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-4", "text": "We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-5", "text": "To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-6", "text": "The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-7", "text": "The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-8", "text": "----------------------------------" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-9", "text": "**THE SUMMA PROJECT OVERVIEW**" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-10", "text": "Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of storylines (Risen et al., 2013) ." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-11", "text": "The massive growth in the number of broadcast and Internet media channels requires innovative ways to cope with this increasing amount of data." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-12", "text": "It is the aim of SUMMA 1 project to significantly improve media monitoring by creating a platform to automate the analysis of media streams across many languages." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-13", "text": "Within SUMMA project three European news broadcasters BBC, Deutche Welle, and Latvian news agency LETA are joining the forces with the University of Edinburgh, University College London, Swiss IDIAP Research Institute, Qatar Computing Research Institute, and Priberam Labs from Portugal to adapt the emerging big data neural deep learning NLP techniques to the needs of the international news monitoring industry." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-14", "text": "BBC Monitoring undertakes one of the most advanced, comprehensive, and large scale media monitoring operations world-wide, providing news and information from media sources around the world." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-15", "text": "BBC monitoring journalists and analysts translate from over 30 languages into English, and follow approximately 13,500 sources, of which 1,500 are television broadcasters, 1,300 are radio, 3,700 are key news portals world-wide, 20 are commercial news feeds, and the rest are RSS feeds and selected Social Media sources." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-16", "text": "Monitoring journalists follow important stories and flag breaking news events as part of the routine monitoring." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-17", "text": "1 SUMMA (Scalable Understanding of Multilingual MediA) is project 688139 funded by the European Union H2020- The central idea behind SUMMA is to develop a scalable multilingual media monitoring platform ( Fig.1 ) that combines the real-time media stream processing (speech recognition, machine translation, story clustering) with indepth batch-oriented construction of a rich knowledge base of reported events and entities mentioned, enabling extractive summarization of the storylines in the news." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-18", "text": "In this paper we focus only on the streaming shallow processing part of the SUMMA project (the dark block in Fig.1 ), where the recently developed neural machine translation techniques (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014) enable radically new end-to-end approach to machine translation and clustering of the incoming news stories." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-19", "text": "The approach is informed by our previous work on machine learning (Barzdins, Paikens, Gosko, 2013) , media monitoring (Barzdins et al.,2014) , and character-level neural translation (Barzdins & Gosko, 2016) ." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-20", "text": "The key difference of the SUMMA project is that it has been incepted after the recent paradigm-shift (Manning, 2015) in the NLP community towards neural network inspired deep learning techniques such as end-to-end automatic speech recognition (Graves & Jaitly, 2014; Hannun et al., 2014; Amodei, 2015) , end-to-end machinetranslation (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014; Luong et al., 2015) , efficient distributed vectorspace word embeddings (Mikolov et al., 2013) , image and video captioning Venugopalan et al., 2015) , unsupervised learning of document representations by autoencoders (Li, Luong & Jurafsky, 2015) ." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-21", "text": "These recent deep learning breakthroughs along with massively parallel GPU computing allow addressing the media monitoring tasks in the completely new end-toend manner rather than relying on the legacy NLP pipelines." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-22", "text": "The novelty of the SUMMA project approach is that all languages covered by the project (Table 1) can be embedded in the same vectorspace by means of joint multitask learning (Collobert et al., 2011; Dong et al., 2015; Pham, Luong & Manning, 2015) of eight LSTM-RNN translational autoencoders with hidden layer parameters shared as illustrated in Fig.2 ." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-23", "text": "Sharing the same vectorspace for sentences in all project languages enables accurate multilingual news story clustering without resorting to the clustering of the less accurate target (English) language machine translations." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-24", "text": "This shared vectorspace approach extends also to the unsupervised multi-task learning of language models from the large monolingual corpora (Fig. 3) , which is crucial for low-resourced languages: having a generic language model learned in parallel from the monolingual corpora reduces (Dai & Le, 2015) the need for large supervised parallel corpora to achieve the same translational accuracy for the Fig. 2 setup." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-25", "text": "The joint training of seventeen translational and samelanguage autoencoders with shared parameters ( Fig. 2 and Fig. 3 together) to our knowledge has not been attempted so far." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-26", "text": "Even training of a single state-of-the-art sentencelevel translational autoencoder requires days of GPU computing (Barzdins & Gosko, 2016) ) in TensorFlow (Abadi et al., 2015) seq2seq model (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014) ." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-27", "text": "To avoid complexities of asynchronous parallel training with shared parameter server (Dean et al., 2012) , the architecture in Fig.2 and Fig. 3 instead can be trained using the alternating training approach proposed in (Luong et al., 2016) , where each task is optimized for a fixed number of parameter updates (or mini-batches) before switching to the next task (which is a different language pair)." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-28", "text": "Although such alternating approach prolongs the training process, it is preferred for simplicity and robustness reasons." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-29", "text": "Once produced within SUMMA project, these translational autoencoders with shared vectorspace will be a unique language resource of likely interest also to the wider NLP community for multilingual applications outside the media monitoring domain." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-30", "text": "----------------------------------" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-31", "text": "**MULTILINGUAL NEURAL TRANSLATION**" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-32", "text": "----------------------------------" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-33", "text": "**CHARACTER-LEVEL NEURAL TRANSLATION FOR STREAMS**" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-34", "text": "Neural translation attention mechanism (Bahdanau, Cho & Bengio, 2014) has been shown to be highly beneficial for bi-lingual neural translation of long sentences, but it is not compatible with the multi-task multilingual translation models (Dong et al., 2015; Luong et al, 2016) described in the previous Section and character-level translation models (Barzdins & Gosko, 2016) described in this Section." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-35", "text": "For these reasons we replace the neural translation attention mechanism with much simpler sliding-window translation (Barzdins & Gosko, 2016; Karpathy, 2015; Jozefowicz, 2016 )." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-36", "text": "Moving from wordlevel to character-level neural translation makes it even harder to cope with long sentences presenting additional reason to employ the sliding-window translation approach." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-37", "text": "Table 2 illustrates the character-level neural translation from English to Latvian using modified 2 TensorFlow (Abadi et al., 2015) seq2seq (Sutskerev, Vinyals & Le, 2014) neural translation model." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-38", "text": "The character-level neural translation is enabled by forcing tokenizer to treat each input symbol as a separate \"word\" leading to the small and fixed \"vocabulary\" containing only 90 most frequently encountered characters." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-39", "text": "Another necessary change to the TensorFlow default seq2seq settings is disabling the attention (Bahdanau, Cho & Bengio, 2014) mechanism which is known to interfere with character-level translation (Barzdins & Gosko, 2016) because there are no mappings between the characters of the translated words." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-40", "text": "The small vocabulary of 90 words automatically disables also the sampled softmax functionality of seq2seq improving the overall performance." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-41", "text": "Finally, we configure single bucket of size 100 characters, which will be the max translation window size." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-42", "text": "Other hyperparameters used are: 1 LSTM 2 https://github.com/didzis/tensorflowAMR layer of size 400, batch size 16." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-43", "text": "Training is performed on Europarl v7 EN-LV corpus 3 for 24h on TitanX GPU." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-44", "text": "The sliding-window mechanism is used only during decoding (translation), mapping a fragment of 6 English words into 5 Latvian words (Latvian translations typically contain less words than English source -rich morphology substitutes for most prepositions and articles)." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-45", "text": "The multiple sliding-window translations produced are later merged into the final translation consisting only of words appearing at least twice in the neighboring sliding window columns (word suffixes are ignored if the initial 6 characters of the words match -this reduces word drop due to inflection errors)." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-46", "text": "The final translation in Table 2 (bottom row) is close to the manual verbatim translation (top row) and conveys the topic of the original sentence." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-47", "text": "Moreover, the slidingwindow translations are surprisingly fluent Latvian phrases with correct word forms and mostly correct coordination." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-48", "text": "The only non-Latvian \"words\" fabricated by the characterlevel translation in the Table 2 are \"transpastiv\u0113\u0161ana\", \"transpitts\", \"transpirma\" and apparently are triggered by the English verb \"transits\", because in Latvian \"tranz\u012bts\" is used only as a noun without a close substitute verb." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-49", "text": "Sliding-window translation method, obviously, cannot handle long-range dependencies well and occasionally drops or inserts words in the translation -therefore SUMMA provides also state-of-the-art translation service in parallel to the one described here." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-50", "text": "Meanwhile the sliding-window character-based translation method has unique advantages relevant to the scope of SUMMA project, discussed in the next Section." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-51", "text": "----------------------------------" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-52", "text": "**POTENTIAL APPLICATIONS OF THE MULTILINGUAL CHARACTER-LEVEL STREAM TRANSLATION**" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-53", "text": "Having a shared vectorspace multilingual translation system ( Fig.2 and Fig.3 ) able to operate on unsegmented streams of text have a number of novel applications." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-54", "text": "The most straightforward novel application is the possibility to embed the documents of all project languages into the same shared semantics vectorspace and compute document semantic similarity (Hill, Cho & Korhonen, 2016) irrespective of the document language." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-55", "text": "The sliding-window translation approach allows to view the document as a sequence (trace) of vectors corresponding to every slidingwindow step while translating the document." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-56", "text": "These vectors are similar to word embedding vectors, but are likely to be semantically richer, as they would mostly distinguish word-senses in the context of the window." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-57", "text": "Such vectortraces corresponding to the documents can be compared in the bag-of-words fashion by measuring cosine-distance between the sums of document trace-vectors (as part of kmeans clustering or nearest-neighbor search)." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-58", "text": "This can be used as a building block for multilingual semantic clustering of stories into storylines, or for the semantic search of the documents in any language which are similar to the given document." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-59", "text": "Another novel application of the character-level neural translation is stream segmentation into the individual stories -a difficult task for news ingested from audio or video sources and transcribed with ASR and thus lacking any explicit sentence or story segmentation information." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-60", "text": "For stream segmentation into the stories it is possible to utilize the exceptional generalization and memorization capacity of the neural networks, which is already applied in the neural dialogue systems such as Gmail Smart Replies (Corrado, 2015; Vinyals&Le, 2015) ." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-61", "text": "Table 3 illustrates how mere 400 LSTM cells of our single-layer 90-character neural translator have been able to generalize and memorize rather correct translations for the first 100 characters of the entire Europarl v7 EN-LV training corpus containing 600,000 sentence pairs." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-62", "text": "For story segmentation a sliding-window neural translation system can be incrementally trained to monolingually \"translate\" the current 5 words of the stream into the next 5 words of the stream (predicting next 5 words from previous 5 words), based on the actual news streams encountered." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-63", "text": "Such system should be able to predict reasonably well the next 5 words within the news story, but will fail to do so when there is a transition from one story to the next." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-64", "text": "Along with additional auxiliary information such as time-code (when exactly the phrase was spoken and pauses in the speech) and speaker identification for each phrase this should provide a rather reliable segmentation signal." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-65", "text": "Table 3 : Europarl v7 training corpus fragment (only first 100 characters of each sentence were used for training) and the character-level neural translation output illustrating the memorization of the training corpus." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-66", "text": "----------------------------------" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-67", "text": "**CONCLUSIONS**" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-68", "text": "It is still an open issue which vectorspace projections yield the semantically best clusters (Hill, Cho & Korhonen, 2016) and further experiments are needed." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-69", "text": "Particularly for storyline (Rissen et al., 2013) clustering the signals for the stories belonging to the same storyline might be not so much the semantic similarity of the articles (they might report various developments of the storyline from differing viewpoints), but rather the matching time and location as well as same organizations and people being involved -the information typically supplied by Named Entity Linking (NEL) tools." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-70", "text": "The tradeoffs between semantic clustering quality and computational complexity are likely to be crucial." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-71", "text": "Once trained, the run-time use of the multilingual translation modules for translation and news story clustering is around 1 sec on TitanX GPU per average news story." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-72", "text": "This is an order of magnitude slower than regular NEL or IF IDF bagof-words based clustering methods." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-73", "text": "Establishing reliable storyline clustering benchmarking data sets and metrics is one of the goals of the SUMMA project, as good storyline clusters are the prerequisite for downstream storyline summarization, visualization, and predictive anticipation of upcoming developments." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-74", "text": "----------------------------------" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-75", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-76", "text": "This work was supported in part by H2020 SUMMA project under grant agreement 688139/H2020-ICT-2015 and in part by the Latvian National research program SOPHIS under grant agreement Nr.10-4/VPP-4/11." }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-77", "text": "----------------------------------" }, { "sent_id": "7f2383426952ddde6fd527cf0d63bd-C001-78", "text": "**ENGLISH(ORIGINAL LATVIAN(MANUAL(TRANSLATION LATVIAN(CHARACTER3LEVEL(NEURAL(TRANSLATION**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "7f2383426952ddde6fd527cf0d63bd-C001-54" ] ], "cite_sentences": [ "7f2383426952ddde6fd527cf0d63bd-C001-54" ] }, "@FUT@": { "gold_contexts": [ [ "7f2383426952ddde6fd527cf0d63bd-C001-68" ] ], "cite_sentences": [ "7f2383426952ddde6fd527cf0d63bd-C001-68" ] } } }, "ABC_ce0441b3ae7b957520d329799f8b9f_47": { "x": [ { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-2", "text": "A new annotated corpus can have a pivotal role in the future of computational linguistics." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-3", "text": "Corpus annotation can define new NLP tasks and set new standards." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-4", "text": "This may put many of the papers presented at this workshop on the cutting edge of our field." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-5", "text": "A standard, however, is a double edged sword." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-6", "text": "A standard corpus urges users to accept the theory of how to represent things that underlie that corpus." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-7", "text": "For example, a Penn Treebank theory of grammar is implicit in PennTreebank-based parsers." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-8", "text": "This can be a problem if one rejects some aspects of that theory." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-9", "text": "Also one may object to a particular system of annotation because some theories generalize to cover new ground (e.g., new languages) better than others." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-10", "text": "Nevertheless, advantages of accepting a corpus as standard include the following:" }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-11", "text": "It is straight-forward to compare the performance of the set of systems that produce the same form of output, e.g., Penn Treebank-based parsers can be compared in terms of how well they reproduce the Penn Treebank." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-12", "text": "Alternative systems based on a standard are largely interchangeable." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-13", "text": "Thus a system that uses one PennTreebank-based parser as a component can easily be adapted to use another better performing PennTreebank-based parser." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-14", "text": "Standards can be built on." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-15", "text": "For example, if one accepts the framework of the Penn Treebank, it is easy to move on to representations of \"deeper\" structure as suggested in three papers in this volume (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) ." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-16", "text": "It is my view that these advantages outweigh the disadvantages." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-17", "text": "I propose that the papers in this volume be viewed with the following question in mind: How can the work covered by this collection of papers be integrated together? Put differently, to what extent are these resources mergeable?" }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-18", "text": "The first six papers describe linguistic annotation in four languages: Spanish (Alc\u00e1ntara and Moreno, 2004), English (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) , Czech (Sgall et al., 2004) and German (Baumann et al., 2004)." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-19", "text": "The sixth, seventh and eighth papers (Baumann et al., 2004; \u00c7 mejrek et al., 2004; Helmreich et al., 2004 ) explore questions of multilingual annotation of syntax and semantics, beginning to answer the question of how annotation systems can be made compatible across languages." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-20", "text": "Indeed (Helmreich et al., 2004) explores the question of integration across languages, as well as levels of annotation." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-21", "text": "(Baumann et al., 2004) also describes how a number of different linguistic levels can be related in annotation (pragmatic and prosodic) among two languages (English and German)." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-22", "text": "The ninth and tenth papers (Langone et al., 2004; Zabokrtsk\u00fd and Lopatkov\u00e1, 2004) are respectively about a corpus related to a lexicon and the reverse: a lexicon related to a corpus." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-23", "text": "This opens up the wider theme of the intergration of a number of different linguistic resources." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-24", "text": "As the natural language community produces more and more linguistic resources, especially corpora, it seems important to step back and look at the larger picture." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-25", "text": "If these resources can be fit together as part of a larger puzzle, this could produce a sketch of the future of our field." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-26", "text": "----------------------------------" }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-27", "text": "****" }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-28", "text": "A new annotated corpus can have a pivotal role in the future of computational linguistics." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-29", "text": "Corpus annotation can define new NLP tasks and set new standards." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-30", "text": "This may put many of the papers presented at this workshop on the cutting edge of our field." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-31", "text": "A standard, however, is a double edged sword." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-32", "text": "A standard corpus urges users to accept the theory of how to represent things that underlie that corpus." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-33", "text": "For example, a Penn Treebank theory of grammar is implicit in PennTreebank-based parsers." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-34", "text": "This can be a problem if one rejects some aspects of that theory." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-35", "text": "Also one may object to a particular system of annotation because some theories generalize to cover new ground (e.g., new languages) better than others." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-36", "text": "Nevertheless, advantages of accepting a corpus as standard include the following:" }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-37", "text": "It is straight-forward to compare the performance of the set of systems that produce the same form of output, e.g., Penn Treebank-based parsers can be compared in terms of how well they reproduce the Penn Treebank." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-38", "text": "Alternative systems based on a standard are largely interchangeable." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-39", "text": "Thus a system that uses one PennTreebank-based parser as a component can easily be adapted to use another better performing PennTreebank-based parser." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-40", "text": "Standards can be built on." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-41", "text": "For example, if one accepts the framework of the Penn Treebank, it is easy to move on to representations of \"deeper\" structure as suggested in three papers in this volume (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) ." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-42", "text": "It is my view that these advantages outweigh the disadvantages." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-43", "text": "I propose that the papers in this volume be viewed with the following question in mind: How can the work covered by this collection of papers be integrated together? Put differently, to what extent are these resources mergeable?" }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-44", "text": "The first six papers describe linguistic annotation in four languages: Spanish (Alc\u00e1ntara and Moreno, 2004) , English (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) , Czech (Sgall et al., 2004) and German (Baumann et al., 2004) ." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-45", "text": "The sixth, seventh and eighth papers (Baumann et al., 2004; \u00c7 mejrek et al., 2004; Helmreich et al., 2004) explore questions of multilingual annotation of syntax and semantics, beginning to answer the question of how annotation systems can be made compatible across languages." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-46", "text": "Indeed (Helmreich et al., 2004) explores the question of integration across languages, as well as levels of annotation." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-47", "text": "(Baumann et al., 2004) also describes how a number of different linguistic levels can be related in annotation (pragmatic and prosodic) among two languages (English and German)." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-48", "text": "The ninth and tenth papers (Langone et al., 2004; Zabokrtsk\u00fd and Lopatkov\u00e1, 2004) are respectively about a corpus related to a lexicon and the reverse: a lexicon related to a corpus." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-49", "text": "This opens up the wider theme of the intergration of a number of different linguistic resources." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-50", "text": "As the natural language community produces more and more linguistic resources, especially corpora, it seems important to step back and look at the larger picture." }, { "sent_id": "ce0441b3ae7b957520d329799f8b9f-C001-51", "text": "If these resources can be fit together as part of a larger puzzle, this could produce a sketch of the future of our field." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ce0441b3ae7b957520d329799f8b9f-C001-15" ], [ "ce0441b3ae7b957520d329799f8b9f-C001-18" ], [ "ce0441b3ae7b957520d329799f8b9f-C001-41" ], [ "ce0441b3ae7b957520d329799f8b9f-C001-44" ] ], "cite_sentences": [ "ce0441b3ae7b957520d329799f8b9f-C001-15", "ce0441b3ae7b957520d329799f8b9f-C001-18", "ce0441b3ae7b957520d329799f8b9f-C001-41", "ce0441b3ae7b957520d329799f8b9f-C001-44" ] } } }, "ABC_17eb0ea80e5a2f18096ef41521af4e_47": { "x": [ { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-2", "text": "Domain adaptation is a time consuming and costly procedure calling for the development of algorithms and tools to facilitate its automation." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-3", "text": "This paper presents an unsupervised algorithm able to learn the main concepts in event summaries." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-4", "text": "The method takes as input a set of domain summaries annotated with shallow linguistic information and produces a domain template." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-5", "text": "We demonstrate the viability of the method by applying it to three different domains and two languages." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-6", "text": "We have evaluated the generated templates against human templates obtaining encouraging results." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-7", "text": "----------------------------------" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-9", "text": "Our research is concerned with the development of techniques for knowledge induction in the field of text summarization." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-10", "text": "Our goal is to automatically induce the necessary knowledge for the generation of concise event summaries such as the one shown in Figure 1 ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-11", "text": "This kind of summaries, which can be found on the Web and in text collections, contain key information of the events they describe." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-12", "text": "Previous work in the area of text summarization (DeJong, 1982; Oakes and Paice, 2001; Saggion and Lapalme, 2002) addressed the problem of generating this type of concise summaries from texts, relying on information extraction and text generation techniques." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-13", "text": "These approaches were difficult to port to new domains and languages because of the efforts needed for modelling the underlying event template structure." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-14", "text": "In this paper we propose a method for learning the main concepts in domain summaries in an unsupervised iterative procedure." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-15", "text": "The proposed algorithm takes a set of unannotated summaries in a given domain and produces auto-annotated summaries which can be used for training information extraction and text generation systems." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-16", "text": "Domain adaptation is essential for text summarization and information extraction, and the last two decades have seen a plethora of methods for supervised, semisupervised, and unsupervised learning from texts." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-17", "text": "2001 August 24: Air Transat Flight 236 runs out of fuel over the Atlantic Ocean and makes an emergency landing in the Azores." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-18", "text": "Upon landing some of the tires blow out, causing a fire that is extinguished by emergency personnel on the ground." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-19", "text": "None of the 304 people on board the Airbus A330-200 were seriously injured." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-20", "text": "For example, in (Li et al., 2010) clustering is applied to generate templates for specific entity types (actors, companies, etc.) and patterns are automatically produced that describe the information in the templates." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-21", "text": "In (Chambers and Jurafsky, 2009 ) narrative schemas are induced from corpora using coreference relations between participants in texts." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-22", "text": "Transformation-based learning is used in (Saggion, 2011) to induce templates and rules for non-extractive summary generation." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-23", "text": "Paraphrase templates containing concepts and typical strings were induced from comparable sentences in (Barzilay and Lee, 2003) using multisentence alignment to discover \"variable\" and fixed structures." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-24", "text": "Linguistic patterns were applied to huge amounts of non-annotated pre-classified texts in (Riloff, 1996) to bootstrap information extraction patterns." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-25", "text": "Similarly, semi-supervised or unsupervised methods have been used to learn question/answering patterns (Ravichandran and Hovy, 2002) or text schemas (Bunescu and Mooney, 2007) ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-26", "text": "One current paradigm to learn from raw data is open information extraction (Downey et al., 2004; Banko, 2009) , which without any prior knowledge aims at discovering all possible relations between pairs of entities occurring in text." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-27", "text": "Our work tries to learn the main concepts making up the template structure in domain summaries, similar to (Chambers and Jurafsky, 2011) ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-28", "text": "However, we do not rely on any source of external knowledge (i.e. WordNet) to do so." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-29", "text": "This paper presents an iterative-learning algorithm which is able to identify the key components of event summaries." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-30", "text": "We will show that the algorithm can induce template-like representations in various domains and languages." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-31", "text": "The rest of the paper is organized in the following way: In Section 2 we introduce the dataset we are using for our experiments and describe how we have prepated it for experimentation." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-32", "text": "Then, in Section 3 we provide an overview of our concept induction learnig algorithm while in Section 4 we explain how we have instantiated the algorithm for the experiments presented in this paper." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-33", "text": "Section 5 describe the experiments and results obtained and Section 6 discusses our approach comparing it with past research." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-34", "text": "Finally, in Section 7 we close the paper with conclusions and future work." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-35", "text": "----------------------------------" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-36", "text": "**DATA AND DATA PREPARATION**" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-59", "text": "Experiments were carried out per domain and language to assess the suitability of the algorithm to the conceptual learning task." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-37", "text": "The dataset used for this study -part of the CON-CISUS corpus (Saggion and Szasz, 2012 ) -consists of a set of 250 summaries in Spanish and English for three different domains: aviation accidents, rail accidents, and earthquakes." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-38", "text": "This dataset makes it possible to compare the performance of learning procedures across languages and domains." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-39", "text": "Based on commonsense, a human annotator developed an annotation schema per domain to describe in a templatelike representation the essential elements (i.e., slots) of each event." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-40", "text": "For example, for the aviation accident domain these essential elements were: the date of the accident, the number of victims, the airline, the aircraft, the location of the accident, the flight number, the origin and destination of the flight, etc." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-41", "text": "The dataset was then annotated following the schema using the GATE annotation tool." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-42", "text": "The human annotations are used for evaluation of the concept discovery algorithm." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-43", "text": "Each document in the dataset was automatically annotated using tools for each language." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-44", "text": "We relied on basic processing steps to identify sentences, words and word-roots, parts of speech, nounchunks, and named entities using the GATE system for English (Maynard et al., 2002) and TreeTagger for Spanish (Schmid, 1995) ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-45", "text": "The method is designed to learn the conceptual information in the summaries by extension (i.e., the set of strings that make up the concept in a given corpus) and by intension (i.e., an algorithm able to recognise the concept members in new documents in the domain) (Buitelaar and Magnini, 2005) ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-46", "text": "Concept extensions identified by our method in the English summaries in the aviation domain are listed in Table 3 ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-47", "text": "Each summary in the corpus can be seen as a sequence of strings and chunks as shown in Figure 1 (named entities and noun chunks are shown in boldface and they may overlap)." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-48", "text": "The procedure to learn a concept in the corpus of summaries is given in pseudocode in Algorithm 2 which is repeatedly invoked by a main algorithm to learn all concepts (Algorithm 1)." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-49", "text": "The idea of the algorithm is rather simple, at each iteration a document is selected for learning, and from this document a single chunk (i.e., a noun chunk or a named entity) available for learning is selected as a seed example of a hypothetical concept (the concept is given a unique name at each itera- tion)." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-50", "text": "The document is annotated with this seed as a target concept and a classifier is trained using this document." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-51", "text": "The trained classifier is then applied to the rest of the documents to identify instances of the hypothetical concept." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-52", "text": "If the classifier is unsuccessful in identifying new instances, then the chunk used in the current iteration is discarded from the learning process, but if the classifier is successful and able to identify instances of the hypothetical concept, then the \"best\" annotated document is selected and added to the training set." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-53", "text": "The classifier is re-trained using the new added document and the process is repeated until no more instances can be identified." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-54", "text": "A hypothetical concept is kept only if there is enough support for it across the set of documents." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-55", "text": "The main procedure calls the basic algorithms a number of times while there are concepts to be learnt (or all chunks have been used)." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-56", "text": "The stopping criteria is the number of concepts which could possibly be learnt, an estimation of which is the average number of chunks in a document." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-57", "text": "----------------------------------" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-58", "text": "**ALGORITHM INSTANTIATION**" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-60", "text": "A number of points in Algorithm 2 need clarification: the selection of a document in line 4 of the algorithm can be carried out using different informed procedures; for the experiments described here we decided to select the document with more available hypotheses, i.e., the document with more chunks." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-61", "text": "For the selection of a (Li et al., 2004) ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-62", "text": "The features we use for representing the instance to be learnt are very superficial for these experiments: lemmas, parts-of-speech tags, orthography, and named entity types of the words surrounding the target concept to be learnt." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-63", "text": "The SVMs provide as output a class together with a probability which is essential to our method." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-64", "text": "We use this probability for selecting the best document in line 13 of the algorithm: the instance predicted with the highest probability is located and the document where this instance occurs is returned as \"best document\"." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-65", "text": "In case no instances are learned (e.g., else in line 17), the iteration ends returning the extension learnt so far." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-66", "text": "Concerning Algorithm 1: in line 5 (the while) we use as stopping criteria for the maximum number of concepts to learn the average number of chunks in the corpus." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-67", "text": "In line 7, the FILTER CONCEPT function evaluates the concept, keeping it only if two criteria are met: (i) there are not \"too many\" repetitions of a string in the discovered concept and (ii) the discovered concept covers a reasonable number of documents." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-68", "text": "With criteria (i) we filter out a concept which contains repeated strings: a concept could be formed simply by grouping together all repeated phrases in the set of documents (i.e. \"the earthquake\" or \"the accident\" or \"the plane\")." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-69", "text": "While these phrases could be relevant in the target domain they do not constitute a key concept in our interpretation." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-70", "text": "Strings which are repeated in the concept extension are more like the \"backbone structure\" of the summaries in the domain." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-71", "text": "In our experiments both criteria are experimental variables and we vary them from 10% to 100% at 20% intervals." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-72", "text": "In Section 5 we will present results for the best configurations." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-73", "text": "----------------------------------" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-74", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-75", "text": "In order to evaluate the discovered concepts we have treated learning as information extraction." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-76", "text": "In order to evaluate them in this context we first need to map each learnt concept onto one of the human concepts." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-77", "text": "The mapping, which is based on the concept extension, is straightforward: a discovered concept is mapped onto the human concept with which it has a majority of string matches." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-78", "text": "Note that we match the discovered text offsets in the analysed documents and not only the identified strings." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-79", "text": "In order to evaluate the matching procedure we have used precision, recall, and f-score measures comparing the automatic concept with the human concept." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-80", "text": "Note that we use a lenient procedure -counting as correct strings those with a partial match." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-81", "text": "This is justified since discovering the exact boundaries of a concept instance is a very difficult task." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-82", "text": "Table 2 shows some examples of the human annotated instances and related discovered one." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-83", "text": "It can be appreciated that the learnt concepts have a reasonable match degree with the human annotated ones." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-84", "text": "Table 3 gives information extraction results per domain and language for the best configuration of the algorithm." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-85", "text": "The best scores are generally obtained when coverage is set to 10% of the number of summaries, except for the learning of conceptual information in Spanish for the earthquake domain where the system performs better for 10% summary coverage." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-86", "text": "The parameter controlling string repetition in the concept extension should be kept small." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-87", "text": "The obtained results are quite satisfactory consider- ing the small dataset and the limited use of linguistic resources during learning." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-88", "text": "These results compare favorably to cross-validation results obtained using supervised machine learning techniques (Saggion and Szasz, 2011) ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-89", "text": "Learning from the earthquake domain appears to be more challenging given the more verbose characteristics of these texts." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-90", "text": "Even though space restricions prevent us from showing all evaluation results, in Table 4 we present detailed results for the two domains and languages." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-91", "text": "Note that the concepts listed constitute the slots of the induced domain template." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-92", "text": "----------------------------------" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-93", "text": "**DISCUSSION**" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-94", "text": "Similar to active learning information extraction techniques (Ciravegna and Wilks, 2003) , the concept discovery algorithm presented here is inspired by techniques like learning by reading, where unfamiliar expressions in one document can be \"explained\" by association to expressions in similar document contexts." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-95", "text": "However, and unlike active learning, human intervention is unnecessary in our approach." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-96", "text": "Although the algorithm achieves reasonably lenient performance, strict (hard) evaluation indicates that in each experimental condition performance drops when a strict match is required." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-97", "text": "This is expected given the difficulty of finding the right instance boundaries based only on automatic chunking information." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-98", "text": "For this reason, we intend to carry out additional experiments based on richer domain independent features from a syntactic parser." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-99", "text": "We have identified a number of reasons why some concept instances can not be correctly associated with their concepts." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-100", "text": "In the aviation domain, for example, numeric expressions constitute the extensions of different concepts including: number of victims, crew members, and number of survivors; it is a rather common feature in the aviation domain to include these different concepts together in one sentence, making their \"separation\" complicated." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-101", "text": "Same explanations apply to other tested domains: for example locations playing the role of origin and destination of a given train or airplace are also sometimes confused." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-102", "text": "Our work demonstrates the possibility of learning conceptual information in several domains and languages, while previous work (Chambers and Jurafsky, 2011) has addressed sets of related domains (e.g., MUC-4 templates) in English." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-103", "text": "Learning full conceptualizations from raw data is a daunting and difficult enterprise (Biemann, 2005) ." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-104", "text": "Here, we provide a short-cut by proposing a method able to learn the essential concepts of a domain by relying on summaries which are freely available on the Web." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-105", "text": "Our method is able to produce conceptualizations from a few documents in each domain and language unlike recent open domain information extraction which requires massive amount of texts for relation learning (Banko, 2009 )." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-106", "text": "Our algorithm has a reasonable computational complexity, unlike alignment-based or clustering-based approaches (Barzilay and Lee, 2003) , which are computationally expensive." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-107", "text": "----------------------------------" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-108", "text": "**CONCLUSIONS AND OUTLOOK**" }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-109", "text": "Domain adaptation is a time consuming and costly procedure calling for the development of algorithms and tools to facilitate its automation." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-110", "text": "In this paper we have presented a novel algorithm for learning information content in event summaries." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-111", "text": "The approach is fully unsupervised and based on the application of an iterative algorithm which grows a concept extension step-by-step." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-112", "text": "We have also proposed an instantiation of the algorithm and demonstrated its applicability to learning conceptual information in three different domains and two languages." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-113", "text": "We have obtained encouraging results, with the procedure able to model the main conceptual information in the summaries with lenient F-scores ranging from 0.25 to 0.66 F-scores depending on the language and domain." }, { "sent_id": "17eb0ea80e5a2f18096ef41521af4e-C001-114", "text": "There are, however, a number of avenues that should be further explored such as the use of a richer document representation based on syntactic information and the development of additional procedures to improve instance boundary recognition." } ], "y": { "@SIM@": { "gold_contexts": [ [ "17eb0ea80e5a2f18096ef41521af4e-C001-27" ] ], "cite_sentences": [ "17eb0ea80e5a2f18096ef41521af4e-C001-27" ] }, "@DIF@": { "gold_contexts": [ [ "17eb0ea80e5a2f18096ef41521af4e-C001-102" ] ], "cite_sentences": [ "17eb0ea80e5a2f18096ef41521af4e-C001-102" ] } } }, "ABC_134baefab4d27e9dafd0c050c43775_47": { "x": [ { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-2", "text": "We present the first open-source graphical annotation tool for combinatory categorial grammar (CCG), and the first set of detailed guidelines for syntactic annotation with CCG, for four languages: English, German, Italian, and Dutch." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-3", "text": "We also release a parallel pilot CCG treebank based on these guidelines, with 4x100 adjudicated sentences, 10K singleannotator fully corrected sentences, and 82K single-annotator partially corrected sentences." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-4", "text": "----------------------------------" }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-6", "text": "Combinatory Categorial Grammar (CCG; Steedman, 2000 ) is a grammar formalism distinguished by its transparent syntax-semantics interface and its elegant handling of coordination." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-7", "text": "It is a popular tool in semantic parsing, and treebank creation efforts have been made for Turkish (\u00c7 ak\u0131c\u0131, 2005) , German (Hockenmaier, 2006) , English (Hockenmaier and Steedman, 2007) , Italian (Bos et al., 2009) , Chinese (Tse and Curran, 2010) , Arabic (Boxwell and Brew, 2010) , Japanese (Uematsu et al., 2013) , and Hindi (Ambati et al., 2018) ." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-8", "text": "However, all of these treebanks were not directly annotated according to the CCG formalism, but automatically converted from phrase structure or dependency treebanks, which is an error-prone process." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-9", "text": "Direct annotation in CCG has so far mostly been limited to small datasets for seeding or testing semantic parsers (e.g., Artzi et al., 2015) , and no graphical annotation interface is available to support such efforts, making the annotation process difficult to scale." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-10", "text": "The only exceptions we are aware of are the Groningen Meaning Bank and the Parallel Meaning Bank (Abzianidze et al., 2017) , two annotation efforts which use a graphical user interface for annotating sentences with CCG derivations and other annotation layers, and which have produced CCG treebanks for English, German, Italian, and Dutch." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-11", "text": "However, these efforts are focused on semantics and have not released explicit guidelines for syntactic annotation." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-12", "text": "Their annotation tool is limited in that annotators only have control over lexical categories, not larger constituents." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-13", "text": "Even though CCG is a lexicalized formalism, where most decisions can be made on the lexical level, there is no full control over attachment phenomena in the lexicon." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-14", "text": "Moreover, these annotation tools are not open-source and cannot easily be deployed to support other annotation efforts." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-36", "text": "When unsure about some annotation decision, they can click the \"report issue\" button to open a discussion thread in an external forum, such as a GitHub issue tracker." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-15", "text": "In this paper, we present an open-source, lightweight, easy-to-use graphical annotation tool that employs a statistical parser to create initial CCG derivations for sentences, and allows annotators to correct these annotations via lexical category constraints and span constraints." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-16", "text": "Together, these constraints make it possible to effect (almost) all annotation decisions consistent with the principles of CCG." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-17", "text": "We also present a pilot study for multilingual CCG annotation, in which a parallel corpus of 4x100 sentences (in English, German, Italian, and Dutch) was annotated by two annotators per sentence, a detailed annotation manual was created, and adjudication was performed to create a final version." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-18", "text": "We publicly release the manual, the annotation tool, and the adjudicated data." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-19", "text": "Our release also includes an additional > 10 K derivations, each manually corrected by a single annotator, and an additional > 82 K sentences, each partially corrected by a single annotator." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-20", "text": "----------------------------------" }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-21", "text": "**AN ANNOTATION TOOL FOR CCG**" }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-22", "text": "Our annotation tool CCGweb 1 is Web-based, implemented in Python, PHP, and JavaScript, and should be easy to deploy on any recent Linux dis- tribution." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-23", "text": "It has two main views: the home page shows the list of sentences an annotator is assigned to annotate." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-24", "text": "Those already done are marked as \"marked correct\". Clicking on a sentence takes the annotator to the sentence view." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-25", "text": "Annotators can also enter arbitrary sentences to annotate, e.g., for experimenting or for producing illustrations." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-26", "text": "Dynamic Annotation Annotation follows an approach called dynamic annotation (Oepen et al., 2002) or human-aided machine annotation , in which sentences are automatically analyzed, annotators impose constraints to rule out undesired analyses, sentences are then reanalyzed subject to the constraints, and the process is repeated until only the desired analysis remains." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-27", "text": "The current system is backed by the EasyCCG parser (Lewis and Steedman, 2014) , slightly modified to allow for incorporating constraints, and other CCG parsers could be plugged in with similar modifications." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-28", "text": "To do this, the annotator clicks on the category and changes it, as shown in the figure." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-29", "text": "When they hit enter or click somewhere else, the sentence is automatically parsed again in the background, this time with the lexical category constraint that go has category (S[b]\\ NP)/ PP." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-30", "text": "In many cases, the parser will directly find the desired parse, with there being a PP, and the annotator only has to check it, not make another edit." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-31", "text": "Span Constraints Although constraining lexical categories is often enough to determine the entire CCG derivation (cf. Bangalore and Joshi, 1999; Lewis and Steedman, 2014) , this is not always the case." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-32", "text": "For example, consider the sentence I want to be a millionaire like my dad. Assuming that like my dad is a verb phrase modifier (category (S \\ NP)\\(S \\ NP)), it could attach to either to be or want, giving very different meanings (cf. Zimmer, 2013)." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-33", "text": "We therefore implemented one other type of edit operation/constraint: span constraints." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-34", "text": "By simply clicking and dragging across a span of tokens as shown in Figure 2 , annotators can constrain this span to be a constituent in the resulting parse." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-35", "text": "Additional Features Our tool offers annotators some additional convenient features." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-37", "text": "To erase all constraints and restart annotation from the parser's original analysis, an annotator can click the \"reset\" button. And the buttons \"HTML\" and \"LaTeX\" provide code that can be copied and pasted to use the current derivation as an illustration on a web page or in a paper." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-38", "text": "Adjudication Support Once two or more annotators have annotated a sentence, disagreements need to be discovered, and a final, authoritative version has to be created." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-39", "text": "Our tool supports this adjudication process through the special user account judge." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-40", "text": "This user can see the derivations of other annotators in a tabbed interface as shown in Figure 3 ." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-41", "text": "In order to enable the judge to easily spot disagreements, categories that annotators disagree on are struck through, and constituents that annotators disagree on are dashed." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-42", "text": "----------------------------------" }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-43", "text": "**A QUADRILINGUAL PILOT CCG TREEBANK**" }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-44", "text": "To test the viability of creating multilingual CCG treebanks by direct annotation, we conducted an annotation experiment on 110 short sentences from the Tatoeba corpus (Tatoeba, 2019) , each in four translations (English, German, Italian, and Dutch)." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-45", "text": "The main annotation guideline was to copy the annotation style of CCGrebank (Honnibal et al., 2010), a CCG treebank adapted from CCGbank (Hockenmaier and Steedman, 2007) , which is in turn based on the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-46", "text": "Since CCGrebank only covers English and lacks some constructions observed in our corpus, an annotation manual with more specific instructions was needed." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-47", "text": "We initially annotated ten sentences in four languages and discussed disagreements." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-48", "text": "The results were recorded in an initial annotation manual, and the initial annotations were discarded." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-49", "text": "Each of the remaining 4x100 sentences was then annotated independently by at least two of the authors." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-50", "text": "Table 1 (upper part) shows the number of nonoverlapping category and span constraints that each annotator created on average per sentence before marking the sentence as correct." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-51", "text": "Annotated sentences were manually classified by the first author into four classes: (0) sentences without any disagreements, (1) sentences with only trivial violations of the annotation guidelines (e.g., concerning attachment of punctuation or underspecifying modifier features), (2) sentences with only apparent oversights, such as giving a determiner a pronoun category, (3) sentences with more intricate disagreements which required additional guidelines to resolve." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-52", "text": "Table 1 (upper part) shows the distribution of disagreement classes, and Table 2 shows examples of class (3)." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-53", "text": "The first author adjudicated all disagreements and updated the annotation manual accordingly." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-54", "text": "We release the manual and the full adjudicated dataset." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-55", "text": "2 To make the resource more useful (e.g., for training parsers), we also include in the release the syntactic CCG derivations created so far in the Parallel Meaning Bank (Abzianidze et al., 2017) ." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-56", "text": "These do not follow the annotation guidelines in detail due to their focus on semantics, nor have they been adjudicated, but instead corrected by a single annotator." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-57", "text": "However, they are much greater in number." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-58", "text": "For an even greater number, we also release partially corrected derivations, meaning that the annotator made at least one change to the automatically created derivation." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-59", "text": "lexical label constraints and span constraints, adjudication support, and various conveniences." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-60", "text": "We have used this tool to create the first published CCG resource that comes with an explicit annotation manual for syntax and has been created by direct annotation, rather than conversion from a non-CCG treebank." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-61", "text": "It is multilingual, currently including English, German, Italian, and Dutch, and aims for cross-lingually consistent annotation guidelines." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-62", "text": "For future work, we envision more extensive direct annotation of multilingual data with CCG derivations, and putting them to use for evaluating unsupervised and distantly supervised CCG parsers." }, { "sent_id": "134baefab4d27e9dafd0c050c43775-C001-63", "text": "We would also like to investigate the use of our tool as an interactive aid in teaching CCG." } ], "y": { "@BACK@": { "gold_contexts": [ [ "134baefab4d27e9dafd0c050c43775-C001-10" ] ], "cite_sentences": [ "134baefab4d27e9dafd0c050c43775-C001-10" ] }, "@UNSURE@": { "gold_contexts": [ [ "134baefab4d27e9dafd0c050c43775-C001-55" ] ], "cite_sentences": [ "134baefab4d27e9dafd0c050c43775-C001-55" ] } } }, "ABC_3a7f65a63e875db3e6d722a695aa5a_47": { "x": [ { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-59", "text": "lexical label constraints and span constraints, adjudication support, and various conveniences." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-2", "text": "We present the first open-source graphical annotation tool for combinatory categorial grammar (CCG), and the first set of detailed guidelines for syntactic annotation with CCG, for four languages: English, German, Italian, and Dutch." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-3", "text": "We also release a parallel pilot CCG treebank based on these guidelines, with 4x100 adjudicated sentences, 10K singleannotator fully corrected sentences, and 82K single-annotator partially corrected sentences." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-4", "text": "----------------------------------" }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-6", "text": "Combinatory Categorial Grammar (CCG; Steedman, 2000 ) is a grammar formalism distinguished by its transparent syntax-semantics interface and its elegant handling of coordination." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-7", "text": "It is a popular tool in semantic parsing, and treebank creation efforts have been made for Turkish (\u00c7 ak\u0131c\u0131, 2005) , German (Hockenmaier, 2006) , English (Hockenmaier and Steedman, 2007) , Italian (Bos et al., 2009) , Chinese (Tse and Curran, 2010) , Arabic (Boxwell and Brew, 2010) , Japanese (Uematsu et al., 2013) , and Hindi (Ambati et al., 2018) ." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-8", "text": "However, all of these treebanks were not directly annotated according to the CCG formalism, but automatically converted from phrase structure or dependency treebanks, which is an error-prone process." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-9", "text": "Direct annotation in CCG has so far mostly been limited to small datasets for seeding or testing semantic parsers (e.g., Artzi et al., 2015) , and no graphical annotation interface is available to support such efforts, making the annotation process difficult to scale." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-10", "text": "The only exceptions we are aware of are the Groningen Meaning Bank and the Parallel Meaning Bank (Abzianidze et al., 2017) , two annotation efforts which use a graphical user interface for annotating sentences with CCG derivations and other annotation layers, and which have produced CCG treebanks for English, German, Italian, and Dutch." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-11", "text": "However, these efforts are focused on semantics and have not released explicit guidelines for syntactic annotation." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-12", "text": "Their annotation tool is limited in that annotators only have control over lexical categories, not larger constituents." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-13", "text": "Even though CCG is a lexicalized formalism, where most decisions can be made on the lexical level, there is no full control over attachment phenomena in the lexicon." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-14", "text": "Moreover, these annotation tools are not open-source and cannot easily be deployed to support other annotation efforts." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-15", "text": "In this paper, we present an open-source, lightweight, easy-to-use graphical annotation tool that employs a statistical parser to create initial CCG derivations for sentences, and allows annotators to correct these annotations via lexical category constraints and span constraints." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-16", "text": "Together, these constraints make it possible to effect (almost) all annotation decisions consistent with the principles of CCG." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-17", "text": "We also present a pilot study for multilingual CCG annotation, in which a parallel corpus of 4x100 sentences (in English, German, Italian, and Dutch) was annotated by two annotators per sentence, a detailed annotation manual was created, and adjudication was performed to create a final version." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-18", "text": "We publicly release the manual, the annotation tool, and the adjudicated data." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-19", "text": "Our release also includes an additional > 10 K derivations, each manually corrected by a single annotator, and an additional > 82 K sentences, each partially corrected by a single annotator." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-20", "text": "----------------------------------" }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-21", "text": "**AN ANNOTATION TOOL FOR CCG**" }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-22", "text": "Our annotation tool CCGweb 1 is Web-based, implemented in Python, PHP, and JavaScript, and should be easy to deploy on any recent Linux dis- tribution." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-23", "text": "It has two main views: the home page shows the list of sentences an annotator is assigned to annotate." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-24", "text": "Those already done are marked as \"marked correct\". Clicking on a sentence takes the annotator to the sentence view." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-25", "text": "Annotators can also enter arbitrary sentences to annotate, e.g., for experimenting or for producing illustrations." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-26", "text": "Dynamic Annotation Annotation follows an approach called dynamic annotation (Oepen et al., 2002) or human-aided machine annotation , in which sentences are automatically analyzed, annotators impose constraints to rule out undesired analyses, sentences are then reanalyzed subject to the constraints, and the process is repeated until only the desired analysis remains." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-27", "text": "The current system is backed by the EasyCCG parser (Lewis and Steedman, 2014) , slightly modified to allow for incorporating constraints, and other CCG parsers could be plugged in with similar modifications." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-28", "text": "To do this, the annotator clicks on the category and changes it, as shown in the figure." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-29", "text": "When they hit enter or click somewhere else, the sentence is automatically parsed again in the background, this time with the lexical category constraint that go has category (S[b]\\ NP)/ PP." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-30", "text": "In many cases, the parser will directly find the desired parse, with there being a PP, and the annotator only has to check it, not make another edit." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-31", "text": "Span Constraints Although constraining lexical categories is often enough to determine the entire CCG derivation (cf. Bangalore and Joshi, 1999; Lewis and Steedman, 2014) , this is not always the case." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-32", "text": "For example, consider the sentence I want to be a millionaire like my dad. Assuming that like my dad is a verb phrase modifier (category (S \\ NP)\\(S \\ NP)), it could attach to either to be or want, giving very different meanings (cf. Zimmer, 2013)." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-33", "text": "We therefore implemented one other type of edit operation/constraint: span constraints." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-34", "text": "By simply clicking and dragging across a span of tokens as shown in Figure 2 , annotators can constrain this span to be a constituent in the resulting parse." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-35", "text": "Additional Features Our tool offers annotators some additional convenient features." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-36", "text": "When unsure about some annotation decision, they can click the \"report issue\" button to open a discussion thread in an external forum, such as a GitHub issue tracker." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-58", "text": "For an even greater number, we also release partially corrected derivations, meaning that the annotator made at least one change to the automatically created derivation." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-37", "text": "To erase all constraints and restart annotation from the parser's original analysis, an annotator can click the \"reset\" button. And the buttons \"HTML\" and \"LaTeX\" provide code that can be copied and pasted to use the current derivation as an illustration on a web page or in a paper." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-38", "text": "Adjudication Support Once two or more annotators have annotated a sentence, disagreements need to be discovered, and a final, authoritative version has to be created." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-39", "text": "Our tool supports this adjudication process through the special user account judge." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-40", "text": "This user can see the derivations of other annotators in a tabbed interface as shown in Figure 3 ." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-41", "text": "In order to enable the judge to easily spot disagreements, categories that annotators disagree on are struck through, and constituents that annotators disagree on are dashed." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-42", "text": "----------------------------------" }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-43", "text": "**A QUADRILINGUAL PILOT CCG TREEBANK**" }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-44", "text": "To test the viability of creating multilingual CCG treebanks by direct annotation, we conducted an annotation experiment on 110 short sentences from the Tatoeba corpus (Tatoeba, 2019) , each in four translations (English, German, Italian, and Dutch)." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-45", "text": "The main annotation guideline was to copy the annotation style of CCGrebank (Honnibal et al., 2010), a CCG treebank adapted from CCGbank (Hockenmaier and Steedman, 2007) , which is in turn based on the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-46", "text": "Since CCGrebank only covers English and lacks some constructions observed in our corpus, an annotation manual with more specific instructions was needed." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-47", "text": "We initially annotated ten sentences in four languages and discussed disagreements." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-48", "text": "The results were recorded in an initial annotation manual, and the initial annotations were discarded." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-49", "text": "Each of the remaining 4x100 sentences was then annotated independently by at least two of the authors." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-50", "text": "Table 1 (upper part) shows the number of nonoverlapping category and span constraints that each annotator created on average per sentence before marking the sentence as correct." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-51", "text": "Annotated sentences were manually classified by the first author into four classes: (0) sentences without any disagreements, (1) sentences with only trivial violations of the annotation guidelines (e.g., concerning attachment of punctuation or underspecifying modifier features), (2) sentences with only apparent oversights, such as giving a determiner a pronoun category, (3) sentences with more intricate disagreements which required additional guidelines to resolve." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-52", "text": "Table 1 (upper part) shows the distribution of disagreement classes, and Table 2 shows examples of class (3)." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-53", "text": "The first author adjudicated all disagreements and updated the annotation manual accordingly." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-54", "text": "We release the manual and the full adjudicated dataset." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-55", "text": "2 To make the resource more useful (e.g., for training parsers), we also include in the release the syntactic CCG derivations created so far in the Parallel Meaning Bank (Abzianidze et al., 2017) ." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-56", "text": "These do not follow the annotation guidelines in detail due to their focus on semantics, nor have they been adjudicated, but instead corrected by a single annotator." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-57", "text": "However, they are much greater in number." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-60", "text": "We have used this tool to create the first published CCG resource that comes with an explicit annotation manual for syntax and has been created by direct annotation, rather than conversion from a non-CCG treebank." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-61", "text": "It is multilingual, currently including English, German, Italian, and Dutch, and aims for cross-lingually consistent annotation guidelines." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-62", "text": "For future work, we envision more extensive direct annotation of multilingual data with CCG derivations, and putting them to use for evaluating unsupervised and distantly supervised CCG parsers." }, { "sent_id": "3a7f65a63e875db3e6d722a695aa5a-C001-63", "text": "We would also like to investigate the use of our tool as an interactive aid in teaching CCG." } ], "y": { "@EXT@": { "gold_contexts": [ [ "3a7f65a63e875db3e6d722a695aa5a-C001-27" ] ], "cite_sentences": [ "3a7f65a63e875db3e6d722a695aa5a-C001-27" ] }, "@BACK@": { "gold_contexts": [ [ "3a7f65a63e875db3e6d722a695aa5a-C001-31" ] ], "cite_sentences": [ "3a7f65a63e875db3e6d722a695aa5a-C001-31" ] } } }, "ABC_ff73758fbef3ddc779a772e634b74e_47": { "x": [ { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-2", "text": "Autonomous systems in remote locations have a high degree of autonomy and there is a need to explain what they are doing and why in order to increase transparency and maintain trust." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-3", "text": "Here, we describe a natural language chat interface that enables vehicle behaviour to be queried by the user." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-4", "text": "We obtain an interpretable model of autonomy through having an expert 'speak out-loud' and provide explanations during a mission." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-5", "text": "This approach is agnostic to the type of autonomy model and as expert and operator are from the same user-group, we predict that these explanations will align well with the operator's mental model, increase transparency and assist with operator training." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-6", "text": "----------------------------------" }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-7", "text": "**INTRODUCTION**" }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-8", "text": "Autonomous systems (AxV) now routinely operate in regions that are dangerous or impossible for humans to reach, such as the deep underwater environment." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-9", "text": "Typically, remote robots instil less trust than those co-located [1, 6] ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-10", "text": "This combined with high vulnerability in hazardous, high-stakes environments, such as that described in [7] , means that the interface between operator and AxV is key in maintaining situation awareness and understanding." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-11", "text": "Specifically, AxVs need to be able to maintain a continuous communication with regards to what they are doing; and increase transparency through explaining their actions and behaviours." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-12", "text": "Explanations can help formulate clear and accurate mental models of autonomous systems and robots." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-13", "text": "Mental models, in cognitive theory, provide one view on how humans reason either functionally (understanding what the robot does) or structurally (understanding how it works)." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-14", "text": "Mental models are important as they strongly impact how and whether robots and systems are used." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-15", "text": "In previous work, explanations have been categorised as either explaining 1) machine learning as in [11] who showed that they can increase trust; 2) explaining plans [2, 13] ; 3) verbalising robot [12] or agent rationalisation [3] ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-16", "text": "However, humans do not present a constant verbalisation of their actions but they do need to be able to provide information on-demand about what they are doing and why during a live mission." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-17", "text": "We present here, MIRIAM, (Multimodal Intelligent inteRactIon for Autonomous systeMs), as seen in Figure 1 ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-18", "text": "MIRIAM allows for these 'on-demand' queries for status and explanations of behaviour." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-19", "text": "MIRIAM interfaces with the Neptune autonomy software provided by SeeByte Ltd and runs alongside their SeeTrack interface." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-20", "text": "In this paper, we focus on explanations of behaviours and describe a method that is agnostic to the type of autonomy method." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-21", "text": "With respect to providing communication for monitoring, please refer to [5] for further details and an overview of the system." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-22", "text": "----------------------------------" }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-23", "text": "**EXPLANATIONS FOR REMOTE AUTONOMY**" }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-24", "text": "Types of explanations include why to provide a trace or reasoning and why not to elaborate on the system's control method or strategy [4] ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-25", "text": "Lim et al. (2009) [10] show that both why and why not explanations increase understanding but only why increases trust." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-26", "text": "We adopt here the 'speak-aloud' method whereby an expert provides rationalisation of the AxV behaviours while watching videos of missions on the SeeTrack software." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-27", "text": "This has the advantage of being agnostic to the method of autonomy and could be used to describe rule-based autonomous behaviours but also complex deep models." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-28", "text": "Similar human-provided rationalisation has been used to generate explanations of deep neural models for game play [3] ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-29", "text": "An interpretable model of autonomy was then derived from the expert, as partially shown in Figure 2 ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-30", "text": "If a why request is made, the decision tree is checked against the current mission status and history and the possible reasons are determined, along with a probability." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-31", "text": "As we can see from example outputs in Figure 3A , there may be multiple reasons with varying levels of certainty depending on the information available at a given point in the mission." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-32", "text": "Hence in this example, when the same why question is asked at a later point then only one higher confidence answer is given." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-33", "text": "In the example scenario given in Figure 3B , the operator is able to observe in the SeeTrack interface that the vehicle has not done a GPS fix for some time." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-34", "text": "The user asks why it is not doing a GPS fix and the answer explains the relevant constraints on the vehicle, as captured in the interpretable autonomy model." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-35", "text": "The surface representations of the explanations are generated using template-based NLG." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-36", "text": "The wording of the output reflects the certainty on three levels: above 80% (high), 80% to 40% (medium) and below 40% (low)." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-37", "text": "----------------------------------" }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-38", "text": "**DISCUSSION AND FUTURE WORK**" }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-39", "text": "Future work includes conducting user evaluations to examine the trade-off between providing all of the information, even if one is not 100% sure ('completeness') versus providing only those statements with very high confidence ('soundness')." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-40", "text": "The former is shown in Figure 3 ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-41", "text": "This trade-off will vary between personnel with different information needs and expertise and domains as discussed in [9] ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-42", "text": "Verbal indicators (e.g. \"It is likely/probable\") have been used in weather reporting to reflect levels of certainty [8] ." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-43", "text": "However, informal feedback from users indicate that the use of such verbal indicators may reduce confidence in the reporting system and therefore may not be suitable for highly critical, high risk situations." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-44", "text": "Exactly how these should be expressed is the subject of future work." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-45", "text": "We present here a method for explaining behaviours of remote AxV, which is agnostic to the autonomy model." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-46", "text": "Fortunately, in this domain, it is appropriate for the expert to be from the same pool of end-users (i.e. operators) and therefore explanations are likely to align with their mental models and assumptions about the system." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-47", "text": "This will not always be the case, as described in [2] , e.g. for in-home help robots, where users and planning experts have disparate mental models." }, { "sent_id": "ff73758fbef3ddc779a772e634b74e-C001-48", "text": "Future work will involve the evaluation of explanations with respect to mental models." } ], "y": { "@BACK@": { "gold_contexts": [ [ "ff73758fbef3ddc779a772e634b74e-C001-15" ], [ "ff73758fbef3ddc779a772e634b74e-C001-28" ] ], "cite_sentences": [ "ff73758fbef3ddc779a772e634b74e-C001-15", "ff73758fbef3ddc779a772e634b74e-C001-28" ] } } }, "ABC_0924035155d4bbac7768c65fbe8f9a_47": { "x": [ { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-11", "text": "The chart realizer takes as input logical forms represented internally using Hybrid Logic Dependency Semantics (HLDS), a dependency-based approach to representing linguistic meaning (Baldridge and Kruijff, 2002) ." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-12", "text": "To illustrate the input to OpenCCG, consider the semantic dependency graph in Figure 1 ." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-13", "text": "In the graph, each node has a lexical predication (e.g. make.03) and a set of semantic features (e.g. NUM sg); nodes are connected via dependency relations (e.g. ARG0 )." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-14", "text": "Such graphs are broadly similar to the \"deep\" shared task inputs." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-15", "text": "Note, however, that they are quite different from the shallow input trees, where many of the expected dependencies from coordination, control and relatization are missing." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-16", "text": "For example, in the figure, both dependents of make.03 would be missing in the shallow tree, which involve control and relativization (with a null relativizer)." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-17", "text": "As it would be difficult to hallucinate such dependencies, we have only attempted the deep task." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-18", "text": "Grammar-based chart realization in the tradition of Kay is capable of attaining high precision, but achieving broad coverage is a challenge, as is robustness to any deviations in the expected input." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-19", "text": "Previous work on chart realization has primarily used inputs derived from gold standard parses, and indeed, native OpenCCG inputs have been obtained from gold standard derivations in the CCGbank (Hockenmaier and Steedman, 2007) ." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-20", "text": "Given the available time, our strategy was to make minor adjustments to OpenCCG's extracted grammars while devoting the bulk of our effort to converting the shared task inputs to be as similar as possible to the native inputs." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-21", "text": "Difficulties in conversion led us to employ machine learning for relation mapping and to introduce several robustness measures into OpenCCG's realization algorithm." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-22", "text": "Nevertheless, the percentage of grammatically complete realizations still remained well below results using native OpenCCG inputs on the development set, with a corresponding drop in output quality." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-23", "text": "----------------------------------" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-24", "text": "**CONVERSION**" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-25", "text": "In previous work, when extracting HLDS quasilogical form graphs from the CCGbank, we removed semantically empty function words such as complementizers, infinitival-to, expletive subjects, and case-marking prepositions." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-26", "text": "For improved consistency with shared task inputs, we have instead left expletive subjects and all prepositions (but not complementizers and relativizers) in the native dependency graphs." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-27", "text": "Even so, the logical forms our system expects differed from the shared task inputs in many ways, the most notable being the structure of conjunctions, possessives and relative clauses, so manual conversion rules were written to handle these cases." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-28", "text": "In addition, named entities and hyphenated words were collapsed to form atomic logical form predicates, and for simplicity quotes were ignored." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-29", "text": "The conversion was effected by a Java converter augmented by XSL transforms." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-30", "text": "----------------------------------" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-31", "text": "**RELATION TAGGER**" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-32", "text": "Since the shared task graphs used relations between nodes which were often not easily mappable to native OpenCCG relations, we trained a maxent classifier to tag the most likely relation, as well as an auxiliary maxent classifier to POS tag the graph nodes, much like hypertagging (Espinosa et al., 2008) ." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-33", "text": "Training data for the classifier was extracted by comparing each relation between two nodes in the input shared task graph with the corresponding relation in the HLDS logical form." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-34", "text": "In case a labeled relation did not exist in the HLDS graph, a NoRel relation label was assigned." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-35", "text": "On the development data, we obtained accuracies of 90% for the POS tagger and 90.5% for the relation classifier." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-36", "text": "A substantial portion of the errors were related to the NoRel outcome." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-37", "text": "Of the 5154 NoRel cases in the dev sect, 444 were miscategorized as Mod, 344 as Arg1, 212 as Arg0, and 107 as Det." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-38", "text": "The other major error was that the Mod relation was often erroneously misclassified as NoRel." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-39", "text": "----------------------------------" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-40", "text": "**REALIZATION RESULTS AND DISCUSSION**" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-41", "text": "In spite of the graph structure and relation label changes described above, it still proved necessary to make several adjustments to both OpenCCG as well as the converted graphs." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-42", "text": "OpenCCG's strict relation checking had to be relaxed to permit divergences between the relations supplied by a lexical item and the ones in the input graph." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-43", "text": "In cases where no complete realization could be found, we also employed a novel approach to assembling fragments using MTinspired glue rules (White, 2011) , which enable a more exhaustive search of possible fragment com- binations and allow for n-best outputs." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-44", "text": "Additionally, we added optionality operators into the converted shared task graphs, in order to allow certain features or relations to be used as required by the grammar's constraints." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-45", "text": "The most notable cases were an optional DET nil feature for nodes that could be expressed by bare nouns, and making certain relations optional, especially those derived from Nombank that yielded multiple parents for the child node." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-46", "text": "For the experiments reported below, as in previous work, we used a lexico-grammar extracted from Sections 02-21 of our enhanced CCGbank with a similar model training procedure." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-47", "text": "Development set results appear in Table 2 ." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-48", "text": "Single-best and weighted 5-best BLEU scores, along with coverage percentages, are given for both the converted shared task inputs as well as native OpenCCG inputs, for comparison." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-49", "text": "The OSU.1 system includes outputs for all sentences, assembling fragments if no grammatically complete realizations are found; the OSU.2 system only includes outputs for complete realizations." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-50", "text": "1 As the table shows, the percentage of grammatically complete realizations for the converted shared task inputs is well below the percentage using native inputs, with a corresponding drop in BLEU scores." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-51", "text": "Debugging efforts suggest that the remaining relation mismatches and other structural divergences are preventing complete realizations from being derived most of the time." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-52", "text": "The relative absence of punctuation-related features may also be an issue." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-53", "text": "In future work, we plan to explore using machine learning more comprehensively to convert the inputs, beyond just relation tagging." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-54", "text": "We also plan to explore whether grammars can be induced that are more directly compatible with shared task inputs." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-55", "text": "1 Native coverage is less than 100% because of failures to derive a complete LF from the CCGbank; shared task coverage could have been 100% but the system was only run on the same inputs as in the native case." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-2", "text": "This report documents our efforts to develop a Generation Challenges 2011 surface realization system by converting the shared task deep inputs to ones compatible with OpenCCG." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-3", "text": "Although difficulties in conversion led us to employ machine learning for relation mapping and to introduce several robustness measures into OpenCCG's grammar-based chart realizer, the percentage of grammatically complete realizations still remained well below results using native OpenCCG inputs on the development set, with a corresponding drop in output quality." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-4", "text": "We discuss known conversion issues and possible ways to improve performance on shared task inputs." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-5", "text": "----------------------------------" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-7", "text": "Our Generation Challenges 2011 shared task system represents an initial attempt to develop a surface realizer for shared task inputs that takes advantage of prior work on broad coverage realization with OpenCCG (White, 2006; Espinosa et al., 2008; Rajkumar and White, 2010) ." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-8", "text": "OpenCCG is a parsing/generation library for Combinatory Categorial Grammar (Steedman, 2000) ." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-9", "text": "CCG is a unification-based categorial grammar formalism defined almost entirely in terms of lexical entries that encode sub-categorization as well as syntactic features." }, { "sent_id": "0924035155d4bbac7768c65fbe8f9a-C001-10", "text": "OpenCCG implements a grammarbased chart realization algorithm in the tradition of Kay's (1996) approach to bidirectional processing with unification grammars." } ], "y": { "@UNSURE@": { "gold_contexts": [ [ "0924035155d4bbac7768c65fbe8f9a-C001-7" ] ], "cite_sentences": [ "0924035155d4bbac7768c65fbe8f9a-C001-7" ] }, "@SIM@": { "gold_contexts": [ [ "0924035155d4bbac7768c65fbe8f9a-C001-32" ] ], "cite_sentences": [ "0924035155d4bbac7768c65fbe8f9a-C001-32" ] } } }, "ABC_400bd47879aaed0aa1195bafe54e76_48": { "x": [ { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-2", "text": "We introduced KaWAT (Kata Word Analogy Task), a new word analogy task dataset for Indonesian." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-3", "text": "We evaluated on it several existing pretrained Indonesian word embeddings and embeddings trained on Indonesian online news corpus." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-4", "text": "We also tested them on two downstream tasks and found that pretrained word embeddings helped either by reducing the training epochs or yielding significant performance gains." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-5", "text": "----------------------------------" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-6", "text": "**INTRODUCTION**" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-7", "text": "Despite the existence of various Indonesian pretrained word embeddings, there are no publicly available Indonesian analogy task datasets on which to evaluate these embeddings." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-8", "text": "Consequently, it is unknown if Indonesian word embeddings introduced in, e.g., (Al-Rfou et al., 2013) and (Grave et al., 2018) , capture syntactic or semantic information as measured by analogy tasks." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-9", "text": "Also, such embeddings are usually trained on Indonesian Wikipedia (Al-Rfou et al., 2013; Bojanowski et al., 2017) whose size is relatively small, approximately 60M tokens." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-10", "text": "Therefore, in this work, we introduce KaWAT (Kata Word Analogy Task), an Indonesian word analogy task dataset, and new Indonesian word embeddings pretrained on 160M tokens of online news corpus." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-11", "text": "We evaluated these embeddings on KaWAT, and also tested them on POS tagging and text summarization as representatives of syntactic and semantic downstream task respectively." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-12", "text": "----------------------------------" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-13", "text": "**METHODOLOGY**" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-14", "text": "We asked an Indonesian linguist to help build KaWAT based on English analogy task datasets such as Google Word Analogy (Mikolov et al., 2013a) and BATS (Gladkova et al., 2016) ." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-15", "text": "Following those works, we split the analogy tasks into two categories, syntax and semantic." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-16", "text": "We included mostly morphological analogies in the syntax category, leveraging the richness of Indonesian inflectional morphology." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-17", "text": "For semantic, we included analogies such as antonyms, country capitals and currencies, gender-specific words, measure words, and Indonesian province capitals." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-18", "text": "In total, we have 15K syntactic and 19K semantic analogy queries." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-19", "text": "KaWAT is available online." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-20", "text": "1 One of the goals of this work is to evaluate and compare existing Indonesian pretrained word embeddings." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-21", "text": "We used fastText pretrained embeddings introduced in (Bojanowski et al., 2017 ) and (Grave et al., 2018) , which have been trained on Indonesian Wikipedia and Indonesian Wikipedia plus Common Crawl data respectively." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-22", "text": "We refer to them as Wiki/fastText and CC/fastText hereinafter." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-23", "text": "We also used another two pretrained embeddings: polyglot embedding trained on Indonesian Wikipedia (Al-Rfou et al., 2013) and NLPL embedding trained on the Indonesian portion of CoNLL 2017 corpus (Fares et al., 2017) ." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-24", "text": "For training our word embeddings, we used online news corpus obtained from Tempo." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-25", "text": "2 We used Tempo newspaper and magazine articles up to year 2014." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-26", "text": "This corpus contains roughly 400K articles, 160M word tokens, and 600K word types." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-27", "text": "To train the word embeddings, we experimented with three algorithms: word2vec (Mikolov et al., 2013b) , fastText (Bojanowski et al., 2017) , and GloVe (Pennington et al., 2014) ." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-28", "text": "We refer to them henceforth as Tempo/word2vec, Tempo/fastText, and Tempo/GloVe respectively." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-29", "text": "We used gensim 3 to run word2vec and fastText and the original C implementation for GloVe." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-30", "text": "4 For all three, we used their default hyperparameters, i.e. no tuning was performed." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-31", "text": "Our three embeddings are available online." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-32", "text": "5 Evaluation on KaWAT was done using gensim with its KeyedVectors.most similar method." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-33", "text": "Since the vocabularies of the word embeddings are different, for a fair comparison, we first removed analogy queries containing words that do not exist in any vocabulary." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-34", "text": "In other words, we only kept queries whose words all exist in all vocabularies." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-35", "text": "After this process, there were roughly 6K syntactic and 1.5K semantic queries." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-36", "text": "We performed evaluation by computing 95% confidence interval of the accuracy at rank 1 by bootstrapping." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-37", "text": "Our implementation code is available online." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-38", "text": "6 3 Results" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-39", "text": "----------------------------------" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-40", "text": "**WORD ANALOGY RESULTS**" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-41", "text": "We found that on syntactic analogies, Wiki/fastText achieved 2.7% accuracy, which significantly outperformed the others, even CC/fastText which has been trained on a much larger corpus." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-42", "text": "Other embeddings performed poorly, mostly less than 1% of accuracy." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-43", "text": "The overall trend of low accuracy scores attests to the difficulty of syntactic KaWAT analogies, making it suitable as benchmark for future research." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-44", "text": "On semantic analogies, Tempo/GloVe clearly outperformed the others with 20.42% accuracy, except Tempo/word2vec." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-45", "text": "Surprisingly, we found that Tempo/fastText performed very poorly with less than 1% accuracy, even worse than Wiki/fastText which has been trained on a much smaller data." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-46", "text": "Overall, the accuracies on semantic are also low, less than 25%, which again attests to the suitability of KaWAT as benchmark for future work." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-47", "text": "----------------------------------" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-48", "text": "**DOWNSTREAM TASK RESULTS**" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-49", "text": "To check how useful these embeddings are for downstream tasks, we evaluated them on POS tagging and text summarization task." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-50", "text": "For each task, we compared two embeddings, which are the best off-the-shelf pretrained embedding and our proposed embedding on the syntactic (for POS) and semantic (for summarization) analogy task respectively." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-51", "text": "7 We used the same model and setting as (Kurniawan and Aji, 2018) for POS tagging and (Kurniawan and Louvan, 2018) for summarization." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-52", "text": "However, for computational reasons, we tuned only the learning rate using grid search, and only used the first fold of the summarization dataset." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-53", "text": "Our key finding from the POS tagging experiment is that using the two embeddings did not yield significant gain on test F 1 score compared with not using any pretrained embedding (around 97.23)." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-54", "text": "However, on average, using Wiki/fastText resulted in 20% fewer training epochs, compared with only 4% when using Tempo/GloVe." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-55", "text": "For the summarization experiment, Tempo/GloVe was significantly better 8 than CC/fastText in ROUGE-1 and ROUGE-L scores (66.63 and 65.93 respectively)." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-56", "text": "The scores of using CC/fastText was on par to those of not using any pretrained word embedding, and we did not observe fewer training epochs when using pretrained word embedding." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-57", "text": "----------------------------------" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-58", "text": "**CONCLUSION**" }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-59", "text": "We introduced KaWAT, a new dataset for Indonesian word analogy task, and evaluated several Indonesian pretrained word embeddings on it." }, { "sent_id": "400bd47879aaed0aa1195bafe54e76-C001-60", "text": "We found that (1) in general, accuracies on the analogy tasks were low, suggesting that improvements for Indonesian word embeddings are still possible and KaWAT is hard enough to be the benchmark dataset for that purpose, (2) on syntactic analogies, embedding by (Bojanowski et al., 2017) performed best and yielded 20% fewer training epochs when employed for POS tagging, and (3) on semantic analogies, GloVe embedding trained on Tempo corpus performed best and produced significant gains on ROUGE-1 and ROUGE-L scores when used for text summarization." } ], "y": { "@BACK@": { "gold_contexts": [ [ "400bd47879aaed0aa1195bafe54e76-C001-8" ] ], "cite_sentences": [ "400bd47879aaed0aa1195bafe54e76-C001-8" ] }, "@USE@": { "gold_contexts": [ [ "400bd47879aaed0aa1195bafe54e76-C001-21" ] ], "cite_sentences": [ "400bd47879aaed0aa1195bafe54e76-C001-21" ] } } }, "ABC_1ebbddc6c6740aea71ade2ed915de4_48": { "x": [ { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-75", "text": "**ACKNOWLEDGMENTS**" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-2", "text": "The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into storylines." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-3", "text": "Storyline clustering of stories covering the same events is an essential task for inquisitorial media monitoring." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-4", "text": "We address these two problems jointly by engaging the low-dimensional semantic representation capabilities of the sequence to sequence neural translation models." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-5", "text": "To enable joint multi-task learning for multilingual neural translation of morphologically rich languages we replace the attention mechanism with the sliding-window mechanism and operate the sequence to sequence neural translation model on the character-level rather than on the word-level." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-6", "text": "The story segmentation and storyline clustering problem is tackled by examining the low-dimensional vectors produced as a side-product of the neural translation process." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-7", "text": "The results of this paper describe a novel approach to the automatic story segmentation and storyline clustering problem." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-8", "text": "----------------------------------" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-9", "text": "**THE SUMMA PROJECT OVERVIEW**" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-10", "text": "Media monitoring enables the global news media to be viewed in terms of emerging trends, people in the news, and the evolution of storylines (Risen et al., 2013) ." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-11", "text": "The massive growth in the number of broadcast and Internet media channels requires innovative ways to cope with this increasing amount of data." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-12", "text": "It is the aim of SUMMA 1 project to significantly improve media monitoring by creating a platform to automate the analysis of media streams across many languages." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-13", "text": "Within SUMMA project three European news broadcasters BBC, Deutche Welle, and Latvian news agency LETA are joining the forces with the University of Edinburgh, University College London, Swiss IDIAP Research Institute, Qatar Computing Research Institute, and Priberam Labs from Portugal to adapt the emerging big data neural deep learning NLP techniques to the needs of the international news monitoring industry." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-14", "text": "BBC Monitoring undertakes one of the most advanced, comprehensive, and large scale media monitoring operations world-wide, providing news and information from media sources around the world." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-15", "text": "BBC monitoring journalists and analysts translate from over 30 languages into English, and follow approximately 13,500 sources, of which 1,500 are television broadcasters, 1,300 are radio, 3,700 are key news portals world-wide, 20 are commercial news feeds, and the rest are RSS feeds and selected Social Media sources." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-16", "text": "Monitoring journalists follow important stories and flag breaking news events as part of the routine monitoring." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-17", "text": "1 SUMMA (Scalable Understanding of Multilingual MediA) is project 688139 funded by the European Union H2020- The central idea behind SUMMA is to develop a scalable multilingual media monitoring platform ( Fig.1 ) that combines the real-time media stream processing (speech recognition, machine translation, story clustering) with indepth batch-oriented construction of a rich knowledge base of reported events and entities mentioned, enabling extractive summarization of the storylines in the news." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-18", "text": "In this paper we focus only on the streaming shallow processing part of the SUMMA project (the dark block in Fig.1 ), where the recently developed neural machine translation techniques (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014) enable radically new end-to-end approach to machine translation and clustering of the incoming news stories." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-19", "text": "The approach is informed by our previous work on machine learning (Barzdins, Paikens, Gosko, 2013) , media monitoring (Barzdins et al.,2014) , and character-level neural translation (Barzdins & Gosko, 2016) ." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-36", "text": "Moving from wordlevel to character-level neural translation makes it even harder to cope with long sentences presenting additional reason to employ the sliding-window translation approach." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-20", "text": "The key difference of the SUMMA project is that it has been incepted after the recent paradigm-shift (Manning, 2015) in the NLP community towards neural network inspired deep learning techniques such as end-to-end automatic speech recognition (Graves & Jaitly, 2014; Hannun et al., 2014; Amodei, 2015) , end-to-end machinetranslation (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014; Luong et al., 2015) , efficient distributed vectorspace word embeddings (Mikolov et al., 2013) , image and video captioning Venugopalan et al., 2015) , unsupervised learning of document representations by autoencoders (Li, Luong & Jurafsky, 2015) ." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-21", "text": "These recent deep learning breakthroughs along with massively parallel GPU computing allow addressing the media monitoring tasks in the completely new end-toend manner rather than relying on the legacy NLP pipelines." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-22", "text": "The novelty of the SUMMA project approach is that all languages covered by the project (Table 1) can be embedded in the same vectorspace by means of joint multitask learning (Collobert et al., 2011; Dong et al., 2015; Pham, Luong & Manning, 2015) of eight LSTM-RNN translational autoencoders with hidden layer parameters shared as illustrated in Fig.2 ." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-23", "text": "Sharing the same vectorspace for sentences in all project languages enables accurate multilingual news story clustering without resorting to the clustering of the less accurate target (English) language machine translations." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-24", "text": "This shared vectorspace approach extends also to the unsupervised multi-task learning of language models from the large monolingual corpora (Fig. 3) , which is crucial for low-resourced languages: having a generic language model learned in parallel from the monolingual corpora reduces (Dai & Le, 2015) the need for large supervised parallel corpora to achieve the same translational accuracy for the Fig. 2 setup." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-25", "text": "The joint training of seventeen translational and samelanguage autoencoders with shared parameters ( Fig. 2 and Fig. 3 together) to our knowledge has not been attempted so far." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-26", "text": "Even training of a single state-of-the-art sentencelevel translational autoencoder requires days of GPU computing (Barzdins & Gosko, 2016) ) in TensorFlow (Abadi et al., 2015) seq2seq model (Sutskerev, Vinyals & Le, 2014; Bahdanau, Cho & Bengio, 2014) ." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-27", "text": "To avoid complexities of asynchronous parallel training with shared parameter server (Dean et al., 2012) , the architecture in Fig.2 and Fig. 3 instead can be trained using the alternating training approach proposed in (Luong et al., 2016) , where each task is optimized for a fixed number of parameter updates (or mini-batches) before switching to the next task (which is a different language pair)." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-28", "text": "Although such alternating approach prolongs the training process, it is preferred for simplicity and robustness reasons." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-29", "text": "Once produced within SUMMA project, these translational autoencoders with shared vectorspace will be a unique language resource of likely interest also to the wider NLP community for multilingual applications outside the media monitoring domain." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-30", "text": "----------------------------------" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-31", "text": "**MULTILINGUAL NEURAL TRANSLATION**" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-32", "text": "----------------------------------" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-33", "text": "**CHARACTER-LEVEL NEURAL TRANSLATION FOR STREAMS**" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-34", "text": "Neural translation attention mechanism (Bahdanau, Cho & Bengio, 2014) has been shown to be highly beneficial for bi-lingual neural translation of long sentences, but it is not compatible with the multi-task multilingual translation models (Dong et al., 2015; Luong et al, 2016) described in the previous Section and character-level translation models (Barzdins & Gosko, 2016) described in this Section." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-35", "text": "For these reasons we replace the neural translation attention mechanism with much simpler sliding-window translation (Barzdins & Gosko, 2016; Karpathy, 2015; Jozefowicz, 2016 )." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-74", "text": "----------------------------------" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-37", "text": "Table 2 illustrates the character-level neural translation from English to Latvian using modified 2 TensorFlow (Abadi et al., 2015) seq2seq (Sutskerev, Vinyals & Le, 2014) neural translation model." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-38", "text": "The character-level neural translation is enabled by forcing tokenizer to treat each input symbol as a separate \"word\" leading to the small and fixed \"vocabulary\" containing only 90 most frequently encountered characters." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-39", "text": "Another necessary change to the TensorFlow default seq2seq settings is disabling the attention (Bahdanau, Cho & Bengio, 2014) mechanism which is known to interfere with character-level translation (Barzdins & Gosko, 2016) because there are no mappings between the characters of the translated words." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-40", "text": "The small vocabulary of 90 words automatically disables also the sampled softmax functionality of seq2seq improving the overall performance." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-41", "text": "Finally, we configure single bucket of size 100 characters, which will be the max translation window size." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-42", "text": "Other hyperparameters used are: 1 LSTM 2 https://github.com/didzis/tensorflowAMR layer of size 400, batch size 16." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-43", "text": "Training is performed on Europarl v7 EN-LV corpus 3 for 24h on TitanX GPU." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-44", "text": "The sliding-window mechanism is used only during decoding (translation), mapping a fragment of 6 English words into 5 Latvian words (Latvian translations typically contain less words than English source -rich morphology substitutes for most prepositions and articles)." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-45", "text": "The multiple sliding-window translations produced are later merged into the final translation consisting only of words appearing at least twice in the neighboring sliding window columns (word suffixes are ignored if the initial 6 characters of the words match -this reduces word drop due to inflection errors)." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-46", "text": "The final translation in Table 2 (bottom row) is close to the manual verbatim translation (top row) and conveys the topic of the original sentence." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-47", "text": "Moreover, the slidingwindow translations are surprisingly fluent Latvian phrases with correct word forms and mostly correct coordination." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-48", "text": "The only non-Latvian \"words\" fabricated by the characterlevel translation in the Table 2 are \"transpastiv\u0113\u0161ana\", \"transpitts\", \"transpirma\" and apparently are triggered by the English verb \"transits\", because in Latvian \"tranz\u012bts\" is used only as a noun without a close substitute verb." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-49", "text": "Sliding-window translation method, obviously, cannot handle long-range dependencies well and occasionally drops or inserts words in the translation -therefore SUMMA provides also state-of-the-art translation service in parallel to the one described here." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-50", "text": "Meanwhile the sliding-window character-based translation method has unique advantages relevant to the scope of SUMMA project, discussed in the next Section." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-51", "text": "----------------------------------" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-52", "text": "**POTENTIAL APPLICATIONS OF THE MULTILINGUAL CHARACTER-LEVEL STREAM TRANSLATION**" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-53", "text": "Having a shared vectorspace multilingual translation system ( Fig.2 and Fig.3 ) able to operate on unsegmented streams of text have a number of novel applications." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-54", "text": "The most straightforward novel application is the possibility to embed the documents of all project languages into the same shared semantics vectorspace and compute document semantic similarity (Hill, Cho & Korhonen, 2016) irrespective of the document language." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-55", "text": "The sliding-window translation approach allows to view the document as a sequence (trace) of vectors corresponding to every slidingwindow step while translating the document." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-56", "text": "These vectors are similar to word embedding vectors, but are likely to be semantically richer, as they would mostly distinguish word-senses in the context of the window." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-57", "text": "Such vectortraces corresponding to the documents can be compared in the bag-of-words fashion by measuring cosine-distance between the sums of document trace-vectors (as part of kmeans clustering or nearest-neighbor search)." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-58", "text": "This can be used as a building block for multilingual semantic clustering of stories into storylines, or for the semantic search of the documents in any language which are similar to the given document." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-59", "text": "Another novel application of the character-level neural translation is stream segmentation into the individual stories -a difficult task for news ingested from audio or video sources and transcribed with ASR and thus lacking any explicit sentence or story segmentation information." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-60", "text": "For stream segmentation into the stories it is possible to utilize the exceptional generalization and memorization capacity of the neural networks, which is already applied in the neural dialogue systems such as Gmail Smart Replies (Corrado, 2015; Vinyals&Le, 2015) ." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-61", "text": "Table 3 illustrates how mere 400 LSTM cells of our single-layer 90-character neural translator have been able to generalize and memorize rather correct translations for the first 100 characters of the entire Europarl v7 EN-LV training corpus containing 600,000 sentence pairs." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-62", "text": "For story segmentation a sliding-window neural translation system can be incrementally trained to monolingually \"translate\" the current 5 words of the stream into the next 5 words of the stream (predicting next 5 words from previous 5 words), based on the actual news streams encountered." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-63", "text": "Such system should be able to predict reasonably well the next 5 words within the news story, but will fail to do so when there is a transition from one story to the next." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-64", "text": "Along with additional auxiliary information such as time-code (when exactly the phrase was spoken and pauses in the speech) and speaker identification for each phrase this should provide a rather reliable segmentation signal." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-65", "text": "Table 3 : Europarl v7 training corpus fragment (only first 100 characters of each sentence were used for training) and the character-level neural translation output illustrating the memorization of the training corpus." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-66", "text": "----------------------------------" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-67", "text": "**CONCLUSIONS**" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-68", "text": "It is still an open issue which vectorspace projections yield the semantically best clusters (Hill, Cho & Korhonen, 2016) and further experiments are needed." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-69", "text": "Particularly for storyline (Rissen et al., 2013) clustering the signals for the stories belonging to the same storyline might be not so much the semantic similarity of the articles (they might report various developments of the storyline from differing viewpoints), but rather the matching time and location as well as same organizations and people being involved -the information typically supplied by Named Entity Linking (NEL) tools." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-70", "text": "The tradeoffs between semantic clustering quality and computational complexity are likely to be crucial." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-71", "text": "Once trained, the run-time use of the multilingual translation modules for translation and news story clustering is around 1 sec on TitanX GPU per average news story." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-72", "text": "This is an order of magnitude slower than regular NEL or IF IDF bagof-words based clustering methods." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-73", "text": "Establishing reliable storyline clustering benchmarking data sets and metrics is one of the goals of the SUMMA project, as good storyline clusters are the prerequisite for downstream storyline summarization, visualization, and predictive anticipation of upcoming developments." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-76", "text": "This work was supported in part by H2020 SUMMA project under grant agreement 688139/H2020-ICT-2015 and in part by the Latvian National research program SOPHIS under grant agreement Nr.10-4/VPP-4/11." }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-77", "text": "----------------------------------" }, { "sent_id": "1ebbddc6c6740aea71ade2ed915de4-C001-78", "text": "**ENGLISH(ORIGINAL LATVIAN(MANUAL(TRANSLATION LATVIAN(CHARACTER3LEVEL(NEURAL(TRANSLATION**" } ], "y": { "@BACK@": { "gold_contexts": [ [ "1ebbddc6c6740aea71ade2ed915de4-C001-27" ], [ "1ebbddc6c6740aea71ade2ed915de4-C001-34" ] ], "cite_sentences": [ "1ebbddc6c6740aea71ade2ed915de4-C001-27", "1ebbddc6c6740aea71ade2ed915de4-C001-34" ] } } }, "ABC_d5a71358168d262dd1e9734c80234b_48": { "x": [ { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-2", "text": "A new annotated corpus can have a pivotal role in the future of computational linguistics." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-3", "text": "Corpus annotation can define new NLP tasks and set new standards." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-4", "text": "This may put many of the papers presented at this workshop on the cutting edge of our field." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-5", "text": "A standard, however, is a double edged sword." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-6", "text": "A standard corpus urges users to accept the theory of how to represent things that underlie that corpus." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-7", "text": "For example, a Penn Treebank theory of grammar is implicit in PennTreebank-based parsers." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-8", "text": "This can be a problem if one rejects some aspects of that theory." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-9", "text": "Also one may object to a particular system of annotation because some theories generalize to cover new ground (e.g., new languages) better than others." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-10", "text": "Nevertheless, advantages of accepting a corpus as standard include the following:" }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-11", "text": "It is straight-forward to compare the performance of the set of systems that produce the same form of output, e.g., Penn Treebank-based parsers can be compared in terms of how well they reproduce the Penn Treebank." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-12", "text": "Alternative systems based on a standard are largely interchangeable." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-13", "text": "Thus a system that uses one PennTreebank-based parser as a component can easily be adapted to use another better performing PennTreebank-based parser." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-14", "text": "Standards can be built on." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-15", "text": "For example, if one accepts the framework of the Penn Treebank, it is easy to move on to representations of \"deeper\" structure as suggested in three papers in this volume (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) ." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-16", "text": "It is my view that these advantages outweigh the disadvantages." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-17", "text": "I propose that the papers in this volume be viewed with the following question in mind: How can the work covered by this collection of papers be integrated together? Put differently, to what extent are these resources mergeable?" }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-18", "text": "The first six papers describe linguistic annotation in four languages: Spanish (Alc\u00e1ntara and Moreno, 2004), English (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) , Czech (Sgall et al., 2004) and German (Baumann et al., 2004)." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-44", "text": "The first six papers describe linguistic annotation in four languages: Spanish (Alc\u00e1ntara and Moreno, 2004) , English (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) , Czech (Sgall et al., 2004) and German (Baumann et al., 2004) ." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-45", "text": "The sixth, seventh and eighth papers (Baumann et al., 2004; \u00c7 mejrek et al., 2004; Helmreich et al., 2004) explore questions of multilingual annotation of syntax and semantics, beginning to answer the question of how annotation systems can be made compatible across languages." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-46", "text": "Indeed (Helmreich et al., 2004) explores the question of integration across languages, as well as levels of annotation." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-47", "text": "(Baumann et al., 2004) also describes how a number of different linguistic levels can be related in annotation (pragmatic and prosodic) among two languages (English and German)." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-48", "text": "The ninth and tenth papers (Langone et al., 2004; Zabokrtsk\u00fd and Lopatkov\u00e1, 2004) are respectively about a corpus related to a lexicon and the reverse: a lexicon related to a corpus." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-49", "text": "This opens up the wider theme of the intergration of a number of different linguistic resources." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-50", "text": "As the natural language community produces more and more linguistic resources, especially corpora, it seems important to step back and look at the larger picture." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-51", "text": "If these resources can be fit together as part of a larger puzzle, this could produce a sketch of the future of our field." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-19", "text": "The sixth, seventh and eighth papers (Baumann et al., 2004; \u00c7 mejrek et al., 2004; Helmreich et al., 2004 ) explore questions of multilingual annotation of syntax and semantics, beginning to answer the question of how annotation systems can be made compatible across languages." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-20", "text": "Indeed (Helmreich et al., 2004) explores the question of integration across languages, as well as levels of annotation." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-21", "text": "(Baumann et al., 2004) also describes how a number of different linguistic levels can be related in annotation (pragmatic and prosodic) among two languages (English and German)." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-22", "text": "The ninth and tenth papers (Langone et al., 2004; Zabokrtsk\u00fd and Lopatkov\u00e1, 2004) are respectively about a corpus related to a lexicon and the reverse: a lexicon related to a corpus." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-23", "text": "This opens up the wider theme of the intergration of a number of different linguistic resources." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-24", "text": "As the natural language community produces more and more linguistic resources, especially corpora, it seems important to step back and look at the larger picture." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-25", "text": "If these resources can be fit together as part of a larger puzzle, this could produce a sketch of the future of our field." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-26", "text": "----------------------------------" }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-27", "text": "****" }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-28", "text": "A new annotated corpus can have a pivotal role in the future of computational linguistics." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-29", "text": "Corpus annotation can define new NLP tasks and set new standards." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-30", "text": "This may put many of the papers presented at this workshop on the cutting edge of our field." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-31", "text": "A standard, however, is a double edged sword." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-32", "text": "A standard corpus urges users to accept the theory of how to represent things that underlie that corpus." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-33", "text": "For example, a Penn Treebank theory of grammar is implicit in PennTreebank-based parsers." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-34", "text": "This can be a problem if one rejects some aspects of that theory." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-35", "text": "Also one may object to a particular system of annotation because some theories generalize to cover new ground (e.g., new languages) better than others." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-36", "text": "Nevertheless, advantages of accepting a corpus as standard include the following:" }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-37", "text": "It is straight-forward to compare the performance of the set of systems that produce the same form of output, e.g., Penn Treebank-based parsers can be compared in terms of how well they reproduce the Penn Treebank." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-38", "text": "Alternative systems based on a standard are largely interchangeable." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-39", "text": "Thus a system that uses one PennTreebank-based parser as a component can easily be adapted to use another better performing PennTreebank-based parser." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-40", "text": "Standards can be built on." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-41", "text": "For example, if one accepts the framework of the Penn Treebank, it is easy to move on to representations of \"deeper\" structure as suggested in three papers in this volume (Miltsakaki et al., 2004; Babko-Malaya et al., 2004; Meyers et al., 2004) ." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-42", "text": "It is my view that these advantages outweigh the disadvantages." }, { "sent_id": "d5a71358168d262dd1e9734c80234b-C001-43", "text": "I propose that the papers in this volume be viewed with the following question in mind: How can the work covered by this collection of papers be integrated together? Put differently, to what extent are these resources mergeable?" } ], "y": { "@BACK@": { "gold_contexts": [ [ "d5a71358168d262dd1e9734c80234b-C001-15" ], [ "d5a71358168d262dd1e9734c80234b-C001-18" ], [ "d5a71358168d262dd1e9734c80234b-C001-41" ], [ "d5a71358168d262dd1e9734c80234b-C001-44" ] ], "cite_sentences": [ "d5a71358168d262dd1e9734c80234b-C001-15", "d5a71358168d262dd1e9734c80234b-C001-18", "d5a71358168d262dd1e9734c80234b-C001-41", "d5a71358168d262dd1e9734c80234b-C001-44" ] } } }, "ABC_7ddd5b18d774575ae7acb97ae9eb33_48": { "x": [ { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-4", "text": "----------------------------------" }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-46", "text": "Since CCGrebank only covers English and lacks some constructions observed in our corpus, an annotation manual with more specific instructions was needed." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-47", "text": "We initially annotated ten sentences in four languages and discussed disagreements." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-48", "text": "The results were recorded in an initial annotation manual, and the initial annotations were discarded." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-49", "text": "Each of the remaining 4x100 sentences was then annotated independently by at least two of the authors." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-2", "text": "We present the first open-source graphical annotation tool for combinatory categorial grammar (CCG), and the first set of detailed guidelines for syntactic annotation with CCG, for four languages: English, German, Italian, and Dutch." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-3", "text": "We also release a parallel pilot CCG treebank based on these guidelines, with 4x100 adjudicated sentences, 10K singleannotator fully corrected sentences, and 82K single-annotator partially corrected sentences." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-6", "text": "Combinatory Categorial Grammar (CCG; Steedman, 2000 ) is a grammar formalism distinguished by its transparent syntax-semantics interface and its elegant handling of coordination." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-7", "text": "It is a popular tool in semantic parsing, and treebank creation efforts have been made for Turkish (\u00c7 ak\u0131c\u0131, 2005) , German (Hockenmaier, 2006) , English (Hockenmaier and Steedman, 2007) , Italian (Bos et al., 2009) , Chinese (Tse and Curran, 2010) , Arabic (Boxwell and Brew, 2010) , Japanese (Uematsu et al., 2013) , and Hindi (Ambati et al., 2018) ." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-8", "text": "However, all of these treebanks were not directly annotated according to the CCG formalism, but automatically converted from phrase structure or dependency treebanks, which is an error-prone process." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-9", "text": "Direct annotation in CCG has so far mostly been limited to small datasets for seeding or testing semantic parsers (e.g., Artzi et al., 2015) , and no graphical annotation interface is available to support such efforts, making the annotation process difficult to scale." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-10", "text": "The only exceptions we are aware of are the Groningen Meaning Bank and the Parallel Meaning Bank (Abzianidze et al., 2017) , two annotation efforts which use a graphical user interface for annotating sentences with CCG derivations and other annotation layers, and which have produced CCG treebanks for English, German, Italian, and Dutch." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-11", "text": "However, these efforts are focused on semantics and have not released explicit guidelines for syntactic annotation." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-12", "text": "Their annotation tool is limited in that annotators only have control over lexical categories, not larger constituents." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-13", "text": "Even though CCG is a lexicalized formalism, where most decisions can be made on the lexical level, there is no full control over attachment phenomena in the lexicon." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-14", "text": "Moreover, these annotation tools are not open-source and cannot easily be deployed to support other annotation efforts." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-15", "text": "In this paper, we present an open-source, lightweight, easy-to-use graphical annotation tool that employs a statistical parser to create initial CCG derivations for sentences, and allows annotators to correct these annotations via lexical category constraints and span constraints." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-16", "text": "Together, these constraints make it possible to effect (almost) all annotation decisions consistent with the principles of CCG." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-17", "text": "We also present a pilot study for multilingual CCG annotation, in which a parallel corpus of 4x100 sentences (in English, German, Italian, and Dutch) was annotated by two annotators per sentence, a detailed annotation manual was created, and adjudication was performed to create a final version." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-18", "text": "We publicly release the manual, the annotation tool, and the adjudicated data." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-19", "text": "Our release also includes an additional > 10 K derivations, each manually corrected by a single annotator, and an additional > 82 K sentences, each partially corrected by a single annotator." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-20", "text": "----------------------------------" }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-21", "text": "**AN ANNOTATION TOOL FOR CCG**" }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-22", "text": "Our annotation tool CCGweb 1 is Web-based, implemented in Python, PHP, and JavaScript, and should be easy to deploy on any recent Linux dis- tribution." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-23", "text": "It has two main views: the home page shows the list of sentences an annotator is assigned to annotate." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-24", "text": "Those already done are marked as \"marked correct\". Clicking on a sentence takes the annotator to the sentence view." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-25", "text": "Annotators can also enter arbitrary sentences to annotate, e.g., for experimenting or for producing illustrations." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-26", "text": "Dynamic Annotation Annotation follows an approach called dynamic annotation (Oepen et al., 2002) or human-aided machine annotation , in which sentences are automatically analyzed, annotators impose constraints to rule out undesired analyses, sentences are then reanalyzed subject to the constraints, and the process is repeated until only the desired analysis remains." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-27", "text": "The current system is backed by the EasyCCG parser (Lewis and Steedman, 2014) , slightly modified to allow for incorporating constraints, and other CCG parsers could be plugged in with similar modifications." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-28", "text": "To do this, the annotator clicks on the category and changes it, as shown in the figure." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-29", "text": "When they hit enter or click somewhere else, the sentence is automatically parsed again in the background, this time with the lexical category constraint that go has category (S[b]\\ NP)/ PP." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-30", "text": "In many cases, the parser will directly find the desired parse, with there being a PP, and the annotator only has to check it, not make another edit." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-31", "text": "Span Constraints Although constraining lexical categories is often enough to determine the entire CCG derivation (cf. Bangalore and Joshi, 1999; Lewis and Steedman, 2014) , this is not always the case." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-32", "text": "For example, consider the sentence I want to be a millionaire like my dad. Assuming that like my dad is a verb phrase modifier (category (S \\ NP)\\(S \\ NP)), it could attach to either to be or want, giving very different meanings (cf. Zimmer, 2013)." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-33", "text": "We therefore implemented one other type of edit operation/constraint: span constraints." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-34", "text": "By simply clicking and dragging across a span of tokens as shown in Figure 2 , annotators can constrain this span to be a constituent in the resulting parse." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-35", "text": "Additional Features Our tool offers annotators some additional convenient features." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-36", "text": "When unsure about some annotation decision, they can click the \"report issue\" button to open a discussion thread in an external forum, such as a GitHub issue tracker." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-37", "text": "To erase all constraints and restart annotation from the parser's original analysis, an annotator can click the \"reset\" button. And the buttons \"HTML\" and \"LaTeX\" provide code that can be copied and pasted to use the current derivation as an illustration on a web page or in a paper." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-38", "text": "Adjudication Support Once two or more annotators have annotated a sentence, disagreements need to be discovered, and a final, authoritative version has to be created." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-39", "text": "Our tool supports this adjudication process through the special user account judge." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-40", "text": "This user can see the derivations of other annotators in a tabbed interface as shown in Figure 3 ." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-41", "text": "In order to enable the judge to easily spot disagreements, categories that annotators disagree on are struck through, and constituents that annotators disagree on are dashed." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-42", "text": "----------------------------------" }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-43", "text": "**A QUADRILINGUAL PILOT CCG TREEBANK**" }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-44", "text": "To test the viability of creating multilingual CCG treebanks by direct annotation, we conducted an annotation experiment on 110 short sentences from the Tatoeba corpus (Tatoeba, 2019) , each in four translations (English, German, Italian, and Dutch)." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-45", "text": "The main annotation guideline was to copy the annotation style of CCGrebank (Honnibal et al., 2010), a CCG treebank adapted from CCGbank (Hockenmaier and Steedman, 2007) , which is in turn based on the Penn Treebank (Marcus et al., 1993) ." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-50", "text": "Table 1 (upper part) shows the number of nonoverlapping category and span constraints that each annotator created on average per sentence before marking the sentence as correct." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-51", "text": "Annotated sentences were manually classified by the first author into four classes: (0) sentences without any disagreements, (1) sentences with only trivial violations of the annotation guidelines (e.g., concerning attachment of punctuation or underspecifying modifier features), (2) sentences with only apparent oversights, such as giving a determiner a pronoun category, (3) sentences with more intricate disagreements which required additional guidelines to resolve." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-52", "text": "Table 1 (upper part) shows the distribution of disagreement classes, and Table 2 shows examples of class (3)." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-53", "text": "The first author adjudicated all disagreements and updated the annotation manual accordingly." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-54", "text": "We release the manual and the full adjudicated dataset." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-55", "text": "2 To make the resource more useful (e.g., for training parsers), we also include in the release the syntactic CCG derivations created so far in the Parallel Meaning Bank (Abzianidze et al., 2017) ." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-56", "text": "These do not follow the annotation guidelines in detail due to their focus on semantics, nor have they been adjudicated, but instead corrected by a single annotator." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-57", "text": "However, they are much greater in number." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-58", "text": "For an even greater number, we also release partially corrected derivations, meaning that the annotator made at least one change to the automatically created derivation." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-59", "text": "lexical label constraints and span constraints, adjudication support, and various conveniences." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-60", "text": "We have used this tool to create the first published CCG resource that comes with an explicit annotation manual for syntax and has been created by direct annotation, rather than conversion from a non-CCG treebank." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-61", "text": "It is multilingual, currently including English, German, Italian, and Dutch, and aims for cross-lingually consistent annotation guidelines." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-62", "text": "For future work, we envision more extensive direct annotation of multilingual data with CCG derivations, and putting them to use for evaluating unsupervised and distantly supervised CCG parsers." }, { "sent_id": "7ddd5b18d774575ae7acb97ae9eb33-C001-63", "text": "We would also like to investigate the use of our tool as an interactive aid in teaching CCG." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7ddd5b18d774575ae7acb97ae9eb33-C001-7" ] ], "cite_sentences": [ "7ddd5b18d774575ae7acb97ae9eb33-C001-7" ] }, "@MOT@": { "gold_contexts": [ [ "7ddd5b18d774575ae7acb97ae9eb33-C001-10", "7ddd5b18d774575ae7acb97ae9eb33-C001-11", "7ddd5b18d774575ae7acb97ae9eb33-C001-12", "7ddd5b18d774575ae7acb97ae9eb33-C001-13", "7ddd5b18d774575ae7acb97ae9eb33-C001-14", "7ddd5b18d774575ae7acb97ae9eb33-C001-15", "7ddd5b18d774575ae7acb97ae9eb33-C001-7", "7ddd5b18d774575ae7acb97ae9eb33-C001-8", "7ddd5b18d774575ae7acb97ae9eb33-C001-9" ] ], "cite_sentences": [ "7ddd5b18d774575ae7acb97ae9eb33-C001-7" ] }, "@USE@": { "gold_contexts": [ [ "7ddd5b18d774575ae7acb97ae9eb33-C001-45" ] ], "cite_sentences": [ "7ddd5b18d774575ae7acb97ae9eb33-C001-45" ] } } }, "ABC_cf7d01faf555f09973e44be400e768_48": { "x": [ { "sent_id": "cf7d01faf555f09973e44be400e768-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-2", "text": "Many Natural Language Processing (NLP) tasks (including generation, language grounding, reasoning, information extraction, coreference resolution, and dialog) can be formulated as deep reinforcement learning (DRL) problems." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-3", "text": "However, since language is often discrete and the space for all sentences is infinite, there are many challenges for formulating reinforcement learning problems of NLP tasks." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-4", "text": "In this tutorial, we provide a gentle introduction to the foundation of deep reinforcement learning, as well as some practical DRL solutions in NLP." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-5", "text": "We describe recent advances in designing deep reinforcement learning for NLP, with a special focus on generation, dialogue, and information extraction." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-53", "text": "In particular, we will take the first deep reinforcement learning solution for dialogue (Li et al., 2016) as a case study." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-6", "text": "We discuss why they succeed, and when they may fail, aiming at providing some practical advice about deep reinforcement learning for solving real-world NLP problems." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-7", "text": "Deep Reinforcement Learning (DRL) (Mnih et al., 2015) is an emerging research area that involves intelligent agents that learn to reason in Markov Decision Processes (MDP)." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-8", "text": "Recently, DRL has achieved many stunning breakthroughs in Atari games (Mnih et al., 2013) and the Go game (Silver et al., 2016) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-9", "text": "In addition, DRL methods have gained significantly more attentions in NLP in recent years, because many NLP tasks can be formulated as DRL problems that involve incremental decision making." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-10", "text": "DRL methods could easily combine embedding based representation learning with reasoning, and optimize for a variety of non-differentiable rewards." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-11", "text": "However, a key challenge for applying deep reinforcement learning techniques to real-world sized NLP problems is the model design issue." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-12", "text": "This tutorial draws connections from theories of deep reinforcement learning to practical applications in NLP." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-13", "text": "In particular, we start with the gentle introduction to the fundamentals of reinforcement learning (Sutton and Barto, 1998; Sutton et al., 2000) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-14", "text": "We further discuss We further discuss several critical issues in DRL solutions for NLP tasks, including (1) The efficient and practical design of the action space, state space, and reward functions; (2) The trade-off between exploration and exploitation; and (3) The goal of incorporating linguistic structures in DRL." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-15", "text": "To address the model design issue, we discuss several recent solutions (He et al., 2016b; Li et al., 2016; Xiong et al., 2017) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-16", "text": "We then focus on a new case study of hierarchical deep reinforcement learning for video captioning (Wang et al., 2018b) , discussing the techniques of leveraging hierarchies in DRL for NLP generation problems." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-17", "text": "This tutorial aims at introducing deep reinforcement learning methods to researchers in the NLP community." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-18", "text": "We do not assume any particular prior knowledge in reinforcement learning." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-19", "text": "The intended length of the tutorial is 3 hours, including a coffee break." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-20", "text": "Representation Learning, Reasoning (Learning to Search), and Scalability are three closely related research subjects in Natural Language Processing." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-21", "text": "In this tutorial, we touch the intersection of all the three research subjects, covering various aspects of the theories of modern deep reinforcement learning methods, and show their successful applications in NLP." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-22", "text": "This tutorial is organized in three parts:" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-23", "text": "\u2022 Foundations of Deep Reinforcement Learning." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-24", "text": "First, we will provide a brief overview" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-25", "text": "----------------------------------" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-26", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-27", "text": "Deep Reinforcement Learning (DRL) (Mnih et al., 2015) is an emerging research area that involves intelligent agents that learn to reason in Markov Decision Processes (MDP)." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-28", "text": "Recently, DRL has achieved many stunning breakthroughs in Atari games (Mnih et al., 2013) and the Go game (Silver et al., 2016) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-29", "text": "In addition, DRL methods have gained significantly more attentions in NLP in recent years, because many NLP tasks can be formulated as DRL problems that involve incremental decision making." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-30", "text": "DRL methods could easily combine embedding based representation learning with reasoning, and optimize for a variety of non-differentiable rewards." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-31", "text": "However, a key challenge for applying deep reinforcement learning techniques to real-world sized NLP problems is the model design issue." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-32", "text": "This tutorial draws connections from theories of deep reinforcement learning to practical applications in NLP." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-33", "text": "In particular, we start with the gentle introduction to the fundamentals of reinforcement learning (Sutton and Barto, 1998; Sutton et al., 2000) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-34", "text": "We further discuss their modern deep learning extensions such as Deep QNetworks (Mnih et al., 2015) , Policy Networks (Silver et al., 2016) , and Deep Hierarchical Reinforcement Learning (Kulkarni et al., 2016) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-35", "text": "We outline the applications of deep reinforcement learning in NLP, including dialog (Li et al., 2016) , semi-supervised text classification (Wu et al., 2018) , coreference (Clark and Manning, 2016; Yin et al., 2018) , knowledge graph reasoning (Xiong et al., 2017 ), text games (Narasimhan et al., 2015; He et al., 2016a) , social media (He et al., 2016b; Zhou and Wang, 2018) , information extraction (Narasimhan et al., 2016; Qin et al., 2018) , language and vision (Pasunuru and Bansal, 2017; Misra et al., 2017; Wang et al., 2018a,b,c; Xiong et al., 2018) , etc." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-36", "text": "We further discuss several critical issues in DRL solutions for NLP tasks, including (1) The efficient and practical design of the action space, state space, and reward functions; (2) The trade-off between exploration and exploitation; and (3) The goal of incorporating linguistic structures in DRL." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-37", "text": "To address the model design issue, we discuss several recent solutions (He et al., 2016b; Li et al., 2016; Xiong et al., 2017) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-38", "text": "We then focus on a new case study of hierarchical deep reinforcement learning for video captioning (Wang et al., 2018b) , discussing the techniques of leveraging hierarchies in DRL for NLP generation problems." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-39", "text": "This tutorial aims at introducing deep reinforcement learning methods to researchers in the NLP community." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-40", "text": "We do not assume any particular prior knowledge in reinforcement learning." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-41", "text": "The intended length of the tutorial is 3 hours, including a coffee break." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-42", "text": "----------------------------------" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-43", "text": "**OUTLINE**" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-44", "text": "Representation Learning, Reasoning (Learning to Search), and Scalability are three closely related research subjects in Natural Language Processing." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-45", "text": "In this tutorial, we touch the intersection of all the three research subjects, covering various aspects of the theories of modern deep reinforcement learning methods, and show their successful applications in NLP." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-46", "text": "This tutorial is organized in three parts:" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-47", "text": "\u2022 Foundations of Deep Reinforcement Learning." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-48", "text": "First, we will provide a brief overview of reinforcement learning (RL), and discuss the classic settings in RL." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-49", "text": "We describe classic methods such as Markov Decision Processes, REINFORCE (Williams, 1992) , and Qlearning (Watkins, 1989 )." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-50", "text": "We introduce modelfree and model-based reinforcement learning approaches, and the widely used policy gradient methods." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-51", "text": "In this part, we also introduce the modern renovation of deep reinforcement learning (Mnih et al., 2015) , with a focus on games (Mnih et al., 2013; Silver et al., 2016) ." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-52", "text": "\u2022 Practical Deep Reinforcement Learning: Case Studies in NLP Second, we will focus on the designing practical DRL models for NLP tasks." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-54", "text": "We describe the main contributions of this work: including its design of the reward functions, and why they are necessary for dialog." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-55", "text": "We then introduce the gigantic action space issue for deep Q-learning in NLP (He et al., 2016a,b) , including several solutions." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-56", "text": "To conclude this part, we discuss interesting applications of DRL in NLP, including information extraction and reasoning." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-57", "text": "\u2022 Lessons Learned, Future Directions, and Practical Advices for DRL in NLP Third, we switch from the theoretical presentations to an interactive demonstration and discussion session: we aim at providing an interactive session to transfer the theories of DRL into practical insights." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-58", "text": "More specifically, we will discuss three important issues, including problem formulation/model design, exploration vs. exploitation, and the integration of linguistic structures in DRL." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-59", "text": "We will show case a recent study (Wang et al., 2018b ) that leverages hierarchical deep reinforcement learning for language and vision, and extend the discussion." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-60", "text": "Practical advice including programming advice will be provided as a part of the demonstration." }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-61", "text": "----------------------------------" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-62", "text": "**HISTORY**" }, { "sent_id": "cf7d01faf555f09973e44be400e768-C001-63", "text": "The full content of this tutorial has not yet been presented elsewhere, but some parts of this tutorial has also been presented at the following locations in recent years:" } ], "y": { "@USE@": { "gold_contexts": [ [ "cf7d01faf555f09973e44be400e768-C001-16" ], [ "cf7d01faf555f09973e44be400e768-C001-38" ], [ "cf7d01faf555f09973e44be400e768-C001-59" ] ], "cite_sentences": [ "cf7d01faf555f09973e44be400e768-C001-16", "cf7d01faf555f09973e44be400e768-C001-38", "cf7d01faf555f09973e44be400e768-C001-59" ] } } }, "ABC_13d1d79a4922d3b5d215d6f8f722ba_48": { "x": [ { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-2", "text": "In this paper we present a novel approach to mapping FrameNet lexical units to WordNet synsets in order to automatically enrich the lexical unit set of a given frame." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-3", "text": "While the mapping approaches proposed in the past mainly rely on the semantic similarity between lexical units in a frame and lemmas in a synset, we exploit the definition of the lexical entries in FrameNet and the WordNet glosses to find the best candidate synset(s) for the mapping." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-4", "text": "Evaluation results are also reported and discussed." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-5", "text": "----------------------------------" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-6", "text": "**FRAMENET AND THE EXISTING MAPPING APPROACHES**" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-7", "text": "The FrameNet database [1] is a lexical resource of English describing some prototypical situations, the frames, and the frame-evoking words or expressions associated with them, the lexical units (LU)." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-8", "text": "Every frame corresponds to a scenario involving a set of participants, the frame elements, that are typically the syntactic dependents of the lexical units." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-9", "text": "The FrameNet resource is corpus-based, i.e. every lexical unit should be instantiated by at least one example sentence, even if at the moment the definition and annotation step is still incomplete for several LUs." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-10", "text": "FrameNet has proved to be useful in a number of NLP tasks, from textual entailment to question answering, but its coverage is still a major problem." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-11", "text": "In order to expand the resource, it would be a good solution to acquire lexical knowledge encoded in other existing resources and import it into the FrameNet database." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-12", "text": "WordNet [4] , for instance, covers the majority of nouns, verbs, adjectives and adverbs in the English language, organized in synonym sets called synsets." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-13", "text": "Mapping FrameNet LUs to WordNet synsets would automatically increase the number of LUs per frame by importing all synonyms from the mapped synset, and would allow to exploit the semantic and lexical relations in WordNet to enrich the information encoded in FrameNet." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-14", "text": "Several experiments have been carried out in this direction." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-15", "text": "Johansson and Nugues [5] created a feature representation for every WordNet lemma and used it to train an SVM classifier for each frame that tells whether a lemma belongs to the frame or not." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-16", "text": "Crespo and Buitelaar [3] carried out an automatic mapping of medical-oriented frames to WordNet synsets, trying to select synsets attached to a LU that were statistically significant in a given reference corpus." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-17", "text": "De Cao et al. [2] proposed a method to detect the set of suitable WordNet senses able to evoke the same frame by exploiting the hypernym hierarchies that capture the largest number of LUs in the frame." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-18", "text": "For all above mentioned approaches, a real evaluation on randomly selected frames is missing, and accuracy was mainly computed over the new lexical units obtained for a frame, not on a gold standard where one or more synsets are assigned to every lexical unit in a frame." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-19", "text": "Besides, it seems that the most common approach to carry out the mapping relies on some similarity measures that perform better on richer sets of lexical units." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-20", "text": "----------------------------------" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-21", "text": "**THE MAPPING ALGORITHM 2.1 MOTIVATION**" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-22", "text": "We propose a mapping algorithm that is independent of the number of LUs in a frame and from the example sentences available." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-23", "text": "In fact, we believe that under real-usage conditions, the automatic expansion of LUs is typically required for frames with a smaller LU set, especially for those with only one element." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-24", "text": "In the FrameNet database (v. 1.3), 33 frames out of 720 are described only by one lexical unit, and 63 are described by two." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-25", "text": "Furthermore, almost 3000 lexical units are characterized only by the lexicographic definition and are not provided with example sentences." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-26", "text": "For this reason, we suggest an alternative approach that makes use of usually unexploited information collected in the FrameNet database, namely the definition associated with every lexical unit." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-27", "text": "Since both WordNet glosses and FrameNet definitions are manually written by lexicographers, they usually show a high degree of similarity, and sometimes are even identical." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-28", "text": "For example, the definition of thicken in the Change of consistency frame is \"become thick or thicker\", which is identical to the WordNet gloss of synset n. v#00300319." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-29", "text": "The thicken lemma occurs in three WordNet synsets, and in each of them it is the only lemma available, so no synonymy information could be exploited for the sense disambiguation." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-30", "text": "----------------------------------" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-31", "text": "**THE ALGORITHM**" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-32", "text": "We tried to devise a simple method to map a FrameNet Lexical Unit (LU) into one or more WordNet synsets." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-33", "text": "Given a LU L from a frame F, we first find the set of all synsets containing L (L candidate set, LCandSet)." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-34", "text": "If LCandSet contains only one synset, this is assigned to L. Otherwise, we look for the synsets in LCandSet whose WN gloss has the highest similarity with the FrameNet definition of L. We tried two baseline similarity algorithms based respectively on stem overlap and on a modified version of the Levenshtein algorithm taking stems as comparison unit instead of characters." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-35", "text": "Stem overlap turned out to perform definitely better than Levehnstein." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-36", "text": "Then we tried to improve on simple stem overlap baseline by considering also the other LUs in F. To this extent, we calculate the set of all synsets linked to any LU in F (FCandSet)." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-37", "text": "This is exploited in two ways." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-38", "text": "First, we boost the similarity score of the synsets in LCandSet with the largest number of links to other LUs in F (according to FCandSet)." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-39", "text": "Secondly we assign to F the most common WordNet Domain in FCandSet, and then boost the similarity score of LCandSet synsets belonging to the most frequent WordNet-Domain in F. We discard any candidate synset with a similarity score below a MIN threshold; on the other side, we accept more than one candidate synset if they have a similarity score higher than a MAX threshold." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-40", "text": "----------------------------------" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-41", "text": "**EVALUATION**" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-42", "text": "We created a gold standard by manually mapping 380 LUs belonging to as many frames to the corresponding WordNet synsets." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-43", "text": "Then, we divided our dataset into a development set of 100 LUs and a testset of 280 LUs." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-44", "text": "We tested the Levenshtein algorithm and the Stem Overlap algorithm (SO), then we evaluated the improvement in performance of the latter taking into account information about the most frequent domain (D) and the most frequent synsets (Syn)." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-45", "text": "Results are reported in Table 1 ." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-46", "text": "We carried out several tests to set the MIN and MAX threshold in order to get the best F-measure, reported in Table 1 . As for precision, the best performance obtained with SO+D+Syn and a stricter MIN/MAX threshold scored 0.78 (recall 0.36, f-measure 0.49)." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-47", "text": "----------------------------------" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-48", "text": "**CONCLUSIONS**" }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-49", "text": "We proposed a new method to map FrameNet LUs to WordNet synsets by computing a similarity measure between LU definitions and WordNet glosses." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-50", "text": "To our knowledge, this is the only approach to the task based on this kind of similarity." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-51", "text": "The only comparable evaluation available is reported in [5] , and shows that our results are promising." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-52", "text": "De Cao at al. [2] reported a better performance, particularly for recall, but evaluation of their mapping algorithm relied on a gold standard of 4 selected frames having at least 10 LUs and a given number of corpus instantiations." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-53", "text": "In the future, we plan to improve the algorithm by shallow parsing the LU definitions and the WordNet glosses." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-54", "text": "Besides, we will exploit information extracted from the WordNet hierarchy." }, { "sent_id": "13d1d79a4922d3b5d215d6f8f722ba-C001-55", "text": "We also want to evaluate the effectiveness of the approach focusing on the new LUs to be included in the existing frames." } ], "y": { "@BACK@": { "gold_contexts": [ [ "13d1d79a4922d3b5d215d6f8f722ba-C001-17" ] ], "cite_sentences": [ "13d1d79a4922d3b5d215d6f8f722ba-C001-17" ] }, "@DIF@": { "gold_contexts": [ [ "13d1d79a4922d3b5d215d6f8f722ba-C001-51", "13d1d79a4922d3b5d215d6f8f722ba-C001-52" ] ], "cite_sentences": [ "13d1d79a4922d3b5d215d6f8f722ba-C001-52" ] } } }, "ABC_bb45a61408a0ade8ce0aba2b8f9ce7_48": { "x": [ { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-2", "text": "A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-3", "text": "To this end, recent advances at the intersection of language and vision have made incredible progress -from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding freeform conversations about visual content!" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-4", "text": "However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?)." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-5", "text": "Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-6", "text": "----------------------------------" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-7", "text": "**TUTORIAL OVERVIEW**" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-8", "text": "This tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-9", "text": "We will comprehensively review existing stateof-the-art approaches to selected tasks such as image captioning (Chen et al., 2015) , visual question answering (VQA) (Antol et al., 2015; Goyal et al., 2017) and visual dialog (Das et al., 2017a,b) , presenting the key architectural building blocks (such as co-attention (Lu et al., 2016) ) and novel algorithms (such as cooperative/adversarial games (Das et al., 2017b) ) used to train models for these tasks." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-10", "text": "We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose (Anderson et al., 2018b; Das et al., 2018) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-11", "text": "The goal of this tutorial is to provide a comprehensive yet accessible overview of existing work and to reduce the entry barrier for new researchers." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-12", "text": "In detail, we will first review the building blocks of the neural network architectures used for these tasks, starting from variants of recurrent sequenceto-sequence language models (Ilya Sutskever, 2014), applied to image captioning (Vinyals et al., 2015) , optionally with visual attentional mechanisms (Bahdanau et al., 2015; Xu et al., 2015; You et al., 2016; Anderson et al., 2018a )." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-13", "text": "We will then look at evaluation metrics for image captioning Anderson et al., 2016) , before reviewing how these metrics can be optimized directly using reinforcement learning (RL) (Williams, 1992; Rennie et al., 2017) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-14", "text": "Next, on the topic of visual question answering, we will look at more sophisticated multimodal attention mechanisms, wherein the network simultaneously attends to visual and textual features (Fukui et al., 2016; Lu et al., 2016) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-15", "text": "We will see how to combine factual and commonsense reasoning from learnt memory representations (Sukhbaatar et al., 2015) and external knowledge bases , and approaches that use the question to dynamically compose the answering neural network from specialized modules (Andreas et al., 2016a,b; Johnson et al., 2017a,b; Hu et al., 2017) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-16", "text": "Following the success of adversarial learning in visual recognition (Goodfellow et al., 2014) , it has recently been gaining momentum in language modeling (Yu et al., 2016) and in multimodal tasks such as captioning (Dai et al., 2017) and dialog (Wu et al., 2018a) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-17", "text": "Within visual dia-log, we will look at recent work that uses cooperative multi-agent tasks as a proxy for training effective visual conversational models via RL (Kottur et al., 2017; Das et al., 2017b) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-18", "text": "Finally, as a move away from static datasets, we will cover recent work on building active RL environments for language-vision tasks." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-19", "text": "Although models that link vision, language and actions have a rich history (Tellex et al., 2011; Paul et al., 2016; Misra et al., 2017) , we will focus primarily on embodied 3D environments (Anderson et al., 2018b; , considering tasks such as visual navigation from natural language instructions (Anderson et al., 2018b) , and question answering (Das et al., 2018; Gordon et al., 2018) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-20", "text": "We will position this work in context of related simulators that also offer significant potential for grounded language learning (Beattie et al., 2016; Zhu et al., 2017) ." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-21", "text": "To finish, we will discuss some of the challenges in developing agents for these tasks, as they need to be able to combine active perception, language grounding, commonsense reasoning and appropriate long-term credit assignment to succeed." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-22", "text": "----------------------------------" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-23", "text": "**STRUCTURE**" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-24", "text": "The following structure is based on an approximately 3 hour timeframe with a break." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-25", "text": "Peter Anderson is a final year PhD candidate in Computer Science at the Australian National University, supervised by Dr Stephen Gould, and a researcher within the Australian Centre for Robotic Vision (ACRV)." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-26", "text": "His PhD focuses on deep learning for visual understanding in natural language." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-27", "text": "He was an integral member of the team that won first place in the 2017 Visual Question Answering (VQA) challenge at CVPR, and he has made several contributions in image captioning, including achieving first place on the COCO leaderboard in July 2017." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-28", "text": "He has published at CVPR, ECCV, EMNLP and ICRA, and spent time at numerous universities and research labs including Adelaide University, Macquarie University, and Microsoft Research." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-29", "text": "His research is currently focused on vision-and-language understanding in complex 3D environments." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-30", "text": "----------------------------------" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-31", "text": "**ABHISHEK DAS**" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-32", "text": "Abhishek Das is a Computer Science PhD student at Georgia Institute of Technology, advised by Dhruv Batra, and working closely with Devi Parikh. He is interested in deep learning and its applications in building agents that can see (computer vision), think (reasoning and interpretability), talk (language modeling) and act (reinforcement learning)." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-33", "text": "He is a recipient of an Adobe Research Fellowship and a Snap Research Fellowship." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-34", "text": "He has published at CVPR, ICCV, EMNLP, HCOMP and CVIU, co-organized the NIPS 2017 workshop on Visually-Grounded Interaction and Language, and has held visiting positions at Virginia Tech, Queensland Brain Institute and Facebook AI Research." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-35", "text": "He graduated from Indian Institute of Technology Roorkee in 2015 with a Bachelor's in Electrical Engineering." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-36", "text": "----------------------------------" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-37", "text": "**QI WU**" }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-38", "text": "Dr. Qi Wu, is a research fellow in the Australia Centre for Robotic Vision (ACRV) in the University of Adelaide." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-39", "text": "Before that, he was a postdoc researcher in the Australia Centre for Visual Technologies (ACVT) in the University of Adelaide." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-40", "text": "He obtained his PhD degree in 2015 and MSc degree in 2011, in Computer Science from University of Bath, United Kingdom." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-41", "text": "His research interests are mainly in Computer Vision and Machine Learning." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-42", "text": "Currently, he is working on the visionto-language problem and he is especially an expert in the area of Image Captioning and Visual Question Answering (VQA)." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-43", "text": "His attributes-based image captioning model got first place on the COCO Image Captioning Challenge Leader Board in the October of 2015." }, { "sent_id": "bb45a61408a0ade8ce0aba2b8f9ce7-C001-44", "text": "He has published several papers in prestigious conferences and journals, such as TPAMI, CVPR, ICCV, ECCV, IJCAI and AAAI." } ], "y": { "@BACK@": { "gold_contexts": [ [ "bb45a61408a0ade8ce0aba2b8f9ce7-C001-10" ], [ "bb45a61408a0ade8ce0aba2b8f9ce7-C001-19" ] ], "cite_sentences": [ "bb45a61408a0ade8ce0aba2b8f9ce7-C001-10", "bb45a61408a0ade8ce0aba2b8f9ce7-C001-19" ] } } }, "ABC_f58555aa8fc78903df83af8309a3d7_48": { "x": [ { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-2", "text": "Linguistic annotation is the process of adding additional notations to raw linguistic data for descriptive or analytical purposes." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-3", "text": "In the tagging of complex Chinese and multilingual linguistic data with a sophisticated linguistic framework, immediate visualization of the complex multi-layered functional and discourse structures is crucial for both speeding up the tagging process and reducing errors." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-4", "text": "The need for large-scale linguistically annotated corpora has made collaborative annotation increasingly essential, and existing annotation tools are inadequate to the task of providing assistance to annotators when dealing with complex linguistic structural information." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-5", "text": "In this paper we describe the design and development of a collaborative tool to extend existing annotation tools." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-6", "text": "The tool improves annotation efficiency and addresses certain difficulties in representing complex linguistic relations." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-7", "text": "Here, we adopt annotation based on Systemic Functional Linguistics and Rhetorical Structure Theory to demonstrate the effectiveness of the interface built on such infrastructure." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-8", "text": "----------------------------------" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-9", "text": "**INTRODUCTION**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-10", "text": "Recent years have witnessed an increasing need for large-scale high-quality annotated corpora on complex Chinese linguistic information where no automated annotators are available." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-11", "text": "Annotation on multi-level data complex structural relationships in such linguistic frameworks as Systemic Functional Grammar (SFG) [1] and Rhetorical Structure Theory [2] is a difficult task." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-12", "text": "SFG investigates texts as intentional acts of meaning, organized in functional-semantic components known as \"metafunctions\"." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-13", "text": "Three primary metafunctions, operating in parallel and each representing a layer of meaning with a set of options to language users, cover different functional aspects of human communication and expression: the ideational, interpersonal and textual metafunctions." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-14", "text": "For our purposes our discussion will focus on analysis and annotation of these three metafunctions in SFG." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-15", "text": "Despite the fact that SFG is becoming increasingly influential among Chinese linguistic researchers, a large-scale, high-quality corpus annotated with SFG has yet to be developed [3] ." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-16", "text": "Consequently, when trying to conduct corpus-based analysis using the SFG framework researchers must either 1) spend an enormous amount of time studying an unannotated corpus, 2) embark on the error-prone process of manually annotating a corpus on their own, or 3) rely on small corpora independently annotated by researchers which may not be particularly suited to needs of the tasks at hand." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-17", "text": "The lack of high-quality Chinese SFG corpora is partly attributable to the lack of a competent SFG tagger capable of annotating large-scale corpora while ensuring quality." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-18", "text": "In developing such a tagger, a number of challenges need to be addressed: 1) Lack of an efficient and sophisticated storage scheme for storing such multilayered information with complex structures 2) Additional visual cues to facilitate the tagging process 3) Need for collaborative tasking (co-tasking) by different annotators" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-19", "text": "The most common method to annotate text includes the use of an open standard like XML document." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-20", "text": "Provided one possesses the prerequisite familiarity with XML conventions, the linguist-as-annotator inserts metadata most likely using a plain-text editor or generic XML editor." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-21", "text": "This method works well so long as the text is short, and the required linguistic information is relatively simple." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-22", "text": "While some special editing tools have been created which provide a graphical interface for linguists to tag texts, such tools, for the most part, tend to be stand-alone, primarily oriented to single users." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-23", "text": "To facilitate efficient, high quality annotation of a large amount of Chinese text material by a team of co-tasking linguists, we have developed a new multi-user linguistic information annotator, which provides real-time cross-domain reference as visual \"feedback\", thereby assisting linguists to tag text data in a highly effective way." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-24", "text": "Multiple users can work at the same time on any portion of the text, with their annotations revealed (or selectively not) to other members as reference." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-25", "text": "Those responsible for verification, comparison, correction, and progress tracking can view the work even as it is being carried out." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-26", "text": "This design is intended to improve both the efficiency and quality of annotation, while enabling multi-user tagging of substantially greater text material in shorter time." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-27", "text": "----------------------------------" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-28", "text": "**THE FRAMEWORK**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-29", "text": "Here, we first review existing tools for annotating texts before discussing the advantages of our new tool." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-30", "text": "We also present an application scenario of our tools for annotating text and explain how visualized cross-domain reference works." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-31", "text": "A number of similar tools have been developed for various annotation scenarios." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-32", "text": "MMAX2 [4] is a customizable tool for creating and visualizing multilevel linguistic annotations that allows outputs the results of annotations according to predefined style-sheets." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-33", "text": "It supports tagging of part-of-speech tags, coreference and grammatical relations, but is not capable of representing and visualizing complex discourse level structures." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-34", "text": "SALTO [5] is a multilevel annotation tool for annotating semantic roles and Treebank syntactic structures." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-35", "text": "O'Donnell's annotation tool for Systemic Functional Linguistics, the UAM CorpusTool [6] , is intended for annotating multi-layered Systemic Functional Grammar structures by a Single User." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-36", "text": "Both tools are restrictive in terms of functionalities and do not support collaborative annotation and provide no means of representing complex sentential structures." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-37", "text": "Our representation model is built on the functionalities of Annotation Graph [7] and the underlying storage scheme is conceptually similar to Standoff XML format [9] , but we opted for a relational database structure built with an object-oriented design for efficiency, reusability and versatility." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-38", "text": "Several web-based annotation tools such as Serengeti [10] , a tool for annotating anaphoric relations and lexical chains, are limited to a particular domain and cannot be used for annotating and visualizing complex structural information without substantial modification." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-39", "text": "----------------------------------" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-40", "text": "**WEB-BASED COLLABORATIVE ANNOTATION**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-79", "text": "One difficulty has been the lack of high-quality corpora to bootstrap the automation, a time and cost extensive task that has to be done manually." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-41", "text": "Traditionally, annotation processes that involve more than one annotator are often divided into multiple steps where one step is taken up and completed by one annotator before being passed on to another." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-42", "text": "This is adequate for small annotation projects where only a linear sequential procedure is involved." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-43", "text": "In recent years, however, the growing scale and complexity of annotation projects have necessitated the collaboration of different annotators who are often geographically dispersed." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-44", "text": "In view of these needs, we develop our application on a web-based infrastructure making it accessible from any web-accessible point and enabling collaborative annotation on the same data source either synchronously or asynchronously." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-45", "text": "One problem that arises in collaborative annotation is that annotators often come with different sets of skills and have varying, sometimes overlapping responsibilities." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-46", "text": "Our goal is provide a user-friendly, intuitive interface, designed to reduce the drudgery of XML-based annotation, while enforcing annotating standards and quality functionalities for user management and versioning." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-47", "text": "Each stage in the annotation process is divided into several hierarchically structured steps in which each parent step can spawn child steps to be taken up by one or more annotators." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-48", "text": "This gives the annotator fine-grained control over the annotating process and facilitates clear division of labor among different annotators." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-49", "text": "In addition, all annotators collaborating on the same step get notified of the relevant changes in annotation in real time once a modification has been made." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-50", "text": "The tagger is built on a generic, multifunctional relational database similar to the annotation graph model [7] that has been demonstrated to be capable of representing virtually all sorts of common linguistic annotations." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-51", "text": "In the collaborative environment annotators can plug in certain linguistic resource that can serve as the standardized version assessable to all annotators, instead of each annotator keeping his own version, which may cause severe merging difficulties." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-52", "text": "----------------------------------" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-53", "text": "**REPRESENTATION OF COMPLEX LINGUISTIC STRUCTURES AND RELATIONSHIPS**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-54", "text": "The storage scheme for traditional annotation tools built using XML have been largely restrained by the inherent limitations of XML, which is suitable for storing written texts that are continuous, linear and single-layered." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-55", "text": "For non-continuous, overlapping and multi-layered linguistic information, XML-based tools typically rely on complex workarounds that unnecessarily overcomplicate the data model." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-56", "text": "Most linguistic structures can be represented with an Annotation Graph interface." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-57", "text": "In annotating corpora with linguistic models such as Systemic Functional Grammar, where the linguistic information is structured in a multi-layered, overlapping hierarchy with references pointing to the linguistic elements, the underlying representation model must be carefully designed." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-58", "text": "The underlying data model of our platform is built on the same principles as Annotation Graph but adopts a modularized design to cover emerging use cases." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-59", "text": "In annotating any sizable corpora, one recurring problem is representing the complex relations across various layers of linguistic elements." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-60", "text": "In this paper we have generalized common linguistic relations on three levels of linguistic elements, namely: 1) Unit Level: single linguistic elements (word, morpheme) 2) Segment Level: continuous range of linguistic elements (phrases, clauses, sentences, and paragraphs) 3) Group Level: groups of ranges of linguistic Elements (non-continuous grammatical units, i.e., clausal relations, hierarchical discourse trees in RST) Figure 1 : Three primary levels of linguistic relations." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-61", "text": "Figure 1 illustrates a simplified abstract view of the three-level structure." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-62", "text": "At the Unit Level, the basic linguistic elements (e.g. words, morphemes) are either broken up into several separate linguistic segments, or joined together by an unlimited number of continuous units into a common segment." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-63", "text": "For example, the word uncovered can be made up of several morphemes (i.e., un + cover + ed), each represented by a single segment, or it can be joined together by another word (e.g. cases) to form a new segment (uncovered cases)." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-64", "text": "At the Segment Level, segments (e.g. morphemes, words) can be part of a larger segment (e.g. clauses, paragraphs) in an indefinitely recursive and hierarchical manner." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-65", "text": "The Group Level is a generic structure that deals with relations among linguistic units and segments." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-66", "text": "For example, in RST there are different discourse relations (e.g. Antithesis, Condition) and roles (e.g. Nucleus, Satellite)." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-67", "text": "Such relations in the data model are defined as groups, with one textual segment pointing to another and attaching a relation (function, tag, or role) to the pointed segment." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-68", "text": "Similar to segments, the number of segments in each group is unlimited and the group as a whole can in turn be pointed to by another group with an arbitrary depth of recursion and hierarchy, but unlike units in segments, the segments in each group can be non-continuous and overlapping, thus enabling any complex relations to be aptly defined." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-69", "text": "In our application scenarios, we focus on annotating hierarchical discourse structures in RST and the three layers of metafunctions in SFG." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-70", "text": "These layers of linguistic units and the complex relations among them are represented using the proposed common structure." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-71", "text": "In one-to-many and many-to-many relations, a sequence of ordered linguistic objects may be linked across different layers." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-72", "text": "Such interrelationships can form complex linguistic networks representing intricate linguistic meanings." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-73", "text": "Due to their inherent complexity, understanding such relationships can pose challenges to annotators, especially when such relationships are constantly added or removed in a collaborative annotating environment." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-74", "text": "The platform introduces real-time visualization of the structural relations as the annotation progresses, allowing the annotator to keep track of and make changes to annotations accordingly." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-75", "text": "In annotating such structural relations, each unit is given a unique identifying number which we use for easy grouping of the units and to define the complex, often embedded interrelations between the units (e.g., in SFG these include logico-semantic relations such as Parataxis and Hypotaxis, Elaboration and Extension etc.)." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-76", "text": "----------------------------------" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-77", "text": "**VISUALIZED CROSS-DOMAIN REFERENCE**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-78", "text": "While the past decade has seen significant advancement in the automatic annotation of Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012) functional structures, the automatic annotation of semantic and discourse information has been largely ignored." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-80", "text": "In a collaborative environment, leveraging the resources of non-expert annotators can significantly boost the annotation efficiency, as has been demonstrated by recent experiments [11] ." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-81", "text": "The lack of sufficient linguistic expertise, however, restrains non-expert linguistic annotators from engaging in more complex annotations." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-82", "text": "The annotation process can be significantly accelerated using assistance and reference tools such as a tag dictionary [12] ." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-83", "text": "Different annotators may form different opinions on particular annotations based on their own reference to acquired linguistic knowledge." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-84", "text": "By unifying the source of such knowledge, we may be able to boost inter-annotator agreement on issues where they otherwise differ." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-85", "text": "Our annotation tool is built on a generic infrastructure compatible with various formats of linguistic information such as Treebanks, multilingual corpora, part-of-speech (POS) annotation and output from statistical syntactic parsers such as the Stanford parser." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-86", "text": "These additional corpora and annotations not only serve to enrich textual data with additional layers of linguistic information but can be potentially used to assist in annotation." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-87", "text": "In our current application scenarios, when annotating a corpus the annotator is often faced with the following tasks: 1) Divide the text into meaningful segments 2) Analyze the segmented texts for the internal structure, such as functional structure of a clause or sentence 3) Analyze the functions of each functional/semantic unit, such as the part-of-speech of each word 4) Refer to a previously annotated section similar to the one being annotated 5) Consult a thesaurus for entries to the words whose meaning is unclear 6) Consult a multilingual corpus parallel to or aligned with the corpus (when annotating a corpus in another language)." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-88", "text": "Figure 2 is an example of some of the available information that has been incorporated into our annotation platform to provide easy access for collaborating annotators." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-89", "text": "The panel is made up of three selected components that assist in the annotation task." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-90", "text": "The first section is produced from an automatic part-of-speech (POS) tagger (we use the Stanford POS tagger)." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-91", "text": "The tagger reads raw text as input and yields the POS tags of each word." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-92", "text": "This information is useful as it provide basic disambiguation and guidance when annotating the text." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-93", "text": "Similarly, the third section is produced by a syntactic parser (Stanford Parser), which not only parses the text syntactically, but generates the complete tree structure of the parse." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-94", "text": "Glancing at the tree can provide helpful information in understanding the text at a syntactic level." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-95", "text": "Both the tagger and parser are highly generic and customizable." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-96", "text": "They can be used for tagging and parsing different languages after being trained on data of corresponding languages." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-97", "text": "The second section, on the other hand, is specific to texts with corresponding translations." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-98", "text": "The example is taken from a text from the Bible, which comes with many different versions that were aligned to each other using a special mechanism." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-99", "text": "With such information integrated with the database, it needs to be easily accessible to aid in annotation and revision (correcting errors made in the annotation)." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-100", "text": "Visualization has been found to be effective in helping users process new information [13] so introducing visualization techniques to our platform should enable users to more effectively process such information." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-101", "text": "Each of the above-mentioned layers of extra information is visualized in a windowed interface that can be customized for the needs of a particular task." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-102", "text": "The annotator can decide which of the available layers to use for reference, and at different stages of annotation different layers may be presented." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-103", "text": "The visualization is an automatic process requiring no manual intervention apart from initial settings." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-104", "text": "When the annotation moves on to the next section/stage, the contents of the visualization will be automatically updated." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-105", "text": "When designing the annotation platform we have several goals in mind: it must be intuitive and easy-to-use." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-106", "text": "The learning curve must be kept to a minimum." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-107", "text": "We reduce the process of annotation to a two-step process: 1) define the annotation range 2) assign a label." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-108", "text": "We allow optional features such as defining the step hierarchy, placing labels in each step, visualizing and editing existing annotations, defining complex linguistic relations." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-109", "text": "In addition, it must provide immediate feedback through visualization." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-110", "text": "In functional grammar systems such as Systemic Function Grammar when tagging a particular layer of meaning, the other layers as defined in the step hierarchy should be immediately visible in a multilayered structured format." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-111", "text": "These information layers provide additional references to the current layer being annotated, especially when they are closely linked in terms of function or meaning." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-112", "text": "When errors are made they are visible from the reference panel and appropriate actions such as deletion or modification can be taken." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-113", "text": "Figure 3 shows the annotation interface we designed to meet these requirements." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-114", "text": "Figure 3 is an illustration of some of the functionalities currently implemented." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-115", "text": "The annotator starts by selecting a range of text to annotate." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-116", "text": "Visual channels appear to assist annotators in making the decisions more easily and with a higher degree of consistency." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-117", "text": "The channels on the right side of the interface provide a detailed collection of functional and semantic labels." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-118", "text": "The label structure for a particular annotation is shown at the bottom right where the structure of different metafunctions of the selected annotation is shown in a uniform way." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-119", "text": "The annotator can operate on the labeled structure directly by adding, removing and modifying the labels in the visual structure." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-120", "text": "----------------------------------" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-121", "text": "**APPLICATION**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-122", "text": "The tagger built on the proposed infrastructure can be used for visualizing various types of analysis." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-123", "text": "Rhetorical Structure Theory, for example, has been adopted for the tagger to help visualize the analysis of US President Obama's speeches." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-124", "text": "Rhetorical Structure Theory (RST) is \"an abstract set of convention\" which \"provides a general way to describe the relations among clauses in a text\" [2] ." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-125", "text": "This theory is widely used for text analysis for complex multilayer sentence and paragraph relations." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-126", "text": "These sentence/paragraph relations are tagged using the proposed tagger, visualized and presented with the help of the \"RST generator\" which generates the RST figures, visualizing sentence/paragraph interpretation pictures." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-127", "text": "This annotating and visualizing method has already been applied in the analysis of Obama's inaugural and victory speeches, rendering 'the big picture' for how these speeches were constructed (Figure 4 )." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-128", "text": "----------------------------------" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-129", "text": "**CONCLUSION**" }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-130", "text": "In this paper, we present a collaborative tool for Chinese and multilingual linguistic structure annotation with visualized cross-domain references." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-131", "text": "We begin by a discussion on current trends in annotating corpora and the requirements for developing a new annotation tool." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-132", "text": "A review of existing linguistics analysis tool is presented in our introduction." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-133", "text": "We demonstrate with example applications that 1) a large collaborative annotation platform is necessary for speeding up large-scale manual or semi-automated Chinese linguistic annotation; 2) annotating complex linguistic information is a difficult and error-prone process; 3) visualized annotation references for language structures can help facilitate the annotation process, especially in a collaborative environment; and 4) cross-domain references can further assist annotators in making the right decisions." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-134", "text": "Our tool is designed with collaborative tasking and cross-domain analysis in mind." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-135", "text": "All linguistic signals are converted into interoperable database structures in real time when users submit their input." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-136", "text": "Data obtained from different domains can be stored in the database structure and used to serve as the basis for cross-domain references." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-137", "text": "The use of our tools for handling these relationships requires a minimal learning curve." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-138", "text": "The same system may also be used for educational purposes like annotation training and examination marking for students." }, { "sent_id": "f58555aa8fc78903df83af8309a3d7-C001-139", "text": "Usage examples may include exercises on identifying SFL constituents, translation alignment and other language analysis." } ], "y": { "@USE@": { "gold_contexts": [ [ "f58555aa8fc78903df83af8309a3d7-C001-37" ] ], "cite_sentences": [ "f58555aa8fc78903df83af8309a3d7-C001-37" ] }, "@DIF@": { "gold_contexts": [ [ "f58555aa8fc78903df83af8309a3d7-C001-37" ] ], "cite_sentences": [ "f58555aa8fc78903df83af8309a3d7-C001-37" ] }, "@SIM@": { "gold_contexts": [ [ "f58555aa8fc78903df83af8309a3d7-C001-50" ] ], "cite_sentences": [ "f58555aa8fc78903df83af8309a3d7-C001-50" ] } } }, "ABC_7666d5a8e05e79f68ec60e47cddecd_48": { "x": [ { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-2", "text": "Domain adaptation is a time consuming and costly procedure calling for the development of algorithms and tools to facilitate its automation." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-3", "text": "This paper presents an unsupervised algorithm able to learn the main concepts in event summaries." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-4", "text": "The method takes as input a set of domain summaries annotated with shallow linguistic information and produces a domain template." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-5", "text": "We demonstrate the viability of the method by applying it to three different domains and two languages." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-6", "text": "We have evaluated the generated templates against human templates obtaining encouraging results." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-7", "text": "----------------------------------" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-9", "text": "Our research is concerned with the development of techniques for knowledge induction in the field of text summarization." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-10", "text": "Our goal is to automatically induce the necessary knowledge for the generation of concise event summaries such as the one shown in Figure 1 ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-11", "text": "This kind of summaries, which can be found on the Web and in text collections, contain key information of the events they describe." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-12", "text": "Previous work in the area of text summarization (DeJong, 1982; Oakes and Paice, 2001; Saggion and Lapalme, 2002) addressed the problem of generating this type of concise summaries from texts, relying on information extraction and text generation techniques." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-13", "text": "These approaches were difficult to port to new domains and languages because of the efforts needed for modelling the underlying event template structure." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-14", "text": "In this paper we propose a method for learning the main concepts in domain summaries in an unsupervised iterative procedure." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-15", "text": "The proposed algorithm takes a set of unannotated summaries in a given domain and produces auto-annotated summaries which can be used for training information extraction and text generation systems." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-16", "text": "Domain adaptation is essential for text summarization and information extraction, and the last two decades have seen a plethora of methods for supervised, semisupervised, and unsupervised learning from texts." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-17", "text": "2001 August 24: Air Transat Flight 236 runs out of fuel over the Atlantic Ocean and makes an emergency landing in the Azores." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-18", "text": "Upon landing some of the tires blow out, causing a fire that is extinguished by emergency personnel on the ground." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-19", "text": "None of the 304 people on board the Airbus A330-200 were seriously injured." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-20", "text": "For example, in (Li et al., 2010) clustering is applied to generate templates for specific entity types (actors, companies, etc.) and patterns are automatically produced that describe the information in the templates." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-21", "text": "In (Chambers and Jurafsky, 2009 ) narrative schemas are induced from corpora using coreference relations between participants in texts." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-22", "text": "Transformation-based learning is used in (Saggion, 2011) to induce templates and rules for non-extractive summary generation." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-23", "text": "Paraphrase templates containing concepts and typical strings were induced from comparable sentences in (Barzilay and Lee, 2003) using multisentence alignment to discover \"variable\" and fixed structures." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-24", "text": "Linguistic patterns were applied to huge amounts of non-annotated pre-classified texts in (Riloff, 1996) to bootstrap information extraction patterns." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-25", "text": "Similarly, semi-supervised or unsupervised methods have been used to learn question/answering patterns (Ravichandran and Hovy, 2002) or text schemas (Bunescu and Mooney, 2007) ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-26", "text": "One current paradigm to learn from raw data is open information extraction (Downey et al., 2004; Banko, 2009) , which without any prior knowledge aims at discovering all possible relations between pairs of entities occurring in text." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-27", "text": "Our work tries to learn the main concepts making up the template structure in domain summaries, similar to (Chambers and Jurafsky, 2011) ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-28", "text": "However, we do not rely on any source of external knowledge (i.e. WordNet) to do so." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-29", "text": "This paper presents an iterative-learning algorithm which is able to identify the key components of event summaries." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-30", "text": "We will show that the algorithm can induce template-like representations in various domains and languages." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-31", "text": "The rest of the paper is organized in the following way: In Section 2 we introduce the dataset we are using for our experiments and describe how we have prepated it for experimentation." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-32", "text": "Then, in Section 3 we provide an overview of our concept induction learnig algorithm while in Section 4 we explain how we have instantiated the algorithm for the experiments presented in this paper." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-33", "text": "Section 5 describe the experiments and results obtained and Section 6 discusses our approach comparing it with past research." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-34", "text": "Finally, in Section 7 we close the paper with conclusions and future work." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-35", "text": "----------------------------------" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-36", "text": "**DATA AND DATA PREPARATION**" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-37", "text": "The dataset used for this study -part of the CON-CISUS corpus (Saggion and Szasz, 2012 ) -consists of a set of 250 summaries in Spanish and English for three different domains: aviation accidents, rail accidents, and earthquakes." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-38", "text": "This dataset makes it possible to compare the performance of learning procedures across languages and domains." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-39", "text": "Based on commonsense, a human annotator developed an annotation schema per domain to describe in a templatelike representation the essential elements (i.e., slots) of each event." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-40", "text": "For example, for the aviation accident domain these essential elements were: the date of the accident, the number of victims, the airline, the aircraft, the location of the accident, the flight number, the origin and destination of the flight, etc." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-41", "text": "The dataset was then annotated following the schema using the GATE annotation tool." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-42", "text": "The human annotations are used for evaluation of the concept discovery algorithm." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-43", "text": "Each document in the dataset was automatically annotated using tools for each language." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-44", "text": "We relied on basic processing steps to identify sentences, words and word-roots, parts of speech, nounchunks, and named entities using the GATE system for English (Maynard et al., 2002) and TreeTagger for Spanish (Schmid, 1995) ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-45", "text": "The method is designed to learn the conceptual information in the summaries by extension (i.e., the set of strings that make up the concept in a given corpus) and by intension (i.e., an algorithm able to recognise the concept members in new documents in the domain) (Buitelaar and Magnini, 2005) ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-46", "text": "Concept extensions identified by our method in the English summaries in the aviation domain are listed in Table 3 ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-47", "text": "Each summary in the corpus can be seen as a sequence of strings and chunks as shown in Figure 1 (named entities and noun chunks are shown in boldface and they may overlap)." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-48", "text": "The procedure to learn a concept in the corpus of summaries is given in pseudocode in Algorithm 2 which is repeatedly invoked by a main algorithm to learn all concepts (Algorithm 1)." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-49", "text": "The idea of the algorithm is rather simple, at each iteration a document is selected for learning, and from this document a single chunk (i.e., a noun chunk or a named entity) available for learning is selected as a seed example of a hypothetical concept (the concept is given a unique name at each itera- tion)." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-50", "text": "The document is annotated with this seed as a target concept and a classifier is trained using this document." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-51", "text": "The trained classifier is then applied to the rest of the documents to identify instances of the hypothetical concept." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-52", "text": "If the classifier is unsuccessful in identifying new instances, then the chunk used in the current iteration is discarded from the learning process, but if the classifier is successful and able to identify instances of the hypothetical concept, then the \"best\" annotated document is selected and added to the training set." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-53", "text": "The classifier is re-trained using the new added document and the process is repeated until no more instances can be identified." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-54", "text": "A hypothetical concept is kept only if there is enough support for it across the set of documents." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-55", "text": "The main procedure calls the basic algorithms a number of times while there are concepts to be learnt (or all chunks have been used)." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-56", "text": "The stopping criteria is the number of concepts which could possibly be learnt, an estimation of which is the average number of chunks in a document." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-57", "text": "----------------------------------" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-58", "text": "**ALGORITHM INSTANTIATION**" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-59", "text": "Experiments were carried out per domain and language to assess the suitability of the algorithm to the conceptual learning task." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-60", "text": "A number of points in Algorithm 2 need clarification: the selection of a document in line 4 of the algorithm can be carried out using different informed procedures; for the experiments described here we decided to select the document with more available hypotheses, i.e., the document with more chunks." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-61", "text": "For the selection of a (Li et al., 2004) ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-62", "text": "The features we use for representing the instance to be learnt are very superficial for these experiments: lemmas, parts-of-speech tags, orthography, and named entity types of the words surrounding the target concept to be learnt." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-63", "text": "The SVMs provide as output a class together with a probability which is essential to our method." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-64", "text": "We use this probability for selecting the best document in line 13 of the algorithm: the instance predicted with the highest probability is located and the document where this instance occurs is returned as \"best document\"." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-65", "text": "In case no instances are learned (e.g., else in line 17), the iteration ends returning the extension learnt so far." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-66", "text": "Concerning Algorithm 1: in line 5 (the while) we use as stopping criteria for the maximum number of concepts to learn the average number of chunks in the corpus." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-67", "text": "In line 7, the FILTER CONCEPT function evaluates the concept, keeping it only if two criteria are met: (i) there are not \"too many\" repetitions of a string in the discovered concept and (ii) the discovered concept covers a reasonable number of documents." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-68", "text": "With criteria (i) we filter out a concept which contains repeated strings: a concept could be formed simply by grouping together all repeated phrases in the set of documents (i.e. \"the earthquake\" or \"the accident\" or \"the plane\")." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-69", "text": "While these phrases could be relevant in the target domain they do not constitute a key concept in our interpretation." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-70", "text": "Strings which are repeated in the concept extension are more like the \"backbone structure\" of the summaries in the domain." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-71", "text": "In our experiments both criteria are experimental variables and we vary them from 10% to 100% at 20% intervals." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-72", "text": "In Section 5 we will present results for the best configurations." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-73", "text": "----------------------------------" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-74", "text": "**EXPERIMENTS AND RESULTS**" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-75", "text": "In order to evaluate the discovered concepts we have treated learning as information extraction." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-76", "text": "In order to evaluate them in this context we first need to map each learnt concept onto one of the human concepts." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-77", "text": "The mapping, which is based on the concept extension, is straightforward: a discovered concept is mapped onto the human concept with which it has a majority of string matches." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-78", "text": "Note that we match the discovered text offsets in the analysed documents and not only the identified strings." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-79", "text": "In order to evaluate the matching procedure we have used precision, recall, and f-score measures comparing the automatic concept with the human concept." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-80", "text": "Note that we use a lenient procedure -counting as correct strings those with a partial match." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-81", "text": "This is justified since discovering the exact boundaries of a concept instance is a very difficult task." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-82", "text": "Table 2 shows some examples of the human annotated instances and related discovered one." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-83", "text": "It can be appreciated that the learnt concepts have a reasonable match degree with the human annotated ones." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-84", "text": "Table 3 gives information extraction results per domain and language for the best configuration of the algorithm." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-85", "text": "The best scores are generally obtained when coverage is set to 10% of the number of summaries, except for the learning of conceptual information in Spanish for the earthquake domain where the system performs better for 10% summary coverage." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-86", "text": "The parameter controlling string repetition in the concept extension should be kept small." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-87", "text": "The obtained results are quite satisfactory consider- ing the small dataset and the limited use of linguistic resources during learning." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-88", "text": "These results compare favorably to cross-validation results obtained using supervised machine learning techniques (Saggion and Szasz, 2011) ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-89", "text": "Learning from the earthquake domain appears to be more challenging given the more verbose characteristics of these texts." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-90", "text": "Even though space restricions prevent us from showing all evaluation results, in Table 4 we present detailed results for the two domains and languages." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-91", "text": "Note that the concepts listed constitute the slots of the induced domain template." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-92", "text": "----------------------------------" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-93", "text": "**DISCUSSION**" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-94", "text": "Similar to active learning information extraction techniques (Ciravegna and Wilks, 2003) , the concept discovery algorithm presented here is inspired by techniques like learning by reading, where unfamiliar expressions in one document can be \"explained\" by association to expressions in similar document contexts." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-95", "text": "However, and unlike active learning, human intervention is unnecessary in our approach." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-96", "text": "Although the algorithm achieves reasonably lenient performance, strict (hard) evaluation indicates that in each experimental condition performance drops when a strict match is required." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-97", "text": "This is expected given the difficulty of finding the right instance boundaries based only on automatic chunking information." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-98", "text": "For this reason, we intend to carry out additional experiments based on richer domain independent features from a syntactic parser." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-99", "text": "We have identified a number of reasons why some concept instances can not be correctly associated with their concepts." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-100", "text": "In the aviation domain, for example, numeric expressions constitute the extensions of different concepts including: number of victims, crew members, and number of survivors; it is a rather common feature in the aviation domain to include these different concepts together in one sentence, making their \"separation\" complicated." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-101", "text": "Same explanations apply to other tested domains: for example locations playing the role of origin and destination of a given train or airplace are also sometimes confused." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-102", "text": "Our work demonstrates the possibility of learning conceptual information in several domains and languages, while previous work (Chambers and Jurafsky, 2011) has addressed sets of related domains (e.g., MUC-4 templates) in English." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-103", "text": "Learning full conceptualizations from raw data is a daunting and difficult enterprise (Biemann, 2005) ." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-104", "text": "Here, we provide a short-cut by proposing a method able to learn the essential concepts of a domain by relying on summaries which are freely available on the Web." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-105", "text": "Our method is able to produce conceptualizations from a few documents in each domain and language unlike recent open domain information extraction which requires massive amount of texts for relation learning (Banko, 2009 )." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-106", "text": "Our algorithm has a reasonable computational complexity, unlike alignment-based or clustering-based approaches (Barzilay and Lee, 2003) , which are computationally expensive." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-107", "text": "----------------------------------" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-108", "text": "**CONCLUSIONS AND OUTLOOK**" }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-109", "text": "Domain adaptation is a time consuming and costly procedure calling for the development of algorithms and tools to facilitate its automation." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-110", "text": "In this paper we have presented a novel algorithm for learning information content in event summaries." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-111", "text": "The approach is fully unsupervised and based on the application of an iterative algorithm which grows a concept extension step-by-step." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-112", "text": "We have also proposed an instantiation of the algorithm and demonstrated its applicability to learning conceptual information in three different domains and two languages." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-113", "text": "We have obtained encouraging results, with the procedure able to model the main conceptual information in the summaries with lenient F-scores ranging from 0.25 to 0.66 F-scores depending on the language and domain." }, { "sent_id": "7666d5a8e05e79f68ec60e47cddecd-C001-114", "text": "There are, however, a number of avenues that should be further explored such as the use of a richer document representation based on syntactic information and the development of additional procedures to improve instance boundary recognition." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7666d5a8e05e79f68ec60e47cddecd-C001-23" ] ], "cite_sentences": [ "7666d5a8e05e79f68ec60e47cddecd-C001-23" ] }, "@DIF@": { "gold_contexts": [ [ "7666d5a8e05e79f68ec60e47cddecd-C001-106" ] ], "cite_sentences": [ "7666d5a8e05e79f68ec60e47cddecd-C001-106" ] } } }, "ABC_2fdfa1b36fcf0d77826c96101ac428_48": { "x": [ { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-2", "text": "Many Natural Language Processing (NLP) tasks (including generation, language grounding, reasoning, information extraction, coreference resolution, and dialog) can be formulated as deep reinforcement learning (DRL) problems." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-3", "text": "However, since language is often discrete and the space for all sentences is infinite, there are many challenges for formulating reinforcement learning problems of NLP tasks." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-4", "text": "In this tutorial, we provide a gentle introduction to the foundation of deep reinforcement learning, as well as some practical DRL solutions in NLP." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-5", "text": "We describe recent advances in designing deep reinforcement learning for NLP, with a special focus on generation, dialogue, and information extraction." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-6", "text": "We discuss why they succeed, and when they may fail, aiming at providing some practical advice about deep reinforcement learning for solving real-world NLP problems." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-7", "text": "Deep Reinforcement Learning (DRL) (Mnih et al., 2015) is an emerging research area that involves intelligent agents that learn to reason in Markov Decision Processes (MDP)." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-8", "text": "Recently, DRL has achieved many stunning breakthroughs in Atari games (Mnih et al., 2013) and the Go game (Silver et al., 2016) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-9", "text": "In addition, DRL methods have gained significantly more attentions in NLP in recent years, because many NLP tasks can be formulated as DRL problems that involve incremental decision making." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-10", "text": "DRL methods could easily combine embedding based representation learning with reasoning, and optimize for a variety of non-differentiable rewards." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-11", "text": "However, a key challenge for applying deep reinforcement learning techniques to real-world sized NLP problems is the model design issue." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-12", "text": "This tutorial draws connections from theories of deep reinforcement learning to practical applications in NLP." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-13", "text": "In particular, we start with the gentle introduction to the fundamentals of reinforcement learning (Sutton and Barto, 1998; Sutton et al., 2000) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-14", "text": "We further discuss We further discuss several critical issues in DRL solutions for NLP tasks, including (1) The efficient and practical design of the action space, state space, and reward functions; (2) The trade-off between exploration and exploitation; and (3) The goal of incorporating linguistic structures in DRL." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-15", "text": "To address the model design issue, we discuss several recent solutions (He et al., 2016b; Li et al., 2016; Xiong et al., 2017) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-16", "text": "We then focus on a new case study of hierarchical deep reinforcement learning for video captioning (Wang et al., 2018b) , discussing the techniques of leveraging hierarchies in DRL for NLP generation problems." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-17", "text": "This tutorial aims at introducing deep reinforcement learning methods to researchers in the NLP community." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-18", "text": "We do not assume any particular prior knowledge in reinforcement learning." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-19", "text": "The intended length of the tutorial is 3 hours, including a coffee break." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-20", "text": "Representation Learning, Reasoning (Learning to Search), and Scalability are three closely related research subjects in Natural Language Processing." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-21", "text": "In this tutorial, we touch the intersection of all the three research subjects, covering various aspects of the theories of modern deep reinforcement learning methods, and show their successful applications in NLP." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-22", "text": "This tutorial is organized in three parts:" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-23", "text": "\u2022 Foundations of Deep Reinforcement Learning." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-24", "text": "First, we will provide a brief overview" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-25", "text": "----------------------------------" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-26", "text": "**TUTORIAL DESCRIPTION**" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-27", "text": "Deep Reinforcement Learning (DRL) (Mnih et al., 2015) is an emerging research area that involves intelligent agents that learn to reason in Markov Decision Processes (MDP)." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-28", "text": "Recently, DRL has achieved many stunning breakthroughs in Atari games (Mnih et al., 2013) and the Go game (Silver et al., 2016) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-29", "text": "In addition, DRL methods have gained significantly more attentions in NLP in recent years, because many NLP tasks can be formulated as DRL problems that involve incremental decision making." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-30", "text": "DRL methods could easily combine embedding based representation learning with reasoning, and optimize for a variety of non-differentiable rewards." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-31", "text": "However, a key challenge for applying deep reinforcement learning techniques to real-world sized NLP problems is the model design issue." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-32", "text": "This tutorial draws connections from theories of deep reinforcement learning to practical applications in NLP." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-33", "text": "In particular, we start with the gentle introduction to the fundamentals of reinforcement learning (Sutton and Barto, 1998; Sutton et al., 2000) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-34", "text": "We further discuss their modern deep learning extensions such as Deep QNetworks (Mnih et al., 2015) , Policy Networks (Silver et al., 2016) , and Deep Hierarchical Reinforcement Learning (Kulkarni et al., 2016) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-35", "text": "We outline the applications of deep reinforcement learning in NLP, including dialog (Li et al., 2016) , semi-supervised text classification (Wu et al., 2018) , coreference (Clark and Manning, 2016; Yin et al., 2018) , knowledge graph reasoning (Xiong et al., 2017 ), text games (Narasimhan et al., 2015; He et al., 2016a) , social media (He et al., 2016b; Zhou and Wang, 2018) , information extraction (Narasimhan et al., 2016; Qin et al., 2018) , language and vision (Pasunuru and Bansal, 2017; Misra et al., 2017; Wang et al., 2018a,b,c; Xiong et al., 2018) , etc." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-36", "text": "We further discuss several critical issues in DRL solutions for NLP tasks, including (1) The efficient and practical design of the action space, state space, and reward functions; (2) The trade-off between exploration and exploitation; and (3) The goal of incorporating linguistic structures in DRL." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-37", "text": "To address the model design issue, we discuss several recent solutions (He et al., 2016b; Li et al., 2016; Xiong et al., 2017) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-38", "text": "We then focus on a new case study of hierarchical deep reinforcement learning for video captioning (Wang et al., 2018b) , discussing the techniques of leveraging hierarchies in DRL for NLP generation problems." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-39", "text": "This tutorial aims at introducing deep reinforcement learning methods to researchers in the NLP community." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-40", "text": "We do not assume any particular prior knowledge in reinforcement learning." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-41", "text": "The intended length of the tutorial is 3 hours, including a coffee break." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-42", "text": "----------------------------------" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-43", "text": "**OUTLINE**" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-44", "text": "Representation Learning, Reasoning (Learning to Search), and Scalability are three closely related research subjects in Natural Language Processing." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-45", "text": "In this tutorial, we touch the intersection of all the three research subjects, covering various aspects of the theories of modern deep reinforcement learning methods, and show their successful applications in NLP." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-46", "text": "This tutorial is organized in three parts:" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-47", "text": "\u2022 Foundations of Deep Reinforcement Learning." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-48", "text": "First, we will provide a brief overview of reinforcement learning (RL), and discuss the classic settings in RL." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-49", "text": "We describe classic methods such as Markov Decision Processes, REINFORCE (Williams, 1992) , and Qlearning (Watkins, 1989 )." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-50", "text": "We introduce modelfree and model-based reinforcement learning approaches, and the widely used policy gradient methods." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-51", "text": "In this part, we also introduce the modern renovation of deep reinforcement learning (Mnih et al., 2015) , with a focus on games (Mnih et al., 2013; Silver et al., 2016) ." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-52", "text": "\u2022 Practical Deep Reinforcement Learning: Case Studies in NLP Second, we will focus on the designing practical DRL models for NLP tasks." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-53", "text": "In particular, we will take the first deep reinforcement learning solution for dialogue (Li et al., 2016) as a case study." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-54", "text": "We describe the main contributions of this work: including its design of the reward functions, and why they are necessary for dialog." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-55", "text": "We then introduce the gigantic action space issue for deep Q-learning in NLP (He et al., 2016a,b) , including several solutions." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-56", "text": "To conclude this part, we discuss interesting applications of DRL in NLP, including information extraction and reasoning." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-57", "text": "\u2022 Lessons Learned, Future Directions, and Practical Advices for DRL in NLP Third, we switch from the theoretical presentations to an interactive demonstration and discussion session: we aim at providing an interactive session to transfer the theories of DRL into practical insights." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-58", "text": "More specifically, we will discuss three important issues, including problem formulation/model design, exploration vs. exploitation, and the integration of linguistic structures in DRL." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-59", "text": "We will show case a recent study (Wang et al., 2018b ) that leverages hierarchical deep reinforcement learning for language and vision, and extend the discussion." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-60", "text": "Practical advice including programming advice will be provided as a part of the demonstration." }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-61", "text": "----------------------------------" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-62", "text": "**HISTORY**" }, { "sent_id": "2fdfa1b36fcf0d77826c96101ac428-C001-63", "text": "The full content of this tutorial has not yet been presented elsewhere, but some parts of this tutorial has also been presented at the following locations in recent years:" } ], "y": { "@BACK@": { "gold_contexts": [ [ "2fdfa1b36fcf0d77826c96101ac428-C001-15" ], [ "2fdfa1b36fcf0d77826c96101ac428-C001-35" ], [ "2fdfa1b36fcf0d77826c96101ac428-C001-37" ] ], "cite_sentences": [ "2fdfa1b36fcf0d77826c96101ac428-C001-15", "2fdfa1b36fcf0d77826c96101ac428-C001-35", "2fdfa1b36fcf0d77826c96101ac428-C001-37" ] }, "@DIF@": { "gold_contexts": [ [ "2fdfa1b36fcf0d77826c96101ac428-C001-15", "2fdfa1b36fcf0d77826c96101ac428-C001-16" ], [ "2fdfa1b36fcf0d77826c96101ac428-C001-37", "2fdfa1b36fcf0d77826c96101ac428-C001-38" ] ], "cite_sentences": [ "2fdfa1b36fcf0d77826c96101ac428-C001-15", "2fdfa1b36fcf0d77826c96101ac428-C001-37" ] } } }, "ABC_3477c0225d6a0e55365242d95a3dc9_48": { "x": [ { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-15", "text": "----------------------------------" }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-16", "text": "****" }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-29", "text": "There are three main steps in our rescoring approach." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "3477c0225d6a0e55365242d95a3dc9-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." } ], "y": { "@BACK@": { "gold_contexts": [ [ "3477c0225d6a0e55365242d95a3dc9-C001-19" ] ], "cite_sentences": [ "3477c0225d6a0e55365242d95a3dc9-C001-19" ] } } }, "ABC_57af9690eb41ff3f9217da6138425f_48": { "x": [ { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-2", "text": "Work on centering (Grosz, Joshi, and Weinstein 1995) has had a very strong impact both on research in discourse and on the development of systems for computational discourse processing." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-3", "text": "Early papers such as those of Grosz, Joshi, and Weinstein (1983) and Brennan, Friedman, and Pollard (1987) have inspired a number of studies that explore or use centering as an important tool in discourse processing." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-4", "text": "A number of psycholinguistic and cross-language projects have reported promising results." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-5", "text": "On the other hand, one of the main ideas of centering theory, that certain entities mentioned in an utterance are more central than others and that this imposes certain constraints on the use of referring expressions and in particular on the use of pronouns, has been extensively used in anaphora resolution." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-6", "text": "This volume (resulting from a Workshop on Centering Theory held at the University of Pennsylvania) represents an important follow-up to the work on centering carried out in the late eighties and early nineties, and in this context it is a timely collection of a number of interesting contributions." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-7", "text": "The authors are well known for their contributions to the theory of centering and to other areas of computational linguistics." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-8", "text": "The book features a clear and well-organized introduction to centering (Walker, Joshi, Prince) and groups the subsequent chapters thematically into six parts." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-9", "text": "Part 1 discusses issues such as the complexity of inference in discourse (Joshi and Weinstein), summarizes the main goals of the original research on global focusing and centering, and outlines unsolved problems (Grosz and Sidner)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-10", "text": "Part 2 covers utterance-level issues, in particular the ranking of forward-looking centers (Cote; Hudson-D'Zmura) and proposes an extension of the original intersentential centering model in order to be able to process complex sentences consisting of multiple clauses (Kameyama)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-11", "text": "Part 3 focuses on the cross-linguistic phenomenon of centering and looks at centering in Italian (Di Eugenio), the ranking of forward-looking centers in Turkish (Turan), and discourse coherence and center shifting in Japanese (Iida)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-12", "text": "Part 4 investigates the role of centering in processing models of discourse." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-13", "text": "It reports on an attempt to integrate centering theory and the givenness hierarchy (Gundel)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-14", "text": "It also discusses the role of the center in assigning antecedents to ambiguous pronouns (Hudson-D'Zmura and Tanenhaus) and presents a psycholinguistic perspective on centering as a resource by which speakers and addressees can coordinate their attention moment by moment as they refer to things (Brennan)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-15", "text": "----------------------------------" }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-16", "text": "****" }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-17", "text": "Part 5 is dedicated to information structure and centering." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-18", "text": "It covers issues such as word order and centering in Turkish (Hoffman), centering transitions as a measure of discourse coherence (Hurewitz), the relationship between centering, global focus, and right-dislocation (Grosz and Ziv) , and recency effects in English inversion (Birner)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-19", "text": "Part 6 covers discourse structure and centering, commenting on open questions such as how local utterance processing relates to the global discourse context and how centering interacts with the global context in constraining the surface form of referring expressions other than pronouns (Passonneau)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-20", "text": "Part 6 also looks at the place of centering in a general theory of anaphora resolution and proposes a new hybrid theory that integrates Grosz and Sidner's discourse structure theory with a dynamic theory of semantic interpretation (Roberts)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-21", "text": "Finally in the last chapter of Part 6, it is argued that the restriction of centering to operate within a discourse segment should be abandoned in favor of a new model integrating centering and the global discourse structure and to this end it is proposed that a model of attentional state, the cache model, be integrated with the centering algorithm (Walker)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-22", "text": "Centering has proved to be a powerful tool for accounting for discourse coherence and has been used successfully in anaphora resolution; however, as with every theory in linguistics, it has its limitations." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-23", "text": "Some chapters suggest extensions of or amendments to the centering theory with a view to achieving a more comprehensive and successful model (e.g., the chapters by Kameyama, Roberts, and Walker)." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-24", "text": "Ideally, in addition to papers such as Kameyama's and Walker's, this collection could perhaps also have featured extended versions of papers, such as those of Kehler (1997) and Hahn and Strube (1997) , that highlight certain weaknesses of the original centering model or suggest extensions or alternative solutions (Strube 1998) ." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-25", "text": "It must be acknowledged here that the production schedule of this volume may have been a factor in not including some of this recent work, and also that space limits might not have allowed all possible areas of centering to be covered." }, { "sent_id": "57af9690eb41ff3f9217da6138425f-C001-26", "text": "These are, however, minor points; the volume is a very well edited collection of excellent papers, and I recommend it unreservedly as a valuable source to anyone interested in centering, discourse, or computational linguistics in general." } ], "y": { "@BACK@": { "gold_contexts": [ [ "57af9690eb41ff3f9217da6138425f-C001-24" ] ], "cite_sentences": [ "57af9690eb41ff3f9217da6138425f-C001-24" ] } } }, "ABC_0526911ab71c85bfa4a20b630f34ae_48": { "x": [ { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-2", "text": "Searching online information is increasingly a daily activity for many people." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-26", "text": "The resulting search engine supports sub-second query response." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-3", "text": "The multilinguality of online content is also increasing (e.g. the proportion of English web users, which has been decreasing as a fraction the increasing population of web users, dipped below 50% in the summer of 2001)." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-4", "text": "To improve the ability of an English speaker to search mutlilingual content, we built a system that supports cross-lingual search of an Arabic newswire collection and provides on demand translation of Arabic web pages into English." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-5", "text": "The cross-lingual search engine supports a fast search capability (sub-second response for typical queries) and achieves state-of-the-art performance in the high precision region of the result list." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-6", "text": "The on demand statistical machine translation uses the Direct Translation model along with a novel statistical Arabic Morphological Analyzer to yield state-of-the-art translation quality." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-7", "text": "The on demand SMT uses an efficient dynamic programming decoder that achieves reasonable speed for translating web documents." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-8", "text": "----------------------------------" }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-9", "text": "**OVERVIEW**" }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-10", "text": "Morphologically rich languages like Arabic (Beesley, K. 1996 ) present significant challenges to many natural language processing applications as the one described above because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix) ." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-11", "text": "By segmenting words into morphemes, we can improve the performance of natural language systems including machine translation (Brown et al. 1993 ) and information retrieval (Franz, M. and McCarley, S. 2002) ." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-12", "text": "In this paper, we present a cross-lingual English-Arabic search engine combined with an on demand ArabicEnglish statistical machine translation system that relies on source language analysis for both improved search and translation." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-13", "text": "We developed novel statistical learning algorithms for performing Arabic word segmentation (Lee, Y. et al 2003) into morphemes and morphological source language (Arabic) analysis (Lee, Y. et al 2003b) ." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-14", "text": "These components improve both monolingual (Arabic) search and cross-lingual (English-Arabic) search and machine translation." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-15", "text": "In addition, the system supports either document translation or convolutional models for cross-lingual search (Franz, M. and McCarley, S. 2002) ." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-16", "text": "The overall demonstration has the following major components:" }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-17", "text": "1. Mono-lingual search: uses Arabic word segmentation and an okapi-like search engine for document ranking." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-18", "text": "2. Cross-lingual search: uses Arabic word segmentation and morphological analysis along with a statistical morpheme translation matrix in a convolutional model for document ranking." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-19", "text": "The search can also use document translation into English to rank the Arabic documents." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-20", "text": "Both approaches achieve similar precision in the high precision region of retrieval." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-21", "text": "The English query is also morphologically analyzed to improve performance." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-22", "text": "3." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-23", "text": "OnDemand statistical machine translation: this component uses both analysis components along with a direct channel translation model with a fast dynamic programming decoder (Tillmann, C. 2003) ." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-24", "text": "This system All of the above functionality is available through a web browser." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-25", "text": "We indexed the Arabic AFP corpus about 330k documents for the demonstration." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-27", "text": "We also provide an html detagging capability that allows the translation of Arabic web pages while trying to preserve the original layout as much as possible in the on demand SMT component." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-28", "text": "The Arabic Name Entity Tagger is currently run as an offline process but we expect to have it online by the demonstration time." }, { "sent_id": "0526911ab71c85bfa4a20b630f34ae-C001-29", "text": "We aslo include two screen shots of the demonstration system." } ], "y": { "@BACK@": { "gold_contexts": [ [ "0526911ab71c85bfa4a20b630f34ae-C001-10" ] ], "cite_sentences": [ "0526911ab71c85bfa4a20b630f34ae-C001-10" ] }, "@EXT@": { "gold_contexts": [ [ "0526911ab71c85bfa4a20b630f34ae-C001-10", "0526911ab71c85bfa4a20b630f34ae-C001-11" ] ], "cite_sentences": [ "0526911ab71c85bfa4a20b630f34ae-C001-10" ] } } }, "ABC_41bd8c692ac513b8a9cabbd5aafbda_48": { "x": [ { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-15", "text": "----------------------------------" }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-16", "text": "****" }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-29", "text": "There are three main steps in our rescoring approach." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "41bd8c692ac513b8a9cabbd5aafbda-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." } ], "y": { "@BACK@": { "gold_contexts": [ [ "41bd8c692ac513b8a9cabbd5aafbda-C001-11" ], [ "41bd8c692ac513b8a9cabbd5aafbda-C001-20" ] ], "cite_sentences": [ "41bd8c692ac513b8a9cabbd5aafbda-C001-11", "41bd8c692ac513b8a9cabbd5aafbda-C001-20" ] } } }, "ABC_51575bb1ffb066d9570551f3347622_48": { "x": [ { "sent_id": "51575bb1ffb066d9570551f3347622-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-15", "text": "----------------------------------" }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-16", "text": "****" }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-29", "text": "There are three main steps in our rescoring approach." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "51575bb1ffb066d9570551f3347622-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." } ], "y": { "@BACK@": { "gold_contexts": [ [ "51575bb1ffb066d9570551f3347622-C001-10" ], [ "51575bb1ffb066d9570551f3347622-C001-19" ] ], "cite_sentences": [ "51575bb1ffb066d9570551f3347622-C001-10", "51575bb1ffb066d9570551f3347622-C001-19" ] } } }, "ABC_eb591565efc03df1706710218a8f19_48": { "x": [ { "sent_id": "eb591565efc03df1706710218a8f19-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-2", "text": "Rescoring approaches for parsing aim to re-rank and change the order of parse trees produced by a general parser for a given sentence." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-3", "text": "The re-ranking quality depends on the precision of the rescoring function." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-4", "text": "However it is a challenge to design an appropriate function to determine the qualities of parse trees." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-5", "text": "No matter which method is used, Treebank is a widely used resource in parsing task." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-6", "text": "Most approaches utilize complex features to re-estimate the tree structures of a given sentence [1, 2, 3] ." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-7", "text": "Unfortunately, sizes of treebanks are generally small and insufficient, which results in a common problem of data sparseness." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-8", "text": "Learning knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-9", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-10", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9 ]." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-11", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-12", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-13", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-14", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-15", "text": "----------------------------------" }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-16", "text": "****" }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-17", "text": "knowledge from analyzing large-scaled unlabeled data is compulsory and proved useful in the previous works [4, 5, 6] ." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-18", "text": "How to extract useful information from unannotated large scale corpus has been a research issue." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-19", "text": "Word embeddings have become increasingly popular lately, proving to be valuable as a source of features in a broad range of NLP tasks [7, 8, 9] ." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-20", "text": "The word2vec [10] is among the most widely used word embedding models today." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-21", "text": "Their success is largely due to an efficient and user-friendly implementation that learns high quality word embeddings from very large corpora." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-22", "text": "The word2vec learns low dimensional continuous vector representations for words by considering window-based contexts, i.e., context words within some fixed distance of each side of the target words." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-23", "text": "Another different context type is dependency-based word embedding [11, 12, 13] , which considers syntactic contexts rather" }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-24", "text": "The 2016 Conference on Computational Linguistics and Speech Processing ROCLING 2016, pp." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-25", "text": "100-102 \uf0d3 The Association for Computational Linguistics and Chinese Language Processing 100 than window contexts in word2vec." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-26", "text": "Bansal et al. [8] and Melamud et al. [11] show the benefits of such modified-context embeddings in dependency parsing task." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-27", "text": "The dependency-based word embedding can relieve the problem of data sparseness, since even without occurrence of dependency word pairs in a corpus, dependency scores can be still calculated by word embeddings [12] ." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-28", "text": "In this paper, we proposed a rescoring approach for parsing, based on a combination of original parsing scores and dependency word embedding scores to assist the determination of the best parse tree among the n-best parse trees." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-29", "text": "There are three main steps in our rescoring approach." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-30", "text": "The first step is to have the parser to produce n-best parse trees with their structural scores." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-31", "text": "For each parsed tree including words, part-of-speech (PoS) and semantic role labels." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-32", "text": "Second, we extract word-to-word associations (or called word dependency, a dependency implies its close association with other words in either syntactic or semantic perspective) from large amounts of auto-parsed data and adopt word2vecf [13] to train dependency-based word embeddings." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-33", "text": "The last step is to build a structural rescoring method to find the best tree structure from the n-best candidates." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-34", "text": "We conduct experiments on the standard data sets of the Chinese Treebank." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-35", "text": "We also study how different types of embeddings influence on rescoring, including word, word with semantic role labels, and word senses (concepts)." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-36", "text": "Experimental results show that using semantic role labels in dependency embeddings has best performance. And the final experiments results indicate that our proposed approach outperforms the best parser in Chinese." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-37", "text": "Furthermore we attempt to compare the performance of using the traditional conditional probability method with our approach." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-38", "text": "From the experimental results, the embedding scores can relax data sparseness problem and have better results than the traditional approach." }, { "sent_id": "eb591565efc03df1706710218a8f19-C001-39", "text": "Keywords: Word Embeddings, Parsing, Word Dependency, Rescoring." } ], "y": { "@BACK@": { "gold_contexts": [ [ "eb591565efc03df1706710218a8f19-C001-8" ], [ "eb591565efc03df1706710218a8f19-C001-17" ] ], "cite_sentences": [ "eb591565efc03df1706710218a8f19-C001-8", "eb591565efc03df1706710218a8f19-C001-17" ] } } }, "ABC_e9b2f32ed29589b4a6d49d3b30fc3a_48": { "x": [ { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-2", "text": "Searching online information is increasingly a daily activity for many people." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-3", "text": "The multilinguality of online content is also increasing (e.g. the proportion of English web users, which has been decreasing as a fraction the increasing population of web users, dipped below 50% in the summer of 2001)." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-4", "text": "To improve the ability of an English speaker to search mutlilingual content, we built a system that supports cross-lingual search of an Arabic newswire collection and provides on demand translation of Arabic web pages into English." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-5", "text": "The cross-lingual search engine supports a fast search capability (sub-second response for typical queries) and achieves state-of-the-art performance in the high precision region of the result list." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-6", "text": "The on demand statistical machine translation uses the Direct Translation model along with a novel statistical Arabic Morphological Analyzer to yield state-of-the-art translation quality." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-7", "text": "The on demand SMT uses an efficient dynamic programming decoder that achieves reasonable speed for translating web documents." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-8", "text": "----------------------------------" }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-9", "text": "**OVERVIEW**" }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-10", "text": "Morphologically rich languages like Arabic (Beesley, K. 1996 ) present significant challenges to many natural language processing applications as the one described above because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix) ." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-11", "text": "By segmenting words into morphemes, we can improve the performance of natural language systems including machine translation (Brown et al. 1993 ) and information retrieval (Franz, M. and McCarley, S. 2002) ." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-12", "text": "In this paper, we present a cross-lingual English-Arabic search engine combined with an on demand ArabicEnglish statistical machine translation system that relies on source language analysis for both improved search and translation." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-13", "text": "We developed novel statistical learning algorithms for performing Arabic word segmentation (Lee, Y. et al 2003) into morphemes and morphological source language (Arabic) analysis (Lee, Y. et al 2003b) ." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-14", "text": "These components improve both monolingual (Arabic) search and cross-lingual (English-Arabic) search and machine translation." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-15", "text": "In addition, the system supports either document translation or convolutional models for cross-lingual search (Franz, M. and McCarley, S. 2002) ." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-16", "text": "The overall demonstration has the following major components:" }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-17", "text": "1. Mono-lingual search: uses Arabic word segmentation and an okapi-like search engine for document ranking." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-18", "text": "2. Cross-lingual search: uses Arabic word segmentation and morphological analysis along with a statistical morpheme translation matrix in a convolutional model for document ranking." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-19", "text": "The search can also use document translation into English to rank the Arabic documents." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-20", "text": "Both approaches achieve similar precision in the high precision region of retrieval." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-21", "text": "The English query is also morphologically analyzed to improve performance." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-22", "text": "3." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-23", "text": "OnDemand statistical machine translation: this component uses both analysis components along with a direct channel translation model with a fast dynamic programming decoder (Tillmann, C. 2003) ." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-24", "text": "This system All of the above functionality is available through a web browser." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-25", "text": "We indexed the Arabic AFP corpus about 330k documents for the demonstration." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-26", "text": "The resulting search engine supports sub-second query response." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-27", "text": "We also provide an html detagging capability that allows the translation of Arabic web pages while trying to preserve the original layout as much as possible in the on demand SMT component." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-28", "text": "The Arabic Name Entity Tagger is currently run as an offline process but we expect to have it online by the demonstration time." }, { "sent_id": "e9b2f32ed29589b4a6d49d3b30fc3a-C001-29", "text": "We aslo include two screen shots of the demonstration system." } ], "y": { "@EXT@": { "gold_contexts": [ [ "e9b2f32ed29589b4a6d49d3b30fc3a-C001-11" ] ], "cite_sentences": [ "e9b2f32ed29589b4a6d49d3b30fc3a-C001-11" ] }, "@MOT@": { "gold_contexts": [ [ "e9b2f32ed29589b4a6d49d3b30fc3a-C001-11" ] ], "cite_sentences": [ "e9b2f32ed29589b4a6d49d3b30fc3a-C001-11" ] } } }, "ABC_7a437574a9a7fff56a480801e47711_48": { "x": [ { "sent_id": "7a437574a9a7fff56a480801e47711-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-2", "text": "This book provides a detailed overview of Regulus, an open-source toolkit for building and compiling grammars for spoken dialog systems, which has been used for a number of applications including Clarissa, a spoken language system in use on the International Space Station." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-3", "text": "The focus is on controlled-language user-initiative systems, in which the user drives the dialog, but is required to use constrained language." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-4", "text": "Such a spoken language interface allows for a constrained range of commands-for example, \"open the pod-bay doors\"-to be issued in circumstances where, ergonomically, other interface options are infeasible." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-5", "text": "The emphasis of the approach, given the kind of application that is the focus of the work, is thus less on robustness of speech recognition and more on depth of semantic processing and quality of the dialog management." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-6", "text": "It is an interesting book, one which succeeds in motivating key problems and presenting general approaches to solving them, with enough in the way of explicit details to allow even a complete novice in spoken language processing to implement simple dialog systems." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-7", "text": "The book is split into two parts." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-8", "text": "The first half is a very detailed tutorial on using Regulus to build and compile grammars." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-9", "text": "Regulus, although itself open-source, makes use of SICStus Prolog for the dialog processing and the Nuance Toolkit for speech recognition." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-10", "text": "Grammars in Regulus are specified with features requiring unification, but are compiled into context-free grammars for use with the Nuance speech recognition system." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-11", "text": "An abundance of implementation details and guidance for the reader (or grammar writer) is provided in this part of the book for building grammars as well as a dialog manager." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-12", "text": "The presentation includes example implementations handling such phenomena as ellipsis and corrections." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-13", "text": "In addition, details for building spoken language translation systems are presented within the same range of constrained language applications." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-14", "text": "The tutorial format is terrifically explicit, which will make this volume appropriate for undergraduate courses looking to provide students with hands-on exercises in building spoken dialog systems." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-15", "text": "One issue with the premise of an open-source toolkit that relies upon other software (SICStus Prolog, Nuance) for key parts of the application is that one is required to obtain and use that software." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-16", "text": "In the book, the authors note that an individual license of SICStus is available for a relatively small fee, and that Nuance has a program to license their Toolkit for research purposes." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-17", "text": "Unfortunately, since the writing of the book, corporate changes at Nuance have made obtaining such a research license more challenging, and this reviewer was only able to do so after several weeks of e-mail persistence." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-18", "text": "It would be very beneficial to future readers if the authors would scout out the current state of" }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-19", "text": "----------------------------------" }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-20", "text": "****" }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-21", "text": "the process of obtaining such a license and provide a detailed map of it on some easily accessible Web site." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-22", "text": "As it stands, the barriers to building the full spoken dialog systems detailed in the book are higher than they should be." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-23", "text": "The second half of the book goes under the hood to examine, more generally, the issues involved in making the authors' approach work." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-24", "text": "The chapter on compilation of feature grammars into context-free grammars looks in depth at alternatives to exhaustive expansion, including efficient filtering techniques and grammar transformation." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-25", "text": "There is also a chapter on developing an English feature grammar specifically for integration with a speech recognition system." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-26", "text": "A third chapter presents the authors' approach for adapting the general English feature grammar to a particular domain, which they term grammar specialization." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-27", "text": "In effect, they induce a new feature grammar by transforming-flattening, to a greater or lesser extent-an automatically produced treebank of domain-specific text and extracting new rules from the treebank, as well as estimating production probabilities from the corpus." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-28", "text": "A subsequent chapter investigates the impact of various parameterizations on this grammar specialization approachfor example, the degree of flattening-thus establishing, at least for the applications presented, some best practices." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-29", "text": "A final chapter presents a comparison of their grammarbased approach with a class-based n-gram language model." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-30", "text": "This half of the book will be more enjoyable for readers of this journal, who are presumably interested in more general questions of computation and language than the step-by-step tutorial format of the first half of the book." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-31", "text": "The details of the approach are interesting, particularly the insights about how to build linguistically rich grammars that can be effectively compiled into high-utility context-free grammars for speech recognition." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-32", "text": "The primary shortcoming of this presentation lies in perpetuating the false dichotomy between \"grammar-based\" and \"data-driven\" approaches to language modeling for speech recognition, which motivates the final chapter of the book." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-33", "text": "In fact, the authors' approach is both grammar-based and data-driven, given the corpus-based grammar specialization and PCFG estimation, which the authors themselves demonstrate to be indispensable." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-34", "text": "Robust grammar-based language modeling is a topic that has received a fair bit of attention over the past decade (Chelba and Jelinek 2000; Charniak 2001; Roark 2001; Wang, Stolcke, and Harper 2004, among others) , and while this line of research has not focused on the use of manually built, narrow-domain feature grammars, there is enough similarity between the approach described in this book and the cited papers that the papers would seem to be better comparison points than the class-based language models that are chosen to represent robust approaches in the comparison." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-35", "text": "Beyond language modeling, methods for PCFG induction from treebanks have been a popular topic in the field over the past decade, and some understanding of the impact of flattening trees can be had in Johnson (1998) , where the beneficial impact of various tree transformations for probabilistic grammars is presented." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-36", "text": "None of this work is discussed or cited, and the naive reader might be left with the impression that data-driven approaches have been demonstrated to underperform relative to knowledge-based approaches, when in fact the authors simply demonstrate that their hybrid grammar-based/data-driven approach outperforms class-based language models." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-37", "text": "Perhaps this is worth demonstrating, but the chapter couches the results within the context of a clash between paradigms, which simply does not ring true." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-38", "text": "This one misstep, however, does not detract from the quality of the authors' system, nor from the interesting presentation of too-often-ignored aspects of spoken language systems engineering." }, { "sent_id": "7a437574a9a7fff56a480801e47711-C001-39", "text": "The book and the toolkit it describes can serve a very useful pedagogical purpose by providing a foundation upon which students and researchers" } ], "y": { "@BACK@": { "gold_contexts": [ [ "7a437574a9a7fff56a480801e47711-C001-34" ] ], "cite_sentences": [ "7a437574a9a7fff56a480801e47711-C001-34" ] } } }, "ABC_0bfea881773f504206bef9c1394f20_48": { "x": [ { "sent_id": "0bfea881773f504206bef9c1394f20-C001-13", "text": "Our claims are that i) such evaluation posteriors are Normally distributed (Tab. I), and that ii) the variance is inversely proportional to the subset size (Tab. II)." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-2", "text": "Estimating the internal state of a robotic system is complex: this is performed from multiple heterogeneous sensor inputs and knowledge sources." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-3", "text": "Discretization of such inputs is done to capture saliences, represented as symbolic information, which often presents structure and recurrence." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-4", "text": "As these sequences are used to reason over complex scenarios [1] , a more compact representation would aid exactness of technical cognitive reasoning capabilities, which are today constrained by computational complexity issues and fallback to representational heuristics or human intervention [1], [2] ." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-5", "text": "Such problems need to be addressed to ensure timely and meaningful human-robot interaction." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-6", "text": "Our work is towards understanding the variability of learning informativeness when training on subsets of a given input dataset." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-7", "text": "This is in view of reducing the training size while retaining the majority of the symbolic learning potential." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-8", "text": "We prove the concept on human-written texts, and conjecture this work will reduce training data size of sequential instructions, while preserving semantic relations, when gathering information from large remote sources [3] ." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-9", "text": "We computed multiple random subsets of sentences from the UMBC WEBBASE CORPUS (\u223c 17.13GB) via a custom implementation using the SPARK distributed framework." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-10", "text": "We evaluated the learning informativess of such sets in terms of semantic word-sense classification accuracy (with WORD2VEC [4]), and of n-gram perplexity." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-11", "text": "Previous literature inform us that corpus size and posterior quality do not follow linear correlation for some learning tasks (e.g. semantic measures) [5] ." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-12", "text": "In our semantic tests, on average 85% of the quality can be obtained by training on a random \u223c 4% subset of the original corpus (e.g. as in Fig. 1 , 5 random million lines yield 64.14% instead of 75.14%)." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-14", "text": "It is therefore possible to select the best random subset for a given size, if an information criterion is known." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-15", "text": "Such metric is currently under investigation." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-16", "text": "Within the robotics domain, in order to reduce computational complexity of the training phase, cardinality reduction of human-written instructions is particularly important for non-recursive online training algorithms, such as current symbol-based probabilistic reasoning systems [1] , [3] , [6] ." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-17", "text": "----------------------------------" }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-18", "text": "****" }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-19", "text": "Estimating the internal state of a robotic system is complex: this is performed from multiple heterogeneous sensor inputs and knowledge sources." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-20", "text": "Discretization of such inputs is done to capture saliences, represented as symbolic information, which often presents structure and recurrence." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-21", "text": "As these sequences are used to reason over complex scenarios [1] , a more compact representation would aid exactness of technical cognitive reasoning capabilities, which are today constrained by computational complexity issues and fallback to representational heuristics or human intervention [1] , [2] ." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-22", "text": "Such problems need to be addressed to ensure timely and meaningful human-robot interaction." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-23", "text": "Our work is towards understanding the variability of learning informativeness when training on subsets of a given input dataset." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-24", "text": "This is in view of reducing the training size while retaining the majority of the symbolic learning potential." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-25", "text": "We prove the concept on human-written texts, and conjecture this work will reduce training data size of sequential instructions, while preserving semantic relations, when gathering information from large remote sources [3] ." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-26", "text": "----------------------------------" }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-27", "text": "**POSTERIOR EVALUATION DISTRIBUTION OF SUBSETS**" }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-28", "text": "We computed multiple random subsets of sentences from the UMBC WEBBASE CORPUS (\u223c 17.13GB) via a custom implementation using the SPARK distributed framework." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-29", "text": "We evaluated the learning informativess of such sets in terms of semantic word-sense classification accuracy (with WORD2VEC [4] ), and of n-gram perplexity." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-30", "text": "Previous literature inform us that corpus size and posterior quality do not follow linear correlation for some learning tasks (e.g. semantic measures) [5] ." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-31", "text": "In our semantic tests, on average 85% of the quality can be obtained by training on a random \u223c 4% subset of the original corpus (e.g. as in Fig. 1 , 5 random million lines yield 64.14% instead of 75.14%)." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-32", "text": "Our claims are that i) such evaluation posteriors are Normally distributed (Tab. I), and that ii) the variance is inversely proportional to the subset size (Tab. II)." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-33", "text": "It is therefore possible to select the best random subset for a given size, if an information criterion is known." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-34", "text": "Such metric is currently under investigation." }, { "sent_id": "0bfea881773f504206bef9c1394f20-C001-35", "text": "Within the robotics domain, in order to reduce computational complexity of the training phase, cardinality reduction of human-written instructions is particularly important for non-recursive online training algorithms, such as current symbol-based probabilistic reasoning systems [1] , [3] , [6] ." } ], "y": { "@USE@": { "gold_contexts": [ [ "0bfea881773f504206bef9c1394f20-C001-10" ], [ "0bfea881773f504206bef9c1394f20-C001-29" ] ], "cite_sentences": [ "0bfea881773f504206bef9c1394f20-C001-10", "0bfea881773f504206bef9c1394f20-C001-29" ] } } }, "ABC_36436e1d8a3f1d65fcc369649341f2_48": { "x": [ { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-2", "text": "The usual approach to learning language processing tasks such as tagging, parsing, grapheme-to-phoneme conversion, pp-attachrnent, etc., is to extract regularities from training data in the form of decision trees, rules, probabilities or other abstractions." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-3", "text": "These representations of regularities are then used to solve new cases of the task." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-4", "text": "The individual training examples on which the abstractions were based are discarded (forgotten)." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-5", "text": "While this approach seems to work well for other application areas of Machine Learning, I will show that there is evidence that it is not the best way to learn language processing tasks." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-6", "text": "I will briefly review empirical work in our groups in Antwerp and Tilburg on lazy language learning." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-7", "text": "In this approach (also called, instance-based, case-based, memory-based, and example-based learning), generalization happens at processing time by means of extrapolation from the most similar items in memory to the new item being processed." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-8", "text": "Lazy Learning with a simple similarity metric based on information entropy (IB I-IG, Daelemarts & van den Bosch, 1992 consistently outperforms abstracting (greedy) learning techniques such as C5.0 or backprop learning on a broad selection of natural language processing tasks ranging from phonology to semantics." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-9", "text": "Our intuitive explanation for this result is that lazy learning techniques keep all training items, whereas greedy approaches lose useful information by forgetting low-frequency or exceptional instances of the task, not covered by the extracted rules or models (Daelemans, 1996) ." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-10", "text": "Apart from the empirical work in Tilburg and Antwerp, a number of recent studies on statistical natural language processing (e.g. Dagan & Lee, 1997; Collins & Brooks, 1995) also suggest that, contrary to common wisdom, forgetting specific training items, even when they represent extremely low-frequency events, is harmful to generalization accuracy." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-11", "text": "After reviewing this empirical work briefly, I will report on new results (work in progress in collaboration with van den Bosch and Zavrel), systematically comparing greedy and lazy learning techniques on a number of benehrnark natural language processing tasks: tagging, grapheme-to-phoneme conversion, and pp-attachment." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-12", "text": "The results show that forgetting individual training items, however \"improbable' they may be, is indeed harmful." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-13", "text": "Furthermore, they show that combining lazy learning with training set editing techniques (based on typicality and other regularity criteria) also leads to worse generalization results." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-14", "text": "I will conclude that forgetting, either by abstracting from the training data or by editing exceptional training items in lazy learning is ha_rm~ to generalization accuracy, and will attempt to provide an explanation for these unexpected results." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-15", "text": "Collins, M. and J. Brooks. \"Prepositional Phrase Attachment through a Backed-off Model." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-16", "text": "----------------------------------" }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-17", "text": "**ABSTRACT**" }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-18", "text": "The usual approach to learning language processing tasks such as tagging, parsing, grapheme-to-phoneme conversion, pp-attachrnent, etc., is to extract regularities from training data in the form of decision trees, rules, probabilities or other abstractions." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-19", "text": "These representations of regularities are then used to solve new cases of the task." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-20", "text": "The individual training examples on which the abstractions were based are discarded (forgotten)." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-21", "text": "While this approach seems to work well for other application areas of Machine Learning, I will show that there is evidence that it is not the best way to learn language processing tasks." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-22", "text": "I will briefly review empirical work in our groups in Antwerp and Tilburg on lazy language learning." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-23", "text": "In this approach (also called, instance-based, case-based, memory-based, and example-based learning), generalization happens at processing time by means of extrapolation from the most similar items in memory to the new item being processed." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-24", "text": "Lazy Learning with a simple similarity metric based on information entropy (IB I-IG, Daelemarts & van den Bosch, 1992 consistently outperforms abstracting (greedy) learning techniques such as C5.0 or backprop learning on a broad selection of natural language processing tasks ranging from phonology to semantics." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-25", "text": "Our intuitive explanation for this result is that lazy learning techniques keep all training items, whereas greedy approaches lose useful information by forgetting low-frequency or exceptional instances of the task, not covered by the extracted rules or models (Daelemans, 1996) ." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-26", "text": "Apart from the empirical work in Tilburg and Antwerp, a number of recent studies on statistical natural language processing (e.g. Dagan & Lee, 1997; Collins & Brooks, 1995) also suggest that, contrary to common wisdom, forgetting specific training items, even when they represent extremely low-frequency events, is harmful to generalization accuracy." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-27", "text": "After reviewing this empirical work briefly, I will report on new results (work in progress in collaboration with van den Bosch and Zavrel), systematically comparing greedy and lazy learning techniques on a number of benehrnark natural language processing tasks: tagging, grapheme-to-phoneme conversion, and pp-attachment." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-28", "text": "The results show that forgetting individual training items, however \"improbable' they may be, is indeed harmful." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-29", "text": "Furthermore, they show that combining lazy learning with training set editing techniques (based on typicality and other regularity criteria) also leads to worse generalization results." }, { "sent_id": "36436e1d8a3f1d65fcc369649341f2-C001-30", "text": "I will conclude that forgetting, either by abstracting from the training data or by editing exceptional training items in lazy learning is ha_rm~ to generalization accuracy, and will attempt to provide an explanation for these unexpected results." } ], "y": { "@BACK@": { "gold_contexts": [ [ "36436e1d8a3f1d65fcc369649341f2-C001-10" ], [ "36436e1d8a3f1d65fcc369649341f2-C001-26" ] ], "cite_sentences": [ "36436e1d8a3f1d65fcc369649341f2-C001-10", "36436e1d8a3f1d65fcc369649341f2-C001-26" ] }, "@SIM@": { "gold_contexts": [ [ "36436e1d8a3f1d65fcc369649341f2-C001-10", "36436e1d8a3f1d65fcc369649341f2-C001-11", "36436e1d8a3f1d65fcc369649341f2-C001-12" ] ], "cite_sentences": [ "36436e1d8a3f1d65fcc369649341f2-C001-10" ] }, "@EXT@": { "gold_contexts": [ [ "36436e1d8a3f1d65fcc369649341f2-C001-26", "36436e1d8a3f1d65fcc369649341f2-C001-27" ] ], "cite_sentences": [ "36436e1d8a3f1d65fcc369649341f2-C001-26" ] } } }, "ABC_c07bc362ac7ee1f64e149fe8907fee_49": { "x": [ { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-2", "text": "In this talk, I will outline some of the myriad of challenges and opportunities that social media offer for natural language processing." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-3", "text": "I will present analysis of how pre-processing can be used to make social media data more amenable to natural language processing, and review a selection of tasks which attempt to harness the considerable potential of different social media services." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-4", "text": "----------------------------------" }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-5", "text": "****" }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-6", "text": "There is no question that social media are fantastically popular and varied in form -ranging from user forums, to microblogs such as Twitter, to social networking sites such as Facebook -and that much of the content they host is in the form of natural language." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-7", "text": "This would suggest a myriad of opportunities for natural language processing (NLP) , and yet much of the applied research on social media which uses language data is based on superficial analysis, often in the form of simple keyword search." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-8", "text": "This begs the question: Are NLP methods not suited to social media analysis?" }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-9", "text": "Conversely, is social media data too challenging for modern-day NLP?" }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-10", "text": "Alternatively, are simple term search-based methods sufficient for social media analysis, i.e. is NLP overkill for social media?" }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-30", "text": "For example, searching for keywords relating to earthquakes or influenza allows for impressive results to be achieved in earthquake detection or influenza outbreak analysis (Sakaki et al., 2010; Ritterman et al., 2009 )." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-11", "text": "In exploring these questions, I attempt to answer the overarching question of whether social media data is the friend or foe of NLP." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-12", "text": "I approach the question first from the perspective of what challenges social media language poses for NLP." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-13", "text": "The most immediate answer is the infamously free-form nature of language in social media, encompassing spelling inconsistencies, the free-form adoption of new terms, and regular violations of English grammar norms." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-14", "text": "Unsurprisingly, when NLP tools are applied directly to social media data, the results tend to be miserable when compared to data sets such as the Wall Street Journal component of the Penn Treebank." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-15", "text": "However, there have been recent successes in adapting parsers and POS taggers to social media data (Foster et al., 2011; Gimpel et al., 2011) ." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-16", "text": "Additionally, lexical normalisation and other preprocessing strategies have been shown to enhance the performance of NLP tools over social media data (Lui and Baldwin, 2012; Han et al., to appear) ." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-17", "text": "Furthermore, social media posts tend to be short and the content highly varied, meaning it is difficult to adapt a tool to the domain, or harness textual context to disambiguate the content." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-18", "text": "There is also the engineering challenge of real-time processing of the text stream, as much of NLP research is carried out offline with only secondary concern for throughput." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-19", "text": "As such, we might conclude that social media data is a foe of NLP, in that it challenges traditional assumptions made in NLP research on the nature of the target text and the requirements for real-time responsiveness." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-20", "text": "However, if we look beyond the immediate text content of social media, we quickly realise that there are various non-textual data sources that can be used to enhance the robustness and accuracy of NLP models, in a way which is not possible with static text corpora." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-21", "text": "For example, simple information on the author of a post can be used to develop authoradapted models based on the previous posts of the same individual (at least for users who post sufficiently large volumes of data)." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-22", "text": "Links in the post can be used to disambiguate the textual content of the post, whether in the form of URLs and the content contained in the target document(s), hashtags and the content of other similarly-tagged posts, thread-ing structure in web user forums, or addressee information and the content of posts from that individual." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-23", "text": "Simple timestamp information may provide insights into what timezone the user is likely to be based in, allowing for adjustment of language priors for use in language identification." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-24", "text": "User-declared metadata may also provide valuable information on the probable interpretation of a given post, e.g. knowing that a person is from Australia may allow for adjustment of lexical or word-POS priors." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-25", "text": "Multimodal content such as images or videos included in the post may also provide valuable insights into the likely interpretation for particular words." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-26", "text": "Social network information may also allow for user-specific adjustment of language priors of various types." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-27", "text": "In this sense, the rich context that permeates social media can very much be the friend of NLP, in providing valuable assistance in disambiguating content." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-28", "text": "Turning to the question of why the majority of social media analysis makes use of simple language analysis such as word counts for a canned set of query terms, I suggest that the cause is largely because of the constraints imposed on the user by different social media APIs, and also the relative accessibility of such simple techniques, as compared to full-strength NLP." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-29", "text": "I go on to claim that \"the tail has been wagging the dog\" in social media research, in the sense that while impressive results have been achieved for particular application types, the choice of application has been constrained by what is achievable with relatively simple keyword analysis." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-31", "text": "However, this style of approach presupposes a highly-constrained, predetermined information need which is expressible in a small number of relatively unambiguous query terms." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-32", "text": "In applications such as trend analysis, the information need is more open-ended and it is unreasonable to expect that a static set of keywords will capture new trends." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-33", "text": "Even for highly-constrained information needs, there may not be a high-precision set of query terms which provide the necessary information." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-34", "text": "While it is certainly not the case that full-blown NLP is needed in all social media applications, it is equally not correct to say that NLP is overkill for all social media analysis." }, { "sent_id": "c07bc362ac7ee1f64e149fe8907fee-C001-35", "text": "Rather, the emergence of more mature, robust NLP technologies tailored to social media data will enable new opportunities for social media analysis, earning new friends for NLP in the process." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c07bc362ac7ee1f64e149fe8907fee-C001-15" ] ], "cite_sentences": [ "c07bc362ac7ee1f64e149fe8907fee-C001-15" ] } } }, "ABC_c78464cfe0fc44f5fd0da2e4f9d90e_49": { "x": [ { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-2", "text": "In this talk, I will outline some of the myriad of challenges and opportunities that social media offer for natural language processing." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-3", "text": "I will present analysis of how pre-processing can be used to make social media data more amenable to natural language processing, and review a selection of tasks which attempt to harness the considerable potential of different social media services." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-4", "text": "----------------------------------" }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-5", "text": "****" }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-6", "text": "There is no question that social media are fantastically popular and varied in form -ranging from user forums, to microblogs such as Twitter, to social networking sites such as Facebook -and that much of the content they host is in the form of natural language." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-7", "text": "This would suggest a myriad of opportunities for natural language processing (NLP) , and yet much of the applied research on social media which uses language data is based on superficial analysis, often in the form of simple keyword search." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-8", "text": "This begs the question: Are NLP methods not suited to social media analysis?" }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-9", "text": "Conversely, is social media data too challenging for modern-day NLP?" }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-10", "text": "Alternatively, are simple term search-based methods sufficient for social media analysis, i.e. is NLP overkill for social media?" }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-11", "text": "In exploring these questions, I attempt to answer the overarching question of whether social media data is the friend or foe of NLP." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-12", "text": "I approach the question first from the perspective of what challenges social media language poses for NLP." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-13", "text": "The most immediate answer is the infamously free-form nature of language in social media, encompassing spelling inconsistencies, the free-form adoption of new terms, and regular violations of English grammar norms." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-14", "text": "Unsurprisingly, when NLP tools are applied directly to social media data, the results tend to be miserable when compared to data sets such as the Wall Street Journal component of the Penn Treebank." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-15", "text": "However, there have been recent successes in adapting parsers and POS taggers to social media data (Foster et al., 2011; Gimpel et al., 2011) ." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-16", "text": "Additionally, lexical normalisation and other preprocessing strategies have been shown to enhance the performance of NLP tools over social media data (Lui and Baldwin, 2012; Han et al., to appear) ." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-17", "text": "Furthermore, social media posts tend to be short and the content highly varied, meaning it is difficult to adapt a tool to the domain, or harness textual context to disambiguate the content." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-18", "text": "There is also the engineering challenge of real-time processing of the text stream, as much of NLP research is carried out offline with only secondary concern for throughput." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-19", "text": "As such, we might conclude that social media data is a foe of NLP, in that it challenges traditional assumptions made in NLP research on the nature of the target text and the requirements for real-time responsiveness." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-20", "text": "However, if we look beyond the immediate text content of social media, we quickly realise that there are various non-textual data sources that can be used to enhance the robustness and accuracy of NLP models, in a way which is not possible with static text corpora." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-21", "text": "For example, simple information on the author of a post can be used to develop authoradapted models based on the previous posts of the same individual (at least for users who post sufficiently large volumes of data)." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-22", "text": "Links in the post can be used to disambiguate the textual content of the post, whether in the form of URLs and the content contained in the target document(s), hashtags and the content of other similarly-tagged posts, thread-ing structure in web user forums, or addressee information and the content of posts from that individual." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-23", "text": "Simple timestamp information may provide insights into what timezone the user is likely to be based in, allowing for adjustment of language priors for use in language identification." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-24", "text": "User-declared metadata may also provide valuable information on the probable interpretation of a given post, e.g. knowing that a person is from Australia may allow for adjustment of lexical or word-POS priors." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-25", "text": "Multimodal content such as images or videos included in the post may also provide valuable insights into the likely interpretation for particular words." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-26", "text": "Social network information may also allow for user-specific adjustment of language priors of various types." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-27", "text": "In this sense, the rich context that permeates social media can very much be the friend of NLP, in providing valuable assistance in disambiguating content." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-28", "text": "Turning to the question of why the majority of social media analysis makes use of simple language analysis such as word counts for a canned set of query terms, I suggest that the cause is largely because of the constraints imposed on the user by different social media APIs, and also the relative accessibility of such simple techniques, as compared to full-strength NLP." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-29", "text": "I go on to claim that \"the tail has been wagging the dog\" in social media research, in the sense that while impressive results have been achieved for particular application types, the choice of application has been constrained by what is achievable with relatively simple keyword analysis." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-30", "text": "For example, searching for keywords relating to earthquakes or influenza allows for impressive results to be achieved in earthquake detection or influenza outbreak analysis (Sakaki et al., 2010; Ritterman et al., 2009 )." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-31", "text": "However, this style of approach presupposes a highly-constrained, predetermined information need which is expressible in a small number of relatively unambiguous query terms." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-32", "text": "In applications such as trend analysis, the information need is more open-ended and it is unreasonable to expect that a static set of keywords will capture new trends." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-33", "text": "Even for highly-constrained information needs, there may not be a high-precision set of query terms which provide the necessary information." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-34", "text": "While it is certainly not the case that full-blown NLP is needed in all social media applications, it is equally not correct to say that NLP is overkill for all social media analysis." }, { "sent_id": "c78464cfe0fc44f5fd0da2e4f9d90e-C001-35", "text": "Rather, the emergence of more mature, robust NLP technologies tailored to social media data will enable new opportunities for social media analysis, earning new friends for NLP in the process." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c78464cfe0fc44f5fd0da2e4f9d90e-C001-15" ] ], "cite_sentences": [ "c78464cfe0fc44f5fd0da2e4f9d90e-C001-15" ] } } }, "ABC_6e5ee9176bcc54c3c9c32965765990_49": { "x": [ { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-27", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-2", "text": "The iADAATPA 1 project coded as N \u2022 2016-EU-IA-0132 that ended in February 2019 is made for building of customized, domain-specific engines for public administrations from EU Member States." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-3", "text": "The consortium of the project decided to use neural machine translation at the beginning of the project." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-4", "text": "This represented a challenge for all involved, and the positive aspect is that all public administrations engaged in the iADAATPA project were able to try, test and use state-of-the-art neural technology with a high level of satisfaction." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-5", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-6", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-7", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-8", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-9", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-10", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-11", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-12", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-13", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-14", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-15", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-16", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012 Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-17", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-18", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-19", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-20", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-21", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-22", "text": "This approach allowed the model to improve the quality for each domain." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-23", "text": "----------------------------------" }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-24", "text": "****" }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-25", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-26", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-28", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-29", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-30", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-31", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-32", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-33", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-34", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-35", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-36", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-37", "text": "Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-38", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-39", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-40", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-41", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-42", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "6e5ee9176bcc54c3c9c32965765990-C001-43", "text": "This approach allowed the model to improve the quality for each domain." } ], "y": { "@USE@": { "gold_contexts": [ [ "6e5ee9176bcc54c3c9c32965765990-C001-19" ], [ "6e5ee9176bcc54c3c9c32965765990-C001-40" ] ], "cite_sentences": [ "6e5ee9176bcc54c3c9c32965765990-C001-19", "6e5ee9176bcc54c3c9c32965765990-C001-40" ] } } }, "ABC_c0d2bbf9dc7615040763019f5c668b_49": { "x": [ { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-2", "text": "In this talk, I will outline some of the myriad of challenges and opportunities that social media offer for natural language processing." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-3", "text": "I will present analysis of how pre-processing can be used to make social media data more amenable to natural language processing, and review a selection of tasks which attempt to harness the considerable potential of different social media services." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-4", "text": "----------------------------------" }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-5", "text": "****" }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-6", "text": "There is no question that social media are fantastically popular and varied in form -ranging from user forums, to microblogs such as Twitter, to social networking sites such as Facebook -and that much of the content they host is in the form of natural language." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-7", "text": "This would suggest a myriad of opportunities for natural language processing (NLP) , and yet much of the applied research on social media which uses language data is based on superficial analysis, often in the form of simple keyword search." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-8", "text": "This begs the question: Are NLP methods not suited to social media analysis?" }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-9", "text": "Conversely, is social media data too challenging for modern-day NLP?" }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-10", "text": "Alternatively, are simple term search-based methods sufficient for social media analysis, i.e. is NLP overkill for social media?" }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-11", "text": "In exploring these questions, I attempt to answer the overarching question of whether social media data is the friend or foe of NLP." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-12", "text": "I approach the question first from the perspective of what challenges social media language poses for NLP." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-13", "text": "The most immediate answer is the infamously free-form nature of language in social media, encompassing spelling inconsistencies, the free-form adoption of new terms, and regular violations of English grammar norms." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-14", "text": "Unsurprisingly, when NLP tools are applied directly to social media data, the results tend to be miserable when compared to data sets such as the Wall Street Journal component of the Penn Treebank." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-15", "text": "However, there have been recent successes in adapting parsers and POS taggers to social media data (Foster et al., 2011; Gimpel et al., 2011) ." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-16", "text": "Additionally, lexical normalisation and other preprocessing strategies have been shown to enhance the performance of NLP tools over social media data (Lui and Baldwin, 2012; Han et al., to appear) ." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-17", "text": "Furthermore, social media posts tend to be short and the content highly varied, meaning it is difficult to adapt a tool to the domain, or harness textual context to disambiguate the content." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-18", "text": "There is also the engineering challenge of real-time processing of the text stream, as much of NLP research is carried out offline with only secondary concern for throughput." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-19", "text": "As such, we might conclude that social media data is a foe of NLP, in that it challenges traditional assumptions made in NLP research on the nature of the target text and the requirements for real-time responsiveness." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-20", "text": "However, if we look beyond the immediate text content of social media, we quickly realise that there are various non-textual data sources that can be used to enhance the robustness and accuracy of NLP models, in a way which is not possible with static text corpora." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-21", "text": "For example, simple information on the author of a post can be used to develop authoradapted models based on the previous posts of the same individual (at least for users who post sufficiently large volumes of data)." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-22", "text": "Links in the post can be used to disambiguate the textual content of the post, whether in the form of URLs and the content contained in the target document(s), hashtags and the content of other similarly-tagged posts, thread-ing structure in web user forums, or addressee information and the content of posts from that individual." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-23", "text": "Simple timestamp information may provide insights into what timezone the user is likely to be based in, allowing for adjustment of language priors for use in language identification." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-24", "text": "User-declared metadata may also provide valuable information on the probable interpretation of a given post, e.g. knowing that a person is from Australia may allow for adjustment of lexical or word-POS priors." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-25", "text": "Multimodal content such as images or videos included in the post may also provide valuable insights into the likely interpretation for particular words." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-26", "text": "Social network information may also allow for user-specific adjustment of language priors of various types." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-27", "text": "In this sense, the rich context that permeates social media can very much be the friend of NLP, in providing valuable assistance in disambiguating content." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-28", "text": "Turning to the question of why the majority of social media analysis makes use of simple language analysis such as word counts for a canned set of query terms, I suggest that the cause is largely because of the constraints imposed on the user by different social media APIs, and also the relative accessibility of such simple techniques, as compared to full-strength NLP." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-29", "text": "I go on to claim that \"the tail has been wagging the dog\" in social media research, in the sense that while impressive results have been achieved for particular application types, the choice of application has been constrained by what is achievable with relatively simple keyword analysis." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-30", "text": "For example, searching for keywords relating to earthquakes or influenza allows for impressive results to be achieved in earthquake detection or influenza outbreak analysis (Sakaki et al., 2010; Ritterman et al., 2009 )." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-31", "text": "However, this style of approach presupposes a highly-constrained, predetermined information need which is expressible in a small number of relatively unambiguous query terms." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-32", "text": "In applications such as trend analysis, the information need is more open-ended and it is unreasonable to expect that a static set of keywords will capture new trends." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-33", "text": "Even for highly-constrained information needs, there may not be a high-precision set of query terms which provide the necessary information." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-34", "text": "While it is certainly not the case that full-blown NLP is needed in all social media applications, it is equally not correct to say that NLP is overkill for all social media analysis." }, { "sent_id": "c0d2bbf9dc7615040763019f5c668b-C001-35", "text": "Rather, the emergence of more mature, robust NLP technologies tailored to social media data will enable new opportunities for social media analysis, earning new friends for NLP in the process." } ], "y": { "@BACK@": { "gold_contexts": [ [ "c0d2bbf9dc7615040763019f5c668b-C001-16" ] ], "cite_sentences": [ "c0d2bbf9dc7615040763019f5c668b-C001-16" ] } } }, "ABC_9b594c5b29175fc8ee598f61609c79_49": { "x": [ { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-2", "text": "AMI Meeting Facilitator is a system that performs topic segmentation and extractive summarisation." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-3", "text": "It consists of three components: (1) a segmenter that divides a meeting into a number of locally coherent segments, (2) a summarizer that selects the most important utterances from the meeting transcripts." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-4", "text": "and (3) a compression component that removes the less important words from each utterance based on the degree of compression the user specied." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-5", "text": "The goal of the AMI Meeting Facilitator is two-fold: rst, we want to provide sucient visual aids for users to interpret what is going on in a recorded meeting; second, we want to support the development of downstream information retrieval and information extraction modules with the information about the topics and summaries in meeting segments." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-6", "text": "The AMI Meeting Segmenter is trained using a set of 50 meetings that are seperate from the input meeting." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-7", "text": "We rst extract features from the audio and video recording of the input meeting in order to train the Maximum Entropy (MaxEnt) models for classifying topic boundaries and non-topic boundaries." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-8", "text": "Then we test each utterance in the input meeting on the Segmenter to see if it is a topic boundary or not." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-9", "text": "The features we use include the following ve categories: (1) Conversational Feature: These include a set of seven conversational features, including the amount of overlapping speech, the amount of silence between speaker segments, the level of similarity of speaker activity, the number of cue words, and the predictions of LCSEG (i.e., the lexical cohesion statistics, the estimated posterior probability, the predicted class)." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-10", "text": "(2) Lexical Feature: Each spurt is represented as a vector space of uni-grams, wherein a vector is 1 or 0 depending on whether the cue word appears in the spurt." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-11", "text": "(3) Prosodic Feature: These include dialogue-act (DA) rate-of-speech, maximum F0 of the DA, mean energy of the DA, amount of silence in the DA, precedent and subsequent pauses, and duration of the DA." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-12", "text": "(4) Motion Feature: These include the average magnitude of speaker movements, which is measured by the number of pixels changed, over the frames of 40 ms within the spurt." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-13", "text": "(5) Contextual Feature: These include the dialogue act types and the speaker role (e.g., project manager, marketing expert)." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-14", "text": "In the dialogue act annotations, each dialogue act is classied as one of the 15 types." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-15", "text": "The AMI summarizer is trained using a set of 98 scenario meetings." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-16", "text": "We train a support vector machine (SVM) on these meetings, using 26 features relating to the following categories: (1) Prosodic Features: These include dialogueact (DA) rate-of-speech, maximum F0 of the DA, mean energy of the DA, amount of silence in the DA, precedent and subsequent pauses, 9" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-17", "text": "----------------------------------" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-18", "text": "****" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-19", "text": "1 Introduction AMI Meeting Facilitator is a system that performs topic segmentation and extractive summarisation." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-20", "text": "It consists of three components: (1) a segmenter that divides a meeting into a number of locally coherent segments, (2) a summarizer that selects the most important utterances from the meeting transcripts." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-21", "text": "and (3) a compression component that removes the less important words from each utterance based on the degree of compression the user specied." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-22", "text": "The goal of the AMI Meeting Facilitator is two-fold: rst, we want to provide sucient visual aids for users to interpret what is going on in a recorded meeting; second, we want to support the development of downstream information retrieval and information extraction modules with the information about the topics and summaries in meeting segments." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-23", "text": "The AMI Meeting Segmenter is trained using a set of 50 meetings that are seperate from the input meeting." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-24", "text": "We rst extract features from the audio and video recording of the input meeting in order to train the Maximum Entropy (MaxEnt) models for classifying topic boundaries and non-topic boundaries." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-25", "text": "Then we test each utterance in the input meeting on the Segmenter to see if it is a topic boundary or not." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-26", "text": "The features we use include the following ve categories: (1) Conversational Feature: These include a set of seven conversational features, including the amount of overlapping speech, the amount of silence between speaker segments, the level of similarity of speaker activity, the number of cue words, and the predictions of LCSEG (i.e., the lexical cohesion statistics, the estimated posterior probability, the predicted class)." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-27", "text": "(2) Lexical Feature: Each spurt is represented as a vector space of uni-grams, wherein a vector is 1 or 0 depending on whether the cue word appears in the spurt." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-28", "text": "(3) Prosodic Feature: These include dialogue-act (DA) rate-of-speech, maximum F0 of the DA, mean energy of the DA, amount of silence in the DA, precedent and subsequent pauses, and duration of the DA." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-29", "text": "(4) Motion Feature: These include the average magnitude of speaker movements, which is measured by the number of pixels changed, over the frames of 40 ms within the spurt." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-30", "text": "(5) Contextual Feature: These include the dialogue act types and the speaker role (e.g., project manager, marketing expert)." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-31", "text": "In the dialogue act annotations, each dialogue act is classied as one of the 15 types." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-32", "text": "----------------------------------" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-33", "text": "**SUMMARIZATION**" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-34", "text": "The AMI summarizer is trained using a set of 98 scenario meetings." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-35", "text": "We train a support vector machine (SVM) on these meetings, using 26 features relating to the following categories: (1) Prosodic Features: These include dialogueact (DA) rate-of-speech, maximum F0 of the DA, mean energy of the DA, amount of silence in the DA, precedent and subsequent pauses, We use two types of term weighting: tf.idf, which is based on words that are frequent in the meeting but rare across a set of other meetings or documents, and a second weighting feature which relates to how word usage varies between the four meeting participants." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-36", "text": "After training the SVM, we test on each meeting of the 20 meeting test set in turn, ranking the dialogue acts from most probable to least probable in terms of being extract-worthy." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-37", "text": "Such a ranking allows the user to create a summary of whatever length she desires." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-38", "text": "----------------------------------" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-39", "text": "**COMPRESSION**" }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-40", "text": "Each dialogue act has its constituent words scored using tf.idf, and as the user compresses the meeting to a greater degree the browser gradually removes the less important words from each dialogue act, leaving only the most informative material of the meeting." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-41", "text": "Previous work has explored the eect of lexical cohesion and conversational features on characterizing topic boundaries, following Galley et al.(2003) ." }, { "sent_id": "9b594c5b29175fc8ee598f61609c79-C001-42", "text": "In previous work, we have also studied the problem of predicting topic boundaries at dierent levels of granularity and showed that a supervised classication approach performs better on predicting a coarser level of topic segmentation (Hsueh et al., 2006" } ], "y": { "@BACK@": { "gold_contexts": [ [ "9b594c5b29175fc8ee598f61609c79-C001-42" ] ], "cite_sentences": [ "9b594c5b29175fc8ee598f61609c79-C001-42" ] } } }, "ABC_6ca7c7782c33f51e9bdb2e8613c24d_49": { "x": [ { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-2", "text": "Websites of businesses should accommodate both customer needs and business requirements." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-3", "text": "Traditional menu-driven navigation and key word search do not allow users to describe their intentions precisely." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-4", "text": "We have developed a conversational interface to online shopping that provides convenient, personalized access to information using natural language dialog." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-5", "text": "User studies show significantly reduced length of interactions in terms of time and number of clicks in finding products." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-6", "text": "The core dialog engine is easily adaptable to other domains." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-7", "text": "----------------------------------" }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-9", "text": "Natural language dialog has been used in many areas, such as for call-center/routing application (Carpenter & Chu-Carroll 1998) , email routing (Walker, Fromer & Narayanan 1998), information retrieval and database access (Androutsopoulos & Ritchie 1995) , and for telephony banking (Zadrozny et al. 1998) ." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-10", "text": "In this demonstration, we present a natural language dialog interface to online shopping." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-11", "text": "Our user studies show natural language dialog to be a very effective means for negotiating user's requests and intentions in this domain." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-12", "text": "----------------------------------" }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-13", "text": "**SYSTEM ARCHITECTURE**" }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-14", "text": "In our system, a presentation manager captures queries from users, employs a parser to transform the user's query into a logical form, and sends the logical form to a dialog manager." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-15", "text": "The presentation manager is also responsible for obtaining the system's response from the dialog manager and presenting it to the user using template-based generation." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-16", "text": "The dialog manager formulates action plans for an action manager to perform backend tasks such as database access, business transactions, etc." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-17", "text": "The dialog manager applies information state-based dialog strategies to formulate responses depending on the current state, discourse history and the action results from the action manager." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-18", "text": "The Data Management Subsystem maintains a \"concept\" repository with common sense \"concepts\" and a phrasal lexicon that lists possible ways for referring to the concepts." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-19", "text": "Business Rules map concepts to business specifications by defining concepts using a propositional logic formula of constraints over product specifications." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-20", "text": "Thus, the Business Rules reflect business goals and decisions." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-21", "text": "The Extended Database combines product specifications and precompiled evaluations of the concept definitions for each product to provide a representation that guides the natural language dialog." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-22", "text": "We are investigating automated tools for helping developers and maintainers extract relevant concepts and terms on the basis of user descriptions and queries about products." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-23", "text": "----------------------------------" }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-24", "text": "**EVALUATION**" }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-25", "text": "We conducted several user studies to evaluate the usability of NLA (Chai et al. 2000) ." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-26", "text": "In one study, seventeen test subjects preferred the dialog-driven navigation of NLA two to one over menu-driven navigation." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-27", "text": "Moreover, with NLA, the average number of clicks was reduced by 63.2% and the average time was reduced by 33.3%." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-28", "text": "Analysis of the user queries (average length = 5.31 words long; standard deviation = 2.62; 85% of inputs are noun phrases) revealed the brevity and relative linguistic simplicity of user input." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-29", "text": "Hence, shallow parsing techniques were adequate for processing user input." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-30", "text": "In general, sophisticated dialog management appears to be more important than the ability to handle complex natural language sentences." }, { "sent_id": "6ca7c7782c33f51e9bdb2e8613c24d-C001-31", "text": "The user studies also highlighted the need to combine multiple modalities and styles of interaction." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6ca7c7782c33f51e9bdb2e8613c24d-C001-9" ] ], "cite_sentences": [ "6ca7c7782c33f51e9bdb2e8613c24d-C001-9" ] }, "@EXT@": { "gold_contexts": [ [ "6ca7c7782c33f51e9bdb2e8613c24d-C001-10", "6ca7c7782c33f51e9bdb2e8613c24d-C001-9" ] ], "cite_sentences": [ "6ca7c7782c33f51e9bdb2e8613c24d-C001-9" ] } } }, "ABC_600317fc3ce88ea730993d3cc94f19_49": { "x": [ { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-2", "text": "In this paper, we propose a new probabilistic GLR parsing method that can solve the problems of conventional methods." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-3", "text": "Our proposed Conditional Action Model uses Surface Phrasal Types (SPTs) encoding the functional word sequences of the sub-trees for describing structural characteristics of the partial parse. And, the proposed GLR model outperforms the previous methods by about 6~8%." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-4", "text": "----------------------------------" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-6", "text": "Since the first approach [Wright and Wrigley 1991] of combining a probabilistic method into the GLR technique was published, Some probabilistic GLR parsers also have been implemented in which probabilities are assigned to actions of LR parsing tables by using lookaheads or LR states as simple context information of [Briscoe and Carroll 1993] , [Kentaro et al. 1998 ], and [Ruland, 2000] which does not use the stack information of the GLR parser effectively, because of highly complex internal GLR stack." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-7", "text": "As a result, they have used relatively limited contextual information for disambiguation." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-8", "text": "[Kwak et al., 2001] have proposed a conditional action model that uses the partially constructed parse represented by the graph-structured stack as the additional context." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-9", "text": "However, this method inappropriately defined sub-tree structure." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-10", "text": "Our proposed model uses Surface Phrasal Types representing the structural characteristics of the sub-trees for its additional contextual information." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-11", "text": "----------------------------------" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-12", "text": "**CONDITIONAL ACTION MODEL(CAM) USING SURFACE PHRASAL TYPE (SPT)**" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-13", "text": "CAM is devised based on the hypothesis that this model can actively use rich information provided by the partially constructed parse built on the graph-structured stack, and thus estimate the probability of the shift/reduce actions more precisely [Kwak et al., 2001] ." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-14", "text": "Surface Phrasal Type (SPT) is represented by a sequence of the primitive mnemonics which describes the specific types of phrases based on their terminal nodes." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-15", "text": "In this work, we use functional words for mnemonics in SPT." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-16", "text": "In Korean, the functional word system is highly developed in the morpheme level." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-17", "text": "Therefore, this kind of phrasal description is meaningful way of representing the parse structure without considering the internal relation of the parse forest." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-18", "text": "Moreover, this scheme can avoid the overhead of taking care of the packed node with a local ambiguity." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-19", "text": "We represent SPTs as the corresponding mnemonic sequence(in backward order) as shown in Figure 1 ." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-20", "text": "We have defined mnemonic sets of SPT combination for the production of the noun phases and verb phrases, respectively." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-21", "text": "Example mnemonic sets for the both production forms are shown in Table 1 .Elements in each mnemonic set consist of representatives of part-of-speeches (POSs) with the same syntactic (structural) function." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-22", "text": "For probabilistic model, we define the entire parse of the given input sentence as the sequence of actions taken until the parser reaches the accept state." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-23", "text": "Thus, the probability of the i-th action and parse probability are calculated by the following formula:" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-24", "text": "sub-trees along the reduce route." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-25", "text": "s i-1 indicates the number of the state nodes at the top of the stack, and l i is the lookahead symbol (POS) read by the parser., and represents the i-th action." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-26", "text": "Then, the probability of a parse tree can be calculated by the product of all action probabilities." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-27", "text": "To cope with the sparse data problem when using our probabilistic model, we use a deleted interpolation method with the backing-off strategy similar to [Collins, 1999] ." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-28", "text": "----------------------------------" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-29", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-30", "text": "We have experimented on the Korean treebank which consists of 12,084 sentences tagged with Korean grammar scheme with 56 CFG rules of [Park et al. 1999] ." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-31", "text": "The distribution of sentence length over the corpus is shown in Table 3 ." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-32", "text": "We have used 10,906 sentences for the training data and 1,178 sentences for the test data." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-33", "text": "Average morpheme length is 22.5." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-34", "text": "For CAM, because of the sparse data problem, we have restricted the maximum continuous repetition count of the same mnemonic and the maximum length of one SPT to 1 and 3, respectively (empirically optimal value) Our GLR parser uses the canonical SLR(1) parsing table constructed from the binary CFG entries provided by the CFG grammar." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-35", "text": "Table 2 , our proposed model outperforms previous models by about 6~8 %(Upper and lower parts show the results for training data and test data, respectively)." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-36", "text": "Furthermore, the performance of our parser could be improved if it is integrated with the properly lexicalized information." }, { "sent_id": "600317fc3ce88ea730993d3cc94f19-C001-37", "text": "The results show that functional category is an effective way of describing structural aspects of a phrase and can be used as contextual information in GLR parsing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "600317fc3ce88ea730993d3cc94f19-C001-6" ] ], "cite_sentences": [ "600317fc3ce88ea730993d3cc94f19-C001-6" ] }, "@MOT@": { "gold_contexts": [ [ "600317fc3ce88ea730993d3cc94f19-C001-10", "600317fc3ce88ea730993d3cc94f19-C001-6", "600317fc3ce88ea730993d3cc94f19-C001-7", "600317fc3ce88ea730993d3cc94f19-C001-8", "600317fc3ce88ea730993d3cc94f19-C001-9" ] ], "cite_sentences": [ "600317fc3ce88ea730993d3cc94f19-C001-6" ] } } }, "ABC_a0614f13b4ed0c6370deb26032f62b_49": { "x": [ { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-2", "text": "This book provides a detailed overview of Regulus, an open-source toolkit for building and compiling grammars for spoken dialog systems, which has been used for a number of applications including Clarissa, a spoken language system in use on the International Space Station." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-3", "text": "The focus is on controlled-language user-initiative systems, in which the user drives the dialog, but is required to use constrained language." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-4", "text": "Such a spoken language interface allows for a constrained range of commands-for example, \"open the pod-bay doors\"-to be issued in circumstances where, ergonomically, other interface options are infeasible." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-5", "text": "The emphasis of the approach, given the kind of application that is the focus of the work, is thus less on robustness of speech recognition and more on depth of semantic processing and quality of the dialog management." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-6", "text": "It is an interesting book, one which succeeds in motivating key problems and presenting general approaches to solving them, with enough in the way of explicit details to allow even a complete novice in spoken language processing to implement simple dialog systems." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-7", "text": "The book is split into two parts." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-8", "text": "The first half is a very detailed tutorial on using Regulus to build and compile grammars." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-9", "text": "Regulus, although itself open-source, makes use of SICStus Prolog for the dialog processing and the Nuance Toolkit for speech recognition." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-10", "text": "Grammars in Regulus are specified with features requiring unification, but are compiled into context-free grammars for use with the Nuance speech recognition system." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-11", "text": "An abundance of implementation details and guidance for the reader (or grammar writer) is provided in this part of the book for building grammars as well as a dialog manager." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-12", "text": "The presentation includes example implementations handling such phenomena as ellipsis and corrections." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-13", "text": "In addition, details for building spoken language translation systems are presented within the same range of constrained language applications." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-14", "text": "The tutorial format is terrifically explicit, which will make this volume appropriate for undergraduate courses looking to provide students with hands-on exercises in building spoken dialog systems." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-15", "text": "One issue with the premise of an open-source toolkit that relies upon other software (SICStus Prolog, Nuance) for key parts of the application is that one is required to obtain and use that software." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-16", "text": "In the book, the authors note that an individual license of SICStus is available for a relatively small fee, and that Nuance has a program to license their Toolkit for research purposes." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-17", "text": "Unfortunately, since the writing of the book, corporate changes at Nuance have made obtaining such a research license more challenging, and this reviewer was only able to do so after several weeks of e-mail persistence." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-18", "text": "It would be very beneficial to future readers if the authors would scout out the current state of" }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-19", "text": "----------------------------------" }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-20", "text": "****" }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-21", "text": "the process of obtaining such a license and provide a detailed map of it on some easily accessible Web site." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-22", "text": "As it stands, the barriers to building the full spoken dialog systems detailed in the book are higher than they should be." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-23", "text": "The second half of the book goes under the hood to examine, more generally, the issues involved in making the authors' approach work." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-24", "text": "The chapter on compilation of feature grammars into context-free grammars looks in depth at alternatives to exhaustive expansion, including efficient filtering techniques and grammar transformation." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-25", "text": "There is also a chapter on developing an English feature grammar specifically for integration with a speech recognition system." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-26", "text": "A third chapter presents the authors' approach for adapting the general English feature grammar to a particular domain, which they term grammar specialization." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-27", "text": "In effect, they induce a new feature grammar by transforming-flattening, to a greater or lesser extent-an automatically produced treebank of domain-specific text and extracting new rules from the treebank, as well as estimating production probabilities from the corpus." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-28", "text": "A subsequent chapter investigates the impact of various parameterizations on this grammar specialization approachfor example, the degree of flattening-thus establishing, at least for the applications presented, some best practices." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-29", "text": "A final chapter presents a comparison of their grammarbased approach with a class-based n-gram language model." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-30", "text": "This half of the book will be more enjoyable for readers of this journal, who are presumably interested in more general questions of computation and language than the step-by-step tutorial format of the first half of the book." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-31", "text": "The details of the approach are interesting, particularly the insights about how to build linguistically rich grammars that can be effectively compiled into high-utility context-free grammars for speech recognition." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-32", "text": "The primary shortcoming of this presentation lies in perpetuating the false dichotomy between \"grammar-based\" and \"data-driven\" approaches to language modeling for speech recognition, which motivates the final chapter of the book." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-33", "text": "In fact, the authors' approach is both grammar-based and data-driven, given the corpus-based grammar specialization and PCFG estimation, which the authors themselves demonstrate to be indispensable." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-34", "text": "Robust grammar-based language modeling is a topic that has received a fair bit of attention over the past decade (Chelba and Jelinek 2000; Charniak 2001; Roark 2001; Wang, Stolcke, and Harper 2004, among others) , and while this line of research has not focused on the use of manually built, narrow-domain feature grammars, there is enough similarity between the approach described in this book and the cited papers that the papers would seem to be better comparison points than the class-based language models that are chosen to represent robust approaches in the comparison." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-35", "text": "Beyond language modeling, methods for PCFG induction from treebanks have been a popular topic in the field over the past decade, and some understanding of the impact of flattening trees can be had in Johnson (1998) , where the beneficial impact of various tree transformations for probabilistic grammars is presented." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-36", "text": "None of this work is discussed or cited, and the naive reader might be left with the impression that data-driven approaches have been demonstrated to underperform relative to knowledge-based approaches, when in fact the authors simply demonstrate that their hybrid grammar-based/data-driven approach outperforms class-based language models." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-37", "text": "Perhaps this is worth demonstrating, but the chapter couches the results within the context of a clash between paradigms, which simply does not ring true." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-38", "text": "This one misstep, however, does not detract from the quality of the authors' system, nor from the interesting presentation of too-often-ignored aspects of spoken language systems engineering." }, { "sent_id": "a0614f13b4ed0c6370deb26032f62b-C001-39", "text": "The book and the toolkit it describes can serve a very useful pedagogical purpose by providing a foundation upon which students and researchers" } ], "y": { "@BACK@": { "gold_contexts": [ [ "a0614f13b4ed0c6370deb26032f62b-C001-35" ] ], "cite_sentences": [ "a0614f13b4ed0c6370deb26032f62b-C001-35" ] }, "@MOT@": { "gold_contexts": [ [ "a0614f13b4ed0c6370deb26032f62b-C001-35", "a0614f13b4ed0c6370deb26032f62b-C001-36", "a0614f13b4ed0c6370deb26032f62b-C001-37", "a0614f13b4ed0c6370deb26032f62b-C001-38", "a0614f13b4ed0c6370deb26032f62b-C001-39" ] ], "cite_sentences": [ "a0614f13b4ed0c6370deb26032f62b-C001-35" ] } } }, "ABC_7d29a7d7c19d097758481b466360b1_49": { "x": [ { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-2", "text": "In this paper, we propose a new probabilistic GLR parsing method that can solve the problems of conventional methods." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-3", "text": "Our proposed Conditional Action Model uses Surface Phrasal Types (SPTs) encoding the functional word sequences of the sub-trees for describing structural characteristics of the partial parse. And, the proposed GLR model outperforms the previous methods by about 6~8%." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-4", "text": "----------------------------------" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-5", "text": "**INTRODUCTION**" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-6", "text": "Since the first approach [Wright and Wrigley 1991] of combining a probabilistic method into the GLR technique was published, Some probabilistic GLR parsers also have been implemented in which probabilities are assigned to actions of LR parsing tables by using lookaheads or LR states as simple context information of [Briscoe and Carroll 1993] , [Kentaro et al. 1998 ], and [Ruland, 2000] which does not use the stack information of the GLR parser effectively, because of highly complex internal GLR stack." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-7", "text": "As a result, they have used relatively limited contextual information for disambiguation." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-8", "text": "[Kwak et al., 2001] have proposed a conditional action model that uses the partially constructed parse represented by the graph-structured stack as the additional context." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-9", "text": "However, this method inappropriately defined sub-tree structure." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-10", "text": "Our proposed model uses Surface Phrasal Types representing the structural characteristics of the sub-trees for its additional contextual information." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-11", "text": "----------------------------------" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-12", "text": "**CONDITIONAL ACTION MODEL(CAM) USING SURFACE PHRASAL TYPE (SPT)**" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-13", "text": "CAM is devised based on the hypothesis that this model can actively use rich information provided by the partially constructed parse built on the graph-structured stack, and thus estimate the probability of the shift/reduce actions more precisely [Kwak et al., 2001] ." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-14", "text": "Surface Phrasal Type (SPT) is represented by a sequence of the primitive mnemonics which describes the specific types of phrases based on their terminal nodes." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-15", "text": "In this work, we use functional words for mnemonics in SPT." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-16", "text": "In Korean, the functional word system is highly developed in the morpheme level." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-17", "text": "Therefore, this kind of phrasal description is meaningful way of representing the parse structure without considering the internal relation of the parse forest." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-18", "text": "Moreover, this scheme can avoid the overhead of taking care of the packed node with a local ambiguity." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-19", "text": "We represent SPTs as the corresponding mnemonic sequence(in backward order) as shown in Figure 1 ." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-20", "text": "We have defined mnemonic sets of SPT combination for the production of the noun phases and verb phrases, respectively." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-21", "text": "Example mnemonic sets for the both production forms are shown in Table 1 .Elements in each mnemonic set consist of representatives of part-of-speeches (POSs) with the same syntactic (structural) function." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-22", "text": "For probabilistic model, we define the entire parse of the given input sentence as the sequence of actions taken until the parser reaches the accept state." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-23", "text": "Thus, the probability of the i-th action and parse probability are calculated by the following formula:" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-24", "text": "sub-trees along the reduce route." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-25", "text": "s i-1 indicates the number of the state nodes at the top of the stack, and l i is the lookahead symbol (POS) read by the parser., and represents the i-th action." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-26", "text": "Then, the probability of a parse tree can be calculated by the product of all action probabilities." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-27", "text": "To cope with the sparse data problem when using our probabilistic model, we use a deleted interpolation method with the backing-off strategy similar to [Collins, 1999] ." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-28", "text": "----------------------------------" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-29", "text": "**EXPERIMENTAL RESULTS**" }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-30", "text": "We have experimented on the Korean treebank which consists of 12,084 sentences tagged with Korean grammar scheme with 56 CFG rules of [Park et al. 1999] ." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-31", "text": "The distribution of sentence length over the corpus is shown in Table 3 ." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-32", "text": "We have used 10,906 sentences for the training data and 1,178 sentences for the test data." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-33", "text": "Average morpheme length is 22.5." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-34", "text": "For CAM, because of the sparse data problem, we have restricted the maximum continuous repetition count of the same mnemonic and the maximum length of one SPT to 1 and 3, respectively (empirically optimal value) Our GLR parser uses the canonical SLR(1) parsing table constructed from the binary CFG entries provided by the CFG grammar." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-35", "text": "Table 2 , our proposed model outperforms previous models by about 6~8 %(Upper and lower parts show the results for training data and test data, respectively)." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-36", "text": "Furthermore, the performance of our parser could be improved if it is integrated with the properly lexicalized information." }, { "sent_id": "7d29a7d7c19d097758481b466360b1-C001-37", "text": "The results show that functional category is an effective way of describing structural aspects of a phrase and can be used as contextual information in GLR parsing." } ], "y": { "@BACK@": { "gold_contexts": [ [ "7d29a7d7c19d097758481b466360b1-C001-6" ] ], "cite_sentences": [ "7d29a7d7c19d097758481b466360b1-C001-6" ] }, "@MOT@": { "gold_contexts": [ [ "7d29a7d7c19d097758481b466360b1-C001-10", "7d29a7d7c19d097758481b466360b1-C001-6", "7d29a7d7c19d097758481b466360b1-C001-7", "7d29a7d7c19d097758481b466360b1-C001-8", "7d29a7d7c19d097758481b466360b1-C001-9" ] ], "cite_sentences": [ "7d29a7d7c19d097758481b466360b1-C001-6" ] } } }, "ABC_66b6283cf1f20977286f99ef21b3c7_49": { "x": [ { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-2", "text": "It is an honor to have this chance to tie together themes from my recent research, and to sketch some challenges and opportunities for NLG in face-to-face conversational interaction." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-3", "text": "Communication reflects our general involvement in one anothers' lives." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-4", "text": "Through the choices we manifest with one another, we share our thoughts and feelings, strengthen our relationships and further our joint projects." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-5", "text": "We rely not only on words to articulate our perspectives, but also on a heterogeneous array of accompanying efforts: embodied deixis, expressive movement, presentation of iconic imagery and instrumental action in the world." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-6", "text": "Words showcase the distinctive linguistic knowledge which human communication exploits. But people's diverse choices in conversation in fact come together to reveal multifaceted, interrelated meanings, in which all our actions, verbal and nonverbal, fit the situation and further social purposes." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-7", "text": "In the best case, they let interlocutors understand not just each other's words, but each other." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-8", "text": "As NLG researchers, I argue, we have good reason to work towards models of social cognition that embrace the breadth of conversation." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-9", "text": "Scientifically, it connects us to an emerging consensus in favor of a general human pragmatic competence, rooted in capacities for engagement, coordination, shared intentionality and extended relationships." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-10", "text": "Technically, it lets us position ourselves as part of an emerging revolution in integrative Artificial Intelligence, characterized by research challenges like human-robot interaction and the design of virtual humans, and applications in assistive and educational technology and interactive entertainment." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-11", "text": "Researchers are already hard at work to place our accounts of embodied action in conversation in contact with pragmatic theories derived from text discourse and spoken dialogue." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-12", "text": "In my own experience, such work proves both illuminating and exciting." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-13", "text": "For example, it challenges us to support and refine theories of discourse coherence by accounting for the discourse relations and default inference that determine the joint interpretation of coverbal gesture and its accompanying speech (Lascarides and Stone, 2008) . And it challenges us to show how speakers work across modalities to engage with, disambiguate, and (on acceptance) recapitulate each others' communicative actions, to ground their meanings (Lascarides and Stone, In Preparation)." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-14", "text": "The closer we look at conversation, the more we can fit all its behaviors into a unitary framework-inviting us to implement behavioral control for embodied social agents through a pervasive analogy to NLG." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-15", "text": "We can already pursue such implementations easily." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-16", "text": "Computationally, motion is just sequence data, and we can manipulate it in parallel ways to the speech data we already use in spoken language generation ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-17", "text": "At a higher level, we can represent an embodied performance through a matrix of discrete actions selected and synchronized to an abstract time-line, as in our RUTH system Stone and Oh, 2008) ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-18", "text": "This lets us use any NLG method that manipulates structured selections of discrete actions as an architecture for the production of embodied behavior." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-19", "text": "Templates, as in (Stone and DeCarlo, 2003; , offer 5" }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-20", "text": "----------------------------------" }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-21", "text": "****" }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-22", "text": "Researchers are already hard at work to place our accounts of embodied action in conversation in contact with pragmatic theories derived from text discourse and spoken dialogue." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-23", "text": "In my own experience, such work proves both illuminating and exciting." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-24", "text": "For example, it challenges us to support and refine theories of discourse coherence by accounting for the discourse relations and default inference that determine the joint interpretation of coverbal gesture and its accompanying speech (Lascarides and Stone, 2008) . And it challenges us to show how speakers work across modalities to engage with, disambiguate, and (on acceptance) recapitulate each others' communicative actions, to ground their meanings (Lascarides and Stone, In Preparation) ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-25", "text": "The closer we look at conversation, the more we can fit all its behaviors into a unitary framework-inviting us to implement behavioral control for embodied social agents through a pervasive analogy to NLG." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-26", "text": "We can already pursue such implementations easily." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-27", "text": "Computationally, motion is just sequence data, and we can manipulate it in parallel ways to the speech data we already use in spoken language generation ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-28", "text": "At a higher level, we can represent an embodied performance through a matrix of discrete actions selected and synchronized to an abstract time-line, as in our RUTH system Stone and Oh, 2008) ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-29", "text": "This lets us use any NLG method that manipulates structured selections of discrete actions as an architecture for the production of embodied behavior." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-30", "text": "Templates, as in (Stone and DeCarlo, 2003; , offer a good illustration." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-31", "text": "Nevertheless, face-to-face dialogue does demand qualitatively new capabilities." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-32", "text": "In fact, people's choices and meanings in interactive conversation are profoundly informed by their social settings." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-33", "text": "We are a long way from general models that could allow NLG systems to recognize and exploit these connections in the words and other behaviors they use." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-34", "text": "In my experience, even the simplest social practices, such as interlocutors' cooperation on an ongoing practical task, require new models of linguistic meaning and discourse context." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-35", "text": "For example, systems must be creative to evoke the distinctions that matter for their ongoing task, and use meanings that are not programmed or learned but invented on the fly (DeVault and ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-36", "text": "They must count on their interlocutors to recognize the background knowledge they presuppose by general inference from the logic of their behavior as a cooperative contribution to the task (Thomason et al., 2006) ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-37", "text": "Such reasoning becomes particularly important in problematic cases, such as when systems must finetune the form and meaning of a clarification request so that the response is more likely to resolve a pending task ambiguity (DeVault and Stone, 2007) ." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-38", "text": "I expect many further exciting developments in our understanding of meaning and interpretation as we enrich the social intelligence of NLG." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-39", "text": "Modeling efforts will remain crucial to the exploration of these new capabilities." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-40", "text": "When we build and assemble models of actions and interpretations, we get systems that can plan their own behavior simply by exploiting what they know about communication." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-41", "text": "These systems give new evidence about the information and problem-solving that's involved." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-42", "text": "The challenge is that these models must describe semantics and pragmatics, as well as syntax and behavior." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-43", "text": "My own slow progress (Cassell et al., 2000; Koller and Stone, 2007) shows that there's still lots of hard work needed to develop suitable techniques." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-44", "text": "I keep going because of the methodological payoffs I see on the horizon." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-45", "text": "Modeling lets us take social intelligence seriously as a general implementation principle, and thus to aim for systems whose multimodal behavior matches the flexibility and coordination that distinguishes our own embodied meanings." }, { "sent_id": "66b6283cf1f20977286f99ef21b3c7-C001-46", "text": "More generally, modeling replaces programming with data fitting, and a good model of action and interpretation in particular would let an agent's own experience in conversational interaction determine the repertoire of behaviors and meanings it uses to make itself understood." } ], "y": { "@MOT@": { "gold_contexts": [ [ "66b6283cf1f20977286f99ef21b3c7-C001-42", "66b6283cf1f20977286f99ef21b3c7-C001-43" ] ], "cite_sentences": [ "66b6283cf1f20977286f99ef21b3c7-C001-43" ] }, "@FUT@": { "gold_contexts": [ [ "66b6283cf1f20977286f99ef21b3c7-C001-42", "66b6283cf1f20977286f99ef21b3c7-C001-43" ] ], "cite_sentences": [ "66b6283cf1f20977286f99ef21b3c7-C001-43" ] } } }, "ABC_e99193f62a8f3a9e46dee3cadd786f_49": { "x": [ { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-2", "text": "We present a comparative study between two machine learning methods, Conditional Random Fields and Support Vector Machines for clinical named entity recognition." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-3", "text": "We explore their applicability to clinical domain." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-4", "text": "Evaluation against a set of gold standard named entities shows that CRFs outperform SVMs." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-5", "text": "The best F-score with CRFs is 0.86 and for the SVMs is 0.64 as compared to a baseline of 0.60." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-6", "text": "----------------------------------" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-7", "text": "**INTRODUCTION AND BACKGROUND**" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-8", "text": "Named entity recognition (NER) is the discovery of named entities (NEs), or textual mentions that belong to the same semantic class." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-9", "text": "In the biomedical domain NEs are diseases, signs/symptoms, anatomical signs, and drugs." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-10", "text": "NER performance is high as applied to scholarly text and newswire narratives (Leaman et al., 2008) ." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-11", "text": "Clinical free-text, on the other hand, exhibits characteristics of both informal and formal linguistic styles which, in turn, poses challenges for clinical NER." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-12", "text": "Conditional Random Fields (CRFs) (Lafferty et al., 2001 ) and and Support Vector Machines (SVMs) (Cortes and Vapnik, 1995) are machine learning techniques which can handle multiple features during learning." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-13", "text": "CRFs' main strength lies in their ability to include various unrelated features, while SVMs' in the inclusion of overlapping features." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-14", "text": "Our goal is to compare CRFs and SVMs performance for clinical NER with focus on disease/disorder NEs." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-15", "text": "----------------------------------" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-16", "text": "**DATASET AND FEATURES**" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-17", "text": "Our dataset is a gold standard corpus of 1557 single-and multi-word disorder annotations (Ogren et al., 2008) ." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-18", "text": "For training and testing the CRF and SVM models the IOB (inside-outside-begin) notation (Leaman, 2008) was applied." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-19", "text": "In our project, we used 1265 gold standard annotations for training and 292 for testing." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-20", "text": "The features used for the learning process are described as follows." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-21", "text": "Dictionary look-up is a binary value feature that represents if the NE is in the dictionary (SNOMED-CT)." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-22", "text": "Bag of Words (BOW) is a representation of the context by the unique words in it." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-23", "text": "Part-of-speech tags (POS) of BOW is the pos tags of the context words." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-24", "text": "Window size is the number of tokens representing context surrounding the target word." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-25", "text": "Orientation(left or right) is the location of the feature in regard to the target word." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-26", "text": "Distance is the proximity of the feature in regard to the target word Capitalization has one of the four token-based values: all upper case, all lower case, mixed_case and initial upper case." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-27", "text": "Number features refer to the presence or absence of related numbers." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-28", "text": "Feature sets are in Table 1 ." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-29", "text": "----------------------------------" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-30", "text": "**RESULTS AND DISCUSSION**" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-31", "text": "----------------------------------" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-32", "text": "**CONCLUSION AND FUTURE WORK**" }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-33", "text": "We investigated the use of CRFs and SVMs for disorder NER in clinical free-text." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-34", "text": "Our results show that, in general, CRFs outperformed SVMs." }, { "sent_id": "e99193f62a8f3a9e46dee3cadd786f-C001-35", "text": "We demonstrated that well-chosen features along with dictionary-based features tend to improve the CRF model's performance but not the SVM's." } ], "y": { "@USE@": { "gold_contexts": [ [ "e99193f62a8f3a9e46dee3cadd786f-C001-17" ] ], "cite_sentences": [ "e99193f62a8f3a9e46dee3cadd786f-C001-17" ] } } }, "ABC_d8c3d04514c0867d78a7603e49ea9b_49": { "x": [ { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-2", "text": "The iADAATPA 1 project coded as N \u2022 2016-EU-IA-0132 that ended in February 2019 is made for building of customized, domain-specific engines for public administrations from EU Member States." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-3", "text": "The consortium of the project decided to use neural machine translation at the beginning of the project." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-4", "text": "This represented a challenge for all involved, and the positive aspect is that all public administrations engaged in the iADAATPA project were able to try, test and use state-of-the-art neural technology with a high level of satisfaction." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-5", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-6", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-7", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-8", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-9", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-10", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-11", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-12", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-13", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-14", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-15", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-16", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012 Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-17", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-18", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-19", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-20", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-21", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-22", "text": "This approach allowed the model to improve the quality for each domain." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-23", "text": "----------------------------------" }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-24", "text": "****" }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-25", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-26", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-27", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-28", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-29", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-30", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-31", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-32", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-33", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-34", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-35", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-36", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-37", "text": "Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-38", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-39", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-40", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-41", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-42", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "d8c3d04514c0867d78a7603e49ea9b-C001-43", "text": "This approach allowed the model to improve the quality for each domain." } ], "y": { "@USE@": { "gold_contexts": [ [ "d8c3d04514c0867d78a7603e49ea9b-C001-16" ], [ "d8c3d04514c0867d78a7603e49ea9b-C001-37" ] ], "cite_sentences": [ "d8c3d04514c0867d78a7603e49ea9b-C001-16", "d8c3d04514c0867d78a7603e49ea9b-C001-37" ] } } }, "ABC_a7223f2afa1afd5fbfa1257b98ec02_49": { "x": [ { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-2", "text": "Introduction: Most research in machine learning has been focused on binary classification, in which the learned classifier outputs one of two possible answers." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-3", "text": "Important fundamental questions can be analyzed in terms of binary classification, but realworld natural language processing problems often involve richer output spaces." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-4", "text": "In this tutorial, we will focus on classifiers with a large number of possible outputs with interesting structure." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-5", "text": "Notable examples include information retrieval, part-of-speech tagging, NP chucking, parsing, entity extraction, and phoneme recognition." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-6", "text": "Our algorithmic framework will be that of online learning, for several reasons." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-7", "text": "First, online algorithms are in general conceptually simple and easy to implement." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-8", "text": "In particular, online algorithms process one example at a time and thus require little working memory." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-9", "text": "Second, our example applications have all been treated successfully using online algorithms." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-10", "text": "Third, the analysis of online algorithms uses simpler mathematical tools than other types of algorithms." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-11", "text": "Fourth, the online learning framework provides a very general setting which can be applied to a broad setting of problems, where the only machinery assumed is the ability to perform exact inference, which computes a maxima over some score function." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-12", "text": "Goals: (1) To provide the audience systematic methods to design, analyze and implement efficiently learning algorithms for their specific complex-output problems: from simple binary classification through multi-class categorization to information extraction, parsing and speech recognition." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-13", "text": "(2) To introduce new online algorithms which provide state-of-the-art performance in practice backed by interesting theoretical guarantees." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-14", "text": "The tutorial is divided into two parts." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-15", "text": "In the first half we introduce online learning and describe the Perceptron algorithm (Rosenblatt, 1958) and the passive-aggressive framework (Crammer et al., 2006) ." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-16", "text": "We then discuss in detail an approach for deriving algorithms for complex natural language processing (Crammer, 2004) ." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-17", "text": "In the second half we discuss is detail relevant applications including text classification (Crammer and Singer, 2003) , named entity recognition (McDonald et al., 2005) , parsing (McDonald, 2006) , and other tasks." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-18", "text": "We also relate the online algorithms to their batch counterparts." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-19", "text": "----------------------------------" }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-20", "text": "****" }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-21", "text": "Introduction: Most research in machine learning has been focused on binary classification, in which the learned classifier outputs one of two possible answers." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-22", "text": "Important fundamental questions can be analyzed in terms of binary classification, but realworld natural language processing problems often involve richer output spaces." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-23", "text": "In this tutorial, we will focus on classifiers with a large number of possible outputs with interesting structure." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-24", "text": "Notable examples include information retrieval, part-of-speech tagging, NP chucking, parsing, entity extraction, and phoneme recognition." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-25", "text": "Our algorithmic framework will be that of online learning, for several reasons." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-26", "text": "First, online algorithms are in general conceptually simple and easy to implement." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-27", "text": "In particular, online algorithms process one example at a time and thus require little working memory." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-28", "text": "Second, our example applications have all been treated successfully using online algorithms." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-29", "text": "Third, the analysis of online algorithms uses simpler mathematical tools than other types of algorithms." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-30", "text": "Fourth, the online learning framework provides a very general setting which can be applied to a broad setting of problems, where the only machinery assumed is the ability to perform exact inference, which computes a maxima over some score function." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-31", "text": "Goals: (1) To provide the audience systematic methods to design, analyze and implement efficiently learning algorithms for their specific complex-output problems: from simple binary classification through multi-class categorization to information extraction, parsing and speech recognition." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-32", "text": "(2) To introduce new online algorithms which provide state-of-the-art performance in practice backed by interesting theoretical guarantees." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-33", "text": "----------------------------------" }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-34", "text": "**CONTENT:**" }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-35", "text": "The tutorial is divided into two parts." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-36", "text": "In the first half we introduce online learning and describe the Perceptron algorithm (Rosenblatt, 1958) and the passive-aggressive framework (Crammer et al., 2006) ." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-37", "text": "We then discuss in detail an approach for deriving algorithms for complex natural language processing (Crammer, 2004) ." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-38", "text": "In the second half we discuss is detail relevant applications including text classification (Crammer and Singer, 2003) , named entity recognition (McDonald et al., 2005) , parsing (McDonald, 2006) , and other tasks." }, { "sent_id": "a7223f2afa1afd5fbfa1257b98ec02-C001-39", "text": "We also relate the online algorithms to their batch counterparts." } ], "y": { "@USE@": { "gold_contexts": [ [ "a7223f2afa1afd5fbfa1257b98ec02-C001-17" ], [ "a7223f2afa1afd5fbfa1257b98ec02-C001-38" ] ], "cite_sentences": [ "a7223f2afa1afd5fbfa1257b98ec02-C001-17", "a7223f2afa1afd5fbfa1257b98ec02-C001-38" ] } } }, "ABC_bd12a9270c5f94056701b86eda9c8e_49": { "x": [ { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-2", "text": "The iADAATPA 1 project coded as N \u2022 2016-EU-IA-0132 that ended in February 2019 is made for building of customized, domain-specific engines for public administrations from EU Member States." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-3", "text": "The consortium of the project decided to use neural machine translation at the beginning of the project." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-4", "text": "This represented a challenge for all involved, and the positive aspect is that all public administrations engaged in the iADAATPA project were able to try, test and use state-of-the-art neural technology with a high level of satisfaction." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-5", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-6", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-7", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-8", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-9", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-10", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-11", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-12", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-13", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-14", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-15", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-16", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012 Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-17", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-18", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-19", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-20", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-21", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-22", "text": "This approach allowed the model to improve the quality for each domain." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-23", "text": "----------------------------------" }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-24", "text": "****" }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-25", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-26", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-27", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-28", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-29", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-30", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-31", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-32", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-33", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-34", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-35", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-36", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-37", "text": "Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-38", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-39", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-40", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-41", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-42", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "bd12a9270c5f94056701b86eda9c8e-C001-43", "text": "This approach allowed the model to improve the quality for each domain." } ], "y": { "@USE@": { "gold_contexts": [ [ "bd12a9270c5f94056701b86eda9c8e-C001-18" ], [ "bd12a9270c5f94056701b86eda9c8e-C001-39" ] ], "cite_sentences": [ "bd12a9270c5f94056701b86eda9c8e-C001-18", "bd12a9270c5f94056701b86eda9c8e-C001-39" ] } } }, "ABC_574ab9a51f3414e6587da7dfca2ff8_49": { "x": [ { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-2", "text": "The last few years have seen an increased interest in narrative within the field of Natural Language Generation (Reiter et al., 2008; Elson and McKeown, 2010; Siddharthan et al., 2012; Lester, 2012) ." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-3", "text": "Narrative is generally acknowledged as a fundamental mode of presenting and communicating information between humans, with different manifestations across media but with a very significant presence in textual form." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-4", "text": "Yet efforts in Natural Language Generation research have generally side stepped the issue." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-5", "text": "Aside from the pioneering work of (Callaway, 2002) and an early attempt to bridge the gap between narratology and natural language generation (L\u00f6nneker, 2005) , the field had mostly avoided narrative until recent times." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-6", "text": "Two possible arguments may be considered as an explanation of this: one based on the need to restrict initial work within a field to the simpler challenges before tackling the difficult ones, and another based on an assumption that the peculiarities of narrative have already been covered by existing work." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-7", "text": "Both arguments can be shown to be inappropriate." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-8", "text": "With respect to the first argument, the field of natural language generation has for many years operated under the tacit assumption that state of the art technology can only aspire to generating texts within a limited range of domains and genres." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-9", "text": "These have over the years been defined in different ways, but in spite of changes, literary texts have usually been considered to be outside the range of possible candidates." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-10", "text": "From an engineering point of view, this kind of restriction made sense when the field was starting, for two important reasons." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-11", "text": "One, the technological solutions available at the time for the various tasks involved in natural language generation were in their infancy, and the linguistic complexity of literary text might have been beyond their scope." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-12", "text": "Two, natural language generation arose from a desire to extend the studies that had been carried out for computational analysis of language to the task of generation, and what was known about language from a computational point of view concerned simple texts." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-13", "text": "Most of the studies on language and computation had applied similar simplifying assumptions." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-14", "text": "However, such restricting assumptions are no longer necessary and may be inhibiting progress." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-15", "text": "In terms of technology, the field has matured significantly over the intervening years." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-16", "text": "The current state of the art provides a wide range of solutions that may be well suited to address some of the more complex phenomena involved in literary text." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-17", "text": "Additional objections may be made on the grounds that we do not know enough about these phenomena." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-18", "text": "Such objections, however valid they might have been originally, are no longer valid either." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-19", "text": "Many of the phenomena that were considered beyond computational treatment (metaphor, emotion, temporal reasoning, dialogue...) have been the subject of serious and sustained study over the same time period." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-20", "text": "Many approaches to their computational modelling and treatment have become available." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-21", "text": "More to the point, the last few years have seen a rise of interest on literary text within the natural language processing community." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-22", "text": "This With respect to the second argument, the recent reappearance of narrative as a research topic for NLG should be enough to dispel the notion that all its problems have already been solved." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-23", "text": "Narrative has many peculiarities that set it apart from other kinds of text, and the body of work addressing narrative as a research topic within NLG has 103 at most uncovered and staked out a set of problems and challenges that area waiting further exploration." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-24", "text": "Of these various open problems in the treatment of narrative, my talk will focus on the problem of narrative composition." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-25", "text": "Research on narrative is plagued by the difficulty of establishing a definition of the term that is both sufficiently formal to act as foundation for scientific rigour, and sufficiently rich to cover the fundamental aspects that people associate with the term." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-26", "text": "At the present stage of development, tentative definition need to be established, to be later confirmed on the basis of empirical work and succesful evaluation of results." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-27", "text": "The talk will outline some of the possibilities that must be considered (arising from established definitions in the field of narratology) and some of the restrictions that arise from the computational nature of the task." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-28", "text": "From the combination of these constraints, a working model of narrative structure will be outlined." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-29", "text": "However, it is clear that such a model must document the relation between a semantic description of the content of the narrative (what is usually termed the fabula) and its rendition as a sequential discourse." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-30", "text": "The task of narrative composition will be specified as the task of constructing such a discourse (or discourse plan) for a given semantic description of fabula." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-31", "text": "This discourse should be susceptible of being converted into text and it should appropriately conveys the set of events in the fabula in such a way that satifies a number of traditionally accepted requirements (like having an identifiable theme, a certain temporal and causal coherence, a recognisable set of characters...)." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-32", "text": "A number of basic narratological concepts will be described where they provide tools for breaking down the task into computationally tractable subproblems." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-33", "text": "Of particular interest is the concept of focalization, which is used by narratologists to describe the way certain segments of a narrative follow a particular character, and which provides a useful computational representation of both the granularity and the shift in focus employed during the process of converting the semantics of the fabula into a linear discourse." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-34", "text": "As part of the talk, narrative composition will be framed in terms of the accepted task breakdown for natural language generation, considering that it may involve a combination of content determination and discourse planning that cannot be segregated into separate subtasks." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-35", "text": "The talk will also discuss the relation of the task of narrative composition with a number of existing research problems such as story generation (which could correspond to the construction of fabula but is sometimes simplified down to construction of a discourse directly) and creativity (which has been addressed with respect to story generation but may also constitute a fundamental ingredient of the composition task)." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-36", "text": "----------------------------------" }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-37", "text": "****" }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-38", "text": "The last few years have seen an increased interest in narrative within the field of Natural Language Generation (Reiter et al., 2008; Elson and McKeown, 2010; Siddharthan et al., 2012; Lester, 2012) ." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-39", "text": "Narrative is generally acknowledged as a fundamental mode of presenting and communicating information between humans, with different manifestations across media but with a very significant presence in textual form." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-40", "text": "Yet efforts in Natural Language Generation research have generally side stepped the issue." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-41", "text": "Aside from the pioneering work of (Callaway, 2002) and an early attempt to bridge the gap between narratology and natural language generation (L\u00f6nneker, 2005) , the field had mostly avoided narrative until recent times." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-42", "text": "Two possible arguments may be considered as an explanation of this: one based on the need to restrict initial work within a field to the simpler challenges before tackling the difficult ones, and another based on an assumption that the peculiarities of narrative have already been covered by existing work." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-43", "text": "Both arguments can be shown to be inappropriate." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-44", "text": "With respect to the first argument, the field of natural language generation has for many years operated under the tacit assumption that state of the art technology can only aspire to generating texts within a limited range of domains and genres." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-45", "text": "These have over the years been defined in different ways, but in spite of changes, literary texts have usually been considered to be outside the range of possible candidates." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-46", "text": "From an engineering point of view, this kind of restriction made sense when the field was starting, for two important reasons." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-47", "text": "One, the technological solutions available at the time for the various tasks involved in natural language generation were in their infancy, and the linguistic complexity of literary text might have been beyond their scope." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-48", "text": "Two, natural language generation arose from a desire to extend the studies that had been carried out for computational analysis of language to the task of generation, and what was known about language from a computational point of view concerned simple texts." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-49", "text": "Most of the studies on language and computation had applied similar simplifying assumptions." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-50", "text": "However, such restricting assumptions are no longer necessary and may be inhibiting progress." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-51", "text": "In terms of technology, the field has matured significantly over the intervening years." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-52", "text": "The current state of the art provides a wide range of solutions that may be well suited to address some of the more complex phenomena involved in literary text." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-53", "text": "Additional objections may be made on the grounds that we do not know enough about these phenomena." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-54", "text": "Such objections, however valid they might have been originally, are no longer valid either." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-55", "text": "Many of the phenomena that were considered beyond computational treatment (metaphor, emotion, temporal reasoning, dialogue...) have been the subject of serious and sustained study over the same time period." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-56", "text": "Many approaches to their computational modelling and treatment have become available." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-57", "text": "More to the point, the last few years have seen a rise of interest on literary text within the natural language processing community." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-58", "text": "This With respect to the second argument, the recent reappearance of narrative as a research topic for NLG should be enough to dispel the notion that all its problems have already been solved." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-59", "text": "Narrative has many peculiarities that set it apart from other kinds of text, and the body of work addressing narrative as a research topic within NLG has at most uncovered and staked out a set of problems and challenges that area waiting further exploration." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-60", "text": "Of these various open problems in the treatment of narrative, my talk will focus on the problem of narrative composition." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-61", "text": "Research on narrative is plagued by the difficulty of establishing a definition of the term that is both sufficiently formal to act as foundation for scientific rigour, and sufficiently rich to cover the fundamental aspects that people associate with the term." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-62", "text": "At the present stage of development, tentative definition need to be established, to be later confirmed on the basis of empirical work and succesful evaluation of results." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-63", "text": "The talk will outline some of the possibilities that must be considered (arising from established definitions in the field of narratology) and some of the restrictions that arise from the computational nature of the task." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-64", "text": "From the combination of these constraints, a working model of narrative structure will be outlined." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-65", "text": "However, it is clear that such a model must document the relation between a semantic description of the content of the narrative (what is usually termed the fabula) and its rendition as a sequential discourse." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-66", "text": "The task of narrative composition will be specified as the task of constructing such a discourse (or discourse plan) for a given semantic description of fabula." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-67", "text": "This discourse should be susceptible of being converted into text and it should appropriately conveys the set of events in the fabula in such a way that satifies a number of traditionally accepted requirements (like having an identifiable theme, a certain temporal and causal coherence, a recognisable set of characters...)." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-68", "text": "A number of basic narratological concepts will be described where they provide tools for breaking down the task into computationally tractable subproblems." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-69", "text": "Of particular interest is the concept of focalization, which is used by narratologists to describe the way certain segments of a narrative follow a particular character, and which provides a useful computational representation of both the granularity and the shift in focus employed during the process of converting the semantics of the fabula into a linear discourse." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-70", "text": "As part of the talk, narrative composition will be framed in terms of the accepted task breakdown for natural language generation, considering that it may involve a combination of content determination and discourse planning that cannot be segregated into separate subtasks." }, { "sent_id": "574ab9a51f3414e6587da7dfca2ff8-C001-71", "text": "The talk will also discuss the relation of the task of narrative composition with a number of existing research problems such as story generation (which could correspond to the construction of fabula but is sometimes simplified down to construction of a discourse directly) and creativity (which has been addressed with respect to story generation but may also constitute a fundamental ingredient of the composition task)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "574ab9a51f3414e6587da7dfca2ff8-C001-2" ], [ "574ab9a51f3414e6587da7dfca2ff8-C001-38" ] ], "cite_sentences": [ "574ab9a51f3414e6587da7dfca2ff8-C001-2", "574ab9a51f3414e6587da7dfca2ff8-C001-38" ] } } }, "ABC_be4144c60068cf242479ece304fd19_49": { "x": [ { "sent_id": "be4144c60068cf242479ece304fd19-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-2", "text": "The last few years have seen an increased interest in narrative within the field of Natural Language Generation (Reiter et al., 2008; Elson and McKeown, 2010; Siddharthan et al., 2012; Lester, 2012) ." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-3", "text": "Narrative is generally acknowledged as a fundamental mode of presenting and communicating information between humans, with different manifestations across media but with a very significant presence in textual form." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-4", "text": "Yet efforts in Natural Language Generation research have generally side stepped the issue." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-5", "text": "Aside from the pioneering work of (Callaway, 2002) and an early attempt to bridge the gap between narratology and natural language generation (L\u00f6nneker, 2005) , the field had mostly avoided narrative until recent times." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-6", "text": "Two possible arguments may be considered as an explanation of this: one based on the need to restrict initial work within a field to the simpler challenges before tackling the difficult ones, and another based on an assumption that the peculiarities of narrative have already been covered by existing work." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-7", "text": "Both arguments can be shown to be inappropriate." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-8", "text": "With respect to the first argument, the field of natural language generation has for many years operated under the tacit assumption that state of the art technology can only aspire to generating texts within a limited range of domains and genres." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-9", "text": "These have over the years been defined in different ways, but in spite of changes, literary texts have usually been considered to be outside the range of possible candidates." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-10", "text": "From an engineering point of view, this kind of restriction made sense when the field was starting, for two important reasons." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-11", "text": "One, the technological solutions available at the time for the various tasks involved in natural language generation were in their infancy, and the linguistic complexity of literary text might have been beyond their scope." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-12", "text": "Two, natural language generation arose from a desire to extend the studies that had been carried out for computational analysis of language to the task of generation, and what was known about language from a computational point of view concerned simple texts." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-13", "text": "Most of the studies on language and computation had applied similar simplifying assumptions." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-14", "text": "However, such restricting assumptions are no longer necessary and may be inhibiting progress." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-15", "text": "In terms of technology, the field has matured significantly over the intervening years." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-16", "text": "The current state of the art provides a wide range of solutions that may be well suited to address some of the more complex phenomena involved in literary text." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-17", "text": "Additional objections may be made on the grounds that we do not know enough about these phenomena." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-18", "text": "Such objections, however valid they might have been originally, are no longer valid either." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-19", "text": "Many of the phenomena that were considered beyond computational treatment (metaphor, emotion, temporal reasoning, dialogue...) have been the subject of serious and sustained study over the same time period." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-20", "text": "Many approaches to their computational modelling and treatment have become available." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-21", "text": "More to the point, the last few years have seen a rise of interest on literary text within the natural language processing community." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-22", "text": "This With respect to the second argument, the recent reappearance of narrative as a research topic for NLG should be enough to dispel the notion that all its problems have already been solved." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-23", "text": "Narrative has many peculiarities that set it apart from other kinds of text, and the body of work addressing narrative as a research topic within NLG has 103 at most uncovered and staked out a set of problems and challenges that area waiting further exploration." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-24", "text": "Of these various open problems in the treatment of narrative, my talk will focus on the problem of narrative composition." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-25", "text": "Research on narrative is plagued by the difficulty of establishing a definition of the term that is both sufficiently formal to act as foundation for scientific rigour, and sufficiently rich to cover the fundamental aspects that people associate with the term." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-26", "text": "At the present stage of development, tentative definition need to be established, to be later confirmed on the basis of empirical work and succesful evaluation of results." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-27", "text": "The talk will outline some of the possibilities that must be considered (arising from established definitions in the field of narratology) and some of the restrictions that arise from the computational nature of the task." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-28", "text": "From the combination of these constraints, a working model of narrative structure will be outlined." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-29", "text": "However, it is clear that such a model must document the relation between a semantic description of the content of the narrative (what is usually termed the fabula) and its rendition as a sequential discourse." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-30", "text": "The task of narrative composition will be specified as the task of constructing such a discourse (or discourse plan) for a given semantic description of fabula." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-31", "text": "This discourse should be susceptible of being converted into text and it should appropriately conveys the set of events in the fabula in such a way that satifies a number of traditionally accepted requirements (like having an identifiable theme, a certain temporal and causal coherence, a recognisable set of characters...)." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-32", "text": "A number of basic narratological concepts will be described where they provide tools for breaking down the task into computationally tractable subproblems." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-33", "text": "Of particular interest is the concept of focalization, which is used by narratologists to describe the way certain segments of a narrative follow a particular character, and which provides a useful computational representation of both the granularity and the shift in focus employed during the process of converting the semantics of the fabula into a linear discourse." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-34", "text": "As part of the talk, narrative composition will be framed in terms of the accepted task breakdown for natural language generation, considering that it may involve a combination of content determination and discourse planning that cannot be segregated into separate subtasks." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-35", "text": "The talk will also discuss the relation of the task of narrative composition with a number of existing research problems such as story generation (which could correspond to the construction of fabula but is sometimes simplified down to construction of a discourse directly) and creativity (which has been addressed with respect to story generation but may also constitute a fundamental ingredient of the composition task)." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-36", "text": "----------------------------------" }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-37", "text": "****" }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-38", "text": "The last few years have seen an increased interest in narrative within the field of Natural Language Generation (Reiter et al., 2008; Elson and McKeown, 2010; Siddharthan et al., 2012; Lester, 2012) ." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-39", "text": "Narrative is generally acknowledged as a fundamental mode of presenting and communicating information between humans, with different manifestations across media but with a very significant presence in textual form." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-40", "text": "Yet efforts in Natural Language Generation research have generally side stepped the issue." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-41", "text": "Aside from the pioneering work of (Callaway, 2002) and an early attempt to bridge the gap between narratology and natural language generation (L\u00f6nneker, 2005) , the field had mostly avoided narrative until recent times." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-42", "text": "Two possible arguments may be considered as an explanation of this: one based on the need to restrict initial work within a field to the simpler challenges before tackling the difficult ones, and another based on an assumption that the peculiarities of narrative have already been covered by existing work." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-43", "text": "Both arguments can be shown to be inappropriate." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-44", "text": "With respect to the first argument, the field of natural language generation has for many years operated under the tacit assumption that state of the art technology can only aspire to generating texts within a limited range of domains and genres." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-45", "text": "These have over the years been defined in different ways, but in spite of changes, literary texts have usually been considered to be outside the range of possible candidates." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-46", "text": "From an engineering point of view, this kind of restriction made sense when the field was starting, for two important reasons." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-47", "text": "One, the technological solutions available at the time for the various tasks involved in natural language generation were in their infancy, and the linguistic complexity of literary text might have been beyond their scope." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-48", "text": "Two, natural language generation arose from a desire to extend the studies that had been carried out for computational analysis of language to the task of generation, and what was known about language from a computational point of view concerned simple texts." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-49", "text": "Most of the studies on language and computation had applied similar simplifying assumptions." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-50", "text": "However, such restricting assumptions are no longer necessary and may be inhibiting progress." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-51", "text": "In terms of technology, the field has matured significantly over the intervening years." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-52", "text": "The current state of the art provides a wide range of solutions that may be well suited to address some of the more complex phenomena involved in literary text." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-53", "text": "Additional objections may be made on the grounds that we do not know enough about these phenomena." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-54", "text": "Such objections, however valid they might have been originally, are no longer valid either." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-55", "text": "Many of the phenomena that were considered beyond computational treatment (metaphor, emotion, temporal reasoning, dialogue...) have been the subject of serious and sustained study over the same time period." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-56", "text": "Many approaches to their computational modelling and treatment have become available." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-57", "text": "More to the point, the last few years have seen a rise of interest on literary text within the natural language processing community." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-58", "text": "This With respect to the second argument, the recent reappearance of narrative as a research topic for NLG should be enough to dispel the notion that all its problems have already been solved." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-59", "text": "Narrative has many peculiarities that set it apart from other kinds of text, and the body of work addressing narrative as a research topic within NLG has at most uncovered and staked out a set of problems and challenges that area waiting further exploration." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-60", "text": "Of these various open problems in the treatment of narrative, my talk will focus on the problem of narrative composition." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-61", "text": "Research on narrative is plagued by the difficulty of establishing a definition of the term that is both sufficiently formal to act as foundation for scientific rigour, and sufficiently rich to cover the fundamental aspects that people associate with the term." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-62", "text": "At the present stage of development, tentative definition need to be established, to be later confirmed on the basis of empirical work and succesful evaluation of results." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-63", "text": "The talk will outline some of the possibilities that must be considered (arising from established definitions in the field of narratology) and some of the restrictions that arise from the computational nature of the task." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-64", "text": "From the combination of these constraints, a working model of narrative structure will be outlined." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-65", "text": "However, it is clear that such a model must document the relation between a semantic description of the content of the narrative (what is usually termed the fabula) and its rendition as a sequential discourse." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-66", "text": "The task of narrative composition will be specified as the task of constructing such a discourse (or discourse plan) for a given semantic description of fabula." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-67", "text": "This discourse should be susceptible of being converted into text and it should appropriately conveys the set of events in the fabula in such a way that satifies a number of traditionally accepted requirements (like having an identifiable theme, a certain temporal and causal coherence, a recognisable set of characters...)." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-68", "text": "A number of basic narratological concepts will be described where they provide tools for breaking down the task into computationally tractable subproblems." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-69", "text": "Of particular interest is the concept of focalization, which is used by narratologists to describe the way certain segments of a narrative follow a particular character, and which provides a useful computational representation of both the granularity and the shift in focus employed during the process of converting the semantics of the fabula into a linear discourse." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-70", "text": "As part of the talk, narrative composition will be framed in terms of the accepted task breakdown for natural language generation, considering that it may involve a combination of content determination and discourse planning that cannot be segregated into separate subtasks." }, { "sent_id": "be4144c60068cf242479ece304fd19-C001-71", "text": "The talk will also discuss the relation of the task of narrative composition with a number of existing research problems such as story generation (which could correspond to the construction of fabula but is sometimes simplified down to construction of a discourse directly) and creativity (which has been addressed with respect to story generation but may also constitute a fundamental ingredient of the composition task)." } ], "y": { "@BACK@": { "gold_contexts": [ [ "be4144c60068cf242479ece304fd19-C001-2" ], [ "be4144c60068cf242479ece304fd19-C001-38" ] ], "cite_sentences": [ "be4144c60068cf242479ece304fd19-C001-2", "be4144c60068cf242479ece304fd19-C001-38" ] } } }, "ABC_b722b98f50669bf3b22208a25f6854_49": { "x": [ { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-2", "text": "Estimating the internal state of a robotic system is complex: this is performed from multiple heterogeneous sensor inputs and knowledge sources." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-3", "text": "Discretization of such inputs is done to capture saliences, represented as symbolic information, which often presents structure and recurrence." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-4", "text": "As these sequences are used to reason over complex scenarios [1] , a more compact representation would aid exactness of technical cognitive reasoning capabilities, which are today constrained by computational complexity issues and fallback to representational heuristics or human intervention [1], [2] ." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-5", "text": "Such problems need to be addressed to ensure timely and meaningful human-robot interaction." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-6", "text": "Our work is towards understanding the variability of learning informativeness when training on subsets of a given input dataset." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-7", "text": "This is in view of reducing the training size while retaining the majority of the symbolic learning potential." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-8", "text": "We prove the concept on human-written texts, and conjecture this work will reduce training data size of sequential instructions, while preserving semantic relations, when gathering information from large remote sources [3] ." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-9", "text": "We computed multiple random subsets of sentences from the UMBC WEBBASE CORPUS (\u223c 17.13GB) via a custom implementation using the SPARK distributed framework." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-10", "text": "We evaluated the learning informativess of such sets in terms of semantic word-sense classification accuracy (with WORD2VEC [4]), and of n-gram perplexity." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-11", "text": "Previous literature inform us that corpus size and posterior quality do not follow linear correlation for some learning tasks (e.g. semantic measures) [5] ." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-12", "text": "In our semantic tests, on average 85% of the quality can be obtained by training on a random \u223c 4% subset of the original corpus (e.g. as in Fig. 1 , 5 random million lines yield 64.14% instead of 75.14%)." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-13", "text": "Our claims are that i) such evaluation posteriors are Normally distributed (Tab. I), and that ii) the variance is inversely proportional to the subset size (Tab. II)." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-14", "text": "It is therefore possible to select the best random subset for a given size, if an information criterion is known." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-15", "text": "Such metric is currently under investigation." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-16", "text": "Within the robotics domain, in order to reduce computational complexity of the training phase, cardinality reduction of human-written instructions is particularly important for non-recursive online training algorithms, such as current symbol-based probabilistic reasoning systems [1] , [3] , [6] ." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-17", "text": "----------------------------------" }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-18", "text": "****" }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-19", "text": "Estimating the internal state of a robotic system is complex: this is performed from multiple heterogeneous sensor inputs and knowledge sources." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-20", "text": "Discretization of such inputs is done to capture saliences, represented as symbolic information, which often presents structure and recurrence." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-21", "text": "As these sequences are used to reason over complex scenarios [1] , a more compact representation would aid exactness of technical cognitive reasoning capabilities, which are today constrained by computational complexity issues and fallback to representational heuristics or human intervention [1] , [2] ." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-22", "text": "Such problems need to be addressed to ensure timely and meaningful human-robot interaction." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-23", "text": "Our work is towards understanding the variability of learning informativeness when training on subsets of a given input dataset." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-24", "text": "This is in view of reducing the training size while retaining the majority of the symbolic learning potential." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-25", "text": "We prove the concept on human-written texts, and conjecture this work will reduce training data size of sequential instructions, while preserving semantic relations, when gathering information from large remote sources [3] ." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-26", "text": "----------------------------------" }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-27", "text": "**POSTERIOR EVALUATION DISTRIBUTION OF SUBSETS**" }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-28", "text": "We computed multiple random subsets of sentences from the UMBC WEBBASE CORPUS (\u223c 17.13GB) via a custom implementation using the SPARK distributed framework." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-29", "text": "We evaluated the learning informativess of such sets in terms of semantic word-sense classification accuracy (with WORD2VEC [4] ), and of n-gram perplexity." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-30", "text": "Previous literature inform us that corpus size and posterior quality do not follow linear correlation for some learning tasks (e.g. semantic measures) [5] ." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-31", "text": "In our semantic tests, on average 85% of the quality can be obtained by training on a random \u223c 4% subset of the original corpus (e.g. as in Fig. 1 , 5 random million lines yield 64.14% instead of 75.14%)." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-32", "text": "Our claims are that i) such evaluation posteriors are Normally distributed (Tab. I), and that ii) the variance is inversely proportional to the subset size (Tab. II)." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-33", "text": "It is therefore possible to select the best random subset for a given size, if an information criterion is known." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-34", "text": "Such metric is currently under investigation." }, { "sent_id": "b722b98f50669bf3b22208a25f6854-C001-35", "text": "Within the robotics domain, in order to reduce computational complexity of the training phase, cardinality reduction of human-written instructions is particularly important for non-recursive online training algorithms, such as current symbol-based probabilistic reasoning systems [1] , [3] , [6] ." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b722b98f50669bf3b22208a25f6854-C001-11" ], [ "b722b98f50669bf3b22208a25f6854-C001-30" ] ], "cite_sentences": [ "b722b98f50669bf3b22208a25f6854-C001-11", "b722b98f50669bf3b22208a25f6854-C001-30" ] }, "@EXT@": { "gold_contexts": [ [ "b722b98f50669bf3b22208a25f6854-C001-11", "b722b98f50669bf3b22208a25f6854-C001-12", "b722b98f50669bf3b22208a25f6854-C001-13" ], [ "b722b98f50669bf3b22208a25f6854-C001-30", "b722b98f50669bf3b22208a25f6854-C001-31", "b722b98f50669bf3b22208a25f6854-C001-32" ] ], "cite_sentences": [ "b722b98f50669bf3b22208a25f6854-C001-11", "b722b98f50669bf3b22208a25f6854-C001-30" ] } } }, "ABC_f856c4fb5e6e00729d33b15b24aff6_49": { "x": [ { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-10", "text": "Technically, it lets us position ourselves as part of an emerging revolution in integrative Artificial Intelligence, characterized by research challenges like human-robot interaction and the design of virtual humans, and applications in assistive and educational technology and interactive entertainment." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-11", "text": "Researchers are already hard at work to place our accounts of embodied action in conversation in contact with pragmatic theories derived from text discourse and spoken dialogue." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-12", "text": "In my own experience, such work proves both illuminating and exciting." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-13", "text": "For example, it challenges us to support and refine theories of discourse coherence by accounting for the discourse relations and default inference that determine the joint interpretation of coverbal gesture and its accompanying speech (Lascarides and Stone, 2008) . And it challenges us to show how speakers work across modalities to engage with, disambiguate, and (on acceptance) recapitulate each others' communicative actions, to ground their meanings (Lascarides and Stone, In Preparation)." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-14", "text": "The closer we look at conversation, the more we can fit all its behaviors into a unitary framework-inviting us to implement behavioral control for embodied social agents through a pervasive analogy to NLG." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-15", "text": "We can already pursue such implementations easily." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-16", "text": "Computationally, motion is just sequence data, and we can manipulate it in parallel ways to the speech data we already use in spoken language generation ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-17", "text": "At a higher level, we can represent an embodied performance through a matrix of discrete actions selected and synchronized to an abstract time-line, as in our RUTH system Stone and Oh, 2008) ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-18", "text": "This lets us use any NLG method that manipulates structured selections of discrete actions as an architecture for the production of embodied behavior." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-19", "text": "Templates, as in (Stone and DeCarlo, 2003; , offer 5" }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-20", "text": "----------------------------------" }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-21", "text": "****" }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-22", "text": "Researchers are already hard at work to place our accounts of embodied action in conversation in contact with pragmatic theories derived from text discourse and spoken dialogue." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-23", "text": "In my own experience, such work proves both illuminating and exciting." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-24", "text": "For example, it challenges us to support and refine theories of discourse coherence by accounting for the discourse relations and default inference that determine the joint interpretation of coverbal gesture and its accompanying speech (Lascarides and Stone, 2008) . And it challenges us to show how speakers work across modalities to engage with, disambiguate, and (on acceptance) recapitulate each others' communicative actions, to ground their meanings (Lascarides and Stone, In Preparation) ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-25", "text": "The closer we look at conversation, the more we can fit all its behaviors into a unitary framework-inviting us to implement behavioral control for embodied social agents through a pervasive analogy to NLG." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-26", "text": "We can already pursue such implementations easily." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-27", "text": "Computationally, motion is just sequence data, and we can manipulate it in parallel ways to the speech data we already use in spoken language generation ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-28", "text": "At a higher level, we can represent an embodied performance through a matrix of discrete actions selected and synchronized to an abstract time-line, as in our RUTH system Stone and Oh, 2008) ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-29", "text": "This lets us use any NLG method that manipulates structured selections of discrete actions as an architecture for the production of embodied behavior." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-30", "text": "Templates, as in (Stone and DeCarlo, 2003; , offer a good illustration." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-31", "text": "Nevertheless, face-to-face dialogue does demand qualitatively new capabilities." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-32", "text": "In fact, people's choices and meanings in interactive conversation are profoundly informed by their social settings." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-33", "text": "We are a long way from general models that could allow NLG systems to recognize and exploit these connections in the words and other behaviors they use." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-34", "text": "In my experience, even the simplest social practices, such as interlocutors' cooperation on an ongoing practical task, require new models of linguistic meaning and discourse context." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-35", "text": "For example, systems must be creative to evoke the distinctions that matter for their ongoing task, and use meanings that are not programmed or learned but invented on the fly (DeVault and ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-36", "text": "They must count on their interlocutors to recognize the background knowledge they presuppose by general inference from the logic of their behavior as a cooperative contribution to the task (Thomason et al., 2006) ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-37", "text": "Such reasoning becomes particularly important in problematic cases, such as when systems must finetune the form and meaning of a clarification request so that the response is more likely to resolve a pending task ambiguity (DeVault and Stone, 2007) ." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-38", "text": "I expect many further exciting developments in our understanding of meaning and interpretation as we enrich the social intelligence of NLG." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-39", "text": "Modeling efforts will remain crucial to the exploration of these new capabilities." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-40", "text": "When we build and assemble models of actions and interpretations, we get systems that can plan their own behavior simply by exploiting what they know about communication." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-41", "text": "These systems give new evidence about the information and problem-solving that's involved." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-42", "text": "The challenge is that these models must describe semantics and pragmatics, as well as syntax and behavior." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-43", "text": "My own slow progress (Cassell et al., 2000; Koller and Stone, 2007) shows that there's still lots of hard work needed to develop suitable techniques." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-44", "text": "I keep going because of the methodological payoffs I see on the horizon." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-45", "text": "Modeling lets us take social intelligence seriously as a general implementation principle, and thus to aim for systems whose multimodal behavior matches the flexibility and coordination that distinguishes our own embodied meanings." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-46", "text": "More generally, modeling replaces programming with data fitting, and a good model of action and interpretation in particular would let an agent's own experience in conversational interaction determine the repertoire of behaviors and meanings it uses to make itself understood." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-2", "text": "It is an honor to have this chance to tie together themes from my recent research, and to sketch some challenges and opportunities for NLG in face-to-face conversational interaction." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-3", "text": "Communication reflects our general involvement in one anothers' lives." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-4", "text": "Through the choices we manifest with one another, we share our thoughts and feelings, strengthen our relationships and further our joint projects." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-5", "text": "We rely not only on words to articulate our perspectives, but also on a heterogeneous array of accompanying efforts: embodied deixis, expressive movement, presentation of iconic imagery and instrumental action in the world." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-6", "text": "Words showcase the distinctive linguistic knowledge which human communication exploits. But people's diverse choices in conversation in fact come together to reveal multifaceted, interrelated meanings, in which all our actions, verbal and nonverbal, fit the situation and further social purposes." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-7", "text": "In the best case, they let interlocutors understand not just each other's words, but each other." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-8", "text": "As NLG researchers, I argue, we have good reason to work towards models of social cognition that embrace the breadth of conversation." }, { "sent_id": "f856c4fb5e6e00729d33b15b24aff6-C001-9", "text": "Scientifically, it connects us to an emerging consensus in favor of a general human pragmatic competence, rooted in capacities for engagement, coordination, shared intentionality and extended relationships." } ], "y": { "@FUT@": { "gold_contexts": [ [ "f856c4fb5e6e00729d33b15b24aff6-C001-43" ] ], "cite_sentences": [ "f856c4fb5e6e00729d33b15b24aff6-C001-43" ] }, "@MOT@": { "gold_contexts": [ [ "f856c4fb5e6e00729d33b15b24aff6-C001-43" ] ], "cite_sentences": [ "f856c4fb5e6e00729d33b15b24aff6-C001-43" ] } } }, "ABC_a7559a8775941622d269433937633a_49": { "x": [ { "sent_id": "a7559a8775941622d269433937633a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a7559a8775941622d269433937633a-C001-2", "text": "This book provides a detailed overview of Regulus, an open-source toolkit for building and compiling grammars for spoken dialog systems, which has been used for a number of applications including Clarissa, a spoken language system in use on the International Space Station." }, { "sent_id": "a7559a8775941622d269433937633a-C001-3", "text": "The focus is on controlled-language user-initiative systems, in which the user drives the dialog, but is required to use constrained language." }, { "sent_id": "a7559a8775941622d269433937633a-C001-4", "text": "Such a spoken language interface allows for a constrained range of commands-for example, \"open the pod-bay doors\"-to be issued in circumstances where, ergonomically, other interface options are infeasible." }, { "sent_id": "a7559a8775941622d269433937633a-C001-5", "text": "The emphasis of the approach, given the kind of application that is the focus of the work, is thus less on robustness of speech recognition and more on depth of semantic processing and quality of the dialog management." }, { "sent_id": "a7559a8775941622d269433937633a-C001-6", "text": "It is an interesting book, one which succeeds in motivating key problems and presenting general approaches to solving them, with enough in the way of explicit details to allow even a complete novice in spoken language processing to implement simple dialog systems." }, { "sent_id": "a7559a8775941622d269433937633a-C001-7", "text": "The book is split into two parts." }, { "sent_id": "a7559a8775941622d269433937633a-C001-8", "text": "The first half is a very detailed tutorial on using Regulus to build and compile grammars." }, { "sent_id": "a7559a8775941622d269433937633a-C001-9", "text": "Regulus, although itself open-source, makes use of SICStus Prolog for the dialog processing and the Nuance Toolkit for speech recognition." }, { "sent_id": "a7559a8775941622d269433937633a-C001-10", "text": "Grammars in Regulus are specified with features requiring unification, but are compiled into context-free grammars for use with the Nuance speech recognition system." }, { "sent_id": "a7559a8775941622d269433937633a-C001-11", "text": "An abundance of implementation details and guidance for the reader (or grammar writer) is provided in this part of the book for building grammars as well as a dialog manager." }, { "sent_id": "a7559a8775941622d269433937633a-C001-12", "text": "The presentation includes example implementations handling such phenomena as ellipsis and corrections." }, { "sent_id": "a7559a8775941622d269433937633a-C001-13", "text": "In addition, details for building spoken language translation systems are presented within the same range of constrained language applications." }, { "sent_id": "a7559a8775941622d269433937633a-C001-14", "text": "The tutorial format is terrifically explicit, which will make this volume appropriate for undergraduate courses looking to provide students with hands-on exercises in building spoken dialog systems." }, { "sent_id": "a7559a8775941622d269433937633a-C001-15", "text": "One issue with the premise of an open-source toolkit that relies upon other software (SICStus Prolog, Nuance) for key parts of the application is that one is required to obtain and use that software." }, { "sent_id": "a7559a8775941622d269433937633a-C001-16", "text": "In the book, the authors note that an individual license of SICStus is available for a relatively small fee, and that Nuance has a program to license their Toolkit for research purposes." }, { "sent_id": "a7559a8775941622d269433937633a-C001-17", "text": "Unfortunately, since the writing of the book, corporate changes at Nuance have made obtaining such a research license more challenging, and this reviewer was only able to do so after several weeks of e-mail persistence." }, { "sent_id": "a7559a8775941622d269433937633a-C001-18", "text": "It would be very beneficial to future readers if the authors would scout out the current state of" }, { "sent_id": "a7559a8775941622d269433937633a-C001-19", "text": "----------------------------------" }, { "sent_id": "a7559a8775941622d269433937633a-C001-20", "text": "****" }, { "sent_id": "a7559a8775941622d269433937633a-C001-21", "text": "the process of obtaining such a license and provide a detailed map of it on some easily accessible Web site." }, { "sent_id": "a7559a8775941622d269433937633a-C001-22", "text": "As it stands, the barriers to building the full spoken dialog systems detailed in the book are higher than they should be." }, { "sent_id": "a7559a8775941622d269433937633a-C001-23", "text": "The second half of the book goes under the hood to examine, more generally, the issues involved in making the authors' approach work." }, { "sent_id": "a7559a8775941622d269433937633a-C001-24", "text": "The chapter on compilation of feature grammars into context-free grammars looks in depth at alternatives to exhaustive expansion, including efficient filtering techniques and grammar transformation." }, { "sent_id": "a7559a8775941622d269433937633a-C001-25", "text": "There is also a chapter on developing an English feature grammar specifically for integration with a speech recognition system." }, { "sent_id": "a7559a8775941622d269433937633a-C001-26", "text": "A third chapter presents the authors' approach for adapting the general English feature grammar to a particular domain, which they term grammar specialization." }, { "sent_id": "a7559a8775941622d269433937633a-C001-27", "text": "In effect, they induce a new feature grammar by transforming-flattening, to a greater or lesser extent-an automatically produced treebank of domain-specific text and extracting new rules from the treebank, as well as estimating production probabilities from the corpus." }, { "sent_id": "a7559a8775941622d269433937633a-C001-28", "text": "A subsequent chapter investigates the impact of various parameterizations on this grammar specialization approachfor example, the degree of flattening-thus establishing, at least for the applications presented, some best practices." }, { "sent_id": "a7559a8775941622d269433937633a-C001-29", "text": "A final chapter presents a comparison of their grammarbased approach with a class-based n-gram language model." }, { "sent_id": "a7559a8775941622d269433937633a-C001-30", "text": "This half of the book will be more enjoyable for readers of this journal, who are presumably interested in more general questions of computation and language than the step-by-step tutorial format of the first half of the book." }, { "sent_id": "a7559a8775941622d269433937633a-C001-31", "text": "The details of the approach are interesting, particularly the insights about how to build linguistically rich grammars that can be effectively compiled into high-utility context-free grammars for speech recognition." }, { "sent_id": "a7559a8775941622d269433937633a-C001-32", "text": "The primary shortcoming of this presentation lies in perpetuating the false dichotomy between \"grammar-based\" and \"data-driven\" approaches to language modeling for speech recognition, which motivates the final chapter of the book." }, { "sent_id": "a7559a8775941622d269433937633a-C001-33", "text": "In fact, the authors' approach is both grammar-based and data-driven, given the corpus-based grammar specialization and PCFG estimation, which the authors themselves demonstrate to be indispensable." }, { "sent_id": "a7559a8775941622d269433937633a-C001-34", "text": "Robust grammar-based language modeling is a topic that has received a fair bit of attention over the past decade (Chelba and Jelinek 2000; Charniak 2001; Roark 2001; Wang, Stolcke, and Harper 2004, among others) , and while this line of research has not focused on the use of manually built, narrow-domain feature grammars, there is enough similarity between the approach described in this book and the cited papers that the papers would seem to be better comparison points than the class-based language models that are chosen to represent robust approaches in the comparison." }, { "sent_id": "a7559a8775941622d269433937633a-C001-35", "text": "Beyond language modeling, methods for PCFG induction from treebanks have been a popular topic in the field over the past decade, and some understanding of the impact of flattening trees can be had in Johnson (1998) , where the beneficial impact of various tree transformations for probabilistic grammars is presented." }, { "sent_id": "a7559a8775941622d269433937633a-C001-36", "text": "None of this work is discussed or cited, and the naive reader might be left with the impression that data-driven approaches have been demonstrated to underperform relative to knowledge-based approaches, when in fact the authors simply demonstrate that their hybrid grammar-based/data-driven approach outperforms class-based language models." }, { "sent_id": "a7559a8775941622d269433937633a-C001-37", "text": "Perhaps this is worth demonstrating, but the chapter couches the results within the context of a clash between paradigms, which simply does not ring true." }, { "sent_id": "a7559a8775941622d269433937633a-C001-38", "text": "This one misstep, however, does not detract from the quality of the authors' system, nor from the interesting presentation of too-often-ignored aspects of spoken language systems engineering." }, { "sent_id": "a7559a8775941622d269433937633a-C001-39", "text": "The book and the toolkit it describes can serve a very useful pedagogical purpose by providing a foundation upon which students and researchers" } ], "y": { "@BACK@": { "gold_contexts": [ [ "a7559a8775941622d269433937633a-C001-34" ] ], "cite_sentences": [ "a7559a8775941622d269433937633a-C001-34" ] }, "@SIM@": { "gold_contexts": [ [ "a7559a8775941622d269433937633a-C001-34" ] ], "cite_sentences": [ "a7559a8775941622d269433937633a-C001-34" ] } } }, "ABC_6cbc59d4cb2d3246b3efa1ee612270_49": { "x": [ { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-2", "text": "The iADAATPA 1 project coded as N \u2022 2016-EU-IA-0132 that ended in February 2019 is made for building of customized, domain-specific engines for public administrations from EU Member States." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-3", "text": "The consortium of the project decided to use neural machine translation at the beginning of the project." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-4", "text": "This represented a challenge for all involved, and the positive aspect is that all public administrations engaged in the iADAATPA project were able to try, test and use state-of-the-art neural technology with a high level of satisfaction." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-5", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-6", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-7", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-8", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-9", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-10", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-11", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-12", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-13", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-14", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-15", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-16", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012 Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-17", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-18", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-19", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-20", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-21", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-22", "text": "This approach allowed the model to improve the quality for each domain." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-23", "text": "----------------------------------" }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-24", "text": "****" }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-25", "text": "One of the main challenges faced by all partners was data availability." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-26", "text": "Although all public administrations had some data available, it was clearly insufficient for high-level customization." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-27", "text": "In some cases, we had merely a few hundred words or several tens of thousand words." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-28", "text": "Each domain (field) has its own unique word distribution and neural machine translation systems are known to suffer a decrease in performance when data is out-of-domain." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-29", "text": "Pangeanic is a language service provider (LSP) specialised in natural language processing and machine translation." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-30", "text": "It provides solutions to cognitive companies, institutions, translation professionals, and corporations." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-31", "text": "The problem faced by the iADAATPA project at Pangeanic was twofold: Data acquisition For translation from Spanish to Russian there was no available in-domain data." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-32", "text": "Therefore, 2 translators were contracted as part of the project to create 30,000 segments of in-domain data, translating public administrations websites." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-33", "text": "They also cleaned United Nations material and post-edited general-domain data that was previously filtered as indomain following the \"invitation model\" (Hoang and Sima'an, 2014) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-34", "text": "For the other language pairs, the input material was 30,000 post-edited segments." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-35", "text": "The main part of the training corpora (approximately 75%) was part of Pangeanic's own repository harvested through web crawling and also OpenSubtitles (Tiedemann, 2012) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-36", "text": "The rest of the corpus was automatically validated synthetic material using general data from Leipzig (Goldhahn et al., 2012) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-37", "text": "Engine customization The data was cleaned using the Bicleaner tool (S\u00e1nchez-Cartagena et al., 2018) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-38", "text": "The data was lowercased and extra embeddings were added in order to keep the case information." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-39", "text": "The tokenization used was the one provided by OpenNMT 3 and words were divided in subwords according to the BPE (Sennrich et al., 2016) approach." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-40", "text": "The models were trained with multi-domain data and we improved performance following a domainmixing approach (Britz et al., 2017) ." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-41", "text": "The domain information was prepended with special tokens for each target sequence." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-42", "text": "The domain prediction was based only on the source as the extra token was added at target-side and there was no need for apriori domain information." }, { "sent_id": "6cbc59d4cb2d3246b3efa1ee612270-C001-43", "text": "This approach allowed the model to improve the quality for each domain." } ], "y": { "@BACK@": { "gold_contexts": [ [ "6cbc59d4cb2d3246b3efa1ee612270-C001-13" ], [ "6cbc59d4cb2d3246b3efa1ee612270-C001-33" ] ], "cite_sentences": [ "6cbc59d4cb2d3246b3efa1ee612270-C001-13", "6cbc59d4cb2d3246b3efa1ee612270-C001-33" ] } } }, "ABC_8c7722ecab0d6a21e15ce63b8a47f5_0": { "x": [ { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-292", "text": "Lemma 1.2." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-301", "text": "Proof." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-2", "text": "Selective rationalization has become a common mechanism to ensure that predictive models reveal how they use any available features." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-3", "text": "The selection may be soft or hard, and identifies a subset of input features relevant for prediction." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-4", "text": "The setup can be viewed as a cooperate game between the selector (aka rationale generator) and the predictor making use of only the selected features." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-5", "text": "The co-operative setting may, however, be compromised for two reasons." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-6", "text": "First, the generator typically has no direct access to the outcome it aims to justify, resulting in poor performance." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-7", "text": "Second, there's typically no control exerted on the information left outside the selection." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-8", "text": "We revise the overall co-operative framework to address these challenges." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-9", "text": "We introduce an introspective model which explicitly predicts and incorporates the outcome into the selection process." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-10", "text": "Moreover, we explicitly control the rationale complement via an adversary so as not to leave any useful information out of the selection." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-11", "text": "We show that the two complementary mechanisms maintain both high predictive accuracy and lead to comprehensive rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-12", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-13", "text": "**INTRODUCTION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-14", "text": "Rapidly expanding applications of complex neural models also bring forth criteria other than mere performance." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-15", "text": "For example, medical (Yala et al., 2019) and other high-value decision applications require some means of verifying reasons for the predicted outcomes." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-16", "text": "This area of selfexplaining models in the context of NLP applications has primarily evolved along two parallel tracks." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-17", "text": "On one hand, we can design neural architectures that expose more intricate mechanisms Table 1 : An example of the rationales extracted by different models on the sentiment analysis task of beer reviews (appearance aspect)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-18", "text": "Red words are human-labeled rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-19", "text": "Details of the experiments can be found in Appendix B. of reasoning such as module networks (Andreas et al., 2016a,b; Johnson et al., 2017) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-20", "text": "While important, such approaches may still require adopting specialized designs and architectural choices that do not yet reach accuracies comparable to blackbox approaches." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-21", "text": "On the other hand, we can impose limited architectural constraints in the form of selective rationalization (Lei et al., 2016; Li et al., 2016b; Chen et al., 2018a,b) where the goal is to only expose the portion of the text relevant for prediction." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-22", "text": "The selection is done by a separately trained model called rationale generator." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-23", "text": "The resulting text selection can be subsequently used as an input to an unconstrained, complex predictor, i.e., architectures used in the absence of any rationalization." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-24", "text": "2 The main challenge of this track is how to properly coordinating the rationale generator with the powerful predictor operating on the selected information during training." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-25", "text": "In this paper, we build on and extend selective rationalization." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-26", "text": "The selection process can be thought of as a cooperative game between the generator and the predictor operating on the selected, partial input text." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-27", "text": "The two players aim for the shared goal of achieving high predictive accuracy, just having to operate within the confines imposed by rationale selection (a small, concise portion of input text)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-28", "text": "The rationales are learned entirely in an unsupervised manner, without any guidance other than their size, form." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-29", "text": "An example of groundtruth and learned rationales are given in Table 1 ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-30", "text": "The key motivation for our work arises from the potential failures of cooperative selection." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-31", "text": "Since the generator typically has no direct access to the outcome it aims to justify, the learning process may converge to a poorly performing solution." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-32", "text": "Moreover, since only the selected portion is evaluated for its information value (via the predictor), there is typically no explicit control over the remaining portion of the text left outside the rationale." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-33", "text": "These two challenges are complementary and should be addressed jointly." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-34", "text": "Performance The clues in text classification tasks are typically short phrases (Zaidan et al., 2007) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-35", "text": "However, diverse textual inputs offer a sea of such clues that may be difficult to disentangle in a manner that generalizes to evaluation data." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-36", "text": "Indeed, the generator may fail to disentangle the information about the correct label, offering misleading rationales instead." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-37", "text": "Moreover, as confirmed by the experiments presented in this paper, the collaborative nature of the game may enable the players to select a sub-optimal communication code that does not generalize, rather overfits the training data." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-38", "text": "Regression tasks considered in prior work typically offer greater feedback for the generator, making it less likely that such communication patterns would arise." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-39", "text": "We address these concerns by proposing an introspective rationale generator." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-40", "text": "The key idea is to force the generator to explicitly understand what to generate rationales for." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-41", "text": "Specifically, we make the label that would be predicted with the full text as an additional input to the generator thereby ensuring better overall performance." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-42", "text": "Rationale quality The cooperative game setup does not explicitly control the information left out of the rationale." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-43", "text": "As a result, it is possible for the rationales to degenerate as in containing only select words without the appropriate context." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-44", "text": "In fact, the introspective generator proposed above can aggravate this problem." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-45", "text": "With access to the predicted label as input, the generator and the predictor can find a communication scheme by encoding the predicted label with special word patterns (e.g. highlighting \".\" for positive examples and \",\" negative ones)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-46", "text": "Table 1 shows such degenerate cases for the two cooperative methods." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-47", "text": "In order to prevent degenerate rationales, we propose a three-player game that renders explicit control over the unselected parts." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-48", "text": "In addition to the generator and the predictor as in conventional cooperative rationale selection schemes, we add a third adversarial player, called the complement predictor, to regularize the cooperative communication between the generator and the predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-49", "text": "The goal of the complement predictor is to predict the correct label using only words left out of the rationale." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-50", "text": "During training, the generator aims to fool the complement predictor while still maintaining high accuracy for the predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-51", "text": "This ensures that the selected rationale must contain all/most of the information about the target label, leaving out irrelevant parts, within size constraints imposed on the rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-52", "text": "We also theoretically show that the equilibrium of the three-player game guarantees good properties for the extracted rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-53", "text": "Moreover, we empirically show that (1) the three-player framework on its own helps cooperative games such as (Lei et al., 2016) to improve both predictive accuracy and rationale quality; (2) by combining the two solutions -introspective generator and the three player game -we can achieve high predictive accuracy and non-degenerate rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-54", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-55", "text": "**PROBLEM FORMULATION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-56", "text": "This section formally defines the problem of rationalization, and then proposes a set of conditions that desirable rationales should satisfy, which addresses problems of previous cooperative frameworks." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-57", "text": "Here are some notations throughout this paper." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-58", "text": "Bolded upper-cased letters, e.g. X, denote random vectors; unbolded upper-cased letters, e.g. X, denote random scalar variables; bolded lowercased letters, e.g. x, denote deterministic vectors or vector functions; unbolded lower-cased letters, e.g. x, denote deterministic scalars or scalar functions." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-59", "text": "pX (\u00b7|Y ) denotes conditional probability density/mass function conditional on Y ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-60", "text": "H(\u00b7) denotes Shannon entropy." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-61", "text": "E[\u00b7] denotes expectation." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-62", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-63", "text": "**PROBLEM FORMULATION OF RATIONALIZATION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-64", "text": "The target application here is text classification on data tokens in the form of {(X, Y )}." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-65", "text": "Denote X = X1:L as a sequence of words in an input text with length L. Denote Y as a label." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-66", "text": "Our goal is to generate a rationale, denoted as r(X) = r1:L(X), which is a selection of words in X that accounts for Y ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-67", "text": "Formally, r(X) is a hard-masked version of X that takes the following form at each position i:" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-68", "text": "where zi \u2208 {0, 1} N is the binary mask." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-69", "text": "Many previous works (Lei et al., 2016; Chen et al., 2018a) follows the above definition of rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-70", "text": "In this work, we further define the complement of rationale, denoted as r c (X), as" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-71", "text": "For notational ease, define" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-72", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-73", "text": "**RATIONALE CONDITIONS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-74", "text": "An ideal rationale should satisfy the following conditions." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-75", "text": "Sufficiency:" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-76", "text": "Comprehensiveness: R c does not contain sufficient information to predict Y , i.e." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-77", "text": "for some constant h. Compactness: the segments in X that are included in R should be sparse and consecutive, i.e.," }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-78", "text": "for some constants s and c." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-79", "text": "Here is an explanation of what each of these conditions means." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-80", "text": "The sufficiency condition is the core one of a legitimate rationale, which essentially stipulates that the rationale maintains the same predictive power as X to predict Y ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-81", "text": "The compactness condition stipulates that the rationale should be continuous and should not contain more words than necessary." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-82", "text": "For example, without the compactness condition, a trivial solution to Eq. (4) would be X itself." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-83", "text": "The first inequality in Eq. (6) constrains the sparsity of rationale, and the second one constrains the continuity." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-84", "text": "The comprehensiveness condition requires some elaboration, which we will discuss in the next subsection." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-85", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-86", "text": "**COMPREHENSIVENESS AND DEGENERATION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-87", "text": "There are two justifications of the comprehensiveness condition." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-88", "text": "First, it regulates the information outside the rationale, so that the rationale contains all the relevant and useful information, hence the name comprehensiveness." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-89", "text": "Second and more importantly, we will show there exists a failure case, called degeneration, which can only be prevented by the comprehensiveness condition." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-90", "text": "Degeneration refers to the situation where, rather than finding words in X that explains Y , R attempts to encode the probability of Y using trivial information, e.g. punctuation and position." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-91", "text": "Consider the following toy example of binary classification (Y \u2208 {0, 1}), where X can always perfectly predict Y ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-92", "text": "Then the following rationale satisfies the sufficiency and compactness: R includes only the first word of X when Y = 0, and only the last word when Y = 1." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-93", "text": "It is obvious that this R is sufficient to predict Y (by looking at if the first word or the last word is chosen), and thus satisfies the sufficiency condition." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-94", "text": "Apparently this R is perfectly compact (only one word)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-95", "text": "However, this rationale does not provide a valid explanation." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-96", "text": "Theoretically, any previous cooperative framework may suffer from the above ill-defined problem, if the generator has the potential to accurately guess Y with sufficient capacity." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-97", "text": "3 This problem happens because they have no control of the words unselected by R. Intuitively, in the presence of degeneration, some key predictors in X will be left unselected by R. Thus by looking at the predictive power of R c , we can determine if degeneration occurs." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-98", "text": "Specifically, when degeneration is present, all the useful information is left unselected by R, and so H(Y |R c ) is low." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-99", "text": "That is why the lower bound in Eq. (5) rules out the degeneration cases." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-100", "text": "(Lei et al., 2016) , which consists of two players, i.e., a generator (Gen) and a predictor (Pred)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-101", "text": "(b) A straightforward extension of the model (a) with a compliment predictor (Pred c )." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-102", "text": "The generator plays a cooperative game with the predictor and plays a mini-max game with the compliment predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-103", "text": "(c) The introspective three-player framework." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-104", "text": "The introspective module first predicts the possible outcome\u1ef9 based on the full texts x and then generate rationales using both x and\u1ef9." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-105", "text": "The third framework (c) is a special case of (b)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-106", "text": "Such an inductive bias in model design preserves the predictive performance." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-107", "text": "plement predictor that predicts the probability of Y based on R c ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-108", "text": "Figure 1 (b) illustrates the basic three-player model." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-109", "text": "Compared with (Lei et al., 2016) , as shown in Figure 1(a) , the three-player model introduces an additional complement predictor, which plays a minimax game in addition to the cooperative game in (Lei et al., 2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-110", "text": "For clarity, we will describe the game backward, starting from the two predictors followed by the generator." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-111", "text": "Predictors: The predictor estimates probability of Y conditioned on R, denoted asp(Y |R)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-112", "text": "The complement predictor estimates probability of Y conditioned on R c , denoted asp c (Y |R)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-113", "text": "Both predictors are trained using the cross entropy loss, i.e." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-114", "text": "where H(p; q) denotes the cross entropy between p and q. p(\u00b7|\u00b7) denotes the empirical distribution." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-115", "text": "It is worth emphasizing that Lp and Lc are both functions of the generator." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-116", "text": "Generator: The generator extracts R and R c by generating the rationale mask, z(\u00b7), as shown in Eqs." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-117", "text": "(1-2)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-118", "text": "Specifically, z(\u00b7) is determined by minimizing the weighted combination of four losses:" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-119", "text": "where Lg encourages the gap between Lp and Lc to be large, i.e." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-120", "text": "It stipulates the comprehensiveness property of the rationale (Eq. (5))." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-121", "text": "Intuitively, if the complement rationale is less informative of Y than the rationale, then Lc should be larger than Lp." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-122", "text": "Ls and Lc impose the sparsity and continuity respectively, which correspond to Eq. (6):" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-123", "text": "From Eq. (7), we can see that the generator plays a cooperative game with the predictor, because both tries to maximize the predictive performance of R. On the other hand, the generator plays an adversarial game with the complement predictor, because the latter tries to maximize the predictive performance of R c , but the former tries to reduce it." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-124", "text": "Without the complement predictor, and thus the loss Lg, the framework reduces to the method in (Lei et al., 2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-125", "text": "Training During training, the three players perform gradient descent steps with respect to their own losses." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-126", "text": "For the generator, Since z(X) is a set of binary variables, we cannot apply the regular gradient descent algorithm." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-127", "text": "Instead we will use policy gradient (Williams, 1992) to optimize the models." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-128", "text": "We maximize the reward that is defined as the negative loss in Eq. (8)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-129", "text": "In order to have bounded rewards for training stability, the negative losses Lp and Lc are replaced with accuracy." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-130", "text": "Theoretical guarantees The proposed framework is able to obtain a rationale that simultane-ously satisfies the conditions in Eqs. (4) to (6), as stated in the following theorem:" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-131", "text": "Theorem 1." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-132", "text": "A rationalization scheme z(X) that simultaneously satisfies Eqs. (4)-(6) is the global optimizer of Eq. (8)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-133", "text": "The proof is given in Appendix A. The basic idea is that there is a correspondence between each term in Eq. (8) and each of the properties Eqs. (4)-(6)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-134", "text": "The minimization of each loss term is equivalent to satisfying the corresponding property." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-135", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-136", "text": "**THE INTROSPECTION GENERATOR**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-137", "text": "As discussed in Section 1, in the existing generator-predictor framework, the generator may fail to disentangle the information about the correct label, offering misleading rationales instead." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-138", "text": "To address this problem, we propose a new generator module, called the Introspective Generator, which explicitly predicts the label before making rationale selections." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-139", "text": "Figure 1 (c) illustrates the model with the introspection generator." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-140", "text": "Specifically, the improved generator still fits into the basic three-player framework in Section 3.1." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-141", "text": "The only difference is in how the generator generates the mask z(X), which now breaks down into two steps." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-142", "text": "First, the module uses a regular classifier that takes the input X and predicts the label, denoted y(X)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-143", "text": "For classification tasks, we use the maximum likelihood estimate, i.e." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-144", "text": "wherep(Y = y|X) is the predicted probability by maximizing the cross entropy, which is pretrained." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-145", "text": "Second, a label-aware rationale generator generates the binary mask of the rationales, i.e. z(X) =z(X,\u1ef9(X))." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-146", "text": "Note that\u1ef9 is a function of X, so the entire introspective generator is essentially a function of X." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-147", "text": "In this way, all the formulations in Section 3.1 and the Theorem 1 still hold for the three-player game with the introspective generator." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-148", "text": "In the implementation, the classifier can use the same architecture like the predictor and the complement predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-149", "text": "The generator is of the same architecture in Section 3.1, but with the additional input of\u0177(X)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-150", "text": "Remark on degeneration Obviously, when working in a cooperative game, the introspection generator will make the degeneration problem more severe: when the classifierp(\u00b7|X) becomes sufficiently accurate during training, the generator only needs to encode the information of\u1ef9 into R. Therefore our three-player game, while improving over any existing generator-predictor frameworks on its own, is critical for the introspective model." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-151", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-152", "text": "**EXPERIMENTAL SETTINGS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-153", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-154", "text": "**DATASETS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-155", "text": "We construct three text classification tasks, including two sentiment classification tasks from the BeerAdvocate review dataset (McAuley et al., 2012) 4 , and a more complex relation classification task from SemEval 2010 Task 8 (Hendrickx et al., 2009) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-156", "text": "Table 2 gives examples of the above tasks." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-157", "text": "Finally, as suggested by an anonymous reviewer, we evaluate on the text matching benchmark AskUbuntu, following Lei et al. (2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-158", "text": "The experimental setting and results are reported in Appendix F." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-159", "text": "Multi-aspect beer review This is the same data used in (Lei et al., 2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-160", "text": "Each review evaluates multiple aspects of a kind of beer, including appearance, smell, palate, and an overall score." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-161", "text": "For each aspect, a rating \u2208 [0,1] is labeled." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-162", "text": "We limit ourselves to the appearance aspect only and use a threshold of 0.6 to create balanced binary classification tasks for each aspect." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-163", "text": "Then, the task becomes to predict the appearance aspect of a beer based on multi-aspect text inputs." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-164", "text": "The advantage of this dataset is that it enables automatic evaluation of rationale extraction." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-165", "text": "The dataset provides sentence-level annotations on about 1,000 reviews, where each sentence is labeled by the aspect it covers." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-166", "text": "Note that in this dataset, each aspect is often described by a single sentence with clear polarity." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-167", "text": "Thus, a generator can select a sentence based on the topic distribution of words." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-168", "text": "The selected sentence often has very high overlap with the ground truth annotations and also contains sufficient information for predicting the sentiment." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-169", "text": "This characteristic of the dataset makes it a relatively easy task, and thus we further consider two more challenging tasks." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-170", "text": "Single-aspect beer review To construct a more challenging task, for each review, we extract the Task Label" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-171", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-172", "text": "**INPUT TEXTS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-173", "text": "Multi-Aspect Beer Review positive (regarding the aspect of appearance)" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-174", "text": "Clear, burnished copper-brown topped by a large beige head that displays impressive persistance and leaves a small to moderate amount of lace in sheets when it eventually departs." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-175", "text": "The nose is sweet and spicy and the flavor is malty sweet, accented nicely by honey and by abundant caramel/toffee notes." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-176", "text": "There's some spiciness (especially cinnamon), but it's not overdone." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-177", "text": "The finish contains a moderate amount of bitterness and a trace of alcohol." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-178", "text": "The mouthfeel is exemplary \u00b7 \u00b7 \u00b7 Single-Aspect Beer Review positive appearance : dark-brown/black color with a huge tan head that gradually collapses , leaving thick lacing ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-179", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-180", "text": "**RELATION CLASSIFICATION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-181", "text": "Message-Topic(e1,e2) It was a friendly calle 1 to remind them about the bille 2 and make sure they have a copy of the invoice sentences that are specifically about the appearance aspect from the aforementioned BeerAdvocate dataset 5 , and the task is to predict the sentiment of the appearance only on the extracted sentences." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-182", "text": "The details of the dataset construction can be found in Appendix C. This new dataset is obviously more challenging in terms of generating meaningful rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-183", "text": "This is because the generator is required to select more find-grained rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-184", "text": "Since there are no rationale annotations, we rely on subjective evaluations to test the quality of the extracted rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-185", "text": "Multi-class relation classification To show the generalizability of our proposed approaches to other NLP applications, we further consider the SemEval 2010 Task 8 dataset (Hendrickx et al., 2009) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-186", "text": "Given two target entities e 1 and e 2 in a sentence, the goal is to classify the relation type (with directions) between the two entities (including None if there is no relation)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-187", "text": "Similar to the single-aspect beer review, it is also a fine-grained rationalization task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-188", "text": "The major difference is that the number of class labels in this task is much larger, and we hope to investigate its effects on degeneration and performance downgrade." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-189", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-190", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-191", "text": "For both the generators and the two predictors, we use bidirectional LSTMs with hidden dimension 400." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-192", "text": "In the introspection generator, the classifier consists of the same bidirectional LSTM, and z(X,\u1ef9) is implemented as an LSTM sequential labeler with the label\u1ef9 transformed to an embedding vector that serves as the initial hidden states of the 5 We will release the single-aspect dataset." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-193", "text": "LSTM." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-194", "text": "For the relation classification task, since the model needs to be aware of the two target entities, we add the relative position features following Nguyen and Grishman (2015) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-195", "text": "We map the relative position features to learnable embedding vectors and concatenate them with word embeddings as the inputs to the LSTM encoder of each player." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-196", "text": "All hyper-parameters are tuned on the development sets according to predictive accuracy." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-197", "text": "In other words, all the models are tuned without seeing any rationale annotations." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-198", "text": "Table 3 summarizes the main results on the multiaspect beer review task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-199", "text": "In this task, a desirable rationale should be both appearance-related and sufficient for sentiment predictions." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-200", "text": "Since the average sparsity level of human-annotated rationales is about 18%, we consider the following evaluation settings." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-201", "text": "Specifically, we compare the generated rationales to human annotations by measuring the precision when extracting 10% of words and the recall for 20% 6 ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-202", "text": "In addition to the precision/recall, we also report the predictive accuracy of the extracted rationales on predicting the sentiment." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-203", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-293", "text": "A rationalization scheme z(X) that satisfies Eq. (4) is the global minimizer of Lp as defined in Eq. (7)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-294", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-295", "text": "**PROOF. NOTICE THAT**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-296", "text": "The first equality is given by Lemma 1.1; the second equality is given by Eq. (3) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-297", "text": "For the inequality, the equality holds if and only if pY (\u00b7|r(X)) = pY (\u00b7|X)," }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-298", "text": "which is Eq. (4)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-299", "text": "Lemma 1.3." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-300", "text": "A rationalization scheme z(X) that satisfies Eq. (5) is the global minimizer of Lg as defined in Eq. (9)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-204", "text": "**EXPERIMENTS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-205", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-206", "text": "**MULTI-ASPECT BEER REVIEW**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-207", "text": "When only 10% of the words are used, Lei et al. (2016) has a significant performance downgrade compared to the accuracy when using the whole passage (82.05 v.s. 87.59)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-208", "text": "With the additional third player added, the accuracy is slightly improved, which validates that controlling the unselected words improves the robustness of rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-209", "text": "On the other hand, our introspection models are able to maintain higher predictive accuracy (86.16 v.s. 82.05) compared to (Lei et al., 2016) , while only sacrificing a little loss on highlighting precision (0.47% drop)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-210", "text": "Similar observations are made when 20% of the words required to highlight with one exception." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-211", "text": "Comparing the model of (Lei et al., 2016) with and without the proposed mini-max module, there is a huge gap of more than 5% on recall of generated rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-212", "text": "This confirms the motivation that the original cooperative game tends to generate less comprehensive rationales, where the three-player framework controls the unselected words to be less informative so the recall is significantly improved." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-213", "text": "It is worth mentioning that when a classifier is trained with randomly highlighted rationales (i.e. random dropout (Srivastava et al., 2014) on the inputs), it performs significantly worse on both predictive accuracy and highlighting qualities." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-214", "text": "This confirms that extracting concise and sufficient rationales is not a trivial task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-215", "text": "Moreover, our reimplementation of (Lei et al., 2016) for the original regression task achieves 90.1% precision when highlighting 10% words, which suggest that rationalization for binary classification is more challenging compared to the regression where finer supervision is available." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-216", "text": "In summary, our three-player framework consistently improves the quality of extracted rationales on both of the original (Lei et al., 2016) and the introspective framework." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-217", "text": "Particularly, without controlling the unselected words, the introspection model experiences a serious degeneration problem as expected." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-218", "text": "With the three-player framework, it manages to maintain both high predictive accuracy and high quality of rationalization." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-219", "text": "Table 4 : Predictive accuracy on the single-aspect beer appearance review (left) and the relation classification (right)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-220", "text": "Acc c refers to the accuracy of the complement predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-221", "text": "We restrict the extracted rationales to be on average eight words and four continuous pieces for the beer review and six words and three pieces for relation classification." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-222", "text": "A desired rationalization method will have high Acc and low Acc c ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-223", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-224", "text": "**SINGLE-ASPECT BEER REVIEW**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-225", "text": "In this section, we evaluate the proposed methods on the more challenging single-aspect beer dataset." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-226", "text": "Similar to previous experiments, we force all the methods to have comparable highlighting ratio and continuity constraints for fair evaluation." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-227", "text": "The highlighting ratio is determined from human estimation on a small set of data." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-228", "text": "We report two classification results, which are the accuracy of the predictor and complement predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-229", "text": "For both cooperative methods, i.e. (Lei et al., 2016) with and without introspection, we train an independent extra predictor on the unselected words from the generator, which does not affect the training of the generator-predictor framework." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-230", "text": "From the left part of Table 4 , we observe that the original model of (Lei et al., 2016 ) has a hard time maintaining the accuracy compared to the classifier trained with full texts." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-231", "text": "Transforming it into a three-player game helps to improve the performance of the evaluation set while lower the accuracy of the complement predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-232", "text": "Since the extracted rationales are in a similar length, these results suggest that learning with a three-player game successfully enforces the generator not to leave informative words unselected." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-233", "text": "Similar to the multi-aspect results, the cooperative introspection model suffers from the degeneration problem, which produces a high complement accuracy." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-234", "text": "The three-player game yields a more comprehensive rationalization with small losses in accuracy." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-235", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-236", "text": "**HUMAN EVALUATION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-237", "text": "We further conduct subjective evaluations by comparing the original model of (Lei et al., 2016) with our introspective threeplayer model." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-238", "text": "We mask the extracted rationales and present the unselected words only to the hu- Table 5 : Subjective evaluations on the task of controlling the unselected rationale words." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-239", "text": "Acc denotes the accuracy in guessing sentiment labels." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-240", "text": "Accw/o UNK denotes the sentiment accuracy for these samples that are not selected as \"UNK\" for the secondary task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-241", "text": "\u2020 denotes p-value < 0.005 in t-test." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-242", "text": "A desired rationalization method achieves high \"UNK\" rate and performance randomly for the Acc predictions." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-243", "text": "man evaluators." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-244", "text": "The evaluators are asked to predict the sentiment label as their first task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-245", "text": "If a rationalizing method successfully includes all informative pieces in the rationale, subjects should have around 50% of accuracy in guessing the label." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-246", "text": "In addition, after providing the sentiment label, subjects are then asked to answer a secondary question, which is whether the provided text spans are sufficient for them to predict the sentiment." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-247", "text": "If they believe there are not enough clues and their sentiment classification is based on a random guess, they are instructed to select unknown (denoted as \"UNK\") as the answer to the second question." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-248", "text": "Appendix E elaborates why we design subjective evaluations in such a way in more details." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-249", "text": "Table 5 shows the performance of subjective evaluations." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-250", "text": "Looking at the first column of the table, our model is better in confusing human, which gives a higher rate in selecting \"UNK\"." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-251", "text": "It confirms that the three-player introspective model selects more comprehensive rationales and leave less informative texts unattended." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-252", "text": "Furthermore, the results also show that human evaluators offer worse sentiment predictions on the proposed approach, which is also desired and expected." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-253", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-254", "text": "**RELATION CLASSIFICATION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-255", "text": "The predictive performances on the relation classification task are shown in the right part of the Table 4." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-256", "text": "We observe consistent results as in previous datasets." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-257", "text": "Clearly, the introspective generator helps the accuracy and the three-player game regularize the complement of the rationale selections." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-258", "text": "Examples of the extracted rationales For relation classification, it is difficult to conduct subjective evaluations because the task requires people to have sufficient knowledge of the schema of relation annotation." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-259", "text": "To further demonstrate the quality of generated rationales, we provide some illustrative examples." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-260", "text": "Since there is a rich form of su- pervised signal, i.e., the number of class labels is large, the chance of any visible degeneration of the Lei et al. (2016) 's model should be low." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-261", "text": "However, we still spot quite a few cases." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-262", "text": "In the first example, Lei et al. (2016) fails to highlight the second entity while ours does." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-263", "text": "In the second example, the introspective three-player model selects more words than (Lei et al., 2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-264", "text": "In this case, the two entities themselves suffice to serve as the rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-265", "text": "However, our model preserves the words like \"working\"." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-266", "text": "This problem might due to the bias of the dataset." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-267", "text": "For example, some words that are not relevant to the target entities may still correlate with the labels." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-268", "text": "In the case, our model will pick these words as a part of the rationale." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-269", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-270", "text": "**RELATED WORK**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-271", "text": "Model interpretability Besides the two major categories of self-explaining models discussed in Section 1, model interpretability is widely studied in the general machine learning field." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-272", "text": "For example, evaluating feature importance with gradient information (Simonyan et al., 2013; Li et al., 2016a; Sundararajan et al., 2017) or local perturbations (Kononenko et al., 2010; Lundberg and Lee, 2017) ; and interpreting deep networks by locally fitting interpretable models (Ribeiro et al., 2016; ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-273", "text": "Besides selective rationalization, the cooperative game has been studied in the latter direction above (Lee et al., 2018) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-274", "text": "It has also been applied to a relevant problem on summarization (Arumae and Liu, 2019), where the selected summary should be sufficient for answering questions related to the document." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-275", "text": "In this problem, the sum-mary is a special type of rationale of a document." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-276", "text": "Another related concurrent work (Bastings et al., 2019) proposes differentiable solution to optimize the cooperative rationalization method." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-277", "text": "Game-theoretical methods Though not having been explored much for self-explaining models, the minimax game setup has been widely used in many machine learning problems, such as selfplaying for chess games (Silver et al., 2017) , generative models (Goodfellow et al., 2014) and many tasks that can be formulated as multi-agent reinforcement learning (Busoniu et al., 2006) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-278", "text": "Our three-player game framework also shares a similar idea with (Zhang et al., 2017; Zhao et al., 2017) , which aim to learn domain-invariant representations with both cooperative and minimax games." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-279", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-280", "text": "**CONCLUSION**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-281", "text": "We proposed a novel framework for improving the predictive accuracy and comprehensiveness of the selective rationalization methods." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-282", "text": "This framework (1) addresses the degeneration problem in previous cooperative frameworks by regularizing the unselected words via a three-player game; and (2) augments the conventional generator with introspection, which can better maintain the performance for down-stream tasks." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-283", "text": "Experiments with both automatic evaluation and subjective studies confirm the advantage of our proposed framework." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-284", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-285", "text": "**A PROOF OF THEOREM 1**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-286", "text": "This appendix provides the proof of theorem 1." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-287", "text": "First, we will need the following lemma." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-288", "text": "Lemma 1.1." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-289", "text": "If the predictors have sufficient representation power, we have" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-290", "text": "The proof is trivial, noticing cross entropy is upper bounded by entropy." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-291", "text": "The following lemmas show that there is a correspondence between the rationale properties in Eqs. (4) and loss terms in Eq. (8)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-302", "text": "According to Lemma 1.1, Lg can be rewritten as Table 1 and Degeneration Cases of (Lei et al., 2016) This section provides the details to obtain the results in Table 1 in the introduction section, where the method of (Lei et al., 2016) generates degenerated rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-303", "text": "The method of (Lei et al., 2016) works well in many applications." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-304", "text": "However, as discussed in Section 1 and 2.2, all the cooperative rationalization approaches may suffer from the problem of degeneration." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-305", "text": "In this section, we design an experiment to confirm the existence of the problem in the original (Lei et al., 2016) model." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-306", "text": "We use the same single-beer review constructed from (McAuley et al., 2012) , as will be described in Appendix C. Instead of constructing a balanced binary classification task, we set the samples with scores higher than 0.5 as positive examples." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-307", "text": "On such a task, the prediction model with full inputs achieves 82.3% accuracy on the development set." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-308", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-309", "text": "**B EXPERIMENTAL SETUP OF EXAMPLES IN**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-310", "text": "During the training of (Lei et al., 2016) , we stipulate that the generated rationales are very concise: we punish it when the rationales have more than 3 pieces or more than 20% of the words are generated (both with hinge losses)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-311", "text": "From the results, we can see that Lei et al. (2016) tends to predict color words, like dark-brown, yellow, as rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-312", "text": "This is a clue of degeneration, since most of the appearance reviews start with describing colors." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-313", "text": "Therefore a degenerated generator can learn to split the vocabulary of colors, and communicate with the predictor by using some of the colors for the positive label and some others for the negative label." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-314", "text": "Such a learned generator also fails to generalize well, given the significant performance decrease (76.4% v.s. 82.3%)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-315", "text": "By comparison, our method with three-player game could achieve both higher accuracy and more meaningful rationales." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-316", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-317", "text": "**C DATA CONSTRUCTION OF THE SINGLE-ASPECT BEER REVIEWS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-318", "text": "This section describes how we construct the single-aspect review task from the multi-aspect beer review dataset (McAuley et al., 2012) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-319", "text": "In many multi-aspect beer reviews, we can see clear patterns indicating the aspect of the following sentences." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-320", "text": "For example, the sentences starting with \"appearance:\" or \"a:\" are likely to be a review on the appearance aspect; and the sentences Original Text (positive): dark-brown/black color with a huge tan head that gradually collapses , leaving thick lacing ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-321", "text": "Rationale from (Lei et al., 2016) (Acc: 76.4%):" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-322", "text": "[\"dark-brown/black color\"] Rationale from our method (Acc: 80.4%):" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-323", "text": "[\"huge tan\", \"thick lacing\"]" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-324", "text": "Original Text (negative): really cloudy , lots of sediment , washed out yellow color . looks pretty gross , actually , like swamp water ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-325", "text": "no head , no lacing ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-326", "text": "Rationale from (Lei et al., 2016) (Acc: 76.4%):" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-327", "text": "[\"really cloudy lots\", \"yellow\", \"no\", \"no\"] Rationale from our method (Acc: 80.4%):" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-328", "text": "[\"cloudy\", \"lots\", \"pretty gross\", \"no lacing\"] starting with \"smell:\" or \"nose:\" are likely to be about the aroma aspect." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-329", "text": "We then extract all the \"X:\" patterns, and count the frequencies of such patterns, where each X is a word." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-330", "text": "The patterns \"X:\" with a higher than 400 frequency are kept as anchor patterns." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-331", "text": "The sentences between two anchor patterns \"X 1 : \u00b7 \u00b7 \u00b7 X 2 \" are very likely the review regarding the aspect of X 1 ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-332", "text": "Finally, we extract such review sentences after \"appearance:\" or \"a:\" and before the immediate subsequent anchor patterns as the single-aspect review for the appearance aspect." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-333", "text": "Each of such instances is regarded as a new single-aspect review." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-334", "text": "The score of the appearance aspect of the original multi-aspect review is regarded as the score of this new review." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-335", "text": "With such an automatically constructed dataset, we form our balanced single-review binary classification tasks (see Section 4.1 and Appendix D), on which our base predictor model (with all the words as inputs) performs an 87.1% on the development set." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-336", "text": "This is as high as the number we achieved on the multi-aspect task regarding the same aspect (87.6%)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-337", "text": "This result indicates that the noise introduced by our data construction method is insignificant." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-338", "text": "Table 7 summarizes the statistics of the three datasets used in the experiments." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-339", "text": "The singleaspect sentiment classification and the relation classification have randomly held-out development sets from the original training sets." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-340", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-341", "text": "**D DATA STATISTICS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-342", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-343", "text": "**E EXPERIMENT DESIGNS FOR HUMAN STUDY**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-344", "text": "This section explains how we designed the human study." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-345", "text": "The goal is to evaluate the unpredictable rates of the input texts after the rationales are removed." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-346", "text": "To this end, we mask the original texts with the rationales generated by (Lei et al., 2016) and our method." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-347", "text": "Each rationale word is masked with the symbol '*'." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-348", "text": "The masked texts from different methods are mixed and shuffled so the evaluators cannot know from which systems an input was generated." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-349", "text": "We have two human evaluators who are not the authors of the paper." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-350", "text": "During evaluation, an evaluator is presented with one masked text and asked to try her/his best to predict the sentiment label of it." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-351", "text": "If a rationalizing method successfully includes all informative pieces in the rationale, subjects should have around 50% of accuracy in guessing the label." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-352", "text": "After the evaluator provides a sentiment label, the subjects are asked to answer the second question about whether the provided text spans are sufficient for them to predict the sentiment." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-353", "text": "If they believe there are no enough clues and their sentiment classification is based on a random guess, they are instructed to input a UNK label as the answer to the second question." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-354", "text": "The reason we ask the evaluators to provide predicted labels first is based on the following idea: if the task is directly annotating whether the masked texts are unpredictable, the annotators will tend to label more UNK labels to save time." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-355", "text": "Therefore the ratios of UNK labels will be biased." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-356", "text": "Our experimental design alleviates this problem since the evaluators are always required to try the best to guess the labels first." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-357", "text": "Therefore they will spend more time thinking about the possible labels, instead of immediately putting a UNK label." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-358", "text": "On a small subset of 50 examples, the interannotator agreement is 76% on the UNK labels." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-359", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-360", "text": "**F ADDITIONAL EXPERIMENTS ON ASKUBUNTU**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-361", "text": "Setting Following the suggestion from the reviews, we evaluate the proposed method on the question retrieval task on AskUbuntu (Lei et al., 2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-362", "text": "AskUbuntu is a non-factoid question retrieval benchmark." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-363", "text": "The goal is to retrieve the most relevant questions from an input question." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-364", "text": "We use the same data split provided by (Lei et al., 2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-365", "text": "7 Specifically, each question consists of two parts, the question title and the question body." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-366", "text": "The former summarizes a problem from using Ubuntu, while the latter contains the detailed descriptions." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-367", "text": "In our experiments, we follow the same setting from (Lei et al., 2016) by only using the question bodies." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-368", "text": "Different from their work, we do not pretrain an encoder by predicting a question title using the corresponding question body." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-369", "text": "This is because, the question title can be considered as the rationale of its question body, which might result in potential information leaks to our main rationalization task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-370", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-371", "text": "**METHOD**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-372", "text": "We formulate the problem of question retrieval as the pairwise classification task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-373", "text": "Given two questions (i.e., a query and a candidate question), we aim to classify them as a positive label if they are relevant and vice-versa." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-374", "text": "We consider the same generator architecture as used in Section 5 in a siamese setting to extract rationales from questions." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-375", "text": "The predictor and the complement predictor make the prediction based on the pairwise selected spans." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-376", "text": "We believe it is the most straightforward way to adapt the proposed framework to the AskUbuntu task." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-377", "text": "There could be sophisticated and task-specific rationalization approaches to improve the performance on AskUbuntu (e.g., using a ranking model instead of a classification model)." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-378", "text": "However, newly design of introspective modules are also required." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-379", "text": "We leave these investigations to future works." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-380", "text": "----------------------------------" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-381", "text": "**IMPLEMENTATION DETAILS**" }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-382", "text": "We consider the following three-step training strategy: 1) pre-train a classifier with the full text; 2) fix the pre-trained classifier, which is used for both the predictor 7 https://github.com/taolei87/askubuntu." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-383", "text": "Table 9 : Testing MAP on the AskUbuntu dataset." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-384", "text": "MAP c refers to the MAP score of the complement predictor." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-385", "text": "The desired rationalization method will have high MAP and low MAP c . and the complement predictor in the three-player game approach, and pre-train the rationale generators; and 3) fine-tune all modules end-to-end." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-386", "text": "This pipeline significantly stabilizes the training and provides better performances." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-387", "text": "8 We use the same word embeddings as released by (Lei et al., 2016) ." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-388", "text": "Results Table 9 summarizes the results." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-389", "text": "We observe similar patterns as in previous datasets." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-390", "text": "The original model from (Lei et al., 2016) fails to maintain the performance compared to the model trained with full texts." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-391", "text": "Adding the proposed minimax game helps both the (Lei et al., 2016 ) and the introspection model to generate more informative texts as the rationales, which improves the MAP of the prediction while lowering the complement MAP." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-392", "text": "Compared to the other tasks, the complement MAPs on AskUbuntu are relatively large." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-393", "text": "One reason is that the reported results rely significantly on the three-step training strategy." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-394", "text": "The best MAP on the development set often occurs after a few epochs of end-to-end training (the third step of our training procedure), which may results in premature training of the generators due to early stop." }, { "sent_id": "8c7722ecab0d6a21e15ce63b8a47f5-C001-395", "text": "Another important reason is that there are a larger number of informative words in the questions, which makes it challenging for the generators to include all the useful information." } ], "y": { "@USE@": { "gold_contexts": [ [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-21" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-53" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-124" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-157" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-159" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-229" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-237" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-260" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-302" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-303" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-305" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-310" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-311" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-321" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-326" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-346" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-361" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-364" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-367" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-387" ] ], "cite_sentences": [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-21", "8c7722ecab0d6a21e15ce63b8a47f5-C001-53", "8c7722ecab0d6a21e15ce63b8a47f5-C001-124", "8c7722ecab0d6a21e15ce63b8a47f5-C001-157", "8c7722ecab0d6a21e15ce63b8a47f5-C001-159", "8c7722ecab0d6a21e15ce63b8a47f5-C001-229", "8c7722ecab0d6a21e15ce63b8a47f5-C001-237", "8c7722ecab0d6a21e15ce63b8a47f5-C001-260", "8c7722ecab0d6a21e15ce63b8a47f5-C001-302", "8c7722ecab0d6a21e15ce63b8a47f5-C001-303", "8c7722ecab0d6a21e15ce63b8a47f5-C001-305", "8c7722ecab0d6a21e15ce63b8a47f5-C001-310", "8c7722ecab0d6a21e15ce63b8a47f5-C001-311", "8c7722ecab0d6a21e15ce63b8a47f5-C001-321", "8c7722ecab0d6a21e15ce63b8a47f5-C001-326", "8c7722ecab0d6a21e15ce63b8a47f5-C001-346", "8c7722ecab0d6a21e15ce63b8a47f5-C001-361", "8c7722ecab0d6a21e15ce63b8a47f5-C001-364", "8c7722ecab0d6a21e15ce63b8a47f5-C001-367", "8c7722ecab0d6a21e15ce63b8a47f5-C001-387" ] }, "@EXT@": { "gold_contexts": [ [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-21", "8c7722ecab0d6a21e15ce63b8a47f5-C001-25" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-215" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-391" ] ], "cite_sentences": [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-21", "8c7722ecab0d6a21e15ce63b8a47f5-C001-215", "8c7722ecab0d6a21e15ce63b8a47f5-C001-391" ] }, "@BACK@": { "gold_contexts": [ [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-69" ] ], "cite_sentences": [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-69" ] }, "@SIM@": { "gold_contexts": [ [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-69" ] ], "cite_sentences": [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-69" ] }, "@DIF@": { "gold_contexts": [ [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-109" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-207" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-209" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-211" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-216" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-230" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-262" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-263" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-321", "8c7722ecab0d6a21e15ce63b8a47f5-C001-322" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-326", "8c7722ecab0d6a21e15ce63b8a47f5-C001-327" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-367", "8c7722ecab0d6a21e15ce63b8a47f5-C001-368" ], [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-390" ] ], "cite_sentences": [ "8c7722ecab0d6a21e15ce63b8a47f5-C001-109", "8c7722ecab0d6a21e15ce63b8a47f5-C001-207", "8c7722ecab0d6a21e15ce63b8a47f5-C001-209", "8c7722ecab0d6a21e15ce63b8a47f5-C001-211", "8c7722ecab0d6a21e15ce63b8a47f5-C001-216", "8c7722ecab0d6a21e15ce63b8a47f5-C001-230", "8c7722ecab0d6a21e15ce63b8a47f5-C001-262", "8c7722ecab0d6a21e15ce63b8a47f5-C001-263", "8c7722ecab0d6a21e15ce63b8a47f5-C001-321", "8c7722ecab0d6a21e15ce63b8a47f5-C001-326", "8c7722ecab0d6a21e15ce63b8a47f5-C001-367", "8c7722ecab0d6a21e15ce63b8a47f5-C001-390" ] } } }, "ABC_b3952c840ce970f0e66460ea6e145a_0": { "x": [ { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-2", "text": "Despite the fast developmental pace of new sentence embedding methods, it is still challenging to find comprehensive evaluations of these different techniques." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-3", "text": "In the past years, we saw significant improvements in the field of sentence embeddings and especially towards the development of universal sentence encoders that could provide inductive transfer to a wide variety of downstream tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-4", "text": "In this work, we perform a comprehensive evaluation of recent methods using a wide variety of downstream and linguistic feature probing tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-5", "text": "We show that a simple approach using bag-of-words with a recently introduced language model for deep contextdependent word embeddings proved to yield better results in many tasks when compared to sentence encoders trained on entailment datasets." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-6", "text": "We also show, however, that we are still far away from a universal encoder that can perform consistently across several downstream tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-7", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-8", "text": "**INTRODUCTION**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-9", "text": "Word embeddings are nowadays pervasive on a wide spectrum of Natural Language Processing (NLP) and Natural Language Understanding (NLU) applications." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-10", "text": "These word representations improved downstream tasks in many domains such as machine translation, syntactic parsing, text classification, and machine comprehension, among others [6] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-11", "text": "Ranging from count-based to predictive or task-based methods, in the past years, many approaches were developed to produce word embeddings, such as Neural Probabilistic Language Model [3] , Word2Vec [28] , GloVe [32] , and more recently ELMo [33] , to name a few." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-12", "text": "Although most of the recent word embedding techniques rely on the distributional linguistic hypothesis, they differ on the assumptions of how meaning or context are modeled to produce the word embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-13", "text": "These differences between word embedding techniques can have unsuspected implications regarding their performance in downstream tasks as well as in their capacity to capture linguistic properties." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-14", "text": "Nowadays, the choice of word embeddings for particular downstream tasks is still a matter of experimentation and evaluation." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-15", "text": "Even though word embeddings produce high-quality representations for words (or sub-words), representing large chunks of text such as sentences, paragraphs or documents is still an open research problem [10] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-16", "text": "The tantalizing idea of learning sentence representations that could achieve good performance on a wide variety of downstream tasks, also called universal sentence encoder is, of course, the major goal of many sentence embedding techniques." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-17", "text": "However, as we will see, we are still far away from a universal representation that has consistent performance on a wide range of tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-18", "text": "A common approach for sentence representations is to compute the Bag-of-Words (BoW) of the word vectors, traditionally using a simple arithmetic mean of the embeddings for the words in a sentence along the words dimension." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-19", "text": "This usually yielded limited performance, however, some recent methods demonstrated important improvements over the traditional averaging." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-20", "text": "By using weighted averages and modifying them using singular-value decomposition (SVD), the method known as smooth inverse frequency (SIF) [1] , proved to be a strong baseline over traditional averaging." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-21", "text": "Recently, p-mean [35] also demonstrated improvements over SIF and traditional averaging by concatenating power means of the embeddings, closing the gap with other complex sentence embedding techniques such as InferSent [10] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-22", "text": "Other sentence embedding techniques were also developed based on encoder/decoder architectures, such as the Skip-Thought [23] , where the skip-gram model from Word2Vec [28] was abstracted to form a sentence level encoder that is trained on a self-supervised fashion." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-23", "text": "Recently, bi-directional LSTM models were also employed by InferSent [10] on a supervised training scheme using the Stanford Natural Language Inference (SNLI) dataset [5] to predict entailment/contradiction." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-24", "text": "InferSent [10] proved to yield much better results on a variety of downstream tasks when compared to many strong baselines or self-supervised methods such as Skip-Thought [23] , by leveraging strong supervision." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-25", "text": "Lately, the Universal Sentence Encoder (USE) [8] mixed an unsupervised task using a large corpus together with the supervised SNLI task and showed a significant improvement by leveraging the Transformer architecture [37] , which is solely based on attention mechanisms, although without providing an evaluation with other baselines and previous works such as InferSent [10] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-26", "text": "Neural Language Models can be tracked back to [3] , and more recently deep bi-directional language models (biLM) [33] have successfully been applied to word embeddings in order to incorporate contextual information." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-27", "text": "Very recently, [34] used unsupervised generative pre-training of language models followed by discriminative fine-tunning to achieve state-of-the-art results in several NLP downstream tasks (improving 9 out of 12 tasks)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-28", "text": "The authors concluded that using language model as objective to fine-tuning helped both in model generalization and convergence." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-29", "text": "A similar approach of transfer learning using pre-trained language model was previously presented in ULMFiT paper [18] , with surprisingly good results even when fine-tuning using small datasets." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-30", "text": "Despite the fast development of a variety of recent methods for sentence embedding, there are no extensive evaluations covering recent techniques on common grounds." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-31", "text": "The developmental pace of new methods has surpassed the pace of inter-methodology evaluation." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-32", "text": "SentEval [9] was recently proposed to reduce this comparison gap and the common problems associated with the evaluation of sentence embeddings, creating a common evaluation pipeline to assess the performance on different downstream tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-33", "text": "Recently, [11] introduced an evaluation method based on 10 probing tasks designed to capture linguistic properties from sentence embeddings, which was later integrated into SentEval [9] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-34", "text": "However, many recent sentence embedding techniques were not evaluated in this pipeline, such as the Universal Sentence Encoder [8] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-35", "text": "In this work, we describe an extensive evaluation of many recent sentence embedding techniques." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-36", "text": "We perform an analysis of the transferability of these embeddings to downstream tasks as well as their linguistic properties through the use of probing tasks [11] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-37", "text": "We heavily make use of recent evaluation protocols based on SentEval [9] , creating a useful panorama of the current sentence embedding techniques and drawing important conclusions about their performance for different tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-38", "text": "This paper is organized as follows: In Section 2, we review work on both word and sentence embeddings, which are the basis of our experiments." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-39", "text": "In Section 3 we describe the evaluation tasks and datasets employed for the evaluations and, in Section 4, we describe the evaluated models and the methodology used to generate the sentence embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-40", "text": "In Section 5, we describe the experimental results in downstream tasks and linguistic probing tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-41", "text": "Finally, we summarize the contribution of this work in the concluding section." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-42", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-43", "text": "**RELATED WORK**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-44", "text": "Word embeddings are extensively used in state-of-the-art NLP techniques, mainly due to their ability to capture semantic and syntactic information of words using large unlabeled datasets and providing an important inductive transfer to other tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-45", "text": "There are several implementations of word embeddings in the literature." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-46", "text": "Following the pioneering work by [3] on the neural language model for distributed word representations, the seminal Word2Vec [28] is one of the first popular approaches of word embeddings based on neural networks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-47", "text": "This type of representation is able to preserve semantic relationships between words and their context, where context is modeled by nearby words." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-48", "text": "In [28] , they presented two different methods to compute Two main challenges exist when learning high-quality representations: they should capture semantic and syntax and the different meanings the word can represent in different contexts (polysemy)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-49", "text": "To solve these two issues, Embedding from Language Models (ELMo) [33] was recently introduced." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-50", "text": "It uses representations from a bi-directional LSTM that is trained with a language model (LM) objective on a large text dataset." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-51", "text": "ELMo [33] representations are a function of the internal layers of the bi-directional Language Model (biLM), which provides a very rich representation about the tokens." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-52", "text": "Like in fastText [4] , ELMo [33] breaks the tradition of word embeddings by incorporating sub-word units, but ELMo [33] has also some fundamental differences with previous shallow representations such as fastText or Word2Vec." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-53", "text": "In ELMo [33] , they use a deep representation by incorporating internal representations of the LSTM network, therefore capturing the meaning and syntactical aspects of words." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-54", "text": "Since ELMo [33] is based on a language model, each token representation is a function of the entire input sentence, which can overcome the limitations of previous word embeddings where each word is usually modeled as an average of their multiple contexts." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-55", "text": "Through the lens of the Ludwig Wittgenstein philosophy of language [40] , it is clear that the ELMo [33] embeddings are a better approximation to the idea of \"meaning is use\" [40] , where a word can contain a wide spectrum of different meanings depending on context, as opposed to traditional word embeddings that are not only context-independent but have a very limited definition of context." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-56", "text": "Although bag-of-words of word embeddings showed good performance for some tasks, it is still unclear how to properly represent the full sentence meaning." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-57", "text": "Nowadays, there is still no consensus on how to represent sentences and many studies were proposed towards that research direction." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-58", "text": "Skip-Thought Vectors [23] are based on a sentence encoder that, instead of predicting the context of a word as Word2Vec, it predicts the surrounding sentences of a given sentence." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-59", "text": "It is based on encoder-decoder models, where the encoder (usually based on RNNs) maps words to a sentence vector and the decoder generates the surrounding sentences." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-60", "text": "A major advantage of Skip Thought Vectors for representing sentences when compared with a simple average of word embeddings is that order is considered during the encoding/decoding process." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-61", "text": "InferSent [10] proposes a supervised training for the sentence embeddings, contrasting with previous works such as Skip-Thought." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-62", "text": "The sentence encoders are trained using Stanford Natural Language Inference (SNLI) dataset, which consists of 570k human-generated English sentence-pairs and it is considered one of the largest high-quality labeled datasets for building sentence semantics understanding [5] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-63", "text": "The authors tested 7 different architectures for the sentence encoder and the best results are achieved with a bi-directional LSTM (BiLSTM) encoder." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-64", "text": "p-mean [35] emerged as a response to InferSent [10] and baselines such as Sent2Vec [29] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-65", "text": "According to the authors, averaging the word embeddings and comparing with approaches such as InferSent [10] can be unfair due to the difference in embedding dimensions (e.g. 300 vs 4096)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-66", "text": "p-mean is a method that concatenates different word embeddings that represent different information such as syntactic and semantic information, resulting in a larger representation for the word embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-67", "text": "In addition, the computation of mean is based on power means [14] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-68", "text": "Google recently introduced Universal Sentence Encoder [8] , where two different encoders were implemented." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-69", "text": "The first is the Transformer based encoder model [37] , which aims for high-accuracy but has larger complexity and uses more computational resources." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-70", "text": "The second model uses a deep averaging network (DAN) [20] , where embeddings for words and bi-grams are averaged together and then used as input to a deep neural network that computes the sentence embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-71", "text": "Some other efforts in creating sentences embeddings include, but are not limited to, Doc2Vec/Paragraph2Vec [24] , fastSent [17] and Sent2Vec [29] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-72", "text": "In our work, we did not include these other approaches since we believe that the chosen ones are already representative of the existing ones and can enable indirect comparisons with omitted methods." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-73", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-74", "text": "**EVALUATION TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-75", "text": "In this section, we describe the evaluation tasks that were employed to asses the performance on downstream or linguistic probing tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-76", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-77", "text": "**DOWNSTREAM TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-78", "text": "One of the main issues with both word and sentence embeddings is the evaluation procedure." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-79", "text": "One approach is to make use of such embeddings in downstream tasks, evaluating how suitable they are to different problems and which kind of semantic information they carry." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-80", "text": "Another approach is to explore the nature of the semantics by experimental methods from cognitive sciences [2] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-81", "text": "To evaluate each method, we used the entire set of tasks and datasets available on the SentEval [9] evaluation framework." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-82", "text": "These tasks cover a wide range of different tasks that are suitable for general-purpose/universal sentence embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-83", "text": "These tasks can be divided into 5 groups: binary and multi-class classification, entailment and semantic relatedness, semantic textual similarity, paraphrase detection, and caption-image retrieval." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-84", "text": "Please refer to the original SentEval [9] article for more information about these tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-85", "text": "We provide a description and sample instances of these datasets and tasks in Table 2 for the classification tasks and in Table 3 for the semantic similarity tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-86", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-87", "text": "**LINGUISTIC PROBING TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-88", "text": "Downstream tasks are not suitable to understand what the representations are in fact capturing from the linguistic perspective." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-89", "text": "Probing tasks are classification problems that focus on simple linguistic properties of the sentences [11] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-90", "text": "We executed experiments using the 10 probing tasks proposed by [11] and a summary of the tasks with examples is shown in" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-91", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-92", "text": "**METHODS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-93", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-94", "text": "**EXPERIMENTAL SETUP**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-95", "text": "In this section, we describe where the pre-trained models were obtained as well as the procedures employed to evaluate each method." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-96", "text": "ELMo (BoW, all layers, 5.5B) [33] : this model was obtained from the authors' website at https: //allennlp.org/elmo." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-97", "text": "According to the authors, the model was trained on a dataset with 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-98", "text": "To evaluate this model, we used the AllenNLP framework [13] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-99", "text": "An averaging bag-of-words was employed to produce the sentence embeddings, using features from all three layers of the ELMo [33] model." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-100", "text": "We did not employ the trainable task-specific weighting scheme described in [33] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-101", "text": "ELMo (BoW, all layers, original) [33] : this model was obtained from the authors website at https://allennlp.org/elmo." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-102", "text": "According to the authors, the model was trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-103", "text": "To evaluate this model, we used the AllenNLP framework [13] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-104", "text": "An averaging bag-of-words was employed to produce the sentence embeddings, using features from all three layers of the ELMo [33] model and averaging along the word dimension." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-105", "text": "We did not employ the trainable task-specific weighting scheme described in [33] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-106", "text": "ELMo (BoW, top layer, original) [33] : the same model and procedure as in ELMo (BoW, all layers, original) was employed, except that in this experiment, we used only the top layer representation from the ELMo [33] model." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-107", "text": "As shown in [33] , the higher-level LSTM representations capture context-dependent aspects of meaning, while the lower level representations capture aspects of syntax." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-108", "text": "Therefore, we split the evaluation of the top layer from the evaluation using all layers described in the previous experiment." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-109", "text": "We did not employ the trainable task-specific weighting scheme described in [33] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-110", "text": "[7] To measure the semantic similarity between two sentences from 0 (not similar" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-111", "text": "Knowledge Semantic Relatedness (SICK-R) [27] To measure the degree of semantic relatedness between sentences from 0 (not related) to 5 (related)" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-112", "text": "A man is singing a song and playing the guitar" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-113", "text": "A man is opening a package that contains headphones" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-114", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-115", "text": "**1.6**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-116", "text": "Stanford Natural Language Inference (SNLI) [5] To measure semantics in terms of Entailment, Contradiction, or Neutral A small girl wearing a pink jacket is riding on a carousel" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-117", "text": "The carousel is moving Entailment FastText (BoW, Common Crawl) [4] : this model was obtained from the authors website at https: //fasttext.cc/docs/en/english-vectors.html." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-118", "text": "According to the authors, this model contains 2 million word vectors trained on Common Crawl (600B tokens) dataset." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-119", "text": "A traditional bag-ofwords averaging was employed to produce the sentence embedding." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-120", "text": "GloVe (BoW, Common Crawl) [32] : this model was obtained from the authors website at https: //nlp.stanford.edu/projects/glove/. According to the authors, it contains a 2.2M vocabulary and was trained on the Common Crawl (840B tokens) dataset." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-121", "text": "A traditional bag-of-words averaging was employed to produce the sentence embedding." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-122", "text": "Word2Vec (BoW, Google News) [28] : this model was obtained from the authors website at https: //code.google.com/archive/p/word2vec/. According to the authors, it was trained on part of the Google News dataset (about 100 billion words)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-123", "text": "A traditional bag-of-words averaging was employed to produce the sentence embedding." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-124", "text": "p-mean (monolingual) [35] : this model was obtained from the authors website at https://github. com/UKPLab/arxiv2018-xling-sentence-embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-125", "text": "A TensorFlow (TF-Hub) module was employed and the sentences were all made lowercase as per authors website recommendation." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-126", "text": "Skip-Thought [23] : this model was obtained from the authors website at https://github.com/ ryankiros/skip-thoughts." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-127", "text": "The sentences were embedded according to the authors' website instructions." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-128", "text": "InferSent (AllNLI) [10] : this model was obtained from the authors website at https://github. com/facebookresearch/InferSent." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-129", "text": "According to the authors, it was trained on the SNLI and MultiNLI datasets." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-130", "text": "The sentences were embedded according to the authors website instructions." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-131", "text": "Predict the maximum depth of the syntactic tree of the sentence" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-132", "text": "The leaves were in various of stages of life ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-133", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-134", "text": "**10**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-135", "text": "Word Content (WC) Predict which of the target words (among 1000) appear in the sentence She eyed him skeptically ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-136", "text": "eyed USE (DAN) [8] : the Universal Sentence Encoder (USE) was obtained from the TF Hub website at https://tfhub.dev/google/universal-sentence-encoder/1." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-137", "text": "According to the TF Hub website, the model was trained with a deep averaging network (DAN) encoder [20] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-138", "text": "USE (Transformer) [8] : the Universal Sentence Encoder (USE) was obtained from the TF Hub website at https://www.tensorflow.org/hub/modules/google/ universal-sentence-encoder-large/1." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-139", "text": "According to the TF Hub website, the model was trained with a Transformer [37] encoder." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-140", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-141", "text": "**DOWNSTREAM CLASSIFICATION TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-142", "text": "As in [9] , a classifier was employed on top of the sentence embeddings for the classification tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-143", "text": "In this work, a Multi-Layer Perceptron (MLP) was used with a single hidden layer of 50 neurons with no dropout added, using Adam [22] optimizer and a batch size of 64." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-144", "text": "We provide more information about the number of classes and validation scheme employed for each task in Table 5 ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-145", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-146", "text": "**SEMANTIC RELATEDNESS AND TEXTUAL SIMILARITY TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-147", "text": "We used the same scheme as in [9] to evaluate the semantic relatedness (SICK-R, STS Benchmark) and semantic textual similarity (STS- [12] [13] [14] [15] [16] )." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-148", "text": "For semantic relatedness, which predicts a semantic value between 0 and 5 between two input sentences, we learn to predict the probability distribution of relatedness scores." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-149", "text": "For the semantic textual similarity, where the goal is to assess how the cosine similarity between two sentences correlates with a human annotation, we employed a Pearson correlation coefficient." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-150", "text": "For more information about these tasks, please refer to the SentEval [9] paper." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-151", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-152", "text": "**INFORMATION RETRIEVAL TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-153", "text": "In the caption-image retrieval task, each image and language features are jointly evaluated with the objective of ranking a collection of images in respect to a given caption (image retrieval tasktext2image) or ranking captions with respect to a given image (caption retrieval -image2text)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-154", "text": "The dataset used to evaluate the quality of image and caption retrieval tasks in SentEval is the Microsoft COCO [26] , which contains 91 common object categories present in 2,5 million labeled instances in 328k images." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-155", "text": "SentEval used 113k images from COCO dataset, each containing 5 captions." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-156", "text": "The metric used to rank caption and image retrieval in this task is recall at K (Recall@K), with K = 1, 5, 10, and also median over 5 splits of 1k images." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-157", "text": "COCO uses a ResNet-101 [16] for image embedding extraction, yielding 2048-d representation." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-158", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-159", "text": "**LINGUISTIC PROBING TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-160", "text": "For the linguistic probing tasks, a MLP was also used with a single hidden layer of 50 neurons, with no dropout added, using Adam [22] optimizer with a batch size of 64, except for the Word Content (WC) probing task, as in [11] , in which a Logistic Regression was used since it provided consistently better results." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-161", "text": "5 Experimental results" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-162", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-163", "text": "**DOWNSTREAM CLASSIFICATION TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-164", "text": "In Table 6 we show the tabular results for the downstream classification tasks, and in Figure 1 we show a graphical comparison between the different methods." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-165", "text": "As seen in Table 6 , although no method had a consistent performance among all tasks, ELMo [33] achieved best results in 5 out of 9 tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-166", "text": "Even though ELMo [33] was trained on a language model objective, it is important to note that in this experiment a bag-of-words approach was employed." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-167", "text": "Therefore, these results are quite impressive, which lead us to believe that excellent results can be obtained by integrating ELMo [33] and the trainable task-specific weighting scheme described in [33] into InferSent [10] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-168", "text": "InferSent [10] achieved very good results in the paraphrase detection as well as in the SICK-E (entailment)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-169", "text": "We hypothesize that these results were due to the similarity of these tasks to the tasks were InferSent [10] was trained on (SNLI and MultiNLI)." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-170", "text": "As described in [10] , the SICK-E can be seen as an out-domain version of the SNLI dataset." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-171", "text": "The Universal Sentence Encoder (USE) [8] model, with the Transformer encoder, also achieved good results on the product review (CR) and on the question-type (TREC) tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-172", "text": "Given that the USE model was trained on SNLI as well as on web question-answer pages, it is possible that these results were also due to the similarity of these tasks to the training data employed by the USE model." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-173", "text": "p-mean [35] also performed better on most tasks than a simple bag-of-words of GloVe, Word2Vec or fastText independently, and it is a recommended strong baseline when computational resources are limited." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-174", "text": "As we can see, sentence embedding methods are still far away from the idea of a universal sentence encoder that can have a broad transfer quality." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-175", "text": "Given that ELMo [33] demonstrated excellent results on a broad set of tasks, it is clear that a proper integration of deep representation from language models can potentially improve sentence embedding methods by a significant margin and it is a promising research line." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-176", "text": "For completeness, we also provide an evaluation using Logistic Regression instead of a MLP in Table 11 of the appendix." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-177", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-178", "text": "**SEMANTIC RELATEDNESS AND TEXTUAL SIMILARITY TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-179", "text": "As can be seen in Table 7 , where we report the results for the semantic relatedness and textual similarity tasks, the Universal Sentence Encoder (USE) [8] using Transformer model achieved excellent results on almost all tasks, except for the SICK-R (semantic relatedness) where InferSent [10] achieved better results." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-180", "text": "In Figure 2 we show a graphical comparison." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-181", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-182", "text": "**LINGUISTIC PROBING TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-183", "text": "In Table 8 we report the results for the linguistic probing tasks and in Figure 3 we show a graphical comparison as well." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-184", "text": "As we can see in Table 8 , ELMo [33] was one of the methods that were able to achieve high performance on a broad set of different tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-185", "text": "Interestingly, in the BShift (bi-gram shift) task, where the goal is to identify whether if two consecutive tokens within the sentence have been inverted or not, ELMo [33] achieved a result that was better by a large margin when compared to all other methods, clearly a benefit of the language model objective, where it makes it easy to spot token inversion in sentences such as \"This is my Eve Christmas\", a sample from the BShift dataset." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-186", "text": "In [11] , they found that the binned sentence length task (SentLen) was negatively correlated with the performance in downstream tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-187", "text": "This hypothesis was also supported by the model learning dynamics, since it seems that as model starts to capture deeper linguistic properties, it will tend to forget about this superficial feature [11] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-188", "text": "However, the [33] bag-of-words not only achieved the best result in the SentLent task but also in many downstream tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-189", "text": "Our hypothesis is that this is due to the fact that ELMo [33] is a deep representation composed by different levels that can capture superficial features such as sentence length as well as deep linguistic properties as seen in the challenging SOMO task." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-190", "text": "ELMo [33] word embeddings can be seen as analogous to the hypercolumns [15] approach in Computer Vision, where multiple feature levels are aggregated to form a single pixelwise representation." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-191", "text": "We leave the exploration of probing tasks for each ELMo [33] layer representation to future research, given that it could provide a framework to expose the linguistic properties capture by each representation level of the LSTM." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-192", "text": "In [11] , they also found that the WC (Word Content) task was positively correlated with the performance in a wide variety of downstream tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-193", "text": "However, in our evaluation, the p-mean [35] approach, which has achieved better results in the WC task did not exceed other techniques such as ELMo [33] bag-of-words or InferSent [10] and USE in the downstream classification tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-194", "text": "We believe that the high performance of the p-mean [35] in the WC task is due to the concatenative approach employed to aggregate the different power means." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-195", "text": "For completeness, we also provide an evaluation using Logistic Regression instead of a MLP in Table 10 of the appendix." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-196", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-197", "text": "**INFORMATION RETRIEVAL TASKS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-198", "text": "In Table 9 , we show the results for the image retrieval and caption retrieval tasks for the Microsoft COCO [26] dataset." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-199", "text": "Table 9 : Results for the image retrieval and caption retrieval tasks using the Microsoft COCO [26] dataset and features extracted with a ResNet-101 [16] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-200", "text": "In this table we present Recall at 1 (R@1), Recall at 5 (R@5) and so on, as well as the median." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-201", "text": "As we can see in Table 9 , InferSent [10] achieved excellent results on the three raking evaluations (R@k for k in [1, 5, 10] ) and for both tasks (caption retrieval and image retrieval), a similar performance to the results reported by [10] ." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-202", "text": "Figure 3 : Graphical results for the linguistic probing tasks using a MLP." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-203", "text": "Best viewed in color." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-204", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-205", "text": "**DISCUSSION**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-206", "text": "We provided a comprehensive evaluation of the inductive transfer as well as an exploration of the linguistic properties of multiple sentence embedding techniques that included bag-of-word baselines, as well as encoder architectures trained with supervised or self-supervised approaches." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-207", "text": "We showed that a bag-of-words approach using a recently introduced context-dependent word embedding technique was able to achieve excellent performance on many downstream tasks as well as capturing important linguistic properties." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-208", "text": "We demonstrated the importance of the linguistic probing tasks as a means for exploration of sentence embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-209", "text": "Especially for evaluating different levels of word representations, where it can be a very useful tool to provide insights on what kind of relationships and linguistic properties each representation level (in the case of deep representations such as ELMo [33] ) is capturing." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-210", "text": "We also showed that no method had a consistent performance across all tasks, with performance being linked mostly with the downstream task similarity to the trained task of these techniques." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-211", "text": "Given that we are still far from a universal sentence encoder, we believe that this evaluation can provide an important basis for choosing which technique can potentially perform well in particular tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-212", "text": "Finally, we believe that new embedding training techniques that include language models as a way to capture context and meaning, such as ELMo [33] , combined with clever techniques of encoding sentences such as in InferSent [10] , can improve the performance of these encoders by a significant margin." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-213", "text": "However, as we saw in the experiments, the performance of these encoders trained on particular datasets such as entailment did not perform well on a broad set of downstream tasks." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-214", "text": "Therefore, one hypothesis is that these encoders are too narrow at modeling what these embeddings can carry." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-215", "text": "We believe that the research direction of incorporating language models and multiple levels of representations can help to provide a wide set of rich features that can capture context-dependent semantics as well as linguistic features, such as seen on ELMo [33] downstream and linguistic probing task experiments, but for sentence embeddings." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-216", "text": "----------------------------------" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-217", "text": "**A.1 SUPPLEMENTAL RESULTS**" }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-218", "text": "In Table 10 we show the results for the probing tasks using a Logistic Regression instead of a MLP." }, { "sent_id": "b3952c840ce970f0e66460ea6e145a-C001-219", "text": "In Table 11 we show the results for the downstream tasks using a Logistic Regression instead of a MLP." } ], "y": { "@BACK@": { "gold_contexts": [ [ "b3952c840ce970f0e66460ea6e145a-C001-11" ], [ "b3952c840ce970f0e66460ea6e145a-C001-26" ], [ "b3952c840ce970f0e66460ea6e145a-C001-49" ], [ "b3952c840ce970f0e66460ea6e145a-C001-51" ], [ "b3952c840ce970f0e66460ea6e145a-C001-52" ], [ "b3952c840ce970f0e66460ea6e145a-C001-53" ], [ "b3952c840ce970f0e66460ea6e145a-C001-54" ], [ "b3952c840ce970f0e66460ea6e145a-C001-55" ], [ "b3952c840ce970f0e66460ea6e145a-C001-107" ], [ "b3952c840ce970f0e66460ea6e145a-C001-189" ], [ "b3952c840ce970f0e66460ea6e145a-C001-190" ], [ "b3952c840ce970f0e66460ea6e145a-C001-209" ] ], "cite_sentences": [ "b3952c840ce970f0e66460ea6e145a-C001-11", "b3952c840ce970f0e66460ea6e145a-C001-26", "b3952c840ce970f0e66460ea6e145a-C001-49", "b3952c840ce970f0e66460ea6e145a-C001-51", "b3952c840ce970f0e66460ea6e145a-C001-52", "b3952c840ce970f0e66460ea6e145a-C001-53", "b3952c840ce970f0e66460ea6e145a-C001-54", "b3952c840ce970f0e66460ea6e145a-C001-55", "b3952c840ce970f0e66460ea6e145a-C001-107", "b3952c840ce970f0e66460ea6e145a-C001-189", "b3952c840ce970f0e66460ea6e145a-C001-190", "b3952c840ce970f0e66460ea6e145a-C001-209" ] }, "@USE@": { "gold_contexts": [ [ "b3952c840ce970f0e66460ea6e145a-C001-96" ], [ "b3952c840ce970f0e66460ea6e145a-C001-99" ], [ "b3952c840ce970f0e66460ea6e145a-C001-101" ], [ "b3952c840ce970f0e66460ea6e145a-C001-104" ], [ "b3952c840ce970f0e66460ea6e145a-C001-165" ], [ "b3952c840ce970f0e66460ea6e145a-C001-184" ], [ "b3952c840ce970f0e66460ea6e145a-C001-185" ], [ "b3952c840ce970f0e66460ea6e145a-C001-188" ], [ "b3952c840ce970f0e66460ea6e145a-C001-193" ] ], "cite_sentences": [ "b3952c840ce970f0e66460ea6e145a-C001-96", "b3952c840ce970f0e66460ea6e145a-C001-99", "b3952c840ce970f0e66460ea6e145a-C001-101", "b3952c840ce970f0e66460ea6e145a-C001-104", "b3952c840ce970f0e66460ea6e145a-C001-165", "b3952c840ce970f0e66460ea6e145a-C001-184", "b3952c840ce970f0e66460ea6e145a-C001-185", "b3952c840ce970f0e66460ea6e145a-C001-188", "b3952c840ce970f0e66460ea6e145a-C001-193" ] }, "@DIF@": { "gold_contexts": [ [ "b3952c840ce970f0e66460ea6e145a-C001-100" ], [ "b3952c840ce970f0e66460ea6e145a-C001-105" ], [ "b3952c840ce970f0e66460ea6e145a-C001-109" ], [ "b3952c840ce970f0e66460ea6e145a-C001-166" ], [ "b3952c840ce970f0e66460ea6e145a-C001-215" ] ], "cite_sentences": [ "b3952c840ce970f0e66460ea6e145a-C001-100", "b3952c840ce970f0e66460ea6e145a-C001-105", "b3952c840ce970f0e66460ea6e145a-C001-109", "b3952c840ce970f0e66460ea6e145a-C001-166", "b3952c840ce970f0e66460ea6e145a-C001-215" ] }, "@EXT@": { "gold_contexts": [ [ "b3952c840ce970f0e66460ea6e145a-C001-106" ] ], "cite_sentences": [ "b3952c840ce970f0e66460ea6e145a-C001-106" ] }, "@FUT@": { "gold_contexts": [ [ "b3952c840ce970f0e66460ea6e145a-C001-167" ], [ "b3952c840ce970f0e66460ea6e145a-C001-175" ], [ "b3952c840ce970f0e66460ea6e145a-C001-191" ], [ "b3952c840ce970f0e66460ea6e145a-C001-212" ], [ "b3952c840ce970f0e66460ea6e145a-C001-215" ] ], "cite_sentences": [ "b3952c840ce970f0e66460ea6e145a-C001-167", "b3952c840ce970f0e66460ea6e145a-C001-175", "b3952c840ce970f0e66460ea6e145a-C001-191", "b3952c840ce970f0e66460ea6e145a-C001-212", "b3952c840ce970f0e66460ea6e145a-C001-215" ] } } }, "ABC_20330d309c218dc2e1521b9644ed9c_0": { "x": [ { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-2", "text": "Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-3", "text": "At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-4", "text": "In this paper, we present a new formulation of attention via the lens of the kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-5", "text": "To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-6", "text": "This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-7", "text": "Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-8", "text": "As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-9", "text": "This approach achieves competitive performance to the current state of the art model with less computation." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-10", "text": "In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-11", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-13", "text": "Transformer (Vaswani et al., 2017 ) is a relative new architecture which outperforms traditional deep learning models such as Recurrent Neural Networks (RNNs) (Sutskever et al., 2014) and Temporal Convolutional Networks (TCNs) (Bai et al., 2018) for sequence modeling tasks across neural machine translations (Vaswani et al., 2017) , language understanding (Devlin et al., 2018) , sequence prediction (Dai et al., 2019) , image generation (Child et al., 2019) , video activity classification (Wang et al., 2018) , music generation (Huang et al., 2018a) , and multimodal sentiment analysis (Tsai et al., 2019a) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-14", "text": "Instead of performing recurrence (e.g., RNN) or convolution (e.g., TCN) over the sequences, Transformer is a feed-forward model that concurrently processes the entire sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-15", "text": "At the core of the Transformer is its attention mechanism, which is proposed to integrate the dependencies between the inputs." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-16", "text": "There are up to three types of attention within the full Transformer model as exemplified with neural machine translation application (Vaswani et al., 2017) : 1) Encoder self-attention considers the source sentence as input, generating a sequence of encoded representations, where each encoded token has a global dependency with other tokens in the input sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-17", "text": "2) Decoder self-attention considers the target sentence (e.g., predicted target sequence for translation) as input, generating a sequence of decoded representations 1 , where each decoded token depends on previous decoded tokens." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-18", "text": "3) Decoder-encoder attention considers both encoded and decoded sequences, generating a sequence with the same length as the decoded sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-19", "text": "It should be noted that some applications has only the decoder self-attention such as sequence prediction (Dai et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-20", "text": "In all cases, the Transformer's attentions follow the same general mechanism." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-21", "text": "At the high level, the attention can be seen as a weighted combination of the input sequence, where the weights are determined by the similarities between elements of the input sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-22", "text": "We note that this operation is orderagnostic to the permutation in the input se-quence (order is encoded with extra positional embedding (Vaswani et al., 2017; Shaw et al., 2018; Dai et al., 2019) )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-23", "text": "The above observation inspires us to connect Transformer's attention to kernel learning (Scholkopf and Smola, 2001) : they both concurrently and order-agnostically process all inputs by calculating the similarity between the inputs." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-24", "text": "Therefore, in the paper, we present a new formulation for Transformer's attention via the lens of kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-25", "text": "To be more precise, the new formulation can be interpreted as a kernel smoother (Wasserman, 2006) over the inputs in a sequence, where the kernel measures how similar two different inputs are." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-26", "text": "The main advantage of connecting attention to kernel is that it opens up a new family of attention mechanisms that can relate to the well-established literature in kernel learning (Scholkopf and Smola, 2001) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-27", "text": "As a result, we develop a new variant of attention which simply considers a product of symmetric kernels when modeling non-positional and positional embedding." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-28", "text": "Furthermore, our proposed formulation highlights naturally the main components of Transformer's attention, enabling a better understanding of this mechanism: recent variants of Transformers (Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019; Wang et al., 2018; Tsai et al., 2019a) can be expressed through these individual components." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-29", "text": "Among all the components, we argue that the most important one is the construction of the kernel function." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-30", "text": "We empirically study multiple kernel forms and the ways to integrate positional embedding in neural machine translation (NMT) using IWSLT'14 GermanEnglish (De-En) dataset (Edunov et al., 2017) and sequence prediction (SP) using WikiText-103 dataset (Merity et al., 2016) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-31", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-32", "text": "**ATTENTION**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-33", "text": "This section aims at providing an understanding of attention in Transformer via the lens of kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-34", "text": "The inspiration for connecting the kernel (Scholkopf and Smola, 2001 ) and attention instantiates from the observation: both operations concurrently processes all inputs and calculate the similarity between the inputs." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-35", "text": "We first introduce the background (i.e., the original formulation) of attention and then provide a new reformulation within the class of kernel smoothers (Wasserman, 2006) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-36", "text": "Next, we show that this new formulation allows us to explore new family of attention while at the same time offering a framework to categorize previous attention variants (Vaswani et al., 2017; Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019; Wang et al., 2018; Tsai et al., 2019a) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-37", "text": "Last, we present a new form of attention, which requires fewer parameters and empirically reaches competitive performance as the state-of-the-art models." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-38", "text": "For notation, we use lowercase representing a vector (e.g., x), bold lowercase representing a matrix (e.g., x), calligraphy letter denoting a space (e.g., X ), and S denoting a set." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-39", "text": "To relate the notations in sequence to sequence learning (Vaswani et al., 2017 ), x represents a specific element of a sequence, x = [x 1 , x 2 , \u22ef, x T ] denotes a sequence of features, S x = {x exp , x 2 , \u22ef, x T } represents the set with its elements being the features in sequence x, and we refer the space of set S x as S." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-40", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-41", "text": "**TECHNICAL BACKGROUND**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-42", "text": "Unlike recurrent computation (Sutskever et al., 2014 ) (i.e., RNNs) and temporal convolutional computation (Bai et al., 2018 ) (i.e., TCNs), Transformer's attention is an order-agnostic operation given the order in the inputs (Vaswani et al., 2017) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-43", "text": "Hence, in the presentation of the paper, we consider the inputs as a set instead of a sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-44", "text": "When viewing sequence as a set, we lose the temporal (positional) information in inputs which is often crucial for sequence modeling (Sutskever et al., 2014) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-45", "text": "As a result, Transformer (Vaswani et al., 2017) introduced positional embedding to indicate the positional relation for the inputs." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-46", "text": "Formally, a sequence x = [x 1 , x 2 , \u22ef, x T ] defines each element as x i = (f i , t i ) with f i \u2208 F being the nontemporal feature at time i and t i \u2208 T as an temporal feature (or we called it positional embedding)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-47", "text": "Note that f i can be the word representation (in neural machine translation (Vaswani et al., 2017) ), a pixel in a frame (in video activity recognition (Wang et al., 2018) ), or a music unit (in music generation (Huang et al., 2018b) )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-48", "text": "t i can be a mixture of sine and cosine functions (Vaswani et al., 2017) or parameters that can be learned during back-propagation (Dai et al., 2019; Ott et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-49", "text": "The feature vector are defined over a joint space X \u2236= (F \u00d7 T )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-50", "text": "The resulting permutationinvariant set is:" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-51", "text": "Followed the definition by Vaswani et al. (2017) , we use queries(q)/keys(k)/values(v) to represent the inputs for the attention." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-52", "text": "To be more precise, x {q k v} is used for denoting a query/key/value data in the query/key/value sequence x {q k v} (x {q k v} \u2208 S x { q k v} ) with S x { q k v} being its set representation." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-53", "text": "We note that the input sequences are the same (x q = x k ) for self-attention and are different (x q from decoder and x k from encoder) for encoder-decoder attention." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-54", "text": "Given the introduced notation, the attention mechanism in original Transformer (Vaswani et al., 2017) can be presented as:" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-55", "text": "with" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-56", "text": "the weight, and d k being the feature dimension of x k W k ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-57", "text": "Decoder self-attention further introduces a mask to block the visibility of elements in S x k to x q ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-58", "text": "Particularly, decoder self-attention considers the decoded sequence as inputs (x k = x q ), where the decoded token at time t is not allowed to access the future decoded tokens (i.e., tokens decoded at time greater than t)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-59", "text": "On the contrary, encoder selfattention and decoder-encoder attention consider no additional mask to Eq. (1)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-80", "text": "2.2.1 Kernel Feature Space X In Eq. (2), to construct a kernel on X , the first thing is to identify the kernel feature space X ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-60", "text": "Recent work (Shaw et al., 2018; Dai et al., 2019; Huang et al., 2018b; Child et al., 2019; Parmar et al., 2018; Tsai et al., 2019a) proposed modifications to the Transformer for the purpose of better modeling inputs positional relation (Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019) , appending additional keys in S x k (Dai et al., 2019) , modifying the mask applied to Eq. (1) (Child et al., 2019) , or applying to distinct feature types Parmar et al., 2018; Tsai et al., 2019a) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-61", "text": "These works adopt different designs of attention as comparing to the original form (Eq. (1))." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-62", "text": "In our paper, we aim at providing an unified view via the lens of kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-63", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-64", "text": "**REFORMULATION VIA THE LENS OF KERNEL**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-65", "text": "We now provide the intuition to reformulate Eq." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-66", "text": "(1) via the lens of kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-67", "text": "First, the softmax function can be realized as a probability function for x q observing the keys {x k }s in S x k (S x k is the set representation of sequence x k )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-68", "text": "The probability is determined by the dot product between x q and x k with additional mappings W q W k and scaling by d k , which we note the dot-product operation is an instance of kernel function." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-69", "text": "We also introduce a set filtering function M (x q , S x k ) \u2236 X \u00d7 S \u2192 S which returns a set with its elements that operate with (or are connected/visible to) x q ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-70", "text": "The filtering function M (\u22c5, \u22c5) plays as the role of the mask in decoder self-attention (Vaswani et al., 2017) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-71", "text": "Putting these altogether, we re-represent Eq. (1) into the following definition." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-72", "text": ", and a value function v(\u22c5) \u2236 X \u2192 Y, the Attention function taking the input of a query feature x q \u2208 X is defined as" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-73", "text": "The Definition 1 is a class of linear smoothers (Wasserman, 2006) with kernel smoothing:" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-74", "text": "where v(x k ) outputs the \"values\" and" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-75", "text": "is a probability function depends on k and N when k(\u22c5, \u22c5) is always positive." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-76", "text": "In the prior work (Vaswani et al., 2017)" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-77", "text": "Note that the kernel form k(x q , x k ) in the original Transformer (Vaswani et al., 2017 ) is a asymmetric exponential kernel with additional mapping W q and W k (Wilson et al., 2016; Li et al., 2017) 2 ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-78", "text": "The new formulation defines a larger space for composing attention by manipulating its individual components, and at the same time it is able to categorize different variants of attention in prior work (Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019; Wang et al., 2018; Tsai et al., 2019a) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-79", "text": "In the following, we study these components by dissecting Eq. (2) into: 1) kernel feature space X , 2) kernel construction k(\u22c5, \u22c5), 3) value function v(\u22c5), and 4) set filtering function M (\u22c5, \u22c5)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-81", "text": "In addition to modeling sequences like word sentences (Vaswani et al., 2017) or music signals (Huang et al., 2018b) , the Transformer can also be applied to images (Parmar et al., 2018) , sets , and multimodal sequences (Tsai et al., 2019a) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-82", "text": "Due to distinct data types, these applications admit various kernel feature space: (Vaswani et al., 2017; Dai et al., 2019) :" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-83", "text": "with F being non-positional feature space and T being the positional embedding space of the position in the sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-84", "text": "(ii) Image Transformer (Parmar et al., 2018) :" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-85", "text": "with F being non-positional feature space, H being the positional space of the height in an image, and W being the positional space of the width in an image." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-86", "text": "(iii) Set Transformer and NonLocal Neural Networks (Wang et al., 2018) :" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-87", "text": "with no any positional information present." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-88", "text": "(iv) Multimodal Transformer (Tsai et al., 2019a) :" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-89", "text": "with F \u2113 representing the language feature space, F v representing the vision feature space, F a representing the audio feature space, and T representing the temporal indicator space." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-90", "text": "For the rest of the paper, we will focus on the setting for sequence Transformer X = (F \u00d7 T ) and discuss the kernel construction on it." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-91", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-92", "text": "**KERNEL CONSTRUCTION AND THE ROLE OF**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-93", "text": "Positional Embedding k(\u22c5, \u22c5) The kernel construction on X = (F \u00d7 T ) has distinct design in variants of Transformers (Vaswani et al., 2017; Dai et al., 2019; Huang et al., 2018b; Shaw et al., 2018; Child et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-94", "text": "Since now the kernel feature space considers a joint space, we will first discuss the kernel construction on F (the non-positional feature space) and then discuss how different variants integrate the positional embedding (with the positional feature space T ) into the kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-95", "text": "Kernel construction on F. All the work considered the scaled asymmetric exponential kernel with the mapping W q and W k (Wilson et al., 2016; Li et al., 2017) for non-positional features f q and f k :" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-96", "text": "Note that the usage of asymmetric kernel is also commonly used in various machine learning tasks (Yilmaz, 2007; Tsuda, 1999; Kulis et al., 2011) , where they observed the kernel form can be flexible and even non-valid (i.e., a kernel that is not symmetric and positive semi-definite)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-97", "text": "In Section 3, we show that symmetric design of the kernel has similar performance for various sequence learning tasks, and we also examine different kernel choices (i.e., linear, polynomial, and rbf kernel)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-98", "text": "Kernel construction on X = (F \u00d7 T )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-99", "text": "The designs for integrating the positional embedding t q and t k are listed in the following." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-100", "text": "(i) Absolute Positional Embedding (Vaswani et al., 2017; Dai et al., 2019; Ott et al., 2019) : For the original Transformer (Vaswani et al., 2017) , each t i is represented by a vector with each dimension being sine or cosine functions." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-101", "text": "For learned positional embedding (Dai et al., 2019; Ott et al., 2019) , each t i is a learned parameter and is fixed for the same position for different sequences." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-102", "text": "These works defines the feature space as the direct sum of its temporal and non-temporal space: X = F \u2295 T ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-103", "text": "Via the lens of kernel, the kernel similarity is defined as" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-104", "text": "(ii) Relative Positional Embedding in Transformer-XL (Dai et al., 2019) : t represents the indicator of the position in the sequence, and the kernel is chosen to be asymmetric of mixing sine and cosine functions:" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-105", "text": "with k fq t q , t k being an asymmetric kernel with coefficients inferred by f q : log k fq t q , t k = \u2211 (iii) Relative Positional Embedding of Shaw et al. (2018) and Music Transformer (Huang et al., 2018b) : t \u22c5 represents the indicator of the position in the sequence, and the kernel is modified to be indexed by a look-up table:" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-106", "text": "where L tq\u2212t k ,fq = exp(f q W q a tq\u2212t k ) with a \u22c5 being a learnable matrix having matrix width to be the length of the sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-107", "text": "We refer readers to Shaw et al. (2018) for more details." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-108", "text": "Dai et al. (2019) showed that the way to integrate positional embedding is better through Eq. (5) than through Eq. (6) and is better through Eq. (6) than through Eq. (4)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-109", "text": "We argue the reason is that if viewing f i and t i as two distinct spaces X \u2236= (F \u00d7 T ) , the direct sum x i = f i + t i may not be optimal when considering the kernel score between x q and x k ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-110", "text": "In contrast, Eq. (5) represents the kernel as a product of two kernels (one for f i and another for t i ), which is able to capture the similarities for both temporal and non-temporal components." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-111", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-112", "text": "**VALUE FUNCTION V(\u22c5)**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-113", "text": "The current Transformers consider two different value function construction: (Vaswani et al., 2017) and Sparse Transformer (Child et al., 2019) :" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-114", "text": "(ii) Transformer-XL (Dai et al., 2019) , Music Transformer (Huang et al., 2018b) , Self-Attention with Relative Positional Embedding (Shaw et al., 2018) :" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-115", "text": "Compared Eq. (7) to Eq. (8), Eq. (7) takes the positional embedding into account for constructing the value function." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-116", "text": "In Section 3, we empirically observe that constructing value function with Eq. (8) constantly outperforms the construction with Eq. (7), which suggests that we do not need positional embedding for value function." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-117", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-118", "text": "**SET FILTERING FUNCTION M (\u22c5, \u22c5)**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-119", "text": "In Eq. (2), the returned set by the set filtering function M (x q , S x k ) defines how many keys and which keys are operating with x q ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-120", "text": "In the following, we itemize the corresponding designs for the variants in Transformers: (i) Encoder Self-Attention in original Transformer (Vaswani et al., 2017) : For each query x q in the encoded sequence, M (x q , S x k ) = S x k contains the keys being all the tokens in the encoded sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-121", "text": "Note that encoder self-attention considers x q = x k with x q being the encoded sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-122", "text": "(ii) Encoder-Decoder Attention in original Transformer (Vaswani et al., 2017) : For each query x q in decoded sequence, M (x q , S x k ) = S x k contains the keys being all the tokens in the encoded sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-123", "text": "Note that encode-decoder attention considers x q \u2260 x k with x q being the decoded sequence and x k being the encoded sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-124", "text": "(iii) Decoder Self-Attention in original Transformer (Vaswani et al., 2017) : For each query x q in the decoded sequence, M (x q , S x k ) returns a subset of S x k (M (x q , S x k ) \u2282 S x k )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-125", "text": "Note that decoder self-attention considers x q = x k with x q being the decoded sequence." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-126", "text": "Since the decoded sequence is the output for previous timestep, the query at position i can only observe the keys being the tokens that are decoded with position < i. For convenience, let us define S 1 as the set returned by original Transformer (Vaswani et al., 2017 ) from M (x q , S x k ), which we will use it later." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-127", "text": "(iv) Decoder Self-Attention in Transformer-XL (Dai et al., 2019) : For each query x q in the decoded sequence, M (x q , S x k ) returns a set containing S 1 and additional memories (M (x q , S x k ) = S 1 + S mem , M (x q , S x k ) \u2283 S 1 )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-128", "text": "S mem refers to additional memories." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-129", "text": "(v) Decoder Self-Attention in Sparse Transformer (Child et al., 2019) : For each query x q in the decoded sentence, M (x q , S x k ) returns a subset of S 1 (M (x q , S x k ) \u2282 S 1 )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-130", "text": "To compare the differences for various designs, we see the computation time is inversely proportional to the number of elements in M (x q , S x k )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-131", "text": "For performance-wise comparisons, Transformer-XL (Dai et al., 2019) showed that, the additional memories in M (x q , S x k ) are able to capture longer-term dependency than the original Transformer (Vaswani et al., 2017) and hence results in better performance." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-132", "text": "Sparse Transformer (Child et al., 2019) showed that although having much fewer elements in M (x q , S x k ), if the elements are carefully chosen, the attention can still reach the same performance as Transformer-XL (Dai et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-133", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-134", "text": "**EXPLORING THE DESIGN OF ATTENTION**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-135", "text": "So far, we see how Eq. (2) connects to the variants of Transformers." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-136", "text": "By changing the kernel construction in Section 2.2.2, we can define a larger space for composing attention." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-137", "text": "In this paper, we present a new form of attention with a kernel that is 1) valid (i.e., a kernel that is symmetric and positive semi-definite) and 2) delicate in the sense of constructing a kernel on a joint space (i.e., X = (F \u00d7 T )):" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-138", "text": "where W F and W T are weight matrices." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-139", "text": "The new form considers product of kernels with the first kernel measuring similarity between non-temporal features and the second kernel measuring similarity between temporal features." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-140", "text": "Both kernels are symmetric exponential kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-141", "text": "Note that t i here is chosen as the mixture of sine and cosine functions as in the prior work (Vaswani et al., 2017; Ott et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-142", "text": "In our experiment, we find it reaching competitive performance as comparing to the current state-of-the-art designs (Eq. (5) by Dai et al. (2019) )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-143", "text": "We fix the size of the weight matrices W \u22c5 in Eq. (9) and Eq. (5) which means we save 33% of the parameters in attention from Eq. (9)" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-144", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-145", "text": "**EXPERIMENTS**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-146", "text": "By viewing the attention mechanism with Eq. (2), we aims at answering the following questions regarding the Transformer's designs:" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-147", "text": "Q1. What is the suggested way for incorporating positional embedding in the kernel function?" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-148", "text": "Q2." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-149", "text": "What forms of kernel are recommended to choose in the attention mechanism?" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-150", "text": "Can we replace the asymmetric kernel with the symmetric version?" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-151", "text": "Q3." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-152", "text": "Is there any exception that the attention mechanism is not order-agnostic with respect to inputs?" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-153", "text": "If so, can we downplay the role of positional embedding?" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-154", "text": "Q4." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-155", "text": "Is positional embedding required in value function?" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-156", "text": "We conduct experiments on neural machine translation (NMT) and sequence prediction (SP) tasks since these two tasks are commonly chosen for studying Transformers (Vaswani et al., 2017; Dai et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-157", "text": "Note that NMT has three different types of attentions (e.g., encoder selfattention, decoder-encoder attention, decoder selfattention) and SP has only one type of attention (e.g., decoder self-attention)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-158", "text": "For the choice of datasets, we pick IWSLT'14 German-English (De-En) dataset (Edunov et al., 2017) for NMT and WikiText-103 dataset (Merity et al., 2016) for SP as suggested by Edunov et al. (Edunov et al., 2017) and Dai et al. (Dai et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-159", "text": "For fairness of comparisons, we train five random initializations and report test accuracy with the highest validation score." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-160", "text": "We fix the position-wise operations in Transformer 3 and only change the attention mechanism." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-161", "text": "Similar to prior work (Vaswani et al., 2017; Dai et al., 2019) , we report BLEU score for NMT and perplexity for SP." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-162", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-163", "text": "**INCORPORATING POSITIONAL EMBEDDING**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-164", "text": "In order to find the best way to integrate positional embedding (PE), we study different PE incorporation in the kernel function k(\u22c5, \u22c5) in Eq. (2)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-165", "text": "Referring to Sections 2.2.2 and 2.3, we consider four cases: 1) PE as direct sum in the feature space (see Eq. (4)), 2) PE as a look-up table (see Eq. (6)), 3) PE in product kernel with asymmetric kernel (see Eq. (5)), and 4) PE in product kernel with symmetric kernel (see Eq. (9))." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-166", "text": "We present the results in Table 1 ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-167", "text": "First, we see that by having PE as a look-up (Edunov et al., 2017) and SP stands for sequence prediction on WikiText-103 dataset (Merity et al., 2016) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-168", "text": "\u2191 means the upper the better and \u2193 means the lower the better." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-169", "text": "Table 2 : Kernel Types." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-170", "text": "Other than manipulating the kernel choice of the non-positional features, we fix the configuration by Vaswani et al. (2017) for NMT and the configuration by Dai et al. (2019) for SP." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-171", "text": "34.14 24.13 24.21 table, it outperforms the case with having PE as direct-sum in feature space, especially for SP task." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-172", "text": "Note that the look-up table is indexed by the relative position (i.e., t q \u2212 t k ) instead of absolute position." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-173", "text": "Second, we see that PE in the product kernel proposed by Dai et al. (Dai et al., 2019) may not constantly outperform the other integration types (it has lower BLEU score for NMT)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-174", "text": "Our proposed product kernel reaches the best result in NMT and is competitive to the best result in SP." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-175", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-176", "text": "**KERNEL TYPES**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-177", "text": "To find the best kernel form in the attention mechanism, in addition to the exponential kernel (see Eq. (3)), we compare different kernel forms (i.e., linear, polynomial, and rbf kernel) for the non-positional features." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-178", "text": "We also provide the results for changing asymmetric to the symmetric kernel, when forcing W q = W k , so that the resulting kernel is a valid kernel (Scholkopf and Smola, 2001) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-179", "text": "The numbers are shown in Table 2 ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-180", "text": "Note that, for fairness, other than manipulating the kernel choice of the non-positional features, we fix the configuration by Vaswani et al. (Vaswani et al., 2017) for NMT and the configuration by Dai et al. (Dai et al., 2019) for SP." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-181", "text": "We first observe that the linear kernel does not converge for both NMT and SP." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-182", "text": "We argue the reason is that the linear kernel may have negative value and thus it violates the assumption in kernel smoother that the kernel score must be positive (Wasserman, 2006) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-183", "text": "Next, we observe the kernel with infinite feature space (i.e., exponential and rbf kernel) outperforms the kernel with finite feature space (i.e., polynomial kernel). And we see rbf kernel performs the best for NMT and exponential kernel performs the best for SP." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-184", "text": "We conclude that the choice of kernel matters for the design of attention in Transformer." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-185", "text": "Also, we see no much performance difference when comparing asymmetric to symmetric kernel." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-186", "text": "In the experiment, we fix the size of W \u22c5 in the kernel, and thus adopting the symmetric kernel benefits us from saving parameters." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-187", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-188", "text": "**ORDER-INVARIANCE IN ATTENTION**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-189", "text": "The need of the positional embedding (PE) in the attention mechanism is based on the argument that the attention mechanism is an order-agnostic (or, permutation equivariant) operation (Vaswani et al., 2017; Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-190", "text": "However, we show that, for decoder self-attention, the operation is not order-agnostic." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-191", "text": "For clarification, we are not attacking the claim made by the prior work (Vaswani et al., 2017; Shaw et al., 2018; Huang et al., 2018b; Dai et al., 2019; Child et al., 2019 ), but we aim at providing a new look at the order-invariance problem when considering the attention mechanism with masks (masks refer to the set filtering function in our kernel formulation)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-192", "text": "In other words, previous work did not consider the mask between queries and keys when discussing the order-invariance problem (P\u00e9rez et al., 2019) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-193", "text": "To put it formally, we first present the definition by for a permutation equivariance function:" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-194", "text": "Definition 2." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-195", "text": "Denote \u03a0 as the set of all permutations over [n] = {1, \u22ef, n}. A function f unc \u2236 X n \u2192 Y n is permutation equivariant iff for any permutation \u03c0 \u2208 \u03a0, f unc(\u03c0x) = \u03c0f unc(x). showed that the standard attention (encoder self-attention (Vaswani et al., 2017; Dai et al., 2019) ) is permutation equivariant." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-196", "text": "Here, we present the non-permutation-equivariant problem on the decoder self-attention: (Vaswani et al., 2017; Dai et al., 2019) is not permutation equivariant." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-197", "text": "To proceed the proof, we need the following definition and propositions." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-198", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-199", "text": "**PROPOSITION 2. ATTENTION WITH THE SET FILTERING FUNC-**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-200", "text": "Proof." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-201", "text": "It is easy to show that if M (x q , S x k ) = S x k , Eq. (2) remains unchanged for any permutation \u03c0 performed on S x k ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-202", "text": "\u220e Proposition 3." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-203", "text": "Attention with the set difference" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-204", "text": "Then, we construct a permutation \u03c0 such that" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-205", "text": "It is obvious that Eq. (2) changes after this permutation and thus" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-206", "text": "not permutation equivariant w.r.t." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-207", "text": "S x k equals to showing Attention not permutation equivariant." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-208", "text": "Then, since the decoder self-attention considers masking (i.e., M (x q , S x k ) returns a subset of S x k ), by Proposition 3, the decoder self-attention is not permutation equivariant." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-209", "text": "\u220e In fact, not only being a permutation inequivariant process, the decoding process in the decoder self-attention already implies the order information from the data." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-210", "text": "To show this, take the decoded sequence y = [init, y 1 , y 2 , y 3 , y 4 ] as an example." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-211", "text": "init stands for the initial token." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-212", "text": "When determining the output y 1 from init, the set filtering function is M (init, S y ) = {init}. Similarly, we will have M (y 1 , S y ), M (y 2 , S y ), M (y 3 , S y ) to be {init, y 1 }, {init, y 1 , y 2 }, {init, y 1 , y 2 , y 3 }." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-213", "text": "Then, it raises a concern: do we require PE in decoder self-attention?" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-214", "text": "By removing PE in decoder selfattention, we present the results in Table 3 ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-215", "text": "From the table, we can see that, for NMT, removing PE only in decoder self-attention results in slight performance drop (from 34.71 to 34.49)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-216", "text": "However, removing PE in the entire model greatly degrades the performance (from 34.71 to 14.47)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-217", "text": "On the other hand, for SP, removing PE from our proposed attention variant dramatically degrades the performance (from 24.28 to 30.92)." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-218", "text": "Nonetheless, the performance is slightly better than considering PE from the original Transformer (Vaswani et al., 2017) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-219", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-220", "text": "**POSITIONAL EMBEDDING IN VALUE FUNCTION**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-221", "text": "To determine the need of positional embedding (PE) in value function, we conduct the experiments by adopting Eq. (7) or Eq. (8) in the attention mechanism." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-222", "text": "The results are presented in Table 4 ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-223", "text": "From the table, we find that considering PE in value function (Eq. (7)) does not gain performance as compared to not considering PE in value function (Eq. (8))." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-224", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-225", "text": "**TAKE-HOME MESSAGES**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-226", "text": "Based on the results and discussions, we can now answer the questions given at the beginning of this section." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-227", "text": "The answers are summarized into the takehome messages in the following." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-228", "text": "A1." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-229", "text": "We show that integrating the positional embedding in the form of product kernel (Eq. (5) or Eq. (9)) gives us best performance." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-230", "text": "A2." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-231", "text": "The kernel form does matter." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-232", "text": "Adopting kernel form with infinite feature dimension (i.e., exponential kernel or rbf kernel) gives us best results." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-233", "text": "The symmetric design of the kernel may benefit us from saving parameters and barely sacrifice the performance as compared to the non-symmetric one." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-234", "text": "A3." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-235", "text": "The decoder self-attention is not an orderagnostic operation with respect to the order of inputs." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-236", "text": "However, incorporating positional embedding into the attention mechanism may still improve performance." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-237", "text": "A4." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-238", "text": "We find that there is no much performance difference by considering or not considering the positional embedding in value function." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-239", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-240", "text": "**RELATED WORK**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-241", "text": "Other than relating Transformer's attention mechanism with kernel methods, the prior work (Wang et al., 2018; Shaw et al., 2018; Tsai et al., 2019b ) related the attention mechanism with graph-structured learning." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-242", "text": "For example, Non-Local Neural Networks (Wang et al., 2018) made a connection between the attention and the non-local operation in image processing (Buades et al., 2005) ." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-243", "text": "Others (Shaw et al., 2018; Tsai et al., 2019b) linked the attention to the message passing in graphical models." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-244", "text": "In addition to the fundamental difference between graph-structured learning and kernel learning, the prior work (Wang et al., 2018; Shaw et al., 2018; Tsai et al., 2019b) focused on presenting Transformer for its particular application (e.g., video classification (Wang et al., 2018) and neural machine translation (Shaw et al., 2018) )." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-245", "text": "Alternatively, our work focuses on presenting a new formulation of Transformer's attention mechanism that gains us the possibility for understanding the attention mechanism better." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-246", "text": "----------------------------------" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-247", "text": "**CONCLUSIONS**" }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-248", "text": "In this paper, we presented a kernel formulation for the attention mechanism in Transformer, which allows us to define a larger space for designing attention." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-249", "text": "As an example, we proposed a new variant of attention which reaches competitive performance when compared to previous state-of-the-art models." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-250", "text": "Via the lens of the kernel, we were able to better understand the role of individual components in Transformer's attention and categorize previous attention variants in a unified formulation." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-251", "text": "Among these components, we found the construction of the kernel function acts the most important role, and we studied different kernel forms and the ways to integrate positional embedding on neural machine translation and sequence prediction." }, { "sent_id": "20330d309c218dc2e1521b9644ed9c-C001-252", "text": "We hope our empirical study may potentially allow others to design better attention mechanisms given their particular applications." } ], "y": { "@BACK@": { "gold_contexts": [ [ "20330d309c218dc2e1521b9644ed9c-C001-13" ], [ "20330d309c218dc2e1521b9644ed9c-C001-19" ], [ "20330d309c218dc2e1521b9644ed9c-C001-22" ], [ "20330d309c218dc2e1521b9644ed9c-C001-60" ], [ "20330d309c218dc2e1521b9644ed9c-C001-100" ], [ "20330d309c218dc2e1521b9644ed9c-C001-101" ], [ "20330d309c218dc2e1521b9644ed9c-C001-104" ], [ "20330d309c218dc2e1521b9644ed9c-C001-127" ], [ "20330d309c218dc2e1521b9644ed9c-C001-156" ], [ "20330d309c218dc2e1521b9644ed9c-C001-189" ] ], "cite_sentences": [ "20330d309c218dc2e1521b9644ed9c-C001-13", "20330d309c218dc2e1521b9644ed9c-C001-19", "20330d309c218dc2e1521b9644ed9c-C001-22", "20330d309c218dc2e1521b9644ed9c-C001-60", "20330d309c218dc2e1521b9644ed9c-C001-100", "20330d309c218dc2e1521b9644ed9c-C001-101", "20330d309c218dc2e1521b9644ed9c-C001-104", "20330d309c218dc2e1521b9644ed9c-C001-127", "20330d309c218dc2e1521b9644ed9c-C001-156", "20330d309c218dc2e1521b9644ed9c-C001-189" ] }, "@MOT@": { "gold_contexts": [ [ "20330d309c218dc2e1521b9644ed9c-C001-22", "20330d309c218dc2e1521b9644ed9c-C001-23" ] ], "cite_sentences": [ "20330d309c218dc2e1521b9644ed9c-C001-22" ] }, "@USE@": { "gold_contexts": [ [ "20330d309c218dc2e1521b9644ed9c-C001-28" ], [ "20330d309c218dc2e1521b9644ed9c-C001-36" ], [ "20330d309c218dc2e1521b9644ed9c-C001-48" ], [ "20330d309c218dc2e1521b9644ed9c-C001-60", "20330d309c218dc2e1521b9644ed9c-C001-61", "20330d309c218dc2e1521b9644ed9c-C001-62" ], [ "20330d309c218dc2e1521b9644ed9c-C001-78" ], [ "20330d309c218dc2e1521b9644ed9c-C001-82" ], [ "20330d309c218dc2e1521b9644ed9c-C001-93" ], [ "20330d309c218dc2e1521b9644ed9c-C001-100" ], [ "20330d309c218dc2e1521b9644ed9c-C001-101" ], [ "20330d309c218dc2e1521b9644ed9c-C001-104" ], [ "20330d309c218dc2e1521b9644ed9c-C001-110", "20330d309c218dc2e1521b9644ed9c-C001-114" ], [ "20330d309c218dc2e1521b9644ed9c-C001-127" ], [ "20330d309c218dc2e1521b9644ed9c-C001-131" ], [ "20330d309c218dc2e1521b9644ed9c-C001-132" ], [ "20330d309c218dc2e1521b9644ed9c-C001-142" ], [ "20330d309c218dc2e1521b9644ed9c-C001-156" ], [ "20330d309c218dc2e1521b9644ed9c-C001-158" ], [ "20330d309c218dc2e1521b9644ed9c-C001-170" ], [ "20330d309c218dc2e1521b9644ed9c-C001-173" ], [ "20330d309c218dc2e1521b9644ed9c-C001-180" ], [ "20330d309c218dc2e1521b9644ed9c-C001-195" ], [ "20330d309c218dc2e1521b9644ed9c-C001-196" ] ], "cite_sentences": [ "20330d309c218dc2e1521b9644ed9c-C001-28", "20330d309c218dc2e1521b9644ed9c-C001-36", "20330d309c218dc2e1521b9644ed9c-C001-48", "20330d309c218dc2e1521b9644ed9c-C001-60", "20330d309c218dc2e1521b9644ed9c-C001-78", "20330d309c218dc2e1521b9644ed9c-C001-82", "20330d309c218dc2e1521b9644ed9c-C001-93", "20330d309c218dc2e1521b9644ed9c-C001-100", "20330d309c218dc2e1521b9644ed9c-C001-101", "20330d309c218dc2e1521b9644ed9c-C001-104", "20330d309c218dc2e1521b9644ed9c-C001-114", "20330d309c218dc2e1521b9644ed9c-C001-127", "20330d309c218dc2e1521b9644ed9c-C001-131", "20330d309c218dc2e1521b9644ed9c-C001-132", "20330d309c218dc2e1521b9644ed9c-C001-142", "20330d309c218dc2e1521b9644ed9c-C001-156", "20330d309c218dc2e1521b9644ed9c-C001-158", "20330d309c218dc2e1521b9644ed9c-C001-170", "20330d309c218dc2e1521b9644ed9c-C001-173", "20330d309c218dc2e1521b9644ed9c-C001-180", "20330d309c218dc2e1521b9644ed9c-C001-195", "20330d309c218dc2e1521b9644ed9c-C001-196" ] }, "@UNSURE@": { "gold_contexts": [ [ "20330d309c218dc2e1521b9644ed9c-C001-114" ] ], "cite_sentences": [ "20330d309c218dc2e1521b9644ed9c-C001-114" ] }, "@SIM@": { "gold_contexts": [ [ "20330d309c218dc2e1521b9644ed9c-C001-142" ], [ "20330d309c218dc2e1521b9644ed9c-C001-161" ] ], "cite_sentences": [ "20330d309c218dc2e1521b9644ed9c-C001-142", "20330d309c218dc2e1521b9644ed9c-C001-161" ] }, "@DIF@": { "gold_contexts": [ [ "20330d309c218dc2e1521b9644ed9c-C001-173" ], [ "20330d309c218dc2e1521b9644ed9c-C001-189", "20330d309c218dc2e1521b9644ed9c-C001-190" ], [ "20330d309c218dc2e1521b9644ed9c-C001-191" ] ], "cite_sentences": [ "20330d309c218dc2e1521b9644ed9c-C001-173", "20330d309c218dc2e1521b9644ed9c-C001-189", "20330d309c218dc2e1521b9644ed9c-C001-191" ] } } }, "ABC_a0db9c3a74487d3fcd11e79d44e163_0": { "x": [ { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-2", "text": "Automatically solving mathematical word problems (MWPs) is challenging, primarily due to the semantic gap between human-readable words and machine-understandable logics." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-3", "text": "Despite a long history dated back to the 1960s, MWPs has regained intensive attention in the past few years with the advancement of Artificial Intelligence (AI)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-4", "text": "To solve MWPs successfully is considered as a milestone towards general AI." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-5", "text": "Many systems have claimed promising results in self-crafted and small-scale datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-662", "text": "**MATH PROBLEM SOLVER IN OTHER LANGUAGES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-6", "text": "However, when applied on large and diverse datasets, none of the proposed methods in the literatures achieves a high precision, revealing that current MWPs solvers are still far from intelligent." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-7", "text": "This motivated us to present a comprehensive survey to deliver a clear and complete picture of automatic math problem solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-8", "text": "In this survey, we emphasize on algebraic word problems, summarize their extracted features and proposed techniques to bridge the semantic gap, and compare their performance in the publicly accessible datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-9", "text": "We will also cover automatic solvers for other types of math problems such as geometric problems that require the understanding of diagrams." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-10", "text": "Finally, we will identify several emerging research directions for the readers with interests in MWPs." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-11", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-12", "text": "**INTRODUCTION**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-13", "text": "Designing an automatic solver for mathematical word problems (MWPs) has a long history dated back to the 1960s [1] , [2] , [3] , and continues to attract intensive research attention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-14", "text": "In the past three years, more than 40 publications on this topic have emerged in the premier venues of artificial intelligence." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-15", "text": "The problem is particularly challenging because there remains a wide semantic gap to parse the human-readable words into machineunderstandable logics so as to facilitate quantitative reasoning." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-16", "text": "Hence, MWPs solvers are broadly considered as good test beds to evaluate the intelligence level of agents in terms of natural language understanding [4] , [5] and the successful solving of MWPs would constitute a milestone towards general AI." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-17", "text": "We categorize the evolving of MWP solvers into three major stages according to the technologies behind these solvers, as depicted in Figure 1 ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-18", "text": "In the first pioneering stage, roughly from the year 1960 to 2010, systems such as STUDENT [1] , DEDUCOM [6] , WORDPRO [7] and ROBUST [8] , manually craft rules and schemas for pattern matchings." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-19", "text": "Thereupon, these solvers heavily rely on human interventions and can only resolve a limited number of scenarios that are defined in advance." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-20", "text": "Those early efforts for automatic understanding of natural language mathematical problems have been thoroughly reviewed in [9] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-21", "text": "We exclude them from the scope of this survey paper and focus on the recent technology developments that have not been covered in the previous survey [9] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-22", "text": "In the second stage, MWP solvers made use of semantic parsing [10] , [11] , with the objective of mapping the sentences from problem statements into structured logic representations so as to facilitate quantitative reasoning." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-23", "text": "It has regained considerable \u2022 D. Zhang interests from the academic, and a booming number of methods have been proposed in the past years." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-24", "text": "These methods leveraged various strategies of feature engineering and statistical learning for performance boosting." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-25", "text": "The authors of these methods also claimed promising results in their public or manually harvested datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-26", "text": "In this paper, one of our tasks is to present a comprehensive review on the proposed methods in this stage." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-27", "text": "The methods will be first organized according to the sub-tasks of MWPs which they were designed to solve, such as arithmetic word problem (in Section 2), equation set word problem (in Section 3) and geometric word problem (in Section 5)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-28", "text": "We then examine the proposed techniques in each sub-task with a clear technical organization and accountable experimental evaluations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-29", "text": "MWP solvers in the third stage originated from an empirical work [12] deserve some special attention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-30", "text": "Its experimental results on a large-scale and diversified dataset showed that the status of MWP solvers was not as optimistic as they claimed to be." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-31", "text": "In fact, the accuracies of many approaches dropped sharply and there is a great room for improvement in this research area." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-32", "text": "To design more accurate and robust solutions, the subsequent publications are forked into two directions." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-33", "text": "One is to continue refining the technology of semantic parsing." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-34", "text": "For instance, Huang et al. proposed a new type of semantic representation to conduct finegrained inference [13] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-35", "text": "The other direction attempts to exploit the advantages of deep learning models, with the availability of largescale training datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-36", "text": "This is an emerging research direction for MWP solvers and we observed two instances, including Deep Neural Solver [14] using recurrent neural network to generate the equation template, and MathDQN [15] relying on the deep reinforcement learning framework." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-37", "text": "It is expected to witness more and more deep learning based methods to solve MWPs in the future." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-38", "text": "To sum up, we present a comprehensive survey to review the MWP solvers proposed in recent years." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-39", "text": "Researchers in the community can benefit from this survey in the following ways: 1) We provide a wide coverage on the math word problems, including arithmetic word problem, equation set problem, geometry word problem and miscellaneous sub-tasks related to automatic math solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-40", "text": "The practicers can easily identify all relevant approaches for performance evaluations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-41", "text": "We observed that the unawareness of relevant competitors is not uncommon in the past literature." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-42", "text": "We are positive that the availability of our survey can help avoid such unawareness." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-43", "text": "2) The solvers designed for arithmetic word problems (AWP) with only one unknown variable and equation set problems (ESP) with multiple unknown variables are often not differentiated by previous works." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-44", "text": "In fact, the methods proposed for ESP are more general and can be used to solve AWP." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-45", "text": "In this survey, we clearly identify the difference and organize them in separate sections." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-46", "text": "We also generalize their methods in terms of the number of operands and the type of operators they are capable to support." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-47", "text": "3) Feature engineering plays an vital role to bridge the gap of semantic parsing." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-48", "text": "Almost all MWP solvers state their strategies of crafting effective features, resulting in a very diversified group of features, and there is a lack of clear organization among these features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-49", "text": "In this survey, we will be the first to summarize all these proposed features in Table 5 . 4) As for the fairness on performance evaluations, ideally, there should be a benchmark dataset well accepted and widely adopted by the MWP research community, just like Ima-geNet [16] for visual object recognition and VQA [17] , [18] for visual question answering." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-50", "text": "Unfortunately, we observed that many approaches tend to compile their own datasets to verify their superiorities, which result in missing of relevant competitors as mentioned above." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-51", "text": "Tables 2 and 3 integrate the results of existing methods on all the public datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-52", "text": "After collecting the accuracies that have been reported in the past literature, we observed many empty cells in the table." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-53", "text": "Each empty cell refers to a missing experiment on a particular algorithm and dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-54", "text": "In this survey, we make our best efforts to fill the missing results by conducting a considerable number of additional experiments, and guarantee that this survey provides comprehensive comparison and delivers explicit experimental analysis." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-55", "text": "The remainder of the paper is organized as follows." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-56", "text": "We first review the arithmetic word problem solvers in Section 2, followed by equation set problem solvers in Section 3." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-57", "text": "Since feature engineering deserves special attention, we summarize the extracted features as well as the associated pre-processing techniques in Section 4." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-58", "text": "The geometric word problem solvers are reviewed in Section 5." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-59", "text": "We also cover miscellaneous automatic solvers related to math problems in Section 6." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-60", "text": "We conclude the paper and point out several future directions in MWPs that are worth examination in the final section." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-61", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-62", "text": "**ARITHMETIC WORD PROBLEM SOLVER**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-63", "text": "The arithmetic word problems are targeted for elementary school students." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-64", "text": "The input is the text description for the math problem, represented in the form of a sequence of k words w 0 , w 1 , . . . , w k ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-65", "text": "There are n quantities q 1 , q 2 , . . . , q n mentioned in the text and an unknown variable x whose value is to be resolved." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-66", "text": "Our goal is to extract the relevant quantities and map this problem into an arithmetic expression E whose evaluation value provides the solution to the problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-67", "text": "There are only four types of fundamental operators O = {+, \u2212, \u00d7, \u00f7} involved in the expression E." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-68", "text": "An example of arithmetic word problem is illustrated in Figure 2 ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-69", "text": "The relevant quantities to be extracted from the text include 17, 7 and 80." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-70", "text": "The number of hours spent on the bike is the unknown variable x. To solve the problem, we need to identify the correct operators between the quantities and their operation order such that we can obtain the final equation 17 + 7x = 80 or expression x = (80 \u2212 17) \u00f7 7 and return 9 as the solution to this problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-71", "text": "In this section, we consider feature extraction as a black box and focus on the high-level algorithms and models." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-72", "text": "The details of feature extraction will be comprehensively present in Section 4." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-73", "text": "We classify existing algebra word problem solvers into three categories: rule-based, statistic-based and tree-based methods." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-74", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-75", "text": "**RULE-BASED METHODS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-76", "text": "The early approaches to math word problems are rule-based systems based on hand engineering." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-77", "text": "Published in 1985, WORD-PRO [7] solves one-step arithmetic problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-78", "text": "It predefines four types of schemas, including change-in, change-out, combine and compare." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-79", "text": "The problem text is transformed into a set of propositions and the answer is derived with simple reasoning based on the propositions." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-80", "text": "Another system ROBUST, developed by Bakman [8] , could understand free-format multi-step arithmetic word problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-81", "text": "It further expands the change schema of WORDPRO [7] into six distinct categories." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-82", "text": "The problem text is split into sentences and each sentence is mapped to a proposition." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-83", "text": "Yun et al. also proposed to use schema for multi-step math problem solving [19] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-84", "text": "However, the implementation details are not explicitly revealed in [19] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-85", "text": "Since these systems have been out of date, we only provide such a brief overview to cover the representative ones." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-86", "text": "Readers can refer to [9] for a comprehensive survey of early rule-driven systems for automatic understanding of natural language math problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-87", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-88", "text": "**STATISTIC-BASED METHODS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-89", "text": "The statistic-based methods leverage traditional machine learning models to identify the entities, quantities and operators from the problem text and yield the numeric answer with simple logic inference procedure." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-90", "text": "The scheme of quantity entailment proposed in [20] can be used to solve arithmetic problems with only one operator." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-91", "text": "It involves three types of classifiers to detect different properties of the word problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-92", "text": "The quantity pair classifier is trained to determine which pair of quantities would be used to derive the answer." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-93", "text": "The operator classifier picks the operator op \u2208 {+, \u2212, \u00d7, \u00f7} with the highest probability." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-94", "text": "The order classifier is relevant only for problems involving subtraction or division because the order of operands matters for these two types of operators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-95", "text": "With the inferred expression, it is straightforward to calculate the numeric answer for the simple math problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-96", "text": "To solve math problems with multi-step arithmetic expression, the statistic-based methods require more advanced logic templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-97", "text": "This usually incurs additional overhead to annotate the text problems and associate them with the introduced template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-98", "text": "As an early attempt, ARIS [21] defines a logic template named state that consists of a set of entities, their containers, attributes, quantities and relations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-99", "text": "For example, \"Liz has 9 black kittens\" initializes the number of kitten (referring to an entity) with black color (referring to an attribute) and belonging to Liz (referring to a container)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-100", "text": "The solution splits the problem text into fragments and tracks the update of the states by verb categorization." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-101", "text": "More specifically, the verbs are classified into seven categories: observation, positive, negative, positive transfer, negative transfer, construct and destroy." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-102", "text": "To train such a classifier, we need to annotate each split fragment in the training dataset with the associated verb category." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-103", "text": "Another drawback of ARIS is that it only supports addition and subtraction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-104", "text": "[22] follows a similar processing logic to ARIS." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-105", "text": "It predefines a corpus of logic representation named schema, inspired by [8] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-106", "text": "The sentences in the text problem are examined sequentially until the sentence matches a schema, triggering an update operation to modify the number associated with the entities." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-107", "text": "In [23] , Mitra et al. proposed a new logic template named formula." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-108", "text": "Three types of formulas are defined, including part whole, change and comparison, to solve problems with addition and subtraction operators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-109", "text": "For example, the text problem \"Dan grew 42 turnips and 38 cantelopes." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-110", "text": "Jessica grew 47 turnips." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-111", "text": "How many turnips did they grow in total?\" is annotated with the partwhole template: whole : x, parts : {42, 47} ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-112", "text": "To solve a math problem, the first step connects the assertions to the formulas." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-113", "text": "In the second step, the most probable formula is identified using the log-linear model with learned parameters and converted into an algebraic equation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-114", "text": "Another type of annotation is introduced in [24] , [25] to facilitate solving a math problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-115", "text": "A group of logic forms are predefined and the problem text is converted into the logic form representation by certain mapping rules." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-116", "text": "For instance, the sentence \"Fred picks 36 limes\" will be transformed into verb 36) ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-117", "text": "Finally, logic inference is performed on the derived logic statements to obtain the answer." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-118", "text": "To sum up, these statistical-based methods have two drawbacks that limit their usability." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-119", "text": "First, it requires additional annotation overhead that prevents them from handling large-scale datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-120", "text": "Second, these methods are essentially based on a set of pre-defined templates, which are brittle and rigid." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-121", "text": "It will take great efforts to extend the templates to support other operators like multiplication and division." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-122", "text": "It is also not robust to diversified datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-123", "text": "In the following, we will introduce the tree-based solutions, which are widely adopted and become the mainstreaming solutions to arithmetic word problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-124", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-125", "text": "**TREE-BASED METHODS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-126", "text": "The arithmetic expression can be naturally represented as a binary tree structure such that the operators with higher priority are placed in the lower level and the root of the tree contains the operator with the lowest priority." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-127", "text": "The idea of tree-based approaches [26] , [27] , [28] , [15] is to transform the derivation of the arithmetic expression to constructing an equivalent tree structure step by step in a bottom-up manner." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-128", "text": "One of the advantages is that there is no need for additional annotations such as equation template, tags or logic forms." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-129", "text": "Figure 3 shows two tree examples derived from the math word problem in Figure 2 ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-130", "text": "One is called expression tree that is used in [26] , [28] , [15] and the other is called equation tree in [27] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-131", "text": "These two types of trees are essentially equivalent and result in the same solution, except that equation tree contains a node for the unknown variable x. The overall algorithmic framework among the tree-based approaches consists of two processing stages." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-132", "text": "In the first stage, the quantities are extracted from the text and form the bottom level of the tree." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-133", "text": "The candidate trees that are syntactically valid, but with different structures and internal nodes, are enumerated." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-134", "text": "In the second stage, a scoring function is defined to pick the best matching candidate tree, which will be used to derive the final solution." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-135", "text": "A common strategy among these algorithms is to build a local classifier to determine the likelihood of an operator being selected as the internal node." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-136", "text": "Such local likelihood is taken into account in the global scoring function to determine the likelihood of the entire tree." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-137", "text": "Roy et al. [26] proposed the first algorithmic approach that leverages the concept of expression tree to solve arithmetic word problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-138", "text": "Its first strategy to reduce the search space is training a binary classifier to determine whether an extracted quantity is relevant or not." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-139", "text": "Only the relevant ones are used for tree construction and placed in the bottom level." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-140", "text": "The irrelevant quantities are discarded." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-141", "text": "The tree construction procedure is mapped to a collection of simple prediction problems, each determining the lowest common ancestor operation between a pair of quantities mentioned in the problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-165", "text": "Second, each quantity mentioned in the sentence can be used at most once in the final equation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-166", "text": "The tree construction procedure consists of a pipeline of predictors that identify irrelevant quantities, recognize grounded variables, and generate the final equation tree." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-167", "text": "With customized feature selection and SVM based classifier, the relevant quantities and variables are extracted and used as the leaf nodes of the equation tree." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-168", "text": "The tree is built in a bottom-up manner." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-169", "text": "It is worth noting that to reduce the search space and simplify the tree construction, only adjacent nodes are combined to generate their parent node." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-170", "text": "UnitDep [28] can be viewed as an extension work of [26] by the same authors." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-171", "text": "An important concept, named Unit Dependency Graph (UDG), is proposed to enhance the scoring function." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-172", "text": "The vertices in UDG consist of the extracted quantities." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-173", "text": "If the quantity correspond to a rate (e.g., 8 dollars per hour), the vertex is marked as RATE." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-174", "text": "There are six types of edge relations to be considered, such as whether two quantities are associated with the same unit." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-175", "text": "Building the UDG requires additional annotation overhead as we need to train two classifiers for the nodes and edges." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-176", "text": "The node classifier determines whether a node is associated with a rate." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-177", "text": "The edge classifier predicts the type of relationship between any pair of quantity nodes." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-178", "text": "Given a valid unit dependency graph G generated by the classifiers, its likelihood is defined as" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-179", "text": "In other words, we sum up the prediction probability for the RATE nodes and all the edges." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-180", "text": "The new scoring function for an expression tree extends Equation 1 by incorporating \u03c6(G)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-181", "text": "Rules are defined to enforce the rate consistence between an expression tree T and a candidate graph G. For example, if v i is the only node in the tree that is labeled RATE and it appears in the question, there should not exist a path from the leaf node to the root which only contains operators of addition and subtraction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-182", "text": "Finally, the candidate graph G with the highest likelihood and rate-consistent with T is used to calculate the total score of T ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-183", "text": "In [15] , Wang et al. made the first attempt of applying deep reinforcement learning to solve arithmetic word problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-184", "text": "The motivation is that deep Q-network has witnessed success in solving various problems with big search space such as playing text-based games [31] , information extraction [32] , text generation [33] and object detection in images [34] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-185", "text": "To fit the math problem scenario, they formulate the expression tree construction as a Markov Decision Process and propose the MathDQN that is customized from the general deep reinforcement learning framework." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-186", "text": "Technically, they tailor the definitions of states, actions, and reward functions which are key components in the reinforcement learning framework." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-187", "text": "By using a two-layer feedforward neural network as the deep Q-network to approximate the Q-value function, the framework learns model parameters from the reward feedback of the environment." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-188", "text": "Compared to the aforementioned approaches, MathDQN iteratively picks the best operator for two selected quantities." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-189", "text": "This procedure can be viewed as beam search with k = 1 when exploiting candidate expression trees." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-190", "text": "Its deep Q-network acts as the operator classifier and guides the model to select the most promising operator for tree construction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-191", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-192", "text": "**DATASET REPOSITORY AND PERFORMANCE ANALYSIS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-193", "text": "The accuracy of arithmetic word problems is evaluated on the datasets that are manually harvested and annotated from online websites." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-194", "text": "These datasets are small-scale and contain hundreds of math problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-195", "text": "In this subsection, we make a summary on the datasets that have been used in the aforementioned datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-196", "text": "Moreover, we organize the performance results on these datasets into one unified table." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-197", "text": "We also make our best efforts to conduct additional experiments." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-198", "text": "The new results are highlighted in blue color." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-199", "text": "In this way, readers can easily identify the best performers in each dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-200", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-201", "text": "**DATASETS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-202", "text": "There have been a number of datasets collected for the arithmetic word problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-203", "text": "We present their descriptions in the following and summarize the statistics of the datasets in Table 1. 1) AI2 [21] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-204", "text": "There are 395 single-step or multi-step arithmetic word problems for the third, fourth, and fifth graders." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-205", "text": "It involves problems that can be solved with only addition and subtraction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-206", "text": "The dataset is harvested from two websites: math-aids.com and ixl.com and comprises three subsets: MA1 (from math-aids.com), IXL (from ixl.com) and MA2 (from math-aids.com)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-207", "text": "Among them, IXL and MA2 are more challenging than MA1 because IXL contains more information gaps and MA2 includes more irrelevant information in its math problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-208", "text": "2) IL [26] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-209", "text": "The problems are collected from websites k5learning.com and dadsworksheets.com." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-210", "text": "The problems that require background knowledge (e.g., \"apple is fruit\" and \"a week comprises 7 days\") are pruned." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-211", "text": "To improve the diversity, the problems are clustered by textual similarity." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-212", "text": "For each cluster, at most 5 problems are retained." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-213", "text": "Finally, the dataset contains 562 single-step word problems with only one operator, including addition, subtraction, multiplication, and division." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-214", "text": "3) CC [26] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-215", "text": "The dataset is designed for mult-step math problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-216", "text": "It contains 600 multi-step problems without irrelevant quantities, harvested from commoncoresheets.com." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-217", "text": "The dataset involves various combinations of four basic operators, including (a) addition followed by subtraction; (b) subtraction followed by addition; (c) addition and multiplication; (d) addition and division; (e) subtraction and multiplication; and (f) subtraction and division." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-218", "text": "It is worth noting that this dataset does not incorporate irrelevant quantities in the problem text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-219", "text": "Hence, there is no need to apply the quantity relevance classifier for the algorithms containing this component." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-220", "text": "4) SingleEQ [27] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-221", "text": "The dataset contains both single-step and multi-step arithmetic problems and is a mixture of problems from a number of sources, including math-aids.com, k5learning.com, ixl.com and a subset of the data from AI2." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-222", "text": "Each problem involves operators of multiplication, division, subtraction, and addition over non-negative rational numbers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-223", "text": "5) AllArith [28] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-224", "text": "The dataset is a mixture of the data from AI2, IL, CC and SingleEQ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-225", "text": "All mentions of quantities are normalized into digit representation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-226", "text": "To capture how well the automatic solvers can distinguish between different problem types, near-duplicate problems (with over 80% match of unigrams and bigrams) are removed." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-227", "text": "Finally, there remain 831 math problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-228", "text": "6) Dolphin-S. This is a subset of Dolphin18K [12] which originally contains 18, 460 problems and 5, 871 templates with one or multiple equations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-229", "text": "The problems whose template is associated with only one problem are extracted as the dataset of Dolphin-S. It contains 115 problems with single operator and 6, 955 problems with multiple operators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-230", "text": "7) Math23K [14] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-231", "text": "The dataset contains Chinese math word problems for elementary school students and is crawled from multiple online education websites." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-232", "text": "Initially, 60, 000 problems with only one unknown variable are collected." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-233", "text": "The equation templates are extracted in a rule-based manner." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-234", "text": "To ensure high precision, a large number of problems that do not fit the rules are discarded." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-235", "text": "Finally, 23, 161 math problems with 2, 187 templates are remained." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-236", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-237", "text": "**PERFORMANCE ANALYSIS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-238", "text": "Given the aforementioned datasets, we merge the experimental results reported from previous works into one table." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-239", "text": "Such a unified organization can facilitate readers in identifying the methods with superior performance in each dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-240", "text": "As shown in Table 2 , the rows refer to the corpus of datasets and the columns are the statistic-based and tree-based methods." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-241", "text": "The cells are filled with the accuracies of these algorithms when solving math word problems in different datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-242", "text": "We conduct additional experiments to cover all the cells by the tree-based solutions." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-243", "text": "These new experiment results are highlighted in blue color." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-244", "text": "Those with missing value are indicated by \"-\" and it means that there was no experiment conducted for the algorithm in the particular dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-245", "text": "The main reason is that they require particular efforts on logic templates and annotations, which are very trivial and cumbersome for experiment reproduction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-246", "text": "There is no algorithm comparison for the dataset Math23K because the problem text is in Chinese and the feature extraction technologies proposed in the statistic-based and tree-based approaches are not applicable." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-247", "text": "From the results in Table 2 , we derive the following observations worth noting and provide reasonings to explain the results." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-248", "text": "First, the statistic-based methods with advanced logic representation, such as Schema [22] , Formula [23] and Logic-Form [24] , [25] , achieve dominating performance in the AI2 dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-249", "text": "Their superiority is primarily owned to the additional efforts on annotating the text problem with more advanced logic representation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-250", "text": "These annotations allow them to conduct finegrained reasoning." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-251", "text": "In contrast, ARIS [21] does not work as good because it focuses on \"change\" schema of quantities and does not fully exploit other schemas like \"compare\" [22] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-252", "text": "Since there are only hundreds of math problems in the datasets, it is feasible to make an exhaustive scan on the math problems and manually define the templates to fit these datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-253", "text": "For instance, all quantities and the main-goal are first identified by rules in LogicForm [24] , [25] and explicitly associated with their roletags." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-254", "text": "Thus, with sufficient human intervention, the accuracy of statistic-baesd methods in AI2 can boost to 88.64%, much higher than that of tree-based methods." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-255", "text": "Nevertheless, these statistic-based methods are considered as brittle and rigid [12] and not scalable to handle large and diversified datasets, primarily due to the heavy annotation cost to train an accurate mapping between the text and the logic representation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-256", "text": "Second, the results of tree-based methods in AI2, IL and CC are collected from [15] where the same experimental setting of 3fold cross validation is applied." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-257", "text": "It is interesting to observe that ALGES [27] , ExpressionTree [26] and UNITDEP [28] cannot perform equally well on the three datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-258", "text": "ALGES works poorly in AI2 because irrelevant quantities exist in its math problems and ALGES is not trained with a classifier to get rid of them." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-259", "text": "However, it outperforms ExpressionTree and UNITDEP by a wide As to the datasets of SingleEQ and AllArith, UNITDEP is a winner in both datasets, owning to the effectiveness of the proposed unit dependency graph (UDG)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-260", "text": "In the math problems with operators {\u00d7, \u00f7}, the unit and rate are important clues to determine the correct quantities and operators in the math expression." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-261", "text": "The UDG poses constraints on unit compatibility to filter the false candidates in the expression tree construction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-262", "text": "It can alleviate the brittleness of the unit extraction system, even though it requires additional annotation overhead in order to induce UDGs." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-263", "text": "Last but not the least, these experiments were conducted on small-scale datasets and their performances on larger and more diversified datasets remain unclear." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-264", "text": "Recently, Huang et al. have noticed the gap and released Dolphin18K [12] which contains 18, 460 problems and 5, 871 templates with one or multiple equations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-265", "text": "The findings in [12] are astonishing." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-266", "text": "The accuracies of existing approaches for equation set problems, which will be introduced in the next section, degrade sharply to less than 25%." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-267", "text": "These methods cannot even perform better than a simple baseline that first uses text similarity to find the most similar text problem in the training dataset and then fills the number slots in its associated equation template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-268", "text": "To discover the performance of existing arithmetic problem solvers in large-scale datasets, we conduct similar experiments to examine the performance of tree-based approaches in Dolphin-S. The results are shown in the last two columns of Table 2 , leading to two conclusions." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-269", "text": "First, the overall performances of existing tree-based arithmetic problem solvers are not promising in large-scale and diversified datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-270", "text": "There is still great room for improvement in the research area of arithmetic math word problem solver." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-271", "text": "Second, MathDQN exhibits superior performance over its two tree-based competitors, verifying its robustness by using the deep reinforcement learning framework." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-272", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-273", "text": "**EQUATION SET SOLVER**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-274", "text": "The equation set problems are much more challenging because they involve multiple unknown variables to resolve and require to formulate a set of equations to obtain the final solution." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-275", "text": "The aforementioned arithmetic math problem can be viewed as a simplified variant of equation set problem with only one unknown variable." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-276", "text": "Hence, the methods introduced in this section can also be applied to solve the problems in Section 2." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-277", "text": "Figure 4 shows an example of equation set problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-278", "text": "There are two unknown variables, including the acres of corn and the acres of wheat, to be inferred from the text description." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-279", "text": "A standard solution to this problem is to use variables x and y to represent the number of corn and wheat, respectively." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-280", "text": "From the text understanding, we can formulate two equations 42x + 30y = 18600 and x + y = 500." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-281", "text": "Finally, the values of x and y can be inferred." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-282", "text": "Compared to the arithmetic problem, the equation set problem contains more numbers of unknown variables and numbers in the text, resulting in a much larger search space to enumerate valid candidate equations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-283", "text": "Hence, the methods designed for arithmetic problems can be hardly applied to solve equation set problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-284", "text": "For instance, the tree-based methods assume that the objective is to construct a single tree to maximize a scoring function." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-285", "text": "They require substantial revision to adjust the objective to building multiple trees which has exponentially higher search space." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-286", "text": "This would be likely to degrade the performance." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-287", "text": "In the following, we will review the existing methods, categorize them into four groups from the technical perspective, and examine how they overcome the challenge." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-288", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-289", "text": "**PARSING-BASED METHODS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-290", "text": "The work of [35] can be viewed as an extension of tree-based approaches to solve a math problem with multiple equations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-291", "text": "Since the objective is no longer to build an equation tree, a meaning representation language called DOL is designed as the structural semantic representation of natural language text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-292", "text": "The core component is a semantic parser that transforms the textual sentences into DOL trees." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-293", "text": "The parsing algorithm is based on context-free grammar (CFG) [36] , [37] , a popular mathematical system for modeling constituent structure in natural languages." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-294", "text": "For every DOL node type, the lexicon and grammar rules are constructed in a semi-supervised manner." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-295", "text": "The association between math-related concepts and their grammar rules is manually constructed." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-296", "text": "Finally, the CFG parser is built on top of 9, 600 grammar rules." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-297", "text": "During the parsing, a score is calculated for each DOL node and the derivation of the DOL trees with the highest score is selected to obtain the answer via a reasoning module." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-298", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-299", "text": "**SIMILARITY-BASED METHODS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-300", "text": "The work of [12] plays an important role in the research line of automatic math problem solvers because it rectifies the understanding of technology development in this area." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-301", "text": "It, for the first time, examines the performance of previous approaches in a large and diversified dataset and derives astonishing experimental findings." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-302", "text": "The methods that claimed to achieve an accuracy higher than 70% in a small-scale and self-collected dataset exhibit very poor performance in the new dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-303", "text": "In other words, none of the methods proposed before [12] is really general and robust." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-304", "text": "Hence, the authors reach a conclusion that the math word problems are still far from being satisfactorily solved." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-305", "text": "A new baseline method based on text similarity, named SIM, is proposed in [12] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-306", "text": "In the first step, the problem text is converted into a word vector whose values are the associated TF-IDF scores [38] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-307", "text": "The similarity between two problems (one is the query problem to solve and the other is a candidate in the training dataset with known solutions) is calculated by the Jaccard similarity between their converted vectors." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-308", "text": "The problem with the highest similarity score is identified and its equation template is used to help solve the query math problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-309", "text": "In the second step, the unknown slots in the template are filled." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-310", "text": "With the availability of a large training dataset, the number filling is conducted in a simple and effective way." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-311", "text": "It finds an instance in the training dataset associated with the same target template and the minimum edit-distance to the query problem, and aligns the numbers in these two problems with ordered and one-to-one mapping." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-312", "text": "It is considered a failure if these two problems do not contain the same number of quantities." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-313", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-314", "text": "**TEMPLATE BASED METHODS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-315", "text": "There are two methods [39] , [40] that pre-define a collection of equation set templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-316", "text": "Each template contains a set of number slots and unknown slots." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-317", "text": "The number slots are filled by the numbers extracted from the text and the unknown slots are aligned to the nouns." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-318", "text": "An example of template may look like" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-319", "text": "where n i is a number slot and u i represents an unknown variable." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-320", "text": "To solve an equation set problem, these approaches first identify a candidate template from the corpus of pre-defined templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-321", "text": "The next task is to fill the number slots and unknown slots with the information extracted from the text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-322", "text": "Finally, the instance with the highest probability is returned." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-323", "text": "This step often involves a scoring function or a rank-aware classifier such as RankSVM [41] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-324", "text": "A widely-adopted practice is to define the probability of each instance of derivation y based on the feature representation x for a text problem and a parameter vector \u03b8, as in [30] , [39] , [40] : With the optimal derivation instance y opt , we can obtain the final solution." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-325", "text": "In [39] , the objective is to maximize the total probabilities of y that leads to the correct answer." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-326", "text": "The latent variables \u03b8 are learned by directly optimizing the marginal data log-likelihood." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-327", "text": "More specifically, L-BFGS [42] is used to optimize the parameters." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-328", "text": "The search space is exponential to the number of slots because each number in the text can be mapped to any number slot and the nouns are also candidates for the unknown slots." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-329", "text": "In practice, the search space is too huge to find the optimal \u03b8 and beam search inference procedure is adopted to prevent enumerating all the possible y leading to the correct answer." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-330", "text": "For the completion of each template, the next slot to be considered is selected according to a pre-defined canonicalized ordering and only top-k partial derivations are maintained." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-331", "text": "In [40] , Zhou et al. proposed an enhanced algorithm for the template-based learning framework." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-332", "text": "First, they only consider assigning the number slots with numbers extracted from the text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-333", "text": "The underlying logic is that when the number slots have been processed, it would be an easy task to fill the unknown slots." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-334", "text": "In this way, the hypothesis space can be significantly reduced." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-335", "text": "Second, the authors argue that the beam search used in [39] does not exploit all the training samples, and its resulting model may be sub-optimal." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-336", "text": "To resolve the issue, the max-margin objective [43] is used to train the log-linear model." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-337", "text": "The training process is turned into a QP problem that can be efficiently solved with the constraint generation algorithm [44] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-338", "text": "Since the annotation of equation templates is expensive, a key challenge to KAZB and ZDC is the lack of sufficient annotated data." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-339", "text": "To resolve the issue, in [45] , Upadhyay et al. attempted to exploit the large number of algebra word problems that have been posted and discussed in online forums." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-340", "text": "These data are not explicitly annotated with equation templates but their numeric answers are extracted with little or no manual effort." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-341", "text": "The goal of [45] is to improve a strong solver trained by fully annotated data with a large number of math problems with noisy and implicit supervision signals." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-342", "text": "The proposed MixedSP algorithm makes use of both explicit and implicit supervised examples mixed at the training stage and learns the parameters jointly." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-343", "text": "With the learned model to formulate the mapping between an algebra word problem and an equation template, the math problem solving strategy is similar to KAZB and ZDC." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-344", "text": "All the templates in the training set have to be exploited to find the best alignment strategy." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-345", "text": "The aforementioned template-based methods suffer from two drawbacks [13] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-346", "text": "First, the math concept is expressed as an entire template and may fail to work well when the training instances are sparse." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-347", "text": "Second, the learning process relies on lexical and syntactic features such as the dependency path between two slots in a template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-348", "text": "Such a huge and sparse feature space may play a negative impact on effective feature learning." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-349", "text": "Based on these two arguments, FG-Expression [13] parses an equation template into fine-grained units, called template fragment." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-350", "text": "Each template is represented in a tree structure as in Figure 3 and each fragment represents a sub-tree rooted at an internal node." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-351", "text": "The main objective and challenge in [13] are learning an accurate mapping between textual information and template fragments." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-352", "text": "For instance, a text piece \"20% discount\" can be mapped to a template fragment 1 \u2212 n 1 with n 1 = 0.2." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-353", "text": "Such mappings are extracted in a semi-supervised way from training datasets and stored as part of the sketch for templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-354", "text": "The proposed solution to a math problem consists of two stages." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-355", "text": "First, RankSVM model [41] is trained to select top-k templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-356", "text": "The features used for the training incorporate textual features, quantity features and solution features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-357", "text": "It is worth noting that the proposed template fragment is applied in the feature selection for the classifier." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-358", "text": "The textual features preserve one dimension to indicate whether the problem text contains textual expressions in each template fragment." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-359", "text": "In the second stage, the alignment is conducted for the k templates and the one with the highest probability is used to solve the problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-360", "text": "The features and rank-based classifier used to select the best alignment are similar to those used in the first stage." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-361", "text": "Compared to the previous template-based methods, FG-Expression also significantly reduces the search space because only top-k templates are examined whereas previous methods align numbers for all the templates in the training dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-362", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-363", "text": "**DL-BASED METHODS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-364", "text": "In recent years, deep learning (DL) has witnessed great success in a wide spectrum of \"smart\" applications, such as video captioning [46] , video event recognition [47] , human action regnition [48] , visual question answering [49] , and question answering with knowledge base [50] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-365", "text": "The main advantage is that with sufficient amount of training data, DL is able to learn an effective feature representation in a data-driven manner without human intervention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-366", "text": "It is not surprising to notice that several efforts have been attempted to apply DL for math word problem solving." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-367", "text": "Deep Neural Solver (DNS) [14] is the first deep learning based algorithm that does not rely on hand-crafted features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-368", "text": "This is a milestone contribution because all the previous methods (including MathDQN) require human intelligence to help extract features that are effective." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-369", "text": "The deep model used in DNS is a typical sequence to sequence (seq2seq) model [51] , [52] , [53] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-370", "text": "The words in the problem are vectorized into features through word embedding techniques [54] , [55] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-465", "text": "Sorts of parsers have been developed, among which the Stanford parser works as the most comprehensive and widely-adopted one." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-371", "text": "In the encoding layer, GRU [56] is used as the Recurrent Neural Network (RNN) to capture word dependency because compared to LSTM [57] , GRU has fewer parameters in the model and is less likely to be over-fitting in small datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-372", "text": "This seq2seq model translates math word problems to equation templates, followed by a number mapping step to fill the slots in the equation with the quantities extracted from the text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-373", "text": "To ensure that the output equations by the model are syntactically correct, five rules are pre-defined as validity constraints." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-374", "text": "For example, if the i th character in the output sequence is an operator in {+, \u2212, \u00d7, \u00f7}, then the model cannot result in c \u2208 {+, \u2212, \u00d7, \u00f7, ), =} for the (i + 1) th character." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-375", "text": "To further improve the accuracy, DNS enhances the model in two ways." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-376", "text": "First, it builds a LSTM-based binary classification model to determine whether a number is relevant." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-377", "text": "This is similar to the relevance model trained in ExpressionTree [26] and UNITDEP [28] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-378", "text": "The difference is that DNS uses LSTM as the classifier with unsupervised word-embedding features whereas ExpressionTree and UNITDEP use SVM with handcrafted features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-379", "text": "Second, the seq2seq model is integrated with a similarity-based method [12] introduced in Section 3.2." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-380", "text": "Given a pre-defined threshold, the similarity-based retrieval strategy is selected as the solver if the maximal similarity score is higher than the threshold." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-381", "text": "Otherwise, the seq2seq model is used to solve the problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-382", "text": "Another follow-up of DNS was proposed recently in [58] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-383", "text": "Instead of using GRU and LSTM, the math solver examines the performance of other seq2seq models when applied in mapping the problem text to equation templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-384", "text": "In particular, two models including BiLSTM [57] and structured self-attention [59] , were examined respectively for the equation template classification task." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-385", "text": "Results show that both models achieve comparable performance." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-386", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-387", "text": "**DATASET REPOSITORY AND PERFORMANCE ANALYSIS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-388", "text": "Similar to the organization of Section 2, we summarize the dataset repository and performance analysis for the equation set solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-389", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-390", "text": "**BENCHMARK DATASETS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-391", "text": "There have been four datasets specifically collected for the equation set problems that involve multiple unknown variables." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-392", "text": "We present their descriptions in the following and summarize the statistics of the datasets in Table 3 ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-393", "text": "We use # of problems # of templates to report the average number of problems associated with each template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-394", "text": "We noticed that in each dataset, a small fraction of problems are associated with one unknown variable in the template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-395", "text": "Thus, we also report the number of single-equation problems in each dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-396", "text": "1) ALG514 [39] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-397", "text": "The dataset is crawled from Algebra.com, a crowd-sourced tutoring website." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-398", "text": "The problems are posted by students." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-399", "text": "The problems with information gap or require explicit background knowledge are discarded." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-400", "text": "Consequently, a set of 1024 questions is collected and cleaned by crowdworkers in Amazon Mechanical Turk." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-401", "text": "These problems are further filtered as the authors require each equation template to appear for at least 6 times." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-402", "text": "Finally, 514 problems are left in the dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-403", "text": "2) Dolphin1878 [35] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-404", "text": "Its math problems are crawled from two websites: albegra.com and answers.yahoo.com." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-405", "text": "For math problems in answers.yahoo.com, the math equations and answers are manually added by human annotators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-406", "text": "Finally, the dataset combined from the two sources contains 1, 878 math problems with 1183 equation templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-407", "text": "3) DRAW1K [60] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-408", "text": "The authors of DRAW1K argued that Dolphin1878 has limited textual variations and lacks narrative." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-409", "text": "This motivated them to construct a new dataset that is diversified in both vocabularies and equation systems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-410", "text": "With these two objectives, they constructed DRAW1K with exactly 1, 000 linear equation problems that are crawled and filtered from algebra.com." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-411", "text": "4) Dolphin18K [12] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-412", "text": "The dataset is collected and rectified mainly from the math category of Yahoo! Answers 1 ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-413", "text": "The problems, equation system annotations, and answers are extracted semi-automatically, with great intervention of human efforts." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-414", "text": "The procedure consists of four stages: removing irrelevant problems, cleaning problem text, extracting gold answers and constructing equation system annotations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-415", "text": "The harvested dataset is so far the largest one, with 18, 460 problems and 5, 871 equation templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-416", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-417", "text": "**PERFORMANCE ANALYSIS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-418", "text": "The performances of the equation set solvers on the existing datasets are reported in Table 4 ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-419", "text": "From the experimental results, we derive the following observations and discussions." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-466", "text": "It is a package consisting of different probabilistic natural language parsers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-467", "text": "To be more specific, its neural-network parser [66] is a transition-based dependency parser that uses high-order features to achieve high speed and good accuracy; the Compositional Vector Grammar parser [67] can be seen as factoring discrete and continuous parsing in one model; and the (English) Stanford Dependencies representation [68] is an automatic system to extract typed dependency parses from phrase structure parses, where a dependency parse represents dependencies between individual words and a phrase structure parse represents nesting of multi-word constituents." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-468", "text": "Besides Stanford parser, there exist other effective dependency parsers with their own traits." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-469", "text": "For example, [69] presents an easy-fist parsing algorithm that iteratively selects the best pair of neighbors in the tree structure to connect at each parsing step." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-470", "text": "Those parsers account in WMP solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-471", "text": "For instance, the neural-network parser [66] is adopted in [70] for coreference resolution, which is another pre-processing step for MWP solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-472", "text": "UnitDep [28] automatically generates features from a given math problem by analyzing its derived parser tree using the Compositional Vector Grammar parser [67] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-473", "text": "Additionally, the Stanford Dependencies representation [68] has been applied in multitple solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-474", "text": "We observed its occurrence in Formula [23] and ARIS [21] to extract attributes of entities (the subject, verb, object, preposition and temporal information), in KAZB [39] to generate part-of-speech tags, lematizations, and dependency parses to compute features, and in ALGES [27] to obtain syntactic information used for grounding and feature computation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-475", "text": "ExpressionTree [26] is an exceptional case without using Stanford Parser." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-476", "text": "Instead, it uses the easy-fist parsing algorithm [69] to detect the verb associated with each quantity." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-477", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-478", "text": "**COREFERENCE RESOLUTION**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-479", "text": "Co-reference resolution involves the identification and clustering of noun phrases mentions that refer to the same real-world entity." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-480", "text": "The MWP solvers use it as a pre-processing step to ensure the correct arithmetic operations or value update on the same entity." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-481", "text": "[71] is an early deterministic approach which is driven entirely by the syntactic and semantic compatibility learned from a large, unlabeled corpus." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-482", "text": "It allows proper and nominal mentions to only corefer with antecedents that have the same head, but pronominal mentions to corefer with any antecedent." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-420", "text": "First, there are many empty cells in the table." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-421", "text": "In the ideal case, the algorithms should be conducted on all the available benchmark datasets and compared with all the previous competitors." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-422", "text": "The reasons are multi-fold, such as limitation of the implementation (e.g., as Upadhyay stated in [45] , they could not run ZDC on DRAW1K because ZDC can only handle limited types of equation systems), delayed release of the dataset (e.g., the Dolphin18K dataset has not been released when the work of DNS is published) or unfitness in certain scenarios (e.g., the experiments of FG-Expression [13] were only conducted on Dolphin18K because the authors considered that the previous datasets are not suitable due to their limitation in scalability and diversity)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-423", "text": "It is noticeable that such an incomplete picture brings difficulty to judge the performance and may miss certain insightful findings." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-424", "text": "Second, ALG514 is the smallest dataset and also the most widely adopted dataset for performance evaluation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-425", "text": "Among the template-based methods, MixedSP outperforms KAZB and ZDC because it benefits from the mined implicit supervision from an external source of additional 2, 000 samples." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-426", "text": "As reported in [45] , if only the explicit dataset (i.e., the problems in ALG514) is used, its performance is slightly inferior to ZDC." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-427", "text": "A possible reason to explain this is that ZDC uses a richer set of features based on POS tags, coreference and dependency parses." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-428", "text": "In contrast, MixedSP only uses features based on POS tags." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-429", "text": "It is also interesting to see that SIM and DNS obtain the same accuracy on ALG514." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-430", "text": "This is the dataset is too small to train an effective deep learning model." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-431", "text": "The reported accuracy of seq2seq model is only 16.1% in ALG514." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-432", "text": "DNS is a hybrid approach that combines a seq2seq model and similarity retrieval model." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-433", "text": "It means the deep learning 1." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-434", "text": "https://answers.yahoo.com/ model does not take any effect when handling problems in ALG514." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-435", "text": "Finally, the datasets of Dolphin1878 and DRAW1K are released for the approaches of DOL and MixedSP, respectively." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-436", "text": "In the experimental settings, simple baselines such as SIM based on textual similarity or KAZB which is the earliest template-based method, are selected." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-437", "text": "It is not surprising to see that Dolphin1878 and DRAW1K outperform their competitors by a large margin." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-438", "text": "Nevertheless, the research for equation set solvers has shifted to proposing methods that can work well in a large and diversified dataset such as Dolphin18K." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-439", "text": "We implemented our own version of DNS and evaluated its performance on the large dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-440", "text": "Unfortunately, we did not observe a higher accuracy derived from DNS." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-441", "text": "The reasons could be two-fold." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-442", "text": "First, our implementation may not be optimized and the model parameters may not be well tuned." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-443", "text": "Second, there are thousands of templates in the datasets, which may bring challenges for the classification task." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-444", "text": "Nevertheless, with the availability of large-scale datasets, applying deep learning models for MWPs is still an interesting research direction that deserves intensive attention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-445", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-446", "text": "**FEATURE EXTRACTION**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-447", "text": "Feature extraction has been a vital component in the machine learning (ML) workflow." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-448", "text": "Effective feature construction, in an either supervised or semi-supervised manner, can significantly boost the accuracy." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-449", "text": "For instance, SIFT [61] , [62] and other local descriptors [63] , [64] , [65] have been intensively used in the domain of object recognition and image retrieval for decades as they are invariant to scaling, orientation and illumination changes." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-450", "text": "Consequently, a large amount of research efforts have been devoted to design effective features to facilitate ML tasks." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-451", "text": "Such a discipline was partially changed by the emergence of deep learning." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-452", "text": "In the past years, deep learning has transformed the world of artificial intelligence, due to the increasing processing power afforded by graphical processing units (GPUs), the enormous amount of available data, and the development of more advanced algorithms." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-453", "text": "With well-annotated sufficient training data, the methodology can automatically learn the effective latent feature representation for the classification or prediction tasks." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-454", "text": "Hence, it can help replace the manual feature engineering process, which is non-trivial and labor-intensive." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-455", "text": "In the area of automatic math problem solver, as reviewed in the previous sections, the only DL-based approach without feature engineering is DSN [14] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-456", "text": "It applies word embedding to vectorize the text information and encodes these vectors by GRU network for automatic feature extraction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-457", "text": "The limitation is that a large amount of labeled training data is required to make the model effective." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-458", "text": "Before the appearance of DSN, most of the math problem solvers were designed with the availability of small-scale datasets." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-459", "text": "Thus, feature engineering plays an important role in these works to help achieve a high accuracy." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-460", "text": "In this section, we provide a comprehensive review on the features engineering process in the literature and show how they help bridge the gap between textual/visual information and the semantic/logical representation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-461", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-462", "text": "**PREPROCESSING**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-463", "text": "Before we review the feature space defined in various MWP solvers, we first present the preliminary background on the preprocessing steps that have been commonly adopted to facilitate the subsequent feature extraction." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-464", "text": "Syntactic parsing focuses on organizing the lexical units and their semantic dependency in a tree structure, which serves as a useful resource for effective feature selection." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-142", "text": "The global scoring function for an enumerated tree takes into account two terms." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-143", "text": "The first one, denoted by \u03c6(q), is the likelihood of quantity q being irrelevant, i.e., q is not used in creating the expression tree." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-144", "text": "In the ideal case, all the irrelevant quantities are correctly predicted with high confidence, resulting in a large value for the sum of \u03c6(q)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-145", "text": "The other term, denoted by \u03c6(op), is the likelihood of selecting op as the operator for an internal tree node." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-146", "text": "With these two factors, Score(E) is formally defined as" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-147", "text": "where I (E) is the group of irrelevant quantities that are not included in expression E, and N refers to the set of internal tree nodes." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-148", "text": "To further reduce the tree enumeration space, beam search is applied in [26] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-149", "text": "To generate the next state T \u2032 from the current partial tree, the algorithm avoids choosing all the possible pairs of terms and determining their operator." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-150", "text": "Instead, only top-k candidates with the highest partial scores are retained." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-151", "text": "Experimental results with k = 200 show that the strategy achieves a good balance between accuracy and running time." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-152", "text": "In [29] , the authors publish the service as a web tool and it can respond promptly to a math word problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-153", "text": "The solution in [27] , named ALGES, differs from [26] in two major ways." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-154", "text": "First, it adopts a more brutal-force manner to exploit all the possible equation trees." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-155", "text": "More specifically, ALGES does not discard irrevalent quantities, but enumerates all the syntactically valid trees." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-156", "text": "Integer Linear Programming (ILP) is applied as it can help enforce the constraints such as syntactic validity, type consistence and domain specific simplicity considerations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-157", "text": "Consequently, its computation cost is dozens of times higher than that in [27] , according to an efficiency evaluation in [15] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-158", "text": "Second, its scoring function is different from Equation 1." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-159", "text": "There is no need for the term \u03c6(q) because ALGES does not build a classifier to check the quantity relevance." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-160", "text": "Besides the monotonic aggregation of the likelihood from local operator classifiers, the scoring function incorporates a new term \u03c6(P) = \u03b8 T f p to assign a coherence score for the tree instance." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-161", "text": "Here, f P is the global feature extracted from a problem text P, and \u03b8 refers to the parameter vector." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-162", "text": "The goal of [30] is also to build an equation tree by parsing the problem text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-163", "text": "It makes two assumptions that can simplify the tree construction, but also limit its applicability." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-164", "text": "First, the final output equation form is restricted to have at most two variables." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-483", "text": "On top of [71] , Raghunathan et al. [72] proposed an architecture based on tiers of deterministic coreference models." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-484", "text": "The tiers are processed from the highest to the lowest precision and the entity output of a tier is forwarded to the next tier for further processing." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-485", "text": "[73] is another model that integrates a collection of deterministic coreference resolution models." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-486", "text": "Targeting at exploring rich feature space, [74] proposed a simple classification model for coreference resolution with a well-designed set of features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-487", "text": "NECo is proposed in [75] and capable of solving both named entity linking and co-reference resolution jointly." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-488", "text": "As to applying coreference resolvers in MWP sovers, the Illinois Coreference Resolver [74] [76] is used in [20] to identify pronoun referents and facilitate semantic labeling." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-489", "text": "In [70] , a rule function Coref (A, B) , which is true when A and B represent the same entity, is derived as a component of the declarative rules to determine the math operators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-490", "text": "Given a pair of sentences, each containing a quantity, ZDC [40] takes into account the existence of coreference relationship between these two sentences for feature exploitation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-491", "text": "Meanwhile, ARIS [21] adopts the [72] for coreference resolution and uses the predicted coreference relationships to replace pronouns with their coreferenent links." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-492", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-493", "text": "**COMMON FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-494", "text": "There have been various types of features proposed in the past literature." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-495", "text": "We separate them into common features and unique features, according to the number of solvers that have adopted a particular type of feature." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-496", "text": "The unique features were proposed once and not reused in another work, implying that their effect could be limited." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-497", "text": "The common features are considered to be more general and effective, and they are the focus of this survey." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-498", "text": "In Table 5 , we categorize the common features according to their syntactic sources for feature extraction, such as quantities, questions, verbs, etc." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-499", "text": "For each type of proposed feature, we identify its related MWP solvers, and provide necessary examples to explain features that are not straightforward to figure out." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-500", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-501", "text": "**QUANTITY-RELATED FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-502", "text": "The basic units in an arithmetic expression or an equation set consist of quantities, unknown variables and operators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-503", "text": "Hence, a natural idea is to extract quantity-related features to help identify the relevant operands and their associated operators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-504", "text": "As shown in Table 5 , a binary indicator to determine whether a quantity refers to a rate is adopted in many solvers [26] [28] [15] [40] [45] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-528", "text": "The context similarity is measured by the Jaccard similarity on two sets of words among the context windows." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-505", "text": "It signals a strong connection between the quantity and operators of {\u00d7, \u00f7}. The value of the quantity is also useful for operator classifier or quantity relevance classifier." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-506", "text": "For instance, a quantity whose value is a real number between [0, 1] is likely to be associated with multiplication or division operators [40] , [45] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-507", "text": "It is also observed that quantities in the text format of \"one\" or \"two\" are unlikely to be relevant with the solution [39] [40], [45] , [13] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-508", "text": "Examples include \"if one airplane averages 400 miles per hour,...\" and \"the difference between two numbers is 36\"." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-509", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-510", "text": "**CONTEXT-RELATED FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-511", "text": "The information embedded in the text window centered at a particular quantity can also provide important clues for solving math word problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-512", "text": "To differentiate two quantities both in the numeric format, we can leverage the word lemmas, part of speech (POS) tags and dependence types within the window as the features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-513", "text": "In this manner, quantities associated with the same operators would to likely to share similar context information." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-514", "text": "A trivial trick used in [26] [28] [15] is to examine whether there exists comparative adverbs." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-515", "text": "For example, terms \"more\", \"less\" and \"than\" indicate operators of {+, \u2212}." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-516", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-517", "text": "**QUANTITY-PAIR FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-518", "text": "The relationship between two quantities is helpful to determine their associated operator." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-519", "text": "A straightforward example is that if two quantities are associated with the same unit, they can be applied with addition and subtraction [26] [28] [15] [40] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-520", "text": "If one quantity is related to a rate and the other is associated with a unit that is part of the rate, their operator is likely to be multiplication or division [26] [27] [28] [15] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-521", "text": "Numeric relation and context similarity are two types of quantity-pair features proposed in [40] [45] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-522", "text": "The former obtains two sets of nouns located within the same sentence as the two quantities and sorts them by the distance in the dependency tree." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-523", "text": "Then, a scoring function is defined to measure the similarity between these two sorted noun lists." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-524", "text": "Higher similarity implies that the two quantities are more likely to be connected by addition or subtraction operators." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-525", "text": "The latter extracts features for equation template classifier." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-526", "text": "It is observed that the contextual information between two numbers is similar, they are likely to be located within in a template with symmetric number slots." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-527", "text": "For example, given a template n 1 \u00d7 u 1 + n 2 \u00d7 u 2 , \"n 1 \" and \"n 2 \" are symmetric." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-529", "text": "Given a problem text \"A plum costs 2 dollars and a peach costs 1 dollars\", \"2\" and \"1\" are two quantities with similar context." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-530", "text": "Two types of quantity-pair features were both adopted in the template-based solutions to equation set problems [39] [40] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-531", "text": "The first type is the dependency path between a pair of quantities." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-532", "text": "Their similarity may be helpful to determine the corresponding positions (or number slots) in the equation template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-533", "text": "For example, given a sentence \"2 footballs and 3 soccer balls cost 220 dollars\", the dependency paths between two quantity pairs (2, 220) and (3, 220) are identical, implying that 2 and 3 refer to similar types of number slots in the template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-534", "text": "The other feature is whether two quantities appear in the same sentence." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-535", "text": "If so, they are likely to appear in the same equation of the template." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-536", "text": "Finally, a popular quantity-pair feature used in [26] [28] [15] [39] [40] [45] examines whether the value of one quantity is greater than the other, which is helpful to determine the correct operands for subtraction operator." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-537", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-538", "text": "**QUESTION-RELATED FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-539", "text": "Distinguishing features can also be derived from questions." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-540", "text": "It is straightforward to figure out that the unknown variable can be inferred from the question and if a quantity whose unit appears in the question, this quantity is likely to be relevant." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-541", "text": "The remain question-related features presented in Table 5 were proposed by Roy et al. [26] , [28] and followed by MathDQN [15] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-542", "text": "Their feature design leverages the number of matching tokens between the related noun phrase of a quantity and the question text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-543", "text": "The quantities with the highest number of matching tokens are considered as useful clues." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-544", "text": "They also check whether the question contains rate indicators such as \"each\" and \"per\", or comparison indicators such as \"more\" or \"less\"." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-545", "text": "The former is related to {\u00d7, \u00f7} and the latter is related to {+, \u2212}. Moreover, if the question text contains \"how many\", it implies that the solution is a positive number." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-546", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-547", "text": "**VERB-RELATED FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-548", "text": "Verbs are important indicators for correct operator determination." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-549", "text": "For example, \"lose\" is a verb indicating quantity loss for an entity and related to the subtraction operator." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-550", "text": "Given a quantity, we call the verb closest to it in the dependency tree as its dependent verb." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-551", "text": "[26] [27] [28] [15] directly use dependent verb as one of the features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-552", "text": "Another widely-adopted verb-related feature is a vector capturing the distance between the dependent verb and a small pre-defined collection of verbs that are found to be useful in categorizing arithmetic operations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-553", "text": "Again, the remaining features come from the works [26] , [28] , [15] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-554", "text": "The features indicate whether two quantities have the same dependent verbs or whether their dependent verbs refer to the same verb mention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-555", "text": "As we can see from the examples in Table 5 , the difference between these two types of features is the occurrence number of the dependent verb in the sentence." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-556", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-557", "text": "**GLOBAL FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-558", "text": "There are certain types of global features in the document-level proposed by existing solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-559", "text": "[26] , [28] , [15] use the number of quantities in the problem text as part of feature space." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-560", "text": "Unigrams and bigrams are also applied in [20] [39] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-561", "text": "They may play certain effect in determining the quantities and their order." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-562", "text": "Note that the unigrams and bigrams are defined in the word level rather than the character level." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-563", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-564", "text": "**GEOMETRIC WORD PROBLEM (GWP) SOLVER**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-565", "text": "Geometry solvers have been studied for a long history." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-566", "text": "Visual diagram understanding is a sub-domain that has attracted significant attention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-567", "text": "As an early work for understanding line drawings, [77] [45] For \"each ride cost 5 tickets\", the quantity \"5\" is a rate Is between 0 and 1 [40] [45] Is equal to one or two [39] [40] [45] [13]" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-568", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-569", "text": "**CONTEXT-RELATED FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-570", "text": "Word lemma [39] [40] [45] For \"Connie has 41.0 red markers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-571", "text": "\", the word lemmas around the quantity \"41.0\" are {Connie, have, red, marker}. POS tags [28] [39] [40] [45] [20] For \"A chef needs to cook 16.0 potatoes.\", the POS tags within a window of size 2 centered at the quantity \"16.0\" are {TO, VB, NNS}. Dependence type [39] [40] [45] For \"Ned bought 14.0 boxes of chocolates candy.\", we can detect multiple dependencies within the window of size 2 around the \"14.0\": (boxes, 14.0) \u2192 (num), (boxes, of)\u2192 (prep), (bought, Ned) \u2192 (nsubj)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-572", "text": "The dependence root is \"bought\"." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-573", "text": "Comparative adverbs [26] [28] [15] For \"If she drank 25 of them and then bought 30 more.\", \"more\" is a comparative term in the window of quantity \"30\"." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-574", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-575", "text": "**QUANTITY-PAIR FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-576", "text": "Whether both quantities have the same unit [26] [28] [15] [40] For \"Student tickets cost 4 dollars and general admission tickets cost 6 dollars\", quantities \"4\" and \"6\" have the same unit." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-577", "text": "If one quantity is related to a rate and the other is associated with a unit that is part of the rate [26] [27] [28] [15] For \"each box has 9 pieces\" and \"Paul bought 6 boxes of chocolate candy\", \"9\" is related to a rate ( i.e., pieces/box) and \"6\" is associated to the unit \"box\"." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-578", "text": "Numeric relation of two quantities [40] [45] For each quantity, the nouns around it are extracted and sorted by the distance in the dependency tree." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-579", "text": "Then, a scoring function is defined on the two sorted lists to measure the numeric relation." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-580", "text": "Context similarity between two quantities [40] [45] The context is represented by the set of words around the quantity." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-581", "text": "Dependency path between two quantities." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-582", "text": "[39] [40] For \"2 footballs and 3 soccer balls cost 220 dollars\", the dependency path for the quantity pair (2, 3) is num(footballs,2) conj(footballs, balls)num(balls, 3)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-583", "text": "Whether both quantities appear in the same sentence [39] [40] Whether the value of the first quantity is greater than the other [ [15] For the question \"How many apples are left in the box?\" and a quantity 77 that appears in \"77 apples in a box\", there are two matching tokens (\"apples\" and \"box\")." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-584", "text": "Number of quantities which happen to have the maximum number of matching tokens with the question [26] [28] [15] For \"Rose have 9 apples and 12 erasers. ... 3 friends." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-585", "text": "How many apples dose each friend get?\", the number of matching tokens for quantities 9, 12 and 3 is 1, 0 and 1." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-586", "text": "Hence, there are two quantities with the maximum matching token number." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-587", "text": "Whether any component of the rate is present in the question [26] [28] [15] Given a question \"How many blocks does George have?\" and a quantity 6 associated with rate \"blocks/box\", the feature indicator is set to 1 since block appears in the question." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-588", "text": "Whether the question contains terms like \"each\" or \"per\" [26] [28] [15] Whether the question contains comparison-related terms like \"more\" or \"less\" [26] [28] [15] Whether the question contains terms like \"how many\" [39] [40] [45] [13] It implies that the solution is positive." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-589", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-590", "text": "**VERB-RELATED FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-591", "text": "Dependent verb of a quantity [26] [27] [28] [15] the verb closest to the quantity in the dependency tree Distance vector between the dependent verb and a small collection of predefined verbs that are useful for arithmetic operator classification [21] [24] [27] Whether two quantities have the same dependent verbs [26] [28] [15] For \"In the first round she scored 40 points and in the second round she scored 50 points\", the quantities \"40\" and \"50\" both have the same verb \"scored\". Note that \"scored\" appeared twice in the sentence." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-592", "text": "Whether both dependent verbs refer to the same verb mention [26] [28] [15] For \"She baked 4 cupcakes and 29 cookies.\", the quantities \"4\" and \"29\" both shared the verb \"baked\". Note that \"baked\" appeared only once in the sentence." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-593", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-594", "text": "**GLOBAL FEATURES**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-595", "text": "Number of quantities mentioned in text [26] [28] [15] Unigrams and bigrams of sentences in the problem text [20] [39] presented an efficient characteristic pattern detection method by scanning the distribution of black pixels and generating feature points graph." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-596", "text": "In [78] , a structure mapping engine named GeoRep was proposed to generate qualitative spatial descriptions from line diagrams." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-597", "text": "After that, the visual elements can be formulated through a two-level representation architecture." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-598", "text": "This work was also applied to the repetition and symmetry detection model in MAGI [79] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-599", "text": "Inspired by human cognitive process of reading juxtaposition diagrams, MAGI detects repetition by aligning visual and conceptual relational structure to analyze repetition-based diagrams." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-600", "text": "The problem of rectangle and parallelogram detection in diagram understanding has also received a considerable amount of interest." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-601", "text": "The proposed techniques fall into two main categories, either based on primitive or Hough transform [80] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-602", "text": "The primitivebased methods combine line segments or curves to form possible edges of a quadrangle." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-603", "text": "For examples, Lin and Nevatia [81] proposed the approach of parallelogram detection from a single aerial image by linear feature extraction and formation of hypothesis following certain geometric constraints." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-604", "text": "Similarly, Lagunovsky and Ablameyko [82] studied the problem of rectangular detection based on line primitives." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-605", "text": "As to the Hough transform based techniques, [83] presented an approach for automatic rectangular particle detection in cryo-electron microscopy through Hough transform, but this method can only work well when all rectangles have the same dimensions and the dimensions must be aware in advance." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-606", "text": "Jung et.al." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-607", "text": "[84] proposed a window Hough transform algorithm to tackle the problem of rectangle detection with varying dimensions and orientations." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-608", "text": "Geometry theorem proving (GTP) [85] , [86] was initially viewed as an artificial intelligence problem that was expected to be easily tackled by machines." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-609", "text": "The difficulties of solving GTP problems lie in the visual reasoning in geometry and the generation of elegant and concise geometry proofs." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-610", "text": "Moreover, completing the proof requires the ingenuity and insights to the problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-611", "text": "The first automated GTP was developed by Gelernter [85] , which used the diagram as pruning heuristic." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-612", "text": "The system rejects geometry goals that fail to satisfy the diagram." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-613", "text": "Whereas the limitation of the method is the true sub-goal may be pruned erroneously due to the insufficient precise arithmetic applied to the diagram." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-614", "text": "Fleuriot et." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-615", "text": "al. [86] studied Newton's geometric reasoning procedures in his work Principia and presented theorem prover Isabelle to formalize and generate the style of reasoning performed by Newton." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-616", "text": "By combining the existing geometry theorem proving techniques and the concepts of Nonstandard Analysis, the prover Isabelle can produce proofs of lemmas and theorem in Principia." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-617", "text": "Readers can refer to the book chapter [86] for the survey of early development of GTP." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-618", "text": "In this survey, we are more interested to examine the math problems that are required to consider visual diagram and textual mentions simultaneously." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-619", "text": "As illustrated in Figure 5 , a typical geometry word problem contains text descriptions or attribute values of geometric objects." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-620", "text": "The visual diagram may contain essential information that are absent from the text." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-621", "text": "For instance, points O, B and C are located on the same line segment and there is a circle passing points A, B, C and D. To well solve geometry word problems, three main challenges need to be tackled: 1) diagram parsing requires the detection of visual mentions, geometric characteristics, the spatial information and the co-reference with text; 2) deriving visual semantics which refer to the textual information related to visual analogue involves the semantic and syntactic interpretation to the text; and 3) the inherent ambiguities lie in the task of mapping visual mentions in the diagram to the concepts in real world." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-622", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-623", "text": "**TEXT-ALIGNED DIAGRAM UNDERSTANDING**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-624", "text": "The very early computer program, BEATRIX [87] , [88] , parses the English text and diagram components of the elementary physics problems together by establishing the coreference between the text and diagram." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-625", "text": "Watanabe et al. proposed a framework to combine layout information and natural language to analyze the pictorial book of flora diagrams [89] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-626", "text": "An overview of the research on integration of visual and linguistic information was provided in the survey paper by Srihar [90] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-627", "text": "However, these early approaches rely on written rules or manual regulations, i.e., the visual elements needed to be recognized with human intervention and their performances were usually dependent on specified diagrams." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-628", "text": "G-ALINGER [91] is an algorithmic work that addresses the geometry understanding and text understanding simultaneously." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-629", "text": "To detect primitives from a geometric diagram, Hough transform [92] is first applied to initialize lines and circles segments." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-630", "text": "An objective function that incorporates pixel coverage, visual coherence and textual-visual alignment." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-631", "text": "The function is submodular and a greedy algorithm is designed to pick the primitive with the maximum gain in each iteration." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-632", "text": "The algorithm stops when no positive gain can be obtained according to the objective function." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-633", "text": "In [93] , the problem of visual understanding is addressed in the context of science diagrams." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-634", "text": "The objective is to identify the graphic representation for the visual entities and their relations such as temporal transitions, phase transformations and inter object dependencies." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-635", "text": "An LSTM-based network is proposed for syntactic parsing of diagrams and learns the graphic structure." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-636", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-637", "text": "**GWP SOLVERS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-638", "text": "GEOS [94] can be considered as the first work to tackle a complete geometric word problem as shown in Figure 5 ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-660", "text": "And algebraic word problems are solved by generating answer rationales written in natural language in [104] through a sequence-to-sequence model." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-639", "text": "The method consists of two main steps: 1) parsing text and diagram respectively by generating a piece of logical expression to represent the key information of the text and diagram as well as the confidence scores, and 2) addressing the optimization problem by aligning the satisfiability of the derived logical expression in a numerical method that requires manually defining indicator function for each predicate." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-640", "text": "It is noticeable that G-ALINGER is applied in GEOS [91] for primitive detection." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-641", "text": "Despite the superiority of automated solving process, the performance of the system would be undermined if the answer choices are unavailable in a geometry problem and the deductive reasoning based on geometric axiom is not used in this method." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-642", "text": "A subsequent improver of GEOS is presented in [95] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-643", "text": "It harvests an axiomatic knowledge from 20 publicly available math textbooks and builds a more powerful reasoning engine that leverages the structured axiomatic knowledge for logical inference." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-644", "text": "GeoShader [96] , as the first tool to automatically handle geometry problem with shaded area, presents an interesting reasoning technique based on analysis hypergraph." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-645", "text": "The nodes in the graph represent intermediate facts extracted from the diagram and the directed edges indicate the relationship of deductibility between two facts." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-646", "text": "The calculation of the shaded area is represented as the target node in the graph and the problem is formulated as finding a path in the hypergraph that can reach the target node." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-647", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-648", "text": "**MISCELLANEOUS MATH TASKS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-649", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-650", "text": "**OTHER VARIANTS OF MATH PROBLEMS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-651", "text": "Apart from geometric problems, there are also assorted variants of math problems that AI system focuses on." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-652", "text": "Aristo [97] is able to solve non-diagram multiple-choice questions through five parallel solvers, one for pure text, two for statistic and two for inference." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-653", "text": "Finally, the combiner of Aristo outputs a comprehensive score of each option based on scores from the five solvers." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-654", "text": "A similar work on multiple-choice questions is [98] , which takes Wikipedia as a knowledge base." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-655", "text": "After ranking and filtering relevant pages retrieved from Wikipedia, it presents a new scoring function to pick the best answer from the candidates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-656", "text": "Another vairant is targeted at solving IQ test and a noticeable number of computer models have been proposed in [99] , [100] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-657", "text": "Taking [100] for example, it proposed a framework for solving verbal IQ questions, which classifies questions into several categories and each group of questions are solved by a specific solver respectively." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-658", "text": "Furthermore, logic puzzles are addressed in [101] by transforming robust natural language to precise semantics." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-659", "text": "For other forms of math problems, [102] solves probability problems automatically by a two-step approach, namely first formulating questions in a declarative language and then computing the answer through a solver implemented in ProbLog [103] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-661", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-663", "text": "Solving math word problems in other languages also attracts research attention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-664", "text": "Yu et al. addressed the equation set problem solver in Chinese [105] , [106] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-665", "text": "A pool of rule-based semantic models are crafted to map the sentences in Chinese text into equation templates." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-666", "text": "The experiments were conducted on a very small-scale dataset with 104 problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-667", "text": "Recently, there has been the first attempt to solve Arabic arithmetic word problems [107] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-668", "text": "Its test dataset was collected by translating the AI2 dataset [21] from English to Arabic." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-669", "text": "The proposed techniques also rely on the verb categorization, similar to those proposed in [21] , except that customization for the Arabic language needs to be made for the tasks of syntactic parser and named entity recognition." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-670", "text": "To conclude, the math word problem solvers in other languages than English are still at a very early stage." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-671", "text": "The datasets used are neither large-scale nor challenging and the proposed techniques are obsolete." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-672", "text": "This research area has great room for improvement and calls for more efforts to be involved." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-673", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-674", "text": "**MATH PROBLEM GENERATOR**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-675", "text": "We also review automatic math word problem generators that can efficiently produce a large, diverse and configurable corpus of question-answer database." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-676", "text": "The topics covered in this survey include algebra word problems with basic operators {+, \u2212, \u00d7, \u00f7} and geometry problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-677", "text": "In [108] , Wang et al. leveraged the concept of expression tree to generate a math word problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-678", "text": "The tree structure can provide the skeleton of the story, and meanwhile allow the story to be constructed recursively from the sub-stories." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-679", "text": "Each sub-story can be seen as a text template with value slots to be filled." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-680", "text": "These sub-stories will be concatenated into an entire narrative." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-681", "text": "Different from [108] , the work of [109] rewrites a given math word problem to fit a particular theme such as Star War." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-682", "text": "In this way, students may stay more engaged with their homework assignments." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-683", "text": "The candidate are scored with the coherence in multiple factors (e.g., syntactic, semantic and thematic)." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-684", "text": "[110] generates math word problems that match the personal interest of students." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-685", "text": "The generator uses Answer Set Programming [111] , in which programs are composed of facts and rules in a first-order logic representation, to satisfy a collection of pedagogical and narrative requirements." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-686", "text": "Its objective is to produce coherent and personalized story problems that meet pedagogical requirements." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-687", "text": "In the branch of geometry problem generator, GeoTutor [112] , [113] is designed to generate geometry proof problems for high school students." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-688", "text": "The input contains a figure and a set of geometry axioms." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-689", "text": "The output is a pair (I, G), where I refers to the assumptions for the figure and goals in G are sets of explicit facts to be inferred." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-690", "text": "Singhal et al. also tackled the automated generation of geometry questions for high school students [114] , [115] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-691", "text": "Its input interface allows users to select geometric objects, concepts and theorems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-692", "text": "Compared with [112] , [113] , its geometric figure is generated by the algorithm rather than specified by the user." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-693", "text": "Based on the figure, the next step of generating facts and solutions is similar to that in [112] , [113] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-694", "text": "It requires pre-knowledge on axioms and theorems and results in the formation capturing the relationships between its objects." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-695", "text": "----------------------------------" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-696", "text": "**CONCLUSIONS AND FUTURE DIRECTIONS**" }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-697", "text": "In this paper, we present a comprehensive survey to review the development of math word problem solvers in recent years." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-698", "text": "The topics discussed in this survey cover arithmetic word problems, equation set problems, geometry word problems and miscellaneous others related to math." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-699", "text": "We compared the techniques proposed for each math task, provided a rational categorization, and conducted accountable experimental analysis." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-700", "text": "Moreover, we took a close examination on the subject of feature engineering proposed for MWP solvers and summarized the diversified proposal of syntactic features." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-701", "text": "Overall speaking, the current status of MWP solvers is far from promising and has great room for improvement." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-702", "text": "There is no doubt that the topic would continue to attract more and more research attention in the next few years, especially after the public release of large-scale datasets such as Dolphin18K and Math23K." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-703", "text": "In the following, we present a number of future directions that may be of interest to the community and worth further exploration." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-704", "text": "Firstly, DNS [14] was the first attempt that used deep learning models in MWP solvers so as to avoid non-trivial feature engineering." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-705", "text": "This work shed light on the feasibility of designing end-to-end models to enhance the accuracy and reduce human intervention." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-706", "text": "Considering that DNS only uses a basic seq-to-seq network structure, with LSTM and GRU as the encoding and decoing units, we expect more advanced networks to be developed." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-707", "text": "Moreover, as a common practice, these models can be integrated with attention mechanism [116] for performance advancement." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-708", "text": "Secondly, aligning visual understanding with text mention is an emerging direction that is particularly important to solve geometry word problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-709", "text": "However, this challenging problem has only been evaluated in self-collected and small-scale datasets, similar to those early efforts on evaluating the accuracy of solving algebra word problem." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-710", "text": "There is a chance that these proposed aligning methods fail to work well in a large and diversified dataset." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-711", "text": "Hence, it calls for a new round of evaluation for generality and robustness with a better benchmark dataset for geometry problems." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-712", "text": "Thirdly, intepretability plays a key role in measuring the usability of MWP solvers in the application of online tutoring, but may pose new challenges for the deep learning based solvers [100] ." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-713", "text": "For instance, AlphaGo [117] and AlphaZero [118] have achieved astonishing superiority over human players, but their near-optimal actions could be difficult for human to interpret." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-714", "text": "Similar issues may occur in the domain of automatic math problem solver and they deserve an early examination." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-715", "text": "Last but not the least, solving math word problems in English plays a dominating role in the literature." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-716", "text": "We only observed a very rare number of math solvers proposed to cope with other languages." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-717", "text": "This research topic may grow into a direction with significant impact." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-718", "text": "To our knowledge, many companies in China have harvested an enormous number of word problems in K12 education." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-719", "text": "As reported in 2015 2 , Zuoyebang, a spin off from Baidu, has collected 950 million questions and solutions in its database." }, { "sent_id": "a0db9c3a74487d3fcd11e79d44e163-C001-720", "text": "When coupled with deep learning models, this is an area with immense imagination and exciting achievements can be expected." } ], "y": { "@BACK@": { "gold_contexts": [ [ "a0db9c3a74487d3fcd11e79d44e163-C001-127" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-137" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-148" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-153" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-170" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-203", "a0db9c3a74487d3fcd11e79d44e163-C001-204", "a0db9c3a74487d3fcd11e79d44e163-C001-205", "a0db9c3a74487d3fcd11e79d44e163-C001-206", "a0db9c3a74487d3fcd11e79d44e163-C001-207", "a0db9c3a74487d3fcd11e79d44e163-C001-208", "a0db9c3a74487d3fcd11e79d44e163-C001-214" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-475" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-514" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-519" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-520" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-536" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-551" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-559" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-573" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-576" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-577" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-584" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-587" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-588" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-591" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-592" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-595" ] ], "cite_sentences": [ "a0db9c3a74487d3fcd11e79d44e163-C001-127", "a0db9c3a74487d3fcd11e79d44e163-C001-137", "a0db9c3a74487d3fcd11e79d44e163-C001-148", "a0db9c3a74487d3fcd11e79d44e163-C001-153", "a0db9c3a74487d3fcd11e79d44e163-C001-170", "a0db9c3a74487d3fcd11e79d44e163-C001-208", "a0db9c3a74487d3fcd11e79d44e163-C001-214", "a0db9c3a74487d3fcd11e79d44e163-C001-475", "a0db9c3a74487d3fcd11e79d44e163-C001-514", "a0db9c3a74487d3fcd11e79d44e163-C001-519", "a0db9c3a74487d3fcd11e79d44e163-C001-520", "a0db9c3a74487d3fcd11e79d44e163-C001-536", "a0db9c3a74487d3fcd11e79d44e163-C001-551", "a0db9c3a74487d3fcd11e79d44e163-C001-559", "a0db9c3a74487d3fcd11e79d44e163-C001-573", "a0db9c3a74487d3fcd11e79d44e163-C001-576", "a0db9c3a74487d3fcd11e79d44e163-C001-577", "a0db9c3a74487d3fcd11e79d44e163-C001-584", "a0db9c3a74487d3fcd11e79d44e163-C001-587", "a0db9c3a74487d3fcd11e79d44e163-C001-588", "a0db9c3a74487d3fcd11e79d44e163-C001-591", "a0db9c3a74487d3fcd11e79d44e163-C001-592", "a0db9c3a74487d3fcd11e79d44e163-C001-595" ] }, "@USE@": { "gold_contexts": [ [ "a0db9c3a74487d3fcd11e79d44e163-C001-130" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-203", "a0db9c3a74487d3fcd11e79d44e163-C001-204", "a0db9c3a74487d3fcd11e79d44e163-C001-205", "a0db9c3a74487d3fcd11e79d44e163-C001-206", "a0db9c3a74487d3fcd11e79d44e163-C001-207", "a0db9c3a74487d3fcd11e79d44e163-C001-208", "a0db9c3a74487d3fcd11e79d44e163-C001-214" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-257" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-504" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-530", "a0db9c3a74487d3fcd11e79d44e163-C001-536" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-541" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-553" ] ], "cite_sentences": [ "a0db9c3a74487d3fcd11e79d44e163-C001-130", "a0db9c3a74487d3fcd11e79d44e163-C001-208", "a0db9c3a74487d3fcd11e79d44e163-C001-214", "a0db9c3a74487d3fcd11e79d44e163-C001-257", "a0db9c3a74487d3fcd11e79d44e163-C001-504", "a0db9c3a74487d3fcd11e79d44e163-C001-536", "a0db9c3a74487d3fcd11e79d44e163-C001-541", "a0db9c3a74487d3fcd11e79d44e163-C001-553" ] }, "@SIM@": { "gold_contexts": [ [ "a0db9c3a74487d3fcd11e79d44e163-C001-377" ], [ "a0db9c3a74487d3fcd11e79d44e163-C001-550", "a0db9c3a74487d3fcd11e79d44e163-C001-551" ] ], "cite_sentences": [ "a0db9c3a74487d3fcd11e79d44e163-C001-377", "a0db9c3a74487d3fcd11e79d44e163-C001-551" ] } } }, "ABC_d2028986dc30ccc0ca840ca3b2f454_0": { "x": [ { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-1", "text": "**ABSTRACT**" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-2", "text": "We tackle the task of question generation over knowledge bases." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-3", "text": "Conventional methods for this task neglect two crucial research issues: 1) the given predicate needs to be expressed; 2) the answer to the generated question needs to be definitive." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-4", "text": "In this paper, we strive toward the above two issues via incorporating diversified contexts and answeraware loss." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-5", "text": "Specifically, we propose a neural encoder-decoder model with multi-level copy mechanisms to generate such questions." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-6", "text": "Furthermore, the answer aware loss is introduced to make generated questions corresponding to more definitive answers." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-7", "text": "Experiments demonstrate that our model achieves state-of-theart performance." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-8", "text": "Meanwhile, such generated question can express the given predicate and correspond to a definitive answer." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-9", "text": "----------------------------------" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-10", "text": "**INTRODUCTION**" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-11", "text": "Question Generation over Knowledge Bases (KBQG) aims at generating natural language questions for the corresponding facts on KBs, and it can benefit some real applications." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-12", "text": "Firstly, KBQG can automatically annotate question answering (QA) datasets." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-13", "text": "Secondly, the generated questions and answers will be able to augment the training data for QA systems." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-14", "text": "More importantly, KBQG can improve the ability of machines to actively ask questions on human-machine conversations (Duan et al., 2017; Sun et al., 2018) ." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-15", "text": "Therefore, this task has attracted more attention in recent years (Serban et al., 2016; Elsahar et al., 2018) ." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-16", "text": "Specifically, KBQG is the task of generating natural language questions according to the input facts from a knowledge base with triplet form, like . For example, as illustrated in Figure 1 , KBQG aims at generating a question \"Which city is Statue of Liberty located in?\" (Q3) for the input factual triplet Which city is Statue of Liberty located in? Figure 1 : Examples of KBQG." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-17", "text": "We aims at generating questions like Q3 which expresses (matches) the given predicate and refers to a definitive answer." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-18", "text": "\"\"." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-19", "text": "Here, the generated question is associated to the subject \"Statue of Liberty\" and the predicate fb:location/containedby) of the input fact, and the answer corresponds to the object \"New York City\"." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-20", "text": "As depicted by Serban et al. (2016) , KBQG is required to transduce the triplet fact into a question about the subject and predicate, where the object is the correct answer." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-21", "text": "Therefore, it is a key issue for KBQG to correctly understand the knowledge symbols (subject, predicate and object in the triplet fact) and then generate corresponding text descriptions." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-22", "text": "More recently, some researches have striven toward this task, where the behind intuition is to construct implicit associations between facts and texts." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-23", "text": "Specifically, Serban et al. (2016) designed an encoder-decoder architecture to generate questions from structured triplet facts." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-24", "text": "In order to improve the generalization for KBQG, Elsahar et al. (2018) utilized extra contexts as input via distant supervisions (Mintz et al., 2009) , then a decoder is equipped with attention and part-ofspeech (POS) copy mechanism to generate questions." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-25", "text": "Finally, this model obtained significant improvements." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-26", "text": "Nevertheless, we observe that there are still two important research issues (RIs) which are not processed well or even neglected." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-27", "text": "1 We omit the domain of the predicate for sake of brevity." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-28", "text": "----------------------------------" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-29", "text": "**RI-1:**" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-30", "text": "The generated question is required to express the given predicate in the fact." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-31", "text": "For example in Figure 1 , Q1 does not express (match) the predicate (fb:location/containedby) while it is expressed in Q2 and Q3." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-32", "text": "Previous work (Elsahar et al., 2018) usually obtained predicate textual contexts through distant supervision." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-33", "text": "However, the distant supervision is noisy or even wrong (e.g. \"X is the husband of Y\" is the relational pattern for the predicate fb:marriage/spouse, so it is wrong when \"X\" is a woman)." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-34", "text": "Furthermore, many predicates in the KB have no predicate contexts." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-35", "text": "We make statistic in the resources released by Elsahar et al. (2018) , and find that only 44% predicates have predicate textual context 2 ." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-36", "text": "Therefore, it is prone to generate error questions from such without-context predicates." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-37", "text": "RI-2: The generated question is required to contain a definitive answer." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-38", "text": "A definitive answer means that one question only associates with a determinate answer rather than alternative answers." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-39", "text": "As an example in Figure 1 , Q2 may contain ambiguous answers since it does not express the refined answer type." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-40", "text": "As a result, different answers including \"United State\", \"New York City\", etc. may be correct." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-41", "text": "In contrast, Q3 refers to a definitive answer (the object \"New York City\" in the given fact) by restraining the answer type to a city." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-42", "text": "We believe that Q3, which expresses the given predicate and refers to a definitive answer, is a better question than Q1 and Q2." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-43", "text": "In previous work, Elsahar et al. (2018) only regarded a most frequently mentioned entity type as the textual context for the subject or object in the triplet." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-44", "text": "In fact, most answer entities have multiple types, where the most frequently mentioned type tends to be universal (e.g. a broad type \"administrative region\" rather than a refined type \"US state\" for the entity \"New York\")." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-45", "text": "Therefore, generated questions from Elsahar et al. (2018) may be difficult to contain definitive answers." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-46", "text": "To address the aforementioned two issues, we exploit more diversified contexts for the given facts as textual contexts in an encoder-decoder model." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-47", "text": "Specifically, besides using predicate contexts from the distant supervision utilized by Elsahar et al. (2018) , we further leverage the domain, range and even topic for the given predicate as contexts, which are off-the-shelf in KBs (e.g. the range and the topic for the predicate fb:location/containedby are \"location\" and \"containedby\", respectively 1 )." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-48", "text": "Therefore, 100% predicates (rather than 44% 2 of those in Elsahar et al.) have contexts." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-49", "text": "Furthermore, in addition to the most frequently mentioned entity type as contexts used by Elsahar et al. (2018) , we leverage the type that best describes the entity as contexts (e.g. a refined entity type 3 \"US state\" combines a broad type \"administrative region\" for the entity \"New York\"), which is helpful to refine the entity information." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-50", "text": "Finally, in order to make full use of these contexts, we propose context-augmented fact encoder and multi-level copy mechanism (KB copy and context copy) to integrate diversified contexts, where the multilevel copy mechanism can copy from KB and textual contexts simultaneously." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-51", "text": "For the purpose of further making generated questions correspond to definitive answers, we propose the answer-aware loss by optimizing the cross-entropy between the generated question and answer type words, which is beneficial to generate precise questions." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-52", "text": "We conduct experiments on an open public dataset." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-53", "text": "Experimental results demonstrate that the proposed model using diversified textual contexts outperforms strong baselines (+4.5 BLEU4 score)." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-54", "text": "Besides, it can further increase the BLEU score (+5.16 BLEU4 score) and produce questions associated with more definitive answers by incorporating answer-aware loss." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-55", "text": "Human evaluations complement that our model can express the given predicate more precisely." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-56", "text": "In brief, our main contributions are as follows:" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-57", "text": "(1) We leverage diversified contexts and multilevel copy mechanism to alleviate the issue of incorrect predicate expression in traditional methods." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-58", "text": "( 2) We propose an answer-aware loss to tackle the issue that conventional methods can not generate questions with definitive answers." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-59", "text": "(3) Experiments demonstrate that our model achieves state-of-the-art performance." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-60", "text": "Meanwhile, such generated question can express the given predicate and refer to a definitive answer." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-61", "text": "----------------------------------" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-62", "text": "**TASK DESCRIPTION**" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-63", "text": "We leverage textual contexts concerned with the triplet fact to generate questions over KBs." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-64", "text": "The Figure 2 : Overall structure of the proposed model for KBQG." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-65", "text": "A context encoder is firstly employed to encode each textual context (Sec. 3.1), where \"Diversified Types\" represents the subject (object) context, and \"DS pattern\" denotes the relational pattern from distant supervisions." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-66", "text": "At the same time, a fact encoder transforms the fact into low-dimensional representations (Sec. 3.2)." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-67", "text": "The above two encoders are aggregated by the context-augmented fact encoder (Sec. 3.3)." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-68", "text": "Finally, the aggregated representations are fed to the decoder (Sec. 3.4), where the decoder leverages multi-level copy mechanism (KB copy and context copy) to generate target question words." }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-69", "text": "task of KBQG can be formalized as follows:" }, { "sent_id": "d2028986dc30ccc0ca840ca3b2f454-C001-70", "text": "where F = (s, p, o) represents the subject (s), predicate (p) and object (o) of the input triplet, C = {x s , x p , x o } denotes a set of additional textual contexts, Y = (y 1 , y 2 , ..., y |Y | ) is the generated question, y